The risk of misclassifying subjects within principal component based asset index

MISSING IMAGE

Material Information

Title:
The risk of misclassifying subjects within principal component based asset index
Physical Description:
Mixed Material
Language:
English
Creator:
Sharker, MA Yushuf
Nasser, Mohammed
Abedin, Jaynal
Arnold, Benjamin F.
Luby, Stephen P.
Publisher:
Bio Med Central (Emerging Themes In Epidemiology)
Publication Date:

Notes

Abstract:
The asset index is often used as a measure of socioeconomic status in empirical research as an explanatory variable or to control confounding. Principal component analysis (PCA) is frequently used to create the asset index. We conducted a simulation study to explore how accurately the principal component based asset index reflects the study subjects’ actual poverty level, when the actual poverty level is generated by a simple factor analytic model. In the simulation study using the PC-based asset index, only 1% to 4% of subjects preserved their real position in a quintile scale of assets; between 44% to 82% of subjects were misclassified into the wrong asset quintile. If the PC-based asset index explained less than 30% of the total variance in the component variables, then we consistently observed more than 50% misclassification across quintiles of the index. The frequency of misclassification suggests that the PC-based asset index may not provide a valid measure of poverty level and should be used cautiously as a measure of socioeconomic status. Keywords: Principal component analysis, Socio-economic status, Asset index, Wealth index
General Note:
Sharker et al. Emerging Themes in Epidemiology 2014, 11:6 http://www.ete-online.com/content/11/1/6; Pages 1-8
General Note:
doi:10.1186/1742-7622-11-6 Cite this article as: Sharker et al.: The risk of misclassifying subjects within principal component based asset index. Emerging Themes in Epidemiology 2014 11:6.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
© 2014 Sharker et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
System ID:
AA00024689:00001

Full Text

PAGE 1

EMERGING THEMES IN EPIDEMIOLOGY Sharker etal.EmergingThemesinEpidemiology 2014, 11 :6 http://www.ete-online.com/content/11/1/6 ANALYTICPERSPECTIVEOpenAccessTheriskofmisclassifyingsubjectswithin principalcomponentbasedassetindexMAYushufSharker1*,MohammedNasser2,JaynalAbedin1,BenjaminFArnold3andStephenPLuby4 AbstractTheassetindexisoftenusedasameasureofsocioeconomicstatusinempiricalresearchasanexplanatoryvariableor tocontrolconfounding.Principalcomponentanalysis(PCA)isfrequentlyusedtocreatetheassetindex.We conductedasimulationstudytoexplorehowaccuratelytheprincipalcomponentbasedassetindexreflectsthestudy subjectsactualpovertylevel,whentheactualpovertylevelisgeneratedbyasimplefactoranalyticmodel.Inthe simulationstudyusingthePC-basedassetindex,only1%to4%ofsubjectspreservedtheirrealpositioninaquintile scaleofassets;between44%to82%ofsubjectsweremisclassifiedintothewrongassetquintile.IfthePC-basedasset indexexplainedlessthan30%ofthetotalvarianceinthecomponentvariables,thenweconsistentlyobservedmore than50%misclassificationacrossquintilesoftheindex.ThefrequencyofmisclassificationsuggeststhatthePC-based assetindexmaynotprovideavalidmeasureofpovertylevelandshouldbeusedcautiouslyasameasureof socioeconomicstatus. Keywords: Principalcomponentanalysis,Socio-economicstatus,Assetindex,WealthindexIntroductionSocioeconomicstatus(SES)iscommonlymeasuredin socialscienceandpublichealthresearchbycombining diversefactorsincludingwealth,educationlevelandoccupation[1].Theassetindex,whichhasalsobeencalled thewealthindex,iscreatedbymeasuringanindividualsassetsandiswidelyusedasaproxymeasureof socioeconomicstatus.However,theassetindexisameasureoftheacquiredassetsand,assuch,isasubsetof totalsocioeconomicstatus.Nevertheless,publichealth researcherscommonlyusetheassetindexinregression modelstoeitherestimateitsdirecteffectonoutcomes orcontrolforconfoundingeffectondisease-exposure association[2-4]. Toderivetheassetindex,researcherscommonlygather informationonassetownershipusuallythroughthe administrationofaquestionnaireandthenfrequently applyprincipalcomponentanalysis(PCA),asadatacompressiontechnique.ThePCAmethodgeneratesasmany principalcomponentsastherearevariablesinthedataset. Thefirstprincipalcomponent(PC)isaweightedsum *Correspondence:mayushuf@gmail.com 1 icddr,b,68ShahidTajuddinAhamedSarani,Mohakhali,1212Dhaka, Bangladesh Fulllistofauthorinformationisavailableattheendofthearticleoftheobservedassetvariablesthataccountsforthe maximumvariabilityoftheobserveddataamongother principalcomponents.ThisfirstPCisconsideredasan assetindex[5].PCAallowstheanalysttoreplacevarious assetvariableswiththeunivariatefirstPCscorethatbest assignsthesubjectsintodifferentcategories.Subjectsmay thenbeclassifiedintoquintilesaccordingtotheirasset index.Forexample,thefirstquintile,consistingofthe lowest20%valuesoftheindex,representspersonswith thefewestassets(thepoorestsubjectcategory)andthe fifthquintile,consistingofthehighest20%indexvalues, representspersonswiththemostassets(thewealthiest category). Conceptually,thereisatrueŽmeasureofsocioeconomicstatuswhichcannotbedeterminedandisassociatedwithvariousoutcomes,forexample,aspecifichealth outcome.Sincewecannotdeterminethetruemeasure ofsocioeconomicstatus,wemeasureeitherrelatedproxy variables,suchasincome,ormanifestvariables,suchas presenceofassets.Economicproxyandmanifestvariables areassumedtorepresentapersonstrueeconomicstatus. Whenproxyvariablesarenotavailable,researchersmay useanassetindexderivedusingPCA[5-8].Itisexpected thattheassetindexwouldretaintheorderofthetrueŽ socioeconomicstatusofthestudysubjectswithnegligible 2014Sharkeretal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycredited.TheCreativeCommonsPublicDomainDedication waiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwise stated.

PAGE 2

Sharker etal.EmergingThemesinEpidemiology 2014, 11 :6Page2of8 http://www.ete-online.com/content/11/1/6errorandthatitwouldprovidetherealquintilemembershipofasubject.Thereare,however,nostandardtools availabletovalidatetheperformanceoftheassetindex, includingthePC-basedassetindex,intermsofretaining theorderofthetruesocioeconomicstatusofanindividualbecausetheunderlyingtruesocioeconomicstatusis unknown. SeveralauthorswhohaveappliedthePC-basedasset index,haveattemptedtovalidateitscredibilityindifferent ways[5,6,9,10].ThesestudiesreportedsatisfactoryperformanceofPCA-basedassetindicesforexplainingthe variabilityofthefertilityrateofacountry,educational outcomes(schoolingofchildren,dropoutofchildren fromschools)andincomeorexpenditurebasedinequality measures. Howeetal.comparedfourdifferentmethodstomeasureanassetindex,includingapplyingPCAonallcategoriesofassetvariablesandapplyingPCAonbinary codedassetvariables.Usingthedatafromthe2004-2005 MalawiIntegratedHouseholdSurvey,theyfoundthatPC hadmodestagreementwithconsumptionexpenditure (kappa = 0.11and0.10)whichisanintensivemeasure ofhouseholdwealthusedbyeconomistsastheoptimal measuretoassessincomeandwelfare[11].Howeetal. alsoreportedtheproportionofsubjectsthatweremisclassifiedbythePC-basedassetindexintothewrong assetquintileswhencomparedwithconsumptionexpenditure:71%forallcategoriesofassetvariables,and70%for binarycodedassetvariables.TheyconcludedthataPCbasedassetindexwasnotareliableproxyforconsumption expenditure[12]. KolenikovandAngelesusedsimulationstoassessthe performanceofaPCA-basedassetindexforrankingthe subjectscomparedtosimulatedwelfare.Theyreported thatthePCA-basedassetindexmisclassifiedsubjects intothewrongassetquintileswhencomparedtowelfarequintiles,butdidnotexplorethereasonsbehindthe misclassification[10]. Howeetal.performedasystematicreviewof17articles with36datasetstoseehowthePC-basedassetindexperformedcomparedtoconsumptionexpenditureandfound thatmostoftheassetindicespoorlyreflectedconsumptionexpenditure.ThestudyconsidereddifferentmeasuresofassetindicesinadditiontoPC-basedassetindices butdidnotfocusonreasonsforpoorperformanceofthe assetindices[7]. Inpublishedliteratureofassetindexmeasurements usingPCA,theproportionofexplainedvariancesbythe firstPCwerelow,rangingfrom12%to34%[2,3,5,8,13]. Sinceresearchersreplacemultipleassetvariablesbythe singlefirstPCscore,ahigherproportionofexplainedvariancebythefirstPCisimportanttocarryenoughinformationofmultipleassetvariables[14].However,wehave littleinformationabouthowtheproportionofexplained variancebythefirstPCcouldaffecttheperformanceof thePC-basedassetindex. Theuseofmisclassifiedcovariatestocontrolconfoundingcanbiastheexposure-diseaseassociationestimates[15,16].IfaPC-basedassetindexdoesnotproperly categorizestudysubjectswhencomparedtorealmeasures ofwealth,itmaybethatthePC-basedassetindexmay notbeagoodindextocontrolconfoundinginexposurediseaseassociationanalyses.Theobjectivesofthisstudy weretoverifywhetheraPC-basedassetindexwouldyield thesamequintilerankasthequintilerankthatwasartificiallyimposedinthesimulateddata.Thisstudyalso exploredthepossiblereasonswhyPCAmightperform poorlyforassetindexmeasurement.MethodsInthisstudy,weperformedasimulationexperiment.In eachsimulation,wegenerated100randomnumbersfrom theuniformdistributionoffivedifferentnon-overlapping rangesasameasureofassetindex.Thissimulatedasset indexwasconsideredthetrueassetindexofagroupof 100subjects.Wethengeneratedtheassetvariablesusing pre-specifiedloadingsandthesimulatedindexthrough aconfirmatoryfactormodelasdescribedinKolenikov andAngeles[10].Aconfirmatoryfactormodelextracts variablesbytakingproportionsoftheindexplustherandomerrorwhichisusuallythemeasurementerror.We usedfoursetsofloadingsforgeneratingdatathatyielded fourmodels.Theprocessforgeneratingthedataand simulationsaredescribedinTable1andFigure1.We wroteacustomizedprogramin R toperformthesimulationexperiment[17].Theprogramwastestedbyanother co-authortocheckthereproducibilityoftheresults. WethentestedtheperformanceofaPC-basedasset indexwithoutanyspecificdistributionalassumptionby usingarealmeasurementofexpendituredata,collected fromanintensivequalitativesurvey,thatwasaskewedTable1Datageneratingprocessinthesimulation € Wegeneratedartificiallatentfactor whichisassumedtobethe realassetindex.Weused of100datapointsfromtheuniform distributionwithfivedifferentarbitrarynon-overlappingranges, including(0,3],(3,5](5,8](8,10]and(10,14].Wedrew20sample pointsfromeachrangeandstoredthepositionindexofeach subjectbasedonthelatentfactor. € Weconsiderednormalizedloadingvectors V1= ( 0.79,0.54,0.13, 0.01,0.26 ) V2= ( 0.73,0.52, Š 0.20,0.00, Š 0.4 ) V3= ( 0.67,0.4, Š 0.5, Š 0.01, Š 0.4 ) and V4= ( Š 0.02,0.14, Š 0.57, Š 0.51, Š 0.63 ) € Wegeneratedthedatamatrix Y basedontheconfirmatoryfactor analysismodelusedinKolenikovandAngeles[10]. € Wegeneratedfivedimensionalrandomvariablesusingthe loadingvectorsandstandardnormalerrors withmean0and variancefrom(0,4]. € WeperformedPCAon Y andgeneratetheassetindex

PAGE 3

Sharker etal.EmergingThemesinEpidemiology 2014, 11 :6Page3of8 http://www.ete-online.com/content/11/1/6 Generate uniform random variable with five different non overlapping ranges ( ) + One real data Data matrix Store the first PC score ( ) Store the position index of each subject based on PC Sort and store the position index of subjects Frequency of unchanged position Explained proportion of variance Mean degree of misclassification Estimates of loading vector Population Loading Vectors PCA Figure1 Flowchartofthesimulation.proxymeasureoftheeconomicstatus.Weperformeda similarexperimentusingtheexpendituredatainsteadof simulatedassetindex.Theonlydifferencebetweenthe simulatedindexandexpendituredatawasthatinthesimulationexperiment,wegeneratedadifferentassetindex foreachsetofloadings,howeverintheexpendituredata theassetindexwasfixed.Wegeneratedtheartificialasset variablesusingtheobservedexpendituredataandthe sameweightsusedinthesimulatedassetindex.Tomake theresultscomparablewiththeunitsofthesimulated assetindex,westandardizedtheexpendituredatausing =x Š median ( x ) MAD ( x )where MAD ( x ) isthemedianabsolute deviationaboutmedianofthevariable x .Thisstandardizedmeasurementwasconsideredtobethetrueindexfor ourexperiment.Weusedthemedianforstandardization tokeepthepositionindexofthesubjectsthesameasin theoriginalrankaccordingtotheexpendituredata. Werepeatedtheexperiment10,000timesforeachof thefourmodelsforatotalof40,000replicationsforboth thesimulatedindexandtherealexpendituredata.Ifthe PCscore retainedtheorderof ,whichwasthetrue indexgeneratedbythemodels,thepositionofanobservationin and shouldbethesame.Werecordedthe frequencyofthesamepositionindexwhichwedefinedas thefrequencyofunchangedpositions. Weestimatedthemeandegreeofmisclassificationin thePC-basedassetindexthatinvolvedclassificationof and intotheirquintilesandcountedthenumberof observationswherethequintilemembershipwasdifferent.Theprobabilityofmisclassificationwasestimatedby dividingthetotalnumberofobservationsclassifiedinto differentquintilesbythetotalnumberofobservations in Westoredtheproportionofexplainedvariancebythe firstPCandestimatesoftheloadingsforeachofthe replications.Weexploredthedependencypatternamong thefrequencyofunchangedposition,probabilityofmisclassification,andexplainedproportionofvarianceusing scatterplots.Toassesstheeffectthatthefivedifferentloadingsestimateontherelationshipbetweenthe proportionofexplainedvarianceandassetquintilemisclassification,weconstructedaparallelcoordinateplot.In aparallelcoordinateplot,theestimateofaloadingvectorthatconsistedoffiveelementswasplottedintothefive parallelverticalcoordinates(E1-E5)andtheplottedpoints wereconnectedhorizontally.Eachconnectedlinecorrespondedtoasimulationresultofloadingvectorestimates. Finally,weuseddifferentcolorsforthetwoclustersand identifiedthecharacteristicsofloadingestimatesbetween thetwoclusters.

PAGE 4

Sharker etal.EmergingThemesinEpidemiology 2014, 11 :6Page4of8 http://www.ete-online.com/content/11/1/6ResultsOursimulationstudyshowedarangeof0%to98%misclassificationinthePC-basedassetquintile.Themedian probabilityofmisclassificationvariedfrom44%to82% dependingonthedifferentloadingvectorusedfordata generation.(Table2).Becauseofthedefinitionofthe unchangedpositionandassetquintilemisclassification, fewerthan20%oftheunchangedpositionsproduced upto100%assetquintilemisclassification.However,if thepercentofunchangedpositionwasgreaterthan20%, themeandegreeofmisclassificationreduceddramatically (Figure2A). Thepatternoftheprobabilityofmisclassificationusing realexpendituredatawassimilartothesimulatedasset index.Theobservedmisclassificationrangedbetween0% to96%.Themedianprobabilityofmisclassificationvaried from68%to79%(Table3). Thesimulateddatasubjectsweremuchmorelikelyto retaintheirpositionwhenthefirstPCexplainedalarge fraction( > 90%)ofthevariance(Figure2B).Iterations inwhichthePCexplainedcloseto100%ofthevariance accountedforthoseinstancesinwhichupto100%ofsubjectsretainedtheirposition.However,therewerealarge numberofsubjectsthatdidnotretaintheirsamepositionalthoughtheirexplainedproportionofvariancewas evengreaterthan80%(markedasredpointsinthefigure). (Figure2B). Thescatterplotsbetweentheproportionofexplained varianceandtheprobabilityofmisclassificationgeneratedtwoclusters(Figure2C).Onecluster(markedas green)indicatedthatthemeandegreeofmisclassification decreasedwiththeincreasingproportionofexplained variance.Theotherclusterconsistentlyshowedmorethan 80%misclassificationirrespectiveofthelevelsoftheproportionofexplainedvariance(markedasred)(Figure2C). Theparallelcoordinateplotoftheloadingvectorestimatesineachexperimentrevealedthat,thesignofthe loadingshasanimportantroleinlimitingthemisclassificationofthesubjects.Theredclusterincludedsimulation resultswheretheindexconsistentlyprovidedmorethan 80%misclassification.Forthoseinstancesinthesimulation,thesignsoftheloadingestimateswerefoundto beoppositefromtherealone(thesignofloadingthat wereusedforgeneratingtrueindex).Inthissituation despitethehigherproportionofexplainedvariance,PCbasedindexmightnotretaintheoriginalpositionof subjects.Inthegreenclusterwherethehigherproportion ofexplainedvariancereducestheproportionofmisclassification,weobservedtheretentionofthesignsofthe loadingestimates(Figure3).DiscussionInthisarticleweevaluatedwhetherPCAretainedthe orderofsubjectsbasedonatrueassetindexusinga simulationexperiment.Wealsousedexpendituredata collectedinadifferentstudytoaddressthedistributional limitationsofthesimulatedassetindex. WefoundthatPCAdoesnotreliablymaintaintheorder ofthetrueassetindex.PCAchangesthepositionofupto 98%ofsubjects,andthemagnitudeofthepositionchange wasusuallyenoughtoclassifythesubjectsintothewrong assetquintiles.Weobservedarelativelyhigherprobability ofmisclassificationwhenweconsideredobservedexpendituredataasatrueindexwhichwaspositivelyskewed. Theskeweddistributionoftheunderlyinglatentfactor introducedmoreriskoftheprobabilityofmisclassificationinaPCbasedindex.Ourfindingsaresupportedby KolenikovandAngeles[10]whoreportedtheincreased riskofmisclassificationinaPC-basedassetindexfor thosedatawhichweregeneratedfromtheskewedunderlyingfactors. Inoursimulations,thesignoftheloadingoftheasset variablesretainedbythePCAwasanimportantdeterminantoftheprobabilityofmisclassificationinaPCbased assetquintile.Achangeinthesignmeansachangein thedirectionofcontributionofanassetvariabletothe index.Intherealworld,anassetmightpositivelycontributetorelativewealth,butinthePC-basedindex, thismightappearnegatively.Forexample,theloading ofagriculturallandappearedwithanegativesignintheTable2Descriptivestatisticsofthenumberofunchangedorder,andprobabilityofmisclassificationintothewrong quintileforfourdifferentvectorsinsimulateddata NumberofMaximumpositiveProbabilityof unchangedpositiondispersionofpositionmisclassification V1V2V3V4V1V2V3V4V1V2V3V4 Minimum00001115000.04 Firstquartile210020243842.25.30.46.49 Median431137449376.44.50.82.67 Mean764238506869.41.51.65.66 Thirdquartile764352929797.55.82.89.87 Maximum9898982699999999.97.98.98.97

PAGE 5

Sharker etal.EmergingThemesinEpidemiology 2014, 11 :6Page5of8 http://www.ete-online.com/content/11/1/6 A B C Figure2 Scatterplotsbetweenfrequencyofunchangedpositionandprobabilityofmisclassification(A),proportionofexplained varianceandfrequencyofunchangedposition(B)andproportionofexplainedvarianceandprobabilityofmisclassification(C). Thedata inredrefertothosesimulationswheretheprobabilityofmisclassificationwereconsistentlymorethan80%irrespectiveofdifferentlevelsof explainedproportionofvariance.Thedataingreenrefertothoseinstanceswheretheprobabilityofmisclassificationisnegativelycorrelatedwi th explainedproportionofvariance.PC-basedassetindexinHoweetal.[12].Theopposite signmayappear,possiblyduetodatacodingandmeasurementscalesselectedandtheunderlyingcorrelation structurewithinvariables.Researchersoftenselectasset variablesinsuchawaythattheyarepositivelycorrelated toeachother.WhileconductingthePC-basedassetindex amongthepositivelycorrelatedvariables,theloadingof thosevariablesshouldappearwithapositivesigns.Our studysuggeststhatloadingvariablesmightbeassigneda negativesignbecausetheunderlyingcorrelationamong theassetvariablesmightvaryindifferentpopulation.If so,thevariableswithnegativeloadingsmaybeproblematicbecausethepresenceofsuchanassetinappropriately leadsasubjecttothelowerlevelfromitstruelevelin theindex,andoursimulationssuggestthatthiscould contributeimportantlytomisclassification[8,13]. Theincreasedproportionofexplainedvarianceofthe firstPCscoreincreasestheprobabilityofgeneratingan indexthatreflectstheunderlyingeconomicstatus.To ensureahigherproportionofexplainedvarianceofthe datasetbythefirstPC,variablesshouldbewellcorrelated witheachother.Itispossiblethatassetvariablesmight beclassifiedintosubgroupsand/ormightberedundant basedonthecorrelationstructure.Whenthisoccursthe firstPCrepresentsthesubgroupofvariablesthatcontainsthemajorsourceofvariabilityofthetotaldatasetand maynotaccountforthecontributionofallvariables[14]. Insuchsituations,onlythefirstPCmightnotbesufficienteithertoaccountforthecontributionofallasset variablesortoexplainasufficientamountofvariability requiredtoreducethemisclassificationofsubjects.The situationbecomesmoredifficultwhenassetvariablesareTable3Descriptivestatisticsofthenumberofunchangedorder,andprobabilityofmisclassificationintothewrong quintileforfourdifferentvectorsforrealexpendituredata NumberofMaximumpositiveProbabilityof unchangedpositiondispersionofpositionmisclassification V1V2V3V4V1V2V3V4V1V2V3V4 Minimum000022113000.10 Firstquartile110056637371.55.59.67.65 Median221176818886.68.72.79.76 Mean432270748080.64.68.74.73 Thirdquartile543389939695.79.84.87.86 Maximum8695881999999999.96.96.97.96 *Realexpendituredatahas112observation,sothedispersionofpositionwererescaledto100.

PAGE 6

Sharker etal.EmergingThemesinEpidemiology 2014, 11 :6Page6of8 http://www.ete-online.com/content/11/1/6 Figure3 Parallelcoordinateplotoftheelementsofloadingvectors. Thegreencolorcorrespondstotheclusterthatindicatestheincreasing proportionofexplainedvariancedecreasesthepercentageofmisclassificationintoquintiles.Thedarklineindicatesthepopulationloadingvec tor basedonwhichdataweregenerated.Inset,thescatterplotoftheproportionofexplainedvarianceandpercentageofmisclassificationmatching colorwiththeparallelcoordinateplotcorrespondswhichestimateofloadingvectorsarelinkedwiththoseclusters.categorical.Propervariableselectionanduseofappropriatecorrelationfornominalandordinalvariables,such aspolychoriccorrelations,couldimprovethepowerof explainingthevariabilityofPC-basedindex[10]. TousePCAforanassetindex,thesignofloadings shouldbeexaminedinadditiontotheproportionofvarianceexplainedbythefirstPCinordertoincreaseour confidenceintheaccuracyoftherankingofrealwealth. Thesignoftheloadingvariablesshouldbeinternallyconsistentwithourunderstandingofwhatconstituteswealth ofthestudypopulation.Additionally,checkingconsistencybetweenwealthgroupsinrespecttotheirexisting assetvariablesandcheckingtherobustnessoftheasset indexwithregardstodifferentassetvariablescouldhelp measurethelevelofreliabilityaswasdonebyFilmerand Pritchett[5]. AlthoughPCbasedassetindexisapoorproxyagainst thestandardconsumptionexpenditure,itcontinuesto beusedbecauseitissomucheasiertodeploy[7].For example,afterpublishingtheseminalpaperofFilmerand Pritchett[5],weobservedacoupleofapplicationsofPCA forestimatingtheassetindexsuchas[2,3,6,8,13];Some ofthesepapersconsideredvaliditycheckingbasedonthe correlationbetweenthePCbasedindexwithsomeother proxyvariables.Ifthecorrelation/associationmeasurementapproachesto1,theorderoftheindexapproaches theorderoftheobservedproxy.Inadditiontothecorrelationwithsomereliableproxies,thecharacteristicsofthe PCAbasedindexsuchasloadings,signofloadingsand theproportionofexplainedvarianceshouldbereported asatoolforvalidationwhichwererarelyconsideredto validatetheirindicesofwealth. Toevenengageinanexplorationofpossiblealgorithms appliedtoproxiesofeconomicstatus,andexaminethose againstastandard,impliesanacceptancethattheunderlyingdata-generatingdistributionfollowsthismodel.Ideally,therewouldexistameasurablestandardthatwecould comparealgorithmsappliedtoproxiesandthusbeableto argueforoneapproachversusanotherbasedonestimates ofrisk(e.g.,probabilityofmisclassificationtowhichquintileasubjectbelongs).However,suchameasurablestandarddoesnotexistforeconomicstatus.Wehavetakenan approachthatwouldidentifywhichalgorithmsappliedto proxiesarebestwithregardtosomelossfunctionatpredictingthelatentvariableunderthebestcircumstances, wherethissortoflatentvariablemodelistrue.Thus

PAGE 7

Sharker etal.EmergingThemesinEpidemiology 2014, 11 :6Page7of8 http://www.ete-online.com/content/11/1/6theresultsshouldbeinterpretedknowingthatthepossiblesimulations(data-generatingmodels)andpossible methodsforsummarizingthemanifestvariablesarebut atinysubsetofthepossiblecombinations.Ourconclusionsaremeanttoprovidesomeintuitionforproblems thatcouldarise,butcanofcoursenotbeseenasproofby simulation. KolenikovandAngeles[10]showedthat,heavytailed distributionofSESindex,suchasalognormaldistribution,notablyaffectsthecoefficientestimatesandthefrequencyofmisclassification.Theyreportedthemarginal effectas15%foroverallmisclassificationand30%misclassificationinthefirstquintileinthePC-basedprocedures. Allotherdistributions,includingbimodalskeweddistribution,haverathermildeffectsonmisclassification.In thisstudy,weonlyconsideredtheassetindexvariables tobeuniformlydistributed.Thislimitedthemisclassificationduetothedistributionandallowedustoexplore othercontributorstomisclassification.Wemeasuredthe performanceofthePCAmethodusingdataonlywith continuousvariables.Weexpectedthatthiswouldcreatefewererrorsindatasetscomparedtoamixofcontinuousandcategoricalvariableswheretherewouldbe evengreatermisclassificationusingthePCAmethod[10]. Therefore,ourestimatesofmisclassificationareconservative.Weconsideredthedatamatrixofonlyfivedimensions.However,theresultsarestillgeneralizableover higherdimensionsbecausetherankpreservingcapacity ofPCAinasset/wealthindexshouldremainthesamein higherdimensions.Theworkwepresentherecouldbe expandedtouseconstructssimulatedfromactualasset variablesinempiricaldatasets,whichwouldbeofhigher dimensionandincludeamixofcontinuousandcategoricalassetvariables. Throughrepeatedsimulationexperimentsusingartificialandrealproxydataforlatentvariables,weshowed thatPCAdoesnotretaintheorderofthetrueassetindex andprovidesahighproportionofmisclassificationinto theassetquintiles.SincethefirstPCscoredoesnotreliablymaintaintheoriginalorderofalatentconstruct,we shouldsearchforanalternativeindexthatmaintainsthe originalorder. IfinvestigatorsusePCAtocreateanassetindex,they shouldreporttheproportionofvarianceexplainedand theloadings.Carefulselectionofassetvariables,proper measurementsandcoding,andsuitablecorrelationestimatesofcategoricalassetvariablesarerecommendedto increasethevariabilityexplainingcapacityofthefirstPC. Iftheproportionofexplainedvarianceislessthan30%, theriskofmisclassificationcouldbehigh( 50%),so itshouldbeinterpretedwithcaution.Werecommend checkingforconsistencyandrobustnessforanylevelof explainedvariance.Ifthegoaloftheassetindexistocontrolforconfounding,theninvestigatorsshouldconsider theassetvariablesastheoriginalcovariatesinthemodel, whichweexpect(thoughhavenottested)couldmore completelycontrolledforconfoundingthanPC-based indices.Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authorscontributions YSdevelopedthesimulationstudydesignanddraftedthemanuscript,MN providedtheoreticalsupportindevelopingstudydesign,JAprovidedinputin programming,anddatapresentation,SLsupervisedtheprocess.Allthe authorscriticallyreviewed,providedintellectualinputtothemanuscriptand approvedthefinalversionofthemanuscript. Acknowledgements TheauthorsthankAlanE.Hubbardforhiscriticalreviewandinputtothe manuscript,NadiaAliRimiforprovidingthedataset,DorothySouthernand DianaDiazGranadozforassistanceinmanuscriptwriting.icddr,bisthankfulto theGovernmentofAustralia,Bangladesh,Canada,SwedenandtheUKfor providingunrestrictedsupport. Authordetails1icddr,b,68ShahidTajuddinAhamedSarani,Mohakhali,1212Dhaka, Bangladesh.2DepartmentofStatistics,UniversityofRajshahi,6205Rajshahi, Bangladesh.3SchoolofPublicHealth,UniversityofCalifornia,Berkeley,USA.4GlobalDiseaseDetectionBranch,DivisionofGlobalHealthProtection,Center forGlobalHealth,CentersforDiseaseControlandPrevention,Georgia,USA. Received:20December2013Accepted:10June2014 Published:18June2014 References 1.BravemanPA,CubbinC,EgerterS,ChideyaS,MarchiKS,MetzlerM, PosnerS: Socioeconomicstatusinhealthresearch. JAmMedAssoc 2005, 294 (22):2879…2888. 2.HalderAK,KabirM: Childmortalityinequalitiesandlinkagewith sanitationfacilitiesinbangladesh. JHealthPopulNutr 2008, 26: 64…73. 3.LubySP,HalderAK: Associationsamonghandwashingindicators, wealth,andsymptomsofchildhoodrespiratoryillnessinurban bangladesh. TropMedIntHealth 2008, 13: 835…844. 4.VanRynM,BurkeJ: Theeffectofpatientraceandsocio-economic statusonphysiciansperceptionsofpatients. SocSciMed 2000, 50: 813…828. 5.FilmerD,PritchettLH: Estimatingwealtheffectswithoutexpenditure data-ortears:anapplicationtoeducationalenrollmentsinstatesof india. Demography 2001, 38: 115…132. 6.BollenKA,GlanvilleJL,StecklovG: Socio-economicstatus,permanent income,andfertility:Alatent-variableapproach. PopulStud 2007, 61: 15…34. 7.HoweLD,HargreavesJR,GabryschS,HuttlySRA: Isthewealthindexa proxyforconsumptionexpenditure?Asystematicreview. JEpidemiolCommunityHealth 2009, 63: 871…880. 8.McKenzieDJ: Measuringinequalitywithassetindicators. JPopulEcon 2005, 18: 229…260. 9.BollenKA,GlanvilleJL,StecklovG: Economicstatusproxiesinstudies offertilityindevelopingcountries:Doesthemeasurematter? PopulStud 2002, 56: 81…96. 10.KolenikovS,AngelesG: Socioeconomicstatusmeasurementwith discreteproxyvariables:isprincipalcomponentanalysisareliable answer. RevIncomeWealth 2009, 55: 128…165. 11.AtkinsonAB: PovertyandSocialSecurity .NewYork:HarvesterWheatsheaf; 1989. 12.HoweLD,HargreavesJR,HuttlySR: Issuesintheconstructionofwealth indicesforthemeasurementofsocio-economicpositionin low-incomecountries. EmergThemesEpidemiol 2008, 5: 3. 13.HouwelingTA,KunstAE,MackenbachJP: Measuringhealthinequality amongchildrenindevelopingcountries:doesthechoiceofthe indicatorofeconomicstatusmatter? IntJEquityHealth 2003, 2 (1):8. 14.JolliffeI: PrincipalComponentAnalysis .NewYork:Springer;2002.

PAGE 8

Sharker etal.EmergingThemesinEpidemiology 2014, 11 :6Page8of8 http://www.ete-online.com/content/11/1/615.AhlbomA,SteineckG: Aspectsofmisclassificationofconfounding factors. AmJIndMed 1992, 21 (1):107…112. 16.GreenlandS,RobinsJM: Confoundingandmisclassification. AmJEpidemiol 1985, 122 (3):495…506. 17.RCoreTeam: R:ALanguageandEnvironmentforStatisticalComputing Vienna:RFoundationforStatisticalComputing;2013.RFoundationfor StatisticalComputing.[http://www.R-project.org/] doi:10.1186/1742-7622-11-6 Citethisarticleas: Sharker etal. : Theriskofmisclassifyingsubjectswithin principalcomponentbasedassetindex. EmergingThemesinEpidemiology 2014 11 :6. Submit your next manuscript to BioMed Central and take full advantage of: € Convenient online submission € Thorough peer review € No space constraints or color “gure charges € Immediate publication on acceptance € Inclusion in PubMed, CAS, Scopus and Google Scholar € Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit