PAGE 1
1
PAGE 2
2
PAGE 3
3
PAGE 4
Iwouldliketoacknowledgemanypeoplewhogavemesupportandadvicewhilewritingthisthesis.First,Iwanttothankmyadvisor,DistinguishedProfessorDr.GeorgeCasella,whogavemetheopportunitytoworkashisresearchassistantandtogetinvolvedwiththeprojectIpresenthere.Hiswealthofknowledgewasaninvaluablesourceofhelp.Itwouldnothavebeenpossiblewithouthisconstantfeedbackandguidance.Ialsowanttothankmygraduatecommitteemembers,ProfessorDr.ClydeSchooleldandProfessorDr.MarkSettles,fortheiradditionalassistanceinmyresearch.TheyprovidedafoundationformetodevelopthecorrectmethodologiestoaddressspecicissueswithinclusteranalysisIdiscusshere.IwouldliketogiveaspecialacknowledgmenttoProfessorDr.AlvaroCofreforhisconstantsupportandadvice.Histeachinghelpedmetodiscoverthebeautyofmathematics.HisexampleservedasaninspirationforhowIinstructinmyownclassroom.HisencouragementbroughtmewhereIamtoday.IamindebtedtoVikneswaranGopalforallhishelpandusefulhintsintheimplementationofthecodestorunthesimulationsincludedinthiswork.Finallyandmostimportantly,IwouldliketothankAlejandraforherunconditionallove,andmyparents,JorgeandEdith,fortheirlove,encouragementandsupportinallthatIdo. 4
PAGE 5
page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 1.1ClusteringProbleminNIRSpectroscopyData ................ 10 1.2ClusterAnalysisMethodsandOurMethodology ............... 13 2THEORETICALFRAMEWORK .......................... 15 2.1NotationSetupandPreliminaries ....................... 15 2.2HypothesisTestingandBayesianAnalysis .................. 16 2.3ModelandDistributions ............................ 18 2.4PrioronthePartitions ............................. 20 2.5EstimationoftheBayesFactor ........................ 21 3CALIBRATIONPROBLEM ............................. 24 3.1BayesFactorundertheNullDistribution ................... 26 3.2DicultiesinDeterminingtheExactNullDistributionoftheBayesFactor 29 3.3SaddlepointApproximation .......................... 31 3.4MCMCandSimulationAlternative ...................... 32 4SIMULATIONSTUDIES .............................. 35 4.1GoodnessoftheApproximation ........................ 35 4.2ErrorofApproximation ............................ 36 4.3MinimumClusterSize ............................. 36 4.4OptimalNumberofPartitions ......................... 39 4.5SensitivityoftheProcedure .......................... 41 5ANALYSISOFTHENIRDATA .......................... 44 6CONCLUSIONS ................................... 48 5
PAGE 6
AONTHEPROBLEMOFPARTITIONS ...................... 50 A.1GeneratingaRandomPartition ........................ 50 A.2DerivationintheGeneralCase ........................ 51 A.3PartitionsofanInteger ............................. 53 BDERIVATIONOFTHEMARGINALDISTRIBUTION ............. 55 CBAYESFACTORSANDHYPOTHESISTESTING ................ 58 REFERENCES ....................................... 60 BIOGRAPHICALSKETCH ................................ 63 6
PAGE 7
Table page 4-1Posteriorprobabilitiesfor2500simulationsof50observationsand3clusters.Thenumberofconsideredpartitionsperiterationis52. .............. 37 4-2Cutopointsforminimumclustersize1and-level0.05,basedon5000simulations.Thenumberinparenthesiscorrespondtothestandarderrorsafter6repetitions. 37 4-3Cutopointsforminimumclustersize15%oftheobservationsand-level0.05,basedon5000simulations. .............................. 38 4-4Posteriorprobabilitiesfor1000simulationsof50observationsand3clusters. .. 41 4-5Posteriorprobabilitiesafter500000iterationsfortheobservationsing. 4-3 .TheMCSis20%oftheobservations. ........................ 41 4-6Posteriorprobabilitiesafter500000iterationsfortheobservationsing. 4-4 .TheMCSis20%oftheobservations. ........................ 43 5-1TestofthehypothesesH0:noclustersvs.H1:kclusters,withminimumclustersize20%ofthetotalnumberofobservations. .................... 44 A-1Numberofpartitionsp(n;k)forn=1;:::;6. .................... 54 7
PAGE 8
Figure page 1-1ScatterplotsoftherstveprincipalcomponentsoftheNIRdataforlabel4S520701. 13 4-1Histogramsofthenullposteriorprobabilitiesforn=50andk=3clustersbasedon2500simulations. .............................. 36 4-2Histogramsofthenullposteriorprobabilitiesforn=50andk=2and3clustersbasedon5000simulations.Theminimumclustersizeissetequalto1observationand15%oftheobservationsineachcase. ............... 39 4-3Scatter-plotof50observationsgeneratedfromabivariateNormaldistributionwithmean=(1;1)0andvariance-covariancematrix=diag(1=4;1=4) ... 42 4-4Scatter-plotof50observationsgeneratedfromtwobivariateNormaldistributionswithdierentmeans. ................................. 43 5-1Scatter-plotsforthersttwoprincipalcomponentsoftheNIRspectraforlabelsi00F-0173-01,i00F-0183-01,i01S-0026-18andi02S-0302-16 ............ 45 5-2Convergenceoftheposteriorprobabilitiesfortestingk=2,3and4clusters. .. 46 5-3Histogramsofthenullposteriorprobabilitiesforn=65andk=2and3clustersbasedon100000simulations.Theminimumclustersizeissetequalto15%oftheobservationsineachcase. ........................ 47 8
PAGE 9
9
PAGE 10
10
PAGE 11
11
PAGE 12
1-1 showsthescatterplotsfortherst5principalcomponentsoftheNIRspectraforlabel4S520701.Althoughvisualinspectionofthescatterplotssuggeststhepresenceofclustersinsomeofthem,wenoticethatisdiculttomakeaconclusionabouttheexistenceofclustersinthedata.Therefore,anobjectivemeasurementofthedegreeofevidenceforthepresenceofclustersisnecessary.Suchameasurementcanbeimplementedasadecision 12
PAGE 13
ScatterplotsoftherstveprincipalcomponentsoftheNIRdataforlabel4S520701. criteriatoavoidthesubjectivityofthevisualinspection,which,inaddition,isnotevenpossiblewhenthenumberofprincipalcomponentsexceedsthree.Hence,astatisticexpressingtruepresenceorabsenceofclusterswouldgreatlyfacilitatetheanalysisofcomplexdatasetsandisrequired. 13
PAGE 14
14
PAGE 15
15
PAGE 16
2{1 )intermsoftheparameterasfollowsH0:=1vs.H1:>1:Noticethatthealternativehypothesis,H1,hasacomplexstructure.Forthisreason,wewillconsiderrstthemoresimplehypothesis 16
PAGE 17
2{2 ),thatis 1+BF10:(2{4)Thisquantitycanbeusedasamodelcomparisoncriteria.Forourhypothesissetting,smallvaluesofP(H0jY)willprovideevidenceagainstH0.Sincewewillusetheposteriorprobabilityasameasurementofthestrengthoftheevidenceagainstthenullhypothesis,wewillrefertothisquantityastheBayesianP-value 17
PAGE 18
2{3 ),involvesconsideringallthepossiblepartitions!2Sn;kthatgeneratekclusters.Then,ifwedene!1tobetheonlyexistingclusterwhen=1,usingtheLawofTotalProbabilitywecanrewritetheBayesfactorintermsofthepartitions!as (a)ba1 (2rj)a+1e1=b2rj;
PAGE 19
2pnj=21 nj2+1p=2 2njX`=1(Y(j)`Y(j))01((Y(j)`Y(j)))+nj 2{6 )simpliestopXr=11 22rjnjs2rj+nj 19
PAGE 20
2{5 ),foreverypartition!2Sn;k.Weobtain 20
PAGE 21
2{5 )andtodeterminetheposteriorprobabilitiesP(H0jY)inclosedform.However,noticethatthesumweneedtocomputeisoverthesetofallpossiblepartitionswhichhasasmanyelementsasitisdeterminedbytheStirlingnumberofthesecondkind.Thisintroducestwonewdicultiestotheproblem: 1. Thenumberofsummandsinvolvedinthecalculationisoftentoolarge,evenifthenumberofobservationsandclustersisrelativelysmall.Forinstance,ifn=48and=2,wehaveS(48;2)=140;737;488;355;327. 2. Inordertocomputethesum,weneedtolistallthepossiblepartitionswhichisextremelydicult.Toovercomethesediculties,wewillconsideranindirectapproachandestimatethevalueoftheBayesFactorusingMonteCarlotechniques.LetandgbedistributionsonthepartitionspaceSn;k.Supposethatisthepriorofinterestandwecansample!(1);:::;!(M)fromg.IfMislargeenough,wecanestimate 21
PAGE 22
PMi=1(!(i)) wheretheexpressionin( 2{10 ),whilepossiblybiased,isproventoreducetheMeanSquaredError(seeCasellaandRobert,1998,andVanDijkandKloeck,1984).Noticethatifweconsidergasthepriorofinterestintherstplace,thentheimportancesamplingisnotneededandwejustcomputetheMonteCarlosum.Togeneratearandompartitionofnobjectsintokclustersweusethefollowingstrategy,suggestedbyJimPitman(personalcommunication).Weuseavectoroflengthnwithnk0'sandk1's,oneofthemintherstposition.Then,werandomlygenerateapermutationofthetheremainingn1elementstodistribute(uniformly)thek11'sinthelastn1places.Each1representsthestartofacluster.Forexample,ifn=5andk=3,thevector11001correspondstothepartitionofveobjectsintoclustersofsize1;3;1.Oncethepartitionhasbeengenerated,werandomlypermutetheYvector,andplacetheYi'sinthegivenpartition.Althoughnotimmediatelyobvious(seetheappendixfordetails),theprobabilityofthispartition!isgivenby 22
PAGE 23
Generatearandomstringof1'sand0'sasabove. 2. GeneratearandompermutationofY1;:::;Yn. 3. ClusterthepermutedY1;:::;Ynaccordingtotherandomstring. 4. Calculatetheratioofmarginalsfrom( 2{8 ). 5. ApproximatetheBayesfactorusing( 2{10 ). 1. Drawcandidate!0fromg. 2. Atiterationt Withprobabilitya,drawcandidate!0fromtherandomwalkstartingfrom!(t),andwithprobability1adrawcandidate!0independentlyfromg. (b) ComputetheMetropolis-HastingsratioMH=g(!0) n(k1)+(1a)g(!0)a n(k1)+(1a)g(!(t)) (c) Withprobabilitymin(1;MH)set!(t+1)=!0,otherwiseset!(t+1)=!(t)ThisisaMetropolis-Hastingsalgorithmwithstationarydistributiong. 23
PAGE 24
2{2 ),usingposteriorprobabilitiesasameasurementofthestrengthoftheevidenceagainstthenull.However,wenoticethattheBayesianP-valueisnotaP-valueintheusualsense.OurBayesianP-valuemeasures(basedonthedata)howlikelyisthenullhypothesistobetrueinoppositiontothealternativebeingtestedand,inthissense,itisavalidmeasurementofevidenceagainstthenull.But,incontrasttothestandardP-value,thereisnoerrorcalibrationforourprocedure.Forinstancewelackareferencedistribution,suchasthenulldistribution,whichdeterminesthevariabilityofthestatisticwhenthenullhypothesisistrue.Hence,oneofthemaindrawbacksofourprocedureisthatwecannoteasilysetupcutopointstodeterminewhenwehavestrongevidenceagainstthenullhypothesis.Allweknowisthatposteriorprobabilitiesbelow0.5suggeststhepresenceofclustersandthelowerthebetter.ButhowlowshouldtheBayesianP-valuebeinorderfortheexperimentertomakeagooddecision?Anotherdicultyishowtocomparedierentoutcomesfromanexperiment.Sincethevariabilityoftheposteriorprobabilitiesunderthenullhypothesisisunknown,itishardtotellifthedierencebetweentwodierentoutcomesisduetotheactualsensitivityofthetestindetectingclustersorifitispurelyduetotherandomvariationassociatedtotheexperiment.TheseproblemsarenotnewinBayesiananalysisandsomesolutionshavebeenpresentedintheliterature.Forinstance,Jereys(1961)developedascaletojudgetheevidenceinfavoroforagainstH0broughtbythedata.Thescalegoesasfollows: 24
PAGE 25
2{5 )intermsofthedata,Y,as 25
PAGE 26
26
PAGE 27
Observethat,foranygivenpartition!2Sn;k,wecanwrite (2)(k1)aPk+1j=1uj+2=b2n 1 .Hence,ifthenullhypothesisholdsandifX=(x1;:::;xn1)isavectorofindependentandidenticallydistributed21randomvariables,wehaveujd=nj1Xi=nj1xi(j=1;:::;k)andk+1Xj=1ujd=n1Xi=1xi;wherewetaken0=1.Thisresultleadstothefollowingproposition. (2)(k1)aPn1i=1xi+2=b2n 27
PAGE 28
2 (2)(k1)aPn1i=1xi+2=b2n Proof. 2 =X2Pn;kX()(!)T(Xj);where()=f!:!hasclustersofsizedeterminedbyg.SinceT(Xj)dependsonthepartitions!onlythroughtheclusterssize,weobtainBF10(Y)d=X2Pn;kT(Xj)X()(!)=X2Pn;k()T(Xj);where()=P()(!). Theimportanceofthepreviousresultsliesmainlyontwokeyconsequences.First,theyprovidewithaknownprobabilisticstructureforeachoneofthecomponentspresentintheBayesfactor.ThisstructurewillbefundamentalinordertoobtainthenulldistributionoftheBayesfactorandconsequentlyinobtainingthedesiredposteriorprobabilities. 28
PAGE 29
1 and 2 remainvalidcomponentwise.Ontheotherhand,thediagonalstructureofthevariance-covariancematricesiunderconsiderationinducesindependencebetweenthecoordinatesofYi(i=1;:::;n)andconsequentlybetweenthefactorsoftheproductin( 3{2 ).Hence,nocorrelationisinducedbyourcalculationsandthegeneralizationtohigherdimensionsproceedsintheobviousmanner.Unfortunately,evenwiththeseresultsthedeterminationoftheexactnulldistributionoftheBayesfactorisdicult.Inthefollowingsectionswewillbrieydiscusssomeoftheproblemsthatpreventusfromobtainingthenulldistributioninclosedformandsomeofthealternativesthatcanbeconsideredtoestimateit. 29
PAGE 30
3{4 )hasaknowndistributionandthesimilaritieswithT(Yj!)(forp=1)suggestthatthenulldistributionoftheBayesfactor(oratleastitsmoments)canbedeterminedanalytically.However,thederivationofthenulldistributionofLisbasedonthesametechniquesusedtodeterminetheasymptoticdistributionofthelikelihoodratiotest(Jorgensen,1993)whichwecannotbeimplementedinderivingthenulldistributionofT,duetosomeimportantdierences,namely: 1. Theshiftingterm2=binThasnoequivalentinthestatisticL. 2. TheexponentsinLsatisfyPkj=1j=,whileinT,kXj=1nj ThedenominatorofLcorrespondstothepooledvariance,whileinT,s2=1 30
PAGE 31
1 permitsustoobtainanapproximationfor 1+BF10(X)(3{5)whereX=(x1;:::;xn1)isavectorofiid21randomvariables.Foranyxedvalueof2,weproceedasfollows(Eastonetal.,1986).Firstwecompute~R(T)=1T+(k+1)2T2 3{5 ).Then,weestimatethetailprobabilityofP(X)byP(P(X)>a)Z1Tl"(k+1)~R00(t) 2#1=2expf(k+1)(~R(t)t~R0(T0))gdt;where~R(Tl)=a.Thus,inordertoapproximatethetailprobabilitiesof( 3{5 )weonlyneedtocompute1;2;3and4,whichcanbeobtainedfromtheequations1=12=2213=3312+2314=4413322+12221641;whereiisthei-thmomentofP(X). 31
PAGE 32
3.1 allowustosimulateobservationsfromthenulldistributionoftheposteriorprobabilities.Then,wecanusethesegeneratedvaluestoeitherconstructhistogramsordodensityestimation,dependingontheinterest.Herewewillfocusinthegenerationofhistograms,whichisenoughtocapturethemainfeaturesofthenulldistribution.Thehistogramswillalsopermitus(viaasymptoticresults)toobtaincutopointsfortheposteriorprobabilitiesatanyfrequentist-levelwewant.Again,themaindicultyliesonthecalculationofthesumintheBayesfactor.Aswehavealreadypointedout,thisrequiresustolistallthepartitionsinPn;k.Neverthelesswecanproceedinasimilarwayasinsection 2.5 .Sincewhenthenullhypothesisistrueweonlycareabouttheclustersizes,wefollowthesamestrategytogeneratethepartitionsaccordingtog,butwedonottakeintoaccountthepermutationsoftheelementsinthegivenpartition.Thus,weneedtocorrecttheprobabilitiesgivenbyg,sothattheydonottakeintoaccountthenumberofredundantpartitionsthatleadtothesameclustersizes. 32
PAGE 33
Inordertoconstructthehistogram,weneedtosimulateB10(X)severaltimes.ToobtainNsimulations,thenalalgorithmis:For`=1;:::;Nrepeat:Fori=1;:::;M Generatearandomstringof1'sand0'saccordingtog0. 2. Generatex1;:::;xn1iid21randomvariables. 3. ComputePn1i=1xiandPnj1i=nj1xiaccordingtotherandomstring. 4. ComputeT(Xj). 5. ApproximatetheBayesfactorusing( 3{6 ).Finally,toestimatethecutopointsforanygiven-levelweusethefollowingresult(SenandSinger1993)
PAGE 34
3{1 )dependsonthevaluesof2,thecommonvarianceoftheobservationsunderthenull.Thus,inordertogenerateasamplefromthenulldistribution,weneedtosetavaluefor2.Wewillovercomethisdicultybyestimating2bythesamplevariance.Moreprecisely,wewillcenterandre-scaleourobservationsinthestandardway. 34
PAGE 35
4-1 weseetheresultsforn=50andk=3basedon2500simulations.Forthiscase,thenumberofelementsinthepartitionsspaceisp(50;3)=208.AllofthemwereconsideredtoobtaintheexactBayesfactor(rstrow)andsamplesofsize52(25%ofthetotalnumberofelementsinthepartitionspace)weredrawnfromg0tocomputetheapproximations(secondrow).Wealsoconsideredtheuniformprior(rstcolumn)andtheg0prior(secondcolumn)inourcalculations.Weobservethatthehistogramsarevirtuallyidenticalandthatthedierencesbetweentheempirical0.05-percentilesislessthan0.012inallthecases.Theresultsaresimilarinallthecaseswestudied,indicatingthatourmethodisfairlyaccurateinapproximatingthenulldistributionoftheposteriorprobabilitiesandsuggestingthattheelectionofthepriorforthepartitionspacehaslittleeectinthecalculations. 35
PAGE 36
Histogramsofthenullposteriorprobabilitiesforn=50andk=3clustersbasedon2500simulations. 4-1 presentstheresultscorrespondingto6replicationsforthecasen=50,k=3.Theempirical0.05percentilesareobtainedbasedon2500simulations,sampling52outof208partitionsperiteration.Theobtainedstandarderrorislessthan0.003whichisfairlysmallconsideringthenumberofsimulationsperrepetition.Similarresultsareobtainedchangingthevaluesofnandk,indicatingthatconvergenceoftheempirical-percentileisreachedmoderatelyfast. 36
PAGE 37
Posteriorprobabilitiesfor2500simulationsof50observationsand3clusters.Thenumberofconsideredpartitionsperiterationis52. 0.15416 0.164480.05 0.17061 0.17005 0.17298 0.16455 Mean 0.16614SE 0.00277 oftheobservationsfallinginthevicinityof1.However,lookingatoursimulations,weobservethataconsiderablenumberofobservationsfallbelow0.5.InTable 4-2 weshowtheresultscorrespondingtotheempirical0.05percentileforn=50,60,70andk=2,3,4.Thevaluesareobtainedastheaverageof6repetitionsof5000simulationseach.Inparenthesiswereporttherespectivestandarderrors.Weobservenotonlythatthecutopointsforthe0.05-leveltestarefairlysmall,butalsothefollowingpattern.Foreveryn,thevalueofthecutopointsdecreaseasthenumberofclustersincreases,thatis,about5%ofthegeneratedposteriorprobabilitiesarelocatedclosertozeroasthekincreases.Thegeneralbehavioristhatforxedn,askincreasesthehistograms,whilestillskewedtotheleft,tendstospreadmoremasstosmallervaluesresultinginfattertailsinsteadoftheexpectedthintails. Table4-2. Cutopointsforminimumclustersize1and-level0.05,basedon5000simulations.Thenumberinparenthesiscorrespondtothestandarderrorsafter6repetitions. Clusters Observations 506070 2 0.15261(0.00198)0.18647(0.00161)0.20709(0.00230)3 0.09782(0.00153)0.13556(0.00311)0.16973(0.00141)4 0.05454(0.00034)0.09268(0.00095)0.13836(0.00118) Themostlikelyexplanationforthisphenomenaisthatthenumberofelementsthatconstitutesaclusterisnotdened.Therefore,ourproceduretendstoconsiderastheir 37
PAGE 38
4-3 showstheresultsofsimulationsobtainedunderthesameconditionswedescribedabove,butsettingtheMCSequaltothe15%oftheobservations.WecanseehowtheintroductionoftheMCSasanewparameter,reversesthepatternobservedinTable 4-2 andalsoincreasesthevalueoftheempirical0.05-percentiles. Table4-3. Cutopointsforminimumclustersize15%oftheobservationsand-level0.05,basedon5000simulations. Clusters Observations 506070 2 0.25523(0.00381)0.31751(0.00322)0.36356(0.00313)3 0.29041(0.00208)0.39503(0.00294)0.51345(0.00511)4 0.29198(0.00226)0.51517(0.00210)0.68352(0.00243) Ingeneral,astheMCSincreases(andthereforetheprobabilityofndingclustersbychancedecreases)thehistogramsbecomemoreskewedtotheleftandtendtoconcentratemoremassnearonethanspreadinthetailsaswecanobserveinFigure 4-2 .Inotherwords,theextrarestrictionprovidesmoreintuitiveresultsforthesimulatedposteriorprobabilities. 38
PAGE 39
Histogramsofthenullposteriorprobabilitiesforn=50andk=2and3clustersbasedon5000simulations.Theminimumclustersizeissetequalto1observationand15%oftheobservationsineachcase. 39
PAGE 40
4-4 weshowtheresultsfor50observationsand3clusterswhere2500simulationswereconsidered.Weobservethatthediscrepancyintheresultsislessthan0.005,whichisduenotonlytothenumberofpartitionsbutalsototheconsiderednumberofsimulations.Similarresultsareobtainedfortheothercasessuggestingthatsamplingasmanypartitionsasabout25%percentofthetotalnumberofelementsinthepartitionspaceissucienttoobtainreasonableresults. 40
PAGE 41
Posteriorprobabilitiesfor1000simulationsof50observationsand3clusters. Part/Iteration all 0.1434152 0.14852104 0.05 0.14139208 0.14368500 0.14229 4-3 weshowthescatter-plotcorrespondingto50observationsfromabivariateNormaldistributionwithmean=(1;1)0andvariance-covariancematrix=diag(1=4;1=4).Theposteriorprobabilitieswereobtainedafter500000iterationsconsideringaMCSof20%oftheobservations.Theresultsfork=2,3and4clustersarelistedinTable 4-5 Table4-5. Posteriorprobabilitiesafter500000iterationsfortheobservationsing. 4-3 .TheMCSis20%oftheobservations. P(H0) 2 0.598953 0.722134 0.91140 41
PAGE 42
Scatter-plotof50observationsgeneratedfromabivariateNormaldistributionwithmean=(1;1)0andvariance-covariancematrix=diag(1=4;1=4) Weobservethattheposteriorprobabilitiesarefairlyhighinallthecases,showingveryweakevidenceforthepresenceofcluster.Noticethatthesmallervalueisobtainedfortestingtwoclusters,butstilltheposteriorprobabilityistoohightobeconsideredsignicantaccordingtoourcalibrations.Othersimulationsagreewiththisresults.Theothercaseweneedtoconsideriswhenwehaveatleasttwoclusters.WegeneratedatasetscomposedfromobservationscomingfrommultivariateNormaldistributionswithdierentmeans,dependingonthenumberofclusterswewanttotest.Thisway,wepurposelycreatedclustersinthedatasetstobetested.InFigure 4-4 weshowthescatter-plotcorrespondingto50observations.Ofthem,35weregeneratedfromabivariateNormaldistributionwithmean=(1;2)0andvariance-covariancematrix=diag(1=2;1=2),andtheremaining15weregeneratedfromabivariateNormaldistributionwithmean=(3;1)0andsamevariance-covariancematrix.Byconstruction,wehavetwoclustersinthedata,whichcanbeeasilynoticed.Theposteriorprobabilitiesobtainedafter500000iterationsarelistedinTable 4-6 .TheMCSis20%ofthetotalnumberofobservations. 42
PAGE 43
Scatter-plotof50observationsgeneratedfromtwobivariateNormaldistributionswithdierentmeans. Table4-6. Posteriorprobabilitiesafter500000iterationsfortheobservationsing. 4-4 .TheMCSis20%oftheobservations. P(H0) 2 2.04E-053 3.14E-054 0.00669 Theposteriorprobabilitiesobtainedforthedataareverysmall,indicatingstrongevidenceagainstthenullhypothesisinallthetest.Observethatalthoughthedatasetwasconstructedwithtwoclusters,westillhavestrongevidencefortheexistenceofthreeorfourclusters.Thishappensbecausewehavetwoverydistinguishablegroups,andeachgroupisfairlyeasytoseparateinothertwogroups.Nevertheless,thestrongestevidencepointstotheexistenceoftwoclusterswhichisthenumber.Similarresultsareobservedinothersimulationsindicatingthatourprocedurefordetectingclustersisfairlyaccurate. 43
PAGE 44
5-1 showsthescatter-plotsfortherst2principalcomponentsfor4earsofmaize,obtainedusingthemethodologydescribedabove.Finally,weapplyourprocedureandcomputetheposteriorprobabilitiestothetreateddata.Wetestfork=2,3and4clusters,whicharetheinterpretablenumberofclustersinthecontextoftheexperiment.Table 5-1 showstheposteriorprobabilitiesforthedatainFigure 5-1 .Thevalueswereobtainedafter500000iterations.Also,theminimumclustersizeissetto20%ofthetotalnumberofobservations.Thisquantitydenesameaningfulclustersizefortheexperimenterandpreventthetestforndingclustersduetoextremesobservations. Table5-1. TestofthehypothesesH0:noclustersvs.H1:kclusters,withminimumclustersize20%ofthetotalnumberofobservations. Label n i00F-0173-01 96 0.009098 0.840622 0.976126i00F-0183-01 96 0.877506 0.980882 0.999744i01S-0026-18 96 0.000266 0.392694 0.044762i02S-0302-16 96 0.499539 0.972905 0.997341 44
PAGE 45
Scatter-plotsforthersttwoprincipalcomponentsoftheNIRspectraforlabelsi00F-0173-01,i00F-0183-01,i01S-0026-18andi02S-0302-16 InFigure 5-2 weshowtheconvergenceoftheprocedureforeari00F-0173-01.WhencomparingtheseresultswiththevaluesinTable 4-3 weseethatforthedataseti00F-0173-01wehavestrongevidencefortheexistenceof2clustersbutnotfor3or4.Thisconclusionisclearlysupportedbytherespectivescatter-plot.Foreari00F-0183-01wedonotndevidenceoftheexistenceofclustersandweconcludethatthenullhypothesisistrue.Theconclusionisalsofairlyobviousbyvisualexaminationofthescatter-plot.Thesameconclusionholdsforeari02S-0302-16,althoughisnotimmediatelyobviousbyinspectionofthescatter-plot.Inthiscase,thesmallestBayesianP-valueis0.499539whichisaboveofthecorrespondingcutopointatlevel0.05. 45
PAGE 46
Convergenceoftheposteriorprobabilitiesfortestingk=2,3and4clusters. Inthedatafromeari01S-0026-18,wendstrongevidenceforthepresenceofclustersinallthetests.Thegreatervalueamongtheposteriorprobabilitiesis0.392694correspondingtothetestfor3clusters.Thisvalueisstillbelowthecutopointatlevel0.05forthecorrespondingtestandthereforeprovidestrongevidencefortheexistenceof3clustersinthedata.Ontheotherhand,wenoticethattheminimumposteriorprobabilityisreachedinthetestfor2clusters.Howeverthesenumbersshouldnotbecompareddirectlybecauseofthedierencesamongtheirrespectivenulldistributionoftheposteriorprobabilities.Visualinspectionofthescatter-plotalsosuggeststheexistenceofclusters,althoughitisnotclearhowmanyofthemthereareinthedata. 46
PAGE 47
5-3 wecanseethehistogramsforn=65andk=2and3basedon100000constructedwiththispurpose. Figure5-3. Histogramsofthenullposteriorprobabilitiesforn=65andk=2and3clustersbasedon100000simulations.Theminimumclustersizeissetequalto15%oftheobservationsineachcase. 47
PAGE 48
48
PAGE 49
49
PAGE 50
2{11 )isnotdicult,butsomecaremustbetakenincountingpartitions,especiallywithrespecttoorderedversusunorderedpartitions.Tobeveryclear,westartwithanexample.Supposethatn=8andk=4,whichisasmallsetofpartitions,butbigenoughtobeinteresting.Weknowthatthenumberofpartitionsof8objectsintokcells,withnoemptycell,istheStirlingNumberoftheSecondKind,S8;4=1701.ThestrategyoutlinedinSection 2.5 will,forthiscase,generate73=35partitions.Theonlypossibleclustersizesforn=8andk=4are PartitionNumberof01Strings 50
PAGE 51
731!3! whichisS8;4,theStirlingNumberoftheSecondKind(andgivingusanalternativerepresentationofthisnumber). 51
PAGE 52
A{2 )and( A{4 )resultsintheprobabilityofapartition!beinggivenby, A{4 ).Cancellingtermsandapplying( A{3 )showsthatP!2Pkg(!)=1.Asanalnote,itfollowsfrom( A{1 )and( A{4 )thatanotherrepresentationoftheStirlingNumberoftheSecondKindis.Xn1+:::+nk=nn1n2:::nknn1n2:::nk 52
PAGE 53
2.5 .Forminimumclustersizem,startwithkblocksoftheform[10:::0],whichconsistofone1andm1zeros.Placeoneblockatthebeginningofthestring,thenrandomlyallocatetheremainingk1blocksandnmkzeros.Asbefore,each1signiesthebeginningofacluster,butnoweachclusterwillhaveatleastmobjects.Anargumentsimilartothatleadingto( A{5 )willshowthatunderthepresentgenerationscheme,theprobabilityofapartitionwithatleastmobjectsineachclusteris 53
PAGE 54
A{7 )canbeeasilycomputed,forany(n;k),usingRoranotherprogramminglanguage.Table A-1 showsthenumberofpartitionsp(n;k)forn=1;:::;6. TableA-1. Numberofpartitionsp(n;k)forn=1;:::;6. 1 12 113 111n4 12115 122116 133211 Thetopicofintegerpartitionshasbeenextensivelyresearchedfromcombinatorial,numbertheoreticalandanalyticaspects.ForfurtherreferencesandresultsseeAndrews,1976. 54
PAGE 55
2.3 ,themarginaldistributionofthedataYgivenapartition!ism(Yj!)=ZZkYj=1njY`=1N(Y(j)`jj;j)N(j(j)0;2j)(j)djdj:First,observethat (B{1) 2"njX`=1(Y(j)`j)01j(Y(j)`j)+(j(j)0)0[2j]1(j(j)0)#):Now,njX`=1(Y(j)`j)01j(Y(j)`j)+(j(j)0)0[2j]1(j(j)0)=njX`=1(Y(j)`Y(j))01j(Y(j)`Y(j))+nj(Y(j)j)01j(Y(j)j)+(j(j)0)01
PAGE 56
B{1 )isproportionaltoexp(1 2"njX`=1(Y(j)`Y(j))01j(Y(j)`Y(j))+nj2+1 2pnj=21 nj2+1p=2 2njX`=1(Y(j)`Y(j))01((Y(j)`Y(j)))+nj (a)ba1 (2rj)a+1e1=b2rjtheexpressioninbracesin( B{2 )simpliestopXr=11 22rjnjs2rj+nj 2np=21 (a)bapkkYj=11 (2rj)nj=2+a+1exp1 22rjnjs2rj+nj
PAGE 58
1R1f(xj)q1()d:Thisquantitycanbeusedasameasurementoftheevidenceagainstthenullcontainedinthedata.WecanalsodenetheBayesfactorofH0relativetoH1tobetheratiooftheposteriorprobabilitiesofthenullandalternativehypothesesovertheratiooftheprior 58
PAGE 59
59
PAGE 60
Andrews,G.(1976),\TheTheoryofPartitions,"Addison-Wesley,ReadingMA. Auermann,W.F.,Ngan,S.C.andHu,X.(2002),\ClusterSignicanceTestingUsingtheBootstrap,"NeuroImage,17,583{591. Baye,T.M.,Pearson,T.C.andSettles,A.M.(2006),\DevelopmentofaCalibrationtoPredictMaizeSeedCompositionUsingSingleKernelNear-InfraredSpectroscopy,"JournalofCerealScience,43,236{243. Bolshakova,N.,Azuaje,F.andCunningham,P.(2005),\AnIntegratedToolforMicroarrayDataClusteringandClusterValidityAssessment,"Bioinformatics,21,451{455. Bona,M.(2004),\CombinatoricsofPermutations,"Chapman&Hall/CRC. Booth,J.G.,Casella,G.andHobert,J.P.(2008),\ClusteringUsingObjectiveFunctionsandStochasticSearch,"JournalofRoyalStatisticalSociety,SeriesB,70,119{140. Casella,G.andRobert,C.(1998),\Post-ProcessingAccept-RejectSamples:RecyclingandRescaling,"JournaloftheComputationalandGraphicalStatistics,7,139{157. Dudley,J.W.,Lambert,R.J.andAlexander,D.E.(1974),\SeventyGenerationsofSelectionforOilandProteinConcetrationintheMaizeKernel,"SeventyGeneationsofSelectionforOilandProteininMaize,J.W.Dudley,Ed(Madison,W.I.:CropsSocietyofAmerica),181{212. Easton,G.S.andRochetti,R.(1986),\GeneralSaddlepointApproximationswithApplicationstoLStatistics,"JournaloftheAmericanStatisticalAssociation,81,420{423. GhoshJ.K.,Delampady,MandSamanta,T.(2006),\AnIntroductiontoBayesianAnalysis:TheoryandMethods,"Springer. Gould,H.W.(1960),\StirlingNumberRepresentationProblems,"ProceedingsoftheAmericanMathematicalSociety,11,447{451. Glaser,R.E.(1980),\ACharacterizationofBartlett'sStatisticInvolvingIncompleteBetaFunctions,"Biometrika,67,53{58. Hartigan,J.A.(1975),\ClusteringAlgorithms,"NewYork:Wiley. Janni,J.,Weinstock,B.A.,Hagen,L.andWright,S.(2008),\NovelNear-InfraredSamplingApparatusforSingleKernelAnalysisofOilContentinMaize,"AppliedSpectroscopy,62,423{426. JereysH.(1961),\TheoryofProbability,"ThirdEdition,OxfordUniversityPress,Oxford. 60
PAGE 61
Kass,R.E.andRaftery,A.E.(1995),\BayesFactorandModelUncertainty,"JournaloftheAmericanStatisticalAssociation,90,773{795. Kerr,M.K.andChurchill,G.A.(2001),\BootstrappingClusterAnalysis:AssessingtheReliabilityofConclusionsfromMicroarrayExperiments,"ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,98,8961{8965. Lambert,R.J.,Alexander,D.E.andHan,Z.J.(1998),\AHighOilPollinatorEnhacementofKernelOilandEectsonGrainYieldsofMaizeHybrids,"Agron-omyJournal,90,211{215. Lavine,M.andShervish,M.(1999),\BayesFactors:WhatTheyAreandWhatTheyAreNot,"AmericanStatistician,53,119{122. McCullaugh,P.andYang,J.(2006),\HowManyClusters?,"TechnicalReport,Depart-mentofStatistics,UniversityofChicago. Orman,B.A.andSchumann,R.A.(1991),\ComparisonofNear-InfraredSpectroscopyCalibrationMethodsforthePredictionofProtein,Oil,andStarchinMaizeGrain,"JournalofAgricultureandFoodChemistry,39,883{886. Pan,D.(2000),\StarchSynthesisinMaize,"CarbohydrateReservesinPlants:SynthesisandRegulation,A.KGuptaandN.Kaur,Eds(Amsterdam:Elsevier),125{146. Pitman,J.(1996),\SomeDevelopmentsoftheBlackwell-MacQueenUrnScheme,"Statistics,ProbabilityandGameTheory,IMSLectureNotesMonographSeries,30,245-267,InstituteofMathematicalStatistics,Hayward,CA. RobertC.P.(2001),\TheBayesianChoice,"SecondEdition,Springer-Verlag,NewYork. SenP.K.andSinger,J.M.(1993),\LargeSampleMethodsinStatistics:AnIntroductionwithApplications,"Chapman&Hall,NewYork-London. Sugar,C.andJames,G.(2003),\FindingtheNumberofClustersinaDataSet:AnInformationTheoreticApproach,"JournaloftheAmericanStatisticalAssociation,98,750{763. Tibshirani,R.,Walther,G.andHastie,T.(2001),\EstimatingtheNumberofClustersinaDataSetViatheGapStatistic,"JournaloftheRoyalStatisticalSociety,SeriesB,63,411{423. VanDijk,H.andKloeck,T.(1984),\ExperimentswithSomeAlternativesforSimpleImportanceSamplinginMonteCarloIntegration,"BayesianStatistics4,J.Bernardo,M.DeGroot,D.LindleyandA.SmithEds.(North-Holland,Amsterdam). 61
PAGE 62
Williams,P.andNorris,K.(2001),\Near-InfraredTechnologyintheAgricultualandFoodIndustries,2ndEdition,"AmericanAssociationofCerealChemists,Inc,Minn.,USA. 62
PAGE 63
ClaudioFuenteswasborninSantiago,Chile,in1977.Upongraduationfromhighschool,heenrolledasastudentatthePonticiaUniversidadCatolicadeChile,wherehereceivedthedegreeofBachelorofScienceinmathematicsin2001,andhisMasterinStatisticsinDecember2003.Duringhisundergraduateandgraduateeducationhewasappointedasateachingassistantinboththefacultyofphysicsandthefacultyofmathematics.InAugust2005,FuentesenteredthePhD.programintheDepartmentofStatisticsattheUniversityofFlorida.Duringhiseducationthere,hewasappointedasateachingassistantandgraduateteachinginstructor.Heworkedasaresearchassistantforhisadvisor,DistinguishedProfessorDr.GeorgeCasella.InAugust2008heearnedthedegreeofMasterofScienceinstatisticswithathesisonClusterAnalysis.HecurrentlyisworkingtowardaPhD.instatisticsunderDr.Casella,withinterestininferenceonselectedpopulations. 63
|