<%BANNER%>

Testing for the Existence of Clusters with Applications to NIR Spectrospcopy Data

Permanent Link: http://ufdc.ufl.edu/UFE0022716/00001

Material Information

Title: Testing for the Existence of Clusters with Applications to NIR Spectrospcopy Data
Physical Description: 1 online resource (63 p.)
Language: english
Creator: Fuentes, Claudio
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: analysis, bayes, bayesian, cluster, factor, hypothesis, posterior, probability, statistical, testing
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, M.S.Stat.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The detection and determination of clusters has been of special interest among researchers from different fields for a long time. Different efforts have been made in cluster analysis, but most of them determine the clusters depending on the distance between the observations. Although these methods have been proven to work well, they are usually too sensitive to the metric that defines the distance and they lack statistical procedures that facilitate decision making. In this paper we develop a procedure that permits testing for clusters using Bayesian tools. Specifically we study the hypothesis test H sub 0: kappa = 1 vs. H sub 1: kappa > 1, were kappa denotes the number of clusters in a certain population. This problem can be solved by looking at the Bayes factors as in a model selection problem, and making decisions according to the posterior probabilities P(H sub 0 | data). Since the procedure is entirely data dependent, we calibrate our results by estimating the frequentist null distribution of the posterior probabilities. Hence we can establish appropriate cutoff points to reject the null hypothesis at any desired alpha-level in the usual sense. While our setting allows (in theory) for explicit calculation of the Bayes factors and the posterior probabilities, our method is computationally too intensive in the most part. To overcome this difficulty we propose an estimation procedure based on MCMC techniques. Finally, we present simulation studies that validate our conclusions, and we apply our method to NIR spectroscopy data coming from a genetic study in maize.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Claudio Fuentes.
Thesis: Thesis (M.S.Stat.)--University of Florida, 2008.
Local: Adviser: Casella, George.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022716:00001

Permanent Link: http://ufdc.ufl.edu/UFE0022716/00001

Material Information

Title: Testing for the Existence of Clusters with Applications to NIR Spectrospcopy Data
Physical Description: 1 online resource (63 p.)
Language: english
Creator: Fuentes, Claudio
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: analysis, bayes, bayesian, cluster, factor, hypothesis, posterior, probability, statistical, testing
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, M.S.Stat.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The detection and determination of clusters has been of special interest among researchers from different fields for a long time. Different efforts have been made in cluster analysis, but most of them determine the clusters depending on the distance between the observations. Although these methods have been proven to work well, they are usually too sensitive to the metric that defines the distance and they lack statistical procedures that facilitate decision making. In this paper we develop a procedure that permits testing for clusters using Bayesian tools. Specifically we study the hypothesis test H sub 0: kappa = 1 vs. H sub 1: kappa > 1, were kappa denotes the number of clusters in a certain population. This problem can be solved by looking at the Bayes factors as in a model selection problem, and making decisions according to the posterior probabilities P(H sub 0 | data). Since the procedure is entirely data dependent, we calibrate our results by estimating the frequentist null distribution of the posterior probabilities. Hence we can establish appropriate cutoff points to reject the null hypothesis at any desired alpha-level in the usual sense. While our setting allows (in theory) for explicit calculation of the Bayes factors and the posterior probabilities, our method is computationally too intensive in the most part. To overcome this difficulty we propose an estimation procedure based on MCMC techniques. Finally, we present simulation studies that validate our conclusions, and we apply our method to NIR spectroscopy data coming from a genetic study in maize.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Claudio Fuentes.
Thesis: Thesis (M.S.Stat.)--University of Florida, 2008.
Local: Adviser: Casella, George.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022716:00001


This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Iwouldliketoacknowledgemanypeoplewhogavemesupportandadvicewhilewritingthisthesis.First,Iwanttothankmyadvisor,DistinguishedProfessorDr.GeorgeCasella,whogavemetheopportunitytoworkashisresearchassistantandtogetinvolvedwiththeprojectIpresenthere.Hiswealthofknowledgewasaninvaluablesourceofhelp.Itwouldnothavebeenpossiblewithouthisconstantfeedbackandguidance.Ialsowanttothankmygraduatecommitteemembers,ProfessorDr.ClydeSchooleldandProfessorDr.MarkSettles,fortheiradditionalassistanceinmyresearch.TheyprovidedafoundationformetodevelopthecorrectmethodologiestoaddressspecicissueswithinclusteranalysisIdiscusshere.IwouldliketogiveaspecialacknowledgmenttoProfessorDr.AlvaroCofreforhisconstantsupportandadvice.Histeachinghelpedmetodiscoverthebeautyofmathematics.HisexampleservedasaninspirationforhowIinstructinmyownclassroom.HisencouragementbroughtmewhereIamtoday.IamindebtedtoVikneswaranGopalforallhishelpandusefulhintsintheimplementationofthecodestorunthesimulationsincludedinthiswork.Finallyandmostimportantly,IwouldliketothankAlejandraforherunconditionallove,andmyparents,JorgeandEdith,fortheirlove,encouragementandsupportinallthatIdo. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 1.1ClusteringProbleminNIRSpectroscopyData ................ 10 1.2ClusterAnalysisMethodsandOurMethodology ............... 13 2THEORETICALFRAMEWORK .......................... 15 2.1NotationSetupandPreliminaries ....................... 15 2.2HypothesisTestingandBayesianAnalysis .................. 16 2.3ModelandDistributions ............................ 18 2.4PrioronthePartitions ............................. 20 2.5EstimationoftheBayesFactor ........................ 21 3CALIBRATIONPROBLEM ............................. 24 3.1BayesFactorundertheNullDistribution ................... 26 3.2DicultiesinDeterminingtheExactNullDistributionoftheBayesFactor 29 3.3SaddlepointApproximation .......................... 31 3.4MCMCandSimulationAlternative ...................... 32 4SIMULATIONSTUDIES .............................. 35 4.1GoodnessoftheApproximation ........................ 35 4.2ErrorofApproximation ............................ 36 4.3MinimumClusterSize ............................. 36 4.4OptimalNumberofPartitions ......................... 39 4.5SensitivityoftheProcedure .......................... 41 5ANALYSISOFTHENIRDATA .......................... 44 6CONCLUSIONS ................................... 48 5

PAGE 6

AONTHEPROBLEMOFPARTITIONS ...................... 50 A.1GeneratingaRandomPartition ........................ 50 A.2DerivationintheGeneralCase ........................ 51 A.3PartitionsofanInteger ............................. 53 BDERIVATIONOFTHEMARGINALDISTRIBUTION ............. 55 CBAYESFACTORSANDHYPOTHESISTESTING ................ 58 REFERENCES ....................................... 60 BIOGRAPHICALSKETCH ................................ 63 6

PAGE 7

Table page 4-1Posteriorprobabilitiesfor2500simulationsof50observationsand3clusters.Thenumberofconsideredpartitionsperiterationis52. .............. 37 4-2Cutopointsforminimumclustersize1and-level0.05,basedon5000simulations.Thenumberinparenthesiscorrespondtothestandarderrorsafter6repetitions. 37 4-3Cutopointsforminimumclustersize15%oftheobservationsand-level0.05,basedon5000simulations. .............................. 38 4-4Posteriorprobabilitiesfor1000simulationsof50observationsand3clusters. .. 41 4-5Posteriorprobabilitiesafter500000iterationsfortheobservationsing. 4-3 .TheMCSis20%oftheobservations. ........................ 41 4-6Posteriorprobabilitiesafter500000iterationsfortheobservationsing. 4-4 .TheMCSis20%oftheobservations. ........................ 43 5-1TestofthehypothesesH0:noclustersvs.H1:kclusters,withminimumclustersize20%ofthetotalnumberofobservations. .................... 44 A-1Numberofpartitionsp(n;k)forn=1;:::;6. .................... 54 7

PAGE 8

Figure page 1-1ScatterplotsoftherstveprincipalcomponentsoftheNIRdataforlabel4S520701. 13 4-1Histogramsofthenullposteriorprobabilitiesforn=50andk=3clustersbasedon2500simulations. .............................. 36 4-2Histogramsofthenullposteriorprobabilitiesforn=50andk=2and3clustersbasedon5000simulations.Theminimumclustersizeissetequalto1observationand15%oftheobservationsineachcase. ............... 39 4-3Scatter-plotof50observationsgeneratedfromabivariateNormaldistributionwithmean=(1;1)0andvariance-covariancematrix=diag(1=4;1=4) ... 42 4-4Scatter-plotof50observationsgeneratedfromtwobivariateNormaldistributionswithdierentmeans. ................................. 43 5-1Scatter-plotsforthersttwoprincipalcomponentsoftheNIRspectraforlabelsi00F-0173-01,i00F-0183-01,i01S-0026-18andi02S-0302-16 ............ 45 5-2Convergenceoftheposteriorprobabilitiesfortestingk=2,3and4clusters. .. 46 5-3Histogramsofthenullposteriorprobabilitiesforn=65andk=2and3clustersbasedon100000simulations.Theminimumclustersizeissetequalto15%oftheobservationsineachcase. ........................ 47 8

PAGE 9

9

PAGE 10

10

PAGE 11

11

PAGE 12

1-1 showsthescatterplotsfortherst5principalcomponentsoftheNIRspectraforlabel4S520701.Althoughvisualinspectionofthescatterplotssuggeststhepresenceofclustersinsomeofthem,wenoticethatisdiculttomakeaconclusionabouttheexistenceofclustersinthedata.Therefore,anobjectivemeasurementofthedegreeofevidenceforthepresenceofclustersisnecessary.Suchameasurementcanbeimplementedasadecision 12

PAGE 13

ScatterplotsoftherstveprincipalcomponentsoftheNIRdataforlabel4S520701. criteriatoavoidthesubjectivityofthevisualinspection,which,inaddition,isnotevenpossiblewhenthenumberofprincipalcomponentsexceedsthree.Hence,astatisticexpressingtruepresenceorabsenceofclusterswouldgreatlyfacilitatetheanalysisofcomplexdatasetsandisrequired. 13

PAGE 14

14

PAGE 15

15

PAGE 16

2{1 )intermsoftheparameterasfollowsH0:=1vs.H1:>1:Noticethatthealternativehypothesis,H1,hasacomplexstructure.Forthisreason,wewillconsiderrstthemoresimplehypothesis 16

PAGE 17

2{2 ),thatis 1+BF10:(2{4)Thisquantitycanbeusedasamodelcomparisoncriteria.Forourhypothesissetting,smallvaluesofP(H0jY)willprovideevidenceagainstH0.Sincewewillusetheposteriorprobabilityasameasurementofthestrengthoftheevidenceagainstthenullhypothesis,wewillrefertothisquantityastheBayesianP-value 17

PAGE 18

2{3 ),involvesconsideringallthepossiblepartitions!2Sn;kthatgeneratekclusters.Then,ifwedene!1tobetheonlyexistingclusterwhen=1,usingtheLawofTotalProbabilitywecanrewritetheBayesfactorintermsofthepartitions!as (a)ba1 (2rj)a+1e1=b2rj;

PAGE 19

2pnj=21 nj2+1p=2 2njX`=1(Y(j)`Y(j))01((Y(j)`Y(j)))+nj 2{6 )simpliestopXr=11 22rjnjs2rj+nj 19

PAGE 20

2{5 ),foreverypartition!2Sn;k.Weobtain 20

PAGE 21

2{5 )andtodeterminetheposteriorprobabilitiesP(H0jY)inclosedform.However,noticethatthesumweneedtocomputeisoverthesetofallpossiblepartitionswhichhasasmanyelementsasitisdeterminedbytheStirlingnumberofthesecondkind.Thisintroducestwonewdicultiestotheproblem: 1. Thenumberofsummandsinvolvedinthecalculationisoftentoolarge,evenifthenumberofobservationsandclustersisrelativelysmall.Forinstance,ifn=48and=2,wehaveS(48;2)=140;737;488;355;327. 2. Inordertocomputethesum,weneedtolistallthepossiblepartitionswhichisextremelydicult.Toovercomethesediculties,wewillconsideranindirectapproachandestimatethevalueoftheBayesFactorusingMonteCarlotechniques.LetandgbedistributionsonthepartitionspaceSn;k.Supposethatisthepriorofinterestandwecansample!(1);:::;!(M)fromg.IfMislargeenough,wecanestimate 21

PAGE 22

PMi=1(!(i)) wheretheexpressionin( 2{10 ),whilepossiblybiased,isproventoreducetheMeanSquaredError(seeCasellaandRobert,1998,andVanDijkandKloeck,1984).Noticethatifweconsidergasthepriorofinterestintherstplace,thentheimportancesamplingisnotneededandwejustcomputetheMonteCarlosum.Togeneratearandompartitionofnobjectsintokclustersweusethefollowingstrategy,suggestedbyJimPitman(personalcommunication).Weuseavectoroflengthnwithnk0'sandk1's,oneofthemintherstposition.Then,werandomlygenerateapermutationofthetheremainingn1elementstodistribute(uniformly)thek11'sinthelastn1places.Each1representsthestartofacluster.Forexample,ifn=5andk=3,thevector11001correspondstothepartitionofveobjectsintoclustersofsize1;3;1.Oncethepartitionhasbeengenerated,werandomlypermutetheYvector,andplacetheYi'sinthegivenpartition.Althoughnotimmediatelyobvious(seetheappendixfordetails),theprobabilityofthispartition!isgivenby 22

PAGE 23

Generatearandomstringof1'sand0'sasabove. 2. GeneratearandompermutationofY1;:::;Yn. 3. ClusterthepermutedY1;:::;Ynaccordingtotherandomstring. 4. Calculatetheratioofmarginalsfrom( 2{8 ). 5. ApproximatetheBayesfactorusing( 2{10 ). 1. Drawcandidate!0fromg. 2. Atiterationt Withprobabilitya,drawcandidate!0fromtherandomwalkstartingfrom!(t),andwithprobability1adrawcandidate!0independentlyfromg. (b) ComputetheMetropolis-HastingsratioMH=g(!0) n(k1)+(1a)g(!0)a n(k1)+(1a)g(!(t)) (c) Withprobabilitymin(1;MH)set!(t+1)=!0,otherwiseset!(t+1)=!(t)ThisisaMetropolis-Hastingsalgorithmwithstationarydistributiong. 23

PAGE 24

2{2 ),usingposteriorprobabilitiesasameasurementofthestrengthoftheevidenceagainstthenull.However,wenoticethattheBayesianP-valueisnotaP-valueintheusualsense.OurBayesianP-valuemeasures(basedonthedata)howlikelyisthenullhypothesistobetrueinoppositiontothealternativebeingtestedand,inthissense,itisavalidmeasurementofevidenceagainstthenull.But,incontrasttothestandardP-value,thereisnoerrorcalibrationforourprocedure.Forinstancewelackareferencedistribution,suchasthenulldistribution,whichdeterminesthevariabilityofthestatisticwhenthenullhypothesisistrue.Hence,oneofthemaindrawbacksofourprocedureisthatwecannoteasilysetupcutopointstodeterminewhenwehavestrongevidenceagainstthenullhypothesis.Allweknowisthatposteriorprobabilitiesbelow0.5suggeststhepresenceofclustersandthelowerthebetter.ButhowlowshouldtheBayesianP-valuebeinorderfortheexperimentertomakeagooddecision?Anotherdicultyishowtocomparedierentoutcomesfromanexperiment.Sincethevariabilityoftheposteriorprobabilitiesunderthenullhypothesisisunknown,itishardtotellifthedierencebetweentwodierentoutcomesisduetotheactualsensitivityofthetestindetectingclustersorifitispurelyduetotherandomvariationassociatedtotheexperiment.TheseproblemsarenotnewinBayesiananalysisandsomesolutionshavebeenpresentedintheliterature.Forinstance,Jereys(1961)developedascaletojudgetheevidenceinfavoroforagainstH0broughtbythedata.Thescalegoesasfollows: 24

PAGE 25

2{5 )intermsofthedata,Y,as 25

PAGE 26

26

PAGE 27

Observethat,foranygivenpartition!2Sn;k,wecanwrite (2)(k1)aPk+1j=1uj+2=b2n 1 .Hence,ifthenullhypothesisholdsandifX=(x1;:::;xn1)isavectorofindependentandidenticallydistributed21randomvariables,wehaveujd=nj1Xi=nj1xi(j=1;:::;k)andk+1Xj=1ujd=n1Xi=1xi;wherewetaken0=1.Thisresultleadstothefollowingproposition. (2)(k1)aPn1i=1xi+2=b2n 27

PAGE 28

2 (2)(k1)aPn1i=1xi+2=b2n Proof. 2 =X2Pn;kX()(!)T(Xj);where()=f!:!hasclustersofsizedeterminedbyg.SinceT(Xj)dependsonthepartitions!onlythroughtheclusterssize,weobtainBF10(Y)d=X2Pn;kT(Xj)X()(!)=X2Pn;k()T(Xj);where()=P()(!). Theimportanceofthepreviousresultsliesmainlyontwokeyconsequences.First,theyprovidewithaknownprobabilisticstructureforeachoneofthecomponentspresentintheBayesfactor.ThisstructurewillbefundamentalinordertoobtainthenulldistributionoftheBayesfactorandconsequentlyinobtainingthedesiredposteriorprobabilities. 28

PAGE 29

1 and 2 remainvalidcomponentwise.Ontheotherhand,thediagonalstructureofthevariance-covariancematricesiunderconsiderationinducesindependencebetweenthecoordinatesofYi(i=1;:::;n)andconsequentlybetweenthefactorsoftheproductin( 3{2 ).Hence,nocorrelationisinducedbyourcalculationsandthegeneralizationtohigherdimensionsproceedsintheobviousmanner.Unfortunately,evenwiththeseresultsthedeterminationoftheexactnulldistributionoftheBayesfactorisdicult.Inthefollowingsectionswewillbrieydiscusssomeoftheproblemsthatpreventusfromobtainingthenulldistributioninclosedformandsomeofthealternativesthatcanbeconsideredtoestimateit. 29

PAGE 30

3{4 )hasaknowndistributionandthesimilaritieswithT(Yj!)(forp=1)suggestthatthenulldistributionoftheBayesfactor(oratleastitsmoments)canbedeterminedanalytically.However,thederivationofthenulldistributionofLisbasedonthesametechniquesusedtodeterminetheasymptoticdistributionofthelikelihoodratiotest(Jorgensen,1993)whichwecannotbeimplementedinderivingthenulldistributionofT,duetosomeimportantdierences,namely: 1. Theshiftingterm2=binThasnoequivalentinthestatisticL. 2. TheexponentsinLsatisfyPkj=1j=,whileinT,kXj=1nj ThedenominatorofLcorrespondstothepooledvariance,whileinT,s2=1 30

PAGE 31

1 permitsustoobtainanapproximationfor 1+BF10(X)(3{5)whereX=(x1;:::;xn1)isavectorofiid21randomvariables.Foranyxedvalueof2,weproceedasfollows(Eastonetal.,1986).Firstwecompute~R(T)=1T+(k+1)2T2 3{5 ).Then,weestimatethetailprobabilityofP(X)byP(P(X)>a)Z1Tl"(k+1)~R00(t) 2#1=2expf(k+1)(~R(t)t~R0(T0))gdt;where~R(Tl)=a.Thus,inordertoapproximatethetailprobabilitiesof( 3{5 )weonlyneedtocompute1;2;3and4,whichcanbeobtainedfromtheequations1=12=2213=3312+2314=4413322+12221641;whereiisthei-thmomentofP(X). 31

PAGE 32

3.1 allowustosimulateobservationsfromthenulldistributionoftheposteriorprobabilities.Then,wecanusethesegeneratedvaluestoeitherconstructhistogramsordodensityestimation,dependingontheinterest.Herewewillfocusinthegenerationofhistograms,whichisenoughtocapturethemainfeaturesofthenulldistribution.Thehistogramswillalsopermitus(viaasymptoticresults)toobtaincutopointsfortheposteriorprobabilitiesatanyfrequentist-levelwewant.Again,themaindicultyliesonthecalculationofthesumintheBayesfactor.Aswehavealreadypointedout,thisrequiresustolistallthepartitionsinPn;k.Neverthelesswecanproceedinasimilarwayasinsection 2.5 .Sincewhenthenullhypothesisistrueweonlycareabouttheclustersizes,wefollowthesamestrategytogeneratethepartitionsaccordingtog,butwedonottakeintoaccountthepermutationsoftheelementsinthegivenpartition.Thus,weneedtocorrecttheprobabilitiesgivenbyg,sothattheydonottakeintoaccountthenumberofredundantpartitionsthatleadtothesameclustersizes. 32

PAGE 33

Inordertoconstructthehistogram,weneedtosimulateB10(X)severaltimes.ToobtainNsimulations,thenalalgorithmis:For`=1;:::;Nrepeat:Fori=1;:::;M Generatearandomstringof1'sand0'saccordingtog0. 2. Generatex1;:::;xn1iid21randomvariables. 3. ComputePn1i=1xiandPnj1i=nj1xiaccordingtotherandomstring. 4. ComputeT(Xj). 5. ApproximatetheBayesfactorusing( 3{6 ).Finally,toestimatethecutopointsforanygiven-levelweusethefollowingresult(SenandSinger1993)

PAGE 34

3{1 )dependsonthevaluesof2,thecommonvarianceoftheobservationsunderthenull.Thus,inordertogenerateasamplefromthenulldistribution,weneedtosetavaluefor2.Wewillovercomethisdicultybyestimating2bythesamplevariance.Moreprecisely,wewillcenterandre-scaleourobservationsinthestandardway. 34

PAGE 35

4-1 weseetheresultsforn=50andk=3basedon2500simulations.Forthiscase,thenumberofelementsinthepartitionsspaceisp(50;3)=208.AllofthemwereconsideredtoobtaintheexactBayesfactor(rstrow)andsamplesofsize52(25%ofthetotalnumberofelementsinthepartitionspace)weredrawnfromg0tocomputetheapproximations(secondrow).Wealsoconsideredtheuniformprior(rstcolumn)andtheg0prior(secondcolumn)inourcalculations.Weobservethatthehistogramsarevirtuallyidenticalandthatthedierencesbetweentheempirical0.05-percentilesislessthan0.012inallthecases.Theresultsaresimilarinallthecaseswestudied,indicatingthatourmethodisfairlyaccurateinapproximatingthenulldistributionoftheposteriorprobabilitiesandsuggestingthattheelectionofthepriorforthepartitionspacehaslittleeectinthecalculations. 35

PAGE 36

Histogramsofthenullposteriorprobabilitiesforn=50andk=3clustersbasedon2500simulations. 4-1 presentstheresultscorrespondingto6replicationsforthecasen=50,k=3.Theempirical0.05percentilesareobtainedbasedon2500simulations,sampling52outof208partitionsperiteration.Theobtainedstandarderrorislessthan0.003whichisfairlysmallconsideringthenumberofsimulationsperrepetition.Similarresultsareobtainedchangingthevaluesofnandk,indicatingthatconvergenceoftheempirical-percentileisreachedmoderatelyfast. 36

PAGE 37

Posteriorprobabilitiesfor2500simulationsof50observationsand3clusters.Thenumberofconsideredpartitionsperiterationis52. 0.15416 0.164480.05 0.17061 0.17005 0.17298 0.16455 Mean 0.16614SE 0.00277 oftheobservationsfallinginthevicinityof1.However,lookingatoursimulations,weobservethataconsiderablenumberofobservationsfallbelow0.5.InTable 4-2 weshowtheresultscorrespondingtotheempirical0.05percentileforn=50,60,70andk=2,3,4.Thevaluesareobtainedastheaverageof6repetitionsof5000simulationseach.Inparenthesiswereporttherespectivestandarderrors.Weobservenotonlythatthecutopointsforthe0.05-leveltestarefairlysmall,butalsothefollowingpattern.Foreveryn,thevalueofthecutopointsdecreaseasthenumberofclustersincreases,thatis,about5%ofthegeneratedposteriorprobabilitiesarelocatedclosertozeroasthekincreases.Thegeneralbehavioristhatforxedn,askincreasesthehistograms,whilestillskewedtotheleft,tendstospreadmoremasstosmallervaluesresultinginfattertailsinsteadoftheexpectedthintails. Table4-2. Cutopointsforminimumclustersize1and-level0.05,basedon5000simulations.Thenumberinparenthesiscorrespondtothestandarderrorsafter6repetitions. Clusters Observations 506070 2 0.15261(0.00198)0.18647(0.00161)0.20709(0.00230)3 0.09782(0.00153)0.13556(0.00311)0.16973(0.00141)4 0.05454(0.00034)0.09268(0.00095)0.13836(0.00118) Themostlikelyexplanationforthisphenomenaisthatthenumberofelementsthatconstitutesaclusterisnotdened.Therefore,ourproceduretendstoconsiderastheir 37

PAGE 38

4-3 showstheresultsofsimulationsobtainedunderthesameconditionswedescribedabove,butsettingtheMCSequaltothe15%oftheobservations.WecanseehowtheintroductionoftheMCSasanewparameter,reversesthepatternobservedinTable 4-2 andalsoincreasesthevalueoftheempirical0.05-percentiles. Table4-3. Cutopointsforminimumclustersize15%oftheobservationsand-level0.05,basedon5000simulations. Clusters Observations 506070 2 0.25523(0.00381)0.31751(0.00322)0.36356(0.00313)3 0.29041(0.00208)0.39503(0.00294)0.51345(0.00511)4 0.29198(0.00226)0.51517(0.00210)0.68352(0.00243) Ingeneral,astheMCSincreases(andthereforetheprobabilityofndingclustersbychancedecreases)thehistogramsbecomemoreskewedtotheleftandtendtoconcentratemoremassnearonethanspreadinthetailsaswecanobserveinFigure 4-2 .Inotherwords,theextrarestrictionprovidesmoreintuitiveresultsforthesimulatedposteriorprobabilities. 38

PAGE 39

Histogramsofthenullposteriorprobabilitiesforn=50andk=2and3clustersbasedon5000simulations.Theminimumclustersizeissetequalto1observationand15%oftheobservationsineachcase. 39

PAGE 40

4-4 weshowtheresultsfor50observationsand3clusterswhere2500simulationswereconsidered.Weobservethatthediscrepancyintheresultsislessthan0.005,whichisduenotonlytothenumberofpartitionsbutalsototheconsiderednumberofsimulations.Similarresultsareobtainedfortheothercasessuggestingthatsamplingasmanypartitionsasabout25%percentofthetotalnumberofelementsinthepartitionspaceissucienttoobtainreasonableresults. 40

PAGE 41

Posteriorprobabilitiesfor1000simulationsof50observationsand3clusters. Part/Iteration all 0.1434152 0.14852104 0.05 0.14139208 0.14368500 0.14229 4-3 weshowthescatter-plotcorrespondingto50observationsfromabivariateNormaldistributionwithmean=(1;1)0andvariance-covariancematrix=diag(1=4;1=4).Theposteriorprobabilitieswereobtainedafter500000iterationsconsideringaMCSof20%oftheobservations.Theresultsfork=2,3and4clustersarelistedinTable 4-5 Table4-5. Posteriorprobabilitiesafter500000iterationsfortheobservationsing. 4-3 .TheMCSis20%oftheobservations. P(H0) 2 0.598953 0.722134 0.91140 41

PAGE 42

Scatter-plotof50observationsgeneratedfromabivariateNormaldistributionwithmean=(1;1)0andvariance-covariancematrix=diag(1=4;1=4) Weobservethattheposteriorprobabilitiesarefairlyhighinallthecases,showingveryweakevidenceforthepresenceofcluster.Noticethatthesmallervalueisobtainedfortestingtwoclusters,butstilltheposteriorprobabilityistoohightobeconsideredsignicantaccordingtoourcalibrations.Othersimulationsagreewiththisresults.Theothercaseweneedtoconsideriswhenwehaveatleasttwoclusters.WegeneratedatasetscomposedfromobservationscomingfrommultivariateNormaldistributionswithdierentmeans,dependingonthenumberofclusterswewanttotest.Thisway,wepurposelycreatedclustersinthedatasetstobetested.InFigure 4-4 weshowthescatter-plotcorrespondingto50observations.Ofthem,35weregeneratedfromabivariateNormaldistributionwithmean=(1;2)0andvariance-covariancematrix=diag(1=2;1=2),andtheremaining15weregeneratedfromabivariateNormaldistributionwithmean=(3;1)0andsamevariance-covariancematrix.Byconstruction,wehavetwoclustersinthedata,whichcanbeeasilynoticed.Theposteriorprobabilitiesobtainedafter500000iterationsarelistedinTable 4-6 .TheMCSis20%ofthetotalnumberofobservations. 42

PAGE 43

Scatter-plotof50observationsgeneratedfromtwobivariateNormaldistributionswithdierentmeans. Table4-6. Posteriorprobabilitiesafter500000iterationsfortheobservationsing. 4-4 .TheMCSis20%oftheobservations. P(H0) 2 2.04E-053 3.14E-054 0.00669 Theposteriorprobabilitiesobtainedforthedataareverysmall,indicatingstrongevidenceagainstthenullhypothesisinallthetest.Observethatalthoughthedatasetwasconstructedwithtwoclusters,westillhavestrongevidencefortheexistenceofthreeorfourclusters.Thishappensbecausewehavetwoverydistinguishablegroups,andeachgroupisfairlyeasytoseparateinothertwogroups.Nevertheless,thestrongestevidencepointstotheexistenceoftwoclusterswhichisthenumber.Similarresultsareobservedinothersimulationsindicatingthatourprocedurefordetectingclustersisfairlyaccurate. 43

PAGE 44

5-1 showsthescatter-plotsfortherst2principalcomponentsfor4earsofmaize,obtainedusingthemethodologydescribedabove.Finally,weapplyourprocedureandcomputetheposteriorprobabilitiestothetreateddata.Wetestfork=2,3and4clusters,whicharetheinterpretablenumberofclustersinthecontextoftheexperiment.Table 5-1 showstheposteriorprobabilitiesforthedatainFigure 5-1 .Thevalueswereobtainedafter500000iterations.Also,theminimumclustersizeissetto20%ofthetotalnumberofobservations.Thisquantitydenesameaningfulclustersizefortheexperimenterandpreventthetestforndingclustersduetoextremesobservations. Table5-1. TestofthehypothesesH0:noclustersvs.H1:kclusters,withminimumclustersize20%ofthetotalnumberofobservations. Label n i00F-0173-01 96 0.009098 0.840622 0.976126i00F-0183-01 96 0.877506 0.980882 0.999744i01S-0026-18 96 0.000266 0.392694 0.044762i02S-0302-16 96 0.499539 0.972905 0.997341 44

PAGE 45

Scatter-plotsforthersttwoprincipalcomponentsoftheNIRspectraforlabelsi00F-0173-01,i00F-0183-01,i01S-0026-18andi02S-0302-16 InFigure 5-2 weshowtheconvergenceoftheprocedureforeari00F-0173-01.WhencomparingtheseresultswiththevaluesinTable 4-3 weseethatforthedataseti00F-0173-01wehavestrongevidencefortheexistenceof2clustersbutnotfor3or4.Thisconclusionisclearlysupportedbytherespectivescatter-plot.Foreari00F-0183-01wedonotndevidenceoftheexistenceofclustersandweconcludethatthenullhypothesisistrue.Theconclusionisalsofairlyobviousbyvisualexaminationofthescatter-plot.Thesameconclusionholdsforeari02S-0302-16,althoughisnotimmediatelyobviousbyinspectionofthescatter-plot.Inthiscase,thesmallestBayesianP-valueis0.499539whichisaboveofthecorrespondingcutopointatlevel0.05. 45

PAGE 46

Convergenceoftheposteriorprobabilitiesfortestingk=2,3and4clusters. Inthedatafromeari01S-0026-18,wendstrongevidenceforthepresenceofclustersinallthetests.Thegreatervalueamongtheposteriorprobabilitiesis0.392694correspondingtothetestfor3clusters.Thisvalueisstillbelowthecutopointatlevel0.05forthecorrespondingtestandthereforeprovidestrongevidencefortheexistenceof3clustersinthedata.Ontheotherhand,wenoticethattheminimumposteriorprobabilityisreachedinthetestfor2clusters.Howeverthesenumbersshouldnotbecompareddirectlybecauseofthedierencesamongtheirrespectivenulldistributionoftheposteriorprobabilities.Visualinspectionofthescatter-plotalsosuggeststheexistenceofclusters,althoughitisnotclearhowmanyofthemthereareinthedata. 46

PAGE 47

5-3 wecanseethehistogramsforn=65andk=2and3basedon100000constructedwiththispurpose. Figure5-3. Histogramsofthenullposteriorprobabilitiesforn=65andk=2and3clustersbasedon100000simulations.Theminimumclustersizeissetequalto15%oftheobservationsineachcase. 47

PAGE 48

48

PAGE 49

49

PAGE 50

2{11 )isnotdicult,butsomecaremustbetakenincountingpartitions,especiallywithrespecttoorderedversusunorderedpartitions.Tobeveryclear,westartwithanexample.Supposethatn=8andk=4,whichisasmallsetofpartitions,butbigenoughtobeinteresting.Weknowthatthenumberofpartitionsof8objectsintokcells,withnoemptycell,istheStirlingNumberoftheSecondKind,S8;4=1701.ThestrategyoutlinedinSection 2.5 will,forthiscase,generate73=35partitions.Theonlypossibleclustersizesforn=8andk=4are PartitionNumberof01Strings 50

PAGE 51

731!3! whichisS8;4,theStirlingNumberoftheSecondKind(andgivingusanalternativerepresentationofthisnumber). 51

PAGE 52

A{2 )and( A{4 )resultsintheprobabilityofapartition!beinggivenby, A{4 ).Cancellingtermsandapplying( A{3 )showsthatP!2Pkg(!)=1.Asanalnote,itfollowsfrom( A{1 )and( A{4 )thatanotherrepresentationoftheStirlingNumberoftheSecondKindis.Xn1+:::+nk=nn1n2:::nknn1n2:::nk 52

PAGE 53

2.5 .Forminimumclustersizem,startwithkblocksoftheform[10:::0],whichconsistofone1andm1zeros.Placeoneblockatthebeginningofthestring,thenrandomlyallocatetheremainingk1blocksandnmkzeros.Asbefore,each1signiesthebeginningofacluster,butnoweachclusterwillhaveatleastmobjects.Anargumentsimilartothatleadingto( A{5 )willshowthatunderthepresentgenerationscheme,theprobabilityofapartitionwithatleastmobjectsineachclusteris 53

PAGE 54

A{7 )canbeeasilycomputed,forany(n;k),usingRoranotherprogramminglanguage.Table A-1 showsthenumberofpartitionsp(n;k)forn=1;:::;6. TableA-1. Numberofpartitionsp(n;k)forn=1;:::;6. 1 12 113 111n4 12115 122116 133211 Thetopicofintegerpartitionshasbeenextensivelyresearchedfromcombinatorial,numbertheoreticalandanalyticaspects.ForfurtherreferencesandresultsseeAndrews,1976. 54

PAGE 55

2.3 ,themarginaldistributionofthedataYgivenapartition!ism(Yj!)=ZZkYj=1njY`=1N(Y(j)`jj;j)N(j(j)0;2j)(j)djdj:First,observethat (B{1) 2"njX`=1(Y(j)`j)01j(Y(j)`j)+(j(j)0)0[2j]1(j(j)0)#):Now,njX`=1(Y(j)`j)01j(Y(j)`j)+(j(j)0)0[2j]1(j(j)0)=njX`=1(Y(j)`Y(j))01j(Y(j)`Y(j))+nj(Y(j)j)01j(Y(j)j)+(j(j)0)01

PAGE 56

B{1 )isproportionaltoexp(1 2"njX`=1(Y(j)`Y(j))01j(Y(j)`Y(j))+nj2+1 2pnj=21 nj2+1p=2 2njX`=1(Y(j)`Y(j))01((Y(j)`Y(j)))+nj (a)ba1 (2rj)a+1e1=b2rjtheexpressioninbracesin( B{2 )simpliestopXr=11 22rjnjs2rj+nj 2np=21 (a)bapkkYj=11 (2rj)nj=2+a+1exp1 22rjnjs2rj+nj

PAGE 58

1R1f(xj)q1()d:Thisquantitycanbeusedasameasurementoftheevidenceagainstthenullcontainedinthedata.WecanalsodenetheBayesfactorofH0relativetoH1tobetheratiooftheposteriorprobabilitiesofthenullandalternativehypothesesovertheratiooftheprior 58

PAGE 59

59

PAGE 60

Andrews,G.(1976),\TheTheoryofPartitions,"Addison-Wesley,ReadingMA. Auermann,W.F.,Ngan,S.C.andHu,X.(2002),\ClusterSignicanceTestingUsingtheBootstrap,"NeuroImage,17,583{591. Baye,T.M.,Pearson,T.C.andSettles,A.M.(2006),\DevelopmentofaCalibrationtoPredictMaizeSeedCompositionUsingSingleKernelNear-InfraredSpectroscopy,"JournalofCerealScience,43,236{243. Bolshakova,N.,Azuaje,F.andCunningham,P.(2005),\AnIntegratedToolforMicroarrayDataClusteringandClusterValidityAssessment,"Bioinformatics,21,451{455. Bona,M.(2004),\CombinatoricsofPermutations,"Chapman&Hall/CRC. Booth,J.G.,Casella,G.andHobert,J.P.(2008),\ClusteringUsingObjectiveFunctionsandStochasticSearch,"JournalofRoyalStatisticalSociety,SeriesB,70,119{140. Casella,G.andRobert,C.(1998),\Post-ProcessingAccept-RejectSamples:RecyclingandRescaling,"JournaloftheComputationalandGraphicalStatistics,7,139{157. Dudley,J.W.,Lambert,R.J.andAlexander,D.E.(1974),\SeventyGenerationsofSelectionforOilandProteinConcetrationintheMaizeKernel,"SeventyGeneationsofSelectionforOilandProteininMaize,J.W.Dudley,Ed(Madison,W.I.:CropsSocietyofAmerica),181{212. Easton,G.S.andRochetti,R.(1986),\GeneralSaddlepointApproximationswithApplicationstoLStatistics,"JournaloftheAmericanStatisticalAssociation,81,420{423. GhoshJ.K.,Delampady,MandSamanta,T.(2006),\AnIntroductiontoBayesianAnalysis:TheoryandMethods,"Springer. Gould,H.W.(1960),\StirlingNumberRepresentationProblems,"ProceedingsoftheAmericanMathematicalSociety,11,447{451. Glaser,R.E.(1980),\ACharacterizationofBartlett'sStatisticInvolvingIncompleteBetaFunctions,"Biometrika,67,53{58. Hartigan,J.A.(1975),\ClusteringAlgorithms,"NewYork:Wiley. Janni,J.,Weinstock,B.A.,Hagen,L.andWright,S.(2008),\NovelNear-InfraredSamplingApparatusforSingleKernelAnalysisofOilContentinMaize,"AppliedSpectroscopy,62,423{426. JereysH.(1961),\TheoryofProbability,"ThirdEdition,OxfordUniversityPress,Oxford. 60

PAGE 61

Kass,R.E.andRaftery,A.E.(1995),\BayesFactorandModelUncertainty,"JournaloftheAmericanStatisticalAssociation,90,773{795. Kerr,M.K.andChurchill,G.A.(2001),\BootstrappingClusterAnalysis:AssessingtheReliabilityofConclusionsfromMicroarrayExperiments,"ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,98,8961{8965. Lambert,R.J.,Alexander,D.E.andHan,Z.J.(1998),\AHighOilPollinatorEnhacementofKernelOilandEectsonGrainYieldsofMaizeHybrids,"Agron-omyJournal,90,211{215. Lavine,M.andShervish,M.(1999),\BayesFactors:WhatTheyAreandWhatTheyAreNot,"AmericanStatistician,53,119{122. McCullaugh,P.andYang,J.(2006),\HowManyClusters?,"TechnicalReport,Depart-mentofStatistics,UniversityofChicago. Orman,B.A.andSchumann,R.A.(1991),\ComparisonofNear-InfraredSpectroscopyCalibrationMethodsforthePredictionofProtein,Oil,andStarchinMaizeGrain,"JournalofAgricultureandFoodChemistry,39,883{886. Pan,D.(2000),\StarchSynthesisinMaize,"CarbohydrateReservesinPlants:SynthesisandRegulation,A.KGuptaandN.Kaur,Eds(Amsterdam:Elsevier),125{146. Pitman,J.(1996),\SomeDevelopmentsoftheBlackwell-MacQueenUrnScheme,"Statistics,ProbabilityandGameTheory,IMSLectureNotesMonographSeries,30,245-267,InstituteofMathematicalStatistics,Hayward,CA. RobertC.P.(2001),\TheBayesianChoice,"SecondEdition,Springer-Verlag,NewYork. SenP.K.andSinger,J.M.(1993),\LargeSampleMethodsinStatistics:AnIntroductionwithApplications,"Chapman&Hall,NewYork-London. Sugar,C.andJames,G.(2003),\FindingtheNumberofClustersinaDataSet:AnInformationTheoreticApproach,"JournaloftheAmericanStatisticalAssociation,98,750{763. Tibshirani,R.,Walther,G.andHastie,T.(2001),\EstimatingtheNumberofClustersinaDataSetViatheGapStatistic,"JournaloftheRoyalStatisticalSociety,SeriesB,63,411{423. VanDijk,H.andKloeck,T.(1984),\ExperimentswithSomeAlternativesforSimpleImportanceSamplinginMonteCarloIntegration,"BayesianStatistics4,J.Bernardo,M.DeGroot,D.LindleyandA.SmithEds.(North-Holland,Amsterdam). 61

PAGE 62

Williams,P.andNorris,K.(2001),\Near-InfraredTechnologyintheAgricultualandFoodIndustries,2ndEdition,"AmericanAssociationofCerealChemists,Inc,Minn.,USA. 62

PAGE 63

ClaudioFuenteswasborninSantiago,Chile,in1977.Upongraduationfromhighschool,heenrolledasastudentatthePonticiaUniversidadCatolicadeChile,wherehereceivedthedegreeofBachelorofScienceinmathematicsin2001,andhisMasterinStatisticsinDecember2003.Duringhisundergraduateandgraduateeducationhewasappointedasateachingassistantinboththefacultyofphysicsandthefacultyofmathematics.InAugust2005,FuentesenteredthePhD.programintheDepartmentofStatisticsattheUniversityofFlorida.Duringhiseducationthere,hewasappointedasateachingassistantandgraduateteachinginstructor.Heworkedasaresearchassistantforhisadvisor,DistinguishedProfessorDr.GeorgeCasella.InAugust2008heearnedthedegreeofMasterofScienceinstatisticswithathesisonClusterAnalysis.HecurrentlyisworkingtowardaPhD.instatisticsunderDr.Casella,withinterestininferenceonselectedpopulations. 63