Citation
Statistical Models for Clustering Dynamic Gene Expression Profiles

Material Information

Title:
Statistical Models for Clustering Dynamic Gene Expression Profiles
Copyright Date:
2008

Subjects

Subjects / Keywords:
Approximation ( jstor )
Covariance ( jstor )
Fourier series ( jstor )
Gene expression ( jstor )
Genes ( jstor )
Modeling ( jstor )
Parametric models ( jstor )
Signals ( jstor )
Statistical discrepancies ( jstor )
Statistical models ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright the author. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
8/31/2011

Downloads

This item is only available as the following downloads:


Full Text

PAGE 4

Firstandforemost,Iwouldliketothankmyadvisor,Dr.RonglingWu,andmyco-advisor,Dr.RamonLittell,forallthehelpandforbeingoutstandingadvisors.Ithankthemfortheirpatienceandwonderfulguidance,forsharingtheirwealthofknowledgeandfortheopportunitiestheyhavebestoweduponme.TheyhavegivenmeknowledgeandadvicethatIwillbeabletoapplythroughoutmycareerandmanyotherpartsofmylife,aswell.Withouttheiradvice,concernandunderstandingthisdissertationwouldnothavebeenpossible.Ialsowouldliketothankthemembersofmydissertationcommittee,Dr.MalayGhosh,Dr.AlexandreTrindadeandDr.MatiasKirst,forservingonmycommittee.Eitherasteachersinmycoursesoroutsideoftheclassroom,eachhasplayedauniqueandspecialroleforme.Next,IwouldliketoexpressmysinceregratitudetoDr.RandyCarterforhisencouragementandhelpingmetoseethebrightsideoflife.Withouthisunderstanding,mentoringandeasy-goingpersonalitythiswouldnothavebeensuchanenjoyableexperienceinmytimehereattheUniversityofFlorida.Iameternallygratefultohimforhisfriendshipandhelpingmetoputlifeintoperspectivewhenthingsgotdicult.IwouldalsoliketogivespecialthankstoDr.P.V.RaoandDr.MichaelResnickfortheirsupportandadvice.Last,butnotleast,Iwouldliketothankmyfamily,fortheirlove,supportandpatience.Iwouldespeciallyliketothankmymother,YonghwaLee,andmyhus-band,SanggiLee,whohaveprovidedsuchunwaveringsupportandencouragementthroughoutthisendeavor.Theyhavebeeninspirationstomeandhavealways iv

PAGE 5

v

PAGE 6

page ACKNOWLEDGMENTS ............................. iv LISTOFTABLES ................................. viii LISTOFFIGURES ................................ x ABSTRACT .................................... xii CHAPTER 1INTRODUCTION .............................. 1 1.1GeneExpressionandMicroarrayData ................ 1 1.2StatisticalToolsAvailableintheLiterature ............. 3 1.3SpecicAims .............................. 6 2GENERALSTATISTICALMODEL .................... 8 2.1FiniteMixtureModel .......................... 8 2.2ModelingtheMeanVector ....................... 8 2.3ModelingtheCovarianceMatrix .................... 9 2.3.1AutoregressiveModel ...................... 10 2.3.2AntedependenceModel ..................... 11 2.4LikelihoodFunctionandComputationalAlgorithm ......... 13 2.5HypothesisTesting ........................... 16 2.6ModelSelection ............................. 18 2.6.1NumberofMixtureComponents ................ 18 2.6.2OrderTestforARorSADModel ............... 20 3PRELIMINARYANALYSIS ......................... 21 3.1Material ................................. 21 3.2Methodology .............................. 22 3.2.1FourierSeriesApproximation .................. 22 3.2.2GeneClassication ....................... 23 3.2.3MixtureModel .......................... 25 3.3Results .................................. 27 3.4Discussion ................................ 28 vi

PAGE 7

.............. 31 4.1Introduction ............................... 31 4.2ModelforSingleExperiment ...................... 33 4.2.1Methodology ........................... 33 4.2.1.1Mixturemodel .................... 33 4.2.1.2Modellingthemeancurveandcovariancestructure 34 4.2.1.3Computationalalgorithm .............. 35 4.2.2WorkedExample ........................ 37 4.2.2.1Material ........................ 37 4.2.2.2Results ........................ 38 4.2.3MonteCarloSimulation ..................... 40 4.2.4Discussion ............................ 42 4.3FunctionalAnalysisofSparseGeneExpressionData ........ 47 4.3.1Problem ............................. 47 4.3.2MaterialandModel ....................... 48 4.3.3LikelihoodFunctionandComputationalAlgorithm ..... 49 4.3.4Results .............................. 49 4.3.5Discussion ............................ 51 4.4JointModelforMultipleExperiments ................. 52 4.4.1Material ............................. 53 4.4.2ModelandComputation .................... 54 4.4.3Results .............................. 55 4.4.4Discussion ............................ 57 5WAVELET-BASEDAPPROACHFORGENECLUSTERING ......................... 59 5.1Introduction ............................... 59 5.2WaveletTransformation ........................ 61 5.3Thresholding .............................. 63 5.4ModelingtheCovarianceStructureofSmoothCoecients ..... 65 5.5LikelihoodofSmoothCoecientsandComputationalAlgorithm .. 67 5.6MonteCarloSimulation ........................ 70 5.7Discussion ................................ 71 6SUMMARIESANDPROSPECTS ..................... 76 APPENDIX ADERIVATIONOF( 4{2 )-( 4{5 ) ....................... 80 BWAVELETDIMENSIONALITYREDUCTION .............. 85 REFERENCES ................................... 89 BIOGRAPHICALSKETCH ............................ 96 vii

PAGE 8

Table page 3{1CophenetCorrelationCoecientsforDierentClusterAnalysisApproacheswithFiveComponentsfor632Genefromcdc15Experiment ....... 25 3{2Log-LikelihoodRatio(LR)TestStatisticsandNECValuesUnderDif-ferentNumbers(k)ofMixtureComponents ................ 27 4{1MLEsoftheFirst-OrderFourierParametersforSixDierentPeriodicPatternsofGeneExpressionAmong632GenesCollectedfromtheYeastGenome(cdc15)BasedontheFirst-OrderApproximation ........ 40 4{2MLEsoftheFirst-OrderFourierParametersforThreePeriodicPatternsofGeneExpressionAmong1000SimulatedGeneswithaGivenSetofValues(a0;a1;b1;T)UndertheResidualVarianceof2=0:3(Theaver-agesofparameterestimatesarecalculatedandthemeansquareerrorsoftheestimatesaregivenintheparentheses) ................. 43 4{3MLEsoftheFirst-OrderFourierParametersforThreePeriodicPatternsofGeneExpressionAmong1000SimulatedGeneswithaGivenSetofValues(a0;a1;b1;T)UndertheResidualVarianceof2=2:0(Theaver-agesofparameterestimatesarecalculatedandthemeansquareerrorsoftheestimatesaregivenintheparentheses) ................. 43 4{4MLEsoftheSecond-OrderFourierParametersforThreePeriodicPat-ternsofGeneExpressionAmong1000SimulatedGeneswithaGivenSetofValues(a0;a1;b1;T)UndertheResidualVarianceof2=0:3(Theaveragesofparameterestimatesarecalculatedandthemeansquareerrorsoftheestimatesaregivenintheparentheses) ............ 44 4{5MLEsoftheSecond-OrderFourierParametersforThreePeriodicPat-ternsofGeneExpressionAmong1000SimulatedGeneswithaGivenSetofValues(a0;a1;b1;T)UndertheResidualVarianceof2=2:6(Theaveragesofparameterestimatesarecalculatedandthemeansquareerrorsoftheestimatesaregivenintheparentheses) ............ 44 4{6MLEwith13ComponentsfromSparseData(cdc15) ........... 50 4{7AICandBICforSparseData(cdc15) .................... 51 4{8AICandBICforJointModel ........................ 55 viii

PAGE 9

........................ 56 5{1ParameterValuesoftheSimulatedDatatoDemonstratetheWaveletApproach ................................... 70 5{2MLEandMSEwithFullDimensionandWaveletfor32TimePoints .. 72 5{3MLEandMSEwithFullDimensionandWaveletfor64TimePoints .. 73 ix

PAGE 10

Figure page 3{1DierentTemporalPatternsofGeneExpressionProlesofFiveRan-domlyChosenGenesfromcdc15Experiment(NumbersInsidetheBrack-etsareEstimatesoftheFourierCoecients,(^ai0;^ai1;^bi1;^Ti)) ....... 24 3{2FiveDistinctTemporalPatternsofGeneExpressionProlesDetectedfor632GenesCollectedfromcdc15ExperimentUsingtheTraditionalClusterAnalysis ............................... 28 3{3TwoDistinctTemporalPatternsofGeneExpressionProlesDetectedfor632GenesfromtheMixtureModel-BasedClassicationApproach(TheNumbersInsidetheBracketsaretheEstimatesoftheFourierCo-ecients(a0;a1;b1;T)) ............................ 29 4{1ResidualPlot ................................. 38 4{2ComponentNumber-DependentAICandBICValuesofModelFittingbyaFourierSeriesFunctionofOrder1to3for632GenesCollectedfromtheYeastGenome(cdc15) .......................... 39 4{3SixPeriodicPatternsofGeneExpressionProlesApproximatedbyaFirst-OrderFourierSeriesFunctionfor632GenesCollectedfromtheYeastGenome(cdc15) ............................ 41 4{4ComponentNumber-DependentAICandBICValuesofModelFittingbyaFirst-andSecond-orderFourierSeriesFunctionfor1000SimulatedGenesUnderDierentResidualVariances ................. 42 4{5VariogramforTime-DependentGeneExpressionProles ......... 46 4{6CorrelogramforTime-DependentGeneExpressionProles ........ 46 4{713DierentPatternsfromSparseData(cdc15) .............. 50 4{8DierentPatternsforJointModel ...................... 57 5{1SmoothCoecientsbytheHaarWaveletTransformationatDierentResolutions .................................. 63 5{2OriginalandEstimatedPatternsfor32TimePoints ............ 71 5{3OriginalandEstimatedPatternsfor64TimePoints ............ 72 x

PAGE 11

...................... 73 5{5AICandBICfor64TimePoints ...................... 74 xi

PAGE 12

DNAmicroarrayanalysishasemergedasaleadingtechnologytoenhanceourunderstandingofgeneregulationandfunctionincellularmechanismcontrolsonagenomicscale.Thistechnologyhasadvancedtounravelthegeneticmachineryofbiologicalrhythmsbycollectingmassivegeneexpressiondataduringatimecourse.Currentlyavailableanalysisoftime-dependentgeneexpressiondatahasbeenlimitedtothecharacterizationofgenesandarrayswithsimilarexpressionpatternsbyusingclusteringapproaches,withnoconsiderationofthedevelopmentalmechanismsunderlyinggeneexpression.Wepresentageneralmixturemodelforcatalogingtime-dependentgeneexpressionprolesinwhichthetemporalpatternofgeneexpressionismodeledbyFourierseriesapproximationstodetermineperiodicgeneexpressionprolesandthetime-dependentcovariancematrixstructuredbyautoregressiveorantedependencemodels. Themodelisalsoextendedtoamoregeneralsituationinwhichlongitudinalgeneexpressionprolesforeachgenearemeasuredatunequallyspacedtimeintervalsanddierentgeneshavedierentmeasurementpatternsandisfurtherextendedtomultipleexperimentsofgeneexpression.Theadvantagesofthis xii

PAGE 13

Wealsoimplementtheideaofwaveletdimensionreductionintothemixturemodelforgeneclustering,aimedtode-noisethedatabytransforminganinherentlyhigh-dimensionalbiologicalproblemtoitstractablelow-dimensionalrepresentation.Asarstattemptofitskind,wecapitalizeonthesimplestHaarwaveletshrinkagetechniquetobreakanoriginalsignaldownintospectrumbytakingitsaveragesanddierencesand,subsequently,todetectgeneclustersthatdierinthesmoothcoecientsextractingfromnoisytimeseriesgeneexpressiondata.Thiswavelet-basedmodelwillhavemanyimplicationsforaddressingbiologicallymeaningfulhypothesesattheinterplaybetweengeneactions/interactionsanddevelopmentalpathwaysinvariouscomplexbiologicalprocessesornetworks. xiii

PAGE 14

11 ]).Thetechnologyisbeingmoreandmorewidelyappliedinbiologicalandmedicalresearchtoaddressawiderangeoffundamentalquestions. Biologicalmoleculescanbedividedintoroughlytwocategories,nucleicacidsandproteins.Deoxyribonucleicacid(DNA)andribonucleicacid(RNA)botharenucleicacidsthatfunctiontostoreandencodetheinformationusedtobuildproteins.Chromosomes,themolecularunitsofheredity,arecomposedofDNAorganizedintogenes,whileRNA,alessstablenucleicacid,isusedtodirecttheprocessofproteinsynthesis.Underregulatedconditions,specicregionsofDNAcorrespondingtoparticulargenesaretranscribedintoRNAthatisthentranslatedintoproteins.Proteinsareperhapsbestknownfortheirenzymaticroleinbiologicalcatalysisbuttheyarealsoneededforstructureandsupport,movement,andcellularcommunication.Thecomprehensivestudyofgenescananswerhowto 1

PAGE 15

eectivelyutilizevastdatabasesofnucleicacidsequencesandproteinsequences,inpartthroughtheapplicationsofaspecictechnique,cDNAmicroarray.SeeArmstrongandWiel(2004[ 5 ]),andYangandSpeed(2002[ 85 ])forabriefreviewofmicroarrayexperiments. Amicroarrayisaglassslidecontainingtinyspotsconsistingofwhatareknownasprobesequences.Dependingontheplatformused,probesareeithersingle-strandedcDNA,longoligonucleotidesorshortoligonucleotides.Oligonu-cleotidesareshortersequentialbase-pairsegments,rangingfrom15to70nu-cleotidesinlength,takenfromthehundredsofnucleotidesinaDNAsegmentthatfunctionasagene(Aitman2001[ 2 ];Jordan2002[ 51 ]).TargetRNAisgenerallyextractedfromsamplesofinterest,reversetranscribedintocDNA,labelledwithuorescentdyeandthenhybridizedtothearray.Thesocalledtwocolorarraysarethemostcommonlyused.Thatis,twodierentsamplesarelabelledwithdierentdyes,Cy3(green)andCy5(red),andthenhybridizedsimultaneouslytothesampleslide.Laserexcitationofeachhybridizedtargetcausesaspecicemissionwithngerprint-likespectrabeingproduced.Thespectraaremeasuredbyscanningconfocallasermicroscope,whicharethenimportedintosoftwarethatmergesthetwoimagesandgivesthemspeciccolors.Thegeneexpressionintensityofeachtargetismeasuredinthegivenregion,andthesevaluesareimputedintodatasets(BrownandBotstein1999[ 12 ]).TheideaisthattheuorescentintensityofaspotisequivalenttotheamountofRNAexpressedinthesamplesothatgenesinvolvedinspecicprocessesordiseasescanbeidentiedbylooking,forexample,atdier-encesbetweencelllines,cancertypesorresponsetodrugtreatment.Regardingofgenefunction,predictionscanalsobemadeandgeneticpathwaysmightbecomeclearer.

PAGE 16

Oneofthestrengthsofmicroarraydataexperimentsistheabilitytoscreenthousandsofgenesatthesametimeandoneofthefundamentaltasksofmicroar-raydataanalysisistoidentifygenesthatareregulateddierentlyforaprioridenedbiologicallyrelevantgroupsofsamples.Alsothekeypointissummarizingthelargequantitiesofdataintosmallercomponentsandcurrentlyclusteringanddiscriminantanalysisarecommonlyapplied. 12 ]). Analysisofgeneexpressiondatahasfrequentlyreliedonclustertechniquesbywhichgenesaregroupedintermsoftheirsimilarorcorrelatedexpressionproles(Spellmanetal.1998[ 76 ];Eisenetal.1998[ 29 ];Ramonietal.2002[ 71 ]).Eisenetal.([ 29 ])appliedcorrelation-basedhierarchicalclusteringmethodtocharacterizegeneexpressionpattern.Correlation-basedclusteringmethods,however,restontheassumptionthatthesetofobservationsforeachgeneareindependentandidenticallydistributed.Whilethisassumptionholdswhenexpressionmeasuresaretakenfromindependentbiologicalsamples,itisknowntobenolongervalidwhentheobservationsareactuallyrealizationsofatimeseries,whereeachobservationmaydependonpriorones(BoxandJenkins1976[ 10 ];WestandHarrison1997[ 82 ]).Ramonietal.appliedaspecializedversionofamoregeneralmethodscalledBayesianclusteringbydynamics(BCD)(seeRamonietal.2002[ 72 ])in

PAGE 17

theirpaper.ThisarticlepresentsaBayesianmethodformodel-basedclustering,distance-basedheuristicsearchprocedure,ofgeneexpressiondynamics. Hierarchicalclusteringisoneofthemajoranalyticaltoolsforgeneexpressiondatafrommicroarrayexperimentsbutamajorproblemisassessingthereliabilityoftheclusteringresults.Severalstatisticalapproachesbasedonanormalmixturemodelhavebeenrecentlydeveloped(GhoshandChinnaiyan2002[ 36 ];McLachlanetal.2002[ 60 ]),inwhicheachmixturecorrespondstoatypicalgeneexpressionpattern.GhoshandChinnaiyan(2002[ 36 ])appliedamixturemodel-basedapproachfortheanalysisofmicroarraydata.McLachlanetal.(2002[ 60 ])alsodevelopedsoftwareEMMIX-GENEforthespecicpurposeofamodel-basedapproachtotheclusteringofmicroarrayexpressiondata,inparticular,oftissuesamplesonaverylargenumberofgenes.Theyillustratedthisnewapproachintheclusteringoftwowell-knowndatasetsinthemicroaaryliterature,thecolondataanalyzedinitiallyinAlonetal.(1999[ 4 ]),andtheleukaemiadatarstanalyzedinGolubetal.(1999[ 39 ]).Therehavebeennumerousapplicationsofbothclusteringandmixturemodel-basedapproachestothediscoveryandcharacterizationofgenome-widedierentiallyexpressedgenesinseveralorganisms(Golubetal.1999[ 39 ];Lossosetal.2000[ 57 ]). Whilemuchmicroarraydatahasbeenaccumulatedinstaticexpressionex-periments,inwhichasnapshotofgeneexpressionlevelsistakenfordierenttreatments,thecollectionoftime-seriesgeneexpressiondatahasreceivedanin-creasinginterestfordevelopmentalgenetics(Holteretal.2001[ 44 ];Zhaoetal.2001[ 86 ];Bar-Josephetal.2003[ 7 ];Ramonietal.2002[ 71 ]).Oneofthefunda-mentalissuesintime-seriesgeneexpressionanalysisistheidenticationofgeneswithalteredexpressionprolesovertime.Asforstaticexpressiondata,clusteranalysisisusefulforcataloguingofgeneexpressiondatabasedontheirdynamicfeature[ 71 ].Guoetal.(2003[ 42 ])andParketal.(2003[ 65 ])independently

PAGE 18

proposedamethodfortestingthestatisticalsignicanceoftime-dependentgeneexpressiondata,aimedatidentifyinggeneswithvaryingexpressionprolesovertime.Bar-Josephetal.(2003[ 7 ])developedanalgorithmtorepresenttimeseriesofgeneexpressionascontinuouscurvesusingacubicsplinemethodandusedthisalgorithmtoestimatemissingvaluesintimeseries.Zhaoetal.(2001[ 86 ])usedasingle-pulsemodel(SPM)toidentifyperiodictranscriptsintheSaccharomycescerevisiaeyeastmicroarraydatabasedontheassumptionthatthecellcyclereg-ulatedtranscriptswillpeakonlyoncepercycle.Afewotheralgorithmsalsohavebeendevelopedtomodelthetemporalpatternsofgeneexpression.Forexample,Holteretal.(2001[ 44 ])deducedthetimetranslationalmatrixforDNAmicroarraygeneexpressiondatasetsbymodelingthemwithinalinearframeworkbyusingthecharacteristicmodesobtainedbysingularvaluedecomposition(SVD).AlsoQianetal.(2001[ 70 ])developedanewresource,calledPartsList.Thesystemisbasedontheexistingfoldclassicationsandfunctionsasaformofcompanionannotationforthem.Theirmainideaisthatofcomparisonthroughranking. Oneofthemainlimitationsfortheseavailableapproachesfortime-seriesgeneexpressionanalysisisthattheyhavenotmadeuseofmathematicalfunctionsthatempowerinvestigatorstomodeltime-dependentchangesofgeneexpression.Thereareseveraladvantagesfortheincorporationofmathematicalfunctionsinclusteringexpressionpatterns.First,byapproximatinglevelsofgeneexpressionatdierenttimepointsusingamathematicalfunction,wedonotneedtoestimatetheexpectedexpressionvaluesforeachgenegroupateachpointand,thus,increasethepowerofthemodelduetothereductionofthenumberofunknownparameters.Second,ifthemathematicalfunctionhasasolidbiologicalfoundationforgeneexpressionproles,suchanintegrationcanpotentiallyprovideresultsthatareclosertobiologicalrealitythanthosefromtheapproacheswithnomathematicalimplementation.Third,whileclusteranalysisisabletodetectindependentgroups

PAGE 19

ofgeneswithdierentexpressionproles,mathematicalfunction-basedcataloguingapproachesallowfortheidenticationofanypossiblecurvesincludingthosethatoverlapduringsomeperiodofatimecourse.Last,butnotleast,thismodelprovidesatestableframeworkforbiologicalhypothesesattheinterplaybetweengeneexpressionanddevelopmentalpattern. IthasbeenwellknownthatthetranscriptlevelsofmanyDNAmicroarraysintermsoftheamountofmRNAsvaryperiodicallywithinthecelldivisioncycle(Spellmanetal.1998[ 76 ];DeLichtenbergetal.2005[ 22 ]).Theregulationofthesegenesinaperiodicmannercoincidentwiththecellcyclemayhelptomaintainproperorderduringcelldivisionortoconservelimitedresources.Theoscillationofcellcycle-regulatedgenescanbemathematicallydescribedbyasimpleperiodicfunctionsuchasFourierseriesapproximationexpressedasalinearcombinationofcosineandsinewaves.Thus,byestimatingtheparametersthatdenetheperiodiccurvesforindividualgenes,wecandeterminethedierencesinthetemporalpatternofgeneexpression. Inordertodeterminetheinteractionbetweenthegenesanddevelopment,itisnecessarytomeasureatime-courseofexpressionexperiments.Whilestaticdatafromasamplepopulationareassumedtobeindependentidenticallydistributed,timeseriesdataexhibitastrongautocorrelationbetweensuccessivetimepoints.Avarietyofstationaryandnonstationarymodelscanbeusedtomodelthecovariancestructure.

PAGE 20

Themodelneedstobeextendedtoamoregeneralsituationinwhichlongitu-dinalgeneexpressionprolesforeachgenearemeasuredatunequallyspacedtimeintervalsordatahavemissingvaluesatsomeparticulartimepoints.Alsotheremightbeacaseofhavingmultipleexperimentsofgeneexpressionanddierentgeneshavedierentmeasurementpatternssothatgeneexpressionprolesmightbeexpresseddierentlyinresponsetodierentexperimentalconditions.Sothemodelisfurtherextendedtomultipleexperimentsofgeneexpression.Theadvantagesofthisprocedurelieinthebiologicalrelevanceofresultsobtainedandtheconstruc-tionofageneralframeworkwithinwhichtheinterplaybetweengeneexpressionanddevelopmentcanbetested. Althoughthisparametricmodelispowerfulforthecharacterizationofdierentgenegroups,somelimitationsmaypreventitsbroadanddeepusesinsomeparticularsituations.Forhighdimensionaldata,highdimensionmaymakethecomputationofinversecovariancematrixunstableandthedeterminantapproachinginnityorzero.Ourparametricmodelbasedonamultivariatenormaldensityfunctionwillbeaectedforhigh-dimension,sparsedatabecauseoftheiraccumulatedcontaminationbynoises.Anecienttreatmentofmicroarraydataisthroughdimensionalityreduction,suchaswaveletde-noisingtransformation.Ithasbeenshownthatthelowdimensionalitymodelsarenotonlycomputationallyecient,butalsomorerobustthanhighdimensionalitymodels.Hereweimplementtheideaofwaveletdimensionreductionintothemixturemodelforgeneclustering,aimedtode-noisethedatabytransformingahigh-dimensionalproblemtoitstractablelow-dimensionalrepresentation.

PAGE 21

Letybea-dimensionalsampleofsizen.Inthemixturemodel,eachsubjectyi=(yi1;:::;yi)0,wherei=1;:::;n,isassumedtohavearisenfromoneandonlyone(sayg)ofaknownorunknownnumberofcomponents,eachbeingmodeledbyamultivariatenormaldistributiondensityfg,i.e., (2)=2jgj1=2exp1 2(yiug)01g(yiug):(2{1) Assumingthattherearekpatterns(components)contributingtothevariationamongdierentcurves,suchamixturemodelisthenexpressedas where$=($1;:::;$k)isavectorofthemixtureproportionswhicharecon-strainedtobenon-negativeandsumtounity;u=(u1;:::;uk)containsthecomponent-specicmeanvectors;gisthecovariancematrixforthegthcompo-nent;andfg(yi;ug;g)isdenedasinEq.( 2{1 ). 8

PAGE 22

(Spellmanetal.1998[ 76 ];DeLichtenbergetal.2005[ 22 ]).Theregulationofthesegenesinaperiodicmannercoincidentwiththecellcyclemayhelptomaintainproperorderduringcelldivisionortoconservelimitedresources.Theoscillationofcellcycle-regulatedgenescanbemathematicallydescribedbyasimpleperiodicFourierfunctionexpressedasalinearcombinationofcosineandsinewaves.Thus,byestimatingtheparametersthatdenetheperiodiccurvesforindividualgenes,wecandeterminethedierencesinthetemporalpatternofgeneexpression. Forperiodicallyregulatedgenes,theycanbeapproximatedbyFourierseries(see,forexample,Lasser1996[ 53 ]).Fourierseriesapproximationcanassessperiodicity.So,byapplyingaFourierseriesapproximation,wecanidentifythegeneswhoseRNAlevelsvariedperiodicallywithinthecellcycleandfurtherndtheassociatedamplitudesandphases.Foragivengene,letu(t)denotetheexpecteduorescenceratiovalueatatimepointt(t=1;:::;).Notethattheratiovaluesarelogtransformed(base2forsimplicity,sothatlog2(Cy5=Cy3))totreatinductionsorrepressionsofidenticalmagnitudeasnumericallyequalbutwithoppositesign.Thenu(t)ismodeledbyFourierseriesapproximation,i.e., 2a0+Xm=1amcos2mt T+bmsin2mt T; wherea0isthegene-specicfundamentalfrequency,amandbmarethegenespecicamplitudecoecients,whichdeterminethetimesatwhichthegeneachievespeakandtroughexpressionlevels,respectively,andTisthegene-specicperiodofthecellcycle. 15 ])toobtainmoreparsimonyandstability.Lessrestrictiveconstraints

PAGE 23

canbeimposedbyareparameterizationofthecovariancematricesintermsoftheireigenvaluedecompositions,forexample,inBaneldandRaftery(1993[ 6 ]).TheimportanceofparsimoniouscovariancestructuresisdemonstratedbyLeeandGeisser(1975[ 54 ])andGeisser(1980[ 34 ]).Also,repeatedmeasurementsofacontinuousresponsevariableareobservedovertimeoneachsubjectandweassumethatobservationsondierentsubjectsareassumedtobeindependentandthusonlywithin-subjectcovariancestructureswillbeconsidered.Thus,ourfocusonthechoiceofacovariancestructureisparsimoniouscovariancestructureasafunctionofthetimesofmeasurement. 20 ]).Itcanbeassumedthatresidualvariancesandcovariancesarecommontoallcomponents.Thatis,incasewhererepeatedmeasurementsareequallyspacedintime,theresidualvariance(2)atdierenttimepointsisconstantandthecovariancebetweentwodierenttimepoints,tcandtc0,iscc0=2jcc0,whereistheresidualcorrelationbetweentwoadjacenttimepoints.Thecommoncovariancematrixforeachcomponentisthenexpressedas whereisthenumberoftimepoints. WhiletheAR(1)modelgreatlyfacilitatesthemathematicalcalculationofthecovariancematrix,i.e.,itprovidesaclosed-formexpressionofmatrixdeterminant

PAGE 24

andinverseforanynumberoftimepoints,thestationarityassumptioncanbeeasilyviolatedwithheterogeneousvariancesacrossdierenttimepoints.Toremovetheheteroscedasticproblemoftheresidualvariance,whichviolatesabasicassumptionofthesimpleAR(1)model,twoapproachescanbeused.Therstapproachistomodeltheresidualvariancebyaparametricfunctionoftime,asoriginallyproposedbyPletcherandGeyer(1999[ 66 ]).Butthisapproachneedstoimplementadditionalparametersforcharacterizingthetime-dependentchangeofthevariance.ThesecondapproachistoembedCarrollandRuppert's(1984[ 13 ])transform-both-sides(TBS)modelintothegrowth-incorporatednitemixturemodel(Wuetal.2004[ 84 ]),whichdoesnotneedanymoreparameters.BothempiricalanalyseswithrealexamplesandcomputersimulationssuggestthattheTBS-basedmodelcanincreasetheprecisionofparameterestimationandcomputationaleciency.Furthermore,theTBSmodelpreservesoriginalbiologicalmeansofthecurveparametersalthoughstatisticalanalysesarebasedontransformeddata. 33 ]).Theantedependencemodelrelaxesthestationarityassumptionandhenceshowspowerfulapplications. Theantedependencemodelstatesthatanobservationataparticulartimetdependsonthepreviousones,withthedegreeofdependencedecayingwithtimelag.Ifanobservationattimetisindependentofallobservationsbeforetr,thisantedependencemodelisthoughttobeoforderr.Theantedependencemodelisextendedtotthestructureoftime-dependentvarianceandcorrelation,leadingto

PAGE 25

thestructuredantedependence(SAD)model(Nu~nez-AntonandZimmerman2000[ 63 ]). Theconceptofantedependenceisequivalentto(y(1);;y())havingaMarkoviandependenceoforderr.Theorderrservesasamemorygauge,wherer=0correspondstoindependenceandr=1toarbitrarymultivariatedependence.Aparametricallyspecieddenitionforindividualiisgivenas wherer=min(r;t1),t;tt0'sareunrestrictedantedependenceparameters,andindependentnormalrandomvariablei(t)mayhavetime-dependentvariances,2(t),termedinnovationvariances. LikeARmodels,theantedependencemodel(2.5)allowsforserialcorrelationwithinsubjects,butunlikeARmodelsitdoesnotassumethatthevariancesareconstantnorthatcorrelationsbetweenmeasurementsequidistantintimeareequal.Theantedependencemodel(2.5)iscalledtheunstructuredantedependencemodeloforderr(UAD(r))because(r+1)(2r)=2parameters,includingthevariances,2(t),andcovariances,(t;tt0),amongmeasurementsatdierenttimepoints,arenotexpressedasafunctionofasmallersetofparameters(Nu~nez-AntonandZimmerman2000[ 63 ]). TomaketheUAD(r)modelmoreparsimonious,Nu~nez-AntonandZimmerman(2000[ 63 ])proposedso-calledstructuredantedependence(SAD)models.OneusefulclassmodelstheautoregressivecoecientswiththeBox-Coxpowerlawandmodeltheinnovationvarianceswithaparametricfunction,i.e.,

PAGE 26

wheretandtt0aremeasurementtimes,w(;)equals(1)=if6=0andequalslog()if=0,andv()isafunctionofrelativelyfewparameters(e.g.,alow-orderpolynomial).Thus,dierentfromGabriel's(1962[ 33 ])originaltreatment,weonlyneedtoestimatethreeparameterstomodeltheinnovationvariancesifaquadraticpolynomialisused,regardlessofthenumberoftimepoints. AsasimpliedexamplewiththeSAD(1)modelinwhichinnovationvariancesareconstantovertimepoints,Jarezicetal.(2003[ 46 ])derivedtheanalyticalformsforvarianceandcovariancefunctionsamongtime-dependentmeasurements,expressed,respectively,as forequallyspacedrepeatedmeasurements.Itcanbeseenthatalthoughconstantinnovationvariancesareassumed,theresidualvariancecanchangewithtime(Jarezicetal.2003[ 46 ]).Also,forthesimplestSADmodel,thecorrelationfunctionisnon-stationarybecausethecorrelationdoesnotdependonlyonthetimeintervalt2t1butalsodependsonthestartandendpointsoftheinterval,t1andt2. 21 ]),Nelder-Meadsimplexalgorithm(NelderandMead1965[ 62 ]),BFGSquasi-Newtonmethod(see,forexample,FletcherandReeves1964[ 30 ]),andBayesianapproachesimplementedwiththeMCMCalgorithm(RobertandCasella1999[ 73 ]),ashavevariousotherlesswell-knownapproaches.

PAGE 27

Thelikelihoodofparameters=($,u,g)0constructedonthebasisofthemixturemodeldenedasinEqs.( 2{1 )and( 2{2 )isexpressedas Theunknownparameterscontaintwotypes:one,p=($),describingtheproportionofgenegroupsandthesecond,q=(u;g),describingthenormaldistributionofgeneexpressionprole.Alltheparameterscanbeestimatedbymaximizingthelog-likelihoodbydierentiatingthelog-likelihoodfunctionwithrespecttoanyelementintheunknownvector(p;q),i.e., @logL(yjp;q)=nXi=1"Pkg=1fg(yi;q)@$g @qfg(yi:q) @qlogfg(yi:q) @qlogfg(yi:q))set=0: Denetheposteriorprobabilityofgeneithatbelongstopatterngby gi=$gfg(yi:q) Themaximumlikelihoodestimates(MLEs)oftheunknownparametersareobtainedas

PAGE 28

^$g=Pni=1gi Pni=1gi: Ifweassumethateachcomponenthasthesamecovariancematrix,gshouldbereplacedby.Thentoobtain^,nshouldbeinthedenominatorinEq.( 2{12 )insteadofPni=1gi. WecanimplementtheEMalgorithm(Dempsteretal.1977[ 21 ])toestimate($;u;).IntheEstep,theposteriorprobabilityforeachgeneiscalculatedbyEq.( 2{11 )andtheunknownparametersarethenestimated/updatedintheMstepusingEq.( 2{12 ).Thisprocedureisrepeateduntiltheestimatesconverge. Inourmodel,themeanvectorugwillnotbeestimateddirectly;instead,itisapproximatedbytheFourierseriesequation.SuchanapproximationisreasonablebecausedynamicgeneexpressionthatregulatesperiodiccellcyclesfollowstheFourierseriesequation.WhentheFourierseriesapproximationisusedtomodelthemeanvector,wewillonlyneedtoestimatethemathematicalparameters,ug=(ag0;agm;bgm;Tg;m=1;:::;;g=1:::;k)thatdenetheshapeoftime-dependentgeneexpressionproleforgenegroupg.Ontheotherhand,thecovariancematrixcanbestructuredbytheAR(1)orSAD(1)modelwiththestructuringparameters,v=(;2)or(;2),respectively. TheEMalgorithmcanstillbeusedtoestimatetheparametersthatmodelthemeanandcovariancestructures.However,therearenoclosedformsavailablefortheEMalgorithmtoestimatealltheseparameters.Onepowerfulapproachforestimatingtheunknownparameters,ugandv,isNelder-Meadsimplexalgorithm(NelderandMead1965[ 62 ]).Nelder-Meadsimplexalgorithmisadirect

PAGE 29

searchmethodthatusesonlyfunctionvalues(doesnotrequirederivativesanddoesnotusenumericaloranalyticgradients)andhandlesnon-smoothobjectivefunctions.Thisisreferredtoasunconstrainednonlineroptimization.Ateachstepofthesearch,anewpointinornearthecurrentsimplexisgenerated.Thefunctionvalueatthenewpointiscomparedwiththefunction'svaluesattheverticesofthesimplexand,usually,oneoftheverticesisreplacedbythenewpoint,givinganewsimplex.Thisstepisrepeateduntilthediameterofsimplexislessthanthespeciedtolerance. AnotheravailableapproachistheBFGSquasi-Newtonmethod.Newton'smethodoftenconvergesfasterthanconjugategradientmethodsbutitiscomplexandexpensivetocomputetheHessianmatrix.Thereis,however,aclassofalgo-rithmsthatisbasedonNewton'smethod,butwhichdoesnotrequirecalculationofsecondderivativesandthesearecalledquasi-Newton(orsecant)methods.TheyupdateanapproximateHessianmatrixateachiterationofthealgorithmandtheupdateiscomputedasafunctionofthegradient.AlargenumberofHessianup-datingmethodshavebeendeveloped.TheformulaofBroyden,Fletcher,GoldfarbandShanno(BFGS)isthoughttobethemosteective(see,forexample,FletcherandReeves1964[ 30 ]).TheBFGSquasi-NewtonmethodusestheBFGSformulaforupdatingtheapproximationoftheHessianmatrix.TheformulagivenbyBFGSisHu+1=Hu+quq0u

PAGE 30

trajectories.Therstimportanthypothesisconcernsoveralldierencesintran-scriptionalexpressionproleamongdierentgroupsofmicroarraygenes,whichisexpressedasH0:uguforg=1;:::;kH1:Atleastoneoftheequalitiesabovedoesnothold: BecauseeachofFouriercoecientsdescribesadierentaspectofperiodiccurves,hypothesescanbemadefortheseparametersindividually.Inaddition,thepeaktotroughratio,am=bm,reectstheamplitudeofexpressionproleandcanbetestedforitsdierencesamongthegenegroupsdetected.Forexample,ifthemeancurveismodeledbasedontheFourierseriesoforderone,i.e., 2a0+a1cos2t T+b1sin2t T; thehypothesiscanbeexpressedasH0:ag1=bg1a1=b1forg=1;:::;kH1:Atleastoneoftheequalitiesabovedoesnothold:

PAGE 31

Theslopeofgeneexpressionprolemaychangewithtime,whichsuggeststheoccurrenceofgeneexpressiontimeinteractioneectsduringatimecourse.Thedierentiationofu(t)withrespecttotimetrepresentsaslopeofgeneexpression.Iftheslopesataparticulartimepointtaredierentbetweenthecurvesofdif-ferentgenegroups,thismeansthatsignicantgeneexpressiontimeinteractionoccursbetweenthistimepointandnext.Thetestforgeneexpressiontimeinteractioncanbeformulatedwiththehypotheses:H0:d dtug(t)=d dtu(t)vs:H1:d dtug(t)6=d dtu(t);g=1;:::;k: 2.6.1NumberofMixtureComponents 56 ];ChenandKalbeisch2004[ 16 ]).Wepresentavailablecriteriaformodelselectioninthissection.Anobviouswayofapproachingtheproblemistousethelog-likelihoodratioteststatisticstotest.Howeverwithmixturemodels,regularityconditionsdonotholdfor2log[L(H0)=L(H1)]tohaveitsusualasymptoticnulldistributionofchi-squaredwithdegreesoffreedomequaltothedierenceinthenumberofparametersinthetwohypotheses.Ingeneral,themaximizedlikelihoodofyisanincreasingfunctionofk,andthus,ifwewantaselectioncriterionforchoosingthenumberofpatternsinthemixture,thelikelihoodcannotbeused. OneoftheleadingselectionmethodsistheAkaikeInformationCriterion(AIC;Akaike1974[ 3 ]).Thisisdesignedtobeanapproximatelyunbiasedestimatorof

PAGE 32

theexpectedKullback-Leiblerinformationofattedmodel.TheminimumAICproducesaselectedmodelwhichisclosetothebestpossiblechoice.Hisproposalinthecontextofthemixturemodelistochoosethenumberofcomponents,k,whichminimizesAIC(k)=2logL(^)+2N(k),whereN(k)isthenumberofindependentparameterswithinthemodel.TheBayesianInformationCriterion(BIC;Schwartz1978[ 75 ])isalsoavailableandBIC(k)=2logL(^)+N(k)log(n),whereN(k)isthenumberofindependentparametersin^.TheselectedmodelistheonewiththesmallestBIC.Thereisnoclearconsensusonwhichcriterionisthebesttouse,althoughtheempiricalworkofFraleyandRaftery(1998[ 31 ])seemstofavorBIC.Sinceinformationcriteriapenalizemodelswithadditionalparameters,theAICandBICmodelorderselectioncriteriaarebasedonparsimony.NoticethatsinceBICimposesagreaterpenaltyforadditionalparametersthandoesAIC,BICalwaysprovidesamodelwithanumberofparametersnogreaterthanthatchosenbyAIC. CeleuxandSoromenho(1996[ 14 ])developedamethodforchoosingthenumberofmixturecomponentsbasedonNormalizedEntropyCriterion(NEC).AccordingtoNEC,thelog-likelihood,logL($;u;jy),Lkforsimplicity,withkpatternscanbedecomposedintotwoparts,i.e.,Lk=Ck+Ek; istheclassicationlog-likelihood,andEk=kXg=1nXi=1giloggi

PAGE 33

Thenormalizedentropycriteriontobeminimizedforassessingthenumberofmixturecomponentsisgivenby NECk=Ek ButsinceNEC1isnotdenedascanbeseenfromabove,weareunabletocomparesituationsk=1vs.k>1directlywithNECk.Here,weuseageneralprocedurefortheonecomponentcasetodecidebetweenk=1andk>1(Biernackietal.1999[ 9 ]).LetkbethevaluethatminimizesNECk(2kksup)withksupbeinganupperboundforthenumberofmixturecomponents.WechoosekifNECk1,otherwisewedeclareonecomponentforthedata. 46 ])proposedanadhocapproachformodelselectiontotestanoptimalorderoftheSADmodel.Theirstrategyistoincreasetheantedependenceorderuntiltheadditionalantedependencecoecientisclosetozero. Nu~nez-AntonandZimmerman(2000[ 63 ])proposedusingtheAICtoselectthebestmodel.HurvichandTsai(1989[ 45 ])showedthatAICcandrasticallyunderestimatetheexpectedKullback-Leiblerinformationwhenonlyfewrepeatedmeasurementsareavailable.Instead,theyderivedacorrectedAIC,expressedas AICC=log^2+1+r= where^2isthewhitenoisevariance,isthenumberofrepeatedmeasurementsandristheorderofthemodel.ThenumberofparametersisheavilypenalizedsothatmodelsselectedbyAICCaretypicallymuchmoreparsimoniousthanthoseselectedbyAIC(HurvichandTsai1989[ 45 ]).Analternativecriterion,BIC,thatselectsamodelwithamaximumposteriorprobability(Schwarz1978[ 75 ])canalsobeusedtodeterminethebestantedependenceorder.

PAGE 34

76 ])reportedtheresultsof800periodicallyexpressedtranscriptionalgenesinthegenomeofyeast(Saacharomycescerevisiae).DNAmicroarrayswereusedtoanalyzemRNAlevelsfor6yeaststrainsincellculturesthathavebeensynchronizedbythreeindependentmethods,factorarrest,elutriationandarrestofacdc15temperature-sensitivemutant.Eachmethodproducespopulationsofyeastcellssynchronizedintermsoftheirphaseinthecellcycle.AllthemicroarraydatareportedinSpellmanetal.(1998[ 76 ])aregivenathttp://cellcycle-www.stanford.edu. AsdescribedbySpellmanetal.(1998[ 76 ]),RNAwasextractedfromeachofthesamplesandfromacontrolsample(unsynchronizedcells).cDNAswerelabelledwithCy5uor(red)forsynchronizedsamplesandCy3uor(green)forthecontrols.MixedlabelledcontrolandexperimentalcDNAswerehybridizedtoindividualmicroarrayscontainingall6178yeastgenes.Theaverageuorescenceintensityforeachuorwithineacharrayspotwasrecorded.Thedataofgeneexpressionweremeasuredbynormalizeduorescenceratio,log2(Cy5/Cy3),atdierenttimepoints.Somespotswereeliminatedforfurtheranalysisbecauseofpoorquality,whichledtomissingvaluesforaportionofgenesatsometimepoints.TheyusedDNAmicroarraystoanalyzemRNAlevelsincellculturesthathadbeensynchronizedbythreeindependentmethods.TheyanalyzedthesedataandidentiedgeneswhoseRNAlevelsweresimilartotheRNAlevelsofgenesalreadyknowntoberegulatedbythecellcycle(geneswhoseRNAlevelsvariedperiodicallyduringthecellcycle).Theyfoundthat800yeastgenesarecellcycleregulated 21

PAGE 35

(periodicallyregulated).Forourpreliminaryanalysishereweusedtime-dependentgeneexpressiondataderivedfromthecdc15experiment.Itcontains632genesthathavecompletedataduringall24timepoints(=24). 29 ])isdonebasedontherstorderFouriercoecientestimates,(^ai0;^ai1;^bi1;^Ti).Andalsomixturemodel-basedanalysisisdonebasedontheestimatedFourierparameters,(^ai0;^ai1;^bi1;^Ti).WhiletheresultsfromourapproacharebroadlyconsistentwiththosebySpellmanetal.(1998),ourapproachdisplayssignicantadvantagesinseveralstatisticalandbiologicalaspects. 2{13 ),wherea0isthegene-specicfun-damentalfrequency,a1andb1arethegenespecicamplitudecoecients,whichdeterminethetimesatwhichthegeneachievespeakandtroughexpressionlevels,respectively,andTisthegene-specicperiodofthecellcycle. Oftenwehavenopriorinformationabouttheperiods.However,theparameterT,periodofthecellcycle,canbeestimatedusingfastFouriertransform(FFTinmatlab).AcommoncomputationaltechniquestostudyperiodicdataistheFFTalgorithm(Priestley1981[ 68 ]),whichndsperiodicitiesbysearchingforsharp

PAGE 36

peaksintheordinaryperiodogramscalculatedfromtheFouriertransformofthetimeseries(Durbin1967[ 28 ]).AcommonuseofFouriertransformsistondthefrequencydomain,andthediscreteFouriertransformofthesignalisfoundbytakingfastFouriertransform.TheFFTfunctionsarebasedonalibrarycalledFFTW(http:www.tw.organdseeDuhamelandVetterli1990[ 27 ];FrigoandJohnson1997[ 32 ]).NowforaxedvalueofT,^T,thegenespeciccoecients,i=(ai0;ai1;bi1),areestimatedforeachgeneibynonlinearcurvettingintheleastsquaressense.Thatis,anobservedgeneexpressiongiventimet,yi(t),thecoecients,i,arefoundtobest-ttheequation,mini1 2kF(i;t)yi(t)k22=1 2Xt[F(i;t)yi(t)]2; 2{13 ).WefoundthattheFourierseriesoforderoneisappropriatebasedonthevalueoftheresiduals.Fig. 3{1 illustratesveexamplesofgeneswhoseexpressionprolesarettedbyleastsquaresnonlinearcurvettingbasedonFourierseriesapproximationoforderone.TheestimatedFouriercoecientsaredierentbecausethesevegenesdisplaydierentialgeneexpressioncurves.Itseemsthat,tomodelgene1inFig. 3{1 ,aFourierseriesapproximationofhigherorderisneeded. TheestimatedFouriercoecientsxi=(^ai0;^ai1;^bi1;^Ti)0forallgenesestablisha(6324)datamatrixinwhichtherowsrepresentdierentgenesandthecolumnsrepresentdierentparameters.Twodierentapproacheswereusedtoclassifygenesbasedontheirrst-orderFouriercoecientestimatesthatdenethetemporalpatternofgeneexpression. 3{1 liststhecopheneticcorrelation

PAGE 37

Figure3{1. DierentTemporalPatternsofGeneExpressionProlesofFiveRan-domlyChosenGenesfromcdc15Experiment(NumbersInsidetheBracketsareEstimatesoftheFourierCoecients,(^ai0;^ai1;^bi1;^Ti)) coecientswhichcomparethedistanceinformation(Z),generatedbyeachlinkagefunction,andthedistanceinformation(W),generatedbyeachdistancefunction.ThecopheneticcorrelationbetweenZandWcanbedenedasc=Pi
PAGE 38

Table3{1. CophenetCorrelationCoecientsforDierentClusterAnalysisAp-proacheswithFiveComponentsfor632Genefromcdc15Experiment EuclideanSeuclideanMahalanobisCityblockMinkowski Average0.95010.83250.79880.95050.9501Single0.88090.66810.64360.86470.8809Complete0.90910.75110.60880.90460.9091Ward0.80860.67210.67760.81450.8086CosineCorrelationHammingJaccard

PAGE 39

coecients,beingspecictopatterng;and=0BBBBBBB@2a0a0a1a0b1a0Ta1a02a1a1b1a1Tb1a0b1a12b1b1TTa0Ta1Tb12T1CCCCCCCA ThelikelihoodofestimatedFourierparameters(x)isconstructedonthebasisofthemixturemodeldenedabove,expressedas Theunknownparameters($;u;)areestimatedbymaximizingthelog-likelihoodandimplementingtheEMalgorithm(Dempsteretal.1977[ 21 ]).Denetheposteriorprobabilityofgeneithatbelongstopatterngby gi=$gfg(xi;ug;) Themaximumlikelihoodestimates(MLEs)oftheunknownsareobtainedas ^$g=Pni=1gi

PAGE 40

Table3{2. Log-LikelihoodRatio(LR)TestStatisticsandNECValuesUnderDif-ferentNumbers(k)ofMixtureComponents kLRNEC 25400.020938010.100948380.121658380.2743 IntheEstep,theposteriorprobabilityforeachgeneiscalculatedbyEq.( 3{2 )andtheunknownparametersarethenestimated/updatedintheMstepusingEq.( 3{3 ).Thisprocedureisrepeateduntiltheestimatesconverge. 3{1 forveparticularexamples).Instep2,theestimatesoftheFouriercoecientswereusedtosort632cellcycle-regulatedgenesintodierentgroups.Theclusteringalgorithmdetectedvedierentgroups.Eachgroupcontainsadierentnumberofgenesbutwithasimilarexpressionpatternwithineachgroup.AsshowninFig. 3{2 ,thesevegroupsdisplayanincreasedpeak/trough(a1=b1)ratiofromgroup1to5anddierentperiods.ThevegroupsclassiedbytheFouriercoecientsareconsistentforsomegeneswiththosesortedaccordingtothephaseofexpression(Spellmanetal.,1998[ 76 ]).Thedierenceofgeneclusteringbetweenthesetwoapproacheswasevaluatedbythelikelihood.Accordingtoourapproach,thelikelihoodcalculatedfortheveclassiedgroupsis1323.2,whereasthelikelihoodofSpellmanetal.'sapproachis505.4.ThissuggeststhatourapproachmaybestatisticallybettersupportedthanSpellmanetal.'sapproach. Themixturemodel-basedclassicationapproachwasfurtherusedtogroupthesamegenesbasedonthefourestimatedFouriercoecients.Byassumingdierentnumbersofcomponentsinvolvedinthemixturemodel,wecalculatedtheLRvaluesforgeneclustering.AsshowninTable 3{2 ,theLRvaluesunderdierentnumbers

PAGE 41

Figure3{2. FiveDistinctTemporalPatternsofGeneExpressionProlesDetectedfor632GenesCollectedfromcdc15ExperimentUsingtheTraditionalClusterAnalysis ofexpressionpatternsareallsignicant,suggestingthatthesecellregulatedgenesdisplaydierentiatedexpressiontrajectoriesduringtimecourse.Asacriteriontobestenumeratethemixturecomponents,theNECvalueswerecalculatedusingEq.( 2{14 )andthesevaluesincreasedwiththenumberofcomponentsinvolvedfromtwotove(seeTable 3{2 ).Thus,theoptimalnumberoftemporalpatternsfortheexpressionof632genesconsideredissuggestedtobetwobasedonNEC.Thesegroupsarestrikinglydierentinthetime-dependentshapeofexpressionprole.Group1hasthepeaktotroughratioof0.10andashortperiodvalue3.22,assupposedto1.67and12.19forgroup2,respectively(seeFig. 3{3 ).Theproportionsofthegenesbelongingtogroup1and2wereestimatedas0.15and0.85,respectively.Thesegroupsofgenesarethussuggestedtoregulatecellcyclesforyeastinadierentmanner.

PAGE 42

Figure3{3. TwoDistinctTemporalPatternsofGeneExpressionProlesDetectedfor632GenesfromtheMixtureModel-BasedClassicationApproach(TheNumbersInsidetheBracketsaretheEstimatesoftheFourierCoecients(a0;a1;b1;T)) expressiontrajectories,followedbytheclassicationoftemporalpatternsforgeneexpressionbasedonstatisticalapproaches.Themostimportantadvantageofourapproach,ascomparedtopreviouslypublishedones,istoclustergenesaccordingtotheirunderlyingexpressionmechanismsandallowforthetestofhypothesesattheinterplaybetweengeneexpressionanddevelopment.Forexample,ourapproachcanexplorecellcyclegeneticmechanismsthatcontrolwhereandwhencellsdivideduringdevelopment. ThesimilaritiesamongtheprolesweredetectedbycomparingtheFouriercoecientsthatjointlydenetheshapeofexpressioncurves.Thishasthenmadeitpossibletoassignco-expressedgenestoclusters.Eachobtainedclustercontainscell-cyclerelatedgenesthatshowasimilarperiodicbehavior(Figs. 3{2 and 3{3 ).Theobtainedclusterswerethencomparedagainsttheclassicationbytraditionalclusteringapproaches.Whiletheresultsarebroadlyconsistentformanygenesbetweenthetwoapproaches,itseemsthatourapproachisstatisticallymoresupported. Ourapproachcanbelimitedinclusteringthosegenesthatarenotperiodicallyexpressedduringthecellcycle.Forexample,gene1inFig. 3{1 cannotbettedbytheFourierseriesapproximationpossiblybecauseitisnotdirectlyrelatedtothecellcycle.Suchnon-periodicgenesmaybeeasiertodetectusingtraditional

PAGE 43

clusteringapproaches.Ourapproachhasbeenderivedforasinglemicroarrayexperiment.Tobettercharacterizegeneclusters,however,multipleexperimentseachwithadierentmeasurementscheduleareoftenused.Forexample,Spellmanetal.(1998[ 76 ])usedthreeseparateexperimentsbasedonelutriation,-factorandmutationcdc15tosynchronizethecellcyclesofyeasts.Theideapresentedinthispreliminaryanalysiscanbeextendedtocombinethesethreedierentexperiments,sothatthepowertodetectdistinctgeneclusterscanbeincreased.Fromabiologicalperspective,thiscombinationanalysiscanalsoallowforthecharacterizationofhowagivengenegroupisperiodicallyexpresseddierentlyinresponsetodierentexperimentalconditions.

PAGE 44

38 ]).Celldivision(Mitchison2003[ 61 ]),circadianrhythms(Crosthwaite2004[ 18 ];Proloetal.2005[ 69 ]),morphogenesisofperiodicstructures,suchassomitesinvertebrates(Daleetal.2003[ 19 ]),andcomplexlifecyclesofsomemicroorganisms(Lakin-ThomasandBrody2004[ 52 ];Roveryetal.2005[ 74 ])areallexcellentrepresentativesofbiologicalrhythms.Fromamechanisticperspective,rhythmicbehaviorarisesingeneticandmetabolicnetworksasaresultofnonlinearitiesassociatedwithvariousmodesofcellularregulation. Thecomplexityandinherentperiodicityofrhythmicprocessescanbewellde-scribedbymathematicalmodels.Forexample,mathematicalmodelsofincreasingcomplexityforthegeneticregulatorynetworkproducingcircadianrhythmsintheyDrosophilapredicttheoccurrenceofsustainedcircadianoscillationsofthelimitcycletype(Leloupetal.1999[ 55 ]).Whenincorporatingtheeectoflight,themodelsaccountforphaseshiftingoftherhythmbylightpulsesandforentrainmentbylight-darkcycles.Themodelsalsoprovideanexplanationforthelong-termsuppressionofcircadianrhythmsbyasinglepulseoflight.Stochasticsimulationspermittotesttherobustnessofcircadianoscillationswithrespecttomolecularnoise(Gonzeetal.2002[ 40 ];Gonzeetal.2003[ 41 ]). 31

PAGE 45

Therecentdevelopmentofgeneexpressiontechnologieshaveoeredbiologistswithanuniqueopportunitytostudythemechanismsthatcontrolaparticularbiologicalrhythmmoreclosely.Pandaetal.(2002[ 64 ])employedhigh-densityoligonucleotidearraystotracegeneexpressioninthemousetissuesamplestakenevery4hoursduringtwocompletecircadiancycles.Theyalsoidentiedclustersofcircadian-regulatedgenesamongmorethan7000geneswithacosinewave-ttingalgorithm(Harmeretal.2000[ 43 ]).About650cyclingtranscriptsweredetectedtobeundercircadianregulationspecictoeitherthesuprachiasmaticnucleiortheliver.Thesestudiesallowedtheauthorstostudyhowthemajoroscillatorinthesuprachiasmaticnucleiandintheliverregulatesbehavioralandphysiologicalrhythmsinthewholeorganism. Toassociatetheproleofgeneexpressionwithaphysiologicalfunctionofinterest,itiscrucialtoclusterthetypesofgeneexpressionbasedontheirperiodicpatterns.Statisticalmodellingandalgorithmsplayacentralroleinthecatalogingofdynamicgeneexpressionproles.Althoughanumberofcomputationalmodelshavebeendevelopedforgeneclusteringinthepastyears(Eisenetal.1998[ 29 ];Ramonietal.2002[ 71 ];GhoshandChinnaiyan2002[ 36 ];McLachlanetal.2002[ 60 ]),powerfulapproachesfordetectingtemporalpatternsofgeneexpressionarestillinanearlystage,despitethefundamentalimportanceofthisissueindrawingacomprehensivepictureoflifeprocesses.Guoetal.(2003[ 42 ])andParketal.(2003[ 65 ])independentlyproposedamethodfortestingthestatisticalsignicanceoftime-dependentgeneexpressiondata,inanattempttoidentifygeneswithvaryingexpressionprolesovertime.SimilarworkhasbeendonebyQianetal.(2001[ 70 ])andHolteretal.(2001[ 44 ]).However,allthesemethodologicalstudieshavenotconsideredtheperiodicrhythmsofgeneexpression.Also,manystatisticalissuesrelatedtotime-seriesautocorrelationshavenotbeenextensivelyexplored.

PAGE 46

Inthischapter,wewillpresentanewstatisticalmodelforclusteringmir-croarraygenesbasedontheirdistinctpatternsofperiodicexpression.Giventherhythmicexpressionofgenes,acommoncomputationaltechnique,theFourierseriesapproximation(Priestley1981[ 68 ]),willbeincorporatedintoamixture-basedframeworkinwhichperiodicitiesofgeneexpressioncanbeestimated.ManypreviousstudiessuggestedthattheFourierseriesapproximationcanwellbeusedtotcircadiangeneexpressionanalysisandndpossibleperiodicsignalsinvariousorganismsincludingyeastandhumancells(Spellmanetal.1998[ 76 ];Wichertetal.2004[ 83 ]).Glynnetal.(2006[ 37 ])developedanewalgorithmbasedontheFourierseriesapproximationtocharacterizeperiodicpatternsforunevenlyspacedgeneexpressiontimeseries.ThroughtheimplementationoftheFourierseriesapproximationinthemixturemodel,itispossibletotestseveralbiologicallymeaningfulcharacteristicssuchassharppeaksintheordinaryperiodogramscal-culatedfromtheFouriertransformofthetimeseries(Durbin1967[ 28 ]).Wewilluseapublishedexampleforcellcycle-regulatedgenesinSaccharomycescerevisiae(Spellmanetal.1998[ 76 ])tovalidateourmodel.Thestatisticalbehaviorofthemodelwillbeexaminedthroughsimulationstudies. 4.2.1Methodology 4.2.1.1Mixturemodel 59 ])havebeenwidelyusedtomodelthedistributionsofavarietyofrandomphenomena.Multivariatenormalityisgenerallyassumedformultivariatedataofacontinuousnature.Also,inamixturemodel-basedclusteranalysis,itisreasonabletotmixturesofellipticallysymmetriccomponentdensitiessincetheclustersinthedataareoftenessentiallyellipticalinshape.Withinthisclassofcomponentdensities,themultivariatenormaldensityisaconvenientchoice.Thismultivariatenormal

PAGE 47

mixturemodelcanbeemployedtodetectdierentpatternsingeneexpressionproles. Assumethatngenesaremeasuredateachofevenlyspacedtimepoints.Let(y1,...,yn)beobserved-dimensionalgeneexpressiondata.Supposetherearekcomponentsinthemixturemodel.Thismeansthateachgeneisassumedtoarisefromone(andonlyone)ofthekpossibleperiodicpatternsofexpression,thedistributionofwhoseexpressiondataisexpressedasthek-componentmixtureprobabilitydensityfunctionasdenedinEq.( 2{2 ).Inthecaseofmultivariatenormalmixturemodelsfg(yi;ug;)inEq.( 2{2 )isdenotedbythemultivariatenormalprobabilitydensitywithmeanvectorugandthecommoncovariancematrix,denedasinEq.( 2{1 ). Fromtheaboveanalysis,dierentgeneexpressionpatternscanbedetectedbyestimatingthenumberofmixturecomponents,thetime-dependentmeansofeachcomponentandthecovariancematrix.Bycomparingthedierencesofthemeanvectoramongdierentcomponents,itispossibletotesthowresponsecurvesdieramongdierentcomponentsand,further,addressfundamentallyimportantbiologicalissuesregardingtheinterplaybetweengeneexpressionandbiologicalrhythms. InEq.( 2{1 ),themeanvectorforgeneexpressionpatterng,ug=(ug(1);:::;ug()),canbemodelledbyaFourierseriesapproximationofvariousorders.Thus,thelog

PAGE 48

ratiogeneexpressionvalueofthegthgeneclusteratatimepointtcanbeex-pressedas 2a0g+a1gcos2t Tg+b1gsin2t Tgfortherstorder1 2a0g+a1gcos2t Tg+b1gsin2t Tg+a2gcos4t Tg+b2gsin4t Tgforthesecondorder1 2a0g+a1gcos2t Tg+b1gsin2t Tg+a2gcos4t Tg+b2gsin4t Tg+a3gcos6t Tg+b3gsin6t Tgforthethirdorder; wherea0gisthepattern-specicfundamentalfrequency,(a1g,a2g,a3g)and(b1g,b2g,b3g)arethepattern-specicamplitudecoecients,whichdeterminethetimesatwhichthegeneachievespeakandtroughexpressionlevels,respectively,andTgistheperiodoftheexpressionofthegthgenecluster. Anumberofapproacheshavebeenproposedtomodelthecovariancestructureofserialmeasurements.Acommonlyusedapproachforstructuringthecovarianceistherstorderautoregressive(AR(1))model,denedasinEq.( 2{4 )(DavidianandGiltinan1995[ 20 ];VerbekeandMolenberghs2000[ 79 ]).OneadvantageofusingtheAR(1)modelisthatitprovidesageneralexpressionforcalculatingthedeterminantandinverseofthematrixforanynumberoftimepointsmeasured.Notethatitassumesvariancestationarityandcorrelationstationarity,i.e.,theresidualvarianceatdierenttimepointsisthesame,expressedas2,andthecorrelationbetweentwodierenttimepointstcandtc0decreasesexponentiallyinwithtimelag,expressedascorr(tc;tc0)=jtctc0j.

PAGE 49

denedtobe1or0accordingtoyiarisingornotfromthegthcomponentofthemixturemodel,i=1;:::;n;g=1;:::;k.Thevariables(z1;:::;zn)areindependentandfromamultinomialdistributionconsistingofonedrawonkcategorieswithprobabilities($1;:::;$k).Thenwehave(z1;z2;:::;zn)iidMultk(1;$); IntheM-step,theintentistochoosethevalueof=(u;;$)0thatmax-imizesE[logLc()jy].ThederivationprocessoftheestimatesoftheunknownparametersintheMstepisgivenintheAppendixA.TheEandM-stepsareiter-atedrepeatedlyuntiltheparameterestimatesconverge.Themaximumlikelihoodestimates(MLEs)oftheunknownparametersareobtainedasfollows: ^$g=Pni=1ig ^c0g=[^a0g;^a1g;^b1g]=Pni=1igy0i1F(Tg)

PAGE 50

^2=Pni=1Pkg=1ig(yiug)0R1(yiug) ^=Pni=1Pkg=1igh1 120R1+P1t=22tP1t=1tt+1i whereF(Tg)=0BBBBBBB@1 2cos(2(1) 2cos(2(2) 2cos(2() 4.2.2.1Material 76 ])usedthreeindependentexperiments,alphafactor,elutriationandarrestofcdc15,tosynchronizecells.Theyfound800periodicallyexpressedtranscriptionalgenesinthegenomeofSaacharomycescerevisiaeyeast.Fortheanalysisdescribedinthissection,weusedtimedependentgeneexpressiondataderivedfromarrestofacdc15,temperature-sensitivemutant.Itcontains632genes,outof800genes,thathavecompletedataduringall24timepoints.

PAGE 51

Figure4{1: ResidualPlot 4{1 presentsaplotoftheresidualsagainstthettedgeneexpressions,showingthattheresidualsfallwithinahorizontalbandcenteredaroundzero,displayingnosystematictendenciestobepositiveandnegative.Thatis,thettedmodelisappropriate(withR2=0:5419). WhentheFourierseriesapproximationisusedtoclustertheperiodicpatternsofgeneexpression,twoissuesshouldbedeterminedinthefollowingsequence.First,whatisthebestorderforFourierseriesfunctionthatexplainsthetime-dependentdata.Second,whatistheoptimalnumberofcomponentsforthemixturemodelthateachcorrespondtoadierentexpressionpattern.Modelselectioncriteria,AICandBIC,wereusedtodeterminethebestFourierseriesorderandbestnumberofmixturecomponentsforSpellmanetal.'sdata.Thetwocriteriaprovidesimilarresults(seeFig. 4{2 ),althoughouranalysiswillbemostlybasedonBIC.ItseemsthatahigherorderofFourierseriescanbetterttime-dependentdatathanalowerorder,reectingthecomplexityofdynamicchangesofgeneexpression.AlowerorderofFourierseriestendstodetectasmallnumberofgeneexpressionpatternsthanahigherorder.Forexample,therst

PAGE 52

Figure4{2. ComponentNumber-DependentAICandBICValuesofModelFittingbyaFourierSeriesFunctionofOrder1to3for632GenesCollectedfromtheYeastGenome(cdc15) orderdetects6-8components,whereasthesecondandthirdorderdetects16-18andover20components,respectively.ThismakessensebecauseahigherorderofFourierseriesfunctionhasmorepowertodiscernsubtledierencesingeneexpressionproles.AlthoughahigherorderofFourierseriesisfavorable,itneedsextensivecomputation.Forasakeofcomputation,ouranalysiswillbebasedontheresultsfromtherstorderapproximation. WecanalsodosimulationstudyinordertodeterminestatisticallysignicantorderoftheFourierseries.Firstwesimulatedatasetsbasedonthettedrst-orderFourierseries.ThenbasedonthesimulateddatasetswetthemodelwithdierentordersofFourierseriestocalculatethepercentageofsimulationreplicatesinwhichthecorrectorderisdetermined.Thispercentageisregardedasempiricalpowerfortheestimationofthecorrectorder. Fig. 4{3 illustratessixdierentperiodicpatternsofgeneexpressionprolesconcordantwithcellcycles,drawnwiththeestimatesofFourierparametersestimatedfromtheproposedmodel(seeTable 4{1 ).Thesesixpatternsdierdramaticallyintheoverallshapeofcurvesdenedbyparameterset(^a0;^a1;^b1;^T).

PAGE 53

Table4{1. MLEsoftheFirst-OrderFourierParametersforSixDierentPeriodicPatternsofGeneExpressionAmong632GenesCollectedfromtheYeastGenome(cdc15)BasedontheFirst-OrderApproximation Pattern123456 Proportion^$0.227130.321290.087610.162270.115950.08576Meanvector^a00.01415-0.012950.17013-0.19802-0.01802-0.07566^a10.21832-0.18415-0.503830.529600.39643-0.53134^b1-0.00846-0.03000-0.658750.29930-0.402040.60837^T2.090502.0707510.776609.9317610.4777010.98770Covariance^0.44358^20.27205 Basedontheseestimates,anumberofhypothesestestscanbemaderegardingthedevelopmentalpatternsofgeneexpression.Table 4{1 alsoprovidestheestimatesoftheproportionsofmixturecomponents.Thesixpatternsoccuratdierentfrequenciesamongtheobservedgenes. Thedetectionofasmallernumberofperiodicpatternsbythepreviousanalysis(inChapter3)suggeststhatthepreviousapproachislesspowerfultoidentifythedierencesofgeneexpressionprolesthanthemixturemodel-basedapproachthatincorporatesFourierseriesapproximation.Also,themixture-basedapproachweusedinthissectioncanallowfortheestimationoftheproportionsoftheoccurrenceofeachexpressionpattern,whichisadvantageouscomparedtotheclusteringapproach.

PAGE 54

Figure4{3. SixPeriodicPatternsofGeneExpressionProlesApproximatedbyaFirst-OrderFourierSeriesFunctionfor632GenesCollectedfromtheYeastGenome(cdc15) function.TheFourierparametersusedforsimulationareassignedbyvaluesthatarewithintheirspacesaccordingtoSpellmanetal.'s(1998[ 76 ])data.Thetime-dependentcovariancematrixofgeneexpressionisstructuredbytheAR(1)model.Dierentvariancesareusedinthesimulationinordertoexaminetheeectofresidualerrorsonparameterestimation. TheoptimalnumberofcomponentsforthesimulateddataisdeterminedbycalculatingAICandBICvalues.AsshowninFig. 4{4 ,themodelcancorrectlyestimatethenumberofcomponents.Basedontheresultsfromthesimulateddatasets,themodelcanprovidereasonablyaccurateandpreciseestimatesofallFourierparameters(seeTables 4{2 and 4{3 ).Theprecisionofparameterestimationdependsontheproportionofageneexpressionpattern,beingbetterformorethanlessfrequentpattern.Asexpected,increasingresidualvariancewillreducetheestimationprecisionofparameters(Table 4{2 vs. 4{3 ).Toshowtherobustnessofthemodel,anadditionalsimulationstudybasedonasecond-orderFourierseriesapproximationwasperformed.Theresultssuggestthatallparameterscanbe

PAGE 55

Figure4{4. ComponentNumber-DependentAICandBICValuesofModelFit-tingbyaFirst-andSecond-orderFourierSeriesFunctionfor1000SimulatedGenesUnderDierentResidualVariances reasonablyestimatedevenifthenumberofparametersbeingestimatedisincreased(seeTables 4{4 and 4{5 ,andFig. 4{4 forcalculatedAICandBICvalues). 64 ]).Statisticalapproachesforanalyzingtime-dependentgeneexpressiondatahavebeenproposed(Qianetal.2001[ 70 ];Holteretal.2001[ 44 ];Parketal.2003[ 65 ];Glynnetal.2006[ 37 ]),butmostofthemarestillbelimitedinbothbiologicalandstatisticalaspects. First,theseavailableapproaches,mostlybasedonaclusteringanalysis,werenotimplementedwithbiologicalprinciplesofgeneexpressionthatarerelatedtoalifeprocess.Forthisreason,theresultsobtainedfromtheseapproachesmaynotbebiologicallyrelevantand,thus,maybelessusefulfordecipheringthedevelopmentalmachineryofgeneexpression.Themodelproposedinthissectionintegratesmathematicalaspectsofperiodicgeneexpressionintotheanalyticalframework,therebyallowingfortheinterplaybetweengeneexpressionanddevelopment.

PAGE 56

Table4{2. MLEsoftheFirst-OrderFourierParametersforThreePeriodicPat-ternsofGeneExpressionAmong1000SimulatedGeneswithaGivenSetofValues(a0;a1;b1;T)UndertheResidualVarianceof2=0:3(Theaveragesofparameterestimatesarecalculatedandthemeansquareerrorsoftheestimatesaregivenintheparentheses) Pattern123 Proportion$/^$(MSE)0.5850/0.5851(0.0002)0.1000/0.1001(0.0002)0.3150/0.3148(0.0005)Meanvectora0/^a0(MSE)0.3000/0.30571(0.0218)0.0300/0.0511(0.0378)0.0600/0.0696(0.0171)a1/^a1(MSE)1.0000/1.0041(0.0081)1.0000/1.0007(0.0081)0.9000/0.9061(0.0121)b1/^b1(MSE)0.2000/0.1910(0.0118)0.0200/0.0084(0.0141)0.0100/0.0160(0.0150)T/^T(MSE)6.0000/6.0034(0.0043)10.0000/10.0150(0.0188)16.0000/15.9730(0.0392)Covariance/^(MSE)0.6000/0.5888(0.0044)2/^2(MSE)0.3000/0.3013(0.0036) Table4{3. MLEsoftheFirst-OrderFourierParametersforThreePeriodicPat-ternsofGeneExpressionAmong1000SimulatedGeneswithaGivenSetofValues(a0;a1;b1;T)UndertheResidualVarianceof2=2:0(Theaveragesofparameterestimatesarecalculatedandthemeansquareerrorsoftheestimatesaregivenintheparentheses) Pattern123 Proportion$/^$(MSE)0.5850/0.5839(0.0070)0.1000/0.1007(0.0093)0.3150/0.3154(0.0112)Meanvectora0/^a0(MSE)0.3000/0.2983(0.0385)0.0300/0.0480(0.0606)0.0600/0.0535(0.0519)a1/^a1(MSE)1.0000/1.0016(0.0148)1.0000/0.9956(0.0631)0.9000/0.9035(0.0383)b1/^b1(MSE)0.2000/0.2022(0.0290)0.0200/0.0296(0.0435)0.0100/0.0157(0.0199)T/^T(MSE)6.0000/5.9997(0.0088)10.0000/10.0030(0.0575)16.0000/15.9930(0.1212)Covariance/^(MSE)0.6000/0.5993(0.0041)2/^2(MSE)2.0000/1.9945(0.0250)

PAGE 57

Table4{4. MLEsoftheSecond-OrderFourierParametersforThreePeriodicPat-ternsofGeneExpressionAmong1000SimulatedGeneswithaGivenSetofValues(a0;a1;b1;T)UndertheResidualVarianceof2=0:3(Theaveragesofparameterestimatesarecalculatedandthemeansquareerrorsoftheestimatesaregivenintheparentheses) Pattern123 Proportion$/^$(MSE)0.5850/0.5850(0.0004)0.1000/0.1000(0.0003)0.3150/0.3150(0.0003)Meanvectora0/^a0(MSE)0.3000/0.2972(0.0166)0.0300/0.0340(0.0291)0.0600/0.0592(0.0212)a1/^a1(MSE)1.0000/1.0002(0.0056)1.0000/1.0018(0.0169)0.9000/0.8994(0.0109)b1/^b1(MSE)0.2000/0.1998(0.0103)0.0200/0.0211(0.0214)0.0100/0.0150(0.0147)a2/^a2(MSE)0.2000/0.2000(0.0030)0.2000/0.1997(0.0112)0.4000/0.3985(0.0081)b2/^b2(MSE)0.0400/0.0398(0.0051)0.0100/0.0121(0.0103)0.0400/0.0441(0.0155)T/^T(MSE)6.0000/6.0002(0.0030)10.0000/10.0010(0.0164)16.0000/15.9880(0.0397)Covariance/^(MSE)0.6000/0.5998(0.0044)2/^2(MSE)0.3000/0.2995(0.0039) Table4{5. MLEsoftheSecond-OrderFourierParametersforThreePeriodicPat-ternsofGeneExpressionAmong1000SimulatedGeneswithaGivenSetofValues(a0;a1;b1;T)UndertheResidualVarianceof2=2:6(Theaveragesofparameterestimatesarecalculatedandthemeansquareerrorsoftheestimatesaregivenintheparentheses) Pattern123 Proportion$/^$(MSE)0.5850/0.5836(0.0088)0.1000/0.1010(0.0088)0.3150/0.3154(0.0109)Meanvectora0/^a0(MSE)0.3000/0.3028(0.0452)0.0300/0.0627(0.0874)0.0600/0.0619(0.0638)a1/^a1(MSE)1.0000/1.0042(0.0174)1.0000/0.9910(0.0753)0.9000/0.9014(0.0416)b1/^b1(MSE)0.2000/0.2004(0.0302)0.0200/0.0388(0.0569)0.0100/0.0232(0.0367)a2/^a2(MSE)0.2000/0.2008(0.0079)0.2000/0.1987(0.0425)0.4000/0.3979(0.0256)b2/^b2(MSE)0.0400/0.0409(0.0163)0.0100/0.0231(0.0325)0.0400/0.0478(0.0416)T/^T(MSE)6.0000/6.0006(0.0092)10.0000/9.9890(0.0522)16.0000/15.9770(0.1049)Covariance/^(MSE)0.6000/0.5994(0.0040)2/^2(MSE)2.6000/2.5935(0.0323)

PAGE 58

Second,themixturemodel-basedapproachallowsfortheestimationofthefrequenciesofvariouspatternsofgeneexpressionandthecalculationoftheposteriorprobabilityofeachgenethatbelongstoaparticularpattern.Third,existingapproachestoclustergenesofsimilarexpressionpatternbasedonasimilaritymeasurehavenotconsideredtheautocorrelationoftimeseriesdataand,therefore,failtoremovesystematicmeasurementerrors.Intheproposedmodel,thestatisticalprincipleoffunctionaldataanalysishasbeenembeddedintothemodelbystructuringthetime-dependentcovariancematrix.This,ononehand,de-noisesrepeatedmeasurementerrorsandincreasestheeectivenessofthemodel,andenhancethemodel'spower,ontheotherhand,duetoareducednumberofparametersbeingestimated.Asarstattemptofitskind,wehaveusedasimplestationaryAR(1)modeltodemonstrateourmodel.Inordertoexpandthemodel'sutility,othermorerealisticnon-stationarymodelsshouldbeused. Therehavebeenmanyapproachesavailabletomodelthenon-stationarystructureofcovariancematrix,includingparametricandnon-parametric.Theparametricapproachcanbeusedwhenthetime-dependentcovariancematrixfollowsaparticularpattern.Forexample,thevarianceforgrowthtraitsincreasesinstantlywithtimeand,therefore,theSADmodelshouldbeabetterchoice(Zhaoetal.2001[ 86 ]).ItseemsthatasimilarideaabouttheSADmodelcanalsobeusedintheexampleofperiodicgeneexpressiondatabecausethesamplingvariancetendstodecreaseinstantlywithtime(seeFig. 4{5 ).Byplottingthetime-dependentcorrelationsforgeneexpressiondataagainsttimelags,aperiodicpatternisobserved(seeFig. 4{6 ).Thus,amodelthatcanspecifytheperiodicpatternofgeneexpressionintheexampleusedshouldbederived.Ifnoparticularpatternisobserved,non-parametricapproacheswillbemoreappropriateforthemodellingofthestructureofthecovariancematrix.Inpractice,multiple

PAGE 59

Figure4{5: VariogramforTime-DependentGeneExpressionProles Figure4{6: CorrelogramforTime-DependentGeneExpressionProles approachesshouldbeemployed,fromwhichamodelselectioncriterion,suchasAICorBIC,canbeusedtodetermineanoptimalapproach. Themixture-basedapproachincorporatedbyFourierseriesapproximationisapromisingtechniquefordetectingperiodicgeneexpressionpatterns.Arealdataanalysisfromthisapproachsuggeststhatitwouldbeusefulfortheidenticationofgeneclustersintermsoftheirperiodicexpressionpatterns.Throughsimulationstudies,thisapproachhasproventoprovidereasonableaccuracyandprecisionofparameterestimationandshouldbedirectlyusedtoanalyzearealdatasetofperiodicgeneexpression. Thisapproachcanbemodiedorextendedinthefollowingareas.First,inpractice,itisdiculttoassurethateachgeneismeasuredatthesameschedule

PAGE 60

andevenlyspacedtimeintervals.Thesetwoissuesshouldbeincorporatedintothemodel,increasingitsvisibilitytowideproblems.Second,whenrepeatedmea-surementincludesahighnumberoftimepoints,thestructuringofthecovariancematrixmaybequicklyill-behaved.Approachesfordimensionreductionshouldbeincorporatedintothemodel.Withtheseandothermodications,theapproachforgeneclusteringpresentedinthissectionwillbeinanexcellentpositiontoaddressdevelopment-relevantquestionsinorganisms. 68 ]).Timeseriesproducedbybiologicalexperimentsare,however,oftenunevenlyspaced,foravarietyofreasons,manyofwhichhavetodowiththelimitationsoftheinstrumentsandwithinherentexperimentalconstraintsonbiologicalsamples.SecondlimitationoftheFFTalgorithmisthatitdoesnottoleratemissingvalues.Whenmissingvaluesinthetimeseriesarepresent,datamustbeimputedpriortotheapplicationoftheFFTalgorithm.Third,FFTscoresdonotdirectlyaddressthesignicanceoftheobservedperiodicityandhavetobevalidatedbyaheuristiccutoscoreorpermutationstudies.

PAGE 61

76 ]).These772geneshavegeneexpressionvalueswithatleast4timepointsduringall24timepoints. Themeanvectorsandthecovariancematricescanbemodeledbytherst-orderFourierseries(asinEq.( 2{13 ))andAR(1)model(asinEq.( 2{4 )),respec-tively,basedonamixturemodelframework.However,thestructuresofthemeanvectorsandthecovariancematriceswilldependonthestructuresofeachdatapointsinceeachgeneismeasuredatunequallyspacedtimeintervalsanddierentgeneshavedierentmeasurementpatterns.Forexample,letusassumethatithgenehasthreemissingvaluesattimepoints1,3and6.Thenthemeanvectorandthecovariancematrixcanbewrittenasu=(u(2);u(4);u(5);u(7);u(8);:::;u(23);u(24))0211 2{13 ).Sothenyi=(yi2;yi4;yi5;yi7;yi8;:::;yi23;yi24)0211 2{2 )withamultivariatenormaldensityfgdenedasinEq.( 2{1 ),where=21forthisparticularithgene.

PAGE 62

Inourmodel,themeanvectorugwillnotbeestimateddirectlysinceitisapproximatedbytheFourierseriesequation.Sowewillneedtoestimatethemathematicalparameters,ug=(ag0;ag1;bg1;Tg;g=1:::;k)thatdenetheshapeoftime-dependentgeneexpressionproleforgenegroupg,whereisthenumberoftimepointsanddependsoneachgene.FortheparametersinthecovariancematrixmodeledbyAR(1)weneedtoestimatethestructuringparametersv=(;2).WecanimplementtheEMandsimplexalgorithmstoestimatetheseunknownparametersinthesimilarfashionasinEqs.( 4{2 )through( 4{5 ). 4{7 showsthecalculatedAICandBIC.Table 4{6 andFig. 4{7 showtheestimatedparametersbasedon13-componentand13dierentperiodicgeneexpressionprolesaccordingtocellcycles,respectively.

PAGE 63

Table4{6: MLEwith13ComponentsfromSparseData(cdc15) Pattern^a0^a1^b1^T^$^^2 Figure4{7: 13DierentPatternsfromSparseData(cdc15) Thesepatternsdramaticallydierintermsoftheoverallshapeofcurvesandtheperiodsdenedbytheestimatedparameters. Wenoticethattheresultsweobtainedhere(withmissingdata)andtheonesfromtheprevioussection(withcompletedata)aredierent.Inthissection,wettedamodelwithmanygeneexpressionvaluesatcertaintimepointsaremissing.Themissinggeneexpressionvaluesatthosetimepointscanbecalculatedfromtheestimatedperiodicpatterns.Theincompletedataincludedfortheanalysisinthissectionmightplayedanimportantroleinmakingtheresultsdierfromtheresults

PAGE 64

Table4{7: AICandBICforSparseData(cdc15) #ofcomponentsAICBIC 2299863003732890028975428263283605282322835362759427738727452276198272422743392722627440102717827415112709627356122694427228132683727144142685027180152716027514162722727604 weobtainedintheprevioussection.However,thoseexpressionvaluesatsomespotscouldbepoorquality,whichledtomissingvaluesforaportionofgenesatsometimepoints.Ifthatisthecaseandtheyarestillincludedinthemodeltting,itmightbemisleadingtheresults.Forexample,11thpatternmightnotreectthedatawell(seeFig. 4{7 ).Theestimatedproportionforthisgroupisonlyaboutoneandthiskindofparticulardatapointsmightbesomethingweneedtomakesurethatthemissingvaluesarenotduetothepoorquality. 78 ]).ThequalityoftheresultsfromeachmicroarraywillvarywithcDNApurity,variationsintheprintingprocess,RNAquality,successincarryingoutthehybridizationproctoclasandscanningeectiveness.Thequalityofindividualspotsshouldbeassessed.Theeasiestsolutiontomissingvaluesistore-dothewholeexperiment,butitisnotrealisticcauseitcanbeveryexpensive.Howeverthereshouldbeanumberofdatacleaning

PAGE 65

steps,lowlevelanalysisofmicroarraydataorareliablemethodtodealwiththisprobleminordertoidentifytheimportantgenesamongstthemanywhicharemeasured. Methodshavebeencreatedinordertodealwiththesemissingvaluesinamorerealisticapproach.Mostcommonlyusedmethodsarereplacingmissingdatawithzerosorbyonaverageofexpressioninarow.Thesemethodsdonottakeintoconsiderationthecorrelationstructureofthedata,andisthereforenotveryprecise.Themostcrucialdisadvantageofthesemethodsisthattheyarenotrobusttomissingvaluesinthematrix.Thesealgorithmsofmethodscanonlybecalculatedifthedatasetsarecomplete.Ork-nearestneighborsandsingularvaluedecomposition(SVD)couldbeusedandthesemethodsdotakeintoconsiderationthecorrelationdata.AlthoughtheseapproacheshavenotbeenusedgreatlyintheDNAmicroarrayanlaysis,thesemethodsareusedasstatisticalmethodsforclassication,suchasTextCategorizationandfacerecognition(Jiangsheng2002[ 49 ]). Thecombinationofavoidingmissingvaluesindataandimprovementofclusteringmethodswillincreasethevalidityofgeneexpressioninterpretation.Oneofthemajorproblemswiththecurrentclusteringmethods,suchashierarchicalandk-meansclustering,isthattheapplicationofclusteringmethodspartitionsadatasetintoclusters,wheresimilardataareassignedtothesameclusterwhereasdissimilardatashouldbelongtodierentclusters.Sinceitisalmostimpossibletocompletelygetridofthenoiseindata,thereneedstobemoredevelopmentsofdealingwiththesekindofdata.

PAGE 66

Sofartheapproachhasbeenderivedforasinglemicroarrayexperiment,cdc15arrest.Tobettercharacterizegeneclusters,however,multipleexperimentseachwithadierentmeasurementscheduleareoftenused.Forexample,Spellmanetal.(1998[ 76 ])usedthreeseparateexperimentstosynchronizethecellcyclesofyeast.InthischapterwepresentajointmodelformultipleexperimentsofmicroarraystudyforyeastbySpellmanetal.(1998[ 76 ]).Itcanbeextendedtocombinethesethreedierentexperiments,sothatthepowertodetectdistinctgeneclusterscanbeincreased.Fromabiologicalperspective,thiscombinationanalysiscanalsoallowforthecharacterizationofhowagivengenegroupisperiodicallyexpresseddierentlyinresponsetodierentexperimentalconditions. 76 ]))thathadbeensynchronizedbythreeindependentmethods:factorarrest,elutriation,andarrestofacdc15temperature-sensitivemutant.Spellmanetal.(1998[ 76 ])measuredtherelativelevelsofmRNAasafunctionoftimeincellculturesthathadbeensynchronizedinthreeindependentways. FirsttheyusedpheromonetoarrestMATacellsinG1.SecondtheyusedcentrifugalelutriationtoobtainsmallG1cells.Finallytheyusedatemperature-sensitivemutation,cdc15,which,attherestrictivetemperature,arrestscellslateinmitosis.Threemethodswereusedbecauseeachintroducescharacteristicartifacts.Forinstance,usedofpheromonehasregulatoryconsequencescharacteristicofmating,whereasuseoftemperature-sensitivemutantscancauseheatshock.Thedatasetweusedhereiscompletedata(n=481genes)from800genesthattheyfoundascellcycleregulated.Notethatfactorarrest,cdc15arrestandelutriationhave18,24and14timepoints,respectively.

PAGE 67

(2)3=2j3j1=2exp1 2(y3iug3)013(y3iug3): ThelikelihoodofunknownparametersisthenL(jy1;y2;y3)="nYi=1k1Xg1=1$g1fg1(y1i;ug1;1)#"nYi=1k2Xg2=1$g2fg2(y2i;ug2;2)#"nYi=1k3Xg3=1$g3fg3(y3i;ug3;3)#:

PAGE 68

Table4{8: AICandBICforJointModel #ofcomponentsin(;cdc15;elutriation)AICBIC (12,10,10)2793628616(12,10,11)2796228663(12,11,10)2777628477(12,11,11)2780228524(12,12,10)2771428437(12,12,11)2774028484(13,10,10)2793428636(13,10,11)2796028682(13,11,10)2777428496(13,11,11)2780028543(13,12,10)2771328456(13,12,11)2773828503(14,10,10)2797328696(14,10,11)2799928742(14,11,10)2781328557(14,11,11)2783928603(14,12,10)2775228516(14,12,11)2777828563 4{8 ).TheestimatedFourierparametersalongwithproportionsandparametersincovariancematrixaregiveninTable 4{9 (seealsoFig. 4{8 ).Aswecanseefromtheresultstheyhavedierentgeneexpressionprolesanddierentnumberofpatternsforeachexperiments.Inotherwords,theyhavedierentinteractionbetweengenesanddevelopmentwithdierentexperi-ments.Dierentgeneshavedierentmeasurementpatternsandgeneexpressionprolesareexpresseddierentlyinresponsetodierentexperimentalconditions.

PAGE 69

Table4{9. MLEwith13-Component(alpha),12-Component(cdc15)and10-Component(elutriation)forJointModel

PAGE 70

Figure4{8: DierentPatternsforJointModel 12 ]).Onemustbedecidedwhichgenesaretobeprintedonthearrays,whichsourcesofRNAaretobehybridizedtothearraysandonhowmanyarraysthehybridizationswillbereplicated.TobetterselectdierentiallyexpressedgenesortobetterndgroupsofgeneswhoseexpressionprolescanreliablyclassifythedierentRNAsourcesintomeaningfulgroupsgeneexpressionsunderdierentexperimentalconditionscanbeconsidered.Bycombiningthesedierentexperimentsthepowertodetectdistinctgeneclusterscanbeincreasedandalsofromthebiologicalperspective,itcanalsoallowforthecharacterizationofhowagivengenegroupisperiodicallyexpresseddierentlyinresponsetodierentexperimentalconditions.

PAGE 71

Ourmethodinthissectionforidentifyinggenesregulatedbythecellcyclereliedonmodelingthelargebodyofdataalreadyhasbeenpublishedtoaggregatedatafrommanyexperimentsandthereforeimprovethepowertodetectgeneclusters.Thisallowedustodistinguishcellcycleregulationfromconfoundingpatternssuchasthosecausedbyheatshockresponsewhenacultureisshiftedfromonetemperaturetoanother.

PAGE 72

44 ]).Inthepreviouschapter,ageneralmixturemodel,calledfunctionalgeneclustering,hasbeenderivedtocatalogueperiodicgeneexpressionprolesbyincorporatingthedevelopmentalmachineryofgeneexpressionthatisdescribedbyrobustmathematicalequations.Thisapproachhasproventobestatisticallypowerfulforthedetectionandesti-mationofgeneexpressionpatternsandalsobiologicallymeaningfulfortestingtheinterplaybetweengeneexpressionanddevelopment. Thecentralideaoffunctionalgeneclusteringistomathematicallymodelthemeanvectorsforeachgenepatternandthestructureofcovariancematrixamonggeneexpressionmeasuredatamultitudeofdiscretetimepoints.Suchmathematicalmodellinghastwomajoradvantages.First,manybiologicalprocesseshaveacertainpattern,whichcanbedescribedrobustlybymathematicalfunctions.Byestimatingtheparametersthatdeterminemathematicalfunctions,thegeneticdierentiationovertimecoursecanbeestimatedandtested.Theresultsfromthesebiologicallyjustiedmodelsare,therefore,moreclosertobiologicalreality.Second,insteadofestimatingeverymeanateachtimepointandeveryelementinthecovariancematrix,ourmodelonlyneedstoestimateareducednumberof 59

PAGE 73

mathematicalparametersthatmodelthemean-covariancestructures.Thisprovidesgreaterpowertodetectsignicantlydierentiatedpatternsduringatimecourse. Despiteitsstatisticalandbiologicalrelevances,functionalgeneclusteringhastwosignicantlimitationsthatmaypreventitsbroadanddeepusesinsomeparticularsituations.First,itdoesnotallowthenumberofrepeatedmeasurements(denedasthedimensionalityofobservation)tounlimitedlyincreaseforrobustparameterestimation.Whileincreaseddimensionalitypossessesricherinformation,modellingoftime-dependentvariancesandcorrelationsbasedonparametricapproaches,suchasautoregressive(Diggleetal.2002[ 23 ]),antedependencemodels(ZimmermanandNu~nez-Anton2001[ 88 ])orBrownianprocess(Syetal.1997[ 77 ]),willbecomeill-behaved.Highdimensionmaymakethecomputationofinversecovariancematrixunstableandthedeterminantapproachinginnityorzero.Second,inpractice,thesparsityofadatasetincreasesexponentiallywithitsdimensionality.Functionalgeneclusteringbasedonamultivariatenormaldensityfunctionwillbeaectedforhigh-dimensionaldatabecauseoftheiraccumulatedcontaminationbynoises. Anecienttreatmentofhigh-dimensionalmicroarraydataisthroughdi-mensionalityreduction,i.e.,thetransformationthatbringsdatafromahigh-tolow-orderdimension.Ithasbeenshownthatthelowdimensionalitymodelsarenotonlycomputationallyecient,butalsomorerobustthanhighdimensional-itymodels.Wavelettransformsthatpreservesignalpatternandyieldbetterorcomparativeclassicationaccuracy(Donoho1995[ 24 ];DonohoandJohnstone1994[ 25 ];Donohoetal.1995[ 26 ];henceforthDDD)provideapowerfulmeansfordimensionalityreduction.Inthischapter,wewillderiveawavelet-basedde-noisingmethodfordevelopmentalmicroarrayanalysis.Byreducingthedimensionalityofdata,thismethodcanimprovetheaccuracyandpowerofgeneclusterdetectioninmostsituations.

PAGE 74

58 ];Vidakovic1999[ 80 ];JensenandCour-Harbo2001[ 47 ]).Therstsequenceofcoecientsiscalledsmoothcoecients(approximationcoecientsinsomeliterature),usuallydenotedbyc-j,thatcorrespondtoanapproximationprocessoftheoriginalsignal.Thesecondsequenceofcoecientsiscalleddetailcoecients,usuallydenotedbyd-j,thatcorrespondtothedetailinformation(subtleties)leftbytheapproximationprocess.Superscript(orsubscriptlater)-jdenotestheresolutionlevelatwhichtheinitialsequenceissplitintosmoothanddetailcoecients.Sincedetailcoecientsarecontaminatedseverelybyrandomerrors,shrinkingthemtozerocanhelptoreducetheoverallnoiselevelofthesignal(DDD). ThesimplestwavelettransformisthediscreteHaartransform.IntheHaartransform,detailcoecientsarecalculatedsimplybysubtractingsuccessivevaluesinthesequence(Walker1999[ 81 ]).Thedataofgeneexpressionprolesmeasuredatmultipletimepointscanbeexpressed,forgenei,asyi=fyi(1);yi(2);yi(3);yi(4);:::;yi(3);yi(2);yi(1);yi()g:

PAGE 75

81 ]).AgeneralHaarwavelettransformprocedureisgivenintheAPPENDIXB. Thepatternofthesmoothcoecientsinthewaveletspaceresemblesthesignalpatterninthetimespace.InFig. 5{1 ,c-0representsasampleofrepeatedmeasurementsofapatternat24timepoints.ThesmoothcoecientsoftherstandsecondHaarwavelettransformationareplottedasc-1andc-2,respectively.Thepatternofc-1andc-2coecientsconformtothesignalpatternalthoughthey

PAGE 76

Figure5{1. SmoothCoecientsbytheHaarWaveletTransformationatDierentResolutions areintwodierentresolutionlevels.Becauseofthesimilarity,itisreasonabletomodeltheoriginalpatternbasedonlow-dimensionalsmoothcoecients. 1 ]),thatremovescoecientsbelowathresholdvalue,determinedbythenoisevariance.Thesecondisthesoftthresholdlter(Hs)thatshrinksthewaveletcoecientsaboveandbelowthethreshold.Softthresholdingreducescoecientstowardzero(Ghaeletal.1997[ 35 ]).Theprocessofdenoisingisnecessarilylossyinthatthedenoisedsignalisirreversiblydierentthanthenoisysignal.Thus,thresholdingisthecauseofthislossofinformation. Ifwedesiretheresultingsignaltobesmooth,thesoftthresholdltershouldbeused,althoughthehardthresholdlterperformsbetter.Inpractice,itis

PAGE 77

diculttochooseathresholdvalue.Asmallthresholdvaluecreatesanoiseresultinginneartheinput,whilealargethresholdvalueintroducesbias.Theoptimalthresholdissomewherein-between. Manyapproacheshavebeenavailabletodeterminethethresholdlevel.Auniversalmethodassignsathresholdlevelequaltothevariancetimes,expressedas whereisthesamplesize(DonohoandJohnstone1994[ 25 ]).Thehardthreshold-ingruleisdenedas AspointedoutbyDonohoandJohnstone(1994[ 25 ])andJohnstoneandSilverman(1997[ 50 ])that,forasequenceofnormaldistributedrandomvariablesztN(0;2t);t=1;:::;, sothatifadetailcoecientistrulyzerothenwithahighprobabilityitisesti-matedaszerobythehardthresholdingrule.Theexpectednumberofjzt Thefollowingprocedureisproposedtoperformdatadimensionalityreductionthroughwavelettransforms: (1) Selectproperorthogonalwaveletlters; (2) Calculatetheempiricalvariancesforthedetailcoecients;

PAGE 78

(3) Applythehardthresholdingruletothedetailcoecients; (4) Truncatethewholelevelofthedetailcoecientsiftheyareallsettozeroby(3),andkeepthewholelevelofthedetailcoecientsotherwise; (5) Repeatprocedure(1)to(4)foruserprescribedjtimes.Ourmixture-basedgeneclusteringapproachwillbeconductedonthebasisofthesmoothcoecientsc-jiftherestofdetailcoecientsaretruncatedoronthebasisofthesmoothcoecientsc-j+1otherwise. Dierentwaveletltersvaryinlterlength.Alongerlterlengthwavelettendsto\average"overmoresignalpoints.BoundaryconditionsbecomesanissueforallwaveletsexceptfortheHaarwaveletwhenthewaveletlterlengthisgreaterthantwo(JensenandCour-Harbo2001[ 47 ]).Thepurposeofhardthresholdingistoreducethedimensionalityofthedatabytruncatingcertainlevelsofdetailcoecients.Thevarianceestimator2-j(q)foreachdetailcoecientd-j(q)issuggestedbyDonohoandJohnstone(1994[ 25 ])andJohnstoneandSilverman(1997[ 50 ]),i.e., whereMADdenotesthemedianabsolutedeviationfrom0and0.6745ischosentoadjustforanormaldistribution.

PAGE 79

23 ]),expressedas where0<<1istheproportionparameterwithwhichthecorrelationdecayswithtimelag.Theparametersthatmodelthestructureofthecovariancematrixarearrayedinv=(;2). Alternatively,thetime-specicchangeofvarianceandcorrelationintheanalysisoflongitudinaltraitscanbemodeledbyZimmermanandNu~nez-Anton's(1997[ 87 ])SADmodel.Usingmatrixnotation,theerrortermcanbeexpressedase=A 46 ])as

PAGE 80

whereDistheinnovationvariance-covariancematrixandisexpressedas:D=0BBBBBBB@2(1)00002(2)00...............0002()1CCCCCCCA: 67 ])or,forsimplicity,isassumedasaconstant2.If2(t)isassumedtobeaconstant,theresidualmatrixcontainstheparameters,v=(;2). Theresidualvariancesandcovariancesofthesmoothcoecientsforthedevelopmentaltraitmeasuredattimepointsafterwavelettransformscanbederived,withexpressionsdependingontheresolutionleveloftransform.Forexample,thevarianceofrst-resolutionsmoothcoecientc-1i(1)isexpressedas var(c-1i(1))=varyi(1)+yi(2) 2(2(1)+2(2)2(1;2)); andthecovariancebetweenc-1i(1)andc-1i(2)isexpressedas cov(c-1i(1);c-1i(2))=covyi(1)+yi(2) 2((1;3)+(1;4)+(2;3)+(2;4)): Eqs.( 5{7 )and( 5{8 )aswellasvariancesandcovariancesbetweenalltheothertimepointscanbemodeledbytheAR(1)modelormatrix( 5{6 )fortheSAD(1)model.

PAGE 81

modelcontainingkgenegroupscanberewritten,intermsofc-j,as where($1;:::;$k)arethemixtureproportionscorrespondingtothefrequenciesofkdierentgenegroups,-j=f$g;w-jcg;-jgkg=1containsunknownparametersaboutthemeanvectorandcovariancematrixafterthewavelettransformandfg(c-ji;w-jcg;-j)isthemultivariatenormaldistributionofgeneithatbelongstogenegroupgexpressedas (2)-j=2j-jj1=2exp1 2(c-jiw-jcg)01-j(c-jiw-jcg); inwhichc-ji=fc-ji(1);:::;c-ji(-j)gisavectorofsmoothcoecientsforindividuali,w-jcg=fw-jcg(1);:::;w-jcg(-j)gisavectorofexpectedsmoothcoecientsforgenegroupgand-jisthe(-j-j)residualcovariancematrixforthesmoothcoecients. Thevectorw-jcgcanbemodeledusingFourierseriesapproximation.Thestructuredapproachmakesuseofideasthatthegeneexpressionvalueovertimecanbettedbyasetofmathematicalparametersuginformu(t;ug)forpatterng.Thisisexpressed,fortherstresolutiontransformation,as Thus,byestimatingugwithtransformeddataatanappropriatetransformationresolution-j,ageneexpressioncurveduringagiventimecoursecanbedrawnforindividualgenepatterns.Dierencesforthecurvecanbecomparedandtestedforthestatisticalsignicanceofgeneticcontrolovertime. ThestandardEMalgorithmisderivedtoestimatetheparameterscontainedinthelikelihood( 5{9 ).Letag=(a0g;a1g;b1g)0andTgbetheunknownparametersin

PAGE 82

themeanvectorforgthpattern.First,calculatetheposteriorprobabilityofgeneithatbelongstothepatterngas 2P4t=1cos(2(t) 2P4t=1sin(2(t) 2P8t=5cos(2(t) 2P8t=5sin(2(t) 2P-jt=-j3cos(2(t) 2P-jt=-j3sin(2(t) 4a21 4a25:::1 4a24-j71 4a2a11 4a2:::1 4a24(-j1)7.........1 4a24p71 4a24(-j1)71 4a24(-j2)7:::a11CCCCCCCA-j-j;a1=(1+1 2(2+2+3));a2=(1+2+32+43+34+25+6) and-jisreduceddimensionfromwiththe-jisthe2ndresolution.BecausetherearenoclosedformsfortheestimationofandTg,thesimplexalgorithmis

PAGE 83

Table5{1. ParameterValuesoftheSimulatedDatatoDemonstratetheWaveletApproach Pattern123 Proportionp0.5850.1000.315Meanvectora00.3000.0300.060a11.0001.0000.900b10.2000.0200.010T6.00010.00016.000Covariance0.60020.300 implementedtoprovidetheestimatesofthesetwoparameters.TheEandMstepsareiterateduntiltheestimatesarestable. 5{1 ).TheperiodicexpressionprolesofeachgenefollowamultivariatenormaldistributionwithmeanvectortbytheFourierseriesapproximationandcovariancematrixtbytheAR(1)model.Theparametersfortherst-orderFourierseries(a0;a1;b1;T)usedtoapproximategeneexpressionprolesaregiveninTable 5{1 .Twodierentdimensionsofsimulationareassumed,32and64timepoints. Simulateddatawereanalyzedbytraditional(full-dimension)andwavelet-baseddimensionreductionmodelsforfunctionalgeneclustering,withresultssummarizedinTables 5{2 and 5{3 .Foreachmodel,theAICandBICvalueswerecalculatedtodeterminetheoptimalnumberofcomponents,eachcorrespondingto

PAGE 84

Figure5{2: OriginalandEstimatedPatternsfor32TimePoints ageneexpressionpattern,forthemixturemodel.AsshowninFigs. 5{4 and 5{5 ,bothmodelscorrectlyestimatethenumberofcomponentsforthesimulateddata. Thewavelet-basedapproachcanprovidereasonableestimatesofthemath-ematicalparametersfortheFourierseriesapproximationofeachgenepattern,withtheestimationprecisioncomparabletothatbythefull-dimensionmodelfordierenttimepoints(seeTables 5{2 and 5{3 andalsoFigs. 5{2 and 5{3 ).Asignicantadvantageofthewavelet-basedoverfull-dimensionapproachliesinitshighercomputationaleciency.Forthesimulationexperimentwith32timepoints,thewavelet-basedapproachneeds13min,whichisfasterbytwofoldsthanthefull-dimensionapproach.Withanincreasingnumberoftimepoints,thisdierenceismagnied.Withdoubletimepoints,thewavelet-basedapproachonlyneeds4moreminutes,whereasthecomputationaltimeofthefull-dimensionapproachisquadrupled.

PAGE 85

Table5{2: MLEandMSEwithFullDimensionandWaveletfor32TimePoints Pattern 123 Proportion^$(MSE)Full0.5851(0.0002)0.1001(0.0002)0.3148(0.0006)Wavelet0.5874(0.0042)0.0981(0.0037)0.3145(0.0022)Meanvector^a0(MSE)Full0.3057(0.0218)0.0511(0.0378)0.0696(0.0171)Wavelet0.3046(0.0245)0.0446(0.0313)0.0709(0.0225)^a1(MSE)Full1.0041(0.0081)1.0007(0.0081)0.9061(0.0121)Wavelet1.0078(0.0144)1.0076(0.0243)0.9072(0.0132)^b1(MSE)Full0.1910(0.0118)0.0084(0.0141)0.0160(0.0150)Wavelet0.1810(0.0354)0.0099(0.0151)0.0093(0.0157)^T(MSE)Full6.0034(0.0043)10.0150(0.0188)15.9730(0.0392)Wavelet6.0098(0.0129)10.0130(0.0304)15.9870(0.0619)Covariance^(MSE)Full0.5888(0.0044)Wavelet0.6042(0.0102)^2(MSE)Full0.3013(0.0036)Wavelet0.2999(0.0028)ComputationaltimeFull-dimension43minWavelet13min Figure5{3: OriginalandEstimatedPatternsfor64TimePoints

PAGE 86

Table5{3: MLEandMSEwithFullDimensionandWaveletfor64TimePoints Pattern 123 Proportion^$(MSE)Full0.5850(0.0006)0.1000(0.0005)0.3150(0.0005)Wavelet0.5849(0.0005)0.1001(0.0004)0.3150(0.0005)Meanvector^a0(MSE)Full0.2992(0.0103)0.0338(0.0226)0.0606(0.0135)Wavelet0.2994(0.0103)0.0377(0.0289)0.0605(0.0134)^a1(MSE)Full0.9999(0.0040)1.0015(0.0140)0.8993(0.0073)Wavelet1.0002(0.0102)1.0014(0.0133)0.8990(0.0075)^b1(MSE)Full0.1998(0.0055)0.0200(0.0171)0.0102(0.0092)Wavelet0.1985(0.0306)0.0253(0.0185)0.0111(0.0106)^T(MSE)Full6.0000(0.0009)10.0010(0.0074)16.0000(0.0125)Wavelet6.0001(0.0041)9.9997(0.0074)15.9990(0.0156)Covariance^(MSE)Full0.6003(0.0033)Wavelet0.6021(0.0080)^2(MSE)Full0.2999(0.0028)Wavelet0.2988(0.0041)ComputationaltimeFull-dimension157minWavelet17min Figure5{4: AICandBICfor32TimePoints

PAGE 87

Figure5{5: AICandBICfor64TimePoints clustering,proposedtocharacterizedierentpatternsofgeneexpression(seeChapter4),hasemergedasapowerfultoolingenomicresearch.Foramultivariateresponsedatasetcontaminatedbynoises,thedetectionofpatternssuersfromtheso-called\curseofdimensionality".Asanincreasinglypopularmeansfordatacompressionandde-noisinginthecontextofsignalandimageprocessing(DonohoandJohnstone1994[ 25 ];JohnstoneandSilverman1997[ 50 ]),waveletshrinkagehasbeenusedtocataloguegeneexpressiondynamics.Thiswavelet-basedmodelprojectshigherdimensionaldatatoamanageablelowerdimensionalsubspace. Periodicgeneexpressionisoneofthefundamentalprocessesinbiology.TheFourierseriesapproximationhasbeensuccessfullyusedtocharacterizetheperi-odicityofgeneexpression.Formanysignals,Fourieranalysisisextremelyusefulbecausesignal'sfrequencycontentisofgreatimportance.Ifasignaldoesnotchangemuchovertime,thatis,ifitiswhatiscalledastationarysignal,adirectapproximationoftheFourierseriesfunctionisfeasible.However,mostinterestingsignalscontainnumerousnon-stationaryortransitorycharacteristics:drift,trends,abruptchanges,andbeginningsandendsofevents.Thesecharacteristicsare

PAGE 88

oftenthemostimportantpartofthesignalsandFourieranalysisisnotsuitedtodetectingthem. Weimplementtheideaofwaveletdimensionreductionintothemixturemodelforgeneclustering,aimedtode-noisethedatabytransforminganinherentlyhigh-dimensionalbiologicalproblemtoitstractablelow-dimensionalrepresenta-tion.Asarstattemptofitskind,wecapitalizeonthesimplestHaarwaveletshrinkagetechniquetobreakanoriginalsignaldownintospectrumbytakingitsaveragesanddierencesand,subsequently,todetectgeneclustersthatdierinthesmoothcoecientsextractingfromnoisytimeseriesgeneexpressiondata.Thiswavelet-basedmodelwillhavemanyimplicationsforaddressingbiologicallymeaningfulhypothesesattheinterplaybetweengeneactions(orinteractions)anddevelopmentalpathwaysinvariouscomplexbiologicalprocessesornetworks.

PAGE 89

Therehavebeenanumberofstatisticalapproachesdevelopedtocataloguetime-seriespatternsofgeneexpression(Ramonietal.2002[ 71 ];Guoetal.2003[ 42 ];Parketal.2003[ 65 ];Qianetal.2001[ 70 ];Holteretal.2001[ 44 ]).OurapproachisbasedontheapproximationofaperiodicFourierseriesfunctionandtimedependentcovariancematrixinamixturemodelframeworktotranscriptionalexpressiontrajectories,followedbytheclassicationoftemporalpatternsforgeneexpressionbasedonstatisticalapproaches.Themostimportantadvantageofourapproach,ascomparedtothesepublishedones,istoclustergenesaccordingtotheirunderlyingexpressionmechanismsandallowforthetestofhypothesesattheinterplaybetweengeneexpressionanddevelopment.Forexample,ourapproachcanexplorecellcyclegeneticmechanismsthatcontrolwhereandwhencellsdivideduringdevelopment. WeappliedtheapproachtoadatasetofgeneexpressionprolesfromtheyeastSaccharomycescerevisiaethatwasmeasuredtostudythemitoticcellcycle(Spellmanetal.1998[ 76 ]).Theyusedthreeseparateexperimentsbasedonelutriation,-factorandmutationcdc15tosynchronizethecellcyclesofyeasts.Firstwederivedtimedependenttemporalpatternsusingasinglemicroarrayexperiment,cdc15arrest.ThesimilaritiesamongtheprolesweredetectedbycomparingtheFouriercoecientsthatjointlydenetheshapeofexpressioncurves.Thishasthenmadeitpossibletoassignco-expressedgenestoclusters.Eachobtainedclustercontainscell-cyclerelatedgenesthatshowasimilarperiodicbehavior.Theobtainedclusterswerethencomparedagainsttheclassicationbytraditionalclusteringapproaches.Whiletheresultsarebroadlyconsistentformany 76

PAGE 90

genesbetweenthetwoapproaches,itseemsthatourapproachisstatisticallymoresupported. Akeypointinmicroarraytimeseriesexperimentsistoextractthecontinuousrepresentationofallgenesthroughoutthetime-courseoftheexperiments.Howeverdataareoftenverynoisyandtherearealargenumberofmissingtimepointsinmanycases,makinggene-specicinterpolationinfeasible.Aparticularproblemariseswhenseriesarenotsampleduniformlyforexample,inSpellmanetal.(1998[ 76 ])andEisenetal.(1998[ 29 ]).Anotherchallengearisesfromthevariabilityinthetimingofbiologicalprocesses.Therateatwhichsimilarunderlyingprocessessuchasthecellcycleunfoldcanbeexpectedtodieracrossorganisms,geneticvariantsandenvironmentalconditions.Forexample,Spellmanetal.(1998[ 76 ])analyzetimeseriesdatafortheyeastcellcycleinwhichdierentmethodswereusedtosynchronizethecells.Itisclearthatthecyclelengthsacrossthedierentexperimentsvaryconsiderably,andthattheseriesbeginandendatdierentphasesofthecellcycle.Thus,oneneedsamethodtoalignsuchseriestomakethemcomparable.Someofthetimeseriesexperimentsareperformedtodetectperiodicgenes.Identifyingsuchgenesischallengingbecauseofthenoisemayhavedierentphaseandamplitude,andbecauseofthenoisepresentinalltimeseriesexpressionexperiments. Ourapproachhasbeenderivednotonlyforasinglemicroarrayexperimentbutalsoextendedtocombinethreedierentmicroarrayexperiments,elutriation,alphafactorandmutationcdc15.TobettercharacterizegeneclustersmultipleexperimentseachwithadierentmeasurementscheduleareusedfromSpellmanetal.'sstudy.Bycombiningthesethreedierentexperimentsthepowertodetectdistinctgeneclusterscanbeincreased.Fromabiologicalperspective,thiscombinationanalysiscanalsoallowforthecharacterizationofhowagiven

PAGE 91

genegroupisperiodicallyexpresseddierentlyinresponsetodierentexperimentalconditions. Wealsoappliedtheideaofwaveletdimensionreductionintothemixturemodelforgeneclustering,aimedtode-noisethedatanytransforminganinherentlyhigh-dimensionalbiologicalproblemtoitstractablelow-dimensionalrepresentation.WecapitalizeonthesimplestHaarwaveletshrinkagetechniquetobreakdownanoriginalsignalintospectrumtodetectgeneclustersbysimulationstudy.Thiswaveletbasedmodelwillhavemanyimplicationsforaddressingbiologicallymean-ingfulhypothesesattheinterplaybetweengeneinteractionsanddevelopmentalpathwaysinvariouscomplexbiologicalprocesses. Thereareanumberofpointstobeaddressedwhenclusteringtimeseriesexpressiondata.First,mostclusteringalgorithmstreattheirinputasavectorofindependentsamples,anddonottakeintoaccountthetemporalrelationshipbetweenthetimepoints,andthedurationeachtimepointrepresents.Thus,thesealgorithmscannotbenetfromtheknowndependenciesamongconsecutivepoints.Inaddition,sincemanytimeseriesaresamplednon-uniformly,suchindependenceassumptionmightskewtheresults.Anotherimportantproblemisinferringcausalityfromtimeseriesexpressionexperiment.Sincesomegenesactasregulatorsofothergenes,bylookingatthetemporalexpressionpatternsofgeneswemightbeabletoinferrelationshipbetweenregulatorsandthegenestheyregulate,andexplainhowgenesareregulatedinthecell.Whenanalyzingtimeseriesexpressiondatasets,weareinterestednotonlyintheclustersthemselvesbutalsointherelationshipsbetweenthedierentclusters.Ourapproach,however,mightbelimitedinclusteringthosegenesthatarenotperiodicallyexpressedduringthecellcycle.ForexamplesomegenescannotbettedbytheFourierseriesapproximationpossiblybecauseitisnotdirectlyrelatedtothecellcycle.Forthosegenes,amoreappropriatemathematicalfunctionthatcanbetterdenetheirprole

PAGE 92

curvesshouldbeincorporated.Ifnoappropriatefunctionexists,non-parametricapproachthatdoesnotrelyuponanyshapeofcurvecanbedeveloped.Hypothesistestsregardingtheinterplaybetweengeneexpressionanddevelopmentpatternscanbeformulatedinasimilarway.

PAGE 93

4{2 )-( 4{5 ) Thelog-likelihoodofparameters=($;u;)0constructedonthebasisofthemixturemodelisexpressedaslogL(y)=nXi=1log[kXg=1$gfg(yi;ug;)] andtheposteriorprobabilityithgenethatbelongstogthgroupisig=E[zigjyi]=Pr[zig=1jyi]=$gfg(yi)

PAGE 94

Notethat@logL(y) 2cos(2(1) 2cos(2(2) 2cos(2() andsolveitforcg.ThennXi=1igu0g1F(Tg)=nXi=1igy0i1F(Tg): andifF(Tg)01F(Tg)isinvertable^c0g=[^a0g;^a1g;^b1g]=Pni=1igy0i1F(Tg)

PAGE 95

Notethatfg(yi)canbewrittenasfg(yi)=1 (2)=2(2)=2jRj1=2exp1 22(yiug)0R1(yiug) 22fg(yi)(yiug)0R1(yiug) so^2=Pni=1Pkg=1ig(yiug)0R1(yiug)

PAGE 96

FirstnotethatR1=1 120BBBBBBBBBBBBBB@1000:::::::::01+200:::::::::001+20:::::::::0...........................00000:::1+200000::::::11CCCCCCCCCCCCCCA; 22(yiug)0R1(yiug)] 220R1] @"1 22(12)fXt=12t21Xt=1tt+1+21Xt=22tg#=1 120R1+1Xt=22t1Xt=1tt+1#; 22(yiug)0R1(yiug)] 120R1+1Xt=22t1Xt=1tt+1g#:

PAGE 97

Therefore,wehave@logL(y) 120R1+2Xt=22t1Xt=1tt+1g#set=0: 120R1+P1t=22tP1t=1tt+1i 120R1+P1t=22tP1t=1tt+1i

PAGE 98

Thewavelettransformattemptstorepeatedlysplitsaninitialsequenceintodetailcoecientsthatquantifylocaluctuationsataparticularscale,andsmoothcoecientsthatquantifyremaininglow-frequencyvariationinthesignalafterthehigh-frequencydetailisremoved.ThesimplestwavelettransformisthediscreteHaartransform.IntheHaartransform,detailcoecientsarecalculatedsimplybysubtractingsuccessivevaluesinthesequence(Walker1999[ 81 ]).Givenasequenceyi=fyi(1);yi(2);:::;yi()gforindividualisampledatequallyspacedintervals,therst-resolutiondetailcoecientsarecalculatedbyd-1(q)=yi(2q1)yi(2q) Wavelettransformspresentaprocessofdecomposingthesignalintotwocomponents,eachcaringinformationaboutsourcesignal.Insodoing,ltersfromthelterbankwillbeused,whichcomeinpairs:lowpassandhighpass.Lowpass 85

PAGE 99

lteredsignalcontainsinformationaboutslowchangingcomponentofthesignal,lookingverysimilartotheoriginalsignal,onlytwotimesshorterintermofnumberofsamples(timepoints).Highpasslteredsignalcontainsinformationaboutfastchangingcomponentofthesignal.Inpractice,wavelettransformationcanbecalculatedusingso-calledpyramidalgorithm(Mallat1989[ 58 ])byrepeatedlyapplyinglowpasslterfhqgandhighpasslterfgqgonthesmoothcoecientsatdierentresolutionlevels,respectively.Mathematically,theqthelementofthesmoothanddetailcoecientsatthejthresolutioncanbeexplicitlywrittenasc-j(q)=(Hc-j+1)q=Xq0hq02qc-j+1(q0)d-j(q)=(Gc-j+1)q=Xq0gq02qc-j+1(q0) whereHandGarethecorrespondingoperatorsforlowpasslterandhighpasslter,respectively.Thewavelettransformationcorrespondstomanipulationsofasequenceofresolutiondependentorthogonaltransformationmatrix(JensenandCour-Harbo2001[ 47 ];Vidakovic1999[ 80 ])symbolicallyexpressedasH-j=0B@HG1CA wherethelowpasslterhaselementsh0=h1=1

PAGE 100

Regardinganoriginalsequenceofsignalyi=fyi(1);:::;yi()gassmoothcoecientsc0,theHaarwavelettransformationcanbewrittenasc-1i(q)=Xt2Tht2qyi(t)=1 Ingeneral,for=2J,thetransformationmatrixcorrespondingtotheHaarwaveletatresolutionleveljcanbewrittenasH-j=0BBBBBBBBBBBBBBBBBBBB@1

PAGE 101

Thus,wehavethesmoothanddetailcoecientsofyiorugatresolutionjexpressedas0BBBBBBBBBBBB@c-ji=c-ji(q)2Jjq=1d-ji=d-ji(q)2Jjq=1d-j+1i=d-ji(q)2Jj+1q=1...d-1i=d-ji(q)2J1q=11CCCCCCCCCCCCA=Hyi;0BBBBBBBBBBBBBB@w-jcg=nw-jcg(q)o2Jjq=1w-jdg=nw-jdg(q)o2Jjq=1w-j+1dg=nw-jdg(q)o2Jj+1q=1...w-1dg=nw-jdg(q)o2J1q=11CCCCCCCCCCCCCCA=Hug:

PAGE 102

[1] E.AboufadelandS.Schilicker,DiscoveringWavelets,WileyandSons,NewYork,1999. [2] T.J.Aitman,\DNAmicroarraysinmedicalpractice,"BritishMedicalJournal,vol.323,pp.611{615,2001. [3] H.Akaike,\Anewlookatthestatisticalmodelidentication,"IEEETransactionsonAutomaticControl,vol.19,pp.716{723,1974. [4] U.Alon,N.Barkai,D.A.Notterman,K.Gish,S.Ybarra,D.MackandA.J.Levine,\Broadpatternsofgeneexpressionrevealedbyclusteringanalysisoftumorandnormalcolontissuesprobedbyoligonucleotidearrays,"PNAS,vol.96,pp.6745{6750,1999. [5] N.J.ArmstrongandM.A.Wiel,\Microarraydataanalysis:fromhypothesestoconclustionsusinggeneexpressiondata,"CellularOncology,vol.26,pp.279{290,2004. [6] J.D.BaneldandA.Raftery,\Model-basedGaussianandnon-Gaussianclustering,"Biometrics,vol.49,pp.803{821. [7] Z.Bar-Joseph,G.K.Gerber,D.K.Giord,T.S.JaakkolaandI.Simon,\Continuousrepresentationsoftimeseriesgeneexpressiondata,"JournalofComput.Biol.,vol.10,pp.341{356,2003. [8] Z.Bar-Joseph,\Analyzingtimeseriesgeneexpressiondata,"Bioinformatics,vol.20,pp.2493{2503,2004. [9] C.Biernacki,G.Celeux,andG.Govaert,\AnimprovementoftheNECcriterionforassessingthenumberofclustersinamixturemodelsource,"Non-linearAnaly.,vol.20,pp.267{272,1999. [10] G.E.BoxandG.M.Jenkins,TimeSeriesAnalysis:ForecastingandControl,Holden-Day,SanFrancisco,1976. [11] A.Bramza,A.Robinson,G.CameronandM.Ashburner,\One-stopshopformicroarraydata,"Nature,vol.403,pp.699{700,2000. [12] P.BrownandD.Botstein,\ExploringthenewworldofthegenomewithDNAmicroarraydata,"NatureGenetics,vol.21,pp.33{37,1999. 89

PAGE 103

[13] R.J.CarrollandD.Ruppert,\Power-transformationswhenttingtheoreti-calmodelstodata,"Am.Stat.Assoc.,vol.79,pp.321{328,1984. [14] G.CeleuxandG.Soromenho,\Anentropycriterionforassessingthenumberofclustersinamixturemodel,"J.Class,vol.13,pp.195{212,1996. [15] P.CheesemanandJ.Stuts,BayesianClassication(AutoClass):TheoryandResults,AAAIPress,California,1996. [16] H.F.ChenandJ.D.Kalbeisch,\Testingfornitemixturemodelwithtwocomponents,"JRSSB,vol.66,pp.95{115,2004. [17] R.J.Cho,M.J.Campbell,E.A.Winzeler,L.Steinmetz,A.Conway,L.Wodicka,T.G.Wolfsberg,A.E.Gabrielian,D.Landsman,D.J.LockhartandR.W.Davis,\Agenome-widetranscriptionalanalysisofthemitoticcellcycle,"MolecularCell,vol.2,pp.65{73,1998. [18] S.K.Crosthwaite,\Circadianclocksandnaturalantisense,"RNAFEBSLett.,vol.567,pp.49{54,2004. [19] J.K.Dale,M.Maroto,M.L.Dequeant,P.Malapert,M.McGrewandO.Pourquie,\Periodicnotchinhibitionbylunaticfringeunderliesthechicksegmentationclock,"Nature,vol.421,pp.275{278,2003. [20] M.DavidianandD.M.Giltinan,NonlinearModelsforRepeatedMeasure-mentData,ChampmanandHall,London,1995. [21] A.P.Dempster,N.M.LairdandD.B.Rubin,\MaximumlikelihoodfromincompletdataviatheEMalgorithm,"JRSSB,vol.39,pp.1{38,1977. [22] U.DeLichtenberg,L.J.Jensen,S.BrunakandP.Bork,\Dynamiccomplexformationduringtheyeastcellcycle,"Science,vol.307,pp.724{727,2005. [23] P.J.Diggle,P.Heagerty,K.Y.LiangandS.L.Zeger,AnalysisofLongitudi-nalData,OxfordUniversityPress,Oxford,2002. [24] D.L.Donoho,\De-noisingbysoft-thresholding,"IEEETransactionofInformationTheory,vol.41,pp.613{627,1995. [25] D.L.DonohoandI.M.Johnstone,\Idealspatialadptationbywaveletshrinkage,"Biometrika,vol.81,pp.425{455,1994. [26] D.L.Donoho,I.M.Johnstone,G.KerkyacharianandD.Picard,\Waveletshrinkage,"JRSSB,vol.57,pp.301{369,1995. [27] P.DuhamelandM.Vetterli,\FastFouriertransforms:atutorialreviewandastateoftheart,"SignlaProcess,vol.19(4),pp.259{299,1990. [28] J.Durbin,\Testsofserialindependencebasedonthecumulatedperi-odogram,"Bull.Int.Stat.Inst.,vol.42,pp.1039{1049,1967.

PAGE 104

[29] M.B.Eisen,P.T.Spellman,P.O.BrownandD.Botstein,\Clusteranalysisanddisplayofgenome-wideexpressionpatterns,"PNAS,vol.95,pp.14863{14868,1998. [30] R.FletcherandC.MReeves,\Functionminimizationbyconjugategradi-ents,"ComputerJournal,vol.7,pp.148-154,1964. [31] C.FraleyandA.E.Raftery,HowManyClusters?WhichClusteringMethod?AnswersviaModel-basedClusterAnalysis,Tech.Rep.329,DepartmentofStatistics,U.ofWashington,1998. [32] M.FrigoandS.G.Johnson,TheFastestFourierTransformintheWest,Tech.Rep.MIT-LCS-TR-728,MassachusettsInstituteofTechnology,1997. [33] K.R.Gabriel,\Ante-dependenceanalysisofanorderedsetofvariables,"Ann.Math.Stat.,vol.33,pp.201{212,1962. [34] S.Geisser,GrowthCurveAnalysis,NorthHollandPublishingCo.,Amster-dam,1980. [35] S.P.Ghael,A.M.SayeedandR.G.Baraniuk,\ImprovedwaveletdenoisingviaempiricalWienerltering,"Proc.ofSPIE,vol.3169,pp.389{399,1997. [36] D.GhoshandA.M.Chinnaiyan,\Mixturemodelingofgeneexpressiondatafrommicroarrayexperiments,"Bioinformatics,vol.18,pp.275{286,2002. [37] E.F.Glynn,J.ChenandA.R.Mushegian,\DetectingperiodicpatternsinunevenlyspacedgeneexpressiontimeseriesusingLomb-Scargleperi-odograms,"Bioinformatics,vol.22(3),pp.310{316,2006. [38] A.Goldbeter,\Computationalapproachestocellularrhythms,"Nature,vol.420,pp.238{245,2002. [39] T.R.Golub,D.K.Slonim,P.Tamayo,C.Huard,M.Gaasenbeek,J.P.Mesirov,H.Coller,M.L.Loh,J.R.Downing,M.A.Caligiuri,C.D.BloomeldandE.S.Lander,\Molecularclassicationofcancer:classdiscoveryandclasspredictionbygeneexpressionmonitoring,"Science,vol.286,pp.531{537,1999. [40] D.Gonze,J.HalloyandA.Goldbeter,\Stochasticversusdeterministicmodelsforcircadianrhythms,"J.Biol.Physics,vol.28,pp.637{653,2002. [41] D.Gonze,J.Halloy,J.C.LeloupandA.Goldbeter,\Stochasticmodelsforcircadianrhythms:inuenceofmolecularnoiseonperiodicandchaoticbehavior,"C.R.Biologies,vol.326,pp.189{203,2003. [42] W.Guo,T.Ahang,X.Shen,J.Z.YuandR.J.KOhel,\DevelopmentofSCARmarkerlinkedtoamajorQTLforhishberstrengthanditsusagein

PAGE 105

molecular-markerassistedselectioninuplandcotton,"CropScience,vol.43,pp.2252{2256,2003. [43] S.L.Harmer,J.B.Hogenesch,M.Straume,H.-S.Chang,B.Han,T.Zhu,X.Wang,J.A.KrepsandS.A.Kay,\Orchestratedtranscriptionofkeypathwaysinarabidopsisbythecircadianclock,"SignalTransductionVirtualJournal,vol.290,pp.2110{2113,2000. [44] N.S.Holter,A.MaritanandM.Cieplak,\Dynamicmodelingofgeneexpressiondata,"PNAS,vol.98,pp.1693{1698,2001. [45] C.M.HurvichandC.L.Tsai,\Regressionandtimeseriesmodelselectioninsamllsmaples,"Biometrika,vol.76,pp.297{307,1989. [46] F.Jarezic,R.ThompsonandW.G.Hill,\Structuredantedependencemodelsforgeneticanalysisofrepeatedmeasuresonmultiplequantitativetraits,"GeneticalResearch,vol.82,pp.55-65,2003. [47] A.JensenandA.Cour-Harbo,RipplesinMathematics,Springer,NewYork,2001. [48] C.J.JiangandZ.B.Zeng,\Multipletraitanalysisofgeneticmappingforquantitativetraitloci,"Genetics,vol.140,pp.1111{1127,1995. [49] Y.Jiangsheng,Methodofk-NearestNeighbors,InstituteofComputationalLinguistics,PekingUniversity,China,2002. [50] I.M.JohnstoneandB.W.Silverman,\Waveletthresholdestimatorsfordatawithcorrelatednoise,"JRSSB,vol.59,pp.319{351,1997. [51] B.Jordan,\Historicalbackgroundandanticipateddevelopments,"AnnalsoftheNewYorkAcademyofScience,vol.975,pp.24{32,2002. [52] P.L.Lakin-ThomasandS.Brody,\Circadianrhythmsinmicroorganisms:newcomplexities,"Annu.Rev.Microbiol.,vol.58,pp.489{519,2004. [53] R.Lasser,IntroductiontoFourierSeries,MarcelDekker,Inc.,NewYork,1996. [54] J.C.LeeandS.Geisser,\Applicationsofgrowthcurveprediction,"Sankhya,vol.37,pp.239{256,1975. [55] J.-C.Leloup,D.GonzeandA.Goldbeter,\LimitcyclemodelsforcircadianrhythmsbasedontranscriptionalregulationinDrosophilaandNeurospora,"J.Biol.Rhythms,vol.14,pp.433{448,1999. [56] Y.T.Lo,N.R.MendellandD.B.Rubin,\Testingthenumberofcompo-nentsinanormalmixture,"Biometrika,vol.88,pp.767{778,2001.

PAGE 106

[57] I.S.Lossos,A.A.Alizadeh,M.B.Eisen,W.C.Chan,P.O.Brown,D.Botstein,L.M.StaudtandR.Levy,\OngoingimmunoglobulinsomaticmutationingerminalcenterBcell-likebutnotinactivatedBcell-likediuselargecelllymphomas,"PNAS,vol.97,pp.10209{10213,2000. [58] S.G.Mallat,\Atheoryformultiresolutinosignaldecomposition:thewaveletrepresentation,"IEEETransactiononPatternAnalysisandMachineIntelligence,vol.11,pp.674{693,1989. [59] G.McLachlanandD.Peel,FiniteMixtureModels,WileyandSons,Inc.,NewYork,1997. [60] G.J.McLachlan,R.W.BeanandD.Peel,\Amixtuemodel-basedapproachtotheclusteringofmicroarrayexpressiondata,"Bioinformatics,vol.18,pp.414{422,2002. [61] J.M.Mitchison,\Growthduringthecellcycle,"Int.Rev.Cytol,vol.226,pp.165{258,2003. [62] J.A.NelderandR.Mead,\Asimplexmethodforfunctionminimization,"ComputerJ.,vol.7,pp.308{313,1965. [63] V.Nu~nez-AntonandD.L.Zimmerman,\Modelingnonstationarylongitudi-naldata,"Biometrics,vol.56,pp.699{705,2000. [64] S.Panda,T.K.Sato,A.M.Castrucci,M.D.Rollag,W.J.DeGrip,J.B.Hogenesch,I.ProvencioandS.A.Kay,\Melanopsin(Opn4)requirementfornormallight-inducedcircadianphaseshifting,"SignalTransductionVirtualJournal,vol.298,pp.2213{2216,2002. [65] T.Park,S.G.Yi,S.Lee,S.Y.Lee,D.H.Yoo,J.I.AhnandY.S.Lee,\Statisticaltestsforidentifyingdierentiallyexpressedgenesintime-coursemicroarrayexperiments,"Bioinformatics,vol.19,pp.694{703,2003. [66] S.D.PletcherandC.J.Geyer,\Thegeneticanalysisofage-dependenttraits:modelingacharacterprocess,"Genetics,vol.153,pp.825{835,1999. [67] M.Pourahmadi,\Jointmean-covariancemodelswithapplicationstolongitudinaldata:unconstrainedparameterization,"Biometrika,vol.86,pp.677{690,1999. [68] M.B.Priestley,SpectralAnalysisandTimeSeriesAcademicPress,SanDiego,1981. [69] L.M.Prolo,J.S.TakahashiandE.D.Herzog,\Circadianrhythmgenerationandentrainmentinastrocytes,"J.ofNeuroscience,vol.25(2),pp.404{408,2005.

PAGE 107

[70] J.Qian,B.Stenger,C.A.Wilson,J.Lin,R.Jansen,S.A.Teichmann,J.Park,W.G.Krebs,H.Yu,V.Alexandrov,N.EcholsandM.Gerstein,\Partslist:aweb-basedsystemfordynamicallyrankingproteinfoldsbasedondisparateattributes,includingwhole-genomeexpressionandinteractioninformation,"NucleicAcidsResearch,vol.29,pp.1750{1764,2001. [71] M.F.Ramoni,P.SebastianiandI.S.Kohane,\Clusteranalysisofgeneexpressiondynamics,"PNAS,vol.99,pp.9121{9126,2002. [72] M.Ramoni,P.SebastianiandP.R.Cohen,\Bayesianclusteringbydynam-ics,"Mach.Learn,vol.47,pp.91{121,2002. [73] C.P.RobertandG.Casella,MonteCarloStatisticalMethods,Springer,NewYork,1999. [74] C.Rovery,M.V.La,S.Robineau,K.Matsumoto,P.RenestoandD.Raoult,\PreliminarytranscriptionalanalsysisofspoTgenefamilyandofmembraneproteinsinRickettsiaconoriiandRickettsiafelis,"AnnalsofNewYorkAcademyofScience,vol.1063,pp.79{82,2005. [75] G.Schwarz,\Estimatingthedimensionofamodel,"AnnalsofStatistics,vol.6,pp.461{464,1978. [76] P.T.Spellman,G.Sherlock,M.Q.Zhang,V.R.Iyer,K.Aners,M.B.Eisen,P.O.Brown,D.BotsteinandB.Futcher,\Comprehensiveidenticationofcellcycle-regulatedgenesoftheYeastsaccharomycescerevisiaebymicroarrayhybridization,"MolecularBiologyoftheCell,vol.9,pp.3273{3297,1998. [77] J.P.Sy,J.M.TaylorandW.G.Cumberland,\AstochasticmodelfortheanalysisofbivariatelongitudinalAIDSdata,"Biometrics,vol.53,pp.542{555,1997. [78] O.Troyankskaya,M.Cantor,G.Sherlock,P.Brown,T.Hastie,R.Tibshirani,D.BotsteinandR.Altman,\MissingvalueestimationmethodsforDNAMicroarrays,"Bioinformatics,vol.17,pp.520{525,2001. [79] G.VerbekeandG.Molenberghs,LinearMixedModelsforLongitudinalData,Springer-Verlag,NewYork,2000. [80] B.Vidakovic,StatisticalModelingbyWaveletWileyandSons,NewYork,1999. [81] J.S.Walker,APrimeronWaveletforTheirScienticApplications,WileyandSons,NewYork,1999. [82] M.WestandJ.Harrison,BayesianForecastingandDynamicModels,Springer,NewYork,1997.

PAGE 108

[83] S.Wichert,K.FokianosandK.Strimmer,\Identifyingperiodicallyexpressedtranscriptsinmicroarraytimeseriesdata,"Bioinformatics,vol.20(1),pp.5{20,2004. [84] R.Wu,C.X.Ma,M.LinandG.Casella,\Ageneralframeworkforanalyzingthegeneticarchitectureofdevelopmentalcharacteristics,"Genetics,vol.166,pp.1541{1551,2004. [85] Y.H.YangandT.Speed,\DesignissuesforcDNAmicroarrayexperiments,"Genetics,vol.3,pp.579{588,2002. [86] L.P.Zhao,R.PrenticeandL.Breeden,\Statisticalmodelingoflargemicroarraydatasetstoidentifystimulus-responseproles,"PNAS,vol.98,pp.5631{5636,2001. [87] D.L.ZimmermanandV.Nu~nez-Anton,StructuredAntedependenceModelsforLognitudinalData,Springer-Verlag,NewYork,1997. [88] D.L.ZimmermanandV.Nu~nez-Anton,\Parametricmodelingofgrowthcurvedata:anoverview(withdiscussion),"Test,vol.10,pp.1{73,2001.

PAGE 109

Bong-RaeKimwasborninSeoul,Korea.ShereceivedaBachelorofArtsandScienceinhistory.ShewasintroducedtostatisticsbyadearfriendofhersandcametotheU.S.tostudysomethingdierentthanwhatshehasmajoredinbecauseshelikedachallenge.SheworkedasastatisticianforawhilebackinKoreaonceshereceivedherM.S.inStatistics.ShecamebacktotheU.S.forherPh.D.in2000.SheenteredthePh.D.programofstatisticsattheUniversityofFlorida.Sincethen,shehasbeeninvolvedincollaborativeresearchinstatisticalapplicationsinpublichealthandmedicineandstatisticalgenetics. 96