Context-Based Classification Via Data-Dependent Mixtures of Logistic and Hidden Markov Model Classifiers

MISSING IMAGE

Material Information

Title:
Context-Based Classification Via Data-Dependent Mixtures of Logistic and Hidden Markov Model Classifiers
Physical Description:
1 online resource (115 p.)
Language:
english
Creator:
Yuksel,Seniha E
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering, Computer and Information Science and Engineering
Committee Chair:
Gader, Paul D
Committee Members:
Rangarajan, Anand
Ho, Jeffrey
Wilson, Joseph N
Aytug, Haldun

Subjects

Subjects / Keywords:
based -- classification -- context -- experts -- hidden -- hmm -- markov -- me -- mhmme -- mixture -- model -- variational -- vmec
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre:
Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
This research addresses the problems encountered when designing classifiers for classes that contain multiple subclasses whose characteristics are dependent on the context. It is sometimes the case that when the appropriate context is chosen, classification is relatively easy, whereas in the absence of contextual information, classification may be difficult. Therefore, this research focuses on simultaneous learning of context and classification. One area where such problems are encountered is landmine detection. The signals collected by radar and metal detector systems show a lot of variability depending on a number of factors. Some of these factors can be listed as follows: (i) landmines are diverse in terms of their physical properties (large, small, plastic, metal), (ii) they may be buried at various depths in the ground (shallow, deep), (iii) they may be covered with clutter (stones, bushes, bottles, etc.), and (iv) the signals collected from these mines in different environments show a huge variability based on temperature, humidity, and soil conditions. When all of these factors are combined, one mine may look more similar to a mine of a different type (or to metallic clutter) than to itself in a different environment. Moreover, the exact reason that caused this confusion may not be always clear nor available. In such cases, a group of data that looks similar based on some unknown factor defines a context, and machine learning algorithms are needed to find these contexts and to learn classifiers within each context for a mine versus non-mine decision to be made. The mixture of experts (ME) model is suitable for problems such as the landmine detection that require context-based classification. During training, ME simultaneously learns the contexts and expert classifiers within each context all within one probabilistic framework. In initial experts applying standard ME model to landmine detection, significantly increased classification rates were achieved providing insightful parameters that can be understood by a system operator. However, two problems are encountered with the ME model. The first problem is over-fitting, which means that the model could be learning noise, and it may not generalize to future data. To prevent over-fitting, the variational mixture of experts model for multi-class classification is developed, and the lower bound associated with the model is derived. The lower bound provides a measure of correctness, and is used as an excellent stopping criterion. The second problem is that the ME model is not suitable for time series or sequential data as it requires fixed-length features and it cannot represent the dependency between a sequence of observations. To overcome these issues, a novel model, mixture of hidden Markov model experts (MHMME) has been introduced. The MHMME model can divide time series data into meaningful contexts, and learn an expert hidden Markov model (HMM) for each of these contexts. Although motivated by landmine detection, MHMME is a general method. Our experiments on multiple real and synthetic datasets show that MHMME can perform better than ME and HMMs, and can do well in comparison to state-of-the-art models.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Seniha E Yuksel.
Thesis:
Thesis (Ph.D.)--University of Florida, 2011.
Local:
Adviser: Gader, Paul D.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2011
System ID:
UFE0043347:00001


This item is only available as the following downloads:


Full Text

PAGE 1

CONTEXTBASEDCLASSIFICATIONVIADATADEPENDENTMIXTURESOFLOGISTICANDHIDDENMARKOVMODELCLASSIFIERSBySENIHAESENYUKSELADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2011

PAGE 2

c2011SenihaEsenYuksel 2

PAGE 3

Tomywonderfulmotherwhosestrengthneverceasestoamazeme 3

PAGE 4

ACKNOWLEDGMENTS IwouldliketoextendmyheartiestgratitudetomyadvisorDr.PaulGaderforhisguidanceandsupportduringmyPhD.Notonlydidhehavethemostinterestingprojects,apositiveattitude,andagreatlab;butalsoheneverleftmewithoutfundingattheworsteconomictimes.Ifeelveryfortunatetohavemethimandbeoneofhisstudents,andIhopetocarryhisprofessionalismandhisexcitementwithmefortherestofmycareer.IwouldliketothankmycommitteemembersDr.JosephWilson,Dr.AnandRangarajan,Dr.JeffreyHoandDr.HaldunAytugfortheirvaluablesuggestionsandinsightfulcomments.Dr.HohasbeenveryencouragingthroughoutmyPhDstudies.Dr.Wilsonhasbeenamajorhelpintherstpartofmydissertation.Iwouldliketoalsothankhimfordiligentlyproofreadingmypapers.ManythankstoDr.JeremyBoltonforemailingmeonedaytocollaborateonaprojectonmultipleinstancelearning.Itwasasemesterwell-spentworkingwithhim,thathasgivenmemanyideasforfutureresearch.IwouldliketoalsoexpressmythankstoDr.RolfHummelandDr.ThierryDubrocafromtheMaterialScienceandEngineeringDepartment.OurcollaborativeresearchatthelasttwosemestersofmyPhDwastrulyinspiring.Ienjoyedhavingamajorroleintheirexplosivedetectionproject,anddoinginterdisciplinaryresearch.ItwasapleasuretomeetalltheformerandcurrentPhDstudentsattheCSI:Floridalab;XupingZhang,GaneshRamachandran,AndresVasquezMendez,AlinaZare,KenWatford,RyanClose,BrandonSmock,andTaylorGlenn.Also,IwouldliketothankallmyfriendsattheCISEDepartment:HaleErten,JeeyoungKim,GokhanKaya,RitwikKumar,KarthikGurumoorthy,VenkatakrishnanRamaswamy,ManuSethi,OneilSmith,andAjitRajwade.Ithasbeenaprivilegetomeetallthesesmart,funanduniqueindividuals. 4

PAGE 5

Iwouldliketothankallmyteacherswhohavesupportedmetothisday.Prof.GonulTurhanSayanandProf.GozdeBozdagifrommyundergraduatestudies,andmyMScadvisorProfAlyA.Faragalwayshadcondenceinmeandsupportedme.ThisPhDwasacollaborativeeffortofallthepeoplewhosupportedmehereandoverseas.IwouldliketothankallthedoctorsandthenursesinTurkeywhokeptmyfatheralive,gavemepieceofmind,andgavemeachancetoshowhimmydiploma.Andlastbutnottheleast,Iwouldliketothankmyfamily;mymotherProf.OznurYuksel,myfatherArch.HuseyinYuksel,andmysoontobesurgeonbrotherDr.MehmetErenYuksel.Youaremyrolemodelswithyourenduranceandyourstrength.Nothingcouldhavebeenpossiblewithoutyoursupportandlove. 5

PAGE 6

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 11 CHAPTER 1INTRODUCTION ................................... 13 1.1OriginalContributions ............................. 15 2DESCRIPTIONOFDATASETS ........................... 17 2.1LandmineDataset ............................... 17 2.2ChickenPiecesDatasetfor2DShapeRecognition ............. 23 3MIXTUREOFEXPERTS(ME) ........................... 25 3.1MEClassicationModel ............................ 27 3.2AdvancesinMEforClassication ....................... 33 3.3ModicationstoMEtoHandleSequentialData ............... 36 3.4ComparisontoPopularAlgorithms ...................... 40 3.5ExperimentalResultsonLandmineDatawiththeOriginalMEmodel ... 42 4VARIATIONALMIXTUREOFEXPERTSFORCLASSIFICATION(VMEC) ... 46 4.1VMECModel .................................. 47 4.1.1VMECLowerBound .......................... 50 4.1.2TrainingtheVMECModel ....................... 51 4.2ExperimentalResultsonSyntheticData ................... 52 4.3ExperimentalResultsontheLandmineDataset ............... 54 5FUNDAMENTALSOFHIDDENMARKOVMODELS(HMM) ........... 57 5.1Evaluation .................................... 58 5.2Decoding .................................... 59 5.3Reestimation .................................. 59 5.3.1Baum-Welch(BW)Training ...................... 60 5.3.2MinimumClassicationError(MCE)Training ............ 60 5.4FindingMultipleModelsfromTime-SeriesData ............... 61 6MIXTUREOFHIDDENMARKOVMODELEXPERTS(MHMME) ........ 68 6.1HowMEComparestoOtherModels ..................... 69 6

PAGE 7

6.2MixtureofHiddenMarkovModelExperts .................. 71 6.2.1TrainingoftheMHMMEmodel .................... 74 6.2.2Two-ClassCase ............................ 79 6.3ResultsonSyntheticData ........................... 80 6.4ResultsontheLandmineDataset ....................... 83 6.5ResultsontheChickenPiecesDataset ................... 90 6.6DiscussionandFutureWork ......................... 98 7CONCLUSION .................................... 100 REFERENCES ....................................... 101 BIOGRAPHICALSKETCH ................................ 115 7

PAGE 8

LISTOFTABLES Table page 2-1Landminedatacollectionfromtwotemperatesites ................ 17 2-2Proportionoflandmineandclutterobjects ..................... 18 2-3Propertiesoflandmines ............................... 19 2-4Distributionofsamplesinthechickenpiecesdataset ............... 23 2-5Lengthsofthesequencesinthechickenpiecesdataset ............. 23 3-1Namingconventionsforreceiveroperatingcharacteristics(ROC)curves .... 42 4-1Probabilityoffalsealarm(PFA)at90%probabilityofdetection(PD) ...... 55 6-1Classicationratesonthelandminedatasetfor10-foldtraining ......... 89 6-2Classicationratesonthechickenpiecesdatasetfor2-foldtraining ....... 98 8

PAGE 9

LISTOFFIGURES Figure page 1-1Simpliedexampleformixtureofexperts(ME) .................. 14 2-1Landminedetectionsystems ............................ 19 2-2Signaturesfromlandminedetectionsensors ................... 20 2-3Positionofthetargetinagridcell .......................... 21 2-4Arganddiagramsformineandclutterobjects. ................... 22 2-5Chickenpiecesdataset ............................... 24 2-6Curvaturefeatures .................................. 24 3-1Classicationresultsonasimpleexample ..................... 28 3-2MEarchitectureforclassication .......................... 31 3-3Receiveroperatingcharacteristics(ROC)onlandminedata. ........... 44 3-4Dominantoutputusingedgehistogramdescriptorsandanglemodelfeatures 45 4-1VariationalMEforclassication(VMEC)modelparameterdependence. .... 48 4-2SyntheticdataforVMECexperiment. ....................... 52 4-3VMECresultsforoneexpert. ............................ 53 4-4VMECresultsforveexperts. ............................ 54 4-5LandmineresultswithVMEC ............................ 55 4-6ThelowerboundoftheVMECtraining ....................... 56 5-1Symmetriclikelihoodmatrix ............................. 63 5-2Hierarchicalclustering ................................ 64 6-1MixtureofhiddenMarkovmodelexperts(MHMME)architecture ........ 72 6-2Syntheticdata .................................... 81 6-3MHMMEresultsonsyntheticdata ......................... 81 6-4MHMMEobjectivefunction ............................. 83 6-5Clustercentersofthelandminedata ........................ 84 6-6Distributionofthesequencestotheexperts .................... 86 9

PAGE 10

6-7ContextsdenedbythegatehiddenMarkovmodels(HMM)1-4 ........ 87 6-8ContextsdenedbythegateHMMs5-8 ...................... 88 6-9ViterbipathsofthesequenceshighlyweightedbygateHMMs1-4 ....... 89 6-10ViterbipathsofthesequenceshighlyweightedbygateHMMs5-8 ....... 90 6-11Sequenceshighlyweightedbythegate'srstfourHMMs ............ 91 6-12ROCcurveofMHMMEforthelandminedataset ................. 92 6-13Contextslearnedatthegate ............................ 94 6-14Log-likelihoodsofgateandexperts ......................... 95 6-15Sequenceshighlyweightedbythegate'srstfourHMMs ............ 96 6-16Probabilityofeachdecision ............................. 97 10

PAGE 11

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyCONTEXTBASEDCLASSIFICATIONVIADATADEPENDENTMIXTURESOFLOGISTICANDHIDDENMARKOVMODELCLASSIFIERSBySenihaEsenYukselAugust2011Chair:PaulD.GaderMajor:ComputerEngineeringThisresearchaddressestheproblemsencounteredwhendesigningclassiersforclassesthatcontainmultiplesubclasseswhosecharacteristicsaredependentonthecontext.Itissometimesthecasethatwhentheappropriatecontextischosen,classicationisrelativelyeasy,whereasintheabsenceofcontextualinformation,classicationmaybedifcult.Therefore,thisresearchfocusesonsimultaneouslearningofcontextandclassication.Oneareawheresuchproblemsareencounteredislandminedetection.Thesignalscollectedbyradarandmetaldetectorsystemsshowalotofvariabilitydependingonanumberoffactors.Someofthesefactorscanbelistedasfollows:(i)landminesarediverseintermsoftheirphysicalproperties(large,small,plastic,metal),(ii)theymaybeburiedatvariousdepthsintheground(shallow,deep),(iii)theymaybecoveredwithclutter(stones,bushes,bottles,etc.),and(iv)thesignalscollectedfromtheseminesindifferentenvironmentsshowahugevariabilitybasedontemperature,humidity,andsoilconditions.Whenallofthesefactorsarecombined,oneminemaylookmoresimilartoamineofadifferenttype(ortometallicclutter)thantoitselfinadifferentenvironment.Moreover,theexactreasonthatcausedthisconfusionmaynotbealwaysclearnoravailable.Insuchcases,agroupofdatathatlookssimilarbasedonsomeunknownfactordenesacontext,andmachinelearningalgorithmsareneededtondthese 11

PAGE 12

contextsandtolearnclassierswithineachcontextforamineversusnon-minedecisiontobemade.Themixtureofexperts(ME)modelissuitableforproblemssuchasthelandminedetectionthatrequirecontext-basedclassication.Duringtraining,MEsimultaneouslylearnsthecontextsandexpertclassierswithineachcontextallwithinoneprobabilisticframework.IninitialexpertsapplyingstandardMEmodeltolandminedetection,signicantlyincreasedclassicationrateswereachievedprovidinginsightfulparametersthatcanbeunderstoodbyasystemoperator.However,twoproblemsareencounteredwiththeMEmodel.Therstproblemisover-tting,whichmeansthatthemodelcouldbelearningnoise,anditmaynotgeneralizetofuturedata.Topreventover-tting,thevariationalmixtureofexpertsmodelformulti-classclassicationisdeveloped,andthelowerboundassociatedwiththemodelisderived.Thelowerboundprovidesameasureofcorrectness,andisusedasanexcellentstoppingcriterion.ThesecondproblemisthattheMEmodelisnotsuitablefortimeseriesorsequentialdataasitrequiresxed-lengthfeaturesanditcannotrepresentthedependencybetweenasequenceofobservations.Toovercometheseissues,anovelmodel,mixtureofhiddenMarkovmodelexperts(MHMME)hasbeenintroduced.TheMHMMEmodelcandividetimeseriesdataintomeaningfulcontexts,andlearnanexperthiddenMarkovmodel(HMM)foreachofthesecontexts.Althoughmotivatedbylandminedetection,MHMMEisageneralmethod.OurexperimentsonmultiplerealandsyntheticdatasetsshowthatMHMMEcanperformbetterthanMEandHMMs,andcandowellincomparisontostate-of-the-artmodels. 12

PAGE 13

CHAPTER1INTRODUCTIONManydatasetsintherealworldshowdifferentpatternsbasedonmultiplecontexts.Forexample,electricityusagehasseasonalpatterns,andwithineachseasonelectricityconsumptionofsocio-economicgroupsshowsdifferentpatterns.Inthisexample,boththeseasonsandthesocio-economicgroupsdenethecontextsforelectricityconsumption.Inelectrocardiogram(EKG)heartbeatclassication,ethnicandsocialgroupsdeneahealthybeat.Forexample,athletestypicallyhaveslowerrestingheartbeats.Inhandwritingrecognition,asinglecharactermaybeimpossibletoreadwithoutthecontextofthewordinwhichitappears.Theaforementionedproblemsarealsoencounteredinlandminedatasoitisdifculttorepresenttheclassofallmineswithasinglemodel.Goingonestepdeeper,onecanthinkoftheclassofallminestohavemultiplesubclassessuchashighmetalanti-tank(HMAT)mines,lowmetalanti-tank(LMAT)mines,highmetalanti-personnel(HMAP)mines,andlowmetalanti-personnel(LMAP)mines.However,thefeaturesofthesesubclassesareinterlaced,soitisalsodifculttoappointamodelasanexperttoidentifyaparticularsubclassofmines[ 1 2 ].Furthermore,thedatamightbeavailablefrommultiplesources.Forexample,inlandminedetection,therearetwotypesofsensorsthatarewidelyused:groundpenetratingradars(GPR)andmetaldetectorsalsoknownaselectromagneticinduction(EMI)sensors.WhiletheHMAPandHMATminescanappearsimilartoametaldetector,theymaylookquitedifferenttoaGPRsensor.Similarly,givenanobjectwithalowenergyEMIsignature,GPRwillbebetteratdistinguishingbetweentheLMAPminesandclutter.Oneapproachtobuildingclassiersforwhichtherearemultiplecontextsistohaveexpertmodelsforeachcontext.Unfortunately,contextsaregenerallyhardtodene,theyareofteninterlaced,anddonothavesharpboundaries.Moreover,contextualinformationmightbeinherentinthedata,butunknowntothedatamodeler.Therefore, 13

PAGE 14

theclassiershouldbeabletolearnboththecontextandtheexperts,ideallyinasinglemodel.Incaseswherethecontextinformationisunknown,acontextisdenedasagroupofsimilarsignatures.Themixtureofexperts(ME)modelintroducedbyJacobsandJordanetal.[ 3 4 ]issuitableforsuchproblemsasitsolvesnon-linearproblemsbyaprobabilisticdivisionoftheinputspace.IntheMEmodel,agatemakessoftpartitionsoftheinputspaceanddenesthoseregionswheretheindividualexpertopinionsaretrustworthy.AsimpleexampletodemonstratethedivisionoftheinputspaceisshowninFig. 1-1 ,wherethegatesimpliesanonlinearclassicationproblemintotwolinearclassicationproblems,andtheexpertslearnthesimpleparametrizedsurfacesintheseregions.WhiletheMEhasbeenusedtosolvebothregressionandclassicationproblems,thefocusofthisstudyistheclassicationversion.Intherestofthisdissertation,MEwillrefertotheoriginalmodelin[ 3 4 ]. Figure1-1. SimpliedclassicationexampleforME.Thegatedividestheregionintwowithasoftdecision,thentheexpertslearnthesimplesurfacestoseparatethetwoclasses.Modiedwithpermission[ 5 ]. PriortoembarkingonresearchonvariationsoftheMEmodel,experimentswereconductedwiththeoriginalMEmodelforlandminedetection.TheexpertsoftheMEmodelgivesoftdecisionboundariesfordifferenttypesofmines(HMAT,LMAT,etc.),andlearnspecializedmodelsforeachtype.TheMEmodelsignicantlyincreasedthe 14

PAGE 15

classicationrateswhileprovidinginsightfulparametersthatcanbeunderstoodbyasystemoperator.Forcomplicatedinputexamples,however,MEispronetoover-trainingduetotheexpectation-maximization(EM)training,especiallyifthenumberofexpertsareincreased.ToregularizeME,thevariationalmixtureofexperts(VME),wasdevelopedandusedbyseveralresearchers[ 6 10 ]forregressionproblems.Waterhouse[ 9 ]discussedhowtoadjustVMEtoclassication,butthestepstotrainthemodelwerenotmadeclearsincetheequationswereapplicabletovector-valuedparametersasopposedtomatricesforeachexpert.Also,thepreviousalgorithmsusedthenumberofiterationsasastoppingcondition,whichmayleadtoprematureterminationofthealgorithmortounnecessarytraining.Inaddition,theMEmodelisnotsuitablefortimeseriesdata,andtheseveralextensionsintheliterature[ 11 16 ]concernregressionanduseaone-step-aheadormulti-step-aheadprediction.Inmulti-steppredictionthelastdvaluesofthetime-seriesdataareusedasfeaturesinaneuralnetwork,butsuchalgorithmscannothandledatawithvaryinglengthandtheuseofmultilayernetworktypeapproachespreventthemfromcompletelydescribingthetemporalpropertiesoftime-seriesdata. 1.1OriginalContributionsAvariationalmixtureofexpertsmodelformulti-classclassication(VMEC)isdeveloped,and(1)acompletelearningalgorithmforVMECisgiven,(2)avariationallowerboundisderivedforVMEC,and(3)theefcacyoftheapproachispresentedonsyntheticdataaswellasonlandminedatatocomparetotheMEtrainedwithEM.TheadvantagesoftheVMECmodelcanbelistedasfollows: TheVMECmodelregularizesthemodelparametersandresistsover-tting, Givesdistributionsoftheparametersinsteadofpointestimates, Thelowerboundprovidesameasureofcorrectnessandcanbeusedasastoppingcriterion. 15

PAGE 16

AnovelmixtureofhiddenMarkovmodelexperts(MHMME)isdevelopedthatcanbothdecomposetime-seriesdataintomultiplecontexts,andlearnexpertclassiersforeachcontext.InMHMME,agateofhiddenMarkovmodelscooperatewithasetofhiddenMarkovmodelexpertsthatprovidemulti-classclassication.MHMMEsharesthefollowingadvantagesoftheMEmodel: MHMMEmodelprovidesadivideandconquerapproach,isprobabilistic,andhassoftboundariesallofwhichmakeitverysuitableforcontextlearning. Unlikethetraditionalmixturemodelswherethemixturecoefcientisascalar,intheMHMMEmodel,themixturecoefcient(i.e.thegate)dependsontheinput,whichsupportsthecontextlearning. Thelearningofthecontextsandtheclassiersisaccomplishedsimultaneouslyinonemodel.MHMMEissuitableformulticlassclassication.Inaddition,MHMMEhasthefollowingthreeadvantagesthatsetitapartfromtheothermodels: MHMMEmodelissuitablefortime-seriesdataandsequentialdataofvaryinglengthsduetotheuseofthehiddenMarkovmodels. Thereisnohardclusteringoftime-seriesdatasosequencescanfreelymovebetweencontextsandclassiersduringtraining. ExperimentsonsyntheticandrealdatashowthatMHMMEcanperformbetterthanMEandHMMs,andcandowellincomparisontostate-of-the-artmodels.TheupdateequationsthatleadtothesimultaneoustrainingoftheexpertsandthegatearegiveninChapter 6 .Throughoutthisstudy,thefollowingwords,temporal,sequential,andtime-seriesdata,willbeusedinterchangeablyaswearenotonlyinterestedintime-seriesdata,butwithanydatathatcarrysequentialinformation. 16

PAGE 17

CHAPTER2DESCRIPTIONOFDATASETS 2.1LandmineDatasetAtpresent,therearebetween60,000,000to100,000,000activeburiedlandminesaroundworld[ 17 ],andthedetectionoftheselandminesisahardproblembecauseofthevarioussizesanddifferenttypesofmines,aswellasthechangesintheirsignatureswithchangesintemperature,humidity,andsoilconditions.Toaddressthisproblemweconsideradatasetconsistingofmineandnon-mineobjectdatacollectedusingroboticsystemswithgroundpenetratingradar(GPR)andwide-bandelectromagneticinduction(WEMI)sensors.Landminesaretypicallycategorizedintofourgroupsaccordingtotheirmetalliccontentandaccordingtotheirintendedtargets.Thesecategoriesarehighmetalanti-tank(HMAT),highmetalanti-personnel(HMAP),lowmetalanti-tank(LMAT),andlowmetalanti-personnel(LMAP).Tocreateacontrolledenvironmentforalgorithmdevelopment,clutterandmineobjectswereplacedintogridcellsintwotemperatesites,anddatawascollectedusingtheroboticsystems.ThepropertiesofthesetemperatesitesaregiveninTable 2-1 andTable 2-2 .Table 2-1 givesthenumberofobjectsineachsite.Itcanbeobservedthattherstsitehasmoreclutterobjectsandfewermineobjects.Thesecondsitecontainsmoremineobjectsandmoreminetypes.InTable 2-2 ,thepercentageoflandmineandclutterobjectsisgiven. Table2-1. Landminedatacollectionfromtwotemperatesites Data/SiteTemperateSite1TemperateSite2 Totalnumberofdata224220Numberofmineobjects44112Numberofclutterobjects14868Numberofblankgridcells3240Numberofminetypes1226Gridarea1.5mx1.5m1mx1m 17

PAGE 18

Table2-2. Proportionoflandmineandclutterobjects NotationMeaningProportion HMAPHighmetalanti-personnelmine6.7LMAPLowmetalanti-personnelmine14.8HMATHighmetalanti-tankmine2.4LMATLowmetalanti-tankmine11HMCHighmetalclutter20.2MMCMediummetalclutter6.3LMCLowmetalclutter6.3NMCNon-metallicclutterorblankcell32.1 Todetecttheselandmines,roboticsystemswithGPRandWEMIsensorshavebeendevelopedasshowninFig. 2-1 .TheWEMIsensorsaregoodatndingthemineswithmetalliccontent,whereastheGPRsensorsaregoodatndingtheplasticmines.AGPRsensor'soperatingfrequencyisbetween200MHzand7GHz,anditcollectsdatawith2cmintervalsin24channels.TheWEMIsensor,alsocalledthemetaldetector,collectscomplexresponsesin21frequenciesbetween330Hzand90,030Hzspacedatconstantintervalsinlogspace,at1cmintervalsinthreechannels.Thisresultsina213150complexdatamatrixforeachgridcell.Themetaldetectoriscompletelydescribedin[ 18 19 ].SignaturesofdatacollectedbyGPRandWEMIareshowninFig. 2-2 .Ingeneral,landminesshowhyperbolicpatternsintheGPRdata[ 20 ],andshowsmootherpatternsinWEMIdatawhencomparedtonon-metallicclutter.Anti-personnelminesaregenerallyplacedclosetothegroundsurface,showsomefaintmetallicsignature,anddisplaysomeGPRthatcanoftenbeconfusedwithsurfaceclutter.Anti-tankminesarefoundatdeeperlevels,arebiggerinsize,sotheyyieldbetterGPRsignatures,butastheyarefoundatgreaterdepths,theirEMIsignalstendtobefaint,especiallyfortheoneswithlowmetalcontent.ThesepropertiesaresummarizedinTable 2-3 .Inthedatacollection,adjoininglanesweredividedintogridcellsandtheobjectwasassumedtobeatthecenterofthecellasshowninFig. 2-3 .Thedatawerecollected 18

PAGE 19

Figure2-1. Theroboticsystemsforlandminedetection.ThevehicleonthelefthasaGPRsensor,whilethecartontherighthasametaldetectorsensor.Inthe3-DdatacollectedfromtheGPR,thepathofthemovingvehicleisdenotedasthedown-track,andthechannelsintheGPRsensoraredenotedasthecross-track. Table2-3. Propertiesoflandmines LowMetal(LM)HighMetal(HM) Anti-personnel(AP)ShallowShallowSomeGPRSomeGPRFaintmetalGoodmetalAnti-tank(AT)DeepDeepGoodGPRGoodGPRFaintmetalGoodmetal intwooppositedirectionsoverthesamesetofgridlocations.Thefeatureextractionalgorithmsweredesignedtogiveascorefromthecenterofthegridcell.Inthisstudy,theangleprototypematching(APM)[ 21 ]andgradientanglemodel(GRANMA)basedK-nearestneighborhood(K-NN)[ 22 ]featureextractionalgorithmswereusedforEMIdata;andlinearpredictionprocessing(LPP)[ 20 ],spectralfeatureextraction(SF)[ 23 ]andedgehistogramdescriptor(EHD)[ 24 25 ]methodswereusedforGPRdata. 19

PAGE 20

ALMAPobject'sGPRdataontheleft,EMIdataontheright BLMATobject'sGPRdataontheleft,EMIdataontherightFigure2-2. SignaturescollectedfromtheGPRandWEMIsensorsforanLMAPandanLMAT.IntheGPRdata,thex-axisshowsthedown-trackdirection,andthey-axisshowsthedepth.IntheWEMIdata,forboththereal(top)andimaginaryresponse(bottom),thex-axisshowsthedown-trackdirection,y-axisshowstheenergy,andz-axisshowsthefrequency. APMandGRANMAfeatureextractorsusearganddiagrams.Anarganddiagramisaplotofthein-phasecomponentagainstthequadrature-phasecomponent;and 20

PAGE 21

Figure2-3. Thelanesaredividedintogridcellsofwidth1mand1.5m.Eachobjectisplacedatthecenterofagridcell. theanglesbetweenthepointsinanarganddiagramhavebeenshowntoprovidediscriminativefeaturesfordifferenttypesoflandmines[ 22 ].ExamplesofarganddiagramsareshowninFig. 2-4 ,fortwodifferenttypesofminesinFig. 2-4 (A-B),forametallicobjectinFig. 2-4 (C),andafornonmetallicobjectsinFig. 2-4 (D).TheAPMmethodndstheanglesequencesandusesapoint-wisemedian(medianamongcandidatesatagivenfrequency)astheprototypeandcomputestheEuclideandistanceasagoodnessoftmeasure.TheGRANMAmethodusesaparametricmodeloftheWEMIdata,whoseparametersareusedforclassicationinaK-NNframework.TherstoftheGPRfeatures,LPP,isthemaximumenergyvalueofthebackgroundremoveddataaroundanalarmlocationdeclaredbyapre-screener.ThebackgroundintheGPRisremovedusingadynamiclinearpredictor,wherethecurrentbackgroundvectorsampleisapproximatedbyalinearcombinationofpastbackgroundvectorsamples.Thelinearpredictioncoefcientsareobtainedbyminimizingthemean-squarepredictionerror,andtheLPPoutputisthedifferencebetweenthecurrentGPRvectorsampleanditspredictedvaluegeneratedfromthelinearpredictionmodel.ThesecondGPRfeatureextractoristhespectralalgorithm.Thespectralfeatureiscreatedbyestimatingtheenergydensityspectrum(EDS)ofanalarmatthedeclarationpositionandmatchingtheestimatedEDSwithatemplatethatisobtainedfroma 21

PAGE 22

A B C DFigure2-4. Arganddiagramsformineandclutterobjects.AnarganddiagramistheplotoftherealEMIresponseI(w)vs.theimaginaryEMIresponseQ(w).MetallicobjectsshowsmootheranddistinctshapessuchastheLMATminein(A),LMAPminein(B),metallicclutterobjectin(C).Nonmetallicobjectshavenoisyargandplotssuchasthenonmetallicclutterobjectin(D). landminetarget.TheLPPandspectralfeatureswerefusedbythedirectsumofthetwo(afterproperscalinginLPP)dividedbythesquarerootofspectralcompactness.ThisiscalledtheSpectralLPPfeature.ThethirdGPRfeature,EHD,isatranslationinvariantfeaturebasedonthelocaledgedistributionofthe3DGPRsignatures[ 24 25 ].WithEHD,a3Dsignatureisdividedintowindows.Ineachwindow,localedgesarecharacterizedasvertical,horizontal, 22

PAGE 23

diagonal,antidiagonaledgesandnonedges.Thenthelocaledgedistributionforeachwindowisrepresentedbyahistogram.Finally,eachobjectisgivenacondencevaluebycomparingittoanumberofprototypesusingapossibilisticK-NNrule. 2.2ChickenPiecesDatasetfor2DShapeRecognitionThechickenpiecesdatabase[ 26 ]consistsof446binaryimagesofchickenpiecesfromveclasses,asampleofwhichisdisplayedinFig. 2-5 .ThenumberofsamplesineachclassisgiveninTable 2-4 .Thisdatasetispubliclyavailableat http://algoval.essex.ac.uk/data/sequence/chicken/ .ItscurvaturefeaturesweremadeavailablebyBicegoetal.,andfullyexlainedin[ 27 ].Briey,thecurvaturefeatureswereobtainedasfollows:thecontoursofthebinaryimageswerefoundbyaCannyedgedetector,andapproximatedbylinesegments.Theinitialpointwassettotherightmostpointlyingonthehorizontallinepassingthroughtheobjectcentroid.ThenthecurvaturevalueateachpointwascomputedastheanglebetweentwoconsecutivesegmentsatthatpointandrecordedinacounterclockwisemannerasdemonstratedinFig. 2-6 .TheresultingsequenceshavedifferentlengthsassummarizedinTable 2-5 Table2-4. Distributionofsamplesinthechickenpiecesdataset WingBackDrumstickThighandBackBreastTotal11776966196446 Table2-5. Lengthsofthesequencesinthechickenpiecesdataset Min.lengthMax.lengthMeanlengthMedianlength181045451 23

PAGE 24

Figure2-5. Asamplefromthechickenpiecesdatabase.Therowsrepresentthewing,back,drumstick,thighandback,andbreastclasses,respectively. Figure2-6. Curvaturefeatures.ContoursofbinaryimageswerefoundbyaCannyedgedetector,andapproximatedbylinesegments.Curvaturevalueateachpointwascomputedastheanglebetweentwoconsecutivesegments,andwasrecordedinacounterclockwisemanner.Theinitialpointistherightmostpointlyingonthehorizontallinepassingthroughtheobjectcentroid. 24

PAGE 25

CHAPTER3MIXTUREOFEXPERTS(ME)Theoriginalmixtureofexperts(ME)modelintroducedbyJacobsetal.[ 3 ]canbeviewedasatreestructuredarchitecturethatisbasedontheprincipleofdivideandconquer,andhasthreemaincomponents:severalexpertsthatareeitherregressionfunctionsorclassiers;agatethatmakessoftpartitionsoftheinputspaceanddenestheregionswheretheexpertopinionsaretrustworthy;andaprobabilisticmodeltocombinethegateandtheexperts.Themodelisaweightedsumoftheexperts,wheretheweightsaretheinputdependentgates.Inthissimpliedform,theoriginalMEmodelhasthreeimportantproperties,(i)itallowsfortheindividualexpertstospecializeinsmallerpartsofabiggerproblem,(ii)itusessoftpartitionsofthedata,and(iii)itallowsthesplitstobeformedalonghyperplanesatarbitraryorientationsintheinputspace[ 28 ].Thesepropertiesallowtorepresentnon-stationaryorpiecewisecontinuousdataincomplexregressionprocesses,andtoidentifythenon-linearitiesinclassicationproblems.Therefore,tounderstandthesystemsthatproducesuchnon-stationaryorpiecewisecontinuousdata,MEhasbeenrevisitedandrevivedovertheyearsinmanypublications.Inthesestudies,thekeyelementshavebeenthethreecomponentsoftheoriginalMEthegate,theexpertsandamodeltocombinethegateandtheexperts.ThelineargateandtheexpertsoftheoriginalMEmodelhavebeenimprovedwithmorecomplicatedregressionorclassicationfunctions,thelearningalgorithmhasbeenchanged,andthemixturemodelhasbeenmodiedfordensityestimationandfortimeseriesdatarepresentation.Inthepasttwentyyears,therehavebeensolidstatisticalandexperimentalanalysesonME,andaconsiderablenumberofstudieshavebeenpublishedinallareasofregression,classicationandfusion.TheMEmodelshavebeenfoundusefulinconjunctionwithmanyofthecurrentclassicationorregressionalgorithmsbecause 25

PAGE 26

oftheirmodularandexiblestructure.Inthelate2000's,severalMEstudieswerepublished,including[ 14 16 29 54 ].ToseethebenetthatMEmodelshaveprovided,wecanlookata2008surveythatidentiedthetop10mostinuentialalgorithmsinthedataminingarea.ItcitedC4.5,k-Means,supportvectormachines(SVM),Apriori,expectation-maximization(EM),PageRank,AdaBoost,k-nearestneighborhood(kNN),naiveBayes,andclassicationandregressiontrees(CART)[ 55 ].MEiscloselyrelatedtomostofthesealgorithms,andhasbeencomparedto,orcombinedwithmostofthemtoimprovetheirperformance.Specically,MEshaveoftenbeentrainedwithEMandhavebeeninitializedusingk-Means[ 37 56 ].Ithasbeenfoundthatdecisiontreeshavethepotentialadvantageofcomputationalscalability,handlingdataofmixedtypes,handlingmissingvalues,anddealingwithirrelevantinputs[ 57 ].However,decisiontreescarrythelimitationsoflowpredictionaccuracyandhighvariance[ 58 ].MEcanberegardedasastatisticalapproachtodecisiontreemodelingwherethedecisionsaretreatedashiddenmultinomialrandomvariables.Therefore,MEcarriestheadvantagesofdecisiontrees,butimprovesonthemwithitssoftboundaries,lowervariance,anditsprobabilisticframeworktoallowforinferenceprocedures,measuresofuncertainty,andBayesianapproaches[ 59 ].Ontheotherhand,decisiontreeshavebeencombinedinensembles,formingrandomforests,toincreasetheperformanceofasingleensembleandincreasepredictionaccuracywhilekeepingotherdecisiontreeadvantages[ 60 ].Similarly,MEhasbeencombinedwithboostingand,withagatingfunctionthatisafunctionofcondence,MEhasbeenshowntoprovideaneffectivedynamiccombinationfortheoutputsoftheexperts[ 61 ].MEisexibleenoughtobecombinedwithavarietyofdifferentmodels.IthasbeencombinedwithSVM[ 35 36 54 ]topartitiontheinputspaceandtoallocatedifferentkernelfunctionsfordifferentinputregions.Recently,MEhasbeencombinedwithGaussianprocesses(GP)tomakethemaccommodatenonstationarycovarianceand 26

PAGE 27

noise.AsingleGPhasaxedcovariancematrix,anditssolutiontypicallyrequirestheinversionofalargematrix.WiththemixtureofGPexpertsmodel[ 29 30 ],thecomputationalcomplexityofinvertingalargematrixcanbereplacedwithseveralinversionsofsmallermatrices,providingtheabilitytosolvelargerscaledatasets.MEhasalsobeenusedtomakeahiddenMarkovmodel(HMM)withtime-varyingtransitionprobabilitiesthatareconditionedontheinput[ 16 41 ].MEhasbeenregardedasamixturemodelforestimatingconditionalprobabilitydistributions,andwiththisinterpretation,MEsstatisticalpropertieshavebeeninvestigatedduring19942006.Theconvergenceproperties[ 4 ],thebiasandvarianceanalysis[ 62 ],upperboundsontheapproximationerror[ 63 ],rateofapproximationtothetruemeaninregression[ 64 ],boundsontheVapnik-ChervonenkisdimensionofME[ 65 ],regularityconditionsforconsistency[ 66 ],identiabilityofME[ 67 ],functionapproximationproperties[ 68 ],asymptoticnormalityconditions[ 69 ],consistencyproperties[ 70 71 ]havebeenanalyzed.ThesestatisticalpropertieshaveledtothedevelopmentofvariousBayesiantrainingmethodsbetween19962010,andithasbeenpossibletotrainMEwithvariationalmethodsandMarkovchainMonteCarlo(MCMC).TheBayesiantrainingmethodshaveintroducedpriorknowledgeintothetraining,helpedavoidover-training,andopenedthesearchforthebestmodel(thenumberofexpertsandthedepthofthetree)during19942007.Inthemeantime,themodelhasbeenusedinaverywiderangeofapplicationsfromnancetobiology. 3.1MEClassicationModelIntheMEmodel,asetofexpertnetworksandagatingnetworkcooperatewitheachothertosolveanon-linearsupervisedlearningproblembydividingtheinputspaceintoanestedsetofregions.Thegatingnetworkmakesasoftsplitofthewholeinputspace,andtheexpertslearnthesimpleparametrizedsurfacesinthesepartitionsoftheregions.Theparametersofthesesurfacesinboththegateandtheexpertnetworkscan 27

PAGE 28

belearnedusingtheEMalgorithm.AsimpleexampletodemonstratethedivisionoftheinputspaceisshowninFig. 3-1 .InFig. 3-1 (A)thebluepointsfromtherstclassandtheredpointsfromthesecondclassarenotlinearlyseparable.However,oncethedataispartitionedintotwobythegateshowninFig. 3-1 (B),thedataoneithersideofthegateislinearlyseparable,andcanbeclassiedbylinearexpertsasshowninFig. 3-1 (C)andFig. 3-1 (D).Theparametersofthegateandtheexpertshavebeenplottedaslines.ThenaldecisionhasbeenexplicitlyshowninFig. 3-1 (E).Thedecisionsofthegateandtheexpertsareprobabilistic,sotheseregionsareactuallysoftpartitions,andthenaldecisionismadebytakingthemaximumoftheprobabilisticoutputs. AClassicationdata. BGatingweights. CExpert1weights. DExpert2weights. EClassicationresults.Figure3-1. Classicationresultsonasimpleexample.Theblueandtheredpointsin(A)belongtoclass1andclass2,respectively.Thegatedividestheregionintotwoin(B),suchthat,totheleftofthegatingline,therstexpertisresponsibleasshownin(C),andtotherightofthegatingline,thesecondexpertisresponsibleasshownin(D).In(E),classicationresultsareshownasareas,wheretheblackregioncorrespondstoclass1(bluepoints)andthewhiteregioncorrespondstoclass2(redpoints). 28

PAGE 29

LetD=fX,YgdenotethedatawhereX=fx(n)gNn=1istheinput,Y=fy(n)gNn=1isthetarget,andNisthenumberoftrainingpoints.Also,let=fg,egdenotethesetofallparameterswheregissetofthegateparametersandeisthesetoftheexpertparameters.Unlessnecessary,wewilldenoteaninputvectorx(n)withx,andatargetvectory(n)withyfromnowon.Asuperscript(n)willbeusedtoindicatethatavariabledependsonaninputx(n).ThederivationofatrainingalgorithmfortheMEmodelmakesuseoftheassumptionthatP(ijj,x,)=0ifi6=jwhereiandjdenoteanytwoexperts.Withthisassumption,thetotalprobabilitytheoremapplies[ 12 ].Furthermore,theprobabilitythegateggivengisindependentoftheparametersoftheexpertse.Therefore,thetotalprobabilityofobservingycanbewrittenintermsoftheexperts: P(yjx,)=IXi=1P(y,ijx,)=IXi=1P(ijx,g)P(yji,x,e)=IXi=1gi(x,g)P(yji,x,e) (3) whereIisthenumberofexperts,thefunctiongi(x,g)P(ijx,g)representingtheprobabilityoftheithexpertgivenxisthegate,andP(yji,x,e)istheprobabilityoftheithexpertgeneratingygivenx.ThelatterwillbedenotedbyPi(y)fromnowon.TheMEtrainingalgorithmmaximizesthelog-likelihoodoftheprobabilityinEq. 3 tolearntheparametersoftheexpertsandthegate.Thegateisthescalardenedbythesoftmaxfunction: gi(x,v)=ei(x,v) PIj=1ej(x,v)(3) 29

PAGE 30

wherei(x,v)arelinearfunctionsofthegateparametervgivenbyi(x,v)=vTi[x,1]intheoriginalME.Thesoftmaxfunctionisasmoothversionofthewinner-take-allmodel.ForaK-classclassication,thereareKparametersetsffwikgIi=1gKk=1tobelearnedineachexpert,correspondingtotheparametersofeachclass,asshowninFig. 3-2 .Thedesiredoutputy(n)isoflengthKandy(n)k=1ifx(n)belongstoclasskand0otherwise.ForIthenumberofexpertsandk:1...Ktheclassindex,theexpertsfollowthemultinomialprobabilitymodel: Pi(y)=Yk^yykik.(3)where^yikaretheexpertoutputsperclasscomputedbysoftmaxfunctions: ^y(n)ik=exp(wTik[x(n),1]) PKr=1exp(wTir[x(n),1]),(3)Tomakeasingleprediction,theoutputsarecomputedperclass:^y(n)k=Xigi(x(n))^y(n)ik,andforpracticalpurposes,theinputx(n)isclassiedasbelongingtotheclasskthatgivesthemaximum^y(n)k,k:1...K.ItcanalsobethecasethattheMEmodelmayserveasacomponentinalargermodelandthattheprobabilitiescanbeusedinahigherlevelreasoningsystem.IntrainingtheME,indicatorvariablesZ=ffz(n)igNn=1gIi=1areintroducedsuchthat z(n)i=8><>:1ifx(n)2Ri,0otherwisewhereRiistheregionspeciedbyexperti.Withtheindicatorvariables,Eq. 3 canbeturnedintothecompletedatadistribution 30

PAGE 31

Figure3-2. MEarchitectureforclassication P(Y,Zjw,v)=YnYig(n)iPi(y(n))z(n)i.(3)Thecompleteloglikelihoodcanbewrittenas: l(;D;Z)=NXn=1IXi=1z(n)iflogg(n)i+logPi(y(n))g,(3)whichisafunctionofthemissingrandomvariableszi.Therefore,theEMalgorithmisemployedtoaverageoutziandmaximizetheexpectedcompletedatalikelihoodEZ(logP(D,Zj)).TheexpectationofthelikelihoodinEq. 3 resultsin: Q(,(p))=NXn=1IXi=1h(n)iflogg(n)i+logPi(y(n))g=IXi=1Qgi+Qei, (3) wherepistheiterationindex, 31

PAGE 32

Qgi=Xnh(n)ilogg(n)i (3) Qei=Xnh(n)ilogPi(y(n)) (3) and h(n)i=E[z(n)ijD]=g(n)iPi(y(n)) Pjg(n)jPj(y(n)).(3)TheparameterisestimatedbytheiterationsthroughtheEandMstepsgivenby: 1. E-step:Computeh(n)i,theexpectationoftheindicatorvariables. 2. M-step:Findanewestimatefortheparameters,suchthatv(p+1)i=argmaxviQgi,and(p+1)i=argmaxiQei.Therearethreeimportantpointsregardingthetraining.Firstly,Eq. 3 describesthecross-entropybetweengiandhi.IntheMstep,hiisheldconstant,sogilearnstoapproximatehi.Rememberingthat0hi1and0gi1,themaximumQgiisreachedifbothgiandhiare1andtheothers(gj,hj,i6=j)arezero.ThisisinlinewiththeinitialassumptionfromEq. 3 thateachpatternbelongstooneandonlyoneexpert.Iftheexpertsareactuallysharingapattern,theypayanentropypriceforit.Becauseofthisproperty,MEalgorithmisalsoreferredtoascompetitivelearningamongtheexperts,astheexpertsarerewardedorpenalizedforsharingthedata.Amoredetailedexplanationabouttheeffectofentropyonthetrainingisprovidedin[ 12 28 ].Thesecondimportantpointisthat,byobservingEq. 3 andEq. 3 ,thegateandtheexpertparametersareestimatedseparatelyowingtotheuseofthehiddenvariables.ThisdecouplinggivesamodularstructuretotheMEtraining,andhasledto 32

PAGE 33

thedevelopmentofthehierarchicalmixturesofexperts(HME)andtotheuseofothermodularnetworksattheexpertsandthegates.TheprobabilitymodelforHME[ 4 28 ]is:P(yjx,)=IXi=1gi(x,gi)IXj=1gjji(x,gj=i)Pij(y,e), (3)wheregiistheoutputofthegateinthetoplayer,gjjiistheoutputofthejthgateconnectedtotheithgateofthetoplayer,andgiandgj=iaretheirparameters,respectively.Thethirdimportantpointisthat,maxviQgicannotbesolvedanalyticallybecauseofthesoftmaxfunction.Therefore,onecanusetheiterativerecursiveleastsquare(IRLS)techniqueforlineargateandexpertmodels[ 28 ],theextendedIRLSalgorithmfornon-lineargateandexperts[ 4 ],orusethegeneralizedEM(GEM)algorithmthatincreasestheQfunctionbutdoesnotnecessarilyfullymaximizethelikelihood[ 72 ]. 3.2AdvancesinMEforClassicationTheMEmodelwasdesignedmainlyforfunctionapproximationratherthanclassication,however,ithasasignicantappealformulti-classclassicationduetotheideaoftrainingthegatingnetworktogetherwiththeindividualclassiersthroughoneprotocol.Infact,morerecently,theMEmodelhasbeengettingattentionasameansofndingsub-clusterswithinthedata,andlearningexpertsforeachofthesesub-clusters.Indoingso,theMEmodelcanbenetfromtheexistenceofcommoncharacteristicsamongdataofdifferentclasses.Thisisanadvantagecomparedtotheotherclassiersthatdonotconsidertheotherclasseswhenndingtheclassconditionaldensityestimatesfromthedataofeachclass.WaterhouseandRobinsonprovidedaniceoverviewontheparameterinitialization,learningratesandstabilityissuesin[ 73 ]formulti-classclassication.Sincethen,therehavebeenreportsintheliterature[ 74 76 ]thatnetworkstrainedbytheIRLSalgorithmperformpoorlyinmulti-classclassication.Althoughtheseargumentshavemerit,Ng 33

PAGE 34

andMcLachlan[ 77 ]claimedthatifthestep-sizeparametersaresmallenough,thenthelog-likelihoodismonotonicincreasingandtheIRLSisstable.TheIRLSalgorithmmakesabatchupdate,andupdatesalltheparametervectorsofanexpertfwikgKk=1atonce,andimplicitlyassumesthattheseparametersareindependent.Chenetal.[ 76 ]pointedoutthatIRLSupdatesresultinanincompleteHessianmatrix,inwhichtheoff-diagonalelementsarenon-zero,implyingthedependenceoftheparameters.Infact,inmulti-classclassication,eachparametervectorinanexpertrelatetoeachotherthroughthesoftmaxfunctioninEq. 3 ,andtherefore,theseparametervectorscannotbeupdatedindependently.Chenetal.notedthatsuchupdatesresultinunstablelog-likelihoods,sotheysuggestedusingtheexactHessianmatrixintheinnerloopoftheEMalgorithm.However,theuseoftheexactHessianmatrixresultsinexpensivecomputations,therefore,theyproposedusingthegeneralizedBernoullidensityintheexpertsformulti-classclassicationasanapproximationtothemultinomialdensity.Withthisapproximation,alloftheoff-diagonalblockmatricesintheHessianmatrixarezeromatrices,andtheparametervectorsareseparable.ThisapproximationresultedinsimpliedNewton-Raphsonupdatesandrequiredlesstime;however,theerrorratesincreasedduetothefactthatitwasmerelyanapproximation.Followingthisstudy,NgandMcLachlan[ 77 ]ranseveralexperimentstoshowthattheconvergenceoftheIRLSalgorithmisstableifthelearningrateiskeptsmallenough,andthelog-likelihoodismonotonicincreasingeventhoughtheassumptionofindependenceisincorrect.However,theyalsosuggestedusingtheexpectation-conditionalmaximization(ECM)algorithmwithwhichtheparametervectorscanbeestimatedseparately.TheECMalgorithmbasicallysuggeststolearntheparametervectorsonebyone,andtousetheupdatedparameterswhilelearningthenextparametervector.Indoingso,themaximizationsareoversmallerdimensionalparameterspaceandaresimplerthanafullmaximization,andtheconvergencepropertyoftheEMalgorithmiskept(atleastonthesubsets).In2007,NgandMcLachlan[ 78 ]presentedanME 34

PAGE 35

networkforbinaryclassication,inwhich,theinterdependencybetweenthehierarchicaldatawastakenintoaccountbyincorporatingarandomeffectstermintotheexpertsandthegates.Therandomeffectsterminanexpertprovidedinformationastowhetherthereisasignicantdifferenceinlocaloutputsfromeachexpertnetwork,anditwasshowntoincreasetheclassicationrates.Recently,MEhasfoundinterestinclassicationstudiesthatcanbenetfromtheexistenceofsub-clusterswithinthedata.InastudybyTitsiasandLikas[ 79 ],theMEclassicationformulationwaswrittenmoreexplicitly,suchthat,theprobabilityofselectinganexpertandtheprobabilityofselectingthesub-clusterofaclasswithinanexpertwerewrittenseparatelyas: l(;D;Z)=KXk=1Xx2XkIXi=1zi(x)floggiP(Ckji)p(xjCk,i,ki)g,(3)wherep(xjCk,i,ki)istheGaussianmodelofasub-clustercorrespondingtoclassCk,andkiarethecorrespondingparameters.Ina2008studybyXingandHu[ 37 ],unsupervisedclusteringwasusedtoinitializetheMEmodel.Intherststage,afuzzyC-means(FCM)-likealgorithmwasusedforpartitioningalltheunlabeleddataintoseveralclusters,andasmallfractionofthesesamplesfromtheclustercenterswerechosenastrainingdata.Inthesecondstage,severalparalleltwo-classMEsweretrainedwiththecorrespondingtwo-classtrainingdatasets.Finally,givenatestdatum,itsclasslabelwasdeterminedasthelargestvoteoftheMEs.InusingclusteringapproachesfortheinitializationofME,agoodclustercanbeagoodinitializationtothegateandspeed-upthetrainingsignicantly.Ontheotherhand,astrongclusteringmayleadtoanunnecessarynumberofexperts,orleadtotheover-trainingoftheclassier.ItmightalsoforcetheMEtoalocaloptimum.Therefore,itwouldbeinterestingtoseetheeffectofinitializationwithclusteringontheMEmodel.AnotherworkthatwouldbeinterestingwouldbetoseetheperformanceofMEfora 35

PAGE 36

K-classproblemwhereK>2andcompareittothe)]TJ /F10 7.97 Tf 5.48 -4.38 Td[(K2comparisonsof2-classMEsandthedecisionfromtheirpopularvote. 3.3ModicationstoMEtoHandleSequentialDataTheoriginalformulationofMEwasforstaticdata,andwasbasedonthestatisticalindependenceassumptionofthetrainingpairs.Hence,itdidnothaveaformulationtohandlecausaldependencies.Toovercomethisproblem,severalstudiesextendedMEfortime-seriesdata.Inthepastdecade,MEhasbeenappliedtoregressionoftime-seriesdatainavarietyofapplications.Itwasfoundtobeagoodtinsuchapplicationswherethetime-seriesdataisnon-stationary,meaningthetimeseriesswitchtheirdynamicsindifferentregions,anditisdifcultforasinglemodeltocapturetheentiredynamicsofthedata.Forexample,Weigendetal.[ 12 ]usedMEtopredictthedailyelectricitydemandofFrance,whichswitchesamongregimes,dependingontheweather,seasons,holidays,andworkdays,establishingdaily,seasonallyandyearlypatterns.Similarly,Lu[ 39 ]usedMEforclimatepredictionbecausethetimeseriesdynamicsswitchbetweendifferentregions.Forsuchdata,MEshowedsuccessinndingboththedecompositionandthepiecewisesolutioninparallel.Inmostofthefollowingpapers,themainideaisthatthegatingdividesthedataintocontexts(regimes),andtheexpertslearnlocalmodelsforeachcontext.Fortime-seriesdata,Zeevietal.[ 80 ]andWongetal.[ 81 ]generalizedtheautoregressivemodelsfortimeseriesdatabycombiningthemwithanMEstructure.Inthespeech-processingcommunity,Chenetal.[ 11 74 82 ]usedHMEfortext-dependentspeakeridentication.In[ 74 ]thefeatureswerecalculatedfromawindowofutterancestointroducesometemporalinformationtothesolution.In[ 11 ]amodiedHMEstructurewasintroduced.InthemodiedHME,anewgatingnetworkwasaddedtomakeuseofthetransitionalspectralinformationwhiletheoriginalHMEarchitecturedealtwithinstantaneousinformation.Thenewgatingnetwork,calledtheS-gatingnetwork,was 36

PAGE 37

placedatthetopofthetreeandgivenapairedobservation(X,y)withX=x1,...,xT;theprobabilisticmodelwasmodiedas: P(yjX,)=TXt=1X(xt,)Xigi(xt,vi)Xjgj=i(xt,vij)P(yjxt,ij), (3) with X(xt)=P(xtj) PTs=1P(xsj),(3)whereP(xtj)isaGaussiandistribution.Then,foraspeakeridenticationsystemofpopulationK,theyselectedtheunknownspeakerkthatgivesthehighestregressionprobabilityoutoftheKmodels.Usingthismodel,Chenetal.modiedtheEM-updateequationsandsolvedfortheparametersofthetop-mostGaussiangateanalytically,whereastherestoftheparametersfortheexpertandgatingnetworkswerefounditeratively.Withthiswork,HMEgainedthecapabilitytohandlesequencesofobservations,buttheexpertsandgatingnetworks(excepttheextragate)werestilllinearmodels.Fornon-stationarydata,CacciatoreandNowlan[ 83 ]suggestedusingrecurrenceinthegatingnet,suchthat,theinputtothegatingnetworkwastheratiooftheoutputsatthetwoprecedingtimesteps.Weigendetal.[ 12 ]developedgatedMEtohandletime-seriesdatathatswitchesregimes.Agatingnetworkintheformofamultilayerperceptroncombinestheoutputsoftheneuralnetworkexperts.Hence,whilethegatediscoversthehiddenregimes,theexpertslearntopredictthenextobservedvalue.ThisgatedMEapproachwasextendedbyCoelhoetal.[ 14 ],wheretrainingwasaccomplishedusinggeneticalgorithmsinsteadofgradientdescent.AsimilarideatodetectswitchingregimeswasalsovisitedbyLiehretal.[ 84 ]wherethegatingwasatransitionmatrix,andtheexpertswereGaussianfunctions.MostoftheseMEmodelsthathandletime-seriesdataonregressionuseaone-step-aheadormulti-step-aheadprediction,inwhichthelastdvaluesofthe 37

PAGE 38

time-seriesdataareusedasafeatureofd-dimensionsinaneuralnetwork.Thebenetofusingaslidingwindowtypetechniqueisthatasequentialsupervisedlearningproblemcanbeconvertedintoaclassicalsupervisedlearningproblem.However,thesemodelscannothandledatawithvaryinglength,andtheuseofmultilayernetworktypeapproachespreventthemfromcompletelydescribingthetemporalpropertiesofatime-seriesdata.SuchproblemswerediscussedbyDietterichin[ 85 ],forpredictingthepronunciationofEnglishwordsfromspelling.Asanexample,Dietterichnotedthattheonlydifferencebetweenthewordsthoughtandthoughisthenalt,yetitinuencestheinitialpronunciationoftheinitialth.Hence,asmallwindowmightcausetomissthelonger-rangeinteractions,oralargewindowmightconsiderfeaturesthatarenotreallyrelevant.Therefore,itisofinteresttodevelopmodelsthatcanfullyusetheinformationfromtime-seriesorsequentialdatabyusingmodelssuchasthehiddenMarkovmodels,whichweshowinChapter 6 .InanumberofstudiesbyWangetal.[ 16 ]andbyLi[ 41 ],theexpertsarethestatesofahiddenMarkovmodel(HMM),andthegoalistoestimatetheHMMparameters.ThearchitectureintroducedbyWangetal.[ 16 ]wasnamedthetime-linehiddenMarkovexperts,anditsegmentsthetrajectoryintoseveralstatesandlearnseachofthemusingalocalexpert.Thelearningofthetime-lineHMMistoestimatetheparametersoftheHMMforobservingthetrainingsamples.Intime-linehiddenMarkovexperts,transitionprobabilitiesarealsotime-varyinganddesignedtobeconditionedontheinput.Thelocalexpertscanbesomesortofaregressionmodeldependingontheparticularmodelingcase.Thetime-lineHMMcombinestheexpertsusingthestateprobabilitiesoftheHMM,whicharedeterminedbythepreviousstateoftheHMMandthestatetransitionprobabilitiessimulatedbythestatetransitionnetwork.Similarly,theprobabilitydensityestimationofanHMMhasbeenmimickedusingMEsystemsin[ 15 86 89 ].Zhaoetal.[ 86 ]mimickedanHMMbyusinganHMEsysteminwhich,hierarchicallyorganized,relativelysmallneuralnetworksweretrainedto 38

PAGE 39

performtheprobabilitydensityestimation.Meilaetal.[ 87 ]implementedthetransitionmatrixwithabankofgatingnetworks,andtheobservationprobabilitydensitiesweremodiedtobecenteredaroundtheexperts.Bengioetal.[ 15 ]extendedtheMEmodelinastructurenamedaninput-outputHMM(IOHMM)withalinearfeedbackloopandtwosetsofexperts.Therstsetofexpertscomputetheoutputgiventhecurrentstateandinput,whereasthesecondsetofexpertsmakeuseofthefeedbackloopandcomputethenextstatedistributiongiventheinputandthepreviousstate(i.e.thestatetransitionprobabilities).ThelearningiscarriedoutwithanEMalgorithm,wheretheupdateequationsaresimilartotheforwardbackwardrecursionsofHMM.Anicefeatureofthisstudyisthatthetransitionprobabilitiesaretime-varyinganddependontheinputtothesystem.However,thelimitationofboththegatedMEandtheIOHMMisthattheexpertsaremultilayerperceptrons,whichmaynotalwaysbeappropriatefortime-seriesmodeling.Fritschetal.[ 88 ]developedaneuralnetworkbasedsystemforestimatingemissionprobabilitiesinanHMMsystemsuchthattheneuralnetworkhasasmanyoutputnodesasthereareHMMstates,andtheneuralnetworkproducesestimatesofposteriorstateprobabilities.TheoutputsoftheseneuralnetworksarethencombinedwithHME.Toremovetheindependentandidenticallydistributed(iid.)assumptionofthedatathatwasnecessaryintheoriginalHMEmodel,andtondamodelappropriatefortime-seriesdata,Jordanetal.[ 90 ]describedadecisiontreewithMarkovtemporalstructurenamedasahiddenMarkovdecisiontree(HMDT),inwhicheachdecisioninthedecisiontreeisdependentonthedecisiontakenatthenodeatthepreviousstep.TheresultwasanHMMinwhichthestateateachmomentintimewasfactorized,andthefactorsarecoupledtoformadecisiontree.ThismodelisanefforttocombineadaptivegraphicalprobabilisticmodelssuchastheHMM,HME,IOHMMandfactorialHMM[ 91 ]inavariationalstudy. 39

PAGE 40

ThebenetsoftheseapproachesofmodelinganHMMwithHMEisthat,therestrictionsonHMMbecauseoftheMarkovpropertycanberelaxed,oronecanreplacetheobservationdistributionswithmorecomplexdistributions.However,whysuchapproachesdidnotbecomeasfamousasHMMsmightbeduetotheneedtolearnmoreparametersandtondeleganttrainingalgorithms.Ontheotherhand,innoneofthesestudieshasHMEbeenusedforthemulti-classclassicationofsequentialdata.Therefore,inChapter 6 ,wedevelopthemixtureofHMMexpertsmodel,provideitsderivationandthetrainingalgorithm. 3.4ComparisontoPopularAlgorithmsAcomparisonofMEwithsomeofthepopularmodelswasgivenatthebeginningofthechapter.Inthissection,weextendthesecomparisonstomoremodels.HMEcanberegardedasastatisticalapproachtodecisiontreemodelingwherethedecisionsaretreatedashiddenmultinomialrandomvariables.Therefore,incomparisontothedecisiontreessuchasCART[ 92 ],HMEusessoftboundaries,andallowsustodevelopinferenceprocedures,measuresofuncertainty,andBayesianapproaches.Also,itwasdiscussedbyHaykin[ 59 ]thatanHMEcanrecoverfromapoordecisionsomewherefurtherupthetree,whereasastandarddecisiontreesuffersfromagreedinessproblem,andgetsstuckonceadecisionismade.HMEbearsaresemblancetotheboostingalgorithm[ 93 94 ]inthattheweakclassiersarecombinedtocreateasinglestrongclassier.MEndsthesubsetsofpatternsthatnaturallyexistwithinthedata,andlearnstheseeasiersubsetstosolvethebiggerproblem.Inboostingontheotherhand,eachclassierbecomesanexpertondifcultpatternsonwhichotherclassiersmakeanerror.Hence,themixturecoefcientinboostingdependsontheclassicationerrorandprovidesalinearcombination,whereasthemixturecoefcient(thegate)intheHMEdependsontheinputandmakesprobabilisticcombinationofexperts.Thisdifferenceinmixingmakesthetrainingofthesetwoalgorithmssignicantlydifferentsuchthat,inboosting,the 40

PAGE 41

classiersaretrainedsequentiallybasedonthedatathatwaslteredbythepreviouslytrainednetworks.Oncethetrainingdataisspecied,theboostingnetworksarelearnedindependently,andcombinedwithamixturecoefcientthatislearnedfromtheclassicationerror.InHME,theexpertscompetewitheachotherfortherighttolearnparticularpatterns;hence,alltheexpertsareupdatedateachiterationdependingonthegate'sselection.InanefforttoincorporateboostingintoME,WaterhouseandCook[ 95 ]initializedasplitofthetrainingsetlearnedfromboostingtodifferentexperts.InboostedME,AvnimelechandIntrator[ 61 ]addedtheexpertsonebyone,andtrainedthenewexpertswiththedataonwhichtheotherexpertswerenotcondent.Thegatingfunctioninthiscasewasafunctionofthecondence.OnecanthinkoftheseboostedMEalgorithmsasasmartpreprocessingthedata.AnotherstudyonpreprocessingwaspublishedbyTangetal.[ 56 ]whereselforganizingmaps(SOM)wereused,inwhich,asopposedtofeedingallthedataintotheexpertnetworks,onlythelocalregionsoftheinputspacefoundbySOMwereassignedtotheindividualexperts.Eventhoughtheperformancewasreportedtogetbetterforthedatasetinthisstudy,hardclusteringofthedatapriortoMEtrainingmayrestricttheperformanceofMEforotherdatasets.Multivariateadaptiveregressionsplines(MARS)modelpartitionstheinputspaceintooverlappingregionsandtsaunivariatesplinetothetrainingdataineachregion.ThesamemixturecoefcientargumentalsoappliestotheMARSmodel[ 96 ],whichhastheequationalformofasumofweightedsplines.IncomparisontothelatentvariablesofHME,MARSdenesthestatesbytheproximityoftheobservedvariables.Ontheotherhand,theBayesianMARSisnon-parametric,andrequiressamplingmethods.ItwasfoundthatHMErequireslessmemoryasitisaparametricregressionapproach,andthevariationalBayesianinferenceforHMEconvergesfasterintime[ 51 ].Incomparisontoneuralnetworks(NN),NowlanandHinton[ 97 ]showedthatwhendealingwithrelativelysmalltrainingsets,MEwasbetteratgeneralizingthanacomparablesingleback-propagationnetworkonavowelrecognitiontask.HMEwas 41

PAGE 42

showntolearnmuchfasterthanback-propagationforthesamenumberofparametersbyJordanandJacobs[ 98 ];however,itisquestionableifthiswasagoodcomparisonsinceNNsweretrainedusingagradient-descentalgorithm,whereasHMEwastrainedusingsecondordermethods.AdditionallyHMEprovideinsightfulandinterpretableresultsthattheNNsdonotprovide.Intermsofthedegreeofapproximationbounds,NNsandMEwerefoundequivalentbyZeevietal.[ 63 ].OthermodelsofMEincludethemax-minpropagationneuralnetworkbyEstevezetal.[ 99 ]wherethesoftmaxfunctionwasreplacedwithmax(min)units;andthemodelbyLimaetal.[ 35 ]whereneuralnetworkswereusedatthegateandsupportvectormachinesattheexperts. 3.5ExperimentalResultsonLandmineDatawiththeOriginalMEmodelInthissection,theresultsoftheoriginalMEmodel[ 3 4 ]arepresentedonthelandminedata.Forlandminedetection,itisconvenienttopresenttheexperimentalresultsusingthereceiveroperatingcharacteristics(ROC)curves.TheROCcurvespeciestheprobabilityofdetection(PD)vs.theprobabilityoffalsealarm(PFA),thataredenedasPD=Hit=(Hit+Miss)andPFA=FA=(FA+Reject),wherethehit,miss,falsealarm(FA)andrejectoptionsaredenedinTable 3-1 Table3-1. Namingconventionsforreceiveroperatingcharacteristics(ROC)curves Algorithm/ActualMineNon-mine MineHitFalsealarmNon-mineMissReject ThedatafromtheexperimentswerecollectedinAugust2007fromtwogeographicallyseparatetestsitesthatalsohavedifferentsoilproperties.Combined,bothsiteshadatotalof11highmetalanti-tankmines,49lowmetalanti-tankmines,30highmetalanti-personnelmines,66lowmetalanti-personnelminesaswellas90highmetallicclutterobjects,28mediummetallicclutterobjects,28lowmetallicclutterobjects,143non-metallicclutterobjects,andblanks.Ateachsite,thereweretwocollections;therst 42

PAGE 43

collectionwasperformedinthenorthwest(NW)direction,andthesecondcollectionwasdoneinthesoutheast(SE)direction.Thesamplesfrombothofthesecollectionswerecombinedwithaspecialrequirement:ifanobjectwasincludedinthetraining,itwasincludedusingthedatafrombothcollections(hencebothNWandSEcollectionsofthesameobjectdatawentintotraining),andneitherwasincludedinthetesting.MoreonthisdatasetwasgiveninSec. 2.1 .Theexperimentalresultsweretrainedandtestedwitha10-cross-foldvalidation;i.e.,thedatasetwasdividedinto10subsets,9subsetswereusedfortraining,and1subsetwasusedfortesting;andthisvalidationwascontinueduntilallthesubsetsweretested.EightclassesweretrainedcorrespondingtoHMAT,HMAP,LMAT,andLMAPminesaswellasdifferentkindsofclutter(HMC,LMC,NMC)andblankgridsquares.AnHMEarchitectureofdepthtwowitheightexpertswasused.Toobtainanalminecondencevalue,thecondencesfromtherstfourmineclassesweresummedup.Tohaveanalnon-minecondencevalue,thecondencesfromthefournon-mineclassesweresummedup,thenconvertedtoamine-condencebysubtractingfrom1.ThetwoWEMImetaldetectorclassiers,namelytheangleprototypematchandanglemodelbasedK-NNgave90/18and90/30detectionrates,respectivelyasshowninFig. 3-3 .ThetwoGPRclassiers,namelythespectralLPPandEHD,gavedetectionratesof90/50and90/51,respectively.WhentheHMEalgorithmwasused,thedetectionratesincreasedto90/10,outperformingthesingleuseofallthealgorithms.WhentheHMEalgorithmisrunusingjusttheEHDfeaturefromGPRdataandtheanglemodelfeaturefromWEMIdata,theregionspickedbytheHMEforallclassesisdisplayedinFig. 3-4 .Byndingthemaximumcondencevaluesofalltheseregions,thedominantclassesaredisplayedinFig. 3-4 (A).Thisresultisinlinewithwhatanexpertpersonwouldexpectforeachclassofobjects.Henceinsteadofhavingjustacondencevalue,suchagraphgivesthetechniciantheopportunitytovisuallyunderstandwhereandwhyacondencevaluewasgenerated.Indoingso,(s)hemay 43

PAGE 44

Figure3-3. ROCcurvesonlandminedatawith10-foldcrossvalidation.ThetwoWEMImetaldetectorclassiers,namelytheangleprototypematchandanglemodelbasedk-NNgave90/18and90/30detectionrates,respectively.ThetwoGPRclassiers,namelythespectralLPPandEHD,gavedetectionratesof90/50and90/51,respectively.WhentheHMEalgorithmwasused,thedetectionratesincreasedto90/10,outperforminganysingleuseofallthealgorithms. choosetoacceptalowcondencevalueobjectinthelowmetalanti-personnelregionasamine,asthosearethehardestminestodetect.Insummary,theMEandHMEmodelscombinesimpleexpertmodelswithmixingweights,calledgatingfunctions,thatdependontheinput.LearningtheME(orHME)meansassigningtheexpertstovariousregions,andtrainingeachtobestadapttoitsregion.HMEsndthetruesegmentationswhenthereisone,andtheycomeupwithgoodsegmentationswhentherearenone.Thus,theywerefoundtobeveryusefultoolsbythelandminedetectioncommunityaspublishedin[ 45 ].WiththeMEmodel,ageneraldecisioncanbemadeaboutintowhichcategoryalandminefalls.Hence,somefuture 44

PAGE 45

Figure3-4. ThedominantandthesoftregionsofeachclassfoundfromHMEusingEHDandanglemodelfeatures.Themaximumofthesoftregionsistakenasthedominantclass,anddisplayedinthetop-leftmostgure. workthatwouldbeusefultothelandminedetectioncommunityistodesignandrunalgorithmsspecicallytrainedtodetectdifferentkindsofmines.Therearetwoshortcomingstothisapproach.TherstshortcomingisthattheEMtrainingmayleadtoover-tting.Toovercomethisproblem,avariationalMEthatregularizesMEandavoidsover-ttingisintroducedinChapter 4 .ThesecondshortcomingisthattheMEworksonstaticfeaturesonlyandcannothandletime-seriesdata.Inthecurrentlandmine-detectionsetting,theobjectwasassumedtobeundertheradar,sostaticfeatureswereextractedfromtheregionofinterest.However,suchasettingdoesnotmakeuseofthesurroundinginformation,anditignoresthetime-seriesdatacollectionfrommovingradar.Toovercometheseissues,amixtureofHMMexpertsalgorithmisdevelopedinChapter 6 45

PAGE 46

CHAPTER4VARIATIONALMIXTUREOFEXPERTSFORCLASSIFICATION(VMEC)Variationalmethods,alsocalledensemblemethods,variationalBayes,orvariationalfreeenergyminimization,isatechniqueforapproximatingacomplicatedposteriorprobabilitydistributionP(wjD)byasimplerensembleQ(w,).Thekeytothisapproachisthat,asopposedtothetraditionalapproacheswheretheparameterwisoptimizedtondthemodeofthedistribution,variationalmethodsdeneanapproximatingprobabilitydistributionovertheparameters,Q(w,),andoptimizethisdistributionbyvaryingsothatitapproximateswelltheposteriordistributionoftheparametersP(wjD)[ 100 ].Hence,insteadofpointestimatesforwrepresentingthemodeofadistribution,variationalmethodsproduceanestimateforthewholedistribution.Themixtureofexperts(ME)trainingisaccomplishedthroughtheexpectation-maximization(EM),however,EMtrainingissensitivetoinitializationanddoesnotregularizetheparametersoruseanypriorinformation.Thusitispronetoover-training.Toaddresssomeoftheseproblems,avariationalmixtureofexperts(VME)wasproposedin[ 6 10 ]forregressionandusedin[ 32 33 ]for3Dmotiontracking.Thesimilaritiesoftheclassicationandregressionalgorithmswerediscussedin[ 9 ],althoughaclearframeworkwasnotgivenforclassication.Also,thepreviousalgorithmsusedthenumberofiterationsasastoppingcondition,whichmayleadtotheprematureterminationofthealgorithmorunnecessarytraining.Inthisstudy,acompletelearningalgorithmforvariationalmixtureofexpertsclassication(VMEC)isintroduced.TostatemoreexplicitlyhowourapproachbuildsontheworkbyWaterhouse[ 9 ],inaKclassclassication,ratherthanasingleweightvector,thereareKweightvectorsforeachexpert.Thus,insteadofasinglehyper-parameter,ourVMECmodelassumesKhyper-parametersforeachexpert.Similarly,insteadofassumingasingledistributionperexpert,thereareKdistributionsperexpertinVMEC.Inaddition,adistributionoverthehyper-parameterswasnotassumedin[ 9 ],butsuchanassumptionisnecessaryto 46

PAGE 47

calculatethelowerboundwhichprovidesanexcellentstoppingcriterionthatcountersover-training.Withtheseassumptions,thejointdistributionwasmodiedtobeaproductoftheaforementioneddistributionsandthelowerboundwasderivedusingthismodiedjointdistribution.Thus,inthischapter,(1)acompletelearningalgorithmforVMECisgiven,(2)avariationallowerboundisderivedforVMEC,and(3)theefcacyoftheapproachispresentedonsyntheticdataaswellasonlandminedatatocomparetotheMEtrainedwithEM.Intherestofthechapter,MEindicatestheoriginalmixtureofexpertsmodeltrainedwiththeEMalgorithm. 4.1VMECModelFollowingthesamenotationoftheMEalgorithminChapter 3 ,welet=ffvi,wikgIi=1gKk=1denotethegateandexpertparameters,andplaceGaussianpriors=ffi,ikgIi=1gKk=1ontheparametersoftheexpertsandthegatesP(wj)=Yi,kP(wikjik)=Yi,kN(wikj0,)]TJ /F9 7.97 Tf 6.58 0 Td[(1ikI),P(vj)=YiP(viji)=YiN(vij0,)]TJ /F9 7.97 Tf 6.58 0 Td[(1iI).WealsoassumethatthehyperparametersareGammadistributed:P()=YiP(i)=YiGam(ijc0,d0),P()=Yi,kP(ik)=Yi,kGam(ikja0,b0).ThedependenceofthehyperparametersandtheparametersisillustratedinFig. 4-1 .Usingthedistributionsofthehyperparameters,thejointdistributioncanbewrittenas: P(,,Z,D)=P(Y,Zjw,v)P(wj)P()P(vj)P().(4) 47

PAGE 48

Figure4-1. DirectedacyclicgraphrepresentingtheVMECmodelshowingthedependenceofthehyperparametersandtheparameters. Inthevariationalapproach,thegoalistondthedistributionQthatbestapproximatestheposteriordistribution,sotheevidenceP(D)isdecomposedusing logP(D)=L(Q)+KL(QjjP),(4)whereListhelowerbound L(Q)=ZQ(,,Z)logP(,,Z,D) Q(,,Z)dddZ,(4)andKListheKullback-Leiblerdivergence KL(QjjP)=)]TJ /F11 11.955 Tf 11.29 16.27 Td[(ZQ(,,Z)logP(,,ZjD) Q(,,Z)dddZ.(4)TheQdistributionminimizestheKL-divergence;however,workingontheKL-divergencewouldbeintractable,sowemaximizethelowerboundL[ 101 ].Weassumetheapproximatingdistributionfactorizesas: Q(,,Z)=Q(Z)YiQ(vi)Q(i)YkQ(wik)Q(ik).(4)Pluggingthejointdistribution(Eq. 4 )andtheQdistribution(Eq. 4 )intothelowerboundequation(Eq. 4 ),andtakingtheexpectationswithrespecttoalltheothervariables,weobtaintheQdistributionsas: 48

PAGE 49

Q(wik)=N(wikj wik,Awik), (4) Q(ik)=Gam(ikjap,bp), (4) Q(vi)=N(vij vi,Avi), (4) Q(i)=Gam(ijcp,dp), (4) Q(Z)=NYn=1IYi=1h(n)iz(n)i. (4) Here wikand viarethemeansoftheGaussiansandtheyarefoundusingNewton-Raphsonupdates.ThecovariancematricesAwikandAviaretheinverseofthenegativeHessianmatrices,Awik=)]TJ /F3 11.955 Tf 9.3 0 Td[(H)]TJ /F9 7.97 Tf 6.59 0 Td[(1wandAvi=)]TJ /F3 11.955 Tf 9.3 0 Td[(H)]TJ /F9 7.97 Tf 6.59 0 Td[(1v,thatareexplainedbelow.Foralearningrate,expertparametersarefoundby w(p+1)ik=w(p)ik)]TJ /F14 11.955 Tf 11.95 0 Td[(H)]TJ /F9 7.97 Tf 6.59 0 Td[(1wGw,(4)where Gw=Xn h(n)i(y(n)k)]TJ /F11 11.955 Tf 12.06 .5 Td[(by(n)ik)x(n))]TJ ET q .478 w 298.73 -361.36 m 306.25 -361.36 l S Q BT /F14 11.955 Tf 298.73 -368.18 Td[(ikwik,(4) Hw=)]TJ /F11 11.955 Tf 11.29 11.36 Td[(Xn h(n)iby(n)ik(1)]TJ /F11 11.955 Tf 12.06 .5 Td[(by(n)ik)(x(n))(x(n))T)]TJ ET q .478 w 330.75 -415.65 m 338.27 -415.65 l S Q BT /F14 11.955 Tf 330.75 -422.47 Td[(ik.I.(4)Similarly,theupdatestothegatingparametersfollow v(p+1)i=v(p)i)]TJ /F14 11.955 Tf 11.95 0 Td[(H)]TJ /F9 7.97 Tf 6.59 0 Td[(1vGv,(4)where Gv=Xn(h(n)i)]TJ /F3 11.955 Tf 11.96 0 Td[(g(n)i)x(n))]TJ ET q .478 w 293.05 -572.05 m 300.1 -572.05 l S Q BT /F14 11.955 Tf 293.05 -578.87 Td[(ivi,(4) Hv=)]TJ /F11 11.955 Tf 11.29 11.36 Td[(Xng(n)i(1)]TJ /F3 11.955 Tf 11.95 0 Td[(g(n)i)(x(n))(x(n))T)]TJ ET q .478 w 322.72 -626.34 m 329.77 -626.34 l S Q BT /F14 11.955 Tf 322.72 -633.16 Td[(i.I.(4) 49

PAGE 50

Newton-RaphsonupdatesarecontinuedinaloopuntiltheP(Y,Zjw,v)updatesconverge;andtheparametersfoundatthelastiterationaretakentobe wikand vi,whereAwikandAviaretheircovariancematrices.Theupdatesforexperthyper-hyperparametersare ap=a0+d 2,(4) bp=b0+1 2(wTikwik+Trace(Awik)),(4)andsimilarequationsapplyforthegatehyper-hyperparameterscpanddp.Asaresult,hyperparameterupdatesbecome (p+1)ik=ap=bp,(4)and (p+1)i=cp=dp.(4) 4.1.1VMECLowerBoundParameterupdatesarecontinueduntilthelowerboundconvergestoagivensmallnumber(e)]TJ /F9 7.97 Tf 6.58 0 Td[(5inourcase),andthelowerboundprovidesatestofcorrectnessasitissupposedtobenondecreasingateachparameterre-estimation.Expandingtheintegralandevaluatingtheexpectations,wearriveattheclosedformsolutionforthelowerboundas: 50

PAGE 51

L(Q)=ZQ(,,Z)logP(,,Z,D) Q(,,Z)dddZ,=Xn,iEZ,v,w[logP(Y,ZjX,v,w)],+XiE[logP(viji)]+XiE[logP(i)]+Xi,kE[logP(wikjik)+Xi,kE[logP(ik)],)]TJ /F11 11.955 Tf 11.95 11.35 Td[(XiE[logQ(vi)])]TJ /F11 11.955 Tf 11.96 11.35 Td[(XiE[logQ(i)])]TJ /F11 11.955 Tf 11.96 11.35 Td[(Xi,kE[logQ(wik)])]TJ /F11 11.955 Tf 11.95 11.35 Td[(Xi,kE[logQ(ik)])]TJ /F3 11.955 Tf 11.95 0 Td[(E[logQ(Z)],where Ew,[logP(wj)]=d 2[ (ap))]TJ /F5 11.955 Tf 11.96 0 Td[(logbp])]TJ /F3 11.955 Tf 13.15 8.09 Td[(d 2log(2))]TJ /F3 11.955 Tf 16.44 8.09 Td[(ap 2bp)]TJ ET q .478 w 334.23 -219.52 m 343.31 -219.52 l S Q BT /F8 11.955 Tf 334.23 -226.84 Td[(wTik wik+Trace(Awik), (4) E[logP()]=a0logb0)]TJ /F3 11.955 Tf 11.95 0 Td[(b0(ap=bp))]TJ /F5 11.955 Tf 11.96 0 Td[(log\(a0)+(a0)]TJ /F5 11.955 Tf 11.96 0 Td[(1)[ (ap))]TJ /F5 11.955 Tf 11.96 0 Td[(logbp], (4) E[logQ()]=)]TJ /F5 11.955 Tf 11.3 0 Td[(log\(ap)+(ap)]TJ /F5 11.955 Tf 11.95 0 Td[(1) (ap)+logbp)]TJ /F3 11.955 Tf 11.95 0 Td[(ap, (4) Ew[logQ(wik)]=1 2logjAwikj+d 2(log(2)+1), (4) EZ[logQ(Z)]=Xnh(n)ilogh(n)i, (4) EZ,v[logP(Zjvi)]=Xnh(n)ilog g(n)i, (4) EZ,w[logP(YjZ,wik)]=Xnh(n)ilog Pi(y(n)). (4) ExpressionsforthegateE[logP(i)]),E[logQ(i)]),Ev,[logP(viji)]),Ev[logQ(vi)]aresimilartothoseoftheexperts.ThetrainingalgorithmoftheVMECmodelupdatestheparametersthroughtheEandMstepsuntilthechangeinthelowerboundbecomeslessthanathreshold(e)]TJ /F9 7.97 Tf 6.59 0 Td[(5inourcase). 4.1.2TrainingtheVMECModel 1. Fora1-of-Kclassproblem,initializethenumberofexpertsI,parameters,andthehyperparameters. 2. Estep:Computetheexpertandgatingoutputsby(n)ik,g(n)iaswellastheexpertprobabilitiesPi(y(n))andtheposteriorprobabilitiesh(n)i. 51

PAGE 52

3. Mstep:Computethenewexpertparametersw(p+1)ikandthenewgatingparametersv(p+1)iusingNewton-Raphsonupdates. 4. Updatethehyperparameters(p+1)ikand(p+1)i. 5. Checktheconvergenceofthelowerbound.GotoStep2ifL(p+1))]TJ /F3 11.955 Tf 10.47 0 Td[(L(p)>1e)]TJ /F5 11.955 Tf 10.46 0 Td[(5;elseterminate. 4.2ExperimentalResultsonSyntheticDataSyntheticdatawasgeneratedbysamplingfromtwoGaussiandistributionswithstandarddeviation0.1atmeans(0.5,0.7)and(0.7,0.7),asshowninFig. 4-2 .Fortesting,200pointsweregeneratedfromeachclass.Fortraining,thenumberofpointswereincreasedateachiteration,from10pointsto60pointsperclass.Ateachiteration,adifferentsetoftrainingpointswereused. Figure4-2. SyntheticdatafortheVMECexperiment.SyntheticdataweregeneratedbysamplingfromtwoGaussiandistributionswithstandarddeviation0.1atmeans(0.5,0.7)and(0.7,0.7). Whenonlyoneexpertisused,VMECandMEgivesimilarresultsonboththetrainingandthetestingset,asexpected.InFig. 4-3 ,theclassicationratesaredisplayedforoneexpert.However,ifthenumberofexpertsisincreased,MEtends 52

PAGE 53

toover-carvetheregions,thusovertrains.ThisbehaviorisshowninFig. 4-4 .Forveexperts,MEdoesbetterthanVMEConthetrainingset,butitsclassicationratesonthetestingsetdecreasesbecauseofover-training.TheconsistentbehaviourofVMECisexplainedbythefactthatitsgateprefersfewerexperts.Ultimately,thischaracteristiccanbeusedtodeterminethemaximumnumberofexpertsrequiredforthebestclassicationresults. ANumberofcorrectclassicationsontrainingdatawithoneexpert. BNumberofcorrectclassicationsontestdatawithoneexpert.Figure4-3. VMECandMEcomparisononsyntheticdataforoneexpert.Thenumberoftrainingdataperclassisincreasedfrom10to55with5pointincrements.Ateachexperiment,adifferentdatasetwasgenerated.Forvaryingnumbersoftrainingdata,whenthereisonlyoneexpert,VMECandMEgivesimilarresultsbothinthetrainingsetandonthetestingset,asexpected. 53

PAGE 54

ANumberofcorrectclassicationsontrainingdata. BNumberofcorrectclassicationsontestdata.Figure4-4. VMECresultsforveexperts.Thenumberoftrainingdataperclassisincreasedfrom10to55with5pointincrements.Whentheexpertsareincreased,MEdoesabetterjobofclassicationthanVMEConthetrainingset,butdoesworseonthetestingset.VMECavoidsover-training,andkeepsgivingconsistentclassicationrates. 4.3ExperimentalResultsontheLandmineDatasetForthelandminedataset,thetwoclassiers,GRANMAandEHDweredescribedinSec. 2.1 ,andin[ 21 25 45 ].Usingthecondencesobtainedfromthesetwoclassiers,VMECandMEmodelswererunvetimesusingveclasses,veexperts,andtenfoldcross-validation.The5classesrepresentthehighmetalmines(HMATandHMAP), 54

PAGE 55

Table4-1. Probabilityoffalsealarm(PFA)at90%probabilityofdetection(PD) Experiment12345Average ME0.130.160.150.180.150.15400.0182VMEC0.110.110.100.130.120.11400.0114 LMAT,LMAP,metallicandnon-metallicclutter.Thenumberofexpertswasfoundexperimentally.Ifitisincreased,theMEmodeltendstoover-t,anditeithercarvestheinputspaceunnecessarily,orlearnsthesameexpert,alsounnecessarily.Ontheaverage,theVMECmodeldecreasedtheprobabilityoffalsealarmsto11.4%from15.4%ofMEat90%PD.TheindividualperformancesoftheMEandtheVMECmodelaregiveninTable 4-1 ,foreachexperimentontheminedata.InFig. 4-5 ,ROCcurveszoomedaround90%PDaredisplayedforoneexperimentwithveclasses,veexperts,andtenfoldcross-validation.InFig. 4-6 ,theincreaseinthelowerboundinoneofthetenfoldsisdisplayed.Thesharpincreaseinthelowerboundcorrespondstooneofthehyper-parameterupdates. Figure4-5. LandminedetectionresultsforGRANMAandEHDfeatures.ROCcurvesofME,VMEC,GRANMAandEHD.Inasettingof5classesand5experts,fora10foldcross-validation,VMECmodelconsistentlyincreasesdetectionratestoaround90/11.6percentfrom90/16.2ofME. 55

PAGE 56

Figure4-6. ThelowerboundoftheVMECtraining.Asuddenincreaseinthelowerboundcorrespondstoanupdateofthehyper-parameters. VMECimprovedtheperformanceoverMEintheseexperimentswithreal-worlddata,andapartoftheseresultswerepublishedin[ 46 ].Sincethenumberofexpertsisrelativelysmall,theoptimalnumberwasfoundexperimentally.Forotherdatasetsthatrequiremoreexperts,automatedapproachesasin[ 7 10 102 103 ]canbeinvestigated.Also,thecomparisonswereperformedonnetworksofdepthone,henceanimmediateextensionofthisworkwouldbetoinvestigatethehierarchicalVMEC.Ontheotherhand,bothVMECandMEmodelsworkonstaticfeatures,andcannothandletime-seriesdataasfeatures.Forexampleinlandminedetection,thecondencevaluesandthefeaturesthatwereusedintheGRANMAandEHDclassierswereobtainedwiththeassumptionthattherobotwasalreadyonthetarget,andthetargetwasinthecenter.Thisassumptionmayworkneforgridbasedlandminedetectionefforts,butitdoesnotusethetemporalstructurethatisfoundinlane-basedlandminedetectiondata.ToincorporatethistemporalstructureintoanMEmodel,anewmodel,referredtoasthemixtureofhiddenMarkovmodelexpertsisintroducedinChapter 6 56

PAGE 57

CHAPTER5FUNDAMENTALSOFHIDDENMARKOVMODELS(HMM)HiddenMarkovmodels(HMMs)aremajortoolstoanalysetime-seriesdata,andtheyhavebeenextensivelyusedinavarietyofapplications,including,butnotlimitedto,speechrecognition[ 104 105 ],landminedetection[ 106 108 ],handwritingrecognition[ 109 ],behaviordetection[ 110 ],facerecognition[ 111 ],musicanalysis[ 44 ],crashdetection[ 112 ],andunderstandinggeneticexpression[ 113 ].TheliteratureonHMMsisimmense,sointhischapteronlythefundamentalsofHMMsthatrelatetothisstudyaregiven.Throughoutthischapter,thenotationbyRabineretal.[ 105 ]willbeused,whereanHMMischaracterizedbythefollowing: W=numberofstates M=numberofobservationsymbols T=lengthofobservationsequence V=fv1,...,vMg,thediscretesetofpossibleobservationsymbols O=O1O2...OTdenotesanobservationsequence,whereOtistheobservationattimet,anditisoneoftheobservationsymbolsV. Q=q1q2...qT,axedstatesequence S=fS1,S2,...,SNg,theindividualstates qt=thestateattimet Theinitialstatedistribution=figWi=1,wherei=P(q1=Si)istheprobabilityofbeinginstateiattimet=1. ThestatetransitionprobabilityA=ffaijgWi=1gWj=1,whereaij=P(qt+1=Sjjqt=Si)istheprobabilityofbeinginstatejattimet+1giventhatweareinstateiattimet. TheobservationsymbolprobabilitydistributionB=ffbj(m)gWj=1gMm=1,wherebj(m)=P(vmattjqt=j)istheprobabilityofobservingthesymbolvmgiventhatweareinstatej. =(A,B,)isthecompactnotationforthecompleteparametersetofanHMM.TherearethreemainproblemsrelatedtoHMMs: 57

PAGE 58

1. Givenamodel,whatistheprobabilitythattheobservationsequencewasgeneratedfromthismodelP(Oj)? 2. Givenamodel,whatisthebeststatesequencei.e.thebestpathQtofollowthatmaximizesP(QjO,)? 3. Howdoyoundthemodelthatbestdescribestheobservationsequences?Theanswerstothersttwoquestionsareratherstandard.Therstproblemisefcientlysolvedusingaforward-backwardalgorithm,whilethesecondcanbesolvedusingtheViterbialgorithm.Theanswertothethirdquestionhasreceivedmuchinvestigation,butthemostpopularalgorithmsarethesegmentalK-means[ 114 ]andtheBaum-Welch(BW)[ 105 ]re-estimationformulas.Thesetwoalgorithmsaregenerativemodels,andtheyareappliedinanattempttoincreasetheprobabilityofasequencegivenamodel.Additionally,ithasbeenobservedthatgenerativemodelsarenotpowerfulenoughtodistinguishbetweensomesequences,sodiscriminativemodelshavebeendevelopedin[ 115 ].Intherestofthischapter,thefundamentalsofBWlearningaregiveninsection 5.3.1 andminimumclassicationerror(MCE)baseddiscriminativelearningisbrieyexplainedinsection 5.3.2 .Finallyinsection 5.4 ,aliteraturereviewexplainsthestudiesthatuseHMMstondmultiplemodelsfromtime-seriesdata. 5.1EvaluationThestraightforwardwaytocomputeP(Oj)istondP(OjQ,)andP(Qj);andsumtheirmultiplicationoverallpossiblestatesequences,where P(OjQ,)=bq1(O1)bq2(O2)...bqT(OT),(5) P(Qj)=q1aq1q2aq2q3...aqT)]TJ /F15 5.978 Tf 5.75 0 Td[(1qT,(5) 58

PAGE 59

and P(Oj)=XQP(OjQ,)P(Qj), (5) =XQq1bq1(O1)aq1q2bq2(O2)...aqT)]TJ /F15 5.978 Tf 5.76 0 Td[(1qTbqT(OT), (5) =XQq1T)]TJ /F9 7.97 Tf 6.59 0 Td[(1Yt=1aqtqt+1bqt(Ot), (5) whichisthestandardHMMmodel.Toreducethenumberofmultiplicationsincomputingthismodel,forward-backwardalgorithmsaregenerallyusedtoinductivelysolvethesummation. 5.2DecodingThedecodingndstheoptimalstatesequenceassociatedwiththegivenobservationsequence.Ofthemanycriteriathatdenewhatisbest,themostwidelyusedcriterionistondthesinglebestpaththatmaximizesP(QjO,),whichisequivalenttomaximizingP(Q,Oj)[ 105 ].TheViterbialgorithm[ 116 ]inductivelykeepsthebeststatesequenceforeachoftheNstatesastheintermediatestatefortheobservationsequence.Hence,itgivesthebestpathsfortheNstatesandsingleoutsthepathwiththehighestprobability. 5.3ReestimationReestimationalgorithmscanbedividedintotwogroups:thegenerativeanddiscriminativetrainingapproaches.Ingenerativetraining,Viterbi[ 116 ],BW[ 105 ],andsegmentalK[ 114 ]arethemostcommonlyusedalgorithms.InareasofDNAorprotein-sequencemodelling,whereViterbipathsplayanimportantrole,Viterbilearningmaybemoredesirable.Ifallpossiblepathsaretobeconsidered,BWispreferred.Recently,variationaltrainingofHMMsisbecomingmorecommon[ 117 119 ]whichhasaBayesianwayoflearninganapproximateposteriordistribution.Conversely,discriminativeapproaches[ 20 115 120 121 ]arebasedonseparatingthetwo(ormore)classesviatheutilizationofsuchmethodsasminimizationofclassicationerror. 59

PAGE 60

5.3.1Baum-Welch(BW)TrainingMaximumlikelihoodestimationmaximizesP(Oj)overalltheparametersforagivenobservationsequenceO.BW,alsoreferredtoastheexpectationmaximization(EM)algorithm,iteratesovertheEandtheMsteps.IntheEstep,theEMalgorithmcalculatestheexpectationofunknown(hidden)datagivenaprobabilisticmodel.IntheMstep,itcomputesthemaximumlikelihoodestimateofthecompletedata,wherethecompletedataisboththedataandtheexpectationofthehiddenstates.BWisthemostwidelyusedalgorithmtolearntheHMMparameters,providesagoodgenerativesolutionanditisguaranteedtoconvergetoalocalmaximum[ 122 ],butitissensitivetoinitialization,andmaynotndtheglobalsolution. 5.3.2MinimumClassicationError(MCE)TrainingTheMCEmodelwasintroducedin[ 115 ]andlaterextendedtoMCE-HMMsin[ 123 ],whereHMMsarethediscriminativeclassiers.InMCE-HMM,thetotalmisclassicationerroriswrittenintermsofanempiricallossfunction,andthediscreteHMM(DHMM)parametersminimizingthislossarelearnedusingagradientdescentapproach.Let=fgKk=1bethesetofparametersofallclasses.GivenasetoftrainingobservationsequencesfO(n)g,n=1,2,...N,theempiricallossfunctionisdenedas: L()=NXn=1KXk=1lk(O(n);)I(O(n)2Kk).(5)Whenthereareenoughtrainingsequences,theempiricallossisanestimateoftheexpectedloss.Minimizingtheempiricallossisequivalenttominimizingthetotalmisclassicationerror,andDHMMparametersareestimatedbyagradientdescentonL().lk(O;)isasigmoidfunctionwithmisclassicationcorrespondingtothelossvaluesin(0.5,1],computedas lk(O;)=1 1+exp()]TJ /F14 11.955 Tf 9.3 0 Td[(dk(O)+),(5) 60

PAGE 61

wheredisamisclassicationmeasureofthesequenceO.Ifdkismuchlessthan0,meaningcorrectclassication,nolossisincurred.Whendkispositive,theclassicationerrorcountbecomesthepenalty.Generally,issetto0andissettogreaterthan1.Themisclassicationmeasureofasequenceforthekthclassiscomputedas dk(O)=)]TJ /F3 11.955 Tf 9.3 0 Td[(gk(O,)+log"1 1)]TJ /F3 11.955 Tf 11.96 0 Td[(KXj,j6=kexp(gj(O,))#1 ,(5)whereisapositivenumberandgk(O,)isthelogofthelikelihoodfoundbytheViterbialgorithm.Whenapproaches1,theterminthebracketbecomesmaxj,j6=kgj(X;).Sodk(X)>0impliesmisclassicationanddk(X)0meanscorrectdecision.Thelog-likelihoodoftheViterbialgorithmisfoundby gk(O,)=log[maxQgk(O,Q,)],(5)where gk(O,Q,)=P(O,Q;k)=(k)q0T)]TJ /F9 7.97 Tf 6.58 0 Td[(1Yt=1a(k)qtqt+1TYt=1b(k)qt(ot)(5)Withtheempiricallossasdenedin 5 ;theparametersaretrainedwithagradientdescentapproach. 5.4FindingMultipleModelsfromTime-SeriesDataFindingmultiplemodelsintime-seriesdatadatesbacktothelate1980s[ 124 ]andhasfoundapplicationsinbothclusteringandclassicationproblemsincludingsurveillance[ 125 126 ],fMRIanalysis[ 127 ],seismology[ 128 ],speechsynthesis[ 129 ],speakeradaptation[ 130 131 ],speakerclustering[ 132 ],handwritingrecognition[ 133 ],andobjectclassication[ 134 ].Thesurveyin[ 135 ]listsmanyoftheonlinedatasetsand[ 136 ]providesacomparisonofthealgorithmsinthedataminingarea.Theseapproacheslearnandgeneratemultiplemodelswithinagroupofdata,andtheyareoftenunsupervised.Someofthealgorithmslistedbelow,ndmultiplemodelsfrom 61

PAGE 62

eachclassofdataandtestagiventime-seriessequenceonallthemodelsfromalltheclassesforclassication.Iftheoutputinformationisnotexplicitlyusedinthealgorithm,wewillrefertotheseapproachesasclusteringalgorithms.Mostofthetime-seriesclusteringalgorithmsconverttime-seriesdatatostaticdatausingHMMsandemploythetraditionalclusteringalgorithms,suchasK-meansorhierarchicalagglomerativeclustering,onthisstaticdata.OneoftheearliestHMMclusteringpapersisbyRabineretal.[ 124 ],whereanHMMmodelwastrainedonatrainingset,allthesequenceswithpoorlikelihoodscoreswereseparatedout,andnewmodelswerelearnedfortheoutliers.Therecognitionratesweresignicantlyenhancedifastructurewasfoundfromtheoutliers,buttheperformancewasverysensitivetothethresholdforseparatingthetrainingsetintoclusters.AsimilarapproachwastakenbyLietal.[ 137 ]andSzamoneketal.[ 138 ],whereanHMMmodelwasttoallthedataandthenumberofclustersaswellasthenumberofstatesweregraduallyincreasedbasedontheobjectswiththeleastlikelihoodsandtheBayesianinformationcriterion(BIC).Sincethesealgorithmsconsideredsplittingeverymodelateachiteration,redistributingtheobjectsandrelearningtheHMMs,theyrequiredalotoftraininginverysmallsteps.Afasterwaythanthetop-downapproachofsplittingthemodelsisthebottom-upapproachofttinganHMMmodeltoeachandeverysequence,andclusteringthesesequencesbasedonthesimilaritiesoftheirlikelihoodsorHMMmodels.PerhapsthemostpopularoftheearlierclusteringsolutionswasdevelopedbySmyth[ 139 ]whereanHMMmodelwasttoeachtrainingsequence,resultinginNHMMmodelsforNsequences.ThenalltheNtrainingsequencesweretestedonalltheNHMMmodels,andthelog-likelihoodswerestoredinasimilaritymatrix,asshowninFig. 5-1 .Usingthissimilaritymatrix,thedatasequenceswithsimilarlikelihoodswereclusteredtogetherintoKgroupsusinghierarchicalclustering,asinFig. 5-2 andanewsetofKHMMswerethentrainedontheseclustersusingtheBWupdates[ 105 ].Inthesameyear,asimilar 62

PAGE 63

approachwasdevelopedbyKorkmazskiyetal.[ 140 ]withK-meansclusteringonthelog-likelihoods,butthenalHMMswerelearnedwithadiscriminativeHMMmodel[ 123 ]ratherthantheBW.In1999,Oatesetal.[ 141 ]constructedthesimilaritymatrixfromthedynamictimewarping(DTW)valuesinsteadofthelog-likelihoods,thenclustereditwithhierarchicalagglomerativeclusteringandselectedaprototypefromeachcluster.AninitialHMMwasttotheprototypesequence,andalltheothersequenceswereintroducedonebyoneintotheclusterwiththeretrainingoftheHMMs. Figure5-1. Symmetriclikelihoodmatrix.OneHMMistrainedononesequence,resultinginasmanyHMMmodelsasthenumberoftrainingsequences.TheneachsequenceistestedagainstalltheHMMs,andthesymmetrizedlikelihoodsarestoredasamatrix. Thepreviousapproachesin[ 137 139 141 ]usedhardclustering,whereeachsequencewasassignedtoasinglecluster,andanHMMwastrainedonlyonthosesequencesinacluster.In2003,Alonetal.[ 125 ]introducedasoftclusteringapproach,wheretheexpectationsofclustermembershipswereupdatedinanEM-typeoflearning.Althoughcriticizedin[ 142 ]formakingstrongassumptionsabouttheshapeofthe 63

PAGE 64

Figure5-2. Hierarchicalclusteringoflikelihoods. overalldistributions,thisapproachallowedforthesimultaneouslearningoftheclusterandthemixtureparameters.In2006,forverylargeamountsofdata,Tsoietal.[ 143 ]proposedusingGaussianmixturemodelstogroupthedataintoclustersandtrainedtheHMMsonrandomlyselectedsubsetsofdatafromeachcluster.Morerecently,inordernottomakeassumptionsontheshapeoftheclusters,spectralapproacheswereusedin[ 142 144 ].In[ 144 ],thesimilaritymatrixwasobtainedbymodellingeachsequenceasahiddenMarkovmodel,justasin[ 139 ],butspectralmethodswereappliedtothissimilaritymatrixfordimensionalityreduction.Withthisapproach,thesignicanteigenvectorsoftheafnitymatrixwereusedtomaptheoriginalsamplesintoalowerdimensionalspace.ThesevectorswerethenclusteredbytheK-meansclusteringalgorithm.SimilarlyPoriklietal.[ 145 ]denesanafnitymatrixAwhoseelementsaijareequaltoaij=e)]TJ /F10 7.97 Tf 6.58 0 Td[(d(si,sj)=22,wheresiandsjaretwosequences,andthedistanceiscomputedfromd(i,j)=jpij+pji)]TJ /F3 11.955 Tf 12.73 0 Td[(pii)]TJ /F3 11.955 Tf 12.73 0 Td[(pjjj,wherepijreferstotestingsequencesionthemodeltrainedfromsequencesj.Thentheafnitymatrixisdecomposedintoitseigenvectors,andclustersareformedusingconnectedcomponentanalysis.Insteadofthelog-likelihoods,in[ 142 ],aprobabilityproductkernel(PPK)isusedtondtheGrammatrixwhichisfollowedbyspectralmethods,asin[ 144 ].ThePPKintegrates 64

PAGE 65

overtheentirespaceofallpossiblesamplesgivenanHMMmodel,andmaycapturethepropertiesofthedatabetter.Amongthesemodels,weinitializeourmodelinthenextchapterwith[ 139 ],however,anyof[ 125 137 141 142 144 145 ]couldalsobeused.AstudycloselyrelatedtoouralgorithmhasbeenpublishedbyBicegoetal.[ 134 ],whofoundHMMtobeveryeffectiveindealingwithobjectocclusionsandusedthemforobjectclassicationin[ 27 146 147 ].TheyalsotrainedoneHMMpersequence,butinsteadofclusteringthesimilaritymatrix,theytreatedeachrowoftheun-symmetrizedsimilaritymatrixasafeaturevectorandusedthesefeaturevectorsinstandardclassicationtechniquesasfollows: 1. GivenasetofsequencesT=fO1...ONgtobeclustered,letR=fP1,...,PRgbeasetofRrepresentativeobjectswhereRT. 2. TrainoneHMMrforeachsequencePr2R. 3. RepresenteachsequenceOibyafeaturevectorDR(Oi)=1 Li26664logP(Oij1)logP(Oij2)...logP(OijR)37775whereLiisthelengthofthesequenceOi. 4. PerformclusteringinEuclideanspaceusinganyclusteringtechnique(hierarchicalagglomerativecompletelink,K-means,etc.)Themaindrawbackofthisapproachisthehighdimensionalityoftheresultingfeaturespace,whichisequaltothecardinalityofthedataset,sodimensionalityreductionwashandledin[ 27 ]usingprincipalcomponentanalysis(PCA)andFisherdiscriminantanalysis(FDA),andalsousingmatchingpursuitstoselectasetofrepresentatives.Inaveryrecentstudy,Garciaetal.[ 148 ]normalizethelikelihoodmatrixsothateachcolumnaddsuptoone.Hence,theyviewthesecolumnsastheprobabilitydensityfunctionsovertheapproximatedmodelspaceconditionedoneachoftheindividualsequences.Indoingso,theyndthedistancebetweentwosequencesfromtheKL-divergencebetweenthetwodiscreteprobabilitydensityfunctions. 65

PAGE 66

Althoughithasnotbeenpresentedasclusteringresearch,onecloserelativeofourstudywouldberelativedensitynets(RDN)byBrownetal.[ 149 ].RDNisadiscriminativelearningbyanetworkofHMMmodelsinwhichmanysmallHMMsareplacedintotheinputlayerofafeed-forwardnetwork.ThehiddenlayerofthenetworkperformsallpairwisecomparisonsbetweentheHMMs,andtheweightedsumofthesepairwisecomparisonsarepassedthroughasoftmaxfunctiontomakethenaldecision.Also,analternativetoRDNswasintroducedin[ 150 ].ItisagenerativealgorithmformedfromtheproductsofHMMs(PoHMM).ThealgorithmtrainstheHMMsindependentlyusingBW,butusesgradientbasedminimizationofcontrastivedivergencetocalculatethelog-likelihoodofthePoHMM.AlthoughthetrainingoftheHMMsisstraightforward,thelog-likelihoodoftheoverallarchitecturerequirestheuseofGibbssamplingmethodsandendsupbeingcomputationallyverycostly.Fromadifferentperspective,clusteringcanbedonetondthedifferentsegmentsofasequence.Foramusicrecommendationsystem,Qietal.[ 44 ]consideramixtureofHMMsapproach,namedastheDP-HMM,inwhichthemixturecoefcientisbasedonaDirichletprocess(DP)prior.TheDP-HMMisusedtoanalysethecharacteristicsofdifferentsegmentsofagivenpieceofmusic.ThenumberofHMMsinthemixtureisdecidedbytheDPprior,andthelearningoftheHMMparametersisachievedusingvariationalmethods.TocomputethesimilaritybetweentheHMMmixturemodels,samplesaregeneratedfromeachmodelusingMonteCarlosampling.ForasamplesetSgfromthemixturemodelMgformusicg,andasamplesetShfromthemixturemodelMhformusich,thedistancebetweentwoHMMmixturemodelsisdenedas:D(Mg,Mh)=1 2[logp(ShjMg))]TJ /F5 11.955 Tf 11.96 0 Td[(logp(ShjMh)]+1 2[logp(SgjMh))]TJ /F5 11.955 Tf 11.96 0 Td[(logp(SgjMg)]TherearealsootherstudiesthatdenethedistancebetweenHMMmodels,basedonstate-observationprobabilitymatricesfordiscreteHMM[ 151 152 ]andfora 66

PAGE 67

continousHMM[ 153 ],forergodicHMMs[ 154 155 ],forleft-to-rightHMMmodels[ 156 ],basedonKL-divergence[ 157 162 ],basedontheprobabilityproductkernel[ 163 ],basedonthestationarycumulativedistribution[ 164 ].AcomparativestudywaspublishedbyMohammadetal.[ 165 ].ThesestudiesusetheparametersoftheHMM,henceassumetheexistenceofatrueHMMmodel.Duetothisassumption,suchdistancesdidnotndalotofinterestintheclusteringoftime-sequences,forwhichtheHMMmodelisunknown.However,itshouldbepossibletoincorporatethesedistancesintoclustering,inadditiontocomparingthelikelihoods. 67

PAGE 68

CHAPTER6MIXTUREOFHIDDENMARKOVMODELEXPERTS(MHMME)Timeseriesorsequentialdataoftenshowmultiplepatternsowingtothedifferentcontextsthattheyappearin.Forexample,electricityusagehasseasonalpatterns.Inaddition,withineachseason,electricityconsumptionofsocio-economicgroupsshowdifferentpatterns.Anotherexampleiselectrocardiogram(EKG)heartbeatclassication.Athletestypicallyhaveslowrestingheartbeatscomparedtonon-athletes,andwithoutthisinformationanathletemightbeconsideredtohaveanunhealthybeat.Therefore,thesocio-economicgroupandeventheethnicityofpeopledeneacontextinclassifyingahealthyvsanunhealthybeat.Unfortunately,unliketheseexamples,contextsaregenerallyhardtodene,theyareofteninterlaced,anddonothavesharpboundaries.Moreover,contextinformationmightbeinherentinthedata,butnotbeknowntothedatamodeler.Insuchcases,wedeneacontextasagroupofsimilarsignatures.Inthischapter,weareintroducinganovelmodel,mixtureofhiddenMarkovmodelexperts(MHMME),thatcanbothdecomposetimeseriesdataintomultiplecontexts,andlearnexpertclassiersforeachcontext.Inthismodel,agateofhiddenMarkovmodelsdenesthecontexts,andcooperatewithasetofhiddenMarkovmodelexpertswhichprovidemulti-classclassication.Themainadvantagesofourmodelcanbesummarizedasfollows: MHMMEmodelissuitablefortimeseriesdataandsequentialdataofvaryinglengthsduetotheuseofthehiddenMarkovmodels. MHMMEmodelprovidesadivideandconquerapproach,isprobabilistic,andhassoftboundaries-allofwhichmakeitverysuitableforcontextlearning. Thereisnohardclusteringofthedatasosequencescanfreelymovebetweencontextsandclassiersduringtraining.Moreover,thenaldecisionisamixtureofthedecisionsofalltheclassiers. Learningofthecontextsandclassiersisaccomplishedsimultaneously,inonemodel. MHMMEissuitableformulti-classclassication. 68

PAGE 69

TheMHHMEmodelextendstheMEmodeltobesuitablefortimeseriesdatabytheuseofHMMs.Thusthelearningalgorithmiscompletelymodied,buttheessenceofMEsisretained.WhiletheHMMsatthegatesmakesoftpartitionsoftheinputspaceanddenetheregionswheretheexpertopinionsaretrustworthy,theHMMsattheexpertsspecializeinndingmodelsthatwoulddiscriminateamongthesedata.Hence,multiplemodelsarelearnedfromthedataandthedetectionratesareincreasedbythecompetitionoftheexpertstoclaimregionsoftheinputspace.Thereisnorestrictiononthedimensionofthedataandvaryinglengthsequencescanbehandled.Inthisstudy,ourgoalistoincreaseclassicationratesbytrainingdiscriminativemodelsforthesequencesofdifferentclassesthatwouldbeinthesamecontextbecauseoftheirhighlysimilarstructures.OurMHMMEalgorithmcanlearnthesecontextsandclassicationparametersinoneframework,andadjustsitselfnaturallyduringtheiterations.AlthoughtheideaissimilartotheME,thegateandexpertsarenowHMMs,andthelearningrequiresthelearningoftheHMMparametersinsteadofthelinearweights.TheHMMmodelsateachexpertcanbeofdifferentstructures(numberofstates,etc.)andthealgorithmcaneasilybeextendedtobeahierarchicalmixtureofHMMexperts.InSec. 6.1 ,wecomparetheMHMMEmodeltosomeoftherelatedmodelsthathavebeenreviewedinSec. 3.3 andSec. 5.4 .Then,inSec. 6.2 ,wederivetheupdateequationsthatcombinetheMEarchitecturewithhiddenMarkovmodelupdates,andprovidethetrainingalgorithmforthesimultaneousprobabilisticlearningofthecontextsfrommulti-classdataandthediscriminativeclassicationofthedatainthesecontexts. 6.1HowMEComparestoOtherModelsMEndsmultiplesub-regionswithinthedataandtrainstheexpertstospecializeintheseregions.Unfortunately,MEmodelscannotoperateontimeseriesdata.IntheliteraturereviewinSec. 3.3 ,anumberofmodels[ 11 12 14 166 ]weredescribedthatextendtheMEarchitecturetomakeitsuitablefortimeseriesdata.Thesemodels 69

PAGE 70

areonlyapplicabletoregressionandtheyuseaone-step-aheadormulti-step-aheadpredictioninwhichthelastdvaluesofthetimeseriesdataareusedasfeaturesinaneuralnetwork.Suchmodelscannothandledataofvaryinglengthandtheuseofmultilayernetwork-typeapproachespreventthemfromcompletelydescribingthetemporalpropertiesofatimeseriesdataset.Additionally,thestudiesintheMEliteraturethatcombineHMMsandHMEs[ 15 86 89 ]arehandlingquitedifferentproblemsdespitethesimilarnames.ThesemodelswerecoveredinSec. 3.3 .Briey,in[ 86 88 ],eachstateoftheHMMisanME,whereasinourmodel,eachHMMisapartofanexpert.In[ 36 ]onecanthinkoftheHiddenConditionalRandomFieldsasthegateandtheSVMsastheexperts,buttemporalsequenceswerenotofinterest.Duetothesoft-partitioningatthegates,theMHMMEmodelmightbecomparedtotimeseriesclusteringalgorithms.MostofthetimeseriesclusteringapproachesreviewedinSec. 5.4 learnoneHMMpersequencetoconstructasimilaritymatrix.However,suchapproachesstartwiththeassumptionthatttingoneHMMtoonesequencewouldgivereliableresults.Thismaynotalwaysbethecaseifthesequenceisnotsufcientlylongenoughorifitdoesnothavearepeatingpattern.Forexample,in2Dshaperecognitionfromboundaries,thesequencesareofvaryinglengths,shorter,andnotselfrepeating.Moreover,thosealgorithmslearnHMMmodelsthatareofthesameHMMtopology(samenumberofstates,numberofsymbols,andthelike).OurMHMMEmodeldoesnotrequiretotasingleHMMmodeltoasinglesequence.Instead,itusesallthedataavailableandgivesaweighttoeachsequencebytheuseoftheHMMsatthegate.ThentheexpertsmakeaclassicationdecisionandlearnwhichHMMsareexpertsforthisweighteddata.Inaddition,becauseofitsmodularstructure,MHMMEcanhandleexpertswithdifferentHMMtopologies.Moreover,ourmodeldoesnotdoahardclusteringatthegate,therefore,duringlearning,thedatacanchangeitsprobabilityandstartcontributingtoadifferentgateorexpert. 70

PAGE 71

Incomparisontoneuralnetworktypestudiessuchas[ 149 ],thedifferencesaresimilartothemaindifferencesbetweenneuralnetworksandME,suchas(1)ourgateisobtainedfromtheHMMs,asopposedtobeingweightvectors;(2)ourgateisinput-dependent,whereastheweightsarexedin[ 149 ]oncetheyarelearned;(3)ourexpertsprovideadiscriminativelearningbetweentheHMMsand(4)theexpertsprovidevisuallyandstatisticallyinterpretableresults.Thecontextsfoundatthegatesproducesensibleoutputs. 6.2MixtureofHiddenMarkovModelExpertsForallthehiddenMarkovmodels,wedene: W=numberofstates. M=numberofsymbolsinthecodebook. T=lengthofobservationsequence. V=fv1,...,vMgthediscretesetofobservationsymbols. O=O1O2...OTdenotesanobservationsequence,whereOtistheobservationattimet. Q=q1q2...qTisaxedstatesequence,whereqtisthestateattimet. S=fS1,S2,...,SWgaretheindividualstates. ik=HMMmodelforthekthclassattheithexpert. i=ithHMMmodelatthegate. I=numberofexperts,andi:1...Iistheexpertindex. K=numberofclasses,andk:1...Kistheclassindex.TheMHMMEarchitectureisillustratedinFig. 6-1 .Inthisarchitecture,thegatehasIHMMmodels,anditsfunctionistomakesoftpartitionsintheHMMspace,anddenethecontextswheretheindividualexpertopinionsaretrustworthy.Eachbranchofthegateisconnectedtoanexpert,andanexperthasKHMMs,oneforeachclass.WedenotetheHMMmodelsatthegatewith=f gIi=1,theHMMmodelsattheexperts 71

PAGE 72

Figure6-1. MHMMEarchitecture.AgatemakespartitionsintheHMM-space,andtheexpertslearntodiscriminatetheclassesinthesepartitions. withi=fikgKk=1,andnally,wedenotethesetofallthegateandexpertparametersas=f,g.LetthedatabedenotedbyD=fO,Yg,whereO=fO(n)gNn=1representstheinputtimeseriessequences,andY=fy(n)gNn=1representsclasscodedtrueoutputsoftrainingdatasuchthat y(n)k=8><>:1x(n)belongstoclassk;0otherwise.Thegateandexpertsmakeadecisionforamulti-classclassicationproblemfollowingtheprobabilitymodel: P(D;)=NYn=1IXi=1gi(O(n),i)Pi(y(n),i).(6) 72

PAGE 73

wheregi(O(n),i)isthegate'sprobabilisticestimatethatthesequenceO(n)belongstotheHMM-spacethatisdenedbyexperti.Inotherwords,gi(O(n),i)=P(ijO(n),i),theprobabilityofselectingtheithexpertgiventhesequenceO(n).ThesecondterminEq. 6 ,Pi(y(n),i)istheprobabilitythattheithexperthasgeneratedy(n)giventhesequenceO(n).Intherestofthispaper,wewilldenotegi(O(n),i)withg(n)i,andPi(y(n),i)withPi(y(n)).Thegate'sprobabilisticestimateisobtainedbyasoftmaxfunctionthatconsidersthecondencesofalltheHMMmodelsatthegate,givenas: g(n)i=expf(O(n)j i) PIm=1expf(O(n)j m)(6)wheref(O(n)j i)istheViterbilog-likelihoodofobservationO(n)foranHMMmodel i.Similartothegate,theHMMsattheexpertscomputetheViterbilog-likelihood f(O(n)jik)=logPHMM(O(n),Q,ik),(6)wheretheViterbilikelihoodPHMM(O,Q,ik)is PHMM(O,Q,ik)=(ik)q0T)]TJ /F9 7.97 Tf 6.59 0 Td[(1Yt=1a(ik)qtqt+1TYt=1b(ik)qt(ot).(6)Theselog-likelihoodsareconvertedtoprobabilitiesbyasoftmaxfunction,andtheoutputofexpertiforclasskis^y(n)ikcomputedas: ^y(n)ik=expf(O(n)jik) PKr=1expf(O(n)jir).(6)whichisalsothemeanofitsmultinomialprobabilitymodel,i.e.,foragivensequenceO(n),expertiproducesapredictionwithprobabilityPi(y(n))followingamultinomialdistributionwithmean^yiksuchthat: Pi(y(n))=Yk^yykik.(6) 73

PAGE 74

Finally,theoutputoftheMHMMEarchitecture,f^ykgKk=1,isaweightedsumoftheexpertoutputs: ^y(n)k=Xig(n)i^y(n)ik.(6)ItisworthytonoticetherelationshipbetweenthemixturemodelEq. 6 anditsprobabilisticcounterpartinEq. 6 .TheparametersoftheMHMMEmodelarelearnedusingtheprobabilisticmodelinEq. 6 ,whichwillbeexplainedinthenextsection.Typically,thekthclassthatgivesthemaximumf^ykgKk=1isselectedasthenaldecision;i.e.,theclassofthesequenceO(n). 6.2.1TrainingoftheMHMMEmodelThemarginaldistributionP(D;)inEq. 6 canbesolvedmuchmoresimplybyintroducinglatentvariablesZ,andbysolvingtheequivalentformulationP(D,Z;)withtheexpectation-maximization(EM)algorithm.TheselatentvariablesareZ=ffz(n)igNn=1gIi=1suchthat z(n)i=8><>:1ifO(n)2Ri;0otherwise.whereRiistheregionspeciedbyexperti.Hence,thecompletedatadistributionbecomes: P(D,Z;)=YnYig(n)iPi(y(n))z(n)i,(6)Sonow,intheEstep,wendtheexpectationsofthehiddenvariables[ 4 28 ]as: h(n)i=g(n)iPi(y(n))=(Xjg(n)jPj(y(n))).(6)IntheMstep,wemaximizetheexpectedcompletedatalikelihoodEZ(logP(D,Z;)),whichgivestheobjectivefunction 74

PAGE 75

Q(,(p))=EZ(l(,D,Z))=NXn=1IXi=1h(n)ilogg(n)i+NXn=1IXi=1h(n)ilogPi(y(n)). (6)IntheMstep,hisarekeptxed,sothetwotermsontherightsideoftheequationaredecoupled,andcanbecomputedindependentlyfortheexpertsandthegates.WerefertotheseobjectivefunctionsasQgforthegate,andasQefortheexperts,givenas: Qg=NXn=1IXi=1h(n)ilogg(n)i(6) Qe=NXn=1IXi=1h(n)ilogPi(y(n))(6)andwesearchfortheHMMparametersthatmaximizetheseobjectivefunctions: (p+1)ik=argmaxikQe, (6) (p+1)i=argmax iQg. (6) Explicitly,theHMMparameterstobeestimatedintheexpertsareik=fA(ik),B(ik)g.WewilldenoteeachelementoftheAmatrixasa(ik)rjwithr=1...W,j=1...W,andeachelementoftheBmatrixasb(ik)mjwithm=1...M,j=1...W.Toensurethattheestimatedparameterssatisfytheconstraintsarj0,PWj=1arj=1,bmj0,andPMm=1bmj=1,wemaptheseparametersusinglog,andmapthembackwithsoftmaxfunctions: 75

PAGE 76

arj!arj=logarj, (6) arj=exparj PWj0=1exparj0, (6) bmj!bmj=logbmj, (6) bmj=expbmj PMm0=1expbm0j. (6) Suchmappingsarecommoningradientbasedtrainingmodelssuchas[ 123 167 ].Letpdenotetheiterationnumber.TheHMMparametersthatmaximizetheobjectivefunctionsarefoundbygradientascentupdatesas: a(ik)rj(p+1)=a(ik)rj(p)+@Qe((p)) @a(ik)rj(p), (6) b(ik)mj(p+1)=b(ik)mj(p)+@Qe((p)) @b(ik)mj(p), (6) where@Qe() @~a(ik)rj=NXn=1KXm=1h(n)iy(n)k)]TJ /F5 11.955 Tf 12.24 0 Td[(^y(n)ik@f(O(n),ik) @a(ik)rj@a(ik)rj @~a(ik)rj, (6)@Qe() @~b(ik)mj=NXn=1KXm=1h(n)iy(n)k)]TJ /F5 11.955 Tf 12.24 0 Td[(^y(n)ik@f(O(n),ik) @b(ik)mj@b(ik)mj @~b(ik)mj, (6)andthegradientsare 76

PAGE 77

@f(O(n),ik) @a(ik)rj=1 a(ik)rjTXt=1(q(n)t=m,q(n)t+1=j), (6)@b(ik)ij @~b(ik)mj=b(ik)mj[1)]TJ /F3 11.955 Tf 11.95 0 Td[(b(ik)mj], (6)@f(O(n),ik) @b(ik)mj=1 b(ik)mjTXt=1(q(n)t=m,QV(O(n)t)=j). (6)Similarly,theupdatesforthegatesare:@Qg() @~a(i)rj=NXn=1KXm=1(h(n)i)]TJ /F3 11.955 Tf 11.96 0 Td[(g(n)i)@f(O(n), i) @a(i)rj@a(i)rj @~a(i)rj, (6)@Qg() @~b(i)mj=NXn=1KXm=1(h(n)i)]TJ /F3 11.955 Tf 11.96 0 Td[(g(n)i)@f(O(n), i) @b(i)mj@b(i)mj @~b(i)mj. (6)Hence,thealgorithmcomputestheexpectationofthehiddenvariables(hi)intheEstep,andlearnstheHMMsofboththegatesandtheexpertsintheMstep.Thecompletealgorithmisgivenbelow. 77

PAGE 78

Algorithm6.2.1:MHMMETRAINING(K,X,Y) InitializethenumberofexpertsIInitializethegatingHMMparametersf igIi=1InitializetheexpertHMMparametersffikgIi=1gKk=1whilej(Q(,(p)]TJ /F9 7.97 Tf 6.59 0 Td[(1)))]TJ /F3 11.955 Tf 11.96 0 Td[(Q(,(p)))j=Q(,(p)>1e)]TJ /F5 11.955 Tf 11.95 0 Td[(5do8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:comment:ESTEPdoComputeViterbiloglikelihoodsf(Ojik)fromEq. 6 Expertoutputs^y(n)ikfromEq. 6 ExpertprobabilitiesPi(y)fromEq. 6 Gatingoutputsg(n)ifromEq. 6 Posteriorprobabilitiesh(n)ifromEq. 6 endcomment:MSTEPcomment:ExpertUpdateswhileQeinEq. 6 isincreasing(i.e.Qe<1e)]TJ /F5 11.955 Tf 11.96 0 Td[(5),foreachexpertdo8>>><>>>:Maparj!arjandbmj!bmj(Eqs. 6 & 6 )UpdateAfromEq. 6 UpdateBfromEq. 6 Maparj!arjandbmj!bmj(Eqs. 6 & 6 )comment:GateUpdateswhileQginEq. 6 isincreasing(i.e.Qe<1e)]TJ /F5 11.955 Tf 11.96 0 Td[(5)do8>>><>>>:Maparj!arjandbmj!bmj(Eqs. 6 & 6 )UpdateAusingthegradientsinEq. 6 UpdateBusingthegradientsinEq. 6 Maparj!arjandbmj!bmj(Eqs. 6 & 6 )ComputeQ(,(p))fromEq. 6 Compute^y(n)kfromEq. 6 Makeadecisionk=argmaxkf^y(n)kgKk=1.return(k) 78

PAGE 79

Averyimportantpointisthat,HMMsateachexpertareaffectedbyallthedatapoints,buttheeffectofeachdatapointisweightedbythegate.Therefore,evenifasequencedoesnothaveahighenoughweighttoguidethetrainingofaclassier,itmaystillaffectthetraining,albeitslightly.Therefore,HMMscanavoidover-trainingwhiletheyspecializeforeachcontext. 6.2.2Two-ClassCaseInthissection,weshowthelearningforasingleexpertinatwo-classproblem.Rememberthatateachexpert,wearetryingtomaximizeQe:Qe=NXn=1IXi=1h(n)ilogPi(y(n)).AttheMstep,thehisarexed,soweassumethemtobe1.Thetrainingforeachexpertisdecoupled,soassumeoneoftheexpertsisalreadypicked(hencedeletesummationoveri).Alsoweassumejusttwoclasses.Q=XnlogPi(y(n)),=Xnlog Yk^yykik!,=XnXkyklog(^yik),=XnXkyklog expf(O(n)jik) PKr=1expf(O(n)jir)!,=XnXkyklog1 1+exp(f(O(n)ji,r6=k))]TJ /F3 11.955 Tf 11.96 0 Td[(f(O(n)ji,r=k)).Rememberingthatthef()termsarethelog-likelihoodsofanHMMandarenegativenumbers,theonlywayforQtoincreaseisiftherstlog-likelihoodtermisdecreased,andthesecondlog-likelihoodtermisincreased.Thus,ateachexpert,theHMMsareadjustedsuchthatthecorrecthypothesisismoreprobable,whiletheincorrecthypothesesislessprobable. 79

PAGE 80

6.3ResultsonSyntheticDataForanillustrativeexample,wegenerated80sequencesfromtwoclassesasfollows:the[0,1]domainwasdividedinto10intervals,andthexvalueswereuniformlysampledfromeachoftheseintervals.Then,fortherstclass,20sequencesweregeneratedfromy=x+N(0,2)where=0.04,and20sequencesweregeneratedfromy=)]TJ /F3 11.955 Tf 9.3 0 Td[(x+1+N(0,2).Thisresultsin40trainingsequences(eachoflength10)fortherstclass.Similarly,40trainingsequencesweregeneratedforthesecondclass,20ofwhichweregeneratedfromy=x2+N(0,2),andtheremaining20weregeneratedfromy=)]TJ /F3 11.955 Tf 9.3 0 Td[(x2+1+N(0,2).ThesesequencesaredisplayedinFig. 6-2 .Usingthesameparametersandthesameprotocol,atestsetwasgeneratedthatalsohad80sequenceswith40fromeachclass.Theyvalueswereusedasthedata,andthexvalueswereignored.Therefore,thefeatureswerejustthe1Dsequencesasifthedataareprojectedontothey-axis.Thenthesevalueswerewerediscretizedtovesymbolswhicharelinearlyspacedbetween[0,1].Hence,theclassicationisactuallynotassimpleasitseemsbecausethedatawerediscretizedtoonlyvesymbols,andtolearnmodelsthatdiscriminatey=xandy=x2withoutthetimeinformationisnottrivial.Todiscriminatethesesequences,anMHMMEmodelwastrainedwithtwoexperts.AllHMMsintheMHMMEmodelwerexedtohavethreestatesandvesymbols.Toinitializethegate,thedatawereclusteredbyk-meansintotwo,andaBaum-Welch(BW)HMMwasttoeachoftheseclusters.Toinitializetheexperts,thesequencesthatwerehighlyweightedbythegateswereclusteredusingtheHMMclusteringapproachin[ 139 ].Withthisinitialization,therstHMMinthegateimmediatelygavemoreweighttothey=xandy=x2sequences,andthesecondHMMinthegategavemoreweighttothey=)]TJ /F3 11.955 Tf 9.29 0 Td[(x+1andy=)]TJ /F3 11.955 Tf 9.29 0 Td[(x2+1sequences.UpontheMHMMEtraining,therstexpertlearnedHMMmodelsthatdiscriminateamongy=xandy=x2,andthesecond 80

PAGE 81

Figure6-2. Syntheticdatafortwoclasses.Thesequencesthatbelongtotherstclassaredisplayedinred,andthesequencesthatbelongtothesecondclassaredisplayedinblack.Therstclasshas20sequencesgeneratedfromthefunctiony=x+N(0,2)overlayedoneachother,and20sequencesgeneratedfromy=)]TJ /F3 11.955 Tf 9.3 0 Td[(x+1+N(0,2),alsooverlayedoneachother.Similarly,thesecondclasshas20sequencesgeneratedfromthefunctiony=x2+N(0,2),and20sequencesgeneratedfromy=)]TJ /F3 11.955 Tf 9.3 0 Td[(x2+1+N(0,2). expertlearnedHMMmodelsthatdiscriminateamongy=)]TJ /F3 11.955 Tf 9.3 0 Td[(x+1andy=)]TJ /F3 11.955 Tf 9.3 0 Td[(x2+1.ThepatternslearnedbyeachHMMareshowninFig. 6-3 Figure6-3. MHMMEresultsonsyntheticdata.TherstHMMinthegatelearnsamodelfory=xandy=x2,andthesecondHMMinthegatelearnsamodelfory=)]TJ /F3 11.955 Tf 9.3 0 Td[(xandy=)]TJ /F3 11.955 Tf 9.3 0 Td[(x2.Then,therstexpertlearnstodisriminatebetweeny=xandy=x2,andthesecondexpertlearnsHMMmodelstodistinguishbetweeny=)]TJ /F3 11.955 Tf 9.3 0 Td[(xandy=)]TJ /F3 11.955 Tf 9.3 0 Td[(x2. 81

PAGE 82

Theinitialclassicationrateswere65%onthetrainingdataand60%onthetestdata.AfterMHMMElearning,theclassicationratesreached98%onthetrainingdata,and94%onthetestdata.TheimprovementintheobjectivefunctionwithrespecttotheouteriterationsisdisplayedinFig. 6-4 .Withoneouteriteration,wemeanacompleteE-Mstepwherealltheparametersintheexpertsandthegateareupdated.NotethatateachupdateofanHMM,thereareseveralinneriterationsthatarerununtilconvergence,asgiveninAlgorithm 6.2.1 .Theredcurveistheobjectivefunctionofthegates,Qg,inEq. 6 ,andthegreencurveistheobjectivefunctionoftheexperts,Qe,inEq. 6 .QgandQearesummedtogetthetotalobjectivefunctionQinEq. 6 ,whichisdisplayedasthesolidbluecurve.Thepatternsintheiterationspointtoaninterestingobservation.Intherstiteration,thegatestaysthesame,andtheexpertsgetaquickupdate.Thentheupdateoftheexpertattainsasmallerincline,whilethegateshowsasignicantincreaseforthenexttwoiterations.Finally,whenthegateupdatesbecomealmostconstant,theexpertskeepadjustingthemselvesuntilboththeexpertsandthegatereachasteadysolution.Withtheseadjustmentsateachiteration,theexpertsstrivetobestrepresentthesequencesthatarehighlyweightedbytheircorrespondinggates.Itshouldbenotedthatwithadifferentinitializationorwithdifferentnumberofexperts,thegatecouldpartitionthespaceinadifferentwayastherearemanywaystosolvethisproblem.Inthatcase,oneexpert/gatecombinationlearnswhatevertheotherexpert/gatecombinationsarenotlearning,andndsmeaningfulpatternsforthedatathatreceivedlowcondencesfromtheotherexperts.Withthisinterpretation,onecancompareittoAdaBoost,however,thereisatleastonemajordifference:theexpertsandthegatesarelearnedinconjuctionwitheachother,andupdatedsimultanously;whereasinAdaBoost,theexpertsarelearnedsuccessively,andthereisnogoingbackwhenanexpertislearned. 82

PAGE 83

Figure6-4. MHMMEobjectivefunction.Theobjectivefunctionsforthegate(Qg)andtheexperts(Qe)versusthenumberofiterations.ThetotalobjectivefunctionQisdisplayedinblue,whichisthesumofQgandQe. 6.4ResultsontheLandmineDatasetInlandminedetection,landminesandmetalliccluttershowsmootherEMIresponsewithrespecttoresponseofnonmetallicclutter.Moreover,eachtypeoflandminehasauniqueEMIresponsedependingonthedistributionofmetalinthetarget.However,thisuniqueresponsecannotalwaysbeobservedinitsperfectshape,couldbeclutteredwithnoise,andcouldbescaleddependingonthedepththemineisburiedat.Therefore,MHMMEmodelcanndmultiplemodelsfromthedatathatrepresenteachofthesecontexts,anddoabetterclassicationthanthoseignoringthecontext.Themetaldetector'sEMIresponsecanbemodeledas S(w)=A[I(w)+iQ(w)](6)wherewisthefrequency,Aisthemagnitude,I(w)isthereal(in-phase)responseandQ(w)istheimaginary(quadrature)responseofthecomplexsystem.ThetermI(w)+iQ(w)describestheshapeoftheresponseasafunctionofthe21frequencies[ 19 ].TherealEMIresponseI(w)isplottedagainsttheimaginaryEMIresponseQ(w) 83

PAGE 84

wherewisthe21frequenciesofw.Theplotoftherealresponsewithrespecttotheimaginaryresponseiscalledanarganddiagram.Theshapeofanarganddiagramisindicativeofthetypeanddistributionofmetalinatarget[ 18 ],andminesofthesametypeshowsimilarargandcurvesthatarescaledreplicasofeachotherdependingonthedepth.Toeliminatethevariationinmagnitudeduetodepth,typically,themagnitudeisnormalizedbetween[0,1][ 2 ].ExamplesofarganddiagramsweregiveninSec. 2.1 .Usingtheshapeinformationfromthearganddiagrams,itispossibletodiscriminatebetweensomelandminesandclutter.TheWEMIsensor'sresponseisa213150complexdatamatrixforeachgrid.Similartotheexistingsystems[ 22 ],thedatavectorfromthemiddlechannelatthecenterofthecellwasusedforanalysis.Theargandsequenceswerediscretizedtothe50clustercentersfoundbyfuzzyC-means(FCM)clustering.TheseclustercentersaredisplayedinFig. 6-5 .AlltheHMMsweresettohave3states,foundexperimentally.Intherestofthissection,wewillvisualizethepropertiesofMHMMEononefoldofatwo-foldtraining.Thenwewillreportourclassicationresultsona10-foldcross-validationtraining/testing. Figure6-5. Clusterscentersofthelandminedata.TheclustercentersarefoundusingFCM-clustering,andusedtoquantizethelandminedata. Forvisualizationpurposes,theresultsaredisplayedononefoldofa2-foldtraining.Ofthe888sequencesofminesandnon-mines,444ofthemwereusedinthetraining. 84

PAGE 85

TheMHMMEarchitecturewassettohave8experts,whichcorrespondsto8HMMsatthegate,and2HMMsateachexpert.Toinitializethegate,therst4HMMswerecomputedfromthesequencesinthemineclass(class1)usingtheclusteringapproachin[ 139 ],whichwasdescribedinthepreviouschapter.Similarly,thelast4HMMswereinitializedfromthenonminesequences,usingthesameclusteringapproach.Thisresultsin8HMMs,whicharetheinitialparametersofthegate.Then,allthetrainingsequencesweretestedwithalltheHMMs,andthosethatreceivedhighlikelihoodswereusedtotrainanHMMforeachexpert.Forexample,ifthegateHMM1gavehigherlikelihoodsto20sequencesfromtherstclassand10sequencesfromthesecondclass,therstHMMoftherstexpertwasinitializedwithBW-HMMtrainingonthe20sequencesfromtherstclass;andthesecondHMMoftherstexpertwasinitializedwithBW-HMMtrainingonthe10sequencesfromthesecondclass.AftertheMHMMElearning,thedistributionofthesequencestothegateisshowninFig. 6-6 .Onthex-axis,objectsfromthemineclassandthenonmineclassareseparatedbytheyellowline,suchthattherst156objectsbelongtothemineclass,andthenext288objectsbelongtothenonmineclass.Thecolorsrepresentthetruelabelsofalltheseobjects.Onthey-axis,eachobjectisplacedatthegatethathasthehighestweightforthatobject.Forexample,the7thgatesuccessfullymodelsmetallicclutter(MC)andnonmetallicclutter(NMC)objects,anddoesnotgetconfusedwithanymines.Ontheotherhand,thesecondgatepredominantlymodelstheHMAPandHMATmines,however,doesgetconfusedwithsomemetallicclutter.Similarly,the4thgatepickssomeLMAPandLMATobjects,andgetsconfusedwithMCmostly.Consideringthemetalcontentofthetargets,suchadivisionmakesalotofsense.Actually,thereisnosuchhardassignment,andalloftheassignmentsinFig. 6-6 haveaprobabilityattachedtothem.Ifwendthesequenceswiththehighestprobabilitiesateachgate,wendthesequencesthatrepresentthecontextthatagatehaslearned.ThesesequencesaredisplayedinFig. 6-7 andinFig. 6-8 .Inthesegure,thetype 85

PAGE 86

Figure6-6. Distributionofthesequencesbythegatetotheexperts.Onthex-axis,objectsfromthemineclassandthenonmineclassareseparatedbytheyellowline,suchthattherst156objectsbelongtothemineclass,andthenext288objectsbelongtothenonmineclass.Thecolorsrepresentthetruelabelsofalltheseobjects.Onthey-axis,eachobjectisplacedatthegatethathasthehighestweightforthatobject. ofthemineornonmineobjectisgiveninthey-axis.Forexample,therstcontextisdenedbytheLMATminesthathaveaparticularshape.Similarly,LMATandHMAPobjectsofaparticularshapesharethesecondcontext.TheViterbipathscorrespondingtothesesequencesinFig. 6-7 andinFig. 6-8 aregiveninFig. 6-9 andFig. 6-10 toshowthedifferenceinthepathsforeachcontext.ClassicationresultsareshowninFig. 6-11 .In(A),sequencesthatarehighlyweightedbyeachgateHMMaredisplayed.ThisisthesamegureasFig. 6-6 ,onlywithoutthecolors.Theexpertshelpcorrectlyclassify 86

PAGE 87

thesequencesthatcreateaconfusion,resultinginthenalclassicationinFig. 6-11 (B)whichhaslessobjectsthataremisclassied. AGateHMM1 BGateHMM2 CGateHMM3 DGateHMM4Figure6-7. ContextsdenedbythegateHMMs1-4.EachcolumnshowsthetopthreeargandsequencesamongallthetrainingsequencesthatareassignedthehighestweightbythegateHMM.Onthey-axis,thetypeofthemineornonmineobjectisgiven. Toreportourclassicationresults,werantwentyexperimentsof10-foldcross-validation.TheROCcurveforthe10-foldcross-validationexperimentisdisplayedinFig. 6-12 .90%truepositiverateisattainedat30%falsealarmrate.Theaverageclassicationrateofthe20independentrunsaregiveninTable 6-1 .Eachentryisdescribedbelow.Cl-HMM:Sequencesfromeachclassareclusteredinto4using[ 139 ],andanHMMislearnedforeachcluster,resultingin8HMMs.Atestsequenceisassignedto 87

PAGE 88

AGateHMM5 BGateHMM6 CGateHMM7 DGateHMM8Figure6-8. ContextsdenedbythegateHMMs5-8.EachcolumnshowsthetopthreesequencesamongallthetrainingsequencesthatareassignedthehighestweightbythegateHMM.Onthey-axis,thetypeofthemineornonmineobjectisgiven. theclasswhoseHMMyieldsthehighestlog-likelihood.ThismethodisalsousedininitializingthegateoftheMHMMEmodel.Gate:TheHMMsatthegateofthenalMHMMEmodelareusedasclassierstotesttheirindividualperformance.TherstfourHMMsareassumedtorepresenttherstclass,andthenextfourHMMsareassumedtorepresentthesecondclass.Experts:TheHMMsattheexpertsofthenalMHMMEmodelareusedasclassierstotesttheirindividualperformance.Atestsequenceisassignedtotheclassforwhichithasthehighestlog-likelihoodforanyofthe16HMMsattheexperts. 88

PAGE 89

AGateHMM1 BGateHMM2 CGateHMM3 DGateHMM4Figure6-9. ViterbipathsofthesequenceshighlyweightedbygateHMMs1-4,correspondingtotheoriginalsequencesinFig. 6-7 .EachcolumnshowstheViterbipathsofthetopthreesequencesamongallthetrainingsequencesthatareassignedthehighestweightbythegateHMM. MHMME+SVM:Foreachsequence,thelog-likelihoodsobtainedfromalltheHMMsintheMHMMEmodelareconcatenatedtoforma30-Dfeaturevector,whichisthenusedintraininganrbsvm. Table6-1. Classicationratesonthelandminedatasetfor10-foldtraining ModelMeanStandardDeviation Cl-HMM0.700.02 Gate0.710.05 Experts0.610.02 MHMME0.800.05 MHMME+SVM0.830.04 89

PAGE 90

AGateHMM5 BGateHMM6 CGateHMM7 DGateHMM8Figure6-10. ViterbipathsofthesequenceshighlyweightedbygateHMMs5-8,correspondingtotheoriginalsequencesinFig. 6-8 .EachcolumnshowstheViterbipathsofthetopthreesequencesamongallthetrainingsequencesthatareassignedthehighestweightbythegateHMM. 6.5ResultsontheChickenPiecesDatasetThechickenpiecessequenceswerequantizedto20symbolswhichwerefoundusingfuzzyc-meansclusteringofthedata.AlltheHMMsweresetto3statesand20symbols.TheMHMMEmodelwasinitializedtohave10experts,whichcorrespondsto2expertsperclass.TheHMMsatthegatewereinitializedusingtheclusteringapproachin[ 139 ].Therefore,byinitialization,thecontextwasdenedateachgatetobeaspecialclusterofagivenclass.Withinthiscontext,itistheexpert'sdutytolearnmodelsthatwoulddiscriminatetwoclasses.Tobeabletocompareourresultstothatintheliterature,atwo-foldcross-validationwasused. 90

PAGE 91

A BFigure6-11. Sequenceshighlyweightedbythegateareshownin(A).Thenalclassicationresultisshownin(B).Theobjectstotheleftoftheyellowlineshouldbeassignedtoclass1,andtheobjectstotherightoftheyellowlineshouldbeassignedtoclass2.Theobjectsatthetopleftandatthebottomrightof(B)correspondtotheonesthataremisclassied. 91

PAGE 92

Figure6-12. ROCcurveoftheMHMMEandtheMHMME+SVMforthelandminedatasetfora10-foldcrossvalidationexperiment.90%truepositiverateisattainedat30%falsealarmrate. Thecontexts(i.e.thesimilarsequencesintheHMMsense)havebeenshowninFig. 6-13 .EachcolumnrepresentsthecontextslearnedbyeachHMMbythegate.Thethreesequencesineachcolumnaretheonesthatattainthehighestlog-likelihoodamongallthetrainingsequenceswhentestedagainstwiththecorrespondingHMMsatthegate.Thelog-likelihoodsofthegateandtheexpertsforthetrainingdataofagivenclassareshowninFig. 6-14 .Eachpointcorrespondstoadatasequencefromthetrainingsetofagivenclass.Thex-axisshowsthedifferenceofthelog-likelihoodsbetweentwoexperts.They-axisshowsthelog-likelihoodsofthegateHMMs.TheHMMsatthegatedeneacontext,andtheHMMexpertsspecializewithinthesecontexts.Asaresult,theexpertsandthegatecomplementeachotherresultinginthebutteryeffect:ifagate/expertpairisdoingbadinaregionofthespace,theothergate/expertpairperformsbetteranddominatestheclassicationdecision. 92

PAGE 93

ThesequencesforwhicheachgateHMMhasthemaximumprobabilityisshowninFig. 6-15 (A).Notethatbyinitialization,everytwoHMMsatthegateissupposedtospecializeinoneclass,however,thereisagreatoverlapbetweentheseclasses.Theexpertsareabletodistinguishbetweenthesesequences,resultinginbetterclassicationratesinFig. 6-15 (B).Theseclassassignmentsareprobabilistic,andtheirprobabilitiesareshowninFig. 6-16 .Aperfectclassicationwouldhavehadred,green,blue,magenta,andblackcolors,respectively,ineachregionbetweentheyellowlines.Anydeviationfromthiscolorarrangementisamisclassication.Forexample,abluepointintheleftmostregion(1stclass)indicatesamisclassication,andmeansthatthesequencefromtherstclasshasbeenlabeledtobelongtothe3rdclass(blue)insteadofitstruelabel(red).NotethatmostofthemisclassiedobjectshavelowerprobabilitiesinFig. 6-16 .WhencomparedtotheME-onlyandHMM-onlymodels,ourresultsshowthatanthecombinationofMEandHMMmodelsinanMHMMEmodelincreasestheperformanceofanysingleclassier.Inaddition,wecompareMHMMEtoitsindividualcomponents,i.e.theHMMsattheexpertsandatthegate.OurresultsshowthattheMHMMEcombinationoftheindividualcomponentsincreasetheclassicationrates.Eachclassierisbrieydescribedbelow.PCA+ME:TobringallthesequencestoamanageablexedlengthforME,allthesequencesareinterpolatedtolength110,andprincipalcomponentanalysis(PCA)isapplied.Sequencesaretransformedusingthe10mostsignicanteigenvectors,andfeaturevectorsoflength10arethenusedtotrainanMEmodel.PCA+VMEC:ThesameprocedureofME+PCAisapplied,butusingvariationalMEforclassication(VMEC)insteadofME[ 46 ].Thegammahyperparametersaresetto1.HMM:AsingleHMMistrainedperclass,usingBaum-Welchtraining[ 105 ].AtestsequenceisassignedtotheclasswhoseHMM(eitherofthetwoHMMs)yieldsthehighestlog-likelihood. 93

PAGE 94

AContext1 BContext2 CContext3 DContext4 EContext5 FContext6 GContext7 HContext8 IContext9 JContext10Figure6-13. ContextslearnedbyeachHMMatthegate.ThreesequencesthatacorrespondingHMMismostcondentisgivenin(A)-(J). Cl-HMM:Sequencesfromeachclassareclusteredinto2using[ 139 ],andanHMMislearnedforeachcluster,resultingin10HMMs.AtestsequenceisassignedtotheclasswhoseHMM(eitherofthetenHMMs)yieldsthehighestlog-likelihood.TheseHMMsarealsousedininitializingthegateoftheMHMMEmodel.Gate:TheHMMsatthegateofthenalMHMMEmodelareusedasclassierstotesttheirindividualperformance.Experts:TheHMMsattheexpertsofthenalMHMMEmodelareusedasclassierstotesttheirindividualperformance. 94

PAGE 95

Figure6-14. Thelog-likelihoodsofthegateandtheexpertsfordataofagivenclass.Eachpointcorrespondstoadatasequencefromthetrainingsetofagivenclass.Thex-axisshowsthedifferenceofthelog-likelihoodsbetweentwoexperts.They-axisshowsthelog-likelihoodsofthegateHMMs. 95

PAGE 96

A BFigure6-15. ClassicationresultsofMHMMEonthechickenpiecesdataset.Sequenceshighlyweightedbythegateareshownin(A).Finalclassicationresultisshownin(B).Yellowlinesindicatethetruelabelsoftheobjectsonthex-axis,andtheclassicationresultsaregivenonthey-axis. MHMME+SVM:Foreachsequence,thelog-likelihoodsobtainedfromalltheHMMsintheMHMMEmodelareconcatenatedtoforma15-Dfeaturevector,whichisthenusedintraininganrbsvm. 96

PAGE 97

Figure6-16. TheprobabilitycorrespondingtoeachdecisioninFig. 6-15 (B)isshown.Thelegendsontherightsideshowthecoloroftheclassierdecision.Aperfectclassicationwouldhavehadred,green,blue,magenta,andblackcolors,respectively,ineachregionbetweentheyellowlines.Anydeviationfromthiscolorarrangementisamisclassication. Thechickenpiecesdatasethasbeenpreviouslyinvestigatedintheliterature.Thestudyby[ 147 ]learnsmultipleHMMsandinputstheoutputsofHMMsasfeaturestoanSVM.Theseclassiersarebrieydescribedbelow,andtheclassicationratecomparisonsoftheseclassiersaregiveninTable 6-2 .rbsvm+log-space:OneHMMislearnedperclass.Then,eachsequenceisrepresentedbya5-Dfeaturevector,whicharethelog-likelihoodsobtainedfromthe5HMMs.Thesefeaturevectorsarethenusedintrainingaradialbasissupportvectormachine(rbsvm)[ 168 ].Theclassicationratesweretakenfrom[ 147 ].rbsvm+emit-space:Similartothepreviousapproach,a3stateHMMislearnedperclass.ForeachsequenceandeachHMM,thesumoftheemissionprobabilitiesata 97

PAGE 98

givenstatearerecorded.Theseemissionprobabilitiesarethenconcatenatedtoforma15-Dfeaturevector,whichisthenusedintraininganrbsvmasdetailedin[ 147 ].TheclassicationratesoftheseclassiersaregiveninTable 6-2 .Theseratesaretheaverageof20independenttrainingruns,eachofwhichemploysa2-foldcross-validation.Notethatthereare5classes,soarandomclassicationwouldbe20%. Table6-2. Classicationratesonthechickenpiecesdatasetfor2-foldtraining ModelMeanStandardDeviation PCA+ME0.430.01 PCA+VMEC0.440.01 HMM0.410.05 Cl-HMM0.610.03 Gate0.590.08 Experts0.530.04 MHMME0.730.02 MHMME+SVM0.870.01 rbsvm+log-space0.750.005 rbsvm+emit-space0.800.005 Amongtheseclassiers,PCA+MEandPCA+VMECarethetwoME-onlymethods,whereastheHMMandtheCl-HMMaretheHMM-onlymethods.Whencomparedtothese,MHMMEperformsbetterthananHMM-onlyandanME-onlymethod.Inaddition,thenaloutputisacombinationoftheexpertswithagate,andthecombinationisbetterthantheindividualexpertsandthegate.MHMMEresultsarecomparabletoanSVMtrainedonlog-likelihoods.Moreover,MHMME+SVMsurpassesanyoftheotheralgorithms. 6.6DiscussionandFutureWorkTheMHMMEalgorithmallowsforcompetitivelearningattheexperts.Ifoneexpertmakesabetterpredictionthantheothers,itreceivesahigherweighteddatapoint.Ontheotherhand,italsolearnstoignorethesequencesthatarehandledbetterbyitscompetitors.Duetothecompetitionoftheclusters,oneexpertmaytrytomodelallormostofthedata.Whenthishappens,theotherexpertsactliketrashcollectors,and 98

PAGE 99

collectwhicheversequencedoesnottwellwiththecompetingmodel.Insuchcases,classicationratesincreaseiftheexpertcanndapatterninthisoutlierdata.Althoughweonlyusedasinglelayermodel,itismathematicallypossibletoeasilyextenditintoahierarchicalmodel.Inthefuture,itwouldbeinterestingtoseewhetherornotanotherlevelofhierarchywouldincreasetheclassicationrates.Itisalsopossibletousevariationallearning,andinvestigatetheoptimumnumberofclusters,aswellasthenumberofstatesineachHMM. 99

PAGE 100

CHAPTER7CONCLUSIONThisresearchaddressestheproblemsencounteredwhendesigningclassiersforclassesthatcontainmultiplesubclasseswhosecharacteristicsaredependentonthecontext.Itissometimesthecasethatwhentheappropriatecontextischosen,classicationisrelativelyeasy,whereasintheabsenceofcontextualinformation,classicationmaybedifcult.Therefore,inthisstudy,simultaneouslearningofcontextandclassicationhasbeenaddressedforstaticandsequentialdata.Themixtureofexpertsmodelhasbeenappliedtothelandminedetectionproblemandhasbeenshowntoincreasetheclassicationratesoftwootherclassiers.TheMEmodelprovidesaprobabilisticapproachfordividingthedataintoregionsandforlearningexpertsineachregion.TopreventtheMEfromover-training,avariationalMEclassication(VMEC)modelhasbeenintroduced,andalowerboundforVMECtraininghasbeenderived.Resultshavebeendemonstratedforclassesthatcontainmultipleclasses.TheVMEClearninghasbeenshowntofurtherincreasetheclassicationratesofMEonthelandminedatasetwhileprovidingregularizedparametersanddistributionsfortheparametersinsteadofpointestimates.Thirdly,anovelmethod,mixtureofhiddenMarkovmodelexperts(MHMME)hasbeendeveloped.TheupdatesofHMMparametersinanMEframeworkhasbeenderived,andthebenetsofMEhasbeenextendedtotime-seriesdata.TheMHMMEmodelallowsforthesimultaneousprobabilisticlearningofthesub-regionsfrommulti-classsequentialdataandthediscriminationoftheclassesinthesesub-regions.TheoutputisamixtureoftheHMMdecisions,butthemixturecoefcientisnotxedonceitislearned,ratheritdependsontheinputdata.TheMHMMEmodelhasbeentestedonasyntheticdataset,thelandminedatasetandthechickenpiecesdataset.IthasbeenshowntoperformbetterthanME-onlyandHMM-onlymodels,anddowellincomparisontostate-of-the-artmodels. 100

PAGE 101

REFERENCES [1] C.Ratto,P.Torrione,K.Morton,andL.Collins,Context-dependentlandminedetectionwithground-penetratingradarusingahiddenMarkovcontextmodel,inIEEEInternationalSymposiumonGeoscienceandRemoteSensing(IGARSS),July2010,pp.4192. [2] H.Frigui,L.Zhang,andP.Gader,Context-dependentmultisensorfusionanditsapplicationtolandminedetection,IEEETransactionsonGeoscienceandRemoteSensing,vol.48,no.6,pp.2528,June2010. [3] R.A.Jacobs,M.I.Jordan,S.J.Nowlan,andG.E.Hinton,Adaptivemixturesoflocalexperts,NeuralComput.,vol.3,no.1,pp.79,1991. [4] M.I.JordanandL.Xu,ConvergenceresultsfortheEMapproachtomixturesofexpertsarchitectures,NeuralNetworks,vol.8,pp.1409,1995. [5] G.Hinton,Introductiontomachinelearning,decisiontreesandmixturesofexperts,Lecturenotes,2007. [6] C.M.BishopandM.Svensen,Bayesianhierarchicalmixturesofexperts,inProceedingsNineteenthConferenceonUncertaintyinArticialIntelligence,2003,pp.57. [7] N.UedaandZ.Ghahramani,OptimalmodelinferenceforBayesianmixtureofexperts,inProc.IEEEWorkshoponNeuralNetworksforSignalProcessing,2000,vol.1,pp.145. [8] S.Waterhouse,D.Mackay,andT.Robinson,Bayesianmethodsformixturesofexperts,inAdv.Neur.Inf.Proc.Sys.(NIPS)7,1996,pp.351. [9] S.R.Waterhouse,ClassicationandRegressionusingMixturesofExperts,Ph.D.dissertation,DepartmentofEngineering,UniversityofCambridge,1997. [10] F.Peng,R.A.Jacobs,andM.A.Tanner,Bayesianinferenceinmixtures-of-expertsandhierarchicalmixtures-of-expertsmodelswithanapplicationtospeechrecognition,J.Amer.Statist.Assoc.,,no.91,pp.953,1996. [11] K.Chen,D.Xie,andH.Chi,AmodiedHMEarchitecturefortext-dependentspeakeridentication,IEEETransactionsonNeuralNetworks,vol.7,pp.1309,1996. [12] A.S.Weigend,M.Mangeas,andA.N.Srivastava,Nonlineargatedexpertsfortimeseries:Discoveringregimesandavoidingovertting,InternationalJournalofNeuralSystems,,no.6,pp.373,1995. [13] A.S.WeigendandS.Shi,PredictingdailyprobabilitydistributionsofSP500returns,JournalofForecasting,vol.19,no.4,pp.375,2000. 101

PAGE 102

[14] A.Coelho,C.Lima,andF.VonZuben,Hybridgenetictrainingofgatedmixturesofexpertsfornonlineartimeseriesforecasting,inIEEEInternationalConferenceonSystems,ManandCybernetics,2003,vol.5,pp.4625. [15] Y.BengioandP.Frasconi,InputoutputHMMsforsequenceprocessing,IEEETransactionsonNeuralNetworks,vol.7,pp.1231,1995. [16] X.Wang,P.Whigham,D.Deng,andM.Purvis,Time-linehiddenMarkovexpertsfortimeseriesprediction,NeuralInformationProcessing-LettersandReviews,vol.3,no.2,pp.39,May2004. [17] Hiddenkillers:Thegloballandminecrisis,report10575,UnitedStatesDepartmentofState,Sep1998. [18] E.B.Fails,P.A.Torrione,J.WaymondR.Scott,andL.M.Collins,Performanceofafourparametermodelformodelinglandminesignaturesinfrequencydomainwidebandelectromagneticinductiondetectionsystems,2007,vol.6553,pp.65530D1,SPIE. [19] W.Scott,Broadbandarrayofelectromagneticinductionsensorsfordetectingburiedlandmines,inIEEEInternationalGeoscienceandRemoteSensingSymposium(IGARSS),July2008,vol.2,pp.IIII. [20] H.Frigui,P.Gader,andD.Ho,RealtimelandminedetectionwithgroundpenetratingradarusingdiscriminativeandadaptivehiddenMarkovmodels,EURASIPJournalonAppliedSignalProcessing,vol.1,pp.1867,January2005. [21] G.Ramachandran,P.Gader,andJ.Wilson,Fastphysics-basedminedetectionalgorihtmsforwide-bandelectromagneticinductionsensors,inSPIEDefense,SecurityandSensing,April2009,pp.7303. [22] G.Ramachandran,P.Gader,andJ.Wilson,GRANMA:GradientanglemodelalgorithmonwidebandEMIdataforland-minedetection,GeoscienceandRemoteSensingLetters,IEEE,vol.7,no.3,pp.535,July2010. [23] K.Ho,L.Carin,P.Gader,andJ.Wilson,Aninvestigationofusingthespectralcharacteristicsfromgroundpenetratingradarforlandmine/clutterdiscrimination,IEEETransactionsonGeoscienceandRemoteSensing,vol.46,no.4-2,pp.1177,April2008. [24] H.FriguiandP.Gader,Detectionanddiscriminationoflandminesbasedonedgehistogramdescriptorsandfuzzyk-nearestneighbors,inIEEEInternationalConferenceonFuzzySystems,2006,pp.1494. [25] H.FriguiandP.Gader,Detectionanddiscriminationoflandminesinground-penetratingradarbasedonedgehistogramdescriptorsandapossibilistic 102

PAGE 103

K-nearestneighborclassier,IEEETransactionsonFuzzySystems,vol.17,no.1,pp.185,Feb.2009. [26] J.V.G.Andreu,A.Crespo,Selectingthetoroidalself-organizingfeaturemaps(TSOFM)bestorganizedtoobjectrecognition,,inInternationalConferenceonNeuralNetworks,1997,vol.2,pp.1341. [27] M.Bicego,V.Murino,andM.A.T.Figueiredo,Similarity-basedclassicationofsequencesusinghiddenMarkovmodels,PatternRecogn.,vol.37,no.12,pp.2281,2004. [28] M.I.Jordan,HierarchicalmixturesofexpertsandtheEMalgorithm,NeuralComputation,vol.6,pp.181,1994. [29] E.MeedsandS.Osindero,AnalternativeinnitemixtureofGaussianprocessexperts,inAdvancesinNeuralInformationProcessingSystems(NIPS),2006,pp.883. [30] C.YuanandC.Neubauer,VariationalmixtureofGaussianprocessexperts,inAdvancesinNeuralInformationProcessingSystems(NIPS)21,pp.1897.2009. [31] K.-A.L.Cao,E.Meugnier,andG.J.McLachlan,Integrativemixtureofexpertstocombineclinicalfactorsandgenemarkers,Bioinformatics,vol.26,no.9,pp.1192,2010. [32] C.Sminchisescu,A.Kanaujia,Z.Li,andD.Metaxas,Learningtoreconstruct3DhumanmotionfromBayesianmixtureofexperts,aprobabilisticdiscriminativeapproach,Tech.Rep.TechnicalReportCSRG-502,UniversityofToronto,October2004. [33] C.Sminchisescu,A.Kanaujia,andD.Metaxas,BM3E:Discriminativedensitypropagationforvisualtracking,IEEETransactionsonPatternAnalysisandMachineIntelligence,vol.29,no.11,pp.2030,nov.2007. [34] J.Saragih,S.Lucey,andJ.Cohn,Deformablemodelttingwithamixtureoflocalexperts,in12thInternationalConferenceonComputerVision(ICCV),Sept.2009,pp.2248. [35] C.A.M.Lima,A.L.V.Coelho,andF.J.VonZuben,Hybridizingmixturesofexpertswithsupportvectormachines:Investigationintononlineardynamicsystemsidentication,InformationSciences,vol.177,no.10,pp.2049,2007. [36] B.Yao,D.Walther,D.Beck,andL.Fei-Fei,Hierarchicalmixtureofclassicationexpertsuncoversinteractionsbetweenbrainregions,inAdvancesinNeuralInformationProcessingSystems(NIPS)22,pp.2178.2009. 103

PAGE 104

[37] H.-J.XingandB.-G.Hu,Anadaptivefuzzyc-meansclustering-basedmixturesofexpertsmodelforunlabeleddataclassication,Neurocomput.,vol.71,pp.1008,January2008. [38] L.Bo,C.Sminchisescu,A.Kanaujia,andD.Metaxas,Fastalgorithmsforlargescaleconditional3Dprediction,inIEEEInternationalConferenceonComputerVisionandPatternRecognition(CVPR),June2008,pp.1. [39] Z.Lu,Aregularizedminimumcross-entropyalgorithmonmixturesofexpertsfortimeseriespredictionandcurvedetection,PatternRecognitionLetters,vol.27,no.9,pp.947,2006. [40] C.E.RasmussenandZ.Ghahramani,InnitemixturesofGaussianprocessexperts,inAdvancesinNeuralInformationProcessingSystems(NIPS)14,2002,pp.881. [41] Y.Li,HiddenMarkovmodelswithstatesdependingonobservations,PatternRecogn.Lett.,vol.26,no.7,pp.977,2005. [42] R.Ebrahimpour,M.R.Moradian,A.Esmkhani,andF.M.Jafarlou,RecognitionofPersianhandwrittendigitsusingcharacterizationlociandmixtureofexperts,JournalofDigitalContentTechnologyanditsApplications,vol.3,no.3,pp.42,2009. [43] R.Ebrahimpour,E.Kabir,H.Esteky,andM.R.Youse,View-independentfacerecognitionwithmixtureofexperts,Neurocomputing,vol.71,no.4,pp.1103,2008. [44] Y.Qi,J.Klein-Seetharaman,andZ.Bar-Joseph,Amixtureoffeatureexpertsapproachforprotein-proteininteractionprediction,BMCBioinformatics,vol.8,no.(S10)-S6,2007. [45] S.E.Yuksel,G.Ramachandran,P.Gader,J.Wilson,D.Ho,andG.Heo,Hierarchicalmethodsforlandminedetectionwithwidebandelectro-magneticinductionandgroundpenetratingradarmulti-sensorsystems,inIEEEInterna-tionalGeoscienceandRemoteSensingSymposium(IGARSS),July2008,vol.2,pp.IIII. [46] S.E.YukselandP.Gader,Variationalmixtureofexpertsforclassicationwithapplicationstolandminedetection,inInternationalConferenceonPatternRecognition(ICPR),2010,pp.2981. [47] M.S.Yumlu,F.S.Gurgen,,andN.Okay,Financialtimeseriespredictionusingmixtureofexperts,in18thInternationalSymposiumonComputerandinformationsciences(ISCIS),2003,vol.2869,pp.553. 104

PAGE 105

[48] M.Versace,R.Bhatt,O.Hinds,andM.Shiffer,Predictingtheexchangetradedfunddiawithacombinationofgeneticalgorithmsandneuralnetworks,ExpertSystemswithApplications,vol.27,no.3,pp.417,2004. [49] I.Guler,E.DeryaUbeyli,andN.FatmaGuler,AmixtureofexpertsnetworkstructureforEEGsignalsclassication,in27thAnnualInternationalConferenceoftheEngineeringinMedicineandBiologySociety(IEEE-EMBS),17-182005,pp.2707. [50] Y.JiangandP.Guo,Mixtureofexpertsforstellardataclassication,inInterna-tionalSymposiumonNeuralNetworks(ISNN),2005,vol.2,pp.310. [51] S.Mossavat,O.Amft,B.deVries,P.Petkov,andW.Kleijn,ABayesianhierarchicalmixtureofexpertsapproachtoestimatespeechquality,inSec-ondInternationalWorkshoponQualityofMultimediaExperience(QoMEX),Jun.2010,pp.200. [52] H.Harb,L.Chen,andJ.-Y.Auloge,Mixtureofexpertsforaudioclassication:anapplicationtomalefemaleclassicationandmusicalgenrerecognition,inIEEEInternationalConferenceonMultimediaandExpo(ICME),Jun.2004,vol.2,pp.1351. [53] F.MichelandN.Paragios,ImagetransportregressionusingmixtureofexpertsanddiscreteMarkovrandomelds,inIEEEInternationalSymposiumonBiomedi-calImaging:FromNanotoMacro(ISBI),Apr.2010,pp.1229. [54] L.Cao,Supportvectormachinesexpertsfortimeseriesforecasting,Neurocom-puting,pp.321,2003. [55] X.Wu,V.Kumar,J.R.Quinlan,J.Ghosh,Q.Yang,H.Motoda,G.J.McLachlan,A.Ng,B.Liu,P.S.Yu,Z.-H.Zhou,M.Steinbach,D.J.Hand,andD.Steinberg,Top10algorithmsindatamining,KnowledgeandInformationSystems,vol.14,pp.1,2008. [56] B.Tang,M.Heywood,andM.Shepherd,Inputpartitioningtomixtureofexperts,inProc.InternationalJointConferenceonNeuralNetworks(IJCNN),2002,vol.1,pp.227. [57] A.A.Montillo,Randomforests,LectureinStatisticalFoundationsofDataAnalysis,April2009. [58] T.Hastie,R.Tibshirani,andJ.H.Friedman,Theelementsofstatisticallearning:datamining,inference,andprediction,NewYork:Springer-Verlag,2001. [59] S.Haykin,NeuralNetworks:AComprehensiveFoundation,PrenticeHall,UpperSaddleRiver,NJ,1999,2ndedition. [60] T.-K.Kim,Boostingandrandomforestforvisualrecognition,InternationalConferenceonComputerVision(ICCV)Tutorial,2009. 105

PAGE 106

[61] R.AvnimelechandN.Intrator,Boostedmixtureofexperts:anensemblelearningscheme,NeuralComput.,vol.11,no.2,pp.483,1999. [62] R.A.Jacobs,Bias/varianceanalysesofmixtures-of-expertsarchitectures,NeuralComputation,vol.9,pp.369,1997. [63] A.Zeevi,R.Meir,andV.Maiorov,Errorboundsforfunctionalapproximationandestimationusingmixturesofexperts,IEEETransactionsonInformationTheory,vol.44,no.3,pp.1010,May1998. [64] W.JiangandM.Tanner,Ontheapproximationrateofhierarchicalmixtures-of-expertsforgeneralizedlinearmodels,NeuralComput.,vol.11,no.5,pp.1183,1999. [65] W.Jiang,TheVCdimensionformixturesofbinaryclassiers,NeuralComputa-tion,vol.12,no.6,pp.1293,June2000. [66] W.JiangandM.Tanner,Ontheasymptoticnormalityofhierarchicalmixtures-of-expertsforgeneralizedlinearmodels,IEEETransactionsonIn-formationTheory,vol.46,no.3,pp.1005,May2000. [67] W.JiangandM.A.Tanner,Ontheidentiabilityofmixtures-of-experts,NeuralNetworks,vol.12,pp.1253,1999. [68] W.JiangandM.A.Tanner,Hierarchicalmixtures-of-expertsforexponentialfamilyregressionmodels:Approximationandmaximumlikelihoodestimation,AnnalsofStatistics,vol.27,pp.987,1999. [69] A.CarvalhoandM.Tanner,Mixtures-of-expertsofautoregressivetimeseries:asymptoticnormalityandmodelspecication,IEEETransNeuralNetw.,vol.16,no.1,pp.39,Jan2005. [70] Y.GeandW.Jiang,Anoteonmixturesofexpertsformulticlassresponses:approximationrateandconsistentBayesianinference,inProceedingsofthe23rdinternationalconferenceonmachinelearning(ICML),2006,pp.329. [71] Y.GeandW.Jiang,OnconsistencyofBayesianinferencewithmixturesoflogisticregression,NeuralComput.,vol.18,pp.224,January2006. [72] L.Xu,M.I.Jordan,andG.E.Hinton,Analternativemodelformixturesofexperts,inAdvancesinNeuralInformationProcessingSystems(NIPS)7.1995,pp.633,MITpress. [73] S.WaterhouseandA.Robinson,Classicationusinghierarchicalmixturesofexperts,inProc.IEEEWorkshoponNeuralNetworksforSignalProcessingIV,1994,pp.177. 106

PAGE 107

[74] K.Chen,D.Xie,andH.Chi,Speakeridenticationbasedonthetime-delayhierarchicalmixtureofexperts,inProc.IEEEInternationalConferenceonNeuralNetworks,Nov/Dec1995,vol.4,pp.2062. [75] V.RamamurtiandJ.Ghosh,Advancesinusinghierarchicalmixtureofexpertsforsignalclassication,inProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),1996,pp.3569. [76] K.Chen,L.Xu,andH.Chi,Improvedlearningalgorithmsformixtureofexpertsinmulticlassclassication,NeuralNetw.,vol.12,pp.1229,November1999. [77] S.-K.NgandG.McLachlan,UsingtheEMalgorithmtotrainneuralnetworks:misconceptionsandanewalgorithmformulticlassclassication,IEEETransac-tionsonNeuralNetworks,vol.15,no.3,pp.738,May.2004. [78] S.NgandG.McLachlan,Extensionofmixture-of-expertsnetworksforbinaryclassicationofhierarchicaldata,ArticialIntelligenceinMedicine,vol.41,no.1,pp.57,Sept.2007. [79] M.K.TitsiasandA.Likas,Mixtureofexpertsclassicationusingahierarchicalmixturemodel,NeuralComputation,vol.14,pp.2221,2002. [80] A.Zeevi,R.Meir,andR.Adler,Timeseriespredictionusingmixturesofexperts,inAdvancesinNeuralInformationProcessingSystems(NIPS)9,1997,pp.309. [81] C.WongandW.Li,Onalogisticmixtureautoregressivemodel,Biometrika,vol.88,no.3,pp.833,2001. [82] K.Chen,D.Xie,andH.Chi,Combinemultipletime-delayHMEsforspeakeridentication,inProc.IEEEInternationalConferenceonNeuralNetworks,Jun1996,vol.4,pp.2015. [83] T.W.CacciatoreandS.J.Nowlan,Mixturesofcontrollersforjumplinearandnon-linearplants,inAdvancesinNeuralInformationProcessingSystems(NIPS)6,1993,pp.719. [84] S.Liehr,K.Pawelzik,J.Kohlmorgen,S.Lemm,andK.-R.Mller,HiddenMarkovmixturesofexpertsforpredictionofnon-stationarydynamics,inInNNSP'99WorkshoponNeuralNetworksforSignalProcessing,1999,pp.195. [85] T.G.Dietterich,Machinelearningforsequentialdata:Areview,inProceedingsoftheJointIAPRInternationalWorkshoponStructural,Syntactic,andStatisticalPatternRecognition,2002,pp.15. [86] Y.Zhao,R.Schwartz,J.Sroka,andJ.Makhoul,Hierarchicalmixturesofexpertsmethodologyappliedtocontinuousspeechrecognition,inProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),1995,vol.5,pp.3443. 107

PAGE 108

[87] M.MeilaandM.I.Jordan,LearningnemotionbyMarkovmixturesofexperts,inAdvancesinNeuralInformationProcessingSystems(NIPS)8.1996,pp.1003,MITPress. [88] J.Fritsch,M.Finke,andA.Waibel,ContextdependenthybridHMEHMMspeechrecognitionusingpolyphoneclusteringdecisiontrees,inProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),1997,pp.1759. [89] J.FritschandM.Finke,ImprovingperformanceonswitchboardbycombininghybridHME/HMMandmixtureofGaussiansacousticmodels,inProceedingsofEurospeech,1997,pp.1963. [90] M.I.Jordan,Z.Ghahramani,andL.K.Saul,HiddenMarkovdecisiontrees,inProc.Conf.AdvancesinNeuralInformationProcessingSystems(NIPS),1996,pp.501. [91] Z.GhahramaniandM.I.Jordan,FactorialhiddenMarkovmodels,Mach.Learn.,vol.29,pp.245,November1997. [92] L.Breiman,J.Friedman,R.Olshen,andC.Stone,ClassicationandRegressionTrees,Wadsworth,1984. [93] R.E.Schapire,Thestrengthofweaklearnability,MachineLearning,vol.5,pp.197,1990. [94] L.Mason,J.Baxter,P.Bartlett,andM.Frean,Boostingalgorithmsasgradientdescent,inInAdvancesinNeuralInformationProcessingSystems(NIPS)12,2000,pp.512. [95] S.WaterhouseandG.Cook,Ensemblemethodsforphonemeclassication,inAdvancesinNeuralInformationProcessingSystems(NIPS).1997,number9,Cambridge,MA:MITPress. [96] J.H.Friedman,Multivariateadaptiveregressionsplines,AnnalsofStatistics,vol.19,no.1,pp.1,1991. [97] S.J.NowlanandG.E.Hinton,Evaluationofadaptivemixturesofcompetingexperts,inAdvancesinNeuralInformationProcessingSystems(NIPS)3,1991. [98] M.JordanandR.Jacobs,Hierarchicalmixturesofexpertsandtheemalgorithm,inProceedingsof1993InternationalJointConferenceonNeuralNetworks(IJCNN),1993,vol.2,pp.1339. [99] P.EstevezandR.Nakano,Hierarchicalmixtureofexpertsandmaxminpropagationneuralnetworks,inProc.IEEEInternationalConferenceonNeuralNetworks,Nov/Dec1995,vol.1,pp.651vol.1. 108

PAGE 109

[100] D.MacKay,Informationtheory,inference,andlearningalgorithms,CambridgeUnivPress,2003. [101] C.M.Bishop,PatternRecognitionandMachineLearning(InformationScienceandStatistics),Springer-VerlagNewYork,Inc.,Secaucus,NJ,USA,2006. [102] D.M.Titteringtonanda.U.M.A.F.MSmith,StatisticalAnalysisofFiniteMixtureDistributions,JohnWiley,NewYork,1985. [103] R.A.Jacobs,F.Peng,andM.A.Tanner,ABayesianapproachtomodelselectioninhierarchicalmixtures-of-expertsarchitectures,NeuralNetw.,vol.10,no.2,pp.231,1997. [104] L.R.RabinerandB.H.Juang,AnintroductiontohiddenMarkovmodels,IEEEASSPMagazine,pp.4,January1986. [105] L.R.Rabiner,AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition,inProceedingsoftheIEEE,1989,pp.257. [106] Y.Zhao,P.Gader,P.Chen,andY.Zhang,Trainingdhmmsofmineandcluttertominimizelandminedetectionerrors,IEEETrans.GeosciencesandRemoteSensing,vol.41,no.5,pp.1016,May2003. [107] P.Gader,M.Mystkowski,andY.Zhao,LandminedetectionwithgroundpenetratingradarusinghiddenMarkovmodels,IEEETrans.GeosciencesandRemoteSensing,vol.39,no.6,pp.1231,2001. [108] X.Zhang,S.E.Yuksel,P.Gader,andJ.Wilson,SimultaneousfeatureandHMMmodellearningforlandminedetectionusinggroundpenetratingradar,in6thIAPRWorkshoponPatternRecognitioninRemoteSensing(PRRS),Aug.2010,pp.1. [109] M.MohamedandP.Gader,GeneralizedhiddenMarkovmodels.ii.applicationtohandwrittenwordrecognition,IEEETransactionsonFuzzySystems,vol.8,no.1,pp.82,Feb2000. [110] R.Brooks,J.Schwier,andC.Grifn,BehaviordetectionusingcondenceintervalsofhiddenMarkovmodels,IEEETransactionsonSystems,Man,andCybernetics,PartB:Cybernetics,vol.39,no.6,pp.1484,Dec.2009. [111] Y.NankakuandK.Tokuda,FacerecognitionusinghiddenMarkoveigenfacemodels,inProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),15-202007,vol.2,pp.IIII. [112] G.SinghandH.Song,UsinghiddenMarkovmodelsinvehicularcrashdetection,IEEETransactionsonVehicularTechnology,vol.58,no.3,pp.1119,march2009. 109

PAGE 110

[113] S.Eddy,ProlehiddenMarkovmodels,Bioinformatics,vol.14,no.9,pp.755,1998. [114] B.-H.JuangandL.Rabiner,Thesegmentalk-meansalgorithmforestimatingparametersofhiddenMarkovmodels,IEEETransactionsonAcoustics,SpeechandSignalProcessing,vol.38,no.9,pp.1639,Sept.1990. [115] B.-H.JuangandS.Katagiri,Discriminativelearningforminimumerrorclassication,IEEETransactionsonSignalProcessing,vol.40,no.12,pp.3043,Dec1992. [116] G.D.Forney,TheViterbialgorithm,ProceedingsoftheIEEE,vol.61,no.3,pp.268,1973. [117] D.J.C.MacKay,EnsemblelearningforhiddenMarkovmodels,1997. [118] S.Ji,B.Krishnapuram,andL.Carin,VariationalbayesforcontinuoushiddenMarkovmodelsanditsapplicationtoactivelearning,IEEETrans.PatternAnal.Mach.Intell.,vol.28,no.4,pp.522,2006. [119] C.A.McGroryandM.Titterington,VariationalBayesiananalysisforhiddenMarkovmodels,AustralianandNewZealandJournalofStatistics. [120] S.Kapadia,DiscriminativeTrainingofHiddenMarkovModels,Ph.D.dissertation,UniversityofCambridge,March1998. [121] H.Kuo,E.Fosler-Lussier,H.Jiang,andC.Lee,Discriminativetrainingoflanguagemodelsforspeechrecognition,inProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),2002. [122] J.A.Bilmes,AgentletutorialontheemalgorithmanditsapplicationtoparameterestimationforGaussianmixtureandhiddenMarkovmodels,Technicalreport,UniversityofCalifornia,Berkeley,1997. [123] B.-H.Juang,W.Hou,andC.-H.Lee,Minimumclassicationerrorratemethodsforspeechrecognition,IEEETransactionsonSpeechandAudioProcessing,vol.5,no.3,pp.257,May1997. [124] L.R.Rabiner,C.H.Lee,B.H.Juang,andJ.G.Wilpon,HMMclusteringforconnectedwordrecognition,inProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),1989,pp.405. [125] J.Alon,S.Sclaroff,G.Kollios,andV.Pavlovic,Discoveringclustersinmotiontime-seriesdata,inProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition(CVPR),2003,pp.375. [126] E.Swears,A.Hoogs,andA.Perera,LearningmotionpatternsinsurveillancevideousingHMMclustering,inIEEEWorkshoponMotionandvideoComputing,2008,pp.1. 110

PAGE 111

[127] C.Goutte,P.Toft,E.Rostrup,F.A.Nielsen,andL.K.Hansen,OnclusteringfMRItimeseries,NeuroImage,vol.9,no.3,pp.298,1999. [128] R.H.Shumway,Time-frequencyclusteringanddiscriminantanalysis,StatisticsandProbabilityLetters,vol.63,no.3,pp.307,2003. [129] A.W.BlackandP.Taylor,Automaticallyclusteringsimilarunitsforunitselectioninspeechsynthesis.,inProc.ofEurospeech,1997,pp.601. [130] K.YuandM.Gales,Discriminativeclusteradaptivetraining,IEEETransactionsonAudio,Speech,andLanguageProcessing,vol.14,no.5,pp.1694,Sept.2006. [131] M.Gales,ClusteradaptivetrainingofhiddenMarkovmodels,IEEETransactionsonSpeechandAudioProcessing,vol.8,no.4,pp.417,2000. [132] M.Naito,L.Deng,andY.Sagisaka,Speakerclusteringforspeechrecognitionusingtheparameterscharacterizingvocal-tractdimensions,inProceedingsoftheIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),12-151998,vol.2,pp.981. [133] G.MayrazandG.E.Hinton,Recognizinghandwrittendigitsusinghierarchicalproductsofexperts,IEEETrans.PatternAnal.Mach.Intell.(PAMI),vol.24,no.2,pp.189,2002. [134] M.BicegoandV.Murino,InvestigatinghiddenMarkovmodels'capabilitiesin2dshapeclassication,IEEETrans.PatternAnal.Mach.Intell.(PAMI),vol.26,no.2,pp.281,2004. [135] T.W.Liao,Clusteringoftimeseriesdataasurvey,PatternRecognition,vol.38,no.11,pp.1857,2005. [136] E.KeoghandS.Kasetty,Ontheneedfortimeseriesdataminingbenchmarks:Asurveyandempiricaldemonstration,DataMiningandKnowledgeDiscovery,vol.7,pp.349,2003. [137] C.LiandG.Biswas,AbayesianapproachtotemporaldataclusteringusinghiddenMarkovmodels,inProceedingsoftheSeventeenthInternationalConfer-enceonMachineLearning(ICML),2000,pp.543. [138] Z.SzamonekandC.Szepesvari,X-mHMM:anefcientalgorithmfortrainingmixturesofHMMswhenthenumberofmixturesisunknown,inIEEEInterna-tionalConferenceonDataMining,2005,pp.27. [139] P.Smyth,ClusteringsequenceswithhiddenMarkovmodels,inAdvancesinNeuralInformationProcessingSystems(NIPS),1997,pp.648. [140] F.Korkmazskiy,B.-H.Juang,andF.Soong,GeneralizedmixtureofHMMsforcontinuousspeechrecognition,inProceedingsoftheIEEEInternational 111

PAGE 112

ConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),21-241997,vol.2,pp.1443. [141] T.Oates,L.Firoiu,andP.R.Cohen,ClusteringtimeserieswithhiddenMarkovmodelsanddynamictimewarping,inProceedingsoftheIJCAIWorkshoponNeural,SymbolicandReinforcementLearningMethodsforSequenceLearning,1999. [142] T.Jebara,Y.Song,andK.Thadani,SpectralclusteringandembeddingwithhiddenMarkovmodels,inProceedingsofthe18thEuropeanConferenceonMachineLearning(ECML),September2007. [143] A.C.Tsoi,S.Zhang,andM.Hagenbuchner,DataMining,vol.3755,chapterHierarchicalHiddenMarkovModels:AnApplicationtoHealthInsuranceData,pp.244,LectureNotesinComputerScience,2006. [144] J.YinandQ.Yang,IntegratinghiddenMarkovmodelsandspectralanalysisforsensorytimeseriesclustering,inIEEEInternationalConferenceonDataMining,2005,pp.27. [145] F.Porikli,Clusteringvariablelengthsequencesbyeigenvectordecompositionusinghmm,inInternationalWorkshoponStructuralandSyntacticPatternRecognition.2004,p.352,LectureNotesinComputerScience,Springer-Verlag. [146] M.Bicego,V.Murino,andM.A.Figueiredo,Similarity-basedclusteringofsequencesusinghiddenMarkovmodels,inMachineLearningandDataMininginPatternRecognition,vol.LNAI2734,2003,pp.86. [147] M.Bicego,E.Pekalska,D.M.J.Tax,andR.P.W.Duin,Component-baseddiscriminativeclassicationforhiddenMarkovmodels,PatternRecogn.,vol.42,no.11,pp.2637,2009. [148] D.Garcia-Garcia,E.Hernandez,andF.DiazdeMaria,Anewdistancemeasureformodel-basedsequenceclustering,IEEETransactionsonPatternAnalysisandMachineIntelligence,vol.31,no.7,pp.1325,July2009. [149] A.D.BrownandG.E.Hinton,Relativedensitynets:AnewwaytocombinebackpropagationwithHMM's.,inAdvancesinNeuralInformationProcessingSystems(NIPS),2001,pp.1149. [150] G.TaylorandG.Hinton,ProductsofhiddenMarkovmodels:Ittakesn>1totango,inProc.ofthe25thConferenceonUncertaintyinArticialIntelligence(UAI),2009. [151] L.R.SELevinsonandM.Sondhi,AnintroductiontotheapplicationofthetheoryofprobabilisticfunctionsofaMarkovprocesstoautomaticspeechrecognition,BellSystTechJ,vol.62,no.4,pp.1035,April1983. 112

PAGE 113

[152] R.B.Lyngs,C.N.S.Pedersen,andH.Nielsen,MetricsandsimilaritymeasuresforhiddenMarkovmodels,inProceedingsofthe7thInternationalConferenceonIntelligentSystemsforMolecularBiology(ISMB),1999,pp.178. [153] T.KosakaandS.Sagayama,Tree-structuredspeakerclusteringforfastspeakeradaptation,inProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),19-221994,vol.1,pp.245. [154] B.H.JuangandL.R.Rabiner,AprobabilisticdistancemeasureforhiddenMarkovmodels,AT&TTechnicalJournal,vol.64,no.2,pp.391,1985. [155] L.Xie,V.Ugrinovskii,andI.Petersen,Probabilisticdistancesbetweennite-statenite-alphabethiddenMarkovmodels,IEEETransactionsonAutomaticControl,vol.50,no.4,pp.505,April2005. [156] M.Vihola,M.Harju,andP.Salmela,TwodissimilaritymeasuresforHMMsandtheirapplicationinphonememodelclustering,inProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),2002,vol.1,pp.933. [157] M.Falkhausen,H.Reininger,andD.Wolf,CalculationofdistancemeasuresbetweenhiddenMarkovmodels,inProc.ofEurospeech,1995,pp.1487. [158] M.Do,Fastapproximationofkullback-leiblerdistancefordependencetreesandhiddenMarkovmodels,IEEESignalProcessingLetters,vol.10,no.4,pp.115,Apr2003. [159] Z.Rached,F.Alajaji,andL.Campbell,Thekullback-leiblerdivergenceratebetweenMarkovsources,IEEETransactionsonInformationTheory,vol.50,no.5,pp.917,May2004. [160] M.MohammadandW.Tranter,AnoveldivergencemeasureforhiddenMarkovmodels,inProceedingsoftheIEEESoutheastCon,April2005,pp.240. [161] M.Vidyasagar,Stochasticmodellingoveranitealphabetandalgorithmsforndinggenesfromgenomes,inLecturenotesincontrolandinformationsciences,ConferenceonControlofuncertainsystems,2006,vol.329,pp.345. [162] J.-Y.Chen,P.A.Olsen,andJ.R.Hershey,WordconfusabilitymeasuringhiddenMarkovmodelsimilarity,inInterspeech,2007,pp.2089. [163] L.ChenandH.Man,FastschemesforcomputingsimilaritiesbetweenGaussianHMMsandtheirapplicationsintextureimageclassication,EURASIPJ.Appl.SignalProcess.,pp.1984,2005. [164] J.Zeng,J.Duan,andC.Wu,AnewdistancemeasureforhiddenMarkovmodels,ExpertSystemswithApplications,vol.37,no.2,pp.1550,2010. 113

PAGE 114

[165] M.MohammadandW.Tranter,ComparingdistancemeasuresforhiddenMarkovmodels,inProceedingsoftheIEEESoutheastCon,April2006,pp.256. [166] A.WeigendandN.Gershenfeld,Eds.,TimeSeriesPrediction:ForecastingtheFutureandUnderstandingthePast,Addison-Wesley,MA,1994. [167] O.Missaoui,H.Frigui,andP.Gader,Land-minedetectionwithground-penetratingradarusingmultistreamdiscretehiddenmarkovmodels,IEEETransactionsonGeoscienceandRemoteSensing,vol.49,no.6,pp.2080,June2011. [168] B.ScholkopfandA.J.Smola,LearningwithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond,MITPress,Cambridge,MA,USA,2001. 114

PAGE 115

BIOGRAPHICALSKETCH SenihaEsenYukselhasbeenpursuingherDoctorofPhilosophydegreeintheDepartmentofComputerandInformationScienceandEngineeringattheUniversityofFlorida(UF)since2006.Previously,shereceivedherMasterofSciencedegreefromtheDepartmentofElectricalandComputerEngineeringattheUniversityofLouisville,Kentuckyin2005,andherBachelorofSciencedegreefromtheDepartmentofElectricalandElectronicsEngineeringattheMiddleEastTechnicalUniversity,Ankara,Turkeyin2003.ShewastherecipientoftheUFCollegeofEngineeringOutstandingInternationalStudentAwardin2010,thewinneroftheNationalScienceFoundationWritingContestin2010,andtherecipientofthePhyllisM.MeekSpiritofSusanB.AnthonyAwardattheUFin2008.ShealsoreceivedtravelawardsfromAssociationforComputingMachinery(ACM),ComputingResearchAssociation(CRA),andGoogle.Herresearchinterestsincludemachinelearning,patternrecognition,andcomputervision. 115