<%BANNER%>

Novel Mixture Models to Learn Complex and Evolving Patterns In High Dimensional Data

Permanent Link: http://ufdc.ufl.edu/UFE0041009/00001

Material Information

Title: Novel Mixture Models to Learn Complex and Evolving Patterns In High Dimensional Data
Physical Description: 1 online resource (120 p.)
Language: english
Creator: Somaiya, Manas
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: bayesian, carlo, chain, data, evolving, expectation, gibbs, learning, machine, markov, maximization, mining, mixture, model, models, monte, patterns, sampling, statistical
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy NOVEL MIXTURE MODELS TO LEARN COMPLEX AND EVOLVING PATTERNS IN HIGH-DIMENSIONAL DATA By Manas H. Somaiya December 2009 Chair: Sanjay Ranka Cochair: Christopher Jermaine Major: Computer Engineering In statistics, a probability mixture model is a probability distribution that is a convex combination of other probability distributions. Mixture models have been used by mathematicians and statisticians to model observed data since as early as 1894. However, significant advances have been made in the fitting of finite mixture models via the method of Maximum Likelihood Estimation (MLE) only in the last 30 years, specifically because of development of the Expectation Maximization (EM) algorithm. In the last decade, because of the arrival of fast computers and recent developments in Monte Carlo Markov Chain (MCMC) methods, a lot of interest has been observed in Bayesian inference of mixture models. While classical mixture model and its variants remain excellent tools to develop generative models for data, we can learn more informative models under certain real life data generation scenarios by making a few subtle yet fundamental changes to the classical mixture model. In order to generate a data point, the classical mixture model selects one of the generative component by performing a multinomial trial over the mixing proportions, and then manifests the various data attributes based on the selected component. Thus, for any given data point, only a single component is a possible generator. However, there are many real life situations where it makes far more sense to model a data point as being generated using multiple components. We propose two such novel mixture modeling frameworks that allow multiple components to influence data generation, and associated learning algorithms. Furthermore, both the mixing proportions and the generating components in the classical mixture model are fixed and do no vary with time. However, there are many data sets where the time associated with a data point is very important information, and needs to be incorporated in the generative model. To introduce these temporal elements, we propose a new class of mixture models that allow the mixing proportions and the mixture components to evolve in a piece-wise linear fashion.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Manas Somaiya.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Ranka, Sanjay.
Local: Co-adviser: Jermaine, Christopher.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041009:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041009/00001

Material Information

Title: Novel Mixture Models to Learn Complex and Evolving Patterns In High Dimensional Data
Physical Description: 1 online resource (120 p.)
Language: english
Creator: Somaiya, Manas
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: bayesian, carlo, chain, data, evolving, expectation, gibbs, learning, machine, markov, maximization, mining, mixture, model, models, monte, patterns, sampling, statistical
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy NOVEL MIXTURE MODELS TO LEARN COMPLEX AND EVOLVING PATTERNS IN HIGH-DIMENSIONAL DATA By Manas H. Somaiya December 2009 Chair: Sanjay Ranka Cochair: Christopher Jermaine Major: Computer Engineering In statistics, a probability mixture model is a probability distribution that is a convex combination of other probability distributions. Mixture models have been used by mathematicians and statisticians to model observed data since as early as 1894. However, significant advances have been made in the fitting of finite mixture models via the method of Maximum Likelihood Estimation (MLE) only in the last 30 years, specifically because of development of the Expectation Maximization (EM) algorithm. In the last decade, because of the arrival of fast computers and recent developments in Monte Carlo Markov Chain (MCMC) methods, a lot of interest has been observed in Bayesian inference of mixture models. While classical mixture model and its variants remain excellent tools to develop generative models for data, we can learn more informative models under certain real life data generation scenarios by making a few subtle yet fundamental changes to the classical mixture model. In order to generate a data point, the classical mixture model selects one of the generative component by performing a multinomial trial over the mixing proportions, and then manifests the various data attributes based on the selected component. Thus, for any given data point, only a single component is a possible generator. However, there are many real life situations where it makes far more sense to model a data point as being generated using multiple components. We propose two such novel mixture modeling frameworks that allow multiple components to influence data generation, and associated learning algorithms. Furthermore, both the mixing proportions and the generating components in the classical mixture model are fixed and do no vary with time. However, there are many data sets where the time associated with a data point is very important information, and needs to be incorporated in the generative model. To introduce these temporal elements, we propose a new class of mixture models that allow the mixing proportions and the mixture components to evolve in a piece-wise linear fashion.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Manas Somaiya.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Ranka, Sanjay.
Local: Co-adviser: Jermaine, Christopher.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041009:00001


This item has the following downloads:


Full Text

PAGE 2

2

PAGE 3

3

PAGE 4

IwouldliketoexpressmygratitudetomyadvisorsDr.SanjayRankaandDr.ChrisJermainefortheirexcellentguidanceandmentoring,andfortheirencouragementandsupportduringmypursuitofthedoctorate.IwouldalsoliketothankDr.AlinDobraforbothagreeingtoserveonmycommittee,andforbeingavailabletodiscussnewideasrelatedtomyworkandgeneraltechnologicaladvancementsintheeldofComputerScienceandEngineering.IwouldliketothankDr.SartajSahniandDr.RavindraAhujaforbeingonmycommitteeandforguidanceandsupport.Thisendeavorwouldnotbecompletewithoutthesupportofmyfamilyandfriends.Iwouldliketoexpressmysincerethankstothemforstickingwithmethroughthickandthin. 4

PAGE 5

page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 12 1.1MixtureModels ................................. 12 1.2Motivation .................................... 13 1.3OurApproach .................................. 15 1.4Contributions .................................. 17 2BRIEFSURVEYOFRELATEDWORK ....................... 18 2.1VisualizationBasedApproaches ....................... 18 2.2InformationTheoreticCo-clustering ...................... 19 2.3SubspaceClustering .............................. 20 2.4OtherApproaches ............................... 23 2.5TemporalModels ................................ 24 3LEARNINGCORRELATIONSUSINGMIXTURE-OF-SUBSETSMODEL ... 26 3.1Introduction ................................... 26 3.2TheMOSModel ................................ 29 3.2.1Preliminaries .............................. 29 3.2.2FormalModelAndPDF ........................ 30 3.2.3ExampleDataGenerationUnderTheMOSModel ......... 32 3.2.4ExampleEvaluationOfTheMOSPDF ................ 34 3.3LearningTheModelViaExpectationMaximization ............. 36 3.3.1Preliminaries .............................. 36 3.3.2TheE-Step ............................... 37 3.3.3TheM-Step ............................... 39 3.3.4ComputingTheParameterMasks ................... 41 3.4Example-BernoulliData ........................... 44 3.4.1MOSModel ............................... 44 3.4.2ExpectationMaximization ....................... 45 3.5Example-NormalData ............................ 46 3.5.1MOSModel ............................... 47 3.5.2ExpectationMaximization ....................... 47 3.6ExperimentalEvaluation ............................ 48 5

PAGE 6

............................. 49 3.6.2BernoulliData-StocksData ...................... 51 3.6.3NormalData-CaliforniaStreamFlow ................ 54 3.7RelatedWork .................................. 58 3.8ConclusionsAndFutureWork ......................... 63 3.9OurContributions ................................ 64 4MIXTUREMODELSTOLEARNCOMPLEXPATTERNSINHIGH-DIMENSIONALDATA ............................. 72 4.1Introduction ................................... 72 4.2Model ...................................... 74 4.2.1GenerativeProcess .......................... 75 4.2.2BayesianFramework .......................... 76 4.3LearningTheModel .............................. 77 4.3.1ConditionalDistributions ........................ 78 4.3.2SpeedingUpTheMaskValueUpdates ................ 81 4.4Experiments .................................. 82 4.4.1SyntheticDataset ............................ 82 4.4.2NIPSPapersDataset .......................... 83 4.5RelatedWork .................................. 85 4.6Conclusions ................................... 87 4.7OurContributions ................................ 87 5MIXTUREMODELSWITHEVOLVINGPATTERNS ................ 92 5.1OurApproach .................................. 92 5.2FormalDenitionOfTheModel ........................ 93 5.3LearningTheModel .............................. 95 5.4Experiments .................................. 96 5.4.1SyntheticDatasets ........................... 96 5.4.2StreamowDataset ........................... 97 5.4.3E.coliDataset ............................. 98 5.5RelatedWork .................................. 99 5.6Conclusions ................................... 100 5.7OurContributions ................................ 100 APPENDIX ASTRATIFIEDSAMPLINGFORTHEE-STEP ................... 104 BSPEEDINGUPTHEMASKVALUEUPDATES .................. 112 REFERENCES ....................................... 116 BIOGRAPHICALSKETCH ................................ 120 6

PAGE 7

Table page 3-1ParametervaluesijforthePDFsassociatedwiththerandomvariablesNj 65 3-2AppearanceprobabilitiesiforeachcomponentCi 65 3-3Exampleofmarketbasketdata ........................... 65 3-4Comparisonoftheexecutiontime(100iterations)oftheourEMlearningalgorithmsforthesyntheticdatasets. ........................ 65 3-5Numberofdaysforwhichthepvaluesfallintop1%ofallpvaluesfortheSouthernCaliforniaHighFlowComponent .................... 66 3-6Numberofdaysforwhichthepvaluesfallintop1%ofallpvaluesfortheNorthCentralCaliforniaHighFlowComponent .................. 66 3-7Numberofdaysforwhichthepvaluesfallintop1%ofallpvaluesfortheLowFlowComponent ................................... 67 4-1Thefourgeneratingcomponentsforthesyntheticdataset.Generatorforeachattributeisexpressedasatripletofparametervalues(Mean,Standarddeviation,Weight) .................................. 88 4-2Parametervalueslearnedfromthedatasetafter1000Gibbsiterations.Wehavecomputedtheaverageoverthelast100iterations.Eachattributeisexpressedasatripletofparametervalues(Mean,Standarddeviation,Weight).Allvalueshavebeenroundedofftotheirrespectiveprecisions. ......... 88 4-3AppearanceprobabilitiesoftheclusterslearnedfromtheNIPSdataset .... 88 B-1Detailsofthedatasetsusedforqualitativetestingofthebetaapproximation .. 114 B-2Quantitativetestingofthebetaapproximation ................... 114 7

PAGE 8

Figure page 3-1OutlineofourEMalgorithm ............................. 65 3-2Generatingcomponentsforthe16-attributedataset.ApixelindicatestheprobabilityvalueoftheBernoullirandomvariableassociatedwithanattribute.Whitepixel(amaskedattribute)indicates0andblackpixel(unmaskedattribute)indicates1. ................................. 67 3-3Exampledatapointsfromthe16-attributedataset.Forexample,theleftmostdatapointwasgeneratedbytheleftmostandtherightmostcomponentsfromFigure 3-2 ...................................... 67 3-4ComponentslearnedusingMonteCarloEMwithstratiedsamplingafter100iterations.ApixelindicatestheprobabilityvalueoftheBernoullirandomvariableassociatedwithanattribute.Whitepixelsaremaskedattributes.Darkerpixelsindicateunmaskedattributeswithhigherprobabilityvalues. ... 68 3-5Generatingcomponentsforthe36-attributedataset ............... 68 3-6Componentslearnedfromthe36-attributedatasetusingMonteCarloEMwithstratiedsamplingafter100iterations. ....................... 68 3-7Stockcomponentslearnedbya20-componentMOSmodel.Alongthecolumnsarethe40chosenstocksgroupedbythetypeofstock;andalongtherowsarethecomponentslearnedbythemodel.EachcellinthegureindicatestheprobabilityvalueoftheBernoullirandomvariableingreyscalewithwhitebeing0andblackbeing1. ........................ 69 3-8Componentslearnedbya20-componentMOSModel.Onlythesiteswithnon-zeroparametermasksareshown.Thediameterofthecircleatasiteisisproportionaltothesquarerootoftheratioofofthemeanparameterijtothemeanowjforthatsite,onalogscale. .................... 70 3-9Someofthecomponentslearnedbya20-componentstandardGaussianMixtureModel.Thediameterofthecircleatasiteisisproportionaltothesquarerootoftheratioofofthemeanparameterijtothemeanowjforthatsite,onalogscale. ............................... 71 4-1Thegenerativemodel.Acircledenotesarandomvariableinthemodel .... 89 4-2ClusterslearnedfromtheNIPSpapersdataset.Foreachcluster,wereportthewordanditsassociatedBernoulliprobability ................. 90 4-3MoreclusterslearnedfromtheNIPSpapersdataset.Foreachcluster,wereportthewordanditsassociatedBernoulliprobability .............. 91 5-1Evolvingmodelparameterslearnedfromsyntheticdataset ........... 101 8

PAGE 9

...................... 102 5-3ChangeinprevalenceoftheowcomponentsshowninFigure 5-2 withtime .. 102 5-4EvolvingmodelparameterslearnedfromE.Colidataset ............. 103 A-1ThestructureofcomputationfortheQfunction .................. 110 A-2AsimpliedstructureofcomputationfortheQfunction .............. 110 A-3Computinganestimateforc1,i 111 B-1ComparisonofthePDFsfortheconditionaldistributionoftheweightparameterwithitsbetaapproximationfor4datasets.Eachchartisnormalizedforeasycomparisonandhasbeenzoomed-intotheregionwherethemassofthePDFsareconcentrated.DetailsaboutthedatasetscanbefoundinTables B-1 and B-2 ................................. 115 9

PAGE 10

10

PAGE 11

11

PAGE 12

12

PAGE 13

1 ]in1894usedamixtureoftwounivariatenormalprobabilitydensityfunctionstotthedatasetcontainingmeasurementsontheratioofforeheadtobodylengthof1000crabssampledfromtheBayofNaples,mixturemodelshavebeenusedbymathematiciansandstatisticianstomodelobserveddata.However,signicantadvanceshavebeenmadeinthettingofnitemixturemodelsviathemethodofMaximumLikelihoodEstimation(MLE)onlyinthelast30years,specicallybecauseofdevelopmentoftheExpectationMaximization(EM)algorithmbyDempsteretal.[ 2 ]in1977.Inthelastdecade,becauseofthearrivaloffastcomputersandrecentdevelopmentsinMonteCarloMarkovChain(MCMC)methods,alotofinteresthasbeenobservedinBayesianinferenceofmixturemodels.ForadetaileddiscussionofmixturemodelswereferthereadertoMcLachlanandBasford[ 3 ],andMcLachlanandPeel[ 4 ]. 13

PAGE 14

fan,doctor,etc.Thisallowseachdatapointtobemodeledwithhighprecision,andyetstillallowsforlearningverygeneralrolessuchasfatherandsports fanthatareimportant,andyetcannotdescribeanydatapointcompletely.Inordertoallowmultiplecomponentsinamixturemodeltosimultaneouslyinuencethegenerationofadatapoint,weneedtodesignamathematicalframeworkthatnotonlyallowsmultiplecomponentstobeselectedsimultaneously,andprovidesacleanwayforthesecomponentstointeractinordertogeneratevariousdataattributes,butalsoisamenabletomachinelearningandstatisticalmethodsthatwouldallowustolearnsuchmodelsgivensuitabledatasets.Furthermore,boththemixingproportionsandthegeneratingcomponentsintheclassicalmixturemodelarexedanddonovarywithtime.However,therearemanydatasetswherethetimeassociatedwithadatapointisveryimportantinformation,andneedstobeincorporatedinthegenerativemodel.Forexample,ahospitalmayhaveadatasetconsistingofantibioticresistancemeasurementsofE.colibacteriacollectedfromitspatientsoveraperiodoftime.Anepidemiologist,ascientistwhotracesthespreadofdiseasesthroughapopulation,wouldbeinterestedinlearningboththekeystrainsofE.colibacteria,andthechangeintheirprevalenceoverthisperiodoftime,usingthisdataset.Similarly,astatisticiananalyzingtrendsinnewsstories,wouldbeinterestedinminingtopics(andtheirassociatedfeaturesi.e.words)thatevolveovertime.Inthenextsection,weoutlineourapproachtoaddressingthesenovelmixturemodels. 14

PAGE 15

3 ,weproposeanewprobabilisticframeworkformodelingcorrelationsinhighdimensionaldata,calledtheMOSmodel.ThekeyideasbehindtheMOSmodelarethatitallowsanentitytobemodeledasbeinggeneratedbymultiplecomponentsratherthanonecomponentalone;andthateachofthecomponentsintheMOSmodelcanonlyinuenceasubsetofthedataattributes.TheformerideaisimplementedbyswitchingfromthemultinomialdistributiontoamultidimensionalBernoullidistributionforthemixingproportions,whilethelaterisachievedbyintroducingbinarymaskvariablesforeachattributecomponentpair.Themodelallowsforusergivenconstraintsonthesemaskvariables,andweshowaverytrivialoptimizationschemethatcanhandlemultipleconstraintscenarios.WeformulatetheinferenceofMOSmodelasaMaximumLikelihoodEstimation(MLE)problem,anddevelopanExpectationMaximization(EM)algorithmforlearningmodelsundertheMOSframework.ComputingtheE-StepofourEMalgorithmisintractable,duetothefactthatanysubsetofcomponentscouldhaveproducedeachdatapoint.Thus,wealsoproposeauniqueMonteCarloalgorithmthatmakesuseofstratiedsamplingtoaccuratelyapproximatetheE-StepasoutlinedinAppendix A .However,therearetwopotentialdrawbacksofthisapproach.TherstdrawbackisthegeneralcriticismofEMandMLEthattheresultingpointestimatedoesnotgivetheuseragoodideaoftheaccuracyofthelearnedmodel.TheseconddrawbackofourproposedapproachistheintractabilityoftheE-stepofouralgorithm,whichisthereasonthatwemakeuseofMonteCarlomethodstoestimatetheE-step.ToaddresstheseconcernsweredenethemodelinaBayesianframeworkasoutlinedinChapter 4 .Wealsodropthebinaryparametermasksinfavorofarealvaluedparameterweightthatindicatesthestrengthoftheinuenceofaparticularcomponentoveradataattributeratherthansimplywhetheritchoosestoinuenceitornot.Thissubtlebutfundamentalchangeallowsusthedroptheusergivenoptimizationschemeandmakesthemodel 15

PAGE 16

B tospeedupthiscomputationmanyfolds.InChapter 5 ,weproposeanewclassofmixturemodelsthattakestemporalinformationintoaccountinthedatagenerationprocess.Weallowthemixingproportionstovarywithtime,andadoptapiece-wiselinearstrategyfortrendstokeepthemodelsimpleyetinformative.Thevalueofamodelparameterinanyofsegmentsissimplyinterpolationbasedonvalueatthestartofthesegment,andthevalueattheendofthesegment.Thissimplestrategyworksreallywellformanyparameterizedprobabilitydensityfunctions.WesetthismodelupinaBayesianframework,andderiveaGibbsSamplingalgorithm(MCMCtechnique)forlearningtheseclassofmodels.Allofourmodelsaretrulydata-typeagnostic.ItiseasilypossibletohandleanydatatypeforwhichareasonableprobabilisticmodelcanbeformulatedaBernoullimodelforbinarydata,amultinomialmodelforcategoricaldata,anormalmodelfornumericaldata,aGammamodelfornon-negativenumericaldata,aprobabilistic,graphicalmodelforhierarchicaldata,andsoon.Furthermore,allthemodelstriviallypermitmixturesofdifferentdatatypeswithineachdatarecord,withouttransformingthedataintoasinglerepresentation(suchastreatingbinarydataasnumericaldatathathappenstohave0-1values).Foreachofthethreemodels,wehaveshowntheirusefulnessinlearningunderlyingpatternsusingbothsyntheticandreal-lifedatasets.Wesummarizeourcontributionsinthenextsection,andreviewrelatedresearchworkinthenextchapter. 16

PAGE 17

17

PAGE 18

5 ]presentaprobabilisticmixturemodelingbasedframeworktomodelcustomerbehaviorintransactionaldata.Intheirmodel,eachtransactionisgeneratedbyoneofthekcomponents(customerproles).Associatedwitheachcustomerisasetofkweightsthatgoverntheprobabilityofanindividualtoengageinashoppingbehaviorlikeoneofthecustomerproles.Thus,theymodelacustomerasamixtureofthecustomerproles.Cadezetal.[ 6 ]proposeagenerativeframeworkforprobabilisticmodelbasedclusteringofindividualswheredatameasurementsforeachindividualmayvaryinsize.Inthisgenerativemodel,eachindividualhasasetofmembershipprobabilitiesthatshebelongstooneofthekclusters,andeachofthesekclustershasaparameterizeddatageneratingprobabilitydistribution.Cadezetal.modelthesetofdatasequencesassociatedwithanindividualasamixtureofthesekdatageneratingclusters.TheyalsooutlineanEMapproachthatcanappliedtothismodelandshowanexampleofhowtoclusterindividualsbasedontheirwebbrowsingdataunderthismodel. 18

PAGE 19

7 ]thegoalistomodelatwo-dimensionalmatrixinaprobabilisticfashion.Co-clusteringgroupsboththerowsandthecolumnsofthematrix,thusformingagrid;thisgridistreatedasdeningaprobabilitydistribution.Theabstractproblemthatco-clusteringtriestosolveistominimizethedifferencebetweenthedistributiondenedbythegridandthedistributionrepresentedbytheoriginalmatrix.Ininformation-theoreticco-clustering,thisdifferenceismeasuredbythemutuallossofinformationbetweenthetwodistributions.Recently,theoriginalworkoninformation-theoreticco-clusteringhasbeenextendedbyotherresearchers.DhillonandGuan[ 8 ]haveshownthatoneofthecommonproblemsforadivisiveclusteringalgorithmbasedoninformationtheoreticco-clusteringisthatcaneasilygetstuckinapoorlocalmaximawhiledealingwithsparsehighdimensionaldata.Theysuggestatwo-foldapproachtoescapethelocalmaximatouseaspecialpriordistributionfortheirBayesianapproach,andtousealocalsearchstrategytomoveawayfromabadlocalmaxima.Theyhaveshownexcellentresultsusingthesestrategiesonworddocumentco-occurrencedatafromthewellknown20newgroupsdataset.Asnotedearlier,everyco-clusteringisbasedonanapproximationoftheoriginaldatamatrix.Thequalityoftheco-clusteringclearlyreliesonthegoodnessofthismatrixapproximation.Banerjeeetal.[ 9 ]havedevisedageneralpartitionalco-clusteringframeworkthatisbasedonsearchforagoodmatrixapproximation.TheyintroducealargeclassoflossfunctionscalledBregmandivergencestomeasuretheapproximationerrorofaco-clustering.TheyshowthatthepopularlossfunctionslikesquaredEuclideandistanceandKL-divergencearespecialcasesofBregmandivergences.Basedontheselossfunctions,theyintroduceanewMinimumBregmanInformationprinciplethatleadstoameta-algorithmforco-clusteringofobjects.Theyfurthershowthatwellknownlossminimizationbasedalgorithmslikek-meansandinformationtheoreticco-clusteringasspecialcasesofthismetaalgorithm. 19

PAGE 20

10 ]extendtheideaofco-clusteringtohigherorderco-clustering,forexamplecategories,documentsandtermsintextmining.Theyspecicallyfocusonaspecialtypeofco-clusteringwherethereisacentralobjectthatconnectstootherdatatypessoastoformastarlikeinterrelationshipsbetweenvarioustypesofobjectstobeco-clustered.Theymodelsuchaco-clusteringproblemasaconsistentfusionofmanypair-wiseco-clusteringproblems,withstructuralconstraintsbasedontheinterrelationshipsbetweentheobjects.Theyarguethateachofthesubproblemmaynotbelocallyoptimal,however,whenallthesubproblemsareconnectedusingthecommonobject,thesolutioncanbegloballyoptimal.Theysuchpartitioningofproblemsasconsistentbipartitegraphcopartitionsandprovethatsuchpartitionscanbefoundusingsemi-deniteprogramming. 11 ]introducetheconceptofProjectedClustering(PROCLUS)whereeachclusterintheclusteringofobjectsmaybebasedonaseparatesetofsubspacesofthedata.Thustheideaistocomputetheclusternotonlybasedonthedatapointsbutalsobasedonthevariousdimensionsofthedata.Theirapproachtosolvingtheprojectedclusteringistocombinetheuseofk-mediodtechniqueandlocalityanalysistondrelevantdimensionsforeachmediod. 20

PAGE 21

12 ]havedesignedaclusteringalgorithmknownasarbitrarilyORientedCLUStergeneration(ORCLUS)thateliminatestheproblemofrectangularclustersreturnedbytheusualprojectedclustering,byclusteringinarbitrarilyalignedsubspacesoflowerdimensionality.Theyalsomaketheimprovementsinscalabilityoftheapproachbyaddingprovisionforprogressiverandomsamplingandextendedclusterfeaturevectors.Wooetal.[ 13 ]indicatethatselectingthecorrectsetofcorrelatedattributesforsubspaceclusteringisachallengebecausebothdatagroupinganddimensionselectionneedstohappenatthesametime.TheyproposeanovelapproachcalledFINDITthatdeterminesthesecorrelationsbasedontwofactorsadimensionorienteddistancemeasure,andavotingstrategiesthattakesintoaccountnearbyneighbors.Yangetal.[ 14 ]haveintroducedamodelcalledclustersthatcapturestheobjectsthathavecoherence(i.e.similartrends)onasubsetofdataattributesratherthancloseness(i.e.smalldistance).Aresiduemetricisintroducedtomeasurecoherenceamongobjectsinacluster.TheirformulationoftheproblemisNP-hard.However,theyprovidearandomizedalgorithmthatiterativelyimprovestheclusteringfromaninitialseed.FriedmanandMuleman[ 15 ]haveproposedamethodcalledClusteringonSubsetofAttributes(COSA)thatcanbeusedtogetherwiththestandarddistancebasedclusteringapproaches,whichallowsfordetectionofgroupsofdatapointsthatclusteronsubsetsoftheattributespaceratherthanallofthemtogether.COSAreliesonweightvaluesfordifferentattributestoallowforcomputationofinter-objectdistancesforclustering.Thesecondsetofsubspaceclusteringalgorithmstrytonddenseregionsinlower-dimensionalprojectionsofthedataspacesandcombinethemtoformclusters.Thistypeofacombinatorialbottom-upapproachwasrstproposedinFrequentItemsetMining[ 16 ]fortransactionaldataandlatergeneralizedtocreatealgorithmssuchas 21

PAGE 22

17 ]haveproposedadensitybasedsubspaceclusteringapproachcalledCLIQUE,thatrstidentiesdenseregionsofthedataspacebypartitionitintoequalvolumecells.Oncethedensecellsareidentied,thedatapointsareseparatedaccordingtothetroughsofthedensityfunctions.Next,theclustersarenothingbuttheunionofconnectedhighlydenseareaswithinasubspace.Chengetal.[ 18 ]haveproposedEntropybasedClustering(ENCLUS),whichasitsnamesuggestsusesanentropybasedcriteriontoevaluatecorrelationamongstdataattributestoidentifygoodsubspacesforsubspaceclustering,alongwithcoverageanddensityassuggestedinCLIQUE.Goiletal.[ 19 ]proposetheuseofadaptivegridsintheapproachdubbedMAFIA,forefcientandscalablecomputationofsubspaceclustering.Theysuccessfullyarguethatthenumberofbinsinabottomupsubspaceclusteringapproachdeterminethespeedofcomputationandqualityofclustering.Theymakeacaseformorebinsinthedenseregionsofthedataasopposedtouniformsizedbinsoveralldataintervals.Theyalsointroduceascalableparallelframeworkusingasharednothingarchitecturetohandlelargedatasets.ChangandJin[ 20 ]haveproposedacellbasedclusteringmethodthatreliesonanefcientcellcreationalgorithmforsubspaceclustering.Theiralgorithmusesaspacepartitioningtechniqueandasplitindextokeeptrackofcellsalongeachdatadimension.Italsohascapabilitytoidentifycellswithmorethanacertainthresholddensityasclusters,andmarktheminthesplitindex.Theyhaveshownthatbyusinganinnovative 22

PAGE 23

21 ]haveproposedaclustertechniquebasedondecisiontree(CLTREE)construction.Themainideaistouseadecisiontreetopartitionthedataspaceintodenseandsparseregionsatdifferentlevelsofdetails(i.e.numberofattributesinvolvedatthetreenodes).Amodieddecisiontreealgorithmwiththehelpofvirtualdatapointshelpsintheinitialdecisiontreeconstruction.Inthenextstep,treepruningstrategiesareusedtosimplifythetree.Thenalclusteringisnothingbuttheunionofhyperrectangulardenseregionsfromthetree.Procopiucetal.[ 22 ]startwiththedenitionofanoptimalprojectiveclusterbasedonthedensityrequirementsofaprojectedclustering.Basedonthisnotionofoptimalcluster,theyhavedevelopedaMonteCarloalgorithmdubbedDensity-basedOptimalClustering(DOC)thatcomputeswithahighprobabilityagoodapproximationofanoptimalprojectivecluster.Theoverallclusteringisfoundbytakingthegreedyapproachofcomputingeachclusteronebyoneratherthananypartitionbasedstrategy. 23 ]havederivedadistributiononinnitebinarymatricesthatcanbeusedasapriorformodelsinwhichobjectsarerepresentedintermsofasetoflatentfeatures.Theyderivethispriorastheinnitelimitofasimpledistributiononnitebinarymatrices.TheyalsoshowthatthesamedistributioncanbespeciedintermsofasimplestochasticprocesswhichtheycoinastheIndianBuffetProcess(IBP).IBPprovidesaveryusefultoolfordeningnon-parametricBayesianmodelswithlatentvariables.IBPallowseachobjecttopossesspotentiallyanycombinationoftheinnitelymanylatentfeatures.GrahamandMiller[ 24 ]haveproposedanaive-Bayesmixturemodelthatallowseachcomponentinthemixtureitsownfeaturesubset,withallotherfeaturesexplainedbyasinglesharedcomponent.Thismeans,foreachfeatureagivencomponentuses 23

PAGE 24

25 ]presentamixturemodelbasedapproachcalledEMMIX-GENEtoclustermicroarrayexpressiondatafromtissuesamples,eachofwhichconsistsofalargenumberofgenes.Intheirapproach,asubsetofrelevantgenesareselectedandthengroupedintodisjointcomponents.Thetissuesamplesarethenclusteredbyttingmixturesoffactoranalyzersonthesecomponents. 26 ]havedevelopedaBayesianhierarchicaldynamictopicmodelthatcapturesevolutionoftopicsinanorderedrepositoryofdocuments.Thoughexactinferenceisnotpossiblefortheirmodel,theyhavedevelopedefcientandaccurateapproximationsusingvariationalKalmanltersandvariationwaveletregressionforlearningthisclassoftopicmodels.WangandMcCallum[ 27 ]havedevelopedatopicmodelthatexplicitlymodelstimejointlywithwordco-occurrencepatternscalledTopicsoverTime.Thismodeldiffersfromotherapproachesintwosignicantwaystimeisnotdiscretized,andnoMarkovassumptionsaremadeaboutstatetransitions.Becauseofthiswordco-occurrencesoverbothnarrowandbroadtimeperiodscanbeidentiedmoreeasily.Chakrabartietal.[ 28 ]havedevisedaframeworkforevolutionaryclusteringthatisprimarilyconcernedwithmaintainingtemporalsmoothnessoftheclusteringi.e.maximizingthetforcurrentdatawhileminimizingdeviationfromhistoricalclustering. 24

PAGE 25

29 ]haveextendedtheclassicalmixturemodelbyallowingthemixtureproportionstoevolveovertime.Theyemploysimplelinearregression,andthemixingproportionsatagiventimecanbecomputedeasilyviaalinearformulagiventhatwethemixingproportionsatthestarttime,andthemixingproportionsattheendtime. 25

PAGE 26

3 ],andMcLachlanandPeel[ 4 ].Theclassicalmixturemodelallowsonlyasinglecomponenttogenerateeachdatapoint.However,therearemanyreallifesituationswhereitmakesfarmoresensetomodeladatapointasbeinggeneratedusingmultiplecomponents.Imaginethattheitemspurchasedbyeachshopperataretailstorearerecordedinadatabase,andthegoalistobuildaninformativemodelforthebuyingpatternsofdifferentclassesofcustomers.Wecouldmaketheclassicassumptionthateachcustomerbelongstooneclass,inwhichcasemembershipinagivenclassshouldattempttocompletelydescribeallofthebuyingpatternsofeachmembercustomer.Unfortunately,giventhepossiblediversityofcustomersanditemsforsale,thismaynotberealistic.Itmaybemoreaccurateandnaturaltotrytoexplainthebehaviorofeachshopperasresultingfromtheinuenceofseveralclasses.Forexample,theitemscollectedintheshopper'scartmaybeinuencedbythefactthatshebelongstotheclasseswife,mother,sports fan,doctor,andavid reader.Thisallowseachdatapointtobemodeledwithhighprecision,andyet 26

PAGE 27

fan,anavid reader,andadoctor.Asthiscustomermakesherpurchase,oneofthedataattributesthatiscollectedisabooleanvalueindicatingwhetherornottheshopperpurchasedarecentbiographyofapopularsportsgure.Membershipinboththesports fanandtheavidreaderclassesshouldberelevanttoproducingthisbooleanvalue,butmembershipinthedoctorclassshouldnotbe.InthegenerativemodelproposedinthischaptercalledtheMixtureofSubsetsmodelorMOSmodelforshorteachmulti-attributedatapoint(theitemsetpurchasedbyashopperinourexample)isgeneratedbyasubsetofthepossibleclassesandeachpossibleclassinuencesasubsetofthedataattributes.TheMOSmodelfacilitatesthisbyallowingeachclasstospecifytheparametersforagenerativeprobabilitydensityfunction,foreachattributewheretheclassisrelevant.Theotherattributesareignoredbytheclass.Inourexample,wemightexpectthatthedecisionwhetherornotthebookpurchaseismadewouldbegovernedbyaBernoulli(yes/no)randomvariablewithprobabilitydensityfunctionf.Sincethesports fanandavid readerclassesarerelevanttothispurchase,eachofthemsuppliespossibleparametervaluestotheBernoullivariable,whicharedenotedassports fan,bookandavid reader,book,respectively.1Theclassdoctorisnotrelevanttothispurchase,andhenceitsuppliesthedefaultparametervalue fan,bookistheprobabilitythatasportsfanpurchasesthebook.Thus,f(yesjsports fan,book)=sports fan,book,andf(nojsports fan,book)=(1sports fan,book). 27

PAGE 28

fan,book,thesecondvariableusestheparameteravid reader,book,andthethirdvariableusestheparameterdefault,book.Asaresult,theprobabilitythattheshopperpurchasesthebookgiventhatsheisareader,asportsfan,andadoctorissimply:1 3f(yesjsports fan,book)+1 3f(yesjavid reader,book)+1 3f(yesjdefault,book)Inthisway,eachdatapointisproducedbyasetofclasses,andeachattributeofthedatapointisproducedbyamixtureoverthesubsetofthedatapoint'sclassesthatarerelevanttotheattributeinquestion.Inthischapter,wepresentlearningalgorithmsthat,givenadatabase,aresuitableforlearningtheclassespresentinthedata,thewaythattheclassesinuencedataattributes,andthesetofclassesthatinuencedeachdatapointinthedatabase.Otherpapershaveexploredrelatedideasbefore.Inrecentyears,themachinelearningcommunityhasbeguntoconsidergenerativemodelswhichalloweachdatapointtobeproducedsimultaneouslybymultipleclasses(examplesincludetheChineseRestaurantProcess[ 30 31 ]andtheIndianBuffetProcess[ 23 ]).Startingwiththeseminalpaperonsubspaceclustering[ 17 ],thedataminingcommunityhasbeenquiteinterestedinndingpatternsinsubspacesofthedataspace.TheMOSmodelcombinesideasfrombothoftheseresearchthreadsintoasingle,uniedframeworkthatisamenabletoprocessingusingstatisticalmachinelearningmethods.Weexplainintheoryhowourmodelandalgorithmscanbeappliedtozero-oneBernoullidataaswellasnumericaldata.Wealsopresentexperimentalresultsusingmodelslearnedfromrealhigh-dimensionaldatalikestockmovementsdataset 28

PAGE 29

3.3 ofthechapterdiscussesourEMalgorithmforlearningtheMOSmodelfromadataset.Sections 3.4 and 3.5 ofthechapterdiscusshowtoapplytheMOSmodeltoBernoulliandnormalmodels.Section 3.6 ofthechapterdetailssomeexampleapplicationsofthemodel,Section 3.7 discussesrelatedwork,andSection 3.8 concludesthechapter. 3.2.1PreliminariesMixturemodelingisacommonmachinelearninganddataminingtechniquethatisbaseduponthestatisticalconceptofmaximumlikelihoodestimation(MLE).MLEbeginswithaprobabilitydistributionFparameterizedon.GivenadatasetX=fx1,x2,,xng,inMLEweattempttochoosesoastomaximizetheprobabilitythatFwouldhaveproducedXafterntrials.Formally,thegoalistoselectsoastomaximizethesum: 29

PAGE 30

30

PAGE 31

wherePr[S1]=Q8Ci2S1iQ8Ci=2S1(1i),Pr[S2jS1]=1 3 ,theoutersumover8S122krepresentsallpossiblecombinationsofactivecomponentsubsetsS1C.Theinnersumover8S22Sd1representsallpossibledominantcomponentassignmentsonceaparticularcomponentsubsetS1has 31

PAGE 32

3 ,weobtain: 3.4 and 3.5 ofthechapter,aswellasintheexamplethatfollows. 32

PAGE 33

3-1 .Noticethatwehaveaddedanadditionaldefaultcomponentinthetable,asspeciedbytheMOSmodel.AinaijpositioninTable 3-1 meansthattheparametermaskMij=0.Thatis,thecomponentCihasnoeffectonattributeAj,anditsimplymakesuseofthedefaultparameterjtogeneratethatattribute.Inourexample,13=and3=0.1meansthataWomanhasa10%chanceofbuyingBabyOilonashoppingtrip.Tocontinuewithourexample,letusassumetheappearanceprobabilitiesforthegenerativecomponentsiareasshowninTable 3-2 .Togenerateadatapoint,wegothroughthefollowingthreestepprocess: 33

PAGE 34

3 withthehelpoftheexampleusedintheprevioussubsection.Continuingwithourexample,weassumethatourdatapointisxa=10011;andwewanttoevaluateFMOS(xa=10011j)asperEquation 3 .Thatis,wewanttocomputetheprobabilitythatthistransactionwouldbeproducedbythemodel.Therststepistoformthepowerset23tochoosethegeneratingsubsetS1fromthepowerset.Here,wehavethreecustomerclasses/componentsC=fC1,C2,C3g,andhence:23=ffg,fC1g,fC2g,fC3g,fC1,C2g,fC1,C3g,fC2,C3g,fC1,C2,C3gg

PAGE 35

3 ,weiteratethroughallthestringsS22S51,andsumupthevalues.Toillustratethis,letusselectS2=fC3C2C3C2C2g,meaningitemSkirthaddominantcustomerclassBusinessOwner;itemDiapershaddominantcustomerclassMother;andsoon.Now,giventhatxa=10011andthevaluesofandgiveninTables 3-1 and 3-2 ,thecontributionofthisparticularS2toEquation 3 willbe: 32=0.00000144

PAGE 36

3.3.1PreliminariesMLEisastandardmethodforestimatingtheparametersofaparametricdistribution.Unfortunately,thissortofmaximizationisintractableingeneral,andassuchmanygeneraltechniquesexistforapproximatelyperformingthissortofmaximization.Thedifcultyinthegeneralcasearisesfromthefactthatcertainimportantdataarenotvisibleduringthemaximizationprocess;thesearereferredtoasthehiddendata.IntheMOSmodel,thehiddendataaretheidentitiesofthecomponentsthatformedthesetS1usedtogenerateeachdatapoint,aswellastheparticularcomponentsthatwereusedtogenerateeachofthedatapoint'sattributes.Ifthesevalueswereknown,thenthemaximizationwouldbeastraightforwardexerciseincollege-levelcalculus.Withoutthesevalues,however,theproblembecomesintractable.OneofthemostpopularmethodsfordealingwiththisintractabilityistheExpectationMaximization(EM)algorithm[ 2 ].ThischapterassumesabasicfamiliaritywithEM;foranexcellenttutorialonthebasicsoftheEMframework,wereferthereadertoBilmes[ 32 ].IntheEMalgorithm,westartwithaninitialguessoftheparameters;andthenalternatebetweenperforminganexpectation(E)stepandamaximization(M)step.IntheEstep,anexpressionfortheexpectedvalueofthelog-likelihoodformula(Equation 3 )withrespecttothehiddendataiscomputed.Thisexpectationiscomputedwithrespecttothecurrentvalueoftheparameterset.Thiseffectivelyremovesthe 36

PAGE 37

3-1 .Theremainderofthissectionconsidershowthevariousupdaterulesforthepartsofarederived.First,weconsidertheE-Stepofthealgorithm,onwhichtheupdatecalculationsforeach,,andMalldepend.Then,wederivegenericupdaterulesfortheandMparameters.TheupdaterulesforthevariousparametersdependsupontheparticularapplicationoftheMOSframework,andexactlywhatformtheunderlyingPDFftakes.SubsequentsectionsofthechapterconsiderhowtoderiveupdaterulesfortheeachoftheunderBernoulliandnormal(orGaussian)models. 3 wouldberelativelyeasyifweknewwhichattributesofthedatapointxaweregeneratedbywhichofthecomponentsCiintheMOSmixturemodel.However,thisinformationisunobservedorhidden.Letzarepresentthehiddenvariablewhichindicatesthesubsetofcomponentsthatcontributedtothevariousattributesofthedatapointxa.Wedenethecomplete-datalikelihoodfunctionasL(jX,Z)=F(X,Zj).IntheE-stepoftheEMalgorithmweevaluatetheexpectedvalueofthecomplete-datalog-likelihoodlogF(X,Zj)withrespecttotheunknowndataZgiventheobserveddataXandthecurrentparameterestimates.So,wedeneourobjectivefunctionQthatwewanttomaximizeas: 37

PAGE 38

3 as:Q(0,)=XxaXallpossiblezaF(zajxa,)logF(xa,zaj0)Now,letustakeacloserlookattheinnersumwhichissummedoverallpossiblezavalues.Here,zarepresentsthehiddenassignmentsinthetwostepprocessthatgeneratedadatapointxa: 3 ,wecanrewriteQas: TheexpressionsforH0andG0aresimilartoHandGbutcontainthevariables0,0,andM0insteadoftheconstantvalues,andM.fistheunivariateprobabilitydensity 38

PAGE 39

A .AcomparisonoftheresultsandexecutiontimesoflearningbasedoncompletecomputationEMandtheMonteCarloEMcanbeseeninSection 3.6.1 .Intheremainderofthebodyofthechapter,wesimplyassumethatitispossibletocomputetheQfunctionusingreasonablecomputationalresources. 39

PAGE 40

3 .AsoutlinedinAppendix A ,letusrstdeneanidentierfunctionIthattakesasitsparameterabooleanfunctionb: 3 as:Q(0,)=XxaXS1kXi=1l(xa,S1,i2S1)log0i+kXi=1l(xa,S1,i=2S1)log(10i)+kXi=1dXj=1l(xa,S1,i2S1^S2[j]=i)logG0ij!Muchofthecomplexityinthisequationcomesfromtermsthatareactuallyconstantscomputed(orestimatedasoutlinedinAppendix A )duringtheE-stepofthealgorithm.WecansimplifytheQfunctionconsiderablybydeningthefollowingthreeconstants:c1,i=XxaXS1l(xa,S1,i2S1)c2,i=XxaXS1l(xa,S1,i2S1)c3,i,j,a=XS1l(xa,S1,i2S1^S2[j]=i)Giventhis,wecanre-writetheQfunctionas: 40

PAGE 41

10i=0)c1,i(10i)c2,i0i=0)0i=c1,i Thisgivesusaverysimpleruleforupdatingeach0i. 3 reducestof(xajj0ij).NowusingEquation 3 ,wecanndthevaluesof0ijvaluesthatwouldmaximizeQbytakingapartialderivativeofQwithrespectto0ijandequatingthattozero.Derivingtheexactupdaterulesforeach0ijdependsuponthenatureoftheunderlyingdata,thatis,theunderlyingdistributionf.AmoredetaileddiscussionofhowthismaybeaccomplishedforBernoullidataandnormaldatacanbeseeninthenexttwofullsectionsofthechapter. 41

PAGE 42

1. TheusermayprescribeexactlyhowmanyoftheMijvaluesmustbezero(orequivalently,non-zero),inordertolimittheamountofinformationpresentinthemodel. 2. TheusermayprescribeexactlyhowmanyoftheMijvaluesmustbezero,andalsoconstrainthenumberofnon-zerovaluesperrow(thatis,percomponent).Inotherwords,theusermaygiveamaximumorminimum(orboth)onthedimensionalityofthecomponentsthatarelearned. 3. TheusermayprescribeexactlyhowmanyoftheMijvaluesmustbezero,andalsoconstrainthenumberofnon-zerovaluespercolumn(thatis,perdataattribute).Inotherwords,theusermayconstrainthenumberoftimesthatagivenattributecanbepartofanycomponent.Thismightbeusefulinmakingsurethatallattributesactuallyappearinoneormorecomponents.WenowconsiderhowthevariousM0ijvaluescanbecomputedduringtheM-StepoftheEMalgorithmforeachofthethreenumberedcasesgivenabove. 42

PAGE 43

3 and 3 ,wecandenethiscontributionas: 43

PAGE 44

3.6 3-3 ,whereeachrowisadatapoint(atransaction)andeachcolumnisanattribute(anitem).Eachitemcanbetreatedasabinaryvariablewhosevalueis1ifitispresentinatransaction;0otherwise.Presenceofitemstogetherinatransactionindicatestheunderlyingcorrelationamongstthem.Forexample,thereisagoodchancethatwewillobserveDiapersandBabyOiltogetherinreal-lifetransactions.WehopetocapturesuchunderlyingcorrelationsinthemarketbasketdatausingourMOSmodel.Besidesthestandardmarketbasketdata,manyothertypesofreallifedatacanbemodeled/transformedintomarket-basket-styledatasoastocapturethesekindsofcorrelations.WeshowacasestudyofthistypeofdatausingstockmovementinformationfromS&P500inSection 3.6 44

PAGE 45

3.2.3 3.3.2 ,however,wewillbeabletocomeupwithafurthersimpliedexpressionforQ(0,)sinceweknowtheunderlyingPDFfjforeachoftheattributerandomvariableNj.Inparticular,foranitemAj,thegeneratingcustomerclassCi,andatransactionxa, (3) Similarly,foranitemAj,defaultparametervector,andatransactionxa, (3) UsingthesevaluesinEquation 3 ,G0ijreducesto: (3) IntheM-step,wehavetocomputethevaluesof0i,0ijandM0ijthatmaximizetheexpectedvalueoflog-likelihoodfunctionQ(0,).Wecancomputethe0ivaluesas 45

PAGE 46

3 :0i=c1,i 3.3.4 .Intherststep,weassumethatM0ij=1inEquation 3 ,andsolvefor0ijthatwouldmaximizeQ(0,)inEquation 3 @0ij0@Xxajxaj=0c3,i,j,alog(10ij)+Xxajxaj=1c3,i,j,alog0ij1A=0)0ij=Pxajxaj=1c3,i,j,a 3.3.4 .UsingthevaluesfromEquations 3 and 3 ,theexpressionforqij(0ij,j)asshowninEquation 3 forthemarketbasketcaseisasfollows: 3.6 46

PAGE 47

3.3.2 ,however,wewillbeabletocomeupwithafurthersimpliedexpressionforQ(0,)sinceweknowtheunderlyingPDFfjforeachoftheattributerandomvariableNj.Inparticular,foranattributeAj,thegeneratingGaussianCi,andadatapointxa, 3 ,G0ijreducesto: 47

PAGE 48

3 :0i=c1,i 3.3.4 .Intherststep,wesetM0ij=1inEquation 3 ,andsolvefor0ijand0ijthatwouldmaximizeQ(0,)inEquation 3 .0ij=Pxac3,i,j,axaj Pxac3,i,j,a(xaj0ij)2 3.3.4 .UsingthevaluesfromEquations 3 and 3 ,theexpressionforqij(0ij,j)asshowninEquation 3 fornormallydistributeddataisasfollows: 3.4 and 3.5 canbeusedtointerpretreal-worlddata. 48

PAGE 49

3-2 and 3-5 .Themaskedattributesappearaswhitesquares(probabilityzero)andtheun-maskedattributesareblacksquares(probabilityone).Toillustratethesortofdatathatwouldbeproducedusingthesecomponents,Figure 3-3 showsfourexampledatapointsproducedbythe16-attributegenerator.Forthefour-attributeandthenine-attributedatasets,welearnedtheMOSmodelusingboththefullydeterministiccomputationandtheMonteCarloE-step.Fortherestofthedatasets,welearnedtheMOSmodelusingjusttheMonteCarloE-step(thedeterministicE-stepwastooslow).Forthefour-attributedataset,thetotalnumberofsamplesfortheE-stepMonteCarlosamplingwassettobe100,000(i.e.100samples 49

PAGE 50

3-4 and 3-6 .TheexecutiontimeofthelearningalgorithmcalculatedonacomputerwithaIntelXenon2.8GHzprocessorand4GBRAMareshowninTable 3-4 3-4 and 3-6 .TheMonteCarloEMalwaysrecoveredthepositionsoftheparametermaskscorrectly.Weobservedthelearnedprobabilityvaluesinallthegenerativecomponentstobeconsistentlyhigherthan0.75,thoughabitlessthanthecorrectvalueof1.0.Themodelcompensatedforthisslightlylowervaluebyslightlyincreasingthelearnedappearanceprobabilityforeachcomponent.Thelearnedvalueswereobservedtobebetween0.5and0.6asopposedtothecorrect 50

PAGE 51

51

PAGE 52

3.4 withthegoalsofobservingthecorrelationsamongstthesestocks.Wesetconstraintsontheparametermaskstoallowaminimumof4andmaximumof14non-zeromaskspercomponent,andatotalof180non-zeroparametermasksinthemodel.Alltheappearanceprobabilitieswereinitializedtothesamerandomoatingpointnumberbetween0and1.Allthethetavaluesofanattributewereinitializedbypickingrandomlyfromanormalizeddistributioncenteredaroundtheunderlyingdefaultparametergammaforthatattributeandhavingastandarddeviationof0.05.TheinitialtotalnumberofsamplesfortheE-stepMonteCarlosamplingwassettobe2,800,000(i.e.1000samplesperrecord).Forthisdataset,wepickedthebestmodel(highestlog-likelihoodvalue)from20randominitializations. 3-7 .Alongthecolumnsarethe40chosenstocksrepresentedbytheirsymbols;andalongtherowsarethecomponentslearnedbythemodel.Wehavegroupedthecolumnsaccordingtothetypesofthecompanies.Thecomponentsareshownindescendingorderofappearanceprobability.TheprobabilityvaluesoftheBernoullirandomvariablesareshowningreyscalewithwhitebeing0andblackbeing1withastepof0.1.Thus,thelighterareasinthegureshowdownwardsmovementofstocks,whilethedarkerareasshowupwardsmovementofstocks. 52

PAGE 53

53

PAGE 54

3.5 .OneofthereasonstoselectthisdatasetwasthathistoricalinformationaboutoodanddroughteventsinCaliforniaiswell-known. 54

PAGE 55

3-8 .TherstcomponentshowninFigure 3-8 hashighowsinthesouthernpartofCalifornia.ThesecondcomponenthashighowsinnorthernandcentralCalifornia.ThethirdcomponenthassitesthatareveryclosetotheneighboringstatesArizonaandNevada.ThefourthcomponentshaslowowsalloverCalifornia. 55

PAGE 56

3-8 .ThethirdcomponentisinterestingbecauseitsinglesoutsitesthatareveryclosetotheneighboringstatesofArizonaandNevada.ProbablythisindicatesthattheowofwateratthesesitesdependsmoreontheweathereventsinthosestatesratherthanCalifornia.InFigure 3-9 ,wehaveshownsomeofthecomponentsfroma20-componentstandardGaussianMixtureModellearnedfromthesamedataset.WecanclearlyobservethatitismoredifculttointerpretandunderstandthespatialcorrelationsamongstvarioussitesinthesecomponentsasopposedtothecomponentsintheMOSmodelbecauseeachattributeisdenedandactiveinallcomponents.BasedonthecomponentsidentiedbytheMOSmodel,itisusefultoestimateonwhichparticulardaysinthedataseteachofthiscomponentwasactive.Todothis,wetakeeachcomponentCi,andforeachdayxa,wegenerate10,000randomlygeneratedcomponentsubsetsS1withandwithouttherestrictionthatthecomponentunderconsiderationCimustbepresentinthisgeneratingsubset.Forexample,ifthecurrentcomponentunderconsiderationwassayC3ina5-componentmodel,thenwewouldrandomlygenerate20,000subsetsofthecomponents.Therst10,000ofthosesubsetswouldbegeneratedwiththeconditionthatC3mustbepresentinthemandtheremainingsubsetswillnothaveanysuchrestriction.HencetheymayormaynothavethecomponentC3.Next,wecomputetheratiooftheaveragelikelihoodofthedataforthedayxabeinggeneratedbytheinclusivesubsetstotheaveragelikelihoodofthedatabeinggeneratedbytheno-restrictionsubsets.Werepeatthisprocessforeachdayinthedataset.Thisgivesusaprincipledwaytocomparethevariousdaysinthedatasetandsayifitwerelikelythataparticularcomponentwouldbepresentinthegenerativesubsetofcomponentsforthatday.Mathematically,wecomputethefollowingratiofor 56

PAGE 57

3-5 and 3-6 .ThehighowcomponentinsouthernCaliforniawaslikelytobeactiveforafewdaysinFeb-March1978,Feb-March1980,March1983,March1991,Feb1992,Jan-Feb1993,JanandMarch1995.TherewereheavyrainsandstormsinsouthernCaliforniaduringFeb/Marchof1978.Similarlyaseriesof6majorstormshitCaliforniainFebruary1980.Thesouthernregionwasthehardesthitandreceivedextensiverainfall.BecauseofastrongElNinoeffectstormsandoodingwasobservedinCaliforniainearly1983.MediumtoHeavyrainswerealsoobservedinMarch1991andFebruary1992.HeavyrainfallwasobservedinsouthernCaliforniaandMexicothroughoutJanuaryof1993.HeavyraininsouthernCaliforniawasalsoobservedinearlyJanuaryof1995andmidMarchof1995.ThehighowcomponentinthenorthernandcentralCaliforniawaslikelytobeactiveforafewdaysinFebruary1976,November1977,January1978,December1981,April1982,Feb-March1983,December1983,February1986,andJan-March1995.InFebruaryof1976northernCaliforniawashitbyasnowstorm.StrongElNinostormsandoodingwasobservedin1982-83.TheoodinFebruary1986wascausedbyastormthatproducedsubstantialrainfallandexcessiverunoffinthenorthernone-halfofCalifornia.HeavyrainandmeltingofsnowcausedoodinginnorthandcentralCaliforniainJanuary-Marchof1995.Onemoreinterestingthingtobeobservedisthatboththehighowcomponentsseemtobenotactiveduringthedroughtsof1976-77and1987-92.Becauseoftheparametermasksandthedefaultcomponent,eachlearnedMOScomponenthasonlymanifestedtheattributeswhereitmakessignicant 57

PAGE 58

3 )ascomparedtothedefaultcomponent.Thedefaultcomponentsortofbecomesthebackgroundagainstwhichtheothercomponentsarelearned.Hence,weareabletoobserveandanalyzetheunderlyingcorrelationsinthesubspacesofthedataspace.Thiscasestudyclearlyshowsthatwithsomedomainknowledge,MOSmodelcanbeaveryusefultooltoperformthiskindofexploratorydataanalysis. 7 ]thegoalistomodelatwo-dimensionalmatrixinaprobabilisticfashion.Recently,theoriginalworkoninformation-theoreticco-clusteringhasbeenextendedbyotherresearchers[ 8 9 10 ].Co-clusteringgroupsboththerowsandthecolumnsofthematrix,thusformingagrid;thisgridistreatedasdeningaprobabilitydistribution.Theabstractproblemthatco-clusteringtriestosolveistominimizethedifferencebetweenthedistributiondenedbythegridandthedistributionrepresentedbytheoriginalmatrix.Ininformation-theoreticco-clustering,thisdifferenceismeasuredbythemutuallossofinformationbetweenthetwodistributions.Thoughco-clusteringandtheMOSmodelarerelated,themostfundamentaldifferencebetweenco-clusteringandtheMOSmodelisthatco-clusteringtreatsrowsandcolumnsasbeingequivalent,andsimplytriestomodeltheirjointdistribution.TheMOSmodelassociatesamuchdeepersetofsemanticswiththematrixthatisbeingmodeled.IntheMOSmodel,thedifferencebetweenrowsandcolumnsistreatedasbeingfundamental;unlikeinco-clustering,rowsarenotclusteredintheMOSmodel. 58

PAGE 59

3.6 .Thedroughtcomponentlearned(anddepictedinFigure 3-8 )coversalmosteveryriverinthestate,sinceallverylowowsarestronglycorrelated.However,thedifferenthigh-owcomponentscovervarioussubsetsofrivers:thosethathaveahighowduringthespringrunoff,thosethathaveahighowduringwinterstorms,thosethathaveahighowduringsummerthunderstorms,andsoon.Apartitioningthatdidnotallowsuchoverlappingcomponentswouldnotallowthedroughtcomponenttoinuenceallrivers,whileatthesametimeeveryhighowcomponentinuencesonlyafew.Subspaceclusteringisanextensionoffeatureselectionthattriestondmeaningfullocalizedclustersinmultiple,possiblyoverlappingsubspacesinthedataset.Therearetwomainsubtypesofsubspaceclusteringalgorithmsbasedontheirsearchstrategy.Therstsetofalgorithmstrytondaninitialclusteringintheoriginaldatasetanditerativelyimprovetheresultsbyevaluatingsubspacesofeachcluster.Hence,insomesense,theyperformregularclusteringinareduceddimensionalsubspacetoobtainbetterclustersinthefulldimensionalspace.PROCLUS[ 11 ],ORCLUS[ 12 ],FINDIT[ 13 ],-clusters[ 14 ]andCOSA[ 15 ]areexamplesofthisapproach.ThemostfundamentaldifferencebetweenclusteringandtheMOSmodelisthegoaloftheapproach.Clusteringgenerallytriestodeterminemembershipofrows(datapoints),andtriestogroupthemtogetherbasedonsimilaritymeasures.TheMOSmodeltriestondasetofprobabilisticgeneratorsfortheentiredataset.Ratherthanpartitioningthe 59

PAGE 60

16 ]fortransactionaldataandlatergeneralizedtocreatealgorithmssuchasCLIQUE[ 17 ],ENCLUS[ 18 ],MAFIA[ 19 ],Cell-basedClusteringMethod(CBF)[ 20 ],CLTree[ 21 ]andDOC[ 22 ].Thesemethodsdeterminelocalitybycreatingbinsforeachdimensionandusethosebinstoformamulti-dimensionalstaticordata-drivendynamicgrid.Thentheyidentifydenseregionsinthisgridbycountingthenumberofdatapointsthatfallintothesebins.Adjacentdensebinsarethencombinedtoformclusters.Adatapointcouldfallintomultiplebinsandthusbeapartofmorethanone(possiblyoverlapping)clusters.ThisapproachisprobablytheclosesttoourworksincethesedensebinscanbeviewedasbeingsimilartothecomponentsintheMOSmodelthatcouldhavecombinedtoformthedataset.However,thekeydifferenceisthattheseAPRIORI-stylemethodsuseacombinatorialframeworkandtheMOSmodelusesaprobabilisticmodel-basedframeworktondthesedensesubspacesinthedataset.Thismodel-basedapproachallowsforagenericMLEsolutionwhilekeepingthemodeldata-agnostic.Italsoprovidesaprobabilisticmodel-basedinterpretationofthedata.AnotherdifferenceisthattheoutputoftheMOSmodelhasaboundedcomplexity,becausethesizeofthemodelisaninputparameter.However,forsubspaceclustering,typicallysomesortofdensitycutoffisaninputparameter,andhencethesizeoftheoutputcanvarydependinguponthatinputparameterandthedistributionofthedatainthedataset.Inthepast,severaldataminingapproacheshavebeensuggestedtousemixturemodelstointerpretandvisualizedata.Cadezetal.[ 5 ]presentaprobabilisticmixturemodelingbasedframeworktomodelcustomerbehaviorintransactionaldata.Intheir 60

PAGE 61

6 ]proposeagenerativeframeworkforprobabilisticmodelbasedclusteringofindividualswheredatameasurementsforeachindividualmayvaryinsize.Inthisgenerativemodel,eachindividualhasasetofmembershipprobabilitiesthatshebelongstooneofthekclusters,andeachofthesekclustershasaparameterizeddatageneratingprobabilitydistribution.Cadezetal.modelthesetofdatasequencesassociatedwithanindividualasamixtureofthesekdatageneratingclusters.TheyalsooutlineanEMapproachthatcanappliedtothismodelandshowanexampleofhowtoclusterindividualsbasedontheirwebbrowsingdataunderthismodel.ThekeydifferencebetweenthisapproachandtheMOSmodelliesintwoaspects.First,Cadezetal.modelanindividualasamixtureofdata-generatingclusters,whereastheMOSmodelwouldmodelthedatapointsasamixtureofsubsetsofdatageneratingcomponents.Second,thegoaloftheirapproachistogroupindividualsintoclusterswhereasthegoaloftheMOSframeworkistosimplylearnamodelthatprovidesaprobabilisticmodel-basedinterpretationoftheobserveddata.TheEMalgorithmitselfwasstproposedbyDemptseretal.[ 2 ].Intheinterveningyearsithasseenwidespreaduseinmanydifferentdisciplines.WorkonimprovingEMcontinuestothisday.Forexample,Amari[ 33 ]haspresentedauniedinformationgeometricalframeworktostudystochasticmodelsofneuralnetworksbyusingtheEM 61

PAGE 62

23 ]havederivedadistributiononinnitebinarymatricesthatcanbeusedasapriorformodelsinwhichobjectsarerepresentedintermsofasetoflatentfeatures.Theyderivethispriorastheinnitelimitofasimpledistributiononnitebinarymatrices.TheyalsoshowthatthesamedistributioncanbespeciedintermsofasimplestochasticprocesswhichtheycoinastheIndianBuffetProcess(IBP).IBPprovidesaveryusefultoolfordeningnon-parametricBayesianmodelswithlatentvariables.IBPallowseachobjecttopossesspotentiallyanycombinationoftheinnitelymanylatentfeatures.WhileIBPprovidesacleanwaytoformulatepriorsthatallowanobjecttopossessmanylatentfeaturesatthesametime,deninghowtheselatentfeaturescombinetogeneratetheobservablepropertiesofanobjectislefttotheapplication.Forexample,thelinear-GaussianIBPmodelusedtomodelsimpleimagesbyGrifthsandGhahramani[ 23 ]combinesthelatentfeaturesusingasimplelinearadditiverelationship.Onecanenvisioncombininglatentfeaturesusingsucharithmeticorlogicaloperations,however,itisnotclearwhatsuchacombinationwouldmeaninthecontextofagenerativemodel.TheMOSmodelprovidesacompleteframeworkthatnotonlyallowsmultiplecomponentstosimultaneouslygenerateadatapoint,butalsodeneshowthesecomponentscombineduringthisgenerativeprocessthatismeaningful.UndertheMOSmodel,eachattributeofthedataspaceisgeneratedbyamixtureoftheselectedlatentfeatures.Thisallowsforaricherandmorepowerfulinteractionamongthefeaturesthananysimplelinearrelationshipbasedonarithmeticoperators.GrahamandMiller[ 24 ]haveproposedanaive-Bayesmixturemodelthatallowseachcomponentinthemixtureitsownfeaturesubset,withallotherfeaturesexplained 62

PAGE 63

25 ]presentamixturemodelbasedapproachcalledEMMIX-GENEtoclustermicroarrayexpressiondatafromtissuesamples,eachofwhichconsistsofalargenumberofgenes.Intheirapproach,asubsetofrelevantgenesareselectedandthengroupedintodisjointcomponents.Thetissuesamplesarethenclusteredbyttingmixturesoffactoranalyzersonthesecomponents.TheMOSmodelalsofollowsamulti-stepapproachwhererstasetofactivecomponentsareselected,andtheneachattributeofthedatapointismanifestedundertheinuenceofamixtureofactivecomponents.ThekeydifferenceisthatthegroupofgenesfromEMMIX-GENEformnon-overlappingsubsetsofthefeaturespace,whiletheMOSmodelcomponentsallowforoverlappingsubsetsofthefeaturespace. 63

PAGE 64

64

PAGE 65

ParametervaluesijforthePDFsassociatedwiththerandomvariablesNj Woman11=0.612=13=14=15=0.5Mother21=0.322=0.623=0.624=25=0.4Businessowner31=0.232=33=34=0.335=Default1=0.42=0.13=0.14=0.15=0.4 AppearanceprobabilitiesiforeachcomponentCi Woman1=0.6Mother2=0.2Businessowner3=0.2 Exampleofmarketbasketdata TIDSkirtDiapersBabyoilPrinterpaperShampoo 110001210010301100400011501101 Table3-4. Comparisonoftheexecutiontime(100iterations)oftheourEMlearningalgorithmsforthesyntheticdatasets. NumberofdimensionsCompleteEMMonteCarloSamplingEM 4787seconds246seconds9463516seconds1906seconds162490seconds364278seconds Chooseinitialvalueforeach,,MWhilethemodelcontinuestoimprove:ApplytheappropriateupdateruletogeteachnewApplytheappropriateupdateruletogeteachnewApplytheappropriateupdateruletogeteachnewMFigure3-1. OutlineofourEMalgorithm 65

PAGE 66

Numberofdaysforwhichthepvaluesfallintop1%ofallpvaluesfortheSouthernCaliforniaHighFlowComponent 19761197719781361979111198019619811198211983611984198519861198719881989119901991419923199351019941995514 Numberofdaysforwhichthepvaluesfallintop1%ofallpvaluesfortheNorthCentralCaliforniaHighFlowComponent 1976319773119783119791198012119811419824198315814198421198519864119871988219891219901991119921199312199419954223

PAGE 67

Numberofdaysforwhichthepvaluesfallintop1%ofallpvaluesfortheLowFlowComponent 197611119772119781119791131198011981111982111219831245198411219852198611119871111219882119893121990111991121199211119931199411199521 Generatingcomponentsforthe16-attributedataset.ApixelindicatestheprobabilityvalueoftheBernoullirandomvariableassociatedwithanattribute.Whitepixel(amaskedattribute)indicates0andblackpixel(unmaskedattribute)indicates1. Figure3-3. Exampledatapointsfromthe16-attributedataset.Forexample,theleftmostdatapointwasgeneratedbytheleftmostandtherightmostcomponentsfromFigure 3-2 67

PAGE 68

ComponentslearnedusingMonteCarloEMwithstratiedsamplingafter100iterations.ApixelindicatestheprobabilityvalueoftheBernoullirandomvariableassociatedwithanattribute.Whitepixelsaremaskedattributes.Darkerpixelsindicateunmaskedattributeswithhigherprobabilityvalues. Figure3-5. Generatingcomponentsforthe36-attributedataset Figure3-6. Componentslearnedfromthe36-attributedatasetusingMonteCarloEMwithstratiedsamplingafter100iterations. 68

PAGE 69

Stockcomponentslearnedbya20-componentMOSmodel.Alongthecolumnsarethe40chosenstocksgroupedbythetypeofstock;andalongtherowsarethecomponentslearnedbythemodel.EachcellinthegureindicatestheprobabilityvalueoftheBernoullirandomvariableingreyscalewithwhitebeing0andblackbeing1. 69

PAGE 70

Componentslearnedbya20-componentMOSModel.Onlythesiteswithnon-zeroparametermasksareshown.Thediameterofthecircleatasiteisisproportionaltothesquarerootoftheratioofofthemeanparameterijtothemeanowjforthatsite,onalogscale. 70

PAGE 71

Someofthecomponentslearnedbya20-componentstandardGaussianMixtureModel.Thediameterofthecircleatasiteisisproportionaltothesquarerootoftheratioofofthemeanparameterijtothemeanowjforthatsite,onalogscale. 71

PAGE 72

3 4 ].Aclassicalmixturemodelforthisexamplewouldviewthedatasetasbeinggeneratedbyasimplemixtureofcustomerclasses,witheachclassbeingmodeledasamultinomialcomponentinthemixture.Undersuchamodel,whenacustomerentersastoreshechoosesoneofthecustomerclassesbyperformingamultinomialtrialaccordingthemixtureproportions,andthenarandomvectorgeneratedusingtheselectedclasswouldproducetheactualrentalrecord.Theproblemwithsuchamodelisthatitonlyallowsonecomponent(customerclass)togenerateadatapoint(rentalrecord),andthusdoesnotaccountfortheunderlyingdatagenerationmechanismthatacustomerbelongstomultipleclasses.Morecomplexhierarchicalmixturemodels[ 6 5 ]havebeenproposedtointerpretandvisualizesuchdata.However,theytendtoview 72

PAGE 73

23 ]isperhapsthebestexistingchoiceforsuchdata.ItisarecentlyderiveddistributionthatcanbeusedasapriordistributionforBayesiangenerativemodels,andallowseachdatapointtobelongtopotentiallyanycombinationoftheinnitelymanyclasses.WhileIBPprovidesacleanmathematicalframeworkforadatapointtobegeneratedbymultipleclasses,itdoesnotdenehowtheseclassescombinetogeneratetheactualdata.Wefeelthatthisisthekeyaspectofagenerativemodelfortheexamplescenario,andasuccessfulapproachtomodelsuchmulti-classdatamustaddressit. 73

PAGE 74

4.3 ofthechapterdiscussesourGibbsSamplerforlearningthemodelfromadataset.Section 4.4 ofthechapterdetailssomeexampleapplicationsofthemodel,Section 4.5 discussesrelatedwork,andSection 4.6 concludesthechapter. 74

PAGE 75

75

PAGE 77

77

PAGE 78

34 ].

PAGE 79

79

PAGE 80

80

PAGE 81

4 .F(mi,jjX,c,m,,)/(mi,jjq,r)YaYjwga,j,jI(ca,ga,j=1) 81

PAGE 82

B usingbothsyntheticandreal-lifedatasets.Fortherestofthischapter,weassumethatsuchanapproximationexistsandworksverywellbothonsyntheticandreal-lifedatasets.Inthenextsection,wediscussourexperimentalresultsbasedonbothsyntheticandreal-lifedatasets. 4-1 .Inthelearningphase,theparametersforthemeanandstandarddeviationinthegeneratorswereinitializedtothemeanandthestandarddeviationofthedataset.Theweightforeachattributewassetto1=4.Theparametersaandbcontrollingthepriorfortheappearanceprobabilityweresetto100and300respectively.Theparametersqandrthatcontrolthepriorfor 82

PAGE 83

4-2 83

PAGE 84

1000.Theparametersqandrthatcontrolthepriorforweightsweresetto1each. 4-2 and 4-3 .Foreachcomponentinthemodel,wereportallthewordsthathaveweightsatleastvetimeslargerthanthedefaultweight,andwithBernoulliprobabilityindicatingpresenceofword(i.ep>0.5).Onlynon-emptyclustersmeetingtheabovecriteriaareshown.TheappearanceprobabilityofthecomponentsarelistedinTable 4-3 84

PAGE 85

5 ]presentaprobabilisticmixturemodelingbasedframeworktomodelcustomerbehaviorintransactionaldata.Intheirmodel,eachtransactionisgeneratedbyoneofthekcomponents(customerproles).Associatedwitheachcustomerisasetofkweightsthatgoverntheprobabilityofanindividualtoengageinashoppingbehaviorlikeoneofthecustomerproles.Thus,theymodelacustomerasamixtureofthecustomerproles.Thekeydifferencebetweenthisapproachandourmodelliesinhowthedataismodeled.Inthiscase,ourmodelwouldvieweachtransactionasaweightedmixtureofsubsetsofthecustomerproles.Asnotedintheintroduction,thisallowsatransactiontobegeneratedinwhichacustomercouldactoutmultiplecustomerprolesatthesametime.Thismayprovideamorenaturalgenerativeprocesstointerpretandvisualizetransactionaldata.Cadezetal.[ 6 ]proposeagenerativeframeworkforprobabilisticmodelbasedclusteringofindividualswheredatameasurementsforeachindividualmayvaryinsize.Inthisgenerativemodel,eachindividualhasasetofmembershipprobabilitiesthatshebelongstooneofthekclusters,andeachofthesekclustershasaparameterizeddatageneratingprobabilitydistribution.Cadezetal.modelthesetofdatasequencesassociatedwithanindividualasamixtureofthesekdatageneratingclusters.TheyalsooutlineanEMapproachthatcanappliedtothismodelandshowanexampleofhowtoclusterindividualsbasedontheirwebbrowsingdataunderthismodel.Cadezetal.modelanindividualasamixtureofdata-generatingclusters,whereasourmodelwouldviewthedatapointsasamixtureofsubsetsofdatageneratingcomponents.Second,thegoaloftheirapproachistogroupindividualsintoclusterswhereasourgoalisto 85

PAGE 86

23 ]havederivedapriordistributionforBayesiangenerativemodelsthatallowseachdatapointtobelongtopotentiallyanycombinationoftheinnitelymanyclasses.However,theydonotdenehowtheseclassescombinetogeneratetheactualdata.WhiletheirworkhassignicantimpactforBayesianmixturemodeling,fromadataminingperspectivethekeyaspectforthecurrentproblemisnothowmultipleclassescanbeselected,buthowtheyinteractwitheachothertoproduceobservabledata.GrahamandMiller[ 24 ]haveproposedanaive-Bayesmixturemodelthatallowseachcomponentinthemixtureitsownfeaturesubsetviauseofbinaryswitchvariables,withallotherfeaturesexplainedbyasinglesharedcomponent.Whilethisallowsacomponenttochooseitsinuenceoverasubsetofdataattributes,thereisnoframeworktoindicateastrongoraweakinuence.Underthismodel,onlytwocomponentsinthemodelcaninuenceadatapointatthesametimethegeneratingcomponentandthesharedcomponent,whichstillpreventsmultipleclassestointeractsimultaneouslyfordatageneration.InSomaiyaetal.[ 35 ],wehavepresentedamixture-of-subsetsmodelthatallowsmultiplecomponentstoinuenceadatapointandeachcomponentcanchoosetoinuenceasubsetofthedataattributes.WehavealsodevelopedanEMalgorithmforlearningmodelsundertheMOSframework;andformulatedauniqueMonteCarloapproachthatmakesuseofstratiedsamplingtoperformtheE-stepinourEMalgorithm.Therearetwokeydifferencesinourapproachhere.Firstly,thepreviousworksuffersfromthegeneralcriticismofMLEbasedapproaches,thatitonlyprovidesapointestimateforthemodelparameters.Hence,theusersisleftwithnoclueabouttheerrorinthisestimate.ThekeybenetofusingtheBayesianframeworkhereisthattheoutputisadistributionofmodelparametersratherthanasinglepointestimate. 86

PAGE 87

87

PAGE 88

Thefourgeneratingcomponentsforthesyntheticdataset.Generatorforeachattributeisexpressedasatripletofparametervalues(Mean,Standarddeviation,Weight) AppearanceAttributeAttributeAttributeAttributeprobability#1#2#3#4 0.2492(300,20,0.4)(600,20,0.1)(900,20,0.1)(1200,20,0.4)0.2528(600,20,0.1)(900,20,0.4)(1200,20,0.4)(300,20,0.1)0.2328(900,20,0.4)(1200,20,0.1)(300,20,0.4)(600,20,0.1)0.2339(1200,20,0.1)(300,20,0.4)(600,20,0.1)(900,20,0.4) Table4-2. Parametervalueslearnedfromthedatasetafter1000Gibbsiterations.Wehavecomputedtheaverageoverthelast100iterations.Eachattributeisexpressedasatripletofparametervalues(Mean,Standarddeviation,Weight).Allvalueshavebeenroundedofftotheirrespectiveprecisions. AppearanceAttributeAttributeAttributeAttributeprobability#1#2#3#4 0.3284(298,20.4,0.37)(600,19.4,0.11)(900,19.7,0.10)(1201,19.6,0.43)0.3658(600,19.8,0.10)(901,20.3,0.42)(1200,19.0,0.38)(299,19.2,0.10)0.3286(898,20.8,0.45)(1197,20.8,0.09)(303,21.2,0.33)(599,19.6,0.13)0.3201(1201,20.4,0.12)(300,19.7,0.40)(598,19.1,0.10)(900,20.5,0.39) Table4-3. AppearanceprobabilitiesoftheclusterslearnedfromtheNIPSdataset Cluster#Appearanceprobability 10.337420.149730.190140.402550.278560.259770.129380.255790.4036100.1774110.2745120.1192 88

PAGE 89

Thegenerativemodel.Acircledenotesarandomvariableinthemodel 89

PAGE 90

Word 0.9895 theorem 0.9975 theory 0.9955 tion 0.7772 variables 0.8628 zero 0.9795 Cluster2 Word 0.9406 processing 0.9967 pulse 0.8582 separation 0.9846 signal 0.9966 sound 0.9776 speech 0.9940 Cluster3 Word 0.9982 membrane 0.9971 neuron 0.9981 pulse 0.5786 spike 0.9964 spikes 0.9931 stimulus 0.9930 supported 0.9931 synapse 0.8484 synapses 0.9639 synaptic 0.9935 tempora 0.8806 Cluster4 Word 0.9716 input 0.9992 layer 0.9988 network 0.9997 neural 0.9995 output 0.9983 target 0.7364 trained 0.9986 training 0.9993 unit 0.9970 values 0.9828 weight 0.9978 Cluster5 Word 0.9986 data 0.9987 hmm 0.6940 performance 0.9961 recognition 0.9966 set 0.9990 speech 0.9950 test 0.9956 trained 0.9978 training 0.9994 vector 0.9973 Cluster6 Word 0.9977 images 0.9984 pca 0.5951 pixel 0.9967 segmentation 0.8851 structure 0.7222 theory 0.5410 vertical 0.8439 vision 0.9961 visual 0.9973 white 0.5620 Figure4-2. ClusterslearnedfromtheNIPSpapersdataset.Foreachcluster,wereportthewordanditsassociatedBernoulliprobability 90

PAGE 91

Word 0.9979 dynamic 0.9914 learning 0.9979 policy 0.9943 reinforcement 0.9975 reward 0.9912 states 0.9387 sutton 0.9877 system 0.9986 temporal 0.8441 trajectories 0.9413 trajectory 0.9826 transition 0.8682 trial 0.9660 world 0.9703 Cluster8 Word 0.9959 hmm 0.9322 likelihood 0.9975 model 0.9989 parameter 0.9974 prior 0.9008 probabilities 0.9868 probability 0.9495 statistical 0.9645 term 0.9153 variable 0.9717 variables 0.9850 variance 0.9875 Cluster9 Word 0.9996 term 0.9899 tion 0.6632 values 0.8855 vector 0.9919 zero 0.9980 Cluster10 Word 0.9963 spatial 0.9780 stimuli 0.9918 stimulus 0.9866 supported 0.9569 visual 0.9961 Cluster11 Word 0.9983 learning 0.9987 network 0.9993 system 0.9969 term 0.9036 trained 0.9795 training 0.9963 unit 0.9906 volume 0.6785 weight 0.9949 william 0.9602 Cluster12 Word 0.9980 circuit 0.9985 implementation 0.9941 input 0.9957 output 0.9799 pulse 0.9460 system 0.9984 transistor 0.9869 vlsi 0.9971 voltage 0.9956 Figure4-3. MoreclusterslearnedfromtheNIPSpapersdataset.Foreachcluster,wereportthewordanditsassociatedBernoulliprobability 91

PAGE 92

92

PAGE 93

time
PAGE 95

2bexp(1 2b)1 )]TJ/F4 7.97 Tf 6.78 3.46 TD[(k(b)kYj=1bb1jp(mj)/3 2mexp(1 2m)1 )]TJ/F4 7.97 Tf 6.78 3.45 TD[(k(m)kYj=1mm1jp(ej)/3 2eexp(1 2e)1 )]TJ/F4 7.97 Tf 6.77 3.45 TD[(k(e)kYj=1ee1jConditionalposteriorformixingproportionsatstarttime,middletimeandendtimecanbewrittenas:p(bjj)=G(bjjb)Yip(cijbj)p(mjj)=G(mjjm)Yip(cijmj)p(ejj)=G(ejje)Yip(cijej)

PAGE 96

Experimentalsetup.Totestthelearningcapabilitiesofthemodelincontrolledenvironments,wegeneratedmanysimplesyntheticdatasetswithsmallnumberofclusters.Weallowedthemixingproportionsoftheclusterstovaryfollowingsimpleellipticalandbetapdflikefunctions.Weassumedtotalof100timeticks,and20datapointspertimetickgivingatotalof2000datapoints.Weassumedonedimensionalnormalgeneratorsforallclusters,andgeneratedthedata.Forlearning,weran1500iterationsofourGibbssamplingalgorithmandreporttheresultsaveragedoverthelast500ofthem. 5-1 ,whilethelearnedmixingproportionsareindicatedbydashedlines.Wealso 96

PAGE 97

5-1 ,ourmodelandlearningalgorithmhavedoneaverygoodjobofconstructingapiece-wiselinearmodelaroundtheactualgeneratingmixingproportions.Thisshowsthatourmodelingframeworkandassociatedlearningalgorithmperformwellinsmoothlychangingmixingproportions. Experimentalsetup.TheCaliforniaStreamFlowDatasetisadatasetthatwehavecreatedbycollectingthestreamowinformationatvariousUSGeologicalSurvey(USGS)locationsscatteredinCalifornia.ThisinformationispubliclyavailableattheUSGSwebsite.Wehavecollectedthedailyowinformationmeasuredincubicfeetpersecond(CFPS)from80sitesbetween1stJanuary,1976through31stDecember,1995.Thus,wehaveadatasetcontaining7305records;witheachrecordcontaining80attributes.EachattributeisarealnumberindicatingtheowataparticularsiteinCFPS.Wenormalizeeachattributeacrosstherecordssothatallvaluesfallin[0,1].Weassumethateachattributeisproducedbyanormallydistributedrandomvariable;andhencetrytolearnitsparametersmeanandstandarddeviation.Alongwithdatapoint,wealsorecognizeitstime-stampasthedayoftheyear.WeignorethedataforFeb29thfromtheleapyears.Wecollatethedatapointsbasedonthedayoftheyear.Thusweobtainedadatasetwhichhas365timeticks,and25datapointspertimetick.OneofthereasonstoselectthisdatasetwasthathistoricalinformationaboutprecipitationinCaliforniaiswell-known,andhopefullywewillobservechangesinprevalenceofhighandlowwaterowswhichisconsistentwiththat.Welearnatwocomponentmodelthatallowsevolvingmixingproportionsfromthisdataset.Weallowforsixtimeslices,sothatwecangetagoodsenseofchangeinmixingproportions.Anothersignicantchangethatwemakethatweassumethatthemixingproportionsatthestartoftime(1stJanuary)arethesameasthemixing 97

PAGE 98

5-2 ,andthechangeinprevalenceoftheseowscanbeseeninFigure 5-3 Experimentalsetup.Weapplyourmodeltoreal-liferesistancedatadescribingtheresistanceproleofE.coliisolatescollectedfromagroupofhospitals.E.coliisafood-bornepathogenandabacteriumthatnormallyresidesinthelowerintestineofwarmbloodedanimals.TherearehundredsofstrainsofE.coli.Somestrainscancauseillnesssuchasseriousfoodpoisoninginhumans.Thedatasetconsistsof9660E.coliisolatestestedagainst27antibioticscollectedoveraperiodfromyear2004toyear2007.Eachdatapointrepresentsthesusceptibilityofasingleisolatecollectedatoneofseveral,real-lifehospitals.WeuseaBernoulligeneratorindicatingsusceptibleorresistantforeachofthetestresults.Undeterminedstatesandmissingvaluesareignoredforthisexperiment.Wesetthenumberofmixturecomponentstobe5,andallowforatotalof3timeslices.Werunourlearningalgorithmfor5000iterations,andreporttheresultsaveragedoverthelast3000iterations. 98

PAGE 99

5-4 .Foreachstrain,wehaveshownitssusceptibilityagainstthe27antibioticsasaprobability.Wealsoshowhowtheprevalenceofeachstrainhaschangedovertimefromtheyear2004totheyear2007. 26 27 ]inminingdocumentclassesthatevolveovertime.However,thesegenerativemodelsareLatentDirichletAllocation(LDA)stylemodels,whicharespecictodocumentclustering,anddonotextendtothemixturemodelsthatwouldallowarbitraryprobabilisticgenerators.Thereissomeexistingworkrelatedtoevolutionaryclustering[ 28 ].However,thepreliminaryfocusforthemishowtoensuresmoothnessintheevolutionforclusters 99

PAGE 100

29 ]haveproposedaBayesianmixturemodelwithlinearregressionmixingproportions.However,theyallowonlythemixingproportionstoevolveovertimeandonlyasasimplelinearregressionbetweenvaluesatthestartoftimetothevaluesattheendoftime.Thisposestwolimitationsrichertrendsinmixingproportionscannotbelearned,andthecomponentsthemselvesarexedovertime. 100

PAGE 101

B C DFigure5-1. Evolvingmodelparameterslearnedfromsyntheticdataset 101

PAGE 102

Componentslearnedbya2-componentevolvingmixingproportionsmodel.Thediameterofthecircleatasiteisisproportionaltotheratioofofthemeanparametertothemeanowforthatsite. Figure5-3. ChangeinprevalenceoftheowcomponentsshowninFigure 5-2 withtime 102

PAGE 103

BCluster2 CCluster3 DCluster4 ECluster5 FMixingProportionsFigure5-4. EvolvingmodelparameterslearnedfromE.Colidataset 103

PAGE 104

3.3.2 ,wehaveshownthatcomputingtheexactvalueofQisimpracticalevenforamoderate-sizeddataset.Inthisappendix,wediscusshowwecomputeanunbiasedestimator^QbysamplingfromthesetofstringsgeneratedbyallS1,S2combinations.Wealsopresentastratiedsamplingbasedapproachandanallocationschemethatattemptstominimizethevarianceofthisestimator.LetusrstdeneanidentierfunctionIthattakesabooleanparameterb,I(b)=8><>:0ifb=false1ifb=trueUsingthisidentierfunction,wecandenel(xa,S1,b)=PS22Sd1I(b)Ha,S1,S2 3 as: A-1 .Notethatthenumberofrowsinthistableisexponentialintermsofk,thenumberofcomponentsinthemodel.Computingtheexactsumacrossalltherowsandcolumnsbecomesprohibitivelyexpensiveevenmoremoderatevaluesofk.Hence,theneedforsamplingamongstthecellsinthisguretoestimateQ.Simpleuniformrandom 104

PAGE 105

A-2 .Hence,wecanwritetheQfunctionas: 105

PAGE 106

A as:^Q(0,)=nXa=1kXj=1^t(xa,j)Giventhatwewanttohavealimitedxedbudgetfortotalnumberofsamples,wenowintroduceanallocationschemetodeterminethesamplesizeforeachcellsoastominimizethevarianceof^Q.Let,nsamplebethethetotalnumberofsamplesthatwewantfromthewholepopulation.Since,theestimator^Qisasumof^testimators,thevarianceoftheestimator^Qcanbeexpressedthesumofvariancesof^testimators.Hence,

PAGE 107

Substitutingthisvalueofnxa,jinEquation A ,weget,4r A yieldsthefollowingsolution:nxa,j=nsamplep A ,andsubsequentlyprovethattheresultingestimatorisunbiased.Letustrytoestimatetheconstantassociatedwithlog0i.Itisgivenby:

PAGE 108

A-3 .Now,weshowthattheestimatorc1estimateisanunbiasedestimatorforc1,i.LetS11,S12,S13,beallpossiblevaluesofS1.LetS21,S22,S23,beallpossiblevaluesofS2.Let11,12,13,besamplingvariablesassociatedwithS11,S12,S13,;and21,22,23,besamplingvariablesassociatedwithS21,S22,S23,.Then,basedontheprocedureComputeC1(i)asdenedabove,wecanwritethevalueofc1estimate(i)as: where 108

PAGE 109

A

PAGE 110

A .Tosummarize,inthisappendix,wehavepresentedastratiedsamplingbasedschemetocomputeanunbiasedestimator^QforourEMalgorithm.Wehavealsoshownaprincipledwayinwhichwecanupdatethesamplesizesforeachofthestratasaftereveryiterationsoastominimizethevarianceofthisestimator. 2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ThestructureofcomputationfortheQfunction 2 3 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... t(x1,k) AsimpliedstructureofcomputationfortheQfunction 110

PAGE 111

Computinganestimateforc1,i

PAGE 112

4 .F(mi,jjX,c,m,,)/(mi,jjq,r)YaYjwga,j,jI(ca,ga,j=1) 112

PAGE 113

B-1 ,thesyntheticdatasetsconsistoffourdimensionalreal-valuedandzero-onedata.Thereal-liferealvalueddatasetwascreatedusingwaterlevelrecordsatvarioussitesinCaliforniacollectedbyUSGS.Thereal-lifezero-onedatasetwascreatedusingupward/downwardstockmovementsusingasubsetofS&P500.Foreachdatasetwehaverandomlypickedoneoftheiterationsofthelearningalgorithm,andoneofthemi,jvalues.InFigure B-1 ,weshowaplotshowingboththeoriginalconditionaldistributionandthebetaapproximationforallthefourdatasets.Eachsubplothasbeennormalizedforeasycomparison,andwehavetozoomed-intotheregionwherethemassofthedistributionsareconcentrated.Visually,itishardtotellaparttheapproximationfromtheoriginaldistribution.Forquantitativecomparisonbetweenthetwodistributions,wecancomputetheKL-divergencebetweenthem.TheKullback-Leiblerdivergence(alsoknownasRelativeEntropy)isameasureofthedifferencebetweentwoprobabilitydistributionsAandB.ItsmeasuretheextrainformationrequiredtocodesamplesfromoriginaldistributionA,whileactuallyusingcodedsamplesfromapproximationdistributionB.FordiscretedistributionsAandB,theKL-divergenceofBfromAisdenedas,KL(AjjB)=XiA(i)logA(i)

PAGE 114

B-2 ,wehaveshownthecomputedvaluesoftheKL-divergenceoftheBetaapproximationfromtheoriginalconditionaldistributionusingbothMatlab'sbuilt-inquadratureandsimplediscretization.ItbecomesquiteclearthatbasedontheKL-divergence,wehaveaverygoodapproximationoftheoriginaldistribution. TableB-1. Detailsofthedatasetsusedforqualitativetestingofthebetaapproximation IdTypeGeneratorsDatapointsDimensionsComponents 1Real-lifeNormal5008052SyntheticNormal1000453SyntheticBernoulli2000454Real-lifeBernoulli28004110 TableB-2. Quantitativetestingofthebetaapproximation IdIteartion#Component#Dimension#KLquadKLdiscrete 110230.0000000.066571250310.0000000.234133320240.0000000.000001457210.0000090.363120 114

PAGE 115

BDataset2 CDataset3 DDataset4FigureB-1. ComparisonofthePDFsfortheconditionaldistributionoftheweightparameterwithitsbetaapproximationfor4datasets.Eachchartisnormalizedforeasycomparisonandhasbeenzoomed-intotheregionwherethemassofthePDFsareconcentrated.DetailsaboutthedatasetscanbefoundinTables B-1 and B-2 115

PAGE 116

[1] K.Pearson,Contributionstothemathematicaltheoryofevolution,PhilosophicalTransactionsoftheRoyalSocietyofLondon.A,vol.185,pp.71,1894. [2] A.P.Dempster,N.M.Laird,andD.B.Rubin,Maximumlikelihoodfromincompletedataviatheemalgorithm,JournalofRoyalStatisticalSociety,vol.B-39,pp.1,1977. [3] G.J.McLachlanandK.E.Basford,MixtureModels:InferenceandApplicationstoClustering.NewYork:MarcelDekker,1988. [4] G.J.McLachlanandD.Peel,FiniteMixtureModels.NewYork:Wiley,2000. [5] I.Cadez,P.Smyth,andH.Mannila,Probabilisticmodelingoftransactiondatawithapplicationstoproling,visualization,andprediction,inKDD'01:ProceedingsoftheseventhACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.NewYork,NY,USA:ACMPress,2001,pp.37. [6] I.Cadez,S.Gaffney,andP.Smyth,Ageneralprobabilisticframeworkforclusteringindividualsandobjects,inKDD'00:ProceedingsofthesixthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.NewYork,NY,USA:ACMPress,2000,pp.140. [7] I.S.Dhillon,S.Mallela,andD.S.Modha,Information-theoreticco-clustering,inKDD'03:ProceedingsoftheninthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.NewYork,NY,USA:ACMPress,2003,pp.89. [8] I.S.DhillonandY.Guan,Informationtheoreticclusteringofsparseco-occurrencedata,inICDM'03:ProceedingsoftheThirdIEEEInternationalConferenceonDataMining.Washington,DC,USA:IEEEComputerSociety,2003,p.517. [9] A.Banerjee,I.Dhillon,J.Ghosh,S.Merugu,andD.S.Modha,Ageneralizedmaximumentropyapproachtobregmanco-clusteringandmatrixapproximation,inKDD'04:ProceedingsofthetenthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.NewYork,NY,USA:ACMPress,2004,pp.509. [10] B.Gao,T.-Y.Liu,X.Zheng,Q.-S.Cheng,andW.-Y.Ma,Consistentbipartitegraphco-partitioningforstar-structuredhigh-orderheterogeneousdataco-clustering,inKDD'05:ProceedingoftheeleventhACMSIGKDDInternationalConferenceonKnowledgeDiscoveryinDataMining.NewYork,NY,USA:ACMPress,2005,pp.41. 116

PAGE 117

C.C.Aggarwal,J.L.Wolf,P.S.Yu,C.Procopiuc,andJ.S.Park,Fastalgorithmsforprojectedclustering,inSIGMOD'99:Proceedingsofthe1999ACMSIGMODInternationalConferenceonManagementofData.NewYork,NY,USA:ACMPress,1999,pp.61. [12] C.C.AggarwalandP.S.Yu,Findinggeneralizedprojectedclustersinhighdimensionalspaces,inSIGMOD'00:Proceedingsofthe2000ACMSIGMODInternationalConferenceonManagementofData.NewYork,NY,USA:ACMPress,2000,pp.70. [13] K.-G.Woo,J.-H.Lee,M.-H.Kim,andY.-J.Lee,Findit:afastandintelligentsubspaceclusteringalgorithmusingdimensionvoting.Information&SoftwareTechnology,vol.46,no.4,pp.255,2004. [14] J.Yang,W.Wang,H.Wang,andP.Yu,delta-clusters:Capturingsubspacecorrelationinalargedataset,inICDE'02:Proceedingsofthe18thInternationalConferenceonDataEngineering.LosAlamitos,CA,USA:IEEEComputerSociety,2002,pp.517. [15] J.FriedmanandJ.Meulman,Clusteringobjectsonsubsetsofattributes,JournaloftheRoyalStatisticalSocietySeriesB(StatisticalMethodology),vol.66,no.4,pp.815,2004. [16] R.AgrawalandR.Srikant,Fastalgorithmsforminingassociationrulesinlargedatabases,inVLDB'94:Proceedingsofthe20thInternationalConferenceonVeryLargeDatabases.SanFrancisco,CA,USA:MorganKaufmannPublishersInc.,1994,pp.487. [17] R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghavan,Automaticsubspaceclusteringofhighdimensionaldatafordataminingapplications,inSIGMOD'98:Proceedingsofthe1998ACMSIGMODInternationalConferenceonManagementofData.NewYork,NY,USA:ACMPress,1998,pp.94. [18] C.-H.Cheng,A.W.Fu,andY.Zhang,Entropy-basedsubspaceclusteringforminingnumericaldata,inKDD'99:ProceedingsofthefthACMSIGKDDInterna-tionalConferenceonKnowledgeDiscoveryandDataMining.NewYork,NY,USA:ACMPress,1999,pp.84. [19] H.Nagesh,S.Goil,andA.Choudhary,Maa:Efcientandscalablesubspaceclusteringforverylargedatasets,1999. [20] J.-W.ChangandD.-S.Jin,Anewcell-basedclusteringmethodforlarge,high-dimensionaldataindataminingapplications,inSAC'02:Proceedingsofthe2002ACMSymposiumonAppliedcomputing.NewYork,NY,USA:ACMPress,2002,pp.503. 117

PAGE 118

B.Liu,Y.Xia,andP.S.Yu,Clusteringthroughdecisiontreeconstruction,inCIKM'00:ProceedingsoftheninthInternationalConferenceonInformationandKnowledgeManagement.NewYork,NY,USA:ACMPress,2000,pp.20. [22] C.M.Procopiuc,M.Jones,P.K.Agarwal,andT.M.Murali,Amontecarloalgorithmforfastprojectiveclustering,inSIGMOD'02:Proceedingsofthe2002ACMSIGMODInternationalConferenceonManagementofData.NewYork,NY,USA:ACMPress,2002,pp.418. [23] T.GrifthsandZ.Ghahramani,Innitelatentfeaturemodelsandtheindianbuffetprocess,inAdvancesinNeuralInformationProcessingSystems18,Y.Weiss,B.Scholkopf,andJ.Platt,Eds.Cambridge,MA:MITPress,2006,pp.475. [24] M.GrahamandD.Miller,Unsupervisedlearningofparsimoniousmixturesonlargespaceswithintegratedfeatureandcomponentselection,IEEETransactionsonSignalProcessing,vol.54,no.4,pp.12891303,2006. [25] G.J.McLachlan,R.W.Bean,andD.Peel,Amixturemodel-basedapproachtotheclusteringofmicroarrayexpressiondata,Bioinformatics,vol.18,no.3,pp.413,2002. [26] D.M.BleiandJ.D.Lafferty,Dynamictopicmodels,inICML'06:Proceedingsofthe23rdinternationalconferenceonMachinelearning.NewYork,NY,USA:ACM,2006,pp.113. [27] X.WangandA.McCallum,Topicsovertime:anon-markovcontinuous-timemodeloftopicaltrends,inKDD'06:Proceedingsofthe12thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining.NewYork,NY,USA:ACM,2006,pp.424. [28] D.Chakrabarti,R.Kumar,andA.Tomkins,Evolutionaryclustering,inKDD'06:Proceedingsofthe12thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining.NewYork,NY,USA:ACM,2006,pp.554. [29] X.Song,C.Jermaine,S.Ranka,andJ.Gums,Abayesianmixturemodelwithlinearregressionmixingproportions,inKDD'08:Proceedingofthe14thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining.NewYork,NY,USA:ACM,2008,pp.659. [30] D.J.Aldous,Exchangeabilityandrelatedtopics,inLectureNotesinMathematics.Berlin:Springer,1985,vol.1117. [31] J.Pitman,Combinatorialstochasticprocesses,NotesforSaintFlourSummerSchool,2002. [32] J.Bilmes,AgentletutorialoftheEMalgorithmanditsapplicationtoparameterestimationforgaussianmixtureandhiddenmarkovmodels,UniversityofBerkeley,Tech.Rep.ICSI-TR-97-021,1998. 118

PAGE 119

S.Amari,InformationgeometryoftheEMandemalgorithmsforneuralnetworks,NeuralNetworks,vol.8,no.9,pp.1379,1995. [34] C.P.RobertandG.Casella,MonteCarloStatisticalMethods.Springer,2005. [35] M.Somaiya,C.Jermaine,andS.Ranka,Learningcorrelationsusingthemixture-of-subsetsmodel,ACMTrans.Knowl.Discov.Data,vol.1,no.4,pp.1,2008. 119

PAGE 120

ManashailsfromthesmalltownofJamnagarinthewesternstateofGujaratinIndia.HedidhisschoolingintheL.G.HariaHighSchoolinJamnagar.Later,hemovedtoAhmedabadtoobtainhisB.E.inelectronicsandcommunicationsfromNirmaInstituteofTechnologyin2000.AftercompletinghisundergraduateeducationinIndia,hemovedtoU.S.A.forhisgraduatestudies.HehasearnedhisM.S.incomputernetworkingfromNorthCarolinaStateUniversityin2001.HeearnedhisM.S.andPh.D.incomputerengineeringfromUniversityofFloridain2009. 120