Citation
Clustering on Feature Subspaces in Data Mining

Material Information

Title:
Clustering on Feature Subspaces in Data Mining
Creator:
Namkoong, Young
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (131 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Dankel, Douglas D.
Committee Members:
Liu, Chien-Lian
Bermudez, Manuel E.
Arroyo, Amauri A.
Zhou, Lei
Graduation Date:
8/7/2010

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Automatic classification ( jstor )
Barolo ( jstor )
Bee clustering ( jstor )
Datasets ( jstor )
Experimental results ( jstor )
International conferences ( jstor )
Mining ( jstor )
Statistical models ( jstor )
T distribution ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
clustering, data, feature
Genre:
Electronic Thesis or Dissertation
born-digital ( sobekcm )
Computer Engineering thesis, Ph.D.

Notes

Abstract:
For a given data matrix, classification methods form a set of groups so that data objects within each group are mutually homogeneous but heterogeneous to those in the other groups. In particular, this grouping process for data without class label is called unsupervised classification or clustering. When little or no prior knowledge is available about the structure of data, clustering is considered a promising data analysis technique and has been successfully applied to solve various problems in environmental pollution research, gene expression data analysis, and text mining. To discover interesting patterns in these applications, most clustering algorithms take into account all of the features represented in the data. However, in many cases, most of these features may be meaningless and disturb the analysis process. For this reason, selecting the proper subset of features needed to represent the clusters, called feature selection, has been an interesting problem in clustering to improve the performance of these algorithms. Many previous approaches have investigated efficient techniques to select relevant features for clustering, but several drawbacks have been identified. First, most previous feature selection approaches assume that features are divided into two subsets: those that are meaningful and those that are meaningless to describe clusters. However, the original feature vector can consist of a number of feature subsets, making these previous feature selection approaches unsuitable. Second, many datasets in various application domains contain multiple different clusters co-existing in different subspaces of the whole feature space. In this case, feature selection techniques can be inappropriate to extract features identifying all clusters in the dataset. In this dissertation, we investigate these problems. For the first issue, we present a novel approach to reveal multiple disjoint feature subsets for various clusters based on the mixture model. These multiple features can be represented by a feature partition. Our approach seeks the desired a feature partition that each feature subset can be expressed by a best-fit mixture model minimizing the model selection criterion. For the other problem, finding a special cluster structure (called a subspace cluster) we propose a new approach to create constrained feature subspaces. By utilizing the property that there can exist strong relationship between features, meaningful feature subspaces can be created and a large number of non-informative feature subspaces can be pruned. Based on these informative feature subspaces, we show that the various subspace clusters can be represented by fitting several finite mixture models. Extensive experimental results illustrate our ability to extract useful information from data as to which features contributes to the discovery of clusters. In addition, we demonstrate that our approaches can be applied to various research areas. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2010.
Local:
Adviser: Dankel, Douglas D.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-08-31
Statement of Responsibility:
by Young Namkoong.

Record Information

Source Institution:
UFRGP
Rights Management:
Copyright Namkoong, Young. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
8/31/2012
Resource Identifier:
709593016 ( OCLC )
Classification:
LD1780 2010 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 2

2

PAGE 3

3

PAGE 4

IwouldliketothankallpeoplewhoprovidedmewiththeirhelpduringmyPh.Dyears.Firstofall,Iwouldliketothankmyadvisor,Dr.DouglasD.DankelII.Withouthispatience,encouragement,constantinspiration,thisdissertationwouldnothavebeenpossible.Iamgratefultomysupervisorycommitteemembers,Dr.JonathanC.L.Liu,AntonioA.Arroyo,ManuelBermudez,andLeiZhoufortheirinvaluablesuggestionsandcomments.ImetmanynicepeopleinGainesville.Especially,IthankYoungseokKim,DoohoPark,andHeh-YoungMoonwhohavesharedjoysandsorrowswithmeduringmanyyearsinGainesville.Finally,mydeepestgratitudegoestomyfamily:myparents(HailNamgoongandJung-inHyun),mysister(HeeNamgoong),andmybrother'sfamily(Seok-HwanNamgoong,MinaSong,andSeonNamgoong).Theyencouragedandsupportedandprayedformywholelife.Mystudywouldnothavebeenpossiblewithouttheirlove,patience,andconcern.Ihopethatmygraduationgivesthemhappinessandpride. 4

PAGE 5

page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 9 CHAPTER 1INTRODUCTION ................................... 11 1.1Feature-Subset-WiseClusteringUsingStochasticSearch ......... 12 1.2RobustClusteringUsingFiniteMixtureModelonVariableSubspaces .. 15 1.3OutlineofThisDissertation .......................... 18 2BACKGROUNDSOFCLUSTERING ........................ 19 2.1TheBasicConceptsofClustering ...................... 19 2.2ProximityMeasuresinClustering ....................... 21 2.3HierarchicalClustering ............................. 24 2.4PartitionalClustering .............................. 25 2.4.1K-meansClustering .......................... 26 2.4.2FuzzyClustering ............................ 27 2.5Model-BasedClustering ............................ 28 2.6ClusteringBasedonStochasticSearch ................... 30 2.6.1SimulatedAnnealing .......................... 31 2.6.2DeterministicAnnealing ........................ 31 2.7OtherClusteringAlgorithms .......................... 32 2.7.1SequentialClustering ......................... 32 2.7.2ClusteringBasedonTheGraphTheoryandStructure ....... 33 2.7.3Kernel-BasedClustering ........................ 34 2.8Biclustering ................................... 35 2.8.1TheBasicConceptofBiclustering .................. 35 2.8.2RelatedWorkofBiclustering ...................... 35 2.9SubspaceClustering .............................. 38 2.9.1SubspaceClusteringUsingTheBottom-UpTechniques ...... 38 2.9.2SubspaceClusteringBasedonTop-DownApproach ........ 40 2.9.3OtherSubspaceClusteringAlgorithms ................ 41 3FEATURE-SUBSET-WISECLUSTERINGUSINGSTOCHASTICSEARCH .. 43 3.1Introduction ................................... 43 3.2RelatedWork .................................. 45 3.3ModelFormulation ............................... 47 5

PAGE 6

.......... 48 3.4.1EstimationofTheFiniteMixtureModelGivenVkandGk 48 3.4.2DeterministicAnnealingEMAlgorithm ................ 50 3.4.3ModelSelectionBasedonInformationCriteria ........... 51 3.4.4ObjectiveFunctionforDeterminingTheOptimalVkandGk 52 3.5StochasticSearchingforTheDesiredVkandGk 53 3.5.1BiasedRandomWalkAlgorithm .................... 54 3.5.2Split-and-MergeMoveAlgorithm ................... 56 3.5.3SimulatedAnnealing-BasedAlgorithm ................ 58 3.6ExperimentalResults ............................. 59 3.6.1SyntheticDataset ............................ 59 3.6.2IncrementalEstimation ......................... 64 3.6.3TheUCIMachineLearningDataset .................. 68 3.7Applications ................................... 72 3.7.1TheRiverDatasetforAnalysisofWaterQuality ........... 72 3.7.2TheUnitedNationsWorldStatisticsDataset ............. 75 3.8Conclusion ................................... 79 4ROBUSTCLUSTERINGUSINGFINITEMIXTUREMODELONVARIABLESUBSPACES ..................................... 81 4.1Introduction ................................... 81 4.2RelatedWork .................................. 83 4.3RobustSubspaceClusteringbyFittingAMultivariatet-mixtureModel .. 84 4.3.1ExtractingTheLocalStrongCorrelatedFeatures .......... 86 4.3.2ConstructingFeatureSubspaceswithTheCorrelatedFeatures .. 89 4.3.3TheMixtureModelofMultivariatet-distribution ........... 91 4.3.4EMforMixturesofMultivariatet-distributions ............ 92 4.3.5DeterminingKviaModelSelectionCriterion ............. 94 4.4ExperimentalResults ............................. 95 4.4.1ExperimentsonSyntheticDatasets .................. 95 4.4.2ExperimentsonRealDatasets .................... 104 4.5Conclusion ................................... 116 5CONCLUSION .................................... 118 REFERENCES ....................................... 122 BIOGRAPHICALSKETCH ................................ 131 6

PAGE 7

Table page 3-1Parameterestimatesofdataset-I .......................... 62 3-2Parameterestimatesofdataset-II .......................... 63 3-3Parameterestimatesofdataset-III ......................... 65 3-4ThesummaryofexperimentalresultsofUCIdatasets .............. 69 3-5ThegroupsoftheNakdongRiverobservations .................. 72 3-6Theresultofmodel-basedclusteringforeachsubsetofvariables ........ 73 3-7EstimatesofmeansforeachfeaturesubsetintheUNdataset .......... 76 4-1ThedescriptionofUCIdatasetsusedintheexperiments ............. 104 4-2TheoverallexperimentalresultsonIrisdataset .................. 106 4-3ThelocalclusterpuritiesondiscoveredvarioussubspacesforIrisdataset ... 106 4-4TheoverallexperimentalresultsonEcolidataset ................. 108 4-5ThelocalclusterpuritiesondiscoveredvarioussubspacesforEcolidataset .. 108 4-6Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthewinedataset .................................. 111 4-7Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthediabetesdataset ................................ 112 4-8Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthewaveformdataset ............................... 115 7

PAGE 8

Figure page 1-1Plotsofdatasetcontainingfourclusters. ...................... 15 1-2Plotsofclustersexistingondifferentsubspaces. ................. 15 3-1Arbitraryexampleoffeatureselectionandfeaturesubsetwiseclustering .... 44 3-2Twosyntheticdatasets(2007and15011) .................. 59 3-3TrajectoriesofchainsofBIC(V()),and^Kfordataset-II ............. 61 3-4Syntheticdatasets .................................. 64 3-5Performanceevaluationoftheincrementalestimation .............. 66 3-6Performanceevaluationagainstthesizeofdataset ................ 67 3-7Classicationaccuraciesofthevariousfeatureselectionapproaches ...... 70 3-8Lineplotsofriverwatersheddataset ........................ 73 3-9Discoveredfeaturesubsetsofwatersheddataset ................. 74 3-10Thearcsin(p )transformationoftheThird-levelstudentsofwomen ....... 76 3-11DiscoveredclustersforeachfeaturesubsetinUNdataset ............ 77 4-1Illustrationofrobustfeaturesubspaceclustering ................. 85 4-2Illustrationofconstructingfeaturesubspaceswiththecorrelatedfeatures ... 90 4-3Dendrogramsforthedifferentfeaturesubsetsindataset-I ............ 96 4-4Someexamplesofthedataplotsondifferentfeaturesubspacesindataset-I .. 97 4-5Clusteringerrorratesfordifferentfeaturesubspacesindataset-I ........ 98 4-6Dendrogramsfordifferentfeaturesubsetsindataset-II .............. 99 4-7Someexamplesofthedataplotsondifferentfeaturesubspacesindataset-II 100 4-8Clusteringerrorratesfordifferentfeaturesubspacesindataset-II ........ 101 4-9ComparisonofthesubspaceclusteringalgorithmswithRandIndex ...... 105 4-10Dendrogramsfordifferentfeaturesubsetsinwinedataset ............ 110 4-11Dendrogramsofdifferentfeaturesubsetsforwaveformdataset ......... 114 8

PAGE 9

Foragivendatamatrix,classicationmethodsformasetofgroupssothatdataobjectswithineachgrouparemutuallyhomogeneousbutheterogeneoustothoseintheothergroups.Inparticular,thisgroupingprocessfordatawithoutclasslabeliscalledunsupervisedclassicationorclustering.Whenlittleornopriorknowledgeisavailableaboutthestructureofdata,clusteringisconsideredapromisingdataanalysistechniqueandhasbeensuccessfullyappliedtosolvevariousproblemsinenvironmentalpollutionresearch,geneexpressiondataanalysis,andtextmining. Todiscoverinterestingpatternsintheseapplications,mostclusteringalgorithmstakeintoaccountallofthefeaturesrepresentedinthedata.However,inmanycases,mostofthesefeaturesmaybemeaninglessanddisturbtheanalysisprocess.Forthisreason,selectingthepropersubsetoffeaturesneededtorepresenttheclusters,calledfeatureselection,hasbeenaninterestingprobleminclusteringtoimprovetheperformanceofthesealgorithms. Manypreviousapproacheshaveinvestigatedefcienttechniquestoselectrelevantfeaturesforclustering,butseveraldrawbackshavebeenidentied.First,mostpreviousfeatureselectionapproachesassumethatfeaturesaredividedintotwosubsets:thosethataremeaningfulandthosethataremeaninglesstodescribeclusters.However,theoriginalfeaturevectorcanconsistofanumberoffeaturesubsets,makingthesepreviousfeatureselectionapproachesunsuitable.Second,manydatasetsin 9

PAGE 10

Inthisdissertation,weinvestigatetheseproblems.Fortherstissue,wepresentanovelapproachtorevealmultipledisjointfeaturesubsetsforvariousclustersbasedonthemixturemodel.Thesemultiplefeaturescanberepresentedbyafeaturepartition.Ourapproachseeksthedesiredafeaturepartitionthateachfeaturesubsetcanbeexpressedbyabest-tmixturemodelminimizingthemodelselectioncriterion.Fortheotherproblem,ndingaspecialclusterstructure(calledasubspacecluster)weproposeanewapproachtocreateconstrainedfeaturesubspaces.Byutilizingthepropertythattherecanexiststrongrelationshipbetweenfeatures,meaningfulfeaturesubspacescanbecreatedandalargenumberofnon-informativefeaturesubspacescanbepruned.Basedontheseinformativefeaturesubspaces,weshowthatthevarioussubspaceclusterscanberepresentedbyttingseveralnitemixturemodels.Extensiveexperimentalresultsillustrateourabilitytoextractusefulinformationfromdataastowhichfeaturescontributestothediscoveryofclusters.Inaddition,wedemonstratethatourapproachescanbeappliedtovariousresearchareas. 10

PAGE 11

Inmanyeldsofmanufacturing,marketing,andscienticareassuchasgenomeresearchandbiometrics,mostrawdatacontainalittleornopriorknowledge(e.g.classlabel).Anaturalwaytoanalyzethesedataistoclassifythemintoanumberofgroupsbasedonasimilaritymeasureoraprobabilitydensitymodel.Theresultinggroupsarestructuredsothateachgroupisheterogeneoustotheothergroupswhilethedataobjectsinthesamegrouparehomogeneous[ 11 ].Thisunsupervisedclassicationtechnique,calledclustering,isoneofthemostpopularapproachesforexploratorydataanalysis[ 45 47 ]. Duringthelastdecades,numerousclusteringalgorithmshavebeenproposed.Theymainlyattempttoidentifygroupsofdataobjectsbasedontheirattributesorfeatures.However,inrecentyears,thesetraditionalclusteringalgorithmshavebeenfacedwithanewdatastructurethatcanbedifculttoprocess.Thisstructurecanbeeasilyexplainedbyusingthegeneexpressiondata,apopulardatastructureinbioinformatics.Geneexpressiondataareaby-productofmicroarrayexperimentsthatmeasuretheexpressionlevelsofgenesundertheexperimentalconditionsinasingleexperiment[ 47 ].Geneexpressiondataconsistofamatrixwhereeachrowrepresentsasinglegeneandeachcolumnrepresentsanexperimentalcondition.Eachelementofthismatrixindicatestheexpressionratioofageneoveraspecicexperimentalcondition. Forgeneexpressiondata,manytraditionalclusteringalgorithmsattempttodiscovergeneclustersbasedonalltheexperimentalconditionsinthedatamatrix,orviceandversa.However,furtherinvestigationforgeneexpressiondataidentiedaninterestingcharacteristic.Specically,asubsetofgeneshasshowednearlycoincidentalpathwaysonasubsetofexperimentalconditions[ 47 ].Thisindicatesthatthesegenesareassociatedwithaparticularsubsetofexperimentalconditions.Unfortunately,thischaracteristicpreventsmostpreviousconventionalclusteringalgorithmsfrom 11

PAGE 12

47 ]. Thisalsoimpliesthatonlyafewfeaturescontributetorepresenttheinterestingclusterstructureandtheotherfeaturesmaybemeaninglessorevendisturbabetterunderstandingoftheclusteranalysis[ 62 ].Forthisreason,selectingthepropersubsetoffeaturestorepresentinterestingclustersispreferred.Thisprocessiscalledfeatureselectionorvariableselectioninclustering. Inthisdissertation,weproposenovelapproachestodealwiththeseatypicalclusterstructuresusingvarioustechniquessuchasstatisticalmodelingandcombinatorialoptimization.Thersttypeofclusterstructurecanberevealedbyselectingmultiplenon-overlappedfeature(variable)subsetsformodel-basedclustering.Anotherclusterstructureisakindofsubspaceclusterthatisembedindifferentfeaturesubspaces.Inourapproach,weshowthatthesevarioussubspaceclusterscanberepresentedbyttingseveralnitemixturemodels. 62 ].Regardless,littleresearchworkonfeatureselectionhasbeendoneinclusteringbecausetheabsenceofclasslabelsmakesthistaskmoredifcult[ 62 63 88 ].Featureselectioninclusteringcanbedividedintotwocategories:model-basedclusteringandnon-model-basedclustering[ 88 ]. Inmodel-basedclustering,itisassumedthatadatasetiscomposedofseveralsubgroups.Torepresentthesesubgroupswithanintegratedgroup,onenaturalwayistousethenitemixturemodelsothateachsubgroupfollowsaparticularprobabilitydistribution[ 112 ].Tobespecic,thegeneralformofthemixturemodelf(x;)foraxednumberofclustersKisdenedas 12

PAGE 13

Forfeatureselectioninmodel-basedclustering,Talavera(2000)selectedasubsetoffeaturesbyevaluatingthedependenciesamongfeatures[ 103 ].Mitraetal.(2002)utilizedthemaximuminformationcompressionindexformeasuringfeaturesimilaritytoremoveredundantfeatures[ 74 ].Liuetal.proposedthefeatureselectionmethod,whichistailoredfortextclusteringusingsomeefcientfeatureselectionmethodssuchasInformationGainandEntropy-basedRanking[ 67 ].InLawetal.(2003),featureselectionwasperformedinmixture-basedclusteringbyestimatingfeaturesaliencyviatheEMalgorithm[ 62 ].Thisworkwasextendedtoanalgorithmtoperformbothfeatureselectionandclusteringsimultaneously[ 61 ].DyandBrodley(2004)consideredthefeatureselectioncriteriaforclusteringthroughtheEMalgorithm[ 34 ].Mostpreviousapproachestrytondrelevantfeaturesbasedontheclusteringresult,sotheyarerestrictedtodeterminethenumberofclustersorchoosetheclusteringmodel,calledmodelselection.RafteryandDean(2006)investigatedthisproblemandattemptedtoselectvariablesbycomparingtwomodels;onecontainstheinformativevariablesforclusteringwhilethevariablesintheothermodelaremeaningless[ 88 ].ByusingtheBayesianInformationCriterion(BIC)[ 97 ]forthemodelcomparisonandthegreedysearchalgorithmforndingthesuboptimalmodel,thismethoddealswithvariableselectionforclusteringandthemodelselectionproblemsimultaneously[ 88 ].However,thepreviousmethodscommonlyproduceonlyasinglesubsetofvariablestoidentifyclustersbasedonallthefeaturesintheentiresamples.Thisimpliesthattheyarelimitedtotheidenticationofthesubsetoffeatureforeachclusterseparately[ 63 ].Recently,Lietal.(2008)exploredndingtheselocalizedfeaturesubsetsbyemployingasequentialbackwardsearch[ 63 ].However,thismethodstillndsonlyasinglesubsetoffeaturesforeachclusteranddoesnothandlethemodelselectionproblem. 13

PAGE 14

Inthisapproach,itisassumedthatadatamatrixconsistsofmultiplevariablesubsetswhereeachcanberepresentedbyasetofclusters.Foreachsubsetofvariables,model-basedclusteringtotheobservationsisperformed.Underthexednumberofclustersforobservation-wiseclustering,thebestclusteringresultcanbeobtainedbythemaximumlikelihoodestimationviatheEMalgorithm[ 30 ].BecausetheEMalgorithmissusceptibletolocaloptima,weemployedthedeterministicannealingEM(DAEM)algorithm[ 106 ].Inmodel-basedclusteringunderthesubsetofvariables,determininganestimationofthenumberofclustersandtheappropriateparametricmodelsshouldbeconsidered.Fortheseissues,weexploitedtheBIC. Selectingmultiplesubsetsofthevariablesisequivalenttondapartitionofvariableswhosevariablesubsetstthebestnitemixturemodel.Findingthispartitionisverychallengeablebecausethenumberofallpossiblepartitions,knownastheBellnumber,(B(m))[ 102 ],growshyper-exponentiallyasmincreases(e.g.B(13)=2.7109andB(21)=4.751014)[ 15 ].Sinceeachvariablesubsetaccompaniesthemodel-basedclusteringprocedure,consideringallpossiblepartitionsisnotfeasibleinpractice.Duetothisreason,stochasticsearchmethodscanbeareasonableapproachsoseveralsuccessfulapplicationsofclusteringusingthistechniquehavebeenutilizedinourmethod[ 15 ].Incontrasttothepreviousresearch,ourapproachperformsboththemultiplevariablesubsetsselectionandthemodel-basedclusteringsimultaneously.Theoutputprovidesusefulinformationaboutwhateachvariablesubsetscontributestodiscoverthemeaningfulclusteringresults.Also,ourapproachshowsrobustnesstotheinitialpartitionofvariablesbecauseoftheuseofstochasticsearchmethods. 14

PAGE 15

Plotsofdatasetcontainingfourclusters. Bdim-2anddim-3 Cdim-1anddim-3 Plotsofclustersexistingondifferentsubspaces. 15

PAGE 16

1-1 ThedatasetshowninFigure 1-1 wasgeneratedbyusingtheinstructionsin[ 83 ].Itconsistsoffourclusterswhereeachclustercontains2003-dimensionaldatapoints.Thersttwoclusterswerecreatedsothattheycanberevealedontwodimensions(i.e.,dim-1anddim-2)outofthethreedimensions.Theothertwoclustersweregeneratedinasimilarmanner.However,adifferentinputparameterwasusedsothattheyexistonadifferentsubspace(i.e.,dim-2anddim-3). Ifthisdatasetisprojectedonthesubspacewithdim-1anddim-2,asshownatFigure 1-2A ,cluster]1andcluster]2canbeclearlyidentied.However,theothertwoclusters(cluster]3and]4)areregardedasasinglecluster.Likewise,Figure 1-2B showsthatcluster]3andcluster]4canbediscoveredondim-1anddim-3.Furthermore,whenthedatasetwasprojectedondim-2anddim-3,theclusteringresultsmaybequitedifferentfromtheexpectedones,asshowninFigure 1-2C Whenweusethealgorithmsforfeatureselectioninclusteringtodiscovertheseclusters,mostofthemarenotabletodeterminerelevantfeaturesforallthoseclusters.Also,thefeature-subset-wiseclusteringapproachisnotsuitableforndingtheproperfeaturepartitiontoidentifyallthoseclusters.Inthisexample,thefeature-subset-wiseclusteringalgorithmsuggestsasinglefeaturesubsetwhichcontainsallfeatures.However,ascanbeseen,eachclustercorrespondstothedifferentfeaturesubspaces,sotheproblemtoidentifythemcanbeaddressed.Forthisproblem,anewclusteringalgorithmwasdeveloped,calledsubspaceclustering.Themainobjectiveofsubspaceclusteringistonddifferentclustersexistingonthedifferentfeaturesubspaces.Forthispurpose,manysubspaceclusteringalgorithmsattemptedtonddifferentsubspaceclustersshowingahighdensityonthespecicfeaturesubspaces.Byperformingtheseapproaches,somemeaningfulsubspaceclusterscanbediscovered.However, 16

PAGE 17

Fortheseproblems,weproposeanovelmethodutilizingthecorrelationcoefcientstodeterminethemeaningfulfeaturesubspaces.Inourapproach,wenotedthatfeaturesofdatasetsinmanyapplicationareasareusuallyrelatedwithspecicdomainknowledgeforefcientrepresentationofdataobjects.Basedonthischaracteristic,consideringstrongcorrelationsamongfeaturescanleadtothediscoveryoftheinformativeclusterslyinginthefeaturesubspaces.Inaddition,byselectingonlythesefeaturesubsets,thesearchspaceforsubspaceclusteringcanbedrasticallyreduced. Forndingtheseparticularfeaturesubspaceswhereinfeaturesarestronglycorrelatedeachother,weutilizedthelocalizedmeancorrelationthreshold.Tobespecic,whenafeatureisxed,somefeaturesshowingastrongcorrelationwiththisfeaturecanberegardedasformingagroupasafeaturesubspace.Inthisway,severalinformativefeaturesubspacescanbeconstructed.Sincethepossiblefeaturesubspacesgrowexponentiallyasthenumberoffeaturesincreases,thefeaturesubspaceconstructionmethodshouldsatisfythemonotonicityproperty.Forthispart,thehierarchicalclusteringalgorithmwasutilizedtobuildthepossiblefeaturesubspaces. Whenmanyclusteringalgorithmsareappliedonreal-worlddatasets,theyconsiderthefactthatmanyatypicaldatasamples,callednoisydataoroutlierscanexistintheserealdatasets.Sincetheperformanceofclusteringalgorithmscanbeaffectedbytheseoutliersornoisydata,therobustnesspropertycanbeanimportantissue.Toalleviateproblemsfromnoisydataandoutliers,weusedamixturemodelwithaheavy-taileddistribution,calledamultivariateStudent'st-distribution,torepresentclusters.Inourapproach,thenumberofclustersforthebest-tmixturemodelisdecidedbyevaluatingaparsimoniousmodelselectioncriterion. 17

PAGE 18

2 reviewsthebackgroundofclustering.Thesummaryofvariousproximitymeasurements(ordistancemetrics),discussedinSection 2.2 ,isanimportantpartinclusteranalysisbecausetheclusteringresultsorstructurescanbedependentontheuseofthespecicproximitymeasurement.TheremainingSectionscoveravarietyofclusteringalgorithms,includinghierarchicalclustering,partitionalclustering,model-basedclustering,stochasticsearchtechnique-basedclustering,etc.Inaddition,wediscussedseveraladvancedclusteringtechniquessuchasbiclusteringandsubspaceclustering. InChapter 3 ,weproposeanewconcepttoovercometheshortcomingsofthepreviousfeatureselectionalgorithms.Beforeweproceedtodiscussthedetailofourapproach,webrieydiscusstheneedandcontributionofourapproach.Afterdiscussingtherelatedworkthatcoversthein-depthbackgroundofthespecictopicsandresearchtrendsrelatedwiththealgorithmsforfeatureselectioninclustering,ournewclusteringmodelformulationisintroducedinSection 3.3 .ThenewclusterstructureisrepresentedbyusingtheproposedmodelthatmaximizestherelevantparametersthroughthestochasticEMalgorithmandmodelselectionschemetothesub-datamatrixrelevanttothesubsetsofvariables.InSection 3.5 ,wediscussthevariousstochasticsearchalgorithmsusedinourapproach. Chapter 4 isorganizedasfollows.InSection 4.2 ,subspaceclusteringalgorithmsaresummarized.Section 4.3 describesamethodtoselectstronglycorrelatedfeaturegroups,amixturemodelofmultivariateStudentt-distributionanditsparameterestimationprocessmaximizingthelikelihoodviatheEMalgorithm,andmodelselectioncriteriontodeterminethebest-tmixturemodel.ExperimentalresultsanddiscussionaregiveninSection 4.4 .Section 4.5 presentsconclusions. Finally,inChapter 5 ,weprovideaconclusionanddiscussfuturework. 18

PAGE 19

Inthischapter,weintroducebackgroundforthisresearch.First,webrieyreviewthebasicconceptsofclustering.Then,wediscussthesummaryofvariousproximitymeasurements(ordistancemetrics).TheremainingSectionscoveravarietyofclusteringalgorithms,includinghierarchicalclustering,partitionalclustering,model-basedclustering,stochasticsearchtechnique-basedclustering,etc.Inaddition,wediscussedseveraladvancedclusteringtechniquessuchasfeatureselectioninclustering,biclustering,andsubspaceclustering. 37 ]ingenetics,thereareapproximately85,759,586,764basesin82,853,685sequencerecordsintheGenBankdatabase,andWal-Martdealswithapproximately800milliontransactionsperday[ 25 ].Fromthetremendousamountofrawdata,itisdifculttoobtaintheinformationappropriatetoparticularsubjectiveapplications.However,itisfortunatethattherawdataconsistofasetoffeaturesrepresentingtheparticularobjects,events,situations,orphenomena,etc.Basedonthisfact,adatasetcanbeexpressedasasetofgroupswhereeachgrouphasparticularpropertiesthatdistinguishitfromtheothergroups.Thissetofpropertiesiscalledapattern.Peoplehaveattemptedtousevariousapproachestonddesirablepatternsfromthedataresultinginthedevelopmentofmanytechniques.Theseapproachescanbecategorizedintotwoareas:classicationandclustering.Whenonecreatespatternsrepresentinggroupsofdata,labeleddatacanbeusefultoevaluatethecorrectnessofthepatternsbeingcreated.Inthismanner,aprocedurendingpatternswithlabeleddataiscalledclassication.Aclassicationtechniqueisprimarilycomposedoftwosteps:atrainingstepandatestingstep.Themaingoalofthe 19

PAGE 20

112 ].Thisfactimpliesthatabestclusterresultdoesnotexistbecauseofthelackofsomekindofstandardcriterionforclustering.Therefore,asapreprocessingstep,ausershouldidentifyaspecicclusteringcriterionandselect(ordevelop)anappropriateproximitymeasure. Ageneralclusteringprocedureconsistsofthefollowing5steps:featureselection,proximitymeasure,clusteringcriterion,clusteringalgorithm,andvalidationofclusterresults[ 105 ].Usually,thedatacontainnumerousfeaturesandmanyofthemmayberedundant.Theseredundantfeaturesmaydegradetheperformanceofclustering.Therefore,itisnecessarythatthepropersubsetoffeaturesbeselectedtolimitthelossofinformationfromtherawdataasmuchaspossible.Theseselectedfeaturesshouldbedistinguishable,easilyinterpretable,androbusttonoise[ 112 ].Undertheselectedfeatures,theclustermembershipofeachdataobjectisdeterminedbythedegreeofsimilaritytoaparticularcluster.Thisisdonethroughaproximitymeasure,aquantitativemeasureofhomogeneity(orheterogeneity)betweentwofeaturevectors.WediscussdetailsofproximitymeasuresinSection 2.2 .Afterdeterminingaproximitymeasure,aspecicclusteringcriterionfunctionshouldbeconstituted.Aclusteringcriterionisusuallydenedbythetypeofrulesbeingsensibletoboththecharacteristicsandsubjectivityofthegivendataset,oranobjectivecostfunctionusedforsolvinganoptimizationproblem.Oncetheclusteringcriterionisdetermined,itis 20

PAGE 21

11 105 112 ].Wereviewvariousclusteringalgorithmsinlatersections.Finally,fortheobtainedclusters,theissuesofhowtheclustersareconstructedandtheirvalidityshouldbeassessed.Thisstepiscalledthevalidationoftheclusteringresults.Occasionally,thisstepincludestheinterpretationoranalysisoftheobtainedclusters. Asafundamentaldataanalysistechnique,clusteringcanbeappliedtovariousareassuchaswaterpollutionresearchinenvironmentalscience,market-basketanalysis,textminingindatamining,andmicroarraydataanalysisinbioinformatics[ 45 ].Formoredetails,weciteXuetal.'ssummaryoftheapplicationsofclustering[ 111 112 ]. 105 112 ]. AssumethatwehaveadatasetxandRisthesetofrealnumbers.Fortwofeaturevectorsx1,x22x,adistancefunction(oradissimilaritymeasure)betweenx1andx2,D(x1,x2),satisesthefollowingconditions:(i)ThereexistsaminimumvalueforD,denotedbyD02R,suchthat
PAGE 22

Basedonthedenitionofproximitymeasure,variousdistancefunctionscanbedened.Notethatdistancefunctionsdependonthevarioustypesoffeatures,suchascontinuous,discrete,ordichotomous[ 105 112 ].Theodoridisetal.introducedthemostcommondistancemeasureforthecontinuousfeaturesasfollows[ 105 ]: wherex1iandx2iaretheithdimensionalcoordinatesofx1andx2,andwiistheithweightcoefcient.Fromtheequation( 2 ),avarietyofdistancemeasurecanbedened.Ifwi=1fori=1,...,n,thentheequation( 2 )canbetransformedastheMinkowskidistance(i.e.,anunweighteddistancemeasure), Inaddition,therearetwofamousspecialcasesofMinkowskidistance.First,theMinkowskidistance( 2 )withp=2iscalledtheEuclideandistance,denedas 2.(2) Second,forthecaseofp=1,MinkowskidistanceisknownastheManhattandistance.Thatis, Anotherdistancemeasurecanbederivedfromequation( 2 )bysettingp=2.ThisisageneralformoftheweightedL2-normdistancemeasure,denedas 22

PAGE 23

2 ), whereCisacovariancematrix.CisdenedasC=E[(x)(x)T],whereisthemeanvectorandEistheexpectedvalueofarandomvariablex.Inparticular,thesquaredMahalanobisdistanceiscloselyrelatedtomodel-basedclustering[ 36 ].Inmodel-basedclustering,theshapeand/orvolumeareinuencedbythecovariancematrixC[ 36 ].AclustercreatedonthebasisofthesquaredMahalanobisdistanceistypicallyshapedasahyper-ellipsoid.WediscusstheseissuesinSection 2.5 Whenwerecallthegoalofclustering,thecorrelationcoefcientcanbeutilizedforasensibledistancemeasure.OneofthemostfamouscorrelationcoefcientisPearsoncorrelationcoefcient,expressedby wherex1istheaverageofx1.Therangeoftheequation( 2 )isbetween1and+1,whichindicatesthedegreeofthelinearrelationshipbetweendataobjects.Thereby,wecandenethePearsoncorrelationcoefcient-baseddistancemeasurewhichhastherange[0,1]asfollows: 2.(2) Finally,anotherefcientwaytomeasurethesimilaritybetweenapairofdataobjectswithcontinuousrandomvariablesistousethecosinesimilarity.Formultidimensionaldataasavectorx,theinnerproductoftwovectorsindicateshowclosetheyaretoeachother.Fromthedenitionoftheinnerproductoftwovectorsx1andx2,denotedby 23

PAGE 24

wherekx1kisthelengthofvectorx1. Formoredetailsaboutproximitymeasures,see[ 84 ],[ 105 ],and[ 111 ]. 105 111 ].Forinstance,inasinglelinkagemethod,i=1=2, 24

PAGE 25

Thedivisiveclusteringoperatesinanoppositemannerthantheagglomerativemethods.Duetoperformanceissues,theagglomerativemethodsarecommonlyusedinmosthierarchicalclusteringalgorithms.Onenotableadvantageofhierarchicalclusteringalgorithmsisthatthedendrogramprovidesaneasyinterpretationofthedata.However,hierarchicalclusteringalgorithmsarenotexiblebecausethedendrogramisconstructedbythegivenspecicdataset.Ifsomedataareinserted,deleted,orupdated,theymaycauseachangeinthestructureofthedendrogram,resultinginadegradationofperformance.Also,theyaresensitivetooutliersandnoisydata. Newhierarchicalclusteringalgorithmsthatattempttoimprovetheseissueshavebeendeveloped.Guhaetal.(1998)presentedCURE,anefcientagglomerativeclusteringalgorithm[ 42 ].Theyattackedproblemswithtraditionalclusteringalgorithmsthataresensitivetooutliersandrestrictiveclustershapes.Theyemployedmultiplerepresentativepointsforeachclusterforgeneratingnon-sphericalshapesandwidevariances.CUREisfavorabletolargedatabasesbycombiningarandomsamplingandpartitioning.Guhaetal.(1999)extendedtheirworktonon-metricsimilaritymeasuresanddevelopedarobustclusteringalgorithm,calledROCK,usinglinkstomeasuresimilarityofcategoricalattributes[ 43 ].Formoredetailsofhierarchicalclusteringsee[ 11 ],[ 47 ],[ 105 ],and[ 111 ]. 25

PAGE 26

wherej=1,...,N. Inmostcases,theidealpartitioncanbeachievedbyminimizingthetotalsumofthedistancesbetweeneachdatapointandthecentroidofitscluster.Thisimpliesthatpartitionalclusteringcanbenaturallytransformedintoanoptimizationproblem. Formanyoftheshortcomingsasdiscussedabove,therehavebeentremendousadvancedworkproposed.Forexample,BradleyandFayyadsuggestedanimprovedK-meansclustering,whichisrobusttotheinitialcentroid,bygeneratingMrandomsets 26

PAGE 27

16 ].FormoredetailsandtheadvancedworksofK-meansclustering,see[ 11 ],[ 47 ],and[ 112 ]. 112 ].Infuzzyclustering,eachdataobjectmaybecomeamemberofallclusterswithpartialdegreeofmembership[ 12 ].Thispropertycouldbepropitiousincaseofthehandlingdatapointsontheboundariesofclusters.FuzzyC-meansclustering(FCM)isanwell-knownfuzzyclusteringalgorithm[ 12 ].Inthisresearch,werefertosomenotationandbackgroundin[ 12 84 112 ].Fortheunlabeleddatax,themaingoaloftheFCMistondapartitionmatrixUcontainingcclusters(U=[uik]cn)andaclusterprototypematrixV(dc),minimizingthefollowingobjectivefunction Inthecostfunction( 2 ),uik(0uik1)standsforthemembershipcoefcientofthekthdataintheith.uikhasthefollowingtwoconditions:Pci=1uik=1forallkand01).Usually,thevalueofmcanbesetto2.Whenmapproaches1themembershipmaybecomesabooleanshape.d2(xk,vi)isanL2-Normdistance.TheGustafson-KesselalgorithmemployedtheMahalanobisdistanceasadistancefunction[ 44 ].UandVcanbeobtainedbycomputingthegradientsofJ(U,V)withrespecttoVandU,respectively.FordetailsofcalculatingUandV,see[ 84 ].Bycalculatinggradients,UandVcanbeobtainedas 27

PAGE 28

Basedon( 2 )and( 2 ),FCMperformsthefollowingiterativestepstondtheclusterprototypeminimizingJ(U,V).First,settheinitialclusterprototypeV(0),Tasthenumberofmaximumiteration,"forthestoppingrule,andt=0fortherstiteration.Then,untilsatisfyingthefollowingcondition,kV(t)V(t+1)k"ort=T,repeatthefollowingsteps: (i)Updatett+1. (ii)CalculateU(t)withequation( 2 )andV(t1). (iii)ComputeV(t)through( 2 )andU(t1). (iv)Ifthestoppingruleshavebeensatised,stopandreturn(U,V)setby(U(t),V(t)). ThemaindrawbacksoftheFCMareasensitivitytotheinitialpartitionandnoisydatapoints.Toovercometheseproblems,Krishnapurametal.(1993)appliedtheframeworkofpossibilitytheorytothefuzzyclusteringproblemandpresentedapossibilisticC-Means(PCM)algorithm[ 59 ].PCMisdiscriminatedfromtheFCMduetothecastoftypicalitysothattheclusterscouldbeinterpretedasapossibilisticpartition.PCMalleviatedsensitivitytothenoisydatabutitwasstilleasilyaffectedbytheinitialprototype[ 112 ].Paletal.(2005)proposedahybridapproach,calledpossibilisticfuzzyc-meansclustering(PFCM),combiningtheFCMandthePCMalgorithms[ 82 ].Formoredetailsonfuzzyclusteringsee[ 112 ]. 36 45 88 ].Theoveralldatacanberepresentedbyanitemixturemodel.Thegeneralformofanite 28

PAGE 29

whereKisthenumberofclusters,fkistheprobabilitydensityfunctiontotheclusterk,andkisthemixtureproportionofthekthcluster(0k1andPKk=1k=1).Whenweconsiderthemixturemodelforclustering,typicallythelikelihoodofthedata(giventhemixturemodel)isusedastheobjectivefunctiontomaximize.Thelikelihoodforthemixturemodel( 2 )canbewrittenas Equation( 2 )canbetransformedasthelog-likelihoodfunction,denotedby Ingeneral,weachievethemaximumlikelihoodestimatorswithapartialderivativeofthelog-likelihoodfunctionwithrespecttokandsetittozero.However,equation( 2 )maybeinappropriatetondthemaximumlikelihoodestimatorsdirectly.Forthis,theEMalgorithm[ 14 30 47 111 ]isapopularalgorithmtondthemaximumlikelihoodestimatorsinthelatentvariablesorparameters.TheEMalgorithmformixturemodelscanbedescribedasfollows[ 14 ].Whenwehaveacompletedatasetx=fx1,...,xng,xi=(xiobs,zi)wherexiobsistheobservablevariablesandzi=(zi1,...,ziK)istheunobservedparameters.Ifxiisamemberof`-thmixturecomponent,zi`=1.Otherwise,zi`=0.Then,thelog-likelihoodofcompletedatacanberewrittenasln(L(k,k,zik;x))=Pni=1PKk=1zikln(kfk(xi;k)).Let(t)and(t+1)bethecurrentandupdatedparameterestimators,respectively.AttheE-stepaftertheinitializationstep,theconditionalexpectationofthelog-likelihoodcanbeexpressedbyQ((t),(t+1))=E[L(tk,k,zik;x)jxobs].IntheM-step,wechooseparametersmaximizingtheQ((t),(t+1))function.TheseE-stepandM-stepsarerepeateduntiltheprocessconvergestothespeciedthreshold[ 14 30 36 71 111 ]. 29

PAGE 30

36 ].Intheirstudy,clusterisexpressedbyamixturemodelandeachmixturecomponentsfollowamultivariatenormaldistribution.Thisimpliesthatthegeometricpropertiesoftheclustersdependsonthecovariancesbecausetheireigenvaluedecompositioncanbeutilizedasparameters.Ifkisthek-thcovariancematrixofthek-thGaussianmixturecomponent,k=kDkAkDTkwherekisamixtureproportionalconstant,Dkistheorthogonalmatrixofeigenvectors,andAkisadiagonalmatrixwhoseelementsareeigenvalues[ 36 ].Fromthisframework,theyintroducedtwomultivariatenormalmixturemethods,hierarchicalagglomerativeclusteringbasedontheclassicationlikelihoodandtheEMalgorithmformaximumlikelihoodestimation[ 36 111 ].CeleuxandGovaert(1992)proposedageneralclassicationEMalgorithm[ 19 ].Theydeemedclusteringtothepartitioningprobleminthemixturecontextutilizingtheclassicationmaximumlikelihoodcriteria.McLachlanetal.(1999)developedtheEMMIXsoftware,attingalgorithmofnormalort-distributionalmixturemodelsbytheEMalgorithm[ 71 ]. 45 ].XuandWunsch(2009)classiedtheseclusteringalgorithmsassearchtechnique-basedclustering,andTheodoridisandKoutroumbas(2006)assignedthemtothecategoryofcostoptimization-basedclustering[ 105 112 ].Amongthe 30

PAGE 31

105 ]and[ 112 ],webrieyreviewsimulatedannealinganddeterministicannealingtechniques. 57 ],isaglobalsearchtechniqueanalogoustotheannealingprocessinphysics[ 95 112 ].InSA,thetemperaturecontrolparameterTplaysanimportantrolethatindicatesrandomness.StartingwithasufcientlylargevalueforTandtheinitialclusterresult,SA-basedclusteringseeksacandidateclusteringresultoptimizingthespecicobjectivecostfunctionE.UntilT=0(meaningthatthenalsolutionbecomesstable)thealgorithmrunsrepeatedlywithagraduallydecreasingT.LetCiandCi+1bethecurrentandthecandidateclusterresult,respectively.Ateachstep,thecandidateclusterisacceptedbytheprobabilityP,expressedas T(2) whereE=E(Ci+1)E(Ci).ThisimpliesthatthecandidateclusterisalwaysacceptedifE(Ci+1)>E(Ci).TheprocedureisderivedfromthebasicconceptoftheMetropolisHastingsalgorithm[ 72 ],whichisdiscussedlater.[ 95 ]introducedthebasicconceptoftheSAalgorithm,anddetailsofSA-basedclusteringarediscussedin[ 105 ]and[ 112 ]. 28 ]ratherthanjustrandomlyinthesearchspace[ 112 ].Theprocessofndingtheexpectedclusteringresultisbasedonminimizingthefreeenergy,whichcanbeachievedviacalculatingthefollowingequation: 31

PAGE 32

ThisimpliesthateachdatapointswillbeassignedtoonlyasingleclusterwiththehighestprobabilityasTgoesto0,inpartofthecoolingprocess.Theupdatedclustercenterscanbeobtainedby dCd(x,C)=0,(2) [ 112 ]. 112 ].Forthisreason,sequentialclusteringhasbeenachallengingresearchareainclustering.Inaddition,assequentialdata,suchasDNAsequenceorcustomertransactiondata,have 32

PAGE 33

Atypicalformofthesequentialclusteringalgorithmisintroducedin[ 105 ].Inanaiveapproach,eachsequentialclusterisconstructedbyeitherassigningadataelementintooneoftheexistingclustersorcreatingitsownsingletoncluster.Thebasicconstraintsfordatainsequentialclusteringisthatallvectorsappeartothealgorithmoneatatime,andthereisnoconstrainttothenumberofclusters.Thisresultsinthenalclusteringresultbeingsensitivetothesequenceofthedata.Severalapproachesaddressthisproblemandproposeimprovedmethodssuchastheallowanceofagrayareatousetwothresholdparameters[ 105 ].Anotherimprovementidentiestheproblemofbeinginexibletothesetofclusters[ 105 ].Byallowingthemergingoftwoclustershavingsomedistancebetweenthem,clustersmaybecomesmorereasonable. XuandWunsch(2009)introducedsequentialclusteringdealingwithdatahavingspecicsequentialfeatures[ 112 ].Somedifferenttypesofproximitymeasuressuchasinsertion,deletion,andsubstitutionaredenedforsequencecomparison.Thedegreeofsimilarityiscalculatedbyaneditdistance.Usually,whenwecomputeasequencecomparisonproblem,itcouldbetransformedasaglobal(orlocal)alignmentproblemandtheexpectedsolutionshouldbeonetominimizethecostoftheeditdistance.TheNeedleman-Wunschalgorithmisafamoussequencealignmentalgorithmbasedondynamicprogramming[ 80 112 ].AlthoughtheNeedleman-Wunschalgorithmcomputestheglobalsequencealignment,thelocalsequencealignmentalsoshouldbetreatedasanimportantresult[ 112 ].TheSmith-Watermanalgorithmcomparessegmentsofallpossiblelengthsinsteadofeachsequenceinitswholelengthandselectssequencesmaximizingthesimilaritymeasure. 33

PAGE 34

112 ].Dependingontheproximitymeasuresandthespecicclusteringcriterions,therehavebeenmanygraphtheory-basedclusteringmethodsproposedandthesearediscussedin[ 112 ].Formoredetails,see[ 105 ]and[ 112 ]. 112 ].Adetailedexplanationofthekerneltrickrelatedtoclusteringcanbefoundin[ 105 ]and[ 112 ].Afterperformingtheclusteringbasedonthiskerneltrick,theclusterresultscanbeshapedasahypersphere.Awell-knownkernel-basedclusteringalgorithmissupportvectorclustering(SVC),whichdiscoversthesmallesthypersphereenclosingdatamappedintothehigh-dimensionalfeaturespace[ 10 ].TheclusteringprocessviaSVCconsistsoftwosteps:constructionoftheclusterboundariesbyusingthesupportvectormachineandassignmentofaclusterlabeltothedatapoints.Moreresearchrelatedtothekernel-basedclusteringisdiscussedin[ 112 ].Onenotableadvantageofkernel-basedclusteringmethodssuchasSVCisthatitcandiscovertheclusteringresultwitharbitraryboundaries.Ontheotherhand,theclusteringresultsaredependenttotheparticularkernelfunctionandtherelevantparameters 34

PAGE 35

105 ].Mostkernel-basedclusteringtechniquesmaydegradeinperformanceforlargedatasets[ 112 ]. 2.8.1TheBasicConceptofBiclustering 68 104 112 ]. 46 ].AfterChengandChurch'srstintroductionofthisconcepttothegeneexpressiondata,thebiclusteringhasemergedasapromisingschemeformicroarraydataanalysis.Althoughresearchersreviewedmanyexistingbiclusteringalgorithmsofbiclusteringhavebeenperformed[ 68 104 ].Inparticular,Madeiraetal.[ 68 ]describedandcomparedvariousbiclusteringalgorithmsandthevarioustypesofbiclusterstructuressuchastheexclusivecolumnsbiclusters,checkerboardstructurebiclusters[ 58 ],non-overlappingbiclusters[ 40 75 ],overlappingbiclusters[ 98 113 ],etc.Besidesbiologicaldataanalysis,blockclusteringhasbeenappliedtothedataminingresearchareaofdocumenttextminingorcollaborativelteringwithanameofco-clustering 35

PAGE 36

38 98 ].Themethodsusedforblockclusteringalgorithmsmaybecategorizedbymixturemodel[ 40 75 ],Bayesianapproach[ 36 88 ],Greedyheuristicapproach[ 23 ],Stochasticapproach[ 8 18 ],etc.Inthefollowingwereviewseveralblockclusteringalgorithmsrelatedtoourwork. ChengandChurch(2000)attemptedtondbiclusterswithahighsimilarityscoreandwithlowresidueindicatethatgenesshowsimilartendencyonthesubsetoftheconditionspresentintheblockcluster[ 23 ].Biclusteringisanoptimizationproblemtondamaximumbiclusterwiththelowestmeansquaredresidueasthescoreoftheobjectivefunctioninbiclusteringalgorithm.BasedontheproofthatndingthelargestbiclusterproblemisNP-hard,theypresentedgreedyheuristicalgorithmstondit.Sincethesealgorithmsaredeterministic,theymaynotbeefcientinndingmultiplebiclusters. Yangetal.(2003)extendedChengandChurchsapproachbycopingwithmissingdataand5allowingeachentryinthedatamatrixtoassignoneormorebiclusters[ 113 ].Choetal.(2004)proposedtwodifferentiterativebiclusteringalgorithmstominimizethetotalsumofthesquaredresidue[ 24 ].Incontrasttothepreviousapproaches,biclusterscanbeexpressedthroughstatisticalmodeling.Fromthestatisticalviewpoint,itisassumedthatdatahavebeengeneratedbasedonseveraldifferentprobabilitydistributions[ 45 ].Thisimpliesthateachclustercanberepresentedbyaparticularprobabilitydensityfunction,soeachclustersetcanbeexpressedbyamixturemodelthattriestoprovideastatisticalinference. GovaertandNadifpresentedanEM-basedblockclusteringalgorithmthatndsblockclustersbysimultaneousgroupingbothsidesofthedataintothespeciednumberofclustersttingastatisticalmodel[ 40 41 ].TheyassumedthatadatasetcanberepresentedbyasetofblockmixturemodelsanditsmaximizedlikelihoodestimatescouldbeobtainedbytheblockEMalgorithm.Alloftheblockclusteringalgorithmsdiscussedsofararebasedthelocaloptimizationmethod.Asasolutionto 36

PAGE 37

Shengetal.(2003)developedthebiclusteringalgorithmusingtheBayesianmethod[ 99 ].Byusingmultinomialdistributions,theymodeledthedataundereveryvariableinabicluster.Inabicluster,themultinomialdistributionsfordifferentvariablesaremutuallyindependent.Forparameterestimationofthisprobabilisticmodel,theyusedGibbssamplingtoavoidthelocalminimafrequentlyaddressedintheEMalgorithm. Bryanetal.(2006)proposedabiclusteringalgorithmusingsimulatedannealing,calledSAB,tondbiclustersinageneexpressiondataset[ 18 ].BycombiningsimulatedannealingwithChengandChurchsgreedybiclusteringalgorithm,SABwasabletoretrievebiclusterslargerthanthoseofChengandChurchsapproachwiththebiologicalinterpretation. Wolfetal.(2006)consideredaglobalbiclusteringalgorithmbasedonlowvariance[ 109 ].Intheirwork,theyassumedthatadatamatrixconsistsofkbiclustersandthattheremainingdataelementsdonotbelongtothekexiblebiclusters.Thegoalofthisalgorithmistondkmultiplebiclustersbyminimizingthevarianceswithinclusterslikek-meansclustering.Duetotheriskoflocalminima,theyemployedtwomethods:simulatedannealingandlinearprogramming. Angiullietal.(2008)proposedanalgorithmtondkbiclusterswithaxedoverlappingdegreeconstraint[ 8 ].Theyintroducedgainasameasurementcriterionbycombiningthemeansquaredresidue,therowvariance,andthesizeofthebiclusters.Thisalgorithmretrievesahighqualitybiclusterthroughtransformationsthatimprovethegainfunction(i.e.,smallerresidue,orhighervariance,orlargervolumeofbicluster).Therandommoveschemewasusedtoescapelocalminima.Sincethisalgorithmsearchesonebiclusteratatime,itmustberepeatedktimestoobtainkbiclusters. 37

PAGE 38

Foridentifyingthesubspaceclustersanddeterminingtherelevantsubspaces,searchingforallpossiblesubspacesandevaluatingtheclustervalidityisapparentlyintractable.Forthisreason,severalsophisticatedalgorithmsbasedonheuristicsearchmethodswereproposed.Theycanbeclassiedintotwocategories:bottom-upsearchalgorithmsandtop-downsearchmethods.Inthefollowing,wefurtherdiscussseveralwell-knownsubspaceclusteringalgorithmsforeachcategory[ 83 ]. 105 ].Relevantalgorithmscanbefurtherdividedintotwosub-groupswithrespecttothestrategiestodeterminethethresholdsforthedenseregionsthatcanberegardedasasubspacecluster.Therstsub-grouputilizesamultidimensionalgridtoevaluatethedensityoftherange.CLIQUEandENCLUSarewell-knownalgorithmsinthisgroup[ 6 22 ].Onthe 38

PAGE 39

39 83 ]. CLIQUE(ClusteringInQUEst)isoneoftheearliestsubspaceclusteringalgorithms[ 6 ].InCLIQUE,alldenseregionsineachdimensionalspaceareunraveledthroughthefollowingtwosteps:rst,createdenseunitsbypartitioningthedataobjectsintoanumberofequal-lengthintervals,thenchoosetheunitshavinghigh-densitythantheuser-speciedthreshold.Thisprocedureisrepeateduntilthenumberofdimensionsisequaltothatoftheoriginalspace.InCLIQUE,thesubspacesthatcontainsclustersconsistingofdenseunitsareautomaticallydetermined.Inaddition,CLIQUEdoesnotrequireanyassumptionsrelatedtoaparticulardistributionofthedataset.However,theperformanceofCLIQUEisheavilyaffectedbyuser-speciedparameterssuchasthesizeofthedenseunitsandthethresholdforhigh-densityregionsonfeaturesubspaces.Furthermore,thecomputationalcostexponentiallyincreaseswithrespecttothedimensionalityofthedataset.Alargenumberofoverlappedclusterscanbeidentiedbecausemanydenseregionsinsomesubspacescanalsobeexaminedatthehigher-dimensionalsubspaces[ 105 ]. ENCLUS(ENtropy-basedCLUStering)followsthebasicconceptofCLIQUEforsubspaceclustering[ 22 ].However,itisdifferentiatedfromCLIQUEbyusinganentropy-basedmethodtoselectsubspaces.Toevaluatetheclustersinthesubspacesoftheoriginalspace,ENCLUSutilizedseveralcriteriabasedontheconceptofinformationtheory.Specically,acriterionusedinENCLUSconsideredmultiplecorrelationofdimensions,calledthetotalcorrelation.ENCLUSisexibletondclustersforminganarbitraryshape.Itisallowedtodiscoveroverlappedsubspaceclusters.However,similartoCLIQUE,ENCLUSdependsonuser-speciedparameterssuchasthesizeofgrid[ 22 105 112 ]. SimilartoCLIQUE,MAFIA(MergingofAdaptiveFiniteIntervals)alsousesabottom-upmethod[ 39 ].However,forgeneratingdenseunits,MAFIAutilizedamodied 39

PAGE 40

105 112 ]. 3 5 ]. PROCLUSisaprojectedclusteringalgorithmbasedonK-medoids,insteadofK-means[ 3 ].Thetermmedoidsmeansasetofdataobjects,eachofwhichrepresentsaclustercenterthatminimizestheaveragedissimilaritytoalldataobjectsinthecluster.Tounravelclusters,thePROCLUSalgorithmrequirestwocriticaluser-speciedparameters:thenumberofclustersKandtheaveragenumberofdimensionswhereeachclusterexists.Giventhesetwoparameters,PROCLUSattempttodiscoverclustersthroughthreesteps:initialization,iteration,andrenement.Thecomputationalcostof 40

PAGE 41

3 105 112 ]. TheORCLUSalgorithmextendedPROCLUSbyexaminingthefeaturesubspaceswithouttheaxis-parallelrestriction[ 5 ].ORCLUSrequiresacomputationalcostO(K3init+KinitN+K2initd3)withrespecttotimecomplexity,whereNisthesizeofthedata,disthedimensionalityofeachdataobject,andKinitisthenumberofinitialseedsofmedoids.ThisimpliesthatthenumberofinitialseedsaffectsthescalabilityofORCLUSalgorithm.SimilartoPROCLUS,theORCLUSalgorithmisrestrictedtothehypersphericalclusters[ 105 ]. pClusterisatop-downsubspaceclusteringalgorithmforndingpatternsofdatabyusingshiftingand/orscalingprocedures[ 108 ].Withinitialclusterscreatedonahigh-dimensionalspace,thepClusteralgorithmgeneratescandidateclustersinlower-dimensionalspacesthroughanalgorithmbasedonadepth-rstapproach. The-Clustersalgorithmisasubspaceclusteringapproachthatcapturescoherentpatternsofasubsetofdataobjectsunderasubsetoffeatures[ 114 ].Asametricforevaluatingthecoherenceamongthedataobjects,thePearsoncorrelationwasutilized.Asaninitializationstep,the-Clustersalgorithmrandomlygeneratesclusters.Then,untilnomoreimprovementofthequalityofclusters,ititerativelyperformsupdatingclustersbyrandomlychangingbothfeaturesanddataobjects.Theclusteringresultsmayvaryonthenumberofclusters.Inaddition,thesizeofclustersaffectboththeperformanceofthisalgorithmandclusteringresults. DOC(Density-basedOptimalprojectiveClustering)isahybridofbothbottom-upandtop-downapproaches[ 86 ].ThemainideaofDOCistondprojectedclustersthatshowsagoodnessofclusteringforaspeciedsubspace.Ateachstep,the 41

PAGE 42

112 ]. OptiGridisasubspaceclusteringalgorithmthatattemptstounravelsubspaceclustersbyrecursivelypartitioningdataintoanoptimalgridstructure[ 48 ].Eachcellinthegridstructureconsistsofahyperplanewithasubsetoffeatures,showingahighdensityorahavinghighpossibilityofcontainingmanysubspaceclusters. CLITreeisavariantofthesubspaceclusteringalgorithmsutilizingadecisiontreetechnique[ 65 ].ThisalgorithmseemstobesimilartotheOptiGridalgorithminthatitsdiscoveredclustersareconstructedbypartitioningdataforeachdimensionintoanumberofgroups.However,theidentiedclustersfoundbyCLTreehaveahyper-rectangularstructureinthefeaturesubspaces.Inaddition,incontrasttoOptiGrid,CLTreendsdesiredclustersusingabottom-upmethodthatassessesthehighdensityofdataobjectsforeachdimension.TheCLTreealgorithmoutperformsotherbottom-upstrategy-basedsubspaceclusteringalgorithmswithrespecttoscalability.However,thisrequiressignicantcomputationalcostforbuildingthedecisiontreestructure. Besidesoftheabovealgorithms,manyresearchershaveproposedvarioussubspaceclusteringtechniquesasfollows:analgorithmperformingprojectionwithhumaninteraction[ 2 ],clusteringdataobjectsontheprojectedsubspacesusinghistograms[ 81 ],projectedclusteringusinghierarchicalclusteringforselectingrelevantdimensions[ 115 ],acell-basedsubspaceclusteringapproach[ 20 ],andamethodthatuncoversclustersconsistingoffeatureswithdifferentlocalweights[ 32 ].Fordetailsofthesealgorithms,see[ 11 83 112 ]. 42

PAGE 43

Inclustering,themajorityofthepriorresearchworkrelatedtofeatureselectionwereinterestedindividingfeaturesintotwosubsets:acontributingandanon-contributing.However,datacanoftencompriseseveralfeaturesubsetswhereeachfeaturesubsetconstructsclustersdifferently.Inthischapter,wepresentanovelmodel-basedclusteringapproachusingstochasticsearchtodiscoverthemultiplefeaturesubsetsthathavedifferentmodel-basedclusters. 62 ].Forthisreason,itispreferredtochooseapropersubsetoffeaturestorepresenttheclusteringmodel,whichiscalledfeature(variable)selectioninclustering.However,littleresearchonfeatureselectionhasbeendoneinclusteringbecausetheabsenceofclasslabelsmakesthistaskmoredifcult[ 61 63 88 ]. Infeatureselectionformodel-basedclustering,thepriorliteraturehasproposedtwoapproaches.First,somefeatureselectionalgorithmsdividefeaturesintotwosubsets:onewithcontributingfeaturesandtheotherwith(almost)non-contributingfeaturesinclusteringobjects.AnexampleofthisfeatureselectionapproachisgiveninFigure 3-1A .InLawetal.(2002),featureselectionwasperformedinmixture-basedclusteringbyestimatingfeaturesaliencyviatheEMalgorithm[ 62 ].Someresearchershaveconsideredconductingfeatureselectionandclusteringsimultaneously.Lawetal.(2004)incorporatedtheirpreviousapproach[ 62 ]withtheuseoftheminimummessagelengthcriterionfortheGaussianmixturemodeltoselectthebestnumberofclusters[ 61 ].ThisapproachhasbeenextendedbyConstantinopoulosetal.(2006)byemploying 43

PAGE 44

BAmultiplefeature-blocks-wiseclustering Arbitraryexampleoffeatureselectionandfeaturesubsetwiseclustering aBayesianframeworktobemorerobustforsparsedatasets[ 26 ].RafteryandDean(2006)investigatedthisproblemandattemptedtoselectvariablesbycomparingtwomodels:onecontainstheinformativevariablesforclusteringwhiletheothermodel'svariablesaremeaningless[ 88 ].ByusingtheBayesianInformationCriterion(BIC)formodelcomparisonandthegreedysearchalgorithmforndingtheonlysuboptimalmodel,thismethoddealswithvariableselectionforclusteringandthemodelselectionproblemsimultaneously[ 88 ].However,whenweexpectmorethanonecontributingfeaturesubsetsthatconstructdifferentfeature-subset-wiseclusters,alltheabovemethodsareinappropriate.Figure 3-1B showsanexampleofmultiplefeaturesubsetswhereeachcontributestoidentifydifferentclusters. Tondfeature-subset-wiseclusters,thisstudyassumesthateachfeaturesubsetisindependentoftheotherfeaturesubsetsandfollowsthemultivariatenormalmixture 44

PAGE 45

1 )),itisconsideredacontributingsubset;otherwise,itisregardedasanon-contributingfeaturesubset.ThebestnumberofclustersforeachfeaturesubsetandthebestfeaturepartitionaredeterminedbasedontheBICvalues.Searchingfortheoptimalfeaturepartitionisverychallengingbecausethenumberofpossiblepartitions,knownastheBellnumber,B(m),growssuper-exponentiallywiththenumberoffeaturesm[ 15 ].Evenwhenafeaturepartitionisgiven,searchingforthebestclustersofeachfeaturesubsetinvolvessignicantcomputation.Forthisreason,weconsideredstochasticsearchmethods. Thischapterisorganizedasfollows:InSection 3.2 ,wesummarizethepreviousworkrelatedwithfeatureselectioninclustering.InSection 3.3 andSection 3.4 ,wedescribethestructureofourmodelandrelevantprocedurestoestimatethenitemixturemodel.InSection 3.5 ,weexamineseveralstochasticsearchmethodsforndingmultiplefeaturesubsets.InSection 3.6 ,wereporttheexperimentalresultsobtainedsofarandcomparethealgorithmsprovidedinSection 3.5 .Inaddition,toreducetheexecutiontime,theincrementalestimationtechniquethatwasappliedtoourapproachisdiscussed.InSection 3.7 ,usingvariousrealdatasetssuchasariverwatersheddatasetandaUnitedNations(UN)WorldStatisticsdataset,wedemonstratethattheproposedmethodcanbesuccessfullyappliedtosocialandeconomicresearchaswellasenvironmentalscienceresearch.Section 3.8 presentsourconclusions. 52 63 ].Ontheotherhand,featureselectioninunsupervisedlearningorclusteringisrelativelychallengingbecausetheabsenceofclasslabelsmakesthistaskmoredifcult[ 62 63 88 ].Inthepreviouschapter,wediscussedthatfeatureselectioninclusteringcanbedividedintotwocategories:model-basedclusteringandnon-model-basedclustering[ 88 ].However,whenweseetheresearchworkforfeatureselectionin 45

PAGE 46

26 61 ].Inthelterapproaches,selectingfeaturesisdeterminedbyassessingtherelevanceoffeaturesinthedataset.Incontrast,thewrapperapproachregardsthefeatureselectionalgorithmasawrappertotheclusteringalgorithmtoevaluateandchoosearelevantsubsetoffeatures[ 26 ]. Kimetal.(2000)performsfeatureselection(ordimensionalityreduction)inK-meansclusteringthroughanevolutionaryalgorithm[ 56 ].Talavera(2000)selectedthesubsetoffeaturesbyevaluatingthedependenciesamongfeatureswhereirrelevantfeaturesshowalowcorrelationtootherrelevantfeatures[ 103 ].Sahami(1998)usedthemutualinformationformeasuringfeaturesimilaritytoremoveredundantfeatures[ 96 ].AnotherapproachtothesameproblemhasbeenproposedbyMitraetal.(2002),whichutilizedthemaximuminformationcompressionindex[ 74 ].Liuetal.proposedafeatureselectionmethodtailoredfortextclusteringthatusedsomeefcientfeatureselectionmethodssuchasInformationGainandEntropy-basedRanking[ 67 ].RothandLange(2003)selectfeaturesdeterminingautomaticallytherelevancetothepreviousresult,ontheGaussianmixturemodelwhichisoptimizedviatheEMalgorithm[ 94 ].InLawetal.(2003),featureselectionwasperformedinmixture-basedclusteringbyestimatingfeaturesaliencyviatheEMalgorithm[ 62 ].Thisworkwasextendedinthealgorithmtoperformbothfeatureselectionandclusteringsimultaneously[ 61 ].DyandBrodleyusedthenormalizedlikelihoodandclusterseparabilitytoevaluatethesubsetoffeatures[ 33 35 ].TheyalsoconsideredthefeatureselectioncriteriaforclusteringthroughtheEMalgorithm[ 34 ].Whilemostpreviousapproachestrytondrelevantfeaturesbasedontheclusteringresult,theyarerestrictedtodeterminethenumberofclustersorchoosetheclusteringmodel,calledmodelselection.Thisproblemhasbeenrelativelylessinteresting.VaithyanathanandDom(1999)triedtodetermineboththefeaturesetandthenumberofclustersthroughevaluatingthemarginallikelihoodandcross-validated 46

PAGE 47

107 ].In[ 26 ],thecombinedmethodintegratingamixturemodelformulationandBayesianmethodisusedtodealwithboththefeatureselectionandthemodelselectionproblemsatthesametime.RafteryandDean(2006)investigatedthisproblemandattemptedtoselectvariablesbycomparingtwomodels,whereonecontainstheinformativevariablesforclusteringwhilethevariablesintheothermodelaremeaningless[ 88 ].ByusingtheBayesianInformationCriterion(BIC)[ 97 ]forthemodelcomparisonandthegreedysearchalgorithmforndingthesuboptimalmodel,thismethoddealswiththevariableselectionforclusteringandthemodelselectionproblemssimultaneously[ 88 ]. Inourapproach,itisassumedthatfeaturesubsetsareindependentofeachotherandthateachfeaturesubsethasdifferentmultivariatenormalmixturemodels.Thus,thelog-likelihoodfunctionforthekthfeaturesubsetis where(UVkijkg)isthemultivariatenormalprobabilitydensityfunctionofthegthclusterbasedonthekthfeaturesubset,pkgandkgarethecorrespondingmixingprobability 47

PAGE 48

Notethatmaximizing( 3 )isequivalenttomaximizing( 3 )foreachfeaturesubset.ThedetailsofthisprocedurearediscussedinSection 3.4 3 )whenfeaturesubsetsVkandthenumberofclustersGkaregiven.ThenwedescribehowtheBICisusedtodeterminethebestfeaturepartitionandthenumberofclustersforeachfeaturesubset. 30 ]forkinequation( 3 ),thecompletedatalog-likelihoodfunctioncanbeconstructedas where,forthekthfeaturesubset,themissingvariablezkgi=1iftheithobservationbelongstoclusterCkgandzkgi=0otherwise.TheEMalgorithmrepeatedlyupdatesthekbyalternatingtheE-stepandtheM-stepuntilitconverges[ 30 ].InthetthE-step,theconditionalexpectationofthecompletedatalog-likelihoodiscomputed: InthetthM-step,thet+1thnewparameterestimateiscalculatedbymaximizingQk(k,(t)k).Thedetailsofthisalgorithmarewelldescribedin[ 14 70 ]. Foraspecicinstanceoftheaboveprocess,aGaussianmixturemodelhasbeenwidelyused[ 14 106 ].TheGaussianprobabilitydensityfunction(PDF)totheUVkis 48

PAGE 49

(2)k=2jkgj1=2expn1 2(UVkkg)T1kg(UVkkg)o(3) wherekgandkgdenotethemeanvectorandthecovariancematrixoftheg-thGaussianmixturecomponent,andkisdenotedbythedimensionalityofUVk.ForthemixturemodeltotheGaussianPDF( 3 ),theQfunctionintheE-stepatthe(t+1)-thiterationiscomputedasfollows: wherei=1,...,nandg=1,...,Gk. IntheM-stepatthe(t+1)-thiteration,therelevantparameterestimatesaredescribedasfollows:Forg=1,...,Gk. 49

PAGE 50

nXi=1g(UVki;(t)k).(3) Withtheinitial(0)k,theaboveEandMstepsrepeatedlyalternateuntilsatisfyingtheL((t+1)kjUVk)L((t)kjUVk). 45 ].However,thisrequiresaprohibitivelyhighcomputationalcostandstillhasthepossibilityofapoorlocalmaxima[ 106 ].Animprovedapproach,calledthedeterministicannealingEM(DAEM)algorithm[ 106 ],istousethemodiedlog-likelihoodincludingthethermodynamicfreeenergyparameter(0<<1),similartothecoolingparameterTinsimulatedannealing(SA)[ 57 ].Specically,theDAEMalgorithmstartswithasmallenoughinitialwhichiscloseto0.Then,untilbecomes1,theDAEMalgorithmperformstheEandMstepsbygraduallyincreasingtoobtainabetterlocal(maybeglobal)maximum. BasedontheresultoftheDAEMalgorithm[ 106 ],thefreeenergyparametercanbeappliedtotheposteriorprobability( 3 ),henceanewposteriorprobabilityforDAEMisdenotedby wherei=1,...,nandg=1,...,Gk.Thisisdirectlyappliedtothe( 3 ),( 3 ),and( 3 ).Inparticular,weslightlymodiedtherangeofsothatbecomesstablebygraduallydecreasingfromapositiveintegerastheinitialvalueof. 50

PAGE 51

60 ].ForX,,and^,theKL-divergencebetweentheprobabilitydensityfunctionofthetruemodel(X;)andanestimatedmodelg(X;^)canbedenedasKL((X;)jj'(X;^))=Z(X;)log(X;) KL-divergenceisequaltozerowhenthettedmodelisequivalenttothetruemodel.Otherwise,KL-divergenceisalwayspositiveandgrowsasthedissimilaritybetweentwomodelsincreases[ 92 ].ThispropertyimpliesthatminimizingKL-divergenceisequivalenttomaximizethefollowingtermderivedfromtheequation( 3 ): Basedonthisproperty,theAkaikeInformationCriterion(AIC)wasproposedtoestimatetheKL-divergencebetweenthetruemodelandthettedmodel[ 7 ].Inthemodelselectionprocess,anestimatedmodelisregardedasthebestttedmodelwhenthescoreofAICisminimized.ForthegivenVk,AIC(Vk)isdenotedby whereLk(^kjUVk)isthemaximumlog-likelihood,andkisthenumberofparameters^k.Using( 3 ),themodelselectionprocessstartswithGk=1.UntilsatisfyingthestoppingcriterionthatthescoreofAIC(Vk)isnolongerdecreasing,thisprocessrepeatsbyincreasingGk=Gk+1.Inparticular,whenthebestttedmodelforaUkcontains 51

PAGE 52

Notethatminimizingequation( 3 )impliesthediscoveryoffeaturepartitionconsistingofmultiplefeaturesubsetswhereeachcanbeexpressedbythebest-tmixturemodelforclustering.Insuchaway,amodelselectioncriterioncanbeutilizedasanobjectivefunctioninourapproach.TheAICcriterionisawidelyusedmodelselectioncriterion.However,becauseBICcriterionismoreparsimoniousthanAIC,weusedBICasanobjectivefunction. 97 ]isdesignedtoselectthebestmodelbasedonboththedegreeofdatatnessandthecomplexityofthemodel.Inthisresearch,theBICisutilizedasanobjectivefunctiontodeterminethebestpartitionoffeaturesaswellasthebestnumberofclustersGkforeachfeaturesubset.Becausefeaturesubsetsareindependentofeachother,theBICforafeaturepartitionVis whereBIC(Vk)isdenotedby where^kisthemaximumlog-likelihoodestimate,kisthenumberofelementsin^k,andnisthenumberofobservations.AlowerBICvalueindicatesabettermodel. 52

PAGE 53

93 ], with theStirlingnumberofthesecondkind.TheBellnumbergrowssuper-exponentiallyasthenumberoffeaturesincreases[ 15 ],itisachallengingproblemtosearchfortheoptimalfeaturepartitionVbasedonamodelselectioncriterion.Furthermore,giventhefeaturepartitionV,anobservation-wiseclusteringusingtheEMalgorithmraisesanon-negligibleamountofcomputationalcost.Inthisresearch,weconsideredtheMonteCarlo(MC)optimization[ 91 ]basedontheMetropolis-Hastings(MH)algorithmasastochasticsearchmethodtominimizetheBIC(V). WhenthecurrentstateSisgivenintheMHalgorithm,thecandidatestateS0isgeneratedwithaproposedprobabilitydensityp(S0jS).ThetransitionfromthecurrentstateStothecandidatestateS0isdeterminedwithanacceptanceprobability,(S0,S),asdescribedbelow: where(S)isthetargetingprobabilitymassfunctionofS. IfthecurrentstateSisacceptedwithprobability( 3 ),thenthenextstateissettothecandidatestateS0.Otherwise,thecurrentstateremainsasthenextstate.Anadvantageofthisalgorithmisthatthereisnoneedtoknoweitherthenormalizingconstant(S)or(S0)becausetheycanceleachother. 53

PAGE 54

3 ): whereVisthesetofallpossiblefeaturepartitions.Notethat,althoughthesizeofVcanbeverylarge,itisstillnite.BecausethepartitionspaceVisverylarge,itisinfeasibletosearchforVbyevaluatingtheBIC(V)forallpossiblefeaturepartitions.Thus,therandompartitionsgeneratedfromBIC(V)andBIC(V)areevaluatedforeachgeneratedpartition.ItgeneratesmorerandompartitionsclosetothemodeofBIC(V),whereBIC(V)islow,andfewerrandompartitionsawayfromthemode,whereBIC(V)ishigh.UsefulapplicationsoftheMCoptimizationmethodaredemonstratedin[ 15 53 ].Theperformanceofthesemethodsmayvarydependingonhowp(SjS0)isdened. 15 ].LetV(1)bethefeaturepartitionthatisstochasticallyacceptedatthe1thiteration.Atthethiterationinthebiasedrandomwalk(BR)algorithm,V()issetbyV(1)andacandidatefeaturepartitionV0()isgeneratedbyupdating(orswitching)thefeaturesubsetmembershipofarandomlyselectedfeatureinV().Then,theBRalgorithmacceptsV0()withthefollowingprobability: becausep(V0()jV())=p(V()jV0()).Thedetailsofthebiasedrandomwalkalgorithmaredescribedin[ 15 ].Algorithm 1 showsabiasedrandomwalkmethodappliedtoourapproach. Althoughthebiasedrandomwalkalgorithmshouldtheoreticallyexploretheentirepartitionspace,ittendstogetstuckinalocalmode. 54

PAGE 56

15 ].Thisalgorithmissocalledthesplit-and-merge(SM)algorithm.TheSMalgorithmworkswithtwomainprocedures:thesplitmoveandthemergemove.Inthemergemove,V0()isconstructedbymergingtworandomlyselectedfeaturesubsetsinthecurrentfeaturepartitionV().Similarly,ifthesplitmoveischosen,V0()isobtainedbyrandomlysplittingafeaturesubsetoftheV()intotwonon-emptyfeaturesubsets.V0()generatedbyasplitmoveisacceptedwiththefollowingprobability wherep(V0()jV())istheprobabilityofamergemove,denotedbyp(V0()jV())=1pm 2!,isthesizeofthefeaturesubsetthatwillbedivided,K0isthenumberoffeaturesubsetswhosesizeisgreaterthan2,pmistheprobabilityofselectingthemerge-moveprocedure,andKisthenumberoffeaturesubsetsinV().TheacceptanceprobabilityofV0()byamerge-moveprocedureismin[1,(1=)][ 15 ].Basedonthissuccessfulapplication,weemployedthismethodforsearchingthemostexpectedvariablesubsets,asshowninAlgorithm 2 BecauseBIC(V)isinvariantforbothBRandSMmethodsdescribedabove,amixtureofthesetwosearchmethodsisalsoinvariant[ 15 ].Ateachiteration,thishybridapproach,calledtheHRalgorithm,canupdatethecurrentfeaturepartitionusingoneofthesetwomethods(BRorSM)selectedbyacertainprobability. 56

PAGE 57

Procedure3merge-move(V(),x) 57

PAGE 58

57 ]techniqueutilizesanannealingprocesstosearchfortheglobaloptimum.IntheSAalgorithm,thetemperatureparameterT(T>0)determineshoweasilyacandidatestatetendstobeaccepted;thehigherthetemperature,thehighertheacceptancerate.TosearchfortheV,wecombinedtheSAtechniquewiththeBRalgorithm,whichiscalledtheSA-basedBRsearch(SABR)algorithm[ 78 ].TheSABRalgorithmstartswiththeinitialfeaturepartitionV(1)andtheinitialtemperatureparameterT(1)whichissettoahighvalue.UntilT!0(i.e.,thenaliterationstepbecomesstable),theSABRseeksVbygraduallyandrepeatedlydecreasingT.Atthethiterationstep,theacceptanceprobabilityofacandidatefeaturepartitionV0()is whereT()=T(1),2andisacoolingrate(0<<1). 58

PAGE 59

BResultofdataset-I CLineplotsofdataset-II(15011) DResultofdataset-II Twosyntheticdatasets(2007and15011) 59

PAGE 60

Fordataset-I(Table 3-1 ),n=200observationsweregeneratedusingsevenfeatures,whereeachbelongedtooneoffourfeaturesubsets:truefeaturepartitionV=ff1,2g,f3,5,6g,f4g,f7gg.InFigure 3-2A ,eachlinerepresentsavectoroffeaturesforasimulatedobservation. LetBIC(V())and^KbetheBICvalueandthenumberoffeaturesubsets,respectively.Withdataset-I,thefoursearchmethodsconvergedonthefeaturepartitionV(usedfordatageneration)andthe^K,regardlessoftheinitialpartitions. Fordataset-II(Table 3-2 andFigure 3-2C ),n=150observationsweregeneratedusing11features.Figures 3-3A 3-3C ,and 3-3E showtheconvergenceofthefoursearchmethodsonthefeaturepartitionusedforthegeneratingdataset,BIC(V()),and^KwhenthealgorithmstartedwiththeinitialpartitionV(0)consistingofmsingletonfeaturesubsets.However,asshowninFigures 3-3B 3-3D ,and 3-3F ,whenthealgorithmstartedwithaninitialpartitionVconsistingofonelargefeaturesubset,theBRalgorithmcouldnotescapealocalmaximumuntilthelastiteration.Theotherthree(SM,HR,andSABR)methodsconvergedonthedesiredfeaturepartition,implyingthattheyarelesssensitivetoinitialfeaturepartitionsthantheBRalgorithm.Figures 3-2B and 3-2D showtheclustersbasedoneachfeaturesubsetfordataset-Ianddataset-II,respectively. 60

PAGE 61

TrajectoriesofchainsofBIC(V()),and^Kfordataset-II 61

PAGE 62

Parameterestimatesofdataset-I 4 4 Featureindices(Vk) f1,2gf3,5,6gf4gf7g 3211 3211 Mixingprobability(pkg) ^p12=0.32^p22=0.81 ^p13=0.30 Means(kg)

PAGE 63

Parameterestimatesofdataset-II 3 3 Featureindices(Vk) f1,2,3,4,5gf6,7,8,9gf10,11g 321 321 Mixingprobability(pkg) ^p12=0.34^p22=0.6 ^p13=0.33 Means(kg)

PAGE 64

BDataset-IV CDataset-V Syntheticdatasets Insummary,SM,HR,andSABRfoundthecorrectfeaturesubsetsandobtainedparameterestimatesneartheparametervaluesusedindatagenerationforbothdataset-Ianddataset-II. 79 ].BydecreasingthenumberofunnecessarilyrepeatedevaluationofBIC(V()),thetotalcomputationtimecanbedrasticallyreduced.Inaddition,theinitialization-insensitivepropertyoftheDAEMalgorithmsupportstheuseoftheincrementalestimation. Inourapproach,theincrementalestimationwasevaluatedwithvarioustypesofdatasets[ 77 ].Severaldatasetsweregeneratedonthebasisofthefollowingparameters,K,Gks,V,pkgs,gds,andgdswhereg2f1,...,Gkg,d2f1,...,Dg,andk2f1,...,Kg.ForV,VkiscomposedofthedifferentnumberofCVkganditscorrespondingpkg.EachXnliesinaGaussiandistributioncorrespondingtokgandkg.Forexample,thedataset-IIIcontainsthreefeaturesubsets:V1,V2,andV3.EachVkhas3,2,and1 64

PAGE 65

[(v1,v2);(v3,v4);(v5,v6,v7)] ^G1=3^G2=2^G3=1 ^p11=0.30^p12=0.40^p13=0.30 ^p21=0.50^p22=0.50^p31=1.00 ^11=[7.15,7.00]^12=[4.06,4.15]^13=[1.13,0.99] ^21=[4.92,4.93]^22=[8.99,8.97]^31=[13.96,13.96,13.97] kg Parameterestimatesofdataset-III mixturecomponents,respectively.Table 3-3 providesthevaluesoftheparametersforcreatingdataset-III.Fig. 3-4A illustratestheexpectedresultsaswellastheoverallstructuresofdataset-III.Dataset-IVanddataset-V,seeFig. 3-4 ,havedifferentshapesandmorecomplexstructuresthandataset-III.Inparticular,thestructureofdataset-IVissometimescalledacheckerboardstructure. Ourfeaturepartitionsearchalgorithmwasexecutedwiththedatasetsfor1.0104iterations.Tocovertheoverallcasesforeachdataset,weused5differentinitialpartitionV(0)s.Forinstance,theV(0)susedfordataset-IIIare:(1)V(0)=f(v1),(v2),(v3),(v4),(v5),(v6),(v7)g,(2)V(0)=f(v1,v2,v3,v4,v5,v6,v7)g,(3)V(0)=f(v1),(v2),(v3),(v4),(v5,v6),(v7)g,(4)V(0)=f(v1,v2,v7),(v3),(v4),(v5,v6)g,and(5)V(0)=f(v1,v5),(v2,v3,v7),(v4),(v6)g.Specically,eachVkin(1)isasingletonfeaturesubset,V(0)of(2)isafeaturesubsetwithallfeatures,andtheremainingthreeV(0)swererandomlygenerated.Tosupportenoughrandomness,T(0)=400.0and=0.997.FortheDAEMalgorithm,(0)=4.0and(i)=(i1)0.998. AsshowninTable 3-3 ,ourmethodfoundbothfeaturesubsetsandparameterestimatesnearthetrueparametervalues.Ontheotherhand,fordataset-III,whentheconventionalclusteringwasperformedwithallfeatures,thebest-tmixturemodel-basedclustersconsistingofthreemixturecomponentswerediscoveredandtwofeatures(i.e.,v1andv2)wereidentiedastheinformativefeaturesforclustering.Thisimplies 65

PAGE 66

Performanceevaluationoftheincrementalestimation thattheconventionalclusteringalgorithmsarelimitedtosimultaneouslyidentifyingmultiplefeaturesubsetsfordiverseclusteringresults.Fordataset-IVanddataset-V,ourapproachrevealedtheexpectedfeaturepartitionaswellasthemixturemodelrepresentingtheclustersforeachfeaturesubset. Ourmethodseeksthedesiredfeaturepartitionusingtheincrementalestimationandhasbeentestedonthethreedatasetsdescribedabove.Fig. 3-5 showstheresultsofthetotalevaluationtime.Inallcases,thegrowthrateoftheexecutiontimeforeachiterationbecomesstableafterafewthousandsofiterations.Incontrasttotheapproachthatestimatesatalliterations,theincrementalestimationmethodshowedbetterperformancewithdrasticallyreducedcomputationaltime. Basedontheaforementionedexperimentalresults,weshowedthattheincrementalestimationtechniquecanreducecomputationalcost.However,thistechniquestillmaynotbelimitedinscalabilitybecausethemainproceduresofthisapproachrequiremuchcomputationalcost.First,computingparameterestimatesviatheEMalgorithmrequiresO(NDI)time,whereNisthenumberofdatasamples,Disthenumberof 66

PAGE 67

Performanceevaluationagainstthesizeofdataset dimensions,andIisthenumberofiterations.Second,whenweconsiderthemodelselectionproceduretodeterminethenumberofclusters,thetimecomplexitywillbeO(NDIK),whereKisthenumberofclusters.ThetotalcomputationcostdependsonIforsatisfyingtheconvergencecondition.Whentheincrementalestimationtechniqueisapplied,themodelestimationandmodelselectionprocedureswillbeperformedatmosttwotimes.Sincetheseallproceduresareexecutedateachiterationforndthedesiredfeaturepartitionmodel,thetotalcomputationofourapproachgrowslinearlyasthenumberofiterationsincreases. Forthisscalabilityissue,wetestedourapproachusingtheSA-basedbiasedrandomwalkalgorithmonseveralsyntheticdatasetswith15features.Thenumberofiterationswassetby1.0104.Forthisexperiment,asimilarapproachtothatusedinthepreviousexperimentswasusedforgeneratingdatasetshavingfourdifferentsizesofdata(1000,5000,10000,and20000).Foreachsizeofdata,wegenerated10differentdatasets.Basedonthesedatasets,theSA-basedbiasedrandomwalkalgorithmwasevaluatedwithrespecttothetotalrunningtime.Figure 3-6 showsasummaryofrelevant 67

PAGE 68

9 ].Thewinerecognitiondataset,availableatUCIMachineLearningrepository,has178observations,13features,and3classes.Althoughmanyclusteringalgorithmsusedthisdatasetintheirexperiments,mostofthemrequiredanumberofclustersasaconstraintsoitisdifculttocomparedirectlybetweenthemandourmethod.Whentheconventionalclusteringisexecutedwithallfeatures,wefoundBICvaluesof7201.01,7206.56,and7498.07withGk=1,2,and3clusters,respectively.ThisimpliesthatusingtheconventionalclusteringmethodisnotappropriatetodeterminethenumberofclustersbasedontheBICvaluesforthisdataset.Whenthefeature-subset-wiseclusteringmethodisapplied,wefoundtheBIC(V)=6934.42fortheoptimalfeaturepartitionV.Thispartitionconsistsoftwofeaturesubsets.TherstfeaturesubsetcontainsAlcohol,Malicacid,Ash,Alcalinityofash,Magnesium,andProline.Thesecondfeaturesubsetincludestotalphenols,avanoids,nonavanoidphenols,proanthocyanins,colorintensity,hue,andOD280/OD315ofdilutedwines.Foreachfeaturesubset,twoclustershavebeendiscovered.Fortherstfeaturesubset,therstclustercontains55observations.Amongalltheobservationsbelongingtotherst 68

PAGE 69

D ^G(G) 178 13 3(N/A) [2,2]([N/A) [(v1,...,v5,v13);(v6,...,v12)] Diabetes 768 8 2(N/A) [3,2,3](N/A) [(v1,v7);(v2,v3,v4,v5,v6);(v8)] Waveform 5000 21 3(N/A) [1,1,3](N/A) [(v1);(v21);(v2,v3,...,v19,v20)] ThesummaryofexperimentalresultsofUCIdatasets cluster,53observations(about96.4%)arethememberoftherstclassintheoriginaldataset.Thesecondclustercontainsmostoftheobservationscorrespondingtothesecondandthethirdclassesintheoriginaldataset.Ontheotherhand,therstclusterinthesecondfeaturesubsetcontainsmostlyobservationsbelongingtotherstandthesecondclusterintheoriginaldataset.Thesecondclusterinthesecondfeaturesubsetcontains51observations,consistingof4observationscorrespondingtothesecondclassand47(outof48)observationsofthethirdclassintheoriginaldataset.Basedontheseresults,itcanberegardedasthattherstfeaturesubsetcontributestoextracttherstclusteroftheoriginaldatasetandthesecondfeaturesubsetispotentiallyusefultoidentifythethirdclusterfromtheoriginaldataset.Bycombiningtheclusteringresultsofthesetwofeaturesubsets,threenewclusterscanbeconstructed.Interestingly,wefoundthatthisclusteringresultmatchestheoriginalclasslabelsclosely.Thisimpliesthatourmethodcanalsobeutilizedtoidentifytheoriginalclusters. Thediabetesdatasetconsistof768instances(500and268instancestestingpositiveandnegativefordiabetes,respectively)describedby8numeric-valuedattributes(e.g.,Diastolicbloodpressure(mmHg)andAge(years)).Thewaveformdatasetconsistsof500021-dimensionalinstancesthatcanbegroupedby3classesofwaves.AllfeaturesexceptoneassociatedwithclasslabelsweregeneratedbycombiningtwoofthreeshiftedtriangularwaveformsandthenaddingGaussiannoise[ 66 ]. Becausetheclusteringresultsofourapproachcanbemuchdifferentfromthoseoftheothermethods,itcanbedifcultincomparingbetweenthem.Moreover,thisimplies 69

PAGE 70

Classicationaccuraciesofthevariousfeatureselectionapproaches thedifcultyindirectlyevaluatingtheclusteringaccuracy.Therefore,inourexperiments,weassessedclusteringaccuracyusingthefeaturesubsetthatshowsthemostsimilarclusteringresultstotheoriginalclasslabels.Table 3-4 summarizestheexperimentalresultswithdatasetsincludingthepropertiesofthesedatasets. Intheexperimentwiththediabetesdataset,the2ndfeatureblockcontainedtwomixturecomponentsanditsclusteringaccuracytothetrueclasslabelswas0.6602.Forthe1stfeatureblock,v1hadsimilardegreesofnegativerelationshipswithv7inallthreeclusters.The3rdfeatureblockcontainingasinglefeaturewascomposedofthreemixturecomponents.Boththe1standthe3rdfeatureblocksshowedrelativelymeaninglessanalyticalinformationthanthe2ndfeatureblock. Theexperimentalresultsofthewaveformdatasetseemtocoveraresultofthefeatureselectionmethod.Eachoftwofeatures,v1andv21,wasrepresentedbyasingletonclusterbasedonunivariateGaussiandistribution.Thisimpliesthattheycanberegardedaslessinformativefeatures.Forthe3rdfeatureblockcontainingallfeatures 70

PAGE 71

UsingthesethreeUCIdatasets,weevaluatedtheperformanceofourapproachandotherfeatureselectionmethodsinbothclassicationandclustering.Figure 3-7 summarizestherelevantresultswithrespecttotheclassicationaccuracy.Wecomparedourapproachwiththefollowingvariousexistingapproaches:featureselectionafterperformingdimensionalityreductionwithPCA(PCA-DR),featureselectionthroughgeneticprogramming(GP),utilizingthenaivebayesclassier(NB),techniquesobtainingtheappropriateparametersforaback-propagationnetworkwithoutusingfeatureselectionapproach(PSOBPN1)andwithusingfeatureselectionapproach(PSOBPN2),variableselectiononmixturemodel-basedclustering(VS-Mclust),independentvariablegroupanalysis(AIVGA),andourapproachusingtheSA-basedbiasedrandomwalkalgorithm(SABR)[ 49 51 64 88 ].Inaddition,aclassicationusingthedecisiontree(C4.5(J48))techniquewithallfeatureswasevaluated. Obviously,mostfeatureselectiontechniquesinclassication(i.e.,PCA-DR,GP,NB,PSOBPN1,andPSOBPN2)showedbetterclassicationaccuraciesthanfeatureselectionalgorithmsinclustering.Thisisbecausetheclassicationprocedurewasperformedbasedonthetrainingdataincludingclasslabels.Decisiontree-basedclassicationshowedcomparableperformancetothefeatureselectionapproachesforbothclassicationandclustering.Forthediabetesdataset,VS-Mclustidentied8clustersafterselectingthreevariables.Becauseofthedifcultyinevaluatingtheclusteringaccuracybydirectlycomparingtheidentiedclusterindiceswithactualclasslabels,theclusteringaccuracywascalculatedbyutilizingtheconfusionmatrixbetweentheobtainedclustersandtheactualclasslabels.Likewise,forthewaveformdataset,21distinctclusterswerediscoveredbyVS-Mclust,sowecalculatedtherelevantclusteringaccuracybyusingthesamemannerfordiabetesdataset.Interestingly,thetwoapproachesforfeatureselectioninclusteringshowedbetterorcomparable 71

PAGE 72

Alluvium(adepositofsand,mud,etc.) 62 FracturalRock 13 Riverwater 5 Table3-5. ThegroupsoftheNakdongRiverobservations performancetotheclassicationtechniqueusingthedecisiontree(C4.5).Ourapproachdemonstratedsimilarorcomparableperformancetootheralgorithmsinclustering.Obviously,thisresultimpliesthatnoneofthesealgorithmsshowsthebestperformanceonallthedatasets.However,ourapproachcanbedifferentiatedfromtheotherapproachesinthatitautomaticallydeterminesthenumberofclusterstorepresentthebest-tmixturemodel.Furthermore,ourapproachalsoautomaticallyextractsthemultiplefeatureblockswhiletheotheralgorithmsidentiesonlyasinglecommonfeaturesubset. 73 ]. Toanalyzethesephenomena,watersamplesshouldbereasonablyclassiedintovariousclusteringcriterionssuchasgeologicalorchemicalprocess.Itisalsoimportanttoidentifythechemicalfactorsasbeingstronglyorweaklyrelevanttothespecic 72

PAGE 73

Lineplotsofriverwatersheddataset indexof subsetof mixture (p21,p22) (p31,p32,p33) (p4) (0.100,0.900) (0.0733,0.5833,0.3434) (1.000) pH[8.3143,7.2024,6.0479] Eh[4.8575,5.9931] EC[7.0520,6.0696,5.8864] SO4[3.5531] DO[2.1763,1.4078] Na[5.0138,3.1724,2.3182] K[1.1860] Ca[3.1422,3.6792,3.7750] Cl[5.1545,3.5227,3.0320] NO3[2.6357,0.3736,4.0717] Theresultofmodel-basedclusteringforeachsubsetofvariables clusteringmodel.Fortheseissues,ourapproachhasbeenappliedtoarealdataset.Thewatersampledataset,collectedfromtheNakdongriverwatershedinSouthKorea,contains93locationsand16chemicalelementssuchasNa+,Ca2+,Cl,SO24,HCO3,andNO3andotherfactorssuchaspH,Eh(Redoxpotential),EC(ElectricalConductivity),andDO(DissolvedOxygen).Formoredetailsaboutthisdataset,see[ 73 ].Forhigheraccuracy,alldataexceptpHweretransformedbynaturallog.Toavoidthelog(x0)problem,80samplesand13attributesfromtheoriginaldatasethavebeenselected.Figure 3-8 showstheplotsfortheNakdongriverdatasetafterpreprocessing.Inthis 73

PAGE 74

Discoveredfeaturesubsetsofwatersheddataset gure,eachlines(withshape'--')representsthevectorofeachobservationstotheallfeaturesinthedataset.Thisdatasethas3typesofobservationsdetailedinTable 3-5 withthesizeofeachgroup. Figure 3-9 depictstheresultswiththeminimizedBIC(V).Therstfeaturesubset,shownatFigure 3-9A ,consistsofpH,Mg2+,SiO2,HCO3,andNO3havingthemultivariatehydrogeologicalmixtureofthreeclusters,whichagreeswithclustersofshallowgroundwater(C11),deepgroundwater(C12),andriverwater(C13).ForV2,shownatFigure 3-9B ,EhandDOarefactorsforthemixtureoftwoclustersbasedontheredoxprocess.Specically,clusterC21andclusterC22explainthereductionandoxidationprocess,respectively.InFigure 3-9C ,clusterC33ofV3containingEC,Ca2+, 74

PAGE 75

3-9D ,thegroupofK+andSO24consistsofasingleclustershowingaweakmeaningofwaterpollution.ThisimpliesthatV4isthegroupcontainingtheremainingchemicalelementsand/orotherfactorswhichhaverelativelylesscontributionforclustering.DetailsofexperimentalresultsareshownatTable 3-6 1 ].Thedatasetconsistedof117countriesand11featuresthatwererelatedwiththepopulationanddevelopmentofacountry;Energyconsumptionpercapita(kilogramsoilequivalent),GNI(grossofnetincomepercapitacurrentUS$),GDP(grossofdomesticproductionpercapitacurrentUS$),governmenteducationexpenditure(%ofGDP),populationdensity(perkm2),third-levelstudents(women,%oftotal),urbanpopulationgrowthrate2000-2005(%perannum),contraceptiveuse(ages15-49,%ofcurrentlymarriedwomen),totalfertilityrate2005-2010(livebirthsperwoman),lifeexpectancyofwomenatbirth2005-2010(years),andimportspercapita(1000US$).Toreduceskewnessofdata,weappliedthelog()transformationtovefeatures;energyconsumptionpercapita,GNI,GDP,importspercapita,andpopulationdensity.Thesefeaturesaredenoted`NRG,`GNI,`GDP,`IMP,and`PDN,respectively.Similarly,toreduceskewnessandboundaryeffects,thearcsin(p )transformationwasappliedtothepercentagefeaturessuchasthird-levelstudents,contraceptiveuse,andgovernmenteducationexpenditure[ 100 ].ThesefeatureswillbedenotedaTLS,aCNT,andaGOV,respectively.Asanexample,Figure 3-10 showsthedistribution(kerneldensityestimation)ofTLSandaTLS.Itshowsthat,afterthearcsin(p )transformation,bothtailsofthedistributionbecomelowenoughtousethemixtureofnormals.Lifeexpectancyofwomen,totalfertilityrate,andurbanpopulation 75

PAGE 76

BAftertransformation Thearcsin(p )transformationoftheThird-levelstudentsofwomen aTLS LIF FRT aCNT UPG PDN aGOV 7.69 9.11 9.13 0.85 78.03 8.08 1.97 0.95 1.18 4.16 0.21 2 4.99 6.70 6.73 0.67 61.88 5.49 4.42 0.54 4.11 EstimatesofmeansforeachfeaturesubsetintheUNdataset growthrateareusedwithouttransformation.ThesefeatureswillbecalledLIF,FRT,andUPG,respectively. Whenourfeature-subset-wiseclusteringalgorithmwasapplied,theoptimalfeaturepartitionVwasfoundwithBIC(V)=1825.06.Thispartitionconsistedoffourfeaturesubsets(V1,V2,V3,andV4):V1andV2constructedtwoclustersofcountriesandV3andV4createdonlyasinglecluster. Figure 3-11 depictstheclustersofcountriesbasedonV1andV2.Themembershipofeachcountrywasassignedtotheclusterthathadthehighestposteriormembershipprobability.Table 3-7 showsthemeanestimatesoffeaturesineachcluster. 76

PAGE 77

B2ndfeaturesubset DiscoveredclustersforeachfeaturesubsetinUNdataset Therstfeaturesubset,V1,containedsixeconomy-relatedfeatures:`IMP,LIF,aTLS,`GDP,`GNI,and`NRG.Figure 3-11A illustratestwoclustersofcountrieswithmediumgraycolorandblackcolorbasedontherstfeaturesubset,V1.Thelightgraycolorindicatescountriesthatarenotincludedinthisstudy.Cluster1inmediumgrayincludedcountriesmostlyinEurope,EastAsia,NorthandSouthAmerica,andOceaniawithrelativelyhighGDPsandGNIs(seeTable 3-7 ).Ontheotherhand,Cluster2in 77

PAGE 78

Thesecondfeaturesubset,V2,containedthreefertility-relatedfeaturessuchasUPG,aFRT,andaCNT.ThesefeaturesalsoconstructedtwoclustersofcountriesasshowninFigure 3-11B .Table 3-7 showsthatcountriesinCluster1(coloredinmediumgray)hadarelativelylowfertilityrate,lowrateofurbanpopulationgrowth,andahighpercentageofcontraceptiveuse.Overall,populationgrowthwasmorecontrolledinthecountriesinCluster1thanthoseinCluster2. Apparently,V1andV2foundsimilarclusters,exceptforIndia,Bangladesh,Indonesia,thePhilippines,Malaysia,SouthAfrica,Zimbabwe,andsomeMiddle-Eastcountries.India,Bangladesh,SouthAfrica,andZimbabwewerecategorizedascountrieswithcharacteristicsreectinglesseconomicdevelopmentinclusteringbasedonV1;however,theywerecategorizedascountrieswithmorepopulation-controllingcharacteristicsintheclusteringbasedonV2.Indonesia,thePhilippines,Malaysia,andsomeMiddle-Eastcountrieshadoppositecharacteristics. Tocompareourfeature-subset-wiseapproachwithconventionalclustering,weappliedthemultivariatenormalmixturemodelwithoutpartitioningfeatures.WefoundthattheBIC(V)wasthelowestwhenthenumberofclustersG=2(BIC(V)=1923.09,1908.62,and2026.53withG=1,2,and3clusters,respectively).Theconventionalclustering(G=2)providedthesameresultsastheclustersbasedonV1(economicdevelopmentrelatedfeaturesubset),asinFigure 3-11A .Thisresultseemstooccurbecauseeconomicdevelopment-relatedfeaturesdominateothers. 78

PAGE 79

Ourmethodcanbeusedinvariousapplicationareassuchastextdataminingandgeneexpressiondataanalysis.Specically,intextdatamining,therearemanyfeaturestorepresenttextdocuments.Throughourapproach,theycanbepartitionedoverthevariousfeaturesubsetsthatidentifygenres,authors,styles,andothercategories,andeachdocumentcanbeassigneddifferentclustersacrosstheabovediversefeaturesubsets.However,approachesshouldbeconsideredtoaddresstheproblemofhowto 79

PAGE 80

80

PAGE 81

Insubspaceclustering,themajorityofthepriorresearchworkwasaffectedbyuser-speciedthresholds.Furthermore,itcanbedifculttointerpretthediscoveredsubspaceclusters,becausetheyhavearbitraryshapes.Tomitigatetheseproblems,wenoticedthatthefeaturesofreal-worlddatasetsareusuallyrelatedwithspecicdomainknowledgeforefcientrepresentationofdatasamples.Byutilizingthispropertyforconstructingthefeaturesubspaces,thesearchspaceforsubspaceclusteringcanbemuchreduced.Inthischapter,weproposeanewmodel-basedclusteringapproachontheseconstrainedfeaturesubspaces.Toimprovetherobustness,weusedamultivariateStudentt-distribution. Onereasonablewaytocounteracttheseproblemsistoutilizeefcientmethodsthatattempttondasubsetoffeaturessupportingtheexplanationofclusters.Thisfeatureselectionapproachcanbeaneffectivetechniqueinthatitenablesrepresentingclustersusingonlyafewrelevantfeatures[ 27 ].However,featureselectiontechnique 81

PAGE 82

83 ]. Insubspaceclustering,somenaiveapproachesattempttondclustersoneverypossiblefeaturesubspacebyevaluatingthedenseregions.However,theseapproachesproducealargenumberofclustersandmanyofthemmaynotbedesired.Furthermore,theassumptionthatallfeaturesareindependenteachotherisnotapplicableinpractice.Therefore,weneedtoconsiderhowtodeterminemeaningfulfeaturesubspaces.Areasonableassumptionisthatclusteringshouldbeperformedonlyonthefeaturesubspacesconsistingofstronglycorrelatedfeatures.Anotherprobleminsubspaceclusteringishowtodealwithnoisydataoroutliers.Theseatypicaldatasamplesmaycausedeteriorationinthequalityoftheclusters.Inparticular,thisproblemhasbeenfrequentlydiscussedintheGaussianmixturemodel-basedclustering,whichissensitivetonoisydataoroutliers. Inthischapter,weinvestigateasubspaceclusteringapproachforhighdimensionaldatasets,consideringstrongcorrelationsamongmultiplefeaturestoimprovethequalityofclustersembeddedinthefeaturesubspace.Toalleviateproblemsfromnoisydataandoutliers,weusedamixturemodelwithaheavy-taileddistribution,calledamultivariateStudent'st-distribution,torepresenttheclusters.Inourapproach,thenumberofclustersforthebest-tmixturemodelisdecidedbyevaluatingaparsimoniousmodelselectioncriterion. Therestofthischapterisorganizedasfollows.InSection 4.2 ,subspaceclusteringalgorithmsaresummarized.Section 4.3 describesamethodtoselectstronglycorrelatedfeaturegroups,amixturemodelofmultivariatet-distributionanditsparameterestimationviamaximumlikelihoodestimation,andmodelselectioncriterion 82

PAGE 83

4.4 .Section 4.5 presentsconclusions. 83 ]. Thebottom-upstyleapproaches(e.g.[ 6 22 39 65 86 ])searchforsubspaceclusterswithhigherdensitythanauser-speciedthreshold.Forinstance,CLIQUEstartswithagridofthenon-overlappingrectangularunitsintheoriginalfeaturespace.Then,itattemptstondclustersexistingonthefeaturesubspacesbysearchingfordenseunitsormergingtheconnecteddenseunitsthathavebeenalreadydiscovered.However,duringthisprocess,theprocedureevaluatingthedensityofdataobjectsembeddedinallpossiblefeaturesubspacescanleadtoexponentiallyincreasingthecomputationalcost.Specically,foragivenD-dimensionalfeaturevector,therecanexist2D1nonemptydistinctfeaturesubsetsasthepossiblefeaturesubspaces.Toavoidthisproblem,thesebottom-upstylesubspaceclusteringalgorithmstrytosatisfyaparticularproperty,calledthedownwardclosureproperty;ifdenseregionsarediscoveredinaD-dimensionalspace,thisimpliesthattheycanalsoresideinanyofits(D1)-dimensionalsubspaces[ 105 ].Basedonthismonotonicpropertyofdensity,thesebottom-upstylesubspaceclusteringalgorithmscanreducetheinfeasiblesearchspaceandndclustershavingvarioussizesandshapes. Ontheotherhand,thetop-downapproachessuchas[ 4 ]createequallyweightedclustersinthewholedimensionalspaceattheinitializationstep.Then,theyalternativelyperformtheclusteringprocessbyassigningdifferentweightsforeachclustersandregeneratingnewsubspaceclustersbasedontheupdatedweights.Incontrasttothebottom-upsubspaceclusteringapproaches,thetop-downapproachesrequireexpensive 83

PAGE 84

114 ].Beginningwithinitialresults,-Clustersiterativelyimprovesthequalityofeachclusters.However,thisalgorithmrequiresknowledgeofthenumberofclustersandeachindividualsizewiththeprocessingtimeaffectedbytheclustersize. 21 22 83 ]. Inmanypracticalapplications,datasamplescanbedrawnindependentlyandidenticallydistributedfromunderlyingpopulationswithinaparticularprobabilitydistribution.Ontheotherhand,featuresofthedatasetareusuallyrelatedwithspecicdomainknowledgeforefcientrepresentationofdataobjects.Thisimpliesthattherecanbeanumberofsubsetsoffeatureswhicharestronglycorrelatedtoeachother,henceclusteringontheseconstrainedfeaturespacescanaccompanythediscoveryoftheinformativesubspaceclusters.Inaddition,byselectingonlythesefeaturesubsets,thesearchspaceforsubspaceclusteringcanbedrasticallyreduced. Anotherissueofreal-worlddatasetsisthattheyusuallycontainmanyatypicaldatasamples,callednoisydataoroutliers.Inthecontextofclusteranalysis,howtodealwiththeseoutliershasbeenextensivelydiscussedbecausetheycanaffectthe 84

PAGE 85

Illustrationofrobustfeaturesubspaceclustering 85

PAGE 86

29 76 ],andconstructinganitemixturemodelbasedonaheavy-taileddistributionsuchastheLaplacedistributionortheStudentt-distribution[ 27 101 ]. Inthischapter,weemploytheStudentt-distributiontoselectabest-tmixturemodel-basedclustersonthefeaturesubspaces.Inthemixturemodel-basedclustering,determiningthenumberofclusterscanbeequivalenttondingthebest-tmixturemodel.Aparsimoniousmodelselectioncriterionwasutilizedforthispart.Inthissection,wedescribemethodsrelatedwiththeaboveissues.Figure 4-1 illustratesourapproach.Ascanbeseen,anumberofdifferentfeaturesubspacescanbecreatedbycombiningvariousfeatures,anddatasubmatricescanbeconstructedbasedonthesefeaturesubspaces.Then,eachsubmatrixcanberepresentedbyanitemixturemodel. 55 ].Manymeasurementssuchas2-statisticandPearsoncorrelationhavebeenusedtoassessthepairwisecorrelationbetweentwofeatures[ 110 116 ].Inparticular,Pearsoncorrelation,denotedby,isapopularmeasurementtoevaluatethemagnitudeoflineardependencebetweentwofeatures.Tobespecic,giventwofeaturesv1andv2,aPearsoncorrelation,v1v2,canbeexpressedas whereviandviareameanandastandarddeviationofvi,respectively. SupposethatXisadatamatrixconsistingofNsamples,witheachrepresentedbyaD-dimensionalfeaturevector,V=(v1,...,vD).Inthiscase,correlationsamongD

PAGE 87

PNi=1(Xij1Xj1)2q PNi=1(Xij2Xj2)2,(4) whereX`=1 Sincethevaluesofrcontainthelinearrelationshipbetweentwofeatures,thedirectuseofrmaynotbesuitableforevaluatingthemagnitudeofthepairwisecorrelation.Instead,anelement-wisesquaredmatrixofr,denotedbyR2,canbeusedbecauseallelementsinR2arebetween0and1.ThevaluesoftheelementsinR2arecalledthecoefcientofdetermination[ 90 ]. Whenoneextractssubsetsoffeatureswithstrongpairwisecorrelations,manystatisticalmeasuresforassessingcorrelationsrequireauser-speciedthreshold.Thiscangiverisetodifferentoutputsdependingonthevalueofthreshold.Forthisproblem,basedontheempiricalresultsofmanypreviousapproaches[ 21 ],apossiblethresholdcanbeobtainedthroughthefollowingequation: Thatis,foragivennumberoffeatures(variables),theequation( 4 )calculatestheaverageofallpairwisecorrelationsofthesefeatures.Wecallthisthresholdtheglobalmeancorrelationthreshold.Basedontheresultofthis,ifR2j1j2>=h,itcanbeconsideredthattwofeaturesvj1andvj2arestronglycorrelated.Thisimpliesthattherelevantfeaturescanformafeaturesubspacetoconstructinformativeclusters.Otherwise,therelevantfeaturescanbethoughtofaslooselycorrelatedandthepairofthemcanbeignoredduringtheconstructionofthecandidatefeaturesubspaces. 87

PAGE 88

Ourapproachfocusesonidentifyingclusterstructureslyingontheconstrainedfeaturesubspacewherethefeaturesarestronglycorrelated.Forconstructingtheseconstrainedfeaturesubspaces,extractingpossiblesubsetsoffeaturesthathavelocallystrongcorrelationswitheachothershouldbeconsidered.Forthisproblem,weconsideredthefollowingprocedure. Supposethatafeaturevd1,whered12f1,...,Dg,isselected.Then,thecoefcientsofdeterminationforallotherfeatureswithvd1isexpressedbyacolumnvectorofR2,denotedbyR2d1=(R2d11,...,R2d1D)T.ForR2d1,themean,denotedbyE(R2d1),canbecomputedby WhenweassumethattheR2d1andtheE(R2d1)followthepropertyoftheglobalmeancorrelationthreshold,aR2d1d2greaterthanE(R2d1)canbedeemedtoindicatearelativelystrongcorrelationbetweenafeaturevd2foragivenvd1.Basedonthisresult,afeaturesubset,Vd1,canbeconstructedbyselectingvd2s,satisfying 88

PAGE 89

AlthoughDd1
PAGE 90

Illustrationofconstructingfeaturesubspaceswiththecorrelatedfeatures Inourapproach,thehierarchicalstructureisconstructedbymergingfeatures(orfeaturesubsets)withthenearestfeaturehavingthelargestvalueofR2amongtheremainingfeatures.Todoso,thesinglelinkagealgorithmwasused.Forsinglelinkage,thedistancebetweenapairoffeaturesubsetsisdeterminedbythetwoclosestfeaturesubsetstothedifferentfeaturesubsets.Assumethatthereexisttwofeaturesubsets,V1andV2.Then,thesinglelinkagecanbecalculatedby Basedontheabovedistancemeasureandproximitycriterion,forcreatingthepossiblefeaturesubspaces,theagglomerativehierarchicalclusteringalgorithmcanbeperformedasfollows;First,wesetsinglefeatureswithinafeaturesubsetVdastheleafnodesofthehierarchicalclusteringresult,thenwestartconstructingfeaturesubspacesusinghierarchicalclustering.Thelargerfeaturesubspacescanbeconstructedbyagglomeratingthemostsimilarfeatures(orfeaturesubsets)ateachstep.Afterobtaining 90

PAGE 91

4-2 illustratesthisprocedure.Forthesefeaturesubsets,amixturemodel-basedclusteringisperformed.Duringthisfeaturesubspaceconstructionprocess,someredundantfeaturesubspacescanbecreated.Thesewillbeignoredatthemixturemodel-basedclusteringstep. where,forthekthmixturecomponent,fk(x;k)istheprobabilitydensityofxandkisthevectoroftheunknownparameters. AnitemixturemodelofStudentt-distributioncanberepresentedbyaweightedsumofmultivariateStudentt-distribution.ForD-dimensionaldatax,theprobabilitydensityfunction(p.d.f.)ofthemultivariateStudentt-distributionisdenotedby 2 2Dk 2,(4) wherei=1,...,N,k=1,...,K,k=(k,k,k),and(xi,k;k)=(xik)T1k(xik)thatkandkareameanvectorandapositive,semi-denitecovariancematrix, 91

PAGE 92

85 ].kisadegreeoffreedomthatparameterizesthe`robustness'oftheStudentt-distribution.Inparticular,asincreases,theStudentt-distributionbecomesclosertotheGaussiandistribution. Thegammafunction(m)isexpressedas Interestingly,thep.d.f.ofmultivariateStudentt-distributioncanbewrittenbyusingthemultivariateGaussiandistributionwithmeankandcovariancematrixk=wi, 2 whereaweightwiistheunknownparameterthatfollowsagammadistributiong(wi;k 92

PAGE 93

Theestimationofmaximizingthelikelihoodfunction( 4 )canbeobtainedviatheEMalgorithm[ 85 ].Suppose(t)istheestimationofobtainedafterthetthiterationofthealgorithm.IntheE-step,theconditionalexpectationofthecomplete-datalog-likelihood(Qfunction)iscomputed: Asaresult,theposteriorprobabilitythatxibelongstothekthmixturecomponentiscomputedas where(t)k=(^(t)k,^(t)k,^(t)k)T.Theconditionalexpectationofwi,givenzik=1andxi,is IntheM-step,newparameterestimates(t+1)maximizingQ(,(t))satisfyaconvergencecondition.Ateachiterationstep,theestimatesofk,k,andkcanbeupdatedas 93

PAGE 94

4 ). whereN(t)k=PNi=1z(t)ikand()=dlog() 4 ),see[ 101 ]. 90 112 ].Consideringthecharacteristicsofmixturemodel-basedclusteringthroughtheEMalgorithm,weusedtheintegratedcompletedlikelihood(ICL),amodelselectioncriterionthatprovidesmoreparsimony 94

PAGE 95

13 ].TheICLcanbedenotedby where ThisICLcanbewrittenasaBIC-likeapproximationform,thatis, wheretikistheconditionalprobability,denotedby and^zikisaMaximumAPosteriori(MAP)obtainedfromtheparameterestimates.AmixturemodelthatminimizestheICLcriterionisselectedasadesiredclusteringresult.Algorithm 5 summarizesourapproach. 95

PAGE 96

Dendrogramsforthedifferentfeaturesubsetsindataset-I Inaddition,thefourthfeatureblockcontainingtwonon-informativefeatureswasdrawnfromaGaussiandistributionN(3,I2)andaddedtodataset-I.ThenumberofdatasamplesforGaussianmixturecomponentsrangedfrom90to300. 96

PAGE 97

Someexamplesofthedataplotsondifferentfeaturesubspacesindataset-I Thedataset-II(N=600andD=13)wasgeneratedinasimilarmannerbutcontainedamuchcomplexstructure.ItconsistsofvedifferentfeatureblockswhereeachwasgeneratedfromamixtureofGaussiandistributionswiththefollowingparameters: 97

PAGE 98

Clusteringerrorratesfordifferentfeaturesubspacesindataset-I andthree-dimensionalGaussiandistributionN(3,I3). Totestthenoiserobustness,noisydatasampleswereuniformlygeneratedwithvedifferentnoiseratios(0%,20%,40%,60%,80%,and100%)andaddedtoeachsyntheticdataset. 98

PAGE 99

Dendrogramsfordifferentfeaturesubsetsindataset-II 99

PAGE 100

Someexamplesofthedataplotsondifferentfeaturesubspacesindataset-II Inhighdimensionaldata,manyfeaturesareirrelevantincontributingtotheclusteringresults.Thatis,consideringthosefeaturesthataremeaninglessforclusteringmayexacerbatetheperformanceofthealgorithms.Tomitigatethisproblem,asapreprocessingstep,featuresrelatedwithclusteringcanbeselectedthrougheitherthelikelihoodratiotestorthemodelselectioncriterion[ 69 ].Inourapproach,weusedtheBICcriteriontoltersomeirrelevantfeaturesforclusteringinthepreprocessingstep. Figure 4-3 showsthedendrogramsforthesubsetsoffeaturesthatarestronglycorrelatedinthedataset-I.FromtheresultofFigure 4-3A ,twopossiblefeaturesubspaces(i.e.,fv1,v2gandfv1,v2,v3g)canbeconstructed.Likewise,basedonthe 100

PAGE 101

Clusteringerrorratesfordifferentfeaturesubspacesindataset-II resultofFigure 4-3B ,featuresubspacesfv1,v2gandfv1,v2,v4gcanbegenerated.FromtheresultofFigure 4-3C ,threefeaturesubspacesfv3,v4g,fv1,v3,v4g,andfv1,v2,v3,v4gcanbeconstructed.FromtheresultofFigure 4-3D ,featuresubspacesfv3,v4g,fv2,v3,v4gcanbeconstructed.FromtheresultofFigure 4-3E ,threefeaturesubspacesfv1,v2g,fv5,v6g,andfv1,v2,v5,v6gcanbeconstructed.AsshowninFigure 4-3F ,onlyonefeaturesubspace(i.e.,fv5,v6g)canbegenerated.Amongthesecandidatefeaturesubspaces,someredundantfeaturesubspacescanbegenerated(e.g.,fv5,v6ginFigure 4-3F ).Toreducetheunnecessarycomputationalcost,theycanbeignoredduringtheclusteringprocess.SincethenumberoffeaturesD=6,thepossiblenumberoffeaturesubspacesbasedonthenaivesubspaceclustering 101

PAGE 102

2.However,sincethenumberofselectedfeatureshavingastrongcorrelationwithagivenfeaturecanbesmallerthanD,thepossiblenumberoffeaturesubspacesforsubspaceclusteringcanbereducedtolessthanD2.Inaddition,sincesomeduplicatescanexistamongtheextractedfeaturesubspaces,thenumberoffeaturesubspacesisfurtherreduced.Forexample,intheexperimentswithdataset-I,themaximumnumberofpossiblefeaturesubspacescanbe15.However,duringtheexperiments,ourapproachonlyselected9featuresubspacesthatareexpectedtocontainmeaningfulclusterstructures. Forvisualization,Figure 4-4 showsexamplesofthedataplotsexistingonthevariousfeaturesubspaces.Theyarethefeatureblocksusedforgeneratingthisdataset.Specically,thefeaturesubspacefv1,v2gcontainstwomixturecomponents,showninFigure 4-4A .Thefeaturesubspacefv3,v4gcanberepresentedbythemixturemodel-basedclusterswiththreecomponents,showninFigure 4-4B .IntheFigure 4-4C ,twoclustereddatacanbeidentiedonthefeaturesubspacewithfv5,v6g.Foreachcreatedfeaturesubspace,amixturemodel-basedclusteringwasperformed. Althoughthedataset-Iwasgeneratedbasedonthefourfeatureblocks,manyotherfeaturesubspacescanbeextractedfromthedataset.Thischaracteristiccanmakeitdifcultwhenevaluatingthequalityofclustersthatarediscoveredonthedifferentfeaturespaces.Forthisreason,toevaluatethequalityofclusters,weonlyconsideredclustersthatwerediscoveredonfeaturesubspacesshowninFigure 4-4 .Toassessthequalityofclusteringresults,clusteringerrorratecanbecalculatedasthenumberdatapointsassignedtodesiredclustersdividedbythenumberofdatasamples[ 50 ],denotedby 102

PAGE 103

Figure 4-5 showstheclusteringerrorrateswithrespecttothechangeofnoiseratioforthedatasamplesofthevariousfeaturesubspacesembeddingondataset-Iwithuniformnoise,showninFigure 4-4 .Obviously,themixturemodel-basedclusteringresultsusingthemultivariateStudentt-distributionmodelsshowedbetterperformancethantheGaussianmixturemodelswithrespecttotherobustnessinauniformnoisecondition.Inaddition,asthestructureofdatasetontheextractedfeaturesubspacebecomesmorecomplex,theapproachusingthemultivariateStudentt-distributionshowedbetterperformancethanthemethodbasedontheGaussiandistribution. Forthesubsetsoffeaturesthatarestronglycorrelatedinthedataset-II,10dendrogramshavebeenconstructedbasedonthehierarchicalclusteringalgorithm,showninFigure 4-6 .Ascanbeseen,sincethedataset-IIismuchlargerandhasamorecomplexstructurethanthedataset-I,morecandidatefeaturesubspacescanexistindataset-II.Figure 4-7 showsexamplesofthedataplotsexistingforthevariousfeaturesubspacesinthedataset-II.Similartothecaseofdataset-I,theyarethefeatureblocksusedforgeneratingdataset-II.Tobespecic,inthefeaturesubspacefv1,v2g,datasamplescanberepresentedbyamixturemodelwiththreecomponents,showninFigure 4-7A .Onthefeaturesubspacefv3,v4g,clustersthatarerepresentedbythemixturemodelwithfourcomponentscanbefound,asshowninFigure 4-7B .Figure 4-7C showsthatthreehyper-ellipsoidalclustersexistingonthefeaturesubspacefv5,v6,v7g.Basedonthefeaturesubspacefv8,v9,v10g,fourmixturemodel-basedclustereddatacanbeidentied,asshowninFigure 4-7D Figure 4-8 showstheclusteringerrorswithrespecttothenoiseratioforseveraldatasamplesofthevariousfeaturesubspacesembeddingondataset-IIwithuniformnoise.AspresentedinFigure 4-8A andFigure 4-8C ,boththeStudentt-distributionmixture 103

PAGE 104

Samples Dimensions Classes Iris 150 4 3 Wine 178 13 3 Ecoli 336 7 8 Yeast 1484 8 10 Diabetes 768 8 2 Waveform 5000 21 3 ThedescriptionofUCIdatasetsusedintheexperiments modelandtheGaussianmixturemodelshowedlowclusteringerrorrates.Ontheotherhand,forthedatasetsonthefeaturesubspacewithamorecomplexstructure(e.g.,fv5,v6,v7gandfv8,v9,v10g),themixtureofmultivariateStudentt-distributionmodelsdemonstratedbetternoiserobustnessthantheGaussianmixturemodelsinanoisycondition,asshowninFigure 4-8B andFigure 4-8D 9 ]wereused,becausethesearemoresuitablefordiscussingandvalidatingtheperformanceoftheproposedalgorithm.Forallthesedatasets,classlabelsassignedtothedatasampleswereremovedbeforeexperiments.AsummaryofthesedatasetsisprovidedinTable 4-1 Ouralgorithm'sperformancewasevaluatedbycomparingwithotherthreesubspaceclusteringalgorithms(i.e.,EWKM,LAC,andPROCLUS)[ 3 31 54 ].UsingUCIdatasets,experimentalresultsforfouralgorithmsincludingourapproachwerecalculatedbyperforming20executionsandevaluatedusingtheaverageoftheRandIndex[ 89 ].Foragivendatasetx,assumethatwehavetwopartitionsofx,c1andc2,wherec1consistsofclasslabelsandc2consistsofclusteredlabels.Then,RandIndexisdenedas 104

PAGE 105

ComparisonofthesubspaceclusteringalgorithmswithRandIndex wherecooisthenumberofpairsofsamplesinxthatareinthesamesetinc1andinthesamesetinc2,cxxisthenumberofpairsofsamplesinxthatareindifferentsetinc1andindifferentsetinc2,andNisthenumberofdatasamples,respectively[ 105 ]. Figure 4-9 showstherelevantresults.Ascanbeseen,formostdatasets,ourapproachobtainsthehighestscoreoftheotherthreesubspacealgorithms.However,sincenoneofthesealgorithmsshowsthebestperformanceonallthedatasets,itisnotsatisfactorythatouralgorithmoutperformtheseotherapproachesonlyusingtheseresults.Nevertheless,ouralgorithmhasanobviousadvantageinthatitdeterminesthenumberofclusterstorepresentthebest-tmixturemodelbuttheotheralgorithmsrequiretheusertoprovidethenumberofclustersandotheruser-speciedparameters. Intheseexperiments,sinceallclusteringresultsbasedonvariousfeaturesubspacescanberepresentedbydifferentparameterestimatessuchasthenumberofclustersandmixtureproportions,itwouldbeinappropriatetoevaluatethediscoveredclustersbyonlyusingtheRandIndexorNormalizedMutualInformation[ 89 ]whichmeasuresthesimilaritybetweentheclusterindicesthroughthemaximumaposteriori 105

PAGE 106

Groupsofclustersbymerging Class Featuresubsets ofclusters basedonthehighestclusterpurity coveragerate (SW,PL,PW) 2 (SL,PL,PW) 3 (PL,PW) 3 Setosa:IrisSetosa,Versicolor:IrisVersicolor,Verginica:IrisVerginica TheoverallexperimentalresultsonIrisdataset Featuresubsets IrisSetosa IrisVersicolor IrisVirginica (SW,PL,PW) 1.000 1.000 1.000 (SL,PL,PW) 1.000 0.920 0.880 (PL,PW) 0.980 0.980 0.680 ThelocalclusterpuritiesondiscoveredvarioussubspacesforIrisdataset (MAP)classicationofthedatasamplesandtheprovidedclasslabels.Furthermore,itishardtoinferthevalidityofsubspaceclustersthroughthecomparisonofthesevarioussubspaceclusteringresults.Forthisreason,onecanconsiderthatadiscoveredclustercanconsistofoneormoreclassesintheoriginaldataset.Inthiscase,theclusteringaccuracycanbecalculatedbycomputingthehighestfrequenciesforeachclasslabelthatbelongstothecluster.Wecallthistypeofclusteringaccuracytheclasscover-agerate.Besidestheclasscoveragerate,wealsoneedtoinvestigatetheclassesconsistingofthediscoveredclusters.Forthis,weconsideredthehomogeneityforeachoftheclassesthatcorrespondtoaspeciccluster.Wecallthisprobabilityasthelocalclusterpuritythatiscomputedbytheratioofthenumberoftheestimatedclasslabelscorrespondingtoadesiredtrueclass.Byutilizingthesemeasures,wediscussthevariousanalysesfortheseexperimentalresults. 106

PAGE 107

Table 4-2 showstheexperimentalresultswiththeirisdataset.Duringthefeaturesubspaceconstructionprocess,afeaturesubspace(PL,PW)wasrepeatedlyevaluatedsoduplicateoneswerediscarded.Asaresult,threedistinctfeaturesubspaceshavebeenextracted.Ascanbeseen,basedonthefeaturesubspace(SW,PL,PW),twoclustersweresuccessfullydiscovered.Thisresultsupportsourpreviousdiscussionthatthetwogroups(excludingIrisSetosa)cannotbelinearlyseparated.Incontrast,fortheothertwofeaturesubspaces,(SL,PL,PW)and(PL,PW),theclusteringresultsshowthatthreeclustershavebeendiscovered.Fromthisresult,wecaninferthatthesepalwidth(SW)featureaffectstheseparabilityofthesethreegroups.Table 4-3 showsthelocalclusterpurityofeachclassbasedontheresultsoftheTable 4-2 .ThisresultidentiesthatdatasamplesoftheIrisVirginicaclassarerelativelylesslikelytobeassignedintothedesiredclasswhenpartitioningthedatasetintothreeclasses. 4-4 Table 4-4 showstheresultsofourapproachontheEcolidataset.Althoughtheoriginaldatasetwasclassiedby8groups,someoriginalclassescontainedtoofewdatasamplestoberegardedasacluster.Inaddition,someofthosesmallclasses 107

PAGE 108

Groupsofclustersbymerging Class Featuresubsets ofclusters basedonthehighestclusterpurity coveragerate (mcg,gvh) 2 (mcg,gvh,alm1) 3 (alm1,alm2) 2 (mcg,alm1,alm2) 2 (mcg,aac,alm1,alm2) 2 cp:cytoplasmim:innermembranewithoutsignalsequencepp:perisplasmimU:innermembrane,uncleavablesignalsequenceom:outermembraneomL:outermembranelipoproteinimL:innermembranelipoproteinimS:innermembrane,cleavablesignalsequence TheoverallexperimentalresultsonEcolidataset Featuresubsets cp im pp imU om omL imL imS (mcg,gvh) 0.944 0.831 0.923 0.743 0.950 0.800 0.500 1.000 (mcg,gvh,alm1) 0.986 0.922 0.865 0.943 1.000 0.800 1.000 0.500 (alm1,alm2) 0.881 0.779 0.846 0.829 0.850 1.000 1.000 1.000 (mcg,alm1,alm2) 0.874 0.779 0.885 0.800 0.950 1.000 0.500 0.500 (mcg,omL,alm1,alm2) 0.867 0.779 0.904 0.714 0.850 1.000 0.500 0.500 ThelocalclusterpuritiesondiscoveredvarioussubspacesforEcolidataset (e.g.,imU,imL,andimS)showedsimilarpatternswhichcanbegroupedtogether.Asaresult,formostclusteringalgorithmresults,theoriginal8groupsweremergedinto2or3groups.Besides,twofeaturesinacandidatefeaturesubspace(lip,chg)showedhighcoefcientsofdetermination.However,basedonthemixturemodel-basedclustering,thisfeaturesubsetwasrepresentedbyasinglecluster,soitwasignoredinthisresult.TheobtainedsubspaceclusteringresultssummarizedinTable 4-4 canbedividedintotwogroupsandeachofthemcanbeinterpretedasfollows:Fortherstgroup(i.e.,(mcg,gvh)and(mcg,gvh,alm1)),thefeature`alm1'contributestoconstructingthenewclusterstructurewiththreemixturecomponentsandimprovingtheclasscoveragerate.Threefeaturesubsetsinthesecondgroupshowthesamegroupsofthemerged 108

PAGE 109

Table 4-5 showsthelocalclusterpurityfortheclusteringresultsthatarediscoveredonvariousfeaturesubsetsoftheEcolidataset.Basedontheresultsofthelocalclusterpurity,theobtainedclusteringresultsttingtheStudent'st-mixturemodelcanbeinterpretedbygroupingtheclassessothattheclusterindexofthehighestlocalclusterpurityisthesame. 4-10 illustratesthedendrogramsconstructedfromthedifferentfeaturesubsetsinwhichthestrongcorrelationsexist.Inthisgure,eachnumberirepresentthefeaturevi.Basedonthesedendrograms,30distinctfeaturesubspacesthatarelikelytocontaintheinformativeclusterswerecreated. Table 4-6 showsasummaryofthesubspaceclusteringresultsofthewinedataset.Inthefeaturesubsetscolumn,theithfeatureviwasrepresentedbynumberi.Among30constructedfeaturesubspaces,twofeaturesubspaces(i.e.,(34)and(89))wereignoredbecausetheywererepresentedbyamodel-basedclusterwithasinglemixturecomponent.BasedonthedendrogramsshowninFigure 4-10 ,thesesubspaceclusteringresultscanbegroupedbysatisfyingthedownwardclosureproperty.Thisprocesssupportstheinterpretationofthediscoveredsubspaceclusteringresults.AsshowninTable 4-6 ,eachgroupofsubspaceclusteringresultsshowsasimilartrendwithrespecttothegroupsofthemergedclustersaswellastheclasscoveragerate.Therstfourgroupsonthefeaturesubsetscolumnillustratethatthosefeaturesubspacescanbe 109

PAGE 110

Dendrogramsfordifferentfeaturesubsetsinwinedataset 110

PAGE 111

Groupsofclustersbymerging Class Featuresubsets ofclusters basedonthehighestclusterpurity coveragerate (16713) 2 (146713) 2 (1456713) 2 (3413) 2 (34513) 2 (3451013) 2 (1513) 2 (158913) 2 (1358913) 2 (113) 2 (11013) 2 (141013) 3 (67912) 2 (6791213) 2 (6791112) 2 (67891112) 2 (678912) 2 (4678912) 2 (1112) 2 (110111213) 2 (267101112) 2 (67) 2 (6712) 2 (671112) 2 (2671112) 3 (678) 2 (167813) 3 (13467813) 3 Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthewinedataset relatedtoextracttheBaroloclass.Specically,inthesecondgroupwiththreedistinctfeaturesubsets(i.e.,(3413),(34513),and(3451013)),theBroloclassshowedthelowerlevelsofAsh(v3),Alcalinityofash(v4),andProline(v13)thantheGrignolinoandtheBarberaclasses.Incontrast,thelattersevengroupsoffeaturesubsetsshowthattheycancontributetoidentifyingtheBarberaclass.Forexample,accordingtotheresultofthesixthgroupoffeaturesubsets(i.e.,(6791112)and(67891112)),theBarberaclassillustratedarelativelyhigherlevelthantheothertwoclasseswithrespecttothefollowingvefeatures:v6:Totalphenols,v7:Flavanoids,v9:Proanthocyanins,v11:Hue,andv12:OD280/OD315ofdilutedwines.Inaddition,basedontheresultsoftheclasscoveragerates,theproposedsubspaceclusteringalgorithmdemonstrated 111

PAGE 112

Groupsofclustersbymerging Overall Featuresubsets ofclusters basedonthehighestclusterpurity clusterpurity (18) 3 (25) 2 (258) 3 (2568) 4 (346) 3 (3468) 3 (456) 2 (245) 3 (2456) 4 (24567) 4 (128) 2 Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthediabetesdataset thattheclusterscanbesuccessfullyidentiedbyttingthemixturemodel.Inthisexperiment,foursubspaceclusteringresultsconsistingofthreeclusterswereidentied.Amongthem,twosubspaceclusteringresults(i.e.,(141013)and(13467813)showedthattheirclasscoveragerateswereabout0.9326and0.9382.Comparedwithsomepreviousfeatureselectionalgorithms[ 61 87 ](0.9339and0.9101),ourapproachshowedsimilarorslightlyimprovedclusteringaccuracies. Table 4-7 showstheexperimentalresultswiththediabetesdataset.Thenumbersinthefeaturesubsetscolumnrepresenttherelevantfeaturesdescribedabove.Sincethereareonlytwoclassesinthisdataset,weneedtoconsideradifferentwaytoanalyzethesevariousclusteringresults.First,theclusteringresultsfortwofeaturesubspaces 112

PAGE 113

4-7 ,theclusteringresultsonthethreevariousfeaturesubspaces(i.e.,(25),(456),and(128))wererepresentedbytwomixturecomponentsandshowedsimilarclusteringaccuracies.Onvedifferentfeaturesubspaces,themixturemodel-basedclusteringresultsconsistofthreemixturecomponents.Sincetheoriginaldatasetiscomposedoftwoclasses,theseclusteringresultsmaynotbedirectlycomparedwiththeclasslabels.Instead,thoseclusterscouldprovidehiddenrelationshipsamongtherelevantfeatures.Forexample,afeaturesubspace(18)wascomposedofthefollowingthreeclusters:i)oldandrelativelyfewpregnancies,ii)youngandrelativelyfewpregnancies,andiii)oldandmanypregnancies.However,bymergingtwoclusters(e.g.,fC1,C2ginthisresult),theseclusteringresultsshowsimilarresultstothemixturemodel-basedclusteringresultswithtwocomponents.Inthiscase,particularly,amergedgroup,withclusteri)andclusteriii),andclusterii)commonlyshowedthatafeature`thenumberoftimespregnant'hasanegativerelationshipwiththefeature`age'.Thisimpliesthatthosetwoclusterscontainsimilarpatterns.Fortheotherfeaturesubspacesconsistingofthreeorfourmixturecomponents,eachclustercanbeinterpretedinthesimilarway. 17 ]. UsingthedendrogramsshowninFigure 4-11 ,63distinctfeaturesubspaceformixturemodel-basedclusteringweregenerated.Duringtheconstructionofthefeaturesubspaces,manyredundantfeaturesubsetswereltered.Amongtheclusteringresultsbasedonthose63featuresubspaces,the46selectedonesareshowninTable 4-8 113

PAGE 114

Dendrogramsofdifferentfeaturesubsetsforwaveformdataset 114

PAGE 115

Cluster-1 Cluster-2 Cluster-3 Class Featuresubsets ofclusters (1657) (1646) (1696) coveragerate (19101116) 2 891 0.7011 (1910111621) 2 989 0.7275 (6781415) 2 865 0.6895 (56781415) 2 1062 0.7532 (5678131415) 2 1209 0.7896 (45678131415) 2 1195 0.7738 (4567812131415) 2 1108 0.7449 (24567812131415) 3 851 1411 1587 0.7700 (34567812131415) 3 844 1378 1599 0.7644 (678141516) 2 1012 0.7353 (5678141516) 2 1145 0.7692 (567813141516) 2 1169 0.7674 (4567813141516) 2 1191 0.7740 (456781213141516) 3 857 1415 1568 0.7682 (3456781213141516) 3 1079 1395 1562 0.8074 (67814151617) 2 1211 0.7902 (567814151617) 2 1179 0.7702 (5678914151617) 2 1208 0.7760 (567891314151617) 3 844 1379 1598 0.7644 (4567891314151617) 3 1108 1394 1529 0.8060 (56789131415161718) 3 1082 1394 1562 0.8078 (678914151617) 2 1212 0.7792 (67891415161718) 2 1172 0.7642 (6789101415161718) 3 844 1379 1598 0.7644 (678910141516171819) 3 1080 1401 1555 0.8074 (89151617) 2 965 0.7311 (8915161718) 2 986 0.7265 (89101115161718) 2 1242 0.7876 (8910111516171819) 2 1163 0.7626 (910111718) 2 963 0.7293 (459101112131718) 2 1151 0.7586 (5671314) 2 865 0.6895 (45671314) 2 989 0.7275 (4567121314) 2 1123 0.7606 (456711121314) 2 1207 0.7778 (3456711121314) 2 1160 0.7616 (78141516) 2 865 0.6895 (7814151617) 2 953 0.7197 (78914151617) 2 1129 0.7626 (7891415161718) 2 1179 0.7702 (789101415161718) 2 1194 0.7718 (78910141516171819) 3 844 1379 1598 0.7644 (678910141516171820) 3 1110 1391 1529 0.8062 (4581013) 2 928 0.7151 (458101321) 2 989 0.7275 (1458101321) 2 1211 0.7902 Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthewaveformdataset Theremainingresultswerenotincludedbecauseofoneofthefollowingreasons:First,thedesiredmixturemodelwascomposedofasinglemixturecomponent.Second,inseveralfeaturesubsets,therelationshipsbetweenfeaturesareeitherpositivecorrelationoranticorrelation,makingitacomplexclusterstructure. 115

PAGE 116

Inpracticalapplicationdomains,mostdatasetsconsistofanumberoffeatureswheresomearestronglycorrelatedtoeachother.Thispropertycanprovideseveraladvantagesasfollows:Byusingthestrongpairwisecorrelationbetweenfeaturesthesizeoffeaturesubspacestobeconstructedcanbedrasticallyreduced,therebyreducedthecomputationalcost.Additionally,itcanberegardedthattheextractedfeaturesubspacescontainmeaningfulstructuresandimprovementinthequalityofclusterscanbeexpected. Inthischapter,wepresentedanovelsubspaceclusteringapproachconsideringastrongpairwisecorrelationamongfeatures.Inthisapproach,thepossiblefeature 116

PAGE 117

Throughtheexperimentalresultswithsyntheticdatasets,wedemonstratedthatourapproachisrobustonthenoisydatasetsandsuccessfullyfoundthedesiredclustersthatcanberepresentedbyanitemixturemodelontheextractedfeaturesubspaces.OurapproachwasalsoappliedtotheseveralrealdatasetsobtainedfromtheUCIMachineLearningrepository.Basedontheseextensiveexperimentalresults,wefoundmanysubspaceclustersthatwerenotdiscoveredbythepreviousfeatureselectionapproaches.Inaddition,wedemonstratedthattheseidentiedsubspaceclusterscontainmeaningfulinformation. 117

PAGE 118

Inclustering,theuseofstatisticalmodelsprovidesthefollowingadvantages.First,sinceitisassumedthatthegivendatafollowsaparticularstatisticalmodel,theconstructedmodelscanbesupportedbytherelatedtheoreticalbackgrounds.Second,theconstructedmodelscannotonlybeadaptedtosomeparticularproblemsbutalsocanbeappliedtosimilartypesofproblemsbyadjustingtherelevantparameters. Theseadvantagescanbeadaptedtothefeatureselectiontechniques.Basedonthesecharacteristics,manyfeatureselectionalgorithmsinclusteringhavebeenproposed.However,theyhavecommonshortcomingsinthattheyselectfeaturesonlyrelatedwithaparticularclusterstructure.Thisimpliesthattheyarelimitedtodiscoveringvariousfeaturesubsetswhereeachcontributestoidentifyingthedifferentclustersitcontains.Forthisreason,anewclusteringtechnique,calledsubspaceclustering,hasbeenofinterestinrecentyears. Althoughmanyexistingsubspaceclusteringalgorithmshaveshownthattheycanndvariousclusterslyinginthedifferentfeaturesubspaces,theyhavemanydrawbackswithrespecttoscalabilityandtheirdependenceonuser-speciedthresholds.Inthisdissertation,wehavedevelopednovelmethodstodealwiththeaboveissues.Therstapproachefcientlyrevealedmultiplefeaturesubsetsforvariousclustersbasedonthemixturemodel.Thesecondmethodgeneralizedtherstapproachbyallowingittoconsideroverlappedfeaturesubspaces. Inmanyapplicationsforclusteranalysis,datacanbecomposedofanumberoffeaturesubsetswhereeachisrepresentedbyanumberofdiversemixturemodel-basedclusters.However,inmostfeatureselectionalgorithms,thiskindofclusterstructurehasrarelybeeninterestingbecausethesealgorithmsaccountedfordiscoveryofasingleinformativefeaturesubsetforclustering. 118

PAGE 119

Byshowingtheidenticationofanumberofmeaningfulnon-overlappedfeatureblockswhereeachcanberepresentedbyusinganitemixturemodel,theproposedfeature-block-wiseclusteringapproachdemonstratedtheabilitytoovercomeproblemsexperiencedbymanypreviousalgorithmsforfeatureselectioninclustering.However,itcontainsseveraldrawbacks.First,itconsideredonlynon-overlappedfeaturesubspacesduringtheclusteringprocess.However,inpractice,manydatasetscanconsistofafeaturevectorthatcanformanumberoffeaturesubspacesandsomeofthesefeaturescanbeassociatedwithmultiplefeaturesubspaces.Forthiskindofdataset,itmay 119

PAGE 120

Forthisproblem,anovelmethodutilizingthecorrelationcoefcientshasbeenpresentedfordeterminingmeaningfulfeaturesubspaces.Tobespecic,whenafeatureisgiven,somefeaturesshowingastrongcorrelationwiththisfeaturecanberegardedasrelevantfeatures.Bycollectingtheserelevantfeaturesforagivenfeature,afeaturesubsetthatwillbeusedforconstructingthefeaturesubspacesarecreated.Afterobtainingthesefeaturesubsets,thefeaturesubspacesareconstructedbymergingfeatures(orfeaturesubsets)withanagglomerativehierarchicalclusteringtechnique. Manydiversemixturemodel-basedclusterscanbediscoveredbasedonthesefeaturesubspaces.ToovercometheinsensitivenessoftheGaussianmixturemodel,weutilizedtheStudentt-distributionmixturemodelforclustering.Experimentalresultswithbothsyntheticandrealdatasetsshowedthatourapproachcanefcientlyidentifymeaningfulsubspaceclusterswithinanacceptablecomputationalcost.Inaddition,thesearchspaceinourapproachgrowslinearlyasthenumberofdimensionsincreases.Thisapproachcanbegeneralizedmakingitusefulinvariousapplicationareassuchastextdatamining,geneexpressiondataanalysis,andpatternextractionofthehumanactivityinubiquitouscomputing. Forfuturework,wewillconsiderthefollowingproblem.Usingthesubspaceclusteringalgorithmspresentedinthisdissertation,wecaneffectivelyextractdesiredsubspaceclustersfromtheuser'sperspective.However,realapplicationsrequiredealingwithcategoricaldatasetsaswellasnumericaldatasets.Unfortunately,thepresentedalgorithmsarerestrictedtoclusteringcategoricaldatasets.Onereasonisthedifcultyinmeasuringthedistancebetweendataobjectsduetotheabsenceofan 120

PAGE 121

121

PAGE 122

[1] [2] Aggarwal,C.C.AHuman-ComputerInteractiveMethodforProjectedClustering.IEEETransactionsonKnowledgeandDataEngineering,16(2004):448. [3] Aggarwal,C.C.,Procopiuc,C.M.,Wolf,J.L.,Yu,P.S.,andPark,J.S.FastAlgorithmsforProjectedClustering.inProceedingsoftheACMSIGMODInternationalConferenceonManagementofData(SIGMOD'99),(1999):61. [4] Aggarwal,C.C.andYu,P.S.FindingGeneralizedProjectedClustersInHighDimensionalSpaces.InProceedingsofthe2000ACMSIGMODInternationalConferenceonManagementofData,Dallas,TX(2000):70. [5] .RedeningClusteringforHigh-DimensionalApplications.IEEETransactionsonKnowledgeandDataEngineering,14(2002):210. [6] Agrawal,R.,Gehrke,J.,Gunopulos,D.,andRaghavan,P.AutomaticSubspaceClusteringofHighDimensionalDataforDataMiningApplications.inProceed-ingsoftheACMSIGMODInternationalConferenceonManagementofData(SIGMOD'98),(1998):94. [7] Akaike,H.Anewlookatthestatisticalmodelidentication.IEEETransactionsonAutomaticControl,19(1974):716. [8] Anguilli,F.,Cesario,E.,andPizzuti,C.Randomwalkbiclusteringformicroarraydata.InformationSciences:AnInternationalJournal,178(2008):1479. [9] Asuncion,A.andNewman,D.J.UCIMachineLearningRepository.DepartmentofInformationandComputerScience,UniversityofCalifornia,Irvine,CA(2007). [10] Ben-Hur,A.,Horn,D.,Siegelmann,H.,andVapnik,V.Asupportvectorclusteringmethod.inProceedingsoftheFifteenthInternationalConferenceonPatternRecognition(2000):724. [11] Berkhin,P.ASurveyofClusteringDataMiningTechniques.inGroupingMultidimensionalData:RecentAdvancesinClustering(J.Koganetal.),(2006). [12] Bezdek,J.C.PatternRecognitionwithFuzzyObjectiveFunctionAlgorithms.Plenum,NY,(1981). [13] Biernacki,C.,Celeux,G.,andGovaert,G.AssessingaMixtureModelforClusteringwiththeIntegratedCompletedLikelihood.IEEETransactionsonPatternAnalysisandMachineIntelligence22(2000).7:719. [14] Bishop,C.M.PatternRecognitionandMachineLearning.Springer,(2006). 122

PAGE 123

Booth,J.G.,Casella,G.,andHobert,J.P.ClusteringusingObjectiveFunctionsandStochasticSearch.JournaloftheRoyalStatisticalSociety,Ser.B,70(2008):119. [16] Bradley,P.S.,Fayyad,U.,andReina,C.ScalingClusteringAlgorithmstoLargeDatabases.inProceedingsoftheFourthInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD'98),(1998):9. [17] Breiman,L.,Friedman,J.H.,Olshen,R.A.,andStone,C.J.ClassicationandRegressionTrees.Chapman&Hall/CRC,1984. [18] Bryan,K.,Cunningham,P.,andBolshakova,N.ApplicationofSimulatedAnnealingtotheBiclusteringofGeneExpressionData.IEEETransactionsonInformationTechnologyinBiomedicine,10(2006):519. [19] Celeux,G.andGovaert,G.AclassicationEMalgorithmforclusteringandtwostochasticversions.ComputationalStatisticsandDataAnalysis,14(1992):315. [20] Chang,J.-W.andJin,D.-S.Anewcell-basedclusteringmethodforlarge,high-dimensionaldataindataminingapplications.inProceedingsoftheACMsymposiumonAppliedcomputing,(2002). [21] Chen,M.-S.andChuang,K.-T.ClusteringCategoricalDataUsingtheCorrelated-ForceEnsemble.inProceedingsoftheFourthSIAMInternationalConferenceonDataMining,(2004). [22] Cheng,C.H.,Fu,A.W.,andZhang,Y.Entropy-basedSubspaceClusteringforMiningNumericalData.inProceedingsoftheFifthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(1999):84. [23] Cheng,Y.andChurch,G.M.BiclusteringofExpressionData.inProceedingsoftheEighthInternationalConferenceonIntelligentSystemsforMolecularBiology(ISMB2000),(2000):93. [24] Cho,H.,Dhillon,I.S.,Guan,Y.,andSra,S.MinimumSum-SquaredResidueCo-ClusteringofGeneExpressionData.2004. [25] CIO-Online.http://www.cio.de/(2008). [26] Constantinopoulos,C.,Titsias,M.K.,andLikas,A.BayesianFeatureandModelSelectionforGaussianMixtureModel.IEEETransactionsonPatternAnalysisandMachineIntelligence,28(2006):1013. [27] Cord,A.,Ambroise,C.,andCocquerez,J.-P.FeatureselectioninrobustclusteringbasedonLaplacemixture.PatternRecognitionLetters27(2006):627. 123

PAGE 124

Cover,T.M.andThomas,J.A.ElementsofInformationTheory,2ndEdition.WileySeriesinTelecommunicationsandSignalProcessing(2006). [29] Dave,R.N.andKrishnapuram,R.RobustClusteringMethods:AUniedView.IEEETransactionsonFuzzySystems5(1997).2:270. [30] Dempster,A.P.,Laird,N.M.,andRubin,D.B.MaximumLikelihoodfromincompletedataviaEMalgorithm.JournaloftheRoyalStatisticalSociety,Ser.B,39(1977):1. [31] Domeniconi,C.,Gunopulos,D.,Ma,S.,Yan,B.,andM.Al-Razgan,D.Papadopoulos.Locallyadaptivemetricsforclusteringhighdimensionaldata.DataMiningandKnolwedgeDiscovery14(2000):63. [32] Domeniconi,C.,Papadopoulos,D.,Gunopulos,D.,andMa,S.SubspaceClusteringofHighDimensionalData.inProceedingsoftheFourthSIAMInternationalConferenceonDataMining,(2004). [33] Dy,J.andBrodley,C.E.FeatureSubsetSelectionandOrderIdenticationforUnsupervisedLearning.inProceedingsoftheSeventeenthInternationalConferenceonMachineLearning(2000). [34] .FeatureSelectionforUnsupervisedLearning.JournaloftheMachineLearningResearch,(2004).5:845. [35] Dy,J.,Brodley,C.E.,Kak,A.,Broderick,L.S.,andAisen,A.M.UnsupervisedFeatureSelectionAppliedtoContent-BasedRetrievalofLungImages.IEEETransactionsonPatternAnalysisandMachineIntelligence,25(2003):373. [36] Fraley,C.andRaftery,A.E.Model-BasedClustering,DiscriminantAanalysis,andDensityEestimation.JournaloftheAmericanStatisticalAssociation,97(2002).458:611. [37] GenBank.NationalCenterforBiotechnologyInformation.http://www.ncbi.nlm.nih.gov/Genbank/(2008). [38] George,T.andMerugu,S.AScalableCollaborativeFilteringFrameworkBasedonCo-Clustering.inProceedingsoftheFifthIEEEInternationalConferenceonDataMining(ICDM2005),(2005):625. [39] Goil,S.,Nagesh,H.,andChoudhary,A.MAFIA:EfcientandScalableSubspaceClusteringforVeryLargeDataSets.Tech.Rep.CPDC-TR-9906-010,NorthwesternUniversity,EvanstonIL60208,1999. [40] Govaert,G.andNadif,M.Clusteringwithblockmixturemodels.PatternRecognition,36(2003).2:463. [41] .AnEMAlgorithmfortheBlockMixtureModel.IEEETransactionsonPatternAnalysisandMachineIntelligence,27(2005).4:643. 124

PAGE 125

Guha,S.,Rastogi,R.,andShim,K.CURE:AnEfcientClusteringAlgorithmforLargeDatabases.inProceedingsoftheACMSIGMODInternationalConferenceonManagementofData(SIGMOD'98),(1998):73. [43] .ROCK:ARobustClusteringAlgorithmforCategoricalAttributes.inProceedingsoftheFifteenthInternationalConferenceonDataEngineering(ICDE'99),(1999):512. [44] Gustafson,D.E.andKessel,W.C.Fuzzyclusteringwithafuzzycovariancematrix.inProceedingsofthe28thIEEEConferenceonDecisionandControl,(1979). [45] Hand,D.,Mannila,H.,andSmyth,P.PrinciplesofDataMining.TheMITPress,(2001). [46] Hartigan,J.A.DirectClusteringofadatamatrix.JournaloftheAmericanStatisticalAssociation,67(1972).337:123. [47] Hastie,T.,Tibshirani,R.,andFriedman,J.TheElemtnsofStatisticalLearning,(DataMining,Inference,andPrediction),.Springer(2001). [48] Hinneburg,A.andKeim,D.A.OptimalGrid-Clustering:TowardsBreakingtheCurseofDimensionalityinHigh-DimensionalClustering.inProceedingsof25thInternationalConferenceonVeryLargeDataBases(VLDB'99),(1999):506. [49] Honkela,A.,Seppa,J.,andAlhoniemi,E.Agglomerativeindependentvariablegroupanalysis.Neurocomputing,71(2008):1311. [50] Huang,Z.Extensionstothek-meansalgorithmforclusteringlargedatasetswithcategoricalvalues.DataMiningandKnowledgeDiscovery2(1998):283. [51] Ienco,DinoandMeo,Rosa.ExplorationandReductionoftheFeatureSpacebyHierarchicalClustering.inProceedingsoftheSIAMInternationalConferenceonDataMining,SDM2008,(2008):577. [52] Jain,A.andZongker,D.FeatureSelection:Evaluation,Appliation,andSmallSamplePerformance.IEEETransactionsonPatternAnalysisandMachineIntelligence,19(1997).2:153. [53] Jerrum,M.andSinclair,A.TheMarkovChainMonteCarloMethod:Anapproachtoapproximatecountingandintegration.inApproximationAlgorithmsforNP-hardProblems(eds.DoritHochbaum),(1996). [54] Jing,L.P.,Ng,M.K.,andHuang,Z.X.Anentropyweightingk-meansalgorithmforsubspaceclusteringforhigh-dimensionalsparsedata.IEEETransactionsonKnowledgeandDataEngineering19(2007).8:1026. [55] Ke,Y.,Cheng,J.,andNg,W.Miningquantitativecorrelatedpatternsusinganinformation-theoreticapproach.InProceedingsoftheTwelfthACMSIGKDD

PAGE 126

[56] Kim,Y.,Street,W.N.,andMenczer,F.Featureselectioninunsupervisedlearningviaevolutionarysearch.ProceedingsoftheSixthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD2000),(2000):365. [57] Kirkpatrick,S.,Gelartt,C.,andVecchi,M.Optimizationbysimulatedannealing.Science,220(1983).4598:671. [58] Kluger,Y.,Basri,R.,Change,J.R.,andGerstein,M.SpectralBiclusteringofMicroarrayData:CoclusteringGenesandConditions.GenomeResearch,13(2003).4:703. [59] Krishnapuram,R.andKeller,J.M.APossibilisticApproachtoClustering.IEEETransactionsonFuzzySystems,1(1993):98. [60] Kullback,S.andLeibler,R.A.OnInformationandSufciency.TheAnnalsofMathematicalStatistics,22(1951).1:79. [61] Law,M.H.,Figueiredo,M.A.T.,andJain,A.K.SimultaneousFeatureSelectionandClusteringUsingMixtureModels.IEEETransactionsonPatternAnalysisandMachineIntelligence,26(2004).9:1154. [62] Law,M.H.,Jain,A.K.,andFigueiredo,M.A.T.FeatureSelectioninMixture-BasedClustering.inProceedingsoftheAdvancesinNeuralInformationProcessingSystems(NIPS2002),(2002).15:625. [63] Li,Y.,Dong,M.,andHua,J.Localizedfeatureselectionforclustering.PatternRecognitionLetters,(2008).29:10. [64] Lin,S.-W.,Chen,S.-C.,Wu,W.-J.,andChen,C.-H.Parameterdeterminationandfeatureselectionforback-propagationnetworkbyparticleswarmoptimization.KnoweldgeandInformationSystems,(2009).21:249. [65] Liu,B.,Xia,Y.,andYu,P.S.ClusteringThroughDecisionTreeConstruction.InProceedingsofthe2000ACMCIKMInternationalConferenceonInformationandKnowledgeManagement,McLean,VA(2000):20. [66] Liu,H.andMotoda,H.ComputationalMethodsforFeatureSelection.Chapman&Hall/CRC(2007). [67] Liu,T.,Liu,S.,Chen,Z.,andMa,W.AnEvaluationonFeatureSelectionforTextClustering.inProceedingsoftheTwentiethInternationalConferenceMachineLearning(ICML2003),(2003):488. [68] Madeira,S.C.andOliveira,A.L.Surveyofclusteringalgorithms.IEEETransactionsonComputationalBiologyandBioinformatics,1(2004).1:24. 126

PAGE 127

McLachlan,G.J.,Bean,R.W.,andPeel,D.Amixturemodel-basedapproachtotheclusteringofmicroarrayexpressiondata.Bioinformatics,18(2002).3:413. [70] McLachlan,G.J.andPeel,D.FiniteMixtureModels.JohnWiley,NY,(2000). [71] McLachlan,G.J.,Peel,D.,Basford,K.E.,andAdams,P.TheEMMIXsoftwareforthettingofmixturesofnormalandt-components.JournalofStatisticalSoftware,4(1999). [72] Metropolis,M.,Rosenbluth,A.W.,Rosenbluth,M.N.,Teller,A.H.,andTeller,E.Equatiosofstaticcalculationsbyfastcomputingmachines.JournalofChemicalPhysics,(1953).21:1087. [73] Min,J.,Yun,S.,Kim,K.,Kim,H.,Hahn,J.,andLee,K.NitratecontaminationofalluvialgroundwatersintheNakdongRiverbasin,Korea.GeosciencesJournal,6(2002).1:35. [74] Mitra,P.,Murthy,C.A.,andPal,S.K.UnsupervisedFeatureSelectionUsingFeatureSimilarity.IEEETransactionsonPatternAnalysisandMachineIntelli-gence,24(2002):301. [75] Nadif,M.andGovaert,G.BlockClusteringofContingencyTableandMixtureModel.inProceedingsofthe6thInternationalSymposiumonDataAnalysis(IDA),(2005). [76] Namkoong,Y.,Heo,G.,andWoo,Y.AnExtensionofPossibilisticFuzzyC-MeanswithRegularization.inProceedingsofthe2010IEEEInternationalConferenceonFuzzySystems(FUZZ-IEEE2010),(2010). [77] Namkoong,Y.,Joo,Y.,andDankel,D.FeatureSubset-WiseMixtureModel-BasedClusteringviaLocalSearchAlgorithm.AdvancesinArticialIntelligence,23rdCanadianConferenceonArticialIntelligence,CanadianAI2010,Ottawa,Canada,Proceedings.LectureNotesinComputerScience6085Springer2010,(2010):135. [78] .PartitioningFeaturesforModel-BasedClusteringusingReversibleJumpMCMCTechnique.inProceedingsofthe23rdInternationalFLAIRSConference(FLAIRS2010),(2010). [79] Neal,R.MarkovchainsamplingmethodsforDirichletprocessmixturemodels.TechnicalReport9815Departmentofstatistics,UniversityofToronto(1998). [80] Needleman,S.B.andWunsch,C.D.Ageneralmethodapplicabletothesearchforsimilaritiesintheaminoacidsequenceoftwoproteins.JournalofMolecularBiology,48(1970).3:443. [81] Ng,E.,Fu,A.,andWong,R.Projectiveclusteringbyhistograms.IEEETransactionsonKnowledgeandDataEngineering17(2005).3:369. 127

PAGE 128

Pal,N.R.,Pal,K.,Keller,J.M.,andBezdek,J.C.APossibilisticFuzzyc-MeansClusteringAlgorithm.IEEETransactionsonFuzzySystems13(2005).4:517. [83] Parsons,L.,Haque,E.,andLiu,H.SubspaceClusteringforHighDimensionalData:AReview.SIGKDDExplorations6(2004):90. [84] Pedrycz,W.Knowledge-BasedClustering:FromDatatoInformationGranules.Wiley-Interscience,(2005). [85] Peel,D.andMcLachlan,G.J.Robustmixturemodellingusingthet-distribution.StatisticalComputing10(2000):339. [86] Procopiuc,C.M.,Jones,M.,Agarwal,P.K.,andMurali,T.M.AMonteCarloalgorithmforfastprojectiveclustering.InProceedingsofthe2002ACMSIGMODInternationalConferenceonManagementofData,Madison,WI(2002):418. [87] Raftery,A.E.andDean,N.VariableSelectionforModel-BasedClustering.TechnicalReport452Departmentofstatistics,UniversityofWashington,Seattle,Washington(2004). [88] .VariableSelectionforModel-BasedClustering.JournaloftheAmericanStatisticalAssociation,101(2006).473:168. [89] Rand,W.M.Objectivecriteriafortheevaluationofclusteringmethods.JournaloftheAmericanStatisticalAssociation,66(1971):846. [90] Rawlings,J.O.,Pantula,S.G.,andDickey,D.A.AppliedRegressionAnalysis:AResearchTool.Springer,NewYork(1998). [91] Robert,C.P.andCasella,G.MonteCarloStatisticalMethods.Springer,NewYork,2005,2nded. [92] Roberts,S.J.,Holmes,C.,andDenison,D.Minimum-EntropyDataPartitioningUsingReversibleJumpMarkovChainMonteCarlo.IEEETransactionsonPatternAnalysisandMachineIntelligence,23(2001).8:909. [93] Rota,G.-C.TheNumberofPartitionsofaSet.AmericanMathematicalMonthly,71(1964).5:498. [94] Roth,V.andLange,T.FeatureSelectioninClusteringProblems.inProceedingsoftheAdvancesinNeuralInformationProcessingSystems(NIPS2003),(2003). [95] Russell,S.andNorvig,P.ArticialIntelligence:AModernApproach,2ndEdition.PrenticeHall,SaddleRiver,NJ(2002). [96] Sahami,M.UsingMachineLearningtoImproveInformationAccess.Ph.D.thesis,DepartmentofComputerScience,StanfordUniversity,(1998). 128

PAGE 129

Schwarz,G.Estimatingthedimensionofamodel.AnnalsofStatistics,6(1978).2:461. [98] Shaei,M.andMilios,E.Model-basedOverlappingCo-clustering.inProceed-ingsoftheFourthWorkshoponTextMining,SixthSIAMInternationalConferenceonDataMining,(2006). [99] Sheng,Q.,Moreau,Y.,andDe,MoorB.BiclusteringMicroarrayDatabyGibbsSampling.Bioinformatics,19(2003):ii196ii205. [100] Sheskin,D.J.HandbookofParametricandNonparametricStatisticalProcedures.CRCPress,2003. [101] Shoham,S.RobustclusteringbydeterministicagglomerationEMofmixturesofmultivariatet-distributions.PatternRecognition35(2002):1127. [102] Stanley,R.P.EnumerateCmbinatorics.Vol.I,NewYork:CambridgeUniversityPress(1997). [103] Talavera,L.Dependency-BasedFeatureSelectionforClusteringSymbolicData.IntelligentDataAnalysis,(2000).4:19. [104] Tanay,A.,Sharan,R.,andShamir,R.Biclusteringalgorithms:Asurvey.HandbookofComputationaMolecularBiology,eds.SrinivasAluru,(2006). [105] Theodoridis,S.andKoutroumbas,K.PatternRecognition,3rdEdition.AcademicPress(2006). [106] Ueda,N.andNakano,R.DeterministicannealingEMalgorithm.NeuralNetworks(1998).11:271. [107] Vaithyanathan,S.andDom,B.GeneralizedModelSelectionforUnsupervisedLearninginHighDimensions.inProceedingsoftheAdvancesinNeuralInforma-tionProcessingSystems(NIPS'99),(1999):970. [108] Wang,H.,Wang,W.,Yang,J.,andYu,P.S.Clusteringbypatternsimilarityinlargedatasets.inProceedingsoftheACMSIGMODInternationalConferenceonManagementofData(SIGMOD2002),(2002):394. [109] Wolf,T.,Brors,B.,Hofmann,T.,andGeorgii,E.GlobalBiclusteringofMicroarrayData.inWorkshopsProceedingsofthe6thIEEEInternationalConferenceonDataMining(ICDM2006),(2006):125. [110] Xiong,H.,Shekhar,S.,Tan,P.-N.,andKumar,V.ExploitingAsupport-basedupperboundofPearson'scorrelationcoefcientforefcientlyidentifyingstronglycorrelatedpairs.inProceedingsoftheTenthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(2004):334. 129

PAGE 130

Xu,R.andWunschII,D.Surveyofclusteringalgorithms.IEEETransactionsonNeuralNetworks,16(2005).3:645. [112] .Clustering(IEEEPressSeriesonComputationalIntelligence).Wiley-IEEEPress(2009). [113] Yang,J.,Wang,H.,Wang,W.,andYu,P.S.EnhancedBiclusteringonExpressionData.inProceedingsofthe3rdIEEEInternationalSymposiumonBioInformaticsandBioEngineering(BIBE2003),(2003):321. [114] Yang,J.,Wang,W.,Wang,H.,andYu,P.S.-Clusters:CapturingSubspaceCorrelationinaLargeDataSet.InProceedingsofthe18thIEEEInternationalConferenceonDataEngineering(ICDE2002),SanJose,CA(2002):517. [115] Yip,K.,Cheung,D.,andNg,M.HARP:Apracticalprojectedclusteringalgorithm.IEEETransactionsonKnowledgeandDataEngineering16(2004).11:1387. [116] Zhang,X.,Pan,F.,andWang,W.CARE:FindingLocalLinearCorrelationsinHighDimensionalData.InProceedingsofthe24thIEEEInternationalConferenceonDataEngineering(ICDE2008),Cancun,Mexico(2008):130. 130

PAGE 131

YounghwanNamkoong(YounghwanNamgoong)wasborninSeoul,RepublicofKorea.HereceivedaBachelorofScienceandMasterofScienceincomputersciencefromKoreaUniversity,Seoul,RepublicofKoreain1999and2001,respectively.HereceivedMasterofScienceincomputersciencefromtheUniversityofSouthernCaliforniainMay2003.HereceivedaPh.DincomputerengineeringfromtheUniversityofFloridainthesummerof2010.Duringhisacademiccareer,hereceivedtheNationalScholarshipsforstudyingabroadfromtheMinistryofInformationandCommunication,RepublicofKorea.Hisresearchinterestsincludedatamining,patternrecognition,andstatisticallearning.Relevantjournalandconferencepapershavebeenpublishedandunderreview. 131