<%BANNER%>

Clustering on Feature Subspaces in Data Mining

Permanent Link: http://ufdc.ufl.edu/UFE0041585/00001

Material Information

Title: Clustering on Feature Subspaces in Data Mining
Physical Description: 1 online resource (131 p.)
Language: english
Creator: Namkoong, Young
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: clustering, data, feature
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: For a given data matrix, classification methods form a set of groups so that data objects within each group are mutually homogeneous but heterogeneous to those in the other groups. In particular, this grouping process for data without class label is called unsupervised classification or clustering. When little or no prior knowledge is available about the structure of data, clustering is considered a promising data analysis technique and has been successfully applied to solve various problems in environmental pollution research, gene expression data analysis, and text mining. To discover interesting patterns in these applications, most clustering algorithms take into account all of the features represented in the data. However, in many cases, most of these features may be meaningless and disturb the analysis process. For this reason, selecting the proper subset of features needed to represent the clusters, called feature selection, has been an interesting problem in clustering to improve the performance of these algorithms. Many previous approaches have investigated efficient techniques to select relevant features for clustering, but several drawbacks have been identified. First, most previous feature selection approaches assume that features are divided into two subsets: those that are meaningful and those that are meaningless to describe clusters. However, the original feature vector can consist of a number of feature subsets, making these previous feature selection approaches unsuitable. Second, many datasets in various application domains contain multiple different clusters co-existing in different subspaces of the whole feature space. In this case, feature selection techniques can be inappropriate to extract features identifying all clusters in the dataset. In this dissertation, we investigate these problems. For the first issue, we present a novel approach to reveal multiple disjoint feature subsets for various clusters based on the mixture model. These multiple features can be represented by a feature partition. Our approach seeks the desired a feature partition that each feature subset can be expressed by a best-fit mixture model minimizing the model selection criterion. For the other problem, finding a special cluster structure (called a subspace cluster) we propose a new approach to create constrained feature subspaces. By utilizing the property that there can exist strong relationship between features, meaningful feature subspaces can be created and a large number of non-informative feature subspaces can be pruned. Based on these informative feature subspaces, we show that the various subspace clusters can be represented by fitting several finite mixture models. Extensive experimental results illustrate our ability to extract useful information from data as to which features contributes to the discovery of clusters. In addition, we demonstrate that our approaches can be applied to various research areas.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Young Namkoong.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Dankel, Douglas D.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041585:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041585/00001

Material Information

Title: Clustering on Feature Subspaces in Data Mining
Physical Description: 1 online resource (131 p.)
Language: english
Creator: Namkoong, Young
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: clustering, data, feature
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: For a given data matrix, classification methods form a set of groups so that data objects within each group are mutually homogeneous but heterogeneous to those in the other groups. In particular, this grouping process for data without class label is called unsupervised classification or clustering. When little or no prior knowledge is available about the structure of data, clustering is considered a promising data analysis technique and has been successfully applied to solve various problems in environmental pollution research, gene expression data analysis, and text mining. To discover interesting patterns in these applications, most clustering algorithms take into account all of the features represented in the data. However, in many cases, most of these features may be meaningless and disturb the analysis process. For this reason, selecting the proper subset of features needed to represent the clusters, called feature selection, has been an interesting problem in clustering to improve the performance of these algorithms. Many previous approaches have investigated efficient techniques to select relevant features for clustering, but several drawbacks have been identified. First, most previous feature selection approaches assume that features are divided into two subsets: those that are meaningful and those that are meaningless to describe clusters. However, the original feature vector can consist of a number of feature subsets, making these previous feature selection approaches unsuitable. Second, many datasets in various application domains contain multiple different clusters co-existing in different subspaces of the whole feature space. In this case, feature selection techniques can be inappropriate to extract features identifying all clusters in the dataset. In this dissertation, we investigate these problems. For the first issue, we present a novel approach to reveal multiple disjoint feature subsets for various clusters based on the mixture model. These multiple features can be represented by a feature partition. Our approach seeks the desired a feature partition that each feature subset can be expressed by a best-fit mixture model minimizing the model selection criterion. For the other problem, finding a special cluster structure (called a subspace cluster) we propose a new approach to create constrained feature subspaces. By utilizing the property that there can exist strong relationship between features, meaningful feature subspaces can be created and a large number of non-informative feature subspaces can be pruned. Based on these informative feature subspaces, we show that the various subspace clusters can be represented by fitting several finite mixture models. Extensive experimental results illustrate our ability to extract useful information from data as to which features contributes to the discovery of clusters. In addition, we demonstrate that our approaches can be applied to various research areas.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Young Namkoong.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Dankel, Douglas D.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041585:00001


This item has the following downloads:


Full Text

PAGE 2

2

PAGE 3

3

PAGE 4

IwouldliketothankallpeoplewhoprovidedmewiththeirhelpduringmyPh.Dyears.Firstofall,Iwouldliketothankmyadvisor,Dr.DouglasD.DankelII.Withouthispatience,encouragement,constantinspiration,thisdissertationwouldnothavebeenpossible.Iamgratefultomysupervisorycommitteemembers,Dr.JonathanC.L.Liu,AntonioA.Arroyo,ManuelBermudez,andLeiZhoufortheirinvaluablesuggestionsandcomments.ImetmanynicepeopleinGainesville.Especially,IthankYoungseokKim,DoohoPark,andHeh-YoungMoonwhohavesharedjoysandsorrowswithmeduringmanyyearsinGainesville.Finally,mydeepestgratitudegoestomyfamily:myparents(HailNamgoongandJung-inHyun),mysister(HeeNamgoong),andmybrother'sfamily(Seok-HwanNamgoong,MinaSong,andSeonNamgoong).Theyencouragedandsupportedandprayedformywholelife.Mystudywouldnothavebeenpossiblewithouttheirlove,patience,andconcern.Ihopethatmygraduationgivesthemhappinessandpride. 4

PAGE 5

page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 9 CHAPTER 1INTRODUCTION ................................... 11 1.1Feature-Subset-WiseClusteringUsingStochasticSearch ......... 12 1.2RobustClusteringUsingFiniteMixtureModelonVariableSubspaces .. 15 1.3OutlineofThisDissertation .......................... 18 2BACKGROUNDSOFCLUSTERING ........................ 19 2.1TheBasicConceptsofClustering ...................... 19 2.2ProximityMeasuresinClustering ....................... 21 2.3HierarchicalClustering ............................. 24 2.4PartitionalClustering .............................. 25 2.4.1K-meansClustering .......................... 26 2.4.2FuzzyClustering ............................ 27 2.5Model-BasedClustering ............................ 28 2.6ClusteringBasedonStochasticSearch ................... 30 2.6.1SimulatedAnnealing .......................... 31 2.6.2DeterministicAnnealing ........................ 31 2.7OtherClusteringAlgorithms .......................... 32 2.7.1SequentialClustering ......................... 32 2.7.2ClusteringBasedonTheGraphTheoryandStructure ....... 33 2.7.3Kernel-BasedClustering ........................ 34 2.8Biclustering ................................... 35 2.8.1TheBasicConceptofBiclustering .................. 35 2.8.2RelatedWorkofBiclustering ...................... 35 2.9SubspaceClustering .............................. 38 2.9.1SubspaceClusteringUsingTheBottom-UpTechniques ...... 38 2.9.2SubspaceClusteringBasedonTop-DownApproach ........ 40 2.9.3OtherSubspaceClusteringAlgorithms ................ 41 3FEATURE-SUBSET-WISECLUSTERINGUSINGSTOCHASTICSEARCH .. 43 3.1Introduction ................................... 43 3.2RelatedWork .................................. 45 3.3ModelFormulation ............................... 47 5

PAGE 6

.......... 48 3.4.1EstimationofTheFiniteMixtureModelGivenVkandGk 48 3.4.2DeterministicAnnealingEMAlgorithm ................ 50 3.4.3ModelSelectionBasedonInformationCriteria ........... 51 3.4.4ObjectiveFunctionforDeterminingTheOptimalVkandGk 52 3.5StochasticSearchingforTheDesiredVkandGk 53 3.5.1BiasedRandomWalkAlgorithm .................... 54 3.5.2Split-and-MergeMoveAlgorithm ................... 56 3.5.3SimulatedAnnealing-BasedAlgorithm ................ 58 3.6ExperimentalResults ............................. 59 3.6.1SyntheticDataset ............................ 59 3.6.2IncrementalEstimation ......................... 64 3.6.3TheUCIMachineLearningDataset .................. 68 3.7Applications ................................... 72 3.7.1TheRiverDatasetforAnalysisofWaterQuality ........... 72 3.7.2TheUnitedNationsWorldStatisticsDataset ............. 75 3.8Conclusion ................................... 79 4ROBUSTCLUSTERINGUSINGFINITEMIXTUREMODELONVARIABLESUBSPACES ..................................... 81 4.1Introduction ................................... 81 4.2RelatedWork .................................. 83 4.3RobustSubspaceClusteringbyFittingAMultivariatet-mixtureModel .. 84 4.3.1ExtractingTheLocalStrongCorrelatedFeatures .......... 86 4.3.2ConstructingFeatureSubspaceswithTheCorrelatedFeatures .. 89 4.3.3TheMixtureModelofMultivariatet-distribution ........... 91 4.3.4EMforMixturesofMultivariatet-distributions ............ 92 4.3.5DeterminingKviaModelSelectionCriterion ............. 94 4.4ExperimentalResults ............................. 95 4.4.1ExperimentsonSyntheticDatasets .................. 95 4.4.2ExperimentsonRealDatasets .................... 104 4.5Conclusion ................................... 116 5CONCLUSION .................................... 118 REFERENCES ....................................... 122 BIOGRAPHICALSKETCH ................................ 131 6

PAGE 7

Table page 3-1Parameterestimatesofdataset-I .......................... 62 3-2Parameterestimatesofdataset-II .......................... 63 3-3Parameterestimatesofdataset-III ......................... 65 3-4ThesummaryofexperimentalresultsofUCIdatasets .............. 69 3-5ThegroupsoftheNakdongRiverobservations .................. 72 3-6Theresultofmodel-basedclusteringforeachsubsetofvariables ........ 73 3-7EstimatesofmeansforeachfeaturesubsetintheUNdataset .......... 76 4-1ThedescriptionofUCIdatasetsusedintheexperiments ............. 104 4-2TheoverallexperimentalresultsonIrisdataset .................. 106 4-3ThelocalclusterpuritiesondiscoveredvarioussubspacesforIrisdataset ... 106 4-4TheoverallexperimentalresultsonEcolidataset ................. 108 4-5ThelocalclusterpuritiesondiscoveredvarioussubspacesforEcolidataset .. 108 4-6Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthewinedataset .................................. 111 4-7Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthediabetesdataset ................................ 112 4-8Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthewaveformdataset ............................... 115 7

PAGE 8

Figure page 1-1Plotsofdatasetcontainingfourclusters. ...................... 15 1-2Plotsofclustersexistingondifferentsubspaces. ................. 15 3-1Arbitraryexampleoffeatureselectionandfeaturesubsetwiseclustering .... 44 3-2Twosyntheticdatasets(2007and15011) .................. 59 3-3TrajectoriesofchainsofBIC(V()),and^Kfordataset-II ............. 61 3-4Syntheticdatasets .................................. 64 3-5Performanceevaluationoftheincrementalestimation .............. 66 3-6Performanceevaluationagainstthesizeofdataset ................ 67 3-7Classicationaccuraciesofthevariousfeatureselectionapproaches ...... 70 3-8Lineplotsofriverwatersheddataset ........................ 73 3-9Discoveredfeaturesubsetsofwatersheddataset ................. 74 3-10Thearcsin(p )transformationoftheThird-levelstudentsofwomen ....... 76 3-11DiscoveredclustersforeachfeaturesubsetinUNdataset ............ 77 4-1Illustrationofrobustfeaturesubspaceclustering ................. 85 4-2Illustrationofconstructingfeaturesubspaceswiththecorrelatedfeatures ... 90 4-3Dendrogramsforthedifferentfeaturesubsetsindataset-I ............ 96 4-4Someexamplesofthedataplotsondifferentfeaturesubspacesindataset-I .. 97 4-5Clusteringerrorratesfordifferentfeaturesubspacesindataset-I ........ 98 4-6Dendrogramsfordifferentfeaturesubsetsindataset-II .............. 99 4-7Someexamplesofthedataplotsondifferentfeaturesubspacesindataset-II 100 4-8Clusteringerrorratesfordifferentfeaturesubspacesindataset-II ........ 101 4-9ComparisonofthesubspaceclusteringalgorithmswithRandIndex ...... 105 4-10Dendrogramsfordifferentfeaturesubsetsinwinedataset ............ 110 4-11Dendrogramsofdifferentfeaturesubsetsforwaveformdataset ......... 114 8

PAGE 9

Foragivendatamatrix,classicationmethodsformasetofgroupssothatdataobjectswithineachgrouparemutuallyhomogeneousbutheterogeneoustothoseintheothergroups.Inparticular,thisgroupingprocessfordatawithoutclasslabeliscalledunsupervisedclassicationorclustering.Whenlittleornopriorknowledgeisavailableaboutthestructureofdata,clusteringisconsideredapromisingdataanalysistechniqueandhasbeensuccessfullyappliedtosolvevariousproblemsinenvironmentalpollutionresearch,geneexpressiondataanalysis,andtextmining. Todiscoverinterestingpatternsintheseapplications,mostclusteringalgorithmstakeintoaccountallofthefeaturesrepresentedinthedata.However,inmanycases,mostofthesefeaturesmaybemeaninglessanddisturbtheanalysisprocess.Forthisreason,selectingthepropersubsetoffeaturesneededtorepresenttheclusters,calledfeatureselection,hasbeenaninterestingprobleminclusteringtoimprovetheperformanceofthesealgorithms. Manypreviousapproacheshaveinvestigatedefcienttechniquestoselectrelevantfeaturesforclustering,butseveraldrawbackshavebeenidentied.First,mostpreviousfeatureselectionapproachesassumethatfeaturesaredividedintotwosubsets:thosethataremeaningfulandthosethataremeaninglesstodescribeclusters.However,theoriginalfeaturevectorcanconsistofanumberoffeaturesubsets,makingthesepreviousfeatureselectionapproachesunsuitable.Second,manydatasetsin 9

PAGE 10

Inthisdissertation,weinvestigatetheseproblems.Fortherstissue,wepresentanovelapproachtorevealmultipledisjointfeaturesubsetsforvariousclustersbasedonthemixturemodel.Thesemultiplefeaturescanberepresentedbyafeaturepartition.Ourapproachseeksthedesiredafeaturepartitionthateachfeaturesubsetcanbeexpressedbyabest-tmixturemodelminimizingthemodelselectioncriterion.Fortheotherproblem,ndingaspecialclusterstructure(calledasubspacecluster)weproposeanewapproachtocreateconstrainedfeaturesubspaces.Byutilizingthepropertythattherecanexiststrongrelationshipbetweenfeatures,meaningfulfeaturesubspacescanbecreatedandalargenumberofnon-informativefeaturesubspacescanbepruned.Basedontheseinformativefeaturesubspaces,weshowthatthevarioussubspaceclusterscanberepresentedbyttingseveralnitemixturemodels.Extensiveexperimentalresultsillustrateourabilitytoextractusefulinformationfromdataastowhichfeaturescontributestothediscoveryofclusters.Inaddition,wedemonstratethatourapproachescanbeappliedtovariousresearchareas. 10

PAGE 11

Inmanyeldsofmanufacturing,marketing,andscienticareassuchasgenomeresearchandbiometrics,mostrawdatacontainalittleornopriorknowledge(e.g.classlabel).Anaturalwaytoanalyzethesedataistoclassifythemintoanumberofgroupsbasedonasimilaritymeasureoraprobabilitydensitymodel.Theresultinggroupsarestructuredsothateachgroupisheterogeneoustotheothergroupswhilethedataobjectsinthesamegrouparehomogeneous[ 11 ].Thisunsupervisedclassicationtechnique,calledclustering,isoneofthemostpopularapproachesforexploratorydataanalysis[ 45 47 ]. Duringthelastdecades,numerousclusteringalgorithmshavebeenproposed.Theymainlyattempttoidentifygroupsofdataobjectsbasedontheirattributesorfeatures.However,inrecentyears,thesetraditionalclusteringalgorithmshavebeenfacedwithanewdatastructurethatcanbedifculttoprocess.Thisstructurecanbeeasilyexplainedbyusingthegeneexpressiondata,apopulardatastructureinbioinformatics.Geneexpressiondataareaby-productofmicroarrayexperimentsthatmeasuretheexpressionlevelsofgenesundertheexperimentalconditionsinasingleexperiment[ 47 ].Geneexpressiondataconsistofamatrixwhereeachrowrepresentsasinglegeneandeachcolumnrepresentsanexperimentalcondition.Eachelementofthismatrixindicatestheexpressionratioofageneoveraspecicexperimentalcondition. Forgeneexpressiondata,manytraditionalclusteringalgorithmsattempttodiscovergeneclustersbasedonalltheexperimentalconditionsinthedatamatrix,orviceandversa.However,furtherinvestigationforgeneexpressiondataidentiedaninterestingcharacteristic.Specically,asubsetofgeneshasshowednearlycoincidentalpathwaysonasubsetofexperimentalconditions[ 47 ].Thisindicatesthatthesegenesareassociatedwithaparticularsubsetofexperimentalconditions.Unfortunately,thischaracteristicpreventsmostpreviousconventionalclusteringalgorithmsfrom 11

PAGE 12

47 ]. Thisalsoimpliesthatonlyafewfeaturescontributetorepresenttheinterestingclusterstructureandtheotherfeaturesmaybemeaninglessorevendisturbabetterunderstandingoftheclusteranalysis[ 62 ].Forthisreason,selectingthepropersubsetoffeaturestorepresentinterestingclustersispreferred.Thisprocessiscalledfeatureselectionorvariableselectioninclustering. Inthisdissertation,weproposenovelapproachestodealwiththeseatypicalclusterstructuresusingvarioustechniquessuchasstatisticalmodelingandcombinatorialoptimization.Thersttypeofclusterstructurecanberevealedbyselectingmultiplenon-overlappedfeature(variable)subsetsformodel-basedclustering.Anotherclusterstructureisakindofsubspaceclusterthatisembedindifferentfeaturesubspaces.Inourapproach,weshowthatthesevarioussubspaceclusterscanberepresentedbyttingseveralnitemixturemodels. 62 ].Regardless,littleresearchworkonfeatureselectionhasbeendoneinclusteringbecausetheabsenceofclasslabelsmakesthistaskmoredifcult[ 62 63 88 ].Featureselectioninclusteringcanbedividedintotwocategories:model-basedclusteringandnon-model-basedclustering[ 88 ]. Inmodel-basedclustering,itisassumedthatadatasetiscomposedofseveralsubgroups.Torepresentthesesubgroupswithanintegratedgroup,onenaturalwayistousethenitemixturemodelsothateachsubgroupfollowsaparticularprobabilitydistribution[ 112 ].Tobespecic,thegeneralformofthemixturemodelf(x;)foraxednumberofclustersKisdenedas 12

PAGE 13

Forfeatureselectioninmodel-basedclustering,Talavera(2000)selectedasubsetoffeaturesbyevaluatingthedependenciesamongfeatures[ 103 ].Mitraetal.(2002)utilizedthemaximuminformationcompressionindexformeasuringfeaturesimilaritytoremoveredundantfeatures[ 74 ].Liuetal.proposedthefeatureselectionmethod,whichistailoredfortextclusteringusingsomeefcientfeatureselectionmethodssuchasInformationGainandEntropy-basedRanking[ 67 ].InLawetal.(2003),featureselectionwasperformedinmixture-basedclusteringbyestimatingfeaturesaliencyviatheEMalgorithm[ 62 ].Thisworkwasextendedtoanalgorithmtoperformbothfeatureselectionandclusteringsimultaneously[ 61 ].DyandBrodley(2004)consideredthefeatureselectioncriteriaforclusteringthroughtheEMalgorithm[ 34 ].Mostpreviousapproachestrytondrelevantfeaturesbasedontheclusteringresult,sotheyarerestrictedtodeterminethenumberofclustersorchoosetheclusteringmodel,calledmodelselection.RafteryandDean(2006)investigatedthisproblemandattemptedtoselectvariablesbycomparingtwomodels;onecontainstheinformativevariablesforclusteringwhilethevariablesintheothermodelaremeaningless[ 88 ].ByusingtheBayesianInformationCriterion(BIC)[ 97 ]forthemodelcomparisonandthegreedysearchalgorithmforndingthesuboptimalmodel,thismethoddealswithvariableselectionforclusteringandthemodelselectionproblemsimultaneously[ 88 ].However,thepreviousmethodscommonlyproduceonlyasinglesubsetofvariablestoidentifyclustersbasedonallthefeaturesintheentiresamples.Thisimpliesthattheyarelimitedtotheidenticationofthesubsetoffeatureforeachclusterseparately[ 63 ].Recently,Lietal.(2008)exploredndingtheselocalizedfeaturesubsetsbyemployingasequentialbackwardsearch[ 63 ].However,thismethodstillndsonlyasinglesubsetoffeaturesforeachclusteranddoesnothandlethemodelselectionproblem. 13

PAGE 14

Inthisapproach,itisassumedthatadatamatrixconsistsofmultiplevariablesubsetswhereeachcanberepresentedbyasetofclusters.Foreachsubsetofvariables,model-basedclusteringtotheobservationsisperformed.Underthexednumberofclustersforobservation-wiseclustering,thebestclusteringresultcanbeobtainedbythemaximumlikelihoodestimationviatheEMalgorithm[ 30 ].BecausetheEMalgorithmissusceptibletolocaloptima,weemployedthedeterministicannealingEM(DAEM)algorithm[ 106 ].Inmodel-basedclusteringunderthesubsetofvariables,determininganestimationofthenumberofclustersandtheappropriateparametricmodelsshouldbeconsidered.Fortheseissues,weexploitedtheBIC. Selectingmultiplesubsetsofthevariablesisequivalenttondapartitionofvariableswhosevariablesubsetstthebestnitemixturemodel.Findingthispartitionisverychallengeablebecausethenumberofallpossiblepartitions,knownastheBellnumber,(B(m))[ 102 ],growshyper-exponentiallyasmincreases(e.g.B(13)=2.7109andB(21)=4.751014)[ 15 ].Sinceeachvariablesubsetaccompaniesthemodel-basedclusteringprocedure,consideringallpossiblepartitionsisnotfeasibleinpractice.Duetothisreason,stochasticsearchmethodscanbeareasonableapproachsoseveralsuccessfulapplicationsofclusteringusingthistechniquehavebeenutilizedinourmethod[ 15 ].Incontrasttothepreviousresearch,ourapproachperformsboththemultiplevariablesubsetsselectionandthemodel-basedclusteringsimultaneously.Theoutputprovidesusefulinformationaboutwhateachvariablesubsetscontributestodiscoverthemeaningfulclusteringresults.Also,ourapproachshowsrobustnesstotheinitialpartitionofvariablesbecauseoftheuseofstochasticsearchmethods. 14

PAGE 15

Plotsofdatasetcontainingfourclusters. Bdim-2anddim-3 Cdim-1anddim-3 Plotsofclustersexistingondifferentsubspaces. 15

PAGE 16

1-1 ThedatasetshowninFigure 1-1 wasgeneratedbyusingtheinstructionsin[ 83 ].Itconsistsoffourclusterswhereeachclustercontains2003-dimensionaldatapoints.Thersttwoclusterswerecreatedsothattheycanberevealedontwodimensions(i.e.,dim-1anddim-2)outofthethreedimensions.Theothertwoclustersweregeneratedinasimilarmanner.However,adifferentinputparameterwasusedsothattheyexistonadifferentsubspace(i.e.,dim-2anddim-3). Ifthisdatasetisprojectedonthesubspacewithdim-1anddim-2,asshownatFigure 1-2A ,cluster]1andcluster]2canbeclearlyidentied.However,theothertwoclusters(cluster]3and]4)areregardedasasinglecluster.Likewise,Figure 1-2B showsthatcluster]3andcluster]4canbediscoveredondim-1anddim-3.Furthermore,whenthedatasetwasprojectedondim-2anddim-3,theclusteringresultsmaybequitedifferentfromtheexpectedones,asshowninFigure 1-2C Whenweusethealgorithmsforfeatureselectioninclusteringtodiscovertheseclusters,mostofthemarenotabletodeterminerelevantfeaturesforallthoseclusters.Also,thefeature-subset-wiseclusteringapproachisnotsuitableforndingtheproperfeaturepartitiontoidentifyallthoseclusters.Inthisexample,thefeature-subset-wiseclusteringalgorithmsuggestsasinglefeaturesubsetwhichcontainsallfeatures.However,ascanbeseen,eachclustercorrespondstothedifferentfeaturesubspaces,sotheproblemtoidentifythemcanbeaddressed.Forthisproblem,anewclusteringalgorithmwasdeveloped,calledsubspaceclustering.Themainobjectiveofsubspaceclusteringistonddifferentclustersexistingonthedifferentfeaturesubspaces.Forthispurpose,manysubspaceclusteringalgorithmsattemptedtonddifferentsubspaceclustersshowingahighdensityonthespecicfeaturesubspaces.Byperformingtheseapproaches,somemeaningfulsubspaceclusterscanbediscovered.However, 16

PAGE 17

Fortheseproblems,weproposeanovelmethodutilizingthecorrelationcoefcientstodeterminethemeaningfulfeaturesubspaces.Inourapproach,wenotedthatfeaturesofdatasetsinmanyapplicationareasareusuallyrelatedwithspecicdomainknowledgeforefcientrepresentationofdataobjects.Basedonthischaracteristic,consideringstrongcorrelationsamongfeaturescanleadtothediscoveryoftheinformativeclusterslyinginthefeaturesubspaces.Inaddition,byselectingonlythesefeaturesubsets,thesearchspaceforsubspaceclusteringcanbedrasticallyreduced. Forndingtheseparticularfeaturesubspaceswhereinfeaturesarestronglycorrelatedeachother,weutilizedthelocalizedmeancorrelationthreshold.Tobespecic,whenafeatureisxed,somefeaturesshowingastrongcorrelationwiththisfeaturecanberegardedasformingagroupasafeaturesubspace.Inthisway,severalinformativefeaturesubspacescanbeconstructed.Sincethepossiblefeaturesubspacesgrowexponentiallyasthenumberoffeaturesincreases,thefeaturesubspaceconstructionmethodshouldsatisfythemonotonicityproperty.Forthispart,thehierarchicalclusteringalgorithmwasutilizedtobuildthepossiblefeaturesubspaces. Whenmanyclusteringalgorithmsareappliedonreal-worlddatasets,theyconsiderthefactthatmanyatypicaldatasamples,callednoisydataoroutlierscanexistintheserealdatasets.Sincetheperformanceofclusteringalgorithmscanbeaffectedbytheseoutliersornoisydata,therobustnesspropertycanbeanimportantissue.Toalleviateproblemsfromnoisydataandoutliers,weusedamixturemodelwithaheavy-taileddistribution,calledamultivariateStudent'st-distribution,torepresentclusters.Inourapproach,thenumberofclustersforthebest-tmixturemodelisdecidedbyevaluatingaparsimoniousmodelselectioncriterion. 17

PAGE 18

2 reviewsthebackgroundofclustering.Thesummaryofvariousproximitymeasurements(ordistancemetrics),discussedinSection 2.2 ,isanimportantpartinclusteranalysisbecausetheclusteringresultsorstructurescanbedependentontheuseofthespecicproximitymeasurement.TheremainingSectionscoveravarietyofclusteringalgorithms,includinghierarchicalclustering,partitionalclustering,model-basedclustering,stochasticsearchtechnique-basedclustering,etc.Inaddition,wediscussedseveraladvancedclusteringtechniquessuchasbiclusteringandsubspaceclustering. InChapter 3 ,weproposeanewconcepttoovercometheshortcomingsofthepreviousfeatureselectionalgorithms.Beforeweproceedtodiscussthedetailofourapproach,webrieydiscusstheneedandcontributionofourapproach.Afterdiscussingtherelatedworkthatcoversthein-depthbackgroundofthespecictopicsandresearchtrendsrelatedwiththealgorithmsforfeatureselectioninclustering,ournewclusteringmodelformulationisintroducedinSection 3.3 .ThenewclusterstructureisrepresentedbyusingtheproposedmodelthatmaximizestherelevantparametersthroughthestochasticEMalgorithmandmodelselectionschemetothesub-datamatrixrelevanttothesubsetsofvariables.InSection 3.5 ,wediscussthevariousstochasticsearchalgorithmsusedinourapproach. Chapter 4 isorganizedasfollows.InSection 4.2 ,subspaceclusteringalgorithmsaresummarized.Section 4.3 describesamethodtoselectstronglycorrelatedfeaturegroups,amixturemodelofmultivariateStudentt-distributionanditsparameterestimationprocessmaximizingthelikelihoodviatheEMalgorithm,andmodelselectioncriteriontodeterminethebest-tmixturemodel.ExperimentalresultsanddiscussionaregiveninSection 4.4 .Section 4.5 presentsconclusions. Finally,inChapter 5 ,weprovideaconclusionanddiscussfuturework. 18

PAGE 19

Inthischapter,weintroducebackgroundforthisresearch.First,webrieyreviewthebasicconceptsofclustering.Then,wediscussthesummaryofvariousproximitymeasurements(ordistancemetrics).TheremainingSectionscoveravarietyofclusteringalgorithms,includinghierarchicalclustering,partitionalclustering,model-basedclustering,stochasticsearchtechnique-basedclustering,etc.Inaddition,wediscussedseveraladvancedclusteringtechniquessuchasfeatureselectioninclustering,biclustering,andsubspaceclustering. 37 ]ingenetics,thereareapproximately85,759,586,764basesin82,853,685sequencerecordsintheGenBankdatabase,andWal-Martdealswithapproximately800milliontransactionsperday[ 25 ].Fromthetremendousamountofrawdata,itisdifculttoobtaintheinformationappropriatetoparticularsubjectiveapplications.However,itisfortunatethattherawdataconsistofasetoffeaturesrepresentingtheparticularobjects,events,situations,orphenomena,etc.Basedonthisfact,adatasetcanbeexpressedasasetofgroupswhereeachgrouphasparticularpropertiesthatdistinguishitfromtheothergroups.Thissetofpropertiesiscalledapattern.Peoplehaveattemptedtousevariousapproachestonddesirablepatternsfromthedataresultinginthedevelopmentofmanytechniques.Theseapproachescanbecategorizedintotwoareas:classicationandclustering.Whenonecreatespatternsrepresentinggroupsofdata,labeleddatacanbeusefultoevaluatethecorrectnessofthepatternsbeingcreated.Inthismanner,aprocedurendingpatternswithlabeleddataiscalledclassication.Aclassicationtechniqueisprimarilycomposedoftwosteps:atrainingstepandatestingstep.Themaingoalofthe 19

PAGE 20

112 ].Thisfactimpliesthatabestclusterresultdoesnotexistbecauseofthelackofsomekindofstandardcriterionforclustering.Therefore,asapreprocessingstep,ausershouldidentifyaspecicclusteringcriterionandselect(ordevelop)anappropriateproximitymeasure. Ageneralclusteringprocedureconsistsofthefollowing5steps:featureselection,proximitymeasure,clusteringcriterion,clusteringalgorithm,andvalidationofclusterresults[ 105 ].Usually,thedatacontainnumerousfeaturesandmanyofthemmayberedundant.Theseredundantfeaturesmaydegradetheperformanceofclustering.Therefore,itisnecessarythatthepropersubsetoffeaturesbeselectedtolimitthelossofinformationfromtherawdataasmuchaspossible.Theseselectedfeaturesshouldbedistinguishable,easilyinterpretable,androbusttonoise[ 112 ].Undertheselectedfeatures,theclustermembershipofeachdataobjectisdeterminedbythedegreeofsimilaritytoaparticularcluster.Thisisdonethroughaproximitymeasure,aquantitativemeasureofhomogeneity(orheterogeneity)betweentwofeaturevectors.WediscussdetailsofproximitymeasuresinSection 2.2 .Afterdeterminingaproximitymeasure,aspecicclusteringcriterionfunctionshouldbeconstituted.Aclusteringcriterionisusuallydenedbythetypeofrulesbeingsensibletoboththecharacteristicsandsubjectivityofthegivendataset,oranobjectivecostfunctionusedforsolvinganoptimizationproblem.Oncetheclusteringcriterionisdetermined,itis 20

PAGE 21

11 105 112 ].Wereviewvariousclusteringalgorithmsinlatersections.Finally,fortheobtainedclusters,theissuesofhowtheclustersareconstructedandtheirvalidityshouldbeassessed.Thisstepiscalledthevalidationoftheclusteringresults.Occasionally,thisstepincludestheinterpretationoranalysisoftheobtainedclusters. Asafundamentaldataanalysistechnique,clusteringcanbeappliedtovariousareassuchaswaterpollutionresearchinenvironmentalscience,market-basketanalysis,textminingindatamining,andmicroarraydataanalysisinbioinformatics[ 45 ].Formoredetails,weciteXuetal.'ssummaryoftheapplicationsofclustering[ 111 112 ]. 105 112 ]. AssumethatwehaveadatasetxandRisthesetofrealnumbers.Fortwofeaturevectorsx1,x22x,adistancefunction(oradissimilaritymeasure)betweenx1andx2,D(x1,x2),satisesthefollowingconditions:(i)ThereexistsaminimumvalueforD,denotedbyD02R,suchthat
PAGE 22

Basedonthedenitionofproximitymeasure,variousdistancefunctionscanbedened.Notethatdistancefunctionsdependonthevarioustypesoffeatures,suchascontinuous,discrete,ordichotomous[ 105 112 ].Theodoridisetal.introducedthemostcommondistancemeasureforthecontinuousfeaturesasfollows[ 105 ]: wherex1iandx2iaretheithdimensionalcoordinatesofx1andx2,andwiistheithweightcoefcient.Fromtheequation( 2 ),avarietyofdistancemeasurecanbedened.Ifwi=1fori=1,...,n,thentheequation( 2 )canbetransformedastheMinkowskidistance(i.e.,anunweighteddistancemeasure), Inaddition,therearetwofamousspecialcasesofMinkowskidistance.First,theMinkowskidistance( 2 )withp=2iscalledtheEuclideandistance,denedas 2.(2) Second,forthecaseofp=1,MinkowskidistanceisknownastheManhattandistance.Thatis, Anotherdistancemeasurecanbederivedfromequation( 2 )bysettingp=2.ThisisageneralformoftheweightedL2-normdistancemeasure,denedas 22

PAGE 23

2 ), whereCisacovariancematrix.CisdenedasC=E[(x)(x)T],whereisthemeanvectorandEistheexpectedvalueofarandomvariablex.Inparticular,thesquaredMahalanobisdistanceiscloselyrelatedtomodel-basedclustering[ 36 ].Inmodel-basedclustering,theshapeand/orvolumeareinuencedbythecovariancematrixC[ 36 ].AclustercreatedonthebasisofthesquaredMahalanobisdistanceistypicallyshapedasahyper-ellipsoid.WediscusstheseissuesinSection 2.5 Whenwerecallthegoalofclustering,thecorrelationcoefcientcanbeutilizedforasensibledistancemeasure.OneofthemostfamouscorrelationcoefcientisPearsoncorrelationcoefcient,expressedby wherex1istheaverageofx1.Therangeoftheequation( 2 )isbetween1and+1,whichindicatesthedegreeofthelinearrelationshipbetweendataobjects.Thereby,wecandenethePearsoncorrelationcoefcient-baseddistancemeasurewhichhastherange[0,1]asfollows: 2.(2) Finally,anotherefcientwaytomeasurethesimilaritybetweenapairofdataobjectswithcontinuousrandomvariablesistousethecosinesimilarity.Formultidimensionaldataasavectorx,theinnerproductoftwovectorsindicateshowclosetheyaretoeachother.Fromthedenitionoftheinnerproductoftwovectorsx1andx2,denotedby 23

PAGE 24

wherekx1kisthelengthofvectorx1. Formoredetailsaboutproximitymeasures,see[ 84 ],[ 105 ],and[ 111 ]. 105 111 ].Forinstance,inasinglelinkagemethod,i=1=2, 24

PAGE 25

Thedivisiveclusteringoperatesinanoppositemannerthantheagglomerativemethods.Duetoperformanceissues,theagglomerativemethodsarecommonlyusedinmosthierarchicalclusteringalgorithms.Onenotableadvantageofhierarchicalclusteringalgorithmsisthatthedendrogramprovidesaneasyinterpretationofthedata.However,hierarchicalclusteringalgorithmsarenotexiblebecausethedendrogramisconstructedbythegivenspecicdataset.Ifsomedataareinserted,deleted,orupdated,theymaycauseachangeinthestructureofthedendrogram,resultinginadegradationofperformance.Also,theyaresensitivetooutliersandnoisydata. Newhierarchicalclusteringalgorithmsthatattempttoimprovetheseissueshavebeendeveloped.Guhaetal.(1998)presentedCURE,anefcientagglomerativeclusteringalgorithm[ 42 ].Theyattackedproblemswithtraditionalclusteringalgorithmsthataresensitivetooutliersandrestrictiveclustershapes.Theyemployedmultiplerepresentativepointsforeachclusterforgeneratingnon-sphericalshapesandwidevariances.CUREisfavorabletolargedatabasesbycombiningarandomsamplingandpartitioning.Guhaetal.(1999)extendedtheirworktonon-metricsimilaritymeasuresanddevelopedarobustclusteringalgorithm,calledROCK,usinglinkstomeasuresimilarityofcategoricalattributes[ 43 ].Formoredetailsofhierarchicalclusteringsee[ 11 ],[ 47 ],[ 105 ],and[ 111 ]. 25

PAGE 26

wherej=1,...,N. Inmostcases,theidealpartitioncanbeachievedbyminimizingthetotalsumofthedistancesbetweeneachdatapointandthecentroidofitscluster.Thisimpliesthatpartitionalclusteringcanbenaturallytransformedintoanoptimizationproblem. Formanyoftheshortcomingsasdiscussedabove,therehavebeentremendousadvancedworkproposed.Forexample,BradleyandFayyadsuggestedanimprovedK-meansclustering,whichisrobusttotheinitialcentroid,bygeneratingMrandomsets 26

PAGE 27

16 ].FormoredetailsandtheadvancedworksofK-meansclustering,see[ 11 ],[ 47 ],and[ 112 ]. 112 ].Infuzzyclustering,eachdataobjectmaybecomeamemberofallclusterswithpartialdegreeofmembership[ 12 ].Thispropertycouldbepropitiousincaseofthehandlingdatapointsontheboundariesofclusters.FuzzyC-meansclustering(FCM)isanwell-knownfuzzyclusteringalgorithm[ 12 ].Inthisresearch,werefertosomenotationandbackgroundin[ 12 84 112 ].Fortheunlabeleddatax,themaingoaloftheFCMistondapartitionmatrixUcontainingcclusters(U=[uik]cn)andaclusterprototypematrixV(dc),minimizingthefollowingobjectivefunction Inthecostfunction( 2 ),uik(0uik1)standsforthemembershipcoefcientofthekthdataintheith.uikhasthefollowingtwoconditions:Pci=1uik=1forallkand01).Usually,thevalueofmcanbesetto2.Whenmapproaches1themembershipmaybecomesabooleanshape.d2(xk,vi)isanL2-Normdistance.TheGustafson-KesselalgorithmemployedtheMahalanobisdistanceasadistancefunction[ 44 ].UandVcanbeobtainedbycomputingthegradientsofJ(U,V)withrespecttoVandU,respectively.FordetailsofcalculatingUandV,see[ 84 ].Bycalculatinggradients,UandVcanbeobtainedas 27

PAGE 28

Basedon( 2 )and( 2 ),FCMperformsthefollowingiterativestepstondtheclusterprototypeminimizingJ(U,V).First,settheinitialclusterprototypeV(0),Tasthenumberofmaximumiteration,"forthestoppingrule,andt=0fortherstiteration.Then,untilsatisfyingthefollowingcondition,kV(t)V(t+1)k"ort=T,repeatthefollowingsteps: (i)Updatett+1. (ii)CalculateU(t)withequation( 2 )andV(t1). (iii)ComputeV(t)through( 2 )andU(t1). (iv)Ifthestoppingruleshavebeensatised,stopandreturn(U,V)setby(U(t),V(t)). ThemaindrawbacksoftheFCMareasensitivitytotheinitialpartitionandnoisydatapoints.Toovercometheseproblems,Krishnapurametal.(1993)appliedtheframeworkofpossibilitytheorytothefuzzyclusteringproblemandpresentedapossibilisticC-Means(PCM)algorithm[ 59 ].PCMisdiscriminatedfromtheFCMduetothecastoftypicalitysothattheclusterscouldbeinterpretedasapossibilisticpartition.PCMalleviatedsensitivitytothenoisydatabutitwasstilleasilyaffectedbytheinitialprototype[ 112 ].Paletal.(2005)proposedahybridapproach,calledpossibilisticfuzzyc-meansclustering(PFCM),combiningtheFCMandthePCMalgorithms[ 82 ].Formoredetailsonfuzzyclusteringsee[ 112 ]. 36 45 88 ].Theoveralldatacanberepresentedbyanitemixturemodel.Thegeneralformofanite 28

PAGE 29

whereKisthenumberofclusters,fkistheprobabilitydensityfunctiontotheclusterk,andkisthemixtureproportionofthekthcluster(0k1andPKk=1k=1).Whenweconsiderthemixturemodelforclustering,typicallythelikelihoodofthedata(giventhemixturemodel)isusedastheobjectivefunctiontomaximize.Thelikelihoodforthemixturemodel( 2 )canbewrittenas Equation( 2 )canbetransformedasthelog-likelihoodfunction,denotedby Ingeneral,weachievethemaximumlikelihoodestimatorswithapartialderivativeofthelog-likelihoodfunctionwithrespecttokandsetittozero.However,equation( 2 )maybeinappropriatetondthemaximumlikelihoodestimatorsdirectly.Forthis,theEMalgorithm[ 14 30 47 111 ]isapopularalgorithmtondthemaximumlikelihoodestimatorsinthelatentvariablesorparameters.TheEMalgorithmformixturemodelscanbedescribedasfollows[ 14 ].Whenwehaveacompletedatasetx=fx1,...,xng,xi=(xiobs,zi)wherexiobsistheobservablevariablesandzi=(zi1,...,ziK)istheunobservedparameters.Ifxiisamemberof`-thmixturecomponent,zi`=1.Otherwise,zi`=0.Then,thelog-likelihoodofcompletedatacanberewrittenasln(L(k,k,zik;x))=Pni=1PKk=1zikln(kfk(xi;k)).Let(t)and(t+1)bethecurrentandupdatedparameterestimators,respectively.AttheE-stepaftertheinitializationstep,theconditionalexpectationofthelog-likelihoodcanbeexpressedbyQ((t),(t+1))=E[L(tk,k,zik;x)jxobs].IntheM-step,wechooseparametersmaximizingtheQ((t),(t+1))function.TheseE-stepandM-stepsarerepeateduntiltheprocessconvergestothespeciedthreshold[ 14 30 36 71 111 ]. 29

PAGE 30

36 ].Intheirstudy,clusterisexpressedbyamixturemodelandeachmixturecomponentsfollowamultivariatenormaldistribution.Thisimpliesthatthegeometricpropertiesoftheclustersdependsonthecovariancesbecausetheireigenvaluedecompositioncanbeutilizedasparameters.Ifkisthek-thcovariancematrixofthek-thGaussianmixturecomponent,k=kDkAkDTkwherekisamixtureproportionalconstant,Dkistheorthogonalmatrixofeigenvectors,andAkisadiagonalmatrixwhoseelementsareeigenvalues[ 36 ].Fromthisframework,theyintroducedtwomultivariatenormalmixturemethods,hierarchicalagglomerativeclusteringbasedontheclassicationlikelihoodandtheEMalgorithmformaximumlikelihoodestimation[ 36 111 ].CeleuxandGovaert(1992)proposedageneralclassicationEMalgorithm[ 19 ].Theydeemedclusteringtothepartitioningprobleminthemixturecontextutilizingtheclassicationmaximumlikelihoodcriteria.McLachlanetal.(1999)developedtheEMMIXsoftware,attingalgorithmofnormalort-distributionalmixturemodelsbytheEMalgorithm[ 71 ]. 45 ].XuandWunsch(2009)classiedtheseclusteringalgorithmsassearchtechnique-basedclustering,andTheodoridisandKoutroumbas(2006)assignedthemtothecategoryofcostoptimization-basedclustering[ 105 112 ].Amongthe 30

PAGE 31

105 ]and[ 112 ],webrieyreviewsimulatedannealinganddeterministicannealingtechniques. 57 ],isaglobalsearchtechniqueanalogoustotheannealingprocessinphysics[ 95 112 ].InSA,thetemperaturecontrolparameterTplaysanimportantrolethatindicatesrandomness.StartingwithasufcientlylargevalueforTandtheinitialclusterresult,SA-basedclusteringseeksacandidateclusteringresultoptimizingthespecicobjectivecostfunctionE.UntilT=0(meaningthatthenalsolutionbecomesstable)thealgorithmrunsrepeatedlywithagraduallydecreasingT.LetCiandCi+1bethecurrentandthecandidateclusterresult,respectively.Ateachstep,thecandidateclusterisacceptedbytheprobabilityP,expressedas T(2) whereE=E(Ci+1)E(Ci).ThisimpliesthatthecandidateclusterisalwaysacceptedifE(Ci+1)>E(Ci).TheprocedureisderivedfromthebasicconceptoftheMetropolisHastingsalgorithm[ 72 ],whichisdiscussedlater.[ 95 ]introducedthebasicconceptoftheSAalgorithm,anddetailsofSA-basedclusteringarediscussedin[ 105 ]and[ 112 ]. 28 ]ratherthanjustrandomlyinthesearchspace[ 112 ].Theprocessofndingtheexpectedclusteringresultisbasedonminimizingthefreeenergy,whichcanbeachievedviacalculatingthefollowingequation: 31

PAGE 32

ThisimpliesthateachdatapointswillbeassignedtoonlyasingleclusterwiththehighestprobabilityasTgoesto0,inpartofthecoolingprocess.Theupdatedclustercenterscanbeobtainedby dCd(x,C)=0,(2) [ 112 ]. 112 ].Forthisreason,sequentialclusteringhasbeenachallengingresearchareainclustering.Inaddition,assequentialdata,suchasDNAsequenceorcustomertransactiondata,have 32

PAGE 33

Atypicalformofthesequentialclusteringalgorithmisintroducedin[ 105 ].Inanaiveapproach,eachsequentialclusterisconstructedbyeitherassigningadataelementintooneoftheexistingclustersorcreatingitsownsingletoncluster.Thebasicconstraintsfordatainsequentialclusteringisthatallvectorsappeartothealgorithmoneatatime,andthereisnoconstrainttothenumberofclusters.Thisresultsinthenalclusteringresultbeingsensitivetothesequenceofthedata.Severalapproachesaddressthisproblemandproposeimprovedmethodssuchastheallowanceofagrayareatousetwothresholdparameters[ 105 ].Anotherimprovementidentiestheproblemofbeinginexibletothesetofclusters[ 105 ].Byallowingthemergingoftwoclustershavingsomedistancebetweenthem,clustersmaybecomesmorereasonable. XuandWunsch(2009)introducedsequentialclusteringdealingwithdatahavingspecicsequentialfeatures[ 112 ].Somedifferenttypesofproximitymeasuressuchasinsertion,deletion,andsubstitutionaredenedforsequencecomparison.Thedegreeofsimilarityiscalculatedbyaneditdistance.Usually,whenwecomputeasequencecomparisonproblem,itcouldbetransformedasaglobal(orlocal)alignmentproblemandtheexpectedsolutionshouldbeonetominimizethecostoftheeditdistance.TheNeedleman-Wunschalgorithmisafamoussequencealignmentalgorithmbasedondynamicprogramming[ 80 112 ].AlthoughtheNeedleman-Wunschalgorithmcomputestheglobalsequencealignment,thelocalsequencealignmentalsoshouldbetreatedasanimportantresult[ 112 ].TheSmith-Watermanalgorithmcomparessegmentsofallpossiblelengthsinsteadofeachsequenceinitswholelengthandselectssequencesmaximizingthesimilaritymeasure. 33

PAGE 34

112 ].Dependingontheproximitymeasuresandthespecicclusteringcriterions,therehavebeenmanygraphtheory-basedclusteringmethodsproposedandthesearediscussedin[ 112 ].Formoredetails,see[ 105 ]and[ 112 ]. 112 ].Adetailedexplanationofthekerneltrickrelatedtoclusteringcanbefoundin[ 105 ]and[ 112 ].Afterperformingtheclusteringbasedonthiskerneltrick,theclusterresultscanbeshapedasahypersphere.Awell-knownkernel-basedclusteringalgorithmissupportvectorclustering(SVC),whichdiscoversthesmallesthypersphereenclosingdatamappedintothehigh-dimensionalfeaturespace[ 10 ].TheclusteringprocessviaSVCconsistsoftwosteps:constructionoftheclusterboundariesbyusingthesupportvectormachineandassignmentofaclusterlabeltothedatapoints.Moreresearchrelatedtothekernel-basedclusteringisdiscussedin[ 112 ].Onenotableadvantageofkernel-basedclusteringmethodssuchasSVCisthatitcandiscovertheclusteringresultwitharbitraryboundaries.Ontheotherhand,theclusteringresultsaredependenttotheparticularkernelfunctionandtherelevantparameters 34

PAGE 35

105 ].Mostkernel-basedclusteringtechniquesmaydegradeinperformanceforlargedatasets[ 112 ]. 2.8.1TheBasicConceptofBiclustering 68 104 112 ]. 46 ].AfterChengandChurch'srstintroductionofthisconcepttothegeneexpressiondata,thebiclusteringhasemergedasapromisingschemeformicroarraydataanalysis.Althoughresearchersreviewedmanyexistingbiclusteringalgorithmsofbiclusteringhavebeenperformed[ 68 104 ].Inparticular,Madeiraetal.[ 68 ]describedandcomparedvariousbiclusteringalgorithmsandthevarioustypesofbiclusterstructuressuchastheexclusivecolumnsbiclusters,checkerboardstructurebiclusters[ 58 ],non-overlappingbiclusters[ 40 75 ],overlappingbiclusters[ 98 113 ],etc.Besidesbiologicaldataanalysis,blockclusteringhasbeenappliedtothedataminingresearchareaofdocumenttextminingorcollaborativelteringwithanameofco-clustering 35

PAGE 36

38 98 ].Themethodsusedforblockclusteringalgorithmsmaybecategorizedbymixturemodel[ 40 75 ],Bayesianapproach[ 36 88 ],Greedyheuristicapproach[ 23 ],Stochasticapproach[ 8 18 ],etc.Inthefollowingwereviewseveralblockclusteringalgorithmsrelatedtoourwork. ChengandChurch(2000)attemptedtondbiclusterswithahighsimilarityscoreandwithlowresidueindicatethatgenesshowsimilartendencyonthesubsetoftheconditionspresentintheblockcluster[ 23 ].Biclusteringisanoptimizationproblemtondamaximumbiclusterwiththelowestmeansquaredresidueasthescoreoftheobjectivefunctioninbiclusteringalgorithm.BasedontheproofthatndingthelargestbiclusterproblemisNP-hard,theypresentedgreedyheuristicalgorithmstondit.Sincethesealgorithmsaredeterministic,theymaynotbeefcientinndingmultiplebiclusters. Yangetal.(2003)extendedChengandChurchsapproachbycopingwithmissingdataand5allowingeachentryinthedatamatrixtoassignoneormorebiclusters[ 113 ].Choetal.(2004)proposedtwodifferentiterativebiclusteringalgorithmstominimizethetotalsumofthesquaredresidue[ 24 ].Incontrasttothepreviousapproaches,biclusterscanbeexpressedthroughstatisticalmodeling.Fromthestatisticalviewpoint,itisassumedthatdatahavebeengeneratedbasedonseveraldifferentprobabilitydistributions[ 45 ].Thisimpliesthateachclustercanberepresentedbyaparticularprobabilitydensityfunction,soeachclustersetcanbeexpressedbyamixturemodelthattriestoprovideastatisticalinference. GovaertandNadifpresentedanEM-basedblockclusteringalgorithmthatndsblockclustersbysimultaneousgroupingbothsidesofthedataintothespeciednumberofclustersttingastatisticalmodel[ 40 41 ].TheyassumedthatadatasetcanberepresentedbyasetofblockmixturemodelsanditsmaximizedlikelihoodestimatescouldbeobtainedbytheblockEMalgorithm.Alloftheblockclusteringalgorithmsdiscussedsofararebasedthelocaloptimizationmethod.Asasolutionto 36

PAGE 37

Shengetal.(2003)developedthebiclusteringalgorithmusingtheBayesianmethod[ 99 ].Byusingmultinomialdistributions,theymodeledthedataundereveryvariableinabicluster.Inabicluster,themultinomialdistributionsfordifferentvariablesaremutuallyindependent.Forparameterestimationofthisprobabilisticmodel,theyusedGibbssamplingtoavoidthelocalminimafrequentlyaddressedintheEMalgorithm. Bryanetal.(2006)proposedabiclusteringalgorithmusingsimulatedannealing,calledSAB,tondbiclustersinageneexpressiondataset[ 18 ].BycombiningsimulatedannealingwithChengandChurchsgreedybiclusteringalgorithm,SABwasabletoretrievebiclusterslargerthanthoseofChengandChurchsapproachwiththebiologicalinterpretation. Wolfetal.(2006)consideredaglobalbiclusteringalgorithmbasedonlowvariance[ 109 ].Intheirwork,theyassumedthatadatamatrixconsistsofkbiclustersandthattheremainingdataelementsdonotbelongtothekexiblebiclusters.Thegoalofthisalgorithmistondkmultiplebiclustersbyminimizingthevarianceswithinclusterslikek-meansclustering.Duetotheriskoflocalminima,theyemployedtwomethods:simulatedannealingandlinearprogramming. Angiullietal.(2008)proposedanalgorithmtondkbiclusterswithaxedoverlappingdegreeconstraint[ 8 ].Theyintroducedgainasameasurementcriterionbycombiningthemeansquaredresidue,therowvariance,andthesizeofthebiclusters.Thisalgorithmretrievesahighqualitybiclusterthroughtransformationsthatimprovethegainfunction(i.e.,smallerresidue,orhighervariance,orlargervolumeofbicluster).Therandommoveschemewasusedtoescapelocalminima.Sincethisalgorithmsearchesonebiclusteratatime,itmustberepeatedktimestoobtainkbiclusters. 37

PAGE 38

Foridentifyingthesubspaceclustersanddeterminingtherelevantsubspaces,searchingforallpossiblesubspacesandevaluatingtheclustervalidityisapparentlyintractable.Forthisreason,severalsophisticatedalgorithmsbasedonheuristicsearchmethodswereproposed.Theycanbeclassiedintotwocategories:bottom-upsearchalgorithmsandtop-downsearchmethods.Inthefollowing,wefurtherdiscussseveralwell-knownsubspaceclusteringalgorithmsforeachcategory[ 83 ]. 105 ].Relevantalgorithmscanbefurtherdividedintotwosub-groupswithrespecttothestrategiestodeterminethethresholdsforthedenseregionsthatcanberegardedasasubspacecluster.Therstsub-grouputilizesamultidimensionalgridtoevaluatethedensityoftherange.CLIQUEandENCLUSarewell-knownalgorithmsinthisgroup[ 6 22 ].Onthe 38

PAGE 39

39 83 ]. CLIQUE(ClusteringInQUEst)isoneoftheearliestsubspaceclusteringalgorithms[ 6 ].InCLIQUE,alldenseregionsineachdimensionalspaceareunraveledthroughthefollowingtwosteps:rst,createdenseunitsbypartitioningthedataobjectsintoanumberofequal-lengthintervals,thenchoosetheunitshavinghigh-densitythantheuser-speciedthreshold.Thisprocedureisrepeateduntilthenumberofdimensionsisequaltothatoftheoriginalspace.InCLIQUE,thesubspacesthatcontainsclustersconsistingofdenseunitsareautomaticallydetermined.Inaddition,CLIQUEdoesnotrequireanyassumptionsrelatedtoaparticulardistributionofthedataset.However,theperformanceofCLIQUEisheavilyaffectedbyuser-speciedparameterssuchasthesizeofthedenseunitsandthethresholdforhigh-densityregionsonfeaturesubspaces.Furthermore,thecomputationalcostexponentiallyincreaseswithrespecttothedimensionalityofthedataset.Alargenumberofoverlappedclusterscanbeidentiedbecausemanydenseregionsinsomesubspacescanalsobeexaminedatthehigher-dimensionalsubspaces[ 105 ]. ENCLUS(ENtropy-basedCLUStering)followsthebasicconceptofCLIQUEforsubspaceclustering[ 22 ].However,itisdifferentiatedfromCLIQUEbyusinganentropy-basedmethodtoselectsubspaces.Toevaluatetheclustersinthesubspacesoftheoriginalspace,ENCLUSutilizedseveralcriteriabasedontheconceptofinformationtheory.Specically,acriterionusedinENCLUSconsideredmultiplecorrelationofdimensions,calledthetotalcorrelation.ENCLUSisexibletondclustersforminganarbitraryshape.Itisallowedtodiscoveroverlappedsubspaceclusters.However,similartoCLIQUE,ENCLUSdependsonuser-speciedparameterssuchasthesizeofgrid[ 22 105 112 ]. SimilartoCLIQUE,MAFIA(MergingofAdaptiveFiniteIntervals)alsousesabottom-upmethod[ 39 ].However,forgeneratingdenseunits,MAFIAutilizedamodied 39

PAGE 40

105 112 ]. 3 5 ]. PROCLUSisaprojectedclusteringalgorithmbasedonK-medoids,insteadofK-means[ 3 ].Thetermmedoidsmeansasetofdataobjects,eachofwhichrepresentsaclustercenterthatminimizestheaveragedissimilaritytoalldataobjectsinthecluster.Tounravelclusters,thePROCLUSalgorithmrequirestwocriticaluser-speciedparameters:thenumberofclustersKandtheaveragenumberofdimensionswhereeachclusterexists.Giventhesetwoparameters,PROCLUSattempttodiscoverclustersthroughthreesteps:initialization,iteration,andrenement.Thecomputationalcostof 40

PAGE 41

3 105 112 ]. TheORCLUSalgorithmextendedPROCLUSbyexaminingthefeaturesubspaceswithouttheaxis-parallelrestriction[ 5 ].ORCLUSrequiresacomputationalcostO(K3init+KinitN+K2initd3)withrespecttotimecomplexity,whereNisthesizeofthedata,disthedimensionalityofeachdataobject,andKinitisthenumberofinitialseedsofmedoids.ThisimpliesthatthenumberofinitialseedsaffectsthescalabilityofORCLUSalgorithm.SimilartoPROCLUS,theORCLUSalgorithmisrestrictedtothehypersphericalclusters[ 105 ]. pClusterisatop-downsubspaceclusteringalgorithmforndingpatternsofdatabyusingshiftingand/orscalingprocedures[ 108 ].Withinitialclusterscreatedonahigh-dimensionalspace,thepClusteralgorithmgeneratescandidateclustersinlower-dimensionalspacesthroughanalgorithmbasedonadepth-rstapproach. The-Clustersalgorithmisasubspaceclusteringapproachthatcapturescoherentpatternsofasubsetofdataobjectsunderasubsetoffeatures[ 114 ].Asametricforevaluatingthecoherenceamongthedataobjects,thePearsoncorrelationwasutilized.Asaninitializationstep,the-Clustersalgorithmrandomlygeneratesclusters.Then,untilnomoreimprovementofthequalityofclusters,ititerativelyperformsupdatingclustersbyrandomlychangingbothfeaturesanddataobjects.Theclusteringresultsmayvaryonthenumberofclusters.Inaddition,thesizeofclustersaffectboththeperformanceofthisalgorithmandclusteringresults. DOC(Density-basedOptimalprojectiveClustering)isahybridofbothbottom-upandtop-downapproaches[ 86 ].ThemainideaofDOCistondprojectedclustersthatshowsagoodnessofclusteringforaspeciedsubspace.Ateachstep,the 41

PAGE 42

112 ]. OptiGridisasubspaceclusteringalgorithmthatattemptstounravelsubspaceclustersbyrecursivelypartitioningdataintoanoptimalgridstructure[ 48 ].Eachcellinthegridstructureconsistsofahyperplanewithasubsetoffeatures,showingahighdensityorahavinghighpossibilityofcontainingmanysubspaceclusters. CLITreeisavariantofthesubspaceclusteringalgorithmsutilizingadecisiontreetechnique[ 65 ].ThisalgorithmseemstobesimilartotheOptiGridalgorithminthatitsdiscoveredclustersareconstructedbypartitioningdataforeachdimensionintoanumberofgroups.However,theidentiedclustersfoundbyCLTreehaveahyper-rectangularstructureinthefeaturesubspaces.Inaddition,incontrasttoOptiGrid,CLTreendsdesiredclustersusingabottom-upmethodthatassessesthehighdensityofdataobjectsforeachdimension.TheCLTreealgorithmoutperformsotherbottom-upstrategy-basedsubspaceclusteringalgorithmswithrespecttoscalability.However,thisrequiressignicantcomputationalcostforbuildingthedecisiontreestructure. Besidesoftheabovealgorithms,manyresearchershaveproposedvarioussubspaceclusteringtechniquesasfollows:analgorithmperformingprojectionwithhumaninteraction[ 2 ],clusteringdataobjectsontheprojectedsubspacesusinghistograms[ 81 ],projectedclusteringusinghierarchicalclusteringforselectingrelevantdimensions[ 115 ],acell-basedsubspaceclusteringapproach[ 20 ],andamethodthatuncoversclustersconsistingoffeatureswithdifferentlocalweights[ 32 ].Fordetailsofthesealgorithms,see[ 11 83 112 ]. 42

PAGE 43

Inclustering,themajorityofthepriorresearchworkrelatedtofeatureselectionwereinterestedindividingfeaturesintotwosubsets:acontributingandanon-contributing.However,datacanoftencompriseseveralfeaturesubsetswhereeachfeaturesubsetconstructsclustersdifferently.Inthischapter,wepresentanovelmodel-basedclusteringapproachusingstochasticsearchtodiscoverthemultiplefeaturesubsetsthathavedifferentmodel-basedclusters. 62 ].Forthisreason,itispreferredtochooseapropersubsetoffeaturestorepresenttheclusteringmodel,whichiscalledfeature(variable)selectioninclustering.However,littleresearchonfeatureselectionhasbeendoneinclusteringbecausetheabsenceofclasslabelsmakesthistaskmoredifcult[ 61 63 88 ]. Infeatureselectionformodel-basedclustering,thepriorliteraturehasproposedtwoapproaches.First,somefeatureselectionalgorithmsdividefeaturesintotwosubsets:onewithcontributingfeaturesandtheotherwith(almost)non-contributingfeaturesinclusteringobjects.AnexampleofthisfeatureselectionapproachisgiveninFigure 3-1A .InLawetal.(2002),featureselectionwasperformedinmixture-basedclusteringbyestimatingfeaturesaliencyviatheEMalgorithm[ 62 ].Someresearchershaveconsideredconductingfeatureselectionandclusteringsimultaneously.Lawetal.(2004)incorporatedtheirpreviousapproach[ 62 ]withtheuseoftheminimummessagelengthcriterionfortheGaussianmixturemodeltoselectthebestnumberofclusters[ 61 ].ThisapproachhasbeenextendedbyConstantinopoulosetal.(2006)byemploying 43

PAGE 44

BAmultiplefeature-blocks-wiseclustering Arbitraryexampleoffeatureselectionandfeaturesubsetwiseclustering aBayesianframeworktobemorerobustforsparsedatasets[ 26 ].RafteryandDean(2006)investigatedthisproblemandattemptedtoselectvariablesbycomparingtwomodels:onecontainstheinformativevariablesforclusteringwhiletheothermodel'svariablesaremeaningless[ 88 ].ByusingtheBayesianInformationCriterion(BIC)formodelcomparisonandthegreedysearchalgorithmforndingtheonlysuboptimalmodel,thismethoddealswithvariableselectionforclusteringandthemodelselectionproblemsimultaneously[ 88 ].However,whenweexpectmorethanonecontributingfeaturesubsetsthatconstructdifferentfeature-subset-wiseclusters,alltheabovemethodsareinappropriate.Figure 3-1B showsanexampleofmultiplefeaturesubsetswhereeachcontributestoidentifydifferentclusters. Tondfeature-subset-wiseclusters,thisstudyassumesthateachfeaturesubsetisindependentoftheotherfeaturesubsetsandfollowsthemultivariatenormalmixture 44

PAGE 45

1 )),itisconsideredacontributingsubset;otherwise,itisregardedasanon-contributingfeaturesubset.ThebestnumberofclustersforeachfeaturesubsetandthebestfeaturepartitionaredeterminedbasedontheBICvalues.Searchingfortheoptimalfeaturepartitionisverychallengingbecausethenumberofpossiblepartitions,knownastheBellnumber,B(m),growssuper-exponentiallywiththenumberoffeaturesm[ 15 ].Evenwhenafeaturepartitionisgiven,searchingforthebestclustersofeachfeaturesubsetinvolvessignicantcomputation.Forthisreason,weconsideredstochasticsearchmethods. Thischapterisorganizedasfollows:InSection 3.2 ,wesummarizethepreviousworkrelatedwithfeatureselectioninclustering.InSection 3.3 andSection 3.4 ,wedescribethestructureofourmodelandrelevantprocedurestoestimatethenitemixturemodel.InSection 3.5 ,weexamineseveralstochasticsearchmethodsforndingmultiplefeaturesubsets.InSection 3.6 ,wereporttheexperimentalresultsobtainedsofarandcomparethealgorithmsprovidedinSection 3.5 .Inaddition,toreducetheexecutiontime,theincrementalestimationtechniquethatwasappliedtoourapproachisdiscussed.InSection 3.7 ,usingvariousrealdatasetssuchasariverwatersheddatasetandaUnitedNations(UN)WorldStatisticsdataset,wedemonstratethattheproposedmethodcanbesuccessfullyappliedtosocialandeconomicresearchaswellasenvironmentalscienceresearch.Section 3.8 presentsourconclusions. 52 63 ].Ontheotherhand,featureselectioninunsupervisedlearningorclusteringisrelativelychallengingbecausetheabsenceofclasslabelsmakesthistaskmoredifcult[ 62 63 88 ].Inthepreviouschapter,wediscussedthatfeatureselectioninclusteringcanbedividedintotwocategories:model-basedclusteringandnon-model-basedclustering[ 88 ].However,whenweseetheresearchworkforfeatureselectionin 45

PAGE 46

26 61 ].Inthelterapproaches,selectingfeaturesisdeterminedbyassessingtherelevanceoffeaturesinthedataset.Incontrast,thewrapperapproachregardsthefeatureselectionalgorithmasawrappertotheclusteringalgorithmtoevaluateandchoosearelevantsubsetoffeatures[ 26 ]. Kimetal.(2000)performsfeatureselection(ordimensionalityreduction)inK-meansclusteringthroughanevolutionaryalgorithm[ 56 ].Talavera(2000)selectedthesubsetoffeaturesbyevaluatingthedependenciesamongfeatureswhereirrelevantfeaturesshowalowcorrelationtootherrelevantfeatures[ 103 ].Sahami(1998)usedthemutualinformationformeasuringfeaturesimilaritytoremoveredundantfeatures[ 96 ].AnotherapproachtothesameproblemhasbeenproposedbyMitraetal.(2002),whichutilizedthemaximuminformationcompressionindex[ 74 ].Liuetal.proposedafeatureselectionmethodtailoredfortextclusteringthatusedsomeefcientfeatureselectionmethodssuchasInformationGainandEntropy-basedRanking[ 67 ].RothandLange(2003)selectfeaturesdeterminingautomaticallytherelevancetothepreviousresult,ontheGaussianmixturemodelwhichisoptimizedviatheEMalgorithm[ 94 ].InLawetal.(2003),featureselectionwasperformedinmixture-basedclusteringbyestimatingfeaturesaliencyviatheEMalgorithm[ 62 ].Thisworkwasextendedinthealgorithmtoperformbothfeatureselectionandclusteringsimultaneously[ 61 ].DyandBrodleyusedthenormalizedlikelihoodandclusterseparabilitytoevaluatethesubsetoffeatures[ 33 35 ].TheyalsoconsideredthefeatureselectioncriteriaforclusteringthroughtheEMalgorithm[ 34 ].Whilemostpreviousapproachestrytondrelevantfeaturesbasedontheclusteringresult,theyarerestrictedtodeterminethenumberofclustersorchoosetheclusteringmodel,calledmodelselection.Thisproblemhasbeenrelativelylessinteresting.VaithyanathanandDom(1999)triedtodetermineboththefeaturesetandthenumberofclustersthroughevaluatingthemarginallikelihoodandcross-validated 46

PAGE 47

107 ].In[ 26 ],thecombinedmethodintegratingamixturemodelformulationandBayesianmethodisusedtodealwithboththefeatureselectionandthemodelselectionproblemsatthesametime.RafteryandDean(2006)investigatedthisproblemandattemptedtoselectvariablesbycomparingtwomodels,whereonecontainstheinformativevariablesforclusteringwhilethevariablesintheothermodelaremeaningless[ 88 ].ByusingtheBayesianInformationCriterion(BIC)[ 97 ]forthemodelcomparisonandthegreedysearchalgorithmforndingthesuboptimalmodel,thismethoddealswiththevariableselectionforclusteringandthemodelselectionproblemssimultaneously[ 88 ]. Inourapproach,itisassumedthatfeaturesubsetsareindependentofeachotherandthateachfeaturesubsethasdifferentmultivariatenormalmixturemodels.Thus,thelog-likelihoodfunctionforthekthfeaturesubsetis where(UVkijkg)isthemultivariatenormalprobabilitydensityfunctionofthegthclusterbasedonthekthfeaturesubset,pkgandkgarethecorrespondingmixingprobability 47

PAGE 48

Notethatmaximizing( 3 )isequivalenttomaximizing( 3 )foreachfeaturesubset.ThedetailsofthisprocedurearediscussedinSection 3.4 3 )whenfeaturesubsetsVkandthenumberofclustersGkaregiven.ThenwedescribehowtheBICisusedtodeterminethebestfeaturepartitionandthenumberofclustersforeachfeaturesubset. 30 ]forkinequation( 3 ),thecompletedatalog-likelihoodfunctioncanbeconstructedas where,forthekthfeaturesubset,themissingvariablezkgi=1iftheithobservationbelongstoclusterCkgandzkgi=0otherwise.TheEMalgorithmrepeatedlyupdatesthekbyalternatingtheE-stepandtheM-stepuntilitconverges[ 30 ].InthetthE-step,theconditionalexpectationofthecompletedatalog-likelihoodiscomputed: InthetthM-step,thet+1thnewparameterestimateiscalculatedbymaximizingQk(k,(t)k).Thedetailsofthisalgorithmarewelldescribedin[ 14 70 ]. Foraspecicinstanceoftheaboveprocess,aGaussianmixturemodelhasbeenwidelyused[ 14 106 ].TheGaussianprobabilitydensityfunction(PDF)totheUVkis 48

PAGE 49

(2)k=2jkgj1=2expn1 2(UVkkg)T1kg(UVkkg)o(3) wherekgandkgdenotethemeanvectorandthecovariancematrixoftheg-thGaussianmixturecomponent,andkisdenotedbythedimensionalityofUVk.ForthemixturemodeltotheGaussianPDF( 3 ),theQfunctionintheE-stepatthe(t+1)-thiterationiscomputedasfollows: wherei=1,...,nandg=1,...,Gk. IntheM-stepatthe(t+1)-thiteration,therelevantparameterestimatesaredescribedasfollows:Forg=1,...,Gk. 49

PAGE 50

nXi=1g(UVki;(t)k).(3) Withtheinitial(0)k,theaboveEandMstepsrepeatedlyalternateuntilsatisfyingtheL((t+1)kjUVk)L((t)kjUVk). 45 ].However,thisrequiresaprohibitivelyhighcomputationalcostandstillhasthepossibilityofapoorlocalmaxima[ 106 ].Animprovedapproach,calledthedeterministicannealingEM(DAEM)algorithm[ 106 ],istousethemodiedlog-likelihoodincludingthethermodynamicfreeenergyparameter(0<<1),similartothecoolingparameterTinsimulatedannealing(SA)[ 57 ].Specically,theDAEMalgorithmstartswithasmallenoughinitialwhichiscloseto0.Then,untilbecomes1,theDAEMalgorithmperformstheEandMstepsbygraduallyincreasingtoobtainabetterlocal(maybeglobal)maximum. BasedontheresultoftheDAEMalgorithm[ 106 ],thefreeenergyparametercanbeappliedtotheposteriorprobability( 3 ),henceanewposteriorprobabilityforDAEMisdenotedby wherei=1,...,nandg=1,...,Gk.Thisisdirectlyappliedtothe( 3 ),( 3 ),and( 3 ).Inparticular,weslightlymodiedtherangeofsothatbecomesstablebygraduallydecreasingfromapositiveintegerastheinitialvalueof. 50

PAGE 51

60 ].ForX,,and^,theKL-divergencebetweentheprobabilitydensityfunctionofthetruemodel(X;)andanestimatedmodelg(X;^)canbedenedasKL((X;)jj'(X;^))=Z(X;)log(X;) KL-divergenceisequaltozerowhenthettedmodelisequivalenttothetruemodel.Otherwise,KL-divergenceisalwayspositiveandgrowsasthedissimilaritybetweentwomodelsincreases[ 92 ].ThispropertyimpliesthatminimizingKL-divergenceisequivalenttomaximizethefollowingtermderivedfromtheequation( 3 ): Basedonthisproperty,theAkaikeInformationCriterion(AIC)wasproposedtoestimatetheKL-divergencebetweenthetruemodelandthettedmodel[ 7 ].Inthemodelselectionprocess,anestimatedmodelisregardedasthebestttedmodelwhenthescoreofAICisminimized.ForthegivenVk,AIC(Vk)isdenotedby whereLk(^kjUVk)isthemaximumlog-likelihood,andkisthenumberofparameters^k.Using( 3 ),themodelselectionprocessstartswithGk=1.UntilsatisfyingthestoppingcriterionthatthescoreofAIC(Vk)isnolongerdecreasing,thisprocessrepeatsbyincreasingGk=Gk+1.Inparticular,whenthebestttedmodelforaUkcontains 51

PAGE 52

Notethatminimizingequation( 3 )impliesthediscoveryoffeaturepartitionconsistingofmultiplefeaturesubsetswhereeachcanbeexpressedbythebest-tmixturemodelforclustering.Insuchaway,amodelselectioncriterioncanbeutilizedasanobjectivefunctioninourapproach.TheAICcriterionisawidelyusedmodelselectioncriterion.However,becauseBICcriterionismoreparsimoniousthanAIC,weusedBICasanobjectivefunction. 97 ]isdesignedtoselectthebestmodelbasedonboththedegreeofdatatnessandthecomplexityofthemodel.Inthisresearch,theBICisutilizedasanobjectivefunctiontodeterminethebestpartitionoffeaturesaswellasthebestnumberofclustersGkforeachfeaturesubset.Becausefeaturesubsetsareindependentofeachother,theBICforafeaturepartitionVis whereBIC(Vk)isdenotedby where^kisthemaximumlog-likelihoodestimate,kisthenumberofelementsin^k,andnisthenumberofobservations.AlowerBICvalueindicatesabettermodel. 52

PAGE 53

93 ], with theStirlingnumberofthesecondkind.TheBellnumbergrowssuper-exponentiallyasthenumberoffeaturesincreases[ 15 ],itisachallengingproblemtosearchfortheoptimalfeaturepartitionVbasedonamodelselectioncriterion.Furthermore,giventhefeaturepartitionV,anobservation-wiseclusteringusingtheEMalgorithmraisesanon-negligibleamountofcomputationalcost.Inthisresearch,weconsideredtheMonteCarlo(MC)optimization[ 91 ]basedontheMetropolis-Hastings(MH)algorithmasastochasticsearchmethodtominimizetheBIC(V). WhenthecurrentstateSisgivenintheMHalgorithm,thecandidatestateS0isgeneratedwithaproposedprobabilitydensityp(S0jS).ThetransitionfromthecurrentstateStothecandidatestateS0isdeterminedwithanacceptanceprobability,(S0,S),asdescribedbelow: where(S)isthetargetingprobabilitymassfunctionofS. IfthecurrentstateSisacceptedwithprobability( 3 ),thenthenextstateissettothecandidatestateS0.Otherwise,thecurrentstateremainsasthenextstate.Anadvantageofthisalgorithmisthatthereisnoneedtoknoweitherthenormalizingconstant(S)or(S0)becausetheycanceleachother. 53

PAGE 54

3 ): whereVisthesetofallpossiblefeaturepartitions.Notethat,althoughthesizeofVcanbeverylarge,itisstillnite.BecausethepartitionspaceVisverylarge,itisinfeasibletosearchforVbyevaluatingtheBIC(V)forallpossiblefeaturepartitions.Thus,therandompartitionsgeneratedfromBIC(V)andBIC(V)areevaluatedforeachgeneratedpartition.ItgeneratesmorerandompartitionsclosetothemodeofBIC(V),whereBIC(V)islow,andfewerrandompartitionsawayfromthemode,whereBIC(V)ishigh.UsefulapplicationsoftheMCoptimizationmethodaredemonstratedin[ 15 53 ].Theperformanceofthesemethodsmayvarydependingonhowp(SjS0)isdened. 15 ].LetV(1)bethefeaturepartitionthatisstochasticallyacceptedatthe1thiteration.Atthethiterationinthebiasedrandomwalk(BR)algorithm,V()issetbyV(1)andacandidatefeaturepartitionV0()isgeneratedbyupdating(orswitching)thefeaturesubsetmembershipofarandomlyselectedfeatureinV().Then,theBRalgorithmacceptsV0()withthefollowingprobability: becausep(V0()jV())=p(V()jV0()).Thedetailsofthebiasedrandomwalkalgorithmaredescribedin[ 15 ].Algorithm 1 showsabiasedrandomwalkmethodappliedtoourapproach. Althoughthebiasedrandomwalkalgorithmshouldtheoreticallyexploretheentirepartitionspace,ittendstogetstuckinalocalmode. 54

PAGE 56

15 ].Thisalgorithmissocalledthesplit-and-merge(SM)algorithm.TheSMalgorithmworkswithtwomainprocedures:thesplitmoveandthemergemove.Inthemergemove,V0()isconstructedbymergingtworandomlyselectedfeaturesubsetsinthecurrentfeaturepartitionV().Similarly,ifthesplitmoveischosen,V0()isobtainedbyrandomlysplittingafeaturesubsetoftheV()intotwonon-emptyfeaturesubsets.V0()generatedbyasplitmoveisacceptedwiththefollowingprobability wherep(V0()jV())istheprobabilityofamergemove,denotedbyp(V0()jV())=1pm 2!,isthesizeofthefeaturesubsetthatwillbedivided,K0isthenumberoffeaturesubsetswhosesizeisgreaterthan2,pmistheprobabilityofselectingthemerge-moveprocedure,andKisthenumberoffeaturesubsetsinV().TheacceptanceprobabilityofV0()byamerge-moveprocedureismin[1,(1=)][ 15 ].Basedonthissuccessfulapplication,weemployedthismethodforsearchingthemostexpectedvariablesubsets,asshowninAlgorithm 2 BecauseBIC(V)isinvariantforbothBRandSMmethodsdescribedabove,amixtureofthesetwosearchmethodsisalsoinvariant[ 15 ].Ateachiteration,thishybridapproach,calledtheHRalgorithm,canupdatethecurrentfeaturepartitionusingoneofthesetwomethods(BRorSM)selectedbyacertainprobability. 56

PAGE 57

Procedure3merge-move(V(),x) 57

PAGE 58

57 ]techniqueutilizesanannealingprocesstosearchfortheglobaloptimum.IntheSAalgorithm,thetemperatureparameterT(T>0)determineshoweasilyacandidatestatetendstobeaccepted;thehigherthetemperature,thehighertheacceptancerate.TosearchfortheV,wecombinedtheSAtechniquewiththeBRalgorithm,whichiscalledtheSA-basedBRsearch(SABR)algorithm[ 78 ].TheSABRalgorithmstartswiththeinitialfeaturepartitionV(1)andtheinitialtemperatureparameterT(1)whichissettoahighvalue.UntilT!0(i.e.,thenaliterationstepbecomesstable),theSABRseeksVbygraduallyandrepeatedlydecreasingT.Atthethiterationstep,theacceptanceprobabilityofacandidatefeaturepartitionV0()is whereT()=T(1),2andisacoolingrate(0<<1). 58

PAGE 59

BResultofdataset-I CLineplotsofdataset-II(15011) DResultofdataset-II Twosyntheticdatasets(2007and15011) 59

PAGE 60

Fordataset-I(Table 3-1 ),n=200observationsweregeneratedusingsevenfeatures,whereeachbelongedtooneoffourfeaturesubsets:truefeaturepartitionV=ff1,2g,f3,5,6g,f4g,f7gg.InFigure 3-2A ,eachlinerepresentsavectoroffeaturesforasimulatedobservation. LetBIC(V())and^KbetheBICvalueandthenumberoffeaturesubsets,respectively.Withdataset-I,thefoursearchmethodsconvergedonthefeaturepartitionV(usedfordatageneration)andthe^K,regardlessoftheinitialpartitions. Fordataset-II(Table 3-2 andFigure 3-2C ),n=150observationsweregeneratedusing11features.Figures 3-3A 3-3C ,and 3-3E showtheconvergenceofthefoursearchmethodsonthefeaturepartitionusedforthegeneratingdataset,BIC(V()),and^KwhenthealgorithmstartedwiththeinitialpartitionV(0)consistingofmsingletonfeaturesubsets.However,asshowninFigures 3-3B 3-3D ,and 3-3F ,whenthealgorithmstartedwithaninitialpartitionVconsistingofonelargefeaturesubset,theBRalgorithmcouldnotescapealocalmaximumuntilthelastiteration.Theotherthree(SM,HR,andSABR)methodsconvergedonthedesiredfeaturepartition,implyingthattheyarelesssensitivetoinitialfeaturepartitionsthantheBRalgorithm.Figures 3-2B and 3-2D showtheclustersbasedoneachfeaturesubsetfordataset-Ianddataset-II,respectively. 60

PAGE 61

TrajectoriesofchainsofBIC(V()),and^Kfordataset-II 61

PAGE 62

Parameterestimatesofdataset-I 4 4 Featureindices(Vk) f1,2gf3,5,6gf4gf7g 3211 3211 Mixingprobability(pkg) ^p12=0.32^p22=0.81 ^p13=0.30 Means(kg)

PAGE 63

Parameterestimatesofdataset-II 3 3 Featureindices(Vk) f1,2,3,4,5gf6,7,8,9gf10,11g 321 321 Mixingprobability(pkg) ^p12=0.34^p22=0.6 ^p13=0.33 Means(kg)

PAGE 64

BDataset-IV CDataset-V Syntheticdatasets Insummary,SM,HR,andSABRfoundthecorrectfeaturesubsetsandobtainedparameterestimatesneartheparametervaluesusedindatagenerationforbothdataset-Ianddataset-II. 79 ].BydecreasingthenumberofunnecessarilyrepeatedevaluationofBIC(V()),thetotalcomputationtimecanbedrasticallyreduced.Inaddition,theinitialization-insensitivepropertyoftheDAEMalgorithmsupportstheuseoftheincrementalestimation. Inourapproach,theincrementalestimationwasevaluatedwithvarioustypesofdatasets[ 77 ].Severaldatasetsweregeneratedonthebasisofthefollowingparameters,K,Gks,V,pkgs,gds,andgdswhereg2f1,...,Gkg,d2f1,...,Dg,andk2f1,...,Kg.ForV,VkiscomposedofthedifferentnumberofCVkganditscorrespondingpkg.EachXnliesinaGaussiandistributioncorrespondingtokgandkg.Forexample,thedataset-IIIcontainsthreefeaturesubsets:V1,V2,andV3.EachVkhas3,2,and1 64

PAGE 65

[(v1,v2);(v3,v4);(v5,v6,v7)] ^G1=3^G2=2^G3=1 ^p11=0.30^p12=0.40^p13=0.30 ^p21=0.50^p22=0.50^p31=1.00 ^11=[7.15,7.00]^12=[4.06,4.15]^13=[1.13,0.99] ^21=[4.92,4.93]^22=[8.99,8.97]^31=[13.96,13.96,13.97] kg Parameterestimatesofdataset-III mixturecomponents,respectively.Table 3-3 providesthevaluesoftheparametersforcreatingdataset-III.Fig. 3-4A illustratestheexpectedresultsaswellastheoverallstructuresofdataset-III.Dataset-IVanddataset-V,seeFig. 3-4 ,havedifferentshapesandmorecomplexstructuresthandataset-III.Inparticular,thestructureofdataset-IVissometimescalledacheckerboardstructure. Ourfeaturepartitionsearchalgorithmwasexecutedwiththedatasetsfor1.0104iterations.Tocovertheoverallcasesforeachdataset,weused5differentinitialpartitionV(0)s.Forinstance,theV(0)susedfordataset-IIIare:(1)V(0)=f(v1),(v2),(v3),(v4),(v5),(v6),(v7)g,(2)V(0)=f(v1,v2,v3,v4,v5,v6,v7)g,(3)V(0)=f(v1),(v2),(v3),(v4),(v5,v6),(v7)g,(4)V(0)=f(v1,v2,v7),(v3),(v4),(v5,v6)g,and(5)V(0)=f(v1,v5),(v2,v3,v7),(v4),(v6)g.Specically,eachVkin(1)isasingletonfeaturesubset,V(0)of(2)isafeaturesubsetwithallfeatures,andtheremainingthreeV(0)swererandomlygenerated.Tosupportenoughrandomness,T(0)=400.0and=0.997.FortheDAEMalgorithm,(0)=4.0and(i)=(i1)0.998. AsshowninTable 3-3 ,ourmethodfoundbothfeaturesubsetsandparameterestimatesnearthetrueparametervalues.Ontheotherhand,fordataset-III,whentheconventionalclusteringwasperformedwithallfeatures,thebest-tmixturemodel-basedclustersconsistingofthreemixturecomponentswerediscoveredandtwofeatures(i.e.,v1andv2)wereidentiedastheinformativefeaturesforclustering.Thisimplies 65

PAGE 66

Performanceevaluationoftheincrementalestimation thattheconventionalclusteringalgorithmsarelimitedtosimultaneouslyidentifyingmultiplefeaturesubsetsfordiverseclusteringresults.Fordataset-IVanddataset-V,ourapproachrevealedtheexpectedfeaturepartitionaswellasthemixturemodelrepresentingtheclustersforeachfeaturesubset. Ourmethodseeksthedesiredfeaturepartitionusingtheincrementalestimationandhasbeentestedonthethreedatasetsdescribedabove.Fig. 3-5 showstheresultsofthetotalevaluationtime.Inallcases,thegrowthrateoftheexecutiontimeforeachiterationbecomesstableafterafewthousandsofiterations.Incontrasttotheapproachthatestimatesatalliterations,theincrementalestimationmethodshowedbetterperformancewithdrasticallyreducedcomputationaltime. Basedontheaforementionedexperimentalresults,weshowedthattheincrementalestimationtechniquecanreducecomputationalcost.However,thistechniquestillmaynotbelimitedinscalabilitybecausethemainproceduresofthisapproachrequiremuchcomputationalcost.First,computingparameterestimatesviatheEMalgorithmrequiresO(NDI)time,whereNisthenumberofdatasamples,Disthenumberof 66

PAGE 67

Performanceevaluationagainstthesizeofdataset dimensions,andIisthenumberofiterations.Second,whenweconsiderthemodelselectionproceduretodeterminethenumberofclusters,thetimecomplexitywillbeO(NDIK),whereKisthenumberofclusters.ThetotalcomputationcostdependsonIforsatisfyingtheconvergencecondition.Whentheincrementalestimationtechniqueisapplied,themodelestimationandmodelselectionprocedureswillbeperformedatmosttwotimes.Sincetheseallproceduresareexecutedateachiterationforndthedesiredfeaturepartitionmodel,thetotalcomputationofourapproachgrowslinearlyasthenumberofiterationsincreases. Forthisscalabilityissue,wetestedourapproachusingtheSA-basedbiasedrandomwalkalgorithmonseveralsyntheticdatasetswith15features.Thenumberofiterationswassetby1.0104.Forthisexperiment,asimilarapproachtothatusedinthepreviousexperimentswasusedforgeneratingdatasetshavingfourdifferentsizesofdata(1000,5000,10000,and20000).Foreachsizeofdata,wegenerated10differentdatasets.Basedonthesedatasets,theSA-basedbiasedrandomwalkalgorithmwasevaluatedwithrespecttothetotalrunningtime.Figure 3-6 showsasummaryofrelevant 67

PAGE 68

9 ].Thewinerecognitiondataset,availableatUCIMachineLearningrepository,has178observations,13features,and3classes.Althoughmanyclusteringalgorithmsusedthisdatasetintheirexperiments,mostofthemrequiredanumberofclustersasaconstraintsoitisdifculttocomparedirectlybetweenthemandourmethod.Whentheconventionalclusteringisexecutedwithallfeatures,wefoundBICvaluesof7201.01,7206.56,and7498.07withGk=1,2,and3clusters,respectively.ThisimpliesthatusingtheconventionalclusteringmethodisnotappropriatetodeterminethenumberofclustersbasedontheBICvaluesforthisdataset.Whenthefeature-subset-wiseclusteringmethodisapplied,wefoundtheBIC(V)=6934.42fortheoptimalfeaturepartitionV.Thispartitionconsistsoftwofeaturesubsets.TherstfeaturesubsetcontainsAlcohol,Malicacid,Ash,Alcalinityofash,Magnesium,andProline.Thesecondfeaturesubsetincludestotalphenols,avanoids,nonavanoidphenols,proanthocyanins,colorintensity,hue,andOD280/OD315ofdilutedwines.Foreachfeaturesubset,twoclustershavebeendiscovered.Fortherstfeaturesubset,therstclustercontains55observations.Amongalltheobservationsbelongingtotherst 68

PAGE 69

D ^G(G) 178 13 3(N/A) [2,2]([N/A) [(v1,...,v5,v13);(v6,...,v12)] Diabetes 768 8 2(N/A) [3,2,3](N/A) [(v1,v7);(v2,v3,v4,v5,v6);(v8)] Waveform 5000 21 3(N/A) [1,1,3](N/A) [(v1);(v21);(v2,v3,...,v19,v20)] ThesummaryofexperimentalresultsofUCIdatasets cluster,53observations(about96.4%)arethememberoftherstclassintheoriginaldataset.Thesecondclustercontainsmostoftheobservationscorrespondingtothesecondandthethirdclassesintheoriginaldataset.Ontheotherhand,therstclusterinthesecondfeaturesubsetcontainsmostlyobservationsbelongingtotherstandthesecondclusterintheoriginaldataset.Thesecondclusterinthesecondfeaturesubsetcontains51observations,consistingof4observationscorrespondingtothesecondclassand47(outof48)observationsofthethirdclassintheoriginaldataset.Basedontheseresults,itcanberegardedasthattherstfeaturesubsetcontributestoextracttherstclusteroftheoriginaldatasetandthesecondfeaturesubsetispotentiallyusefultoidentifythethirdclusterfromtheoriginaldataset.Bycombiningtheclusteringresultsofthesetwofeaturesubsets,threenewclusterscanbeconstructed.Interestingly,wefoundthatthisclusteringresultmatchestheoriginalclasslabelsclosely.Thisimpliesthatourmethodcanalsobeutilizedtoidentifytheoriginalclusters. Thediabetesdatasetconsistof768instances(500and268instancestestingpositiveandnegativefordiabetes,respectively)describedby8numeric-valuedattributes(e.g.,Diastolicbloodpressure(mmHg)andAge(years)).Thewaveformdatasetconsistsof500021-dimensionalinstancesthatcanbegroupedby3classesofwaves.AllfeaturesexceptoneassociatedwithclasslabelsweregeneratedbycombiningtwoofthreeshiftedtriangularwaveformsandthenaddingGaussiannoise[ 66 ]. Becausetheclusteringresultsofourapproachcanbemuchdifferentfromthoseoftheothermethods,itcanbedifcultincomparingbetweenthem.Moreover,thisimplies 69

PAGE 70

Classicationaccuraciesofthevariousfeatureselectionapproaches thedifcultyindirectlyevaluatingtheclusteringaccuracy.Therefore,inourexperiments,weassessedclusteringaccuracyusingthefeaturesubsetthatshowsthemostsimilarclusteringresultstotheoriginalclasslabels.Table 3-4 summarizestheexperimentalresultswithdatasetsincludingthepropertiesofthesedatasets. Intheexperimentwiththediabetesdataset,the2ndfeatureblockcontainedtwomixturecomponentsanditsclusteringaccuracytothetrueclasslabelswas0.6602.Forthe1stfeatureblock,v1hadsimilardegreesofnegativerelationshipswithv7inallthreeclusters.The3rdfeatureblockcontainingasinglefeaturewascomposedofthreemixturecomponents.Boththe1standthe3rdfeatureblocksshowedrelativelymeaninglessanalyticalinformationthanthe2ndfeatureblock. Theexperimentalresultsofthewaveformdatasetseemtocoveraresultofthefeatureselectionmethod.Eachoftwofeatures,v1andv21,wasrepresentedbyasingletonclusterbasedonunivariateGaussiandistribution.Thisimpliesthattheycanberegardedaslessinformativefeatures.Forthe3rdfeatureblockcontainingallfeatures 70

PAGE 71

UsingthesethreeUCIdatasets,weevaluatedtheperformanceofourapproachandotherfeatureselectionmethodsinbothclassicationandclustering.Figure 3-7 summarizestherelevantresultswithrespecttotheclassicationaccuracy.Wecomparedourapproachwiththefollowingvariousexistingapproaches:featureselectionafterperformingdimensionalityreductionwithPCA(PCA-DR),featureselectionthroughgeneticprogramming(GP),utilizingthenaivebayesclassier(NB),techniquesobtainingtheappropriateparametersforaback-propagationnetworkwithoutusingfeatureselectionapproach(PSOBPN1)andwithusingfeatureselectionapproach(PSOBPN2),variableselectiononmixturemodel-basedclustering(VS-Mclust),independentvariablegroupanalysis(AIVGA),andourapproachusingtheSA-basedbiasedrandomwalkalgorithm(SABR)[ 49 51 64 88 ].Inaddition,aclassicationusingthedecisiontree(C4.5(J48))techniquewithallfeatureswasevaluated. Obviously,mostfeatureselectiontechniquesinclassication(i.e.,PCA-DR,GP,NB,PSOBPN1,andPSOBPN2)showedbetterclassicationaccuraciesthanfeatureselectionalgorithmsinclustering.Thisisbecausetheclassicationprocedurewasperformedbasedonthetrainingdataincludingclasslabels.Decisiontree-basedclassicationshowedcomparableperformancetothefeatureselectionapproachesforbothclassicationandclustering.Forthediabetesdataset,VS-Mclustidentied8clustersafterselectingthreevariables.Becauseofthedifcultyinevaluatingtheclusteringaccuracybydirectlycomparingtheidentiedclusterindiceswithactualclasslabels,theclusteringaccuracywascalculatedbyutilizingtheconfusionmatrixbetweentheobtainedclustersandtheactualclasslabels.Likewise,forthewaveformdataset,21distinctclusterswerediscoveredbyVS-Mclust,sowecalculatedtherelevantclusteringaccuracybyusingthesamemannerfordiabetesdataset.Interestingly,thetwoapproachesforfeatureselectioninclusteringshowedbetterorcomparable 71

PAGE 72

Alluvium(adepositofsand,mud,etc.) 62 FracturalRock 13 Riverwater 5 Table3-5. ThegroupsoftheNakdongRiverobservations performancetotheclassicationtechniqueusingthedecisiontree(C4.5).Ourapproachdemonstratedsimilarorcomparableperformancetootheralgorithmsinclustering.Obviously,thisresultimpliesthatnoneofthesealgorithmsshowsthebestperformanceonallthedatasets.However,ourapproachcanbedifferentiatedfromtheotherapproachesinthatitautomaticallydeterminesthenumberofclusterstorepresentthebest-tmixturemodel.Furthermore,ourapproachalsoautomaticallyextractsthemultiplefeatureblockswhiletheotheralgorithmsidentiesonlyasinglecommonfeaturesubset. 73 ]. Toanalyzethesephenomena,watersamplesshouldbereasonablyclassiedintovariousclusteringcriterionssuchasgeologicalorchemicalprocess.Itisalsoimportanttoidentifythechemicalfactorsasbeingstronglyorweaklyrelevanttothespecic 72

PAGE 73

Lineplotsofriverwatersheddataset indexof subsetof mixture (p21,p22) (p31,p32,p33) (p4) (0.100,0.900) (0.0733,0.5833,0.3434) (1.000) pH[8.3143,7.2024,6.0479] Eh[4.8575,5.9931] EC[7.0520,6.0696,5.8864] SO4[3.5531] DO[2.1763,1.4078] Na[5.0138,3.1724,2.3182] K[1.1860] Ca[3.1422,3.6792,3.7750] Cl[5.1545,3.5227,3.0320] NO3[2.6357,0.3736,4.0717] Theresultofmodel-basedclusteringforeachsubsetofvariables clusteringmodel.Fortheseissues,ourapproachhasbeenappliedtoarealdataset.Thewatersampledataset,collectedfromtheNakdongriverwatershedinSouthKorea,contains93locationsand16chemicalelementssuchasNa+,Ca2+,Cl,SO24,HCO3,andNO3andotherfactorssuchaspH,Eh(Redoxpotential),EC(ElectricalConductivity),andDO(DissolvedOxygen).Formoredetailsaboutthisdataset,see[ 73 ].Forhigheraccuracy,alldataexceptpHweretransformedbynaturallog.Toavoidthelog(x0)problem,80samplesand13attributesfromtheoriginaldatasethavebeenselected.Figure 3-8 showstheplotsfortheNakdongriverdatasetafterpreprocessing.Inthis 73

PAGE 74

Discoveredfeaturesubsetsofwatersheddataset gure,eachlines(withshape'--')representsthevectorofeachobservationstotheallfeaturesinthedataset.Thisdatasethas3typesofobservationsdetailedinTable 3-5 withthesizeofeachgroup. Figure 3-9 depictstheresultswiththeminimizedBIC(V).Therstfeaturesubset,shownatFigure 3-9A ,consistsofpH,Mg2+,SiO2,HCO3,andNO3havingthemultivariatehydrogeologicalmixtureofthreeclusters,whichagreeswithclustersofshallowgroundwater(C11),deepgroundwater(C12),andriverwater(C13).ForV2,shownatFigure 3-9B ,EhandDOarefactorsforthemixtureoftwoclustersbasedontheredoxprocess.Specically,clusterC21andclusterC22explainthereductionandoxidationprocess,respectively.InFigure 3-9C ,clusterC33ofV3containingEC,Ca2+, 74

PAGE 75

3-9D ,thegroupofK+andSO24consistsofasingleclustershowingaweakmeaningofwaterpollution.ThisimpliesthatV4isthegroupcontainingtheremainingchemicalelementsand/orotherfactorswhichhaverelativelylesscontributionforclustering.DetailsofexperimentalresultsareshownatTable 3-6 1 ].Thedatasetconsistedof117countriesand11featuresthatwererelatedwiththepopulationanddevelopmentofacountry;Energyconsumptionpercapita(kilogramsoilequivalent),GNI(grossofnetincomepercapitacurrentUS$),GDP(grossofdomesticproductionpercapitacurrentUS$),governmenteducationexpenditure(%ofGDP),populationdensity(perkm2),third-levelstudents(women,%oftotal),urbanpopulationgrowthrate2000-2005(%perannum),contraceptiveuse(ages15-49,%ofcurrentlymarriedwomen),totalfertilityrate2005-2010(livebirthsperwoman),lifeexpectancyofwomenatbirth2005-2010(years),andimportspercapita(1000US$).Toreduceskewnessofdata,weappliedthelog()transformationtovefeatures;energyconsumptionpercapita,GNI,GDP,importspercapita,andpopulationdensity.Thesefeaturesaredenoted`NRG,`GNI,`GDP,`IMP,and`PDN,respectively.Similarly,toreduceskewnessandboundaryeffects,thearcsin(p )transformationwasappliedtothepercentagefeaturessuchasthird-levelstudents,contraceptiveuse,andgovernmenteducationexpenditure[ 100 ].ThesefeatureswillbedenotedaTLS,aCNT,andaGOV,respectively.Asanexample,Figure 3-10 showsthedistribution(kerneldensityestimation)ofTLSandaTLS.Itshowsthat,afterthearcsin(p )transformation,bothtailsofthedistributionbecomelowenoughtousethemixtureofnormals.Lifeexpectancyofwomen,totalfertilityrate,andurbanpopulation 75

PAGE 76

BAftertransformation Thearcsin(p )transformationoftheThird-levelstudentsofwomen aTLS LIF FRT aCNT UPG PDN aGOV 7.69 9.11 9.13 0.85 78.03 8.08 1.97 0.95 1.18 4.16 0.21 2 4.99 6.70 6.73 0.67 61.88 5.49 4.42 0.54 4.11 EstimatesofmeansforeachfeaturesubsetintheUNdataset growthrateareusedwithouttransformation.ThesefeatureswillbecalledLIF,FRT,andUPG,respectively. Whenourfeature-subset-wiseclusteringalgorithmwasapplied,theoptimalfeaturepartitionVwasfoundwithBIC(V)=1825.06.Thispartitionconsistedoffourfeaturesubsets(V1,V2,V3,andV4):V1andV2constructedtwoclustersofcountriesandV3andV4createdonlyasinglecluster. Figure 3-11 depictstheclustersofcountriesbasedonV1andV2.Themembershipofeachcountrywasassignedtotheclusterthathadthehighestposteriormembershipprobability.Table 3-7 showsthemeanestimatesoffeaturesineachcluster. 76

PAGE 77

B2ndfeaturesubset DiscoveredclustersforeachfeaturesubsetinUNdataset Therstfeaturesubset,V1,containedsixeconomy-relatedfeatures:`IMP,LIF,aTLS,`GDP,`GNI,and`NRG.Figure 3-11A illustratestwoclustersofcountrieswithmediumgraycolorandblackcolorbasedontherstfeaturesubset,V1.Thelightgraycolorindicatescountriesthatarenotincludedinthisstudy.Cluster1inmediumgrayincludedcountriesmostlyinEurope,EastAsia,NorthandSouthAmerica,andOceaniawithrelativelyhighGDPsandGNIs(seeTable 3-7 ).Ontheotherhand,Cluster2in 77

PAGE 78

Thesecondfeaturesubset,V2,containedthreefertility-relatedfeaturessuchasUPG,aFRT,andaCNT.ThesefeaturesalsoconstructedtwoclustersofcountriesasshowninFigure 3-11B .Table 3-7 showsthatcountriesinCluster1(coloredinmediumgray)hadarelativelylowfertilityrate,lowrateofurbanpopulationgrowth,andahighpercentageofcontraceptiveuse.Overall,populationgrowthwasmorecontrolledinthecountriesinCluster1thanthoseinCluster2. Apparently,V1andV2foundsimilarclusters,exceptforIndia,Bangladesh,Indonesia,thePhilippines,Malaysia,SouthAfrica,Zimbabwe,andsomeMiddle-Eastcountries.India,Bangladesh,SouthAfrica,andZimbabwewerecategorizedascountrieswithcharacteristicsreectinglesseconomicdevelopmentinclusteringbasedonV1;however,theywerecategorizedascountrieswithmorepopulation-controllingcharacteristicsintheclusteringbasedonV2.Indonesia,thePhilippines,Malaysia,andsomeMiddle-Eastcountrieshadoppositecharacteristics. Tocompareourfeature-subset-wiseapproachwithconventionalclustering,weappliedthemultivariatenormalmixturemodelwithoutpartitioningfeatures.WefoundthattheBIC(V)wasthelowestwhenthenumberofclustersG=2(BIC(V)=1923.09,1908.62,and2026.53withG=1,2,and3clusters,respectively).Theconventionalclustering(G=2)providedthesameresultsastheclustersbasedonV1(economicdevelopmentrelatedfeaturesubset),asinFigure 3-11A .Thisresultseemstooccurbecauseeconomicdevelopment-relatedfeaturesdominateothers. 78

PAGE 79

Ourmethodcanbeusedinvariousapplicationareassuchastextdataminingandgeneexpressiondataanalysis.Specically,intextdatamining,therearemanyfeaturestorepresenttextdocuments.Throughourapproach,theycanbepartitionedoverthevariousfeaturesubsetsthatidentifygenres,authors,styles,andothercategories,andeachdocumentcanbeassigneddifferentclustersacrosstheabovediversefeaturesubsets.However,approachesshouldbeconsideredtoaddresstheproblemofhowto 79

PAGE 80

80

PAGE 81

Insubspaceclustering,themajorityofthepriorresearchworkwasaffectedbyuser-speciedthresholds.Furthermore,itcanbedifculttointerpretthediscoveredsubspaceclusters,becausetheyhavearbitraryshapes.Tomitigatetheseproblems,wenoticedthatthefeaturesofreal-worlddatasetsareusuallyrelatedwithspecicdomainknowledgeforefcientrepresentationofdatasamples.Byutilizingthispropertyforconstructingthefeaturesubspaces,thesearchspaceforsubspaceclusteringcanbemuchreduced.Inthischapter,weproposeanewmodel-basedclusteringapproachontheseconstrainedfeaturesubspaces.Toimprovetherobustness,weusedamultivariateStudentt-distribution. Onereasonablewaytocounteracttheseproblemsistoutilizeefcientmethodsthatattempttondasubsetoffeaturessupportingtheexplanationofclusters.Thisfeatureselectionapproachcanbeaneffectivetechniqueinthatitenablesrepresentingclustersusingonlyafewrelevantfeatures[ 27 ].However,featureselectiontechnique 81

PAGE 82

83 ]. Insubspaceclustering,somenaiveapproachesattempttondclustersoneverypossiblefeaturesubspacebyevaluatingthedenseregions.However,theseapproachesproducealargenumberofclustersandmanyofthemmaynotbedesired.Furthermore,theassumptionthatallfeaturesareindependenteachotherisnotapplicableinpractice.Therefore,weneedtoconsiderhowtodeterminemeaningfulfeaturesubspaces.Areasonableassumptionisthatclusteringshouldbeperformedonlyonthefeaturesubspacesconsistingofstronglycorrelatedfeatures.Anotherprobleminsubspaceclusteringishowtodealwithnoisydataoroutliers.Theseatypicaldatasamplesmaycausedeteriorationinthequalityoftheclusters.Inparticular,thisproblemhasbeenfrequentlydiscussedintheGaussianmixturemodel-basedclustering,whichissensitivetonoisydataoroutliers. Inthischapter,weinvestigateasubspaceclusteringapproachforhighdimensionaldatasets,consideringstrongcorrelationsamongmultiplefeaturestoimprovethequalityofclustersembeddedinthefeaturesubspace.Toalleviateproblemsfromnoisydataandoutliers,weusedamixturemodelwithaheavy-taileddistribution,calledamultivariateStudent'st-distribution,torepresenttheclusters.Inourapproach,thenumberofclustersforthebest-tmixturemodelisdecidedbyevaluatingaparsimoniousmodelselectioncriterion. Therestofthischapterisorganizedasfollows.InSection 4.2 ,subspaceclusteringalgorithmsaresummarized.Section 4.3 describesamethodtoselectstronglycorrelatedfeaturegroups,amixturemodelofmultivariatet-distributionanditsparameterestimationviamaximumlikelihoodestimation,andmodelselectioncriterion 82

PAGE 83

4.4 .Section 4.5 presentsconclusions. 83 ]. Thebottom-upstyleapproaches(e.g.[ 6 22 39 65 86 ])searchforsubspaceclusterswithhigherdensitythanauser-speciedthreshold.Forinstance,CLIQUEstartswithagridofthenon-overlappingrectangularunitsintheoriginalfeaturespace.Then,itattemptstondclustersexistingonthefeaturesubspacesbysearchingfordenseunitsormergingtheconnecteddenseunitsthathavebeenalreadydiscovered.However,duringthisprocess,theprocedureevaluatingthedensityofdataobjectsembeddedinallpossiblefeaturesubspacescanleadtoexponentiallyincreasingthecomputationalcost.Specically,foragivenD-dimensionalfeaturevector,therecanexist2D1nonemptydistinctfeaturesubsetsasthepossiblefeaturesubspaces.Toavoidthisproblem,thesebottom-upstylesubspaceclusteringalgorithmstrytosatisfyaparticularproperty,calledthedownwardclosureproperty;ifdenseregionsarediscoveredinaD-dimensionalspace,thisimpliesthattheycanalsoresideinanyofits(D1)-dimensionalsubspaces[ 105 ].Basedonthismonotonicpropertyofdensity,thesebottom-upstylesubspaceclusteringalgorithmscanreducetheinfeasiblesearchspaceandndclustershavingvarioussizesandshapes. Ontheotherhand,thetop-downapproachessuchas[ 4 ]createequallyweightedclustersinthewholedimensionalspaceattheinitializationstep.Then,theyalternativelyperformtheclusteringprocessbyassigningdifferentweightsforeachclustersandregeneratingnewsubspaceclustersbasedontheupdatedweights.Incontrasttothebottom-upsubspaceclusteringapproaches,thetop-downapproachesrequireexpensive 83

PAGE 84

114 ].Beginningwithinitialresults,-Clustersiterativelyimprovesthequalityofeachclusters.However,thisalgorithmrequiresknowledgeofthenumberofclustersandeachindividualsizewiththeprocessingtimeaffectedbytheclustersize. 21 22 83 ]. Inmanypracticalapplications,datasamplescanbedrawnindependentlyandidenticallydistributedfromunderlyingpopulationswithinaparticularprobabilitydistribution.Ontheotherhand,featuresofthedatasetareusuallyrelatedwithspecicdomainknowledgeforefcientrepresentationofdataobjects.Thisimpliesthattherecanbeanumberofsubsetsoffeatureswhicharestronglycorrelatedtoeachother,henceclusteringontheseconstrainedfeaturespacescanaccompanythediscoveryoftheinformativesubspaceclusters.Inaddition,byselectingonlythesefeaturesubsets,thesearchspaceforsubspaceclusteringcanbedrasticallyreduced. Anotherissueofreal-worlddatasetsisthattheyusuallycontainmanyatypicaldatasamples,callednoisydataoroutliers.Inthecontextofclusteranalysis,howtodealwiththeseoutliershasbeenextensivelydiscussedbecausetheycanaffectthe 84

PAGE 85

Illustrationofrobustfeaturesubspaceclustering 85

PAGE 86

29 76 ],andconstructinganitemixturemodelbasedonaheavy-taileddistributionsuchastheLaplacedistributionortheStudentt-distribution[ 27 101 ]. Inthischapter,weemploytheStudentt-distributiontoselectabest-tmixturemodel-basedclustersonthefeaturesubspaces.Inthemixturemodel-basedclustering,determiningthenumberofclusterscanbeequivalenttondingthebest-tmixturemodel.Aparsimoniousmodelselectioncriterionwasutilizedforthispart.Inthissection,wedescribemethodsrelatedwiththeaboveissues.Figure 4-1 illustratesourapproach.Ascanbeseen,anumberofdifferentfeaturesubspacescanbecreatedbycombiningvariousfeatures,anddatasubmatricescanbeconstructedbasedonthesefeaturesubspaces.Then,eachsubmatrixcanberepresentedbyanitemixturemodel. 55 ].Manymeasurementssuchas2-statisticandPearsoncorrelationhavebeenusedtoassessthepairwisecorrelationbetweentwofeatures[ 110 116 ].Inparticular,Pearsoncorrelation,denotedby,isapopularmeasurementtoevaluatethemagnitudeoflineardependencebetweentwofeatures.Tobespecic,giventwofeaturesv1andv2,aPearsoncorrelation,v1v2,canbeexpressedas whereviandviareameanandastandarddeviationofvi,respectively. SupposethatXisadatamatrixconsistingofNsamples,witheachrepresentedbyaD-dimensionalfeaturevector,V=(v1,...,vD).Inthiscase,correlationsamongD

PAGE 87

PNi=1(Xij1Xj1)2q PNi=1(Xij2Xj2)2,(4) whereX`=1 Sincethevaluesofrcontainthelinearrelationshipbetweentwofeatures,thedirectuseofrmaynotbesuitableforevaluatingthemagnitudeofthepairwisecorrelation.Instead,anelement-wisesquaredmatrixofr,denotedbyR2,canbeusedbecauseallelementsinR2arebetween0and1.ThevaluesoftheelementsinR2arecalledthecoefcientofdetermination[ 90 ]. Whenoneextractssubsetsoffeatureswithstrongpairwisecorrelations,manystatisticalmeasuresforassessingcorrelationsrequireauser-speciedthreshold.Thiscangiverisetodifferentoutputsdependingonthevalueofthreshold.Forthisproblem,basedontheempiricalresultsofmanypreviousapproaches[ 21 ],apossiblethresholdcanbeobtainedthroughthefollowingequation: Thatis,foragivennumberoffeatures(variables),theequation( 4 )calculatestheaverageofallpairwisecorrelationsofthesefeatures.Wecallthisthresholdtheglobalmeancorrelationthreshold.Basedontheresultofthis,ifR2j1j2>=h,itcanbeconsideredthattwofeaturesvj1andvj2arestronglycorrelated.Thisimpliesthattherelevantfeaturescanformafeaturesubspacetoconstructinformativeclusters.Otherwise,therelevantfeaturescanbethoughtofaslooselycorrelatedandthepairofthemcanbeignoredduringtheconstructionofthecandidatefeaturesubspaces. 87

PAGE 88

Ourapproachfocusesonidentifyingclusterstructureslyingontheconstrainedfeaturesubspacewherethefeaturesarestronglycorrelated.Forconstructingtheseconstrainedfeaturesubspaces,extractingpossiblesubsetsoffeaturesthathavelocallystrongcorrelationswitheachothershouldbeconsidered.Forthisproblem,weconsideredthefollowingprocedure. Supposethatafeaturevd1,whered12f1,...,Dg,isselected.Then,thecoefcientsofdeterminationforallotherfeatureswithvd1isexpressedbyacolumnvectorofR2,denotedbyR2d1=(R2d11,...,R2d1D)T.ForR2d1,themean,denotedbyE(R2d1),canbecomputedby WhenweassumethattheR2d1andtheE(R2d1)followthepropertyoftheglobalmeancorrelationthreshold,aR2d1d2greaterthanE(R2d1)canbedeemedtoindicatearelativelystrongcorrelationbetweenafeaturevd2foragivenvd1.Basedonthisresult,afeaturesubset,Vd1,canbeconstructedbyselectingvd2s,satisfying 88

PAGE 89

AlthoughDd1
PAGE 90

Illustrationofconstructingfeaturesubspaceswiththecorrelatedfeatures Inourapproach,thehierarchicalstructureisconstructedbymergingfeatures(orfeaturesubsets)withthenearestfeaturehavingthelargestvalueofR2amongtheremainingfeatures.Todoso,thesinglelinkagealgorithmwasused.Forsinglelinkage,thedistancebetweenapairoffeaturesubsetsisdeterminedbythetwoclosestfeaturesubsetstothedifferentfeaturesubsets.Assumethatthereexisttwofeaturesubsets,V1andV2.Then,thesinglelinkagecanbecalculatedby Basedontheabovedistancemeasureandproximitycriterion,forcreatingthepossiblefeaturesubspaces,theagglomerativehierarchicalclusteringalgorithmcanbeperformedasfollows;First,wesetsinglefeatureswithinafeaturesubsetVdastheleafnodesofthehierarchicalclusteringresult,thenwestartconstructingfeaturesubspacesusinghierarchicalclustering.Thelargerfeaturesubspacescanbeconstructedbyagglomeratingthemostsimilarfeatures(orfeaturesubsets)ateachstep.Afterobtaining 90

PAGE 91

4-2 illustratesthisprocedure.Forthesefeaturesubsets,amixturemodel-basedclusteringisperformed.Duringthisfeaturesubspaceconstructionprocess,someredundantfeaturesubspacescanbecreated.Thesewillbeignoredatthemixturemodel-basedclusteringstep. where,forthekthmixturecomponent,fk(x;k)istheprobabilitydensityofxandkisthevectoroftheunknownparameters. AnitemixturemodelofStudentt-distributioncanberepresentedbyaweightedsumofmultivariateStudentt-distribution.ForD-dimensionaldatax,theprobabilitydensityfunction(p.d.f.)ofthemultivariateStudentt-distributionisdenotedby 2 2Dk 2,(4) wherei=1,...,N,k=1,...,K,k=(k,k,k),and(xi,k;k)=(xik)T1k(xik)thatkandkareameanvectorandapositive,semi-denitecovariancematrix, 91

PAGE 92

85 ].kisadegreeoffreedomthatparameterizesthe`robustness'oftheStudentt-distribution.Inparticular,asincreases,theStudentt-distributionbecomesclosertotheGaussiandistribution. Thegammafunction(m)isexpressedas Interestingly,thep.d.f.ofmultivariateStudentt-distributioncanbewrittenbyusingthemultivariateGaussiandistributionwithmeankandcovariancematrixk=wi, 2 whereaweightwiistheunknownparameterthatfollowsagammadistributiong(wi;k 92

PAGE 93

Theestimationofmaximizingthelikelihoodfunction( 4 )canbeobtainedviatheEMalgorithm[ 85 ].Suppose(t)istheestimationofobtainedafterthetthiterationofthealgorithm.IntheE-step,theconditionalexpectationofthecomplete-datalog-likelihood(Qfunction)iscomputed: Asaresult,theposteriorprobabilitythatxibelongstothekthmixturecomponentiscomputedas where(t)k=(^(t)k,^(t)k,^(t)k)T.Theconditionalexpectationofwi,givenzik=1andxi,is IntheM-step,newparameterestimates(t+1)maximizingQ(,(t))satisfyaconvergencecondition.Ateachiterationstep,theestimatesofk,k,andkcanbeupdatedas 93

PAGE 94

4 ). whereN(t)k=PNi=1z(t)ikand()=dlog() 4 ),see[ 101 ]. 90 112 ].Consideringthecharacteristicsofmixturemodel-basedclusteringthroughtheEMalgorithm,weusedtheintegratedcompletedlikelihood(ICL),amodelselectioncriterionthatprovidesmoreparsimony 94

PAGE 95

13 ].TheICLcanbedenotedby where ThisICLcanbewrittenasaBIC-likeapproximationform,thatis, wheretikistheconditionalprobability,denotedby and^zikisaMaximumAPosteriori(MAP)obtainedfromtheparameterestimates.AmixturemodelthatminimizestheICLcriterionisselectedasadesiredclusteringresult.Algorithm 5 summarizesourapproach. 95

PAGE 96

Dendrogramsforthedifferentfeaturesubsetsindataset-I Inaddition,thefourthfeatureblockcontainingtwonon-informativefeatureswasdrawnfromaGaussiandistributionN(3,I2)andaddedtodataset-I.ThenumberofdatasamplesforGaussianmixturecomponentsrangedfrom90to300. 96

PAGE 97

Someexamplesofthedataplotsondifferentfeaturesubspacesindataset-I Thedataset-II(N=600andD=13)wasgeneratedinasimilarmannerbutcontainedamuchcomplexstructure.ItconsistsofvedifferentfeatureblockswhereeachwasgeneratedfromamixtureofGaussiandistributionswiththefollowingparameters: 97

PAGE 98

Clusteringerrorratesfordifferentfeaturesubspacesindataset-I andthree-dimensionalGaussiandistributionN(3,I3). Totestthenoiserobustness,noisydatasampleswereuniformlygeneratedwithvedifferentnoiseratios(0%,20%,40%,60%,80%,and100%)andaddedtoeachsyntheticdataset. 98

PAGE 99

Dendrogramsfordifferentfeaturesubsetsindataset-II 99

PAGE 100

Someexamplesofthedataplotsondifferentfeaturesubspacesindataset-II Inhighdimensionaldata,manyfeaturesareirrelevantincontributingtotheclusteringresults.Thatis,consideringthosefeaturesthataremeaninglessforclusteringmayexacerbatetheperformanceofthealgorithms.Tomitigatethisproblem,asapreprocessingstep,featuresrelatedwithclusteringcanbeselectedthrougheitherthelikelihoodratiotestorthemodelselectioncriterion[ 69 ].Inourapproach,weusedtheBICcriteriontoltersomeirrelevantfeaturesforclusteringinthepreprocessingstep. Figure 4-3 showsthedendrogramsforthesubsetsoffeaturesthatarestronglycorrelatedinthedataset-I.FromtheresultofFigure 4-3A ,twopossiblefeaturesubspaces(i.e.,fv1,v2gandfv1,v2,v3g)canbeconstructed.Likewise,basedonthe 100

PAGE 101

Clusteringerrorratesfordifferentfeaturesubspacesindataset-II resultofFigure 4-3B ,featuresubspacesfv1,v2gandfv1,v2,v4gcanbegenerated.FromtheresultofFigure 4-3C ,threefeaturesubspacesfv3,v4g,fv1,v3,v4g,andfv1,v2,v3,v4gcanbeconstructed.FromtheresultofFigure 4-3D ,featuresubspacesfv3,v4g,fv2,v3,v4gcanbeconstructed.FromtheresultofFigure 4-3E ,threefeaturesubspacesfv1,v2g,fv5,v6g,andfv1,v2,v5,v6gcanbeconstructed.AsshowninFigure 4-3F ,onlyonefeaturesubspace(i.e.,fv5,v6g)canbegenerated.Amongthesecandidatefeaturesubspaces,someredundantfeaturesubspacescanbegenerated(e.g.,fv5,v6ginFigure 4-3F ).Toreducetheunnecessarycomputationalcost,theycanbeignoredduringtheclusteringprocess.SincethenumberoffeaturesD=6,thepossiblenumberoffeaturesubspacesbasedonthenaivesubspaceclustering 101

PAGE 102

2.However,sincethenumberofselectedfeatureshavingastrongcorrelationwithagivenfeaturecanbesmallerthanD,thepossiblenumberoffeaturesubspacesforsubspaceclusteringcanbereducedtolessthanD2.Inaddition,sincesomeduplicatescanexistamongtheextractedfeaturesubspaces,thenumberoffeaturesubspacesisfurtherreduced.Forexample,intheexperimentswithdataset-I,themaximumnumberofpossiblefeaturesubspacescanbe15.However,duringtheexperiments,ourapproachonlyselected9featuresubspacesthatareexpectedtocontainmeaningfulclusterstructures. Forvisualization,Figure 4-4 showsexamplesofthedataplotsexistingonthevariousfeaturesubspaces.Theyarethefeatureblocksusedforgeneratingthisdataset.Specically,thefeaturesubspacefv1,v2gcontainstwomixturecomponents,showninFigure 4-4A .Thefeaturesubspacefv3,v4gcanberepresentedbythemixturemodel-basedclusterswiththreecomponents,showninFigure 4-4B .IntheFigure 4-4C ,twoclustereddatacanbeidentiedonthefeaturesubspacewithfv5,v6g.Foreachcreatedfeaturesubspace,amixturemodel-basedclusteringwasperformed. Althoughthedataset-Iwasgeneratedbasedonthefourfeatureblocks,manyotherfeaturesubspacescanbeextractedfromthedataset.Thischaracteristiccanmakeitdifcultwhenevaluatingthequalityofclustersthatarediscoveredonthedifferentfeaturespaces.Forthisreason,toevaluatethequalityofclusters,weonlyconsideredclustersthatwerediscoveredonfeaturesubspacesshowninFigure 4-4 .Toassessthequalityofclusteringresults,clusteringerrorratecanbecalculatedasthenumberdatapointsassignedtodesiredclustersdividedbythenumberofdatasamples[ 50 ],denotedby 102

PAGE 103

Figure 4-5 showstheclusteringerrorrateswithrespecttothechangeofnoiseratioforthedatasamplesofthevariousfeaturesubspacesembeddingondataset-Iwithuniformnoise,showninFigure 4-4 .Obviously,themixturemodel-basedclusteringresultsusingthemultivariateStudentt-distributionmodelsshowedbetterperformancethantheGaussianmixturemodelswithrespecttotherobustnessinauniformnoisecondition.Inaddition,asthestructureofdatasetontheextractedfeaturesubspacebecomesmorecomplex,theapproachusingthemultivariateStudentt-distributionshowedbetterperformancethanthemethodbasedontheGaussiandistribution. Forthesubsetsoffeaturesthatarestronglycorrelatedinthedataset-II,10dendrogramshavebeenconstructedbasedonthehierarchicalclusteringalgorithm,showninFigure 4-6 .Ascanbeseen,sincethedataset-IIismuchlargerandhasamorecomplexstructurethanthedataset-I,morecandidatefeaturesubspacescanexistindataset-II.Figure 4-7 showsexamplesofthedataplotsexistingforthevariousfeaturesubspacesinthedataset-II.Similartothecaseofdataset-I,theyarethefeatureblocksusedforgeneratingdataset-II.Tobespecic,inthefeaturesubspacefv1,v2g,datasamplescanberepresentedbyamixturemodelwiththreecomponents,showninFigure 4-7A .Onthefeaturesubspacefv3,v4g,clustersthatarerepresentedbythemixturemodelwithfourcomponentscanbefound,asshowninFigure 4-7B .Figure 4-7C showsthatthreehyper-ellipsoidalclustersexistingonthefeaturesubspacefv5,v6,v7g.Basedonthefeaturesubspacefv8,v9,v10g,fourmixturemodel-basedclustereddatacanbeidentied,asshowninFigure 4-7D Figure 4-8 showstheclusteringerrorswithrespecttothenoiseratioforseveraldatasamplesofthevariousfeaturesubspacesembeddingondataset-IIwithuniformnoise.AspresentedinFigure 4-8A andFigure 4-8C ,boththeStudentt-distributionmixture 103

PAGE 104

Samples Dimensions Classes Iris 150 4 3 Wine 178 13 3 Ecoli 336 7 8 Yeast 1484 8 10 Diabetes 768 8 2 Waveform 5000 21 3 ThedescriptionofUCIdatasetsusedintheexperiments modelandtheGaussianmixturemodelshowedlowclusteringerrorrates.Ontheotherhand,forthedatasetsonthefeaturesubspacewithamorecomplexstructure(e.g.,fv5,v6,v7gandfv8,v9,v10g),themixtureofmultivariateStudentt-distributionmodelsdemonstratedbetternoiserobustnessthantheGaussianmixturemodelsinanoisycondition,asshowninFigure 4-8B andFigure 4-8D 9 ]wereused,becausethesearemoresuitablefordiscussingandvalidatingtheperformanceoftheproposedalgorithm.Forallthesedatasets,classlabelsassignedtothedatasampleswereremovedbeforeexperiments.AsummaryofthesedatasetsisprovidedinTable 4-1 Ouralgorithm'sperformancewasevaluatedbycomparingwithotherthreesubspaceclusteringalgorithms(i.e.,EWKM,LAC,andPROCLUS)[ 3 31 54 ].UsingUCIdatasets,experimentalresultsforfouralgorithmsincludingourapproachwerecalculatedbyperforming20executionsandevaluatedusingtheaverageoftheRandIndex[ 89 ].Foragivendatasetx,assumethatwehavetwopartitionsofx,c1andc2,wherec1consistsofclasslabelsandc2consistsofclusteredlabels.Then,RandIndexisdenedas 104

PAGE 105

ComparisonofthesubspaceclusteringalgorithmswithRandIndex wherecooisthenumberofpairsofsamplesinxthatareinthesamesetinc1andinthesamesetinc2,cxxisthenumberofpairsofsamplesinxthatareindifferentsetinc1andindifferentsetinc2,andNisthenumberofdatasamples,respectively[ 105 ]. Figure 4-9 showstherelevantresults.Ascanbeseen,formostdatasets,ourapproachobtainsthehighestscoreoftheotherthreesubspacealgorithms.However,sincenoneofthesealgorithmsshowsthebestperformanceonallthedatasets,itisnotsatisfactorythatouralgorithmoutperformtheseotherapproachesonlyusingtheseresults.Nevertheless,ouralgorithmhasanobviousadvantageinthatitdeterminesthenumberofclusterstorepresentthebest-tmixturemodelbuttheotheralgorithmsrequiretheusertoprovidethenumberofclustersandotheruser-speciedparameters. Intheseexperiments,sinceallclusteringresultsbasedonvariousfeaturesubspacescanberepresentedbydifferentparameterestimatessuchasthenumberofclustersandmixtureproportions,itwouldbeinappropriatetoevaluatethediscoveredclustersbyonlyusingtheRandIndexorNormalizedMutualInformation[ 89 ]whichmeasuresthesimilaritybetweentheclusterindicesthroughthemaximumaposteriori 105

PAGE 106

Groupsofclustersbymerging Class Featuresubsets ofclusters basedonthehighestclusterpurity coveragerate (SW,PL,PW) 2 (SL,PL,PW) 3 (PL,PW) 3 Setosa:IrisSetosa,Versicolor:IrisVersicolor,Verginica:IrisVerginica TheoverallexperimentalresultsonIrisdataset Featuresubsets IrisSetosa IrisVersicolor IrisVirginica (SW,PL,PW) 1.000 1.000 1.000 (SL,PL,PW) 1.000 0.920 0.880 (PL,PW) 0.980 0.980 0.680 ThelocalclusterpuritiesondiscoveredvarioussubspacesforIrisdataset (MAP)classicationofthedatasamplesandtheprovidedclasslabels.Furthermore,itishardtoinferthevalidityofsubspaceclustersthroughthecomparisonofthesevarioussubspaceclusteringresults.Forthisreason,onecanconsiderthatadiscoveredclustercanconsistofoneormoreclassesintheoriginaldataset.Inthiscase,theclusteringaccuracycanbecalculatedbycomputingthehighestfrequenciesforeachclasslabelthatbelongstothecluster.Wecallthistypeofclusteringaccuracytheclasscover-agerate.Besidestheclasscoveragerate,wealsoneedtoinvestigatetheclassesconsistingofthediscoveredclusters.Forthis,weconsideredthehomogeneityforeachoftheclassesthatcorrespondtoaspeciccluster.Wecallthisprobabilityasthelocalclusterpuritythatiscomputedbytheratioofthenumberoftheestimatedclasslabelscorrespondingtoadesiredtrueclass.Byutilizingthesemeasures,wediscussthevariousanalysesfortheseexperimentalresults. 106

PAGE 107

Table 4-2 showstheexperimentalresultswiththeirisdataset.Duringthefeaturesubspaceconstructionprocess,afeaturesubspace(PL,PW)wasrepeatedlyevaluatedsoduplicateoneswerediscarded.Asaresult,threedistinctfeaturesubspaceshavebeenextracted.Ascanbeseen,basedonthefeaturesubspace(SW,PL,PW),twoclustersweresuccessfullydiscovered.Thisresultsupportsourpreviousdiscussionthatthetwogroups(excludingIrisSetosa)cannotbelinearlyseparated.Incontrast,fortheothertwofeaturesubspaces,(SL,PL,PW)and(PL,PW),theclusteringresultsshowthatthreeclustershavebeendiscovered.Fromthisresult,wecaninferthatthesepalwidth(SW)featureaffectstheseparabilityofthesethreegroups.Table 4-3 showsthelocalclusterpurityofeachclassbasedontheresultsoftheTable 4-2 .ThisresultidentiesthatdatasamplesoftheIrisVirginicaclassarerelativelylesslikelytobeassignedintothedesiredclasswhenpartitioningthedatasetintothreeclasses. 4-4 Table 4-4 showstheresultsofourapproachontheEcolidataset.Althoughtheoriginaldatasetwasclassiedby8groups,someoriginalclassescontainedtoofewdatasamplestoberegardedasacluster.Inaddition,someofthosesmallclasses 107

PAGE 108

Groupsofclustersbymerging Class Featuresubsets ofclusters basedonthehighestclusterpurity coveragerate (mcg,gvh) 2 (mcg,gvh,alm1) 3 (alm1,alm2) 2 (mcg,alm1,alm2) 2 (mcg,aac,alm1,alm2) 2 cp:cytoplasmim:innermembranewithoutsignalsequencepp:perisplasmimU:innermembrane,uncleavablesignalsequenceom:outermembraneomL:outermembranelipoproteinimL:innermembranelipoproteinimS:innermembrane,cleavablesignalsequence TheoverallexperimentalresultsonEcolidataset Featuresubsets cp im pp imU om omL imL imS (mcg,gvh) 0.944 0.831 0.923 0.743 0.950 0.800 0.500 1.000 (mcg,gvh,alm1) 0.986 0.922 0.865 0.943 1.000 0.800 1.000 0.500 (alm1,alm2) 0.881 0.779 0.846 0.829 0.850 1.000 1.000 1.000 (mcg,alm1,alm2) 0.874 0.779 0.885 0.800 0.950 1.000 0.500 0.500 (mcg,omL,alm1,alm2) 0.867 0.779 0.904 0.714 0.850 1.000 0.500 0.500 ThelocalclusterpuritiesondiscoveredvarioussubspacesforEcolidataset (e.g.,imU,imL,andimS)showedsimilarpatternswhichcanbegroupedtogether.Asaresult,formostclusteringalgorithmresults,theoriginal8groupsweremergedinto2or3groups.Besides,twofeaturesinacandidatefeaturesubspace(lip,chg)showedhighcoefcientsofdetermination.However,basedonthemixturemodel-basedclustering,thisfeaturesubsetwasrepresentedbyasinglecluster,soitwasignoredinthisresult.TheobtainedsubspaceclusteringresultssummarizedinTable 4-4 canbedividedintotwogroupsandeachofthemcanbeinterpretedasfollows:Fortherstgroup(i.e.,(mcg,gvh)and(mcg,gvh,alm1)),thefeature`alm1'contributestoconstructingthenewclusterstructurewiththreemixturecomponentsandimprovingtheclasscoveragerate.Threefeaturesubsetsinthesecondgroupshowthesamegroupsofthemerged 108

PAGE 109

Table 4-5 showsthelocalclusterpurityfortheclusteringresultsthatarediscoveredonvariousfeaturesubsetsoftheEcolidataset.Basedontheresultsofthelocalclusterpurity,theobtainedclusteringresultsttingtheStudent'st-mixturemodelcanbeinterpretedbygroupingtheclassessothattheclusterindexofthehighestlocalclusterpurityisthesame. 4-10 illustratesthedendrogramsconstructedfromthedifferentfeaturesubsetsinwhichthestrongcorrelationsexist.Inthisgure,eachnumberirepresentthefeaturevi.Basedonthesedendrograms,30distinctfeaturesubspacesthatarelikelytocontaintheinformativeclusterswerecreated. Table 4-6 showsasummaryofthesubspaceclusteringresultsofthewinedataset.Inthefeaturesubsetscolumn,theithfeatureviwasrepresentedbynumberi.Among30constructedfeaturesubspaces,twofeaturesubspaces(i.e.,(34)and(89))wereignoredbecausetheywererepresentedbyamodel-basedclusterwithasinglemixturecomponent.BasedonthedendrogramsshowninFigure 4-10 ,thesesubspaceclusteringresultscanbegroupedbysatisfyingthedownwardclosureproperty.Thisprocesssupportstheinterpretationofthediscoveredsubspaceclusteringresults.AsshowninTable 4-6 ,eachgroupofsubspaceclusteringresultsshowsasimilartrendwithrespecttothegroupsofthemergedclustersaswellastheclasscoveragerate.Therstfourgroupsonthefeaturesubsetscolumnillustratethatthosefeaturesubspacescanbe 109

PAGE 110

Dendrogramsfordifferentfeaturesubsetsinwinedataset 110

PAGE 111

Groupsofclustersbymerging Class Featuresubsets ofclusters basedonthehighestclusterpurity coveragerate (16713) 2 (146713) 2 (1456713) 2 (3413) 2 (34513) 2 (3451013) 2 (1513) 2 (158913) 2 (1358913) 2 (113) 2 (11013) 2 (141013) 3 (67912) 2 (6791213) 2 (6791112) 2 (67891112) 2 (678912) 2 (4678912) 2 (1112) 2 (110111213) 2 (267101112) 2 (67) 2 (6712) 2 (671112) 2 (2671112) 3 (678) 2 (167813) 3 (13467813) 3 Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthewinedataset relatedtoextracttheBaroloclass.Specically,inthesecondgroupwiththreedistinctfeaturesubsets(i.e.,(3413),(34513),and(3451013)),theBroloclassshowedthelowerlevelsofAsh(v3),Alcalinityofash(v4),andProline(v13)thantheGrignolinoandtheBarberaclasses.Incontrast,thelattersevengroupsoffeaturesubsetsshowthattheycancontributetoidentifyingtheBarberaclass.Forexample,accordingtotheresultofthesixthgroupoffeaturesubsets(i.e.,(6791112)and(67891112)),theBarberaclassillustratedarelativelyhigherlevelthantheothertwoclasseswithrespecttothefollowingvefeatures:v6:Totalphenols,v7:Flavanoids,v9:Proanthocyanins,v11:Hue,andv12:OD280/OD315ofdilutedwines.Inaddition,basedontheresultsoftheclasscoveragerates,theproposedsubspaceclusteringalgorithmdemonstrated 111

PAGE 112

Groupsofclustersbymerging Overall Featuresubsets ofclusters basedonthehighestclusterpurity clusterpurity (18) 3 (25) 2 (258) 3 (2568) 4 (346) 3 (3468) 3 (456) 2 (245) 3 (2456) 4 (24567) 4 (128) 2 Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthediabetesdataset thattheclusterscanbesuccessfullyidentiedbyttingthemixturemodel.Inthisexperiment,foursubspaceclusteringresultsconsistingofthreeclusterswereidentied.Amongthem,twosubspaceclusteringresults(i.e.,(141013)and(13467813)showedthattheirclasscoveragerateswereabout0.9326and0.9382.Comparedwithsomepreviousfeatureselectionalgorithms[ 61 87 ](0.9339and0.9101),ourapproachshowedsimilarorslightlyimprovedclusteringaccuracies. Table 4-7 showstheexperimentalresultswiththediabetesdataset.Thenumbersinthefeaturesubsetscolumnrepresenttherelevantfeaturesdescribedabove.Sincethereareonlytwoclassesinthisdataset,weneedtoconsideradifferentwaytoanalyzethesevariousclusteringresults.First,theclusteringresultsfortwofeaturesubspaces 112

PAGE 113

4-7 ,theclusteringresultsonthethreevariousfeaturesubspaces(i.e.,(25),(456),and(128))wererepresentedbytwomixturecomponentsandshowedsimilarclusteringaccuracies.Onvedifferentfeaturesubspaces,themixturemodel-basedclusteringresultsconsistofthreemixturecomponents.Sincetheoriginaldatasetiscomposedoftwoclasses,theseclusteringresultsmaynotbedirectlycomparedwiththeclasslabels.Instead,thoseclusterscouldprovidehiddenrelationshipsamongtherelevantfeatures.Forexample,afeaturesubspace(18)wascomposedofthefollowingthreeclusters:i)oldandrelativelyfewpregnancies,ii)youngandrelativelyfewpregnancies,andiii)oldandmanypregnancies.However,bymergingtwoclusters(e.g.,fC1,C2ginthisresult),theseclusteringresultsshowsimilarresultstothemixturemodel-basedclusteringresultswithtwocomponents.Inthiscase,particularly,amergedgroup,withclusteri)andclusteriii),andclusterii)commonlyshowedthatafeature`thenumberoftimespregnant'hasanegativerelationshipwiththefeature`age'.Thisimpliesthatthosetwoclusterscontainsimilarpatterns.Fortheotherfeaturesubspacesconsistingofthreeorfourmixturecomponents,eachclustercanbeinterpretedinthesimilarway. 17 ]. UsingthedendrogramsshowninFigure 4-11 ,63distinctfeaturesubspaceformixturemodel-basedclusteringweregenerated.Duringtheconstructionofthefeaturesubspaces,manyredundantfeaturesubsetswereltered.Amongtheclusteringresultsbasedonthose63featuresubspaces,the46selectedonesareshowninTable 4-8 113

PAGE 114

Dendrogramsofdifferentfeaturesubsetsforwaveformdataset 114

PAGE 115

Cluster-1 Cluster-2 Cluster-3 Class Featuresubsets ofclusters (1657) (1646) (1696) coveragerate (19101116) 2 891 0.7011 (1910111621) 2 989 0.7275 (6781415) 2 865 0.6895 (56781415) 2 1062 0.7532 (5678131415) 2 1209 0.7896 (45678131415) 2 1195 0.7738 (4567812131415) 2 1108 0.7449 (24567812131415) 3 851 1411 1587 0.7700 (34567812131415) 3 844 1378 1599 0.7644 (678141516) 2 1012 0.7353 (5678141516) 2 1145 0.7692 (567813141516) 2 1169 0.7674 (4567813141516) 2 1191 0.7740 (456781213141516) 3 857 1415 1568 0.7682 (3456781213141516) 3 1079 1395 1562 0.8074 (67814151617) 2 1211 0.7902 (567814151617) 2 1179 0.7702 (5678914151617) 2 1208 0.7760 (567891314151617) 3 844 1379 1598 0.7644 (4567891314151617) 3 1108 1394 1529 0.8060 (56789131415161718) 3 1082 1394 1562 0.8078 (678914151617) 2 1212 0.7792 (67891415161718) 2 1172 0.7642 (6789101415161718) 3 844 1379 1598 0.7644 (678910141516171819) 3 1080 1401 1555 0.8074 (89151617) 2 965 0.7311 (8915161718) 2 986 0.7265 (89101115161718) 2 1242 0.7876 (8910111516171819) 2 1163 0.7626 (910111718) 2 963 0.7293 (459101112131718) 2 1151 0.7586 (5671314) 2 865 0.6895 (45671314) 2 989 0.7275 (4567121314) 2 1123 0.7606 (456711121314) 2 1207 0.7778 (3456711121314) 2 1160 0.7616 (78141516) 2 865 0.6895 (7814151617) 2 953 0.7197 (78914151617) 2 1129 0.7626 (7891415161718) 2 1179 0.7702 (789101415161718) 2 1194 0.7718 (78910141516171819) 3 844 1379 1598 0.7644 (678910141516171820) 3 1110 1391 1529 0.8062 (4581013) 2 928 0.7151 (458101321) 2 989 0.7275 (1458101321) 2 1211 0.7902 Theidentiedsubspaceclusteringresultsonthevariousfeaturesubspacesofthewaveformdataset Theremainingresultswerenotincludedbecauseofoneofthefollowingreasons:First,thedesiredmixturemodelwascomposedofasinglemixturecomponent.Second,inseveralfeaturesubsets,therelationshipsbetweenfeaturesareeitherpositivecorrelationoranticorrelation,makingitacomplexclusterstructure. 115

PAGE 116

Inpracticalapplicationdomains,mostdatasetsconsistofanumberoffeatureswheresomearestronglycorrelatedtoeachother.Thispropertycanprovideseveraladvantagesasfollows:Byusingthestrongpairwisecorrelationbetweenfeaturesthesizeoffeaturesubspacestobeconstructedcanbedrasticallyreduced,therebyreducedthecomputationalcost.Additionally,itcanberegardedthattheextractedfeaturesubspacescontainmeaningfulstructuresandimprovementinthequalityofclusterscanbeexpected. Inthischapter,wepresentedanovelsubspaceclusteringapproachconsideringastrongpairwisecorrelationamongfeatures.Inthisapproach,thepossiblefeature 116

PAGE 117

Throughtheexperimentalresultswithsyntheticdatasets,wedemonstratedthatourapproachisrobustonthenoisydatasetsandsuccessfullyfoundthedesiredclustersthatcanberepresentedbyanitemixturemodelontheextractedfeaturesubspaces.OurapproachwasalsoappliedtotheseveralrealdatasetsobtainedfromtheUCIMachineLearningrepository.Basedontheseextensiveexperimentalresults,wefoundmanysubspaceclustersthatwerenotdiscoveredbythepreviousfeatureselectionapproaches.Inaddition,wedemonstratedthattheseidentiedsubspaceclusterscontainmeaningfulinformation. 117

PAGE 118

Inclustering,theuseofstatisticalmodelsprovidesthefollowingadvantages.First,sinceitisassumedthatthegivendatafollowsaparticularstatisticalmodel,theconstructedmodelscanbesupportedbytherelatedtheoreticalbackgrounds.Second,theconstructedmodelscannotonlybeadaptedtosomeparticularproblemsbutalsocanbeappliedtosimilartypesofproblemsbyadjustingtherelevantparameters. Theseadvantagescanbeadaptedtothefeatureselectiontechniques.Basedonthesecharacteristics,manyfeatureselectionalgorithmsinclusteringhavebeenproposed.However,theyhavecommonshortcomingsinthattheyselectfeaturesonlyrelatedwithaparticularclusterstructure.Thisimpliesthattheyarelimitedtodiscoveringvariousfeaturesubsetswhereeachcontributestoidentifyingthedifferentclustersitcontains.Forthisreason,anewclusteringtechnique,calledsubspaceclustering,hasbeenofinterestinrecentyears. Althoughmanyexistingsubspaceclusteringalgorithmshaveshownthattheycanndvariousclusterslyinginthedifferentfeaturesubspaces,theyhavemanydrawbackswithrespecttoscalabilityandtheirdependenceonuser-speciedthresholds.Inthisdissertation,wehavedevelopednovelmethodstodealwiththeaboveissues.Therstapproachefcientlyrevealedmultiplefeaturesubsetsforvariousclustersbasedonthemixturemodel.Thesecondmethodgeneralizedtherstapproachbyallowingittoconsideroverlappedfeaturesubspaces. Inmanyapplicationsforclusteranalysis,datacanbecomposedofanumberoffeaturesubsetswhereeachisrepresentedbyanumberofdiversemixturemodel-basedclusters.However,inmostfeatureselectionalgorithms,thiskindofclusterstructurehasrarelybeeninterestingbecausethesealgorithmsaccountedfordiscoveryofasingleinformativefeaturesubsetforclustering. 118

PAGE 119

Byshowingtheidenticationofanumberofmeaningfulnon-overlappedfeatureblockswhereeachcanberepresentedbyusinganitemixturemodel,theproposedfeature-block-wiseclusteringapproachdemonstratedtheabilitytoovercomeproblemsexperiencedbymanypreviousalgorithmsforfeatureselectioninclustering.However,itcontainsseveraldrawbacks.First,itconsideredonlynon-overlappedfeaturesubspacesduringtheclusteringprocess.However,inpractice,manydatasetscanconsistofafeaturevectorthatcanformanumberoffeaturesubspacesandsomeofthesefeaturescanbeassociatedwithmultiplefeaturesubspaces.Forthiskindofdataset,itmay 119

PAGE 120

Forthisproblem,anovelmethodutilizingthecorrelationcoefcientshasbeenpresentedfordeterminingmeaningfulfeaturesubspaces.Tobespecic,whenafeatureisgiven,somefeaturesshowingastrongcorrelationwiththisfeaturecanberegardedasrelevantfeatures.Bycollectingtheserelevantfeaturesforagivenfeature,afeaturesubsetthatwillbeusedforconstructingthefeaturesubspacesarecreated.Afterobtainingthesefeaturesubsets,thefeaturesubspacesareconstructedbymergingfeatures(orfeaturesubsets)withanagglomerativehierarchicalclusteringtechnique. Manydiversemixturemodel-basedclusterscanbediscoveredbasedonthesefeaturesubspaces.ToovercometheinsensitivenessoftheGaussianmixturemodel,weutilizedtheStudentt-distributionmixturemodelforclustering.Experimentalresultswithbothsyntheticandrealdatasetsshowedthatourapproachcanefcientlyidentifymeaningfulsubspaceclusterswithinanacceptablecomputationalcost.Inaddition,thesearchspaceinourapproachgrowslinearlyasthenumberofdimensionsincreases.Thisapproachcanbegeneralizedmakingitusefulinvariousapplicationareassuchastextdatamining,geneexpressiondataanalysis,andpatternextractionofthehumanactivityinubiquitouscomputing. Forfuturework,wewillconsiderthefollowingproblem.Usingthesubspaceclusteringalgorithmspresentedinthisdissertation,wecaneffectivelyextractdesiredsubspaceclustersfromtheuser'sperspective.However,realapplicationsrequiredealingwithcategoricaldatasetsaswellasnumericaldatasets.Unfortunately,thepresentedalgorithmsarerestrictedtoclusteringcategoricaldatasets.Onereasonisthedifcultyinmeasuringthedistancebetweendataobjectsduetotheabsenceofan 120

PAGE 121

121

PAGE 122

[1] [2] Aggarwal,C.C.AHuman-ComputerInteractiveMethodforProjectedClustering.IEEETransactionsonKnowledgeandDataEngineering,16(2004):448. [3] Aggarwal,C.C.,Procopiuc,C.M.,Wolf,J.L.,Yu,P.S.,andPark,J.S.FastAlgorithmsforProjectedClustering.inProceedingsoftheACMSIGMODInternationalConferenceonManagementofData(SIGMOD'99),(1999):61. [4] Aggarwal,C.C.andYu,P.S.FindingGeneralizedProjectedClustersInHighDimensionalSpaces.InProceedingsofthe2000ACMSIGMODInternationalConferenceonManagementofData,Dallas,TX(2000):70. [5] .RedeningClusteringforHigh-DimensionalApplications.IEEETransactionsonKnowledgeandDataEngineering,14(2002):210. [6] Agrawal,R.,Gehrke,J.,Gunopulos,D.,andRaghavan,P.AutomaticSubspaceClusteringofHighDimensionalDataforDataMiningApplications.inProceed-ingsoftheACMSIGMODInternationalConferenceonManagementofData(SIGMOD'98),(1998):94. [7] Akaike,H.Anewlookatthestatisticalmodelidentication.IEEETransactionsonAutomaticControl,19(1974):716. [8] Anguilli,F.,Cesario,E.,andPizzuti,C.Randomwalkbiclusteringformicroarraydata.InformationSciences:AnInternationalJournal,178(2008):1479. [9] Asuncion,A.andNewman,D.J.UCIMachineLearningRepository.DepartmentofInformationandComputerScience,UniversityofCalifornia,Irvine,CA(2007). [10] Ben-Hur,A.,Horn,D.,Siegelmann,H.,andVapnik,V.Asupportvectorclusteringmethod.inProceedingsoftheFifteenthInternationalConferenceonPatternRecognition(2000):724. [11] Berkhin,P.ASurveyofClusteringDataMiningTechniques.inGroupingMultidimensionalData:RecentAdvancesinClustering(J.Koganetal.),(2006). [12] Bezdek,J.C.PatternRecognitionwithFuzzyObjectiveFunctionAlgorithms.Plenum,NY,(1981). [13] Biernacki,C.,Celeux,G.,andGovaert,G.AssessingaMixtureModelforClusteringwiththeIntegratedCompletedLikelihood.IEEETransactionsonPatternAnalysisandMachineIntelligence22(2000).7:719. [14] Bishop,C.M.PatternRecognitionandMachineLearning.Springer,(2006). 122

PAGE 123

Booth,J.G.,Casella,G.,andHobert,J.P.ClusteringusingObjectiveFunctionsandStochasticSearch.JournaloftheRoyalStatisticalSociety,Ser.B,70(2008):119. [16] Bradley,P.S.,Fayyad,U.,andReina,C.ScalingClusteringAlgorithmstoLargeDatabases.inProceedingsoftheFourthInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD'98),(1998):9. [17] Breiman,L.,Friedman,J.H.,Olshen,R.A.,andStone,C.J.ClassicationandRegressionTrees.Chapman&Hall/CRC,1984. [18] Bryan,K.,Cunningham,P.,andBolshakova,N.ApplicationofSimulatedAnnealingtotheBiclusteringofGeneExpressionData.IEEETransactionsonInformationTechnologyinBiomedicine,10(2006):519. [19] Celeux,G.andGovaert,G.AclassicationEMalgorithmforclusteringandtwostochasticversions.ComputationalStatisticsandDataAnalysis,14(1992):315. [20] Chang,J.-W.andJin,D.-S.Anewcell-basedclusteringmethodforlarge,high-dimensionaldataindataminingapplications.inProceedingsoftheACMsymposiumonAppliedcomputing,(2002). [21] Chen,M.-S.andChuang,K.-T.ClusteringCategoricalDataUsingtheCorrelated-ForceEnsemble.inProceedingsoftheFourthSIAMInternationalConferenceonDataMining,(2004). [22] Cheng,C.H.,Fu,A.W.,andZhang,Y.Entropy-basedSubspaceClusteringforMiningNumericalData.inProceedingsoftheFifthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(1999):84. [23] Cheng,Y.andChurch,G.M.BiclusteringofExpressionData.inProceedingsoftheEighthInternationalConferenceonIntelligentSystemsforMolecularBiology(ISMB2000),(2000):93. [24] Cho,H.,Dhillon,I.S.,Guan,Y.,andSra,S.MinimumSum-SquaredResidueCo-ClusteringofGeneExpressionData.2004. [25] CIO-Online.http://www.cio.de/(2008). [26] Constantinopoulos,C.,Titsias,M.K.,andLikas,A.BayesianFeatureandModelSelectionforGaussianMixtureModel.IEEETransactionsonPatternAnalysisandMachineIntelligence,28(2006):1013. [27] Cord,A.,Ambroise,C.,andCocquerez,J.-P.FeatureselectioninrobustclusteringbasedonLaplacemixture.PatternRecognitionLetters27(2006):627. 123

PAGE 124

Cover,T.M.andThomas,J.A.ElementsofInformationTheory,2ndEdition.WileySeriesinTelecommunicationsandSignalProcessing(2006). [29] Dave,R.N.andKrishnapuram,R.RobustClusteringMethods:AUniedView.IEEETransactionsonFuzzySystems5(1997).2:270. [30] Dempster,A.P.,Laird,N.M.,andRubin,D.B.MaximumLikelihoodfromincompletedataviaEMalgorithm.JournaloftheRoyalStatisticalSociety,Ser.B,39(1977):1. [31] Domeniconi,C.,Gunopulos,D.,Ma,S.,Yan,B.,andM.Al-Razgan,D.Papadopoulos.Locallyadaptivemetricsforclusteringhighdimensionaldata.DataMiningandKnolwedgeDiscovery14(2000):63. [32] Domeniconi,C.,Papadopoulos,D.,Gunopulos,D.,andMa,S.SubspaceClusteringofHighDimensionalData.inProceedingsoftheFourthSIAMInternationalConferenceonDataMining,(2004). [33] Dy,J.andBrodley,C.E.FeatureSubsetSelectionandOrderIdenticationforUnsupervisedLearning.inProceedingsoftheSeventeenthInternationalConferenceonMachineLearning(2000). [34] .FeatureSelectionforUnsupervisedLearning.JournaloftheMachineLearningResearch,(2004).5:845. [35] Dy,J.,Brodley,C.E.,Kak,A.,Broderick,L.S.,andAisen,A.M.UnsupervisedFeatureSelectionAppliedtoContent-BasedRetrievalofLungImages.IEEETransactionsonPatternAnalysisandMachineIntelligence,25(2003):373. [36] Fraley,C.andRaftery,A.E.Model-BasedClustering,DiscriminantAanalysis,andDensityEestimation.JournaloftheAmericanStatisticalAssociation,97(2002).458:611. [37] GenBank.NationalCenterforBiotechnologyInformation.http://www.ncbi.nlm.nih.gov/Genbank/(2008). [38] George,T.andMerugu,S.AScalableCollaborativeFilteringFrameworkBasedonCo-Clustering.inProceedingsoftheFifthIEEEInternationalConferenceonDataMining(ICDM2005),(2005):625. [39] Goil,S.,Nagesh,H.,andChoudhary,A.MAFIA:EfcientandScalableSubspaceClusteringforVeryLargeDataSets.Tech.Rep.CPDC-TR-9906-010,NorthwesternUniversity,EvanstonIL60208,1999. [40] Govaert,G.andNadif,M.Clusteringwithblockmixturemodels.PatternRecognition,36(2003).2:463. [41] .AnEMAlgorithmfortheBlockMixtureModel.IEEETransactionsonPatternAnalysisandMachineIntelligence,27(2005).4:643. 124

PAGE 125

Guha,S.,Rastogi,R.,andShim,K.CURE:AnEfcientClusteringAlgorithmforLargeDatabases.inProceedingsoftheACMSIGMODInternationalConferenceonManagementofData(SIGMOD'98),(1998):73. [43] .ROCK:ARobustClusteringAlgorithmforCategoricalAttributes.inProceedingsoftheFifteenthInternationalConferenceonDataEngineering(ICDE'99),(1999):512. [44] Gustafson,D.E.andKessel,W.C.Fuzzyclusteringwithafuzzycovariancematrix.inProceedingsofthe28thIEEEConferenceonDecisionandControl,(1979). [45] Hand,D.,Mannila,H.,andSmyth,P.PrinciplesofDataMining.TheMITPress,(2001). [46] Hartigan,J.A.DirectClusteringofadatamatrix.JournaloftheAmericanStatisticalAssociation,67(1972).337:123. [47] Hastie,T.,Tibshirani,R.,andFriedman,J.TheElemtnsofStatisticalLearning,(DataMining,Inference,andPrediction),.Springer(2001). [48] Hinneburg,A.andKeim,D.A.OptimalGrid-Clustering:TowardsBreakingtheCurseofDimensionalityinHigh-DimensionalClustering.inProceedingsof25thInternationalConferenceonVeryLargeDataBases(VLDB'99),(1999):506. [49] Honkela,A.,Seppa,J.,andAlhoniemi,E.Agglomerativeindependentvariablegroupanalysis.Neurocomputing,71(2008):1311. [50] Huang,Z.Extensionstothek-meansalgorithmforclusteringlargedatasetswithcategoricalvalues.DataMiningandKnowledgeDiscovery2(1998):283. [51] Ienco,DinoandMeo,Rosa.ExplorationandReductionoftheFeatureSpacebyHierarchicalClustering.inProceedingsoftheSIAMInternationalConferenceonDataMining,SDM2008,(2008):577. [52] Jain,A.andZongker,D.FeatureSelection:Evaluation,Appliation,andSmallSamplePerformance.IEEETransactionsonPatternAnalysisandMachineIntelligence,19(1997).2:153. [53] Jerrum,M.andSinclair,A.TheMarkovChainMonteCarloMethod:Anapproachtoapproximatecountingandintegration.inApproximationAlgorithmsforNP-hardProblems(eds.DoritHochbaum),(1996). [54] Jing,L.P.,Ng,M.K.,andHuang,Z.X.Anentropyweightingk-meansalgorithmforsubspaceclusteringforhigh-dimensionalsparsedata.IEEETransactionsonKnowledgeandDataEngineering19(2007).8:1026. [55] Ke,Y.,Cheng,J.,andNg,W.Miningquantitativecorrelatedpatternsusinganinformation-theoreticapproach.InProceedingsoftheTwelfthACMSIGKDD

PAGE 126

[56] Kim,Y.,Street,W.N.,andMenczer,F.Featureselectioninunsupervisedlearningviaevolutionarysearch.ProceedingsoftheSixthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD2000),(2000):365. [57] Kirkpatrick,S.,Gelartt,C.,andVecchi,M.Optimizationbysimulatedannealing.Science,220(1983).4598:671. [58] Kluger,Y.,Basri,R.,Change,J.R.,andGerstein,M.SpectralBiclusteringofMicroarrayData:CoclusteringGenesandConditions.GenomeResearch,13(2003).4:703. [59] Krishnapuram,R.andKeller,J.M.APossibilisticApproachtoClustering.IEEETransactionsonFuzzySystems,1(1993):98. [60] Kullback,S.andLeibler,R.A.OnInformationandSufciency.TheAnnalsofMathematicalStatistics,22(1951).1:79. [61] Law,M.H.,Figueiredo,M.A.T.,andJain,A.K.SimultaneousFeatureSelectionandClusteringUsingMixtureModels.IEEETransactionsonPatternAnalysisandMachineIntelligence,26(2004).9:1154. [62] Law,M.H.,Jain,A.K.,andFigueiredo,M.A.T.FeatureSelectioninMixture-BasedClustering.inProceedingsoftheAdvancesinNeuralInformationProcessingSystems(NIPS2002),(2002).15:625. [63] Li,Y.,Dong,M.,andHua,J.Localizedfeatureselectionforclustering.PatternRecognitionLetters,(2008).29:10. [64] Lin,S.-W.,Chen,S.-C.,Wu,W.-J.,andChen,C.-H.Parameterdeterminationandfeatureselectionforback-propagationnetworkbyparticleswarmoptimization.KnoweldgeandInformationSystems,(2009).21:249. [65] Liu,B.,Xia,Y.,andYu,P.S.ClusteringThroughDecisionTreeConstruction.InProceedingsofthe2000ACMCIKMInternationalConferenceonInformationandKnowledgeManagement,McLean,VA(2000):20. [66] Liu,H.andMotoda,H.ComputationalMethodsforFeatureSelection.Chapman&Hall/CRC(2007). [67] Liu,T.,Liu,S.,Chen,Z.,andMa,W.AnEvaluationonFeatureSelectionforTextClustering.inProceedingsoftheTwentiethInternationalConferenceMachineLearning(ICML2003),(2003):488. [68] Madeira,S.C.andOliveira,A.L.Surveyofclusteringalgorithms.IEEETransactionsonComputationalBiologyandBioinformatics,1(2004).1:24. 126

PAGE 127

McLachlan,G.J.,Bean,R.W.,andPeel,D.Amixturemodel-basedapproachtotheclusteringofmicroarrayexpressiondata.Bioinformatics,18(2002).3:413. [70] McLachlan,G.J.andPeel,D.FiniteMixtureModels.JohnWiley,NY,(2000). [71] McLachlan,G.J.,Peel,D.,Basford,K.E.,andAdams,P.TheEMMIXsoftwareforthettingofmixturesofnormalandt-components.JournalofStatisticalSoftware,4(1999). [72] Metropolis,M.,Rosenbluth,A.W.,Rosenbluth,M.N.,Teller,A.H.,andTeller,E.Equatiosofstaticcalculationsbyfastcomputingmachines.JournalofChemicalPhysics,(1953).21:1087. [73] Min,J.,Yun,S.,Kim,K.,Kim,H.,Hahn,J.,andLee,K.NitratecontaminationofalluvialgroundwatersintheNakdongRiverbasin,Korea.GeosciencesJournal,6(2002).1:35. [74] Mitra,P.,Murthy,C.A.,andPal,S.K.UnsupervisedFeatureSelectionUsingFeatureSimilarity.IEEETransactionsonPatternAnalysisandMachineIntelli-gence,24(2002):301. [75] Nadif,M.andGovaert,G.BlockClusteringofContingencyTableandMixtureModel.inProceedingsofthe6thInternationalSymposiumonDataAnalysis(IDA),(2005). [76] Namkoong,Y.,Heo,G.,andWoo,Y.AnExtensionofPossibilisticFuzzyC-MeanswithRegularization.inProceedingsofthe2010IEEEInternationalConferenceonFuzzySystems(FUZZ-IEEE2010),(2010). [77] Namkoong,Y.,Joo,Y.,andDankel,D.FeatureSubset-WiseMixtureModel-BasedClusteringviaLocalSearchAlgorithm.AdvancesinArticialIntelligence,23rdCanadianConferenceonArticialIntelligence,CanadianAI2010,Ottawa,Canada,Proceedings.LectureNotesinComputerScience6085Springer2010,(2010):135. [78] .PartitioningFeaturesforModel-BasedClusteringusingReversibleJumpMCMCTechnique.inProceedingsofthe23rdInternationalFLAIRSConference(FLAIRS2010),(2010). [79] Neal,R.MarkovchainsamplingmethodsforDirichletprocessmixturemodels.TechnicalReport9815Departmentofstatistics,UniversityofToronto(1998). [80] Needleman,S.B.andWunsch,C.D.Ageneralmethodapplicabletothesearchforsimilaritiesintheaminoacidsequenceoftwoproteins.JournalofMolecularBiology,48(1970).3:443. [81] Ng,E.,Fu,A.,andWong,R.Projectiveclusteringbyhistograms.IEEETransactionsonKnowledgeandDataEngineering17(2005).3:369. 127

PAGE 128

Pal,N.R.,Pal,K.,Keller,J.M.,andBezdek,J.C.APossibilisticFuzzyc-MeansClusteringAlgorithm.IEEETransactionsonFuzzySystems13(2005).4:517. [83] Parsons,L.,Haque,E.,andLiu,H.SubspaceClusteringforHighDimensionalData:AReview.SIGKDDExplorations6(2004):90. [84] Pedrycz,W.Knowledge-BasedClustering:FromDatatoInformationGranules.Wiley-Interscience,(2005). [85] Peel,D.andMcLachlan,G.J.Robustmixturemodellingusingthet-distribution.StatisticalComputing10(2000):339. [86] Procopiuc,C.M.,Jones,M.,Agarwal,P.K.,andMurali,T.M.AMonteCarloalgorithmforfastprojectiveclustering.InProceedingsofthe2002ACMSIGMODInternationalConferenceonManagementofData,Madison,WI(2002):418. [87] Raftery,A.E.andDean,N.VariableSelectionforModel-BasedClustering.TechnicalReport452Departmentofstatistics,UniversityofWashington,Seattle,Washington(2004). [88] .VariableSelectionforModel-BasedClustering.JournaloftheAmericanStatisticalAssociation,101(2006).473:168. [89] Rand,W.M.Objectivecriteriafortheevaluationofclusteringmethods.JournaloftheAmericanStatisticalAssociation,66(1971):846. [90] Rawlings,J.O.,Pantula,S.G.,andDickey,D.A.AppliedRegressionAnalysis:AResearchTool.Springer,NewYork(1998). [91] Robert,C.P.andCasella,G.MonteCarloStatisticalMethods.Springer,NewYork,2005,2nded. [92] Roberts,S.J.,Holmes,C.,andDenison,D.Minimum-EntropyDataPartitioningUsingReversibleJumpMarkovChainMonteCarlo.IEEETransactionsonPatternAnalysisandMachineIntelligence,23(2001).8:909. [93] Rota,G.-C.TheNumberofPartitionsofaSet.AmericanMathematicalMonthly,71(1964).5:498. [94] Roth,V.andLange,T.FeatureSelectioninClusteringProblems.inProceedingsoftheAdvancesinNeuralInformationProcessingSystems(NIPS2003),(2003). [95] Russell,S.andNorvig,P.ArticialIntelligence:AModernApproach,2ndEdition.PrenticeHall,SaddleRiver,NJ(2002). [96] Sahami,M.UsingMachineLearningtoImproveInformationAccess.Ph.D.thesis,DepartmentofComputerScience,StanfordUniversity,(1998). 128

PAGE 129

Schwarz,G.Estimatingthedimensionofamodel.AnnalsofStatistics,6(1978).2:461. [98] Shaei,M.andMilios,E.Model-basedOverlappingCo-clustering.inProceed-ingsoftheFourthWorkshoponTextMining,SixthSIAMInternationalConferenceonDataMining,(2006). [99] Sheng,Q.,Moreau,Y.,andDe,MoorB.BiclusteringMicroarrayDatabyGibbsSampling.Bioinformatics,19(2003):ii196ii205. [100] Sheskin,D.J.HandbookofParametricandNonparametricStatisticalProcedures.CRCPress,2003. [101] Shoham,S.RobustclusteringbydeterministicagglomerationEMofmixturesofmultivariatet-distributions.PatternRecognition35(2002):1127. [102] Stanley,R.P.EnumerateCmbinatorics.Vol.I,NewYork:CambridgeUniversityPress(1997). [103] Talavera,L.Dependency-BasedFeatureSelectionforClusteringSymbolicData.IntelligentDataAnalysis,(2000).4:19. [104] Tanay,A.,Sharan,R.,andShamir,R.Biclusteringalgorithms:Asurvey.HandbookofComputationaMolecularBiology,eds.SrinivasAluru,(2006). [105] Theodoridis,S.andKoutroumbas,K.PatternRecognition,3rdEdition.AcademicPress(2006). [106] Ueda,N.andNakano,R.DeterministicannealingEMalgorithm.NeuralNetworks(1998).11:271. [107] Vaithyanathan,S.andDom,B.GeneralizedModelSelectionforUnsupervisedLearninginHighDimensions.inProceedingsoftheAdvancesinNeuralInforma-tionProcessingSystems(NIPS'99),(1999):970. [108] Wang,H.,Wang,W.,Yang,J.,andYu,P.S.Clusteringbypatternsimilarityinlargedatasets.inProceedingsoftheACMSIGMODInternationalConferenceonManagementofData(SIGMOD2002),(2002):394. [109] Wolf,T.,Brors,B.,Hofmann,T.,andGeorgii,E.GlobalBiclusteringofMicroarrayData.inWorkshopsProceedingsofthe6thIEEEInternationalConferenceonDataMining(ICDM2006),(2006):125. [110] Xiong,H.,Shekhar,S.,Tan,P.-N.,andKumar,V.ExploitingAsupport-basedupperboundofPearson'scorrelationcoefcientforefcientlyidentifyingstronglycorrelatedpairs.inProceedingsoftheTenthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(2004):334. 129

PAGE 130

Xu,R.andWunschII,D.Surveyofclusteringalgorithms.IEEETransactionsonNeuralNetworks,16(2005).3:645. [112] .Clustering(IEEEPressSeriesonComputationalIntelligence).Wiley-IEEEPress(2009). [113] Yang,J.,Wang,H.,Wang,W.,andYu,P.S.EnhancedBiclusteringonExpressionData.inProceedingsofthe3rdIEEEInternationalSymposiumonBioInformaticsandBioEngineering(BIBE2003),(2003):321. [114] Yang,J.,Wang,W.,Wang,H.,andYu,P.S.-Clusters:CapturingSubspaceCorrelationinaLargeDataSet.InProceedingsofthe18thIEEEInternationalConferenceonDataEngineering(ICDE2002),SanJose,CA(2002):517. [115] Yip,K.,Cheung,D.,andNg,M.HARP:Apracticalprojectedclusteringalgorithm.IEEETransactionsonKnowledgeandDataEngineering16(2004).11:1387. [116] Zhang,X.,Pan,F.,andWang,W.CARE:FindingLocalLinearCorrelationsinHighDimensionalData.InProceedingsofthe24thIEEEInternationalConferenceonDataEngineering(ICDE2008),Cancun,Mexico(2008):130. 130

PAGE 131

YounghwanNamkoong(YounghwanNamgoong)wasborninSeoul,RepublicofKorea.HereceivedaBachelorofScienceandMasterofScienceincomputersciencefromKoreaUniversity,Seoul,RepublicofKoreain1999and2001,respectively.HereceivedMasterofScienceincomputersciencefromtheUniversityofSouthernCaliforniainMay2003.HereceivedaPh.DincomputerengineeringfromtheUniversityofFloridainthesummerof2010.Duringhisacademiccareer,hereceivedtheNationalScholarshipsforstudyingabroadfromtheMinistryofInformationandCommunication,RepublicofKorea.Hisresearchinterestsincludedatamining,patternrecognition,andstatisticallearning.Relevantjournalandconferencepapershavebeenpublishedandunderreview. 131