Citation
An Effective and Robust Method for Active Constrained Clustering

Material Information

Title:
An Effective and Robust Method for Active Constrained Clustering
Creator:
Kriminger, Evan G
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (145 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
PRINCIPE,JOSE C
Committee Co-Chair:
RANGARAJAN,ANAND
Committee Members:
SHEA,JOHN MARK
BANERJEE,ARUNAVA
Graduation Date:
12/18/2015

Subjects

Subjects / Keywords:
Active learning ( jstor )
Datasets ( jstor )
Distance functions ( jstor )
Error rates ( jstor )
International conferences ( jstor )
Machine learning ( jstor )
Oracles ( jstor )
Outliers ( jstor )
Sonar ( jstor )
Voting ( jstor )
Electrical and Computer Engineering -- Dissertations, Academic -- UF
active-learning -- clustering -- constrained-clustering -- semi-supervised-learning
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Electrical and Computer Engineering thesis, Ph.D.

Notes

Abstract:
Constrained clustering methods seek to partition a dataset in accordance with the distance properties of the data, while also adhering to a set of linkage constraints between pairs of samples. While distance information is present for each sample, pairwise constraints only exist for a small subset of samples. In addition to balancing constraints and inherent distance properties, propagating constraint information from constrained samples introduces large computational demands. The performance of constrained clustering methods also tends to be much less reliable than equivalent unconstrained methods. To overcome these difficulties, this study presents efficient and robust methods for all aspects of the constrained clustering problem. A novel constrained variant of hierarchical agglomerative clustering is presented, which manipulates the unsupervised structure of the data to satisfy constraints. By allowing regions around constrained samples to agglomerate naturally, the constraints dictate the resulting clustering at a local level, rather than through difficult to set parameters. The hierarchical structure is also exploited by an active constraint selection technique that chooses informative sample pairs for query, which helps to establish the goals of the user at multiple scales and provide redundancy for error detection. In real applications, constraint sets will be provided by humans, which are imperfect oracles. The detrimental effects of errors in the constraint set are well-known, but little has been done to handle this issue. We present a flexible framework for removing constraints which are possibly harmful to the resulting clustering. The method is based on identifying conflict within the constraint set. A case study on the nature of real human feedback is provided, allowing the development of realistic models for constraint errors. Finally, the proposed stack of methods for constrained clustering, active constraint selection, and error removal is tested on a large dataset with induced errors. The results are compared with supervised classification to determine the true utility of constrained clustering for practical applications with a comparable amount of human interaction. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2015.
Local:
Adviser: PRINCIPE,JOSE C.
Local:
Co-adviser: RANGARAJAN,ANAND.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2016-12-31
Statement of Responsibility:
by Evan G Kriminger.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
12/31/2016
Classification:
LD1780 2015 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

ANEFFECTIVEANDROBUSTMETHODFORACTIVECONSTRAINEDCLUSTERINGByEVANG.KRIMINGERADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2015

PAGE 2

c2015EvanG.Kriminger 2

PAGE 3

ToLeroyBrown 3

PAGE 4

ACKNOWLEDGMENTSIthankmyadvisorJoseC.Prncipeforemphasizingcreativityandforwardthinking,andknowingexactlywhatIneededtogrowasaresearcher.Mycommitteemembers,ArunavaBanerjee,JohnShea,andAnandRangarajan,havegreatperspectivesontheeldofthemachinelearningandtheirsupervisionismuchappreciated.IthankToryCobbandtheONRNavalSurfaceWarfareCenterfortheirsupport,guidance,andforprovidingasoundingboardforalloftheworkinthisdissertation.IthankChoudurLakshminarayanatHPLabsforhelpingmestartmyresearchcareerwithinterestingproblemsanddiscussions.IthankKau-FuiWongforguidanceattheUniversityofMiami,andthemanygreatteachersIhavehadovertheyears.IthanktheCNELold-timerswhoestablishedboththeresearchandbarbecueculturesofCNEL,includingSohan,Memming,Stefan,Rakesh,Luis,Bot,Alex(mylabmentor),andErion(myONRprojectbenefactor).IthankAustin,atruerolemodelandanimpassionedsucculentkeeper.Ahearty“supnerd”toErica.ThankstothemanyCNELlabmatesandvisitorswhowerethereforthefuntimes.Theyaretoomanytoname,soIofferthemananonymous,butheartfeltthanks.IthankAbeforgettingmeexcitedaboutacademicslongago.IthankthepeoplewhomadeGainesvillelifegreatoverthesepast6years.ThisincludesCarlosandAshleyandtheirsouthernhospitality,Ederthehipbrozilian,Catiatheercecompetitor,HammamwhohassharedthePh.Dexperiencesinceourrecruitingdays,MikeyTexas,Kan,Jihye,Hong,andmynobleONRshipmateMatt.SpecialthankstothebrotherhoodofnightlabwithRyanandGoktug.Thankstomynonacademiclifelongfriends:Ben,Jerry,Aytanaandhercrew,Aldrick,Sean,Karly,Kenton,Lisaandthekitties,Richard,Diana,Lorena,Noah,Rachel,Chris,Drew,andtheYerkesfamily.Thankstomygrandmother,whokeptmefedandsuited.Thankstomysisterwhosequickwitandsharpdressaresecondtonone.Thankstomylovingparents,whohavealwaysofferedunconditionalsupport. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 12 1.1Clustering .................................... 14 1.1.1Representations ............................ 14 1.1.2VectorQuantization ........................... 16 1.1.3MixtureModels ............................. 20 1.1.4SpectralClustering ........................... 22 1.1.5HierarchicalClustering ......................... 24 1.1.6OtherApproaches ........................... 27 1.2ConstrainedClustering ............................. 29 1.2.1ConstrainingK-Means ......................... 31 1.2.2RepresentationModication ...................... 33 1.2.3ConstraintPropagation ......................... 37 1.2.4OtherMethods ............................. 40 1.2.5TheStateofConstrainedClustering ................. 42 1.3ActiveLearning ................................. 44 1.3.1ActiveLearningforClassication ................... 45 1.3.2ActiveConstraintSelection ...................... 48 1.4OutlineandContributions ........................... 52 2ACTIVELEARNINGFORCLASSDISCOVERY .................. 54 2.1ModelTreesActiveLearning ......................... 55 2.2OpenSetClassication ............................ 58 2.2.1AutomaticTargetRecognition ..................... 59 2.2.1.1Supervisedmethods ..................... 60 2.2.1.2Unsupervisedmethods ................... 61 2.2.2ProposedSystemDescription ..................... 62 2.2.2.1Classicationwiththresholds ................ 63 2.2.2.2Newobjectdetection .................... 67 2.3Experiments .................................. 68 2.4Summary .................................... 74 5

PAGE 6

3ACTIVECONSTRAINEDCLUSTERINGWITHTHEDENDROGRAM ..... 76 3.1Propagation-FreeConstrainedHierarchicalClustering ........... 77 3.1.1ManipulatingtheDendrogram ..................... 80 3.1.2EfcientComputation .......................... 82 3.1.3ConstraintRemoval ........................... 82 3.2HierarchicalQuerySelection ......................... 84 3.2.1InformativeMerges ........................... 84 3.2.2QuerySampleSelection ........................ 87 3.2.3AddingRedundancy .......................... 88 3.3Experiments .................................. 89 3.4Summary .................................... 96 4CONSTRAINEDCLUSTERINGUNDERUNCERTAINTY ............ 99 4.1ConstraintErrors ................................ 102 4.2RemovalofHarmfulConstraints ....................... 104 4.3VoteBasedConstraintRemoval ....................... 106 4.4Experiments .................................. 109 4.4.1CaseStudy ............................... 110 4.4.2UCIBenchmarks ............................ 115 4.5Summary .................................... 117 5PRACTICALITYOFCONSTRAINEDCLUSTERING ............... 119 5.1PairwiseConstraintsVersusLabels ...................... 119 5.2ComparisonwithClassication ........................ 121 5.3Discussion ................................... 125 6CONCLUSION .................................... 127 6.1Summary .................................... 127 6.2FutureWork ................................... 128 REFERENCES ....................................... 131 BIOGRAPHICALSKETCH ................................ 145 6

PAGE 7

LISTOFTABLES Table page 2-1Modeltreesclassicationcomparison ....................... 71 2-2Newobjectdetectionabilitiesofmodeltreesactivelearning ........... 74 3-1TheUCIdatasetsusedinthisstudy ........................ 89 3-2UCIresultsforactiveconstrainedclustering .................... 90 3-3Wall-clockcomputationtimeonMNISTdataset .................. 95 4-1Conditionalresponsedistributions ......................... 111 4-2Performanceonowerdataset ........................... 112 4-3TheUCIdatasetsusedintheerrorremovalexperiment ............. 115 4-4ErrorremovalstatisticsforUCI ........................... 115 7

PAGE 8

LISTOFFIGURES Figure page 1-1Representationsofdata ............................... 15 1-2Unsupervisedhierarchicalclustering ........................ 26 1-3Instance-levelconstraints .............................. 30 1-4Trivialsolutionofconstrainedk-meansclustering ................. 38 1-5Informativesamplesforclassication ........................ 46 1-6Informativenessandcoherenceofaconstraintset ................ 49 2-1Branchingprocessofmodeltreesactivelearning ................. 56 2-2Activelearningforsonarimageprocessing .................... 64 2-3Newobjectdetectionwithlearnedparameters .................. 66 2-4Examplesonarimages ............................... 68 2-5Failureunderself-learning .............................. 70 2-6Performanceofmodeltreesactivelearning .................... 71 2-7Parametersensitivityofmodeltrees ........................ 73 3-1Must-linkconstraintsinmulti-modaldata ...................... 78 3-2Manipulatingthedendrogram ............................ 81 3-3Classoverlapcausesconstraintnoise ....................... 83 3-4Populationbasedmergecriterion .......................... 85 3-5Selectingsamplestoquery ............................. 88 3-6Examplesofconstraintstructure .......................... 89 3-7Performanceasafunctionofthenumberofqueries ............... 94 3-8PerformanceontheMNISTdataset ........................ 95 3-9VisualizationofMNISTclustering .......................... 96 4-1Illustrationofrelevanceandvotes .......................... 107 4-2Histogramofresponses ............................... 111 4-3UCIresultsforVBCRversuserrorrate ....................... 114 8

PAGE 9

5-1Recreatinglabelswithpairwisequeries ...................... 121 5-2PFCHCwithlabels .................................. 122 5-3Classicationversusconstrainedclustering .................... 124 5-4Classicationversusconstrainedclusteringwitherror .............. 125 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyANEFFECTIVEANDROBUSTMETHODFORACTIVECONSTRAINEDCLUSTERINGByEvanG.KrimingerDecember2015Chair:JoseC.PrncipeMajor:ElectricalandComputerEngineeringConstrainedclusteringmethodsseektopartitionadatasetinaccordancewiththedistancepropertiesofthedata,whilealsoadheringtoasetoflinkageconstraintsbetweenpairsofsamples.Whiledistanceinformationispresentforeachsample,pairwiseconstraintsonlyexistforasmallsubsetofsamples.Inadditiontobalancingconstraintsandinherentdistanceproperties,propagatingconstraintinformationfromconstrainedsamplesintroduceslargecomputationaldemands.Theperformanceofconstrainedclusteringmethodsalsotendstobemuchlessreliablethanequivalentunconstrainedmethods.Toovercomethesedifculties,thisstudypresentsefcientandrobustmethodsforallaspectsoftheconstrainedclusteringproblem.Anovelconstrainedvariantofhierarchicalagglomerativeclusteringispresented,whichmanipulatestheunsupervisedstructureofthedatatosatisfyconstraints.Byallowingregionsaroundconstrainedsamplestoagglomeratenaturally,theconstraintsdictatetheresultingclusteringatalocallevel,ratherthanthroughdifculttosetparameters.Thehierarchicalstructureisalsoexploitedbyanactiveconstraintselectiontechniquethatchoosesinformativesamplepairsforquery,whichhelpstoestablishthegoalsoftheuseratmultiplescalesandprovideredundancyforerrordetection.Inrealapplications,constraintsetswillbeprovidedbyhumans,whichareimperfectoracles.Thedetrimentaleffectsoferrorsintheconstraintsetarewell-known,butlittle 10

PAGE 11

hasbeendonetohandlethisissue.Wepresentaexibleframeworkforremovingconstraintswhicharepossiblyharmfultotheresultingclustering.Themethodisbasedonidentifyingconictwithintheconstraintset.Acasestudyonthenatureofrealhumanfeedbackisprovided,allowingthedevelopmentofrealisticmodelsforconstrainterrors.Finally,theproposedstackofmethodsforconstrainedclustering,activeconstraintselection,anderrorremovalistestedonalargedatasetwithinducederrors.Theresultsarecomparedwithsupervisedclassicationtodeterminethetrueutilityofconstrainedclusteringforpracticalapplicationswithacomparableamountofhumaninteraction. 11

PAGE 12

CHAPTER1INTRODUCTIONDataclusteringisanunsupervisedmethodforextractingthestructureofdatasets.Clusteringapproachesseektopartitionthedatasetsuchthatdatasampleswithinaclusteraresimilarandaredistinctfromsamplesinotherclusters.Applyingclusteringasapreprocessingstephelpstoorganizesamplestofacilitateanalysis.Theresultingstructurethatisobtainedmayinuencethediscoveryofnewinformationaboutthedata.Forinstance,asingleclassmaybecomposedofmultipleclusters,whichindicatedifferentsubclasses.Clusteringhasbeensuccessfullyappliedinmanyeldswithdiversedatatypes.Documentclusteringseekstogroupdocumentsbasedonthetopicsthattheyrepresent Steinbachetal. ( 2000 ); Willett ( 1988 ); Xuetal. ( 2003 ).Suchorganizationallowsgroupsofsimilardocumentstoberetrievedeasilybasedonsearchterms.Clusteringcanbeusefulintheanalysisofsocialnetworks Scott ( 2012 ); Wasserman&Faust ( 1994 ).Withinformationabouttherelationshipsbetweenusersofthenetwork,communitiescanbedetected.Collaborativelteringandrecommendersystems Breeseetal. ( 1998 ); Lindenetal. ( 2003 ); Sarwaretal. ( 2001 )takeadvantageofclusteringtondgroupsofusersorproductsthataresimilar.Usersmaybeclusteredbasedontheirtasteinsimilarproducts,whichenablesthesystemtoproviderecommendations.Imagesegmentationmaybeconsideredasaclusteringproblem Felzenszwalb&Huttenlocher ( 2004 ); Fu&Mui ( 1981 ); Shi&Malik ( 2000 ),assegmentsconsistofgroupsofpixelssimilarincolorandlocation.Inimageretrieval Dattaetal. ( 2008 ); Jain&Vailaya ( 1996 ),largedatabasesofimagesareclustered,allowingimagesofdesiredobjecttypestoberetrievedfromthedatabaseasdesired.Inthebiomedicaldomain Alonetal. ( 1999 ); Portionsofthischapterarepublishedinthefollowingmanuscript:Kriminger,E.,Cobb,J.T.,andPrncipe,J.C.(2015).Onlineactivelearningforautomatictargetrecognition.OceanicEngineering,IEEEJournalof,40(3),583-591. 12

PAGE 13

Ben-Doretal. ( 1999 ); Jiangetal. ( 2004 ),geneexpressionlevelsareclusteredtobetterunderstandfunctionalgroupsofgenes,andtoclassifytissuesbasedonsimilarexpressionpatterns.Whileclusteringextractsstructurefromthedata,thisdoesnotnecessarilycorrespondtothegoalsoftheuser.Inthepresenceofnoise,theclusteringproblembecomesquiteambiguous.Asmallamountofhumanfeedbackcanhelpguidetheclusteringprocesstoimproveperformance.Forinstance,inimagesegmentationfeedbackcandeterminethegranularityoftheresultingclustering.Perhapstheuserspeciesafewpairsofpixelsthatshouldbeinthesamesegment.Whenthissideinformationisincorporatedalongsidethegeometryofthedataset,theprocessisknownasconstrainedorsemi-supervisedclustering.Itisachallengetoachieveabalancebetweenthesideinformationandthedistanceinformation.Sideinformationonlyimprovestheresultingclusteringifitispresentformeaningfulsamplesorpairsofsamples.Inactivelearning,thesamplesforwhichfeedbackisacquiredareselectedbythealgorithmitselftogainasmuchinformationaspossible.Activelearningtechniquesconsiderthenatureofthedatasetandthemechanicsoftheclusteringorclassicationalgorithmsindeterminingwhichsampleswouldprovidethemostbenetifannotatedbyahumanoracle.Theworkinthisdissertationconsiderstheentireprocessofactiveconstrainedclustering,consistingofconstraintselectionwithactivelearningandconstrainedclusteringofthedatabasedonthoseconstraints.Practicalissuesareofchiefconcern.Asthefeedbackfromhumanoraclesinevitablycontainserrors,methodsformitigatingtheseerrorswhilepreservinginformationarepresented.Todevelopthesemethods,wepresentasummaryoftheliteratureonclustering,constrainedclustering,andactivelearninginthischapter. 13

PAGE 14

1.1ClusteringClassicationisawell-denedproblem,inthatthesetoftraininglabelsprovidesametricforperformance.Theclusteringproblemhowever,lackssuchanobjectivebasisforperformance,andsimplyseeksanaturalpartitionofthedataset.Clusteringmethodsmustrelyonprinciplesderivedfromintuitionofwhatconstitutesacluster.Atafundamentallevel,thisintuitionattemptstogroupthedatasuchthatsampleswithinaclusteraresimilar,andsamplesindifferentclustersaredissimilar.Therefore,measuresofsimilarityordissimilaritymustbedenedinordertounderstandclusteringmethods.Therepresentationofthesimilaritystructureofadatasetisonlyonewayofcategorizingthevasttaxonomyofclusteringmethods.Therearemanyclassesofclusteringalgorithms.Inthissectionweoverviewthemanyapproachestopurelyunsuperviseddataclustering. 1.1.1RepresentationsConsideradatasetXconsistingofNsamples.Inthefeaturevectorrepresentationofthedataset,eachsamplexi2X,isassociatedwithaseriesofvalues.Thesevaluesmayberealnumbers,integers,orcategoricalvalues,butforthesakeofsimplicitywediscussfeaturevectorsinad-dimensionalEuclideanspace,wherexi2Rd.Inthefeaturespace,distancesbetweensamplesmaybecalculated,suchastheEuclideandistance,dEuc(xi,xj)=p (xi)]TJ /F5 11.955 Tf 11.95 0 Td[(xj)T(xi)]TJ /F5 11.955 Tf 11.96 0 Td[(xj),aswellasmanymoreexoticdistances Pekalska&Duin ( 2005 ).Afeaturespaceandadistancemeasureonthatspaceallowforintuitivenotionsofclusterstobequantiedeveninhighdimensionalspaces,suchasdistancesbetweenpointsandaclustercenter.Inthesimilarityrepresentationofadataset,ratherthanafeaturevectorforeachsamplexi2X,asimilaritymeasure,sij,existsforeachpairofsamplesxi,xj2X.Adissimilarityordistancemeasurecaneasilybeconvertedintoasimilarity.Forinstance,distancesarecommonlytransformedviasij=exp()]TJ /F11 11.955 Tf 9.3 0 Td[(d(xi,xj)2),where2[0,1).However,itmustbenotedthatthisimposessomedistortiononthedata 14

PAGE 15

A B Figure1-1.Representationsofdata.A)The2-dimensionalfeaturespaceofanexampledataset.B)ThesimilaritymatrixofthedatasetcreatedbytransformingthedistanceswithaGaussiankernel. anddependsheavilyonthekernelbandwidth.Inthiscasesij2(0,1],whichisacommonassumptiononthevaluesofsimilarities.AnexampleoffeatureversussimilarityrepresentationisseeninFigure 1-1 .Itmayalsobethecasethatfeaturevectorsarenotavailableandthesimilaritiesaremeasureddirectly.Thismightoccurinsituationswherehumansprovidepairwisecomparisonsaboutdataobjectsthataretoocomplicatedtoberepresentedbyavector.Inthespecialcasethatsij2f0,1g,thesimilaritymatrixactuallyformsaconnectivitymatrixforagraph.Eachformofdatarepresentationelicitsitsownchallengesandadvantagesfortheclusteringmethodsthatfocusonthem.Inadditiontotheinformationpresentinthefeaturespaceorsimilaritiesofthedata,theusermayhaveknowledgeaboutthedesirednumberofclusters,k.Whilemostclusteringmethodsareformulatedundertheassumptionofsuchknowledge,therearealsomethodsthatautomaticallydeterminethenumberofclusters.Thismaybeaccomplishedduringtheclusteringprocess,withparametersthatcontrolthescaleonwhichclustersareidentied.Specicmeasuresandalgorithmsexistexplicitlyfordetectingthenumberofclusters. 15

PAGE 16

1.1.2VectorQuantizationThevectorquantizationproblemseeksasetofprototypevectorsforthedataset.Eachclusterconsistsofthegroupofsamplesassociatedwitheachprototype.Oneoftheearliestclusteringalgorithmsisk-means Lloyd ( 1982 ); MacQueen ( 1967 ); Steinhaus ( 1957 ),whichseekskprototypesorclustercenters2Rd(themeans),thatminimizethewithin-clustersumofsquaresobjective.Inwords,thisobjectiveissatisedbythepartitionwhichminimizesthesumofthesquareddistancesbetweenthesamplesofeachclusterandtheirclustercenter.ThisoptimizationproblemisargminfXigkXi=1Xx2Xijjx)]TJ /F11 11.955 Tf 11.96 0 Td[(ijj2wherefXigfori=1,...,kisapartitionofthedataset,withXicontainingthesamplesintheithcluster,andjjjjistheEuclideannorm.Thek-meansobjectiveissolvedwithaniterativeprocedure,whichalternatesbetweenanassignmentstepandanupdatestep.Afterinitializingthecenters,theassignmentstepndsthenearestcentertoeverysample,thusassigningittoacluster.Theupdatestepcomputesthenewclustercentersbyaveragingthesamplesineachcluster.Thealgorithmstopswhenthesamplesnolongerchangeclustersduringtheassignmentstep.Whileconvergencetoaglobaloptimumisnotguaranteed,alocaloptimumisreached.Performanceisdependentontheinitializationofthecenters,forwhichmanyproceduresexist,thoughitiscommontosimplyselectthemrandomly.Therearemanyrelatedalgorithmstok-means.K-medoids Kaufman&Rousseeuw ( 2009 ),k-medians Jainetal. ( 1999 ),andk-modes Chaturvedietal. ( 2001 )functionsimilarly,butwithdifferentstatisticsthanthemeanfordeterminingclustercenters.Weightedkernelk-means Dhillonetal. ( 2004a )isthek-meansalgorithmperformedonthedataprojectedbythefunction.Thus,theobjectivebecomes argminfXigkXi=1Xx2Xiw(x)jj(x))]TJ /F11 11.955 Tf 11.96 0 Td[(ijj2,(1) 16

PAGE 17

wherew(x)isapositiveweightassociatedwithsamplexthatcontrolseachsample'simportanceintheaverage.Inthisformulation,thecentersaretheaverageofthemappedsamples i=Px2Xiw(x)(x) Px2Xiw(x).(1)Kernelk-meansisso-namedbecauseinnerproductsinthereproducingkernelHilbertspace(RKHS)ofthemappeddata(x)2Hcanbecomputedwithevaluationsofasymmetric,positivedenitekernelfunction:XX!R Aronszajn ( 1950 ).Therefore,ratherthanevaluating( 1 )directly,thedistancebetweenmappeddataandcenterscanbecomputedusingh(xi),(xj)i=(xi,xj).Eachinducesadifferentmapping.TheGaussiankernel(xi,xj)=exp()]TJ /F11 11.955 Tf 9.3 0 Td[(jjxi)]TJ /F5 11.955 Tf 12.26 0 Td[(xjjj2)ispopularandinducesaninnitedimensionalandnonlinear,whichcanhandlenon-sphericalclustersbetterthanthestandardk-means.Ratherthanthehardclusterassignmentsofk-means,fuzzymethodsallowsamplestoexistinmultipleclusters,whichisusefulforsamplesalongclusterboundaries.Fuzzyc-means Bezdek ( 2013 )issimilartok-means,buttheobjectivesumsoveralldatapointsforeachcluster,ratherthanjustthosepointsbelongingtothecluster.Thefuzzyc-meansobjectiveisargminfigkXi=1NXj=1wmijjjxj)]TJ /F11 11.955 Tf 11.96 0 Td[(ijj2,wherewij>0isamembershipfunctioncapturingthesoftassignmentofsamplexjtoclusteri,andm1establishestheleveloffuzziness.Infuzzyc-means,theassignmentstepiswij=1 Pkn=1jjxj)]TJ /F15 7.97 Tf 6.59 0 Td[(ijj jjxj)]TJ /F15 7.97 Tf 6.59 0 Td[(njj2 m)]TJ /F16 5.978 Tf 5.76 0 Td[(1andtheupdatestepisaweightedaverage,verysimilartoequation( 1 ),whereaclustercenteriscomputedasthemeanofallsamples,weightedbytheirmembership 17

PAGE 18

valueforthatclusteri=PNj=1wmijxj PNj=1wmij.Mode-seekingalgorithmsndthemodesofdensityfunctions Cheng ( 1995a ); Fukunaga&Hostetler ( 1975 ).Intermsofclustering,thisisavectorquantizationapproach,asthemodesserveasclustercenters.Themeanshiftalgorithmisderivedfromthekerneldensityestimator Parzen ( 1962 ); Silverman ( 1986 )ofadensityfunctionf, ^f(x)=1 NdNXi=1(x,xi)(1)whereisabandwidthparameterofthekernel.Computingthegradientof( 1 )andsettingittozeroprovidesaconditionforlocalmodesofthedensity.Aftersomearrangement,thisconditionisx=PNi=1(x,xi)xi PNi=1(x,xi),Findingxedpointsisaccomplishedbyaniterativeprocedureinwhichthemeanshiftm(x)=PNi=1(x,xi)xi PNi=1(x,xi))]TJ /F5 11.955 Tf 11.96 0 Td[(xiscomputedandthentheupdatex x+m(x)isappliedforeachsamplex.Allsamplesthatconvergetothesamepointbelongtothesamecluster.Sincemeanshiftndsthelocaldensitymaximaitdoesnotrequirethenumberofclusterstobespecied.Thus,itiscalledanonparametricmethod,althoughthereareparameters,suchasthekernelbandwidthwhichdictatesthenumberofclustersthatarediscovered.Self-organizingmaps(SOM) Kohonen ( 1982 )areatoolforvisualizinghighdimensionaldatasets,butcanalsobeconsideredasaformofvectorquantizationclustering.TheuserofanSOMspeciesaxedlatticeofdiscretepointsinavisualizablespace(R,R2,orR3).Eachnode,orneuron,visassociatedwithaweightvectorw(v)2Rdinthesamespaceasthedata.TheSOMundergoesacompetitivelearningprocess,whereeachsampleiscomparedwiththeweightsoftheneuronsandthe 18

PAGE 19

neuronwiththeclosestweighttothesampleisthebestmatchingunit(BMU).TheweightsoftheBMUandoftheneuronsinthelocalneighborhoodoftheBMUneuronareadaptedtowardsthesample.Forasamplex,aneuronv,andBMUv,thisupdateisw(v) w(v)+(v,v)(t)(x)]TJ /F5 11.955 Tf 11.96 0 Td[(w(v))where(v,v)isaneighborhoodfunctiondescribinghowclosevistotheBMU,and(t)isatime-dependentweightingcoefcientontheupdate.Throughupdatingtheweightsofeachneuron,theSOMrecreatesthestructureofthehigh-dimensionaldataonthelatticeofneuronsinthelow-dimensionalspace.Forclusteringpurposes,samplesareassociatedwiththeirBMUs.Whiletheaforementionedmethodsoperateonfeaturevectors,afnitypropagation Frey&Dueck ( 2007 )isanalgorithminwhichexemplarsamplesarechosenthroughamessagepassingprocedurebasedonsimilarityvaluesbetweensamples.Afnitypropagationreliesontwotypesofmessages.Theresponsibilityr(i,k)issentfromsampleitosamplek,quantifyinghowwellkservesasanexemplarfori.Theavailabilitya(i,k)issentfromkbacktoi,reectingthesuitabilityofkasanexemplar,asjudgedbyothersamples.Afterinitializingavailabilitiestozero,andresponsibilitiestothesimilaritiessik,thequantitiesareupdatedbyiteratingoverr(i,k) sik)]TJ /F9 11.955 Tf 21.61 0 Td[(maxk0s.t.k06=kfa(i,k0)+sik0ga(i,k) min8<:0,r(k,k)+Xi0s.t.i0=2fi,kgmaxf0,r(i0,k)g9=;a(k,k) Xi0s.t.i06=kmaxf0,r(i0,k)g.Afnitypropagationisanonparametricmethod,withadampingfactorthatcontrolsthenumberofexemplarsthatarefound. 19

PAGE 20

1.1.3MixtureModelsMixturemodelsareaclassofprobabilisticmodels,whichconsistofkcomponents Figueiredo&Jain ( 2002 ); Lindsay ( 1995 ); McLachlan&Peel ( 2004 ).Thelikelihoodofasample,x,inamixturemodelisp(xj)=kXj=1jp(xjj),wherejisthepriorprobabilityofthejthmixturecomponent,jistheparametervectorofthejthmixturecomponent,and=f1,...,K,1,...,kg.Thesecomponentsmaybethoughtofasclusters,thuslearningtheparametersofamixturemodelisoneapproachtoclustering.LetX=fxigNi=1bethejointrandomvariableovertheentiredatasetandZ=fzigNi=1bethejointvariableofthelabelsthatspecifytowhichcomponenteachsamplebelongs.Zisahiddenorlatentvariable.Themaximumlikelihoodestimatoroftheparameters,,seekstomaximizethemarginallikelihoodofthedata,p(Xj).ThisisamarginallikelihoodbecauseitarisesfromsummingthejointlikelihoodofXandZoverthelatentvariablesinZ,p(Xj)=XZp(X,Zj).Thisisintractable,becausewearesummingoverallpossibleZ,whichcontainsallpossiblesequencesofcomponentlabels.However,theExpectationMaximization(EM) Dempsteretal. ( 1977 )algorithmoffersaniterativemethodtocomputethemaximumlikelihoodestimator.Inthe“E”step,themarginalloglikelihoodiscomputedbytakingtheexpectationoverthelatentvariables,giventhedataandthecurrentestimateoftheparameters(t).Q(j(t))=EZjX,(t)[logp(X,Zj(t)] 20

PAGE 21

Thisisfollowedbythe“M”step,duringwhichupdatedparameters,(t+1),arecomputedbymaximizingthepreviouslycomputedquantity,(t+1)=maxQ(j(t))TheEMprocedureisrepeateduntilconvergence.Whenthecomponentsofthemixturemodelaremultivariatenormaldistributions,analyticsolutionsoftheEandMupdateequationsareavailable,makingtheEMalgorithmanefcientchoiceforlearningGaussianmixturemodels(GMMs).Thek-meansalgorithmcanbeaconsideredaspecialcaseoftheEMalgorithm.Themixturemodelframeworkisveryexible,andmodelscanbedesignedforevenmorecomplicatedscenarios.Mixturemodelsarecommonlyusedinthemodelingofwordsandtopicsindocuments.Forexample,inlatentDirichletallocation Bleietal. ( 2003 ),documentscontainamixtureoflatenttopics,whichexplainthewordscontainedwithinthatdocument.MaximumlikelihoodestimationwiththeEMalgorithmisnottheonlywaytosolveamixturemodel.MarkovchainMonteCarlo(MCMC)andothersamplingmethods Gilks ( 2005 )arecommonlyusedininference,aswellasthevariationalBayesapproach Attias ( 1999 ).Whilenitemixturemodelsrequirethespecicationofk,theycanbeextendedtoincludeanunknownnumberofmixturecomponents.ThisisachievedusingtheDirichletprocessasapriorontheparametersofthemixturecomponents,whichresultsintheDirichletprocessmixturemodel(DPMM) Antoniak ( 1974 ); Ferguson ( 1973 ); MacEachern&Muller ( 1998 ).TheDPMMcanbespeciedasfollows:xijiFiijGGGDP(G0,). 21

PAGE 22

Inthismodel,xiisadatasampleandF(i)isitslikelihoodwiththeparametervectori.Gisadistributionovertheparametervectori,forinstanceaifirepresentsthecovarianceofanormaldistribution,thenGmightbeaWishartdistribution.GitselfarisesfromtheDirichletprocess,DP(G0,),whereG0iscalledthebasedistributionandistheconcentrationparameter.TheinterestingpropertiesoftheDirichletprocesscanbeobservedbyintegratingouttheG,torevealtheprioronigivenitspastvalues, ij1,...,i)]TJ /F8 7.97 Tf 6.59 0 Td[(11 i)]TJ /F9 11.955 Tf 11.95 0 Td[(1+i)]TJ /F8 7.97 Tf 6.59 0 Td[(1Xj=1(j)+ i)]TJ /F9 11.955 Tf 11.95 0 Td[(1+G0,(1)where()isthedistributionwithprobability1atand0everywhereelse.Equation 1 revealsthattheDirichletprocessyieldsdistributionsoverisuchthatitakesononeofthepreviouslydrawnvalueswithaprobabilityofi)]TJ /F8 7.97 Tf 6.59 0 Td[(1 i)]TJ /F8 7.97 Tf 6.58 0 Td[(1+orofanewvalue,drawnfromG0,withaprobabilityof i)]TJ /F8 7.97 Tf 6.59 0 Td[(1+.Thereforeincreasingincreasestheprioronthenumberofunique,andthusthenumberofdistinctclusters.TherearemanymethodsforinferenceintheDPMMincludingMCMC Neal ( 2000 )andvariationalmethods Blei&Jordan ( 2006 ).ThesimplestistheuseofGibbssamplingwhenconjugatepriorsareused(typicallywithmultivariatenormalormultinomiallikelihoods). 1.1.4SpectralClusteringSpectralclusteringhasitsrootsinthegraphcutproblem Boykovetal. ( 2001 ); Xingetal. ( 2003 ).AnundirectedgraphG=(V,E)consistsofthesetofverticesVandthesetofedgesE,witheachedgecarriesanon-negativeweight.Adatasetcanberepresentedbyagraphifthesamplesx2Xareconsideredthevertices,andtheweightfortheedgebetweenverticesviandvjisthesimilaritysij.Insteadofusingthesimilarityitselffortheedgeweights,athresholdcanbeputonthesimilaritiessuchthatavalueof1isassignedifthetwosamplesaresufcientlyclose,and0otherwise.Thek-nearestneighborgraphisalsocommon,whereanedgeexistsbetweensamplesthatarebothintheother'ssetofk-nearestneighbors.Regardlessofthepreprocessingtoconstruct 22

PAGE 23

agraph,wewilldenotetheweightsassij.Thedegreeofavertexviisdi=PNj=1sij,andthedegreematrixDisadiagonalmatrixwiththedegreesalongitsdiagonal.TheunnormalizedgraphLaplacianmatrixisdenedasL=D)]TJ /F5 11.955 Tf 12.74 0 Td[(S,whereSistheweight/similaritymatrix.ThegraphLaplacianisausefulobjectforstudyinggraphs.Forthepurposeofclustering,thegraphLaplacianhastheinterestingpropertythatifthegraphconsistsofkdisjointsubsetsofverticesthatareconnected,thenLwillhaveexactlykeigenvaluesequaltozero.Furthermore,thecorrespondingeigenvectorsserveasindicatorsforwhichsamplesareincludedineachcluster.Infact,evenincaseswheretheverticesarenotsoobviouslyseparable(whichonlyoccursunderconditionswheresimilarityvaluesequal0,suchaswiththresholdedsimilarities),thespectrumofthegraphLaplacianisusefulforclustering.Inthecaseoffullyconnectedgraphs,onlyasingleeigenvaluehasavalueof0.However,theeigenvectorsassociatedwiththesmallestkeigenvaluesstillholdmeaninginregardstoclusteringthedata,ifthedatadoesinfactformkreasonablyseparatedclusters.Theseeigenvectorsactuallyformarepresentationofthedatainwhichclusteringistrivial.Thesimplestspectralclusteringalgorithmconsistsofcomputingthekeigenvectorsfigki=1associatedwiththeksmallesteigenvaluesofL,andarrangingtheselength-NcolumnvectorsinamatrixXnew=[1,2,...,K].LettherowsofXnewbethenewrepresentationsofourNsamples.Clusteringthesesamplesisamucheasiertaskandcanbeaccomplishedwithk-means.OthergraphLaplaciansexist,andyielddifferentspectralclusteringalgorithms.TwonormalizedgraphLaplaciansareLrw=I)]TJ /F5 11.955 Tf 11.95 0 Td[(D)]TJ /F8 7.97 Tf 6.58 0 Td[(1SLsym=I)]TJ /F5 11.955 Tf 11.95 0 Td[(D)]TJ /F16 5.978 Tf 7.78 3.26 Td[(1 2SD)]TJ /F16 5.978 Tf 7.79 3.26 Td[(1 2withtheidentitymatrixIand“sym”and“rw”standingforsymmetricandrandomwalk,respectively Chung ( 1997 ).Therearepopularspectralclusteringalgorithmsusing 23

PAGE 24

thenormalizedgraphLaplacians,Lrw Shi&Malik ( 2000 )andLsym Ngetal. ( 2002 ).Spectralclusteringcanbeunderstoodasarelaxationofthegraphcutproblem.Thegraphcutproblemseekstocutedgestoproducekdisjointcomponents,whileremovingthesmallesttotalsumofweights.Solvingthisproblemfavorsthesuboptimalsolutionofisolatingsinglevertices,sodifferentcutsareusedtoensuremorebalancedcomponents.Theratiocut Hagen&Kahng ( 1992 )isRatioCut(fXigki=1)=kXi=1cut(Xi,XnXi) jXijwherethecutisthesumofweightsbetweenthetwosetsofvertices(samples).Thenormalizedcut Shi&Malik ( 2000 )isNcut(fXigki=1)=kXi=1cut(Xi,XnXi) vol(Xi)wherevol(Xi)isthegraphtheoreticvolumeofcomponentXi,denedtothebesumofthedegreesofitsvertices.Thesecutobjectivescontainnormalizationstoaccountforthesizeofeachresultingcomponent,preventingtheselectionoftrivialcuts.Manyothergraphcutobjectiveshavebeenstudied,includinganinformation-theoreticonecalledtheinformationcut Jenssenetal. ( 2007 ). 1.1.5HierarchicalClusteringTheaforementionedclusteringalgorithmsareconsideredpartitionalmethodssincetheyproduceasinglepartitionofthedataset.Anotherclassofmethodsareknownashierarchicalbecausetheyproduceclusteringsatmultiplescales.Therearetwoformsofhierarchicalclustering.Agglomerativehierarchicalclustering Johnson ( 1967 )beginswithaninitialpartitionconsistingofeachsampleasasingletoncluster,andclusterstructureisbuiltincrementallyuntilallsamplesexistinthesamecluster.Incontrast,divisivehierarchicalclusteringbeginswithallsamplesinasingleclusterandincrementallysplitsclustersuntilallsamplesareseparate.Whilethereareafewdivisivemethods 24

PAGE 25

Guenocheetal. ( 1991 ),itisgenerallyamoredifcultproblemandagglomerativemethodsaremuchmorecommon.Agglomerativehierarchicalclusteringbeginswitheachsampleinitsowncluster.Indexthestagesofhierarchicalclusteringwitht,andletthepartitionpresentattimetbeGt=fXigNti=1,whereNtisthenumberofgroupsattimet.ThereareN)]TJ /F9 11.955 Tf 12.16 0 Td[(1totalstagesofhierarchicalclustering,eachstagerepresentingamergeoftwogroupswithinGt.Therefore,N0=NandNN)]TJ /F8 7.97 Tf 6.58 0 Td[(1=1.Ateachstage,thenextmergeisselectedbasedonthedecisionrule argminXi,Xj2Glink(Xi,Xj),(1)wherelink(,)isknownasthelinkagefunctionwhichdeterminesthedistancebetweentwogroupsofsamples.Thereareavarietyoflinkagefunctions,withvaryingcomputationalcomplexitiesandcharacteristicsintheclusteringstheyproduce.Thesingle-linkage Sibson ( 1973 )ndstheminimumofthedistancesbetweenthesamplesinthetwogroups,linksingle(Xi,Xj)=minxi2Xi,xj2Xjd(xi,xj).Thecomplete-linkage Defays ( 1977 )issimilar,exceptitmeasuresthemaximumdistancebetweengroups.Theaveragelinkage Sokal&Michener ( 1958 )averagesalldistancesbetweengroups,linkav(Xi,Xj)=1 jXijjXjjXxi2XiXxj2Xjd(xi,xj).Manylinkagesexist,includingoneswithstatisticalbases WardJr. ( 1963 )andgraphtheoreticlinkagefunctions Zhangetal. ( 2012 ).Therelativeperformanceofdifferentlinkagefunctionsishighlydependentontheapplication. 25

PAGE 26

A B Figure1-2.Unsupervisedhierarchicalclustering.A)Thedendrogramcreatedbyhierarchicalagglomerativeclustering,withthedashedlineindicatingacutat5clusters.B)Theresultingclusteringproducedbythiscut. Oncetwogroupsareselectedwhichminimizethelinkagecondition( 1 ),thosegroupsaremerged.Thus,thereisonefewergroupsincejGtj=jGt)]TJ /F8 7.97 Tf 6.58 0 Td[(1j)]TJ /F9 11.955 Tf 18.45 0 Td[(1.Thelinkagefunctionsbetweengroupsnotinvolvedinthemergecanbeusedinthenextstage,butforthenewlyformedgroup,linkagesmustbecomputedwitheachoftheexistinggroups.HierarchicalclusteringhasacomplexityofO(n3),owingtotherepeatedsearchesoverallpairsofpossiblemergesandtherecomputationoflinkagesateachstage.However,moreefcienthierarchicalclusteringalgorithmsexist Zhangetal. ( 1996 ),withmodicationsforlargerdatasets.Theresultofhierarchicalclusteringisadendrogram,asseeninFigure 1-2 .Thehorizontalaxiscontainsthesamples,orderedsothatsamplesthatmergetheearliestareadjacent.Theverticalaxiscontainsthelinkagevalues,andgroupsareconnectedatthelevelofthelinkagevalueatwhichtheymerged.Toacquireasinglepartitionfromhierarchicalclustering,thedendrogramiscutatacertainlevel.Therearemanycriteriaforselectingthepointatwhichthedendrogramiscut Dubes ( 1987 ); Fraley&Raftery ( 1998 ); McCullagh&Yang ( 2008 ).Itiscommontocutatthepointwhichyieldsapartitionofkclusters. 26

PAGE 27

1.1.6OtherApproachesDensity-basedclusteringmethodsndclustersinhighdensityregionsofthedata.Pointsinsparseregionscanberejectedasoutliers.Thesemethodsoftenarenonparametric(inthesenseofthenumberofclusters)andhavenoassumptionsabouttheshapeofclusters,otherthantherestrictionsfordenseregions.Themostfamousalgorithmofthistypeisdensity-basedspatialclusteringofapplicationswithnoise(DBSCAN) Esteretal. ( 1996 ).Indeningdenseregions,DBSCANhastwoparameters:,whichisaneighborhoodsize,andminPts,whichisathresholdonthepopulationofthatneighborhoods.Samplesaredividedintocorepoints,density-reachablepoints,andoutliers.AcorepointcontainsatleastminPtsothersampleswithinitsneighborhoodofdistance.Forxitobedensity-reachablefromxj,theremustbeaseriesofcorepoints,eachinthe-neighborhoodoftheprevious,extendingfromxj,andwiththelastbeingxioratleastcontainingxiinitsneighborhood.Outliersaresampleswhicharenotreachablefromanyotherpoint.Clustersarethendenedasgroupsofsampleswhicharedensity-reachablefromatleastonecommoncorepoint.AvariantofDBSCANcalledorderingpointstoidentifytheclusteringstructure(OPTICS) Ankerstetal. ( 1999 ),incorporatestheneighborhoodsizeneededtoacquireminPts.Grid-basedclusteringisrelatedtodensity-basedclusteringandfocusesonpartitioningthespaceintocells,ratherthanpurelypartitioningthedatasamples.Grid-basedtechniquesareparticularlyadeptathandlinglargedatasets.Thestatisticalinformationgrid(STING) Wangetal. ( 1997 )andCLIQUE Agrawaletal. ( 2005 )aretwowell-knowngrid-basedmethods.Non-negativematrixfactorization(NMF) Dingetal. ( 2005 ); Lin ( 2007 )isproblemwhichattemptstofactoradnmatrixVastheproductofdkmatrixWandknmatrixH,whereV,W,andHallhavenon-negativeelements.Findingsucha 27

PAGE 28

factorizationcanbeachievedbysolvingargminW0,H0jjV)]TJ /F5 11.955 Tf 11.96 0 Td[(WHjj2F,wherejjjjFistheFrobeniusnorm.TheclusteringeffectofNMFcanbeseenwhenVisthedatamatrix,witheachcolumnbeingoneofthensamples.TheresultingWcontainsthekclustercentersinitscolumns,andHisanindicatormatrix,whereHji>0indicatesthatsamplexiisinclusterj.TheMarkovcluster(MCL)algorithm VanDongen ( 2000 )isagraphclusteringmethod,basedontheideathatarandomwalkonthegraphformedbythesamplesandtheirsimilaritieswilltendtostaywithinclustersandonlyoccasionallyjumpbetweendifferentclusters.Therandomwalkiscapturedinastochasticmatrixconsistingofthetransitionprobabilitiesbetweenverticesorsamples.LetthisNNmatrixbeM,withMijbeingtheprobabilityoftransitioningbetweensamplexjandxi.ConstructingatransitionmatrixfromthesimilaritiescanbedonesimplybynormalizingthecolumnsofthesimilaritymatrixS,suchthattheysumto1.TheMClalgorithmsisaniterativeprocess,alternatingbetweentwooperationsonM.Intheexpansionprocess,MisreplacedbythematrixpowerMe,whereeisapositiveinteger.Essentially,Meisthetransitionmatrixforestepsoftherandomwalk.Theinationprocessthenraisestheelementstothepowerr>1andrenormalizesdownthecolumns,whichpushessmallertransitionprobabilitiestowardszero.TheMthatresultsaftermanyiterationsissparse,withpositiveprobabilitiesindicatingsamplesthatareattractorsfortherestofthesamplesintheircluster.Togetafullaccountofthevastlandscapeofclusteringalgorithmsandapplicationstherearemanyresources Aggarwal&Reddy ( 2013 ); Jain ( 2010 ); Jainetal. ( 1999 ); Xu&Wunsch ( 2008 ).Therightclusteringalgorithmforagivendatasetdependsonmanyfactors.Therepresentationofthedatasetitselfdictatesifgraphorsimilaritybasedclusteringmethodsneedtobeused.Theexpectedshapesofclustersareanimportant 28

PAGE 29

consideration,ask-meansormixturemodelsworkwellforsimplesphericalclusters,butspectralorhierarchicalclusteringmaybeneededtocapturemorecomplicatedgeometries.Largedatasetswarrantspecialmethodssuchasdensityorgrid-basedclustering.Ifclusteringisbeingusedinanexploratorysense,thenumberofclustersmaynotbeknownandrequiringnonparametricalgorithmssuchasafnitypropagation,DPMM,DBSCAN,andmeanshift. 1.2ConstrainedClusteringTheproblemofclusteringwithsideinformationisknownassemi-supervisedorconstrainedclustering.Thesupervisiontypicallytakesononeoftwoforms:clusterlabelsorinstance-levelconstraints Wagstaff&Cardie ( 2000 ).Clusteringwithagivensetoflabelshasbeenexplored,forinstancein Basuetal. ( 2002 ),wheresampleswithknownlabelsforeachclusterareusedasinitialcentersfork-means,butislesscommonthanclusteringwithpairwiseconstraints.Instance-levelconstraintsareknownasmust-linkandcannot-linkconstraints,expressingwhethertwosamplesareinthesameordifferentclusters,respectively(Figure 1-3 ).Fromapracticalpointofview,constraintscanbeeasiertoacquirethandirectlabels,sinceaconstraintisfoundsimplybycomparingtwoobjects,whilealabelrequirestheannotatortochoosefromamongsttheentiresetofpossibleclasses.Consider,forinstance,theproblemofdocumentclustering.Itiseasierforahumanannotatortogrouptwosportsarticlesthanitisfortheexperttodecidebetweensports,currentevents,orlocalnews.Incorporatingpairwiseconstraintswiththeunderlyingdistancepropertiesofthedatasettoproduceacoherentclusteringisadifculttask.Thisrequiresestablishingabalancebetweenthesetwoformsofinformation.Constraintswillattimesconictwiththenaturalorunsupervisedclusteringofthedata.Ofcourseexternalinformationishighlyusefulforenforcingunintuitivepartitionsofthedataanditiscertainlyreasonableforsideinformationtooverridethedistanceinformation.However,constraintsarisefrom 29

PAGE 30

Figure1-3.Instance-levelconstraints.Thetwoformsofinstance-levelconstraints:must-link(ML)andcannot-link(CL). imperfectsources,whichcompletelyconfoundstheissueofwhichsourceofinformationistrustworthy.Thetreatmentofimperfectoracleswillbediscussedinlaterchapters.Distancesandpairwiseconstraintsarealsodifculttocompare,becausewhiledistancesareknownforeverypairofsamples,constraintsareonlypresentforafew.Theforcesofattractionorrepulsioncarriedbyamust-linkorcannot-linkconstraintshouldinuencethenotonlythesamplesinvolved,butthelocalneighborhoodsoftheconstrainedsamples.Thisprocessisknownasconstraintpropagationandthemechanismforpropagatingconstraintstolocalneighborhoodsisoneofthemajorquestionsthatmustbeaskedwhendesigningaconstrainedclusteringmethod.Thereareafewapproachestoensuringthatconstraintsareenforced.Manymethodsenforceconstraintsdirectly.Thisisaccomplishedeitherbycreatingmethodsthatsimplybuildclusterswithoutviolatingconstraintsorbydesigningclusteringcriteriainwhichconstraintviolationsarepenalized.Ratherthanmodifyingtheclusteringalgorithmdirectly,therepresentationofthedatacanbemodiedsuchthatconstraintstendtobesatisedinthenewrepresentation.Theideathatmust-linksamplesshouldbeclose,whilecannot-linksamplesshouldbedistantisbehindmanymethodsofthistype.These 30

PAGE 31

arenotwell-denedcategoriesofconstrainedclusteringmethodsandmanyalgorithmsinvolveacombinationofideas.Werstpresenttheconstrainedclusteringmethodsrelatedtok-means,astheseweresomeoftheearliestapproaches,andastheygrewincomplexity,theyspannedmanyoftheideasbehindtheentireeldofconstrainedclustering. 1.2.1ConstrainingK-MeansWhilek-meansissimpletoimplement,theresultingclusteringdependsheavilyontheinitializationofthecenters.Itisthusapopularalgorithmtobringintothesemi-superviseddomain,becausetheadditionofsideinformationcangreatlyenhanceitsrobustness.Oneoftheearliestalgorithmsforconstrainedclusteringwithk-meansisCOP-KMeans Wagstaffetal. ( 2001 ),whichmodiestheassignmentstepsuchthatsamplesareassignedtotheclusterwhichcontainsasampletheyaremust-linkedwithortothenearestclusterthatwouldnotviolateanyofthecannot-linkconstraints.Thepairwiseconstrainedk-means(PCKMeans)algorithm Basuetal. ( 2004a )isamoreadvancedformofconstrainedk-means,whichmodiestheusualk-meansobjectivefunctionconsistingofthesumofsquareddistancesbetweensamplesandthenearestofthekcenters.ThisobjectiveisaugmentedinPCKMeanstoincludeapunishmenttermforviolatingconstraints, JPCKMeans=kXi=1Xx2Xijjx)]TJ /F11 11.955 Tf 11.96 0 Td[(ijj2+X(xi,xj)2MLs.t.(xi)6=(xj)wij+X(xi,xj)2CLs.t.(xi)=(xj)wij(1)whereMListhesetofallmust-linkconstraints,CListhesetofallcannot-linkconstraints,(x)isthecentertowhichxbelongs,andwijisaweightthatpunishesconstraintviolations.TheresultingPCKMeansalgorithmminimizesthisobjective,similarlytok-means.However,forPCKMeanstheassignmentstepdoesnotselectthenearestcluster,buttheclusterwiththelowestsquareddistanceplussumoftheviolationpenalties. 31

PAGE 32

Otherthantheirinclusionintheobjectivefunction,PCKMeansexploitsconstraintsbyusingthemtoinitializethekcenters.Thetransitivepropertyisanimportantconceptformust-linkconstraints.Ifsamplesaandbarelinked,andsamplesbandcarelinked,thensamplesaandcmustalsobelinked.Theconstraintsetcanbeaugmentedtoincludeallthelogicallydeduciblemust-linksviathetransitiveproperty.Agroupofsamplesalljoinedwithmust-linksisknownasatransitiveclosureorachunklet.PCKMeansndstheklargesttransitiveclosureneighborhoods,whicharelikelytobeseparateclusters,andinitializesthealgorithmwiththecentersoftheseneighborhoods.Notethatthesetofcannot-linkscanbeexpandedthroughtransitiveclosure,asifaandbarelinkedandbandcarenotlinked,thenaandcmustalsobeseparated.K-Meansishighlydependentonthedistancemetricusedtorepresentdistancesbetweenpoints.Therefore,ratherthanpunishingclusteringsthatviolateconstraints,aparametrizeddistancemetriccanbelearned,suchthatconstraintsaresatisedwhenk-meansisperformedwiththenewmetric.Changingthemetricwithwhichthesamplesarecomparedisequivalenttomappingthedata.Forinstance,theMahalanobisdistance Mahalanobis ( 1936 )isjjx)]TJ /F5 11.955 Tf 11.96 0 Td[(yjjA=p (x)]TJ /F5 11.955 Tf 11.96 0 Td[(y)TA(x)]TJ /F5 11.955 Tf 11.96 0 Td[(y),whereAisapositivesemi-denitematrix(A0).ThisisequivalenttoworkingwithmappedsamplesA1=2xandusingthestandardEuclideandistance.Themetricpairwiseconstrainedk-means(MPCK-Means)algorithm Bilenkoetal. ( 2004 ),learnsaMahalanobismetricAiforeachclusteri.TheAiarelocalmetrics,ratherthanasingleglobalAwhichincreasestheabilitytosatisfyconstraints.InMPCK-MeansthelocalmetricsarelearnedbymodifyingthePCKMeansobjectivesuchthatMahalanobisdistanceswiththelocalscalingsareused.Thepenaltiesonconstraintviolationsarealsochangedsuchthatviolationsofclosemust-linksanddistantcannot-linksaremore 32

PAGE 33

egregious.TheresultingMPCK-MeansobjectiveisJ=kXi=1Xx2Xikxi)]TJ /F11 11.955 Tf 11.95 0 Td[(ik2Ai)]TJ /F9 11.955 Tf 11.96 0 Td[(log(det(Ai))+X(xi,xj)2MLs.t.(xi)6=(xj)wijfM(xi,xj)+X(xi,xj)2CLs.t.(xi)=(xj)wijfC(xi,xj)wherethetermlog(det(Ai))isfornormalizationandfM(xi,xj)andfC(xi,xj)arethedistanceweightingpenaltiesforconstraintviolation.Foramust-linkviolation,ifxiisinclusterziandxjisinzj,thenfMaveragesthedistancesinthemetricsofbothclusters,fM(xi,xj)=1 2jjxi)]TJ /F5 11.955 Tf 11.96 0 Td[(xjjj2Azi+1 2jjxi)]TJ /F5 11.955 Tf 11.95 0 Td[(xjjj2Azj.Foracannot-linkviolation,ifxiandxjarebothinclusterz,thenfCisfC(xi,xj)=maxDistAz)-222(jjxi)]TJ /F5 11.955 Tf 11.96 0 Td[(xjjj2Az,wheremaxDistAzisthemaximumdistancebetweenanytwopointsinthedatasetwithrespecttothemetricofclusterz. 1.2.2RepresentationModicationThelocalMahalanobisdistanceslearnedforeachclusterinMPCK-Meansareoneexampleofacommonstrategyforconstrainedclustering,wherethedatasetthataclusteringalgorithmreceivesismodiedsuchthatconstraintsareeasilysatised.Onewayofdoingsoisevaluatingdistanceswithadifferentmetric,thatisoptimizedwithrespecttotheconstraintset.Metriclearning Davisetal. ( 2007 ); Weinbergeretal. ( 2006 ); Yang&Jin ( 2006 )wasrstusedforsupervisedproblems,forinstance,inclassication,wherethedatacanbeprojectedtobetterseparatetheclasses.Theprojection,suchasthematrixAinaMahalanobisdistance,islearnedtomaximizetheclassicationrateonatrainingset.Forclustering,criteriacanbedevelopedsuchthattheconstraintsaresatisedaftertheprojectionisapplied.Thisapproachhasbeentaken Xingetal. ( 2002 ),inwhichthereexistsasetofsamplepairsknowntobesimilar,m:=f(xi,xj):xiandxjaresimilarg,andasetc:=f(xi,xj):xiandxjaredissimilarg 33

PAGE 34

ofpairsknowntobedissimilar.Thisisamoregeneralcaseofmustandcannot-linkpairwiseconstraints.Theproblemisthentondthepositivesemi-deniteMahalanobismatrixA,whichminimizesthedistancebetweenpairsinm,whilemaintainingabasedistancebetweensamplepairsinc.Moreexplicitly,thisoptimizationproblemisminAXxi,xj2mkxi)]TJ /F5 11.955 Tf 11.96 0 Td[(xjkA (1)s.t.Xxi,xj2ckxi)]TJ /F5 11.955 Tf 11.96 0 Td[(xjkA1, (1)A0, (1)whichisaconvexproblem.IfAisdiagonal,meaningascalingforeachdimensionislearned,thenasolutioncanbefoundwithNewton-Raphson.ForafullA,whichisalineartransformationonthedataset,gradientdescentisused.Otherlinearmethodsexistforlearningrepresentationsthatattempttosatisfytheconstraints.Semi-superviseddimensionalityreduction(SSDR) Zhangetal. ( 2007 ),learnsalinearmappingofthedatatoalowerdimensionalspace.Thismappingmustminimizethedistancebetweenmust-linkpairsandmaximizethedistancebetweenmust-linkpairs.Bothconcernsarebalancedinasingleobjectivewithascalingparameter.Relevantcomponentsanalysis(RCA) Bar-Hilleletal. ( 2003 )takesthechunkletsformedbythetransitiveclosureofthemust-linkconstraints,andusestheseneighborhoodstodeterminewhichfeaturesofthedataaremostrelevant.Aglobalscalingofthefeaturesreectsthisrelevanceorirrelevance.Assumetheconstraintsaredividedintokchunklets,wherechunkletiisCiandatotalofpsamplesareinvolvedinchunklets.RCAcomputesthecenteredcovarianceofthesepoints,^C=1 pkXi=1Xx2Ci(x)]TJ /F9 11.955 Tf 13.81 0 Td[(^mi)(x)]TJ /F9 11.955 Tf 13.8 0 Td[(^mi)T, 34

PAGE 35

where^miisthemeanofsamplesinchunkleti.Thedataisthenprojectedwiththewhiteningtransformationgivenbyxnew=^C)]TJ /F16 5.978 Tf 7.79 3.26 Td[(1 2x.RCAisshowntomaximizethemutualinformationbetweentheoriginalandprojecteddata,suchthattheaverageintra-chunkletdistanceremainsbelowasetdistance.Locallinearmetricadaptation(LLMA) Chang&Yeung ( 2004 ),alsousesthetransitiveclosuresformedbythemust-linkconstraintstoproducealocallinearprojectsofthedata,similartoMPCK-Means.Differentlinearprojectionsonlocalregionsofthedataprovideagloballynonlinearprojection.ThelinearprojectionimpliedbytheMahalanobisdistancecanbelimited.Imagineatwoclassdatasetinwhichclass1hastwomodesandclass2consistsofasinglemodesittingbetweenthetwomodesofclass1.Squeezingthespacetosatisfymust-linkconstraintsinclass1violatescannot-linkconstraints,asdoesexpandingthespacetosatisfycannot-linkconstraints.Nonlinearprojectionscanaccommodatesuchtrickysituations.Kernel-basedmetriclearning Baghshah&Shouraki ( 2010 )allowsforanonlinearextensiontotheprojectionlearningproblemsbylearningalinearmappingintheRKHSthatisnonlinearinthespaceofthedata.Startwithalinearprojectionlearningproblemforsatisfactionofmust-linkandcannot-linkconstraints,whereanewrepresentationyisformedbyy=WTx.AdesirableprojectionW,maximizesthedistancebetweencannot-linkpairsofprojectedsamplesandminimizesthedistancebetweenmust-linkpairs.ThisiscapturedintheoptimizationproblemargmaxWTW=IP(xi,xj)2CLjjyi)]TJ /F5 11.955 Tf 11.95 0 Td[(yjjj2 P(xi,xj)2MLjjyi)]TJ /F5 11.955 Tf 11.95 0 Td[(yjjj2+J(W),whereJ(W)isaregularizationpenalty.NowconsiderthisproblemintheRKHS,consistingoftheelementsx2H,whicharethefeaturesassociatedwithareproducingkernel(xi,xj).AlinearoperatorW:H!Hcreatinganewrepresentationy=(x)in 35

PAGE 36

theRKHSrepresentsanonlineartransformationforthedatainRd.Asimilarapproachwastaken Yeung&Chang ( 2007 ),butwithanobjectivefunctionforthemappinginwhichthemust-linkandcannot-linktermswereaddedtogether.Othernonlineardistancescanbelearnedotherthantheoneimplicitlydenedbyakernel.TheBregmandistance Bregman ( 1967 ),givenaconvex,twicedifferentiablefunction(x):Rd!RisdeneddBreg(x1,x2)=(x1))]TJ /F11 11.955 Tf 11.95 0 Td[((x2))]TJ /F9 11.955 Tf 11.96 0 Td[((x1)]TJ /F5 11.955 Tf 11.95 0 Td[(x2)Tr(x2),ABregmandistancecanbefoundsuchthatconstraintsaresatisedwhenthisdistanceisused Wuetal. ( 2012 ).FindingsuchaBregmandistanceamountstoselectingtheconvexfunction(x),whichcharacterizestheBregmandistance,thatsatisesanobjectivefunction.Likeothermethods,thisobjectiveattemptstobringmust-linkpairscloseranddrivecannot-likepairsapart.Similarityneuralnetworks Chopraetal. ( 2005 )provideanothermeansoflearningadistancefunctionthatimpliesanonlinearprojectionofthedata.Asimilarityneuralnetworkconsistsoftwoidenticalneuralnetworks,eachmappingasampletoasingleorlowdimensionalspacewheretheycanbecomparedmoreaccuratelywithrespecttothegoalsoftheuser.Pairwiseconstraintshavebeenusedtotrainthenetwork Magginietal. ( 2012 )tocreateasimilaritymeasureinwhichmust-linkpairsarecloseandcannot-linkpairsaredistant.Afewmethodsiterativelybuildrepresentationsofthedata,andineachstageclusterthedatausingtheauxiliaryclusteringmethod,toquantifythesatisfactionoftheconstraintsforthecurrentstageoftherepresentation.Therepresentationisthenadjustedtoaccountfortheunsatisedconstraints.OnesuchmethodistheBoostClusteralgorithm Liuetal. ( 2007 ),inwhichakernelmatrixencodingsimilarityisupdatedbasedonthesatisfactionofviolationofconstraintsbytheauxiliaryclusteringmethod.BuildingonBoostCluster,theboostingclustering(BOC)algorithm Sublemontieretal. ( 2011 ),selectsaprojectionofthedatathatmaximizesitsobjectivebasedonsatisfactionofthe 36

PAGE 37

constraintsetandagreementwiththedatarepresentation.Anauxiliaryclusteringoftheprojecteddatarevealswhichconstraintsareviolated,updatingweightsthatparametrizetheobjectivewhichdictatestheprojectionthatisselectedinthenextiteration.DistBoost Hertzetal. ( 2004 )usesaweaklearner,whichisabaseconstrainedclusteringmethod,thatproducehypothesesforwhetherornoteachpairofsamplesarelinked.Likeinboostingforclassication Freund&Schapire ( 1996 ),theweaklearneristrainedwithweightsassigningtherelativeimportanceofeachsample.InDistBoost,theweightsonunconstrainedpairsgraduallydecay,toallowonlysomeofthenaturalgeometryofthedatatoguidetheprocess.TheresultingdistancefunctionisacompositeofallthehypothesesforpairlinkagescreatedduringtheiterationsofDistBoost.Thesemethodsforlearningrepresentationsofthedata,eitherexplicitlyorimplicitlythroughadistanceofsimilarityfunction,havethebenetofnotbeingtiedtoanyparticularclusteringalgorithm.Aclusteringalgorithmofchoicecanbeusedontheprojecteddataoremployingthelearneddistancefunction.Consideringthatthenewrepresentationsofthedataareasmoothtransformationoftheoriginaldata,onedoesnothavetoworryaboutthepropagationofinformationfromconstrainedsamplestotheunconstrainedsamplesintheirlocalneighborhoods. 1.2.3ConstraintPropagationAconsiderableamountofeffortintheeldofconstrainedclusteringinvolvesthepropagationofconstraintinformationfromtheconstrainedsamplestotheirsurroundingneighbors,includingareviewpaperonthetopic Fu&Lu ( 2015 ).Withoutpropagatingconstraints,verytrivialclusteringsolutionscanbefound,whichstillsatisfytheconstraints.Forinstance,consideraconstrainedk-meansalgorithmlikePCKMeans.Ifaconstraintconictsconsiderablywiththenaturalk-meansclusteringofthedata,lesscostisaccruedinadjustingthegroupingofasinglesampletosatisfytheconstraint,thaninattemptingtofundamentallychangethenaturaltendenciesofk-meansforthatdataset.ThiseffectisshowninFigure 1-4 .Propagatingconstraintsrequiressome 37

PAGE 38

A B Figure1-4.Trivialsolutionofconstrainedk-meansclustering.A)Theunsupervisedk-meansclusteringwithsinglesampleschangedtosatisfyconstraintscanincurlestcostthanasmoothclustering.B)Whenconstraintsarepropagatedtolocalneighborhoods,theunintuitiveclusteringindicatedbytheconstraintscanbefound. decisiononwhatconstitutesaneighborhood,sothatconstraintsdonotaffectsamplesthattheyarenotassociatedwith.Generallyspeaking,constraintinformationissharedthroughdenseregionsofthedataset,basedonthesimpleobservationthateverysampleinaclustersatisesthesameconstraints.Theconceptofconstraintpropagationarisesespeciallyinmethodsbasedonspectralclustering,asthesimilaritymatrixmustbemodiedinameaningfulwayasconstraintscannotbedirectlysatised.Itwasrstproposed Kamvaretal. ( 2003 ),tosimplysetthesimilaritybetweenmust-linksamplesto1andto0betweencannot-linksamples.ThegraphnormalizedgraphLaplacianisthencomputedandspectralclusteringisperformedasusual.Afewapproachesmodifytheunderlyingobjectiveofspectralclusteringtoaccountforpairwiseconstraints.Inexibleconstrainedspectralclustering Wang&Davidson ( 2010 ),theobjectivefunctionofthegraphcutproblem,whichiszTLz,wherez2f)]TJ /F9 11.955 Tf 15.28 0 Td[(1,1gNisthevectorofbinaryclusterlabels,ismodiedtoincludeconstraints.LetQbeanNbyNmatrixsuchthatQij=1if(xi,xj)2ML,Qij=)]TJ /F9 11.955 Tf 9.3 0 Td[(1if(xi,xj)2CL,andQij=0ifnoconstraintispresentforthepair.ThenzTQzisameasureofhowwelltheconstraintsaresatised,asitisthesumofzizjQijforallpairsi,j.Alowerboundisput 38

PAGE 39

onthismeasure,andtheinequalityconstraintzTQzisincludedwiththeobjectiveofthegraphcut.Constrained1-spectralclustering(COSC) Rangapuram&Hein ( 2015 ),restrictsthenormalizedcutproblemtothespaceofpartitionsconsistentwithrespecttotheconstraintset.Forthetwoclustercase,afnitiesarepropagatedfromconstrainedsampleswithamethodthatcanbeinterpretedasaGaussianprocess Lu&Carreira-Perpinan ( 2008 ).Withtheclusterlabelzi2R,zi>0correspondstocluster1andzi0correspondstocluster2.Inthissetting,thecovarianceofthelabelsservesasanadjacencyorafnitymeasurementbetweensamples,i.e.sij=E[zizj].Therefore,theprobabilityofalabelcongurationisp(z)/exp()]TJ /F8 7.97 Tf 10.5 4.71 Td[(1 2zTS)]TJ /F8 7.97 Tf 6.58 0 Td[(1z).Itisassumedthatfor(xi,xj)2ML,zi)]TJ /F5 11.955 Tf 11.99 0 Td[(zjN(0,2m)andthatfor(xi,xj)2CL,zi+zjN(0,2c).Undertheseassumptions,thelabelposteriorp(zjML,CL)canbecomputedanalytically,whichsuggeststheuseofanewadjacencymatrix,S,thatincorporatestheconstraints.Namely,Sshouldbethecovariancematrixsuchthatp(zjML,CL)/exp()]TJ /F8 7.97 Tf 10.49 4.71 Td[(1 2zTS)]TJ /F8 7.97 Tf 6.59 0 Td[(1z).Siseasilycomputable,andmaythenbeusedastheadjacencymatrixfortraditionalspectralclusteringalgorithms.Theexhaustiveandefcientconstraintpropagation(E2CP)algorithm Lu&Ip ( 2010 )learnsanewsimilaritymatrixFwithfullypropagatedconstraintsbysolving minF1 2jjF)]TJ /F19 11.955 Tf 11.96 0 Td[(Zjj2F+ 2tr(FTLF+FLFT)(1)whereListhenormalizedgraphLaplacianandZisthematrixcontainingpairwiseconstraintswith1encodingmust-links,)]TJ /F9 11.955 Tf 9.3 0 Td[(1forcannot-links,and0otherwise.Thersttermencouragesthelearnedsimilaritiestomatchtheconstraints,thesecondtermmaintainssmoothnesswithrespecttothesimilaritymeasure.Thesemi-supervisedclusteringviarandomwalk(SCRAWL)algorithm Heetal. ( 2014 ),propagatesconstraintswithamulti-levelrandomwalk,withtransitionprobabilitiesderivedfromtheconstraintsandfromsimilarityvalueswhichdenethe 39

PAGE 40

edgesofthegraph.Thewalksoccuratdifferentscales,withonedeterminingtheverticesaroundconstrainedsamplesthatareaffected.Largerscaledstructurescalledcomponentscontainthegroupsofaffectedsamples.Ahigherlevelrandomwalkdeterminestheinuenceofconstraintsonedgesbetweencomponents.Hierarchicalagglomerativeclusteringisparticularlyeasytomodifyforacceptingconstraints.Constrainedcomplete-link(CCL) Kleinetal. ( 2002 )altersthedistancematrixsuchthatif(xi,xj)2ML,thend(xi,xj)=0.Thesemust-linksarepropagatedbyiteratingoverallsamplesxjinvolvedinamust-linkconstraintandmodifyingthedistanceforallpairsofsamples: d(xi,xk)=minxj2MLfd(xi,xk),d(xi,xj)+d(xj,xk)g(1)Thus,amust-linkconstraintservesasa“shortcut”fordistancesbetweenunconstrainedpoints.Forapair(xi,xj)2CL,thedistanceissettosomehighvaluesuchasmax(xa,xb)d(xa,xb)+1,andnomergesaremadewhichviolatethisconstraint.Thus,constrainedhierarchicalclusteringcanproceeduntilnoviablemergesremain,avoidingtheneedtoestablishthenumberofclusters.Constrainedhierarchicalclusteringhasalsobeenexpandedtonewtypesofconstraintsandtheoreticalanalysis Davidson&Ravi ( 2009 ).Otherconstrainedvariantsofhierarchicalclusteringexist.Thelinkagefunctionhasbeenmodiedtoincludeapenaltyforviolatingconstraints Miyamoto&Terami ( 2011 ).Athresholdonlinkagevaluescanbelearnedfromtheconstraints,whichextractsasinglepartitionfromthedendrogramthatreectstheconstraints Daniels&Giraud-Carrier ( 2006 ). 1.2.4OtherMethodsTherearequiteafewconstrainedclusteringapproacheswhichbuildonGaussianmixturemodels(GMMs)sincetheyarepopularmodelsforclustereddataandsincetheprobabilisticframeworkeasilyadaptedtoincludenewinformation.ConstraintswereincludedinthehiddenMarkovrandomeld(HMRF) Basuetal. ( 2004b )formulation. 40

PAGE 41

Inthiswork,theconstraintsetis:=f!ijgNi,j=1,where!ij=1ifxiandxjhaveamust-linkconstraint,!ij=)]TJ /F9 11.955 Tf 9.3 0 Td[(1ifthesampleshaveacannot-linkconstraint,and!ij=0ifthereisnoconstraint.TheconstraintsetanddatasetX:=fxigNi=1areobservedvariables,whilethesetofsampleclusterassignmentsZ:=fzigNi=1andclustermodelparameters:=fjgkj=1arehiddenvariables.Itisthegoalofthismethodtomaximizethejointprobabilityofthedata,clusterassignments,andclusterparameters,giventheconstraints: p(X,Z,j)=p(j)p(Zj,)p(XjZ,,)(1)Thesecondterm,capturestheinuenceoftheconstraintsonclusterassignments.Itachievesthisviaapotentialfunction,(i,j),whichtakesonavalueofwijtopunishconstraintviolationsandis0otherwise.Therefore,p(Zj,)=p(Zj)/exp()]TJ /F10 11.955 Tf 11.29 11.36 Td[(Xi,j(i,j)),wheretherightmostexpressionismissinganormalizationterm.Thethirdtermof 1 representsthelikelihoodofthedatagivenclusterassignmentsandparameters.Bytheindependenceassumption,itcanbewrittenasp(XjZ,,)=p(XjZ,)=NYi=1p(xijzi,).Theclusterassignmentsandparametersthatmaximize( 1 )arefoundwiththeHMRF-KMeansalgorithm,whichprovidesanEMapproachfortheclusteringproblemwithlabelcongurationsmodiedbytheconstraints.HMRF-KMeanswasshowntobeaspecialcaseoftheweightedkernelk-meansobjective Dhillonetal. ( 2004b ).Thereexistafewmethodswhichincorporateconstraintsinasimilarway,bymodifyinglabelcongurationprobabilities.TheadditionofconstraintsoftenresultsintheEMupdateslosingtheanalyticalsolutionthattheGMMhas,thusthesealgorithmsaretypicallyapproximate.Ratherthanstrictlinkingconstraints,theprobabilisticframeworkfacilitatessoftconstraintsthatcapturevaryinglevelsofpreferencefor 41

PAGE 42

groupinggivenpairsofsamples.Inpenalizedprobabilisticclustering(PPC) Lu&Leen ( 2004 ),constraintsareusedtomodifytheclusterlabelprior,p(Zj)=Qizi.Thismodicationisintheformofaweightingfunctiong(Z),whichgiventheconstraintset,changesthepriorintop(Zj,)/Qizig(Z),uptonormalization.AconstrainedversionoftheEMalgorithm Shentaletal. ( 2004 ),changestheEstepsuchthattheexpectationdoesnotoccuroverallpossiblelabelcongurations,butonlyoverthosewhichdonotviolateanyconstraints.ConstrainedDPMMclusteringhasbeenpresentedforGibbssamplinginference Vlachosetal. ( 2009 ).Thisisactuallysimplerthanfornitemixturemodels,asforasampling-basedapproach,theposteriorcanbesetto0forclustersthatwouldviolateconstraints.Nearlyeverycommonclusteringmethodhasaconstrainedanalogue.AconstrainedversionofDBSCANwasintroduced Ruizetal. ( 2007 ),whichsatisestheconstraintsdirectly.Afnitypropagationhasconstrainedversions,inwhichmessagepassingdisseminatestheinformationpresentintheconstraints Arzeno&Vikalo ( 2015 ); Givoni&Frey ( 2009 ).Aconstrainedversionofmeanshiftcalledsemi-supervisedkernelmeanshift(SKMS)clustering Anandetal. ( 2014 )projectsthedatatoanRKHS,andenforcesmust-linkconstraintsbyprojectingthedataontothenullspaceofthevectorsformedbythemust-linkconstraints.Thisprojectioncollapseslinkedsamplestogether.Cannot-linkconstraintsareforcedtoremainataxeddistanceapart.Afewoverviewsofconstrainedclusteringhavebeenwritten Basuetal. ( 2008 ); Truong&Battiti ( 2013 ). 1.2.5TheStateofConstrainedClusteringIthasbeen15yearssincetherstconstrainedclusteringmethodswerepresented.Threeopenissueshavebeenbroughtup Wagstaff ( 2007 ).Therstissueisinevaluatingtheeffectivenessofagivenconstraintset.Notallconstraintsetsprovideusefulinformation.Infact,theinclusionofcertainconstraintsetsmayactuallyyieldworseperformancecomparedwithunsupervisedclustering Davidsonetal. ( 2006 ); Wagstaffetal. ( 2006 ).Thesecondissueinvolvesintelligentlyselectingquerypairssuch 42

PAGE 43

thatmoreinformativeconstraintsetscanbeacquiredusingfewerqueriesofthehumanoraclesthatannotateconstraints.Thethirdissue,isinhowtoeffectivelypropagateconstraintinformation,aswehavediscussedpreviously.Asidefromtheseissues,whicharestillrelevanttoday,amodernaccountofthestatusofconstrainedclusteringislacking.Theincreasingsizeofdatasetsnecessitatesscalablemethods.Manyofthemodicationsthatallowclusteringalgorithmstoacceptconstraintsarecomputationallyintensive.Propagatingconstraintsisinherentlyaniterativeprocessandoftentheentiresetofpairwisedistancesmustbere-evaluatedateachstep.Evendirectsatisfactionmethodsareburdenedbyconstraintsastestsofconstraintsatisfactioncaninterferewithefcientimplementationsofdistancecalculations.Fewmethods Lietal. ( 2015 ); Semertzidisetal. ( 2015 )havedesignedscalableapproachestotheconstrainedclusteringprocess.Thewayinwhichsideinformationisacquiredhaschangedsincetheformulationofconstrainedclustering.Whiletheearlycreatorsofconstrainedclusteringalgorithmsimaginedpairwiseconstraintsbeingproducedbyscientistsortrainedexpertsintheapplicationfromwhichthedataarises,anonymousoracleshavebecomeamorecommonoccurrence.Large-scalecrowdsourcingsolutions Kitturetal. ( 2008 )provideahighlyaccessibleplatformforacquiringhumanfeedback.However,withthisaccessibilitycomesissuesofreliability Nowak&Ruger ( 2010 ).Movingforward,theeldofconstrainedclusteringmustactivelyaccountfortheissueofimperfectoracles.Theseissuesarethefocusofthisdissertation.Onthematterofconstraintselection,thisisthefocusofaeldknownasactivelearning.Activelearningissonamedbecauseinasupervisedorsemi-supervisedscenario,itactivelyselectsthesamplesforwhichthereissupervision,tobestimproveperformance.Werstpresentthebasicsofactivelearning,whichbeganintheclassicationdomain.Thendiscussactiveconstraintselection,whichisthesecondopenissueinconstrainedclustering Wagstaff ( 2007 ). 43

PAGE 44

1.3ActiveLearningInbothsupervisedandunsupervisedapproaches,feedbackfromahumanexpertmaybeincorporatedtoensureahighlevelofperformanceandclarifyanyambiguitiessuggestedbythedata.Selectingwhichsamplesreceivefeedbackfromtheoperatortoimprovethelearningsystemisthedomainofactivelearning Settles ( 2009 ).Aqueryisthepresentationofthedatathehuman,whothendeterminesalabelorconstraintforthosequeriedsamples,dependingontheproblem.Itisusuallythecasethatinteractionwiththehumanoperatorisexpensiveowingtothehumantimeinvolvedwithansweringqueries.Thenumberofqueriesmustbeminimized,whilestillobtainingasmuchusefulinformationaspossible.Typically,itisassumedthatthereisknownallottednumberofqueries.Inothercases,thealgorithmitselfmustdeterminewhensufcientfeedbackhasbeenacquired.Thereareafewscenariosforwhichactivelearningmaybepresented.Thesescenariosdifferinthesamplesthatareconsideredcandidatesforqueryandthenatureofthecommunicationchannelwiththeoracle.Inpool-basedsampling Lewis&Gale ( 1994 )astaticpoolofcandidatesamplesforqueryexists.Queriescanbemadeinbatches,suchthatagroupofthebestcandidatesisselected,presentedtotheoracle,andreturnedtothealgorithm.Alternatively,samplescanbeselectedonebyonefromthepoolofcandidatesforfeedback,thusallowingtheactivelearningmethodtoaccountforthenewinformationthataccumulatesaseachqueryisanswered.Serialqueryselectioncanbeusefulbecausetheactivelearningmechanismcanavoidredundancyandfocusonregionsofthedataspacethatarerevealedtobedifcult.Forinstance,ifagroupofsamplesincloseproximitytooneanotheraredeemedtobethemostinterestingcandidatesbytheactivelearningcriterion,howdoesonepreventabatchmodequerystrategyfromselectingeverysampleinthisgroup?Thatrequiresthestrategytoexplicitlyavoidredundancy.Withserialqueryselection,themostinterestingsampleofthatgroupcanbeselected,andwiththefeedbackthemodel 44

PAGE 45

isupdated.Recalculatingtheactivelearningcriterionnowindicatesthatthosecandidatearenolongerinteresting.Whilethissequentialapproachhasitsbenets,itmaynotbethemostpractical.Asdiscussedintheprevioussection,crowdsourcedsolutionsaremoreaccessiblethanin-houseactivelearningsystems.Serialqueryingplacesmoredemandsonthecommunicationchannelwiththeuserandthus,batchmodequeryingmustbeproperlyexplored.Asidefrompool-basedsampling,thereisstream-basedselectivesampling Freundetal. ( 1997 ),inwhichthedecisiontoqueryanobjectismadeaseachnewsampleofstreamingdataarises.Insomespecialcases,unobservedpointsintheinputspacecanbesynthesizedandconsideredforquery,thusallowingevenmorecontrolovertheinformationthatisgainedbytheclassier.Thisscenarioisknownasmembershipquerysynthesis Angluin ( 1988 ).Wediscusspool-basedsamplingapproaches,astheideasfromthisgeneralsettingcanbeappliedtostream-basedproblems,sincebothmethodsthedetectionofinformativesamples. 1.3.1ActiveLearningforClassicationIntheclassicationdomain,candidatesforqueryaresingleunlabeledsamplesandthefeedbackfromthehumanoperatorisintheformoflabelsforthosesamples.Activelearningalgorithmsaretypicallybasedonsomestatisticofthesamplethatcapturesitssuitabilityforquery.Oneofthemostpopularactivelearningframeworksisuncertaintysampling Lewis&Gale ( 1994 ),inwhichthefeatureofinterestisameasureofuncertaintywithrespecttotheclassier.Thesamplewiththehighestmeasureofuncertaintyisselectedforquery.Considersamplesx2X,whereXisapoolofcandidatesforquery.Lettheclasslabelbey2Y,withclassposteriorP(yjx).LettheMAPclassbe ^y1=argmaxyp(yjx),(1)andthe“runner-up”classbe ^y2=argmaxy2Yn^y1p(yjx).(1) 45

PAGE 46

Then,someexampleuncertaintysamplingstrategiesareleastcondent(LC),marginsampling(M) Schefferetal. ( 2001 ),andmaximumentropy(H) Shannon ( 2001 ),representedbyxLC=argmaxxf1)]TJ /F5 11.955 Tf 11.96 0 Td[(p(^y1jx)g (1)xM=argminxfp(^y1jx))]TJ /F5 11.955 Tf 11.95 0 Td[(p(^y2jx)g (1)xH=argmaxx()]TJ /F10 11.955 Tf 11.29 11.35 Td[(Xy2Yp(yijx)logp(yijx)) (1)respectively.Inallthesestrategies,uncertaintysamplingchoosesthesampleforwhichtheclassicationdecisionisleastclear.Thesearesamplesontheboundariesbetweenclasses,asshowninFigure 1-5 . Figure1-5.Informativesamplesforclassication.Giventhetrainingdata(redandblue),thecandidatesamplebetweenthetwoclassesisinformative,whilethesampleclearlyintheredclassisnot. Thesecondmajorclassofactivelearningalgorithmsisquery-by-committee Seungetal. ( 1992 ).Inthequery-by-committeeframeworkthereexistsanensembleofNmodels1,...,N,eachrepresentingaclassierthatisconsistentwiththelabeledtrainingdata,butnotwithdifferingopinionsontheunlabeledsamples.Eachmodeliintheensembleprovidesitsownlabelyi(x)foracandidatesamplex.Therefore,sampleswhichengenderthemostconictinghypothesesamongthemodelsareconsideredinterestingsamplesworthyofquery.Onemeasureofconictisthevoteentropy,and 46

PAGE 47

applyingthisrulemakesthefollowingquerydecision, xVE=argmaxx)]TJ /F10 11.955 Tf 11.29 11.36 Td[(Xy2YNy NlogNy N(1)whereNy=PNi=11y(yi(x)),i.e.thenumberofmodelsvotingforsamplextobeinclassy.Anotherclassofthemethodsseekssampleswhichwouldmostgreatlychangethemodeliftheirlabelswereknown.Theexpectedgradientlength(EGL) Settlesetal. ( 2008 )functionsonprobabilisticclassiers,withparameterslearnedviagradientascent.Themagnitudeofthegradientoftheobjectivefunctionisanindicatoroftheimpactthatsamplewillhaveonlearning.Ofcourse,sincethelabelofthesampleisunknown,anexpectationofthemagnitudeofthegradientmustbetakenoverallclasseswithrespecttotheestimatedclassposteriors.TheEGLdecisionisthereforexEGL=argmaxxXip(yijx)jjrJ(+(x,yi))jj,whererJ(+(x,yi))isthegradientwithrespecttotheparametersoftheobjectivefunctionofthemodelthatistrainedwithsamplexinclassyi.Asimilarmethodisexpectederrorreduction Roy&McCallum ( 2001 ),whichestimatestheexpectederrorofeachsampleoverallpossibledecisionsforthatsample.Densitymethodsconsiderboththeuncertaintyofasample,andthenumberofunlabeledsamplesinitsneighborhood,thusbettercharacterizingtheimpactofthatquery.Theinformationdensity Settles&Craven ( 2008 )isaqueryselectioncriterionwithatermthatvaluessamplesclosetoothers,xID=argmaxx(x) 1 XXx02Xsim(x,x0)!,where(x)issomebaseinformativenessmeasureofx,suchasfromuncertaintysamplingorquery-by-committee,andisaparametercontrollingtheimportanceofthedensityterm. 47

PAGE 48

Manyoftheseselectioncriteriaaredesignedwithaserialmodeofqueryinginmind.Itcanbecomequitecomputationallyexpensivetoretrainmodelsaftereachsampleislabeled.Criteriathatrequirethemodeltobetrainedseparatelywitheachcandidatesamplebelongingtoeachclassbeforemakingadecisioncompoundthisdifculty.Uncertainlysamplingandquery-by-committeedonothavethisproblem,butsomeredundancypenaltymustbeimplementedtofactorintheeffectasamplewillhavewhenitislabeled.Anotherissueinactivelearningisthestoppingcriterion,whichdetermineswhenfurtherqueriesarenolongerneeded.Afewapproacheshaveofferedstoppingcriteria Bloodgood&Vijay-Shanker ( 2009 ); Vlachos ( 2008 ); Zhuetal. ( 2010 ),butoftenthetruestoppingcriterionissimplythebudgetontimeormoneyforhumaninteraction Settles ( 2009 ). 1.3.2ActiveConstraintSelectionActivelearningisamorewell-denedtaskforclassicationbecausethelabeledtrainingdataprovideknowninformation,andtheproblemreducestondingsamplesthatarenotexplainedbythatknowledge.Inactiveconstraintselection,inwhichpairsofsamplesarequeriedtoreceiveconstraints,thereisnosuchreferencepoint.Pairsofsamplesarealsomorecumbersometoworkwiththanindividualsamples.Asinglesamplecaneasilybecomparedtoothersamples,butitisdifculttocomparetwopairwiseconstraints.Intermsofthesearchspace,forclassicationonlytheNsamplesinthepoolmustbeconsidered,whileN(N)]TJ /F8 7.97 Tf 6.59 0 Td[(1) 2uniquepairsexistinthatpool.Activeconstraintselectionhasbeenthefocusofmuchlessresearchthanactiveclassication,whichplacesitasanopenissueinconstrainedclustering Wagstaff ( 2007 ).Regardless,activeconstraintselectioninvolvessomeofthesameintuitionandfacessimilardifcultiesasactivelearningforclassication.Likewithclassication,pairofsamplesaregoodcandidatesforqueryiftheirrelationshipisunclearandifknowingtheirrelationshipwouldbeinformative.Manyactiveconstraintselectionmethodsattempttondindiceswhichquantifytheambiguity 48

PAGE 49

A B Figure1-6.Informativenessandcoherenceofaconstraintset.A)Theconstraintsetisinformativebecauseitcontainsinformationnotobviousfromthedata.B)Thetwoconstraintsarenotcoherentbecausetheyoffercontradictoryinformation. orinformativenessofapossibleconstraint.Thisissimilartotheuncertaintysamplingidea.Thequery-by-committeeframeworkisalsorepresentedwithmethodsthatuseclusteringensemblestoidentifydifcultsamples.Avoidingredundancyisanimportantissueinactiveconstraintselectionbecauseitisnotwidelyassumedthatwitheachacquiredconstraint,thedatacanbereclusteredentirely.Asstatedearlier,constrainedclusteringtendstobeacomputationallyintensiveprocessandsoitiseasiertosimplydesignqueryselectiontechniqueswithredundancyinmind.Activeconstraintselectionisaveryimportantissueforconstrainedclusteringbecauseclusteringisinherentlyasubjectivetaskthatneedstobeguidedtosatisfythegoalsoftheuser.Properconstraintselectionnotonlyimprovesperformance,butpreventspossiblefailuresthatcanoccurwhenthewrongconstraintsetisused Davidsonetal. ( 2006 ).Inthatworktwomeasuresforquantifyingtheutilityofaconstraintsetwereproposed:informativenessandcoherence.Informativenessistheamountofinformationthattheconstraintsetprovidesrelativetoanunsupervisedclusteringofthedata.Coherencemeasurestheagreementamongsttheconstraintsthemselves.ThesepropertiesareillustratedinFigure 1-6 . 49

PAGE 50

Oneoftheearliestactiveconstraintselectionmethodsforclusteringwasthefarthest-rstqueryselection(FFQS) Basuetal. ( 2004a )algorithm.FFQSbeginswithanexplorephase,inwhichsamplesareiterativelyselectedthatarefurthestfromallpreviouslyqueriedsamples,whichhavebeengroupedintomust-linkneighborhoods.Recallthatamust-linkneighborhoodconsistsofthemust-linkconstraintslogicallydeduciblefromthetransitiveclosureofexistingmust-linkconstraints.EachnewsampleselectedinFFQSisqueriedagainsteachmust-linkneighborhoodinturn,stoppingwhenaneighborhoodthatthenewsamplebelongstoisfound.Ifnomust-linkresultisobtainedduringthisprocess,thenewsampleisestablishedasanewneighborhood.Followingtheexplorephase,theconsolidatephaserandomlyselectsnewsamplesandcomparesthemwithexistingmust-linkneighborhoodsuntilamatchisfound.TheconsolidatephaseofFFQSwasimprovedwithmin-maxfarthest-rstqueryselection(MMFFQS) Mallapragadaetal. ( 2008 ),whichprovidedamin-maxcriterionforselectinguncertainsampleswithrespecttotheexistingneighborhoods.TheFFQSandMMFFQSproceduresareintuitivelysimple,butinourexperience,aresomeofthemostreliableandeffective.Themust-linkneighborhoodscreatedbytheexplorephase,serveas“classes”.Whenasampleislinkedtooneofthemust-linkneighborhoods,itgainsatremendousamountofinformation,aswethenknowthatitislinkedtoeverysampleinthatneighborhoodandnotlinkedtoeveryotherneighborhood.Thisistherichinformationthatispresentinclasslabels,buttypicallyabsentinpairwiseconstraints.However,FFQSinvolvesaserialqueryingstyle,andisthusnotapplicablewhenprocessingcannotbedoneaftereachbitoffeedbackfromtheoracle,asisthecasewithbatchquerying.Thequery-by-committeeframeworkisrepresentedinactiveclustering Al-Razgan&Domeniconi ( 2009 ); Duanetal. ( 2008 ); Greene&Cunningham ( 2007 ); Okabe&Yamada ( 2014 ); Yangetal. ( 2012 ).Thereareafewapproachesforgeneratingdiverseclusteringensembles Vega-Pons&Ruiz-Shulcloper ( 2011 ),includingusingdifferent 50

PAGE 51

clusteringalgorithms,parametervalues,initializationvalues,subsetsoffeatures,orprojectionsofthedata.Aswithclassication,ensembledisagreementistheselectioncriterion.Disagreementacrosstheensemblecanbesummarizedintheco-associativitymatrixA,whereAijisthefractionofensembleclusteringsthatputsamplexiandxjinthesamecluster.In Abin&Beigy ( 2014 ),thesequentialapproachforactiveconstraintselection(SACS)algorithmmadetwoassumptionsonthenatureofinformativeconstraints.Therstassumptionseekspairsofsamplesthatexistinseparatedenseregionsofdata,butwithashortdistancebetweenthem(possibleclusterboundarysamples).Thesecondseeksconstraintsforsamplesinthesamedenseregion,withagreatdistancebetweenthem.Activefuzzyconstrainedclustering(AFCC) Griraetal. ( 2008 ),iterativelycomputesconstrainedclusteringsofthedata,andselectsnewconstraintsbasedonnon-redundantsamplesontheboundariesontheleastwell-denedclusters,asmeasuredbythefuzzyhypervolume.Ameasurecalledtheabilitytoseparatebetweenclusters(ASC) Vuetal. ( 2010 )hasbeendeveloped,whichrewardspairsofsamplesthatspansparseregionsinthedataspace,thusdeningtransitionalregionsbetweenpossibleclusters.Thek-nearestneighborgraphisemployedtopropagateconstraintsbetweenneighbors,thusensuringnon-redundancy.Activeconstrainedclusteringbyexaminingspectraleigenvectors(ACCESS) Xuetal. ( 2005 ),inthetwoclustercase,examineseigenvectorsofthesimilaritymatrixtoidentifysamplesonclusterboundaries.Itisverynaturaltoconsiderhumanguidedhierarchicalclustering,andthisideadrivesafewconstraintselectionstrategies.Intheoriginalconstrainedcomplete-link Kleinetal. ( 2002 )itwashintedthatqueryingtoplevelmergesisagoodpractice.Hierarchicalcondence-basedactiveclustering(HCAC) Nogueiraetal. ( 2012 )providesameasureforthecondenceofmerges.Ahumanoracleisqueriedtoclarifytheambiguityinmergesthatarenotcondent,thoughthismethoddoesnotprescribewhichsamplesarequeried.Theclustersthemselvesarepresentedtotheuserusingsome 51

PAGE 52

visualizationtactic.Intheconstrainedactiveclustering1(CAC1)algorithm Biswas&Jacobs ( 2011 ),aninitialsetofconstraintsisusedtolearntwodistributionsofmergesdistances,oneformust-linksandoneforcannot-links.Thesedistributionsareusedtoselectmergeswhichareambiguousforquery.Afewprinciplesarefairlyconsistentinactiveconstraintselection.Itisimportanttodeneclusterboundariesandmakenon-redundantqueries.However,oneofthemostimportantfunctionsofqueryselectionistorevealunintuitiveclusteringsofthedata.Inrealproblems,wewishforclusterstobetiedtoclasses.Thus,itisunreasonabletoassumethatclustersareseparated,denseregions.Theboundariesbetweenthese“clusters”maynotevenbemoresparsethanwithintheclusteritself.Thenumberofclustersisalsoapieceofinformationthatmaynotbeknown.Goodconstraintselectionmethodswillnotonlyclarifyclusterboundaries,butdiscoverclusters,whichcannotbediscoveredfromthestructureofthedataitself. 1.4OutlineandContributionsThisdissertationfocusesonthepracticalissuesassociatedwithactiveconstrainedclustering.InChapter 2 ,amethodforactivelearningintheopensetclassicationdomainispresented.Thismethodseekstousehumanfeedbacktodiscovernewclasses.Itisappliedtotheautomatictargetrecognitionproblemforsonarimages.InChapter 3 ,aconstrainedclusteringmethodbasedonthedendrogramofhierarchicalclusteringispresented.Thismethodisbasedonanovelconceptthatconstraintsneednotbepropagatedduringagglomerativehierarchicalclustering,andareinsteadusedtomanipulatethedendrogramproducedbyanunconstrainedrunofhierarchicalclustering.Thisavoidstheneedtoperformthecomputationallydifcultpropagationoftheseconstraintsandpermitstheuseofanyhighlyoptimizedunconstrainedhierarchicalclusteringmethods.Acomplementarymethodforactiveconstraintselectionisprovided,whichalsoutilizesthedendrogram,providinganextremelyefcientandreliableactiveconstrainedclusteringframework.TheproblemofanoisyoracleistreatedinChapter 4 . 52

PAGE 53

Humanoraclesareimperfectanderrorsintheirfeedbackareinevitableevenforhighlytrainedannotators.Amethodforremovingharmfulconstraintsisprovidedandtestedusingnoisemodelsdevelopedfromrealhumanfeedback.Finally,inChapter 5 ,acomparativestudyofconstrainedclusteringandclassicationisprovidedtodeterminetheutilityofconstrainedclustering,givensimilaramountsofinformation. 53

PAGE 54

CHAPTER2ACTIVELEARNINGFORCLASSDISCOVERYArelatedproblemtoclusteringisopensetclassication.Inopensetclassication,trainingdataisavailableforsomeclasses,butunknownclassesmayexistinthedataset.Infact,ifthetraininglabelsareconvertedintoanequivalentsetofpairwiseconstraints,constrainedclusteringmaybeusedtosolvethisproblem.Inthischapterwehandletheonlineopensetclassicationproblem.Intheonlinecase,thedatasourceisstreaminganddecisionsmustbemadeoneachsampleasitarrives.Clusteringisnotwell-suitedtotheonlinecase,asconstrainedclusteringmethodsdonotmakeincrementaldecisionswithoutrecomputingtheentireclustering.Classicationmodelscaneasilylabelincomingsamples.Thechallengeisthentocreateanopensetclassicationmethodcapableofdiscoveringnewclassesinanonlinefashion.Activelearningisimportantforopensetclassicationbecauseahumanoraclecanhelpestablishtheunknownclasses.Thisproblemmotivatesnewactivelearningapproacheswithafocusonstreamingdataandclassdiscovery.Thequery-by-committeeframeworkisapromisingqueryselectionstrategy,sincethereisnolongerthedifcultiesofretraining,reclassifying,andreapplyingactivelearningalgorithmseachtimeaqueryisissued.Instead,wheninformationisgained,thecommitteeisprunedsuchthatincorrecthypothesesaresimplyremoved.Thedifcultyinthisframeworkisingeneratingadiversecommitteeofmodels,eachconsistentwiththetrainingdata.Iftherangeofhypothesespresentinthecommitteeisnotrepresentativeoftheuncertaintyofeachsample,thendisagreementamongthevariousmodelsinthecommitteewillnotserveasagoodbasisforquery,sincethisdisagreementisonlyfoundedontherandomnesswithwhichthecommittee Portionsofthischapterarepublishedinthefollowingmanuscript:Kriminger,E.,Cobb,J.T.,andPrncipe,J.C.(2015).Onlineactivelearningforautomatictargetrecognition.OceanicEngineering,IEEEJournalof,40(3),583-591. 54

PAGE 55

wassynthesized.Weproposetofusethephilosophiesofuncertaintysamplingandquery-by-committee,bydynamicallygeneratingcommitteesthatreecttheuncertaintyofthesamples.Inthischapter,anactivelearningmethodwhichseekstounitequery-by-committeeanduncertaintysamplingapproachesispresented.Anensembleofmodelsisbuiltdynamicallytorepresentmultiplereasonablehypothesesforthedataandreecttheuncertaintyineachincomingsample.Fromtheensemble,informativesamplesareselectedastargetforquery.Withtheproposedmodeltreesalgorithm,asystemforopensetclassicationisdeveloped,whichbothclassiesincomingsamplesanddetectsnewclustersfromthedatastreaminanunsupervisedfashion.Asystemisdesignedforopensetclassicationthatemploysmodeltrees,butintroducesaclassconsistingofoutliersasapotentialdecision.Modeltreesusesqueriestoclarifyambiguoussamplesincludingthosethatmaybeoutliers.Theoutlierbincanthenbeclusteredseparatelytodetectnewobjects.Ourapproachismotivatedbytheautomatictargetrecognitionproblemofsonarimageprocessing. 2.1ModelTreesActiveLearningOuractivelearningalgorithm,modeltrees(MT),isbasedonanadaptiveensembleofmixturemodelsofthedata.Initially,thereisasinglemodel,withlikelihoodsfyforeachclassy2f1,...,Cgthatareestimatedfromatrainingsetofknownobjects.Foranewsamplexthatisencountered,thefeaturesi(x)fori=[0,...,C]arecalculatedasinAlgorithm 2-2 .i(x)isconvertedintoaGibbsdistribution, p(y=ijx)=exp()]TJ /F11 11.955 Tf 9.3 0 Td[(i(x)) PCi=0exp()]TJ /F11 11.955 Tf 9.3 0 Td[(i(x)),(2)where>0isatemperatureconstantcontrollingthedegreetowhichthedistributionisattened. 55

PAGE 56

Figure2-1.Branchingprocessofmodeltreesactivelearning.Fromthemodel,afeaturei(x)isproducedforeachoftheCclassesandfortheoutlierbin(y=0).DrawingfromtheGibbsdistributiondeterminesiftherewillbeanewmodelintheensemble. Typically,wewouldusextoupdatethelikelihoodparametersoftheclassy:=argmaxip(y=ijx)(orignorethesampleify=0).However,ifthedecisionisverydifcult,i.e.p(jx)hasasmallmargin,thennotasmuchcondencecanbeputonthatdecision.Itisadvisablethen,toconsiderotherlikelyalternatives.Ratherthanmakingaharddecisiononx,asoftdecisioncanbemadebydrawingfrom( 2 ).Lettheresultofthisdrawbeyalt.Ifyisthesameasyalt,thenonlyonemodelisupdated,andthenumberofhypothesesstaysthesame.Ifyaltisadifferentoutcome,thenwerstupdate(1)ywithx,thenupdate(2)yalt. 56

PAGE 57

Thesuperscripts(1)and(2)representdifferenthypotheseswhichformourcommittee.ThebranchingprocessisillustratedinFigure 2-1 .Bycontinuinginthisway,a“tree”ofmodelsisbuilt,witheachpaththroughthetreerepresentingdifferenthypotheses.Whenmanyofourclassicationdecisionscarryalargeamountofuncertainty,thenumberofbranchesinthetreewillgrow.Thisisindicationthattheclassiercouldbesignicantlyimprovedwithfeedbackfromahumanoperator.Bysettingalimitonthenumberofbranches,wehaveameansofdeterminingwhentoinitiateaquerythatisrelatedtotheamountofuncertaintyinourmodels.Thesamplethatistobequeriedshouldbestreducethenumberofcompetinghypotheses.Theamountofdisagreementamongthehypothesescanbequantiedbytheentropyofthelabelsgiventothatsample.Letthelabelgiventoxbythejthmemberofthecommitteebey(j)(x).Wecanconstructahistogramofthecommittee'svotestocomputetheentropyforeachsample.ThelabeldistributionacrossacommitteeofsizeMisgivenby pcom(Y(x)=l)=MXj=1(l,y(j)(x)).(2)foralabeloutcomelandwhereistheKroneckerdelta.ThentheShannonentropyforthelabelofxis H(Y(x))=CXi=0)]TJ /F5 11.955 Tf 9.3 0 Td[(pcom(Y(x)=i)logpcom(Y(x)=i).(2)Thesamplechosenforqueryissimplytheonewithmaximumlabelentropy, xquery=argmaxxH(Y(x)).(2)Thesampleinquestionisthenpresentedtothehumanoperator,whospeciesitstruelabel.Allbrancheswhichincorrectlylabeledthesampleareremoved,thuspruningthetree,withthemoreaccuratemodelsremaining.Ifallhypothesesareassumedtobe 57

PAGE 58

equallylikely,thenthemaximumlabelentropyquerydecisionmaximizestheexpectednumberofbrancheswhichareremovedfromthetree. Algorithm2-1.Modeltreesactivelearning Input: Temperatureconstant,maxcommitteesizeMaxPaths,likelihoodparametersfigCi=1 Output: Committeeoflabels Initialize:M=1 Foreachnewtestsamplexdo Forj 1toMdo Computep(jx)(Eq. 2 ) y argmaxip(y=ijx) Update(j)ywithx Drawyaltp(jx) ifyalt6=ythen M+=1 (M)yalt update(j)yaltwithx Endif Endfor Queries ifM>MaxPathsthen Findxquery(Eq. 2 ) Receivequeryresponse,ytrue Deletecommitteemembersfjjy(j)(xquery)6=ytrueg Endif Endfor 2.2OpenSetClassicationIntheopensetclassicationproblem,thenumberofclassesisunknown,andtrainingdataisonlyavailableforasubsetofthepossibleclasses.Opensetclassicationisaproblemforwhichactivelearningisparticularlyuseful.Humanfeedbackhelpsestablishnewclasses.Theuseofmodeltreesasanactivelearningcriterionistestedonanopensetclassicationproblemcalledautomatictargetrecognition. 58

PAGE 59

2.2.1AutomaticTargetRecognitionSide-looksonarimagesareusedextensivelyinmine-countermeasure(MCM)applicationswherethesonarimagesareusedtodetectandclassifyvarioustypesofunderwaterunexplodedordinances(UXOs) Sternlichtetal. ( 2011 )orseamines Dobecketal. ( 1997 ).Itisnotfeasibleforhumanoperatorstoinspecteachsonarimageduetolimitationsintheunderwatercommunicationchannelandthetimeburdenitwouldputontheoperator.Automatictargetrecognition(ATR)systemsaimtoalleviatetheburdenfromhumanoperatorsbyquicklyandaccuratelyndingtargetsofinterestinthesedata.TheATRproblempossessessomeinterestingchallenges.Whilemanysonarimagesareacquiredduringscansoftheseabed,fewoftheseimagesactuallycontainobjectsofinterest.Therefore,withrespecttoconstructingmodelsoftheseobjects,theproblemisdata-poor.Featuresareextractedfromtherawsonarimages,whichrepresenttheshapeoftheobjectspresent.Toprovidesufcientdiscriminativepower,thedimensionalityofthefeaturespacecanbequitelargerelativetothenumberofavailablesamples.Itisthereforedifculttoestimatemodels,andself-labeledsamplesmustoftenbeusedfortrainingtoimprovethemodels.Furthermore,thesurveyplatformacquiresdatafrommultipleviewinganglesasitmakespassesovertheseabed.Thus,observationsofthesameobjectmaybeveryscatteredinfeaturespace,owingtoeffectssuchasshadowsandpartialconcealmentbytheseaoorsubstrate.AfullATRsystemconsistsofsensorhardware,image/sonarprocessing,andthealgorithmsforobjectdetectionandclassication.ManyATRsystemsincorporatesomeorallofthesetechniques.In Reedetal. ( 2003 ),mine-likeobjectsaredetectedusingaMarkovrandomeldmodel,theobject'sshadowisextractedusingaco-operatingstatisticalsnake(CSS),andaclassicationdecisionismadebycomparingtotheshadowsofknownshapes.In Hasanbelliuetal. ( 2009 ),objectsarecomparedtotemplateswithanonlinearextensionofthematchedlterbasedoncorrentropy.Anactivelearningsystemisdevelopedforusewithside-scansonarimagingsystemsin 59

PAGE 60

Duraetal. ( 2005 ),inwhichakernelclassierisbuiltusingasmallnumberofsampleslabeledbyahumanexpert.In Dobecketal. ( 1997 ),mine-likeregionsofthesonarimagesaredetected,featuresareautomaticallyselected,andobjectsareclassiedwithak-nearestneighborattractor-basedneuralnetclassierfused(orcascaded)withanoptimaldiscriminationlterclassier.TheATRsystemweproposehereseekstoclassifyobjectsastheyappearinasonarsurvey,whichisanonlineratherthanbatchclassicationtask.Anonlineapproachisattractiveinthatitenablesreactivebehaviorsfromasensingplatform,suchaschangingasearchpatterntoimprovedetectionperformancemid-survey.Ratherthanmaintainingasingleclassier,anensembleofclassiersisbuiltadaptively,suchthatlikelyalternativehypothesesaboutnewdatamaybecaptured.Unlikelysamplesarestoredinaoutlierbin,fromwhichnewobjectscanbedetectedusingunsupervisedtechniques.Anactivelearningschemeisproposedinconjunctionwithourensemble,suchthatasmallamountofhumanfeedbackmaybeutilizedtoimproveperformanceintheonlineandnonstationaryenvironment. 2.2.1.1SupervisedmethodsTherearetwomainclassesofalgorithmscommonlyusedinATR:supervisedandunsupervised.Thesupervisedmethodofclassicationrequiresaxednumberofclassesandatrainingsetoflabeleddatapoints,fromwhichtheclassiercanbebuilt.Supervisedlearningisreliable,butonlyifthestatisticsofthetrainingdataareconsistentwiththatofthedatawhichwillbeencounteredduringoperation.Moreoveritisanoversimplicationoftherealworldscenarioofminedetection,wherethenumberofdifferentobjectsofinterestisunknownapriori.Therefore,minedetectioncanonlybesolvedwithsupervisedtechniquesifonlyafewknowntypesofminesaretobedetected.Underthishugeconstraint(closedsetclassication),atrainingsetcanbebuilt,comprisedoftemplatesforthedesiredminetypes.Nevertheless,eveninthiscase, 60

PAGE 61

thresholdvalueshavetobesettodeterminewhenasampledoesnotbelongtoanyoftheminetypes.TheabilitytomakedecisionsonsamplesastheyareacquiredisimportantinATRsystemsforavarietyofreasons.Thelargeamountofuninterestingdatathatiscontinuouslyacquiredisaburdentostoreandprocessinbatchmode.Itispreferredtorejectthesepointsastheyarrive,withanonlinesystem.Inaddition,thetaskmayrequirereal-timedetection,andhencethedecisiononasamplecannotwaitformoreobservations.Classicationisaverynaturalwaytogroupsamplesastheyareacquired.Whilestandardclassierslacktheabilitytodetectnewobjects,theirabilitytoefcientlyprocessATRdataonlineisquiteuseful.RealATRtasksarequitedynamicandtheclassicationofdesiredobjectsisconfoundedbythepresenceofnaturalandman-madeclutter.Assumptionsonnoiseandstationarityareviolated,whichcanlimittheeffectivenessoftemplatesgeneratedinamuchdifferentenvironment.Inaddition,itisoftenusefulfortheATRsystemtodetectnovelobjecttypesotherthanthosewhichhavebeenspeciedinitially(opensetclassication).Therefore,wemustconsidertrainingdatatobealimitedpictureofthepossibleobjectsthatwillbeobservedinpractice. 2.2.1.2UnsupervisedmethodsTheunsupervisedmethod,commonlycalledclustering,seekstondstructureinthedataitself,withoutlabels.Observationsofanobjectareassumedtobefoundinclusters,whichcanbedistinguishedfromobservationsofdifferentobjects.Thisisnotnecessarilythecase,sincetheremaybeoverlapsbetweenobjects,oroneobjectmaybesplitintomorethanonecluster.Thelattercaserequiresclustermerging,whichisautomaticallydoneinclassicationbecauseofthelabelinformation.Moreover,clusteringalgorithmsrequirethespecicationofthenumberofclusters,orinourcase,thenumberofobjectswhicharetobedetected.Thisisquitealargeassumption,andcannotbemadeintheATRproblemasagroupofobservationsmaybetiedtooneor 61

PAGE 62

manyobjects.Withoutspecifyingthenumberofclusters,thisisaconsiderablymoredifcultproblem.Dirichletprocessmixturemodeling(DPMM) Escobar&West ( 1995 )isanon-parametricBayesianclusteringapproachthattsthedatatoapossiblyinnitenumberofcomponentdensities.TheGaussianmeanshift(GMS) Cheng ( 1995b )isamethodwhichseeksmodesoftheunderlyingdensityfunctionofthedata,byiterativelyimprovinganestimateofthesemodes.Althoughthereisnoneedtospecifythenumberofclusters,thefreekernelparameterinthealgorithmcontrolsthenumberofclustersdiscovered.Furthermore,manyunsupervisedmethodsarenotdesignedforonlineuse.Clusteringthefulldatasetcanbecomeinfeasibleasmoreandmoresamplesareacquired.Nonetheless,unsupervisedmethodsarehighlydesirableintheATRproblem,becausetheyallowforthediscoveryofnewobjects,anddonotrequirethecreationoflabels,whichisexpensiveandabottleneck.Evenifonlyafewobjectsareofinterest,andthereexistsmanytrainingexamplesfortheseobjects,manyuninterestingobjectswillbeencounteredintheeld.Thesuperuousobjectsincludelargerocksanddebris,biologicalclutter,andartifactsfromturnsandshadows.Notaccommodatingthesevariousobjectsputstheburdenofexcludingthemonaproperlytunedthresholdoranomalydetectionsystem.Itisdifculttosettheseparametersforuseinanonstationaryenvironment,andevensmallchangescanresultinmisseddetectionsorfalsealarms.Ifmodelscanbebuiltfortheclutterobjectsaswell,thentheproblembecomesoneofdiscrimination,whichrequiresnoarbitrarythresholds. 2.2.2ProposedSystemDescriptionWeintroduceanATRsystemwhichisdesignedtoexploitasmuchexternalinformationabouttheproblemaspossible,withoutmakinganyassumptionsthatwillbeviolatedinthedynamicenvironment.Atrainingsetisused,condensingthebestknowledgeavailableonasetoftargetsofinterest,whichgivesthesystemanideaofthedispersionpresentinthedataasacharacteristicoftheenvironment 62

PAGE 63

andsensors,ratherthansimplyaclosedsetoftemplatesforfutureobjects.Humanfeedbackcanbeincorporatedtoverifypreviousdecisions,butthesystemdoesnotrequirethisinteraction,andcanutilizeitwheneveritisavailable.TheATRsystemisdesignedwithonlinefunctionalityinmind,suchthateachnewobservationisprocessedwithoutreprocessingthefulldataset.Previouslyunseenobjectsaredetected,andnoassumptionsaremadeonthenumberofobjectsthatwillbeencountered,allowingdeploymentinanyunknownenvironment.Thesystemconsistsofthreecomponents:aclassierwithinformation-theoreticfeatures Hasanbelliuetal. ( 2011 ),whichdeterminesifanewobservationbelongstoanexistingobjectorshouldbeplacedintheoutlierbin.Modeltreesactivelearning,whichmaintainsanensembleoftheseclassiers,representingalternatehypotheses.Whenhumanfeedbackisavailable,thisfeedbackremovesincorrecthypotheses.Finally,anewobjectdetectionalgorithm,whichworksontheoutlierbintodetermineifasubsetofthosesamplesrepresentsanewobject.TheowchartforprocessingasonarobservationwithoursystemisshowninFigure 2-2 . 2.2.2.1ClassicationwiththresholdsIntheATRproblem,aclassrepresentsanobject,andthesampleswhichmakeuptheclassrepresentthemultipleobservationsofthatobject.Letthefeaturevectorofanobservationbex.LettherebeCknownclasses,designatedbythelabely2f1,...,Cg.Foreachknownclass,y,wehavealikelihoodfy(x),whereyisanestimateofthelikelihoodparameters.Theseparametersareestimatedwiththedynamictree(DT)estimatordevelopedinpreviouswork Kampaetal. ( 2012 ).Dynamictreesaregraphicalmodels,whereagroupofsensormeasurementsarerepresentedprobabilisticallyastheleafnodesinaforestofDTs,withrootnodesrepresentingtheobjectsthemselves.DTsareparticularlywell-suitedtoonlineestimationandinourtestshaveperformedmuchbetterthanmethodssuchasmaximumlikelihoodwhenveryfewtrainingsamplesare 63

PAGE 64

Figure2-2.Activelearningforsonarimageprocessing.Asonarimageispreprocessed,resultinginanewdatapointinfeaturespace.Themodeltreesactivelearningalgorithmconsistsofanensembleofclassiers.Queryingthehumanresultsinapruningofthesemodels.Newobjectdetectioniscontinuallyrun,alteringthemodelsasnewobjectsaredetected. availablerelativetothedimensionalityofthedata.Inthispaperthefi()aremultivariatenormaldistributions.Ratherthanusingthegenerativemodelformaximumlikelihoodormaximumaposterioriclassication,weuseaninformation-theoreticfeaturebasedonaprobabilitydivergence,D.Foreachclass,thisfeatureisthechangethelikelihoodundergoeswhenitsparametersarere-estimatedwiththeinclusionofthesampletobeclassied,xnew.TheclassicationprocedurewiththisfeatureisdescribedinAlgorithm 2-2 . 64

PAGE 65

Algorithm2-2.Classicationwithdivergenceoflikelihoodsfeature Input: Sampletoclassifyx,likelihoodparametersfigCi=1,trainingsetsfXigCi=1 Output: Classlabely Fori=1toCdo Xi Xi[x i DynamicTree(Xi) i(x) D(fikfi) Endfor y argminii(x) Thisdivergence-basedfeatureperformscomparablytonegativelog-likelihoodforclassication.However,thevalueofdivergencequantiestheimpactasamplehasonourexistingmodels,whichisimportantinourdata-sparseATRproblem,whereindividualsamplescaninuencethemodelsubstantially.Itisthereforeamorenaturalfeatureforselectingoutlierbinthresholdsthanlikelihoods,aswillbedevelopedbelow.ForD,theCauchy-Schwartz(CS)divergenceisusedinsteadoftheKullback-Leiblerbecauseitismoreefcienttocomputeformixturesofmultivariatenormaldistributions Kampaetal. ( 2011 ).Fortwocontinuouspdfs,fandg,theCSdivergenceisdenedtobe DCS(f;g)=)]TJ /F9 11.955 Tf 11.29 0 Td[(log0@Rf(x)g(x)dx q Rf(x)2g(x)2dx1A.(2)ClassicationwithDTestimatorsandCSdivergencefeatureswasintroducedinpriorwork Hasanbelliuetal. ( 2012 ).Inthisformulation,wehaveassumedthatthenumberofclassesisknownandthatthereexisttrainingsamplesforeachclass.Thisisonlypossiblewhenthesetofobjectstobedetectedisknowninadvanceandtrainingexamplesoftheseobjectsfrommultipleviewsareavailable.ForanATRsystemcapableofdetectingnewobjectsaswellasknownones,ameansofaccommodatingnewclassesofobjectsmustbedeveloped.Forsomesamples,thelikelihooddivergenceishighforallclasses.Thisisthecasewhenanewobjecthasbeendetected,forinstance.Ratherthanassignthesesamplestothemostlikelyclass,wesetthemasideinanoutlierbin.Binnedsamplescanbe 65

PAGE 66

groupedwithotherbinsamplesifanewobjectispresent,orsimplyremaininthebininsteadofcorruptingexistingobjectmodelsiftheyarenotobservationsofinterest.WeestablishathresholdTionthelikelihooddivergenceiforeachclassi,basedonthemaximumvaluethatisencounteredinthetrainingset. Ti=maxx2Xii(x)(2)Wedeney=0forsamplesintheoutlierbinandthereisnoaccompanying0.Thelikelihooddivergenceforthebin,0((x)),willbedenedasthethresholdoftheminimumdivergenceclass 0(x)=Targminii(x).(2)Thisdenitionwillbeusedforstochasticallydesignatingsamplestotheoutlierbininouractivelearningalgorithm. Figure2-3.Newobjectdetectionwithlearnedparameters.Thepreviouslyclassieddataservesasareferencetolearnclusteringparametersthatbestrepresentthedata.Theoutlierbinisthenclusteredtodiscovernewobject. 66

PAGE 67

2.2.2.2NewobjectdetectionEachhypothesisinourcommitteeplacessampleintheirownoutlierbins.Itisfromthesebinsthatthesamplescomprisinganewobjectaredetectedasclusters,andamodelisbuiltforthatobject.Thestatisticsoftheseundesiredobservationsareunknown,butatleastfromtheknownclasseswehavesomeestimateofthevariationpresentinagroupofobservations.Sinceclusteringalgorithmstypicallyhaveaparameterwhichdeterminesthescaleonwhichclustersareidentied,wecantunethisparametertotheknownclasses.Thisshouldprovideamoreproblemspecicbalanceoffalsepositivesandmisseddetectionsthaniftheparameterswerechosenapriori. Algorithm2-3.Newobjectdetection RuninAlgorithm 2-1 foreachmemberofthecommitteeatperiodicintervals Input: UnlabeleddatasetXu,labeleddataset(Xl,Yl),baseclusteringmethodCluster,setofcandidateclusteringparameters Output: LabelsforXu Forindo Yl Cluster(Xl;) r RandInd(Yl,Yl) Endfor Compute argmaxr Yu Cluster(Xu;) Todeterminetheseclusteringparameters,rstabaseclusteringalgorithmischosen.Thisalgorithmmustbenonparametric(thenumberofclustersdoesnotneedtobespecied)becauseitisunknownhowmanyobjectsexistintheoutlierbins.Theparametersvaryforeachbaseclusteringmethod:kernelbandwidthforGMSandconcentrationparameterforDPMM.Byapplyingtheclusteringalgorithmtothedataforknownclasses,wecanndtheparametersetwhichyieldsaclusteringofthedatawhichbestmatchesexistinglabels.Withtheproperlysetparameter,theoutlierbincanbeclusteredatdesiredtimes,asshowninFigure 2-3 .Thestoredlabelswithinthecommitteearechangedwhentheiroutlierbinsareclustered.SeeAlgorithm 2-1 and 2-3 forafulldescriptionofmodeltreeswithnewobjectdetection. 67

PAGE 68

Figure2-4.Examplesonarimages.Examplesofrock,grass,andsandstyleseabedtexturesandcylinder,box,andconeshapedobjects.ReproducedwithpermissionfromtheOfceofNavalResearch,NavalSurfaceWarfareCenter,PanamaCity,Florida. 2.3ExperimentsWetesttheactivelearningandnewobjectrecognitioncomponentsofourATRsystemandcomparewithothermethods.Thedatasetusedisfullydescribedin Hasanbelliuetal. ( 2011 )andisapartiallysyntheticsimulationofatargeteldofobjects.Eachtrialwillrunonagroupof3closelyspacedobjects,with24observationsofeachobject.Thefeaturevectoris8-dimensional.Thersttwofeaturesaretheapproximatex)]TJ /F5 11.955 Tf 12.33 0 Td[(ycoordinatesoftheobservations.Thesecoordinatesaresyntheticandaredrawnfromgamma,beta,andnormaldistributions.Thedifcultyoftheclassicationtaskvariesthroughoutthedatasetandislargelyafunctionoftheclosenessoftheobjectcenters.Thenextthreefeaturesareshapefeaturesacquiredbyapplyingedgedetectiontechniquesontherawsonarimagesandextractingedges.Thereare3objectsfoundinourdataset,cones,cylinders,andboxes.Thenalthreefeaturesaresurfacefeatures,whichcharacterizetheseaoortexturesandarealsoacquiredfromtherawsonar 68

PAGE 69

images.Thereare6seaoortypes,withtwovarietieseach,ofgrass,rock,andsand.SomeexamplesonarimagesareshowninFigure 2-4 .Allpossiblecombinationsofthe3shapes,6backgrounds,andcoordinatedistributionsarepresentinthedataset.Thus,the3objectsthatappearineachtrialmayhavethesameordifferentbackgroundsandshapes.Itisthemostdifculttodetectdistinctobjectswhentheypossessthesameshapeandappearoverthesamebackground.Onenaturalwaytoevaluatethebenetofanactivelearningmethodologyistocharacterizehowthemodelsimproveovertime.Tomeasurethisimprovement,ineachtrial,wemeasurethedivergencebetweenthelearnedmodelateachtimestepintesting,andthebestpossiblemodel.Inthiscase,thebestmodelistheonewhichistrainedwithallthecorrectlylabeledsamples.Bycomputingthisdivergenceforeachclass,andtakingthenormofthedivergencesofallclasses,wehaveametricforevaluatingtheclassicationsystem.Ifwedenetheestimatedlikelihoodforclassytobe^py:=fy()andthebestlikelihoodtobepy,thenourmetricis, q DCS(^p1;p1)2+...+DCS(^pC;pC)2(2)Thismetriciscomputedforeachtimestep,asthesystemtrainsitsmodelswiththeDT,usingself-labeledtestingsamples.Byplottingthemetric,averagedoveralltrials,asafunctionoftime,theprogressoflearningcanbevisualized.Infact,thisplotcanrevealinterestingbehaviorthatisnotcapturedintheclassicationerrorrate.InFigure 2-5 ,plottingthemodel'sdivergenceovertimerevealsaninstancewhereaclassierutilizingself-learningactuallydeterioratesasitencountersnewsamples.Adatasetconsistingof3classes,drawnfrom2-dimensionalnormaldistributions,wasbrokenintoatrainingsetwith8samplesperclass,andatestingsetof126samples.Asseeninthegure,thedivergencegrowsasmoresamplesareclassiedandusedtoupdatethemodels.Whenasimpleimplementationofuncertaintysamplingwitha9%queryrateisimplementedontheverysametrial,theclassierlearnsproperly. 69

PAGE 70

Figure2-5.Failureunderself-learning.Withonly8samplesperclassinthetrainingset,theclassiermodelsdeteriorateastheyaretrainedwithself-labeledsamples.Addingactivelearningwitha9%queryrateallowstheclassiertolearnproperly. InFigure 2-6 ,themodeldivergenceisplottedovertimetocomparethemodeltreesalgorithmwithuncertaintysamplingandwiththeclassierwithoutactivelearning.Theplotisanaverageover200randomlyselectedtrialsfromthesimulatedsonardataset.TheDTwithlikelihooddivergenceclassieristhedefaultclassicationscheme.Theresultsforqueryratesof9%and16%areshown.Thesequeryratesweredeterminedbysettingtheparameterofthemodeltreesalgorithmto6and3,respectively.Themaximumnumberofpathsparameterswasxedat20.Astheuncertaintysamplingalgorithmdoesnotdeterminewhentoqueryautomatically,itwasallottedasmanyqueriesasmadebythemodeltrees.Alargeimprovementcanbeseenbyemployingactivelearning,howevermodeltreesismorequeryefcient,producingsimilarresultswitha9%queryrateasUncertaintySamplingachieveswith16%.InTable 2-1 ,theclassicationratesareshownforourclassierequippedwithseveralformsofactivelearning.The“plain”classieristhestandardclassierused 70

PAGE 71

Figure2-6.Performanceofmodeltreesactivelearning.Activelearningcausesthelearnedmodelstoapproachthetruemodelsmuchmorequicklythanwithaplainclassier.Withmodeltrees,queriesareutilizedmuchmoreefcientlythanwithUncertaintySampling. Table2-1.Modeltreesclassicationcomparison QueriesPlainUSUS+DensityEERMT 183.2%84.2%84.3%84.3%86.1%283.2%85.9%85.9%86.1%86.9%383.2%86.2%86.4%87.1%87.0% Theplainclassieriscomparedwithuncertaintysampling(US),density-baseduncertaintysampling,expectederrorreduction(EER),andmodeltrees(MT).Thescenariosare1,2,and3queries,madeatregularintervalsof10timesteps. throughoutthispaper,andusesnoactivelearning.WeselectthreestandardactivelearningmethodsthatcanbeusedwithourDTclassier:uncertaintysampling,density-baseduncertaintysampling,andexpectederrorreduction,andcomparethemwithmodeltrees.Foreachofthethreeobjectsinoursimulatedsonardataset,10samplesaredesignatedfortraining,and14remainfortesting.The42testingsamplesarepresentedinrandomorder,andthemodelsforeachobjectarelearnedonlineviaself-learningwiththeclassiedsamples.Forthesakeofdetermininganaccurateclassicationrateforthemodeltreesalgorithm,nosamplesareassignedtotheoutlier 71

PAGE 72

bin,whichisensuredbysettingitsprobabilityto0.Themaximumnumberofpathsinourtreeissetto20,andissetto8.However,thequeryrateisxed,sotheseparametersonlyaffectthenumberofmodelsinourtree.Theimplementationsoftheotherthreeactivelearningalgorithmsaretakenfrom Settles ( 2009 ),andareparameterfree(otherthanthequerytimes).Forcomparisonweevaluatethealgorithmswith1,2,and3queriespertrial.Thesequeriesaremadeinintervalsof10timesteps,thusquerieswouldbemadeattimesteps10,20,and30inthecaseof3queriespertrial.Atthetimeofquery,allencounteredtestingsamplesareconsideredcandidatesforquery.Queryingasamplereturnsthecorrectobjectlabelofthatsample.Theclassicationrateiscalculatedasthenumberofcorrectlyclassiedsamplesattheendoftesting,outofthe42testingsamples.Theresultsareaveragedover100trials.Allactivelearningmethodssignicantlyimprovetheclassicationrateoftheplainclassier.Modeltreesisshowntoperformparticularlywellwhenveryfewqueriesarepossible,whiletheothermethodsimprovequicklyasthenumberofqueriesrises.Thiscanbeattributedtoouruseofxedqueriestimesinthisexperiment,ratherthanqueryingatthetimesautomaticallyspeciedbymodeltrees.Bymakingunnecessaryqueries,thediversityofourensemblecanbedecreased.Asaquery-by-committeemethod,thequalityofthemodeltreesquerydecisiondependsondisagreementamongstthemodelsofthecommittee.Ifthereisnodisagreement,thenaqueryisnotlikelytohaveasmuchimpact.Thesensitivityofthequeryrateofmodeltreestoitstwoparameters,andmaximumnumberofpaths,isshowninFigure 2-7 .Thisgureisproducedusingthesametestingsetupaswasusedforcalculatingtheclassicationrates,andtheplottedqueryratesareaveragedover50trials.Thequeryratereducesexponentiallyin,whilechangingthemaximumnumberofpathsinthetreemerelyinducesashiftinthequery 72

PAGE 73

Figure2-7.Parametersensitivityofmodeltrees.Thequeryrateofmodeltreesisnotoverlysensitivetothemaximumnumberofpaths,whichcanbexed,allowingthequeryratetobeestablishedwiththeparameter. rate.Inpractice,themaximumnumberofpathscanbesetbasedonthecomputationalrequirements,andcanbesettoinducethedesiredqueryrate.Weevaluatetheabilityofthenewobjectdetectionsysteminanalexperimentinwhichonlyoneofthethreeobjectsineachtrialispresentedintraining(10trainingsamples).Observationsoftheunknowntwoobjectsariseonlyduringthetestingphase,mixedwiththeremaining14observationsoftheknownobject.Weevaluatealgorithmsbasedontheirabilitytocorrectlydetectthepresenceofthenewobjectsfromthetestingobservations.Wecompareourclassication-basedapproachaidedbyournewobjectdetectionmethodwiththepureclusteringmethod.WeusetheGaussianmeanshiftbothasthebaseclusteringmethodinournewobjectdetectionandasaseparatealgorithmfordeterminingthenumberofobjectsinasetofobservations.Inthelattercase,apurelyunsupervisedmethodisused,inwhichallsamplesareclustered(includingtrainingdata).Itmayseemredundanttoclusterthetrainingdataaswell,butthesesampleshelpensurethattheknownobjectsareclusteredproperly. 73

PAGE 74

Table2-2.Newobjectdetectionabilitiesofmodeltreesactivelearning ObjectsDetectedGMSMT+NODMT+NOD+Query 110%4%0%222%4%3%354%89%95%4+14%3%2% Thetruenumberofobjectsis3,butlabeleddataforonlyoneoftheseobjectsisprovided.TheperformanceofGaussianmeanshift(GMS)clusteringonthewholedatasetiscomparedwithourmethodofmodeltreeswithnewobjectdetection(NOD),andourmethodwith1allottedquery. InTable 2-2 ,weshowthestatisticsofthenumberofobjectsdetectedbyeachmethodacross100trials.ThekernelbandwidthoftheGMSwassetwithaheuristicrulebasedonthemeannearestneighbordistanceoftheclusteredsamples.Theensemblesizeinournewobjectdetectionalgorithmwassetto10.Theparametersofthemodeltreesalgorithmarethesameasintheclassicationrateexperiment.Inonescenario,noqueriesareallowed,whileintheother,asinglequeryismadeattheendoftesting.Theoutlierbinisenabled,withthethresholdsgivenby( 2 ).Ateverytimestep,thenewobjectdetectionmethodisrunontheoutlierbinsamplesofeachbranchinthetree.Inthisexperiment,ourmethodgreatlyoutperformsthepureclusteringapproach,whichreectstheamountofinformationutilizedbyourmethodincomparisontoafullyunsupervisedapproach.Byrstclassifyingthemoreobvioussamples,andthenonlyclusteringtheoutlierbin,thereismuchlessroomforerror.Allowingjustonequeryimprovestheperformanceevenfurther,asthereishigherlabelentropyforsampleswhichhavebeenlabeledasbelongingtobotholdandnewclassesinthetree. 2.4SummaryInthischapterwehaveintroducedamethodfordetectingandclassifyingobjectsinsonardata,whichisnotlimitedtoatemplatesetofobjectsanduseshumanfeedback,whenavailable,toimprovethefusionofobservations.Ouractivelearningmethod,modeltrees,isafusionofquery-by-committeeanduncertaintysamplingapproaches,generatingatreeoflikelyhypothesesaboutthedata.Interactionwiththeoperator 74

PAGE 75

removesincorrecthypotheses.Thedecisionofwhentoqueryismadeautomaticallybythealgorithm,whichimprovesuponexistingmethods,inwhichthisdecisionisarbitrary.Newobjectsarerecognizedfromsamplesthataresetasideintheoutlierbinusinganapproachwhichestimatesgoodclusteringparametersfromexistingsamples.Modeltreesisshowntogreatlyimproveaclassier,andoutperformsthreestandardactivelearningmethods.Ournewobjectdetectionmethodismuchmorereliablethanclusteringapproaches,astheoutlierbinoffersamorefacilitatingsubsetofthedataforrecognizingnewclusters,andtheclusteringparametersmaybeeffectivelysetbyusinginformationpresentinthepreviouslyclassiedsamples.Themodeltreesalgorithmworksseamlesslywiththenewobjectdetection,thusallowingforanATRsystemwhichcanfunctionautonomouslyinanunknownenvironment,andconrmitsdecisionswithhumanfeedback,whenavailable. 75

PAGE 76

CHAPTER3ACTIVECONSTRAINEDCLUSTERINGWITHTHEDENDROGRAMUnsuperviseddataclusteringproducesresultsthatarenotnecessarilysuitablewithrespecttothegoaloftheuser.Toguidetheclusteringprocess,itisoftenpossiblefortheusertoprovideasmallamountofsupervision,typicallyintheformofasetoflabelsorpairwiseconstraintsbetweensamples.Constrainedorsemi-supervisedclusteringmethodsincorporatethissideinformationintotheresultingclustering,alongwiththedistanceinformationprovidedbytheoriginaldataset.Mostcommonly,pairwiseconstraintsarethepreferredformoffeedbackfromthehumanexpert,asitiseasierforahumantodecidewhethertwosamplesrepresentobjectsfromthesamecluster(must-linkconstraints)orobjectsfromdifferentclusters(cannot-linkconstraints).Otherwise,thehumanwouldhavetoprovidealabelforeachqueriedsamplefromapossiblyunknownsetoflabels.Constrainedclusteringmethodsfacethedifculttaskofnotonlyenforcingtheconstraintsonthesamplesdirectlyinvolvedinconstraints,butpropagatinginformationtonearbysamples.Thepropagatedconstraintinformationmustbecoherentwithrespecttootherconstraintsandtheinherentgeometryofthedata.Whilemanyclusteringmethodshaveaconstrainedvariant,theadditionofconstraintstypicallyaddsconsiderableoverhead,andreducesthereliabilityrelativetotheunderlyingunconstrainedclusteringmethod.Inactiveclustering,theexpertmaybeunawareofthepossiblelabelsforthedataset,thusweinsteadseekpairsofsamplesforquery.Thelackofatrainingset,makesitdifculttoapplytheframeworksofactiveclassication.Forinstance, Portionsofthischapterweresubmittedforpublicationinthefollowingmanuscript:Kriminger,E.,Cobb,J.T.,andPrncipe,J.C.(2015).Activehierarchicalclusteringwithoutmust-linkconstraints.PatternAnalysisandMachineIntelligence,IEEETransactionson,underreview 76

PAGE 77

certaintyishardertoquantifyinclusteringbecausewedonothaveaboundarybetweenclassesfromthetrainingset.Instead,activeclusteringalgorithmsareguidedbydesignprinciplesbasedonthenatureoftheclusterboundary,widespreadcoverageofthethedataset,thegeometricshapeofclusters,andparticularknowledgeoftheproblemdomainorclusteringalgorithmitself.Thecontributionsofthischapteraretwofold.Werstpresentanovelapproachtoconstrainedhierarchicalclusteringcalledpropagation-freeconstrainedhierarchicalclustering(PFCHC),inwhichconstraintsarenotpropagatedthroughthedistanceinformationasagglomerationproceeds.Instead,theexistingdendrogramstructurefromanunsupervisedrunofhierarchicalclusteringismanipulatedusingthepairwiseconstraints.Theresultingalgorithmaddsverylittleoverheadtostandardhierarchicalclustering,andmayevenimproveperformancebynotcollapsingmulti-modalclasses.Secondly,wedevelopanactiveconstraintselectionmethodthatalsobuildsonthedendrogramofstandardhierarchicalclustering.Wecallthisactiveconstrainedselectionmethodhierarchicalqueryselection(HQS).Thedendrogramautomaticallyextractsthestructureofthedata,thuswemakenounderlyinggeometricassumptionsonclusternumberorgeometry.heproposedmethodsarepresented.Finally,wedemonstratetheimprovementofconstrainedclusteringperformanceon10real-worlddatasetswhenusingourmethods,andperformacasestudyonclusterdiscoveryontheMNISTdigitdataset. 3.1Propagation-FreeConstrainedHierarchicalClusteringMust-linkconstraintsaresimpletoenforce,asthedistancebetweenlinkedsamplescanbesetto0.However,ithasnotbeenconsideredthatthispracticemaybeunnecessaryorevenharmfultotheoverallclusteringofthedata.Theconstraintsreceivedfromtheoracledonotnecessarilydescribeclusters,butrathertheclassesthatthosesamplesbelongto.Inthecaseofmulti-modalclasses,wemayacquiremust-linkconstraintsbetweendistantareasofthedataspace.Thismayplacegreatstresson 77

PAGE 78

constrainedclusteringmethodswhichseektosimultaneouslypushcannot-linkedsamplesapartandgroupmust-linkedsamples. Figure3-1.Must-linkconstraintsinmulti-modeldata.Linkingvalidmergeswhileignoringmust-linksproducesanoversegmentationofthedata.Thelabelscanbeamendedlatertoaccountforthemust-linkswithouttamperingwithdistanceinformation. Anunexploredalternativeistoproceedwithclusteringwithoutmust-links,whileobeyingcannot-links.Themust-linkconstraintsareonlyusedinitiallytogetafulltransitiveclosureofcannot-linkconstraints.Thenoncethisinitialprocesshasnished,themust-linkconstraintsareemployedsimplytomergethelabelsofclustersfromthesameclassandrenetheoversegmentedclustering,asseeninFigure 3-1 .Themust-linkinformationisstillutilized,butthefundamentaldifferenceisthatweallowlocalsubclusterstoagglomeratebeforelinkingthem,ratherthantryingtoenforceaglobalunionofdistanceclusters.Thebenetofthisapproachisthatweavoidwhateveroverheadisinvolvedinpropagatingmust-linkconstraints.Wealsoavoidthepossibleproblemsassociatedwithdistortingthedataspacetoaccommodatemust-linksbetweendistantsamples.However,therearetwofundamentalassumptionstomakethisaviableapproach.First,theclusteringmethodmustnotspecifythenumberofclusters.Thisisimportantbecauseweareassumingthenumberofclassesislessthanorequaltothenumberofclustersinrealdata.Weneedtoproduceanoversegmentationwhichisrenedby 78

PAGE 79

themust-linksintheend.Second,theconstraintsmustbeselectedcarefullyenoughtospanthedatawellandencompasseveryclusteranditsrelationtonearbyclusters.Withbothoftheseissuesinmind,hierarchicalclusteringistheperfectbasisforournewparadigminconstrainedclustering.Thenumberofclustersinhierarchicalclusteringcanbeestablishedsimplyandefcientlybytheviablemergeswhichdonotviolatecannot-linkconstraints.Anexistinghierarchicalclusteringofthedataalsofacilitatestheselectionofinformativequeries,whichwedemonstratewithanactiveconstraintselectionmethod.Ratherthanthepreviouslyproposedconstrainedhierarchicalclustering Kleinetal. ( 2002 ),whichwewillherebydenoteCHC,weproposepropagation-freeconstrainedhierarchicalclustering(PFCHC).Hierarchicalagglomerativeclusteringbeginswitheachsampleasasingletoninitsowngroup.LettheinitialsetofgroupsbeG=fXigNi=1,whereNisthenumberofsamplesandeachXiisagroupofuniquesamples.Aseriesofmergesareperformed,whereineachround,thegroupswhicharemergedarefoundbysolving minXi,Xj2Glink(Xi,Xj)(3)wherethelinkagefunctionlink(,)issomemeasureofgroupdistance,suchastheaveragelinkage: linkav(Xi,Xj)=1 jXijjXjjXxi2XiXxj2Xjd(xi,xj).(3)TheselectedgroupsareremovedfromGandthenewmergedgroupisadded.Whenhierarchicalagglomerativeclusteringisperformed,thereareN)]TJ /F9 11.955 Tf 12.23 0 Td[(1resultingmerges.LetMibetheithmergetooccurconsistingoftwogroupsX(i)1andX(i)2.UnlikeCHC,ourPFCHCutilizesthemergesproducedbyastandardrunofhierarchicalclustering.Webeginagivenconstraintset,andwithatransitiveclosureoftheseconstraints,suchthatCLreferstoallcannot-linkswhicharelogicallydeducible.Eachsampleisgivenitsownlabel,thusthelabelsetfLxigNi=1[1,...,N].WeiteratethrougheachmergeMiinascendingorder.Avalidmergeisonesuchthatnocannot-link 79

PAGE 80

Algorithm3-1.Propagation-freeconstrainedhierarchicalclustering Input: DatasetfxigNi=1,setofquerypairsQ Output: ClusteringlabelsfLxigNi=1 InitializeLxi=i8i AugmentMLandCLwithtransitiveclosures Performhierarchicalagglomerativeclustering ForeachmergeinfMigN)]TJ /F8 7.97 Tf 6.59 0 Td[(1i=1do Foreachpairofuniquelabels(l1,l2)2L1L2do ifX(i)1[l1]X(i)2[l2]\CL;then ComputedX(i)1[l1],X(i)2[l2] Endif Endfor Findl1andl2with( 3 ) SetfLxgx2X(i)2[l2]tothevalueoffLxgx2X(i)1[l1] Endfor ifLxa6=Lxbforany(xa,xb)2MLthen ReplaceanyLxbwithLxa Endif constraintisviolated,or X(i)1X(i)2\CL;.(3)LetlbethelabelofthesamplesinX(i)1.IfthemergeMiisvalid,thenwemodifythelabelingsuchthat Lx=l8x2X(i)2.(3) 3.1.1ManipulatingtheDendrogramBuildingaconstrainedclusteringisnotassimpleasevaluatingthemergesoftheunsuperviseddendrogram.Amergeoftwogroupscanbeprevented,butathigherlevels,thosegroupsarestillassumedtobemerged.Assumetherstinvalidmergeoccursattimen.ForallmergesfMigi>n,eitherorbothofthelabelsetsforthemerge,fLxgx2X(i)1andfLxgx2X(i)2,maycontainmorethanonelabel(theresultofpreviousmergesnotbeingvalid).Sinceacannot-linkconstraintexistswithinX(i)1and/orwithinX(i)2,wemustbreaktheoffendinggroupintoitssubgroupsbasedonlabel,andndtheclosest 80

PAGE 81

A B Figure3-2.Manipulatingthedendrogram.A)Group1containstwodifferentlabeltypesduetothecannot-linkconstraint.However,oneofthesesubgroupscanstillmergewithgroup2.B)Sincenoconstrainedsamplesarepresentingroup2,itcanmergewitheithersubgroup.Theredsubgroupiscloser,sogroup2takesonitslabels.Notethatevaluationsofconstraintsanddistancesonlyneedoccuroverconstrainedsamples. validmergebetweenthesesubgroups.LetX(i)1[l]bethesubgroupofsamplesfromX(i)1suchthatfLxgx2X(i)1[l]onlycontainsthelabell.LetL1bethesetofuniquelabelsinfLxgx2X(i)1.Thenwemergelabelsaswasdonein( 3 )forthesubgroupsX(i)1[l1]andX(i)2[l2]suchthat l1,l2=argminl12L1,l22L2dX(i)1[l1],X(i)2[l2].(3)Weallowthesemergesbecauseacannot-linkconstraintdoesnotmeanthatmergesatgreaterdistancesshouldbeforbidden.Forinstance,manyofthehigherlevelmergeswillconsistofsingleoutliersbeingaddedtotherestofthedata.Itisdesirablethatthesesamplesbeaddedtothenearestviablecluster.TheresultofthePFCHCclusteringisanoversegmenteddataset.Clustershavegrownlocally,andthemust-linkconstraintsindicatewhichclusterlabelsshouldbe 81

PAGE 82

joined.Thisissimplyanoperationonthelabels.Thelabelsforanypairofclustersforwhichamust-linkconstraintexistsaremerged.Thisiswhyitisimportanttoaugmentthecannot-linksetwiththetransitiveclosure,otherwiseduringthisnalmergeprocess,constraintswouldbeviolated.Whilethecannot-linkconstraintswillallowthenumberofclusterstobeautomaticallydetected,ifthenumberofclustersisknown,thedendrogrammaybe“cut”asusualinhierarchicalclustering.OurPFCHCalgorithmisshowninAlgorithm 3-1 . 3.1.2EfcientComputationWhenconsideringthemergerbetweengroupsX1andX2,oneonlyneedconsiderthesamplesthatareinvolvedinaconstraint.Thus,indeterminingconstraintviolationsandevaluatingdistanceswhenmultiplelabelsarepresentineachgroup,suchcomputationneedonlybedoneonthesmallsubsetofconstrainedsamples.Ifagroupofsamplesdoesnotinvolveanyconstraints,thereiscertainlyonlyonelabelpresentwithinthisgroup.Suchagroupformsfromthenaturalunsupervisedagglomerationofthedata.Eventually,throughmerges,thesegroupsacquireconstrainedsamples.Itisathigherlevelsthatmultiplelabelsmaybepresentanddistancecomputationisrequiredbeyondtheexistingstructureofthedendrogramtodeterminethebestofthepossibleviablemerges.PFCHC,whereverpossible,manipulatesthelabelsofblobsformednaturallyabouttheconstrainedsamples.Mostmergesthatareevaluatedcanproceedsimplybydeterminingifacannot-linkconstraintwouldbebroken.Thisisincontrasttoexistingconstrainedhierarchicalclusteringapproachesthatrecomputedistancesaftereachtimethataconstraintaffectstheclustering. 3.1.3ConstraintRemovalConstraintswillsometimesofferconictinginformation.Thisoccurswhenthereareerrorsintheconstraintset,anissuewhichwillbediscussedinthenextchapter.However,conictsariseeveninperfectlyannotatedconstraintsets,asseenin 82

PAGE 83

Figure3-3.Classoverlapcausesconstraintnoise.Duetooverlappingclasses,everyconstraintsetmustbedealtwithundertheassumptionthatconstrainterrorsoccur. Figure 3-3 .Duetoclassoverlap,itisquitecommonforamust-linkandcannot-linkconstrainttobefoundinparallel.Itisparticularlyimportanttoremovecontradictoryconstraintsbecauseatransitiveclosureoftheconstraintsetcannotbecomputedifthereareconicts.Detectingtheseharmfulconstraintsisadifculttask,butbyprovidingbinarymergecandidatesatdifferentscales,hierarchicalclusteringoffersawaytoproperlycompareanddetecterrantconstraints.WhenconsideringamergebetweengroupsX1andX2,onecanmeasureNCL,whichisthenumberofcannot-linkconstraintsbetweenthetwogroups,NCL=jX1X2\CLj,andNML,whichisthenumberofmust-linkconstraints,NML=jX1X2\MLj.IfNCL>NML,thereismoreevidencethatthegroupsshouldnotmerge.IfNCL
PAGE 84

allowingthemergetoproceed.ForthecaseofNCL==NMLawaytobreakthetiemustbeprovided.Preferenceformust-linkconstraintscanbegivenduringlowlevelmerges,whilecannot-linkscanbefavoredforhighlevelmerges.Thus,PFCHChasaverysimplewaytoaccountforthecontradictoryconstraintsetsthatexistinrealdata. 3.2HierarchicalQuerySelectionNotpropagatingmust-linkconstraintsisbenecialcomputationallybecausenotonlydowenotsuffertheoverheadofpropagatingtheseconstraints(aprocesswhichrequiressomevariantofashortestpathalgorithm),butwecanoperatedirectlyonadendrogramproducedbyanystandardunconstrainedhierarchicalclusteringmethod.However,wenowrelyonanintelligentlyselectedsetofconstraintstoestablishthestructureoftheclusters.Otherwise,therenementstepwouldnothavesufcientinformationtogatherseparateclustersintotheirproperclasses.Wedemonstratethatthedendrogramproducedbyhierarchicalagglomerativeclusteringisausefultoolforselectinginformativequeriesinanextremelyefcientmanneranddevelopthehierarchicalqueryselection(HQS)algortihm.Basedononlyasinglerunofstandardunconstrainedhierarchicalclustering,informativequeriescanbeselectedandaclusteringwhichobeystheseconstraintscanbegeneratedwithourtwocomplementaryalgorithms. 3.2.1InformativeMergesHierarchicalclusteringsimpliestheproblemofinformativequeryselection,becauseeachstageprovidesasimpletwo-clustersituationwiththetwomergedgroups.Thisposesasimplerquestionthanthefullconstraintselectionproblem:shouldthismergetakeplace?Inaddition,thedendrogramadaptstoanydatasetregardlessofitsshape.Byqueryingmerges,redundancyisnaturallyavoidedasthemergesspanthedatasetinspaceandoccuratavarietyofscales.Thedendrogramcanbeconstructedsimplyfromadistancematrix,thusthequeryselectionmethodneednotdealwithhighdimensionality. 84

PAGE 85

Figure3-4.Populationbasedmergecriterion.Underthiscriterion,merge1,betweenthegroupsenclosedwithinthebluedashedlines,ismoreinformativethanmerge2,betweenthegroupsenclosedwithinthesolidredlines.Merge2isthehighestlevelmergeintermsoflinkagedistance,butwithlowimpactsinceitinvolvesasingleton. Infact, Kleinetal. ( 2002 )alludedtoanactiveconstraintselectionmethodinwhichthetoplevel(largedistance)mergeswereutilizedforquerytoclarifywherethedendrogramshouldbecut.However,nomethodwasprescribedforactuallyselectingwhichsamplestoquery.Furthermore,thereisaprobleminsimplyusingthehighlevelmergesbecausemanyofthesewillsimplyconsistofsingleoutliersbeingaddedtothenearestcluster.Similarlyinthisstrategy,sparseclusterswilldominatequeryselection.Morecaremustbetakenintheevaluationofinformativemergesthansimplyusingamaximumdistancemergecriterion. 85

PAGE 86

Usingthesamenotationformergesasusedpreviously,thecriterionforinformativemergesproposedin Kleinetal. ( 2002 ),wouldthereforebe maxidX(i)1,X(i)2.(3)Onesimple,buteffectivealternativeistouseapopulationbasedcriterion.Queryingmergesinvolvingthelargestpopulationensuresthatthequerieswillimpactalargenumberofsamples.Distanceinformationismissing,sooutliersarenotdonotreceivepreference,asseeninFigure 3-4 .However,populationstilltargetshighlevelmerges,whichisgoodforproducingqueriesthatrevealtheoverallstructureofthedata.Thepopulationbasedmergeinformativenesscriterionis maximinnX(i)1,X(i)2o,(3)wherewetakethepopulationofthesmallermergedgroup,suchthatmergesbetweentwolargegroups,whichhavemoreimpactontheoverallclustering,arefavored.Onepossibleproblemwiththepopulationcriterionisindealingwithgreatlyimbalanceddatasets.Thepopulationcriterionwouldover-representtheverydenseclusters.Avarietyofmergeinformativenesscriteriacanbeconstructed,whichavoidvariousproblemsparticulartotheapplication.Forinstance,wecancomputetheboundarysamplesofX(i)1andofX(i)2bydeterminingwhichsamplesinonegrouparenearestneighborsoftheothergroup.Thenevaluatingthevarianceorentropyofthedistancesspanningthegapbetweenthetwoboundariesprovidesaninformativenessmeasureformergeswhichhaveaveryunclearboundary,regardlessoftherelativedensityofthetwogroups,anddistinctlylteringoutoutliers.However,inourteststhepopulationcriterionwasusedasitperformedreliablyandwithalmostnocomputationaloverheadinadditiontothehierarchicalclusteringitself. 86

PAGE 87

Algorithm3-2.HierarchicalQuerySelection Input: DatasetfxigNi=1,numberofqueriesNq Output: SetofquerypairsQ Performhierarchicalagglomerativeclustering ForeachmergeinfMigN)]TJ /F8 7.97 Tf 6.59 0 Td[(1i=1do Computemergeinformativenessmeasure e.g.minfjX(i)1j,jX(i)2jg Endfor ComputetopNqmostinformativemerges Fori2f1,...,N)]TJ /F9 11.955 Tf 11.95 0 Td[(1g,ifMiisaninformativemergedo Forj2f1,2gdo ifQ\X(i)j=;then LetqjbethecentroidofX(i)j else LetqjbeclosestsamplefromQ\X(i)jtogroupf1,2gnj Endif Endfor Add(q1,q2)toQ Endfor 3.2.2QuerySampleSelectionWithamergeinformativenesscriterion,wecanselecttheNqmostinformativemergestoquery,whereNqisthenumberofqueriesavailable.Nowtheactualsamplepairsthatwillbequeriedmustbeextractedfromthemerges.Ifweareselectinghighlevelmerges,itisdesirabletoselectsamplesnearthegroupboundaries,asthiswillselectconstraintsacrosspossibleclusterboundaries.However,ratherthanselectthenearestsamplesonthefringeofthemergedgroups,weproposetoquerylocalmodes,queriesofwhichwillaffectmoresamples.Finally,wewillreusequeriedpoints,suchthattheresulting“skeleton”ofconstraintscantakeadvantageoftransitiveclosures.LettheNqinformativemergesselectedforquerybefMigNqi=1.Weiteratethroughthemergesfromthelowesttohighestlevel.FormergeMj,wemustselectasamplefromX(j)1andasamplefromX(j)2tobethequerypair.IftherearenopreviouslyqueriedsamplesinX(j)1and/orX(j)2,whichislikelywhenoneofthosegroupsconsistsoffew 87

PAGE 88

Figure3-5.Selectingsamplestoquery.Intheinformativemergeshown,group2containsnoqueriedsamplessothemedoidisselected.Theclosestofthe3previouslyqueriedsamplesingroup1isselected. samples,weselectthecentroidofthatgroupasthequerysample.IfX(j)1and/orX(j)2containsmanypreviouslyqueriedsamples,whichoccursinthelargergroups,weselectthenearestonetotheopposingmergedgroup(Figure 3-5 ).TheresultingactiveconstraintselectionalgorithmusingthedendrogramisseeninAlgorithm 3-2 . 3.2.3AddingRedundancyThestructureofqueriesformedbyHQSestablishesthestructureofthedatawell,butinthepresenceofanimperfectoracle,orahighdegreeofclassoverlap,itisadvisabletosetasideafractionoftheallottedqueriesforredundancy.Ifaconstraintthatshouldbeacannot-linkseparatingtwoclustersiserroneouslylabeledasamust-link,thosetwoclusterswillbejoined.GiventhesetofqueriesproducedbyHQS,randomconstraintsbetweenthequeriedsamplesofferrobustnessandallow 88

PAGE 89

errorstobedetected.Ideallyafewconstraintsshouldspanbetweenthesamplesofseparateclusters.Thestructureofthequeriesfor2-dimensionalnormaldatacanbeseeninFigure 3-6 . A B Figure3-6.Examplesofconstraintstructure.A)Theconstraintsselectedbytheproposedhierarchicalqueryselection.B)Theconstraintsselectedbymin-maxfarthestrstqueryselection(MMFFQS. Table3-1.TheUCIdatasetsusedinthisstudy. Dataset#Samples#Clusters Balance6252Ecoli3368Glass2146Iris1503Segmentation21011Wine1783Breast5692Ionosphere3512Parkinsons1952Sonar2082 3.3ExperimentsWerstdemonstratetheperformanceofourmethodon10UCIdatasets Lichman ( 2013 ),asseeninTable 3-1 .Forcomparison,3constrainedclusteringmethodsareusedinconjunctionwith4activeconstraintselectionmethods.Asabaseline, 89

PAGE 90

Table3-2.UCIresultsforactiveconstrainedclustering. PFCHCCHCE2CPPFCHCCHCE2CPBalanceEcoli HQS0.570.540.690.810.780.70MMFFQS0.520.550.580.710.720.58ASC0.050.280.380.540.600.56Rand0.090.290.340.520.580.57GlassIrisHQS0.610.450.450.900.900.89MMFFQS0.560.570.430.900.930.95ASC0.280.320.340.770.860.84Rand0.290.340.340.730.820.86SegmentationWineHQS0.780.700.690.951.000.94MMFFQS0.700.760.680.960.990.89ASC0.560.690.650.820.890.87Rand0.520.670.650.760.850.88BreastIonosphereHQS0.810.800.880.650.550.52MMFFQS0.830.820.850.550.720.24ASC0.080.600.760.200.480.40Rand0.120.570.750.060.460.38ParkinsonsSonarHQS0.650.720.320.610.570.41MMFFQS0.620.610.350.440.310.23ASC0.270.450.230.080.280.00Rand0.270.360.250.110.190.00 Valuesrepresentthenormalizedmutualinformationbetweentheresultingclusteringandthetruelabels.Thenumberofqueriesforeachdatasetisequaltohalfthenumberofsamples. 90

PAGE 91

theE2CP Lu&Ip ( 2010 )algorithmwasusedtoconstructasimilaritymatrixthatincorporatestheactivelyselectedconstraintswiththenaturalsimilarityinformation.InourexperienceE2CPwasamongthemostreliablemethodsofconstraintincorporation.Wethenperformedspectralclusteringontheresultingsimilaritymatrix.TheothertwoconstrainedclusteringmethodswereCHCfrom Kleinetal. ( 2002 )andtheproposedPFCHCmethod.ForCHCandPFCHC,theaveragelinkagecriterionwasused.Spectralclusteringrequiresthenumberofclusters,k,tobespecied,whileCHCandPFCHCautomaticallydiscoveredthenumberofclusters.Inadditiontoourproposedconstraintselectionalgorithm,wecomparewithrandomqueryselection,MMFFQS,andASC.Othermethodswereattempted,butthesethreewereselectedastheircomputationscaledwellwiththesizeofthedata.Asidefromnormalizingeachfeaturebetween0and1,nootherpreprocessingwasperformed.TheresultsofTable 3-2 arepresentedintermsofthenormalizedmutualinformation Strehl&Ghosh ( 2003 )(NMI)betweenthetrueandestimatedclusterings.TheNMIisin[0,1],andis1foraperfectclustering.Theresultsareaveragedover8trials(thoughsomemethods,includingourown,aredeterministic)andthenumberofquerieswassettohalfofthenumberofsamplespresentinthedataset.Thisisalownumberofqueriesrelativetothetotalnumberofuniquepairwisequeries,N(N)]TJ /F8 7.97 Tf 6.58 0 Td[(1) 2.Asseeninthetable,thecombinationofPFCHCandHQSisquitereliable.Itisthehighestperformingcongurationon4ofthe10datasets;morethananyotherconguration.Ourproposedcombinationiscompetitiveonalldatasets.Forinstance,ondatasetslikeWineandIriswheretheotherconstrainedclusteringmethodsoutperformPFCHC,theresultsarestillquitehigh.HQS,asastandalonequeryselectionmethod,performswellinconjunctionwiththeE2CPandCHCconstrainedclusteringmethodsaswellasPFCHC.On8ofthe10datasets,HQSisthebestqueryselectionmethod,withMMFFQSperformingbetterontheothertwodatasets.NotethatPFCHCcanyieldquitepoorresultswhenusedinconjunctionwithrandomqueries.Thisisexpected,asitwas 91

PAGE 92

explainedearlierthatPFCHCrequiresqueriesthatspanthedataspaceandestablishrelationshipsbetweenclusters.InFigure 3-7 ,weshowtheperformanceasafunctionofthenumberofqueries,indicatedasafractionofthetotalnumberofsamplesinthedataset.Forclarity,onlythetopperformingqueryselectionmethods,HQSandMMFFQS,areshown.Resultsareaveragedover20trials,andsinglestandarddeviationerrorbarsareshowntocapturetherangeofperformancesbyeachcombinationofalgorithms.ThecombinationofPFCHCandHQSoffersveryreliableperformance,asthealgorithmsaredeterministicandinnocasedidthepairproduceunfavorableresults.Thisisanimportantissueinconstrainedclustering,asitiswidelyheldtruethatadditionalqueriesoftendeteriorateperformance.ThisistrueinparticularforMMFFQS,whichproducesdifferentqueriesdependingonitsinitialization.Onsometrialsthequeriesbarelyaidintheclustering,forinstanceontheIonospheredataset.Asaqueryselector,HQSfunctionswellwiththeotherconstrainedclusteringmethods,butmatchesthebestwithPFCHC.Withthiscombination,weachieveanearlymonotonicperformanceincreaseinthenumberofqueries.Theincreaseisalsoprominentforasmallnumberofqueries,thusmakingefcientuseofconstraints.NotethatMMFFQSandE2CPrequirethenumberofclusterstobespecied.NeitherPFCHCnorCHCrequireexternalinformationaboutthedataset.Tosimulateaclusterdiscoveryproblem,weutilizetheMNIST LeCun&Cortes ( 2010 )digitdataset.WeusethetestingsetfromMNIST,consistingof10,000samples,with1,000samplesforeachofthe10digits.Nopreprocessingoftheraw784pixelgreyscaleimagesisdone,anddistancesbetweenimagesarecomputedwiththeEuclideandistancebetweentheseintensityvectors.WecomparePFCHCandCHC,asthesemethodsdonotrequirethenumberofclusterstobespecied.InFigure 3-8 ,theclusteringperformanceasafunctionofthenumberofqueriesisshown.WithPFCHCandHQS,theperformanceincreasessmoothlyandmonotonicallywiththenumberofconstraints.HQSperformsmuchbetterthanMMFFQSasaqueryselectiontoolforthis 92

PAGE 93

ABalance BEcoli CGlass DIris ESegmentation FWineFigure3-3.Performanceasafunctionofthenumberofqueries.Resultson10UCIdatasetsforPFCHC,CHC,andE2CPconstrainedclusteringwithHQSandMMFFQSconstraintselectionmethods. 93

PAGE 94

GBreast HIonosphere IParkinsons JSonar Figure3-7.Continued problem,astherearemanyclusters,andasystematicmethodwhichidentiesclustercentersisdesirable,ratherthanonewhichqueriesfringesamplesatdistantlocations.InFigure 3-9 ,wedemonstratetheresultingclusteringofthetwotopperformingcongurationsvisually,after400queries.Bothmethodscorrectlyrecognizethatthereare10clusters.Thus,wecancomputeaclassicationerrorrate,whichis19%forPFCHCwithHQSand26%forCHCwithHQS.Thisiswithonly400pairwisequeries,whichisnotalargeamountofinformation,beingequivalenttoonly37labeledtrainingsamplesperclass,assumingthequerieswereselectedinformatively.InTable 3-3 ,wecomparethewall-clockcomputationtimesforPFCHCwithHQSversusCHCandMMFFQSontheMNISTdataset.Thesimulationswereperformed 94

PAGE 95

Figure3-8.PerformanceontheMNISTdataset.Theperformanceon10,000MNISTsamplesisplottedagainstthenumberofqueriesforeachcombinationofmethods. Table3-3.Wall-clockcomputationtimeonMNISTdataset.PerformedinMATLABonanIntelXeondualcore(E5-2637)3.50GHzcpu. PFCHCandHQSCHCandMMFFQSStepTime(s)StepTime(s) Hier.Clust.35.0DistanceMatrix35.2HQS15.8MMFFQS3.0PFCHC10.6CHC485.3 Total61.4Total523.5 inMATLABonanIntelXeondualcore(E5-2637)3.50GHzcpu.ThecombinationofPFCHCandHQSis8.5timesfasterthanCHCandMMFFQS.WeusedMATLAB'simplementationofhierarchicalclustering,withanaveragelinkagefunction.Thisstepwasthemostcomputationallyintensiveoneforourmethod,butwaswaylessthanproducingaconstrainedclusteringwithCHC.Unlikeourmethod,whichcanbuildfromanyhierarchicalclustering,CHCmustpropagateconstraintsandmodifyadistancematrix. 95

PAGE 96

AMNISTclusteringwithPFCHCandHQS BMNISTclusteringwithCHCandMMFFQS Figure3-9.VisualizationofMNISTclustering.Clusteringof10,000MNISTdigitswith400queries.Thesamplesarearrangedradially,withtheirlabelsproducingacolor-codedsegmentatthatparticularangle.Theinnerringrepresentsthetrueclusteringandtheouterringrepresentstheestimatedclustering.Bothmethodscorrectlyidentied10clusters.A)Theerrorrateforourmethodis19%.B)Theerrorrateis26%forthecombinationofMMFFQSandCHC. 3.4SummaryInthischapterwehaveproposedthatmust-linkconstraintsneednotbeutilizedduringtheprocessofconstrainedclusteringaslongassufcientlyinformativeconstraintshavebeenselected.Whenapplyingthisideatohierarchicalclustering,oneneedonlyensurethateachmergenotviolatethecannot-linkconstraints.Thus,constrainedclusteringhasbeenreducedtoset-theoreticoperationsonthemergesproducedbystandardhierarchicalclustering.Thelogicbehindthisapproachisthatconstraintsreectclasslabels,ratherthantrueclusters.Therefore,inthecaseofmulti-modaldata,thedataspacemayneedtobesignicantlydistortedtobringlinkedsamplestogether.Forsomeconstrainedclusteringapproaches,suchasonesthatrelyonmetriclearning,thismaycauseconictintheobjectivefunctionwhichseekstopushmust-linkpairsclosetogetherandspreadcannot-linkpairsfarapart.Perhapsthisisareasonbehindthecommonlyobserveddeteriorationofresultswithanincreasingnumberofconstraints,andwarrantsfurtherstudy. 96

PAGE 97

Must-linkconstraintsareusedinitiallytodeducetransitiveclosures,andattheendofhierarchicalclusteringtoreassignlabelsofclustersbelongingtothesameclass.Notpropagatingmust-linkconstraintsiscomputationallydesirable,asintheexistingCHCthisprocessinvolvesashortest-pathalgorithmbetweenallsamples,aswellascontinualmodicationstothedistancematrixduringagglomeration,unlessthecompletelinkagefunctionisused.Fromapracticalperspective,ourmethodcanworkwiththemergesproducedbyanyrunofhierarchicalclustering.Thisallowstheuseofexistingcodethatmayincludeexoticlinkagefunctionsorthatisappropriateforuseonlargedatasets.WecancapitalizeonthemergesproducedbyhierarchicalclusteringtoproduceinformativequeriesthatensurethehighperformanceofourPFCHC.TheresultingHQSalgorithmhasalmostnooverheadifthedendrogramhasalreadybeenconstructed.Inourtests,thecombineduseofPFCHCandHQSnotonlyofferedthelightestcomputationalrequirementbyawidemargin,butalsoconsistentlyperformedbetterthanorcomparativelytoleadingconstrainedclusteringmethods.Activeconstrainedclusteringsuffersfromadistinctivelackof“blackbox”methods.Typically,integratingconstraintsgreatlyincreasesthecomputationalrequirementsofthemethod.Thereliabilityoftheconstrainedvariantofaclusteringmethodalsotendstobelessthantheunconstrainedversion.Ourmethodrequiresadistinctlackofparameters,otherthanthenumberofallottedqueries.Agglomerativeclusteringusesthenaturaldistanceinformationfromthedatasettoformlocalclusterswhichencompasstheconstrainedsamples.Thenetworkofconstraintsthendetermineshowthesesubclustersrelatetooneanother.CHCpropagatesmust-linkconstraintsinanintuitiveandreasonablewayanditisnotpronetosufferperformancedegradationwithanincreasingnumberofconstraints.PFCHCprovidescomparableperformancewithamuchloweroverhead.However,manymethodscouldperhapsbenetfromincorporatingmust-linksonlyaspostprocessing.Forinstance,constrainedclusteringmethodsbasedonmetriclearningorconstraint 97

PAGE 98

weightingcostfunctionscouldfacecontradictoryinformationinthepresenceofmulti-modaldata.Ignoringthemust-linksinitiallycouldgreatlyreducethestrainontheseoptimizations,thusresultinginsimpler,morerobustalgorithms.Notethatthisideacanonlyapplytomethodswithnoxedk,suchasamode-seekingormessagepassingclusteringmethod.Thisremainsthetopicoffuturework. 98

PAGE 99

CHAPTER4CONSTRAINEDCLUSTERINGUNDERUNCERTAINTYMachinelearningalgorithmsaretypicallydesignedtoaccountforthenoisefoundinrealworlddatasets.However,itisoftentakenforgrantedthatthegroundtruthinformationwhichguidesthemachinelearningtaskservesasaperfectoracle.Suchinformationmaytaketheformofalabeledtrainingsetforclassication,orpairwiselinkageconstraintsforclusteringtasks.Errorsinthevariousformsofsupervisedinformationareunavoidable.Oftenthisinformationisacquiredfromhumanannotatorssomistakesareinevitable.Thisisparticularlytruewhentheinformationis“crowdsourced”,orwhenagroupofpossiblyanonymousworkersprovidethefeedback Kitturetal. ( 2008 ); Snowetal. ( 2008 ).Evaluatingthereliabilityoftheannotatorbecomesanissueinthisdomain Hsuehetal. ( 2009 ); Welinder&Perona ( 2010 ).CrowdsourcedinformationisparticularlycommonwiththeavailabilityofplatformslikeAmazon'sMechanicalTurk(AMT),Clickworker,andothers Yuenetal. ( 2011 ).Dataarisingfromuserreviews,whichcanbeusedincollaborativelteringandrecommendationengines Bennett&Lanning ( 2007 ),providesanabundantsourceofuseful,yetvariablyreliabledata.Butevenwithatrustedhumanoracle,thetaskscanstillbeambiguous.Trainedexpertsoftenannotatescienticormedicaldatasets,whicharedifcultbytheirverynature.Errorsmayalsoarisewhenthedecisionrequiressomesubjectivityorinsituationswheresamplescanbelongtomultipleclasses,suchasindocumentclustering.Forclassicationproblems,lteringoutunreliablelabelsisamorestraightforwardtask Donmez&Carbonell ( 2008 ); Du&Ling ( 2010 ); Yanetal. ( 2011 ).Sampleswithlabelsthatarelikelytobeinerrorwillappearanomalouswithrespecttotheother Portionsofthischapterweresubmittedforpublicationinthefollowingmanuscript:Kriminger,E.,Cobb,J.T.,andPrncipe,J.C.(2015).Constrainedclusteringunderuncertainty.JournalofMachineLearningResearch,underreview 99

PAGE 100

samplesoftheirclass.Thewell-studiedmethodsofanomalyoroutlierdetection Chandolaetal. ( 2009 ); Hodge&Austin ( 2004 )canbeappliedtorectifytheunreliableinformationinaclassicationtrainingset.Whensideinformationisavailableinaclusteringproblem,itisknownassemi-supervisedorconstrainedclustering.Theinformationistypicallypresentintheformofmust-linkandcannot-linkconstraints Wagstaff&Cardie ( 2000 ),whichdeterminewhetherapairofsamplesbelongtothesameordifferentclusters,respectively.Thenatureofpairwiseconstraintsmakesthemdifculttoanalyze.Sincetwosamplesareinvolvedineachconstraint,itisnotobvioushowconstraintsinteractandconict.In Davidsonetal. ( 2006 ),thepossibleutilityofaconstraintsetisquantiedwithmeasuresofcoherenceandconsistency.Coherenceroughlyquantiestheamountofagreementwithintheconstraintsetbasedongeometricalintuitionaboutthedistanceanddirectionalpropertiesofconstraints.Consistencymeasurestheagreementbetweentheconstraintsetandtheunderlyingstructureofthedata,representedbyanunconstrainedclustering.However,thesearemorecomparativefeaturesforrankingthedifcultyofconstrainedclusteringdifferentdatasets VanCraenendonck&Blockeel ( 2015 ).Afewpapershaveaddressedhowtohandleerrorsinclusteringconstraints Colemanetal. ( 2008 ); Freundetal. ( 2008 ); Nelson&Cohen ( 2007 ); Zhuetal. ( 2015 ),butallexceptfor Zhuetal. ( 2015 )functioninanextremelylimitedsetting.Alternatively,methodsforhandlingsoftconstraintshavebeenproposedintheconstrainedclusteringliterature Lawetal. ( 2004 ); Lu&Leen ( 2004 ),whichallowsforconstraintviolation.Thisdoesnotsuggestthaterrorswillberectied,butratherthattheclusteringmechanismpermitsconstraintstobeviolated.Theuseofsoftconstraintsonlycorrectserrorsifbadconstraintsarecanceledoutbymultiplecorrectonesinvolvingtheexactsameneighborhoodsofthedataspace.Despitethelackoffocusonremovingerroneousconstraints,itisanimportantissuewhichgreatlylimitsthepracticaleffectivenessofconstrainedclustering.Sinceclustering 100

PAGE 101

isanunsupervisedprocess,theinclusionofsideinformationisparticularlyusefulasthereisnoothermeansofestablishingtheuser'sgoals.Itiswell-knownthatconstrainedclusteringisarelativelybrittleprocess.Infact,theinclusionofconstraintscanactuallyharmtheclusteringresultiftheyarenotcarefullyselected Davidsonetal. ( 2006 ).Consideringthatthisnegativeresultdoesnotevenincludetheconfoundingissueoferrorsintheconstraintset,constrainedclusteringcanbequiteproblematicifappliedwithouthandlingtheseerrors.Empiricalstudies Aresetal. ( 2012 ); Zengetal. ( 2007 )havedemonstratedtheseverityofthiseffectbysyntheticallyinvertingconstraintvaluestoinduceerrors.Constraintsareavaluableresourcesincehumantimeisexpensive.Itisimportantforamethodwhichremovesunreliableconstraintstonotbewastefulandtossoutpossiblyvaluableinformation.Infact,itisoftenrecommendedtocollectmultipleopinionsforeachqueriedsampletoimprovethereliabilityofthefeedback Nowak&Ruger ( 2010 ); Shengetal. ( 2008 ).Anintelligentconstrainterrorremovalsystemcouldalleviatetheneedforsuchredundancy,whichisdemandingofvaluablehumanhours.Inthischapter,wepresentamethodcalledvotebasedconstraintremoval(VBCR),whichintelligentlyremovesonlythoseconstraintswhichareharmfultotheclusteringprocess.Wealsoprovideanoverallstudyofthenatureofrealconstraintsforclustering.Werstpresenttheveryfewtreatmentsofimperfectoraclesintheliterature.WethendeveloptheVBCRmethod,inwhicheachconstrainthasatrustvalue,whichisadjustedinaniterativeprocessbythevotesofotherconstraints.Ourmethodproducesanimprovedconstraintsetthatcanbeusedbyanydesiredconstrainedclusteringalgorithm.Wethenpresentacasestudyusingthe17CategoryFlowerDataset Nilsback&Zisserman ( 2006 ),withconstraintsprovidedbyrealhumanworkersthroughtheAmazonMechanicalTurk.Theresultsprovideinsightintothestatisticsoferrorsmadebyacrowdsourcedannotationworkforce.Toourknowledgethisisthersttimeconstrainterrorshavebeenstudiedwithrealhumanfeedbackasopposedtosimulatederrors,whichisanecessarystepassuggestedin Aresetal. ( 2012 ).We 101

PAGE 102

nallyshowresultsonbenchmarkUCIdatasetswithvaryinglevelsofreliability,toshowourmethod'sabilitytoremoveharmfulconstraints. 4.1ConstraintErrorsForclusteringconstraints,littlehasbeenwrittenonconstrainterrors,whichcanalsobereferredtoasconstraintnoiseorimperfectoracles.Thesevereeffectsofconstrainterrorsonconstrainedclusteringperformancewereshownin Aresetal. ( 2012 ); Zengetal. ( 2007 ).Combinedwiththeriseoftheoftenunreliableannotationviacrowdsourcing,thepracticalimplementationofconstrainedclusteringprojectsispossiblyfutileuntiltheimperfectoracleproblemisproperlyaddressed.Addressingthisprobleminvolvesnotonlydevelopingmethodsforremovingbadconstraints,butalsotheuseofrealhumanfeedbackinconstraintedclusteringexperiments.Totheknowledgeoftheauthors,thepaperswhichdoaddressconstraintnoisehaveonlyusedsimulatednoise.Understandingthenatureandstatisticsofrealerrorscanbeusedtohelpidentifyandavoidthem.Itisalsoimportantforworkinthiseldtoconsiderthefullsetofpossibilitiesfortheconstraintsetitself.Specicactiveconstraintselectiontechniquesallowforamorenaturallyformulatederrorremovalproblem.However,formanyconstrainedclusteringprojects,theconstraintsetconsistsofwhateverdataisavailable,ratherthanofspeciallyselectedpairs.Constrainterrorsaremostlyaccountedforintheliteraturebyusingsoftconstraints,suchthatconstraintviolationsareallowed,butpunished Nelson&Cohen ( 2007 ).In Nelson&Cohen ( 2007 ),itwasshownthatforprobabilisticmethodsofconstrainedclusteringthatuseweightstopenalizeconstraintviolations,thereareseriouspracticalissueswithmodelinferenceinthepresenceofconstraintnoise.Inaddition,performanceissensitivetotheweightsthatareselected.Toremedytheseproblems, Nelson&Cohen ( 2007 )developedamorerobustsamplingmodelforchunklets(groupsofsampleslinkedbythetransitiveclosureofmust-links),wheretheviolationpenaltiesareadjustedbasedonthecondenceoftheuserforeachparticularconstraint.However, 102

PAGE 103

thereisacleardifferencebetweenallowingconstraintviolationsandmanagingerrors.Itcannotbeassumedthatbadconstraintswillevenbeviolated,northatgoodconstraintswillbesatised.Errorswillbeavoidedonlyintheextremecasewherethereareenoughconstraintssuchthateachbadoneisoutvotedbymultiplegoodonesspanningsimilarneighborhoods.Theuseofexpertcondencetoscaletheconstraintviolationpenaltiesassumesthathumansveryreliablyevaluatetheirperformanceoneachquery.In Freundetal. ( 2008 ),atheoreticalstudyoftheeffectsofvariousnoisemodelsonconstrainedclusteringwaspresented.Theabilitytoestimatethenumberofclustersinsuchcaseswasevaluated,butnomethodswereprescribedformanagingnoiseissuesinclustering.In Colemanetal. ( 2008 ),thenormalizedcutproblemofspectralclusteringissolvedconcurrentlywithacorrelationclusteringproblemovertheconnectedcomponentsoftheconstraintgraph.Inthepresenceoferrorsinaconstraintgraph,thegraphmaynotbreaknicelyalongsupposedcannot-linkconstraints.Correlationclusteringndsbreaksintheconstraintgraphthatbestsatisfytheconstraints.Notethatthemethodof Colemanetal. ( 2008 )cancorrecterrorsonlywhentheconstraintsveryexplicitlyinteractwithoneanotherviasharedsamples,andonlyappliestothetwoclustercase.Foragenericconstraintset,cuttingtheconstraintgraphistrivial,asmostconnectedcomponentswouldconsistofasingleconstraint.Atrulygeneralpurposeconstrainterrordetectionalgorithmwasdevelopedin Zhuetal. ( 2015 ),whichseeksmust-linkandcannot-linkconstraintsthatappeartobeoutliersamongtheirfellows.Inthiswork,eachconstraintciintheconstraintsetCisrepresentedbytheconcatenatedvectorzi=264jxj)]TJ /F5 11.955 Tf 11.96 0 Td[(xkj1 2(xj+xk)375,wherexjandxkarethesamplesinvolvedinconstrainti.Thisrepresentstheconstraintinbothrelativeandabsoluteposition.Associatedwitheachconstraintisalabelyi,whichis1iftheconstraintisamust-linkand0ifitisacannot-link.Thesetofconstraint 103

PAGE 104

representations,fzigjCji=1,andlabelsfyigjCji=1,establishesaclassicationproblem.Thelabeledsetisusedtotrainaclassicationrandomforest.Anafnitybetweenanytwoconstraintsisprovidedbythefractionoftreesintheforestinwhichthetwoconstraintsareplacedinthesameclass.Withthisafnityvalueaninconsistencymeasureforeachconstraintisdeveloped,whichdescribeshowunusualtheconstraintisrelativetootherconstraintsofthesametype.Theauthorssuggestremovingaspeciedpercentageoftheconstraintswiththehighestinconsistencyvalues.Theconstraintremovalmethodof Zhuetal. ( 2015 )leavesthenumberofconstraintstoremoveasafreeparameter.Theselectionofthisquantityshouldbedealtwithcarefullybecauseremovingtoomanyconstraintsresultsinthelossofvaluableinformation,andleavingharmfulconstraintscanruintheclusteringperformance.Ideally,thenumberofconstraintsthatareremovedshouldchangedependingontheneedsofeachconstraintset.Inaddition,theoutlier-basedcriterionofthismethodmaynotbetargetingtherightconstraintsforremoval.Thezrepresentationofaconstrainthasadimensionalitywhichisdoublethatofthedataitself.Combinedwiththetypicalsparsenessofconstraints,thismaynotbeanaccuratewaytodeterminewhenmust-linkandcannot-linkconstraintsaredirectlyantagonizingoneanother. 4.2RemovalofHarmfulConstraintsRatherthanposingtheconstraintremovalproblemasasearchforerrors,itisperhapsmoreusefultoseekconstraintsthatareharmfultotheresultingclustering.Errorsareafunctionofthetruelabelingofthedataset,whichthankstonoiseandclassoverlap,islessintuitivethantheconceptofharmfulnesswhichisafunctionoftheconstraintsetitselfandhowaconstrainedclusteringalgorithmwouldrespondtoit.Therearetwoformsofinformationavailableinthesearchforharmfulconstraints.Therstisthedistanceinformationpresentintheunderlyinggeometryofthedataset,whichconsistsofthefullsetoffeaturevectorsfxigNi=1,orthefullsetofpairwisedistanceorsimilarityvaluesfSijg.Thesecondsourceofinformationistheconstraintsetitself, 104

PAGE 105

C:=fcng,whereconstraintcninvolvesthesamplepair(xn1,xn2).Thisisapartialsetofsamplespairsforwhichconstraintsareavailable.Inthecaseofhardconstraintscn2f0,1g,where0representsacannot-linkand1representsamust-linkconstraint.Inthecaseofsoftconstraints,cn2[0,1].Itmaybetemptingtoseektoremoveconstraintswhichdisagreewiththedistanceinformation,suchasmust-linksthatlieacrossnaturalclusterboundaries,orcannot-linksthatfallwithindenseregions.However,thepurposeofconstrainedclusteringistoimproveupontheunconstrainedclusteringbyclarifyingpotentialclusterboundariesthatarenotintuitive.Removingconstraintsthatdisagreewiththedistanceinformationthenleavesonlythoseconstraintsthatagreewiththeclusteringthatcouldbeachievedwithoutanyconstraints.Therefore,aconstraintremovalsystemmustlookfordisagreementwithintheconstraintsetitself.Constraintsthatactivelyinterferewithoneanothercausetroubleforconstraintclusteringmethods.Identifyingconictinaconstraintsetissimplewhentheconstraintsetformsaconnectedgraphbecausewhensamplesaresharedbetweendifferentconstraints,actuallogicalviolationswilloccur.Findingaclusteringthataccountsforsuchaconstraintsetwhileadheringtotheunderlyinggeometryasbestaspossiblewasthesubjectof Colemanetal. ( 2008 ).Whenconstraintsdonotsharesamplestoformaconnectedgraph,theystillconveypossiblycontradictoryinformation.Indecidingifthereareanyconictsbetweentheconstraints,wemustalsodeterminehowconstraintscommunicatewithoneanother.Themeansbywhichconstraintsinteractwithoneanotheristhroughthedataitself.Ifconstraintsinvolvesampleswithinthesamedenseregionofdata,itislikelythattheseconstraintseithercontradictorafrmtheinformationthattheotheristryingtoconvey.Inthepreviouschapter,wedescribedtheabilityofhierarchicalclusteringmethodstoidentifyconstraintsthatareinconict.Thedendrogramprovidestwogroupsofsamplestobemerged.Constraintswithasampleinvolvedineachgroupprovide 105

PAGE 106

evidencefororagainstthemerge,andthemajorityconstrainttypewins.Essentially,thisisavotebasedsysteminwhichconstraintscompeteiftheyarerelevanttooneanotherthroughthestructuresprovidedbyhierarchicalclustering.However,hierarchicalclusteringstructuresmaynotalwaysbethedesiredwayofthegroupingthedata.Thisapproachalsolacksasenseofwhatconstitutesareasonableclustering.Forinstance,acannot-linkbetweennearestneighborsampleswouldbepermittedtoexistunlessthatsamepairwasqueriedmultipletimes. 4.3VoteBasedConstraintRemovalWepresentthevotebasedconstraintremoval(VBCR)algorithm.Consideranunconstrainedclusteringofthedata.TheadjacencymatrixA,representstheresultsofthisclusteringwithAij=1ifsamplesiandjaregroupedtogetherandAij=0ifthesamplesareseparated.Areasonableclusteringofthedataprovidesanideaofwhichconstraintsinteractandinterferewithoneanother.Wedenetwotermswhichwillbeusedindeningthecommunicationbetweenconstraints.Wesaythatamust-linkconstraintcmandacannot-linkconstraintcnarerelevanttoeachotherifintheclusteringgivenbyA,thetwoconstraintsshareatleastonesampleinthesamecluster.WesaythatcmvotesforcniftheconstraintsaremutuallysatisablewithrespecttoA.Ifcmisamust-linkandcnacannot-link,thentheyvoteforeachotherifAm1m2=1andAn1n2=0.RelevanceandvotesaredemonstratedinFigure 4-1 .IftwoconstraintsarerelevantletRmn=1andiftheyarenotrelevantthenRmn=0.Likewise,iftheconstraintsvoteforeachotherVmn=1andiftheydonotthenVmn=0.Here,bothRandVarejCjjCjmatrices.Notethatweonlyconsidertherelationshipbetweenconstraintsoftheoppositetype,becauseconstraintsofthesametypedonotinterfereeachother,despitethefactthattheymaynotbothbesatisedinA.Additionally,ifsoftconstraintsareused,theyareroundedtomust-linkorcannot-linkforthepurposeofthesedenitions.Ratherthansimplyrelyingonasingleunconstrainedclustering,A,todeterminetherelevanceandvotesofconstraintpairs,itisadvisabletouseanensembleof 106

PAGE 107

Figure4-1.Illustrationofrelevanceandvotes.Forthisparticularpartitionofthedataset,constraint1isrelevantandvotesforconstraint4.Constraint2isrelevantto4,butdoesnotvoteforitastheyarenotmutuallysatised.Constraint3isnotrelevanttoconstraint4. clusterings.Owingtotheambiguityoftheclusteringproblem,asingleclusteringonlyrepresentsonereasonablewaytopartitionthedata.Twoconstraintsmaynotbemutuallysatisedinoneclustering,butthatdoesnotmeanthattheyactivelypreventeachotherfrombeingsatised.Withinanensemble,multiplereasonableclusteringsofthedatacanbecaptured.Thus,apairofconstraintsarerelevantiftheyarerelevantwithrespecttoanyclusteringoftheensemble.Clusteringensemblescanbebuiltusingdifferentparameters,suchasthekink-means,differentprojectsofthedata,ordifferentclusteringalgorithmsentirely Vega-Pons&Ruiz-Shulcloper ( 2011 ).WithRandVcalculatedforeachpairofconstraints,wecandeneameasurethatallowsustodiscriminatebetweenharmfulandusefulconstraints.Wedenethetrusttm,ofconstraintcm,asthefractionofthetotaltrusted,relevantvotethataconstraintreceives.Thisisarecursivedenition, tm=Pcn2CtnRmnVmn Pcn2CtnRmn.(4)Tocalculatethetrustvaluerequiressequentialiterationoftheequation( 4 ),inaxed-pointupdate.Thisrequiresaninitialconditionforthetrustvector,t.Ifthereisno 107

PAGE 108

levelofcertaintyassociatedwitheachconstraint,thenthetrustvaluescanbeinitializedas1s.Inthecaseofsoftconstraints,alevelofcertaintyisprovided.Iftheconstraintsci2[0,1],then0.5representscompleteuncertainty,andthetrustvaluescanbeinitializedtoti=2jci)]TJ /F9 11.955 Tf 12.91 0 Td[(0.5j.Duringtheupdateof( 4 ),constraintswhicharenotrelevanttoanyotherconstraintswillproduceatrustedvaluethatisundened.Theseareconstraintswhichdonotinteractwithanyotherconstraints,andthereforetheycanbeautomaticallyacceptedorrejectedbasedonothercriterion.Forexample,oneofthese“irrelevant”constraintscanbeacceptedorrejectedbasedonifitissatisedinanymemberoftheclusteringensemble,orbasedonsoftconstraintvalues. Algorithm4-1.VoteBasedConstraintRemoval Input: DatasetfxigNi=1,constraintsetC,trustthreshold Output: Reducedconstraintset^C Initializeti=2jci)]TJ /F9 11.955 Tf 11.95 0 Td[(0.5j8ci2C InitializejCCjzeromatrices,RandV GenerateclusteringensembleA:=fAigpi=1fromdataset ForeachA2Ado SetRmn=1foreachrelevantconstraintpair(cm,cn) SetVmn=1foreachpairthatismutuallysatisedinA Endfor Whilenotconvergeddo t (VR)t Rt Endwhile Removeconstraints:^C:=fcmjtm>g Oncethetrustvalueshaveconverged,constraintscmwithtm
PAGE 109

datasetindependent.Whilealowervaluelike0.8appearstoallowforsomeconstraintsthatconictwithothers,notethatthetrustofbadconstraintsdoesnottypicallydecayallthewayto0.Aharmfulconstraintcanstillreceivesomevotesfromreliableconstraints.Inthiscase,theoffenderwouldmaintainasmallshareofthevote,whichisreectedinalowertrustforthereliableconstraintsthatitdoescontradict.Thexedpointupdateof( 4 )canbeefcientlycomputedinmatrixform,asshowninAlgorithm 4-1 .TherelevantvoteinthenumeratorissimplytheHadamardproductofVandR.Thedivisionrepresentselement-wisedivisionoftwolength-jCjvectors.WhilewehavepresentedaverysimpleandreliableimplementationofVBCR,itisaexibleandmodularframework.Forinstance,votesandrelevancedonotneedtotakehardvaluesof0or1.Asoftrelevanceforinstance,withRmn2[0,1],wouldapproach1forapairofidenticalconstraints,andwoulddecayto0fordistantpairs.Thedenitionofrelevancecanalsobeamended.Astricterdenitionofrelevanceensuresthatthetrustvaluesconvergeclosertothe0and1extremes.Thisoccurswhenthesphereofinuenceofaharmfulconstraintisreducedtothepointwhereonlyconstraintsthatitactivelyinterfereswithareconsideredrelevant. 4.4ExperimentsToproperlyvalidatetheperformanceofconstraintremovalsystemrequiresarealisticsimulationofconstrainterrors.Whilepastworkhassimulatederrorswithintuitivemodels,nostudieshavebeenconductedinwhichclusteringconstraintswereacquiredviarealhumanfeedback.Thisisanecessarysteptowardsbetterunderstandingthenatureofconstrainterrors,bothformodelingandthedesignoferrorremovalsystems.Someofthestatisticsofinterestaboutrealhumanfeedbackincludethefrequencyoferrorsandtheseverityoferrors.Forinstance,theusefulnessofsoftinformationreliesontheassumptionthattheerrorsmadebythehumanoraclearereectedinaconstraintvaluewithlowcertainty.Itisalsointerestingtoseeforwhichsamplepairserrorsaremostlikelytobemade.Intuitively,apairofsamplesfromthe 109

PAGE 110

sameclassthatareverydistantinfeaturespacearepronetoerror.Likewise,apairofsamplesfromdifferentclassesthatareveryclosewillbedifculttoannotate.Alltheaforementionedpropertiesaredependentnotonlyonthedataset,butonthedatatype(e.g.image,document,speech)andthehumanoraclesthemselves.Regardless,realisticexperimentswillimprovethestateofknowledgeandpracticalusageofconstrainedclusteringtechniques.Theavailabilityofcrowdsourcingplatformsgreatlysimpliestheprocessofacquiringconstraintsetsfromrealhumans.BeforevalidatingourVBCRalgorithmonUCIdatasets,wepresentacasestudyonactiveclusteringusingAmazon'sMechanicalTurktoacquireconstraintsfromhumanclickworkersforpairsofowerimages. 4.4.1CaseStudyItisimportanttoselectanappropriatedatasettobeannotatedbyanonymousclickworkers,sinceifistoosimpleordifcult,theresultswillnotpossessinterestingstatisticsonconstrainterrors.Forthisreason,weselectthe17CategoryFlowerDataset Nilsback&Zisserman ( 2006 )becauseanon-expertcandecideiftwoowersarethesamespecies,butnotwithouterror.ThisToincreasethedifcultyofthisdataset,whichconsistsof17specieswith80sampleseach,the10specieswithyelloworwhiteowerswereused.Theuniquepairsofsamplesweredividedintomust-linksandcannot-links.Ofthemust-links,150pairswereselecteduniformlyatrandom,andanother150fromthecannot-links.Anequalamountwereselectedfromthemust-linksandcannot-linkstoensurethatenoughdataexiststogetstatisticsforbothconstrainttypes.Thequerieswererandomratherthanfollowingaprincipledactiveselectiontechniquebecausearandomconstraintsetposesthegreatestdifcultyforthedetectionoferrors,owingtotheambiguityofconstraintinteraction.AMTjobsarepresentedasHumanIntelligenceTasks(HITs).OurHITsconsistedof5pairwisequerieseach.Thus,the300querieswerecompletedin60HITs.ToprovideredundancyeachHITwasassignedto5uniqueworkers,so1,500responseswere 110

PAGE 111

Table4-1.Conditionalresponsedistributions. Responsep(cjML)p(cjCL) 00.010.750.250.050.150.50.010.010.750.240.0810.700.02 Frequenciesofresponsesforowerdataset,giventhegroundtruthconstraintvalue. Figure4-2.Histogramofresponses.Histogramsfortheresponsestogroundtruthmust-linkqueries(blue)andcannot-linkqueries(red). obtainedintotal.Eachquerypresentsapairofowerimages.Thepossibleresponsesare“denitelydifferent”,“maybedifferent”,“notsure”,“maybesame”,and“denitelysame”,whicharecodiedas0,0.25,0.5,0.75,and1,respectively.Thedistributionoftheresponses,giventhetruevalueoftheconstraint,isshowninFigure 4-2 .Forbothmust-linkandcannot-linkpairs,themajorityofresponseswereansweredcorrectly,andamongthose,mostwereansweredwithcertainty.Certaintywasrareamongincorrectresponses,butstillpresent,whichrevealstheinconsistencyintheself-evaluationoftheannotators.Theneutraloptionof“unsure”wasrarelyused. 111

PAGE 112

Therelationshipbetweenthefeaturespacedistanceforapairofsamplesandtheirconstraintresponsevaluehassignicanceforconstrainterrorremoval.Wesplitthesetofreceivedconstraintsintotwogroupsbasedontheirgroundtruthvalue.WethencomputePearson'scorrelationcoefcientbetweenthedistancesandconstraintvaluesfortheconstraintsinbothofthesegroups.Forthetruemust-linkconstraints,=)]TJ /F9 11.955 Tf 9.3 0 Td[(0.15andforthecannot-linkconstraints,=)]TJ /F9 11.955 Tf 9.3 0 Td[(0.26.Thus,formust-linksshorterdistancesonlybarelycorrespondtohigherconstraintvalues(closerto1).Forcannot-links,largerdistancesareonlyslightlycorrelatedwithlowerconstraintvalues(closerto0).Thesignicanceisthatpuredisagreementwiththefeaturespaceisnotareliablemeansofidentifyingrogueconstraints.Humansmaketheirdecisionsbasedondifferentfeatures,sothislackofcorrelationisexpected.Ofthe300querypairs,errorsweremadeon72ofthem.Withtheredundancyprovidedbythe5assignmentsperconstrain,itispossibletoseeiferrorscanmostlybeattributedtothedifcultyofthesamplepairortotheindividualworker.Ofthe72constraintsforwhicherrorsweremade,for46ofthemonly1oftheresponseswasanerror.For15ofthe72constraintswitherrors,only2ofthe5workerscommittedanerror.Therefore,whilesomedifcultsamplespairsstumpedamajorityofthe5assignedworkers,errorsweremostfrequentlymadebyonlyasingleannotator.Ifthemeanvalueoftheconstraintistakenacrossthe5responses,only11errorsaremade.Anaverageof24errorsoccurifonlyasingleresponseisused,whichsupportstheuseofredundancyincrowdsourcingsituations. Table4-2.Performanceonowerdataset. PerformanceinNMIErrorremovalforVBCRNoremovalPerfectremovalVBCRErrorrateCorrectionrateRemovalrate 0.410.520.4810%78%39% Resultsaveragedover20trials. Thediscretedistributionsoftheresponsesprovidevaluablestatisticsforgeneratingconstrainterrorsfortestingpurposes.Foreachconstraint,basedonitsgroundtruth 112

PAGE 113

label,thesimulatedconstraintcanbedrawnfromtheappropriatedistribution,p(cjML)orp(cjCL),asseeninTable 4-1 .WeusethisproceduretotesttheVBCRmethodonthe17CategoryFlowerDataset.Wesimulateconstraints,ratherthanusingonlytheconstraintsetfromAMT,sothattherearemultipletrialsoverwhichtheresultscanbeaveraged.Theutilityofrandomconstraintsvariesgreatlybetweenrealizations.Inthistest,werandomlydraw600uniquepairs,andsimulatetheirconstraintvalues.Weuseconstrainedhierarchicalclustering Kleinetal. ( 2002 ),whichreliesonhardconstraints,sosoftvaluesareroundedto0or1.TheproposedVBCRmethodiscomparedwithtwobenchmarks.Therstisaworst-casescenario,inwhichnoconstraintsareremovedandconstraintsaresimplyroundedto0or1.Thesecondisperfecterrorremoval,inwhichallerrorsandonlytheerrorsareremoved.Thenormalizedmutualinformation(NMI) Strehl&Ghosh ( 2003 )isusedtocompareclusteringresultstothetruepartition,andtakesvaluesin[0,1],with1representingaperfectclustering.TheclusteringensembleusedbyVBCRwascreatedbyperformingahierarchicalclusteringofthedatawithanaveragelinkagefunction,andcuttingthedendrogramat15points.Athresholdof0.8wasusedonthetrustvalues,whichconvergedafter10iterations.TheresultsareseeninTable 4-2 ,andareaveragedover20trials.Ignoringtheconstrainterrorsleadstomuchworseperformancethanthehypotheticalcaseofperfectremoval.UsingVBCRachievesaperformanceveryneartoperfectremoval.WhiletheVBCRdoesnotremoveallerrors,notallconstrainterrorsareharmfultotheresults,andhencegoodperformancecanbeachievedwhileleavingerrorsandremovingconstraintswhicharetechnicallycorrect.TheerrorremovalstatisticsofVBCRforthistestarealsoshowninthetable.Theaverageerrorrateoverthe20trialswas10%,duringwhichtheVBCRremoved78%oferrorsand39%ofthetotal600constraints. 113

PAGE 114

ABreast BEcoli CGlass DIris ESegmentation FWine Figure4-3.UCIresultsforVBCRversuserrorrate.Resultson6UCIdatasetsfortheproposedVBCR,themethodofZhuetal.,noconstraintremoval,andperfecterrorremoval. 114

PAGE 115

Table4-3.TheUCIdatasetsusedintheerrorremovalexperiment. Dataset#Samples#Clusters Breast5692Ecoli3368Glass2146Iris1503Segmentation21011Wine1783 4.4.2UCIBenchmarksTheUCIdatasets Lichman ( 2013 )providerealdatasetstotestclusteringperformance.PropertiesofthedatasetsareshowninTable 4-3 .Thegroundtruthlabelsareavailable,buttheimperfectoraclemustbesimulated.Usingavarietyoferrorratesprovidesamorethoroughtestingscenario.Toachievedifferenterrorrates,eachelementofp(cjML)orp(cjCL)isrisentothepower<1(thenrenormalized)toattenthedistribution,thusproducingmoreerrors.TheproposedVBCRisagaincomparedtoperfectremovalandtonoremovalofconstraints.WealsoimplementtheconstraintremovalsystemofZhuetal. Zhuetal. ( 2015 ),whichcanfunctionwiththefeaturevectorsoftheUCIdata,butnotthedistancematrixoftheowerdataset.Forthismethod,thefractionofconstraintstoremovewassetto0.33,whichprovidedgoodresultsacrossallscenarios.ToconstructtheclusteringensembleforVBCR,k-meanswasusedwith4differentvaluesofk,spanningfrom1fewerto3morethanthetruenumberofclusters. Table4-4.ErrorremovalstatisticsforUCI.AcomparisonofVBCRandZhuetal.forerrorremoval.Theresultsareaveragedoverall6UCIdatasetsusedandall100trials,foreacherrorrate. Correctionrate(%)Removalrate(%)Errorrate(%)Zhuetal.VBCRZhuetal.VBCR 0100100332267378-2697078-27166377-31215977-34275477-36 115

PAGE 116

InFigure 4-3 theclusteringperformanceasafunctionoftheerrorrateisshown.Querypairsaredrawnrandomly,withthenumberofqueriesequaltohalfthenumberofsamplesforthatparticulardataset.Differenterrorrateswereachievedbyvarying.Forcomparison,theperformanceusingthecomplete,perfectconstraintsetisshown,aswellasunconstrainedhierarchicalclustering.Theresultsareaveragedover100trials.Itisinterestingthatevenwhentheerrorrateis0,removingconstraintswiththeproposedmethodorwiththatofZhuetal.canactuallyimprovetheclusteringcomparedwithusingthecompleteandcorrectconstraintset.Asshownin Davidsonetal. ( 2006 ),moreconstraintscanactuallyharmperformance,soitisreasonablethatintelligentmethodsofconstraintremovalcanofferimprovements,eveninthecaseofperfectoracles.Itisclearthatusingthenoisyconstraintsetwithoutremovingerrorsisinadvisable,asperformancedecaysquicklywithagrowingconstrainterrorrate.BetweentheproposedVBCRandthemethodofZhuetal. Zhuetal. ( 2015 ),bothmethodsoffercomparableperformanceforlowerrorrates.However,VBCRmaintainsperformancenearthelevelofperfectremovalevenastheerrorratepasses25%.NotethattheVBCRdoesnotrequirethenumberofremovedconstraintstobespecied.InTable 4-4 ,wepresentacomparisonoftheerrorremovalstatisticsbetweenVBCRandZhuetal.,averagedacrossalldatasetsforagivenerrorrate.Bothmethodsremoveacomparableamountoferrorsforlowerrorrates,butastheerrorrategrows,themethodofZhuetal.becomeslessefcientatremovingerrors,whileVBCRmaintainsthesamerateoferrorremoval.ThiscouldbeattributedtotheoutlierdetectionideasonwhichthemethodofZhuetal.isbased.Thedifcultyofidentifyingoutliersincreaseswiththenoiselevel.TheVBCRisbasedontheinterferencebetweenconstraint,whichisnotstronglyrelatedtotheamountoferrors.Thefractionofremovedconstraintsremainsataconstant0.33forZhuetal.Thisparametercouldbetunedtoofferbetterperformanceoneachdataset,butsuchknowledgeisnotavailableinapracticalsituation.Weselected0.33foritsreliableperformanceinalltests.Fortheexperiments 116

PAGE 117

ofZhuetal.,0.50wasrecommended.TheVBCRmethodadaptstoeachconstraintsetthatitfaces.Whentheerrorrateislow,fewerconstraintsareremovedthanwhentheerrorrateishigh.Ifallconstraintsaremutuallysatisable,thennoconstraintswillberemoved. 4.5SummaryConstrainedclusteringhasbeenwell-studied,butthereisadistinctlackofattentiontopracticalissues.Ofprimeimportancetothesuccessofconstrainedclusteringinrealapplicationsisthehandlingofconstrainterrorsproducedbyimperfectoracles.Errorswillnaturallyoccurduetodifcultyofsomedatasets,buttheproblemisintensiedbytheprominenceofcrowdsourcingsolutionsforannotation.Combinedwiththepossibilityofevenerror-freeconstraintsetstoharmclusteringperformance,constrainedclusteringmethodsarepronetofailureinpractice.Onlyonegeneralconstrainterrorremovalmethodexists Zhuetal. ( 2015 ),whichlooksforconstraintsthatappearasoutliersamongsttheirpeers.However,thismethodrequiresthenumberofremovedconstraintstobespecied.Withoutareliablewaytoestimatethisquantity,eithervaluableconstraintswillbewasted,orharmfulconstraintswillremaininthedataset.Weproposedthevotebasedconstraintremoval(VBCR)algorithm,whichidentiesconstraintswhichcontradicttheinformationofotherconstraints.Thiscontradictionismeasuredbyconsideringwhichconstraintsarerelevanttooneanother,basedonlocationinfeaturespace,andwhichconstraintsvoteforeachother,basedonmutualsatisfaction.Thevotesandrelevancearedeterminedusinganensembleofunconstrainedclusterings,whichcapturevariousreasonablemeansbywhichtheconstraintscouldinteract.Thebenetofanensembleisthatitishighlyexible,allowingtheuseofanyclusteringmethodsthatsuitthedataset,howeveritmayberepresentedandwhetherthenumberofclustersisknown.TheVBCRautomaticallydetermineshowmanyconstraintsaretoberemovedbecauseconictbetweentheconstraintsdoesnotrelyonexternalinformation. 117

PAGE 118

Weprovidetherstcasestudyinwhichrealhumanresponsesareacquiredforconstraints.Pairsofsamplesfromthe17CategoryFlowerDatasetwereplacedonAmazon'sMechanicalTurktoreceiveconstraintsfromanonymousclickworkers.Theseresultsshowthatitisdifculttoidentifyerrorsfromthegeometryofthedatasetalone.Thedistributionofresponsesforthetruemust-linkandcannot-linkcases,providerealisticmodelsforsimulatingconstrainterror.Usingthismodelweshowthatwithouthandlingerrors,constrainedclusteringmethodsfail,andthatVBCRiscapableofachievingnearlytheperformanceofperfecterrorremoval.OnUCIdatasets,therangeofperformanceacrossavarietyoferrorratesisstudied.Performancebenetsareevenseenwhennoerrorsaremade,asweidentifyconstraintsthatareharmfultoclustering,eveniftheyarecorrect,whichoftenoccursinthepresenceofclassoverlap.Wehopethisworkencouragesthefurtherstudyofconstrainterrorremovalandtheuseofconstrainedclusteringinpracticalapplications.Fromtheexperimentswehaveperformed,constraintremovalisarecommendedprocessevenwhentheerrorrateisknowntobeloworzero.Perhapstheideaofremovingharmfulconstraints,suchaswithVBCR,avoidstheproblemassociatedwithcertainconstraintsetsdeterioratingconstrainedclusteringperformance. 118

PAGE 119

CHAPTER5PRACTICALITYOFCONSTRAINEDCLUSTERINGWhileconstrainedclusteringtechniqueshavebeendevelopedandtestedforoveradecade,theperformancehasbeenpresentedinavacuum.Certainly,theuseofconstraintscanimproveuponanunconstrainedclusteringofthedata,butthequestionarises:howgoodisconstrainedclustering?Whileclusteringperformancesmeasures,suchastheRandindexandnormalizedmutualinformation,provideabasisforcomparisonamongstclusteringmethods,thisdoesnotallowdirectcomparisonwithclassication.Theclassicationerrorrate,however,issurelyamoreinformativemeasureofperformanceinregardstohumanperception,ashumangoalsbestrelatetoerrorsmadeonsamples,notpairsofsamples.Themethodspresentedinthisworkarenotcompleteunlesspresentedalongsidearecommendationfortheiruse.Designersofdataanalysisprojectscanchoosebetweenconstrainedclusteringandclassication,buttheliteraturecontainsnoanalysisofthetrade-offsinvolvedinthisdecision.Ofcourse,moreexploratorytasksareonlysuitabletobeframedasconstrainedclustering,sincethedesignermaynotevenbeawareoftheclassespresentinthedataset.However,evensuchcasesbenetfromunderstandingthesortofperformancethatcanbeexpectedcomparedwithstandardclassicationapproaches. 5.1PairwiseConstraintsVersusLabelsLabelsareveryinformative,sincetheygroupallsampleswithlikelabelsandseparateallthosewithdifferentlabels.Pairwiseconstraintsonlyexpresstheinformationbetweenapairofsamples,whichonlyrelatesthesampletotheotheroneintheconstraint.Alabelrelatesasampletoallotherlabeledsamples.Sotoovercomethisinherentdisparityininformationbetweenapairwiseconstraintandalabel,carefulconstraintselectionschemesmustbedevisedtoveryefcientlycapturetheimportantinformationfromthelabelsinasfewpairwisequeriesaspossible. 119

PAGE 120

Generally,asthenumberofclassesrises,labelsbecomeincreasinglymoreinformativethanpairwiseconstraints.Inthetwoclasscase,pairwiseconstraintsareasgoodaslabels.Withkclasses,therearek(k)]TJ /F8 7.97 Tf 6.58 0 Td[(1) 2uniquepairsofclasses.Themostefcientconstraintsetwhichcapturestheinformationofalabelsetconstructsmust-linkneighborhoodswithineachclass,andestablishesacannot-linkconstraintbetweenthemust-linkneighborhoodsofseparateclasses.AlabelsetwithNlabelsperclassandkclassescanberepresentedbyk(N)]TJ /F9 11.955 Tf 11.96 0 Td[(1)+k(k)]TJ /F9 11.955 Tf 11.95 0 Td[(1) 2pairwiseconstraints.ThisincludesN)]TJ /F9 11.955 Tf 12.97 0 Td[(1must-linkedgesformedbytheminimumspanningtreeofthelabeledsamplesofeachclass,andthecannot-linksbetweeneachclasspair.ThisstrategyisseeninFigure 5-1 .Oneimportantconsiderationinthedesignofaqueryselectionmethod,isthatinmanycases,anoraclewithreal-timeinteractionisnotfeasible.ConsiderthepreparationofanAmazonMechanicalTurkjob,inwhichallqueriesmustbespeciedbeforehand,andfeedbackisreceivedasabatch.Manyofthepopularactiveconstraintselectiontechniquesinvolveserialqueryingschemes,incorporatingtheresultsofpreviousqueriesintothecurrentselection.ForinstanceFFQSandMMFFQSbuildmust-linkneighborhoodsbycomparingeachnewsamplewiththeexistingneighborhoodsuntilamatchisfound,beginningwiththeclosestneighborhoods.Inessence,thesemethodsareproducingattemptingtoproducetheidealizedconstrainedsetofFigure 5-1 .Replicatingthisprocesswithbatchqueryingisimportantforpracticalproblems,andwasthedrivingforceofourproposedHQSfromChapter 3 .Noticethattheidealizedqueryselectionmethoddoesnotndmanycannot-linkconstraints.Thus,muchfaithisputonthosefewcannot-linkconstraintsandifanyofthemareinerror,entireclassesarejoined.ItwasrecommendedfortheHQSmethodtoanticipateerrorsbyaddingredundancytotheseimportantcannot-linkconstraints. 120

PAGE 121

Figure5-1.Recreatinglabelswithpairwisequeries.The17coloredsamplesrepresentlabeleddatabelongingto4classes.Aminimumof19pairwiseconstraintscompletelyrepresentthisinformation. Queryinginsuchawayfacilitatestheremovalofconstrainterrors.Forinstance,ifeachcannot-linkisreinforcedbytwoadditionalparallelqueries,oneerrorcanbemadeandcorrectedbyVBCR. 5.2ComparisonwithClassicationWetestthesuiteofconstrainedclusteringmethodspresentedinthisdissertationagainstsomestandardclassiersontheMNISTdigitdataset LeCun&Cortes ( 2010 ).Ourmethodsincludehierarchicalqueryselection(HQS)withpropagation-freeconstrainedhierarchicalclustering(PFCHC)anditsassociatedconstraintremovalmethodfromChapter 3 .PFCHCcanalsofunctionquiteeasilywithlabelsinsteadofconstraints. 121

PAGE 122

ConsiderthesetofmergesfMigN)]TJ /F8 7.97 Tf 6.58 0 Td[(1i=1producedbyunconstrainedhierarchicalclustering.IfgroupX1containslabeledsamplesandgroupX2doesnot,X2takesonthelabelsofX1.Therefore,anygroupcontaininglabelsdoesnothaveanyunlabeledsamples.IfXtrainisthesetoflabeleddata,setLx=08x2XnXtrain,whileLxisknownforthelabeleddata.PFCHCwithlabelsonlyconsidersmergesbetweentwogroupsinwhichoneisunlabeled.AsshowninFigure 5-2 ,weproceedupthedendrogramwiththelabeledsamplesaccumulatingtheunlabeledones,untilnomoreunlabeledsamplesremain. Figure5-2.PFCHCwithlabels.Iflabelsarepresentinsteadofpairwiseconstraints,theunsuperviseddendrogramdeterminestheneighborhoodssurroundingthelabeledsamples. WethereforeconsiderthreecasesofPFCHC:PFCHCwithlabels,PFCHCwithpairwiseconstraintsandnoconstraintremoval,andPFCHCwithpairwiseconstraints 122

PAGE 123

andconstraintremoval.TheclassiersweusearethedefaultMATLABimplementationsofbaggedtrees,naiveBayes,andk-nearestneighbors.Withpairwiseconstraints,PFCHCisimplementedasdescribedinChapter 3 .Theaveragelinkagefunctionisused.Forhierarchicalqueryselection,25%oftheavailablequeriesarereservedanddrawnrandomlyforrobustness.Conictingconstraintsareremovedbyproceedingupthedendrogram,andidentifyingwhenbothmust-linksandcannot-linksspanthesamemergegroups.Themajorityconstrainttyperemains,whiletheminorityisremoved.Intheeventofatie,thecannot-linkisselected.Thisenablesatransitiveclosuretobecomputed.Forthelabelbasedmethods,labeleddataareselectedrandomlyfromthe60,000trainingsamplesofMNIST.Thetrainedclassieristhentestedonthe10,000testingsamples.Forthepairwiseconstraintbasedmethods,HQSisusedonthepoolof60,000MNISTtrainingsamplestoselectthedesignatednumberofqueries.Theclusteringmethodthenproceedstoclusterthecombinedtestsetandanyofthetrainingsamplesthatareinvolvedinaconstraint.Classicationresultsaregeneratedforclustering,byndingtheoptimalcluster-to-classlabeltranslationthroughasearch.Therst100PCAcomponentsoftheMNISTdataareusedasfeaturesforallmethods.Theresultsareaveragedover30trials.First,weconsiderthecasewithnolabelorconstrainterrors,seeninFigure 5-3 .PFCHCusingpairwiseconstraintsperformspoorlywithalownumberofconstraints,butrisestoahighlevelcomparablewiththelabelbasedmethodsrelativelyquickly.Whiletherearenoerrors,theconstraintremovalmethoddetractslittlefromtheperformanceofPFCHCwithoutconstraintremoval.Amongthelabelbasedmethods,PFCHCperformsalmostidenticallytothek-nearestneighborclassier,whichisthehighestperformingclassier.Thisintuitivelymakessense,asPFCHCessentiallyattemptstobeanearestneighbormethodforconstrainedclustering.Clustersgrowaroundthelabeledsamples,stoppinguntilmeetingothergrowingclusters.Sincewedonotpropagatemust-link 123

PAGE 124

constraints,theseclusterscangrowinseparatelocationsacrossthedataspace,inaverylocalizedmanner. Figure5-3.Classicationversusconstrainedclustering. Next,errorsareintroducedtothelabelsandconstraints.TheresultsareshowninFigure 5-4 .MNISTisaverysimpledatasettolabel,buterrorsoccurnaturallyduetousererrororsimplyanambiguoussample.TheaverageerrorrateonMNISTistypically1%,whichweconrmedwithaAmazonMechanicalTurktestover4,000responses.Foramoresignicantresult,weuseanerrorrateof2%.Toinduceerrors,2%ofthelabelsorconstraintswererandomlyswitchedtoadifferentlabelorconstraint.WedonotusethepairwiseconstraintPFCHCwithouterrorremoval,aswithouterrorremoval,thereareconictingconstraints.Thelabelbasedmethods,includingPFCHCwithlabels,arerobusttotheseerrors,althoughthek-nearestneighborclassierandPFCHCtakeabiggerhit.Sincethesearelocalmethods,errorsinalabelinducearegionoferror.Whilestillcomparabletosomeclassiers,PFCHCwithpairwiseconstraintssuffersa 124

PAGE 125

Figure5-4.Classicationversusconstrainedclusteringwitherror.Histogramsfortheresponsestogroundtruthmust-linkqueries(blue)andcannot-linkqueries(red). muchgreaterdeclineinthepresenceoferrors,andamorevariableperformancecurverevealsthesensitivitytowhichsampleswereinfactinerror.Withlabelsforalocalmethod,anincorrectlabelwillnotcorrupttheinformationprovidedbyotherlabeledsamples.However,withpairwiseconstraints,asingleerror,forinstancebetweentwoseparatemust-linkneighborhoods,canfalselyjointhem,thusinducingmanymoreerrorsthantheinvolvedsamplesofthebadconstraint. 5.3DiscussionWhileitisdifculttoachievetheperformanceoflabelbasedmethods,thereareafewpossibleadvantageswithpairwiseconstraints.Humantimeisthelimitingfactorinhowmanylabelsorconstraintscanbeacquired,andpairwisequeriestendtobefastertoanswer.Forthesamereasonthatlabelsareinformative,theyarealsodifcultforthehuman,sincetheoraclemustdeliberatebetweenallpossibleoptions.Thisdifculty 125

PAGE 126

increasesasthenumberofclassesrises.Theoraclemaybemorepronetoerrorsinthepresenceofalargenumberofpossibleresponses.Fromthepointofviewofthedesigner,specifyingthelabelsetapriorimaybeimpossibleforsomeapplications.Inmoreexploratorydataminingoperations,pairwisequeriesallowtheoracletospecifytheclassstructureofthedataset,withoutexplicitknowledgeofthisstructure.Classicationorconstrainedclusteringmethodsmaybemorerobusttothesefeedbackerrors.Alltheseissuesdependonthetaskitself,includingthetypeofsamplepresentedtothehuman,suchastextdocuments,images,andaudioles. 126

PAGE 127

CHAPTER6CONCLUSIONInthisdissertation,particularemphasishasbeenputonthepracticalissuesfacingactiveconstrainedclustering.Itishopedthattheideaspresentedhereservetobreaksomeofthebarriersthatlimititsapplicabilityinpractice.Revisitingtheissuesdiscussedin Wagstaff ( 2007 ):inthefaceofconstraintpropagation,wesuggestexploitinglocalunsupervisedstructuresaroundconstrainedpoints.Thisavoidstheneedtoexplicitlypropagatedistanceinformationandallowsforscalability.Inregardstoqueryselection,thedatasetshouldbespannedbythequeries,andquerypointsshouldbeselectedfromtheinteriorofclusterstructurestoanticipatenoiseontheboundaries.Forthisreason,redundancyshouldalsobeincluded.Ontheissueofconstraintsetutility,weprovideaexibleframeworkforquantifyingthecommunicationandconictwithinconstraintsets. 6.1SummaryInChapter 2 ,anactivelearningtechniquewasdevelopedbasedongrowinganensembleinresponsetouncertaintyinadatastream.Theapplicationofthismethodtoopensetclassicationmotivatesactivelearningintheclusteringdomain.InChapter 3 ,methodsweredevelopedforactiveconstraintselectionandconstrainedclusteringbasedonthedendrogramofhierarchicalagglomerativeclustering.Unlikemostmethodsforconstrainedclustering,propagation-freeconstrainedhierarchicalclustering(PFCHC)functionsasawrapperaroundanystandardhierarchicalclusteringofthedata.Thus,itcanbemadehighlyefcientandappropriateforlargeandnon-Euclideandatasets.PFCHCfunctionsasalocalconstrainedclusteringmethod,wheretheconstraintsareusedtomanipulatethesamplesthatagglomeratearoundtheconstrainedsamplesthroughtheunsupervisedstructureofthedata.Theaccompanyinghierarchicalqueryselection(HQS)techniquechoosesconstraintsetsinabatchmodetoprovidesufcientinformationaboutthestructureofthedata.Withaddedredundancy,HQSfacilitateserrorremoval. 127

PAGE 128

Harmfulconstraintsariseasaproductofhumanerrorandnoisydata,andareveryharmfultothealreadydelicateconstrainedclusteringprocess.Giventhelackofresearchonthisproblem,itiscriticaltodevelopmethodsforremovingharmfulconstraints.Thevotebasedconstraintremoval(VBCR)algorithmwaspresentedinChapter 4 .VBCRisexibleasitreliesonanyclusteringensemblethatisappropriateforthedataset.Theensembleprovidesareferenceforwhichconstraintscommunicate,andwhichofthoseconstraintsarereliableenoughtohaveavote.Byquantifyingconictwithintheconstraintset,thereisnoneedtospecifythenumberofconstraintswhichshouldberemoved.Thisensuresthattheprocedurecanbenetevenperfectconstraintsetsasapreprocessingstep,withoutremovingvaluableinformation.ThepracticalityofoursuiteofconstrainedclusteringmethodsisexaminedinChapter 5 ,whichcomparesconstrainedclusteringtoclassication.Thisprovidesanexampleoftheutilityofconstrainedclusteringrelativetoclassicationwithasimilaramountofhumaninteraction.PFCHCcanfunctionquiteeasilywithlabelsinsteadofpairwiseconstraints,andprovidessimilarperformancetothek-nearestneighborclassier,asbotharelocalmethods.Constrainedclusteringisshowntoperformcomparablytoclassication,butstillisvulnerabletoconstrainterrors.Pairwiseconstraintsdocarrypossiblebenetsoverlabels,relatedtotheireaseofannotationbyhumanoracles.Therefore,constrainedclusteringcanbeconsideredacompetitivechoicefordataprocessingprojects. 6.2FutureWorkWhilethemethodspresentedinthisdissertationofferacompletesystemforactiveconstrainedclustering,thereareafewavenuesforfuturework.The“propagation-free”ideaisdescribedinthecontextofconstrainedhierarchicalclustering,butcanbeimplementedinotherconstrainedclusteringapproachesaswell.Foranonparametricclusteringmethod,thenumberofclustersisnotxed.Ifsuchamethodsimplyabidesbythecannot-linkconstraints,themust-linkconstraintscanbeappliedafterwardsto 128

PAGE 129

groupclusterlabels.Thisavoidstheprocessofpropagatingmust-linkconstraints,whichmaybeunnaturalwithrespecttothegeometryofthedataaswellascomputationallyintensive.WhilePFCHCcanfunctionwiththemergescreatedbyanyformofhierarchicalclustering,inthisdissertation,onlythetypicalimplementationofhierarchicalclusteringisused,whichisaninefcientO(n3)computation.Usingabasehierarchicalclusteringmethodoptimizedforlargedatasets,wecanapplyPFCHCtoimportantproblemsinwhichconstrainedclusteringhasneverbeenapplied.Suchproblemsmaybegoodtsforconstrainedclusteringifinformationarisesfrompairwisecomparisons,forexample.Whilethebatchmodeofqueryingsamplesisanecessityformanyplatformsforhumanfeedback,thosequeriesaremadewithoutthebenetofmuchinformationotherthanthatpresentinthedataitself.Inadditiontoaquerystrategythatfunctionspriortothecollectionofinformation,afollow-upbatchquerystrategycouldbedeveloped.Thisstrategywouldusetheinformationgainedthroughtheentireprocessofconstrainedclusteringtogeneratequerieswhichtrulyensurehighperformance.Aninitialconstraintsetprovidesinformationsuchasthemust-linkneighborhoods.Theconstraintremovalprocessrevealsconstraintswhichwereinconict.Theconstrainedclusteringprocessitselfinvolvesvariousdecisionswhichweremadewithoutthebenetofaquery.Afollow-upqueryselectionstrategyismorestraightforward,asitisnotbasedongeometricaspectsofthedata,butonclarifyingtheissuesthataroseduringtheentireactiveconstrainedclusteringprocess.Finally,inourtechniquesforconstraintremoval,wehaveassumedthatconictinasetofconstraintsindicateserrors.However,formanyproblemsthevalidityofapairwiseconstraintisnotsowell-dened.Humanoraclesmayannotatepairsofsampleswithdifferentfeaturesofthedatainmind.Thisisaproductofdatawithahierarchicaloroverlappingclassstructure.Anorangeandanapplearedifferentspecies,butsomeoraclemayconsiderthemtobethesameinthecontextofthetask,astheyareboth 129

PAGE 130

fruits.Iftwodifferentvarietiesoforangearepresented,theambiguityisevenmorepronounced.Currently,wedivideconstraintsintotwosets:onestokeepandonestoremove.Aconstraintremovalframeworkcouldtakethesetofconstraintsandseparateitintosubsetsthatfunctionharmoniously.Eachsubsetcouldofferadifferentmodeofclustering,suchaswithdifferentgranularities. 130

PAGE 131

REFERENCES Abin,A.A.,&Beigy,H.(2014).Activeselectionofclusteringconstraints:Asequentialapproach.PatternRecognition,47(3),1443. Aggarwal,C.C.,&Reddy,C.K.(2013).Dataclustering:Algorithmsandapplications.CRCPress. Agrawal,R.,Gehrke,J.,Gunopulos,D.,&Raghavan,P.(2005).Automaticsubspaceclusteringofhighdimensionaldata.DataMiningandKnowledgeDiscovery,11(1),5. Al-Razgan,M.,&Domeniconi,C.(2009).Clusteringensembleswithactiveconstraints.InApplicationsofSupervisedandUnsupervisedEnsembleMethods,(pp.175).Springer. Alon,U.,Barkai,N.,Notterman,D.A.,Gish,K.,Ybarra,S.,Mack,D.,&Levine,A.J.(1999).Broadpatternsofgeneexpressionrevealedbyclusteringanalysisoftumorandnormalcolontissuesprobedbyoligonucleotidearrays.ProceedingsoftheNationalAcademyofSciences,96(12),6745. Anand,S.,Mittal,S.,Tuzel,O.,&Meer,P.(2014).Semi-supervisedkernelmeanshiftclustering.IEEETransactionsonPatternAnalysisandMachineIntelligence,36(6),1201. Angluin,D.(1988).Queriesandconceptlearning.MachineLearning,2,319. Ankerst,M.,Breunig,M.M.,Kriegel,H.-P.,&Sander,J.(1999).Optics:Orderingpointstoidentifytheclusteringstructure.InACMSigmodRecord,vol.28,(pp.49).ACM. Antoniak,C.E.(1974).Mixturesofdirichletprocesseswithapplicationstobayesiannonparametricproblems.TheAnnalsofStatistics,2(6),1152. Ares,M.E.,Parapar,J.,&Barreiro,A.(2012).Anexperimentalstudyofconstrainedclusteringeffectivenessinpresenceoferroneousconstraints.InformationProcessing&Management,48(3),537. Aronszajn,N.(1950).Theoryofreproducingkernels.TransactionsoftheAmericanMathematicalSociety,(pp.337). Arzeno,N.,&Vikalo,H.(2015).Semi-supervisedafnitypropagationwithsoftinstance-levelconstraints.PatternAnalysisandMachineIntelligence,IEEETransac-tionson,37(5),1041. Attias,H.(1999).Inferringparametersandstructureoflatentvariablemodelsbyvariationalbayes.InProceedingsofthe15thConferenceonUncertaintyinArticialIntelligence,(pp.21).MorganKaufmannPublishersInc. 131

PAGE 132

Baghshah,M.S.,&Shouraki,S.B.(2010).Non-linearmetriclearningusingpairwisesimilarityanddissimilarityconstraintsandthegeometricalstructureofdata.PatternRecognition,43(8),2982. Bar-Hillel,A.,Hertz,T.,Shental,N.,&Weinshall,D.(2003).Learningdistancefunctionsusingequivalencerelations.InICML,vol.3,(pp.11). Basu,S.,Banerjee,A.,&Mooney,R.J.(2002).Semi-supervisedclusteringbyseeding.InICML,vol.2,(pp.27). Basu,S.,Banerjee,A.,&Mooney,R.J.(2004a).Activesemi-supervisionforpairwiseconstrainedclustering.InProceedingsofthe2004SIAMInternationalConferenceonDataMining,(pp.333).SIAM. Basu,S.,Bilenko,M.,&Mooney,R.J.(2004b).Aprobabilisticframeworkforsemi-supervisedclustering.InProceedingsofthe10thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(pp.59).ACM. Basu,S.,Davidson,I.,&Wagstaff,K.L.(Eds.)(2008).Constrainedclustering:Ad-vancesinalgorithms,theory,andapplications.Chapman&Hall/CRCPress,2nded. Ben-Dor,A.,Shamir,R.,&Yakhini,Z.(1999).Clusteringgeneexpressionpatterns.JournalofComputationalBiology,6(3-4),281. Bennett,J.,&Lanning,S.(2007).Thenetixprize.InProceedingsofKDDCupandWorkshop,vol.2007,(p.35). Bezdek,J.C.(2013).Patternrecognitionwithfuzzyobjectivefunctionalgorithms.SpringerScience&BusinessMedia. Bilenko,M.,Basu,S.,&Mooney,R.J.(2004).Integratingconstraintsandmetriclearninginsemi-supervisedclustering.InProceedingsofthe21stInternationalConferenceonMachineLearning,(pp.81).ACM. Biswas,A.,&Jacobs,D.(2011).Largescaleimageclusteringwithactivepairwiseconstraints.InInternationalConferenceinMachineLearning2011WorkshoponCombiningLearningStrategiestoReduceLabelCost,vol.2.Citeseer. Blei,D.M.,&Jordan,M.I.(2006).Variationalinferencefordirichletprocessmixtures.BayesianAnalysis,1(1),121. Blei,D.M.,Ng,A.Y.,&Jordan,M.I.(2003).Latentdirichletallocation.TheJournalofmachineLearningResearch,3,993. Bloodgood,M.,&Vijay-Shanker,K.(2009).Amethodforstoppingactivelearningbasedonstabilizingpredictionsandtheneedforuser-adjustablestopping.InProceedingsofthe13thConferenceonComputationalNaturalLanguageLearning,(pp.39).AssociationforComputationalLinguistics. 132

PAGE 133

Boykov,Y.,Veksler,O.,&Zabih,R.(2001).Fastapproximateenergyminimizationviagraphcuts.IEEETransactionsonPatternAnalysisandMachineIntelligence,23(11),1222. Breese,J.S.,Heckerman,D.,&Kadie,C.(1998).Empiricalanalysisofpredictivealgorithmsforcollaborativeltering.InProceedingsofthe14thconferenceonUncertaintyinArticialIntelligence,(pp.43).MorganKaufmannPublishersInc. Bregman,L.M.(1967).Therelaxationmethodofndingthecommonpointofconvexsetsanditsapplicationtothesolutionofproblemsinconvexprogramming.USSRComputationalMathematicsandMathematicalPhysics,7(3),200. Chandola,V.,Banerjee,A.,&Kumar,V.(2009).Anomalydetection:Asurvey.ACMcomputingsurveys,41(3). Chang,H.,&Yeung,D.-Y.(2004).Locallylinearmetricadaptationforsemi-supervisedclustering.InProceedingsofthe21stInternationalConferenceonMachineLearning,(p.20).ACM. Chaturvedi,A.,Green,P.E.,&Caroll,J.D.(2001).K-modesclustering.JournalofClassication,18(1),35. Cheng,Y.(1995a).Meanshift,modeseeking,andclustering.IEEETransactionsonPatternAnalysisandMachineIntelligence,17(8),790. Cheng,Y.(1995b).Meanshift,modeseeking,andclustering.PatternAnalysisMachineIntelligence,IEEETransactionson,17(8),790. Chopra,S.,Hadsell,R.,&LeCun,Y.(2005).Learningasimilaritymetricdiscriminatively,withapplicationtofaceverication.InIEEEComputerSocietyConferenceonComputerVisionandPatternRecognition,vol.1,(pp.539).IEEE. Chung,F.R.(1997).Spectralgraphtheory,vol.92.AmericanMathematicalSociety. Coleman,T.,Saunderson,J.,&Wirth,A.(2008).Spectralclusteringwithinconsistentadvice.InProceedingsofthe25thInternationalConferenceonMachineLearning,(pp.152).ACM. Daniels,K.,&Giraud-Carrier,C.(2006).Learningthethresholdinhierarchicalagglomerativeclustering.In5thInternationalConferenceonMachineLearningandApplications,(pp.270).IEEE. Datta,R.,Joshi,D.,Li,J.,&Wang,J.Z.(2008).Imageretrieval:Ideas,inuences,andtrendsofthenewage.ACMComputingSurveys(CSUR),40(2),5. Davidson,I.,&Ravi,S.(2009).Usinginstance-levelconstraintsinagglomerativehierarchicalclustering:Theoreticalandempiricalresults.DataMiningandKnowledgeDiscovery,18(2),257. 133

PAGE 134

Davidson,I.,Wagstaff,K.L.,&Basu,S.(2006).Measuringconstraint-setutilityforpartitionalclusteringalgorithms.InProceedingsofthe10thEuropeanConferenceonPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,vol.4213,(pp.115).Springer. Davis,J.V.,Kulis,B.,Jain,P.,Sra,S.,&Dhillon,I.S.(2007).Information-theoreticmetriclearning.InProceedingsofthe24thInternationalConferenceonMachineLearning,(pp.209).ACM. Defays,D.(1977).Anefcientalgorithmforacompletelinkmethod.TheComputerJournal,20(4),364. Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximumlikelihoodfromincompletedataviatheemalgorithm.JournaloftheRoyalStatisticalSociety,Se-riesB(Methodological),(pp.1). Dhillon,I.S.,Guan,Y.,&Kulis,B.(2004a).Kernelk-means:spectralclusteringandnormalizedcuts.InProceedingsofthe10thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(pp.551).ACM. Dhillon,I.S.,Guan,Y.,&Kulis,B.(2004b).Auniedviewofkernelk-means,spectralclusteringandgraphcuts.Citeseer. Ding,C.H.,He,X.,&Simon,H.D.(2005).Ontheequivalenceofnonnegativematrixfactorizationandspectralclustering.InSDM,vol.5,(pp.606).SIAM. Dobeck,G.,Hyland,J.C.,&Smedley,L.(1997).Automateddetectionandclassicationofseaminesinsonarimagery.InProceedingsofSPIE3079,DetectionandRemedia-tionTechnologiesforMinesandMinelikeTargetsII. Donmez,P.,&Carbonell,J.G.(2008).Proactivelearning:Cost-sensitiveactivelearningwithmultipleimperfectoracles.InProceedingsofthe17thACMconferenceonInformationandKnowledgeManagement,(pp.619).ACM. Du,J.,&Ling,C.X.(2010).Activelearningwithhuman-likenoisyoracle.InIEEE10thInternationalConferenceonDataMining,(pp.797).IEEE. Duan,C.,Cleland-Huang,J.,&Mobasher,B.(2008).Aconsensusbasedapproachtoconstrainedclusteringofsoftwarerequirements.InProceedingsofthe17thACMConferenceonInformationandKnowledgeManagement,(pp.1073).ACM. Dubes,R.C.(1987).Howmanyclustersarebest?anexperiment.PatternRecognition,20(6),645. Dura,E.,Zhang,Y.,Liao,X.,Dobeck,G.,&Carin,L.(2005).Activelearningfordetectionofmine-likeobjectsinside-scansonarimagery.IEEEJournalofOceanicEngineering,30(2),360. 134

PAGE 135

Escobar,M.D.,&West,M.(1995).Bayesiandensityestimationandinferenceusingmixtures.JournaloftheAmericanStatisticalAssociation,90,577. Ester,M.,Kriegel,H.-P.,Sander,J.,&Xu,X.(1996).Adensity-basedalgorithmfordiscoveringclustersinlargespatialdatabaseswithnoise.InKDD,vol.96,(pp.226). Felzenszwalb,P.F.,&Huttenlocher,D.P.(2004).Efcientgraph-basedimagesegmentation.InternationalJournalofComputerVision,59(2),167. Ferguson,T.S.(1973).Abayesiananalysisofsomenonparametricproblems.TheAnnalsofStatistics,(pp.209). Figueiredo,M.A.,&Jain,A.K.(2002).Unsupervisedlearningofnitemixturemodels.IEEETransactionsonPatternAnalysisandMachineIntelligence,24(3),381. Fraley,C.,&Raftery,A.E.(1998).Howmanyclusters?whichclusteringmethod?answersviamodel-basedclusteranalysis.TheComputerJournal,41(8),578. Freund,A.,Pelleg,D.,&Richter,Y.(2008).Clusteringfromconstraintgraphs.InProceedingsoftheSIAMInternationalConferenceonDataMining,(pp.301).SIAM. Freund,Y.,&Schapire,R.E.(1996).Experimentswithanewboostingalgorithm.InICML,vol.96,(pp.148). Freund,Y.,Seung,H.S.,Shamir,E.,&Tishby,N.(1997).Selectivesamplingusingthequerybycommitteealgorithm.MachineLearning,28(2-3),133. Frey,B.J.,&Dueck,D.(2007).Clusteringbypassingmessagesbetweendatapoints.Science,315(5814),972. Fu,K.-S.,&Mui,J.(1981).Asurveyonimagesegmentation.PatternRecognition,13(1),3. Fu,Z.,&Lu,Z.(2015).Pairwiseconstraintpropagation:Asurvey.arXivpreprintarXiv:1502.05752. Fukunaga,K.,&Hostetler,L.D.(1975).Theestimationofthegradientofadensityfunction,withapplicationsinpatternrecognition.IEEETransactionsonInformationTheory,21(1),32. Gilks,W.R.(2005).MarkovchainMonteCarlo.WileyOnlineLibrary. Givoni,I.E.,&Frey,B.J.(2009).Semi-supervisedafnitypropagationwithinstance-levelconstraints.InProceedingsofthe12thInternationalConferenceonArticialIntelligenceandStatistics,(pp.161). 135

PAGE 136

Greene,D.,&Cunningham,P.(2007).Constraintselectionbycommittee:Anensembleapproachtoidentifyinginformativeconstraintsforsemi-supervisedclustering.InMachineLearning:ECML2007,(pp.140).Springer. Grira,N.,Crucianu,M.,&Boujemaa,N.(2008).Activesemi-supervisedfuzzyclustering.PatternRecognition,41(5),1834. Guenoche,A.,Hansen,P.,&Jaumard,B.(1991).Efcientalgorithmsfordivisivehierarchicalclusteringwiththediametercriterion.JournalofClassication,8(1),5. Hagen,L.,&Kahng,A.B.(1992).Newspectralmethodsforratiocutpartitioningandclustering.IEEETransactionsonComputer-AidedDesignofIntegratedCircuitsandSystems,11(9),1074. Hasanbelliu,E.,Kampa,K.,Cobb,J.T.,&Prncipe,J.C.(2011).Bayesiansurprisemetricforoutlierdetectioninon-linelearning.InProceedingsofSPIE8017,DetectionandSensingofMines,ExplosiveObjects,andObscuredTargetsXVI. Hasanbelliu,E.,Kampa,K.,Principe,J.,&Cobb,J.(2012).Onlinelearningusingabayesiansurprisemetric.InProceedingsoftheInternationalJointConferenceonNeuralNetworks,(pp.1). Hasanbelliu,E.,Principe,J.,&Slatton,C.(2009).Correntropybasedmatchedlteringforclassicationinsidescansonarimagery.InProc.oftheIEEEInternationalConferenceonSystems,ManandCybernetics,(pp.2757). He,P.,Xu,X.,Hu,K.,&Chen,L.(2014).Semi-supervisedclusteringviamulti-levelrandomwalk.PatternRecognition,47(2),820. Hertz,T.,Bar-Hillel,A.,&Weinshall,D.(2004).Boostingmarginbaseddistancefunctionsforclustering.InProceedingsofthe21stInternationalConferenceonMachineLearning,(p.50).ACM. Hodge,V.J.,&Austin,J.(2004).Asurveyofoutlierdetectionmethodologies.ArticialIntelligenceReview,22(2),85. Hsueh,P.-Y.,Melville,P.,&Sindhwani,V.(2009).Dataqualityfromcrowdsourcing:Astudyofannotationselectioncriteria.InProceedingsoftheNAACLHLT2009Work-shoponActiveLearningforNaturalLanguageProcessing,(pp.27).AssociationforComputationalLinguistics. Jain,A.K.(2010).Dataclustering:50yearsbeyondk-means.PatternRecognitionLetters,31(8),651. Jain,A.K.,Murty,M.N.,&Flynn,P.J.(1999).Dataclustering:Areview.ACMcomput-ingsurveys(CSUR),31(3),264. Jain,A.K.,&Vailaya,A.(1996).Imageretrievalusingcolorandshape.PatternRecognition,29(8),1233. 136

PAGE 137

Jenssen,R.,Erdogmus,D.,Hild,K.E.,Prncipe,J.C.,&Eltoft,T.(2007).Informationcutforclusteringusingagradientdescentapproach.PatternRecognition,40(3),796. Jiang,D.,Tang,C.,&Zhang,A.(2004).Clusteranalysisforgeneexpressiondata:Asurvey.IEEETransactionsonKnowledgeandDataEngineering,16(11),1370. Johnson,S.C.(1967).Hierarchicalclusteringschemes.Psychometrika,32(3),241. Kampa,K.,Hasanbelliu,E.,Cobb,J.T.,&Prncipe,J.C.(2012).Deformablebayesiannetwork:Arobustframeworkforunderwatersensorfusion.IEEEJournalofOceanicEngineering,37(2). Kampa,K.,Hasanbelliu,E.,&Principe,J.(2011).Closed-formcauchy-schwarzpdfdivergenceformixtureofgaussians.InProceedingsoftheInternationalJointConferenceonNeuralNetworks,(pp.2578). Kamvar,K.,Sepandar,S.,Klein,D.,&Manning,C.(2003).Spectrallearning.InInternationalJointConferenceofArticialIntelligence.StanfordInfoLab. Kaufman,L.,&Rousseeuw,P.J.(2009).Findinggroupsindata:Anintroductiontoclusteranalysis,vol.344.JohnWiley&Sons. Kittur,A.,Chi,E.H.,&Suh,B.(2008).Crowdsourcinguserstudieswithmechanicalturk.InProceedingsoftheSIGCHIConferenceonHumanFactorsinComputingSystems,(pp.453).ACM. Klein,D.,Kamvar,S.D.,&Manning,C.D.(2002).Frominstance-levelconstraintstospace-levelconstraints:Makingthemostofpriorknowledgeindataclustering.InProceedingsofthe19thInternationalConferenceonMachineLearning,(pp.307).Stanford. Kohonen,T.(1982).Self-organizedformationoftopologicallycorrectfeaturemaps.BiologicalCybernetics,43(1),59. Law,M.H.,Topchy,A.,&Jain,A.K.(2004).Clusteringwithsoftandgroupconstraints.InStructural,Syntactic,andStatisticalPatternRecognition,(pp.662).Springer. LeCun,Y.,&Cortes,C.(2010).MNISThandwrittendigitdatabase.Availableat http://yann.lecun.com/exdb/mnist/ . Lewis,D.D.,&Gale,W.A.(1994).Asequentialalgorithmfortrainingtextclassiers.InProceedingsoftheACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,(pp.2).Springer-Verlag. Li,J.,Xia,Y.,Shan,Z.,&Liu,Y.(2015).Scalableconstrainedspectralclustering.IEEETransactionsonKnowledgeandDataEngineering,27(2),589. 137

PAGE 138

Lichman,M.(2013).UCImachinelearningrepository.Availableat http://archive.ics.uci.edu/ml . Lin,C.-B.(2007).Projectedgradientmethodsfornonnegativematrixfactorization.NeuralComputation,19(10),2756. Linden,G.,Smith,B.,&York,J.(2003).Amazon.comrecommendations:Item-to-itemcollaborativeltering.IEEEInternetComputing,7(1),76. Lindsay,B.G.(1995).Mixturemodels:Theory,geometryandapplications.InNSF-CBMSRegionalConferenceSeriesinProbabilityandStatistics,(pp.i).JSTOR. Liu,Y.,Jin,R.,&Jain,A.K.(2007).Boostcluster:Boostingclusteringbypairwiseconstraints.InProceedingsofthe13thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(pp.450).ACM. Lloyd,S.P.(1982).Leastsquaresquantizationinpcm.IEEETransactionsonInformationTheory,28(2),129. Lu,Z.,&Carreira-Perpinan,M.A.(2008).Constrainedspectralclusteringthroughafnitypropagation.InProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,(pp.1).IEEE. Lu,Z.,&Ip,H.H.(2010).Constrainedspectralclusteringviaexhaustiveandefcientconstraintpropagation.InComputerVision–ECCV2010,(pp.1).Springer. Lu,Z.,&Leen,T.K.(2004).Semi-supervisedlearningwithpenalizedprobabilisticclustering.InAdvancesinNeuralInformationProcessingSystems,(pp.849). MacEachern,S.N.,&Muller,P.(1998).Estimatingmixtureofdirichletprocessmodels.JournalofComputationalandGraphicalStatistics,7(2),223. MacQueen,J.(1967).Somemethodsforclassicationandanalysisofmultivariateobservations.InProceedingsofthe5thBerkeleySymposiumonMathematicalStatisticsandProbability,vol.1,(pp.281).Oakland,CA,USA. Maggini,M.,Melacci,S.,&Sarti,L.(2012).Learningfrompairwiseconstraintsbysimilarityneuralnetworks.NeuralNetworks,26,141. Mahalanobis,P.C.(1936).Onthegeneraliseddistanceinstatistics.ProceedingsoftheNationalInstituteofSciencesofIndia,1(2),49. Mallapragada,P.K.,Jin,R.,&Jain,A.K.(2008).Activequeryselectionforsemi-supervisedclustering.InProceedingsofthe19thInternationalConferenceonPatternRecognition,(pp.1).IEEE. McCullagh,P.,&Yang,J.(2008).Howmanyclusters?BayesianAnalysis,3(1),101. McLachlan,G.,&Peel,D.(2004).Finitemixturemodels.JohnWiley&Sons. 138

PAGE 139

Miyamoto,S.,&Terami,A.(2011).Constrainedagglomerativehierarchicalclusteringalgorithmswithpenalties.InIEEEInternationalConferenceonFuzzySystems(FUZZ),(pp.422).IEEE. Neal,R.M.(2000).Markovchainsamplingmethodsfordirichletprocessmixturemodels.JournalofComputationalandGraphicalStatistics,9(2),249. Nelson,B.,&Cohen,I.(2007).Revisitingprobabilisticmodelsforclusteringwithpair-wiseconstraints.InProceedingsofthe24thInternationalConferenceonMachinelearning,(pp.673).ACM. Ng,A.Y.,Jordan,M.I.,&Weiss,Y.(2002).Onspectralclustering:Analysisandanalgorithm.AdvancesinNeuralInformationProcessingSystems,2,849. Nilsback,M.,&Zisserman,A.(2006).Avisualvocabularyforowerclassication.InComputerVisionandPatternRecognition,2006IEEEComputerSocietyConferenceon,vol.2,(pp.1447). Nogueira,B.M.,Jorge,A.M.,&Rezende,S.O.(2012).Hierarchicalcondence-basedactiveclustering.InProceedingsofthe27thAnnualACMSymposiumonAppliedComputing,(pp.216).ACM. Nowak,S.,&Ruger,S.(2010).Howreliableareannotationsviacrowdsourcing:Astudyaboutinter-annotatoragreementformulti-labelimageannotation.InProceedingsoftheInternationalConferenceonMultimediaInformationRetrieval,(pp.557).ACM. Okabe,M.,&Yamada,S.(2014).Activesamplingforconstrainedclustering.JournalofAdvancedComputationalIntelligenceandIntelligentInformatics,18,332. Parzen,E.(1962).Onestimationofaprobabilitydensityfunctionandmode.TheAnnalsofMathematicalStatistics,(pp.1065). Pekalska,E.,&Duin,R.P.(2005).Thedissimilarityrepresentationforpatternrecogni-tion:Foundationsandapplications.64.WorldScientic. Rangapuram,S.S.,&Hein,M.(2015).Constrained1-spectralclustering.arXivpreprintarXiv:1505.06485. Reed,S.,Petillot,Y.,&Bell,J.(2003).Anautomaticapproachtothedetectionandextractionofminefeaturesinsidescansonar.IEEEJournalofOceanicEngineering,28(1),90. Roy,N.,&McCallum,A.(2001).Towardoptimalactivelearningthroughsamplingestimationoferrorreduction.InProceedingsofthe18thInternationalConferenceonMachineLearning,(pp.441).MorganKaufmann. 139

PAGE 140

Ruiz,C.,Spiliopoulou,M.,&Menasalvas,E.(2007).C-dbscan:Density-basedclusteringwithconstraints.InRoughSets,FuzzySets,DataMiningandGranularComputing,(pp.216).Springer. Sarwar,B.,Karypis,G.,Konstan,J.,&Riedl,J.(2001).Item-basedcollaborativelteringrecommendationalgorithms.InProceedingsofthe10thInternationalConferenceonWorldWideWeb,(pp.285).ACM. Scheffer,T.,Decomain,C.,&Wrobel,S.(2001).Activehiddenmarkovmodelsforinformationextraction.InAdvancesinIntelligentDataAnalysis,(pp.309).Springer. Scott,J.(2012).Socialnetworkanalysis.Sage. Semertzidis,T.,Rafailidis,D.,Strintzis,M.,&Daras,P.(2015).Large-scalespectralclusteringbasedonpairwiseconstraints.InformationProcessing&Management,51(5),616. Settles,B.(2009).Activelearningliteraturesurvey.ComputerSciencesTechnicalReport1648,UniversityofWisconsin–Madison. Settles,B.,&Craven,M.(2008).Ananalysisofactivelearningstrategiesforsequencelabelingtasks.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,(pp.1070).AssociationforComputationalLinguistics. Settles,B.,Craven,M.,&Ray,S.(2008).Multiple-instanceactivelearning.InAdvancesinNeuralInformationProcessingSystems,(pp.1289). Seung,H.,Opper,M.,&Sompolinsky,H.(1992).Querybycommittee.InProceedingsofthe5thAnnualWorkshoponComputationalLearningTheory,(pp.287).ACM. Shannon,C.E.(2001).Amathematicaltheoryofcommunication.ACMSIGMOBILEMobileComputingandCommunicationsReview,5(1),3. Sheng,V.S.,Provost,F.,&Ipeirotis,P.G.(2008).Getanotherlabel?improvingdataqualityanddataminingusingmultiple,noisylabelers.InProceedingsofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(pp.614).ACM. Shental,N.,Bar-Hillel,A.,Hertz,T.,&Weinshall,D.(2004).Computinggaussianmixturemodelswithemusingequivalenceconstraints.(pp.465). Shi,J.,&Malik,J.(2000).Normalizedcutsandimagesegmentation.IEEETransactionsonPatternAnalysisandMachineIntelligence,22(8),888. Sibson,R.(1973).Slink:anoptimallyefcientalgorithmforthesingle-linkclustermethod.TheComputerJournal,16(1),30. 140

PAGE 141

Silverman,B.W.(1986).Densityestimationforstatisticsanddataanalysis,vol.26.CRCPress. Snow,R.,O'Connor,B.,Jurafsky,D.,&Ng,A.Y.(2008).Cheapandfast—butisitgood?:Evaluatingnon-expertannotationsfornaturallanguagetasks.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,(pp.254).AssociationforComputationalLinguistics. Sokal,R.,&Michener,C.(1958).Astatisticalmethodforevaluatingsystematicrelationships.UniversityofKansasScienceBulletin,(38),1409. Steinbach,M.,Karypis,G.,&Kumar,V.(2000).Acomparisonofdocumentclusteringtechniques.InKDDWorkshoponTextMining,vol.400,(pp.525). Steinhaus,H.(1957).Surladivisiondescorpmaterielsenparties.Bull.Acad.Polon.desSciences,4(12),801. Sternlicht,D.,Fernandez,J.,Holtzapple,R.,Kucik,D.P.,Montgomery,T.C.,&Loefer,C.(2011).Advancedsonartechnologiesforautonomousminecountermeasures.InProceedingsoftheMTS/IEEEOCEANSConference,(pp.1). Strehl,A.,&Ghosh,J.(2003).Clusterensembles—aknowledgereuseframeworkforcombiningmultiplepartitions.TheJournalofMachineLearningResearch,3,583. Sublemontier,J.-H.,Martin,L.,Cleuziou,G.,&Exbrayat,M.(2011).Integratingpairwiseconstraintsintoclusteringalgorithms:Optimization-basedapproaches.InIEEE11thInternationalConferenceonDataMiningWorkshops(ICDMW),(pp.272).IEEE. Truong,D.T.,&Battiti,R.(2013).Asurveyofsemi-supervisedclusteringalgorithms:Fromapriorischemetointeractiveschemeandopenissues. VanCraenendonck,T.,&Blockeel,H.(2015).Limitationsofusingconstraintsetutilityinsemi-supervisedclustering.InProceedingsoftheEuropeanConferenceonMachineLearning.Toappear. VanDongen,S.(2000).Aclusteralgorithmforgraphs.Report-InformationSystems,(10),1. Vega-Pons,S.,&Ruiz-Shulcloper,J.(2011).Asurveyofclusteringensemblealgorithms.InternationalJournalofPatternRecognitionandArticialIntelligence,25(3),337. Vlachos,A.(2008).Astoppingcriterionforactivelearning.ComputerSpeech&Language,22(3),295. Vlachos,A.,Korhonen,A.,&Ghahramani,Z.(2009).Unsupervisedandconstraineddirichletprocessmixturemodelsforverbclustering.InProceedingsoftheWorkshop 141

PAGE 142

onGeometricalModelsofNaturalLanguageSemantics,(pp.74).AssociationforComputationalLinguistics. Vu,V.-V.,Labroche,N.,&Bouchon-Meunier,B.(2010).Boostingclusteringbyactiveconstraintselection.InProceedingsofthe19thEuropeanConferenceonArticialIntelligence,vol.10,(pp.297). Wagstaff,K.,&Cardie,C.(2000).Clusteringwithinstance-levelconstraints.InProceedingsoftheInternationalConferenceonMachineLearning,(pp.1103). Wagstaff,K.,Cardie,C.,Rogers,S.,&Schrodl,S.(2001).Constrainedk-meansclusteringwithbackgroundknowledge.InICML,vol.1,(pp.577). Wagstaff,K.L.(2007).Value,cost,andsharing:Openissuesinconstrainedclustering.In5thInternationalWorkshoponKnowledgeDiscoveryinInductiveDatabases,(pp.1). Wagstaff,K.L.,Basu,S.,&Davidson,I.(2006).Whenisconstrainedclusteringbenecial,andwhy? Wang,W.,Yang,J.,&Muntz,R.(1997).Sting:Astatisticalinformationgridapproachtospatialdatamining.InVLDB,vol.97,(pp.186). Wang,X.,&Davidson,I.(2010).Flexibleconstrainedspectralclustering.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,(pp.563).ACM. WardJr.,J.H.(1963).Hierarchicalgroupingtooptimizeanobjectivefunction.JournaloftheAmericanStatisticalAssociation,58(301),236. Wasserman,S.,&Faust,K.(1994).Socialnetworkanalysis:Methodsandapplications,vol.8.CambridgeUniversityPress. Weinberger,K.,Blitzer,J.,&Saul,L.(2006).Distancemetriclearningforlargemarginnearestneighborclassication.AdvancesinNeuralInformationProcessingSystems,18,1473. Welinder,P.,&Perona,P.(2010).Onlinecrowdsourcing:Ratingannotatorsandobtainingcost-effectivelabels.InProceedingsoftheIEEEComputerSocietyConferenceonComputerVisionandPatternRecognitionWorkshops,(pp.25).IEEE. Willett,P.(1988).Recenttrendsinhierarchicaldocumentclustering:Acriticalreview.InformationProcessingandManagement,24(5),577. Wu,L.,Hoi,S.C.,Jin,R.,Zhu,J.,&Yu,N.(2012).Learningbregmandistancefunctionsforsemi-supervisedclustering.IEEETransactionsonKnowledgeandDataEngineer-ing,24(3),478. 142

PAGE 143

Xing,E.,Xing,E.P.,Jordan,M.,&Jordan,M.I.(2003).Onsemideniterelaxationsfornormalizedk-cutandconnectionstospectralclustering. Xing,E.P.,Jordan,M.I.,Russell,S.,&Ng,A.Y.(2002).Distancemetriclearningwithapplicationtoclusteringwithside-information.InAdvancesinNeuralInformationProcessingSystems,(pp.505). Xu,Q.,desJardins,M.,&Wagstaff,K.L.(2005).Activeconstrainedclusteringbyexaminingspectraleigenvectors.InProceedingsofthe8thInternationalConferenceonDiscoveryScience,(pp.294).Springer. Xu,R.,&Wunsch,D.(2008).Clustering,vol.10.JohnWiley&Sons. Xu,W.,Liu,X.,&Gong,Y.(2003).Documentclusteringbasedonnon-negativematrixfactorization.InProceedingsofthe26thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformaionRetrieval,(pp.267).ACM. Yan,Y.,Fung,G.M.,Rosales,R.,&Dy,J.G.(2011).Activelearningfromcrowds.InProceedingsofthe28thInternationalConferenceonMachineLearning,(pp.1161). Yang,L.,&Jin,R.(2006).Distancemetriclearning:Acomprehensivesurvey.MichiganStateUniversiy,2. Yang,Y.,Wang,H.,Lin,C.,&Zhang,J.(2012).Semi-supervisedclusteringensemblebasedonmulti-antcoloniesalgorithm.InRoughSetsandKnowledgeTechnology,(pp.302).Springer. Yeung,D.-Y.,&Chang,H.(2007).Akernelapproachforsemisupervisedmetriclearning.IEEETransactionsonNeuralNetworks,18(1),141. Yuen,M.-C.,King,I.,&Leung,K.-S.(2011).Asurveyofcrowdsourcingsystems.InProceedingsofthe3rdIEEEInernationalConferenceonSocialComputing,(pp.766).IEEE. Zeng,E.,Yang,C.,Li,T.,&Narasimhan,G.(2007).Ontheeffectivenessofconstraintssetsinclusteringgenes.InProceedingsofthe7thIEEEInternationalConferenceonBioinformaticsandBioengineering,(pp.79).IEEE. Zhang,D.,Zhou,Z.-H.,&Chen,S.(2007).Semi-superviseddimensionalityreduction.InSDM,(pp.629).SIAM. Zhang,T.,Ramakrishnan,R.,&Livny,M.(1996).Birch:Anefcientdataclusteringmethodforverylargedatabases.InACMSIGMODRecord,vol.25,(pp.103).ACM. Zhang,W.,Wang,X.,Zhao,D.,&Tang,X.(2012).Graphdegreelinkage:Agglomerativeclusteringonadirectedgraph.InComputerVision–ECCV2012,(pp.428).Springer. 143

PAGE 144

Zhu,J.,Wang,H.,Hovy,E.,&Ma,M.(2010).Condence-basedstoppingcriteriaforactivelearningfordataannotation.ACMTransactionsonSpeechandLanguageProcessing(TSLP),6(3),3. Zhu,X.,Loy,C.C.,&Gong,S.(2015).Constrainedclusteringwithimperfectoracles.Inpress. 144

PAGE 145

BIOGRAPHICALSKETCHEvanKrimingerwasborninMiami,Floridain1987.EvangraduatedfromNorthMiamiBeachSeniorHighSchoolinMay2005.HeattendedtheUniversityofMiamiinCoralGables,Florida,whereinMay2009,hereceivedhisBachelorofSciencedegreeinengineeringsciencewithminorsinphysicsandmathematics.Duringhisundergraduatestudies,Evanwasinvolvedinundergraduateresearch,andservedasateachingassistantfortheExperimentalEngineeringLabintheDepartmentofMechanicalandAerospaceEngineering.InAugust2009,EvanenrolledattheUniversityofFloridatopursueaPh.DintheElectricalandComputerEngineeringDepartment.HewasateachingassistantforintroductorycircuitsforthreesemestersbeforebecomingaresearchassistantunderDr.JosePrncipe.HehadtheopportunitytoworkwithresearchersatHPLabsdevelopingstatisticalmethodsforoutlierdetectioningasandoilpipelines.Forthepastfouryears,hehascollaboratedwiththeOfceofNavalResearchdevelopingmethodsforsonarprocessing.EvanreceivedhisPh.DinelectricalandcomputerengineeringinDecember2015.EvanwasalsoanactivevolunteerduringhisPh.Dstudies,servingover600hoursofwildliferehabilitationforFloridaWildlifeCareinGainesville,Florida. 145