<%BANNER%>

Efficient Algorithms for Learning Correlations in Large-Scale Wireless Data

Permanent Link: http://ufdc.ufl.edu/UFE0044610/00001

Material Information

Title: Efficient Algorithms for Learning Correlations in Large-Scale Wireless Data
Physical Description: 1 online resource (121 p.)
Language: english
Creator: Almutairi, Abdullah
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: co-clustering -- correlations -- data -- mixture -- models -- networks -- wireless
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Wireless mobile networks are experiencing tremendous growth and increased presence. Data collected for mobile users can be effectively used to design more effective networks as well as provide more effective services to the users. In particular, understanding the behavioral model of the users in the wireless network can help in designing efficient context-aware protocols and services that revolve around the target users. In this dissertation, we explore the use of data mining techniques for analyzing wireless data. Wireless data has many characteristics that make it very challenging from a data mining perspective. In particular, the challenges revolve around a large number of spatio-temporal dimensions representing mobility of the users, their access patterns from multiple websites and time domain. For each of the dimensions, the number of possible values can be from thousands to millions. Further, these values may have spatial and temporal relationships. We present a novel algorithm that reduces the overall time complexity of finding meaningful website access patterns in wireless data. The overall time reduction is by several orders of magnitude resulting in the application of these techniques to significantly larger size problems. When data is available for multiple locations, it is very challenging to develop multiple models that each represent the different behavior of users at the different locations but also capture the commonalities of behavior between multiple locations. We propose a Global Local model that captures the above ideas and can be used to understand the relationship of different locations based on multiple users' mobile behavior. The study is extended to the temporal attributes of the data to learn both the temporal and spatio-temporal correlations present in the wireless data. To find these correlations we propose a Multi-Dimensional Hierarchical Co-Clustering (MDHCC) method.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Abdullah Almutairi.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Ranka, Sanjay.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044610:00001

Permanent Link: http://ufdc.ufl.edu/UFE0044610/00001

Material Information

Title: Efficient Algorithms for Learning Correlations in Large-Scale Wireless Data
Physical Description: 1 online resource (121 p.)
Language: english
Creator: Almutairi, Abdullah
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: co-clustering -- correlations -- data -- mixture -- models -- networks -- wireless
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Wireless mobile networks are experiencing tremendous growth and increased presence. Data collected for mobile users can be effectively used to design more effective networks as well as provide more effective services to the users. In particular, understanding the behavioral model of the users in the wireless network can help in designing efficient context-aware protocols and services that revolve around the target users. In this dissertation, we explore the use of data mining techniques for analyzing wireless data. Wireless data has many characteristics that make it very challenging from a data mining perspective. In particular, the challenges revolve around a large number of spatio-temporal dimensions representing mobility of the users, their access patterns from multiple websites and time domain. For each of the dimensions, the number of possible values can be from thousands to millions. Further, these values may have spatial and temporal relationships. We present a novel algorithm that reduces the overall time complexity of finding meaningful website access patterns in wireless data. The overall time reduction is by several orders of magnitude resulting in the application of these techniques to significantly larger size problems. When data is available for multiple locations, it is very challenging to develop multiple models that each represent the different behavior of users at the different locations but also capture the commonalities of behavior between multiple locations. We propose a Global Local model that captures the above ideas and can be used to understand the relationship of different locations based on multiple users' mobile behavior. The study is extended to the temporal attributes of the data to learn both the temporal and spatio-temporal correlations present in the wireless data. To find these correlations we propose a Multi-Dimensional Hierarchical Co-Clustering (MDHCC) method.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Abdullah Almutairi.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Ranka, Sanjay.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044610:00001


This item has the following downloads:


Full Text

PAGE 1

EFFICIENTALGORITHMSFORLEARNINGCORRELATIONSINLARGE-SCALEWIRELESSDATAByABDULLAHALMUTAIRIADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2012

PAGE 2

c2012AbdullahAlmutairi 2

PAGE 3

Idedicatethistomyfamily. 3

PAGE 4

ACKNOWLEDGMENTS IwouldliketothankmyadvisorDr.SanjayRankaforhisguidanceandsupport.IwouldalsoliketothankManasSomaiyaforhisgreathelpandsupport.IwouldalsoliketoexpressmygratitudetoDr.AhmedHelmyforhissupportandfeedback.MythanksisalsoextendedtomyothercommitteemembersDr.Jih-KwonPeir,Dr.BeverlySandersandDr.NawariNawariforagreeingtobeinmycommittee.Thisworkwouldnothavebeenpossiblewithouttheenormoussupport,dedicationandpatienceofmywife.Iwouldliketoexpressmydeepgratitudeforherextremesupportthroughouttheyears,andforherstickingwithmethroughthegooddaysandthebaddays.MythanksisalsoextendedtoallmyfriendsinGainesville.EspeciallyEisaAlnashmi,MuhammadAlmatar,AdelAlsaffar,TalalAlkindery,MishariAlnahedh,YounisSalmeen,AhmadandIbraheemAlbasheer. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 12 ABSTRACT ......................................... 13 CHAPTER 1INTRODUCTION ................................... 15 1.1ProblemStatement ............................... 15 1.2RelatedWork .................................. 17 1.3OutlineoftheDissertation ........................... 19 1.3.1ModelingUserWebBehavior ..................... 19 1.4IncorporatingSpatialInformationtotheWebBehavior ........... 21 1.5IncorporatingTemporalInformationtotheWebBehavior .......... 21 1.6Contributions .................................. 23 2AFASTAPPROXIMATEPOWERMODEL ..................... 24 2.1Method ..................................... 25 2.1.1ThePOWERModel .......................... 25 2.1.2WeightLearninginthePOWERModel ................ 26 2.1.3ApproximationAlgorithm ........................ 30 2.1.4TheInitialLLappComputationfortheFirstWeightwi,1 ....... 32 2.1.5LLappfortheSecondUpdateofwi,1 ................. 34 2.1.6LLappfortheThirdUpdateofwi,1 ................... 35 2.1.7LLappfortheFirstUpdateoftheRemainderWeightswi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1 .... 37 2.1.8LLappfortheSecondandThirdUpdateoftheRemainderWeightswi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1 ................................... 38 2.1.9HandlingAi,j ............................... 40 2.1.10HandlingBi,j .............................. 43 2.1.11HandlingCi,j ............................... 43 2.1.12HandlingDi,1 .............................. 44 2.1.12.1PartitioningtheRangeofxa,j2 ................ 45 2.1.12.2UpdatingDi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1 ........................ 46 2.1.13HandlingEi,1 .............................. 49 2.2Experiments .................................. 49 2.2.1SyntheticDataSetResults ...................... 49 2.2.2NIPSPapersDataSet ......................... 51 2.2.3WirelessDataSet ........................... 52 2.3RelatedandFutureWork ........................... 54 5

PAGE 6

2.4Discussion ................................... 55 2.5OurContributions ................................ 55 3AGLOBALLOCALMODELINGOFINTERNETUSAGEINLARGEMOBILESOCIETIES ...................................... 56 3.1TheGlobalLocalModel ............................ 58 3.2Experiments .................................. 59 3.2.1SyntheticDataResults ......................... 59 3.2.2WirelessDataResults ......................... 59 3.2.2.1GlobalPhase ......................... 62 3.2.2.2LocalPhase ......................... 63 3.2.3InternetUsageSimilarityBetweenLocations ............ 65 3.3RelatedWork .................................. 66 3.4Discussion ................................... 67 3.5OurContributions ................................ 67 4LEARNINGSPATIO-TEMPORALCORRELATIONSINLARGESCALEWIRELESSDATAUSINGMULTI-DIMENSIONALCO-CLUSTERINGMETHODS ...... 69 4.1Preliminaries .................................. 71 4.1.1ProblemStatement ........................... 71 4.1.2TemporalDataRepresentation .................... 72 4.2Methods ..................................... 74 4.2.1Multi-DimensionalInformationTheoreticCo-Clustering(MDITCC) 74 4.2.2Multi-wayDistributionalClusteringviaPairwiseInteraction(MDC) 75 4.2.3HierarchicalMulti-dimensionalCo-clustering ............. 76 4.3Results ..................................... 79 4.3.1TemporalCorrelationsintheWirelessDataResults ......... 80 4.3.1.1Multi-DimensionalInformationTheoreticCo-ClusteringResults ............................ 80 4.3.1.2Multi-wayDistributionalClustering(MDC)viaPairwiseInteractionwithMulti-DimensionalTimeResults ..... 84 4.3.1.3Multi-wayDistributionalClustering(MDC)viaPairwiseInteractionwithSingle-dimensionTimeRepresentationResults ............................ 90 4.3.1.4Multi-DimensionalHierarchicalCo-Clustering(MDHCC)Results ............................ 93 4.3.2Spatio-TemporalCorrelationsintheWirelessData ......... 98 4.3.2.1Multi-DimensionalInformationTheoreticCo-Clustering(MDITCC)Results ...................... 99 4.3.2.2Multi-wayDistributionalClustering(MDC)viaPairwiseInteractionwithMulti-DimensionalTimeRepresentationResults ............................ 100 4.3.2.3Multi-wayDistributionalClusteringviaPairwiseInteractionwithSingle-DimensionTimeRepresentationResults ... 102 6

PAGE 7

4.3.2.4Multi-DimensionalHierarchicalCo-Clustering(MDHCC)Results ............................ 103 4.3.3RuntimeEvaluation ........................... 106 4.3.4MetaAnalysis .............................. 107 4.3.5OverallDiscussion ........................... 108 4.4RelatedWork .................................. 109 4.5Discussion ................................... 111 4.6OurContributions ................................ 112 5CONCLUSIONS ................................... 114 REFERENCES ....................................... 116 BIOGRAPHICALSKETCH ................................ 121 7

PAGE 8

LISTOFTABLES Table page 2-1ConstantsandrandomvariablesusedinthegenerationprocessforthePOWERmodel ......................................... 27 2-2Thegeneratedandlearned4patternsusedfortherstsyntheticdataset.theunderbraceunderanumberisthestringlengthofthenumber .......... 50 2-3Thegeneratedandlearned4patternsusedforthesecondsyntheticdataset.theunderbraceunderanumberisthestringlengthofthenumber ....... 50 2-4ThehighestrankedwordsforsomeofthecomponentslearnedfromtheNIPSdataset ........................................ 52 2-5ThehighestrankedwebdomainsforsomeofthecomponentslearnedfromthewirelessdatasetusingtheapproximateversionofthePOWERmodel ... 53 3-1The4patternsusedtogeneratethesyntheticdata.Theunderbraceunderanumberisthestringlengthofthenumber ..................... 59 3-2The4patternslearnedfromthesyntheticdata.Theunderbraceunderanumberisthestringlengthofthenumber .......................... 60 3-3Theglobalgeneratedappearanceprobability()andthelearnedappearanceprobability() ..................................... 60 3-4Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation1 ............................. 60 3-5Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation2 ............................. 60 3-6Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation3 ............................. 60 3-7Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation4 ............................. 60 3-8Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation5 ............................. 61 3-9Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation6 ............................. 61 3-10Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation7 ............................. 61 3-11Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation8 ............................. 61 8

PAGE 9

3-12Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation9 ............................. 61 3-13Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation10 ............................. 61 3-14ThewirelessdatalocationsusedintheUSCcampus .............. 62 3-15ThehighestrankedwebdomainsforsomeofthecomponentslearnedfromtheUSCcampuswirelessdatasetglobalphase ................. 63 3-16Theappearanceprobability()oflocationsindescendingorder,sortedpercomponent ...................................... 64 4-1Theco-clusterslearnedusingtheMDITCCfromtheUSCcampuswirelessdataforeachdimensionforJanuary2008 ..................... 81 4-2Theco-clusterslearnedusingtheMDITCCfromtheUSCcampuswirelessdataforeachdimensionforFebruary2008 .................... 82 4-3Theco-clusterslearnedusingtheMDITCCfromtheUSCcampuswirelessdataforeachdimensionforMarch2008 ...................... 83 4-4Theco-clusterslearnedusingtheMDITCCfromtheUSCcampuswirelessdataforeachdimensionfortheperiodofJanuary-March2008 ......... 84 4-5Theco-clusterslearnedfromtheUSCcampuswirelessdataforeachdimensionforJanuary2008 ................................... 85 4-6Theco-clusterslearnedfromtheUSCcampuswirelessdataforeachdimensionforFebruary2008 .................................. 87 4-7Theco-clusterslearnedfromtheUSCcampuswirelessdataforeachdimensionforMarch2008 .................................... 88 4-8Theco-clusterslearnedfromtheUSCcampuswirelessdataforeachdimensionfortheperiodfromJanuary2008toMarch2008 ................. 89 4-9Theco-clusterslearnedfromtheUSCcampuswirelessdataforsingledimensiontimerepresentationforJanuary2008 ........................ 91 4-10Theco-clusterslearnedfromtheHourdaysdimensionintheUSCcampuswirelessdataforsingledimensiontimerepresentationforFebruary2008 ... 92 4-11Theco-clusterslearnedfromtheHourdaysdimensionintheUSCcampuswirelessdataforsingledimensiontimerepresentationforMarch2008 ..... 92 4-12Theco-clusterslearnedusingtheMDCalgorithmfromtheUSCcampuswirelessdataforsingle-dimensiontimerepresentationforJanuary2008-March2008 93 9

PAGE 10

4-13Theco-clusterslearnedusingMDHCCfromtheUSCcampuswirelessdataforthetopleveltimedimensionforJanuary2008 ................. 93 4-14Theco-clustersbelongingtocluster1learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension ............ 94 4-15Theco-clustersbelongingtocluster2learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension ............ 94 4-16Theco-clusterslearnedusingMDHCCfromtheUSCcampuswirelessdataforeachdimensionforFebruary2008 ....................... 95 4-17Theco-clustersbelongingtocluster1learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension ............ 95 4-18Theco-clustersbelongingtocluster2learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension ............ 95 4-19Theco-clusterslearnedusingMDHCCfromtheUSCcampuswirelessdataforthetoplevelfortheperiodfromJanuarytoMarch2008 ............ 96 4-20Theco-clustersbelongingtocluster1learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension ............ 96 4-21Theco-clustersbelongingtocluster2learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension ............ 97 4-22Theco-clustersbelongingtocluster5learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension ............ 97 4-23ThewirelessdatabuildingsusedintheUSCcampus .............. 98 4-24Theco-clusterslearnedusingtheMDITCCalgorithmfromtheUSCcampuswirelessdata ..................................... 99 4-25Theco-clusterslearnedusingtheMDCalgorithmfromtheUSCcampuswirelessdatawithamulti-dimensionaltimerepresentation ................. 101 4-26Theco-clusterslearnedusingtheMDCalgorithmfromtheUSCcampuswirelessdatawithasingle-dimensiontimerepresentation ................. 103 4-27Thetoplevelco-clusterslearnedusingMDHCCfromtheUSCcampuswirelessdata .......................................... 104 4-28Thebottomlevelco-clustersbelongingtocluster1learnedusingMDHCCfromtheUSCcampuswirelessdata ........................ 104 4-29Thebottomlevelco-clustersbelongingtocluster4learnedusingMDHCCfromtheUSCcampuswirelessdata ........................ 104 10

PAGE 11

4-30Thebottomlevelco-clustersbelongingtocluster5learnedusingMDHCCfromtheUSCcampuswirelessdata ........................ 105 4-31Theco-clusterslearnedinthelocationdimensionusingMDHCCfromtheUSCcampuswirelessdata ................................ 105 4-32TheMDHCCruntimesondifferentnumbersandsizesofdatadimensions ... 106 4-33TheMDCruntimesondifferentnumbersandsizesofdatadimensions ..... 106 11

PAGE 12

LISTOFFIGURES Figure page 2-1ThegenerationprocessforthePOWERmodel. .................. 27 2-2Axisshowingtherangeofvaluesofxa,j2andthelocationofitsaverage .... 46 2-3Partitioningtherangeofvaluesofxa,j2intotwoareas,eachwithitsvaluesaverage ........................................ 46 3-1Graphrepresentationofthedissimilaritymatrixusingthethresholdof0.1forthelocationsinthemobilesociety ......................... 65 4-1Temporalattributesinone-dimension ........................ 72 4-2Co-clusteringoftemporalattributesinasingledimension ............ 73 4-3Temporalattributesinahierarchy .......................... 73 4-4Co-clusteringoftemporalattributesinahierarchy ................. 73 4-5Temporalattributesinmultipledimensions ..................... 73 4-6Co-clusteringoftemporalattributesinmultipledimensions ............ 74 4-7TheshapeofthetimehierarchyaftertheclusteringoftheMondayandWednesdayvariablesonthedaylevelofthehierarchy ..................... 77 4-8TheshapeofthetimehierarchyaftertheclusteringoftheMondayandWednesdayvariablesonthedaylevelofthehierarchy ..................... 78 4-9Thegraphrepresentingthepair-wiseinteractionbetweenthevariablesinaone-monthwirelessdata ............................... 85 4-10Thegraphrepresentingthepair-wiseinteractionbetweenthevariablesforallmonthsinthewirelessdata ............................. 85 4-11Graphrepresentationofthedissimilaritymatrixusingthethresholdof0.0075forthehoursofadayforFebruary2008intheUSCcampus .......... 90 4-12Thegraphrepresentingthepairwiseinteractionbetweenthevariablesofwirelessdatafortheanalysisofthespatio-temporalcorrelations ............. 100 4-13Graphrepresentationofthedissimilaritymatrixoftheaccesstimebehaviorusingthethresholdof0.2forthelocationsintheUSCcampus ......... 107 12

PAGE 13

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyEFFICIENTALGORITHMSFORLEARNINGCORRELATIONSINLARGE-SCALEWIRELESSDATAByAbdullahAlmutairiAugust2012Chair:SanjayRankaMajor:ComputerEngineeringWirelessmobilenetworksareexperiencingtremendousgrowthandincreasedpresence.Datacollectedformobileuserscanbeeffectivelyusedtodesignmoreeffectivenetworksaswellasprovidemoreeffectiveservicestotheusers.Inparticular,understandingthebehavioralmodeloftheusersinthewirelessnetworkcanhelpindesigningefcientcontext-awareprotocolsandservicesthatrevolvearoundthetargetusers.Inthisdissertation,weexploretheuseofdataminingtechniquesforanalyzingwirelessdata.Wirelessdatahasmanycharacteristicsthatmakeitverychallengingfromadataminingperspective.Inparticular,thechallengesrevolvearoundalargenumberofspatio-temporaldimensionsrepresentingmobilityoftheusers,theiraccesspatternsfrommultiplewebsitesandtimedomain.Foreachofthedimensions,thenumberofpossiblevaluescanbefromthousandstomillions.Further,thesevaluesmayhavespatialandtemporalrelationships.Wepresentanovelalgorithmthatreducestheoveralltimecomplexityofndingmeaningfulwebsiteaccesspatternsinwirelessdata.Theoveralltimereductionisbyseveralordersofmagnituderesultingintheapplicationofthesetechniquestosignicantlylargersizeproblems.Whendataisavailableformultiplelocations,itisverychallengingtodevelopmultiplemodelsthateachrepresentthedifferentbehaviorofusersatthedifferentlocationsbutalsocapturethecommonalitiesofbehaviorbetween 13

PAGE 14

multiplelocations.WeproposeaGlobalLocalmodelthatcapturestheaboveideasandcanbeusedtounderstandtherelationshipofdifferentlocationsbasedonmultipleusers'mobilebehavior.Thestudyisextendedtothetemporalattributesofthedatatolearnboththetemporalandspatio-temporalcorrelationspresentinthewirelessdata.TondthesecorrelationsweproposeaMulti-DimensionalHierarchicalCo-Clustering(MDHCC)method. 14

PAGE 15

CHAPTER1INTRODUCTION 1.1ProblemStatementWirelessnetworksarerapidlygrowingandexpandingtheirreach.Withthisgrowth,thenetworksmustservicetheincreasingnumberofusers,handlemobilitypatternsandon-lineactivity,andloadefciently.Therefore,newdesignparadigms,efcientprotocolsandservicesshouldbecreated.Wirelessnetworkdesignbasedonuseractivityandbehaviorisessentialtocreatebetternetworkprotocolsandservices.Thisdesignparadigmstartsbyunderstandingandrealisticallymodelingtheusers'Internetbehavior.Theusers'wirelessdatahasshowntobeachallengetocollect,processandanalyzeduetothescaleofthetraces.Anextensivenetow,DHCPandMACtraptracesforthousandsofmobileusersinaWLANspanningover79buildingsandincludingover700APsontheUniversityofSouthernCalifornia(USC)campus,werecollectedandprocessed[ 1 ].Thisisconsideredbyfaroneofthelargestsetoftracesprocessedformobilenetworks.Differentkindsofinformationabouttheusers'Internetbehaviorareavailableafterintegratingthetracesmentionedabove,forexample,spatialandlocation-basedinformationaboutaccesspointsorbuildings,temporalinformationaboutsessiontimesanddurations,interest-basedinformationaboutwebdomainsvisitedandapplicationsused,andloadandtrafcinformationaboutowrateandpacketrate.Thislargemulti-dimensionaldatasetwithnegranularitycanbeusedforData-drivenmodelingoflargemobilesocieties.Thismodelingapproachhelpsindesigningbettercontext-awarenetworkprotocolsandservicesbasedontheusers'Internetusagebehavior.Intherstphase,extensivedatasetsarecollectedusingthenetworkinfrastructure(orthemobiledevices),plusaugmentinginformationfromonlinedirectories(e.g.,buildingsdirectory,maps)andthewebservices(e.g.,whoislookupservice).Dataprocessingisthesecondphasetocross-correlateacquiredinformation 15

PAGE 16

fromdifferentresources(e.g.,accesspoints,IPandMACaddresses),inwhichmultipledatasetsaremanipulated,integratedandaggregated.Inthisdissertation,wedevelopedefcientandscalablealgorithmsfordataminingthatlearntheuserbehaviorinthelarge-scalewirelessdata.Weexplorethevariousgranularitiesandscaleofinformationpresentedinthiswirelessdata.Themainchallengesofthisendeavorarethesheersizeofthedata(millionstobillionsofrecords)andthenumberofvaluesineachdimensionofthedata(thousandstomillionsofwebdomainsvisited).Also,theexistenceofspatialandtemporalinformationinthedatashouldbeaccountedforsinceitrepresentspropertiesoftheuserbehaviorandmayaffectit.Ourmainfocusintheusagebehaviorisonlearningtheusers'webdomainvisitationpatternsanddiscoveringtheclassesofusersinalargemobilesocietybasedonthesepatterns.Afterdiscoveringtheclassesofusers,weincorporatethelocationinformationinthedatatondtheinuenceoflocationontheusers'onlinebehavior.Whendataisavailableformultiplelocations,itisverychallengingtodevelopmultiplemodelsthateachrepresentthedifferentbehaviorofusersatthedifferentlocationswhilealsocapturingthecommonalitiesofbehaviorbetweenmultiplelocations.WehaveproposedaGlobalLocalmodelthatcapturestheaboveideasandcanbeusedtounderstandtherelationshipofdifferentlocationsbasedonmultipleusers'mobilebehavior.Thenextstepisincorporatingthetemporalinformationinthewirelessdatatodiscovertheinuenceoftimeandthecombinedinuenceoflocation-timeontheusagebehavior.Analyzingtemporalinformationindatacanbechallengingformanyreasons.Thecontinuousnatureoftimeallowsittoberepresentedinmanyways.Westudywhichrepresentationofthetemporalinformationmakesitamenabletodataminingmethods.Discoveringmeaningfulcorrelationsinthemulti-dimensionalwirelessdatathatcontainthespatialandtemporalinformationrequirehavingco-clusteringmethodsthatcanhandlethiskindofdata.Therefore,wedevelopedaMulti-DimensionalHierarchical 16

PAGE 17

Co-Clustering(MDHCC)methodthatdiscoversthetemporalandspatio-temporalcorrelationsinthewirelessdata.Insection 1.3 ,webrieydescribeourworkinmoredetail. 1.2RelatedWorkUsingtheobserveduserbehaviortodesignrealisticandpracticalmobilitymodelshasbeenthefocusofmanyworks[ 33 36 ].Ithasbeenshownthatthemostwidelyusedexistingmobilitymodelsfailtogeneraterealisticmobilitycharacteristicsobservedfromthetraces.Realisticmobilitymodelingisessentialforprotocolperformance[ 37 ].Correlatingtheuserbehaviorwithhislocationhasrarelybeencoveredinresearch.Ploumidisetal.[ 38 ]usedamulti-level(network,APandclient)application-basedtrafccharacterization,thengroupedAPsbasedonbuildingcategorytoexaminevariationinapplicationuse.Aweakcorrelationhasbeenfoundbetweenthetypeofapplicationusedandsomebuildingcategories,buttrafccharacterizationwasbasedonlyonAPtracesandonlyon7.5daysoftrafc.Anotherapplication-basedstudyofaWLANusageonacampus[ 39 ]evaluatedtheinboundandoutboundtrafcofwebapplicationsinresidentialandnon-residentiallocations.Theirworkdidn'tnddifferencesinuserbehaviorbasedonthelocation.Inapreviouswork[ 1 ]locations(i.e.buildings)inamobilesocietywereclusteredtogetherbasedonthesimilarityofInternetusagebehavioranditwasfoundthatlocationsofthesametypeactuallyclustertogether,butnorelationsweredrawnbetweenthetypeofwebdomainclusterslearnedandtheirappearanceinsimilartypesoflocations.Studyingspatio-temporaleffectsonWiFinetworkshasrecentlygainedsomeinterestinresearch.Apaper[ 47 ]modeledthetrafcdemandofacampusWLANtakingintoaccountthespatial-temporaldimensions.Thedifferentspatialscalesusedwere(infrastructure-wide,AP-levelorclientlevel)withthetimegranularitiesbeingpacket-level,ow-leveloraggregate.Thisisdifferentfromourspatio-temporalfocuswhereweconcentrateonbuildingsinacampusforthespatialdimensionandhours, 17

PAGE 18

daysandmonthsforthetemporaldimension.IntheAfannasyevetal.paper[ 46 ],location-basedcorrelationwasfoundbetweentheareatypeofthecityandthedevicetype.Smartphonesweremostlyusedintransportationareas,laptopsweremostlyusedincommercialareaswithwirelesshotspotsanddesktopcomputersweremostlyusedinresidentialareas.Also,temporal-basedcorrelationwasfoundbetweentheclientdevicetypeandthetime-of-dayandtheday-of-week.Eachclientdevicetypeexhibitedauniquebehaviordependingonthehourofthedayandthedayoftheweek,whichiswhatweshowinourwork.However,theirstudywasbasedonthenumberofactiveusers,themobilityoftheusersandtheonlinetrafcgenerated.Unlikethegreatmajorityofresearchinthisarea,ourworkfocusesontheweb-domainvisitationbehavioroftheusers.Dataclusteringisawidelyuseddataminingtechnique[ 48 49 ].Itdealswiththeunsupervisedgroupingofobservationsintosimilarclusters.Clusteringisusedtostudymanykindsofproblemssuchasdatamining,imagesegmentation,textandpatternrecognitionandbioinformatics.Therearemanytypesofclusteringalgorithms:themostpopularonesarethek-meansclustering[ 50 ],expectation-maximization(EM)[ 51 ],DBSCAN[ 52 ],hierarchicalclustering[ 53 ]andco-clustering[ 54 ].Co-clustering(sometimescalledbiclusteringorblockclustering)isaclusteringtechniquewhererowsandcolumnsofamatrixaresimultaneouslyclusteredprovidingbothrowclustersandcolumnclusters.Co-clusteringusesatwo-wayclusteringacrosstwodimensionsincontrasttoregularclusteringwhichusesone-wayclusteringacrossonlyonedimension.Thisallowsfornewkindsofpatternstobedetectedandalsowasfoundtoleadtobetterclusteringofthedata.Therearemanystrategiesforco-clustering,suchasthespectraltechniquesandinformationtheoreticapproaches.Thesestrategiesdifferinthewaytheymeasuresimilarityandbythewaytheytreattheco-occurrencetable.Thespectraltechniquestreattheco-occurrencetableasanadjacencymatrixunderlyingabipartitegraph,the 18

PAGE 19

goalistominimizeacutfunctionthatmeasuresthedegreeofassociationbetweenthenodesets.Ontheotherhand,theinformationtheoreticapproach[ 5 ]normalizestheco-occurrencetableconvertingitintoajointprobabilitytablewiththegoaltoreducethelossofmutualinformationobjectivefunctionbetweentheoriginaltableandtheclusteredversion.Thisapproachmonotonicallyincreasesthemutualinformationpreservedbyintertwiningboththerowandcolumnclustersatallstages. 1.3OutlineoftheDissertation 1.3.1ModelingUserWebBehaviorThePOWERmodel[ 2 ]isanewclassofmixturemodelswheremultiplecomponentscancontributetothegenerationofasingledatapointwhilesimultaneouslyallowingeachcomponenttohaveavaryingdegreeofinuenceondifferentdataattributes.ThePOWERframeworkisageneric,Bayesianframeworkthatsolvestheproblemoflearninghiddenpatternsfromdata.Classically,variousmethodsbasedonmixturemodels[ 3 ]havebeenproposedasoneapproachtolearnsuchpatterns.However,thesemodelsdonotaccountforthecasewherevarioushiddenpatternsinteractinanoverlappingandcomplexwaytogenerateasingledatapoint.Allowingforthisenableslearningverygenericandintuitivepatternswhilestillpreciselymodelingspecicdatapoints.WhilethePOWERmodelisaveryusefulnewclassofmixturemodelsthathelpsinlearninghiddenoverlappingcorrelations,itsuffersfromaverylonglearningtime.ThislonglearningtimeismainlyduetotheweightupdatingstepoftheGibbssamplerwhichisusedtolearntheparametersofthemodel.Toupdateallofthekdweightsofthemodel,ittakesO(nkd2)computationtimeforeachiterationoftheGibbssampler.ThePOWERmodelwilltakeaverylongtimetolearn,especiallywithhigh-dimensionaldatasincethecomputationtimeisnon-linearwithrespecttothenumberofdimensionsofthedata.Thisrendersitunpracticaltouseonverylargeandhigh-dimensionaldata.However,manyverylargeandhigh-dimensionalreallifedatasetsexistwhereweareinterestedinndinghiddenoverlappingcorrelations,likethenetowwireless 19

PAGE 20

data.InordertobeabletousethePOWERmodelpracticallyonthiskindofdata,thecomputationtimeoftheweightupdatingstepneedstobereduced.WhiletheCo-clusteringtechniquewasusedonthenetowwirelessdatatodiscoverclustersofusers-webdomains[ 1 ],somehiddencorrelationsmaybelostduetotheexistenceofmultipleclassesofusers.Ausermaybelongtomanyclassesofuserslikenews-junkie,social-network-fan,movies-fan,sports-fan,hacker,andgaming-enthusiast,andtheusagepatternsmaybeinuencedbyoneormoreoftheseclasses.Hence,itmakesmoresensetomodelthebehaviorofeachuserasresultingfromtheinuenceofseveralclasses.Furthermore,theclassofuserandthelocationitmostlikelyappearsincouldbecorrelated.Forexample,usersmostlikelyenterwebdomainsrelatedtotheirinterestsinlocationswheretheyarecomfortabledoingso,liketheirresidenceoralocationwheretheyhangout.UsersinalibraryaremorelikelyusingtheInternettodoresearchratherthanusingitforentertainmentpurposes.Amodelisneededtolearnthiscorrelationwhileconsideringthatlocationsarepartofalargemobilesocietyandtheclassesofusersarethesamealloverthismobilesociety.InthenextsectionweoutlineourapproachtoaddressingthePOWERmodelissueandtousethemodeltolearncorrelationsinalargemobilesociety.Inchapter2,weproposeanapproximationtothePOWERmodelweightparameterlearningalgorithmthatreducesthecomputationaltimesignicantly.TheoverallcomplexityofthenewalgorithmisO(nkdp)inpractice,wherepisthenumberofpartitionsusedinourapproximationandismuchsmallerthand.Thisallowsthemodellearningtimetobelinearinthenumberofattributesofthedataandprovidesasignicantspeedupovertheoriginalalgorithm.NowthatwehaveafasterPOWERmodelthathasalearningtimethatislinearwiththenumberofdimensionsofdata,wecanusethemodelpracticallyonhigh-dimensionaldatalikethenetowwirelessdata. 20

PAGE 21

1.4IncorporatingSpatialInformationtotheWebBehaviorItisimportanttomodeltheInternetusagebehaviorindifferenttypesoflocationsinalargemobilesociety.WerstlearntheclusteredwirelessInternetusagebehaviorinaccessingwebdomains.Then,weinvestigatewhether,andtowhatextent,thetypeoflocationauserisincaninuenceInternetusagebehaviorandcorrelatewiththetypeofwebdomainauserisvisiting.Also,wewishtoknowtheprobabilityofeachwebdomainclusterappearinginthelocations.Thisgivesthelevelofrelationshipbetweeneachlocationandallwebdomainclusters.Inchapter3,AGlobalLocalmodelisproposedthatlearnstheInternetusagebehaviorindifferenttypesoflocationsinalargemobilesociety.TheGlobalLocalmodelwasdevisedinordertomodeltheinuenceofalocationinamobilesocietyonauser'swebactivityandbehavior.Itisimportanttounderstandthewebusageinalocationinordertodesignefcientcontext-awareInternetprotocolsandservicessuitableforalllocations.Apreviouswork[ 1 ]showedhowsimilartypesoflocationsareclusteredtogetherbasedontheusers'webbehavior;ourmodelshowshowthetypeofawebdomainclusterlearnedfromusers'behaviorinthemobilesocietyasawholecorrelateswiththetypeofacertainlocationinthesociety.Themodelalsogeneratesthelikelihoodofthewebdomainclustersappearinginthelocations.Thisgivesthelevelofrelationshipbetweeneachlocationandallwebdomainclusters.Themodelrstlearnsaglobaltemplateofclustersfromtheglobalwirelessdata.Theglobaltemplatelearnedwillbeimposedonthemodellearningtoeachlocationinthemobilesociety.Thisrelatesthelocationsinamobilesocietytogetherandgivestheabilitytocomparelocations. 1.5IncorporatingTemporalInformationtotheWebBehaviorAfterlearningthespatialinuenceonthepatternsoftheonlinebehavior,weexploreboththetemporalandspatio-temporalcorrelationsintheusers'behaviorinchapter4.Westudytheinuenceofthetimeofaccessontheonlinebehavior.Then, 21

PAGE 22

westudytheinuenceofthecombinedfactorsofthelocationandtimeoftheonlineaccessonthebehavior.Analyzingdatawithtemporalattributeshasitsissues.Oneissueishandlingthecontinuousandgranularnatureoftime,i.e.continuoustimecanbedividedintoseconds,minutes,hours,days,etc.Temporalattributesindatacanberepresentedindifferentwayssuchas,(i)one-dimensionaldatawherethecolumnsintheco-occurrencetablerepresentconsecutivehoursinthetimeperiodthatspansthedata;or(ii)multi-dimensionaldatawhereeachdimensionrepresentsadifferentgranularleveloftimeintheco-occurrencetable,e.g.therstdimensioncanrepresentthe24hoursinaday,theseconddimensioncanrepresentdaysinaweekandthethirddimensioncanrepresentweeksinamonth;or(iii)one-dimensionaldatabutusingahierarchy,e.g.thecolumnsoftheco-occurrencetablewouldbeallthehoursofmanyconsecutivedays,and,onahigherlevelwegrouptheconsecutivehoursintodaysmakingthatlevelrepresentdays,whichislessgranularthanthelowerlevel.Thiscanberepeatedaswegohigherhavinglevelsforweeks,months,etc.Duetothehighdimensionalityofthewirelessdatawiththespatialandtemporalattributes,multi-dimensionalco-clusteringmethodsareusedtostudythecorrelations.AMulti-DimensionalHierarchicalCo-Clustering(MDHCC)methodisdevelopedtodiscovercorrelationsinmulti-dimensionaldatawithhierarchicalinformation.Themethodusespairwiseinteractionbetweenthedimensionsincomputingtheobjectivefunctiontolearnco-clustersineachdimensionofthedata.Thelearningonthedimensionthatcontainsthehierarchicalinformationstartsatthetoplevelbylearningthegeneralco-clusters.Then,eachgeneralco-clusterisspecializedinthenextlowerlevelandco-clustersarelearnedineachgeneralone.Themethodcontinuestheprocessuntilitreachesthebottomlevelofthehierarchy.Themethodwasfoundtoimprovetheco-clusterlearningformulti-dimensionaldatabydiscoveringmorespecializedcorrelationsinthedata. 22

PAGE 23

1.6ContributionsTosummarize,thecontributionsofthisdissertationareasfollows: WehaveproposedaninnovativeapproximationthatimprovestheruntimeperformanceofthePOWERmodel'slearningtimefrombeingquadratictolinearinnumberofdataattributes.Theoveralltimereductionisbyseveralordersofmagnituderesultingintheapplicationofthesetechniquestosignicantlylargersizeproblems. WehavedevelopedanovelGlobalLocalmodelthatlearnstheInternetusagebehaviorinawirelessmobilesocietyasawholeandinlocationsinsidethemobilesociety.Themodelgivesthelevelofrelationshipbetweenthewebdomainclusterslearnedgloballyinthemobilesocietyandthelocationsinsidethemobilesociety.Ithasonelocalmodelforalllocationsallowingdifferentlocationstobecomparedtoeachotherbasedontheusagebehavior. Wehavediscoveredbothtemporalandthespatio-temporalcorrelationsinwirelessdatainalargemobilesocietyusingvariousmulti-dimensionalco-clusteringmethodsbothexistingandnovel.Wehavestudiedtheeffectofthedatarepresentationofthetemporalattributeofthewirelessdatawhenapplyingtheco-clusteringmethodstothedata.WehavedevelopedanovelMulti-DimensionalHierarchicalCo-Clustering(MDHCC)methodtohandleco-clusteringmulti-dimensionaldatawithhierarchy. 23

PAGE 24

CHAPTER2AFASTAPPROXIMATEPOWERMODELMixturemodelshavebeenusedtomodeldataandforclustering.Aprobabilistictypeofthemixturemodelexistthatusethecombinationofprobabilitydistributionsofdifferentcomponentstomodelcomplexdata.TheGaussianMixtureModel(GMM)isasimilarmodelwheresamplesaretakenfromamixtureofkGaussianvariablestoproducepoints.ThePrObabilisticWeightedEnsembleofRolesmodelorPOWERmodel[ 2 ]forshort,isanewclassofmixturemodelswheremultiplecomponentscompetewithavaryingdegreeofinuencetoproduceasingledatapoint.Thisclassofmixturemodelshelpsindiscoveringhiddencomplexpatternsinthedata.However,thePOWERmodelsuffersfromaverylonglearningtimeinordertodetectthesecomplexoverlappingdatapatterns.Ithasbeenshownthatittakes300hourstolearnthe21-componentmodelonalargedatasetcalledtheNIPSpapersdataset.TheNIPSpapersdatasetrepresentsthetop1000non-trivialwordscollectedfrom1500conferencepapers,thedatasetisthereforearrangedas1500datapointswith1000dataattributes.ThelonglearningtimeismainlyduetothetimecomplexityofthemodelwhichisO)]TJ /F3 11.955 Tf 5.48 -9.69 Td[(nkd2wherenisthenumberofdatapoints,kisthenumberofcomponentsanddthenumberofattributes.ThiscomputationalcostisspecicallyinictedintheweightparameterupdatestepoftheGibbssamplerthatisusedtolearnthemodel.SuchhighexecutiontimeprohibitsscalingtheapplicationofthePOWERframeworkforhigh-dimensionaldatasets.Inthischapter,weproposeanapproximationtothePOWERmodelweightparameterlearningalgorithmthatreducesthecomputationaltimesignicantly.TheoverallcomplexityofthenewalgorithmisO(nkdp)inpractice,wherepisthenumberofpartitionsusedinourapproximationandismuchsmallerthand.Thisallows 24

PAGE 25

themodellearningtimetobelinearinthenumberofattributesofthedataandprovidesasignicantspeedupovertheoriginalalgorithm.Wedemonstratetheaccuracyofourapproximationtechniqueusingsyntheticandrealdatasets.WealsoprovideexperimentalresultsoftheapproximatePOWERmodelontheNIPSdatasetandthewirelessdatasetsandshowthatthemodellearnedissimilar.AnimplementationoftheapproximatePOWERmodelontheNIPSpapersdatasetwith1000dimensionsisabout27timesfasterthantheoriginalversion. 2.1MethodBayesianinferenceforthePOWERmodelisaccomplishedviaGibbssampling.Thisalgorithmisusedtogeneratesamplesfromajointprobabilitydistributionofmanyrandomvariables.Thisisespeciallyusefulwhenitiseasytosamplefromtheconditionaldistributionsoftherandomvariables.TheGibbssamplerisaMonteCarloMarkovChainthatwhenitisrunfornumerousiterationsreachesthesteadystatewherethesamplescloselyapproximatethejointprobabilitydistributionoftherandomvariables.Thecomputeintensivenessoftheoverallalgorithmisintheportionthatlearnstheweightmatrix.Inthefollowing,werstbrieydescribethemodelandthelearningalgorithm(forcompleteness)andthendescribeourapproximation. 2.1.1ThePOWERModelThePOWERmodelisanewclassofBayesianmixturemodelsthatallowsformulti-classmembership.Ithandlessituationsthatclassicalmixturemodelscannot,likethesituationwhenmanyclassesinuencethegenerationofasingledatapoint.Thiscanprovidesuswithgenericandintuitiveclassesthatcanbemixedtogethertogenerateadatapoint.InthePOWERmodelthedatasetisrepresentedbyX=hx1,x2,,xni,eachxatakestheformxa=hxa,1,xa,2,,xa,di.Thevaluexa,jisassumedtohavebeensampled 25

PAGE 26

fromarandomvariableAjcorrespondingtothejthattribute,havingaparametrizedprobabilitydensityfunctionfj.ThePOWERmodelconsistsofamixtureofkcomponentsC=fC1,C2,,Ckg.EachcomponentCihasanappearanceprobabilityi.AssociatedwitheachcomponentCiad-dimensionalparametervectorithatparametrizestheprobabilitydensityfunctionfi.AvectorofpositiverealnumbersWi=hwi,1,wi,2,,wi,dispeciesthestrengthofinuenceonvariousdataattributesofeachcomponentCi.Thesearecalledparameterweights;andPjwi,j=1.Duringthegenerationofeachdatapointxa,oneormoreofthekcomponentsaremarkedasactivebyperformingaBernoullitrialwheretheprobabilityofCibeingactiveisi.Activecomponentsaredenotedbyc.Aftermarkingallcomponentsasactiveorinactive,adominantcomponentisselectedforeachattributebyperformingaweightedmultinomialtrialamongtheactivecomponents.IfCiisactive,thentheprobabilitythatCiisdominantforattributejisproportionaltowi,j.Dominantcomponentsaredenotedbyg,allofthevaluesofgarestoredinandmatrixcalledGwherega,jrepresentsthedominantcomponentfortheattributejofdatapointa.ThegenerationprocessofthePOWERmodelandtheconstantsandrandomvariablesusedinitaredescribedinFigure 2-1 andTable 2-1 .Inthetablelightgraynodesareuser-denedconstants,unshadednodesarehiddenrandomvariables,circularnodesindicateacontinuousvalue,whilesquarenodesindicateadiscretevalue.Constantsandrandomvariablesusedinthediagramareexplainedinthetable. 2.1.2WeightLearninginthePOWERModelThePOWERmodelusestheGibbssamplingalgorithmtolearntheparametersofthemodel.Gibbssamplingisusedtogeneratesamplesfromajointprobabilitydistributionofmanyrandomvariables.Itisespeciallyusedwhenitishardtosamplethejointprobabilitydistributionoftherandomvariablesbutitissimplertosamplefrom 26

PAGE 27

Table2-1. ConstantsandrandomvariablesusedinthegenerationprocessforthePOWERmodel VariableExplanation nnumberofdatapointsdnumberofdataattributesknumberofcomponentsinthemodelappearanceprobabilityofacomponentu,vuserdenedpriorparametersformmaskvaluesq,ruserdenedpriorparametersformwparameterweightscindicatorvariableforactivecomponentsgindicatorvariablefordominantcomponentpBernoulliprobabilitiesfortheparameterizedp.d.f.pa,pbuserdenedpriorparametersforpxdatavariable Figure2-1. ThegenerationprocessforthePOWERmodel. theconditionaldistributionofthoserandomvariables.TheGibbssamplerisaniterativealgorithm,itstartsfromarandominitializationandupdatesthevalueofeachrandomvariablebysamplingfromitsconditionaldistributionw.r.tallotherrandomvariables. 27

PAGE 28

Theconditionaldistributionofthemaskvaluem(thenon-normalizedweightw)isrepresentedas F(mi,jjX,c,w,,)/(wi,jjq,r)YaYjwga,j,jI(ca,ga,j=1) Piwi,jI(ca,i=1)(2)Whereqandrareuserdenedpriorparametersform,I()isanindicatorfunctionthatevaluatesto1ifthelogicalargumentprovidedtoitistrue,and0otherwise,and,wi,j=mi,j Pjmi,jThesecondtermoftheRHSin( 2 )isthelikelihoodfunction,whichtakesO(nd)timetoupdatethevalueofoneweightwi,j.Then,thetimetoupdatethevaluesofWforallkcomponentsisO(nkd2).Toclearlyshowthetimerequirementofeachweightupdate,theLogLikelihoodfunction(LL)of( 2 )isgoingtobeusedfromnowon.Itisalsowhatisbeingapproximatedinourproposedmethod.TheLogLikelihoodfunctionisobtainedbyrsttakingthelogoftheRHSin( 2 ): log (wi,jjq,r)YaYjwga,j,jI(ca,ga,j=1) Piwi,jI(ca,i=1)!(2)Thenfocusingonthelikelihoodfunctionpartin( 2 )andconvertingtheproductsintosumsusingthelogproperties. nXa=1dXj=1log wga,j,jI(ca,ga,j=1) Piwi,jI(ca,i=1)!(2)Theweightsaregoingtobedividedtotheactiveweightofcomponentiandthesumofactiveweightsforallothercomponents.BelowistheLogLikelihoodfunctionthatwearegoingtousetoupdatetheweights. LLi,j=nXa=1dXj2=1(log(wi,j2 wa,j2+wi,j2)| {z }ga,j2=i+log( wa,j2 wa,j2+wi,j2)| {z }ga,j26=i)(2) 28

PAGE 29

Where wa,j2areelementsofthendmatrixcalledtheAgainstWeight W.elementsofthismatrixrepresentthesumofactiveweightsfromallcomponentsbesidei. wa,j2=kXi2=1wi2,j2| {z }ca,i26=iAftereachweightwi,jisupdated,alloftheotherelementsofWivectorismodiedtomaintaintheconstraintthatPdj=1wi,j=1.ThechangetotheotherelementsofWiisbyafactorof1)]TJ /F5 7.97 Tf 10.56 0 Td[(wi,j 1)]TJ /F4 7.97 Tf 6.59 0 Td[(wi,jPREV,wherewi,jisthenewupdatedwi,jandwi,jPREVisthepreviousvalueinwi,j.WecallthisfactoranUpdateRatio(URj).WewillshowbelowthechangeshappeningtoWiaftereachweightupdate.W(j)idenotesWiafterupdatingthejthweight.TheoriginalWivector:Wi=hwi,1,wi,2,,wi,diAfterupdatingthevalueofwi,1using( 2 )Wibecomes:W(1)i=hwi,1,UR1wi,2,,UR1wi,diwhereUR1=1)]TJ /F11 11.955 Tf 17.99 0 Td[(wi,1 1)]TJ /F3 11.955 Tf 11.95 0 Td[(wi,1.Afterupdatingthevalueofwi,2:W(2)i=hUR2wi,1,wi,2,,UR2UR1wi,di 29

PAGE 30

whereUR2=1)]TJ /F11 11.955 Tf 17.99 0 Td[(wi,2 1)]TJ /F3 11.955 Tf 11.96 0 Td[(UR1wi,2.Ingeneral,afterupdatingthevalueofwi,j:W(j)i=*jYt=2URtwi,1,jYt=3URtwi,2,,URjwi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1,wi,j,jYt=1URtwi,j+1,,jYt=1URtwi,d+whereURj=1)]TJ /F11 11.955 Tf 17.23 0 Td[(wi,j 1)]TJ /F3 11.955 Tf 11.96 0 Td[(wi,jQj)]TJ /F5 7.97 Tf 6.59 0 Td[(1t=1URt.Afterupdatingthevaluesofthewholevector:W(d)i=*dYt=2URtwi,1,dYt=3URtwi,2,,URdwi,d)]TJ /F5 7.97 Tf 6.58 0 Td[(1,wi,d+whereURd=1)]TJ /F11 11.955 Tf 18.33 0 Td[(wi,d 1)]TJ /F3 11.955 Tf 11.95 0 Td[(wi,dQd)]TJ /F5 7.97 Tf 6.59 0 Td[(1t=1URt. 2.1.3ApproximationAlgorithmInthissection,weproposeanapproximationtoEquation 2 reducingthetimerequirementstoupdateWi.Letusrstsimplifyequation 2 : LLi,j=nXa=1dXj2=1(log(wi,j2)| {z }ga,j2=i+log( wa,j2)| {z }ga,j26=i))]TJ /F4 7.97 Tf 18.31 14.95 Td[(nXa=1dXj2=1log( wa,j2+wi,j2)(2) 30

PAGE 31

DuringtheupdateofWielementsusingequation 2 WelementsdonotchangebetweenupdatessincetheyarenotdependentonWi,Wielementsontheotherhandchangeaftereachupdate.SincesomevaluesincomputingLLi,jforWidonotchange( W),andothervalues(wi,?exceptwi,j)changebyacertainfactor(URj)afterupdatingwi,j,itispossibletouseacomputedvalueofLLi,jtocomputeLLi,j+1inalessertime.Therefore,LLi,1canberstcomputedtoupdatewi,1thentheirvaluescanbeusedtocomputeLLi,2withoutrestartingthewholecomputation,andsoonuntilLLi,discomputedfromLLi,d)]TJ /F5 7.97 Tf 6.59 0 Td[(1andwi,d)]TJ /F5 7.97 Tf 6.59 0 Td[(1,tonallyupdatewi,d.Butthefollowingtermneedstoaddressedinordertoachievethisidea.nXa=1dXj2=1log( wa,j+wi,j)SincetheWielementsaresummedwithelementsof Winsidethelog,wehavetorecomputethistermaftereachupdateofaWielement.IftheWitermcanberemovedorseparatedoutthenitispossibletoreuseacomputedvalueofLLi,jtocomputeLLi,j+1byjustaddingthetermsthathavetheupdatedWivalues.AnapproximationisproposedtoachievethisincrementalcomputationofallLLi,jvaluesinalessertimethantheoriginalPOWERmodel.Theapproximationcomprisesofasetofpchosenelements xwithcorrespondingcountsnx.Thep-elementsof xwillbemodiedaftereachweightwi,jupdate.Wewilldenotethep-elementsof xcorrespondingtowi,jby xj.Thepischosentobemuchsmallerthanthenumberofattributesd.Theseelementsandcorrespondingcountswillbechosensuchthat:PXp=1nxplog( xj,p)WillapproximatenXa=1dXj2=11+ wa,j2 wi,j2 31

PAGE 32

Sothefollowingpartofequation( 2 ):nXa=1dXj2=1)]TJ /F11 11.955 Tf 11.29 0 Td[(log( wa,j2+wi,j2)willbereplacedby:nXa=1dXj2=1)]TJ /F11 11.955 Tf 11.29 0 Td[(log(wi,j2))]TJ /F4 7.97 Tf 17.48 14.94 Td[(PXp=1nxplog(xj,p)Section( 2.1.11 )willdiscussindetailtheprocessofcomputingthepelementsofxandnx.ButitshouldbenotedthatinthePOWERmodel,F(mi,j)isapproximatedusingaBetadistribution.ThisrequiresthreecomputationsoftheconditionalinordertotvaluesforeachoftheBetaparameters,thetwoshapeparametersandonescaleparameter.Forthisreason,weupdateeachweightthreetimesobtainingthreesamplesofeachweight.ThethreeBetaparameterscanthenbesolved.WewillshowinthenextsectionsthestepsofcomputingourapproximateLogLikelihood(LLappi,j)forthetherstweightupdateandforthefollowingincrementalcomputationsforthenexttwoupdatesoftherstweight.Then,wewillshowtheincrementalLLappi,jcomputationsfortheremainingweightsandtheirtwoupdates.Afterwards,wewillshowhowweobtaintheseequationsfromequation( 2 )byrearrangingtermsandobservingthedifferencebetweenconsecutiveequationsofLLi,jafterweightupdates. 2.1.4TheInitialLLappComputationfortheFirstWeightwi,1FirstwewilldenenewvariablesusedinLLappforwi,1,letCW=hcw1,cw2,,cwdibeavectorofpositiveintegersthatrepresentthenumberoftimesga,j=iforattributej.Socw1isthenumberoftimesga,1=i.AnothervariableusedissumCW,whichisthesumofallcwelements,sumCW=Pdj=1cwj.ThesevariablesareusedtosimplifytheLLappequationsandtohelpupdatethevalueofLLappbetweenweightupdates.TheCWvectorcomputationrequiresscanningtheGmatrixonce,itisonlycomputedonce 32

PAGE 33

intheinitialLLappcomputation.ThetimethentocomputethevectorisO(nd).TheequationtocomputeLLappforwi,1whichwillbedenotedbyLLappi,1isshownbelow:LLappi,1=dXj2=1cwj2log(wi,j2)+nXa=1dXj2=1(log( wa,j2)| {z }ga,j26=i)]TJ /F11 11.955 Tf 11.29 0 Td[(log(wi,j2)))]TJ /F4 7.97 Tf 17.48 14.94 Td[(PXp=1nxplog( x1,p)FromthecomputedLLappi,1,arstupdateofwi,1willbesampled.Therstupdateofwi,1willbedenotedby_wi,1,thiswillbeusedinLLappi,1whichistheLLappi,1equationtondthesecondupdateofwi,1. TimecomplexityforcomputingLLappi,1.TocomputeLLappi,1TheCWvectorneedstobecomputedrst,whichasmentionedwilltakeO(nd).Also,p-elementsofxandnxneedstobecomputed,thiswilltakeO(ndp).Section( 2.1.11 )willdescribethetimeforcomputing xandnx.The xandnxelementsareonlycomputedonceinLLappi,1,intheotherLLappi,?)]TJ /F11 11.955 Tf 11.95 0 Td[(1computationsthepelementsofxwillonlybeupdatedinconstanttimeforeachx.Makingthetotaltimeof xupdatesisO(p),thisishandledintheothercomputationsofLLapp.Aftercomputingallthevaluesoftheneededvariables,computingthevalueforLLappi,1requiresO(d)timetocomputethevalueoftherstterm.ThesecondandthirdtermoftheequationtakesO(nd)timeforcomputingthesumoflog( W)overallofitselements.ThenaltermthatrepresentourapproximationobviouslytakesO(d)timetocompute.ThetotaltimeforcomputingthevariablesneededandthenallofthetermsofLLappi,1is:TotalTime:O(nd+ndp| {z }computingCWand x+d+nd+p| {z }computingLLappi,1)=O(ndp) 33

PAGE 34

LLappi,1willtakealongertimetocomputethanLLi,1whichhadatimeofO(nd).Thisisduetotheoverheadofcomputingourapproximationp-elements xandnx.EventhoughittakesalongertimetocomputetheseelementsinLLappi,1,theseelementswillallowtheothercomputationsLLappi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1tojusttakeO(pd)timealtogethertobecomputed.MakingthetotaltimetoupdatealltheelementsofWitobeO(ndp). 2.1.5LLappfortheSecondUpdateofwi,1TheoriginalPOWERmodelupdateseachweightwi,jthreetimestondthreeestimatesofthebetadistributioninordertoapproximateF(mi,j).ThisleadstosomewhatdifferentLLappequationsusedforthesecondandthirdupdateofwi,1andforthesecondandthirdupdateoftheotherweightswi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1.TheLLappi,jofanysecondupdateofwi,jwillbedenotedwithLLappi,j.LLappi,jusesLLappi,1tocomputeit'svalue.LLappi,jwillalsousethesameCWandsumCWthatwascomputedinLLappi,1,italsouses_wi,1whichistherstupdateofwi,1thatwascomputedfromLLappi,1._UR1whichistheUpdateRatiobetweentherstandsecondupdatewillbeusedalso,thisequals:_UR1=1)]TJ /F11 11.955 Tf 15.03 0 Td[(_wi,1 1)]TJ /F3 11.955 Tf 11.95 0 Td[(wi,1Modiedvaluesofthep-elementsof x1willbecomputedtondLLappi,1,theywillbedenotedby_ x1.Section( 2.1.11 )willdescribecomputingthosemodiedvalues.Thetermcorrespondingtotheoldvaluesof x1thenneedstoberemovedfromtheLLappi,1.ThisisdonebyaddingthetermPPp=1nxplog( xp),sincethistermwassubtractedfromLLappi,1.Finally,thesametermwiththemodiedvaluesof x1isincorporatedintoLLappi,1byaddingtheirnegativevalues.Itshouldbenotedthatthep-elementsofnxwillnotbemodiedbetweenweightupdates.Thiswillbealsobediscussedinsection( 2.1.11 ).LLappi,1isshownbelow: 34

PAGE 35

LLappi,1=LLappi,1+cw1log(_wi,1)+(sumCW)]TJ /F3 11.955 Tf 11.95 0 Td[(cw1)log(_UR1))]TJ /F3 11.955 Tf -331.8 -26.89 Td[(cw1log(wi,1)+nlog(_wi,1)+(nd)]TJ /F3 11.955 Tf 11.96 0 Td[(d)log(_UR1))]TJ /F3 11.955 Tf 11.96 0 Td[(nlog(wi,1)+PXp=1nxplog( x1,p))]TJ /F4 7.97 Tf 17.48 14.95 Td[(PXp=1nxplog(_ x1,p)Asecondupdatetowi,1willbesampledfromcomputingLLappi,1,thiswillbedenotedbywi,1.Thisvaluewillbeusedsimilarlyinthenextupdatetowi,1.ThecorrespondingLLappwillbedenotedbyLLappi,1. TimecomplexityforcomputingLLappi,1.Computingpelementsof_ x1willtakeO(p)timealtogetheraswillbeshowninsection( 2.1.11 ).AlltheothervariablesinLLappi,1arealreadyavailable,makingthetimetocomputeLLappi,1isO(p).ThetotaltimeforcomputingthevariablesneededandthenallofthetermsofLLappi,1is:TotalTime:O(p|{z}computingp-elementsof x1+p|{z}computingLLappi,1)=O(p) 2.1.6LLappfortheThirdUpdateofwi,1ThethirdandlastLLappi,1equationforwi,1,denotedbyLLappi,1,isidenticaltoLLappi,1butwith_wi,1replacingwi,1,wi,1(thesecondupdateofwi,1thatwascomputedinLLappi,1)replacing_wi,1.Also,thereistheupdateratiobetweenthethirdandsecondupdateofwi,1denotedbyUR1whichhavethevalueof:UR1=1)]TJ /F11 11.955 Tf 13.63 0 Td[(wi,1 1)]TJ /F11 11.955 Tf 15.03 0 Td[(_wi,1Finally,justlikeincomputingLLappi,1therewillbenewmodiedp-elementsof x1,denotedby x1,thatcorrespondingtermwillbeaddedtoLLappi,1.Theterm 35

PAGE 36

correspondingtothepreviousp-elements_ x1willberemovedfromLLappi,1.Theequationisshownbelow:LLappi,1=LLappi,1+cw1log(wi,1)+(sumCW)]TJ /F3 11.955 Tf 11.96 0 Td[(cw1)log(UR1))]TJ /F3 11.955 Tf 9.3 0 Td[(cw1log(_wi,1)+nlog(wi,1)+(nd)]TJ /F3 11.955 Tf 11.95 0 Td[(n)log(UR1))]TJ /F3 11.955 Tf 11.95 0 Td[(nlog(_wi,1)+PXp=1nxplog(_ x1,p))]TJ /F4 7.97 Tf 17.48 14.94 Td[(PXp=1nxplog( x1,p)Athirdandnalupdateforwi,1isthensampledaftercomputingthevalueforLLappi,1,whichwillbedenotedby xi,1. xi,1isusedtondthethirdandnalestimateforthebetadistributionwhichinturngivesanalsampleofwi,1.Thisisdenotedbywi,1.theupdateratio(UR1)fromsection( 2.1.2 )cannowbecomputed.AllURjvalueswillbecomputedandstoredaftersamplingeachweightinWi.Qdt=1URtwillalsobecomputedincrementallyaswesamplealltheweightsinWiinordertocomputeW(d)ivectorasshowninsection( 2.1.2 ). TimecomplexityforcomputingLLappi,1.Computingp-elementsof x1willtakeO(p)timealtogethersameascomputingp-elementsof_ x1.SinceLLappi,1isjustlikeLLappi,1itwilltakethesametotaltimeforthecomputation,whichisO(p).Thetotaltimetondwi,1afterthethreeupdatestowi,1isthen:O(ndp+p+p)=O(ndp)WhichislargerthanO(nd)thetimetondwi,1intheoriginalPOWERmodel.Thisextratimeisimportanttocomputethep-elementsof xandnxintheinitialcomputationofLLappi,1,whichwillhelpinreducingthetimegreatlyforalltheotherupdatesofwi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1.ThenexttwosectionswilldescribetheLLappforalltheotherweightsinWi. 36

PAGE 37

2.1.7LLappfortheFirstUpdateoftheRemainderWeightswi,?)]TJ /F5 7.97 Tf 6.58 0 Td[(1ForalloftheotherremainingcomputationsofLLappi,jforwi,j,thevalueofLLappi,j)]TJ /F11 11.955 Tf 11.96 0 Td[(1isgoingtobeusedtocomputeLLappi,jinafastertimethantheoriginalPOWERmodel,LLappi,j)]TJ /F5 7.97 Tf 6.58 0 Td[(1beingtherstupdateLLappofthepreviousweightwi,j)]TJ /F5 7.97 Tf 6.58 0 Td[(1inthevector.Thenewsampleofthepreviousweightwi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1isusedinLLappi,j.Also,URj)]TJ /F5 7.97 Tf 6.59 0 Td[(1andwhichisdenedinsection( 2.1.2 )andwascomputedforthepreviousweightisusedinLLappi,j.TheproductofallUR'suntilURj)]TJ /F5 7.97 Tf 6.59 0 Td[(1areused,thiscanbecomputedincrementallyaftereachURisfound.LLappi,jalsohasthep-elementsxforboththepreviousweightdenotedby xj)]TJ /F5 7.97 Tf 6.59 0 Td[(1andforthecurrentweight xj.Section( 2.1.11 )willdiscusscomputingthep-elementsof xjfrom xj)]TJ /F5 7.97 Tf 6.59 0 Td[(1.ThecwandsumCWvaluesarethesameusedforalltheweightwi,jupdatesofWi.LLappi,jisshownbelow:LLappi,j=LLappi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1+cwj)]TJ /F5 7.97 Tf 6.59 0 Td[(1log(wi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1)+(sumCW)]TJ /F3 11.955 Tf 11.96 0 Td[(cwj)]TJ /F5 7.97 Tf 6.58 0 Td[(1)log(URj)]TJ /F5 7.97 Tf 6.59 0 Td[(1))]TJ /F3 11.955 Tf 9.3 0 Td[(cwj)]TJ /F5 7.97 Tf 6.59 0 Td[(1log(j)]TJ /F5 7.97 Tf 6.59 0 Td[(2Yt=1URtwi,j)]TJ /F5 7.97 Tf 6.58 0 Td[(1)+nlog(wi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1)+(nd)]TJ /F3 11.955 Tf 11.96 0 Td[(n)log(URj)]TJ /F5 7.97 Tf 6.59 0 Td[(1))]TJ /F3 11.955 Tf 9.29 0 Td[(nlog(j)]TJ /F5 7.97 Tf 6.58 0 Td[(2Yt=1URtwi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1)+PXp=1nxplog( xj)]TJ /F5 7.97 Tf 6.58 0 Td[(1,p))]TJ /F4 7.97 Tf 17.48 14.95 Td[(PXp=1nxplog( xj,p)AftercomputingLLappi,jtherstupdateofwi,jwillbesampled,denotedby_wi,j._URjwillbecomputedalsotobeusedinthenextupdateof_wi,jwhichwillequal:_URj=1)]TJ /F11 11.955 Tf 15.03 0 Td[(_wi,j 1)]TJ /F3 11.955 Tf 11.96 0 Td[(wi,jQj)]TJ /F5 7.97 Tf 6.58 0 Td[(1t=1URtNotethattheproductofURdoesnottakej)]TJ /F11 11.955 Tf 12.51 0 Td[(1timetocompute.Aftereachnewweightwi,jsampleinthevectorWi,acorrespondingURjwillalsobecomputed.SotheproductofURscanbecomputedincrementally.Forexample,UR1isavailableafterndingwi,1,fromitQ2t=1URtcanbecomputedbymultiplyingUR1byUR2whichiscomputedafterndingwi,2.Ingeneral,Qj)]TJ /F5 7.97 Tf 6.59 0 Td[(1t=1URtcanbecomputedinonestepby 37

PAGE 38

multiplyingURj)]TJ /F5 7.97 Tf 6.59 0 Td[(1byQj)]TJ /F5 7.97 Tf 6.58 0 Td[(2t=1URtwhichwasavailablefromthepreviousweightwi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1LLappcomputation. TimecomplexityforcomputingtherstupdateofLLappi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1.AsinpreviousLLappcomputations,computingmodiedp-elementsofxjfromxj)]TJ /F5 7.97 Tf 6.59 0 Td[(1whicharealreadyavailablewilltakeO(p)timealtogether,thiswillbediscussedinsection( 2.1.11 ).SincealltheothervariablesinLLappi,jarealreadyavailable,itwilltakeaconstanttimetocomputetherestoftheterms.ThetotaltimetocomputeLLappi,jthenbecomes:O(p) 2.1.8LLappfortheSecondandThirdUpdateoftheRemainderWeightswi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1TheLLappforthesecondandthirdupdateofwi,jareexactlythesameandaredenotedbyLLappi,j,theybothusetheLLappi,jvaluefromtherstupdateofwi,j.Theonlyvaluesthatdifferare_wi,jwhichrepresentthepreviousupdateofwi,j,soifthisisLLappi,jofthesecondupdate,_wi,jrepresentstherstupdateofwi,j.Ontheotherhand,ifthisisLLappi,jofthethirdupdate,_wi,jrepresentsthesecondupdateofwi,j.ThesamegoesfortheURandthep-elements xusedinLLappi,j,theyaredenoted_URjand_ xjrespectivelyandbothcorrespondtotheupdate_wi,jrepresent.Theequationfor_URjwasshownintheprevioussectionandgettingthep-elementsof_ xjfrom xjwillbediscussedinsection( 2.1.11 ).LLappi,jisshownbelow:LLappi,j=LLappi,j+cwjlog(_wi,j)+(sumCW)]TJ /F3 11.955 Tf 11.95 0 Td[(cwj)log(_URj))]TJ /F3 11.955 Tf 9.29 0 Td[(cwjlog(j)]TJ /F5 7.97 Tf 6.59 0 Td[(1Yt=1URtwi,j)+nlog(_wi,j)+(nd)]TJ /F3 11.955 Tf 11.95 0 Td[(n)log(_URj))]TJ /F3 11.955 Tf 9.3 0 Td[(nlog(j)]TJ /F5 7.97 Tf 6.59 0 Td[(1Yt=1URtwi,j)+PXp=1nxplog( xj,p))]TJ /F4 7.97 Tf 17.49 14.95 Td[(PXp=1nxplog(_ xj,p) 38

PAGE 39

AftercomputingLLappi,j,thesecondorthirdupdateofwi,jwillbecomputeddependingonwhichupdateLLappi,jrepresent.Thesecondandthirdupdateofwi,jarebothrepresentedby_wi,j.Thesecondupdateofwi,jwillbeusedforLLappi,jofthethirdupdateofwi,j.Thethirdupdateofwi,jjustlikethethirdupdateofwi,1willbeusedtondthethirdandnalestimateforthebetadistributionwhichinturngivesanalsampleofwi,j,whichisdenotedbywi,j.URjcannowbecomputedfromwi,jandQj)]TJ /F5 7.97 Tf 6.59 0 Td[(1t=1URtwhichwasalreadyavailable. TimecomplexityforcomputingtherstupdateofLLappi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1.Justlikethesecondandthirdupdateofwi,1,LLappi,jwillbecomputedinO(p)time.Thenthetotaltimetondthed-valuesofwi,jis:O(ndp| {z }Forwi,1+3pd| {z }Forwi,?)]TJ /F14 5.978 Tf 5.75 0 Td[(1)=O(ndp)WestillneedtocomputeW(d)iasshowninsection( 2.1.2 ),thisisachievedafterhavingthed-valuesofwi,j,thed-valuesofURandQdt=1URt(withthelastonebeingnowfullycomputedsinceitwascomputedincrementally).W(d)icanbecomputedinO(d)byrstcomputingit'srstelementwi,1Qdt=2URtusingthevalueswehaveusingwi,11 UR1Qdt=1URt.Havingalreadycomputedthequantity1 UR1Qdt=1URt,thiswillbeusedtocomputethesecondelementofW(d)iwhichiswi,2Qdt=3URtbymultiplying1 UR1Qdt=1URtwithwi,2 UR2.Thesameprocessisrepeatedisrepeatedforalld-elementsofwi,jresultinginW(d)iinO(d).O(ndp)dominatesthistermmakingthetotaltimetondW(d)iisO(ndp),andfork-componentsthetimeisO(nkdp),whichisabigimprovementovertheoriginalPOWERtimeofO(nkd2)sincepischosentobeamuchsmallervaluethand.NowthatwehaveshownourLLappequationsandhowtousethemtocomputeW(d)i,wewillshowhowtheequationswereconstructedfromLL.Wewilldivideequation 39

PAGE 40

( 2 )intothreepartstomakeiteasiertoshowhowtheyweretransformedtoourLLappequations.Theequationbelowshowsequation( 2 )againandit'sthreeparts: LLi,j=nXa=1dXj2=1(log(wi,j2)| {z }ga,j2=i| {z }Ai,j+log( wa,j2)| {z }ga,j26=i| {z }Bi,j))]TJ /F4 7.97 Tf 18.31 14.95 Td[(nXa=1dXj2=1log( wa,j2+wi,j2)| {z }Ci,j(2)Thenextthreesectionswilldiscussgettingfromequation( 2 )totheLLappequations. 2.1.9HandlingAi,jThissectionwillshowhowtousethersttermofLLi,jwhichisdenotedbyAi,jtogetthedifferentLLappequations.Firstwewillstartwith:Ai,j=nXa=1dXj2=1log(wi,j2)| {z }ga,j2=iTherstsummationcanberemovedbyscanningeachcolumninthendmatrixGandcountingthenumberoftimestheelementsofthecolumnequalsi.Wewillendupwithacountforeachcolumn.Thosed-countsaretheelementsofCWthatwedenedearlier.ThevalueofAi,jisthesumofthecountsmultipliedbytheircorrespondinglog(wi,j).Ai,jthenbecomes:Ai,j=dXj2=1cwj2log(wi,j2)WewillstartwithAi,1andshowhowit'svaluechangesaftersamplingeachweightinWi.WithWi:Wi=hwi,1,wi,2,,wi,diAi,1is: 40

PAGE 41

Ai,1=dXj2=1cwj2log(wi,j2)Aftersamplingwi,1andgettingW1iwhichis:W(1)i=hwi,1,UR1wi,2,,UR1wi,diwhereUR1=1)]TJ /F11 11.955 Tf 17.99 0 Td[(wi,1 1)]TJ /F3 11.955 Tf 11.95 0 Td[(wi,1.TocomputeLLi,2,thenewweightvectorW(1)iisusedinAi,2whichbecomes:Ai,2=cw1log(wi,1)+cw2log(UR1wi,2)+cw3log(UR1wi,3)++cwdlog(UR1wi,d)Whichisalso:Ai,2=cw1log(wi,1)+dXj2=2cwj2log(UR1wi,j2)ByrearrangingthetermsAi,2becomes:Ai,2=cw1log(wi,1)+dXj2=2cwj2log(UR1)+dXj2=2cwj2log(wi,j2)Bytakinglog(UR1)outsideofthesum,thesumofCWelementswithoutcw1issumCW)]TJ /F3 11.955 Tf 12.15 0 Td[(cw1,wheresumCWisthesumofallCWelementswhichwealreadydenedandiscomputedwiththecomputationofCW.ThethirdtermofAi,2isbasicallyAi,1withoutcw1log(wi,1).Ai,2thenbecomes: 41

PAGE 42

Ai,2=cw1log(wi,1)+(sumCW)]TJ /F3 11.955 Tf 11.95 0 Td[(cw1)log(UR1)+Ai,1)]TJ /F3 11.955 Tf 11.96 0 Td[(cw1log(wi,1)WeseehowAi,1canbeusedtogetAi,2withouttheneedofdoingacompleterecomputation.NowwelookatAi,3andseethedifferencebetweenitandbetweenAi,2togetageneralideaofthedifferencebetweenanytwoconsecutiveAi,jandAi,j+1.AftercomputingLLi,2andgettingW(2)i:W(2)i=hUR2wi,1,wi,2,,UR2UR1wi,diwhereUR2=1)]TJ /F11 11.955 Tf 17.99 0 Td[(wi,2 1)]TJ /F3 11.955 Tf 11.96 0 Td[(UR1wi,2.Ai,3usingW(2)iis:Ai,3=cw1log(UR2wi,1)+cw2log(wi,2)+cw3log(UR2UR1wi,3)++cwdlog(UR2UR1wi,d)ByanalyzingthedifferencebetweenAi,3andAi,2,wecanndAi,3fromAi,2byrearrangingthetermsofthepreviousAi,3.Ai,3thenbecomes:Ai,3=Ai,2+cw2log(wi,2))]TJ /F3 11.955 Tf 11.95 0 Td[(cw2log(UR1wi,2)+(sumCW)]TJ /F3 11.955 Tf 11.95 0 Td[(c2)log(UR2)AgeneralequationforAi,jcannowbederivedfromtheobviousstructureofthedifferencebetweenanytwoconsecutiveAi,jandAi,j+1,whichis:Ai,j+1=Ai,j+cwjlog(wi,j))]TJ /F3 11.955 Tf 11.95 0 Td[(cwjlog(j)]TJ /F5 7.97 Tf 6.58 0 Td[(2Yt=1URtwi,j)+(sumCW)]TJ /F3 11.955 Tf 11.95 0 Td[(cj)log(URj)]TJ /F5 7.97 Tf 6.59 0 Td[(1) 42

PAGE 43

NotethattheURproductiscomputedincrementallyaftereachURisfound.WehaveshownhowpartoftheLLi,j+1equationcanbefoundfromLLi,jinconstanttime.ThenextsectionwilltalkaboutpartBi,j 2.1.10HandlingBi,jBi,jpartofLLi,jrepresentthesumoftheelementsofthematrix Wwhenga,j26=i.Sincetheelementsof Wrepresentthesumoftheweightsfromtheothercomponentsbesidei,theirvalueswillnotchangeaftersamplingtheweightsinWi.ThematrixGalsodoesnotchangebetweenweightssample,makingThequantityBi,jconstantduringsamplingtheelementsofWi.SoBi,1=Bi,2==Bi,d.Bi,jisonlycomputedoncethenintheLLi,1computation. 2.1.11HandlingCi,jWehaveseenintheprevioussectionshowAi,jandBi,jpartsofLLi,jcanbecomputedjustonceinO(nd)timewhenj=1,thentheirvaluesarereusedtogetalloftheothervaluesofAi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1andBi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1inconstanttimeforeachattributej.ForCi,jwhichis:Ci,j=)]TJ /F11 11.955 Tf 11.96 0 Td[((nXa=1dXj2=1log( wa,j2+wi,j2))Thesamemethodcan'tbeusedtogetCi,jfromCi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1inafastertimeduetothesuminsidethelog.Aftereachweightupdate,allwi,j2valueswillchangebut wa,j2willstaythesame,andsincetheyaresummedinsidethelogaconstant-timeupdateofwi,j2cannotbeachieved.SoacompleteO(nd)recomputationmustbedoneforeachattributej,leadingtoO(nd2)tocomputeLLforthewholeweightvectorWi.AnovelapproximationisdevisedinCi,jtoallowcomputingthevalueofCi,jfromCi,j)]TJ /F5 7.97 Tf 6.59 0 Td[(1inafastertime.WewillstartrstwithCi,1,byfactoringoutwi,j2andtakingitoutfromthelog,Ci,1becomes: 43

PAGE 44

Ci,1=)]TJ /F11 11.955 Tf 11.95 0 Td[((nXa=1dXj2=1(log(1+ wa,j2 wi,j2)+log(wi,j2)))Wearegoingtodenote1+ wa,j2 wi,j2byxa,j2,sowehave(nd)valuesofxa,j2.MakingCi,1:Ci,1=)]TJ /F11 11.955 Tf 11.95 0 Td[((nXa=1dXj2=1(log(xa,j2)+log(wi,j2)))ThersttermofthesummationwillbedenotedbyDi,1andthesecondtermwillbedenotedbyEi,1.Thiswillbeshownbelow:Ci,1=)]TJ /F11 11.955 Tf 11.96 0 Td[((nXa=1dXj2=1(log(xa,j2)| {z }Di,1+log(wi,j2)| {z }Ei,1)) 2.1.12HandlingDi,1WithDi,1being:Di,1=)]TJ /F11 11.955 Tf 11.96 0 Td[((nXa=1dXj2=1log(xa,j2))Ourmethodwillapproximatethe(nd)valuesofxa,j2inDi,1intop-valuesofxa,j2,wherepisachosenpower-of-twovaluemuchsmallerthand,thosep-valuescanthenbeeasilyhandledandupdated.Thep-valuesofxa,j2willbechosensuchthateachofthosevalueswillapproximateaportionofthe(nd)valuesofxa,j2.Thewayitwillapproximatethemisbypartitioningtherangeofvaluesofxa,j2intop-regionsandselectingjustonevalueofxineachregion.Thatxistheaveragevalueofxa,j2ofthatregion,wedenotetheaveragevalueofxforDi,jandregionpby xj,p.Thereason xwaschosenasarepresentativevalueofxineachregionisbecausealmosthalfofthevaluesofxa,j2intheregionwillhaveagreatervaluethan x(making xanunderestimateofthesevalues),andalmosthalfofthevaluesofxa,j2willhavealowervaluethan x(making xanoverestimateofthesevalues).Sincealsothelogof 44

PAGE 45

allofthexa,j2inaregionaresummedandthelogarithmfunctionisalwaysincreasing,replacingeachxa,j2inthesummationwiththevalueof xwillresultinhavinglog( x)beinganunderestimateofalmosthalfofthevaluesoflog(xa,j2)andanoverestimateoftheremaininghalf,whenthoseunderestimateofvaluesandoverestimateofvaluesaresummedtheywillevenoutresultinginagoodapproximationoftheiroriginalsum.Thiswasalsoachievablebecausetherangeofvaluesofxa,j2issmallsotheregionspartitioningthatrangeareverysmallmakinglog(xa,j2)valuesineachregionverysimilar.Withlog( x)replacingthevaluesofeverylog(xa,j2)ineachregion,thesummationoflog( x)willbereducedtothenumberofelementsofxa,j2intheregion,denotedbynx,multipliedbylog( x).Di,1thenbecomes:Di,1=)]TJ /F11 11.955 Tf 9.3 0 Td[((PXp=1nxplog( x1,p))Nextwearegoingtotalkaboutourmethodofpartitioningtherangeofvaluesofxa,j2intop-regions. 2.1.12.1PartitioningtheRangeofxa,j2Duetothenon-uniformscatterofthevaluesofxa,j2intherange,iftherangeofvaluesofxa,j2waspartitionedintoequalareastherewillbeanunequalnumberofvaluesofxa,j2ineachareaandtheremightevenbesomeemptyareas.Sinceweareinterestedintheaverageofxa,j2ineacharea,wedeviseamethodbasedonthatforpartitioningtherangeofxa,j2values.Themethodstartsbycomputingtheaverageofall(nd)valuesofxa,j2.Ifthemethodstopsherewewouldhave x1,1toapproximatethe(nd)valuesofxa,j2inoneareaindexedby1.Figure 2-2 showsthisstep. x1,1willbetherstpartitioningpointoftherangecreatingtwoareas.Then,anaverageiscomputedforallxa,j2valuesintherstarea,andanotheraverageiscomputedforallxa,j2valuesinthesecondarea.Nowwehave x1,1and x1,2each 45

PAGE 46

Figure2-2. Axisshowingtherangeofvaluesofxa,j2andthelocationofitsaverage approximatingthevaluesofxa,j2fallingintheirrespectiveareas.Figure 2-3 showsthisstep. Figure2-3. Partitioningtherangeofvaluesofxa,j2intotwoareas,eachwithitsvaluesaverage Thiscanberecursivelyrepeatedwith x1,1and x1,2becomingthepartitioningpointsthatdividetheirareas.Eachnewstepwillresultindoubletheareasofthepreviousstep.Thisiswhypneedstobeapower-of-twonumber.ThecomputationtimeforthispartitioningmethodisO(ndp),duetocomputingtheaverageof(nd)elementsforeachpartition.AftergettinganewsampleforeachweightinWiallthevaluesofxa,j2willchange,whichobviouslywillchangethevaluesof x1,?.WewilltalknextaboutcomputingalloftheremainingDi,?)]TJ /F5 7.97 Tf 6.58 0 Td[(1fromDi,1byupdatingthevaluesof x1,?. 2.1.12.2UpdatingDi,?)]TJ /F5 7.97 Tf 6.58 0 Td[(1AfterhavingDi,1,tocomputeDi,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1thep-valuesof x1,?needstobeupdatedduetothechangeofvaluesofxa,j2fromgettinganewsampleofaweightwi,j2inWi.Wearegoingtoshowawayofupdatingthep-valuesof x1,?inconstanttimeforeachvalueresultinginO(p)forallp-values. 46

PAGE 47

Wearegoingtoshowhowtheupdateoccursin x1,1, x1,?)]TJ /F5 7.97 Tf 6.58 0 Td[(1fortheotherareasfollowthesameprocess.Rememberthat x1,1: x1,1=1 nx1nXa=1dXj2=1xa,j2|{z}xa,j22area1Byreplacingxa,j2withwhatitrepresents: x1=1 nx1nXa=1dXj2=11+ wa,j2 wi,j2| {z }xa,j22area1Sincethesumof1inthesummationandundertheconditionofxa,j22area1isnx1,bytakingoutthatterm x1,1becomes: x1,1=1+1 nx1nXa=1dXj2=1 wa,j2 wi,j2| {z }xa,j22area1Byexpandingtheinnersummationandbreakinguptheoutersummation, x1,1becomes: x1,1=1+1 nx1(nXa=1 wa,1 wi,1+nXa=1 wa,2 wi,2++nXa=1 wa,d wi,d| {z }xa,j22area1)Thisis x1,1forDi,1whichcorrespondstoWi.Afterwi,1issampledandwegetW(1)i, x1,1whichwillbedenotednowby x2,1shouldbecome: x2,1=1+1 nx1(nXa=1 wa,1 wi,1+nXa=1 wa,2 UR1wi,2++nXa=1 wa,d UR1wi,d| {z }xa,j22region1) 47

PAGE 48

x1,1canbeupdatedto x2,1inconstanttimebystoringthevalueof x1,1andthedtermsthatcomposesto x1,1neglectingthetermcontaining1.ThosedtermsarestoredinavectorcalledALLAVESsuchas:ALLAVES=*nXa=1 wa,1 nx1wi,1| {z }xa,j22area1,nXa=1 wa,2 nx1wi,2| {z }xa,j22area1,,nXa=1 wa,d nx1wi,d| {z }xa,j22area1+Wecanget x2,1from x1,1byusingALLAVESwiththefollowingsteps: 1. x2,1= x1,1)]TJ /F11 11.955 Tf 11.95 0 Td[((1+ALLAVES1) 2. x2,1=1 UR1 x2,1 3. ALLAVES1=wi,1 wi,1ALLAVES1 4. x2,1= x2,1+1+ALLAVES1Wegot x2,1from x1,1injust4steps.Thesamestepsareappliedto x1,?)]TJ /F5 7.97 Tf 6.59 0 Td[(1resultinginatimeofO(p).Byobservingthechangesthatoccurbetweenanyconsecutive xj)]TJ /F5 7.97 Tf 6.59 0 Td[(1,?and xj,?,wecangetthegeneralstepsforgetting xj,?from xj)]TJ /F5 7.97 Tf 6.58 0 Td[(1,?: 1. xj,?= xj)]TJ /F5 7.97 Tf 6.58 0 Td[(1,?)]TJ /F11 11.955 Tf 11.96 0 Td[((1+ALLAVESj)]TJ /F5 7.97 Tf 6.58 0 Td[(1) 2. xj,?=1 URj)]TJ /F14 5.978 Tf 5.75 0 Td[(1 xj,? 3. ALLAVESj)]TJ /F5 7.97 Tf 6.58 0 Td[(1=Qj)]TJ /F14 5.978 Tf 5.75 0 Td[(2t=1URtwi,j)]TJ /F14 5.978 Tf 5.75 0 Td[(1 wi,j)]TJ /F14 5.978 Tf 5.75 0 Td[(1ALLAVESj)]TJ /F5 7.97 Tf 6.59 0 Td[(1 4. xj,?= xj,?+1+ALLAVESj)]TJ /F5 7.97 Tf 6.58 0 Td[(1Qj)]TJ /F5 7.97 Tf 6.58 0 Td[(2t=1URtasmentionedearlieriscomputedincrementallysoitwilltakeO(1)tocomputeforany xj,?.Forthecaseofthesameweightwi,jgettingasecondorthirdupdate,sowewanttoget_ xj,?from xj,?thereisaslightdifferenceinthestepswhichwewillshowbelow: 1. xj,?= xj,?)]TJ /F11 11.955 Tf 11.96 0 Td[((1+ALLAVESj) 2. xj,?=1 URj_ xj,? 48

PAGE 49

3. ALLAVESj=Qj)]TJ /F14 5.978 Tf 5.75 0 Td[(1t=1URtwi,j wi,jALLAVESj 4. xj,?=_ xj,?+1+ALLAVESj 2.1.13HandlingEi,1Ei,1issimilartoAi,1exceptthatthesuminEi,1doesnothavetheconditionga,j2=i.Ei,1=nXa=1dXj2=1log(wi,j2)Thisisreducedto:Ei,1=dXj2=1nlog(wi,j2)FollowingthesamemethodofndingthegeneralAi,j+1,Ei,j+1is:Ei,j+1=Ei,j+nlog(wi,j))]TJ /F3 11.955 Tf 11.95 0 Td[(nlog(j)]TJ /F5 7.97 Tf 6.59 0 Td[(2Yt=1URtwi,j)+(nd)]TJ /F3 11.955 Tf 11.96 0 Td[(n)log(URj)]TJ /F5 7.97 Tf 6.59 0 Td[(1) 2.2ExperimentsThissectionwillshowanddiscusstheresultsofourapproximationonbothsyntheticandrealworlddata.ThePOWERmodelalgorithmandapproximationwaswritteninCandwasrunontheCISEdepartment'sstormAMD64CPUserver. 2.2.1SyntheticDataSetResultsWeranourapproximatemethodon2setsofsyntheticdata.Therstdatasetwasofsize1500datapointswith20attributesgeneratedfromthemixtureof4simplepatterns.Table 2-2 showstheoriginalgeneratedpatternsandthelearnedpatternsafterrunningourapproximatemethodwith16partitions.Ourmethodlearnedallthepatternscorrectly.Theseconddatasetwasofsize2000datapointswith500attributesgeneratedfromthemixtureofalso4simplepatterns.Table 2-3 showsboththegeneratedandlearnedpatternsafterrunningourmethodwith8partitions.Ourmethodlearned3of 49

PAGE 50

Table2-2. Thegeneratedandlearned4patternsusedfortherstsyntheticdataset.theunderbraceunderanumberisthestringlengthofthenumber IDPattern Generatedpattern111| {z }1000| {z }10Learnedpattern10.990.99| {z }1000| {z }10Generatedpattern211| {z }20Learnedpattern20.990.99| {z }20Generatedpattern300| {z }1011| {z }10Learnedpattern300| {z }100.990.99| {z }10Generatedpattern411| {z }500| {z }1011| {z }5Learnedpattern40.990.99| {z }500| {z }100.990.99| {z }5 the4patternscorrectlyandlearned60%ofthe2ndpattern.Thisshowsthatourmethodworkswellforsimpledatasets. Table2-3. Thegeneratedandlearned4patternsusedforthesecondsyntheticdataset.theunderbraceunderanumberisthestringlengthofthenumber IDPattern Generatedpattern111| {z }25000| {z }250Learnedpattern10.990.99| {z }25000| {z }250Generatedpattern200| {z }10011| {z }30000| {z }100Learnedpattern20.990.99| {z }10000| {z }1000.990.99| {z }10000| {z }1000.990.99| {z }100Generatedpattern300| {z }25011| {z }250Learnedpattern300| {z }2500.990.99| {z }250Generatedpattern411| {z }10000| {z }30011| {z }100Learnedpattern40.990.99| {z }10000| {z }3000.990.99| {z }100 50

PAGE 51

2.2.2NIPSPapersDataSetWetestedtheapproximateversionofthePOWERmodelontheNIPSpapersdatasetandcomparetheresultswiththeoriginalversion.Thedatasetconsistsofwordscollectedfrom1500papers.Thevocabularycovers12419words,andatotalofapproximately6.4millionwordscanbefoundinthepaper.Thetop1000non-trivialwordswereconsidered.Eachpaperwasconvertedtoarowofzerosandonescorrespondingtotheabsenceandpresenceofthewordrespectively.A0/1matrixofsize1500by1000wasobtained.Thenumberofcomponentskwassetto21.Aspertheoriginalpaper,KLdivergenceisusedontheresultsfromlearningthemodeltoranktheattributesaccordingtoimportanceforallcomponents.Table 2-4 showsthehighlyrankedwordsforsomeofthecomponentslearnedfromtheNIPSdataset.Discussion.AsintheoriginalPOWERmodel,eachlearnedcomponenthasaclearandintuitivemeaning.Forexample,Component1representswordsrelatedtotheoryandproofs.Component4representswordsrelatedtohardwareandelectronics.Component8representswordsrelatedtothebrainandthenervoussystem.Component9isconcernedwithwordsrelatedtoclassicationanddatamining.Component10relatestoneuralnetworks.Component12relatestonaturallanguageprocessing(NLP).Component13relatestostatisticalandBayesianmethods.Component17relatestocomputervisionandimageprocessing.Component19isrelatedtoroboticsandmovingobjects.Finally,component20isconcernedwithspeechprocessing.TheoriginalversionofthePOWERmodellearningfortheNIPSdatasettook300hours(431secondsonaverageforoneiterationoftheGibbssampling).OurversionofthePOWERmodellearningwith16partitionsoftheapproximationtook8.8hours(16secondsonaverageforoneiterationoftheGibbssampling).This,ourversionisaround27timesfasterthantheoriginalversion. 51

PAGE 52

Table2-4. ThehighestrankedwordsforsomeofthecomponentslearnedfromtheNIPSdataset IdWords 1symbol,turn,variables,denition,proof,unique,mathematical,linearly,implies,exact3predictor,predicted,deviation,validation,randomly,true,heuristic,test,smaller,modied4chip,vlsi,transistor,hardware,analog,digital,gate,circuit,pulse,voltage,processor,implementation,array,design,winner8cortical,cortex,evidence,synaptic,cognitive,orientation,stimuli,mechanism,brain,population,sensory,responses,selective,stimulus,receptive9dimensionality,ica,principal,pca,unsupervised,cluster,mixture,kernel,diagonal,clustering,images,nearest,reduction,covariance,decomposition10inhibitory,oscillation,synapses,neuronal,excitatory,synapse,synaptic,hebbian,strength,oscillator,inhibition,activity,ring,active,spike12language,text,interpretation,similarity,structure,context,string,focus,description,assignment13likelihood,posterior,probabilistic,markov,mixtures,probabilities,hmm,bayesian,conditional,bayes,densities,monte,mixture,carlo,belief17images,object,pixel,vision,scene,image,contour,resolution,detection,edge,segmentation,translation,visual,invariant,edges19reinforcement,reward,policy,sutton,agent,controller,action,programming,exploration,robot,learner,trajectory,environment,starting,strategy20acoustic,speaker,phoneme,speech,classier,vowel,window,mlp,database,segmentation,language,forward,sound 2.2.3WirelessDataSetWewillruntheapproximatePOWERmodelondatagatheredfromthewirelessnetworkintheUniversityofSouthernCalifornia.Themethodofobtainingthedatacanbefoundin[ 1 ].ThedatarepresentsthewirelessInternetusers'activityinthecampusonmarch2008,wheretheaccessofthetop100mostvisitedwebdomainsincampuswereobservedfor22,816users.Eachuser'saccesspatternwasconvertedtoarowofzerosandonescorrespondingtotheaccessornon-accessofthewebdomainrespectively.A0/1matrixofsize22816by100wasobtained.Thenumberof 52

PAGE 53

componentskwassetto21andthenumberofpartitionsfortheapproximationwerechosentobe16.KLdivergenceisagainusedontheresultsfromlearningthemodeltorankthewebdomainsaccordingtoimportanceforallcomponents.Table 2-5 showsthehighlyrankedwebdomainsforsomeofthecomponentslearnedfromthewirelessdataset. Table2-5. ThehighestrankedwebdomainsforsomeofthecomponentslearnedfromthewirelessdatasetusingtheapproximateversionofthePOWERmodel IdWebDomains 2windowsmedia,gridserver,microsoft,microsoftofce2007,ln,yahoo,adrevolver,youtube,llnw,veoh7washingtonpost,cnet,mac,apple,facebook,doubleclick,mediaplex14ebayrtm,ebayimg,ebay,adrevolver,tribalfusion,mediaplex,yahoo,panthercdn,doubleclick17hotmail,live,net,quiettouch,coremetrics,ln,windowsmedia,microsoft,doubleclick,bankofamerica18mcafee,hackerwatch,ln,aol,llnw,doubleclick,mediaplex,facebook Discussion.TheapproximateversionofthePOWERmodeldiscoversobviouspatternsofwebdomaincomponents.Component2capturesthedomainsrelatedtomicrosoftwindowsapplications.Component7representsagroupofwebdomainsthatarealwaysclusteredtogetherin[ 1 ]andwasseentoidentifymacuserswebaccesspatterns.Component14discoverstheclusterofebayrelateddomains.Component17clustersmicrosoftemaildomains,italsocapturestherelationshipbetweenthebankofamericadomainandthecoremetricsdomain,whichistheonlineanalyticstoolituses.Finally,Component18clustersthedigitalwebsecuritywebdomainstogether.Component11istheclusterofnewswebdomains.TheruntimeofthemodellearningfortheapproximatePOWERmodelwas16.23hours,withtheweightresamplingsteptaking11.6secondsonaverageperiteration,onecompleteiterationoftheapproximatePOWERtakes23.38seconds.Thisisin 53

PAGE 54

contrasttotheoriginalPOWERmodeltaking125secondsonaverageperiteration,whichmakesourmodelabout5.34timesfaster.Theoverheadofourapproximatemethodappearsondatawithasmallnumberofattributes,butthisoverheaddisappearswhenhandlingdatawithalargenumberofattributessinceourmethodislinearwiththenumberofattributesinsteadofquadraticastheoriginalversion.Asthenumberofattributesincreaseinthedatathespeedupgainincreasesalso. 2.3RelatedandFutureWorkPerformanceissuesariseinDataMiningmethodsduetothemethods'natureofworkingonverylargedatasets.Thisismoresothecaseincurrentscienticandcommercialapplicationswherethereisahighervolumeofdataprocessed.AmongthetechniquesusedtoimprovethespeedofDataMiningmethodsareparallelstructuresandtechniques[ 25 ].ParalleltechniqueshavebeenusedonmanypopularDataMiningalgorithmstoimproveupon.Yeetal.[ 26 ]developedaparallelversionoftheApriorialgorithmwheretheinputisdistributedamongthenodesandeachnodecomputesitslocalcandidatek-itemsets.Eachnodethensendsitslocalk-itemsetstoamasternodethatcomputesthesumofallcandidatesandprunesthemresultinginthefrequentk-itemsets.Parallelizingk-meansclusteringhasreceivedanextensiveamountofstudy[ 27 29 ].Manyoftheparalleltechniquesink-meansclusteringreliedonpartitioningthetasksanddataamongthenodes,someusedthemap-reduceframeworkforthesamepurpose.Incomputingtheaveragesinourapproximationthedatawasalsopartitionedamongthelargenumberofthreadstospeeduptheoperation.MonteCarloalgorithmshavealsobeenshowntobesuitedforparallelcomputation[ 30 ].Also,datareduction[ 31 ]techniquesareusedtoimprovetheperformanceofDataMiningmethodswhenworkingonverylargedatasets.Thetechniquereducesthenumberofdatapointsfortheinputbyremovingpointsthatareseentonotbeapartofanycluster.Onewayofreducingthedataisdensitybiassampling[ 32 ]wherethepoints 54

PAGE 55

havedifferentprobabilitiesofbeingpartoftheinput.Thisprobabilitydependsonthevalueofprespecieddatacharacteristicsandthespecicanalysisrequirements.WehavealsodesignedastraightforwardparallelversionoftheapproximationrunningonaCUDAGPUthatfurtherimprovedontheruntimeperformance.ThisisachievedbyutilizingtheSIMDnatureofCUDAandusingthistoexploitsomeoftheparallelizationavailableinupdatingparametersofthemodelandcomputingthepartitionsoftheapproximationfaster.TheGPUversionallowedforarounda40timesspeedupovertheoriginalmodelontheNIPSdata. 2.4DiscussionInthischapter,weproposedanapproximationtothePOWERmodelthatreducedthecomplexitytimeoflearningthemodeltoO(nkd).Theapproximationallowedthemodellearningtobelinearlyscalablewiththenumberofattributes.Thisalsoreducedtherunningtimetoabigdegreeallowingaserialimplementationofthemodellearningtobeabout27timesfasterthantheoriginalPOWERmodelontheNIPSdataset.ThisistherstworkdoneonspeedingupthePOWERmodelandimprovingitscomplexitytime.Wehavealsoshownthespeedupgainandthecorrectnessoftheresultsusingexperimentsonacoupleofreal-lifedatasets. 2.5OurContributionsTosummarize,ourcontributionsareasfollows: WeproposeaninnovativeapproximationthatimprovestheruntimeperformanceofPOWERmodelsfrombeingquadratictolinearinnumberofdataattributes.Theoveralltimereductionisbyseveralordersofmagnituderesultingintheapplicationofthesetechniquestosignicantlylargersizeproblems. 55

PAGE 56

CHAPTER3AGLOBALLOCALMODELINGOFINTERNETUSAGEINLARGEMOBILESOCIETIESWirelessmobilenetworksareextensivelyusedthroughouttheworld.Theyareevergrowingtothepointofbeingubiquitous,andmostpeoplenowadayscarryadevicethatcanconnecttothesenetworks.Thisputsagreatpressureonthesenetworksandstrainstheirabilitytosupporttheloadanddemandoftheirusers.Byusingdata-drivenmodelingandnewdesignparadigms[ 1 ],bettercontext-awarenetworkprotocolsandservicescanbedesignedbasedontheusers'Internetusagebehavior.Inordertohelpbuildthesenetworkprotocolsandservices,aGlobalLocalmodelisproposedtounderstandtheInternetusagebehaviorindifferenttypesoflocationsinalargemobilesociety.WewishrsttolearntheclusteredwirelessInternetusagebehaviorinaccessingwebdomains.Then,wewishtoinvestigatewhether,andtowhatextent,doesthetypeoflocationauserisininuencetheInternetusagebehaviorandcorrelatewiththetypeofwebdomainauserisvisiting.Also,wewishtoknowtheprobabilityofeachwebdomainclusterappearinginthelocations.Thisshowsthelevelofrelationshipbetweeneachlocationandallwebdomainclusters.Tostudytheusers'Internetusagebehavior,anextensivenetow,DHCPandMACtraptracesforthousandsofmobileusersinaWLANspanningover79buildingsandincludingover700APs,werecollectedandprocessed[ 1 ].Thisisbyfarthelargestsetoftracesprocessedinanystudyofmobilenetworkstodate.Moghaddametal.[ 1 ]providedasystematicmethodtoprocessthebillionsofrecordsinthisnetowtointegrateandaggregatethemulti-dimensionaldata.TheGlobalLocalmodelmainlyusesagenericBayesianframeworkcalledthePrObabilisticWeightedEnsembleofRolesModel,orPOWERmodelforshort.Thisframeworkisanewclassofmixturemodels[ 3 19 ]wheremultiplecomponentscancontributetothegenerationofasingledatapointwhilesimultaneouslyallowingeachcomponenttohaveavaryingdegreeofinuenceondifferentdataattributes.One 56

PAGE 57

ofthechallengesofusingtheclassicalmixturemodelwithhighdimensionaldataisthatitallowsonlyasinglemixturecomponenttogenerateeachdatapoint.However,therearemanyreal-worldhighdimensionaldatasetswhereitmakesfarmoresensetomodeladatapointasbeinggeneratedusingmultipleoverlappingcomponents.Inaddition,anunintendedconsequenceofthesingle-componentdatagenerationisthatacomponentcannotlimititsinuencetoonlyasubsetofthedataattributes,makingitdifculttocapturepatternsindatasubspaces.Considerthescenarioofbuildinganinformativemodelforthewebusagepatternsofusersonauniversitycampusgiventhatwehaveusagelogsforeachuser.Undertheclassicalmixturemodel,wewouldassumethateachuserbelongstoonlyoneclass.Membershipinagivenclassshouldattempttocompletelydescribeallofthewebsurngpatternsofeachmemberuser.Giventhediversityinsurngpatternsandthevarietyofwebsites,however,thisishighlyunrealistic.Inreallife,ausercouldbelongtomanyclasseslikenews-junkie,social-network-fan,movies-fan,sports-fan,hacker,andgaming-enthusiast,andtheusagepatternsmaybeinuencedbyoneormoreoftheseclasses.Hence,itmakesmoresensetomodelthebehaviorofeachuserasresultingfromtheinuenceofseveralclasses.Now,considerauserthatisamovies-fan,asports-fan,agaming-enthusiast,andahacker.Asthisuserissurngtheweb,hendsanewgamingwebsitebasedonapopularsport.Itseemsobviousthatmembershipinboththesports-fanandthegaming-enthusiastclassesshouldinuencethedecisionofvisitingthiswebsite,butmembershipinthehackerormovies-fanclassesshouldnotbe.Hence,wecanconcludethatitisprobablynotrealisticforeachclasstoinuenceeachandeveryoneofauserswebsitevisits.Basedontheconceptsofmulti-classmembershipandthateachclassshouldonlyinuenceasubsetofdataattributes,agenerativeprocesswouldalloweachdatapointtobemodeledwithhighprecision,whilestilllearningverygeneralrolessuchashackerandsports-fanthatareimportant,andyetcannotdescribeanydatapointcompletely. 57

PAGE 58

3.1TheGlobalLocalModelTheGlobalLocalmodelwasdevisedinordertomodeltheinuenceofalocationinamobilesocietyonauser'swebactivityandbehavior.Itisimportanttounderstandthewebusageinalocation,anddesignefcientcontext-awareInternetprotocolsandservicessuitableforalllocations.Apreviouswork[ 1 ]showedhowsimilartypesoflocationsareclusteredtogetherbasedontheusers'webbehavior.Ourmodelshowshowthetypeofawebdomainclusterlearnedfromusers'behaviorinthemobilesocietyasawholecorrelateswiththetypeofacertainlocationinthesociety.Themodelalsogeneratesthelikelihoodofthewebdomainclustersappearinginthelocations.Thisgivesthelevelofrelationshipbetweeneachlocationandallwebdomainclusters.Themodelrstlearnsaglobaltemplateofclustersfromtheglobalwirelessdata.Theglobaltemplatelearnedwillbeimposedonthemodellearningtoeachlocationinthemobilesociety.Havingaglobalframeworktolearntheusagebehavioroverthewholemobilesocietyallowsfortheretentionoflearnedusefulinformation.Thisglobalbehavioristhenexploitedtolearntheunderlyinglocalbehaviorinlocationsinsidethemobilesocietywhichcanprovidearelationshipbetweenthelocallocationsandallowsforbehaviorcomparisonbetweendifferentlocationsinsidethemobilesociety.TheGlobalLocalmodelconsistoftwophases.Therstphase(Globalphase)isbasicallythefastapproximateversionofthePOWERmodelthatwasproposedinchapter2.ThisversionofthePOWERmodelisrunonglobalwirelessdata(wirelessdatafromalllocationsofthemobilesociety).Aftertheglobalphaselearningreachesasteadystatethesecondphase(Localphase)continuesthePOWERmodellearningprocessoftheglobalphasebutwhilexingthelearningoftheparametersofwand.Thelocalphaseisrunoneachlocationinthemobilesocietyseparatelywiththewandxedwiththevalueslearnedfromtheglobalphase.Theinputdatatothelocalphaseinalocationisthesubsetoftheglobaldatacorrespondingtothatlocation.Thelearnedfromthisphasewillrepresenttheappearanceprobabilityoftheclusterslearned 58

PAGE 59

inthatlocation.Thiswilltellusabouttherelationbetweenthetypeofwebdomainclustersmostlikelyappearinginacertaintypeoflocation. 3.2ExperimentsThissectionshowsanddiscussestheresultsoftheGlobalLocalmodelonbothsyntheticandwirelessdata.Thewirelessdatawascollectedfromacampus-wideanalysisfromtheUniversityofSouthernCalifornia(USC)in2008.TheGlobalLocalmodelwaswritteninC++andrunona1.73GHzIntelCorei7laptopwith6GBofRAM. 3.2.1SyntheticDataResultsTheGlobalLocalmodelwastestedonsyntheticdatageneratedfromamixtureof4simplepatternswithvaryingappearanceprobabilitieswith40attributesacross10locations.Therewere1000datapointsforeachlocationmakingthetotalnumberofdatapointsto10,000.Table 3-1 showsthepatternsusedtogeneratethedata. Table3-1. The4patternsusedtogeneratethesyntheticdata.Theunderbraceunderanumberisthestringlengthofthenumber IdPattern 111| {z }1000| {z }30200| {z }2011| {z }1000| {z }10300| {z }1011| {z }1000| {z }20400| {z }3011| {z }10 TheGlobalLocalmodelcorrectlylearnedthepatternsasseenintable 3-2 andbothoftheirglobalandlocalappearanceprobabilities.Table 3-3 showsboththegeneratingandthelearnedfortheglobalpartandtables 3-4 to 3-13 showsthemforthelocalpart. 3.2.2WirelessDataResultsInthissectionwetesttheGlobalLocalmodelonwirelessdatagatheredfromtheUSCcampusin2008.Thedataconsistsofthetop30webdomainsvisitedoncampusbytheusers.Thenumberofuserrecordscoveredinthedatais6284userrecords. 59

PAGE 60

Table3-2. The4patternslearnedfromthesyntheticdata.Theunderbraceunderanumberisthestringlengthofthenumber IdPattern 10.990.99| {z }1000| {z }3020.000.00| {z }2011| {z }100.000.00| {z }10300| {z }1011| {z }100.000.00| {z }2040.000.00| {z }300.990.99| {z }10 Table3-3. Theglobalgeneratedappearanceprobability()andthelearnedappearanceprobability() GlobalAppearanceProbability() Generating0.300.390.210.28Learned0.290.400.220.28 Table3-4. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation1 Location1AppearanceProbability() Generating0.500.500.500.50Learned0.480.480.530.50 Table3-5. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation2 Location2AppearanceProbability() Generating0.750.750.250.25Learned0.760.740.280.22 Table3-6. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation3 Location3AppearanceProbability() Generating0.400.000.400.00Learned0.410.000.440.00 Table3-7. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation4 Location4AppearanceProbability() Generating0.000.400.000.40Learned0.000.400.000.38 60

PAGE 61

Table3-8. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation5 Location5AppearanceProbability() Generating0.750.250.000.00Learned0.720.250.000.00 Table3-9. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation6 Location6AppearanceProbability() Generating0.000.500.750.00Learned0.000.510.760.00 Table3-10. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation7 Location7AppearanceProbability() Generating0.000.300.000.30Learned0.000.320.000.31 Table3-11. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation8 Location8AppearanceProbability() Generating0.000.500.000.75Learned0.000.520.000.72 Table3-12. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation9 Location9AppearanceProbability() Generating0.350.500.000.35Learned0.370.490.000.36 Table3-13. Thelocalgeneratedappearanceprobability()andthelearnedappearanceprobability()inlocation10 Location10AppearanceProbability() Generating0.250.250.250.25Learned0.230.300.240.26 61

PAGE 62

Eachuser'svisittothosewebdomainswereconvertedtoarowofzerosandones,zerosbeingtheusernotvisitingthedomainandonebeingvisitingthedomain.Thus,weobtaina0/1matrixofsize6284by30.Thismatrixcanbedividedinto15disjointparts.Eachpartrepresentsthelocationthatportionofthedatawasgatheredfrom.Table 4-23 showsthelocationsusedingatheringthedatawithlocationsofthesametypecoloredwiththesamecolor.Thenumberofcomponentskwassetto11. Table3-14. ThewirelessdatalocationsusedintheUSCcampus LocationIdLocationNameLocationcode 1AlphaChiOmegaSororitySor12KappaAlphaThetaSororitySor23AlphaTauOmegaFraternityFrat14BetaOmegaPhiFraternityFrat25SigmaPhiEpsilonFraternityFrat36ZetaBetaTauFraternityFrat47AlphaKappaPsiFraternity/BusinessFrat5/Business8FluorTowerHousinghous19AnnenbergHouseApartmenthous210WebTowerHousinghous311AnnenbergSchoolforCommunication&Journalismjour12GeorgeLucasBuildingSchoolofCinematicArtslucas13WilsonDentalLibrarydent14NorrisMedicalLibrarymed15UniversityComputingCenterUCC WealsostudythesimilarityinInternetusagebetweenlocations.TheresultsfromtheLocalphaseofthemodel,whicharetheappearanceprobabilityofthewebdomaincomponentsforalllocations,areanalyzedusingadissimilaritymatrix.Then,agraphrepresentationofthedissimilaritymatrixiscreatedandcliquesarediscoveredbetweenthelocations.WewanttoobserveiflocationsofasimilartypehaveasimilarInternetusagebasedonthediscoveredcliques. 3.2.2.1GlobalPhaseTheapproximatePOWERGlobalphaseisrunontheglobalwirelessdatatolearnthemodel.KLdivergenceisusedontheresultstoranktheattributesaccordingto 62

PAGE 63

importanceforallcomponents.Table 3-15 showsthehighlyrankedwebdomainsforsomeofthecomponentslearnedfromthedataset. Table3-15. ThehighestrankedwebdomainsforsomeofthecomponentslearnedfromtheUSCcampuswirelessdatasetglobalphase IdWebdomains 1theplanet,youtube,wikimedia,panthercdn,tfbnw5cnet,washingtonpost,apple,mac6live8facebook,tfbnw,mediaplex9co,mozilla DiscussionTheglobalphaselearningofthemodelcapturesaclearandintuitiveclusterofwebdomains,basedonuserbehavior,associatedwiththelearnedcomponents.Itwasabletocapturetheknowncomponentsusuallyclusteredfromthisdataset.Itmanagestoclustermediarelateddomainsincomponent1.Component5representsthemaccomponentwhichisagroupofdomainsthatalwaysclustertogetherandrepresentthebehaviorofmacusers.Microsoftwebdomainsarerepresentedincomponent6,whichinourdataisonlyrepresentedbythelivedomain.Facebookanditssupporteddomainsisrepresentedincomponent8.Component9isthemozillarefoxusers'cluster.Thereisalwaysacorrelationbetweenthemozilladomainandthecodomain. 3.2.2.2LocalPhaseAfterrunningthelocalphaseoneachlocationweobservetheappearanceprobability()ofthecomponentslearnedfromtheglobalphaseineachlocation.Theappearanceprobabilitiesofthecomponentinthelocationsaredisplayedpercomponentandaresortedindescendingorderbytheprobability'svalueineachlocation.Theresultsareshownintable 3-16 .DiscussionThelocalphaseofthemodelcapturesanintuitivecorrelationbetweenthecomponentstothetypesoflocationsinthewirelessmobilesociety.AscanbeseenfromTable 3-16 ,component1themediacomponenthasahighappearancein 63

PAGE 64

Table3-16. Theappearanceprobability()oflocationsindescendingorder,sortedpercomponent ComponentLocationandcorresponding C1(mediacomponent)Sor10.45,Frat30.45,hous10.45,Sor20.44,jour0.43,hous20.40,Frat10.38,Frat20.37,Frat5/business0.30,UCC0.28,hous30.25,lucas0.23,med0.16,dent0.12,Frat40.10C5(maccomponent)jour0.36,Lucas0.34,Sor20.31,hous10.22,hous20.20,Frat30.14,Dent0.11,Sor10.11,Frat5/business0.09,Frat10.08,Frat20.08,hous30.07,med0.07,UCC0.05,Frat40.01C8(facebookcomponent)Sor20.43,hous10.29,jour0.28,Frat30.25,hous20.25,UCC0.18,Sor10.17,Frat40.15,Frat5/business0.15,med0.14,Frat10.13,hous30.12,dent0.10,lucas0.10,Frat20.06 fraternities/sororitiesandhousinglocations,whichisexpectedsincemedialiketheyoutubedomainisvisitedforentertainmentandhousinglocationswherethestudentsliveandspendtheirtimeisasuitableplaceforusingthisdomain.Also,ithasahighappearanceintheschoolofcommunicationandjournalism,duetotheyoutubedomainbeingalsoavideosharingdomainwithvideosbeingusedforcommunicationandjournalisticpurposessincetheymaycontainvideosofeventsornewsitems.Theleastlocationswithanappearanceofthisdomainarelibraries,alocationnotsuitableforviewingvideos,andthereislittlereasontoviewvideosinlibrarieswhereusuallyreadingandstudyinghappens.Component5,themaccomponent,hasahigherappearanceintheschoolofcommunicationandjournalismandtheschoolofcinematicarts.Thisshowsmacsareusedmoreprominentlyintheseschoolsduetomacshavingbettersoftwareandcapabilitiesforhandlingvideosandmediaingeneral.Also,themaccomponentcontainsthewashingtonpostdomain,whichisanewspaper,whichmayalsoexplainitsmoreprevalentappearanceintheschoolofjournalism.Component8,thefacebookcomponent,hasahighappearanceinsororitiesandhousinglocationssinceitisasocialdomainusedalotbycollegestudentsintheir 64

PAGE 65

freetime.Also,Theannenbergschoolofcommunicationandjournalismshowsahighappearanceofthisdomainbecausealotofprominentpeopleandpoliticiansalsousethisdomaintocommunicatetothemediaandthepublic.Othereducationlocationsgetalowerappearanceofthisdomainsinceitisnotusedforstudyandresearchpurposes. 3.2.3InternetUsageSimilarityBetweenLocationsInthissectionwewishtostudythesimilarityofInternetusageamonglocationsinthemobilesociety.Weachievethisbycreatingadissimilaritymatrixbetweenthe15locationsavailableinourwirelessdata,thedissimilaritybetweenlocationswillbecomputedbythecosinedistancefunctionbetweenthelocations'appearanceprobabilitiesofthewebdomaincomponents.Then,thedissimilaritymatrixismappedtoanundirectedgraphasfollows.Nodesinthegraphwillrepresentlocationsinthemobilesociety,anedgeisdrawnbetweentwodifferentnodesiftheirdissimilarityislessthanathreshold.Finally,wendcliqueswithinthegraphtodiscovergroupsoflocationswithsimilarInternetusage.Figure 4-13 showstheresultinggraphwithathresholdof0.1.Locationsofsimilartypehavethesamenodecolor,locationcodesfromTable 4-23 wereusedtodenotethenodes. Figure3-1. Graphrepresentationofthedissimilaritymatrixusingthethresholdof0.1forthelocationsinthemobilesociety 65

PAGE 66

3.3RelatedWorkUsingtheobserveduserbehaviortodesignrealisticandpracticalmobilitymodelshasbeenthefocusofmanyworks[ 33 36 ].Ithasbeenshownthatthemostwidelyusedexistingmobilitymodelsfailtogeneraterealisticmobilitycharacteristicsobservedfromthetraces.Realisticmobilitymodelingisessentialforprotocolperformance[ 37 ].Correlatingtheuserbehaviorwithhislocationhasrarelybeencoveredinresearch.Ploumidisetal.[ 38 ]usedamulti-level(network,APandclient)application-basedtrafccharacterization,thengroupedAPsbasedonbuildingcategorytoexaminevariationinapplicationuse.Aweakcorrelationhasbeenfoundbetweenthetypeofapplicationusedandsomebuildingcategories,buttrafccharacterizationwasbasedonlyonAPtracesandonlyon7.5daysoftrafc.Anotherapplication-basedstudyofaWLANusageonacampus[ 39 ]evaluatedtheinboundandoutboundtrafcofwebapplicationsinresidentialandnon-residentiallocations.Theirworkdidn'tnddifferencesinuserbehaviorbasedonthelocation.Inapreviouswork[ 1 ]locations(i.e.buildings)inamobilesocietywereclusteredtogetherbasedonthesimilarityofInternetusagebehavioranditwasfoundthatlocationsofthesametypeactuallyclustertogether,butnorelationsweredrawnbetweenthetypeofwebdomainclusterslearnedandtheirappearanceinsimilartypesoflocations.AsmallerscalestudyoftheinuenceoftheregionontheInternetusagebehaviorfocusedontheusageinaruralvillage[ 40 ].AdifferencewasfoundbetweenthedominantInternettrafctype(HTTPvs.peer-to-peer)betweenurbanareasandruralareas.Also,ithasbeenfoundthatinthevillageresidence,facebookwebdomainsdominatedthewebtrafc.ThissupportsourideathatthelocationoftheuserinuenceshisInternetbehavior.Therefore,tothebestofourknowledgethisrepresentstherstworkinndingacorrelationbetweenthetypeofwebdomainclusterlearnedfromInternetusagebehaviorfromamobilesocietyasawholeandthetypeoflocationtheymostlikelyappearin. 66

PAGE 67

AcoupleofstudieswereconductedontheInternetusagebehavioronsmartphonesonwirelessnetworksand3Gmobilenetworks[ 41 42 ].Inonepaper[ 41 ],inuenceofthelocationwasfoundontheusagebehaviorwithoutspecifyingarelationshipbetweenthetypeoflocationandthetypeofwebdomainvisited.Theotherpaper[ 42 ]foundacorrelationbetweenthetypeofphoneonlineapplicationmostusedbytheuseratworkandathome.Entertainmentandsocialnetworksapplicationswerefoundtobemostusedathome,whilemailapplicationwasmostusedatwork.Thiscorrelationwasdiscoveredalsoinourwork. 3.4DiscussionInthischapterwehaveintroducedanewmodelforusers'webbehaviorinawirelessmobilesocietyasawholeandincertaintypesoflocationsinthesociety.Themodellearnedthewebdomainclustersofthemobilesocietybasedontheusers'webbehavior.Ithasalsoproducedtheappearanceprobabilityofeachwebdomaincluster(learnedgloballyinthemobilesociety)inalllocationsofthemobilesociety.Thisshowedtheleveloftherelationshipbetweenwebdomainclustersandlocations.Wehaveshownthatthereisacorrelationbetweenthetypeofwebdomainclusterandthetypesoflocationsinthesocietywheretheseclustersaremorelikelytoappear.Thismodelhelpsinbuildingbettercontext-awarenetworkprotocolsandservicesbyusingdata-drivenmodelinganddesignparadigm. 3.5OurContributionsTosummarize,ourcontributionsareasfollows: WepresentanovelGlobalLocalmodelthatlearnstheInternetusagebehaviorinawirelessmobilesocietyasawholeandinlocationsinsidethemobilesociety.Themodelgivesthelevelofrelationshipbetweenthewebdomainclusterslearnedgloballyinthemobilesocietyandthelocationsinsidethemobilesociety.Ithasonelocalmodelforalllocationsallowingdifferentlocationstobecomparedtoeachotherbasedontheusagebehavior. WerealisticallydescribeInternetusageinlargemobilesocietiesbyanoverlapofcorrelationsbyusingafastapproximatevesionoftheBayesianmixturemodelcalledthePOWERmodel.Themodelallowsustocapturethemixtureof 67

PAGE 68

hiddenpatternsintheuser'sInternetbehavior.Itcanassigntheuserstomultipleirrelevantclassesofwebuserssimultaneouslybasedonthetheirusagebehavior,afeatclassicalmixturemodelsareincapableof. 68

PAGE 69

CHAPTER4LEARNINGSPATIO-TEMPORALCORRELATIONSINLARGESCALEWIRELESSDATAUSINGMULTI-DIMENSIONALCO-CLUSTERINGMETHODSWirelessnetworksaregrowingthroughouttheworldinnumber,sizeandnumberofconnecteddevices.Thisgrowthrequireslookingintoalternativeparadigmsformodelinganddesigningwirelessnetworkstoalleviatetheproblemsfacingthegeneral-purposeparadigminhandlingtheincreasingdemandandloadonthenetworks.Oneoftheemergingparadigmsisthedata-drivenmodelinganddesignparadigm.Inthisparadigm,realisticbehavioralmodelsforInternetusers'websitevisitationpatternsaredeveloped;andbehavior-awarenetworkprotocolscanbedevelopedandparameterized.However,thesemodelsrequireanalysisoflarge-scalewirelessdatasetstounderstandthebehavioroftheInternetusers.OnesuchwirelessdatasetusedforanalysisistheUniversityofSouthernCalifornia(USC)campusnetow,DHCPandMACtraptracesforthousandsofmobileusersinaWLANfromover79buildingsincludingover700APscollectedin2008[ 1 ].Thisisthelargestsetoftracesprocessedinastudyofmobilenetworksupuntilnow.Weprocessed,lteredandaggregatedthisdatasettobeamenableforrunningco-clusteringmethodssoastolearnthebehavioralmodels.Inapreviouswork[ 43 ],wehaveidentiedclassesofInternetusersthatvisitclustersofwebsitesbyusinganewclassofmixturemodelscalledthePOWERmodelthatwasrunonasubsetoftheUSCwirelessdata.ThePOWERmodellearnscomplexoverlappingpatternsofusagebehaviorfromthewirelessdata.ThismodelallowsanInternetusertobelongtomultipleusageclassesbasedonusagebehavior.Notonlydoesthismakemoresenseandismorerealistic,butitalsoallowsforhavinggeneralusageclassesinthemodel.Wehavealsodevelopedanovelmodeltolearnlocation-basedcorrelationsinthewirelessdata.Thismodel,calledtheGlobalLocalmodel,learnstheinuenceofabuildingoralocationinsidealargecampusontheInternetusers'behavior.Themodelgeneratesprobabilitiesofeachuserclassappearingineachbuildingonthecampus, 69

PAGE 70

thusallowingustoobservetherelationshipbetweenthetypeofabuildingandthetypeoftheuserclassmostlikelytoappearinthatbuilding.TheGlobalLocalmodelgivesoneglobalmodelforallbuildingsoncampus,whichin-turnisusedtoconstructalocalmodelforeachbuilding.Thisgivesustheabilitytocomparebuildingswithinthecampus.Wehavefoundaverydistinctcorrelationbetweenusers'classesandbuildingtypes,e.g.,userswhovisitmediawebsitesaremorelikelyintheschoolofJournalismortheschoolofcinematicartsbuildings,userswhovisitsocialnetworkwebsitesareusuallyinhousingbuildingsorinsororitiesandfraternities.Inthischapter,weanalyzetheinuenceoftimeontheuser'swebvisitationbehavior.Weuseco-clusteringmethodstolearntheinuenceofthehour,dayormonthoftheuser'swebsitevisitonhisInternetbehavior.Thereareafewchallengesindealingwithtemporalattributesofdata.Oneofthemisrepresentingthetemporalattributesofdatainaco-occurrencetableinsuchawaythatitisamenabletodataminingmethods.Timeisacontinuousfunctionandisgranularinnature,i.e.itcanbedividedintoseconds,minutes,hours,days,etc.Hence,temporaldataattributescaneitherberepresentedas(i)one-dimensionaldatawithatimeunit,e.g.,eachcolumnintheco-occurrencetablerepresentinganhourinthetimeperiodthatspansthedata;or(ii)multi-dimensionaldatawhereeachdimensionrepresentsatimeunitintheco-occurrencetable,e.g.,therstdimensioncanrepresentthe24hoursinaday,theseconddimensioncanrepresentdaysinaweekandthethirddimensioncanrepresentweeksinamonth;or(iii)one-dimensionaldatabutusingahierarchy,e.g.,thecolumnsoftheco-occurrencetablewouldbeallthehoursofmanyconsecutivedays,onahigherlevelwegrouptheconsecutivehoursintodaysmakingthatlevelrepresentdays,whichislessgranularthanthelowerlevel.Thiscanberepeatedaswegohigherhavinglevelsforweeks,months,etc.Havingtemporalattributesrepresentedinmultiplelevelsallowsapplicationofhierarchicalclusteringtechniquesonthedataandhelpusndrelationshipsbetweentheattributesatdifferentgranularitiesoftime. 70

PAGE 71

Anotherissuewiththetemporalattributesofdataisdealingwithcontiguityinthetimedimension.Weinvestigatewhethercontiguoushoursanddaysinthetimedimensioncanbeforcedtoclustertogether.Unlikeotherattributesofdata,contiguoustemporalelementscouldberelatedduetothecontinuousnatureoftime.Additionally,weextendlearningthetemporalcorrelationsintheInternetusagebehaviorbyalsoconsideringthelocationdimension.Welearnthecombinedspatio-temporalcorrelationsintheusagebehaviorofwirelessInternetusersoncampus,andthatboththelocationoftheuserandthetimeaffecthisorherInternetbehavior. 4.1PreliminariesInthissectionwebrieydiscusstheproblemofndingspatio-temporalcorrelationsinthedata.Also,wementionthevariouswayofrepresentingtemporalinformationinthedata. 4.1.1ProblemStatementBetterunderstandingofwirelessInternetnetworks'users'behaviorhelpsdevelopmorerealisticbehavioralmodels.Thesemodelswiththebehavior-awarenetworkprotocolsareimportantindesigningbetterwirelessInternetnetworksusingthedata-drivenmodelinganddesignparadigm.Learningthetemporalandthecombinedspatio-temporalinuenceontheusers'onlinebehaviorhelpstoachievethisgoal.Inapreviouswork,wehavelearnedthespatialinuenceontheusers'behavior.Wehaveseenthatthereisacorrelationbetweenthetypeofweb-domainsvisitedandthetypeofbuildingorlocationinalargewirelessnetworkwheretheweb-domainwasvisited.WehavealsoseeninourworkandinSaeedetal's.paper[ 1 ]thatusersinlocationsofthesametypeorcategorysharethesamepatternofInternetbehavior.Weexpandouranalysistothedimensionoftimeandthecombinedspace-timedimension.Wealsohandletheissuesofdealingwiththetimedimensioninthewirelessdata,howtorepresenttemporalattributesofthedataandhowtoapplyco-clusteringmethodsonthem.Variousmulti-dimensionalco-clusteringmethods,someavailableand 71

PAGE 72

somenovel,areusedforouranalysistohandlethecomplex,multi-dimensionalwirelessdatathatwasobtainedfromtheUSCcampus. 4.1.2TemporalDataRepresentationTemporalattributesofthedatacanberepresentedasasingledimensionwhereeachhourordayisacolumnintheco-occurrencetableasingure 4-1 .Clusteringoftheattributesinthisrepresentationisshowningure 4-2 .Inaddition,thesingledimensionoftimecanberepresentedinahierarchicalfashionasingure 4-3 .Inthisrepresentationthebottomlevelofthehierarchycontainsthesmallestunitoftimetobeanalyzedandalldataisstoredthere.Thenexthierarchyjustgroupsthesmallerunitoftimetoformthehigherunitoftimeandsoon.Thisrepresentationallowsfortheanalysistocoverthedifferentgranularitylevelsoftime.Clusteringcanoccurinthisrepresentationineachlevelofthehierarchyasshowningure 4-4 .Anotherrepresentationofthetemporalattributesisthemulti-dimensionalrepresentationwhichbreaksupthetimeintoitsdifferentunitssuchashours,days,months,etc.A3-dimensionalco-occurrencetablecanbeconstructedfromthetimeelementscontainingtheInternetusagedatathenbeusedastheinputtoco-clusteringmethods.Thisisillustratedingure 4-5 .Thiscanimproveonlearningcorrelationsinco-clusteringbyhavingdatafromatimeelement(e.g.hour)inthetableagainstothertimeelements(e.g.,dayandmonth),whichallowsforlearninghoursthathavesimilarInternetbehaviorincertaindaysandcertainmonthsasillustratedingure 4-6 Figure4-1. Temporalattributesinone-dimension 72

PAGE 73

Figure4-2. Co-clusteringoftemporalattributesinasingledimension Figure4-3. Temporalattributesinahierarchy Figure4-4. Co-clusteringoftemporalattributesinahierarchy Figure4-5. Temporalattributesinmultipledimensions 73

PAGE 74

Figure4-6. Co-clusteringoftemporalattributesinmultipledimensions 4.2MethodsVariousmulti-dimensionalco-clusteringmethodsbothexistingandnovelareusedtondthetemporalandthespatio-temporalcorrelationsinthewirelessdata.Inthissection,wepresentthesemethods. 4.2.1Multi-DimensionalInformationTheoreticCo-Clustering(MDITCC)TheInformationTheoreticCo-Clustering(ITCC)[ 5 ]wasextendedtomultipledimensionsinGao'sthesis[ 45 ].Thelossofthemutualinformationformulaforaxedco-clustering(CD1,CD2,...,CDn)wasextendedtomultipledimensionsasthefollowing:I(D1;D2;...;Dn))]TJ /F3 11.955 Tf 11.96 0 Td[(I(^D1;^D2;...;^Dn)=D(p(D1,D2,...,Dn)kq(D1,D2,...,Dn))WhereD(.k.)denotestheKullback-Leibler(KL)divergence,alsoknownasrelativeentropy,andq(D1,D2,...,Dn)isthedistributionoftheformq(d1,d2,...,dn)=p(^d1,^d2,...,^dn)nYi=1p(dij^di)TheMDITCCalgorithmisverysimilartotheITCCalgorithm,onlyextendedtomultipledimensions.Gaousedadatacubetorepresentthemulti-dimensionaldataanditsclustersforefcientandfastaccesstodataelementsinhismethod.Thecompute-intensivepartsofthealgorithmcanbechosentobesolvedinmassive 74

PAGE 75

parallelizationusingaCUDAenabledGPUforamoreefcientimplementation.Optimizationswereappliedonthedistancecomputationusingthedatacubesonboththeserialandparallelimplementationsofthemethod. 4.2.2Multi-wayDistributionalClusteringviaPairwiseInteraction(MDC)TheMulti-wayDistributionalClustering[ 44 ],orMDCforshort,extendsonthetwo-wayclusteringoftheco-clusteringalgorithmsandallowsforco-clusteringovermultipledimensions.ThisclusteringissimilartotheInformationTheoreticCo-Clusteringinusingthesameobjectivefunctionbutextendsittomulti-wayclusteringbyusingpairwiseinteractionsgraphs.TheInformationTheoreticCo-Clusteringapproachisrestrictedtodealingwithtwo-dimensionaldata.Although,itisclaimedthattheapproachcanbeeasilyextendedtomultipledimensionsrequiringamulti-dimensionalmutualinformationobjectivefunction.However,objectivefunctionsbasedonhighorderstatistics,suchasthemulti-dimensionalmutualinformation,areconsidereduncertainandarerelativelypoorlyunderstood[ 44 ].Also,itisnotclearstatisticallyifreliableestimatescanbeextractedformulti-dimensionaljointdistribution.ThepairwiseinteractiongraphoftheMDCisdenedasfollows,letX=fXiji=1,,mgbethevariablestobeclustered,and~X=~Xiji=f1,,mgbetheirrespectiveclusterings.LettheundirectedgraphG=(V,E)withV=~X.AnedgeeijappearsinEifwewanttomaximizethemutualinformationbetween~Xiand~Xj.Ifthereisnomutualinformationbetween~Xiand~Xjortheco-occurrencedataisunavailable,thentheedgeeijisabsentfromthegraph.IfpriorknowledgeisavailablewecanaddweightstoedgesinE,wijwouldbetheweightonedgeeij.Ifsuchpriorknowledgeisnotavailablethenwij=1.TheobjectivefunctionofthepairwiseinteractiongraphGisthendenedas:maxf~XigXeij2EwijI(~Xi;~Xj) 75

PAGE 76

Hence,themulti-dimensionalmutualinformationcanbebrokenintoasetoftwo-dimensionalco-occurrencetablesbetweenanytworandomvariablesthathavemutualinformationbetweenthem. 4.2.3HierarchicalMulti-dimensionalCo-clusteringOneoftheissuesofdealingwiththetemporalattributesofthedataisthegranularnatureoftime.Whenanalyzingthetemporalattributeswewanttohandletimeinallitslevelsofgranularity.Moreover,thewirelessdatacontainsmanyattributessuchas,thewebdomainsvisited,theuserIDs,theuserlocations,etc.Co-clusteringthewirelessdataoverjustoneattributewillignoretheinuenceofalloftheattributeshaveontheInternetuserbehavior.Therefore,wehavedevisedthehierarchicalmulti-dimensionalco-clusteringalgorithmtohandlethesecasesinthewirelessdataanalysis.Thealgorithmisbasedonthemulti-dimensionalco-clusteringalgorithmMulti-wayDistributionalClustering[ 44 ].Changesweremadetothealgorithmtoincorporatehierarchyandtohandlethetimedimension.OuralgorithmaddshierarchytotheMDCwithsomeothermodications.Thehierarchyisaddedtothedimensionoftime,intheco-occurrencetableofthatdimensionthecolumnswillrepresentthehoursofmanyconsecutivedays.Whenwegoonelevelhigherthecolumnsoftheconsecutivehoursofeachdaycanbegroupedtorepresentdays,whenwegoanotherlevelhighertheconsecutivedayscanbegroupedintoweeksormonthsandsoon,Figure 4-3 showsthehierarchylevelsofthetimeunits'variables.Eachgroupofhoursordaysistreatedasaco-clusteringovertheinitialclusteringoftheotherdimensions.Hierarchycaneasilybeaddedthesamewaytotheotherdimensionsofdatabutforourcasethetimedimensionistheonlydimensionthatrequiresthehierarchy.ThealgorithmstartsbyfollowingtheclusteringscheduleoftheMDCalgorithmandclusterseachvariableitencountersinthescheduleeitherinadivisiveoranagglomerativemanor,dependingonwhatwasspeciedintheschedule.Thevariable 76

PAGE 77

thatischosentohavehierarchymustalwaysuseagglomerativeclustering.Theclusteringofthehierarchicalvariablestartsatthehighestlevelofthehierarchy.Similarelementsofthethatlevelgetsco-clusteredtogether(e.g.,monthswithsimilarpatternsofInternetusagedatagetsclusteredtogether)overalloftheothervariables.Figure 4-7 showsthehierarchyafterclusteringMondayandWednesdaytogetherwiththeirhoursvariablesbecominginonegroup. Figure4-7. TheshapeofthetimehierarchyaftertheclusteringoftheMondayandWednesdayvariablesonthedaylevelofthehierarchy Afterwards,wegooneleveldownandco-clustertheelementsofthatlevelbelongingtotheresultingclustersfromthepreviousleveleachseparatelyoveralloftheothervariables.InthepreviousleveltheMondayandWednesdaydayswereclusteredtogether,whenwegooneleveldownagglomerativeclusteringwillbeperformedontheseparatehourelementsintheMondayandWednesdayclusteroveralloftheothervariablesseparatelyfromthehourelementsinotherdays.Figure 4-8 showstheclusteringafterclusteringthehoursinthebottomlevelofthehierarchy.Pseudo-codefortheMDHCCisgiveninAlgorithm1.TheMDHCCalgorithmstartsbylearninggeneralco-clustersandthenspecializeseachlearnedco-clusterseparatelyallowingforbetterclusteringtobelearnedaswegodowninthehierarchy.Weiterativelyrepeattheprocessofgoingdownthehierarchyuntilwereachthelowestlevel,sinceinourexamplethehierarchicalvariableistimeandthelowestlevelarethehours,weperformagglomerativeclusteringonlyoncontiguoushoursineachclustersinceitwouldn'tmakesensetoclusternon-contiguoushourstogethertolearntheusers' 77

PAGE 78

Figure4-8. TheshapeofthetimehierarchyaftertheclusteringoftheMondayandWednesdayvariablesonthedaylevelofthehierarchy Algorithm1TheMulti-DimensionalHierarchicalCo-Clustering(MDHCC)algorithm Input: X1,...,Xm-variablestocluster XML1,...,XMLd-hierarchicalmulti-layeredvariableXML G=(V,E)-pairwiseinteractiongraph Sup,Sdown-up/downpartition,SupSdown=f1,...,mg Sn=i1,i2,...,in-clusteringschedule Output: Clusterings~X1,...,~Xm Initializeclusters: foralli=1,...,mdo ifi2Sdownthen PlaceallelementsofXiinacommoncluster else ifi2Supthen PlaceeachelementXiinasingltoncluster endif endif endfor Mainloop: forallj=1,...,ndo textbfSplit/merge ifij2Sdownthen Spliteachelement~xof~Xijuniformlyatrandomtotwoclusters else ifij2Supthen ifSjistherstSupthen ifXij=XMLthen forallk=d,...,1do Mergeeachelement~xof~XMLkwithitsclosestpeer endfor else Mergeeachelement~xof~Xijwithitsclosestpeer endif else Mergeeachelement~xof~Xijwithitsclosestpeer endif endif endif Correctclusters ifXij6=XMLdthen forallelementsxofXijdo Pullxoutofitscurrentcluster Placexintoacluster,s.t.Peij2EwijI(~Xi;~Xj)ismaximized endfor endif endfor 78

PAGE 79

Internetbehaviorpatterns.Wewillendupwithgroupsofclustersineachlevelofthehierarchicalvariable,witheachclusterbelongingtoaone-higherlevelcluster.Also,wehavelearnedclustersintheothernon-hierarchicalvariablesinthedata. 4.3ResultsThissectionshowstheresultsoflearningtemporalandspatio-temporalcorrelationsonthewirelessInternetdatausingvariousco-clusteringmethods.Therstsubsectionwillshowtheresultsoflearningtemporalcorrelationsaloneandinthesecondsubsectionweshowresultsforlearningthecombinedspatio-temporalcorrelationsonthewirelessdata.Inbothsubsectionswerunthesameco-clusteringmethods.Thewirelessdataisthenetowinformationthatwascollectedfromacampus-wideanalysisfromtheUniversityofSouthernCalifornia(USC)intheperiodbetweenJanuary2008andMarch2008.Thewirelessdatacontains9,281,529recordswhereeachrecordcontains:userID,webdomainvisited,userlocation,timeofaccess.However,theinputdatahasadifferentformatandrepresentationoftimebasedontheexperiment. Dataset1(multi-dimensionaltimeanalysiswithMDITCC):ThisdatasetcontainsarowforeachrecordofthewirelessdataandvecolumnsforthetemporalcorrelationsexperimentsrepresentingtheuserID,webdomainvisitedID,monthofaccess,dayofaccessandhourofaccess.Forthespatio-temporalexperimentsanextracolumnisaddedrepresentingtheuserlocationID. Dataset2(multi-dimensionaltimeanalysiswithMDC):Thisdatasetconsistofasetofco-occurrencetablesforeachedgebetweenvertices(eachdimensionisrepresentedbyavertex)ofthepairwiseinteractiongraphingure 4-9 .Eachco-occurrencetableisstoredinaseparateinputleresultinginseveninputles.Thenumberofdimensionsorvariablesisthesameasdataset1.Forthetemporalexperimentsondataforonemonth,themonthdimensionisremovedresultinginapairwiseinteractiongraphasingure 4-10 .Reducingtheinputlestove.Forthespatio-temporalexperimentsonallmonths,thenumberofinputlesaretwelve.Eachrowintheinputlesrepresentsacellintheco-occurrencetableandcontainsthreecolumns.Therstcolumnisthetable'srow'svalue,thesecondcolumnisthetable'scolumn'svalueandthethirdisthecountstoredinthetable'scell. Dataset3(single-dimensiontimeanalysiswithMDCandMDHCC):Dataset3hasthesameformatasdataset2excepttimeisrepresentedinthisdatasetasa 79

PAGE 80

single-dimensioncalledhourdaysmonth.Thisdimensioncontainstheinformationofallthehoursofthedaysandmonthsinorder.Thismakestherst24columnsofthetimedimensionintheco-occurrencetablerepresentthehoursoftherstdayofthemonth,thesecond24columnsrepresentthehoursoftheseconddayofthemonth,etc.Forminghierarchiesandclustersineachlevelofthetimegranularitybecomeseasierinthisrepresentation.Forthetemporalexperimentsthenumberofinputlesarethree.Whilethenumberofinputlesforthespatio-temporalexperimentsaresix.ExistingmethodsforMDITCCandMDCmethodsinC++wereusedandtheMDHCCmethodwaswritteninC++.Allmethodswererunona1.73GHzIntelCorei7laptopwith6GBofRAM. 4.3.1TemporalCorrelationsintheWirelessDataResultsInthissubsectionweruntheco-clusteringmethodsonthewirelessdatatolearnthetemporalcorrelations.Themulti-dimensionalwirelessdatausedinthissubsectionconsistsofthedimensionsforuserIDs,webdomainsandtime.Werunthedatawithtimeasbothasingledimensionandasmultipledimensions.However,intheMulti-DimensionalHierarchicalCo-Clustering(MDHCC)timeistreatedonlyasasingledimensionduetothemethod'shierarchy.Werunthemethodsonsubsetsofthewirelessdatathatcorrespondtoeachmonthseparately(January,FebruaryandMarch),andonthetotaldatatoobservethecorrelationsdifferencesacrossthistimeperiodandthestabilityofouranalysis. 4.3.1.1Multi-DimensionalInformationTheoreticCo-ClusteringResultsTheMDITCCmethodisrunrstonthewirelessdataforeachmonthseparatelythenonallofthemonths.Thisistoanalyzethedataandndthepatternsforeachmonthaloneandthenndthepatternsforthewholeperiod. January2008.Table 4-1 showstheresultsforThemonthofJanuary.DiscussionWediscoversomecorrelationsintheseresults.Inthehourdimension,weobservethatcluster1representstheonlinebehaviorintheearlymorninghours.Cluster2representsthebehaviorintheafternoonhours.Cluster3hasmostlythe 80

PAGE 81

Table4-1. Theco-clusterslearnedusingtheMDITCCfromtheUSCcampuswirelessdataforeachdimensionforJanuary2008 DataDimensionsandClustersClusterElements HourCluster12:00am-6:59am,1:00pm-1:59pm,4:00pm-4:59pm,10:00pm-10:59pmCluster29:00am-9:59am,11:00am-12:59pm,2:00pm-3:59pm,5:00pm-5:59pm,11:00pm-11:59pmCluster312:00am-1:59am,7:00am-7:59am,10:00am-10:59am,6:00pm-9:59pmDayCluster1Friday,SaturdayCluster2Thursday,SundayWebDomainCluster4usc,facebook,apple,tfbnw,washingtonpost,co,mozilla eveninghours.Wendthatineachclustertherewasanaturalclusteringofconsecutivehoursofacertainperiodoftheday.Thiscouldmeanthateachmainperiodofthedayhasadifferentonlineusagebehavior.Inthedaydimension,weobservethatFridayandSaturdayweregroupedtogetherincluster1representingtheonlinebehaviorinthebeginningoftheweekend.EventhoughFridayisaweekdaybutpeopleconsideritasthebeginningoftheweekendsoitisintuitiveforthatdaytobegroupedwithSaturdayaweekendday.Inthewebdomaindimension,thecorrelationbetweenthefacebookdomainanditscontentdeliverynetworkdomain(tfbnw)werediscoveredincluster4. February2008.Table 4-2 showssomeoftheresultsforthemonthofFebruary.DiscussionWeobservesimilarcorrelationsfortheFebruarydatatotheJanuarydata.Forthehourdimension,Weobservethatcluster1representsthemorningperiodbehavior.Clusters2and3bothrepresentthebehaviorintheafternoonandeveningperiods. 81

PAGE 82

Table4-2. Theco-clusterslearnedusingtheMDITCCfromtheUSCcampuswirelessdataforeachdimensionforFebruary2008 DataDimensionsandClustersClusterElements HourCluster112:00am-1:59am,4:00am-4:59am,7:00am-1:59pm,7:00pm-7:59pmCluster24:00pm-4:59pm,6:00pm-6:59pm,9:00pm-9:59pmCluster32:00am-3:59am,5:00am-6:59am,2:00pm-3:59pm5:00pm-5:59pm,10:00pm-11:59pmDayCluster1Tuesday,WednesdayCluster2Sunday,MondayWebDomainCluster2usc,mozilla,facebook,live,tfbnw,cnetCluster4google,co,youtube Inthedaydimension,wendthatconsecutivedayswereclustertogetherinclusters1and2.Thismayindicatethatonlinebehaviorinconsecutivedaysaresimilar.Inthewebdomaindimension,cluster2foundthecorrelationbetweenthefacebookdomainanditscontentdeliverynetwork(tfbnw).Cluster4foundthecorrelationbetweenthegoogledomainandtheyoutubedomain.Thiscouldberelatedtothefactthatgoogleownsyoutubewhichexplainsthesimilarpatterns. March2008.InthissectionweshowanddiscusstheresultsobtainedfromrunningtheMDITCCmethodonthewirelessdatafromMarch2008.Table 4-3 showssomeoftheresultsfound.DiscussionSimilarcorrelationwerealsofoundintheresultsforMarch2008.Inthehourdimension,cluster1representsthebehaviorinthemorningperiod.Clusters2and3representthebehaviorintheafternoonandeveningperiods.Inthedaydimension,weobservethatcluster1containstheweekdaysThursdayandFridaywhilecluster2containsSaturday.Thismayindicatethattheusagebehavior 82

PAGE 83

Table4-3. Theco-clusterslearnedusingtheMDITCCfromtheUSCcampuswirelessdataforeachdimensionforMarch2008 DataDimensionsandClustersClusterElements HourCluster13:00am-8:59am,10:00am-10:59am,5:00pm-6:59pmCluster212:00am-1:59am,1:00pm-1:59pm,3:00pm-3:59pm,9:00pm-9:59pm,11:00pm-11:59pmCluster39:00am-9:59am,11:00am-12:59pm,2:00pm-2:59pm,4:00pm-4:59pm,7:00pm-8:59pm,10:00pm-10:59pmDayCluster1Thursday,FridayCluster2SaturdayWebDomainCluster1apple,live,aster,macCluster4usc,mozilla,washingtonpost,facebook,tfbnw ontheweekdaysaredifferentthanthebehavioronweekends.WeobservethatFridaywasclusterwithaweekenddayintheJanuarydatawhilehereitwasclusteredwithaweekday.ThiscouldshowthatFridaycouldbetreatedasaweekdayoraweekendsinceitisboththeendoftheweekdaysandthebeginningoftheweekend.Inthewebdomainsdimension,wendthatcluster1foundthecorrelationbetweentheappleandmacdomains.Cluster4foundthecorrelationbetweenthefacebookdomainanditscontentdeliverynetworkdomain. January-March2008.Table 4-4 showstheresultsoftheMDITCContheperiodofJanuary-March2008.DiscussionInthehourdimension,wendthatcluster1representstheafternoonperiod.Cluster2representsthebehaviorofthemorningperiod.Cluster3containsboththebehavioroftheafternoonandtheeveningperiods.Inthedaydimension,weobservethatcluster1representthebehaviorintheweekdays.Cluster2mayrepresentthebehaviorontheweekendssinceitcontains 83

PAGE 84

Table4-4. Theco-clusterslearnedusingtheMDITCCfromtheUSCcampuswirelessdataforeachdimensionfortheperiodofJanuary-March2008 DataDimensionsandClustersClusterElements HourCluster111:00am-11:59am,1:00pm-1:59pm,4:00pm-5:59pm,7:00pm-7:59pm,10:00pm-10:59pmCluster22:00am-10:59am,6:00pm-6:59pm,11:00pm-11:59pmCluster312:00am-1:59am,12:00pm-12:59pm,2:00pm-3:59pm,8:00pm-9:59pmDayCluster1Tuesday-Thursday,SaturdayCluster2Monday,Friday,SundayMonthCluster1February,MarchCluster2JanuaryWebDomainCluster4facebook,tfbnw FridayandSunday.Thedaydimensionco-clustersshowsthattheusagebehaviorintheweekdaysaredifferentthantheweekends.Inthemonthdimension,FebruaryandMarchwereclusteredtogetherincluster1.Cluster2containsJanuaryalone.Inthewebdomaindimension,thecorrelationbetweenfacebookanditscdndomainisdiscoveredincluster4. 4.3.1.2Multi-wayDistributionalClustering(MDC)viaPairwiseInteractionwithMulti-DimensionalTimeResultsTheMDCalgorithmwasrunonthewirelessdatawithmulti-dimensionaltime,thedimensionsofthewirelessdataforeachmonthwereuserIDs,webdomains,hoursanddays.Forthewholedatathedimensionofmonthwasadded.Figure 4-9 isthegraphrepresentationofthepairwiseinteractionbetweenthevariablesforonemonthofthedata,whilegure 4-10 isthegraphforallmonthsofthedata. January2008results. 84

PAGE 85

Figure4-9. Thegraphrepresentingthepair-wiseinteractionbetweenthevariablesinaone-monthwirelessdata Figure4-10. Thegraphrepresentingthepair-wiseinteractionbetweenthevariablesforallmonthsinthewirelessdata WerstrunwirelessdatafromthemonthofJanuary2008ontheMDC.Welearnco-clustersforeachdimensionofthedata.Wewillshowtheresultsforthehour,dayandwebdomaindimensionsintable 4-5 Table4-5. Theco-clusterslearnedfromtheUSCcampuswirelessdataforeachdimensionforJanuary2008 DataDimensionsandClustersClusterElements HourCluster112:00am-11:59amCluster27:00pm-11:59pmCluster312:00pm-6:59pmDayCluster1Saturday,SundayCluster2Thursday,FridayWebDomainCluster7facebook,tfbnw DiscussionThemostinterestingcorrelationdiscoveredwasinthehourdimensionofthedata.ThealgorithmdiscoveredthatInternetusagebehaviorforconsecutivehours 85

PAGE 86

ofthedaynaturallyclustertogetherwithoutforcingcontiguityontheclustering.Thisisintuitiveasuserstendtologonforconsecutivehoursatatimeallowingforasimilarpatterntoemergeforthosehoursonline.Wealsoseethreedistinctclustersofusagebehaviorfortheperiodsoftheday:themorninghours,theafternoonhoursandtheeveninghours.Peopleconsiderthesethemainpartsofthedayandtheytendtohavedifferentactivitiesdependingonthatpartoftheday.Weseethatevenonlineactivityhasdifferentpatternsbasedonthepartoftheday.ThedaydimensionresultsarealsointerestinginthatweseethattheInternetusagebehaviorforweekendsisdifferentfromthebehaviorforweekdays,sincethosedayswereclusteredseparately.Thisisintuitivesinceusers'schedulesforweekendsvarythantheirscheduleforweekdaysandthisisreectedintheirInternetusagebehavior.Forthewebdomaindimensionwendthatthefacebookdomainanditscontentdeliverynetwork(tfbnw)gotclusteredtogether.Thisisexpectedsincetheybasicallyhavethesamedataanduserbehavior. February2008results.AfterrunningthewirelessdataforthemonthofFebruary2008ontheMDCco-clusterswerelearnedforeachdimensionofthedata.Weareinterestedinthetimeandwebdomaindimensionsoweshowtheirco-clustersintable 4-6 .DiscussionSimilartotheresultsfoundforJanuary2008datathereisanaturalclusteringofconsecutivehoursoftheday.Also,theclustersofhoursrepresentthethreemainpartsofthedayasintheJanuary2008results.InthedaydimensionwendthatMondayandWednesdayhavesimilarpatterns.Thisindicatesthatunlikehoursoftheday,consecutivedaysdonotnaturallyclustertogether.Beingthatthedatawastakenfromauniversitycampuswiththemainusersbeingstudents,thisisalsointuitive.Inuniversities,consecutivedaystendtohavedifferentclassschedules.Also,inmanyuniversitiesandatUSC,Mondaysand 86

PAGE 87

Table4-6. Theco-clusterslearnedfromtheUSCcampuswirelessdataforeachdimensionforFebruary2008 DataDimensionsandClustersClusterElements HourCluster112:00am-12:59pmCluster21:00pm-5:59pmCluster36:00pm-11:59pmDayCluster1Monday,WednesdayCluster2Sunday,TuesdayWebDomainCluster1apple,facebook,tfbnwCluster2cnet,ebay,google,yahoo,co Wednesdayshavethesameclassschedules.Thiscanexplainthesimilarpatterninthesedays.InthewebdomaindimensionweobservethattheMDCalgorithmfoundtherelationshipbetweenthefacebookwebdomainandthefacebookcontentdeliverynetworkdomain(tfbnw)incluster1.Incluster2thesearchenginedomainsgoogleandyahoowereclusteredtogether. March2008results.OnthedatafromtheperiodbetweenThursdayMarch27thandSaturdayMarch29th,someoftheresultsareshownintable 4-7 .DiscussionWendsimilarresultstoJanuaryandFebruary.First,thereisanaturalclusteringofconsecutivehours.Second,thethreemainpartsofthedayareclusteredtogetherasintheresultsforthepreviousmonths.Theperiodofthedatacontainsboththedaysoftheweekandfortheweekend.WendthatFridayandSaturdaywereclusteredtogether.EventhoughFridayisconsideredaweekday,itisalwaysthoughtofasthebeginningoftheweekend.Hence,itwillhaveabehaviormoresimilartoaweekendthanaweekday.Thursday,beingaregularweekday,wasclusteredseparately. 87

PAGE 88

Table4-7. Theco-clusterslearnedfromtheUSCcampuswirelessdataforeachdimensionforMarch2008 DataDimensionsandClustersClusterElements HourCluster412:00am-06:59amCluster52:00pm-4:59pmCluster67:00pm-11:59pmDayCluster1ThursdayCluster2Friday,SaturdayWebDomainCluster1google,youtube,coCluster2facebook,tfbnw Forthewebdomainclustering,thealgorithmfoundacorrelationbetweenthegoogledomainandtheyoutubedomain.Thiscouldberelatedtothefactthatgoogleownsyoutube.Thefacebookanditscontentdeliverynetworkdomainswereclusteredtogetherjustlikethepreviousmonths'dataresults. January-March2008results.HerewewillshowtheresultsfortheperiodfromJanuary2008toMarch2008.Theresultsaredisplayedintable 4-8 .DiscussionWendsimilarresultswhenrunningtheMDConthewholeperiodaswhenwerunitoneachmonth'sdataseparately.Weobservethatconsecutivehoursgotclusteredtogether,naturallyformingthemainpartsofaday,morning,afternoonandevening.Thereasonfortheresultingclusteringinthedayandmonthdimensionsisrelated,andithastodowiththedaysthedataarefromforeachmonth.BothJanuaryandMarchdataweregatheredfromThursdaytoSundayandThursdaytoSaturdayrespectively,whiletheFebruarydatawerefromSundaytoWednesday.Hence,weseeoneclusterinthedaydimensionrepresentsthedaysfromFebruaryandtheotherclusterrepresentsthedaysfromJanuaryandMarch.Accordingly,inthe 88

PAGE 89

Table4-8. Theco-clusterslearnedfromtheUSCcampuswirelessdataforeachdimensionfortheperiodfromJanuary2008toMarch2008 DataDimensionsandClustersClusterElements HourCluster112:00am-01:59pmCluster22:00pm-5:59pmCluster36:00pm-11:59pmDayCluster1Sunday,Monday,Tuesday,WednesdayCluster2Thursday,Friday,SaturdayMonthsCluster1January,MarchCluster2FebruaryWebDomainCluster1aol,apple,cnet,ebay,facebook,mac,tfbnw,usc,mozillaCluster2aster,google,live,washingtonpost,youtube,yahoo,co monthdimensionclusterswendthatJanuaryandMarchwereclusteredtogether,whileFebruarywasinaseparatecluster.Thisindicatesthatgenerallythebehaviorofweekendsisdifferentfromweekdaysregardlessofthemonth.Thisexplainstheclusteringinthedayandmonthdimensions.InthewebdomaindimensionwendthattheMDCalgorithmlearnedthecorrelationbetweenfacebookandthefacebookcontentdeliverynetwork(tfbnw)domains.Also,thealgorithmlearnedthecorrelationbetweentheappleandmacdomains.Incluster2,thealgorithmlearnedthecorrelationbetweensearchenginedomains,googleandyahoo.Thecorrelationbetweengoogleandyoutube,whichgoogleowns,wasalsofound. InternetusagebehaviorsimilaritybetweenhoursforFebruary2008.Westudythesimilarityinusagebehaviorbetweenthehoursoftheday.Thisisachievedbycreatingadissimilaritymatrixbetweenthe24hoursoftheday,the 89

PAGE 90

dissimilaritybetweenhourswillbecomputedbythecosinedistancefunctionbetweenthemutualinformationlossvalueforeachhouroverallofthewebdomains.Then,thedissimilaritymatrixismappedtoanundirectedgraphasfollows.Nodesinthegraphwillrepresentthehoursinaday,anedgeisdrawnbetweentwodifferentnodesiftheirdissimilarityislessthanathreshold.Finally,wendcliqueswithinthegraphtodiscovergroupsofhourswithsimilarInternetusage.Figure 4-11 showstheresultinggraphwithathresholdof0.0075. Figure4-11. Graphrepresentationofthedissimilaritymatrixusingthethresholdof0.0075forthehoursofadayforFebruary2008intheUSCcampus DiscussionCliqueswerediscoveredamongthehoursthatcorrespondtotheafternoonperiodco-clusterandtheeveningperiodco-clusterindicatingsimilarbehaviorinthoseperiodsoftheday.Inaddition,thehoursinthosetwoperiodsformcliqueswitheachotherindicatingthattheonlinebehaviorintheafternoon-eveningperiodissimilarandisdifferentfromthebehaviorinthemorningperiod.Wealsondthatconsecutivehoursinamainperiodofaday,ingeneral,haveasimilarityinbehaviorandconsecutivehoursofadayperiodtendtoformcliquesinthegraph. 4.3.1.3Multi-wayDistributionalClustering(MDC)viaPairwiseInteractionwithSingle-dimensionTimeRepresentationResultsInthissectionweshowtheresultsoftheMDCalgorithmonthewirelessdatawithaone-dimensiontimerepresentation.Thesingletimedimensionwillbecalledhourdays 90

PAGE 91

andeachcolumnofthatdimensionintheco-occurrencetablecontainsthedataforasinglehour. January2008results.WerstrunthealgorithmonthedatafromJanuary2008.Weshowtheresultsintable 4-9 Table4-9. Theco-clusterslearnedfromtheUSCcampuswirelessdataforsingledimensiontimerepresentationforJanuary2008 variableHourdaysclusterIDsClusterElements Cluster1Friday12:00am-11:59pmCluster2Thursday12:00am-11:59pmCluster3Saturday12:00am-11:59pmSunday12:00am-11:59pm DiscussionWeobservesomeinterestingndingswhenweuseasingle-dimensiontimevariable.Therewasanaturalclusteringofconsecutivehoursevenwhenusingtimeasasingledimension.Alsointerestingisthathoursofwholedayswereclusteredtogether,whichindicatesthatthehoursofeachdayaresimilartoeachother.Also,wendthatincluster3theweekenddaysSaturdayandSundaywereclusteredtogetherwhichindicatessimilarusagebehavioronweekendsasopposedtotheweekdays,whichisthesameasthedayvariableclusteringfortheJanuaryresultswhenmulti-dimensionaltimewasused.Wecanalsoseeadrawbackfromusingsingledimensiontime.Wedonotlearnmeaningfulclustersofhoursonlyofwholedays.Thismakesbreakinguptimeintohoursanddaysbetterforlearningmeaningfulcorrelations. February2008results.AfterrunningthealgorithmonthedatafromFebruary2008,wefoundtheseresultswhichwewillshowintable 4-10 .DiscussionSimilartotheJanuaryresults,weseethenaturalclusteringofconsecutivehours.Also,thehoursofwholedaysgotclusteredtogetherindicatingthesimilarbehaviorinthehoursofwholedays. March2008results. 91

PAGE 92

Table4-10. Theco-clusterslearnedfromtheHourdaysdimensionintheUSCcampuswirelessdataforsingledimensiontimerepresentationforFebruary2008 IDHourdays Cluster1Wednesday12:00am-11:59pmCluster2Sunday12:00am-11:59pmMonday12:00am-11:59pmCluster3Tuesday12:00am-11:59pm OnthedatafromMarch2008theresultsareshownintable 4-11 Table4-11. Theco-clusterslearnedfromtheHourdaysdimensionintheUSCcampuswirelessdataforsingledimensiontimerepresentationforMarch2008 IDHourdays Cluster1Thursday12:00am-11:59pmCluster2Friday12:00am-11:59pmSaturday12:00am-11:59pm DiscussionSimilartopreviousmonths'results,wehavethenaturalclusteringofconsecutivehoursandtheclusteringofhoursofwholedays.WealsoobtainthesameclusterofdaysasobtainedwhenrunningtheMDCalgorithmonthemulti-dimensionaltimevariable,whichshowsthecorrectnessofourresults.However,asstatedbeforewhenhavingtimeinasingledimensionwedon'tlearnclustersofne-grainedtime,i.e.hours;onlyclustersofhighleveltimeunitswerelearned. January2008-March2008results.WerunthealgorithmonthedatafromJanuarytoMarch2008.Table 4-12 showstheresults.DiscussionWendsomeinterestingresultsrunningthisexperiment.Thenaturalclusteringofconsecutivehoursisstillpersistentwiththehoursofwholedaysbeinginoneclusterforthemostpart.Weobservethatincluster1FridaysofJanuaryandMarchwereclusteredtogetherindicatingthattheusers'behavioronFridayisthesameregardlessofthemonth.Cluster2managedtoclusterallofthehoursoftheconsecutivedaysofMondayandTuesdayinthemonthofFebruaryindicatingthesimilarityofusagebehaviorbetweenconsecutiveweekdays.Incluster4weobservethattheweekend 92

PAGE 93

Table4-12. Theco-clusterslearnedusingtheMDCalgorithmfromtheUSCcampuswirelessdataforsingle-dimensiontimerepresentationforJanuary2008-March2008 IDHourdays Cluster1JanuaryFriday12:00am-11:59pmMarchFriday12:00am-11:59pmCluster2FebruaryMonday12:00am-11:59pmFebruaryTuesday12:00am-11:59pmCluster4JanuaryThursday12:00am-11:59pmJanuarySaturday12:00am-11:59pmJanuarySunday12:00am-11:59pmMarchSaturday1:00am-1:59am,5:00am-11:59pm ofJanuarywasclusteredwiththeSaturdayofMarch.Thisindicatesthattheusagebehaviorontheweekendsingeneralissimilarregardlessofthemonth.Also,fromClusters1and4wecaninferthatweekendshaveastrongersimilarbehaviorthanweekdaysingeneral,sincetheyaremorelikelytoformclusterstogetherthanweekdays. 4.3.1.4Multi-DimensionalHierarchicalCo-Clustering(MDHCC)ResultsOurMDHCCalgorithmwasrunonthewirelessdatawithasingledimensionfortime,thedimensionsofthewirelessdataweretheuserIDs,webdomainsandthehoursofconsecutivedays.Therearetwolevelsofthetimehierarchy,withthebottomlevelbeingthehoursandthetoplevelbeingthedays.Forthedataforallthreemonthsanotherhierarchylevelwillbeaddedforthemonths. January2008results.TheMDHCCalgorithmwasrunontheJanuary2008wirelessdata.Co-clusterswerelearnedforeachlevelinthetimedimension.Weshowsomeoftheresultsintables 4-13 to 4-15 Table4-13. Theco-clusterslearnedusingMDHCCfromtheUSCcampuswirelessdataforthetopleveltimedimensionforJanuary2008 IDDay(level2) Cluster1Friday,SaturdayCluster2Thursday,Sunday 93

PAGE 94

Table4-14. Theco-clustersbelongingtocluster1learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension Hour(Level1)clustersbelongingtoClusterelementscluster1inthe2ndlevel Cluster2Friday3:00am-Friday12:59pmCluster4Friday3:00pm-Friday5:59pmCluster6Friday8:00pm-Friday11:59pmCluster7Saturday12:00am-Saturday6:59amCluster8Saturday7:00am-Saturday10:59amCluster10Saturday2:00pm-Saturday5:59pmCluster11Saturday6:00pm-Saturday10:59pm Table4-15. Theco-clustersbelongingtocluster2learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension Hour(Level1)clustersbelongingtoClusterelementscluster2inthe2ndlevel Cluster2Thursday2:00am-Thursday07:59amCluster3Thursday8:00am-Thursday10:59amCluster5Thursday12:00pm-Thursday2:59pmCluster6Thursday3:00pm-Thursday4:59pmCluster7Thursday5:00pm-Thursday6:59pmCluster9Thursday8:00pm-Thursday9:59pmCluster10Thursday10:00pm-Thursday11:59pmCluster15Sunday7:00am-Sunday10:59amCluster16Sunday11:00am-Sunday12:59amCluster17Sunday3:00pm-Sunday4:59pmCluster19Sunday5:00pm-Sunday8:59pmCluster20Sunday9:00pm-Sunday11:59pm DiscussionWeobservethattheMDHCClearnsthesameco-clusterastheMDCwithmulti-dimensionaltimeinthetoplevelofthetimehierarchy,whichcorrespondstodays.Weseethatsmallerco-clusterswerelearnedineachpartoftheday(morning,afternoonandevening)incomparisontotheMDCalgorithmThisismostlikelytodowiththealgorithm'slearningontheelementsofeachco-clusterofthetoplevelofthetimehierarchyseparatelyandalsoforforcingcontiguityoftheelementsinclustering. February2008results.AfterrunningtheMDHCContheFebruary2008wirelessdatawelearnco-clustersforeachlevelinthetimedimension.Weshowsomeoftheresultsintables 4-16 to 4-18 94

PAGE 95

Table4-16. Theco-clusterslearnedusingMDHCCfromtheUSCcampuswirelessdataforeachdimensionforFebruary2008 IDDay(Level2) Cluster1Sunday,TuesdayCluster2Monday,Wednesday Table4-17. Theco-clustersbelongingtocluster1learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension Hour(Level1)clustersbelongingtoClusterelementscluster1inthe2ndlevel Cluster2Sunday2:00am-Sunday8:59amCluster4Sunday12:00pm-Sunday3:59pmCluster5Sunday4:00pm-Sunday8:59pmCluster6Sunday9:00pm-Sunday11:59pm,Tuesday12:00am-Tuesday2:59amCluster9Tuesday9:00am-Tuesday1:59pmCluster10Tuesday2:00pm-Tuesday5:59pmCluster12Tuesday8:00pm-Tuesday11:59pm DiscussionFortheFebruarydataweobservethesamendingsasinJanuary.Theco-clusterslearnedinthetoplevelofthetimehierarchy(Days)arethesameastheco-clusterslearnedusingtheMDCalgorithm.Inaddition,smallerco-clusterswerelearnedinthethreemainpartsoftheday. January2008-March2008results. Table4-18. Theco-clustersbelongingtocluster2learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension Hour(Level1)clustersbelongingtoClusterelementscluster2inthe2ndlevel Cluster4Monday9:00am-Monday12:59pmCluster5Monday1:00pm-Monday5:59pmCluster6Monday6:00pm-Monday7:59pmCluster8Monday9:00pm-Monday10:59pmCluster17Wednesday1:00pm-Wednesday2:59pmCluster20Wednesday6:00pm-Wednesday7:59pmCluster21Wednesday8:00pm-Wednesday9:59pmCluster22Wednesday10:00pm-Wednesday11:59pm 95

PAGE 96

WeruntheMDHCCalgorithmonthewirelessdatafromJanuarytoMarch2008using2hierarchylevels,level1representingthehoursandlevel2representingthemonth-day.Tables 4-19 to 4-22 showssomeoftheresults. Table4-19. Theco-clusterslearnedusingMDHCCfromtheUSCcampuswirelessdataforthetoplevelfortheperiodfromJanuarytoMarch2008 IDMonth-Day(Level2) Cluster1JanuaryFriday,JanuarySaturdayCluster2MarchThursday,MarchFridayCluster3JanuarySunday,MarchSaturdayCluster5FebruaryTuesday,FebruaryWednesday Table4-20. Theco-clustersbelongingtocluster1learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension Hour(Level1)clustersbelongingtoClusterelementscluster1inthe2ndlevel Cluster2JanuaryFriday2:00am-5:59amCluster3JanuaryFriday6:00am-9:59amCluster4JanuaryFriday10:00am-1:59pmCluster6JanuaryFriday3:00pm-4:59pmCluster7JanuaryFriday5:00pm-9:59pmCluster8JanuaryFriday10:00pm-11:59pmCluster11JanuarySaturday4:00am-6:59amCluster12JanuarySaturday7:00am-10:59amCluster13JanuarySaturday11:00am-12:59pmCluster14JanuarySaturday1:00pm-5:59pmCluster15JanuarySaturday6:00pm-10:59pm DiscussionWeobservethattheMDHCCalgorithminthetoplevelofthetimevariablehierarchy(Day-Monthlevel)foundthecorrelationbetweenconsecutivedays(i.e.,FridayandSaturdayfromJanuaryincluster1,ThursdayandFridayfromMarchincluster2andTuesdayandWednesdayfromFebruaryincluster5).Thisindicatesthatconsecutivedaysofthesamemonthhavesimilarpatternsofusagebehavior.Anotherinterestingndingiscluster3,whichcontainstheSundayfromJanuaryandtheSaturdayfromMarch.Eventhoughtheyarefromdifferentmonths,theywereclusteredtogether.Thismayindicatethat,ingeneral,weekenddays'onlineusagepatternsareverysimilarandprobablymoresimilarthanweekdays'.Thisndingissupportedby 96

PAGE 97

Table4-21. Theco-clustersbelongingtocluster2learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension Hour(Level1)clustersbelongingtoClusterelementscluster2inthe2ndlevel Cluster3MarchThursday7:00am-10:59amCluster6MarchThursday3:00pm-4:59pmCluster7MarchThursday5:00pm-7:59pmCluster10MarchThursday10:00pm-12:59amCluster15MarchFriday8:00am-9:59amCluster16MarchFriday10:00am-11:59amCluster18MarchFriday1:00pm-2:59pmCluster20MarchFriday4:00pm-9:59pmCluster21MarchFriday10:00pm-11:59pm Table4-22. Theco-clustersbelongingtocluster5learnedusingMDHCCfromtheUSCcampuswirelessdataforthebottomleveltimedimension Hour(Level1)clustersbelongingtoClusterelementscluster5inthe2ndlevel Cluster1FebruaryTuesday12:00am-9:59amCluster2FebruaryTuesday10:00am-1:59pmCluster3FebruaryTuesday2:00pm-5:59pmCluster4FebruaryTuesday6:00pm-9:59pmCluster5FebruaryTuesday10:00pm-11:59pmCluster7FebruaryWednesday3:00am-9:59amCluster8FebruaryWednesday10:00am-12:59pmCluster9FebruaryWednesday1:00pm-5:59pmCluster10FebruaryWednesday6:00pm-9:59pmCluster11FebruaryWednesday10:00pm-10:59pmCluster12FebruaryWednesday11:00pm-11:59pm theresultsfromrunningtheMDCalgorithmwithsingle-dimensiontimewhereweekenddaysfromdifferentmonthswereclusteredtogetherandsomeweekdaysfromthesamemonthwerealsoclusteredtogether.Inthebottomlevelofthetimevariablehierarchy(Hourlevel),weobservethathoursfromthethreemainpartsoftheday(i.e.morning,afternoonandevening)aregroupedinsmallerclustersthantheclustersfoundwhenrunningtheMDCalgorithmwithmulti-dimensionaltime.Thisisprobablyduetothedifferenceofhandlingtimeinthedatabetweenthetwomethods.Thesmallerclustersinthethreemainpartsofthedaygiveaclearerpictureoftheonlineusagepatternsduringthemainpartsoftheday. 97

PAGE 98

WecannowobservetheadvantageofusingMDHCCovertheMDCwithsingle-dimensiontime.Havingahierarchyhelpedinlearningclustersinthedifferentgranularitiesoftime(i.e.months,daysandhours).UsingtheMDConlymanagedtoclusterwholedaystogetherbutitdidn'tlearnsimilarpatternsinthehoursofthedays,unlikeMDHCCthatmanagedtolearnboth.UsingtheMDHCCalgorithmismoresuitedtouseingeneralwhenourdatahasasingledimensionoftime. 4.3.2Spatio-TemporalCorrelationsintheWirelessDataInthissubsectionoftheresultsweaddthespacedimensiontoouranalysisofthewirelessdatatostudythecombinedeffectofboththelocationandtimeoftheuseronhisonlinebehavior.Thespacedimensionisrepresentedbythelocationvariablewhichcontainsthenameofthebuildingtheuserwasinwhenhewasaccessingthewirelessnetwork.Forouranalysiswestudiedthebehaviorin15buildingsofvariouscategoriesintheUSCcampus.Tableshowsthebuildingswherethewirelessdataweretakenfrom,thecolorofbuildingcodetablecellshighlightsthecategoryofeachbuilding. Table4-23. ThewirelessdatabuildingsusedintheUSCcampus BuildingIdBuildingNameBuildingcode 1AlphaChiOmegaSororitySor12KappaAlphaThetaSororitySor23AlphaTauOmegaFraternityFrat14BetaOmegaPhiFraternityFrat25SigmaPhiEpsilonFraternityFrat36ZetaBetaTauFraternityFrat47AlphaKappaPsiFraternity/BusinessFrat58FluorTowerHousingHous19AnnenbergHouseApartmentHous210WebTowerHousingHous311AnnenbergSchoolforCommunication&JournalismJour12GeorgeLucasBuildingSchoolofCinematicArtsLucas13WilsonDentalLibraryDentalLibrary14NorrisMedicalLibraryMedicalLibrary15UniversityComputingCenterUCC 98

PAGE 99

4.3.2.1Multi-DimensionalInformationTheoreticCo-Clustering(MDITCC)ResultsInthissectionweshowtheresultsofrunningtheMDITCCmethodonthewirelessdatawiththelocationdimension.Table 4-24 showssomeoftheresultsofrunningtheMDITCContheperiodofJanuary-March2008. Table4-24. Theco-clusterslearnedusingtheMDITCCalgorithmfromtheUSCcampuswirelessdata DataDimensionsandClustersClusterElements HourCluster12:00am-2:59am,7:00am-7:59am,12:00pm-1:59pm,3:00pm-3:59pm,6:00pm-6:59pm,8:00pm-8:59pm,11:00pm-11:59pmCluster212:00am-12:59am,4:00am-4:59am,6:00am-6:59am,8:00am-11:59am,2:00pm-2:59pmCluster31:00am-1:59am,3:00am-3:59am,5:00am-5:59am,4:00pm-5:59pm,7:00pm-7:59pm,9:00pm-10:59pmDayCluster1Monday,Tuesday,Thursday,FridayCluster2Wednesday,Saturday,SundayMonthCluster1January,FebruaryCluster2MarchLocationCluster1Frat1,DentalLibrary,UCCCluster2Sor1,Frat3,lucasCluster3hous3,Frat2,Sor2Cluster4jour,hous1,Frat4Cluster5Frat5,MedicalLibrary,hous2 DiscussionInthehourdimension,clusters1and3representmostlytheafternoon-eveningperiod.Cluster2representthebehaviorinthemorningperiod.Inthedaydimension,weobservethatcluster1representtheweekdays.Cluster2,ontheotherhand,representthebehaviorintheweekends.Inthemonthdimension,JanuaryandFebruaryweregroupedtogetherincluster1.Cluster2containsMarchalone. 99

PAGE 100

Inthelocationdimension,cluster1groupedtogetherthedentallibraryandthecomputercenter.Thismayindicateasimilarusagebehaviorinthesebuildingswhichisintuitiveforthefactthatthesebuildingsareusedbystudentstostudyordoresearch.Clusters2,4and5groupedtogetherresidencebuildingsandsomeschoolbuildings.Thiscouldberelatedtothefactthatstudentsstudybothintheirresidenceandintheschoolbuildings.Cluster3containsonlyresidencebuildingswhichrepresentstheonlinebehaviorthatisonlyconductedathome. 4.3.2.2Multi-wayDistributionalClustering(MDC)viaPairwiseInteractionwithMulti-DimensionalTimeRepresentationResultsToaddthespacedimensiontothewirelessdata,thepairwiseinteractiongraphfromthetemporalcorrelationssectionisusedwiththeadditionofavectorforthelocationvariable.Thevectorofthelocationvariableisconnectedtoalloftheothervectorsoftheothervariables.Figure 4-12 showstheresultinggraphrepresentationofthepairwiseinteractionbetweenthevariables. Figure4-12. Thegraphrepresentingthepairwiseinteractionbetweenthevariablesofwirelessdatafortheanalysisofthespatio-temporalcorrelations WeruntheMDConthewirelessdatawiththespatio-temporalattributesspanningtheperiodfromJanuary2008untilMarch2008andobtainthefollowingresultsshownintable 4-25 .DiscussionAddingthelocationvariablecontributedtolearningbettertemporalcorrelations.Wendnotonlythatthreepartsofthedaywereclusteredwithanaturalgroupingoftheconsecutivehoursoftheday,butthistimethehourclusteringrepresents 100

PAGE 101

Table4-25. Theco-clusterslearnedusingtheMDCalgorithmfromtheUSCcampuswirelessdatawithamulti-dimensionaltimerepresentation DataDimensionsandClustersClusterElements HourCluster112:00am-08:59amCluster29:00am-5:59pmCluster36:00pm-11:59pmDayCluster1Monday,Tuesday,WednesdayCluster2Thursday,Friday,Saturday,SundayMonthsCluster1January,MarchCluster2FebruaryLocationCluster1frat5,frat2,frat3,UCCCluster2sor1,hous2,DentalLibrary,MedicalLibrary,frat4Cluster3frat1,hous1,Sor2Cluster4jour,lucas,hous3WebDomainCluster2apple,cnet,ebay,google,live,mac,yahoo,coCluster3washingtonpost,usc,youtube,mozillaCluster4aster,facebook,tfbnw intuitivelythemorninghoursincluster1.Thework/classeshoursincluster2representtheperiodfrom9:00amuntil5:59pmandtheeveninghoursfrom6pmuntilmidnight.Themostlikelyreasonthatthehours'clusteringimprovedwhenthelocationvariablewasaddedtotheco-clusteringisthatthebuildingsthedataweregatheredfromcontainbothschoolandhousingbuildings,whichprovidedinformationofwhenusersareinschoolandwhentheyareintheirresidence.Intuitively,usersareinschoolduringtheclasshoursfrom9to5andareathomeintheotherhoursoftheday,whichincludeboththemorninghoursfrommidnightuntil8amandtheeveninghoursfrom6pmuntilmidnight. 101

PAGE 102

WehavediscussedintheMDCresultssubsectionofthetemporalcorrelationsresultsectiontheco-clusteringfoundinthedayandmonthvariableswhichareverysimilartowhatweobservehereandfollowthesamereason,soitwillnotbediscussedhere.Inthelocationvariable,weobservethatcluster1groupedtogetherfraternitybuildingsindicatingasimilarusagebehaviorintheselocations,whichareofthesamecategory.Cluster2groupedtogethertheonlytwolibrarybuildingsinourdatawithsomeothervarioushousingbuildingsincludingasororityandafraternityhouse.Cluster3alsogroupedtogethervarioushousingbuildings.Cluster4groupedtogetherthejournalismschoolbuildingandthecinematicartsschoolbuilding.Fromtheclusterslearnedwecaninferthatbuildingsofthesamecategorycontainsimilaronlineusagebehavior.Inthewebdomainvariable,weobservethatcluster2foundthecorrelationbetweentheappleandmacwebdomains.Also,cluster2foundthecorrelationbetweenthesearchenginewebdomainsgoogleandyahoo.Cluster3containsthewebdomainswashingtonpostandyoutubewhicharebothwebdomainsusedformediapurposes.Thisco-clustermayhavebeeninuencedbycluster4fromthelocationvariablesinceitcontainsthejournalismschoolbuilding.Finally,cluster4foundthecorrelationbetweenfacebookanditscontentdeliverynetworkdomain(tfbnw). 4.3.2.3Multi-wayDistributionalClusteringviaPairwiseInteractionwithSingle-DimensionTimeRepresentationResultsWealsoaddthelocationvariabletooursingle-dimensiontimerepresentationwirelessdata.WeshowtheresultsofrunningtheMDConthisdataintable 4-26 .DiscussionWendfromtheresultsthatevenwiththesingle-dimensiontimerepresentationdata,addingthelocationvariablecontributedtoobtainingmoreintuitiveclustersofthetimedimension.Weobservethatinthehourdaysvariablethealgorithmmanagedtolearntheworkhoursofweekdays(e.g.,cluster3,cluster12andcluster16).Also,thealgorithmmanagedtoclusterallofthehoursofSaturdayincluster15.This 102

PAGE 103

Table4-26. Theco-clusterslearnedusingtheMDCalgorithmfromtheUSCcampuswirelessdatawithasingle-dimensiontimerepresentation DataDimensionsandClustersClusterElements HourdaysCluster3MarchFriday9:00am-5:59pmCluster12FebruaryTuesday9:00am-5:59pmCluster15JanuarySaturday12:00am-11:59pmCluster16FebruaryWednesday10:00am-5:59pmLocationCluster1DentalLibrary,MedicalLibrary,frat5,hous2Cluster2jour,lucasCluster3sor1,frat2,hous1,frat4Cluster4frat1,Sor2,frat3,UCC,hous3 mayindicatethat,unlikeweekdays,whenthereisauniqueonlineusagebehaviorintheworkhoursfrom9to5,onweekends,whentherearenoclassesorworkoncampus,wedon'tobserveauniquebehaviorinthesehoursandtheusagebehaviorissimilarforallhoursoftheweekendday.Aspreviouslymentioned,thereasonbehindthemoreintuitiveclusterswhenthelocationvariablewasaddedisthatitprovidedadistinctionbetweentheuser'sbehaviorinresidencebuildingsandinschoolbuildings.Thishelpedincapturingtheuniqueonlinebehaviorofusersduringthework/classhoursincontrasttothenon-work/non-classhoursoftheday.Forthelocationvariablethealgorithmmanagedtocapturethesimilarbehaviorofusersinlibrariesincluster1.Cluster2representstheclusteringoftheschoolbuildings.Thevarioushousingbuildingsarerepresentedinclusters3and4. 4.3.2.4Multi-DimensionalHierarchicalCo-Clustering(MDHCC)ResultsTables 4-27 to 4-31 showsomeoftheinterestingresultsofrunningtheMDHCCalgorithmonthewirelessdatawithboththespatialandtemporalattributes.DiscussionWeobservethatinthetoplevelofthetimehierarchysimilarday-monthclusterswerediscoveredusingtheMDHCConthewirelessdatawithboththespatialandtemporalattributestotheclustersdiscoveredwithonlythetemporalattributesofthewirelessdata.Consecutivedaysofweekdaysorweekendswereclusteredtogether. 103

PAGE 104

Table4-27. Thetoplevelco-clusterslearnedusingMDHCCfromtheUSCcampuswirelessdata Month-Day(Level2)clustersClusterElements Cluster1JanuarySaturday,JanuarySundayCluster2JanuaryThursday,JanuaryFridayCluster4MarchThursday,MarchFridayCluster5FebruaryTuesday,FebruaryWednesday Table4-28. Thebottomlevelco-clustersbelongingtocluster1learnedusingMDHCCfromtheUSCcampuswirelessdata Hour(Level1)clustersbelongingtoClusterelementscluster1inthe2ndlevel Cluster2JanuarySaturday3:00am-7:59amCluster3JanuarySaturday8:00am-9:59amCluster4JanuarySaturday10:00am-1:59pmCluster5JanuarySaturday2:00pm-5:59pmCluster6JanuarySaturday6:00pm-8:59pmCluster7JanuarySaturday9:00pm-9:59pmCluster8JanuarySaturday10:00am-11:59amCluster11JanuarySunday3:00am-8:59amCluster12JanuarySunday9:00am-11:59amCluster14JanuarySunday2:00pm-5:59pmCluster15JanuarySunday6:00pm-10:59pm Table4-29. Thebottomlevelco-clustersbelongingtocluster4learnedusingMDHCCfromtheUSCcampuswirelessdata Hour(Level1)clustersbelongingtoClusterelementscluster4inthe2ndlevel Cluster2MarchThursday4:00am-8:59amCluster3MarchThursday9:00am-4:59pmCluster4MarchThursday5:00pm-9:59pmCluster5MarchThursday10:00pm-11:59pm,MarchFriday12:00am-12:59amCluster6MarchFriday1:00am-8:59amCluster8MarchFriday11:00am-5:59pmCluster10MarchFriday7:00pm-11:59pm 104

PAGE 105

Table4-30. Thebottomlevelco-clustersbelongingtocluster5learnedusingMDHCCfromtheUSCcampuswirelessdata Hour(Level1)clustersbelongingtoClusterelementscluster5inthe2ndlevel Cluster5FebruaryTuesday9:00am-10:59amCluster6FebruaryTuesday11:00am-2:59pmCluster7FebruaryTuesday3:00pm-5:59pmCluster9FebruaryTuesday7:00pm-11:59pmCluster11FebruaryWednesday5:00am-11:59amCluster12FebruaryWednesday12:00pm-1:59pmCluster13FebruaryWednesday2:00pm-4:59pmCluster15FebruaryWednesday6:00pm-8:59pmCluster16FebruaryWednesday9:00pm-11:59pm Table4-31. Theco-clusterslearnedinthelocationdimensionusingMDHCCfromtheUSCcampuswirelessdata IDLocation Cluster3frat1,frat3Cluster5jour,lucasCluster7sor1,frat2 However,inthebottomlevelofthetimehierarchy(Hourlevel)theresultssomewhatdiffer.ThealgorithmnowdiscoverstheworkhoursgroupinginMarch,Thursdayday(The3rdclusterofthe4thclusterinthesecondlevel)andinMarch,Fridayday(The8thclusterofthe4thclusterinthesecondlevel).ThisisduetohavingthespatialattributeinformationasweexplainedinthediscussionoftheMDCwithsingle-dimensiontimeresults.Also,wendsomesimilarndingstotheresultsfromrunningtheMDHCConthewirelessdatawiththetemporalattributes,whicharethediscoveryofsmallerclustersofhoursinthemainpartsoftheday.EvenwiththewirelessdataconsideringboththespatialandtemporalattributesweseetheadvantageoftheMDHCCalgorithmovertheMDCalgorithmonthewirelessdatawithasingle-dimensiontimerepresentation.MDHCCallowsforbothclusteringoftheMonth-DayleveloftimeanddiscoveringdayswithsimilarpatternsofusagebehaviorandforclusteringoftheHourlevelanddiscoveringhourswithsimilarpatternsofusagebehavior. 105

PAGE 106

Inthelocationvariableweobservethatthealgorithmfoundthecorrelationbetweenfraternitybuildingsincluster3.Also,cluster7foundthecorrelationbetweendifferenttypesofhousingbuildingswhicharetheSorority1andFraternity2.Finally,cluster5representsthegroupingoftheschoolbuildingsofJournalismandCinematicArts. 4.3.3RuntimeEvaluationTheeffectofthenumberofdimensionsandthesizeofdimensionsontheruntimeoftheMDHCCandMDCmethodswerestudiedbyobservingtheincreaseoftheruntime.Themethodswererunonthewirelessdatawithtwotofourdimensions.Also,thelocationdimensionwasincreasedtotwoandfourtimeslargerinsize.Table 4-32 showsthedifferentruntimesfortheMDHCCmethod. Table4-32. TheMDHCCruntimesondifferentnumbersandsizesofdatadimensions Numberofdimensions15Locations30Locations60Locations 244seconds44seconds59seconds3216seconds217seconds241seconds4251seconds324seconds448seconds DiscussionWeobservethattheruntimeincreaseonthelocationdimensionsizeincreaseislinear.Thisisduetothefactthattheincreaseindimensionsizedoesnotaffectthenumberofco-occurrencetablesprocessedinthemethod.Thelocationsizeincreaseonlyaffecttheco-occurrencetablesthatlocationisamemberof.TheruntimeoftheMDHCCincreaseisnearquadraticinthenumberofdimensionsofthedata.Thisoccursbecauseasthenumberofdimensionsincrease,thenumberofco-occurrencetablestobeprocessedincreaseinaratethatisnearquadratic.Table 4-33 showsthedifferentruntimesfortheMDCmethodafterrunningthesameexperiments. Table4-33. TheMDCruntimesondifferentnumbersandsizesofdatadimensions Numberofdimensions15Locations30Locations60Locations 244seconds43seconds44seconds3176seconds166seconds179seconds4264seconds299seconds373seconds 106

PAGE 107

DiscussionTheruntimeincreasefortheMDCmethodonlocationdimensionsizeincreasewasalsoseentobelinear.Thisisduetothenumberofco-occurrencetablesprocessedstayingthesame.TheruntimeoftheMDCmethodincreasenearquadraticallyasthenumberofdimensionsofthedataprocessedincrease.ThesamereasonastheMDHCCmethodappliessincebothmethodsusepairwiseinteractiongraph.TheruntimefortheMDCmethodingeneralislowerthantheMDHCCduetotheoverheadofusingthehierarchyontheMDHCCmethod. 4.3.4MetaAnalysisThesimilarityinusagebehaviorinlocationsofthemobilenetworksocietyisanalyzedbasedonthehoursoftheday.Toachievethisadissimilaritymatrixbetweenthe15locationsofthemobilesocietyiscreatedbycomputingthecosinedistancebetweenthevaluesoftheco-occurrencetable.Thedissimilaritymatrixismappedtoanundirectedgraph.Nodesinthegraphrepresentthelocationsandanedgeisdrawnbetweentwonodesifthedissimilarityislessthanathresholdof0.2.Cliquesarediscoveredinthegraphtondgroupsoflocationswithasimilaraccesstimebehavior.Figure 4-13 showstheresultinggraph. Figure4-13. Graphrepresentationofthedissimilaritymatrixoftheaccesstimebehaviorusingthethresholdof0.2forthelocationsintheUSCcampus DiscussionAsseeningure 4-13 cliquesareformedbetweenthelibrarylocationswhichindicatesthesimilarbehaviorinthesetwolocations.Manycliquesareformed 107

PAGE 108

betweenthedifferentresidencelocations(fraternities,sororitiesandresidentiallocations).Thisindicatesthesimilarityinaccesstimebehaviorintheselocations. 4.3.5OverallDiscussionWendthattheMDITCChasitslimitationsanddoesnotperformwellonthemulti-dimensionalwirelessdata.Co-clusteringmethodsthatusepairwiseinteraction(i.e.MDC,MDHCC)yieldbetterresultsandndtheexpectedclusteringinthetimedimension.Thissupportstheclaimthattheactofextendingthemulti-informationobjectivefunctiontomulti-dimensionsovertwodoesnotprovidereliableestimatesforthefulljointdistributionp(~X1,...,~Xm).Thepairwiseinteractionco-clusteringmethodsfoundthatonlineusagebehaviorinWirelessnetworksissimilarinconsecutivehours,especiallywhenthesehoursbelongtoamainpartofadaysuchas,morning,afternoonorevening.Theonlineusagebehaviorisdifferentinweekdaysfromweekends.WehaveseentheMDCmethod'scapabilityoflearningbothtemporalandspatio-temporalcorrelationsonwirelessdatawithamulti-dimensionaltimerepresentation.TheMDC'spairwiseinteractionbetweenthemultipledimensionsofthewirelessdataobtainsmeaningfulandintuitiveco-clustersforeachdimension.However,itishardtoobtainmorespecializedtemporalcorrelationsinthedataduetoseparationofthetemporalattributesbetweenthedimensions.Thiscannotbesolvedbyusingdatawithasingle-dimensiontimerepresentationontheMDC,whichfailstolearnanycorrelationsinthelowergranularitylevelofthetimedimensionandonlylearnscorrelationsonthehighergranularitylevelsofthetimedimension.UsingtheMDHCCcansolvethatandobtainsamorespecializedtemporalcorrelationsinagranularityleveloftime(i.e.hour)usingcorrelationslearnedfromthehighergranularitylevel(i.e.day).Inaddition,theMDHCCalgorithmsucceedsinlearningintuitiveco-clustersforeachdimensionofthewirelessdata.Itlearnsmeaningfulco-clustersforeachleveloftimeforbothamulti-dimensionalandasingle-dimensionrepresentationoftime.However,itfacestheriskofoverspecializing 108

PAGE 109

thecorrelationsinthebottomlevelofthetimedimensionwhichresultsinsingletonclusterslearnedinthatlevel.Thetemporalcorrelationlearningimprovedintheco-clusteringmethodsthatusepairwiseinteractionwhenthespatialinformationwasaddedtothedata.Thisshowstheinuenceontheusagebehaviorofboththelocationoftheuserandthetimeofonlineaccesscombined.Theonlineusagebehaviorwasfoundtocorrelatewiththeusers'dailyschedule.Theusers'behaviorisdifferentinwork/classduringwork/classhoursfromthebehaviorintheotherhoursoftheday.Inthehigherscopeoftheweek,theonlinebehaviorduringweekdaysaredifferentfromweekends.Spatialcorrelationswerealsodiscoveredshowingthatonlinebehaviorinbuildingsofthesametypearesimilar. 4.4RelatedWorkPeoplearemoreconnectedtotheInternetthaneverbefore.ThisisattributedtothespreadofWiFi-connecteddevicesthatpeoplealwayscarry,suchassmartphones,laptopsandtablets.However,thisincreasedusageoftheWiFinetworksleadstolargerstrainsonthesenetworks.Therefore,usingadata-drivenparadigmtomodelWiFinetworksbasedonusagedatahasbecomethefocusofmanypapers[ 33 36 ].Thishasmadeunderstandingusers'onlinebehavioranimportantmatterandthetopicofmanyworks.Afannasyevetal.[ 46 ]studiedtheusagepatternsinacity-wideWiFinetwork.Distinctclassesofusagebasedonactivity,mobilityandtrafcwerefoundtobedependentontheclientdevicetype,whetheritwasasmartphone,laptoporadesktopcomputer.Anotherwork[ 40 ]studiedtheonlinebehaviorofusersinaruralvillageinZambia.BehaviordifferenceswerediscoveredbetweenruralandurbanareasbasedonInternettrafctype(HTTPvs.Peer-to-Peer).Studyingspatio-temporaleffectsonWiFinetworkshasrecentlygainedsomeinterestinresearch.Apaper[ 47 ]modeledthetrafcdemandofacampusWLANtakingintoaccountthespatial-temporaldimensions.Thedifferentspatialscalesusedwere(infrastructure-wide,AP-levelorclientlevel)withthetimegranularitiesbeing 109

PAGE 110

packet-level,ow-leveloraggregate.Thisisdifferentfromourspatio-temporalfocuswhereweconcentrateonbuildingsinacampusforthespatialdimensionandhours,daysandmonthsforthetemporaldimension.IntheAfannasyevetal.paper[ 46 ],location-basedcorrelationwasfoundbetweentheareatypeofthecityandthedevicetype.Smartphonesweremostlyusedintransportationareas,laptopsweremostlyusedincommercialareaswithwirelesshotspotsanddesktopcomputersweremostlyusedinresidentialareas.Also,temporal-basedcorrelationwasfoundbetweentheclientdevicetypeandthetime-of-dayandtheday-of-week.Eachclientdevicetypeexhibitedauniquebehaviordependingonthehourofthedayandthedayoftheweek,whichiswhatweshowinourwork.However,theirstudywasbasedonthenumberofactiveusers,themobilityoftheusersandtheonlinetrafcgenerated.Unlikethegreatmajorityofresearchinthisarea,ourworkfocusesontheweb-domainvisitationbehavioroftheusers.Dataclusteringisawidelyuseddataminingtechnique[ 48 49 ].Itdealswiththeunsupervisedgroupingofobservationsintosimilarclusters.Clusteringisusedtostudymanykindsofproblemssuchasdatamining,imagesegmentation,textandpatternrecognitionandbioinformatics.Therearemanytypesofclusteringalgorithms:themostpopularonesarethek-meansclustering[ 50 ],expectation-maximization(EM)[ 51 ],DBSCAN[ 52 ],hierarchicalclustering[ 53 ]andco-clustering[ 54 ].Co-clustering(sometimescalledbiclusteringorblockclustering)isaclusteringtechniquewhererowsandcolumnsofamatrixaresimultaneouslyclusteredprovidingbothrowclustersandcolumnclusters.Co-clusteringusesatwo-wayclusteringacrosstwodimensionsincontrasttoregularclusteringwhichusesone-wayclusteringacrossonlyonedimension.Thisallowsfornewkindsofpatternstobedetectedandalsowasfoundtoleadtobetterclusteringofthedata.Therearemanystrategiesforco-clustering,suchasthespectraltechniquesandinformationtheoreticapproaches.Thesestrategiesdifferinthewaytheymeasure 110

PAGE 111

similarityandbythewaytheytreattheco-occurrencetable.Thespectraltechniquestreattheco-occurrencetableasanadjacencymatrixunderlyingabipartitegraph,thegoalistominimizeacutfunctionthatmeasuresthedegreeofassociationbetweenthenodesets.Ontheotherhand,theinformationtheoreticapproach[ 5 ]normalizestheco-occurrencetableconvertingitintoajointprobabilitytablewiththegoaltoreducethelossofmutualinformationobjectivefunctionbetweentheoriginaltableandtheclusteredversion.Thisapproachmonotonicallyincreasesthemutualinformationpreservedbyintertwiningboththerowandcolumnclustersatallstages.Therehavebeenfewattemptsatcreatingahierarchicalco-clusteringmethod.Onemethod[ 55 ]appliedhierarchyonboththeobjectsandfeaturesinthedata.Themethodwasbasedontheco-clusteringmethod[ 56 ]andhastheadvantageofnotpre-specifyingthenumberofclustersbeforetherun.Anothermethod[ 57 ]aimsatgeneratingadendrogramfordifferenttypesofdatasimultaneouslybyusingtheirrelationshipinformation.Themethodisbasedonagglomerativehierarchicalclusteringwhileapplyingtheunionofdifferenttypesofdata.However,theseattemptsonlytackledtheproblemofhierarchicaltwo-wayclustering.Tothebestofourknowledge,ourworkistherstattemptatamulti-wayhierarchicalclusteringmethod. 4.5DiscussionInthischapterwehaveanalyzedboththetemporalcorrelationsandthespatio-temporalcorrelationsinusers'onlineaccessdataoflarge-scalewirelessnetworks.Multiplemulti-dimensionalco-clusteringmethodswereusedforthedataanalysisincludingtheMulti-DimensionalInformationTheoreticClustering(MDITCC),theMulti-wayDistributionalClustering(MDC)andournovelMulti-DimensionalHierarchicalCo-clusteringmethod(MDHCC).Inaddition,differentdatarepresentationsofthetemporalattributes(multi-dimensionaltimevs.single-dimensiontime)inthedatawereusedtohandlethegranularnatureofthetemporalattributesandndthebestrepresentationfortheanalysis. 111

PAGE 112

Wefoundthatinmulti-dimensionalco-clustering,pairwiseinteractionmethodsaresuperiortomethodsthatusearegularmulti-dimensionaljointdistributionintheobjectivefunction.Thissupportstheclaimthatestimatesextractedfromamulti-dimensionaljointdistributionisnotreliable.Wehavediscoveredinterestingtemporalcorrelationsinthewirelessdatawhenusingpairwiseinteractionmethods.Thisincreasedourunderstandingoftheonlinebehaviorofwirelessnetworkusersovertime.Theonlinebehaviorofuserswasdependentonboththetimeofdayandthedayoftheweek.Inaddition,afteraddingthespatialattributesofthewirelessdataandstudyingthespatio-temporalcorrelations,wehavefoundthatthereisacorrelationbetweenthelocationoftheuserandthetimeofonlineaccess.Byaddingthelocationoftheusertotheanalysiswehavemanagedtogetbettertemporalcorrelationsfortheusers'onlinebehavior.Boththelocationoftheuserandhistimeofaccesscombinedinuencehisonlinebehavior.Thereisadifferenceintheonlinebehaviorbetweenwork/classhoursandhomehours,andbetweenschoolbuildingsandresidencebuildings.OurnovelMDHCCmethodprovedtoyieldbetterresultswhenthetemporalattributeswererepresentedinasingle-dimensionthantheMDCmethod.Moreover,itprovidedmorepreciseinformationabouttheonlinebehavioroftheusersovertime.Thesendingscanbeusedtodesignbetterwirelessnetworksunderthedata-drivenmodelinganddesignparadigm. 4.6OurContributionsTosummarize,thecontributionsofthischapterare: WeinvestigatecorrelationsbetweenthewirelessInternetuserbehaviorandthetimeofwebaccess.Weobservetheinuenceofthehouroftheday,dayoftheweekorthemonthoftheuser'saccessonhisInternetbehaviorpatterns.WeanalyzethelargeUSCcampuswirelessdatabyhandlingthetimedimensionindifferentwaysandndingthebestwaytorepresenttimeinwirelessdata. Weinvestigatethespatio-temporalcorrelationsinthewirelessdata.Wendtheinuenceofboththelocationandtimeoftheuser'swebaccessonhisInternetbehavior. 112

PAGE 113

Wepresentanovelhierarchicalmulti-dimensionalco-clusteringalgorithmforlearningtemporalandspatiotemporalcorrelationsinthewirelessdataandhandlingthegranularityoftime.Thealgorithmisbasedonthemulti-waydistributionalclusteringalgorithm[ 44 ]. 113

PAGE 114

CHAPTER5CONCLUSIONSInthisdissertationwehaveimprovedonanexistingdataminingalgorithmtomakeitsuitableforhandlinglarge-scalewirelessdata.ThePOWERmodel,aprobabilisticmulti-classmixturemodel,wasfoundtolearnhiddenandoverlappingcorrelationsinthewirelessdata.Thishelpsindesigningbetterwirelessmobilenetworksusingthedata-drivenparadigm.WehaveimprovedonthetimecomplexityofthePOWERmodelallowingthetimecomplexityofthelearningtobelinearwithrespecttothenumberofdimensionsofthedata.ThismadeourversionofthePOWERmodeltobeefcientinhandlingthelarge-scalewirelessdataandanyhigh-dimensionaldata.ThisalsoimprovedthespeedofthePOWERmodelbyordersofmagnitude.WehavealsointroducedanovelmodelcalledtheGlobalLocalmodelthatlearnsthebehavioralpatternsoftheusersinalargemobilesociety.Themodellearnsthegenericclassesofusersbasedontheirinterestsfromdatagatheredfromthewholemobilesociety.Eachuserinthemobilesocietywillbeamemberofoneormoreoftheseclasses.Then,themodelndscorrelationsbetweentheseclassesandlocationsinsidethemobilesociety.Theprobabilityoftheseclassesappearingindifferentlocationsinsidethesocietyislearned.Thetypeofuserclasswasfoundtocorrelatestronglywiththetypeoflocationitmostlikelyappearin.Wehavestudiedboththetemporalandthespatio-temporalcorrelationsinlarge-scalewirelessdatausingvariousco-clusteringmethodsbothexistingandnovel.Wehavefoundthatgeneralmulti-dimensionalinformationtheoreticco-clusteringmethodsfailtondmeaningfulcorrelationsinmulti-dimensionaldata.Co-clusteringmethodsthatusepairwiseinteractionbetweenthedimensionsofdatandsintuitiveandmeaningfulcorrelationsinmulti-dimensionaldata.Theonlineusagebehaviorwasfoundtocorrelatewithusers'dailyandweeklyschedules.Learningtemporalcorrelationsfromdatathatalsocontainspatialinformationwasfoundtoimproveonthetemporal 114

PAGE 115

correlations.Theonlineusagebehaviorisinuencedbyboththelocationoftheuserandthetimeofhisaccesscombined.Wehaveintroducedanovelco-clusteringmethodcalledtheMulti-DimensionalCo-Clustering(MDHCC)methodthatlearnsco-clusteringonmulti-dimensionaldatathatcontainhierarchicalinformation.Themethodlearnsco-clustersineachdimensionofthedatausingpairwiseinteractionbetweenthedimensionsincomputingtheobjectivefunction.Theco-clusterlearninginthedimensionthatcontainthehierarchicalinformationstartsatthetoplevelofthehierarchybylearningthegeneralco-clustersinthatlevel.Then,eachco-clusterlearnedinalevelisspecializedinthelowerlevelbylearninginnerco-clustersinthatco-cluster.Themethodimproveontheco-clusterlearningformulti-dimensionaldataandobtainsmorespecializedcorrelationsinthedata. 115

PAGE 116

REFERENCES [1] S.Moghaddam,A.Helmy,S.RankaandM.Somaiya,Data-drivenCo-clusteringModelofInternetUsageinLargeMobileSocieties,13thACMInt'lConfonModeling,AnalysisandSimulationofWirelessandMobileSystems(MSWIM),Oct2010 [2] M.Somaiya,C.JermaineandS.Ranka,APOWERFrameworkforMulti-ClassMembershipinBayesianMixtureModels,ACMconferenceonKnowledgeDiscoveryandDataMining,2010. [3] G.J.McLachlanandK.E.Basford,MixtureModels:InferenceandApplicationstoClustering.NewYork:MarcelDekker,1988. [4] J.A.Hartigan,Directclusteringofadatamatrix,JournaloftheAmericanStatisticalAssociation,67(337):pp.123-129,March1972. [5] I.S.Dhillon,S.MallelaandD.S.Modha,Informationtheoreticalco-clustering,InNinthACMSIGKDDIntlConf.KnowledgeDiscoveryandDataMining(KDD03),pp.8998,2003. [6] C.C.Aggarwal,J.L.Wolf,P.S.Yu,C.ProcopiucandJ.S.Park,Fastalgorithmsforprojectedclustering,InSIGMOD'99:Proceedingsofthe1999ACMSIGMODInternationalConferenceonManagementofData.ACMPress,NewYork,NY,USA,pp.61-72. [7] C.C.AggarwalandP.S.Yu,Findinggeneralizedprojectedclustersinhighdimensionalspaces,InSIGMOD'00:Proceedingsofthe2000ACMSIGMODInternationalConferenceonManagementofData.ACMPress,NewYork,NY,USA,pp.70-81. [8] K.-G.Woo,J.-H.Lee,M.-H.KimandY.-J.Lee,Findit:afastandintelligentsubspaceclusteringalgorithmusingdimensionvoting,Information&SoftwareTechnology46,4,pp.255-271. [9] J.FriedmanandJ.Meulman,Clusteringobjectsonsubsetsofattributes,JournaloftheRoyalStatisticalSocietySeriesB(StatisticalMethodology)66,4,pp.815-849. [10] J.Yang,W.Wang,H.WangandP.Yu,delta-clusters:Capturingsubspacecorrelationinalargedataset,InICDE'02:Proceedingsofthe18thInternationalConferenceonDataEngineering.IEEEComputerSociety,LosAlamitos,CA,USA,pp.517-528. [11] R.AgrawalandR.Srikant,Fastalgorithmsforminingassociationrulesinlargedatabases,InVLDB'94:Proceedingsofthe20thInternationalConferenceonVeryLargeDatabases.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,pp.487-499. 116

PAGE 117

[12] R.Agrawal,J.Gehrke,D.GunopulosandP.Raghavan,Automaticsubspaceclusteringofhighdimensionaldatafordataminingapplications,InSIGMOD'98:Proceedingsofthe1998ACMSIGMODInternationalConferenceonManagementofData.ACMPress,NewYork,NY,USA,pp.94-105. [13] C.-H.Cheng,A.W.FuandY.Zhang,Entropy-basedsubspaceclusteringforminingnumericaldata,InKDD'99:ProceedingsofthefthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.ACMPress,NewYork,NY,USA,pp.84-93. [14] H.Nagesh,S.GoilandA.Choudhary,Maa:Efcientandscalablesubspaceclusteringforverylargedatasets,TechnicalReportCPDC-TR-9906-010,NorthwesternUniversity,2145SheridanRoad,EvanstonIL60208,June1999. [15] J.-W.ChangandD.-S.Jin,Anewcell-basedclusteringmethodforlarge,high-dimensionaldataindataminingapplications,InSAC'02:Proceedingsofthe2002ACMSymposiumonAppliedcomputing.ACMPress,NewYork,NY,USA,pp.503-507. [16] B.Liu,Y.XiaandP.S.Yu,Clusteringthroughdecisiontreeconstruction,InCIKM'00:ProceedingsoftheninthInternationalConferenceonInformationandKnowledgeManagement.ACMPress,NewYork,NY,USA,20-29. [17] C.M.Procopiuc,M.Jones,P.K.AgarwalandT.M.Murali,Amontecarloalgorithmforfastprojectiveclustering,InSIGMOD'02:Proceedingsofthe2002ACMSIGMODInternationalConferenceonManagementofData.ACMPress,NewYork,NY,USA,pp.418-427. [18] M.Somaiya,C.JermaineandS.Ranka,Learningcorrelationsusingthemixture-of-subsetsmodel,ACMTrans.Knowl.Discov.Data,1(4):pp.1-42,2008. [19] G.J.McLachlanandD.Peel,FiniteMixtureModels.NewYork:Wiley,2000. [20] X.Song,C.Jermaine,S.RankaandJ.Gums,ABayesianMixtureModelwithLinearRegressionMixingProportions,Proceedingsofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,2008. [21] L.RabinerandB.Juang,AnintroductiontohiddenMarkovmodels,ASSPMagazine,IEEE,vol.3,no.1,pp.4-16,Jan1986 [22] W.W.S.Wei,Timeseriesanalysis:univariateandmultivariatemethods,PearsonAddisonWesley,2006.ISBN0321322169 [23] G.C.Reinsel,Elementsofmultivariatetimeseriesanalysis,Springer,2003.ISBN:0387406190 [24] G.DongandJ.Pei,Sequencedatamining,Volume33ofAdvancesindatabasesystems,Springer,2007.ISBN0387699368 117

PAGE 118

[25] M.J.Zaki,Dataminingparallelanddistributedassociationmining:Asurray,IEEEConcurrency,1999. [26] Y.YeandC.Chiang,AParallelAprioriAlgorithmforFrequentItemsetsMining,InProceedingsoftheFourthInternationalConferenceonSoftwareEngineeringResearch,ManagementandApplications(SERA'06).IEEEComputerSociety,Washington,DC,USA,pp.87-94. [27] W.Zhao,H.MaandQ.He,ParallelK-MeansClusteringBasedonMapReduce,inCloudComputing,Vol.5931,pp.674-679,2009. [28] Y.Zhang,Z.Xiong,J.MaoandL.O,TheStudyofParallelK-MeansAlgorithm,IntelligentControlandAutomation,WCICA2006.TheSixthWorldCongresson,vol.2,no.,pp.5868-5871,2006. [29] Z.Lv,Y.Hu,H.Zhong,J.Wu,B.LiandH.Zhao,ParallelK-MeansClusteringofRemoteSensingImagesBasedonMapReduce,LectureNotesinComputerScience,Volume6318/2010,162-170,2010. [30] J.S.Rosenthal,ParallelcomputingandMonteCarloalgorithms,FarEastJournalofTheoreticalStatistics42000,207-236. [31] D.Barbara,C.Faloutsos,J.Hellerstein,Y.Ioannidis,H.V.Jagadish,T.Johnson,R.Ng,V.Poosala,K.RossandK.C.Sevcik,Thenewjerseydatareductionreport.DataEngineeringBulletin,September1996. [32] G.Kollios,D.Gunopoulos,N.KoudasandS.Berchtold,AnEfcientApproximationSchemeforDataMiningTasks,Proc.IEEEInt.Conf.onDataEngineering(ICDE01),2001,pp.453-462. [33] W.-J.Hsu,T.Spyropoulos,K.Psounis,andA.Helmy,TVC:Modelingspatialandtemporaldependenciesofusermobilityinwirelessmobilenetworks,IEEE/ACMTrans.Netw.,17,5(Oct2009),pp.1564-1577. [34] R.Jain,D.LelescuandM.Balakrishnan,ModelT:amodelforuserregistrationpatternsbasedoncampusWLANdata,Wirel.Netw.,13,6(Dec2007),pp.711-735. [35] D.Lelescu,U.C.Kozat,R.JainandM.Balakrishnan,ModelT++:anempiricaljointspace-timeregistrationmodel,InProceedingsofthe7thACMMOBIHOC(Florence,Italy,May,2006).ACM. [36] M.Kim,D.KotzandS.Kim,ExtractingaMobilityModelfromRealUserTraces,InProceedingsoftheIEEEINFOCOM2006(Barcelona,SpainApr,2006). [37] F.Bai,N.SadagopanandA.Helmy,TheIMPORTANTframeworkforanalyzingtheImpactofMobilityonPerformanceOfRouTingprotocolsforAdhocNeTworks,AdHocNetworks,1,4(Nov2003),pp.383-403. 118

PAGE 119

[38] M.Ploumidis,M.PapadopouliandT.Karagiannis,Multi-levelapplication-basedtrafccharacterizationinalarge-scalewirelessnetwork,WOWMOM2007:pp.1-9. [39] T.Henderson,D.KotzandI.Abyzov,Thechangingusageofamaturecampus-widewirelessnetwork,ComputerNetworks,52,14(Oct2008),pp.2690-2712. [40] D.L.Johnson,E.M.Belding,K.AlmerothandG.VanStam,InternetusageandperformanceanalysisofaruralwirelessnetworkinMacha,Zambia,InProceedingsofthe4thACMWorkshoponNetworkedSystemsforDevelopingRegions(NSDR'10).ACM,NewYork,NY,USA,,Article7,6pages. [41] C.Shepard,C.Tossel,A.Rahmati,L.ZhongandP.Kortum,LiveLab:MeasuringWirelessNetworksandSmartphoneUsersintheField,InProceedingsofThe3rdWorkshoponHotTopicsinMeasurement&ModelingofComputerSystems(HotMetrics),June2010. [42] I.Trestian,S.Ranjan,A.KuzmanovicandA.Nucci,Measuringserendipity:connectingpeople,locationsandinterestsinamobile3Gnetwork,InProceedingsofthe9thACMSIGCOMMconferenceonInternetmeasurementconference(IMC'09).ACM,NewYork,NY,USA,pp.267-279. [43] A.Almutairi,S.RankaandM.SomaiyaAFastAlgorithmforLearningWeightedEnsembleofRoles,TheFifthInternationalConferenceonContemporaryComputing,IC3-2012,2012. [44] R.Bekkerman,R.El-YanivandA.McCallum,Multi-waydistributionalclusteringviapairwiseinteractions.,InProceedingsofthe22ndinternationalconferenceonMachinelearning(ICML'05).ACM,NewYork,NY,USA,pp.41-48. [45] X.Gao,Efcientimplementationofmulti-dimensionalco-clustering.,Published:[Gainesville,Fla.]:UniversityofFlorida,2011.Fulltext:http://purl.fcla.edu/fcla/etd/UFE0043454 [46] M.Afanasyev,T.Chen,G.M.VoelkerandA.C.Snoeren,UsagepatternsinanurbanWiFinetwork,IEEE/ACMTrans.Netw.18,5(October2010),pp.1359-1372. [47] F.Hernandez-Campos,M.Karaliopoulos,M.PapadopouliandH.Shen,Spatio-temporalmodelingoftrafcworkloadinacampusWLAN,InProceedingsofthe2ndannualinternationalworkshoponWirelessinternet(WICON'06).ACM,NewYork,NY,USA,,Article1. [48] P.Berkhin,Surveyofclusteringdataminingtechniques,Technicalreport,AccrueSoftware,SanJose,CA,2002. 119

PAGE 120

[49] A.K.Jain,M.N.Murty,andP.J.Flynn,Dataclustering:areview,ACMComputingSurveys,31(3):pp.264323,1999. [50] S.P.Lloyd,LeastsquaresquantizationinPCM,IEEETransactionsonInformationTheory28(2):pp.129137.doi:10.1109/TIT.1982. [51] A.P.Dempster,N.M.LairdandD.B.Rubin,MaximumLikelihoodfromIncompleteDataviatheEMAlgorithm,JournaloftheRoyalStatisticalSociety.SeriesB(Methodological)39(1):pp.138.1977. [52] M.Ester,H.P.Kriegel,J.SanderandX.XuAdensity-basedalgorithmfordiscoveringclustersinlargespatialdatabaseswithnoise,ProceedingsoftheSecondInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD-96).AAAIPress.pp.226231.1996. [53] S.C.Johnson,HierarchicalClusteringSchemes,Psychometrika,2:pp.241-254.1967. [54] J.A.Hartigan,Directclusteringofadatamatrix,JournaloftheAmericanStatisticalAssociation(AmericanStatisticalAssociation)67(337):pp.1239.1972. [55] D.Ienco,R.G.PensaandR.Meo,Parameter-FreeHierarchicalCo-clusteringbyn-ArySplits,InProceedingsoftheEuropeanConferenceonMachineLearningandKnowledgeDiscoveryinDatabases:PartI(ECMLPKDD'09),2009. [56] C.Robardet,Contribution`alaclassicationnonsupervisee:propositiondune,methodedebi-partitionnement.PhDthesis,UniversiteClaudeBernard-Lyon1,(Juliet2002) [57] J.LiandT.LiHCC:AHierarchicalCo-ClusteringAlgorithm,InProc.33rdACMSIGIRConference,2010. 120

PAGE 121

BIOGRAPHICALSKETCH AbdullahAlmutairireceivedhisB.E.fromtheDepartmentofComputerEngineeringinKuwaitUniversityin2003.Afterwards,hegotascholarshipfromtheuniversitytocontinuehisstudiestoreturnasafacultymember.HegothisM.S.fromtheUniversityofSouthernCaliforniain2005.HehasreceivedhisPh.D.incomputerengineeringfromtheUniversityofFloridain2012.HisresearchinterestsareinDataMiningandHighPerformanceComputing. 121