Citation
Query-Driven Text Analytics for Knowledge Extraction, Resolution, and Inference

Material Information

Title:
Query-Driven Text Analytics for Knowledge Extraction, Resolution, and Inference
Creator:
Grant, Christan Earl
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (134 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
WANG,ZHE
Committee Co-Chair:
WILSON,JOSEPH N
Committee Members:
GILBERT,JUAN EUGENE
DOBRA,ALIN VIOREL
COWLES,HEIDI WIND
Graduation Date:
8/8/2015

Subjects

Subjects / Keywords:
Analytics ( jstor )
Canopy ( jstor )
Databases ( jstor )
Datasets ( jstor )
Inference ( jstor )
Knowledge bases ( jstor )
Machine learning ( jstor )
Scheduling ( jstor )
SQL ( jstor )
Text analytics ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
analytics -- coreference -- database -- large-scale -- query -- query-driven
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Computer Engineering thesis, Ph.D.

Notes

Abstract:
With the precipitous increase in data, performing text analytics using traditional methods has become increasingly dificult. From now until 2020 the world's data is predicted to double every year. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Traditionally, to perform analytics, a data scientist extracts parts of large data sources to local machines and perform analytics using R, Python or SASS. Extracting this information is becoming a pain point. Additionally, many algorithms performed over sets of data perform extra work, the data scientist may only be interested in particular portion of the data. In this dissertation, I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to direct, restrict and alter computation in analytic systems without a major sacrifice in accuracy. I demonstrate this principle in three ways. First, I add text analytics inside of a relational database where the user can use SQL to bind the scope of the algorithm, e.g. using a SELECT statement. In this way, computation takes place in the same location as storage and the user can take advantage of the query processing provided by the database. Second, I alter an entity resolution algorithm so it uses example queries to drive computation. This demonstrates a method of making a non-trivial algorithm aware of the query. Finally, I describe a method for inferring information from knowledge bases. These techniques perform inference over knowledge bases that model uncertainty for a real scenario and its application within question answering. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2015.
Local:
Adviser: WANG,ZHE.
Local:
Co-adviser: WILSON,JOSEPH N.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2016-02-29
Statement of Responsibility:
by Christan Earl Grant.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
2/29/2016
Classification:
LD1780 2015 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

QUERY-DRIVENTEXTANALYTICSFORKNOWLEDGEEXTRACTION, RESOLUTION,ANDINFERENCE By CHRISTANEARLGRANT ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2015

PAGE 2

c 2015ChristanEarlGrant

PAGE 3

ToJesusmySavior,Vanisiamywife,mydaughterCaliah,soontobebornsonandmy parentsandsiblings,whomIstrivetoimpress.Also,toallmybrothersandsisters battlinginjusticewhileIbattledbugsanddeadlines.

PAGE 4

ACKNOWLEDGMENTS Ihadanopportunitytoseemydad,asoftwareengineerfromJamaicaworkextremely hardtogetamaster'sdegreeandworkasasoftwareengineer.Ievenhadtheprivilege ofsittinginsomeofhisclassesashetaughtatalocaluniversity.Watchingmydadwork towardsintellectualendeavorsmademebelievethatanythingispossible.Iamextremely privilegedtohavesomeoneIcouldlookuptoasanexampleofbeingaman,father,and scholar. IhadmyrsttasteofresearchwhenDr.JoachimHammerwentoutofhiswayto ndataskformeononeofhisresearchprojectsbecauseIwasinterestedinattending graduateschool.Afterworkingwiththeteamforafewweekshewaswillingtogiveme increasedresponsibility|heletmeattendthe2006SIGMODConferenceinChicago.It wasatthisthatmyeyeswereopenedtotheworldofdatabaseresearch. AsanearlygraduatestudentDr.JosephWilsonexercisedsuperhumanpatiencewith measIlearnedtograspthefundamentalsofpaperwriting.Hehelpedmemanagearocky rstfewyears.HisabundanceofwisdomwouldspillrevealingjewelsoftruthsthatIstill holdsacred.HealongwithPeterDobbins,hehelpedmenavigatetheroadtothePh.D. IamdelightedtohaveDr.DaisyZheWangasadissertationadvisor.Ifollowed herworkwhileshewasstillagraduatestudentandIwasthrilledtohearshewas consideringcomingtoUF.HavingtheopportunitytowatchsomeoneasgiftedasDr. Wangbrainstormandwritewasaninvaluableexperience.Additionally,labmatesClintP. GeorgeandDr.KunLiwithwhomIhaveworkedwithformanyyears.IalsothankSean Goldberg,MortezaShahriariNia,YangChen,YangPeng,andXiaofengZhouwhohave alsobeenmentoredbyDr.Wang,Iappreciatetheirvaluablefeedback. Duringthelastyearsofmygraduateprogramtherehasbeenalargeamountof civilunrest.Whiletheseissuesdonotaectmespecically,itisemotionallydicultto handleandcannegativelyaectmyeverydayproductivity.Itwasimportantformeto havepeoplearoundmewhoIknowaregoingthroughsimilarcircumstancesemotionally 4

PAGE 5

andstillpursuingtheirdegree.ThatiswhyIthankDr.PierreSt.Juste,Dr.Corey Baker,andJeremyMagruderfordiscussionsaboutissuesthataresacredtoone'sraceand ethnicity. Finally,IwouldliketothankalltheindividualswhoregularlyattendedtheACM RichardTapiaCelebrationofDiversityinComputing.In2007,Ifoundthisgroupbecause Iwaspurposelysearchingforcommunity.Thisisagroupoftalentedintellectualswho continuetospurmetowardsexcellence.ThroughthemImetDr.JuanGilbertwhohas beenanexcellentmentorandrolemodelthroughoutmyresearchcareer. 5

PAGE 6

TABLEOFCONTENTS page ACKNOWLEDGMENTS.................................4 LISTOFTABLES.....................................9 LISTOFFIGURES....................................10 ABSTRACT........................................12 CHAPTER 1INTRODUCTION..................................14 1.1DatabaseastheQueryingEngine.......................16 1.2Query-DrivenMachineLearning........................17 1.3QuestionAnswering...............................18 2IN-DATABASEQUERY-DRIVENTEXTANALYTICS.............20 2.1MADdenIntroduction.............................20 2.2MADdenSystemDescription.........................22 2.2.1MADdenSystemArchitecture.....................22 2.2.2StatisticalTextAnalysisFunctions..................23 2.2.3MADdenImplementationDetails...................24 2.3TextAnalysisQueriesandDemonstration..................26 2.3.1DatasetforMADdenExample.....................26 2.3.2MADdenTextAnalyticsQueries....................27 2.3.3MADdenUserInterface.........................29 2.4GPTextIntroduction..............................30 2.4.1GPTextRelatedWork.........................32 2.4.2GreenplumTextAnalytics.......................32 2.4.2.1In-databasedocumentrepresentation............33 2.4.2.2ML-basedadvancedtextanalysis..............35 2.4.3CRFforIEoverMPPDatabases...................35 2.4.3.1Implementationoverview...................35 2.4.3.2FeatureextractionusingSQL................36 2.4.3.3Parallellinear-chainCRFtraining..............37 2.4.3.4Parallellinear-chainCRFinference.............39 2.4.4GPTextExperimentsandResults...................39 2.4.5GPTextApplication...........................40 2.4.6GPTextSummary............................42 3MAKINGENTITYRESOLUTIONQUERY-DRIVEN..............43 3.1Query-DrivenEntityResolutionIntroduction.................43 3.2Query-DrivenEntityResolutionPreliminaries................46 6

PAGE 7

3.2.1FactorGraphs..............................46 3.2.2InferenceoverFactorGraphs......................48 3.2.3Cross-DocumentEntityResolution..................49 3.3Query-DrivenEntityResolutionProblemStatement.............51 3.4Query-DrivenEntityResolutionAlgorithms.................53 3.4.1IntuitionofQuery-DrivenER.....................54 3.4.2Single-NodeER.............................55 3.4.3Multi-queryER.............................58 3.5OptimizationofQuery-DrivenER.......................59 3.5.1InuenceFunction:AttractandRepel.................59 3.5.2Query-proportionalER.........................61 3.5.3HybridER................................62 3.5.4ImplementationDetails.........................62 3.5.5AlgorithmsSummaryDiscussion....................64 3.6Query-DrivenEntityResolutionExperiments................65 3.6.1ExperimentSetup............................67 3.6.2RealtimeQuery-DrivenEROverNYT................69 3.6.3Single-queryER.............................70 3.6.4Multi-queryER.............................74 3.6.5ContextLevels..............................75 3.6.6ParallelHybridER...........................76 3.7Query-DrivenEntityResolutionRelatedWork................77 3.8Query-DrivenEntityResolutionSummary..................79 4APROPOSALOPTIMIZERFORSAMPLING-BASEDENTITYRESOLUTION80 4.1IntroductiontotheProposalOptimizer....................80 4.2ProposalOptimizerBackground........................83 4.3AcceleratingEntityResolution.........................84 4.4ProposalOptimizerAlgorithms........................86 4.5Optimizer....................................87 4.6ProposalOptimizerExperimentImplementation...............88 4.6.1WikiLinkCorpus............................89 4.6.2MicroBenchmark............................89 4.7ProposalOptimizerSummary.........................92 5QUESTIONANSWERING.............................93 5.1MorpheusQAIntroduction..........................93 5.2MorpheusQARelatedWork..........................94 5.2.1QuestionAnsweringSystems......................94 5.2.2OntologyGenerators..........................95 5.3MorpheusQASystemArchitecture......................96 5.3.1UsingOntologyandCorpora......................96 5.3.2Recording................................97 5.3.3Ranking.................................98 7

PAGE 8

5.3.4ExecutingNewQueries.........................100 5.4MorpheusQAResults.............................100 5.5MorpheusQASummary............................101 6PATHEXTRACTIONINKNOWLEDGEBASES................103 6.1PreliminariesforKnowledgeBaseExpansion.................103 6.1.1ProbabilisticKnowledgeBase.....................103 6.1.2MarkovLogicNetworkandFactorGraphs..............104 6.1.3SamplingforMarginalInference....................105 6.1.3.1Gibbssampling........................105 6.1.3.2MC-SAT............................106 6.1.4LinkingFactsinaKnowledgeBase..................106 6.2FactPathExpansionRelatedWork......................107 6.2.1SPARQLQueryPathSearch......................108 6.2.2PathRanking..............................108 6.2.3FactRank................................109 6.3FactPathExpansionAlgorithm........................109 6.4JointInferenceofPathProbabilities.....................113 6.4.1FuzzyQuerying.............................114 6.4.2PostgreSQLFactPathExpansionAlgorithm.............114 6.4.3GraphDatabaseQuery.........................119 6.4.4FactPathExpansionComplexity...................120 6.5FactPathExpansionExperiments.......................121 6.6FactPathExpansionSummary........................124 7CONCLUSIONS...................................125 REFERENCES.......................................126 BIOGRAPHICALSKETCH................................134 8

PAGE 9

LISTOFTABLES Table page 2-1ListingofcurrentMADdenfunctions........................23 2-2ListofeachMADdenfunctionsanditsNLPtask.................28 2-3AbbreviatedNFLdatasetschema..........................28 3-1Mentionssets M fromacorpus...........................57 3-2Examplequerynode q ................................57 3-3Summaryofalgorithmsandtheirmostcommonmethodsforproposaljumps..64 3-4FeaturesusedontheNYTCorpus.Therstsetoffeaturesaretokenspecic features,themiddlesetarebetweenpairsofmentionsandthebottomsetare entitywidefeatures..................................68 3-5Theperformanceofthehybrid-repelERalgorithmforqueriesovertheNYT corpusfortherst50samples............................70 4-1Atableofthetechniquestoimprovethesamplingprocessandeachisclassied byhowtheyaectsampling.............................87 5-1ExampleSSQmodel.................................97 5-2TheoutputofNLPengine..............................101 5-3Termclassesandprobabilities............................101 5-4HighestrankedMorpheusQAqueries........................102 6-1ThefrequencyofeachterminoutcleanedReverbdataset............121 9

PAGE 10

LISTOFFIGURES Figure page 2-1MADdenarchitecture.................................22 2-2ExampleMADdenUIquerytemplate........................30 2-3TheGPTextarchitectureoverGreenplumdatabase................34 2-4TheMADLibCRFoverallsystemarchitecture...................36 2-5Linear-chainCRFtrainingscalability........................40 2-6Linear-chainCRFinferencescalability.......................40 2-7GPTextapplication..................................41 3-1Threenodefactorgraph.Circlesrandomvariableswith m i representmentions andthosewith e i represententities.Cloudsareaddedforvisualemphasisof entityclusters.....................................47 3-2Apossibleinitializationforentityresolution....................53 3-3Thecorrectentityresolutionforallmentions....................54 3-4Theentitycontaining q isinternallycoreferent;theotherentitiesarenotcorrectly resolved........................................54 3-5Hybrid-repelperformancefortherst50samplesforthreequeries.Eachresult isaveragedover6runs................................70 3-6Acomparisonofsingle-queryalgorithmsonaquerywithselectivityof11....71 3-7Acomparisonofsingle-queryalgorithmswithaquerynodeofselectivity46...72 3-8Acomparisonofselection-drivenalgorithmswithaquerynodeofselectivity13072 3-9Thetimeuntilan f 1 q scoreof0.95forvequeriesofincreasingselectivities; averagedoverthreeruns...............................73 3-10Theprogressofthehybridalgorithmacrossformultiplequerynodesusingdierence schedulingalgorithms.Eachresultisaveragedoverthreeruns..........74 3-11Theperformanceofzuckerbergquerywithdierencelevelsofcontext.Each resultisaveragedover6runs............................75 3-12Hybrid-attractalgorithmwithrandomqueriesrunovertheWikilinkscorpus. EachplotstartsaftertheVosestructuresareconstructed.............76 4-1Thehigh-levelinteractionoftheoptimizer.....................82 10

PAGE 11

4-2Adistributionofentitysizesfromthewikilinkscorpus[87]withaninitialstart andthetruth.....................................84 4-3Comparisonofbaselineversusearlystoppingmethods..............90 4-4Thetimeforcompressionforvaryingentitysizesandcardinalities.Thisiscompared withlinerepresentingthetimeittaketomake100Kinsertions..........91 5-1Abbreviatedvehicularontology...........................100 6-1Anexampleoftheincreaseofthefactsandthenumberofrelationsoverseveral timestamps......................................111 6-2Asampleofnodesandtheirchangingprobabilitiesovertime.Thegureisdarkened toshowthemanyoverlappinglines.........................112 6-3FactPathExpansionqueriesovertheTitanGraphDB...............122 6-4FactPathExpansionqueriesoverPostgreSQL...................122 6-5PostgreSQLresultsofFactPathExpansionquerieswithresetdatabasecache.123 6-6ComparisonoftheexperimentswithTitanDBandPostgreSQLwithoutcache.123 11

PAGE 12

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy QUERY-DRIVENTEXTANALYTICSFORKNOWLEDGEEXTRACTION, RESOLUTION,ANDINFERENCE By ChristanEarlGrant August2015 Chair:DaisyZheWang Cochair:JosephN.Wilson Major:ComputerEngineering Withtheprecipitousincreaseindata,performingtextanalyticsusingtraditional methodshasbecomeincreasinglydicult.Fromnowuntil2020theworld'sdatais predictedtodoubleeveryyear.Techniquestostoreandprocesstheselargedatastores arequicklygrowingoutofdate.Theincreaseindatasizewithimpropermethodscould meanalargeincreaseinretrievalandprocessingtime.Inshort,theformertechniquesdo notscale.Complexityofdataformatsisincreasing.Nolongercanoneassumedatawill bestructurednumbersandnames.Traditionally,toperformanalytics,adatascientist extractspartsoflargedatasourcestolocalmachinesandperformanalyticsusingR, PythonorSASS.Extractingthisinformationisbecomingapainpoint.Additionally, manyalgorithmsperformedoversetsofdataperformextrawork,thedatascientistmay onlybeinterestedinparticularportionofthedata. Inthisdissertation,Iintroducequery-driventextanalytics.Query-Driventext analyticsistheuseofdeclarativesemanticsaquerytodirect,restrictandalter computationinanalyticsystemswithoutamajorsacriceinaccuracy.Idemonstrate thisprincipleinthreeways.First,Iaddtextanalyticsinsideofarelationaldatabase wheretheusercanuseSQLtobindthescopeofthealgorithm,e.g.usingaSELECT statement.Inthisway,computationtakesplaceinthesamelocationasstorageandthe usercantakeadvantageofthequeryprocessingprovidedbythedatabase.Second,I 12

PAGE 13

alteranentityresolutionalgorithmsoitusesexamplequeriestodrivecomputation.This demonstratesamethodofmakinganon-trivialalgorithmawareofthequery.Finally, Idescribeamethodforinferringinformationfromknowledgebases.Thesetechniques performinferenceoverknowledgebasesthatmodeluncertaintyforarealscenarioandits applicationwithinquestionanswering. 13

PAGE 14

CHAPTER1 INTRODUCTION FromBabylonian-eraalgorithmsforaccountingresources[51]tomodernday web-scaleprocessing,methodsforanalyzingdatahavebeencentraltotheprogressof successfulsocieties.Dataanalyticsencompassesthealgorithmsandsystemsinvolved inextractingdecisiongradeinformationfromdata.Notably,dataanalyticsspansa seriesofeldsincludingcomputerscience,economics,marketing,physics,sociologyand engineering.Inacapitalistsocietytheabilitytomakeintelligentbusinessdecisionsis critical.Thegloballyconnectedsocietyofmoderndayhasdemandedthatcompetitive organizationsndmoreecientmethodsofextractingknowledge.Ifanorganization cannotcollect,manageandprocessdataasecientlyasitscompetition,thenitwillhave troublesurviving[85]. Fromnowuntil2020theworldsdataispredictedtodoubleeveryyear[35]. Techniquestostoreandprocesstheselargedatastoresarequicklygrowingoutof date.Theincreaseindatasizewithimpropermethodscouldmeanalargeincreasein retrievalandprocessingtime.Inshort,theformertechniquesdonotscale.Complexity ofdataformatsisincreasing,nolongercanoneassumedatawillbestructurednumbers andnames.Databasesarenowstoringmoreamixofstructuredandunstructureddata. Tosupportdataanalytics,queriesoverdisparatedatatypescannotbeanoversight. Additionally,usergeneratedcontentsuchasclickstreams,tweetsandvideosareexamples ofnewdatasourceswithextremelyhighratesofgrowth. Intypicaldatascientists'textanalyticspipeline,dataisextractedfromadatabase, analyticsarethenperformedusingR,PythonorMATLAB,andtheresultisaddedback tothedatabase.Withincreasingdatasizes,thebottleneckofthisprocessisquickly becomingthedatatransfertime,thatis,transferringlargeamountsofdatafromandto thedatabase.Oftentimes,largeanddiversetypesofdatasourcescannotextractedfrom adatabase,eitherforsecurityreasonsorbecauseofthelargesize.Neither,canaglobal 14

PAGE 15

servicebetakeno-lineforprocessingandupdates.Toperformtextanalyticsinthese scenariositispreferabletobringthequerytothedatainsteadofbriningthedatatothe query[23]. Textanalyticsisaclassofmethodsforprocessingdocumentstoobtainactionable orexploratoryinformation.Textanalytictasksincludelinguisticprocessing,knowledge extractionandinformationvisualization.Mosttextanalyticstechniquesarecreatedfor processinginformationacrossthefullsupplieddataset.Thatis,toextractanswersfroma dataset,thefullsetmustbeprocessed.Withlargedatasetsizes,thisapproachbecomes prohibitive.Touseananalogy,ifacleanplateisneededfromthekitchensinkoneshould notusethedishwashingmachine.Itisourobservationthatduringthemajorityof explorationtasks,adatascientistmayonlyinterpretasmallportionofthedataset. Forexample,whenclusteringdataforevaluationadatascientistmayonlylookata handfulofdataclusters.Whenrunningexploratoryanalysisoverdatastreams,providing atemplateorexampleofexpectedresultsmaybeusefulwhensiftingthroughnoise. Thisdissertationdenesthecategoryofquery-driventextanalyticsandpresents threescenariosdemonstratingtheecacyofquery-driventechniques. Query-Driventextanalyticsistheuseofdeclarativesemanticstodecreasetheamount ofprocessingwithoutsacriceinaccuracy.Inthisproposal,wedemonstratethisinthree ways. WeaddmachinelearningalgorithmsinsideofaparallelrelationalDBMSwherethe usercanuseSQLandUDFstochoosethescopeoftheiralgorithmChapter2; Wealteramachinelearningalgorithmsoitusesanexamplequerytodrive computationChapters3and5; Weinvestigatetheuseofknowledgebasedinferencetoassistquestionanswering systemChapter5andunderstandingtheconnectionbetweenconceptsChapter6. InthefollowingsubsectionsIbrieyintroduceeachcontributedarea.Inaddition,I explicitlystatethecontributionofeachwork. 15

PAGE 16

1.1DatabaseastheQueryingEngine Whenprocessinglargedata,oftenabottlenecktocomputationisdatamovement. Movingdataacrossgeographicallocationsforprocessingisexpensive.In-database analyticsdblyticsaimstobuildsophisticatedanalyticalgorithmsintodataparallel systems,suchasrelationaldatabasesandmassivelyparallelprocessingsystems.Using adatabaseastheecosystemforanalyticswegetadeclarativequeryinterface,query optimization,transactionaloperations,ecientcatchingandfaulttolerance.Ipresenttwo projectsdemonstratingdblyics:MADdenandGPText. MADdenisademonstrationofin-databasetextanalysisalgorithms[41].This demonstrationfocusesonansweringqueriesforsportsjournalism,inparticularNFLdata setsusingMadLibstylequeries.Thedemonstrationmadethefollowingcontributions: Processingdeclarativeadhocqueriesinvolvingvariousstatisticaltextanalytic functions. Joiningandqueryingovermultipledatasourcesofstructuredandunstructuredtext. Query-timerenderingofvisualizationsoverqueryresults,usingwordclouds, histograms,andrankedlistsofdocuments. GPTextisasystemforlarge-scaletextindexing,searchandranking[57].Thisisa newsystemthatintegratesGreenplumDB,MADlibanalyticlibrariesandtheApache Solrenterprisesearchplatform.CombinedwithourmadlibalgorithmssuchasConditional RandomFieldpartofspeechtagging,GPTextisanextremelylargeandscalabletext analyticsengine.GPTextaddsaSolrinstancetoeachparallelGreenplumDBSegment andthedatabasecouldcommunicateovertheinstancesusinghttp.Textsearchesarethen parallelizedacrosssegments.UsingUDFswecanmixsophisticatedsearchpredicates, rankinganddatabasequeries.Inaddition,wecreatedanapplicationthatdemonstrates thescalabilityofGPTextandMADlibalgorithms. 16

PAGE 17

1.2Query-DrivenMachineLearning Inquery-drivenmachinelearning,theideaistouseexamplesofdesiredresultsto reducetheamountoftimespentprocessingdata.Todemonstratethiswetakeapopular clusteringproblemcalledentityresolutionandmakeitquery-driven. EntityresolutionERistheprocessofdeterminingrecordsmentionsinadatabase thatcorrespondtothesamereal-worldentity.LeadingERsystemssolvethisproblemby resolvingeveryrecordinthedatabase.Forlargedatasets,howeverthisisanexpensive process.Moreover,suchapproachesarewastefulbecauseinpractice,usersareinterested inonlyoneorasmallsubsetoftheentitiesmentionedinthedatabase.Inthiswork, weintroducenewclassesofSQLqueriesinvolvingERoperators{single-queryERand multi-queryER.WedevelopnovelvariationsoftheMetropolisHastingsalgorithmand introduceselectivity-basedschedulingalgorithmstosupportthetwoclassesofERqueries. Tosupportsingle-queryERqueries,wedevelopthreenewvariationsoftheMetropolis HastingstyleMarkovchainMonteCarloalgorithmforinferenceovertheCRF-based probabilisticmodel.Morespecically,insteadofauniformsamplingdistribution,weuse aquerysamplingmethodthatisbiasedtothearrangementoftheprobabilisticmodel. Inthersttarget-xedalgorithm,weadaptthesamplestoresolvethequeryentity. Thesecondquery-proportionalalgorithm,selectsmentionsbasedontheirprobabilistic similaritytothequeryentity.Thethirdhybridalgorithmcombinesthetwoapproaches. FollowingtheseminalworkofWicketal.[100],wedeviseaninuencefunction tomodelthesimilaritybetweenthementionsandthequeryentityattractscore.This inuencefunctionisbestwhentheclusterofmentionsisheterogeneous.Inthecasewhen theclusterofmentionsishomogeneous,forexampletheresultofahigh-qualitycanopy generation,weshowadierentalgorithmtocomputeandapplyaninuencefunctionto generatearepelscoreforbiasedsampling. Tosupportmulti-queryER,anaivenested-loopjoincanbeperformedusingthe single-queryERalgorithmsiterativelytocomputeresolutiononeentityatatime. 17

PAGE 18

However,suchajoinalgorithmcanleadtounoptimizedresourceallocationifthesame numberofsamplesaregeneratedforeachtargetentity,orlowthroughputifoneofthe entitieshasalowconvergenceratee.g.,alongsamplingprocess.Toalleviatethis problem,wediscussthreemulti-queryERalgorithms,whichschedulethecomputation i.e.,samplegenerationamongdierenttargetentitiesinordertoachieveoptimum overallconvergencerate. 1.3QuestionAnswering Thenextstep,afterextractinginformationfromlargedatasetsandanalysisis questionanswering.Questionansweringisbridgingthegapbetweenthewayauserasksa questionandthewayananswerisencodedinthebackgroundknowledge.Understanding questionsandextractinganswersrequiresthefullsuitetotextminingtasks.Theprocess ofquestionansweringisinherentlyquery-driven;allpossiblequestionsoveradataset cannotbeenumeratedthereforeanyquestionansweringsystemwaitsforauserqueryto initiateananswerdiscoveryprocess. Questionansweringistheholygrailoftextanalytics.Manytextanalytictasksare requiredtoobtainaccurateanswers.Inthiswork,weextractanswersfromthewebcorpus andwedistinguishbetweentwoportionsoftheweb,namelythesurfacewebandthe deepweb.Thesurfacewebarethestandardwebpagesaccessiblefromabrowserwithout authenticatingorprovidinganycredentials.Thesepagesincludeblogposts,companyweb pages,newsarticlesandmore.Bycontrast,thedeepwebisthesetofpagesgenerated throughweb s.Thatis,accessingdeepwebpagesrequireduserinteractand parameters. WepresenttheMorpheusQAsystem,aquestionansweringsystemthatrecordsuser interactioninthedeepwebinordertoanswerquestions.Whenauserposesanatural languagequestion,thatquestioniscompiledandmatchedtopreviousinteractionsonthe deepwebandMorpheusQAcompilesthesetofpagesrequiretoanswerthequestion.It then,atquery-time,interactswiththedeepwebtoextractananswerforthequery. 18

PAGE 19

Thesurfacewebcontainsvastamountsoffactsthatcanbeextractedusingtext analytics.Thesefactsareuncertain,thatis,theremaybeconictingorill-formedfacts. Theseuncertaintiesmayarisefromtheextractionprocessitself,changesinfactsovertime ortheinherentambiguityofnaturallanguage.Hence,thesefactsarestoredalongwith theirprobabilitiesinaprobabilisticknowledgebase.Aprobabilisticknowledgebase,isa databaseoffactswithaninferenceengine.Theprobabilisticknowledgebaseallowqueries tocomputeprobabilityafactiscorrectgiventherestofthefactsontheknowledgebase. Additionally,knowledgebasescaninfermissinginformation. Giventhepromiseofthesetypesadvancedknowledgebases,wecanlookatthe calculationoftheprobabilitiesofknowledgebase.Wecantoprocurepathsofconnected factsfromknowledgebasestoshowconnectionsbetweenentities.Connectedpaths betweenentitiescanhelpaninformationseekerunderstandhowtwoentitiesare correspond.Foraweb-scaleknowledgebase,thistaskisdicultwithoutusinglarge machines.Wegiveexamplesofhowthismethodhelpsassistinformationseekers. Dissertationoutline .Thisdissertationisoutlinedinsixchapters.Eachchapteris independentanddiscussesanynecessarybackgroundandrelatedworkwithin.Chapter2 describeshowdatabasecanbeusedasalocationfortextanalyticsthroughtwosystems, MADdenandGPText.Chapter3describesquery-drivenentityresolution,thisisthe maincontributionofthiswork.Chapter4givessomedepthabouttheentityresolution workbyproposinganoptimizertoincreaseitseciency.Chapter5describethedeepweb questionansweringsystemknowntheMorpheus.Chapter6presentsthenalpartofthe dissertation,namely,analgorithmforrankingpathsbetweenentitiesinknowledgebases. InChapter7wesummarizethecontributionandpresentsomefutureareaofresearch. 19

PAGE 20

CHAPTER2 IN-DATABASEQUERY-DRIVENTEXTANALYTICS Inthischapter,Iintroducetwosystems|MADden[41]andGPText[58]|thatare createdtointroduceatextanalyticsparadigmwherethedatascientistwillrarelyhaveto leavethecomfortoftheRDBMS,wheretheirdatalies.Thischaptersshowshowthese systemsallowfornon-trivialtextanalytics,sophisticatedtextsearchandvisualization.By empoweringtheRDBMStoperformthesetasks,wecanusethedeclarativequeryinterface todecideoverwhichdatasourceweshouldperformanalytics;thatis,weareperforming query-driven dataanalytics. ThesecondhalfofthischapterdiscussestheGPTextproject.Thisworkdoeshave signicantoverlapbetweenmeandafellowPh.D.studentKunLi.Eachofuscontributed equallyanditwasimportanttoincludeitinthedissertationasitwasthesecondpartof theMADdenproject.Thisbodyofworkrepresentsearlycontributionsinstatisticaltext analytics. 2.1MADdenIntroduction Formanyapplications,unstructuredtextandstructureddataarebothimportant naturalresourcestofueldataanalysis.Forexample,asportsjournalistcoveringNFL NationalFootballLeague 1 |thirty-twoAmericanfootballteamswithmorethan1700 playersgameswouldneedasystemthatcananalyzeboththestructuredstatisticse.g., scores,biographicdataofteamsandplayersandtheunstructuredtweets,blogs,andnews aboutthegames. Insuchapplications,analyticsareperformedovertextdatafrommanysources. TextanalysisusesstatisticalmachinelearningSMLmethodstoextractstructured information|suchaspart-of-speechtags,entities,relations,sentiments,andtopics| fromtext.Theresultofthetextanalysiscanbejoinedwithotherstructureddatasources 1 http://www.n.com 20

PAGE 21

formoreadvancedanalysis.Forexample,asportsjournalistmaywanttocorrelatefan sentimentfromtweetswithstatisticsdescribingtheplayerandteamperformanceofthe MiamiDolphins 2 . Toanswersuchqueries,asoftwaredevelopermustunderstandandconnectmultiple tools,includingLucenefortextsearch,WekaorRforsentimentanalysis,andadatabase tojointhestructureddatawiththesentimentresults.Usingsuchacomplexo-line batchprocesstoanswerasinglequerymakesitdiculttomakeadhocqueriesover ever-evolvingtextdata.Thesequeriesareessentialforapplications,suchascomputational journalism,e-discovery,andpoliticalcampaignmanagement,wherequeriesareexploratory innature,andfollow-upqueriesneedtobeaskedbasedontheresultofpreviousqueries. MADdenimplementsfourimportanttextanalysisfunctions,namely,part-of-speech tagging,entityextraction,classicatione.g.,sentimentanalysis,andentityresolution. TextanalysisfunctionsareimplementedusingPostgreSQLasingle-threadeddatabase andGreenplumamassiveparallelprocessingMPPframework.TwoSMLmodelsand theirinferencealgorithmsareadapted:linear-chainConditionalRandomFieldsCRF 3 andNaiveBayes[95].In-databaseandparallelSMLalgorithmsareimplementedinthe GreenplumMPPframework.TheMADdentextanalyticlibraryisintegratedintothe MADLibopen-sourceproject 4 .ThedeclarativeSQLqueryinterfacewithMADdentext analysisfunctionsprovidesahigher-levelabstraction.Suchanabstractionshieldsusers fromdetailedtextanalyticalgorithmsandenablesuserstofocusmoreontheapplication specicdataexplorations. 2 http://www.miamidolphins.com/ 3 ACRFisadiscriminativeprobabilisticgraphicalmodelusedtoencodearbitrary relationshipsforstatisticalprocessing. 4 MADLibisanopensourceprojectforscalablein-databaseanalyticshttp://madlib.net 21

PAGE 22

Figure2-1.MADdenarchitecture Inthischapterwewillshowthefollowingpointsusinge-journalismoveranNFL corpusourdrivingexample: Processingdeclarativeadhocqueriesinvolvingvariousstatisticaltextanalytic functions; Joiningandqueryingovermultipledatasourceswithbothstructuredand unstructuredtextualinformation; Query-timerenderingofvisualizationsoverqueryresults,usingwordclouds, histograms,andrankedlistsofdocuments. 2.2MADdenSystemDescription Inthissectionwerstdiscussthegeneralarchitectureandthebasictechniquesused intheimplementationofthetextanalyticsalgorithmsinMADden.Thenwegivean exampleofPOStaggingimplementation. 2.2.1MADdenSystemArchitecture MADdenisafour-layeredsystem,ascanbeseeninFigure2-1.Theuserinterface iswherebothnaiveandadvanceduserscanconstructqueriesovertext,structureddata, 22

PAGE 23

Table2-1.ListingofcurrentMADdenfunctions FunctionTask match object 1 ;object 2EntityResolution sentiment text SentimentAnalysis entity find text DetectsNamedEntities viterbi PartofspeechtagsusingCRF andmodels.Fromtheuserinterface,queriesarethenpassedtotheDBMS,whereboth MADLibandMADdenlibrariessitontopofthequeryprocessortoaddstatisticaland textprocessingfunctionality.ItisimportanttoemphasizethatMADlibandMADden performfunctionsatthesamelogicallayer.Toenabletextanalytics,MADdenworks alongsidestatisticalfunctionsfoundintheMADliblibrary[44].Thesequeriesare processedusingPostgreSQLandGreenplum'sParallelDBarchitecturetofurtheroptimize storagereplicationandqueryparallelism. 2.2.2StatisticalTextAnalysisFunctions Inthissectionwedescribevarioustextanalysisalgorithms.Manyapproachesexist forin-databaseinformationextraction.WebuildonourpreviousworkusingConditional RandomFieldsCRFsforquery-timeinformationextraction[94].Weperformthe extractionandtheinferenceinsideofthedatabase.Werelyoninformationprovided inthequerytomakedecisionsonthetypeofalgorithmusedforextraction.Table ?? describesalistofstatisticaltextanalysistasks. Entityresolutionorco-referenceresolutionisthefollowingproblem:givenanytwo mentionsofaname,clusteredthemifandonlyiftheyrefertothesamerealworldentity. Certainentitiesmaybemisrepresentedbythepresenceofdierentnames,misspellings inthetext,oraliases.Itisimportanttoresolvetheseentitiesappropriatelytobetter understandthedata.Increasinglyinformaltext,suchasblogpostsandtweetsrequires entityresolution.MADdenuseinvertedindiceswithinthedatabasetoperformtext analysisondocuments.Wecanscantheinvertedindicesofeachdocument,lteringout documentsthatdonotcontaininstancesoftheplayernames.Tohandlemisspellingsand nicknamesweusetrigramindicestoperformapproximatematchesofsearchesfornames 23

PAGE 24

asdatabasequeries[46].Thismethodallowsustouseindicestoperformqueriesononly therelevantportionsofthedataset;thuswedonoextraprocessing. WeimplementedfunctionstoperformclassicationtaskssuchasPOStagging andsentimentanalysis.Thesefunctionsworkatbothadocumentandsentencelevel. Insentimentanalysisweclassifytextbypolarities,wherepositivesentimentrefersto thepositivenatureoftheexpressedopinion,andtothenegativesentiment,negative nature.Muchworkhasalreadybeendoneinthisareafordocument-levelandentity-level sentiment[71,103].WecanjoinwithothertablesandfunctionswithinanSQLquery, allowingmorecomplexqueriestobedeclarativelyrealized. Parallelization .WithaparalleldatabasearchitecturesuchasGreenplum,we canparallelizetofurtheroptimizequerieswrittenwithMADden.Eachnodewithinthe parallelDBcouldrunsomequeryoverasubsetofthedatadataparallel.Thisincludes thestatisticalmethodsinMADLib,whichwereallbuilttobedataparallel.Greenplum hasaparallelshared-nothingarchitecture.Dataisloadedontosegmentservers.Whena queryisissued,aparallelqueryoptimizercreatesaglobalqueryplanwhichispushedto eachofthesegmentservers.Query-drivenalgorithmscanthenbeexecutedinparallelover severaldataservers. 2.2.3MADdenImplementationDetails Coretomanynaturallanguageprocessingtasks,part-of-speechPOStagging involvesthelabelingoftermsbythepartsofspeechwithinasentence.Weimplemented POStagginginPostgreSQLandGreenplum.OurcodeisapartoftheMADLibopen sourcesystem. MADdenusesrst-orderchainCRFtomodelthelabelingofasequenceoftokens. Thefactorgraphcontainsobservednodesoneachsentencetokenwithlatentlabel variablesattachedtoeachtoken.Factorsarefunctionsthatconnecttwonodesorsignify theendsofthechain.Wegeneratethefeaturesusingafunctiongeneratemrtbl.This 24

PAGE 25

functionproducesatablerfactorforsinglestatefeaturesandatablemfactorfortwostate features. TrainingtheCRFmodelisaone-timetaskthatisperformedoutsidetheDBMS. 5 WeuseapythonscripttoparseandimportthetrainedmodelintotablesintheDBMS. Inferenceisperformedoverthestoredmodelsinordertondthehighestassignment oflabelsinthemodel.Wecalculatethemostprobablelabelassignment.Thisis calculatedusingtheViterbidynamicprogrammingalgorithmoverthelabelspace. WeusethePL/Pythonlanguagetomanagetheworkowofallthecalculations.The computationallyexpensiveViterbifunctionisimplementedasadatabaseuser-dened functionintheClanguage.Thefeaturegenerationandexecutionofinferenceoveratable ofsentencesisimplementedinSQL.WhenexecutedinGreenplumthequeryisperformed inparallel. ImplementingPOStagginginsidetheDBMSallowsustoperforminferenceovera subsetoftokensinresponsetoaqueryinsteadofperformingbatchtaggingoveralltokens. Wealsogetthebenetofusingthequeryenginetoparallelizeourquerieswithoutlosing theabilitytodrivetheworkowusingPL/Python. ExampleAlgorithm1performsPOStaggingforallthesentencesthatcontainthe word`Jaguar'.Thisqueryinterfaceallowstheusertoperformfunctionsonasubsetofthe data.Thesegmenttblholdsalistoftokensandtheirpositionforeachdocumentdoc id. Weassumeadocumentisasequenceoftokens. Algorithm1 POStaggingonsentenceswiththeword`Jaguars' SELECTDISTINCTONsegtbl . doc_id , viterbi segtbl . seglist , mfactor . score , rfactor . score FROMsegmenttbl , mfactor , rfactor , segtbl WHEREsegtbl . doc_id = segmenttbl . doc_id ANDsegmenttbl . seg_text ='Jaguar'; 5 WeusetheIITBombaypackagefortrainingavailableathttp://crf.sourceforge.net 25

PAGE 26

2.3TextAnalysisQueriesandDemonstration WedescribevariousdatasourcesfromtheNFLdomainforcomputationaljournalism. Finally,wedescribethequery-drivenuserinterfacesusedforexploratorytextanalysis applications. 2.3.1DatasetforMADdenExample OursampledemonstrationforMADdeninvolvesavarietyofNFL-baseddatasources. ThedataisrepresentedinTable ?? asanabbreviatedschema. 6 TheNFLCorpustableholdssemi-structureddata.Textualdatafromblogs,news articles,andfantweetswithdocumentmetadatasuchastimestamp,tagsandtype,among others.ThetweetswereextractedusingtheTwitterStreamingAPI 7 withaseriesof NFLrelatedkeywords,andthenewsarticlesandblogswereextractedfromvarioussports mediawebsites.Thesedocumentsvaryinsizeandquality.Wehavearound25million tweetsfromthe2011NFLseason,includingplaysandrecapsfromeverygameinthe season. ThestatisticaldatawasextractedfromtheNFL.complayerdatabase.Eachtable containstheplayer'sname,position,number,andaseriesofstatisticsofdierenttypes someplayersshowupinmultipletables,othersinonlyone.ThePlayertableholds informationaboutaplayerintheNFL,includingcollege,birthday,height,weight,aswell asyears in NFL.TheTeamtableholdssomebasicinformationaboutthe32NFLteams, includinglocation,conference,division,andstadium.TeamStats 2011holdstheteam rankingsandstatisticsinavarietyofcategoriesOense,Defense,SpecialTeams,Points, etc..Extracted EntitiesstorestheextractedentitiesfoundintheNFLCorpusdocuments. 6 ThesetablesmaybeextractedtoanRDBMS,ordenedoveranAPIusingaforeign datawrapper. 7 https://dev.twitter.com/docs/streaming-apis 26

PAGE 27

2.3.2MADdenTextAnalyticsQueries Basedonourexampledataset,supposeasportsjournalistwantstodoaninvestigative pieceonoverallpublicopinionofallFlorida-basedNFLteamsduringthe2011-2012 season.Suchapeicewouldrequirein-depthanalysisofnewsreports,tweets,blog postings,amongothersources.Thestandardapproachwouldconsistoftrawlingthrough thetextsourceseitherbyhandwithhelp,orusingaseriesofdierenttextprocessing toolkitsandpackages,sometimesspecializedforasingulartask.Insteadofmultipletools, MADdencanstreamlinethisprocess,withitsdeclarative,in-databaseapproachtotext analytics.Arststepmayconsistofparingyourcorporadowntojustthedocuments relatedtotheFloridafootballteams,namelytheMiamiDolphins,JacksonvilleJaguars, andTampaBayBuccaneers. Algorithm2 AnentityresolutionqueryusingtheMADdenframework. SELECTDISTINCTdoc_id FROMextracted_entities WHEREmatch 'Jaguars', entity > match_thresh ORmatch 'Dolphins', entity > match_thresh ORmatch 'Buccaneers', entity > match_thresh ; The match functionusedinAlgorithm2isanEntityResolutionUDFwhich calculatesa[0 ; 1]boundinversemetric,wheretermsthatareclosetoourtargetwillhave ahigherscorethanthosethatarelesssimilar. Extracted Entities isaviewconstructed using entity find .Thisfunctiondetectsentitiesusingoneoftwoaccuracysettingshigh accuracywithlowerrecognition,orlowaccuracywithhigherrecognition,ontextual documentsastheyareaddedtothedatabaseinthiscase,likelynewsarticlesandblogs. Table ?? showssomeofthecurrenttextanalysisfunctionsimplementedinMADden. JournalistmaywanttoexplorefansentimentfortheJacksonvilleJaguarsbasedon tweetscollectedduringtheNFLseason.Utilizingtherstqueryasabuildingblockwith somechanges,wecanconstructthisqueryaslistedinAlgorithm3. 27

PAGE 28

Table2-2.ListofeachMADdenfunctionsanditsNLPtask FunctionsTasks matchtarget,againstEntityResolution sentimenttextSentimentAnalysis entity ndtext,booleanDetectsNamedEntities pos tagtextPOStagging viterbiPartofspeechtagsusingCRF pos extracttext,typePOStermextraction Table2-3.AbbreviatedNFLdatasetschema TablesAttributes NFLCorpusdoc id,type,text,tstmp,tags PlayerStats2011pid,type specic stats Playerpid,f name,l name,college,etc Team Stats2011team,points,pass yds,various stats Teamteam,city,state,stadium Extracted Entitiesdoc id,entity InthequeryofAlgorithm3,onecouldaccomodatefornicknamesthroughOR-matching theextractedentities,analiastable,amongotherstrategies.Noticethatgoingfroma singulartextanalyticstask,toamorecomplexanalysisonlyrequiredasmallchange. Whereasatraditionalapproachwouldhaveuseitherlookingforacustomizedsolution orpatchingtogetherpackages,thedeclarativeSQLapproachallowstheusertojuststate whattheresultis. AndsinceweareworkinginSQL,wecancombinequeriesoncorpustableswith tablesofstructureddata.Forexample,ifourjournalistwantstoanalyzethemedia opinionofthestate'sbestreciever,hecouldconsultboththeplayerstatstable,aswellas themediablogsasshowninAlgorithm4. Algorithm3 EntityresolutionandsentimentanalysisinMADden. SELECTDISTINCTE . docid , E . entity , sentiment S . document FROMextracted_entitiesasE , NFLCorpusasS WHEREE . doc_id = S . doc_id ANDsentiment S . document in '+',' )]TJ/F15 11.9552 Tf 11.055 0 Td [(' ANDmatch 'Jaguars', E . entity > match_thresh ANDS . type ='tweet'; 28

PAGE 29

Algorithm4 AMADdenqueryoverstructuredandunstructureddata. SELECTBestWR . name , sentiment A . txt , A . txt FROMNFLCorpusA , extracted_entitiesE , SELECTP . fname jj '' jj P . lnameasname FROMPlayerP , PlayerStats2011_RecS WHERES . pid = P . pid AND P . team ='Jaguars' ORP . team ='Dolphins' ORP . team ='Buccaneers' ORDERBYS . rec_ydsDESC LIMIT 1 asBestWR WHEREE . doc_id = A . doc_id AND A . type ='blog' ORA . type ='news' ANDmatch BestWR . name , E . entity > match_thresh ; Algorithm4usesthestandardstructuredsqltablesPlayerandPlayerStats2011 Rec, whichrepresentsplayersandtheirreceivingstats.Ourjournalistndsthebestreceiver playingonaFloridaNFLteambasedonreceivingyards. 8 Then,discoverallthe associatednewsandblogdocumentsandperformentitiyresolutionfunctiononthe extractedentities,returningthesentimentandtextassociatedwiththatplayer.This methodisnotrestrictedtosingledomainanalytics.Onecanrunanalyticscombining dierentdatasetse.g.StateEconomiesandtheNFL,utilizingthesamedeclarative in-databasemethodsseenhere. 2.3.3MADdenUserInterface WehavegivenaninteractivedemonstrationoftheMADden'scapabilities.The demonstrationisbasedaroundMADdenUI,awebinterfacethatallowsuserstoperform analytictasksonourdataset.MADdenUIhastwoformsofinteraction:rawSQLqueries, andaMadLib 9 styleinterface,withll-in-the-blankquerytemplatesforquickinteraction 8 Totalyardsoveraseason,yds/catch,andtouchdownsusuallydecideswhothebest receiverwasattheendofaseason. 9 http://en.wikipedia.org/wiki/Mad Libs 29

PAGE 30

Figure2-2.ExampleMADdenUIquerytemplate asshowninFigure2-2.Inademonstrationduring[41]theuserinterfacewasincludedto assistusersininterpretingtheresults. 2.4GPTextIntroduction Manycompanieskeeplargeamountsoftextdatainrelationaldatabases.Several challengesexistinperforminganalysisonsuchdatasetsusingstate-of-the-artsystems. First,expensivedatatransfercostsmustbepaidup-fronttomovedatabetweendatabases andanalyticssystems.Second,manypopulartextanalyticspackagesdonotscaleupto production-sizeddatasets.Inthissection,weintroduceGPText,aGreenplumparallel statisticaltextanalysisframeworkthataddressestheaboveproblemsbysupporting statisticalinferenceandlearningalgorithmsnativelyinamassivelyparallelprocessing databasesystem.GPTextseamlesslyintegratestheSolrsearchengineandapplies statisticalalgorithmssuchask-meansandLDAusingMADLib,anopensource libraryforscalablein-databaseanalyticswhichcanbeinstalledonPostgreSQLand Greenplum.Inaddition,GPTextalsodevelopedandcontributedalinear-chainconditional randomeldCRFmoduletoMADLibtoenableinformationextractiontaskssuchas part-of-speechtaggingandnamedentityrecognition.Weshowtheperformanceand scalabilityoftheparallelCRFimplementation.Finally,wedescribeane-discovery applicationbuiltontheGPTextframework. Textanalyticshasgainedmuchattentioninthebigdataresearchcommunitydueto thelargeamountsoftextdatageneratedeverydayinorganizationssuchascompanies, governmentandhospitalsintheformofemails,electronicnotesandinternaldocuments. Manycompaniesstorethistextdatainrelationaldatabasesbecausetheyrelayon databasesfortheirdailybusinessneeds.Agoodunderstandingofthisunstructuredtext 30

PAGE 31

dataiscrucialforcompaniestomakebusinessdecision,fordoctorstoassesstheirpatients, andforlawyerstoacceleratedocumentreviewprocesses. Traditionalbusinessintelligencepullscontentfromdatabasesintoothermassive datawarehousestoanalyzethedata.Thetypicaldatamovementprocess"involves movinginformationfromthedatabaseforanalysisusingexternaltoolsandstoringthe nalproductbackintothedatabase.Thismovementprocessistimeconsumingand prohibitiveforinteractiveanalytics.Minimizingthemovementofdataisahugeincentive forbusinessesandresearchers.Onewaytoachievethisisforthedatastoretobeinthe samelocationastheanalyticengine. WhileHadoophasbecomeapopularplatformtoperformlarge-scaledataanalytics, newerparallelprocessingrelationaldatabasescanalsoleveragemorenodesandcores tohandlelarge-scaledatasets.TheGreenplumdatabase,builtupontheopensource databasePostgreSQL,isaparalleldatabasethatadoptsashared-nothingmassively parallelprocessingMPParchitecture.Databaseresearchersandvendorsarecapitalizing ontheincreaseindatabasecoresandnodesandinvestinginopen-sourcedataanalytics venturessuchastheMADLibproject[23,45].MADLibisanopen-sourcelibraryfor scalablein-databaseanalyticsonGreenplumandPostgreSQL.Itprovidesparallel implementationofmanymachinelearningalgorithms. Inthischapter,wemotivatein-databasetextanalyticsbyshowingGPText,a powerfulandscalabletextanalysisframeworkdevelopedonGreenplumMPPdatabase. GPTextinheritsscalableindexing,keywordsearch,andfacetedsearchfunctionalitiesfrom aneectiveintegrationoftheSolrsearchengine[33].TheGPTextusesandcontributes statisticalmethodstotheMADLibopen-sourcelibrary.WeshowthatwecanuseSQL anduserdenedaggregatestoimplementconditionalrandomeldsCRFsmethods forinformationextractioninparallel.Theexperimentshowssublinearimprovement inruntimeforbothCRFlearningandinferencewithlinearincreaseinthenumber ofcores.Asfarasweknow,GPTextisthersttoolkitforstatisticaltextanalysisin 31

PAGE 32

relationaldatabasemanagementsystems.Finally,wedescribetheneedandrequirement fore-discoveryapplicationsandshowthatGPTextisaneectiveplatformtodevelopsuch sophisticatedtextanalysisapplications. 2.4.1GPTextRelatedWork ResearchershavecreatedsystemsforlargescaletextanalyticsincludingGATE, PurpleSoxandSystemT[15,26,60].WhilebothGATEandSystemTusesrule-based approachforinformationextractionIE,PurpleSoxusesstatisticalIEmodels.However, noneoftheabovesystemarenativebuiltforaMPPframework. TheparallelCRFin-databaseimplementationfollowstheMADmethodology[45]. Inasimilarvein,researchersshowthatmostofthemachinelearningalgorithmscanbe expressedthroughuniedRDBMSarchitectures[31].Recently,parallelmachinelearning algorithmshavebeendevelopedbymanyresearchgroups[16,56,81]. Thereareseveralimplementationsofconditionalrandomeldsandbutonlyafew largescaleimplementationsforNLPtasks.OneexampleisthePCRFs[76]thatare implementedovermassivelyparallelprocessingsystemssupportingMessagePassing InterfaceMPIsuchasCrayXT3,SGIAltix,andIBMSP.However,thisisnot implementedoverRDBMS. 2.4.2GreenplumTextAnalytics GPTextrunsontheGreenplumdatabaseGP,whichisashared-nothingmassively parallelprocessingdatabase.TheGreenplumdatabaseadoptsthemostwidelyused master-slaveparallelcomputingmodelwhereonemasternodeorchestratesmultipleslaves toworkuntilthetaskiscompletedandthereisnodirectcommunicationbetweenslaves. AsshowninFigure2-3,itisacollectionofPostgreSQLinstancesincludingonemaster instanceandmultipleslaveinstancessegments.ThemasternodeacceptsSQLqueries fromclients,thendividetheworkloadsandsendsub-taskstothesegments.Besides harnessingthepowerofacollectionofcomputernodes,eachnodecanalsobecongured withmultiplesegmentstofullyutilizethemuticoreprocessor.Toprovidehighavailability, 32

PAGE 33

GPprovidestheoptionstodeployredundantstandbymasterandmirroringsegmentsin caseofmastersegmentandprimarysegmentfailure. TheembarrassingprocessingcapabilitypoweredbytheGreenplumMPPframework laysthecornerstonetoenableGPTexttoprocessproduction-sizedtextdata.Besides theunderlyingMPPframework,GPTextalsoinheritedthefeaturesthatatraditional databasesystembringstoGPTextforexample,onlineexpansion,datarecoveryand performancemonitoringtonameafew.OntopoftheunderlyingMPPframework,there aretwobuildingblocks,MADLibandSolrillustratedinFigure2-3,whichdistinguished GPTextfrommanyoftheexistingtextanalysistools.MADLibmakesGPTextcapable ofperformingsophisticatedtextdataanalysistasks,suchaspart-of-speechtagging, named-entityrecognition,documentclassicationandtopicmodelingwithavastamount ofparallelism.SolrisreliableandscalabletextsearchplatformfromtheApacheLucene projectandithasbeenwidelydeployedinwebservers.Themajorfeaturesincludes powerfulfull-textsearch,facetedsearchandnearrealtimeindexing.Asshownin Figure2-3,GPTextusesSolrtocreatedistributedindexing.Eachprimarysegmentis associatedwithexactlyoneSolrinstancewheretheindexofthedataintheprimary segmentisstoredforthepurposeofloadbalancing.GPTexthasallthefeaturesthatSolr hassinceSolrisintegratedintoGPTextseamlessly.GPTextalsocontributesstatistical methodstoMADLibfortextanalysissuchasCRF. 2.4.2.1In-databasedocumentrepresentation InGPText,adocumentcanberepresentedasavectorofcountsagainstatoken dictionary,whichisavectorofuniquetokensinthedataset.Forecientstorageand memory,GPTextusesasparsevectorrepresentationforeachdocumentinsteadofa naivevectorrepresentation.Thefollowingareanexamplewithtwodierentvector representationsofadocument.Thedictionarycontainsalltheuniquetermsi.e.,1-grams existinthecorpus. Dictionary: f am;before;being;bothered;corpus;document;in;is;me; 33

PAGE 34

Figure2-3.TheGPTextarchitectureoverGreenplumdatabase never;now;one;really;second;the;third;this;until g Document: f i;am;second;document;in;the;corpus g Naivevectorrepresentation: f 1 ; 0 ; 0 ; 0 ; 1 ; 1 ; 1 ; 1 ; 0 ; 0 ; 0 ; 0 ; 0 ; 0 ; 1 ; 1 g Sparsevectorrepresentation: f ; 0 ; 1 ; 1 ; 1 ; 1 ; 0 ; 1 ; 1: ; 3 ; 1 ; 1 ; 1 ; 1 ; 6 ; 1 ; 1 g . GPTextadoptsrun-lengthencodingtocompressthenaivevectorrepresentationusing pairsofcount-valuevectors.Therstvectorinthesparsevectorrepresentationcontains countvaluesandthesecondvectoristhenumberofcontiguousappearancesofthevalue intherstvector.Althoughnotapparentinthisexample,theadvantageofthesparse vectorrepresentationisdramaticinrealworlddocuments,wheremostoftheelementsin thevectorarezero. 34

PAGE 35

2.4.2.2ML-basedadvancedtextanalysis GPTextreliesonmultiplemachinelearningmodulesinMADLibtoperformstatistical textanalysis.Threeofthecommonlyusedmodulesarek-meansfordocumentclustering, multinomialnaivebayesmultinomialNBfordocumentclassication,andlatentDirichlet allocationLDAfortopicmodelingfordimensionalityreductionandfeatureextraction. Performingk-means,multinomialNB,LDAinGPTextfollowsthesamepipelineasfollows: 1.CreateSolrindexfordocuments. 2.CongureandpopulateSolrindex. 3.Createtermstableforeachdocument. 4.Createadictionaryoftheuniquetermsacrossdocuments. 5.Constructtermvectorusingthetermfrequency-inversedocumentfrequencytf-idf foreachdocument. 6.RunMADLibk-means/MultinomiaNB/LDAalgorithm. ThefollowingsectiondetailstheimplementationoftheCRFlearningandinference modulesthatwedevelopedforGPTextapplicationsaspartofMADLib. 2.4.3CRFforIEoverMPPDatabases AConditionalrandomeldCRFisatypeofdiscriminativeundirectedprobabilistic graphicalmodel.Linear-chainCRFsarespecialCRFs,thatassumethatthenextstate dependsonlyonthecurrentstate.Linear-chainCRFsachievestart-of-the-artaccuracyin manyrealworldnaturallanguageprocessingNLPtaskssuchaspartofspeechtagging andnamedentityrecognition. 2.4.3.1Implementationoverview Figure2-4illustratesthedetailedimplementationoftheCRFmoduledevelopedfor IEtasksinGPText.Thetopboxshowsthepipelineofthetrainingphase.Thebottom boxshowsthepipelinefortheinferencephase.WeusedeclarativeSQLstatements toextractallfeaturesfromtext.Anyfeaturesinthestateofartpackagescanbe extractedusingonesingleSQLclause.Allofthecommonfeaturesdescribedinthe 35

PAGE 36

Figure2-4.TheMADLibCRFoverallsystemarchitecture literaturecanbeextractedwithoneSQLstatement.Theextractedfeaturesarestored inarelationforeithersingleortwo-statefeatures.Afterthefeatureextraction,weuse user-denedaggregatesUDAstocalculatethemaximum-a-prioriMAPconguration andprobabilityforinference.Forlearning,weusesUDFstoimplementgradientand log-likelihoodinparallel. 2.4.3.2FeatureextractionusingSQL Textfeatureextractionisastepinmoststatisticaltextanalysismethods.Weare abletoimplementalloftheseventypesoffeaturesusedinPOSandNERusingexactly sevenSQLstatements.Thesefeaturesinclude: Dictionary: doesthistokenexistinadictionary? Regex: doesthistokenmatcharegularexpression? Edge: isthelabelofatokencorrelatedwiththelabelofaprevioustoken? Word: doesthistokenappearinthetrainingdata? Unknown: doesthistokenappearedinthetrainingdatabelowcertainthreshold? Start/End: isthistokenrst/lastinthetokensequence? 36

PAGE 37

TherearemanyadvantagesforextractingfeaturesusingSQLs.TheSQLstatements hidealotofthecomplexitypresentintheactualoperation.Itturnsoutthateachtype offeaturecanbeextractedoutusingexactlyoneSQLstatement,makingthefeature extractioncodeextremelysuccinct.Additionally,SQLsstatementsarenaivelyparalleldue tothesetsemanticssupportedbyrelationalDBMS's.Forexample,wecomputefeatures foreachdistincttokenandavoidre-computingthefeaturesforrepeatedtokens. InAlgorithm5andAlgorithm6weshowhowtoextractedgeandregexfeatures, respectively.Algorithm5extractsadjacentlabelsfromsentencesandstorestheminan array.Algorithm6showsaquerythatselectsallthesentencesthatsatisesanyofthe regularexpressionspresentinthetableregextable. Algorithm5 Queryforextractingedgefeatures SELECTdoc2 . pos , doc2 . doc_id ,'E.', ARRAY [ doc1 . label , doc2 . label ] FROMsegmenttbldoc1 , segmenttbldoc2 WHEREdoc1 . doc_id = doc2 . doc_idAND doc1 . pos +1= doc2 . pos Algorithm6 Queryforextractingregexfeatures SELECTstart_pos , doc_id ,'R ' jj r . name , ARRAY [ )]TJ/F15 11.9552 Tf 10.219 0 Td [(1, label ] FROMregextblr , segmenttbls WHEREs . seg_text ~ r . pattern 2.4.3.3Parallellinear-chainCRFtraining ProgrammingModel .InAlgorithm7weshowtheparallelCRFtrainingstrategy. Thealgorithmisexpressedasauser-denedaggregate.User-denedaggregatesare composedofthreeparts:atransitionfunctionAlgorithm8,amergefunctionanda nalizationfunctionAlgorithm9.Followingwedescribethesefunctions. Inline1ofAlgorithm7the Initialization functioncreatesastateobjectinthe database.Thisobjectcontainscoecient w ,gradient r andlog-likelihood L variables.Thisstateisloadedline3andsavedline8betweeniterations.Wecompute 37

PAGE 38

Algorithm7 CRFtraining z 1: M Input: z 1: M , . Documentset Convergence , Initialization , Transition , Finalization Output: Coecients w 2 R N Initialization/Precondition:iteration=0 1: Initialization state 2: repeat 3: state LoadState 4: forall m 2 1 ::M do 5: state Transition state ;z m 6: endfor 7: state Finalization state 8: WriteState state 9: until Convergence state ;iteration return state :w thegradientandlog-likelihoodofeachsegmentinparallelline4muchlikeaMap function.Thenline7computesthenewcoecientsmuchlikeareducefunction. Transitionstrategies .Algorithm8containsthelogicofcomputingthegradient andlog-likelihoodforeachtupleusingtheforward-backwardalgorithm.Thisalgorithm isinvokedinparallelovermanysegmentsandtheresultofthesefunctionsarecombined usingthemergefunction. Algorithm8 transition-lbfgs state ;z m Input: state, . Transitionstate z m , . ADocument Gradient Output: state 1: f state. r ,state : Lg Gradient state ;z m 2: state :num rows state :num rows +1 return state Finalizationstrategy .ThenalizationfunctioninvokestheL-BFGSconvexsolver togetanewcoecientvector. 38

PAGE 39

Algorithm9 nalization-lbfgsstate Input: state, LBFGS . Convexoptimizationsolver Output: state 1: f state. r ,state. Lg penaltystate. r ,state. L 2: instance LBFGS .initstate 3: instance.lbfgs . invoketheL-BFGSsolver return instance.state Limited-memoryBFGSL-BFGS,avariationoftheBroyden-Fletcher-Goldfarb-Shannon BFGSalgorithmisaleadingmethodforlargescalenon-constraintconvexoptimization method.Wetranslateanin-memoryJavaimplementation[68]toaC++in-database implementation.BeforeeachiterationofL-BFGSoptimization,weneedtoinitializethe L-BFGSwiththecurrentstateobject.Attheendofeachiteration,weneedtodumpthe updatedvariablestothedatabasestatefornextiteration. 2.4.3.4Parallellinear-chainCRFinference TheViterbialgorithmisusedtondthe k mostlikelylabelingofadocumentfor CRFmodels.WechosetoimplementanSQLclausetodrivetheViterbiinference.The Viterbiinferenceisimplementedsequentiallyandeachfunctioncallwillnishlabeling ofonedocument.However,inGreenplum,Viterbicanberuninparalleloverdierent subsetsofthedocumentonamulti-coremachine.So,theCRFinferenceisnaively parallel. 2.4.4GPTextExperimentsandResults Inordertoevaluatetheperformanceandscalabilityofthelinear-chainCRFlearning andinferenceonGreenplum,weconductedexperimentsonvariousdatasizesoverona 32-coremachinewith2Tharddriveand64GBmemory.WeusedtheCoNLL2000dataset containing8936taggedsentencesforlearning.Thisdatasetislabeledwith45POStags. Toevaluatetheinferenceperformance,weextracted1.2millionsentencesfromtheNew YorkTimesdataset.InFigure2-5andFigure2-6weshowthealgorithmissublinearand 39

PAGE 40

Figure2-5.Linear-chainCRFtrainingscalability Figure2-6.Linear-chainCRFinferencescalability improveswithanincreaseinthenumberofsegments.OurPOSimplementationachieves 0 : 9715accuracy,whichisconsistentwiththestateoftheart[63]. 2.4.5GPTextApplication WiththesupportofGreenplumMPPdataprocessingframework,ecientSolr indexing/searchengine,andparallelizedstatisticaltextanalysismodules,GPText positionsitselftobeagreatplatformformanyapplicationsthatneedtoapplyscalable textanalyticsmethodswithvaryingsophisticationsoverunstructureddataindatabases. Oneofsuchapplicationise-discovery. E-discoveryhasbecomeincreasinglyimportantinlegalprocesses.Largeenterprises keepterabytesoftextdatasuchasemailsandinternaldocumentsintheirdatabases. Traditionalcivillitigationofteninvolvesreviewsoflargeamountsofdocumentsbothin theplaintissideanddefendantsside.E-discoveryprovidestoolstopre-lterdocuments 40

PAGE 41

Figure2-7.GPTextapplication forreviewtoprovideamorespeedy,inexpensiveandaccuratesolution.Traditionally, simplekeywordsearchareusedforsuchdocumentretrievaltasksine-discovery.Recently, predictivecodingbasedonmoresophisticatedstatisticaltextanalysismethodsaregaining moreandmoreattentionincivillitigationsinceitprovideshigherprecisionandrecallin retrieval.In2012,judgesinseveralcasesapprovedtheuseofthepredictivecodingbased ontheindicationofRulesoftheFederalRulesofCivilProcedure[1]. GPTextdevelopedaprototypeofane-discoverytoolusingtheEnrondataset. Figure2-7isasnapshotofthee-discoveryapplication.Inthetoppane,keyword`illegal' isspeciedandSolrisusedtoretrieveallrelevantemailsthatcontainsthekeyword displayedinthebottompane.Asshownonthemiddlepaneontheright,italsosupports topicdiscoveryintheemailcorpususingLDAandclassicationusingk-means.K-means usesLDAandCRFtoreducethedimensionalitiesoffeaturestospeedupthek-means classication.CRFisalsousedtoextractnamedentities.Asyoucanseethetopics aredisplayedintheResultspanel.Thetoolalsoprovidesvisualizationofaggregate informationandfacetedsearchovertheemailcorpus. 41

PAGE 42

2.4.6GPTextSummary WeintroduceGPText,aparallelstatisticaltextanalysisframeworkoverMPP database.WiththeseamlessintegrationwithSolrandMADLib,GPTextisaframework withpowerfulsearchengineandadvancedstatisticaltextanalysiscapabilities.We implementedaparallelCRFinferenceandtrainingmoduleforIE.Thefunctionalitiesand scalabilityprovidedbyGPTextpositionsitselftobeagreattoolsforsophisticatedtext analyticsapplicationssuchase-discovery. 42

PAGE 43

CHAPTER3 MAKINGENTITYRESOLUTIONQUERY-DRIVEN Inthischapter,Ipresenttechniquetomakeanimportantproblemquery-driven.I takeamodelentityresolutiondescribedbyMcCallumetal.[67]anddescribehowto expressaqueryoverthismodeltoreturnonlytheanswerrequestedbythedatascientist request.Thismethodreducestheamountofworkforobtainingtherequestedanswers. 3.1Query-DrivenEntityResolutionIntroduction EntityresolutionERistheprocessofidentifyingandlinking/groupingdierent manifestationse.g.,mentions,nounphrases,namedentitiesofthesamerealworld object.Itisacrucialtaskformanyapplicationsincludingknowledgebaseconstruction, informationextraction,andquestionanswering.Fordecades,ERhasbeenstudiedin bothdatabaseandnaturallanguageprocessingcommunitiestolinkdatabaserecordsorto performentityresolutionoverextractedmentionsnounphrasesintext. ERisanotoriouslydicultandexpensivetask.Traditionally,entitiesareresolved usingstrictpairwisesimilarity,whichusuallyleadstoinconsistenciesandlowaccuracy duetolocalized,myopicdecisions[98].Morerecently,collectiveentityresolutionmethods haveachievedstate-of-the-artaccuracybecausetheyleveragerelationalinformationin thedatatodetermineresolutionjointlyratherthanindependently[10].However,itis expensivetoruncollectiveERbasedonprobabilisticgraphicalmodelsGMs,especially forcross-documententityresolution,whereERmustbeperformedovermillionsof mentions. Inmanypreviousapproaches,collectiveERisperformedexhaustivelyoverallthe mentionsinadataset,returningallentities.Researchershavedevelopednewmethods toperformlarge-scalecross-documententityresolutionoverparallelframeworks[86,98]. However,inmanyERapplications,usersareonlyinterestedinoneorasmallsubsetof entities.Thiskeyobservationmotivatesquery-drivenER,analternativeapproachto solvingthescalabilityproblemforER. 43

PAGE 44

ComparedtopreviousERmodelsandalgorithms,query-driventechniquesinthis chapterscaletodatasetsthatareinmanycasesthreeordersofmagnitudelarger. Moreover,theERmodelinthischapterisgeneralenoughtotakebothbibliographic recordsandmentionsextractedfromunstructuredtext.Query-drivenERtechniquesover GMscanalsobegeneralizedforotherapplicationstoperformquery-driveninference. ThisworkfollowsalineofresearchonimplementingMLmodelsinsideofdatabases[44, 58,97].Researchersusefactorgraphsbecausethisexiblerepresentationworkswellwith othermachinelearningalgorithms.ERisubiquitousandanimportantpartofmany analyticpipelines;aprobabilisticdatabaseimplementationisnatural. Inthischapter,werstintroduceSQL-likequeriesthatinvolveERoperations.These ERoperatorsareanSQLcomparisonoperatori.e.,ER-basedequalitythatreturnstrue iftwomentionsmaptothesameentity.FactorGraphs,atypeofGM,areusedtomodel thecollectiveentityresolutionoverextractedmentionsfromtext.UsingthisERbased comparisonoperator,userscanposeselectionqueriestondallmentionsthatmaptoa singleentityorposejoinqueriestondmentionsthatmaptothesubsetofentitiesthat theyareinterestedinresolving. BecauseexhaustiveERisexpensiveitiscommontouseblockingtechniquesto partitionthedatasetintoapproximatelysimilargroupscalledcanopies.Query-driven ERinthischapterdiersfromblockingintwoimportantways:1deterministicblocks arereplacedbyapairwisedistance-basedmetric,and2blocksorcanopiesareimplicit tothequery-drivenERdatasetanddonothavetobecreatedinadvanced.Thelatter point,implicitblocking,isrealizedusingadatastructurecreatedbasedonthesimilarity toaquerymention.Thisdatastructureallowsparameterstoincludeorremovementions fromtheworkingdataset.Thispropertyissimilartotheiterativeblockingtechnique[96], whichisshowntoimproveERaccuracy.Suchanapproachcandramaticallyamortizethe overallERcostsuitableforthepay-as-you-goparadigmindataspaces[62]. 44

PAGE 45

TosupportERdrivenbyqueries,wedevelopthreesamplingalgorithmsforMCMC inferenceovergraphicalmodels.Morespecically,insteadofauniformsampling distribution,wesampleonadistributionthatisbiasedtothequery.Wedevelopa query-drivensamplingtechniquesthatmaximizestheresolutionofthetargetqueryentity target-xedandbiasesthesamplesbasedonthepairwisesimilaritymetricbetween mentionsandquerynodesquery-proportional.Wealsointroduceahybridmethodthat performsquery-proportionalsamplingoveraxedtarget.Wedeveloptwooptimizations tothequery-proportionalandhybridmethodstomodelthesimilarityanddissimilarity betweenthementionsandthequeryentity,i.e.,attractandrepelscores.Intherst target-xedalgorithm,weadaptthesamplestoresolvethequeryentity.Thesecond query-proportionalalgorithm,selectsmentionsbasedontheirprobabilisticsimilarityto thequeryentity.Thethirdhybridalgorithmcombinesthetwoapproaches.Asummaryof approachescanbefoundinTable3-3. Whenauserisinterestedinresolvingmorethanoneentityweemploymulti-node ERtechniques.Toimplementmulti-nodeERqueries,single-nodeERtechniquesmaybe naivelyperformediterativelytoresolveoneentityatatime.However,suchanalgorithm canleadtoun-optimizedresourceallocationifthesamenumberofsamplesisgenerated foreachtargetentity,orlowthroughputifoneoftheentitieshasadisproportionatelylow convergencerate.Toalleviatethisproblem,wepresentthreemulti-queryERalgorithms thatschedulethesamplegenerationamongquerynodesinordertoimproveoverall convergencerate. Insummary,thecontributionsofthischapterarethefollowing: Wedeneaquery-drivenERproblemforcross-document,collectiveERovertext extractedfromunstructureddatasets; Wedevelopthreesingle-nodealgorithmsthatperformfocusedsamplingand reduceconvergencetimebyorders-of-magnitudecomparedtoanon-query-driven baselineSection3.4.Wedeveloptwoinuencefunctionsthatuseattractandrepel techniquestogroworshrinkqueryentitiesSection3.5.1; 45

PAGE 46

Wedevelopschedulingalgorithmstooptimizetheoverallconvergencerateofthe multi-queryERSection3.5.2.Thebestschedulingalgorithmisbasedonselectivity ofdierenttargetentitiesSection3.5.3. Theresultsshowthatquery-drivenERalgorithmsisapromisingmethodofenabling realtime,ad-hoc,ER-basedqueriesoverlargedatasets.Singlenodequeriesofdierent selectivityconvergetoahigh-qualityentitywithin1-2minutesoveranewswiredataset containing71millionmentions.Experimentsalsoshowthatsuchreal-timeERquery answeringallowsuserstoiterativelyreneERqueriesbyaddingcontexttoachievebetter accuracySection3.6. 3.2Query-DrivenEntityResolutionPreliminaries Inthissectionwepresentafoundationofconceptsdiscussedinthischapter.Westart withanintroductionoffactorgraphsthendiscusssamplingtechniquesoverthismodel. Finally,weformallyintroducestate-of-the-artentityresolutionapproachesandexplainthe origin. 3.2.1FactorGraphs Graphicalmodelsareaformalismforspecifyingcomplexprobabilitydistributions overmanyinterdependentrandomvariables.Factorgraphsarebipartitegraphicalmodels thatcancapturearbitraryrelationshipsbetweenrandomvariablesthroughtheuseof factors[53].AsdepictedinFigure3-1,linksalwaysconnectrandomvariablesrepresented ascirclesandfactornodesrepresentedasblacksquares.Factorsarefunctionsthat takeasinputthecurrentsettingofconnectedrandomvariables,andoutputapositive real-valuedscalarindicatingthecompatibilityoftherandomvariablessettings.The probabilityofasettingtoalltherandomvariablesisanormalizedproductofallthe factors.Intuitively,thehighestprobabilitysettingshavevariableassignmentsthatyield thehighestfactorscores. Weusefactorgraphstorepresentcomplexentityresolutionrelationships.Nodes randomvariablesmaycorrespondtomentionsofpeople,placesandorganizationsin documents.Nodesalsorepresenttherandomvariablesthatcorrespondtogroupsof 46

PAGE 47

Figure3-1.Threenodefactorgraph.Circlesrandomvariableswith m i represent mentionsandthosewith e i represententities.Cloudsareaddedforvisual emphasisofentityclusters mentionsentities,thesenodesareaccompaniedbycloudsinFigure3-1.Thefactors betweenmentionsandentitiesgiveusasoundrepresentationformanypossiblestates. Thefactorgraphmodelalsogivesusasimplemathematicalexpressionoftherelationship. Formally,afactorgraph G = h x ; i containsasetofrandomvariables x = f x i g n 1 andfactors = f i g m 1 .Eachfactor i mapsthesubsetofvariablesitisassociatedwith toanon-negativecompatibilityvalue.Theprobabilityofasetting ! amongthesetofall possiblesettingsoccurringinthefactorgraphisgivenbyaprobabilitymeasure: ! = 1 Z X x 2 ! m Y i =1 i x i ;Z = X ! 2 X x 2 ! m Y i =1 i x i where x i isthesetofrandomvariablesthatneighborthefactor i and Z isthe normalizingconstant. Queryinggraphicalmodelsproducesthemostlikelysettingfortherandomvariables. Aqueryonafactorgraphisdenedasatriple h x q ;x l ;x e i where x q isthesetofnodes inquestion, x l isasetoflatentnodesentitiesthataremarginalizedand x e isasetof evidencenodesobservedmentions.Aquerytaskisasumoverthealllatentvariables andthemaximizationofthequeryprobability.Aqueryoverthefactorgraphisdenedas Q x q ;x l ;x e ; =argmax x q X v l 2 x l x q [ v l [ x e : 47

PAGE 48

Toobtainthebestsettingofthequeriesinquestion,inferenceisrequired. Severalmethodsexistforperforminginferenceoverfactorgraphs.Theentity resolutionfactorgraph,beingpairwise,isdenseandhighlyconnected.Thisproperty suggeststhebestmethodsforinferenceareMarkovChainMonteCarloMCMCmethods; inparticular,weuseaMetropolisHastingsvariant[53].Wereferthereadertoour previousworkforadetaileddiscussiononinferenceoverfactorgraphsandadeviationof thetechnique[100]. 3.2.2InferenceoverFactorGraphs Severalmethodsexistforperforminginferenceoverfactorgraphs.Theentity resolutionfactorgraph,beingpairwise,isdenseandhighlyconnected.Thisproperty suggeststhebestmethodsforinferenceareMarkovChainMonteCarloMCMCmethods; inparticular,weuseaMetropolisHastingsvariant[53]. TheideaofMCMC-MHistoproposemodicationstoacurrentsettingandusethe modeltodecidewhethertoacceptorrejecttheproposedsettingasareplacementforthe currentsettings.Whenthemodelsarebeingscoredonlythefactorstouchingnodeswith changedvalues,theMarkovblanket,needstoberecomputed.Weacceptorrejectchanges sothemodelcaniterativelyproceedtoanoptimalsetting. Moreformally,consideranMCMCtransitionfunction T : ! [0 ; 1]wheregiven thecurrentsetting ! wecansampleasubsequentsetting ! 0 . Theprobabilityofacceptingatransitiongivenagraphicalmodeldistribution is: A !;! 0 =min 1 ; ! 0 T !;! 0 ! T ! 0 ;! : {1 Additionally,theintractablepartitionfunction Z iscanceledout,makingsample generationinexpensive.Thispropertyallowsustocalculatetheprobabilityofaccepting thenextstatebysimplycomputingthedierenceinscorebetweenthenextandcurrent state[100]. 48

PAGE 49

Wesaythealgorithmconvergeswhenasteadystateisreached. 1 Intelligently samplingnextstatesdecreasesthetimetoconvergence.ConvergenceinMCMCisdicult toverify[25],wediscussconvergenceestimationinSection3.6.1. 3.2.3Cross-DocumentEntityResolution Cross-documentERistheproblemofclusteringmentionsthatappearacross independentsetsofdocumentsintogroupsofmentionsthatcorrespondtothesame realworldentity.TheseERtaskstypicallyassumeasetofpreprocesseddocumentsand performlinkingacrossdocuments[7,86].Thescaleofthecross-documentERproblem istypicallyseveralordersofmagnitudemorethanintra-documentER.Thereareno documentboundariestolimitinferencescopeandallentitymentionsmaybedistributed arbitrarilyacrossmillionsofdocuments. Tomodelcross-documentER,let M = f m 1 ;:::;m jMj g bethesetofmentions inadataset.Eachmention m i containsasetofattribute-valuedatapoints.Let E = f e 1 ;:::;e jMj g representthesetofentitieswhereeach e i containzeroormorementions. Note,weassumethemaximumnumberofentitiesisnomorethanthenumberofmentions andnolessthan1.Eachmentionmaycorrespondtoauniqueentityorallmentionsmay correspondtoasingleentity. Thebaselinemethodofentityresolutionisastraight-forwardapplicationofthe MCMC-MHalgorithm.WeshowpseudocodeforthebaselinemethodinAlgorithm10. Algorithm10takesasinputasetofentities E and samples whichisthenumberof iterationsofthealgorithmorafunctiontoestimateconvergence.Thealgorithmsamples twoentitiesfromtheentitysetandmovesonerandom 2 mentionintotheotherentity. Afterthemove,thealgorithmchecksforanimprovementintheoverallscoreofthemodel. 1 Werefertoliteratureforamoredetaileddescriptionofconvergence[100]. 2 Givenaset X ,thefunction x u X makesauniformsamplefromtheset X intoa variable x . 49

PAGE 50

Algorithm10 ThebaselineentityresolutionalgorithmusingMetropolis-Hastings sampling INPUT: Asetofunresolvedentities E eachwithonemention m . INPUT: Apositiveinteger samples . OUTPUT: Asetofresolvedentities E . 1: while samples -> 0 do 2: e i u E 3: e j u E 4: m u e i 5: E 0 move E ;m;e j 6: if score E < score E 0 then 7: EE 0 8: endif 9: endwhile return E Ifthemodelscoreimproves,thechangesarekept,otherwisetheproposedchangesare ignored.The SCORE functionsumstheweightsofalltheedgesinthegivenentityto obtainavalueforthemodel.Thisisequivalenttotheprobabilityofthesetting as describedinSection3.2.1. Blocking. Blockingorcanopygenerationisapreprocessingtechniquetopartition largeamountsofdataintosmallerchunks,orblocksofitemsthatarelikelytobe matches[64].Blockingcanusesimpleandfasttechniquessuchassortingbasedon attributesormoreadvancedtechniquesthatmapsimilaritemsontoavectorspace [28,84,96]. Inthischapter,weusetwomethodsofblocking.First,weuseanapproximatestring matchoverallthementionsinthedatabase.Toperformtheapproximatestringlter weusea q -gramstechniqueoverallthementionsinthedatabase.Thismethodcreates aninvertedindexforeachmentioninthedatabasesoaquerycanbeperformedtolook forallwordsthatcontainasucientnumberofmatching q -grams.Thisgivesusafast high-recalllterovermanyrecords[42]. Thesecondisanimplicitblockingstructurecreatedbycomputingtheinuence aquerynodehasontheothernodesinthedatasetseeSection3.5.1.Thismethod 50

PAGE 51

usesanestimateofthedistantbetweenthequerynodesandthecandidatementionsto prioritizesamples. 3.3Query-DrivenEntityResolutionProblemStatement Inthissection,weformallydenetheproblemofquery-drivenER.Weusean SQL-likeformalismtomodeltraditionalandquery-drivenentityresolution. Inaprobabilisticdatabase,letaMentionstablecontainalltheextractedmentions fromatextcorpus.Itscolumnentity p representsthe p robabilisticlatententitylabels;they containamappingbutthatmappingmaynotrepresentthecurrentstate.ThePeople tableholdsawatchlistofmentionsandrelevantcontextualinformation.Thecontext columnisanabstractplaceholderfortextdataorricherschemas.Thismodelonly assumesthereisamastercolumn,therealizationofthecontextcolumnisexibleand implementationdependent. MentionsdocID ,startpos ,mention,entity p ,context PeoplepeopleID ,mention,entity p ,context Wealsodeneauser-denedfunctioncoref mapthatperformsmaximumaposteriori MAPinferenceonthelatententity p randomvariables.Thefunctiontakestwoinstances ofmentionswithatleastonebeingfromaprobabilistictablesuchastheMentiontable. Whenthequeryisexecutedthecoref mapfunctionreturnstrueifthementionsreferenced arecoreferent.Following,wedescribethetraditionalexhaustiveERtaskaswellasthe single-andmulti-nodequery-drivenERqueries. Exhaustive .Thegoalsoftraditionalentityresolutionistoclusterallmentionsin adataset.Allthementionsclusteredinsideeachentityarecoreferentwitheachother andnotentitywithanymentionthatisapartofadierententitycluster.Theprocess ofexhaustiveERcanbemodeledasaself-joindatabasequerywhereeachmentionis groupedintocoreferentclusters.InAlgorithm11wecreateaviewdisplayingtheresultsof aresolvedquery. 51

PAGE 52

Algorithm11 Exampleexhaustiveentityresolutionquerythatcreatedsadatabaseview CREATEVIEWCorefViewAS SELECTm . docID , m . startpos , m . mention , m2 . mention FROMMentionsm , Mentionm2 WHEREcoref_map m . , m . entity ^ p , m2 . mention , m2 . context Toobtainuniqueentityclusters,wecanperformanaggregationqueryoverthe CorefView.InFigure3-3weseeanexampleoftheresultoftraditionalentityresolution. Single-nodeQuery .IntheERtask,wemayonlybeinterestedinthementions ofoneentity.Werepresentthisentitywithatemplatemention,orasaquerynode q . Single-nodeentityresolutionismodeledasaselectionquerywithawhere-clausethat includesthetemplatemention q andreturnsonlythementionsthataremembersofthe entityclusterthatcontainsthesamplemention.Givenatemplatemention q andits context q .context ,Algorithm12weshowthesingle-nodequerybasedonanexamplein Section3.4.2. Algorithm12 Singlequery-nodedrivenentityresolutionquery SELECTm . docID , m . startpos , m . mention FROMMentionsm WHEREcoref_map m . , m . entity ^ p , q , q . context Hereweaddparameterstothecoref mapfunctionthatcontainthespecicqueryand itscontext.ItperformsERoverthementionstablebutonlyreturnsanarmativevalue ifthelabelsfortheentityclustermatchthequerynode.Forexample,ifthetemplate mention q was`MarkZuckerberg',andthequerycontextwerekeywordssuchas`facebook' and`ceo',theonlyreturnedmentionswillbethosethatrepresentMarkZuckerbergthe facebookfounder.Thisissimilartoa`facebook'approximatestringsearch.Theemphasis ofthischapterisoptimizingthisfunctionsowhileperformingERweperformlesswork comparedanexhaustivequery. 52

PAGE 53

Figure3-2.Apossibleinitializationforentityresolution Multi-Query .Inmanycases,ausermaybeinterestedinawatchlistofentities. WatchlistisasubsetofthelargerMentionset.Thisiscommonforcompanieslookingfor mentionsofitsproductsinadataset.Inthiscase,mentionareonlyclusteredwiththe entitiesrepresentedinthewatchlist.Algorithm13isanexampleofajoin-querybetween theMentionstableandthePeopletable. Algorithm13 Multi-querybetweenthePeopleswatchlisttablseandthefullmentions set SELECTm . docID , m . startpos , m . mention , q FROMMentionsm , Peopleq WHEREcoref_map m . , m . entity ^ p , q , q . context ThisfunctioncombinesawatchlistoftermsandperformsERwithrespecttothe specicexamplesinthewatchlist.Themulti-querymethodusesschedulingtoperform inference,orafuzzyequal,overeachmentions.InSection3.4.3weproposescheduling algorithmssomulti-querynodeERgracefullymanagemultiqueryworkloads. 3.4Query-DrivenEntityResolutionAlgorithms Query-drivenERisanunderstudiedproblem;inthissectionwedescribeour approachtoquery-drivenERwithoneentitysingle-queryERandwithmultipleentities multi-queryER.First,wegiveagraphicalintuitionofquery-drivenERalgorithms. 53

PAGE 54

Figure3-3.Thecorrectentityresolutionforallmentions Figure3-4.Theentitycontaining q isinternallycoreferent;theotherentitiesarenot correctlyresolved 3.4.1IntuitionofQuery-DrivenER Inthissection,weremindthereaderofthequery-drivenERtaskwithaformal denition.EachERtaskisgivenacorpus G andasetofentitymentions M = f m 1 ;:::;m j m j g extractedfromthe G .Ausermaysupplysetofquerynodes Q = f q 1 ;:::;q j Q j g .Each q i ,alsocalledaquerytemplate,maybeamemberof M oramanually declaredmentionthatisappendedtothesetofmentions.Foreachnode q i 2 Q ,thetask ofERistocomputethesetofmentions E = f e 1 ;:::;e j Q j g thatonlycontainmentionsthat arecoreferentwiththequerynode, e q i = f m i j m i 2M ; QDER M ;m i ;q i g : 54

PAGE 55

InSection3.4.2,wedescribeimplementationsofthe QDER algorithmfor j Q j =1.In Section3.4.3,wedescribetechniquesofschedulingtheERtaskforthegeneralcaseof j Q j > 1. Fundamentally,theERalgorithmgeneratesagraphicalmodelandmakesnewstate proposalsjumpstoreachthebeststateseeSection3.2.Thequery-drivenalgorithms inthissectionuseaquerynodetofacilitatemoresophisticatedjumps.Bymakingsmart proposalsweexpectfasterconvergencetoanaccuratestate.Asanotetothereader,a summaryofquery-drivenalgorithmscanbefoundinTable3-3. Figure3-4showsaninitialcongurationandacceptablequery-drivenentityresolution solutions.AnexampleinitialstateofthisalgorithmisshowninFigure3-2|each mentionisinitiallyassignedtoseparateentities.Alternatively,themodelmaybe initializedrandomly,orinanarrangementfromapreviousentityresolutionoutputor withallmentionsinoneentity.Figure3-3isthefullresolutionforthedataset;each mentioniscorrectlyassignedtoitsentitycluster.Figure3-4isaresultthatwasresolved withquery-drivenmethodsandisapartiallyresolveddata.Becausetheentitycontaining thequerynodeiscompletelyresolvedthesolutionisacceptable. 3.4.2Single-NodeER Single-nodeERalgorithmsaretheclassofalgorithmsthatresolveasinglequery-node asdiscussedinSection3.3.Inparticular,thetarget-xedERalgorithmaimstofocusa majorityoftheproposalsonresolvingthequeryentity.Thealgorithmxesthequery nodeasthetargetentityandthenrandomlyselectingasourcenodetomergeintothe entityofthetargetquerynode.Thisfocusonbuildingthequeryentityinthistypeof importancesamplingmeansthequeryentityshouldberesolvedfasterthanifwesampling eachentityuniformly. Aquery-drivenERalgorithmthatonlyselectsthequery-nodeasthetargetentity duringsamplingwillcreateerrorsbecausesuchanalgorithmisunabletoremove erroneousmentionsfromthequeryentity.Topreventtheseerrors,weallowthealgorithm 55

PAGE 56

tooccasionallybackoutofpoordecisions,thatis,itmakesnon-queryspecicsamples. ShowninAlgorithm14,target-xedentityresolutionadaptsAlgorithm10butitallows parameterstospecifytheproportionoftimethedierentsamplingmethodsareselected. Inadditiontotheinputmentions E fromAlgorithm10,target-xedentityresolution takesasinputaquerynode q .Theoutputofthealgorithmisaresolvedqueryentityand otherpartiallyresolvedentities. Foreachsamplingiterationthealgorithmcanmaketwodecisions.Thesamplermay proposetomergearandomsourcenodethatisnotalreadyamemberofthequeryentity intothetargetqueryentity.Alternatively,thealgorithmmergesarandomnodewitha randomentity. Algorithm14 Target-xedentityresolutionalgorithm Input: Aquerynode q . Asetofentities E eachwithonemention m . Apositiveinteger samples . Output: Asetofresolvedentities E 0 .i 1: E 0 E[ q 2: while samples -> 0 do 3: if random < then 4: e i u E 0 5: e j q:entity 6: m u e i 7: else 8: e j f e j9 e;e 2E 0 ;e 6 = q .entity g 9: e i f e j9 e;e 2E 0 ;e 6 = e j g 10: m u e i 11: endif 12: E 00 move E 0 ;m;e j 13: if score E 0 < score E 00 then 14: E 0 E 00 15: endif 16: endwhile return E 0 Onlines3to6thealgorithmtakesauniformsamplefromthelistofentities.If thesampledentityisthesameasthequeryentityittriesagainandsamplesadistinct entity.Anodeisdrawnfromthisentity.Theprobabilityofthisblockbeingenteredis . 56

PAGE 57

Table3-1.Mentionssets M fromacorpus idMention... m 1 NYGiants... m 2 BronxBombers... m 3 NewYorkGiants... m 4 Yankees... m 5 BrooklynDodgers... m 6 TheYanks... Table3-2.Examplequerynode q idMention... q NewYorkYankees... Lines7to10areenteredwithaprobability )]TJ/F21 11.9552 Tf 12.028 0 Td [( .Thisblockperformsarandomentity assignmentinthesamemannerasAlgorithm10.Thisblockosetstheaggressivenature ofthetarget-xedalgorithmbyprobabilisticallybackingoutofanybadmerges.Finally, theblockstartingfromline12toline15scoresthenewarrangementandacceptsifthis improvesthemodelscore.WediscussparametersettingsinSection3.5.5. Example .Takethesyntheticmentionset M showninTable3-1andaquerynode q , thebaseballteam`NewYorkYankees',inTable3-2.Thisistheresultoftheapproximate matchofquery q overalargerdatasetblocking.Thementionsof M maybeinitialized byassigningeachmentiontoitsownentity.Afterasuccessfulrunoftraditionalentity resolutionthesetofentitiesclustersare fh q;m 2 ;m 4 ;m 6 i ; h m 1 ;m 3 i ; h m 5 ig : Forquery-drivenscenariotheonlyentityweareinterestedinis h q;m 2 ;m 4 ;m 6 i .Each mentioninthisqueryentityisanaliasforthe`NewYorkYankees'baseballteam.The othertwomentionsrepresentthe`NewYorkGiants'footballteamandthe`Brooklyn Dodgers'baseballteamrespectively. Thetarget-xedalgorithmattemptstomergenodeswiththequeryentityone mentionatatimeandthemergeisacceptedifitimprovesthescoreoftheoverallmodel. Wecanseeintheexamplethatamergeof m 1 and m 3 mayimprovetheoverallmodel 57

PAGE 58

becausetheyhavesimilarkeywordsbutonereferstothequeryentityandtheother todierentfootballteam.Thetarget-xedalgorithmcancorrectthistypeoferrorby probabilisticallybackingoutoferrorsbymovingmentionsinthequerynodetoanew entityasshowinline7toline10ofAlgorithm14. 3.4.3Multi-queryER Ausermaywanttoresolvemorethanonequeryentity,thatis,shemaybeinterested inresolvingawatchlistofentitiesoverthedataset.Tosupportmultiplequeries,rst mergethecanopiesofeachquerynodeinthewatchlisttoobtainasubsetofthefull graphicalmodelcontainingonlythenodessimilartoquerynodes.Toresolvetheentities wecanusequery-proportionalmethodsiterativelyovereachquerynode.Wedenetwo classesofschedules,namely,staticanddynamic. Staticschedulesareformulatedbeforesamplingwhiledynamicschedulesareupdated inresponsetoestimatedconvergence.Thetwostaticscheduleswedeveloparerandom andselectivity-based.Inrandomschedulingeachquerynodefromthewatchlistis selectedinaroundrobinstyle.Selectivity-basedschedulingisamethodofordering multi-querysamplestoscheduleproposalsinproportiontotheselectivityofthequery node.Selectivity,inthiscase,isdenedasthenumberofmentionsretrievedusing anapproximatematchofthedataset,orthequerynode'scontributiontothetotal newgraphicalmodel.Forexample,theselectivityofourquerynode q inTable3-2the selectivityissimplythesizeof M ,showninTable3-1. Random-basedschedulingmethodperformswellifallquerynodescomefromsimilar selectivity.Otherwise,iftheselectivityofeachquerynodevary,onequerynodemay requiremoresamplingcomparedtotheothers.Ifonequerynodeneedsalotofsamplesto converge,itmaytakethewholeprocessalongtimetocompleteandcyclesmaybewasted onothernodesthathavealreadyconverged. Inadditiontoschedulingsamplesinproportiontotheirselectivity,wecanschedule samplesdynamically,dependingontheprogressofeachqueryentity.Toperformdynamic 58

PAGE 59

schedulingweneedtoknowhoweachqueryentityisprogressingtowardsconvergence. Toestimatetherunningconvergencewedonotusestandardtechniquesinliterature becauseschedulingneedstooccurbeforethemodelisclosetoconvergence[25].Instead, weestimatetheconvergencebymeasuringthefractionofacceptedsamplesoverthelast N samplesofeachqueryinthewatchlist.Thetwodynamicschedulingalgorithmsare closest-rstandsamplingthefarthest-rst.Inclosest-rstwequeueupthequerynode thathasthelowestpositiveaveragenumberofacceptednodesoverthelast N proposals. Thisschedulingmethodperformsinferenceforthenodethatisclosesttobeingresolved soitcanmoveontoothernodes.Alternatively,thefarthest-rstalgorithmschedulesthe nodethathasthehighestconvergencerate.Thisschedulingalgorithmmakeseachquery entityprogressevenly. 3.5OptimizationofQuery-DrivenER ThepreviousERtechniquesaggressivelyattempttoresolvethequeryentity. However,ifthequerynodeisnotrepresentativeofthequeryitemsperformanceof target-xedERcanleadtoundesirableresults.Wedonotexplorethistrade-o; weassumeuserscanselectrepresentativequerynodes.Inthissection,weintroduce optimizationstocreateapproximatequery-drivensamplesbasedonthequerynode. Werstdiscusstheinuencefunctionthatisusedtomakequery-drivenproposals.We thendiscusstheattractandrepelversionsoftheinuencefunctionfollowedbytwonew algorithms.Weendwithimplementationdetailsandasummaryofourquery-driven algorithms. 3.5.1InuenceFunction:AttractandRepel Toretrievenodesfromagraphicalmodelthatissimilartoaquerynodeweemploy thenotionofinuence.Ourassumptionisthatnodesthataresimilarhaveahigh probabilityofbeingcoreferent.Aninuencetrailscorebetweentwonodesinagraphical modelcanbecomputedastheproductoffactorsalongtheiractivetrailasdenedin literature[100].Foranode m i 2M andthequerynode q 2M theinuenceof m i onthe 59

PAGE 60

querynodeisdenedas: I m i ;q = X j 2F w j j m i ;q where F istheworldofpairwisefeaturesandthefeatureweightandlog-linearfunction are,respectively, w j and j .Theinuencefunction I isanimplementationofthistrail score. Theinuencefunctiontakesasetofentities|ortheequivalentGM|andaquery node q asparameters.Theparameterstoaninuencefunctioncanbeoverthewhole databaseoracanopy.Overseveralinvocationsofthefunction, I returnsmentionsfrom thegraphicalmodelwithafrequencyproportionatetotheirinuenceon q .Ifamention haslittleornoinuence,theinuenceactsasablockingfunction,infrequentlyreturning themention.Recallinuenceisthedistanceactivetraildistancetoquerynode.To implementtheinuencefunctionwebuildadatastructurebasedonanalgorithmby Vose[93],hereafterreferredtoasaVosestructure. Theinputmentionstotheblockingalgorithmsmayresultinhighorlowquality canopies.Ahighqualitycanopymeansmostofthementionsinthecanopyareassociated withthequerynode.Lowqualitycanopies,whicharemorecommon,correspondsto onlyasmallnumberofmentionsbeingassociatedwiththequerynode.Wheninitializing query-drivenalgorithmsthecanopyqualityisimportantfordeterminingwhatalgorithm touse. Theattractmethodinitializeseachmentioninthecanopyinitsownentity,and thenmentionsaremergeduntiltheconvergence.Thetarget-xedalgorithmdiscussed inSection3.4.2isexplainedusingthismethod.Theattractmethodworkswellforlow qualitycanopies,orcanopiesthatrequireasmallnumberoritemstomerge.Conversely, therepelmethodworkswellwithhighqualitycanopiesorwhenmostitemsinacanopy belongtothequeryentity. Therepelmethodinitializeseachmentioninthecanopyintoasingleentity.Then proposalsaremadetoremovementionsfromtheentitysoweareleftwithonlythenodes 60

PAGE 61

inthequeryentity.WediscussthismethodusingthehybridalgorithminSection3.5.3. Tobuildaninuencefunctionfortherepelmethodwecanusethesamemethodandwe onlyneedtonormalizeandinverttheinuencescores.Werefertothisasco-inuenceor I . 3.5.2Query-proportionalER Inthequery-proportionalsamplingalgorithm,oneveryiteration,thesourcemention andtargetentityareselectedinproportiontoitsdistancetothequeryentity.Insteadof focusingsolelyonthequeryentity,thisalgorithmprioritizessamplesusingameasurethat representsprobabilityofamentionbeingcoreferentwiththequeryentity. Thatis,eachnode p inthegraphicalmodel G isselectedontheactivetrailbetween itselfandthequerynode q .Thisalgorithmmergesnodesthataresimilartothequery nodewithanincreasedfrequency. Beforequery-proportionalsampling,adatastructurefor I iscreated.The I inuencestructuretakesaquerynode q andtheglobalgraphicalmodel E thenreturnsa sampledmention.As I iscalledmultipletimes,thedistributionofthenodesreturnedis proportionaltotheirinuence.Algorithm15describesthequery-proportionalalgorithm. Algorithm15 Query-proportionalalgorithm Input: Aquerynode q todrivecomputation. Asetofentities E eachwithonemention m . Apositiveinteger samples . Afunction I thatsamplesfromnodesentitiesaccordingtoitsinuenceonamention. Output: Asetofresolvedentities E 0 . 1: E 0 E[ q 2: while samples -> 0 do 3: m 1 I E 0 ;q 4: m 2 I E 0 ;q 5: E 00 move E 0 ;m 1 ;m 2 :entity 6: if score E 0 < score E 00 then 7: E 0 E 00 8: endif 9: endwhile return E 0 61

PAGE 62

Foreachiteration,thealgorithmselectsmentionsusingtheinuencefunctionline3 andline4.Then,onemention m 1 ismovedintotheentityof m 2 .Mentions m 1 and m 2 haveahigherprobabilityofbeingcoreferentandthereforeahigherprobabilityofamerge occurringinthequeryentitycomparedtorandomselectionsasinAlgorithm10.Asa corollary,theinuencesamplingpropertycreatesmanysmallentitiesthataresimilarto thequeryentity. Duringquery-proportionalsamplingmoreentitiesthataresimilartothequerynode arecreated.Someofthementionscreatedinintermediateentitiesduringquery-proportional samplingwillmovetothequeryentity.Thisisabigadvantagewhenperforming entity-to-entitymergesasopposedtomentiontoentitymerges.Inthischapter,we donotinvestigatethisextensiontothealgorithm. 3.5.3HybridER Thebestofboththetarget-xedandquery-proportionalalgorithmscanbecombined tocreateahybridalgorithm.Likethetarget-xedalgorithm,thehybridmethod aggressivelyxesthetargetasthequeryentity.Thehybridmethodalsochoosesits sourcenodeusingtheinuencefunctioninthesamemannerasthequery-proportional algorithm. Algorithm16showsthehybridalgorithmusingtherepelmethod.Withprobability thealgorithmchoosesamentionusingtherepelmethod I andmovesittoanentity thatisnotthequerynode.Thisistheoppositeofmerginganodeintothequeryentity. Pseudocodeislistedonlines3toline5. 3.5.4ImplementationDetails Thepreviousalgorithmsdescribedsingleprocesssamplingoverthesetofmentions. Themulti-querymethodsaremodeledforseveralinterwovensequentialsingle-nodeER processes.Inthissection,wedescribeourimplementationofthehybridalgorithmovera paralleldatabasemanagementsystem. 62

PAGE 63

Algorithm16 Hybrid-Repelalgorithm Input: Asetofentities E ,whereonecontainsallthementions m andtheothersareempty. Apositiveinteger samples . Aquerynode q . Afunction I thatsamplesfromnodesentitiesaccordingtoitsinuenceonamention. Output: Asetofresolvedentities E 0 . 1: E 0 E[ q 2: while samples -> 0 do 3: if random < then 4: m I E 0 ;q 5: e i f e j9 e;e 2E 0 ;e 6 = q .entity g 6: else 7: e i u E 0 8: e j f e j9 e;e 2E 0 ;e 6 = e i g 9: m u e j 10: endif 11: E 00 move E 0 ;m;e i 12: if score E 0 < score E 00 then 13: E 0 E 00 14: endif 15: endwhile return E 0 AnindependentVosestructure I , x 3.5.1iscreatedforeachquerynodeinthequery set.ThecreationoftheVosestructurequerynodesisparallelized.Whenthenumberof querynodesincreasestheVosestructuresdemandmorememoryfromthesystem.Each Vosestructurecontainsarrayoftypedoubleprecisionandunsignedint.Thespacefor thestructureis O j Q jj M j where j Q j isthenumberofquerynodesinthequeryand j M j isthenumberofmentionsinthecorpus.TheVosestructureisaccessedoverevery sampleandneedstobeinmemory.Toincreasescalability,onecouldstorethefullsets ofprecomputedsamplesandserializetheVosestructurestodiskbutthatisnotexplored here[48]. Samplingoverthequerynodesforeachalgorithmcanalsobeperforminparallel. Inourmethod,athreadselectsaquerynodeusingarandomscheduleasdescribedin Section3.4.3.ThesystemwillusetheVosestructureassociatedwiththequerynodeto setupaproposalmove.Thesystemattemptstoobtainalocksforbothentitiesinvolved 63

PAGE 64

intheproposal.Ifthesystemisunabletoobtainalockoneitherofthetwoentitiesthe systemwillbackoutandresamplenewentities.Whenthenumberofquerynodesis smallthequery-drivenalgorithmsexperiencelotofcontentionattheentitiescontaining thequerynodes.Inthesecircumstances,thesystemwillbackoutandeitherrestartthe proposalprocessorattemptabaselineproposal.Thisavoidswaitingforlockedentities andkeepsthesamplingprocessactive.InSection3.6.6wedemonstratetheparallelhybrid methodoveralargedataset. 3.5.5AlgorithmsSummaryDiscussion Algorithms14,15and16aremodicationsofproposaljumpsfoundinthebaseline Algorithm10.Table3-3describestheproposalprocessforeachalgorithmbyitspreferred jumpmethod. Table3-3.Summaryofalgorithmsandtheirmostcommonmethodsforproposaljumps sourcetarget Baselinerandomrandom Target-Fixedrandomxed Query-Proportionalproportionalproportional Hybridproportionalxed Thetarget-xedalgorithmbuildsthequeryentitybyaggressivelyproposingrandom samplestomergeintothequeryentity.Thequery-proportionalalgorithmusesan inuencefunctiontoensureitssamplesaremostlyrelatedtothequerynode.Thehybrid algorithmmixestheaggressivenessofthetarget-xedwiththeintelligentselectingofthe sourcenodefoundinthequeryproportionalmethod. Afterchoosingthecorrectalgorithm,auserneedstohaveawelltrainedmodelwith severalfeatures.Anadvantageofusingquery-proportionaltechniques,becausesolittle samplingisrequired,isthatwecaninteractivelytestqueryaccuracy.Wecanandalso addcontextorkeywordsthatwerediscoveredfromapreviousrunofthealgorithm.This interactivequeryingworkowwillhelpimproveaccuracy,whichweexperimentallyverify inSection3.6. 64

PAGE 65

Parametersettings .Thealgorithmtakesseveralparametersthataectperformance. Whilenotstudiedinthischapter,parametersettingsarerobusttochangemaking parameterselectionsimple.Therstisthenumberofproposals samples .This numbercanbeafunctiononthesizeofthedataset.Eachquerynodeshouldhave theopportunitytobemergedintoanentitymorethanonce. Thevalue isbetween[0 : 0 ; 1 : 0]andrepresentshowoftentoperformthemaintype ofsampling.Thisvalueshouldbesettoahighvalue,0.9foracceptalgorithms.With probability1 )]TJ/F21 11.9552 Tf 12.624 0 Td [( thealgorithmsback-otorandomsamplestoimprovemixing.This valueisloweredtocountersomeoftheaggression,particularlyinAlgorithm14.The parallelexperimentsusea =1andbackoutwhenthereiscontentioninthethreads. Instatistics,anegativebinomialfunctionisusedtomodelthenumberoftrialsit takesforaneventtobeasuccess.Wecanalsouseanegativebinomialfunctionasadecay functionfortheoutputoftheinuencefunction.Weusethisfunctionbecausewewant valuesthataremostsimilarlowestscoretobesampledmoreoften.Wesetthe r value, ornumberoffailuresforthenegativebinomialfunctionto1.Wesetthe p value,orthe probabilityofeachsuccesstoavaluecloseto0.05. Inthemulti-queryERalgorithmsweruninferencefor K stepsbeforewelookto changethequeryentity.Inourexperimentswechoosea K of500andanincreasingvalue fromtwoto100thousandintheparallelexperiments. 3.6Query-DrivenEntityResolutionExperiments Inthissection,wedescribetheimplementationdetails,thedatasetsandour experimentalsetup.Next,wediscussourhypothesesandfourcorrespondingexperiments. Wethennishwithadiscussionoftheresults. Implementation .WedevelopedthealgorithmsdescribedinSection3.4inScala 2.9.1usingtheFactoriepackage.Factorieisatoolkitforbuildingimperativelydened factorgraphs[65].Thisframeworkallowsatemplateddenitionofthefactorgraoh toavoidfullymaterializingthestructure.Thetrainingalgorithmsarealsodeveloped 65

PAGE 66

usingFactorie.Thealgorithmsforcanopybuildingandapproximatestringmatching aredevelopedasinsideofPostgreSQL9.1andGreenplum4.1usingSQL,PL/pgSQL andPL/Python.Inferenceisperformedin-memoryonanIntelCorei7processorswith 3.2GHz,8coresand12GBofRAM.TheapproximatestringmatchingonGreenplumis performedonaAMDOpteron627232-coremachinewith64GB. Theparallelexperimentsweredevelopedentirelyinaparalleldatabase,DataPath[4]. DataPathisinstalledona48-coremachinewith256GBs. Datasets. .Theexperimentsusethreedatasets,therstistheEnglishnewswire articlesfromtheGigawordCorpus,werefertothisastheNYTCorpus[39].Thesecond isasmallerbutfully-labeledRexadataset. 3 Becauseitisfully-labeleditallowsusto runthemoredetailedmicrobenchmarks.TheNYTcorpuscontains1,655,279articlesand 29,866,129paragraphsfromtheyears1994to2006.Weextractedatotalof71,433,375 mentionsusingthenaturallanguagetoolkitnamedentityextractionparser[12]. Additionally,wecomputegeneralstatisticsaboutthecorpusincludingthetermand documentfrequencyandtf-idfscoresforallterms.Wemanuallylabeledmentionsforeach queryovertheNYTdataset. Theseconddataset,Rexa,iscitationdatafromapublicationsearchenginenamed Rexa.Thisdatasetcontains2454citationsand9399authorsofwhich1972arelabeled. WeperformexperimentsontheRexacorpusbecauseitisfullylabeledunliketheNYT Corpus.TheRexacorpusissmallerintotalsizebutithasaveragesizedcanopies. ThethirddatasetistheWikilinksCorpus[87]largestlabeledcorpusforentity resolutionthatwecouldndatthetimeofdevelopment.Itcontains40millionmentions and3millionentitiesthatwereextractedfromthewebandtruthedbasedonwebanchor linkstoWikipediapages.WeloadedamillionmentionsontoDataPathtodemonstrate theparallelcapabilities. 3 http://cs.neiu.edu/ fg culotta/data/rexa.html 66

PAGE 67

3.6.1ExperimentSetup Table3-4liststhefeaturesandtheweightsforeachfeature. Features. .Featuresthatlookforsimilaritybetweenmentionnodesarecalledanity featuresandtheyaregivenpositiveweights.Featuresthatlookfordissimilaritybetween mentionsnodesarecalledrepulsionfactorsandtheyaregivennegativeweights.We implementthreeclassesoffeatures:pairwisetokenfeatures,pairwisecontextfeatures andentity-widefeatures.Pairwisefeaturesdirectlycomparetokensstringsonattributes suchasequalityormatchingsubstrings.Contextfeaturescomparetheinformation surroundingthemention.Wecanlookatthesurroundingsentence,paragraph,document oruserspeciedkeywords.Thequerynodesareextractedfromtextandcontainaproper documentcontext.Withthiscontext,weuseatf-idfweightedcosinesimilarityscoreto comparethecontextofeachmentiontoken.Finally,entity-widefeaturesuseallmentions insideanentityclustertomakeadecision.Anexampleentity-widefeaturecountsthe matchingmentionstringsbetweentwoentities. Models. .FeaturesontheNYTandWikilinksdatasetsweremanuallytunedand thefeaturesfortheRexadatasetweretrainedusingsamplerank[101]withcondence weightedupdates.WemanuallytunesomeoftheweightsintheNYTcorpustomakeup forthelackthecompletetrainingdata.Themodelscanbegraphicallyrepresentedasthe modelsinFigure3-4. Evaluationmetrics. .ConvergenceofMCMCalgorithmsisdiculttomeasureas describeinareviewbyCowlesandCarlin[25].Weestimatetheconvergenceprogress bycalculatingthe f 1scoreofthequerynode'sentity f 1 q .Wecreatethisnewmeasure becauseweareprimarilyconcernedwiththequeryentity.OthermeasuresincludeB 3 for entityresolutionandseveralothersforgeneralMCMCmodels[7,25]. Thequery-specic f 1scoreistheharmonicmeanofthequery-specicrecall R q andquery-specicprecision P q .Toaccuratelydeterminethe P q and R q ofeachqueryin thisexperimentwelabeleachcorrectquerynode.Query-specicprecisionisdenedas 67

PAGE 68

Table3-4.FeaturesusedontheNYTCorpus.Therstsetoffeaturesaretokenspecic features,themiddlesetarebetweenpairsofmentionsandthebottomsetare entitywidefeatures. FeaturenameScore + Score )]TJ/F8 9.9626 Tf 93.727 -3.615 Td [(Featuretype EqualmentionStrings+20-15TokenSpecic Equalrstcharacter+5TokenSpecic Equalsecondcharacter+3TokenSpecic Equalsecondcharacter0 Unequalmention Strings -15TokenSpecic Unequalrstcharacter0 Unequalsecondcharacter 0 Unequalsecondcharacter 0 Equalsubstrings+30-150TokenSpecic Unequalsubstrings-150TokenSpecic Equalstringlengths+10TokenSpecic Matchingrstterm+90-3TokenSpecic Nomatchingrstterm-3TokenSpecic Similarity 0 : 99+120Pairwise Similarity 0 : 90+105Pairwise Similarity 0 : 80+80Pairwise Similarity 0 : 70+55Pairwise Similarity 0 : 60+35Pairwise Similarity 0 : 50+15Pairwise Similarity 0 : 40-5Pairwise Similarity 0 : 30-50Pairwise Similarity 0 : 20-80Pairwise Similarity < 0 : 20-100Pairwise Matchingterms+20Pairwise Tokenincontext+1Pairwise Nomatchingkeyword+700-10Pairwise MatchingKeyword+700Pairwise Keywordintoken+70Pairwise ExtraToken-500Pairwise Matchingtokenin context +10Pairwise Similarneighbor+100-5Entity-wide NoSimilarneighborin entity -5Entity-wide Matchingdocument+350-15Entity-wide NoMatchingdocumentsinentity -15Entity-wide 68

PAGE 69

P q = jf relevant M gf retrieved M gj jf retrieved M gj andquery-specicrecall R q = jf relevant M gf retrieved M gj jf relevant M gj .The f 1scoreforthequerynode'sentity q isdenedas: f 1 q =2 R q P q R q + P q : The f 1 q scoreisagoodindicatorofentityandanswerquality.Formulti-queryexperiments wecalculatetheaverage f 1 q scoresforeachquerynode.Therunofeachnon-parallel algorithmisaveragedover3to10runs. 3.6.2RealtimeQuery-DrivenEROverNYT Inthisexperimentweshowthatquery-drivenentityresolutiontechniquesallowusto obtainnearrealtime 4 resultsonlargedatasetssuchastheNYTcorpus. Figure3-5showsthe f 1 q scoreofthehybridERalgorithmswiththreesingle-query ERqueries.Thegraphshowsperformanceovertherst50proposals.Forexample,the `Zuckerberg'querycouldbeexpressedasshowninAlgorithm17. Algorithm17 ExampleERqueryovertheentity`zuckerberg' SELECT FROM Mentionm WHEREcoref_map m . , entity ^ p ,` zuckerberg ',context. Recall,acanopyisrstgeneratedusinganapproximatematchoverthementionset. Weusetherepelinferencefunctionandallthementionsareinitializedinonelargeentity. The`RichardHatch'and`CarnegieMellon'queriesstartatan f 1 q scoreof : 92and : 97, respectively.The`Zuckerberg'querystartsabove : 65andimprovestoan f 1 q scoreover : 8. Theseexperimentsshowtherepelmethodremovingmismatchesfromthequery entity.Theco-inuencefunctionisusedtoquicklyidentifythementionsthatdonot belongintheentityandtheyareproposedtoberemoved.Whenahybridmoveis 4 Wedenerealtimeasonlycontributingasmallornotimelosswhenthisprocessa partofanexternalexecutionpipelinesuchasaninformationextractionpipeline. 69

PAGE 70

Figure3-5.Hybrid-repelperformancefortherst50samplesforthreequeries.Eachresult isaveragedover6runs Table3-5.Theperformanceofthehybrid-repelERalgorithmforqueriesovertheNYT corpusfortherst50samples.Totaltimeincludesthetimetobuildthe I data structureandresultoutput.TheNYTCorpuscontainsover71million mentions,alargeamountfortheentityresolutionproblems. QueryBlockingMentionsInferenceTotaltime Zuckerberg24.4s1032s37s RichardHatch28.3s22618.5s59s Carnegie Mellon 25.9s130268s124s proposed,amentionfromthelargeentitymovedfromalargeentitygrouptoanew, possiblyempty,entity.Thismethodreliesonthegoodrepulsionfeaturesandcorrect weights. InTable3-5weshowtheperformanceofthreequeries.Inadditiontothequerytoken weaddfourcolumns:blockingtimeinseconds,canopysize,inferencetimeinsecondsand thetotalcomputetime.Totaltimeisthecompletetimetakenbyeachrun,thisincludes buildingoftheinuencedatastructureandresultwriting.ThevaluesinTable3-5show thatfastperformanceofquery-drivenERoveralargedatabaseofmentions. 3.6.3Single-queryER Inthisexperimentweshowaperformancecomparisonbetweenthesingle-query algorithmssummarizedinSections3.4and3.5.Werunthequery-drivenalgorithmsover 70

PAGE 71

Figure3-6.Acomparisonofsingle-queryalgorithmsonaquerywithselectivityof11 querieswithdierentselectivitylevelsandshowtheaccuracyovertime.Eachalgorithm usestheattractmethod,soeachmentioninthecanopystartsinitsownentity. Figure3-6showstheruntimeofallfouralgorithmsontheRexadatasetwiththe query`NemoSemret',anauthorwithaselectivityof11.Theperformanceforthebaseline entityresolutiondoesnotgetacorrectproposaluntilabout500seconds.Thebaseline algorithmtakesalongtimetoaccepttherstproposalbecauseitisrandomlytryingto insertmentionsintoanexistingentity.Target-xedimmediatelybeginstomakecorrect proposals.Hybridandquery-proportionalhavethebestperformanceandresolvethe entityalmostinstantaneously.Thehybridchoosesthemostlikelynodestomergeintothe queryentity.Astherstcoupleofproposalsarecorrectmerges,hybridquicklyconverges toahighaccuracy.Duetoimperfectfeatures,amongthe10averagedrunsafewrunsget stuckatlocaloptimumandcausingsuboptimalresults. Figure3-7showstheruntimeoffouralgorithmsforquerynodeid`A.A.Lazar'with selectivityof46.Thebaselinealgorithmprogressestheslowest.Thehybridalgorithm quicklyreachesaperfect f 1 q score.Query-proportionalalgorithmlagsslightlybehind thehybridmethodbutstillreachesaperfectvalue.Thetarget-xedalgorithmgradually increasestoaperfect f 1 q scoreabout60secondsafterhybridandquery-proportional. 71

PAGE 72

Figure3-7.Acomparisonofsingle-queryalgorithmswithaquerynodeofselectivity46 Figure3-8.Acomparisonofselection-drivenalgorithmswithaquerynodeofselectivity 130 Figure3-8showstheruntimeoffouralgorithmswithaquery`MichaelJordan'of selectivity130.Thebaselineslowlyincreasesoverthe100seconds.Thehybridalgorithm againquicklyachievesaperfect f 1 q scorefollowedbyquery-proportionalandthen target-xed.Thetimegapbetweeneachofthealgorithmsincreaseswiththeincreasein selectivity,hybridachievesthebestperformance. Welookdeeperathowselectivityaectstherateofconvergence.InFigure3-9we showthetimeittakesforeachalgorithmtoreachan f 1 q scoreof0 : 95overincreasing 72

PAGE 73

Figure3-9.Thetimeuntilan f 1 q scoreof0.95forvequeriesofincreasingselectivities; averagedoverthreeruns selectivity.Wechoosevequerynodesofincreasingselectivitybutwiththesamecanopy sizes.Thehybridalgorithmruntimeincreasedwiththeincreaseinselectivitybutonly slightlysteeperthanconstant.Target-xedincreasedfortherstthreequeriesbutdid notlastmorethan50seconds.Query-proportionalhasonlyaslightincreaseintimetill convergencefortherstthreequeries.Thehighesttwoselectivityqueriesareexpensive forquery-proportionalandweobserveanexponentialincreaseinruntime.Theseresults areconsistentwiththeexponentiallylargeincreaseinthenumberofrandomcomparisons neededtondamatchforaqueryentity.Thequery-proportionalalgorithmdoesnot focusonthequeryentityasaggressivelyastarget-xedandhybridalgorithms.Recall thatthetarget-xedandthehybridalgorithmfocusonmovingcorrectnodesintothe queryentity.Query-proportionalselectscandidatenodesusingtheinuencefunctionbut itdoesnotxthetargetentity.Withthetargetentitynotxed,thechanceofcorrect nodeforthequeryentitydecreaseexponentially.Thisshowsthatselectivityofnodes aectstheruntimeperformanceofeachalgorithm.Whenperformingjoin-drivenER itisimportanttotaketherelativeselectivityofnodesintoaccountforchoosingbest schedulingalgorithms. 73

PAGE 74

Figure3-10.Theprogressofthehybridalgorithmacrossformultiplequerynodesusing dierenceschedulingalgorithms.Eachresultisaveragedoverthreeruns 3.6.4Multi-queryER Inthisexperimentwestudyperformanceofourdierentschedulingalgorithmsfor join-drivenERqueries.Wechoosetenquerynodesofdierentselectivityandrunthe joinqueriesschedulingalgorithmsdescribedinSection3.4.3.Consideratablelikethe PeopletableinSection3.3withselectivity f 130,63,68,7,12,12,301,11,46 g .The fouralgorithms,random,closest-rst,farthest-rstandselectivity-basedareshownin Figure3-10.Theselectivity-basedmethodoutperformstheotherthreealgorithms intermsofconvergencerate.Thejumpsinaccuracyonthegraphcorrespondtothe schedulingalgorithmschoosingnewquerynodesandacceptingnewproposals.Ithas ahighjumpwhenitstartssamplingtheseventh,andhighestselectivitynodes.The farthest-rstalgorithmrisestheslowestoutoftheschedulingalgorithmsbecauseittries tostopsamplingthehighperformingqueryentityandmakesproposalsfortheslowest growing.Selectivity-basedmethodperformswellearlybecausethehighselectivityqueries aresampledrst.Thehighselectivityquerymakesupalargeproportionofthetotal f 1 q score.Thelargejumpintherandommethodiswhenitreachesthenodewithselectivity 301.Notice,closest-rstreachesitspeak f 1 q scorethefastestbecauseittriestogetthe mostoutofeveryquerynode. 74

PAGE 75

Figure3-11.Theperformanceofzuckerbergquerywithdierencelevelsofcontext.Each resultisaveragedover6runs 3.6.5ContextLevels Inthisexperimentweaimtodiscoverhowdierentlevelsofcontextspeciedatquery timecanimproveconvergencetimeandoverallaccuracy.Wetakethezuckerbergquery andthehybrid-repelalgorithmandranERthreetimesoverthreelevelsofcontext.Each mentioninthegraphcontainsa`paragraph'levelofcontextandweonlyalterthecontext ofthequerynode.The`none'contextonlyactivatestokenspecicfeatures,anycontext featuresinvolvingthequerynodearezeroedout.The`paragraph'levelcontextisthe defaultcontextfromtheNYTcorpusandthe`document'levelcontextextendscontext totheentirenewsarticle.Additionally,weaddspecickeywordsfromMarkZuckerberg's DBpediapagetothe`document'and`paragraph'contextlevels.Weshowtheperformance usingtherepelmethodinFigure3-11. Addingspecickeywordsthatactivatethekeywordfeaturesarethemosteective methodsforincreasingtheaccuracyofquery-drivenER.Query-drivenmethodsallowa usertoobservetheresultsandaddorremovekeywordsspecicqueriestoimprovethe accuracy.Thistypeofiterativeimprovementworkowisnotfeasiblewithbatchmethods. 75

PAGE 76

Figure3-12.Hybrid-attractalgorithmwithrandomqueriesrunovertheWikilinkscorpus. EachplotstartsaftertheVosestructuresareconstructed 3.6.6ParallelHybridER Inthisexperimenthastwoobjectives,rsthowdoesthehybridalgorithmperformin acanopysizeof1millionqueriesandwhatistheeectofincreasingthenumberofqueries nodes.InFigure3-12theHybridalgorithmisabletoresolveentitiesinashortamount oftime.ThecreationtimeoftheVosestructureisaboutlinearinthenumberofqueries. Thetrendinthegraphisthatastheratioofqueriestoentitiesincreasestheperformance benetofthehybrid-attractmethoddecreases.Withmorequerynodestheconstruction timeincreasesandthebenetsofthealgorithmdecreaseandbecomenobetterthanthe baselinemethod. ExperimentSummary .Eachofthequery-drivenmethodsoutperformthebaseline methodsintermsofruntimewhilenotlosingoutonaccuracy.Acrossdierentdataset sizeshybridalgorithmshavethemostconsistentperformance.Ifasystemhasaquality blockingfunctionthenitisbettertousetheco-inuenceentityresolutionmethod.With multiplequerynodes,selectivity-basedisthemostconsistentperformingalgorithm. MoreaccurateestimationofMCMCconvergenceperformancecouldallowthedynamic schedulingalgorithmsclosest-rstandfarthest-rsttoachievehigheraccuracy.Themore contextualinformationthatcanbeaddedtoquerynodesatquerytimecauseshigher 76

PAGE 77

accuracyoftheentityresolutionalgorithms.Parallelquery-drivensamplingisaneective waytogetspeedupinanERdatasetwhentheratioofmentionstoentitiesislow. 3.7Query-DrivenEntityResolutionRelatedWork Thischapterisrelatedtoworkinseveralareas.Inthissectionwedescribeaselection oftheliteraturethatwefoundmostrelevanttodierentpartsoftheQuery-DrivenER task. EntityResolution .Thestate-of-the-artmethodforentityresolutionemploys collectiveclassication.Insteadofpurelypairwisedecisions,collectiveclassication methodsconsidergrouprelationshipswhenmakingclusteringdeterminations.Ina recenttutorial[37],collectedclassicationmethodsweregroupedintothreecategories: non-probabilistic[10,29,49],probabilistic[18,36,59,67,74,89]andhybridapproaches[3, 83].Arelevantchallengeproposedforentityresolutionresearchbythetutorialishowto ecientlyperformentityresolutionwhenaqueryisinvolved.Thischapterseeksto addressthisissue. Entityresolutionisgenerallyanexpensive,oinebatchprocess.Bhattacharyaand Getoorproposedamethodforquery-timeentityresolution[11].Thismethodperforms inferencebystartingwithaquerynodeandperforming`expandandresolve'toresolve entitiesthroughresolutionofattributesandexpansionofhyper-edges.Unfortunately, hyper-edgesbetweenrecordsarenotalwaysexplicitindatasets.Thischapterdoesnot assumethepresenceofanylinkinthecorpus,eachentityormentionsareindependently dened,whichisthecaseformostapplications. ArecentpaperbyAltwaijry,KalashnikovandMehrotra[2]hasasimilarmotivation ofusingSQLqueriestodriveentityresolution.Thatworkfocusesusingpredicatesinthe querytodrivecomputationwhilethisworkusesexamplequeriestodrivecomputation. Bothtechniquesarecomplementaryandcombiningthetwobyupdatingtheedge-picking policydescribedintheirpaperusingourapproachmakesforinterestingmethodof optimizingtheentityresolutionprocess. 77

PAGE 78

Thetermquery-drivenappearsinthischapterandhasappearedinothersacross literaturewithdierentmeanings[41].Ourdenitionofaquerynodeisanexampleitem, mention,fromadataset.AqueryinAltwaijryetal.[2]arethepredicatesinanSQL statement.Query-driveninGrantetal.[41]istheSQLqueriesusedtodriveanalytics. Itisbecomingincreasinglynormaltoworkwithdatasetsofextremelylargesize,in responseresearchershavestudiedstreaminganddistributedprocessing.Rao,McNamee andDredzedescribesanapproachforstreamingentityresolution[78].Thisapproachis fastandapproximatesentriesinanLRUqueueofclusteredentitychains.Weapplythese techniquestostaticdatasetanddonotyethandlestreamsofdata.Singh,Subramanya, Pereira,andMcCallumproposeatechniqueforERwhereentitiesareresolvedinparallel blocksandthenredistributedandresolvedagaininnewblocks[86].Thisparallel distributionmethodmakeslarge-scaleentityresolutiontractable.Inthischapter,we performanalysisonasimilarscaledatasetbutweshowthatgreatperformancegainscan beachievedwhenaqueryisspecied. Queryspecicsampling .Recently,severalresearchershaveexploredtheideaof focusingsamplingofgraphicalmodelstospeedupinference.Belowwediscussthethree approachesthatusesamplingtospeedupERovergraphicalmodels. Query-AwareMCMC[100]foundthatwhenperformingaqueryoveragraphical modelthecostofnotsamplinganodeisexactlythenodesinuenceonthequerynode. Thisenablesustoignoresomenodesthathavelowinuenceoverthequerynodeand incurasmallamountoferror.Thisinuencescorecanbecalculatedasthemutual informationbetweentwonodes.Theauthorscompareestimationtechniquesofthe intractablemutualinformationscore,thisiscalledtheinuencetrailscore.BecauseER hasaxedpairwisemodel,wecanusethetheoryfromthisworkandspecializeddata structurestogainperformancewhenquery-drivensampling. Type-basedMCMCisamethodofsamplinggroupsofnodesofwiththesame attributetoincreasetheprogresstowardsconvergence[61].Thisapproachworkswell 78

PAGE 79

whenfeaturesetscanbetractablycountedandgrouped.Ifquerynodesareintroducedit isnotclearhowonemayfocusestype-basedsampling. Otherresearchershaveexploredusingbeliefpropagationwithqueriestoapproximate marginaloffactorgraphs[20].However,theentityresolutiongraphiscyclicand highlyconnected.MCMCscaleswithlargerealworldmodelsbetterthanloopybelief propagation[100]. 3.8Query-DrivenEntityResolutionSummary Inthischapter,Iproposenewapproachesforacceleratinglarge-scaleentityresolution inthecommoncasethattheuserisinterestedinoneorawatchlistofentities.These techniquescanbeintegratedintoexistingdataprocessingpipelinesorusedasatool forexploratorydataanalysis.Weshowedthreesingle-nodeERalgorithmsandthree schedulingalgorithmsformulti-queryERandshowexperimentallyhowtheirruntime performanceisseveralordersofmagnitudebetterthanthebaseline. 79

PAGE 80

CHAPTER4 APROPOSALOPTIMIZERFORSAMPLING-BASEDENTITYRESOLUTION 4.1IntroductiontotheProposalOptimizer Recently,anincreasingnumberoforganizationsaretrackinginformationacrosssocial mediaandtheweb.Tothisend,theNationalInstituteofStandardshostedathree-year tracktoacceleratetheextractionofinformationandconstructionofknowledgebasesfrom streamingwebresources[34].Thisinternationalcontesthighlightedthemanydiculties ofdealingwithcollectingunstructureddataacrosstheweb.Acrosstheseeortsinthis contest,weidentifyentityresolutionasamajorbarriertoprogress. Entityresolutionacrosstextcorporaisthetaskofidentifyingmentionswithinthe documentsthatcorrespondtothesamereal-worldentities.Toconstructknowledgebases orextractaccurateinformation,entityresolutionERisarequiredstep.Thistaskisa notoriouslycomputationallydicultproblem.UsingMarkovChainMonteCarloMCMC techniquesexchangesrawperformanceforaexiblerepresentationandguaranteed convergence[66,86,99]. Processingstreamingtextualdocumentsexacerbatestwoofthecorediculties ofER.Therstdicultyisthecomputationoflargeentities,andthesecondisthe excessivecomputationspentresolvingunambiguousentities.Overtime,thegrowingsize oflargeentitiesmakeskeepingupwiththeincomingdocumentsuntenable.Optimization thattouchesthesecriticalportionsiswhollyunderstudied.Inthischapter,weargue thatcompressionandapproximationtechniquescanecientlydecreasetheruntimeof traditionalERsystemsthusmakingthemusableforstreamingenvironment. Insampling-basedentityresolution,entitiesarerepresentedasclustersofmentions.A proposalismadetomovearandommentionfromasourceentitytoarandomdestination entity.Theproposedstateisscoredandifitimprovestheglobalstate,thenewstateis accepted.Iftheproposaldoesnotimprovetheglobalstate,theproposalmaystillbe acceptedwithsomesmallprobability.Thisprocessisrepeateduntilthestateconverges. 80

PAGE 81

Scoringthestateofanentitycluster,throughpairwisefeaturecomputationofthecluster mentions,is O n 2 .Forentityclusterslargerthan1000mentions,calculatingthescorefor eachproposalcanbecomeprohibitivelyexpensive. Wicketal.presentanentityresolutiontechniquethatusesatreestructureto organizerelatedentitiestoreducetheamountofworkperformedineachstep[99]. Duringeachproposal,thisapproachavoidsthepairwisecomparisonbyrestrictingmodel calculationtothetopnodesofthehierarchy.Thisapproachcanavoidmassiveamounts ofcomputationbyperformingorganizingtheknownsetsofmentions.Thisdiscriminative treestructureisatypeofcompression. Singhetal.presentamethodofecientlysamplingfactorstoreducetheamount ofworkperformedwhencomputingfeatures[88].Theyobservethatmanyfactorsare redundantanddonotneedtobecomputedwhencalculatingthefeaturescore.They usestatisticaltechniquestoestimatethecalculatingfeaturescoreswithauser-specied condence.Thisapproachcanbecategorizedasearlystoppingforfeaturecomputation. Thereisnoonesizetsallsamplingalgorithm[82];eachofthesemethods, compressionandearlystopping,hasdrawbacks.Compressionmayslowdowninsertion speedandrequiresextrabookkeepingtokeeptoorganizethedatastructure.Early stoppingisnotalwayspreciseandaddingextraconditionalsintheMetropolisHastings loopstructureslowscomputation.Applyingeachtechniqueatappropriatetimescan removepainpointsandacceleratetheentityresolutionprocess. Inthischapter,wediscussinitialworktowardsthedesignofanoptimizerthat modiesthesampling-basedcollectiveentityresolutionprocesstoimprovesampling performance.Staticparametersforevaluatingentityresolutionrarelyholdforthelifetime ofstreamingprocessingtask.Theoptimizer,inthespiritoftheeddydatabasequery optimizer[5],dynamicallyexaminesthecurrentstateofeachproposalandsuggests methodsforevaluatingproposalsandstructuringentities.Wetrainaclassiertodecide whenthesamplingprocessshoulduseearlystopping.Additionally,weusetrainingdata 81

PAGE 82

Figure4-1.Thehigh-levelinteractionoftheoptimizer.Asstreamingdataupdatespassto themachinelearningmodel,theoptimizerrecommendsthebestalgorithmsto updatethemodel.Entityresolutionisanexampleofamodelthatneedstobe frequentlyupdatedwithnewdata todecidewhenisthebesttimeforaparticularentitytobecompressed.Thisisdonewith negligiblebookkeeping.Wemakethefollowingcontributions: Weidentifyseveraltechniquestospeedupsamplingpastanaturalbaseline. Wecreaterulesandtechniquesforanoptimizertochooseparametersandmethods atruntime. Weempiricallyevaluatethesemethodsoveralargedataset. Werecognizethatoptimizerscanalsoapplytomanydierentlongrunningmachine learningpipeline.Figure4-1depictsthattheoptimizersupervisesthemachinelearning model.Theoptimizerdeterminesthemethodsofprocessingthestreamingupdatesof themodel.Asfuturework,weplantocreateafulloptimizertostudyperformance improvementsonlongrunningmachineslearningtasks. Theoutlineofthepaperisasfollows.InSection4.2,wegiveaintroductionto factorgraphmodelsandentityresolution.InSection4.3,wefurtherdiscussthestatistics thatanoptimizerforentityresolutioncanuse.InSections4.4and4.5,wediscussthe implementationoftheoptimizer.Finally,inSection4.6,weexaminethebenetsby 82

PAGE 83

testingearlystoppingandcompressionoverasyntheticandapopularrealworldentity resolutiondataset. 4.2ProposalOptimizerBackground Factorgraphsareapairwiseformalismforexpressingarbitrarilycomplexrelationships betweenrandomvariables[53].Afactorgraph F = h x ; i ,containsasetofrandom variables x = f x i g n 1 andfactors = f i g m 1 .Randomvariablesareconnectedtoeachother throughfactors.Factorsareamappingbetweenoneormorevariablesandareal-valued score. Theprobabilityofasetting ! amongthesetofallpossiblesettingsoccurringina factorgraphisgivenbyaprobabilitymeasure: ! = 1 Z X x 2 ! m Y i =1 i x i ;Z = X ! 2 X x 2 ! m Y i =1 i x i where x i isthesetofrandomvariablesthatneighborthefactor i and Z isthe normalizingconstant. Exactinferenceovercomplexfactorsgraphsiscomputationallyexpensivebecause itinvolvescomputingthenormalizingconstant.Therefore,itispopularforresearchers touseMarkovChainMonteCarloMCMCapproximationtechniquestoestimatethe probabilityofsettings.Inparticular,forlargeanddensefactorgraphsMCMCMetropolis HastingsMHhasbeenshowntobeascalabletechniqueforinferencecalculation[86]. Cross-Documententityresolution,resolvingentitiesacrossdocumentborders,is usuallyseveralordersofmagnitudesmallerwhencomparedtowithindocumententity resolution.Inlargetextcorpora,thesizeofentitiesfollowsthepowerlaw[87].For example,Figure4-2isagenerateddatasetcontaining40millionmentionsand3million entitiesover11millionwebpages.Asdocumentsandmentionsareincrementallystreamed through,thescaleproblembecomesacriticalissue. Thementionsondiskcanberepresentedasalargearrayofidentiers.Entitiesarea collectionofmentionsandcanberepresentedassuch.Intheworstcasethereisanequal 83

PAGE 84

Figure4-2.Adistributionofentitysizesfromthewikilinkscorpus[87]withaninitial startandthetruth numberofentitiesandmentions.Thismeanseachmentionisitsownindividualentity. Intheotherextreme,allthementionsmaybeapartofthesameentity.Forstreaming entityresolution,mentionswithindocumentsmustbematchedtotheexistingsetof entities[78].Inthischapter,weassumetheentitysetisinitializedbygroupingthemost similarmentions;newmentionsareassignedtotheclosedmatch. Tocomputethescoreateachstep,thenumberofcomparisonsisproportionalto thenumberofpairwisefactorsbetweenmentions.Thepairwisefactorsareweighted functionssuchasapproximatestringmatches,tokenoverlap,n-grammatches.Thereare additionalcluster-widefeaturescalculatedateachstep.Suchfeaturesincludefunctions tocheckwhetherallmentionsinaclustersharethesametoken.Forclusterslargerthan 1000mentions,calculatingscoresofthemodelbecomesextremelyexpensive.Performing sophisticatedtechniquesoversmallerclustersalsoaddsextraoverhead.Inthischapter,we examinethetrade-oofselectingmethodstoacceleratethefeaturecomputationprocess. 4.3AcceleratingEntityResolution Inthissection,wediscusstheaccelerationinMCMC-MHsamplingforentity resolution.Wethenmotivatehowwebelievegainscanbeachievedusingcompression, samplingaccelerationmethodsandoptimizers.Weusealargereal-worldcorpusasa motivatingexample. 84

PAGE 85

Thetwoissuesweareinvestigatingareasfollows:First,givenasourceentity, destinationentityandthemention e s ;e d ;m ,whichmethodcanscoretheproposalinthe leastamountoftime?Secondly,aftertheproposaliscalculated,shouldwecompressthe entitystructure?Theoptimizerwilldecidewhentouseeachtechnique. Thetotalsizeofallentitiesinthetraditionalrepresentationis: sizeof E = X i c +sizeofint j e i j ; {1 where sizeof isanabstractfunctiontocomputethesizeofthecontainingobject, c isa classconstantand j e i j isnumberofmentionsintheentity. Therearemanycompressiontechniques,onebeingtoonlykeepmentionsthathave auniquerepresentationinsideentities.Thatis,ifanymentiontokenisaduplicate,we removeit.Thiscompressedtotalentitysizeis: sizeof E compressed = X i c +sizeof int # e i ; {2 where# e i isthecardinalityofthementiontokensinentity e i .Wenotethatwhenthe # e i j e i j ,itmaybeworthcompressingtheentity e i . InFigure4-2,45%percentofentitiesaresmallerthat100mentionsinsize. Additionally,82%percentofentitiescontainlessthan1000mentions.Thesenumbers suggestthatattimeswewecantakeadvantageoftheredundancywithinlargeentitiesby compressingthem.WeinvestigatethewikilinkscorpusfurtherinSection4.6.1. Inaddition,Figure4-2showsthatthereisanorderofmagnitudedierencebetween thesizesofinitialentitiesandthetrueentitysizes.Theentitieswereinitializedbyexact stringmatch,acommoninitializationscheme.Thisdierencegivesussomeintuition ofthetrendsoftheentityresolutionprocess.Additionally,thissuggeststhatthereare severaldistinctrepresentationsofentitiesDuringentityresolutionthesizesofentitiescan expecttogrowbyanorderofmagnitudeinsizewhilethetotalnumberofsmallerentities 85

PAGE 86

willdecrease.Wecanusethispropertytotrackthegrowthandchangeofentitysizesover timetounderstandhowtoprocessaparticulargroupingofentities. 4.4ProposalOptimizerAlgorithms Inthissection,wewilldescribesimplealgorithmsforentitysamplingandentity simplecompression.Afterintroducingthecompressionandapproximationtechniqueswe discusshowanoptimizercanbedesignedtoimprovetheoverallsamplingtime. Thebaselinemethodperformspairwisecomparisonsbyiteratingoverthementions usingtheorderondisk.Thementionsidsareusedtoextractthecontextualinformation ofeachmentionfromadatabase.Thisisthetraditionalmethodofcomputingthepairwise similarityoftwoclusters.Thismethodresultsinsimplecodesomoderncompilersareable toperformextremeoptimizationssuchasloopunrolling. Condence-basedscoringmethodperformsuniformsamplesofthementionsfrom thesourceanddestinationentitiesclustersduringscoring.Thismethodmeasuresthe condenceofthecalculatedpairwisesamplesandstopswhenthecondenceofascore exceedsathresholdof0.95.Thisisasimpliedversionofthesamplinguniformsampling methoddescribedbySinghetal.[88]. ThecodetocollectstatisticsisshowninAlgorithm18.Theaddfunctionshowshow andwhatstatisticsarerecordedwheneachnewmentionisadded.Noticethemaxand theminarevariablesintheStatsclassthatstorethecurrentmaximumandminimum. Thecurrentsumandrunningmeanarealsoupdatedwitheachnewvalueadded.The currentimplementationassumesthevaluesfromthepairwisefactorsfollowaGaussian distribution;themodelinSinghetal.makethesameassumption[88]. Asentitysizesgrow,wecanexpecttoseemanyrepeatsofthesameorverysimilar mentions.Reducingtheentitysizewillshrinktheeectivememoryfootprintofentities. Thisisimportantforlongrunningcollectionofentities.Run-lengthencodingisthe simplestmethodforcompressingentities.Thismethodcompressesthenearduplicate mentions.Acanonicalmentionischosenalongeachexactduplicateandacountermap 86

PAGE 87

Algorithm18 Samplecodefromthestatsshowinghowrunningstatisticsarerecorded andhowthevariancecanbecomputed void Stats::add longdouble x f themax=MAXthemax,x; themin=MINthemin,x; sum+=x; ++n; auto delta=x )]TJ/F15 11.9552 Tf 15.72 0 Td [(mean; mean+=delta/n; M2=M2+delta x )]TJ/F15 11.9552 Tf 15.72 0 Td [(mean; g double Stats::variance void const f if n > 2 return M2/n )]TJ/F15 11.9552 Tf 9.64 0 Td [(1; else return 0.0; g Table4-1.Atableofthetechniquestoimprovethesamplingprocessandeachisclassied byhowtheyaectsampling TechniqueCompressionEarlyStoppingOverhead BaselineNoNoNone Condence-based[88]NoYesMedium Discriminative Tree[99] YesNoLarge Run-Length Encoding YesNoSmall recordsthenumberofduplicatesthatarerepresented.Thecompressionratesbecome largeformentionclusterswithmanyduplicate. 4.5Optimizer BeforecalculatingtheMCMC-MHproposalthereareseveraldecisionswecan makethatwillaecttheruntimeandaccuracyofthealgorithm.Ateachstepwemay: approximatethecalculationoftheentitystates;updateanentitystructureto acompressedformat;skipthecalculationoftheproposalanddirectlyacceptor reject.Thesedecisionscanbemadebyobservingseveralfeaturesofasourceentity, 87

PAGE 88

destinationentityandasourcemention.Weenumerateasmallsetoffeaturesthatcan yieldinformationtohelpusdecidehowtheentitystructureshouldbechanged. Thedecisiontocompressanentitytakesfourmainpointsintoconsideration. First,thetimeittakestocompresstheentity C time .Forexample,ifthetimeit takestocompressanentityisthesameasthetimeittakestoreachananswerinthe uncompressedformat,thencompressionissuperuous.Second,itisimportanttoconsider thespacedsavedinmemoryandtheamountofadditionalentitiesthatdonothavetobe fetchedfromdiskandcannowtinmemory C space .Third,weneedtoknowhowactive anentityhasbeen C activity .Thatis,howmanyadditionsorsubtractionsthisentityhas seenoveralongperiodoftime.Thisinformationishelpfulinunderstandingthelikelihood thisentitywillberequestedforanotheradditionorsubtraction.Modifyingentities clusterscausesthemtoblock.Last,weretaintheactivityofanentityoverarecent, shortperiodoftime C velocity .Thisinformationletsusknowwhetheritissmartforthis entitytotakethetimeouttoforcompressionwhileothermentionsmaybeattemptingan insertionorremoval. Ateachproposalstepthedecisionmadeshouldmaximizetheutility.Utilityofthe decisionisanumericscoretorepresentthegainperformingtheproposalcalculation. Theutilityvalueisarealnumberintherange ; 1 .Aformalmodelforutilityisas follows: U = C time + C space + C activity + C velocity Collectingstatisticstomeasureutilityiscanincurasignicantoverhead.Notevery decisionintheoptimizerneedstobedecidedautomatically.Wecanusesomesimple principlestoestimatetheutilityateachpoint.Inthenextsectionwe,examineanentity resolutiondatasetandgetsomeintuitionforthedevelopmentoftheoptimizer. 4.6ProposalOptimizerExperimentImplementation Inthissection,werstdescribetheWikilinkdatasetweuseforexperiments. Following,wepresentamicrobenchmarktovalidateourinvestigationofentityapproximation 88

PAGE 89

andcompression.Wethendiscusstheimplementationofthecompressionandapproximation techniquesoveralargereal-worldcross-documententityresolutioncorpus. 4.6.1WikiLinkCorpus TheWikilinkcorpusisthelargestfully-labeledcross-documententityresolution datasettodate[87].Whendownloaded,thedatasetcontains40millionmentionsand almostthreemillionentities|itisacompressed180GBsofdata.TheWikilinkcorpus wascreatedbycrawlingpagesacrossthewebandextractinganchortagsthatreferenced Wikipediaarticles.Eachpagecontainsmultiplemultiplementionsofdierenttypes.The Wikipediaarticlesactasthetruthforeachmention.Althoughmanuallyconstructedand notwithoutitsbiases,thisisthelargest,fully-labeledentityresolutiondatasetoverweb datathatwecouldndatthetimeofpreparation. 4.6.2MicroBenchmark ToincreaseourintuitionofearlystoppingtechniqueswesimulatedtheMCMC proposalprocesses.Wehypothesizethatarangeofvaluesexist,whereperformingthe baselineclustersamplingfasterthanusingearlystoppingmethods.Wearrangeentity clustersinincreasingsizeandwecomputethetimeinclocktickseachproposaltakes tocomputethearrangementoftheclusters.Thedataintheclustersaredistributed uniformlyforthisexperimentandeachclusterpointwas5dimensional.Forthebaseline clusterscorecomputationweusedapairwisecalculationoftheaveragecosinedistance withandwithoutthemention.Tocomputeearlystoppingwesetacondencethresholdto 0 : 8andtheearlystoppingcodestoppedcomputationwhenthepredictederrorwasunder 20%.Therewasnodierenceintheproposalchoicesofthebaselinemethodortheearly sortingmethod. ThesimulationsweredevelopedinGNUC++11andcompiledwithg++-O3.The CPUwasan8coreInteli7with3.2GHzand12GBsofMemory.Eacharrangementwas run5timesandresultsaverages. 89

PAGE 90

Figure4-3.Comparisonofbaselineversusearlystoppingmethods Earlystoppingorbaseline. .Werstdeterminewhenearlystoppingapproaches fromproposalscoringisbenecial.Forthisresultwecomparethebaselikeproposal evaluatorwithacondence-basedscorerforvaryingentitysizes.Theresultofthis experimentissummarizedinFigure4-3.Thex-axisisthenumberofmentionsinthe sourceanddestinationclusterforeachproposal.They-axsisisthenumberofclockticks onalog-scale. Weobservethatforproposalswithfewerthan100and1000sourceanddestination mentions,theperformanceofthebaselineproposerisbetterthanoralmostequaltothat ofthemoresortedearlystoppingmethod.Forproposalsthatcontainanentitycluster with10000mentionstheearlystoppingmethodperformssignicantlybetterthanthe baselinemethod. Surprisingly,thebaselineproposalsforforentitiesclusterscontaining100 K mentions performedoveranorderofmagnitudebetterthantheearlystoppingmethod. Theoptimizationfoundinpredictablecodepathsmakesimpleimplementationslike thebaselinemethodattractiveforsmallclustersizesandverylargeclusterssizes.In addition,82%oftheentitiesinthetruthedWikilinksdatasetsarelessthat1000mentions insizeand45%oftheentitiescontainlessthan100mentions. 90

PAGE 91

Figure4-4.Thetimeforcompressionforvaryingentitysizesandcardinalities.Thisis comparedwithlinerepresentingthetimeittaketomake100Kinsertions Theresultsofthemicrobenchmarksuggeststhatdierentproposalestimation techniquesareusefulatdierenttimes.Notethatforthesetechniquesasmallconstant amountofbookkeepingspaceisrequiredtoperformearlystopping. InsertionvsCompressionsTime .Compressinganentityisanexpensive operation.Whencompressinganentity,itmustbelockedtopreventanyconcurrent access.Inordertochoosethebesttimestocompressanentityclusterinthismicro benchmarkwelookatthetimetocompressionentityofdierentcardinalitiesand comparethemtothetimeittakestoinsertentities.Usingasyntheticdatasetwe generatedentitiesofvaryingsizesandcardinality.ThisexperimentisshowninFigure4-4. Cardinalitynumberisaratioofduplicatesinthedataset.Forexample,cardinality 0.8means8of10itemsinthedatasetareduplicates.Thegraphshowsthatinthetime ittaketocompressentitiesofabout300Kthesamplercouldmake100Ksamples.Wecan concludefromtheseresultthatcompressinglargeentitiesisexpensiveshouldonlybedone iftheclusterisprohibitivelylargeandnotpopular. Cardinalityestimationformillionsofentitiesisasignicantoverhead.Tracking cardinalitiessimultaneouslyforeachentity,evenusingsmallprobabilisticsketchessuch asHyperloglog[32]becomeprohibitiveforlargeamountsofentities.Bythetimethe cardinalityofanentityneedstobemonitoredforpossiblecompression,thatentitymight 91

PAGE 92

aswellbecompressed.Wearecontinuingtolookforlighterweightcardinalityestimators formillionsofmentionssodecisionscanquicklybemade. 4.7ProposalOptimizerSummary Inthischapter,wedescribeaninitialapproachforoptimizingsamplingfortheentity resolutionprocess.Webegintodevelopanoptimizerthatattackstwomajorlimitations, thesizeoftheentitiesandtheredundantcomputation.Thischaptermotivatedtheneed fortheoptimizerandexaminedthefeasibilityofitscreation.Futureworkincludethe implementationofafulloptimizeroveralarge,streamingcorpus,withresolvedentities. WehopetosoonhaveafullyresolvedTRECstreamcorpus 1 andexaminetheperformance oftheoptimizerofthatlargedataset.Additionally,wehopetocompareresultswith enterpriseERsystemssuchasWOO[9]. 1 Afteracceptancethehttp://trec-kba.org/kba-stream-corpus-2014.shtmlwaslinkedto Freebaseandisnowavailableforresearchers[27]. 92

PAGE 93

CHAPTER5 QUESTIONANSWERING QuestionAnsweringistheproblemofbridgingthegapbetweenthewayauserasksa questionandthewayananswerisencodedinthebackgroundknowledge.Inthisworkwe startwithnaturallanguagequestionsandusethe deepweb asthebackgroundknowledge. Thedeepweb,orhiddenweb,isthesetofdatabasebehindwebformsintheontheweb. In2007,thesedatabasesareestimatedtocontaindatatwoordersofmagnitudelarger thanthesurfaceweb.Contrarytothesurfaceweb,theinformationinthesedatabaseis diculttoobtain. Inthissection,Idescribeamethodforaccessingthedeepwebtoanswer wh -questions. ThissystemiscalledtheMorpheusQAsystem[40]. 5.1MorpheusQAIntroduction Whentravelingthroughajungletoadestination,itiseasytogetlost.Therst persontojourneysomewheremaymakeanumberofmistakeswhentryingtondthebest pathtotheirdestination.Thosewhocomelaternditeasiertoreachthedestinationif awell-markedtrailhasbeencreated.OlsenandMaliziadescribethisideaasexploiting trails[72].Ratherthantreatingauser'sdiscoveryexperienceasauniqueentity,onecan exploitthefactthatasimilarsearchmayhavealreadybeenperformed.Inonestudy, almost40percentofwebquerieswererepetitionsofpreviousqueries[92].Thus,reuseof priorsearchesisonewaytooptimizethesearchprocess.Morpheusisaquestionanswering systemmotivatedbyreuseofpriorwebsearchpathwaystoyieldananswertoauser query.Morpheusfollowspathnderstotheirdestinationsandnotonlymarksthetrail, butalsoprovidesataxiservicetotakefollowerstosimilardestinations. Morpheusfocusesonthedeeporhiddenwebtoanswerquestionsbecauseofthe largestoresofqualityinformationprovidedbythedatabasesthatsupportit[69].Web formsactasaninterfacetothisinformation.Morpheusemploysuserexplorationthrough thesewebformstolearnthetypesofdataeachdeepweblocationprovides. 93

PAGE 94

TherearetwodistinctMorpheususerroles.Apathnderentersqueriesinthe Morpheuswebinterfaceandsearchesforananswertothequeryusinganinstrumented webbrowser.Thiswebtrackingtoolstoresthequeryandnecessaryinformationtorevisit thepathwaystothepagewherethepathnderfoundtheanswer.Apathfolloweruses theMorpheussystemmuchlikearegularsearchenginewithanaturallanguageinterface. Thepathfollowerentersaquestioninatextboxandreceivesaguidedpathtothe answer.Thesystemexploitspreviouslyfoundpathstoprovideananswer. Morpheusrepresentsauserquestionasasemi-structuredquerySSQ.Itassumes thequerytermsbelongtoclassesofaconsistentrealm-basedontology,thatis,onehaving asinglyrootedheterarchywhosesubclass/superclassrelationshavemeaningfulsemantic interpretations.Whenapathfollowerentersaquery,MorpheusranksSSQsinthestore basedonclasssimilarity.Supposeapathfollowerasks:A1997ToyotaCamryV6needs whatsizetires?Inthisquerytheclassesassociatedwithterms,e.g.Manufacturerwith Toyota,helpusidentifysimilarqueries. Thischapterdiscussesrelatedquestionansweringandontologygenerationsystems inSection5.2.Section5.3explainstheMorpheussystemanditsimplementation.In Section5.4wedescribethecurrentresultsofourapproach.Finally,weconcludewith futuregoalsforthesystem. 5.2MorpheusQARelatedWork 5.2.1QuestionAnsweringSystems TheearliestquestionansweringsystemssuchasBASEBALL[43]andLunar[102] hadcloseddomainsandclosedcorpora,thatis,theysupportaniteamountofquestions oncoporacontainingaxedsetofdocuments.Morpheususesthewebasitsdynamic, opencorpusandexaminesdeepwebsourcestoanswerquestions.Thisprocessisfederated questionanswering. 94

PAGE 95

SeveralotherQAsystemsthatusethewebasaresourcehavebeendeveloped. ExamplesystemsincludeSTART 1 andSwingly. 2 Thesesystemsusewebpagesfrom searcheswebcrawlersorsearchenginestondanswers.Morpheusdiersinthatitseeks outrelevantdeepwebsources,andinsteadofusingawebsearchengine,itusesonlythe pagesreferencedinapreviouslyansweredquestion. 5.2.2OntologyGenerators TheDBpedia 3 projectisacommunityofcontributorsextractingsemanticinformation fromWikipediaandmakingthisinformationavailableontheWeb.Wikipediasemantics includesdisambiguationpages,geo-coordinates,categorizationinformation,images, info-boxtemplates,linkstoexternalwebpages,andredirectstopagesinWikimarkup form[13].DBpediadoesnotdeneanynewrelationsbetweentheWikipediacategories. YAGOisasemi-automaticallyconstructedontologyobtainedfromtheWikipedia pages,info-boxes,categories,andWordNet 4 synsetsheterarchy[91].YAGOusesthe Wikipediapagetitlesasitsontologyindividualsandcategoriesasitsontologyclasses. YAGOusesonlythenounsfromWordNetandignorestheWordNetverbsandadjectives. YAGOdiscoversconnectionsbetweenWordNetsynsetsandWikipediacategories,parsing thecategorynamesandmatchingtheparsedcategorycomponentswiththeWordNet synsets.EachWikipediacategorynothavingaWordNetmatchisignoredintheYAGO ontology.Theontology'sheterarchyisbuiltusingthehypernymandhyponymrelationsof theWordNetsynsets. WeuseYAGO'sprinciplestoconstructontologiesthatprovidesimilaritymeasures foransweringquestionswithinthesamedomain.Thusfar,theseontologiescanbeused 1 http://start.csail.mit.edu 2 http://swingly.com 3 http://dbpedia.org 4 http://wordnet.princeton.edu 95

PAGE 96

toclassifyterms,howevertheirclassesdonotalwaysappropriatelycategorizequery parameters.Itisnecessarytoprovideanappropriatelevelofclassgranularity.Section5.3 discussesourapproachforidentifyingclassesandtheirinstancesfromdeepwebformsand documents. 5.3MorpheusQASystemArchitecture Thissectionpresentsontologyandcorpora,queryprocessing,rankingqueries,and queryexecuting. 5.3.1UsingOntologyandCorpora Morpheususesanontologythatcontainsclassesofaparticularrealmofinterest. Eachleafnodeintheontologyisassociatedwithacorpusofwordsbelongingtoaclass. Forexample,wehaveconstructedavehicularontologycontainingclassesrelevanttothe vehicularrealm.Thisontologyprovidesastructureforreferenceinthefollowingsections. MorpheusreferencestheDBpediacategories,Wikipediapages,andtheWordNet synsetheterarchytondclass-relevantwebpages.FirstarealmismappedtoaDBpedia category[13].UsingtheDBpediaontologypropertiesbroaderandnarrower,aMarkov Blanket[75]iscreatedcoveringallneighboringcategories. Tobuildacorpusforeachoftheleafnodesintheontology,weextracttermsfrom theWikipediapagesassociatedwiththeDBpediacategoriesfoundinitsblanket.From thistermcorpus,wecanndthelikelihoodofatermbelongingtoaparticularclass.This assistsinclassifyingtermsinapathfollowerquery.Thelikelihoodisdeterminedbythe probabilityofaclassgivenatermusingBayesRuleEq.5{1,sincewecaneasilyobtain theterm-classandterm-corpusprobabilitiesasrelativefrequencies. P class j term = P term j class P class P term {1 Inaddition,weemployLatentDirichletAllocationLDAtoidentifylatenttopics ofthedocumentsinacorpus[14].LDAisBayesianmodelthatrepresentsadocument inthecorpusbydistributionsovertopics,andatopicitselfisadistributionoverall 96

PAGE 97

Table5-1.ExampleSSQmodel Terms:1997ToyotaCamryV6sizetires Input:DateManufacturerModelEngine Output:MeasurementPart termsinthecorpus.Forexample,thelatenttopicsreectthethematicstructureof Wikipediapages.Thus,LDAdiscoversrelevanttopicproportionsinadocumentusing posteriorinference[14].Givenatextdocument,wetagrelateddocumentsbymatching theirsimilarityovertheestimatedtopicproportions,assistinginontologyandcorpora construction.WeuseLDAasadimensionalityreductiontool.LDA'stopicmixture arerepresentedasfeaturevectorsforeachdocument.Weareevaluatingsupportvector machinesasaclassieroverthedocuments-topicproportions.Duetoitsfullygenerative semantics,thisusageofLDAcouldaddressdrawbacksoffrequencybasedapproachese.g. TF-IDF,LSI,andpLSIsuchasdimensionalityandfailuretondthediscriminativesetof wordsforadocument. 5.3.2Recording TheQueryResolutionRecorderQRRisaninstrumentedwebbrowserthatrecords theinteractionsofapathnderansweringaquestion.Thepathnderalsousesthetool toidentifyontologicalclassesassociatedwithsearchterms.Morpheusstoresthequery,its terms,anditsclassesasanSSQ.Table5-1isanexampleshowingtheSSQmodelofthe query:A1997ToyotaCamryV6needswhatsizetires?TheSSQinTable5-1issaidto bequaliedbecausetheclassesassociatedwithitstermshavebeenidentied.Usingthe QRR,thepathnderisalsoabletoidentifywhereanswerscanbefoundwithintraversed pages. TheQueryResolutionMethodQRMisadatastructurethatmodelsthequestion answeringprocess.AQRMrepresentsageneralizedexecutablerealizationofthesearch processthatthepathnderfollowed.TheQRMisabletoreconstructthepagesearch pathfollowedbythepathnder.EachQRMcontainsarealmfromourontology,anSSQ, 97

PAGE 98

andinformationtosupportthequeryansweringprocess.Foreachdynamicpage,the QRMcontainsalistofinputsandreferenceoutputsfromtheURL. WhenapathfollowersubmitsaquerytheMorpheussearchprocessparsesandtags queriesinordertorecordimportantterms.Thesystemassignsthemostprobablerealm giventhetermsinthequeryascalculatedfromrealm-speciccopora.Oncetherealm isassigned,anontologysearchisperformedtoassignclassestotheterms.AnSSQis constructedandthesystemattemptstomatchthisnewSSQtoexistingQRMSSQs. Ratherthanmatchingexactqueryterms,thesystemmatchesinputandoutputclasses, becauseaQRMcanpotentiallyanswermanysimilarqueries. 5.3.3Ranking Toanswerauser'squery,acandidateSSQ,MorpheusndssimilarqualiedSSQs thatareassociatedwithQRMsintheMorpheusdatastore.TodetermineSSQsimilarity, weconsidertheSSQ'srealm,inputterms,outputterms,andtheirassignedclasses. Theclassdivergenceoftwoclasseswithintheontologycharacterizestheirdissimilarity. ThissolutionismotivatedbytheconceptofmultipledispatchinCLOSandDylan programmingforgenericfunctiontypematches[8]. Weconsidertheclassmatchasatypematchandweuseclassdivergencetocalculate therelevancebetweenthecandidateSSQandaqualiedSSQ.EachqualiedSSQ willhaveinputterms,outputterms,associatedclasses,andonerealmfromtheQRM. ForthecandidateSSQ,therelevantclassesfortermsaredeterminedfromthenatural languageprocessingengineandcorpora.Thecalculationofarealmforacandidatequery isperformedusingthetermsfoundwithinthequeryandanyprobabilitiesfoundwith p realm j term .WematchQRMsthatbelongtothesamerealmofthecandidateSSQ. TherelevanceofaqualiedSSQtothecandidateSSQisdeterminedbyaggregatingthe divergencemeasureofinputtermclassesassociatedwitheachSSQ.Inaddition,weorder QRMsinthedatastorebydecreasingrelevance.Theorderprovidesarankingforthe resultstotheuser.Thefollowingdescribesclassdivergenceindetail. 98

PAGE 99

WedeneclassdivergenceEq.5{2,aquasi-metric,betweenasourceclassanda targetclassusingthetopologicalstructureoftheclassesinanontology.Wewrite S T forthereexivetransitiveclosureofthesuperclassrelation.Let d P;Q representthehop distanceinthedirectedontologyinheritancegraphfrom P to Q .Theclassdivergence cd , betweenasourceandtargetclassrangesfromzeroforidenticalclassestoonefortype incompatibleclasses.Let S bethesourceclass, T bethetargetclass,and C bealeast commonancestorclassof S and T i.e.,onethatminimizes d S;C + d T;C .Theclass divergencebetween S and T isdenedby: cd S;T = 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : 0 S:Uri T:Uri d S;T = h S T 1 T S d S;root + d S;C + d T;C = h otherwise {2 where h isthelongestpathintheontologyclassheterarchy. Note,if S T and S 6 Q then cd S;T
PAGE 100

Figure5-1.Abbreviatedvehicularontology 5.3.4ExecutingNewQueries OncewehaverankedtheQRMsforagivenuserquery,wecanproduceanswers byre-visitingthepathwaysstoredintheQRMs.TheMorpheusQueryExecutorQE evaluatesascriptofthequeryresolvingprocess.Itsimulatesahumanclickingbuttonsto followlinks,submitforms,andhighlightdata,formingatextualanswer.TheQEassumes thatbecauseoftheautogeneratednatureofdeepwebpages,thelocationofanswersare thesameirrespectiveofpagechanges.ItusestherelativeXPathlocationtotheanswer nodeonHTMLpagesasdescribedinBadicaetal.[6]. 5.4MorpheusQAResults First,webuiltanontologyforthevehicularrealmexploitingtheWikipediapages, DBpediacategories,andWordNetsynsets.Foreachoftheclassesintheontologywebuilt corporafromthecorrespondingWikipediapages.Figure5-1showsasubsectionofthis ontology. InTable5-2weshowthedataoutputbytheMorpheusparseofthequery.Itextracts thewh-termthatclassiesthesentenceasaquestion,identiestheanswerclass,and 100

PAGE 101

Table5-2.TheoutputofNLPengine wh-termwhat descriptivephrases 1997ToyotaCamryV6 askingforsizetires n-grams 1997,1997Toyota,1997ToyotaCamry, Toyota,ToyotaCamry,ToyotaCamryV6, Camry,CamryV6,V6 Table5-3.Termclassesandprobabilities TermClass P Class j Term 1997Sedans404132.77e-14 1997ToyotaEngines7.90e-14 ToyotaSedans3486670.15e-14 ToyotaCamrySedans12147.23e-14 ToyotaCamryV6Coupes13.80e-14 CamrySedans312034.20e-14 CamryV6Coupes13.80e-14 V6Sedans4464535.40e-14 locatesdescriptivephrasestoproducetheanswer.Finally,theengineproducesn-grams fromphrasesinthedescriptiveinformationsections. UsingthedatainTable5-2wedeterminerelevantclassesinnon-increasingorder ofrelevance.Table5-3showstheeightbesttermclassesandtheirprobabilitiesfor automotivequeries. WefoundthebestclassesforthetermsinthecandidateSSQ.Wecalculatedtheclass divergencebetweentheseclassesandthequaliedSSQclassesintheQRMstore.QRMs arerankedbasedupontherelevancescoreandtheclassdivergencemeasure.Table5-4 showstheanswersproducedbytheQRE,apythonbackend,andthethreehighestranked queries.Finally,weexecutethebestQRMsanddisplaytheresultstotheuser. 5.5MorpheusQASummary Inthiswork,weproposeanovelquestionansweringsystemthatusesthedeepweb andpreviouslyanswereduserqueriestoanswersimilarquestions.Thesystemusesa pathndertoannotateanswerpathssopathfollowerscandiscoveranswerstosimilar questions.Eachquestion,answerpathpairisassignedarealm,andnewquestionsare 101

PAGE 102

Table5-4.HighestrankedMorpheusQAqueries QueryTaggedClassesScore A1997ToyotaCamryV6 needswhatsizetires? Sedan,Automobile,Engine, Manufacturer 0.91 Whatisthetiresizefora 1998SiennaXLEVan? Van,Manufacturer0.72 WherecanIbuyanengine foraToyotaCamryV6? Sedan,AutomobileEngine, Manufacturers 0.74 matchedtoexistingquestion,answerpathpairs.Theclassicationofnewquestion termsintoclassesisbasedontermfrequencydistributionsinourrealmspeciccorpora ofwebdocuments.Thesetermsaretheinputtoexistinganswerpathsandwere-execute thesepathswiththenewinputtoproduceanswers. Oursolutioniscomposedawebfrontendwhereuserscanaskquestions.TheQRR wasdevelopedasaFirefoxpluginandanassociatedC#application.Oursimilarity measureswerecodedusingJavaandopensourcelibraries.Answersareproducedbythe QRE,apythonbackend.ThedataisstoredinaPostresSQLdatabase. Topicmodelingprovidesapromisingapproachtoidentifyingpagesrelevanttoaclass inamoreautomatedmanner.Webelievethesewebformentryannotationmethodsand formlabelextraction[69]canyieldpromisingresults.Combiningthiswiththemethodof Elmeleegyetal.[30]mayremovetheuserfromtheanswerpathgenerationprocess. FutureinvestigationinthisareashouldlooktomergecompatibleQRMstoanswer compoundquestions;chainingQRMsusingtheprinciplesoftransformcomposition[73]. 102

PAGE 103

CHAPTER6 PATHEXTRACTIONINKNOWLEDGEBASES Knowledgebasesareincreasinglybeingaugmentedusingunstructureddatatoextract actionableinformation.Typically,KBsarepopulatedwithtriplesofinformationandthen searchedwithqueriestodiscoveranewsubsetofdata.Inferenceisthetaskofextracting knowledgethatisnotexplicitlyrepresented.Knowledgebasescontainboundlessamounts ofinformationthatneedstobeextractedusingnewandecientmethods.Extracting setsofinformationfromknowledgebasesisanexcitingareaofcurrentresearch.Inthis chapter,wedescribeanewpathtraversalprocessoverknowledgebaseswithuncertainty. Ideneanalgorithmtodiscover,extract,andrankconnectedsetsoffactsinaknowledge basebetweenmultipleentitiesofinterest.Iempiricallyshowthatthepathexpansion methodsdescribedareusefultoexpressrelationshipsbetweenentities. 6.1PreliminariesforKnowledgeBaseExpansion Inthissection,wediscussthefundamentalconceptsunderlyingpathexpansion.First, isadiscussionofknowledgebasesandwethenformallydescribeaprobabilisticknowledge base.Notethatformalismsandpreviousworkinthissectionaresharedacrossrecent publication. 6.1.1ProbabilisticKnowledgeBase Inthissectionweformallydescribeaprobabilisticknowledgebase.Thisdenition isderivedfromChenetal.[21].Aprobabilisticknowledgebaseisa5-tuple)-594(= E ; C ; R ; ; L where 1. E = f e 1 ;:::;e jEj g isasetofentities.Eachentity e 2E referstoareal-worldobject. 2. C = f C 1 ;:::;C jCj g isasetofclassesortypes.Eachclass C 2C isasubsetof E : C E . 3. R = f R 1 ;:::;R jRj g isasetofrelations.Each R 2R denesabinaryrelationon C i ;C j 2C : R C i C j .Wecall C i ;C j thedomainandrangeif R anduse R C i ;C j todenotetherelationanditsdomainandrange. 103

PAGE 104

4.= f r 1 ;! 1 ;:::; r j j ;! j j g isasetofweightedfactsorrelationships.Foreach r;! 2 ;r isatuple R;x;y ,where R C i ;C j 2R , x 2 C i 2C , y 2 C j 2C ,and c;y 2 R ; ! 2 R isaweightindicatinghowlikely r istrue.Wealsouse R x;y to denotethetuple R;x;y . 5. L = f F 1 ;W 1 ;::: F jLj ;W jLj g isasetofweightedclausesorrules.Itdenes aMarkovlogicnetwork.Foreach F;W 2L ;F isarst-orderlogicclause,and W 2 R isaweightindicatinghowlikelyformula F holds. Wereferthereaderto[21]foradiscussionofrst-orderprobabilisticlogic. 6.1.2MarkovLogicNetworkandFactorGraphs MarkovlogicnetworksMLN[79]combinerst-orderlogic[90]andprobabilistic graphicalmodels[53]intoasinglemodel.Essentially,anMLNisasetofweighted rst-orderformulae F i ;W i ,theweights W i indicatinghowlikelytheformula F i istrue.A simpleexampleofanMLNis: 1.0.96borninRuthGruber,NewYork 2.1.40 8 x 2 Person, 8 y 2 CITY:bornin x , y ! livein x , y ItstatesafactthatRuthGruberisborninNewYorkCityandarulethatifawriter x isborninanarea y ,then x livesin y .However,bothstatementsdonotdenitelyhold. Theweights0.96and1.40specifyhowstrongtheyare;strongerrulesarelesslikelytobe violated. AnMLNcanbeviewedasatemplateforconstructinggroundfactorgraphs.Inthe groundfactorgraph,eachnoderepresentsafactintheKB,andeachfactorrepresentsthe causalrelationshipamongtheconnectedfacts.Forinstance,supposeinthebornInRuth Gruber,NewYork,wehavetwonodes,onefortheheadandtheotherforthebody, andafactorconnectingthem,thevaluesdependingontheweightoftherule.The factorstogetherdetermineajointprobabilitydistributionoverthefactsintheKB.A factorgraphisasetoffactors= f 1 , ::: , N g ,whereeachfactor i isafunctionover arandomvector X i indicatingthecausalrelationshipsamongtherandomvariablesin 104

PAGE 105

X i .Thesefactorstogetherdetermineajointprobabilitydistributionovertherandom vector X consistingofalltherandomvariablesinthefactors.Mathematically,weseekthe maximumaposterioriMAPconguration:Itdenesaprobabilitydistributionoverits variablesX: P X = x = 1 Z Y i i x = 1 Z exp )]TJ 7.472 1.674 Td [(X i w i n i x {1 6.1.3SamplingforMarginalInference Computingtheexact Z inEquation6{1isintractableduetothelargespaceof possibleconguration.Samplingalgorithmsaretypicallyusedtoapproximatethe marginaldistributionsincedirectioncomputationisdicult.Twomostpopularofthese approachesareGibbssampling[19]andMC-SAT[77].Thesetwosamplingalgorithmsare brieydiscussedinthefollowingtwoparagraphs. 6.1.3.1Gibbssampling Gibbssampling[19]isaspecialcaseoftheMetropolisHastingsalgorithm[22].The pointofGibbssamplingisthatgivenamultivariatedistributionitissimplertosample fromaconditionaldistributionthantomarginaldistributionbyintegratingoverajoint distribution.TheGibbsSamplingalgorithmisdescribedinAlgorithm19: Algorithm19 GibbsSampling 1: z := h z 0 1 ;:::;z 0 k i 2: for t 1to T do 3: for i 1to k do 4: P Z i j z t 1 ;:::;z t i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 ;z t )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 i +1 ;:::;z t )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 k 5: endfor 6: endfor Itbeginswithsomeinitialvalue,whichcanbedeterminedrandomly.Eachvariableis sampledfromthedistributionofthatvariableconditionedonallothervariables,making useofthemostrecentvaluesandupdatingthevariablewithitsnewvalue.Themarginal probabilityofanyvariablecanbeapproximatedbyaveragingoverallthesamplesofthat variable.Usually,somenumberofsamplesburn-inperiodatthebeginningareignored, 105

PAGE 106

andthenvaluesoftheleftsamplesareaveragedtocomputetheexpectation.TheGibbs samplingalgorithmsareimplementedinthestate-of-the-artstatisticalrelationallearning andprobabilisticlogicinferencesoftwarepackages[52,70]. 6.1.3.2MC-SAT Inrealworlddatasets,considerablenumbersofMarkovlogicrulesaredeterministic. Deterministicdependenciesbreakaprobabilitydistributionintodisconnectedregions. Whendeterministicrulesarepresents,Gibbssamplingtendstobetrappedinasingle regionandneverconvergestothecorrectanswers.MC-SAT[77]solvestheproblemby wrappingaprocedurearoundtheSampleSATuniformsamplerthatenablesittosample fromhighlynon-uniformdistributionsoversatisfyingassignments. Algorithm20 MC-SAT clauses , weight , num samples 1: z Satisfyhard clauses 2: for i 1to num samples do 3: M ; 4: for c k 2 clauses satisedby x i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 do 5: Withprobability1e )]TJ/F22 7.9701 Tf 6.587 0 Td [(w k add c k to M 6: endfor 7: Sample x i U SAT M 8: endfor FormoredetaileddiscussionaboutMC-SAT,werefertotheoriginalpublication[77]. 6.1.4LinkingFactsinaKnowledgeBase Thelinkingoffactsisknowledge.Toextractknowledgefromlargesetsoffactswe lookcanusedthelinkedstructuredcreatedthroughprobabilisticrules.Thisconnection isonlyonemethodthatmaybeusedtolinkfacts.Weenumeratefourmethodsthatlink conceptsinaknowledgebase,theseareaRules;bRecords;cRecordLinking;d R d -space. Inaknowledgebase,anyfactmayberepresentedasasingle,orasetoftriples.We previouslydiscussedhowtoperforminferenceoverknowledgebasesusingrules;thisis therstwayknowledgebasescanbecombinedtoextractinformation6.1.2.Anotherwell 106

PAGE 107

knownmethodforlinkingfactstogetheristhroughSPARQLqueries.Largetriplesstores recallthateachtriplemaybeconsideredafactcanbequeriesoverusingandeclarative querylanguage.Datainthesetriplestoresarelinkedbetweennodesandedges,wherean edgemaybeanentityoravalueandedgesrepresentrelationshipsbetweentwonodes. WithSPARQL,usersdeclarativelyexpresstheinformationthattheywouldliketosearch forinatriplestoreandthelanguagegeneratesamixofsub-treetemplatesandoperations thatproducetherequestedinformation.SPARQLisapowerfulmethodfordiscovering andlinkingfacts. Thethirdmethod,recordlinking,describestheconnectionofthenodesandvertices inatriplestoretocreateapathofconnectedstatements.Theconnectionoffactscause multipleentitiesthatwerenotpreviouslylinkedtoformarelationship.Whilethese connectionmaybelongandpossiblynon-informative,inbulkitprovidesinformation seekerswithasummaryoftheconnectionbetweenentities.Westudyanimplementation ofthismethodoveralargeknowledgebaseandinarelationaldatabasemanagement systemandalsoagraphdatabase. Thefourthmethodd R d -spacedescribesamethodofembeddingfactsintoavector space.Thenknowledgebasetypeinferencecanbeperformedoverthevectorspaceitself. Thereareseveralmethodsofencodingfactsintovectorspaces[50]butknowledgebase operationsoverthevectorspaceisanareaforpresentresearch.KBoperationsin R d space isanopenproblem. 6.2FactPathExpansionRelatedWork Inthissection,Idescribethreeworksrelatedtofactpathexpansion.Irstdiscuss pathscreationofSPARQLqueriestoextractpaths.Following,Idiscusstworesearch projectsthatusepathsinknowledgebasesandonthewebtoextractinformation.The rstispathrankingfromresearchersatCarnegieMellonUniversityandthesecondfact rankingfromYahoo!. 107

PAGE 108

6.2.1SPARQLQueryPathSearch QueryingdataovertriplesinanresourcedescriptionframeworkRDF 1 format isperformedusinganSQL-likelanguagecalledtheSPARQLProtocolandRDFQuery LanguageSPARQL.TodiscoverpathsinRDFstores,SPARQLhasdenedproperty pathsaspartofthestandard. 2 ThepropertypathextensiontoSPARQLallowsfor thespecicationofgraphpatternsofarbitrarylength.Propertypathqueriescandene variableendsofpathstoreturnallpathsbetweennodes.Wearemostinterestedina functiontoextractpathsofanylengthcalledArbitraryLengthPath.Thisdenitioncanbe foundinthestandard. Inourusecase,wedonotusetheRDFformataswedonotstoretheknowledge baseintheRDFformat.ToadaptourmethodstotheRDFdatasetwewouldneedto performapproximatematchingonRDFelements.Wewouldstillneedanexternalmethod ofrankingthepaths. 6.2.2PathRanking NiLaoetal.investigatedtherankingofpathsingraphsandknowledgebases[54,55]. Theytrainamodeltolearnnewinstancesofrelationsbetweentwoormoreentities bypaths.Theresearchersperformcrossvalidationtorankthetopreturnedpaths usingmechanicalturk.Thegraphtheydescribeconnectsentitiesusingrelations intheknowledgegraph.Relations,oredges,aredirectedandifapathgoesinthe oppositedirectionitisdescribedasaninversewalk.Foreciency,pathsarealso constrainedbaseduponrulesandclasstypes.Forexample,researchersextractapath constrainedbasedonthehornclauseisa x;c isa x 0 ;c AthletePlaysInLeague x 0 ;y ! AthletePlaysInLeague x ; y .Arandomwalkisperformedtodiscoverpathscommonpaths 1 http://www.w3.org/RDF/ 2 SPARQL1.1standardandthepropertypathdenitionhttp://www.w3.org/TR/2010/ WD-sparql11-query-20101014/#propertypaths. 108

PAGE 109

andthetoppathsarerankedandreturned.Thisworkisessentiallyatypeofrulelearning overknowledgebases;theendproductisalistofpathsthatarecandidatesfornewlogical rules. Inthisdissertation,itisnotnecessarytosummarizethepathstocreateanewlink. Instead,welookattheaggregateofallpathstomakeastatementaboutthesourceand destinationentity.Thisworkprimarilycreatesagraphoverthesimilarityspace,although techniquescanextendtothelinksgeneratedovertherulespaceoroverthegraphsspace. 6.2.3FactRank Jainetal.exploredatechniquecalledFactRankthatusedsimilarentitiesina knowledgebasefactbasetondinuentialfactsandtrustworthyfacts[47].Givena setofrelationsofinterest,agraphiscreatedbasedonallthefactsthathavethesame subjectsandobjects.Linksinthegrapharebidirectionalandarecreatedwhentwo factshaveamatchingsubjectorobject.TheycreateamodicationofthePageRank algorithm[17]thatconsistentlyoutperformsthetraditionalalgorithmwhendiscovering themostimportantfacts.Thisworkalsoconnectgraphsthathaveanoverlappingsubject orobjectbutwedonotrestrictpathstoasetofrelationtypes.Wefocusoncomputation timeandthediscoveryofnovelpathoffacts. 6.3FactPathExpansionAlgorithm Thegoalofthefactpathexpansionistocollectandrankthemostrepresentative pathsbetweenasourceanddestinationentity.Startingwithasourceentity,this algorithmperformsasearchthroughtheknowledgebasespacetoextractcandidate paths.Thecandidatepathsarerankedbasedonthesimilarityofthematchalongthe path.Werstformalizethedenitionofthealgorithmandprovidepseudocode.We separatelydiscusshowtoranktheoutputtoobtainthemostrepresentativepaths.After anexplanationofthealgorithmwedescribethedierentimplementationsofthealgorithm overapopularknowledgebase. 109

PAGE 110

Givenaknowledgebase G whichstorestriples g i 2h s i ;p i ;o i i where g i 2G ,the algorithmtakesasourcequerynode e s andatargetnode e t .Steponeofthealgorithmis tondallthetriples g e s 2G suchthat s g e s s e s or o g e s o e s .Eachoftheseinitialnodes areconsideredthestartentitiesforanypotentialpaths.Onlythesubject s andobject o arelinkedinthepathexploration,andnotthepredicate p ,becauseweareinterestedin theentitiesconnectingfacts.Forthepurposesofexplorationwearenotsointerestedin themeaningofthepredicateorsimilarityofrelations. Themostrecentdiscoverednodesareaddedtotheworkingset W where W G . Next,werecursivelyexpandeachtripleintheworkingsettolookfortriplesthatissimilar totheworkingsettriple.Thatis,foreach w = h s w ;p w ;o w i ,ndallthetriplesin g i 2G suchthat s w s g i or o w o g i Theresultofthisstepbecomesthenewworkingset.We alsoordertheiteminthenewworkingsetbasedonthesumofcalculatedsimilarityscores ofthecurrentmatchandthepreviouspaths.Afterwesortthepaths,thenweleaveitto theendusertodrawconclusionsonthelinkedsetoffacts. Algorithm21 Formalalgorithmicdenitionforfactpathconstruction Input: Aknowledgebase G . Aquerystartentity e s . Aquerytargetentity e t . Output: Linkedpaths P . 1: V ; . Visitedset. 2: W f g w j s e s s g w or o e s o g w ;g w 2Gg 3: while W 6 = ; do 4: W 0 f g w j s g i s g w or o g i o g w ;g w 2W ;g i 2Gg 5: W 0 W 0 n V. Removeallvisitednodes. 6: V V [ W 0 7: P P [f w j e t 2f w s ;w o g ;w 2 W 0 . Emitpathsthatreachthetarget. 8: W W 0 9: endwhile return Sort P . Assumeallprovenanceispreserved. Algorithm21describesthemethodforconstructingtheknowledgebasepaths. Noticethatineachexpansionstep,onlythemostrecentlyexpandednodesareaddedto theworkingset.Thealgorithmendswhentheworkingsetisemptyoruponreachinga 110

PAGE 111

Figure6-1.Anexampleoftheincreaseofthefactsandthenumberofrelationsover severaltimestamps. maximumdepthnotshown.Thevariable P representsthelinkedsetsofpathstraversed bythealgorithm.Thenal Sort algorithmsortsthepathsbytheirconnectionsimilarity. RankingFactPaths . Thesetofcandidatefactsshouldberankedbyhowrepresentativetheyareoftheset ofpathsconnectingthetwoentities.Aqualitypathistrustworthy,timely,representative, andrelevant.Itisimportantthateachfactincludedinapathistrue.Anuntruefact,ora factwithlowprobabilityofbeingcorrectrenderstheresultingpatharbitrary. Eachfactisassociatedwithanextractiontime.Factsmayhaveactuallyoccurred atamuchdierenttime.Theextractiontimeofafactisdistinctfromthetruerangea factrepresents.Figure6-1showsthenumberofconnectionsnodeshaveovertime.Wesee thatovertimethenumberofnodesandnumberofedgesincreaseovertime.Thesenodes, whichrepresentfacts,alsoincreaseincomplexity.Withfactschangingovertime,the meaningandpossiblythetruthfulnessoffactsmaychange.Infact,Figure6-2showsthe changeofprobabilitiesovertime.Withtheadditionoffactscomeschangingprobabilities, itisimportanttotakethelatestprobabilityunderconsideration. 111

PAGE 112

Figure6-2.Asampleofnodesandtheirchangingprobabilitiesovertime.Thegureis darkenedtoshowthemanyoverlappinglines Arepresentativesetofpathsarethosethatareuniquetothesetofallpossiblepaths andalsosummarizethecandidatepaths.WecomputewhatisessentiallyaTF-IDFscore oftheedgesintherelationsandentitiesmentionedinthecandidateset.Lastly,thepath shouldberelevant,meaningwefavorthemorepopularentitiesandrelations;obscure relationsorentitiesmaynotbeuseful. Foreachpath,apairofglobalnode/edgescoresandlocalnode/edgescoresare computed.Thesescoreboostthepathsinthecandidatesetthataremostrepresentative ofthecandidatesetofpathsandminimallyrepresentativeoftheglobalset.Tothatend wedenetheGlobalNodeandLocalNodeasfollows: GlobalNode n;N = X n 2 N path j n 2 N j)]TJ/F59 11.9552 Tf 17.933 0 Td [(min f N max f N )]TJ/F59 11.9552 Tf 11.955 0 Td [(min f N + LocalNode n;N path = X n 2 N path j n 2 N path j)]TJ/F59 11.9552 Tf 17.933 0 Td [(min f N path max f N path )]TJ/F59 11.9552 Tf 11.955 0 Td [(min f N path + where N path isthesetofnodesinthecandidatepathsand N istheglobalsetofnodes. Similarly,theGlobalEdgeandLocalEdgearedenedasfollows: 112

PAGE 113

GlobalEdge e;E = X e 2 E path j e 2 E j)]TJ/F59 11.9552 Tf 17.933 0 Td [(min f E max f E )]TJ/F59 11.9552 Tf 11.955 0 Td [(min f E + LocalEdge e;E path = X e 2 E path j e 2 E path j)]TJ/F59 11.9552 Tf 17.933 0 Td [(min f E path max f E path )]TJ/F59 11.9552 Tf 11.955 0 Td [(min f E path + where E path isthesetofedgesinthecandidatepathsand E istheglobalsetofedges. Foreachpath,thesevaluesareincludedwiththepathlengthandtimelinesstoproducea score.Additionally,atruthfulnessisscoreofthepathiscomputed.Thisistheprobability thateachfactinthepathiscorrect.Iftheprobabilityofafactcannotbecomputedinthe knowledgebaseitisassumedtobetrue.Eachofthesevaluesarealsonormalizedbutor simplicitythatisnotshown.Theequationfor SCORE isasshownbelow. SCOREpath= 8 > > > > > > > > > > < > > > > > > > > > > : P n 2 N path LocalNode n;N + )]TJ/F21 11.9552 Tf 11.955 0 Td [(GlobalNode n;N + P e 2 E path LocalEdge e;E + )]TJ/F21 11.9552 Tf 11.956 0 Td [(GlobalNode e;E + pathLengthpath+ truthfulnesspath+ timelinesspath 6.4JointInferenceofPathProbabilities Discoveringtheprobabilityofagroundedatominadatabaseisawell-studied problem.ToolssuchasAlchemyandGraphLabdescribedistributedandscalable implementationsofinferenceofsimilarmodels[38].Thegroundedatomshaveweights associatedwiththemcolumnintheProbKBdatabase.Toscorethelikelihoodof thepathquerycorrespondingtoeachavailablefactinthequerywecomputethejoint probability. 113

PAGE 114

Forcomputationaleciencyweassumethateachfactinthepathisindependent. Following,wecancomputethejointprobabilitywiththefollowingformula: P X = x = 1 Z exp )]TJ 7.472 1.673 Td [(X i W i n i x where x isthesetofgroundatoms, n i x isthenumberoftruegroundingoftherule F i in x ;W i isitsweightand Z isthenormalizationconstant.Computingthisvalueis computationallyintractablesoweembedGibbssamplingtechniquesinthedatabaseto computetheanswerswithintheProbKBframework. 6.4.1FuzzyQuerying Inordertocreatepathsbetweenfactswelookfortechniquestondapproximate matchestothesetoffacts.Thepossiblenumberofpathsisexponentialsomuchcare mustbetakenwhenselectingpaths.Graphdatabasescanrepresentfactsandmaterialize connectionstomaketraversalsofcomplexgraphsecient.Becausegraphdatabases havematerializedandexactgraphstheydonotprovideanyspecialbenetsforthe traversaloffuzzygraphs.SearchenginessuchasLuceneareoptimizedforthediscoveryof approximateresultsovertextdocuments.However,Luceneisnotadatastoragesystem; itwouldneedtobepreparedwithastoragesystemtoperformapproximatesearchesand traversals.PostgreSQLFull-TextSearchFTSarefull-textsearchcapabilitiesinsideof thePostgreSQLdatabasemanagementsystem.Thissystemcombinesthestoragewiththe fuzzysearchingcapabilities.Queriesarelogicallyperformedoneatatimemakingphysical queryplanexpensive.Inthiswork,weusedatabasetechniquestondmatchesandthe rankingutilitiesofthePostgreSQLFTStoassistinrankingtheresults. 6.4.2PostgreSQLFactPathExpansionAlgorithm Algorithm22describesthepathrankingalgorithmimplementedinPostgreSQL. Thisalgorithmmakesaqueryastepatatime.Foreachstep,thealgorithmchecksfor newnodesbylookingforstringoverlapmatchesinthesubjectandobjectcolumns. Additionally,eachstepensuresthatnocycleexistsbyensuringanynewvertexisdistinct 114

PAGE 115

inthecurrentpathi.e.noloops.Theresultsofeachpreviousnodearedownsampledto fewerthantenpercentoftheresultnodes.Forthisimplementation,the s and o column ofPostgreSQLareaddedasts querytype,thatis,atextsearchtypethatisastemmed versionoftheoriginalstring.Thisextracolumnmakesitpossibletoperformfuzzyqueries ateachstep.Thets rankfunctioncomputesthecompatibilitybetweentwovalues.For moreinformationonthePostgreSQLFullTextsearchcapabilitywedirectthereaderto PostgreSQLdocumentation. 3 Movingforward,tomaketheimplementationalittlemoretractableforasmall numberofhopswemakethreeoptimization:theremaybecyclesintheset,only exactmatchesstringmatchesmaybeconnections,andwecansampleorsortand rankthecandidatepathsateachhop.Checkingforcyclesateachstepisexpensivebut itdoeskeepthesizeofintermediatepathlow.Performingafuzzysearchintheformof astringtokenoverlapisexpensiveandalsoresultsinnoisyoutput.Inpractice,instead ofperformingfuzzysearcheswecanchoosemultiplestartnodewitheachcanonical representationofthestring.Lastly,atthecostofnotreceivingallpossiblecandidateswe candownsampletheintermediatenodes.Downsamplingateachstepreducesredundancy especiallyduringthelaterpaths.Sortingandrankingthepathsateachstepisarecursive stepinagreedymethodforchoosingthetop-kpathsatanypoint.Unfortunately,this processcanalsobecomeexpensive. Algorithm23describesthePostgreSQLrecursivemethodfordiscoveringpathswhen factsaredescribedinatripletable.ThismethodrstusesthePostgreSQLFTStond asetofstartingnodes.Next,asetofcandidateendnodesarealsosearchedinthetriple table.Then,thesystemrecursivelylooksforpathsareconnectedtothestartnodesby eithermatchingsubjectsorobjecto.Afteraxednumberofiterations,thenal 3 http://www.postgresql.org/docs/current/static/textsearch.html 115

PAGE 116

Algorithm22 PostgreSQLcodeforfactpathexpansionusingfortwohopsandfuzzy joins )142()]TJ0 g 0 G0.5 g 0.5 G0 g 0 G0.5 g 0.5 G/F8 9.9626 Tf 21.18 0 Td [(triplesdocid,s,p,o WITHstart_nodesAS SELECTt . docid ,0 aslevel , t . docid :: textaspath , t . docidasendid , s , o , qs , qo , ARRAY [ docid ] asapath , ' < ' jj subject jj ' j ' jj predicate jj ' j ' jj object jj ' > ' asstatement , ts_rank s , plainto_tsquery 'marijuana'+ ts_rank o , plainto_tsquery ' marijuana' asrank FROMtriplest WHERE t . s@@plainto_tsquery 'marijuana' ORt . o@@plainto_tsquery ' marijuana' , onehopAS SELECTp1 . docidasdocid , p1 . level +1 aslevel , p1 . path jj ',' jj t . docid :: textaspath , t . docidasendid , t . s , t . o , t . qs , t . qo , array_append p1 . apath , t . docid asapath , statement jj ' )152()]TJ/F11 9.9626 Tf 12.461 0 Td [(> ' jj ' < ' jj subject jj ' j ' jj predicate jj ' j ' jj object jj ' > ' asstatement , p1 . rank + ts_rank t . s , p1 . qo + ts_rank t . o , p1 . qs + ts_rank t . o , p1 . qo + ts_rank t . s , p1 . qs asrank FROMtriplesast , SELECTdocid , level , path , endid , s , o , qs , qo , apath , statement , rankFROMstart_nodesWHERErandom < 0.1 asp1 WHEREp1 . docid < t . docid AND p1 . s = t . sORp1 . s = t . oORp1 . o = t . sORp1 . o = t . o ANDNOTp1 . apath@ > ARRAY [ t . docid ] , twohopAS SELECTp1 . docidasdocid , p1 . level +1 aslevel , p1 . path jj ',' jj t . docid :: textaspath , t . docidasendid , t . s , t . o , t . qs , t . qo , array_append p1 . apath , t . docid asapath , statement jj ' )152()]TJ/F11 9.9626 Tf 12.461 0 Td [(> ' jj ' < ' jj subject jj ' j ' jj predicate jj ' j ' jj object jj ' > ' asstatement , p1 . rank + ts_rank t . s , p1 . qo + ts_rank t . o , p1 . qs + ts_rank t . o , p1 . qo + ts_rank t . s , p1 . qs asrank FROMtriplesast , SELECTdocid , level , path , endid , s , o , qs , qo , apath , statement , rankFROMonehopWHERErandom < 0.1 asp1 WHEREp1 . docid < t . docid AND p1 . s = t . sORp1 . s = t . oORp1 . o = t . sORp1 . o = t . o ANDNOTp1 . apath@ > ARRAY [ t . docid ] SELECT FROMtwohop ; 116

PAGE 117

pathsarejoinedwiththeendnodesandonlypathsthatcanjoinwiththeendnodesare preserved. Algorithm23 PostgreSQLcodeforrecursivefactpathexpansionwithfuzzysearchfora `pot'startingentityanda`foxnews'endentity )152()]TJ0 g 0 G0.5 g 0.5 G0 g 0 G0.5 g 0.5 G/F15 11.9552 Tf 25.045 0 Td [(triplesdocid,s,p,o WITHRECURSIVE start_nodesAS SELECTdocid ,0, docid :: text FROMtriplet WHEREt . s@@plainto_tsquery 'pot' ORt . o@@plainto_tsquery 'pot', end_nodesAS SELECTdocid ,0, docid :: text FROMtriplet WHEREt . s@@plainto_tsquery 'foxnews' ORt . o@@plainto_tsquery 'foxnews', paths term , level , path AS SELECT FROMstart_nodes UNIONALL SELECTt1 . docid , level +1, p . path :: text jj ',' jj t1 . docid :: text FROMpathsASp , tripleASt1 , tripleASt2 WHEREt1 <> t2 ANDt2 . docid = p . term AND t2 . qs@@t1 . s ORt2 . qs@@t1 . o ORt2 . qo@@t1 . s ORt2 . qo@@t1 . o SELECTterm , level , path FROMpathsp WHERElevel < 4 ANDp . termIN SELECTtermFROMend_nodes ; 117

PAGE 118

Alternatively,thetripledatacanbestoredinanadjacencylistformat.Thatis,all nodesarestoredina`nodes'tableandall`edges'arestoredinanedgestable.Witha node-listrepresentationitisstraightforwardtoexpressaqueryasaself-joinofmultiple nodes.Algorithm24describesanon-recursiveversionoffactpathexpansionoverthe adjacencylistrepresentation.Thealgorithmdescribesapathinvolvingsevenfactswith thestartandendfactsbeingspecied. Algorithm24 PostgreSQLcodeforfactpathexpansionusingdatabaseselfjoinsoveran edgetableand`pot'asstartingentityanda`FoxNews'targetentity )152()]TJ0 g 0 G0.5 g 0.5 G0 g 0 G0.5 g 0.5 G/F15 11.9552 Tf 23.964 0 Td [(nodesid,term )152()]TJ0 g 0 G0.5 g 0.5 G0 g 0 G0.5 g 0.5 G/F15 11.9552 Tf 24.181 0 Td [(edgessrc,dst,edge CREATEORREPLACEFUNCTIONgetFact srcint , dstint RETURNSTEXT AS "" SELECT ' < ' jj nsrc . term jj ',' jj e . edge jj ',' jj ndst . term jj ' > ' FROMedgese , nodesnsrc , nodesndst WHEREe . src = $1 ANDe . dst = $2 ANDnsrc . id = $1 ANDndst . id = $2 "" LANGUAGESQLIMMUTABLE ; SELECTgetFact g1 . src , g1 . dst , getFact g1 . dst , g2 . src , getFact g2 . src , g2 . dst , getFact g2 . dst , g3 . src , getFact g3 . src , g3 . dst , getFact g3 . dst , g4 . src , getFact g4 . src , g4 . dst FROMedgesg1 , edgesg2 , edgesg3 , edgesg4 WHEREg1 . src = SELECTidfromnodeswhereterm ='pot' ANDg1 . dst = g2 . src ANDg2 . dst = g3 . src ANDg3 . dst = g4 . src ANDg4 . dst = SELECTidfromnodeswhereterm ='FoxNews'; 118

PAGE 119

6.4.3GraphDatabaseQuery Withtheassumptionofthelinkedgraph,agraphdatabaseisanewoptionfor searchingpaths.TheTitanGraphDatabase 4 isadistributedgraphdatabasethatcan supportbillionsofedgesandverticesacrossmachines.Itisalsoatransactionaldatabase soinformationcanbestoredandqueriedbymanyusers.Titanalsoispackagedwith aquerylanguagecalledGremlinthatallowspathandsubgraphqueriestobeeasily expressedwhencomparedtoSQL. TheTitangraphdatabasephysicallyrepresentsgraphsinamethodthatissimilar tothenode-listformatdescribedabove.Nodearestoredinalistandtheyaresorted bytheidentieridalthoughanotherindexmaybeusedtoobtainanothersortorder. Eachnodelinkstoasortedsetofedgesandproperties.Keepingtheedgessortedrequires somemaintenancebutthereisagoodtrade-owiththequeryperformance.Eachedge containthelabelidentierandabitforthedirectionofthelabel.Followingisthesort key,adjacentnodeidentier,theedgeidentierandallotherproperties. Algorithm25showsak-hoppathqueryinGremlin.Eachpathtraversalthatneeds tobeperformisexpressedintheGremlinJavaorGroovylanguageAPIandanoptimizer performsoptimizationstoimprovethetraversal.Thealgorithmrstdoesasearchfor allthenodesinthegraphthathasatermlabelthatequalstothesourcenode.The algorithmthenwalksthegraphtondalltheoutedgesfollowedbyalltheincoming verticesofthegraph.Thisprocessisrepeatedmax pathtimes.Thenalcandidatepaths arelteredouttocontainonlythosewithanalnodetermequaltothedestinationterm. ItisalsonotedthatCypherisanSQL-likedeclarativelanguagethatwouldallow pathstobeexpressedmydeclarativelyoveraTitangraph.Cyphercanbeusedover arbitrarytriplesinmuchthesamewaySPARQLisusedoverRDFgraphs.TheGremlin languageissimplertosetupandsucientlyperformsallpathqueries. 4 https://github.com/thinkaurelius/titan 119

PAGE 120

Algorithm25 Gremlincodetondallthepathsthatstartatavertexnamedsrc,endat avertexnameddst,andarelessthanmax path defkhopVertices = g . V 'term', src //Getthestartingnodes . outE . inV . random sample //Downsamplethecurrentpaths . loop max_path f it . loops < max_path g . filter f dst == it . term g . simplePath //Removecycles 6.4.4FactPathExpansionComplexity Decidingifasimplepathbetweentwonodeswithaminimumnumberofedgeswith agivennumberofedgesisNP-Completeproblem[24].Theproblemcanbeveriedin polynomialtimewithanon-deterministicTuringmachineandaknownNP-complete problem,theHamiltoniancyclecanbereducedtoit.AsearchforaHamiltoniancycle asksIsthereasimplecycleinagraphthatvisitseveryvertex?"Thefactpathexpansion problemhasaddedcomplexityinthatateachhopasearchisperformed.Allthepathsof thegrapharenotknownapriori;thisisaweightedsimplepathproblem. Thenodelisthasaslightlyhigherspacecomplexitywhencomparedtothetriple method.Thespacecomplexityofthetriplemethodis O E where E isthenumberof edges.Thespacecomplexityofthenodelistmethodis O j V j + E ,thatisthecardinality ofthenodes V and E isthenumberofedges.Thetimecomplexityofthismethodis ofFactPathExpansionisthesameasadepth-rstsearchalgorithm, O b k where b is thebranchfactorofthegraphand k isthenumberofhops.Thesortfunctionofthe algorithmis O p log p where p isthenumberofpathsreturnedfromtheanswers. PriorworkbyRubinin2003describestheuseofWarshall'sTheoremtoenumerate allthesimplepathsinamatrixgraph[80].Theirproposedmethodhasacomplexity of O N 3 whereNisthenumberofvertices.However,suchtechniquesneedtoassume thatthegraphhasnoselfloopsandnomultipleedges,thisisanassumptionthatweare 120

PAGE 121

Table6-1.ThefrequencyofeachterminoutcleanedReverbdataset TermFrequency Biden931 Brutality41 FoxNews574 Marijuana752 Reddit95 notabletomake.Additionally,thelargesizeoftheknowledgegraphsmaketheallpairs approachnotsuitable. 6.5FactPathExpansionExperiments Inthissection,weperformexperimentstobetterunderstandthetrendsand performanceoffactpathexpansioninthebothgraphandrelationaldatabases.We rstdescribethedatasetweuseandthequeriesperformedoverit.Wethendiscussthe timingexperimentsoverbothsystems. InTable6-1welistthefrequenciesoftermsusedintheexperiments.NotethatBiden hasthemostmentionsamongoursetwhilethetermBrutalityhasthefewest.Weselected wordsthatmaybeofinterestedtopeopleobservingconversationsaboutelections. RuntimeofMethods . Usingfull-textsearchtechniquestolinkknowledgebaseelementscaneciently explorethespaceofpossiblelinks.Exactmatching,whilesignicantlyquickerthanthe searchthefull-textsearchapproachitlowerstherecalloftheexploration.Thechoice betweenthetwomethodsdependshowquicklyauserwouldlikeresults. 121

PAGE 122

Figure6-3.FactPathExpansionqueriesovertheTitanGraphDB. Figure6-4showstheruntimeoftheFactPathExpansionalgorithms.TheJava VirtualMachineJVMissettoallocateamaximumof8GBtoinsurethefullgraphcan tsinmemory.PostgreSQLisalsogiven8GBsofworkingmemory.Theperformance ofthequeriesinthegraphdatabasereectthatofitstimecomplexity.Thereisan exponentialincreaseinruntimeforeachhopbecauseofthelargebranchingfactor. Titanisnotabletorecognizeorcachepathssothereisnotadvantagewithrepeatedor incrementalpathqueries. Figure6-4.FactPathExpansionqueriesoverPostgreSQL Figure6-4showstheruntimeoftheFactPathExpansionalgorithmintherelational database.Theeectivecachesizeofthedatabasequeriesis8GBsothegraphisable totinmemory.WithnativeSQLqueriesPostgreSQLisabletoaggressivelyperform 122

PAGE 123

eachjoinandconstructthepath.Whenthesamequeryissubsequentlyrunforalonger paththereisaslightdecreaseinruntimebecausePostgreSQLisabletousetheprevious queryasapartialresult.Thedatabaseresultsaresignicantlybetterwithqueriesofmore thantwohops.Aone-hopqueryinthegraphdatabaseisslightlyfasterthattherelational database. Figure6-5.PostgreSQLresultsofFactPathExpansionquerieswithresetdatabasecache Figure6-5showsthesameexperimentasFigure6-4butwiththecacheresetafter eachrun.Theperformancedecreaseathop4ismuchless.Weoverlaytheaggregate performanceofthegraphdatabaseexperimentandtherelationaldatabasewithoutthe cacheinFigure6-6.WeseethatthematurePostgreSQLdatabaseisabletoaggressivley optimizeforlongerrunningqueries. Figure6-6.ComparisonoftheexperimentswithTitanDBandPostgreSQLwithoutcache 123

PAGE 124

6.6FactPathExpansionSummary Inthischapter,wedescribethetheoryandimplementationofversionofthe NP-completeminimumpathproblem,anditsextensionoverknowledgebases.We showedanimplementationinsideofthePostgreSQLrelationaldatabaseandalsoan implementationinsideoftheTitangraphdatabase.Wealsodescribedparametersused inrankingtheresultingpath.Weshowedresultsofthetimesandexamplepaths.This workwillbecontinuedforfurtheranddeeperanalysis.Furtheranalysiswillinclude instrumentationoftheTitandatabasecacheandindexes.BehaviorofPostgreSQLDBMS iswellunderstoodbutnotincomparisontographdatabases.Wealsowillperformthe sameexperimentsoverothergraphandrelationaldatabases.Aparticularlyinteresting directionisthequeryoptimizationacrossrecursivecallsofdierentdatabasetypes. 124

PAGE 125

CHAPTER7 CONCLUSIONS Thisdissertationfocusesonresearchdemonstratingthequery-driventextanalytics paradigm.Itshowsgeneralcontributionstothedatabasecommunitythroughthree projects.First,itshowshowstatisticaltextanalyticscanbeperformedinthedatabase, improvinganalyticworkows.Thisgroupofworkwasoneofthersttoperform in-databasetextanalytics.Second,thisworkchangesapopulardataminingalgorithm, entityresolution,tomakeitawareofthequeriestherebyimprovingcomputationtime. Theresultsshowedordersofmagnitudeimprovementofoverthebaseline.Finally,Ian algorithmforextractingthemostrepresentativefactsconnectingtwoknowledgebase entities.Thisfactpathexpansionmethodranksknowledgebasepathsbytruthfulness, relevance,timeliness,andrepresentativenessofranks. Thequery-diventechniquescanbeappliedtomoreinteractivescenarios.For example,query-driventechniquescanbeusedtodevelopinfrastructuretosupport multi-userproblemsolving.Thesenext-generationinteractivesystemsshallallow usersexaminetheprogressofthealgorithmsatanypointinthelifeoftheapplication; allowuserstointerveneinordertoimprove/redirectthealgorithmusinglowlatency interactions;allowmultipleuserstobeaddedinordertoconcurrentlymonitor theprogressofalgorithms.Eachoftheserequirementsadvocatefortheadaptation ofquery-driventechniques.Thisdissertationdenesquery-driventechniquesfortext analyticsanditlaysafoundationforuser-focuseddata-centricresearchwithcriticaluser requirements. 125

PAGE 126

REFERENCES [1]L.AguiarandJ.A.Friedman.Predictivecoding:Whatitisandwhatyouneedto knowaboutit.http://newsandinsight.thomsonreuters.com/Legal/Insight/2013. [2]H.Altwaijry,D.V.Kalashnikov,andS.Mehrotra.Query-drivenapproachtoentity resolution. ProceedingsoftheVLDBEndowment ,6,2013. [3]A.Arasu,R.Christopher,andD.Suciu.Large-scalededuplicationwithconstraints usingdedupalog.In InternationalConferenceonDataEngineering ,pages952{963. IEEE,2009. [4]S.Arumugam,A.Dobra,C.M.Jermaine,N.Pansare,andL.Perez.Thedatapath system:Adata-centricanalyticprocessingengineforlargedatawarehouses.In Proc. ofthe2010ACMSIGMOD ,pages519{530,NY,USA,2010.ACM. [5]R.AvnurandJ.M.Hellerstein.Eddies:Continuouslyadaptivequeryprocessing.In ACMSIGMoDRecord ,volume29,pages261{272.ACM,2000. [6]C.Badica,A.Badica,andE.Popescu.Anewpathgeneralizationalgorithmforhtml wrapperinduction.In AdvancesinWebIntelligenceandDataMining ,pages11{20. Springer,2006. [7]A.BaggaandB.Baldwin.Entity-basedcross-documentcoreferencingusingthe vectorspacemodel.In 17thACL ,pages79{85.ACL,1998. [8]K.Barrett,B.Cassels,P.Haahr,D.A.Moon,K.Playford,andP.T.Withington.A monotonicsuperclasslinearizationfordylan.In OOPSLA ,pages69{82,1996. [9]K.Bellare,C.Curino,A.Machanavajihala,P.Mika,M.Rahurkar,andA.Sane. Woo:Ascalableandmulti-tenantplatformforcontinuousknowledgebasesynthesis. Proc.VLDBEndow. ,6:1114{1125,Aug.2013. [10]I.BhattacharyaandL.Getoor.Collectiveentityresolutioninrelationaldata. ACM Trans.KDD ,1,Mar.2007. [11]I.Bhattacharya,L.Getoor,andL.Licamele.Query-timeentityresolution.In Proc.12thACMSIGKDD ,KDD'06,pages529{534,NY,USA,2006. [12]S.Bird,E.Loper,andE.Klein.Naturallanguageprocessingwithpython.In NaturalLanguageProcessingwithPython .O'ReillyMediaInc,2009. [13]C.Bizer,J.Lehmann,G.Kobilarov,S.Auer,C.Becker,R.Cyganiak,and S.Hellmann.Dbpedia-acrystallizationpointforthewebofdata. WebSemantics:Science,ServicesandAgentsontheWWW ,7:154{165,September 2009. [14]D.M.Blei,A.Y.Ng,andM.I.Jordan.Latentdirichletallocation. J.Mach.Learn. Res. ,3:993{1022,Mar.2003. 126

PAGE 127

[15]P.Bohannon,S.Merugu,C.Yu,V.Agarwal,P.DeRose,A.Iyer,A.Jain, V.Kakade,M.Muralidharan,R.Ramakrishnan,andW.Shen.Purplesoxextraction managementsystem. SIGMODRec. ,37:21{27,Mar.2009. [16]J.K.Bradley,A.Kyrola,D.Bickson,andC.Guestrin.Parallelcoordinatedescent forl1-regularizedlossminimization.InL.GetoorandT.Scheer,editors, ICML , pages321{328.Omnipress,2011. [17]S.BrinandL.Page.Theanatomyofalarge-scalehypertextualwebsearchengine. In ProceedingsoftheSeventhInternationalConferenceonWorldWideWeb7 , WWW7,pages107{117,Amsterdam,TheNetherlands,TheNetherlands,1998. ElsevierSciencePublishersB.V. [18]M.Brocheler,L.Mihalkova,andL.Getoor.Probabilisticsimilaritylogic.In P.GrunwaldandP.Spirtes,editors, UAI ,pages73{82.AUAIPress,2010. [19]G.CasellaandE.I.George.Explainingthegibbssampler. TheAmericanStatistician ,46:167{174,1992. [20]A.ChechetkaandC.Guestrin.Focusedbeliefpropagationforquery-specic inference.In AISTATS ,May2010. [21]Y.ChenandD.Z.Wang.Knowledgeexpansionoverprobabilisticknowledgebases. In Proceedingsofthe2014ACMSIGMODInternationalConferenceonManagement ofData ,SIGMOD'14,pages649{660,NewYork,NY,USA,2014.ACM. [22]S.ChibandE.Greenberg.Understandingthemetropolis-hastingsalgorithm. The AmericanStatistician ,49:327{335,1995. [23]J.Cohen,B.Dolan,M.Dunlap,J.M.Hellerstein,andC.Welton.Madskills:new analysispracticesforbigdata. Proc.VLDBEndow. ,2:1481{1492,Aug.2009. [24]T.H.Cormen,C.E.Leiserson,R.L.Rivest,andC.Stein. IntroductiontoAlgorithms .MITpress,2009. [25]M.CowlesandB.Carlin.Markovchainmontecarloconvergencediagnostics:a comparativereview. JournalofAmStat ,91:883{904,1996. [26]H.Cunningham,D.Maynard,K.Bontcheva,V.Tablan,N.Aswani,I.Roberts, G.Gorrell,A.Funk,A.Roberts,D.Damljanovic,T.Heitz,M.A.Greenwood, H.Saggion,J.Petrak,Y.Li,andW.Peters. TextProcessingwithGATEVersion 6 .2011. [27]J.Dalton,J.R.Frank,E.Gabrilovich,M.Ringgaard,andA.Subramanya.Fakba1: Freebaseannotationoftreckbastreamcorpus,version1releasedate2015-01-26, formatversion1,correctionlevel0,January2015. 127

PAGE 128

[28]A.DasSarma,A.Jain,A.Machanavajjhala,andP.Bohannon.Anautomatic blockingmechanismforlarge-scalede-duplicationtasks.In Proceedingsofthe21st ACMCIKM ,pages1055{1064.ACM,2012. [29]X.Dong,A.Halevy,andJ.Madhavan.Referencereconciliationincomplex informationspaces.In Proceedingsofthe2005ACMSIGMODinternational conferenceonManagementofdata ,SIGMOD'05,pages85{96,NewYork,NY, USA,2005.ACM. [30]H.Elmeleegy,J.Madhavan,andA.Halevy.Harvestingrelationaltablesfromlistson theweb. Proc.VLDBEndow. ,2:1078{1089,2009. [31]X.Feng,A.Kumar,B.Recht,andC.Re.Towardsauniedarchitecturefor in-rdbmsanalytics.In Proceedingsofthe2012ACMSIGMODInternational ConferenceonManagementofData ,SIGMOD'12,pages325{336,NewYork,NY, USA,2012.ACM. [32]P.Flajolet, E.Fusy,O.Gandouet,andF.Meunier.Hyperloglog:theanalysisofa near-optimalcardinalityestimationalgorithm. DMTCSProceedings ,0,2008. [33]T.A.S.Foundation.Apachesolr.http://lucene.apache.org/solr. [34]J.R.Frank,S.J.Bauer,M.Kleiman-Weiner,D.A.Roberts,N.Tripuraneni, C.Zhang,C.Re,E.Voorhees,andI.Soboro.Evaluatingstreamlteringfor entityproleupdatesfortrec2013kbatrackoverview.Technicalreport,DTIC Document,2013. [35]J.GantzandD.Reinsel.Thedigitaluniversein2020:Bigdata,biggerdigital shadows,andbiggestgrowthinthefareast. IDCiView:IDCAnalyzetheFuture , 2012. [36]I.Getoor.Alatentdirichletmodelforunsupervisedentityresolution.In Proceedings ofthe6thSIAMInternationalConferenceonDataMining ,volume124,page47. SocietyforIndustrialMathematics,2006. [37]L.GetoorandA.Machanavajjhala.Entityresolution:Theory,practice&open challenges.In Proceedingsofthe38rdVLDB ,VLDB'12.VLDBEndowment,2012. [38]J.E.Gonzalez,Y.Low,C.Guestrin,andD.O'Hallaron.Distributedparallel inferenceonlargefactorgraphs.In ProceedingsoftheTwenty-FifthConferenceon UncertaintyinArticialIntelligence ,pages203{212.AUAIPress,2009. [39]D.Gra.Ldc2007t07:Englishgigawordcorpus,2007. [40]C.Grant,C.P.George,J.-d.Gumbs,J.N.Wilson,andP.J.Dobbins.Morpheus: Adeepwebquestionansweringsystem.In Proceedingsofthe12thInternational ConferenceonInformationIntegrationandWeb-basedApplications&Services , iiWAS'10,pages841{844,NewYork,NY,USA,2010.ACM. 128

PAGE 129

[41]C.E.Grant,J.-d.Gumbs,K.Li,D.Z.Wang,andG.Chitouras.Madden: query-drivenstatisticaltextanalytics.In Proceedingsofthe21stACMinternationalconferenceonInformationandknowledgemanagement ,CIKM'12,pages 2740{2742,NewYork,NY,USA,2012.ACM. [42]L.Gravano,P.Ipeirotis,H.Jagadish,N.Koudas,S.Muthukrishnan,L.Pietarinen, andD.Srivastava.Usingq-gramsinadbmsforapproximatestringprocessing. IEEEDataEngineeringBulletin ,24:28{34,2001. [43]B.Green,A.Wolf,C.Chomsky,andK.Laughery.Baseball:anautomaticquestion answerer.In ProcoftheWesternJointComputerConference ,volume19,pages 219{224,SanFrancisco,CA,USA,1961.MorganKaufmannPublishersInc. [44]J.M.Hellerstein,C.Re,F.Schoppmann,D.Z.Wang,E.Fratkin,A.Gorajek,K.S. Ng,C.Welton,X.Feng,K.Li,andA.Kumar.Themadlibanalyticslibrary:ormad skills,thesql. ProceedingsoftheVLDBEndowment ,5:1700{1711,Aug.2012. [45]J.M.Hellerstein,C.Re,F.Schoppmann,D.Z.Wang,E.Fratkin,A.Gorajek,K.S. Ng,C.Welton,X.Feng,K.Li,andA.Kumar.Themadlibanalyticslibrary:ormad skills,thesql. Proc.VLDBEndow. ,5:1700{1711,Aug.2012. [46]A.Jain,P.Ipeirotis,andL.Gravano.Buildingqueryoptimizersforinformation extraction:thesqoutproject. SIGMODRec. ,37:28{34,March2009. [47]A.JainandP.Pantel.Factrank:Randomwalksonaweboffacts.In Proceedingsof the23rdInternationalConferenceonComputationalLinguistics ,COLING'10,pages 501{509,Stroudsburg,PA,USA,2010.AssociationforComputationalLinguistics. [48]R.Jampani,F.Xu,M.Wu,L.L.Perez,C.Jermaine,andP.J.Haas.Mcdb:A montecarloapproachtomanaginguncertaindata.In Proc.ofthe2008ACM SIGMOD ,pages687{700,NY,USA,2008.ACM. [49]D.V.KalashnikovandS.Mehrotra.Domain-independentdatacleaningviaanalysis ofentity-relationshipgraph. ACMTrans.DatabaseSyst. ,31:716{767,June2006. [50]R.Kiros,Y.Zhu,R.Salakhutdinov,R.S.Zemel,A.Torralba,R.Urtasun,and S.Fidler.Skip-thoughtvectors. arXivpreprintarXiv:1506.06726 ,2015. [51]D.E.Knuth.Ancientbabylonianalgorithms. Commun.ACM ,15:671{677,July 1972. [52]S.Kok,P.Singla,M.Richardson,P.Domingos,M.Sumner,H.Poon,andD.Lowd. Thealchemysystemforstatisticalrelationalai. UniversityofWashington,Seattle , 2005. [53]D.KollerandN.Friedman. Probabilisticgraphicalmodels:principlesandtechniques . MITpress,2009. 129

PAGE 130

[54]N.LaoandW.W.Cohen.Relationalretrievalusingacombinationof path-constrainedrandomwalks. Machinelearning ,81:53{67,2010. [55]N.Lao,T.Mitchell,andW.W.Cohen.Randomwalkinferenceandlearningina largescaleknowledgebase.In ProceedingsoftheConferenceonEmpiricalMethods inNaturalLanguageProcessing ,pages529{539.AssociationforComputational Linguistics,2011. [56]G.Lee,J.Lin,C.Liu,A.Lorek,andD.Ryaboy.Theuniedlogginginfrastructure fordataanalyticsattwitter. Proc.VLDBEndow. ,5:1771{1780,Aug.2012. [57]K.Li,C.Grant,D.Z.Wang,S.Khatri,andG.Chitouras.Gptext:Greenplum parallelstatisticaltextanalysisframework.In ProceedingsoftheSecondWorkshop onDataAnalyticsintheCloud ,DanaC'13,pages31{35,NewYork,NY,USA,2013. ACM. [58]K.Li,C.Grant,D.Z.Wang,S.Khatri,andG.Chitouras.Gptext:Greenplum parallelstatisticaltextanalysisframework.In ProceedingsoftheSecondWorkshop onDataAnalyticsintheCloud ,pages31{35.ACM,2013. [59]X.Li,P.Morie,andD.Roth.Identicationandtracingofambiguousnames: Discriminativeandgenerativeapproaches.In ProceedingsoftheNationalConference onArticialIntelligence ,pages419{424.MenloPark,CA;Cambridge,MA;London; AAAIPress;MITPress;1999,2004. [60]Y.Li,F.R.Reiss,andL.Chiticariu.Systemt:adeclarativeinformationextraction system.In Proceedingsofthe49thAnnualMeetingoftheAssociationforComputationalLinguistics:HumanLanguageTechnologies:SystemsDemonstrations ,HLT '11,pages109{114,Stroudsburg,PA,USA,2011.AssociationforComputational Linguistics. [61]P.Liang,M.I.Jordan,andD.Klein.Type-basedmcmc.In HumanLanguage Technologies:The2010NAACL ,HLT'10,pages573{581,Stroudsburg,PA,USA, 2010.ACL. [62]J.Madhavan,S.Jeery,S.Cohen,X.Dong,D.Ko,C.Yu,andA.Halevy.Web-scale dataintegration:Youcanonlyaordtopayasyougo.In ProceedingsofCIDR , pages342{350,2007. [63]C.D.Manning.Part-of-speechtaggingfrom97%to100%:isittimeforsome linguistics?In Proceedingsofthe12thinternationalconferenceonComputational linguisticsandintelligenttextprocessing-VolumePartI ,CICLing'11,pages 171{189,Berlin,Heidelberg,2011.Springer-Verlag. [64]A.McCallum,K.Nigam,andL.Ungar.Ecientclusteringofhigh-dimensionaldata setswithapplicationtoreferencematching.In Proc.ofthe6thSIGKDD ,pages 169{178,2000. 130

PAGE 131

[65]A.McCallum,K.Schultz,andS.Singh.FACTORIE:Probabilisticprogrammingvia imperativelydenedfactorgraphs.In NIPS ,pages1426{1427,2009. [66]A.MccallumandB.Wellner.Towardconditionalmodelsofidentityuncertainty withapplicationtopropernouncoreference.In InNIPS ,pages905{912.MITPress, 2003. [67]A.MccallumandB.Wellner.ConditionalModelsofIdentityUncertaintywith ApplicationtoNounCoreference.In NIPS ,2004. [68]J.MoralesandJ.Nocedal.Automaticpreconditioningbylimitedmemory quasi-newtonupdating. SIAMJournalonOptimization ,10:1079{1096,2000. [69]H.Nguyen,T.Nguyen,andJ.Freire.Learningtoextractformlabels. Proc.VLDB Endow. ,1:684{694,2008. [70]F.Niu,C.Re,A.Doan,andJ.Shavlik.Tuy:Scalingupstatisticalinferencein markovlogicnetworksusinganrdbms. ProceedingsoftheVLDBEndowment , 4:373{384,2011. [71]B.O'Connor,R.Balasubramanyan,B.Routledge,andN.Smith.Fromtweetsto polls:Linkingtextsentimenttopublicopiniontimeseries.In Proc.AAAIConf.on WeblogsandSocialMedia ,pages122{129,2010. [72]K.OlsenandA.Malizia.Followingvirtualtrails. Potentials,IEEE ,29:24{28, jan.-feb.2010. [73]M.PamukandM.Stonebraker.Transformscout:ndingcompositionsof transformationsforsoftwarere-use.Master'sthesis,MIT,2007. [74]H.Pasula,B.Marthi,B.Milch,S.J.Russell,andI.Shpitser.IdentityUncertainty andCitationMatching.In NeuralInformationProcessingSystems ,pages1401{1408, 2002. [75]J.Pearl. Probabilisticreasoninginintelligentsystems:networksofplausible inference .MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,1988. [76]H.PhanandM.LeNguyen.Flexcrfs:Flexibleconditionalrandomelds,2004. [77]H.PoonandP.Domingos.Soundandecientinferencewithprobabilisticand deterministicdependencies.In AAAI ,volume6,pages458{463,2006. [78]D.Rao,P.McNamee,andM.Dredze.Streamingcrossdocumententitycoreference resolution.In Proceedingsofthe23rdInternationalConferenceonComputational Linguistics:Posters ,pages1050{1058.AssociationforComputationalLinguistics, 2010. [79]M.RichardsonandP.Domingos.Markovlogicnetworks. Machinelearning , 62-2:107{136,2006. 131

PAGE 132

[80]F.Rubin.Enumeratingallsimplepathsinagraph. CircuitsandSystems,IEEE Transactionson ,25:641{642,1978. [81]F.RusuandA.Dobra.Glade:ascalableframeworkforecientanalytics. SIGOPS Oper.Syst.Rev. ,46:12{18,Feb.2012. [82]D.SculleyandC.E.Brodley.Compressionandmachinelearning:Anewperspective onfeaturespacevectors.In DataCompressionConference,2006.DCC2006. Proceedings ,pages332{341.IEEE,2006. [83]W.Shen,X.Li,andA.Doan.Constraint-basedentitymatching.In Proceedingsof theNationalConferenceonArticialIntelligence ,volume20,page862.MenloPark, CA;Cambridge,MA;London;AAAIPress;MITPress;1999,2005. [84]L.Shu,A.Chen,M.Xiong,andW.Meng.Ecientspectralneighborhoodblocking forentityresolution.In 2011IEEE27thICDE ,pages1067{1078,april2011. [85]P.Simon. TooBigtoIgnore:TheBusinessCaseforBigData .Wiley.com,2013. [86]S.Singh,A.Subramanya,F.Pereira,andA.McCallum.Large-scalecross-document coreferenceusingdistributedinferenceandhierarchicalmodels.In Proceedingsof the49thAnnualMeetingoftheAssociationforComputationalLinguistics:Human LanguageTechnologies-Volume1 ,pages793{803.AssociationforComputational Linguistics,2011. [87]S.Singh,A.Subramanya,F.Pereira,andA.McCallum.Wikilinks:Alarge-scale cross-documentcoreferencecorpuslabeledvialinkstoWikipedia.TechnicalReport UM-CS-2012-015,2012. [88]S.Singh,M.Wick,andA.McCallum.Montecarlomcmc:ecientinferenceby approximatesampling.In Proceedingsofthe2012JointConferenceonEmpirical MethodsinNaturalLanguageProcessingandComputationalNaturalLanguage Learning ,pages1104{1113.AssociationforComputationalLinguistics,2012. [89]P.SinglaandP.Domingos.EntityResolutionwithMarkovLogic.In IEEE InternationalConferenceonDataMining ,pages572{582.IEEE,2006. [90]R.M.Smullyan. First-orderlogic ,volume21968.Springer,1968. [91]F.M.Suchanek. AutomatedConstructionandGrowthofaLargeOntology .PhD thesis,SaarlandUniversity,2009. [92]J.Teevan,E.Adar,R.Jones,andM.A.S.Potts.Informationre-retrieval:repeat queriesinyahoo'slogs.In SIGIR'07:Procofthe30thannualinternationalACM SIGIRconferenceonRandDininformationretrieval ,pages151{158,NewYork, NY,USA,2007.ACM. [93]M.D.Vose.Alinearalgorithmforgeneratingrandomnumberswithagiven distribution. IEEETrans.Softw.Eng. ,17:972{975,Sept.1991. 132

PAGE 133

[94]D.Wang,M.Franklin,M.Garofalakis,J.Hellerstein,andM.Wick.Hybrid in-databaseinferencefordeclarativeinformationextraction.In Proc.SIGMOD , pages517{528.ACM,2011. [95]D.Z.Wang,E.Michelakis,M.Garofalakis,andJ.M.Hellerstein.Bayesstore: managinglarge,uncertaindatarepositorieswithprobabilisticgraphicalmodels. Proc.VLDBEndow. ,1:340{351,August2008. [96]S.E.Whang,D.Menestrina,G.Koutrika,M.Theobald,andH.Garcia-Molina. Entityresolutionwithiterativeblocking.In 2009ACMSIGMOD ,pages219{232. ACM,2009. [97]M.Wick,A.McCallum,andG.Miklau.Scalableprobabilisticdatabaseswithfactor graphsandmcmc. Proc.VLDBEndow. ,3-2:794{804,Sept.2010. [98]M.Wick,S.Singh,andA.McCallum.Adiscriminativehierarchicalmodelforfast coreferenceatlargescale.In Proceedingsofthe50thACL ,ACL'12,pages379{388, 2012. [99]M.Wick,S.Singh,andA.McCallum.Adiscriminativehierarchicalmodelfor fastcoreferenceatlargescale.In Proceedingsofthe50thAnnualMeetingofthe AssociationforComputationalLinguistics:LongPapers-Volume1 ,ACL'12,pages 379{388,Stroudsburg,PA,USA,2012.AssociationforComputationalLinguistics. [100]M.L.WickandA.McCallum.Query-awaremcmc.InJ.Shawe-Taylor,R.Zemel, P.Bartlett,F.Pereira,andK.Weinberger,editors, AdvancesinNIPS24 ,pages 2564{2572,2011. [101]M.L.Wick,K.Rohanimanesh,K.Bellare,A.Culotta,A.McCallum,and A.McCallum.Samplerank:Trainingfactorgraphswithatomicgradients.In ICML ,pages777{784,2011. [102]W.Woods.Progressinnaturallanguageunderstanding-anapplicationtolunar geology.In AmericanFederationofInformationProcessingSocietiesAFIPS ConferenceProc,42 ,pages441{450,1973. [103]L.Zhang,R.Ghosh,M.Dekhil,M.Hsu,andB.Liu.Combininglexicon-basedand learning-basedmethodsfortwittersentimentanalysis.2011. 133

PAGE 134

BIOGRAPHICALSKETCH ChristanGrantcompletedhisBachelorofScienceandMasterofScienceandPh.D. incomputersciencefromtheUniversityofFlorida.Hisresearchinterestsinvolvenovel methodsforansweringdicultquestions.Thisincludestheadditionofnaturallanguage processingwithinrelationaldatabasestoprobabilisticknowledgebaseassistedquestion answeringsystems.Hehasworkedondevelopingaquery-driven"paradigmfortext analytics.HeisarecipientoftheNationalScienceFoundationGraduateResearch FellowshipawardintheareaofDatabaseInformationRetrievalandWebSearch".He wasalsoawardedtheFloridaGeorgiaLSAMPBridgetoDoctorateFellowshipanda diversityawardfromtheCollegeofEngineering.Christanhasservedasanexternaleditor reviewerforACMSIGMOD,ACMVLDB,ACMCIKM,andIEEEICDEconferences. HeisalsoontheprogramcommitteeforBroadeningParticipationinDataMining.He holdsseveralpublicationsandalsoapatentfromaninternshipatIBMAlmadenResearch Center. 134