<%BANNER%>

A Realm-based Question Answering System using Probabilistic Modeling

Permanent Link: http://ufdc.ufl.edu/UFE0042585/00001

Material Information

Title: A Realm-based Question Answering System using Probabilistic Modeling
Physical Description: 1 online resource (54 p.)
Language: english
Creator: P George, Clint
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: lda, lerning, qa, ranking, topics
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Conventional search engines traditionally perform key-word based searches of web documents. Since the wealth of information from the web is huge and often represented in an unstructured format, the availability of automated search engines that can solve user queries has become essential. This thesis discusses categorizing and representing the information contained within web documents in a more structured and machine readable form, an ontology, and building a realm-based question answering (QA) system that uses deep web sources by exploiting information from the web along with sample query answering strategies provided by users. The first part of the thesis deals with probabilistic modeling of the web documents using a Bayesian framework, Latent Dirichlet Allocation (LDA), and the corresponding document classification framework built on top of this. Subsequently, LDA is employed as a dimensionality reduction tool for the corpus vocabulary and documents. This thesis also describes the inference of document-topic-proportions for newly encountered documents without retraining the previously learned LDA model. The second part of the thesis describes a realm based QA system using an ontology that is built from the web. Finally, the techniques to represent user-queries that are natural language text in a semi-structured format (SSQ) and to rank the similar queries using a novel ontology based similarity quasi-metric, class divergence, is explained.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Clint P George.
Thesis: Thesis (M.S.)--University of Florida, 2010.
Local: Adviser: Wilson, Joseph N.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042585:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042585/00001

Material Information

Title: A Realm-based Question Answering System using Probabilistic Modeling
Physical Description: 1 online resource (54 p.)
Language: english
Creator: P George, Clint
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: lda, lerning, qa, ranking, topics
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Conventional search engines traditionally perform key-word based searches of web documents. Since the wealth of information from the web is huge and often represented in an unstructured format, the availability of automated search engines that can solve user queries has become essential. This thesis discusses categorizing and representing the information contained within web documents in a more structured and machine readable form, an ontology, and building a realm-based question answering (QA) system that uses deep web sources by exploiting information from the web along with sample query answering strategies provided by users. The first part of the thesis deals with probabilistic modeling of the web documents using a Bayesian framework, Latent Dirichlet Allocation (LDA), and the corresponding document classification framework built on top of this. Subsequently, LDA is employed as a dimensionality reduction tool for the corpus vocabulary and documents. This thesis also describes the inference of document-topic-proportions for newly encountered documents without retraining the previously learned LDA model. The second part of the thesis describes a realm based QA system using an ontology that is built from the web. Finally, the techniques to represent user-queries that are natural language text in a semi-structured format (SSQ) and to rank the similar queries using a novel ontology based similarity quasi-metric, class divergence, is explained.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Clint P George.
Thesis: Thesis (M.S.)--University of Florida, 2010.
Local: Adviser: Wilson, Joseph N.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042585:00001


This item has the following downloads:


Full Text

PAGE 1

AREALM-BASEDQUESTIONANSWERINGSYSTEMUSINGPROBABILISTICMODELINGByCLINTPAZHAYIDAMGEORGEATHESISPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFMASTEROFSCIENCEUNIVERSITYOFFLORIDA2010

PAGE 2

c2010ClintPazhayidamGeorge 2

PAGE 3

Tomyparents,GeorgeandGracyPazhayidam 3

PAGE 4

ACKNOWLEDGMENTS Iwouldliketoexpressmygratitudetomyadvisor,Dr.JosephN.Wilson,foralloftheguidanceandencouragementheprovidedmethroughoutmymasterscourseworkandresearch.Ialsothankmythesiscommitteemembers,Dr.PaulGader,Dr.SanjayRanka,foralltheirhelpandsuggestions.Inaddition,IamgratefultomycolleaguesJoir-danGumbs,GuillermoCordero,BenjaminLanders,PatrickMeyer,ChristopherShields,andTerenceTaifortheirassistanceinbuildingthissystem.IamparticularlythankfultoPeterJ.DobbinsandChristanGrantfortheirinsightfulcommentsandideas.IthankmyparentsGeorgeandGracyPazhayidamandmyfamily,Christa,Renjith,andChrisfortheireverlastinglove,care,andencouragement,whichmotivatedmethroughoutmystudies.Finally,Ioermyregardstoallofmyfriendswhohelpedmeonthecompletionofthethesis. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 2BACKGROUND ................................... 12 2.1QuestionAnsweringSystems .......................... 12 2.2Ontology ..................................... 13 2.2.1OntologyBuilding ............................ 13 2.2.2TheOntologyRepresentations ..................... 14 2.3Morpheus .................................... 14 2.3.1TheMorpheusArchitecture ...................... 15 2.3.2Discussion ................................ 18 2.4OntologyGenerationSystems ......................... 19 2.4.1DBpediaOntology ............................ 19 2.4.2WordNetOntology ........................... 20 2.4.3YAGO .................................. 20 2.5TopicModelsandDocumentClassication ................. 22 2.6Summary .................................... 23 3DOCUMENTMODELING ............................. 24 3.1LatentDirichletAllocation ........................... 24 3.2FeatureExtraction ............................... 26 3.2.1Lemmatization .............................. 26 3.2.2Stemming ................................ 27 3.2.3FeatureSelectionorReduction ..................... 27 3.2.4RepresentingDocuments ........................ 27 3.3DocumentModelingusingLDA ........................ 28 3.3.1LDAforTopicExtraction ....................... 28 3.3.2LDAforDimensionalityReduction .................. 31 3.3.3DocumentClassierfromLDAoutputs ................ 33 3.3.4InferringDocumentTopicProportionsfromtheLearnedLDAModel 38 3.4TheScopeofLDAinOntologyLearning ................... 42 3.5Summary .................................... 42 5

PAGE 6

4QUERYRANKING ................................. 43 4.1Introduction ................................... 43 4.2ClassDivergence ................................ 43 4.3SSQMatchingandQueryRanking ...................... 45 4.4Results ...................................... 46 4.5Summary .................................... 48 5CONCLUSIONS ................................... 49 REFERENCES ....................................... 51 BIOGRAPHICALSKETCH ................................ 54 6

PAGE 7

LISTOFTABLES Table page 2-1Examplesemi-structuredquery ........................... 17 3-1Estimatedtopic-wordsusingLDAwiththreetopics ................ 29 3-2Estimatedtopic-wordsusingLDAwithtwotopics ................. 31 3-3LDAbasedfeaturereductionwithdierentnumberoftopics ........... 34 3-4SVMasaregressionandclassicationmodel .................... 42 4-1TheMorpheusNLPengineparse .......................... 47 4-2Topterm-classesandprobabilities .......................... 47 4-3Toprankedqueriesbasedontheaggregatedclassdivergencevalues ....... 48 7

PAGE 8

LISTOFFIGURES Figure page 3-1ThegraphicalmodelofLatentDirichletAllocation ................ 26 3-2LDAtopicproportionsthetrainingdocumentsfromthewhales(rst119)andtires'(last111)domains.ThetopicsfromtheLDAmodel,withthreetopics,arerepresentedbydierentcoloredlines. ...................... 29 3-3Thetopicproportionsofthewhales(rst119)andtires'(last111)domainsfromtheLDAmodelwiththreetopics.Leftplotrepresentsdocumentsfrombothclassesinthetopicspace,redforwhalesandbluefortires.Rightplotrepresentsthedocumenttopicproportionsin3-Dforthetwoclasses ............. 30 3-4ThetwotopicLDA'stopicproportionsforthewhales(rst119)andtires(last111)documents .................................... 31 3-5HistogramoftermentropyvaluesovertheLDAmatrix:calculatedonthedatasetwhalesandtires ............................... 33 3-6LDA-basedfeaturereduction ............................ 33 3-7Hellinger(top)andcosine(bottom)distancesofthedocumenttopicmixturestowhales'documenttopicmixturemean. ..................... 35 3-8ClassicationaccuracyoftheclassiersbuildbasedonHellingerandcosinedistanceswithvaryingnumberoftopics. ............................ 36 3-9SVMclassicationmodelusingtheLDAdocumentproportions ......... 37 3-10SVMclassicationaccuracyofthewhales-tiresdocumenttopicmixturesonthevaryingnumberoftopics ............................... 37 3-11SVCregressedvaluesofthewhales-tiresdocumenttopicmixtureswithtwotopics. 38 3-12Hellinger(top)andcosine(bottom)distancesofthedocumenttopicmixturestowhales'documenttopicmixturemean. ..................... 39 3-13SVMasaregressionandclassicationmodelwiththeLDAmodeltrainedonthewholewhalesandtiresdocuments ....................... 40 3-14SVMasaregressionandclassicationmodelwiththeLDAmodeltrainedon2=3rdofthewhalesandtiresdocuments ...................... 41 4-1Anabbreviatedautomotiveontology ........................ 45 8

PAGE 9

AbstractofThesisPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofMasterofScienceAREALM-BASEDQUESTIONANSWERINGSYSTEMUSINGPROBABILISTICMODELINGByClintPazhayidamGeorgeDecember2010Chair:JosephN.WilsonMajor:ComputerEngineeringConventionalsearchenginestraditionallyperformkey-wordbasedsearchesofwebdocuments.Sincethewealthofinformationfromthewebishugeandoftenrepresentedinanunstructuredformat,theavailabilityofautomatedsearchenginesthatcansolveuserquerieshasbecomeessential.Thisthesisdiscussescategorizingandrepresentingtheinformationcontainedwithinwebdocumentsinamorestructuredandmachinereadableform,anontology,andbuildingarealm-basedquestionanswering(QA)systemthatusesdeepwebsourcesbyexploitinginformationfromthewebalongwithsamplequeryansweringstrategiesprovidedbyusers.TherstpartofthethesisdealswithprobabilisticmodelingofthewebdocumentsusingaBayesianframework,LatentDirichletAllocation(LDA),andthecorrespondingdocumentclassicationframeworkbuiltontopofthis.Subsequently,LDAisemployedasadimensionalityreductiontoolforthecorpusvocabularyanddocuments.Thisthesisalsodescribestheinferenceofdocument-topic-proportionsfornewlyencountereddocumentswithoutretrainingthepreviouslylearnedLDAmodel.ThesecondpartofthethesisdescribesarealmbasedQAsystemusinganontologythatisbuiltfromtheweb.Finally,thetechniquestorepresentuser-queriesthatarenaturallanguagetextinasemi-structuredformat(SSQ)andtorankthesimilarqueriesusinganovelontologybasedsimilarityquasi-metric,classdivergence,isexplained. 9

PAGE 10

CHAPTER1INTRODUCTIONConventionalsearchenginesaremainlyfocusedonkey-wordbasedsearchintheweb.Soifweusethemtoansweraquestion,theyusuallygiveusrelevantpagelinksbasedonthekeywordsinthequery.Toreachawebpageprovidingarelevantanswertothequery,wemayhavetofollowseverallinksorpages.Sincethewebisahugedatabaseofinformation,whichisoftenstoredinanuncategorizedandunstructuredformat,anautomatedsystemthatcansolvenaturallanguagequestionshasbecomeessential.Currentquestionanswering(QA)systemsattempttodealwithdocumentsvaryingfromsmalllocalcollections(closeddomain)totheWorldWideWeb(opendomain)[ 16 22 ].ThisthesisdiscussesMorpheus[ 9 ],aQAsystemthatisopendomainandfocusedonarealm-basedQAstrategy.Morpheusprovideswebpagesoeringarelevantanswertoauserquerybyre-visitingrelevance-rankedpriorsearchpathways.Morpheushastwotypesofusers:oneisapathnderwhosearchesfortheanswersinthewebandrecordsthenecessaryinformationtorevisitthesearchpathwaystotheanswer.Theothertypeisapathfollower,whoentersaquerytothesystemandgetsananswer.LikemanyQAsystems,Morpheusparsesnaturallanguagequeriesandstorestheminasemi-structuredformat,asemi-structuredqueryorSSQ,thatcontainsqueryterms,assignedtermclasses,andinformationaboutpriorsearchpaths.Thetermclassesaredrawnfromanontologyandtheyareusedtorankthesimilarityofpriorsearchqueriestoanewlyencounteredquery.Thisthesisfocusesonmodelingandcategorizingwebdocumentsinordertohelpautomatetheontologybuildingprocessfromtheweb.ThisthesisalsodescribesrankingoftheuserqueriesintheSSQformatbasedonaquasi-metric,classdivergence,denedontheontologyheterarchy.Morpheus'sQAfollowsahierarchicalapproach,i.e,itcategorizesuserqueriesintoarealmandsolvestheQAproblemunderthatsub-domainorrealm.Arealmofinformationinthewebisrepresentedbyanontologythatcontainscategoriesorclassesandtheir 10

PAGE 11

relationships.Morpheus'sNLPenginetagsthequerytermswiththeseclassesandclustersqueriesbasedonthetermclasses'cumulativedivergencecalculatedfromthefromtheontology.Inthisway,wecanrankthesimilarsolvedqueriesstoredinthedatabaseandreusethemforfuturesearches.Categorizingandextractinginformationfromtextisthemainstepinanontologylearningfromwebdocuments.Thisthesisdescribesawellknowntopicmodelingmethod,LatentDirichletAllocation[ 6 ],anditsapplicationsindocumentmodeling,dimensionalityreduction,classication,andclustering.ThesisOrganization Thecontributionofthisthesisisaprincipleddevelopmentofaprobabilisticframeworkforinformationextractionanddocumentclassicationtohelptheontologybuildingprocess,andamethodforrankingpriorsearchqueriesbasedonahierarchybasedsimilaritymeasure[ 9 ].Thisthesisisorganizedasfollows:Chapter 2 discussesthebackgroundforthisresearchsuchasQAsystemsingeneralandtheMorpheussysteminparticular.ItalsoaddressesthenecessityofanontologyintheMorpheussystemandgivesanintroductiontoconventionalontologygeneratingsystems.Intheend,thischaptershowsdierentdocumentmodelingstrategiesandthemotivationbehindpursuingLDAforwebdocumentmodelinginthisthesis.Chapter 3 startsbydescribingthegraphicalmodelofLDA.Then,itexplainsfeatureextractionstepsandtopicextractionfromwebdocumentsusingLDA.ItalsodiscussesLDAbaseddimensionalityreductiontechniques,documentclassication,andtheinferenceofdocumenttopicmixtureswithoutretrainingthelearnedLDAmodel.Chapter 4 presentstheproblemofrankingthepriorsearchqueriesstoredintheSSQformat,basedonanovelclassdivergencemeasurethatisbuiltontheontologicalheterarchy.Chapter 5 summarizestheideaofthisthesisanddiscussesthedirectionsoffuturework. 11

PAGE 12

CHAPTER2BACKGROUNDThegoalofthischapteristogiveanoverviewofquestionanswering(QA)systemsingeneralandtheMorpheussysteminparticular.ItalsolooksintotheneedforanontologyandcorporainMorpheussystemandtheirspecications.Subsequently,itdescribessomeoftheexistingontologiesandontologybuildingsystemssuchasDBpedia,WordNet,andYAGOandlimitationsoftheirontologiesandmethodologiesinthescopeofMorpheus.Thischapterconcludesbydiscussingapplicabilityofawell-knowntopicmodelingtechniqueLatentDirichletAllocationinwebpagecategorizationandhowitaidstheontologyandassociatedcorporabuildingprocess. 2.1QuestionAnsweringSystemsEarlyquestionansweringsystemsarebasedontheclosedcorporaandcloseddomainapproach,inwhichtheQAsystemanswersqueriesbasedonthequestionspresentinaknowndomainandcorpus.ExamplesaretheQAsystemssuchasBASEBALL[ 10 ]andLunar[ 28 ].ThesesystemsarespecictoadomainandtheyhardlyscalefortheneedsofWWW.Ontheotherhand,theMorpheussystemisbuiltbasedonamoredynamicapproach,wherewemakeuseofdeepwebsourcesandmethodsfrompriorsuccessfulsearchestoanswerquestions[ 9 ].START1isaQAsystemthatisbuiltbasedonnaturallanguageannotationandparsingoftheuserqueries.STARTanswersqueriesbymatchingtheannotatedquestionswithcandidateanswersthatarecomponentsoftextandmulti-media.Anothermethodtoquestionansweringisacommunitybasedapproach[ 9 ],wherequestionsareansweredbyexploitingtheresponsestothequestionspostedbytheusersinacommunity.Yahoo!Answers[ 17 ]andAardvark[ 15 ]areleadingcommunitybasedQAsystems.Morpheusannotatesuserquerytermsbyclassesfromanontologywiththehelpofpathnder.This 1http://start.csail.mit.edu 12

PAGE 13

annotatedclassesareessentialinndingsimilaritiesbetweenanewlyencountereduserqueryandthesolvedqueriesinthedatastore. 2.2OntologyAnontologyformallymodelsrealworldconceptsandtheirrelationships.Aconceptinanontologygivesusanabstractandsimpliedviewoftherealworld[ 12 ]inamachine-readableandlanguage-independentformat.Inaddition,eachontologicalconceptshouldhaveasinglemeaninginanontologicaldomain.Asmentionedin[ 19 ],oneofthemainreasonforbuildinganontologyistoshareacommonunderstandingofstructuralinformationamongpeopleandmachines.Forexample,supposewehavemanyresourcessuchaswebpages,videos,andarticlesthatdescribethedomainvehicledynamics.Ifallofthemshareaninherentontologyforthetermstheyalluse,theinformationextractionandaggregationcanbedoneeasily.Inaddition,anontologyenablesthereuseofconcepts[ 19 ].Forexample,ifonecanbuildacomprehensiveontologyaboutaspecicdomain(e.g.automotive),peoplefromadierentdomaincaneasilyreuseit.Inanontology,aconceptisformallydescribedusingtheclassconstruct.Therelationshipsandattributesofanontologicalconceptaredenedusingpropertiesandpropertyrestrictions.Anontologydenitioncontainsallofthese:classes,properties,andrestrictions.Aclassinanontologycanhavemultiplesuperclassesandsub-classes.Asub-classofaclassrepresentsconceptsthataremorespecicintheontologydomain,whilesuperclassesofaclassrepresentmoregeneralconceptsintheontologydomain.Finally,aknowledgebaserepresentsanontologytogetherwithinstancesofitsclasses. 2.2.1OntologyBuildingDeterminingthescopeanddomainofanontologyisoneofthechallengingquestionsinontologybuilding.Onewaytomodeladomainorrealmistobaseitonaparticularapplication[ 19 ].IntheMorpheussystem,wefocusonmodelinguserquerieswithinaparticularrealm[ 9 ].Thus,weusethequeryrealmastheontology'sdomain.Forexample, 13

PAGE 14

foruserqueriesinanautomotivedomain,wecreateanautomotiveontology.Moreover,wecanextendthescopeofanontologytootherdomainseasilyifwefollowaniterativeontologybuildingprocess[ 19 ]. 2.2.2TheOntologyRepresentationsAsof2004,theWebOntologyLanguageorOWL2isastandardforformallypresentinganontology.OWLhasmuchstrongermachineinterpret-abilitythanRDF3.Inthisthesis,wefollowtheOWLstandardtorepresentanontologyincludingclasses,individuals,andproperties. 2.3MorpheusIfwesearchforananswertoaquestioninatypicalsearchenginesuchasGoogle,BingorYahoo,itusuallygivesusrelevantpagesbasedonthekeywordsinthequery.Wemayneedtofollowseverallinksorpagestoreachadocumentprovidingarelevantanswer.Ifwecanstoresuchsearchpathwaystoananswerforagivenuserqueryandreuseitforfuturesearchesitmayspeedupthisprocess.Morpheusreusespriorwebsearchpathwaystoyieldananswertoauserquery[ 9 ].Infact,itprovidestoolstomarksearchtrailsandtofollowpathnderstotheirdestinations.Morpheus'sQAstrategyisbasedonexploitingdeep(orhidden)websourcestosolvequestions,sincemanyofdeepwebsources'qualityinformationiscomingfromon-linedatastores[ 18 ].Webpagesarejustaninterfacetothisinformation.Thus,intheMorpheussystem,weexplorethesewebpagestolearnthecharacteristicsofdata,associatedwitheachdeepweblocations.Morpheusemploystwotypesofusers.Onetype,thepathnder,entersqueriesintheMorpheuswebinterfaceandsearchesforananswertothequeryusinganinstrumentedwebbrowser[ 9 ].Thistooltracksthepathwaystothepagewherethepathndernds 2http://www.w3schools.com/rdf/rdf owl.asp3http://www.w3schools.com/rdf/rdf reference.asp 14

PAGE 15

theanswerandotherinformationrequiredtorevisitthepath.Theothertype,thepathfollower,usestheMorpheussystemmuchlikeaconventionalsearchengine[ 9 ].Thepathfollowertypesanaturallanguagequestioninatextboxandgetsaresponsefromthesystem.Thesystemusespreviouslyansweredqueriesandtheirpathstoanswernewquestions.Toexploitthepriorsearchesoneshouldrepresentqueriesinanelegantwaysothatsimilarqueriescanbefoundecientlyandtheirstoredsearchpathsrevisitedtoproduceananswer.Morpheususesarealm-basedapproachtorepresentthequeriesasdescribedinthefollowingsections. 2.3.1TheMorpheusArchitectureThissectionpresentsthecoreelementsoftheMorpheussystemanditsroleintheautomatedquestionansweringprocess.Thisarchitectureisoriginallypublishedin[ 9 ].MorpheusOntologyandCorpora: Morpheusrequiresanontologythatcontainsclassesofaparticularrealm.Forexample,therealm-ontologyAutomotivecontainscategoriesthatarerelevanttoautomobiles.Inaddition,eachleafclass-nodeintheontologyisassociatedwithacorpusofdocumentsthataretaggedtothatclass.Forrealmontologyconstruction,MorpheususestheWikipediacategories(asontologicalclasses)andDBpediaontology(asclasstaxonomy).Morpheuscreatesarealmontologyasfollows:Foragivenrealm,werstndamappingcategoryfromtheDBpediaontology.ThenwegraballtheneighboringcategoriesusingaconceptcalledaMarkovBlanket[ 20 ]inBayesianNetworks.Infact,wegrabthecategory'sparents,itschildren,anditschildren'sotherparentsusingtheDBpediaontologypropertiesbroaderandnarrower.Oncetheblanketisready,itgrabsallparentcategoriesofthecategoriesintheblankettillitreachesroot.Finally,wegrabWikipediadocumentsthosearetagged 15

PAGE 16

withtheDBpediacategoriesandprocessthedocumentterms4toconstructcorpusforeachoftheontologicalleafnodes[ 1 9 ].Associatingpathfollower'squerytermswithclasses: Fromaclassdocumentcorpus,wecalculatetheprobabilitythatatermbelongstoaparticularclass.Infact,wedeterminethelikelihoodofaclassgivenatermusingBayesRule(Eq. 2{1 ).Here,theterm-classandterm-corpusprobabilitiesaredeterminedasrelativefrequencies[ 9 ].Thehighertheprobability,themorelikelyatermisreferringtoaclass. P(classjterm)=P(termjclass)P(class) P(term)(2{1)Semi-StructuredQuery: Morpheusrepresentsuser-querydetailsinasemistructuredformatcalledaSemi-StructuredQueryorSSQ.Itincludesauserqueryinnaturallanguageformat,relevanttermsforthequery,andplaceholdersforthetermclasses.Atermclassrepresentsatopicorclasstowhichatermbelongs.Forexample,thetermtirecanbeclassiedintothevehiclepartclass.Morpheusassumesthetermclassesarefromaconsistentrealm-basedontology,thatis,onehavingasinglyrootedheterarchywhosesubclass/superclassrelationshavemeaningfulsemanticinterpretations.QueryResolutionRecorder: TheQueryResolutionRecorderorQRRisaninstrumentedwebbrowserusedtotrackthesearchpathwaystoananswerforapathnder'squery.Italsohelpsapathndertoidentifyontologicalclassesassociatedwithsearchterms[ 8 9 ].MorpheuscapturesalltheseinformationintheformatofanSSQ.Forexample,suppose,apathndertriestoanswerthequery\Whatisthetiresizefora1997ToyotaCamryV6?"usingMorpheus.WiththehelpoftheQRR,apathndermaychooserelevantquerytermsandassign 4word,n-gram,andtermareusedinterchangeablyinthisthesis 16

PAGE 17

categoriestothem(seetable 2-1 ).Moreover,usingQRRpathnderisabletologthesearchpathwaystoananswer(e.g.P215/60R16)tothisqueryandselectaqueryrealm(e.g.automotive). Table2-1. Examplesemi-structuredquery Terms:1997ToyotaCamryV6tiresize Input:yearmanufacturermodelengineOutput:tire size QueryResponseMethod: TheMorpheusQAprocessisformalizedbytheQueryResolutionMethodorQRMdatastructure[ 9 ].AQRMisusuallyconstructedbyapathnderwiththehelpoftheQRR.EachQRMcontainsanontologicalrealm,anSSQ,andinformationtosupportthequeryansweringprocess.QRMsareassociatedwithanontologythathasaparticularrealm,i.e.,anontologicalrealm.TheassociatedSSQcontainsauserquery,selectedtermsandclassassociations,andoutputcategoriesforagivenquery.Inaddition,theQRMcontainsinformationrequiredtovisiteachpageneededtoanswerthequeryaswellastheorderofvisitation.Foreachdynamicpage,itstoresthelistofinputstotheURLsuchasforminputsandthepossiblereferenceoutputs(highlightedportionsofthewebpage).NLPEngine: Thepathfollower'squeryisprocessedinadierentapproach.TheNLPengineparsesandannotatesqueriesinordertorecordimportantterms.Basedonthequeryterms,Morpheusassignsthemostprobablequery-realmfromrealm-speciccopora.Oncetherealmisassigned,weusethecorrespondingrealm-ontologyandassociatedcorporatoassignclassestothetermsasmentionedabove[ 9 ].Subsequently,thesystemcreatesanSSQandthisnewSSQisusedtomatchwithexistingQRMSSQsinthestore.QRMRankingMechanism: Toanswerapathfollower'squerythatisinSSQformat(acandidateSSQ),MorpheusndssimilarSSQsthatbelongtoQRMsinthestore(qualiedSSQs).ThesimilaritybetweenacandidateSSQandaqualiedSSQiscalculatedbyaggregatingthesimilarity 17

PAGE 18

measuresoftheirassignedclasses[ 9 ].Classsimilarityisbasedontheheterarchicalstructureofclassesinanontology.Chapter 4 describestheMorpheusclasssimilaritymeasureindetail.QueryResolutionExecution: Usually,answeringaquestionusingthedeepwebrequiresonetonavigatethroughasequenceofoneormorewebpages.Manyoftheaccessesinvolveclicksthroughwebformsorlinkstoresultingpages.EachQRMcontainsthenecessaryinformationrequiredtore-runthisprocedure.OncewehavefoundrelevantQRMs,throughQRMrankingforagivenpathfollowerquery,Morpheusgetsanswersbyre-runningthestoredpathwaysinQRM.TheMorpheusQueryExecuterperformsthisjobbysimulatingthehumanbuttonclickstofollowlinks,formsubmission,andHTMLpageanswerhighlighting.ThisalsohastheroutinesnecessarytogeneralizetheQRMquerypathstotackleproblemsassociatedwiththedynamicgenerationofthedeepwebpages[ 9 ]. 2.3.2DiscussionThebackboneoftheMorpheussystemisarealmontologyandthecorporaassociatedwiththeontology'scategories.SincetheMorpheusontologystructureistightlycoupledtotheDBpediaontology,letusconsidertheDBpediaontology.DBpediaiscreatedfromanautomatedprocessthatgrabsinfo-boxinformationandtheclassheterarchyfromfromWikipediapages.Thus,thecorrectnessoftheDBpediaontologygreatlydependsontheWikipediaclassstructure.AsmentionedinYAGO[ 26 ],touseanontologyfortypeinformationitsclassesmustbearrangedinataxonomy.EventhoughDBpediacategoriesarearrangedinaheterarchy,itcannotbeusedasarobustontology.Forexample,"Eectsoftheautomobileonsocieties"isintheclassnamedautomobiles,but"Eectsoftheautomobileonsocieties"representsaresultofautomobilesandnotinformationaboutautomobiles.AnotherdrawbackofblindlyusingtheWikipediacategoriesandtheirassociatedWikipediapagesforcorpusconstructionisthatcategoriesthatareassignedtoWikipedia 18

PAGE 19

pagesdonotfollowanystandard.Similarly,theclassassignmentforaWikipediapageisnotmandatory,somanypagesdonothaveanyassignedclass.Asdiscussedabove,theDBpediaontologyandWikipediapagecategorizationsarenotadequatetoMorpheus'sneeds,eventhoughthesetoocontainshugeamountsofinformation.Accordingtoalexa.comWikipediaisthe9thmostvisitedwebsiteasofMay2010anditcoversalmostallelds.So,ifwecanmakearealm-basedontologyfromtheWikipediathismayhelpintheprocessofquestionanswering. 2.4OntologyGenerationSystemsThissectiontalksaboutexistingontologygeneratorsandtheirapplicabilityinbuildingaMorpheusrealmontologyfromWikipediapages. 2.4.1DBpediaOntologyTheDBpedia5projectisbasedonextractingsemanticinformationfromtheWikipediaandmakingitavailableontheWeb[ 1 ].Wikipediasemanticsincludesinfo-boxtemplates,categorizationinformation,images,Geo-coordinates,linkstoexternalWebpages,disambiguationpages,andredirectsbetweenpagesinWikimarkupform[ 1 5 13 ].TheDBpediamaintainersprocessthisstructuredinformationandstoreitinRDFtripleformat.Theextractionprocedureisasfollows:Wikipediacategoriesarerepresentedusingskos:conceptsandcategoryrelationsarerepresentedusingskos:broader[ 1 5 ].Infact,DBpediadoesnotdeneanynewrelationbetweentheWikipediacategories.TheWikipediacategoriesareextracteddirectlyfromWikipediapagesandtherearenoqualitycheckontheresultantontology.Similarly,DBpediausestheinfo-boxattributesandvaluesasontologicalrelationdomainsandtheirranges.Sincetheguidelinesforinfo-boxtemplatesarenotstrictlyfollowedbytheWikipediaeditors,info-boxattributesareinawiderangeofdierentformatsandunitsofmeasurement 5http://dbpedia.org 19

PAGE 20

[ 5 ].DBpediahastakencareoftheseissuesusingtwoapproaches(1)Agenericapproachformassextraction:Alltheinfo-boxesareprocessedintotripleswiththesubjectbeingthepageURI;predicateastheattributename,andobjectastheattribute'svalue.Eventhoughseveralheuristicscanbeapplied,theinformationqualityisnotguaranteed[ 1 5 ](2)Amapping-basedapproachformaintaininginformationquality:Bymanuallymappinginfo-boxtemplatesintoanontology,thismethodsolvestheissuesofsynonymousattributenamesandtemplates. 2.4.2WordNetOntologyWordNet6isahugemanuallybuiltlexiconofEnglish.WordNetcontainsnouns,verbs,adjectives,andadverbsthataregroupedintosynonyms'setsorsynsets.ThesemanticandlexicalrelationssuchashypernymandhyponymconnecttheWordNetsynsets.Theserelationsformaheterarchyofsynsetsandareveryusefulintheontologybuildingprocess.Oneofthebigadvantagesofusingtheman-madeWordNetheterarchyforontologybuildingisthatithasaconsistenttaxonomyofconcepts,whichisarequirementintheontologydenition[section 2.2 ].However,thecostofkeepingthemuptodateisveryhigh.Inaddition,ifyouconsidertheentireWWWWordNetrepresentsasmallamountofdata.Toovercometheseissues,thisthesisproposesasetoftoolsthatassistautomaticontologybuildingprocess,andsuggeststheirextensionstoincludeotherdomainsorrealms. 2.4.3YAGOYAGOisapartlyautomaticallyconstructedontologyfromWikipediaandWordNet[ 26 ].TheYAGOontologyusesanautomatedprocesstoextractinformationfromWikipediapages,info-boxes,andcategoriesandtocombinethisinformationwiththeWordNetsynsetheterarchy.SinceWikipediacontainsmoreindividualsintheformofWikipediapagesthanintheman-madeWordNetontology,theWikipediapagetitles 6http://wordnet.princeton.edu 20

PAGE 21

areusedastheYAGOontologyindividuals.YAGOconcentratesmainlyontheeldsofpeople,cities,events,andmovies[ 26 ].Inaddition,YAGOusesWikipediacategoriesasitsontologyclasses.TheWikipediacategoriesareclassiedintofourtypessuchasconceptual,relational,administrative,andthematicbyparsingtheWikipediacategorynames.Onlytheconceptualandrelationalcategoriesareusedfortheontologybuildingprocess[ 26 ].Theontology'sheterarchyisbuiltusingthehypernymandhyponymrelationsoftheWordNetsynsets.YAGOusesonlythenounsfromWordNetandignorestheWordNetverbsandadjectives.TheconnectionbetweenaWordNetsynsetandaWikipediacategoryisachievedbyparsingthecategorynamesandmatchingtheparsedcategorycomponentswiththeWordNetsynsets[ 26 ].ThoseWikipediacategorieshavingnoWordNetmatchareignoredintheYAGOontology.AdditionalontologyrelationshipsandindividualsaregeneratedbyparsingWikipediapageinfo-boxattributesandvalues.Thisinformationextractionisdonewiththehelpofattributemapsofinfo-boxtemplatesandregularexpressions[ 26 ].Commonlyusedinfo-boxtemplatesarealsousedasclassesintheontology(e.g.outdoors).Onedrawbackofthisapproachisthatinfo-boxassignmentisnotamandatorytaskincreatinganewWikipediapage.Inaddition,manyWikipediapagesdonothaveanycategoriesassignedtothem.Therefore,usinginfo-box-basedinformationextractionisnotfeasiblefortheseWikipediapages.Inaddition,YAGOtypeextractionismainlybasedontheassumptionthateachWikipediapagehasatleastonetaggedcategory,andtheassignedWikipediacategoriesarerelevanttothatWikipediapage.Asexplainedinsection 2.4.1 ,theWikipediapagecategoryassignmentsaresubjectivelyassignedbyhumaneditors.SomeWikipagescouldbeleftunassignedi.e.,theyhavenocategoryunintentionally.Inaddition,wecannotcompletelyrelyontherelevanceoftheseassignedWikipediapagecategories.Thisthesisinvestigatesmethodsforautomaticwebpagecategorizationandrecommendation,using 21

PAGE 22

amachinelearningtechnique,topicmodeling.Therestofthischapterdescribesexistingdocumentmodelingandclassicationmethods. 2.5TopicModelsandDocumentClassicationTherehavebeenmanystudiesconductedinmodelingtextdocuments[ 25 ]intheInformationRetrievaleld.ProbabilisticLatentSemanticIndexing(pLSI),amodelintroducedbyHomann[ 14 ],usesaprobabilisticgenerativeprocesstomodelthedocumentsinatextcorpus.Inthismodel,weassumethateverywordinacorpusdocumentissampledfromamixturemodel.Similarly,componentsofthemixturemodelaremultinomialrandomvariablesthatrepresenttopicsinthecorpus[ 14 ].Thus,acorpusdocumentinthecorpuscanhavewordsfrommultipletopicsandthemixturecomponents'mixingproportionsrepresentit[ 21 ].However,pLSIdrawseachwordfromonetopic.Eventhoughithasmanyadvantages,thenumberofmodelparametersincreaseslinearlywiththenumberofdocumentsinthecorpus[ 6 ].Inaddition,ithasaninappropriategenerativesemantics,andthemodeldoesnothaveanaturalwaytoassignprobabilitiestounseendata[ 27 ].LatentDirichletAllocation(LDA)isaprobabilistic,graphicmodelforrepresentinglatenttopicsorhiddenvariablesofthedocumentsinatextcorpus[ 6 ].Duetoitsfullygenerativesemantics,evenatthelevelofdocuments,LDAcouldaddressthedrawbacksofthepLSImodel.InthishierarchicalBayesianmodel,LDArepresentsadocumentinthecorpusbydistributionsovertopics.Forexample,forWikipediapages,thelatenttopicsreectthethematicstructureofthepages.Inaddition,atopicitselfisadistributionoveralltermsinthecorpus.Thus,LDAcanndoutrelevanttopicproportionsinadocumentusingposteriorinference[ 6 ].Additionally,givenatextdocument,usingLDAwecouldtagrelateddocumentsbymatchingthesimilarityovertheestimatedtopicproportions.ThisthesismainlyexploitstheclassicationcapabilityoftheLDAmodel[ 27 ]. 22

PAGE 23

2.6SummaryThischaptergaveanintroductiontoquestionansweringsystemsandabriefoverviewoftheMorpheussystemarchitecture.ItalsodiscussedtheuseofanontologyandcorrespondingcorporaintheMorpheusQAprocess,andconventionalontologygenerationmethods.Extractingtopicsfromtextandcategorizingdocumentsareessentialinbuildinganontologyandtheassociatedclass-corpora.ThischaptergaveanintroductiontodocumentmodelingsystemsandexplainedhowLDAprovidesabettermodelthanotherstrategies.ThenextchapterdiscussestheapplicationsofLDAforecientdataanalysisoflargecollectionsofwebdocuments,andhowthishelpusbuildinganontology.Thisisfollowedbyaqueryrankingalgorithmthatisbuiltontopoftheontologicalheterarchy. 23

PAGE 24

CHAPTER3DOCUMENTMODELINGTherealmontologyisoneofthepillarsofMorpheus'squerymatchingalgorithm.Thischapterdescribesthetoolsthathelpstobuildarealmontologyinasemi-automatedfashion.First,IdiscussLatentDirichletAllocation(LDA),aprobabilistictopicmodelusedinthisthesisformodelingandclassifyingWikipediapages.Second,IpresentalgorithmsforclassifyingtheWikipediapages,andhowwecanusethecategorizedwebpagesforarealmontologyandcorpora.Inaddition,theresultssectionexplainstheobservationsfromtheLDA-baseddocumentclassierandtheirinterpretations.ThischapterconcludesbydescribingtheglobalimpactsoftheseapproachesinWWWpagecategorizationandontologygenerators. 3.1LatentDirichletAllocationAsdescribedinsection 2.5 LDAhassomepromisingadvantagesovertheprobabilisticLSI(pLSI)modelindocumentmodelingduetoitsfullygenerativesemantics[ 6 27 ].LDAassumesthatthemixtureproportionsoftopics,fordocumentsinacollection,aredrawnfromaDirichletpriordistribution.However,inthepLSImodelthemixtureproportionsoftopicsareconditionedonthetrainingdocuments.Therefore,inthepLSImodelthenumberofmodelparametersgrowslinearlywiththenumberofdocumentsinatrainingset.LDA'sgenerativeprocessemploysthefollowingconcepts(Notationandterminologyadoptedfrom[ 6 27 ]): wrepresentsawordoraterminacorpusvocabulary,andVrepresentsthesizeofacorpusvocabulary. w=(w1,w2,...,wN)isadocumentinatextcorpus,wherewiistheithwordinthedocument,andNdstandsforthetotalnumberoftermsinadocumentthatisindexedbyd.LDAassumesthatadocumentisabag-of-words. C=fw1,w2,...,wDgisatextcorpus,whereDisforthenumberofdocumentsinacorpus.ThegenerativeprocessofLDAisthefollowing[ 6 27 ]: 24

PAGE 25

Chooseeachtopic~kfromaDirichletdistributionDirV()overthevocabularyofterms,inotherwords~krepresentstopic-mixtureovertheentirecorpusofterms.Here,weassumeDirichletsmoothing[ 6 ]onthemultinomialparameterk. Foreachdocumentdchoose~d,avectoroftopicproportions,fromaDirichletdistributionDir()withhyperparameter ForeachoftheNdwordswninadocumentwd { Chooseatopiczd,nfromamultinomialdistributionMult(~d)withparameter~d { Then,choosewnfromamultinomialdistributionp(wnjzd,n,~)conditionedonzd,nand~Thus,thejointprobabilitydistributionofatopicmixture~d,asetofNtopicsz,andasetofNwordswninadocumentwdis: p(~d,z,w)=p(~dj)p(~j)NYn=1p(zd,nj~d)p(wnjzd,n,~)(3{1)wherep(~dj)andp(~j)areDirichletdistributions.Then,themarginaldistributionofadocumentdisobtainedbysummingoverzand~das: p(dj,)=Zp(~dj)p(~j)NYn=1Xzd,np(zd,nj~d)p(wnjzd,n,~)d~d(3{2)Finally,assumingexchange-abilityoverthetextdocumentsinacorpus,theprobabilityofacorpusisgivenbymultiplyingtheprobabilitiesofsingledocuments[ 6 ].Figure 3-1 showstheLDAgraphicalmodel.Thenodesrepresentrandomvariables,shadednodesstandfortheobservedrandomvariables,andrectanglesareforreplicatedstructures.FromthisLDAmodel,wecaninferthehiddenper-wordtopicparameterassignmentzn,andtheunknownper-documenttopicproportionsdandper-corpustopicdistributionsk.However,theposteriorprobabilityformedbytheequation 3{2 isintractabletocompute[ 6 ].Therefore,weusuallypursuedierentapproximationtechniquessuchasVariationalInference[ 6 ]andMarkovChainMonteCarlo[ 11 ]toestimatethehiddenparameters.Section 3.3 describesmoreontheWikipediapagemodelingusingLDA. 25

PAGE 26

Figure3-1. ThegraphicalmodelofLatentDirichletAllocation 3.2FeatureExtractionForbuildingamachinelearningmodelfortextcategorizationweneedatrainingsetandtestset.Thisthesisusespre-processedtextfromtheWikipediapagestobuildthetrainingsetandtestset.TheWikipediapagesforthedatasetbuildingareidentiedwiththehelpoftheWikipediacategoryheterarchyandassociatedwebpages.Inthepre-processingstep,theWikipediapagetextistokenizedintotermsorwordtokensusingNaturalLanguageProcessingToolkit(NLTK)1andregularexpressions.Moreover,thesetermsareprocessedfurtherforstandardizingtermsandremovingtermredundancy.Thefollowingsectionsdescribethesetechniquesindetail. 3.2.1LemmatizationLemmatizationisastandardizationtechniquethatchangesdierentinectedformsofatermintoacommonaformLemma,sothatwecanuseLemmatorepresentasetofterms(e.g.appeared!appear,women!woman).Intextprocessing,lemmatizationhelpsustoreducethedimensionalityofwordsinacorpusvocabulary.ThisthesisusestheWordNetbasedlemmatizerfromNLTK. 1http://www.nltk.org 26

PAGE 27

3.2.2StemmingStemmingisaprocesstogetstemsfromtermsbyremovingunnecessarygrammaticalmarkings(e.g.walking!walk).Therearemanystemmingalgorithmsexist-Porter,Lancaster,etc.ThisthesisusesthePorterstemmerforthewordsthatcannotbelemmatizedusingtheWordNetlemmatizer[ 3 ]. 3.2.3FeatureSelectionorReductionFeatureselectionisaprocessthatremovesuninformativetermsornoisefromtheextractedtext.Byreducingthedimensionalityoffeature-vectorsthatareusedtorepresentthetextdocumentsandcorpusvocabulary,featureselectionspeed-upsthelanguagemodelingprocess.Thisthesisusesalterbasedapproachbyremovingthestop-wordsofEnglishfromtheextractedtokens.Thestop-wordsarethetermsthatarecommontoanydocumentorthetermsthatarespecictoalanguageorgrammar.Anotherapproachtofeatureselectionisbasedontheclassierorclusteringalgorithmsoutput.TheendofthischapterdescribesafeatureselectionmethodbasedontheLDAmodeloutput. 3.2.4RepresentingDocumentsThisthesisusesasparsevectorspacemodeltorepresentthetextdocumentsinacorpus.Featuresarethefrequenciesoftermsorwordsinadocument.Eachdocumentinacorpusisrepresentedbyasparsevector2as: Wd=fidi:countgidiV(3{3)whereVrepresentsthecorpusvocabulary,idiistheuniquevocabularyid,andcountrepresentsthenumberofoccurrencesofterminthedocumentWd. 2http://www.cs.princeton.edu/blei/lda-c 27

PAGE 28

3.3DocumentModelingusingLDAThissectiondescribesLDAasatooltotopicextractionandfeatureselectionfromtext.Firstly,wegothroughthetopicextractionfromwebdocumentsandthelimitationsofusingtheextractedLDAtopicsfortheontologylearningprocess.Secondly,LDAisusedasatoolfordimensionalityreduction[ 6 ]andfeatureselection[ 30 ].Lastly,wegothroughaclassierthatisbuiltbasedontheLDAoutput,fortheunseendocumentsinacorpus.TheparameterestimationofLDAisperformedusingGibbssampling[ 11 ]onthepreprocesseddocumentsfromWikipedia.ThisthesisusestheRpackagelda3fortheparameterestimationprocess.Theendresultsofthisestimationareadocument-topicmatrixandaword-topicmatrix.isaKXDmatrixwhereeachentryrepresentsthefrequencyofwordsineachdocumentthatwereassignedtoeachtopic.ThevariableK,thetotalnumberoftopicsinacorpus,isaninputtothemodel.isaKXVmatrixwhereeachentryisthefrequencyofawordwasassignedtoatopic.ThefollowingsectionsexplainstheexperimentsconductedusingtheLDAmodel. 3.3.1LDAforTopicExtractionForthisexperimentalstudy,theWikipediapagesfromtheWhalesandTiresdomainsareidentiedusingtheDBpediaSPARQL4interface.FortheWhalesdomain,theWikipediacategoriesWhale Products,Whaling,Baleen Whale,Toothed Whale,andKiller Whale(119webpages)areselectedandforTiresdomaintheWikipediacategoryTiresanditssub-categories(111webpages)areselected.Afterpre-processing[section 3.2 ],thedocumenttermsareusedfortheLDAmodeltraining.Table 3-1 showsthethreetopicsthatareextractedfromthewhales-tiresdocuments.Thetopics'wordsarelistedinthenon-increasingorderofprobabilitiesp(topicjterm).Table 3-1 showsthatthere 3http://cran.r-project.org/web/packages/lda4http://dbpedia.org/sparql 28

PAGE 29

Figure3-2. LDAtopicproportionsthetrainingdocumentsfromthewhales(rst119)andtires'(last111)domains.ThetopicsfromtheLDAmodel,withthreetopics,arerepresentedbydierentcoloredlines. aretwotopics(whalesandtires)thataredominantinthecorpusdocuments.Figure 3-2 representsthecorrespondingnormalizedtopicproportionsofthetrainingdocuments. Table3-1. Estimatedtopic-wordsusingLDAwiththreetopics NumberTopic1Topic2Topic3 1whaletirecompany2dolphinwheeltyre3speciesvehiclenew4seatreadtire5shipusebridgeston6killerpressureplant7iwcwearrubber8orcasystemgoodyear9populationvalverestone10animalcardunlop 29

PAGE 30

Figure3-3. Thetopicproportionsofthewhales(rst119)andtires'(last111)domainsfromtheLDAmodelwiththreetopics.Leftplotrepresentsdocumentsfrombothclassesinthetopicspace,redforwhalesandbluefortires.Rightplotrepresentsthedocumenttopicproportionsin3-Dforthetwoclasses Additionalexperimentalresultscanbeseenfrom(gure 3-3 ,gure 3-4 ,table 3-2 ).Fromtheseexperimentswecaninferthat,ifthedocumentsarecompletelydierentdomains(e.g.whalesandtires),LDAndsthetopicsanddocument-topic-mixturethatcanbeusedinclassifyingthecorpusdocuments.However,ifthedocumentsarefromsimilardomains(e.g.thesubdomainsofwhales),LDAfailstocapturedistinguishabletopics(gure 3-2 ).Inaddition,LDAdenestopicsoverallthetermsinavocabulary.Therefore,topicsextractedfromtheLDAestimationprocessaresetsofwords,notnamesofcategoriessimilartothewordsusedtotagWikipediapages.ThecommonreasonfortheabovetwoissuesisthewayLDAisdesigned.Itisaprobabilisticmodelbasedonthefrequencyoftermsinthecorpusdocumentsandnotbasedontheterms'meanings.Eventhoughthedocumentsaretaggedtodierentdomains,ifthedocumentshavethesameterms,LDAwillcapturethesimilarityoftheirtopics.Table 3-2 alsoshowstheimportance 30

PAGE 31

Table3-2. Estimatedtopic-wordsusingLDAwithtwotopics NumberTopic1Topic2 1whaletire2dolphintyre3specieswheel4searubber5shipvehicle6killertread7iwccar8orcapressure9populationwear10animalsystem Figure3-4. ThetwotopicLDA'stopicproportionsforthewhales(rst119)andtires(last111)documents oflemmatizingandstandardizingthetermsintheLDAestimation.WeseethattireandtyrebothappearinTopicset2,buttheyhavethesamemeaning. 3.3.2LDAforDimensionalityReductionLDAcanbeusedfordimensionalityreductionofthecorpusdocumentsintwoways.Onemethodistorepresentdocumentsbyadocument-topic-mixturevector~j 31

PAGE 32

fromtheLDAestimationinsteadofthedocumentwordfrequencyvector[ 6 ].Thus,wecanreducethedimensionalityofadocumentfromvocabularysizetothenumberoftopicsinacorpus.IfLDAcanndtheactualnumberoftopicsinacorpus,documentswillbelinearlyseparableusingthedocument-topic-mixturevectors.Actualdenotesthetopics(e.g.whalesandtires)thatareusedtomanuallytagthedocumentsinthecorpus.Forexample,gure 3-4 showsthatanindividualtopicmixtureproportionisenoughfordistinguishingthedocumentsfromthewhalesandtiresdomain.Infact,thisdocument-topicmatrixcanbeusedforclusteringorclassifyingthecorpusdocuments[ 6 ].Anotherapproachtodimensionalityreductionofthedocumenttermsisbasedontheterm-entropyvaluescalculatedontheterm-topicmatrix[ 30 ].Eachelementofrepresentstheconditionalprobabilityofatermvgivenatopick,p(termvjtopick).Ouraimistondbestdiscriminativetermsinthecorpus.Shannonentropyisameasuretocalculateuncertaintyofasystem.Sothistswelltondusefultermsofclassicationorregression.Thus,theentropyiscalculatedforatermtermvby[ 30 ]: H(term)=)]TJ /F5 7.97 Tf 16.47 14.94 Td[(KXk=1p(termvjtopick)lnp(termvjtopick)(3{4)Figure 3-5 showstheterm-entropyvaluescalculatedonthematrixforthewhalesandtiresdocuments.Iftheentropyvaluesarehigh,thatmeansthattermiscommonamongthecorpustopics.Sowecanrankthetermsinthenon-decreasingorderofentropyanddiscardthecorpustermsbyapplyingathreshold.Therestofthetermsanditsfrequenciescanbeusedarethenewfeaturesforadocument.Figure 3-6 showstheentirefeaturereductionprocessbasedonentropy[ 30 ].ExperimentsconductedonthedocumentfromwhalesandtiresdomainsshowsthatthisfeatureselectionprocesshighlydependedonthenumberoftopicsassignedtotheLDAestimator.Table 3-3 showsthenumberoftermsselectedonthewhalesandtiresdocumentswhenacommonthreshold(quantile40%)isappliedonterm-entropies. 32

PAGE 33

Figure3-5. HistogramoftermentropyvaluesovertheLDAmatrix:calculatedonthedatasetwhalesandtires Figure3-6. LDA-basedfeaturereduction 3.3.3DocumentClassierfromLDAoutputsAsdiscussedinsection 3.3.1 ,wecannotdirectlyusetheestimatedtopicsfromLDAasthetopicorclassofdocumentsfortheontologybuilding,becausetheestimatedtopicsaresetsofwordsoradistributionoverthevocabularywords.Onewaytotacklethisproblemistobuildadocumentclassierusingdocument-topic-mixturevectorsfromthe 33

PAGE 34

Table3-3. LDAbasedfeaturereductionwithdierentnumberoftopics 2topics3topics5topics10topics15topics Selected#ofterms24081664899246104 LDAmodel'soutput.Thefollowingsectionsdescribestwodierentapproachestousethedocument-topic-mixturesinclassifyingthecorpusdocumentsandtheirresults.Documenttopicproportions'distance: Anaiveapproachistondthecentroidormean^~ofthetaggeddocumentstopic-mixtures~dandclassifytheunseendocumentsintoaparticularclassbyusingtheminimaltopic-mixturedistancetotheclassmean.Twodistancesthatcanbeusedforthisapproach,aretheCosineandHellinger[ 23 ]distances: Cosine(~d,^~)=~d.^~ ~d^~(3{5) Hellinger(~d,^~)=KXk=1(q ~d)]TJ /F12 11.955 Tf 11.96 18.57 Td[(q ^~)2(3{6)Figure 3-7 showsthecorrespondingdistanceplotsforthewhales(rst40documents)andtires(last37documents)documents.InthisexperimenttheLDAestimation,with2topics,isperformedonallthedocumentsinthedataset,whichisextractedfromtheWikipediapagesinthewhalesandtiresdomain(230documents).Then,thewhaledocuments'mean-topic-proportioniscalculatedonthetrainset,2=3rdoftheLDAdocumenttopicproportionsmatrix.Subsequently,thecosineandHellingerdistancesarecalculatedonthetestset,1=3rdofthematrix.LowHellingerdistanceorhighcosinedistancevaluesindicatethecorrespondingvaluesarenearertotheclassmean.Moreover,wecanbuildclassiersbasedonthisapproachbyapplyingathresholdonthedistancestoaclassdocumenttopicproportions'mean.Forexample,tobuildaclassierusingHellingerdistanceandwhales-mean,wesetthethresholdas0.5.Afterwards,weassignallthepointsbelowthethresholdtothewhalesclassandviceversa.Figure 3-8 showstheclassicationaccuracyoftheclassiersbuiltbasedonthisapproach 34

PAGE 35

Figure3-7. Hellinger(top)andcosine(bottom)distancesofthedocumenttopicmixturestowhales'documenttopicmixturemean. andwithvaryingnumberoftopics.Thelineswithbulletsrepresentstheaccuracyvaluesontestset.Fromthis,wecaninferthatwhenthenumberoftopicsvaryfromthedesirednumberoftopics(forthiscorpusthedesiredvalueis2)inthecorpus,theaccuracyoftheclassiersdeteriorate.ThebottomlineisthatoneshouldndtheaccuratenumberofcorpustopicsforanLDAmodelbeforeapplyingtheLDAdocumenttopicmixturestoanyclassicationpurpose.AccuracyvaluescanalsobeusedforselectingthenumberoftopicsinanLDAmodel.Maximummarginclassier Inthisexperimentamaximummarginclassier,asupportvectormachine(SVM)classier(SVC),isbuiltonthedocumenttopicmixtureproportionsfromtheLDAmodel. 35

PAGE 36

Figure3-8. ClassicationaccuracyoftheclassiersbuildbasedonHellingerandcosinedistanceswithvaryingnumberoftopics. Thetrainingsetisbuilton2=3rdoftheLDAdocumenttopicproportionsmatrix,bothclasses,whalesandtires,areequallydistributed.Thetestsetisfromtherestofthedocumenttopicproportions.ThisthesisusestheLIBSVM[ 7 ]packagetotraintheSVMclassier.Figure 3-9 showstheowdiagramofthisexperiment.Figure 3-10 showstheaccuracyoftheSVMmodelsonthetestsetwith10-foldcrossvalidation[ 4 ],varyingnumberoftopicsappliedtotheLDAmodel.Fromthis,wecanseethatSVCperformsaboutaswellasthedistancebasedclassiers[gure 3-8 ],whentheLDAmodelistrainedwithdesirednumberoftopics.Moreover,theSVCmodelsoutperformthedistancebasedclassiers,whenthenumberoftopicsassignedtotheLDAmodelisincremented.Figure 3-12 showsthedistancemeasurescalculatedtowhales-meanonthesametrainingset.Italsoshowsthedistanceofsupportvectors(markedasS)thatwerefound 36

PAGE 37

Figure3-9. SVMclassicationmodelusingtheLDAdocumentproportions Figure3-10. SVMclassicationaccuracyofthewhales-tiresdocumenttopicmixturesonthevaryingnumberoftopics 37

PAGE 38

Figure3-11. SVCregressedvaluesofthewhales-tiresdocumenttopicmixtureswithtwotopics. intheSVCmodelbuiltonthesametrainingsettothewhales-mean.Thisguregivesusanintuitiveideaofthesupportvectorselectionfromthetrainingset.Figure 3-11 showstheSVMregressedvaluesofthetestsetbuiltfromthewhales-tiresdocumenttopicmixtureswithtwotopics.Thetiredocumentwithitsregressionvalueabove-1(representsthetiresclass)isanoutlierdocument.Eventhough,thisdocumentistaggedastirescategoryinWikipedia,itstermsarebelongtowhalesclass.WeonlyneedtostoretheSVCmodel,whichisbuiltonahugecorpusofdocuments,forfutureclassication.Thissavesthestoragespaceofaclassicationmodel. 3.3.4InferringDocumentTopicProportionsfromtheLearnedLDAModelTopicmodelsareusefulinanalyzinglargecollectionofdocumentsandrepresentinghigh-dimensionaldocumentsinlow-dimensionalform.However,ttinganLDAmodelfromdocumentsiscomputationallyexpensive,sinceitusesdierentapproximationtechniquessuchasGibbssamplingorVariationalmethods.Wecantacklethisissuebyinferringdocument-topic-mixtureproportionsfortheunseendocumentsfromthelearned 38

PAGE 39

Figure3-12. Hellinger(top)andcosine(bottom)distancesofthedocumenttopicmixturestowhales'documenttopicmixturemean. LDAmodelwithoutmodelretraining.Apaper[ 29 ]byYaoetal.describesdierentmethodsfortopicinferencebasedonthetechniquessuchasGibbssampling,variationalinference,andclassication.However,thisthesisproposesanovelmethodtottheLDA'sdocumenttopic-mixturesanddocumentwordfrequenciesintoaregressionmodelasfollows.Firstly,tottheregressionmodelfordocumenttopicproportions,wereducethevocabularysizebyusingthedimensionalityreductiontechniquesdescribedinthesection 3.3.2 .Secondly,wetamulti-outputregressionmodelbyttingmultipleregressionmodelsforeachoftheLDAtopicproportions,withthedocumentfeaturevectorsformed 39

PAGE 40

Figure3-13. SVMasaregressionandclassicationmodelwiththeLDAmodeltrainedonthewholewhalesandtiresdocuments byusingthereducedfeatures.Infact,foraKtopicLDAmodel,weneedKregressionmodelstot.LettibeatopicrowoftheKXDmatrix,,andXbeDXVmatrixthatrepresentsthereducedfeaturevectors(V)ofthecorpusdocuments.Thus,theregressionmodelfortiisrepresentedas: ti=f(X,w)(3{7)wherewrepresentstheparameterstobelearnedfortheregressionmodel.Inaddition,weusetheselearnedregressionmodelsforobtainingthedocumenttopicmixturefortheunseendocuments.ThisbypassestheLDAlearningprocessforthenewdocuments.Experiments:SVMforbothregressionandclassication Inthisexperiment,SVMisusedforboththeregression(SVR)[Eq. 3{7 ]andclassication(SVC)[section 3.3.3 ].First,weseetheresultsofanSVCmodelonthedocumenttopicproportionsreceivedfromanSVRmodelusingthetestset,1=3rd 40

PAGE 41

Figure3-14. SVMasaregressionandclassicationmodelwiththeLDAmodeltrainedon2=3rdofthewhalesandtiresdocuments ofthewhalesandtiresdocuments,[Model1].ThisSVRistrainedon2=3rdofthedocumentproportionsfromanLDAmodellearnedwiththewholedataseti.e.,230documents(gure 3-13 ).Second,weseetheresultsofanSVCmodelonthedocumenttopicproportionsreceivedfromanSVRmodelonthetestset,[Model2].ThisSVRistrainedonthedocumentproportionsfromanLDAmodellearnedwith2=3rdofthewhalesandtiresdocuments(gure 3-14 ).Table 3-4 showstheclassicationaccuracyofthesetwoexperimentscomparedwiththeLDAdocumentproportionclassier,Model0[section 3.3.3 ,gure 3-9 ].WecanseethattheModel2approachcanbeusedanalternativetotheclassicationmodelinwhichweretrainthewholedatasetwithLDAwheneverwegetanewdocumentforclassication.Inaddition,theSVMbasedapproachsavesthestoragespaceformodels[ 4 ],becauseweonlyneedtostorethesupportvectordocumentproportionsandtheirparameters,whichissupposedlyfarlessthanthecorpusdocumentsandtheirfeatures. 41

PAGE 42

Table3-4. SVMasaregressionandclassicationmodel ModelAccuracy SVCwiththeLDA'sdocument-topic-mixture(Model0)97.40Model188.31Model284.42 3.4TheScopeofLDAinOntologyLearningTherststepinlearningarealmbasedontologyfromwebpagesistondthoseweb-pagesassociatedwithaparticulardomain.ThisthesisproposesaclusteringbasedapproachbasedonthetopicsextractedfromtheLDAestimation.Theuncategorizedweb-pagesareclassiedusinganLDAbasedclassier[section 3.3 ].Thenextstepinontologybuildingistocreatetheontologyheterarchy.Morpheus'sapproachtobuildingtheontologyheterarchyismotivatedbytheYAGOheterarchybuildingprocess[ 26 ],whichdoesnotlearnthetaxonomyfromtext.However,itusestheWordNet[section 2.4.2 ]heterarchytobuildthetaxonomy.TheclasseswelearnedfromthewebpagesareassociatedwithWordNetsyn-sets.Foreachclassintheontologywewillhaveaclass-documentmodelthatrepresentsaclassintermsoftheLDAmodellearnedfromtext.Theclasstermprobabilitiesfrom,anLDAestimationoutput,canbeusedforassociatingtermsinauserquerytothecorrespondingclassesinanontology.ThisLDA-basedontologylearningisinprogressanditisnotcoveredinthisthesis. 3.5SummaryThischaptergaveabriefintroductiontothetopicmodelLatentDirichletAllocationanditspossibleapplicationtomodelingtextdocuments.ThehighlightsareusingLDAforreducingthedimensionalityofthecorpusvocabularyanddocuments,andinferenceofdocumenttopicmixturefromthelearnedLDAmodelwithoutre-trainingtheLDAmodel.Inaddition,thischaptercomparedtheresultsofclassicationusingthedocumenttopicsproportionswithvaryingnumberoftopics.Thischapterconcludesbydescribingthescopeofthedocumentmodelingtechniquesinontologylearningfromuncategorizedwebdocumentsandtaggingofuserquerytermswithontologicalclasslabels. 42

PAGE 43

CHAPTER4QUERYRANKINGQueryrankingisachallengingprobleminanyquestionansweringsystem.Usuallytheuserquery-rankingmethodisbasedonasimilaritymeasuredenedbetweentheuserqueries.Thischapterdescribesanontologicalrealm-basedclasssimilaritymeasureanditsapplicationtoqueryranking.Therstfewsectionsexplainclassdivergence,aquasi-metric,andaquery-rankingalgorithmbasedonit.Thelatersectionspresenttheirimplementationsandresults.Finally,thischapterconcludesbydiscussingprosandconsoftheclassdivergencemeasureandquery-rankingalgorithm. 4.1IntroductionMorpheussolvesapathfollower'squery,acandidateSSQ,byndingsimilarqualiedSSQsthatareassociatedwithQRMsinthedatastore.WeconsidertheSSQcomponents[section 2.3 ]suchasquery-realm,inputtermsandoutputterms,andtheirassignedclassestocomputethesimilaritiesbetweenSSQs.IntheMorpheussystem,theclassesassignedtoquerytermsbelongtoarealm-ontology.Wedeneclassdivergence,aquasi-metric,thatcharacterizesthesimilarityordissimilaritybetweentwoclasseswithinarealm-ontology[ 9 ].Inthismetric,weapplytheconceptsofmultipledispatchinCLOS[ 24 ]andDylanprogramming[ 2 ]forgenericfunctiontypematches.OncewehavethesimilaritymeasuresbetweenthecandidateSSQandqualiedSSQsintheQRMdatastore,weordertherelevantSSQsandassociatedQRMs.Theorderrepresentsarankingforthequery-resultstothepathfollower.Subsections 4.2 ,and 4.3 describethisindetail. 4.2ClassDivergenceWeemployameasureofcompatibility,whichwecallclassdivergence(cd),betweentwoclassesbasedontheclasstopologyofanontology.Thisalgorithmisoriginallydescribedin[ 9 ].LetSbethequaliedclassandTbethecandidateclass.STrepresentsthereexivetransitiveclosureofthesuperclassrelation,i.e.,Sisanancestorof 43

PAGE 44

T.d(P,Q)representsthehopdistanceinthedirectedontologyinheritancegraphfromPtoQ.LetCbeacommonancestorclassofSandTwhichminimizesd(S,C)+d(T,C).TheclassdivergencebetweenSandTisdenedas: cd(S,T)=8>>>>>><>>>>>>:0STd(S,T)=3hST(d(S,root)+d(S,C)+d(T,C))=3hotherwise(4{1)wherehistheheightoftheontologytree.Thevalueofcd(S,T)isinrangeofzero(representingidenticalclasses)toone(representingincompatibleclasses).Inaddition,ifSTandS6Qthencd(S,T)
PAGE 45

Figure4-1. Anabbreviatedautomotiveontology 6=12.Thevalueiscalculatedfromd(bus,root)=3,d(bus,land vehicle)=1,andd(coupe,land vehicle)=2. 4.3SSQMatchingandQueryRankingTondtherelevancebetweenthecandidateSSQandaqualiedSSQMorpheususesclassdivergencebetweentheirassignedtermclasses.ThequaliedSSQsintheMorpheusdatastorecontaininputterms,outputterms,taggedontologicalclasses,andarealmfromtheQRM.Ontheotherhand,forthecandidatequery,theMorpheusNLPengineparsesquery-termsandassignsclassesfromarealm-ontology[section 3.4 ].Inaddition,thequeryrealmisinferredfromthecandidatequerytermsbasedonanyprobabilitiescalculatedwithp(realmjterm).Werefertothesequenceoftheseclassesh1,...,N,1,...,M,RiasthesignatureofanSSQ.irepresentsaninputclass,irepresentsanoutputclass,andRisfortherealmofaquery.Ifaninputoroutputclassisunspeciedinthequery,itisassociatedwiththeclass?,whichisasubclassofeveryotherclasshavingnosubclassesorindividuals. 45

PAGE 46

LetbethesetofallinputclassesofanSSQandbethesetofalloutputclassesofanSSQ.SupposewewanttondthesimilarityofaqualiedSSQQwithsignaturehQ,Q,RQitothecandidateSSQSwithsignaturehS,S,RSi.LethfQigXSirepresentsasetwithelementscalculatedby hfQigXSi=fcd(Qi,b)j(Qi,b)fQigXSg(4{2)wherecd(Qi,b)representsclassdivergencebetweentheclassQiandb,andXrepresentsthecrossproductoftheelementsofthesetsSandfQig.WeconsiderQRMsthatonlybelongtothesamerealmofthecandidateSSQ.TondthesimilaritybetweenthequaliedSSQQtothecandidateSSQSthatbelongtothesamerealm,werstcalculatehfQigXSiandhfQjgXSiforallelementsofSandS.ThedivergenceofaqualiedSSQtothecandidateSSQisdeterminedbyaggregatingthecalculateddivergencemeasurevaluesasfollows. divergence(S,Q)= NNXi=1minihfQigXSi+ MMXj=1minjhfQjgXSi(4{3)whereNisthenumberofelementsinthesetQ,MisthenumberofelementsinthesetQ,andthepositiveweightssatisesthecondition: +=1(4{4)Finally,weordertheQRMsinthestorebyincreasingdivergence.Theorderprovidesarankingfortheresultstothepathfollower'squery.Section 2.3.1 givesabriefoverviewoftheMorpheusqueryexecutionprocessusingtherankedQRMs. 4.4ResultsFortheillustrativepurposes,weusetheabbreviatedautomotiveontology(gure 4-1 )tondtheclassdivergencebetweentheterm-classesinthissection.Supposeapathfollowerentersthequery-\Whatisthetiresizefora1997ToyotaCamryV6?".TheMorpheusnaturallanguageprocessingsystemparsesthisqueryintotable 4-1 .Table 4-2 46

PAGE 47

showsasetofsampletop-term-classesandtheirassociatedprobabilitiesp(classjterm),astaggedbytheNLPengine.Theresultsshownherearebasedonmanuallyconstructedclassesandprobabilities.Ouraimistolearntheseclass-termprobabilitiesfromtheontologyandcorporabuiltfromthewebdocuments.Then-gramswithduplicatedterms(marked*intable 4-2 )arediscardedbasedonthep(classjterm)values. Table4-1. TheMorpheusNLPengineparse AttributeValue WH-questionTermwhatDescriptiveInformation1997ToyotaCamryV6AskingFortiressizen-grams1997,1997Toyota,1997ToyotaCamry,Toyota,ToyotaCamry,ToyotaCamryV6,Camry,CamryV6,V6 Table4-2. Topterm-classesandprobabilities TermClassp(ClassjTerm) 1997year1.00Toyotamanufacturer0.90ToyotaCamry*automobile0.60Camrymodel0.90CamryV6*model0.65V6engine0.99tiresizetire size10.99 ThenalstepinQRMrankingistocalculatetheclassdivergencevaluesfortheterm-classesassociatedwiththecandidateSSQandaqualiedSSQ,andaggregatethembasedonalgorithmdescribedinsection 4.3 .Table 4-3 showsdivergencevaluescalculatedfromthecandidateSSQtothethreequaliedSSQspresentintheMorpheusdatastore.FromthisexperimentwecaninferthattheaggregateclassdivergenceforanSSQdependsmainlyonthetermsandtaggedclasses.Iftheassignedterm-classesofboththecandidatequeryandaqualiedqueryarenearbyintherealm-ontologythedivergencevalueswillgenerallybesmaller,asaresult,querysimilaritywillbehigh.Thisalsorequiresthatthepathnder'squerymustbeassignedthecorrecttermclassesfromtheontology,whichisachallengingtask. 47

PAGE 48

Table4-3. Toprankedqueriesbasedontheaggregatedclassdivergencevalues QueryTaggedClassesScore Whatisthetiresizefora1998ToyotaSiennaXLEVan?manufacturer,model,year,tire size10.000WherecanIbuyanengineforaToyotaCamryV6?engine,manufacturer,model,vehicle part10.072A1997ToyotaCamryneedswhatsizewindshieldwipers?year,manufacturer,model,measurement1,vehicle part10.336 4.5SummaryThischapterpresentedanontologicalheterarchybasedclasssimilaritymeasuregroundedontheconceptsoftypematching.Italsodemonstratedaprototypesystemtocharacterizequerysimilarityusingaclassdivergencequasi-metric.Aninitialanalysisshowsthatthisisapromisingmethodinmatchingandclusteringthesimilaruserqueriesinsemi-structuredformat.However,thoroughtestingisrequiredinsomeoftheareas.Forexample,thecalculationofprobabilitiesfortheuserquerytermsandquery-termclasstaggingarestillinprogress.Similarly,moreresearchisrequiredtoincludethecontextofthetermsinaquerywhilewesearchforthesimilarqueriesorSSQs. 48

PAGE 49

CHAPTER5CONCLUSIONSThisthesisstartedbydescribingthechallengesofbuildingaquestionansweringsystemandarealm-basedapproachinparticular.Subsequently,itpresentedtheideaofusinganontologyandassociatedcorporatorepresentthewebinamorehierarchicalformhelpingtheMorpheusQAsystemtomatchsimilarqueriesintheSSQformat.Thecoreideaistorepresentuserquerytermsbyasetofclassesfromarealm-basedontologyandndsimilaritiesinthespaceofontologicalclasses.Inaddition,wecalculateclasssimilarityasatypematchbasedontheclasstaxonomyoftherealm-ontology.Wecanextendthisideaofndingclasssimilaritiestoanysystemsthatstoresinformationinaclassheterarchy.Assigningclassesfromarealm-ontologyalsohelpsclusteringnaturallanguagequeriesthatbelongtoalocalrealmcontext.Inaddition,weconsideredthechallengingtaskofbuildinganontologyandassociatedcorporafromthewebwhichoftenstoresinformationinunstructuredanduncategorizedform.Inordertotacklethisproblem,thethesisbuiltframeworksforcategorizingandextractinghiddentopicsfromwebpages.WediscussedthewellknowntopicmodelLatentDirichletAllocationtodetecttopicsfromtext.Weobservedthatitisusefulfordistinguishingthosedomainsthatarefarapart,e.g.,whalesandtires.However,itfailstodistinguishdocumentswhentheyarefromsimilardomains,e.g.,whaletypes.Inaddition,wesawthatLDAisusefulinreducingdimensionalityofdocumentsintoatopicspaceandbuildingclassiersonthetopicspace.ThisLDA-baseddimensionalityreductioncanbeappliedtoanysystemwhereonecanassumetheexistenceofatopicspaceontopofthefeaturespace.Moreover,wesawapromisingmethodthatcanavoidLDAretrainingforthecorpusdocuments,byttingamulti-outputregressionmodelbasedonSupportVectorRegressionusingthedocument-term-frequencyfeaturevectorsandtheLDAtopicproportions.ThisisausefulinsystemswhereweneedtoapplyLDAonstreamingdocuments.Eventhough 49

PAGE 50

LDAisseentobepromisingapproachtoclassicationandtopicextraction,moreresearchisrequiredtondtheapplicabilityofLDAtopicsinlearninganontologytaxonomyfromaplaintextcorpus. 50

PAGE 51

REFERENCES [1] S.Auer,C.Bizer,G.Kobilarov,J.Lehmann,andZ.Ives.Dbpedia:Anucleusforawebofopendata.InIn6thIntlSemanticWebConference,Busan,Korea,pages11{15.Springer,2007. [2] K.Barrett,B.Cassels,P.Haahr,D.A.Moon,K.Playford,andP.T.Withington.Amonotonicsuperclasslinearizationfordylan.SIGPLANNot.,31(10):69{82,1996. [3] S.Bird,E.Klein,andE.Loper.NaturalLanguageProcessingwithPython:AnalyzingTextwiththeNaturalLanguageToolkit.O'Reilly,Beijing,2009. [4] C.M.Bishop.PatternRecognitionandMachineLearning(InformationScienceandStatistics).Springer,1sted.2006.corr.2ndprintingedition,October2007. [5] C.Bizer,J.Lehmann,G.Kobilarov,S.Auer,C.Becker,R.Cyganiak,andS.Hellmann.Dbpedia-acrystallizationpointforthewebofdata.WebSeman-tics:Science,ServicesandAgentsontheWorldWideWeb,7(3):154{165,September2009. [6] D.M.Blei,A.Y.Ng,andM.I.Jordan.Latentdirichletallocation.JournalofMachineLearningResearch,3:993{1022,2003. [7] C.-C.ChangandC.-J.Lin.LIBSVM:alibraryforsupportvectormachines,2001.Softwareavailableat http://www.csie.ntu.edu.tw/~cjlin/libsvm [8] T.Dohzen,M.Pamuk,S.W.Seong,J.Hammer,andM.Stonebraker.Dataintegrationthroughtransformreuseinthemorpheusproject.InSIGMOD'06:Proceedingsofthe2006ACMSIGMODinternationalconferenceonManagementofdata,pages736{738,NewYork,NY,USA,2006.ACM. [9] C.Grant,C.P.George,J.-d.Gumbs,J.N.Wilson,andP.J.Dobbins.Morpheus:Adeepwebquestionansweringsystem.InInternationalConferenceonInformationIntegrationandWeb-basedApplicationsandServices,Paris,France,2010. [10] B.Green,A.Wolf,C.Chomsky,andK.Laughery.Baseball:anautomaticquestionanswerer.pages545{549,1986. [11] T.L.GrithsandM.Steyvers.Findingscientictopics.ProceedingsoftheNationalAcademyofSciences,101:5228{5235,April2004. [12] T.Gruber.Atranslationapproachtoportableontologyspecications.KnowledgeAcquisition,5(2):199{220,June1993. [13] R.Hahn,C.Bizer,C.Sahnwaldt,C.Herta,S.Robinson,M.Burgle,H.Duwiger,andU.Scheel.Facetedwikipediasearch.InBIS,pages1{11,2010. 51

PAGE 52

[14] T.Hofmann.Probabilisticlatentsemanticindexing.InSIGIR'99:Proceedingsofthe22ndannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages50{57,NewYork,NY,USA,1999.ACM. [15] D.HorowitzandS.D.Kamvar.Theanatomyofalarge-scalesocialsearchengine.InWWW'10:Proceedingsofthe19thinternationalconferenceonWorldwideweb,pages431{440,NewYork,NY,USA,2010.ACM. [16] J.LinandB.Katz.Questionansweringfromthewebusingknowledgeannotationandknowledgeminingtechniques.InCIKM'03:Proceedingsofthetwelfthinterna-tionalconferenceonInformationandknowledgemanagement,pages116{123,NewYork,NY,USA,2003.ACM. [17] Y.LiuandE.Agichtein.Ontheevolutionoftheyahoo!answersqacommunity.InSIGIR'08:Proceedingsofthe31stannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages737{738,NewYork,NY,USA,2008.ACM. [18] H.Nguyen,T.Nguyen,andJ.Freire.Learningtoextractformlabels.Proc.VLDBEndow.,1(1):684{694,2008. [19] N.F.NoyandD.Mcguinness.Ontologydevelopment101:Aguidetocreatingyourrstontology.StanfordKSLTechnicalReportKSL-01-05,2000. [20] J.Pearl.Probabilisticreasoninginintelligentsystems:networksofplausibleinference.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,1988. [21] J.-F.Pessiot,Y.-M.Kim,M.R.Amini,andP.Gallinari.Improvingdocumentclusteringinalearnedconceptspace.InformationProcessingandManagement,46(2):180{192,2010. [22] N.Schlaefer,P.Gieselmann,andT.Schaaf.Apatternlearningapproachtoquestionansweringwithintheephyraframework.InProceedingsoftheNinthInternationalConferenceonTEXT,SPEECHandDIALOGUE,2006. [23] A.SrivastavaandM.Sahami.TextMining:Classication,Clustering,andApplica-tions.Chapman&Hall/CRC,2009. [24] J.G.L.Steele.CommonLISP:thelanguage(2nded.).DigitalPress,Newton,MA,USA,1990. [25] M.SteyversandT.Griths.ProbabilisticTopicModels,chapterLatentSemanticAnalysis:ARoadtoMeaning.LaurenceErlbaum.LawrenceErlbaumAssociates,2007. [26] F.M.Suchanek.AutomatedConstructionandGrowthofaLargeOntology.PhDthesis,SaarlandUniversity,2009. 52

PAGE 53

[27] X.WeiandW.B.Croft.Lda-baseddocumentmodelsforad-hocretrieval.InSIGIR'06:Proceedingsofthe29thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages178{185,NewYork,NY,USA,2006.ACM. [28] W.Woods.Progressinnaturallanguageunderstanding-anapplicationtolunargeology.InAmericanFederationofInformationProcessingSocieties(AFIPS)ConferenceProceedings,42,pages441{450,1973. [29] L.Yao,D.Mimno,andA.McCallum.Ecientmethodsfortopicmodelinferenceonstreamingdocumentcollections.InKDD'09:Proceedingsofthe15thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages937{946,NewYork,NY,USA,2009.ACM. [30] Z.Zhang,X.-H.Phan,andS.Horiguchi.AnEcientFeatureSelectionUsingHiddenTopicinTextCategorization.22ndInternationalConferenceonAdvancedInformationNetworkingandApplications-Workshops(ainaworkshops2008),pages1223{1228,Mar.2008. 53

PAGE 54

BIOGRAPHICALSKETCH ClintPazhayidamGeorgereceivedhisBachelorofTechnologyinComputerSciencefromtheDepartmentofComputerScienceandEngineering,CollegeofEngineeringTrivandrum,UniversityofKerala,Indiain2004.Aftergraduationheworkedforfouryearsasasoftwareprofessional.Hegraduatedwithhismaster'sincomputersciencefromtheDepartmentofComputerandInformationScienceandEngineering,UniversityofFlorida,Gainesville,Floridain2010.Hisresearchinterestsincludemachinelearning,appliedstatistics,textmining,andpatternrecognition. 54