<%BANNER%>

EntityScore

Permanent Link: http://ufdc.ufl.edu/UFE0042394/00001

Material Information

Title: EntityScore Using Facts to Quantify Information in Documents
Physical Description: 1 online resource (174 p.)
Language: english
Creator: Smith, Charles
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Information retrieval systems have traditionally relied on matching keywords to retrieve documents. Typically this has been achieved using one of two approaches: a vector-based method in which the documents are considered to be n-dimensional vectors in the space of the terms, and a probabilistic method in which documents are considered relevant if they contain keywords that have a high probability of occurring in the documents considered relevant. While systems using these methods have achieved good results in many cases, they do not take advantage of the underlying meaning of the terms. Attempts have been made to apply additional semantic meaning using natural language processing, but these systems are often slow and are sensitive to the many different ways of expressing the same information. We present a new model of information retrieval using named entities. Using well defined facts about the named entity allows the documents to be ranked according to the total number of facts they contain. To evaluate this system, we propose a new evaluation system based on the number of questions a document answers.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Charles Smith.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Dankel, Douglas D.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042394:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042394/00001

Material Information

Title: EntityScore Using Facts to Quantify Information in Documents
Physical Description: 1 online resource (174 p.)
Language: english
Creator: Smith, Charles
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Information retrieval systems have traditionally relied on matching keywords to retrieve documents. Typically this has been achieved using one of two approaches: a vector-based method in which the documents are considered to be n-dimensional vectors in the space of the terms, and a probabilistic method in which documents are considered relevant if they contain keywords that have a high probability of occurring in the documents considered relevant. While systems using these methods have achieved good results in many cases, they do not take advantage of the underlying meaning of the terms. Attempts have been made to apply additional semantic meaning using natural language processing, but these systems are often slow and are sensitive to the many different ways of expressing the same information. We present a new model of information retrieval using named entities. Using well defined facts about the named entity allows the documents to be ranked according to the total number of facts they contain. To evaluate this system, we propose a new evaluation system based on the number of questions a document answers.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Charles Smith.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Dankel, Douglas D.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042394:00001


This item has the following downloads:


Full Text

PAGE 1

ENTITYSCORE:USINGFACTSTOQUANTIFYINFORMATIONINDOCUMENTSByCHARLESSMITHADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2010

PAGE 2

c2010CharlesSmith 2

PAGE 3

TABLEOFCONTENTS page LISTOFTABLES ...................................... 6 LISTOFFIGURES ..................................... 7 ABSTRACT ......................................... 8 CHAPTER 1INTRODUCTION ................................... 9 1.1TheGrowthofInformationRetrievalSystems ................ 9 1.2QueriesandAutomaticQuestionAnswering ................. 12 1.3OurResearch .................................. 15 1.4RelatedWork .................................. 17 1.5StudyOrganization ............................... 18 2INFORMATIONRETRIEVALMODELS ....................... 19 2.1TheBooleanModel .............................. 20 2.1.1TheFuzzySetModel .......................... 22 2.1.2TheExtendedBooleanModel ..................... 24 2.2TheProbabilisticModel ............................ 26 2.3TheVectorModel ................................ 30 2.3.1LatentSemanticAnalysis ....................... 32 2.3.2NeuralNetworkModel ......................... 35 2.4TechniquesforIndexManipulation ...................... 37 2.4.1Stemming ................................ 38 2.4.2StopwordRemoval ........................... 39 2.4.3IndexingUsingPhrases ........................ 40 2.4.4OtherTypesofIndexing ........................ 41 2.5TechniquesforQueries ............................ 41 2.5.1QueryExpansion ............................ 42 2.5.2RelevanceFeedback .......................... 44 2.6TechniquesforRanking ............................ 45 2.6.1PageRank ............................... 46 2.6.2HubsandAuthorities .......................... 47 2.7EvaluationofInformationRetrievalModels .................. 48 2.7.1PrecisionandRecall .......................... 48 2.7.2OtherEvaluations ............................ 50 2.7.3TermRelevanceSets .......................... 51 2.8Conclusion ................................... 52 3

PAGE 4

3QUESTIONANSWERING .............................. 53 3.1QuestionAnsweringSystems ......................... 55 3.2QuestionCategories .............................. 56 3.3QueryPreprocessing ............................. 58 3.4DocumentRetrieval .............................. 60 3.5PassageRetrievalandAnswers ........................ 61 3.6Evaluation .................................... 65 3.7NaturalLanguageProcessing ......................... 67 3.8Conclusion ................................... 69 4EVALUATION ..................................... 70 4.1Introduction ................................... 70 4.2CreatingQueryRelevanceSets ........................ 72 4.2.1ConvertingTRECTopics ........................ 72 4.2.2ComparingMethods .......................... 75 4.2.3Conclusion ............................... 77 4.3NewQrels .................................... 78 4.3.1QrelTopics ............................... 80 4.3.2TestingMethod ............................. 80 4.3.3Evaluation ................................ 82 4.4Results ..................................... 84 4.5Overview .................................... 84 4.5.1SinglePersonQuestions ........................ 85 4.5.2BprefatDifferentLevels ........................ 86 4.6Conclusion ................................... 87 5ENTITYSCORE-INFORMATIONQUANTIFICATION .............. 94 5.1Introduction ................................... 94 5.2RelatedWork .................................. 96 5.3ProblemStatement ............................... 98 5.4ANewModel .................................. 100 5.4.1Indexing ................................. 102 5.4.1.1Retrievingfacts ........................ 105 5.4.1.2Queryconstruction ..................... 106 5.4.1.3Documentscoring ...................... 107 5.4.2QueryRelevance ............................ 111 5.4.3Ranking ................................. 114 5.5Conclusion ................................... 119 6ENTITYSCORE-EVALUATION .......................... 121 6.1Introduction ................................... 121 6.2EvaluatedSystems ............................... 123 6.2.1EntityScore ............................... 123 4

PAGE 5

6.2.2WeightedEntityScoreandCountedSystems ............. 124 6.2.3QueryExpansionandFocusedExpansion .............. 125 6.2.4Phrase-BasedEntityScoreSystems ................. 127 6.3FullQrelEvaluation .............................. 127 6.4EvaluationUsingPartialInformation ..................... 130 6.4.1EvaluationWithoutList-basedFacts ................. 131 6.4.2EvaluationwhenRemovingFeatures ................. 133 6.5EvaluatingHowSystemsIdentifySpecicInformation ........... 134 6.6Conclusion ................................... 135 7CONCLUSIONANDFUTUREWORK ....................... 150 7.1FutureWorkinEvaluation ........................... 151 7.2FutureWorkfortheEntityScoreModel .................... 153 APPENDIX ABPREFFORINDIVIDUALNAMEDENTITIESUSINGALLFACTS ....... 157 BERRORFORENTITYTYPES ........................... 159 REFERENCES ....................................... 165 BIOGRAPHICALSKETCH ................................ 174 5

PAGE 6

LISTOFTABLES Table page 4-1BreakdownofdocumentsforTRECDavidMcCulloughtopic ........... 89 4-2BreakdownofdocumentsforTRECbyquestiontype ............... 89 4-3Questiontypesforcountries ............................. 90 4-4Questiontypesformovies .............................. 90 4-5Questiontypesforpersons ............................. 90 4-6Judgedobjectnames ................................ 90 4-7ComparisonofLuceneandBing .......................... 91 4-8ComparisonofLuceneandBingonauthors .................... 91 4-9ComparisonofLuceneandBingonpresidents .................. 92 5-1ExampleofFreebase'stypesappliedtopeople .................. 120 5-2Shortexampledocumentscoresforaperson ................... 120 6-1Comparisonofsystemsusingalljudgedfacts ................... 137 6-2EffectofthebookfeatureonJaneAustenquery ................. 139 6-3Countryquerywithoutlist-basedfacts ....................... 140 6-4Moviequerywithoutlist-basedfacts ........................ 141 6-5Personquerywithoutlist-basedfacts ........................ 142 6-6Countryquerywithfewerfacts ........................... 143 6-7Moviequerywithfewerfacts ............................ 144 6-8Personquerywithfewerfacts ............................ 145 6-9Bprefscoresforsinglefacts ............................. 146 6

PAGE 7

LISTOFFIGURES Figure page 4-1TRECtopicexample ................................. 89 4-2ComparisonofBingandLuceneatdifferentquestionlevels ........... 93 6-1FrequencyoffacttermsforJaneAusten ...................... 138 6-2Bprefatfactlevelsforallfacts ............................ 147 6-3Bprefatfactlevelswithoutlist-basedfacts ..................... 148 6-4Bprefatfactlevelswhensystemsaremissingfacts ................ 149 A-1Bprefatlevelsforindividualpersons ........................ 157 A-2Bprefatlevelsforindividualmovies ......................... 158 A-3Bprefatlevelsforindividualcountries ....................... 158 B-1MAEallfacts ..................................... 159 B-2RMSEallfacts .................................... 160 B-3MAEwhenmissingfacts ............................... 161 B-4RMSEwhenmissingfacts .............................. 162 B-5MAEwithoutlist-basedfacts ............................ 163 B-6RMSEwithoutlist-basedfacts ............................ 164 7

PAGE 8

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyENTITYSCORE:USINGFACTSTOQUANTIFYINFORMATIONINDOCUMENTSByCharlesSmithDecember2010Chair:DouglasD.DankelIIMajor:ComputerEngineering-CISEInformationretrievalsystemshavetraditionallyreliedonmatchingkeywordstoretrievedocuments.Typicallythishasbeenachievedusingoneoftwoapproaches:avector-basedmethodinwhichthedocumentsareconsideredtoben-dimensionalvectorsinthespaceoftheterms,andaprobabilisticmethodinwhichdocumentsareconsideredrelevantiftheycontainkeywordsthathaveahighprobabilityofoccurringinthedocumentsconsideredrelevant.Whilesystemsusingthesemethodshaveachievedgoodresultsinmanycases,theydonottakeadvantageoftheunderlyingmeaningoftheterms.Attemptshavebeenmadetoapplyadditionalsemanticmeaningusingnaturallanguageprocessing,butthesesystemsareoftenslowandaresensitivetothemanydifferentwaysofexpressingthesameinformation.Wepresentanewmodelofinformationretrievalusingnamedentities.Usingwelldenedfactsaboutthenamedentityallowsthedocumentstoberankedaccordingtothetotalnumberoffactstheycontain.Toevaluatethissystem,weproposeanewevaluationsystembasedonthenumberofquestionsadocumentanswers. 8

PAGE 9

CHAPTER1INTRODUCTIONAstheamountofrecordedinformationgrows,theabilitytoretrievespecicinformationbecomesincreasinglyimportant.Inthepast,thetaskoflingandindexingwouldbeassignedtolibrariansandarchivists.Nowdataisbeinggeneratedatsuchapacethatitisimpossibleforhumansalonetocope.Intheyear2006,individualsandcorporationsgeneratedanestimated161exabytesofdigitaldata.Thisincludesimages,movies,music,andtext,andisequivalentto1milliondigitalcopiesofeverybookintheLibraryofCongress.Toarchiveandretrievedatainatimelymannernowrequiresinformationretrievalsystems.Thisalsoallowscasualusers(i.e.userswithoutspecializedknowledge)tondtheinformationtheyneed. 1.1TheGrowthofInformationRetrievalSystemsInformationRetrieval(IR)isthetaskofretrievinginformationthatmeetsauser'sneed.IRinvolvesthreespecicaspects:theinformationalneedoftheuser,thecollectionofinformationavailable,andtheparticulardocument(s)thatmeetstheneed.Initially,researchershaveconcentratedonndingthesingledocumentthatbestmettheuser'sinformationneed.Inmanycasesthisresearchhasinvolvedsmallcollectionsofdocumentswherefewitemsweredeemedrelevanttoaparticularquery.TheresultingIRmodelswere:theBooleanmodelbasedonsettheory,thevectormodelbasedonalgebraictheory,andtheprobabilisticmodelbasedonprobabilitytheory.WiththeincreasedimportanceoftheInternetandworldwideweb(WWW),collectionshavebecomemuchlarger.TheWWWisimportantforresearchasitcontainsmanydifferentkindsofinformation.Notonlyaretheretextdocumentsbutalsoimages,music,andmovies.Additionally,whiletheWWWistypicallyconsideredtoconsistofunstructureddata,itcontainsrepositoriesofstructureddatasuchasencyclopedias,newspapers,andothertightlycohesivecollectionsofinformation. 9

PAGE 10

Recently,IRresearchhasfocusedonusingandimprovingthetraditionalmodelsfortheWWWorothersimilarlylargecorpora.BecausetheWWWhasgrownatsucharateandhassomuchinformationavailable,muchresearchhasbeendevotedtoaddressingthistypeofcorpus.AnexampleofthisistheTextRetrievalConference's(TREC)trackforlargecorpora.Overtheyears,TREChasmovedfromalargecollectionof20gigabytes[ 40 ],toa100-gigabytecollectioncreatedfromaWeb-crawlmadebytheInternetArchivein1997[ 39 ],tothenewestTerabytetrackcomposedofwebpagesfromthe.govdomain[ 18 ].ThisgrowthintestcollectionsizesmirrorsthegrowthoftheWWW,whichhasbeenestimatedashighas11.5billionindexablepagesin2004[ 29 37 59 ].Intheselargercorpora,queriesusingthetraditionalIRmodelstendtondmanyrelevantdocuments.Whilethesedocumentsaretechnicallyrelevantunderthemodels,auserisoverwhelmedwhenreceivingmillionsofdocumentsinresponsetoaquery.Thelargenumberofdocumentsretrievedispartiallyduetothefactthatwithalargercorpustherearemanydocumentsthatarepartlyrelevanttomostqueries,butitisalsoduetoabasicassumptionoftheclassicalIRmodelsthatanentiredocumentcanbetreatedasalistofwords.Bymakingthisbag-of-words(BOW)assumption,IRsystemshavebeenabletoignoremuchofthenaturallanguageaspectofdocumentsandinsteadfocusonlearningpatternsthatallowforsuccessfulretrieval.Todealwiththelargenumberofrelevantdocuments,researchershavebegunutilizingsomeoftheadditionalinformationcontainedindocuments.AttemptshaveincludedgivingadditionalweighttotermsthatappearinthetitlesofwebpagesandusingspecicmarkupsintheHTMLtogaininsightintowhatisimportantonthepages[ 25 49 ].Whiletheseapproacheshavebeensomewhatusefultheystillfallshortofusingthetext'ssemanticandsyntacticinformation.Somesuccesshasbeenachievedinidentifyinginteractionbetweenkeywords,asintheLatentSemanticNetworkmodel.ThisisanextensionoftheVectormodelandusesthedocumentvectorstondthe 10

PAGE 11

intra-documentandintra-termsimilarities.Whilethisimprovesovertraditionalmodels,thecalculationsneededareintractableasformulated[ 28 ].Thisisalsotrueofmethodsthatusenaturallanguageprocessingtopullmoreinformationoutofthetext.ThemostsuccessfulmodelshavebeenthosethattreattheWWWasagraphandusethehyperlinksforsemanticinformation,mostnotablyPageRank[ 77 ]andHubsandAuthorities[ 55 ].ThereasoningbehindthesemodelsisthatifpagesareimportantinthecontextoftheWWW,morepeoplewilllinktothem.Thusthehyperlinksbetweendocumentshavetheinherentsemanticmeaningofvotesfordocumentsthatareuseful.PageRankisthemorepopularofthesealgorithmsandisthebasisoftheGooglesearchengine.InthismodelanimaginaryuserrandomlymovesaroundthegraphoftheInternet.Themoreoftentheusercomestoaparticularpagethemoreimportantitisdeemed.Interestingly,theGooglesearchengineutilizestheBooleanModelforretrieval,whichisgenerallyconsideredthemostantiquatedoftheclassicalmodels.HubsandAuthoritiesisbasedontheassumptionthatsomepeoplecreatepagesthatpointtothemostimportanttopicsonasubject.Inamutualstrengthening,hubs(i.e.pagesthathavemanyotherpageslistedonthem)andauthorities(i.e.pagesthatareauthoritative)arefoundanddeterminedtobeimportant.UtilizingthegraphfeaturesoftheWWWallowsthesemodelstotreatthelinksbetweenpagesassemanticinformation.Theselinksareexplicit,unlikethesemanticinformationthatoneextractsfromtextvianaturallanguageprocessing.Thesuccessofthesemodelshasinuencedresearchsincetheirdevelopment.Oneresearchapproachusesthegraphnaturetopushrelevancefrompagetopage[ 80 ],whileanotherutilizesthehierarchicalnatureofwebsitestofurtherunderstandthegraphstructure[ 114 ].ThereisevenanattempttousePageRankinaprobabilisticmodelthatlackslinks[ 57 ],insteadusinglanguagemodelstosimulatethegraphnatureofthecorpus.ResearchinIRcontinuesinallareas.Astheneedforinformationincreases,sodoestheneedforaccurateandpreciseinformationretrieval.Twospecic,longterm 11

PAGE 12

areasofinterest,forexample,areglobalinformationaccessandcontextualretrieval.Globalinformationaccessistheneedtoaccessinformationinanylanguage.Contextualretrievalistheneedtocustomizesearchenginestotheparticularuser.Sinceeachuserhashisorherownevaluationofrelevance,anIRsystemthatcouldre-weightrelevantdocumentsfortheparticularuserwouldbehighlyuseful. 1.2QueriesandAutomaticQuestionAnsweringIntraditionalIRresearch,theinformationneedofauserisconsideredtobethosedocumentsthatarerelevanttoaparticulartopic.Toexpressthatinformationneed,theuserprovideskeywordsthatareusedtomatchrelevantdocuments.IntheBooleanmodel,theusercoulduseBooleantermssuchasandandortoexpressmoreclearlyhisorherneed.Inprobabilisticandalgebraicmodelsthequeryistreatedasabagofwords.Inbothofthesemodels,theabilityofasystemtodistinguishbetweenrelevantandirrelevantdocumentsappearstodependontheabilityofthetermsusedinthequerytodistinguishbetweendocuments.Toimprovetheabilityofuserstoexpresstheirneedsortobettercapturetheneedsofcurrentqueriestherearetwomainapproaches.1Therstistoexpandthescopeofthequerysothatiteitherndsadditionaldocumentsorfocusesonparticulardocumentsthroughrelevancefeedbackandqueryexpansion.Relevancefeedbackistheprocessofalteringaquerybasedondocumentsthatauserndsrelevantwithintheinitialreturnedset[ 70 84 ].Thisprocessisusedextensivelyinprobabilisticmodels,althoughitisoftenautomatic[ 4 ].Closelyrelatedtothisistheprocessofqueryexpansion,whichexpands 1Thisisnottodiscounttheresearchthatdealswiththecontextoftheuserandpersonalweightings.Bothofthesesubjectsareinterestingbuthavenotbeenaddressedinthecurrentquestionansweringwork.Asworkprogresses,bothofthesewillbeimportantasitbecomesnecessarytogainanunderstandingofwhatparticularusersneedratherthanthegeneralconceptofausertocorrectlyanswerquestions. 12

PAGE 13

thescopeofthequerytondrelevantdocumentsthatdonotnecessarilyincludethequeryterms.Queryexpansionisusedfortworeasons.Inonecase,aquerymayreturnnodocuments.Thismayoccurbecausethecorpushaszeroorveryfewdocumentscontainingthesearchterms.Butratherthantelltheuserthatthesearchfailedandhavehimorherreformulatethequery,wecantryusingwordsrelatedtothetermsinthequerytonddocumentsthatmayberelevant.QueryexpansionusesathesaurusorWordNet[ 69 ]tondwordsrelatedtothequerytermsthatreturnadditionaldocuments.Whilethistechniqueisusefulonsmallcorpora,itbecomesproblematicontheWWW.Here,therearesomanydocuments,itisdifculttoformulatequeriesthatresultintheretrievalofsofewdocumentsthatqueryexpansionwouldbenecessary,althoughthereissomeevidencethatusingstructuredqueriesallowsexpansiontoimproveretrievalperformance[ 54 ].Anotherapproachtoimprovingtheabilityoftheusertoexpresshisorherinformationneedistoallowthequerytobeexpressedinnaturallanguage.Moreprecisely,thequeryispreprocessedinawaythatcapturessemanticorsyntacticinformationotherwiselostinthebag-of-wordsapproach.Onewayissimplytoidentifyphrasesofmeaninginthequerythathaveshownimprovementwhenconvertedtostructuredqueries[ 23 ].Morerecently,muchemphasishasbeenplacedonquestionanswering.Questionansweringisnotsimplythetaskofansweringquestions,butrather,inthecontextofIR,itisretrievingthespecicinformationthatsatisestheinformationneedoftheuser[ 102 ].Forexample,WhowastherstpresidentoftheUnitedStates?andNametherstpresidentoftheUnitedStates.bothexpressthesameinformationneed.Asaresult,questionansweringmovesIRsomewhatawayfromsimpledocumentretrievalandclosertoactualinformationretrieval. 13

PAGE 14

Questionansweringresearchwasinitiallyperformedinthecontextofnaturallanguageunderstanding.Systemsweredesignedtoshowunderstandingofalgebraproblemsorsimple,articialworlds[ 102 109 ].Workwasalsodoneontranslatingqueriesintostructuredqueriesforadatabase[ 36 ].QuestionansweringhasbeenpushedintoIRresearchbytheTextREtrievalConference(TREC).Inthe6thconference,TRECsponsoredanaturallanguageprocessingtrack[ 103 ],whichwascontinuedasaquestionansweringtrackinthe8thconference[ 105 ].TheTRECquestionansweringtaskdiffersfrommuchofthepreviousquestionansweringinthatthesystemisexpectedtorelyonalargeunstructuredcorpus.Thefocusofthetrackhasbeentopushgeneralpurposequestionansweringsystemsthatcanhandleavarietyofquestiontypeswithincreasingfocusoninformationthatisnotwellformed.Thishasresultedinsystemsthatuseacombinationofstrictpatternmatchingandnaturallanguagetechniques[ 26 43 ].Asaresultofthisshift,thecurrentquestionansweringsystemsshouldbeconsideredaspecialcaseoftraditionalIRsystems.Toseethis,considertheprocesstypicallyusedbyaquestionansweringsystemtoansweraquestion.First,thequestionisexpressedinnaturallanguagebytheuser.ThenthequestionistranslatedintoaquerythatisfedintoadocumentretrievalenginethatusesatraditionalIRmodel.Thedocumentsthatarereturnedarethenprocesseduntilapassageisfoundorananswerisgeneratedthatanswerstheparticularquestion.ThisissimilartotheprocessofIR,whichgenerallyfollowsthepatternofquerytranslation,matchingrelevantdocuments,andrankingtheresults.Currentquestionansweringsystemsarefairlyadeptatansweringquestionsaboutfacts.Questionsthatrequiremorecomplexanswerssuchaslistsanddenitionsarelessreliablyanswered.Futureresearchwillfocusonseveralareasofquestionansweringthatarestillopenproblems,suchascategorizingquestions,extractinganswersacrossdocuments,andreasoning[ 13 ].Theabilitytocategorizeaquestion 14

PAGE 15

allowsthesystemtomorereliablycreatequeriesandextractanswers.Creatingtaxonomiesforquestionsisattheheartofunderstandingwhatthelanguageofaquestionmeansandisoneofthehardesttasks.Inmanyquestionansweringsystems,thisisdonebyhand(i.e.,therulesforcategorizingquestionsarewrittenbyhandandhaveevolvedovertime),althoughmorerecentlytherehavebeenattemptstousemachinelearningtechniquestoaccomplishthistask[ 38 41 ].Extractinganswersacrossdocumentsandreasoningarebothnecessaryasquestionansweringsystemsattempttoanswerharderquestions.Thoseansweredbysimplefactsaretypicallyeasytoanswerbecausebothquestionandanswercanoftenbefoundinasingledocument.Morecomplexanswertypes,suchasthoserequiringlists,mayrequireanswersthatarenotinasingledocument.Finally,reasoningisnecessaryforquestionsthathaveimpliedratherthandirectlystatedanswersinthecorpus[ 13 ]. 1.3OurResearchThisresearchpresentsanIRmodelforresearchingsubjectsontheWWW.Todothiswefocusonaparticulartypeofdocument:onethatisinformativeonasubject.Weusetheterminformativetomeanadocumentthatanswersquestionsaboutasubject.ThisisadifferenttaskthanistypicallyundertakenbygeneralsearchenginesontheWWW.Forexample,whenauserissuesaqueryonthetermMicrosoft,asearchenginemightretrievethehomepagefortheMicrosoftcorporation.Weexpectasystembasedonourmodel,ontheotherhand,toretrieveadocumentthatanswersthemostquestionsaboutMicrosoft.Sincehomepagesofcompaniestypicallylacktext,itisunlikelythatthesepageswouldrankamongthetopresultsofasearch,butweexpectadocumentsuchastheWikipediapageonMicrosoft,oranotherdocumentthatcontainsinformationaboutthesearchterm,torankhigh.Thistypeofsystemhastwobenets.First,itisusefulwhenresearchingasubject,asthetopresultsareguaranteedtohaveinformationaboutthetopic,ratherthanjust 15

PAGE 16

containingthesearchterms.Also,thissystemshouldbelesssusceptibletogamingtheresultsorspamsincetherankingreliesonhowinformativedocumentsare.Documentsimproveinrankingsasaresultofbecomingmoreinformative,ratherthanbyamethodthatcancorruptthevalidityoftherankings.ThisnewmodelofIRinwhichdocumentsareretrievedbecausetheyarejudgedtobeinformativeonasubjectisachievedbychangingthewaythatdocumentsareindexed.ThemotivationforthisnewIRmodelisadesiretorankdocumentsaccordingtotheprobabilitythattheyareinformativeonaparticularsubject.Intheprobabilisticmodelofrelevance,thedocumentsarerankedaccordingtothelikelihoodthattheirtermswillexistinthesetofalltermsoverrelevantdocuments.Toachieveadistinctionbetweenthesetwomodels,webelieveitisnecessarytoexpandindexessothattheyincludetheconceptsaswellasthewordscontainedinthedocuments.Whiletherehavebeenattemptstoindexdocumentsusingadditionalsemanticinformation,theseattemptshavenotproducedimprovementsoverbag-of-wordsindexing[ 21 31 ].Thereareseveralreasonsforthis.Inthecaseofphrases,theadditionalinformationonlymattersifthequerycontainsthephrase.Knowingwhichphrasesareimportantandusefulisahardproblem.Todecidewhichphrasestoindex,asystemmightuseastatisticalanalysisoftheco-occurrenceofphrasesinthedocumentsofthecorpus.Sincethisinformationisalreadyincludedinthebag-of-wordsmodel,theimprovementislimited.Whensemanticphrasesareintroduced,theimprovementistypicallysmallandislimitedtoquerieswithparticularphrases.Eveninthosecasesthesamedocumentsaretypicallymarkedasrelevant.Weimproveonthosepastattemptsbyusingquestionansweringonindividualdocumentsinamannerapproximatingreadingcomprehension.Whenadocumentisprocessed,itisindexedonhowwellitanswerscertainquestions.Forexample,ifadocumentcontainsanamedentitysuchasGeorgeWashington,thedocumentisindexedonquestionsthatpertaintopeople.Documentsthatanswerquestionssuchas 16

PAGE 17

whenapersonwasbornorforwhatheorshewasfamousarethenconsideredmoreinformativeonthatperson.Usingquestionansweringforthistaskhastwobenets.First,theworkperformedonfactoidansweringhasoccurredoutsidetherealmofnaturallanguageprocessing.Thatis,muchoftheworkhasbeenpattern-basedandusesmachinelearningtechniques.Thisshouldallowustoavoidthebrittlenessofsystemsthatrelyonunderstandingthelanguageofdocuments.ThisisespeciallyimportantwhenconsideringthevastnumberofdifferentwritingstylesthatencompassthedocumentsontheWWW.Second,thesystemiseasilyexpandableforadditionalquestiontypes.Forexample,ifwewishtocreateasearchenginethatisgoodatndinginformationoncars,wecanwriteasetofquestionsrelatedtocars.Whomanufacturedthecar?,Whattypeofenginedoesithave?,andHowmanydoorsdoesithave?mightallowthesystemtodistinguishinformativepagesonparticulartypesofcars. 1.4RelatedWorkTherehavebeeneffortstochangethewaydocumentsareindexed.Gonzaloetal.usedWordnetsynsetstoimproveonbag-of-wordsindexinginthevectormodel[ 33 ].WoodsproposedusingsomethinghecallsaConceptualIndex[ 112 ].CroftandLewisattemptedtobuildasystemusingcaseframestounderstandtheunderlyingideasinadocument[ 22 ].Theseapproachesproposedindexingofdocumentsinvolvingnotsimplylistsofwordsbutalsoanindexinaconceptspace.Litkowskiproposedindexingdocumentsusingsemanticrelationtriples[ 64 ].Inthisapproachdocumentsareparsedforthesentencestheycontainaswellasforrelationshipsbetweenentitiesinthesentenceandtheobjectsofthesentencefordifferentquestiontypes.Theserelationshipsarethenstoredintriplesandusedforquestionanswering.Linexpandedonthisworkandusedtherelationsforamoregeneralwebsearch[ 62 ].AdditionalrelatedworkisdiscussedinChapters2and3. 17

PAGE 18

1.5StudyOrganizationTheorganizationofthisstudyisasfollows.Chapter2givesanoverviewofclassicalinformationretrievalmodels.Chapter3givesbackgroundonquestionansweringandhowitisaspecializationofinformationretrieval.InChapter4,weexploreevaluation,usingourideaofinformativenesstocreateanevaluationsystemforsearchengines.Thisisplacedbeforeournewmodeltoemphasizetheindependentnatureoftheevaluationsystem.InChapter5wepresentournewmodelofIRandexplainhowasystembaseduponthisnewmodelworks.EvaluationofthesystemsbasedonthenewmodeliscoveredinChapter6usingtheevauationsystemproposedinChapter4.InChapter7weconcludeandlooktofurtherworkrelatedtoournewmodelandevaluationmethod. 18

PAGE 19

CHAPTER2INFORMATIONRETRIEVALMODELSInformationretrievalhasgrowninimportanceduetotherapidriseofinformationproduced.Withtheamountofinformationavailable,ndingparticularpiecesmanuallyhasbecomeincreasinglydifcult.OnefocusofIR,theworldwideweb,consistsofbillionsofdocuments.Additionally,therearemanymoredocumentsproducedthanarepubliclyavailable.Theneedtoaccessthisdatadrivesinformationretrieval.Whilethereareothertypesofinformationretrieval,thetaskthatwerefertoasIRisdocumentretrieval.Givenaspecicinformationneed,asystemretrievesdocumentsthatitdeemsrelevanttothatneed.ThischapterdiscussesthestepsinvolvedinIRsystemsandresearchrelatedtoretrievingdocuments.Informationretrievalmodelstypicallyconsistofthreephases:indexingdocuments,determiningtheirrelevance,andrankingthem.Documentindexingassistsusindeterminingwhatinformationwillbeusefulwhenlocatingthedocumentlater.Thedecisionsmadehereinuenceeverythingelsethatcomposeamodel.Indexingcanbeasbasicastheassumptionthatallofthetermsinadocumentareindependent,buttoincreaseperformancesomemodelstakemorecomplexapproachestodecidingwhattoindex.Forexample,inLatentSemanticAnalysis(LSA)documenttermsaretransformedintoarepresentationoftheconceptsthatcomposethedocument,andindexingisperformedinthisnewspace.Inmoretraditionalmodels,theindexincludesallthewordsinthedocument.Insuchcases,thedocumentisviewedasavectoroftermsandisthereforealsoknownasthebag-of-words(BOW)approach.Thisviewoftheindexhasprovedthemostpopular,primarilybecauseofitsperformance,butalsobecauseitisfastandeasilyimplemented.ThetraditionalinformationretrievalmodelsthatweexaminealluseaBOWapproach,althoughlaterinthechapterweaddresswaystosupplementtheindexwithadditionalinformation. 19

PAGE 20

Thesecondaspectofamodelismatchingqueriestotheindexofdocuments,whichestablishestheirrelevance.Toaccomplishthis,werstneedtoexaminehowthedocumentisindexed.IntheBOWapproachthistendstobesomesortofdistancemeasurebetweenthequeryandtheindexvectors.Themodeldeterminesthemeaningofthedistancesandthusthedocumentsrelevanttothequery.Querymatchingleadstothethirdaspectofinformationretrievalmodels:documentranking.Sincemostqueriesresultintheretrievalofmorethanonedocument,theremustbesomewaytodistinguishbetweenthedocumentsandidentifywhetheraparticulardocumentismorerelevantthananother.Withoutthisability,thenumberofdocumentsretrievedmayoverwhelmauser.SincethegoalofIRistoassisttheuserbyreducingorentirelyavoidinginformationoverload,themoreeffectiveasystemisinprovidingonlytheinformationtheuserdesires,thebetterthesystem.Despitethisclear-cutdenitionofaIRmodel,therealityofactualmodelsisabitlessstraightforward.Mostofthetimequerymatchinganddocumentrankingareasingleprocess.Sinceitisdifcultforasystemtosaythatadocumentisincontrovertiblyrelevant,thisisusuallyweakenedtoameasureofprobability.Oncedocumentshavebeenevaluatedforrelevance,theyareusuallyrankedsothedocumentswiththehighestprobabilityarereturnedrst.ThischapterpresentssomealternativerankingalgorithmsfortheWWWthatcouldbecombinedwithanyoftheclassicalmodelstocreateamodelconsistingofseparateindexing,querying,andrankingaspects. 2.1TheBooleanModelTheBooleanmodelisoneoftheeasiesttounderstandandimplement,dueinlargeparttoitslackofbothindexingandrankingchoice.Theseveryabsences,inlightofthedenitiongivenabove,precludethismodelfrombeingunderstoodasatrueinformationretrievalmodel.ItwouldbemoreaccuratetoconsidertheBooleanmodelinsteadasaBooleanquerywithanaddedindex.However,sincethetermBooleanmodelissowidelyused,itismoreconvenientforustofollowsuit.Thecontentsofa 20

PAGE 21

Booleanindexareimmaterialaslongasaclosedworldisassumedtoexist.Thatis,everytermintheindexisassumedtobeinthedocumentset,whileanywordnotintheindexisassumednottobeinthedocumentset.Additionally,theBooleanmodeldoesnotaccommodateranking.Despitethis,theBooleanmodelisthestartingpointfortheclassicalmodelsofIR.AqueryintheBooleanmodelisbaseduponsettheoryandBooleanalgebra.Becauseoftheclosedworldassumption,aqueryisdeterminedtoberelevanttoadocumentifallthequerytermsappearinthedocument.Morespecically,ifthelogicalcombinationofthequery(usingtheBooleanoperatorsand,or,not,etc.)istrue,thedocumentisacknowledgedtoberelevant.Todeterminethetruthofanyterminthequerygivenadocumentd,considerwhetherthetermappearsinthedocument: qterm=8>><>>:1iftermisindocumentdand0otherwise.(2)Usingthistest,thelogicofanyqueryisdeterminedbycombiningthevaluesusingBooleanoperators.Typically,thequeryisconvertedintoadisjunctivenormalform(DNF)orconjunctivenormalform(CNF).IfqccisaconjunctiveclauseofthequeryinDNF,thesimilaritybetweenaqueryandadocumentcanbegivenas: sim(q,dj)=8>><>>:1if9qccj(qcc2q)^(g(qcc)=1)0otherwise(2)wheregisafunctionthatevaluatesto1iftheconjunctiveclauseistrue.SobythismethodaBooleanqueryisdenedusingBooleanoperators.Becauseofthesimplicityofthequery,thesystemiseasytoevaluateonanyparticularquery.Aquerycanalsobeconsideredasacombinationofsetoperations.IfthegivenqueryisGeorgeANDWashington,adocumentcontainingthetermsfGeorge,Washington,PresidentgwouldmatchbutnotdocumentscontainingthetermsfGeorge, 21

PAGE 22

Bush,Presidentg.TheBooleanmodelretrievesthesetofdocumentscontainingGeorgeandthosecontainingWashington,thenperformstheunionofthosesetstoobtaintheresultsGeorgeANDWashington.TheBooleanmodelreceivedseriouscriticismasIRresearchsurpassedit.Cooper[ 20 ],forexample,illustratesthattheBooleanquerydoesnotworkasEnglishspeakersexpect.BecausethetermANDindicatestousersthattheywillreceivemoreresultsthanjustonetermalone,hesuggestsdroppingthelogicalsymbolsfromthequery.Somesystems,suchasGoogle,haveadoptedthisapproach.AnothercriticismisthattheBooleanmodelcannotdistinguishbetweenrelevantresultsandrankthem,becauseeverydocumentiseitherrelevantornot.Whenaqueryretrievesalargesetofdocuments,usersmustextendtheirqueriesbyaddingtermsthatwillnarrowtheresults.Theconversesituation,whennodocumentsarejudgedrelevant,isequallyproblematic.Ifaquerycontainstwoterms,anddocumentsmatchoneortheotherbutnotboth,arenotthesetwosetsofdocumentsatleastsomewhatrelevant?Whatifadocumentcontainsasynonymforoneofthewordsandamatchfortheother?Ideally,theresultsetwouldbeexpandedtoincludethisdocument,butwithoutsomewaytoranktheresultingsetofdocumentssuchamaneuvermightdelugetheuser. 2.1.1TheFuzzySetModelInsteadofcreatinganentirelynewmodel,someresearchershaveproposedmodicationstotheBooleanmodelthatdealwithitsproblems.Thefuzzysetmodel,proposedbyBookstein[ 6 ],allowsdocumentstoberankedusingaBooleanquery.TherankingisintroducedbychangingthestrictBooleanquerytoaquerybasedonfuzzysets.Thiscanbeaccomplishedintwoways.Therstistoaddweightstothequerytermsandusefuzzysettheorytorelatethequeriestothedocuments.Booksteinallowedtheusertospecifytowhatextenteachquerytermwasimportanttothequery.Alternatively,fuzzinesscanbeintroducedbychangingtheindexfromtheclosed-worldassumptiontoasystemindicatingthefuzzymembershipofthetermtothedocument(or 22

PAGE 23

vice-versa).Theindexnolongerjustliststhetermsinthedocument;rather,eachtermisgivenaweightindicatingitsmembershipinafuzzyset.Theresultingmodelnotonlystatesthatthedocumentisrelevanttothequerybutalsoprovidesarangeofpossiblevalues,soarankingofresultsispossible.Thislatterapproachisgenerallythemostprevalent,duetothedifcultnatureofdecidinghowtoweighttermsinaquery.Therearedifferentwaystobuildthisindex.Onepossibilityistousethetermfrequency(tf)thenumberoftimesatermappearsinadocumentandmultiplyitbytheinversedocumentfrequency(idf)ameasureofhowgoodthetermisatdistinguishingbetweendocumentsasisdoneinthevectormodel.Wefocusonjustonewayofindexingthatinvolvesconstructingathesaurus,aterm-termcorrelationmatrix,whichassociateshighlycorrelatedwords.ThisapproachwassuggestedbyOgawa,Morita,andKobayashi[ 4 76 ].Inthismatrix,thecorrelationbetweenanytwotermsti,tjcanbedeterminedbylookingatacellci,jwhichisgivenby: ci,j=ni,j ni+nj)]TJ /F5 11.955 Tf 11.95 0 Td[(ni,j(2)whereniisthenumberofdocumentscontainingthetermti,njisthenumberofdocumentscontainingthetermtj,andni,jisthenumberofdocumentscontainingboth.Thevalueci,jwillbe1ifthetermsonlyappearwitheachother,0iftheyneverappeartogether,andintherange[0,1]otherwise.Usingthecorrelationmatrix,thefuzzymembershipofadocumentdkforeachtermtiiscalculatedbyndingthemaximumvalueforthematrix i,k=maxci,l,8tl2dk.(2)Inthiswayadocumenttermtiisrelatedtodocumentdk.Theindexforanyparticulardocumentisthevectorofi,koverallterms.Onceanindexcontainingthesefuzzymembershipshasbeenobtained,amethodisneededtoperformfuzzyBooleanqueriesusingthethreeoperations:and,or,not. 23

PAGE 24

Theseoperationsareperformedusingthefollowingsetoperations and=A\B(d)=min(A(d),B(d)),or=A[B(d)=max(A(d),B(d)),andnot=A(d)=1)]TJ /F11 11.955 Tf 11.96 0 Td[(A(d).(2)Combiningthesemethods,Booleanqueriescanbeperformedonthefuzzyindexes.ForthepreviousexampleGeorgeANDWashington,takethemaximumofthemembershipvaluesforeachdocumentforeachofthoseterms.Sincethissimilarityisnolongeronly1or0,thedocumentscanberankedbysimilarity.Insteadofusingmaximumandminimum,Baeza-YatesandRibeiro-Neto[ 4 ]suggestusingalgebraicsumsandalgebraicproducts,respectively.Theysuggestthatthismethodwouldprovideasmootherdegreeofmembership,whichtheyfeelismoreappropriatetoIRsystems.Eitherway,thissystemprovidesarankingcapabilityandindexingthatisdifferentfromthestrictBooleanmodel.Thereiscurrentlynoconclusiveresearchonwhetherthisprovidesanincreaseinperformancecomparedtoothermodels. 2.1.2TheExtendedBooleanModelTheextendedBooleanmodelisbasedonthecritiquethattheBooleanmodeldoesnotdistinguishbetweennon-relevantdocuments(orrelevantdocuments)[ 87 ].Tounderstandthis,considerthepreviousqueryGeorgeANDWashington.Ifsomedocumentdjcontainsbothterms,theBooleanquerymarksitasrelevant.Ifitdoesnothavebothterms,itismarkedasnon-relevant.Sothequestionis,isthereadifferencebetweenthedocumentsthathaveoneofthetwotermsanddocumentsthathavenone?Todecidebetweenthesecases,theextendedBooleanmodelmeasuresthedistancebetweenthequeryandthedocument.Tomeasuredistancethequeryisconvertedtoan-spacevectorwhereeachdimensionisBoolean.Thatis,ifadocumentismappedintothisspace,itreceivesaoneineachdimensionthatcontainsthequerytermforthatdimension,otherwiseitreceivesa0.Inthecaseofatwo-termquery 24

PAGE 25

considerthepoint(1,1),atwhichtheBooleanmodelndsthequeryrelevant.Thepoints(1,0),(0,1)arecaseswhenthedocumenthasoneortheotherterm,andthepoint0,0isthecasewhenthedocumenthasneitheroftheterms.Todistinguishbetweenthesepoints,introduceasimpledistancefrompoint(1,1).Torankdocumentsinorderofincreasingdistancefromtherelevantpointadistancemeasurecanbeused: sim(qand,d)=1)]TJ /F8 11.955 Tf 11.96 19.42 Td[(r (1)]TJ /F5 11.955 Tf 11.96 0 Td[(q1)2+...+(1)]TJ /F5 11.955 Tf 11.95 0 Td[(qn)2 n(2)whereqnisoneofqueryterms.ThisintuitionworkswellforAND-queriesbutdoesnotworkforOR-queries.InthecaseofOR-queriestheBooleanquerymarksdocumentsrelevantifeitherofthetermsarepresent:(1,1),(1,0),(0,1).Soinsteadofmeasuringfromtherelevantpoints,measurefromthenon-relevantpoint,(0,0),andrankthedocumentsindecreasingdistancetothatpoint sim(qor,d)=r q21+...+q2n n.(2)Thisinitselfisafairlysimplesolution.Takingthisfurther,insteadofastrictlyEuclideandistance,usethep-normofthevector.Givenavectorx,thep-normofthatvectorisdenedas: jjxjjp=(jx1jp+...+jxnjp)1 p.(2)Inthiscaseadjustthesimilaritymeasures: sim(qand,d)=1)]TJ /F7 11.955 Tf 11.95 0 Td[(((1)]TJ /F6 7.97 Tf 6.59 0 Td[(q1)p+...+(1)]TJ /F6 7.97 Tf 6.58 0 Td[(qn)p n)1 p,sim(qor,d)=(qp1+...+qpn n)1 p.(2)Obviously,inthecasep=2theoriginalEuclideandistancesareused.Settingptoothervaluesgivesotherinterestingresults.Inthecasewherep=1,1asimilaritymeasureis 1Manhattandistance 25

PAGE 26

givenas: sim(qor,dj)=sim(qand,dj)=q1+...+qn n.(2)Thismeasureistheinnerproductbetweenthequeryanddocumentvectorwhenthequerydoesnothaveweights.Thismeasureofsimilarityisequivalenttothevectormodel,discussedbelow.Sothatwhenpapproachesinnity: sim(qand,d)=minqisim(qor,d)=maxqi.(2)Inthiscase,theextendedBooleanmodelcanfunctioninamannerequivalenttothefuzzysetmodelaswell.ThisexibilitymakestheextendedBooleanmodelinteresting,butithasnotfoundmuchfavorinresearch. 2.2TheProbabilisticModelTheprobabilisticmodelresultsfromthefact,rstexplicitlynotedbyMaronandKuhns,thatsinceasystemcannotmakeadenitiveassessmentofrelevance,itmustdealwiththeprobabilitythatthedocumentisrelevanttotheuser[ 67 ].ThisinsightledtoCooper'sdevelopmentoftheProbabilityRankingPrinciple. HYPOTHESIS:Ifareferenceretrievalsystem'sresponsetoeachrequestisarankingofthedocumentsinthecollectioninorderofdecreasingprobabilityofusefulnesstotheuserwhosubmittedtherequest,wheretheprobabilitiesareestimatedasaccuratelyaspossibleonthebasisofwhateverdataisavailableforthispurpose,thentheoveralleffectivenessofthesystemtoitsuserswillbethebestobtainableonthebasisofthatdata.[ 19 ]Insuchasystem,considerrelevanceadichotomouscriterionvariabledenedoutsidethesystem.Thesystemhassomeinformationaboutthisvariable,whichisessentiallythebeliefthatthedocumentisrelevant[ 83 ].Alternatively,probabilitycanbeviewedasacontinuousvariableofexpectedutility[ 20 ].Thevariousinterpretationsofprobabilityhavebeenshowntobeverysimilar,ifnotmerelyspecialcasesofoneanother[ 110 ].Theresultingmodeliscalledthebinaryindependencemodel,duetothesimplifyingassumptionsmadetoconstructit. 26

PAGE 27

Asthenameofthemodelimplies,twoassumptionsdenethismodel,thebinaryassumptionandtheindependenceassumption.Thebinaryassumptionisanothernamefortheclosed-worldassumptionmadeinthebinarymodel.Theindexofadocumentintheprobabilisticmodelincludeseachtermthatappearsinthedocument.Theindependenceassumptionstatesthattheappearanceoftermkiindocumentdjdoesnotpredicttheappearanceoftermkx.Thisassumptionismadeformathematicalreasonsthataredemonstratedbelow.Oncetheseassumptionsaremade,itisnecessarytodenetheprobabilitythatadocumentisrelevanttothequery.Giventhecriterionvariable,denethesimilarityas sim(q,dj)=P(Rjdj) P(Rjdj)(2)whichistheratiooftheprobabilityofrelevanceR,giventhedocument,overtheprobabilityofnon-relevanceR.Sincethisiswhatthemodelwillestimate,transformitusingBayes'rule: sim(q,dj)=P(djjR) P(djjR)P(R) P(R).(2)P(R) P(R),whichistheratioofrelevanttonon-relevantdocumentsinthecollection,isthesameforalldocuments,soitisignored.Theresultingsimilaritymeasureisinterpretedastheprobability,giventhesetofrelevantdocuments,ofselectingdocumentdjovertheprobabilityofselectingitgiventhesetofnon-relevantdocuments.Thisstillgivesagoodapproximationoftheoriginalgoal.Iftheexactsetofdocumentsnotrelevanttothequeryisknown,anddjisinthatset,thenP(djjR)willbe0.So,giventhesetofrelevantandnon-relevantdocumentsforaquery,thisratiowouldgiveanon-zerovalueonlyifthedocumentwasrelevant.Sincethesetsofdocumentsarestillunknown,itremainsnecessarytodecidewhichdocumentsarerelevantwhilealsoestimatingtheprobabilityofchoosingadocumentfromtherelevantset.Usingtheindependenceassumption,anestimateofthe 27

PAGE 28

probabilityofselectingadocumentgiventherelevantsetis: sim(q,dj)=Qgi(dj)=1P(kijR)Qgi(dj)=0P(kijR) Qgi(dj)=1P(kijR)Qgi(dj)=0P(kijR).(2)HereP(kijR)istheprobabilitythat,givenarandomdocumentfromtherelevantset,itcontainsthetermki,andP(kijR)istheprobabilitythat,givenarandomdocumentfromthenon-relevantset,itdoesnotcontainthetermki.Thisallowsustousethesetsofrelevantandnon-relevantdocumentstoestimatetheprobabilitythatthedocumentisrelevanttothequery.Toestimatethesetofrelevantdocuments,Robertsonproposedhavinganinteractivesessionwiththeuserduringwhich,asdocumentsareretrieved,therelevantonesaremarked[ 83 ].Inthisway,themodelwouldmovecloserandclosertomatchingtheactualparametersoftheidealset.Baeza-YatesandRibeiro-Netosuggestedinsteadthatthesystemassumethedocumentsthathavebeenretrievedsofararerelevant,andallthedocumentsnotretrievedareirrelevant[ 4 ].Inthiswaythemodelcouldlearnwithouthumanassistance.ItispossibletousethesetofdocumentsthatareseenasrelevantS,inplaceoftheactualrelevantset,toestimatetheprobabilitiesP(kijS)andP(kijS)neededtorankdocuments.Theprimarybenetoftheprobabilisticmodelisitstheory,whichhasbeendeterminedtobesound.Italsoallowsustostatethatthedocumentsarerankedbasedontheirprobabilityofusefulness(orrelevance)totheuser.Thebinaryindependencemodelstillsuffersfromtheindependenceassumption,fromtheneedtomakeaninitialguessattheparameters,andfromignoringtheintra-documentfrequencyofwords.AnalternativeprobabilisticmodelistheinferencenetworkmodeldevelopedbyTurtleandCroft[ 96 ].TheinferencenetworkmodelisbaseduponBayesianinferencenetworks,whicharedirected,acyclicgraphs.Thesegraphshavenodescontainingvariablesanddirectedlinksbetweenthemthatindicatethebeliefthatsomevariableinuencesanothervariable.Inanodeeachoftheincominglinksisrepresentedbyalink 28

PAGE 29

matrixthatrepresentsP(xjy)forallcombinationsofthestatesofthevariables.Inthecaseoftheinferencemodel,thedocumentnodeshaveacausalrelationshipwiththetermnodes.Thedocumentnoderepresentsanobservationofthatparticulardocument.Eachdocumentnodehasdirectedarcstothetermnodesindicatingastrengtheninginthebeliefofthoseterms.Internally,eachtermnodecontainsalinkmatrix,whichcanbeviewedastheprobabilitythatthenodeis1giventhesetofthestatesofallthedocumentsthatpointtoit.Eachofthese2nstates,wherenisthenumberofdocuments,representsasubsetofthedocumentsbeingobservedwhiletherestarenot.Toobtainananswerfromthenetwork,queriesareattachedtothemodelofthedocumentsandterms.Whenaddingthequeries,additionalnodesareinsertedintothegraphforthequeryandanyadditionalformsneededforthequery(i.e.aBooleancombinationoftheterms).Inthismannermorecomplexqueriesarecreatedbyattachingmultiplesimplequerynodes.Thearcsfromthetermnodestothequerynodesincreasebeliefinthequerynodes.Thelastnodeinthemodelrepresentstheuser'sinformationneedandisconnectedbyarcsfromthequerynodes.Inthismodel,documentsarerankedbyP(dj^I)whichistheprobabilitythatdocumentdjisobservedandtheinformationneedoftheuser,I,ismet.Giventhatkrepresentsthen-dimensionallinkmatrix,assumethatXisthesetofallpossiblestatesofk.TheprobabilityofbothdjandIbeingtrueisthengivenby: P(dj^I)=P(Ijdj)P(dj),P(Ijdj)=Pk2XP(kjdj)P(qjk).(2)Assumingthatthefactorsthatcomposexareindependent,theprobabilityP(kjdj)canbegivenas: P(xjdj)=Yki=1P(kijdj)Yki=0P(kijdj).(2)Giventhisdenition,ithasbeenshownthatbyestimatingtheprobabilitiesincertainwaystheinferencemodelcanapproximatetheBooleanandtfidfmodels.Thetf-idf 29

PAGE 30

modelissimilarbutnotthesameasthevectormodelduetotheprobabilityP(xjdj).Sincethisprobabilitydependsonthedocumentandvariesfromdocumenttodocumenttherankingsarenotgeneral.Despitethis,theinferencemodelhasbeenshowntogivegoodperformancesinceitconsistentlycombinesevidencefromdistinctsources[ 4 ]. 2.3TheVectorModelThenalclassicalinformationretrievalmodelisthevectormodel,basedontheBOWindexingscheme,inwhichadocumentismodeledasavectorofterms.Theentirevectorconsistsofallthetermsinthecorpus.Eachdocumentismodeledbyavectorinthedocumentspacethatcontainsapositiveweightforeachtermappearinginthedocumentora0forthosethatdonot.Thisspaceisverysimilartothoseinmachinelearningproblems,and,asaresult,thevectormodelhasbeenthebasisofmanymachinelearningtechniquesusedinIR,suchasclustering.Salton,Wong,andYanggiveclusteringasoneofthemainmotivationsforthismodel,arguingthatsimilardocumentsshouldbeclosetogether,orclustered,inthedocumentspace[ 90 ].Thedesiretohavesimilardocumentsclosetogethermotivatestherankingfunctionforqueries.Sincethequerycanbemodeledasjustanothervectorinthespace,theanglebetweenthequeryandeachdocumentcanbemeasuredtodeterminetheircloseness.Obviously,ifsimilardocumentsareclosetoeachotherinthedocumentspace,andifaqueryisclosetooneofthesedocuments,theotherdocumentswillbeclosetothatqueryaswell.Thisclusteringworkswellwithauser'sintuitionoftheproblemandmakesthemodeluseful.Theheartofthevectormodelinvolvesdetermininghowtoindexpagessothatsimilarpagesareclusteredtogether.Usingadocumentspaceinwhichthedimensionsarefand,the,dog,cat,ag,thefollowingverysmalldocumentsillustratethedifcultyofthisproblem:d1=fand,the,dogg,d2=fand,the,catg,d3=fa,catg.Representingthedocumentsbytheindexvectorinwhichalltermsthatappearinadocumenthaveaweightof1,thedocumentsared1=f1,1,1,0,0g,d2=f1,1,0,1,0g,d3=f0,0,0,1,1g.Usingtheassumptionthat 30

PAGE 31

documentvectorsthataresimilarwillbeclosetogether,wecanuseadotproduct: sim(dj,dk)=djdk.(2)Thissimilaritymeasurementndsdocumentd1andd2havingasimilarityof2.Documentsd2andd3haveasimilarityof1,anddocumentsd1andd3haveasimilarityof0.Sincedocumentsoncatsareexpectedtobethemostsimilar,thisisnotanidealclustering.Onewaytoresolvethisfaultyclusteringmightbetouseastopwordlist,alistofcommonwordsthatshouldbeignoredintheresults.Usingthislist,wemightignoreand,the,anda,givingadocumentspaceoffdog,catg.Inthisspacethedocumentvectorsared1=f1,0g,d2=f0,1g,d3=f0,1g.Inthisspaced2andd3haveasimilarityof1,whilethereisasimilarityof0betweend1andtheothertwodocuments.StopwordsandtheirimplicationsinIRarediscussedfurtherinSection2.4.2.Alternatively,informationaboutthefrequencyofthewordscouldbeused.Ratherthaneliminatingthewordsusingastopwordlist,aweightingisusedsothatcommonwordsaffecttheclusteringverylittle.Inversedocumentfrequencyispopularforclusteringdocumentssinceitgivesalowweighttotermsthatdonotdistinguishbetweendocuments.Thisweightisgenerallycomputedasidfi=log(N n),whereNisthenumberofdocumentsinthecollection,andnisthenumberofdocumentsinwhichthetermoccurs[ 52 ].Assumingthatthedocumentsarepartofalargercollectioninwhichthetermsaandtheareverycommon,thesimilaritywillnowchangesothatdocumentd1isfartherawayfromdocumentsd2andd3.Inadditiontodifferentdocumentsbeingseperated,similardocumentsshouldbeplacedclosertogether.Typically,thisisdoneusingthenormalizedintra-documenttermfrequency.Thetermfrequencyiscalculatedasthenumberoftimesatermappearsinadocumentoverthenumberoftimesthemostfrequentlyoccurringtermappearsinadocumentdj,tfj(i)=freqt,j maxfreqi,j.Becausesimilardocumentstendtousetermsthatarespecictothesubjectmoreoften,theytendtoclusterinthedocumentspace.Typically, 31

PAGE 32

theintra-documentandinter-documentfrequenciesarecombinedthroughtheproducttfxidf.Thishasbeenshowntoberelatedtothetermrelevanceweightderivedintheprobabilityretrievalmodel[ 85 ].Onefurtherconsiderationofthismodelistopreventbiasingthesimilaritytowardlongdocuments.Longerdocumentshaveabetterchanceofbeingmatchedagainstaqueryduetotheincreasedlikelihoodthattheycontainwordsfoundinthequery.Normalizingtheinnerproductremovesthisbias: sim(dj,dk)=djdk jjdjdkjj.(2)Thisisequivalenttondingthecosineanglebetweenthetwovectors.Tondthesimilaritybetweenadocumentandaquery,createavectorforthequery.Theweightsusedforthisvectorare: wi,q=(.5+.5tfq(i))idfi.(2)Sincemostquerieshaveatfof1forallthetermsinthequery(i.e.thequerydoesnotcontainrepeatedwords),thisiseffectivelytheidfoftheterms.Otherweightscanbeused,butSaltonandBuckleyfoundthesewerethemosteffective[ 85 ].Thevectormodelispopularbecauseofitseaseofuseandeffectiveness,butitlacksthetheoreticalbasisexhibitedbytheprobabilisticmodel.Theunderlyingideaofdocumentsbeingclosetogetheriftheyaresimilarisagoodmachinelearningtechnique,buttherehasnotbeenproofthatthedistancesgivenbytheweightsaboveareactuallyrelatedtodocumentsbeingrelevanttoaquery.Despitethis,thevectormodelisbelievedtooutperformtheprobabilisticmodelingeneralcollections[ 4 ]. 2.3.1LatentSemanticAnalysisLatentSemanticIndexingisbasedupontheassumptionthatthereissomeunderlyinginformationthattraditionalIRfailstorecognize.Thislatentstructureisobscuredbythenoisecausedbyrandomwordchoice.Inparticular,synonymyandpolysemyarebelievedtocauseproblemsinotherIRsystems.Synonymyreferstothe 32

PAGE 33

factthattherearemanydifferentwaystorefertothesameobject.Polysemyreferstotheproblemthatmostwordshavemorethanonedistinctmeaningorusage.Ingeneral,theproblemofsynonymycausesrecalltosuffer.Becausethepersonconstructingaquerymightuseadifferenttermtorefertoanobjectthandoesthewriterofthedocument,thesystemmightnotndadocumenttoberelevantwhenjustcheckingtoseeifthetermappearsinthedocument.Polysemy,ontheotherhand,generallycausesprecisiontosuffer.Duetotheslightvariationsofmeaningintermsusedinthequery,adocumentmightmatchtheterm,butuseitinawaythatwasdifferentfromtheoriginalmeaningofthequery.LatentSemanticAnalysis(LSA)orLatentSemanticIndexing(LSAappliedtoindexing)attemptstoremovethenoisebyreducingthedocuments'representationstothelatentstructure.Deerwesteretal.approachedthisproblembycreatingamodelinwhichboththedocumentsandtermswereinthemodelspace[ 28 ].Documentsareconsideredcentroidsoftheclusteroftermstheycontain.Thisalsoworksforqueriesastheycanbecentroidsofthetermsthattheycontain.Tondasimilaritybetweendocuments,thedistancebetweendocumentsismeasuredinthisspace.Inpracticethisismodeledbyadtsizedmatrixwheredisthenumberofdocumentsandtisthenumberoftermsinthedocumentset.AllthedocumentsinthedocumentsetarerepresentedasvectorsinthesamewaythatindexingisdoneintheBooleanmodelandarecombinedtogetherintothecolumnsofthematrix.Todeterminethelatentstructure,thesingular-valuedecomposition(SVD)ofthismatrixisfound.ThepurposeoftheSVDistodevelopanorthonormalbasisthatreectsthestructureoftheconceptspace.Thisfactoranalysisallowsustodecomposetheoriginalmatrixintomatricesrepresentingtheorthogonalvectorsofthetermsandthedocuments.GivenXistheoriginalassociationmatrix,theSVDgives: X=TSDt(2) 33

PAGE 34

whereTisthematrixofeigenvectorsrelatedtotheterm-to-termcorrelation,Sisthesingularvalues,andDistheeigenvectorsofdocument-to-documentcorrelation.Ajudgmentmustnowbemadeastohowmanydimensionsaredesirableinthisnewspace.Thinkofthesenewdimensionsastheclassesassociatedwiththedifferentterms.Thereductionofindexingallowsthegroupingoftermsintorelationshipsbetweendocumentsthatarestrictlymeaningful[ 27 ].Thisreducesthenoiseintroducedtothevectormodelbysynonymy.Toreducedimensions,thetdimensioniscollapsedtothenewlength,n,chosenbydeletingthesmallestvaluesintheSmatrix.Likeeigenvalues,thesingularvaluesindicatetheamountofspreadinaparticulardimension.Whenthelowvaluesareremoved,littleinformationislost.Additionally,thecorrespondingrowsandcolumnsareremovedfromtheTandDmatrix.NowtheoriginalXmatrixcanbereconstructedwithoutcreatinganexactcopy: X=TnSnDtn.(2)UsingthisapproximationoftheXmatrix,thesimilaritymeasuresbetweendocumentsandtermscanbecalculatedintheconceptspaceinsteadofintheterm-documentspace.Intheconceptspace,termsarerelatedbytakingXXt=TS2Tt.Inthismatrixthei,jcellcorrespondstothesimilaritybetweentheithandjthterms.Thismatrixgivestheabilitytoconstructathesaurusofsimilarwords.DocumentsimilarityisgivenbyXtX=DS2Dt.Inthiscasethei,jcellcorrespondstothesimilaritybetweentheithandjthdocument.Theclusteringassumptionofthevectormodelholdshereaswell.Inthismodelsimilardocumentsareexpectedtobeclusteredtogetherintheconceptspace.Toquerythesystem,thequeryisaddedtotheoriginaldocumentspace.ThisnewXqcanbemultipliedbyTS)]TJ /F10 7.97 Tf 6.59 0 Td[(1togiveusDq.Usingthismatrix,thedocumentscanberankedbysimilaritytothequerypsuedo-document.ThereareacoupleofobjectionstotheLSAmethod.Firstly,thereisnowaytocontrolwhatconceptspacesrepresent.Asimpleillustrationofthisinvolvesa 34

PAGE 35

documentd=fcar,truck,houseg.Afterttingthedocumenttotheconceptspace,thefollowingdimensionsd=ff1.3car+.5truckg,housegmightbeobserved.Lookingatthisdocumentinthenewconceptspacethereisanimpressionthatthenewdimensionisrelatedtotheideaofvehicle,butitisnotnecessaryforthistobethecase.Computationally,ifcombiningtruckandhousetogetherbetterrepresentsthemodelintheconceptspace,itwouldhavebeendone.Thismightmakesensewithoutnaturallanguageunderstandingoftheconceptspace,butinlightofthemeaningsofthewordsthisseemslessthanideal.Analternativetothelinearalgebra-basedLSAistoperformprobabilisticLSA.Thisapproachhasagoodstatisticalbasisandisgenerallyapproved[ 46 ].TheprobabilisticLSAmodelstheconceptspaceastheconceptsinwhicheachtermhasaprobabilityofbeing.Thismaybeabetterwayalsotounderstandandmodelconceptsasitseemsclosertotheirnaturallanguageunderstanding.AsecondobjectiontotheLSAmodelispurelycomputational.Withanextremelylargedocumentandtermspace,thecalculationsbecomecomputationallyintractable.Additionally,eachqueryhastogothroughmultiplematrixmultiplicationstoprovidearesult.AlthoughthisdoesnotmeanthatLSAisnotagoodsolution,itdoeslimititsusefulnessinrealworldsituations. 2.3.2NeuralNetworkModelNeuralNetworkshavebeenusedinmultiplewaysforinformationretrieval.OnesuchmodelwasproposedbyKwok[ 58 ].Inthismodelthenetworkconsistsofthreelayers.Therstlayerisassociatedwiththequery,andeachnoderepresentsaterminthequery.Inthenextlayer,theindextermlayer,eachnoderepresentsoneterminthevocabulary.Thenallayerrepresentsdocumentsinthecorpus.WhileKwokbasedhisneuralnetworkmodelontheprobabilisticmodel,theneuralnetworkmodelismoregenerallybaseduponrecognizingpatternsinthevectorspace.Inthislayoutsomeleveloflearningisperformedwheninitiallybuildingthenetwork.Specically,thenetworkadaptstheweightsoftheconnectionsbetweentheindexnodes 35

PAGE 36

andthedocumentnodesasdocumentsareaddedtothenetwork.Queryanddocumentnodesaretreatedsimilarly.Bothhaveaconnectionfromthemselvestothetermnodesofwhichtheyarecomposed.Soifaquerycontainsthreeterms,threelinksfromthequerynodetoeachofthethreetermnodesexisttorepresentthoseterms.Theinitialweightsofthoselinksaregivenby: wja=tfaj Lj(2)wherethisweightforaconnectionfromthenodefortermaisthetermfrequencyoftermaindocumentdj(whichcouldalsobeaquery)dividedbythenormofdj.Inaddition,fromthedocumentsandqueriestothetermnodes,thereisaconnectionintheoppositedirection.Havingconnectionsintwodirectionsallowsforspreadingactivationsothatifenoughdocumentstriggeratermthatwasnotintheoriginaldocumentittoocanreandinturntriggerotherdocuments.Thisweightis: waj=log(p 1)]TJ /F5 11.955 Tf 11.96 0 Td[(p)+log(Nw)]TJ /F5 11.955 Tf 11.96 0 Td[(Fa Fa)(2)wherepisasmallpositiveconstant,Nwissomelargeconstant,andFaisafunctionofthedocumentfrequency(thenumberofdocumentsthatcontaintheterm).Aseachdocumentisadded,theweightsintotheindextermsfromthedocumentnodesmustbeupdated.Withoutanyadditionallearning,queriescanbeaddedtothesystembycreatingquerynodesandattachingthemtotermnodes.Theseweightsaregivenas: wqa=tfaq Lq(2)whichisthesameconnectionasthedocuments.Herethetermfrequencytfaqisthefrequencyoftermainqueryq,andthatisdividedbythenormofthequery.Usedtogether,adocument'svalueiscalculatedas: Sj=nXa=1wqawaj(2) 36

PAGE 37

whichisequivalenttothevector-basedtf-idfscore.Wherethismodelimprovesuponthetraditionalvectormodelisthattherelevancecanspreadtoadditionaldocumentsthroughthelinksfromdocumentsbacktothetermnodes.Thisallowsdocumentsthataresimilartotheinitialrelevantdocumentstobeconsideredrelevanteventhoughtheymightnotcontaintheinitialterms.Whilethisisausefulmodelinitself,Kwokalsoproposedupdatingtheweightsinanincrementalmannerbyactivatingadocument,stimulatingthenetworkandupdatingtheweightsbetweenthedocumentandtermsthatcauseittobeconsideredrelevant.Whileasingledocumentwillresultintheinitialweights,usingmultipledocumentsallowsthemodeltoapproachaprobabilisticmodelestimationofP(tajR)ortheprobabilitythattermaappearsinarandomdocumentintherelevantset.Thismethodofusingneuralnetworkshasnotbeenwidelyexploredinlargedocumentsets.Oneofthemaindrawbacksisthetimeittakesforthequerytosettle.Duetothelackofresearch,thereisnorealevidencetodeterminewhetherthismodelworksbetterthanothers.Theabilitytonddocumentsthathavenooverlappingtermswiththequeryisinteresting,butwithoutfurtherworkitishardtodeterminetheworthofthemodel.Additionalworkhasbeendoneapplyingsupportvectormachinesandmorphologicalneuralnetsandhasshownmuchpromise[ 74 82 ]. 2.4TechniquesforIndexManipulationInadditiontothemodelsthathavebeendevelopedforIR,therearealsomanystandardtechniquesandheuristicsformakingdatamoremanageable.Twoofthesearestopwordremoval,inwhichtermsareremovedfromtheindex,andstemming,inwhichtheindexiscompressedbycombiningtermsintotheirroots.Althoughthesechangesaregenerallydiscussedinrelationtoimprovingperformance,theusualresultisachangeofthedocumentspace.Bothmethodsmakesemanticchoicesaboutwhatportionofwordsareimportanttotherealmeaningofterms. 37

PAGE 38

Anotherareaofresearchinindexinghasbeenwhethertheindependenceassumptioncanbedropped.Theseapproachesresultfromtheassumptionthatasemanticmeaningexistsinthedocuments,whichindexingthetermsalonedoesnotcapture.Itisobviousthatthisisthecasesince,ifthetermsinadocumentarereadwithouttakingintoaccountthecontext,order,sentences,andadditionallanguagecuesprovided,whatisbeingcommunicatedisnotunderstood,althoughitmaybepossibletogleanthetopicofdocument.Whatislessobviousishowtoleveragethisintuitionintoaninformationretrievalsystem. 2.4.1StemmingStemmingistheprocessofreducingwordstotheirstemsandisperformedbyusingastemdictionary,n-grams,orremovaloftheafx.Severalalgorithmshavebeensuggestedfortheremovaloftheafx[ 32 ],butthePorteralgorithm,asetofrulesthatproperlyremovesmostafxes,isthemostwidelyused[ 79 ].Anexampleofthisisthewordstemming.UsingthePorteralgorithm,rststripawaythe-ingafx,leavingstemm.Thenanotherruleresultsinthedroppingofthedoublemm,replacingitwithasinglem,leavingjustthestem,stem.Thismethodworkswell,andwithmanyimplementationswidelyavailable,thePorteralgorithmhasbecomeverypopular.Thebenetofstemmingisthatthedimensionalityofdocumentsisreduced.Onceatermisreducedtoitsroot,nomismatchesbetweenthewordanditspluralsorsomeotherformofthewordwiththesamemeaningremain.Whetherornotthisprovidesmuchbenettoretrievalisanopenquestion.Salton,whiledevelopingtheSMARTsystem,foundthatathesaurusoutperformedstemming[ 88 ].Usingamoreconservativedictionary-assistedtrimmer,StrzalkowskiandVautheyfoundtheycouldimproveonSMART'sprecisionby6to8%[ 93 ].OnesurprisingexampleofstemmingprovidingbetterresultsisinapaperonLSI[ 28 ].Incomparingtheirsystem'sresultswithavectormodelsystemthatusedstemmingfortheindexingofdocuments,theauthorsnotedthatthedifferenceinperformance 38

PAGE 39

betweenthesystemswaslikelyduetothefactthattheauthorsdidnotusestemming.Whentheylaterstemmedtheterms,theyfoundthatthetwosystemsproducedsimilarresults.Thisexampledemonstratesthedifcultyincomparingresultswhenresearchisproducedwithslighttweakstotheactualdocumentset.SincetheLSIsystemissupposedtoeliminatenoiseduetowordchoice,itissurprisingthatstemmingwordsproducesabetterresult. 2.4.2StopwordRemovalAnothertechniquetoreducethenumberofindexedtermsandtheamountofprocessingthatmustbedoneistoremovenon-informativewords.Non-informativewordsareusuallytakentobewordswithalowidf,orcertaintypesofwordssuchasarticles.Sincethesewordsdonothelpdistinguishbetweendocuments,includingthemintheindexwillnotimproverecall.Sincethewordsremoveddonotcontainmuchinformation,theprocessisconsideredsafe.However,thenumberofwordsthatcanberemovedissmallrelativetotheoverallnumberoftermsintheentirecorpus,sotheperformancebenetsarelikewisesmall.Despitethelimitedbenet,thistechniqueiscommontomanyIRsystemsandcanincludehundredsofstopwords[ 4 ].AnalternativetogenericstopwordlistswasproposedbyWilburandSirotkin[ 108 ].Insteadofusinggenericstopwords,thistechniqueidentieswordsthatarenotinformativeinthecorpus.Insteadofusingidfforweightingwords,anewschemewasdevisedthatgivestermsahigherweightiftheyoccurinpairsofrelateddocumentsratherthaninpairsofunrelateddocuments[ 115 ].Sincestopwordstypicallyhavealowidf,theyarefactoredoutofsystemsthatusetheidfaspartoftheweight.Theargumentagainstusingstopwordremovalcanbeshownbysearchessuchastobeornottobe.Inthisparticularcase,atypicalstopwordlistleavesuswithjustbe.Thisdoesnotseemtobeapersuasiveargumentunderthevectormodeloranymodelthatendsupweightingtermsusingidf.Sincethesetermsgenerallyhavesuchalowweight,theyareeffectivelyremovedfromthequerynormally. 39

PAGE 40

2.4.3IndexingUsingPhrasesTheuseofphraseswasintroducedearlyinIRresearchwiththeSMARTsystem,whichusedphrasesasawaytoretainsomeofthesemanticmeaningofthedocument[ 89 ].TheSMARTsystemisderivedfromtheintuitionthatsomewordsarenotindependentbutrathercontainmeaningbasedonthewordsaroundthem,e.g.WhiteHouse.Neitherwhitenorhousecontainsspecialmeaning,butmanypeoplerecognizethatthephrasewhichcombinesthemdoeshavespecialmeaning.SincetheBOWapproachtoindexingassumesthatallthetermsareindependent,thephrasewouldbelost.Onewaytoutilizethisinstinctistoindexphrasesalongwithsingleterms.Puttingphrasesintotheindexinvolvesidentifyingwhichtermsarepartsofphrasesandwhicharenot.Twotypesofphrasesareconsidered.Therstisthestatisticalphrase,composedofwordshavingahighrateofoccurrencesandco-occurrences.Suchwordsaretypicallyincloseproximitytooneanother.Theothertypeofphraseisthesyntacticphrase,whichissimilartothestatisticalphraseandusesthewordsthatoccurinthesamedocumentwithahighfrequency.Inaddition,though,asyntacticphrasefollowsanysyntacticconstraintonwordsinthephrase,suchasADJNOUN.ThisconstraintwouldhandleouroriginalcaseofWhiteHouse,asitisanadjectivefollowedbyanoun.Theresearchonusingphrasesasindextermshasnotbeenveryconclusiveindeterminingwhetherthephrasesimproveretrieval.Faganfoundthatstatisticalphraseshadsomeusebutsyntacticphrasesdidnot[ 31 ].CroftandDas,ontheotherhand,foundthatneithermethodprovidedbenetoverthebestsinglewordrelationship[ 21 ].Croftetal.concludedthattherelationshipofphrasestoindexingtermsisnotwellunderstoodbecauseitisnotclearwhetherthephrasesshouldbeindexedortreatedasrelationshipsbetweenindexedwords[ 23 ]. 40

PAGE 41

Duetothesemixedresults,indexingdoesnotgenerallyaccountforphrases.Currently,theresearchonphraseshasmovedintotherealmofqueriesandquestionanswering. 2.4.4OtherTypesofIndexingThebasicassumptionofthethreeclassicalmodelsofIRisthatindextermsareindependent.Asaresult,mostsystemsusetheBOWapproach.LSAaddressesthisbyreducingtherepresentationofthedocumenttooneinalowerdimensionalrepresentation.Anotherwayofrepresentingthedocumentisbyusingthewordsenseorsynsetsinsteadoftheactualwords.AnexampleofthisapproachisgivenbyGonzaloetal.[ 33 ].TheytookacollectionoftextsthatwerehandcodedwiththeWordNetwordsenseandtestedthedifferenceinperformanceamongstandardvectormodelingusingBOW,indexingusingwordsense,andindexingusingsynsets.Theyfoundthat,althoughusingsynsetsprovidedbetterperformance,performancedependedontheabilitytodisambiguatebetweenwordsenses.Withanerrorrateof30%inthewordsensedisambiguation,theperformancewasworsethanthestandardindexing.Toimproveperformance,MihalceaandMoldovandevelopedanalgorithmthatcandistinguishbetween55%ofnounsandverbsandis92%accurate[ 68 ].Usingthisalgorithmtheycreatedahybidindex(usingsynsetsforthewordsforwhichtheycouldgivesynsets)thatshowedimprovementovertheBOWapproach.Thisapproachispromising,butduetotheintensenatureoftheprocessingitisnotyetusableinapracticablesystem. 2.5TechniquesforQueriesRelatingqueriestodocumentshasbeenconvolutedwithranking.Thisisseenclearlyintheprobabilisticmodel,whereeachdocumenthasaprobabilityofbeingrelevanttothequery.Unlessdocumentshaveaprobabilityof0,theyareconsideredatleastsomewhatrelevanttothequery.Sinceknowingwhichdocumentsarerelevantisrelatedtoranking,researchqueryimprovementshavefocusedonauser'sinteractionwiththesystemtoimproverankingratherthanmakingabinarydecisionofrelevance. 41

PAGE 42

Twonotableexceptionstothisrulearequeryexpansionandquestionanswering.Queryexpansionistheprocessofaddingtothescopeofthequerytoovercometheproblemsofsynonymy.Questionansweringisawayofallowingtheuserstospecifytheinformationthattheydesireinsuchawaythatthequeryenginecannotonlydeterminetherelevanceofthedocumentbutalsoretrievepassagesthatseemtoanswerthequestion.Relevancefeedbackisrelatedtouserfeedback.Thequeryenginereceivesfeedbackfromtheuser'sinputastowhetherheorsheconsidersthedocumentsreturnedtoberelevantornot.Thequeryenginecanthenmakebetterjudgmentsontherelevanceofdocuments.Agoodexampleofthistypeofsystemistheprobabilisticmodel.Asdocumentsareretrieved,theusercanmarkthemasrelevantornotrelevant.Thisimprovesthemodelpredictionsoftheprobabilityofrelevancefortheadditionaldocuments. 2.5.1QueryExpansionThesimplestqueryexpansionistouseathesaurustoaddsynonymsofthequerytermsbackintothequery.ThisapproachisanalogoustoindexingusingathesaurusorWordNet.Thissituation,however,lackscontextfordecidinghowtoexpandthequery.Asaresult,whilequeryexpansioncanimproverecall,ittendstolessenprecision.Onemethodthatmayimprovetheperformanceforthistypeofexpansionissimplytochoosethemostcommonmeaning(orsense)ofatermforexpansion,unlesstheotherwordsinthequeryimplyotherwise[ 98 ].Thisqueryexpansionmethodisgenerallynotusedunlessthequeryresultsinveryfewdocuments.ShahandCroftapproachedtheproblemofdecreasedrecallusingqueryexpansionbyusingCronen-Townsend'stechniqueofndingaqueryclarityscore[ 24 91 ].Theclarityofawordisgivenas: Clarityscore=Xw[P(wjQ)log2(P(wjQ)=Pcol(w))](2) 42

PAGE 43

wherewisawordinthevocabularyV,Qisthequery,P(wjQ)istheprobabilityofthetermgiventhequery,andPcolistherelativefrequencyofthewordovertheentirecollection.ThisistheKullback-Leiblerdivergencebetweenthequeryandcollectionlanguagemodels.P(wjQ)canbeestimatedbysamplingdocumentsreturnedbythequery: P(wjQ)=XD2RP(wjD)YqP(qjD).(2)GiventhatP(DjQ)bytheBayesmethodisP(QjD)whereP(D)isuniformifthedocumentisrelevantand0ifnot(thatis,therearenoquerywordsinthedocument),thisgivesaclarityscorebasedontheprobabilityofwordsgivenadocument.Bothcasesestimatetheprobabilityofaterm,givenadocument,as: P(tjD)=tf(t)+(1)]TJ /F11 11.955 Tf 11.96 0 Td[()Pcol(t).(2)Thisformulastatesthatifatermhasahighprobabilityofbeingrelatedtothequery,beyondthenormallikelihoodofitappearinginjustanydocument,thewordisthatmuchclearer.Forexamplethewordwarmighthaveafairlyhighfrequencyacrossthedocuments,butifitshowsupwithevengreaterfrequencyindocumentsthatarerelevanttoaqueryVerdun1916thenitwouldbeconsideredahighclarityword.ShahandCroftshowedthatbyremovingwordswithlowclarityscoresandretainingwordswithhighclarityscores,theycouldimproveprecisionofqueries.Inaddition,theyfoundthatthelimitedexpansionofwordsthatwereinbetween(basedonobservation)highandlowclarityscorescouldbeexpandedusingWordNet,whichalsoincreasedprecision.Althoughthisseemsapromisingsolution,thedependenceonthismethodoftheruntimequeryscoresomewhatlessensitsappeal.ToestimateP(wjD)youmustknowthesetofrelevantdocumentsforthequery.Despitethisdrawback,theresearchshowsthatlimitedexpansioncanincreaseperformance. 43

PAGE 44

2.5.2RelevanceFeedbackThebasicpremiseofrelevancefeedbackisthattheuserwouldliketherankingofdocumentstochangesothatrelevantdocumentsreturnedlater(i.e.rankedlowerthantheyshouldbe)aremovedhigherintherankingsbasedontherelevanceofdocumentspreviouslyreturned.Thereareseveralwaystodothis,includingthepreviouslydiscussedmethodfromtheprobabilisticmodel.Asdocumentsareretrieved,eitherthroughuserfeedbackorthroughrelevancedecisionsmadebythesystem,themodelmovesclosertoanactualpredictionofthetruerelevantset.Relevancefeedbackconsistsofusinginformationpertainingtotherelevanceofdocumentsretrievedandadjustingthemodelaccordingly.Thevectormodeldoesnotmakeitimmediatelyapparenthowtherelevanceofadocumentthathasbeenretrievedcanbeofuse.Unlikeintheprobabilisticmethods,therearenoprobabilitiesthatcanbeupdatedwiththisnewinformation.Instead,theunderlyingideaofthevectormodel,thatsimilardocumentsshouldbeclusteredtogetherinthevectorspace,isused.Clusteringallowsthemovementofthequeryvectorclosertothedocumentsthatarerelevantandfartherawayfromdocumentsthatarenotrelevant.Therearethreesimilarmethodsofaccomplishingthisthroughadjustingthequeryvector:Rocchio[ 84 ],ideregular,andidedec-hi[ 50 ].TheRocchiomethodisbasedupontheoptimalquerygivenbyRocchioformultiplerelevantdocuments: Qopt=1 nXD2RD jDj)]TJ /F7 11.955 Tf 25.26 8.09 Td[(1 N)]TJ /F5 11.955 Tf 11.95 0 Td[(nXD2RD jDj.(2)Sincetheentiresetofrelevantdocumentsisunknownatquerytime,replacethatrequirementusingthedocumentsjudgedtoberelevantornon-relevant.Bycombiningthatwiththeoldquery,anewqueryisgenerated: Qi+1=Qi+XD2RD jDj)]TJ /F11 11.955 Tf 11.96 0 Td[(XD2RD jDj.(2) 44

PAGE 45

Thus,ratherthancompletelyignoringtheoldquery,thequeryisweightedbythedocumentsknowntoberelevantaccordingtotheuser.Inthiswaythequeryslowlydriftstowardstherelevantsetinthevectorspace.,,andareallweightsbetween0and1thatinuencehowmuchrelianceisplacedoneachpartofthepastinformationtogeneratethenewquery.Theideregularandidedec-himethodsareverysimilarbothtoRocchioandtoeachother.ThemaindifferencebetweentheidemethodsandRocchioisthat,intheidemethods,thedocumentvectorsarenotnormalizedwhencombinedwiththequeryvector.IderegularisthesameasRocchio,butdoesnotincludeweights: Qi+1=Qi+XD2RD)]TJ /F11 11.955 Tf 11.96 0 Td[(XD2RD.(2)Idedec-hiusesalltherelevantdocuments,butitonlyusesoneofthenon-relevantones.Thatdocumentisthehighestrankedorearliestretrievednon-relevantdocument: Qi+1=Qi+XD2RD)]TJ /F11 11.955 Tf 11.95 0 Td[(Dr(2)whereDristhehighestrankednon-relevantdocument.Allthreeareconsideredtogivesimilarresults,althoughinthepastidedec-hiwasconsideredbetter[ 4 86 ].Idedec-hiisatleastmoreefcientinthatitconsidersfewernon-relevantdocumentswhenupdatingthequery.Ingeneral,thesimplicityofthesystemandthegoodresultsmakerelevancefeedbackagoodchoiceforvectormodelbasedsystems. 2.6TechniquesforRankingWiththeriseinthenumberofdocumentsontheWWW,thenumberofdocumentsthatcanbeconsideredrelevanttoaqueryhasrisenaswell.Asaresult,theamountofinformationthatauserispresentedisoverwhelming.Studieshaveshownthatamajorityofsearchengineusersstopexaminingresultsafterthethirdpage[ 51 ].Whilethisarguesforveryhighprecisionsearches,italsoindicatesthatthebasicmodelfordeterminingrelevancedoesnotworkwellinlargecorpora.Onepossibleimprovementis 45

PAGE 46

tousethehyperlinksintroducedontheWWWasadditionalsemanticinformationaboutthepages.TwoofthemostpopularalgorithmsforrankingusingthisinformationarePageRankandHyperlink-inducedTopicSearch(HITS). 2.6.1PageRankPageRank,arankingalgorithmdevelopedbyBrin&Page[ 77 ],isthebasisoftheGooglesearchengine[ 9 ].Thebasicpremiseofthisalgorithmisthatthereissemanticinformationinthelinksbetweenwebpages.ThisinformationalneedismodeledbyarandomwalkacrosstheWWW.Inthismodel,thesystemneedstoknowthelikelihoodthatsomepageontheInternetisencounteredbythisrandommethod.Thisiscalculatedbystartingatonepagefromasetofhighqualitypages.Thishighqualitypageisdeterminedinadvance,andthechoiceofthepagecaninuencetheresults.Afterchoosingthepage,arandomwalkdownitslinksisstarted.Occasionally,thewalkresetstoahighlevelpage.Themoreoftenapageisencountered,thehigheritranksintheresultsofasearch.Theresultisapopularityrankingwherethemorepopularapageis(orthemorecentralapageistoourgraph)themorelikelyitistobeahighqualitypage.ThusthesemanticinformationofthelinksthatPageRankusesisavoteforthereliabilityoftheinformationofthepageandindirectlythepagesthatthecurrentpagelinkstoaswell.TheresultofthismodelisthePageRankalgorithm.Thepagerankofanyparticulardocumentisthesumofthecontributionsofallthepageslinkingtoit.Thecontributiononepagemakestoanotherpage'srankisthepage'spagerankscoredividedbyitsnumberoflinks.Inthiswayapagedividesitsvoteequallyamongallthepagesitrecommends.Anexampleofhowthismightworkinvolvesresearchpapers.Toassignrankingssoseminalworksareratedthehighest,weightsareassignedtooneofseveralimportantpapers.Nextconsiderallofthepaperscitedbytherstsetandcontinuetoexpandthesetbyaddingadditionalcitationsofnewpapers.Eventually,thepapersthatare 46

PAGE 47

themostlikelycandidatestobeimportantwillbeindicatedbytheirhighin-degreeofcitings-amethodthatissimilartosearchingbyhand.Thisshowsthatcontentitselfisnotrelevantindeterminingwhatisimportantonthewebinthismethod,butratherthenumberofpeople(pages)whobelieveitisimportant.Thisleadstoaproblem,becausethereisnoneedtohavethecontentofapagematchaparticularquery,butrathertoconvinceenoughpeopletolinktothepage.Toavoidthis,pagesareonlyconsiderediftheycontainallthesearchterms.Therearetwowaystoevaluatethisalgorithm.Theeasiestistoassumethemostpopularpageswillberankedhighest.OncepageshavebeenmarkedasrelevantusingaBooleanquery,themostpopularpagesarerankedhighest.Thissimplemethodhasprovensurprisinglyeffective.Alternatively,thisalgorithmcanbeviewedasgivingthemostcentralpagesthehighestweight.Althoughtheresultisthesameeitherway,thisassumesthatthepagesthataremorecentraltotheWWWgraphgivenbythelinksaremorelikelytoberelevanttowhatevertheuserdesires. 2.6.2HubsandAuthoritiesTheHITSalgorithmisbasedonamodeloftheInternetofhubsandauthoritiesdevelopedbyKleinberg[ 55 ].ThemodelmakestheassumptionthattherearetwotypesofpagesthatareimportantontheWWW.Therst,authorities,containstheinformationthatauserdesires.Whenaqueryisgenerated,Kleinbergreasonsthatthepersondoingthesearchislookingforthehighestauthorityonthesubject.Whatbestowsauthorityuponapageisthatmanypagespointtoit,muchlikethePageRankcriteria.Theothertypeofpage,hubs,isonethathasmanyoutgoinglinkstohighqualityauthorities.So,Kleinbergnotedthatthereisarelationshipthatdevelopsbetweenhubsandauthorities.Somepagesaregoodatgivingtheinformationandsomearegoodatlistingpagesthatdotheformer.Tondthebestauthorities,oneneedstondthebesthubsanddeterminewhattheypointtoachickenandeggproblem.Tosolvethis,Kleinbergusedanothersearchenginetoseedtheresults.Takingthetopn-documents 47

PAGE 48

fromthatsearch,thepoolisexpandedbyfollowingallthelinksinthedocumentsdowntoadepthofd.Oncethisisdone,highqualityauthoritiesareidentiedasthedocumentsthathavethelargestnumber(orpercentage)ofin-degreelinksfromthedocumentpool.Todistinguishbetweendocumentsthatareuniversallypopular(suchasalinktoapopularwebsiteeditor)anddocumentsthatareauthoritiesonasubject,theauthoritiesthatcontainlinksfromhigh-qualityhubsarestrengthened.Throughaniterativeprocessofstrengtheningthehubsandauthorities,aconclusioncanbemadedeterminingwhichdocumentsareauthoritativeonthesetopicsandwhichareuniversallypopular.OneofthemajorproblemsoftheHITSalgorithmisthatitreliesonaprevioussearch.Ifthetopn-documentscontaintermsunrelatedtothesearch,theprocessofexpandingthepoolfurtherdilutestheresults,loweringprecision. 2.7EvaluationofInformationRetrievalModelsTraditionally,whenevaluatingalgorithmsincomputerscience,metricssuchastimeandspacecomplexityareused.ThisappliestoIRaswellbutdoesnotgiveagoodindicationofwhetherthesystemhasadequatelyperformedthetask.Insteadtheremustbeanevaluationofhowwellthesystemhasperformedinretrievingrelevantinformation.ThestandardmethodologyforthistaskistheCraneldparadigm,whichistouseastandardcollectioncompletewithsetsofqueriesandrelevancejudgmentstocomparetheeffectivenessofdifferentretrievalsystems[ 106 ].Thissectiondiscussesprecisionandrecallusingthesecollections,aswellassomeofthealternativemethods. 2.7.1PrecisionandRecallThebasicassumptionoftheCraneldparadigmisthatforeverytestqueryalldocumentsinthesetareevaluatedtobeeitherrelevantornon-relevant.Oncethatisdone,systemscanbecomparedusingtheprecisionandrecallmeasures.Considerasamplequeryandtherelevantdocuments.Precisionistheratioofretrieveddocuments 48

PAGE 49

totheretrieveddocumentsthatarerelevant: Precision=jRaj jAj(2)wherejRajisthenumberofdocumentsintheretrievedsetthatarerelevantandjAjisthenumberofdocumentsretrieved.Recallissimplytheratioofretrieveddocumentstothetotalnumberofrelevantdocuments(jRj): Recall=jRaj jRj.(2)ThesevaluesallowtheevaluationofIRsystemsifitisassumedthatthegoalistoretrievetheentiresetofrelevantdocumentsandtonotretrievenon-relevantdocuments.Inthiscasethereisaninverserelationshipbetweenprecisionandrecall.Ingeneralasmoredocumentsarerecalled,theprecisiontendstogodownandvice-versa.Anexampleofthisiswhenexpandingaquerytoachieveahigherrecall,termsareaddedtothequerythatallowmoredocumentstoberetrievedthatarenotrelevanttothequery(aswellasthosethatare)thusreducingprecisionwhileimprovingrecall.Anotherproblemhappenswhenstemmingterms,whichampliestheproblemofpolysemy.Aproblemwithusingprecisionandrecallasmeasuresistheassumptionthattheentireretrievedsetofdocumentsisevaluatedatonetime.Sincemostusersareonlylookingatafewofthedocuments,ameasureispreferredthatcangiveanideaofhowwellthesystemhasrankedrelevantdocumentsatthetopoftheretrievedset.Toachievethis,theprecisioniscalculatedfordifferentlevelsofrecall.Forexample,precisioniscalculatedatthepointwhere10%oftherelevantdocumentshavebeenretrieved.ThismeasureisP@NwhereNisthepercentageofrelevantdocumentsseen.Toactuallyusethesemeasures,theremustbesomeunderstandingofwhatdocumentsarerelevanttoaquery.Thisisaccomplishedbyusingstandardcollectionswherequerieshavebeenevaluatedbyhand.Thisisadifculttaskforlargecollectionsizes.Voorheesestimatedthatitwouldtakearesearcher9monthstoevaluateone 49

PAGE 50

queryonacollectionof800,000documents.Insteadofjudgingallthedocumentsdirectly,largetestcollectionsarecreatedbyaprocessofpooling.Inthiscase,manydifferentsystemsgeneratearankedsetofdocuments.Then,thosedocumentsarecombinedtocreatealistoftopdocuments.Thesedocumentsarejudgedrelevantornot.Whilethistechniqueisnotideal,itseemstoallowfortherelativejudgmentofsystems[ 106 ].Theseadvances,alongwiththesimplenatureofthesecalculations,havekepttheuseofprecisionandrecallmeasurementsstandardinIRresearch. 2.7.2OtherEvaluationsInadditiontousingprecisionandrecall,otherevaluationmeasureshavebeencreated.ManyofthesearecombinationsorvariantsofprecisionandrecallsuchasP@N(previouslydiscussed),meanaverageprecision(MAP),R-Precision,andtheE-Measure.ThesemeasuresaregiventobetterunderstandhowwellasystemperformsIR.TheMAPscoreisthemeanoftheprecisionobtainedwheneachrelevantdocumentisretrieved.Relevantdocumentsnotretrievedaregivenaprecisionof0.Thisgivesanunderstandingofhowwellthesystemdoesatretrievingrelevantdocumentsandrankingthemhighly.Sinceitusesmanydatapointstogeneratethisscore,itistypicallyaverystablerating.Itsmaindownsideisthatthereismorethanonewaytointerpretthescore,dependingonhowitisgenerated.Rankingsomerelevantdocumentshighandsomelowgivesthesamescoreasifallthosedocumentswererankedinthemiddle[ 12 ].R-PrecisionistheprecisionafterRdocumentshavebeenretrieved,whereRisthenumberofrelevantdocumentsintheset.ThisissimilartotheP@Nmeasurebutgivesamorestableanswerfortheparticularquery.Thismeasureisusedwhenthequerieshaveawidelyvariednumberofrelevantdocuments.TheE-Measureisacombinationofprecisionandrecalltoallowtheuserstodeterminewhethertheyaremoreinterestedinrecallorprecision.Proposedbyvan 50

PAGE 51

Rijsbergen[ 97 ]itisgivenas: E=1)]TJ /F7 11.955 Tf 13.9 8.08 Td[(1+2 2 R+1 P(2)whereisaweightingfactor,Risrecall,andPisprecision.Byadjustingtheweightingfactor,theuserspecieswhethersheorheismoreinterestedinrecallorprecision.Whilethismeasureisnotwidelyusedduetothestandardnatureofmostretrievalsystems,itisinterestingtoconsiderwhattypeofsystemwouldoptimizefordifferentprecisionneedsgivenbytheuser. 2.7.3TermRelevanceSetsInsteadofusingbinaryrelevancejudgments,somehaveproposedusinganevaluationthatdoesnotcontainrelevanceatall[ 92 ].Onealternativemethodtorelevancejudgmentsisusingtermrelevancesets[ 3 ].Insteadofprovidingalistofdocumentsthatarerelevanttoaquery,asetoftermsthatwillbecontainedinrelevantdocumentsandasetoftermsthatwillbecontainedinnon-relevantdocumentsisprovided.AnexampleofhowthismightworkisaqueryforGeneralGeorgeWashington.Forthisexample,termsthatarerelevantmightinclude:Valley,Forge,Delaware,General,andRevolution.Non-relevanttermsmightbeUniversity,President,andCarver.Usingthesevectors,atermrelevancescorecanbecalculated.Toevaluatearetrievalset,asimilaritytothetermrelevancesetsisgivenby: tScore(Dq)=Pni=11 icos(di,ontopicq))]TJ /F11 11.955 Tf 11.96 0 Td[(cos(di,otopicq) Pni=11 i.(2)GivenanorderedsetofdocumentsretrievedforthequeryDq,thesimilarityiscalculatedbytakingthecosineangleofthedocumentdiwiththerelevanttermsofthequeryontopicqsubtractingthesimilaritywiththenon-relevanttermsotopicq.Fortheentireset,thevalueforeachdocumentisweightedbyitsrank.ItisimportanttorealizethatthetScoredoesnotexactlypredicttherelevanceofthedocument.Arelevantdocumentcouldeasilycontainsomeofthenon-relevanttermsandnoneoftherelevantterms.Tocomparesystems,thisformulaiscalculatedoverthetopndocuments.Experiments 51

PAGE 52

ondifferentcollectionsanddatasetsfoundthattherewasa90%orbettercorrelationwiththetypicalrelevantdocumentsetevaluations.Inadditiontothesepositiveresults,thebenetsthattermrelevancesetshavearethattheyarescale-free,unchanging,andglobal.Scale-freeisimportantwhenattemptingtoevaluateperformanceofsearchenginesontheWWW.Sincethecollectionisconstantlygrowing,theremustbeeitherconstantevaluationoftheupdatesorafreezeofthecollectiontohaveasetofhandevaluatedqueries.Thatthetermrelevancesetsareunchangingmeanstheeffortgoingintocreatingthesewillbeusefulforyearstocome.Finally,theglobalnatureofthetermrelevancesetsallowsforsetstobedevelopedononecollectionandusedonothers.Thisallowscomparisonsofsearchengineswithoutthenecessityofxingthecorpusbetweenthem. 2.8ConclusionTheclassicalinformationretrievalmodelsdescribedinthischapterarethebasisforourresearch.Themainproblemthatstatistic-basedsystemshaveisthatthesemanticinformationcontainedinthelanguageisnotutilizedeffectively.Whiletherehavebeensomeattemptstouselatentsemanticanalysis,linkinformation,andlanguagemodelstomoreeffectivelycapturethisinformation,progresshasbeenslow.Thenextchapterexploresquestionansweringresearch.Whilequestionansweringisitsownresearcharea,itisimportantinIRduetotheabilitytoexpressandfulllaninformationneedthatismoregranularthandocumentretrieval. 52

PAGE 53

CHAPTER3QUESTIONANSWERINGWhenwasGeorgeWashingtonborn?WhatwasBabeRuth'srealname?Whendidmenrstlandonthemoon?Whoispresident?Onthesurfacetheseseemlikefairlysimplequestions,butforaninformationretrievalsystemthistypeofquestionansweringrangesfromsomewhattoextremelydifcult.Unlikeinthepreviouschapter,wherethetaskistoretrievedocumentsthatmatchaquery,inthiscasetheuserdesiresaspecicanswer,whichmightonlyrequireapartofadocument,orpossiblypartsofseveraldifferentdocuments.Allofthesequestionshaveatleastonecorrectanswer,andthelastquestionmayhavehavemorethanonecorrectanswer.Questionanswering(QA)isinsomewaysmoreofaninformationretrievaltaskthantheIRdiscussedinthepreviouschapter.Whiledocumentsarebitsofinformation,theremaybeinformationthatisirrelevanttothequestionbeingasked.Additionally,theremaybemanydocumentsthatwouldberelatedtothequeryifwemerelysearchedfordocumentsthatcontainsomeorallofthewordsinthequery.Thisisespeciallyimportantinthecaseoftheworldwideweb(WWW).SincetherearesomanydocumentsontheWWW,manyqueriesreturnanoverwhelmingnumberofresults,forcingtheuserstowadethroughthemtondtheparticularinformationthattheyrequire.Thischapterfocusesontechniquesdevelopedtobeusedonverylarge,unstructuredcorporasuchastheWWW.Ontheonehand,thetaskrequiresspecicinformationtoberetrievedfromthedocuments.SincethistypicallyinvolvessomethingmorelinguisticallysophisticatedthanisrequiredinIRsystems,thecostintimeandcomputingresourcesofprocessingpagescanbesignicantlyhigher.Soinsomeways,usingtheWWWasacorpuscansignicantlyincreasetheoverheadofaQAsystem.Ontheotherhand,thelargeamountofredundancyontheWWWmeansthatstatisticalmethodshavebeen 53

PAGE 54

successfulatcomplementingtraditionalQAtechniques.Sofar,combiningthesetwotechniquestogetherhasproducedthemostsuccessfulresults.Questionansweringresearchwasinitiallyperformedoncloseddomainsandfocusedonunderstandingthetranslationofquestionstoastructureddatabaseofinformation.OneoftherstquestionansweringsystemswasBASEBALL,whichcouldtakenaturallanguagequestionsaboutbaseballandtransformthemintoqueriesagainstastructureddatabaseofstatistics[ 36 ].Severalothersystemsofthistypefollowed,amongthemasystembyKochenthatcouldanswerquestionsaboutasceneofblocks[ 56 ]andLUNAR,createdbyWilliamWoodstoanswerquestionsaboutthegeologyofthemoon[ 111 ].Whilethistypeofquestionansweringisinterestingandstilluseful,retrievinganswersfromaclosedorstructureddomainissomewhatdifferentfromthetaskofretrievinginformationfromanunstructuredsource.Inthecaseoftheformer,itispossibletohandcodetransformationsofthesyntaxorsemanticsofthequestionsintostructuredqueries,butitisnotclearthatthisispossiblewithunstructureddatabases.Topropelthisresearch,theTextREtreivalConference(TREC)startedaquestionansweringtrackatTREC-8in1999[ 105 ].MuchoftheresearchconsideredinthischapterisderivedfromworkrelatedtoTRECorusingtheTRECdatasets.Relatedtothetaskofquestionansweringisinformationextraction,theprocessofextractingstructureddatafromunstructuredsources.Whilethisiscloselyrelatedtoquestionanswering,thefocusofthetaskisdifferentinthatinformationextractionhasagoalofretrievingspecictypesofinformationthatareknownaheadofprocessing,butinquestionanswering,thenatureofthequestionposedbytheuserisnotusuallyknowntothesystembeforethequery.Informationextractiontechniques,however,canbeemployedincertainareasofquestionanswering. 54

PAGE 55

3.1QuestionAnsweringSystemsAquestionansweringsystemconsistsofseveralparts.AtypicalsystemarchitectureisshowninFigure??.Notallofthesepartsarenecessary,butmostsystemshavethemajorityorallofthempresent.Therststage,queryanalysis,isastepthatisgenerallynottakeninatraditionalinformationretrievalsystem.Thisentailslookingataqueryanddeterminingwhattypeofinformationisbeingrequested.Further,thesystemoftenneedstoknowthetypeofanswerrequired.Forexample,asystemgiventhequeryWherewasGeorgeWashingtonborn?mightneedtoknowthattheanswerwouldbeintheformofaplace.Itmightalsoneedtoknowthatwhenpeopleaskwheresomeoneisborn,theyarelookingforacityratherthanacountry.Thecontextofthequestionmightbeimportantaswell.Iftheuserwaslookingforinformationabouthisorherneighbor,GeorgeWashington,anansweraboutthepresidentwouldnotbehelpful.ThesecondstageissimilartothetraditionalIRtask.Inthisstage,thequerythathasbeengeneratedintherststageisgiventoasearchengine,anddocumentsthataredeemedrelevantareretrieved.Thethirdstageisthepassageretrievalstage,whichisoftencombinedwiththefourthstage,answerretrieval.Thecombinationoftheselasttwofocusesonndingtheparticularinformationthattheuserrequiresinagivendocument.Insomesystems,ratherthanthroughsophisticatedlinguisticprocessing,theanswerisfoundinmultipledocumentsbysomescheme,suchasusingn-gramswheretheanswern-gramappearingmostfrequentlyispresented.Thesestagesalsogenerallyrelyonqueryanalysisforinformationsuchastheexpectedanswertypeandpossiblepatternstouseforspecicanswers.Thenalstageisthepresentationoftheanswertotheuser.Althoughtheultimategoalofthissystemisuserinteraction,thistopicisnotcoveredhere.Theuserinterfaceisanimportantpartofresearch,though,especiallyassystemsbecomeabletocopewiththecontextinwhichquestionsareasked.Somequestionsare 55

PAGE 56

impossibletoanswerwithoutunderstandingthecontextsinwhichtheyareasked(ex.WhenwasIborn?).Thistypeofquestioningisoutsidethescopeofthisresearch. 3.2QuestionCategoriesTostartbuildingasystemtoperformquestionansweringthereneedstobeanunderstandingofwhattypeofquestionsarehardtoanswer.Therststepindoingthisisdividingthequestionsintocategories.ThemostbasiccategoryofquestionthataQAsystemattemptstoanswerisfactoidorfactualquestions.Thesearetrivia-typequestionslikeWhereistheLouvre?Insomecases,asmanyas18categoriesofquestionssuchascomparison,example,andinterpretationhavebeenenumerated[ 13 34 ].Anothercategoryofquestionisthedenitionquestion.AnexampleofthistypeofquestionisgivenbyHildebrandt[ 44 ]-WhoisAaronCopland?-whichhasmultipleanswers: Americancomposer,sonofaJewishimmigrant,civilrightsadvocate,Americancommunist.Sincetherecanbemorethanoneanswer,thistypeofquestionissomewhatmoredifcultthanasimplefactoidquestion.Toanswerit,thesystemmusthavesomeideaofwhattypeofanswerisrequired.Evenintheexampleoflookingforalocation,thesystemmightrequirethecontextofaquestion.IfthequestionerisinParistryingtonddirections,theanswertoWhereistheLouvremightneedtobeastreetaddress,whichisadifferentrequirementfromthatofastudentwritingareport,whowouldneedtoknowthatitisinParis,France.Thistypeofinformationneed,whichisnotnecessarilydirectlystatedinthequestion,seemstomakesomequestionsmoredifcult.Additionally,questionsbecomeharderastheinformationrequiredtoanswerthembecomesmoredependentonreasoning[ 13 ]. 56

PAGE 57

AnotherapproachtoquestiontaxonomiesisgivenbyPomerantz[ 78 ],whoelaboratesseveraldifferenttaxonomiestoquestions,threeofwhichareusedinautomatedquestionanswering:Wh-Words,theformsofexpectedanswers,andthefunctionofexpectedanswers.1ThesetaxonomiesareusedindifferentQAsystems,althoughthesystemsdiscussedtendtouseacombinationofthedifferentgroupings.Wh-WordsisthesimpleclassicationofEnglishquestionsintoWho,What,When,Where,WhyandHow.Thiswayofaskingandformingquestionsistaughttoschoolchildrenandisverycommoninwriting.Limitingquestionstothisformhastwomaindrawbacks.First,thequestionsmaynotcomeintheformofaquestion,forexample,TellmethebirthdateofGeorgeWashington.ThisstatementisidenticalinpurposetothequestionsWhatisGeorgeWashington'sbirthdate?andWhenwasGeorgeWashingtonborn?Despitethis,itisstillreasonabletoassumethatthisstatementcanbeclassiedusingtheWhenclassication.Aseconddrawbackisthatstatementsthatstartwithawh-wordarenotnecessarilyquestions.SeveralTRECQAsystemsusewh-wordsasthecriteriaforthecategorizationofquestions,andacoupleofthembearcloserexamination[ 47 72 ].Inthesesystemsthewh-wordisthebaseofatreeforfurtherspecicationofthetypeofquestiongiventheexpectedanswer.InLASSO[ 72 ]thistakestheformofsubsetsforwhatquestions,forexamplewhat-whoquestionssuchasWhatcostumedesignerdecidedthatMichaelJacksonshouldwearoneglove?andwhat-whereforquestionslikeWhatisthecapital 1Pomerantzalsogivesthreeothertaxonomieswhichareinteresting,butlessoftenusedinautomatedQAsystems.Subjectofquestionsisoneofthese.Anexampleofthistypeofquestionclassicationisafrequentlyaskedquestions(FAQ)document,wherequestionsrelatedtothesamesubjectaregroupedtogether.Theotherisbytypesofanswersfromwhichtheanswerisdrawn.Inthiscasequestionsthatneedanswersfromnewspapersmightbegroupeddifferentlyfromquestionsthatneedanswersfromencyclopedias.ThismightmakesensetoimplementinawebQAsystem,sincetherearepocketsofstructureddataontheWebfromwhichspecictypesofquestionscouldbeanswered. 57

PAGE 58

ofFrance?Inthecaseoftheformerquestiontheexpectedanswertypeisaperson,whereasinthecaseofthelatter,theanswertypeisacity.SinceitwouldbeeasiertotellthattheanswerfortherstquestionisapersonifthequestionisframedasWhowasthecostumedesignerthatdecidedMichaelJacksonshouldwearoneglove?LASSOalsousedsomethingcalledthefocusofthequestiontodisambiguatebetweendifferenttypesofquestions.Thefocusinthiscaseiscostumedesigner.Knowingthefocusofthequestionmakesdeterminingtheanswertypeeasier.TheWebclopedia[ 47 ]usesasimilarapproach.IntheWebclopediathereareQ-TARGETS,whichareexpectedanswertypesandQ-ARGS,whicharesimilartothefocusinLASSO.WheretheWebclopediadiffersfromLASSOisthatithasamuchmoredevelopedcategorysystemfortypesofanswers,suchassemanticclasses:PROPER-PERSON,EMAIL-ADDRESS,LOCATION,andPROPER-ORGANIZATION.Inaddition,Webclopediaalsohasclassesofanswersthatdescribetheexpectedfunctionoftheanswer.Thatis,thefunctionoftheanswerismeetingtheuser'sinformationalneed.ThesecategoriesincludeWHY-FAMOUS,DEFINITION,andABBREVIATION-EXPANSION[ 41 ].Althoughthesearestillsubcategoriesofthewh-word,theydifferinthatthecategorygiventothemistotallydependentonthetypeofanswerthattheuserneeds.ClassifyingquestionsintoquestiontaxonomiesisanopenresearchareainQA.Burgeretal.identifythisasanareaofresearchintheirQAroadmap[ 13 ]. 3.3QueryPreprocessingThetwomaingoalsofthisstagearetoidentifywhattypeofanswerisbeingsoughtandtodeterminewhatconstraintsarebeingplacedontheanswer[ 45 ].Therstgoalcanpartiallybeachievedbylookingattheinterrogativeword.Whereimpliesthequestionneedsaplace,whenquestionsneedatime,etc.Thisisnotatrivialproblemwhenconsideringquestionssuchaswhatquestions.ThequestionWhataretherequiredcoursesforComputerScience?ismuchmorecomplexandrequiresmoreworktoclassifythetypeofanswerrequiredthanthequestionWhenwasGeorge 58

PAGE 59

Washingtonborn?Somesystemstrytobuildcategoriesofanswersthatclassifythequestiontype.UnlikeIR,whereaqueryispasseddirectlytothedocumentretrievalsystem,mostQAsystemspreprocessthequerytofurtherextractmeaning.Insomecases,thismightbeafullcategorizationofthequestionintocategoriessimilartothosepresentedintheprevioussection.Inothersystems,however,thisstepmightinvolvetransformingquestionsintoaformspecictothesearchenginethatisused.Whateverthemethod,somelevelofpreprocessingisdonetoalleviatetheneedtodocomplexlteringofthecandidatedocuments.Inthecaseoftheverylightsystemsthattrytotransformquestionsintoqueriesthatarerunagainsttraditionalsearchengines,muchoftheearlyworkwasperformedonpatternbasedtransformations.ThebasicproblemofgivingaquestionsuchasWhereistheLouvre?toasearchengineisthatasearchenginewilltypicallytreatthequestionasabagofterms.Ifthesearchengineusesanalgorithmthatrequiresallthetermstoappearinthedocument,gooddocumentsthatcontaintheanswermightnotberetrieved.Ontheotherhand,documentsthatcontainthequestionwithouthavingtheanswermightbe.Whilethisdoesnotmeanthatthequerymightnotcontaindocumentsthathavetheanswer,theprocessofpullingitoutofthedocumentsmighttakelongerthanifamoreprecisesetofdocumentswereretrieved.Apatternbasedtransformationcanimprovethissituationbygivingaquerythatwillresultinretrievingbetterdocuments.OnewaytodothisissimplytostripthequestionwordsWhereis.Asimplestopwordlistwouldalsoaccomplishthis.Abettersolutionwouldbetousethesyntacticpatternsoftermsthatwouldbeexpectedtoappearinthedocumentsansweringthequestion.RulesforthesystemcouldtransformthequestionintothequerytheLouvreisin,theLouvreisnear,andtheLouvreislocatedin.Usingasearchenginethatallowsforphrasestobeinthequerywouldthenproduceaveryprecisesetofdocumentsthatcontainanswerstothequestion. 59

PAGE 60

Agichteinetal.haveperformedresearchtryingtoimprovethequerytransformationsforspecicsearchengines[ 1 ].Theyhaveattemptedtondwhattransformationsworkbestfordifferentquestiontypes.Foreachquestionanumberofdifferenttransformationsarehandcoded.Eachtransformationisthenappliedtothequestion,andtheresultingqueryissentthroughthesearchengine.Thetransformationisthenrankedbasedonhowwellthedocumentsretrievedmatchtheanswerrequiredforthespecicsearchengineused.MorerecentlyHermjakobetal.haveexploredusingnaturaltechniques(i.e.techniquesthathumanswouldusetochangeaspecicquestionintoaqueryforasearchengine)[ 42 ].Theyfoundthattechniquessuchasincludingpotentialanswertypekeywordsinthequerywouldimprovetheprecisionofthesystem.ForthemostpartthequeryanalysisstageofQAhasbeenhandledbyhandcodedrules.Thisistrueofthequestionclassicationsaswell.Somerecentresearchhasfocusedonusingmachinelearningtechniques[ 60 ]toreplacethehandcodedmechanisms.Anotherareaofresearchhasbeeninusingparsetreestoidentifyquestiontypes[ 41 ].Sofarthereisnoclearbestwaytohandlequerytransformationorqueryclassication.Sincethesetechniquesareoftentightlycombinedwiththeanswerextractionsystemsinaquestionansweringsystemitisdifculttoextractandtestdifferentapproaches. 3.4DocumentRetrievalThedocumentretrievalstepisnecessaryinmostsystemsasthecorporaaretoolargetoprocesswiththepassageretrievalandanswerextractionsystem.Tomakethesystemmoreresponsiveitmakessensetolimitorseedthesystemwithafewhighqualitydocumentsthatarelikelytocontaintheanswer.Toutilizethesearchengine,mostQAsystemsdosomesortofquerytransformationasdescribedintheprevioussection.SincemanyQAsystemsutilizeacorpusthatisverylarge,mostresearchershavebeencontenttouseanexistingIRsystem.InthemostextremecasestheQAtrackinTRECallowsresearcherstogetalistofdocumentsfromtheapprovedTREC 60

PAGE 61

IRsystem.AsystemthatreliesonthisdataneednotdodocumentretrievalandinsteadcanfocusontheanswerpinpointingpartoftheQAtask,althoughthestepwouldhavestillbeenperformed[ 99 ].OnewaythatIRresearchwouldbecomemoreimportantinQAresearchwouldbeifcombiningsimpleanswerretrievaltechniqueswithrobustIRwouldgenerategoodresults.Inthemostbasicform,aQAsystemcouldfunctionsimplybytakingthequestionsfromauserandpassingthemdirectlyintoasearchenginesuchasGoogle.Thesystemwouldthenlookatthedocumentsummariesgivenandderiveananswerfromthetopn-documents.Itisunclearhowwellthiswouldworkinpractice,butifitworkedwellitwouldimplythatmuchoftheresearchinQAshouldbedirectedtowardsIRandsummarization.Instead,manysystemsrelyonverysimpleIRsystemsandmorecomplexqueryanalysisandanswerextractiontechniques[ 72 ].OneexampleofthisisWebclopedia,whichusesasearchenginecapableofrankingqueries,butonlyusestheBooleanquerycapability[ 47 ].WiththesizeoftheWWWandthematurityofresearchinIR,thereisnourgencytoaddressthisissueundertheveilofQA.Anotherareainwhichsearchengineshavebeenusedistoleveragetheredundancyofthewebintoquestionansweringonsmallercorpora.Insteadofbuildingasystemthatisspecictoasmallcorpus,thesystemreliesonthisredundancytogetananswer.Thesesystemsarelargecorpusbased.Onceananswerhasbeenlocatedinthelargercorpus,thesystemlookstondthatanswerinthesmallercorpora[ 2 8 ].Thebenetofthistypeofsystemisthatthelargeamountofdataincreasesthelikelihoodthattheanswertoaquestioncanbefoundinasimplerelationtothequestion.Evenifitisnot,thepossibilityofusinganswerminingtechniquesincreasessincethereismorethanonesourcethatpotentiallyanswersauser'squestion[ 30 ]. 3.5PassageRetrievalandAnswersBothpassageretrievalandanswerextractionareareaswherequestionansweringandinformationretrievaldiffer-notnecessarilyintheprocess,butratherinthe 61

PAGE 62

motivation.WhilesomeIRsystemsusepassageretrievaltodisplaysnippetsofdocumentstotheuserasaresultofasearchprocess,thegoaloftheIRsystemistondrelevantdocuments.Thegoalofpassageretrievalandanswerextractioninquestionansweringistopresenttheuserwiththeinformationthattheuserneeds.ItisimportanttonotethatpassageretrievalisnotalwaysanenduntoitselfinaQAsystem.Insomesystems,passageretrievalisusedasasecondstepafterdocumentretrievaltofurtherlimittheamountofnaturallanguageprocessingorsimilartechniquesusedtoextractanswers,andthisprocessisarankingofsentenceswherethosewithhigherquerytermdensitygetconsideredrst[ 47 ].Totheotherextreme,Linetal.[ 63 ]foundthatusersprefertoseepassagesratherthanexactanswers.Theyconcludedthattheuser'sinformationneedsmightbebestmetwithanswershighlightedinthecontextofthepassagesthatcontainthem.Eitherway,passageretrievalcanbeausefultoolforQAsystems.Becausepassageretrievalisanintegratedstepinretrievinganswersinmostsystems,itisdifculttoprovideadirectcomparisonofthedifferenttechniquesused.Tellexetal.attemptedtocomparedifferentpassageretrievalalgorithmsfromseveraltopperformingTRECsystemsandfoundthatperformancewashardtodetermineinisolation[ 95 ].Onceasystemhasthedocumentsorpassagesthatitwilluseforanswerextraction,thenextstepistoidentifytheanswer.Inmostcases,thisisaccomplishedthroughpatternbasedextraction.Thisisnottheonlywaytoanswerquestions,butithasworkedwellforQAsystemsthatrelyonlargecorpora,suchastheweb,toanswerquestions.Alternatively,n-gramshavebeenexploredasanotherstatisticalmethodforextractinganswers.Fromthenaturallanguageprocessing(NLP)approachtherehavebeenattemptstousethesemanticparsetreetobettercapturethedesiredanswer.Doubtless,astheamountofreasoningthatisrequiredtoanswerquestionsincreases,moreunderstandingofthematerialwillbenecessary[ 13 ]. 62

PAGE 63

Theanswerextractionstepiscloselyrelatedtothequerypreprocessingstep.Duringthisstepmanysystemsdeterminetheanswertypethatisexpectedforthequestion.Inasimplesystemtheanswertypemightbeusedtoltercandidateanswers,soonlyanswersthatmatchthecorrectsemantictypearepresentedtotheuser.Onewayofdoingthisisbyusingn-grams.Documentsthathavebeendeterminedrelevantaresiftedforthemostcommonn-grams.Oncethesearefound,theleadingn-gramcanbegivenasananswer.Topreventtheanswerfrombeingsomethinglikeofthe,theanswertypedeterminedinthequerypreprocessingisusedtolterthen-gramssothatonlyn-gramsthatmatchthecorrectanswertypeareconsidered.AnexampleofthistypeofsystemisgivenbyDumaisetal.[ 30 ].Insteadofusingtheentiredocumentsthatwereretrieved,thesummarieswereusedinstead.Fromthesummaries1-,2-and3-gramswereextractedandweightedbythequeriesthatgeneratedthem,alongwithhowwelleachmatchedtheexpectedanswertypes.Lastly,then-gramsweretiledsothatoverlappingn-gramswouldbecomeasingleanswer.Thisallowedforlongeranswersandalsoassuredthatanswerswouldbeweightedfurtherbytheadditionofsupportedn-grams.Theyfoundthatthisnotonlyworkedwellagainsttheweb,butalsoagainstthesmallerTRECdataset.Asanalternativetousingn-gramsasanswers,patternshavebeenused.Thesepatternsinformthesystemaboutwhatformtheanswersoughtwilltake.Forexample,inthequeryWhenwasMozartborn?thepattern(-couldbeusedtondthebirthyear.Inmanysystemsthesepatternshavebeenwrittenandtunedbyhand.RavichandranandHovysuggestamethodforgeneratingthemautomaticallyusingasearchengine[ 81 ].FortheaboveexampleofananswertypeBirthYear,thenameofapersonandthatperson'sbirthyearwouldbeusedasthequery.Thetopn-documentsreturnedasrelevantbythesearchenginearesplitintosentences,andallthesentenceswithboththenameandyearareretained.Fromthesesentencesthemostfrequentn-gramsareobtainedandlteredsothatonlyn-gramscontainingthat 63

PAGE 64

specicnameandyearremain.Atthispointthenameandyearcanbereplacedbytokens(withbecomingthetokenforthedesiredanswerinthepattern)andthepatternsforBirthYear.Toensurethepatternsareprecisetheyarethenappliedtoasearch,andtheprecisioniscalculatedasthenumberofcorrectanswerstaggedwithdividedbythenumberoftotalwordstaggedwith.Theprocessofndingtheanswersisessentiallythesameasdoingquestionansweringitself.Insteadofstartingwithaquery,thesystemstartswithananswertoaquestion.Thenitgoesthroughtheprocessesofdocumentretrievalandpassageretrieval,extractsanswersasn-grams,andthenltersthemusingtheactualanswerratherthanananswertype.SimilarresearchbyMagninietal.foundthattheredundancyofinformationontheweballowedsimilarmethodstovalidateanswersaswellastoextractthem[ 66 ].Answertypesaretaxonomiesthataregiventoanswersforaquestion.WhileanswertypesareheavilyusedinQAsystemsitisinterestingtonotethatinmanycasestheyrelyonthequestiontaxonomy.Althoughthesetaxonomiesvaryfromsystemtosystem,aninuentialoneisgivenbyHovyetal.[ 48 ].Thistaxonomybreaksanswertypesinto3speciccategories:2abstract,semantic,andsyntactic.AbstractanswersareanswersthatarespecictoQAanddependonthequestiontype.AnexampleofthisistheWhyFamouscategory.Thisanswerissoughtwhenaskingawhowasquestion.WhenauserqueriesWhowasGeorgeWashington?theanswerthatismostsuitableistheanswertoWhatwasGeorgeWashingtonfamousfor?Sincethisinformationisnotsemanticorsyntacticinnature,thesequestionsarehardertoanswerthanonesthatrequireinformationthatcanbedenedinthatway.Othertypesofanswersinthiscategoryinclude:Denition,Synonym,andContrast.Semanticanswers 2Thereareactually5categoriesdenedinthepaper,buttheroleandslotareleftoutastheyarenotgeneraltoQA. 64

PAGE 65

areanswersthatmaptoasemanticclassintheWordNetontology.Whilethiscouldbegeneral,suchasalocation,Hovyetal.describeasystemthatismuchmorespecic,withanswertypessuchasProperCityorProperState.Thesyntacticcategoryisafall-backwhenthesystemcannotdeterminethesemantictypedesired.Therearefoursyntactictypes:Noun,NounPhrase,VerbPhrase,andProperName.InagroupingsimilartotheabstractcategoryofHovyetal.,BuchholzandDaelmansidentifywhattheycallcomplexanswers[ 11 ].Thesearequestionswithanswertypesthattheybelievewillbehardtoanswerduetotheabstractnatureoftheanswer.TheyassignanswertypessuchasList,Opinion,andTimeDependenttothiscategory.Duetothedifcultnatureofansweringthesequestions,thefrequencyofanyparticularcandidateansweristhedefactostandardinmanysystems. 3.6EvaluationIninformationretrieval,systemsareevaluatedbyhowwelltheyretrievedocumentsrelatedtothequery.Thetwomainmeasuresareprecisionandrecall.Thesemeasurescanworkinquestionanswering,butsincetheinterestisinretrievinganexactanswertoaquestion,andthesemeasuresrelatetodocuments,anewevaluationmustbeestablished.Evaluatingquestionanswersrequiresasetofcriteriatobedevelopedtodeterminehowtojudgetheanswer.Forexample,wouldtheuserbesatisedwithanexactanswer,orwouldaparagraphthatincludedtheanswerbeacceptable?Giventhegoalsofansweringquestions,severalcriteriahavebeendevelopedtoexplainauser'sinformationalneed.Thesecriteriaincluderelevance,correctness,conciseness,completeness,coherence,andjustication.Relevanceisthemostbasiccriterion.Theanswermustberelatedtothequestionasked.Onceananswerisdeterminedtoberelevant,thenextstepistodecidewhethertheansweriscorrect.Inmostcasesthesetwocriteriacanbeconsideredtogetherwhenevaluatedbyauserofthesystem,but 65

PAGE 66

forthepurposeofjudgingtheefciencyofthesystemthesemustbeaddressedasseparateissues.JusticationhasbeenconsideredbytheTRECquestionansweringevaluation.InTREC,systemswererequiredtoincludeadocumentid.Ifthedocumentdidnotsupporttheanswergivenbythesystem,buttheanswerwascorrect,itwasdeemedNotsupported[ 101 ].Auserstudyinthisareafoundthatusersprefertohaveanswersthatareshowninthecontextofthedocumentinwhichtheyarefound.Theadditionalcontextallowsuserstoovercomeanysuspicionsthatthesystemisinsomewayunreliableinreportingwhetherasourceistrustworthyornot[ 63 ].Completenessissimplytherequirementthattheanswershouldnotbeconsideredcorrectifitisonlyapartialanswer.Thenexttwocriteria-concisenessandcoherence-arerelevanttotheperceivedneedtogiveusersparticularinformation.Concisenessisgivingtheuseronlytheinformationthatanswersthequestionposed,withoutanyextraneousdata.Thisgoalclashessomewhatwiththecriterionofjustication,asthemoreconcisetheansweris,themorelikelyitisthattheanswerwillbegivenoutofcontext.Coherencerelatestotheneedforthequestionertobeableeasilytoreadandunderstandtheanswergiven.Thisisnotaproblemforsystemswhichsimplyextractanswers,butasanswersnecessarilybecomemorecomplextheremustbesomewayofjudgingthis[ 7 ].Giventhesecriteria,evaluatingquestionansweringisstillanopenproblem.Thegoalistoallowforautomatedevaluation,butsofartherehasnotbeenmuchsuccessincreatingasystemtodothis.TRECuseshumanjudges.Testshaveshownthatasinglehumanassessorcanjudgedocumentssuchthattherelativeperformanceofsystemsaremaintained.Whilethisallowsforthecostofjudgingsystemstobemuchreduced,itstilldoesnotallowformachinelearningbasedonfeedbackfromthejudge.SeveralattemptshavebeenmadeatautomatingtheevaluationofQAsystems.OnesuchsystemisQaviar,createdbyBrecketal.[ 7 ].Thisautomaticevaluationworksby 66

PAGE 67

comparingtheanswerstringgivenbytheQAsystemwithahuman-generatedanswerkeyforfactoidquestions.Theyfoundthattheycouldachievea93%correlationwiththehumanjudgeresponsesusingthissystem.Despitethis,thesystemremainsverybrittleforanumberofreasons,amongwhicharenumericalandtemporalanswers,phrases,context,andotherquestiontypes.Sincethesearelinguisticchallengesthatmustbesolvedtoaccomplishthetaskintherstplace,itisdifculttoseehowwellevaluationwillwork.Inaddition,evenhumanjudgescandisagreeinevaluatingananswer,makingitevenmoredifcult[ 101 104 ].AnothermethodfordoingautomaticevaluationfordenitionquestionswaspresentedbyXuetal.[ 113 ].ThissystemusedtheRougesystemdevelopedbyLinandHovy[ 61 ]forevaluatingsummarizationofdocuments.SincetheanswerkeysfordenitionquestionscontainnuggetsofinformationthatneedtobeintheanswergivenbytheQAsystem,itcanreasonablybeassumedthatifthereisacorrelationbetweentheanswersandthenuggetsthentheanswerislikelycorrect.Underthatassumption,Xuetal.foundthattheycouldachievea0.7correlation(with1beingtotalagreement,and-1beingcompletedisagreement)betweenRougeandtherecallofnuggetsmethodofevaluatingdenitionquestions. 3.7NaturalLanguageProcessingNaturallanguageprocessingseemslikethegoalofquestionanswering,sinceansweringquestionsandunderstandingthempresentsasalanguageproblem.Despitethis,therehasbeenlessprogressinthisareathanmightbeexpected.AsquestionansweringmovedontotheWWW,theneedforsophisticatedNLPdecreased,aspatternbasedmethodscouldmeettheuser'sneedinmanysituations.SimilartoIR,NLPhasbeenusedinareassuchasstemmingtoimprovetechniquesthatdonotrelyonanunderstandingofthelinguisticnatureofthetexts.Atthesametime,therehasbeenworktowardunderstandinghowthesemanticsofthequestionsandanswerscan 67

PAGE 68

improveQA.Asofyet,thereisnorealconsensusastowhetherNLPistheultimateanswertoQA.Moldovanetal.foundthatimprovementsinNLPinallstagesofquestionansweringimprovetheperformanceofaQAsystem[ 73 ].Inalaterpaper,KatzandLinclaimedthatlinguisticallysophisticatedsystemsperformworsethansystemsthataresimpler[ 53 ].Oneofthereasonsthisseemstobetrueisthatthelinguisticallybasedsystemsactinaverybrittlemanner.Asthelanguagechanges,thesystemshaveahardtimecoping.Whilethisappearssimilartothepatternbasedsystemsdescribedpreviously,themaindifferenceisthatthosesystemsdonotknowwhatthesymbolsinthepatternsmean.Thisdifferenceallowsforthesystemtolearnsurfacepatterns,whetherornotthelanguageissyntacticallycorrect.Usingthisinsight,KatzandLinproposedthatQAsystemsbebasedonthesesimplertechniquesandexpandedtomoreelaboratemeasureswhenbenetstothemwerefound.TwospecicareasinwhichtheyfoundthatNLPcouldbeusedtoenhanceasystemwerethoseofsemanticsymmetryandambiguousmodication[ 53 ].TheproblemofsemanticsymmetryisdemonstratedbythequestionsWhatdofrogseat?andWhateatsfrogs?whichhavesimilarwords,butdifferentanswers.AmbiguousmodicationcanbedemonstratedbyaquestionsuchasWhatisthelargestvolcanointheSolarSystem?BecauselargestandintheSolarSystemcanmodifydifferentheadnounsitispossiblethatthesystemmightndananswerthatincludeslargestplanetintheSolarSystem.Toallowthesystemtodistinguishbetweenanswerstothesetypesofquestions,KatzandLinusedasystemthatpreprocesseddocumentsandstoredrelevantsemanticinformationinsemanticrelationtriples.Thesearesetsofthreewordsthatstoretherelationshipbetweenwordsconveyedinthetext.Anexampleofthisisfrogseaties:asubjectandobjectalongwiththeirrelationshiparestored.Toretrievethisinformation,thequeryenginecreatesatriplethatindicatesthequery.InthecaseofWhatdofrogs 68

PAGE 69

eat?thesystemwouldgeneratetheexpressionfrogseat?x.Thesystemcouldthenperformalogicalanalysistobindthevariableandsolvetheproblem.UsinglogictoanswerquestionshasalsobeenproposedbyMoldovan[ 71 ].WhilethereisclearlyvaluetoNLPinquestionanswering,currenttrendslimitittosmallareasasresearchreachesitslimitsusingpatternandmachinelearningtechniques.However,itseemslikelythatthisareawillprovidethenextstepsforward. 3.8ConclusionWhileautomatedquestionansweringhasbeenstudiedinvariousformsformanyyears,progressisstillveryslow.Manyofthepaperspublishedarebasedonhandcraftedsystemsdevelopedovermanyyears.WhileTREChasallowedforsomedirectcomparisonsbetweendifferentsystemsonsimilardata,thereisstilllittleprogressindeterminingwhatpartsofthesystemactuallydifferentiatebetweentheQAsystems.Itisstillunclearifthisisaresultoftheproblemitself.Isquestionansweringnecessarilysuchadifculttaskthatthesystemswillcontinuetobetightlycoupled,orwillaframeworkdevelopinwhichvariouscomponentscanbepulledinandoutsothatdirectcomparisonscanbemadeofdifferentmethods?Itmaybethatuntilautomatedevaluationofanswerscanbeperformedinareliablemanner,thebuildingofsuchsystemswillnotbeachieved. 69

PAGE 70

CHAPTER4EVALUATION 4.1IntroductionPriortointroducingournewmodel,wewilladdressevaluatinginformationretrieval(IR)systems.Whilethisseemstobeinconsistentwithexpectations,themethodofevaluationweintroduceinthischapterisalsoderivedfromanideaofinformativeness.Byworkingthroughanewmethodofevaluation,wewillprovideagoalforournewmodelofIR:torankdocumentsinorderofdecreasinginformation.ThegoalofanIRsystemistomeetaninformationneed,andwithoutanobjectivewayofmeasuringwhetherasystemmeetsthisneedbetterthananothersystem,thereisnowaytodirectlycomparethetwo.Withoutawaytocomparesystems,improvementsarehardtoassess,andevaluationisoftenlefttotheopinionofusers.Thisisespeciallytruewhenconsideringtheworldwideweb(WWW),whereevencreatingaconsistentcorpusisextremelyhard.1Webelievethatthereareseveralproblemswithcreatingandmaintainingalargecorpusandassociatedqueryrelevancesets(qrels)thatmakeitnecessarytondanewwaytoevaluateIRsystems.Therstisthatnomatterhowlargeourstartingcorpusis,itwillnotapproximatethesizeandcomplexityoftheWWW.Sinceweneedaconsistentwaytocomparesystems,usingtheWWWitselfisnotagoodoption,aseverysystemmusthavethesameindexcopytoensureaconsistenttest.Evenconsideringalargercorpusraisesasecondproblem:Howdowecreateqrelsinsuchalargecorpus?Judgingwhetheradocumentisrelevantisnotadifculttaskinitself,butwithmillionsorbillionsofdocumentsitisimpossibletojudgeeachone.Whenwemovetowardjudgingsmallerpercentagesofdocumentsinthecorpus,traditional 1OneexcellenteffortatcreatingsuchacorpusistheTextREtrievalConference(TREC),alargecorpustrackthatincludesdocumentsfromthegov2website,buteventhisisonlyasmallportionoftheactualWWW. 70

PAGE 71

measuressuchasprecisionandrecallarenolongeruseful.EvenmeasuressuchasBprefrelyonthejudgeddocumentstobehighlyrankedonthesystemtogivesomesortofsenseofhowthesystemwillperformforauser.Thisleadstoathirdproblem.Iftherearereallythousands,ormillions,ofrelevantdocuments,howdoesmarkingsomeofthemasrelevantgiveusagoodgraspofhowthesystemworks?MostIRsystemsrelyonsomesortofranking,eitherbynatureoftheIRmodeltheyuse,orbecausetheyusesomeadditionalsortofsemanticinformationtorankthedocuments.Withoutasetofrankedqrelsitisdifculttojudgehowwellasystemperformsthattask.Weproposetoaddresstheseproblemsbycreatingastaticsetofqrelsthatareindependentofacorpus.Ratherthanhavingasetofdocumentsthatareretrievedfromacorpus,asetofdocumentswillbeevaluatedandjudgedbythecontenttheycontain.Thisrequiresachangefromthetraditionalqrelset,inwhichadocumentisconsideredrelevantifitisdeemedtohavemettheinformationneedofthequery.Rather,weproposetocreateasetofquestionsthatweexpectarelevantdocumenttoanswer.Inthisstudywefocusonnamedentities.Whileitispossibletocreatesetsofquestionsrelatedtowhatmakesadocumentrelevantforanyquery,byfocusingonnamedentitiesweareabletocreatesetsofquestionsthatarespecictothetypeofentitywithwhichwearedealing.Thusthedocumentscanberankedbyhowmanyoftherelevantquestionstheyanswerratherthanbythesinglequestion,Isthisdocumentrelevant?Byrankingthedocumentswehopetoeliminatetheneedforthedocumentstobepartofalargercorpus.InthiswaywecanconstructqrelsfromtheWWWthatcanbeevaluatedbysystemsthathavenotindexedit.Additionally,thisshouldallowcomparisontoavailablesearchenginesontheweb,asthedocumentstheyreturncanbejudgedandcomparedtoothersystems,whichwewilldemonstrate.Inthischapterwedescribethemethodforcreatingrankedqrels.Inaddition,wecomparetherankingmethodtoTRECqrelsandalsouseourqrelstocompare 71

PAGE 72

anavailablesearchenginetoahome-grownIRsystem.Therestofthechapterisorganizedasfollows:WedemonstratethetypesofquestionsandcomparethemtoaTRECqrelinSection2,andwegiveseveralwaysofevaluatingthenewrankingsinSection3.InSection4wegiveinitialcomparisonsoftwosystemsusingourevaluations.Weconcludebydiscussingourwork,futureuses,andresearchinSection5. 4.2CreatingQueryRelevanceSetsThissectioncontainsadiscussionofcreatingqrels.WediscusshowTRECdescribestheirqrelsandgiveanexampletopic.WethencompareTREC'smethodwithourown.WegiveacomparisonoftheeffectivenessoftheresultingsetofjudgeddocumentsagainsttheTRECqrel.WealsodiscusswhytheTRECcorpusisdifculttoworkwithwhenusingourmethod. 4.2.1ConvertingTRECTopicsThepurposeofthissectionistocompareourmethodqreltowhathasbeendoneinthepast,namely,theguidelinespostedbyTRECforaddressinglargecorpora.TREC'sguidelinesareinherentinthenarrativesprovidedforeachtopicintheirsystem.Figure 4-1 givesanexampleofaTRECtopicdenition,whichwillhelpillustrateourpoint.Here,thetitleisthequerygivenbyatheoreticaluser.Thedescriptioninterpretstheinformationneedsimpliedbythequery,andthenarrativegivesguidanceastowhetheradocumentshouldbejudgedrelevant.Inthiscase,adocumentthatcontainsinformationaboutthecorrectDavidMcCulloughwillbeconsideredrelevant.Effectively,then,thequeryrelevantsetthatresultsfromaTRECtopicisansweringthequestion,Isthisdocumentrelevant?Althoughthisisabinaryquestion,therearestillareasopentotheinterpretationoftheuser,andadocumentmightberelevantforoneuserbutnotforanother.Morespecically,if,inthemidstofalistofauthorsandtheirbirthdates,adocumentlistsDavidMcCullough,itisstilluptotheinterpretationofthepersonjudgingthedocumentwhetheritisrelevantornot.Thesetopicsarethengiventothejudgeswhotakeasetofdocumentsandjudgeeachdocumentastowhetherthedocumentis 72

PAGE 73

relevantaccordingtothenarrative.Foreachtopic,theqreliscomposedofthesejudgeddocuments.TheambiguityofthisexamplecouldberesolvedifwesimplystatethatadocumentcontainingDavidMcCullough'sbirthdateisrelevant.Furthermore,themorespecicwemakethenarrative,thelessweleaveopentotheinterpretationofthejudge.Inourmethod,wehavechosentoidentifythespecicinformationthatmakesadocumentrelevantintheformofaquestionthatthedocumentmustanswer.FromourMcCulloughexample,wecanaskotherquestions,suchasHowmanyPulitzerprizeshashewon?orWhendidhewriteTruman ?orWhatcompanypublishedTruman ?Adocumentthatansweredanyofthesequestionswouldbeatleastsomewhatrelevanttothequery.Likewise,themorequestionsadocumentanswers,themorecompletelyitshouldmeettheinformationneedoftheuser.Assumingthequestionscovertheinformationneedspeciedinthedescription,underthismethod,theoriginalbinaryquestionIsthisdocumentrelevant?canberephrasedasDoesthedocumentansweranyofthesequestions?Therelevantdocumentsalsocanberankedusingthenumberofquestionsthatadocumentanswers.Analternativetocreatingquestionsthatarespecictoanindividualqueryistocreateataxonomyofquestionsthatrelatetoatypeofquery.InthecaseofthetopicforDavidMcCullough,wemayconsiderapersonsearchoranauthorsearch.Ratherthanchoosethequestionsthatshouldbeanswered,documentswouldbejudgedaccordingtoalistofquestionsrelatedtothetypeofentityforwhichweweresearching.Apersonsearchmighthavequestionsaboutbirth,career,family,ordeath.Thiswillprovideseveralbenets.Firstandmostimportantly,searchesforsimilartypesofinformationwillbejudgedinaconsistentway.WhileweareofteninterestedinageneralsearchwhenconsideringtheWWW,improvingparticulartypesofqueriesrequirescreatingawaytopredictablycomparesystems.InthecaseoftheTRECqrels,onlytwojudgedtopics 73

PAGE 74

expressaninformationneedaboutaperson.Thatis,twoofthejudgedqueriesmatchourconceptofapersonsearch.Asecondbenetisthatthesubjectivitythatwouldbeintroducedbychoosingquestionsor,inthecaseoftraditionalqrels,whatdetermineswhetheradocumentisrelevant,wouldbereduced.Inourmethodofcreatingaqrel,ifadocumentanswersaquestion,itisdeemedrelevant,andifitdoesnotansweranyofthequestionsitisdeemedirrelevant.Thisgivesaclearguidelineforrelevance,butitdoesnotcoverallsituationsinwhichdocumentsmightbeconsideredrelevanttoaquery.Forexample,ifauthorquestions(questionspertainingtowritingbooks)werenotincludedinjudgingdocumentspertainingtoDavidMcCullough,documentsthatlistedhisbookswouldnotbedeemedrelevantunlesstheyincludedbiographicalinformationabouthim.Otherexamplesoftypesofinformationthatmightbeexcludedaretemporalinformation,suchasyoumightndinanewspaper,andinterviews,wherethesubjectisinterviewedaboutsometopicotherthanhimselforherself.Ratherthantryingtomakeajudgmentonthoseissues,wefeelthatthiscouldbeaddressedbycreatingadditionalquestionsthatcoverthosesituations.Thesubjectivitythenisrelegatedtothecreationofsetsofquestionsthatquantifyinformationinthedocuments.Thethirdbenetcomesfromtheabilitytoquantifyinformation.Whilewedonotintendtoarguethatthisistheonlyoreventhebestwaytorankdocuments,giventhatwehaveanumberthatindicateshowmanyquestionsadocumentanswers,itseemsnaturaltoassumethatthedocumentthatanswersthemostquestionsisthemostlikelytomeettheinformationneedsofthequeryandisthusthemostrelevant.Likewise,adocumentthatanswersonequestioncontainssomeinformationbutlessthanadocumentthatanswerstwo.ThisintuitioncoincideswithourunderstandingofthevectorandprobabilityIRmodels.Thevectormodeldocumentsareclosertoaquerybasedondistanceinan-dimensionalspace.Ifwedisregardfeaturesthatarenotrelatedtotheanswersofthequestion,thenweexpectthatdocumentsthatanswerallthequestions 74

PAGE 75

wouldbeveryclosetogether.Likewise,wemightexpectthatonethatansweredfewerquestionswouldbeclosertothedocumentsthatansweredmanythanthedocumentsthatanswerednone.Thisintuitionalsoholdstruefortheprobabilisticmodel.Theanswerstothequestionsthatmakeadocumentrelevantwouldbeexpectedtoappearinrelevantdocumentsmorethannon-relevantdocuments.Soineachcasewemightexpectdocumentswithmoreanswerstoberankedhigherwhenrankingresults,giventhenatureofthemodel. 4.2.2ComparingMethodsTocompareourmethodtotheTRECjudgmentsweevaluatedeachofthedocumentsintheqrelforDavidMcCulloughandmarkedthemashavingansweredoneofasetofquestionsaboutpeople.Wealsomarkedasinglequestionrelatedtoanauthor,Whatbookswerewrittenbythisperson?InTable 4-1 weshowthenumberofquestionsansweredbyallthedocumentsintheTRECqrelforDavidMcCullough.Thenumberofdocumentscolumnshowshowmanydocumentswerejudgedtohaveanswered0ormorequestions.ThenumberofrelevantdocumentsisthenumberofdocumentsthatwerejudgedrelevantaccordingtotheTRECqrel.Therefore,therewere82documentsthatansweredonequestionaboutDavidMcCulloughand53ofthosewerejudgedtoberelevantintheTRECqrel.Likewise,13documentsthatwerefoundtoanswernoquestionswerejudgedtoberelevantbytheTRECjudges.WhencomparingourmethodofevaluationtotheTRECjudgments,wefoundthattherewere36documentsourmethodmarkedasrelevantthatwereconsideredirrelevantaccordingtotheTRECjudges,29whichansweredonequestionand7whichansweredtwoquestions,asshowninTable 4-1 .MostofthediscrepancieswererelatedtodocumentsthatmentionedthatDavidMcCulloughwastheauthorofJohnAdams ,whichseemstobeinconsistentlyjudged.SeveralotherdocumentsmarkedasrelevantthatwerenotundertheoriginaljudgmentswererelatedtodocumentsnotspecifyingthepublisherofTruman .Therewerealso13documentsthatwerenotmarkedasbeing 75

PAGE 76

relevant.InthesecasesthedocumentscontainedinterviewswithDavidMcCulloughthatdidnotcontainanswerstoanyofthequestionsprovided,orjustmentionedhimwithoutidentifyinghisprofessionorbooks.Intotal,ourjudgmentsresultedinaprecisionof70%andarecallof86.5%.Thoughitisnotperfect,webelievethatourmethodaccuratelyandobjectivelyjudgedtheinformationinthedocuments.Inadditiontotheoverallprecisionandrecallbeinghigh,ourmethodalsoworkswellwhenweexaminethenumberofquestionsthatadocumentanswers.InTable 4-1 ,thenumberofquestionsthateachdocumentanswerscorrespondstoanincreasingprobabilitythatthedocumentisjudgedrelevant.Whilethisisidealforourmethod,ithighlightsaproblemwiththeTRECcorpus:VeryfewdocumentscontainalargeamountofinformationaboutDavidMcCullough.SinceonlytwodocumentscontainanswerstomorethantwoquestionsaboutDavidMcCulloughitislikelythatthelowco-occurrenceoftheanswerswillpreventprobabilisticmethodsfromderivingbenetfromtheseterms.Table 4-2 furtheranatomizesthemakeupoftherelevantdocumentsbyquestion.Thistableshowshowmanydocumentsweredeterminedtohaveansweredtheparticularquestionlisted.ThenumberofdocumentsjudgedrelevantshowshowmanyofthedocumentsansweringtheparticularquestionareconsideredrelevantbytheTRECjudges.HereitbecomesapparentthatalthoughthequestionsWHENBORNandWHEREBORNpredictrelevancebetterthanBOOKSandPROFESSION,thereareonlytwodocumentsthatanswerthesequestionssinceoneofthedocumentscontainsbothanswers.2ThishighlightstheproblemwiththeTRECcorpus.Eventhoughitisextremelylarge,itisnotrepresentativeoftheamountandtypeofdataontheWWW. 2OnedocumentincludesDavidMcCullough'swife'snameandprofession,whichisnotjudgedrelevant.Weconsiderthistobeamistakeofthejudge,butevenifitisnot,itsupportsourpointthatthejudgmentsaresubjectiveintheircurrentform.Ifwemaketheassumptionthatitisinfactrelevant,theSPOUSEfactpredictsbetterthanBOOKSandPROFESSIONaswell. 76

PAGE 77

WhenweexaminedthedocumentscontainingasingleanswerinTable 4-1 thatwerejudgedrelevantbytheTRECjudges,wefoundthatmanyofthesewereeitherlibrarylistingsorreferencesforanarticle.Tofurtherseethedecitofthecorpusforrankingwecreatedanalternativeqrelwithfewerquestionsdeningrelevance.InthiscasewhenwelimitthejudgmentofrelevancetoBOOKSandPROFESSIONwestillhaveaqrelthatmatchesthecurrentqrelwiththesameprecisionof70%andrecallof86.5%aswhenusingallthequestions.Themaindifferenceisthatwewouldthennotgiveanyadditionalweighttodocumentsthatcontainadditionalbiographicalinformation.Alternatively,wecouldcreateaqrelinwhichtherelevancerequirementwasthatitcontainedbiographicalinformationSPOUSE,WHENBORN,andWHEREBORNwhichwouldcontainonlytwodocuments.TofurtherdemonstratetheinadequacyoftheTRECcorpus,usingtheWWWasthecorpus,thequeryDavidMcCullough1933returnstensofthousandsofdocuments.WhilenotallofthesecontainDavidMcCullough'sbirthyearandinsteadjustcontainthethreeterms,acursoryexaminationfoundthatmanymoredocumentscontainedthebirthdatethantwodocumentsintheTRECcorpus. 4.2.3ConclusionWhileTRECisextremelyusefulforcomparingIRsystems,itcannotanddoesnotcontainthediversityofdatafoundontheWWW.Whilewehaveshownthatourmethodofspecifyingquestionstoindicaterelevanceissimilartothetraditionalspecicationornarrativeofrelevance,wehavealsofoundthattherankingthatourmethodproducesdoesnotprovideagoodsetofrankeddocuments.Whilewehavefocusedonasinglequery,wealsolookedatadditionalqueriesintheTRECqrels.Wefoundthatmostofthedocumentsthatarejudgedhavenoinformationrelatedtothequery.InthecaseofthetwopersonqueriesDavidMcCulloughandPolPot,oftennoteventhenames,rstorlast,areincluded.ThisdoesnotaffecttheBprefscores,asdocumentsjudgednotrelevantthatareneverreturnedwillbythenatureof 77

PAGE 78

Bprefnotaffectthescore.3Thisrequiressignicantadditionalworkfordocumentsthatwillaffectthescoringverylittle.Unfortunately,untilallthedocumentsarejudgedthereisnowaytotellwhetherscoringthedocumentswillprovidebenet. 4.3NewQrelsIntheprevioussectionwediscussedusingquestionstocreateqrelsthatweresimilartotheqrelsgivenfortheTREClargecorpustrack.Inthissectionweexpandonthatmethodtocreateaqrelthataddressesouroriginalproblems:FirstthecorpusshouldbetheWWW,secondwecannotjudgealldocuments,andthirdthedatamustberankable.TheproblemthatwefacedwiththeTRECquerywasthatthedocumentsavailabledidnotcontaintheamountofinformationthatwewouldexpecttondontheWWW.Asimplesolutiontothisproblemistonddocumentsthatdocontaintheinformationandrankthemaccordingtohowmanyquestionstheyanswer.ThereasonthatthisapproachisnottakeninIRisthatthetraditionalevaluationmeasuresallcomparehowwellasystemdoesrelativetotheentirecorpus.Tointuitivelyunderstandthis,considertheprecisionmeasure: Precision=jRaj jAj(4)wherejRajisthenumberofrelevantdocumentsreturnedbyanIRsystemandjAjisthetotalnumberofrelevantdocumentsinthecorpus.Withneitherofthetermsexplicitlydependentonthecorpus,itiseasytoseethatinabinaryquerywewouldhavehighprecisionifthesetofrelevantdocumentswereexclusivelytheonlydocumentstosatisfytheBooleanquery.Moreplainly,letDbethetotalsetofdocumentswhereDaisthe 3ItturnsoutthatitwouldbepossibletocreateaqrelforBprefthatwouldallowaperfectscorebyjustjudgingthatallofthedocumentsasystemreturnsarerelevant,effectivelygamingthesystem.TheqrelsforTRECareactuallycreatedbyjudgingthetopNdocumentsreturnedbyalltheparticipatingsystems.Thishasbeenshowntobeeffectiveinjudgingtherelativeperformanceofsystems. 78

PAGE 79

subsetofdocumentscontainingthequerytermsandDcisthesetofdocumentsnotcontainingthequerytermssuchthat: P(QjDa)=1P(QjDc)=0(4)whereP(QjDa)istheprobabilityofthequerytermsappearinginthesetofdocumentsDaandP(QjDc)istheprobabilityofthequerytermsappearinginanyotherdocumentinthecorpus.InthiscaseweexpectthatanyIRsystemwouldperformextremelywellintermsofprecisiononaquerytowhichthosedocumentswererelevant.Likewise,measuresthatarebasedoffofprecision(e.g.meanaverageprecision,precisionatN,etc.)havethesamedependency.TheBprefmeasuredoesnotsufferfromthesamedependencyonthecorpus,asitconsidersonlydocumentsthathavebeenjudged.WhenconsideringthedocumentsthatarejudgedinusingBpref,thedependencyonthecorpuswillbeonlysomuch,astheIRsystemusesinformationfromthecorpustodeterminerelevancy.Aprobabilisticsystem,forexample,wouldusethestatisticalcharacteristicsoftheterms.Despitethis,documentsthatarerelevanttoaqueryremainrelevantnomatterinwhatcontexttheyarefound.Usingthisinsight,ourmethodfurtherexpandstoassumethatadocumentcontainingmoreinformationcontinuestodosonomatterwhatitscontext.Bythisassumption,documentsnolongerhavetobeconnedtothecorpusinwhichtheywereoriginallyfound.Essentially,thedocumentsthatmakeaqrelcouldbegiventoanysystemwiththetaskofrankingthem.Whilethesystemmightperformdifferentlydependingonthecontext,thetaskwouldallowsystemstobeimprovedwithoutregardtotheparticularcorpus.Inthiswaywearefreetochoosedocumentsforourqrelfromthecorpusthatwedesirethatis,theWWWaslongaswegivethosedocumentstoanysystem.Thesysteminturnwillbefreetousewhatevermethodisdesiredtorankthedocuments.Thiswillallowsystemsthathavedifferingcorporatobeevaluatedbasedontheirrankingperformance. 79

PAGE 80

4.3.1QrelTopicsTotestourmethodwecreatedqrelsforseveraldifferenttypesofobjects.Ourgoalistoshowtheexibilityofthissystem.Althoughthisisnotnecessarilyappropriateforatrulygeneralsearchengine,wefeelthatthismethodwillencompassenoughdifferenttypesofinformationtogiveagoodindicationofhowwelldifferentsystemscompare.Forourtestswechosethreedifferentcategories:Countries,Movies,andPeople.Table 4-3 ,Table 4-4 ,andTable 4-5 showthequestiontypesthatweusedforeachobject.Wetermthisapproachtheencyclopediaapproach,asweexpectthatanencyclopediawouldberankedrst,becausetheytendtocontainarticlesthataredenseinfacts.Foreachobjecttypewechoseseveralobjectnamesatrandomandmarkeddocumentsashavingansweredthequestionsaboutthosenames.Thepersonobjectquestionsareslightlydifferentfromthoseoftheothercategories.Thisisduetothefactthatnotallquestionshaveanswersforallpeople.AnimmediateexampleofthisistheWHENDIEDquestion,whichdoesnotapplytopeoplewhoarestillliving.Whatthisshouldshowisthatdespitethedifferenttypesofquestions,theevaluationwillbeconsistent.Documentswithmorequestionsansweredwillcontinuetoberankedhigherthanthosewithfewerquestions.Inchoosingnamesforthepersoncategory,wechosenamesofpresidentsoftheUnitedStatesandauthors.Inexpandingthequestionsthatarejudgedinthesecategories,itwouldbeappropriatetofurthersplitthepersoncategoryintosubcategoriesandaskadditionalquestionsabouteachtype.Whilethatwouldallowforfurtherspecializationinthetypesofqueriesweevaluate,forourpresentevaluationthatdoesnottourpurpose.Everyquestioncanbeevaluatedforeverynamedobject,regardlessofwhetheritappliesorhasananswer.Intheend,thetotalnumberofquestionsthatapplywillbeidentical. 4.3.2TestingMethodInthissectionwecomparethesearchengineBingagainstanIRsystemusingLuceneanditsstandardscoringmethod.Themainpurposeofdoingthisisto 80

PAGE 81

demonstratethat,eventhoughLuceneusesaverysimplevector-basedsystemandBingisapopularsearchengineontheWWW,neitherdoesagoodjobofrankingdocumentsaccordingtoourmeasure.Wefurtherdemonstratetheusefulnessoftheportabilityofourqrel.Weareableobjectivelytocomparetheperformanceofthetwosystems,despitethefactthattheLuceneenginedoesnothaveaccesstotheentirecorpus.Whileitmightbepossibletosimulateaccesstothecorpusforexample,byjudgingtherelevanceofallthedocumentsthatasearchenginereturnsinpracticalterms,thisisunrealisticaseveryquerywouldhavetoreturnalldocumentsthatwerepossiblyrelevant.Therewouldbefartoomanydocumentsforasinglequerytojudgeinareasonabletimeframe,andthepossibilitywouldexistthatdocumentsthatwerenotreturnedwouldhavebeenrankedhigherbythesystemwithoutaccesstothecorpus.TocreatetheqrelforthistestweusedtheBingapitoquerythenamesofeachoftheobjectsthatweredescribedintheprevioussection.Forconvenienceweusedonlytherst40documents,andofthoseonlythedocumentsthatallowedourspidertoretrievethem.Thesedocumentswerethenmarkedashavingansweredeachofthequestionsrelatedtoobjecttypes,andthusarankedqrelwascreated.ForourpurposestherankingforBingwastheorderinwhichthedocumentswerereturned.ForLucene,weusedtheTRECcorpusasthebasisofourscoring.Ratherthanindexingeachdocument,wecreatedamethodbywhichthescoreforeachdocumentwasdeterminedasifthedocumenthadbeenindexedusingtheformula: ds(q,d)=Xt2q(idf(t)qn)(tf(t,d)idf(t)dn)(4)wheredsisthescoreofthedocument,tfandidfarethenormaltermfrequencyandinversedocumentfrequancy,qnisthequerynormanddnisthedocumentnorm.Forthepurposeofranking,thequerynormcanbefactoredoutoftheformula.Thedocument 81

PAGE 82

normiscalculatedas1=p jdjwherejdjisthelengthofthedocument.4Inthiswayarankedsetofdocumentswascreatedthatcouldbejudgedandcompared. 4.3.3EvaluationToevaluatetheperformanceofthetwosystemsweusedseveralscoringmechanisms.Aswehaveshownpreviously,precisionandrelevancedonotworkwhenthecorpusisnotthesamebetweensystems,butBprefwillwork.InadditiontoBpref,wealsousedtwoadditionalscoringmethods:rootmeansquarederror(RMSE)andmeanaverageerror(MAE).WehavealreadydiscussedBpref,whenwecomparedtheperformanceofourmethodtothatoftheTRECqrels.Whilethisislessusefulhere,westillexaminetheperformancebyconsideringalldocumentsthatansweratleastonequestiontoberelevantandalldocumentsthatdonotansweranyquestiontobeirrelevant.WefurtherexpandtheuseoftheBprefmeasurebychangingthedistinctionbetweenrelevantandirrelevanttoansweringatleasttwoquestions.Incaseswherethedocumentsarewellranked,thisnumberisexpectedtobeclosetothemaxBprefscore(one)forthehighestnumberofquestions.Thetwoadditionalmeasures,RMSEandMAE,areusedtoshowtheaveragemagnitudeoferrorsovertheentiresetofjudgeddocuments.Whileeachmeasureisuseful,togethertheygiveapictureofhowwellaparticularrankingmatchestherankedqrel.TheMAEismeasuredas: MAE=1 nnXi=0jfi)]TJ /F5 11.955 Tf 11.96 0 Td[(yij(4) 4Luceneactuallystoresthenormsintheindexinanencodedformatwitha3bitmantissa.Whilethisdoesnotsignicantlyaltertherankings,inourexperienceourmethodsimulatesthetruncationtoexactlymatchthescorethatLuceneproduceswhenretrievingthedocumentfromtheindex. 82

PAGE 83

wherefiisthepredictedrankandyiistheqrelrank.TheRMSEisgivenas: RMSE=vuut 1 nnXi=0(fi)]TJ /F5 11.955 Tf 11.95 0 Td[(yi)2.(4)ThemaindifferencebetweenthetwomeasuresisthatRMSEheavilypenalizeserrorsthatarelargeinmagnitude.Thiscorrespondstoourdesiretohavethedocumentswithmoreinformationclosetothetopandthedocumentswithoutinformationclosetothebottom.Ideally,thesystemwillnotrankadocumentextremelyfarfromwhereitshouldbe.Alternatively,theMAEdemonstrateshowwellasystemdoesoverall.Sinceourmethodofrankingdocumentsmakesnodistinctionbetweenthedifferenttypesofquestions,wehavemanycaseswheredifferentdocumentsansweranequalnumberofdifferentquestions.Forexample,onedocumentmayansweraquestionaboutwhenapersonwasbornandanotheraboutwhathisorherprofessionis.Wemakenoattempttodistinguishbetweentheranksofthesetwodocuments.Effectively,bothdocumentsareconsideredequalintheamountofinformationtheycontain.Thisresultsinahighlycompressedqrel.Toreectthat,inourmeasureswetreatalldocumentsansweringthesamenumberofquestionsasequalinrank.Toevaluatethat,wecalculatedtheMAEandRMSEslightlydifferentlythanasgivenpreviously.TheMAEisinsteadcalculatedas: MAE=1 nnXi=0jEij(4)andRMSEis: RMSE=vuut 1 nnXi=0(Ei)2.(4) 83

PAGE 84

HeretheEiiscalculatedwithregardtotheupperandlowerboundsofthesimilarlyrankeddocuments.Forourpurposeswecalculateitas: Ei=8>><>>:0ifURq>fi>LRqmin(jfi)]TJ /F5 11.955 Tf 11.95 0 Td[(URqj,jfi)]TJ /F5 11.955 Tf 11.95 0 Td[(LRqj)otherwise(4)whereURqisthehighestrankofdocumentswiththesamenumberofquestionsansweredintheqrelandLRqisthelowestrankofadocumentwiththesamenumberofquestionsanswered.Thiscanbeunderstoodtomeanthatifthedocumentisinthesamerangeasdocumentswithasimilarnumberofquestionsintheqrel,thereisnoerror.Otherwise,theerroristheminimumdistancebetweenitsrankingandtheminimumandmaximumrangewhereitshouldappear. 4.4ResultsThissectionexaminestheresultsofrunningourqueriesbetweenthetwosystems.Intherstpartweexaminetheoverallresultsofthesystems,focusingonobjecttypesandoverallaverage.Inthesecond,wedelvefurtherintothePersontypeandpresenttheresultsforeachofitsquestions.Inthethird,weexaminetheBprefscoreatdifferentlevelsofrelevance. 4.5OverviewWhencomparingtwosystemsitisimportanttorememberthatthepurposeofthiscomparisonistoexaminehowwelleachranksthedocumentsintheqrel.TheresultsdonotreecthowwellthetwosystemswoulddowithregardtoprecisionorrecallwhenrankingresultsfromtheWWW,althoughclearlythereisasimilaritybetweenrankingandprecision.Withthatdisclaimer,considertheresultsinTable 4-7 .Bprefhereismeasuredusinganyquestiondeemedrelevant,withahigherscorebeingbetter.Thus,ifallthedocumentsthatansweredquestionsaboutthenamedobjectcamerst,aBprefscoreof1wouldbeobtained.ThemaxMAEisN=2whereNisthetotalnumberof 84

PAGE 85

documents.Inourmeasurement,themaxerrorissomewhatless,aswehavebucketsofsimilarlyrankeddocuments,whichshrinksthemaxdistancefromthebucket.MAEandRMSEareerrormeasures,solowerscoresarebetter.Whileintwocases(moviesandpersons)thehigherBprefscorecorrespondstothelowerMAEandRMSE,inthecaseofcountrythisisnottrue.Whatthisshowsisthatthesystemthatranksthedocumentsbestforeachtestdoesnotnecessarilyrankalldocumentswithanswers.ThiscanbeseeninFigure 4-2 .Additionally,thistestshowsthatbothsystemsdisplayafairlylargeerror.Thisimpliesthatbothsystemsrankthedocumentsfairlypoorlyrelativetothenumberofquestionsthedocumentanswersaboutthenamedentity. 4.5.1SinglePersonQuestionsInTable 4-8 andTable 4-9 wefurtherexpandtheresultsofthecomparisonofthepersonobjects.ThesescoresdonotdirectlycorrespondtothosegiveninTable 4-7 ,asthenumberofdocumentsthatareconsideredrelevantchangeforeachparticularquestiontype.Thuseachrowcanbeconsideredaseparateneedforaparticulartypeofinformation,butthequerystringisthesame.Asisexpectedforthetwosystemsinvolved,neithersystemhasawayofderivingthechanginginformationneedsoftheuserduetothequerynotchanging.Itseemsthatthedesiredoutcomeisthatifthesystemranksthedocumentswiththemostinformationrst,thatwillresultindocumentswiththeseindividualquestionsanswered,whichwillcorrelatetotheoverallranking.Weexpectthatdocumentsthatcontainananswertoanindividualquestionwillallberankedbeforedocumentsthatdonotcontaintheanswer.Inthatcaseitmightseemthatallthedocumentsthatansweraparticularquestion,suchasWHENBORN,wouldberankedhigherthanallthedocumentsthatdonotanswerthatquestion.Thatwouldnotbethecase,however,ifadocumentthatansweredtwootherquestions(suchasSPOUSEandPROFESSION)wasproperlyrankedhigherthanadocumentthatonlyansweredtheWHENBORNquestion.So,inthecasethatdocumentsarerankedbythetotalnumberofquestionstheyanswer,documentsarenotexpectedto 85

PAGE 86

berankedaccordingtoasinglequestion.Thiswillresultinscoresthatarehardtouseasabasisforevaluationofasinglesystem.Withthatconsideration,Lucenegenerallyperformsbetterforpresidents,buttheresultsaremixedfortheauthorquestion,withBingperformingbetteronaverage. 4.5.2BprefatDifferentLevelsWhiletheoverallresultsindicatethatthetwosystemsareverysimilarinresults,thisisnotthecompletepicture.WhiletheBprefscorespresentedintheprevioussectionsshowedsystemsthathadsimilarscores,thetwosystemsactuallydisplaydifferenceswhenwelookatthetoprankeddocuments.Toseethat,considerthechartsinFigure 4-2 .TheBprefscoresinthesechartsarecalculatedbyconsideringeverydocumentthathasatleastNnumberofquestionsansweredtoberelevant.InmostofthesechartsweseethatBingstartswithamuchhigherBprefthanLucene.Thatgapnarrows,andinsomecasesLucene'sBprefsurpassesBing,whenthenumberofquestionsrequiredforrelevancegetssmaller.WeinterpretthisdatatomeanthatBingtendstoranktherelevantdocumentsbetterthanLucene.Thiscanbeseenbythefactthatatahighernumberofquestions,BinghasahigherBprefthanLucene.Luceneseemstohavealltherelevantdocumentsmixedtogether.Thatis,thedocumentsthathaveatleastonequestionansweredarenotinorderofdescendinginformation.ThisisapparentwhenconsideringhowtheBprefincreasesasthenumberofquestionsrequiredforrelevancydecreases.Thisindicatesinbothsystemsthatthedocumentsareinterspersedateachlevel.Thepersonchartsgiveaviewofhowdifferingnumbersofquestionsaffecttheresults.TheresultsareshowninseveraldifferentgroupingstohighlightthatthedifferenttypesofpersonsresultinBingoutperformingLucenedespitetherebeingdifferinglevelsofperformanceinsubsetsofthedata.Thelargedropweseeintheoverallaverageresultsfromthenumberofquestionsvarying. 86

PAGE 87

Overall,thisviewoftheBprefshowsthatBingoutperformsLuceneatrankingdocumentsathigherlevels.Bothsystems,though,shownoconsistentrankingofdocuments,whichwouldbeshownbyaconsistentBprefacrossthenumberofquestions.This,combinedwiththefactthattheBprefnaturallyimproveswhenmoredocumentsinthesetareconsideredrelevant(i.e.therearenozeroesfactoredinastherewouldbeinasystemthatdidnotreturnanyofthedocuments),indicatesneithersystemperformsextremelywell,whichreinforcestheresultsoftheMAEandRMSE. 4.6ConclusionInthischapterwehavepresentedanovelwayofevaluatingtherankingofIRsystems.Ourmethodallowsustomakeanobjectivecomparisonbetweenacommercialsearchengine,Bing,andLucene.Thisisanimprovementonpastmethodsthatwouldoftenrelyonuserstudies,comparingn-grams,orothermethodsofcomparingresearchtothesesystems.Further,theqrelswecreatedareportableandcanbeusedbyothersystemstomeasureagainstthesamedatawithouttheneedofalargecorpustobepassedaswell.Thisrelianceonthecorpuswouldalsopreventacomparisonagainstacommercialsearchengine,astheactualcorpusisclosedandcannotbecompletelyreplicatedbetweentwosystems.WealsoshowedthatthismethodcompareswellwiththeqrelscreatedusingtheTRECcorpus.Weexaminedthemakeupofthedocumentsincludedintheqrelanddiscoveredthatveryfewcontainedmuchinformation.Theinsightthatdocumentsarerelevantregardlessofthecontextinwhichtheyarefoundledtotherealizationthatdocumentscontainingmoreinformationwillberankedhighernomatterinwhatcorpustheyarefound.Whencomparingthetwosystems,wefoundafairlysurprisingresult:Bothsystemsperformedpoorlywhenrankingthedocuments.Ofthetwo,Bingwasclearlybetter,anditisimportanttonotethatthedifferencebetweenthetwosystemsissmallinoursmallsampleset.WhenappliedtotheWWW,wewouldexpectthatthisdifferencewouldbe 87

PAGE 88

magnied.Thiswouldbeinlinewithourunderstandingofvector-basedIRsystemsversusgraph-basedrankingsystems. 88

PAGE 89

Number:842DavidMcCullough<desc>Description:GiveinformationaboutDavidMcCullough,author,hislife,works,and/orawards.</narr>Narrative:ThedocumentationshouldprovidesomeinformationaboutDavidMcCullough,theauthorof``HarryS.Truman".PagesonsomeotherDavidMcCullough(evenifanauthor)willnotbeconsideredrelevant.Aquotefromoneofhisbooks,oracompletecitationofoneofhisbooks,includingpublisherwillbeconsideredrelevant.However,theobservationthatsomeonetalkedwithhistorianandwriterDavidMcCulloughabouthistorydoesnotmakethepagerelevant.</top> Figure4-1. TRECtopicexample Table4-1. BreakdownofdocumentsforTRECDavidMcCulloughtopic Questions#Documents#Documents%DocumentsAnsweredAtLevelJudgedRelevantRelevant 0473132.751825364.632362980.553000.00411100.00511100.00 Table4-2. BreakdownofdocumentsforTRECbyquestiontype Question#Documents#Documents%DocumentsTypeAnsweringQuestionJudgedRelevantRelevant BOOKS886573.86PROFESSION695072.46SPOUSE3266.67WHENBORN11100.00WHEREBORN22100.00 89 <br /> <br /> PAGE 90<br /> <br /> Table4-3. Questiontypesforcountries QuestionTypeExampleQuestions ANTHEMWhatisthenationalanthemofthecountry?CAPITALWhatisthecapitalofthecountry?CONTINENTOnwhatcontenantisthecountry?CURRENCYWhatistheprimaryorofcialcurrencyofthecountry?DATEFOUNDEDWhenwasthecountryfounded?LANGUAGEWhatistheprimaryorofciallanguageofthecountry?LEADERWhoisthecurrentleaderofthecountry?WARSInwhatwarshasthecountryfought? Table4-4. Questiontypesformovies QuestionTypeExampleQuestions ACTORSWhatactorsstarredinthemovie?COMPOSERWhocomposedthemusicforthemovie?DIRECTORWhowasthedirectorofthemovie?GENREWhatisthegenreofthemovie?LENGTHHowlongdoesthemovierun?RATINGWhatistheMPAAratingofthemovie?STUDIOWhatstudioproducedthemovie?WRITERWhowasthescreenwriterforthemovie? Table4-5. Questiontypesforpersons QuestionTypeExampleQuestions ALIASBywhatothernameswasthepersonknown?BOOKSWhatbooksdidthepersonwrite?PROFESSIONWhatwastheprofessionoftheperson?SPOUSETowhomwasthepersonmarried?WHENBORNWhenwasthepersonborn?WHENDIEDWhendidthepersondie?WHEREBORNWherewasthepersonborn?WHEREDIEDWheredidthepersondie? Table4-6. Judgedobjectnames AuthorsPresidentsMoviesCountries AgathaChristieBarackObamaCitizenKaneEgyptDavidMcCulloughGeorgeWashingtonGonewiththeWindFranceJaneAustenHerbertHooverRaidersoftheLostArkMongoliaStevenKingJohnAdamsTerminator:SalvationUnitedStates 90 <br /> <br /> PAGE 91<br /> <br /> Table4-7. ComparisonofLuceneandBing ObjectTypeSystemBprefRMSEMAE countryBing0.54487.99275.5246Lucene0.62629.11086.7374moviesBing0.555612.17398.6458Lucene0.413613.575511.0108personsBing0.435112.15409.0355Lucene0.515111.46768.3820averageBing0.511810.54477.7353Lucene0.517211.38468.7101 Table4-8. ComparisonofLuceneandBingonauthors QuestionSystemBprefRMSEMAE ALIASBing0.29417.28643.9857Lucene0.32015.85493.5017BOOKSBing0.46857.29663.7725Lucene0.42907.91384.0763PROFESSIONBing0.45727.31893.8285Lucene0.46957.72313.6210SPOUSEBing0.43676.55752.8697Lucene0.18507.32953.9465WHENBORNBing0.47337.24544.2332Lucene0.50087.09633.9115WHENDIEDBing0.60956.44853.6599Lucene0.44258.41765.1650WHEREBORNBing0.34496.88043.4745Lucene0.28746.09603.2868WHEREDIEDBing0.30805.22642.1496Lucene0.06027.26943.6996averageBing0.42406.78253.4967Lucene0.33687.21263.9001 91 <br /> <br /> PAGE 92<br /> <br /> Table4-9. ComparisonofLuceneandBingonpresidents QuestionSystemBprefRMSEMAE ALIASBing0.22225.77622.1316Lucene0.32894.07541.5075BOOKSBing0.10719.66626.4103Lucene0.45413.31280.9744PROFESSIONBing0.45968.03454.3579Lucene0.42677.35904.1079SPOUSEBing0.45758.28075.1019Lucene0.50987.90074.4785WHENBORNBing0.41318.39865.1976Lucene0.57296.88753.6427WHENDIEDBing0.48188.27324.8577Lucene0.61257.10903.6491WHEREBORNBing0.44658.37985.0872Lucene0.52797.19024.0676WHEREDIEDBing0.46497.80974.3237Lucene0.45166.74854.1459averageBing0.38148.07744.6835Lucene0.48566.32293.3217 92 <br /> <br /> PAGE 93<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigure4-2. ComparisonofBingandLuceneatdifferentquestionlevels 93 <br /> <br /> PAGE 94<br /> <br /> CHAPTER5ENTITYSCORE-INFORMATIONQUANTIFICATION 5.1IntroductionAsdiscussedinpreviouschapters,thetypicalpurposeofinformationretrieval(IR)systemsistopresenttotheuserdocumentsthatmeetaninformationneed.ToaccomplishthistaskwehavediscussedseveraldifferentmodelsofIR.Theconceptofthemodelisimportant,nottoconstrainIRsystemstobeconstructedinacertainway,butrathertounderstandhoweachsteptheIRsystemtakesinretrievingdocumentsgoestowardmeetingtheinformationneed.TheIRmodelcontainsthreemajorparts:indexing,queryrelevance,andranking.Thethreeworktogetherasaprocessofretrievinginformationthatisrelevanttotheuserofthesystem.Indexingishowthesystemstoresthedata.AsdiscussedinChapter2,thisisusuallyabag-of-wordsapproach,witheachwordstoredalongwiththenumberoftimesitappearsineachdocument.Queryrelevanceistheprocessorcalculationofdeterminingwhetheradocumentisrelevanttoaqueryor,tobemoreprecise,whetheradocumentmeetstheinformationneedofauser.Thisrangesfromthebooleanapproach,inwhichthedocumentisconsideredrelevantifabooleanevaluationofthequeryissatised,toclusteringbasedonvectorspace,toquestionanswering,whichtriestodetermineifthedocumentanswersaspecicquestion.WhilethersttwopartsofanIRsystemwouldallowasystemtoretrievedocuments,thethirdpart,ranking,hasincreasedinimportanceasthesizeofthecorporawithwhichIRsystemsworkhasgrown.Rankingisthesystem'sattempttodeterminewhichdocumentsaremorerelevanttothequerywhenmorethanonedocumentisjudgedrelevant.TogethertheseaspectsgiveanunderstandingofhowanIRsystemmeetsaninformationneed.WhileIRsystemscanretrieverelevantdocumentstosomedegree,thereisadisconnectbetweenthedesiretomeetaninformationneedandwhetheranyparticulardocumentmeetsthatneedduetothemodelsbeingmostlyuninformed.Thatis,the 94 <br /> <br /> PAGE 95<br /> <br /> traditionalIRmodelsdonothaveanunderstandingofwhatthetermscontainedinthedocumentorquerymean1.Itwouldmakenodifferenceifthetermswerereplacedwithsymbolsthatwerenotunderstoodbyhumansthesystemswouldperformjustaswell.Thus,thesystemsbasedonthesemodelscannotmakeanobjectivejudgmentaboutwhetheradocumentcontainsinformationinthewaywewouldexpectahumantolookatadocumentandmakethatdecision.Primarily,techniquesthathaveastatisticalbasisareusedinsystemsderivedfromvector-basedorprobabilistic-basedmodels,althoughexceptionstothesemodels,suchasPageRank,doexist.APageRank-orgraph-basedIRsystemusessemanticinformationthatisderivedwhenapersoncreatesapageontheworldwideweb(WWW)linkingtoanotherpage.ThisuseofinformationhasprovedsuccessfulandisthebasisofmanylargecommercialsearchenginessuchasGoogleandBing.Toimproveonthesemethods,wetakeadifferentapproachtocreatinganIRmodel.Ratherthanfocusingonhowtoretrieveorrankdocuments,wetrytodeterminewhatinformationadocumentcontains.Thisinitselfisnotanewendeavorandisthefocusofresearchinnaturallanguageunderstanding,questionanswering,tagging,informationextraction,andthesemanticweb.WhatnoneofthesemethodshasbeenabletoachieveisawayofusingtheinformationtoenhanceIR.Ratherthantryingtoimproveonallqueries,wefocusonnamedentities.Thesearespecicobjectssuchaspersons,movies,books,oranythingelsethatcanbeidentiedbyname.Thisisdifferentfromageneralsearchengineduetothedesiretodeterminewhetheradocumentcontainsinformation.Therearedifferentwaystoidentifyquestionsthatwedeterminetobeimportantaboutthetypeofentity.Theproblemof 1QuestionAnsweringseemstobeanexceptiontotherule,buttypicallythequestionansweringsystemusesthemeaningoftheparticularquestiontodeterminewhattypeofanswerisdesiredratherthanthemeaningoftheterms. 95 <br /> <br /> PAGE 96<br /> <br /> identifyinghowmuchinformationadocumentcontainsaboutasingleentityisone.Thisisimportantduetothedifcultyofnaturallanguageunderstanding.Bylimitingourselvestonamedentities,wecancreateasystemthatcanidentifyinformationindocumentsduringindexing.Ifwetriedtodealwithalltypesofqueriesoralltypesofinformation,wewouldnotbecapableofunderstandingwhatwasimportantpriortoquerying.Whileinthepasttherehavebeensystemsthathavetriedtoexpandindexingtoincludeconceptsorothersemanticmeaning,theseindexingsystemsdidnotndsuccessinIR[ 64 112 ].Byonlyfocusingonnamedentities,webelievethatwecanfocusonspecictypesofinformationandshapetheIRtoimprovethespecicqueriesaboutthesenamedentities,thusimprovingonpreviousrankings.InthischapterwepresentanewmodelforIRbasedonthequantityofinformationaboutanentityinadocument.InSection2wepresentworkusingrelatedconcepts.InSection3,werestatetheproblemanddiscusstheIRmodelsthatwehavepresentedinthepreviouschapters.Section4concernsthedifferentpartsofthemodel.WeconcludeinSection5. 5.2RelatedWorkNamedentitiesarebeingexploredinseveralareasofIR.InChengetal.theEntityRanksystemisdescribed[ 16 ].Inthissystemthemetinformationneedisthatofnamedentitiesratherthandocuments.Examplesofthismightbeauserwhowantstondthepriceandcoverimageofaparticularbookorastudentwhowantstondalistofprofessorsinaparticularsubjectareaacrossmultipleprograms.Inbothofthesecasesatraditionalsearchenginewouldhelp,buttheuser'sneedwouldmostlikelyrequiremultipledocumentstoresolve.Thisresearchfocusesonhowtorankentitiesbymodelinghowanentitywouldstabilizeastheuserlookedacrossmultipledocuments.ThisworkisfurtherexpandedbyfocusingonimprovingtheindexingofsuchasystemtoimproveperformanceinChengandChang[ 15 ].Thissystemcontrastsfromourresearchbyfocusingontheentitiesthemselvesratherthantheresultingdocuments. 96 <br /> <br /> PAGE 97<br /> <br /> Luetal.[ 65 ]usednaturallanguageprocessing(NLP)onquerystringstoidentifysemanticentities,whichwerethensupplementedwithadditionalknowledgeaboutentitytypestodeterminerankingsfordocuments.Thisapproachimprovedoverawebsearchengine,butitwaslimitedtoNLPofthequery.Workhasalsobeenperformedwithannotateddata.InChakrabartietal.anewproximityqueryisintroduced,andscoringandindexingisstudiedtosupportthesetypesofqueries[ 14 ].Inthisworkthefocusisonallowingproximityqueriesbetweendocumentsannotatedwithentitynamesandonreducingthesizeoftheindicesneededtosupportsuchqueries.InGraupmannetal.,asystemisdescribedthatusesthetypeofentitiestoallowforbooleansearchesthatincludefeaturesofthedesiredresult,suchastitleandauthorofabook[ 35 ].Therehavebeeneffortstochangethewaydocumentsareindexedbeyondthebag-of-wordsapproach.Gonzaloetal.[ 33 ]usedWordnetsynsetstoimproveonbag-of-wordsindexinginthevectormodel.WoodsproposedusingaConceptualIndex[ 112 ].CroftandLewisattemptedtobuildasystemusingcaseframestounderstandtheunderlyingideasinadocument[ 22 ].Theseapproachesproposedindexingofdocumentsinvolvingnotonlylistsofwordsbutalsoanindexinaconceptspace.Litkowskiproposedindexingdocumentsusingsemanticrelationtriples[ 64 ].Inthisapproach,fordifferentquestiontypes,documentsareparsedforthesentencestheycontainaswellasforrelationshipsbetweenentitiesinthesentenceandtheobjectsofthesentence.Theserelationshipsarethenstoredintriplesandusedforquestionanswering.Linexpandedonthisworkandusedtherelationsforamoregeneralwebsearch[ 62 ].Anotherareaofrelatedresearchispersondisambiguation,whichdetermineswhatdocumentsarerelatedtowhichoneofseveralpersonswiththesamename.Niuetal.[ 75 ]usedmachinelearningtechniques,includingco-occurrenceofwordsandnaturallanguageinformationtodisambiguatebetweennames.Vuetal.[ 107 ]used 97 <br /> <br /> PAGE 98<br /> <br /> webdictionariestoclusterdocumentsrelatedtothesamepersonandfoundthattheyoutperformedvectorspacemodels.LiketraditionalIRmodels,theseapproachesdonotidentifysemanticinformationinthedocumentsbutratherfocusondeterminingtowhatparticularnamethedocumentisrelated.Whilewedonotaddressdisambiguationinthisstudy,ourmethodwould,nonetheless,solvetheproblembydeterminingwhichfactsarerelatedtoeachoftheentitiesthataretobedisambiguated.Eachdocumentwouldthenbejudgedtorelatetotheentityaboutwhichitcontainedinformation. 5.3ProblemStatementThespecicproblemthatweareattemptingtosolveis:Givenanamedentity,determinethequantityofinformationonthepagethatrelatestothatentity.AddressingthisproblemisnotinitselfanIRproblem,butourmethodofquantifyingthisproblemlendsitselftoanIRmodel.So,toexpandtheproblemstatementintoamodel,wecreatedasystemthatretrievesthedocumentsthatcontainthemostinformationaboutaparticularnamedentity.ThedifferencebetweenourproblemandthosethatotherIRmodelsaresolvingisnotapparentatrstglance.AllIRsystemsareessentiallyaboutmeetinganinformationneed.Theuserpresentsaninformationneedintheformofaquery,andthesystemreturnsdocumentsthatmeetthatinformationneed.Achievingthisgoalturnsouttobefairlydifcult.Toreallyunderstandtheinformationneedandwhetheritismetmostlikelyrequiresunderstandingofthelanguageinwhichthedocumentandpossiblythequeryareexpressed.Ingeneralsearchenginestherealsomightbeconsiderationoftheintentofthesearch.Thatis,whatdoestheuserintendbytheparticularquery?ButnaturallanguageunderstandinghasnotprogressedveryfarintheeldofIR.TheTRECtrackdealingwithnaturallanguagewasnotwidelysupportedandwaslastofferedaspartofTREC-6.[ 103 ]Questionansweringisanareawherenaturallanguagecanbeused,butcurrentlytherearenosuccessfulcommercialquestionansweringsystemsonthescaleofthecommercialsearchengines.Thereareseveralreasons 98 <br /> <br /> PAGE 99<br /> <br /> forthis,butthetwomainreasonsseemtobethatthesearchenginesavailablenowworkfairlywell,andnaturallanguagesystemstendtobeslowwhenusedinrealworldapplications.FirstisthePageRankmodel.Theproblemthatthismodeladdressesis,givenarandomwalkaroundtheWWW,uponwhichdocumentsisausermostlikelytostumble?ThusanIRsystemthatisbuiltuponthismodelndsdocumentsthatareconsideredrelevantandranksthemessentiallybypopularity.Adocumentthatismorepopularwillhavemorelinkstoit,andthusauserismorelikelytodiscoveritrandomly.Ifonedocumenthasmoreinwardlinksthananother,andbotharerelevant,theformerwillberankedhigherthanthelatter.Thus,thismodelmeetstheinformationneedofauserbyassumingthatifapageismorepopularandisjudgedrelevant,itismorelikelytomeettheuser'sinformationneed.While,strictlyspeaking,pagerankisarankingalgorithm,andtheindexingandqueryrelevancecanbeinterchanged,theunderlyingassumptionisthatalldocumentscanberankedifthereisawaytolteroutwhatdocumentsaredenitelynotrelatedtotheuser'squery.Thesecondmodelweconsideristheprobabilisticmodel.Theproblemaddressedbythismodelistheprobabilitythatthedocumentisrelevant.Inthismodeladocumentisconsideredrelevantwhenthetermsthedocumentcontainsaremoreprobablyinthesetoftermsfromalldocumentsthatarerelevantthanthesetoftermsfromthedocumentsthatarenotrelevant.Thusadocumentisconsideredmorelikelytomeettheuser'sinformationneedifitcontainstermsthataremoreofteninthesetofdocumentsthatarerelevanttothequery.Finallyweaddressthevector-spacemodelorclusteringmodel.Theproblemthismodeladdressesisthatofhowcloseonedocumentistoanother.Inthismodeladocumentisconsideredrelevantifitisclosertothequeryinthespaceofdocuments.Whilethisissimilartotheprobabilisticmodel,themaindifferenceisthatthequerytermsaretheonlytermsonwhichwecanjudgethedocuments.Ifthequeryisonlyasingle 99 <br /> <br /> PAGE 100<br /> <br /> term,wewouldimaginethatthedocumentsthatareclosesttothequeryarethosethatmostoftenusetheterm.Ifnormalizationisused,themostrelevantdocumentwouldbeoneinwhichthelargestpercentageofthedocument'stermsarequeryterms.Inallofthesemodels,theinformationneedisaddressedwithoutknowledgeorunderstandingofwhatthequeryordocumentsmean.Therearemethodsofinformationretrievalthatdotrytotakeintoaccountthemeaningofeitherthequeryorthedocument.Inparticular,questionansweringinvolvesunderstandingthemeaningofthequerytoimprovetheabilityofthesystemtoretrievetheneededparticularinformationfromthedocuments.Amongothertechniqueswehavediscussed,questionansweringhasbeenshowntoimprovewhenthefocusofthequestionandtheexpectedtypeofansweraredetermined.Thisprocessisgenerallyperformedatruntime,though,andisnotnecessarilyafastprocess.WhileallofthesemodelsareimportantandworkforIR,noneofthemdirectlyaddressesthequestionofwhetheradocumentcontainsinformationaboutthequery.ThepagerankmodelsubstitutesthesemanticinformationitderivesfromthelinksfoundinWWWdocuments.Theprobabilisticmodelassumesthatinformationwillbecontainedinthetermsthatoccurwithhigherprobabilityintherelevantdocumentsoverthenon-relevantdocuments.Thevectormodelusestheoccurrenceofthequerytermsinthedocumentsasindicationsofinformation,weightingthembytheabilityofeachtermtodistinguishbetweenclusters.Questionansweringdoesinfactmakeuseofthemeaningofthequery,butthefocusofthequeryistypicallyaparticularpieceofinformationandisnotajudgmentofanentiredocument.ThislackofunderstandingdoesnotmeanthattheIRsystemsarebrokenbutratherthatthereisadditionalinformationthatcanbeusedtoimproveIR. 5.4ANewModelTounderstandthemotivationofournewmodelitisimportanttorealizethatthemaingoalofourmethodisnotIRbutratherthequanticationofinformationonapage. 100 <br /> <br /> PAGE 101<br /> <br /> Thisgoalcouldbemetinseveraldifferentways,butwechoosetofocusonwhatisessentiallyataggingmethod.Thedifferencebetweenwhatwearetryingtoachieveandmanyothermethodsoftaggingisthat,inthetypeofinformationthatwearetaggingnotonlydoesadocumentrelatetoaparticularentitybutitalsocontainsaparticularpieceofinformationaboutthatentity.Toquantifyinformationinadocumentitisnecessaryrsttodeterminewhatwewillconsiderinformation.Forasystemtohavetheabilitytoquantify,thethingforwhichitislookingmustbeidentiable.Forexample,wemightconsideritinformativethataquerytermexistsinadocument.Inthatcase,asystemcouldjustcountthenumberoftimesthetermexistedinthedocument.Alternatively,wecouldhaveasystemthatcountsthenumberofsentencesinwhichthetermappears.Continuinginthisvein,wemightdecidetonormalizethecountssothatlongerdocumentsdidnothaveanadvantageovershorterdocuments.Likewise,whenhandlingaquerywithmultipletermswecouldweightthetermsbasedonthenumberofdocumentsthatcontainedeachterm,resultinginasystemsimilartothevectormodel.Toimproveonthis,weconsiderthecasethatthequeryisanamedentity.Namedentitiesarephrasesthatcontainthenameofspecicthingssuchaspersons,locations,companies,orbooks.Forexample: <ENAMEXTYPE="PERSON">DavidMcCullough</ENAMEX>wasbornin<TIMEXTYPE="DATE:DATE">1933</TIMEX>.Thissentencecontainstwonamedentities:apersonandadate.Namedentityextractionisanimportanttaskofinformationextraction.Itisalsousedinquestionansweringtoidentifytypesofinformation.TheMessageUnderstandingConferenceshaveproducedaschemaforentityannotation,whileforquestionanswering,BBNTechnologiesproducedaschema[ 10 17 ].Witheachnamedentitywehaveanameandatypeofentity.InthecaseofDavidMcCulloughwehavethenameofaperson.Thecategoryortypeofanentityallowsus 101 <br /> <br /> PAGE 102<br /> <br /> toapplyspecictypesofinformation.AswediscussedinChapter4,specicquestionscanbeaskedabouteachperson,withmoregeneralquestionsthatapplytothecategoryoftheentity.Forexample,apersonhasadateofbirth.Wecouldalsofurtherconsidertheyear,month,anddayofbirthasdifferentpiecesofinformationaboutaperson.Ifasystemexistedthatcouldaccuratelyextractormarkeachtypeofinformationforeachtypeofentity,wecouldthenquantifytheinformationaboutanamedentitybycountingthedifferentpiecesofinformationinadocument.ThisisthebasisofourmodelofIR.Givenadocumentandanentity,wecantellhowmuchinformationthedocumentcontainsbythenumberofquestionsansweredorthefeaturesoftheentitytypethedocumentcontains.Tocreateasystemthatimplementsthismodel,wemustaddressthethreepartsofanIRmodel:indexing,queryrelevanceandranking.InthenextsectionwedescribeseveraldifferentsystemsbasedondifferentIRtechniquesthattogetherimplementthismodel. 5.4.1IndexingInthetraditionalIRmodels,theindexofadocumentDwouldbethesumofitstermsDi=ft1,t2,...,tng.Inmostcases,eachvalueisanormalizedcountofeachterm.However,forournewmodelwewouldratherentitiesberstclassobjectsinourindex.ForanyparticulardocumentthereshouldbeanindexingofitsentitiesDi=fE1,E2,...,Eng.Thiswillallowoursystemtoeasilyidentifywhichdocumentsrelatetowhichentities.Thisinitselfisnotenoughtoidentifyinformationinthedocument,butaseachentityisstoredwiththenumberoftimesitappearsinthedocument,wehaveconstructedasmallhybridofabag-of-wordsapproachandasemanticindex.Whatwedesireinsteadistoidentifyhowmanyfeaturesoftheentitythedocumentcontains,sotheindexneedstobeexpandedtoincludethosefeatures.Whilethisisnotstrictlynecessarybecausethefeaturescouldbeextractedatruntime,itseemsessentialtotheperformanceofthesystemthatthesystemdoesnothavetoprocessdataatruntime. 102 <br /> <br /> PAGE 103<br /> <br /> Todenetheontologyofanentity,wereturntoQAandtheschemascreatedtoimprovequestionansweringsystems.Foreveryentitycategorythatwewishtoindex,weintendtocreateasetofquestionsthatdenethefeaturesofthatcategory.ThisprocessissimilartotheoneforevaluationdescribedinChapter4.Ideally,wecouldthenbuildaquestionansweringsystemthatwouldreturnallthedocumentsthatansweredanyparticularquestion.Practically,though,thistypeofprocessingisnotavailabletous.Questionansweringworksbyretrievingdocumentsandthentryingtodeterminecandidatepassagesfromthedocumentsthatcontaintheanswer.Oursystemneedsnotonlythebestcandidatepassagesthatcontainananswerbutalldocumentsthatcontaintheanswer.Thisdistinctionissubtlebutsignicant.Themoredocumentsthatcontaincorrectanswersaboutaparticularentitythataremissed,thelesssurewecanbesureaboutourranking.Thuswewouldneedthequestionansweringsystemnottoanswerquestionsbutrathertogiveajudgmentoneachdocumentastowhetheritansweredtheparticularquestionabouttheparticularentity.Whilethismaysomedaybepossible,currentlythisisbeyondthestateoftheart.Togetaroundthelimitationsofquestionanswering,wechoosetoapproachtheproblemasoneoftagging.Giventhesignicantamountofstructureddatanowavailable,insteadofstartingwithaquestionanddeterminingwhetheradocumentanswersthequestion,wecanretrievetheanswertothequestionfromastructureddatabase.Whilethislimitsustofact-basedquestions,italsogivesustwobenets.Firstandmostimportantly,whenafactcomesfromastructuredsourceweassumethatitisaccurate.Inquestionanswering,muchworkgoesintodeterminingthecorrectanswer,whichisnotnecessarywhenretrievingfromadatabaseofanswers.Secondly,whenusingquestionanswering,itisdifculttoknowwhethertheanswertoaparticularquestionisforthenamedentitythatisbeingexamined.Thisisadifcultproblembecausewemighthaveseveralentities,especiallypeople,withthesamename. 103 <br /> <br /> PAGE 104<br /> <br /> Givenasetoffacts,anindexiscreatedbydeterminingwhichdocumentscontainthosefacts.Thisistheoppositeofaquestionansweringorinformationextractionsystem,asratherthanextractingdataoranswersfromadocument,thesystemwillinsteadstartwiththenamedentity,question,andanswer,andcreatetheindexbasedonthis.ThustheresultingindexmightbeDi=fE1:#person;E2:#movie;g,whereeachentityisdenedbyanameandhasscoresforeachfeatureinthefeaturesetofthecategoryoftheentity.Intheexampleofpersonwemayusefeaturessuchas#person=fBIRTHDATE,BIRTHPLACE,...g.Foreachfeature,thesystemneedstomakeabinarydecisiononwhetherthedocumentcontainstheanswer.Thisgoalisadifcultproblem.Evenwiththeanswer,thelanguageofthedocumentcanbedifcultevenforanativespeakertocorrectlydetermineitspresence.AnexampleofthisisthesentenceDavidMcCulloughgrewupinPittsburgh.DoesthissentencetellusthebirthplaceofDavidMcCullough,whichisindeedPittsburgh?Notnecessarily.Althoughnoteveryquestionansweringdecisionwillbethisdifcult,thevarietyoflanguagemakesitaproblemworthsolving.However,wearenotgoingtoaddressitatthispoint,insteadofferingaheuristicfordeterminingifthedocumentcontainstheanswer.TheproblemofdeterminingifadocumentcontainsinformationisexactlytheoriginalproblemofIR.So,tocreatetheindexforournewmodelwearegoingtouseatraditionalIRsystembasedonLucenetodeterminewhetheradocumentcontainstheparticularanswerforwhichwearelooking.Inthiscase,adocumentisconsideredrelevantifitcontainsthenameoftheentityandthefactthatdenestheanswer.Insteadofplacinga1or0intheindex,thesystemwillusethescoreforeachparticularfeature.Soforeachentityname,ourindexiscreatedinthefollowingsteps: 1. Retrievethefeaturesrelatingtothenamefromtheknowledgebase. 2. Foreachfactconstructaquerytoretrievedocumentsthatcontainthatfact. 3. Recordascoreforeachpersonfactdocumentintheindex. 104 <br /> <br /> PAGE 105<br /> <br /> Usingthismethod,oursystemwillhaveanindexthatallowsquanticationoftheinformationindocuments. 5.4.1.1RetrievingfactsWhilethereareseveralwaysthatwecouldretrievefacts,includingsystemsthatdata-mineWikipediasuchasYago,whichisanontologyderivedfromWikipediaandWordnet[ 94 ],orusingaquestionansweringsystem,wechoseFreebase[ 5 ]duetotheopenapianditsstructureofinformationbeingsimilartoourneeds.Takingadvantageofthistypeofstructureddataisthenataskofdecidingwhichinformationisuseful.Freebaseisacollaborativelyeditedstructureddatabase,whichhumanuserscanupdateandedit,muchlikeWikipedia.ItoffersawebinterfaceaswellasastructuredquerylanguagecalledMQL,whichoursystemusestoobtainthefacts.Freebase'sdataisessentiallyatbutdoescontaindifferentgroupsoffeaturesforeachtypeofobject.ExamplesofthetypesofobjectsavailableinFreebaseareshowninTable 5-1 .Althoughthevarioustypesaregroupedintorelatedcategoriessuchasaward,book,lm,andpeopleintheexampletypes,thereisnotaxonomyforaperson.Foreachname,thetypesoffactsthatrelatetothepersonhavebeenapplied.Soaqueryonanauthorwillreturnfactsunderthebook/authornode,andanactorwouldhavealm/actornodeifitapplied.Inoursystemwedonotindexthetypesofnamedentities.Eachnameisevaluatedonthefactsrelatedtothetype,sotypecanbeinferredfromthesetoffeaturesthatareincludedintheindexforthenamedentityanddocument.Thissituationisreasonableforsmallsetsofnames,butitwillleadtoconfusionwhendealingwithmultipleentitiesofthesamename.Todealwiththatsituation,uniquenamescouldbeassignedforeachentitywiththesamename.Oursystemdoesnotdealwiththissituation,astherankingthatwewishtoachieveisdoneonanindividualentityandnotonthename.Sincewearenotconcernedwithmultipleentitiesofthesamename,thefactretrievalprocessallowsFreebasetosuggestthespecicinstanceoftheentitytowhichthenamerefers. 105 <br /> <br /> PAGE 106<br /> <br /> 5.4.1.2QueryconstructionOncethesystemhasasetoffacts,thosefactsareusedtoconstructqueriesthatretrievecandidatedocumentstoevaluate.ThisprocessissimilartoanIRsystemthatwouldbeusedinquestionanswering.Whereaquestionansweringsystemwouldusethequestiontocreateafocusandanswertypethatwouldsatisfytheinformationneed,oursystemalreadyhastheparticularanswer.Sinceouranswersaretermsthatweexpecttondindocumentscontainingtheanswer,oursystemcancreatequeriesusingtheanswerswithanexpectationofhighprecision.Thisrelianceontheanswersdoesnotguaranteehighrecall,asanswersmightbecontainedindifferentformsthansimplythetermsweretrieveduringourfactretrievalstep.Agoodexampleofthisisadocumentinalanguagedifferentfromtheoneinwhichthefacttermappears.Thistypeofrelevanceisoutsidethescopeofourcurrentresearch.Themainconcernwithcreatingqueriesistolimitthenumberoffalsepositives.Thesimplestquerythatcouldbeconstructedwouldbetojustusetheanswer.Forexample,DavidMcCulloughwasbornin1933.Whenconstructingaqueryfordocumentsthatanswerthequestion,wecouldsimplyusequery.Whilethisquerywouldretrieveallthedocumentsthatcontained1933,therewouldbemanydocumentsthatwouldnotcontaintheanswer.Tofurtherimproveontheperformancewealsoaddthenameoftheentitytothequery.InourexampleourquerywouldthenbeDavidMcCullough1933.Whilethisappearstobeareasonablesolutiontotheproblem,itfallsvictimtothelossofprecisionthatisfoundwithqueryexpansion.Eachtermthatisaddedtothequerygivesanotherdimensiononwhichadocumentmightbeconsideredrelevant.Soinourexample,documentsthatcontainDavid,McCullough,orwouldallbereturnedbyanIRsystem.Toavoidthissituationwehavefurtherrenedthequerytorequireboththenameoftheentityandallthetermsofthefactinthedocument.InLucenethisqueryiscreatedbyaddinga+toeachwordtoforcethesystemtoonlyconsiderdocumentsthatcontain 106 <br /> <br /> PAGE 107<br /> <br /> eachterm.Thusourcandidatedocumentsaregoingtocontainboththenameandthetermscontainedinthefact.Thenalmethodoursystemusestopreventdocumentsthatdonotcontaintherequiredinformationfrombeingretrievedisbyrequiringthequerytermstoappearwithinacertaindistanceofoneanother.Thisparticularrequirementisaresultoftheinsightthatsimplefactscanbestatedinthecourseofasentence.ContinuingourexampleofDavidMcCullough'sbirthdate,wewouldexpecttheanswertoappearinadocumentintheformDavidMcCulloughwasbornin1933.Aslanguageallowsformanyvariationsonthisstructureitwouldbeeasiertomerelyrequirethetermstoappearinasentence,butthatwouldnotbeenough,aswemightalsondtheinformationinformssuchasMcCullough,David(1933)oratableofdatagivingasmallbiographyofeachofseveralpersons.Thisisanotherareathatcouldbeaddressedbyanalyzingthedocumentspost-retrieval,buttokeepourindexingsystemasclosetoatraditionalIRsystemaspossiblewechosetosolvetheproblembycreatingproximityqueries.Proximityqueriesonlyrequirethetermstobewithinacertaindistanceofeachother.WhereaphrasequerysuchasDavidMcCulloughonlymatchestheexactphraseDavidMcCullough,theproximityquerywouldalsomatchthephraseMcCullough,David.Ifweexpandedthedistanceallowedbetweentermswecouldalsomatchdocumentsthatcontainthemiddlenameaswell.Bycombiningthesemethodsofqueryconstruction,documentsthatareretrievedwillberequiredbothtocontainthetermsinwhichweareinterestedandtocontaintheminaproximitythatwewouldexpectthetermstoappearinadocumentthatwasstatingthefactunderconsideration. 5.4.1.3DocumentscoringOncethesystemhasasetofcandidatedocumentsforanamedentityandfact,ascoreneedstobeestablishedforeachdocumentandenteredintotheindex.Documentscoringinthiscontextisthescoregiventoeachdocumentforeachfeatureorfact 107 <br /> <br /> PAGE 108<br /> <br /> relatedtotheentitytype.Ideally,thatscorewouldbe1ifthedocumentdidcontainthefactand0otherwise,butthisisdifcultaswedonotknowwhetherthetermscontainedinthedocumentimplythatthedocumentdoesindeedcontainthefact.AgoodexampleofthisistoconsiderdocumentsthatcontainthebirthplaceofJohnAdams,thesecondpresidentoftheUnitedStates.JohnAdamswasborninBraintree,Massachusetts,nowknownasQuincy.IfoursystemretrievedthebirthplaceofJohnAdamsasQuincy,itwouldattempttonddocumentsthatcontainedthenameJohnAdamsandthetermQuincy,whichwouldalsocommonlyoccurindocumentsabouthisson,JohnQuincyAdams,thesixthpresidentoftheUnitedStates.Itisalsoexpectedthatthetermsofthenameandfactaregoingtoshowupindocumentscoincidentally.ThisisanormalIRproblem.Withoutnaturallanguagetechniquesitisdifculttodeterminewhethertheappearanceofthecorrecttermsmeansthatthedocumentdoesindeedcontainthecorrectinformation.Eventheproximityquerythatweintendtousewillnotresolvethepreviousexample.Sowechoosetoacceptthiswhileindexing.OurscorewillbeanormalizedresultofthescoreappliedtothedocumentduringtheIRprocess.ToscorethedocumentsweuseavectorIRscore:tf idf.Thetf idfscoreforthedocumentsisatraditionalvector-basedscore,wheretheweightofeachtermisdependentonthefrequencyoftheterminthedocument.Typicallytf idfiscalculatedas tf idf(t,d)=tf(t,d)log(N dft)(5)wheretf(t,d)isthenumberoftimestermtappearsindocumentd.ThetermfrequencyismultipliedbytheidfcalculatedasthelogofN,thetotalnumberofdocumentsinthecorpus,dividedbydft,whichisthenumberofdocumentsinwhichthetermappears.Luceneaddsanormalizationfactorforthelengthofthedocument,whichismultipliedbytheentirescore. 108 <br /> <br /> PAGE 109<br /> <br /> Thedocumentscoresaregivenforaparticulartermtforeachdocument.Whenwescoreadocumentwearenotconcernedjustwithspecicterms,butratherwithfactsandnamesofentities.Foradocumentindexweconsiderourdesiredindexedform:Di=fE1:#person;E2:#movie;g.ForentityE1wewishtoscoreeachfeatureofthe#personentitytype.InourexampleofDavidMcCullough'sbirthdate,thedocumentscorewewishtoindexisfDavidMcCullough:BIRTHDATE:dsgwheredsisthedocumentscoreforthebirthdatefeatureforDavidMcCullough.Itisimportanttonotethat,unlikeinformationextractionandrelatedtasks,theinformationthatweretaindoesnotincludetheactualanswer.Thatinformationhasalreadybeenmadeavailabletothesystemthroughthefactretrievalmechanism.Theindexmerelyrecordswhetherornotthefactiscontainedinthedocument.Alackofjudgmentonanyparticulardocument,namedentity,featurecombinationcanbeviewedasajudgmentthattheinformationisnotcontainedinthedocument.Thedocumentscoreisthencalculatedas: ds(n,f,d)=(tf idf(n,d)+tf idf(f,d))=N(5)wherenisanamedentity,fisafactforalldocumentsthatareretrievedforthequery,andNisthenormalizationfactor.Thenormalizationusedforoursystemisthemaximumdocumentscoreforaparticularfeature.Thatis: N=max(ds(n,f,D))(5)wherenisaparticularnamedentity,fisaparticularfeatureandDisthesetofalldocumentsonwhichthisfactisevaluated.Thisisnotstrictlynecessary,butitpreventsasinglefeaturefromoutweighingtheotherfeatures.Forexample,ifatermhasaverylowidfscoreitwillresultinalowertf idfscorethanatermthathasahighidf.Itisduetothenatureofthesescoresthattermsthatappearinmoredocumentshavealoweridf.WhileindividuallythisworkswellforIR,wedesirethecombinationofthenameofthe 109 <br /> <br /> PAGE 110<br /> <br /> entityandthefactinadocumenttoholdthesameweightasanyotherfactappearinginthedocument.Thelastconsiderationthatthesystemgivesthedocumentscoreisinthecaseofalistoffacts.Listsoffactsareencounteredwhentheparticularfeaturehasmorethanonespecicfactthatrelates.Examplesofthisarepeoplewithmultiplealiases,authorsofmultiplebooks,orcountrieswhenconsideringleadersovertime.Eachofthesefactscanbestatedasasinglefact.Forexample,theUnitedStates'spresidentisBarackObama.Adocumentthatcontainedthatstatementwouldbescoredrelevantonthatparticularfeature.IfwehadanotherdocumentthatcontainedthefactthatGeorgeWashingtonwastherstpresidentoftheUnitedStates,wemightwanttomarkthatrelevantaswell.Withtimewecanlookatthedistinctionbetweenthesetwodocumentsasonehavingthecurrentleaderandtheotherhavingaformerleader.Whilethatallowsustodistinguishbetweenfeatures,therestillexistsapossibilitythatmultiplefactswillappearinthesamedocumentforthesamefeature.Inourexample,weconsideredGeorgeWashingtonasaformerleaderoftheUnitedStates.Ifthedocumentcontainedinformationonmultipleformerpresidents,itwouldseemthatthedocumentismoreinformativethanadocumentthatonlycontainedinformationaboutasingleformerpresident.Thisconceptcouldbeindicatedintheindexinoneoftwoways.Therstwouldbetotreateachpossiblefactforafeatureasaseparatefeature.ThustheindexforaformerleadermightbefUnitedStates:FORMERLEADER:GeorgeWashington:ds;JohnAdams:ds;...g.Themaindrawbacktothisparticularapproachisthatdividingafeatureintomultiplefeaturesgivesmoreweighttoadocumentthatcontainsmorethanone.Rather,wetreatthisaswewouldanyotherfeaturebycalculatingthedocumentscoreas: ds(n,F,d)=XFds(n,fi,d)(5)whereFisasetoffactsrelatedtoasinglefeature.Byaddingtherelatedscorestogether,theconceptofmoreinformativeinthedimensionofthatfeatureisrepresented 110 <br /> <br /> PAGE 111<br /> <br /> byahigherscore.Themorefactsinthefeaturethatthedocumentscoresthehigherthisfeature'sscorewillbe.Thisresultsinascorethatissimilartoafeaturewithasinglefact,withtheexceptionthatnoteveryfactinthefeatureisrequiredforthedocumenttohaveascore.Thatis,previouslyonlyadocumentthatcontainedboththetermsthatcomposedthefactandnameoftheentitywouldreceiveascore.Forfeatureswithmultiplefacts,thedocumentisscoredseparatelyoneachfact,andthenanaggregatescoreisnormalizedandapplied.Herethenormalizationfactorisagainthemaximumdocumentscoreforthefeaturespace.Whilethenumberofpossibleitemsinthelistwouldseemtobetheappropriatenormalization,thegoalisstilltohaveequivalentscoresineachofourfeaturespaces. 5.4.2QueryRelevanceOnceanindexhasbeenestablished,thesystemneedsamethodfordeterminingifdocumentsarerelevanttoanyparticularquery.Forourresearch,thishasbeenaccomplishedbylimitingqueriestothenamesoftheentitiesthathavebeenindexedbythesystem.Thismakesqueryconstructionandretrievalsimple,butitdoesnotcompletelyanswerthequestionofwhatmakesadocumentrelevant.Thesimplestwaytoachievequeryrelevanceistoconsideranydocumentthathasthenameindexedtoberelevant.Thiswouldresultineverydocumentthatwasoriginallyretrievedforanamedentitytobeconsideredrelevantifanyfeature'stermsappearedinthedocument.Inourmodel,thisisageneralsearchenginerelatedtoanamedentity.Asinageneralsearchengine,anydocumentthatcontainsthesearchtermswouldbeconsideredrelevant.Toimproveuponthisperformancewechoosethefeaturesofthenamedentitytodenethetypeofsearchwewishtomake.Thefollowingareacoupleofexamplesofhowthismightwork.Therstistohaveanentitytypesearch.IfwehadasearchforGonewiththeWind,itwouldbeusefultospecifythatthesearchwasfordocumentsthatcontainedinformationaboutthemovieorthebook,dependingforwhichtypeof 111 <br /> <br /> PAGE 112<br /> <br /> informationwewerelooking.Thishappensbyconsideringonlythefeaturesthatrelatetothetypeofentitytowhichthesearchwasrelated.Tospecifyaparticularquerytype,thequerycanbespeciedasmovie:GonewiththeWindorbook:GonewiththeWind.Thiscanbeextendedtoothertypesofqueriesbeyondthoserelatedtoentitytypes.Iftheuserdesireddocumentsthatcontainedparticularfeatures,itwouldbeawasteoftimetocreateothertypesofqueries.Anexampleofthisisaqueryofdocumentsthatcontaininformationaboutthedeathofaperson.Whileitmightbeusefultoreturndocumentsthatcontaininformationinallthefeaturesofaperson,merelydoingthiswouldreturndocumentsthatdidnotcontainanyinformationaboutaperson'sdeathinadditiontothosedesired.Adocumentthatcontainsaperson'sbirthdateorprofessionwouldbeconsideredrelevanttoapersonbutnottothisparticularinformationneed.Toextendthequeryrelevancemechanism,thesystemsupportscreatingnewquerytypes.Inoursystemaquerytypeisalistoffeaturesthatwillmakeadocumentrelevanttothenamedentitygivenasthequery.Thiscanbecreatedaheadoftimeaspresetqueries,oritcanbegivenbytheusertodenethetypeofqueryheorshewishestorun.Asfeaturesgetaddedtotheindex,theabilitytodescribewhattypeofinformationmakesadocumentrelevantwillincreasetheexpressivenessofdescribingtheinformationneed.Ifweconsiderasinglefeature,suchasbirthdate,thismethodisequivalenttoaquestionansweringsystemforthequestionWhenwasthepersonborn?Whenexpandingtomultiplefeaturesoursystemreturnsdocumentsthatpotentiallyanswermultiplequestions.Thisshouldimproveonquestionansweringsystemsintheareaofspeed.Questionansweringhasbeenfairlyslowduetotheamountofprocessingatruntime.Sinceoursystemisnotinvolvedinretrievinganexactanswer,thedocumentretrievaltimeisasfastasanyotherIRsystem.Thesacriceisthattheanswertothequestionmustalreadyhavebeenreturnedbythefactretrievalsystem.Anyquestionthathasnotalreadybeenansweredcannotbeusedorreturnedbythesystem.Since 112 <br /> <br /> PAGE 113<br /> <br /> oursystemisconcernedwithreturningdocumentsthatcontaininformationaboutaparticularnamedentity,thisisonlyadrawbackintheoverallretrievalsystem.OursystemcouldbeconsideredfortheIRforaQAsystemaswewouldexpectdocumentsthatcontaininformationaboutanamedentitytobelikelycandidatesforotheranswersrelatedtotheentity.Analimprovementoverasimplequery,whichisthenameoftheentity,istheabilitytorequireorexcludecertainfacts.Intheexampleofaquerythattargetsdocumentscontaininginformationaboutaperson'sdeath,wecouldcreateaquerytypethatincludesfactssuchasthedateofdeath,theplaceofdeath,andthecauseofdeath.Inthiscaseanydocumentthatcontainsanyofthesethreepiecesofinformationwouldbeconsideredrelevant.Tofurtherrenethisquery,wewanttheusertobeabletospecifywhichfeaturesmustbeincludedtobeconsideredrelevant.Forexample,iftheinformationneedrequiredthatthecauseofdeathbespecied,thequerycouldbeGeorgeWashington+CAUSEOFDEATH.Thereverseofthisistospecifyexclusions.Aquerythatspeciedtheinformationneedfordocumentsaboutanauthorthatdidnotcontaininformationaboutthebookstheauthorwrotemightbegivenasperson:DavidMcCullough-BOOKS.Whilewedonotndthistypeofqueryparticularlyuseful,theabilitytospecifywhatinformationisrequiredallowsqueryrenementnotpossibleinthesearchenginesavailabletoday.Theequivalentqueryexpansiontorequirethebirthdateofapersonwouldobligetheusertoknowthebirthdatepriortoqueryingtoadditintothequeryterms:DavidMcCullough+1933.Alternatively,theusercoulduseothertermsthathavemeaningrelativetotheinformationsheorheisseeking:DavidMcCullough+born+onorDavidMcCullough+birthday.Whilethesequerieswillindicateasimilarinformationneed,wefeelthattheabilitytousesemanticmodiersismuchmoreexpressiveoftheactualinformationneed.Whenrequiringaspecictermorseveralspecictermsonrelevantdocuments,theresultsarelimitedtodocumentsthatdoindeedcontainthoseterms. 113 <br /> <br /> PAGE 114<br /> <br /> Ifthatiswhattheuser'sinformationneedisthenthequerytermsdoindeedsufce.Alternatively,iftheinformationneedisactuallyfordocumentswiththebirthdate,ourmethodallowsthatinformationneedtobeexpressed.OurmethodwouldreturnallthedocumentsthatcontainedDavidMcCullough'sbirthdate,includingthosethatcontainedthetermsbornonandbirthday. 5.4.3RankingTheIRmodeloursystemisimplementingispredicatedontheideathatdocumentswithmoreinformationinthemarerankedrst.Thatis,thequantityofinformationthatiscontainedinthedocumentismeasured,anddocumentswithmoreinformationarerankedhigherthandocumentswithless.Aswehaveshownintheprecedingsections,asystemcanbecreatedthatattemptstomeasurethenumberoffactsthatadocumentcontainsaboutanamedentity.Thusifourqueryisanamedentity,thesystemwouldbecorrectinrankingdocumentsbythenumberoffactsthatthedocumentcontained.Thisisthebasisofourrankingsystem.Givenanentitytypeandaname,itranksthedocumentsfromthemostfactscontainedtotheleast.Inthemostgeneralquerythiscanbeaccomplishedinseveralways.Therstistoconsidereachdocumentandsumthescoresforeachfeature.WecallthisscoreEntityScore,andweuseittorankthedocumentsas: EntityScore(n,t,d)=Xf2tds(n,f,d)(5)wherenisthenameoftheentity,tisthetypeoftheentity,disthedocument,fisafeatureforthetypeofentityanddsisthedocumentscoremethodusedintheindexingstep.Thisresultsinascorethatranksdocumentsforeachnameandtypeaccordingtowhichhasthehighestsumofscoresforeachfeature.Inthecasethatds(n,f,d)isbinary,thissumisequaltothenumberoffactsthatthesystemhasjudgedthedocumentcontains.Inoursystemthebinaryassumptiondoesnothold,sothismethodofcalculatingEntityScorecanresultinsituationswhereashortdocumentjudgedtocontainasinglefeaturemayberankedhigherthanalongerdocumentcontaining 114 <br /> <br /> PAGE 115<br /> <br /> severalfacts,duetothenormalizationfactor.Thenormalizationcouldalsoresultintwodocumentsjudgedtocontainthesamefeatureshavingdifferingscores,withtheshorterdocumentbeingrankedhigher.WhilethisisconsistentwiththenormalIRheuristicthatashorterdocumentthatcontainsthetermsislesslikelytohavedonesobychance,whentryingtondthedocumentwithmoreinformation,wemightwantthelongerdocumenttoreceivepreference.Ineithercaseourmodelwouldwantadocumentthatcontainsmorefactstoberankedhigherthanonethatcontainsfewer.Whatthismeansisthatifadocumentcontainsafactaboutanentity,itislikelythatdocumentcontainsanotherfactaboutthesameentity.Toreectthiswewanttoweightthescoreofafeaturemoreifitisfoundinadocumentthatcontainsotherfeaturesofthesameentity.ThisinsightleadstothesecondwaytocalculatetheEntityScore.Ratherthansimplysummingthescoresforeachfeaturewewilladdaweighttothedocumentthatindicateshowmanyfeaturesthedocumentcontainsaboutanentity.WerefertothisscoreastheWeightedEntityScoreanddeneitas: WeightedEntityScore(n,t,d)=weight(n,t,d)+Xf2tds(n,f,d)(5)whereweightisgivenas: weight(n,t,d)=Xf2tceiling(ds(n,f,d))(5)wheretheceilingfunctionwillbe1forallnonzerodocumentscorevalues,sincedsisanumberbetween1and0inclusive.Thescoreforadocumentisthenumberoffeaturesjudgedtobeonthepage,plusaweightingfactor,whichisthesumoftheactualfeaturescores.AnexampleofhowthiswouldworkisgivenforthesampledocumentscoresinTable 5-2 .UnderEntityScore,wherethescoreisasumofdocumentscoresofallthefeatures,wewouldgetthefollowingrankingofdocuments:fd1:1.0000,d3:1.0000, 115 <br /> <br /> PAGE 116<br /> <br /> d4:1.0000,d2:0.4000g.Inthiscasethereisnodistinctionamongdocuments1,3,and4,asthedocumentswiththesamescorearereturnedinarandomorder.2Ifweassumethateachdocumentdoesindeedcontaininformationaboutthefeaturesthatarenonzero(i.e.thesystemhasnotmisjudgedanydocuments),thecorrectrankingwouldbefd1,d2,d3,d4gwithnodistinctionmadebetweendocuments1and2or3and4.Thatis,bothdocumentsd1andd2containtwofactsabouttheentitysotheyshouldberankedrst,anddocumentsd3andd4bothcontainonefactandshouldberankedafterdocumentsd1andd2,withnojudgmentmadeonwhichdocumentshouldberankedrstateachlevel.WithoutcalculationitisapparentthatthebestcasescenarioforEntityScoreinthiscaseistheonegivenabove,andintheworsecaseitwouldrankd1lowerthandocuments3and4duetotherandomnatureofsimilarlyscoreddocuments.Thatis,undertheEntityScorerankthesystemcouldalsoproducethefollowingranking:fd3:1.0000,d4:1.0000,d1:1.0000,d2:0.4000g.ThisisavalidrankingasEntityScoremakesnodistinctionbetweendocumentsofthesamescore.Furtherexploringtheexampledocumentsandscores,wendthatWeightedEntityScoreranksthedocumentsas:fd1:3.0000,d2:2.4000,d3:1.0000,d4:1.0000g.Theabilityofthismethodtorankdocumentswithmorefeatureshigherisexactlywhattheoriginalmodelprescribes.Additionally,theWeightedEntityScoreranksdocumentsateachlevelofinformation.Usingthismethod,wecanextendtherankingtoourotherqueryrelevanceexamples.Therstcasetoconsideriswhatwerefertoasthequerytype.Tofurtherourexampleswemightcreateabirthdatequery.Inthisquerywewanttoretrieveadocumentbasedonlyonwhetheritcontainsabirthdate.ReferringbacktoTable 2Inanactualsystemthiscaseisextremelyunusualduetothenormalizationofdocuments.InourexperiencewehavefoundthatdocumentsoverlapwhenusingEntityScoreonlywhenthedocumentshaveidenticalcontent. 116 <br /> <br /> PAGE 117<br /> <br /> 5-2 ,wecanseethattherankingofthedocumentsunderEntityScorewouldbefd4:1.0000,d1:0.5000,d2:0.1500,d3:0.0000g.Inthiscaseweincludedocument3eventhoughitisconsiderednotrelevantbecausetherankingmechanismdoesnotmakerelevancejudgments.UnderWeightedEntityScoretherankingwouldbefd4:2.0000,d1:1.5000,d2:1.1500,d3:0.0000g.WhilethisrankingisthesameastheEntityScore,thecomparisonstillbreaksdownasthequerytypeexpandstomultiplequestions.Ifweassumethatthetablecontainsasubqueryofalargerindex,thescoreswouldmatchthosegiveninthepreviousexample.Thismakesthisapproachsimilartodeninganewnamedentitytype.IfwehadanamedentitythatonlyfeaturedBOOKSandWHENBORN,therankingwouldreturnthedocumentthathadthemostinformationaboutthatqueryrst,whichisthegoalofourIRmodel.However,inthiscasetheassumptionthatthisdocumentisthemostinformativeabouttheentityisincorrect.Theknowledgeofadditionalfeaturesthatareknownaboutaparticularentitymeansthataquerytypelimitstheamountofinformationusedtorankthedocuments.Whilethisseemstobeundesiredbehavior,itisusefultoconsiderwhatwouldhappenifeithersystemconsideredallthefeatureswhenusingatypequery.Inourexampleofabirthdatequery,therankingofthedocumentsusingallthefeatureswouldagainbethesameastheoriginalexample,withEntityScorebeingrankedfd1:1.0000,d3:1.0000,d4:1.0000,d2:.4000gwhileWeightedEntityScorewasfd1:3.0000,d2:2.4000,d3:1.0000,d4:1.0000g.Inbothoftheserankingsthedocumentwithnoinformationaboutbirthdateisrankedhigherthanadocumentthathasinformationaboutbirthdate.InthecaseofEntityScoreitcouldberankedrstduetotherandomorderingoftherstthreedocuments.Thisparticularexampleisunlikelyaswewouldnotexpectthequeryrelevancemechanismtoreturnadocumentthathasnoinformationaboutthequery,butitiseasytoimagineacasewheretherankingwouldbeunexpectedforexample,aquerytypethathadfourfeaturesoutoftentotalfeaturesofanentitytype.Inthiscasewemightndadocumentthatonlyhadinformationaboutoneofthefeatures 117 <br /> <br /> PAGE 118<br /> <br /> rankedhigherthanadocumentthathadinformationaboutallfourofthefeatures.Thiscouldhappeninthecasewherethedocumentwithinformationaboutonlyonefeaturehadinformationaboutthesixfeaturesnotusedforthequery,andtheotherdocumenthadnoadditionalinformation.Itisimportanttonoticethateitherrankingcouldbedesiredforthedocuments.Ifthequerywasmeanttondthedocumentswiththemostinformationaboutthefeaturesusedinthequerywewouldprefertherstranking.Ontheotherhand,ifwedesiredarankingofdocumentssuchthatthehighestdocumenthadthemostinformationintotalaboutthenamedentity,thesecondrankingwouldbedesired.Asdescribed,therstrankingistheresultofaquerytype.Ratherthanchangingtherankingofasinglequerytypeduetoadditionalinformation,wewillconsidertheotherqueryenhancementdescribedintheprevioussection:theabilitytorequirecertaininformationtobeincludedinaparticularquery.WhendescribingaquerythatincludesallthefeaturesgiveninTable 5-2 ,+WHENBORNcanbeaddedtothequerytorequirethatthebirthdatefeaturebeincludedintheanswer.Thisquerythencanberankedusingallthefeatures,anddocument3willneverbeincludedintheranking.Thisgivestheuserawaytospecifyaninformationneedforadocumentthathasthemostinformationaboutanamethatcontainsabirthdate.Eveniftherequiredfeaturewasnotafeaturethatcomposesthequerytype,therankingwouldstillincludethedocumentsthatcontainthemostinformationacrossthefeaturesofthequerywhilecontainingtherequiredfeatures.Excludingfeaturesworksthesameway,expressingadesirefordocumentsthatcontainthemostinformationabouteitheranentitytypeoraquerytypewhilenotcontaininganyinformationaboutaparticularfeature.Together,thesequeryrelevancemethodsworkwiththerankingmethodtocreateasystemwhererelevanceandrankingofdocumentsfollowouroriginalmotivationofndingthedocumentswiththemostinformationaboutanentity.Usingquerytypes,the 118 <br /> <br /> PAGE 119<br /> <br /> systemcanspecifywhatinformationisimportantforthepurposeofrankingdocuments.Includingrequiredandexcludedfeaturesinthequeryexpressesaninformationneedforthedocumentacrossallthefeaturesofthequerytypewhileutilizingallthefeaturesofthetypeforranking. 5.5ConclusionInthischapterwehavepresentedanewmodelforIR.Startingwithadesiretoquantifyinformationinadocument,wehavedenedawaytoretrieveandrankdocumentsbasedonthenumberoffactsthattheycontainaboutanamedentity.Whilethisisnotageneralsearchengine,thissystemcouldbeblendedwithageneralsearchenginetoimprovequeriesrelatedtonamedentities.Inthenextchapterweevaluatehowwellthismodelranksdocumentsbasedonthequerieswewouldexpectsuchasystemtohandle. 119 <br /> <br /> PAGE 120<br /> <br /> Table5-1. ExampleofFreebase'stypesappliedtopeople CategoryFact awardawardwinnerbookauthorlmactorpeopledeceasedpersonpeoplepersonpeopleprofession Table5-2. Shortexampledocumentscoresforaperson DocumentDocumentfactDocumentscore 1BOOKS0.50001WHENBORN0.50002BOOKS0.25002WHENBORN0.15003BOOKS1.00003WHENBORN0.00004BOOKS0.00004WHENBORN1.0000 120 <br /> <br /> PAGE 121<br /> <br /> CHAPTER6ENTITYSCORE-EVALUATION 6.1IntroductionInthepreviouschapterweintroducedanewinformationretrievalmodelbasedonidentifyingfactsaboutnamedentities.InthischapterweevaluatetheperformanceofthismethodbothbycomparingittoatraditionalIRsystemandbycomparingittoothersystemsbasedonsimilarinformation.Thegoalofthisevaluationistodeterminewhetherthismethodworkswellnotonlyforcasesinwhichweexpectitto,butalsoinsituationsinwhichthefactsthatthesystemusestoquantifyinformationareeitherwrongorundetermined.Tounderstandthetypesofsituationsinwhichweexpectsystemsbasedonthismodeltoperformwell,wehavetorevisittheassumptionsofourIRmodel.Therstandmostimportantassumptionisthatthedocumentcontainingthemostinformationaboutanamedentityisthedocumentthatbestmeetstheinformationneedofauserqueryingfordocumentsrelatedtothatnamedentity.Thesecondassumptionisaslightmodicationoftherst:Thedocumentthatcontainsmorefactsaboutanamedentitycontainsmoreadditionalinformationrelatedtothenamedentity.Whilewehavenotdirectlyaddressedthis,itisthebasicassumptionthatdocumentsrelatedtothenamedentitywillcontainfactsaboutthenamedentity.Themoreextensivetheinformationthedocumentcontains,themorelikelythatitwillcontainvariousfactsthatwecanidentifyaboutthenamedentity.Underthesetwoassumptionsweexpectthatsystemsbasedonourmodelwillperformwellonqueriesinwhichthesystemcanaccuratelyidentifymanyfactsaboutanamedentity.Thedesiretohavemorefactsisbasedontherstassumption,andthemorefactswecorrectlyidentifyinadocument,thebettertherankingofthedocumentwillbe.Whencombinedwiththesecondassumption,though,wealsoexpectthatthesystemwillbemoreaccurateindeterminingwhetheradocumentcontainsafactif 121 <br /> <br /> PAGE 122<br /> <br /> itcontainsotherfactsaboutthesamenamedentity.Thismeansthatweexpectthatdocumentsthatcontainmoreinformationwillbemoreaccuratelyidentiedashavingadditionalinformation.Forexample,whenconsideringwhethertwodocumentscontainafactaboutanamedentity,itismoreprobablethatthedocumentthatcontainsadditionalinformationabouttheentityalsocontainsthefact,unlikethedocumentthatdoesnotcontainanyadditionalinformation.Weexpectthatthesystemwillbemoreaccurateinidentifyingwhetheradocumentcontainsafactifthedocumentcontainsotherfactsaboutthesamenamedentity.Theconverseofthesetwocasesisalsoeasilyseen.Whentherearefewerfactswithwhichtodistinguishbetweendocuments,weexpecttherankingswillnotbeasaccurate.Anextremeexampleofthisisthecasewherewehaveonlyonefactknownaboutanamedentity,wheretheonlydistinctionisbetweendocumentsthatcontainthefactandthosethatdonot.Sincethereisnodistinctionbetweenthedocumentsthatcontainthefact,wewouldnotexpectthesystemtoperformanybetterthanatraditionalIRsystemwithaqueryforthatparticularnamedentityandfact.Additionally,ifthesystemhasincorrectfactsforanamedentity,theprecisionofthesystemwillbedegraded.Documentswillbeincorrectlymarkedasinformative,andthereenforcementthatweexpectofdocumentscontainingmultiplefactswillnotoccur.ToevaluatehowwellourmodelworksinthepreviouslydescribedsituationsweusetherankedqrelsetsintroducedinChapter4.Theseqrelsareevaluatedonvariousquestionsusedinquestionanswering.WepresentseveraldifferentexperimentsthatgiveanobjectivemeasureofhowourmodelandourexamplesystemscomparetoatraditionalIRsystem.Therestofthischapterconsistsofseveralsectionsdescribingourtestsystemsandthevariousteststhatwehaveconducted.InSection2wepresentthesystemsevaluatedinthischapter.Section3coversthegeneralquerytest,inwhichthesystemsareexpectedtorankdocumentsaccordingtothetotalinformationtheycontain,as 122 <br /> <br /> PAGE 123<br /> <br /> denedinChapter4.InSection4wepresenttheresultsfromlimitingthetypesoffactsthatareusedtoevaluatedocuments.Section5evaluateshowaccuratethesystemsareatidentifyingwhichdocumentscontainspecicfacts.Finally,inSection6weconcludethechapter. 6.2EvaluatedSystemsThissectionoffersanoverviewofthesystemsthatwehavecreatedtoevaluateourmodel.AllthesystemsthatwecreatedarebasedonournewIRmodel,exceptfortheLuceneIRsystemweuseasabaselineforevaluatingoursystem.ThismeansthateachsystemwillhavethefactsretrievedfromtheFreebaseknowledgebaseavailableforquerying.Themaindistinctionthenisinhoweachsystemindexes,performsqueryevaluation,andnallyscoresandranksdocuments.Additionally,eachofthesystemsusestheTREClargewebquerycorpustocalculatetheidffordocuments. 6.2.1EntityScoreEntityScoreisourbaselinesystemasdescribedinChapter5.Inthefollowingexperimentsandgraphs,thissystemisreferredtoasES.InthissystemthecombinationofeachnamedentityandfactrelatedtotheentitytypeissentthroughanIRsystemtoretrievedocumentsthatpotentiallycontainthefact.Fromthesequeriesanindexiscreatedthatcontainstheentityname,facttype,andscore.Becauseeachfactisconsideredindependently,thequerythatisusedtoretrievedocumentsforindexingisconstructedinawaythatrequiresboththefactandtheentitynamebecontainedinthedocumenttoachieveanon-zeroscore.Forexample,whenqueryingforDavidMcCullough'splaceofbirth,aqueryisconstructedsuchthatboththenameandtheplaceofbirth(Pittsburgh)areinthedocument.Thereareseveralwaysofachievingthisresult,butforthepurposeofthissystemweconstructthequery:+David+McCullough+Pittsburgh.Inthisquery,the+symbolrequiresthesystemtotreatalltermsasrequiredforthedocumenttobeconsideredrelevant,butitdoesnotaffectthedocumentscore.ThusLucenescoresadocumentthatcontainsallthreetermsthe 123 <br /> <br /> PAGE 124<br /> <br /> sameasitwouldwiththeplainquery:DavidMcCulloughPittsburgh.Ifadocumentdidnotcontainallthreeterms,though,themorerestrictivequerywouldnotretrievethedocument.Thus,oursystemtriestoincreaseprecisionthatwouldnormallybelostduetoqueryexpansion.ThequerytotheESsystemissimplythenameoftheentity.ThedocumentscoreistheEntityScoredescribedinChapter5,wherethesystemaddsallthescoresforeachfeatureoftheentitytype.Theresultsarerankedfromhighesttolowestscore. 6.2.2WeightedEntityScoreandCountedSystemsTheWeightedEntityScoreisanextensionofEntityScoreasdescribedinChapter5.InthischapterwerefertothissystemasWES.Themaindifferencebetweenthetwosystemsisthat,whereastheESsystem'sscoreisthescoregiventoadocument,theWESsystemaddsaweightequaltothetotalnumberoffactsfoundinthedocument.ThepurposeofthisweightingistopreventtwoissuesthatweexpecttoaffecttheESsystem.Therstisthattwodifferentfactswillbeweighteddifferentlyduetothedifferentidfofthetermsrelatedtothefact.Thusiftwofactshaveonlyasingleterm,withtherstfact'stermhavinganidfof10andthesecondfact'stermhavinganidfof1,theESsystemwouldscoredocumentswiththersttermhigher.Thesecondcaseissimilarbutisrelatedtothelengthofthedocument(thenormalizationfactor)andthenumberoftimesthetermsappearinthedocument(termfrequency).InbothofthesecasestheWESsystemtriestoalleviatetheeffectofthesefactorsbyrstrankingalldocumentsbythenumberofdifferentfeatureseachcontainsaboutanamedentity,andthenrankingthemwithineachlevelbytheEntityScore.Wehavealsocreatedafurthersystem,whichwerefertoasthecountedsystem,basedonthismethod.Inthissystemwemerelycountthenumberoffeaturesthathaveanon-zeroscorewithoutapplyingtheadditionalsortingateachlevel.ThismethodshouldgiveanindicationofwhethertheEntityScorerankingimprovestherankingat 124 <br /> <br /> PAGE 125<br /> <br /> eachlevel.IfthecountedsystemperformsbetteritindicatesthatEntityScoreisnobetterthanrandomlyorderingdocumentsateachlevel. 6.2.3QueryExpansionandFocusedExpansionTheQueryExpansionsystemisanimplementationofaclusteringsysteminthefeaturespaceofthetermsrelatingtotheentitytypefacts.Werefertothissystemasqeforthepurposesofthischapter.Whenperformingtraditionalqueryexpansion,termsthathaveahighco-occurrencewiththequerytermsareappendedtothequerytoexpandthenumberofdocumentsthatwillberelatedtothequery.AsdiscussedinSection 2.5.1 ,thistendstoleadtolesspreciseresultsbecausemoredocumentswillbeconsideredrelevanttothequeryduetotheadditionoftermsthatarenotspecictotheinformationneed.Inthecaseofthissystem,though,whatwearereallycreatingisanidealdocument.Thisconceptisattheverycoreofthevectormodel.Inthefeaturespaceinwhichthedocumentsexist,thedocumentthatisclosesttothisquerywouldbethemostrelevanttothequery,sinceitcontainstheexpectedinformationaboutthenamedentity.InanidealsetoffactsthiswouldmeanthequeryexpansiontechniqueisequivalenttotheESsystems.Theproblemwiththisassumptionisthatthesetoftermsusedforthefactscanbequitelargeandpossiblyinaccurate.InFigure 6-1 wegiveanexampleofthequerytermsandtheirfrequencyinthequeryusedforJaneAustenforallfeaturesofthePERSONentitytype.ManytermsappearthatwewouldnotexpecttohaveanyrelevancetoJaneAusten,suchasdiscman,sonyandaudiobooks,aswellastermsthatareclearlywrongormisspelledsuchasaustenj,norhthanger,andmaneld.WhilethesetermsalsoappearintheESsystemswhileindexing,theactualeffectonthetwosystemsisquitedifferentduetothedifferenceinscoringdocuments. 125 <br /> <br /> PAGE 126<br /> <br /> WhenconsideringtheEntityScorebasedsystems,adocumentonlyreceivesascore: ds(n,f,d)=(tf idf(n,d)+tf idf(f,d))=N(6)whenboththenameandthefactappearinthedocument.AswedescribedinSections 5.4.1.3 and 6.2.1 ,thecompletenameoftheentityaswellasallthetermsofthefactmustappearinadocumentforthescoretobenon-zero.Thisispossiblebecauseeachfeatureoftheentitytypeisconsideredseparatelywhencreatingtheindex.Ontheotherhand,whendoingqueryexpansion,thesemanticmeaningsofthenameandfactarelostduetotreatingthetermsasabag-of-words.Thisalsoaffectsthesystem'sabilitytodistinguishbetweendocumentsthatcontainalotofinformationaboutlist-basedfactsandfactsthatonlyhaveasingleanswer.IntheexampletermsforJaneAustenwendthatmostofthemcomefromthelistofbooksthatarerelatedtotheauthor.Thesameissueisseeninotherlist-basedfeaturessuchasWARSforthecountryentitytypeandACTORSforthemovietype.Whilethiscanbealleviatedbynotusinglist-basedfacts,itlimitsthetypesofinformationthatthesystemcanusetorankdocuments.Despitethesedrawbacks,weexpectthatthequeryexpansionwillperformwellwhensettothetaskofrankingdocumentsrelatedtotheentityname.Inadditiontothestraightqesystemwealsohavecreatedasecondsystemthatcanrequireaspecicfact'stermstoappearinthedocument.ThisistheFocusedExpansionsystem(fe),andif,inthissystem,wecreateaquerysuchas:DavidMcCullough+YEARBORN,wecanrequirethat1933beincludedinanydocumentthatthesystemreturns.Thisallowsustocopewithqueriesinwhichspecicinformationisrequiredtomakeadocumentrelevant.Wethereforeexpectfetoperformbetterthanthestrictqeinthosecases.TwoimportantcaveatsmaketheEntityScoresystemspreferabletothefesystem.Therstisthatthelatterdoesnotimplementtheexclusionoffacts.Whilewecouldtechnicallyhandlesuchacase,inourexperiencedoingsodegradesperformance.An 126 <br /> <br /> PAGE 127<br /> <br /> exampleofthisoccurswhenconsideringthenamedentity:GeorgeWashington.Ifweexcludepageswithinformationabouthiswife,MarthaWashington,thesystemendsuprequiringdocumentstonotcontainanyofthetermsoftheSPOUSEfact,resultingintheexclusionofalldocumentswiththetermWashington.Forobviousreasons,thisresultsinnodocumentsbeingreturned.Whilewecouldtrytoalleviatetheproblem,itwasoutsidethescopeofourworktoattempttobuildasystemthatcouldperformthetask.Thesecondweaknessofthefesystemisnotrequiringtermsofalist-basedfact.Agoodexampleofwherethiscausesthesystemtoperformpoorlyistoconsiderthefollowingquery:JaneAusten+BOOKS.IfthesystemrequiredadocumenttocontainallthetermsforeverybookthatJaneAustenwrote,thenadocumentwithinformationaboutjustthebookPrideandPrejudicewouldnotbeconsideredrelevantifitdidnotalsoincludethetermsofalltheotherbooksshewrote,suchasPersuasion. 6.2.4Phrase-BasedEntityScoreSystemsThenalapproachthatweevaluateinthischapteristhatofaphrase-basedsystem.Whileourintentionforthecurrentresearchhasbeentoavoidnaturallanguagetechniques,itisnotextremelyrobusttorelysimplyonthetermstoappearinthedocument,indicatingthatafactexists.TheESandWESsystemstakeadvantageofotherfactsaboutanamedentityinthedocumenttoreinforcethelikelihoodofanyparticularfactalsoappearinginthedocument.Inthephrase-basedsystemwelookatlimitingthescoreofadocumenttoonlybenon-zeroifthetermsofthefactandthenameoftheentityappearwithinaphrase.Forcurrentpurposes,wetestattwophraselengths,10and50,withthenamesofthosesystemsbeingpb es 10andpb es 50respectively. 6.3FullQrelEvaluationIntheprevioussectionwedescribedthesystemsthatweareevaluating.Inthissectionwepresenttheresultsofevaluatingthesystemusingthefullevaluation 127 <br /> <br /> PAGE 128<br /> <br /> describedinChapter4.InthisexperimenteachsystemisevaluatedagainsttherankingscreatedtocomparetheBingandLucenesystems.AllfacttypesgiveninTable 4-3 ,Table 4-4 ,andTable 4-5 areusedtorankthedocuments.OurscoringmethodsarethosepresentedinSection 4.3.3 :RMSE,MAE,andBpref.Inthiscase,Bprefistheleastusefulscorebecauseitiscalculatedasthougheverydocumentthatcontainsafactisrelevant.Whilethatisausefulmeasureinlargersetswhencomparingdifferentsystems,hereitislesssosinceweareevaluatinghowwellthesystemranksdocumentsthatarerelevant.Whatitdoestellusiswhetherdocumentsthatarenotrelevantthatis,theydonotcontainanyinformationaboutthenamedentityarebeingrankedhigherthanthosethatdo.Table 6-1 presentstheresultsofrunningthesystemswherethedocumentsarerankedusingthefacttypesfromChapter4.Forourmodel-basedsystems,thoseexcludingtheLucenesystem,therearesomedifferencesfromourpreviousevaluation.Themainexceptionisincountry,forwhichFreebasedoesnotmaketheLEADERfacteasilyavailable,sooursystemdoesnotuseit.Inthepersoncategory,oursystemusestheadditionalfactsCAUSEOFDEATH,PARENTS,andSIBLINGS.Whatwendisthatthecountedsystemperformsconsistentlybetterthanallothersystems,withtheexceptionoftheBprefscoreforthepersonandmovieentitytypes.WhiletheWESsystemperformssecondbestinallcategoriesonaverage,itdoesnotperformaswellasthecountedsystemdespitethefactthatbothsystemsrankdocumentswiththesamenumberoffeaturesatthesamelevel.Thedifferencebetweenthetwosystemsisthat,underthecountedsystem,theorderingofdocumentsateachlevelisundened,whiletheWESsystemranksthedocumentsateachlevelaccordingtotheEntityScoreofthedocument.TotestwhethertheEntityScoreitselfwasthecauseofthescoringdifference,wesortedeachlevelusinganascendingEntityScore.Forexample,iftwodocumentsbothcontainedtwofacts,thedocumentthathadthelowestEntityScorewouldberankedhigher.Testingthismethodacrossallentitytypesresulted 128 <br /> <br /> PAGE 129<br /> <br /> inanaverageBprefscoreof0.8011,RMSEof6.1203,andaMAEof3.7155.WhilethisdoessuggestthattheEntityScoreworkstorankdocumentsateachlevel,itmeansthatthewaythatthecountedsystemsorteddocumentswasbetter.ThiscanbefurtherseenintheaverageBprefatdifferentfactlevelsinFigure 6-2 andalsoinfurtherdetailinAppendix A .Inthiscase,therankingforthedocumentsateachlevelisduetotheretrievalorderoftheBingsystem.Whilewedonotaddressthisinourcurrentresearch,thisseemstoimplythatthebettertheunderlyingretrievalsystemis,thebettertheESandWESsystemswillperform.Lookingfurtherattheresults,wendthatthephrase-basedsystemsperformedworse,withthepb es 50scoringbetter.Thismethodperformsbestforthepersonquery.Thereasonforthisisnotclear,butitseemsthattherearetwopossiblereasons.Oneisthatthedocumentsthatcontaininformationaboutpeopletendtorepeattheperson'snamemoreoften.Thesecondreasonissimilar:Documentscontaininginformationaboutmoviesorcountriescouldtendtoincludelistsoffactsaboutthecountry(ormovie)thatmaybeseparatefromthenameoftheentity.Theseseemtobethesamereason,butthesecondisreectiveofthefactthatsomedocumentscontaininformationintableform.Anotherpointwewishtoaddressfromtheseresultsistheperformanceoftheqesystem.InTable 6-1 wendthattheaverageqeperformanceisequivalenttothatoftheESsystem.Thisimpliesthatexpandingasinglequery,withallthetermsrelatedtotheentity,performsaswellasrequiringeachdocumenttocontainalltheterms.However,whenweexaminetheperformanceofeachentitytypewendthatforthepersontypetheESsystemperformsbetter,whiletheqeperformsbetterforthemoviecategory.ByexaminingtheBprefscoresatdifferentlevelsfortheindividualpersonsusedforthistestinFigure A-1 ,wecanseethatintheexampleofJaneAusten,thequeryexpansionperformsworsethanalltheothersystemsusingthefactterms.Someofthisdifferencecanbeexplainedbythelargenumberofunrelatedtermsthatarecontainedinthelistof 129 <br /> <br /> PAGE 130<br /> <br /> bookterms(refertoTable 6-1 ).TotesttheeffectofthebooktermsonthesystemsweremovedthosetermsfromthesystemandreranthespecicJaneAustenqueries.InTable 6-2 weseethat,asweexpect,Lucenedoesnotchangeasweremovefacttermsfromtheavailabletermsforevaluatingdocuments.Beyondthat,allothersystemsrankthedocumentsworsethantheydowhenusingthebookterms(withtheexceptionofthepb es 10forRMSE).IntheEntityScorebasedsystems,eachtermisscoredaspartofasinglefeature.Inthecaseofafeaturethatisalistoffacts,thetermsonlyaffectwhetherornotthesystemconsidersthatadocumentcontainsinformationaboutthatparticularfeature.Whenconsideringthis,itisinterestingthattheqesystemchangesinBprefmorethantheothersystemsbasedonourmodel.Weexpectthattheadditionaltermsthatareimprecisewoulddegradeperformance,andsoremovingthemwouldcausetheqesystemtoperformbetter.Instead,thetermsthatarerelatedseemtoinuencethesystemmorethanthenegativeterms.Wefurtherexploretheeffectofaddingandremovingfeaturesfromqueriesinthenextsection. 6.4EvaluationUsingPartialInformationIntheprevioussectionwedemonstratedthattheEntityScoresystemsperformedbetteratthetaskofrankingdocumentsthathavebeenevaluatedonthenumberoffactstheycontain.Whilethesystemperformstherankingtaskwell,thisdoesnotimplythatthedocumentswiththemostinformationarealwaysrankedhighest.Onewaytounderstandtheproblemistoconsideradocumentcontainingfeaturesthatthesystemisnotcapableofevaluating.Undertheassumptionthatadocumentwithmoreinformationwouldberankedhigherthanadocumentthatcontainsless,thisdocumentwouldneverberankedinthecorrectplace.ThisproblemisanalogoustothatofotherIRsystemswhereadocumentdoesnotcontainthequerytermsbutisindeedrelevanttotheinformationneed.Forourmodelwefurtherassumethatifadocumentcontainslargeamountsofinformationaboutanentityitwillalsoincludefactsthatwecanquantify.Thisassumption 130 <br /> <br /> PAGE 131<br /> <br /> isusefulwhenweconsidertheinformationneedexpressedbythenameofanentity.Ifwehadatheoreticaldocumentthatcontainedallpossibleinformationabouttheentity,itwouldcontainallthefeaturesthatthesystemcouldpossiblyevaluate.Likewise,weexpectthatadocumentcontainingmoreinformationaboutanentitywillcontainmoreofthefeaturesthatweusetorankthedocuments.Thusweexpectthatadocumentrankedhighernotonlycontainsmorefeaturesthatthesystemcanevaluate,butalsomoreadditionalinformation.Thisassumptioncannotbedirectlytested.Withoutanobjectivewaytoevaluateadditionalinformationindocuments,wemustinsteadusetheinformationthathasbeenevaluated,theoriginalfeatures.Inthissectionweevaluatehowwellthesystemsperformwhentheinformationthatisavailabletoevaluatethedocumentsislessthanordifferentfromwhatisusedtojudgethedocuments.ThisissimilartowhatwasshowninTable 6-2 .First,weexaminehowwellthesystemsperformwithoutthelistquestions.Second,thesystemsarefurtherevaluatedbyremovingadditionalfeatureswithwhichtorankdocuments. 6.4.1EvaluationWithoutList-basedFactsTotesttheeffectofthelist-basedfactsontheresults,weevaluatehowwellthesystemsrankthedocumentswithoutanyofthelistfacts.ForthecountryentitytypethiswastheWARfact.FormoviesthisistheACTOR,PRODUCERS,STUDIO,andWRITERfacts,ofwhichPRODUCERSisnotajudgedfeature.ThepersonentitytypeexcludesBOOKS,PARENTS,PROFESSION,andSIBLINGSfeatures.NeitherthePARENTSnorSIBLINGSfeaturesarejudged.Whatwewouldexpecttoseeisthatthefewerfeaturesthesystemcanusetorankthedocuments,theworsethesystem'srankingswillperform.Forexample,ifwehadtwodocuments,d1andd2,whered1hadtwofactsincludingonelist-basedfact,andd2hadonefact,thenthecorrectrankingwouldbefd1,d2g.Whenthelist-basedquestionisremovedfromtheinformationthesystemcanusetorankthedocuments,thesystem 131 <br /> <br /> PAGE 132<br /> <br /> couldrankthedocumentseitherfd1,d2gorfd2,d1gsincebothdocumentswouldhaveinformationaboutonefeature.Inthiscase,thesystemcouldcontinuetorankd1befored2,whichdoesnotincreasetheerrorofitsranking,butitwouldincreaseifthedocumentsarerankedfd2,d1g.Ifbothfeaturesind1areremovedfromconsiderationforrankingpurposes,thesystemwillalwaysreturnarankingwithd2rankedhigher.Inthesecases,weexpecttheerrortoincreaseinrelationtothenumberofdocumentsthatdidnotincludetheremovedfeatures.Theotherpossibilityisthatthesystemwilleitherstayrankedthesameorevenimprove.Stayingthesamewouldimplyeitherthatthefeaturesdidnotcontributetotherankingsthesystemproducesorthateverydocumentcontainsthosefeatures.Improving,ontheotherhand,wouldimplythatthefeaturewasbeingincorrectlyappliedtodocuments.ThiswouldhappenincaseswheretheeldcontainedincorrectfactssuchaswehaveshownoccurredintheJaneAustenbookfeature.WhenconsideringtheresultsinTable 6-3 ,Table 6-4 ,andTable 6-5 ,wendthat,asexpected,thecountryentityndsasmallincreaseinerrorforthesystems,andthemoviesandpersonentityquerieshaveamuchlargerincreaseinerrorforboththecountedandWESsystems.Nevertheless,bothofthesesystemscontinuetoperformbetterthantheothersystems.Theqealsofollowsthispatternforcountryandmovie,butforthepersonentitytypequeryexpansionactuallyimproveswhenthelist-basedfactsareremoved.Toexaminethismorecloselywerantheexperimenttwice.FirstwithouttheBOOKfeature,whichresultedinaBprefof0.8543,aRMSEof6.2987,andaMAEof4.1170.WhilenotaslargeachangeasthecountedandWESsystems,thisisindeedanincreaseinerrorofabout4%.WhenrunwithonlytheBOOKfeatureremoved,theresultswere:Bprefof0.8788,RMSEof5.1437,andaMAEof2.8397,foranimprovementof17%onRMSEand39%inMAE.Thisresultisinlinewithourexpectationsofdecreasedperformanceasfeaturesareremoved.Theimprovementforthebookqueryshowsthedownsidetothequeryexpansiontechnique,butitisto 132 <br /> <br /> PAGE 133<br /> <br /> beexpectedasadditionaltermsareaddedtoaquery.TheexpectationisthatasthecorpusexpandsfromdocumentsthatarelikelyrelatedtotheentitytotheWWW,thequeryexpansiontechniquewillfurtherdegradewhenusingtermsthatarenotincreasingprecisionasshowninShahandCroft[ 91 ].Weleavethisproblemforfutureresearch. 6.4.2EvaluationwhenRemovingFeaturesIntheprevioussectionweexperimentedwithremovingthelist-basedfeaturesfromtherangeoffeaturesthatthesystemswereabletouseforranking.Inthissectionwecontinuetoremovefeatures,butthistimethefeaturesarenotlist-based.Inthistestweremovehalftheofthefeaturesforeachentitytype:theANTHEM,CAPITAL,andCONTINENTfeaturesforthecountrytype;theLENGTH,RATING,andYEARfeaturesforthemovietype;andSPOUSE,WHENBORN,WHENDIED,WHEREBORN,andWHEREDIEDfeaturesforthepersontype.Thesefeatureswerepickedarbitrarilyandcouldhavebeenanycombinationoffacts.Herewendsomewhatmixedresults.Inallcasesthecountedsystemcontinuestoworkbest.Forallthreeentitytypes,boththecountedandtheWEShavesimilarincreasesinerror.Forthepersonentitytype,though,theperformancedecreasesbyapproximately50%forbothsystems.Alternatively,theESsystemperformsslightlyworseforthecountryquery,slightlybetterforthemoviequery,andmuchworseforthepersonquery.Thisappearstobeaffectedbytheadditionalweightofthenameoftheentity.IntheTRECcorpusthatthesystemsuseforidffortheterms,themovietermshaveamuchhigheridfthanmostotherterms.TheqesystemdecreasesinperformanceacrossalltypesbutseemsmuchlessaffectedbythelossoftermsthantheEntityScoresystems.Whilethereasonsforthedifferencesinchangesintheerrorarenotclear,whatisimportantisthatoursystemcontinuestoworkwellintheabsenceofallfeatures.ThiscanbeseenevenmoreclearlyinthegraphsoftheBprefatdifferentfeaturelevelsinFigure 6-2 ,Figure 6-3 ,andFigure 6-4 .Whenexaminingthegraphsitisclearthatwhile 133 <br /> <br /> PAGE 134<br /> <br /> theBprefhasloweredsomewhatateverylevel,thepatternremainsmuchthesameasitdidbeforethefeatureswereremoved.Documentswithmoreinformationcontinuetoberankedhigherthandocumentswithlessinformation. 6.5EvaluatingHowSystemsIdentifySpecicInformationWehaveshownthattheEntityScorebasedsystemsperformbetterthanatraditionalIRsystem.Further,wehaveshownthatasinformationisaddedtoorremovedfromthefactsthatthesystemscanusetoevaluateandrankdocuments,oursystemstillranksdocumentswell.Inthissectionweexaminewhethertheinformationinadocumentcanbeusedtobetterdeterminewhetheradocumentcontainsaparticularfact.Forthisevaluationwecreatequeriesforeachentitytypethatmatcheachindividualfact.ThusthecountryentitytypehasqueriesforANTHEM,CAPITAL,CONTINENT,etc.whereadocumentisrelevanttothequeryifitcontainsthatparticularfact.ForeachquerythattheEntityScoresystemsareevaluatedon,thequeryisconstructedsothattheparticularfactthatthequeryevaluatesisincludedasrequired.Thusaqueryfordocumentsthatcontainthefact,France'santhem,isconstructedas:France+ANTHEM.Inthisway,theEntityScorebasedsystemsrankallthedocumentsthathavebeenevaluatedtocontainthefactrst.Thisisnottrueoftheqesystem.Instead,wealsoincludethefesystem.Withthefesystem,thequeryisexpandedinsuchawaythatthetermsthatarerelatedtotherequiredfactareinturnrequiredinthequeryexpansion.Giventhesequerytypesandquerieswecomparehowwellthesystemsrankthedocumentswhenusingallavailableinformationabouttheentitytypeversusrankingthedocumentsjustusingthesinglefeature.Forexample,whenconsideringthequery,France+ANTHEM,theWESsystemreturnsalldocumentsthathavetheANTHEMfactwiththedocumentsorderedbythenumberoffactsthedocumentcontainedabout 134 <br /> <br /> PAGE 135<br /> <br /> France.WecomparethattohowwellthesystemranksthedocumentswhenonlyhavingtheANTHEMfactavailabletorankthedocuments.InTable 6-9 wendthatinallcasesexceptforthepb es 10andpb es 50systems,therankingoffactsimproveswhenusingadditionalfeaturestorankthedocument.Overall,thecountedsystemranksthedocumentsbetterthanallothersystemsbothwhenrankingwiththesinglefactandwithallfacts.Thedifferenceinperformancebetweentheqeandthefeisalsointeresting.Whenconsideringthesinglefactperformance,thesesystemsindicatehowwellweorderdocumentsusingjusttheentitynameandfactterms.Inthecaseofthefe,allofthefactterms(inthecaseofnon-list-basedfeatures)arerequiredtoappearinthedocumenttobenon-zero.Asexpected,thefeperformsbetterinallcasesbyapproximately10%.Whenorderingusingallfacts,thefesystemstillhasahigherBprefscore,butthedifferenceisless.Thisseemstoindicatethatthenecessityofrequiringtermslessensinthepresenceofotherinformation. 6.6ConclusionInthischapterwehaveexploredtheperformanceofvarioussystemsbasedontheEntityScoremodel.Wehavefoundthatallofthesystemsperformmuchbetteratrankingdocumentsaccordingtothenumberoffactscontainedinthedocument.Thetwosystemsthatarebasedonthecountofthenumberoffacts,countedandWES,performedbest.Followingthat,theqeandESperformedwellinthegeneralrankingtask.Despitethat,theqesystemseemsmoresusceptibletofactsthatincludealargenumbertermsthatdonotspecicallyrelatetotheentityname.Inparticular,list-basedfactsaffectedtherankinginawaythatweexpectfromqueryexpansion.Whiletheqesystemperformedwellinthevarioustests,additionalresearchisneededtodeterminehowextendedqrelswithmoredocumentsthatarenotrelatedtotheparticularentityperform.Thiscouldbeaccomplishedbyallowingthevarioussystemstoreturndocumentsfromalargercorpusandjudgingthetop-ndocumentsfromallsystems 135 <br /> <br /> PAGE 136<br /> <br /> tocreateaqrelafterthemanneroftheTRECqrels[ 100 ].Inadditiontothegeneralrankingtask,wehavealsoevaluatedhowwellthesystemsrankdocumentswhenthefactsavailabledonotmatchthefactsonwhichthedocumentshavebeenjudged.Theimprovementinrankingofsinglefactswhenthedocumentsarerankedbyallfactsshowsthatdocumentsthatcontainmorefactsaboutanentityalsocontainadditionalinformationabouttheentity. 136 <br /> <br /> PAGE 137<br /> <br /> Table6-1. Comparisonofsystemsusingalljudgedfacts SystemObjectTypeBprefRMSEMAE countedcountry0.89483.64242.1668EScountry0.86694.48432.8649lucenecountry0.62629.11086.7374pb es 10country0.78736.11064.3127pb es 50country0.83015.67743.8183qecountry0.85454.73593.0992WEScountry0.87224.22192.7151countedmovie0.95373.83422.1079ESmovie0.88735.73913.8323lucenemovie0.413613.575511.0108pb es 10movie0.83868.05005.3338pb es 50movie0.87756.75724.3597qemovie0.95584.22042.4554WESmovie0.93784.59152.6821countedperson0.96553.25021.5499ESperson0.96784.46992.2881luceneperson0.515111.48948.3931pb es 10person0.89346.73214.1712pb es 50person0.94745.40793.1499qeperson0.91035.47053.2693WESperson0.96673.73211.7185countedaverage0.93803.57561.9415ESaverage0.90794.89782.9951luceneaverage0.518311.39198.7138pb es 10average0.83986.96424.6059pb es 50average0.88505.94753.7760qeaverage0.90694.80892.9413WESaverage0.92564.18152.3719 137 <br /> <br /> PAGE 138<br /> <br /> austen26set2ve1predubezhdenie1jane20susan2form1prejudiced1and17und2fragment1quarters1of1514001francis1readers1the1017751frederic1romantyczna1a71796-18171freundschaft1rozwazna1emma618171from1sandition1letters641george1sayings1novels66-book1geraldine1schwestern1pride6a&e1gordost'1sechs1cassandra5abadia1gramercy1sensatez1classics5according1grandison1sensibilidad1to5adult's1henry1sensibilidade1austen's4advance1hints1senso1die4ago1historian1sentido1northanger4amistad1histories1sentimientos1persuasion4amor1history1shorter1prejudice4analysis1hour1signature1sense4anonym1hundred1signet1sensibility4austenj1ignorant1simply1abbey3based1indecent1sir1classic3beautifull1james1sister1i3bedside1kloster1sony1notes3bite-size1knight1spark1watsons3bom1la1steventon1works3by1last1stories1y3castle1later1study1anne2cc51lesley1teacher's1audiobooks2cc91level1three1boxed2chapters1maneld1tie-in1catharine2charades1manseld1transcription1charles2collected1manuscript1tuberculosis1complete2comp1manuscripts1two1edition2data1mcewan's1uprzedzenie1elliot2de1minor1v8911england2dear1miscellanea1various1freindship2der1moby1verse1great2dhiyan1my1version1guide2dick1niece1verstand1headline2discman1norhthanger1vol11her2display1northhanger1vollendeter1hodder2double1novelist1volume1illustrated2drei1oder1web1lady2duma1opactwo1winchester1library2e1orgueil1wisdom1liebe2edgar1others1wit1love2ein1oxford1woman's1novel2elfrida1park1work1on2essential1partial1writer1other2et1pauch1writings1penguin2evening1plan1written1roman2family1poems1years1romane2fanny1prayers1york1sanditon2favorite1predjudice1zombies1 Figure6-1. FrequencyoffacttermsforJaneAusten 138 <br /> <br /> PAGE 139<br /> <br /> Table6-2. EffectofthebookfeatureonJaneAustenquery SystemTestBprefRMSEMAE countedwithbooks0.96772.97211.3889withoutbooks0.88393.43591.7500averagechange-8.66%-13.50%-20.63%ESwithbooks0.96773.05961.4167withoutbooks0.87743.53951.7500averagechange-9.33%-13.56%-19.05%lucenewithbooks0.238713.23409.2500withoutbooks0.238713.23409.2500averagechange0.00%0.00%0.00%pb es 10withbooks0.95485.33333.2222withoutbooks0.86455.29153.2778averagechange-9.46%0.79%-1.70%pb es 50withbooks0.96773.62092.0000withoutbooks0.85814.52772.7222averagechange-11.33%-20.03%-26.53%qewithbooks0.83876.56803.9722withoutbooks0.58067.74964.4444averagechange-30.77%-17.99%-11.89%WESwithbooks0.96772.73861.1667withoutbooks0.88393.25751.5556averagechange-8.66%-15.93%-25.00% 139 <br /> <br /> PAGE 140<br /> <br /> Table6-3. PerformanceofsystemsontheCountryquerywithoutlist-basedfacts SystemTestBprefRMSEMAE countedwithlist0.89483.64242.1668withoutlist0.87583.87572.2644averagechange-2.12%-6.02%-4.31%ESwithlist0.86694.48432.8649withoutlist0.85344.89793.2157averagechange-1.56%-8.44%-10.91%lucenewithlist0.62629.11086.7374withoutlist0.62629.11086.7374averagechange0.00%0.00%0.00%pb es 10withlist0.78736.11064.3127withoutlist0.77626.43814.6503averagechange-1.41%-5.09%-7.26%pb es 50withlist0.83015.67743.8183withoutlist0.80586.23404.3269averagechange-2.93%-8.93%-11.75%qewithlist0.85454.73593.0992withoutlist0.83824.90123.3467averagechange-1.91%-3.37%-7.40%WESwithlist0.87224.22192.7151withoutlist0.86454.38582.8282averagechange-0.88%-3.74%-4.00% 140 <br /> <br /> PAGE 141<br /> <br /> Table6-4. PerformanceofsystemsontheMoviequerywithoutlist-basedfacts SystemTestBprefRMSEMAE countedwithlist0.95373.83422.1079withoutlist0.93384.70522.7527averagechange-2.09%-18.51%-23.42%ESwithlist0.88915.73913.8323withoutlist0.85956.33464.0879averagechange-3.33%-9.40%-6.25%lucenewithlist0.413613.575511.0108withoutlist0.413613.575511.0108averagechange0.00%0.00%0.00%pb es 10withlist0.83868.05005.3338withoutlist0.82838.00045.1567averagechange-1.23%0.62%3.43%pb es 50withlist0.87756.75724.3597withoutlist0.86436.63134.0685averagechange-1.50%1.90%7.16%qewithlist0.95584.22042.4554withoutlist0.89495.71913.7282averagechange-6.37%-26.21%-34.14%WESwithlist0.93784.59152.6821withoutlist0.90925.12222.9951averagechange-3.05%-10.36%-10.45% 141 <br /> <br /> PAGE 142<br /> <br /> Table6-5. PerformanceofsystemsonthePersonquerywithoutlist-basedfacts SystemTestBprefRMSEMAE countedwithlist0.96553.25021.5499withoutlist0.88244.36882.5123averagechange-8.61%-25.60%-38.31%ESwithlist0.96784.46992.2881withoutlist0.87665.24153.0694averagechange-9.42%-14.72%-25.45%lucenewithlist0.515111.48948.3931withoutlist0.515111.48948.3931averagechange0.00%0.00%0.00%pb es 10withlist0.89346.73214.1712withoutlist0.80767.43014.8285averagechange-9.60%-9.39%-13.61%pb es 50withlist0.94745.40793.1499withoutlist0.86566.24143.9364averagechange-8.63%-13.35%-19.98%qewithlist0.86166.05313.9470withoutlist0.81015.95213.4152averagechange-5.98%1.70%15.57%WESwithlist0.96673.73121.7185withoutlist0.88554.52462.5230averagechange-8.40%-17.54%-31.89% 142 <br /> <br /> PAGE 143<br /> <br /> Table6-6. PerformanceofsystemsontheCountryqueryremovingnon-list-basedfacts SystemTestBprefRMSEMAE countedallfacts0.89483.64242.1668factsremoved0.87403.85912.4132averagechange-2.32%-5.62%-10.21%ESallfacts0.86694.48432.8649factsremoved0.84564.59083.0494averagechange-2.46%-2.32%-6.05%luceneallfacts0.62629.11086.7374factsremoved0.62629.11086.7374averagechange0.00%0.00%0.00%pb es 10allfacts0.78736.11064.3127factsremoved0.76796.14114.2511averagechange-2.46%-0.50%1.45%pb es 50allfacts0.83015.67743.8183factsremoved0.83805.26313.6033averagechange0.95%7.87%5.97%qeallfacts0.85454.73593.0992factsremoved0.84055.09183.4162averagechange-1.64%-6.99%-9.28%WESallfacts0.87224.22192.7151factsremoved0.83644.75543.1315averagechange-4.10%-11.22%-13.30% 143 <br /> <br /> PAGE 144<br /> <br /> Table6-7. PerformanceofsystemsontheMoviequeryremovingnon-list-basedfacts SystemTestBprefRMSEMAE countedallfacts0.95373.83422.1079factsremoved0.94604.14232.3237averagechange-0.81%-7.44%-9.29%ESallfacts0.88915.73913.8323factsremoved0.91865.33923.5107averagechange3.32%7.49%9.16%luceneallfacts0.413613.575511.0108factsremoved0.413613.575511.0108averagechange0.00%0.00%0.00%pb es 10allfacts0.83868.05005.3338factsremoved0.75749.45816.1827averagechange-9.68%-14.89%-13.73%pb es 50allfacts0.87756.75724.3597factsremoved0.87197.21464.4860averagechange-0.64%-6.34%-2.82%qeallfacts0.95584.22042.4554factsremoved0.95334.52242.5506averagechange-0.26%-6.68%-3.73%WESallfacts0.93784.59152.6821factsremoved0.92204.80052.8612averagechange-1.68%-4.35%-6.26% 144 <br /> <br /> PAGE 145<br /> <br /> Table6-8. PerformanceofsystemsonthePersonqueryremovingnon-list-basedfacts SystemTestBprefRMSEMAE countedallfacts0.96553.25021.5499factsremoved0.93215.31603.1066averagechange-3.46%-38.86%-50.11%ESallfacts0.96784.46992.2881factsremoved0.93885.65323.4020averagechange-3.00%-20.93%-32.74%luceneallfacts0.515111.48948.3931factsremoved0.515111.48948.3931averagechange0.00%0.00%0.00%pb es 10allfacts0.89346.73214.1712factsremoved0.79408.53485.5383averagechange-11.13%-21.12%-24.68%pb es 50allfacts0.94745.40793.1499factsremoved0.87587.12854.3926averagechange-7.56%-24.14%-28.29%qeallfacts0.86166.05313.9470factsremoved0.85326.51224.2825averagechange-0.97%-7.05%-7.83%WESallfacts0.96673.73121.7185factsremoved0.93095.63783.3542averagechange-3.70%-33.82%-48.77% 145 <br /> <br /> PAGE 146<br /> <br /> Table6-9. Bprefscoresforsinglefacts BprefBprefEntitytypeSystemallfactssinglefactPercentchange countrycounted0.83040.726514.30%ES0.74620.69157.92%fe0.80070.643224.50%lucene0.29940.29940.00%pb es 100.57180.5773-0.96%pb es 500.61840.6551-5.60%qe0.74910.591926.55%WES0.79700.691515.26%moviecounted0.82500.745910.60%ES0.77810.73935.25%fe0.82450.691919.17%lucene0.25550.25550.00%pb es 100.65550.581712.69%pb es 500.72150.69314.11%qe0.80370.603233.23%WES0.82090.739311.04%personcounted0.86040.79568.14%ES0.82710.79184.46%fe0.78820.78011.04%lucene0.42030.42030.00%pb es 100.71350.66876.69%pb es 500.77050.76460.77%qe0.73330.70254.38%WES0.84870.79187.19% 146 <br /> <br /> PAGE 147<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigure6-2. Bprefatfactlevelsforallfacts 147 <br /> <br /> PAGE 148<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigure6-3. Bprefatfactlevelswithoutlist-basedfacts 148 <br /> <br /> PAGE 149<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigure6-4. Bprefatfactlevelswhensystemsaremissingfacts 149 <br /> <br /> PAGE 150<br /> <br /> CHAPTER7CONCLUSIONANDFUTUREWORKInthisstudywehavepresentedanewmodelofinformationretrieval(IR)basedontheinformativenessofadocumentaboutaparticularnamedentity.WhiletraditionalIRmodelshavebeenthebasisofgeneralsearchthatis,informationretrievalforanyinformationneedthatcanbestatedinaqueryourmodelfocusesonwell-denedentitiessothatthesystemcandenefeaturesthatwouldbeexpectedinadocumentforaparticularentitytype.Westartedwiththeideaofadocumentansweringnumerousquestionsaboutanamedentityandproposedamodelthatwouldusefactsretrievedfromaknowledgebasefortheinformationtobejudgedinthedocument.Bylimitingourselvestofacts,wehopedtoincreaseprecisionbyminimizingthenumberofwronganswers.Anotheradvantagetousingfactsisthatwecanfocusontheactualtermsoftheanswerandavoidadditionaltermsthatmayormaynotbepartoftheactualanswer.Toevaluatesystemsbasedonournewmodel,weexploredthecreationofanewevaluationmethodthatwasnotbasedinaparticularcorpus.Whileresearchhasreliedonevaluationofbinarypreferencesets,wefocusedontheBprefmeasure,whichonlyconsidersevaluateddocuments.Theunderlyingassumptionthatadocumentisrelevanttoaqueryregardlessofitscontextwasexpandedtotheassociatedideathatifadocumentmeetsaninformationneedbetterthananotherdocumentitshouldberankedhigherthananotherdocument,withoutneedforacorpus.Thisinsight,combinedwitharankingofdocumentsbasedonthenumberofquestionseachanswersaboutanamedentity,allowedustocreateasystemthatcouldobjectivelycomparetheperformancebetweentwosystemsthatdidnothaveaccesstothesamecorpus.Further,thisevaluationmethodallowedustoevaluatesystemsbasedonournewmodelbyoverallranking.Usingthisnewmethodofevaluation,webuiltseveralsystemsbasedonournewmodelandcomparedthemtoatraditionalvector-basedIRsystemandaquery 150 <br /> <br /> PAGE 151<br /> <br /> expansionsystemusingrankedqueryrelevancesets,wheredocumentshadbeenjudgedonthenumberofquestionstheycontained.UsingthesequeryrelevancesetswefoundthatsystemsthatwerebasedonthenewmodelfarexceededthetraditionalIRsysteminaccuracy.Additionally,oursystemsthatrankdocumentsbythenumberoffactstheycontain,WeightedEntityScoreandcounted,workedbetterthanthesystemsthatwereclosertotraditionalqueryexpansion.Thisalsoheldtruewhenweevaluatedhowwellthesystemsrankeddocumentswhentheyweresuppliedwithincompleteinformation.Thenalevaluationweperformedexaminedhowwellthesystemsrankeddocumentswhenconsideringindividualfacts.Werstevaluatedwhetheradocumentcontainedaparticularfactandthenrankeddocumentsaccordingtoeachsystem'sscorefortheentiredocument.Thisimprovedtheperformanceofthesystemswhencomparedtotheirrankingofthedocumentswhenonlyconsideringthesinglefact.Theseresultslendweighttotheintuitionthatdocumentswithsomeinformationaboutaparticularnamedentitywilltendtohaveotherinformationabouttheentity,whichisthebasisofboththeclusteringassumptionandtheprobabilisticassumptionofthoserespectiveIRmodels.Theresultsofourexperimentsleadustobelievethatournewmodelcanfunctionasthebasisofapracticalsystemusinginformationavailabletoday.Additionally,wehavecreatedanobjectivewaytoevaluatetherankingofIRsystemsthatcanbetransferedwithouttheneedforlargecorpora.Finallywehavepresentedanewmodelthatwillbethefoundationofadditionalresearchidentifyinginformationindocumentsandusingthosejudgmentsasthebasisofrankings. 7.1FutureWorkinEvaluationToevaluateourIRmodel,wehavecreatedanewevaluationmetricbasedonjudgingdocumentsbythenumberoffeaturesaboutaparticularnamedentityforwhichtheycontaininformation.Whilewehaveshownthatitmetourexpectationsbyshowing 151 <br /> <br /> PAGE 152<br /> <br /> thattheBingsearchenginerankeddocumentsbetterthantheLuceneIRsystem,thereisstillworktobedoneevaluatingothersystemstodeterminewhethertheserankingscanbetransferredfromsystemtosystem.Oneproblemthatwendincomparingtwosystemsisthatwithoutreplicatingthesesystemsitishardtocompareperformance.Withourevaluationmethod,thiscouldbealleviatedbymakingavailabletheexactsetofdatathatoriginallywasusedtoevaluatethesystem.Aswehaveshown,theqrelscanbetransferredwiththejudgmentsonwhichfeaturesitcontains.Ifmultiplesystemspublishedresultsagainstthesamesetofdata,thiswouldallowforsystemcomparisonwithouttheneedforlargecorporasuchastheTRECcorpusthatweusedinChapter4.Alternatively,itwouldbepossibletocomparesystemsbyqueryingasystem,thenjudgingtheresultingsetofdocumentsandusingabasesystemtorankthedocuments.ThiswouldbesimilartohowweusedtheLucenesysteminChapters4and5.InChapter4,weretrieveddocumentsfromBingthenjudgedthemandcomparedthatrankingtohowLucenewouldrankthosedocuments.InChapter5,wecomparedhowwelltheEntityScoresystemsrankedthosesamedocuments,comparingthoseresultstoLuceneaswell.Additionally,wecouldhavecomparedLucenetoanothersearchenginesuchasGooglebyusingthatsearchenginetoretrievedocumentsforanentityalsoretrievedbyBingandthencomparingbothtoLucene.AdditionalresearchisthennecessarytoseehowwellwecouldcompareBingtoGoogleusingtwodifferentqueries,comparedtoathirdsystemforcaseswherewecannotdirectlycomparetwosystems.Anotherareaofresearchishowthesizeoftheqreleffectsthecomparisonofsystems.Inourcurrentresearchweusedqrelsizesof40documentsforconvenienceandbecauseofthetimeittakestoevaluatedocumentsformultiplefeatures.Asadditionaldocumentsareadded,wewouldexpecttheerrorfortheevaluatedsystemstoincreasebecauseasthenumberofdocumentsateachlevelincreases,theerrorcausedbymisjudgingthelevelatwhichthedocumentbelongsatalsoincreases.While 152 <br /> <br /> PAGE 153<br /> <br /> thatisexpected,itisnecessarytoevaluatewhethertherelativeperformanceofdifferentsystemsremainsthesameasthenumberofdocumentsincreases.Finally,wehavefocusedonusingqrelscreatedfromtheresultsofasinglesystem.IntheTRECcorpustheqrelsarecreatedfromdocumentssubmittedfrommultiplesystemsbeingjudged.Whilethatisnotstrictlynecessary,whatwillbeinterestingistouseourmethodofcreatingrankedqrelstoevaluatesystemsonspecicproblems.Documentsfromblogs,newspapers,books,orjustnormalwebpagescouldbeincluded.Astherankingcontinuestobestable,judgedonthenumberoffeaturesthatthedocumentcontains,thesourceofthedocumentdoesnotmatter,butwhatinformationitcontainsdoes.Examplesofproblemsthatcouldberepresentedinthismanneraregroupingnewspaperarticlesaccordingtotheparticularstorytowhichtheyrelate,orjudginghowwellthesystemcancopewithtermsthatmatchfeaturesforanamedentitybutthatdonotactuallycontaintheinformationthatthefeaturerepresents.Theproblemofnewspaperarticlesissomewhatinteresting.Articlesaboutthegovernment,e.g.,abouttwodifferentbillsthatsharequotationsbythesamepeopleonthesameday,mightcontainmanyofthesamefeaturesbutdifferonthekeypoint.Tocreatesuchaqrel,wewouldhavetoexpandourjudgmentsbeyondsimplefactoidfeatures.TheproblemofdocumentsthatcontainmisleadingtermscanbeseenintheexampleofJohnAdams'sbirthplace,Quincy.TotesthowwellasystemcopeswithbothJohnAdams'sbirthplaceandhisson,JohnQuincyAdams,wemightchoosedocumentsaboutbothpeopleandthenjudgethemaccordingly. 7.2FutureWorkfortheEntityScoreModelTherearemanyareasforadvancingresearchontheEntityScoremodelwhichwehavenotaddressed.Therstandsimplestareaforexpansioninresearchistoincreasethenumberofentitytypesthatthesystemcanevaluateaswellasthenumberoffeaturesonwhichthesystemcanevaluatedocuments.Increasingentitytypesisanexerciseinidentifyingthetypeofentitythatanameimplies.Onceaparticularentity 153 <br /> <br /> PAGE 154<br /> <br /> typehasbeenidentied,theabilitytoretrievefactsaboutthetypeofentityisnecessary.WhetherthisIRmodelperformswellonentitiesthathavefewerfacttypesisamatterofadditionalresearch.Inadditiontoexpandingthetypesofentitiesthatcanbeindexed,additionalworkisnecessarytoincreasethesystem'sabilitytoretrieveparticularanswersforquestionsaboutanentitytype.Muchresearchhasbeendoneintheareasofinformationextraction,whichwouldexpandthebreadthofthesystem.Ouroriginalintentionofusingquestionansweringtondanswersforthedifferentfeaturesoftheentitytypeisanotherareathatcouldbeexploredasanavenuetoidentifyinformationonpages.Thiscouldbeeitherintheareaofreadingcomprehensionorasaretrievalmechanismfortheanswersusedtoevaluatedocuments.Focusingonquestionanswering,therearetworelatedissuesthatwouldbenetfromfurtherresearch.Therstisparagraphextraction.Insomequestionansweringsystems,aftercandidatedocumentsareretrieved,individualparagraphsareextractedthatareexpectedtocontainthedesiredanswer.Inourmodelwecouldattemptsomethingsimilarbyjudgingindividualparagraphsforinformation.Ratherthanretrievingentiredocuments,thesystemcouldpresentaparagraphthatcontainsthemostinformationabouttheentity.Thiscouldalsobeusedasasummarization,withtheparagraphscontainingthemostinformationpresentedwiththeresultlinkstogivetheuseranideaofwhatinformationadocumentcontainedabouttheentity.ThisideaofparagraphevaluationisdifferentfromthesystemswepresentedinChapter6,whichattemptedtoevaluatewhetherthedocumentcontainedfactsinaphrase.Inthatcase,eachphrasewasconsideredindividually.Inthiscase,eachparagraphwouldbetreatedasawholeandtheentireprocessofscoringcouldbeapplied.Thesecondarearelatedtoquestionansweringisthatofnaturallanguageprocessing.Inthisstudywehaveavoidedanytypeofnaturallanguagetechniques,butwhendeterminingwhetheradocumentactuallycontainsanansweritmightbe 154 <br /> <br /> PAGE 155<br /> <br /> usefultousesemanticorsyntacticinformation.Ourresultsdemonstratedthatthesystemsthatcountthenumberoffactsinadocumenthavelesserrorintheirrankingsthanthesystemsthatmerelyapplyascorebasedontheterms.Wewouldexpectthatadditionalaccuracyindeterminingwhetheradocumentcontainsafactwoulddecreasetheerrorinrankingfurther.AnotherareaforadditionalworkrelatedtotheEntityScoremodelisintheareaofquerying.Whilewehavedemonstratedawaytocreatequerieseitherbyusingtheentitynamesandadditionallyrequiringcertainfactstobeeitherincludedorexcludedinadocument,wehavenotdealtwithmoregeneralquerytypes.Morespecically,ourmodelneedstobeexpandedtohandlequeriesinwhichanamedentityispartofaquery.Anexampleofthismightbethequery:GeorgeWashingtonBunkerHill.Thisquerycouldindicateseveraldifferentinformationneeds.TherstisthattheuserdesiresdocumentsaboutGeorgeWashingtonthatmentionBunkerHill.Inthiscasethesystemcouldtreatthequerysimilarlytohowwetreatedsinglefacts.ThedocumentsarerstjudgedonwhethertheycontainthequerytermsandthenrankedaccordingtohowmanyfactstheycontainaboutGeorgeWashington.Asecondinformationneedthatthisquerymightindicateistheoppositeoftherst,documentsthatcontaininformationaboutBunkerHillthatcontaintheterms:GeorgeandWashington.Finally,thesystemcouldinterpretthisasaquerywithtwonamedentitiesandtrytoreturndocumentsthatcontainthemostinformationaboutbothentities.Toevaluatethistypeofinformationneed,newrankedqrelscouldbecreatedthatarerankedbasedonthemeaningoftheinformationneed.ThenalareaofworkthatwewishtoconsideristhatofcombininganEntityScoremodelwithagraph-basedIRsystem.AsystembaseduponthePageRankalgorithmranksdocumentsaccordingtohowlikelyitisthatauserwouldviewthatpageontheworldwidewebwhenrandomlyfollowinglinksfrompagetopage.Thiscanalsobeviewedasapopularityorvotingmechanism,asdocumentsthathavemore 155 <br /> <br /> PAGE 156<br /> <br /> inward-boundlinksaremorepopular,andthusitismorelikelyausermightviewthem.Thisrankingmechanismcouldbeappliedtothedifferentlevelsofdocumentsinthecountedsystem.Forexample,ifmultipledocumentscontainedvefacts,thesedocumentscouldbesortedaccordingtotheirPageRank.Thiscombinedsystemwouldhavetheeffectofrankingdocumentssuchthatthehighestrankeddocumentwouldbethemostpopulardocumentwiththemostinformationaboutanentity. 156 <br /> <br /> PAGE 157<br /> <br /> APPENDIXABPREFFORINDIVIDUALNAMEDENTITIESUSINGALLFACTS AAgathaChristie BBarackObama CDavidMcCullough DGeorgeWashington EHerbertHoover FJaneAusten GJohnAdams HStephenKingFigureA-1. Bprefatdifferentleveloffactsforallpersons 157 <br /> <br /> PAGE 158<br /> <br /> ACitizenKane BGoneWiththeWind CRaidersoftheLostArk DTerminator:SalvationFigureA-2. Bprefatdifferentleveloffactsforallmovies AEgypt BFrance CMongolia DUnitedStatesFigureA-3. Bprefatdifferentleveloffactsforallcountries 158 <br /> <br /> PAGE 159<br /> <br /> APPENDIXBERRORFORENTITYTYPES AAuthor BCountry CMovie DPerson EPresidentFigureB-1. MAEallfacts 159 <br /> <br /> PAGE 160<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigureB-2. RMSEallfacts 160 <br /> <br /> PAGE 161<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigureB-3. MAEwhenmissingfacts 161 <br /> <br /> PAGE 162<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigureB-4. RMSEwhenmissingfacts 162 <br /> <br /> PAGE 163<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigureB-5. MAEwithoutlist-basedfacts 163 <br /> <br /> PAGE 164<br /> <br /> AAuthor BCountry CMovie DPerson EPresidentFigureB-6. RMSEwithoutlist-basedfacts 164 <br /> <br /> PAGE 165<br /> <br /> REFERENCES [1] E.Agichtein,S.Lawrence,andL.Gravano.Learningsearchenginespecicquerytransformationsforquestionanswering.ProceedingsofthetenthinternationalconferenceonWorldWideWeb,pages169,2001. [2] K.Ahn,J.Bos,S.Clark,J.Curran,T.Dalmas,J.Leidner,M.Smillie,andB.Webber.Questionansweringwithqedandweeattrec-2004.ProceedingsoftheThirteenthTextREtrievalConference(TREC2004),NISTSpecialPublica-tion,pages500,2004. [3] E.Amitay,D.Carmel,R.Lempel,andA.Soffer.ScalingIR-systemevaluationusingTermRelevanceSets.InSIGIR'04:Proceedingsofthe27thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages10,NewYork,NY,USA,2004.ACMPress. [4] R.Baeza-YatesandB.Ribeiro-Neto.ModernInformationRetrieval.ACMPress,NewYork,NY,USA,1999. [5] K.Bollacker,C.Evans,P.Paritosh,T.Sturge,andJ.Taylor.Freebase:acollaborativelycreatedgraphdatabaseforstructuringhumanknowledge.InProceedingsofthe2008ACMSIGMODinternationalconferenceonManagementofdata,pages1247.ACM,2008. [6] A.Bookstein.Fuzzyrequests:Anapproachtoweightedbooleansearches.JournaloftheAmericanSocietyforInformationSciences,31:240,1980. [7] E.J.Breck,J.D.Burger,L.Ferro,L.Hirschman,D.House,M.Light,andI.Mani.Howtoevaluateyourquestionansweringsystemeverydayandstillgetrealworkdone.InInProceedingsoftheSecondInternationalConferenceonLanguageResourcesandEvaluation(LREC-2000),pages1495,2000. [8] E.Brill,J.Lin,M.Banko,S.Dumais,andA.Ng.Data-intensivequestionanswering.ProceedingsoftheTenthTextREtrievalConference(TREC2001),pages393,2001. [9] S.BrinandL.Page.Theanatomyofalarge-scalehypertextualWebsearchengine.ComputerNetworksandISDNSystems,30(1):107,1998. [10] A.Brunstein.Annotationguidelinesforanswertypes.LDC2005T33,LinguisticDataConsortium,Philadelphia,2002. [11] S.BuchholzandW.Daelemans.Complexanswers:acasestudyusingaWWWquestionansweringsystem.NaturalLanguageEngineering,7(04):301,2002. [12] C.BuckleyandE.M.Voorhees.Evaluatingevaluationmeasurestability.InSIGIR'00:Proceedingsofthe23rdannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages33,NewYork,NY,USA,2000.ACMPress. 165 <br /> <br /> PAGE 166<br /> <br /> [13] J.Burger,C.Cardie,V.Chaudhri,R.Gaizauskas,S.Harabagiu,D.Israel,C.Jacquemin,C.Lin,S.Maiorano,G.Miller,etal.Issues,TasksandProgramStructurestoRoadmapResearchinQuestion&Answering(Q&A).Rapporttechnique,NIST,2001. [14] S.Chakrabarti,K.Puniyani,andS.Das.Optimizingscoringfunctionsandindexesforproximitysearchintype-annotatedcorpora.InProceedingsofthe15thinternationalconferenceonWorldWideWeb,page726.ACM,2006. [15] T.ChengandK.Chang.Beyondpages:supportingefcient,scalableentitysearchwithdual-inversionindex.InProceedingsofthe13thInternationalConferenceonExtendingDatabaseTechnology,pages15.ACM,2010. [16] T.Cheng,X.Yan,andK.Chang.EntityRank:searchingentitiesdirectlyandholistically.InProceedingsofthe33rdinternationalconferenceonVerylargedatabases,pages387.VLDBEndowment,2007. [17] N.ChinchorandP.Robinson.MUC-7namedentitytaskdenition.InProceedingsofthe7thMessageUnderstandingConference,1997. [18] C.Clarke,N.Craswell,andI.Soboroff.OverviewoftheTREC2004TerabyteTrack.ProceedingsoftheThirteenthTextREtrievalConference(TREC-13),Gaithersburg,MD,NISTSpecialPublication,2004. [19] W.S.Cooper.TheformalismofprobabilitytheoryinIR:afoundationoranencumbrance?InSIGIR'94:Proceedingsofthe17thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages242,NewYork,NY,USA,1994.Springer-VerlagNewYork,Inc. [20] W.S.Cooper.Gettingbeyondboole.InReadingsininformationretrieval,pages265.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,1997. [21] W.B.CroftandR.Das.Experimentswithqueryacquisitionanduseindocumentretrievalsystems.InSIGIR'90:Proceedingsofthe13thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages349,NewYork,NY,USA,1990.ACMPress. [22] W.B.CroftandD.D.Lewis.Anapproachtonaturallanguageprocessingfordocumentretrieval.InResearchandDevelopmentinInformationRetrieval,pages26,1987. [23] W.B.Croft,H.R.Turtle,andD.D.Lewis.Theuseofphrasesandstructuredqueriesininformationretrieval.InSIGIR'91:Proceedingsofthe14thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages32,NewYork,NY,USA,1991.ACMPress. 166 <br /> <br /> PAGE 167<br /> <br /> [24] S.Cronen-Townsend,Y.Zhou,andW.B.Croft.Predictingqueryperformance.InSIGIR'02:Proceedingsofthe25thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages299,NewYork,NY,USA,2002.ACMPress. [25] M.Cutler,H.Deng,S.Maniccam,andW.Meng.ANewStudyonUsingHTMLStructurestoImproveRetrieval.Proceedingsofthe11thIEEEInternationalConferenceonToolswithArticialIntelligence,page406,1999. [26] H.Dang,D.Kelly,andJ.Lin.OverviewoftheTREC2007questionansweringtrack.InProceedingsofthe2007TextREtrievalConference(TREC2007),2007. [27] S.Deerwester.Reductionalindexing.Technicalreport,CenterforInformationandLanguageStudiesattheUniversityofChicago,Chicago,IL,USA,1990. [28] S.C.Deerwester,S.T.Dumais,T.K.Landauer,G.W.Furnas,andR.A.Harshman.Indexingbylatentsemanticanalysis.JournaloftheAmericanSocietyofInformationScience,41(6):391,1990. [29] A.DobraandS.Fienberg.HowlargeistheWorldWideWeb.WebDynamics,pages23,2004. [30] S.Dumais,M.Banko,E.Brill,J.Lin,andA.Ng.Webquestionanswering:ismorealwaysbetter?Proceedingsofthe25thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages291,2002. [31] J.L.Fagan.Experimentsinautomaticphraseindexingfordocumentretrieval:Acomparisonofsyntacticandnon-syntacticmethods.Technicalreport,CornellUniversity,Ithaca,NY,USA,1987. [32] W.B.Frakes.Stemmingalgorithms.InInformationretrieval:datastructuresandalgorithms,pages131.Prentice-Hall,Inc.,UpperSaddleRiver,NJ,USA,1992. [33] J.Gonzalo,F.Verdejo,I.Chugur,andJ.Cigarran.Indexingwithwordnetsynsetscanimprovetextretrieval.InProceedingsoftheCOLING/ACL'98WorkshoponUsageofWordNetforNLP,pages38,Montreal,Canada,1998. [34] A.Graesser,K.Millis,andR.Zwaan.DISCOURSECOMPREHENSION.AnnualReviewofPsychology,48(1):163,1997. [35] J.Graupmann,M.Biwer,C.Zimmer,P.Zimmer,M.Bender,M.Theobald,andG.Weikum.COMPASS:Aconcept-basedWebsearchengineforHTML,XML,anddeepWebdata.InProceedingsoftheThirtiethinternationalconferenceonVerylargedatabases-Volume30,page1316.VLDBEndowment,2004. [36] B.GreenJr,A.Wolf,C.Chomsky,andK.Laughery.Baseball:anautomaticquestionanswerer.MITPressCambridge,MA,USA,1995. 167 <br /> <br /> PAGE 168<br /> <br /> [37] A.GulliandA.Signorini.Theindexablewebismorethan11.5billionpages.InternationalWorldWideWebConference,pages902,2005. [38] K.HaciogluandW.Ward.Questionclassicationwithsupportvectormachinesanderrorcorrectingcodes.NAACL,2003:28,2003. [39] D.Hawking.OverviewoftheTREC-9WebTrack.ProceedingsoftheNinthTextREtrievalConference(TREC-9),41:47,2000. [40] D.Hawking,N.Craswell,andP.Thistlewaite.OverviewofTREC-7verylargecollectiontrack.ProceedingsoftheSeventhTextREtrievalConference(TREC-7),pages91,1998. [41] U.Hermjakob.Parsingandquestionclassicationforquestionanswering.ProceedingsoftheworkshoponARABIClanguageprocessing:statusandprospects-Volume12,pages1,2001. [42] U.Hermjakob,A.Echihabi,andD.Marcu.NaturalLanguageBasedReformulationResourceandWebExploitationforQuestionAnswering.ProceedingsoftheTREC-2002Conference,NIST.Gaithersburg,MD,2002. [43] A.Hickl,K.Roberts,B.Rink,J.Bensley,T.Jungen,Y.Shi,andJ.Williams.QuestionAnsweringwithLCCsChaucer-2atTREC2007.InProceedingsofthe2007TextREtrievalConference(TREC2007),2007. [44] W.Hildebrandt,B.Katz,andJ.Lin.Answeringdenitionquestionsusingmultipleknowledgesources.ProceedingsofHLT/NAACL,pages2,2004. [45] L.HirschmanandR.Gaizauskas.Naturallanguagequestionanswering:theviewfromhere.NaturalLanguageEngineering,7(04):275,2002. [46] T.Hofmann.ProbabilisticLatentSemanticIndexing.InProceedingsofthe22ndAnnualACMConferenceonResearchandDevelopmentinInformationRetrieval,pages50,Berkeley,California,Aug.1999. [47] E.Hovy,U.Hermjakob,andC.Lin.TheUseofExternalKnowledgeinFactoidQA.ProceedingsoftheTenthTextREtrievalConference(TREC2001),2001. [48] E.Hovy,U.Hermjakob,andD.Ravichandran.AQuestion/AnswerTypologywithSurfaceTextPatterns.ProceedingsoftheHumanLanguageTechnology(HLT)conference,2002. [49] Y.Hu,G.Xin,R.Song,G.Hu,S.Shi,Y.Cao,andH.Li.TitleextractionfrombodiesofHTMLdocumentsanditsapplicationtowebpageretrieval.Proceed-ingsofthe28thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages250,2005. [50] E.Ide.Newexperimentsinrelevancefeedback.InG.Salton,editor,TheSMARTRetrievalSystem,pages337.Prentice-Hall,Inc.,1971. 168 <br /> <br /> PAGE 169<br /> <br /> [51] B.J.Jansen,A.Spink,J.Bateman,andT.Saracevic.Reallifeinformationretrieval:astudyofuserqueriesontheweb.SIGIRForum,32(1):5,1998. [52] K.S.Jones.Astatisticalinterpretationoftermspecicityanditsapplicationinretrieval.JournalofDocumentation,28(1):11,Mar.1972. [53] B.KatzandJ.Lin.Selectivelyusingrelationstoimproveprecisioninquestionanswering.ProceedingsoftheEACL-2003WorkshoponNaturalLanguageProcessingforQuestionAnswering,2003. [54] J.KekalainenandK.Jarvelin.Theimpactofquerystructureandqueryexpansiononretrievalperformance.Proceedingsofthe21stannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages130,1998. [55] J.M.Kleinberg.Authoritativesourcesinahyperlinkedenvironment.JournaloftheACM,46(5):604,1999. [56] M.Kochen.AutomaticQuestion-AnsweringofEnglish-LikeQuestionsAboutSimpleDiagrams.JournaloftheACM(JACM),16(1):26,1969. [57] O.KurlandandL.Lee.PageRankwithouthyperlinks:structuralre-rankingusinglinksinducedbylanguagemodels.Proceedingsofthe28thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages306,2005. [58] K.L.Kwok.Aneuralnetworkforprobabilisticinformationretrieval.InSIGIR'89:Proceedingsofthe12thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages21,NewYork,NY,USA,1989.ACMPress. [59] S.LawrenceandC.Giles.SearchingtheWorldWideWeb.Science,280(5360):98,1998. [60] X.LiandD.Roth.Learningquestionclassiers:theroleofsemanticinformation.NaturalLanguageEngineering,12(03):229,2005. [61] C.LinandE.Hovy.AutomaticevaluationofsummariesusingN-gramco-occurrencestatistics.Proceedingsofthe2003ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguisticsonHumanLanguageTechnology-Volume1,pages71,2003. [62] J.Lin.IndexingandRetrievingNaturalLanguageUsingTernaryExpressions.PhDthesis,MassachusettsInstituteOfTechnology,2001. [63] J.Lin,D.Quan,V.Sinha,K.Bakshi,D.Huynh,B.Katz,andD.Karger.WhatMakesaGoodAnswer?TheRoleofContextinQuestionAnswering.Human-ComputerInteraction(INTERACT2003),page25,2003. 169 <br /> <br /> PAGE 170<br /> <br /> [64] K.Litkowski.Question-AnsweringUsingSemanticRelationTriples.Proceedingsofthe8thTextREtrievalConference(TREC-8),pages349,1999. [65] Y.Lu,F.Peng,G.Mishne,X.Wei,andB.Dumoulin.ImprovingWebsearchrelevancewithsemanticfeatures.InProceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages648,Singapore,August2009.AssociationforComputationalLinguistics. [66] B.Magnini,M.Negri,R.Prevete,andH.Tanev.Isittherightanswer?:exploitingwebredundancyforAnswerValidation.Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguistics,pages425,2001. [67] M.E.MaronandJ.L.Kuhns.Onrelevance,probabilisticindexingandinformationretrieval.J.ACM,7(3):216,1960. [68] R.MihalceaandD.Moldovan.Semanticindexingusingwordnetsenses.InProceedingsofACLWorkshoponIRandNLP,Oct.2000. [69] G.Miller.WordNet:ALexicalDatabaseforEnglish.COMMUNICATIONSOFTHEACM,38:11,1995. [70] M.Mitra,A.Singhal,andC.Buckley.Improvingautomaticqueryexpansion.Proceedingsofthe21stannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages206,1998. [71] D.Moldovan,C.Clark,andS.Harabagiu.COGEX:alogicproverforquestionanswering.Proceedingsofthe2003ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguisticsonHumanLanguageTechnology-Volume1,pages87,2003. [72] D.Moldovan,S.Harabagiu,M.Pasca,R.Mihalcea,R.Goodrum,R.Girju,andV.Rus.LASSO:AToolforSurngtheAnswerNet.ProceedingsofTREC,8:65,1999. [73] D.Moldovan,M.Pasca,S.Harabagiu,andM.Surdeanu.Performanceissuesanderroranalysisinanopen-domainQuestionAnsweringsystem.Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguistics,pages33,2001. [74] R.Nallapati.Discriminativemodelsforinformationretrieval.InProceedingsofthe27thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages64.ACMNewYork,NY,USA,2004. [75] C.Niu,W.Li,andR.Srihari.Weaklysupervisedlearningforcross-documentpersonnamedisambiguationsupportedbyinformationextraction.ProceedingsoftheACL'2004,2:1,2004. 170 <br /> <br /> PAGE 171<br /> <br /> [76] Y.Ogawa,T.Morita,andK.Kobayashi.Afuzzydocumentretrievalsystemusingthekeywordconnectionmatrixandalearningmethod.FuzzySetsSyst.,39(2):163,1991. [77] L.Page,S.Brin,R.Motwani,andT.Winograd.Thepagerankcitationranking:Bringingordertotheweb.Technicalreport,StanfordDigitalLibraryTechnologiesProject,1998. [78] J.Pomerantz.Alinguisticanalysisofquestiontaxonomies.JournaloftheAmericanSocietyforInformationScienceandTechnology,56(7):715,2005. [79] M.F.Porter.Analgorithmforsufxstripping.Program,14(3):130,1980. [80] T.Qin,T.Liu,X.Zhang,Z.Chen,andW.Ma.Astudyofrelevancepropagationforwebsearch.Proceedingsofthe28thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages408,2005. [81] D.RavichandranandE.Hovy.LearningsurfacetextpatternsforaQuestionAnsweringsystem.Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguistics,pages41,2001. [82] C.Roberson.AMorphologicalNeuralNetworkModelForInformationRetrieval.PhDthesis,UniversityofFlorida,2008. [83] S.E.Robertson.TheprobabilityrankingprincipleinIR.JournalofDocumentation,33:294,1977. [84] J.J.Rocchio,Jr.Relevancefeedbackininformationretrieval.InTheSMARTInformationRetrievalSystem,pages313.PrenticeHall,1971. [85] G.SaltonandC.Buckley.Term-weightingapproachesinautomatictextretrieval.InformationProcessingandManagement:anInternationalJournal,24(5):513,1988. [86] G.SaltonandC.Buckley.Improvingretrievalperformancebyrelevancefeedback.JournaloftheAmericanSocietyforInformationScience,41(4):288,1990. [87] G.Salton,E.A.Fox,andH.Wu.Extendedbooleaninformationretrieval.Technicalreport,CornellUniversity,Ithaca,NY,USA,1982. [88] G.SaltonandM.E.Lesk.Computerevaluationofindexingandtextprocessing.JournaloftheACM,15(1):8,1968. [89] G.SaltonandM.J.McGill.TheSMARTandSIREexperimentalretrievalsystems.InReadingsinInformationRetrieval,pages381.MorganKaufmann,1997. [90] G.Salton,A.Wong,andC.S.Yang.Avectorspacemodelforautomaticindexing.Commun.ACM,18(11):613,Nov.1975. 171 <br /> <br /> PAGE 172<br /> <br /> [91] C.ShahandW.B.Croft.Evaluatinghighaccuracyretrievaltechniques.InSIGIR'04:Proceedingsofthe27thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages2,NewYork,NY,USA,2004.ACMPress. [92] I.Soboroff,C.Nicholas,andP.Cahan.Rankingretrievalsystemswithoutrelevancejudgments.InSIGIR'01:Proceedingsofthe24thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages66,NewYork,NY,USA,2001.ACMPress. [93] T.StrzalkowskiandB.Vauthey.Informationretrievalusingrobustnaturallanguageprocessing.InProceedingsofthe30thannualmeetingonAssociationforCompu-tationalLinguistics,pages104,Morristown,NJ,USA,1992.AssociationforComputationalLinguistics. [94] F.Suchanek,G.Kasneci,andG.Weikum.Yago:Alargeontologyfromwikipediaandwordnet.WebSemantics:Science,ServicesandAgentsontheWorldWideWeb,6(3):203,2008. [95] S.Tellex,B.Katz,J.Lin,A.Fernandes,andG.Marton.Quantitativeevaluationofpassageretrievalalgorithmsforquestionanswering.Proceedingsofthe26thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformaionretrieval,pages41,2003. [96] H.TurtleandW.B.Croft.Evaluationofaninferencenetwork-basedretrievalmodel.ACMTrans.Inf.Syst.,9(3):187,1991. [97] C.vanRijsbergen.InformationRetrieval.Butterworths,1979. [98] E.Voorhees.Onexpandingqueryvectorswithlexicallyrelatedwords.TechnicalReport500215,1994. [99] E.Voorhees.OverviewoftheTREC2002questionansweringtrack.ProceedingsoftheEleventhTextREtrievalConference(TREC2002),pages57,2002. [100] E.Voorhees.OverviewofTREC2002.ProceedingsoftheEleventhTextREtrievalConference(TREC2002),2002. [101] E.Voorhees.TheEvaluationofQuestionAnsweringSystems:LessonsLearnedfromtheTRECQATrack.ProceedingsoftheQuestionAnswering:StrategyandResourcesWorkshopatLREC-2002,2002. [102] E.Voorhees.TheTRECquestionansweringtrack.NaturalLanguageEngineer-ing,7(04):361,2002. [103] E.VoorheesandD.Harman.OverviewoftheSixthTextREtrievalConference(TREC-6).InformationProcessingandManagement,36(1):3,2000. 172 <br /> <br /> PAGE 173<br /> <br /> [104] E.VoorheesandD.Tice.Buildingaquestionansweringtestcollection.Proceed-ingsofthe23rdannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages200,2000. [105] E.M.Voorhees.Thetrec-8questionansweringtrackreport.InProceedingsoftheEighthTextREtrievalConference(TREC-8),1999. [106] E.M.Voorhees.Thephilosophyofinformationretrievalevaluation.InCLEF'01:RevisedPapersfromtheSecondWorkshopoftheCross-LanguageEvaluationForumonEvaluationofCross-LanguageInformationRetrievalSystems,pages355,London,UK,2002.Springer-Verlag. [107] Q.Vu,T.Masada,A.Takasu,andJ.Adachi.Usingaknowledgebasetodisambiguatepersonalnameinwebsearchresults.InProceedingsofthe2007ACMsymposiumonAppliedcomputing,page843.ACM,2007. [108] W.J.WilburandK.Sirotkin.Theautomaticidenticationofstopwords.Journalofinformationscience,18(1):45,1992. [109] T.Winograd.Whatdoesitmeantounderstandlanguage.CognitiveScience,4(3):209,1980. [110] S.K.M.WongandY.Y.Yao.Onmodelinginformationretrievalwithprobabilisticinference.ACMTransactionsonInformationSystems(TOIS),13(1):38,1995. [111] W.Woods.Progressinnaturallanguageunderstanding:Anapplicationtolunargeology.AFIPSConferenceProceedings,42:441,1973. [112] W.Woods.ConceptualIndexing:ABetterWaytoOrganizeKnowledge.SunMicrosystems,Inc.MountainView,CA,USA,1997. [113] J.Xu,R.Weischedel,andA.Licuanan.Evaluationofanextraction-basedapproachtoansweringdenitionalquestions.Proceedingsofthe27thannualinternationalconferenceonResearchanddevelopementininformationretrieval,pages418,2004. [114] G.Xue,Q.Yang,H.Zeng,Y.Yu,andZ.Chen.Exploitingthehierarchicalstructureforlinkanalysis.Proceedingsofthe28thannualinternationalACMSIGIRcon-ferenceonResearchanddevelopmentininformationretrieval,pages186,2005. [115] Y.YangandJ.Wilbur.Usingcorpusstatisticstoremoveredundantwordsintextcategorization.J.Am.Soc.Inf.Sci.,47(5):357,1996. 173 <br /> <br /> PAGE 174<br /> <br /> BIOGRAPHICALSKETCH CharlesStuartSmith,sonofDr.andMrs.SteveSmith,grewupinWisconsinandmovedtoFloridain1992,whenhisfather,apastor,startedachurchinLakeland.CharlesattendedEckerdCollegeinSt.Petersburg,Florida,wherehemajoredincomputerscienceandmethiswife,Kristen.In1999,hegraduatedwithHonorsandacceptedapositionwithPathtechSoftwareSolutionsinJacksonville,Florida.WhenKristendecidedtopursueagraduatedegreeinEnglishattheUniversityofFlorida,Charlessoonfollowedsuit,leavingtheworkingworldbehindandembracingthecollegelifestyleagain.CharlesandKristencurrentlyresideinGainesvilleandhavethoroughlyenjoyedthatlifestyle.UponcompletionofthePhDprogram,Charleswillreturntotheworldofsoftwareengineering,concentratingonresearchandpracticalbusinessapplications. 174 <br /> <br /><br /> </table> </td> </tr> </table> <!-- Footer divisions complete the web page --> <div id="footer_item_wrapper"> <div id="footer_item"> <p><a href="http://ufdc.ufl.edu/contact">Contact Us</a> | <a href="http://ufdc.ufl.edu/permissions">Permissions</a> | <a href="http://ufdc.ufl.edu/preferences">Preferences</a> | <a href="http://ufdc.ufl.edu/sobekcm/">Technical Aspects</a> | <a href="http://ufdc.ufl.edu/stats">Statistics</a> | <a href="http://ufdc.ufl.edu/internal">Internal</a> | <a href="http://www.uflib.ufl.edu/privacy.html">Privacy Policy</a></p> </div> <div id="UfdcWordmark_item"> <a href="http://www.ufl.edu"><img src="http://ufdc.ufl.edu//design/skins/ufdc/wordmark/smallWordmark_333333.png" alt="University of Florida Home Page" title="University of Florida Home Page" style="border: none;" id="UfdcWordmarkImage" /></a> </div> <div id="UfdcCopyright_item"> <a href="http://www.uflib.ufl.edu/rights.html"> © 2004 - 2013 University of Florida George A. Smathers Libraries.<br />All rights reserved.</a> <br /> <a href="http://www.uflib.ufl.edu/accesspol.html">Terms of Use for Electronic Resources</a> and <a href="http://www.uflib.ufl.edu/copyright.html">Copyright Information</a> <br /> Powered by <a href="http://ufdc.ufl.edu/sobekcm">SobekCM</a> </div> </div>