Citation
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases

Material Information

Title:
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases
Creator:
Zhiwei Chen
Zhe He
Xiuwen Liu
Jiang Bian
Publisher:
BMC, BMC Medical Informatics and Decision Making
Publication Date:
Language:
English

Subjects

Subjects / Keywords:
Word embedding ( fast )
Semantic relation ( fast )
UMLS Wordnet ( fast )

Notes

Abstract:
Background: In the past few years, neural word embeddings have been widely used in text mining. However, the vector representations of word embeddings mostly act as a black box in downstream applications using them, thereby limiting their interpretability. Even though word embeddings are able to capture semantic regularities in free text documents, it is not clear how different kinds of semantic relations are represented by word embeddings and how semantically-related terms can be retrieved from word embeddings. Methods: To improve the transparency of word embeddings and the interpretability of the applications using them, in this study, we propose a novel approach for evaluating the semantic relations in word embeddings using external knowledge bases: Wikipedia, WordNet and Unified Medical Language System (UMLS). We trained multiple word embeddings using health-related articles in Wikipedia and then evaluated their performance in the analogy and semantic relation term retrieval tasks. We also assessed if the evaluation results depend on the domain of the textual corpora by comparing the embeddings of health-related Wikipedia articles with those of general Wikipedia articles. Results: Regarding the retrieval of semantic relations, we were able to retrieve semanti. Meanwhile, the two popular word embedding approaches, Word2vec and GloVe, obtained comparable results on both the analogy retrieval task and the semantic relation retrieval task, while dependency-based word embeddings had much worse performance in both tasks. We also found that the word embeddings trained with health-related Wikipedia articles obtained better performance in the health-related relation retrieval tasks than those trained with general Wikipedia articles. Conclusion: It is evident from this study that word embeddings can group terms with diverse semantic relations together. The domain of the training corpus does have impact on the semantic relations represented by word embeddings. We thus recommend using domain-specific corpus to train word embeddings for domain-specific text mining tasks. Keywords: Word embedding, Semantic relation, UMLS, WordNet
General Note:
Chen et al. BMC Medical Informatics and DecisionMaking 2018, 18(Suppl 2):65 https://doi.org/10.1186/s12911-018-0630-x; Pages 1-157

Record Information

Source Institution:
UF Special Collections
Holding Location:
UF Special Collections
Rights Management:
© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 https://doi.org/10.1186/s12911-018-0630-x RESEARCH OpenAccessEvaluatingsemanticrelationsinneural wordembeddingswithbiomedicaland generaldomainknowledgebasesZhiweiChen1,ZheHe2*,XiuwenLiu1andJiangBian3From The2ndInternationalWorkshoponSemantics-PoweredDataAnalytics KansasCity,MO,USA.13November2017 AbstractBackground: Inthepastfewyears,neuralwordembeddingshavebeenwidelyusedintextmining.However,the vectorrepresentationsofwordembeddingsmostlyactasablackboxindownstreamapplicationsusingthem, therebylimitingtheirinterpretability.Eventhoughwordembeddingsareabletocapturesemanticregularitiesinfree textdocuments,itisnotclearhowdifferentkindsofsemanticrelationsarerepresentedbywordembeddingsand howsemantically-relatedtermscanberetrievedfromwordembeddings. Methods: Toimprovethetransparencyofwordembeddingsandtheinterpretabilityoftheapplicationsusingthem, inthisstudy,weproposeanovelapproachforevaluatingthesemanticrelationsinwordembeddingsusingexternal knowledgebases:Wikipedia,WordNetandUnifiedMedicalLanguageSystem(UMLS).Wetrainedmultipleword embeddingsusinghealth-relatedarticlesinWikipediaandthenevaluatedtheirperformanceintheanalogyand semanticrelationtermretrievaltasks.Wealsoassessediftheevaluationresultsdependonthedomainofthetextual corporabycomparingtheembeddingsofhealth-relatedWikipediaarticleswiththoseofgeneralWikipediaarticles. Results: Regardingtheretrievalofsemanticrelations,wewereabletoretrievesemanti.Meanwhile,thetwopopular wordembeddingapproaches,Word2vecandGloVe,obtainedcomparableresultsonboththeanalogyretrievaltask andthesemanticrelationretrievaltask,whiledependency-basedwordembeddingshadmuchworseperformancein bothtasks.Wealsofoundthatthewordembeddingstrainedwithhealth-relatedWikipediaarticlesobtainedbetter performanceinthehealth-relatedrelationretrievaltasksthanthosetrainedwithgeneralWikipediaarticles. Conclusion: Itisevidentfromthisstudythatwordembeddingscangrouptermswithdiversesemanticrelations together.Thedomainofthetrainingcorpusdoeshaveimpactonthesemanticrelationsrepresentedbyword embeddings.Wethusrecommendusingdomain-specificcorpustotrainwordembeddingsfordomain-specifictext miningtasks. Keywords: Wordembedding,Semanticrelation,UMLS,WordNet *Correspondence: zhe.he@cci.fsu.eduZhiweiChenandZheHecontributedequallytothiswork.2SchoolofInformation,FloridaStateUniversity,142CollegiateLoop,32306 Tallahassee,FL,USA Fulllistofauthorinformationisavailableattheendofthearticle TheAuthor(s).2018 OpenAccess ThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.0 InternationalLicense( http://creativecommons.org/licenses/by/4.0/ ),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktothe CreativeCommonslicense,andindicateifchangesweremade.TheCreativeCommonsPublicDomainDedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/ )appliestothedatamadeavailableinthisarticle,unlessotherwisestated.

PAGE 2

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page54of157BackgroundMiningusefulbuthiddenknowledgefromunstructured dataisafundamentalgoaloftextmining.Towardsthis goal,thefieldofnaturallanguageprocessing(NLP)has beenrapidlyadvancing,especiallyintheareaofword representations.Inthepastfewyears,neural-networkbaseddistributionalrepresentationsofwordssuchas wordembeddingshavebeenshowneffectiveincapturing fine-grainedsemanticrelationsandsyntacticregularities inlargetextcorpora[ 13].Therefore,theyhavebeen widelyusedindeeplearningmodelssuchasthosefor textclassificationandtopicmodeling[ 4, 5].Forexample,convolutionalneuralnetworks(CNNs)havebecome increasinglypopularformodelingsentencesanddocuments[6].Recurrentneuralnetworks(RNNs),especially bidirectionallongshort-termmemory(LSTM)models, havebeenusedtotrainlanguagemodelsfortaskssuch asmachinetranslationandinformationextraction[ 7]. Nevertheless,thevectorrepresentationsofwordembeddings,withpoorinterpretability,stillactasablackbox indownstreamapplications[8].Improvingtheinterpretabilityofwordembeddingshasbeenachallenge duetothehighdimensionalityofthesemodels.Recent researchismostlyfocusedonconstructingneuralnetworksbutnotontheinterpretationsofwordembeddings [1, 3, 9, 10]. Semanticrelationsarethemeaningfulassociations betweenwordsorphrases[ 11].Inawidely-usedopen domainlexicon,WordNet,thereareninesemanticrelations.Table 1 showsthedefinitionsofthesesemantic relations.Figure 1 givesconcreteexamplesofthese semanticrelations.Inthemedicaldomain,medicalrelationsareprevalentbetweenmedicalterms.Forexample, nausea hasasymptom flu;breastcancer hasafindingsite breast.WiththeadvancementofhealthIT,manymedicalconceptsandtheirrelationshavebeenencodedin biomedicalontologiesandcontrolledvocabularies,which havebeenusedforelectronichealthrecords,semantic reasoning,informationextraction,andclinicaldecision support[12]. Identifyingthesemanticrelationsbetweenwordsand phrasesisthebasisforunderstandingthemeaningof thetext[ 13].Therefore,investigatingsemanticrelations representedinwordembeddingshasthepotentialto improvethetransparencyofwordembeddingsandthe interpretabilityofthedownstreamapplicationsusing them.Nevertheless,semanticrelationsinwordembeddingshavenotbeenadequatelystudied,especiallyinthe biomedicinedomain.Inthisstudy,weexplorethesemanticrelations(i.e.,semanticallyrelatedterms)intheneural wordembeddingsusingexternalknowledgebasesfrom Wikipedia,WordNet[ 13],andUnifiedMedicalLanguage System(UMLS)[ 14].Weformulatedtworesearchquestions(RQ)as:Table1 SemanticrelationdefinitioninWordNet RelationDefinition SynonymAtermwithexactlyornearlythesamemeaningas anotherterm.Forexample, heartattack isasynonym of myocardialinfarction AntonymAtermwithanoppositemeaningwithanotherterm, e.g., big:small, long:short,and precede:follow HypernymAtermwithabroadmeaningunderwhichmore specificwordsfall.Forexample, bird isahypernymof pigeon crow, eagle,andseagull HyponymAtermofmorespecificmeaningthanageneralor superordinatetermapplicabletoit.Itistheopposite ofhypernym.Forexample, pigeon crow, eagle and seagull arehyponymsof bird HolonymAtermthatdenotesawholewhosepartisdenoted byanotherterme.g., body isaholonymof arm MeronymAtermthatdenotespartofsomethingbutwhich isusedtorefertothewhole.Itistheoppositeof holonym SiblingTherelationshipdenotingthattermshavethesame hypernym.E.g., son and daughter aresiblingterms, sincetheyhavethesamehypernym child Derivationally relatedforms Termsindifferentsyntacticcategoriesthathavethe samerootformandaresemanticallyrelated,e.g., childhood isaderivationallyrelatedformof child PertainymAdjectivesthatareusuallydefinedbyphrasessuchas oforpertainingtoanddonothaveantonyms,e.g., America isapertainymof American RQ1:Howarethedifferenttypesofsemantic relationsdistributedinthewordembeddings? RQ2:Doesthedomainofthetrainingcorpusaffect thedistributionofthedomain-specificrelationsin thewordembeddings? Toinvestigatehowdifferentsemanticrelationsarerepresentedinwordembeddings,wefirstneedtoobtain thesemanticrelationsfromtherealworld.Toconstruct gold-standardsemanticrelationdatasets,weusedtwo lexicalknowledgebases:WordNetandUnifiedMedical LanguageSystem(UMLS).WordNetisalargelexical databaseofEnglish.Itgroupsnouns,verbs,adjectives,and adverbsintosetsofcognitivesynsets,eachexpressingdistinctconcepts[13].TheUMLSisacompendiumofover 200controlledvocabulariesandontologiesinbiomedicine [14].Itmapsover10milliontermsintooverthreemillionmedicalconceptssuchthatthetermswiththesame meaningareassignedthesameconceptID.Theterms ofthesameconceptareconsideredassynonymswith eachother. Wetrainedwordembeddingswiththreepopularneural wordembeddingtools(i.e.,Word2Vec[ 2],dependencybasedwordembeddings[10],andGloVe[3])usingover 300,000health-relatedarticlesinWikipedia.Toanswer RQ1,weevaluatedtheseembeddingsintwosemantic

PAGE 3

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page55of157 Fig.1 Examplesofsemanticrelationsrelationretrievaltasks:1)theanalogytermretrievaltask and2)therelationtermretrievaltask.ToanswerRQ2, wesampledthesamenumberofgeneralarticlesfrom Wikipedia,usedthemtotrainthewordembeddingsusing Word2vec,andevaluatedtheembeddingsintwosemantic relationretrievaltaskswiththesameevaluationdataset. Inourrecentconferencepaper[15],weexploredthis problemwithonlyhealth-relatedWikipediaarticlesasthe trainingcorpusandevaluateddifferentwordembeddings with10generalsemanticrelationsinthesemanticrelation retrievaltasks.Inthisextendedpaper,wesignificantly expandedouranalysisinthefollowingaspects.First, besidestrainingwordembeddingswiththehealth-related Wikipediaarticles,wealsotrainedtheembeddingswith generalWikipediaarticlestoassesstheimpactofthe domainofthecorpusontheevaluationresults.Second, weexpandedourgold-standardevaluationdatasetwith6 medicalrelationsfromtheUMLS.Withthis,wecanbetterunderstandhowmedicalrelationsarerepresentedin thewordembeddings.Third,weaddedtheanalogyterm retrievaltaskforboththegeneralsemanticrelationsas wellasmedicalrelations.Thisallowedustocompareour resultswithpreviouspublishedresultswhichalsoevaluatedwordembeddingsusinganalogyquestions.Fourth, wevisualizedhowdifferentsemanticrelationtermsare distributedinthewordembeddingspacewithrespect toaparticularterm.Thiswillhelpusbetterunderstand theintrinsiccharacteristicsofthewordembeddings.This studyhasthepotentialtoinformtheresearchcommunity onthebenefitsandlimitationsofusingwordembeddingsintheirdeep-learning-basedtextminingandNLP projects.RelatedworkNeuralwordembeddingsInanearlywork,Lundetal.[16]introducedHAL (HyperspaceAnaloguetoLanguage),whichusesaslidingwindowtocapturetheco-occurrenceinformation. Bymovingtherampedwindowthroughthecorpus,a co-occurrencematrixisformed.Thevalueofeachcell ofthematrixisthenumberofco-occurrencesofthe correspondingwordpairsinthetextcorpus.HALisa robustunsupervisedwordembeddingmethodthatcan representcertainkindsofsemanticrelations.However,it suffersfromthesparsenessofthematrix. In2003,Bengioetal.[ 17]proposedaneuralprobabilisticlanguagemodeltolearnthedistributedrepresentationsforwords.Intheirmodel,awordisrepresentedas adistributedfeaturevector.Thejointprobabilityfunction ofawordsequenceisasmoothfunction.Assuch,asmall changeinthevectorsofthesequenceofwordvectors willinduceasmallchangeinprobability.Thisimplies thatsimilarwordswouldhavesimilarfeaturevectors.For example,thesentence Adogiswalkinginthebedroom canbechangedto Acatiswalkinginthebedroom by replacingdogwithcat.ThismodeloutperformstheNgrammodelinmanytextminingtaskswithabigmargin butsuffersfromhighcomputationalcomplexity. Later,Mikolovetal.[ 18]proposedtheRecurrent NeuralNetworkLanguageModel(RNNLM).RNNLM isapowerfulembeddingmodelbecausetherecurrent networksincorporatetheentireinputhistoryusing theshort-termmemoryrecursively.Itoutperformsthe traditionalN-grammodel.Nevertheless,oneofthe shortcomingsofRNNLMisitscomputationalcomplexity inthehiddenlayerofthenetwork.In2013,Mikolov etal.[ 1, 2]proposedasimplifiedRNNLMusingmultiple simplenetworksinWord2vec.Itassumesthattraining asimplenetworkwithmuchmoredatacanachieve similarperformanceasmorecomplexnetworkssuch asRNN.Word2veccanefficientlyclustersimilarwords togetherandpredictregularityrelations,suchas man istowomanaskingistoqueen .TheWord2vecmethod basedonskip-gramwithnegativesampling[ 1, 2]is widelyusedmainlybecauseofitsaccompanyingsoftware package,whichenabledefficienttrainingofdenseword representationsandastraightforwardintegrationinto downstreammodels.Word2vecusestechniquessuch as negativesampling and sub-sampling toreducethe computationalcomplexity.Toacertainextent,Work2vec successfullypromotedwordembeddingstobethe de facto inputinmanyrecenttextminingandNLPprojects.

PAGE 4

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page56of157Penningtonetal.[ 3]arguedthatWord2vecdoes notsufficientlyutilizetheglobalstatisticsofwordcooccurrences.Theyproposedanewembeddingmodel calledGloVebyincorporatingthestatisticsoftheentire corpusexplicitlyusingalltheword-wordco-occurrence counts.Thecomputationalcomplexityisreducedby includingonlynon-zerocountentries.In[ 3],GloVesignificantlyoverperformstheWord2vecinthesemantic analogytasks. Mostoftheembeddingmodelsusedthewordssurroundingthetargetwordasthecontextbasedonthe assumptionthat wordswithasimilarcontexthavethe similarmeanings [19].LevyandGoldberg[10]proposed adependency-basedwordembeddingswiththeargumentthatsyntacticdependenciesaremoreinclusiveand focused.Theyassumedonecanusesyntacticdependencyinformationtoskipthewordsthatareclosebut notrelatedtothetargetword,meanwhilecapturingthe distantlyrelatedwordsthatareoutofthecontextwindow.Theirresultsshowedthatdependency-basedword embeddingscapturedlesstopicalsimilaritybutmore functionalsimilarity. ToimprovetheinterpretabilityofWord2vec,Levyand Goldberg[ 20]illustratedthatWord2vecimplicitlyfactorizesaword-contextmatrix,whosecellsarethepointwise mutualinformation(PMI)oftherespectivewordand contextpairs,shiftedbyaglobalconstant.Aroraetal. [21]proposedagenerativerandomwalkmodeltoprovide theoreticaljustificationsfornonlinearmodelslikePMI, Word2vec,andGloVe,aswellassomehyper-parameter choices.EvaluationofwordembeddingsLundandBurgesssexperimentsbasedonHAL[ 16] demonstratedthatthenearestneighborsofawordhave certainrelationstotheword.However,theydidnotinvestigatethespecifictypesofrelationsthatthesenearest neighborshavewiththeword.Mikolovetal.[ 1]demonstratedthatneuralwordembeddingscouldeffectively captureanalogyrelations.Theyalsoreleasedawidelyused analogyandsyntacticevaluationdataset.Finkelsteinetal. [22]releasedanotherwidelyuseddatasetforwordrelationevaluation, WordSim-353,whichprovidesobscure relationsbetweenwordsratherthanspecificrelations. Onoetal.[23]leveragedsupervisedsynonymand antonyminformationfromthethesauriaswellasthe objectivesinSkip-GramNegativeSampling(SGNS) modeltodetectantonymsfromunlabeledtext.They reachedthestate-of-the-artaccuracyontheGRE antonymquestionstask. Schnabeletal.[ 24]presentedacomprehensiveevaluationmethodforwordembeddingmodels,whichused boththewidely-usedevaluationdatasetsfromBaroni etal.[ 2, 25]andadatasetmanuallylabeledbythemselves. Theycategorizedtheevaluationtasksintothreeclasses: absoluteintrinsic, coherence,and extrinsic .Theirmethod involvesextensivemanuallabelingofthecorrelationof words,forwhichtheyleveragedcrowdsourcingonthe AmazonMechanicalTurk(MTurk).Inourstudy,we investigatedtherelationsamongtermsinanautomated fashion. LevyandGoldberg[20]showedthattheskip-gramwith negativesamplingcanimplicitlyfactorizeaword-context matrix,whosecellsarethePointwiseMutualInformation (PMI)ofthecorrespondingwordanditscontext,shifted byaconstantnumber,i.e., MPMI log(k ),where k isthe negativesamplingnumber.Later,they[ 26]systematically evaluatedandcomparedfourwordembeddingmethods: PPMI(PositivePointwiseMutualInformation)matrix, SVD(SingularValueDecomposition)factorizationPPMI matrix,Skip-GramNegativeSampling(SGNS),and GloVe,withninehyperparameters.Theresultsshowed thatnoneofthesemethodscanalwayoutperformthe otherswiththesamehyperparameters.Theyalsofound thattuningthehyperparametershadahigherimpacton theperformancethanthealgorithmchosen. Zhuetal.[ 27]recentlyexaminedWord2vecsability inderivingsemanticrelatednessandsimilaritybetween biomedicaltermsfromlargepublicationdata.Theypreprocessedandgroupedover18millionPubMedabstracts andover750kfull-textarticlesfromPubMedCentralinto subsetsbyrecency,size,andsection.Word2vecmodels aretrainedonthesesubtests.Cosinesimilaritiesbetween biomedicaltermsobtainedfromtheword2vecmodelsare comparedagainstthereferencestandards.Theyfound thatincreasingthesizeofdatasetdoesnotalwaysenhance theperformance.Itcanresultintheidentificationofmore relationsofbiomedicalterms,butitdoesnotguarantee betterprecision.VisualizationofwordembeddingsRecently,Liuetal.[28]presentedanembeddingtechniqueforvisualizingsemanticandsyntacticanalogiesand performedteststodeterminewhethertheresultingvisualizationscapturethesalientstructureofthewordembeddingsgeneratedwithWord2vecandGloVe.Principal ComponentAnalysisprojection,Cosinedistancehistogram,andsemanticaxiswereusedasthevisualization techniques.Inourwork,wealsoexploredothertypesof relationsthatarerelatedtomedicine,e.g.,morphology, findingsite.GooglereleasedEmbeddingProjector[ 29] whichincludesPCA[ 30]andt-SNE[ 31]asanembedding visualizationtoolintheTensorFlowframework[ 32].MethodsToexplorethesemanticrelationsinwordembeddings, weusedthreetoolstogeneratetheembeddings,namely, Word2vec[ 2],dependency-basedwordembeddings[ 10],

PAGE 5

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page57of157andGloVe[3].Meanwhile,weobtainedtwotrainingcorporafromWikipedia:onewasahealthrelatedcorpus,the otherwasthecorpusofarandomsampleoftheentire Wikipedia.WethenusedWordNet[ 13]andtheUMLS [14]toevaluateperformanceofthesewordembeddings through1)theanalogytermretrievaltaskand2)the relationtermretrievaltask. Tobetterexplainourmethods,wefirstlistthekey terminologiesusedinthissection: Lemma:Alemmaisthecanonicalform,dictionary form,orcitationformofasetofwords.Forexample,sing,sings,sangandsingingareformsofthesame lexeme,withsingasthelemma. Relationterm:Aterm(wordorphrase)thatis associatedtoanothertermwithasemanticrelation. Forexample,beautifulisanantonymofugly.They arerelationtermsofeachother. Evaluationterm:Anevaluationtermisawordor phrasethatisusedtoretrieveitsrelationtermsinits nearestneighborsinthevectorspaceoftheword embeddings.Datasetcollectionandpreprocessing TrainingdatasetsWefirstobtainedtheentireWikipediaEnglishdataset collectedon01/01/2017.Toobtainthehealth-relatedcorpus,wefetchedallthesubcategoriesof Health within thedepthof4inthecategoryhierarchyofWikipedia, usingtheweb-tool PetScan [33].Thenwefilteredoutthe articlesthatwerenotcategorizedintoanyofthesubcategories.Toobtainacomparablegeneraltopiccorpus,we randomlysampledthesamenumberofarticlesfromthe entireWikipediacorpus.AnalogyevaluationdatasetsWefirstconstructedageneralanalogyevaluationdataset withover9000analogyquestionsusingtheanalogy termgoldstandardfrom[ 1, 2, 9].Inaddition,weconstructedmedical-relatedanalogyevaluationdatasetwith over33,000medical-relatedanalogyquestionsusingsix relationsintheUMLS.UMLSrelationsInthisstudy,weusedfivepopularandwell-maintained sourcevocabularies,includingSNOMEDCT,NCIt (NationalCancerInstituteThesaurus),RxNorm,ICD (InternationalClassificationofDiseases),andLOINC (LogicalObservationIdentifiersNamesandCodes).The UMLSalsocontainsmedicalrelationsbetweenconcepts.Inthisstudy,westudiedsixmedicalrelationsin UMLS,i.e., maytreat, hasproceduresite hasingredient hasfindingsite hascausativeagent and hasassociated morphology.Wefetchedtheserelationsfromthe2015 AAreleaseoftheUMLS.Aswedidnotusephrasesin thisstudy,wefilteredouttherelationswithmorethan onewordintheconceptnames.Wealsoremovedall thepunctuationsandconvertedtheconceptnamesto lowercase.WordNetrelationsForthesemanticrelationsinWordNet,weemployed NaturalLanguageToolkit(NLTK)[ 34]toextractsemanticrelationsfortheevaluationterms.Thedetailsof specificproceduresweusedarelistedseparately[ 35].For eachevaluationterm,wefirstobtainedthesynsetofthe termandthensearchedforsemanticrelationsbasedon thesynset.WordembeddingtrainingconfigurationForthewordembeddingmethodsweinvestigatedinthis study,Word2vec,dependency-basedwordembeddings arebasedonthesameneuralnetworkarchitecture Skipgrammodel .Toreducethebias,weusedthesameSkipgrammodelconfigurationwhenconstructingtheseword embeddings.Wechosethecommonlyusedparametersin previouspublications[ 1, 3, 9, 10].Thedimensionofthe vectorwassetto300.Thesizeofthecontextwindowwas setto5(i.e.,fivewordsbeforeandafterthetargetword). Thenegativesamplewas5andthesubsamplingthresholdwas1e-4.Thenumberofiterationswas15andthe numberofthreadswas20.Fordependency-basedword embeddings,weusedthetoolfromtheauthorswebsite [36].FortheWord2vec,weusedthetoolfromitsproject page[ 37]. ForGloVe,althoughitemploysthesameideaofusing thetargetwordtopredictitscontext,itusesadifferenttrainingmethodcalled adaptivesubgradient [ 38].We thereforeusedthesameconfigurationastheexperiment in[3].Thedimensionofthevectorwassetto300.The windowsizewassetto5,the xmaxto100,andthemaximumiterationto50.Wealsoused20threadstotrainthe modelwiththetooldownloadedfromonGloVesofficial website[39].EvaluationmethodsWeevaluatedtheperformanceofdifferentwordembeddingsinretrievingrelationtermsfortheevaluationterms. Ourevaluationmethodconsistsoftwotasks,correspondingtofourevaluationdatasets. 1Analogytermretrievaltask,withevaluation gold-standarddatasets: Generalanalogyquestionspreviouslyusedin evaluatingwordembeddings[1, 2, 9],which includesixsubtasks:capital-common,capital-world,currency,city-in-state.family, andairlines.

PAGE 6

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page58of157 Medicalrelatedanalogyquestions,i.e.,pairsof termswiththesameUMLSrelations.For example,thetermpairs(atropine,uveitis)and (rifampin,tuberculosisare)havethesame UMLSrelationmaytreat.Weconstructedan analogyquestion(atropine,uveitis::rifampin,?), forwhichtheansweristuberculosisare. 2Relationtermretrievaltask,withevaluationdatasets: Generalsemanticrelations:10subtasks,9from the9semanticrelationsinWordNetandone fromthesynonymsintheUMLS. Medicalrelations,whichincludethe aforementionedsixmedicalrelationsinthe UMLS.AnalogytermretrievaltasksOneofthemostappealingcharacteristicsofWord2vec isthatitiscapableofpredictinganalogyrelations.A well-knownexampleismanistowomanaskingisto queen.Giventheembeddingvectorsofthreewords: king queen and man ,wecanuseasimplealgebraequation, king queen = man woman ,topredictthefourth word woman .Therelationbetweenthesetwopairsof wordsistheanalogyrelation.Weconsidertheanalogy relationasaspecialtypeofsemanticrelations.Itisdifferentfromtheothersemanticrelations(e.g.,synonym, hypertym,hyponym)inWordNetandUMLSdefinedin the Background section.Toevaluatetheanalogyrelation indifferentwordembeddings,wecollectedgeneralanalogyquestionsfrom[1, 2, 9](syntacticalrelationswerenot included).AsshowninTable 2,theanalogyevaluation datasetincludesatotalof9331questionsinsixdifferent groupsofanalogyrelations.WealsoconstructedmedicalrelatedanalogyquestionsasshowninTable 3.Allthe analogyquestionsareinthesameformat:giventhefirst threewords,predictthefourthwordusingvectorarithmeticandthealgorithm nearestneighbor .Inotherwords, giventhewords a, b, c and d,1)usethevectorsofthefirstTable2 Theinformationaboutthegeneralanalogydataset Questiongroup Count Percentage Capital-common 506 5.24% Capital-world 4524 48.48% Currency 866 9.28% cCty-in-state 2467 26.44% Family 506 5.42% Airlines 462 4.95% Total 9331 100.00% Table3 Theinformationaboutthemedicalrelatedevaluation dataset Subtask #ofrelations1#ofanalogy#ofrelation questionsquestions may-treat 595 10,000595 has-procedure-site43 90343 has-ingredient 57 159657 has-finding-site 369 10,000369 has-causative-agent47 108147 has-associated-morphology184 10,000184 Total 129533,5851,295 11:#indicates thenumberofthreewordsintheembedding, a, b and c ;tocomputethe predictedvector dusingtheformula d= c a + b,( 1 ) 2)identifythenearest k neighborsof dinthe embeddingvectorspaceusingcosinesimilarity,namely set (d1, d2, ... dk).Ifwordd isin set (d1, d2, ... dk),the resultofaquestionwasconsideredasatruepositivecase, otherwiseitisafalsepositivecase.Wecomputedthe accuracyofeachquestionineachgroupaswellasthe overallaccuracyacrossallthegroups. Notethatmostofthepreviousevaluationofword embeddingsusingtheanalogyquestionsfocusedonthe top1nearestneighbor[ 1, 3, 9, 10, 26].Inthisstudy,we expandedtheanalysistotop1,5,20and100nearest neighbors.Thisreasonisthatsomewordshavemultiplesynonyms.Assuch,thecorrectanswerinthegold standardmaynotberetrievedasthetop1nearestneighbor.Anotherreasonisthatthetrainingdatasetisasmall partofthewholecorpusinWikipediaandthetraining processofwordembeddingsinvolvesmanyroundsofrandomsampling.Therefore,theresultingembeddingsare notdeterministic.RelationtermretrievaltasksTherelationtermretrievaltasksconsistedofgeneral relationtermretrievaltasksandmedicalrelationterm retrievaltasks. 1.GeneralRelationTermRetrievalTasks Thegeneralsemanticrelationtermretrievaltask includedtensubtasks,nineofwhicheachcorresponding totheninesemanticrelationsinWordNet[ 13]andoneof whichcorrespondingtothesynonymsintheUMLS[14] (i.e.,thetermswiththesameConceptUniqueIdentifiers (CUI)intheUMLS).WhiletherelationsfromWordNetrepresentthegeneralsemanticrelations,wealsoused theUMLSsynonymstoinvestigatetheperformanceof retrievingsynonymsinthemedicaldomainfromword embeddings.

PAGE 7

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page59of157Table4 Informationaboutgeneralsemanticrelationevaluation dataset Subtasks#ofevaluationPercentage1RelationAverage2terms terms# UMLSsynonym9,23541.47%9,2113.14 Synonym15,59170.01%48,0304.25 Antonym2,2259.99%2,9771.38 Hypernym16,40073.64%58,1544.85 Hoponym9,16841.17%112,21519.84 Holonym4,69421.08%9,9443.69 Meronym3,19114.33%13,0567.14 Sibling13,99362.83%869,81486.38 Derivation9,62043.20%27,7822.92 Pertainym9264.16%9871.20 Total22,27138.19%*1,168,9219.35 1Percentageofevaluationtermsthathasatleastonetermwiththerelation2Averagenumberofrelationtermsthatthistypeofevaluationtermshas*AverageofallthesubtasksOursemanticrelationevaluationdatasetfocusedon nounsandadjectives,whichwasbasedonthestatisticsof synsetsinWordNet[ 40],where69.79%(82,115/117,659) ofthesynsetswerenouns,and15.43%(18,156/117,659) ofsynsetswereadjectives.Weconstructedtheevaluationdatasetinfoursteps:1)Weemployedanamed entityrecognitiontooldevelopedinourlab, simiTerm [41],togenerateallthecandidateevaluationtermsbased ontheN-grammodel.2)Wefilteredoutthenoisyterms including: Termswithmorethanfourwords; Termswithafrequency<100inthecorpus; Termsstartingorendingwithastopword(Weused thedefaultstopwordlistinsimiTerm); Unigramsthatarenotnoun,adjective,orgerund; Multi-gramsnotendingwithanounoragerund. Weobtained38,560candidateevaluationtermsafter thesecondstep.3)WematchedthecandidateevaluationtermswithtermsintheUMLSandWordNet andkeptthosethatcanbemappedtoleastoneterm intheUMLSoratleastonesynsetinWordNet.After thethirdstep,weretained22,271termsasthe evaluationterms .4)Foreveryevaluationterm,weidentified its relationterms ,i.e., UMLSsynonyms, synonyms antonyms hypernyms hyponyms holonyms meronyms siblings derivationallyrelatedforms and pertainyms in WordNet. Attheend,weobtainedagoldstandarddatasetforthe semanticrelationswith22,271evaluationtermswithat leastoneoutoftenrelations.Table4 showsthebasic characteristicsoftheseevaluationterms.Thecolumn # ofEvaluationTerms showsthatthenumbersofterms fordifferentrelationsareunbalanced. Hypernyms and synonyms arethemostfrequentrelationterms,whereas pertainyms and antonyms aretheleastfrequentrelation terms.Column Average isthetotalnumberofrelation termsdividedbythenumberofevaluationterms.Wegive examplesoftheevaluationtermsandtheirrelationterms inTable 5. 2.MedicalRelationTermRetrievalTasksTable5 Examplesinthesemanticrelationevaluationdataset Evaluationterm UMLSSynonymWordNetSynonymWordNetAntonymWordNetHypernymWordNetHyponym nativeamerican(AN*)first_nation amerindian someone carib indian individual arawak amerind mortal american_indian hand(N) hand_no hired_hand manual_laborerright deal help hooks paw crewman ostler important(A) authoritative unimportant crucial noncrucial significant insignificant Evaluationterm WordNetHolonymWordNetMeronymWordNetSiblingWordNetDerivationWordNetPertainym nativeamerican(AN) gatekeeper amerind american_indian scratcher bereaved hand(N) timepiece arteria_digitalisday_labourerhandwrite timekeeper metacarpus botany paw human_beingthenar printing_processscriptural important(A) importance cruciality significance *A:Adjective;N:Noun;AN:Adjective+Noun

PAGE 8

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page60of157Table6 Examplesinthemedicalrelationevaluationdataset Source Target Source Target Source Target Hasassociatedmorphology Hascausativeagent Maytreat Enteritis Inflammation Coinfection Organism Atropine Uveitis Keratosis Lesion Coccidiosis Protozoa Rifampin Tuberculosis Asthma Obstruction Asbestosis Asbestos Naproxen Inflammation Hasproceduresite Hasingredient Hasfindingsite Splenectomy Spleen Dressing Foam Rickets Cartilage Keratoplasty Cornea Beeswax Waxes Pyometra Uterus Bronchoscopy Bronchi Cellulose Regenerated Overbite Cartilage Themedicalrelationswereextractedfromtherelations tableintheUMLS(MRREL).Wechosesixmedicalrelationsinthemedicaldomainincluding, maytreat has proceduresite, hascausativeagent hasfindingsite has associatedmorphology and hasingredient .Therelations arerepresentedastriplets, (source,relation,target) ,where source and target arethetermsassociatedwiththe relation .Forexample,intherelationtriplet (atropine,may treat,uveitis), atropine isamedication,while uveitis isa disease.Therelation maytreat representsthatwemayuse atropine totreat uveitis .Table 6 givesexamplesforthese sixmedicalrelations. Toevaluatetheperformanceofwordembeddings,for eachevaluationterm t inthedataset,first,weobtained itsvector t intheembedding,foundthetop k nearest neighborsof t intheembeddingsusingcosinesimilarity,andfetchedthe k correspondingtermstoconstruct set (t1, t2, ... tk).Second,wecomputedthenumberof relationtermsoftheevaluationterm t in set (t1, t2, ..., tk). Ifthenumberisgreaterorequalstoone,theevaluationtermisatruepositivecaseforthecorresponding semanticrelation,otherwiseitisafalsepositivecase. Third,wecomputedthe retrievedratio ofeachsemantic relationaswellastheaverage retrievedratio forallsemanticrelations.Thekeymeasureofthesemanticrelation evaluationperformanceis retrievedratio(RR),whichis definedas hit (e, rel ) = 1,if (Ne e.rel ) = 0,otherwise, (2) RR (rel ) = e bEhit (e, rel ) |{e : e b E t e.rel =}| ,(3) where E isthesetofevaluationterms. e isanevaluationtermin E. Neisthesetofnearestneighborsfor e inthewordembedding. rel isasemanticrelation. e.rel isthesetofrelationtermsin e w.r.t. rel. hit (e, rel ) computesthenumberofevaluationtermswithatleastone relationterminitsnearestneighbors.Thedenominatorof retrieved ratio isthenumberofevaluationtermswithat leastonerelationtermw.r.t.rel.The retrievedratio indicatestheprobabilityofarelationtermoccurringinthe nearestneighborsofanevaluationterm. Sincethenumberofrelationtermsfordifferentsemanticrelationsvaries,Eq. 3 maybeunfairforthesemantic relationswithmuchfewerrelationtermsthanothers. Therefore,wealsouseaweightedfunctiontoadjustthe retrievedratio: weight (e, rel n) = n min (k |e.rel |) ,(4) WRR (rel ) = e bEweight (e, rel hit (e, rel )) |{e : e b E t e.rel =}| ,(5) where k isthenumberofnearestneighborsand |e.rel | is thenumberofrelationtermsin e w.r.t. rel. weight (e, rel n) isusedtobalancetheeffectoflargenumberofrelation terms.ResultsInthissection,wedescribetheevaluationresultsand ouranalyses.Wefirstgivethebasicstatisticsofthecorpusandthestatisticsaboutthedependencyrelations. NotethathereWeonlyshowthestatisticsofhealth relatedWikipediacorpus.ThegeneralWikipediacorpusTable7 Frequencyofthedependencyrelationsinthecorpus Relationname[42] Frequency Percentage nmod 31,740,495 14.44 case 30,937,477 14.07 det 22,721,509 10.34 compound 21,640,690 9.84 amod 16,400,313 7.46 nsubj 14,417,633 6.56 conj 10,616,184 4.83 dobj 10,527,741 4.79 cc 8,437,745 3.84 advmod 7,251,191 3.30

PAGE 9

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page61of157 Fig.2 Words faculty and loyalty andtheirtop10nearestneighborsinthereduced2-Dspaceoftheembeddingspaceexhibitsthesimilarcharacteristics.ForRQ1,usingthe healthrelatedcorpus,weinvestigatedtheperformance ofthreewordembeddings,i.e.,Word2vec,dependencybasedwordembeddings,andGloVe,onboththeanalogy termretrievaltaskandtherelationtermretrievaltask. ForRQ2,wecomparedtheperformanceofWord2vec betweenhealthrelatedWikipediacorpusandthegeneral Wikipediacorpus.BasicinformationofthehealthrelatedWikipediacorpus GeneralstatisticsOurhealth-relatedtextcorpuscontained322,339English healthrelatedarticlesinWikipediaincluding36,511 subcategoriesofhealth,whichconstitutesaboutsix percentoftheentireEnglishWikipedia.Itcontained 282,323,236words.Onaverage,eacharticlecontained876 words. Fig.3 Relationtermsof faculty inthereduced2-Dspaceoftheembeddingspace

PAGE 10

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page62of157DependencyrelationinformationFortheexperimentofdependency-basedwordembeddings,weusedthesamecorpusfromWikipedia.We obtained466,096,785word-contextpairsasthetrainingdata.Table 7 showsthefrequencyoftherelations. Therelation nmod(nounphrase) wasthemostfrequent one.Othernoun-relatedrelations,suchas compound and amod(adjectivemodifier) alsooccurredfrequentlyinthe corpus.VisualizationembeddingspaceForthevisualization,wetriedbothPCAandT-SNEmethods.WefoundPCAismoreappropriateforourcasefor thefollowingreasons: 1Inourvisualization,wewanttoshowtheclustersas wellasthelineararithmeticrelationsbetweenwords (e.g.,analogyrelation).T-SNEisaneffectivemethod toshowtheclustersinthedatasetbutitdoesnot keepthelinearrelationbetweenwords.PCAsatisfies bothrequirements. 2PCAisadeterministicmethodwhileT-SNEisnot. Wefoundthatevenusingthesameparameters,the resultofT-SNEvariedalot. 3MostpaperswecitedinthisstudyusedPCAastheir visualizationmethod. Figure 2 showsthewords faculty and loyalty andtheir top10nearestneighborsinthereducedwordembeddingspacebyPrincipalComponentAnalysis(PCA).In Fig. 2,thebiggerthecirclefortheterm,thecloseritis tothetargetwordintheoriginal(unreduced)embeddingspace.ItshowsthatalthoughPCAkeepsthenearest neighborsrelativelyclose,itdoesnotpreservethedistancerankinginthereducedspace.InFig. 3,thesame typeofrelationtermsarelabeledwiththesamesymbol. Thetop10nearestneighborsofevaluationterm faculty includeda meronym professoranda sibling university. Notethat10nearestneighborsconstituteverysmallsampleofthewords,consideringthesizeofthevocabulary.It demonstratesthatvariouskindsofrelationtermscanbe retrievedinthenearestneighborsofagiventerminthe wordembeddingspace.EvaluationwiththehealthrelatedWikipediacorpus EvaluationresultsoftheanalogytermretrievaltaskFigures 4 and 5 showtheevaluationresultsofdifferent typesofwordembeddingsinthegeneralandmedicalrelatedanalogytermretrievalsubtasks.Weassessedtop 1,5,20and100nearestneighbors.Asthenumberof retrievednearestneighbors k increases,theaccuracyof allthesubtasksincreases,whichisintuitive.However,the performancegaindecreasesasthenumberofthenearestneighborsincreases.Itdemonstratesthatthecloser thetermistothepredictedvector,themorelikelyitisa correctanswer. Figures 5a, b,andc showtheevaluationresultsofthe medical-relatedanalogytermretrievaltask.InWord2vec anddependency-basedwordembeddings,theanalogy questionsontherelation hasproceduresite hadthehighestaccuracy.InGloVe-basedembeddings,theanalogy Fig.4 Evaluationresultsofthegeneralanalogytasksfor( a) Word2vec,( b)GloVe,and( c)dependency-basedwordembeddings

PAGE 11

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page63of157questionsonrelation hasfindingsite and hasassociated morphology hadahigheraccuracythan hasprocedure site .Similarasthegeneralanalogytermretrievaltask,as thenumberofretrievednearestneighbor k increases,the accuracyofallthesubtasksincreases.Inaddition,theperformancegainalsoincreasesasthenumberofthenearest neighborsincreases. Fig.5 Evaluationresultsofthemedical-relatedanalogytasksfor( a) Word2vec,( b)GloVe,and( c)dependency-basedwordembeddingsFigure 6 showsthedetailedanalogytaskresultswhen k = 5fordifferentembeddingtrainingmethods.Inboth generalanalogytermretrievaltaskandmedical-related analogytermretrievaltask,thedependency-basedword embeddingspreformedmuchworsethanothermethods.AsLevyetal.[ 10]pointedout,dependency-based wordembeddingscatchlesstopicrelatedinformation thanWord2vec.GloVeachievedslightlyhigheroverall accuracythanWord2vec,whichweakenedtheconclusion of[3]thatGloVeoutperformedWord2vec75%onthe analogytask .Thisdiscrepancymaybeduetothesmaller datasetweusedinthisstudy.AccordingtoFig. 6a,the performanceofthesubtasks currency and airlines are muchworsethanothersubtasks.Thismaybebecause health-relatedarticlesinWikipediamaynotcontain richinformationaboutcurrencyandairlines.Figure 6b demonstratedthatwordembeddingsareabletocapture medical-relatedanalogyrelations,butthemedical-related analogysubtaskobtainedmuchworseperformancethan generalanalogysubtask. Fig.6 Resultsof(a)thegeneralanalogytermretrievaltaskand( b)the medical-relatedanalogytermretrievaltaskwhen k = 5

PAGE 12

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page64of157EvaluationresultsofthesemanticrelationretrievaltaskInthisevaluation,weinvestigatedthe retrievedratio for top1,5,20and100nearestneighborsforeachofthe threekindsofwordembeddings.AsshowninFigs. 7 and 8,asthenumberofnearestneighbor k increases, theretrievedratiochangesmuchfasterthantheanalogy task.Wespeculatedthatthereasonisthatforeach Fig.7 Resultsofthegeneralrelationtermretrievaltasksfor( a) Word2vec,( b)GloVe,and( c)dependency-basedwordembeddingsevaluationterm,therearemorethanonerelationterm, whereasfortheanalogytask,thereisonlyonecorrect answer. Figures 8a, b,andc givetheevaluationresultsfor themedicalrelatedsemanticrelationretrievaltasks. Word2vecoutperformedGloVeanddependency-based wordembeddings.InWord2vecandGloVe,thesemantic Fig.8 Resultsofthemedicalrelationtermretrievaltasksfor(a) Word2vec,( b)GloVe,and( c)dependency-basedwordembeddings

PAGE 13

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page65of157relationsofhasassociatedmorphologyandhas proceduresitehadabetterretrievedratiothanother semanticrelations. Figures 9 and 10 giveaclosercomparisonamong themethodswhen k = 5.Thesubtask siblingrelation achievedthehighestretrievedratio,sincetherearemuch moresiblingtermsthanotherrelations(seeTable 4). The pertainymrelation hadthehighestretrievedratio eventhoughtherearefewerpertainymsthanothertypes ofrelationterms.The weightedretrievedratio further (seeFig. 9b)demonstratedthatpertainymshavehigher probabilitytoberetrievedinthenearestneighbors.Using weighted-retrievedratio,wefoundthattheWRRof antonymsisevenhigherthanthatofthesynonyms.Otherssemanticrelationshaveroughlysimilarprobability toberetrievedinthenearestneighbors.Figure 9a shows dependency-basedwordembeddingshadmuchworse performancethanothermethodsonthesubtasks pertainym derivation meronym and holonym.Meanwhile,it hadaslightlyhigherperformanceonsubtasks sibling and synonym .GloVeobtainsaslightlyworseperformancefor Fig.9 (a)Theretrievedratioand( b)theweightedretrievedratioof therelationtermretrievaltaskwhen k = 5 Fig.10 (a)Theretrievedratioand( b)theweightedretrievedratioof themedicalrelationtermretrievaltaskwhen k = 5almostallthesubtasksthanWord2vec.Formedicalrelationtask,wefoundsimilarresultfromFig. 10a.Word2vec outperformedGloVeinfiveoutof6taskanddependency basedwordembeddingobtainedworstperformance. Forthemedical-relatedrelationtermretrieval,theWRR ofalltherelationsareconsistentlylow,whichshows thattermswiththemedical-relatedsemanticrelation termsareusuallynotinthenearestneighborsofagiven evaluationterm.Performancecomparisonbetweenhealthrelatedand generalWikipediacorporaToinvestigatetheimpactofthetrainingcorpusonour analysis,wealsotrainedWord2vecongeneralWikipedia articles,andthencomparedtheanalogyandsemantic relationtasksbetweentheembeddingstrainedbytwo corpora.Figure 11 showsthathealthrelatedcorpusand generalcorpushadthesimilarresultsintheanalogy termretrievaltask.Figure 11b showsthathealthrelated corpusobtainedmuchhigheraccuracyonthemedical analogyquestions.Figure 12 showsthathealthrelated

PAGE 14

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page66of157 Fig.11 Resultsof(a)thegeneralanalogytaskand( b)the medical-relatedanalogytaskusingthehealth-relatedcorpusandthe generalcorpus( k = 5)corpusobtainedslightlybetterresultongeneralsemantic relationretrievaltask.Forthemedicalrelatedsemanticrelationevaluationdataset,comparingtoresultofthe generalcorpus,Fig. 12b showsthatthehealthrelated corpusobtainedmuchbetterresultontherelations has causativeagent, hasproceduresite ,andhasingredient whileobtainingthesimilarresultsontherelation hasfindingsite ,andworseresultontherelation hasassociated morphology.DiscussionsInthisstudy,weusedexternalknowledgebasesfrom Wikipedia,WordNet,andtheUMLStoevaluatethe semanticrelationsindifferentneuralwordembeddings, namely,Word2vec,GloVe,anddependency-basedword embeddings.Accordingtotheresults,wefoundthat termswithcertainsemanticrelations,suchaspertainyms,haveahigherlikelihoodtooccurinthenearest neighborsofagiventerm.Withrespecttodifferent wordembeddingtools,wefoundthatGloVeperformed Fig.12 Resultsof(a)thegeneralsemanticrelationretrievaltaskand (b)themedical-relatedsemanticrelationretrievaltaskusingthe health-relatedcorpusandthegeneralcorpus( k = 5)slightlybetterthanWord2vecinsomeanalogyterm retrievaltasks(e.g.,capital-worldandcity-in-state),but worseinothertasks(e.g.,family,currency,andairlines), whichisslightlydifferentfrom[ 3],possiblyduetothe smallertrainingcorporaweused.Theperformanceof dependency-basedwordembeddingsintheanalogy termretrievaltaskisworsethanWord2vecandGloVe inmostofthesubtasksexceptforairline.Eventhough weusedhealth-relatedWikipediadatatotraintheword embeddings,thecomparativeresultsonnon-healthrelatedsubtaskcategoriesshouldstillreflecttheirrelative performance.Inthemedical-relatedanalogytask,the analogyquestionson hasfindingsite and hasassociatedmorphology hadahighaccuracyinGloVe-based embeddingsthanWord2vecanddependency-based embeddings.Thequestionsmedical-relatedanalogy taskhadaloweraccuracythanthequestionsingeneralanalogytask.Medical-relatedanalogyquestions usingdependency-basedembeddingshadtheworst accuracy.

PAGE 15

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page67of157Intherelationtermretrievaltask,wefoundthatthe retrievedtermratiosofsynonymsandantonymswere almostidentical.Usingweighted-retrievedratio,wefound thattheWRRofantonymswasevenhigherthanthatof thesynonyms.Themedicalrelationtermretrievaltasks hadapoorWRR,showingthatmedically-relatedterms arenotinthenearestneighborsforagivenevaluation term.ThemedicalrelationtermretrievaltaskhadabetterWRRon hasassociatedmorphology, hasprocedure site ,and hasfindingsite .Thedependency-basedword embeddingshadtheworseperformanceamongthethree. Thisevaluationalsoshowedthattheperformanceofword embeddingsmayvaryacrossdifferentdomainsanddifferenttextcorpora.Therefore,wesuggestthatresearchers shouldevaluatedifferentwordembeddingsusingstandardevaluationmethodssuchastheonesweconducted inthisworkbeforedecidingonaparticularonetobeused. Thisstudyhassomelimitations.Eventhoughweonly employedthecosinesimilarityasthedistancemeasureto definethenearestneighborsofawordintheembeddings inthisstudy,thenearestneighborsintheembeddings canchange(substantially),dependingonthedefinitionof distancebetweenwordvectors.Euclideandistanceand cosinesimilarityarewidelyused,whileanotherintuitive distancefunctionistheshortestpath,whichconsiders theembeddingasaweightedgraphandwordsareconnectedtoeachother.Thenearestneighborsarethewords thatcanbereachedfromtheevaluationtermswiththe shortestpath.Inthefuture,wewillexplorethesemantic relationdistributioninthenearestneighborsdefinedby otherdistancemethodssuchastheshortestpath.Another interestingdirectionistoinvestigatewhysynonymsand antonymshavesimilaroccurrencesinthenearestneighborofaword.Asco-occurredwordsinthecorporado notnecessarilycapturesemanticrelations,itisimportant tounderstandthelimitationsandremedytheirimpacts onthedownstreamapplications.Inthispaper,weonly usedtheunigramsinthewordembeddingtrainingand thegoldstandarddatasetsfromWordNetandUMLS.In thefuture,wewillinvestigatehowphrasecomposition mayimpacttheretrievalofsemanticallyrelatedtermsin thewordembeddings.ConclusionsInthisstudy,weevaluatedthesemanticrelationsin differentneuralwordembeddingsusingexternalknowledgebasesfromWikipedia,WordNet,andtheUMLS.We trainedwordembeddingsusingbothhealth-relatedand generaldomainWikipediaarticles.WeusedthesemanticrelationsinWordNetandUMLS,whichcoveredmost ofthecommonlyusedsemanticrelations.Wecompared thedistributionofrelationtermsinthenearestneighbors ofawordindifferentwordembeddings.Itisevidentthat wordembeddingscangrouptermswithdiversesemantic relationstogether.Thedomainofthetrainingcorpushas animpactonthesemanticrelationsrepresentedbyword embeddings.Abbreviations CNN:Convolutionalneuralnetwork;CUI:Conceptuniqueidentifier; HAL:Hyperspaceanaloguetolanguage;ICD:Internationalclassificationof diseases;LSTM:Longshort-termmemory;LOINC:Logicalobservationidentifiers namesandcodes;NER:Namedentityrecognition;NLP:Naturallanguage processing;PPMI:Positivepointwisemutualinformation;PMI:Pointwisemutual information;POS:Partofspeech;RNN:Recurrentneuralnetwork; RNNLM:Recurrentneuralnetworklanguagemodel;RR:Retrievedratio; SGNS:Skip-Gramnegativesampling;SVD:Singularvaluedecomposition; UMLS:Unifiedmedicallanguagesystem;WRR:Weightedretrievedratio Acknowledgements WewouldliketothankSihuiLiufromDepartmentofMathematicsatFlorida StateUniversityforherhelpinprocessingthemedical-relatedsemantic relationsfromtheUnifiedMedicalLanguageSystem. Funding Thisstudywassupportedbythestart-upfundofZheHeprovidedbythe CollegeofCommunicationandInformationatFloridaStateUniversity.This workwasalsosupportedinpartbyNationalScienceFoundation(NSF)award #1734134(PI:JiangBian),NationalInstitutesofHealth(NIH)grant UL1TR001427.ThepublicationcostwillbecoveredbytheOpenAccessFund ofFSULibraries. Availabilityofdataandmaterials Thesourcecodeandthetrainedwordembeddingscanbeaccessedat https://github.com/zwChan/Wordembedding-and-semantics Aboutthissupplement Thisarticlehasbeenpublishedaspartof BMCMedicalInformaticsandDecision Making Volume18Supplement2,2018:Selectedextendedarticlesfromthe 2ndInternationalWorkshoponSemantics-PoweredDataAnalytics.Thefull contentsofthesupplementareavailableonlineat https:// bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume18-supplement-2. Authorscontributions ZHsupervisedtheentireproject.ZCandZHconceptualizedanddesignedthe study.ZCcollectedthedatafromWordNetandWikipedia.ZHcollectedthe relationshipdatafromtheUnifiedMedicalLanguageSystem.ZCimplemented themethod.ZCandZHperformedthedataanalysisanddraftedtheinitial version.ZHrevisedthemanuscriptiterativelyforimportantintellectual content.Alltheauthorscontributedtotheresultsinterpretation.Allauthors editedthepaper,andgavefinalapprovalfortheversiontobepublished.ZH takesprimaryresponsibilityfortheresearchreportedhere.Allauthorsread andapprovedthefinalmanuscript. Ethicsapprovalandconsenttoparticipate Notapplicable. Consentforpublication Notapplicable. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authordetails1DepartmentofComputerScience,FloridaStateUniversity,Tallahassee,FL, USA.2SchoolofInformation,FloridaStateUniversity,142CollegiateLoop, 32306Tallahassee,FL,USA.3DepartmentofHealthOutcomesandBiomedical Informatics,UniversityofFlorida,Gainesville,FL,USA. Published:23July2018 References 1.MikolovT,ChenK,CorradoG,DeanJ.Efficientestimationofword representationsinvectorspace.2013. https://arxiv.org/abs/1301.3781 .

PAGE 16

Chen etal.BMCMedicalInformaticsandDecisionMaking 2018, 18(Suppl2):65 Page68of1572.MikolovT,SutskeverI,ChenK,CorradoGS,DeanJ.Distributed representationsofwordsandphrasesandtheircompositionality. In:AdvancesinNeuralInformationProcessingSystems.Curran Associates,Inc.;2013.p.3111. 3.PenningtonJ,SocherR,ManningCD.Glove:Globalvectorsforword representation.In:EMNLP,vol14.AssociationforComputational Linguistics;2014.p.1532. 4.TangD,WeiF,YangN,ZhouM,LiuT,QinB.Learning sentiment-specificwordembeddingfortwittersentimentclassification. In:ACL(1).AssociationforComputationalLinguistics;2014.p.1555. 5.LiC,WangH,ZhangZ,SunA,MaZ.Topicmodelingforshorttextswith auxiliarywordembeddings.In:Proceedingsofthe39thInternational ACMSIGIRConferenceonResearchandDevelopmentinInformation Retrieval.NewYork:ACM;2016.p.165. 6.KimY.Convolutionalneuralnetworksforsentenceclassification. Proceedingsofthe2014ConferenceonEmpiricalMethodsinNatural LanguageProcessing(EMNLP).Doha:2014.p.1746. 7.TangD,QinB,LiuT.Documentmodelingwithgatedrecurrentneural networkforsentimentclassification.In:EMNLP.Associationfor ComputationalLinguistics;2015.p.1422. 8.SunF,GuoJ,LanY,XuJ,ChengX.Sparsewordembeddingsusingl1 regularizedonlinelearning.In:ProceedingsoftheTwenty-Fifth InternationalJointConferenceonArtificialIntelligence,IJCAI.AAAI Press;2016.p.2915. http://dl.acm.org/citation.cfm?id=3060832. 3061029. 9.BojanowskiP,GraveE,JoulinA,MikolovT.Enrichingwordvectorswith subwordinformation.2016. https://arxiv.org/abs/1607.04606 10.LevyO,GoldbergY.Dependency-basedwordembeddings.In:ACL(2). Stroudsburg:Citeseer;2014.p.302. 11.KhooCSG,NaJ-C.Semanticrelationsininformationscience.AnnuRevInf SciTechnol.2006;40(1):157. https://doi.org/10.1002/aris.1440400112 12.BodenreiderO.Biomedicalontologiesinaction:roleinknowledge management,dataintegrationanddecisionsupport.In:GeissbuhlerA, KulikowskiC,editors.IMIAYearbookofMedicalInformatics.IMIA,the Netherlands,MethodsInfMed.2008;47(Suppl1):67. 13.MillerGA.Wordnet:alexicaldatabaseforenglish.CommunACM. 1995;38(11):39. 14.LindbergDA,HumphreysBL,McCrayAT,etal.Theunifiedmedical languagesystem.theNetherlands:Yearbook,IMIA;1993.p.41. 15.ChenZ,HeZ,LiuX,BianJ.Anexplorationofsemanticrelationsinneural wordembeddingsusingextrinsicknowledge.In:Bioinformaticsand Biomedicine(BIBM),2017IEEEInternationalConferenceOn.Piscataway: IEEE;2017.p.1246. 16.LundK,BurgessC.Producinghigh-dimensionalsemanticspacesfrom lexicalco-occurrence.BehavResMethods.1996;28(2):203. 17.BengioY,DucharmeR,VincentP,JauvinC.Aneuralprobabilistic languagemodel.JMachLearnRes.2003;3(Feb):1137. 18.MikolovT,KarafitM,BurgetL,Cernock ` yJ,KhudanpurS.Recurrent neuralnetworkbasedlanguagemodel.In:Interspeech,vol2. InternationalSpeechCommunicationAssociation;2010.p.3. 19.HarrisZS.Distributionalstructure.Word.1954;10(2-3):146. 20.LevyO,GoldbergY.Neuralwordembeddingasimplicitmatrix factorization.In:AdvancesinNeuralInformationProcessingSystems. CurranAssociates,Inc.;2014.p.2177. 21.AroraS,LiY,LiangY,MaT,RisteskiA.Alatentvariablemodelapproach topmi-basedwordembeddings.TransAssocComputLinguist. 2016;4:385. 22.FinkelsteinL,GabrilovichE,MatiasY,RivlinE,SolanZ,WolfmanG, RuppinE.Placingsearchincontext:Theconceptrevisited.In: Proceedingsofthe10thInternationalConferenceonWorldWideWeb. NewYork:ACM;2001.p.406. 23.OnoM,MiwaM,SasakiY.Wordembedding-basedantonymdetection usingthesaurianddistributionalinformation.In:HLT-NAACL.Association forComputationalLinguistics;2015.p.984. 24.SchnabelT,LabutovI,MimnoDM,JoachimsT.Evaluationmethodsfor unsupervisedwordembeddings.In:EMNLP.Associationfor ComputationalLinguistics;2015.p.298. 25.BaroniM,DinuG,KruszewskiG.Dontcount,predict!asystematic comparisonofcontext-countingvs.context-predictingsemanticvectors. In:ACL(1).AssociationforComputationalLinguistics;2014.p.238. 26.LevyO,GoldbergY,DaganI.Improvingdistributionalsimilaritywith lessonslearnedfromwordembeddings.TransAssocComputLinguist. 2015;3:211. 27.ZhuY,YanE,WangF.Semanticrelatednessandsimilarityofbiomedical terms:examiningtheeffectsofrecency,size,andsectionofbiomedical publicationsontheperformanceofword2vec.BMCMedInformDecis Making.2017;17(1):95. https://doi.org/10.1186/s12911-017-0498-1 28.LiuS,BremerP-T,ThiagarajanJJ,SrikumarV,WangB,LivnatY, PascucciV.Visualexplorationofsemanticrelationshipsinneuralword embeddings.IEEETransVisComputGraph.2018;24(1):553. 29.EmbeddingProjectorofTensorFlow. http://projector.tensorflow.org/ Accessed1June2017. 30.ShlensJ.Atutorialonprincipalcomponentanalysis.2014. https://arxiv. org/abs/1404.1100 31.MaatenLvd,HintonG.Visualizingdatausingt-sne.JMachLearnRes. 2008;9(Nov):2579. 32.TensorFlow. https://www.tensorflow.org/ .Accessed1June2017. 33.PetScan. https://petscan.wmflabs.org.Accessed1June2017. 34.LoperE,BirdS.Nltk:Thenaturallanguagetoolkit.In:Proceedingsofthe ACL-02WorkshoponEffectiveToolsandMethodologiesforTeaching NaturalLanguageProcessingandComputationalLinguistics-vol1. ETMTNLP.Stroudsburg:AssociationforComputationalLinguistics; 2002.p.63. https://doi.org/10.3115/1118108.1118117 35.WordNetAPI. http://www.nltk.org/howto/wordnet.html .Accessed1June 2017. 36.DependencyBasedWordEmbeddingproject. https://levyomer. wordpress.com/2014/04/25/dependency-based-word-embeddings Accessed1June2017. 37.Word2vecproject. https://code.google.com/archive/p/word2vec/ Accessed1June2017. 38.DuchiJ,HazanE,SingerY.Adaptivesubgradientmethodsforonline learningandstochasticoptimization.JMachLearnRes.2011;12(Jul): 2121. 39.GloVeproject. https://nlp.stanford.edu/projects/glove/ .Accessed1June 2017. 40. StatisticalinformationofWordNet. https://wordnet.princeton.edu/ documentation/wnstats7wn.Accessed1June2017. 41.HeZ,ChenZ,OhS,HouJ,BianJ.Enrichingconsumerhealthvocabulary throughminingasocialq&asite:Asimilarity-basedapproach.JBiomed Inform.2017;69:75. 42.DependenciesmanualinStanfordNLPproject. https://nlp.stanford.edu/ software/dependencies_manual.pdf.Accessed1June2017.


xml version 1.0 encoding UTF-8 standalone no
fcla fda yes
!-- Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases ( Mixed Material ) --
METS:mets OBJID AA00064854_00001
xmlns:METS http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink
xmlns:xsi http:www.w3.org2001XMLSchema-instance
xmlns:daitss http:www.fcla.edudlsmddaitss
xmlns:mods http:www.loc.govmodsv3
xmlns:sobekcm http:digital.uflib.ufl.edumetadatasobekcm
xmlns:lom http:digital.uflib.ufl.edumetadatasobekcm_lom
xsi:schemaLocation
http:www.loc.govstandardsmetsmets.xsd
http:www.fcla.edudlsmddaitssdaitss.xsd
http:www.loc.govmodsv3mods-3-4.xsd
http:digital.uflib.ufl.edumetadatasobekcmsobekcm.xsd
METS:metsHdr CREATEDATE 2019-02-22T10:10:48Z ID LASTMODDATE 2018-12-07T09:10:35Z RECORDSTATUS COMPLETE
METS:agent ROLE CREATOR TYPE ORGANIZATION
METS:name UFSPEC,UF Special Collections
METS:note Created using CompleteTemplate 'INTERNAL' and project 'NONE'.
OTHERTYPE SOFTWARE OTHER
Go UFDC FDA Preparation Tool
INDIVIDUAL
UFAD\renner
METS:dmdSec DMD1
METS:mdWrap MDTYPE MODS MIMETYPE textxml LABEL Metadata
METS:xmlData
mods:mods
mods:abstract displayLabel Abstract Background: In the past few years, neural word embeddings have been widely used in text mining. However, the
vector representations of word embeddings mostly act as a black box in downstream applications using them,
thereby limiting their interpretability. Even though word embeddings are able to capture semantic regularities in free
text documents, it is not clear how different kinds of semantic relations are represented by word embeddings and
how semantically-related terms can be retrieved from word embeddings.
Methods: To improve the transparency of word embeddings and the interpretability of the applications using them,
in this study, we propose a novel approach for evaluating the semantic relations in word embeddings using external
knowledge bases: Wikipedia, WordNet and Unified Medical Language System (UMLS). We trained multiple word
embeddings using health-related articles in Wikipedia and then evaluated their performance in the analogy and
semantic relation term retrieval tasks. We also assessed if the evaluation results depend on the domain of the textual
corpora by comparing the embeddings of health-related Wikipedia articles with those of general Wikipedia articles.
Results: Regarding the retrieval of semantic relations, we were able to retrieve semanti. Meanwhile, the two popular
word embedding approaches, Word2vec and GloVe, obtained comparable results on both the analogy retrieval task
and the semantic relation retrieval task, while dependency-based word embeddings had much worse performance in
both tasks. We also found that the word embeddings trained with health-related Wikipedia articles obtained better
performance in the health-related relation retrieval tasks than those trained with general Wikipedia articles.
Conclusion: It is evident from this study that word embeddings can group terms with diverse semantic relations
together. The domain of the training corpus does have impact on the semantic relations represented by word
embeddings. We thus recommend using domain-specific corpus to train word embeddings for domain-specific text
mining tasks.
Keywords: Word embedding, Semantic relation, UMLS, WordNet
mods:accessCondition type restrictions on use Rights The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
mods:language
mods:languageTerm text English
code authority iso639-2b eng
mods:location
mods:physicalLocation UF Special Collections
UFSPEC
mods:url access object context http://ufdc.ufl.edu/AA00064854/00001
mods:name
mods:namePart Zhiwei Chen
Zhe He
Xiuwen Liu
Jiang Bian
mods:note Chen et al. BMC Medical Informatics and DecisionMaking 2018, 18(Suppl 2):65
https://doi.org/10.1186/s12911-018-0630-x; Pages 1-157
mods:originInfo
mods:publisher BMC, BMC Medical Informatics and Decision Making
mods:dateIssued 2018
mods:recordInfo
mods:recordIdentifier source sobekcm AA00064854_00001
mods:recordContentSource UF Special Collections
mods:subject SUBJ650_1 fast
mods:topic Word embedding
SUBJ650_2
Semantic relation
SUBJ650_3
UMLS Wordnet
mods:titleInfo
mods:title Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases
mods:typeOfResource mixed material
DMD2
OTHERMDTYPE SOBEKCM SobekCM Custom
sobekcm:procParam
sobekcm:Aggregation ALL
UFIRG
UFIR
IUF
IUFSPEC
sobekcm:Wordmark UFIR
sobekcm:bibDesc
sobekcm:BibID AA00064854
sobekcm:VID 00001
sobekcm:Publisher
sobekcm:Name BMC, BMC Medical Informatics and Decision Making
sobekcm:Source
sobekcm:statement UFSPEC UF Special Collections
sobekcm:SortDate 736694
METS:amdSec
METS:digiprovMD DIGIPROV1
DAITSS Archiving Information
daitss:daitss
daitss:AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
METS:techMD TECH1
File Technical Details
sobekcm:FileInfo
METS:fileSec
METS:fileGrp USE reference
METS:file GROUPID G1 PDF1 applicationpdf CHECKSUM f09ad2275e13dd729ec1555b59022353 CHECKSUMTYPE MD5 SIZE 11369686
METS:FLocat LOCTYPE OTHERLOCTYPE SYSTEM xlink:href 12911_2018_Article_630.pdf
G2 METS2 unknownx-mets ecdc8b8cbd1d8d89690f30a1de683ec4 7282
AA00064854_00001.mets
METS:structMap STRUCT2 other
METS:div DMDID ADMID ORDER 0 main
ODIV1 1 Main
FILES1 Page
METS:fptr FILEID
FILES2 2