<%BANNER%>

Data Transform Composition for Efficient Information Integration

Permanent Link: http://ufdc.ufl.edu/UFE0024907/00001

Material Information

Title: Data Transform Composition for Efficient Information Integration
Physical Description: 1 online resource (72 p.)
Language: english
Creator: Shin, Jungmin
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: composition, metamodel, similarity, transform
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Data transformation resolves heterogeneities between disparate schemas and is indispensable process in many applications where data sharing and exchange happen. Creating a data transform which converts source data to target data is extremely time-consuming and labor-intensive. This dissertation presents the data transform composition problem using a large repository of reusable transforms. Recent work on data transforms have focused on structural data mapping or applying a restricted set of data transforms for composition. In order to do semi-automatic data transform composition with existing transforms, we first design our RDF-based transform meta model including meta data on data a transform. Next, we model the data transform composition problem as a graph search problem and use A* algorithm with our transform meta model based sophisticated distance measures. Our experiment shows that our meta model greatly speeds up searching for complete compositions and provides high precision. Our distance measures enable our search to progress to a complete composition correctly and to reduce exponential search space by pruning. Using our system, users can reduce their efforts in time-consuming and error-prune steps of data transformation process, thereby reduce efforts in information integration that requires in many applications.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Jungmin Shin.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Hammer, Joachim.
Local: Co-adviser: Lam, Herman.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-02-28

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024907:00001

Permanent Link: http://ufdc.ufl.edu/UFE0024907/00001

Material Information

Title: Data Transform Composition for Efficient Information Integration
Physical Description: 1 online resource (72 p.)
Language: english
Creator: Shin, Jungmin
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: composition, metamodel, similarity, transform
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Data transformation resolves heterogeneities between disparate schemas and is indispensable process in many applications where data sharing and exchange happen. Creating a data transform which converts source data to target data is extremely time-consuming and labor-intensive. This dissertation presents the data transform composition problem using a large repository of reusable transforms. Recent work on data transforms have focused on structural data mapping or applying a restricted set of data transforms for composition. In order to do semi-automatic data transform composition with existing transforms, we first design our RDF-based transform meta model including meta data on data a transform. Next, we model the data transform composition problem as a graph search problem and use A* algorithm with our transform meta model based sophisticated distance measures. Our experiment shows that our meta model greatly speeds up searching for complete compositions and provides high precision. Our distance measures enable our search to progress to a complete composition correctly and to reduce exponential search space by pruning. Using our system, users can reduce their efforts in time-consuming and error-prune steps of data transformation process, thereby reduce efforts in information integration that requires in many applications.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Jungmin Shin.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Hammer, Joachim.
Local: Co-adviser: Lam, Herman.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-02-28

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024907:00001


This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Thoughonlymynameappearsonthecoverofthisdissertation,thecompletionofmydissertationwaspossiblewiththehelpandeortsofmanypeople.First,IexpressmyappreciationtomyadvisorDr.JoachimHammer.ThroughmygraduatecareeratUniversityofFlorida,hisguidance,support,andpatiencehelpedmeovercomemanycrisissituationsandcompletethisdissertation.Heoftenbroughtmetothethresholdofknowledge,andignitedtheinteresttocrossthethreshold.Healsoencouragedmetobeanindependentthinkerwithahighresearchstandard.Ideeplyappreciatetomyco-advisor,ProfessorHermanLam.Heiskindlywillingtospendalotoftimeandeortinimprovingmywork.Withouthim,Icouldnotaccomplishmywork.Thanksalsogoouttothemembersofthedissertationcommittee,ProfessorsAbdelsalamHelal,MarkusSchneider,andPaulAveryfortheirvaluableguidance.ProfessorsAbdelsalamHelaltriestohearandunderstandmyhardshipsandhisencouragementhelpsmetokeepmyself-esteem.IamgratefultomanypeopleonthefacultyandstaoftheDepartmentofComputerandInformationScienceandEngineeringforallthattheytaughtandsupportedmeinvariousways.Finally,andmostimportantly,Isincerelythankmyfamilywhohavebeenaconstantsourceofhelp,support,andstrengthduringdoctoralstudies.Noneofmyachievementwouldhavebeenpossiblewithouttheirlove.MyveryspecialthankstomyhusbandforhissupportuponwhichthepathtocompletingmyPh.D.wasbuilt.Iwarmlyappreciatemyparentsfortheirunwaveringfaithinmeaswellasunendingencouragementandsupport.Ithankmysistersfortheirloveandsupport.Iappreciateparents-in-lawforconsistentencouragementandsupport.IreallythanktoGodforgivingmemydaughter,Jiminwhowasthesourceofenergytogetthroughmyjourney.Iloveyou,Jimin. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 1.1MotivatingExample .............................. 12 1.2ChallengesandContribution .......................... 15 1.2.1Challenges ................................ 15 1.2.2Contributions .............................. 16 2RELATEDWORK .................................. 19 2.1DataTransformvs.WebServiceandWorkow ............... 19 2.2DataMappingandSchemaMatching ..................... 20 2.3WebServiceComposition ........................... 21 2.4CalculatingSimilarity ............................. 22 2.5MorpheusPrototype .............................. 23 3TRANSFORMMETAMODEL ........................... 27 3.1Denition .................................... 27 3.2SemanticSimilarityofTwoTransforms .................... 30 4TRANSFORMCOMPOSITION .......................... 33 4.1TransformCompositionProblem ....................... 33 4.2Algorithm .................................... 36 4.2.1SimilarityMeasures ........................... 38 4.2.2ThefunctionMerge ........................... 41 4.3PartialComposition .............................. 43 5IMPLEMENTATION ................................. 46 5.1Architecture ................................... 46 5.2SemanticAnnotationandRegistrationTool ................. 47 5.3TransformCompositionModule ........................ 49 6EVALUATION .................................... 51 6.1ExperimentalEnvironment ........................... 51 5

PAGE 6

.............................. 51 6.2.1ExperimentwithExampleA:FindCompleteCompositions ..... 53 6.2.2ExperimentalCase1:ModelVerication ............... 54 6.2.3ExperimentalCase2:EciencyofOurAlgorithm .......... 59 6.2.4ExperimentCase3:PartialComposition ............... 62 7CONCLUSION .................................... 64 7.1SummaryandContributions .......................... 64 7.2FutureWork ................................... 65 REFERENCES ....................................... 67 BIOGRAPHICALSKETCH ................................ 72 6

PAGE 7

Table page 1-1Sampletransformsintherepository. ........................ 14 6-1Sampletransforms .................................. 52 6-2Experimentalresult .................................. 53 7

PAGE 8

Figure page 1-1Theschemasofourexample. ............................ 13 2-1ConceptualarchitectureoftheMorpheussystem. ................. 23 2-2DatatypesandatransforminMorpheus. ..................... 24 3-1Atransformrepresentedingraph. .......................... 29 3-2CurrencyConversionintransformgraph. ...................... 30 3-3KRW2USDintransformgraph. ........................... 31 4-1Apartoftransformcompositiongraph. ....................... 35 4-2Singlematchbetweentwotransforms. ....................... 41 4-3Multiplematchbetweentwotransforms. ...................... 42 5-1Thearchitectureofoursystem. ........................... 46 5-2Webservicesemanticannotationtool. ....................... 49 6-1TheschemaofExampleB. .............................. 53 6-2Thedesiredtransformspecication1. ........................ 54 6-3Eciencybyvaryingmetadatautilizationwith150transforms. ......... 57 6-4Precisionbyvaryingmetadatautilizationwith150transforms. ......... 58 6-5Thenumberofexpandednodessofarwheneachcompletecompositionappears(total53answers).Wevaryh(x)bychangingtheag(e.g.,I+O,I+O+OP,I+OP,I).Wetestwith1000transformsinrepository.Thesearchisterminatedwhentotal3788nodesareexpanded. ........................ 59 6-6Eciencybyvaryingthesizeofrepository. ..................... 60 8

PAGE 9

Datatransformationresolvesheterogeneitiesbetweendisparateschemasandisindispensableprocessinmanyapplicationswheredatasharingandexchangehappen.Creatingadatatransformwhichconvertssourcedatatotargetdataisextremelytime-consumingandlabor-intensive.Thisdissertationpresentsthedatatransformcompositionproblemusingalargerepositoryofreusabletransforms.Recentworkondatatransformshavefocusedonstructuraldatamappingorapplyingarestrictedsetofdatatransformsforcomposition.Inordertodoautomaticdatatransformcompositionwithexistingtransforms,werstdesignourRDF-basedtransformmetamodelincludingmetadataondataatransform.Next,wemodelthedatatransformcompositionproblemasagraphsearchproblemanduseA*algorithmwithourtransformmetamodelbasedsophisticateddistancemeasures.Ourexperimentshowsthatourmetamodelgreatlyspeedsupsearchingforcompletecompositionsandprovideshighprecision.Ourdistancemeasuresenableoursearchtoprogresstoacompletecompositioncorrectlyandtoreduceexponentialsearchspacebypruning.Usingoursystem,userscanreducetheireortsintime-consuminganderror-prunestepsofdatatransformationprocess,therebyreduceeortsininformationintegrationthatrequiresinmanyapplications. 9

PAGE 10

50 11 ].Itresolvesheterogeneitiesofdataindisparatesources. Indataintegration,whendisparatedatasourceshavetobeintegrated,heterogeneitiesamongdatainpreviouslyindependentdatasourcesmustberesolvedinordertoprovideauniforminterfacetousers.Forexample,whentwosalesdatasetsindierentcurrenciesareintegrated,revenueinonecurrency(e.g.,theKoreanwon)mustbetransformedtotheothercurrency(e.g.,theUSdollar)inordertoprovidesalesdatainonecurrency.Throughthedatatransformationprocess,datainKoreanwonsareconvertedtodatainUSdollars. Datacleaningdealswithdetectingandremovingerrorsandinconsistenciesindatatoimprovethequalityofdata.Whenmultipledatasourcesareintegrated,theneedfordatacleaningincreasessignicantlysincethesourcesoftencontainredundantdataindierentrepresentations.Inordertoprovideaccesstoaccurateandconsistentdata,consolidationofdierentdatarepresentationsthroughdatatransformationandeliminationofduplicateinformationbecomesnecessary[ 50 ]. Datatransformationplaysanimportantroleinadatawarehousingsystem[ 54 ].Inadatawarehousingsystem,aextract-transform-load(ETL)processisrequiredtointegrateinformation.TheETLprocessconsistsofextractingdatafrommultiplesources,transformingdata,andloadingdataintothedatawarehouse.AsapartoftheETLprocess,tocreateandmanagetransformations,typicaldatawarehousingarchitecturesrequireexternaltools.Asaresult,mostexistingETLtoolsperformthenecessarytransformationsoutsidetherepositorywheredataisstored. Generally,thedatatransformationprocessundergoesseveralessentialsteps[ 55 ].Intherststep,calledschemamatching,wendsemanticcorrespondencesbetween 10

PAGE 11

18 ].Finally,theprogramlogicistranslatedintoanexecutableoneanddeployedtoanexecutionenvironment.Theprogramlogicisadatatransformthatconvertsdatainsourceschematodatatintargetschema. Usersshouldndschemamatchingandsemanticmappinglistedaboveandresearchtheactualprogramlogicofatransformthatsolvessyntacticandsemanticheterogeneitiesamongdata.Findingtheprogramlogicofatransformrequiresalotoftrial-and-errorandistime-consuming[ 49 48 ].Theschemamatchingsbetweenelementsintwoschemasinclude1-1matchesandcomplexmatches.Acomplexmatchmeansacombinationofattributesinoneschemacorrespondstoacombinationinanother[ 18 ].Creatingmappingsforcomplexmatches[ 7 28 ]ismoredicultthan1-1matcheswhereoneattributeinaschemaismatchedtoasingleattributeinanotherschema. Inourwork,weintroduceadatatransformcompositionproblemthatreusesexistingtransformsinarepositoryinsteadofcreatinganewtransformfromscratch,thusreducingdevelopmenttimeandeort.First,weassumethatthereisarepositorythathasalargenumberoftransforms.ThosetransformscanhavebeencreatedbyuserspreviouslyorharvestedfromtheInternet(e.g,WebservicesorJavafunctions).Atransforminarepositorywaspreviouslyusedforanotherpurposeandgeneratesoutputwithinputs.Thereisasemanticmappingbetweeninputsandoutputsofatransformillustratedbytheinsideprogramlogicofthetransform.Ourdatatransformcompositionapproachtriestousethesemanticmappingofatransformtondschemamatchesandsemanticmappingsbetweentwoschemas.Usingapreviouslyavailablemeaningfulsingleorcompositetransform,wecanndsemanticcorrespondences(i.e.,schemamatches)automaticallybetweentwoschemasbyreusinginput/outputmappingsofatransform,andatthesametimesemanticmappingscanbegeneratedusingtheinsideprogramlogicoftransforms. 11

PAGE 12

Ourgoalistondacompletecompositionthatarethesamesemanticmapping(i.e.,programlogic)asauser'sdesiredtransform.However,wecannotguaranteethattransformsinarepositoryarecompletetogenerateanynewtransforms.Weclaimthatwecangiveaguidetoauserbyshowingpossiblepartialcompositionsthatcanbesimilartotheuser'sdesiredtransform.Asaresult,ndingschemamatchingsandsemanticmappingswithourdatatransformcompositionapproachreducestimeandlaborindatatransformation. Recentworkondatatransformshasfocusedonndingadatatransformaspartofthedatamappingorschemamatchingproblem.However,thosestudiesconcentratedmostlyonstructuraldatamappingorapplyingarestrictedsetoftransformsincomposition.ThereismuchresearchonWebservicecomposition,butwendthatwecannotapplythesolutionsinWebservicecompositiondirectlytoourproblem.Asfarasweinvestigated,therehasbeennoeorttond1-1orcomplexmatcheswithdatatransformcomposition.Inourwork,weprovideasolutionforconstructingauser'sdesiredtransformbycomposingmultipletransformsinarepository,therebyreducingusers'overalleortstoperformdatatransformation,suchasanalyzingsyntacticandsemanticdierencesamongrelevantdataandcreatingaprogramlogicofatransformthatresolvesthedierences. 12

PAGE 13

Theschemasofourexample. StarbucksheadquartersinSeattle.TheKoreanStarbucksoperationprovidestworelations,Sales-CoffeeandSales-PastryinFig. 1-1 .Sales-Coffeehasthreeattributes,date,revenue,andbranchname.Thedateeldisintheformatof\DD-MM-YYYY",revenueisaoatvaluerepresentingrevenuesfromcoeesalesinKoreanwonandbranchnameisthenameofabranchinKoreathatisasourceoftherevenue(therecanbemultiplebranchesinacity).Sales-PastryhasthreeattributesthatarethesameasSales-Coffeeexceptrevenuerepresentssalesfrommunsandothersnacks. TheStarbucksheadquartersinSeattle,however,usesonlyonerelationoftheformSales-FoodshowninFig. 1-1 wheredateisintheformat\MM/DD/YYYY",revenuerepresentsthesumofcoeeandpastryrevenuesinUSdollars,andcityrepresentsthenameofthecitywhereabranchislocated.ThetermcitycanberetrievedbylookinguparelationstoringaddressesofbrancheswithbranchnameofSales-Coffee.TheITdepartmentofStarbucksistaskedwiththejobofprovidingatransformthatconvertsdatainthetwosalesrelations(i.e.,Sales-CoffeeandSales-Pastry)fromKoreanfranchisesintodatainSales-FoodthattheUScorporationcanuse. WeassumethattheITdepartmenthasarepositoryoftransformsthatcanbereusedtocomposeanewdatatransform.Byreusingexistingtransformsthathavebeendebugged,itsimpliestheeortofcreatinganewtransform. Table 1-1 showsasampleofavailabletransformsintherepository.Theinput/outputofthetransformsinTable 1-1 arespeciedascompositedatatypes.Inourexample,theITdepartmentofStarbuckswantstocreateanewtransformusingoneormoreavailable 13

PAGE 14

Sampletransformsintherepository. NameofInputOutputTransformDataTypeDataTypedatatypedatatype(eld:type,...)(eld:type) Currency-KoreanwonDollarConversion(Korean:oat,date1:date)(USD:oat) KRW2USDKoreanDollar(KRW:oat)(USD:oat) Conversion2USDCurrencyDollar(amount:oat,country:string)(USD:oat) AddKRWTwoKoreanKorean(KRW1:oat,KRW2:oat)(KRW:oat) Payment-KoreanwonDollarConversion(Korean:oat,date1:date)(USD:oat) DateFormat2-DATEDATEMMDDYYYY(date1:date)(date2:date) DateFormat2-DATEDATEYYYYMMDD(date1:date)(date2:date) getCityBranchNameCityName(branch:string)(city:string) transforms.However,evenwithreusabletransforms,theusermustbeabletondtheappropriatetransformsandconnectthembycorrectlymatchingtheoutputofthecurrenttransformtotheinputofthenexttransform. Browsingandndingrelevanttransformsinarepositoryisnotatrivialtask.Ausermustbeabletoconnectcontiguoustransformsbycorrectlymatchingtheoutputofcurrenttransformtotheinputofthenexttransform.InTable 1-1 ,iftheuserwantstocalltheDateFormat2MMDDYYYYtransformrstandthencalltheCurrencyConversiontransform,theuserisnotsurewhethertheoutputoftheDateFormat2MMDDYYYYismatchedtotheinputoftheCurrencyConversiontransformexactly. Itisclearthattheincreaseofavailabletransformsentailsmoreandmoreeorttounderstandexistingtransforms.Asthenumberoftransformsisincreased,ndingappropriatereusabletransformsbecomesatimeconsumingandlaboriousprocess.Also, 14

PAGE 15

Weproposeadatatransformationcompositionalgorithmthatinvestigatesalltransformsinarepositoryonbehalfofusersandthatguidesuserstocreateanewtransformusingexistingtransforms.Thealgorithmprovidesusersasequenceofexistingtransformsthatisidenticaltousers'expectedtransform.Ifanysequenceofexistingtransformscannotprovideusers'expectedtransform,thealgorithmproducessimilarsequenceoftransformstotheusers'expectedtransform. 5 17 21 ]centersonhowtodescribe,manage,andstoredatatransformsecientlybyprovidingatoolorlanguagetospecifytheprogramlogicofatransformandstoringtheentireexecutableproceduraldescription.Theresearchprovidesproceduralinformationratherthanstructuralinformationaboutdatatransformbehavior.Proceduralinformationisveryhelpfulforunderstandingthebehavior,butitisinappropriatewhencomparingtwotransformswhereisnecessarytocalculatethesimilaritiesbetweentwotransformsincomposition.ExistingWebservicemodelsarenotsucienttocharacterizedatatransformsbecausetheydonothavespecicdataformatinformationandthebehaviorofanoperationisdescribedinonesemanticword,whichishardtocreatewhenmultipletransformsareconnected. Inourproblem,weneedastructuralmetamodeltocharacterizedierentdatatransformsusinglimitedinformation.SincetransformsinarepositorycanbeWeb 15

PAGE 16

Second,ourproblemistondadatatransformthatcanconvertdatainthesourceschematodatainthetargetschemaasautomaticallyaspossibleusingreusabletransformsinarepository.Itcanbesolvedbyasingletransforminatransformrepositoryoracompositetransformthatiscomposedofmultiplesingletransforms.Sincetherecanbealargenumberoftransformsinarepository,thepotentialsearchspaceforthecompositionsishuge.Unlesswehaveschemamatchingbetweensourceandtargetschemas,thesearchspaceisincreasedevenmorebecauseweneedtondasetofattributesinthesourceschemathatmatchesaspecicattributeinthetargetschema. Inaddition,weassumethattransformsinarepositorycanbeWebserviceharvestedfromtheInternet(e.g.,lookinguptheexchangeratebetweentwodierentcurrencies).ItisdiculttoexecutetransformsandusetheiroutputtondcompositionsbecauseitcantaketimetoexecuteremoteWebservices. Existingresearchondatatransformationndmappingsbyintensivelyusingoutputsgeneratedbyexecutingtransforms.ExistingresearchonWebservicecompositioncannotbedirectlyappliedtoourproblemsincetheWebservicemodels(e.g.,WSDLorWSDL-S)[ 14 1 23 ]arenotsucienttocharacterizedierenttransforms.Weneedanovel,sophisticatedsolutionondatatransformcompositionthatcanecientlyndagoaltransformwithoutdependingmainlyonthedatageneratedbyexecutingtransforms. Third,wecannotguaranteethatwealwayswillndacompletecompositionsinceourrepositorymaybeincomplete.Therefore,itisnecessarytoprovidepartialcompositionstoauserthatareusefultoconstructagoaltransform. 16

PAGE 17

35 ]-basedtransformmetamodeltorepresenttransforms.MetadataabouttransformsarerepresentedsemanticallyinRDFtriplesandthoseRDFtriplescanconstructstructuralRDFgraphs.WecancomparetwotransformssemanticallyusingthoseRDFgraphs.Inaddition,anRDFgraphcaneasilybeexpandedbyaddingRDFtripleswhenweaddmoremetadatatoaccommodatenewrequirementstodescribeadatatransform.SinceRDFiswidelyacceptedasastandardforsemanticrepresentationofinformationontheWebanditisrepresentedinanXML,datatransformsinourmetamodelareinterchangeableandcanbereadilycomparedwithotherresourcesontheWeb. OurmodelincludesnotonlytheprimitivedatatypeandsemanticmeaningofaparameterasareprovidedinexistingstandardsinWebservices,butalsothedataformatofaparameter.Inaddition,metadataonoperationsdescribetherelationshipsbetweeninputsandoutputsofatransform.Ourexperimentshowsthatourmetamodelgreatlyspeedsupthesearchforcompletecompositionsandprovideshighprecision. Second,wemodelthedatatransformcompositionproblemasagraphsearchproblemandusetheA*algorithmwithourtransformmetamodel-baseddistancemeasures.Oursophisticateddistancemeasuresenableoursearchtoprogresstoacorrectcompletecompositionandtoreduceexponentialsearchspacebypruning.Incomposingtransforms,ouralgorithmpreservesthebehavioroftransformsasmuchaspossibleusingourtransformmetamodel,thuswecanndagoaltransformcorrectly.Whenthereisnocompletecomposition,ouralgorithmprovidespartialcompositionsthatareusefultoconstructagoaltransform. Third,wedesignandimplementourprototypesystemforsemi-automaticdatatransformcompositionusingtransformsinourmodelinarepository.Oursystemprovidesanautomaticsearchofalargerepositorytondorcomposeadesiredtransform.Oursemanticannotationtoolisusedtosemi-automaticallyconvertcrawledWebservicesinWSDLtoourmodel.Usingoursystem,userscanreducetheireortsintime-consuming 17

PAGE 18

Thisthesisisorganizedasfollows:Chapter 2 reviewsrelatedwork;Chapter 3 introducesourtransformmetamodel,andChapter 4 denesthedatatransformcompositionproblemandshowsouralgorithm.Chapter 5 describesimplementationofoursystemandChapter 6 presentsourexperiments.Chapter 7 provideconclusions. 18

PAGE 19

Researchrelatedtoourworkcanbecategorizedintothefollowingareas:Webservicecomposition,datamappingandschemamatching,similaritymeasures,andMorpheusprototype.Beforeweinvestigatethoseareas,werstcomparedatatransformwithWebserviceandworkow. 62 53 ]isatechnologyparadigmcharacterizedbysharingandintegratingsoftwarecomponentsovertheInternetwithoutconsideringplatforms[ 3 ].AwebservicedescriptioninWSDLincludesonlysignatureinformation.WebservicecompositionintegrateindividualWebservicestocreateanothernewWebservice.Thisisanewtrendforcreatinganewsoftwareapplicationusingindividualsoftwarecomponentsoeredbydierentserviceproviders. Workows[ 33 ]modelandexecutebusinessprocesses.AworkowcanbemodeledinXMLprocessdenitionlanguage(XPDL)[ 15 ]andexecutedinaworkowmanagementsystem.TheXPDLspecicationcoversabroadrangeofelementsthatarerequiredinusualbusinessprocesses.Webservicescanbeusedasoneimplementationtypeofanactivityinaworkow. Adatatransform(especiallysemantictransform)mapsattributesinoneschematoattributesinanother.Thismappingconvertsdatafromthesourcedataformattotargetdataformat.Mostly,datainsourceandtargetschemashavedierentformatsbutcanhavethesameorsimilarsemanticmeanings.Forexample,inFig. 1-1 ,revenueinSales-CoffeeisrepresentedinKoreanwon,butrevenueinSales-FoodisrepresentedinUSdollars.Bothrevenueattributessharethesamesemanticmeaning. Consideringjustsignaturesandsemanticmeaningsofargumentsarenotenoughtodierentiatetransforms.Dataindierentschemascanbeexpressedindierentformats, 19

PAGE 20

18 ]systemsemi-automaticallyidentiesboth1-1andcomplexmatchesbetweendatabaseschemas.Acomplexmatchspeciesthatacombinationofattributesinoneschemacorrespondstoacombinationintheother.Thegenerationofcomplexmatchesisdonebysearchingthespaceofpossiblematches.Asetofsearchmodules,calledsearchers,areemployedandeachconsidersameaningfulsubsetofthespace.Forexample,atextsearchermayconsideronlymatchesthatareconcatenationsoftextattributes,whileanumericsearcherconsiderscombiningattributeswitharithmeticexpressions.Usingabeamsearch,onlyapre-speciednumberofhighest-scoringmatchcandidatesareselected,and,amongcandidates,thosethathaveaclosesemanticdistancetothetargetattributeareselected.Toelaboratethesearchprocess,domainknowledgeanduserinteractionareusedintheprocessofsearchingthemapping. Adatamappingproblem[ 47 64 7 ]isaboutautomaticdiscoveryofeectivemappingsbetweenstructureddatasources.Datamappingsarefundamentalindatacleaning,dataintegration,andsemanticintegrationandincludesubproblems,suchasschemamatchingandsemanticschemamapping.Existingsolutionstypicallyhavefocusedondiscoveringrestrictedmappings,suchasonlydiscoveringone-to-oneschemamatching.Therearealsostructuredierencesamongrelationsandcomplexsemanticmappingsamongattributesindierentrelations.InTupelo[ 25 ],startingfromuserprovidedexampleinstancesofthesourceandtargetschemas,amappingissemi-automaticallydiscoveredbysearchingwithinthetransformationspacebasedonaset 20

PAGE 21

ThePiazza[ 27 ]projectproposedthepeerdatamanagementsystem(PDMS),wheremappingcompositionisstudied[ 40 ]andproposedtoserveasoneofitsmainoptimizationtechniquesforansweringquerieseciently[ 59 ].YuandPopa[ 65 ]appliedmappingcompositiontomaintainmappingsundersomeschemaevolutionscenarios. 39 45 51 9 8 4 22 38 ].Forexample,theworkin[ 43 ]adaptstheA*algorithmusingtheinput/outputargumentsofWebservices.Inourwork,wealsoconsiderthesemanticmeaningoftheargumentsandthebehavioroftransforms.Authorsin[ 39 ]useresourcedescriptionframework(RDF)triplesforrepresentingpre/postconditionsofaWebservice.AsemanticnetworkcanbefoundamongasetofWebservicesusingpre/postconditions. InthesematicWebservicearea,anewcompositeWebserviceiscreatedusingsimpleWebserviceswiththehelpofsemanticinformation.ExplicitsemanticswillenableautomaticWebservicecompositionwithouthumanintervention[ 46 ].Currently,manyapproachestosemanticWebservicecompositionconcentrateonjustsematicmatchingofinput/outputarguments.However,consideringtheinternalfunctionalitiesofservicesisimportantsinceWebservicewiththesameinput/outputinterfacescouldhavedierentfunctions[ 39 ]. 21

PAGE 22

63 45 ],usersspecifyadesiredcompositeapplicationbyarst-orderformulathatrepresentsthelogicthatmustbesatisedbytheapplication.WiththeassumptionthatallnecessarysimpleWebservicesareavailable,thisapproachndsacombinationofserviceswhereconjunctionsoflogicsareequivalenttoaformulagivenbyauser.In[ 4 ],individualatomicWebservicesarerepresentedinnitestateautomata(FSA).GivenasetofdescriptionsofcomponentWebservicesasanautomaton,thisapproachndsasubsetofthecomponentservicesandamediatorwiththeinputofadesiredglobalbehaviorspeciedinanautomaton. Inaddition,thereisatemplate-basedapproach[ 61 8 ],butthisapproachrequirestechnicalknowledgeandexperiencefordescribingdesiredtransforms.Furthermore,wehavenotseenanapproachthatusessematicbehaviorinformation(insideoffunction)forcomparingtwoservicesforsolvingservicecompositionproblems. 21 ]exploitsthestructureoftheWebservices.TheWoogleemploysanovelclusteringmechanismthatgroupsparameternamesintosemanticallymeaningfulconcepts,andtheseconceptsareleveragedtodeterminesimilarityofinputs(oroutputs)ofWeb-serviceoperations.ThealgorithmdependsononlytheinformationprovidedintheWSDLlewithoutadditionalannotatedinformation.ThisapproachfocusesmoreonsearchingsimilarWebservicesthanoncalculatingthesimilaritybetweentwoWebservices. SemanticToolsforWebServices,developedbyIBM[ 32 ],hasWebservicematching,discovery,andcompositionfeatures.TheWebservicesareannotatedinWebServicesSemantics(WSDL-S)[ 1 ].UsingtheWebServiceInterfaceMatchingfeature,onecansemi-automaticallymaptheinterfacesoftwogivenWebservices.Domain-independentanddomain-specicontologiesareusedtocomputeanoverallsemanticsimilarityscorebetweenambiguousterms.Thistechnologyresolvessemanticambiguitiesinthe 22

PAGE 23

ConceptualarchitectureoftheMorpheussystem. descriptionsofWebserviceinterfacesbycombininginformationretrievalandsemanticWebtechniques.MatchesfromthetwoapproachesarecombinedtodetermineanoverallsimilarityscoretohelpassessthequalityofaWebservicematchtoagivenrequest[ 58 2 ].Incaseswheresingleservicesdonotmatchagivenrequest,thesystemcancomposemultipleservicesbyemployingarticialintelligence(AI)planningalgorithmsinordertofulllagivenrequest. 19 20 ]providesanenvironmentforcreating,storing,andsearching,thenexecutingtransformsinordertofacilitatethedatatransformationprocess. Fig. 2-1 [ 20 ]showsthearchitectureoftheMorpheussystem.TheMorpheussystemconsistsoftwoparts:thetransformconstructiontoolkit(TCT)andtheassociativerepository.TheMorpheussystemusesthePostgresDBMSsystem[ 26 ]asarepositoryoftransformsandatthesametimeasaplatformtorunthetransforms. UnliketypicalETLtoolsthatwementionedinChapter 1 ,adatatransformationconstructiontoolintheMorpheussystemfacilitatesincreatinganewtransformandthecreatedtransformsarestoredandexecutedinsideaDBMS.Therefore,theMorpheus 23

PAGE 24

DatatypesandatransforminMorpheus. systemtakesadvantageofamenitiesprovidedbyamodernDBMS,suchasecientstorageofdataandsupportfortransactionsandrecovery. TheTCTfacilitatesthecreationofanewtransform.AuserinteractswithTCT,whichhasabrowserandGUIforbuildingtransforms.UsingTCT,theusercancreateatransform(whichwecallaMorpheusTransform)consistingofMorpheusprimitives,namelyControl,Wrapper,Computation,Lookup,andJavafunction.Attheendofthecreation,anewtransformisrstwritteninXMLandtranslatedintoaprogramwritteninPLjava.TheJavaprogramiscompiledandregisteredasauserdenedfunction(UDF)inPostgres.AWrapperprimitiveiscreatedbywrappingWebservicesorWebformsavailableintheInternet.WebservicesinWSDL[ 14 ]areconvertedtoJavafunctions,thenregisteredasaUDF.AUDFcantakesimpletypes,compositetypes,oracombinationoftheseasarguments.Postgresuserscandesignanewcompositetypeasauserdeneddatatype(UDD)andUDDsareregisteredinthePostgresDBMS.UserscancreatetwocompositedatatypesasUDDs,includingeldsinasourceandtargetschema.InPostgresDBMS,datatransformationisachievedbyaddingsourcedataintothedatabaseandthenrunningaquerythatinvokesatransformstoredasaUDF.Theresultingtargetdataaregeneratedinthedatabaseaftertheexecutionofthequerycontainingthetransform. RelatedtotheexampleinFig. 1-1 inChapter 1 ,Fig. 2-2 showstheinputandoutputUDDs(KoreanRevenueandSeattleRevenue)andonepossibledescriptionoftheStarbucksTtransformwithprimitives. 24

PAGE 25

(i.e.,date,coffee-revenue,pastry-revenue,branchname)inthesourceschema(i.e.,theschemaofStarbucksfranchisesinKoreainFig. 1-1 .)andSeattleRevenue,whichincludeselements(i.e.,date,revenue,city)inthetargetschema(i.e.,theschemaoftheStarbucksheadquartersinSeattle)arecreated.Then,ausercreatesanewtransformStarbucksTinFig. 2-2 whichmapstheinputdatatypeKoreaRevenuetooutputdatatypeSeattleRevenue. Fig. 2-2 showsasimplieddescriptionofStarbucksT.ThedateelementinSeattleRevenueisconvertedfromtheformatof"DD-MM-YYYY"to"MM/DD/YYYY"throughtheDateConverterfunction.TherevenueelementofSeattleRevenueisgeneratedbyaddingtwovalueswhichareconvertedfromcoffee-revenueandpastry-revenueelementsofKoreaRevenueusingWon2Dollarfunction.ThecityelementisderivedfromthebranchnameelementusingtheGetCityfunction.StarbucksTisregisteredasauser-denedfunction(UDF)inPostgresDBMSandisinvokedinanSQLqueryforactualtransformation.Forexample,Fig. 2-2 showsanSQLquerythatexecutesStarbucksTtransformoverthetworelations(i.e.,Sales-Coffee,Sales-Pastry)inthesourceschema.TheresultofthequeryexecutionisdatathattinthetargetschemaSales-Food. Morpheustransformsintherepositorycouldbeusedforcreatinganewtransform.AusercanviewatransforminTCT,theneditthetransformtocreateanothertransform.Currently,allthestepsareperformedmanuallyinMorpheus.Ithasnoautomaticsupportforcomposinganewtransformbyreusingtransformsintherepository. InMorpheus,acreatedtransformiscompliedintoaJavaclassle,registeredinthePostgresDBMSandinvokedonthesourcedatasettobetransformedusinganSQLqueryforactualtransformation.Currently,theregisteredtransforminUDFistreatedasablack-box[ 13 12 31 30 60 42 ]duringqueryoptimization,sotherearelimitationsin 25

PAGE 26

26

PAGE 27

OurtransformrepositorycanhavetransformscreatedfromscratchbyusersorharvestedfromtheInternetusingacrawler.Weneedanabstractmodelthatcanreectvariouskindsoftransformsandhasinformationusefultondadesiredtransform. Existingstandards,suchasWSDL[ 14 ]andWSDL-S[ 1 ],areusedtorepresentWebservices.InWSDL-S,semanticwordsdenedintheunderlyingontologyareusedtorefertoinput/output,precondition/eect,andoperation.Thestandardtriestomakestandardssimpleandtransferdetaileddescriptionsofsemanticmeaningstoontology.Ourintuitionisonesemanticwordforaparameter(i.e.,inputoroutputparameter)isnotenoughtorepresentthesemanticmeaningandrepresentationoftheparameter.Wecanuseprecondition/eectofWSDL-S,butonepreconditionperoperationisnotenoughtorepresentinformationaboutmultipleparameters.Inaddition,weclaimonesemanticwordperoperationisnotenoughtoreecttherelationshipsofinputsandoutputs. 27

PAGE 28

52 ].Thedenitionbelowformalizesourtransformmetamodel. 41 ]dictionary,Disadatatypeamongoat,string,int,anddate,andRisawordintheformatdictionarywemake.Eachoperationop2OPisatuplewithOP IandOP O.OP IisasetofinputsandOP OisasetofoutputsofopwhereOP IIandOP OO. 1-1 inChap. 1 convertsmoneyinKoreanwontomoneyinUSdollarswithanexchangerateatagivendate.TheamountofmoneyinKoreanwonanddatein\DD-MM-YYYY"areinputsandtheamountofmoneyinUSdollarsisanoutput.Thetransformcanberepresentedinourmodelasfollows: I=fm1,d1g,O=fm2g,OP=fop1g, op1=(OP I,OP O)=(fm1,d1g,fm2g), wherem1isoneoftheinputparametersofCurrencyConversionanditssemanticmeaningSismoney,datatypeDisoatanditisrepresentedinKoreancurrency.Thed1transformisanotherinputparameterofCurrencyConversionanditssemanticmeaningSisdate,datatypeDisDATEanditisrepresentedinDD-MM-YYYY.Them2isanoutputparameteranditssemanticmeaningSismoney,datatypeDisoatanditisrepresentedinUSdollars.Thistransformhasanoperationop1inwhichtheinputsetOPIincludesm1andd1andtheoutputsetOPOincludesm2. 28

PAGE 29

Atransformrepresentedingraph. Inaddition,atransformcanberepresentedasagraph,asshowninFig. 3-1 .WedeneatransformgraphinDenition2.Weusethetransformgraphforcalculatingasimilaritybetweentwotransforms. 3-1 .ThenodesetVisaunionofnodesrepresentingthefollowing:(1)elementsinsetsI,O,andOP,(2)S,D,RofeachelementinI,O,and(3)dummynodesrepresentingT,I,O,OP,OP I,andOP O.Eachedgee2EbetweennodesinVisassociatedwithaweight. 3-2 representstheCurrencyConversiontransform.TherearedummynodesT;I;O;OP;OPI,andOPO,whichmeanssetsinDenition1.TheinputparametersetIhastwochildnodesp1andp2andOhasonechildnodep3.p1isanodefortheinputparameterm1,thereforep1hasthreechildnodesthatrepresentS,D,andRofm1.AnoperationsetOPhasonechildnodethatmeansanoperationop1ofCurrencyConversion OurtransformmetamodelcanberepresentedusingRDFtriples.RDF[ 52 ]iswidelyacceptedasastandardforsemanticrepresentationofinformationontheWebandwecanuseRDFforrepresentingatransformsemantically.Metadataabouttransformsare 29

PAGE 30

CurrencyConversionintransformgraph. representedinRDFtriplessemanticallyandthoseRDFtriplescanconstructstructuralRDFgraphs.WecancomparetwotransformssemanticallyusingRDFgraphs.Inaddition,theRDFgraphiseasilyexpandablebyaddingRDFtripleswhenweaddmoremetadatatoaccommodatenewrequirementstodescribeadatatransform.SinceRDFiswidelyacceptedasastandardforsemanticrepresentationofinformationontheWebanditisrepresentedinanXML,datatransformsinourmetamodelareinterchangeableandcanbecomparedwithotherresourcesonWebeasily. 4.2.1 1-1 .WerstshowthedetaileddescriptionaboutcalculatinghowmuchKRW2USDissimilartotheCurrencyConversiontransform.Fig. 3-3 showsatransformgraphoftheKRW2USDtransformwhereKRW2USDconverts 30

PAGE 31

KRW2USDintransformgraph. theamountofmoneyinKoreanwontoUSdollarswiththeexchangerateatthetimeofexecution. Weapplythesimilarityfunctionintroducedin[ 66 ]toourtransformmodel.In[ 66 ],informationavailableontheInternetiscollectedandrepresentedinRDFgraphs.Inordertodoasemanticinformationsearch,[ 66 ]introducesaformulathatcomparestwoRDF[ 35 ]graphs.Unlikeotherapproachesininformationretrievalthatarebasedontermfrequencyanalysis,matchingRDFgraphsconsidersthestructuralinformationrepresentedinagraph.Byusingtransformgraphs,wecalculatehowmuchtransformBissimilartotransformA(i.e.,f(A,B))intermsofhowmuchinformationinAiscoveredbyB.IfBhasthesameormoreinformationthanA,BisthesameasA.Therefore,wecansaythatAandBareexactlythesamewhenbothf(A,B)andf(B,A)are1. Wecalculatef(CurrencyConversion,KRW2USD),howmuchKRW2USDissimilartoCurrencyConversion.First,wecomparethechildnodesofIinthetransformgraphsinFig. 3-2 andFig. 3-3 .Eachchildnodehasthreeadditionalchildnodesthatholdsemanticmeaning,datatype,andrepresentation(inotherwords,dataformat)ofaninputparameter(seeDenition1.).Letf1,f2,andf3befunctionscalculatingsimilaritiesintermsofsemanticmeaning,datatype,andrepresentation,respectively(thesethreefunctionswillbeexplainedindetailinSec. 4.2.1 ).Thesimilarityof(p2,p4)iscalculated 31

PAGE 32

3-2 ,wehave0*1/2+1*1/2=1/2asaresultofapplyingweightsfromnodeItoitschildnodes.Usingthesamemethod,nodeOhas1andnodeOPhas3/4.Finally,atnodeTofFig. 3-2 ,wegettheresult1/2*1/3+1*1/3+3/4*1/3=3/4,whichmeansthesimilarityofKRW2USDtoCurrencyConversionis3/4.Inshort,wecompareleafnodesunderparameternodesinbothtransformgraphs,applyweightvaluesonedgestogetvalueattheirparentnode,andcalculatethenalresultatrootnodeT. 32

PAGE 33

Inthischapter,weintroduceourdatatransformcompositionproblemandsolutionapproach.WeusetheA*algorithmwithsimilaritymeasureswedesignedandafunctionMergetocombinetransformscorrectlywhilewendadesiredtransformusingtransformsinarepository. Hence,wemodelourtransformcompositionproblemasagraphsearchproblemandusetheA*searchalgorithmwithourheuristicfunctionforndingauser'sdesiredtransform.Theresultingcompositetransformcannotbeexactlythesameasthedesiredtransformbecausewecannotguaranteethatallrequiredtransformsforcreatingadesiredtransformexistintherepository.Therefore,wesuggestpartialcompositionsthatareusefultoconstructthedesiredtransform. Fortherestofthischapter,weuseSTandTdtodenotethesetoftransformsinarepositoryandadesiredtransform(i.e.,goaltransform),respectively.EachTi2STandTdarerepresentedinourtransformmetamodelinDenition1. 33

PAGE 34

(1) jelementsinS1andoneelementinS2haveS,D,andR,whichareintroducedinDenition1. (2) jelementsaredescribedwithaclause,whichisdenedinDenition4. (3) AsetIofdkhasjelementsandasetOofdkhasoneelement (4) OPhasoperationsbetweenelementsinIandO Next,wegiveadenitionofaclausethatisusedtospecifyIofdk. 1 ,wecandescribejelementsinS1foramatchtorevenueinS2asfollows. (1) revenue,pastry revenue,dateg[] (2) revenue,pastry revenueg[date] (3) revenue,pastry revenue,date,branchname] wherecoee revenue=S1.revenueandpastry revenueisanotherS1.revenueinS1. (1)meansthereisaschemamatchbetweenthreeelements(i.e.,S1.revenue,S1.revenue,S1.date)andS2.revenue.(2)meansS1.revenue,S1.revenuearecertainlyrequiredandS1.datecanbenecessarytomatchtoS2.revenue.(3)meansitisnotcertainwhichoneismatchedtoS2.revenue,thereforeallelementsinS1areconsidered.(1)isthecasethatndsacompositetransformwithagivenschemamatchand(3)isthecasewherewedonothaveagivenschemamatch.With(3),ourapproachtriestondschemamatchingandsemanticmappingbetweentwodatasources. WedenethefAg[B]clausebecausewetrytoreducetheburdenofuserstondtheexactschemamatchings.Withtheclause,ourapproachndsallpossiblecompositionsthathaveallelementsinAasinputandsomeofelementsinBasinput.Therefore,our 34

PAGE 35

Apartoftransformcompositiongraph. approachreducestheburdenofndingschemamatchingandsemanticmappinginthedatatransformationprocess. Next,wedenethetransformcompositionproblem. Inordertosolveourtransformcompositionproblem,wemodelourproblemasagraphsearchproblem.Denition6introducesthetransformcompositiongraph. InG0,thetransformcompositionproblemistondpathsfromroottogoalthatgenerateatransformthatisthesameasTdwithinagiventimet.Thedetailedalgorithmisintroducedinthenextsection. 35

PAGE 36

4-1 showsapartofatransformcompositiongraph.TwotransformsTjandTiareconnectedfromroottoxthroughy.Thereisadistancebetweenanodeandtheconnectingtransform,suchasdist(root,Tj)anddist(y,Ti).Inaddition,thedistancefromxtogoalisdist(x,goal).Thegofxmeansthedistancefromroottoxanditisthesumofdistancesfromroottoxthataredist(root,Tj)anddist(y,Ti).Thehofxmeansdist(x,goal).MTofxisgeneratedbymergingTjandTi.WeusetheA*algorithmtondatransformpathfromroottogoalnode. 16 ].AsshowninFig. 4-1 ,eachnodexinG0hasgandhvalues.Thegisthedistancefromroottoxandhisaheuristicallyestimateddistancefromxtogoal.TheA*ndstheleast-costpathfromagiveninitialnodetoonegoalnodeoutofoneormorepossiblegoals.TheA*usesf(x)=g(x)+h(x)forexpandingnodesinG0.Theintuitionofusingf(x)isthatadvancingfromthenodexthathasthesmallestf(x)valueisthefastestway(theshortestintermsofadistance)tondapathtothegoalnode.TheA*incrementallybuildsallpathsleadingfromtherootuntilitndsonethatreachesthegoalnode,butonlybuildspathsthatappeartoleadtowardthegoalnode. Algorithm 1 showshowtheA*algorithmndsacompositetransform.Atthecurrentexpandingnodex,alltransformsinarepositoryareconnectedandanewstatenodeforeachconnectioniscreatedwithinformationinDenition6.Thenewstatenodeisputinapriorityqueueandthenastatenodethathastheleastf(x)isselectedasthenextexpandingnode.Inthefollowingsections,weexplainourdistancemeasuresforcalculatinggandhofastatenodeandtheMergefunctionformergingtransformswhileapathislengthened. 36

PAGE 37

"+j

PAGE 38

4-1 ,thenodexiscreatedandconnectedtothenodeywiththeedgeassociatedwithTi.Thef(x)iscalculatedusingthefollowingformula. wherew1,w2,andw3areweightvalues. Theg(x)istheadditionofthegvalueofthepreviousnodeyandthedistancebetweenthenodeyandthesubsequenttransformTi,andthelatteriscalculatedbythefunctiondist1.WhenTiisconnectedtoy,foraparallelcomposition,theunmatchedinputsofdksofarinthepathfromtheroottoyarealsoconsideredasapartoftheoutputofatransformatyinordertomatchtheinputofTi(Sec. 4.2.2 includesadetailedexplanation). Therefore,newO=y:MT:O[(Td:I(Td:I\y:MT:I)),whichmeansaunionofy:MT:O(theoutputofthetransformMTatnodey)andunmatchedinputsofdkatnodey.WemakeallcombinationsbetweenelementsinnewOandTi:I,andndthebestcombinationgeneratingthemaximumsumofsimilarityvalues(Fortheconnection,allparametersinTi:IshouldbematchedwithanyparameterinnewO).ForcalculatingthesimilaritybetweenanelementinnewOandanelementinTi:I,weusethefollowingformula: whereup2newOandvq2Ti:Iandsimeisthesumofthreesimilarities.AsinDenition1,eachparameterinnewOandTi:IhasS,D,andR.ThesimScomparesSoftwoparameters.WeusetheJWordNetSimlibraryforthecomparison(ifasimilarityvalue 38

PAGE 39

Letz=jTi:I;jup2newO,andvq2Ti:I,thendist1isasfollows: Theg(x)isthesumofg(y)anddist1(y;Ti).Theh(x)istheheuristicallyestimateddistancefromthenodextothegoalnode.Atthersttime,wedesignh(x)asthedistancebetweentheoutputofthenodexandtheoutputofadesiredtransformusingdist1,butoursearchcannotndthedesiredtransomwellsincejustoutputmatchingisnotenoughtondadesiredtransform. Hence,wedesignsim2tocalculatethesimilaritybetweenthetransformatthenodex(i.e.,x.MT)andadesiredtransformdk.Bothtransformsarerepresentedinourmodel,thuscanberepresentedinRDFgraphs.Thesim2(dk;x:MT;I+O+OP)meanshowmuchinputs,outputsandoperationsofdkiscoveredbyinputs,outputsandoperationsofx.MT.Oncex.MThasthesameinputs,outputsandoperationsasdk,sim2(dk;x:MT;I+O+OP)becomes1.However,x.MTcanhavemoreinputs,outputs,andoperationswhicharenotindk.Thus,whenweuse1sim2(dk;x:MT;I+O+OP)asourh(x),ouranswercanhavemanyuselessinputs,outputsoroperationsbesidestheonesrelatedtoadesiredtransform.Ifbothsim2(dk;x:MT;I+O+OP)andsim2(x:MT;dk;I+O+OP)are1,dkandx.MTareexactlythesame. Theinputofsim2istwotransformstocompare,andags(i.e.,I,O,andOP),whichmeaninput,output,andoperation,respectively.Unlikesim1,alltheinputofTidonotneedtobematched.Thesimeisusedtocompareinputsandoutputs,andsimopisdenedtocompareoperationsoftwotransforms. 39

PAGE 40

Ij,m=jeq:OP Oj gq2eq:OP I,hp2fp:OP I,kq2eq:OP O,lp2fp:OP O, I(eq;fp)+simOP O(eq;fp)=maxPlq=1sime(gq;hp) wherew4;w5,andw6areweightsofI,O,andOP,respectively. Wecangivedierentweightvaluestoeachmatchingbutherewegivethesameweights.Usingsim2,wecancalculatehowmuchatransformconstructedfromaroottoacurrentnodexissimilartodkintermsofinputs,outputs,oroperations.Thedist2iscalculatedbysubtractingsim2from1.Theformulaforcalculatingh(x)isappearedin(4-1). In(4-1),therstpartofh(x)meansthathowmuchx.MThastheinputsandoperationsofdkandthesecondpartmeanshowmuchx.MThasotheroutputsbesidestheoutputofdk.Consideringthesecondpartmakesoursearchproceedsnottogenerateunnecessaryoutputs. Usingf(x)in(4-1),ourapproachadvancesthesearchwherehaslessdistancebetweenconnectingtransformsandhasmoresimilarinputandoperationstodkbutfeweroutputsbesidestheoneindk. 40

PAGE 41

Singlematchbetweentwotransforms. Therecanbesinglematchandmultiplematchbetweentheoutputofaprevioustransform(newO)andtheinputofasubsequenttransform(Tb:I),asinFig. 4-2 and 41

PAGE 42

Multiplematchbetweentwotransforms. Fig. 4-3 .Theguresincludeallpossiblemergingscenarios.ThefunctionMergekeepsinputs,outputs,andoperationsthatcharacterizetheresultingtransformasmuchaspossibleandremovesuselessones. Atroot,dk:IandTb:Iarecomparedandmatched(withadistancecalculatedusingdist1),asinthecaseof(1)inFig. 4-2 .TM:Ibecomestheparameterindk:IthatismatchedwithaparameterinTb:I,andTM:ObecomesTb:O.Finally,anewtransformTMiscreatedliketheoneontherightsideofthearrowin(1).Theunmatchedparametersindk:Iremaineinagraybox. 42

PAGE 43

4-2 )ortheunmatchedparameterindk:I(asin(3)).Bykeepingtheunmatchedinputofdk,weallownotonlyserialbutalsoparallelconnectionsoftransforms.Asshownin(2)and(3),theunmatchedTa:OtogetherwithTb:OareincludedinTM:O,andTa:Iandthematcheddk:IbecomeTM:I.WhenaparameterinTb:Iismatchedwithanelementindk:I,anadditionaloperationisaddedtoTM:OPasin(3)inFig. 4-2 .TheMergefunctionpreservesasmuchaspossiblethesemanticofinsideoperationsoftransforms. Inamultiplematch,asshowninFig 4-3 ,elementsinTb:IcanbematchedwithjusttheparametersinTa:O(asincase(4))orinunmatcheddk:I(asincase(5))orparametersinbothTa:Oanddk:I(asincase(6)).Asshownin(6),twoseparateoperationsinTacanbemergedintooneoperationbyamultiplematch. 36 56 18 ],partialcompositionscanbeusefulbecausetheusercangetanideaaboutusefulsingletransformswithoutmanuallysearchingthetransformrepositoryandcanelaborateonpartialcompositionstomakeacompletecomposition. AsinFig. 4-1 ,thereareroot,goal,andstatenodesinourtransformcompositiongraph.Astatenodethatiscreatedduringthesearchforacompletecompositionhasinformation,asshowninSec. 4.1 ,suchasapaththatistheorderedlistoftransformidsconnectedfromtherootnodetothisstatenode,gisthedistancefromtheroottothisstatenode,andhisthedistancefromthisstatenodetothegoalnode.Werecordthepath,g,andhofeachstatenodeinale.Thereforewehaveinformationaboutstatenodescreatedduringthesearchforacompletecomposition. Thosestatenodesincludeprunedstatenodes.Whenstatenodesisexpanded,transformsinarepositoryareconsideredtobeconnected.Ifthedistance(e.g.,calculated 43

PAGE 44

4.2.1 )betweentheoutputparametersofthetransforminsandtheinputparametersofthetransformtbeingconsideredforconnectioniszero,transformtisconnectedtos.Otherwise,thenewstatenodecreatedbyconnectingttosispruned,sincethegoalofouralgorithmistondacompletecompositionthathasnodistancebetweenconnectingtransforms. Partofadatatransformcompositiongraph. Thepathsofallstatenodesbecomecandidatesforpartialcompositions.Amongcandidates,thepartialcompositionsthatwelteroutaretheunionofthefollowing: (1) Asetofpathsthatarebottom-kbasedonf(x)(i.e.,g+h)in(4-1).Itmeansthepathisclosetothegoalnodeandhaslessdistancebetweentransforms.Themeasureweuseisthesameaswhatweuseinasearchforacompletecompositioninadatatransformationgraph. (2) Asetofpathsthatarebottom-kbasedonh(x)(i.e.,h)in(4-1)Itmeansthepathisclosetothegoalnode.Theintuitionisthatthelasttransform(e.g.,T1orT3inFig. 4-4 )thatisclosetothegoalnodecanbeusefultoconstructacompletecompositioneventhoughthepathmayhavealargegapbetweentransforms. 44

PAGE 45

Backwardsearchforsuggestingpartialcompositions. AsshowninFig. 4-5 ,wecantrytollthebiggapbetweenT2andT1usingthebackwardsearchfromtheinputsofT1.ThereasonforthegapisthattheinputofT1isnotmatchedtotheoutputofthetransformatthepreviousstatenode.WethentrytondanytransformthatcangenerateanyinputofT1.Usingabackwardsearch,wemayndapathfromtheroottooneoftheinputsofT1. 45

PAGE 46

Inthischapter,weintroducetheoverallarchitectureofourdatatransformcompositionsystemwehaveimplemented,andthenexplainimportantcomponentsindetail. Thearchitectureofoursystem. Fig. 5-1 showstheoverallarchitectureofoursolutionapproachtodatatransformcompositionproblem.Broadly,ourarchitecturehastwoparts:oneiswherecompletecompositionsaregeneratedusingtransformsinrepository,andtheotheriswherethecompletecompositionsareexecutedoversourcedata.WeuseMorpheus(seeSection 2.5 )asanexecutionplatformofthecompletecompositions. Inordertoconstructatransformrepository,weneedtoharvesttransformsanddescribetheminourtransformmetamodel.AsshowninFig. 5-1 ,usingourannotationtool,crawledsoftwarecomponentsfromtheInternetarerepresentedinourtransformmetamodel,Morpheus-M,andstoredinrepository.Atthesametime,theyarecompiledintojavaprogramsandregisteredinMorpheusforfutureexecution. 46

PAGE 47

4 ,ndscompletecompositionsthatarethesameasthedesiredtransformorsimilartothedesiredtransformusingtransformsinrepository.ThetransformcompositionmoduleusesjWordNetSymlibrarytocomparetransformsinMorpheus-M.Whenwestoreandcomposetransforms,weusetheWordNetdictionaryandinternaldomainknowledge. Infact,itcanbechallengingtodiscerntheinsideprogramlogicofatransformwhichiscrawledfromWebservices.Therefore,inordertodecidethecorrectcompositionamongthecompletecompositionsouralgorithmhasfound,weneedtoapplythecompletecompositionstoactualsourcedatatovalidatewhetherthesolutiongeneratesthetargetdatacorrectly. Afteronecompletecompositionisfound,thecompletecompositioniscompiledintojavaprogramsandregisteredinMorpheus.WeexecuteanSQLqueryincludingtheregisteredcompositetransformasaUDFoverthesourcedatainMorpheus.Ifthecompletecompositiongeneratesdatathatarethesameassampletargetdatacorrectly,thecompletecompositionisthetransformthatauserwantstocreate.Incaseoursystemcannotndthevalidatedcompletecomposition,wesuggestthepartialcompositionstoauser. 19 ].TransformsinrepositorymustberepresentedinourtransformmodelMorpheus-M. Wedevelopedasemanticannotationtoolinordertorepresentcrawledtransformsinourmodel.TheCrawlerdevelopedinMorpheusprojectharvestsURLsthathaveweb 47

PAGE 48

Fromtheabovethreesoftwarecomponents,weuseWebservicesinordertoconstructanexperimentalrepository.WedevelopedatoolthatparsespagesinWSDLs.WeparsedtheWSDLlesandtranslatedthemintoourtransformmetamodel.Currently,wedotheannotationmanually,butmetadataweannotateonceareaccessibleintothefutureconsistentlytoenrichautomaticdatatransformcomposition[ 44 ]. Besideswebservices,webformscouldbeanothercandidateforcrawling.UnlikewebservicesthatareprovidedwithWSDLlesandtext-basedexplanatorywebpages,webformhasbetterextractableinformationonrelatedHTMLpages.Forexample,webformhasapagethatincludeaformandcorrespondingjavascriptinordertocheckwhethertheforminputgivenbyauseriscorrect.Forexample,foratexteditcontrolinaform,wecanextractanameandlabelofthecontrol.Thelabelofthecontrolisprovidedtotheusersofthewebformtoexplainthetextcontrol,thereforeititselfhasaveryprecisemeaning.Inaddition,Javascriptsareusuallyaccompaniedbyaformthathasfunctionsthatletausergivetheinputinthecorrectformat.Therefore,wecanextractmorecorrectsemanticmeaningandformatoftheinputsthroughtheWebforms. 48

PAGE 49

Webservicesemanticannotationtool. Fig. 5-2 showsaGUIofourannotationtool.WeparseaWSDLleandextractop-erations.Foreachoperation,weextractinformationregardinginputandoutputmes-sagesthatareaboutinput/outputparametersoftheoperation.Foreachparameter,weextractthename,datatypeoftheparameter.Usingthename,welistwordsintheWordNetdictionaryinordertoletauserbeabletoselectawordthathasthesamesemanticmeaningasthename.Inaddition,ourtoollistspossibleformatsforthewordandausercanselecttherepresentationoftheparameteramongthem.Inshort,usingoursemanticannotationtool,ausercanannotateeachparameterwiththedatatype,semanticmeaning,andrepresentationthatarerequiredinordertoberepresentedinourtransformmetamodel. 49

PAGE 50

Inordertoreducethesearchspace,weapplyfollowingtechniques: Ratherthantryingtondmappingbetweenallattributesinthesourceschemaandonecandidateattributeinthetargetschema,ausercanspecifyattributesofthesourceschemaintwosets,AandB.AisasetofattributesinthesourceschemathatiscertaintobeusableforgeneratingaspecicattributeinthetargetschemaandBisasetofattributesthatmightormightnotbehelpfulforgeneratingschemamappingstogeneratethespecicattribute.WedeneaclauseinDenition4inSec. 4.1 forthispurpose. IntheWebservicecompositionarea,usuallysixsingleWebservicesareenoughtocreateanewcomplexWebservice[ 37 ].Welimitthesearchdepthtosixbasedonthejustication.Inaddition,weuseathresholdtoselectconnectingnodesfromthecurrentnodeusingthedistancebetweentransforms.Thiscanbejustiedbecauseourapproachtriestondacompletecompositionwithoutdistancesbetweentransforms. 50

PAGE 51

24 ]andgather200WebservicesontheInternetbyusingtheMorpheuscrawler[ 19 ].Inadition,wesurfHTMLpagesthathaveWebforms(e.g., AlltransformsinrepositoryarerepresentedinourtransformmetamodelMorpheus-M.UsingoursemanticannotationtoolintroducedSec. 5.2 ,weconvertWebservicesinWSDLtoourtransformmetamodelMorpheus-M.Table 6-1 showspartofourtransformrepository. ThemachineweusehasanIntel(R)PentiumDualCPU2Ghzwith3GBRAManduseWindowsVista.Weusetwoexamplesinourexperiments.ExampleAistheStarbucksRevenueConversionexampleintroducedinChap. 1 ,andExampleBistheEmployeePaymentConversionexample. ExampleBhastworelationalschemas,SandTshowninFig. 6-1 .Inaglobalcompany,supposeemployeesworkinginthebranchlocatedintheUSmovetoabranchinKoreaandwillgetbepaidinKoreanWoninsteadofUSdollars.TheschemaSistherelationtothebranchintheUSandTistotheoneinKorea.WeneedtoconvertdatainStodatatinT.Fortheconversion,weneedschemamatchingsandsemanticmappingsbetweentwoschemas. 51

PAGE 52

Sampletransforms 5m1:moneyinKoreanm3:moneyinKoreanAddtwomoneysm2:moneyinKorean 13m1:moneyinKoreanm3:moneyinUSDConvertmoneyinKoreantomoneyinUSDwithanexchangerateattheexecutiontime 28m1:moneyinUSDm3:moneyinUSDAddtwomoneysinUSDm2:moneyinUSD 34m1:moneyinKoreanm3:moneyinUSDAddtwomoneysinKoreancurrencyandthenconverttoUSDm2:moneyinKoreanwithanexchangerateattheexecutiontime 104d1:dateindd-mm-yyyye1:exchangerateGetexchangeratefromKoreantoUSDatthegivendatefromKoreantoUSD 105m1:moneyinKoreanm2:moneyinUSDConvertmoneyinKoreancurrencytomoneyinUSDe1:exchangeratewithagivenexchangerate 122d1:dateindd-mm-yyyym3:moneyinUSDAddtwomoneysinKoreancurrencym1:moneyinKoreanandthenconverttoUSDwithanexchangerateofagivendatem2:moneyinKorean 127m1:moneyinKoreanm3:moneyinUSDSubtractm2fromm1andassigntom3m2:moneyinKorean

PAGE 53

TheschemaofExampleB. Table6-2. Experimentalresult andshowthecontributionofourmodel.Third,weshowthescalabilityofourapproach,andnally,weshowthecasewhenourapproachndspartialcompositionsincasethereisnocompletecomposition.WeuseExampleAfortherstthreeexperiments,anduseExampleBforthelastexperiment. 6-2 showsthespecicationwhenwehavegivenschemamatches.Thed2meansthereisamatchbetweenthreeattributes(i.e.,S1:date,S1:revenueandS1:revenue)ofS1andoneattribute(i.e.,S2:revenue)ofS2.OurcompositionalgorithmndsallpossiblecompositionsthatcangenerateS2:revenuewithS1:date,S1:revenueandS1:revenuewithtransformsintherepository. Table 6-2 showsourexperimentalresults.Ouralgorithmndsasetofsingleorcompositetransformsford2.Auniquenumberforeachtransformisconnectedwithadashasadelimiter.Forexample,5-104-105isaconnectionofthreetransformsidentiedas5,104and105(cf.,Table 6-1 ).Wecannotbesurethattheresultofouralgorithmisthecorrecttransformthatconvertssourcedatatotargetdatabecausewedonotknowthe 53

PAGE 54

Thedesiredtransformspecication1. exactinternalprogramlogicofeachtransformintherepositoryandtheyareabstractlyrepresentedinourmodel.Wesimplyndpossibleanswersusingourmodelanddistancemeasures.Therefore,asanextphase,weapplythosetransformswefoundtothesourcedataandcheckwhetherthetransformscangeneratetargetdatacorrectly.AsshowninFig. 5-1 ,weexecutetheresultingsingleorcompositetransformsonMorpheusoverexistingsourcedatawithaquerythathasatransformasaUDFandexecutes. Weexecutetheresultingcompositetransformsoversourcedata.Thecompositions5-104-105(or104-5-105),122,and5-1generatecorrecttargetdata. Insearchofagoaltransforminatransformcompositiongraph,wecompareoutputparametersoftheprevioustransformtoinputparametersofthenexttransformtoseewhethertwotransformscanbeconnectedornot.Weuseourmetadata,namelyprimitive 54

PAGE 55

1 usedatatypeonly 2 usedatatypeandsemantics 3 usedatatype,semantics,anddataformat Formally,thesimeinformula(4-2)(i.e.,theformulacalculatingthedistancebetweentwotransforms)ischangedasfollowsaswechangethelevelofthemetadatautilization.Belowformula(6-1)meanslevel1,(6-2)meanslevel2,and(6-3)(thesameas(4-2))meanslevel3(seeSec. 4.2.1 foradetailedexplanationoftheseformulae). whereupisanoutputparameteroftheprevioustransformandvqisaninputparameterofthenexttransform. Additionally,wedenethemodel-basedanswerandvalidatedanswer.Ouralgorithmusesthemetadataofatransformtondacompletecompositionratherthanusingtheintermediatedatageneratedbyexecutingatransform.Therefore,thecompletecompositionfoundbyouralgorithm,namelyamodel-basedanswer,maynotbetherightdatatransform,theonethatcanconvertthesourcedatatotargetdatacorrectly.Thus,assoonaswendamodel-basedanswer,weapplyittothesourcedataandcheckwhetherthemodel-basedanswergeneratestargetdatacorrectly.Ifamodel-basedanswersgeneratestargetdatacorrectly,wecallitavalidatedanswer.Wecanapplyamodel-based 55

PAGE 56

Next,wedeneaparticipatingtransform.Inatransformcompositiongraph,alltransformsintherepositoryareconsideredtobeconnectedtothecurrentexpandingnode.Thecurrentexpandingnodehasatransformthatisconstructedbymergingtransformsinthepathfromtherootnodetothecurrentexpandingnode.Therefore,theoutputsofatransforminthecurrentexpandingnodearecomparedtotheinputsofthetransformthatisconsideredtobeconnected.Ifthereisnodistancebetweenoutputandinputparameters,twotransformscanbeconnected,andthetransformconnectedtothecurrentexpandingnodebecomesaparticipatingtransform.Theexecutiontimeofourtransformcompositionalgorithmcanbeexponentialtothenumberofparticipatingtransforms. 6-3 showsthatasthemetadatautilizationisincreased,theexecutiontimeisgreatlydecreased.Theexecutiontimeisthetimetondallmodel-basedanswersusingourapproach(i.e.,theworstcaseisthatavalidatedanswerappearslastamongallmodel-basedanswers). Ourexperimentshowsthatourmetadata(specically,dataformatofaparameter)areusefultoquicklyndthemodel-basedanswers,whichisasetofcandidateanswers,includingvalidatedanswers.Themoreweusemetadatainasearch,thebetterwecanlteroutusefultransformsforcomposingagoaltransform.Consequently,thenumberofparticipatingtransformsinasearchisdecreased,anditdecreasestheexecutiontimeexponentially.Inshort,usingourmetadatareducesexponentialsearchspace.Asin[ 44 ],oncemanuallycreated,metadatacanconsistentlyimprovethesearchingcapabilityinourdatatransformcomposition. 56

PAGE 57

Eciencybyvaryingmetadatautilizationwith150transforms. 6-4 showshowtheprecisionofouralgorithmischangedasthemetadatautilizationlevelisincreased.Theprecisionandrecallinourexperimentaredenedasfollows: Precision=retrievedvalidatedanswers/totalretrievedmodel-basedanswers Recall=retrievedvalidatedanswers/totalvalidatedanswers Sinceourapproachndsallvalidatedanswersevenaswevarythemetadatautilizationlevel,wecanshowhowtheprecisionischanged.AsinFig. 6-4 ,metadatautilizationlevel3hasthehighestprecision.Themoreweuseourmetadata,thebetterouralgorithmcanlteroutusefultransforms.Consequently,thenumberofmodel-basedanswersisreducedasweincreasemetadatautilization.Sincethenumberofretrievedvalidatedanswersisthesameevenasweincreasethemetadatautilization,theprecisionisincreasedasweincreasethemetadatautilization.Ourgoalistondtherstvalidatedansweramongmodel-basedanswers.Ourexperimentshowshowpreciselywecanndtherstonewewant,andourtransformmetamodelassuresthehighprecision. Otherinterestingpossibleexperimentscanbeperformingbyapplyingdistancemeasuresusedinotherresearch(e.g.,Webservicecomposition)toourframeworkandshowingtheaccuracyofanswers.However,wecandemonstratewithourexperimental 57

PAGE 58

Precisionbyvaryingmetadatautilizationwith150transforms. resultsthatthosewillnotnarrowdownwelltothecorrectanswerssincetheyuselessmetadatathanourapproach.Thiswillgeneratethesamephenomenonasthesecondexperimentinthissection. Fortherstpartofh(x),wecanchangetheag(i.e.,I+OP)ofdist2.TheagImeansinputsofatransform,Omeansoutputsofatransform,andOPmeansoperationsofatransform.I+OPmeansthatweconsiderinputsandoperationswhenwecalculatethedistanceanddonotconsidertheoutputs.Weexperimenthowmanynodesinatransformcompositiongraphareexpandedtondeachcompletecompositionbyvaryingtheag(i.e.,I+OP,I+O+OP,I+O).Threeexperimentsbyvaryingtheaggeneratethesame53completecompositions.Ourexperimentshowsthequalityofdistancemeasures.Ifwecanestimatetheheuristicdistancebetter,searchwillgototheanswersfasterwithlessexpandednodes.Fig. 6-5 showsthatusingI+OPorI+O+OPagsarebetterthanusingI+Oag.Inotherwords,consideringtheoperationdenedinourtransformmetamodelcanproceedtotheanswersfaster.Thisjustifythenecessityofoperationinour 58

PAGE 59

Thenumberofexpandednodessofarwheneachcompletecompositionappears(total53answers).Wevaryh(x)bychangingtheag(e.g.,I+O,I+O+OP,I+OP,I).Wetestwith1000transformsinrepository.Thesearchisterminatedwhentotal3788nodesareexpanded. transformmetamodel.Thenumberofexpandednodesisalsorelatedtotheexecutiontimeofthesearch. 59

PAGE 60

Eciencybyvaryingthesizeofrepository. repository.Next,weshowhowtheworst/bestcaseexecutiontimeischangedaswevarythenumberofparticipatingtransforms. 6-6 showsthatouralgorithmisscalabletothesizeofrepositoryifthenumberofparticipatingtransformsareequal.Thisisbecauseofouralgorithm,whichprunessearchspacewithmetadataofourtransformmetamodel.Thenumberofparticipatingtransformsisrelatedtotheexponentialsearchspace,butouralgorithm(specically,matchingparameters)islineartothesizeofrepository.Thealgorithmtriestondallpossiblemodel-basedanswersinatransformcompositiongraphthatcanleadtoagoalnodebymatchingparametersofconsecutivetransforms.Thenweapplythosemodel-basedanswertothesourcedatainordertondavalidatedanswer. Recenttechniquesinsemanticmappinguseintermediatedatageneratedbyexecutingtransforms(oroperator,searcherintheircontext)intensively[ 6 18 25 ]tondcomplexmatchingsbetweentwoschemas.Thosetechniquescanreachthecorrectanswerdirectly(i.e.,theyhaveonesteptothenalanswer,unlikeourtwosteps),buttheyarenotabletoeasilypruneasearchspaceusingintermediatedatageneratedbyexecutingaconnectingtransformbecauseitisdicultytogureoutbyintermediatedatawhetheritisuseful 60

PAGE 61

Worstcase/bestcaseexecutiontimeasaresultofvaryingthenumberofparticipatingtransforms dataowornotinthesearchforapathtoagoal.Wecanclaimthatpruningwithourmatchingalgorithmkeepsthepossibledataowthatcangenerateagoaltransform,andreducethesearchspaceeectively.Asaresult,weseparatethetransformcompositionphasefromthetransformexecutionphase,whichmakesourapproachmorerealisticbecauseweassumethatourtransformscanberemoteWebservices. 6-7 showstheworst/bestcaseexecutiontimesforndingavalidatedanswer.Basically,weusetheA*algorithm,whichndsthebestonerst(i.e.,thebestpathintermsofourdistancemeasures).Therefore,therstanswerappearsquickly,butndingallanswerstakestimesimilartoabrute-forcesearch.Ifthereisnoanswer,wecansearchtheentirespacelikeabrute-forcesearch.Inouralgorithm,weterminateasearchwhenitreachestheexecutiontimeconstraint,andthenmoveontondpartialcompositions.However,ifthereisananswer,wecanstopthesearchassoonastherstvalidatedanswerappears.Inthecase 61

PAGE 62

6-7 evenasweincreasethenumberofparticipatingtransforms. 6-1 thatwedonothaveexactschemamatchingsbetweenSandT.Ourgoaltransformisd3,theinputsofwhichareUSsalaryandzipcodeattributesinSandanoutputisthewageattributeinT.LetusassumethattheinputattributesinSaredescribedbyauserasfollows: AsinDenition4inSec. 4.1 ,auserspeciesthatUSsalaryinSisrequiredtondasemanticmappingforwageinTandzipcodemightormightnotbeused.Withtheinputs,ouralgorithmtriestondcompletecompositionsthatcangeneratetheoutputwageattributeusingUSsalaryandzipcode.With160transformsintherepository,oursearchisterminatedwithoutacompletecomposition. Amongthepathsofstatenodesgeneratedduringthesearchforthecompletecomposition,welteroutthefollowingsets:ThesetAmeansbottom-10pathsbasedonf(x)in(4-1)andthesetBmeansbottom-10pathsbasedonh(x)in(4-1),respectively.Thenumbersbelowaretheidenticationofatransform. A=f23-6,157-158,23-6-31,165-6,157-158-103,23-6-103,161-130-6,161-130-6-9,161-130-53-6,161-130-6-148-27g A[B=f23-6,157-158,157-116,157-158-103,23-6-31,157-158-103,165-6,165-123,165-125,23-6-103,161-130-6,161-130-6-9,161-130-53-6,161-130-6-3,161-130-6-148-27g 62

PAGE 63

63

PAGE 64

Inthischapter,wesummarizeourresearchanddescribefutureworks. Unlikepreviousworks,ourworkinthisdissertationfocusesonthesemi-automaticdatatransformcompositionthatreusestransformsinrepositorytoconstructusers'desiredtransforms.Findingsemanticmappingsbycomposingtransformsneedsmassivesearchspace,thereforeweneedasophisticatedsolution.Inaddition,wecannotguaranteethattransformsinrepositoryarecompleteforcomposinganynewtransform.Wehavefollowingchallenges:howtoformallyrepresenttransforms,howtoecientlyndorcomposetransforms,andhowtoprovidepartialsolutionsincasethereisnocompletesolutions. Wecreatethetransformmetamodelandthetransforminourmodelcanberepresentedinagraph.Wecancomparetwotransformssemanticallyusinggraphs.ThemetadatainourmodelalsocanberepresentedinResourceDescriptionFramework(RDF)[ 35 ]triplessemanticallyandthoseRDFtriplescanconstructstructuralRDFgraphs.ByusingtheRDFframework,ourtransformmetamodelgainmeritssuchasnewsemanticmetadatacanbeeasilyexpandablebyaddingRDFtripleswhenweaddmoremetadatatoaccommodatenewrequirementstodescribeadatatransformanddatatransformsinourmetamodelareinterchangeableandcanbecomparedwith 64

PAGE 65

OurmodelincludesnotonlytheprimitivedatatypeandsemanticmeaningofaparameterasinexistingstandardsinWebservicesbutalsothedataformatofaparameter.Inaddition,metadataonoperationsdescribetherelationshipsbetweeninputsandoutputsofatransform.Ourexperimentshowsthatourmetamodelgreatlyspeedupsearchingforcompletecompositionsandprovidehighprecision. Basedonourtransformmetamodel,wemodeladatatransformcompositionproblemasagraphsearchproblemanddesignoursophisticateddistancemeasures(i.e.,similaritymeasures)toprogressasearchusingtheA*algorithm.Oursophisticateddistancemeasuresenableoursearchtoprogresstoacompletecompositioncorrectlyandtoreduceexponentialsearchspacebypruning.Incomposingtransforms,ouralgorithmkeepsthebehavioroftransformsasmuchaspossibleusingourtransformmetamodel,thuswecanndagoaltransformcorrectly.Whenthereisnocompletecomposition,ouralgorithmprovidespartialcompositionsthatareusefultoconstructagoaltransformamongthetransformsinrepository. Wedesignandimplementourprototypesystemforsemi-automaticdatatransformcompositionusingtransformsinourmodelinarepository.Oursystemprovidesanautomaticsearchofalargerepositorytondorcomposeadesiredtransform.OursemanticannotationtoolisusedtoconvertcrawledWebservicesinWSDLtoourmodelsemi-automatically.Usingoursystem,userscanreducetheireortsintime-consuminganderror-prunesemanticmappingstepsofdatatransformationprocess,therebyreducingeortsininformationintegrationthatrequiresinmanyapplications. 65

PAGE 66

29 10 34 ]referstocontenthiddenbehindHTMLforms.Inordertogettosuchcontent,auserhastoperformaformsubmissionwithvalidinputvalues.SearchengineslikeGoogletrytoretrievethehiddencontentsalongwithgeneralHTMLpages.OurframeworkfordatatransformcompositioncanbeappliedtoautomaticDeepWebqueryprocessing. Forexample,ausercanframeaquerylike\HowlongdoesittaketogofromOrlandoInternationalAirporttoDisneyWorldbycar?"Inordertoanswerthequestion,weneedtoextractDeepWebcontentstepbystep,suchasgettingtheaddressesofOrlandoInternationalAirportandDisneyWorld,gettingthedistancebetweenthem,andcalculatingthetimetodrivethedistancebycar.ThequerycanbeansweredusingtheDeepWebcontentoftheGoogleMapsite(e.g., Second,wecanextendourworkonoptimizingtheexecutionofacomposeddatatransform.ThecomposeddatatransformiscompiledintoaJavaclassle,registeredinthePostgresDBMSasaUDF,andinvokedonthesourcedatasettobetransformedusinganSQLqueryforactualtransformation.Currently,theregisteredtransforminUDFistreatedasablackboxduringqueryoptimization[ 13 12 ],sotherearelimitationstooptimizingaqueryinvokingthecomposeddatatransform.Itwouldbebenecialtondanopportunityformoreoptimizationbylookinginsidethecomposeddatatransformdescription. Third,weshowthatourtransformmetamodelmakesahugecontributiontodecreasetheexponentialsearchspaceofourdatatransformcomposition.Wecanworkongatheringothervaluablemetadataasautomaticallyaspossible[ 57 ].Asthekindsofsemanticmetadataareincreased,todesigneectivedistancemeasurescanbechallenging. 66

PAGE 67

[1] R.Akkiraju,J.Farrell,J.Miller,M.Nagarajan,M.-T.Schmidt,A.Sheth,andK.Verma.Webservicesemanticswsdl-s. [2] R.Akkiraju,A.Ivan,R.Goodwin,B.Srivastava,andT.Syeda-Mahmood.Semanticmatchingtoachievewebservicediscoveryandcomposition.InCEC-EEE'06:Pro-ceedingsoftheThe8thIEEEInternationalConferenceonE-CommerceTechnologyandThe3rdIEEEInternationalConferenceonEnterpriseComputing,E-Commerce,andE-Services,page70,Washington,DC,USA,2006.IEEEComputerSociety. [3] G.Alonso,F.Casati,H.Kuno,andV.Machiraju.WebServices:Concepts,Architec-turesandApplications.Springer-Verlag,Berlin,Germany,2003. [4] D.Berardi,D.Calvanese,G.D.Giacomo,M.lenzerini,andM.Mecella.Automaticcompositionofe-servicesthatexporttheirbehavior.InProc.ofthe1stInternationalConferenceonServiceOrientedComputing,2003. [5] P.A.BernsteinandT.Bergstraesser.Meta-datasupportfordatatransformationsusingmicrosoftrepository.IEEEDataEng.Bull.,22(1):9{14,1999. [6] P.A.BernsteinandS.Melnik.Modelmanagement2.0:manipulatingrichermappings.InSIGMOD'07:Proceedingsofthe2007ACMSIGMODinterna-tionalconferenceonManagementofdata,pages1{12,NewYork,NY,USA,2007.ACM. [7] P.CarreiraandH.Galhardas.Executionofdatamappers.InIQIS'04:Proceedingsofthe2004internationalworkshoponInformationqualityininformationsystems,pages2{9,NewYork,NY,USA,2004.ACM. [8] F.Casati,S.Ilnicki,L.J.Jin,V.Krishnamoorthy,andM.Shan.Adaptiveanddynamicservicecompositionineow.InProc.oftheInternationalConferenceonAdv.Info.aSystemsEngineering,2000. [9] G.Chae,S.Chandra,V.Mann,andM.G.Nanda.Decentralizedorchestrationofcompositewebservices.InWWW'04:Proceedingsofthe13thInternationalWorldWideWebConference.ACM,May2004. [10] K.C.-C.ChangandJ.Cho.Accessingtheweb:fromsearchtointegration.InSIGMOD'06:Proceedingsofthe2006ACMSIGMODinternationalconferenceonManagementofdata,pages804{805,NewYork,NY,USA,2006.ACM. [11] S.ChaudhuriandU.Dayal.AnoverviewofdatawarehousingandOLAPtechnology.SIGMODRec.,26(1):65{74,1997. [12] S.ChaudhuriandK.Shim.Queryoptimizationinthepresenceofforeignfunctions.InVLDB'93:Proceedingsofthe19thInternationalConferenceonVeryLargeDataBases,pages529{542,SanFrancisco,CA,USA,1993.MorganKaufmannPublishersInc. 67

PAGE 68

S.ChaudhuriandK.Shim.Optimizationofquerieswithuser-denedpredicates.ACMTrans.DatabaseSyst.,24(2):177{228,1999. [14] E.Christensen,F.Curbera,G.Meredith,andS.Weerawarana.Webservicesdescriptionlanguage(WSDL)1.1. [15] W.M.Coalition.Processexchangespecicationlanguage. [16] B.Coppin.ArticialIntelligenceIlluminated.JonesandBartlettPublishers,Sudbury,Massachusetts,2004. [17] S.B.DavidsonandA.Kosky.Specifyingdatabasetransformationsinwol.IEEEDataEng.Bull.,22(1):25{30,1999. [18] R.Dhamankar,Y.Lee,A.Doan,A.Halevy,andP.Domingos.iMap:discoveringcomplexsemanticmatchesbetweendatabaseschemas.InSIGMOD'04,pages383{394,2004. [19] P.Dobbins,T.Dohzen,C.Grant,J.Hammer,M.Jones,D.Oliver,M.Pamuk,J.Shin,andM.Stonebraker.Morpheus2.0:Adatatransformationmanagementsystem".InInterDB,VLDBworkshop,2007. [20] T.Dohzen,M.Pamuk,S.-W.Seong,J.Hammer,andM.Stonebraker.DataintegrationthroughtransformreuseintheMorpheusproject.InSIGMOD'06,pages736{738,2006. [21] X.Dong,A.Halevy,J.Madhavan,E.Nemes,andJ.Zhang.Similaritysearchforwebservices.InVLDB'04:ProceedingsoftheThirtiethinternationalconferenceonVerylargedatabases,pages372{383.VLDBEndowment,2004. [22] B.A.-M.etal.Template-basedsemanticsimilarityforsecurityapplications.InTechnicalReport,LSDISLab,ComputerScienceDepartment,UniversityofGeorgia,Jan.2005. [23] D.M.etal.Owl-s:Semanticmarkupforwebservices.W3C,2004. [24] J.FanandS.Kambhampati.Asnapshotofpublicwebservices.SIGMODRec.,34(1):24{32,2005. [25] G.H.FletcherandC.M.Wyss.Datamappingassearch.InEDBT2006:AdvancesinDatabaseTechnology.SpringerLNCS3896,2006. [26] P.G.D.Group.Postgresql. [27] A.Y.Halevy,Z.G.Ives,P.Mork,andI.Tatarinov.Piazza:datamanagementinfrastructureforsemanticwebapplications.InWWW'03:Proceedingsofthe12thinternationalconferenceonWorldWideWeb,pages556{567,NewYork,NY,USA,2003.ACM. 68

PAGE 69

B.He,K.C.-C.Chang,andJ.Han.Discoveringcomplexmatchingsacrosswebqueryinterfaces:acorrelationminingapproach.InKDD'04:ProceedingsofthetenthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages148{157,NewYork,NY,USA,2004.ACM. [29] B.He,M.Patel,Z.Zhang,andK.C.-C.Chang.Accessingthedeepweb.Commun.ACM,50(5):94{101,2007. [30] J.M.Hellerstein.Optimizationtechniquesforquerieswithexpensivemethods.ACMTrans.DatabaseSyst.,23(2):113{157,1998. [31] J.M.HellersteinandM.Stonebraker.Predicatemigration:optimizingquerieswithexpensivepredicates.InSIGMOD'93:Proceedingsofthe1993ACMSIGMODinternationalconferenceonManagementofdata,pages267{276,NewYork,NY,USA,1993.ACMPress. [32] IBM.Semantictoolsforwebservices. [33] IBM.Workowmanagementcoalition. [34] L.A.JayantMadhavan,LoredanaAfanasievandA.Halevy.Harnessingthedeepweb:Presentandfuture.InCIDR'09:4thBiennialConferenceonInnovativeDataSystemsResearch,2009. [35] G.KlyneandJ.J.Carroll.ResourceDescriptionFramework(RDF):Conceptsandabstractsyntax.W3C,2004. [36] P.KungasandM.Matskin.Detectionofmissingwebservices:Thepartialdeductionapproach.InNWESP'05:ProceedingsoftheInternationalConferenceonNextGenerationWebServicesPractices,page339,Washington,DC,USA,2005.IEEEComputerSociety. [37] P.KungasandM.Matskin.Fromwebservicesannotationandcompositiontowebservicesdomainanalysis.IJMSO,2(3):157{178,2007. [38] D.Lin.Aninformation-theoreticdenitionofsimilarity.InICML98:ProceedingsoftheFifteenthInternationalConferenceonMachineLearning,pages296{304.MorganKaufmannPublishers,1998. [39] L.LinandI.B.Arpinar.Discoveryofsemanticrelationsbetweenwebservices.InICWS'06.InternationalConferenceonWebServices,pages357{364,Sep.2006. [40] J.MadhavanandA.Y.Halevy.Composingmappingsamongdatasources.InVLDB'2003:Proceedingsofthe29thinternationalconferenceonVerylargedatabases,pages572{583.VLDBEndowment,2003. [41] G.A.Miller.Wordnet:alexicaldatabaseforenglish.Commun.ACM,38(11):39{41,1995. 69

PAGE 70

J.Myerson.WorkwithWebservicesinenterprise-wideSOAs,part5:Optimizewebserviceapplicationswithwebspherebusinessintegrationtools.IBMdeveloperWorks,2005. [43] S.-C.Oh,B.-W.On,E.J.Larson,andD.Lee.BF*:Webservicesdiscoveryandcompositionasgraphsearchproblem.InEEE'05:Proceedingsofthe2005IEEEInternationalConferenceone-Technology,e-Commerceande-Service. [44] N.I.S.Organization.Understandingmetadata.NISOPress,2004. [45] S.R.PonnekantiandA.Fox.Sword:Adevelopertoolkitforwebservicecomposition.InProc.ofthe11thInternationalConferenceonWWW,2002. [46] M.P.Singh.Thepragmaticweb.InIEEEInternetComputing,pages4{5,2002. [47] G.QianandY.Dong.Asteptowardsincrementalmaintenanceofthecomposedschemamapping.InCIKM'08:Proceedingofthe17thACMconferenceonInfor-mationandknowledgemanagement,pages173{182,NewYork,NY,USA,2008.ACM. [48] E.RahmandP.A.Bernstein.Onmatchingschemasautomatically.VLDBJournal,(4),2001. [49] E.RahmandP.A.Bernstein.Asurveyofapproachestoautomaticschemamatching.TheVLDBJournal,10(4):334{350,2001. [50] E.RahmandH.H.Do.Datacleaning:Problemsandcurrentapproaches.IEEEDataEng.Bull.,23(4):3{13,2000. [51] E.A.S.GhandeharizadehandS.Manjunath.ProteusRTI:Aframeworkforon-the-yintegrationofbiomedicalwebservices.InUSCDatabseLaboratoryTechnicalReportNumber2006-05,2006. [52] J.Shin,J.Hammer,andH.Lam.RDF-basedapproachtodatatransformcomposition.In7thIEEE/ACISInternationalConferenceonComputerandIn-formationScience,IEEE/ACISICIS2008,14-16May2008,Portland,Oregon,USA,pages645{648,2008. [53] J.Shin,J.Hammer,andW.J.O'Brien.Distributedprocessintegration:Experiencesandopportunitiesforfutureresearch.InIEEEInternationalWorkshoponWebandMobileInformationSystems(WAMIS),2006. [54] A.Simitsis,P.Vassiliadis,andT.Sellis.Optimizingetlprocessesindatawarehouses.InICDE'05:Proceedingsofthe21stInternationalConferenceonDataEngineering,pages564{575,Washington,DC,USA,2005.IEEEComputerSociety. [55] A.Simitsis,P.Vassiliadis,andT.Sellis.OptimizingETLprocessesindatawarehouses.ICDE,pages564{575,2005. 70

PAGE 71

A.SirbuandJ.Homann.Towardsscalablewebservicecompositionwithpartialmatches.InICWS'08:Proceedingsofthe2008IEEEInternationalConferenceonWebServices,pages29{36,Washington,DC,USA,2008.IEEEComputerSociety. [57] R.SumraandA.D.Qualityofserviceforwebservices-demystication,limitations,andbestpractices.Developer.com,1999. [58] T.Syeda-Mahmood,G.Shah,R.Akkiraju,A.-A.Ivan,andR.Goodwin.Searchingservicerepositoriesbycombiningsemanticandontologicalmatching.In2005IEEEInternationalConferenceonWebServices,pages13{20,2005. [59] I.TatarinovandA.Halevy.Ecientqueryreformulationinpeerdatamanagementsystems.InSIGMOD'04:Proceedingsofthe2004ACMSIGMODinternationalconferenceonManagementofdata,pages539{550,NewYork,NY,USA,2004.ACM. [60] P.S.M.TsaiandA.L.P.Chen.Optimizingquerieswithforeignfunctionsinadistributedenvironment.IEEETransactionsonKnowledgeandDataEngineering,14(4):809{824,2002. [61] K.Verma.Congurationandadaptationofsemanticwebprocesses.InPh.DThesis,ComputerScience,Univ.ofGeorgia,June2006. [62] W3C.WebServicesArchitectureRequirement.Technicalreport,W3C,2002. [63] D.Wu,B.Parsia,E.Sirin,J.Hendler,andD.Nau.Automatingdaml-swebservicescompositionusingshop2.InPoc.of2ndInternationalSemanticWebConference,Oct.2003. [64] L.XuandD.W.Embley.Discoveringdirectandindirectmatchesforschemaelements.InDASFAA,pages39{46,2003. [65] C.YuandL.Popa.Semanticadaptationofschemamappingswhenschemasevolve.InVLDB'05:Proceedingsofthe31stinternationalconferenceonVerylargedatabases,pages1006{1017.VLDBEndowment,2005. [66] H.Zhu,J.Zhong,J.Li,andY.Yu.Anapproachforsemanticsearchbymatchingrdfgraphs.InProceedingsoftheFifteenthInternationalFloridaArticialIntelligenceResearchSocietyConference,pages450{454.AAAIPress,2002. 71

PAGE 72

JungminShinreceivedherB.SandM.SattheDepartmentofComputerScienceandEngineeringfromtheEwhaWomansUniversityinSouthKoreain1993and1995,respectively.In1995,shejoinedatMediaCommunicationsLabofLGElectronicsinSeoul,SouthKoreaandworkedonintelligentuserinterfaceasaresearcher.In1997,shejoinedSKTelecominSeoul,SouthKorea.Sheworkedonalargescaledatabasemanagementsystemsuchasthemembershipmanagementsystemandbulletinboardsystemofacommercialon-lineportalservice.Also,sheinvolvedindevelopingavideoondemand(VOD)serviceinacommercialwirelessinternetservice.Since2001,shehasbeenworkingonprocessintegrationanddatatransformcompositionatDatabaseSystemsResearchandDevelopmentCenterinUniversityofFlorida.ShereceivedherPh.D.fromtheUniversityofFloridainthefallof2009. 72