Citation
Data Transform Composition for Efficient Information Integration

Material Information

Title:
Data Transform Composition for Efficient Information Integration
Creator:
Shin, Jungmin
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (72 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Hammer, Joachim
Committee Co-Chair:
Lam, Herman
Committee Members:
Schneider, Markus
Helal, Abdelsalam A.
Avery, Paul R.
Graduation Date:
8/8/2009

Subjects

Subjects / Keywords:
Data types ( jstor )
Information search ( jstor )
Information use ( jstor )
International conferences ( jstor )
Money ( jstor )
RDF ( jstor )
Semantic models ( jstor )
Semantics ( jstor )
Web services ( jstor )
Written composition ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
composition, metamodel, similarity, transform
Genre:
Electronic Thesis or Dissertation
born-digital ( sobekcm )
Computer Engineering thesis, Ph.D.

Notes

Abstract:
Data transformation resolves heterogeneities between disparate schemas and is indispensable process in many applications where data sharing and exchange happen. Creating a data transform which converts source data to target data is extremely time-consuming and labor-intensive. This dissertation presents the data transform composition problem using a large repository of reusable transforms. Recent work on data transforms have focused on structural data mapping or applying a restricted set of data transforms for composition. In order to do semi-automatic data transform composition with existing transforms, we first design our RDF-based transform meta model including meta data on data a transform. Next, we model the data transform composition problem as a graph search problem and use A* algorithm with our transform meta model based sophisticated distance measures. Our experiment shows that our meta model greatly speeds up searching for complete compositions and provides high precision. Our distance measures enable our search to progress to a complete composition correctly and to reduce exponential search space by pruning. Using our system, users can reduce their efforts in time-consuming and error-prune steps of data transformation process, thereby reduce efforts in information integration that requires in many applications. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2009.
Local:
Adviser: Hammer, Joachim.
Local:
Co-adviser: Lam, Herman.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-02-28
Statement of Responsibility:
by Jungmin Shin.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Shin, Jungmin. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
2/28/2010
Resource Identifier:
489206505 ( OCLC )
Classification:
LD1780 2009 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Thoughonlymynameappearsonthecoverofthisdissertation,thecompletionofmydissertationwaspossiblewiththehelpandeortsofmanypeople.First,IexpressmyappreciationtomyadvisorDr.JoachimHammer.ThroughmygraduatecareeratUniversityofFlorida,hisguidance,support,andpatiencehelpedmeovercomemanycrisissituationsandcompletethisdissertation.Heoftenbroughtmetothethresholdofknowledge,andignitedtheinteresttocrossthethreshold.Healsoencouragedmetobeanindependentthinkerwithahighresearchstandard.Ideeplyappreciatetomyco-advisor,ProfessorHermanLam.Heiskindlywillingtospendalotoftimeandeortinimprovingmywork.Withouthim,Icouldnotaccomplishmywork.Thanksalsogoouttothemembersofthedissertationcommittee,ProfessorsAbdelsalamHelal,MarkusSchneider,andPaulAveryfortheirvaluableguidance.ProfessorsAbdelsalamHelaltriestohearandunderstandmyhardshipsandhisencouragementhelpsmetokeepmyself-esteem.IamgratefultomanypeopleonthefacultyandstaoftheDepartmentofComputerandInformationScienceandEngineeringforallthattheytaughtandsupportedmeinvariousways.Finally,andmostimportantly,Isincerelythankmyfamilywhohavebeenaconstantsourceofhelp,support,andstrengthduringdoctoralstudies.Noneofmyachievementwouldhavebeenpossiblewithouttheirlove.MyveryspecialthankstomyhusbandforhissupportuponwhichthepathtocompletingmyPh.D.wasbuilt.Iwarmlyappreciatemyparentsfortheirunwaveringfaithinmeaswellasunendingencouragementandsupport.Ithankmysistersfortheirloveandsupport.Iappreciateparents-in-lawforconsistentencouragementandsupport.IreallythanktoGodforgivingmemydaughter,Jiminwhowasthesourceofenergytogetthroughmyjourney.Iloveyou,Jimin. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 1.1MotivatingExample .............................. 12 1.2ChallengesandContribution .......................... 15 1.2.1Challenges ................................ 15 1.2.2Contributions .............................. 16 2RELATEDWORK .................................. 19 2.1DataTransformvs.WebServiceandWorkow ............... 19 2.2DataMappingandSchemaMatching ..................... 20 2.3WebServiceComposition ........................... 21 2.4CalculatingSimilarity ............................. 22 2.5MorpheusPrototype .............................. 23 3TRANSFORMMETAMODEL ........................... 27 3.1Denition .................................... 27 3.2SemanticSimilarityofTwoTransforms .................... 30 4TRANSFORMCOMPOSITION .......................... 33 4.1TransformCompositionProblem ....................... 33 4.2Algorithm .................................... 36 4.2.1SimilarityMeasures ........................... 38 4.2.2ThefunctionMerge ........................... 41 4.3PartialComposition .............................. 43 5IMPLEMENTATION ................................. 46 5.1Architecture ................................... 46 5.2SemanticAnnotationandRegistrationTool ................. 47 5.3TransformCompositionModule ........................ 49 6EVALUATION .................................... 51 6.1ExperimentalEnvironment ........................... 51 5

PAGE 6

.............................. 51 6.2.1ExperimentwithExampleA:FindCompleteCompositions ..... 53 6.2.2ExperimentalCase1:ModelVerication ............... 54 6.2.3ExperimentalCase2:EciencyofOurAlgorithm .......... 59 6.2.4ExperimentCase3:PartialComposition ............... 62 7CONCLUSION .................................... 64 7.1SummaryandContributions .......................... 64 7.2FutureWork ................................... 65 REFERENCES ....................................... 67 BIOGRAPHICALSKETCH ................................ 72 6

PAGE 7

Table page 1-1Sampletransformsintherepository. ........................ 14 6-1Sampletransforms .................................. 52 6-2Experimentalresult .................................. 53 7

PAGE 8

Figure page 1-1Theschemasofourexample. ............................ 13 2-1ConceptualarchitectureoftheMorpheussystem. ................. 23 2-2DatatypesandatransforminMorpheus. ..................... 24 3-1Atransformrepresentedingraph. .......................... 29 3-2CurrencyConversionintransformgraph. ...................... 30 3-3KRW2USDintransformgraph. ........................... 31 4-1Apartoftransformcompositiongraph. ....................... 35 4-2Singlematchbetweentwotransforms. ....................... 41 4-3Multiplematchbetweentwotransforms. ...................... 42 5-1Thearchitectureofoursystem. ........................... 46 5-2Webservicesemanticannotationtool. ....................... 49 6-1TheschemaofExampleB. .............................. 53 6-2Thedesiredtransformspecication1. ........................ 54 6-3Eciencybyvaryingmetadatautilizationwith150transforms. ......... 57 6-4Precisionbyvaryingmetadatautilizationwith150transforms. ......... 58 6-5Thenumberofexpandednodessofarwheneachcompletecompositionappears(total53answers).Wevaryh(x)bychangingtheag(e.g.,I+O,I+O+OP,I+OP,I).Wetestwith1000transformsinrepository.Thesearchisterminatedwhentotal3788nodesareexpanded. ........................ 59 6-6Eciencybyvaryingthesizeofrepository. ..................... 60 8

PAGE 9

Datatransformationresolvesheterogeneitiesbetweendisparateschemasandisindispensableprocessinmanyapplicationswheredatasharingandexchangehappen.Creatingadatatransformwhichconvertssourcedatatotargetdataisextremelytime-consumingandlabor-intensive.Thisdissertationpresentsthedatatransformcompositionproblemusingalargerepositoryofreusabletransforms.Recentworkondatatransformshavefocusedonstructuraldatamappingorapplyingarestrictedsetofdatatransformsforcomposition.Inordertodoautomaticdatatransformcompositionwithexistingtransforms,werstdesignourRDF-basedtransformmetamodelincludingmetadataondataatransform.Next,wemodelthedatatransformcompositionproblemasagraphsearchproblemanduseA*algorithmwithourtransformmetamodelbasedsophisticateddistancemeasures.Ourexperimentshowsthatourmetamodelgreatlyspeedsupsearchingforcompletecompositionsandprovideshighprecision.Ourdistancemeasuresenableoursearchtoprogresstoacompletecompositioncorrectlyandtoreduceexponentialsearchspacebypruning.Usingoursystem,userscanreducetheireortsintime-consuminganderror-prunestepsofdatatransformationprocess,therebyreduceeortsininformationintegrationthatrequiresinmanyapplications. 9

PAGE 10

50 11 ].Itresolvesheterogeneitiesofdataindisparatesources. Indataintegration,whendisparatedatasourceshavetobeintegrated,heterogeneitiesamongdatainpreviouslyindependentdatasourcesmustberesolvedinordertoprovideauniforminterfacetousers.Forexample,whentwosalesdatasetsindierentcurrenciesareintegrated,revenueinonecurrency(e.g.,theKoreanwon)mustbetransformedtotheothercurrency(e.g.,theUSdollar)inordertoprovidesalesdatainonecurrency.Throughthedatatransformationprocess,datainKoreanwonsareconvertedtodatainUSdollars. Datacleaningdealswithdetectingandremovingerrorsandinconsistenciesindatatoimprovethequalityofdata.Whenmultipledatasourcesareintegrated,theneedfordatacleaningincreasessignicantlysincethesourcesoftencontainredundantdataindierentrepresentations.Inordertoprovideaccesstoaccurateandconsistentdata,consolidationofdierentdatarepresentationsthroughdatatransformationandeliminationofduplicateinformationbecomesnecessary[ 50 ]. Datatransformationplaysanimportantroleinadatawarehousingsystem[ 54 ].Inadatawarehousingsystem,aextract-transform-load(ETL)processisrequiredtointegrateinformation.TheETLprocessconsistsofextractingdatafrommultiplesources,transformingdata,andloadingdataintothedatawarehouse.AsapartoftheETLprocess,tocreateandmanagetransformations,typicaldatawarehousingarchitecturesrequireexternaltools.Asaresult,mostexistingETLtoolsperformthenecessarytransformationsoutsidetherepositorywheredataisstored. Generally,thedatatransformationprocessundergoesseveralessentialsteps[ 55 ].Intherststep,calledschemamatching,wendsemanticcorrespondencesbetween 10

PAGE 11

18 ].Finally,theprogramlogicistranslatedintoanexecutableoneanddeployedtoanexecutionenvironment.Theprogramlogicisadatatransformthatconvertsdatainsourceschematodatatintargetschema. Usersshouldndschemamatchingandsemanticmappinglistedaboveandresearchtheactualprogramlogicofatransformthatsolvessyntacticandsemanticheterogeneitiesamongdata.Findingtheprogramlogicofatransformrequiresalotoftrial-and-errorandistime-consuming[ 49 48 ].Theschemamatchingsbetweenelementsintwoschemasinclude1-1matchesandcomplexmatches.Acomplexmatchmeansacombinationofattributesinoneschemacorrespondstoacombinationinanother[ 18 ].Creatingmappingsforcomplexmatches[ 7 28 ]ismoredicultthan1-1matcheswhereoneattributeinaschemaismatchedtoasingleattributeinanotherschema. Inourwork,weintroduceadatatransformcompositionproblemthatreusesexistingtransformsinarepositoryinsteadofcreatinganewtransformfromscratch,thusreducingdevelopmenttimeandeort.First,weassumethatthereisarepositorythathasalargenumberoftransforms.ThosetransformscanhavebeencreatedbyuserspreviouslyorharvestedfromtheInternet(e.g,WebservicesorJavafunctions).Atransforminarepositorywaspreviouslyusedforanotherpurposeandgeneratesoutputwithinputs.Thereisasemanticmappingbetweeninputsandoutputsofatransformillustratedbytheinsideprogramlogicofthetransform.Ourdatatransformcompositionapproachtriestousethesemanticmappingofatransformtondschemamatchesandsemanticmappingsbetweentwoschemas.Usingapreviouslyavailablemeaningfulsingleorcompositetransform,wecanndsemanticcorrespondences(i.e.,schemamatches)automaticallybetweentwoschemasbyreusinginput/outputmappingsofatransform,andatthesametimesemanticmappingscanbegeneratedusingtheinsideprogramlogicoftransforms. 11

PAGE 12

Ourgoalistondacompletecompositionthatarethesamesemanticmapping(i.e.,programlogic)asauser'sdesiredtransform.However,wecannotguaranteethattransformsinarepositoryarecompletetogenerateanynewtransforms.Weclaimthatwecangiveaguidetoauserbyshowingpossiblepartialcompositionsthatcanbesimilartotheuser'sdesiredtransform.Asaresult,ndingschemamatchingsandsemanticmappingswithourdatatransformcompositionapproachreducestimeandlaborindatatransformation. Recentworkondatatransformshasfocusedonndingadatatransformaspartofthedatamappingorschemamatchingproblem.However,thosestudiesconcentratedmostlyonstructuraldatamappingorapplyingarestrictedsetoftransformsincomposition.ThereismuchresearchonWebservicecomposition,butwendthatwecannotapplythesolutionsinWebservicecompositiondirectlytoourproblem.Asfarasweinvestigated,therehasbeennoeorttond1-1orcomplexmatcheswithdatatransformcomposition.Inourwork,weprovideasolutionforconstructingauser'sdesiredtransformbycomposingmultipletransformsinarepository,therebyreducingusers'overalleortstoperformdatatransformation,suchasanalyzingsyntacticandsemanticdierencesamongrelevantdataandcreatingaprogramlogicofatransformthatresolvesthedierences. 12

PAGE 13

Theschemasofourexample. StarbucksheadquartersinSeattle.TheKoreanStarbucksoperationprovidestworelations,Sales-CoffeeandSales-PastryinFig. 1-1 .Sales-Coffeehasthreeattributes,date,revenue,andbranchname.Thedateeldisintheformatof\DD-MM-YYYY",revenueisaoatvaluerepresentingrevenuesfromcoeesalesinKoreanwonandbranchnameisthenameofabranchinKoreathatisasourceoftherevenue(therecanbemultiplebranchesinacity).Sales-PastryhasthreeattributesthatarethesameasSales-Coffeeexceptrevenuerepresentssalesfrommunsandothersnacks. TheStarbucksheadquartersinSeattle,however,usesonlyonerelationoftheformSales-FoodshowninFig. 1-1 wheredateisintheformat\MM/DD/YYYY",revenuerepresentsthesumofcoeeandpastryrevenuesinUSdollars,andcityrepresentsthenameofthecitywhereabranchislocated.ThetermcitycanberetrievedbylookinguparelationstoringaddressesofbrancheswithbranchnameofSales-Coffee.TheITdepartmentofStarbucksistaskedwiththejobofprovidingatransformthatconvertsdatainthetwosalesrelations(i.e.,Sales-CoffeeandSales-Pastry)fromKoreanfranchisesintodatainSales-FoodthattheUScorporationcanuse. WeassumethattheITdepartmenthasarepositoryoftransformsthatcanbereusedtocomposeanewdatatransform.Byreusingexistingtransformsthathavebeendebugged,itsimpliestheeortofcreatinganewtransform. Table 1-1 showsasampleofavailabletransformsintherepository.Theinput/outputofthetransformsinTable 1-1 arespeciedascompositedatatypes.Inourexample,theITdepartmentofStarbuckswantstocreateanewtransformusingoneormoreavailable 13

PAGE 14

Sampletransformsintherepository. NameofInputOutputTransformDataTypeDataTypedatatypedatatype(eld:type,...)(eld:type) Currency-KoreanwonDollarConversion(Korean:oat,date1:date)(USD:oat) KRW2USDKoreanDollar(KRW:oat)(USD:oat) Conversion2USDCurrencyDollar(amount:oat,country:string)(USD:oat) AddKRWTwoKoreanKorean(KRW1:oat,KRW2:oat)(KRW:oat) Payment-KoreanwonDollarConversion(Korean:oat,date1:date)(USD:oat) DateFormat2-DATEDATEMMDDYYYY(date1:date)(date2:date) DateFormat2-DATEDATEYYYYMMDD(date1:date)(date2:date) getCityBranchNameCityName(branch:string)(city:string) transforms.However,evenwithreusabletransforms,theusermustbeabletondtheappropriatetransformsandconnectthembycorrectlymatchingtheoutputofthecurrenttransformtotheinputofthenexttransform. Browsingandndingrelevanttransformsinarepositoryisnotatrivialtask.Ausermustbeabletoconnectcontiguoustransformsbycorrectlymatchingtheoutputofcurrenttransformtotheinputofthenexttransform.InTable 1-1 ,iftheuserwantstocalltheDateFormat2MMDDYYYYtransformrstandthencalltheCurrencyConversiontransform,theuserisnotsurewhethertheoutputoftheDateFormat2MMDDYYYYismatchedtotheinputoftheCurrencyConversiontransformexactly. Itisclearthattheincreaseofavailabletransformsentailsmoreandmoreeorttounderstandexistingtransforms.Asthenumberoftransformsisincreased,ndingappropriatereusabletransformsbecomesatimeconsumingandlaboriousprocess.Also, 14

PAGE 15

Weproposeadatatransformationcompositionalgorithmthatinvestigatesalltransformsinarepositoryonbehalfofusersandthatguidesuserstocreateanewtransformusingexistingtransforms.Thealgorithmprovidesusersasequenceofexistingtransformsthatisidenticaltousers'expectedtransform.Ifanysequenceofexistingtransformscannotprovideusers'expectedtransform,thealgorithmproducessimilarsequenceoftransformstotheusers'expectedtransform. 5 17 21 ]centersonhowtodescribe,manage,andstoredatatransformsecientlybyprovidingatoolorlanguagetospecifytheprogramlogicofatransformandstoringtheentireexecutableproceduraldescription.Theresearchprovidesproceduralinformationratherthanstructuralinformationaboutdatatransformbehavior.Proceduralinformationisveryhelpfulforunderstandingthebehavior,butitisinappropriatewhencomparingtwotransformswhereisnecessarytocalculatethesimilaritiesbetweentwotransformsincomposition.ExistingWebservicemodelsarenotsucienttocharacterizedatatransformsbecausetheydonothavespecicdataformatinformationandthebehaviorofanoperationisdescribedinonesemanticword,whichishardtocreatewhenmultipletransformsareconnected. Inourproblem,weneedastructuralmetamodeltocharacterizedierentdatatransformsusinglimitedinformation.SincetransformsinarepositorycanbeWeb 15

PAGE 16

Second,ourproblemistondadatatransformthatcanconvertdatainthesourceschematodatainthetargetschemaasautomaticallyaspossibleusingreusabletransformsinarepository.Itcanbesolvedbyasingletransforminatransformrepositoryoracompositetransformthatiscomposedofmultiplesingletransforms.Sincetherecanbealargenumberoftransformsinarepository,thepotentialsearchspaceforthecompositionsishuge.Unlesswehaveschemamatchingbetweensourceandtargetschemas,thesearchspaceisincreasedevenmorebecauseweneedtondasetofattributesinthesourceschemathatmatchesaspecicattributeinthetargetschema. Inaddition,weassumethattransformsinarepositorycanbeWebserviceharvestedfromtheInternet(e.g.,lookinguptheexchangeratebetweentwodierentcurrencies).ItisdiculttoexecutetransformsandusetheiroutputtondcompositionsbecauseitcantaketimetoexecuteremoteWebservices. Existingresearchondatatransformationndmappingsbyintensivelyusingoutputsgeneratedbyexecutingtransforms.ExistingresearchonWebservicecompositioncannotbedirectlyappliedtoourproblemsincetheWebservicemodels(e.g.,WSDLorWSDL-S)[ 14 1 23 ]arenotsucienttocharacterizedierenttransforms.Weneedanovel,sophisticatedsolutionondatatransformcompositionthatcanecientlyndagoaltransformwithoutdependingmainlyonthedatageneratedbyexecutingtransforms. Third,wecannotguaranteethatwealwayswillndacompletecompositionsinceourrepositorymaybeincomplete.Therefore,itisnecessarytoprovidepartialcompositionstoauserthatareusefultoconstructagoaltransform. 16

PAGE 17

35 ]-basedtransformmetamodeltorepresenttransforms.MetadataabouttransformsarerepresentedsemanticallyinRDFtriplesandthoseRDFtriplescanconstructstructuralRDFgraphs.WecancomparetwotransformssemanticallyusingthoseRDFgraphs.Inaddition,anRDFgraphcaneasilybeexpandedbyaddingRDFtripleswhenweaddmoremetadatatoaccommodatenewrequirementstodescribeadatatransform.SinceRDFiswidelyacceptedasastandardforsemanticrepresentationofinformationontheWebanditisrepresentedinanXML,datatransformsinourmetamodelareinterchangeableandcanbereadilycomparedwithotherresourcesontheWeb. OurmodelincludesnotonlytheprimitivedatatypeandsemanticmeaningofaparameterasareprovidedinexistingstandardsinWebservices,butalsothedataformatofaparameter.Inaddition,metadataonoperationsdescribetherelationshipsbetweeninputsandoutputsofatransform.Ourexperimentshowsthatourmetamodelgreatlyspeedsupthesearchforcompletecompositionsandprovideshighprecision. Second,wemodelthedatatransformcompositionproblemasagraphsearchproblemandusetheA*algorithmwithourtransformmetamodel-baseddistancemeasures.Oursophisticateddistancemeasuresenableoursearchtoprogresstoacorrectcompletecompositionandtoreduceexponentialsearchspacebypruning.Incomposingtransforms,ouralgorithmpreservesthebehavioroftransformsasmuchaspossibleusingourtransformmetamodel,thuswecanndagoaltransformcorrectly.Whenthereisnocompletecomposition,ouralgorithmprovidespartialcompositionsthatareusefultoconstructagoaltransform. Third,wedesignandimplementourprototypesystemforsemi-automaticdatatransformcompositionusingtransformsinourmodelinarepository.Oursystemprovidesanautomaticsearchofalargerepositorytondorcomposeadesiredtransform.Oursemanticannotationtoolisusedtosemi-automaticallyconvertcrawledWebservicesinWSDLtoourmodel.Usingoursystem,userscanreducetheireortsintime-consuming 17

PAGE 18

Thisthesisisorganizedasfollows:Chapter 2 reviewsrelatedwork;Chapter 3 introducesourtransformmetamodel,andChapter 4 denesthedatatransformcompositionproblemandshowsouralgorithm.Chapter 5 describesimplementationofoursystemandChapter 6 presentsourexperiments.Chapter 7 provideconclusions. 18

PAGE 19

Researchrelatedtoourworkcanbecategorizedintothefollowingareas:Webservicecomposition,datamappingandschemamatching,similaritymeasures,andMorpheusprototype.Beforeweinvestigatethoseareas,werstcomparedatatransformwithWebserviceandworkow. 62 53 ]isatechnologyparadigmcharacterizedbysharingandintegratingsoftwarecomponentsovertheInternetwithoutconsideringplatforms[ 3 ].AwebservicedescriptioninWSDLincludesonlysignatureinformation.WebservicecompositionintegrateindividualWebservicestocreateanothernewWebservice.Thisisanewtrendforcreatinganewsoftwareapplicationusingindividualsoftwarecomponentsoeredbydierentserviceproviders. Workows[ 33 ]modelandexecutebusinessprocesses.AworkowcanbemodeledinXMLprocessdenitionlanguage(XPDL)[ 15 ]andexecutedinaworkowmanagementsystem.TheXPDLspecicationcoversabroadrangeofelementsthatarerequiredinusualbusinessprocesses.Webservicescanbeusedasoneimplementationtypeofanactivityinaworkow. Adatatransform(especiallysemantictransform)mapsattributesinoneschematoattributesinanother.Thismappingconvertsdatafromthesourcedataformattotargetdataformat.Mostly,datainsourceandtargetschemashavedierentformatsbutcanhavethesameorsimilarsemanticmeanings.Forexample,inFig. 1-1 ,revenueinSales-CoffeeisrepresentedinKoreanwon,butrevenueinSales-FoodisrepresentedinUSdollars.Bothrevenueattributessharethesamesemanticmeaning. Consideringjustsignaturesandsemanticmeaningsofargumentsarenotenoughtodierentiatetransforms.Dataindierentschemascanbeexpressedindierentformats, 19

PAGE 20

18 ]systemsemi-automaticallyidentiesboth1-1andcomplexmatchesbetweendatabaseschemas.Acomplexmatchspeciesthatacombinationofattributesinoneschemacorrespondstoacombinationintheother.Thegenerationofcomplexmatchesisdonebysearchingthespaceofpossiblematches.Asetofsearchmodules,calledsearchers,areemployedandeachconsidersameaningfulsubsetofthespace.Forexample,atextsearchermayconsideronlymatchesthatareconcatenationsoftextattributes,whileanumericsearcherconsiderscombiningattributeswitharithmeticexpressions.Usingabeamsearch,onlyapre-speciednumberofhighest-scoringmatchcandidatesareselected,and,amongcandidates,thosethathaveaclosesemanticdistancetothetargetattributeareselected.Toelaboratethesearchprocess,domainknowledgeanduserinteractionareusedintheprocessofsearchingthemapping. Adatamappingproblem[ 47 64 7 ]isaboutautomaticdiscoveryofeectivemappingsbetweenstructureddatasources.Datamappingsarefundamentalindatacleaning,dataintegration,andsemanticintegrationandincludesubproblems,suchasschemamatchingandsemanticschemamapping.Existingsolutionstypicallyhavefocusedondiscoveringrestrictedmappings,suchasonlydiscoveringone-to-oneschemamatching.Therearealsostructuredierencesamongrelationsandcomplexsemanticmappingsamongattributesindierentrelations.InTupelo[ 25 ],startingfromuserprovidedexampleinstancesofthesourceandtargetschemas,amappingissemi-automaticallydiscoveredbysearchingwithinthetransformationspacebasedonaset 20

PAGE 21

ThePiazza[ 27 ]projectproposedthepeerdatamanagementsystem(PDMS),wheremappingcompositionisstudied[ 40 ]andproposedtoserveasoneofitsmainoptimizationtechniquesforansweringquerieseciently[ 59 ].YuandPopa[ 65 ]appliedmappingcompositiontomaintainmappingsundersomeschemaevolutionscenarios. 39 45 51 9 8 4 22 38 ].Forexample,theworkin[ 43 ]adaptstheA*algorithmusingtheinput/outputargumentsofWebservices.Inourwork,wealsoconsiderthesemanticmeaningoftheargumentsandthebehavioroftransforms.Authorsin[ 39 ]useresourcedescriptionframework(RDF)triplesforrepresentingpre/postconditionsofaWebservice.AsemanticnetworkcanbefoundamongasetofWebservicesusingpre/postconditions. InthesematicWebservicearea,anewcompositeWebserviceiscreatedusingsimpleWebserviceswiththehelpofsemanticinformation.ExplicitsemanticswillenableautomaticWebservicecompositionwithouthumanintervention[ 46 ].Currently,manyapproachestosemanticWebservicecompositionconcentrateonjustsematicmatchingofinput/outputarguments.However,consideringtheinternalfunctionalitiesofservicesisimportantsinceWebservicewiththesameinput/outputinterfacescouldhavedierentfunctions[ 39 ]. 21

PAGE 22

63 45 ],usersspecifyadesiredcompositeapplicationbyarst-orderformulathatrepresentsthelogicthatmustbesatisedbytheapplication.WiththeassumptionthatallnecessarysimpleWebservicesareavailable,thisapproachndsacombinationofserviceswhereconjunctionsoflogicsareequivalenttoaformulagivenbyauser.In[ 4 ],individualatomicWebservicesarerepresentedinnitestateautomata(FSA).GivenasetofdescriptionsofcomponentWebservicesasanautomaton,thisapproachndsasubsetofthecomponentservicesandamediatorwiththeinputofadesiredglobalbehaviorspeciedinanautomaton. Inaddition,thereisatemplate-basedapproach[ 61 8 ],butthisapproachrequirestechnicalknowledgeandexperiencefordescribingdesiredtransforms.Furthermore,wehavenotseenanapproachthatusessematicbehaviorinformation(insideoffunction)forcomparingtwoservicesforsolvingservicecompositionproblems. 21 ]exploitsthestructureoftheWebservices.TheWoogleemploysanovelclusteringmechanismthatgroupsparameternamesintosemanticallymeaningfulconcepts,andtheseconceptsareleveragedtodeterminesimilarityofinputs(oroutputs)ofWeb-serviceoperations.ThealgorithmdependsononlytheinformationprovidedintheWSDLlewithoutadditionalannotatedinformation.ThisapproachfocusesmoreonsearchingsimilarWebservicesthanoncalculatingthesimilaritybetweentwoWebservices. SemanticToolsforWebServices,developedbyIBM[ 32 ],hasWebservicematching,discovery,andcompositionfeatures.TheWebservicesareannotatedinWebServicesSemantics(WSDL-S)[ 1 ].UsingtheWebServiceInterfaceMatchingfeature,onecansemi-automaticallymaptheinterfacesoftwogivenWebservices.Domain-independentanddomain-specicontologiesareusedtocomputeanoverallsemanticsimilarityscorebetweenambiguousterms.Thistechnologyresolvessemanticambiguitiesinthe 22

PAGE 23

ConceptualarchitectureoftheMorpheussystem. descriptionsofWebserviceinterfacesbycombininginformationretrievalandsemanticWebtechniques.MatchesfromthetwoapproachesarecombinedtodetermineanoverallsimilarityscoretohelpassessthequalityofaWebservicematchtoagivenrequest[ 58 2 ].Incaseswheresingleservicesdonotmatchagivenrequest,thesystemcancomposemultipleservicesbyemployingarticialintelligence(AI)planningalgorithmsinordertofulllagivenrequest. 19 20 ]providesanenvironmentforcreating,storing,andsearching,thenexecutingtransformsinordertofacilitatethedatatransformationprocess. Fig. 2-1 [ 20 ]showsthearchitectureoftheMorpheussystem.TheMorpheussystemconsistsoftwoparts:thetransformconstructiontoolkit(TCT)andtheassociativerepository.TheMorpheussystemusesthePostgresDBMSsystem[ 26 ]asarepositoryoftransformsandatthesametimeasaplatformtorunthetransforms. UnliketypicalETLtoolsthatwementionedinChapter 1 ,adatatransformationconstructiontoolintheMorpheussystemfacilitatesincreatinganewtransformandthecreatedtransformsarestoredandexecutedinsideaDBMS.Therefore,theMorpheus 23

PAGE 24

DatatypesandatransforminMorpheus. systemtakesadvantageofamenitiesprovidedbyamodernDBMS,suchasecientstorageofdataandsupportfortransactionsandrecovery. TheTCTfacilitatesthecreationofanewtransform.AuserinteractswithTCT,whichhasabrowserandGUIforbuildingtransforms.UsingTCT,theusercancreateatransform(whichwecallaMorpheusTransform)consistingofMorpheusprimitives,namelyControl,Wrapper,Computation,Lookup,andJavafunction.Attheendofthecreation,anewtransformisrstwritteninXMLandtranslatedintoaprogramwritteninPLjava.TheJavaprogramiscompiledandregisteredasauserdenedfunction(UDF)inPostgres.AWrapperprimitiveiscreatedbywrappingWebservicesorWebformsavailableintheInternet.WebservicesinWSDL[ 14 ]areconvertedtoJavafunctions,thenregisteredasaUDF.AUDFcantakesimpletypes,compositetypes,oracombinationoftheseasarguments.Postgresuserscandesignanewcompositetypeasauserdeneddatatype(UDD)andUDDsareregisteredinthePostgresDBMS.UserscancreatetwocompositedatatypesasUDDs,includingeldsinasourceandtargetschema.InPostgresDBMS,datatransformationisachievedbyaddingsourcedataintothedatabaseandthenrunningaquerythatinvokesatransformstoredasaUDF.Theresultingtargetdataaregeneratedinthedatabaseaftertheexecutionofthequerycontainingthetransform. RelatedtotheexampleinFig. 1-1 inChapter 1 ,Fig. 2-2 showstheinputandoutputUDDs(KoreanRevenueandSeattleRevenue)andonepossibledescriptionoftheStarbucksTtransformwithprimitives. 24

PAGE 25

(i.e.,date,coffee-revenue,pastry-revenue,branchname)inthesourceschema(i.e.,theschemaofStarbucksfranchisesinKoreainFig. 1-1 .)andSeattleRevenue,whichincludeselements(i.e.,date,revenue,city)inthetargetschema(i.e.,theschemaoftheStarbucksheadquartersinSeattle)arecreated.Then,ausercreatesanewtransformStarbucksTinFig. 2-2 whichmapstheinputdatatypeKoreaRevenuetooutputdatatypeSeattleRevenue. Fig. 2-2 showsasimplieddescriptionofStarbucksT.ThedateelementinSeattleRevenueisconvertedfromtheformatof"DD-MM-YYYY"to"MM/DD/YYYY"throughtheDateConverterfunction.TherevenueelementofSeattleRevenueisgeneratedbyaddingtwovalueswhichareconvertedfromcoffee-revenueandpastry-revenueelementsofKoreaRevenueusingWon2Dollarfunction.ThecityelementisderivedfromthebranchnameelementusingtheGetCityfunction.StarbucksTisregisteredasauser-denedfunction(UDF)inPostgresDBMSandisinvokedinanSQLqueryforactualtransformation.Forexample,Fig. 2-2 showsanSQLquerythatexecutesStarbucksTtransformoverthetworelations(i.e.,Sales-Coffee,Sales-Pastry)inthesourceschema.TheresultofthequeryexecutionisdatathattinthetargetschemaSales-Food. Morpheustransformsintherepositorycouldbeusedforcreatinganewtransform.AusercanviewatransforminTCT,theneditthetransformtocreateanothertransform.Currently,allthestepsareperformedmanuallyinMorpheus.Ithasnoautomaticsupportforcomposinganewtransformbyreusingtransformsintherepository. InMorpheus,acreatedtransformiscompliedintoaJavaclassle,registeredinthePostgresDBMSandinvokedonthesourcedatasettobetransformedusinganSQLqueryforactualtransformation.Currently,theregisteredtransforminUDFistreatedasablack-box[ 13 12 31 30 60 42 ]duringqueryoptimization,sotherearelimitationsin 25

PAGE 26

26

PAGE 27

OurtransformrepositorycanhavetransformscreatedfromscratchbyusersorharvestedfromtheInternetusingacrawler.Weneedanabstractmodelthatcanreectvariouskindsoftransformsandhasinformationusefultondadesiredtransform. Existingstandards,suchasWSDL[ 14 ]andWSDL-S[ 1 ],areusedtorepresentWebservices.InWSDL-S,semanticwordsdenedintheunderlyingontologyareusedtorefertoinput/output,precondition/eect,andoperation.Thestandardtriestomakestandardssimpleandtransferdetaileddescriptionsofsemanticmeaningstoontology.Ourintuitionisonesemanticwordforaparameter(i.e.,inputoroutputparameter)isnotenoughtorepresentthesemanticmeaningandrepresentationoftheparameter.Wecanuseprecondition/eectofWSDL-S,butonepreconditionperoperationisnotenoughtorepresentinformationaboutmultipleparameters.Inaddition,weclaimonesemanticwordperoperationisnotenoughtoreecttherelationshipsofinputsandoutputs. 27

PAGE 28

52 ].Thedenitionbelowformalizesourtransformmetamodel. 41 ]dictionary,Disadatatypeamongoat,string,int,anddate,andRisawordintheformatdictionarywemake.Eachoperationop2OPisatuplewithOP IandOP O.OP IisasetofinputsandOP OisasetofoutputsofopwhereOP IIandOP OO. 1-1 inChap. 1 convertsmoneyinKoreanwontomoneyinUSdollarswithanexchangerateatagivendate.TheamountofmoneyinKoreanwonanddatein\DD-MM-YYYY"areinputsandtheamountofmoneyinUSdollarsisanoutput.Thetransformcanberepresentedinourmodelasfollows: I=fm1,d1g,O=fm2g,OP=fop1g, op1=(OP I,OP O)=(fm1,d1g,fm2g), wherem1isoneoftheinputparametersofCurrencyConversionanditssemanticmeaningSismoney,datatypeDisoatanditisrepresentedinKoreancurrency.Thed1transformisanotherinputparameterofCurrencyConversionanditssemanticmeaningSisdate,datatypeDisDATEanditisrepresentedinDD-MM-YYYY.Them2isanoutputparameteranditssemanticmeaningSismoney,datatypeDisoatanditisrepresentedinUSdollars.Thistransformhasanoperationop1inwhichtheinputsetOPIincludesm1andd1andtheoutputsetOPOincludesm2. 28

PAGE 29

Atransformrepresentedingraph. Inaddition,atransformcanberepresentedasagraph,asshowninFig. 3-1 .WedeneatransformgraphinDenition2.Weusethetransformgraphforcalculatingasimilaritybetweentwotransforms. 3-1 .ThenodesetVisaunionofnodesrepresentingthefollowing:(1)elementsinsetsI,O,andOP,(2)S,D,RofeachelementinI,O,and(3)dummynodesrepresentingT,I,O,OP,OP I,andOP O.Eachedgee2EbetweennodesinVisassociatedwithaweight. 3-2 representstheCurrencyConversiontransform.TherearedummynodesT;I;O;OP;OPI,andOPO,whichmeanssetsinDenition1.TheinputparametersetIhastwochildnodesp1andp2andOhasonechildnodep3.p1isanodefortheinputparameterm1,thereforep1hasthreechildnodesthatrepresentS,D,andRofm1.AnoperationsetOPhasonechildnodethatmeansanoperationop1ofCurrencyConversion OurtransformmetamodelcanberepresentedusingRDFtriples.RDF[ 52 ]iswidelyacceptedasastandardforsemanticrepresentationofinformationontheWebandwecanuseRDFforrepresentingatransformsemantically.Metadataabouttransformsare 29

PAGE 30

CurrencyConversionintransformgraph. representedinRDFtriplessemanticallyandthoseRDFtriplescanconstructstructuralRDFgraphs.WecancomparetwotransformssemanticallyusingRDFgraphs.Inaddition,theRDFgraphiseasilyexpandablebyaddingRDFtripleswhenweaddmoremetadatatoaccommodatenewrequirementstodescribeadatatransform.SinceRDFiswidelyacceptedasastandardforsemanticrepresentationofinformationontheWebanditisrepresentedinanXML,datatransformsinourmetamodelareinterchangeableandcanbecomparedwithotherresourcesonWebeasily. 4.2.1 1-1 .WerstshowthedetaileddescriptionaboutcalculatinghowmuchKRW2USDissimilartotheCurrencyConversiontransform.Fig. 3-3 showsatransformgraphoftheKRW2USDtransformwhereKRW2USDconverts 30

PAGE 31

KRW2USDintransformgraph. theamountofmoneyinKoreanwontoUSdollarswiththeexchangerateatthetimeofexecution. Weapplythesimilarityfunctionintroducedin[ 66 ]toourtransformmodel.In[ 66 ],informationavailableontheInternetiscollectedandrepresentedinRDFgraphs.Inordertodoasemanticinformationsearch,[ 66 ]introducesaformulathatcomparestwoRDF[ 35 ]graphs.Unlikeotherapproachesininformationretrievalthatarebasedontermfrequencyanalysis,matchingRDFgraphsconsidersthestructuralinformationrepresentedinagraph.Byusingtransformgraphs,wecalculatehowmuchtransformBissimilartotransformA(i.e.,f(A,B))intermsofhowmuchinformationinAiscoveredbyB.IfBhasthesameormoreinformationthanA,BisthesameasA.Therefore,wecansaythatAandBareexactlythesamewhenbothf(A,B)andf(B,A)are1. Wecalculatef(CurrencyConversion,KRW2USD),howmuchKRW2USDissimilartoCurrencyConversion.First,wecomparethechildnodesofIinthetransformgraphsinFig. 3-2 andFig. 3-3 .Eachchildnodehasthreeadditionalchildnodesthatholdsemanticmeaning,datatype,andrepresentation(inotherwords,dataformat)ofaninputparameter(seeDenition1.).Letf1,f2,andf3befunctionscalculatingsimilaritiesintermsofsemanticmeaning,datatype,andrepresentation,respectively(thesethreefunctionswillbeexplainedindetailinSec. 4.2.1 ).Thesimilarityof(p2,p4)iscalculated 31

PAGE 32

3-2 ,wehave0*1/2+1*1/2=1/2asaresultofapplyingweightsfromnodeItoitschildnodes.Usingthesamemethod,nodeOhas1andnodeOPhas3/4.Finally,atnodeTofFig. 3-2 ,wegettheresult1/2*1/3+1*1/3+3/4*1/3=3/4,whichmeansthesimilarityofKRW2USDtoCurrencyConversionis3/4.Inshort,wecompareleafnodesunderparameternodesinbothtransformgraphs,applyweightvaluesonedgestogetvalueattheirparentnode,andcalculatethenalresultatrootnodeT. 32

PAGE 33

Inthischapter,weintroduceourdatatransformcompositionproblemandsolutionapproach.WeusetheA*algorithmwithsimilaritymeasureswedesignedandafunctionMergetocombinetransformscorrectlywhilewendadesiredtransformusingtransformsinarepository. Hence,wemodelourtransformcompositionproblemasagraphsearchproblemandusetheA*searchalgorithmwithourheuristicfunctionforndingauser'sdesiredtransform.Theresultingcompositetransformcannotbeexactlythesameasthedesiredtransformbecausewecannotguaranteethatallrequiredtransformsforcreatingadesiredtransformexistintherepository.Therefore,wesuggestpartialcompositionsthatareusefultoconstructthedesiredtransform. Fortherestofthischapter,weuseSTandTdtodenotethesetoftransformsinarepositoryandadesiredtransform(i.e.,goaltransform),respectively.EachTi2STandTdarerepresentedinourtransformmetamodelinDenition1. 33

PAGE 34

(1) jelementsinS1andoneelementinS2haveS,D,andR,whichareintroducedinDenition1. (2) jelementsaredescribedwithaclause,whichisdenedinDenition4. (3) AsetIofdkhasjelementsandasetOofdkhasoneelement (4) OPhasoperationsbetweenelementsinIandO Next,wegiveadenitionofaclausethatisusedtospecifyIofdk. 1 ,wecandescribejelementsinS1foramatchtorevenueinS2asfollows. (1) revenue,pastry revenue,dateg[] (2) revenue,pastry revenueg[date] (3) revenue,pastry revenue,date,branchname] wherecoee revenue=S1.revenueandpastry revenueisanotherS1.revenueinS1. (1)meansthereisaschemamatchbetweenthreeelements(i.e.,S1.revenue,S1.revenue,S1.date)andS2.revenue.(2)meansS1.revenue,S1.revenuearecertainlyrequiredandS1.datecanbenecessarytomatchtoS2.revenue.(3)meansitisnotcertainwhichoneismatchedtoS2.revenue,thereforeallelementsinS1areconsidered.(1)isthecasethatndsacompositetransformwithagivenschemamatchand(3)isthecasewherewedonothaveagivenschemamatch.With(3),ourapproachtriestondschemamatchingandsemanticmappingbetweentwodatasources. WedenethefAg[B]clausebecausewetrytoreducetheburdenofuserstondtheexactschemamatchings.Withtheclause,ourapproachndsallpossiblecompositionsthathaveallelementsinAasinputandsomeofelementsinBasinput.Therefore,our 34

PAGE 35

Apartoftransformcompositiongraph. approachreducestheburdenofndingschemamatchingandsemanticmappinginthedatatransformationprocess. Next,wedenethetransformcompositionproblem. Inordertosolveourtransformcompositionproblem,wemodelourproblemasagraphsearchproblem.Denition6introducesthetransformcompositiongraph. InG0,thetransformcompositionproblemistondpathsfromroottogoalthatgenerateatransformthatisthesameasTdwithinagiventimet.Thedetailedalgorithmisintroducedinthenextsection. 35

PAGE 36

4-1 showsapartofatransformcompositiongraph.TwotransformsTjandTiareconnectedfromroottoxthroughy.Thereisadistancebetweenanodeandtheconnectingtransform,suchasdist(root,Tj)anddist(y,Ti).Inaddition,thedistancefromxtogoalisdist(x,goal).Thegofxmeansthedistancefromroottoxanditisthesumofdistancesfromroottoxthataredist(root,Tj)anddist(y,Ti).Thehofxmeansdist(x,goal).MTofxisgeneratedbymergingTjandTi.WeusetheA*algorithmtondatransformpathfromroottogoalnode. 16 ].AsshowninFig. 4-1 ,eachnodexinG0hasgandhvalues.Thegisthedistancefromroottoxandhisaheuristicallyestimateddistancefromxtogoal.TheA*ndstheleast-costpathfromagiveninitialnodetoonegoalnodeoutofoneormorepossiblegoals.TheA*usesf(x)=g(x)+h(x)forexpandingnodesinG0.Theintuitionofusingf(x)isthatadvancingfromthenodexthathasthesmallestf(x)valueisthefastestway(theshortestintermsofadistance)tondapathtothegoalnode.TheA*incrementallybuildsallpathsleadingfromtherootuntilitndsonethatreachesthegoalnode,butonlybuildspathsthatappeartoleadtowardthegoalnode. Algorithm 1 showshowtheA*algorithmndsacompositetransform.Atthecurrentexpandingnodex,alltransformsinarepositoryareconnectedandanewstatenodeforeachconnectioniscreatedwithinformationinDenition6.Thenewstatenodeisputinapriorityqueueandthenastatenodethathastheleastf(x)isselectedasthenextexpandingnode.Inthefollowingsections,weexplainourdistancemeasuresforcalculatinggandhofastatenodeandtheMergefunctionformergingtransformswhileapathislengthened. 36

PAGE 37

"+j

PAGE 38

4-1 ,thenodexiscreatedandconnectedtothenodeywiththeedgeassociatedwithTi.Thef(x)iscalculatedusingthefollowingformula. wherew1,w2,andw3areweightvalues. Theg(x)istheadditionofthegvalueofthepreviousnodeyandthedistancebetweenthenodeyandthesubsequenttransformTi,andthelatteriscalculatedbythefunctiondist1.WhenTiisconnectedtoy,foraparallelcomposition,theunmatchedinputsofdksofarinthepathfromtheroottoyarealsoconsideredasapartoftheoutputofatransformatyinordertomatchtheinputofTi(Sec. 4.2.2 includesadetailedexplanation). Therefore,newO=y:MT:O[(Td:I(Td:I\y:MT:I)),whichmeansaunionofy:MT:O(theoutputofthetransformMTatnodey)andunmatchedinputsofdkatnodey.WemakeallcombinationsbetweenelementsinnewOandTi:I,andndthebestcombinationgeneratingthemaximumsumofsimilarityvalues(Fortheconnection,allparametersinTi:IshouldbematchedwithanyparameterinnewO).ForcalculatingthesimilaritybetweenanelementinnewOandanelementinTi:I,weusethefollowingformula: whereup2newOandvq2Ti:Iandsimeisthesumofthreesimilarities.AsinDenition1,eachparameterinnewOandTi:IhasS,D,andR.ThesimScomparesSoftwoparameters.WeusetheJWordNetSimlibraryforthecomparison(ifasimilarityvalue 38

PAGE 39

Letz=jTi:I;jup2newO,andvq2Ti:I,thendist1isasfollows: Theg(x)isthesumofg(y)anddist1(y;Ti).Theh(x)istheheuristicallyestimateddistancefromthenodextothegoalnode.Atthersttime,wedesignh(x)asthedistancebetweentheoutputofthenodexandtheoutputofadesiredtransformusingdist1,butoursearchcannotndthedesiredtransomwellsincejustoutputmatchingisnotenoughtondadesiredtransform. Hence,wedesignsim2tocalculatethesimilaritybetweenthetransformatthenodex(i.e.,x.MT)andadesiredtransformdk.Bothtransformsarerepresentedinourmodel,thuscanberepresentedinRDFgraphs.Thesim2(dk;x:MT;I+O+OP)meanshowmuchinputs,outputsandoperationsofdkiscoveredbyinputs,outputsandoperationsofx.MT.Oncex.MThasthesameinputs,outputsandoperationsasdk,sim2(dk;x:MT;I+O+OP)becomes1.However,x.MTcanhavemoreinputs,outputs,andoperationswhicharenotindk.Thus,whenweuse1sim2(dk;x:MT;I+O+OP)asourh(x),ouranswercanhavemanyuselessinputs,outputsoroperationsbesidestheonesrelatedtoadesiredtransform.Ifbothsim2(dk;x:MT;I+O+OP)andsim2(x:MT;dk;I+O+OP)are1,dkandx.MTareexactlythesame. Theinputofsim2istwotransformstocompare,andags(i.e.,I,O,andOP),whichmeaninput,output,andoperation,respectively.Unlikesim1,alltheinputofTidonotneedtobematched.Thesimeisusedtocompareinputsandoutputs,andsimopisdenedtocompareoperationsoftwotransforms. 39

PAGE 40

Ij,m=jeq:OP Oj gq2eq:OP I,hp2fp:OP I,kq2eq:OP O,lp2fp:OP O, I(eq;fp)+simOP O(eq;fp)=maxPlq=1sime(gq;hp) wherew4;w5,andw6areweightsofI,O,andOP,respectively. Wecangivedierentweightvaluestoeachmatchingbutherewegivethesameweights.Usingsim2,wecancalculatehowmuchatransformconstructedfromaroottoacurrentnodexissimilartodkintermsofinputs,outputs,oroperations.Thedist2iscalculatedbysubtractingsim2from1.Theformulaforcalculatingh(x)isappearedin(4-1). In(4-1),therstpartofh(x)meansthathowmuchx.MThastheinputsandoperationsofdkandthesecondpartmeanshowmuchx.MThasotheroutputsbesidestheoutputofdk.Consideringthesecondpartmakesoursearchproceedsnottogenerateunnecessaryoutputs. Usingf(x)in(4-1),ourapproachadvancesthesearchwherehaslessdistancebetweenconnectingtransformsandhasmoresimilarinputandoperationstodkbutfeweroutputsbesidestheoneindk. 40

PAGE 41

Singlematchbetweentwotransforms. Therecanbesinglematchandmultiplematchbetweentheoutputofaprevioustransform(newO)andtheinputofasubsequenttransform(Tb:I),asinFig. 4-2 and 41

PAGE 42

Multiplematchbetweentwotransforms. Fig. 4-3 .Theguresincludeallpossiblemergingscenarios.ThefunctionMergekeepsinputs,outputs,andoperationsthatcharacterizetheresultingtransformasmuchaspossibleandremovesuselessones. Atroot,dk:IandTb:Iarecomparedandmatched(withadistancecalculatedusingdist1),asinthecaseof(1)inFig. 4-2 .TM:Ibecomestheparameterindk:IthatismatchedwithaparameterinTb:I,andTM:ObecomesTb:O.Finally,anewtransformTMiscreatedliketheoneontherightsideofthearrowin(1).Theunmatchedparametersindk:Iremaineinagraybox. 42

PAGE 43

4-2 )ortheunmatchedparameterindk:I(asin(3)).Bykeepingtheunmatchedinputofdk,weallownotonlyserialbutalsoparallelconnectionsoftransforms.Asshownin(2)and(3),theunmatchedTa:OtogetherwithTb:OareincludedinTM:O,andTa:Iandthematcheddk:IbecomeTM:I.WhenaparameterinTb:Iismatchedwithanelementindk:I,anadditionaloperationisaddedtoTM:OPasin(3)inFig. 4-2 .TheMergefunctionpreservesasmuchaspossiblethesemanticofinsideoperationsoftransforms. Inamultiplematch,asshowninFig 4-3 ,elementsinTb:IcanbematchedwithjusttheparametersinTa:O(asincase(4))orinunmatcheddk:I(asincase(5))orparametersinbothTa:Oanddk:I(asincase(6)).Asshownin(6),twoseparateoperationsinTacanbemergedintooneoperationbyamultiplematch. 36 56 18 ],partialcompositionscanbeusefulbecausetheusercangetanideaaboutusefulsingletransformswithoutmanuallysearchingthetransformrepositoryandcanelaborateonpartialcompositionstomakeacompletecomposition. AsinFig. 4-1 ,thereareroot,goal,andstatenodesinourtransformcompositiongraph.Astatenodethatiscreatedduringthesearchforacompletecompositionhasinformation,asshowninSec. 4.1 ,suchasapaththatistheorderedlistoftransformidsconnectedfromtherootnodetothisstatenode,gisthedistancefromtheroottothisstatenode,andhisthedistancefromthisstatenodetothegoalnode.Werecordthepath,g,andhofeachstatenodeinale.Thereforewehaveinformationaboutstatenodescreatedduringthesearchforacompletecomposition. Thosestatenodesincludeprunedstatenodes.Whenstatenodesisexpanded,transformsinarepositoryareconsideredtobeconnected.Ifthedistance(e.g.,calculated 43

PAGE 44

4.2.1 )betweentheoutputparametersofthetransforminsandtheinputparametersofthetransformtbeingconsideredforconnectioniszero,transformtisconnectedtos.Otherwise,thenewstatenodecreatedbyconnectingttosispruned,sincethegoalofouralgorithmistondacompletecompositionthathasnodistancebetweenconnectingtransforms. Partofadatatransformcompositiongraph. Thepathsofallstatenodesbecomecandidatesforpartialcompositions.Amongcandidates,thepartialcompositionsthatwelteroutaretheunionofthefollowing: (1) Asetofpathsthatarebottom-kbasedonf(x)(i.e.,g+h)in(4-1).Itmeansthepathisclosetothegoalnodeandhaslessdistancebetweentransforms.Themeasureweuseisthesameaswhatweuseinasearchforacompletecompositioninadatatransformationgraph. (2) Asetofpathsthatarebottom-kbasedonh(x)(i.e.,h)in(4-1)Itmeansthepathisclosetothegoalnode.Theintuitionisthatthelasttransform(e.g.,T1orT3inFig. 4-4 )thatisclosetothegoalnodecanbeusefultoconstructacompletecompositioneventhoughthepathmayhavealargegapbetweentransforms. 44

PAGE 45

Backwardsearchforsuggestingpartialcompositions. AsshowninFig. 4-5 ,wecantrytollthebiggapbetweenT2andT1usingthebackwardsearchfromtheinputsofT1.ThereasonforthegapisthattheinputofT1isnotmatchedtotheoutputofthetransformatthepreviousstatenode.WethentrytondanytransformthatcangenerateanyinputofT1.Usingabackwardsearch,wemayndapathfromtheroottooneoftheinputsofT1. 45

PAGE 46

Inthischapter,weintroducetheoverallarchitectureofourdatatransformcompositionsystemwehaveimplemented,andthenexplainimportantcomponentsindetail. Thearchitectureofoursystem. Fig. 5-1 showstheoverallarchitectureofoursolutionapproachtodatatransformcompositionproblem.Broadly,ourarchitecturehastwoparts:oneiswherecompletecompositionsaregeneratedusingtransformsinrepository,andtheotheriswherethecompletecompositionsareexecutedoversourcedata.WeuseMorpheus(seeSection 2.5 )asanexecutionplatformofthecompletecompositions. Inordertoconstructatransformrepository,weneedtoharvesttransformsanddescribetheminourtransformmetamodel.AsshowninFig. 5-1 ,usingourannotationtool,crawledsoftwarecomponentsfromtheInternetarerepresentedinourtransformmetamodel,Morpheus-M,andstoredinrepository.Atthesametime,theyarecompiledintojavaprogramsandregisteredinMorpheusforfutureexecution. 46

PAGE 47

4 ,ndscompletecompositionsthatarethesameasthedesiredtransformorsimilartothedesiredtransformusingtransformsinrepository.ThetransformcompositionmoduleusesjWordNetSymlibrarytocomparetransformsinMorpheus-M.Whenwestoreandcomposetransforms,weusetheWordNetdictionaryandinternaldomainknowledge. Infact,itcanbechallengingtodiscerntheinsideprogramlogicofatransformwhichiscrawledfromWebservices.Therefore,inordertodecidethecorrectcompositionamongthecompletecompositionsouralgorithmhasfound,weneedtoapplythecompletecompositionstoactualsourcedatatovalidatewhetherthesolutiongeneratesthetargetdatacorrectly. Afteronecompletecompositionisfound,thecompletecompositioniscompiledintojavaprogramsandregisteredinMorpheus.WeexecuteanSQLqueryincludingtheregisteredcompositetransformasaUDFoverthesourcedatainMorpheus.Ifthecompletecompositiongeneratesdatathatarethesameassampletargetdatacorrectly,thecompletecompositionisthetransformthatauserwantstocreate.Incaseoursystemcannotndthevalidatedcompletecomposition,wesuggestthepartialcompositionstoauser. 19 ].TransformsinrepositorymustberepresentedinourtransformmodelMorpheus-M. Wedevelopedasemanticannotationtoolinordertorepresentcrawledtransformsinourmodel.TheCrawlerdevelopedinMorpheusprojectharvestsURLsthathaveweb 47

PAGE 48

Fromtheabovethreesoftwarecomponents,weuseWebservicesinordertoconstructanexperimentalrepository.WedevelopedatoolthatparsespagesinWSDLs.WeparsedtheWSDLlesandtranslatedthemintoourtransformmetamodel.Currently,wedotheannotationmanually,butmetadataweannotateonceareaccessibleintothefutureconsistentlytoenrichautomaticdatatransformcomposition[ 44 ]. Besideswebservices,webformscouldbeanothercandidateforcrawling.UnlikewebservicesthatareprovidedwithWSDLlesandtext-basedexplanatorywebpages,webformhasbetterextractableinformationonrelatedHTMLpages.Forexample,webformhasapagethatincludeaformandcorrespondingjavascriptinordertocheckwhethertheforminputgivenbyauseriscorrect.Forexample,foratexteditcontrolinaform,wecanextractanameandlabelofthecontrol.Thelabelofthecontrolisprovidedtotheusersofthewebformtoexplainthetextcontrol,thereforeititselfhasaveryprecisemeaning.Inaddition,Javascriptsareusuallyaccompaniedbyaformthathasfunctionsthatletausergivetheinputinthecorrectformat.Therefore,wecanextractmorecorrectsemanticmeaningandformatoftheinputsthroughtheWebforms. 48

PAGE 49

Webservicesemanticannotationtool. Fig. 5-2 showsaGUIofourannotationtool.WeparseaWSDLleandextractop-erations.Foreachoperation,weextractinformationregardinginputandoutputmes-sagesthatareaboutinput/outputparametersoftheoperation.Foreachparameter,weextractthename,datatypeoftheparameter.Usingthename,welistwordsintheWordNetdictionaryinordertoletauserbeabletoselectawordthathasthesamesemanticmeaningasthename.Inaddition,ourtoollistspossibleformatsforthewordandausercanselecttherepresentationoftheparameteramongthem.Inshort,usingoursemanticannotationtool,ausercanannotateeachparameterwiththedatatype,semanticmeaning,andrepresentationthatarerequiredinordertoberepresentedinourtransformmetamodel. 49

PAGE 50

Inordertoreducethesearchspace,weapplyfollowingtechniques: Ratherthantryingtondmappingbetweenallattributesinthesourceschemaandonecandidateattributeinthetargetschema,ausercanspecifyattributesofthesourceschemaintwosets,AandB.AisasetofattributesinthesourceschemathatiscertaintobeusableforgeneratingaspecicattributeinthetargetschemaandBisasetofattributesthatmightormightnotbehelpfulforgeneratingschemamappingstogeneratethespecicattribute.WedeneaclauseinDenition4inSec. 4.1 forthispurpose. IntheWebservicecompositionarea,usuallysixsingleWebservicesareenoughtocreateanewcomplexWebservice[ 37 ].Welimitthesearchdepthtosixbasedonthejustication.Inaddition,weuseathresholdtoselectconnectingnodesfromthecurrentnodeusingthedistancebetweentransforms.Thiscanbejustiedbecauseourapproachtriestondacompletecompositionwithoutdistancesbetweentransforms. 50

PAGE 51

24 ]andgather200WebservicesontheInternetbyusingtheMorpheuscrawler[ 19 ].Inadition,wesurfHTMLpagesthathaveWebforms(e.g., AlltransformsinrepositoryarerepresentedinourtransformmetamodelMorpheus-M.UsingoursemanticannotationtoolintroducedSec. 5.2 ,weconvertWebservicesinWSDLtoourtransformmetamodelMorpheus-M.Table 6-1 showspartofourtransformrepository. ThemachineweusehasanIntel(R)PentiumDualCPU2Ghzwith3GBRAManduseWindowsVista.Weusetwoexamplesinourexperiments.ExampleAistheStarbucksRevenueConversionexampleintroducedinChap. 1 ,andExampleBistheEmployeePaymentConversionexample. ExampleBhastworelationalschemas,SandTshowninFig. 6-1 .Inaglobalcompany,supposeemployeesworkinginthebranchlocatedintheUSmovetoabranchinKoreaandwillgetbepaidinKoreanWoninsteadofUSdollars.TheschemaSistherelationtothebranchintheUSandTistotheoneinKorea.WeneedtoconvertdatainStodatatinT.Fortheconversion,weneedschemamatchingsandsemanticmappingsbetweentwoschemas. 51

PAGE 52

Sampletransforms 5m1:moneyinKoreanm3:moneyinKoreanAddtwomoneysm2:moneyinKorean 13m1:moneyinKoreanm3:moneyinUSDConvertmoneyinKoreantomoneyinUSDwithanexchangerateattheexecutiontime 28m1:moneyinUSDm3:moneyinUSDAddtwomoneysinUSDm2:moneyinUSD 34m1:moneyinKoreanm3:moneyinUSDAddtwomoneysinKoreancurrencyandthenconverttoUSDm2:moneyinKoreanwithanexchangerateattheexecutiontime 104d1:dateindd-mm-yyyye1:exchangerateGetexchangeratefromKoreantoUSDatthegivendatefromKoreantoUSD 105m1:moneyinKoreanm2:moneyinUSDConvertmoneyinKoreancurrencytomoneyinUSDe1:exchangeratewithagivenexchangerate 122d1:dateindd-mm-yyyym3:moneyinUSDAddtwomoneysinKoreancurrencym1:moneyinKoreanandthenconverttoUSDwithanexchangerateofagivendatem2:moneyinKorean 127m1:moneyinKoreanm3:moneyinUSDSubtractm2fromm1andassigntom3m2:moneyinKorean

PAGE 53

TheschemaofExampleB. Table6-2. Experimentalresult andshowthecontributionofourmodel.Third,weshowthescalabilityofourapproach,andnally,weshowthecasewhenourapproachndspartialcompositionsincasethereisnocompletecomposition.WeuseExampleAfortherstthreeexperiments,anduseExampleBforthelastexperiment. 6-2 showsthespecicationwhenwehavegivenschemamatches.Thed2meansthereisamatchbetweenthreeattributes(i.e.,S1:date,S1:revenueandS1:revenue)ofS1andoneattribute(i.e.,S2:revenue)ofS2.OurcompositionalgorithmndsallpossiblecompositionsthatcangenerateS2:revenuewithS1:date,S1:revenueandS1:revenuewithtransformsintherepository. Table 6-2 showsourexperimentalresults.Ouralgorithmndsasetofsingleorcompositetransformsford2.Auniquenumberforeachtransformisconnectedwithadashasadelimiter.Forexample,5-104-105isaconnectionofthreetransformsidentiedas5,104and105(cf.,Table 6-1 ).Wecannotbesurethattheresultofouralgorithmisthecorrecttransformthatconvertssourcedatatotargetdatabecausewedonotknowthe 53

PAGE 54

Thedesiredtransformspecication1. exactinternalprogramlogicofeachtransformintherepositoryandtheyareabstractlyrepresentedinourmodel.Wesimplyndpossibleanswersusingourmodelanddistancemeasures.Therefore,asanextphase,weapplythosetransformswefoundtothesourcedataandcheckwhetherthetransformscangeneratetargetdatacorrectly.AsshowninFig. 5-1 ,weexecutetheresultingsingleorcompositetransformsonMorpheusoverexistingsourcedatawithaquerythathasatransformasaUDFandexecutes. Weexecutetheresultingcompositetransformsoversourcedata.Thecompositions5-104-105(or104-5-105),122,and5-1generatecorrecttargetdata. Insearchofagoaltransforminatransformcompositiongraph,wecompareoutputparametersoftheprevioustransformtoinputparametersofthenexttransformtoseewhethertwotransformscanbeconnectedornot.Weuseourmetadata,namelyprimitive 54

PAGE 55

1 usedatatypeonly 2 usedatatypeandsemantics 3 usedatatype,semantics,anddataformat Formally,thesimeinformula(4-2)(i.e.,theformulacalculatingthedistancebetweentwotransforms)ischangedasfollowsaswechangethelevelofthemetadatautilization.Belowformula(6-1)meanslevel1,(6-2)meanslevel2,and(6-3)(thesameas(4-2))meanslevel3(seeSec. 4.2.1 foradetailedexplanationoftheseformulae). whereupisanoutputparameteroftheprevioustransformandvqisaninputparameterofthenexttransform. Additionally,wedenethemodel-basedanswerandvalidatedanswer.Ouralgorithmusesthemetadataofatransformtondacompletecompositionratherthanusingtheintermediatedatageneratedbyexecutingatransform.Therefore,thecompletecompositionfoundbyouralgorithm,namelyamodel-basedanswer,maynotbetherightdatatransform,theonethatcanconvertthesourcedatatotargetdatacorrectly.Thus,assoonaswendamodel-basedanswer,weapplyittothesourcedataandcheckwhetherthemodel-basedanswergeneratestargetdatacorrectly.Ifamodel-basedanswersgeneratestargetdatacorrectly,wecallitavalidatedanswer.Wecanapplyamodel-based 55

PAGE 56

Next,wedeneaparticipatingtransform.Inatransformcompositiongraph,alltransformsintherepositoryareconsideredtobeconnectedtothecurrentexpandingnode.Thecurrentexpandingnodehasatransformthatisconstructedbymergingtransformsinthepathfromtherootnodetothecurrentexpandingnode.Therefore,theoutputsofatransforminthecurrentexpandingnodearecomparedtotheinputsofthetransformthatisconsideredtobeconnected.Ifthereisnodistancebetweenoutputandinputparameters,twotransformscanbeconnected,andthetransformconnectedtothecurrentexpandingnodebecomesaparticipatingtransform.Theexecutiontimeofourtransformcompositionalgorithmcanbeexponentialtothenumberofparticipatingtransforms. 6-3 showsthatasthemetadatautilizationisincreased,theexecutiontimeisgreatlydecreased.Theexecutiontimeisthetimetondallmodel-basedanswersusingourapproach(i.e.,theworstcaseisthatavalidatedanswerappearslastamongallmodel-basedanswers). Ourexperimentshowsthatourmetadata(specically,dataformatofaparameter)areusefultoquicklyndthemodel-basedanswers,whichisasetofcandidateanswers,includingvalidatedanswers.Themoreweusemetadatainasearch,thebetterwecanlteroutusefultransformsforcomposingagoaltransform.Consequently,thenumberofparticipatingtransformsinasearchisdecreased,anditdecreasestheexecutiontimeexponentially.Inshort,usingourmetadatareducesexponentialsearchspace.Asin[ 44 ],oncemanuallycreated,metadatacanconsistentlyimprovethesearchingcapabilityinourdatatransformcomposition. 56

PAGE 57

Eciencybyvaryingmetadatautilizationwith150transforms. 6-4 showshowtheprecisionofouralgorithmischangedasthemetadatautilizationlevelisincreased.Theprecisionandrecallinourexperimentaredenedasfollows: Precision=retrievedvalidatedanswers/totalretrievedmodel-basedanswers Recall=retrievedvalidatedanswers/totalvalidatedanswers Sinceourapproachndsallvalidatedanswersevenaswevarythemetadatautilizationlevel,wecanshowhowtheprecisionischanged.AsinFig. 6-4 ,metadatautilizationlevel3hasthehighestprecision.Themoreweuseourmetadata,thebetterouralgorithmcanlteroutusefultransforms.Consequently,thenumberofmodel-basedanswersisreducedasweincreasemetadatautilization.Sincethenumberofretrievedvalidatedanswersisthesameevenasweincreasethemetadatautilization,theprecisionisincreasedasweincreasethemetadatautilization.Ourgoalistondtherstvalidatedansweramongmodel-basedanswers.Ourexperimentshowshowpreciselywecanndtherstonewewant,andourtransformmetamodelassuresthehighprecision. Otherinterestingpossibleexperimentscanbeperformingbyapplyingdistancemeasuresusedinotherresearch(e.g.,Webservicecomposition)toourframeworkandshowingtheaccuracyofanswers.However,wecandemonstratewithourexperimental 57

PAGE 58

Precisionbyvaryingmetadatautilizationwith150transforms. resultsthatthosewillnotnarrowdownwelltothecorrectanswerssincetheyuselessmetadatathanourapproach.Thiswillgeneratethesamephenomenonasthesecondexperimentinthissection. Fortherstpartofh(x),wecanchangetheag(i.e.,I+OP)ofdist2.TheagImeansinputsofatransform,Omeansoutputsofatransform,andOPmeansoperationsofatransform.I+OPmeansthatweconsiderinputsandoperationswhenwecalculatethedistanceanddonotconsidertheoutputs.Weexperimenthowmanynodesinatransformcompositiongraphareexpandedtondeachcompletecompositionbyvaryingtheag(i.e.,I+OP,I+O+OP,I+O).Threeexperimentsbyvaryingtheaggeneratethesame53completecompositions.Ourexperimentshowsthequalityofdistancemeasures.Ifwecanestimatetheheuristicdistancebetter,searchwillgototheanswersfasterwithlessexpandednodes.Fig. 6-5 showsthatusingI+OPorI+O+OPagsarebetterthanusingI+Oag.Inotherwords,consideringtheoperationdenedinourtransformmetamodelcanproceedtotheanswersfaster.Thisjustifythenecessityofoperationinour 58

PAGE 59

Thenumberofexpandednodessofarwheneachcompletecompositionappears(total53answers).Wevaryh(x)bychangingtheag(e.g.,I+O,I+O+OP,I+OP,I).Wetestwith1000transformsinrepository.Thesearchisterminatedwhentotal3788nodesareexpanded. transformmetamodel.Thenumberofexpandednodesisalsorelatedtotheexecutiontimeofthesearch. 59

PAGE 60

Eciencybyvaryingthesizeofrepository. repository.Next,weshowhowtheworst/bestcaseexecutiontimeischangedaswevarythenumberofparticipatingtransforms. 6-6 showsthatouralgorithmisscalabletothesizeofrepositoryifthenumberofparticipatingtransformsareequal.Thisisbecauseofouralgorithm,whichprunessearchspacewithmetadataofourtransformmetamodel.Thenumberofparticipatingtransformsisrelatedtotheexponentialsearchspace,butouralgorithm(specically,matchingparameters)islineartothesizeofrepository.Thealgorithmtriestondallpossiblemodel-basedanswersinatransformcompositiongraphthatcanleadtoagoalnodebymatchingparametersofconsecutivetransforms.Thenweapplythosemodel-basedanswertothesourcedatainordertondavalidatedanswer. Recenttechniquesinsemanticmappinguseintermediatedatageneratedbyexecutingtransforms(oroperator,searcherintheircontext)intensively[ 6 18 25 ]tondcomplexmatchingsbetweentwoschemas.Thosetechniquescanreachthecorrectanswerdirectly(i.e.,theyhaveonesteptothenalanswer,unlikeourtwosteps),buttheyarenotabletoeasilypruneasearchspaceusingintermediatedatageneratedbyexecutingaconnectingtransformbecauseitisdicultytogureoutbyintermediatedatawhetheritisuseful 60

PAGE 61

Worstcase/bestcaseexecutiontimeasaresultofvaryingthenumberofparticipatingtransforms dataowornotinthesearchforapathtoagoal.Wecanclaimthatpruningwithourmatchingalgorithmkeepsthepossibledataowthatcangenerateagoaltransform,andreducethesearchspaceeectively.Asaresult,weseparatethetransformcompositionphasefromthetransformexecutionphase,whichmakesourapproachmorerealisticbecauseweassumethatourtransformscanberemoteWebservices. 6-7 showstheworst/bestcaseexecutiontimesforndingavalidatedanswer.Basically,weusetheA*algorithm,whichndsthebestonerst(i.e.,thebestpathintermsofourdistancemeasures).Therefore,therstanswerappearsquickly,butndingallanswerstakestimesimilartoabrute-forcesearch.Ifthereisnoanswer,wecansearchtheentirespacelikeabrute-forcesearch.Inouralgorithm,weterminateasearchwhenitreachestheexecutiontimeconstraint,andthenmoveontondpartialcompositions.However,ifthereisananswer,wecanstopthesearchassoonastherstvalidatedanswerappears.Inthecase 61

PAGE 62

6-7 evenasweincreasethenumberofparticipatingtransforms. 6-1 thatwedonothaveexactschemamatchingsbetweenSandT.Ourgoaltransformisd3,theinputsofwhichareUSsalaryandzipcodeattributesinSandanoutputisthewageattributeinT.LetusassumethattheinputattributesinSaredescribedbyauserasfollows: AsinDenition4inSec. 4.1 ,auserspeciesthatUSsalaryinSisrequiredtondasemanticmappingforwageinTandzipcodemightormightnotbeused.Withtheinputs,ouralgorithmtriestondcompletecompositionsthatcangeneratetheoutputwageattributeusingUSsalaryandzipcode.With160transformsintherepository,oursearchisterminatedwithoutacompletecomposition. Amongthepathsofstatenodesgeneratedduringthesearchforthecompletecomposition,welteroutthefollowingsets:ThesetAmeansbottom-10pathsbasedonf(x)in(4-1)andthesetBmeansbottom-10pathsbasedonh(x)in(4-1),respectively.Thenumbersbelowaretheidenticationofatransform. A=f23-6,157-158,23-6-31,165-6,157-158-103,23-6-103,161-130-6,161-130-6-9,161-130-53-6,161-130-6-148-27g A[B=f23-6,157-158,157-116,157-158-103,23-6-31,157-158-103,165-6,165-123,165-125,23-6-103,161-130-6,161-130-6-9,161-130-53-6,161-130-6-3,161-130-6-148-27g 62

PAGE 63

63

PAGE 64

Inthischapter,wesummarizeourresearchanddescribefutureworks. Unlikepreviousworks,ourworkinthisdissertationfocusesonthesemi-automaticdatatransformcompositionthatreusestransformsinrepositorytoconstructusers'desiredtransforms.Findingsemanticmappingsbycomposingtransformsneedsmassivesearchspace,thereforeweneedasophisticatedsolution.Inaddition,wecannotguaranteethattransformsinrepositoryarecompleteforcomposinganynewtransform.Wehavefollowingchallenges:howtoformallyrepresenttransforms,howtoecientlyndorcomposetransforms,andhowtoprovidepartialsolutionsincasethereisnocompletesolutions. Wecreatethetransformmetamodelandthetransforminourmodelcanberepresentedinagraph.Wecancomparetwotransformssemanticallyusinggraphs.ThemetadatainourmodelalsocanberepresentedinResourceDescriptionFramework(RDF)[ 35 ]triplessemanticallyandthoseRDFtriplescanconstructstructuralRDFgraphs.ByusingtheRDFframework,ourtransformmetamodelgainmeritssuchasnewsemanticmetadatacanbeeasilyexpandablebyaddingRDFtripleswhenweaddmoremetadatatoaccommodatenewrequirementstodescribeadatatransformanddatatransformsinourmetamodelareinterchangeableandcanbecomparedwith 64

PAGE 65

OurmodelincludesnotonlytheprimitivedatatypeandsemanticmeaningofaparameterasinexistingstandardsinWebservicesbutalsothedataformatofaparameter.Inaddition,metadataonoperationsdescribetherelationshipsbetweeninputsandoutputsofatransform.Ourexperimentshowsthatourmetamodelgreatlyspeedupsearchingforcompletecompositionsandprovidehighprecision. Basedonourtransformmetamodel,wemodeladatatransformcompositionproblemasagraphsearchproblemanddesignoursophisticateddistancemeasures(i.e.,similaritymeasures)toprogressasearchusingtheA*algorithm.Oursophisticateddistancemeasuresenableoursearchtoprogresstoacompletecompositioncorrectlyandtoreduceexponentialsearchspacebypruning.Incomposingtransforms,ouralgorithmkeepsthebehavioroftransformsasmuchaspossibleusingourtransformmetamodel,thuswecanndagoaltransformcorrectly.Whenthereisnocompletecomposition,ouralgorithmprovidespartialcompositionsthatareusefultoconstructagoaltransformamongthetransformsinrepository. Wedesignandimplementourprototypesystemforsemi-automaticdatatransformcompositionusingtransformsinourmodelinarepository.Oursystemprovidesanautomaticsearchofalargerepositorytondorcomposeadesiredtransform.OursemanticannotationtoolisusedtoconvertcrawledWebservicesinWSDLtoourmodelsemi-automatically.Usingoursystem,userscanreducetheireortsintime-consuminganderror-prunesemanticmappingstepsofdatatransformationprocess,therebyreducingeortsininformationintegrationthatrequiresinmanyapplications. 65

PAGE 66

29 10 34 ]referstocontenthiddenbehindHTMLforms.Inordertogettosuchcontent,auserhastoperformaformsubmissionwithvalidinputvalues.SearchengineslikeGoogletrytoretrievethehiddencontentsalongwithgeneralHTMLpages.OurframeworkfordatatransformcompositioncanbeappliedtoautomaticDeepWebqueryprocessing. Forexample,ausercanframeaquerylike\HowlongdoesittaketogofromOrlandoInternationalAirporttoDisneyWorldbycar?"Inordertoanswerthequestion,weneedtoextractDeepWebcontentstepbystep,suchasgettingtheaddressesofOrlandoInternationalAirportandDisneyWorld,gettingthedistancebetweenthem,andcalculatingthetimetodrivethedistancebycar.ThequerycanbeansweredusingtheDeepWebcontentoftheGoogleMapsite(e.g., Second,wecanextendourworkonoptimizingtheexecutionofacomposeddatatransform.ThecomposeddatatransformiscompiledintoaJavaclassle,registeredinthePostgresDBMSasaUDF,andinvokedonthesourcedatasettobetransformedusinganSQLqueryforactualtransformation.Currently,theregisteredtransforminUDFistreatedasablackboxduringqueryoptimization[ 13 12 ],sotherearelimitationstooptimizingaqueryinvokingthecomposeddatatransform.Itwouldbebenecialtondanopportunityformoreoptimizationbylookinginsidethecomposeddatatransformdescription. Third,weshowthatourtransformmetamodelmakesahugecontributiontodecreasetheexponentialsearchspaceofourdatatransformcomposition.Wecanworkongatheringothervaluablemetadataasautomaticallyaspossible[ 57 ].Asthekindsofsemanticmetadataareincreased,todesigneectivedistancemeasurescanbechallenging. 66

PAGE 67

[1] R.Akkiraju,J.Farrell,J.Miller,M.Nagarajan,M.-T.Schmidt,A.Sheth,andK.Verma.Webservicesemanticswsdl-s. [2] R.Akkiraju,A.Ivan,R.Goodwin,B.Srivastava,andT.Syeda-Mahmood.Semanticmatchingtoachievewebservicediscoveryandcomposition.InCEC-EEE'06:Pro-ceedingsoftheThe8thIEEEInternationalConferenceonE-CommerceTechnologyandThe3rdIEEEInternationalConferenceonEnterpriseComputing,E-Commerce,andE-Services,page70,Washington,DC,USA,2006.IEEEComputerSociety. [3] G.Alonso,F.Casati,H.Kuno,andV.Machiraju.WebServices:Concepts,Architec-turesandApplications.Springer-Verlag,Berlin,Germany,2003. [4] D.Berardi,D.Calvanese,G.D.Giacomo,M.lenzerini,andM.Mecella.Automaticcompositionofe-servicesthatexporttheirbehavior.InProc.ofthe1stInternationalConferenceonServiceOrientedComputing,2003. [5] P.A.BernsteinandT.Bergstraesser.Meta-datasupportfordatatransformationsusingmicrosoftrepository.IEEEDataEng.Bull.,22(1):9{14,1999. [6] P.A.BernsteinandS.Melnik.Modelmanagement2.0:manipulatingrichermappings.InSIGMOD'07:Proceedingsofthe2007ACMSIGMODinterna-tionalconferenceonManagementofdata,pages1{12,NewYork,NY,USA,2007.ACM. [7] P.CarreiraandH.Galhardas.Executionofdatamappers.InIQIS'04:Proceedingsofthe2004internationalworkshoponInformationqualityininformationsystems,pages2{9,NewYork,NY,USA,2004.ACM. [8] F.Casati,S.Ilnicki,L.J.Jin,V.Krishnamoorthy,andM.Shan.Adaptiveanddynamicservicecompositionineow.InProc.oftheInternationalConferenceonAdv.Info.aSystemsEngineering,2000. [9] G.Chae,S.Chandra,V.Mann,andM.G.Nanda.Decentralizedorchestrationofcompositewebservices.InWWW'04:Proceedingsofthe13thInternationalWorldWideWebConference.ACM,May2004. [10] K.C.-C.ChangandJ.Cho.Accessingtheweb:fromsearchtointegration.InSIGMOD'06:Proceedingsofthe2006ACMSIGMODinternationalconferenceonManagementofdata,pages804{805,NewYork,NY,USA,2006.ACM. [11] S.ChaudhuriandU.Dayal.AnoverviewofdatawarehousingandOLAPtechnology.SIGMODRec.,26(1):65{74,1997. [12] S.ChaudhuriandK.Shim.Queryoptimizationinthepresenceofforeignfunctions.InVLDB'93:Proceedingsofthe19thInternationalConferenceonVeryLargeDataBases,pages529{542,SanFrancisco,CA,USA,1993.MorganKaufmannPublishersInc. 67

PAGE 68

S.ChaudhuriandK.Shim.Optimizationofquerieswithuser-denedpredicates.ACMTrans.DatabaseSyst.,24(2):177{228,1999. [14] E.Christensen,F.Curbera,G.Meredith,andS.Weerawarana.Webservicesdescriptionlanguage(WSDL)1.1. [15] W.M.Coalition.Processexchangespecicationlanguage. [16] B.Coppin.ArticialIntelligenceIlluminated.JonesandBartlettPublishers,Sudbury,Massachusetts,2004. [17] S.B.DavidsonandA.Kosky.Specifyingdatabasetransformationsinwol.IEEEDataEng.Bull.,22(1):25{30,1999. [18] R.Dhamankar,Y.Lee,A.Doan,A.Halevy,andP.Domingos.iMap:discoveringcomplexsemanticmatchesbetweendatabaseschemas.InSIGMOD'04,pages383{394,2004. [19] P.Dobbins,T.Dohzen,C.Grant,J.Hammer,M.Jones,D.Oliver,M.Pamuk,J.Shin,andM.Stonebraker.Morpheus2.0:Adatatransformationmanagementsystem".InInterDB,VLDBworkshop,2007. [20] T.Dohzen,M.Pamuk,S.-W.Seong,J.Hammer,andM.Stonebraker.DataintegrationthroughtransformreuseintheMorpheusproject.InSIGMOD'06,pages736{738,2006. [21] X.Dong,A.Halevy,J.Madhavan,E.Nemes,andJ.Zhang.Similaritysearchforwebservices.InVLDB'04:ProceedingsoftheThirtiethinternationalconferenceonVerylargedatabases,pages372{383.VLDBEndowment,2004. [22] B.A.-M.etal.Template-basedsemanticsimilarityforsecurityapplications.InTechnicalReport,LSDISLab,ComputerScienceDepartment,UniversityofGeorgia,Jan.2005. [23] D.M.etal.Owl-s:Semanticmarkupforwebservices.W3C,2004. [24] J.FanandS.Kambhampati.Asnapshotofpublicwebservices.SIGMODRec.,34(1):24{32,2005. [25] G.H.FletcherandC.M.Wyss.Datamappingassearch.InEDBT2006:AdvancesinDatabaseTechnology.SpringerLNCS3896,2006. [26] P.G.D.Group.Postgresql. [27] A.Y.Halevy,Z.G.Ives,P.Mork,andI.Tatarinov.Piazza:datamanagementinfrastructureforsemanticwebapplications.InWWW'03:Proceedingsofthe12thinternationalconferenceonWorldWideWeb,pages556{567,NewYork,NY,USA,2003.ACM. 68

PAGE 69

B.He,K.C.-C.Chang,andJ.Han.Discoveringcomplexmatchingsacrosswebqueryinterfaces:acorrelationminingapproach.InKDD'04:ProceedingsofthetenthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages148{157,NewYork,NY,USA,2004.ACM. [29] B.He,M.Patel,Z.Zhang,andK.C.-C.Chang.Accessingthedeepweb.Commun.ACM,50(5):94{101,2007. [30] J.M.Hellerstein.Optimizationtechniquesforquerieswithexpensivemethods.ACMTrans.DatabaseSyst.,23(2):113{157,1998. [31] J.M.HellersteinandM.Stonebraker.Predicatemigration:optimizingquerieswithexpensivepredicates.InSIGMOD'93:Proceedingsofthe1993ACMSIGMODinternationalconferenceonManagementofdata,pages267{276,NewYork,NY,USA,1993.ACMPress. [32] IBM.Semantictoolsforwebservices. [33] IBM.Workowmanagementcoalition. [34] L.A.JayantMadhavan,LoredanaAfanasievandA.Halevy.Harnessingthedeepweb:Presentandfuture.InCIDR'09:4thBiennialConferenceonInnovativeDataSystemsResearch,2009. [35] G.KlyneandJ.J.Carroll.ResourceDescriptionFramework(RDF):Conceptsandabstractsyntax.W3C,2004. [36] P.KungasandM.Matskin.Detectionofmissingwebservices:Thepartialdeductionapproach.InNWESP'05:ProceedingsoftheInternationalConferenceonNextGenerationWebServicesPractices,page339,Washington,DC,USA,2005.IEEEComputerSociety. [37] P.KungasandM.Matskin.Fromwebservicesannotationandcompositiontowebservicesdomainanalysis.IJMSO,2(3):157{178,2007. [38] D.Lin.Aninformation-theoreticdenitionofsimilarity.InICML98:ProceedingsoftheFifteenthInternationalConferenceonMachineLearning,pages296{304.MorganKaufmannPublishers,1998. [39] L.LinandI.B.Arpinar.Discoveryofsemanticrelationsbetweenwebservices.InICWS'06.InternationalConferenceonWebServices,pages357{364,Sep.2006. [40] J.MadhavanandA.Y.Halevy.Composingmappingsamongdatasources.InVLDB'2003:Proceedingsofthe29thinternationalconferenceonVerylargedatabases,pages572{583.VLDBEndowment,2003. [41] G.A.Miller.Wordnet:alexicaldatabaseforenglish.Commun.ACM,38(11):39{41,1995. 69

PAGE 70

J.Myerson.WorkwithWebservicesinenterprise-wideSOAs,part5:Optimizewebserviceapplicationswithwebspherebusinessintegrationtools.IBMdeveloperWorks,2005. [43] S.-C.Oh,B.-W.On,E.J.Larson,andD.Lee.BF*:Webservicesdiscoveryandcompositionasgraphsearchproblem.InEEE'05:Proceedingsofthe2005IEEEInternationalConferenceone-Technology,e-Commerceande-Service. [44] N.I.S.Organization.Understandingmetadata.NISOPress,2004. [45] S.R.PonnekantiandA.Fox.Sword:Adevelopertoolkitforwebservicecomposition.InProc.ofthe11thInternationalConferenceonWWW,2002. [46] M.P.Singh.Thepragmaticweb.InIEEEInternetComputing,pages4{5,2002. [47] G.QianandY.Dong.Asteptowardsincrementalmaintenanceofthecomposedschemamapping.InCIKM'08:Proceedingofthe17thACMconferenceonInfor-mationandknowledgemanagement,pages173{182,NewYork,NY,USA,2008.ACM. [48] E.RahmandP.A.Bernstein.Onmatchingschemasautomatically.VLDBJournal,(4),2001. [49] E.RahmandP.A.Bernstein.Asurveyofapproachestoautomaticschemamatching.TheVLDBJournal,10(4):334{350,2001. [50] E.RahmandH.H.Do.Datacleaning:Problemsandcurrentapproaches.IEEEDataEng.Bull.,23(4):3{13,2000. [51] E.A.S.GhandeharizadehandS.Manjunath.ProteusRTI:Aframeworkforon-the-yintegrationofbiomedicalwebservices.InUSCDatabseLaboratoryTechnicalReportNumber2006-05,2006. [52] J.Shin,J.Hammer,andH.Lam.RDF-basedapproachtodatatransformcomposition.In7thIEEE/ACISInternationalConferenceonComputerandIn-formationScience,IEEE/ACISICIS2008,14-16May2008,Portland,Oregon,USA,pages645{648,2008. [53] J.Shin,J.Hammer,andW.J.O'Brien.Distributedprocessintegration:Experiencesandopportunitiesforfutureresearch.InIEEEInternationalWorkshoponWebandMobileInformationSystems(WAMIS),2006. [54] A.Simitsis,P.Vassiliadis,andT.Sellis.Optimizingetlprocessesindatawarehouses.InICDE'05:Proceedingsofthe21stInternationalConferenceonDataEngineering,pages564{575,Washington,DC,USA,2005.IEEEComputerSociety. [55] A.Simitsis,P.Vassiliadis,andT.Sellis.OptimizingETLprocessesindatawarehouses.ICDE,pages564{575,2005. 70

PAGE 71

A.SirbuandJ.Homann.Towardsscalablewebservicecompositionwithpartialmatches.InICWS'08:Proceedingsofthe2008IEEEInternationalConferenceonWebServices,pages29{36,Washington,DC,USA,2008.IEEEComputerSociety. [57] R.SumraandA.D.Qualityofserviceforwebservices-demystication,limitations,andbestpractices.Developer.com,1999. [58] T.Syeda-Mahmood,G.Shah,R.Akkiraju,A.-A.Ivan,andR.Goodwin.Searchingservicerepositoriesbycombiningsemanticandontologicalmatching.In2005IEEEInternationalConferenceonWebServices,pages13{20,2005. [59] I.TatarinovandA.Halevy.Ecientqueryreformulationinpeerdatamanagementsystems.InSIGMOD'04:Proceedingsofthe2004ACMSIGMODinternationalconferenceonManagementofdata,pages539{550,NewYork,NY,USA,2004.ACM. [60] P.S.M.TsaiandA.L.P.Chen.Optimizingquerieswithforeignfunctionsinadistributedenvironment.IEEETransactionsonKnowledgeandDataEngineering,14(4):809{824,2002. [61] K.Verma.Congurationandadaptationofsemanticwebprocesses.InPh.DThesis,ComputerScience,Univ.ofGeorgia,June2006. [62] W3C.WebServicesArchitectureRequirement.Technicalreport,W3C,2002. [63] D.Wu,B.Parsia,E.Sirin,J.Hendler,andD.Nau.Automatingdaml-swebservicescompositionusingshop2.InPoc.of2ndInternationalSemanticWebConference,Oct.2003. [64] L.XuandD.W.Embley.Discoveringdirectandindirectmatchesforschemaelements.InDASFAA,pages39{46,2003. [65] C.YuandL.Popa.Semanticadaptationofschemamappingswhenschemasevolve.InVLDB'05:Proceedingsofthe31stinternationalconferenceonVerylargedatabases,pages1006{1017.VLDBEndowment,2005. [66] H.Zhu,J.Zhong,J.Li,andY.Yu.Anapproachforsemanticsearchbymatchingrdfgraphs.InProceedingsoftheFifteenthInternationalFloridaArticialIntelligenceResearchSocietyConference,pages450{454.AAAIPress,2002. 71

PAGE 72

JungminShinreceivedherB.SandM.SattheDepartmentofComputerScienceandEngineeringfromtheEwhaWomansUniversityinSouthKoreain1993and1995,respectively.In1995,shejoinedatMediaCommunicationsLabofLGElectronicsinSeoul,SouthKoreaandworkedonintelligentuserinterfaceasaresearcher.In1997,shejoinedSKTelecominSeoul,SouthKorea.Sheworkedonalargescaledatabasemanagementsystemsuchasthemembershipmanagementsystemandbulletinboardsystemofacommercialon-lineportalservice.Also,sheinvolvedindevelopingavideoondemand(VOD)serviceinacommercialwirelessinternetservice.Since2001,shehasbeenworkingonprocessintegrationanddatatransformcompositionatDatabaseSystemsResearchandDevelopmentCenterinUniversityofFlorida.ShereceivedherPh.D.fromtheUniversityofFloridainthefallof2009. 72