Citation
Algorithms for reduced content document synchronization

Material Information

Title:
Algorithms for reduced content document synchronization
Creator:
Kang, Ajay I. S. ( Dissertant )
Helal, Abdelsalam A. ( Thesis advisor )
Hammer, Joachim ( Thesis advisor )
Place of Publication:
Gainesville, Fla.
Publisher:
University of Florida
Publication Date:
Copyright Date:
2002
Language:
English

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Bandwidth ( jstor )
Change detection ( jstor )
Mobile devices ( jstor )
Munchausen syndrome by proxy ( jstor )
Preliminary proxy material ( jstor )
Proxy reporting ( jstor )
Proxy statements ( jstor )
Run time ( jstor )
XML ( jstor )
Data transmission systems
Wireless communication systems
Computer and Information Science and Engineering thesis, M.S
Dissertations, Academic -- UF -- Computer and Information Science and Engineering

Notes

Abstract:
Existing research has focused on alleviating the problems associated with wireless connectivity, viz. lower bandwidth and higher error rates, by the use of proxies which transcode data to be sent to the mobile device depending on the state of the connection between the proxy and the mobile device. However, there is a need for a system to allow automatic reintegration of changes made to such reduced content information into the original content rich information. This also allows users the flexibility of using virtually any device to edit information by converting the information to a lesser content rich format suitable for editing on that device and then automatically reintegrating changes into the original. Hence, these algorithms allow the development of an application transparent solution to provide ubiquitous data access and enable reducing download time and cache usage on the mobile device for editable documents. They can also be applied to progressive synchronization of a full fidelity document with the master document on the server. To develop flexible, generic, synchronization algorithms, we need an application-independent specification of the hierarchical content of data, a generic method of specifying what information is shared between reduced content documents and the originals and a change detection algorithm. This algorithm should produce a script that can be applied to the original to bring it up to date and also allow for version control. Existing difference detection algorithms do not operate on reduced content documents and the approaches used need to be adapted to this domain. XML has been chosen to represent document content since it allows encoding of any hierarchical information and most applications are now moving towards native XML file formats. The change detection algorithm makes extensive use of the O(ND) difference algorithm developed by Eugene W. Myers for matching shared content and approximate sub string matching. Extensive performance analysis has been performed on the system by constructing a test engine to automatically generate modifications of varying degree on source documents and thus produce a set of modified reference documents. These reference documents are converted to text (hence stripping out all formatting content), and the algorithm is run to generate delta scripts that are applied to the original documents to generate the new versions. Comparison of the new versions to the references allows us to quantify error.
General Note:
Title from title page of source document.
General Note:
Includes vita.
Thesis:
Thesis (M.S.)--University of Florida, 2002.
Bibliography:
Includes bibliographical references.
General Note:
Text (Electronic thesis) in PDF format.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Kang, Ajay I. S.. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
12/27/2005
Resource Identifier:
1105201850 ( OCLC )

Downloads

This item has the following downloads:


Full Text











ALGORITHMS FOR REDUCED CONTENT DOCUMENT SYNCHRONIZATION


By

AJAY I. S. KANG

















A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2002




























Copyright 2002

by

Aj ay 1. S. Kang




























I dedicate this work to my parents, Harjeet Singh and Kulwant Kaur Kang. Without their constant encouragement and support over the years, I would not be here documenting this thesis.
















ACKNOWLEDGMENTS

I cannot adequately express the gratitude and appreciation I owe to my family, friends and faculty advisors in this space.

My family has always been supportive of all my endeavors and prodded me to push farther and achieve all that I am capable of. However, they have never in that process made me feel that I let them down when I failed to achieve their expectations. I know that I will always have their unconditional love and support as I go through life.

I have many friends I would like to thank, especially the "fourth floor crowd" of the CISE department where I spent most of my time working on my research. Hai Nguyen, Andrew Lomonosov and Pompi Diplan would constanly stop by my office providing welcome interruptions from the monotony of work. I am indebted to Mike Lanham for his encouragement to pursue this area of research. A special mention must also be made for Starbucks Coffee which the aforementioned are addicted to and have consumed many gallons of the overpriced beverage in their time here in (according to them) an effort to bolster the national economy.

I thank my advisors, Dr Abdelsalam "Sumi" Helal and Dr Joachim Hammer, whose patience, encouragement, insight and guidance were invaluable in the sucessful completion of this work.

Finally, I would like to thank the National Science Foundation for funding the UbiData project, which resulted in my receiving a research assistantship this summer.














TABLE OF CONTENTS
pM. e

A CK N OW LED GM EN TS . iv

LIST OF FIGU RE S . viii

A B S T R A C T . x

CHAPTER

I IN TR OD U CTION . I

U s e C a se s . 3
Summary of the Advantages of Change Detection in Reduced Content Information . 4
Approach Used to Investigate the Design and Implementation of Reduced Content
Change D etection and R eintegration . 5


2 RELA TED RE SEAR CH . 6

The U biquitous D ata A ccess Project at the U niversity of Florida . 6
S y ste m G o a ls . 6
System A rchitecture . 6
M obile environm ent m anagers (M -M EM s) . 7
M obile data service server (M D SS) . 8
The Role of Reduced Content Change Detection in this Architecture . 9
Previous W ork on Change D etection for U biD ata . I I
Research on Disconnected Operation Transparency: the CODA File System . I I R esearch on B andw idth A daptation . 13
O d y s se y . 1 3
P u p p e te e r . 1 4
Transcoding Proxies: Pythia, M OW SER and M ow gli . 15
Congestion M anager . 15
Intelligent Collaboration Transparency . 15
R esearch on D ifference A lgorithm s . 16
L a D iff . 1 6
X y D iff . 1 6
X m e rg e . 1 8
A n O (N D ) D ifference A lgorithm . 18









3 PROBLEMS INVOLVED IN RECOVERABLE CONTENT ADAPTATION . 21

Developing a Generic Difference Detection and Synchronization Algorithm for
R educed C ontent Inform ation . 2 1
Example Document Content Reduction Scenario . 21
Issues in Designing a Function to Construct a Modified Document Structure
Isomorphic to the Source Document Structure . 25


4 DESIGN AND IMPLEMENTATION OF RECOVERABLE CONTENT
A D A P T A T IO N . 2 7

Solution O utline for D elta C onstruction . 27 R epresentation of Shared C ontent . 27 Matching of Nodes in the Source and Modified Documents . 28
Construction of the List of Content Node Subtrees . 28 Determining the Node Matches between the vo and v, (-) Trees . 29 Determining the Node Matches between the vo Subtree List and v, (-) Tree List. 29
Adjusting the Structure of the Modified Document to make it Isomorphic to the Source
D o c u m e n t . 3 2 A pplying the Substring M atches . 32 Adjusting the Result Document Structure for Unshared Ancestors . 33 Adjusting the Result Document Structure for Unshared Nodes . 34 A pplying D efault Form atting for Inserts . 34 P o st P ro c e ssin g . 3 5
Computing signatures and weights for the two documents . 35 Applying the node matches to the XyDiff structures . 36 O ptim izing node m atches . 36
C on structing the D elta Script . 36
M ark th e O ld T ree . 3 6 M ark th e N ew T ree . 3 7 S av e th e X ID M ap . 3 7 C om pute W eak M ov es . 37 D ete c t U p d ate s . 3 7 C on struct the D elete Script . 37 C on struct th e In sert Script . 3 8 Sav e the D elta D ocum ent . 38 C le a n U p . 3 8


5 RELATED EXTENSIONS TO UBIDATA . 39

Development of a File Filter Driver for Windows CE to Support a UbiData Client . 39 Active Event Capture as a Means of UbiData Client-Server Transfer Reduction . 41 Development of an Xmerge Plug-in to Support OpenOffice Writer . 41











6 PERFORMANCE EVALUATION APPROACH . 42

T e st P la tfo rm . 4 2 T e st M eth o d o lo g y . 4 2


7 T E S T R E S U L T S . 4 4

Correctness of the Change Detection System . 44 Performance of the Difference Detection Algorithm . 45
In sert O p e ratio n s . 4 7 D elete O p eratio n s . 4 9 U p d ate O p eratio n s . 5 0 M o v e O p eratio n s . 5 1 Conclusions on Per-Operation Run Time . 53 Run Time Behavior of Node Matches . 53 Run Time Behavior of Substring Matching . 55
R elative Sizes of F iles produced . 56


8 CONCLUSIONS AND POTENTIAL FOR FUTURE WORK . 57

C o n c lu sio n s . 5 7 P otential for F utu re W ork . 57
Determination of Best Performing Values for Algorithm Parameters . 57
Exploration of Techniques to Prune the Search for the Best Subtree Permutation 58 Post Processing to Ensure Validity of Reintegrated Document . 58 Further Testing and Generalization of the System . 58 Implementation of the UbiData Integration Architecture . 58 Extension to Other Application Domains . 58


L IST O F R E F E R E N C E S . 60

B IO G R A PH IC A L SK E T C H . 63















LIST OF FIGURES


Figure pM. e

I A rchitecture of the U biD ata system . 7

2 Proposed integration of change detection and synchronization with UbiData . 10

3 V e n u s state s . 12

4 Sample Edit graph; Path trace = (3,1), (4,3), (5,4), (7,5); Common subsequence =
CAB A ; Edit script = ID , 2D , 2113, 6D , 71C . 19 5 E xam ple A biw ord docum ent . 22

6 XML structured imposed on modified text document . 23

7 D OM tree for the original docum ent, vo . 24

8 Example of approximate substring matching . 30

9 Windows CE.NET 4.0 Storage architecture . 39

10 Logical queue of m em ory m apped files . 40

11 T est tools an d procedu res . 42

12 Missing Nodes by test cases in LES complexity order . 44

13 Missing Nodes by test cases in substring complexity order . 45

14 Time taken for the various phases by test case . 46

15 Time taken by test cases for Inserts with no file level changes . 47

16 Time taken by test cases for Inserts with 30 percent file level changes . 48

17 Time taken by test cases for Inserts with 50 percent file level changes . 48

18 Time taken by test cases for Insert with 10 percent paragraph level changes . 49

19 Time taken by test cases for Delete with 10 percent paragraph level changes . 49









20 Time taken by test cases for Updates with 10 percent paragraph level changes . 50

21 Time taken by test cases for Updates with no file level changes . 51

22 Time taken by test cases for Moves with 10 percent paragraph level changes . 51

23 Time taken by test cases for Moves with no file level changes . 52

24 Per-operation relative tim e com plexity . 53

25 B ehavior of the node m atch phase . 55

26 B ehavior of substring m atch phase . 56

27 Sizes of the files produced by the algorithm . 56















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

ALGORITHMS FOR REDUCED CONTENT DOCUMENT SYNCHRONIZATION By

AJ ay 1. S. Kang

December 2002


Chair: Abdelsalam "Sumi" Helal
Major Department: Computer and Information Science and Engineering

Existing research has focused on alleviating the problems associated with wireless connectivity, viz. lower bandwidth and higher error rates, by the use of proxies which transcode data to be sent to the mobile device depending on the state of the connection between the proxy and the mobile device. However, there is a need for a system to allow automatic reintegration of changes made to such reduced content information into the original content rich information. This also allows users the flexibility of using virtually any device to edit information by converting the information to a lesser content rich format suitable for editing on that device and then automatically reintegrating changes into the original.

Hence, these algorithms allow the development of an application transparent solution to provide ubiquitous data access and enable reducing download time and cache usage on the mobile device for editable documents. They can also be applied to









progressive synchronization of a full fidelity document with the master document on the server.

To develop flexible, generic, synchronization algorithms, we need an applicationindependent specification of the hierarchical content of data, a generic method of specifying what information is shared between reduced content documents and the originals and a change detection algorithm. This algorithm should produce a script that can be applied to the original to bring it up to date and also allow for version control.

Existing difference detection algorithms do not operate on reduced content documents and the approaches used need to be adapted to this domain. XML has been chosen to represent document content since it allows encoding of any hierarchical information and most applications are now moving towards native XML file formats. The change detection algorithm makes extensive use of the O(ND) difference algorithm developed by Eugene W. Myers for matching shared content and approximate substring matching.

Extensive performance analysis has been performed on the system by constructing a test engine to automatically generate modifications of varying degree on source documents and thus produce a set of modified reference documents. These reference documents are converted to text (hence stripping out all formatting content), and the algorithm is run to generate delta scripts that are applied to the original documents to generate the new versions. Comparison of the new versions to the references allows us to quantify error.














CHAPTER I
INTRODUCTION

With the emergence of smaller and ever more powerful computing and communication devices providing a wide range of connectivity options, computers are playing an ever-increasing role in peoples lives. Portable computers such as laptops, tablet PCs, handheld devices based on the Palm, Pocket PC, embedded Linux platforms and to an extent, smart phones coupled with the proliferation of wireless connectivity options are bringing the vision of ubiquitous computing into reality.

However, the potential provided by this existing hardware has as yet largely gone untapped due to the lack of a software infrastructure to allow users to make effective use of it. Users instead have to deal with the added complexity of manually synchronizing their data across all the devices that they use and handle the issues of file format conversion and integration of their changes made to the same document in different formats. These differences exist due to the constraints on mobile devices of being lightweight, small and having a good battery life, which restricts the memory, processing power and display size available on such devices [1]. Hence, there will always be devices for which applications will not be available since the device's hardware or display capabilities fall short. These devices cannot be used as terminals or thin clients (which would alleviate the problem to an extent) since lower bandwidth, higher error rates and frequent disconnections characterize wireless connectivity.

Most research in the mobile computing area has thus been focused on the minimization of the impact of this form of connectivity on the user experience. The most









common approach used is to use a proxy server to transcode the data [2-5] to be sent to the mobile device based on current connectivity conditions, thus ensuring that latency times in data reception as perceived by the user stay within certain limits at the cost of displaying data with reduced content. If the user were to edit such content, the burden of reintegrating the changes with the original copy on the server would fall on him/her. Hence, there is a definite need for a system to perform this change detection and reintegration automatically without user intervention. Solving this problem would also result in a solution to the first problem outlined, i.e., the unavailability of certain applications on a mobile device by having a "smart proxy" deduce that the device could not support such content and reduce the complexity to allow it to be convertible or directly compatible with applications on the mobile device. Any changes made would subsequently be re-integrated using the change detection system.

In conjunction with existing research on systems that automatically hoard the files a user uses on the various devices he owns for disconnected operation support such as the Coda file system developed at Carnegie Mellon University [6-7], we now have the ability to provide the user access to his data at any time, anywhere and on almost any device.

To allow the development of a change detection system that works on hierarchical data in any format, we need an intermediate data representation form that can be used to represent hierarchical data in any format. The intermediate form is what the change detection system will use as its input and output formats which are by definition easily convertible to the required data representation. The Extensible Markup Language (XML) [8] was chosen as the intermediate data representation since XML is an open standard format designed for this purpose and has a wide variety of parsing tools available for it.









This work is being explored as a component to be used in the ubiquitous data access project [9] under development at the Computer and Information Science and Engineering Department of the University of Florida. The goals of the UbiData system are to provide any time, anywhere access to a user's data irrespective of the device being used by him/her.

To get a better feel of the potential of such a system we explore two use cases in the following paragraphs.

Use Cases

Use case 1: overcoming device/application limitations. Joe is working on a document that he intends to show his advisor at a meeting across campus but he is not quite done yet. He powers up his Palm PDA that connects to the local area network through a wireless interface card. The UbiData system recognizes this as one of Joe's devices and fetches the device's capability list from the UbiData server. The system then proceeds to hoard his data onto the device, including the word document that he was just working on. However, a comparison with the capability list shows that the Palm does not support editing of NIS Word documents. UbiData uses a converter plugin to convert the NIS Word document to text/Aportisdoc format that is editable on the Palm and then transfers it to the Palm as he walks out of the door. By the time he reaches the campus lawn where wireless connectivity is unavailable, the document is sitting on his Palm. He makes edits to the document as he walks across campus. When he reaches his destination, he requests UbiData to synchronize his updates and the text file is compared to the original document and the changes are imported. By the time he reaches his advisor's door, the document is up to date.









Use case 2: overcoming bandwidth limitations. Jane has to go on a business trip to Paris and has a long flight ahead of her. She wants to use this time to edit some project proposals and cost estimates. She requests UbiData to hoard the files she needs to her laptop. At the airport she realizes that she needs a file that she has not accessed recently and hence has not been hoarded to her laptop. She uses her cell phone to connect to and authenticate with the corporate network. She requests the file download and the smart proxy on her corporate network transcodes the file data by removing the images from the file for fast transfer over the weak connection. She edits the file on the plane and uses the business lounge at her hotel in Paris to upload the changes. The UbiData system recognizes that this is a reduced content document (since it was notified by the proxy that Jane's copy was content reduced) and proceeds to detect and integrate the changes with the original. After her meeting is over, she decides to walk around town and sitting at a caf6 has an idea to incorporate into a proposal. She takes out her PDA and uses her cell phone to connect to and authenticate with her corporate network. Since she is now making an international call, she requests a fast download and the proxy converts the document down to compressed text for minimum transfer size. When she returns to the hotel she reintegrates her changes with the content rich document on the server. Summary of the Advantages of Change Detection in Reduced Content Information " A change detection system as outlined above would allow the development of an
application transparent system for ubiquitous data access with no need for any support software at the client end for change detection as against other solutions such as the CoFi architecture used in the Puppeteer system developed at Rice
University [10-12].

" The Change detection system could be used to hoard reduced content documents on
a mobile device instead of full fidelity documents. This would speed up download times and reduce cache space utilization, which could also allow hoarding of a larger working set. Transcoding proxies could also transcode editable content since
it would always be possible to reintegrate changes into the original document.









* On the other hand, if the mobile device has a full fidelity version of a document
that has been changed, and a high bandwidth connection is unavailable, the document could be "distilled" by removing content such as images. The changes could then be reintegrated with the document on the server allowing other users to see updates earlier. Once a high bandwidth connection is available, the entire document could then be uploaded for incorporation of changes to content that was
stripped out. Hence, progressive updates of documents are made possible.

Approach Used to Investigate the Design and Implementation of Reduced Content Change Detection and Reintegration

Since such a system is quite complex, it is easier to focus on a restricted case and apply the approaches developed to the general problem. However, this problem should be carefully chosen since the solution may not be applicable to the overall problem. Hence, the extreme case in document editing was chosen for solution, viz, conversion of a content rich document to text and detecting and reintegrating the changes. To provide flexibility, a diff/patch solution was chosen which would also allow only the differences to be sent to another client requesting the same file in the UbiData network. This would result in significant bandwidth savings. To enable version control, a complete diff format was chosen which has enough information available to revert to a previous version of the file. To ease development, an existing XML diff/patch tool was modified. The tool is part of the XyDiff program suite developed by the VERSO team, from 1NRIA, Rocquencourt, France, for their Xyleme proj ect [ 13 -14].

The following chapters describe the UbiData system, related work and the algorithms developed to solve the problem of change detection in reduced content XMIfL documents as well as test results obtained.














CHAPTER 2
RELATED RESEARCH

The Ubiquitous Data Access Project at the University of Florida System Goals

1. Anytime, anywhere access to data, regardless of whether the user is connected,
weakly connected via a high latency, low bandwidth network, or completely
disconnected;

2. Device-independent access to data, where the user is allowed to use and switch
among different portables and PDAs, even while mobile;

3. Support for mobile access to heterogeneous data sources such as files belonging to
different file systems and/or resource managers. System Architecture

The architecture for the UbiData system [ 15] divides the data it manages into three tiers as follows:

I . The data sources which could consist of file systems, database servers, workflow
engines, e-mail servers, web servers or other sources.

2. Data and meta-data collectedfor each user of the system. The data is the user's
working set consisting of files used by the user. Meta-data collected on a per user basis includes the configuration and capabilities of all the devices used by the
user.

3. The mobile caches that contain a subset of the user's working set to allow
disconnected operation on the various devices owned by the user.

The middle tier (storing the data and meta-data), acts as a mobility aware proxy, filling the client's caches when connected and updating the user's working set information. On reconnection after disconnection, it handles reintegration of updates made to documents on the mobile cache.





























Figure 1. Architecture of the UbiData system


The three tiers are easily identifiable in figure 1; the clients on the left contain the mobile caches on their local file systems. The clients also act as data sources since they can publish data to the system. The other data sources (labeled external data sources) are to the right of the figure. The lower middle part of the diagram depicts the data and metadata collected for each user. The system modules that implement this "three tier data architecture" [9] are detailed below:

Mobile environment managers (M-MEMs)

The functions of the M-MEMs are as follows:

I . To observe the user's file access patterns and discard uninteresting events e.g.
operating system file accesses and filter the remaining to determine the user's working set. These are the files that the user needs on the devices he uses to
continue working in spite of disconnections.

2. Manage files hoarded to the device and track events such as updates to the file so it
can be marked as a candidate for synchronization on reconnection. Also, on reconnection, if the master copy in the middle tier is updated, the M-MEM discards the stale local copy and fetches the updated master copy from the middle tier server
(NM S S).











3. Bandwidth conservation by always shipping differences between the old version
and the updated version of a file instead of the new file whenever possible. Other activities such as queue optimization of the messages to be sent to the middle tier
server also help in this regard.

Mobile data service server (MDSS)

The MDSS consists of two components:

Fixed environment manager (F-IVIEM)

The F-1VEM server side component is a stateless data and meta-data information service to mobile devices. The basic functionality of the F-1fiEM is to accept the registration of events, detect events, and to trigger the proper event-condition-action (ECA) rules. The main tasks of the F-1fiEM are to manage the working set and the metadata in conjunction with the M-MEM component. As a result, the F-MEM needs to perform additional services such as authentication of users and implement a communication protocol between the M-MEM and itself. This protocol should be very resilient to poor connectivity and frequent disconnections, which implies the use of a message queuing system.

Data and meta-data warehouse

The meta-data warehouse is an XML database that stores information related to each user such as the uniform resource identifiers (URIs) for all files in his/her working set and device types and capabilities for all devices used by the user. The data (working set) is stored on the local file system of the MDSS server and accessed through URIs as mentioned before.









The Role of Reduced Content Change Detection in this Architecture

The goal of providing "device-independent access to data" would be unattainable without the presence of such a system. This is due the fact that mobile devices vary widely in terms of hardware capabilities, the operating systems they run and the applications they support. Hoarding files without considering the target environment would result in the user having a number of unusable files since there is no application available on his/her device to handle them. Hence the F-1VEM needs to access the device profile and suitably transcode the user's working set before shipping it to the client device. If the transcoded content is editable, changes made to this content must be reintegrated into the original copy on the server. Simply replacing the old file by a newer copy, as done in existing systems of this type e.g. the Coda file system developed at Carnegie Mellon University, would result in lost information loss since transcoding involves loss of information.

Now that it is established that this is a critical component of UbiData, the question that remains is how this system will be integrated into the overall UbiData architecture. Figure 2 shows how the system could be integrated into the UbiData architecture. The sequence of actions for three events has been shown and they are described below:

1 . Mobile host publishing a document. The actions for this event are shown with
solid black arrows. The mobile host transfers the document to F-1VEM which on decoding the message recognizes it as a publish request. The F-1VEM determines if the document is already in XN/L format. If not, the appropriate conversion plugin is loaded and the converted document is stored in the repository as part of the user's
working set.

2. Mobile host importing a document. The actions for this event are shown with
solid gray arrows. When an import request is received, the required document format is determined. The connection state is also sampled from the connection monitor. If the connection is weak, an appropriate transcoding scheme is selected.
Once the type of conversion is determined, the appropriate plugin is loaded and the conversion performed by loading the latest version of the document from the













































Figure 2. Proposed integration of change detection and synchronization with UbiData









repository and running the plugin to convert it. An entry is made in the meta-data database to mark the client as having a reduced content version of the document.
Finally, the document is packaged for transport and shipped to the mobile client.

3. Synchronization of the possibly reduced content document. The actions for this
event are shown with dashed arrows. Once the message is decoded and recognized as a synchronization request, the meta-data content is checked to determine if this was a reduced content document. If so, the appropriate XN11L conversion plugin is loaded if necessary, and the document is made available to the vdiff change detection module. The change detection module also uses the application specific shared tag information and the original document from the repository. This information is used to generate a delta script to update the document that is stored in the repository for versioning purposes. The XyApply module is then used to apply the script to the original document in the repository to generate the new version. This new version is stored in place of the original if it was not the first version of the document. Otherwise, the latest version is stored separately. Hence, the repository maintains the original version of the document and the latest version.
The stored deltas allow any version in between to be generated from the original
document. The latest version is kept for efficiency.

Previous Work on Change Detection for UbiData

Refer to [16] for previous work on change detection and reintegration. This work

approaches the same problem in a slightly different way and also extends the previous

effort. The previous system was unable to handle deeper and more complex document

structures. Updates were often missed and treated as inserts. This has been addressed in

this work and the approach to testing the system has been improved making evaluating

document correctness more objective.

Research on Disconnected Operation Transparency: the CODA File System

Coda is a location transparent, distributed file system developed at Carnegie

Mellon University, which uses server replication and disconnected operation to provide

resiliency to server and network failures. Server replication involves storing copies of a

file at multiple servers. The other mechanism, disconnected operation, is a mode of

execution in which a caching site temporarily assumes the role of a replication site. Coda

uses an application transparent approach to disconnected operation. Each client runs a









process called Venus that handles remote file system requests using the local hard disk as a file cache. Once a file or directory is cached, the client receives a callback from the server. The callback is an agreement between the client and the server that the server will notify the client before any other client modifies the cached information. This allows cache consistency to be maintained. Coda uses optimistic replication where write conflicts can occur if different clients modify the same cached file. However, these operations are deemed to be rare and hence acceptable. Cache management by Venus is done according to the state diagram in Figure 3.



Hoarding
Disconnection Strong connection
Weak connection

Connection

Emulating Writedisconnected

Disconnection
Figure 3. Venus states

Venus is normally in the hoarding state but on the alert for possible disconnection. During the hoarding state, Venus caches all objects required during disconnected operation. While disconnected, Venus services all file system requests from its local cache. All changes are written to a Change-Modify-Log (CNfL) implemented on top of a transactional facility called recoverable virtual memory (RVM). Upon reintegration, the system ensures detection of conflicting updates and provides mechanisms to help users recover from such situations. Coda resolves conflicts at the file level. However, if an application specific resolver (ASR) is available, it is invoked to rectify the problem. An ASR is a program that has the application specific knowledge to detect genuine









inconsistencies from reconcilable differences. In a weakly connected state, Venus propagates updates to the servers in the background.

The Coda file system has the drawback of requiring its own file system to be present on the clients and servers, which UbiData does not have. UbiData uses the native file system of its host. Hence, bandwidth adaptation and reduced content synchronization is being incorporated into UbiData, due to its greater flexibility in potentially supporting a much wider variety of devices on which such a system can be tested.

Research on Bandwidth Adaptation

Odyssey

Odyssey [17-19] uses the application aware approach to bandwidth adaptation. The Odyssey architecture is a generalization of the Coda architecture. Data is stored in a shared, hierarchical name space made up of typed volumes called Tomes. There are three Odyssey components at each client as follows: I . The Odyssey A-PI: The Odyssey A-PI supports two interface classes, viz. resource
negotiation and change notification class and the dynamic sets class. The former allows an application to negotiate and register a window of tolerance with Odyssey for a particular resource. When the availability of the resource falls out of those bounds, Odyssey notifies the application. The application is then free to adapt itself
accordingly and possibly renegotiate a window of tolerance.

2. The Viceroy and the Wardens: These are a generalization of Coda's cache
manager, Venus. The Viceroy is responsible for monitoring external events and the levels of generic resources. It notifies clients who have placed requests on these resources. It is also responsible for type independent cache management functions.
The Wardens, on the other hand, manage type specific functionality. Each Warden manages one Tome type (called a Codex). Each Warden also implements caching
and naming support and manages resources specific to its Codex.

The Odyssey architecture, due to its application aware adaptation design, is not as universally useable as an application transparent approach and hence, unsuitable for ubiquitous data access. It must be noted however, that applications written for such









architectures would have superior performance, as it is easier and faster for an application to adapt to a poor connection than a generic transcoder that would more likely have higher latency and would be far more complex. Puppeteer

The Puppeteer architecture is based around the following philosophy: I . Adaptation should be centralized.

2. Applications should not be written to perform adaptation. Instead, they should
make visible their Document Object Model (DOM): the hierarchy of documents,
containing pages, slides, images and text.

The CoFi architecture [11] standing for consistency and fidelity, implemented in Puppeteer supports document editing and collaborative work on bandwidth-limited devices. This is precisely the goal that the reduced content change detection and synchronization algorithms are being designed for. However, CoFi takes a different approach to solving the problem and this can actually be used to complement the algorithms being developed. CoFi supports editing of partially loaded documents by decomposing documents into their component structures and keeping track of the changes made to each component by the user and those resulting from adaptation. Hence, in cases where the target applications meet the puppeteer requirements of exposing a DOM interface, this approach would be more efficient since only the modified subset of components needs to be dealt with. However, when this is not the case, the hierarchical change detection and synchronization system can still be used. Another requirement for Puppeteer to be used is that a Puppeteer client needs to be present on the target device to perform the function of tracking changes. This may pose a significant overhead for some mobile devices (these devices tend to be resource poor in general), in which case this work can still be used.









Transcoding Proxies: Pythia, MOWSER and Mowgli

Each of these systems performs active transcoding of data on a proxy between the data source and the mobile device to adapt to the varying state of the connection between the mobile device and the proxy. The work done on these systems can be applied to the proposed UbiData integration architecture described earlier. Congestion Manager

Congestion Manager (CM) is an operating system module developed at MIT [20] that provides integrated flow management and exports a convenient programming interface that allows applications to be notified of, and adapt to changing network conditions. This can be used to implement the connection monitor subsystem in the proposed UbiData integration architecture described earlier. Intelligent Collaboration Transparency

Intelligent collaboration transparency is an approach developed at Texas A&M University [21] for sharing single user editors by capturing user input events. These events are then reduced to high-level descriptions, semantically translated into corresponding instructions of different editors, and then replayed at different sites. Although the research is not aimed at bandwidth adaptation, it can be adapted for this purpose by capturing all user input events on the mobile device and then converting them to higher level descriptions before optimizing this log for transmission to the server. This was explored as a means of reducing transmission overhead between the mobile device and the server in the UbiData architecture and the work done and results obtained are described in chapter 5.









Research on Difference Algorithms

LaDiff

LaDiff [22] uses an O(ne + e 2) algorithm where n is the number of leaves and e is the weighted edit distance, for finding the minimum cost edit distance between ordered trees. However, there is no support for reduced content change detection. The algorithm functions in two phases, viz, finding a good matching between the nodes of the two trees and finding a minimum cost edit script for the two trees given a computed matching. The matching phase uses a compare function, which given nodes s, and S2 as input, returns a number in the range [0,2] that represents the distance between the two nodes. This is similar to the technique used to generate matches in our reduced content change detection system. The edit script computation operations are similar to the technique used by XyDiff and that is the system that has been followed for edit script construction in the reduced content change detection algorithm.

XyDiff

XyDiff is a difference detection tool for XMIL documents designed for the Xyleme project at IIA, Rocquencourt, France. The Xyleme project proposes the building of a worldwide XML warehouse capable of tracking changes made to all XML documents for query and notification. The XyDiff tool had is being used to investigate reduced content change detection since it employs an efficient algorithm to track changes in XN'L documents and its source is available for modification. The companion tool to XyDiff, XyApply, applies the deltas generated by XyDiff to XMIL documents. This tool has been used as is for testing the algorithms developed since the XyDiff delta structure has not been modified in any way but the delta construction process has been completely rewritten. Xyleme deltas also have the advantage of being complete, i.e. they have









enough additional information to allow them to be reversed. Hence, they meet the goal of allowing version reconstruction. To assist in versioning, all XML nodes are assigned a persistent identifier, called an XID, which stands for Xyleme ID. XJDs are stored in an XID map that is a storage efficient way of attaching persistent IDs to every node in the tree. The basic algorithm used by XyDiff is as follows:

1 . Parse the vO (original) document and read in the XJDs if available.

2. Parse the vi (modified) document.

3. Register the source document, i.e. assign IDs to each node in a postorder traversal
of the document. Construct a lookup table for ID attributes (unique identifiers assigned to nodes by the creating application for the purpose of easing change
detection), if present.

4. Construct a subtree lookup table for finding nodes at a certain level from any parent
node for the vO document. 5. Register the v I document.

6. Perform bottom up matching. 7. Perform top down matching.

8. Peephole optimization.

9. Mark old tree for deletes. 10. Mark new tree for inserts. 11. Save the vi document's XID map. 12. Compute weak move operations, i.e. a node having the same parent in both
documents but different relative positions. 13. Detect updates.

14. Write the delta to file. 15. All steps before step 8 have been altered for supporting reduced content difference
detection. The remaining steps concerned with the construction of the delta have
been left relatively untouched.









Xmerge

Xmerge is an OpenOffice.org open source project [23] for document editing on

small devices. The goals of the document editing on small devices project are as follows: 1. To allow editing of rich format documents on small devices, using 3rd party
applications native to the device.

2. Support the most widely used small devices, namely: the Palm and PocketPC.

3. Provide the ability to merge edits made in the small device's lossy format back into
the original richer format document, maintaining its original style.

4. Take advantage of the open and well-defined OpenOffice.org XMIL document
format.

5. Provide a framework with the ability to have plugin-able convert, diff and merge
implementations. It should be possible to determine a converter's capabilities at
run-time.


The goals of the Xmerge group are virtually identical to what the reduced content difference detection system is trying to achieve. However, the approach used by the Xmerge group is to have format-specific diff and merge implementations, unlike the format independent approach selected for reduced content synchronization. This allows our reduced content synchronization system to be simpler and have lesser space requirements. This would enable it to be installed on a reasonably powerful small device. However, Xmerge' s plugin management system can be utilized to implement the plugin manager subsystem for the proposed UbiData integration architecture described above. An O(ND) Difference Algorithm

The approach used in this difference algorithm [24] is to equate the problems of finding the longest common subsequence of two sequences A and B, and a shortest edit script for transforming A into B, to finding a shortest path in an edit graph. Here, the N and D refer to the sum of the lengths of A and B and the size of the minimum edit script









for A and B respectively. Hence, this algorithm performs well when differences are small, and thus has been chosen as the approximate string-matching algorithm since the application domain for the system is for user edits on mobile devices. These edits are not likely to be very extensive and hence, for this application domain, the algorithm should perform well.

The algorithm uses the concept of an "edit graph", an example of which is shown in figure 4. Let A and B be two sequences of length N and M respectively. The edit graph for A and B has a vertex at each point in the grid (x,y), x e [O,N] and y G [O,M]. The points at which A,=By are called match points. Corresponding to every match point is a diagonal edge in the graph. The edit graph is a directed acyclic graph.

(0-0) C A B B A
0

C


B >>V
2

A
3

B ,

V 4
A
5

CV )h V )h >h h
0 1 2 3 4 5 6 (7-6)

Figure 4. Sample Edit graph; Path trace =(3,1), (4,3), (5,4), (7,5); Common subsequence
CABA; Edit script =ID, 2D, 21B, 6D, 71C

A trace of length L is a sequence of L match points (xi, yi), (X2, Y2). . .(XL, yL) such that xi K xj 1 and yi K yi 1 for successive points (xi, yi). Each trace corresponds to a









common subsequence of the strings A and B. Each trace also corresponds to an edit script. This can be seen from figure 4; if we equate every horizontal move in the trace to a deletion from A and every vertical move to an insertion into B, we have an edit script that can transform A to B. Finding the longest common subsequence and finding the shortest edit script both correspond to finding a path from (0,0) to (N, M) with the minimum number of non-diagonal edges. Giving diagonal edges a weight of 0, and nondiagonal edges a weight of 1, the problem reduces to finding the minimum-cost path in a directed weighted graph which is a special instance of the single source shortest path problem.

Refer to Myers [24] for the algorithms to find the length of such a path and to reconstruct the path's trace. These algorithms have been used for approximate string matching and approximate substring matching respectively as part of the overall change detection algorithm.














CHAPTER 3
PROBLEMS INVOLVED IN RECOVERABLE CONTENT ADAPTATION

Developing a Generic Difference Detection and Synchronization Algorithm for Reduced Content Information

Since document formats vary widely, and there are a vast number of them, the number of possible content reductions is enormous. To reduce the size of the problem and make it tractable, only the extreme case was considered for solution. The system that was developed can now be extended to suit other combinations. This extension is an ongoing task. Document editing has been chosen as a problem domain and to investigate recoverable content adaptation, conversion to text followed by reintegration of changes to the edited text has been chosen as the restricted problem to be solved. This problem is the extreme case of content adaptation in the domain of document editing. Abiword [25], an open source document editor, has been selected as the test editor as it uses XML as its native file format and is available on both Linux and Windows.

Example Document Content Reduction Scenario

Consider the document shown in figure 5 which is reduced to text and edited as shown in table 1. The notation followed in this sequence of changes to the document is to refer to the source document as vo, the content reduced (converted to text) document as vo(-), the content reduced document after edits as vl(-), and the final document with changes integrated as vj. In the general case, reduction will not be so extreme and the result document will still be in XML format. Hence, we need to impose an XML structure on the vl(-) document. However, since all formatting content has been lost, it









will only be possible to impose a very basic structure on the document. This basic structure will depend on the particular word processor in use and hence the module handling this will vary for every document format. Therefore, a plugin management system is required to load the correct converter depending on the particular format being handled. In the case of Abiword, imposition of this structure involves extracting text terminated by a new line and inserting it under a

tag since in Abiword, paragraphs are terminated by a new line. Other structures that can be safely imposed are an tag and a

tag since they are present in all Abiword documents. However, this is the maximum extent we can go to, given only the modified text document. The resulting XML DOM tree is shown in figure 6.



G:\-h1 C ny e a ~ . b ~i. , rd . . -.- . .-:1 1. r ,,, .
File Edit View Insert Format Tools Window Web Help INormal JlTimes New Roman iil12 B I U [- -- - ._-- 3.- T



I




COP 5615
Evaluation of Current Prevention. Detection and defeat Research
1 List item 1
2. List item 2
3. List item 3
End o) document


Figure 5. Example Abiword document









Table 1. Document in figure 5 converted to text and edited Document text content (vo Document edited text content (vi

COP 5615 COP 5615
Evaluation of Current Prevention, Evaluation of Current Prevention,
Detection and defeat Research Detection and defeat Research
List item I List item I
List item 2 List item 2
List item 3 Updated #1## List item 3
End of document End of document


Evaluation of Current Prevention, Detection, and defeat Research


Figure 6. XML structured imposed on modified text document

For comparison, the original document structure is shown in figure 7. Clearly, the modified document has a far simpler structure. To enable integration of changes to the original document, we need to come up with a function which, when given the original document tree and the tree with XML structure imposed, produces a structure with the content in the modified document that is isomorphic to the source document.












































Figure 7. DOM tree for the original document, vo











Once we have this isomorphic modified tree, the difference script can be

constructed using standard techniques employed in other difference algorithms.

Issues in Designing a Function to Construct a Modified Document Structure Isomorphic to the Source Document Structure

I. A format independent method of specifying what content is shared between the
modified document and the source is needed to identify what content to consider deleted versus just stripped out. In our example above, the shared content are the
abiword, section, p and text nodes as shown in figure 6.

2. A technique for matching content nodes in the modified document to the nodes in
the source document is needed. However, this is complicated by the following
possibilities:

a. Node contents can be edited to varying degrees. Hence, an approximate
string matching technique is needed and a boundary value for a node to
be considered matched to another needs to be found.

b. An aggregation of nodes in the source document may be present as one
node in the result. As an example, consider the string "Evaluation of Current Prevention, Detection and defeat Research" in figure 5. The substring "Current Prevention, Detection", is underlined and hence the DOM subtree for this string in figure 7, has three nodes corresponding to the string. However, on conversion to text, this structure is lost and when a basic XN'IL structure is applied, the three nodes are present as one node in the result tree in figure 6. In addition to possibility a. occurring, the result string may have the subsequences in a different order, which further complicates matching. Since the string may be edited in any way, it is not possible to match each substring in the source to the result split into the same sized sub strings. Matching each sub string in the source document to every prefix/suffix of the content node in the result tree is also incorrect, since the match may be any substring and not necessarily a prefix or suffix. Hence, structures such as suffix trees are unusable.
Matching all possible substrings would be extremely expensive computationally, since all possible substrings would result in 0(2 n)/O( 1+2!+3!+. .+n!) combinations which when rewritten as
0(2 n)/O(n! +(n- 1)! +. ) =O(2 n)/O((n- 1 )(n+ 1 )+(n-3)(n- 1W). )=
0(2 n)/O(n') =O(2 n) combinations to be processed. In the extreme case, when no XML structure can be imposed on the document except a document root tag, all content will be present in one node. This extreme case of the aforementioned problem underscores the need to develop an
efficient solution to it.






26


3. Once matches are found, the tree structure from the source needs to be used to
construct an isomorphic structure for the result that re-imports all stripped out
content.

4. A mechanism is needed to filter out certain content nodes from the match process
e.g. Abiword encodes images in base 64 text and thus this could be confused for a
content node and impose a heavy, and unnecessary performance burden.

5. Post processing is necessary to ensure that the created isomorphic tree is valid e.g.
if the document has all content deleted but the formatting and style descriptions are imported as part of the process above, Abiword crashes on trying to open the document. A format specific plugin could be run at this point, to perform dependency checks such as this. Alternatively, a separate tool could be used to
perform this function after applying the delta to the source document.














CHAPTER 4
DESIGN AND IMPLEMENTATION OF RECOVERABLE CONTENT ADAPTATION Solution Outline for Delta Construction As mentioned in the previous chapter, the restricted problem being solved is that of converting a document to text, and incorporating the changes made to the textual version back into the original source document. There are three basic steps in the solution approach as follows:

I . If required, adjust the resulting document by imposing a basic XML structure on it.
Match the nodes in the source and modified document trees taking into account the
fact that an aggregate of source nodes may match a result node.

2. Adjust the structure of the modified document to make it isomorphic to the source
document. In this process all content that was stripped out should be re-imported
unless it was associated with deleted content.

3. Construct the delta script using the source document structure and the newly
created modified document structure.

Before describing the steps in detail, the question of how shared content and nodes to be excluded from the match procedure are to be represented needs to be answered.

Representation of Shared Content

There are two ways shared content between documents could be represented as follows:

I . The intersection of the set of tags in the source and result.

2. The complement of the set above i.e. the set of tags, which are not shared between
the two documents.

The former approach has been chosen since in our case, the number of shared tags will be very small (Since all documents are in XN11L format we treat shared information









as shared tags). This information can also be represented in an XN'IL format. Hence, there will exist one such map for each file format. The corresponding one will be read depending on the format being handled and the tag names will be stored in a hash table (STL hash map) for 0(1) time lookup during the matching process. The same idea is used to represent excluded tags (the information which is to be excluded from the matching process).

Matching of Nodes in the Source and Modified Documents

The algorithm has the following basic steps as follows:

1 . Recursively construct a list of content nodes (TEXT nodes) from the source (vo)
document excluding any subtrees under excluded tags.

2. Construct a list of content node subtrees from the vo document and remove the
corresponding nodes from the list constructed in step 1.

3. Recursively construct a list of content nodes (TEXT nodes) from the modified
document (vi(-) ).

4. Find the least cost edit script matches between the nodes in the vo list and the nodes
in the vi(-) list.

5. Find the least cost edit script matches between the vo subtree list and the nodes in
the vi() list.

Construction of the List of Content Node Subtrees

The algorithm has the following basic steps:

1 . For each node in the vo list, check if its parent node is a shared node and if the node
has a sibling. If so, find out the number of leaf nodes in this subtree by counting each TEXT sibling and recursively counting the TEXT nodes in every non-TEXT
node sibling.

2. Otherwise, if the parent is not a shared node, traverse the tree upwards until a
shared parent is found. Calculate the number of TEXT nodes in this subtree
recursively.

3. If the number of nodes found is greater than one, remove this range of nodes from
the vo list and insert the nodes into the vo subtree list.









Determining the Node Matches between the vo and v, (-) Trees

The algorithm has the following basic steps:

I. For each node in the vi (-) node list, execute step 2.

2. minLES = oc

a. For each node in the vo node list, perform the following steps:

i. Compute the least cost edit script length using the algorithm in [24]. If this value is less than minLES, set minLES = the value computed above.

ii. If the minLES value is below a threshold value, enter the vi(-) node as a key in a hash table with the value being another hash table containing the (key, value) pair of (vo node, the minLES value) inserted into it. Otherwise, if the minLES value exceeds the threshold value, enter the vi(-) node as a key in a hash table with the value being another hash table containing the (key, value) pair of (vo node, cc) inserted into it. iii. If the minLES value is 0, we have found a perfect match for the vi(-) node and there is no need to consider any other vo nodes. Continue execution from step 2.

b. If minLES < oc and if the vo node corresponding to the minLES value is
already assigned, find the better match by looking up the previous match value in the hash table and comparing it to minLES. If the current one is better, reset the previous match. If not, we need not continue further for
this vi(-) node. Continue execution from step 2.

c. Mark the vo and vi(-) nodes as matched/assigned to each other.

d. If minLES is 0, remove the corresponding vo node from the vo node list
since it does not need to be included in any further comparisons.


Determining the Node Matches between the vo Subtree List and Vi (-) Tree List

Before going into the steps involved, we explore the solution approach with an example. Consider the subtree consisting of the nodes AB, C, ABBA and the vl(-) node CBABAC. For now let us assume that the relative ordering of the substrings in the vl(-) node is the same as that in v0, but they may have been edited in any way otherwise. We








need a way to decompose the vl(-) node content into substrings providing the best approximate matches to the subtree nodes. The approach used utilizes the way the least cost edit script algorithm works. Consider figure 8 which shows the edit graph generated for the concatenation of the substrings (AB, C, ABBA) and the v(-) node CBABAC.


(0.0) A B C A B B A

cv II I~
1
C 01 01 0 1 0> >,




A >v>

B V
A




0 1 2 3 4 5 6 (76

Figure 8. Example of approximate sub string matching

We can find the matches for each sub string by using the co-ordinates of the path found in the graph. If the end of a substring e.g. AB occurs at point x (in this case 2), the corresponding y value of the point on the path gives the termination of the matching substring (in this case 0 implying that the best match for the substring AB is an empty string). In this example the matches are AB 4b, C=C, ABBA =BABAC.

Now returning to the general problem where substring node orderings may not be the same, we can see that the solution is valid provided we find the permutation of the subtree that best matches the vi(-) node content. However, finding all permutations of a









set of substrings takes O(n!) time which is intractable for large problems. Hence, a condition/heuristic is needed to prune the search. The approach used in the current implementation is to perform the search for n
The basic steps involved in the algorithm are as follows:

1. For each node in the vo substring list, perform the following steps:

a. minLES = oc

b. If the size of the current vo subtree is less than a threshold value C,
recursively compute the permutations for the nodes in the subtree and store them in a vector. Otherwise, concatenate the nodes in the subtree
and store the result in the vector.

c. For each node in the vl(-) node list, execute the following steps:

i. Find the minimum LES value of all the permutations in the vector with the current vl(-) node subject to the threshold value. If the threshold value is exceeded, the current LES value is oc. ii. If this value is less than minLES, assign it to minLES.

iii. Enter the vl(-) node as a key in a hash table with the value being another hash table containing the (key, value) pair, (first node in the vo subtree, current LES value) inserted into it. iv. If minLES=0, the perfect match has been found and no further comparisons are necessary. Restart execution from step a. v. If minLES < o, perform the following steps:

1. If the subtree has already been matched, determine the better match and set the vi(-) node as the match for the v0 subtree corresponding to the better match.

2. Re-compute the best permutation for the matched vo subtree and compute its LES match, saving









information needed to reconstruct the trace through the graph.

3. Compute the trace through the edit graph.

4. Use the technique described above, to traverse the trace and split the vl(-) node into substrings matching the v0 node. Enter the substrings with their matching counterparts into a hash table.

5. Return the hash table.

2. Otherwise, return an empty hash table.

Adjusting the Structure of the Modified Document to make it Isomorphic to the Source Document

Applying the Substring Matches

The basic steps involved in the algorithm are as follows:

I1. For each subtree in the vo document, perform the following steps:

a. If the v0 subtree has already been matched, execute the following steps:

i. Set vlParent = Parent node of the vi match node.

ii. For each node in the v0 subtree, execute the following steps:

1. Look up the matching substring the hash table returned from the substring match function.

2. If the match exists, then execute the following steps: a. Create a text node for the match and insert it before the matching vi node. If the vo node was previously matched, reset the match. Mark the vo and v, nodes as matched/assigned to each other. b. Add the inserted node to the list of vi nodes. iii. If even one node in the subtree was matched, delete the corresponding vi node (it has now been replaced by its components).


2. Return.









Adjusting the Result Document Structure for Unshared Ancestors

The basic steps involved in the algorithm are as follows:

1. For each node in the v, node list, perform the following steps:

a. Lookup the matching vo node and its parent.

b. If the vo Parent node is not null and is unshared, execute the following:

i. Push the parent node into a stack.

ii. Get the Parent's ancestor, and repeat step b.

c. If the shared vo parent and vi node parent match (have the same node
names), perform the following steps:

1. Match the parents if not done already.

ii. Propagate the matches upward till we reach a matched node or the root. For each match, import attributes of the vo node over to the v, node.

iii. Remove the vi node (to allow insertion of unshared ancestors).

iv. Set vlParentDom =Parent of the v, node.

v. Get the vi node's previous sibling and check if its name is the
same as the ancestor we are trying to insert. If so, and the vo ancestor is assigned, check the remaining nodes in the stack for assignment and a match with the rightmost path down the sibling's tree. Remove each matching node from the stack. Set vlParentDom =last match from the stack. This takes into account the possibility that we are trying to insert ancestors common to siblings.

vi. For each node on the stack, execute the following steps:

1. Append the node as a child to vlParentDom.

2. Mark the appended node and the vO nodes as matched/as signed to each other.
3. Set vlParentDom =appended node. vii. Append the vi node as a child of vlParentDom.


2. Return.










Adjusting the Result Document Structure for Unshared Nodes

The basic steps involved in the algorithm are as follows:

1. If the document roots are not matched and they have the same tag names, match
them. Otherwise, signal an error and halt.

2. Set vONode =First child of the vo Root. 3. Set viNode =First child of the v, Root.

4. Similarly, set vOParent and viParent to the vo and vi Roots respectively.

5. If vONode is null, return.

6. If the vONode is not assigned, and it is unshared with its parent matched to the
viNode parent or has an excluded parent, execute the following steps:

a. Add the vONode to the vi tree as a previous sibling of the viNode if it is
non-null or the only child of the viParent otherwise.

b. Mark the nodes as matched/assigned to each other.

c. Recursively process all the children of the vONode and viNode.

7. Otherwise, if the nodes are already assigned, recursively process all the children of
the vONode and it's vi match.

8. Recursively process all the children of the vONode's next sibling and the viNode.


Applying Default Formatting for Inserts

This phase has been added to the overall algorithm to provide the flexibility of assigning any formatting as default for inserted content. This allows inserted content to be easily identifiable. However, to allow any formatting to be applied, we need the flexibility of inserting the new content as a child of a hierarchical structure. This structure is added as a child of a shared ancestor (which would have been the immediate parent of the inserted content). To define the structure, we use an XNMIL document specifying the tree structure where the root of the tree defines the shared tag to which this structure is to









be applied. The insertion point, which is a text string to be replaced by the inserted content, is also defined in this document. The structure is read in along with the other XML configuration files and each time the structure has to be applied, the new content replaces the insertion point. The new content is then set as the new insertion point for the next cycle. When the structure is applied, the replacement is performed as above and the tree is inserted into the position previously occupied by the inserted content. Post Processing

Before running the XyDiff delta construction phases, we need to set up some basic structures needed by the difference detection algorithm and apply the discovered matches to those structures. Once this is done, a XyDiff optimization phase is run to detect updates that may have been missed by the least cost edit script match phase due to the text differences being too extensive. The optimization function should actually be run before the application of the default formatting for inserts but since it will rarely find missed matches, this is the way it has been kept for now. Another justification for doing so is that the structures used by this function do not exist before the insert formatting is applied. Insert formatting cannot be applied later since the new nodes inserted will require extensive re-organization of the XyDiff structures. Hence, the best approach would be to perform tests on the effectiveness of the phase and if required, re-implement it to function with the structures used by the match phase. A brief description of these phases follows.

Computing signatures and weights for the two documents

The basic steps involved are as follows:

I . Set the first available XID for the source document based on its XID map.









2. Register the source and result document by traversing the documents and assigning
identifiers to each node based on a post-order traversal of the tree. Create a list of data nodes storing these IDs and hash values for the subtrees corresponding to the
tree nodes in question.

Applying the node matches to the XyDiff structures

Apply all node matches generated as part of the matching phases to the XyDiff structures. If the matched nodes have the same names and values, mark them as NO-OPS. Otherwise, if they have the same parents, they represent an update and hence mark them accordingly. Otherwise, it must be the case that a move has occurred, which will be handled as part of the delta construction phase. Optimizing node matches

In this phase, we basically look for matched nodes and check if any children need matching. If children are found satisfying the condition that the source and result tag names are the same, and that the result has only one child with that tag name for the current parent, they are matched.

Constructing the Delta Script The steps involved in delta construction are as follows:

1. Mark the old tree. 2. Mark the new tree. 3. Save the XID map.
4. Compute weak moves.
5. Detect updates.
6. Construct the delete script. 7. Construct the insert script.
8. Save the delta document.
9. Clean up.

Mark the Old Tree

Conduct a post-order traversal of the source document and mark the nodes as deleted and moved. A deleted node can be identified as a node that has not been assigned.









An assigned node that has different parents in the source and result documents is deemed a "strong move".

Mark the New Tree

Conduct a post-order traversal of the result document and mark the nodes as inserted or moved and adjust the XID map accordingly. Matched nodes are imported into the result document's XID map with the XID of the node in the source document. A moved node is identified in the same fashion as above and if the node is not assigned, it is deemed inserted and marked as such.

Save the XID Map

Write the result document's new XID map to file. Compute Weak Moves

A weak move is one where assigned nodes have the same parents but are in different relative positions among the parent's children. The function of this phase is to determine the least cost edit script that can transform the old child ordering to the new one. This corresponds to finding the longest common subsequence between the two orderings.

Detect Updates

If a node and its matched counterpart have only one unmatched TEXT node as children, match them and mark the children as updated. Construct the Delete Script

Construct the delete script by traversing the nodes in post-order with a right to left enumeration of children. Based on the marked values, construct appropriate tags for the delete script, which specifies what operation is to be performed, at what location, and enumerates the affected XIDs. Add the created subtree to the delta script DOM tree. In









this process, situations where a node is deleted and its parent is deleted need to be handled. The rule employed is that if the parent is to be deleted, only the parent's delete operation is written to the delta and all deleted children are specified as part of the parent's delete operation. Construct the Insert Script

The same procedure as above is also used to construct the operations for inserts. Note that moves and updates are also handled in these two phases. Save the Delta Document

Write the resulting delta tree to file. Clean Up

Close the input files and free allocated memory.














CHAPTER 5
RELATED EXTENSIONS TO UBIDATA Development of a File Filter Driver for Windows CE to Support a UbiData Client To allow integration of the change detection system with UbiData, and testing on various platforms, it is necessary to first have UbiData support a range of platforms. To allow porting of the UbiData system onto windows CE, a file filter driver is necessary to capture file operations to ascertain when a file on the local file system has been accessed or modified. This filter driver communicates the information to the local UbiData client, the M-MEM that uses the information to build the user's working set or to mark the locally modified file as requiring synchronization with the server. The windows CE NET

4.0 storage architecture [26] is as shown in figure 9.

Storage Manager
FSD Manager
File sytem filters
Filter 1 Filter 2 Filter 3 Filter 4




File File File Fi
system 1 system 2 system system 4 FATFS) (NTFS) (IFS) (UDFS)


Partition Partition Null
driwr drverpartition driverr

Block drivers




Figure 9. Windows CE .NET 4.0 Storage architecture [26].









As can be seen from figure 9, a file system filter is a dynamic-link library (DLL)

that sits on top of the file system and intercepts file system calls. Multiple filters can exist on any loaded file system performing any combination of file manipulations before the file system sees the call. For the purposes of UbiData, we need to trap the file system functions associated with create, open and write. For every open/create call on a file, the file name is stored in a hash table with the key being the file handle value. On a write, the hash table is looked up to get the corresponding file name. These file operations are encoded (using an application specific lookup table) and communicated to the M-MEM

application.

An issue with the communication is that the driver and client are separate processes

where the client runs with user level privileges. An interprocess communication technique is required that is lightweight and fast. This method should not use an inordinate amount of memory since the mobile device will have limited memory. The solution to this problem was to use memory mapped files in a logical queue (see figure 10) ensuring an upper bound on memory utilization and yet being fast and lightweight

compared to pipes and sockets.



Filter Memory Memory --N Memory M-MEM
file I file 2 file 3
Driver H



Event Objects for synchronization


Figure 10. Logical queue of memory mapped files









To complete the port, the UbiData client code and libraries need to be ported to Windows CE NET 4.0.

Active Event Capture as a Means of UbiData Client-Server Transfer Reduction

The idea behind active event capture is to reduce data transfer between a UbiData client (M-IVIEM) and the UbiData server (F-MEM) by capturing and logging events generated by an application as it is being used to edit a file. These events are then converted to higher-level descriptions and optimized. The resulting event log would in some cases be smaller than the delta computed by a difference algorithm. The delta can also be computed and compared to the event log to decide which one to ship based on size. Although a promising idea (and one that has been implemented in a desktop environment at Texas A&M University [21]), it was found that the burden of event capture and the resulting context switches (since the application logging events was different from the editor, each time a message was posted to the editor's queue, a context switch to the logging application would be made) was too heavy for a mobile device. However, as mobile devices get progressively more powerful, this may be a promising avenue of research for supporting more resource rich (and perhaps less mobile) devices such as laptops.

Development of an Xmerge Plug-in to Support OpenOffice Writer

As explained in the introduction, a plugin architecture is the cleanest method for handling application/format specific code. Since this is the philosophy guiding the design of Xmerge [22], the Xmerge plugin management code can be used to implement the UbiData integration architecture proposed earlier. As a result, the OpenOffice [27] to text converter and the text to a basic OpenOffice document format converter have been implemented as an Xmerge plugin.














CHAPTER 6
PERFORMANCE EVALUATION APPROACH Test Platform

Testing was performed on a machine with an AMID Duron 650Mhz processor, 128 Mb RAMN/ and an IIBM-DTLA-307030 hard drive running Mandrake Linux 8.2 (kernel 2.960.76mdk) with gcc version 2.96.

Test Methodology

The architecture of the test engine is as shown in figure 11. The figure also shows the sequence of steps involved in the testing process.


Figure 11. Test tools and procedures









The steps involved in the testing process are as follows:

10 documents were selected as initial test cases. These documents included letters,
application forms, term papers, pages from an E-book and a resume.

2. A test case generator module was used to generate 960 modified documents from
these source documents by traversing the DOM trees and modifying the content nodes. Each test case had 0-50 percent file level changes and 10-40 percent paragraph level changes for four different edit operations, viz. insert, delete, update and move, applied to it to generate its share of 96 modified documents. Paragraph level operations refer to operations performed within a content node since these nodes represent paragraphs or their components in Abiword. The number of content nodes they are applied to depends on the file change percent. The file level change percent also determines the number of content nodes added, deleted and moved for inserts, deletes and moves respectively. The test case generator also records information on source file statistics, viz. the source document size (in bytes), node count, text node count, number of document subtrees, number of content nodes in all subtrees, average number of children, average height, average content size (in characters) and the average number of attributes which are recorded
(shown as a dotted line).

3. The 960 Modified documents are converted to text using a text conversion module
which traverses the DOM trees and extracts the content in the TEXT nodes (shown
by the block arrows).

4. The modified text documents then have a basic XNIIL structure imposed upon them
before being used as inputs to Vdiff, the change detection module described in this thesis, along with the original documents and the configuration files. During execution, Vdiff generates statistical data such as the time taken by each phase of
the algorithm, which is recorded (shown by the dotted arrow).

5. The resulting deltas generated by Vdiff are used to patch the original files to
generate the modified documents (shown by the black block arrows).

6. These modified documents are compared to the reference modified documents to
quantify reconstruction accuracy and the results are recorded (shown by the gray arrows). The comparator generates three data points for each comparison, viz. the number of missing leaves, missing non-leaves and the number of missing attributes.
The comparator determines these values by walking the input trees and noting each
discrepancy.

7. All data was generated in comma separated variable format, which can be directly
read by spreadsheet software and some databases. The software packages used to analyze the data were Microsoft Access and Microsoft Excel. Access was used to
run queries to produce data for specific graphs that were then generated in Excel.

































































AL- Iktttt


CHAPTER 7
TEST RESULTS

Correctness of the Change Detection System The change detection system, as shown in figures 12 and 13, has good accuracy with a maximum of 14.8 percent total absolute missing nodes or 12 percent missing nodes with a maximum of 6.6 percent missing leaves and 9.2 percent missing non leaves.


The absolute missing leaves parameter refers to the difference between the number of leaves in the expected reference document and the number of leaves in the result document. The other missing node counts (termed "computed" in figures 12 and 13) are determined by the test engine comparator (refer to chapter 6). The computed measures are needed to determine whether the nodes are in their expected locations, which cannot be determined from the absolute measure.


M is sing nodes by test cases in LES Complexity order


80 70 60 50
40
0 L) 30
20
0 Z 10
0
-10
-20


Testcases

0 Source Text node count E Average of absolute missing nodes [] Average of computed missing nodes Figure 12. Missing Nodes by test cases in LES complexity order










The error rate seems to depend more on the LES complexity of the document than the substring complexity (refer to the sections on the run time behavior of the phases of the algorithm in the next section on the performance of the algorithm).

The error in the last document in figure 13 may be explained by the fact that its subtree sizes exceed the bound C on subtree re-ordering (refer to chapter 4).


Missing nodes by average sub-tree size 80 70 60 50
S40
0
Ua 30
Wi
-5 20
0
z 10
0 r-1 _-10
-20
Test cases

* Source Text node count u Average of absolute missing nodes []Average of computed missing nodes



Figure 13. Missing Nodes by test cases in substring complexity order

Performance of the Difference Detection Algorithm

To determine the performance of the difference algorithm and the primary phases on which it depends, we plot the time taken by the various phases of the algorithm for all the test cases in figure 14.

As can be seen from figure 14, the overall time taken (Vdiff2 Compute Delta time in the legend) is dependent almost entirely on only two phases, viz, least cost edit script matching of nodes (Vdiff2 LES Match Nodes) and matching of substrings of the source tree to the modified tree (Vdiff2 Subsequence Match). The substring matching process is clearly the dominant factor in the overall time taken.
















Time taken (ms) by each phase for sources in average source text nodes per subtree order


--Vdiff2 Compute Delta time
---Vdiff2 Read Documents
Vdiff2 Build Tree time
VdifF2 LES Match Nodes
---Vdiff2 Subsequence Match
---Vdiff2 Subsequence Apply
-- Vdiff2 Adjust From V1
- Vdiff2 Adjust from Vl
VdifF2 Apply Default Formatting
Vdiff2 Register Subtree
Vdiff2 Optimise
Vdiff2 Construct Delta


Sources in avg. text nodes per subtree order


Figure 14. Time taken for the various phases by test case


700000



600000 500000



400000



300000



200000



100000



0









The test cases have been ordered according to the average number of TEXT nodes per subtree since this is the primary factor determining the sub string matching time. The drop in execution time seen for the cases with large subtrees can be attributed to their exceeding the bound C, which determines the limit beyond which no subtree reordering is done (refer to the subtree matching algorithm in chapter 4).

We now analyze the run time for each operation class, viz insert, delete, update and move. We shall consider the node matching and substring matching phases from now on since they are the primary factors determining run time. Insert Operations

From figures 15 and 16, it is clear that the operations done at file level affect the run time more than the operations related to individual nodes.


Time taken (mns) by test case for Insert with no file level changes


200000 180000 160000
140000 '~120000
E
IV 100000 E 80000
60000


20000


Test cases

--Vdiff2 Compute Delta time ---Vdiff2 LES Match Nodes Vdiff2 sub tree match

Figure 15. Time taken by test cases for Inserts with no file level changes

This can be attributed to the fact that for each inserted node, it is compared with every node in the source subtree whereas in the case of paragraph level inserts, the number of comparisons remains roughly the same (The only difference being that some comparisons that may have been terminated early due to finding a perfect match now


A 66







48


need to be worked through. However, on average the match would be found halfway through a set of nodes so insertion of a new node would affect run time more).


Time taken (ms) for Insert with 30 percent file level operations


600000 500000 400000
E
S300000


/


200000

100000


Test cases

- Vdiff2 Compute Delta time -E--Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 16. Time taken by test cases for Inserts with 30 percent file level changes


Time taken (ms) for Insert with 50 percent file level operations


700000 600000
500000
E 400000 E 300000
200000 100000

0 r ir ir ir i'i i '
Test cases

---Vdiff2 Compute Delta time ---Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 17. Time taken by test cases for Inserts with 50 percent file level changes












Time taken (ms) by test case for Insert with 10 percent paragraph level modifications




W*


250000

200000


' 150000 E 100000


50000


Test cases

-*-Vdiff2 Compute Delta time -E--Vdiff2 LES Match Nodes Vdiff2 sub tree match

Figure 18. Time taken by test cases for Insert with 10 percent paragraph level changes


The same trend holds for 50 percent inserts at file level as seen in figure 17.


Since we have determined that the operations that affect run time are at file level, the graphs for the remaining operations will be shown in the format of figure 18. Delete Operations

For deletes, we see that for the highest percentage of file level changes, the delta computation time goes down as expected.


Time taken (ms) for Delete with 10 percent paragraph level operations


300000 250000

200000 150000

� 100000
50000 ,,"




Test cases

Vdiff2 Compute Delta time -U-- Vdiff2 LES Match Nodes Vdiff2 sub tree match

Figure 19. Time taken by test cases for Delete with 10 percent paragraph level changes











However, for the initial deletes, the delta computation time actually rises. This is reflected in both the Node matching phases as well as the substring matching phases. Update Operations

In the case of updates, the percentage of file operations does not affect the run time since the number of nodes in the files does not change (see figure 20).



Time taken (ins) for Update with 10 percent paragraph level operations


200000 180000 160000

140000

-~120000
E
'0 100000u S80000 60000

ArArAAr


pp.---.


00

Test cases

-4-Vdiff2 Compute Delta time -1--Vdiff2 LES Match Nodes Vdiff2 sub tree match

Figure 20. Time taken by test cases for Updates with 10 percent paragraph level changes

However, the same may not hold for paragraph level operations and hence we need to plot paragraph level changes for a fixed number of file level changes as shown in figure 21. As can be seen from figure 21, with paragraph level operations up to 30 percent, there is no significant change in run time.








51




Time taken (ms) for Updates with no file level changes


200000
180000 160000 140000 120000 100000 80000 60000 4nnnn


40000
20000 "or . *


Test cases Vdiff2 Compute Delta time - Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 21. Time taken by test cases for Updates with no file level changes Move Operations


As can be seen from figure 22, the time taken for the node matching phase remains relatively constant or increases slightly with increasing percentages of moves.



Time taken (ms) for Move with 10 percent paragraph level operations


400000 350000 300000

250000 200000

i=150000

100000

50000


Test cases


Vdiff2 Compute Delta time -u--Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 22. Time taken by test cases for Moves with 10 percent paragraph level changes


4










This would be expected since the only effect that file level moves have is to affect when a perfect match is found. With increasing move percentages, the nodes are displaced further relative to their previous positions and hence a larger number of nodes need to be compared to find matches. The behavior of substring matching should be independent of the percentage of file level moves and hence no correlation should exist which is borne out by the figure. Hence, we need to investigate the effect of paragraph level changes on run time.


Time taken (ins) for Moves with no file level changes


200000 180000 160000

140000

-~120000
E

S80000 60000


20000

0 -r~jFr jFr jFr jFr jFr jFr j
Test cases

--Vdiff2 Compute Delta time ---Vdiff2 LES Match Nodes Vdiff2 sub tree match

Figure 23. Time taken by test cases for Moves with no file level changes

For increasing paragraph level moves, the run time is not affected. We analyze the run time behavior of the algorithm's phases as a whole in subsequent sections.










Conclusions on Per-Operation Run Time

From the analysis above, we can conclude that the primary factor determining the run time performance of the system is the number of operations at file level. Also, as shown in figure 24, the time complexity is in the order delete, move, update and insert.



Average time taken (is) for each phase by operation


90000 80000 70000 60000

0 50000 E 40000

30000

20000 10000

0
AvgOfdif2 Compute Delta Avg~f\/diff2 LES Match AvgOfVdiff2 Subsequence time Nodes Match
Algorithm phases

0 Delete E Insert E] Move E]lUpdate Figure 24. Per-operation relative time complexity Run Time Behavior of Node Matches

To determine whether the run time behavior of the node matching phase is according to expectations, we need to define a complexity measure to reorder the source documents by and check if the average values for the time taken are monotonically increasing/decreasing. Since we know that the Node matching phase depends on the number of non-subtree TEXT nodes in the source document (since subtree TEXT nodes









are processed seperately in the subtree match phase), and the number of nodes in the result document, we can define our initial complexity measure as follows:

LES Complexity = (non-subtree TEXT nodes)*(non-subtree TEXT nodes + number of subtrees) since the number of nodes in the result document is equal to the number of non-subtree TEXT nodes plus the number of subtrees each of which contribute one node to the result (since the result is a text document, all siblings are clobbered together for lack of a formatting structure). However, we have not considered the time taken by the least cost edit script matching process itself. Since it is O(ND) where N is the total size of the nodes being compared and D is the average edit distance for each pair, we can approximate it to 2*Average size of a source text node*D. Since we apply the same percentages of edits to all source document sets, a rough approximation would be to represent D as Average size of a source text node* (Some constant). Hence, an approximate complexity measure works out to the following:

LES Complexity=((Non subtree nodes)2 + (Non subtree nodes)*(Number of subtrees))* (Average size of a source text node)2

Ordering the Source file according to this complexity measure as in figure 25, we note that we obtain an increasing trend in terms of average time taken albeit with fluctuations. However, this should be considered in the light of there being relatively few data points, the documents not being large enough to render constant factors small enough to be negligible, and the fact that the measure is itself an approximation.













Average time taken (ms) by the node match phase vs. source documents in LES complexity order


25000



20000



15000
In)
E
a) E
10000



5000





Sources in LES Complexity order

* -AvgOfVdiff2 LES Match Nodes Figure 25. Behavior of the node match phase Run Time Behavior of Substring Matching

To analyze substring matching, graphs were plotted versus source document parameters and expected complexity measures. However, strong correlation was only shown for the ordering of the sources by the average number of children per TEXT subtree. This is due to the fact that the number of children per subtree determines the number of possible orderings for the nodes in that subtree. The fall in run time for the last source document set in figure 26, is due to the fact that most subtrees in the document have a greater number of children than the re-ordering threshold C. This causes a great drop in the time taken since re-ordering is completely bypassed.








56



Average time (ins) taken by the sub string match phase for test documents ordered by the average number of Text nodes per sub tree

350000 300000 250000
S200000 E150000 100000 50000

Test cases

--Ag~f\dff2 Subsequence Match

Figure 26. Behavior of sub string match phase


Relative Sizes of Files produced


From figure 27, we can see that delta sizes roughly increase with increasing sizes of


the source documents.



Document Sizes (bytes) in order of increasing source document size


20000 18000 16000
14000 12000
.~10000
Cl,
8000 6000
4000FiL
2000
0 U 0
3082 4701 7493 9625 13976 14217 14617 16413 17838 19391 Source Document size

*AvgOfVdiff2 Expected Mobd Document Size NAvgOf Modified text document Size E]AvgOfVdiff2 Delta Size

Figure 27. Sizes of the files produced by the algorithm


The deltas are relatively large (still smaller than the sources however) due to their


completeness requirement. If version control is not required, these sizes can be reduced


further, yielding significant bandwidth savings.














CHAPTER 8
CONCLUSIONS AND POTENTIAL FOR FUTURE WORK Conclusions

The results obtained in this exploration of a generic algorithm for synchronizing

reduced content documents shows that this work can improve upon the state of the art for data access on mobile devices. However, to achieve the goal of ubiquitous data access, the algorithm needs to be tested with other document editing software and extended to handle other classes of applications. This would have to be done in conjunction with the integration of the system into a framework for disconnected operation support e.g. the Coda File System or UbiData.

Potential for Future Work

Determination of Best Performing Values for Algorithm Parameters

The change detection algorithm explored in this thesis depends on the two primary parameters: the threshold value for the least cost edit script matching and the threshold for substring re-ordering before substring matching. Extensive testing is needed to determine the best performing values for the algorithm since a fine balance needs to be struck. For the former, if the match threshold is set too low, heavier modifications will be missed. However if set too high, the possibility of false matches will increase. In the case of the re-ordering threshold, a higher value will result in a significant performance impact but a value that is too low may be completely ineffective.









Exploration of Techniques to Prune the Search for the Best Subtree Permutation

As can be seen from the test results, searching all permutations of subtrees is computationally expensive. Currently a naive heuristic has been used to prune the search. A better technique may deliver the same level of correctness with lower performance overhead.

Post Processing to Ensure Validity of Reintegrated Document

As stated in chapter 3, post processing is necessary to ensure application consistency requirements are met. This is primarily an issue only for very extensive deletes and rarely occurs in practice. However, it should be handled for a better user experience.

Further Testing and Generalization of the System

The algorithm needs to be tested with a wider range of editors. So far, all tests have been done with Abiword. To test with other editors would require the appropriate converters to be written, appropriate configuration files generated and integration of Xmerge with the test setup (to allow converters to be written as loadable plugins). As a result of such testing, it may be determined that the algorithm does not handle certain conditions arising due to differing document structure. In this case, the code would need to be extended to handle all such conditions. Implementation of the UbiData Integration Architecture

To validate the usefulness of this work, it is necessary to integrate it with the UbiData system. An overview of this has been described in chapter 2. Extension to Other Application Domains

The current design and implementation has focused on the document-editing scenario. This caters to the primary usage of most mobile devices (editing documents and






59


e-mail). However, validating the ideas on a generalized content reduction and reintegration system requires extension of this approach to handle other application classes such as spreadsheets, calendars etc.















LIST OF REFERENCES


[1] Michael J. Franklin, Challenges in Ubiquitous Data Management. Informatics
10 Years Back. 10 Years Ahead. Lecture Notes in Computer Science, Vol. 2000,
Springer-Verlag, Berlin 2001: 24-33.

[2] Richard Han, Pravin Bhagwat, Richard LaMaire, Todd Mummert, Veronique
Perret, and Jim Rubas, Dynamic Adaptation in an Image Transcoding Proxy for
Mobile Web Browsing. IEEE Personal Communications, Vol. 5, Number 6,
December 1998: 8-17.

[3] Harini Bharadvaj, Anupam Joshi and Sansanee Auephanwiriyakul, An Active
Transcoding Proxy to Support Mobile Web Access. Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems, West Lafayette, IN, IEEE Computer
Society Press, October 1998: 118-126.

[4] Mika Liljeberg, Heikki Helin, Markku Kojo, and Kimmo Raatikainen, Mowgli
WWW Software: Improved Usability of WWW in Mobile WAN Environments.
Proceedings of the IEEE Global Internet 1996 Conference, London, England,
IEEE Communications Society Press, November 1996: 33-37.

[5] Armando Fox and Eric A. Brewer, Reducing WWW Latency and Bandwidth
Requirements by Real-Time Distillation. Computer Networks and ISDN Systems,
Vol. 28, Issues 7-11, Elsevier Science, May 1996: 1445-1456.

[6] M. Satyanarayanan, Coda: A Highly Available File System for a Distributed
Workstation Environment. IEEE Transactions on Computers, Vol. 39, Number 4,
1990: 447-459.

[7] M. Satyanarayanan, Mobile Information Access. IEEE Personal Communications,
Vol. 3, Number 1, February 1996: 26-33.

[8] World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Second
Edition), W3C Recommendation, 6 October 2000, available at
http://www.w3.org/TR/2000/REC-xml-20001006. August 2002.

[9] J. Zhang, Mobile Data Service: Architecture, Design, and Implementation.
Doctoral dissertation, University of Florida, 2002.









[10] Eyal de Lara, Dan S. Wallach and Willy Zwaenepoel, Position Summary:
Architectures for Adaptations Systems. Eighth IEEE Workshop on Hot Topics in Operating Systems (HotOS-VIII). Schloss Elmau, Germany, May 2001, available at http://www.cs.rice.edu/-delara/papers/hotos2001/hotos2001 .pdf. August 2002.

[11] Eyal de Lara, Rajnish Kumar, Dan S. Wallach and Willy Zwaenepoel,
Collaboration and Document Editing on Bandwidth-Limited Devices. Presented at the Workshop on Application Models and Programming Tools for Ubiquitous
Computing (UbiTools'01). Atlanta, Georgia. September, 2001, available at http://www.cs.rice.edu/-delara/papers/ubitools/ubitools.pdf August 2002.

[12] Eyal de Lara, Dan S. Wallach and Willy Zwaenepoel, Puppeteer: Componentbased Adaptation for Mobile Computing. Presented at the Third USENIX
Symposium on Internet Technologies and Systems, San Francisco, California,
March 2001, available at
http://www.cs.rice.edu/-delara/papers/usits2001/usits2001. pdf August 2002.

[13] G. Cobena, S. Abitebould, A Marian, Detecting Changes in XML Documents.
Presented at the International Conference on Data Engineering, San Jose,
California. 26 February-i March 2002. May 2002, available at http://wwwrocq.inria.fr/-cobena/cdrom/www/xydiff/eng.htm. August 2002.

[14] Amelie Marian, Serge Abiteboul, Lurent Mignet, Change-centric Management of
Versions in an XML Warehouse. The {VLDB} Journal, 2001: 581-590.

[15] A. Helal, J. Hammer, A. Khushraj, and J. Zhang, A Three-tier Architecture for
Ubiquitous Data Access. Presented at the First ACS/IEEE International
Conference on Computer Systems and Applications. Beirut, Lebanon. 2001,
available at http://www.harris.cise.ufl.edu/publications/3tier.pdf August 2002.

[16] Michael J. Lanham, Change Detection in XML Documents of Differing Levels of
Structural Verbosity in Support of Ubiquitous Data Access. Master's Thesis,
University of Florida, 2002.

[17] M. Satyanarayanan, Brian Noble, Puneet Kumar, Morgan Price, ApplicationAware Adaptation for Mobile Computing. Proceedings of the 6th ACM SIGOPS
European Workshop, ACM Press, New York, NY, September 1994: 1-4.

[18] B. Noble, M Satyanarayanan, D Narayanan, J. Tilton, J. Flinn and K. Walker,
Agile Application-Aware Adaptation for Mobility. Operating Systems Review
(ACM) Vol. 51, Number 5, December 1997: 276-287.

[19] Brian Noble, System Support for Mobile, Adaptive Applications. IEEE Personal
Communications Magazine, Vol. 7, Number 1, February 2000: 44-49.









[20] David Andersen, Deepak Bansal, Dorothy Curtis, Srinivasan Seshan, Hari
Balakrishnan, System Support for Bandwidth Management and Content Adaptation in Internet Applications. Proceedings of the Fourth USENIX
Symposium on Operating Systems Design and Implementation, USENIX Assoc.,
Berkeley, CA, October 2000: 213-226.

[21] D. Li, Sharing Single User Editors by Intelligent Collaboration Transparency.
Proceedings of the Third Annual Collaborative Editing Workshop, ACM Group.
Boulder, Colorado. 2002, available at
http://www.research.umbc.edu/-j campbel/Group01/Li iwces3.pdf August 2002.

[22] S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom, Change
Detection in Hierarchically Structured Information. Proceedings of the ACM
SIGMOD International Conference on Management of Data, ACM Press, New
York, NY, 1996: 493-504.

[23] OpenOffice.org. Xmerge Project. http://xml.openoffice.org/xmerge/. August
2002.

[24] E. Myers, An O(ND) Difference Algorithm and its Variations. Algorithmica, Vol.
1, 1986: 251-266.

[25] SourceGear Corporation. Abiword: Word Processing for Everyone. Software
available at http://www.abisource.com/. August 2002.

[26] Microsoft Corporation. Microsoft Windows CE .NET Storage Manager
Architecture. http://msdn.microsoft.com/library/default.asp?url=/library/enus/wcemain4/htm/cmconstoragemanagerarchitecture. asp. August 2002.

[27] OpenOffice.org. OpenOffice.org Source Project. http://www.openoffice.org/.
August 2002.
















BIOGRAPHICAL SKETCH

Ajay Kang was born in New Delhi, India. He received his bachelor's degree in computer engineering from Mumbai University, India. He subsequently worked for a year at Tata Infotech, a software and systems integration company based in India, as an associate systems engineer. During this period he worked on a product for financial institutions and programmed in C++ using the Microsoft Foundation Class (IVIFC) libraries and the Win32 SDK.

During his master's study he was a TA for CIS3020: Introduction to Computer Science and had his first teaching experience in the process.

His interests include mobile computing and algorithms as a result of two excellent classes he took on these subjects at the University of Florida.

On graduation he will join Microsoft Corp. as a Software Design Engineer in Test (SDET).




Full Text
xml version 1.0 encoding UTF-8 standalone no
fcla fda yes
!-- Algorithms for reduced content document synchronization ( Book ) --
METS:mets OBJID UFE0000549_00001
xmlns:METS http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink
xmlns:xsi http:www.w3.org2001XMLSchema-instance
xmlns:daitss http:www.fcla.edudlsmddaitss
xmlns:rightsmd http:www.fcla.edudlsmdrightsmd
xmlns:mods http:www.loc.govmodsv3
xmlns:sobekcm http:digital.uflib.ufl.edumetadatasobekcm
xmlns:lom http:digital.uflib.ufl.edumetadatasobekcm_lom
xsi:schemaLocation
http:www.loc.govstandardsmetsmets.xsd
http:www.fcla.edudlsmddaitssdaitss.xsd
http:www.fcla.edudlsmdrightsmd.xsd
http:www.loc.govmodsv3mods-3-4.xsd
http:digital.uflib.ufl.edumetadatasobekcmsobekcm.xsd
METS:metsHdr CREATEDATE 2020-04-28T10:06:09Z ID LASTMODDATE 2020-04-27T14:38:52Z RECORDSTATUS COMPLETE
METS:agent ROLE CREATOR TYPE ORGANIZATION
METS:name UF,University of Florida
OTHERTYPE SOFTWARE OTHER
Go UFDC - FDA Preparation Tool
INDIVIDUAL
UFAD\renner
METS:note Online edit by Nicola Hill ( 8/16/2010 )
Online edit by Nicola Hill ( 10/22/2010 )
METS:dmdSec DMD1
METS:mdWrap MDTYPE MODS MIMETYPE textxml LABEL Metadata
METS:xmlData
mods:mods
mods:abstract displayLabel Abstract Existing research has focused on alleviating the problems associated with wireless connectivity, viz. lower bandwidth and higher error rates, by the use of proxies which transcode data to be sent to the mobile device depending on the state of the connection between the proxy and the mobile device. However, there is a need for a system to allow automatic reintegration of changes made to such reduced content information into the original content rich information. This also allows users the flexibility of using virtually any device to edit information by converting the information to a lesser content rich format suitable for editing on that device and then automatically reintegrating changes into the original. Hence, these algorithms allow the development of an application transparent solution to provide ubiquitous data access and enable reducing download time and cache usage on the mobile device for editable documents. They can also be applied to progressive synchronization of a full fidelity document with the master document on the server. To develop flexible, generic, synchronization algorithms, we need an application-independent specification of the hierarchical content of data, a generic method of specifying what information is shared between reduced content documents and the originals and a change detection algorithm. This algorithm should produce a script that can be applied to the original to bring it up to date and also allow for version control. Existing difference detection algorithms do not operate on reduced content documents and the approaches used need to be adapted to this domain. XML has been chosen to represent document content since it allows encoding of any hierarchical information and most applications are now moving towards native XML file formats. The change detection algorithm makes extensive use of the O(ND) difference algorithm developed by Eugene W. Myers for matching shared content and approximate sub string matching. Extensive performance analysis has been performed on the system by constructing a test engine to automatically generate modifications of varying degree on source documents and thus produce a set of modified reference documents. These reference documents are converted to text (hence stripping out all formatting content), and the algorithm is run to generate delta scripts that are applied to the original documents to generate the new versions. Comparison of the new versions to the references allows us to quantify error.
mods:accessCondition Copyright Kang, Ajay I. S.. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
mods:language
mods:languageTerm type text English
code authority iso639-2b eng
mods:location
mods:physicalLocation University of Florida
UF
mods:url access object in context https://ufdc.ufl.edu/UFE0000549/00001
mods:name personal
mods:namePart Kang, Ajay I. S.
mods:role
mods:roleTerm marcrelator dis
Dissertant
Helal, Abdelsalam A.
ths
Thesis advisor
Hammer, Joachim
ths
Thesis advisor
mods:note Title from title page of source document.
Includes vita.
thesis Thesis (M.S.)--University of Florida, 2002.
bibliography Includes bibliographical references.
Text (Electronic thesis) in PDF format.
mods:originInfo
mods:publisher University of Florida
mods:dateIssued 2002
mods:copyrightDate 2002
mods:recordInfo
mods:recordIdentifier source sobekcm UFE0000549_00001
mods:recordContentSource University of Florida
mods:subject SUBJ650_-0_1 jstor
mods:topic Algorithms
SUBJ650_-0_2
Bandwidth
SUBJ650_-0_3
Change detection
SUBJ650_-0_4
Mobile devices
SUBJ650_-0_5
Munchausen syndrome by proxy
SUBJ650_-0_6
Preliminary proxy material
SUBJ650_-0_7
Proxy reporting
SUBJ650_-0_8
Proxy statements
SUBJ650_-0_9
Run time
SUBJ650_-0_10
XML
Data transmission systems
Wireless communication systems
Computer and Information Science and Engineering thesis, M.S
Dissertations, Academic
UF
Computer and Information Science and Engineering
mods:titleInfo
mods:title Algorithms for reduced content document synchronization
mods:typeOfResource text
DMD2
OTHERMDTYPE SOBEKCM SobekCM Custom
sobekcm:procParam
sobekcm:Aggregation ALL
UFIR
UFETD
IUF
GRADWORKS
sobekcm:MainThumbnail kang_a_Page_01thm.jpg
sobekcm:Wordmark UFIR
sobekcm:bibDesc
sobekcm:BibID UFE0000549
sobekcm:VID 00001
sobekcm:Publisher
sobekcm:Name University of Florida
sobekcm:PlaceTerm Gainesville, Fla.
sobekcm:Source
sobekcm:statement UF University of Florida
sobekcm:SortDate 730850
METS:amdSec
METS:digiprovMD DIGIPROV1
DAITSS Archiving Information
daitss:daitss
daitss:AGREEMENT_INFO ACCOUNT PROJECT UFDC
METS:rightsMD RIGHTS1
RIGHTSMD Rights
rightsmd:accessCode public
rightsmd:embargoEnd 2005-12-27
METS:techMD TECH1
File Technical Details
sobekcm:FileInfo
sobekcm:File fileid JP21 width 2550 height 3300
JPEG1 630 815
JPEG2
JP22
JPEG3
JP23
JPEG4
JP24
JPEG5
JP25
JPEG6
JP26
JPEG7
JP27
JPEG8
JP28
JPEG9
JP29
JPEG10
JP210
JPEG11
JP211
JPEG12
JP212
JPEG13
JP213
JPEG14
JP214
JPEG15
JP215
JPEG16
JP216
JPEG17
JP217
JPEG18
JP218
JPEG19
JP219
JPEG20
JP220
JPEG21 487
JP221
JPEG22
JP222
JPEG23
JP223
JPEG24
JP224
JPEG25
JP225
JPEG26
JP226
JPEG27
JP227
JPEG28
JP228
JPEG29
JP229
JPEG30
JP230
JPEG31
JP231
JPEG32
JP232
JPEG33
JP233
JPEG34
JP234
JPEG35
JP235
JPEG36
JP236
JPEG37
JP237
JPEG38
JP238
JPEG39
JP239
JPEG40
JP240
JPEG41
JP241
JPEG42
JP242
JPEG43
JP243
JPEG44
JP244
JPEG45
JP245
JPEG46
JP246
JPEG47
JP247
JPEG48
JP248
JPEG49
JP249
JPEG50
JP250
JPEG51
JP251
JPEG52
JP252
JPEG53
JP253
JPEG54
JP254
JPEG55
JP255
JPEG56
JP256
JPEG57
JP257
JPEG58
JP258
JPEG59
JP259
JPEG60
JP260
JPEG61
JP261
JPEG62
JP262
JPEG63
JP263
JPEG64
JP264
JPEG65
JP265
JPEG66
JP266
JPEG67
JP267
JPEG68
JP268
JPEG69
JP269
JPEG70
JP270
JPEG71
JP271
JPEG72
JP272
JPEG73
JP273
JP274
JPEG74
METS:fileSec
METS:fileGrp USE archive
METS:file GROUPID G1 TIF1 imagetiff CHECKSUM e1456a35d560fb8347a731afad5bc7da CHECKSUMTYPE MD5 SIZE 8433380
METS:FLocat LOCTYPE OTHERLOCTYPE SYSTEM xlink:href kang_a_Page_01.tif
G2 TIF2 0df5f4ec07d0747a36ce8dab34ea8a38 8432924
kang_a_Page_02.tif
G3 TIF3 8ccee401bec804badca328a148400b83 8433056
kang_a_Page_03.tif
G4 TIF4 f268bbfc6ae930aad18322c400642ab0 8435752
kang_a_Page_04.tif
G5 TIF5 99c52d74a396cf2315ab7a6c19035d56 8434516
kang_a_Page_05.tif
G6 TIF6 84d841f3e02ea0813fa21e976615ebaf 8435264
kang_a_Page_06.tif
G7 TIF7 b2644b8dd5c5287938a4181d1bca0b8c 8434392
kang_a_Page_07.tif
G8 TIF8 8f052bd97712c488c3ff4bcd2e1bfe4e 8434848
kang_a_Page_08.tif
G9 TIF9 e6b3b2b8245dc9871d817d21edc41bb6 8433620
kang_a_Page_09.tif
G10 TIF10 386bc9899617a14fdf6736c7702efc61 8435436
kang_a_Page_10.tif
G11 TIF11 cec2239340613344f7f2a8b91a8db2d4 8435904
kang_a_Page_11.tif
G12 TIF12 d49ac0d7466cd2a4be71e09ad7d5b86d 8435900
kang_a_Page_12.tif
G13 TIF13 841d0d51e4eb080b63bce280e4eb91d7 8436448
kang_a_Page_13.tif
G14 TIF14 92cb12965acd31c759e2ebcc056ba998 8436220
kang_a_Page_14.tif
G15 TIF15 bbce6a36778ebd6d68c8aad5550a6a02 8436860
kang_a_Page_15.tif
G16 TIF16 a4b74a355be199f8d271a2883640e4e7 8436008
kang_a_Page_16.tif
G17 TIF17 b435a0f9af4dedfa6aa6fe5ff89a81ff
kang_a_Page_17.tif
G18 TIF18 c7b12d83b72ca1a8e57e6466afa7a189
kang_a_Page_18.tif
G19 TIF19 282f81fb8675c6129c40520ccde4288f 8435636
kang_a_Page_19.tif
G20 TIF20 8aba22f1dbef1a5b1da455c81537a661 8436748
kang_a_Page_20.tif
G21 TIF21 33a7aa8d9e5ce8d9aa342b8e026c71e8 8435720
kang_a_Page_21.tif
G22 TIF22 646360cca280c6a99ffc974e0d690909 8436640
kang_a_Page_22.tif
G23 TIF23 12586cf5db0c7437fd3b0d0cb5bd0c0c 8436048
kang_a_Page_23.tif
G24 TIF24 bf4615cec1c4f0c21728563ad830e20d 8436336
kang_a_Page_24.tif
G25 TIF25 95dd28a02b881009ca63edfcf9dcc151 8436184
kang_a_Page_25.tif
G26 TIF26 7e093767f7823c4999d09f6b263c7ca6 8435956
kang_a_Page_26.tif
G27 TIF27 c3c3795b7c0a77e82ceb9193c0028672 8436268
kang_a_Page_27.tif
G28 TIF28 c51cc8e1a0587e7e0c358f67074b949d 8435616
kang_a_Page_28.tif
G29 TIF29 49c924b6cf7b9974d0a64f5b55a78c3b 8436376
kang_a_Page_29.tif
G30 TIF30 b833dd18c5569279d1d5cf6ca3570f84 8436424
kang_a_Page_30.tif
G31 TIF31 eeb876d9d300bbffd9141ceadc32aa05 8434688
kang_a_Page_31.tif
G32 TIF32 7d50e31d1879a609d143511fd22d0a15 8436080
kang_a_Page_32.tif
G33 TIF33 512ab697ef465a6ebfb30ac94a116a3d 8435724
kang_a_Page_33.tif
G34 TIF34 be69349ee429df879b9dd7251ae746a1 8435508
kang_a_Page_34.tif
G35 TIF35 daaff49a4be65178709f1d801131b58a 8434976
kang_a_Page_35.tif
G36 TIF36 aea3b0916081262783b796b5dd55c6e7 8436372
kang_a_Page_36.tif
G37 TIF37 ae36f887610df56707508f8ccaeaa10a 8434128
kang_a_Page_37.tif
G38 TIF38 5a18e3119ad80e86dab874cffd2d0fd2
kang_a_Page_38.tif
G39 TIF39 2e6fc5824721e57cd7e02110645c527a 8436324
kang_a_Page_39.tif
G40 TIF40 7f4ffb91ee048a2ebb752fba6bb9ce76 8436096
kang_a_Page_40.tif
G41 TIF41 f4884b091bd28a70574093fdd3ddc221 8436956
kang_a_Page_41.tif
G42 TIF42 d039f0d5179d0e194722600d5845b3bc
kang_a_Page_42.tif
G43 TIF43 f24175a3edbf9336af300a31252495ac 8435392
kang_a_Page_43.tif
G44 TIF44 b3626fce0f8f60c3359893765f044e92 8435560
kang_a_Page_44.tif
G45 TIF45 9102893922f988a821b754ba2f093646 8435980
kang_a_Page_45.tif
G46 TIF46 aee46b4cda3f36b1f8a1bf730a4b479f 8436232
kang_a_Page_46.tif
G47 TIF47 050c7453bc031e2e130d75331f2737c1 8435568
kang_a_Page_47.tif
G48 TIF48 ee779671ffb412ff30ccee0eb8b7ba0b 8435932
kang_a_Page_48.tif
G49 TIF49 28e1eeab8abaae73afcb91e939ea5b38 8434120
kang_a_Page_49.tif
G50 TIF50 1e75a9c1b66b6b1b9d392fafd87aa076 8435428
kang_a_Page_50.tif
G51 TIF51 40e4ae0268496c6b43f57ce0f3dae4fe 8436332
kang_a_Page_51.tif
G52 TIF52 7c82395ac374bf2e3e064ac39af793ba 8436548
kang_a_Page_52.tif
G53 TIF53 7f1575719d22b48d5c6dabd45bdb879c
kang_a_Page_53.tif
G54 TIF54 5fbbaed0a963e35931f581e0e45c6dbf 8436668
kang_a_Page_54.tif
G55 TIF55 053708f08725d5dfaab1ad5dd3780180 8435400
kang_a_Page_55.tif
G56 TIF56 5c687f5d748c6e4656bb3257ee2f3130 8436052
kang_a_Page_56.tif
G57 TIF57 7afbf95a7f63b557daf29270d8d3b4fc 8434956
kang_a_Page_57.tif
G58 TIF58 d90d98b8b693d7fafb951c4be127614c 8435912
kang_a_Page_58.tif
G59 TIF59 4485bd32c9e1cae147838cbe1b4d5cf2 8435292
kang_a_Page_59.tif
G60 TIF60 d83aa93c5484a7bc62050cab44fe3b83 8435452
kang_a_Page_60.tif
G61 TIF61 ba076efcab3110d479f55e55e4d39233 8435488
kang_a_Page_61.tif
G62 TIF62 bad26c63fc1da45a55dfb7239e2f523a 8435156
kang_a_Page_62.tif
G63 TIF63 13959cb18dac767d7e703dcdfdb34d20 8435320
kang_a_Page_63.tif
G64 TIF64 6d21e569a43e1377d36299bbd48c4298 8435852
kang_a_Page_64.tif
G65 TIF65 ce9ee1c70ed5e51c9bb1d6cf68e41690 8435928
kang_a_Page_65.tif
G66 TIF66 6f6503f2a5af1c395a8afe55c399038a 8435204
kang_a_Page_66.tif
G67 TIF67 2c6a1c2ee7e0fc0f1d5b43c24786c9c9
kang_a_Page_67.tif
G68 TIF68 057b1c81918dd22b12c7087f29217320 8435432
kang_a_Page_68.tif
G69 TIF69 5b112b3a3cfdf7bb44fb68b4a50019d5 8436404
kang_a_Page_69.tif
G70 TIF70 c9ffb6cff6c9d7e799c7a1546c9f3074 8433248
kang_a_Page_70.tif
G71 TIF71 48daef5321f422fbce53fe7c51031e9a 8435756
kang_a_Page_71.tif
G72 TIF72 ddc86e2f1763c092edf82a358722090c 8436516
kang_a_Page_72.tif
G73 TIF73 af47ecca1f87dd5e5bccc213052f9fbf 8435184
kang_a_Page_73.tif
G74 TIF74 61251851db7332965c5abf766d942f9d 8434536
kang_a_Page_74.tif
reference
imagejp2 a79302bbe44dbad67da2d2cab57870fa 214919
kang_a_Page_01.jp2
45f2bc4ffb1a1f7e46b1e64dab66f0c0 26720
kang_a_Page_02.jp2
5d69b6708c6b6a45eefafba8b8d766df 113337
kang_a_Page_03.jp2
a39cf4b59b959e3bd2b18a8a873ad4f6 874531
kang_a_Page_04.jp2
4b8ab62e8bf161182c7cb8b69380e309 861318
kang_a_Page_05.jp2
be9ba8ad9d9dd7dd52ba25a49ada50c7 1051904
kang_a_Page_06.jp2
29f0e8c761d95c81c4bfedba02bd86b7 791340
kang_a_Page_07.jp2
f8553c0e93c05bb9db53e9e48724688c 787378
kang_a_Page_08.jp2
b1246eb541545bd13fa0e088c9f961f2 324568
kang_a_Page_09.jp2
70c92564bf9e11ffea366320182dcd06 805938
kang_a_Page_10.jp2
77a63f94ba5f260e2b09f23e20d93e3e 933307
kang_a_Page_11.jp2
a2ba4e95aeea30f40d59b385abecf361 1000717
kang_a_Page_12.jp2
9ebf7a9dcb7bac21e1e00fd47b1367c6 1051905
kang_a_Page_13.jp2
fca3ebc1227b79d86e2c3b613fc8113c 1017503
kang_a_Page_14.jp2
17c6641769b27f29bd5eb805f842e3a6 1051902
kang_a_Page_15.jp2
073f6dea0c7bed10324c790e8d63f108 1051946
kang_a_Page_16.jp2
510d2f45da443e45f7eb823cf0e31bb5 891773
kang_a_Page_17.jp2
349c1262a48f6ff90ab9ec7d7699d61c 934705
kang_a_Page_18.jp2
17a9814ea9dff97de1323d1b840e9bd0 886045
kang_a_Page_19.jp2
4311e311e0e92d570fd56a6898c396b2 1051970
kang_a_Page_20.jp2
386e870f22196b18e07be0a7a0e445fc 595642
kang_a_Page_21.jp2
1965c8e9ceefa2ddc7b35c1cd1623382 1051976
kang_a_Page_22.jp2
d8d2720c71ece303214a22dd2cfff159 991770
kang_a_Page_23.jp2
8467b56eb32c69576d28f662c9824cc3 1051982
kang_a_Page_24.jp2
dd4821348db0e14b9b37677807c84089 1051977
kang_a_Page_25.jp2
aa6a7e0ac0699073188895ac8ef23bb6 971222
kang_a_Page_26.jp2
c0d4fcacf118f6cfa91468955e06163b 1051984
kang_a_Page_27.jp2
8cd5f4be30a753fc88252ab73aa253d8 865131
kang_a_Page_28.jp2
94f5d840f1fb215f7ca2c2f306ac6585 1051935
kang_a_Page_29.jp2
6a41f3e0178f7b08351902590d6848e0 822125
kang_a_Page_30.jp2
6821820a170cc29bd729e930afc72287 584197
kang_a_Page_31.jp2
5af3ffbf67ec268ce848fe5d20daf6b7 998698
kang_a_Page_32.jp2
fd1d807d26debfe50901e4f060a89cd4 1051971
kang_a_Page_33.jp2
b243a21e0301cbfda2be3c8e0a86c328 693141
kang_a_Page_34.jp2
1b7f23244bdd11bae40e8c3d34ee784c 335068
kang_a_Page_35.jp2
606e89898bdfbe3aeacd24ed6bc0446a 1051979
kang_a_Page_36.jp2
e6c7ab8322178599143e3e78e7eb7f6f 501104
kang_a_Page_37.jp2
65a4bc209e067b80766bbb6d7c84ffd4 943546
kang_a_Page_38.jp2
d386e16273cf4b29f77d3eb354bd62dc 1050991
kang_a_Page_39.jp2
b8a517af1727451ab9b287f7394f4475 1051954
kang_a_Page_40.jp2
8cdfac1453aadfe09cd075af2b830bec 846676
kang_a_Page_41.jp2
2a0ccfd8634aa5d1c3fe4fbeaf6e1eac 1051969
kang_a_Page_42.jp2
a3056b5a60db851d01b376d2f5441bb0 783468
kang_a_Page_43.jp2
855930ad356dbe88ca6f5eec7317e9f9 883257
kang_a_Page_44.jp2
ad8a5243fa90e549fbe4d663d345b8b4 987562
kang_a_Page_45.jp2
a508c69aa93e076e0d3c7ab12984fb07 1040861
kang_a_Page_46.jp2
c455fb64b678cd102ca35b955d6f45e2 887859
kang_a_Page_47.jp2
bb006fbcc55654ba9eb3e061f64f6139 876026
kang_a_Page_48.jp2
4e79b22129465c5779dfa7aea5c908e2 350057
kang_a_Page_49.jp2
f1f1506ab2dc050f2edca29593405ebe 698400
kang_a_Page_50.jp2
dc58a48bb427ce2952142bfc15597768 892850
kang_a_Page_51.jp2
0d53d44e2b87ca7290aef57d09a7e517 1051909
kang_a_Page_52.jp2
6c85e90691b1b7cb2bfeec037c28e04f 614086
kang_a_Page_53.jp2
b16a596c098698b06bc78a3bd544683a 1051940
kang_a_Page_54.jp2
79f2a87b343c66925a12256a2ee496ad 675939
kang_a_Page_55.jp2
38960dcc026941f0fa0b64490b30ceab 823009
kang_a_Page_56.jp2
54af45cd01d2b4d862c758f3aa4609b3 438383
kang_a_Page_57.jp2
fd21caa5e27dc3197a017187c73b5b10 907823
kang_a_Page_58.jp2
4bf07d22e6ad2b9dc95de91554346d52 608832
kang_a_Page_59.jp2
7f8d03d9b6d3df26dfbf67712265621e 695457
kang_a_Page_60.jp2
345260b1fdf2f15b4f69eb477defd6f9 679821
kang_a_Page_61.jp2
065a417e65bd0b3cb26b92cb10bdb47a 605388
kang_a_Page_62.jp2
cf9a1a2a497344dfe0bc5e2125b73166 683511
kang_a_Page_63.jp2
d883ca35bcfa58ca96028a0939370534 690518
kang_a_Page_64.jp2
1ecdcd6cea382edcfce93a354cb4aab5 960146
kang_a_Page_65.jp2
d29551ad4417c205c56f59890463e002 613516
kang_a_Page_66.jp2
3716223ed1022002b7b70ded5ae5688b 659617
kang_a_Page_67.jp2
0e1e9d9e28025df556ceaf522491db86 840020
kang_a_Page_68.jp2
989352ca75643ba925a7b53bb49e6141 1002378
kang_a_Page_69.jp2
6c19dbf21b1f7612997e58c4fd068793 131585
kang_a_Page_70.jp2
6e20089ae4779a2ad9b72620a0d006eb 1051957
kang_a_Page_71.jp2
b4a898d0eb8f06160f2c1521bc679f4e 1051956
kang_a_Page_72.jp2
44db972e64474e7775e1fba38470654b 902917
kang_a_Page_73.jp2
8ba8927ecafef8ee939ca339aec71926 517738
kang_a_Page_74.jp2
imagejpeg 939b67e1d84a233056afff87d4864599 41322
kang_a_Page_01.jpg
JPEG1.2 515597cc0a32713313af98dc9bc05e35 24902
kang_a_Page_01.QC.jpg
74cf37d74a9c77658d02783a7f965496 22190
kang_a_Page_02.jpg
JPEG2.2 18e72382c104cc9b4444dbaa04521306 19297
kang_a_Page_02.QC.jpg
1cd4f0ef18d56f21fc4ae054bd4bfb60 29346
kang_a_Page_03.jpg
JPEG3.2 92a30747ff13b0992bc4179c2cec5fa8 21355
kang_a_Page_03.QC.jpg
c9773dc9de5699bf390b20d9165260a4 103633
kang_a_Page_04.jpg
JPEG4.2 0beaab378ffb8ca8050d2bfe6e3b2948 47736
kang_a_Page_04.QC.jpg
696105cfde9bc6798274c185afc78e7a 103656
kang_a_Page_05.jpg
JPEG5.2 e27ed0d786614a1761b8d9489d09ac57 38884
kang_a_Page_05.QC.jpg
a9d7f019dcb59f199fcf9034b93cafdd 139000
kang_a_Page_06.jpg
JPEG6.2 c185b84e44cc6f0cdefde1a88e5a6496 48475
kang_a_Page_06.QC.jpg
7aa42837191607f0683d3cf726497e0f 97546
kang_a_Page_07.jpg
JPEG7.2 71a0369610c195743d6e83958c65d93e 37326
kang_a_Page_07.QC.jpg
f97391555ab53cef714a3ae28ff2d384 94407
kang_a_Page_08.jpg
JPEG8.2 e4756ad3a41c2bd6ef1958b59beadf18 41610
kang_a_Page_08.QC.jpg
95b7e9f911bde2307604ec07b630b693 51265
kang_a_Page_09.jpg
JPEG9.2 1ffa16bafc9856ece5a243d107b2deab 28587
kang_a_Page_09.QC.jpg
e91da959c6b50919d5cf972ee72ddca0 95411
kang_a_Page_10.jpg
JPEG10.2 8ff003fb7f985547179cfa4f277336cb 43410
kang_a_Page_10.QC.jpg
9d906861650be42d884228568a6ab899 108364
kang_a_Page_11.jpg
JPEG11.2 fbe9840ed95ad4b553eadbe2e7cee1ef 49350
kang_a_Page_11.QC.jpg
87ff7df6c761046657dada705aed98f2 115323
kang_a_Page_12.jpg
JPEG12.2 a842910846697cb55024b1d93fb64bd1 51282
kang_a_Page_12.QC.jpg
e25a841d1dab7c7f0d6d38dbce791e8b 128602
kang_a_Page_13.jpg
JPEG13.2 0627eacd2b8906537614be610dab1385 56227
kang_a_Page_13.QC.jpg
98ee2a987946e06f673e4ab2a575d26e 116651
kang_a_Page_14.jpg
JPEG14.2 8d42479d3dd3edd653930d7a3a01aa8b 52776
kang_a_Page_14.QC.jpg
7c47f3827b8555625f7973b082cca6e6 144288
kang_a_Page_15.jpg
JPEG15.2 b4ac278195e2189af81c67b76c7d408f 59193
kang_a_Page_15.QC.jpg
e89d85cb62e179c25bca1a67bf12227a 123670
kang_a_Page_16.jpg
JPEG16.2 ebd2e2a94923310fed9724bdf3c3c24b 52021
kang_a_Page_16.QC.jpg
17bf88357ab9dd24f141a63e151ce548 102941
kang_a_Page_17.jpg
JPEG17.2 4640eb93774667f1bafc4af4ba21d2dd 45804
kang_a_Page_17.QC.jpg
546ed75d0b83d7fcc94f83c166797c17 114158
kang_a_Page_18.jpg
JPEG18.2 2d81f1431601c85d5fc18e0b1dadeb18 49630
kang_a_Page_18.QC.jpg
5ba86eda5dfb43082d2b4388b2c63e5a 103172
kang_a_Page_19.jpg
JPEG19.2 a993161ef35b4fff304d2962f5764b32 46092
kang_a_Page_19.QC.jpg
6d084800957a0c581014cbeea2e5bac9 143381
kang_a_Page_20.jpg
JPEG20.2 6c59c3ea21bd58abc4846d9144547c5a 58157
kang_a_Page_20.QC.jpg
500a41757624d6d7f092bdce2618e0f1 65025
kang_a_Page_21.jpg
JPEG21.2 f449e29e0126f663ed19bc1a0c84f10a 34407
kang_a_Page_21.QC.jpg
3d241521746c2428699fa585b7b8a411 157677
kang_a_Page_22.jpg
JPEG22.2 6dd4fbc5fb8bbb7620aec8d03de0297b 59426
kang_a_Page_22.QC.jpg
0ad6805d09a4f1e51de4e60e50d751de 114577
kang_a_Page_23.jpg
JPEG23.2 652875740ea5ca9ddb8947cf080c3cd5 51366
kang_a_Page_23.QC.jpg
f392520dd03d595752dc66e6f51218a3 139400
kang_a_Page_24.jpg
JPEG24.2 673a1a33334d20fe336a2865f7d7866b 55771
kang_a_Page_24.QC.jpg
7644ff598ba704e9a56f44b5d83978d9 125134
kang_a_Page_25.jpg
JPEG25.2 5a486add6ada9e6dbcda2e09733e82e6 54702
kang_a_Page_25.QC.jpg
2e994efad89750f24681ce431bdd389e 112624
kang_a_Page_26.jpg
JPEG26.2 a5ef721af97e786c2a0cf098cbfb452f 50357
kang_a_Page_26.QC.jpg
ac7fdfa8c74f9be58442632efa1c4434 122772
kang_a_Page_27.jpg
JPEG27.2 e6234afe0fa968e3625a8e12b4a5d5cc 53694
kang_a_Page_27.QC.jpg
c1480e573de56bf0d0d58ea612b514c5 99967
kang_a_Page_28.jpg
JPEG28.2 5d5663f3e8b93e79ad647e8b2c057634 45641
kang_a_Page_28.QC.jpg
4a11a5e7000e76cca87ba58c770b02cf 125650
kang_a_Page_29.jpg
JPEG29.2 fd47a37079f90c39c23a2327e0f87197 53092
kang_a_Page_29.QC.jpg
60f59375bb7cbe66e417637a6c3b9fc0 104952
kang_a_Page_30.jpg
JPEG30.2 6dae2af4072475342663d44b686b6812 50004
kang_a_Page_30.QC.jpg
4b94073b347becfb6d9658b26e2c5f17 75953
kang_a_Page_31.jpg
JPEG31.2 7c521e6588a1743bce419001f23a0a4e 38696
kang_a_Page_31.QC.jpg
1c7419bb62ba7f44c07dec27bf16ba77 115228
kang_a_Page_32.jpg
JPEG32.2 7de8dcce292700225be4720e229f6cc9 51032
kang_a_Page_32.QC.jpg
f01c7805dbbf1fe0cd660291f075353c 99782
kang_a_Page_33.jpg
JPEG33.2 7348af259dd5121598cd4aaa22dadf9f 45077
kang_a_Page_33.QC.jpg
77c0c5163c6de1773849da5faf138bd2 93985
kang_a_Page_34.jpg
JPEG34.2 22806d6ec563b314e9810ac5049bf1c2 44543
kang_a_Page_34.QC.jpg
b28cd47ebe8ac4132446953944fe099c 50786
kang_a_Page_35.jpg
JPEG35.2 4e0f45dd4ebf60239a053c66785eb6df 30729
kang_a_Page_35.QC.jpg
5baaffe8bc79c37122c91bd6548f7d15 154942
kang_a_Page_36.jpg
JPEG36.2 a94d489bb8545b640c2efc8b653af1a8 55776
kang_a_Page_36.QC.jpg
23e7e4731484ff8895361a975ccf49c4 70342
kang_a_Page_37.jpg
JPEG37.2 7858c8e871427b07e20aa1f9fada786a 32306
kang_a_Page_37.QC.jpg
2a4d11412cc844c13ba8ab9badb6ed75 111169
kang_a_Page_38.jpg
JPEG38.2 cbbd7271abc3f00ef9bcd9a0cb72e668 49420
kang_a_Page_38.QC.jpg
55b6ba93504c961468773518f3a82935 116439
kang_a_Page_39.jpg
JPEG39.2 18e6ff99ac726ed3899eb54ed45a4c81 51073
kang_a_Page_39.QC.jpg
190dc30b78dd5fd6306d48a268b02432 129215
kang_a_Page_40.jpg
JPEG40.2 16863dad331b24cc51c6bed6ce0fc3bb 52556
kang_a_Page_40.QC.jpg
6a1456ad0a1a51ffb6e6a4a1a590f1e8 107856
kang_a_Page_41.jpg
JPEG41.2 a59c507e7c67f9f03357b9e5be3755d7 50612
kang_a_Page_41.QC.jpg
6f819c1252b74f3eaf9ae004adde3f88 123112
kang_a_Page_42.jpg
JPEG42.2 ff9360766562d939a5d1b2ee10601c9f 52459
kang_a_Page_42.QC.jpg
6c749ba09d9dfb511148b8955be04e7b 95254
kang_a_Page_43.jpg
JPEG43.2 89699e28f0fb299b74bd31711ac032d9 44047
kang_a_Page_43.QC.jpg
3b2ef5f93a5308ff716ffeae32a1c821 103817
kang_a_Page_44.jpg
JPEG44.2 7227dcb723447844c78284c09bb7fbc0 46744
kang_a_Page_44.QC.jpg
09f2ece19b2dbfeb14bf7dffb3165551 115400
kang_a_Page_45.jpg
JPEG45.2 b0733c10c7efb4ec9053fdbedea3e272 49897
kang_a_Page_45.QC.jpg
029dd19097eeb94575195724eabe6f79 119313
kang_a_Page_46.jpg
JPEG46.2 3f52ca380423bc84ddf14fb6f4ccad11 53860
kang_a_Page_46.QC.jpg
a067af9463578009b0dbab3d1229201f 103913
kang_a_Page_47.jpg
JPEG47.2 998e49725d1bafecbdfef601538aeb1d 46139
kang_a_Page_47.QC.jpg
dd385924ca744d452a01623c65da88ce 105291
kang_a_Page_48.jpg
JPEG48.2 59dfb59e31d5d372dea7341755efb410 49118
kang_a_Page_48.QC.jpg
17afbf1e7f8bc8ad86b780d1ba37b775 54131
kang_a_Page_49.jpg
JPEG49.2 f7bd637ab54c3311ea5b5092c6a53119 31203
kang_a_Page_49.QC.jpg
f19a0317b5f0d75397de803e0681193f 90935
kang_a_Page_50.jpg
JPEG50.2 0221660763baebbe0a227322898f9ce3 43900
kang_a_Page_50.QC.jpg
3366396c70fad5f8eede3d3b5cca1b21 108095
kang_a_Page_51.jpg
JPEG51.2 49be72874e18dbd9264b5d4a27ddbe25 51023
kang_a_Page_51.QC.jpg
7fccad80f4c88338d6b42ff6a017e160 124873
kang_a_Page_52.jpg
JPEG52.2 8584a605a936d27602ab21a51ca8db0d 55795
kang_a_Page_52.QC.jpg
3a22cbc80254b563274c06096c6b3e6f 87175
kang_a_Page_53.jpg
JPEG53.2 98c37e31ae2f651ee3719bb696fbc8e1 42995
kang_a_Page_53.QC.jpg
92e234bbb58d286e65aebeb1fcf42991 168894
kang_a_Page_54.jpg
JPEG54.2 5094c0594bd585aa7a6bcf7a1bbdf827 59551
kang_a_Page_54.QC.jpg
db364375199570da917ddfd75896146c 91335
kang_a_Page_55.jpg
JPEG55.2 08d91c59efcf1cb19e8c2e22f6a2a518 44016
kang_a_Page_55.QC.jpg
f6b35099560ccaaad5238d20b748a384 102974
kang_a_Page_56.jpg
JPEG56.2 24a96f0a42dff1df444b81a7179108fc 47196
kang_a_Page_56.QC.jpg
c9efb34632f50a9c24dd78a2b5ecdc84 51011
kang_a_Page_57.jpg
JPEG57.2 dd6ad6119be6608811739e1ab49cca92 30281
kang_a_Page_57.QC.jpg
0a15514d630ca10d23ba171d3ac283be 107380
kang_a_Page_58.jpg
JPEG58.2 fb5cb71d6a8f33bb5904252527664839 50539
kang_a_Page_58.QC.jpg
ad51cfa7510de06e91c8b9685c6ecc4a 82881
kang_a_Page_59.jpg
JPEG59.2 8eef1dc2649c7b53e8e186d164333329 41940
kang_a_Page_59.QC.jpg
359b5e7b454beacdab0d5a66306d3b19 90967
kang_a_Page_60.jpg
JPEG60.2 50ab09074d4819afdd9be7d8d5749ae8 42944
kang_a_Page_60.QC.jpg
c07f8fbf35515bbb51a884c13a12a75d 88171
kang_a_Page_61.jpg
JPEG61.2 c934abb6bc88a44a535df7e47ce919dc 43248
kang_a_Page_61.QC.jpg
459abebba2e894dcd0a6b0e6039a58db 82453
kang_a_Page_62.jpg
JPEG62.2 44aaa873f7631201681736a409b99b05 40191
kang_a_Page_62.QC.jpg
c8008912f3f167fe902552ad5eff1729 86778
kang_a_Page_63.jpg
JPEG63.2 3938c5e4505b8f6a596600cdffb1d281 42432
kang_a_Page_63.QC.jpg
8df08911a7f09a23234a16535bad8497 93455
kang_a_Page_64.jpg
JPEG64.2 d1520b24a16c3458198e5c012875b1cc 45006
kang_a_Page_64.QC.jpg
d9b72ee932ae42abbe22d1c7aa4b20fc 111433
kang_a_Page_65.jpg
JPEG65.2 edea36e6004346725c94032ae6e3d988 50698
kang_a_Page_65.QC.jpg
2fa7a6fe887189b740fb19c0bba6bf25 81974
kang_a_Page_66.jpg
JPEG66.2 6b1fb64e802056f332aeb2e0259ae314 40203
kang_a_Page_66.QC.jpg
e39dae65ac8f9fdad799e88d2b2d517f 90119
kang_a_Page_67.jpg
JPEG67.2 35acae9300ba7de142e6f16863d5cdd7 43476
kang_a_Page_67.QC.jpg
f02306bcf254fc84f849944bf3076850 99945
kang_a_Page_68.jpg
JPEG68.2 d3d1089e807d133a985e6b9b7194b2a1 46146
kang_a_Page_68.QC.jpg
49c8c5bd050b3fc536f65aca627004e5 117575
kang_a_Page_69.jpg
JPEG69.2 6864e75744deacbb01e709e76e90f24f 51631
kang_a_Page_69.QC.jpg
1c484dc9e49a3d7d926b578e6590383c 32715
kang_a_Page_70.jpg
JPEG70.2 232934a59373d86e28c834d4c619bec0 23381
kang_a_Page_70.QC.jpg
07f9d12a0edba0f5bbb6a8da259ad3c8 120396
kang_a_Page_71.jpg
JPEG71.2 248793fa329db9166b5560074259cbd9 48688
kang_a_Page_71.QC.jpg
1b1306c2bdd8e0ee0cc2870b25757b8e 153649
kang_a_Page_72.jpg
JPEG72.2 f41a659d63b122094b9a12634e88719f 56776
kang_a_Page_72.QC.jpg
ff518436d69d6e9a24952b97197f58ca 100745
kang_a_Page_73.jpg
JPEG73.2 1f506ddc68e2531e14ee4c15cd2bcca6 43082
kang_a_Page_73.QC.jpg
7750ebbd46e3c93f7e9a0949dbee2605 70517
kang_a_Page_74.jpg
JPEG74.2 c566e80ec3bcba5f8515046e70e4c361 35795
kang_a_Page_74.QC.jpg
THUMB1 imagejpeg-thumbnails 190463c3a8f5e6cd0c238c2cef738418 20059
kang_a_Page_01thm.jpg
THUMB2 755f59d1d7ae12950739cfaf041cb07b 18238
kang_a_Page_02thm.jpg
THUMB3 dea77996036673b4925eee9752230459 18994
kang_a_Page_03thm.jpg
THUMB4 0980c95e04a62399d15004768716ce94 27560
kang_a_Page_04thm.jpg
THUMB5 5bd2529a4c6fe61d980e57e266ca24d8 24079
kang_a_Page_05thm.jpg
THUMB6 116eb45f09271db0488486e65caf5e02 26253
kang_a_Page_06thm.jpg
THUMB7 3729e94037de3a5ef8dcf53fcd8ec323 23620
kang_a_Page_07thm.jpg
THUMB8 a1f87fafc53ed2ff46dbc73e6042f46b 25356
kang_a_Page_08thm.jpg
THUMB9 2bc55d13a84daed55d048986dc564256 20971
kang_a_Page_09thm.jpg
THUMB10 2373e5241bbaa431fee636ea42d60b9b 26531
kang_a_Page_10thm.jpg
THUMB11 36946bfa23f3934b38fe015f74c4e32e 27839
kang_a_Page_11thm.jpg
THUMB12 adb89447805416d9aded3b0da5baff79 28519
kang_a_Page_12thm.jpg
THUMB13 69e0ca53dc9dfbb3a139b32efc8363d2 30098
kang_a_Page_13thm.jpg
THUMB14 a7a6380ed421474eb706858ec513b323 29151
kang_a_Page_14thm.jpg
THUMB15 3d6e06ae4b50793300ffb63b5fbae6f4 30675
kang_a_Page_15thm.jpg
THUMB16 e4d28a78cefd3dd95278ed0118bd892e 28441
kang_a_Page_16thm.jpg
THUMB17 a015021f2333c1ee343db81309950a76 27218
kang_a_Page_17thm.jpg
THUMB18 40cdd73f8c9b5ab052c1d69a3c17067f 28470
kang_a_Page_18thm.jpg
THUMB19 d8814a7f4244df5819da18b606db8a41 27161
kang_a_Page_19thm.jpg
THUMB20 5448d46b9ad8ce2d8622cd621225c531 30366
kang_a_Page_20thm.jpg
THUMB21 04482ff5788cc081c57a9a3d53426934 24551
kang_a_Page_21thm.jpg
THUMB22 a70a6d414b4d15b032dfac6f570ee6fa 30483
kang_a_Page_22thm.jpg
THUMB23 7b189e1c3f8b559356e06c1648584380 28885
kang_a_Page_23thm.jpg
THUMB24 e162cb4c76c0d005e1183659e3cbec40 29539
kang_a_Page_24thm.jpg
THUMB25 c4ebeb406f8d8d188011fa42804fc9c2 29257
kang_a_Page_25thm.jpg
THUMB26 fdd904756f619e3e3d52ebd0fc0e3d8e 28283
kang_a_Page_26thm.jpg
THUMB27 8544313b252e014621d4dbbef030c4ab 29481
kang_a_Page_27thm.jpg
THUMB28 af3ce18ea17c2c85cf75c9ab0c803858 27046
kang_a_Page_28thm.jpg
THUMB29 ecb064c9d11171ea8cf8b9b61a8a05f8 29261
kang_a_Page_29thm.jpg
THUMB30 e672026f1f37db83b68bf3c746979bb7 29508
kang_a_Page_30thm.jpg
THUMB31 a30fd69f7c501b07b89ab707ab6f06fc 24347
kang_a_Page_31thm.jpg
THUMB32 0f0d6dddf49a9df70334dce7e72d16f5 28798
kang_a_Page_32thm.jpg
THUMB33 36640a24e46f56cc115ee4b04a849b44 27106
kang_a_Page_33thm.jpg
THUMB34 a263ba90bad67b147599bf06219dcc3b 26847
kang_a_Page_34thm.jpg
THUMB35 e17d421e9facd31509fd46228ebbd9d0 23061
kang_a_Page_35thm.jpg
THUMB36 eba7240894c9b29e76a1bdcdad18fc7e 29244
kang_a_Page_36thm.jpg
THUMB37 1d3cf594697e513c8f2b219e146f244d 21924
kang_a_Page_37thm.jpg
THUMB38 b7b28983312f030a7544778396bb057e 27843
kang_a_Page_38thm.jpg
THUMB39 7affa11a6ac6bcdadc81d43fac69dcdd 29241
kang_a_Page_39thm.jpg
THUMB40 eee0ba677686daf15684a7289a9aceec 28551
kang_a_Page_40thm.jpg
THUMB41 83d1c1e9d46076197aea60ea6e4b1c48 30067
kang_a_Page_41thm.jpg
THUMB42 02f33b3057aa6b2b816ccacc4a3ff42f 28374
kang_a_Page_42thm.jpg
THUMB43 7b6fe80ca3b9146be31755847e401133 26119
kang_a_Page_43thm.jpg
THUMB44 13c1a93d5ce7083ff3df71a2050bc256 26890
kang_a_Page_44thm.jpg
THUMB45 497f9ab43079375b2fccfd5c5dbac45f 28644
kang_a_Page_45thm.jpg
THUMB46 062045c1a3fa3e04c8b95714cee59795 29044
kang_a_Page_46thm.jpg
THUMB47 6d8257caeeb5137bc009650103c46bfa 27168
kang_a_Page_47thm.jpg
THUMB48 67fc553ba7a4e70ebfcf4ae7bbf4d282 28218
kang_a_Page_48thm.jpg
THUMB49 72d45c39578de18a531c24d6b4fad131 22212
kang_a_Page_49thm.jpg
THUMB50 f010d207d33c5a8a45eff6c55e320b35 26512
kang_a_Page_50thm.jpg
THUMB51 5c5c19195816f8e3bd18ef0fe3d01ee7
kang_a_Page_51thm.jpg
THUMB52 546819ac404edfc693b32a4855dd5982 30057
kang_a_Page_52thm.jpg
THUMB53 c56c0f55bb0493ee989aac8590c72e3b 27413
kang_a_Page_53thm.jpg
THUMB54 2bb8dda04c14e7a6e05b9e2efc11bb8b 30130
kang_a_Page_54thm.jpg
THUMB55 8c94496aa56edbc1dc86100990389f02 26635
kang_a_Page_55thm.jpg
THUMB56 a91b3aaa83a9b145fed5e2925e3b0f49 28318
kang_a_Page_56thm.jpg
THUMB57 19ccce2bd578c63076be159947de7907 23121
kang_a_Page_57thm.jpg
THUMB58 efa8fe296cf86ebdd8a3399019cf082e 28068
kang_a_Page_58thm.jpg
THUMB59 cadef947f2e16ed66083bc4e3f1cddea 26041
kang_a_Page_59thm.jpg
THUMB60 728071e74aaaa31f9cc065dec7d325db 26724
kang_a_Page_60thm.jpg
THUMB61 b0ba9146beec7798732986fe4d511e3f 26743
kang_a_Page_61thm.jpg
THUMB62 fb8376e8fe6c91b446d20ba2b6dcc0fd 25582
kang_a_Page_62thm.jpg
THUMB63 28065d9a37ae8a36e287f3125878d4c2 26252
kang_a_Page_63thm.jpg
THUMB64 e992b3ec4a0e6a003b1f157ea66276e9 27140
kang_a_Page_64thm.jpg
THUMB65 4e9be7933431ab6654135ccacd83030c 28146
kang_a_Page_65thm.jpg
THUMB66 ea554de6b663664e09df5e46a7d63222 25793
kang_a_Page_66thm.jpg
THUMB67 568ea1826a3bc6471298dddbb3e4eebd 27179
kang_a_Page_67thm.jpg
THUMB68 d605d0cca8fbad0604d0e0d1e51ed16a 26572
kang_a_Page_68thm.jpg
THUMB69 92ccd2cc2ef8f376c77adacda4c93c62 29380
kang_a_Page_69thm.jpg
THUMB70 cafc1bfb1f5388c1c712adc3718e33eb 19419
kang_a_Page_70thm.jpg
THUMB71 4efa80e71ea995b939f7ce14f5c128ca 27808
kang_a_Page_71thm.jpg
THUMB72 09743d9cb19fc72b55aa7888350b96b1 29659
kang_a_Page_72thm.jpg
THUMB73 b136f0f6e73cdd309119db3103ead85b 25687
kang_a_Page_73thm.jpg
THUMB74 0d15b3d986dd98953d4c8b2a4912d8f4 23945
kang_a_Page_74thm.jpg
TXT1 textplain 60972bb64c933ebe8947ea61fec30669 410
kang_a_Page_01.txt
TXT2 fa8f89a4827a06bc90e474b2b8cc0f02 99
kang_a_Page_02.txt
TXT3 1c9d98cc8582c8f63260948e99848e55 236
kang_a_Page_03.txt
TXT4 6b466c22d829f5b6c5a6583e467eb342 1544
kang_a_Page_04.txt
TXT5 d62ce6573ba0d6dfd422d87b5399d5f1 2210
kang_a_Page_05.txt
TXT6 f2a2e775e529b6a4f07d15cb27db211f 2334
kang_a_Page_06.txt
TXT7 3f5692c4c44ddb3be95d54363e8fae88 1237
kang_a_Page_07.txt
TXT8 9a03330be8493f1acfe6de71ff472b0c 1598
kang_a_Page_08.txt
TXT9 051b2b499e65cab4020a50d7c941f1fc 686
kang_a_Page_09.txt
TXT10 a61840d13af48e87ec323197cd819d04 1486
kang_a_Page_10.txt
TXT11 760d619bb7a37ad77b70c72ecdb9d9b5 1582
kang_a_Page_11.txt
TXT12 95b77676f23467ec74ddea94238c445d 1795
kang_a_Page_12.txt
TXT13 9f4a7d658e820ecd466846b523f73e92 2006
kang_a_Page_13.txt
TXT14 785ebf96c13b7ab1693a24538483a96d 1777
kang_a_Page_14.txt
TXT15 e2cee4f048c3f38caa589e84abd181b7 2360
kang_a_Page_15.txt
TXT16 9055b22d9062e90dd7bdaaa8c73d93ab 1932
kang_a_Page_16.txt
TXT17 d15421d5e4ac64600212236cfc4bdb38 1703
kang_a_Page_17.txt
TXT18 cd8d691bda37aaa84dc9efa767a08632 1359
kang_a_Page_18.txt
TXT19 aee7b89e5e50a7343300441987f7dbe1 1536
kang_a_Page_19.txt
TXT20 54dad37549bbe3e9f1c9883ef8366865 2353
kang_a_Page_20.txt
TXT21 e51f28d1c6ad73b277dc5c1358e3c15e 171
kang_a_Page_21.txt
TXT22 2f177fa474d62e3c8c53710b7d7a4c5e 2661
kang_a_Page_22.txt
TXT23 4207e5de21ab7b7c8915806b91ff2b9c 1813
kang_a_Page_23.txt
TXT24 a50c508bf6c1b421ce9cd0a632757f7e 2286
kang_a_Page_24.txt
TXT25 de1557e5be4351e7d900217d90076969 1888
kang_a_Page_25.txt
TXT26 f35712442210991bc1f759235b902f51 1640
kang_a_Page_26.txt
TXT27 e8292c69240960543cef7d4ce05e89c0 1869
kang_a_Page_27.txt
TXT28 5e953a545193d3fca5ef2227ecf66a7a 1600
kang_a_Page_28.txt
TXT29 6201cc9a5a32629534e5228ef21e630d 1940
kang_a_Page_29.txt
TXT30 2ed9eaa59cbaa803c811834c4962efdb
kang_a_Page_30.txt
TXT31 3d86ab596a40fa5c2558464743f61400 992
kang_a_Page_31.txt
TXT32 f6b83d7154ff229342c7dc3dcb300ed4 1756
kang_a_Page_32.txt
TXT33 2794a959a2deb7cd8b9978ef9326fd7d 1388
kang_a_Page_33.txt
TXT34 b68adea3753d9697fcdfe7507628162e 1141
kang_a_Page_34.txt
TXT35 d84fb3635368ef26ff8a5e90b5aff607 134
kang_a_Page_35.txt
TXT36 5c38b6558c6bfea2722f10e691aad1e3 2821
kang_a_Page_36.txt
TXT37 8a966743193272a175691f24e1632ece 1000
kang_a_Page_37.txt
TXT38 3f972f3bfa0e9effa1c9df804f454aea 1750
kang_a_Page_38.txt
TXT39 b9f42d4e333c1f812774dcd16d09e902 1984
kang_a_Page_39.txt
TXT40 c3091497836f5a825ed2f1ae02de70bd 2155
kang_a_Page_40.txt
TXT41 443e2307f8299895b54e409bebc58746 1553
kang_a_Page_41.txt
TXT42 5a89caee0a557d24cf2d8368a7c1c879 2112
kang_a_Page_42.txt
TXT43 0453f3c06612917a26b161bb7ba70046 1713
kang_a_Page_43.txt
TXT44 e9ece5cf13db6c6bb5392309b57f5e46 1951
kang_a_Page_44.txt
TXT45 9ccb6142bdf63b241848ae1572594658 1871
kang_a_Page_45.txt
TXT46 6ae28c8f7dac0fa2ae9765275a597fd2 1825
kang_a_Page_46.txt
TXT47 4895590546f3d5bfd1904d0304a9cdb9 1624
kang_a_Page_47.txt
TXT48 42c39377683f0d353b7d6082d25760c9 1513
kang_a_Page_48.txt
TXT49 dbaa3ae28f5cc00cc7eb512681e86388 633
kang_a_Page_49.txt
TXT50 21ff6d5fe2e4ce4fb2298b178313828d 1416
kang_a_Page_50.txt
TXT51 4de828eae9f819f442a9dce2f249c291 1649
kang_a_Page_51.txt
TXT52 96e7740d4c721b1a3025b77d606154f7
kang_a_Page_52.txt
TXT53 20b50430de84649db2f341565692fd7e 565
kang_a_Page_53.txt
TXT54 110d5cbb5f1a7b39150de2e205987b32 2979
kang_a_Page_54.txt
TXT55 f6485a174ba8dc97c613261058d46f4e 1419
kang_a_Page_55.txt
TXT56 2014a602cc910421d0974f803c27c1be 1556
kang_a_Page_56.txt
TXT57 dd2153583355c6e050ab2dd0f34ee9f1 677
kang_a_Page_57.txt
TXT58 dbfaa14bf0c1c4f6f675c8d4dbf8d6e9 1618
kang_a_Page_58.txt
TXT59 afd55b248a715722dcf644933a857709 1045
kang_a_Page_59.txt
TXT60 6fc9b29d84a2af6919eb0980919c7010 1342
kang_a_Page_60.txt
TXT61 3ab112a532590e447bac7a5c4f6eaed3 1149
kang_a_Page_61.txt
TXT62 2a719e294bb9d4d55d3ecbe8bc0b4ea7 1114
kang_a_Page_62.txt
TXT63 b32e8ecdaaf8bdc5022a86eea44944d9 1169
kang_a_Page_63.txt
TXT64 594d66386c2c0f459b4f66e4856485bb 1291
kang_a_Page_64.txt
TXT65 d875fc11e95a9b0fac42e40a466082e4 1648
kang_a_Page_65.txt
TXT66 9f3adcd713fff308d4643262693ae5f9 1129
kang_a_Page_66.txt
TXT67 0fb3fc71fd583aa0fc6d1bb7bd117428 1385
kang_a_Page_67.txt
TXT68 6f5b5d5f00ed2890743daff27a356d13 1515
kang_a_Page_68.txt
TXT69 53f067a6849678bda90b0f56b5794150 1753
kang_a_Page_69.txt
TXT70 deeacfc568243a799de60361da53206d 267
kang_a_Page_70.txt
TXT71 b97df64b4c83227ce15f2b75388a9a16 2015
kang_a_Page_71.txt
TXT72 d2fafb49ce025885c3cb7ce8a245bc4f 2657
kang_a_Page_72.txt
TXT73 b5bdeca86f3438b73d05bfc36f2afd4b 1604
kang_a_Page_73.txt
TXT74 04ee77896f20ad57e6f0d5cd77cbd29b 926
kang_a_Page_74.txt
PDF1 applicationpdf 31eac1fd19dc0998893dbbe629491ca4 783865
kang_a.pdf
METS2 unknownx-mets cb4d6fa5a1d04db3ca07fcac558861d0 83209
UFE0000549_00001.mets
METS:structMap STRUCT1 physical
METS:div DMDID ADMID ORDER 0 main
PDIV1 1 Main
PAGE1 Page i
METS:fptr FILEID
PAGE2 ii 2
PAGE3 iii 3
PAGE4 iv 4
PAGE5 v 5
PAGE6 vi 6
PAGE7 vii 7
PAGE8 viii 8
PAGE9 ix 9
PAGE10 x 10
PAGE11 xi 11
PAGE12 12
PAGE13 13
PAGE14 14
PAGE15 15
PAGE16 16
PAGE17 17
PAGE18 18
PAGE19 19
PAGE20 20
PAGE21 21
PAGE22 22
PAGE23 23
PAGE24 24
PAGE25 25
PAGE26 26
PAGE27 27
PAGE28 28
PAGE29 29
PAGE30 30
PAGE31 31
PAGE32 32
PAGE33 33
PAGE34 34
PAGE35 35
PAGE36 36
PAGE37 37
PAGE38 38
PAGE39 39
PAGE40 40
PAGE41 41
PAGE42 42
PAGE43 43
PAGE44 44
PAGE45 45
PAGE46 46
PAGE47 47
PAGE48 48
PAGE49 49
PAGE50 50
PAGE51 51
PAGE52 52
PAGE53 53
PAGE54 54
PAGE55 55
PAGE56 56
PAGE57 57
PAGE58 58
PAGE59 59
PAGE60 60
PAGE61 61
PAGE62 62
PAGE63 63
PAGE64 64
PAGE65 65
PAGE66 66
PAGE67 67
PAGE68 68
PAGE69 69
PAGE70 70
PAGE71 71
PAGE72 72
PAGE73 73
PAGE74 74
STRUCT2 other
ODIV1
FILES1
FILES2



PAGE 1

ALGORITHMS FOR REDUCED CONTENT DOCUMENT SYNCHRONIZATION By AJAY I. S. KANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2002

PAGE 2

Copyright 2002 by Ajay I. S. Kang

PAGE 3

I dedicate this work to my parents, Harjeet Singh and Kulwant Kaur Kang. Without their constant encouragement and support over the years, I would not be here documenting this thesis.

PAGE 4

ACKNOWLEDGMENTS I cannot adequately express the gratitude and appreciation I owe to my family, friends and faculty advisors in this space. My family has always been supportive of all my endeavors and prodded me to push farther and achieve all that I am capable of. However, they have never in that process made me feel that I let them down when I failed to achieve their expectations. I know that I will always have their unconditional love and support as I go through life. I have many friends I would like to thank, especially the fourth floor crowd of the CISE department where I spent most of my time working on my research. Hai Nguyen, Andrew Lomonosov and Pompi Diplan would constanly stop by my office providing welcome interruptions from the monotony of work. I am indebted to Mike Lanham for his encouragement to pursue this area of research. A special mention must also be made for Starbucks Coffee which the aforementioned are addicted to and have consumed many gallons of the overpriced beverage in their time here in (according to them) an effort to bolster the national economy. I thank my advisors, Dr Abdelsalam Sumi Helal and Dr Joachim Hammer, whose patience, encouragement, insight and guidance were invaluable in the sucessful completion of this work. Finally, I would like to thank the National Science Foundation for funding the UbiData project, which resulted in my receiving a research assistantship this summer. iv

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF FIGURES.........................................................................................................viii ABSTRACT.........................................................................................................................x CHAPTER 1 INTRODUCTION...........................................................................................................1 Use Cases........................................................................................................................3 Summary of the Advantages of Change Detection in Reduced Content Information....4 Approach Used to Investigate the Design and Implementation of Reduced Content Change Detection and Reintegration.......................................................................5 2 RELATED RESEARCH.................................................................................................6 The Ubiquitous Data Access Project at the University of Florida..................................6 System Goals............................................................................................................6 System Architecture.................................................................................................6 Mobile environment managers (M-MEMs).......................................................7 Mobile data service server (MDSS)...................................................................8 The Role of Reduced Content Change Detection in this Architecture....................9 Previous Work on Change Detection for UbiData.......................................................11 Research on Disconnected Operation Transparency: the CODA File System.............11 Research on Bandwidth Adaptation..............................................................................13 Odyssey..................................................................................................................13 Puppeteer................................................................................................................14 Transcoding Proxies: Pythia, MOWSER and Mowgli..........................................15 Congestion Manager..............................................................................................15 Intelligent Collaboration Transparency..................................................................15 Research on Difference Algorithms..............................................................................16 LaDiff.....................................................................................................................16 XyDiff....................................................................................................................16 Xmerge...................................................................................................................18 An O(ND) Difference Algorithm...........................................................................18 v

PAGE 6

3 PROBLEMS INVOLVED IN RECOVERABLE CONTENT ADAPTATION..........21 Developing a Generic Difference Detection and Synchronization Algorithm for Reduced Content Information................................................................................21 Example Document Content Reduction Scenario.........................................................21 Issues in Designing a Function to Construct a Modified Document Structure Isomorphic to the Source Document Structure......................................................25 4 DESIGN AND IMPLEMENTATION OF RECOVERABLE CONTENT ADAPTATION.............................................................................................................27 Solution Outline for Delta Construction.......................................................................27 Representation of Shared Content................................................................................27 Matching of Nodes in the Source and Modified Documents........................................28 Construction of the List of Content Node Subtrees...............................................28 Determining the Node Matches between the v0 and v1 (-) Trees............................29 Determining the Node Matches between the v0 Subtree List and v1 (-) Tree List.29 Adjusting the Structure of the Modified Document to make it Isomorphic to the Source Document...............................................................................................................32 Applying the Substring Matches............................................................................32 Adjusting the Result Document Structure for Unshared Ancestors.......................33 Adjusting the Result Document Structure for Unshared Nodes............................34 Applying Default Formatting for Inserts...............................................................34 Post Processing......................................................................................................35 Computing signatures and weights for the two documents.............................35 Applying the node matches to the XyDiff structures.......................................36 Optimizing node matches................................................................................36 Constructing the Delta Script........................................................................................36 Mark the Old Tree..................................................................................................36 Mark the New Tree................................................................................................37 Save the XID Map..................................................................................................37 Compute Weak Moves...........................................................................................37 Detect Updates.......................................................................................................37 Construct the Delete Script....................................................................................37 Construct the Insert Script......................................................................................38 Save the Delta Document.......................................................................................38 Clean Up................................................................................................................38 5 RELATED EXTENSIONS TO UBIDATA..................................................................39 Development of a File Filter Driver for Windows CE to Support a UbiData Client....39 Active Event Capture as a Means of UbiData Client-Server Transfer Reduction........41 Development of an Xmerge Plug-in to Support OpenOffice Writer............................41 vi

PAGE 7

6 PERFORMANCE EVALUATION APPROACH........................................................42 Test Platform.................................................................................................................42 Test Methodology.........................................................................................................42 7 TEST RESULTS...........................................................................................................44 Correctness of the Change Detection System...............................................................44 Performance of the Difference Detection Algorithm....................................................45 Insert Operations....................................................................................................47 Delete Operations...................................................................................................49 Update Operations..................................................................................................50 Move Operations....................................................................................................51 Conclusions on Per-Operation Run Time..............................................................53 Run Time Behavior of Node Matches...................................................................53 Run Time Behavior of Substring Matching...........................................................55 Relative Sizes of Files produced...................................................................................56 8 CONCLUSIONS AND POTENTIAL FOR FUTURE WORK....................................57 Conclusions...................................................................................................................57 Potential for Future Work.............................................................................................57 Determination of Best Performing Values for Algorithm Parameters...................57 Exploration of Techniques to Prune the Search for the Best Subtree Permutation58 Post Processing to Ensure Validity of Reintegrated Document.............................58 Further Testing and Generalization of the System.................................................58 Implementation of the UbiData Integration Architecture......................................58 Extension to Other Application Domains..............................................................58 LIST OF REFERENCES...................................................................................................60 BIOGRAPHICAL SKETCH.............................................................................................63 vii

PAGE 8

LIST OF FIGURES Figure page 1 Architecture of the UbiData system...........................................................................7 2 Proposed integration of change detection and synchronization with UbiData........10 3 Venus states..............................................................................................................12 4 Sample Edit graph; Path trace = (3,1), (4,3), (5,4), (7,5); Common subsequence = CABA; Edit script = 1D, 2D, 2IB, 6D, 7IC...........................................................19 5 Example Abiword document....................................................................................22 6 XML structured imposed on modified text document.............................................23 7 DOM tree for the original document, v0..................................................................24 8 Example of approximate substring matching...........................................................30 9 Windows CE .NET 4.0 Storage architecture............................................................39 10 Logical queue of memory mapped files...................................................................40 11 Test tools and procedures.........................................................................................42 12 Missing Nodes by test cases in LES complexity order............................................44 13 Missing Nodes by test cases in substring complexity order.....................................45 14 Time taken for the various phases by test case........................................................46 15 Time taken by test cases for Inserts with no file level changes...............................47 16 Time taken by test cases for Inserts with 30 percent file level changes...................48 17 Time taken by test cases for Inserts with 50 percent file level changes...................48 18 Time taken by test cases for Insert with 10 percent paragraph level changes..........49 19 Time taken by test cases for Delete with 10 percent paragraph level changes........49 viii

PAGE 9

20 Time taken by test cases for Updates with 10 percent paragraph level changes.....50 21 Time taken by test cases for Updates with no file level changes.............................51 22 Time taken by test cases for Moves with 10 percent paragraph level changes........51 23 Time taken by test cases for Moves with no file level changes...............................52 24 Per-operation relative time complexity....................................................................53 25 Behavior of the node match phase...........................................................................55 26 Behavior of substring match phase..........................................................................56 27 Sizes of the files produced by the algorithm............................................................56 ix

PAGE 10

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science ALGORITHMS FOR REDUCED CONTENT DOCUMENT SYNCHRONIZATION By Ajay I. S. Kang December 2002 Chair: Abdelsalam Sumi Helal Major Department: Computer and Information Science and Engineering Existing research has focused on alleviating the problems associated with wireless connectivity, viz. lower bandwidth and higher error rates, by the use of proxies which transcode data to be sent to the mobile device depending on the state of the connection between the proxy and the mobile device. However, there is a need for a system to allow automatic reintegration of changes made to such reduced content information into the original content rich information. This also allows users the flexibility of using virtually any device to edit information by converting the information to a lesser content rich format suitable for editing on that device and then automatically reintegrating changes into the original. Hence, these algorithms allow the development of an application transparent solution to provide ubiquitous data access and enable reducing download time and cache usage on the mobile device for editable documents. They can also be applied to x

PAGE 11

progressive synchronization of a full fidelity document with the master document on the server. To develop flexible, generic, synchronization algorithms, we need an applicationindependent specification of the hierarchical content of data, a generic method of specifying what information is shared between reduced content documents and the originals and a change detection algorithm. This algorithm should produce a script that can be applied to the original to bring it up to date and also allow for version control. Existing difference detection algorithms do not operate on reduced content documents and the approaches used need to be adapted to this domain. XML has been chosen to represent document content since it allows encoding of any hierarchical information and most applications are now moving towards native XML file formats. The change detection algorithm makes extensive use of the O(ND) difference algorithm developed by Eugene W. Myers for matching shared content and approximate substring matching. Extensive performance analysis has been performed on the system by constructing a test engine to automatically generate modifications of varying degree on source documents and thus produce a set of modified reference documents. These reference documents are converted to text (hence stripping out all formatting content), and the algorithm is run to generate delta scripts that are applied to the original documents to generate the new versions. Comparison of the new versions to the references allows us to quantify error. xi

PAGE 12

CHAPTER 1 INTRODUCTION With the emergence of smaller and ever more powerful computing and communication devices providing a wide range of connectivity options, computers are playing an ever-increasing role in peoples lives. Portable computers such as laptops, tablet PCs, handheld devices based on the Palm, Pocket PC, embedded Linux platforms and to an extent, smart phones coupled with the proliferation of wireless connectivity options are bringing the vision of ubiquitous computing into reality. However, the potential provided by this existing hardware has as yet largely gone untapped due to the lack of a software infrastructure to allow users to make effective use of it. Users instead have to deal with the added complexity of manually synchronizing their data across all the devices that they use and handle the issues of file format conversion and integration of their changes made to the same document in different formats. These differences exist due to the constraints on mobile devices of being lightweight, small and having a good battery life, which restricts the memory, processing power and display size available on such devices [1]. Hence, there will always be devices for which applications will not be available since the devices hardware or display capabilities fall short. These devices cannot be used as terminals or thin clients (which would alleviate the problem to an extent) since lower bandwidth, higher error rates and frequent disconnections characterize wireless connectivity. Most research in the mobile computing area has thus been focused on the minimization of the impact of this form of connectivity on the user experience. The most 1

PAGE 13

2 common approach used is to use a proxy server to transcode the data [2-5] to be sent to the mobile device based on current connectivity conditions, thus ensuring that latency times in data reception as perceived by the user stay within certain limits at the cost of displaying data with reduced content. If the user were to edit such content, the burden of reintegrating the changes with the original copy on the server would fall on him/her. Hence, there is a definite need for a system to perform this change detection and re-integration automatically without user intervention. Solving this problem would also result in a solution to the first problem outlined, i.e., the unavailability of certain applications on a mobile device by having a smart proxy deduce that the device could not support such content and reduce the complexity to allow it to be convertible or directly compatible with applications on the mobile device. Any changes made would subsequently be re-integrated using the change detection system. In conjunction with existing research on systems that automatically hoard the files a user uses on the various devices he owns for disconnected operation support such as the Coda file system developed at Carnegie Mellon University [6-7], we now have the ability to provide the user access to his data at any time, anywhere and on almost any device. To allow the development of a change detection system that works on hierarchical data in any format, we need an intermediate data representation form that can be used to represent hierarchical data in any format. The intermediate form is what the change detection system will use as its input and output formats which are by definition easily convertible to the required data representation. The Extensible Markup Language (XML) [8] was chosen as the intermediate data representation since XML is an open standard format designed for this purpose and has a wide variety of parsing tools available for it.

PAGE 14

3 This work is being explored as a component to be used in the ubiquitous data access project [9] under development at the Computer and Information Science and Engineering Department of the University of Florida. The goals of the UbiData system are to provide any time, anywhere access to a users data irrespective of the device being used by him/her. To get a better feel of the potential of such a system we explore two use cases in the following paragraphs. Use Cases Use case 1: overcoming device/application limitations. Joe is working on a document that he intends to show his advisor at a meeting across campus but he is not quite done yet. He powers up his Palm PDA that connects to the local area network through a wireless interface card. The UbiData system recognizes this as one of Joes devices and fetches the devices capability list from the UbiData server. The system then proceeds to hoard his data onto the device, including the word document that he was just working on. However, a comparison with the capability list shows that the Palm does not support editing of MS Word documents. UbiData uses a converter plugin to convert the MS Word document to text/Aportisdoc format that is editable on the Palm and then transfers it to the Palm as he walks out of the door. By the time he reaches the campus lawn where wireless connectivity is unavailable, the document is sitting on his Palm. He makes edits to the document as he walks across campus. When he reaches his destination, he requests UbiData to synchronize his updates and the text file is compared to the original document and the changes are imported. By the time he reaches his advisors door, the document is up to date.

PAGE 15

4 Use case 2: overcoming bandwidth limitations. Jane has to go on a business trip to Paris and has a long flight ahead of her. She wants to use this time to edit some project proposals and cost estimates. She requests UbiData to hoard the files she needs to her laptop. At the airport she realizes that she needs a file that she has not accessed recently and hence has not been hoarded to her laptop. She uses her cell phone to connect to and authenticate with the corporate network. She requests the file download and the smart proxy on her corporate network transcodes the file data by removing the images from the file for fast transfer over the weak connection. She edits the file on the plane and uses the business lounge at her hotel in Paris to upload the changes. The UbiData system recognizes that this is a reduced content document (since it was notified by the proxy that Janes copy was content reduced) and proceeds to detect and integrate the changes with the original. After her meeting is over, she decides to walk around town and sitting at a caf has an idea to incorporate into a proposal. She takes out her PDA and uses her cell phone to connect to and authenticate with her corporate network. Since she is now making an international call, she requests a fast download and the proxy converts the document down to compressed text for minimum transfer size. When she returns to the hotel she reintegrates her changes with the content rich document on the server. Summary of the Advantages of Change Detection in Reduced Content Information A change detection system as outlined above would allow the development of an application transparent system for ubiquitous data access with no need for any support software at the client end for change detection as against other solutions such as the CoFi architecture used in the Puppeteer system developed at Rice University [10-12]. The Change detection system could be used to hoard reduced content documents on a mobile device instead of full fidelity documents. This would speed up download times and reduce cache space utilization, which could also allow hoarding of a larger working set. Transcoding proxies could also transcode editable content since it would always be possible to reintegrate changes into the original document.

PAGE 16

5 On the other hand, if the mobile device has a full fidelity version of a document that has been changed, and a high bandwidth connection is unavailable, the document could be distilled by removing content such as images. The changes could then be reintegrated with the document on the server allowing other users to see updates earlier. Once a high bandwidth connection is available, the entire document could then be uploaded for incorporation of changes to content that was stripped out. Hence, progressive updates of documents are made possible. Approach Used to Investigate the Design and Implementation of Reduced Content Change Detection and Reintegration Since such a system is quite complex, it is easier to focus on a restricted case and apply the approaches developed to the general problem. However, this problem should be carefully chosen since the solution may not be applicable to the overall problem. Hence, the extreme case in document editing was chosen for solution, viz. conversion of a content rich document to text and detecting and reintegrating the changes. To provide flexibility, a diff/patch solution was chosen which would also allow only the differences to be sent to another client requesting the same file in the UbiData network. This would result in significant bandwidth savings. To enable version control, a complete diff format was chosen which has enough information available to revert to a previous version of the file. To ease development, an existing XML diff/patch tool was modified. The tool is part of the XyDiff program suite developed by the VERSO team, from INRIA, Rocquencourt, France, for their Xyleme project [13-14]. The following chapters describe the UbiData system, related work and the algorithms developed to solve the problem of change detection in reduced content XML documents as well as test results obtained.

PAGE 17

CHAPTER 2 RELATED RESEARCH The Ubiquitous Data Access Project at the University of Florida System Goals 1. Anytime, anywhere access to data, regardless of whether the user is connected, weakly connected via a high latency, low bandwidth network, or completely disconnected; 2. Device-independent access to data, where the user is allowed to use and switch among different portables and PDAs, even while mobile; 3. Support for mobile access to heterogeneous data sources such as files belonging to different file systems and/or resource managers. System Architecture The architecture for the UbiData system [15] divides the data it manages into three tiers as follows: 1. The data sources which could consist of file systems, database servers, workflow engines, e-mail servers, web servers or other sources. 2. Data and meta-data collected for each user of the system. The data is the users working set consisting of files used by the user. Meta-data collected on a per user basis includes the configuration and capabilities of all the devices used by the user. 3. The mobile caches that contain a subset of the users working set to allow disconnected operation on the various devices owned by the user. The middle tier (storing the data and meta-data), acts as a mobility aware proxy, filling the clients caches when connected and updating the users working set information. On reconnection after disconnection, it handles reintegration of updates made to documents on the mobile cache. 6

PAGE 18

7 Figure 1. Architecture of the UbiData system The three tiers are easily identifiable in figure 1; the clients on the left contain the mobile caches on their local file systems. The clients also act as data sources since they can publish data to the system. The other data sources (labeled external data sources) are to the right of the figure. The lower middle part of the diagram depicts the data and metadata collected for each user. The system modules that implement this three tier data architecture [9] are detailed below: Mobile environment managers (M-MEMs) The functions of the M-MEMs are as follows: 1. To observe the users file access patterns and discard uninteresting events e.g. operating system file accesses and filter the remaining to determine the users working set. These are the files that the user needs on the devices he uses to continue working in spite of disconnections. 2. Manage files hoarded to the device and track events such as updates to the file so it can be marked as a candidate for synchronization on reconnection. Also, on reconnection, if the master copy in the middle tier is updated, the M-MEM discards the stale local copy and fetches the updated master copy from the middle tier server (MDSS).

PAGE 19

8 3. Bandwidth conservation by always shipping differences between the old version and the updated version of a file instead of the new file whenever possible. Other activities such as queue optimization of the messages to be sent to the middle tier server also help in this regard. Mobile data service server (MDSS) The MDSS consists of two components: Fixed environment manager (F-MEM) The F-MEM server side component is a stateless data and meta-data information service to mobile devices. The basic functionality of the F-MEM is to accept the registration of events, detect events, and to trigger the proper event-condition-action (ECA) rules. The main tasks of the F-MEM are to manage the working set and the meta-data in conjunction with the M-MEM component. As a result, the F-MEM needs to perform additional services such as authentication of users and implement a communication protocol between the M-MEM and itself. This protocol should be very resilient to poor connectivity and frequent disconnections, which implies the use of a message queuing system. Data and meta-data warehouse The meta-data warehouse is an XML database that stores information related to each user such as the uniform resource identifiers (URIs) for all files in his/her working set and device types and capabilities for all devices used by the user. The data (working set) is stored on the local file system of the MDSS server and accessed through URIs as mentioned before.

PAGE 20

9 The Role of Reduced Content Change Detection in this Architecture The goal of providing device-independent access to data would be unattainable without the presence of such a system. This is due the fact that mobile devices vary widely in terms of hardware capabilities, the operating systems they run and the applications they support. Hoarding files without considering the target environment would result in the user having a number of unusable files since there is no application available on his/her device to handle them. Hence the F-MEM needs to access the device profile and suitably transcode the users working set before shipping it to the client device. If the transcoded content is editable, changes made to this content must be reintegrated into the original copy on the server. Simply replacing the old file by a newer copy, as done in existing systems of this type e.g. the Coda file system developed at Carnegie Mellon University, would result in lost information loss since transcoding involves loss of information. Now that it is established that this is a critical component of UbiData, the question that remains is how this system will be integrated into the overall UbiData architecture. Figure 2 shows how the system could be integrated into the UbiData architecture. The sequence of actions for three events has been shown and they are described below: 1. Mobile host publishing a document. The actions for this event are shown with solid black arrows. The mobile host transfers the document to F-MEM which on decoding the message recognizes it as a publish request. The F-MEM determines if the document is already in XML format. If not, the appropriate conversion plugin is loaded and the converted document is stored in the repository as part of the users working set. 2. Mobile host importing a document. The actions for this event are shown with solid gray arrows. When an import request is received, the required document format is determined. The connection state is also sampled from the connection monitor. If the connection is weak, an appropriate transcoding scheme is selected. Once the type of conversion is determined, the appropriate plugin is loaded and the conversion performed by loading the latest version of the document from the

PAGE 21

Warehouse F-MEM Connection Monitor Plug-in manager Content conversion p lug-ins Content specific shared & excluded tag information M-MEM Working sets Publish Im p ort Message handling, decoding and execution Message handling and encoding Reduced content document Change detection (vdiff) Patch (XyApply) Meta-data Rules/ Events M-MEM 10 Figure 2. Proposed integration of change detection and synchronization with UbiData

PAGE 22

11 repository and running the plugin to convert it. An entry is made in the meta-data database to mark the client as having a reduced content version of the document. Finally, the document is packaged for transport and shipped to the mobile client. 3. Synchronization of the possibly reduced content document. The actions for this event are shown with dashed arrows. Once the message is decoded and recognized as a synchronization request, the meta-data content is checked to determine if this was a reduced content document. If so, the appropriate XML conversion plugin is loaded if necessary, and the document is made available to the vdiff change detection module. The change detection module also uses the application specific shared tag information and the original document from the repository. This information is used to generate a delta script to update the document that is stored in the repository for versioning purposes. The XyApply module is then used to apply the script to the original document in the repository to generate the new version. This new version is stored in place of the original if it was not the first version of the document. Otherwise, the latest version is stored separately. Hence, the repository maintains the original version of the document and the latest version. The stored deltas allow any version in between to be generated from the original document. The latest version is kept for efficiency. Previous Work on Change Detection for UbiData Refer to [16] for previous work on change detection and reintegration. This work approaches the same problem in a slightly different way and also extends the previous effort. The previous system was unable to handle deeper and more complex document structures. Updates were often missed and treated as inserts. This has been addressed in this work and the approach to testing the system has been improved making evaluating document correctness more objective. Research on Disconnected Operation Transparency: the CODA File System Coda is a location transparent, distributed file system developed at Carnegie Mellon University, which uses server replication and disconnected operation to provide resiliency to server and network failures. Server replication involves storing copies of a file at multiple servers. The other mechanism, disconnected operation, is a mode of execution in which a caching site temporarily assumes the role of a replication site. Coda uses an application transparent approach to disconnected operation. Each client runs a

PAGE 23

12 process called Venus that handles remote file system requests using the local hard disk as a file cache. Once a file or directory is cached, the client receives a callback from the server. The callback is an agreement between the client and the server that the server will notify the client before any other client modifies the cached information. This allows cache consistency to be maintained. Coda uses optimistic replication where write conflicts can occur if different clients modify the same cached file. However, these operations are deemed to be rare and hence acceptable. Cache management by Venus is done according to the state diagram in Figure 3. Hoarding Weak connection Emulating Writedisconnected Disconnection Strong connection Connection Disconnection Figure 3. Venus states Venus is normally in the hoarding state but on the alert for possible disconnection. During the hoarding state, Venus caches all objects required during disconnected operation. While disconnected, Venus services all file system requests from its local cache. All changes are written to a Change-Modify-Log (CML) implemented on top of a transactional facility called recoverable virtual memory (RVM). Upon reintegration, the system ensures detection of conflicting updates and provides mechanisms to help users recover from such situations. Coda resolves conflicts at the file level. However, if an application specific resolver (ASR) is available, it is invoked to rectify the problem. An ASR is a program that has the application specific knowledge to detect genuine

PAGE 24

13 inconsistencies from reconcilable differences. In a weakly connected state, Venus propagates updates to the servers in the background. The Coda file system has the drawback of requiring its own file system to be present on the clients and servers, which UbiData does not have. UbiData uses the native file system of its host. Hence, bandwidth adaptation and reduced content synchronization is being incorporated into UbiData, due to its greater flexibility in potentially supporting a much wider variety of devices on which such a system can be tested. Research on Bandwidth Adaptation Odyssey Odyssey [17-19] uses the application aware approach to bandwidth adaptation. The Odyssey architecture is a generalization of the Coda architecture. Data is stored in a shared, hierarchical name space made up of typed volumes called Tomes. There are three Odyssey components at each client as follows: 1. The Odyssey API: The Odyssey API supports two interface classes, viz. resource negotiation and change notification class and the dynamic sets class. The former allows an application to negotiate and register a window of tolerance with Odyssey for a particular resource. When the availability of the resource falls out of those bounds, Odyssey notifies the application. The application is then free to adapt itself accordingly and possibly renegotiate a window of tolerance. 2. The Viceroy and the Wardens: These are a generalization of Codas cache manager, Venus. The Viceroy is responsible for monitoring external events and the levels of generic resources. It notifies clients who have placed requests on these resources. It is also responsible for type independent cache management functions. The Wardens, on the other hand, manage type specific functionality. Each Warden manages one Tome type (called a Codex). Each Warden also implements caching and naming support and manages resources specific to its Codex. The Odyssey architecture, due to its application aware adaptation design, is not as universally useable as an application transparent approach and hence, unsuitable for ubiquitous data access. It must be noted however, that applications written for such

PAGE 25

14 architectures would have superior performance, as it is easier and faster for an application to adapt to a poor connection than a generic transcoder that would more likely have higher latency and would be far more complex. Puppeteer The Puppeteer architecture is based around the following philosophy: 1. Adaptation should be centralized. 2. Applications should not be written to perform adaptation. Instead, they should make visible their Document Object Model (DOM): the hierarchy of documents, containing pages, slides, images and text. The CoFi architecture [11] standing for consistency and fidelity, implemented in Puppeteer supports document editing and collaborative work on bandwidth-limited devices. This is precisely the goal that the reduced content change detection and synchronization algorithms are being designed for. However, CoFi takes a different approach to solving the problem and this can actually be used to complement the algorithms being developed. CoFi supports editing of partially loaded documents by decomposing documents into their component structures and keeping track of the changes made to each component by the user and those resulting from adaptation. Hence, in cases where the target applications meet the puppeteer requirements of exposing a DOM interface, this approach would be more efficient since only the modified subset of components needs to be dealt with. However, when this is not the case, the hierarchical change detection and synchronization system can still be used. Another requirement for Puppeteer to be used is that a Puppeteer client needs to be present on the target device to perform the function of tracking changes. This may pose a significant overhead for some mobile devices (these devices tend to be resource poor in general), in which case this work can still be used.

PAGE 26

15 Transcoding Proxies: Pythia, MOWSER and Mowgli Each of these systems performs active transcoding of data on a proxy between the data source and the mobile device to adapt to the varying state of the connection between the mobile device and the proxy. The work done on these systems can be applied to the proposed UbiData integration architecture described earlier. Congestion Manager Congestion Manager (CM) is an operating system module developed at MIT [20] that provides integrated flow management and exports a convenient programming interface that allows applications to be notified of, and adapt to changing network conditions. This can be used to implement the connection monitor subsystem in the proposed UbiData integration architecture described earlier. Intelligent Collaboration Transparency Intelligent collaboration transparency is an approach developed at Texas A&M University [21] for sharing single user editors by capturing user input events. These events are then reduced to high-level descriptions, semantically translated into corresponding instructions of different editors, and then replayed at different sites. Although the research is not aimed at bandwidth adaptation, it can be adapted for this purpose by capturing all user input events on the mobile device and then converting them to higher level descriptions before optimizing this log for transmission to the server. This was explored as a means of reducing transmission overhead between the mobile device and the server in the UbiData architecture and the work done and results obtained are described in chapter 5.

PAGE 27

16 Research on Difference Algorithms LaDiff LaDiff [22] uses an O(ne + e2) algorithm where n is the number of leaves and e is the weighted edit distance, for finding the minimum cost edit distance between ordered trees. However, there is no support for reduced content change detection. The algorithm functions in two phases, viz. finding a good matching between the nodes of the two trees and finding a minimum cost edit script for the two trees given a computed matching. The matching phase uses a compare function, which given nodes s1 and s2 as input, returns a number in the range [0,2] that represents the distance between the two nodes. This is similar to the technique used to generate matches in our reduced content change detection system. The edit script computation operations are similar to the technique used by XyDiff and that is the system that has been followed for edit script construction in the reduced content change detection algorithm. XyDiff XyDiff is a difference detection tool for XML documents designed for the Xyleme project at INRIA, Rocquencourt, France. The Xyleme project proposes the building of a worldwide XML warehouse capable of tracking changes made to all XML documents for query and notification. The XyDiff tool had is being used to investigate reduced content change detection since it employs an efficient algorithm to track changes in XML documents and its source is available for modification. The companion tool to XyDiff, XyApply, applies the deltas generated by XyDiff to XML documents. This tool has been used as is for testing the algorithms developed since the XyDiff delta structure has not been modified in any way but the delta construction process has been completely rewritten. Xyleme deltas also have the advantage of being complete, i.e. they have

PAGE 28

17 enough additional information to allow them to be reversed. Hence, they meet the goal of allowing version reconstruction. To assist in versioning, all XML nodes are assigned a persistent identifier, called an XID, which stands for Xyleme ID. XIDs are stored in an XID map that is a storage efficient way of attaching persistent IDs to every node in the tree. The basic algorithm used by XyDiff is as follows: 1. Parse the v0 (original) document and read in the XIDs if available. 2. Parse the v1 (modified) document. 3. Register the source document, i.e. assign IDs to each node in a postorder traversal of the document. Construct a lookup table for ID attributes (unique identifiers assigned to nodes by the creating application for the purpose of easing change detection), if present. 4. Construct a subtree lookup table for finding nodes at a certain level from any parent node for the v0 document. 5. Register the v1 document. 6. Perform bottom up matching. 7. Perform top down matching. 8. Peephole optimization. 9. Mark old tree for deletes. 10. Mark new tree for inserts. 11. Save the v1 documents XID map. 12. Compute weak move operations, i.e. a node having the same parent in both documents but different relative positions. 13. Detect updates. 14. Write the delta to file. 15. All steps before step 8 have been altered for supporting reduced content difference detection. The remaining steps concerned with the construction of the delta have been left relatively untouched.

PAGE 29

18 Xmerge Xmerge is an OpenOffice.org open source project [23] for document editing on small devices. The goals of the document editing on small devices project are as follows: 1. To allow editing of rich format documents on small devices, using 3rd party applications native to the device. 2. Support the most widely used small devices, namely: the Palm and PocketPC. 3. Provide the ability to merge edits made in the small device's lossy format back into the original richer format document, maintaining its original style. 4. Take advantage of the open and well-defined OpenOffice.org XML document format. 5. Provide a framework with the ability to have plugin-able convert, diff and merge implementations. It should be possible to determine a converters capabilities at run-time. The goals of the Xmerge group are virtually identical to what the reduced content difference detection system is trying to achieve. However, the approach used by the Xmerge group is to have format-specific diff and merge implementations, unlike the format independent approach selected for reduced content synchronization. This allows our reduced content synchronization system to be simpler and have lesser space requirements. This would enable it to be installed on a reasonably powerful small device. However, Xmerges plugin management system can be utilized to implement the plugin manager subsystem for the proposed UbiData integration architecture described above. An O(ND) Difference Algorithm The approach used in this difference algorithm [24] is to equate the problems of finding the longest common subsequence of two sequences A and B, and a shortest edit script for transforming A into B, to finding a shortest path in an edit graph. Here, the N and D refer to the sum of the lengths of A and B and the size of the minimum edit script

PAGE 30

19 for A and B respectively. Hence, this algorithm performs well when differences are small, and thus has been chosen as the approximate string-matching algorithm since the application domain for the system is for user edits on mobile devices. These edits are not likely to be very extensive and hence, for this application domain, the algorithm should perform well. The algorithm uses the concept of an edit graph, an example of which is shown in figure 4. Let A and B be two sequences of length N and M respectively. The edit graph for A and B has a vertex at each point in the grid (x,y), x [0,N] and y [0,M]. The points at which Ax=By are called match points. Corresponding to every match point is a diagonal edge in the graph. The edit graph is a directed acyclic graph. 0 1 2 3 4 5 6 A B C A B B A 5 1 0 2 3 4 (0,0) C B C A A B (7,6) Figure 4. Sample Edit graph; Path trace = (3,1), (4,3), (5,4), (7,5); Common subsequence = CABA; Edit script = 1D, 2D, 2IB, 6D, 7IC A trace of length L is a sequence of L match points (x1, y1), (x2, y2)(xL, yL), such that xi < xi+1 and yi < yi+1 for successive points (xi, yi). Each trace corresponds to a

PAGE 31

20 common subsequence of the strings A and B. Each trace also corresponds to an edit script. This can be seen from figure 4; if we equate every horizontal move in the trace to a deletion from A and every vertical move to an insertion into B, we have an edit script that can transform A to B. Finding the longest common subsequence and finding the shortest edit script both correspond to finding a path from (0,0) to (N, M) with the minimum number of non-diagonal edges. Giving diagonal edges a weight of 0, and non-diagonal edges a weight of 1, the problem reduces to finding the minimum-cost path in a directed weighted graph which is a special instance of the single source shortest path problem. Refer to Myers [24] for the algorithms to find the length of such a path and to reconstruct the paths trace. These algorithms have been used for approximate string matching and approximate substring matching respectively as part of the overall change detection algorithm.

PAGE 32

CHAPTER 3 PROBLEMS INVOLVED IN RECOVERABLE CONTENT ADAPTATION Developing a Generic Difference Detection and Synchronization Algorithm for Reduced Content Information Since document formats vary widely, and there are a vast number of them, the number of possible content reductions is enormous. To reduce the size of the problem and make it tractable, only the extreme case was considered for solution. The system that was developed can now be extended to suit other combinations. This extension is an ongoing task. Document editing has been chosen as a problem domain and to investigate recoverable content adaptation, conversion to text followed by reintegration of changes to the edited text has been chosen as the restricted problem to be solved. This problem is the extreme case of content adaptation in the domain of document editing. Abiword [25], an open source document editor, has been selected as the test editor as it uses XML as its native file format and is available on both Linux and Windows. Example Document Content Reduction Scenario Consider the document shown in figure 5 which is reduced to text and edited as shown in table 1. The notation followed in this sequence of changes to the document is to refer to the source document as v0, the content reduced (converted to text) document as v0(-), the content reduced document after edits as v1(-), and the final document with changes integrated as v1. In the general case, reduction will not be so extreme and the result document will still be in XML format. Hence, we need to impose an XML structure on the v1(-) document. However, since all formatting content has been lost, it 21

PAGE 33

22 will only be possible to impose a very basic structure on the document. This basic structure will depend on the particular word processor in use and hence the module handling this will vary for every document format. Therefore, a plugin management system is required to load the correct converter depending on the particular format being handled. In the case of Abiword, imposition of this structure involves extracting text terminated by a new line and inserting it under a

tag since in Abiword, paragraphs are terminated by a new line. Other structures that can be safely imposed are an tag and a

tag since they are present in all Abiword documents. However, this is the maximum extent we can go to, given only the modified text document. The resulting XML DOM tree is shown in figure 6. Figure 5. Example Abiword document

PAGE 34

23 Table 1. Document in figure 5 converted to text and edited Document text content (v0 (-)) Document edited text content (v1 (-)) COP 5615 Evaluation of Current Prevention, Detection and defeat Research List item 1 List item 2 List item 3 End of document COP 5615 Evaluation of Current Prevention, Detection and defeat Research List item 1 List item 2 Updated #1## List item 3 End of document End of document Updated#1## List item 3 List item 2 List item 1 Evaluation o f Current Prevention, Detection, and defeat Research COP 5615 p p p p p p p section abiword Figure 6. XML structured imposed on modified text document For comparison, the original document structure is shown in figure 7. Clearly, the modified document has a far simpler structure. To enable integration of changes to the original document, we need to come up with a function which, when given the original document tree and the tree with XML structure imposed, produces a structure with the content in the modified document that is isomorphic to the source document.

PAGE 35

abiword section COP 5615 Evaluation of List item 1 End of File type identification comments lists l p p p p p p c c c c Current Prevention, Detection and defeat Research c c field List item 2 c c field List item 3 c c field c c document c p agesize 24 Figure 7. DOM tree for the original document, v0

PAGE 36

25 Once we have this isomorphic modified tree, the difference script can be constructed using standard techniques employed in other difference algorithms. Issues in Designing a Function to Construct a Modified Document Structure Isomorphic to the Source Document Structure 1. A format independent method of specifying what content is shared between the modified document and the source is needed to identify what content to consider deleted versus just stripped out. In our example above, the shared content are the abiword, section, p and text nodes as shown in figure 6. 2. A technique for matching content nodes in the modified document to the nodes in the source document is needed. However, this is complicated by the following possibilities: a. Node contents can be edited to varying degrees. Hence, an approximate string matching technique is needed and a boundary value for a node to be considered matched to another needs to be found. b. An aggregation of nodes in the source document may be present as one node in the result. As an example, consider the string Evaluation of Current Prevention, Detection and defeat Research in figure 5. The substring Current Prevention, Detection, is underlined and hence the DOM subtree for this string in figure 7, has three nodes corresponding to the string. However, on conversion to text, this structure is lost and when a basic XML structure is applied, the three nodes are present as one node in the result tree in figure 6. In addition to possibility a. occurring, the result string may have the subsequences in a different order, which further complicates matching. Since the string may be edited in any way, it is not possible to match each substring in the source to the result split into the same sized substrings. Matching each substring in the source document to every prefix/suffix of the content node in the result tree is also incorrect, since the match may be any substring and not necessarily a prefix or suffix. Hence, structures such as suffix trees are unusable. Matching all possible substrings would be extremely expensive computationally, since all possible substrings would result in O(2n)/O(1+2!+3!++n!) combinations which when rewritten as O(2n)/O(n!+(n-1)!+ ) = O(2n)/O((n-1)(n+1)+(n-3)(n-1)+ ) = O(2n)/O(n3) = O(2n) combinations to be processed. In the extreme case, when no XML structure can be imposed on the document except a document root tag, all content will be present in one node. This extreme case of the aforementioned problem underscores the need to develop an efficient solution to it.

PAGE 37

26 3. Once matches are found, the tree structure from the source needs to be used to construct an isomorphic structure for the result that re-imports all stripped out content. 4. A mechanism is needed to filter out certain content nodes from the match process e.g. Abiword encodes images in base 64 text and thus this could be confused for a content node and impose a heavy, and unnecessary performance burden. 5. Post processing is necessary to ensure that the created isomorphic tree is valid e.g. if the document has all content deleted but the formatting and style descriptions are imported as part of the process above, Abiword crashes on trying to open the document. A format specific plugin could be run at this point, to perform dependency checks such as this. Alternatively, a separate tool could be used to perform this function after applying the delta to the source document.

PAGE 38

CHAPTER 4 DESIGN AND IMPLEMENTATION OF RECOVERABLE CONTENT ADAPTATION Solution Outline for Delta Construction As mentioned in the previous chapter, the restricted problem being solved is that of converting a document to text, and incorporating the changes made to the textual version back into the original source document. There are three basic steps in the solution approach as follows: 1. If required, adjust the resulting document by imposing a basic XML structure on it. Match the nodes in the source and modified document trees taking into account the fact that an aggregate of source nodes may match a result node. 2. Adjust the structure of the modified document to make it isomorphic to the source document. In this process all content that was stripped out should be re-imported unless it was associated with deleted content. 3. Construct the delta script using the source document structure and the newly created modified document structure. Before describing the steps in detail, the question of how shared content and nodes to be excluded from the match procedure are to be represented needs to be answered. Representation of Shared Content There are two ways shared content between documents could be represented as follows: 1. The intersection of the set of tags in the source and result. 2. The complement of the set above i.e. the set of tags, which are not shared between the two documents. The former approach has been chosen since in our case, the number of shared tags will be very small (Since all documents are in XML format we treat shared information 27

PAGE 39

28 as shared tags). This information can also be represented in an XML format. Hence, there will exist one such map for each file format. The corresponding one will be read depending on the format being handled and the tag names will be stored in a hash table (STL hash map) for O(1) time lookup during the matching process. The same idea is used to represent excluded tags (the information which is to be excluded from the matching process). Matching of Nodes in the Source and Modified Documents The algorithm has the following basic steps as follows: 1. Recursively construct a list of content nodes (TEXT nodes) from the source (v0) document excluding any subtrees under excluded tags. 2. Construct a list of content node subtrees from the v0 document and remove the corresponding nodes from the list constructed in step 1. 3. Recursively construct a list of content nodes (TEXT nodes) from the modified document (v1(-) ). 4. Find the least cost edit script matches between the nodes in the v0 list and the nodes in the v1(-) list. 5. Find the least cost edit script matches between the v0 subtree list and the nodes in the v1(-) list. Construction of the List of Content Node Subtrees The algorithm has the following basic steps: 1. For each node in the v0 list, check if its parent node is a shared node and if the node has a sibling. If so, find out the number of leaf nodes in this subtree by counting each TEXT sibling and recursively counting the TEXT nodes in every non-TEXT node sibling. 2. Otherwise, if the parent is not a shared node, traverse the tree upwards until a shared parent is found. Calculate the number of TEXT nodes in this subtree recursively. 3. If the number of nodes found is greater than one, remove this range of nodes from the v0 list and insert the nodes into the v0 subtree list.

PAGE 40

29 Determining the Node Matches between the v0 and v1 (-) Trees The algorithm has the following basic steps: 1. For each node in the v1 (-) node list, execute step 2. 2. minLES = a. For each node in the v0 node list, perform the following steps: i. Compute the least cost edit script length using the algorithm in [24]. If this value is less than minLES, set minLES = the value computed above. ii. If the minLES value is below a threshold value, enter the v1(-) node as a key in a hash table with the value being another hash table containing the (key, value) pair of (v0 node, the minLES value) inserted into it. Otherwise, if the minLES value exceeds the threshold value, enter the v1(-) node as a key in a hash table with the value being another hash table containing the (key, value) pair of (v0 node, ) inserted into it. iii. If the minLES value is 0, we have found a perfect match for the v1(-) node and there is no need to consider any other v0 nodes. Continue execution from step 2. b. If minLES < and if the v0 node corresponding to the minLES value is already assigned, find the better match by looking up the previous match value in the hash table and comparing it to minLES. If the current one is better, reset the previous match. If not, we need not continue further for this v1(-) node. Continue execution from step 2. c. Mark the v0 and v1(-) nodes as matched/assigned to each other. d. If minLES is 0, remove the corresponding v0 node from the v0 node list since it does not need to be included in any further comparisons. Determining the Node Matches between the v0 Subtree List and v1 (-) Tree List Before going into the steps involved, we explore the solution approach with an example. Consider the subtree consisting of the nodes AB, C, ABBA and the v1(-) node CBABAC. For now let us assume that the relative ordering of the substrings in the v1(-) node is the same as that in v0, but they may have been edited in any way otherwise. We

PAGE 41

30 need a way to decompose the v1(-) node content into substrings providing the best approximate matches to the subtree nodes. The approach used utilizes the way the least cost edit script algorithm works. Consider figure 8 which shows the edit graph generated for the concatenation of the substrings (AB, C, ABBA) and the v1(-) node CBABAC. 0 1 2 3 4 5 6 A B C A B B A 5 1 0 2 3 4 (0,0) C B C A B A (7,6) Figure 8. Example of approximate substring matching We can find the matches for each substring by using the co-ordinates of the path found in the graph. If the end of a substring e.g. AB occurs at point x (in this case 2), the corresponding y value of the point on the path gives the termination of the matching substring (in this case 0 implying that the best match for the substring AB is an empty string). In this example the matches are AB=, C=C, ABBA = BABAC. Now returning to the general problem where substring node orderings may not be the same, we can see that the solution is valid provided we find the permutation of the subtree that best matches the v1(-) node content. However, finding all permutations of a

PAGE 42

31 set of substrings takes O(n!) time which is intractable for large problems. Hence, a condition/heuristic is needed to prune the search. The approach used in the current implementation is to perform the search for n
PAGE 43

32 information needed to reconstruct the trace through the graph. 3. Compute the trace through the edit graph. 4. Use the technique described above, to traverse the trace and split the v1(-) node into substrings matching the v0 node. Enter the substrings with their matching counterparts into a hash table. 5. Return the hash table. 2. Otherwise, return an empty hash table. Adjusting the Structure of the Modified Document to make it Isomorphic to the Source Document Applying the Substring Matches The basic steps involved in the algorithm are as follows: 1. For each subtree in the v0 document, perform the following steps: a. If the v0 subtree has already been matched, execute the following steps: i. Set v1Parent = Parent node of the v1 match node. ii. For each node in the v0 subtree, execute the following steps: 1. Look up the matching substring the hash table returned from the substring match function. 2. If the match exists, then execute the following steps: a. Create a text node for the match and insert it before the matching v1 node. If the v0 node was previously matched, reset the match. Mark the v0 and v1 nodes as matched/assigned to each other. b. Add the inserted node to the list of v1 nodes. iii. If even one node in the subtree was matched, delete the corresponding v1 node (it has now been replaced by its components). 2. Return.

PAGE 44

33 Adjusting the Result Document Structure for Unshared Ancestors The basic steps involved in the algorithm are as follows: 1. For each node in the v1 node list, perform the following steps: a. Lookup the matching v0 node and its parent. b. If the v0 Parent node is not null and is unshared, execute the following: i. Push the parent node into a stack. ii. Get the Parents ancestor, and repeat step b. c. If the shared v0 parent and v1 node parent match (have the same node names), perform the following steps: i. Match the parents if not done already. ii. Propagate the matches upward till we reach a matched node or the root. For each match, import attributes of the v0 node over to the v1 node. iii. Remove the v1 node (to allow insertion of unshared ancestors). iv. Set v1ParentDom = Parent of the v1 node. v. Get the v1 nodes previous sibling and check if its name is the same as the ancestor we are trying to insert. If so, and the v0 ancestor is assigned, check the remaining nodes in the stack for assignment and a match with the rightmost path down the siblings tree. Remove each matching node from the stack. Set v1ParentDom = last match from the stack. This takes into account the possibility that we are trying to insert ancestors common to siblings. vi. For each node on the stack, execute the following steps: 1. Append the node as a child to v1ParentDom. 2. Mark the appended node and the v0 nodes as matched/assigned to each other. 3. Set v1ParentDom =appended node. vii. Append the v1 node as a child of v1ParentDom. 2. Return.

PAGE 45

34 Adjusting the Result Document Structure for Unshared Nodes The basic steps involved in the algorithm are as follows: 1. If the document roots are not matched and they have the same tag names, match them. Otherwise, signal an error and halt. 2. Set v0Node = First child of the v0 Root. 3. Set v1Node = First child of the v1 Root. 4. Similarly, set v0Parent and v1Parent to the v0 and v1 Roots respectively. 5. If v0Node is null, return. 6. If the v0Node is not assigned, and it is unshared with its parent matched to the v1Node parent or has an excluded parent, execute the following steps: a. Add the v0Node to the v1 tree as a previous sibling of the v1Node if it is non-null or the only child of the v1Parent otherwise. b. Mark the nodes as matched/assigned to each other. c. Recursively process all the children of the v0Node and v1Node. 7. Otherwise, if the nodes are already assigned, recursively process all the children of the v0Node and its v1 match. 8. Recursively process all the children of the v0Nodes next sibling and the v1Node. Applying Default Formatting for Inserts This phase has been added to the overall algorithm to provide the flexibility of assigning any formatting as default for inserted content. This allows inserted content to be easily identifiable. However, to allow any formatting to be applied, we need the flexibility of inserting the new content as a child of a hierarchical structure. This structure is added as a child of a shared ancestor (which would have been the immediate parent of the inserted content). To define the structure, we use an XML document specifying the tree structure where the root of the tree defines the shared tag to which this structure is to

PAGE 46

35 be applied. The insertion point, which is a text string to be replaced by the inserted content, is also defined in this document. The structure is read in along with the other XML configuration files and each time the structure has to be applied, the new content replaces the insertion point. The new content is then set as the new insertion point for the next cycle. When the structure is applied, the replacement is performed as above and the tree is inserted into the position previously occupied by the inserted content. Post Processing Before running the XyDiff delta construction phases, we need to set up some basic structures needed by the difference detection algorithm and apply the discovered matches to those structures. Once this is done, a XyDiff optimization phase is run to detect updates that may have been missed by the least cost edit script match phase due to the text differences being too extensive. The optimization function should actually be run before the application of the default formatting for inserts but since it will rarely find missed matches, this is the way it has been kept for now. Another justification for doing so is that the structures used by this function do not exist before the insert formatting is applied. Insert formatting cannot be applied later since the new nodes inserted will require extensive re-organization of the XyDiff structures. Hence, the best approach would be to perform tests on the effectiveness of the phase and if required, re-implement it to function with the structures used by the match phase. A brief description of these phases follows. Computing signatures and weights for the two documents The basic steps involved are as follows: 1. Set the first available XID for the source document based on its XID map.

PAGE 47

36 2. Register the source and result document by traversing the documents and assigning identifiers to each node based on a post-order traversal of the tree. Create a list of data nodes storing these IDs and hash values for the subtrees corresponding to the tree nodes in question. Applying the node matches to the XyDiff structures Apply all node matches generated as part of the matching phases to the XyDiff structures. If the matched nodes have the same names and values, mark them as NO-OPS. Otherwise, if they have the same parents, they represent an update and hence mark them accordingly. Otherwise, it must be the case that a move has occurred, which will be handled as part of the delta construction phase. Optimizing node matches In this phase, we basically look for matched nodes and check if any children need matching. If children are found satisfying the condition that the source and result tag names are the same, and that the result has only one child with that tag name for the current parent, they are matched. Constructing the Delta Script The steps involved in delta construction are as follows: 1. Mark the old tree. 2. Mark the new tree. 3. Save the XID map. 4. Compute weak moves. 5. Detect updates. 6. Construct the delete script. 7. Construct the insert script. 8. Save the delta document. 9. Clean up. Mark the Old Tree Conduct a post-order traversal of the source document and mark the nodes as deleted and moved. A deleted node can be identified as a node that has not been assigned.

PAGE 48

37 An assigned node that has different parents in the source and result documents is deemed a strong move. Mark the New Tree Conduct a post-order traversal of the result document and mark the nodes as inserted or moved and adjust the XID map accordingly. Matched nodes are imported into the result documents XID map with the XID of the node in the source document. A moved node is identified in the same fashion as above and if the node is not assigned, it is deemed inserted and marked as such. Save the XID Map Write the result documents new XID map to file. Compute Weak Moves A weak move is one where assigned nodes have the same parents but are in different relative positions among the parents children. The function of this phase is to determine the least cost edit script that can transform the old child ordering to the new one. This corresponds to finding the longest common subsequence between the two orderings. Detect Updates If a node and its matched counterpart have only one unmatched TEXT node as children, match them and mark the children as updated. Construct the Delete Script Construct the delete script by traversing the nodes in post-order with a right to left enumeration of children. Based on the marked values, construct appropriate tags for the delete script, which specifies what operation is to be performed, at what location, and enumerates the affected XIDs. Add the created subtree to the delta script DOM tree. In

PAGE 49

38 this process, situations where a node is deleted and its parent is deleted need to be handled. The rule employed is that if the parent is to be deleted, only the parents delete operation is written to the delta and all deleted children are specified as part of the parents delete operation. Construct the Insert Script The same procedure as above is also used to construct the operations for inserts. Note that moves and updates are also handled in these two phases. Save the Delta Document Write the resulting delta tree to file. Clean Up Close the input files and free allocated memory.

PAGE 50

CHAPTER 5 RELATED EXTENSIONS TO UBIDATA Development of a File Filter Driver for Windows CE to Support a UbiData Client To allow integration of the change detection system with UbiData, and testing on various platforms, it is necessary to first have UbiData support a range of platforms. To allow porting of the UbiData system onto windows CE, a file filter driver is necessary to capture file operations to ascertain when a file on the local file system has been accessed or modified. This filter driver communicates the information to the local UbiData client, the M-MEM that uses the information to build the users working set or to mark the locally modified file as requiring synchronization with the server. The windows CE .NET 4.0 storage architecture [26] is as shown in figure 9. Figure 9. Windows CE .NET 4.0 Storage architecture [26]. 39

PAGE 51

40 As can be seen from figure 9, a file system filter is a dynamic-link library (DLL) that sits on top of the file system and intercepts file system calls. Multiple filters can exist on any loaded file system performing any combination of file manipulations before the file system sees the call. For the purposes of UbiData, we need to trap the file system functions associated with create, open and write. For every open/create call on a file, the file name is stored in a hash table with the key being the file handle value. On a write, the hash table is looked up to get the corresponding file name. These file operations are encoded (using an application specific lookup table) and communicated to the M-MEM application. An issue with the communication is that the driver and client are separate processes where the client runs with user level privileges. An interprocess communication technique is required that is lightweight and fast. This method should not use an inordinate amount of memory since the mobile device will have limited memory. The solution to this problem was to use memory mapped files in a logical queue (see figure 10) ensuring an upper bound on memory utilization and yet being fast and lightweight compared to pipes and sockets. Memory file 1 Memory file 2 Filter Driver Event Objects for synchronization Memory file 3 M-MEM Figure 10. Logical queue of memory mapped files

PAGE 52

41 To complete the port, the UbiData client code and libraries need to be ported to Windows CE .NET 4.0. Active Event Capture as a Means of UbiData Client-Server Transfer Reduction The idea behind active event capture is to reduce data transfer between a UbiData client (M-MEM) and the UbiData server (F-MEM) by capturing and logging events generated by an application as it is being used to edit a file. These events are then converted to higher-level descriptions and optimized. The resulting event log would in some cases be smaller than the delta computed by a difference algorithm. The delta can also be computed and compared to the event log to decide which one to ship based on size. Although a promising idea (and one that has been implemented in a desktop environment at Texas A&M University [21]), it was found that the burden of event capture and the resulting context switches (since the application logging events was different from the editor, each time a message was posted to the editors queue, a context switch to the logging application would be made) was too heavy for a mobile device. However, as mobile devices get progressively more powerful, this may be a promising avenue of research for supporting more resource rich (and perhaps less mobile) devices such as laptops. Development of an Xmerge Plug-in to Support OpenOffice Writer As explained in the introduction, a plugin architecture is the cleanest method for handling application/format specific code. Since this is the philosophy guiding the design of Xmerge [22], the Xmerge plugin management code can be used to implement the UbiData integration architecture proposed earlier. As a result, the OpenOffice [27] to text converter and the text to a basic OpenOffice document format converter have been implemented as an Xmerge plugin.

PAGE 53

CHAPTER 6 PERFORMANCE EVALUATION APPROACH Test Platform Testing was performed on a machine with an AMD Duron 650Mhz processor, 128 Mb RAM and an IBM-DTLA-307030 hard drive running Mandrake Linux 8.2 (kernel 2.96-0.76mdk) with gcc version 2.96. Test Methodology The architecture of the test engine is as shown in figure 11. The figure also shows the sequence of steps involved in the testing process. Figure 11. Test tools and procedures Test documents Test case generator Modified test documents Text converter Modified Test documents ( Text ) Configuration files (Shared, Exclusion maps and default insert formatting) Comparator Vdiff Text to XML Results Test m odule XyApply 42

PAGE 54

43 The steps involved in the testing process are as follows: 1. 10 documents were selected as initial test cases. These documents included letters, application forms, term papers, pages from an E-book and a resume. 2. A test case generator module was used to generate 960 modified documents from these source documents by traversing the DOM trees and modifying the content nodes. Each test case had 0-50 percent file level changes and 10-40 percent paragraph level changes for four different edit operations, viz. insert, delete, update and move, applied to it to generate its share of 96 modified documents. Paragraph level operations refer to operations performed within a content node since these nodes represent paragraphs or their components in Abiword. The number of content nodes they are applied to depends on the file change percent. The file level change percent also determines the number of content nodes added, deleted and moved for inserts, deletes and moves respectively. The test case generator also records information on source file statistics, viz. the source document size (in bytes), node count, text node count, number of document subtrees, number of content nodes in all subtrees, average number of children, average height, average content size (in characters) and the average number of attributes which are recorded (shown as a dotted line). 3. The 960 Modified documents are converted to text using a text conversion module which traverses the DOM trees and extracts the content in the TEXT nodes (shown by the block arrows). 4. The modified text documents then have a basic XML structure imposed upon them before being used as inputs to Vdiff, the change detection module described in this thesis, along with the original documents and the configuration files. During execution, Vdiff generates statistical data such as the time taken by each phase of the algorithm, which is recorded (shown by the dotted arrow). 5. The resulting deltas generated by Vdiff are used to patch the original files to generate the modified documents (shown by the black block arrows). 6. These modified documents are compared to the reference modified documents to quantify reconstruction accuracy and the results are recorded (shown by the gray arrows). The comparator generates three data points for each comparison, viz. the number of missing leaves, missing non-leaves and the number of missing attributes. The comparator determines these values by walking the input trees and noting each discrepancy. 7. All data was generated in comma separated variable format, which can be directly read by spreadsheet software and some databases. The software packages used to analyze the data were Microsoft Access and Microsoft Excel. Access was used to run queries to produce data for specific graphs that were then generated in Excel.

PAGE 55

CHAPTER 7 TEST RESULTS Correctness of the Change Detection System The change detection system, as shown in figures 12 and 13, has good accuracy with a maximum of 14.8 percent total absolute missing nodes or 12 percent missing nodes with a maximum of 6.6 percent missing leaves and 9.2 percent missing non leaves. The absolute missing leaves parameter refers to the difference between the number of leaves in the expected reference document and the number of leaves in the result document. The other missing node counts (termed computed in figures 12 and 13) are determined by the test engine comparator (refer to chapter 6). The computed measures are needed to determine whether the nodes are in their expected locations, which cannot be determined from the absolute measure. Missing nodes by test cases in LES Complexity order-20-1001020304050607080Test casesNode Count Source Text node count Average of absolute missing nodes Average of computed missing nodes Figure 12. Missing Nodes by test cases in LES complexity order 44

PAGE 56

45 The error rate seems to depend more on the LES complexity of the document than the substring complexity (refer to the sections on the run time behavior of the phases of the algorithm in the next section on the performance of the algorithm). The error in the last document in figure 13 may be explained by the fact that its subtree sizes exceed the bound C on subtree re-ordering (refer to chapter 4). Missing nodes by average sub-tree size-20-1001020304050607080Test casesNode Count Source Text node count Average of absolute missing nodes Average of computed missing nodes Figure 13. Missing Nodes by test cases in substring complexity order Performance of the Difference Detection Algorithm To determine the performance of the difference algorithm and the primary phases on which it depends, we plot the time taken by the various phases of the algorithm for all the test cases in figure 14. As can be seen from figure 14, the overall time taken (Vdiff2 Compute Delta time in the legend) is dependent almost entirely on only two phases, viz. least cost edit script matching of nodes (Vdiff2 LES Match Nodes) and matching of substrings of the source tree to the modified tree (Vdiff2 Subsequence Match). The substring matching process is clearly the dominant factor in the overall time taken.

PAGE 57

46 Figure 14. Time taken for the various phases by test case

PAGE 58

The test cases have been ordered according to the average number of TEXT nodes per subtree since this is the primary factor determining the substring matching time. The drop in execution time seen for the cases with large subtrees can be attributed to their exceeding the bound C, which determines the limit beyond which no subtree reordering is done (refer to the subtree matching algorithm in chapter 4). We now analyze the run time for each operation class, viz insert, delete, update and move. We shall consider the node matching and substring matching phases from now on since they are the primary factors determining run time. Insert Operations From figures 15 and 16, it is clear that the operations done at file level affect the run time more than the operations related to individual nodes. Time taken (ms) by test case for Insert with no file level changes020000400006000080000100000120000140000160000180000200000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 15. Time taken by test cases for Inserts with no file level changes This can be attributed to the fact that for each inserted node, it is compared with every node in the source subtree whereas in the case of paragraph level inserts, the number of comparisons remains roughly the same (The only difference being that some comparisons that may have been terminated early due to finding a perfect match now 47

PAGE 59

48 need to be worked through. However, on average the match would be found halfway through a set of nodes so insertion of a new node would affect run time more). Time taken (ms) for Insert with 30 percent file level operations0100000200000300000400000500000600000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 16. Time taken by test cases for Inserts with 30 percent file level changes Time taken (ms) for Insert with 50 percent file level operations0100000200000300000400000500000600000700000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 17. Time taken by test cases for Inserts with 50 percent file level changes

PAGE 60

49 Time taken (ms) by test case for Insert with 10 percent paragraph level modifications050000100000150000200000250000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 18. Time taken by test cases for Insert with 10 percent paragraph level changes The same trend holds for 50 percent inserts at file level as seen in figure 17. Since we have determined that the operations that affect run time are at file level, the graphs for the remaining operations will be shown in the format of figure 18. Delete Operations For deletes, we see that for the highest percentage of file level changes, the delta computation time goes down as expected. Time taken (ms) for Delete with 10 percent paragraph level operations050000100000150000200000250000300000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 19. Time taken by test cases for Delete with 10 percent paragraph level changes

PAGE 61

50 However, for the initial deletes, the delta computation time actually rises. This is reflected in both the Node matching phases as well as the substring matching phases. Update Operations In the case of updates, the percentage of file operations does not affect the run time since the number of nodes in the files does not change (see figure 20). Time taken (ms) for Update with 10 percent paragraph level operations020000400006000080000100000120000140000160000180000200000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 20. Time taken by test cases for Updates with 10 percent paragraph level changes However, the same may not hold for paragraph level operations and hence we need to plot paragraph level changes for a fixed number of file level changes as shown in figure 21. As can be seen from figure 21, with paragraph level operations up to 30 percent, there is no significant change in run time.

PAGE 62

51 Time taken (ms) for Updates with no file level changes020000400006000080000100000120000140000160000180000200000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 21. Time taken by test cases for Updates with no file level changes Move Operations As can be seen from figure 22, the time taken for the node matching phase remains relatively constant or increases slightly with increasing percentages of moves. Time taken (ms) for Move with 10 percent paragraph level operations050000100000150000200000250000300000350000400000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 22. Time taken by test cases for Moves with 10 percent paragraph level changes

PAGE 63

52 This would be expected since the only effect that file level moves have is to affect when a perfect match is found. With increasing move percentages, the nodes are displaced further relative to their previous positions and hence a larger number of nodes need to be compared to find matches. The behavior of substring matching should be independent of the percentage of file level moves and hence no correlation should exist which is borne out by the figure. Hence, we need to investigate the effect of paragraph level changes on run time. Time taken (ms) for Moves with no file level changes020000400006000080000100000120000140000160000180000200000Test casesTime (ms) Vdiff2 Compute Delta time Vdiff2 LES Match Nodes Vdiff2 sub tree match Figure 23. Time taken by test cases for Moves with no file level changes For increasing paragraph level moves, the run time is not affected. We analyze the run time behavior of the algorithms phases as a whole in subsequent sections.

PAGE 64

53 Conclusions on Per-Operation Run Time From the analysis above, we can conclude that the primary factor determining the run time performance of the system is the number of operations at file level. Also, as shown in figure 24, the time complexity is in the order delete, move, update and insert. Average time taken (ms) for each phase by operation0100002000030000400005000060000700008000090000AvgOfVdiff2 Compute DeltatimeAvgOfVdiff2 LES MatchNodesAvgOfVdiff2 SubsequenceMatchAlgorithm phasesTime (ms) Delete Insert Move Update Figure 24. Per-operation relative time complexity Run Time Behavior of Node Matches To determine whether the run time behavior of the node matching phase is according to expectations, we need to define a complexity measure to reorder the source documents by and check if the average values for the time taken are monotonically increasing/decreasing. Since we know that the Node matching phase depends on the number of non-subtree TEXT nodes in the source document (since subtree TEXT nodes

PAGE 65

54 are processed seperately in the subtree match phase), and the number of nodes in the result document, we can define our initial complexity measure as follows: LES Complexity = (non-subtree TEXT nodes)*(non-subtree TEXT nodes + number of subtrees) since the number of nodes in the result document is equal to the number of non-subtree TEXT nodes plus the number of subtrees each of which contribute one node to the result (since the result is a text document, all siblings are clobbered together for lack of a formatting structure). However, we have not considered the time taken by the least cost edit script matching process itself. Since it is O(ND) where N is the total size of the nodes being compared and D is the average edit distance for each pair, we can approximate it to 2*Average size of a source text node*D. Since we apply the same percentages of edits to all source document sets, a rough approximation would be to represent D as Average size of a source text node* (Some constant). Hence, an approximate complexity measure works out to the following: LES Complexity=((Non subtree nodes)2 + (Non subtree nodes)*(Number of subtrees))* (Average size of a source text node)2 Ordering the Source file according to this complexity measure as in figure 25, we note that we obtain an increasing trend in terms of average time taken albeit with fluctuations. However, this should be considered in the light of there being relatively few data points, the documents not being large enough to render constant factors small enough to be negligible, and the fact that the measure is itself an approximation.

PAGE 66

55 Average time taken (ms) by the node match phase vs. source documents in LES complexity order 0500010000150002000025000Sources in LES Complexity orderTime (ms) AvgOfVdiff2 LES Match Nodes Figure 25. Behavior of the node match phase Run Time Behavior of Substring Matching To analyze substring matching, graphs were plotted versus source document parameters and expected complexity measures. However, strong correlation was only shown for the ordering of the sources by the average number of children per TEXT subtree. This is due to the fact that the number of children per subtree determines the number of possible orderings for the nodes in that subtree. The fall in run time for the last source document set in figure 26, is due to the fact that most subtrees in the document have a greater number of children than the re-ordering threshold C. This causes a great drop in the time taken since re-ordering is completely bypassed.

PAGE 67

56 Average time (ms) taken by the sub string match phase for test documents ordered by the average number of Text nodes per sub tree050000100000150000200000250000300000350000Test casesTime (ms) AvgOfVdiff2 Subsequence Match Figure 26. Behavior of substring match phase Relative Sizes of Files produced From figure 27, we can see that delta sizes roughly increase with increasing sizes of the source documents. Document Sizes (bytes) in order of increasing source document size020004000600080001000012000140001600018000200003082470174939625139761421714617164131783819391Source Document sizeSize AvgOfVdiff2 Expected Mod Document Size AvgOfModified text document Size AvgOfVdiff2 Delta Size Figure 27. Sizes of the files produced by the algorithm The deltas are relatively large (still smaller than the sources however) due to their completeness requirement. If version control is not required, these sizes can be reduced further, yielding significant bandwidth savings.

PAGE 68

CHAPTER 8 CONCLUSIONS AND POTENTIAL FOR FUTURE WORK Conclusions The results obtained in this exploration of a generic algorithm for synchronizing reduced content documents shows that this work can improve upon the state of the art for data access on mobile devices. However, to achieve the goal of ubiquitous data access, the algorithm needs to be tested with other document editing software and extended to handle other classes of applications. This would have to be done in conjunction with the integration of the system into a framework for disconnected operation support e.g. the Coda File System or UbiData. Potential for Future Work Determination of Best Performing Values for Algorithm Parameters The change detection algorithm explored in this thesis depends on the two primary parameters: the threshold value for the least cost edit script matching and the threshold for substring re-ordering before substring matching. Extensive testing is needed to determine the best performing values for the algorithm since a fine balance needs to be struck. For the former, if the match threshold is set too low, heavier modifications will be missed. However if set too high, the possibility of false matches will increase. In the case of the re-ordering threshold, a higher value will result in a significant performance impact but a value that is too low may be completely ineffective. 57

PAGE 69

58 Exploration of Techniques to Prune the Search for the Best Subtree Permutation As can be seen from the test results, searching all permutations of subtrees is computationally expensive. Currently a nave heuristic has been used to prune the search. A better technique may deliver the same level of correctness with lower performance overhead. Post Processing to Ensure Validity of Reintegrated Document As stated in chapter 3, post processing is necessary to ensure application consistency requirements are met. This is primarily an issue only for very extensive deletes and rarely occurs in practice. However, it should be handled for a better user experience. Further Testing and Generalization of the System The algorithm needs to be tested with a wider range of editors. So far, all tests have been done with Abiword. To test with other editors would require the appropriate converters to be written, appropriate configuration files generated and integration of Xmerge with the test setup (to allow converters to be written as loadable plugins). As a result of such testing, it may be determined that the algorithm does not handle certain conditions arising due to differing document structure. In this case, the code would need to be extended to handle all such conditions. Implementation of the UbiData Integration Architecture To validate the usefulness of this work, it is necessary to integrate it with the UbiData system. An overview of this has been described in chapter 2. Extension to Other Application Domains The current design and implementation has focused on the document-editing scenario. This caters to the primary usage of most mobile devices (editing documents and

PAGE 70

59 e-mail). However, validating the ideas on a generalized content reduction and re-integration system requires extension of this approach to handle other application classes such as spreadsheets, calendars etc.

PAGE 71

LIST OF REFERENCES [1] Michael J. Franklin, Challenges in Ubiquitous Data Management. Informatics 10 Years Back. 10 Years Ahead. Lecture Notes in Computer Science, Vol. 2000, Springer-Verlag, Berlin 2001: 24-33. [2] Richard Han, Pravin Bhagwat, Richard LaMaire, Todd Mummert, Veronique Perret, and Jim Rubas, Dynamic Adaptation in an Image Transcoding Proxy for Mobile Web Browsing. IEEE Personal Communications, Vol. 5, Number 6, December 1998: 8-17. [3] Harini Bharadvaj, Anupam Joshi and Sansanee Auephanwiriyakul, An Active Transcoding Proxy to Support Mobile Web Access. Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems, West Lafayette, IN, IEEE Computer Society Press, October 1998: 118-126. [4] Mika Liljeberg, Heikki Helin, Markku Kojo, and Kimmo Raatikainen, Mowgli WWW Software: Improved Usability of WWW in Mobile WAN Environments. Proceedings of the IEEE Global Internet 1996 Conference, London, England, IEEE Communications Society Press, November 1996: 33-37. [5] Armando Fox and Eric A. Brewer, Reducing WWW Latency and Bandwidth Requirements by Real-Time Distillation. Computer Networks and ISDN Systems, Vol. 28, Issues 7-11, Elsevier Science, May 1996: 1445-1456. [6] M. Satyanarayanan, Coda: A Highly Available File System for a Distributed Workstation Environment. IEEE Transactions on Computers, Vol. 39, Number 4, 1990: 447-459. [7] M. Satyanarayanan, Mobile Information Access. IEEE Personal Communications, Vol. 3, Number 1, February 1996: 26-33. [8] World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation, 6 October 2000, available at http://www.w3.org/TR/2000/REC-xml-20001006 August 2002. [9] J. Zhang, Mobile Data Service: Architecture, Design, and Implementation. Doctoral dissertation, University of Florida, 2002. 60

PAGE 72

61 [10] Eyal de Lara, Dan S. Wallach and Willy Zwaenepoel, Position Summary: Architectures for Adaptations Systems. Eighth IEEE Workshop on Hot Topics in Operating Systems (HotOS-VIII). Schloss Elmau, Germany, May 2001, available at http://www.cs.rice.edu/~delara/papers/hotos2001/hotos2001.pdf August 2002. [11] Eyal de Lara, Rajnish Kumar, Dan S. Wallach and Willy Zwaenepoel, Collaboration and Document Editing on Bandwidth-Limited Devices. Presented at the Workshop on Application Models and Programming Tools for Ubiquitous Computing (UbiTools'01). Atlanta, Georgia. September, 2001, available at http://www.cs.rice.edu/~delara/papers/ubitools/ubitools.pdf August 2002. [12] Eyal de Lara, Dan S. Wallach and Willy Zwaenepoel, Puppeteer: Component-based Adaptation for Mobile Computing. Presented at the Third USENIX Symposium on Internet Technologies and Systems, San Francisco, California, March 2001, available at http://www.cs.rice.edu/~delara/papers/usits2001/usits2001.pdf August 2002. [13] G. Cobna, S. Abitebould, A Marian, Detecting Changes in XML Documents. Presented at the International Conference on Data Engineering, San Jose, California. 26 February-1 March 2002. May 2002, available at http://www-rocq.inria.fr/~cobena/cdrom/www/xydiff/eng.htm August 2002. [14] Amlie Marian, Serge Abiteboul, Lurent Mignet, Change-centric Management of Versions in an XML Warehouse. The {VLDB} Journal, 2001: 581-590. [15] A. Helal, J. Hammer, A. Khushraj, and J. Zhang, A Three-tier Architecture for Ubiquitous Data Access. Presented at the First ACS/IEEE International Conference on Computer Systems and Applications. Beirut, Lebanon. 2001, available at http://www.harris.cise.ufl.edu/publications/3tier.pdf August 2002. [16] Michael J. Lanham, Change Detection in XML Documents of Differing Levels of Structural Verbosity in Support of Ubiquitous Data Access. Masters Thesis, University of Florida, 2002. [17] M. Satyanarayanan, Brian Noble, Puneet Kumar, Morgan Price, Application-Aware Adaptation for Mobile Computing. Proceedings of the 6th ACM SIGOPS European Workshop, ACM Press, New York, NY, September 1994: 1-4. [18] B. Noble, M Satyanarayanan, D Narayanan, J. Tilton, J. Flinn and K. Walker, Agile Application-Aware Adaptation for Mobility. Operating Systems Review (ACM) Vol. 51, Number 5, December 1997: 276-287. [19] Brian Noble, System Support for Mobile, Adaptive Applications. IEEE Personal Communications Magazine, Vol. 7, Number 1, February 2000: 44-49.

PAGE 73

62 [20] David Andersen, Deepak Bansal, Dorothy Curtis, Srinivasan Seshan, Hari Balakrishnan, System Support for Bandwidth Management and Content Adaptation in Internet Applications. Proceedings of the Fourth USENIX Symposium on Operating Systems Design and Implementation, USENIX Assoc., Berkeley, CA, October 2000: 213-226. [21] D. Li, Sharing Single User Editors by Intelligent Collaboration Transparency. Proceedings of the Third Annual Collaborative Editing Workshop, ACM Group. Boulder, Colorado. 2002, available at http://www.research.umbc.edu/~jcampbel/Group01/Li_iwces3.pdf August 2002. [22] S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom, Change Detection in Hierarchically Structured Information. Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM Press, New York, NY, 1996: 493-504. [23] OpenOffice.org. Xmerge Project. http://xml.openoffice.org/xmerge/ August 2002. [24] E. Myers, An O(ND) Difference Algorithm and its Variations. Algorithmica, Vol. 1, 1986: 251-266. [25] SourceGear Corporation. Abiword: Word Processing for Everyone. Software available at http://www.abisource.com/ August 2002. [26] Microsoft Corporation. Microsoft Windows CE .NET Storage Manager Architecture. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wcemain4/htm/cmconstoragemanagerarchitecture.asp August 2002. [27] OpenOffice.org. OpenOffice.org Source Project. http://www.openoffice.org/ August 2002.

PAGE 74

BIOGRAPHICAL SKETCH Ajay Kang was born in New Delhi, India. He received his bachelors degree in computer engineering from Mumbai University, India. He subsequently worked for a year at Tata Infotech a software and systems integration company based in India, as an associate systems engineer. During this period he worked on a product for financial institutions and programmed in C++ using the Microsoft Foundation Class (MFC) libraries and the Win32 SDK. During his masters study he was a TA for CIS3020: Introduction to Computer Science and had his first teaching experience in the process. His interests include mobile computing and algorithms as a result of two excellent classes he took on these subjects at the University of Florida On graduation he will join Microsoft Corp as a Software Design Engineer in Test (SDET). 63