UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
MODELS AND TECHNIQUES FOR THE VISUALIZATION OF LABELED DISCRETE OBJECTS By DINESH P. MEHTA A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1992 ,UNIVERSITY OF FLORIDA LIBRARIES ACKNOWLEDGEMENTS I wish to express my gratitude to my advisor, Professor Sartaj Sahni, for his guid ance, encouragement, and help over the last five years. This dissertation would not have been possible without his supervision. I am also grateful to the members of my dissertation committee at Florida, Professors Gerhard Ritter, John Staudhammer, Ted Johnson, and Haniph Latchman for patiently reading this manuscript and pro viding insightful feedback. I thank Phil Barry of the Computer Science Department and Bill Fox of the Department of Psychology at the University of Minnesota for their support. I also thank the Computer Science Departments at the University of Florida and the University of Minnesota, and the National Science Foundation for financial support. I would also like to thank my friends, Mario Lopez, Vijay Rajan, Rose Tsang, and Andrew Lim for many fruitful discussions and also for other notsofruitful, but entertaining, times which made the process enjoyable. Finally, I wish to thank my family, and in particular, my parents, Prabha and Prakash Mehta, for their love and support. This dissertation is dedicated to them. TABLE OF CONTENTS ACKNOWLEDGEMENTS ............................ ii ABSTRACT .................................... v CHAPTERS 1 INTRODUCTION .................... .......... 1 2 STRING VISUALIZATION .......................... 3 2.1 Problem Specification ................... ........ 3 2.2 Visualization Conflicts ................... ....... 5 2.3 Refinements of Display Model .................. ........ 6 2.4 String Visualization Queries ....................... 9 2.5 Applications .................................. 10 2.5.1 General Methods ................. ....... 10 2.5.2 Numerical Data ................... ...... 12 2.5.3 Molecular Biology ................... ..... 13 2.5.4 Textual Data ................... ....... 14 3 STRING VISUALIZATION ALGORITHMS ................. 15 3.1 Definitions ................... ....... ..... 15 3.1.1 Compact Symmetric Directed Acyclic Word Graphs (csdawgs) 15 3.1.2 Computing Occurrences of Displayable Entities in a string 17 3.1.3 Prefix and Suffix Extension Trees ................ 17 3.2 Computing Conflicts ........................... 18 3.2.1 Algorithm to determine whether a string is conflictfree .... 18 3.2.2 Subword Conflicts .................... ..... 21 3.2.3 PrefixSuffix Conflicts ................. ..... 26 3.2.4 Alternative Algorithms ................. ..... 39 3.3 Size Restricted Queries ................... ...... 42 3.4 Pattern Oriented Queries ................... ..... 43 3.5 Statistical Queries. ............................ 45 3.6 Experimental Results ........................... 47 3.7 Display Algorithms ................... ......... 51 4 CIRCULAR STRING VISUALIZATION .................. 4.1 Introduction .................. ............. 4.2 Definitions ................................. 4.3 Constructing the Csdawg for a Circular String . . 4.4 Computing Conflicts Efficiently . . . 4.5 Applications ................................ 5 EXTENSION TO BINARY TREES AND SERIESPARALLEL GRAPHS 5.1 Tree Visualization ........ 5.1.1 Problem Definition . 5.1.2 Applications ....... 5.1.3 Algorithms . 5.2 Geometric SeriesParallel Graph 5.2.1 Problem Definition . 5.2.2 Applications ....... 5.2.3 Algorithms ....... Visualization 6 SYSTEM INTEGRATION .............. 6.1 Using the ObjectOriented Methodology . 6.2 Overview of a visualization system ...... 7 CONCLUSIONS ........... REFERENCES ............... BIOGRAPHICAL SKETCH ....... .... 100 . . . 103 . 105 ..... 94 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy MODELS AND TECHNIQUES FOR THE VISUALIZATION OF LABELED DISCRETE OBJECTS By Dinesh P. Mehta August 1992 Chairman: Dr. Sartaj Sahni Major Department: Computer and Information Sciences The objective of visualization is to extract useful and relevant information from raw data and present it so that it can be easily understood and assimilated by humans. We propose a general model for the visualization of labeled discrete objects. (Ex amples of labeled discrete objects are strings, circular strings, trees, graphs, etc.) Our model is based on identifying similar subobjects in an object and coding them visually. This is demonstrated by applying the model to linear and circular strings, binary trees, and seriesparallel graphs. We also describe and classify the problem of display conflicts that is associated with this visualization model and suggest methods to overcome it. We extend the csdawg data structure for linear strings to circular strings. Efficient, often optimal, algorithms to implement the visualization model for linear and circular strings, binary trees, and seriesparallel graphs are also developed. These algorithms are specifically designed to utilize the inheritance and data ab straction features of the objectoriented paradigm. Several of these algorithms were implemented in C++ to evaluate their performance. We also propose a blueprint for a visualization system based on our visualization model. Applications of this visualization technique arise in areas such as molecular biol ogy, computer vision, computer graphics, CAD VLSI, data compression, algorithm animation, and debuggers. CHAPTER 1 INTRODUCTION The objective of visualization is to extract useful and relevant information from data and represent it so that it can be easily understood and assimilated by humans. This enables specialists in application areas to observe trends and patterns in the data. This could lead to better understanding of various phenomena and provide insights resulting in theories or hypotheses, which can subsequently be proved (or disproved) by formal methods or by further experimentation. A notion that is useful in the understanding of objects is that of similarity. Multi ple occurrences of the same pattern in data which represents the outcome or result of some process indicates the presence of many instances of the same "effect." Analyz ing the set of circumstances associated with these occurrences could yield a plausible "cause." For example, multiple occurrences of the same flaw in a paper roll might reveal the faulty component in the paper production process. Similarly, multiple occurrences of the same patterns in an object whose struc ture is being studied indicates the presence of many instances of the same "cause." Observing the phenomena which occur in the presence of these patterns could shed some light on the "effect" of the patterns. For example, the presence of multiple occurrences of patterns in DNA strands in organisms could manifest themselves as common characteristics shared by the organisms. This linkage of cause and effect is a fundamental goal in many scientific disciplines. Most work in visualization so far, which attempts to facilitate the scientific goals outlined above, consists of choosing methods to display individual units of data, so that patterns and trends become visually obvious when the data is seen in its entirety [1, 2]. However, the onus of detecting patterns and trends still lies on the user or the specialist. This becomes more crucial when the amount of data is large and the user's perceptual faculties are overburdened. Consequently, errors of omission (not seeing patterns which are actually there) become more likely. Our work attempts to shift the responsibility of detecting patterns from the human to the computer. This is done by devising algorithms to detect patterns in the data and then making these patterns available to the user for further scrutiny. Multiple occurrences of the same pattern can be made visually explicit by color coding them with the same color. Other methods could also be used, such as "flashing" occurrences of the same pattern on the screen. If visual schemes are not appropriate, recurring patterns could simply be provided as a list of occurrences which the user goes through. In the context of a visualization environment, this technique could be used either in a standalone manner or as a supplement to other visualization methods. For the technique outlined above to be useful, we propose the following principles on displaying patterns, which we shall call the Similarity Paradigm. The similarity paradigm forms the basis for the visualization techniques used in this dissertation. Principle 1: Two patterns should be displayed to look similar iff they are similar. Principle 2: The degree of similarity between the displays of two patterns should be proportional to the actual similarity of the patterns. Chapter 2 discusses string visualization while Chapter 3 provides algorithms which implement string visualization queries. Chapter 4 discusses the extension of our visualization techniques for linear strings to circular strings and Chapter 5 deals with the visualization of binary trees and seriesparallel graphs. Finally, Chapter 6 explains how the work of the first five chapters may be integrated to form a visualization system based on the similarity paradigm. CHAPTER 2 STRING VISUALIZATION 2.1 Problem Specification A complete specification of a visualization problem based on the similarity paradigm requires one to provide the following five items. These are illustrated using linear string visualization as an example. Henceforth, the word "string" shall refer to linear strings. (1) Structure of the Data to be Visualized: Does the data represent a string, a seriesparallel graph, a binary tree, etc? In the string visualization example, the data is a string S of length n whose characters are chosen from a fixed alphabet E of constant size. (2) Structure of Patterns: This depends largely on the structure of the data. If the data represents a string, then the structure of the pattern could be a substring (contiguous sequence of string elements), a subsequence (a noncontiguous sequence of string elements), etc. Patterns can have other constraints imposed upon them. For example, a pattern may be required to be of a minimum size. In the string visualization example, the pattern is a substring of S defined uniquely by its start and end positions. (3) Maximality of Patterns: If a pattern is repeated in the data, then any subpattern of the pattern is also repeated. For example, if abc repeats in a string, so do ab, bc, a, b, and c In particular, if all occurrences of ab occur in the context of abc, then attempting to distinguish between ab and abc does not serve any useful purpose. Defining maximality and restricting the user's attention to maximal patterns helps to simplify the display. In the string visualization example, a pattern is said to be maximal iff its occur rences are not all preceded by the same letter, nor all followed by the same letter. Consider the string S = abczdefydefxabc. Here, the empty string A, abc, def, and S are the only maximal patterns. The occurrences of def are preceded by different let ters (z and y) and followed by different letters (y and z). The occurrences of abc are not preceded by the same letter (the first occurrence does not have a predecessor) nor followed by the same letter. However, de is not maximal because all its occurrences in S are followed by f. (4) Measure of Similarity (MS): A measure of similarity would consist of attaching numerical values to pairs of maximal patterns which indicate the degree of similarity between the two patterns. In the string visualization example, the measure of similarity could be defined as follows: If two patterns are identical, then MS = 1. Otherwise, MS = 0. In other words, two patterns are defined to be similar iff they are identical. There is no concept of "degree of similarity" in this definition. (5) Display Models: This addresses the issues of which patterns are displayed and how they are displayed. While choosing display models, Principles 1 and 2 of Chapter 1 should be kept in mind. In the string visualization example, a pattern is said to be a displayable entity (or displayable) iff it is maximal, nonnull, and occurs more than once in S (in this case, all maximal patterns are displayable entities with the exception of S, which occurs once in itself and A) All instances of the same displayable entity are highlighted in the same color. Instances of different displayable entities are highlighted in different Figure 2.1. Highlighting Displayable Entities colors (there is no relationship between colors representing different displayable en tities). In the example string S, abc and def are the only displayable entities. So, S would be displayed by highlighting abc in one color and def in another as shown in Figure 2.1. 2.2 Visualization Conflicts Consider the string S = abcicdefcdegabchabcde and its displayable entities, abc and cde (both are maximal and occur thrice). The displayable entities abc and cde must be highlighted in different colors. Notice, however, that abc and cde both occur in the string abcde, which is a suffix of S. Clearly, both displayable entities cannot be highlighted in different colors in abcde as required by the model. This is a consequence of the fact that the letter c occurs in both displayable entities. This situation is known as a prefixsuffix conflict because a prefix of one displayable entity (here, cde) is a suffix of the other (here, abc). Note, also, that c is a displayable entity in S. Consequently, all occurrences of c must be highlighted in a color different from those used for abc and cde. But this is impossible as c is a subword of both abc and cde. This situation is referred to as a subword conflict. Formally, (i) A subword conflict between two displayable entities D1 and D2 in S exists iff D1 is a substring of D2. (ii) A prefixsuffix conflict between two displayable entities D1 and D2 in S exists iff there exist nonnull substrings, S,, S,, S, in S such that SpSS, occurs in S, SpSm a b cli c d elf c d elg a b clh a b cd e la b cli Ic d elf Ic d elg9a b clh a bic d el a b cli d e f d e ga b ch ab de Figure 2.2. Possible Configurations = D1, and S,S, = D2. We say that a prefixsuffix conflict exists between D1 and D2 with respect to Sm. Sm is known as the intersection of the conflict. 2.3 Refinements of Display Model When subword and prefixsuffix conflicts occur, we need some criteria to determine which of the information previously required to be displayed actually gets displayed. For instance, in the example string S = abcicdefcdegabchabcde from the previous section, three possible nonconflicting, displayable subsets are shown in Figure 2.2. In this section we present three refinements to the display model from Section 2.1 which attempt to overcome the display difficulties created by conflicts. They are (1) OneCopy, MaximumContent, NoOverlap: In this model, exactly one copy of the string is displayed. Occurrences of displayable entities are selected so that there are no mutually conflicting occurrences. Given this restriction, the model re quires occurrences to be selected so that the amount of information conveyed by the display is maximized. This goal may be achieved in three ways. Interactive : The user selects occurrences interactively by using his/her judgement. Typically, this would be done by examining the occurrences which are involved in a conflict and choosing one that is the most meaningful. lab cli d e f d e gla b chia b cd e Figure 2.3. Optimal Configuration under Model 1 Automatic: A numeric weight is assigned to each occurrence. The higher the weight, the greater the desirability of displaying the corresponding occurrence. Criteria that could be used in assigning weights to occurrences include length, position, number of occurrences of the pattern, semantic value of the displayable entity, information on conflicts, etc. The information is then fed to a routine which selects a set of occur rences so that the sum of their weights is maximized. For example, consider string S = abcicdefcdegabchabcde of Section 2.2. If the weight assigned to each occurrence of abc is 4, cde is 2, c is 3, then Figure 2.3 shows the optimal display configuration. The total weight of the display is 18. SemiAutomatic: In a practical environment, the most appropriate method would be a hybrid of the Interactive and Automatic approaches described above. The user could select some occurrences that he/she wants included in the final display. The selection of the remaining occurrences can then be performed by a routine which maximizes the display information. (2) MultipleCopy, NoOverlap: Multiple copies of the string may be displayed. Mutually disjoint sets of occurrences are associated with the copies (one set per copy), so that the occurrences corresponding to each copy are mutually nonconflicting. There are two approaches: (a) A constant number (max) of copies of the string may be displayed. The total content of the display, summed over all max copies, is to be maximized. For example, a b cli i def id e gla b chla b cd e a b i cdef cdeg a b h a bid e Figure 2.4. Optimal Configuration under Model 2(a) [a b cli d e f d e ga b ch a b cd e a b i c d elf c d egab h a b d e abci cdefcdegabchablcdee Figure 2.5. Configuration under Model 2(b) consider string S = abcicdefcdegabchabcde of Section 2.2. If the weights corresponding to abc, cde, and c are 4, 2, and 3, respectively, and max = 2, then Figure 2.4 shows the optimal display configuration. The total weight of the display is 31. (b) No limit is imposed on the number of copies of the string that may be displayed. Each occurrence is highlighted in exactly one copy of the string. The number of copies of the string used should be minimized. In the worst case, O(n2) copies of the string may be required. The displayable entities in the string S = a" are a', for 1 < i < n. So, each proper substring of a" is an occurrence of a displayable entity of S. Consider the character Sn/2 It occurs in approximately n2/4 occurrences of displayable entities (since there are approximately n/2 positions to the left and to the right of S,,2 and all but one combination of these represents an occurrence of a a b Ii def d ega b ha b d e Figure 2.6. Optimal Configuration under Model 3 displayable entity). Since a different copy of the string is required to display these occurrences without overlap, the number of copies of the string is approximately n2/4 or O(n2). Figure 2.5 shows a configuration for string S = abcicdefcdegabchabcde of Section 2.2. (3) SingleCopy, MaximumContent, SubwordOverlap: Exactly one copy of the string may be displayed. Occurrences are selected so that no pair of occurrences has a prefixsuffix conflict. Subword conflicts are allowed. As in the SingleCopy, MaximumContent, NoOverlap model, the goal is to maximize the information con veyed. Again, there are three approaches for selecting occurrences: Automatic, Inter active and SemiAutomatic. For example, consider string S = abcicdefcdegabchabcde of Section 2.2. If the weights corresponding to each occurrence of abc, cde, and c are 4, 3, and 2, respectively, then Figure 2.6 shows the optimal display configuration. The total value of the display is 31. Note that it is crucial to all methods to get information on prefixsuffix and subword conflicts. 2.4 String Visualization Queries The following algorithmic problems arise as a result of the discussion in the pre vious section: Given a string S of length n whose elements are chosen from an alphabet E of fixed size, P1. Obtain a list of all displayable entities and their occurrences. P2. Obtain a list of all prefixsuffix conflicts. P3. Obtain a list of all subword conflicts. Given a list of occurrences of displayable entities and a weight associated with each occurrence, P4. Obtain a set of mutually nonconflicting occurrences so that the sum of the weights associated with them is maximum (Model 1). P5. Obtain max mutually disjoint sets of mutually nonconflicting occurrences so that the sum of the weights associated with them is maximum (Model 2(a)). P6. Obtain a minimum number of mutually disjoint sets of mutually nonconflicting occurrences required to partition the set of all occurrences (Model 2(b)). P7. Obtain a set of occurrences, such that no two occurrences in the set have a prefix suffix conflict, so that the sum of the weights associated with them is maximized (Model 3). Algorithms for P1P4, P6, and P7 are presented in Chapter 3. 2.5 Applications An abstract strategy has been outlined for the visualization of strings. This section discusses some general methods for applying this strategy to specific areas. Applications to molecular biology, text and numerical sequences are also outlined. 2.5.1 General Methods In order to apply this strategy successfully to actual data, it is important to first check that the data conforms to the definition of strings provided in Section 2.1. If this is not the case, then it may be possible to transform the data so that it satisfies the definition without losing vital information in the process. (1) If a string consists of characters which are chosen from a fixed alphabet of a small size ( < 50, say), then it is already in the correct format. For example, DNA sequences are made up from an alphabet of 4 elements. English text is made up of an alphabet of 26 characters and some special symbols. (2) Otherwise, if a function, f, can be defined for each element in the alphabet such that: a) The range of f can be determined and is of constant size and b) f(elementi) = f(element2) iff elements and element2 are similar, then the given string may be converted to another string which is obtained by apply ing f to each element of the original string. The resulting string can now be input to the visualization routines. For example, consider a sequence of objects which are chosen from a large (pos sibly infinite) alphabet. Assume that a set of properties, P = {P1,P2,...,Pm}, is associated with each object. Suppose that patterns of property Pi of the sequence of objects are interesting (where Pi may take on one of a constant, fixed number of values). Then, for the purposes of visualization, each object in the sequence is replaced by the corresponding Pi value. This approach can be used with the other properties as well. Some examples where this approach is useful are (1) Protein Sequences [2]: A protein sequence consists of a sequence of amino acids. While the number of amino acids that could form a sequence is large, it is possible to place amino acids in groups on the basis of physical properties such as hydropho bicity, acidity, polarity etc. Amino acids in the sequence may then be replaced by a symbol representing the groups to which they belong. (2) Chronological sequences of Multidimensional Data [1]: Here, a number of mea surements relating to a particular scientific phenomenon are taken at regular intervals of time. The measurements for each variable are classified as low, medium, or high. Consequently, the sequence of multidimensional data may be replaced by a sequence of symbols L, M, H representing low, medium, and high values, respectively, of the variable being studied. Finally, many applications would benefit by comparisons between two or more different strings (as opposed to comparisons within the same string). This can also be supported by a simple extension of our techniques. 2.5.2 Numerical Data An important category of data is numerical data. These arise whenever properties of objects are described by measurements. Numerical information, in general, is cho sen from large alphabet sizes which are determined by the accuracy of measurement required. Clearly, numerical sequences cannot be directly input to our visualization system. This is remedied by determining the range of values that a variable can take on. This range is then subdivided into a constant number of subranges (this is essentially the same strategy used by Beddow [1]). Each value in the sequence is then replaced by a symbol representing the subrange to which it belongs. The resulting sequence may then be input to our visualization system. Consider a sequence of num bers which lie in the range 1200. Assume that subranges have been defined as 120, 2140, ..., 181200 which are represented by the symbols a, b, ......, j, respectively. So, the sequence 7 142 63 94 6 148 69 becomes ahd e ahd. Sequences of numbers, such as financial data, are usually studied by using graphs. We expect that the techniques outlined here if used in conjunction with graphs could reduce the possibility of overlooking important patterns in the data. This can be done by either appropriately coloring pieces of the graph or by coloring a string which is aligned with the graph. Often, comparisons are made not between the values of numbers in a sequence, but between the increase/decrease in consecutive values. For example, in [5 20 15 75 90 85], 5 20 15 is not obviously related to 75 90 85. However, the increases/decreases in values are identical. An increase of 15 followed by a decrease of 5. Information of this type may be obtained by transforming the string by taking the difference between successive values before inputting it to the visualization system; i.e., 15 5 60 15 5. Similar transformations may be used for percentage increases/decreases, second order differences, etc. 2.5.3 Molecular Biology In molecular biology, RNA, DNA, and protein sequences are studied. Sequence comparison helps to answer questions about evolution, structure and function in organisms, and the structural configuration of individual RNA molecules [3]. Of particular importance are repeating patterns and their relative positions [4]. Here [4], the csdawg data structure, which is described in the next chapter, is used to analyze sequences. Our work improves upon this by suggesting more effective display methods and by introducing more sophisticated analysis techniques involving prefixsuffix and subword conflicts. For example, let D1 and D2 be displayable entities and Di be a subword of D2. If the fraction (number of occurrences of D1 contained in D2)/(number of occurrences of DI) 1, then we can infer that D1 usually occurs as a subword of D2 which could mean that D1 does not perform any significant functions except as a subword of D2. Suppose patterns Pi and P2 perform functions F, and F2 in an organism. Then, if (number of prefixsuffix conflicts between PI and P2)/ (min{number of occurrences of PI,number of occurrences of P2}) 1 then we can infer that Fi and F2 are generally performed by the same region and are therefore related in some way. 14 2.5.4 Textual Data Structural information about text may be obtained by studying prefixsuffix and subword conflicts. Information about the contexts in which certain phrases are used is provided by subword conflicts. Information on the combination of phrases is provided by prefixsuffix conflicts. This information can be used to identify anomalies in sentence structure and possibly identify the author of a text by the structure. It can also be used to decipher text coded using sophisticated substitution ciphers where patterns are substituted by other patterns. CHAPTER 3 STRING VISUALIZATION ALGORITHMS 3.1 Definitions 3.1.1 Compact Symmetric Directed Acyclic Word Graphs (csdawgs) The csdawg data structure data structure is used to represent a string or a set of strings. It evolved from other string data structures such as position trees [5], suffix trees [6], directed acyclic word graphs [7, 8], etc. A csdawg CSD(S) corresponding to a string S is a directed acyclic graph defined by a set of vertices V(S), a set R(S) of labeled directed edges called right extension edges (reedges), and a set L(S) of labeled directed edges called left extension edges (leedges). Each vertex of V(S) represents a maximal substring of S. Specifically, V(S) consists of a source (which represents the empty word, A), a sink (which represents S), and a vertex corresponding to each displayable entity of S. Let str(v) denote the string represented by vertex v, for v e V(S). Define the implication, imp(S, a), of a string a in S to be the smallest superword of a in {str(v): v e V(S)}, if such a superword exists. Otherwise, imp(S, a) does not exist. Reedges from a vertex vl are obtained as follows: for each letter x in E, if imp(S, str(vi)x) exists and is equal to str(v2) = 6str(v)zxy, then there exists an reedge from vl to v2 with label xy. If 3 is the empty string, then the edge is known as a prefiz extension edge. Leedges from a vertex vl are obtained as follows: for each letter z in E, if imp(S, xstr(vi)) exists and is equal to str(v2) = 7xstr(vi)#, then Figure 3.1. Csdawg for S = cdefabcgabcde (L(S) not shown) there exists an leedge from vl to v2 with label yx. If P is the empty string, then the edge is known as a suffix extension edge. Figure 3.1 shows V(S) and R(S) corresponding to S = cdefabcgabcde. Here abc, cde, and c are the displayable entities of S. There are two outgoing reedges from the vertex representing abc. These edges correspond to z = d and x = g. imp(S, abed) = imp(S, abcg) = S. Consequently, both edges are incident on the sink. There are no edges corresponding to the other letters of the alphabet as imp(S, abcz) does not exist for z e {a, b, c, e, f}. The space required for CSD(S) is O(n) and the time needed to construct it is O(n) [7, 9]. While we have defined the csdawg data structure for a single string S, it can be extended to represent a set of strings. Algorithm LinearOccurrences(S, v) { Find all occurrences of str(v) in S} Occurrences(S, v, 0) Procedure Occurrences(S:string,u:vertex,i:integer) 1 begin 2 if str(u) is a suffix of S 3 then output(ISI i); 4 for each reedge e from u do 5 begin 6 Let w be the vertex on which e is incident; 7 Occurrences(S, w, Ilabel(e)l + i); 8 end; 9 end; Figure 3.2. Algorithm for obtaining all occurrences of displayable entity 3.1.2 Computing Occurrences of Displayable Entities in a string Figure 3.2 presents an algorithm for computing the end positions of all the oc currences of str(v) in S. This is based on the outline provided by Blumer et al [9]. First, the algorithm determines whether str(v) is a suffix of S. If so, the end posi tion of S is reported. The remaining occurrences of str(v) are computed recursively by examining each node reached by an outgoing reedge from v. The complexity of LinearOccurrences(S, v) is proportional to the number of occurrences of str(v) in S. 3.1.3 Prefix and Suffix Extension Trees The prefix extension tree PET(S, v) at vertex v in V(S) is a subgraph of CSD(S) consisting of (i) the root v (ii) PET(S, w) defined recursively for each vertex w in V(S) such that there exists a prefix extension edge from v to w, and (iii) the prefix extension edges leaving v. The suffix extension tree SET(S, v) at v is defined analogously. In Figure 3.1, PET(S,v), where v is the vertex representing the substring c, consists of the vertices representing c and cde, and the sink. It also includes the prefix extension edges from c to cde and from cde to the sink. Similarly, SET(S, v), where v is the vertex representing c, consists of the vertices representing c and abc and the suffix extension edge from c to abc (not shown in the figure). Lemma 3.1.1 PET(S,v) (SET(S,v)) contains a directed path from v to a vertex w in V(S) iffstr(v) is a prefix (suffix) of str(w). Proof: If there is a directed path in PET(S, v) from v to some vertex w, then from the definition of a prefix extension edge and the transitivity of the "prefixof" relation, str(v) must be a prefix of str(w). If str(v) is a prefix of str(w), then there exists a series of reedges from v to w, such that str(v), when concatenated with the labels on these edges yields str(w). But, each of these reedges must be a prefix extension edge. So a directed path from v to w exists in the PET(S, v). The proof for SET(S, v) is analogous. * 3.2 Computing Conflicts 3.2.1 Algorithm to determine whether a string is conflictfree Before describing our algorithm to determine if a string is free of conflicts, we establish some properties of conflictfree strings that will be used in this algorithm. Lemma 3.2.1 If a prefixsuffix conflict occurs in a string S, then a subword conflict must occur in S. Proof: If a prefixsuffix conflict occurs between two displayable entities W1 and W2 in S, then there exists WpW,W, in S such that WpWm = Wi and WmW, = W2. Since W1 and W2 are maximal, W1 isn't always followed by the same letter and W2 isn't always preceded by the same letter; i.e., Wm isn't always followed by the same letter and Wm isn't always preceded by the same letter. So, Wm is maximal. But, W1 occurs at least twice in S (since W1 is a displayable entity). So Wm occurs at least twice (since Wm is a subword of W1) and is a displayable entity of S. But, Wm is a subword of W1. So a subword conflict occurs between Wm and W1 in S. * Corollary 3.2.1 The intersection of a prefizsuffix conflict between two displayable entiites is itself a displayable entity. Corollary 3.2.2 If string S is free of subword conflicts, then it is free of conflicts. Lemma 3.2.2 str(w) is a subword of str(v) in S iff there is a path comprising right extension and suffix extension edges from w to v. Proof: From the definition of CSD(S), if there exists an reedge from u to v, then str(u) is a subword of str(v). If there exists a suffix extension edge from u to v, then str(u) is a suffix (and therefore a subword) of str(v). If there exists a path comprising right and suffix extension edges from w to v, then by transitivity, str(w) is a subword of str(v). If str(w) is a suffix of str(v), then there is a path (Lemma 3.1.1) of suffix exten sion edges from w to v. If str(w) is a subword, but not a suffix of str(v), then from the definition of a csdawg, there is a path of reedges from w to a vertex representing a suffix of str(v). * Let Vo,,,r be the set of all vertices in V(S) such that an reedge or suffix extension edge exists between the source vertex of CSD(S) and each element of V,o,,e. Lemma 3.2.3 String S is conflictfree if all right extension or suffix extension edges leaving vertices in Vour end at the sink vertex of CSD(S). Proof: A string S is conflictfree iff there does not exist a right or suffix extension edge between two vertices, neither of which is the source or sink of CSD(S) (Corollary 3.2.2 and Lemma 3.2.2). Assume that S is conflictfree. Consider any vertex v in Vo,,.u If v has right or suffix extension out edge < v, w >, then v 5 sink. If w # sink, then str(v) is a subword of str(w) and the string is not conflictfree. This contradicts the assumption on S. Next, assume that all right and suffix extension edges leaving vertices in Vo,,.r end at the sink vertex. Clearly, there cannot exist right or suffix extension edges between any two vertices, v and w (v / sink, w o sink) in V,oure. Further, there cannot exist a vertex x in V(S) (z / source, x # sink) such that x V' V,o,,. For such a vertex to exist, there must exist a path consisting of right and suffix extension edges from a vertex in V,,ourc to x. Clearly, this is not true. So, S is conflictfree. * The preceding development leads to algorithm NoConflicts (Figure 3.3). Theorem 3.2.1 Algorithm NoConflicts is both correct and optimal. Algorithm NoConflicts(S) 1. Construct CSD(S). 2. Compute Vource. 3. Scan all right and suffix extension out edges from each element of V.,urc. If any edge points to a vertex other than the sink, then a conflict exists. Otherwise, S is conflictfree. Figure 3.3. Algorithm to determine whether a string is conflictfree Proof: Correctness is an immediate consequence of Lemma 3.2.3. Step 1 takes O(n) time [9]. Step 2 takes 0(1) time since IV,o.,ce < 2 I. Step 3 takes 0(1) time since the number of out edges leaving vertices in V,u,, is less than 4V2I. So, NoConflicts takes O(n) time, which is optimal. Actually, Steps 2 and 3 can be merged into Step 1 and the construction of CSD(S) aborted as soon as an edge that violates Lemma 3.2.3 is created. * 3.2.2 Subword Conflicts Consider the problem of finding all subword conflicts in string S. Let k, be the number of subword conflicts in S. Any algorithm to solve this problem requires (i) O(n) time to read in the input string and (ii) O(k,) time to output all subword conflicts. So, O(n + k,) is a lower bound on the time complexity for this problem. For the string S = a", k, = n4/24 + n3/4 13n2/24 3n/4 + 1 = O(n4). This is an upper bound on the number of conflicts as the maximum number of substring occurrences is O(n2) and in the worst case, all occurrences conflict with each other. In this section, a compact method for representing conflicts is presented. Let kSc be the size of this representation. k,c is n3/6+n2/25n/3 or O(n3), for a". Compaction never increases the size of the output and may yield up to a factor of n reduction, as in the example. The compaction method is described below. Consider S= abcdbcgabcdbchbc. The displayable entities are D1 = abcdbc and D2 = bc. The end positions of D1 are 6 and 13 while those of D2 are 3, 6, 10, 13, and 16. A list of the subword conflicts between D1 and D2 can be written as {(6,3), (6,6), (13,10), (13,13)}. The first element of each ordered pair is the last position of the instance of the superstring (here D1) involved in the conflict; the second element of each ordered pair is the last position of the instance of the substring (here D2) involved in the conflict. The cardinality of the set is the number of subword conflicts between D1 and D2. This is given by frequency(D1)*number of occurrences of D2 in D1. Since each conflict is represented by an ordered pair, the size of the output is 2(frequency(D1)*number of occurrences of D2 in D1). Observe that the occurrences of D2 in D1 are in the same relative positions in all instances of D1. It is therefore possible to write the list of subword conflicts between D1 and D2 as (6,13):(0,3). The first list gives all the occurrences in S of the superstring (DI), and the second gives the relative positions of all the occurrences of the substring (D2) in the superstring (D1) from the right end of D1. The size of the output is now frequency(D1)+ number of occurrences of D2 in D1. This is more economical than our earlier representation. In general, a substring Di of S will have subword conflicts with many instances of a number of displayable entities (say, Dj, Dk,..., D,) of which it (Di) is the superword. We would then write the conflicts of Di as (11 J? j )mi' j, ..., '..., 1 12,...,lz ). ) / / 1 ), (l k ,,, Here, the li's represent all the occurrences of Di in S; the li's, lk's, ..., lI's represent the relative positions of all the occurrences of Dj, Dk,..., D,, respectively, Algorithm SubwordConflicts(S:string) { Identify displayable entities that contain subword displayable entities} 1 begin 2 for each vertex v in CSD(S) do 3 begin 4 v.subword = false; 5 for all vertices u such that a right or suffix extension edge < u, v > is incident on v do 6 if u 5 source then v.subword = true; 7 end 8 for each vertex v in CSD(S) such that v 5 sink, v.subword is true do 9 GetSubwords(S, v); 10 end Procedure GetSubwords(S, v) { Compute subword displayable entities of str(v)} 1 begin 2 LinearOccurrences(S,v); 3 v.sublist = {0}; 4 SetUp(v); 5 SetSuffizes(v); 6 for each vertex, x (# source), in reverse topological order of SG(S, v) do 7 begin 8 if str(z) is a suffix of str(v) then x.sublist = {0} else x.sublist = {}; 9 for each w in SG(S, v) on which an reedge e from x is incident do 10 begin 11 for each element 1 in w.sublist do 12 x.sublist = z.sublist U {l Ilabel(e)l}; 13 end; 14 output(z.sublist); 15 end; 16 end Figure 3.4. Optimal algorithm to compute all subword conflicts in Di. One such list will be required for each displayable entity that contains other displayable entities as subwords. The following qualities are easily obtained: Size of Compact Representation = ED,,D (fi+ EDJeD (rii)). Size of Original Representation = 2 EDiD, (fi* ED,CD: (rij)) Here, fi is the frequency of Di (only Di's that have conflicts are considered), and rij is the frequency of Dj in one instance of Di. The letter D represents the set of all displayable entities of S while Df represents the set of all displayable entities that are subwords of Di. SG(S, v), for v c V(S), is defined as the subgraph of CSD(S) which consists of the set of vertices SV(S, v) C V(S) which represents displayable entities that are subwords of str(v) and the set SE(S, v) of all reedges and suffix extension edges that connect any pair of vertices in SV(S, v). Define SGR(S, v) as SG(S, v) with the directions of all the edges in SE(S, v) reversed. Lemma 3.2.4 SG(S, v) consists of all vertices w such that a path comprising right or suffiz extension edges joins w to v in CSD(S). Proof: Follows from Lemma 3.2.2 and the definition of SG(S, v). * Algorithm SubwordConflicts(S) of Figure 3.4 computes all subword conflicts of S. The subword conflicts are computed for precisely those displayable entities which have subword displayable entities. Lines 4 to 6 of SubwordConflicts deter mine whether str(v) has subword displayable entities. Each incoming right or suffix extension edge to v is checked to see whether it originates at the source. If any in coming edge originates at a vertex other than source, then v.subword is set to true (Lemma 3.2.2). If all incoming edges originate from source, then v.subword is set to false. Procedure GetSubwords(S, v), which computes the subword conflicts of str(v) is invoked iff v.subword is true. Procedure LinearOccurrences(S, v) (line 2 of GetSubwords) reports the occur rences of str(v) in S. Procedure SetUp(v) in line 4 traverses SGR(S,v) and ini tializes fields in each vertex of SGR(S, v) so that a reverse topological traversal of SG(S, v) may subsequently be performed. Procedure SetSuffixes(v) in line 5 marks vertices whose displayable entities are suffixes of str(v). This is accomplished by following the chain of reverse suffix extension pointers starting at v and marking the vertices encountered as suffixes of v. A list of relative occurrences sublist is associated with each vertex x in SG(S, v). x.sublist represents the relative positions of str(x) in an occurrence of str(v). Each relative occurrence is denoted by its position relative to the last position of str(v) which is represented by 0. If str(x) is a suffix of str(v) then x.sublist is initialized with the element 0. The remaining elements of x.sublist are computed from the sublist fields of vertices w in SG(S, v) such that a right extension edge goes from x to w. Consequently, w.sublist must be computed before x.sublist. This is achieved by traversing SG(S, v) in reverse topological order [10]. Lemma 3.2.5 x.sublist for vertex x in SG(S, v) contains all relative occurrences of str(x) in str(v) on completion of GetSubwords(S,v). Proof: The correctness of this lemma follows from the correctness of pro cedure LinearOccurrences(S, v) of Section 3.1.2 and the observation that lines 6 to 14 of procedure GetSubwords achieve the same effect as LinearOccurrences(S, v) in SG(S, v). * Theorem 3.2.2 Algorithm SubwordConflicts takes O(n + k.c) time and space and is therefore optimal. Proof: Computing v.subword for each vertex v in V(S) takes O(n) time as constant time is spent at each vertex and edge in CSD(S). Consider the complexity of GetSubwords(S, v). Line 2 takes O(lv.occurrencesl) time. Let the number of vertices in SG(S, v) be m. Then the number of edges in SG(S, v) is O(m) since each vertex has at most 21E = 0(1) edges leaving it. Line 4 traverses SG(S,v) and therefore consumes O(m) time. Line 5, in the worst case, could involve traversing SG(S,v) which takes O(m) time. Computing the relative occurrences of str(x) in str(v) (lines 814) takes O(jz.sublistl) time for each vertex x in SG(S, v). So, the total complexity of GetSubwords(S,v) is O(Iv.occurrencesl + m + Exsv(sv)v .sublist ). However, m is O(Ex,sv(s,v){f} Ix.sublistl), since Ix.sublistl > 1 for each x e SG(S,v). But Iv.occurrencesl + E,,sv(s,v)(,} Ix.sublistl is the size of the output of GetSubwords(S, v). So, the complexity of Algorithm SubwordConflicts(S) is O(n+Evv(S){sink},v.ubwrord=true output of GetSubwords(S, v)I) = O(n + kc). * 3.2.3 PrefixSuffix Conflicts As with subword conflicts, the lower bound for the problem of computing prefix suffix conflicts is O(n + kp), where kp is the number of prefixsuffix conflicts in S. For S = a", kp is n4/24 n3/12 25n2/24 21n/12 + 1 = O(n4), which is also the upper bound on kp. Unlike subword conflicts, it is not possible to compact the output representation. Let w and x, respectively, be vertices in SET(S, v) and PET(S, v). Let str(v) = W,, str(w) = W,W,, and str(z) = WW,. Define Pshadow(w, v,x) to be the vertex SPD(wv) '5a I, c~ X SI  SET(S, v)L PET(S,v) Figure 3.5. Illustration of prefix and suffix trees and a shadow prefix dag representing imp(S, W ,WWx,), if such a vertex exists. Otherwise, Pshadow(w, v, x) = nil. We define Pimage(w,v,z) = Pshadow(w,v,z) iff Pshadow(w,v,x) = imp(S,W,,WW,) = WW W,,W, for some (possibly empty) string W.. Otherwise, Pimage(w, v,z)= nil. For each vertex w in SET(S, v), a shadow prefix dag SPD(w, v) rooted at vertex w comprises the set of vertices {Pshadow(w, v, x) x on PET(S, v), Pshadow(w, v, x) Snil}. Figure 3.5 illustrates these concepts. Broken lines represent suffix extension edges, dotted lines represent right extension edges, and solid lines represent prefix exten sion edges. SET(S,v), PET(S,v), and SPD(w,v) have been enclosed by dashed, solid, and dotted lines, respectively. We have Pshadow(w,v, v) = Pimage(w, v, v) = w. Pshadow(w,v,z) = Pshadow(w,v,r) = c. However, Pimage(w,v,z) = Pimage(w, v, r) = nil. Pshadow(w, v, x) = Pimage(w, v, x) = a. Pshadow(w, v,p) = b, but Pimage(w, v, p) = nil. Pshadow(w, v, q) = Pshadow(w, v, s) = Pimage(w, v, q) = Pimage(w,v,s) = nil. Lemma 3.2.6 A prefixsuffix conflict occurs between two displayable entities, W1 = str(w) and W2 = str(x), with respect to a third displayable entity Wm = str(v) iff (i) w occurs in SET(S, v) and x occurs in PET(S, v), and (ii) Pshadow(w, v, z) # nil. The number of conflicts between str(w) and str(x) with respect to str(v) is equal to the number of occurrences of str(Pshadow(w,v,x)) in S. Proof: By definition, a prefixsuffix conflict occurs between displayable en tities W1 and W2 with respect to Wm iff there exists WW,,W. in S, where W1 = W ,,p and W2 = WmW,. Clearly, W, is a suffix of W1 and Wm a prefix of W2 iff w occurs in SET(S, v) and x occurs in PET(S,v). WpWW, occurs in S iff imp(S, WpWW,) exists or Pshadow(w, v, x) # nil. The number of conflicts between str(w) and str(x) is equal to the number of occurrences of imp(S, WW,,W,) = str(Pshadow(w, v, x)) in S. * Lemma 3.2.7 If a prefixsuffix conflict does not occur between str(w) and str(x) with respect to str(v), where w occurs in SET(S,v) and z occurs in PET(S, v), then there are no prefixsuffix conflicts between any displayable entity which represents a descen dant of w in SET(S,v) and any displayable entity which represents a descendant of x in PET(S,v) with respect to str(v). Proof: Since w is in SET(S, v) and x is in PET(S, v), we can represent str(w) by Wstr(v) and str(z) by str(v)W,. If no conflicts occur, then Wpstr(v)W, does not occur in S. The descendants of w in SET(S, v) will represent displayable entities of the form Wstr(w) = WWstr(v), while the descendants of x in PET(S, v) will represent displayable entities of the form str(x)Wb = str(v)W.Wb, where WI and Wb are substrings of S. For a prefixsuffix conflict to occur between Wstr(w) and str(x)Wb with respect to str(v), WWpstr(v)W,Wb must exist in S. However, this is not possible as Wstr(v)W, does not occur in S and the result follows. a Lemma 3.2.8 In CSD(S), if (i) y = Pimage(w,v,x), (ii) there is a prefix extension edge e from x to z with label aa, and (iii) there is a right extension edge f from y to u with label a/3, then Pshadow(w,v,z) = u. Proof: Let str(w) = W,str(v) and str(x) = str(v)W,. By definition, str(y) = WaW,str(v)W, for some, possibly empty, string W,. We have str(z) = str(x)aa = str(v)W,aa and str(u) = Wbstr(y)ap = WbWWstr(v)W,ap for some string Wb. Pshadow(w, v, z) = imp(Wstr(v)Waaa). To prove the lemma, we must show that Pshadow(w,v,z) = u. Or, that (i) Wstr(v)Wraa is a subword of str(u) and (ii) str(u) is the smallest superword of W,str(v)W.aa represented by a vertex in CSD(S). (i) Assume that Wstr(v)Waa is not a subword of str(u) = WbWW,Wstr(v)Wap. So, a is not a prefix of 6. Case 1: P is a proper prefix of a. Since WbWW,str(v)Wxafi is maximal, its occurrences are not all followed by the same letter. This statement is also true for any of its suffixes. In particular, all I S. .. .. S68 e = Prefix Extension Edge = Suffix Extension Edge = Right Extension Edge Figure 3.6. Illustration of conditions for Lemma 3.2.8  occurrences of str(v)Wap/ cannot be followed by the same letter. Similarly, all oc currences of str(v)Wa,/3 cannot be preceded by the same letter as it is a prefix of str(v)Waa = str(z). So, str(v)Wap/ is a displayable entity of S. Consequently, the prefix extension edge from x corresponding to the letter a must be directed to the vertex representing str(v)W.,ap. This is a contradiction. Case 2. a/ matches aa in the first k characters, but not in the (k + 1)'th character (1 < k < l+min(lal,ll)). We have a/p = ay71, aa = aya, where 17 = k 1. Clearly, the strings str(v)Wa7yai and WbW, Wstr(v)Waf/l occur in S. So, all occurrences of str(v)Way cannot be followed by the same letter. Further, all occurrences of str(v)W, a cannot be preceded by the same letter as it is a prefix of str(v)W, aa = str(z). So, it is a dis playable entity of S. Consequently, the prefix extension edge from x corresponding to the letter a must be directed to the vertex representing str(v)W.,ay. This results in a contradiction. Thus, a is a prefix of /. (ii) From (i), a is a prefix of /. Assume that WbW.W,str(v)W,a/ is not the smallest superword of Wstr(v)Waa. Since str(y) = Pimage(w,v,x) = W.W,str(v)W, is the smallest superword of Wstr(v)W,, the smallest superword of W,str(v)Waaa must be of the form Wb, WaW,str(v)W,ay where a is a prefix of 7 which is a proper prefix of / and/or Wb, is a proper suffix of Wb. But, the right out edge f from z points to the smallest superword of Wa Wstr(v)Wa (from the definition of CSD(S)) which is WbW W,,str(v)Wt/ap. So, Wb1 = Wb and 7 = /, which is a contradiction. * Lemma 3.2.9 In CSD(S), if (i) y = Pimage(w, v, ), P6 f .. .. aab T aab, Path P = Prefix Extension Edge = Suffix Extension Edge = Right Extension Edge Figure 3.7. Illustration of conditions for Lemmas 3.2.9 and 3.2.10 r ~o o v t z  *~~ ~ ~~ > ** ** ** ** > (ii) there is a path of prefix extension edges from x to xz (let the concatenation of their labels be aa), (iii) there is a prefix extension edge from x1 to z with label by, and (iv) there is a right extension edge f from y to u with label aabp, then u = Pshadow(w, v, z) $ nil. Proof: Similar to proof of Lemma 3.2.8. m Lemma 3.2.10 In Lemma 3.2.8 or Lemma 3.2.9, if Ilabel(f)( < sum of the lengths of the labels of of the edges on the prefix extension edge path P from x to z, then label(f) = concatenation of the labels on P and u = Pimage(w,v,z). Proof: From Lemma 3.2.9, the concatenation of the labels of the edges of P is a prefix of label(f). But, (label(f)( < sum of the lengths of the labels of the edges on P. Thus, label(f) = concatenation of the labels of the series edges on P. str(u) = Pimage(w, v, z) follows. * Lemma 3.2.11 If Pshadow(w,v,x) = nil then Pshadow(w,v,y) = nil for all descen dants y of x in PET(S,v). Proof: Follows from Lemmas 3.2.6 and 3.2.7. m Algorithm PrefixSuffixConflicts(S) in Figure 3.8 computes all prefixsuffix con flicts in S. Line 1 constructs CSD(S). Lines 2 and 3 compute all prefixsuffix conflicts in S by separately computing for each displayable entity str(v), all the prefixsuffix conflicts of which it is the intersection (Corollary 3.2.1). Algorithm PrefixSuffixConflicts(S:string) {Compute all prefixsuffix conflicts in S} 1 Construct CSD(S). 2 for each vertex v in CSD(S) do 3 NextSuffix(v,v); {compute all conflicts wrt str(v)} Procedure NextSuffix(current,v: vertex); 1 for each suffix extension edge < current, w > do 2 {there can only be one suffix extension edge from current to w} 3 begin 4 exist = false 5 ShadowSearch(v, w, v, w);{compute SPD(w, v)} 6 if exist then NextSuffix(w,v); 7 end; Figure 3.8. Optimal algorithm to compute all prefixsuffix conflicts Procedure NextSuffix(current, v) computes all prefixsuffix conflicts between dis playable entities represented by descendants of current in SET(S, v) and displayable entities represented by descendants of v in PET(S, v) with respect to str(v) (so the call to NextSuffiz(v,v) in line 3 of Algorithm PrefizSuffixConflicts(S) computes all prefixsuffix conflicts with respect to str(v)). It does so by identifying SPD(w, v) for each child w of current in SET(S, v). The call to ShadowSearch(v,w,v,w) in line 5 identifies SPD(w, v) and computes all prefixsuffix conflicts between str(w) and dis playable entities represented by descendants of v in PET(S, v) with respect to str(v). If ShadowSearch(v,w,v,w) does not report any prefixsuffix conflicts then the global variable exist is unchanged by ShadowSearch(v,w,v,w) (so, exist = false, from line 4). Otherwise, it is set to true by ShadowSearch. Line 6 ensures that NextSuffix(w,v) is called only if ShadowSearch(v,w,v,w) detected prefixsuffix conflicts between str(w) and displayable entities represented by descendants of v in PET(S, v) with respect to str(v) (Lemma 3.2.7). For each descendant q of vertex x in PET(S, v), procedure ShadowSearch(v,w,x,y) computes all prefixsuffix conflicts between str(w) and str(q) with respect to str(v). y represents Pshadow(w, v, x). We will show that all calls to ShadowSearch main tain the invariant (which is referred to as the image invariant hereafter) that y = Pimage(w, v, x) 5 nil. Notice that the invariant holds when ShadowSearch is called from NextSuffix as w = Pimage(w, v, v). The for statement in line 1 examines each prefix extension edge leaving x. Lines 3 to 28 compute all prefixsuffix conflicts be tween str(w) and displayable entities represented by vertices in PET(S,z), where z is the vertex on which the prefix extension edge from x is incident. The truth of the condition in the for statement of line 1, line 4, and the truth of the condition inside the if statement of line 5 establish that the conditions of Lemma 3.2.8 are satisfied prior to the execution of lines 8 and 9. The truth of the comment in line 8 Procedure ShadowSearch(v, w, z, y); 1 for each prefix extension edge e = < x, z > do 2 {There can only be one prefix extension edge from z to z} 3 begin 4 fc:= first character in label(e); 5 if there is a right extension edge, f = < y, u >, whose label starts with fc 6 then 7 begin 8 {u = Pshadow(w, v, z)} 9 ListConflicts(u,z,w); 10 distance:= 0; done = false 11 while (not done) and (Ilabel(f)l > Ilabel(e)l + distance)) do 12 begin 13 distance:= distance + Ilabel(e)l; 14 nc:= (distance + 1)th character in label(f). 15 if there is a prefix extension edge < z, r > starting with nc 16 then 17 begin 18 z := r; 19 {u = Pshadow(w,v,z)}; 20 ListConflicts(u, z,w); 21 end 22 else 23 done:= true; 24 end 25 if (not done) then 26 ShadowSearch(v,w,z,u); 27 exist:= true; 28 end 29 end Figure 3.9. Algorithm for shadow search and the correctness of line 9 are established by Lemma 3.2.8. Procedure ListCon flicts of line 9 lists all prefixsuffix conflicts between str(w) and str(z) with respect to str(v). Similarly, the truth of the condition inside the while statement of line 11, lines 13 and 14, and the truth of the condition inside the if statement of line 15 establish that the conditions of Lemma 3.2.9 are satisfied prior to the execution of lines 1820. Again, the correctness of lines 1820 are established by Lemma 3.2.9. If done remains false on exiting the while loop, the condition of the if statement of line 15 must have evaluated to true. Consequently, the conditions of Lemma 3.2.9 apply. Further, since the while loop of line 11 terminated, the additional condition of Lemma 3.2.10 is also satisfied. Hence, from Lemma 3.2.10, u = Pimage(w, v, z) and the image invariant for the recursive call to ShadowSearch(v, w, z, u)is maintained. Line 27 sets the global variable exist to true since the execution of the then clause of the if statement of line 5 ensures that at least one prefixsuffix conflict is reported by ShadowSearch(v, w, v, w) (Lemmas 3.2.6 and 3.2.8). exist remains false only if the then clause of the if statement (line 5) is never executed. Theorem 3.2.3 Algorithm PrefixSuffixConflicts(S) computes all prefixsuffix con flicts of S in O(n + kp) space and time, which is optimal. Proof: Line 1 of Algorithm PrefixSuffixConflicts takes O(n) time [9]. The cost of lines 2 and 3 without including the execution time of NextSuffix(v, v) is O(n). Next, we show that NextSuffiz(v, v) takes O(k,) time, where k, is the number of prefixsuffix conflicts with respect to v (so, k, represents the size of the output of NextSuffix(v, v)). Assume that NextSuffix is invoked p times in the computation. Let ST be the set of invocations of NextSuffix which do not call NextSuffix recursively. Let pr = ISTI. Let SF be the set of invocations of NextSuffix which do call NextSuffix recursively. Let PF = ISFI. Each element of SF can directly call at most EI elements of ST. So, prTPF < jI. From lines 46 in NextSuffiz(current,v), each element of SF yields at least one distinct conflict from its call to ShadowSearch. Thus, pF < kv. So, p = pr + PF < (I~E + 1)k, = O(k,). The cost of execution of NextSuffix without including the costs of recursive calls to NextSuffix and ShadowSearch is O(~E) (= 0(1)) as there are at most I~ suffix edges leaving a vertex. So, the total cost of execution of all invocations of NextSuffix spawned by NextSuffix(v, v) without including the cost of recursive calls to ShadowSearch is O(pIEl) = O(k"). Next, we consider the calls to ShadowSearch that were spawned by NextSuffi(v, v). Let TA be the set of invocations of ShadowSearch which do not call ShadowSearch recursively. Let qA = ITAI. Let TB be the set of invocations of ShadowSearch which do call ShadowSearch recursively. Let qB = ITBI. Let q = qA + qB. We have qA < IElqB + IElp. So, q = qA + qB < (EI + 1)9B + IEp. From the algorithm, each element of TB yields a distinct conflict. So, qB < kI. So, q < (IE + l)qa + IElp = O(k,). The cost of execution of a single call to ShadowSearch without including the cost of executing recursive calls to ShadowSearch is 0(1) + O(w) + O(complexity of ListConflicts of line 9) + O(IFC (complexity of ListConflicts of line 20 in the i'titeration of the while loop)), where w denotes the number of iterations of the while loop. The complexity of ListConflicts is proportional to the number of conflicts it reports. Since ListCon flicts always yields at least one distinct conflict, the complexity of ShadowSearch is 0(1+ Ioutputl). Summing over all calls to ShadowSearch spawned by NextSuffix(v, v), we obtain O(q + k,) = O(k,). Thus, the total complexity of Algorithm PrefizSuf fizConflicts(S) is O(n + kp) . 3.2.4 Alternative Algorithms In this section, an algorithm for computing all conflicts (i.e., both subword and prefixsuffix conflicts) is presented. This solution is relatively simple and has com petitive run times. However, it lacks the flexibility required to efficiently solve many of the problems listed in Sections 3.3, 3.4, and 3.5 The algorithm (Algorithm AllConflicts(S)) is presented in Figure 3.10. Step 1 computes a list of all occurrences of all displayable entities in S. This list is obtained by first computing the lists of occurrences corresponding to each vertex of V(S) (except the source and the sink) and then concatenating these lists. Each occurrence is represented by its start and end positions. Step 2 sorts the list of occurrences obtained in Step 1 in increasing order of their start positions. Occurrences with the same start positions are sorted in decreasing order of their end positions. This is done using radix sort. Step 3 computes for the i'th occurrence occj all its prefixsuffix conflicts with occurrences whose start positions are greater than its own, and all its subword conflicts with its subwords. occi is checked against occi+l, occi+2,..., occi+c for a conflict. Here, c is the smallest integer for which there is no conflict between occi and occi+c. The start position of occi+c is greater than the end position of occj. The start position of occj (j > i + c) will also be greater than the end position of occi, since the list of occurrences was sorted on increasing order of start positions. The start positions of occi+i,.., occj+ci are greater than or equal to the start positions of occj but are less than or equal to its end position. Those occurrences among {occ4+, ..., occ+c_1 whose start positions are equal to that of occi have end positions that are smaller (since occurrences with the same start position are sorted in decreasing order of their end positions). The remaining conflicts of occi (i.e., subword conflicts with its su perwords, prefixsuffix conflicts with occurrences whose start positions are less than that of occi) have already been computed in earlier iterations of the for statement in Algorithm AllConflicts(S). For example, let the input to Step 3 be the following list of ordered pairs:((1,6), (1,3), (1,1), (2,2), (3,8), (3,5), (4,6), (5,8), (6,10)),where the first element of the or dered pair denotes the start position and the second element denotes the end position of the occurrence. Consider the occurrence (3,5). Its conflicts with (1,6), (1,3), and (3,8) are computed in iterations 1, 2, and 5 of the for loop. Its conflicts with (4,6) and (5,8) are computed in iteration 6 of the for loop. Theorem 3.2.4 Algorithm AllConflicts(S) takes O(n + k) time, where k = k, + k,. Proof: Step 1 takes O(n + o) time, where o is the number of occurrences of displayable entities of S. Step 2 also takes O(n + o) time, since o elements are to be sorted using radix sort with n buckets. Step 3 takes O(o + k) time (the for loop executes O(o) times; each iteration of the while loop yields a distinct conflict). So, the total complexity is O(n + o + k). We now show that o = O(n + k). Let ol be the number of occurrences not involved in a conflict. Then ol < n. Let 02 be the number of occurrences involved in at least one conflict. A single conflict occurs between two occurrences. So 2k > 02. So, o = o + o2 < n + 2k = O(n + k). * Algorithm AllConflicts(S) can be modified so that the size of the output is k, + k,.. This may be achieved by checking whether an occurrence is the first representative of its pattern in the for loop of Step 3. The subword conflicts are only reported for the first occurrence of the pattern. However, the time complexity of Algorithm AllConflicts(S) remains O(n + k). In this sense, it is suboptimal. Algorithm AllConflicts(S) Step 1: Obtain a list of all occurrences of all displayable entities in S. This list is obtained by first computing the lists of occurrences corresponding to each vertex of the csdawg (except the source and the sink) and then concatenating these lists. Step 2: Sort the list of occurrences using the start positions of the occurrences as the primary key (increasing order) and the end position as the secondary key (decreasing order). This is done using radix sort. Step3: for i:= 1 to (number of occurrences) do begin j:= i + 1; while(lastpos(occ,) > firstpos(occj) do begin if (lastpos(occi) > lastpos(occj)) then occi is a superword of occj else (occi, occi) have a prefixsuffix conflict; j:= j + 1; end; end; Figure 3.10. A simple algorithm for computing conflicts 3.3 Size Restricted Queries Experimental data show that random strings contain a large number of displayable entities of small length. In most applications, small displayable entities are less interesting than large ones. Hence, it is useful to list only those displayable entities whose lengths are greater than some integer k. Similarly, it is useful to report exactly those conflicts in which the conflicting displayable entities have length greater than k. This gives rise to the following problems: P8: List all occurrences of displayable entities whose lengths are greater than k. P9: Compute all prefixsuffix conflicts involving displayable entities of length greater than k. P10: Compute all subword conflicts involving displayable entities of length greater than k. The overlap of a conflict is defined as the string common to the conflicting dis playable entities. The overlap of a subword conflict is the subword displayable entity. The overlap of a prefixsuffix conflict is its intersection. The size of a conflict is the length of the overlap. An alternative formulation of the size restricted problem which also seeks to achieve the goal outlined above is based on reporting only those conflicts whose size is greater than k. This formulation of the problem is particularly relevant when the conflicts are of more interest than the displayable entities. It also establishes that all conflicting displayable entities reported have size greater than k. We have the following problems: P11: Obtain all prefixsuffix conflicts of size greater than some integer k. P12: Obtain all subword conflicts of size greater than some integer k. P8 is solved optimally by invoking LinearOccurrences(S, v) for each vertex v in V(S), where Istr(v)j > k. A combined solution to P9 and P10 uses the approach of Section 3.2.4. The only modification to the algorithm of Figure 3.10 is in Step 1 which now becomes: Obtain all occurrences of displayable entities whose lengths are greater than k. The resulting algorithm is optimal with respect to the expanded representation of subword conflicts. However, as with the general problem, it is not possible to obtain separate optimal solutions to P9 and P10 by using the techniques of Section 3.2.4. An optimal solution to P11 is obtained by executing line 3 of Algorithm PrefixSuf fixConflicts(S) of Figure 3.8 for only those vertices v in V(S) which have Istr(v)l > k. An optimal solution to P12 is obtained by the following modifications to Algorithm SubwordConflicts of Figure 3.4: (i) Right extension or suffix extension edges < u, v >, where Istr(u)l < k and Istr(v)l > k are marked "disabled." (ii) The definition of SG(S, v) is modified so that SG(S, v), for v e V(S), is defined as the subgraph of CSD(S) which consists of the set of vertices SV(S, v) C V(S) which represent displayable entities of length greater than k that are subwords of str(v) and the set of all re and suffix extension edges that connect any pair of vertices in SV(S, v). (iii) Algorithm SubwordConflicts(S) is modified. The modified algorithm is shown in Figure 3.11. We note that P9 and P12 are identical, since the overlap of a subword conflict is the same as the subword displayable entity. 3.4 Pattern Oriented Queries These queries are useful in applications where the fact that two patterns have a conflict is more important than the number and location of conflicts. The following problems arise as a result: Algorithm SubwordConflicts(S,k) 1 begin 2 for each vertex v in CSD(S) do 3 v.subword = false; 4 for each vertex v in CSD(S) such that Istr(v)I > k do 5 for all vertices u such that a nondisabled right or suffix extension edge < u, v > exists do 6 vt.subword = true; 7 for each vertex v in CSD(S) such that v / sink and v.subword is true do 8 GetSubwords(S, v); 9 end Figure 3.11. Modified version of algorithm SubwordConflicts P13: List all pairs of displayable entities which have subword conflicts. P14: List all triplets of displayable entities (Di,D2,Dm,) such that there is a prefix suffix conflict between D1 and D2 with respect to D. P15: Same as P13, but size restricted as in P12. P16: Same as P14, but size restricted as in P11. P13 may be solved optimally by reporting for each vertex v in V(S), where v does not represent the sink of CSD(S), the subword displayable entities of str(v), if any. This is accomplished by reporting str(w), for each vertex w, w 0 source, in SG(S, v). P14 may also be solved optimally by modifying procedure ListConflicts of Figure 3.9 so that it reports the conflicting displayable entities and their intersection. P15 and P16 may also be solved by making similar modifications to the algorithms of the previous section. 3.5 Statistical Queries These queries are useful when conclusions are to be drawn from the data based on statistical facts. Let f(D) denote the frequency (number of occurrences) of dis playable entity D in the string and rf(DI, D2) the number of occurrences of dis playable entity D1 in displayable entity D2. The following queries may then be defined. P17: For each pair of displayable entities, D1 and D2, involved in a subword conflict (DI is the subword of D2), obtain p(DI, D2) = (number of occurrences of D1 which occur as subwords of D2) / f(D1). P18: For each pair of displayable entities, D1 and D2, involved in a prefixsuffix conflict, obtain q(DI, D2) = (number of occurrences of D1 which have prefixsuffix conflicts with D2) /f(DI). If p(D1, D2) or q(DI, D2) is greater than a statistically determined threshold, then the following could be be said with some confidence: Presence of D1 implies Presence of D2. Let psf (D, D2, Dm) denote the number of prefixsuffix conflicts between D1 and D2 with respect to Dm and psf (D, D2), the number of prefixsuffix conflicts between D1 and D2. We can approximate p(DI, D2) by rf(DI, D2) f(D2)/f(Di). The two quantities are identical unless a single occurrence of D1 is a subword of two or more distinct occurrences of D2. Similarly, we can approximate q(Di, D2) by psf(Di, D2)/f(D1). The two quantities are identical unless a single occurrence of D1 has prefixsuffix conflicts with two or more distinct occurrences of D2. f(DI) can be computed for all displayable entities in CSD(S) in O(n) time by a single traversal of CSD(S) in Procedure GetSubwords(S, v) 1 begin 2 rf(str(v),str(v))= 1; 3 SetUp(v); 4 SetSuffizes(v); 5 for each vertex, z (4 source), in reverse topological order of SG(S, v) do 6 begin 7 if str(z) is a suffix of str(v) then rf(str(z),str(v)) = 1 8 else rf(str(x),str(v)) = 0; 9 for each w in SG(S, v) on which an reedge e from z is incident do 10 rf(str(x),str(v)):= rf(str(x),str(v)) + rf(str(w),str(v)); 11 output(rf(str(x), str(v))); 12 end; 13 end Figure 3.12. Modification to GetSubwords(S, v) for computing relative frequencies reverse topological order. rf(DI, D2) may be computed optimally for all D1, D2, by modifying procedure GetSubwords(S, v) as shown in Figure 3.12. psf(Di, D2, Dm) is computed optimally, for all D1, D2, and D,, where D1 has a prefixsuffix conflict with D2 with respect to D,, by modifying ListConflicts(u, z, w) of Figure 3.9 so that it returns f(str(u)), since this is the number of conflicts be tween str(w) and str(z) with respect to str(v). psf(DI, D2) is calculated by summing psf (Di, D2, Dm) over all intersections Dm of prefixsuffix conflicts between D1 and D2. p(DI, D2) and q(DI, D2) may be computed by simple modifications to the algo rithms used to compute rf(DI, D2) and psf(Di, D2). These problems may be solved under the size restrictions of P11 and P12 by modifications similar to those made in Section 3.3. 3.6 Experimental Results Algorithms SubwordConflicts(S) and PrefixSuffixConflicts(S) were implemented on a SUN SPARCstation 1 in GNU C++. 50 randomly generated strings of lengths ranging from 100 to 2000 from alphabets whose size ranged from 2 to 50 were input to our algorithms. Statistical information such as the number of vertices in the csdawg, the number of prefixsuffix and subword conflicts etc was obtained. The run times of the algorithms were also recorded. (i) Figure 3.13 shows Number of prefixsuffix conflicts vs String size. There is one curve for each alphabet size. The plot illustrates that the number of prefixsuffix conflicts increases with string size and decreases with alphabet size. The graph for subword conflicts is similar. (ii) Figure 3.14 shows Time per prefixsuffix conflict (p/s) vs String size. There is one curve for each alphabet size. It illustrates that the time per conflict generally decreases with increasing string size and increases with alphabet size. The graph for subword conflicts is similar. (iii) The factor by which the compact representation of subword conflicts is smaller than the fully expanded representation varies from 2 to 9. It increases with string size and decreases with alphabet size. (iv) Table 3.1 shows the size of the largest displayable entity for each combination of alphabet size and string size. It shows that only displayable entities of small lengths occur in random strings (in practice we would like to be able to distinguish between displayable entities that occur randomly and those that do not in a given string. This can be done by selecting those displayable entities whose frequency is large compared to other displayable entities of the same length in a random string). a = Size of Alphabet 100200 500 1000 2000 Size of String Figure 3.13. Graph of Number of PrefixSuffix Conflicts vs String Size Table 3.1. Lengths of Largest Displayable Entity for Random Strings Size of Size of String Alphabet 100 200 500 1000 2000 2 11 12 15 18 19 5 5 6 7 8 9 10 3 4 5 6 6 15 3 3 4 5 5 20 3 3 4 4 5 25 2 3 3 4 4 50 2 2 3 3 4 49 800 T 700 i m e 600 p eS r 500 a= Size of Alphabet P S C 400 n 0 ^^^^ =25 f 300 1 = 20 C 200 # 100  ._ 0_a=5 a=2 100200 500 1000 2000 Size of String Figure 3.14. Graph of Time per PrefixSuffix Conflict vs String Size In another experiment, Algorithms SubwordConflicts(S) (Section 3.2.2), Pre fixSuffixConflicts(S) (Section 3.2.3), and AllConflicts(S) (Section 3.2.4) were pro grammed in GNU C++ and run on a SUN SPARCstation 1. For test data we used 120 randomly generated strings. The alphabet size was chosen to be one of {5, 15, 25, 35} and the string length was 500, 1000, or 2000. The test set of strings consisted of 10 different strings for each of the 12 possible combinations of input size and alphabet size. For each of these combinations, the average run times for the 10 strings is given in Tables 3.23.5. Table 3.2 gives the average times for computing all conflicts by combining al gorithms PrefixSuffizConflicts(S) and SubwordConflicts(S). Table 3.3 gives the av erage times for computing all prefixsuffix conflicts using Algorithm PrefizSuffix Conflicts(S). Table 3.4 gives the average times for computing all patternrestricted prefixsuffix conflicts (problem P14 of Section 3.4) by modifying Algorithm Prefix SuffizConflicts(S) as described in Section 3.4. Table 3.5 represents the average times for Algorithm AllConflicts(S). Tables 3.2 to 3.4 represent the theoretically superior solutions to the corresponding problems, while Table 3.5 represents Algorithm AllConflicts(S) which provides a simpler, but suboptimal, solution to the three problems. In all cases the time for constructing csdawgs and writing the results to a file were not included as these steps are common to all the solutions. The results show that the suboptimal Algorithm AllConflicts(S) is superior to the optimal solution for computing all conflicts or all prefixsuffix conflicts for a ran domly generated string. This is due to the simplicity of Algorithm AllConflicts(S) and the fact that the number of conflicts in a randomly generated string is small. However, on a string such as a10 which represents the worst case scenario in terms of the number of conflicts reported, the following run times were obtained: Table 3.2. Time in ms for computing all conflicts using SubwordConflicts(S) and PrefizSuffiz Conflicts (S) All conflicts, optimal algorithm: 14,190 ms All prefixsuffix conflicts, optimal algorithm: 10,840 ms All pattern restricted prefixsuffix conflicts, optimal algorithm: 5,000 ms Algorithm AlIConflicts(S): 26,942 ms The experimental results using random strings also show that, as expected, the op timal algorithm fares better than Algorithm AllConflicts(S) for the more restricted problem of computing pattern oriented prefixsuffix conflicts. We conclude that Algorithm AllConflicts(S) should be used for the more gen eral problems of computing conflicts while the optimal solutions should be used for the restricted versions. Hence, Algorithm AllConflicts(S) should be used in an au tomatic environment, while the optimal solutions should be used in interactive or semiautomatic environments. 3.7 Display Algorithms A list of all occurrences of a displayable entity may be obtained from the csdawg data structure described earlier. A list of all conflicts between displayable entities may be obtained by operations on the csdawg. This information is then used to assign numeric weights to each occurrence of each displayable entity. Size of Size of String Alphabet 500 1000 2000 5 410 989 2722 15 292 603 1300 25 315 671 1485 35 234 791 1740 Table 3.3. Time in ms for computing all prefixsuffix conflicts using PrefixSuffixCon flicts(S) Table 3.4. Time in ms for computing all pattern restricted prefixsuffix conflicts using the optimal algorithm Table 3.5. Time in ms for Algorithm AllConflicts(S) Size of Size of String Alphabet 500 1000 2000 5 247 730 1873 15 219 454 989 25 255 522 1179 35 186 648 1370 Size of Size of String Alphabet 500 1000 2000 5 163 399 1058 15 103 231 550 25 91 267 628 35 61 226 735 Size of Size of String Alphabet 500 1000 2000 5 203 551 1367 15 217 409 897 25 227 400 887 35 145 478 994 In this section we shall discuss algorithms to implement some of the refined display models of Section 2.3. Specifically, we shall consider models 1, 2(b), and 3 under the Automatic mode (i.e., problems P4, P6, and P7 of Section 2.4). Our algorithms also apply to the Semi Automatic mode. Problem P4 can be reduced to the single pair, longest path problem in a directed acyclic graph as follows: let vertex Vi, 1 < i < n, correspond to the position between the i'th and i + 1'th characters in the string. Vo corresponds to the position preceding the first character in the string. Vn corresponds to the position following the last character in the string. For each occurrence Sij of each displayable entity, we create an edge from V11 to Vj. The weight associated with this edge is exactly the weight of the occurrence it represents. Finally, for each pair (Vi, V+1), 0 < i < n, of vertices such that an edge from Vi to V,+1 does not already exist, we create an edge from V, to V+j of weight 0. Figure 3.15 shows the directed acyclic graph corresponding to the string S = abcicdefcdegabchancde of Figure 2.2, assuming that the weights corresponding to each occurrence of abc, cde, and c are 4, 3, and 2, respectively. The longest path from Vo to V, in the dag is Vo V3 V4 V7 V Vs * VH  V12 V1s + V16  V19  V20  V21. All edges on this path with nonzero weight represent occurrences of displayable entities that are to be highlighted. Here Vo + V3, V4  V7, Vs  V1, V12 Vis, V16 V19 correspond to occurrences < 1,3 >, < 13,15 >, and < 17,19 > of abc, and < 5,7 > and < 9, 11 > of cde. The length of the longest path (here, 18) represents the total weight of the display. Algorithm A4 of Figure 3.16 solves problem P4. A[0..n] is an array of integers, where A[i] represents the longest path from Vo to Vi detected upto that point of time. All elements of A are initialized to 0. The auxiliary array T[1..n] stores the occurrences that have been chosen for display. The array delist contains all the occurrences of all the displayable entities in the string. Each element of delist 4 4 Figure 3.15. Dag corresponding to abcicdefcdegabchabcde contains three fields: start, end, and weight which represent the start position, the end position, and the numeric weight of the occurrence, respectively. It is assumed that delist is sorted in increasing order of end. The vertices are processed in topological order (here, Vo, V1, ..., V,,). When a vertex Vj is being processed, each vertex Vi (i < j) preceding it has associated with it the cost of the longest path from Vo to Vi. The cost of the longest path from Vo to Vj is then determined by examining each of the incoming edges to Vj. The complexity of Algorithm A4 is O(n + e), where e represents the size of delist. So, Algorithm A4 is optimal. Note that sorting delist on end position can also be accomplished in O(n + e) time using radix sort. Problem P6 of Section 2.4 is solved by Algorithm A6 of Figure 3.17 using the greedy method. The array endpoints[l..2e] is a list of, both, the start positions and end positions of all occurrences of all displayable entities in the string. Each element of endpoints contains three fields: position, type, and id. position contains the position of Algorithm A4 begin A[O]:= O; j:= 1; for i := 1 to n do begin A[i]:= 0; {Initialize A[il} while (delist[j].end = i) do {determine longest path ending at Vi} begin if A[i] < A[delist[j].start1] + delist[j].weight then begin A[i] := A[delist[j].start1] + delist[j].weight; T[i] := j; end; j:= i + 1; end; end; end. Figure 3.16. Algorithm for P4 the particular endpoint in the string; type indicates whether the endpoint is a "start" position or an "end" position. id is an integer which uniquely identifies the occurrence corresponding to the endpoint. It is assumed that endpoints is sorted in increasing order on primary key position and secondary key type ("start" < "end"). The variable current keeps track of the number of copies of the string that are currently "active." The variable max keeps track of the maximum number of copies of the string required so far. CurrentLine is the particular copy of the string to which the new occurrence is assigned. LineStack is a stack that contains line numbers (or partition numbers) which are currently available. On completion of the algorithm, line[i] contains the line or the particular copy of the string in which the occurrence with id i is to be highlighted for 1 < i < e. It can be seen from the algorithm that (1) At least one position in the string is covered by fmaz occurrences. Thus, fmaz is the smallest number of lines required to highlight all the displayable entities. (2) The final value of maz represents the number of partitions of S required to highlight all displayable entities. Since the the final value of max is fmaz, Algorithm A6 is correct. Algorithm A6 consumes O(n + e) time, which is optimal. Note that sorting endpoints can also be accomplished in O(n + e) time, if radix sort is used. We outline two solutions to P7. In the first, we assume that all occurrences of the same displayable entity are assigned the same numeric weight. In the second, we do not make this assumption. Algorithm A7(a) of Figure 3.18 solves the first version of P7. The second version of P7 may be solved by executing steps ac for each occurrence of each displayable entity as shown in Algorithm A7(b) of Figure 3.19. Both solutions involve a traversal of CSD(S) in topological order. For each occurrence, an optimal Algorithm A6 begin max := 0; current := 0; fori:= 1 to 2e do begin z := endpoints[i]; if (z.type = "start") then begin {assign line numbers to occurrences} current := current + 1; if (current < max) then {use available copy of string} begin CurrentLine := top(LineStack); pop(LineStack); end else begin {increase max } max := max + 1; CurrentLine:= maz; end; line[x.id] := CurrentLine; end else {x.type = "end"} begin current := current 1; push(LineStack, line[x.id]); end; end; fmaz := max; end. Figure 3.17. Algorithm for P6 Algorithm A 7(a) begin for each vertex v of CSD(S) in topological order do begin Step a. Compute the relative positions of all subword displayable entities in a single instance of str(v) using Procedure GetSubwords(S,v). Step b. Choose a mutually non overlapping subset from the set of subwords of str(v) (obtained in step (a) above) so that the sum of their weights is maximized. This is achieved by an algorithm similar to A4. Step c. Reset the numeric weight of str(v) by adding to it the total weight of the configuration obtained in step (b). end; end. Figure 3.18. Algorithm for P7, same weights selection of its subwords are chosen. The weight of the occurrence is then obtained by adding the sum of the weights of the chosen subwords to its weight. This is done because an occurrence is highlighted along with the chosen subwords. Algorithm A7(a) consumes O(n3) time, while Algorithm A 7(b) consumes O(n4) time. Algorithm A 7(b) begin for each vertex v of CSD(S) in topological order do for each occurrence < i,j > of str(v) do begin Step a. Compute the occurrences of all subword displayable entities in < i,j >. Step b. Choose a mutually non overlapping subset from the set of occurrences (obtained in Step (a) above) so that the sum of their weights is maximized. This is achieved by an algorithm similar to A4. Step c Reset the numeric weight of < i,j > by adding to it the total weight of the configuration obtained in Step (b). end; end. Figure 3.19. Algorithm for PT, different weights CHAPTER 4 CIRCULAR STRING VISUALIZATION 4.1 Introduction The circular string data type is used to represent a number of objects such as circular genomes, polygons, and closed curves. Research in molecular biology involves the identification of recurring patterns in data and hypothesizing about their causes and/or effects [4, 2]. Research in pattern recognition and computer vision involves detecting similarities within an object or between objects [11]. We have already listed in Chapter 2 a number of queries that our visualization model supports. In Chapter 3, we developed efficient (mostly optimal) algorithms for some of these queries for linear strings. These algorithms performed operations and traversals on csdawgs of the linear strings. One approach for extending these techniques to circular strings is to arbitrarily break the circular string at some point so that it becomes a linear string. Techniques for linear strings may then be applied to it. However, this has the disadvantage that some significant patterns in the circular string may be lost because the patterns were broken when linearizing the string. Indeed, this would defeat the purpose of representing objects by circular strings. This particular problem may be overcome by working with the csdawg corresponding to the concatenation of the linearized cir cular string with itself. However, the resulting data structure contains a number of extraneous nodes. The existence of these nodes increases the asymptotic complexity and actual running times of some of our algorithms, and increases the storage re quirement of the data structure. Moreover, algorithms for linear strings need to be substantially modified for use with this data structure. A circular string data structure, the polygon structure graph, which is an exten sion of suffix trees to circular strings already exists [11]. However, the suffix tree is not as powerful as the csdawg and cannot be used to solve some of the problems that the csdawg can solve. In particular, our queries required the csdawg for asymptotically and practically efficient algorithms. In this chapter, we propose a csdawg for circular strings which is obtained by mak ing simple modifications to the csdawg for linear strings. This new data structure does not contain extraneous vertices and consequently avoids the disadvantages men tioned above. Algorithms which make use of the csdawg for linear strings can then be extended to circular strings with trivial modifications. The extended algorithms continue to have the same time and space complexities. Moreover, the extensions take the form of postprocessing or preprocessing steps which are simple to add on to a system built for linear strings, particularly in an objectoriented language. In particular, algorithms NoConflicts(S), SubwordConflicts(S), PrefizSuffizConflicts(S), AllConflicts(S) and the solutions outlined for problems P8 to P18 in Chapter 3 for linear strings can be easily extended to circular strings. Section 4.2 contains definitions. Section 4.3 describes the construction of a circular csdawg and the computation of locations of occurrences of displayable entities and Section 4.4 introduces the notion of display conflicts. Finally, Section 4.5 mentions some applications for the visualization and analysis of circular strings. Figure 4.1. Circular string 4.2 Definitions Let s denote a circular string of size n consisting of characters from a fixed alpha bet E of constant size. Figure 4.1 shows an example circular string of size 8. We shall represent a circular string by a linear string enclosed in angle brackets "<>" (this distinguishes it from a linear string) The linear string is obtained by traversing the circular string in clockwise order and listing each element as it is traversed. The start point of the traversal is chosen arbitrarily. Consequently, there are up to n equiv alent representations of s. In the example, s could be represented as We characterize the relationship between circular strings and linear strings by defining the functions linearize and circularize, linearize maps circular strings to linear strings. It is a onemany mapping as a circular string can, in general, be mapped to more than one linear string. For example, linearize( bcda, cdab, dabc}. We will assume that linearize arbitrarily chooses one of the linear strings; for convenience we assume that it chooses the representation obtained by removing the angle brackets "<>." So, linearize( linear strings to circular strings. It is a manyone function and represents the inverse of linearize. We use lower case letters to represent circular strings and upper case letters to represent linear strings. Further, if a lower case letter (say s) is used to represent a particular circular string, then the corresponding upper case letter (S) is assumed to be linearize(s). The definitions of maximal strings and displayable entities for circular strings are identical to those for linear strings. I.e., a nonnull pattern occurring in s is said to be maximal iff its occurrences are not all preceded by the same character, nor all followed by the same character. The empty string is always maximal. s may be thought of as a periodic, infinite string which is neither followed nor preceded by a letter, and is therefore maximal. A pattern is said to be a displayable entity of s iff it is nonnull, maximal, and occurs at least twice in oa. A displayable entity of s always has length less than n. Thus, s and the empty string are maximal, but not displayable entities. The definition of a csdawg for circular strings is also similar to that for linear strings. I.e., a csdawg, CSD(s) = (V(s), R(s), L(s)) corresponding to s is a directed acyclic graph defined by a set of vertices V(s), a set R(s) of labeled directed edges called right extension edges (reedges), and a set of labeled directed edges L(s) called left extension edges (leedges). Each vertex of V(s) represents a substring of s. Specifically, V(s) consists of a vertex corresponding to each maximal pattern of s. This consists of a source which represents the empty string A; a sink which represents s; and a vertex for each displayable entity of s. Let str(v) denote the string represented by vertex v for v e V(s). Define the implication imp(s, a) of a string a in s to be the smallest superstring of a in {str(v)l v e V(s)}, if such a superstring exists. Otherwise, imp(s, a) does not exist. Note that imp(s, a) is always unique (if there are two or more superstrings of a of the same length k in {str(v)l v e V(s)}, it can be shown that there must exist a superstring of a in {str(v) v e V(s)} with length less than k). Reedges from vi (vl e V(s)) are obtained as follows: for each letter x in E, if imp(s, str(vi)z) exists and is equal to str(v2) = psir(vi)xz, then there is an reedge from vi to v2 with label x. If f is the empty string, then the edge is known as a prefix extension edge. Leedges from vl (vi e V(s)) are obtained as follows: for each letter x in E, if imp(s, xstr(v1)) exists and is equal to str(v2) = 7xstr(v1l), then there is an leedge from v, to v2 with label zx. If 3 is the empty string, then the edge is known as a suffix extension edge The sink represents the periodic infinite string denoted by the circular string s. The labels of edges incident on the sink are themselves infinite and periodic and may be represented by their start positions in s. We also associate with CSD(s) the periodicity p of s. This is the value of lal for the largest value of k such that S = ak. We require the periodicity to answer queries about the locations of displayable entities if S is itself periodic. Figure 4.11 shows CSD(s) for s = periodicity is 7. 4.3 Constructing the Csdawg for a Circular String The csdawg for circular string s is constructed by the algorithm of Figure 4.2. It is obtained by first constructing the csdawg for the linear string T = SS (recall that S = linearize(s)). A bit is associated with each reedge in R(T) indicating whether it is a prefix extension edge or not. Similarly, a bit is associated with each leedge in L(T) to identify suffix extension edges. Two pointers, a suffix pointer and a prefix pointer are associated with each vertex v in V(T). The suffix (prefix) pointer points to a vertex w in V(T) such that str(w) is the largest proper suffix (prefix) of str(v) represented by any vertex in V(T). Note that such a w always exists since the empty Algorithm CircularCsdawg(s) Stepl: Construct CSD(T) for T = SS (S = linearize(s)). Step2: {Determine periodicity of S using Lemma 4.3.1} vs := vertex representing S in CSD(T) es := any outgoing edge from vs p := lesl Step3(a): {Identify Suffix Redundant Vertices using Lemma 4.3.3} v:= sink; while v f source do begin v:= v.suffi; if v has exactly one outgoing reedge then mark v suffix redundant; else exit Step 3(a); end; Step3(b): {Identify Prefix Redundant vertices} {Similar to Step 3(a)} Step4: v:= sink; while (v <> source) do begin Modify representation of edges from v that are incident on the sink. case v of suffix redundant but not prefix redundant: ProcessSuffizRedundant(v); prefix redundant but not suffix redundant: ProcessPrefixRedundant(v); suffix redundant and prefix redundant : ProcessBothRedundant(v); not redundant: {Do nothing}; endcase; v:= Next VertexlnReverse TopologicalOrder; end; Figure 4.2. Algorithm for constructing the csdawg for a circular string string is a suffix (prefix) of all strings. Suffix (prefix) pointers are the reverse of suffix (prefix) extension edges and are derived from them. Figure 4.4 shows CSD(T) = CSD(SS) for S = cabcbab. The broken edge from vertex c to vertex abc is a suffix extension edge, while the solid edge from vertex ab to vertex abc is a prefix extension edge. In Step 2, we determine the periodicity of s which is equal to the length of the label on any outgoing edge from the vertex representing S in CSD(T). This equality is derived in Lemma 4.3.1. Lemma 4.3.1 Step 2 of Algorithm CircularCsdawg(s) correctly determines the pe riodicity of s. Proof: Let a be the shortest substring of S such that S = am, for some inte ger m. Then, the occurrences of a in T have start positions 1, Ial+ 1,21a + 1,...,(2m 1)a + 1 (if this is not the case, then we can show that there exists a substring / of S such that S = P', k > m, resulting in a contradiction). So, CSD(T) takes the form of Figure 4.3. Each vertex representing a', 1 < i < 2m 1, has exactly one leedge and one reedge leaving it as shown. All remaining displayable entities of T are subwords of a2 and are of size less than jal (if this is not the case, then we can show that there exists a substring / of S such that S = /k, k > m, resulting in a contradiction). The vertices representing these displayable entities are represented by the box in Figure 4.3. From the figure, it is now easy to see that Step 2 correctly determines the periodicity of s. m Next, in Step 3, suffix and prefix redundant vertices of CSD(T) are identified. A suffix (prefix) redundant vertex is one that has exactly one reedge (leedge) leaving it. A vertex is said to be redundant if it is either prefix redundant or suffix redundant 67 a a a a 0Other 2 "2 Vertices a a2 a2m 2m a a aO O  >* = Left Extension Edge  = Right Extension Edge Figure 4.3. SCD(a2m) or both. We have, from Lemma 4.3.2, that redundant vertices in CSD(T) are the only vertices in CSD(T) (except the source and the sink) which do not represent displayable entities of s. Redundant vertices represent patterns that are maximal in T specifically because they occur at either end of T and therefore are not followed or preceded by a letter. Lemma 1.3.2 A vertex v in V(T) (v 5 source, v J sink) is not redundant iff str(v) is a displayable entity of s. Proof: A displayable entity of s must be a displayable entity of T since, by construction of T, any pattern of size less than n in s has at least once occurrence in T that is preceded (followed) by any letter that preceded (followed) an occurrence of the same pattern in s. Further, a displayable entity of s must have at least two occurrences which are preceded by different letters and at least two occurrences which are followed by different letters (note that a displayable entity of a circular string, unlike that of a linear string, is always preceded and followed by a letter). So, from the definition of a csdawg, a vertex in V(T) corresponding to a displayable entity of s must have two reedges and two leedges and is not redundant. All displayable entities of T of size > n are redundant (from Figure 4.3). All nonredundant vertices of T have at least two leedges and two reedges. Since they represent strings of size less than n, and since, by construction, any string of size less than n in T is a string in s, they must represent displayable entities of s. In Figure 4.4, vertex c is prefix redundant only, while vertex ab is suffix redundant only. The vertex representing S is both prefix and suffix redundant since it has one reedge and one leedge leaving it. The fact that Step 3 does, in fact, identify all redundant vertices is established by Lemma 4.3.3. Lemma 4.3.3 (a) A vertex v in V(T) will have exactly one reedge (leedge) leaving it only if str(v) is a suffix (prefix) of T. (b) If a vertex v such that str(v) is a suffix (prefix) of T has more than one reedge (leedge) leaving it, then no vertex w such that str(w) is a suffix (prefix) of str(v) can be suffix (prefix) redundant. Proof: (a) Suppose str(v) is not a suffix of T. Then v has at least two reedges leaving it, otherwise str(v) would not be maximal in T. But this is a contradiction and str(v) must be a suffix of T. (b) Since str(w) is a suffix of str(v), a letter following str(v) must also follow str(w). So, w must have at least as many reedges leaving it as v. But v has at least two reedges leaving it, so w cannot be suffix redundant *  = Left Extension Edges = Right Extension Edges Figure 4.4. CSD(T) for T=cabcbabcabcbab Since it is sufficient to examine vertices corresponding to suffixes of T (Lemma 4.3.3(a)), Step 3(a) follows the chain of suffix pointers starting from the sink. If a vertex on this chain has one reedge leaving it, then it is marked suffix redundant. The traversal of the chain terminates either when the source is reached or when a vertex with more than one reedge leaving it is encountered (Lemma 4.3.3(b)). Similarly, Step 3(b) identifies all prefix redundant vertices in V(T). Vertices of CSD(T) are processed in reverse topological order in Step 4 and redun dant vertices are eliminated. When a vertex is eliminated, the edges incident to/from it are redirected and relabeled as described in Figures 4.5 to 4.10. Procedure Pro cessPrefizRedundant is symmetric to ProcessSuffizRedundant. The correctness of the relabeling and redirecting of edges in ProcessSuffizRedundant and ProcessBothRedun dant follows from Lemmas 4.3.4 and 4.3.5. The resulting graph is CSD(s). Lemma 4.8.4 In Procedure ProcessSuffixRedundant, str(v) is a prefix ofstr(w). Proof: All occurrences of str(v) in s are followed by zy. So, str(w) must be of the form pstr(v)zx since w cannot be redundant (otherwise it would have been eliminated in Step 4). But, at least two occurrences of str(v) are preceded by different letters (since v is not prefixredundant). So, at least two occurrences of str(v)zy are preceded by different letters. So, # = nil, and str(w) = str(v)xz. U Lemma 4.3.5 In Procedure ProcessBothRedundant, w, = w2. Proof: fly precedes all occurrences of str(v) and x follows all occurrences of str(v) in s. Since wi and w2 cannot be redundant (otherwise they would have been eliminated in Step 4), str(wi) and str(w2) are both pystr(v)xz. m Procedure ProcessSuffizRedundant(v) 1. Eliminate all left extension edges leaving v (there are at least two of these). 2. There is exactly one right extension edge e leaving v. Let the vertex that it leads to be w. Let the label on the right extension edge be zy. Delete the edge. 3. All right edges incident on v are updated so that they point to w. Their labels are modified so that they represent the concatenation of their original labels with xy. 4. All left edges incident on v are updated so that they point to w. Their labels are not modified. However, if any of these were suffix extension edges, the bit which indicates this should be reset as these edges are no longer suffix extension edges. 5. Delete v. Figure 4.5. Algorithm for processing a vertex which is suffix redundant w w I ( eR(y71y) v eR(YYI) \ eL(72z)\ 6b 0  > = Left Extension Edges = Right Extension Edges Figure 4.6. v is suffix redundant 72 Procedure ProcessPrefizRedundant(v) 1. Eliminate all right extension edges leaving v (there are at least two of these). 2. There is exactly one left extension edge e leaving v. Let the vertex that it leads to be w. Let the label on the left extension edge be yx. Delete the edge. 3. All left edges incident on v are updated so that they point to w. Their labels are modified so that they represent the concatenation of their original labels with 7z. 4. All right edges incident on v are updated so that they point to w. Their labels are not modified. However, if any of these were prefix extension edges, the bit which indicates this should be reset as these edges are no longer prefix extension edges. 5. Delete v. Figure 4.7. Algorithm for processing a vertex which is prefix redundant eR(y71) ' eL(X(f2Z) 0 = Left Extension Edges = Right Extension Edges Figure 4.8. v is prefix redundant Procedure ProcessBothRedundant(v) 1. There is exactly one right extension edge el leaving v. Let the vertex that it leads to be wl. Let the label on the edge be x. Delete the edge. 2. There is exactly one left extension edge e2 leaving v. Let the vertex that it leads to be w2. Let the label on the edge be fly. Delete the edge. {Lemma 4.3.5 establishes that wl and w2 are the same vertex.} 3. All right edges incident on v are updated so that they point to wl. Their labels are modified so that they represent the concatenation with zy. If any of these edges were prefix edges, the bit which indicates this should be reset. 4. Similarly, left edges incident on v are updated so that they point to w2. Their labels are modified so that they represent the concatenation with 7z. If any of these edges were suffix extension edges, the bit which indicates this should be reset. 5. Delete v. Figure 4.9. Algorithm for processing a vertex which is prefix and suffix redundant w w el (W Sel(xy) eR(lRX) e2(fly)' v eL(lLOy) eR(lR) \ eL(lL)  = Left Extension Edges S= Right Extension Edges Figure 4.10. v is suffix and prefix redundant Theorem 4.3.1 Algorithm CircularCsdawg(s) correctly computes CSD(s) and has complexity O(n), which is optimal. Proof: Since the algorithm eliminates redundant vertices, Lemma 4.3.2 en sures that the vertices in CSD(s) are correctly obtained. It remains to show that the edges of CSD(s) are correctly obtained. Given Lemmas 4.3.4 and 4.3.5, it is easy to verify that Procedures ProcessSuffixRedundant, ProcessPrefizRedundant, and ProcessBothRedundant correctly relabel and redirect edges in the csdawg. Step 1 takes O(n) time [9]. Step 2 takes O(n) time to locate the vertex represent ing S in CSD(T). Step 3 will, in the worst case, traverse all the vertices in CSD(T) spending 0(1) time at each vertex. The number of vertices is bounded by O(n) [9]. So, Step 3 takes O(n) time. Step 4 traverses CSD(T). Each vertex is processed once; each edge is processed at most twice (once when it is an incoming edge to the vertex currently being processed, and once when it is the out edge from the vertex currently being processed). So, Step 4 takes O(n) time (note that CSD(T) has O(n) edges). * Procedure CircOccurrences(s, v) of Figure 4.12 reports the end position of each occurrence of str(v), for v e V(s), in the circular string s. The label corresponding to an edge terminating at a vertex other than the sink is denoted by label(e). The start position of an edge terminating at the sink vertex is denoted by pos(e). It is similar to the algorithm for computing occurrences of displayable entities in linear strings (Chapter 3). 4.4 Computing Conflicts Efficiently We have identified a number of problems relating to the computation of conflicts in a linear string and have presented efficient algorithms for most of these problems 4 = Right Extension Edges  = Left Extension Edges Figure 4.11. Csdawg for < cabcbab > (Chapter 3). These algorithms typically involved sophisticated traversals or opera tions on the csdawg for linear strings. Our extension of csdawgs to circular strings makes it possible to use the same algorithms to solve the corresponding problems for circular strings with some minor modifications caused by our representation of edges that are incident on the sink. 4.5 Applications Circular strings may be used to represent circular genomes [4] such as G4 and XX174. The detection and analysis of patterns in genomes helps to provide insights into the evolution, structure, and function of organisms. Here [4], G4 and qX174 are analyzed by linearizing them and then constructing their csdawg. We improve upon this by (i) analyzing circular strings without risking the "loss" of patterns. (ii) Procedure CircOccurrences(s:circular string, v:vertex) {Obtain all occurrences of str(v), v e V(s), in S} if v j sink then Occurrences(s,v,O); Procedure Occurrences(s:circular string, v:vertex, i:integer) begin for each reedge e from v in CSD(s) do begin let w be the vertex on which e is incident; if w 0 sink then Occurrences(s,w, label(e)j + i) else for k := 1 to (Is/periodicity(s)) do output(pos(e)i1 +(k1)periodicity(s)) end; end; Figure 4.12. Obtaining all occurrences of a displayable entity in a circular string extending the analysis and visualization techniques of Chapter 3 for linear strings to circular strings. Circular strings in the form of chain codes are also used to represent closed curves in computer vision [12]. The objects of Figure 4.13(a) are represented in chain code as follows: (1) Arbitrarily choose a pixel through which the curve passes. In the diagram, the start pixels for the chain code representation of objects 1 and 2 are marked by arrows. (2) Traverse the curve in the clockwise direction. At each move from one pixel to the next, the direction of the move is recorded according to the convention shown in Figure 4.13(b). Objects 1 and 2 are represented by 1122102243244666666666 and 666666661122002242242446, respectively. The alphabet is {0, 1, 2, 3, 4, 5, 6, 7} which is fixed and of constant size (8) and therefore satisfies the condition of Section 4.2. We may now use our visualization techniques of Chapter 3 to compare the two objects. For example, our methods would show that objects 1 and 2 share the segments S1 and S2 (Figure 4.13(c)) corresponding to 0224 and 2446666666661122, respectively. Information on other common segments would also be available. The techniques of this paper make it possible to detect all patterns irrespective of the starting pixels chosen for the two objects. Circular strings may also be used to represent polygons in computer graphics and computational geometry [11]. Figure 4.14 shows a polygon which is represented by the following alternating sequence of lines and angles: bpaaoeaeacfpcfefpaaeaeaccaccbacdaca, where a denotes a 90 degree angle and P, a 270 degree angle. The techniques of this paper would point out all instances of self similarity in the polygon, such as aaeaeacflc. Note, however, that for the methods to work efficiently, the number of lines and angles that are used to represent the polygons must be small and fixed. 78 I I I I I I I I I I I I I I I I I I I I I I I I I,,.....,,ir,r.~r~rri.rr,, rJLJrJLJLJrLJL.JLJJLJJri I I I I I I I I I I I I I I I I I I I I I I I I I I I I I II I I I I I I I I I I I I I I J..LJL J L LLJLJLJ LJ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I L. ....r  rJ  rr r r  I I I I I I I I I I I I I I I I I I I I I I I I i _. _. . _. __.r ~ __. _ __ _ _~~r ~~ ~ T_ . . _.. T rr i~ I I I I I I I I I I I I I I I I I I I I I I J LLL._.LJ,.,_,.LJ ,J.L,L..L..J._LJ _L_J..__I I I I I I I I I I I I I I I I I I I I I I Starting position for object 1 (a) Starting position for object 2 7 0 6 2 5 3 4 Chain code representations of directions (b) I I = s S = S2 (c) Figure 4.13. Representing closed curves by circular strings 79 e e e e a a c c b c e c b ci I d Figure 4.14. Representing polygons by circular strings Closed curves defined using Bsplines can be determined from their control poly gons. If there exist two or more sufficiently long identical segments in the control polygon, then the curve fragments corresponding to those polygon segments would also be identical. Thus, similarity in closed curves represented by Bsplines can also be detected by our techniques. CHAPTER 5 EXTENSION TO BINARY TREES AND SERIESPARALLEL GRAPHS 5.1 Tree Visualization In this section, we consider the problem of tree visualization. Section 5.1.2 men tions some of its applications while Section 5.1.3 outlines the algorithms. 5.1.1 Problem Definition We provide a specification for the problem of tree visualization: 1. Structure of Data to be Visualized: A binary tree BT of size n. Each node of the binary tree contains a key. 2. Structure of Patterns: A subtree of BT. 3. Maximality of Patterns: A subtree pattern is not maximal if its occurrences are all left or all right children of their parents and the subtrees rooted at their parents are all identical. Figure 5.1 shows an example tree BT of size 7. We will represent it using the parenthesized infix notation as (((b)a(c))e((b)a(c))) for the purposes of this paper. Here, the subtrees (b) and (c) are not maximal. However, ((b)a(c)) and (((b)a(c))e((b)a(c))) are maximal. 4. Measure of Similarity MS: If two patterns are identical, then MS = 1. Other wise, MS = 0. 5. Display Model: A maximal subtree of BT is called a displayable subtree if it occurs at least twice in BT. All instances of the same displayable subtree are shaded in the same color. Different displayable subtrees are shaded in different 80 Figure 5.1. A labeled binary tree colors. In the example of Figure 5.1, ((b)a(c)) is the only displayable subtree. So, BT would be displayed as shown in Figure 5.2. We encounter the problem of subtree conflicts which is analogous to subword conflicts in strings. Consider the binary tree of Figure 5.3. The two displayable patterns are (((b)a(c))d) and ((b)a(c)). Note that the latter is a subtree of the former. Consequently, the two subtrees cannot be shaded using different colors. Formally, a subtree conflict occurs between two displayable subtrees P1 and P2 iff one is a proper subtree of the other. The problem of subtree conflicts may be solved by using an approach similar to that of Model 3 of Section 2.3. The resulting display of (((((b)a(c))d)e(((b)a(c))d))f((b)a(c))) is shown in Figure 5.4. 5.1.2 Applications Binary trees are chiefly used as data structures in computer science [10]. So phisticated debuggers attempt to display data structures at different points in the execution of a program [14, 15, 16]. Systems for algorithm animation also require displays of data structures [17, 18, 19]. The techniques of this section may be used in 82 e a a / \ /I I \ /l \ / \ / \ I \ L \ L \ Figure 5.2. Highlighting displayable subtrees Figure 5.3. Subtree conflicts Figure 5.4. Displaying a tree with subtree conflicts Figure 5.4. Displaying a tree with subtree conflicts conjunction with tree layout algorithms [20, 21, 22] to provide a meaningful display of binary trees. While the emphasis of this paper is on exploiting similarity to display discrete objects, we note that similarity detection is also useful in other applications. For example, consider a forest of expression trees. Each expression tree cor responds to the computation of a variable. The leaf nodes of an expression tree correspond to operands while non leaf nodes correspond to operators (Figure 5.5). The techniques of this paper identify all common subexpressions in the expression trees. Specifically, common subexpressions which are embedded inside other common subexpressions are also detected. Our techniques also provide information for the efficient storage of a forest of trees by converting it into a directed acyclic graph. For example, the tree of Figure 5.3 can be stored efficiently as in Figure 5.6. , .......... I 84 Figure 5.5. An expression tree Figure 5.5. An expression tree Figure 5.6. Tree compression by representation as a dag 5.1.3 Algorithms Each node of the tree contains the fields: Ichild, rchild, key, id, leftid, rightid, and frequency. The id and frequency fields of each node are initialized to 0. The leftid (rightid) field of a node is initialized to 1 if its Ichild (rchild) field is nil; otherwise, it is initialized to 0. Algorithm IdentifySubtrees of Figure 5.7 computes all displayable subtree pat terns of BT. At the end of execution of IdentifySubtrees, all occurrences of a dis playable subtree are assigned the same unique integer (which is stored in the id field of the root of each occurrence of the displayable subtree). The variable Newld is used as a counter. StoreLeaves(BT,L) (line 1) traverses BT and stores pointers to its leaves in the linked list L. Each node in L contains two fields: treenode, which points to a node in BT and next, which points to the next element of L. M is a linked list containing nodes that are identical to those of L. Initially, M is the empty list (line 5). Sort(L) (line 6) sorts L on (treenodeT.key, treenodeT.leftid, treenodet.rightid). Groups(L) (line 7) identifies groups of nodes such that all nodes in a group contain identical values in treenodeT.key, treenodeT.leftid, and treenodeT.rightid. Each group is a contiguous sequence in L because of line 6. Let there be g groups Lk, for 1 < k < g. Lk is a pointer to the first element in the k'th group. Let the number of nodes in Lk be sk for 1 < k < g. Note that a group represents all occurrences of a subtree pattern. The equality of the leftid and rightid fields guarantees that the left subtrees are identical and the right subtrees are identical. The equality of the key fields ensures that the roots are identical. In the first iteration, a group represents identical leaves. Lines 8 and 9 ensure that all groups with at least two elements are considered. So, a group corresponding to a maximal subtree pattern also corresponds to a displayable subtree pattern. All roots of the subtrees corresponding to a group are assigned a unique id (lines 11 and 22). Their frequency fields are also initialized (line 21) to the number of elements in the group, i.e. Sk. These will be used later to determine whether the subtree pattern corresponding to the group is maximal. Lines 12 to 17 determine whether the subtree patterns rooted at either or both of the two children of Lk T.treenode are displayable. If this is not the case, then the id fields of either the left children or right children or both, as appropriate, are reset (lines 35 and 36). Lines 23 to 34 initialize either the leftid or rightid field of the parent of the root of each subtree occurrence (depending on whether the particular root is a left child or a right child of its parent). If both, the leftid and the rightid, fields of the parent are nonzero, then a pointer to the parent is inserted into the linked list M (lines 27 and 33). Finally, the list L is deleted and the process is repeated with list M (lines 40 and 41). Line 1 of IdentifySubtrees consumes O(n) time, since it involves a traversal of BT. Let Li denote the list L in the i'th iteration of the while statement of line 3, and Li denote the number of nodes in Li (1 < i < w, where w represents the number of iterations of the while loop). Since each node in Li corresponds to a distinct node in BT and since a node in BT is an element of at most one Li, =1ILi < n. We now show that the i'th iteration of the while loop consumes O(ILi ) time. It is assumed that all keys are in the range [1..n]. Then, line 6 can be executed in O(Lij) time, if radix sort is used to sort the linked list. The keys are relabeled so that they are in the range [1..ILj]. The relabeling can be done in O(IL) time as shown in Figures 5.8 and 5.9. Line 7 requires a pass over Li and also consumes O(ILI) time. Lines 8 to 39 essentially involve a pass over Li, with 0(1) time being spent at each node. Line 40 also involves a pass over Li and consumes O(ILi ) time. Thus, IdentifySubtrees consumes O(n) time. (If the keys are not in the range [1..n], Algorithm IdentifySubtrees 1 StoreLeaves(BT,L) 2 Newld := 0; 3 while L is not empty do 4 begin 5 M := nil; 6 Sort(L); 7 Groups(L); 8 for k := 1 to g do {g is the number of groups obtained from line 7} 9 if sk > 1 then {sk is the number of nodes in group k} 10 begin 11 Newld:= Newld + 1; 12 ResetLeft := false; 13 ResetRight:= false; 14 if Lk T.treenodet.lchild j nil 15 then if Lk T.treenodeT.lchildT.frequency = sk then ResetLeft:= true; 16 if Lk T.treenodet.rchild $ nil 17 then if Lk T.treenodeT.rchildT.frequency = sk then ResetRight:= true; 18 for j := 1 to sk do 19 begin 20 current := Lk .treenode; 21 currentT.frequency := sk; 22 currentT.id := Newld; 23 if currentT.parentf.Ichild :=current then 24 begin 25 currentt.parentf.leftid:= Newld; 26 if currentT.parentT.rightid j 0 27 then AddNode(M,current .parent); 28 end 29 else 30 begin 31 currentT.parentT.rightid := Newld; 32 if currentT.parentT.leftid # 0 33 then AddNode(M,currentT.parent); 34 end; 35 if ResetLeft then currentt.lchildt.id = 0; 36 if ResetRight then currentt.rchildt.id = 0; 37 Lk := Lk T.next; 38 end; 39 end; 40 DeleteList(L) 41 L := M; 42 end. Figure 5.7. Algorithm for identifying displayable subtrees { An auxiliary array A and stack S, each of size n, are used. All elements of A are initialized to 0 at the beginning of the program. The stack is initially empty. A key k (guaranteed to lie between 1 and n) is relabeled by first checking A[k] to determine whether it has already been assigned a label. If not, key k is assigned a label. Finally, the integer k is pushed onto S.} Procedure Label(L) begin count := 0; for each key k in L do if A[k] = 0 then begin count := count + 1; A[k] := count; Push(S,k); end; end; Figure 5.8. Procedure for relabeling keys then they can be relabeled to satisfy this condition. However, the relabeling step would take O(nlogk) time, where k is the number of distinct keys in BT.) 5.2 Geometric SeriesParallel Graph Visualization In this section, we consider the problem of geometric seriesparallel graph visual ization. Section 5.2.2 mentions some of its applications while Section 5.2.3 outlines the algorithms. 5.2.1 Problem Definition We provide a specification for the problem of seriesparallel graph visualization: 1. Structure of Data to be Visualized: A geometric seriesparallel graph SPG with n labeled vertices. Each vertex of SPG contains a character chosen from an alphabet E of constant size. "Fork" and "Join" vertices are not labeled. { The elements of array A are reset to 0 prior to the start of the next iteration of the while statement on line 3 of Algorithm IdentifySubtrees. } Procedure ReInitialize; begin while S is not empty do begin k := Pop(S); A[k] := 0; end end Figure 5.9. Procedure for reinitializing array A Figure 5.10 shows a geometric seriesparallel graph with 13 labeled vertices, 2 fork vertices, and 2 join vertices. We represent the series parallel graph as a string using the double parenthesis notation: ab[(cd)(ef)(b)]ab[(cd) (ef)]. A geometric seriesparallel graph differs from a regular seriesparallel graph in that it describes the layout of a regular seriesparallel graph in a plane. So, the graph obtained by exchanging (cd) and (ef) in Figure 5.10 would be a different geometric seriesparallel graph from the one in Figure 5.10, even though both represent the same seriesparallel graph. 2. Structure of Patterns: A geometric seriesparallel subgraph of SPG. 3. Maximality of Patterns: A seriesparallel subgraph pattern P1 is not maximal iff all its instances are subgraphs of the same seriesparallel subgraph P2 and if these instances of P1 all occur in the same relative geometric position in P2. 4. Measure of Similarity MS: If two patterns are identical, then MS = 1. Other wise, MS = 0. 90 c d c Jd a b e f a b e f Figure 5.10. A geometric seriesparallel graph DE3 DE 1 __ DE 2 Figure 5.11. Highlighting displayable subgraphs 5. Display Model: A maximal subpattern of SPG is called a displayable subgraph if it occurs at least twice in SPG. All instances of the same displayable subgraph are shaded in the same color. Different displayable subgraphs are shaded in different colors. The displayable subgraphs are ab, b, and (cd)(ef). So, the graph of Figure 5.10 may be displayed using a model similar to that of Model 3 of Section 2.3. as shown in Figure 5.11. Algorithm IdentifySubgraphs 1. Convert geometric seriesparallel graph SPG into a string S in the double parenthesis notation. 2. Compute CSD(S) 3. Extract displayable subgraphs of SPG from the displayable entities of CSD(S). Each displayable subgraph is associated with the smallest displayable entity from which it was extracted. Figure 5.12. Computing displayable subgraphs 5.2.2 Applications Seriesparallel circuits are an important class of electronic circuits. These can be modeled by seriesparallel graphs in which the vertices (other than fork and join vertices) represent active circuit elements. These vertices are labeled by the name/id of the active component they represent. The different branches of a fork/join are ordered top to bottom in some canonical way. For example, the lexical ordering of the double parenthesis string representation of the branches could be used. Common subgraphs represent common subcircuits. Circuit visualization/display systems are required to display circuits in such a way that common circuit substructures are easily identified. 5.2.3 Algorithms Figure 5.12 outlines an algorithm for determining the displayable subgraphs of a geometric seriesparallel graph. First, the geometric series parallel graph SPG is converted into a string S in the double parenthesis notation described earlier. Next, the csdawg CSD(S) corresponding to S is constructed. Figure 5.13 shows CSD(S) for S = ab[(cd)(ef)(b)]ab[(cd)(ef)] (only reedges are shown). Finally, the )]ab[(cd)(ef)] ab[(cd)(ef)] I,: ab[(cd)(ef) 12: [(cd)(ef) 13: cd)(ef) 14: d)(ef) Is: ef) 16: f) 17: b)]ab[(cd)(ef)] Is: b)Jab[(cd)(ef)] 19: cd)(ef) Si: ab[(cd)(ef) Figure 5.13. Csdawg for graph of Figure 5.10, only reedges are shown displayable subgraphs of SPG are obtained from the displayable entities of S. This is achieved by extracting the longest substrings with no unmatched parenthesis from each displayable entity. The displayable subgraphs corresponding to the displayable entity ab[(cd)(ef) are ab and (cd)(ef) corresponding to DE 1 and DE 3 in Figure 5.11. Each displayable subgraph is associated with the vertex in CSD(S) from which it was obtained so that all the locations of its occurrences in SPG can be retrieved. If a displayable subgraph is obtained from two or more displayable entities, then it is associated with the smallest displayable entity. Note that a unique smallest displayable entity must exist from the definition of an csdawg. Steps 1 and 2 of the algorithm each consume O(n) time, while Step 3 consumes O(n2) time. CHAPTER 6 SYSTEM INTEGRATION 6.1 Using the ObjectOriented Methodology In this section we demonstrate that our algorithms are wellsuited for imple mentation using the objectoriented paradigm. The data abstraction and inheritance features of the objectoriented paradigm are particularly applicable to our algorithms. Most of our algorithms for the string, circular string, and series parallel graph discrete objects are essentially traversals or operations on the corresponding csdawg. High level operations on the csdawgs such as creating the csdawg, computing dis playable entities, computing their locations, computing display conflicts, etc make use of lower level operations on the csdawgs such as finding a vertex in the csdawg corresponding to a substring (subgraph), modifying the contents of a vertex, etc. Thus, the concept of data abstraction which views data as a black box whose con tents can only be accessed through its operations can be applied naturally to csdawgs. Many operations are implemented identically for each of the three csdawgs. These are usually low level operations such as locating the vertex corresponding to a string, or modifying the contents of a vertex, etc. Other operations (usually higher level operations), while not implemented identically for the three discrete objects, are similar. Typically, a high level operation on a circular string or a series parallel graph can be implemented by the corresponding operation on a linear string preceded by a simple preprocessing step and followed by a simple postprocessing step. We create a class for each csdawg corresponding to the three discrete objects: LinearCsdawg, CircularCsdawg, SeriesParallelCsdawg. LinearCsdawg is a base class, 94 