Models and techniques for the visualization of labeled discrete objects

MISSING IMAGE

Material Information

Title:
Models and techniques for the visualization of labeled discrete objects
Physical Description:
vi, 105 leaves : ill. ; 29 cm.
Language:
English
Creator:
Mehta, Dinesh P., 1965-
Publication Date:

Subjects

Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1992.
Bibliography:
Includes bibliographical references (leaves 103-104).
Statement of Responsibility:
by Dinesh P. Mehta.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001801822
notis - AJM5591
oclc - 27719664
System ID:
AA00003701:00001

Full Text








MODELS AND TECHNIQUES FOR THE VISUALIZATION OF LABELED
DISCRETE OBJECTS














By

DINESH P. MEHTA


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA
1992


,UNIVERSITY OF FLORIDA LIBRARIES















ACKNOWLEDGEMENTS


I wish to express my gratitude to my advisor, Professor Sartaj Sahni, for his guid-

ance, encouragement, and help over the last five years. This dissertation would not

have been possible without his supervision. I am also grateful to the members of my

dissertation committee at Florida, Professors Gerhard Ritter, John Staudhammer,

Ted Johnson, and Haniph Latchman for patiently reading this manuscript and pro-

viding insightful feedback. I thank Phil Barry of the Computer Science Department

and Bill Fox of the Department of Psychology at the University of Minnesota for

their support. I also thank the Computer Science Departments at the University of

Florida and the University of Minnesota, and the National Science Foundation for

financial support.

I would also like to thank my friends, Mario Lopez, Vijay Rajan, Rose Tsang,

and Andrew Lim for many fruitful discussions and also for other not-so-fruitful, but

entertaining, times which made the process enjoyable.

Finally, I wish to thank my family, and in particular, my parents, Prabha and

Prakash Mehta, for their love and support. This dissertation is dedicated to them.


















TABLE OF CONTENTS




ACKNOWLEDGEMENTS ............................ ii

ABSTRACT .................................... v

CHAPTERS

1 INTRODUCTION .................... .......... 1


2 STRING VISUALIZATION .......................... 3

2.1 Problem Specification ................... ........ 3
2.2 Visualization Conflicts ................... ....... 5
2.3 Refinements of Display Model .................. ........ 6
2.4 String Visualization Queries ....................... 9
2.5 Applications .................................. 10
2.5.1 General Methods ................. ....... 10
2.5.2 Numerical Data ................... ...... 12
2.5.3 Molecular Biology ................... ..... 13
2.5.4 Textual Data ................... ....... 14

3 STRING VISUALIZATION ALGORITHMS ................. 15

3.1 Definitions ................... ....... ..... 15
3.1.1 Compact Symmetric Directed Acyclic Word Graphs (csdawgs) 15
3.1.2 Computing Occurrences of Displayable Entities in a string 17
3.1.3 Prefix and Suffix Extension Trees ................ 17
3.2 Computing Conflicts ........................... 18
3.2.1 Algorithm to determine whether a string is conflict-free .... 18
3.2.2 Subword Conflicts .................... ..... 21
3.2.3 Prefix-Suffix Conflicts ................. ..... 26
3.2.4 Alternative Algorithms ................. ..... 39
3.3 Size Restricted Queries ................... ...... 42
3.4 Pattern Oriented Queries ................... ..... 43
3.5 Statistical Queries. ............................ 45
3.6 Experimental Results ........................... 47
3.7 Display Algorithms ................... ......... 51













4 CIRCULAR STRING VISUALIZATION ..................

4.1 Introduction .................. .............
4.2 Definitions .................................
4.3 Constructing the Csdawg for a Circular String . .
4.4 Computing Conflicts Efficiently . . .
4.5 Applications ................................

5 EXTENSION TO BINARY TREES AND SERIES-PARALLEL GRAPHS


5.1 Tree Visualization ........
5.1.1 Problem Definition .
5.1.2 Applications .......
5.1.3 Algorithms .
5.2 Geometric Series-Parallel Graph
5.2.1 Problem Definition .
5.2.2 Applications .......
5.2.3 Algorithms .......


Visualization


6 SYSTEM INTEGRATION ..............

6.1 Using the Object-Oriented Methodology .
6.2 Overview of a visualization system ......


7 CONCLUSIONS ...........


REFERENCES ...............

BIOGRAPHICAL SKETCH .......


.... 100


. . . 103


. 105


..... 94












Abstract of Dissertation
Presented to the Graduate School of the University of Florida
in Partial Fulfillment of the Requirements for the
Degree of Doctor of Philosophy


MODELS AND TECHNIQUES FOR THE VISUALIZATION OF LABELED
DISCRETE OBJECTS

By
Dinesh P. Mehta

August 1992


Chairman: Dr. Sartaj Sahni
Major Department: Computer and Information Sciences

The objective of visualization is to extract useful and relevant information from

raw data and present it so that it can be easily understood and assimilated by humans.

We propose a general model for the visualization of labeled discrete objects. (Ex-

amples of labeled discrete objects are strings, circular strings, trees, graphs, etc.)

Our model is based on identifying similar subobjects in an object and coding them

visually. This is demonstrated by applying the model to linear and circular strings,

binary trees, and series-parallel graphs. We also describe and classify the problem of

display conflicts that is associated with this visualization model and suggest methods

to overcome it. We extend the csdawg data structure for linear strings to circular

strings. Efficient, often optimal, algorithms to implement the visualization model for

linear and circular strings, binary trees, and series-parallel graphs are also developed.

These algorithms are specifically designed to utilize the inheritance and data ab-

straction features of the object-oriented paradigm. Several of these algorithms were

implemented in C++ to evaluate their performance. We also propose a blueprint for

a visualization system based on our visualization model.













Applications of this visualization technique arise in areas such as molecular biol-

ogy, computer vision, computer graphics, CAD VLSI, data compression, algorithm
animation, and debuggers.

















CHAPTER 1
INTRODUCTION

The objective of visualization is to extract useful and relevant information from

data and represent it so that it can be easily understood and assimilated by humans.

This enables specialists in application areas to observe trends and patterns in the

data. This could lead to better understanding of various phenomena and provide

insights resulting in theories or hypotheses, which can subsequently be proved (or

disproved) by formal methods or by further experimentation.

A notion that is useful in the understanding of objects is that of similarity. Multi-

ple occurrences of the same pattern in data which represents the outcome or result of

some process indicates the presence of many instances of the same "effect." Analyz-

ing the set of circumstances associated with these occurrences could yield a plausible

"cause." For example, multiple occurrences of the same flaw in a paper roll might

reveal the faulty component in the paper production process.

Similarly, multiple occurrences of the same patterns in an object whose struc-

ture is being studied indicates the presence of many instances of the same "cause."

Observing the phenomena which occur in the presence of these patterns could shed

some light on the "effect" of the patterns. For example, the presence of multiple
occurrences of patterns in DNA strands in organisms could manifest themselves as

common characteristics shared by the organisms.

This linkage of cause and effect is a fundamental goal in many scientific disciplines.

Most work in visualization so far, which attempts to facilitate the scientific goals

outlined above, consists of choosing methods to display individual units of data, so












that patterns and trends become visually obvious when the data is seen in its entirety

[1, 2]. However, the onus of detecting patterns and trends still lies on the user or
the specialist. This becomes more crucial when the amount of data is large and the

user's perceptual faculties are overburdened. Consequently, errors of omission (not

seeing patterns which are actually there) become more likely.

Our work attempts to shift the responsibility of detecting patterns from the human

to the computer. This is done by devising algorithms to detect patterns in the data

and then making these patterns available to the user for further scrutiny. Multiple

occurrences of the same pattern can be made visually explicit by color coding them

with the same color. Other methods could also be used, such as "flashing" occurrences

of the same pattern on the screen. If visual schemes are not appropriate, recurring

patterns could simply be provided as a list of occurrences which the user goes through.

In the context of a visualization environment, this technique could be used either in

a stand-alone manner or as a supplement to other visualization methods.

For the technique outlined above to be useful, we propose the following principles

on displaying patterns, which we shall call the Similarity Paradigm. The similarity

paradigm forms the basis for the visualization techniques used in this dissertation.

Principle 1: Two patterns should be displayed to look similar iff they are similar.

Principle 2: The degree of similarity between the displays of two patterns should

be proportional to the actual similarity of the patterns.

Chapter 2 discusses string visualization while Chapter 3 provides algorithms which

implement string visualization queries. Chapter 4 discusses the extension of our

visualization techniques for linear strings to circular strings and Chapter 5 deals with

the visualization of binary trees and series-parallel graphs. Finally, Chapter 6 explains

how the work of the first five chapters may be integrated to form a visualization

system based on the similarity paradigm.

















CHAPTER 2
STRING VISUALIZATION

2.1 Problem Specification

A complete specification of a visualization problem based on the similarity paradigm

requires one to provide the following five items. These are illustrated using linear

string visualization as an example. Henceforth, the word "string" shall refer to linear

strings.

(1) Structure of the Data to be Visualized: Does the data represent a string,

a series-parallel graph, a binary tree, etc?
In the string visualization example, the data is a string S of length n whose

characters are chosen from a fixed alphabet E of constant size.

(2) Structure of Patterns: This depends largely on the structure of the data.

If the data represents a string, then the structure of the pattern could be a substring

(contiguous sequence of string elements), a subsequence (a non-contiguous sequence
of string elements), etc. Patterns can have other constraints imposed upon them.

For example, a pattern may be required to be of a minimum size.

In the string visualization example, the pattern is a substring of S defined uniquely

by its start and end positions.

(3) Maximality of Patterns: If a pattern is repeated in the data, then any

subpattern of the pattern is also repeated. For example, if abc repeats in a string, so

do ab, bc, a, b, and c In particular, if all occurrences of ab occur in the context of abc,

then attempting to distinguish between ab and abc does not serve any useful purpose.











Defining maximality and restricting the user's attention to maximal patterns helps

to simplify the display.
In the string visualization example, a pattern is said to be maximal iff its occur-

rences are not all preceded by the same letter, nor all followed by the same letter.

Consider the string S = abczdefydefxabc. Here, the empty string A, abc, def, and S
are the only maximal patterns. The occurrences of def are preceded by different let-

ters (z and y) and followed by different letters (y and z). The occurrences of abc are

not preceded by the same letter (the first occurrence does not have a predecessor) nor

followed by the same letter. However, de is not maximal because all its occurrences

in S are followed by f.

(4) Measure of Similarity (MS): A measure of similarity would consist of

attaching numerical values to pairs of maximal patterns which indicate the degree of

similarity between the two patterns.

In the string visualization example, the measure of similarity could be defined

as follows: If two patterns are identical, then MS = 1. Otherwise, MS = 0. In

other words, two patterns are defined to be similar iff they are identical. There is no

concept of "degree of similarity" in this definition.

(5) Display Models: This addresses the issues of which patterns are displayed

and how they are displayed. While choosing display models, Principles 1 and 2 of

Chapter 1 should be kept in mind.
In the string visualization example, a pattern is said to be a displayable entity (or

displayable) iff it is maximal, non-null, and occurs more than once in S (in this case,

all maximal patterns are displayable entities with the exception of S, which occurs

once in itself and A) All instances of the same displayable entity are highlighted in

the same color. Instances of different displayable entities are highlighted in different
















Figure 2.1. Highlighting Displayable Entities



colors (there is no relationship between colors representing different displayable en-

tities). In the example string S, abc and def are the only displayable entities. So, S

would be displayed by highlighting abc in one color and def in another as shown in

Figure 2.1.

2.2 Visualization Conflicts

Consider the string S = abcicdefcdegabchabcde and its displayable entities, abc
and cde (both are maximal and occur thrice). The displayable entities abc and cde

must be highlighted in different colors. Notice, however, that abc and cde both

occur in the string abcde, which is a suffix of S. Clearly, both displayable entities

cannot be highlighted in different colors in abcde as required by the model. This is

a consequence of the fact that the letter c occurs in both displayable entities. This

situation is known as a prefix-suffix conflict because a prefix of one displayable entity

(here, cde) is a suffix of the other (here, abc). Note, also, that c is a displayable entity

in S. Consequently, all occurrences of c must be highlighted in a color different from

those used for abc and cde. But this is impossible as c is a subword of both abc and

cde. This situation is referred to as a subword conflict. Formally,

(i) A subword conflict between two displayable entities D1 and D2 in S exists iff D1
is a substring of D2.

(ii) A prefix-suffix conflict between two displayable entities D1 and D2 in S exists iff

there exist non-null substrings, S,, S,, S, in S such that SpSS, occurs in S, SpSm











a b cli c d elf c d elg a b clh a b cd e

la b cli Ic d elf Ic d elg9a b clh a bic d el

a b cli d e f d e ga b ch ab de


Figure 2.2. Possible Configurations



= D1, and S,S, = D2. We say that a prefix-suffix conflict exists between D1 and D2
with respect to Sm. Sm is known as the intersection of the conflict.

2.3 Refinements of Display Model

When subword and prefix-suffix conflicts occur, we need some criteria to determine
which of the information previously required to be displayed actually gets displayed.
For instance, in the example string S = abcicdefcdegabchabcde from the previous
section, three possible non-conflicting, displayable subsets are shown in Figure 2.2.
In this section we present three refinements to the display model from Section 2.1
which attempt to overcome the display difficulties created by conflicts. They are
(1) One-Copy, Maximum-Content, No-Overlap: In this model, exactly one copy
of the string is displayed. Occurrences of displayable entities are selected so that
there are no mutually conflicting occurrences. Given this restriction, the model re-
quires occurrences to be selected so that the amount of information conveyed by the
display is maximized. This goal may be achieved in three ways.
Interactive : The user selects occurrences interactively by using his/her judgement.
Typically, this would be done by examining the occurrences which are involved in a
conflict and choosing one that is the most meaningful.












lab cli d e f d e gla b chia b cd e


Figure 2.3. Optimal Configuration under Model 1



Automatic: A numeric weight is assigned to each occurrence. The higher the weight,
the greater the desirability of displaying the corresponding occurrence. Criteria that
could be used in assigning weights to occurrences include length, position, number of
occurrences of the pattern, semantic value of the displayable entity, information on
conflicts, etc. The information is then fed to a routine which selects a set of occur-
rences so that the sum of their weights is maximized. For example, consider string
S = abcicdefcdegabchabcde of Section 2.2. If the weight assigned to each occurrence
of abc is 4, cde is 2, c is 3, then Figure 2.3 shows the optimal display configuration.
The total weight of the display is 18.
Semi-Automatic: In a practical environment, the most appropriate method would
be a hybrid of the Interactive and Automatic approaches described above. The user
could select some occurrences that he/she wants included in the final display. The
selection of the remaining occurrences can then be performed by a routine which
maximizes the display information.
(2) Multiple-Copy, No-Overlap: Multiple copies of the string may be displayed.
Mutually disjoint sets of occurrences are associated with the copies (one set per copy),
so that the occurrences corresponding to each copy are mutually non-conflicting.
There are two approaches:
(a) A constant number (max) of copies of the string may be displayed. The total
content of the display, summed over all max copies, is to be maximized. For example,










|a b cli i def id e gla b chla b cd e

a b i cdef cdeg a b h a bid e

Figure 2.4. Optimal Configuration under Model 2(a)



[a b cli d e f d e ga b ch a b cd e

a b i c d elf c d egab h a b d e

abci cdefcdegabchablcdee

Figure 2.5. Configuration under Model 2(b)


consider string S = abcicdefcdegabchabcde of Section 2.2. If the weights corresponding
to abc, cde, and c are 4, 2, and 3, respectively, and max = 2, then Figure 2.4 shows
the optimal display configuration. The total weight of the display is 31.
(b) No limit is imposed on the number of copies of the string that may be displayed.
Each occurrence is highlighted in exactly one copy of the string. The number of
copies of the string used should be minimized. In the worst case, O(n2) copies of
the string may be required. The displayable entities in the string S = a" are a', for
1 < i < n. So, each proper substring of a" is an occurrence of a displayable entity
of S. Consider the character Sn/2 It occurs in approximately n2/4 occurrences of
displayable entities (since there are approximately n/2 positions to the left and to
the right of S,,2 and all but one combination of these represents an occurrence of a












a b Ii def d ega b ha b d e


Figure 2.6. Optimal Configuration under Model 3



displayable entity). Since a different copy of the string is required to display these
occurrences without overlap, the number of copies of the string is approximately n2/4
or O(n2). Figure 2.5 shows a configuration for string S = abcicdefcdegabchabcde of
Section 2.2.
(3) Single-Copy, Maximum-Content, Subword-Overlap: Exactly one copy of the
string may be displayed. Occurrences are selected so that no pair of occurrences
has a prefix-suffix conflict. Subword conflicts are allowed. As in the Single-Copy,
Maximum-Content, No-Overlap model, the goal is to maximize the information con-
veyed. Again, there are three approaches for selecting occurrences: Automatic, Inter-
active and Semi-Automatic. For example, consider string S = abcicdefcdegabchabcde
of Section 2.2. If the weights corresponding to each occurrence of abc, cde, and c are
4, 3, and 2, respectively, then Figure 2.6 shows the optimal display configuration.
The total value of the display is 31.
Note that it is crucial to all methods to get information on prefix-suffix and
subword conflicts.

2.4 String Visualization Queries

The following algorithmic problems arise as a result of the discussion in the pre-
vious section:
Given a string S of length n whose elements are chosen from an alphabet E of
fixed size,











P1. Obtain a list of all displayable entities and their occurrences.

P2. Obtain a list of all prefix-suffix conflicts.

P3. Obtain a list of all subword conflicts.

Given a list of occurrences of displayable entities and a weight associated with

each occurrence,

P4. Obtain a set of mutually non-conflicting occurrences so that the sum of the

weights associated with them is maximum (Model 1).

P5. Obtain max mutually disjoint sets of mutually non-conflicting occurrences so

that the sum of the weights associated with them is maximum (Model 2(a)).

P6. Obtain a minimum number of mutually disjoint sets of mutually non-conflicting
occurrences required to partition the set of all occurrences (Model 2(b)).

P7. Obtain a set of occurrences, such that no two occurrences in the set have a prefix-

suffix conflict, so that the sum of the weights associated with them is maximized

(Model 3).

Algorithms for P1-P4, P6, and P7 are presented in Chapter 3.

2.5 Applications

An abstract strategy has been outlined for the visualization of strings. This

section discusses some general methods for applying this strategy to specific areas.

Applications to molecular biology, text and numerical sequences are also outlined.

2.5.1 General Methods

In order to apply this strategy successfully to actual data, it is important to first

check that the data conforms to the definition of strings provided in Section 2.1. If

this is not the case, then it may be possible to transform the data so that it satisfies

the definition without losing vital information in the process.











(1) If a string consists of characters which are chosen from a fixed alphabet of a

small size ( < 50, say), then it is already in the correct format. For example, DNA

sequences are made up from an alphabet of 4 elements. English text is made up of

an alphabet of 26 characters and some special symbols.

(2) Otherwise, if a function, f, can be defined for each element in the alphabet

such that:

a) The range of f can be determined and is of constant size and

b) f(elementi) = f(element2) iff elements and element2 are similar,
then the given string may be converted to another string which is obtained by apply-

ing f to each element of the original string. The resulting string can now be input
to the visualization routines.

For example, consider a sequence of objects which are chosen from a large (pos-

sibly infinite) alphabet. Assume that a set of properties, P = {P1,P2,...,Pm}, is

associated with each object. Suppose that patterns of property Pi of the sequence

of objects are interesting (where Pi may take on one of a constant, fixed number

of values). Then, for the purposes of visualization, each object in the sequence is

replaced by the corresponding Pi value. This approach can be used with the other

properties as well. Some examples where this approach is useful are

(1) Protein Sequences [2]: A protein sequence consists of a sequence of amino acids.

While the number of amino acids that could form a sequence is large, it is possible

to place amino acids in groups on the basis of physical properties such as hydropho-

bicity, acidity, polarity etc. Amino acids in the sequence may then be replaced by a

symbol representing the groups to which they belong.

(2) Chronological sequences of Multidimensional Data [1]: Here, a number of mea-

surements relating to a particular scientific phenomenon are taken at regular intervals

of time. The measurements for each variable are classified as low, medium, or high.











Consequently, the sequence of multidimensional data may be replaced by a sequence

of symbols L, M, H representing low, medium, and high values, respectively, of the

variable being studied.

Finally, many applications would benefit by comparisons between two or more

different strings (as opposed to comparisons within the same string). This can also

be supported by a simple extension of our techniques.

2.5.2 Numerical Data

An important category of data is numerical data. These arise whenever properties

of objects are described by measurements. Numerical information, in general, is cho-

sen from large alphabet sizes which are determined by the accuracy of measurement

required. Clearly, numerical sequences cannot be directly input to our visualization

system. This is remedied by determining the range of values that a variable can

take on. This range is then subdivided into a constant number of subranges (this is

essentially the same strategy used by Beddow [1]). Each value in the sequence is then

replaced by a symbol representing the subrange to which it belongs. The resulting

sequence may then be input to our visualization system. Consider a sequence of num-

bers which lie in the range 1-200. Assume that subranges have been defined as 1-20,

21-40, ..., 181-200 which are represented by the symbols a, b, ......, j, respectively. So,

the sequence 7 142 63 94 6 148 69 becomes ahd e ahd.

Sequences of numbers, such as financial data, are usually studied by using graphs.

We expect that the techniques outlined here if used in conjunction with graphs could

reduce the possibility of overlooking important patterns in the data. This can be done

by either appropriately coloring pieces of the graph or by coloring a string which is

aligned with the graph.












Often, comparisons are made not between the values of numbers in a sequence,

but between the increase/decrease in consecutive values. For example, in [5 20 15 75
90 85], 5 20 15 is not obviously related to 75 90 85. However, the increases/decreases
in values are identical. An increase of 15 followed by a decrease of 5. Information
of this type may be obtained by transforming the string by taking the difference
between successive values before inputting it to the visualization system; i.e., 15 -5

60 15 -5. Similar transformations may be used for percentage increases/decreases,
second order differences, etc.

2.5.3 Molecular Biology

In molecular biology, RNA, DNA, and protein sequences are studied. Sequence

comparison helps to answer questions about evolution, structure and function in
organisms, and the structural configuration of individual RNA molecules [3]. Of
particular importance are repeating patterns and their relative positions [4]. Here [4],
the csdawg data structure, which is described in the next chapter, is used to analyze

sequences. Our work improves upon this by suggesting more effective display methods
and by introducing more sophisticated analysis techniques involving prefix-suffix and
subword conflicts.
For example, let D1 and D2 be displayable entities and Di be a subword of D2. If
the fraction (number of occurrences of D1 contained in D2)/(number of occurrences

of DI) 1, then we can infer that D1 usually occurs as a subword of D2 which could
mean that D1 does not perform any significant functions except as a subword of D2.

Suppose patterns Pi and P2 perform functions F, and F2 in an organism. Then, if

(number of prefix-suffix conflicts between PI and P2)/ (min{number of occurrences of
PI,number of occurrences of P2}) 1 then we can infer that Fi and F2 are generally
performed by the same region and are therefore related in some way.









14


2.5.4 Textual Data

Structural information about text may be obtained by studying prefix-suffix and

subword conflicts. Information about the contexts in which certain phrases are used is

provided by subword conflicts. Information on the combination of phrases is provided

by prefix-suffix conflicts. This information can be used to identify anomalies in

sentence structure and possibly identify the author of a text by the structure. It can

also be used to decipher text coded using sophisticated substitution ciphers where

patterns are substituted by other patterns.
















CHAPTER 3
STRING VISUALIZATION ALGORITHMS

3.1 Definitions

3.1.1 Compact Symmetric Directed Acyclic Word Graphs (csdawgs)

The csdawg data structure data structure is used to represent a string or a set of
strings. It evolved from other string data structures such as position trees [5], suffix

trees [6], directed acyclic word graphs [7, 8], etc. A csdawg CSD(S) corresponding to
a string S is a directed acyclic graph defined by a set of vertices V(S), a set R(S) of
labeled directed edges called right extension edges (re-edges), and a set L(S) of labeled
directed edges called left extension edges (le-edges). Each vertex of V(S) represents
a maximal substring of S. Specifically, V(S) consists of a source (which represents
the empty word, A), a sink (which represents S), and a vertex corresponding to each
displayable entity of S.
Let str(v) denote the string represented by vertex v, for v e V(S). Define the

implication, imp(S, a), of a string a in S to be the smallest superword of a in {str(v):

v e V(S)}, if such a superword exists. Otherwise, imp(S, a) does not exist.
Re-edges from a vertex vl are obtained as follows: for each letter x in E, if

imp(S, str(vi)x) exists and is equal to str(v2) = 6str(v)zxy, then there exists an
re-edge from vl to v2 with label xy. If 3 is the empty string, then the edge is known
as a prefiz extension edge. Le-edges from a vertex vl are obtained as follows: for each
letter z in E, if imp(S, xstr(vi)) exists and is equal to str(v2) = 7xstr(vi)#, then






























Figure 3.1. Csdawg for S = cdefabcgabcde (L(S) not shown)



there exists an le-edge from vl to v2 with label yx. If P is the empty string, then the

edge is known as a suffix extension edge.
Figure 3.1 shows V(S) and R(S) corresponding to S = cdefabcgabcde. Here abc,

cde, and c are the displayable entities of S. There are two outgoing re-edges from the

vertex representing abc. These edges correspond to z = d and x = g. imp(S, abed)

= imp(S, abcg) = S. Consequently, both edges are incident on the sink. There are

no edges corresponding to the other letters of the alphabet as imp(S, abcz) does not

exist for z e {a, b, c, e, f}.
The space required for CSD(S) is O(n) and the time needed to construct it is

O(n) [7, 9]. While we have defined the csdawg data structure for a single string S, it
can be extended to represent a set of strings.













Algorithm LinearOccurrences(S, v)
{ Find all occurrences of str(v) in S}
Occurrences(S, v, 0)

Procedure Occurrences(S:string,u:vertex,i:integer)
1 begin
2 if str(u) is a suffix of S
3 then output(ISI i);
4 for each re-edge e from u do
5 begin
6 Let w be the vertex on which e is incident;
7 Occurrences(S, w, Ilabel(e)l + i);
8 end;
9 end;



Figure 3.2. Algorithm for obtaining all occurrences of displayable entity



3.1.2 Computing Occurrences of Displayable Entities in a string

Figure 3.2 presents an algorithm for computing the end positions of all the oc-

currences of str(v) in S. This is based on the outline provided by Blumer et al [9].

First, the algorithm determines whether str(v) is a suffix of S. If so, the end posi-

tion of S is reported. The remaining occurrences of str(v) are computed recursively

by examining each node reached by an outgoing re-edge from v. The complexity of

LinearOccurrences(S, v) is proportional to the number of occurrences of str(v) in S.

3.1.3 Prefix and Suffix Extension Trees

The prefix extension tree PET(S, v) at vertex v in V(S) is a subgraph of CSD(S)

consisting of

(i) the root v

(ii) PET(S, w) defined recursively for each vertex w in V(S) such that there exists












a prefix extension edge from v to w, and

(iii) the prefix extension edges leaving v.

The suffix extension tree SET(S, v) at v is defined analogously.

In Figure 3.1, PET(S,v), where v is the vertex representing the substring c,

consists of the vertices representing c and cde, and the sink. It also includes the

prefix extension edges from c to cde and from cde to the sink. Similarly, SET(S, v),

where v is the vertex representing c, consists of the vertices representing c and abc

and the suffix extension edge from c to abc (not shown in the figure).

Lemma 3.1.1 PET(S,v) (SET(S,v)) contains a directed path from v to a vertex w

in V(S) iffstr(v) is a prefix (suffix) of str(w).

Proof: If there is a directed path in PET(S, v) from v to some vertex w, then

from the definition of a prefix extension edge and the transitivity of the "prefix-of"

relation, str(v) must be a prefix of str(w).
If str(v) is a prefix of str(w), then there exists a series of re-edges from v to w,

such that str(v), when concatenated with the labels on these edges yields str(w).

But, each of these re-edges must be a prefix extension edge. So a directed path from

v to w exists in the PET(S, v).

The proof for SET(S, v) is analogous. *



3.2 Computing Conflicts

3.2.1 Algorithm to determine whether a string is conflict-free

Before describing our algorithm to determine if a string is free of conflicts, we

establish some properties of conflict-free strings that will be used in this algorithm.












Lemma 3.2.1 If a prefix-suffix conflict occurs in a string S, then a subword conflict
must occur in S.

Proof: If a prefix-suffix conflict occurs between two displayable entities W1
and W2 in S, then there exists WpW,W, in S such that WpWm = Wi and WmW, =

W2. Since W1 and W2 are maximal, W1 isn't always followed by the same letter and
W2 isn't always preceded by the same letter; i.e., Wm isn't always followed by the
same letter and Wm isn't always preceded by the same letter. So, Wm is maximal.
But, W1 occurs at least twice in S (since W1 is a displayable entity). So Wm occurs
at least twice (since Wm is a subword of W1) and is a displayable entity of S. But,
Wm is a subword of W1. So a subword conflict occurs between Wm and W1 in S. *



Corollary 3.2.1 The intersection of a prefiz-suffix conflict between two displayable
entiites is itself a displayable entity.

Corollary 3.2.2 If string S is free of subword conflicts, then it is free of conflicts.

Lemma 3.2.2 str(w) is a subword of str(v) in S iff there is a path comprising right
extension and suffix extension edges from w to v.

Proof: From the definition of CSD(S), if there exists an re-edge from u to v,
then str(u) is a subword of str(v). If there exists a suffix extension edge from u to
v, then str(u) is a suffix (and therefore a subword) of str(v). If there exists a path
comprising right and suffix extension edges from w to v, then by transitivity, str(w)
is a subword of str(v).
If str(w) is a suffix of str(v), then there is a path (Lemma 3.1.1) of suffix exten-
sion edges from w to v. If str(w) is a subword, but not a suffix of str(v), then from












the definition of a csdawg, there is a path of re-edges from w to a vertex representing

a suffix of str(v). *


Let Vo,,,r be the set of all vertices in V(S) such that an re-edge or suffix extension

edge exists between the source vertex of CSD(S) and each element of V,o,,e.

Lemma 3.2.3 String S is conflict-free if all right extension or suffix extension edges

leaving vertices in Vour end at the sink vertex of CSD(S).

Proof: A string S is conflict-free iff there does not exist a right or suffix

extension edge between two vertices, neither of which is the source or sink of CSD(S)
(Corollary 3.2.2 and Lemma 3.2.2).
Assume that S is conflict-free. Consider any vertex v in Vo,,.u If v has right

or suffix extension out edge < v, w >, then v 5 sink. If w # sink, then str(v) is a

subword of str(w) and the string is not conflict-free. This contradicts the assumption

on S.

Next, assume that all right and suffix extension edges leaving vertices in Vo,,.r

end at the sink vertex. Clearly, there cannot exist right or suffix extension edges

between any two vertices, v and w (v / sink, w o sink) in V,oure. Further, there

cannot exist a vertex x in V(S) (z / source, x # sink) such that x V' V,o,,. For

such a vertex to exist, there must exist a path consisting of right and suffix extension

edges from a vertex in V,,ourc to x. Clearly, this is not true. So, S is conflict-free. *


The preceding development leads to algorithm NoConflicts (Figure 3.3).


Theorem 3.2.1 Algorithm NoConflicts is both correct and optimal.












Algorithm NoConflicts(S)
1. Construct CSD(S).
2. Compute Vource.
3. Scan all right and suffix extension out edges from each element of V.,urc. If any
edge points to a vertex other than the sink, then a conflict exists. Otherwise, S is
conflict-free.


Figure 3.3. Algorithm to determine whether a string is conflict-free



Proof: Correctness is an immediate consequence of Lemma 3.2.3. Step 1
takes O(n) time [9]. Step 2 takes 0(1) time since IV,o.,ce < 2| I. Step 3 takes
0(1) time since the number of out edges leaving vertices in V,u,, is less than 4|V2I.
So, NoConflicts takes O(n) time, which is optimal. Actually, Steps 2 and 3 can be
merged into Step 1 and the construction of CSD(S) aborted as soon as an edge that
violates Lemma 3.2.3 is created. *



3.2.2 Subword Conflicts

Consider the problem of finding all subword conflicts in string S. Let k, be the

number of subword conflicts in S. Any algorithm to solve this problem requires (i)
O(n) time to read in the input string and (ii) O(k,) time to output all subword
conflicts. So, O(n + k,) is a lower bound on the time complexity for this problem.
For the string S = a", k, = n4/24 + n3/4 13n2/24 3n/4 + 1 = O(n4). This is
an upper bound on the number of conflicts as the maximum number of substring
occurrences is O(n2) and in the worst case, all occurrences conflict with each other.
In this section, a compact method for representing conflicts is presented. Let kSc be
the size of this representation. k,c is n3/6+n2/2-5n/3 or O(n3), for a". Compaction












never increases the size of the output and may yield up to a factor of n reduction, as

in the example. The compaction method is described below.

Consider S= abcdbcgabcdbchbc. The displayable entities are D1 = abcdbc and D2

= bc. The end positions of D1 are 6 and 13 while those of D2 are 3, 6, 10, 13, and

16. A list of the subword conflicts between D1 and D2 can be written as {(6,3),

(6,6), (13,10), (13,13)}. The first element of each ordered pair is the last position of
the instance of the superstring (here D1) involved in the conflict; the second element

of each ordered pair is the last position of the instance of the substring (here D2)

involved in the conflict.

The cardinality of the set is the number of subword conflicts between D1 and D2.

This is given by frequency(D1)*number of occurrences of D2 in D1. Since each conflict

is represented by an ordered pair, the size of the output is 2(frequency(D1)*number

of occurrences of D2 in D1).

Observe that the occurrences of D2 in D1 are in the same relative positions in

all instances of D1. It is therefore possible to write the list of subword conflicts

between D1 and D2 as (6,13):(0,-3). The first list gives all the occurrences in S of the

superstring (DI), and the second gives the relative positions of all the occurrences

of the substring (D2) in the superstring (D1) from the right end of D1. The size of

the output is now frequency(D1)+ number of occurrences of D2 in D1. This is more

economical than our earlier representation.

In general, a substring Di of S will have subword conflicts with many instances of a

number of displayable entities (say, Dj, Dk,..., D,) of which it (Di) is the superword.

We would then write the conflicts of Di as
(11 J? j )mi' j, ..., '..., 1 12,...,lz ).
) / / 1 ), (l k ,,,
Here, the li's represent all the occurrences of Di in S; the li's, lk's, ..., lI's

represent the relative positions of all the occurrences of Dj, Dk,..., D,, respectively,















Algorithm SubwordConflicts(S:string)
{ Identify displayable entities that contain subword displayable entities}
1 begin
2 for each vertex v in CSD(S) do
3 begin
4 v.subword = false;
5 for all vertices u such that a right or suffix extension edge < u, v >
is incident on v do
6 if u 5 source then v.subword = true;
7 end
8 for each vertex v in CSD(S) such that v 5 sink, v.subword is true do
9 GetSubwords(S, v);
10 end

Procedure GetSubwords(S, v)
{ Compute subword displayable entities of str(v)}
1 begin
2 LinearOccurrences(S,v);
3 v.sublist = {0};
4 SetUp(v);
5 SetSuffizes(v);
6 for each vertex, x (# source), in reverse topological order of SG(S, v) do
7 begin
8 if str(z) is a suffix of str(v) then x.sublist = {0} else x.sublist = {};
9 for each w in SG(S, v) on which an re-edge e from x is incident do
10 begin
11 for each element 1 in w.sublist do
12 x.sublist = z.sublist U {l Ilabel(e)l};
13 end;
14 output(z.sublist);
15 end;
16 end


Figure 3.4. Optimal algorithm to compute all subword conflicts












in Di. One such list will be required for each displayable entity that contains other
displayable entities as subwords. The following qualities are easily obtained:
Size of Compact Representation = ED,,D (fi+ EDJeD (rii)).
Size of Original Representation = 2 EDiD, (fi* ED,CD: (rij))

Here, fi is the frequency of Di (only Di's that have conflicts are considered), and rij
is the frequency of Dj in one instance of Di. The letter D represents the set of all
displayable entities of S while Df represents the set of all displayable entities that
are subwords of Di.
SG(S, v), for v c V(S), is defined as the subgraph of CSD(S) which consists of
the set of vertices SV(S, v) C V(S) which represents displayable entities that are
subwords of str(v) and the set SE(S, v) of all re-edges and suffix extension edges
that connect any pair of vertices in SV(S, v). Define SGR(S, v) as SG(S, v) with the
directions of all the edges in SE(S, v) reversed.

Lemma 3.2.4 SG(S, v) consists of all vertices w such that a path comprising right or

suffiz extension edges joins w to v in CSD(S).

Proof: Follows from Lemma 3.2.2 and the definition of SG(S, v). *


Algorithm SubwordConflicts(S) of Figure 3.4 computes all subword conflicts
of S. The subword conflicts are computed for precisely those displayable entities
which have subword displayable entities. Lines 4 to 6 of SubwordConflicts deter-
mine whether str(v) has subword displayable entities. Each incoming right or suffix
extension edge to v is checked to see whether it originates at the source. If any in-
coming edge originates at a vertex other than source, then v.subword is set to true
(Lemma 3.2.2). If all incoming edges originate from source, then v.subword is set












to false. Procedure GetSubwords(S, v), which computes the subword conflicts of

str(v) is invoked iff v.subword is true.

Procedure LinearOccurrences(S, v) (line 2 of GetSubwords) reports the occur-

rences of str(v) in S. Procedure SetUp(v) in line 4 traverses SGR(S,v) and ini-

tializes fields in each vertex of SGR(S, v) so that a reverse topological traversal of

SG(S, v) may subsequently be performed. Procedure SetSuffixes(v) in line 5 marks
vertices whose displayable entities are suffixes of str(v). This is accomplished by

following the chain of reverse suffix extension pointers starting at v and marking the

vertices encountered as suffixes of v.

A list of relative occurrences sublist is associated with each vertex x in SG(S, v).

x.sublist represents the relative positions of str(x) in an occurrence of str(v). Each

relative occurrence is denoted by its position relative to the last position of str(v)

which is represented by 0. If str(x) is a suffix of str(v) then x.sublist is initialized

with the element 0. The remaining elements of x.sublist are computed from the

sublist fields of vertices w in SG(S, v) such that a right extension edge goes from x

to w. Consequently, w.sublist must be computed before x.sublist. This is achieved

by traversing SG(S, v) in reverse topological order [10].

Lemma 3.2.5 x.sublist for vertex x in SG(S, v) contains all relative occurrences of

str(x) in str(v) on completion of GetSubwords(S,v).

Proof: The correctness of this lemma follows from the correctness of pro-
cedure LinearOccurrences(S, v) of Section 3.1.2 and the observation that lines 6 to

14 of procedure GetSubwords achieve the same effect as LinearOccurrences(S, v) in

SG(S, v). *












Theorem 3.2.2 Algorithm SubwordConflicts takes O(n + k.c) time and space and is
therefore optimal.

Proof: Computing v.subword for each vertex v in V(S) takes O(n) time as
constant time is spent at each vertex and edge in CSD(S). Consider the complexity of

GetSubwords(S, v). Line 2 takes O(lv.occurrencesl) time. Let the number of vertices

in SG(S, v) be m. Then the number of edges in SG(S, v) is O(m) since each vertex
has at most 21|E = 0(1) edges leaving it. Line 4 traverses SG(S,v) and therefore
consumes O(m) time. Line 5, in the worst case, could involve traversing SG(S,v)
which takes O(m) time. Computing the relative occurrences of str(x) in str(v) (lines

8-14) takes O(jz.sublistl) time for each vertex x in SG(S, v). So, the total complexity
of GetSubwords(S,v) is O(Iv.occurrencesl + m + Exsv(sv)-v .sublist ).
However, m is O(Ex,sv(s,v)-{f} Ix.sublistl), since Ix.sublistl > 1 for each x e

SG(S,v). But Iv.occurrencesl + E,,sv(s,v)-(,} Ix.sublistl is the size of the output of
GetSubwords(S, v).
So, the complexity of Algorithm SubwordConflicts(S) is

O(n+Evv(S)-{sink},v.ubwrord=true output of GetSubwords(S, v)I) = O(n + kc). *



3.2.3 Prefix-Suffix Conflicts

As with subword conflicts, the lower bound for the problem of computing prefix-
suffix conflicts is O(n + kp), where kp is the number of prefix-suffix conflicts in S.
For S = a", kp is n4/24 n3/12 25n2/24 21n/12 + 1 = O(n4), which is also
the upper bound on kp. Unlike subword conflicts, it is not possible to compact the
output representation.
Let w and x, respectively, be vertices in SET(S, v) and PET(S, v). Let str(v) =
W,, str(w) = W,W,, and str(z) = WW,. Define Pshadow(w, v,x) to be the vertex















SPD(wv)


'5a





I, c~
X








SI -
SET(S, v)L-


PET(S,v)


Figure 3.5. Illustration of prefix and suffix trees and a shadow prefix dag



representing imp(S, W ,WWx,), if such a vertex exists. Otherwise, Pshadow(w, v, x)

= nil. We define Pimage(w,v,z) = Pshadow(w,v,z) iff Pshadow(w,v,x) = imp(S,W,,WW,)

= WW W,,W, for some (possibly empty) string W.. Otherwise, Pimage(w, v,z)=

nil. For each vertex w in SET(S, v), a shadow prefix dag SPD(w, v) rooted at vertex

w comprises the set of vertices {Pshadow(w, v, x)| x on PET(S, v), Pshadow(w, v, x)

Snil}.
Figure 3.5 illustrates these concepts. Broken lines represent suffix extension edges,

dotted lines represent right extension edges, and solid lines represent prefix exten-

sion edges. SET(S,v), PET(S,v), and SPD(w,v) have been enclosed by dashed,

solid, and dotted lines, respectively. We have Pshadow(w,v, v) = Pimage(w, v, v)












= w. Pshadow(w,v,z) = Pshadow(w,v,r) = c. However, Pimage(w,v,z) =
Pimage(w, v, r) = nil. Pshadow(w, v, x) = Pimage(w, v, x) = a. Pshadow(w, v,p) =
b, but Pimage(w, v, p) = nil. Pshadow(w, v, q) = Pshadow(w, v, s) = Pimage(w, v, q)
= Pimage(w,v,s) = nil.

Lemma 3.2.6 A prefix-suffix conflict occurs between two displayable entities, W1 =
str(w) and W2 = str(x), with respect to a third displayable entity Wm = str(v) iff

(i) w occurs in SET(S, v) and x occurs in PET(S, v), and (ii) Pshadow(w, v, z) #
nil. The number of conflicts between str(w) and str(x) with respect to str(v) is equal
to the number of occurrences of str(Pshadow(w,v,x)) in S.

Proof: By definition, a prefix-suffix conflict occurs between displayable en-
tities W1 and W2 with respect to Wm iff there exists WW,,W. in S, where W1 =
W ,,p and W2 = WmW,.
Clearly, W, is a suffix of W1 and Wm a prefix of W2 iff w occurs in SET(S, v)
and x occurs in PET(S,v). WpWW, occurs in S iff imp(S, WpWW,) exists or
Pshadow(w, v, x) # nil. The number of conflicts between str(w) and str(x) is equal
to the number of occurrences of imp(S, WW,,W,) = str(Pshadow(w, v, x)) in S. *



Lemma 3.2.7 If a prefix-suffix conflict does not occur between str(w) and str(x) with
respect to str(v), where w occurs in SET(S,v) and z occurs in PET(S, v), then there
are no prefix-suffix conflicts between any displayable entity which represents a descen-
dant of w in SET(S,v) and any displayable entity which represents a descendant of
x in PET(S,v) with respect to str(v).

Proof: Since w is in SET(S, v) and x is in PET(S, v), we can represent str(w)
by Wstr(v) and str(z) by str(v)W,. If no conflicts occur, then Wpstr(v)W, does












not occur in S. The descendants of w in SET(S, v) will represent displayable entities
of the form Wstr(w) = WWstr(v), while the descendants of x in PET(S, v) will
represent displayable entities of the form str(x)Wb = str(v)W.Wb, where WI and
Wb are substrings of S. For a prefix-suffix conflict to occur between Wstr(w) and
str(x)Wb with respect to str(v), WWpstr(v)W,Wb must exist in S. However, this is
not possible as Wstr(v)W, does not occur in S and the result follows. a



Lemma 3.2.8 In CSD(S), if

(i) y = Pimage(w,v,x),
(ii) there is a prefix extension edge e from x to z with label aa, and
(iii) there is a right extension edge f from y to u with label a/3,
then Pshadow(w,v,z) = u.

Proof: Let str(w) = W,str(v) and str(x) = str(v)W,. By definition, str(y)
= WaW,str(v)W, for some, possibly empty, string W,. We have str(z) = str(x)aa
= str(v)W,aa and str(u) = Wbstr(y)ap = WbWWstr(v)W,ap for some string Wb.
Pshadow(w, v, z) = imp(Wstr(v)Waaa). To prove the lemma, we must show
that Pshadow(w,v,z) = u. Or, that (i) Wstr(v)Wraa is a subword of str(u) and
(ii) str(u) is the smallest superword of W,str(v)W.aa represented by a vertex in
CSD(S).
(i) Assume that Wstr(v)Waa is not a subword of str(u) = WbWW,Wstr(v)Wap.
So, a is not a prefix of 6.
Case 1: P is a proper prefix of a.
Since WbWW,str(v)Wxafi is maximal, its occurrences are not all followed by the
same letter. This statement is also true for any of its suffixes. In particular, all



























I
S. .. ..











S68 e


= Prefix Extension Edge
= Suffix Extension Edge
= Right Extension Edge


Figure 3.6. Illustration of conditions for Lemma 3.2.8


----------











occurrences of str(v)Wap/ cannot be followed by the same letter. Similarly, all oc-
currences of str(v)Wa,/3 cannot be preceded by the same letter as it is a prefix of
str(v)Waa = str(z). So, str(v)Wap/ is a displayable entity of S. Consequently, the

prefix extension edge from x corresponding to the letter a must be directed to the
vertex representing str(v)W.,ap. This is a contradiction.
Case 2. a/ matches aa in the first k characters, but not in the (k + 1)'th character
(1 < k < l+min(la|l,ll)).
We have a/p = a-y71, aa = a-ya, where 17- = k- 1. Clearly, the strings str(v)Wa7yai
and WbW, Wstr(v)Wa-f/l occur in S. So, all occurrences of str(v)Way cannot

be followed by the same letter. Further, all occurrences of str(v)W, a cannot be
preceded by the same letter as it is a prefix of str(v)W, aa = str(z). So, it is a dis-

playable entity of S. Consequently, the prefix extension edge from x corresponding
to the letter a must be directed to the vertex representing str(v)W.,ay. This results
in a contradiction. Thus, a is a prefix of /.


(ii) From (i), a is a prefix of /. Assume that WbW.W,str(v)W,a/ is not the smallest
superword of Wstr(v)Waa. Since str(y) = Pimage(w,v,x) = W.W,str(v)W, is

the smallest superword of Wstr(v)W,, the smallest superword of W,str(v)Waaa
must be of the form Wb, WaW,str(v)W-,ay where a is a prefix of 7 which is a proper

prefix of / and/or Wb, is a proper suffix of Wb. But, the right out edge f from z
points to the smallest superword of Wa Wstr(v)Wa (from the definition of CSD(S))
which is WbW W,,str(v)Wt/ap. So, Wb1 = Wb and 7 = /, which is a contradiction. *



Lemma 3.2.9 In CSD(S), if

(i) y = Pimage(w, v, ),





















P6 f
.. .. aab
T aab,


Path P


= Prefix Extension Edge
= Suffix Extension Edge
= Right Extension Edge


Figure 3.7. Illustration of conditions for Lemmas 3.2.9 and 3.2.10


r


~--o-- --------o
v t z


----------

*~~ ~ ~~ > ** ** ** ** >












(ii) there is a path of prefix extension edges from x to xz (let the concatenation of

their labels be aa),
(iii) there is a prefix extension edge from x1 to z with label by, and

(iv) there is a right extension edge f from y to u with label aabp,

then u = Pshadow(w, v, z) $ nil.

Proof: Similar to proof of Lemma 3.2.8. m



Lemma 3.2.10 In Lemma 3.2.8 or Lemma 3.2.9, if Ilabel(f)( < sum of the lengths

of the labels of of the edges on the prefix extension edge path P from x to z, then

label(f) = concatenation of the labels on P and u = Pimage(w,v,z).

Proof: From Lemma 3.2.9, the concatenation of the labels of the edges of P

is a prefix of label(f). But, (label(f)( < sum of the lengths of the labels of the edges

on P. Thus, label(f) = concatenation of the labels of the series edges on P. str(u)

= Pimage(w, v, z) follows. *



Lemma 3.2.11 If Pshadow(w,v,x) = nil then Pshadow(w,v,y) = nil for all descen-

dants y of x in PET(S,v).

Proof: Follows from Lemmas 3.2.6 and 3.2.7. m



Algorithm PrefixSuffixConflicts(S) in Figure 3.8 computes all prefix-suffix con-

flicts in S. Line 1 constructs CSD(S). Lines 2 and 3 compute all prefix-suffix conflicts
in S by separately computing for each displayable entity str(v), all the prefix-suffix

conflicts of which it is the intersection (Corollary 3.2.1).
























Algorithm PrefixSuffixConflicts(S:string)
{Compute all prefix-suffix conflicts in S}
1 Construct CSD(S).
2 for each vertex v in CSD(S) do
3 NextSuffix(v,v); {compute all conflicts wrt str(v)}

Procedure NextSuffix(current,v: vertex);
1 for each suffix extension edge < current, w > do
2 {there can only be one suffix extension edge from current to w}
3 begin
4 exist = false
5 ShadowSearch(v, w, v, w);{compute SPD(w, v)}
6 if exist then NextSuffix(w,v);
7 end;


Figure 3.8. Optimal algorithm to compute all prefix-suffix conflicts












Procedure NextSuffix(current, v) computes all prefix-suffix conflicts between dis-

playable entities represented by descendants of current in SET(S, v) and displayable

entities represented by descendants of v in PET(S, v) with respect to str(v) (so the

call to NextSuffiz(v,v) in line 3 of Algorithm PrefizSuffixConflicts(S) computes all

prefix-suffix conflicts with respect to str(v)). It does so by identifying SPD(w, v) for
each child w of current in SET(S, v). The call to ShadowSearch(v,w,v,w) in line 5

identifies SPD(w, v) and computes all prefix-suffix conflicts between str(w) and dis-

playable entities represented by descendants of v in PET(S, v) with respect to str(v).
If ShadowSearch(v,w,v,w) does not report any prefix-suffix conflicts then the global

variable exist is unchanged by ShadowSearch(v,w,v,w) (so, exist = false, from line 4).

Otherwise, it is set to true by ShadowSearch. Line 6 ensures that NextSuffix(w,v) is
called only if ShadowSearch(v,w,v,w) detected prefix-suffix conflicts between str(w)

and displayable entities represented by descendants of v in PET(S, v) with respect
to str(v) (Lemma 3.2.7).

For each descendant q of vertex x in PET(S, v), procedure ShadowSearch(v,w,x,y)

computes all prefix-suffix conflicts between str(w) and str(q) with respect to str(v).
y represents Pshadow(w, v, x). We will show that all calls to ShadowSearch main-

tain the invariant (which is referred to as the image invariant hereafter) that y =
Pimage(w, v, x) 5 nil. Notice that the invariant holds when ShadowSearch is called

from NextSuffix as w = Pimage(w, v, v). The for statement in line 1 examines each
prefix extension edge leaving x. Lines 3 to 28 compute all prefix-suffix conflicts be-

tween str(w) and displayable entities represented by vertices in PET(S,z), where
z is the vertex on which the prefix extension edge from x is incident. The truth of
the condition in the for statement of line 1, line 4, and the truth of the condition
inside the if statement of line 5 establish that the conditions of Lemma 3.2.8 are
satisfied prior to the execution of lines 8 and 9. The truth of the comment in line 8


















Procedure ShadowSearch(v, w, z, y);
1 for each prefix extension edge e = < x, z > do
2 {There can only be one prefix extension edge from z to z}
3 begin
4 fc:= first character in label(e);
5 if there is a right extension edge, f = < y, u >, whose label starts with fc
6 then
7 begin
8 {u = Pshadow(w, v, z)}
9 ListConflicts(u,z,w);
10 distance:= 0; done = false
11 while (not done) and (Ilabel(f)l > Ilabel(e)l + distance)) do
12 begin
13 distance:= distance + Ilabel(e)l;
14 nc:= (distance + 1)th character in label(f).
15 if there is a prefix extension edge < z, r > starting with nc
16 then
17 begin
18 z := r;
19 {u = Pshadow(w,v,z)};
20 ListConflicts(u, z,w);
21 end
22 else
23 done:= true;
24 end
25 if (not done) then
26 ShadowSearch(v,w,z,u);
27 exist:= true;
28 end
29 end


Figure 3.9. Algorithm for shadow search












and the correctness of line 9 are established by Lemma 3.2.8. Procedure ListCon-
flicts of line 9 lists all prefix-suffix conflicts between str(w) and str(z) with respect
to str(v). Similarly, the truth of the condition inside the while statement of line

11, lines 13 and 14, and the truth of the condition inside the if statement of line 15
establish that the conditions of Lemma 3.2.9 are satisfied prior to the execution of
lines 18-20. Again, the correctness of lines 18-20 are established by Lemma 3.2.9.
If done remains false on exiting the while loop, the condition of the if statement of
line 15 must have evaluated to true. Consequently, the conditions of Lemma 3.2.9

apply. Further, since the while loop of line 11 terminated, the additional condition of
Lemma 3.2.10 is also satisfied. Hence, from Lemma 3.2.10, u = Pimage(w, v, z) and
the image invariant for the recursive call to ShadowSearch(v, w, z, u)is maintained.

Line 27 sets the global variable exist to true since the execution of the then clause
of the if statement of line 5 ensures that at least one prefix-suffix conflict is reported
by ShadowSearch(v, w, v, w) (Lemmas 3.2.6 and 3.2.8). exist remains false only if the
then clause of the if statement (line 5) is never executed.

Theorem 3.2.3 Algorithm PrefixSuffixConflicts(S) computes all prefix-suffix con-

flicts of S in O(n + kp) space and time, which is optimal.

Proof: Line 1 of Algorithm PrefixSuffixConflicts takes O(n) time [9]. The
cost of lines 2 and 3 without including the execution time of NextSuffix(v, v) is O(n).
Next, we show that NextSuffiz(v, v) takes O(k,) time, where k, is the number of
prefix-suffix conflicts with respect to v (so, k, represents the size of the output of
NextSuffix(v, v)). Assume that NextSuffix is invoked p times in the computation. Let
ST be the set of invocations of NextSuffix which do not call NextSuffix recursively.
Let pr = ISTI. Let SF be the set of invocations of NextSuffix which do call NextSuffix
recursively. Let PF = ISFI. Each element of SF can directly call at most |EI elements












of ST. So, prTPF < jI|. From lines 4-6 in NextSuffiz(current,v), each element of
SF yields at least one distinct conflict from its call to ShadowSearch. Thus, pF <
kv. So, p = pr + PF < (I~E + 1)k, = O(k,). The cost of execution of NextSuffix
without including the costs of recursive calls to NextSuffix and ShadowSearch is O(~E|)
(= 0(1)) as there are at most I|~ suffix edges leaving a vertex. So, the total cost
of execution of all invocations of NextSuffix spawned by NextSuffix(v, v) without
including the cost of recursive calls to ShadowSearch is O(pIEl) = O(k").
Next, we consider the calls to ShadowSearch that were spawned by NextSuffi(v, v).
Let TA be the set of invocations of ShadowSearch which do not call ShadowSearch
recursively. Let qA = ITAI. Let TB be the set of invocations of ShadowSearch which do
call ShadowSearch recursively. Let qB = ITBI. Let q = qA + qB. We have qA < IElqB
+ IElp. So, q = qA + qB < (EI + 1)9B + IEp. From the algorithm, each element of TB
yields a distinct conflict. So, qB < kI. So, q < (I|E + l)qa + IElp = O(k,). The cost
of execution of a single call to ShadowSearch without including the cost of executing
recursive calls to ShadowSearch is 0(1) + O(w) + O(complexity of ListConflicts of
line 9) + O(IFC (complexity of ListConflicts of line 20 in the i'titeration of the while
loop)), where w denotes the number of iterations of the while loop. The complexity
of ListConflicts is proportional to the number of conflicts it reports. Since ListCon-
flicts always yields at least one distinct conflict, the complexity of ShadowSearch is
0(1+ Ioutputl). Summing over all calls to ShadowSearch spawned by NextSuffix(v, v),
we obtain O(q + k,) = O(k,). Thus, the total complexity of Algorithm PrefizSuf-
fizConflicts(S) is O(n + kp) .












3.2.4 Alternative Algorithms

In this section, an algorithm for computing all conflicts (i.e., both subword and

prefix-suffix conflicts) is presented. This solution is relatively simple and has com-

petitive run times. However, it lacks the flexibility required to efficiently solve many

of the problems listed in Sections 3.3, 3.4, and 3.5 The algorithm (Algorithm

AllConflicts(S)) is presented in Figure 3.10. Step 1 computes a list of all occurrences

of all displayable entities in S. This list is obtained by first computing the lists of

occurrences corresponding to each vertex of V(S) (except the source and the sink)

and then concatenating these lists. Each occurrence is represented by its start and

end positions. Step 2 sorts the list of occurrences obtained in Step 1 in increasing

order of their start positions. Occurrences with the same start positions are sorted

in decreasing order of their end positions. This is done using radix sort. Step 3

computes for the i'th occurrence occj all its prefix-suffix conflicts with occurrences

whose start positions are greater than its own, and all its subword conflicts with

its subwords. occi is checked against occi+l, occi+2,..., occi+c for a conflict. Here, c

is the smallest integer for which there is no conflict between occi and occi+c. The

start position of occi+c is greater than the end position of occj. The start position

of occj (j > i + c) will also be greater than the end position of occi, since the list

of occurrences was sorted on increasing order of start positions. The start positions

of occi+i,.., occj+c-i are greater than or equal to the start positions of occj but are
less than or equal to its end position. Those occurrences among {occ4+, ..., occ+c_-1

whose start positions are equal to that of occi have end positions that are smaller

(since occurrences with the same start position are sorted in decreasing order of their

end positions). The remaining conflicts of occi (i.e., subword conflicts with its su-

perwords, prefix-suffix conflicts with occurrences whose start positions are less than












that of occi) have already been computed in earlier iterations of the for statement in
Algorithm AllConflicts(S).
For example, let the input to Step 3 be the following list of ordered pairs:((1,6),

(1,3), (1,1), (2,2), (3,8), (3,5), (4,6), (5,8), (6,10)),where the first element of the or-
dered pair denotes the start position and the second element denotes the end position
of the occurrence. Consider the occurrence (3,5). Its conflicts with (1,6), (1,3), and

(3,8) are computed in iterations 1, 2, and 5 of the for loop. Its conflicts with (4,6)
and (5,8) are computed in iteration 6 of the for loop.

Theorem 3.2.4 Algorithm AllConflicts(S) takes O(n + k) time, where k = k, + k,.

Proof: Step 1 takes O(n + o) time, where o is the number of occurrences of
displayable entities of S. Step 2 also takes O(n + o) time, since o elements are to

be sorted using radix sort with n buckets. Step 3 takes O(o + k) time (the for loop

executes O(o) times; each iteration of the while loop yields a distinct conflict). So,
the total complexity is O(n + o + k).
We now show that o = O(n + k). Let ol be the number of occurrences not involved
in a conflict. Then ol < n. Let 02 be the number of occurrences involved in at least

one conflict. A single conflict occurs between two occurrences. So 2k > 02. So, o =

o + o2 < n + 2k = O(n + k). *


Algorithm AllConflicts(S) can be modified so that the size of the output is

k, + k,.. This may be achieved by checking whether an occurrence is the first
representative of its pattern in the for loop of Step 3. The subword conflicts are
only reported for the first occurrence of the pattern. However, the time complexity
of Algorithm AllConflicts(S) remains O(n + k). In this sense, it is suboptimal.























Algorithm AllConflicts(S)
Step 1: Obtain a list of all occurrences of all displayable entities in S. This list is obtained
by first computing the lists of occurrences corresponding to each vertex of the csdawg
(except the source and the sink) and then concatenating these lists.
Step 2: Sort the list of occurrences using the start positions of the occurrences as the
primary key (increasing order) and the end position as the secondary key (decreasing order).
This is done using radix sort.
Step3:

for i:= 1 to (number of occurrences) do
begin
j:= i + 1;
while(lastpos(occ,) > firstpos(occj) do
begin
if (lastpos(occi) > lastpos(occj))
then occi is a superword of occj
else (occi, occi) have a prefix-suffix conflict;
j:= j + 1;
end;
end;


Figure 3.10. A simple algorithm for computing conflicts












3.3 Size Restricted Queries

Experimental data show that random strings contain a large number of displayable

entities of small length. In most applications, small displayable entities are less

interesting than large ones. Hence, it is useful to list only those displayable entities

whose lengths are greater than some integer k. Similarly, it is useful to report exactly

those conflicts in which the conflicting displayable entities have length greater than

k. This gives rise to the following problems:

P8: List all occurrences of displayable entities whose lengths are greater than k.

P9: Compute all prefix-suffix conflicts involving displayable entities of length greater

than k.

P10: Compute all subword conflicts involving displayable entities of length greater

than k.

The overlap of a conflict is defined as the string common to the conflicting dis-

playable entities. The overlap of a subword conflict is the subword displayable entity.

The overlap of a prefix-suffix conflict is its intersection. The size of a conflict is

the length of the overlap. An alternative formulation of the size restricted problem

which also seeks to achieve the goal outlined above is based on reporting only those

conflicts whose size is greater than k. This formulation of the problem is particularly

relevant when the conflicts are of more interest than the displayable entities. It also

establishes that all conflicting displayable entities reported have size greater than k.

We have the following problems:

P11: Obtain all prefix-suffix conflicts of size greater than some integer k.

P12: Obtain all subword conflicts of size greater than some integer k.
P8 is solved optimally by invoking LinearOccurrences(S, v) for each vertex v in

V(S), where Istr(v)j > k. A combined solution to P9 and P10 uses the approach












of Section 3.2.4. The only modification to the algorithm of Figure 3.10 is in Step 1
which now becomes:

Obtain all occurrences of displayable entities whose lengths are greater than k.
The resulting algorithm is optimal with respect to the expanded representation of
subword conflicts. However, as with the general problem, it is not possible to obtain
separate optimal solutions to P9 and P10 by using the techniques of Section 3.2.4.

An optimal solution to P11 is obtained by executing line 3 of Algorithm PrefixSuf-
fixConflicts(S) of Figure 3.8 for only those vertices v in V(S) which have Istr(v)l > k.
An optimal solution to P12 is obtained by the following modifications to Algorithm
SubwordConflicts of Figure 3.4:
(i) Right extension or suffix extension edges < u, v >, where Istr(u)l < k and Istr(v)l
> k are marked "disabled."
(ii) The definition of SG(S, v) is modified so that SG(S, v), for v e V(S), is defined as
the subgraph of CSD(S) which consists of the set of vertices SV(S, v) C V(S) which
represent displayable entities of length greater than k that are subwords of str(v)
and the set of all re and suffix extension edges that connect any pair of vertices in
SV(S, v).
(iii) Algorithm SubwordConflicts(S) is modified. The modified algorithm is shown
in Figure 3.11.
We note that P9 and P12 are identical, since the overlap of a subword conflict
is the same as the subword displayable entity.

3.4 Pattern Oriented Queries

These queries are useful in applications where the fact that two patterns have a
conflict is more important than the number and location of conflicts. The following
problems arise as a result:













Algorithm SubwordConflicts(S,k)
1 begin
2 for each vertex v in CSD(S) do
3 v.subword = false;
4 for each vertex v in CSD(S) such that Istr(v)I > k do
5 for all vertices u such that a non-disabled right or suffix extension edge
< u, v > exists do
6 vt.subword = true;
7 for each vertex v in CSD(S) such that v / sink and v.subword is true do
8 GetSubwords(S, v);
9 end




Figure 3.11. Modified version of algorithm SubwordConflicts



P13: List all pairs of displayable entities which have subword conflicts.

P14: List all triplets of displayable entities (Di,D2,Dm,) such that there is a prefix-

suffix conflict between D1 and D2 with respect to D-.

P15: Same as P13, but size restricted as in P12.

P16: Same as P14, but size restricted as in P11.

P13 may be solved optimally by reporting for each vertex v in V(S), where v

does not represent the sink of CSD(S), the subword displayable entities of str(v),

if any. This is accomplished by reporting str(w), for each vertex w, w 0 source, in

SG(S, v). P14 may also be solved optimally by modifying procedure ListConflicts of

Figure 3.9 so that it reports the conflicting displayable entities and their intersection.

P15 and P16 may also be solved by making similar modifications to the algorithms

of the previous section.












3.5 Statistical Queries

These queries are useful when conclusions are to be drawn from the data based

on statistical facts. Let f(D) denote the frequency (number of occurrences) of dis-

playable entity D in the string and rf(DI, D2) the number of occurrences of dis-

playable entity D1 in displayable entity D2. The following queries may then be

defined.

P17: For each pair of displayable entities, D1 and D2, involved in a subword conflict

(DI is the subword of D2), obtain p(DI, D2) = (number of occurrences of D1 which

occur as subwords of D2) / f(D1).

P18: For each pair of displayable entities, D1 and D2, involved in a prefix-suffix

conflict, obtain q(DI, D2) = (number of occurrences of D1 which have prefix-suffix

conflicts with D2) /f(DI).

If p(D1, D2) or q(DI, D2) is greater than a statistically determined threshold, then
the following could be be said with some confidence: Presence of D1 implies Presence

of D2.
Let psf (D, D2, Dm) denote the number of prefix-suffix conflicts between D1 and D2

with respect to Dm and psf (D, D2), the number of prefix-suffix conflicts between

D1 and D2.
We can approximate p(DI, D2) by rf(DI, D2) f(D2)/f(Di). The two quantities

are identical unless a single occurrence of D1 is a subword of two or more distinct
occurrences of D2. Similarly, we can approximate q(Di, D2) by psf(Di, D2)/f(D1).

The two quantities are identical unless a single occurrence of D1 has prefix-suffix
conflicts with two or more distinct occurrences of D2. f(DI) can be computed for

all displayable entities in CSD(S) in O(n) time by a single traversal of CSD(S) in













Procedure GetSubwords(S, v)
1 begin
2 rf(str(v),str(v))= 1;
3 SetUp(v);
4 SetSuffizes(v);
5 for each vertex, z (4 source), in reverse topological order of SG(S, v) do
6 begin
7 if str(z) is a suffix of str(v) then rf(str(z),str(v)) = 1
8 else rf(str(x),str(v)) = 0;
9 for each w in SG(S, v) on which an re-edge e from z is incident do
10 rf(str(x),str(v)):= rf(str(x),str(v)) + rf(str(w),str(v));
11 output(rf(str(x), str(v)));
12 end;
13 end



Figure 3.12. Modification to GetSubwords(S, v) for computing relative frequencies



reverse topological order. rf(DI, D2) may be computed optimally for all D1, D2, by

modifying procedure GetSubwords(S, v) as shown in Figure 3.12.

psf(Di, D2, Dm) is computed optimally, for all D1, D2, and D,, where D1 has a

prefix-suffix conflict with D2 with respect to D,, by modifying ListConflicts(u, z, w)

of Figure 3.9 so that it returns f(str(u)), since this is the number of conflicts be-

tween str(w) and str(z) with respect to str(v). psf(DI, D2) is calculated by summing

psf (Di, D2, Dm) over all intersections Dm of prefix-suffix conflicts between D1 and
D2. p(DI, D2) and q(DI, D2) may be computed by simple modifications to the algo-

rithms used to compute rf(DI, D2) and psf(Di, D2). These problems may be solved

under the size restrictions of P11 and P12 by modifications similar to those made

in Section 3.3.












3.6 Experimental Results

Algorithms SubwordConflicts(S) and PrefixSuffixConflicts(S) were implemented

on a SUN SPARCstation 1 in GNU C++. 50 randomly generated strings of lengths

ranging from 100 to 2000 from alphabets whose size ranged from 2 to 50 were input to

our algorithms. Statistical information such as the number of vertices in the csdawg,

the number of prefix-suffix and subword conflicts etc was obtained. The run times of

the algorithms were also recorded.

(i) Figure 3.13 shows Number of prefix-suffix conflicts vs String size. There is one

curve for each alphabet size. The plot illustrates that the number of prefix-suffix

conflicts increases with string size and decreases with alphabet size. The graph for

subword conflicts is similar.

(ii) Figure 3.14 shows Time per prefix-suffix conflict (p/s) vs String size. There is

one curve for each alphabet size. It illustrates that the time per conflict generally

decreases with increasing string size and increases with alphabet size. The graph for

subword conflicts is similar.

(iii) The factor by which the compact representation of subword conflicts is smaller

than the fully expanded representation varies from 2 to 9. It increases with string

size and decreases with alphabet size.

(iv) Table 3.1 shows the size of the largest displayable entity for each combination of

alphabet size and string size. It shows that only displayable entities of small lengths

occur in random strings (in practice we would like to be able to distinguish between

displayable entities that occur randomly and those that do not in a given string. This

can be done by selecting those displayable entities whose frequency is large compared

to other displayable entities of the same length in a random string).














a = Size of Alphabet


100200 500 1000 2000
Size of String

Figure 3.13. Graph of Number of Prefix-Suffix Conflicts vs String Size







Table 3.1. Lengths of Largest Displayable Entity for Random Strings


Size of Size of String
Alphabet 100 200 500 1000 2000
2 11 12 15 18 19
5 5 6 7 8 9
10 3 4 5 6 6
15 3 3 4 5 5
20 3 3 4 4 5
25 2 3 3 4 4
50 2 2 3 3 4









49













800-

T 700-
i
m
e 600-
p
eS
r 500- a= Size of Alphabet
P
S
C 400-
n 0 ^^^^ =25
f 300
1 = 20
C 200-

# 100 -
--.-_ 0_a=5
a=2
100200 500 1000 2000
Size of String
Figure 3.14. Graph of Time per Prefix-Suffix Conflict vs String Size












In another experiment, Algorithms SubwordConflicts(S) (Section 3.2.2), Pre-

fixSuffixConflicts(S) (Section 3.2.3), and AllConflicts(S) (Section 3.2.4) were pro-
grammed in GNU C++ and run on a SUN SPARCstation 1. For test data we used
120 randomly generated strings. The alphabet size was chosen to be one of {5, 15, 25,

35} and the string length was 500, 1000, or 2000. The test set of strings consisted of
10 different strings for each of the 12 possible combinations of input size and alphabet
size. For each of these combinations, the average run times for the 10 strings is given
in Tables 3.2-3.5.
Table 3.2 gives the average times for computing all conflicts by combining al-

gorithms PrefixSuffizConflicts(S) and SubwordConflicts(S). Table 3.3 gives the av-
erage times for computing all prefix-suffix conflicts using Algorithm PrefizSuffix-

Conflicts(S). Table 3.4 gives the average times for computing all pattern-restricted

prefix-suffix conflicts (problem P14 of Section 3.4) by modifying Algorithm Prefix-

SuffizConflicts(S) as described in Section 3.4. Table 3.5 represents the average times
for Algorithm AllConflicts(S).
Tables 3.2 to 3.4 represent the theoretically superior solutions to the corresponding
problems, while Table 3.5 represents Algorithm AllConflicts(S) which provides a

simpler, but suboptimal, solution to the three problems. In all cases the time for
constructing csdawgs and writing the results to a file were not included as these

steps are common to all the solutions.
The results show that the suboptimal Algorithm AllConflicts(S) is superior to
the optimal solution for computing all conflicts or all prefix-suffix conflicts for a ran-
domly generated string. This is due to the simplicity of Algorithm AllConflicts(S)

and the fact that the number of conflicts in a randomly generated string is small.
However, on a string such as a10 which represents the worst case scenario in terms

of the number of conflicts reported, the following run times were obtained:













Table 3.2. Time in ms for computing all conflicts using SubwordConflicts(S) and
PrefizSuffiz Conflicts (S)


All conflicts, optimal algorithm: 14,190 ms

All prefix-suffix conflicts, optimal algorithm: 10,840 ms

All pattern restricted prefix-suffix conflicts, optimal algorithm: 5,000 ms

Algorithm AlIConflicts(S): 26,942 ms


The experimental results using random strings also show that, as expected, the op-

timal algorithm fares better than Algorithm AllConflicts(S) for the more restricted
problem of computing pattern oriented prefix-suffix conflicts.

We conclude that Algorithm AllConflicts(S) should be used for the more gen-

eral problems of computing conflicts while the optimal solutions should be used for

the restricted versions. Hence, Algorithm AllConflicts(S) should be used in an au-

tomatic environment, while the optimal solutions should be used in interactive or

semi-automatic environments.

3.7 Display Algorithms

A list of all occurrences of a displayable entity may be obtained from the csdawg

data structure described earlier. A list of all conflicts between displayable entities

may be obtained by operations on the csdawg. This information is then used to

assign numeric weights to each occurrence of each displayable entity.


Size of Size of String
Alphabet 500 1000 2000
5 410 989 2722
15 292 603 1300
25 315 671 1485
35 234 791 1740














Table 3.3. Time in ms for computing all prefix-suffix conflicts using PrefixSuffixCon-
flicts(S)


Table 3.4. Time in ms for computing all pattern restricted prefix-suffix conflicts using
the optimal algorithm


Table 3.5. Time in ms for Algorithm AllConflicts(S)


Size of Size of String
Alphabet 500 1000 2000
5 247 730 1873
15 219 454 989
25 255 522 1179
35 186 648 1370


Size of Size of String
Alphabet 500 1000 2000
5 163 399 1058
15 103 231 550
25 91 267 628
35 61 226 735


Size of Size of String
Alphabet 500 1000 2000
5 203 551 1367
15 217 409 897
25 227 400 887
35 145 478 994












In this section we shall discuss algorithms to implement some of the refined display

models of Section 2.3. Specifically, we shall consider models 1, 2(b), and 3 under the

Automatic mode (i.e., problems P4, P6, and P7 of Section 2.4). Our algorithms also

apply to the Semi Automatic mode. Problem P4 can be reduced to the single pair,
longest path problem in a directed acyclic graph as follows: let vertex Vi, 1 < i < n,

correspond to the position between the i'th and i + 1'th characters in the string. Vo

corresponds to the position preceding the first character in the string. Vn corresponds

to the position following the last character in the string. For each occurrence Sij of

each displayable entity, we create an edge from V1-1 to Vj. The weight associated

with this edge is exactly the weight of the occurrence it represents. Finally, for each

pair (Vi, V+1), 0 < i < n, of vertices such that an edge from Vi to V,+1 does not

already exist, we create an edge from V, to V+j of weight 0.

Figure 3.15 shows the directed acyclic graph corresponding to the string S =

abcicdefcdegabchancde of Figure 2.2, assuming that the weights corresponding to each

occurrence of abc, cde, and c are 4, 3, and 2, respectively. The longest path from

Vo to V, in the dag is Vo V3 V4 V7 V-- Vs -* VH -- V12 V1s -+ V16 -

V19 -- V20 -- V21. All edges on this path with non-zero weight represent occurrences

of displayable entities that are to be highlighted. Here Vo -+ V3, V4 -- V7, Vs -- V1,

V12 Vis, V16 V19 correspond to occurrences < 1,3 >, < 13,15 >, and < 17,19 >
of abc, and < 5,7 > and < 9, 11 > of cde. The length of the longest path (here, 18)

represents the total weight of the display.

Algorithm A4 of Figure 3.16 solves problem P4. A[0..n] is an array of integers,

where A[i] represents the longest path from Vo to Vi detected upto that point of

time. All elements of A are initialized to 0. The auxiliary array T[1..n] stores

the occurrences that have been chosen for display. The array de-list contains all

the occurrences of all the displayable entities in the string. Each element of delist
























4 4


Figure 3.15. Dag corresponding to abcicdefcdegabchabcde



contains three fields: start, end, and weight which represent the start position, the end

position, and the numeric weight of the occurrence, respectively. It is assumed that

delist is sorted in increasing order of end. The vertices are processed in topological

order (here, Vo, V1, ..., V,,). When a vertex Vj is being processed, each vertex Vi

(i < j) preceding it has associated with it the cost of the longest path from Vo to Vi.

The cost of the longest path from Vo to Vj is then determined by examining each of

the incoming edges to Vj.

The complexity of Algorithm A4 is O(n + e), where e represents the size of

de-list. So, Algorithm A4 is optimal. Note that sorting delist on end position can

also be accomplished in O(n + e) time using radix sort.

Problem P6 of Section 2.4 is solved by Algorithm A6 of Figure 3.17 using the

greedy method. The array endpoints[l..2e] is a list of, both, the start positions and

end positions of all occurrences of all displayable entities in the string. Each element of

endpoints contains three fields: position, type, and id. position contains the position of

























Algorithm A4
begin
A[O]:= O; j:= 1;
for i := 1 to n do
begin
A[i]:= 0; {Initialize A[il}
while (delist[j].end = i) do {determine longest path ending at Vi}
begin
if A[i] < A[delist[j].start-1] + delist[j].weight
then
begin
A[i] := A[delist[j].start-1] + delist[j].weight;
T[i] := j;
end;
j:= i + 1;
end;
end;
end.


Figure 3.16. Algorithm for P4












the particular endpoint in the string; type indicates whether the endpoint is a "start"

position or an "end" position. id is an integer which uniquely identifies the occurrence

corresponding to the endpoint. It is assumed that endpoints is sorted in increasing

order on primary key position and secondary key type ("start" < "end"). The variable

current keeps track of the number of copies of the string that are currently "active."

The variable max keeps track of the maximum number of copies of the string required

so far. CurrentLine is the particular copy of the string to which the new occurrence

is assigned. LineStack is a stack that contains line numbers (or partition numbers)

which are currently available. On completion of the algorithm, line[i] contains the

line or the particular copy of the string in which the occurrence with id i is to be

highlighted for 1 < i < e.

It can be seen from the algorithm that

(1) At least one position in the string is covered by fmaz occurrences. Thus, fmaz is

the smallest number of lines required to highlight all the displayable entities.

(2) The final value of maz represents the number of partitions of S required to

highlight all displayable entities. Since the the final value of max is fmaz, Algorithm

A6 is correct.

Algorithm A6 consumes O(n + e) time, which is optimal. Note that sorting

endpoints can also be accomplished in O(n + e) time, if radix sort is used.

We outline two solutions to P7. In the first, we assume that all occurrences of

the same displayable entity are assigned the same numeric weight. In the second, we

do not make this assumption.

Algorithm A7(a) of Figure 3.18 solves the first version of P7. The second

version of P7 may be solved by executing steps a-c for each occurrence of each

displayable entity as shown in Algorithm A7(b) of Figure 3.19. Both solutions

involve a traversal of CSD(S) in topological order. For each occurrence, an optimal



















Algorithm A6
begin
max := 0; current := 0;
fori:= 1 to 2e do
begin
z := endpoints[i];
if (z.type = "start") then
begin {assign line numbers to occurrences}
current := current + 1;
if (current < max) then {use available copy of string}
begin
CurrentLine := top(LineStack);
pop(LineStack);
end
else
begin {increase max }
max := max + 1;
CurrentLine:= maz;
end;
line[x.id] := CurrentLine;
end
else {x.type = "end"}
begin
current := current 1;
push(LineStack, line[x.id]);
end;
end;
fmaz := max;
end.


Figure 3.17. Algorithm for P6













Algorithm A 7(a)
begin
for each vertex v of CSD(S) in topological order do
begin
Step a. Compute the relative positions of all subword displayable entities
in a single instance of str(v) using Procedure GetSubwords(S,v).
Step b. Choose a mutually non overlapping subset from the set of subwords
of str(v) (obtained in step (a) above) so that the sum of their weights is
maximized. This is achieved by an algorithm similar to A4.
Step c. Reset the numeric weight of str(v) by adding to it the
total weight of the configuration obtained in step (b).
end;
end.



Figure 3.18. Algorithm for P7, same weights




selection of its subwords are chosen. The weight of the occurrence is then obtained

by adding the sum of the weights of the chosen subwords to its weight. This is done

because an occurrence is highlighted along with the chosen subwords. Algorithm

A7(a) consumes O(n3) time, while Algorithm A 7(b) consumes O(n4) time.



























Algorithm A 7(b)
begin
for each vertex v of CSD(S) in topological order do
for each occurrence < i,j > of str(v) do
begin
Step a. Compute the occurrences of all subword displayable entities in < i,j >.
Step b. Choose a mutually non overlapping subset from the set of occurrences
(obtained in Step (a) above) so that the sum of their weights is maximized.
This is achieved by an algorithm similar to A4.
Step c Reset the numeric weight of < i,j > by adding to it the total weight
of the configuration obtained in Step (b).
end;
end.


Figure 3.19. Algorithm for PT, different weights

















CHAPTER 4
CIRCULAR STRING VISUALIZATION

4.1 Introduction

The circular string data type is used to represent a number of objects such as

circular genomes, polygons, and closed curves. Research in molecular biology involves

the identification of recurring patterns in data and hypothesizing about their causes
and/or effects [4, 2]. Research in pattern recognition and computer vision involves

detecting similarities within an object or between objects [11]. We have already listed

in Chapter 2 a number of queries that our visualization model supports. In Chapter 3,
we developed efficient (mostly optimal) algorithms for some of these queries for linear

strings. These algorithms performed operations and traversals on csdawgs of the

linear strings.

One approach for extending these techniques to circular strings is to arbitrarily

break the circular string at some point so that it becomes a linear string. Techniques

for linear strings may then be applied to it. However, this has the disadvantage

that some significant patterns in the circular string may be lost because the patterns

were broken when linearizing the string. Indeed, this would defeat the purpose of

representing objects by circular strings. This particular problem may be overcome
by working with the csdawg corresponding to the concatenation of the linearized cir-

cular string with itself. However, the resulting data structure contains a number of
extraneous nodes. The existence of these nodes increases the asymptotic complexity












and actual running times of some of our algorithms, and increases the storage re-

quirement of the data structure. Moreover, algorithms for linear strings need to be

substantially modified for use with this data structure.

A circular string data structure, the polygon structure graph, which is an exten-

sion of suffix trees to circular strings already exists [11]. However, the suffix tree is not

as powerful as the csdawg and cannot be used to solve some of the problems that the

csdawg can solve. In particular, our queries required the csdawg for asymptotically

and practically efficient algorithms.

In this chapter, we propose a csdawg for circular strings which is obtained by mak-

ing simple modifications to the csdawg for linear strings. This new data structure

does not contain extraneous vertices and consequently avoids the disadvantages men-

tioned above. Algorithms which make use of the csdawg for linear strings can then

be extended to circular strings with trivial modifications. The extended algorithms

continue to have the same time and space complexities. Moreover, the extensions

take the form of postprocessing or preprocessing steps which are simple to add on

to a system built for linear strings, particularly in an object-oriented language. In

particular, algorithms NoConflicts(S), SubwordConflicts(S), PrefizSuffizConflicts(S),

AllConflicts(S) and the solutions outlined for problems P8 to P18 in Chapter 3 for

linear strings can be easily extended to circular strings.

Section 4.2 contains definitions. Section 4.3 describes the construction of a circular

csdawg and the computation of locations of occurrences of displayable entities and

Section 4.4 introduces the notion of display conflicts. Finally, Section 4.5 mentions

some applications for the visualization and analysis of circular strings.

























Figure 4.1. Circular string


4.2 Definitions

Let s denote a circular string of size n consisting of characters from a fixed alpha-

bet E of constant size. Figure 4.1 shows an example circular string of size 8. We shall

represent a circular string by a linear string enclosed in angle brackets "<>" (this

distinguishes it from a linear string) The linear string is obtained by traversing the

circular string in clockwise order and listing each element as it is traversed. The start

point of the traversal is chosen arbitrarily. Consequently, there are up to n equiv-

alent representations of s. In the example, s could be represented as ,

, etc.

We characterize the relationship between circular strings and linear strings by

defining the functions linearize and circularize, linearize maps circular strings to

linear strings. It is a one-many mapping as a circular string can, in general, be

mapped to more than one linear string. For example, linearize() = {abcd,

bcda, cdab, dabc}. We will assume that linearize arbitrarily chooses one of the linear

strings; for convenience we assume that it chooses the representation obtained by

removing the angle brackets "<>." So, linearize() = abcd. circularize maps












linear strings to circular strings. It is a many-one function and represents the inverse

of linearize.

We use lower case letters to represent circular strings and upper case letters to

represent linear strings. Further, if a lower case letter (say s) is used to represent a

particular circular string, then the corresponding upper case letter (S) is assumed to

be linearize(s).

The definitions of maximal strings and displayable entities for circular strings are

identical to those for linear strings. I.e., a non-null pattern occurring in s is said

to be maximal iff its occurrences are not all preceded by the same character, nor

all followed by the same character. The empty string is always maximal. s may be

thought of as a periodic, infinite string which is neither followed nor preceded by a

letter, and is therefore maximal. A pattern is said to be a displayable entity of s

iff it is non-null, maximal, and occurs at least twice in oa. A displayable entity of s

always has length less than n. Thus, s and the empty string are maximal, but not

displayable entities.

The definition of a csdawg for circular strings is also similar to that for linear

strings. I.e., a csdawg, CSD(s) = (V(s), R(s), L(s)) corresponding to s is a directed

acyclic graph defined by a set of vertices V(s), a set R(s) of labeled directed edges

called right extension edges (re-edges), and a set of labeled directed edges L(s) called

left extension edges (le-edges). Each vertex of V(s) represents a substring of s.

Specifically, V(s) consists of a vertex corresponding to each maximal pattern of s.

This consists of a source which represents the empty string A; a sink which represents

s; and a vertex for each displayable entity of s.

Let str(v) denote the string represented by vertex v for v e V(s). Define the

implication imp(s, a) of a string a in s to be the smallest superstring of a in {str(v)l

v e V(s)}, if such a superstring exists. Otherwise, imp(s, a) does not exist. Note that











imp(s, a) is always unique (if there are two or more superstrings of a of the same
length k in {str(v)l v e V(s)}, it can be shown that there must exist a superstring

of a in {str(v)| v e V(s)} with length less than k). Re-edges from vi (vl e V(s)) are

obtained as follows: for each letter x in E, if imp(s, str(vi)z) exists and is equal to

str(v2) = psir(vi)xz, then there is an re-edge from vi to v2 with label x-. If f is the

empty string, then the edge is known as a prefix extension edge. Le-edges from vl

(vi e V(s)) are obtained as follows: for each letter x in E, if imp(s, xstr(v1)) exists
and is equal to str(v2) = -7xstr(v1l), then there is an le-edge from v, to v2 with label
zx. If 3 is the empty string, then the edge is known as a suffix extension edge

The sink represents the periodic infinite string denoted by the circular string s.

The labels of edges incident on the sink are themselves infinite and periodic and may

be represented by their start positions in s. We also associate with CSD(s) the

periodicity p of s. This is the value of lal for the largest value of k such that S =

ak. We require the periodicity to answer queries about the locations of displayable

entities if S is itself periodic. Figure 4.11 shows CSD(s) for s = . Its

periodicity is 7.

4.3 Constructing the Csdawg for a Circular String

The csdawg for circular string s is constructed by the algorithm of Figure 4.2. It

is obtained by first constructing the csdawg for the linear string T = SS (recall that

S = linearize(s)). A bit is associated with each re-edge in R(T) indicating whether
it is a prefix extension edge or not. Similarly, a bit is associated with each le-edge in
L(T) to identify suffix extension edges. Two pointers, a suffix pointer and a prefix

pointer are associated with each vertex v in V(T). The suffix (prefix) pointer points
to a vertex w in V(T) such that str(w) is the largest proper suffix (prefix) of str(v)

represented by any vertex in V(T). Note that such a w always exists since the empty
















Algorithm CircularCsdawg(s)
Stepl: Construct CSD(T) for T = SS (S = linearize(s)).

Step2:
{Determine periodicity of S using Lemma 4.3.1}
vs := vertex representing S in CSD(T)
es := any outgoing edge from vs
p := lesl

Step3(a):
{Identify Suffix Redundant Vertices using Lemma 4.3.3}
v:= sink;
while v f source do
begin
v:= v.suffi;
if v has exactly one outgoing re-edge then mark v suffix redundant;
else exit Step 3(a);
end;

Step3(b):
{Identify Prefix Redundant vertices}
{Similar to Step 3(a)}

Step4:
v:= sink;
while (v <> source) do
begin
Modify representation of edges from v that are incident on the sink.
case v of
suffix redundant but not prefix redundant: ProcessSuffizRedundant(v);
prefix redundant but not suffix redundant: ProcessPrefixRedundant(v);
suffix redundant and prefix redundant : ProcessBothRedundant(v);
not redundant: {Do nothing};
endcase;
v:= Next VertexlnReverse TopologicalOrder;
end;




Figure 4.2. Algorithm for constructing the csdawg for a circular string












string is a suffix (prefix) of all strings. Suffix (prefix) pointers are the reverse of suffix

(prefix) extension edges and are derived from them. Figure 4.4 shows CSD(T) =

CSD(SS) for S = cabcbab. The broken edge from vertex c to vertex abc is a suffix

extension edge, while the solid edge from vertex ab to vertex abc is a prefix extension

edge.

In Step 2, we determine the periodicity of s which is equal to the length of the

label on any outgoing edge from the vertex representing S in CSD(T). This equality

is derived in Lemma 4.3.1.

Lemma 4.3.1 Step 2 of Algorithm CircularCsdawg(s) correctly determines the pe-

riodicity of s.

Proof: Let a be the shortest substring of S such that S = am, for some inte-

ger m. Then, the occurrences of a in T have start positions 1, Ial+ 1,21a + 1,...,(2m-

1)a + 1 (if this is not the case, then we can show that there exists a substring / of

S such that S = P', k > m, resulting in a contradiction). So, CSD(T) takes the

form of Figure 4.3. Each vertex representing a', 1 < i < 2m 1, has exactly one

le-edge and one re-edge leaving it as shown. All remaining displayable entities of T

are subwords of a2 and are of size less than jal (if this is not the case, then we can

show that there exists a substring / of S such that S = /k, k > m, resulting in a

contradiction). The vertices representing these displayable entities are represented

by the box in Figure 4.3. From the figure, it is now easy to see that Step 2 correctly

determines the periodicity of s. m



Next, in Step 3, suffix and prefix redundant vertices of CSD(T) are identified. A

suffix (prefix) redundant vertex is one that has exactly one re-edge (le-edge) leaving

it. A vertex is said to be redundant if it is either prefix redundant or suffix redundant








67




a a a a

0Other 2 "2
Vertices a a2 a2m- 2m

a a aO O


------------ >* = Left Extension Edge
-- = Right Extension Edge


Figure 4.3. SCD(a2m)



or both. We have, from Lemma 4.3.2, that redundant vertices in CSD(T) are the

only vertices in CSD(T) (except the source and the sink) which do not represent

displayable entities of s. Redundant vertices represent patterns that are maximal in

T specifically because they occur at either end of T and therefore are not followed or

preceded by a letter.

Lemma 1.3.2 A vertex v in V(T) (v 5 source, v J sink) is not redundant iff str(v)

is a displayable entity of s.

Proof: A displayable entity of s must be a displayable entity of T since, by

construction of T, any pattern of size less than n in s has at least once occurrence

in T that is preceded (followed) by any letter that preceded (followed) an occurrence

of the same pattern in s. Further, a displayable entity of s must have at least two

occurrences which are preceded by different letters and at least two occurrences which

are followed by different letters (note that a displayable entity of a circular string,

unlike that of a linear string, is always preceded and followed by a letter). So, from











the definition of a csdawg, a vertex in V(T) corresponding to a displayable entity of

s must have two re-edges and two le-edges and is not redundant.

All displayable entities of T of size > n are redundant (from Figure 4.3). All

non-redundant vertices of T have at least two le-edges and two re-edges. Since they

represent strings of size less than n, and since, by construction, any string of size less

than n in T is a string in s, they must represent displayable entities of s.




In Figure 4.4, vertex c is prefix redundant only, while vertex ab is suffix redundant

only. The vertex representing S is both prefix and suffix redundant since it has one

re-edge and one le-edge leaving it. The fact that Step 3 does, in fact, identify all

redundant vertices is established by Lemma 4.3.3.

Lemma 4.3.3 (a) A vertex v in V(T) will have exactly one re-edge (le-edge) leaving

it only if str(v) is a suffix (prefix) of T.

(b) If a vertex v such that str(v) is a suffix (prefix) of T has more than one re-edge

(le-edge) leaving it, then no vertex w such that str(w) is a suffix (prefix) of str(v)
can be suffix (prefix) redundant.

Proof: (a) Suppose str(v) is not a suffix of T. Then v has at least two re-edges
leaving it, otherwise str(v) would not be maximal in T. But this is a contradiction

and str(v) must be a suffix of T.

(b) Since str(w) is a suffix of str(v), a letter following str(v) must also follow str(w).
So, w must have at least as many re-edges leaving it as v. But v has at least two
re-edges leaving it, so w cannot be suffix redundant *
































--------- = Left Extension Edges
= Right Extension Edges


Figure 4.4. CSD(T) for T=cabcbabcabcbab



Since it is sufficient to examine vertices corresponding to suffixes of T (Lemma 4.3.3(a)),

Step 3(a) follows the chain of suffix pointers starting from the sink. If a vertex on this

chain has one re-edge leaving it, then it is marked suffix redundant. The traversal of

the chain terminates either when the source is reached or when a vertex with more

than one re-edge leaving it is encountered (Lemma 4.3.3(b)). Similarly, Step 3(b)

identifies all prefix redundant vertices in V(T).

Vertices of CSD(T) are processed in reverse topological order in Step 4 and redun-
dant vertices are eliminated. When a vertex is eliminated, the edges incident to/from












it are redirected and relabeled as described in Figures 4.5 to 4.10. Procedure Pro-

cessPrefizRedundant is symmetric to ProcessSuffizRedundant. The correctness of the

relabeling and redirecting of edges in ProcessSuffizRedundant and ProcessBothRedun-

dant follows from Lemmas 4.3.4 and 4.3.5. The resulting graph is CSD(s).


Lemma 4.8.4 In Procedure ProcessSuffixRedundant, str(v) is a prefix ofstr(w).

Proof: All occurrences of str(v) in s are followed by zy. So, str(w) must be

of the form pstr(v)zx since w cannot be redundant (otherwise it would have been

eliminated in Step 4). But, at least two occurrences of str(v) are preceded by different

letters (since v is not prefix-redundant). So, at least two occurrences of str(v)zy are

preceded by different letters. So, # = nil, and str(w) = str(v)xz. U




Lemma 4.3.5 In Procedure ProcessBothRedundant, w, = w2.

Proof: fly precedes all occurrences of str(v) and x- follows all occurrences of

str(v) in s. Since wi and w2 cannot be redundant (otherwise they would have been

eliminated in Step 4), str(wi) and str(w2) are both pystr(v)xz. m












Procedure ProcessSuffizRedundant(v)
1. Eliminate all left extension edges leaving v (there are at least two of these).
2. There is exactly one right extension edge e leaving v. Let the vertex that it leads to
be w. Let the label on the right extension edge be zy. Delete the edge.
3. All right edges incident on v are updated so that they point to w. Their labels are
modified so that they represent the concatenation of their original labels with xy.
4. All left edges incident on v are updated so that they point to w. Their labels are not
modified. However, if any of these were suffix extension edges, the bit which indicates
this should be reset as these edges are no longer suffix extension edges.
5. Delete v.


Figure 4.5. Algorithm for processing a vertex which is suffix redundant




w w
I


( eR(y71y)

v


eR(YYI) \ eL(72z)\



6b 0


--------- -> = Left Extension Edges
= Right Extension Edges


Figure 4.6. v is suffix redundant








72



Procedure ProcessPrefizRedundant(v)
1. Eliminate all right extension edges leaving v (there are at least two of these).
2. There is exactly one left extension edge e leaving v. Let the vertex that it leads to
be w. Let the label on the left extension edge be yx. Delete the edge.
3. All left edges incident on v are updated so that they point to w. Their labels are
modified so that they represent the concatenation of their original labels with 7z.
4. All right edges incident on v are updated so that they point to w. Their labels are
not modified. However, if any of these were prefix extension edges, the bit which
indicates this should be reset as these edges are no longer prefix extension edges.
5. Delete v.


Figure 4.7. Algorithm for processing a vertex which is prefix redundant


eR(y71)


' eL(X(f2Z)



0


= Left Extension Edges
= Right Extension Edges


Figure 4.8. v is prefix redundant












Procedure ProcessBothRedundant(v)

1. There is exactly one right extension edge el leaving v. Let the vertex that it leads to
be wl. Let the label on the edge be x-. Delete the edge.

2. There is exactly one left extension edge e2 leaving v. Let the vertex that it leads to
be w2. Let the label on the edge be fly. Delete the edge.
{Lemma 4.3.5 establishes that wl and w2 are the same vertex.}

3. All right edges incident on v are updated so that they point to wl. Their labels are
modified so that they represent the concatenation with zy. If any of these edges were
prefix edges, the bit which indicates this should be reset.

4. Similarly, left edges incident on v are updated so that they point to w2. Their labels
are modified so that they represent the concatenation with 7z. If any of these edges
were suffix extension edges, the bit which indicates this should be reset.

5. Delete v.


Figure 4.9. Algorithm for processing a vertex which is prefix and suffix redundant



w w
el (W
Sel(xy)


eR(lRX-)
e2(fly)'
v

eL(lLOy)

eR(lR) \ eL(lL)






----------- = Left Extension Edges
S= Right Extension Edges


Figure 4.10. v is suffix and prefix redundant











Theorem 4.3.1 Algorithm CircularCsdawg(s) correctly computes CSD(s) and has

complexity O(n), which is optimal.

Proof: Since the algorithm eliminates redundant vertices, Lemma 4.3.2 en-

sures that the vertices in CSD(s) are correctly obtained. It remains to show that

the edges of CSD(s) are correctly obtained. Given Lemmas 4.3.4 and 4.3.5, it is
easy to verify that Procedures ProcessSuffixRedundant, ProcessPrefizRedundant, and

ProcessBothRedundant correctly relabel and redirect edges in the csdawg.

Step 1 takes O(n) time [9]. Step 2 takes O(n) time to locate the vertex represent-

ing S in CSD(T). Step 3 will, in the worst case, traverse all the vertices in CSD(T)

spending 0(1) time at each vertex. The number of vertices is bounded by O(n) [9].
So, Step 3 takes O(n) time. Step 4 traverses CSD(T). Each vertex is processed once;
each edge is processed at most twice (once when it is an incoming edge to the vertex

currently being processed, and once when it is the out edge from the vertex currently

being processed). So, Step 4 takes O(n) time (note that CSD(T) has O(n) edges). *



Procedure CircOccurrences(s, v) of Figure 4.12 reports the end position of each

occurrence of str(v), for v e V(s), in the circular string s. The label corresponding

to an edge terminating at a vertex other than the sink is denoted by label(e). The
start position of an edge terminating at the sink vertex is denoted by pos(e). It is

similar to the algorithm for computing occurrences of displayable entities in linear
strings (Chapter 3).

4.4 Computing Conflicts Efficiently

We have identified a number of problems relating to the computation of conflicts
in a linear string and have presented efficient algorithms for most of these problems












4


= Right Extension Edges
------------- = Left Extension Edges


Figure 4.11. Csdawg for < cabcbab >



(Chapter 3). These algorithms typically involved sophisticated traversals or opera-

tions on the csdawg for linear strings. Our extension of csdawgs to circular strings

makes it possible to use the same algorithms to solve the corresponding problems for

circular strings with some minor modifications caused by our representation of edges

that are incident on the sink.

4.5 Applications

Circular strings may be used to represent circular genomes [4] such as G4 and

XX174. The detection and analysis of patterns in genomes helps to provide insights

into the evolution, structure, and function of organisms. Here [4], G4 and qX174

are analyzed by linearizing them and then constructing their csdawg. We improve

upon this by (i) analyzing circular strings without risking the "loss" of patterns. (ii)

























Procedure CircOccurrences(s:circular string, v:vertex)
{Obtain all occurrences of str(v), v e V(s), in S}
if v j sink then Occurrences(s,v,O);


Procedure Occurrences(s:circular string, v:vertex, i:integer)
begin
for each re-edge e from v in CSD(s) do
begin
let w be the vertex on which e is incident;
if w 0 sink then Occurrences(s,w, label(e)j + i)
else
for k := 1 to (Is|/periodicity(s)) do
output(pos(e)-i-1 +(k-1)periodicity(s))
end;
end;


Figure 4.12. Obtaining all occurrences of a displayable entity in a circular string











extending the analysis and visualization techniques of Chapter 3 for linear strings to

circular strings.

Circular strings in the form of chain codes are also used to represent closed curves

in computer vision [12]. The objects of Figure 4.13(a) are represented in chain code

as follows:

(1) Arbitrarily choose a pixel through which the curve passes. In the diagram, the

start pixels for the chain code representation of objects 1 and 2 are marked by arrows.

(2) Traverse the curve in the clockwise direction. At each move from one pixel to

the next, the direction of the move is recorded according to the convention shown

in Figure 4.13(b). Objects 1 and 2 are represented by 1122102243244666666666

and 666666661122002242242446, respectively. The alphabet is {0, 1, 2, 3, 4, 5, 6,

7} which is fixed and of constant size (8) and therefore satisfies the condition of

Section 4.2. We may now use our visualization techniques of Chapter 3 to compare

the two objects. For example, our methods would show that objects 1 and 2 share the

segments S1 and S2 (Figure 4.13(c)) corresponding to 0224 and 2446666666661122,

respectively. Information on other common segments would also be available. The

techniques of this paper make it possible to detect all patterns irrespective of the

starting pixels chosen for the two objects.

Circular strings may also be used to represent polygons in computer graphics and

computational geometry [11]. Figure 4.14 shows a polygon which is represented by the

following alternating sequence of lines and angles: bpaaoeaeacfpcfefpaaeaeaccaccbacdaca,

where a denotes a 90 degree angle and P, a 270 degree angle. The techniques of

this paper would point out all instances of self similarity in the polygon, such as

aaeaeacflc. Note, however, that for the methods to work efficiently, the number of

lines and angles that are used to represent the polygons must be small and fixed.









78






I I I I I I I I I I I I I I I I I I I I I I I I
I---,----,-.---.---.---.---.-,--,--i---r-,-r-.~r-~r--------r--i-.-r---r-,---,
rJ--L-J-r-J--LJ---L-Jr--L-J--L.J--L-J--J-L-J---J--r---i
I I I I I I I I I I I I I I I I I I I I I I I I
I I I I I II I I I I I I I I I I I I I I----
J..LJ-L-- -J--- -L L--L-J--L-J-L-J- -LJ
I I I I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I I I I
L. ....r -- r-J -- r-r --r -r -
I I I I I I I I I I I I I I I I I I I I I I I I
i- _.- _.- .- _.- __.r ~ __. _- __- _- _~~r ~~ ~ T_ -. -.- _.-. T rr- -i~-
I I I I I I I I I I I I I I I I I I I I I I
-J -L-LL._.L-J,--.,_,.LJ ,J.-L-,-L-..--L..-J._-LJ _-L_-J..__I
I I I I I I I I I I I I I I I I I I I I I





Starting position for object 1 (a) Starting position for object 2

7 0



6 2



5 3
4

Chain code representations of directions
(b)


I I = s





S = S2


(c)


Figure 4.13. Representing closed curves by circular strings









79



e e



e e
a a
c c
b c e c b
ci I
d


Figure 4.14. Representing polygons by circular strings




Closed curves defined using B-splines can be determined from their control poly-

gons. If there exist two or more sufficiently long identical segments in the control

polygon, then the curve fragments corresponding to those polygon segments would

also be identical. Thus, similarity in closed curves represented by B-splines can also

be detected by our techniques.
















CHAPTER 5
EXTENSION TO BINARY TREES AND SERIES-PARALLEL GRAPHS

5.1 Tree Visualization

In this section, we consider the problem of tree visualization. Section 5.1.2 men-

tions some of its applications while Section 5.1.3 outlines the algorithms.

5.1.1 Problem Definition

We provide a specification for the problem of tree visualization:

1. Structure of Data to be Visualized: A binary tree BT of size n. Each node of

the binary tree contains a key.

2. Structure of Patterns: A subtree of BT.

3. Maximality of Patterns: A subtree pattern is not maximal if its occurrences

are all left or all right children of their parents and the subtrees rooted at their

parents are all identical. Figure 5.1 shows an example tree BT of size 7. We

will represent it using the parenthesized infix notation as (((b)a(c))e((b)a(c)))

for the purposes of this paper. Here, the subtrees (b) and (c) are not maximal.

However, ((b)a(c)) and (((b)a(c))e((b)a(c))) are maximal.

4. Measure of Similarity MS: If two patterns are identical, then MS = 1. Other-

wise, MS = 0.

5. Display Model: A maximal subtree of BT is called a displayable subtree if it

occurs at least twice in BT. All instances of the same displayable subtree are

shaded in the same color. Different displayable subtrees are shaded in different
80






















Figure 5.1. A labeled binary tree



colors. In the example of Figure 5.1, ((b)a(c)) is the only displayable subtree.

So, BT would be displayed as shown in Figure 5.2.

We encounter the problem of subtree conflicts which is analogous to subword

conflicts in strings. Consider the binary tree of Figure 5.3. The two displayable

patterns are (((b)a(c))d) and ((b)a(c)). Note that the latter is a subtree of the

former. Consequently, the two subtrees cannot be shaded using different colors.

Formally, a subtree conflict occurs between two displayable subtrees P1 and P2

iff one is a proper subtree of the other.

The problem of subtree conflicts may be solved by using an approach similar

to that of Model 3 of Section 2.3. The resulting display of

(((((b)a(c))d)e(((b)a(c))d))f((b)a(c))) is shown in Figure 5.4.


5.1.2 Applications

Binary trees are chiefly used as data structures in computer science [10]. So-

phisticated debuggers attempt to display data structures at different points in the

execution of a program [14, 15, 16]. Systems for algorithm animation also require

displays of data structures [17, 18, 19]. The techniques of this section may be used in










82









e


a a
/ \ /I
I \ /l \





/ \ / \
I \
L -------------\ L ------------\



Figure 5.2. Highlighting displayable subtrees


Figure 5.3. Subtree conflicts































Figure 5.4. Displaying a tree with subtree conflicts
Figure 5.4. Displaying a tree with subtree conflicts


conjunction with tree layout algorithms [20, 21, 22] to provide a meaningful display

of binary trees.

While the emphasis of this paper is on exploiting similarity to display discrete

objects, we note that similarity detection is also useful in other applications.

For example, consider a forest of expression trees. Each expression tree cor-

responds to the computation of a variable. The leaf nodes of an expression tree

correspond to operands while non leaf nodes correspond to operators (Figure 5.5).

The techniques of this paper identify all common subexpressions in the expression

trees. Specifically, common subexpressions which are embedded inside other common

subexpressions are also detected.

Our techniques also provide information for the efficient storage of a forest of trees

by converting it into a directed acyclic graph. For example, the tree of Figure 5.3

can be stored efficiently as in Figure 5.6.


,-- ..........


I








84

















Figure 5.5. An expression tree

Figure 5.5. An expression tree


Figure 5.6. Tree compression by representation as a dag











5.1.3 Algorithms

Each node of the tree contains the fields: Ichild, rchild, key, id, leftid, rightid, and

frequency. The id and frequency fields of each node are initialized to 0. The leftid

(rightid) field of a node is initialized to -1 if its Ichild (rchild) field is nil; otherwise,

it is initialized to 0.

Algorithm IdentifySubtrees of Figure 5.7 computes all displayable subtree pat-

terns of BT. At the end of execution of IdentifySubtrees, all occurrences of a dis-

playable subtree are assigned the same unique integer (which is stored in the id field

of the root of each occurrence of the displayable subtree). The variable Newld is used

as a counter.

StoreLeaves(BT,L) (line 1) traverses BT and stores pointers to its leaves in the

linked list L. Each node in L contains two fields: treenode, which points to a node

in BT and next, which points to the next element of L. M is a linked list containing

nodes that are identical to those of L. Initially, M is the empty list (line 5). Sort(L)

(line 6) sorts L on (treenodeT.key, treenodeT.leftid, treenodet.rightid). Groups(L) (line

7) identifies groups of nodes such that all nodes in a group contain identical values

in treenodeT.key, treenodeT.leftid, and treenodeT.rightid. Each group is a contiguous

sequence in L because of line 6. Let there be g groups Lk, for 1 < k < g. Lk is a

pointer to the first element in the k'th group. Let the number of nodes in Lk be sk

for 1 < k < g. Note that a group represents all occurrences of a subtree pattern. The

equality of the leftid and rightid fields guarantees that the left subtrees are identical

and the right subtrees are identical. The equality of the key fields ensures that the

roots are identical. In the first iteration, a group represents identical leaves.

Lines 8 and 9 ensure that all groups with at least two elements are considered. So,

a group corresponding to a maximal subtree pattern also corresponds to a displayable











subtree pattern. All roots of the subtrees corresponding to a group are assigned a
unique id (lines 11 and 22). Their frequency fields are also initialized (line 21) to
the number of elements in the group, i.e. Sk. These will be used later to determine
whether the subtree pattern corresponding to the group is maximal.
Lines 12 to 17 determine whether the subtree patterns rooted at either or both
of the two children of Lk T.treenode are displayable. If this is not the case, then the
id fields of either the left children or right children or both, as appropriate, are reset
(lines 35 and 36). Lines 23 to 34 initialize either the leftid or rightid field of the parent
of the root of each subtree occurrence (depending on whether the particular root is
a left child or a right child of its parent). If both, the leftid and the rightid, fields of
the parent are non-zero, then a pointer to the parent is inserted into the linked list
M (lines 27 and 33). Finally, the list L is deleted and the process is repeated with
list M (lines 40 and 41).
Line 1 of IdentifySubtrees consumes O(n) time, since it involves a traversal of BT.
Let Li denote the list L in the i'th iteration of the while statement of line 3, and
|Li denote the number of nodes in Li (1 < i < w, where w represents the number of
iterations of the while loop). Since each node in Li corresponds to a distinct node
in BT and since a node in BT is an element of at most one Li, =1ILi < n.
We now show that the i'th iteration of the while loop consumes O(ILi ) time.
It is assumed that all keys are in the range [1..n]. Then, line 6 can be executed
in O(|Lij) time, if radix sort is used to sort the linked list. The keys are relabeled
so that they are in the range [1..|ILj]. The relabeling can be done in O(IL|) time
as shown in Figures 5.8 and 5.9. Line 7 requires a pass over Li and also consumes

O(|ILI) time. Lines 8 to 39 essentially involve a pass over Li, with 0(1) time being
spent at each node. Line 40 also involves a pass over Li and consumes O(ILi ) time.
Thus, IdentifySubtrees consumes O(n) time. (If the keys are not in the range [1..n],













Algorithm IdentifySubtrees
1 StoreLeaves(BT,L)
2 Newld := 0;
3 while L is not empty do
4 begin
5 M := nil;
6 Sort(L);
7 Groups(L);
8 for k := 1 to g do {g is the number of groups obtained from line 7}
9 if sk > 1 then {sk is the number of nodes in group k}
10 begin
11 Newld:= Newld + 1;
12 ResetLeft := false;
13 ResetRight:= false;
14 if Lk T.treenodet.lchild j nil
15 then if Lk T.treenodeT.lchildT.frequency = sk then ResetLeft:= true;
16 if Lk T.treenodet.rchild $ nil
17 then if Lk T.treenodeT.rchildT.frequency = sk then ResetRight:= true;
18 for j := 1 to sk do
19 begin
20 current := Lk .treenode;
21 currentT.frequency := sk;
22 currentT.id := Newld;
23 if currentT.parentf.Ichild :=current then
24 begin
25 currentt.parentf.leftid:= Newld;
26 if currentT.parentT.rightid j 0
27 then AddNode(M,current .parent);
28 end
29 else
30 begin
31 currentT.parentT.rightid := Newld;
32 if currentT.parentT.leftid # 0
33 then AddNode(M,currentT.parent);
34 end;
35 if ResetLeft then currentt.lchildt.id = 0;
36 if ResetRight then currentt.rchildt.id = 0;
37 Lk := Lk T.next;
38 end;
39 end;
40 DeleteList(L)
41 L := M;
42 end.


Figure 5.7. Algorithm for identifying displayable subtrees











{ An auxiliary array A and stack S, each of size n, are used. All elements of A are initialized
to 0 at the beginning of the program. The stack is initially empty. A key k (guaranteed to
lie between 1 and n) is relabeled by first checking A[k] to determine whether it has already
been assigned a label. If not, key k is assigned a label. Finally, the integer k is pushed onto
S.}

Procedure Label(L)
begin
count := 0;
for each key k in L do
if A[k] = 0 then
begin
count := count + 1;
A[k] := count;
Push(S,k);
end;
end;



Figure 5.8. Procedure for relabeling keys



then they can be relabeled to satisfy this condition. However, the relabeling step

would take O(nlogk) time, where k is the number of distinct keys in BT.)

5.2 Geometric Series-Parallel Graph Visualization

In this section, we consider the problem of geometric series-parallel graph visual-

ization. Section 5.2.2 mentions some of its applications while Section 5.2.3 outlines

the algorithms.

5.2.1 Problem Definition

We provide a specification for the problem of series-parallel graph visualization:

1. Structure of Data to be Visualized: A geometric series-parallel graph SPG

with n labeled vertices. Each vertex of SPG contains a character chosen from

an alphabet E of constant size. "Fork" and "Join" vertices are not labeled.











{ The elements of array A are reset to 0 prior to the start of the next iteration of the while
statement on line 3 of Algorithm IdentifySubtrees. }

Procedure ReInitialize;
begin
while S is not empty do
begin
k := Pop(S);
A[k] := 0;
end
end



Figure 5.9. Procedure for re-initializing array A




Figure 5.10 shows a geometric series-parallel graph with 13 labeled vertices, 2

fork vertices, and 2 join vertices. We represent the series parallel graph as a

string using the double parenthesis notation: ab[(cd)(ef)(b)]ab[(cd) (ef)].

A geometric series-parallel graph differs from a regular series-parallel graph in

that it describes the layout of a regular series-parallel graph in a plane. So, the

graph obtained by exchanging (cd) and (ef) in Figure 5.10 would be a different

geometric series-parallel graph from the one in Figure 5.10, even though both

represent the same series-parallel graph.

2. Structure of Patterns: A geometric series-parallel subgraph of SPG.

3. Maximality of Patterns: A series-parallel subgraph pattern P1 is not maximal

iff all its instances are subgraphs of the same series-parallel subgraph P2 and if

these instances of P1 all occur in the same relative geometric position in P2.

4. Measure of Similarity MS: If two patterns are identical, then MS = 1. Other-

wise, MS = 0.








90



c d c Jd

a b e f a b e f





Figure 5.10. A geometric series-parallel graph



DE3

DE 1 __






DE 2

Figure 5.11. Highlighting displayable subgraphs



5. Display Model: A maximal subpattern of SPG is called a displayable subgraph if

it occurs at least twice in SPG. All instances of the same displayable subgraph

are shaded in the same color. Different displayable subgraphs are shaded in

different colors. The displayable subgraphs are ab, b, and (cd)(ef). So, the

graph of Figure 5.10 may be displayed using a model similar to that of Model

3 of Section 2.3. as shown in Figure 5.11.











Algorithm IdentifySubgraphs
1. Convert geometric series-parallel graph SPG into a string S in the double parenthesis
notation.
2. Compute CSD(S)
3. Extract displayable subgraphs of SPG from the displayable entities of CSD(S). Each
displayable subgraph is associated with the smallest displayable entity from which it
was extracted.


Figure 5.12. Computing displayable subgraphs



5.2.2 Applications

Series-parallel circuits are an important class of electronic circuits. These can

be modeled by series-parallel graphs in which the vertices (other than fork and join

vertices) represent active circuit elements. These vertices are labeled by the name/id

of the active component they represent. The different branches of a fork/join are

ordered top to bottom in some canonical way. For example, the lexical ordering of

the double parenthesis string representation of the branches could be used. Common

subgraphs represent common subcircuits. Circuit visualization/display systems are

required to display circuits in such a way that common circuit substructures are easily

identified.

5.2.3 Algorithms

Figure 5.12 outlines an algorithm for determining the displayable subgraphs of

a geometric series-parallel graph. First, the geometric series parallel graph SPG

is converted into a string S in the double parenthesis notation described earlier.

Next, the csdawg CSD(S) corresponding to S is constructed. Figure 5.13 shows

CSD(S) for S = ab[(cd)(ef)(b)]ab[(cd)(ef)] (only re-edges are shown). Finally, the


















)]ab[(cd)(ef)]


ab[(cd)(ef)]


I,: ab[(cd)(ef)
12: [(cd)(ef)
13: cd)(ef)
14: d)(ef)
Is: ef)


16: f)
17: b)]ab[(cd)(ef)]
Is: b)Jab[(cd)(ef)]
19: cd)(ef)
Si: ab[(cd)(ef)


Figure 5.13. Csdawg for graph of Figure 5.10, only re-edges are shown











displayable subgraphs of SPG are obtained from the displayable entities of S. This
is achieved by extracting the longest substrings with no unmatched parenthesis from
each displayable entity. The displayable subgraphs corresponding to the displayable
entity ab[(cd)(ef) are ab and (cd)(ef) corresponding to DE 1 and DE 3 in Figure 5.11.
Each displayable subgraph is associated with the vertex in CSD(S) from which it
was obtained so that all the locations of its occurrences in SPG can be retrieved.
If a displayable subgraph is obtained from two or more displayable entities, then
it is associated with the smallest displayable entity. Note that a unique smallest
displayable entity must exist from the definition of an csdawg. Steps 1 and 2 of the
algorithm each consume O(n) time, while Step 3 consumes O(n2) time.
















CHAPTER 6
SYSTEM INTEGRATION

6.1 Using the Object-Oriented Methodology

In this section we demonstrate that our algorithms are well-suited for imple-

mentation using the object-oriented paradigm. The data abstraction and inheritance

features of the object-oriented paradigm are particularly applicable to our algorithms.

Most of our algorithms for the string, circular string, and series parallel graph

discrete objects are essentially traversals or operations on the corresponding csdawg.

High level operations on the csdawgs such as creating the csdawg, computing dis-

playable entities, computing their locations, computing display conflicts, etc make

use of lower level operations on the csdawgs such as finding a vertex in the csdawg

corresponding to a substring (subgraph), modifying the contents of a vertex, etc.

Thus, the concept of data abstraction which views data as a black box whose con-

tents can only be accessed through its operations can be applied naturally to csdawgs.

Many operations are implemented identically for each of the three csdawgs. These

are usually low level operations such as locating the vertex corresponding to a string,

or modifying the contents of a vertex, etc. Other operations (usually higher level

operations), while not implemented identically for the three discrete objects, are

similar. Typically, a high level operation on a circular string or a series parallel graph

can be implemented by the corresponding operation on a linear string preceded by a

simple preprocessing step and followed by a simple postprocessing step.

We create a class for each csdawg corresponding to the three discrete objects:

LinearCsdawg, CircularCsdawg, SeriesParallelCsdawg. LinearCsdawg is a base class,

94