Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: A Data structure for circular string analysis and cisualization
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095097/00001
 Material Information
Title: A Data structure for circular string analysis and cisualization
Physical Description: Book
Language: English
Creator: Sahni, Sartaj
Mehta, Dinesh P.
Affiliation: University of Florida
University of Minnesota
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 1991
 Record Information
Bibliographic ID: UF00095097
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

199122 ( PDF )


Full Text











A Data Structure for Circular String Analysis and


Visualization *


Dinesh P. Mehta tt


Sartaj Sahnit


Technical Report 25


Abstract

Circular strings are used to represent circular genomes in molecular biology, poly-
gons in computer graphics and computational geometry, and closed curves in computer
vision. In this paper we extend techniques which have so far been successfully applied
to the analysis and visualization of linear strings to circular strings by defining a data
structure for circular strings. Efficient (often optimal) algorithms that support these
techniques are presented.


Keywords and Phrases:
Circular strings, visualization, analysis, directed acyclic word graphs.














*This research was supported in part by the National Science Foundation under grant MIP 86-17374.
tDept. of Computer and Information Sciences, University of Florida, Gainesville, FL 32611
*Dept. of Computer Science, University of Minnesota, Minneapolis, MN 55455










1 Introduction


The circular string data type is used to represent a number of objects such as circular

genomes, polygons, and closed curves. Research in molecular biology involves the identi-

fication of recurring patterns in data and hypothesizing about their causes and/or effects

[1, 2]. Research in pattern recognition and computer vision involves detecting similarities
within an object or between objects [3].

Detecting patterns visually is tedious and prone to error. In [4], a model was proposed

to alleviate this problem. The model consists of identifying all recurring patterns in a string

and highlighting identical patterns in the same color.

[4] also listed a number of queries that the model would support. In [5], efficient (mostly

optimal) algorithms were proposed for some of these queries for linear strings. These algo-

rithms perform operations and traversals on the symmetric compact directed acyclic word

graph (scdawg) [6] of the linear string. The scdawg, which is used to represent a string or a

set of strings, evolved from other string data structures such as position trees, suffix trees,

directed acyclic word graphs, etc [7, 8, 9, 10].

One approach for extending these techniques to circular strings is to arbitrarily break

the circular string at some point so that it becomes a linear string. Techniques for linear

strings may then be applied to it. However, this has the disadvantage that some significant

patterns in the circular string may be lost because the patterns were broken when linearizing

the string. Indeed, this would defeat the purpose of representing objects by circular strings.

[3] defined a polygon structure graph, which is an extension of suffix trees to circular
strings. However, the suffix tree is not as powerful as the scdawg and cannot be used to

solve some of the problems that the scdawg can solve. In this paper, we define an scdawg

for circular strings. Algorithms in [5] and [6] which make use of the scdawg for linear strings

can then be extended to circular strings with minor modifications. The extended algorithms

continue to have the same efficient time and space complexities. Further, the extensions

take the form of postprocessing or preprocessing steps which are simple to add on to a

system built for linear strings, particularly in an object oriented language.
























Figure 1: Circular string


Section 2 contains definitions. Section 3 describes the scdawg for linear strings while

Section 4 describes its extension to circular strings. Section 5 deals with the computation

of occurrences of displayable entities. Section 6 introduces the notion of conflicts and

Section 7 lists other queries that are to be implemented. Section 6 also explains how the

algorithms implementing queries for linear strings can be modified so that they work with

circular strings. Finally, Section 8 mentions some applications for the visualization and

analysis of circular strings.



2 Definitions


Let s denote a circular string of size n consisting of characters from a fixed alphabet, E,

of constant size. Figure 1 shows an example circular string of size 8. We shall represent a

circular string by a linear string enclosed in angle brackets "<>" (this distinguishes it from

a linear string) The linear string is obtained by traversing the circular string in clockwise

order and listing each element as it is traversed. The starting point of the traversal is chosen

arbitrarily. Consequently, there are up to n equivalent representations of s. In the example,

s could be represented as , , etc.

We characterize the relationship between circular strings and linear strings by defining

the functions, linearize and circularize, linearize maps circular strings to linear strings. It

is a one-many mapping as a circular string can, in general, be mapped to more than one










linear string. For example, linearize() = {abcd, bcda, cdab, dabc}. We will assume,

for the purpose of this paper, that linearize arbitrarily chooses one of the linear strings; for
convenience we assume that it chooses the representation obtained by removing the angle

brackets "<>". So, linearize() = abcd. circularize maps linear strings to circular

strings. It is a many-one function and represents the inverse of linearize.

We use lower case letters to represent circular strings and upper case letters to represent
linear strings. Further, if a lower case letter (say, s) is used to represent a particular circular

string, then the corresponding upper case letter (S) is assumed to be linearize(s). A single

character in s or S occurring in the ith position is denoted by s, or Si, respectively. A
substring of S is denoted by Sij where i < j. Sij = SiSi+1...Sj. A substring of s is denoted

by sj, where sij = Sj if i < j and SinSIj if i > j. For example, if s = < abcdabce >,
then S = abcdabce. S5 = s5 = a. S3,5 = s3,5 = cda. s7,2 = ceab. We use the symbol, 7, to
denote either a circular string or a linear string. In the example, 73,5 = S3,5, if 7 = S; 73,5

= 3,5, if7 = S.

The predecessor, pred(7,i,j) of a substring of 7 is defined as

7i-1 if 1 < i < n
pred(,i,j)= o if i = 1 and 7 is linear

7y if i = 1 and 7 is circular

The successor, succ(7,i,j) of a substring of 7 is defined as

7j+1 if 1 < J < n
succ(7,i,j) = oc if j = n and 7 is linear

71 if j = n and 7 is circular

The immediate context, context(7,i,j) of a substring ; of 7 is the ordered pair
(pred(7, i,j), succ(7, i,j)).

The predecessor, pred(7, a), and successor, succ(7, a), sets of a pattern, a, in a string 7
are defined as below:

pred(Q, a) = {pred(Q, i,j)l j = a }. succ(, a) = {succ(Q, i,j)l j = a }.










The immediate context set, context(7,a) of a pattern, a, in 7 is the set

{context(, i,j)A = }.

In the example string of Figure 1, succ(s, abc) = succ(S, abc)= {d, e}. pred(s, abc) =

{d, e}; pred(S, abc) = {o, d}. context(s, abc) = {(e, d), (d, e)}. context(S, abc) = {(oo, d), (d, e)}.

A pattern occurring in 7 is said to be maximal iff its occurrences are not all preceded
by the same character nor all followed by the same character. So, a pattern a of length <
n in 7 is maximal iff \pred(7, a) > 2 and |succ(7, a)| > 2. This is not necessarily true for
patterns of length greater than or equal to n. For example, S is maximal in S (since it is
neither preceded nor followed by a character), but Ipred(S, S)I = Isucc(S, S)I = 1.

A pattern is said to be a displayable entity (or displayable) of 7 iff it is maximal and
occurs at least twice in 7. Note that if 7 represents a circular string, then a pattern can
be arbitrarily long. In the rest of our discussion, we will assume that displayable entities of
circular strings have length less than n.



3 Scdawgs For Linear Strings


An scdawg, SCD(S) = (V(S), R(S), L(S)) corresponding to a string S is a directed acyclic
graph defined by a set of vertices, V(S), a set, R(S), of labeled directed edges called right
extension (re) edges, and a set of labeled directed edges, L(S) called left extension (le)
edges. Each vertex of V(S) represents a substring of S. Specifically, V(S) consists of a
source (which represents the empty word, A), a sink (which represents S), and a vertex
corresponding to each displayable entity of S.

Let de(v) denote the string represented by vertex, v, v e V(S). Define the implication,
imp(S, a), of a string, a of S to be the smallest superword of a in {de(v)l v e V(S)}, if
such a superword exists. Otherwise, imp(S, a) does not exist. Re edges from vi (vi e V(S))
are obtained as follows: for each letter, x, in E, if imp(S, de(vi)x) exists and is equal to
de(v2) = Bde(vl)xz, then there is an re edge from vl to v2 with label x7. If P is the empty
string, then the edge is known as a prefix extension edge. Le edges from vl (vl e V(S)) are
obtained as follows: for each letter, x, in E, if imp(S, xde(vi)) exists and is equal to de(v2) =































Figure 2: SCDAWG for S = cdefabcgabcde, only re edges are shown


7xde(vl)3, then there is an le edge from vl to v2 with label 7x. If 3 is the empty string, then
the edge is known as a suffix extension edge. Figure 2 shows (V(S),R(S)) corresponding

to S = cdefabcgabcde. abc, cde, and c are the displayable entities of S. There are two re

edges from the vertex representing abc. These correspond to x = d and x = g. imp(S,abcd)
= imp(S,abcg) = S. Consequently, both edges are incident on the sink. There are no edges

corresponding to the other letters of the alphabet as imp(S,abcx) does not exist for x c

{a, b, c, e, f}.

Notice that the number of re edges from a vertex, v, equals |succ(S, de(v)) {oc} and

the number ofle edges equals Ipred(S, de(v))- {oo}|. In the example, succ(S,cde) = {oo,f}.

So, the number of right edges leaving the vertex corresponding to it is 1.

The space required for SCD(S) is O(n) and the time needed to construct it is O(n) [7, 6].

While we have defined the scdawg data structure for a single string, it can be extended to

represent a set of strings [6].










4 Extension to Circular Strings


In Section 4.1, we present a constructive definition of an scdawg for circular strings. Sec-
tion 4.2 analyzes the complexity of the algorithm of Section 4.1 to construct the scdawg of
a circular string and Section 4.3 identifies and proves some properties of this scdawg.


4.1 SCDAWGs For Circular Strings


The notion of an scdawg may be extended to circular strings. The scdawg for circular
strings is defined constructively by the algorithm of Figure 3. The scdawg for the circular
string s is obtained by first constructing the scdawg for the linear string T = SS (recall
that S = linearize(s)). A bit is associated with each re edge in R(T) indicating whether it
is a prefix extension edge or not. Similarly, a bit is associated with each le edge in L(T)
to identify suffix extension edges. Two pointers, a suffix pointer and a prefix pointer are
associated with each vertex, v in V(T). The suffix (prefix) pointer points to a vertex, w,
in V(T) such that de(w) is the largest suffix (prefix) of de(v) represented by any vertex
in V(T). Suffix (prefix) pointers are the reverse of suffix (prefix) extension edges and are
derived from them. Figure 4 shows SCD(T) = SCD(SS) for S = cabcbab. The broken
edge from vertex c to vertex abc is a suffix extension edge, while the solid edge from vertex
ab to vertex abc is a prefix extension edge.

Next, in step 2, suffix and prefix redundant vertices of SCD(T) are identified. A suffix
(prefix) redundant vertex is a vertex v that satisfies the following properties:

(a) v has exactly one outgoing re (le) edge.

(b) Ide(v)l < n.
A vertex is said to be redundant if it is either prefix redundant or suffix redundant or both.
In Figure 4, vertex c is prefix redundant only, while vertex ab is suffix redundant only. No
other vertices in the figure are redundant (in particular, the vertex representing S is not
redundant even though it has one re and one le out edge as ISI = n). The fact that step 2
does, in fact, identify all redundant vertices is established later.

Vertices of SCD(T) are processed in reverse topological order in step 3 and redundant
















Algorithm A
Stepl: Construct SCD(T) for T = SS.

Step2(a):
{Identify Suffix Redundant Vertices}
v:= sink;
while v f source do
begin
v:= v.suffix;
if v has exactly one outgoing re edge
then
if (Ide(v)l < n)
then mark v suffix redundant;
else
exit Step 2(a);
end;

Step2(b):
{Identify Prefix Redundant vertices}
{Similar to Step 2 (a)}

Step3:
v:= sink;
while (v <> source) do
begin
case v of
suffix redundant but not prefix redundant: P,... --,fT, I, h1, l1.l ,,1 (v);
prefix redundant but not suffix redundant: ProcessPrefixRedundant(v);
suffix redundant and prefix redundant : ProcessBothRedundant(v);
not redundant : {Do nothing };
endcase;
v:= NextVertexlnReverse T, 1',1..i '/ Order;
end;


Figure 3: Algorithm for constructing the scdawg for a circular string











































----- = Left Extension Edges
= Right Extension Edges



Figure 4: SCD(T) for T=cabcbabcabcbab










Procedure ProcessSuffixRedundant(v)

1. Eliminate all left extension edges leaving v (there are at least two of these).

2. There is exactly one right extension edge, e, leaving v. Let the vertex that it leads to
be w. Let the label on the right extension edge be x7y. Delete the edge.

3. All right edges incident on v are updated so that they point to w. Their labels are
modified so that they represent the concatenation of their original labels with xy.

4. All left edges incident on v are updated so that they point to w. Their labels are not
modified. However, if any of these were suffix extension edges, the bit which indicates
this should be reset as these edges are no longer suffix extension edges.

5. Delete v.


Figure 5: Algorithm for processing a vertex which is suffix redundant


vertices are eliminated. When a vertex is eliminated, the edges incident to/from it are

redirected and relabeled as described in Figures 5 to 10. The resulting graph is CSCD(s).

The set of vertices of CSCD(s) is denoted by CV(s). The set of right (left) edges of

CSCD(s) is denoted by CR(s) (CL(s)). Figure 11 shows CSCD(s) for s = < cabcbab >.

Notice that vertices c and ab have been eliminated and that the two incoming edges to c

and the three incoming edges to ab of Figure 4 now point to abc.























Us
V(T) --------- --


/


\ Us
--/-(y-x-) -V(T)
eC(y7i1'7) i V(T) -


SCL(72Z)


\U


N


S= Left Extension Edges
= Right Extension Edges



Figure 6: v is suffix redundant











Procedure ProcessPrefixRedundant(v)


1. Eliminate all right extension edges leaving v (there are at least two of these).

2. There is exactly one left extension edge, e, leaving v. Let the vertex that it leads to
be w. Let the label on the left extension edge be 7x. Delete the edge.

3. All left edges incident on v are updated so that they point to w. Their labels are
modified so that they represent the concatenation of 7x with their original labels.

4. All right edges incident on v are updated so that they point to w. Their labels are not
modified. However, if any of these were prefix extension edges, the bit which indicates
this should be reset as these edges are no longer prefix extension edges.

5. Delete v.


Figure 7: Algorithm for processing a vertex which is prefix redundant


Us

V(T)- UT


Us
\ V(T) U


L CL(7X 2z)


\U


\ CL%2Z)


S= Left Extension Edges
= Right Extension Edges



Figure 8: v is prefix redundant


eR(?Yi)










Procedure ProcessBothRedundant(v)


1. There is exactly one right extension edge, ei, leaving v. Let the vertex that it leads
to be wl. Let the label on the edge be x7. Delete the edge.

2. There is exactly one left extension edge, e2, leaving v. Let the vertex that it leads to
be w2. Let the label on the edge be 7x. Delete the edge.
{We establish later that wl and w2 are, in fact, the same vertex.}

3. All right edges incident on v are updated so that they point to wl. Their labels are
modified so that they represent the concatenation with x7. If any of these edges were
prefix edges, the bit which indicates this should be reset.

4. Similarly, left edges incident on v are updated so that they point to w2. Their labels
are modified so that they represent the concatenation with 7x. If any of these edges
were suffix extension edges, the bit which indicates this should be reset.

5. Delete v.


Figure 9: Algorithm for processing a vertex which is prefix and suffix redundant


el(x7)


/ Us
_I \ __ _


CR(IRx3)


V(T)- UT


\ eCL(1LPy)

\


eR(l,


SL(IL)


----- = Left Extension Edges
= Right Extension Edges



Figure 10: v is suffix and prefix redundant


Us

V(T)- UT










Lemma 1 For every substring sij of 1 ,,,ill < n of s, there exists a substring, Ti,m (= Sij),
of T such that context(T, 1, m) = context(s, i,j).


Proof
Case (i): i > j. Clearly, i 5 1, j 5 n. By construction, Tij+, = sij and
= (Ti_1, T,+j+l) = (Si_1, Sj+i) = context(s, i,j).
Case (ii): i < j. Now, Tij = Ti+n,j+ = sij.
Subcase (a): i = 1, j n. context(T, n + 1, n + j) = (T,, Tn+j+
context(s, i,j)
Subcase (b): i 1, j = n. context(T, i, n) = (Ti_1, T,+) = (S_1, S1) =
Subcase (c): i 5 1, j 5 n. context(T, i, j) = (T_1, T+) = (Si_1, Sj+i)
Subcase(d): i = 1, j = n. Not possible since the length of si,j < n. O


context(T, i, j+n)


1)


= (S, S+l)


context(s, i, n).
context(s, i,j).


Corollary 1 For every pattern, a, of ,,il1l < n in s, context(s, a) C context(T, a).


Corollary 2 For every pattern, a, of' ,i1 i < n in s, pred(s, a) C pred(T, a) and succ(s, a)
C succ(T, a).


Lemma 2 Let Tj be a substring of T. If i 5 1, then there is a substring
such that pred(s, 1, m) = pred(T, i,j). If i = 1, pred(T, i,j) = oo.


Proof If i = 1, the result follows from the definition of pred(T, i,j). If i
so that I = iif i n; m = jifj the length of Tj is greater than n, sl,m is assumed to wrap around once).
= SI_1 = Ti_1 = pred(T,i,j). D


sl,m (=T,j) of s



f 1, choose sl,m
-n if j > n; (if
So, pred(s, 1, m)


Corollary 3 For every pattern, a, of I ,,'il1, < n in T, pred(T, a) {oo} C pred(s, a).


Theorem 1 For every pattern a of I. ,-ill less than n, pred(s,a) = pred(T,a) {oo} and
succ(s, a) = succ(T, a) {oo}.


Proof From Corollary 2 we have pred(s, a) C pred(T, a). So, pred(s, a) {00} C
pred(T, a) {oo} and hence pred(s, a) C pred(T, a) {oo} (since pred(s, a) does not contain











cabc


-- = Right Extension Edges
-------------P > = Left Extension Edges




Figure 11: Scdawg for < cabcbab >


oc). From Corollary 3 we have pred(s, a) D pred(T, a) {oo}. So, pred(s, a) = pred(T, a)

- {o}. The proof that succ(s, a) = succ(T, a) {o} is similar. O


Theorem 2 A vertex, v with Ide(v)l < n in V(T) is non redundant iff de(v) is a displayable

entity of s.


Proof Suppose a is a displayable entity ofs. Then, we have Ipred(s, a)l > 2 and Isucc(s, a)l

> 2. From Theorem 1 we have Ipred(T, a) {oo}l > 2 and Isucc(T, a) {oo} > 2. So, a

is a displayable entity in T and the corresponding vertex in V(T) has at least two le and
two re edges leaving it. Hence, v is not redundant.

Next, suppose there is a non redundant vertex, v, in SCD(T) with Ide(v)l < n. Let a
de(v). Since v is not redundant, Ipred(T, a)l {oc}l > 2 and succ(T, a) {oc}l > 2. From

Theorem 1 we have Ipred(s, a)l > 2 and Isucc(s, a)l > 2. So, a is a displayable entity of s.




Corollary 4 A redundant vertex in V(T) is not a displayable entity of s.










Lemma 3 (a) A vertex, v, in V(T) will have exactly one re (le) out edge only if de(v) is
a suffix (prefix) of T.

(b) If a vertex, v, such that de(v) (Ide(v)l < n) is a suffix (prefix) ofT has more than one
re (le) out edge, then no vertex, w, such that de(w) is a suffix (prefix) of de(v) can be suffix

(prefix) redundant.


Proof (a) Suppose de(v) is not a suffix of T. Then o is not an element of succ(T, de(v)).
So, Isucc(T, de(v))- {}oo = succ(T, de(v))l > 2. So, v has at least two re out edges, which
is a contradiction. Hence, de(v) must be a suffix of T.
(b) Since de(w) is a suffix of de(v), a successor of de(v) must also be a successor of de(w). So,
succ(T, de(w))- {f} D succ(T, de(v))- {oo} or Isucc(T, de(w))- {}oo > Isucc(T, de(v))-

{oo } > 2 (de(v) has at least two re out edges). So, w must have at least two re out edges
and cannot be suffix redundant. O

We can now show that step 2(a) of Algorithm A identifies all suffix redundant vertices in
V(T). Since it is sufficient to examine vertices corresponding to suffixes of T (Lemma 3(a)),
step 2(a) follows the chain of suffix pointers starting from the sink. If a vertex on this chain
representing a displayable entity of length < n has one re out edge, then it is marked suffix
redundant. The traversal of the chain terminates either when the source is reached or a
vertex with more than one re out edge is encountered (Lemma 3(b)). Similarly, step 2(b)
identifies all prefix redundant vertices in V(T).


4.2 Complexity Analysis


Step 1 takes O(n) time [6]. Step 2 will in the worst case traverse all the vertices in SCD(T)
spending 0(1) time at each. The number of vertices is bounded by O(n) [6]. So, step 2
takes O(n) time. Step 3 traverses SCD(T). Each vertex is processed once; each edge is
processed at most twice (once when it is an incoming edge to the vertex being currently
processed, and once when it is the out edge from the vertex currently being processed. So,
Step 3 takes 0(n) time (note that SCD(T) has 0(n) edges).










4.3 Properties of CSCD(s)


Define the implication, imp(s, a), of a string, a, with respect to CSCD(s) to be the smallest
superword, pay, of a represented by a vertex in CV(s), such that there does not exist a
substring Play7 of T where the length of the least common suffix, Ics(P, Pi), of P and Pi
is less than min(|lI, I |p) or the length of the least common prefix, Icp(7, 7i), of 7 and i7
is less than min(17, 17i|), if such a superword exists. Otherwise, imp(s, a) does not exist.
The additional condition (which is referred to as the uniqueness condition) that is imposed
on imp(s, a) is guaranteed for imp(T, a) by the definition of SCD(T).

Let R = { abcaaaa, babcaa, cabcaa } be the smallest set of superword displayable entities
of abc in s such that any superword displayable entity of abc in s is a superword of an
element of R. Then, de(s, abc) must be one of the elements of R. We have Ilcs(b, c)l =
0 < min(lbl, Icl). So, de(s, abc) is neither babcaa nor cabcaa. Further, since Ilcs(aaaa, aa)l
= min(laaal, laal), Ilcp(b, A) = min(lbl, |A|), and Ilcp(c, A) = min(Icl, |A|), de(s, abc)=
abcaaaa.


Lemma 4 Let v be a suffix and prefix redundant vertex in SCDINT(T), where SCDINT(T)
represents an intermediate '. ,'Fii,,,i';.,/,, between SCD(T) and CSCD(s) just after the while
statement in Step 3 of Algorithm A. Let the le and re out edges be incident on wl and w2
respectively, where de(wi) = imp(s, de(v)x) = Pide(v)xz7 and de(w2) = imp(s, yde(v) =
P .,i,'' (v)72. If w and w2 are not redundant, then wl = w2.


Proof Case 1. Ide(wi)l < n, Ide(w2)l < n.
/3 cannot be nil (if it is, then wl is prefix redundant since Ide(wl)l < n and all occurrences
of de(v) except the prefix of S are preceded by y). Similarly, 72 5 nil. So, de(wi) must be
of the form p .,.. (v)zx7, since y is the only letter that precedes de(v). Similarly, de(w2)
must be of the form /_ .,l.' (v)xZ3. We now show that 33 = 32 and 71 = 73. Assume that
this is not the case. Since Ide(wl)l < n, Ide(w2)| < n and wl and w2 are not redundant,
Ipred(s, de(wi))l, Isucc(s, de(wi))l, Ipred(s, de(w2))l, and Isucc(s, de(w2))l are all at least 2.
So, there must exist a displayable entity, /3 .,.l (v) 7m, of s where 3m is the largest common
suffix of /3 and 32 and 7m is the largest common prefix of 71 and 73. Further,/3 .,./. (V)X7m












S S




1 2 3 4 5 6 7



2 3 4 5 6 7


de(v) de(v) de(v) de(v)


de(wi)

de(wi)



Figure 12: Illustration of proof of prefix/suffix redundancy invariant

= imp(s, de(v)x) = imp(s, yde(v)), which contradicts statements made above.
Case 2. |de(wi)l > n, Ide(w2) < n.

72 cannot be nil, otherwise w2 is suffix redundant. So, de(w2) = ,I. (v)x73 as x is the
only letter that follows de(v). Arguments similar to those in Case 1 show that since Ide(w2)l
< n and w2 is not redundant, 71 must be a prefix of 73 and P/ a suffix of 32y. But, then
Ide(w2) > de(wi)l > n, which is a contradiction. Hence, Case 2 cannot exist.
Case 3. Ide(w2)l > n, de(wi)l < n.
Similar to Case 2.
Case 4. Ide(w2)l > n, de(wi)l > n.
Figure 12 shows that for this case to occur, S = am, for some a, where lal < de(v)l. Call
this the prefix/suffix redundancy invariant. The figure assumes that |de(wi)l = n, that
de(v) is a prefix of de(wi), and that Ide(v)l < n/2 and divides n. However, the prefix/suffix
redundancy invariant can be shown to be true in all other cases. Two copies of T are shown
in the figure. The first copy shades the occurrence (n Ide(v)l + 1, n) of de(v) and its












o a a a

S Other 2 C2
Vertices 0 o2 2m 2






----------- = Left Extension Edge
>-- = Right Extension Edge



Figure 13: SCD(a2m)


extension to de(wi). The second shades the occurrence (n+ 1, n+ Ide(v)l) and its extension
to de(wi). Since the shaded regions in both strings represent de(wi), we have: box 1 = box
2; box 2 = box 3;...;box 6 = box 7. Or, box 1 = box 2 = ... = box 7, and S = (de(v))6

Next, we assume without loss of generality that there is no P such that S = 3k, k > m.
Call this the smallest repetition assumption.

The only occurrences of a in T are at ((1, lal), (lal +1, 2al),..., ((m-1)a +1, 2n)) (if not,
an argument similar to the one of Figure 12 contradicts the smallest repetition assumption).
So, SCD(T) takes the form of Figure 13. Each vertex representing a', 1 < i < 2m 1, has
exactly one le and one re out edge as shown.

All remaining displayable entities of T are subwords of a2 and are of size less than |la
(if not, an argument identical to the one in Figure 12 contradicts the smallest repetition
assumption). The vertices representing these displayable entities are represented by the box
in Figure 13.

None of the vertices in the box has out edges incident on vertices representing the dis-
playable entities {a3, a4, ..., 2m}. In particular, no out edges from the vertices in the box
are incident on vertices representing displayable entities of length greater than n. After










SCD(a2") has been processed by Algorithm A, all incoming edges to vertices correspond-
ing to a and a2 in SCD(a2m) are incident on the vertex corresponding to S = am in
CSCD(a2m). It follows that any prefix and suffix redundant vertex in SCD(a2m), when
processed by Step 3 of Algorithm A can have both edges incident on wl and w2 such that
Ide(wl)l and Ide(w2)| are at least n only if de(wi) = de(w2) = n.

CSCD(s) satisfies properties P1, P2, and P3 stated below (Theorem 3). These properties
ensure that the algorithms of [5] can be extended to circular strings.
PI: CV(s) consists of a source and a sink. For each v of CV(s) that is not the source or
sink, the following are true:

(a) Ide(v)l < n iff de(v) is a displayable entity of s.
(b) if Ide(v)l > n, then de(v) is a displayable entity of T.
P2 : There exists an re out edge corresponding to letter x in E from vertex vi in CV(s) to
vertex v2 in CV(s) iff imp(s, de(vl)x) exists and is equal to de(v2). If de(v2) = lde(vl)xz,
then the label on the re edge is x7. If P = nil, then the edge is a prefix extension edge.
P3: Similar to P2 but for le edges.


Theorem 3 CSCD(s) satisfies P1, P2, and P3.


Proof Property P1 is established by the knowledge that SCD(T) contains all displayable
entities of T and that Algorithm A only eliminates those displayable entities of T of length
less than n, which are not displayable entities of s (Corollary 4).

P2 and P3 are proved by induction. The induction hypothesis is:
Let Us be the subset of UT that remains after the vertex set UT C V(T) has been processed
by step 3 of Algorithm A.
(I) Let Ru, be the set of re edges which are incident on vertices in Us. For any re edge r e
Ru, from vertex u to w with label xz, imp(s, de(u)x) = de(w) = 3de(u)xz. An analogous
condition holds for le edges.
(II) For each vertex u in Us U (V(T) Ur), there is an re out edge corresponding to each
letter x in succ(T, de(u)) {oo} incident on a vertex in Us U (V(T) UT). An analogous
condition holds for le edges.










When UT = V(T), we have Us = CV(s), by definition. So, Rcv(s) = CR(s). (I)
establishes that these edges are incident on the correct vertices and that their labels are
correct. (II) establishes that CR(s) is complete. So P2 holds. Similarly, P3 holds.

Induction Base: UT = Us = {}. RU, and Lu, are empty so (I) does not apply. (II) is
established from the definition of SCD(T).

Induction Step: Consider vertex, v (v c V(T)), which is about to be processed by step 3
of algorithm A. Let UTr and U' denote UT and Us respectively after v has been processed.
We must show that (I) and (II) hold for U,' and Ur. Since the vertices are processed in
reverse topological order, all out edges from v are incident on vertices in Us and are therefore
elements of Ru, or Lus. So, they must satisfy (I).
Case 1: v is not redundant. U U = UT {v}; U' = U U {v} since v is not eliminated.
We must show that (I) is true for incoming edges to v as these are the only additions to
Ru, and LU,. I.e., Ru' = RU, + {incoming right edges to v}, Lu' = LU, + {incoming left
edges to v}.

Let e be an re edge with label xz from u to v. From the definition of SCD(T), we have
de(v) = imp(T, de(u)z) = lde(u)zx for some ). imp(T, de(u)z) is the smallest superword
of de(u)z in {de(w)lw c V(T)}. Since CV(s) C V(T), {de(w)lw c CV(s)} C {de(w)lw e

V(T)} and imp(s, de(u)z) = imp(T, de(u)z) iff imp(T, de(u)z) c {de(w)|w C CV(s)}. But,
this is true since v c CV(s). So, de(v) = imp(s, de(u)z) = 3de(u)zx and (I) is satisfied. A
symmetric argument can be made for incoming le edges to v.

The letter of the alphabet to which an re (le) out edge corresponds is the first (last)
character in its label. Since no out edges are added, deleted, or redirected and the labels
of all out edges are unchanged, each vertex has an re/le out edge corresponding to the
same letter of the alphabet as it had prior to processing vertex v. So, (II) holds (induction
hypothesis).

Case 2: v is redundant. Ur = UT U {v}. U' = Us, since v is eliminated.

Subcase (a): v is suffix redundant only. By definition, v consists of a single re
out edge, e, to a vertex w in Us. Let label(e) = x7. From the induction hypothesis,










imp(s, de(v)x) = de(w) = 3de(v)xz. We first establish that (i) de(w) = imp(s, de(v)) and
(ii) de(v) is a prefix of de(w).

imp(s, de(v)) 5 de(v) as v is redundant. So, imp(s, de(v)) must correspond to a vertex
on which one of the out edges from v is incident, since there is an out edge corresponding
to each element in pred(s, de(v)) U succ(s, de(v)) (from (II)). The single re edge is incident
on w, which represents imp(s, de(v)x). The left out edges from v are incident on vertices
which represent imp(s, ide(v)) for 1 < i < Ipred(s,de(v))| > 2. From the definition of
imp(s, de(v)), none of these vertices can possibly represent imp(s, de(v)). For instance, if
imp(s, xide(v)) is imp(s, de(v)), then the string, imp(s, xjde(v)), i 5 j, would invalidate
the definition.

So, imp(s, de(v)) must be de(w). However, for this to be true, we must show that p =
nil and therefore that de(v) is a prefix of de(w). All occurrences of de(v) in s are followed
by x. So, Ipred(T, de(v)x) {oo} = Ipred(s, de(v)x)l = Ipred(s, de(v))| > 2. An argument
similar to the one in the previous paragraph shows that for imp(s, de(v)x) to exist, ) = nil.

We have Ru' = Ru, {single re out edge from v} + {incoming re edges to v} and Lu'
= Lu, {le out edges from v} + {incoming le edges to v}. (I) and (II) do not apply to the
edges deleted from Ru, and Lus. So, we only need to prove (I) and (II) for incoming edges
to v.

Let eR be an re edge incident on v from vertex uR with label y7i so that de(v) =
imp(T, de(uR)y) = 3lde(uR)y7i. eR must be redirected to imp(s, de(uR)y) for (I) to
hold. imp(T, de(uR)y) = de(v) is the smallest superword of de(uR)y in {de(a)| a e V(T)}.
imp(s, de(uR)y) is the smallest superword of de(uR)y that satisfies the uniqueness condi-
tion in {de(a)l a e CV(s)} C {de(a)l a e V(T)}. Since v / CV(s), imp(s, de(UR)y) is the
smallest superword of de(v) that satisfies the uniqueness condition in {de(a)l a e CV(s)}.
So imp(s, de(UR)y) = imp(s, de(v)) = de(w) = Plde(uR)yJlz The updated re edge, eR,
is incident on w and has label yjx7^ which was obtained in step 3 of Algorithm A by con-
catenating label(eR) with label(e). If Pi = nil, then eC continues to be a prefix extension
edge. eR satisfies (I).

Let eL be an le edge incident on v from UL so that de(v) = imp(T, zde(ul)) = 72zde(UL))2










(label(eL) = 72z). Using the same argument that was used for eR, we have imp(s, zde(uL))
= de(w) = 72zde(uL)32x7. CL is redirected to w and its label remains unchanged. Clearly,

eL is no longer a suffix edge even if )2 = nil, because x7 f nil. So, eL satisfies (I).

Notice that (II) continues to be satisfied as each out edge corresponding to any vertex
in U' U (V(T) UT) continues to be associated with the same character (in particular,
label(eR) continues to begin with y and label(eL) continues to end with z); and each out
edge continues to leave the same vertex (in particular, eC continues to leave UR, eL continues
to leave UL).

Subcase (b): v is prefix redundant only. Symmetric to subcase (a).

Subcase (c): v is prefix and suffix redundant. So, v has one re out edge, el, to
vertex wl in CV(s). Let label(el) = x7y. Also, v has one le out edge, e2, to vertex w2 in
CV(s). Let label(e2) = )y.

From the induction hypothesis, de(wi) = imp(s, de(v)x) = l3de(v)x7z and de(w2)

imp(s, yde(v) = -_ '.I.. (v)72.

The conditions for Lemma 4 are satisfied since wl and w2 are not redundant (otherwise
they would have been eliminated). Thus, de(wi) = de(w2) = de(w) (say). imp(s, de(v)) can
either be imp(s, de(v)x) or imp(s, yde(v)). But, both these expressions are equal to de(w).
So, imp(s, de(v)) = de(w).

The proof that (I) and (II) are satisfied is similar to that for subcase (a). Note, however,
that any incoming prefix/suffix extension edges to v will no longer remain prefix/suffix
extension edges as x7 and Py are not nil. D



5 Computing Occurrences of Displayable Entities


Procedure LinearOccurrences(S, v) of Figure 14, which is based on the outline in [6], reports
the end position of each occurrence of de(v), v e V(S), in the linear string S However,
invoking LinearOccurrences(T,v), v e CV(s), does not immediately yield all occurrences
of de(v) in T. In Section 5.1 we present a modification which obtains all occurrences of










displayable entities of s. In Section 5.2 we show that this modification is correct and that

its time complexity is optimal.


5.1 Algorithm


An auxiliary boolean array reported[l..n], is used in conjunction with CSCD(s). Initially,

all elements of this array are set to false. Procedure CircOccurrences(s, v) of Figure 15

computes the end positions of each de(v) (v e CV(s)) in s. LinearOccurrences(T, v) of

line 1 will not necessarily compute all occurrences of de(v) in T, since it is being executed

on CSCD(s) and not on SCD(T). Note, also, that an occurrence of de(v) ending at

position i (i < n) in T has an identical occurrence ending at position n + i in T (since

T = S.S). Both these occurrences correspond to the same occurrence of de(v) in s. So,

if LinearOccurrences(T, v) reports both occurrences, then only the single corresponding

occurrence of de(v) in s must eventually be reported.

Lines 4-7 transform the occurrence 1, if necessary, so that it represents a value between

1 and n. If this occurrence has not already been listed, then it is added to the list of

occurrences and the corresponding element of reported is set to true. If the occurrence has

been listed then it is a duplicate (lines 8-12). After all occurrences have been computed,

all elements of reported are reset to false (lines 14,15) so that reported can subsequently be

reused to compute the occurrences of some other displayable entity in s.

In the example of Figure 11, LinearOccurrences(T, v), where v represents abc, does report

the end positions of all occurrences of abc in T (i.e., 4, 8, and 11). Lines 2 to 12 transform

this into the list of end positions of abc in s (i.e., 1 and 4) corresponding to S6,1 and s2,4

respectively.

Figure 16 shows the de(v)'s, de(w)'s, and de(x)'s for a hypothetical string T = SS.

Figure 17 shows some fragments of its scdawg. v is suffix redundant in SCD(T) and its

single re out edge is incident on w. There is an re edge from x to v and x is not redundant.

By construction, the re edge from x to v in SCD(T) becomes an re edge from x to w

in CSCD(s). Procedure LinearOccurrences(T,x), x in CSCD(s) will fail to yield the

rightmost occurrence of de(x) in T, since that occurrence is neither a subword of de(w)














Procedure LinearOccurrences(S:string, v:vertex)
{Obtain all occurrences of de(v), v e V(S), in S}
Occurrences(S,v,0);


Procedure Occurrences(S:linear string, v:vertex, i:integer)
begin
if de(v) is a suffix of S
then output(ISI i);
for each re out edge, e, from v in SCD(S) do
begin
let w be the vertex on which e is incident;
Occurrences(S,w,|label(e)l + i);
end;
end;



Figure 14: Obtaining all occurrences of a displayable entity in a linear string







Procedure CircOccurrences(s:circular string, v:vertex)
{v is a vertex in CSCD(s)}
1 LinearOccurrences(linearize(s).linearize(s), v);
2 for each reported occurrence 1 of de(v) do
3 begin
4 if (1 > II)
5 k:= 1- Is
6 else
7 k := 1;
8 if not reported[k] then
9 begin
10 add k to final list of occurrences
11 reported[k] := rue
12 end
13 end
14 for each occurrence, 1, of de(v) in s do
15 reportedly]:= false;



Figure 15: Obtaining all occurrences of a displayable entity in a circular string












de(v) --- -



de(x) -- ", -- -' --
de(w)



S S


T

Figure 16: Example string


nor a suffix of T. In the next section, we show that CircOccurrences(s, x) computes all
occurrences of de(x) in s in spite of the fact that LinearOccurrences(T, v) does not compute
all occurrences of de(x) in T.


5.2 Proof of Correctness


Let Tj = de(v), v e V(T), be a substring
(i.e., j 1 2n). Let y = imp(T,de(v)Tjf+)
, .I1/1 extension IRE(SCD(T), Tij) of Tj in

displayable entity, y.


of T. Assume that Ti, is not a suffix of T
= )de(v)Tj+iy. Then, define the immediate
SCD(T) to be the occurrence T -II i+| l+l of


Let Tij = de(v), v c CV(s), be a substring of T. Assume that Tij is not a suffix of
T (i.e., j 5 2n). Let y = imp(s, de(v)Tj+i) = 3de(v)Tj+ly. Then, define the immediate
, .//,1 extension IRE(CSCD(s), Tj) of Tij in CSCD(s) to be the occurrence T-II i+| |+

of displayable entity, y.

So, if in Figure 16, de(v) = yde(x)a, and de(w) = de(v)3, then IRE(SCD(T ), T2n-_de()al+1,2n-l)

= T2n-de(v)l+1,2n which is the occurrence of de(v) corresponding to the suffix of T. However,
IRE(CSCD(s),T2n-Ide()a|+1,2n-aJ) = T2n-de(v)l+1 +1 Iwhich does not represent a valid
substring of T.





















---- ---- = Left Extension Edge



<---- z x j--- i a



---------SB' = Left Extension Edge
= Right Extension Edge

Figure 17: Fragments of scdawgs corresponding to Figure 16

Let DAWG represent either SCD(T) or CSCD(s). Then IREk(DAWG, T,j) denotes
IRE(DAWG, IREk-1(DAWG, Ti,)) if k > 1, and Tij if k = 0.

An occurrence Tj = de(v), v c V(T) is said to be Right Retrievable (RR) in SCD(T)
iff one of the following is true:

(i)j = 2n.
(ii) j 5 2n and IRE(SCD(T), T4j) is RR in SCD(T).

Similarly, an occurrence Tj = de(v), v c CV(s) is said to be Right Retrievable (RR) in
CSCD(s) iff one of the following is true:
(i)j = 2n.
(ii) j 5 2n and IRE(CSCD(s), T,j) is RR in CSCD(s).

IRE(CSCD(s),Ti,j) is defined for any occurrence, Tij = de(v), v c CV(s), where j
5 2n. So, Tij is not RR in CSCD(s) only if (i) IRE(CSCD(s),T,j) does not represent
a substring of T or (ii)IRE(CSCD(s),Tj) is a valid substring of T, but is not RR in










CSCD(s).


In the example of Figure 16, T2-Ide(x)al+1,2n- al is RR in SCD(T), but not RR in
CSCD(s).

Notice that (ip,jp) = IRE(s, (i,j)) is not a substring of T iff ip < 1 or jp > 2n.


Lemma 5 Fork > 1, if IREk-1(CSCD(s), T4j) and IREk-1(CSCD(s), T,+,j+n) repre-
sent substrings of T and if (ip, jp) = IREk(CSCD(s), Tj) and (iq, j) = IREk(CSCD(s), Ti+,j+),
then ip + n = iq and jp + n = jq.


Proof Assume that there exists a pair of substrings Ti,,j, and T ,i, of T, such that i2 =
ii + n and j2 = jl + n and that ji < n (i.e.,we are assuming that their IRE's are defined).
By symmetry, both occurrences represent the same displayable entity (say, de(v)). Further,

Tj+1 = Th,+1 (also by symmetry). Clearly, imp(s, de(v).Tj,+l) = imp(s,de(v).Tj,+l). If
(i3, 3) = IRE(CSCD(s),Ti,,j) and (i4,j4) = IRE(CSCD(s), Ti,,), then from the defi-
nition of IRE, we have i4 = i3 + n and j4 = j3 + n. Applying this argument repeatedly
proves the lemma D


Lemma 6 The RR occurrences of de(v), v in V(T) (CV(s)) in SCD(T) (CSCD(s)) are
exactly those occurrences of de(v) which are obtained by LinearOccurrences(T,v).


Proof Follows from the definition of RR occurrences. D


Corollary 5 All occurrences of a pattern de(v) (v e V(T)) in T are obtained by LinearOccurrences(T, v).


Lemma 7 All occurrences of de(v) in T, v e CV(s), where |de(v) > n are obtained by
LinearOccurrences (T,v).


Proof This follows from Corollary 5 and the construction of CSCD(s) in which no right
out edges from vertices representing displayable entities of size > n were modified. D


Lemma 8 All occurrences, Tij, of de(v), where Ide(v)l < n, v e CV(S) with i < n, j > n
are RR in CSCD(s).










Proof Assume that the lemma is false and that there exists an occurrence, Tj, of de(v)
with i < n, j > n which is not RR in CSCD(s).

Clearly, j f 2n, otherwise T;j would be RR in CSCD(s). Let last denote the smallest
value of k for which IREk(CSCD(s), T;j) is not a substring of T. Such a last > 1 must
exist since Tij is not RR. Let (i1ast,jiast) denote IRElast(CSCD(s), T,j) Let z be the
vertex in CV(s) to which Ti astJa, corresponds.
Case 1. ilst < 1
Clearly, n < jiast < 2n. Consider the string TI, tJ, in T. Its length is greater than n. If there
were two occurrences of this string in T, then it would be a displayable entity of length > n
(because (i) Tij1t does not have a predecessor and (ii) de(z) is maximal and its occurrences
are not all followed by the same letter). A vertex corresponding to this displayable entity
would not have been eliminated by Algorithm A since its length would be > n and TI,jat
would be RR in CSCD(s) (Lemma 7). So, there must exist only one occurrence of the
string represented by Tijas. But, this string is a proper suffix of de(z) which means that
one of its occurrences is preceded by a character. So, there are two occurrences of this
string. This leads to a contradiction.
Case 2. jast > 2n
The proof is similar to the one for Case 1. O


Lemma 9 At least one of the two occurrences, Tij and T+,,j+,, of de(v), Ide(v)l < n, v
e CV(s), with i,j < n is RR in CSCD(s).


Proof Assume that the lemma is false. Let last be the smallest value of k for which either
IREk(CSCD(s), Ti,j) or IREk(CSCD(s), T+,,j+,) is not a substring of T. Let (ip,jp) =
IRElast(CSCD(s), Ti,) and (iq,j,) = IRElast(CSCD(s), Ti+,j+,).
Case 1. IRElast(CSCD(s), Tj) is not a substring of T; IRE'lst(CSCD(s), Ti+,j+,) is a
substring of T.
I.e, ip < 1 and jq < 2n. So, j, < n and iq < n (from Lemma 5). (iq,jq) is RR in CSCD(s),
since (iq,jq) satisfies the conditions of Lemma 8.
Case 2. IRElast(CSCD(s), Tij) is a substring of T; IRElast(CSCD(s), Ti+,j+,) is not a
substring of T.










Symmetric to Case 1.
Case 3. IRElast(CSCD(s), Tj) is not a substring of T; IRElast(CSCD(s), T,+,j+,) is
not a substring of T.
I.e., ip < 1 and j, > 2n. So, jp > n and iq < n (Lemma 5). This is shown to cause a
contradiction by an argument similar to the one in Lemma 8. D


Theorem 4 Procedure CircOccurrences(s,v) correctly obtains all occurrences of de(v) in s



Proof Lemma 6 shows that LinearOccurrences(T, v) computes all RR occurrences of de(v)
in CSCD(s). Lemmas 8 and 9 show that each occurrence of de(v) in s has at least one
corresponding occurrence in T, which is RR in CSCD(s). CircOccurrences computes these
occurrences in T and transforms them so that they represent occurrences in s, removing
duplicates if any. So, the output is a list of all occurrences of de(v) in s. D


Theorem 5 Procedure CircOccurrences is optimal.


Proof Procedure CircOccurrences(s, v) takes O(|occ(T, v)l) time, where |occ(T, v)l is the
number of occurrences of de(v) in T. Each for loop takes O(|occ(T, v)|) time.

But, |occ(T, v)| < 2|occ(s, v)|, where |occ(s, v)| is the number of occurrences of de(v) in
s. So, the complexity is O(|occ(s, v)|). |occ(s, v)| is the size of the output, so the algorithm
is optimal. D



6 Computing Conflicts Efficiently


[4] defines the concept of conflicts and explains its importance in the analysis and visual-
ization of strings. Formally,

(i) A subword conflict between two displayable entities, D1 and D2, in S exists iff D1 is a
substring of D2.
(ii) A prefix-suffix conflict between two displayable entities, D1 and D2, in S exists iff there
exist substrings, Sp, Sm, SS in S such that SpSSm occurs in S and SpS, = D1 and SmSs










= D2. The string, S, is known as the intersection of the conflict; the conflict is said to

occur between D1 and D2 with respect to S,.

[4] also identified a number of problems relating to the computation of conflicts in a linear

string, while [5] presented efficient algorithms for most of these problems (some of which

are listed in the next section). These algorithms typically involve sophisticated traversals

or operations on the scdawg for linear strings. Our extension of scdawgs to circular strings

makes it possible to use the same algorithms to solve the corresponding problems for circular

strings with some minor modifications which are outlined below.

There are conceptually two kinds of traversals that the algorithms of [5] perform on an

scdawg corresponding to a linear string:

(i) Traversal of displayable entities of the string. In these traversals, a vertex is traversed

specifically because it represents a displayable entity of the string.

(ii) Incidental traversals. In these traversals, a vertex is not traversed because it is a

displayable entity, but because it performs some other function. For example, this includes

vertices traversed by LinearOccurrences(T, v).

Traversals of type (i) in CSCD(s) are not required to traverse vertices which represent

displayable entities of size greater than or equal to n. This may be achieved simply by

disabling edges in CSCD(s) which leave a vertex representing a displayable entity of size

less than n and are incident on a vertex representing a displayable entity of size greater

than or equal to n. Traversals of type (ii), however, may be required to traverse vertices

representing displayable entities of size greater than or equal to n. This is achieved by

associating a bit for each edge which is set to 1 if it represents an edge from a vertex whose

displayable entity is of size less than n to a vertex whose displayable entity is of size greater

than or equal to n. Otherwise, it is set to 0. Type (i) traversals check the bit, while type

(ii) traversals ignore it.

Finally, all calls to LinearOccurrences are replaced by calls to CircOccurrences.










7 Other Queries


In this section, we list queries that a system for the visualization and analysis of circular

strings would support. [5] contains algorithms for these same queries for linear strings. In

the previous section, we showed how these algorithms could be modified to support these

queries.


Size Restricted Queries: Experimental data show that random strings contain a large

number of displayable entities whose lengths are small. In most applications, small dis-

playable entities are uninteresting. Hence, it is useful to list only those displayable entities

whose lengths are greater than some integer, k. Similarly, it is useful to report exactly those

conflicts in which the conflicting displayable entities have length greater than k. This gives

rise to the following problems:

(1) List all occurrences of displayable entities whose length is greater than k.

(2) Compute all prefix suffix conflicts involving displayable entities of length greater than

k.

(3) Compute all subword conflicts involving displayable entities of length greater than k.

An alternative formulation of the problem which also seeks to achieve the goal outlined

above is based on reporting only those conflicts whose size is greater than k. The size of a
conflict is defined below:

The overlap of a conflict is defined as the string common to the conflicting displayable

entities. The overlap of a subword conflict is the subword displayable entity. The overlap of

a prefix-suffix conflict is its intersection. The size of a conflict is the length of the overlap.

This formulation of the problem is particularly relevant when the conflicts are of more

interest than the displayable entities. It also ensures that all conflicting displayable entities

reported have size greater than k. We have the following problems:

(4) Obtain all prefix-suffix conflicts of size greater than some integer k.

(5) Obtain all subword conflicts of size greater than some integer k.










Pattern Restricted Queries: These queries are useful in applications where the fact
that two patterns have a conflict is more important than the number or location of the
conflicts. The following problems arise as a result:
(6) List all pairs of displayable entities which have subword conflicts.
(7) List all triplets of displayable entities (Di,D2,Dm) such that there is a prefix suffix
conflict between D1 and D2 with respect to D,.
(8) Same as 6, but size restricted as in 5.
(9) Same as 7, but size restricted as in 4.


Statistical Queries: These queries are useful when conclusions are to be drawn from
the data based on statistical facts.
(10) For each pair of displayable entities, D1 and D2, involved in a subword conflict (DI
is the subword of D2), obtain p(D1, D2) = (number of occurrences of D1 which occur as
subwords of D2)/(number of occurrences of D1).
(11) For each pair of displayable entities, D1 and D2, involved in a prefix-suffix conflict,
obtain q(D1, D2) = (number of occurrences of Di which have prefix-suffix conflicts with D2)

/(number of occurrences of Di).
If p(DI, D2) or q(DI, D2) is greater than a statistically determined threshold, then the fol-
lowing could be be said with some confidence: Presence of D1 implies presence of D2.





8 Applications


Circular strings may be used to represent circular genomes [1] such as G4 and 9X174. The
detection and analysis of patterns in genomes helps to provide insights into the evolution,
structure, and function of organisms. [1] analyzes G4 and 9X174 by linearizing and then
constructing their scdawg. Our work improves upon [1] by :

(i) analyzing circular strings without risking the "loss" of patterns.
(ii) extending the analysis and visualization techniques of [5] for linear strings to circular
strings.










Circular strings in the form of chain codes are also used to represent closed curves in

computer vision [11]. The objects of Figure 18(a) are represented in chain code as follows:
(1) Arbitrarily choose a pixel through which the curve passes. In the diagram, the starting

pixels for the chain code representation of objects 1 and 2 are marked by arrows.

(2) Traverse the curve in the clockwise direction. At each move from one pixel to the next,

the direction of the move is recorded according to the convention shown in Figure 18(b).

Objects 1 and 2 are represented by 1122102243244666666666 and 6666666611:.' .' 11:, .::-, ', ,'6

respectively. The alphabet is {0, 1, 2, 3, 4, 5, 6, 7} which is fixed and of constant size (8) and

therefore satisfies the condition of Section 2. We may now use the visualization techniques of

[5] to compare the two objects. For example, our methods would show that objects 1 and 2

share the segments S1 and S2 (Figure 18(c)) corresponding to 0224 and 244666666661122

respectively. Information on other common segments would also be available. The tech-
niques of this paper make it possible to detect all patterns irrespective of the starting pixels

chosen for the two objects.

Circular strings may also be used to represent polygons in computer graphics and com-
putational geometry [3]. Figure 19 shows a polygon which is represented by the following

alternating sequence of lines and angles: b/paaeaoeac/pcpeaaeaoeac/pcbacadaca, where a

denotes a 90 degree angle and 3, a 270 degree angle.

The techniques of this paper would point out all instances of self similarity in the polygon,

such as aaeaeac/c. Note, however, that for the methods to work efficiently, the number of

lines and angles that are used to represent the polygons must be small and fixed.



9 Conclusions


In this paper, we have defined the scdawg for circular strings and shown how it can be used

to solve problems in the visualization and analysis of patterns in circular strings. We expect
that it can also be used for other string matching applications involving circular strings.

An important feature of the scdawg for circular strings is that it is easy to implement and

use when corresponding techniques for scdawgs for linear strings are already available.














-- 1 1 1 1 -1 1 -1 1 1 1 1 1 -1 1 -1 1 1 1 1 r -1 1 -1

I I I I I I J I I I I I I I I I I I I I I -- I I I
,- J_ _L _J__ _J y_L J L J_ L - J _J __L _J __L _J __L _J__L _J __L _J _
r--

1 I I / I I I 1\ I | | | | | | | 1 1 1 1 | |
SI 1 r -1 1 T -1 - -T 1 1 1 1 -1 r- 1 II r 1 1 1


F _- L J -[ J L J _ _LJ__ .J LJ [y J __L3 L _J _L J L __L.J _


I / I I I I I I I I I I I I / I I I I I I I I I I

-_ __-~---_--L_ __--_-J_ l- _-- -----J--l -L-- L -J----L--? J --L ---
-i__ i i i- i---i i -i i -i i -i i -i i -i i i i i--
,
, 1 1 1 1 1 1 1 1


Starting position for object 1


7



6



5


Chain code representations
(b)


Starting position for object 2


of directions


= S1


= S2


(c)




Figure 18: Representing closed curves by circular strings











e e


CI IC
d

Figure 19: Representing polygons by circular strings



Acknowledgement

We are grateful to Professor Gerhard Ritter for pointing out the application of circular

strings to the representation of closed curves.


References

[1] B. Clift, D. Haussler, T.D. Schneider, and G.D. Stormo "Sequence Landscapes,"
Nucleic Acids Research, vol. 14, no. 1, pp. 141-158, 1',.

[2] G.M. Morris "The Matching of Protein Sequences using Color Intrasequence Homol-
ogy Displays," J. Mol. Graphics, vol. 6, pp. 135-142, 1'l'

[3] S.L. Tanimoto, "A method for detecting structure in polygons," Pattern R .. ...I '.,
vol. 13, no. 6, pp. .;:' ::;'i 1981.

[4] D. Mehta and S. Sahni, "String Visualization," In Preparation, 1991.

[5] D. Mehta and S. Sahni, "Computing Display Conflicts in String Visualization," Sub-
mitted for journal publication, 1991.

[6] A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht, "Complete In-
verted Files for Efficient Text Retrieval and Analysis," J. AC I, vol. 34, no. 3, pp. 578
595, 1I" .

[7] A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M.T. Chen, J. Seiferas, "The
Smallest Automaton Recognizing the Subwords of a Text," Theoretical Computer Sci-
ence, no. 40, pp. 31-55, '1,".

[8] M. E. Majster and A. Reiser, I.1 .i. i, on-line construction and correction of position
trees," SIAM Journal on Computing, vol. 9, pp. 7 ".-807, Nov. 1980.

36










[9] E. McCreight, "A space-economical suffix tree construction algorithm," Journal of the
AC I, vol. 23, pp. 262-272, Apr. 1976.

[10] M. T. Chen and Joel Seiferas, I.1l.. ii and elegant subword tree construction," in
Combinatorial Algorithms on Words (A. Apostolico and Z. Galil, eds.), NATO ASI
Series, Vol. F12, pp. 97-107, Berlin Heidelberg: Springer-Verlag, 1',".

[11] R. Gonzalez, P. Wintz, D;iital Iiimaq P.... --','.. .',,.1 Edition. Addison Wesley, 1'" ;.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs