
Full Citation 
Material Information 

Title: 
A Data structure for circular string analysis and cisualization 

Physical Description: 
Book 

Language: 
English 

Creator: 
Sahni, Sartaj Mehta, Dinesh P. 

Affiliation: 
University of Florida University of Minnesota 

Publisher: 
Department of Computer and Information Sciences, University of Florida 

Place of Publication: 
Gainesville, Fla. 

Copyright Date: 
1991 
Record Information 

Bibliographic ID: 
UF00095097 

Volume ID: 
VID00001 

Source Institution: 
University of Florida 

Holding Location: 
University of Florida 

Rights Management: 
All rights reserved by the source institution and holding location. 

Downloads 

Full Text 
A Data Structure for Circular String Analysis and
Visualization *
Dinesh P. Mehta tt
Sartaj Sahnit
Technical Report 25
Abstract
Circular strings are used to represent circular genomes in molecular biology, poly
gons in computer graphics and computational geometry, and closed curves in computer
vision. In this paper we extend techniques which have so far been successfully applied
to the analysis and visualization of linear strings to circular strings by defining a data
structure for circular strings. Efficient (often optimal) algorithms that support these
techniques are presented.
Keywords and Phrases:
Circular strings, visualization, analysis, directed acyclic word graphs.
*This research was supported in part by the National Science Foundation under grant MIP 8617374.
tDept. of Computer and Information Sciences, University of Florida, Gainesville, FL 32611
*Dept. of Computer Science, University of Minnesota, Minneapolis, MN 55455
1 Introduction
The circular string data type is used to represent a number of objects such as circular
genomes, polygons, and closed curves. Research in molecular biology involves the identi
fication of recurring patterns in data and hypothesizing about their causes and/or effects
[1, 2]. Research in pattern recognition and computer vision involves detecting similarities
within an object or between objects [3].
Detecting patterns visually is tedious and prone to error. In [4], a model was proposed
to alleviate this problem. The model consists of identifying all recurring patterns in a string
and highlighting identical patterns in the same color.
[4] also listed a number of queries that the model would support. In [5], efficient (mostly
optimal) algorithms were proposed for some of these queries for linear strings. These algo
rithms perform operations and traversals on the symmetric compact directed acyclic word
graph (scdawg) [6] of the linear string. The scdawg, which is used to represent a string or a
set of strings, evolved from other string data structures such as position trees, suffix trees,
directed acyclic word graphs, etc [7, 8, 9, 10].
One approach for extending these techniques to circular strings is to arbitrarily break
the circular string at some point so that it becomes a linear string. Techniques for linear
strings may then be applied to it. However, this has the disadvantage that some significant
patterns in the circular string may be lost because the patterns were broken when linearizing
the string. Indeed, this would defeat the purpose of representing objects by circular strings.
[3] defined a polygon structure graph, which is an extension of suffix trees to circular
strings. However, the suffix tree is not as powerful as the scdawg and cannot be used to
solve some of the problems that the scdawg can solve. In this paper, we define an scdawg
for circular strings. Algorithms in [5] and [6] which make use of the scdawg for linear strings
can then be extended to circular strings with minor modifications. The extended algorithms
continue to have the same efficient time and space complexities. Further, the extensions
take the form of postprocessing or preprocessing steps which are simple to add on to a
system built for linear strings, particularly in an object oriented language.
Figure 1: Circular string
Section 2 contains definitions. Section 3 describes the scdawg for linear strings while
Section 4 describes its extension to circular strings. Section 5 deals with the computation
of occurrences of displayable entities. Section 6 introduces the notion of conflicts and
Section 7 lists other queries that are to be implemented. Section 6 also explains how the
algorithms implementing queries for linear strings can be modified so that they work with
circular strings. Finally, Section 8 mentions some applications for the visualization and
analysis of circular strings.
2 Definitions
Let s denote a circular string of size n consisting of characters from a fixed alphabet, E,
of constant size. Figure 1 shows an example circular string of size 8. We shall represent a
circular string by a linear string enclosed in angle brackets "<>" (this distinguishes it from
a linear string) The linear string is obtained by traversing the circular string in clockwise
order and listing each element as it is traversed. The starting point of the traversal is chosen
arbitrarily. Consequently, there are up to n equivalent representations of s. In the example,
s could be represented as , , etc.
We characterize the relationship between circular strings and linear strings by defining
the functions, linearize and circularize, linearize maps circular strings to linear strings. It
is a onemany mapping as a circular string can, in general, be mapped to more than one
linear string. For example, linearize() = {abcd, bcda, cdab, dabc}. We will assume,
for the purpose of this paper, that linearize arbitrarily chooses one of the linear strings; for
convenience we assume that it chooses the representation obtained by removing the angle
brackets "<>". So, linearize() = abcd. circularize maps linear strings to circular
strings. It is a manyone function and represents the inverse of linearize.
We use lower case letters to represent circular strings and upper case letters to represent
linear strings. Further, if a lower case letter (say, s) is used to represent a particular circular
string, then the corresponding upper case letter (S) is assumed to be linearize(s). A single
character in s or S occurring in the ith position is denoted by s, or Si, respectively. A
substring of S is denoted by Sij where i < j. Sij = SiSi+1...Sj. A substring of s is denoted
by sj, where sij = Sj if i < j and SinSIj if i > j. For example, if s = < abcdabce >,
then S = abcdabce. S5 = s5 = a. S3,5 = s3,5 = cda. s7,2 = ceab. We use the symbol, 7, to
denote either a circular string or a linear string. In the example, 73,5 = S3,5, if 7 = S; 73,5
= 3,5, if7 = S.
The predecessor, pred(7,i,j) of a substring of 7 is defined as
7i1 if 1 < i < n
pred(,i,j)= o if i = 1 and 7 is linear
7y if i = 1 and 7 is circular
The successor, succ(7,i,j) of a substring of 7 is defined as
7j+1 if 1 < J < n
succ(7,i,j) = oc if j = n and 7 is linear
71 if j = n and 7 is circular
The immediate context, context(7,i,j) of a substring ; of 7 is the ordered pair
(pred(7, i,j), succ(7, i,j)).
The predecessor, pred(7, a), and successor, succ(7, a), sets of a pattern, a, in a string 7
are defined as below:
pred(Q, a) = {pred(Q, i,j)l j = a }. succ(, a) = {succ(Q, i,j)l j = a }.
The immediate context set, context(7,a) of a pattern, a, in 7 is the set
{context(, i,j)A = }.
In the example string of Figure 1, succ(s, abc) = succ(S, abc)= {d, e}. pred(s, abc) =
{d, e}; pred(S, abc) = {o, d}. context(s, abc) = {(e, d), (d, e)}. context(S, abc) = {(oo, d), (d, e)}.
A pattern occurring in 7 is said to be maximal iff its occurrences are not all preceded
by the same character nor all followed by the same character. So, a pattern a of length <
n in 7 is maximal iff \pred(7, a) > 2 and succ(7, a) > 2. This is not necessarily true for
patterns of length greater than or equal to n. For example, S is maximal in S (since it is
neither preceded nor followed by a character), but Ipred(S, S)I = Isucc(S, S)I = 1.
A pattern is said to be a displayable entity (or displayable) of 7 iff it is maximal and
occurs at least twice in 7. Note that if 7 represents a circular string, then a pattern can
be arbitrarily long. In the rest of our discussion, we will assume that displayable entities of
circular strings have length less than n.
3 Scdawgs For Linear Strings
An scdawg, SCD(S) = (V(S), R(S), L(S)) corresponding to a string S is a directed acyclic
graph defined by a set of vertices, V(S), a set, R(S), of labeled directed edges called right
extension (re) edges, and a set of labeled directed edges, L(S) called left extension (le)
edges. Each vertex of V(S) represents a substring of S. Specifically, V(S) consists of a
source (which represents the empty word, A), a sink (which represents S), and a vertex
corresponding to each displayable entity of S.
Let de(v) denote the string represented by vertex, v, v e V(S). Define the implication,
imp(S, a), of a string, a of S to be the smallest superword of a in {de(v)l v e V(S)}, if
such a superword exists. Otherwise, imp(S, a) does not exist. Re edges from vi (vi e V(S))
are obtained as follows: for each letter, x, in E, if imp(S, de(vi)x) exists and is equal to
de(v2) = Bde(vl)xz, then there is an re edge from vl to v2 with label x7. If P is the empty
string, then the edge is known as a prefix extension edge. Le edges from vl (vl e V(S)) are
obtained as follows: for each letter, x, in E, if imp(S, xde(vi)) exists and is equal to de(v2) =
Figure 2: SCDAWG for S = cdefabcgabcde, only re edges are shown
7xde(vl)3, then there is an le edge from vl to v2 with label 7x. If 3 is the empty string, then
the edge is known as a suffix extension edge. Figure 2 shows (V(S),R(S)) corresponding
to S = cdefabcgabcde. abc, cde, and c are the displayable entities of S. There are two re
edges from the vertex representing abc. These correspond to x = d and x = g. imp(S,abcd)
= imp(S,abcg) = S. Consequently, both edges are incident on the sink. There are no edges
corresponding to the other letters of the alphabet as imp(S,abcx) does not exist for x c
{a, b, c, e, f}.
Notice that the number of re edges from a vertex, v, equals succ(S, de(v)) {oc} and
the number ofle edges equals Ipred(S, de(v)) {oo}. In the example, succ(S,cde) = {oo,f}.
So, the number of right edges leaving the vertex corresponding to it is 1.
The space required for SCD(S) is O(n) and the time needed to construct it is O(n) [7, 6].
While we have defined the scdawg data structure for a single string, it can be extended to
represent a set of strings [6].
4 Extension to Circular Strings
In Section 4.1, we present a constructive definition of an scdawg for circular strings. Sec
tion 4.2 analyzes the complexity of the algorithm of Section 4.1 to construct the scdawg of
a circular string and Section 4.3 identifies and proves some properties of this scdawg.
4.1 SCDAWGs For Circular Strings
The notion of an scdawg may be extended to circular strings. The scdawg for circular
strings is defined constructively by the algorithm of Figure 3. The scdawg for the circular
string s is obtained by first constructing the scdawg for the linear string T = SS (recall
that S = linearize(s)). A bit is associated with each re edge in R(T) indicating whether it
is a prefix extension edge or not. Similarly, a bit is associated with each le edge in L(T)
to identify suffix extension edges. Two pointers, a suffix pointer and a prefix pointer are
associated with each vertex, v in V(T). The suffix (prefix) pointer points to a vertex, w,
in V(T) such that de(w) is the largest suffix (prefix) of de(v) represented by any vertex
in V(T). Suffix (prefix) pointers are the reverse of suffix (prefix) extension edges and are
derived from them. Figure 4 shows SCD(T) = SCD(SS) for S = cabcbab. The broken
edge from vertex c to vertex abc is a suffix extension edge, while the solid edge from vertex
ab to vertex abc is a prefix extension edge.
Next, in step 2, suffix and prefix redundant vertices of SCD(T) are identified. A suffix
(prefix) redundant vertex is a vertex v that satisfies the following properties:
(a) v has exactly one outgoing re (le) edge.
(b) Ide(v)l < n.
A vertex is said to be redundant if it is either prefix redundant or suffix redundant or both.
In Figure 4, vertex c is prefix redundant only, while vertex ab is suffix redundant only. No
other vertices in the figure are redundant (in particular, the vertex representing S is not
redundant even though it has one re and one le out edge as ISI = n). The fact that step 2
does, in fact, identify all redundant vertices is established later.
Vertices of SCD(T) are processed in reverse topological order in step 3 and redundant
Algorithm A
Stepl: Construct SCD(T) for T = SS.
Step2(a):
{Identify Suffix Redundant Vertices}
v:= sink;
while v f source do
begin
v:= v.suffix;
if v has exactly one outgoing re edge
then
if (Ide(v)l < n)
then mark v suffix redundant;
else
exit Step 2(a);
end;
Step2(b):
{Identify Prefix Redundant vertices}
{Similar to Step 2 (a)}
Step3:
v:= sink;
while (v <> source) do
begin
case v of
suffix redundant but not prefix redundant: P,... ,fT, I, h1, l1.l ,,1 (v);
prefix redundant but not suffix redundant: ProcessPrefixRedundant(v);
suffix redundant and prefix redundant : ProcessBothRedundant(v);
not redundant : {Do nothing };
endcase;
v:= NextVertexlnReverse T, 1',1..i '/ Order;
end;
Figure 3: Algorithm for constructing the scdawg for a circular string
 = Left Extension Edges
= Right Extension Edges
Figure 4: SCD(T) for T=cabcbabcabcbab
Procedure ProcessSuffixRedundant(v)
1. Eliminate all left extension edges leaving v (there are at least two of these).
2. There is exactly one right extension edge, e, leaving v. Let the vertex that it leads to
be w. Let the label on the right extension edge be x7y. Delete the edge.
3. All right edges incident on v are updated so that they point to w. Their labels are
modified so that they represent the concatenation of their original labels with xy.
4. All left edges incident on v are updated so that they point to w. Their labels are not
modified. However, if any of these were suffix extension edges, the bit which indicates
this should be reset as these edges are no longer suffix extension edges.
5. Delete v.
Figure 5: Algorithm for processing a vertex which is suffix redundant
vertices are eliminated. When a vertex is eliminated, the edges incident to/from it are
redirected and relabeled as described in Figures 5 to 10. The resulting graph is CSCD(s).
The set of vertices of CSCD(s) is denoted by CV(s). The set of right (left) edges of
CSCD(s) is denoted by CR(s) (CL(s)). Figure 11 shows CSCD(s) for s = < cabcbab >.
Notice that vertices c and ab have been eliminated and that the two incoming edges to c
and the three incoming edges to ab of Figure 4 now point to abc.
Us
V(T)  
/
\ Us
/(yx) V(T)
eC(y7i1'7) i V(T) 
SCL(72Z)
\U
N
S= Left Extension Edges
= Right Extension Edges
Figure 6: v is suffix redundant
Procedure ProcessPrefixRedundant(v)
1. Eliminate all right extension edges leaving v (there are at least two of these).
2. There is exactly one left extension edge, e, leaving v. Let the vertex that it leads to
be w. Let the label on the left extension edge be 7x. Delete the edge.
3. All left edges incident on v are updated so that they point to w. Their labels are
modified so that they represent the concatenation of 7x with their original labels.
4. All right edges incident on v are updated so that they point to w. Their labels are not
modified. However, if any of these were prefix extension edges, the bit which indicates
this should be reset as these edges are no longer prefix extension edges.
5. Delete v.
Figure 7: Algorithm for processing a vertex which is prefix redundant
Us
V(T) UT
Us
\ V(T) U
L CL(7X 2z)
\U
\ CL%2Z)
S= Left Extension Edges
= Right Extension Edges
Figure 8: v is prefix redundant
eR(?Yi)
Procedure ProcessBothRedundant(v)
1. There is exactly one right extension edge, ei, leaving v. Let the vertex that it leads
to be wl. Let the label on the edge be x7. Delete the edge.
2. There is exactly one left extension edge, e2, leaving v. Let the vertex that it leads to
be w2. Let the label on the edge be 7x. Delete the edge.
{We establish later that wl and w2 are, in fact, the same vertex.}
3. All right edges incident on v are updated so that they point to wl. Their labels are
modified so that they represent the concatenation with x7. If any of these edges were
prefix edges, the bit which indicates this should be reset.
4. Similarly, left edges incident on v are updated so that they point to w2. Their labels
are modified so that they represent the concatenation with 7x. If any of these edges
were suffix extension edges, the bit which indicates this should be reset.
5. Delete v.
Figure 9: Algorithm for processing a vertex which is prefix and suffix redundant
el(x7)
/ Us
_I \ __ _
CR(IRx3)
V(T) UT
\ eCL(1LPy)
\
eR(l,
SL(IL)
 = Left Extension Edges
= Right Extension Edges
Figure 10: v is suffix and prefix redundant
Us
V(T) UT
Lemma 1 For every substring sij of 1 ,,,ill < n of s, there exists a substring, Ti,m (= Sij),
of T such that context(T, 1, m) = context(s, i,j).
Proof
Case (i): i > j. Clearly, i 5 1, j 5 n. By construction, Tij+, = sij and
= (Ti_1, T,+j+l) = (Si_1, Sj+i) = context(s, i,j).
Case (ii): i < j. Now, Tij = Ti+n,j+ = sij.
Subcase (a): i = 1, j n. context(T, n + 1, n + j) = (T,, Tn+j+
context(s, i,j)
Subcase (b): i 1, j = n. context(T, i, n) = (Ti_1, T,+) = (S_1, S1) =
Subcase (c): i 5 1, j 5 n. context(T, i, j) = (T_1, T+) = (Si_1, Sj+i)
Subcase(d): i = 1, j = n. Not possible since the length of si,j < n. O
context(T, i, j+n)
1)
= (S, S+l)
context(s, i, n).
context(s, i,j).
Corollary 1 For every pattern, a, of ,,il1l < n in s, context(s, a) C context(T, a).
Corollary 2 For every pattern, a, of' ,i1 i < n in s, pred(s, a) C pred(T, a) and succ(s, a)
C succ(T, a).
Lemma 2 Let Tj be a substring of T. If i 5 1, then there is a substring
such that pred(s, 1, m) = pred(T, i,j). If i = 1, pred(T, i,j) = oo.
Proof If i = 1, the result follows from the definition of pred(T, i,j). If i
so that I = iif i n; m = jifj
the length of Tj is greater than n, sl,m is assumed to wrap around once).
= SI_1 = Ti_1 = pred(T,i,j). D
sl,m (=T,j) of s
f 1, choose sl,m
n if j > n; (if
So, pred(s, 1, m)
Corollary 3 For every pattern, a, of I ,,'il1, < n in T, pred(T, a) {oo} C pred(s, a).
Theorem 1 For every pattern a of I. ,ill less than n, pred(s,a) = pred(T,a) {oo} and
succ(s, a) = succ(T, a) {oo}.
Proof From Corollary 2 we have pred(s, a) C pred(T, a). So, pred(s, a) {00} C
pred(T, a) {oo} and hence pred(s, a) C pred(T, a) {oo} (since pred(s, a) does not contain
cabc
 = Right Extension Edges
P > = Left Extension Edges
Figure 11: Scdawg for < cabcbab >
oc). From Corollary 3 we have pred(s, a) D pred(T, a) {oo}. So, pred(s, a) = pred(T, a)
 {o}. The proof that succ(s, a) = succ(T, a) {o} is similar. O
Theorem 2 A vertex, v with Ide(v)l < n in V(T) is non redundant iff de(v) is a displayable
entity of s.
Proof Suppose a is a displayable entity ofs. Then, we have Ipred(s, a)l > 2 and Isucc(s, a)l
> 2. From Theorem 1 we have Ipred(T, a) {oo}l > 2 and Isucc(T, a) {oo} > 2. So, a
is a displayable entity in T and the corresponding vertex in V(T) has at least two le and
two re edges leaving it. Hence, v is not redundant.
Next, suppose there is a non redundant vertex, v, in SCD(T) with Ide(v)l < n. Let a
de(v). Since v is not redundant, Ipred(T, a)l {oc}l > 2 and succ(T, a) {oc}l > 2. From
Theorem 1 we have Ipred(s, a)l > 2 and Isucc(s, a)l > 2. So, a is a displayable entity of s.
Corollary 4 A redundant vertex in V(T) is not a displayable entity of s.
Lemma 3 (a) A vertex, v, in V(T) will have exactly one re (le) out edge only if de(v) is
a suffix (prefix) of T.
(b) If a vertex, v, such that de(v) (Ide(v)l < n) is a suffix (prefix) ofT has more than one
re (le) out edge, then no vertex, w, such that de(w) is a suffix (prefix) of de(v) can be suffix
(prefix) redundant.
Proof (a) Suppose de(v) is not a suffix of T. Then o is not an element of succ(T, de(v)).
So, Isucc(T, de(v)) {}oo = succ(T, de(v))l > 2. So, v has at least two re out edges, which
is a contradiction. Hence, de(v) must be a suffix of T.
(b) Since de(w) is a suffix of de(v), a successor of de(v) must also be a successor of de(w). So,
succ(T, de(w)) {f} D succ(T, de(v)) {oo} or Isucc(T, de(w)) {}oo > Isucc(T, de(v))
{oo } > 2 (de(v) has at least two re out edges). So, w must have at least two re out edges
and cannot be suffix redundant. O
We can now show that step 2(a) of Algorithm A identifies all suffix redundant vertices in
V(T). Since it is sufficient to examine vertices corresponding to suffixes of T (Lemma 3(a)),
step 2(a) follows the chain of suffix pointers starting from the sink. If a vertex on this chain
representing a displayable entity of length < n has one re out edge, then it is marked suffix
redundant. The traversal of the chain terminates either when the source is reached or a
vertex with more than one re out edge is encountered (Lemma 3(b)). Similarly, step 2(b)
identifies all prefix redundant vertices in V(T).
4.2 Complexity Analysis
Step 1 takes O(n) time [6]. Step 2 will in the worst case traverse all the vertices in SCD(T)
spending 0(1) time at each. The number of vertices is bounded by O(n) [6]. So, step 2
takes O(n) time. Step 3 traverses SCD(T). Each vertex is processed once; each edge is
processed at most twice (once when it is an incoming edge to the vertex being currently
processed, and once when it is the out edge from the vertex currently being processed. So,
Step 3 takes 0(n) time (note that SCD(T) has 0(n) edges).
4.3 Properties of CSCD(s)
Define the implication, imp(s, a), of a string, a, with respect to CSCD(s) to be the smallest
superword, pay, of a represented by a vertex in CV(s), such that there does not exist a
substring Play7 of T where the length of the least common suffix, Ics(P, Pi), of P and Pi
is less than min(lI, I p) or the length of the least common prefix, Icp(7, 7i), of 7 and i7
is less than min(17, 17i), if such a superword exists. Otherwise, imp(s, a) does not exist.
The additional condition (which is referred to as the uniqueness condition) that is imposed
on imp(s, a) is guaranteed for imp(T, a) by the definition of SCD(T).
Let R = { abcaaaa, babcaa, cabcaa } be the smallest set of superword displayable entities
of abc in s such that any superword displayable entity of abc in s is a superword of an
element of R. Then, de(s, abc) must be one of the elements of R. We have Ilcs(b, c)l =
0 < min(lbl, Icl). So, de(s, abc) is neither babcaa nor cabcaa. Further, since Ilcs(aaaa, aa)l
= min(laaal, laal), Ilcp(b, A) = min(lbl, A), and Ilcp(c, A) = min(Icl, A), de(s, abc)=
abcaaaa.
Lemma 4 Let v be a suffix and prefix redundant vertex in SCDINT(T), where SCDINT(T)
represents an intermediate '. ,'Fii,,,i';.,/,, between SCD(T) and CSCD(s) just after the while
statement in Step 3 of Algorithm A. Let the le and re out edges be incident on wl and w2
respectively, where de(wi) = imp(s, de(v)x) = Pide(v)xz7 and de(w2) = imp(s, yde(v) =
P .,i,'' (v)72. If w and w2 are not redundant, then wl = w2.
Proof Case 1. Ide(wi)l < n, Ide(w2)l < n.
/3 cannot be nil (if it is, then wl is prefix redundant since Ide(wl)l < n and all occurrences
of de(v) except the prefix of S are preceded by y). Similarly, 72 5 nil. So, de(wi) must be
of the form p .,.. (v)zx7, since y is the only letter that precedes de(v). Similarly, de(w2)
must be of the form /_ .,l.' (v)xZ3. We now show that 33 = 32 and 71 = 73. Assume that
this is not the case. Since Ide(wl)l < n, Ide(w2) < n and wl and w2 are not redundant,
Ipred(s, de(wi))l, Isucc(s, de(wi))l, Ipred(s, de(w2))l, and Isucc(s, de(w2))l are all at least 2.
So, there must exist a displayable entity, /3 .,.l (v) 7m, of s where 3m is the largest common
suffix of /3 and 32 and 7m is the largest common prefix of 71 and 73. Further,/3 .,./. (V)X7m
S S
1 2 3 4 5 6 7
2 3 4 5 6 7
de(v) de(v) de(v) de(v)
de(wi)
de(wi)
Figure 12: Illustration of proof of prefix/suffix redundancy invariant
= imp(s, de(v)x) = imp(s, yde(v)), which contradicts statements made above.
Case 2. de(wi)l > n, Ide(w2) < n.
72 cannot be nil, otherwise w2 is suffix redundant. So, de(w2) = ,I. (v)x73 as x is the
only letter that follows de(v). Arguments similar to those in Case 1 show that since Ide(w2)l
< n and w2 is not redundant, 71 must be a prefix of 73 and P/ a suffix of 32y. But, then
Ide(w2) > de(wi)l > n, which is a contradiction. Hence, Case 2 cannot exist.
Case 3. Ide(w2)l > n, de(wi)l < n.
Similar to Case 2.
Case 4. Ide(w2)l > n, de(wi)l > n.
Figure 12 shows that for this case to occur, S = am, for some a, where lal < de(v)l. Call
this the prefix/suffix redundancy invariant. The figure assumes that de(wi)l = n, that
de(v) is a prefix of de(wi), and that Ide(v)l < n/2 and divides n. However, the prefix/suffix
redundancy invariant can be shown to be true in all other cases. Two copies of T are shown
in the figure. The first copy shades the occurrence (n Ide(v)l + 1, n) of de(v) and its
o a a a
S Other 2 C2
Vertices 0 o2 2m 2
 = Left Extension Edge
> = Right Extension Edge
Figure 13: SCD(a2m)
extension to de(wi). The second shades the occurrence (n+ 1, n+ Ide(v)l) and its extension
to de(wi). Since the shaded regions in both strings represent de(wi), we have: box 1 = box
2; box 2 = box 3;...;box 6 = box 7. Or, box 1 = box 2 = ... = box 7, and S = (de(v))6
Next, we assume without loss of generality that there is no P such that S = 3k, k > m.
Call this the smallest repetition assumption.
The only occurrences of a in T are at ((1, lal), (lal +1, 2al),..., ((m1)a +1, 2n)) (if not,
an argument similar to the one of Figure 12 contradicts the smallest repetition assumption).
So, SCD(T) takes the form of Figure 13. Each vertex representing a', 1 < i < 2m 1, has
exactly one le and one re out edge as shown.
All remaining displayable entities of T are subwords of a2 and are of size less than la
(if not, an argument identical to the one in Figure 12 contradicts the smallest repetition
assumption). The vertices representing these displayable entities are represented by the box
in Figure 13.
None of the vertices in the box has out edges incident on vertices representing the dis
playable entities {a3, a4, ..., 2m}. In particular, no out edges from the vertices in the box
are incident on vertices representing displayable entities of length greater than n. After
SCD(a2") has been processed by Algorithm A, all incoming edges to vertices correspond
ing to a and a2 in SCD(a2m) are incident on the vertex corresponding to S = am in
CSCD(a2m). It follows that any prefix and suffix redundant vertex in SCD(a2m), when
processed by Step 3 of Algorithm A can have both edges incident on wl and w2 such that
Ide(wl)l and Ide(w2) are at least n only if de(wi) = de(w2) = n.
CSCD(s) satisfies properties P1, P2, and P3 stated below (Theorem 3). These properties
ensure that the algorithms of [5] can be extended to circular strings.
PI: CV(s) consists of a source and a sink. For each v of CV(s) that is not the source or
sink, the following are true:
(a) Ide(v)l < n iff de(v) is a displayable entity of s.
(b) if Ide(v)l > n, then de(v) is a displayable entity of T.
P2 : There exists an re out edge corresponding to letter x in E from vertex vi in CV(s) to
vertex v2 in CV(s) iff imp(s, de(vl)x) exists and is equal to de(v2). If de(v2) = lde(vl)xz,
then the label on the re edge is x7. If P = nil, then the edge is a prefix extension edge.
P3: Similar to P2 but for le edges.
Theorem 3 CSCD(s) satisfies P1, P2, and P3.
Proof Property P1 is established by the knowledge that SCD(T) contains all displayable
entities of T and that Algorithm A only eliminates those displayable entities of T of length
less than n, which are not displayable entities of s (Corollary 4).
P2 and P3 are proved by induction. The induction hypothesis is:
Let Us be the subset of UT that remains after the vertex set UT C V(T) has been processed
by step 3 of Algorithm A.
(I) Let Ru, be the set of re edges which are incident on vertices in Us. For any re edge r e
Ru, from vertex u to w with label xz, imp(s, de(u)x) = de(w) = 3de(u)xz. An analogous
condition holds for le edges.
(II) For each vertex u in Us U (V(T) Ur), there is an re out edge corresponding to each
letter x in succ(T, de(u)) {oo} incident on a vertex in Us U (V(T) UT). An analogous
condition holds for le edges.
When UT = V(T), we have Us = CV(s), by definition. So, Rcv(s) = CR(s). (I)
establishes that these edges are incident on the correct vertices and that their labels are
correct. (II) establishes that CR(s) is complete. So P2 holds. Similarly, P3 holds.
Induction Base: UT = Us = {}. RU, and Lu, are empty so (I) does not apply. (II) is
established from the definition of SCD(T).
Induction Step: Consider vertex, v (v c V(T)), which is about to be processed by step 3
of algorithm A. Let UTr and U' denote UT and Us respectively after v has been processed.
We must show that (I) and (II) hold for U,' and Ur. Since the vertices are processed in
reverse topological order, all out edges from v are incident on vertices in Us and are therefore
elements of Ru, or Lus. So, they must satisfy (I).
Case 1: v is not redundant. U U = UT {v}; U' = U U {v} since v is not eliminated.
We must show that (I) is true for incoming edges to v as these are the only additions to
Ru, and LU,. I.e., Ru' = RU, + {incoming right edges to v}, Lu' = LU, + {incoming left
edges to v}.
Let e be an re edge with label xz from u to v. From the definition of SCD(T), we have
de(v) = imp(T, de(u)z) = lde(u)zx for some ). imp(T, de(u)z) is the smallest superword
of de(u)z in {de(w)lw c V(T)}. Since CV(s) C V(T), {de(w)lw c CV(s)} C {de(w)lw e
V(T)} and imp(s, de(u)z) = imp(T, de(u)z) iff imp(T, de(u)z) c {de(w)w C CV(s)}. But,
this is true since v c CV(s). So, de(v) = imp(s, de(u)z) = 3de(u)zx and (I) is satisfied. A
symmetric argument can be made for incoming le edges to v.
The letter of the alphabet to which an re (le) out edge corresponds is the first (last)
character in its label. Since no out edges are added, deleted, or redirected and the labels
of all out edges are unchanged, each vertex has an re/le out edge corresponding to the
same letter of the alphabet as it had prior to processing vertex v. So, (II) holds (induction
hypothesis).
Case 2: v is redundant. Ur = UT U {v}. U' = Us, since v is eliminated.
Subcase (a): v is suffix redundant only. By definition, v consists of a single re
out edge, e, to a vertex w in Us. Let label(e) = x7. From the induction hypothesis,
imp(s, de(v)x) = de(w) = 3de(v)xz. We first establish that (i) de(w) = imp(s, de(v)) and
(ii) de(v) is a prefix of de(w).
imp(s, de(v)) 5 de(v) as v is redundant. So, imp(s, de(v)) must correspond to a vertex
on which one of the out edges from v is incident, since there is an out edge corresponding
to each element in pred(s, de(v)) U succ(s, de(v)) (from (II)). The single re edge is incident
on w, which represents imp(s, de(v)x). The left out edges from v are incident on vertices
which represent imp(s, ide(v)) for 1 < i < Ipred(s,de(v)) > 2. From the definition of
imp(s, de(v)), none of these vertices can possibly represent imp(s, de(v)). For instance, if
imp(s, xide(v)) is imp(s, de(v)), then the string, imp(s, xjde(v)), i 5 j, would invalidate
the definition.
So, imp(s, de(v)) must be de(w). However, for this to be true, we must show that p =
nil and therefore that de(v) is a prefix of de(w). All occurrences of de(v) in s are followed
by x. So, Ipred(T, de(v)x) {oo} = Ipred(s, de(v)x)l = Ipred(s, de(v)) > 2. An argument
similar to the one in the previous paragraph shows that for imp(s, de(v)x) to exist, ) = nil.
We have Ru' = Ru, {single re out edge from v} + {incoming re edges to v} and Lu'
= Lu, {le out edges from v} + {incoming le edges to v}. (I) and (II) do not apply to the
edges deleted from Ru, and Lus. So, we only need to prove (I) and (II) for incoming edges
to v.
Let eR be an re edge incident on v from vertex uR with label y7i so that de(v) =
imp(T, de(uR)y) = 3lde(uR)y7i. eR must be redirected to imp(s, de(uR)y) for (I) to
hold. imp(T, de(uR)y) = de(v) is the smallest superword of de(uR)y in {de(a) a e V(T)}.
imp(s, de(uR)y) is the smallest superword of de(uR)y that satisfies the uniqueness condi
tion in {de(a)l a e CV(s)} C {de(a)l a e V(T)}. Since v / CV(s), imp(s, de(UR)y) is the
smallest superword of de(v) that satisfies the uniqueness condition in {de(a)l a e CV(s)}.
So imp(s, de(UR)y) = imp(s, de(v)) = de(w) = Plde(uR)yJlz The updated re edge, eR,
is incident on w and has label yjx7^ which was obtained in step 3 of Algorithm A by con
catenating label(eR) with label(e). If Pi = nil, then eC continues to be a prefix extension
edge. eR satisfies (I).
Let eL be an le edge incident on v from UL so that de(v) = imp(T, zde(ul)) = 72zde(UL))2
(label(eL) = 72z). Using the same argument that was used for eR, we have imp(s, zde(uL))
= de(w) = 72zde(uL)32x7. CL is redirected to w and its label remains unchanged. Clearly,
eL is no longer a suffix edge even if )2 = nil, because x7 f nil. So, eL satisfies (I).
Notice that (II) continues to be satisfied as each out edge corresponding to any vertex
in U' U (V(T) UT) continues to be associated with the same character (in particular,
label(eR) continues to begin with y and label(eL) continues to end with z); and each out
edge continues to leave the same vertex (in particular, eC continues to leave UR, eL continues
to leave UL).
Subcase (b): v is prefix redundant only. Symmetric to subcase (a).
Subcase (c): v is prefix and suffix redundant. So, v has one re out edge, el, to
vertex wl in CV(s). Let label(el) = x7y. Also, v has one le out edge, e2, to vertex w2 in
CV(s). Let label(e2) = )y.
From the induction hypothesis, de(wi) = imp(s, de(v)x) = l3de(v)x7z and de(w2)
imp(s, yde(v) = _ '.I.. (v)72.
The conditions for Lemma 4 are satisfied since wl and w2 are not redundant (otherwise
they would have been eliminated). Thus, de(wi) = de(w2) = de(w) (say). imp(s, de(v)) can
either be imp(s, de(v)x) or imp(s, yde(v)). But, both these expressions are equal to de(w).
So, imp(s, de(v)) = de(w).
The proof that (I) and (II) are satisfied is similar to that for subcase (a). Note, however,
that any incoming prefix/suffix extension edges to v will no longer remain prefix/suffix
extension edges as x7 and Py are not nil. D
5 Computing Occurrences of Displayable Entities
Procedure LinearOccurrences(S, v) of Figure 14, which is based on the outline in [6], reports
the end position of each occurrence of de(v), v e V(S), in the linear string S However,
invoking LinearOccurrences(T,v), v e CV(s), does not immediately yield all occurrences
of de(v) in T. In Section 5.1 we present a modification which obtains all occurrences of
displayable entities of s. In Section 5.2 we show that this modification is correct and that
its time complexity is optimal.
5.1 Algorithm
An auxiliary boolean array reported[l..n], is used in conjunction with CSCD(s). Initially,
all elements of this array are set to false. Procedure CircOccurrences(s, v) of Figure 15
computes the end positions of each de(v) (v e CV(s)) in s. LinearOccurrences(T, v) of
line 1 will not necessarily compute all occurrences of de(v) in T, since it is being executed
on CSCD(s) and not on SCD(T). Note, also, that an occurrence of de(v) ending at
position i (i < n) in T has an identical occurrence ending at position n + i in T (since
T = S.S). Both these occurrences correspond to the same occurrence of de(v) in s. So,
if LinearOccurrences(T, v) reports both occurrences, then only the single corresponding
occurrence of de(v) in s must eventually be reported.
Lines 47 transform the occurrence 1, if necessary, so that it represents a value between
1 and n. If this occurrence has not already been listed, then it is added to the list of
occurrences and the corresponding element of reported is set to true. If the occurrence has
been listed then it is a duplicate (lines 812). After all occurrences have been computed,
all elements of reported are reset to false (lines 14,15) so that reported can subsequently be
reused to compute the occurrences of some other displayable entity in s.
In the example of Figure 11, LinearOccurrences(T, v), where v represents abc, does report
the end positions of all occurrences of abc in T (i.e., 4, 8, and 11). Lines 2 to 12 transform
this into the list of end positions of abc in s (i.e., 1 and 4) corresponding to S6,1 and s2,4
respectively.
Figure 16 shows the de(v)'s, de(w)'s, and de(x)'s for a hypothetical string T = SS.
Figure 17 shows some fragments of its scdawg. v is suffix redundant in SCD(T) and its
single re out edge is incident on w. There is an re edge from x to v and x is not redundant.
By construction, the re edge from x to v in SCD(T) becomes an re edge from x to w
in CSCD(s). Procedure LinearOccurrences(T,x), x in CSCD(s) will fail to yield the
rightmost occurrence of de(x) in T, since that occurrence is neither a subword of de(w)
Procedure LinearOccurrences(S:string, v:vertex)
{Obtain all occurrences of de(v), v e V(S), in S}
Occurrences(S,v,0);
Procedure Occurrences(S:linear string, v:vertex, i:integer)
begin
if de(v) is a suffix of S
then output(ISI i);
for each re out edge, e, from v in SCD(S) do
begin
let w be the vertex on which e is incident;
Occurrences(S,w,label(e)l + i);
end;
end;
Figure 14: Obtaining all occurrences of a displayable entity in a linear string
Procedure CircOccurrences(s:circular string, v:vertex)
{v is a vertex in CSCD(s)}
1 LinearOccurrences(linearize(s).linearize(s), v);
2 for each reported occurrence 1 of de(v) do
3 begin
4 if (1 > II)
5 k:= 1 Is
6 else
7 k := 1;
8 if not reported[k] then
9 begin
10 add k to final list of occurrences
11 reported[k] := rue
12 end
13 end
14 for each occurrence, 1, of de(v) in s do
15 reportedly]:= false;
Figure 15: Obtaining all occurrences of a displayable entity in a circular string
de(v)  
de(x)  ",  ' 
de(w)
S S
T
Figure 16: Example string
nor a suffix of T. In the next section, we show that CircOccurrences(s, x) computes all
occurrences of de(x) in s in spite of the fact that LinearOccurrences(T, v) does not compute
all occurrences of de(x) in T.
5.2 Proof of Correctness
Let Tj = de(v), v e V(T), be a substring
(i.e., j 1 2n). Let y = imp(T,de(v)Tjf+)
, .I1/1 extension IRE(SCD(T), Tij) of Tj in
displayable entity, y.
of T. Assume that Ti, is not a suffix of T
= )de(v)Tj+iy. Then, define the immediate
SCD(T) to be the occurrence T II i+ l+l of
Let Tij = de(v), v c CV(s), be a substring of T. Assume that Tij is not a suffix of
T (i.e., j 5 2n). Let y = imp(s, de(v)Tj+i) = 3de(v)Tj+ly. Then, define the immediate
, .//,1 extension IRE(CSCD(s), Tj) of Tij in CSCD(s) to be the occurrence TII i+ +
of displayable entity, y.
So, if in Figure 16, de(v) = yde(x)a, and de(w) = de(v)3, then IRE(SCD(T ), T2n_de()al+1,2nl)
= T2nde(v)l+1,2n which is the occurrence of de(v) corresponding to the suffix of T. However,
IRE(CSCD(s),T2nIde()a+1,2naJ) = T2nde(v)l+1 +1 Iwhich does not represent a valid
substring of T.
  = Left Extension Edge
< z x j i a
SB' = Left Extension Edge
= Right Extension Edge
Figure 17: Fragments of scdawgs corresponding to Figure 16
Let DAWG represent either SCD(T) or CSCD(s). Then IREk(DAWG, T,j) denotes
IRE(DAWG, IREk1(DAWG, Ti,)) if k > 1, and Tij if k = 0.
An occurrence Tj = de(v), v c V(T) is said to be Right Retrievable (RR) in SCD(T)
iff one of the following is true:
(i)j = 2n.
(ii) j 5 2n and IRE(SCD(T), T4j) is RR in SCD(T).
Similarly, an occurrence Tj = de(v), v c CV(s) is said to be Right Retrievable (RR) in
CSCD(s) iff one of the following is true:
(i)j = 2n.
(ii) j 5 2n and IRE(CSCD(s), T,j) is RR in CSCD(s).
IRE(CSCD(s),Ti,j) is defined for any occurrence, Tij = de(v), v c CV(s), where j
5 2n. So, Tij is not RR in CSCD(s) only if (i) IRE(CSCD(s),T,j) does not represent
a substring of T or (ii)IRE(CSCD(s),Tj) is a valid substring of T, but is not RR in
CSCD(s).
In the example of Figure 16, T2Ide(x)al+1,2n al is RR in SCD(T), but not RR in
CSCD(s).
Notice that (ip,jp) = IRE(s, (i,j)) is not a substring of T iff ip < 1 or jp > 2n.
Lemma 5 Fork > 1, if IREk1(CSCD(s), T4j) and IREk1(CSCD(s), T,+,j+n) repre
sent substrings of T and if (ip, jp) = IREk(CSCD(s), Tj) and (iq, j) = IREk(CSCD(s), Ti+,j+),
then ip + n = iq and jp + n = jq.
Proof Assume that there exists a pair of substrings Ti,,j, and T ,i, of T, such that i2 =
ii + n and j2 = jl + n and that ji < n (i.e.,we are assuming that their IRE's are defined).
By symmetry, both occurrences represent the same displayable entity (say, de(v)). Further,
Tj+1 = Th,+1 (also by symmetry). Clearly, imp(s, de(v).Tj,+l) = imp(s,de(v).Tj,+l). If
(i3, 3) = IRE(CSCD(s),Ti,,j) and (i4,j4) = IRE(CSCD(s), Ti,,), then from the defi
nition of IRE, we have i4 = i3 + n and j4 = j3 + n. Applying this argument repeatedly
proves the lemma D
Lemma 6 The RR occurrences of de(v), v in V(T) (CV(s)) in SCD(T) (CSCD(s)) are
exactly those occurrences of de(v) which are obtained by LinearOccurrences(T,v).
Proof Follows from the definition of RR occurrences. D
Corollary 5 All occurrences of a pattern de(v) (v e V(T)) in T are obtained by LinearOccurrences(T, v).
Lemma 7 All occurrences of de(v) in T, v e CV(s), where de(v) > n are obtained by
LinearOccurrences (T,v).
Proof This follows from Corollary 5 and the construction of CSCD(s) in which no right
out edges from vertices representing displayable entities of size > n were modified. D
Lemma 8 All occurrences, Tij, of de(v), where Ide(v)l < n, v e CV(S) with i < n, j > n
are RR in CSCD(s).
Proof Assume that the lemma is false and that there exists an occurrence, Tj, of de(v)
with i < n, j > n which is not RR in CSCD(s).
Clearly, j f 2n, otherwise T;j would be RR in CSCD(s). Let last denote the smallest
value of k for which IREk(CSCD(s), T;j) is not a substring of T. Such a last > 1 must
exist since Tij is not RR. Let (i1ast,jiast) denote IRElast(CSCD(s), T,j) Let z be the
vertex in CV(s) to which Ti astJa, corresponds.
Case 1. ilst < 1
Clearly, n < jiast < 2n. Consider the string TI, tJ, in T. Its length is greater than n. If there
were two occurrences of this string in T, then it would be a displayable entity of length > n
(because (i) Tij1t does not have a predecessor and (ii) de(z) is maximal and its occurrences
are not all followed by the same letter). A vertex corresponding to this displayable entity
would not have been eliminated by Algorithm A since its length would be > n and TI,jat
would be RR in CSCD(s) (Lemma 7). So, there must exist only one occurrence of the
string represented by Tijas. But, this string is a proper suffix of de(z) which means that
one of its occurrences is preceded by a character. So, there are two occurrences of this
string. This leads to a contradiction.
Case 2. jast > 2n
The proof is similar to the one for Case 1. O
Lemma 9 At least one of the two occurrences, Tij and T+,,j+,, of de(v), Ide(v)l < n, v
e CV(s), with i,j < n is RR in CSCD(s).
Proof Assume that the lemma is false. Let last be the smallest value of k for which either
IREk(CSCD(s), Ti,j) or IREk(CSCD(s), T+,,j+,) is not a substring of T. Let (ip,jp) =
IRElast(CSCD(s), Ti,) and (iq,j,) = IRElast(CSCD(s), Ti+,j+,).
Case 1. IRElast(CSCD(s), Tj) is not a substring of T; IRE'lst(CSCD(s), Ti+,j+,) is a
substring of T.
I.e, ip < 1 and jq < 2n. So, j, < n and iq < n (from Lemma 5). (iq,jq) is RR in CSCD(s),
since (iq,jq) satisfies the conditions of Lemma 8.
Case 2. IRElast(CSCD(s), Tij) is a substring of T; IRElast(CSCD(s), Ti+,j+,) is not a
substring of T.
Symmetric to Case 1.
Case 3. IRElast(CSCD(s), Tj) is not a substring of T; IRElast(CSCD(s), T,+,j+,) is
not a substring of T.
I.e., ip < 1 and j, > 2n. So, jp > n and iq < n (Lemma 5). This is shown to cause a
contradiction by an argument similar to the one in Lemma 8. D
Theorem 4 Procedure CircOccurrences(s,v) correctly obtains all occurrences of de(v) in s
Proof Lemma 6 shows that LinearOccurrences(T, v) computes all RR occurrences of de(v)
in CSCD(s). Lemmas 8 and 9 show that each occurrence of de(v) in s has at least one
corresponding occurrence in T, which is RR in CSCD(s). CircOccurrences computes these
occurrences in T and transforms them so that they represent occurrences in s, removing
duplicates if any. So, the output is a list of all occurrences of de(v) in s. D
Theorem 5 Procedure CircOccurrences is optimal.
Proof Procedure CircOccurrences(s, v) takes O(occ(T, v)l) time, where occ(T, v)l is the
number of occurrences of de(v) in T. Each for loop takes O(occ(T, v)) time.
But, occ(T, v) < 2occ(s, v), where occ(s, v) is the number of occurrences of de(v) in
s. So, the complexity is O(occ(s, v)). occ(s, v) is the size of the output, so the algorithm
is optimal. D
6 Computing Conflicts Efficiently
[4] defines the concept of conflicts and explains its importance in the analysis and visual
ization of strings. Formally,
(i) A subword conflict between two displayable entities, D1 and D2, in S exists iff D1 is a
substring of D2.
(ii) A prefixsuffix conflict between two displayable entities, D1 and D2, in S exists iff there
exist substrings, Sp, Sm, SS in S such that SpSSm occurs in S and SpS, = D1 and SmSs
= D2. The string, S, is known as the intersection of the conflict; the conflict is said to
occur between D1 and D2 with respect to S,.
[4] also identified a number of problems relating to the computation of conflicts in a linear
string, while [5] presented efficient algorithms for most of these problems (some of which
are listed in the next section). These algorithms typically involve sophisticated traversals
or operations on the scdawg for linear strings. Our extension of scdawgs to circular strings
makes it possible to use the same algorithms to solve the corresponding problems for circular
strings with some minor modifications which are outlined below.
There are conceptually two kinds of traversals that the algorithms of [5] perform on an
scdawg corresponding to a linear string:
(i) Traversal of displayable entities of the string. In these traversals, a vertex is traversed
specifically because it represents a displayable entity of the string.
(ii) Incidental traversals. In these traversals, a vertex is not traversed because it is a
displayable entity, but because it performs some other function. For example, this includes
vertices traversed by LinearOccurrences(T, v).
Traversals of type (i) in CSCD(s) are not required to traverse vertices which represent
displayable entities of size greater than or equal to n. This may be achieved simply by
disabling edges in CSCD(s) which leave a vertex representing a displayable entity of size
less than n and are incident on a vertex representing a displayable entity of size greater
than or equal to n. Traversals of type (ii), however, may be required to traverse vertices
representing displayable entities of size greater than or equal to n. This is achieved by
associating a bit for each edge which is set to 1 if it represents an edge from a vertex whose
displayable entity is of size less than n to a vertex whose displayable entity is of size greater
than or equal to n. Otherwise, it is set to 0. Type (i) traversals check the bit, while type
(ii) traversals ignore it.
Finally, all calls to LinearOccurrences are replaced by calls to CircOccurrences.
7 Other Queries
In this section, we list queries that a system for the visualization and analysis of circular
strings would support. [5] contains algorithms for these same queries for linear strings. In
the previous section, we showed how these algorithms could be modified to support these
queries.
Size Restricted Queries: Experimental data show that random strings contain a large
number of displayable entities whose lengths are small. In most applications, small dis
playable entities are uninteresting. Hence, it is useful to list only those displayable entities
whose lengths are greater than some integer, k. Similarly, it is useful to report exactly those
conflicts in which the conflicting displayable entities have length greater than k. This gives
rise to the following problems:
(1) List all occurrences of displayable entities whose length is greater than k.
(2) Compute all prefix suffix conflicts involving displayable entities of length greater than
k.
(3) Compute all subword conflicts involving displayable entities of length greater than k.
An alternative formulation of the problem which also seeks to achieve the goal outlined
above is based on reporting only those conflicts whose size is greater than k. The size of a
conflict is defined below:
The overlap of a conflict is defined as the string common to the conflicting displayable
entities. The overlap of a subword conflict is the subword displayable entity. The overlap of
a prefixsuffix conflict is its intersection. The size of a conflict is the length of the overlap.
This formulation of the problem is particularly relevant when the conflicts are of more
interest than the displayable entities. It also ensures that all conflicting displayable entities
reported have size greater than k. We have the following problems:
(4) Obtain all prefixsuffix conflicts of size greater than some integer k.
(5) Obtain all subword conflicts of size greater than some integer k.
Pattern Restricted Queries: These queries are useful in applications where the fact
that two patterns have a conflict is more important than the number or location of the
conflicts. The following problems arise as a result:
(6) List all pairs of displayable entities which have subword conflicts.
(7) List all triplets of displayable entities (Di,D2,Dm) such that there is a prefix suffix
conflict between D1 and D2 with respect to D,.
(8) Same as 6, but size restricted as in 5.
(9) Same as 7, but size restricted as in 4.
Statistical Queries: These queries are useful when conclusions are to be drawn from
the data based on statistical facts.
(10) For each pair of displayable entities, D1 and D2, involved in a subword conflict (DI
is the subword of D2), obtain p(D1, D2) = (number of occurrences of D1 which occur as
subwords of D2)/(number of occurrences of D1).
(11) For each pair of displayable entities, D1 and D2, involved in a prefixsuffix conflict,
obtain q(D1, D2) = (number of occurrences of Di which have prefixsuffix conflicts with D2)
/(number of occurrences of Di).
If p(DI, D2) or q(DI, D2) is greater than a statistically determined threshold, then the fol
lowing could be be said with some confidence: Presence of D1 implies presence of D2.
8 Applications
Circular strings may be used to represent circular genomes [1] such as G4 and 9X174. The
detection and analysis of patterns in genomes helps to provide insights into the evolution,
structure, and function of organisms. [1] analyzes G4 and 9X174 by linearizing and then
constructing their scdawg. Our work improves upon [1] by :
(i) analyzing circular strings without risking the "loss" of patterns.
(ii) extending the analysis and visualization techniques of [5] for linear strings to circular
strings.
Circular strings in the form of chain codes are also used to represent closed curves in
computer vision [11]. The objects of Figure 18(a) are represented in chain code as follows:
(1) Arbitrarily choose a pixel through which the curve passes. In the diagram, the starting
pixels for the chain code representation of objects 1 and 2 are marked by arrows.
(2) Traverse the curve in the clockwise direction. At each move from one pixel to the next,
the direction of the move is recorded according to the convention shown in Figure 18(b).
Objects 1 and 2 are represented by 1122102243244666666666 and 6666666611:.' .' 11:, .::, ', ,'6
respectively. The alphabet is {0, 1, 2, 3, 4, 5, 6, 7} which is fixed and of constant size (8) and
therefore satisfies the condition of Section 2. We may now use the visualization techniques of
[5] to compare the two objects. For example, our methods would show that objects 1 and 2
share the segments S1 and S2 (Figure 18(c)) corresponding to 0224 and 244666666661122
respectively. Information on other common segments would also be available. The tech
niques of this paper make it possible to detect all patterns irrespective of the starting pixels
chosen for the two objects.
Circular strings may also be used to represent polygons in computer graphics and com
putational geometry [3]. Figure 19 shows a polygon which is represented by the following
alternating sequence of lines and angles: b/paaeaoeac/pcpeaaeaoeac/pcbacadaca, where a
denotes a 90 degree angle and 3, a 270 degree angle.
The techniques of this paper would point out all instances of self similarity in the polygon,
such as aaeaeac/c. Note, however, that for the methods to work efficiently, the number of
lines and angles that are used to represent the polygons must be small and fixed.
9 Conclusions
In this paper, we have defined the scdawg for circular strings and shown how it can be used
to solve problems in the visualization and analysis of patterns in circular strings. We expect
that it can also be used for other string matching applications involving circular strings.
An important feature of the scdawg for circular strings is that it is easy to implement and
use when corresponding techniques for scdawgs for linear strings are already available.
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 r 1 1 1
I I I I I I J I I I I I I I I I I I I I I  I I I
, J_ _L _J__ _J y_L J L J_ L  J _J __L _J __L _J __L _J__L _J __L _J _
r
1 I I / I I I 1\ I        1 1 1 1  
SI 1 r 1 1 T 1  T 1 1 1 1 1 r 1 II r 1 1 1
F _ L J [ J L J _ _LJ__ .J LJ [y J __L3 L _J _L J L __L.J _
I / I I I I I I I I I I I I / I I I I I I I I I I
_ __~_L_ ___J_ l _ Jl L L JL? J L 
i__ i i i ii i i i i i i i i i i i i i i
,
, 1 1 1 1 1 1 1 1
Starting position for object 1
7
6
5
Chain code representations
(b)
Starting position for object 2
of directions
= S1
= S2
(c)
Figure 18: Representing closed curves by circular strings
e e
CI IC
d
Figure 19: Representing polygons by circular strings
Acknowledgement
We are grateful to Professor Gerhard Ritter for pointing out the application of circular
strings to the representation of closed curves.
References
[1] B. Clift, D. Haussler, T.D. Schneider, and G.D. Stormo "Sequence Landscapes,"
Nucleic Acids Research, vol. 14, no. 1, pp. 141158, 1',.
[2] G.M. Morris "The Matching of Protein Sequences using Color Intrasequence Homol
ogy Displays," J. Mol. Graphics, vol. 6, pp. 135142, 1'l'
[3] S.L. Tanimoto, "A method for detecting structure in polygons," Pattern R .. ...I '.,
vol. 13, no. 6, pp. .;:' ::;'i 1981.
[4] D. Mehta and S. Sahni, "String Visualization," In Preparation, 1991.
[5] D. Mehta and S. Sahni, "Computing Display Conflicts in String Visualization," Sub
mitted for journal publication, 1991.
[6] A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht, "Complete In
verted Files for Efficient Text Retrieval and Analysis," J. AC I, vol. 34, no. 3, pp. 578
595, 1I" .
[7] A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M.T. Chen, J. Seiferas, "The
Smallest Automaton Recognizing the Subwords of a Text," Theoretical Computer Sci
ence, no. 40, pp. 3155, '1,".
[8] M. E. Majster and A. Reiser, I.1 .i. i, online construction and correction of position
trees," SIAM Journal on Computing, vol. 9, pp. 7 ".807, Nov. 1980.
36
[9] E. McCreight, "A spaceeconomical suffix tree construction algorithm," Journal of the
AC I, vol. 23, pp. 262272, Apr. 1976.
[10] M. T. Chen and Joel Seiferas, I.1l.. ii and elegant subword tree construction," in
Combinatorial Algorithms on Words (A. Apostolico and Z. Galil, eds.), NATO ASI
Series, Vol. F12, pp. 97107, Berlin Heidelberg: SpringerVerlag, 1',".
[11] R. Gonzalez, P. Wintz, D;iital Iiimaq P.... ','.. .',,.1 Edition. Addison Wesley, 1'" ;.

