Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Computing display conflicts in string visualization
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095096/00001
 Material Information
Title: Computing display conflicts in string visualization
Alternate Title: Department of Computer and Information Science and Engineering Technical Report ; 24
Physical Description: Book
Language: English
Creator: Sahni, Sartaj
Mehta, Dinesh P.
Affiliation: University of Florida
University of Minnesota
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 1991
 Record Information
Bibliographic ID: UF00095096
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

199121 ( PDF )


Full Text











Computing Display Conflicts in String Visualization *


Dinesh P. Mehta tt


Sartaj Sahnit


Technical Report 24


Abstract

Strings are used to represent a variety of objects such as DNA sequences, text, and
numerical sequences. A model for a system for the visualization and analysis of strings
was proposed in [1]. In this paper, we present algorithms which implement some of the
queries supported by this model.


Keywords and Phrases:
Strings, visualization, analysis, directed acyclic word graphs.




















*This research was supported in part by the National Science Foundation under grant MIP 86-17374.
tDept. of Computer and Information Sciences, University of Florida, Gainesville, FL 32611
'Dept. of Computer Science, University of Minnesota, Minneapolis, MN 55455












[A B CZD E F1 Y D E FI XIA B C]

Figure 1: Highlighting displayable entities

1 Introduction


The string data type is used to represent a number of objects such as text strings, DNA
or protein sequences in molecular biology, numerical sequences, etc. Research in molecular
biology, text analysis, and interpretation of numerical data involves the identification of
recurring patterns in data and hypothesizing about their causes and/or effects [2, 3]. De-
tecting patterns visually in long strings is tedious and prone to error. In [1], a model was
proposed to alleviate this problem. The model consists of identifying all recurring patterns
in a string and highlighting identical patterns in the same color.

We first discuss the notion of maximal patterns. Let abc be a pattern occurring m times
in a string S. Let the only occurrences of ab be those which occur in abc. Then, the
pattern ab is not maximal in S as it is always followed by c. The notion of maximality is
motivated by the assumption that in most applications, longer patterns are more significant
than shorter ones. Maximal patterns that occur at least twice are known as displayable
entities.

The problem of identifying all displayable entities and their occurrences in S can be solved
from the results in [4]. Once all displayable entities and their occurrences are obtained, we
are confronted with the problem of color coding them. In the string, S = abczdefydefxabc,
abc and def are the only displayable entities. So, S would be displayed by highlighting abc
in one color and def in another as shown in Figure 1.

In most strings, we encounter the problem of conflicts: Consider the string S = abci-
cdefcdegabchabcde and its displayable entities, abc and cde (both are maximal and occur
thrice). So, they must be highlighted in different colors. Notice, however, that abc and cde
both occur in the substring abcde, which occurs as a suffix of S. Clearly, both displayable
entities cannot be highlighted in different colors in abcde as required by the model. This is










AB IDE DE GAB HAB DE
IAIBI ID EIF ID EIGIAHBIHABmD
Figure 2: Alternative display model

a consequence of the fact that the letter c occurs in both displayable entities. This situation
is known as a prefix-suffix conflict (because a prefix of one displayable entity is a suffix of
the other). Note, also, that c is a displayable entity in S. Consequently, all occurrences
of c must be highlighted in a color different from those used for abc and cde. But this is
impossible as c is a subword of both abc and cde. This situation is referred to as a subword
conflict. The problem of subword conflicts may be partially alleviated by employing more
sophisticated display models as in Figure 2.

Irrespective of the display model used, it is usually not possible to display all occur-
rences of all displayable entities. We are therefore forced into having to choose which ones
to display. There are three ways of achieving this:
Interactive : The user selects occurrences interactively by using his/her judgement. Typi-
cally, this would be done by examining the occurrences which are involved in a conflict and
choosing one that is the most meaningful.
Automatic : A numeric weight is assigned to each occurrence. The higher the weight, the
greater the desirability of displaying the corresponding occurrence. Criteria that could be
used in assigning weights to occurrences include: length, position, number of occurrences
of the pattern, semantic value of the displayable entity, information on conflicts, etc. The
information is then fed to a routine which selects a set of occurrences so that the sum of
their weights is maximized (algorithms for these are discussed in [1]).
Semi-Automatic: In a practical environment, the most appropriate method would be a
hybrid of the interactive and automatic approaches described above. The user could se-
lect some occurrences that he/she wants included in the final display. The selection of the
remaining occurrences can then be performed by a routine which maximizes the display
information.

All the methods described above require knowledge about the conflicts, either to choose










which occurrences to display (interactive) or to assign weights to the occurrences (au-

tomatic). Automatic methods would require a list of all the conflicts, while interactive
methods require information about conflicts local to a particular segment of the string.

Since prefix suffix and subword conflicts are handled differently by different display models,

separate lists for each are required.

In this paper we identify a family of problems relating to the identification of conflicts
at various levels of detail. Problems relating to statistical information about conflicts are

also identified. Efficient algorithms for these problems are presented. All algorithms make

use of the symmetric compact directed acyclic word graph (scdawg) data structure [4] and
may be thought of as operations or traversals of the scdawg. The scdawg, which is used

to represent strings and sets of strings evolved from other string data structures such as

position trees, suffix trees, and directed acyclic word graphs [5, 6, 7, 8].

Section 2 contains preliminaries including definitions of displayable entities, conflicts,

and scdawgs. Section 3 presents optimal algorithms to determine whether a string has

conflicts and to compute subword and prefix suffix conflicts in a string. Sections 4, 5, and
6 discuss related size restricted, pattern restricted, and statistical problems and show how

to implement these by modifying the algorithms of Section 3. Finally, Section 7 presents

experimental data on the run times of some of these algorithms.



2 Preliminaries


2.1 Definitions


Let S represent a string of length n, whose characters are chosen from a fixed alphabet,
E, of constant size. A pattern in S is said to be maximal iff its occurrences are not all

preceded by the same letter, nor all followed by the same letter. Consider the string S =

abczdefydefxabc. Here, abc and def are the only maximal patterns. The occurrences of def
are preceded by different letters (z and y) and followed by different letters (y and x). The

occurrences of abc are not preceded by the same letter (the first occurrence does not have

a predecessor) nor followed by the same letter. However, de is not maximal because all its










occurrences in S are followed by f.


A pattern is said to be a displayable entity (or displayable) iff it is maximal and occurs
more than once in S (all maximal patterns are displayable entities with the exception of S,
which occurs once in itself).

(i) A subword conflict between two displayable entities, D1 and D2, in S exists iff D1 is a
substring of D2.
(ii) A prefix-suffix conflict between two displayable entities, D1 and D2, in S exists iff there
exist substrings, Sp, S,, Ss in S such that SpSmS, occurs in S, SpS, = D1, and SmSs =
D2. The string, S, is known as the intersection of the conflict; the conflict is said to occur
between D1 and D2 with respect to S,.


2.2 Symmetric Compact Directed Acyclic Word Graphs (SCDAWGs)


An scdawg, SCD(S), corresponding to a string S is a directed acyclic graph defined by
a set of vertices, V(S), a set, R(S), of labeled directed edges called right extension (re)
edges, and a set, L(S), of labeled directed edges called left extension (le) edges Each
vertex of V(S) represents a substring of S. Specifically, V(S) consists of a source (which
represents the empty word, A), a sink (which represents S), and a vertex corresponding to
each displayable entity of S.

Let de(v) denote the string represented by vertex, v (v e V(S)). Define the implication,

imp(S, a), of a string a in S to be the smallest superword of a in {de(v): v e V(S)}, if such
a superword exists. Otherwise, imp(S, a) does not exist.

Re edges from a vertex, vl, are obtained as follows: for each letter, x, in E, if imp(S, de(vl)x)
exists and is equal to de(v2) = lde(v1)x7, then there exists an re edge from v, to v2 with
label x7. If P is the empty string, then the edge is known as a prefix extension edge. Le
edges from a vertex, vl, are obtained as follows: for each letter, x, in E, if imp(S, xde(vi))
exists and is equal to de(v2) = 7xde(vl)3, then there exists an le edge from v, to v2 with
label 7x. If P is the empty string, then the edge is known as a suffix extension edge.

Figure 3 shows V(S) and R(S) corresponding to S = cdefabcgabcde. abc, cde, and c are































Figure 3: Scdawg for S = cdefabcgabcde (L(S) not shown)


the displayable entities of S. There are two outgoing re edges from the vertex representing
abc. These edges correspond to x = d and x = g. imp(S, abcd) = imp(S, abcg) = S.
Consequently, both edges are incident on the sink. There are no edges corresponding to the
other letters of the alphabet as imp(S, abcx) does not exist for x c {a, b, c, e, f}.

The space required for SCD(S) is O(n) and the time needed to construct it is O(n)

[5, 4]. While we have defined the scdawg data structure for a single string, S, it can be
extended to represent a set of strings.


2.3 Computing Occurrences of Displayable Entities


Figure 4 presents an algorithm for computing the end positions of all the occurrences of
de(v) in S. This is based on the outline provided in [4]. The complexity of Occurrences(S, v, 0)
is proportional to the number of occurrences of de(v) in S.










Algorithm A
Occurrences(S, v, 0)

Procedure Occurrences(S:string,u:vertex,i:integer)
begin
if de(u) is a suffix of S
then output(|SI i);
for each right out edge, e, from u do
begin
Let w be the vertex on which e is incident;
Occurrences(w, label(e) + i);
end;
end;


Figure 4: Algorithm for obtaining occurrences of displayable entities

2.4 Prefix and Suffix Extension Trees


The prefix extension tree, PET(S, v), at vertex v in V(S) is a subgraph of SCD(S) con-
sisting of (i) the root, v, (ii) PET(S, w) defined recursively for each vertex w in V(S) such
that there exists a prefix extension edge from v to w, and (iii) the prefix extension edges
leaving v. The suffix extension tree, SET(S, v), at v is defined analogously.

In Figure 3, PET(S, v), de(v) = c, consists of the vertices representing c and cde, and
the sink. It also includes the prefix extension edges from c to cde and from cde to the sink.
Similarly, SET(S, v), de(v) = c, consists of the vertices representing c and abc and the suffix
extension edge from c to abc (not shown in the figure).


Lemma 1 PET(S, v) (SET(S, v)) contains a directed path from v to a vertex, w, in V(S)
iff de(v) is a prefix (suffix) of de(w).


Proof If there is a directed path in PET(S, v), from v to some vertex, w, then from the
definition of a prefix extension edge and the transitivity of the 1p' II:. of" relation, de(v)
must be a prefix of de(w).

If de(v) is a prefix of de(w), then there exists a series of re edges from v to w, such that

de(v), when concatenated with the labels on these edges yields de(w). But, each of these










re edges must be a prefix extension edge. So a directed path from v to w exists in the

PET(S, v).

The proof for SET(S, v) is analogous. O



3 Computing Conflicts


3.1 Algorithm to determine whether a string is conflict free


Before describing our algorithm to determine if a string is free of conflicts, we establish

some properties of conflict free strings that will be used in this algorithm.


Lemma 2 If a prefix-suffix conflict occurs in a string S, then a subword conflict must occur

in S.


Proof If a prefix-suffix conflict occurs between two displayable entities, W1 and W2 then
there exists WpWmW, such that WpW, = WI and WmW, = W2. Since W1 and W2 are

maximal, W1 isn't always followed by the same letter and W2 isn't always preceded by the

same letter. I.e., W, isn't always followed by the same letter and W, isn't always preceded
by the same letter. So, W, is maximal. But, W1 occurs at least twice in S (since W1 is a

displayable entity). So W, occurs at least twice (since W, is a subword of W1) and is a

displayable entity. But, W, is a subword of W1. So a subword conflict occurs between W,
and W1. D


Corollary 1 If string S is free of subword conflicts, then it is free of conflicts.


Lemma 3 de(w) is a subword of de(v) in S iff there is a path cimiprsing ii, ,'. extension

and suffix extension edges from w to v.


Proof From the definition of SCD(S), if there exists an re edge from u to v, then de(u)
is a subword of de(v). If there exists a suffix extension edge from u to v, then de(u) is a

suffix (and therefore a subword) of de(v). If there exists a path comprising right and suffix

extension edges from w to v, then by transitivity, de(w) is a subword of de(v).










Algorithm NoConflicts(S)

1. Construct SCD(S).

2. Compute Vource.

3. Scan all right and suffix extension out edges from each element of Vsoue. If any edge points to

a vertex other than the sink, then a conflict exists. Otherwise, S is conflict free.

Figure 5: Algorithm to determine whether a string is Conflict Free


If de(w) is a suffix of de(v), then there is a path (Lemma 1) of suffix extension edges

from w to v. If de(w) is a subword, but not a suffix of de(v), then from the definition of an

scdawg, there is a path of re edges from w to a vertex representing a suffix of de(v). O

Let Vsource denote all vertices in V(S) such that an re or suffix extension edge exists

between the source vertex of SCD(S) and each element of Vsource-


Lemma 4 String S is conflict free iff all '.11/ extension or suffix extension edges leaving

vertices in Vsource end at the sink vertex of SCD(S).


Proof A string, S is conflict free iff there does not exist a right or suffix extension edge

between two vertices, neither of which is the source or sink of SCD(S) (Corollary 1 and

Lemma 3).

Assume that S is conflict free. Consider a vertex, v, in Vource,. If v has right or suffix

extension out edge < v, w >, then v f sink. If w f sink, then de(v) is a subword of de(w)

and the string is not conflict free. This contradicts the assumption on S.

Next, assume that all right and suffix extension edges leaving vertices in Vsource end at

the sink vertex. Clearly, there cannot exist right or suffix extension edges between any two

vertices, v and w (v $ sink, w $ sink) in Vsource. Further, there cannot exist a vertex, x,

in V(S) (x f source, x f sink) such that x Vsource. For such a vertex to exist, there

must exist a path consisting of right and suffix extension edges from a vertex in Vsource to

x. Clearly, this is not true. So, S is conflict free. O

The preceding development leads to algorithm NoConflicts (Figure 5).










Theorem 1 Algorithm NoConflicts is both correct and optimal.


Proof Correctness is an immediate consequence of Lemma 4. Step 1 takes O(n) time
[4]. Step 2 takes 0(1) time since IVsourcl < 21EI. Step 3 takes 0(1) time since the number
of out edges leaving Vsource is less than 41 21. So, NoConflicts takes O(n) time, which is
optimal. Actually, steps 2 and 3 can be merged into step 1 and the construction of SCD(S)
aborted as soon as an edge that violates Lemma 4 is created. O


3.2 Subword Conflicts


Consider the problem of finding all subword conflicts in string S. Let ks be the number
of subword conflicts in S. Any algorithm to solve this problem requires (i) O(n) time to
read in the input string and (ii) O(ks) time to output all subword conflicts. So, O(n + k/)
is a lower bound on the time complexity for this problem. For the string S = a", ks =
n4/24 + n3/4 13n2/24 3n/4 + 1 = O(n4). This is an upper bound on the number of
conflicts as the maximum number of substring occurrences is O(n2) and in the worst case,
all occurrences conflict with each other. In this section, a compact method for representing
conflicts is presented. Let ks, be the size of this representation. ks, is n3/6 + n2/2 5n/3
or O(n3), for a". Compaction never increases the size of the output and may yield up to a
factor of n reduction, as in the example. The compaction method is described below.

Consider S= abcdbcgabcdbchbc. The displayable entities are D1 = abcdbc and D2 = bc.
The ending positions of D1 are 6 and 13 while those of D2 are 3, 6, 10, 13, and 16. A
list of the subword conflicts between D1 and D2 can be written as: {(6,3), (6,6), (13,10),
(13,13)}. The first element of each ordered pair is the last position of the instance of the
superstring (here, DI) involved in the conflict; the second element of each ordered pair is
the last position of the instance of the substring (here, D2) involved in the conflict.

The cardinality of the set is the number of subword conflicts between D1 and D2. This
is given by: frequency(Di)*number of occurrences of D2 in D1. Since each conflict is repre-
sented by an ordered pair, the size of the output is 2(frequency(D1)*number of occurrences
of D2 in DI).










Observe that the occurrences of D2 in D1 are in the same relative positions in all instances
of D1. It is therefore possible to write the list of subword conflicts between D1 and D2 as:
(6,13):(0,-3). The first list gives all the occurrences in S of the superstring (D1), and the
second gives the relative positions of all the occurrences of the substring (D2) in the super-
string (D1) from the right end of D1. The size of the output is now: frequency(DI)+number
of occurrences of D2 in D1. This is more economical than our earlier representation.

In general, a substring, Di, of S will have conflicts with many instances of a number of
displayable entities (say, Dj, Dk,..., Dz) of which it (Di) is the superword. We would then
write the conflicts of Di as:

I?, IM, :, ( 14 2 M ) (11, 12 .. mk. (11, 12 ).

Here, the li's represent all the occurrences of Di in S; the l's, l's,..., I's represent the
relative positions of all the occurrences of Dj, Dk,..., D in Di. One such list will be re-
quired for each displayable entity that contains other displayable entities as subwords. The
following qualities are easily obtained:
Size of Compact Representation = ED,,D (f+ EDeD (rij)).
Size of O, '.I',..1 Representation = 2 EDED (i* EDJeD (rij)).

fi is the frequency of Di (only Di's that have conflicts are considered). rij is the frequency
of Dj in one instance of Di. D represents the set of all displayable entities of S. Df repre-
sents the set of all displayable entities that are subwords of Di.



SG(S, v), v e V(S), is defined as the subgraph of SCD(S) which consists of the set of
vertices, SV(S, v) C V(S) which represents displayable entities that are subwords of de(v)
and the set SE(S, v) of all re and suffix extension edges that connect any pair of vertices
in SV(S, v). Define SGR(S, v) as SG(S, v) with the directions of all the edges in SE(S, v)
reversed.


Lemma 5 SG(S, v) consists of all vertices, w, such that a path criiilrisiiI ,+ "i./ or suffix
extension edges joins w to v in SCD(S).


Proof Follows from Lemma 3. D





















Algorithm B
1 begin
2 for each vertex, v, in SCD(S) do
3 begin
4 v.subword = false;
5 for all vertices, u, such that a right or suffix extension edge, < u, v >, is incident on v do
6 if u f source then v.subword = true;
7 end
8 for each vertex, v, in SCD(S) such that v f sink and v.subword is true do
9 GetSubwords(v);
10 end

Procedure GetSubwords(v)
1 begin
2 Occurrences(S,v,O);
3 output(v.list);
4 v.sublist = {0};
5 SetUp(v);
6 SetSuffixes(v);
7 for each vertex, x (A source), in reverse topological order of SG(S, v) do
8 begin
9 if de(x) is a suffix of de(v) then x.sublist = {0} else x.sublist {};
10 for each vertex, w, in SG(S, v) on which an re edge, e from x is incident do
11 begin
12 for each element, 1, in w.sublist do
13 x.sublist = x.sublist U {l label(e)|};
14 end;
15 output(x.sublist);
16 end;
17 end



Figure 6: Optimal algorithm to compute all subword conflicts










Algorithm B of Figure 6 computes the subword conflicts of S. The subword conflicts are

computed for precisely those displayable entities which have subword displayable entities.
Lines 4 to 6 of Algorithm B determine whether de(v) has subword displayable entities. Each

incoming right or suffix extension edge to v is checked to see whether it originates at the

source. If any incoming edge originates at a vertex other than source, then v.subword is set

to true (Lemma 3). If all incoming edges originate from source, then v.subword is set to
false. Procedure Getsubwords(v), which computes the subword conflicts of de(v) is invoked

if v.subword is true.

Procedure Occurrences(S, v, 0) (line 2 of GetSubwords) computes the occurrences of de(v)
in S and places them in v.list. Procedure SetUp in line 5 traverses SGR(S, v) and initializes

fields in each vertex of SGR(S, v) so that a reverse topological traversal of SG(S, v) may be

subsequently performed. Procedure SetSuffixes in line 6 marks vertices whose displayable
entities are suffixes of de(v). This is accomplished by following the chain of reverse suffix

extension pointers starting at v and marking the vertices encountered as suffixes of v.

A list of relative occurrences, sublist, is associated with each vertex, x, in SG(S, v).
x.sublist represents the relative positions of de(x) in an occurrence of de(v). Each relative

occurence is denoted by its position relative to the last position of de(v) which is repre-

sented by 0. If de(x) is a suffix of de(v) then x.sublist is initialized with the element, 0.
The remaining elements of x.sublist are computed from the sublist fields of vertices, w, in

SG(S, v) such that a right extension edge goes from x to w. Consequently, w.sublist must

be computed before x.sublist. This is achieved by traversing SG(S, v) in reverse topological
order [9].


Lemma 6 x.sublist for vertex, x, in SG(S, v) contains all relative occurrences of de(x) in

de(v) on completion of GetSubwords(v).


Proof The correctness of this lemma follows from the correctness of procedure Occurrences(S, v, 0)
of Section 2.3 and the observation that lines 7 to 15 of procedure GetSubwords achieve the

same effect as Occurrences(S, v, 0) in SG(S, v). O


Theorem 2 Algorithm B takes O(n + kc,) time and space and is therefore optimal.










Proof Computing v.subword for each vertex, v, in V(S) takes O(n) time as constant time
is spent at each vertex and edge in SCD(S). Consider the complexity of GetSubwords(v).
Lines 2 and 3 take O(|v.listl) time. Let the number of vertices in SG(S, v) be m. Then the
number of edges in SG(S, v) is O(m). Line 5 traverses SG(S, v) and therefore consumes
O(m) time. Line 6, in the worst case, could involve traversing SG(S, v) which takes O(m)
time. Computing the relative occurrences of de(x) in de(v) (lines 9-15) takes O(Ix.sublistl)
time for each vertex, x, in SG(S, v). So, the total complexity of GetSubwords(v) is O(|v.listl+
m + Exesv(s,v),x, Ix.sublistl).

However, m is O(ExcSv(s,v),xv Iz.sublistl), since Ix.sublistl > 1 for each x c SG(S, v).
But |v.1istl + Zsv(s,v),,v I x.sublistl is the size of the output for GetSubwords(v).

So, the over all complexity of algorithm B is O(n + veVV(S)-{sink},v.subword=true output
for GetSubwords(v)l) = 0(n + ks). D


3.3 Prefix Suffix Conflicts


As with subword conflicts, the lower bound for the problem of computing prefix-suffix
conflicts is O(n + kp), where kp is the number of prefix-suffix conflicts in S. For S = a", kp
is n4/24 n3/12 L.,2/24 21n/12 + 1 = O(n4), which is also the upper bound on kp.
Unlike subword conflicts, it is not possible to compact the output representation.

Let w and x, respectively, be vertices in SET(S, v) and PET(S, v). Let de(v) = W,,
de(w) = WwW,, and de(x) = WW,. Define Pshadow(w, v, x) to be the vertex represent-
ing imp(S, WWW), if such a vertex exists. Otherwise, Pshadow(w,v,x) = nil. We
define Pimage(w, v, x) = Pshadow(w, v, x) iff Pshadow(w, v, x) = imp(S, W ,WWW) =
W,W ,WWW for some (possibly empty) string, W,. Otherwise, Pimage(w, v,x) = nil.
For each vertex, w in SET(S, v), a shadow prefix dag, SPD(w, v), rooted at vertex w is
comprised of the set of vertices {Pshadow(w, v, x)l x on PET(S, v), Pshadow(w, v, x)
nil}.

Figure 7 illustrates these concepts. Broken lines represent suffix extension edges, dot-
ted lines represent right extension edges, and solid lines represent prefix extension edges.

























S SPD(w,v)


PET(S, v)










D n





SET(S,, v) -



Figure 7: Illustration of prefix and suffix trees and a shadow prefix dag










SET(S, v), PET(S, v), and SPD(w, v) have been enclosed by dashed, solid, and dotted
lines respectively. We have: Pshadow(w, v, v) = Pimage(w, v, v) = w. Pshadow(w, v,z) =
Pshadow(w, v, r) = c. However, Pimage(w, v, z)= Pimage(w, v, r) = nil. Pshadow(w, v, x)
= Pimage(w, v, x) = a. Pshadow(w, v,p)= b, but Pimage(w, v,p)= nil. Pshadow(w, v, q)
= Pshadow(w, v,s) = Pimage(w, v,q) = Pimage(w, v,s) = nil.


Lemma 7 A prefix-suffix conflict occurs between two displayable entities, W1 = de(w)
and W2 = de(x) with respect to a third displayable entity Wm = de(v) iff (i) w occurs
in SET(S, v) and x occurs in PET(S, v), and (ii) Pshadow(w, v, x) 5 nil. The number of
conflicts between de(w) and de(x) with respect to de(v) is equal to the number of occurrences
of de(Pshadow(w, v,x)) in S.


Proof By definition, a prefix-suffix conflict occurs between displayable entities W1 and
W2 with respect to Wm iff there exists WpW,mW in S, where Wi = WpWm and W2

Wm W,.

Clearly, Wm is a suffix of W1 and Wm a prefix of W2 iff w occurs in SET(S, v) and x
occurs in PET(S, v). WpWmW, occurs in S iff imp(S, WpWmW) = Pshadow(w, v,x) f
nil. The number of conflicts between de(w) and de(x) is equal to the number of occurrences
of imp(S, WWmWs) = Pshadow(w, v, x) in S. D


Lemma 8 If a prefix-suffix conflict does not occur between de(w) and de(x) with respect
to de(v), where w occurs in SET(S,v) and x occurs in PET(S,v), then there are no

prefix-suffix conflicts between any displayable entity which represents a descendant of w
in SET(S, v) and any displayable entity which represents a descendant of x in PET(S, v)
with respect to de(v).


Proof Since w is in SET(S, v) and x is in PET(S, v), we can represent de(w) by Wpde(v)
and de(x) by de(v)Ws. If no conflicts occur, then Wpde(v)Ws does not occur in S. The
descendants of w in SET(S, v) will represent displayable entities of the form Wade(w) =
WWpde(v), while the descendants of x in PET(S, v) will represent displayable entities
of the form de(x)Wb = de(v)WsWb, where WnWb are substrings of S. For a prefix-suffix











f







r





v It
/1 C


U


= Prefix Extension Edge
= Suffix Extension Edge
= Right Extension Edge


Figure 8: Illustration of conditions for Lemma 9


conflict to occur between Wade(w) and de(x)Wb with respect to de(v), W Wpde(v)WsWb

must exist in S. However, this is not possible as Wpde(v)Ws does not occur in S and the

result follows. O


Lemma 9 In SCD(S), if (i) y = Pimage(w,v,x), (ii) there is a prefix extension edge, e,

from x to z with label aa. (iii) there is a "'1l extension edge, f, from y to u with label ap,

then Pshadow(w,v,z) = u.


Proof Let de(w) = Wde(v), de(x) = de(v)Wx. By definition, de(y) =

for some possibly empty string W,. de(z) = de(x)aa = de(v)Waa. de(u)

WbWaWwde(v)Wxap for some string Wb.


WWwde(v)Wx

= Wbde(y)a) =


Pshadow(w, v,z) = imp(Wde(v)Wxaa). To prove the lemma, we must show that

Pshadow(w, v, z) = u. I.e., that (i) Wwde(v)Wxaa is a subword of de(u) and (ii) de(u) is

the smallest superword of Wde(v)Wxaa represented by a vertex in SCD(S).


. . . . . . . > -










(i) Assume that W,,de(v)Waaa is not a subword of de(u) = WbWaW,,de(v)W a3. I.e.,

a is not a prefix of 3.
Case 1: 3 is a proper prefix of a.

Since WbWW,,de(v)W a3 is maximal, its occurrences are not all followed by the same

letter. This is true for any of its suffixes. In particular all occurrences of de(v)W1at 3 cannot

be followed by the same letter. Similarly, all occurrences of de(v)W a3 cannot be preceded
by the same letter as it is a prefix of de(v)Waa = de(z). So, de(v)W a3 is a displayable

entity of S. Consequently, the prefix extension edge from x corresponding to the letter a

must be directed to the vertex representing de(v)Wxa. This is a contradiction.
Case 2: a)3 matches aa in the first k characters, but not in the (k + 1)'th character (1 < k

< l+min(la|,|I3)).

We have a3 = a 31, aa = a7al, where | | = k 1. Clearly, the strings de(v)Wx1aal and

WbWW,,de(v)W1;a7 3 occur in S. I.e., all occurrences of de(v)W;a7 cannot be followed
by the same letter. Further, all occurrences of de(v)Wxa7 cannot be preceded by the same

letter as it is a prefix of de(v)Wxaa = de(z). So, it is a displayable entity of S. Consequently,
the prefix extension edge from x corresponding to the letter a must be directed to the vertex

representing de(v)W1a This results in a contradiction. Thus, a is a prefix of 3.

(ii) From (i), a is a prefix of 3. Assume that WbWW,,de(v)Wxa3 is not the smallest su-
perword of W,,de(v)Wxaa. Since de(y) = Pimage(w, v, x) = WWWde(v)Wx is the smallest

superword of Wwde(v)Wx, the smallest superword of Wwde(v)W;aa must be of the form

Wb, W ,Wwde(v)W1a' where a is a prefix of 7 which is a proper prefix of 3 and/or Wb, is a
proper suffix of Wb. But, the right out edge, f, from z points to the smallest superword of
WaW,de(v)Wxa (from the definition of SCD(S)) which is WbWW~1,de(v)W1ap. So, Wb,

= Wb, 7 = 3, which is a contradiction. D


Lemma 10 In SCD(S), if (i) y = Pimage(w, v,x), (ii) there is a path of prefix extension

edges from x to xz (let the concatenation of their labels be aa), (iii) there is a prefix extension

edge from x1 to z with label b7, and (iv) there is a, "i l/ extension edge, f from y to u with
label aab3, then u = Pshadow(w, v, z) f nil.


Proof Similar to proof of Lemma 9. D





















y U


aab3


r


Path P


v t x


, = Prefix Extension Edge
- = Suffix Extension Edge
> = Right Extension Edge


Figure 9: Illustration of conditions for Lemmas 10 and 11


*-1-- z











Algorithm C
1 Construct SCD(S).
2 for each vertex, v, in SCD(S) do
3 NextSuffx(v, v);

Procedure Next Suffix (current, v);
1 for each suffix extension edge < current, w > do
2 {there can only be one suffix extension edge from current to w}
3 begin
4 exist = false
5 -1.,1....: iearch(v, w, v, w);
6 if exist then NextSuffix(w,v);
7 end;




Figure 10: Optimal algorithm to compute all prefix-suffix conflicts


Lemma 11 In Lemma 9 or Lemma 10, if Ilabel(f)l < sum of the ', ,.,ill. of the labels of

of the edges on the prefix extension edge path P from x to z, then label(f) = concatenation

of the labels on P and u = Pimage(w, v, z).


Proof From Lemma 10, the concatenation of the labels of the edges of P is a prefix of

label(f). But, Ilabel(f)l < sum of the lengths of the labels of the edges on P. I.e., label(f)

= concatenation of the labels of the series edges on P. de(u) = Pimage(w, v,z) follows.
D


Lemma 12 If Pshadow(w, v, x) = nil then Pshadow(w, v,y) = nil for all descendants, y,

of x in PET(S,v).


Proof Follows from Lemmas 7 and 8. D

Algorithm C in Figure 10 computes all prefix-suffix conflicts of S. Line 1 constructs

SCD(S). Lines 2 and 3 compute all prefix-suffix conflicts in S by separately computing for

each displayable entity, de(v), all the prefix-suffix conflicts of which it is the intersection.

Procedure NextSuffix(current, v) computes all prefix-suffix conflicts between displayable

entities represented by descendants of current in SET(S, v) and displayable entities repre-

sented by descendants of v in PET(S, v) with respect to de(v) (so the call to NextSuffix(v,v)










in line 3 of Algorithm C computes all prefix-suffix conflicts with respect to de(v)). It does

so by identifying SPD(w, v) for each child, w, of current in SET(S, v). The call to Y/,.1/-

owSearch(v,w,v,w) in line 5 identifies SPD(w,v) and computes all prefix-suffix conflicts

between de(w) and displayable entities represented by descendants of v in PET(S, v) with

respect to de(v). If l/".,.r-.'earch(v,w,v,w) does not report any prefix-suffix conflicts then

the global variable exist is unchanged by l/,.,.,.,-'rearch(v,w,v,w) (i.e., exist = false, from

line 4). Otherwise, it is set to true by ls/.,.',l'earch. Line 6 ensures that NextSuffix(w,v)

is called only if l,/,,,.,-l'earch(v,w,v,w) detected prefix suffix conflicts between de(w) and

displayable entities represented by descendants of v in PET(S, v) with respect to de(v)

(Lemma 8).

For each descendant, q, of vertex x in PET(S,v), procedure s/,,./.,.,-'earch(v,w,x,y)

computes all prefix suffix conflicts between de(w) and de(q) with respect to de(v). y rep-

resents Pshadow(w, v,x). We will show that all calls to sl....-lrrearch maintain the in-

variant (which is referred to as the image invariant hereafter) that y = Pimage(w, v, x) f

nil. Notice that the invariant holds when sl,..l,.,s'earch is called from NextSuffix as w =

Pimage(w, v, v). The for statement in line 1 examines each prefix out edge from x. Lines 3

to 28 compute all prefix suffix conflicts between de(w) and displayable entities represented

by vertices in PET(S, z), where z is the vertex on which the prefix extension edge from x

is incident. The truth of the condition in the for statement of line 1, line 4 and the truth

of the condition inside the if statement of line 5 establish that the conditions of Lemma 9

are satisfied prior to the execution of lines 8 and 9. The truth of the comment in line 8

and the correctness of line 9 are established by Lemma 9. Procedure ListConflicts of line 9

lists all prefix suffix conflicts between de(w) and de(z) with respect to de(v). Similarly, the

truth of the condition inside the while statement of line 11, lines 13 and 14, and the truth

of the condition inside the if statement of line 15 establish that the conditions of Lemma 10

are satisfied prior to the execution of lines 18-20. Again, the correctness of lines 18-20 are

established by Lemma 10. If done remains false on exiting the while loop, the condition

of the if statement of line 15 must have evaluated to true. Consequently, the conditions of

Lemma 10 apply. Further, since the while loop of line 11 terminated, the additional con-

dition of Lemma 11 is also satisfied. Hence, from Lemma 11, u = Pimage(w, v, z) and the





















Procedure ShadowSearch(v, w, x, y);
1 for each prefix extension edge e = < x, z > do
2 {There can only be one prefix extension edge from x to z}
3 begin
4 fc:= first character in label(e);
5 if there is a right extension edge, f = < y, u >, whose label starts with fc
6 then
7 begin
8 {u = Pshadow(w, v, x)}
9 ListConflicts(u,z,w);
10 distance:= 0; done = false
11 while (not done) and (Ilabel(f)l > Ilabel(e)| + distance)) do
12 begin
13 distance:= distance + Ilabel(e)l;
14 nc:= (distance + l)th character in label(f).
15 if there is a prefix extension edge < z, r > starting with nc
16 then
17 begin
18 z := r;
19 {u = Pshadow(w, v, z)};
20 ListConflicts(u,z,w);
21 end
22 else
23 done:= true;
24 end
25 if (not done) then
26 -1 .,.1 ... iearch(v,w,z,u);
27 exist:= true;
28 end
29 end




Figure 11: Algorithm for shadow search










image invariant for the recursive call to sl,/,lr.,'-earch(v, w, z, u) is maintained. Line 27 sets
the global variable exist to true since the execution of the then clause of the if statement of
line 5 ensures that at least one prefix-suffix conflict is reported by sll/,,.r'"rearch(v, w, v, w)
(Lemmas 7 and 9). exist remains false only if the then clause of the if statement (line 5)
is never executed.


Theorem 3 Algorithm C computes all prefix-suffix conflicts of S in O(n + kp) space and
time, which is optimal.


Proof Line 1 of Algorithm C takes O(n) time [4]. The cost of lines 2 and 3 without
including the execution time of NextSuffix(v, v) is O(n).

Next, we show that NextSuffix(v, v) takes O(k,) time, where k, is the number of prefix
suffix conflicts with respect to v (i.e., k, represents the size of the output of NextSuffix(v, v)).
Assume that NextSuffix is invoked p times in the computation. Let ST be the set of invoca-
tions of NextSuffix which do not call NextSuffix recursively. Let pT = ISTI. Let SF be the
set of invocations of NextSuffix which do call NextSuffix recursively. Let pF = ISF Each
element of SF can directly call at most |E| elements of ST. So, PT/PF << |E. From lines
4-6 in NextSuffix(current,v), each element of SF yields at least one distinct conflict from
its call to i/..,I,.,'-earch. Thus, pF < kv. So, p = PT + PF < (|E| + 1)k, = O(k,). The
cost of execution of NextSuffix without including the costs of recursive calls to NextSuffix
and l,'/.i,.,'-rearch is O(|1E) (= 0(1)) as there are at most |E| suffix edges leaving a vertex.
So, the total cost of execution of all invocations of NextSuffix spawned by NextSuffix(v, v)
without including the cost of recursive calls to ll',.,r.,'-earch is 0(p1'\1) = 0(kv).

Next, we consider the calls to sl,/.i.,'l.-earch that were spawned by NextSuffix(v, v). Let
TA be the set of invocations of l,,l,,..,'-rearch which do not call sll,,.r.,'-earch recursively.
Let qA = ITA Let TB be the set of invocations of ll',,,.,'-earch which do call yl,,',r ., -earch
recursively. Let qB = ITBI. Let q = qA + qB. We have qA <_ IElqB + 'l/' So, q = qA + qB

(IEI +1)qB+' I/- From the algorithm, each element of TB yields a distinct conflict. So, qB <
kv. So, q < (|I +1)qB+'|' I = 0(k,). The cost of execution of a single call to IlIo../,'-rearch
without including the cost of executing recursive calls to lsol,,'.,'-earch is 0(1) + O(w) +










O(complexity of ListConflicts of line 9) + O(EZl(complexity of ListConflicts of line 20
in the ith iteration of the while loop)), where w denotes the number of iterations of the
while loop. The complexity of ListConflicts is proportional to the number of conflicts
it reports. Since ListConflicts always yields at least one distinct conflict, the complexity
of ll',/...'-earch is 0(1 + outputl. Summing over all calls to sl..l/.'-.earch spawned by

NextSuffix(v, v), we obtain O(q + k,) = O(k,). Thus, the total complexity of Algorithm C
is O(n + k) D


3.4 Alternative Algorithms


In this section, an algorithm for computing all conflicts (i.e., both subword and prefix-suffix
conflicts) is presented. This solution is relatively simple and has competitive run times.
However, it lacks the flexibility required to efficiently solve many of the problems listed
in Sections 4, 5, and 6 The algorithm (Algorithm D) is presented in Figure 12. Step 1
computes a list of all occurrences of all displayable entities in S. This list is obtained by first
computing the lists of occurrences corresponding to each vertex of V(S) (except the source
and the sink) and then concatenating these lists. Each occurrence is represented by its start
and end positions. Step 2 sorts the list of occurrences obtained in step 1 in increasing order
of their start positions. Occurrences with the same start positions are sorted in decreasing
order of their end positions. This is done using radix sort. Step 3 computes for the i'th
occurrence, occi, all its prefix suffix conflicts with occurrences whose starting positions are
greater than its own, and all its subword conflicts with its subwords. occi is checked against

occi+l, occi+2,..., occi+c for a conflict. Here, c is the smallest integer for which there is no
conflict between occi and occi+,. The start position of occi+, is greater than the ending
position of occi. The start position of the occj (j > i + c) will also be greater than the
end position of occi, since the list of occurrences was sorted on increasing order of start
positions. The start positions of occi+l,.., occi+_c- are greater than or equal to the start
positions of occi but are less than or equal to its end position. Those occurrences among

{occi+i,..., occi+,_i} whose start positions are equal to that of occi have end positions that
are smaller (since occurrences with the same start position are sorted in decreasing order










of their end positions). The remaining conflicts of occi (i.e., subword conflicts with its

superwords, prefix suffix conflicts with occurrences whose start positions are less than that

of occi) have already been computed in earlier iterations of the for statement in Algorithm

D.

For example, let the input to step 3 be the following list of ordered pairs:((1,6), (1,3),

(1,1), (2,2), (3,8), (3,5), (4,6), (5,8), (6,10)), where the first element of the ordered pair
denotes the start position and the second element denotes the end position of the occurrence.

Consider the occurrence (3,5). Its conflicts with (1,6), (1,3), and (3,8) are computed in

iterations 1, 2, and 5 of the for loop. Its conflicts with (4,6) and (5,8) are computed in

iteration 6 of the for loop.


Theorem 4 Algorithm D takes O(n + k) time, where k = kp + ks.


Proof Step 1 takes O(n + o) time, where o is the number of occurrences of displayable

entities of S. Step 2 also takes O(n + o) time, since o elements are to be sorted using radix

sort with n buckets. Step 3 takes 0(o + k) time: the for loop executes 0(o) times; each

iteration of the while loop yields a distinct conflict. So, the total complexity is O(n+ o+ k).

We now show that o = O(n + k). Let ol be the number of occurrences not involved in a

conflict. Then ol < n. Let o2 be the number of occurrences involved in at least one conflict.

A single conflict occurs between two occurrences. So 2k > 02. So, o = 01 + o2 < n + 2k =

(n + k). D

Algorithm D can be modified so that the size of the output is kp + ks,. This may be

achieved by checking whether an occurrence is the first representative of its pattern in the

for loop of step 3. The subword conflicts are only reported for the first occurence of the

pattern. However, the time complexity of Algorithm D remains O(n + k). In this sense, it

is suboptimal.










Algorithm D
Step 1: Obtain a list of all occurrences of all displayable entities in the string. This list is obtained
by first computing the lists of occurrences corresponding to each vertex of the scdawg (except the
source and the sink) and then concatenating these lists.
Step 2: Sort the list of occurrences using the start positions of the occurrences as the primary key
(increasing order) and the end position as the secondary key (decreasing order). This is done using
radix sort.
Step3:

for i:= 1 to (number of occurrences) do
begin
j:= i + 1;
while(lastpos(occi) > firstpos(occj) do
begin
if (lastpos(occi) > lastpos(occj))
then occi is a superword of occj
else (occi, occj) have a prefix-suffix conflict;
j:= j + 1;
end;
end;




Figure 12: A simple algorithm for computing conflicts


4 Size Restricted Queries


Experimental data show that random strings contain a large number of displayable entities

of small length. In most applications, small displayable entities are less interesting than

large ones. Hence, it is useful to list only those displayable entities whose lengths are greater

than some integer, k. Similarly, it is useful to report exactly those conflicts in which the

conflicting displayable entities have length greater than k. This gives rise to the following

problems:

PI: List all occurrences of displayable entities whose lengths are greater than k.

P2: Compute all prefix suffix conflicts involving displayable entities of length greater than

k.

P3: Compute all subword conflicts involving displayable entities of length greater than k.



The overlap of a conflict is defined as the string common to the conflicting displayable

entities. The overlap of a subword conflict is the subword displayable entity. The overlap of










a prefix-suffix conflict is its intersection. The size of a conflict is the length of the overlap.

An alternative formulation of the size restricted problem which also seeks to achieve the
goal outlined above is based on reporting only those conflicts whose size is greater than

k. This formulation of the problem is particularly relevant when the conflicts are of more

interest than the displayable entities. It also establishes that all conflicting displayable

entities reported have size greater than k. We have the following problems:
P4: Obtain all prefix-suffix conflicts of size greater than some integer k.

P5: Obtain all subword conflicts of size greater than some integer k.

P1 is solved optimally by invoking Occurrences(S, v, 0) for each vertex, v, in V(S), where
Ide(v)l > k. A combined solution to P2 and P3 uses the approach of Section 3.4. The only

modification to the algorithm of Figure 12 is in step 1 which now becomes:

Obtain all occurrences of displayable entities whose 1, ,oill. are ,-, .#,I r than k.
The resulting algorithm is optimal with respect to the expanded representation of subword

conflicts. However, as with the general problem, it is not possible to obtain separate optimal

solutions to P2 and P3 by using the techniques of Section 3.4. An optimal solution to P4 is
obtained by executing line 3 of Algorithm Cof Figure 10 for only those vertices, v, in V(S)

which have Ide(v)l > k. An optimal solution to P5 is obtained by the following modification

to Algorithm B of Figure 6:

(i) Right extension or suffix extension edges < u, v >, where Ide(u)l < k and Ide(v)l > k
are marked "disabled".

(ii) The definition of SG(S, v) is modified so that SG(S, v), v e V(S), is defined as the
subgraph of SCD(S) which consists of the set of vertices, SV(S, v) C V(S) which represent

displayable entities of length greater than k that are subwords of de(v) and the set of all re

and suffix extension edges that connect any pair of vertices in SV(S, v).
(iii) Algorithm B is modified. The modified algorithm is shown in Figure 13.

We note that P2 and P5 are identical, since the overlap of a subword conflict is the same

as the subword displayable entity.











Algorithm B
1 begin
2 for each vertex, v, in SCD(S) do
3 v.subword = false;
4 for each vertex, v, in SCD(S) such that Ide(v)l > k do
5 for all vertices, u, such that a non disabled right or suffix extension edge, < u, v > exists do
6 if (u # source) then vt.subword = true;
7 for each vertex, v, in SCD(S) such that v f sink and v.subword is true do
8 GetSubwords(v);
9 end



Figure 13: Modified version of algorithm B


5 Pattern Oriented Queries


These queries are useful in applications where the fact that two patterns have a conflict is

more important than the number and location of conflicts. The following problems arise as

a result:

P6: List all pairs of displayable entities which have subword conflicts.

P7: List all triplets of displayable entities (D1,D2,Dm) such that there is a prefix suffix

conflict between D1 and D2 with respect to D,.

P8: Same as P6, but size restricted as in P5.

P9: Same as P7, but size restricted as in P4.

P6 may be solved optimally by reporting for each vertex v in V(S), where v does not

represent the sink of CSD(s), the subword displayable entities of de(v), if any. This is

accomplished by reporting de(w), for each vertex w, w f source, in SG(S, v). P7 may also

be solved optimally by modifying procedure ListConflicts of Figure 11 so that it reports

the conflicting displayable entities and their intersection. P8 and P9 may also be solved by

making similar modifications to the algorithms of the previous section.










6 Statistical Queries


These queries are useful when conclusions are to be drawn from the data based on statistical
facts. Let f(D) denote the frequency (number of occurrences) of D in the string and
rf(D1, D2) the number of occurrences of displayable entity D1 in displayable entity D2.
The following queries may then be defined.
P10: For each pair of displayable entities, D1 and D2, involved in a subword conflict (Di
is the subword of D2), obtain p(D1, D2) = (number of occurrences of D1 which occur as
subwords of D2) / f(D1).
P11: For each pair of displayable entities, D1 and D2, involved in a prefix-suffix conflict,
obtain q(DI, D2) = (number of occurrences of D1 which have prefix-suffix conflicts with D2)

/f(DI).
If p(DI, D2) or q(DI, D2) is greater than a statistically determined threshold, then the
following could be be said with some confidence: Presence of D1 implies Presence of D2.
Let psf(D1, D2, Dm) denote the number of prefix suffix conflicts between D1 and D2 with
respect to Dm and psf(D1, D2), the number of prefix suffix conflicts between D1 and D2.

We can approximate p(D1, D2) by rf(Di, D2) f(D2)/f(D1). The two quantities are
identical unless a single occurrence of D1 is a subword of two or more distinct occurrences
of D2. Similarly, we can approximate q(DI, D2) by psf(D1, D2)/f(D1). The two quantities
are identical unless a single occurrence of D1 has prefix suffix conflicts with two or more
distinct occurrences of D2. f(DI) can be computed for all displayable entities in SCD(S)
in O(n) time by a single traversal of SCD(S) in reverse topological order. rf(Di, D2) may
be computed optimally for all D1, D2, by modifying procedure GetSubwords(v) as shown
in Figure 14.

psf(DI, D2, Dm) is computed optimally, for all D1, D2, and D,, where D1 has a prefix
suffix conflict with D2 with respect to D,, by modifying ListConflicts(u, z, w) of Figure 11
so that it returns f(de(u)), since this is the number of conflicts between de(w) and de(z)
with respect to de(v). psf(DI,D2) is calculated by summing psf(D, D2, Dm) over all
intersections, D,, of prefix suffix conflicts between D1 and D2. p(D1, D2) and q(DI, D2)
may be computed by simple modifications to the algorithms used to compute rf(DI, D2)











Procedure GetSubwords(v)
1 begin
2 rf(de(v), de(v)) = 1;
3 SeIUp(v);
4 SetSuffixes(v);
5 for each vertex, x (# source), in reverse topological order of SG(S, v) do
6 begin
7 if de(x) is a suffix of de(v) then rf(de(x), de(v)) = 1 else rf(de(x), de(v)) = 0;
8 for each vertex, w,in SG(S, v) on which an re edge, e from x is incident do
9 rf(de(x), de(v)):= rf(de(x), de(v)) + rf(de(w), de(v));
10 output(rf(de(x), de(v)));
11 end;
12 end


Figure 14: Modification to GetSubwords(v) for computing relative frequencies


and psf(DI, D2). These problems may be solved under the size restrictions of P4 and P5

by modifications similar to those made in Section 4.



7 Experimental Results


Algorithms B (Section 3.2), C (Section 3.3), and D (Section 3.4) were programmed in GNU

C++ and run on a SUN SPARCstation 1. For test data we used 120 randomly generated
strings. The alphabet size was chosen to be one of {5, 15, 25, 35} and the string length was

500, 1000, or 2000. The test set of strings consisted of 10 different strings for each of the

12 possible combinations of input size and alphabet size. For each of these combinations,

the average run times for the 10 strings is given in Figures 15-18.

Figure 15 gives the average times for computing all conflicts by combining algorithms

B and C. Figure 16 gives the average times for computing all prefix-suffix conflicts using

Algorithm C. Figure 17 gives the average times for computing all the pattern restricted

prefix-suffix conflicts (problem P7 of Section 5) by modifying Algorithm C as described in

Section 5. Figure 18 represents the average times for Algorithm D.

Figures 15 to 17 represent the theoretically superior solutions to the corresponding prob-

lems, while Figure 18 represents Algorithm D which provides a simpler, but suboptimal,




















Figure 15: Time in ms for computing all conflicts using the optimal algorithm


solution to the three problems. In all cases the time for constructing scdawgs and writing

the results to a file were not included as these steps are common to all the solutions.

The results show that the suboptimal Algorithm D is superior to the optimal solution for

computing all conflicts or all prefix-suffix conflicts for a randomly generated string. This is

due to the simplicity of Algorithm D and the fact that the number of conflicts in a randomly

generated string is small. However, on a string such as a100 which represents the worst case

scenario in terms of the number of conflicts reported, the following run times were obtained:

All conflicts, optimal algorithm: 14,190 ms

All prefix-suffix conflicts, optimal algorithm: 10,840 ms

All pattern restricted prefix-suffix conflicts, optimal algorithm: 5,000 ms

Algorithm D: 2. 1-11 ms



The experimental results using random strings also show that, as expected, the optimal

algorithm fares better than Algorithm D for the more restricted problem of computing

pattern oriented prefix-suffix conflicts.

We conclude that Algorithm D should be used for the more general problems of comput-

ing conflicts while the optimal solutions should be used for the restricted versions. Hence,

Algorithm D should be used in an automatic environment, while the optimal solutions

should be used in interactive or semi-automatic environments.


Size of Size of Siring
Alphabet 500 1000 2000
5 410 989 2722
15 292 603 1300
25 315 671 11 -
35 234 791 1740
























Figure 16: Time in ms for computing all prefix suffix conflicts using the optimal algorithm


Figure 17: Time in ms for computing all pattern restricted prefix suffix conflicts using the
optimal algorithm


Figure 18: Time in ms for algorithm D


Size of Size of String
Alphabet 500 1000 2000
5 247 730 1873
15 219 454 989
25 ,, 522 1179
35 186 648 1370


Size of Size of String
Alphabet 500 1000 2000
5 163 399 1058
15 103 231 550
25 91 '1". 628
35 61 226 735


Size of Size of String
Alphabet 500 1000 2000
5 203 551 1367
15 217 409 897
25 227 400 887
35 145 478 994










8 Conclusions


In this paper, we have described efficient algorithms for the analysis and visualization of

patterns in strings. We are currently extending these to other discrete objects such as

circular strings and graphs. Extending these techniques to the domain of approximate

string matching would be useful, but appears to be difficult.


References

[1] D. Mehta and S. Sahni, "String Visualization," In Preparation, 1991.

[2] B. Clift, D. Haussler, T.D. Schneider, and G.D. Stormo "Sequence Landscapes," Nu-
cleic Acids Research, vol. 14, no. 1, pp. 141-158, 1'1i.

[3] G.M. Morris "The Matching of Protein Sequences using Color Intrasequence Homology
Displays," J. Mol. Graphics, vol. 6, pp. 135-142, 1',"

[4] A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht, "Complete
Inverted Files for Efficient Text Retrieval and Analysis," J. AC I1, vol. 34, no. 3, pp. 578
595, 1'" .

[5] A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M.T. Chen, J. Seiferas, "The Small-
est Automaton Recognizing the Subwords of a Text," Theoretical Computer Science,
no. 40, pp. 31-55, 1' ".

[6] M. E. Majster and A. Reiser, I.1I ... ii' on-line construction and correction of position
trees," SIAM Journal on Computing, vol. 9, pp. 7 .- 807, Nov. 1980.

[7] E. McCreight, "A space-economical suffix tree construction algorithm," Journal of the
AC I1, vol. 23, pp. 262-272, Apr. 1976.

[8] M. T. Chen and Joel Seiferas, I.1 .. i, and elegant subword tree construction," in
Combinatorial Algorithms on Words (A. Apostolico and Z. Galil, eds.), NATO ASI
Series, Vol. F12, pp. 97-107, Berlin Heidelberg: Springer-Verlag, 1'l".

[9] E.Horowitz, S. Sahni, Fundamentals of Data Structures in Pascal, 3'rd Edition. Com-
puter Science Press, 1990.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs