Models and Techniques for the Visualization of Labeled Discrete
Objects *
Dinesh P. Mehta t+ Sartaj Sahnit
Abstract
A general technique for the visualization of discrete objects is presented. This technique
which consists of identifying similar substructures within the structure and color coding them is
demonstrated by applying it to linear strings, circular strings, binary trees, and seriesparallel
graphs.
The problem of display conflicts which is encountered while applying this technique is de
scribed and methods to deal with it are suggested. Some of these methods are interactive in
nature. Queries that would be supported in such an environment are described. Efficient algo
rithms to implement some of the queries are developed. The performance of these algorithms
is studied, both theoretically and experimentally. We also demonstrate that our algorithms are
wellsuited for implementation using the object oriented paradigm.
The application of our techniques to such diverse areas as molecular biology, text analysis,
analysis of numerical sequences, computer vision, computer graphics, computeraided design,
data compression, algorithm animation, and debuggers is outlined.
*This research was supported in part by the National Science Foundation under grant MIP 8617374.
tDept. of Computer and Information Sciences, University of Florida, Gainesville, FL 32611
tDept. Computer Science, University of Minnesota, Minneapolis, MN 55455
1 Introduction
The objective of visualization is to extract useful and relevant information from data and represent
it so that it can be easily understood and assimilated by humans. This enables specialists in appli
cation areas to observe trends and patterns in the data. This could lead to better understanding of
various phenomena and provide insights resulting in theories or hypotheses, which can subsequently
be proved (or disproved) by formal methods or by further experimentation.
A notion that is useful in the understanding of objects is that of similarity. Multiple occurrences
of the same pattern in data which represents the outcome or result of some process indicates the
presence of many instances of the same II i". Analyzing the set of circumstances associated
with these occurrences could yield a plausible "cause". For example, multiple occurrences of the
same flaw in a paper roll might reveal the faulty component in the paper production process.
Similarly, multiple occurrences of the same patterns in an object whose structure is being studied
indicates the presence of many instances of the same "cause". Observing the phenomena which
occur in the presence of these patterns could shed some light on the I I i" of the patterns.
For example, the presence of multiple occurrences of patterns in DNA strands in organisms could
manifest themselves as common characteristics shared by the organisms.
This linkage of cause and effect is a fundamental goal in many scientific disciplines. Most work
in visualization so far, which attempts to facilitate the scientific goals outlined above, consists of
choosing methods to display individual units of data, so that patterns and trends become visually
obvious when the data is seen in its entirety [1, 2]. However, the onus of detecting patterns and
trends still lies on the user or the specialist. This becomes more crucial when the amount of data
is large and the user's perceptual faculties are overburdened. Consequently, errors of omission (not
seeing patterns which are actually there) become more likely.
Our work attempts to shift the responsibility of detecting patterns from the human to the
computer. This is done by devising algorithms to detect patterns in the data; and then making
these patterns available to the user for further scrutiny. Multiple occurrences of the same pattern
can be made visually explicit by color coding them with the same color. Other methods could also
be used, such as "flashing" occurrences of the same pattern on the screen. If visual schemes are
not appropriate, recurring patterns could simply be provided as a list of occurrences which the user
goes through. In the context of a visualization environment, this technique could be used either in
a standalone manner or as a supplement to other visualization methods.
For the technique outlined above to be useful, the following principles on displaying patterns
should be adhered to:
Principle 1: Two patterns should be displayed to look similar iff they are similar.
Principle 2: The degree of similarity between the displays of two patterns should be proportional
to the actual similarity of the patterns.
A complete specification of a visualization problem requires one to provide the following five
items. These are illustrated using linear string visualization as an example. Henceforth, the word
i11igi" shall refer to linear strings.
(1) Structure of the Data to be Visualized: Does the data represent a string, a series
parallel graph, a binary tree etc.
In the string visualization example, the data is a string, S of length n, whose characters are
chosen from a fixed alphabet, E, of constant size.
(2) Structure of Patterns: This depends largely on the structure of the data. If the data
represents a string, then the structure of the pattern could be a substring (contiguous sequence of
string elements), a subsequence (a noncontiguous sequence of string elements), etc. Patterns can
have other constraints imposed upon them. For example, a pattern may be required to be of a
minimum size.
In the string visualization example, the pattern is a substring of S, defined uniquely by its start
and end positions.
(3) Maximality of Patterns: If a pattern is repeated in the data, then any subpattern of
the pattern is also repeated. For example, if abc repeats in a string, so do ab, be, a, b, and c .
In particular, if all occurrences of ab occur in the context of abc, then attempting to distinguish
between ab and abc does not serve any useful purpose. Defining maximality and restricting the
users attention to maximal patterns helps to simplify the display.
In the string visualization example, a pattern is said to be maximal iff all its occurrences are
a b cz e f yd e f xa b c
Figure 1: Highlighting Displayable Entities
not all preceded by the same letter, nor all followed by the same letter. Consider the string S =
abczdefydefxabc. Here, abc and def are the only maximal patterns. The occurrences of def are
preceded by different letters (z and y) and followed by different letters (y and x). The occurrences
of abc are not preceded by the same letter (the first occurrence does not have a predecessor) nor
followed by the same letter. However, de is not maximal because all its occurrences in S are followed
by f.
(4) Measure of Similarity (MS): A measure of similarity would consist of attaching numerical
values to pairs of maximal patterns which indicate the degree of similarity between the two patterns.
In the string visualization example, the measure of similarity could be defined as follows: If two
patterns are identical, then MS = 1. Otherwise, MS = 0. In other words, two patterns are defined
to be similar iff they are identical. There is no concept of "degree of similarity" in this definition.
(5) Display Models: This addresses the issues of which patterns are displayed and how they
are displayed. While choosing display models, Principles 1 and 2 should be kept in mind.
In the string visualization example, a pattern is said to be a displayable entity (or displayable)
iff it is maximal and occurs more than once in S (in this case, all maximal patterns are displayable
entities with the exception of S, which occurs once in itself) All instances of the same displayable
entity are highlighted in the same color. Instances of different displayable entities are highlighted in
different colors (there is no relationship between colors representing different displayable entities).
In the example string, S, abc and def are the only displayable entities. So, S would be displayed
by highlighting abc in one color and def in another as shown in Figure 1.
Visualization conflicts that arise from the above technique are described in Section 2 and refine
ments to the display model that attempt to overcome these conflicts are provided in Section 3. In
Section 4 some of the queries supported by a string visualization system are stated. Applications of
string visualization are discussed in Section 5 and some of the issues that arise when one implements
a string visualization system are addressed in Section 6. Some empirical results are also provided
in this section. Sections 7, 8, 9 discuss circular string, tree, and geometric series parallel graph
visualization respectively. Section 10 demonstrates that many of our algorithms can be effectively
coded using the object oriented paradigm. Finally, in Section 11, we provide an outline of how a
visualization system based on our approach would function.
2 Visualization Conflicts
Consider the string S = abcicdefcdegabchabcde and its displayable entities, abc and cde (both are
maximal and occur thrice). So, they must be highlighted in different colors. Notice, however, that
abc and cde both occur in the string abcde, which occurs as a suffix of S. Clearly, both displayable
entities cannot be highlighted in different colors in abcde as required by the model. This is a
consequence of the fact that the letter c occurs in both displayable entities. This situation is known
as a prefixsuffix conflict (because a prefix of one displayable entity is a suffix of the other).
Note, also, that c is a displayable entity in S. Consequently, all occurrences of c must be
highlighted in a color different from those used for abc and cde. But this is impossible as c is a
subword of both abc and cde. This situation is referred to as a subword conflict. Formally,
(i) A subword conflict between two displayable entities, D1 and D2, in S exists iff D1 is a substring
of D2.
(ii) A prefixsuffix conflict between two displayable entities, D1 and D2, in S exists iff there exist
substrings, Sp, Sm, Ss in S such that SpSSS occurs in S and SpS, = D1 and SmS, = D2.
3 Refinements of Display Model
When subword and prefixsuffix conflicts occur, we need some criteria to determine which of the
information previously required to be displayed actually gets displayed. For instance, in the ex
ample string S = abcicdefcdegabchabcde from the previous section, three possible nonconflicting,
displayable subsets are shown in Figure 2. In this section we present three refinements to the dis
play model from Section 1 which attempt to overcome the display difficulties created by conflicts.
They are
a b c i c d e f c d e g a b c h a b cd e
fa b c i c d e f c d e g a b ch a b c d el
ia b cdi cd e f c]d e gla b c h a b c]d e
Figure 2: Possible Configurations
a b c i d e f de gla b c h a b cd e
Figure 3: Optimal Configuration under Model 1
(1) OneCopy, MaximumContent, NoOverlap: In this model, exactly one copy of the string is
displayed. Occurrences of displayable entities are selected so that there are no mutually conflicting
occurrences. Given this restriction, the model requires occurrences to be selected so that the
amount of information conveyed by the display is maximized. This goal may be achieved in three
ways.
Interactive : The user selects occurrences interactively by using his/her judgement. Typically, this
would be done by examining the occurrences which are involved in a conflict and choosing one that
is the most meaningful.
Automatic : A numeric weight is assigned to each occurrence. The higher the weight, the greater
the desirability of displaying the corresponding occurrence. Criteria that could be used in assigning
weights to occurrences include: length, position, number of occurrences of the pattern, semantic
value of the displayable entity, information on conflicts, etc. The information is then fed to a
routine which selects a set of occurrences so that the sum of their weights is maximized. For
example, consider string S = abcicdefcdegabchabcde of Section 2. If the weight assigned to each
occurrence of abc is 4, cde is 2, c is 3, then Figure 3 shows the optimal display configuration. The
total weight of the display is 18.
SemiAutomatic: In a practical environment, the most appropriate method would be a hybrid of
the Interactive and Automatic approaches described above. The user could select some occurrences
a b c i *d e ffd e g a b c h a b cl de
a bfi c d elf d elga b ha b de
Figure 4: Optimal Configuration under Model 2(a)
ia b ci id e fld e ga b ch a b cld e
a bfi c d ef c d ega b ha b de
abci cdefcdegabchab cde
Figure 5: Configuration under Model 2(b)
that he/she wants included in the final display. The selection of the remaining occurrences can
then be performed by a routine which maximizes the display information.
(2) MultipleCopy, NoOverlap: Multiple copies of the string may be displayed. Mutually dis
joint sets of occurrences are associated with the copies (one set per copy), so that the occurrences
corresponding to each copy are mutually nonconflicting. There are two approaches:
(a) A constant number (max) of copies of the string may be displayed. The total content of
the display, summed over all max copies, is to be maximized. For example, consider string S =
abcicdefcdegabchabcde of Section 2. If the weights corresponding to abc, cde, and c are 4, 2, and
3, respectively, and max = 2, then Figure 4 shows the optimal display configuration. The total
weight of the display is 31.
(b) No limit is imposed on the number of copies of the string that may be displayed. However,
each occurrence is highlighted in only one copy of the string. The number of copies of the string
used should be minimized. It can be shown, that in the worst case, O(n2) copies may be required.
Figure 5 shows a configuration for string S = abcicdefcdegabchabcdeof Section 2.
(3) S;iiI Copy, MaximumContent, SubwordOverlap: Exactly one copy of the string may be
displayed. Occurrences are selected so that no pair of occurrences has a prefixsuffix conflict.
a b i d e f de ga h a b d e
Figure 6: Optimal Configuration under Model 3
Subword conflicts are allowed. As in the SingleCopy, MaximumContent, NoOverlap model, the
goal is to maximize the information conveyed. Again, there are three approaches for selecting
occurrences: Automatic, Interactive and SemiAutomatic. For example, consider string S = abci
cdefcdegabchabcde of Section 2. If the weights corresponding to each occurrence of abc, cde, and
c are 4, 3, and 2, respectively, then Figure 6 shows the optimal display configuration. The total
value of the display is 31.
Note that it is crucial to all methods to get information on prefixsuffix and subword conflicts.
4 String Visualization Queries
The following algorithmic problems arise as a result of the discussion in the previous section:
Given a string, S, of length n whose elements are chosen from an alphabet E of fixed size,
P1. Obtain a list of all displayable entities and their occurrences.
P2. Obtain a list of all prefixsuffix conflicts.
P3. Obtain a list of all subword conflicts.
Given a list of occurrences of displayable entities and a weight associated with each occurrence,
P4. Obtain a set of mutually nonconflicting occurrences so that the sum of the weights associated
with them is maximum (Model 1).
P5. Obtain max mutually disjoint sets of mutually nonconflicting occurrences so that the sum of
the weights associated with them is maximum (Model 2(a)).
P6. Obtain a minimum number of mutually disjoint sets of mutually nonconflicting occurrences
required to partition the set of all occurrences (Model 2(b)).
P7. Obtain a set of occurrences such that no two have a prefixsuffix conflict, so that the sum of
the weights associated with them is maximized (Model 3).
In addition to the problems outlined above, restricted versions of these problems exist. These
problems represent typical queries that would be supported by an interactive visualization system
based on our approach. We list some of these below.
(i) Restricted Queries: The overlap of a conflict is defined as the string common to the conflicting
displayable entities. The overlap of a subword conflict is the subword displayable entity. The overlap
of a prefixsuffix conflict is the substring common to the conflicting strings. The size of a conflict is
the length of the overlap. It is useful to be able to list only those conflicts whose sizes are greater
than some specified length. This simplifies the display and eliminates uninteresting displayable
entities.
P8. Obtain all prefixsuffix conflicts of size greater than some integer min.
P9. Obtain all subword conflicts of size greater than some integer min.
(ii) PatternOriented Queries: These queries are useful in applications where the fact that two
patterns have a conflict is more important than the number of conflicts or where in the string the
conflicts occur.
P10. List all pairs of displayable entities which have prefixsuffix or subword conflicts.
P11. List all pairs of displayable entities which have conflicts of size greater than some given
constant.
P12. List all displayable entities that are superwords of a given displayable entity.
(iii) Statistical Queries: These queries are useful when conclusions are to be drawn from the
data based on statistical facts.
P13. For each pair of displayable entities, D1 and D2, involved in a subword conflict (D1 is the
subword of D2), obtain p = (number of occurrences of D, which occur as subwords of D2)/(number
of occurrences of Di).
P14. For each pair of displayable entities, D1 and D2, involved in a prefixsuffix conflict, obtain q
= (number of occurrences of Di (D2) which have prefixsuffix conflicts with D2 (D1))/ (number of
occurrences of D, (D2)).
If p or q is greater than a statistically determined threshold, then the following could be be said
with some confidence: Presence of D1 implies Presence of D2.
A detailed explanation of efficient algorithms for P1P3 and P8P14 is provided in [3]. Algorithms
for P1P3 are briefly outlined in Section 6, while algorithms for P4, P6, and P7 are presented in
Section 6.6.
5 Applications
An abstract strategy has been outlined for the visualization of strings. This section discusses some
general methods for applying this strategy to specific areas. Applications to molecular biology, text
and numerical sequences are also outlined.
5.1 General Methods
In order to apply this strategy successfully to actual data, it is important to first check that the
data conforms to the definition of strings provided in Section 1. If this is not the case, then it may
be possible to transform the data so that it satisfies the definition without losing vital information
in the process.
(1) If a string consists of characters which are chosen from a fixed alphabet of a small size (
< 50, say), then it is already in the correct format. For example, DNA sequences are made up
from an alphabet of 4 elements. English text is made up of an alphabet of 26 characters and some
special symbols.
(2) Otherwise, if a function, f, can be defined for each element in the alphabet such that:
a) The range of f can be determined and is of constant size and
b) f(elementl) = f(element2) iff element1 and element2 are similar,
then the given string may be converted to another string which is obtained by applying f to each
element of the original string. The resulting string can now be input to the visualization routines.
For example, consider a sequence of objects which are chosen from a large (possibly infinite)
alphabet. Assume that a set of properties, P = {P, P2,..., Pm}, is associated with each object.
Suppose that patterns of property, Pi, of the sequence of objects are interesting (where Pi may
take on one of a constant, fixed number of values). Then, for the purposes of visualization, each
object in the sequence is replaced by the corresponding Pi value. This approach can be used with
the other properties as well. Some examples where this approach is useful are:
(1) Protein Sequences [2]: A protein sequence consists of a sequence of amino acids. While the
number of amino acids that could form a sequence is large it is possible to place amino acids in
groups on the basis of physical properties such as hydrophobicity, acidity, polarity etc. Amino acids
in the sequence may then be replaced by a symbol representing the groups to which they belong.
(2) C/,i,,, "li .I sequences of Multidimensional Data [1]: Here, a number of measurements relating
to a particular scientific phenomenon are taken at regular intervals of time. The measurements for
each variable are classified as LOW, MEDIUM, or HIGH. Consequently, the sequence of multidi
mensional data may be replaced by a sequence of symbols: L,M,H which represent the values of a
particular variable.
Many applications would benefit by comparisons between two or more different strings (as
opposed to comparisons within the same string). This can also be supported by a simple extension
of our techniques.
5.2 Numerical Data
An important category of data is numerical data. These arise whenever properties of objects are
described by measurements. Numerical information, in general, is chosen from large alphabet sizes
which are determined by the accuracy of measurement required. Clearly, numerical sequences
cannot be directly input to our visualization system. This is remedied by determining the range
of values that a variable can take on. This range is then subdivided into a constant number of
subranges (this is essentially the same strategy used in [1]). Each value in the sequence is then
replaced by a symbol representing the subrange to which it belongs. The resulting sequence may
then be input to our visualization system. Consider a sequence of numbers which lie in the range
1200. Assume that subranges have been defined as 120, 2140, ..., 181200 which are respectively
represented by the symbols: a,b,......,j. So, the sequence: 7 142 63 94 6 148 69 becomes: ahd e ahd.
Sequences of numbers, such as financial data, are usually studied by using graphs. We expect
that the techniques outlined here if used in conjunction with graphs could reduce the possibility
of overlooking important patterns in the data. This can be done by either appropriately coloring
pieces of the graph or by coloring a string which is aligned with the graph.
Often, comparisons are made not between the values of numbers in a sequence, but between the
increase/decrease in consecutive values. For example, in [5 20 15 75 90 i,' ], 5 20 15 is not obviously
related to 75 90 ". However, the increases/decreases in values are identical. An increase of 15
followed by a decrease of 5. Information of this type may be obtained by transforming the string
by taking the difference between successive values before inputting it to the visualization system.
I.e., 15 5 60 15 5. Similar transformations may be used for percentage increases/decreases, second
order differences, etc.
5.3 Molecular Biology
In molecular biology, RNA,DNA, and protein sequences are studied. Sequence comparison helps to
answer questions about evolution, structure and function in organisms, and the structural configu
ration of individual RNA molecules [4]. Of particular importance are repeating patterns and their
relative positions [5]. [5] uses the scdawg data structure, which is described in the next section, to
analyze sequences. Our work improves upon [5] by suggesting more effective display methods as
well as by introducing more sophisticated analysis techniques involving prefixsuffix and subword
conflicts.
For example, let D1 and D2 be displayable entities and D1 be a subword of D2. If the fraction
(number of occurrences of D1 contained in D2) /(number of occurrences of DI) w 1, then we can
infer that D1 usually occurs as a subword of D2 which could mean that D1 does not perform any
significant functions except as a subword of D2.
Suppose patterns P1 and P2 perform functions Fi and F2 in an organism. Then, if
(number of prefixsuffix conflicts between Pi and P2)/ (min{number of occurrences of Pi,number
of occurrences of P2}) w 1, then we can infer that F1 and F2 are generally performed by the same
region and are therefore related in some way.
5.4 Textual Data
Structural information about text may be obtained by studying prefixsuffix and subword conflicts.
Information about the contexts in which certain phrases are used is provided by subword conflicts.
Information on the combination of phrases is provided by prefixsuffix conflicts. This information
can be used to identify anomalies in sentence structure and possibly identify the author of a text by
the structure. It can also be used to decipher text coded using sophisticated substitution ciphers
where patterns are substituted by other patterns.
6 Implementation Considerations
A string visualization system along the lines described above requires efficient algorithms for prob
lems P1P14. These problems can be solved in optimal or near optimal time [3] by using the
Symmetric Compact Directed Acyclic Word Graph (scdawg) data structure [6, 7].
6.1 Directed Acyclic Word Graphs
An scdawg, SCD(S), corresponding to a string S is a directed acyclic graph defined by a set of
vertices, V(S), a set, R(S), of labeled directed edges called right extension (re) edges, and a set,
L(S), of labeled directed edges called left extension (le) edges Each vertex of V(S) represents a
substring of S. Specifically, V(S) consists of a source (which represents the empty word, A), a sink
(which represents S), and a vertex corresponding to each displayable entity of S.
Let de(v) denote the string represented by vertex, v (v c V(S)). Define the implication,
imp(S, a), of a string a in S to be the smallest superword of a in {de(v): v c V(S)}, if such
a superword exists. Otherwise, imp(S, a) does not exist.
Re edges from a vertex, vi, are obtained as follows: for each letter, x, in E, if imp(S, de(vl)x)
exists and is equal to de(v2) = Bde(vl)x7, then there exists an re edge from vl to v2 with label x7y.
If 3 is the empty string, then the edge is known as a prefix extension edge. Le edges from a vertex,
vl, are obtained as follows: for each letter, x, in E, if imp(S, xde(vl)) exists and is equal to de(v2)
= 7xde(vl)3, then there exists an le edge from vl to v2 with label 7x. If 3 is the empty string,
then the edge is known as a suffix extension edge.
Figure 7 shows V(S) and R(S) corresponding to S = cdefabcgabcde. abc, cde, and c are the
displayable entities of S. There are two outgoing re edges from the vertex representing abc. These
edges correspond to x = d and x = g. imp(S, abcd)= imp(S, abcg) = S. Consequently, both edges
Figure 7: Scdawg for S = cdefabcgabcde (L(S) not shown)
are incident on the sink. There are no edges corresponding to the other letters of the alphabet as
imp(S, abcx) does not exist for x c {a, b, c, e, f}.
The space required for SCD(S) is O(n) and the time needed to construct it is O(n) [6, 7]. While
we have defined the scdawg data structure for a single string, S, it can be extended to represent a
set of strings.
6.2 Computing Subword Conflicts Efficiently
6.2.1 Representing Subword Conflicts
Consider, P2, the problem of finding all subword conflicts in string S. Let ks be the number of
subword conflicts in S. Any algorithm to solve this problem requires (i) O(n) time to read in the
input string and (ii) 0(k,) time to output all subword conflicts. So, O(n + k,) is a lower bound on
the time complexity for this problem. For the string S = a", ks = n4/24+ n3/4 13n2/243n/4+1
= O(n4). This is an upper bound on the number of conflicts as the maximum number of substring
occurrences is O(n2) and in the worst case, all occurrences conflict with each other. In this section,
a compact method for representing conflicts is presented. Let ksc be the size of this representation.
kc, is n3/6 + n2/2 5n/3 or O(n3), for a". Compaction never increases the size of the output and
may yield up to a factor of n reduction, as in the example. The compaction method is described
below.
Consider S= abcdbcgabcdbchbc. The displayable entities are D1 = abcdbc and D2 = be. The
ending positions of D1 are 6 and 13 while those of D2 are 3, 6, 10, 13, and 16. A list of the subword
conflicts between D1 and D2 can be written as: {(6,3), (6,6), (13,10), (13,13)}. The first element
of each ordered pair is the last position of the instance of the superstring (here, D1) involved in the
conflict; the second element of each ordered pair is the last position of the instance of the substring
(here, D2) involved in the conflict.
The cardinality of the set is the number of subword conflicts between D1 and D2. This is
given by: frequency(D1) number of occurrences of D2 in D1, where frequency(D1) is the number of
occurrences of D1 in S. Since each conflict is represented by an ordered pair, the size of the output
is 2(frequency(D1)*number of occurrences of D2 in D1).
Observe that the occurrences of D2 in D1 are in the same relative positions in all instances of D1.
It is therefore possible to write the list of subword conflicts between D1 and D2 as: (6,13):(0,3).
The first list gives all the occurrences in S of the superstring (D1), and the second gives the relative
positions of all the occurrences of the substring (D2) in the superstring (D1) from the right end
of D1. The size of the output is now: frequency(Di)+number of occurrences of D2 in D1. This is
more economical than our earlier representation.
In general, a substring, Di, of S will have conflicts with many instances of a number of displayable
entities (say, Dj, Dk,..., Dz) of which it (Di) is the superword. We would then write the conflicts
of Di as:
Here, the /i's represent all the occurrences of Di in S; the I's, I's,..., I's represent the relative
positions of all the occurrences of Dj, Dk,..., D in Di. One such list will be required for each
displayable entity that contains other displayable entities as subwords. The following qualities are
easily obtained:
Size of Compact Representation = ED, cD (fi+ DecDE (ij)).
Size of O, '.;',. Representation = 2 ED,,D (fi* nDJeD, (rj)).
fi is the frequency of Di (only Di's that have conflicts are considered). rij is the frequency of Dj
in one instance of Di. D represents the set of all displayable entities of S. Df represents the set of
all displayable entities that are subwords of Di.
6.2.2 Computing Subword Conflicts
Algorithm A3 of Figure 8 computes the subword conflicts of S. These are represented using the
scheme described in Section 6.2.1.
SG(S, v), v e V(S), is defined as the subgraph of SCD(S) which consists of the set of ver
tices, SV(S, v) C V(S) which represent displayable entities that are subwords of de(v) and the set
SE(S, v) of all re and suffix extension edges that connect any pair of vertices in SV(S, v). Define
SGR(S, v) as SG(S, v) with the directions of all the edges in SE(S, v) reversed.
The subword conflicts are computed for precisely those displayable entities which have subword
displayable entities. Lines 4 to 6 of Algorithm A3 determine whether de(v) has subword displayable
entities. Procedure Getsubwords(v), which computes the subword conflicts of de(v) is invoked if
v.subword is true.
Procedure Occurrences(S, v, 0) (line 2 of GetSubwords) computes the occurrences of de(v) in S
and places them in v.list. Procedure SetUp in line 5 traverses SGR(S, v) and initializes fields in
each vertex of SGR(S, v) so that a reverse topological traversal of SG(S, v) may be subsequently
performed. Procedure SetSuffixes in line 6 marks vertices whose displayable entities are suffixes of
de(v). A list of relative occurrences, sublist, is associated with each vertex, x, in SG(S, v). x.sublist
represents the relative positions of de(x) in an occurrence of de(v). If de(x) is a suffix of de(v)
then x.sublist is initialized with the element, 0. The remaining elements of x.sublist are computed
from the sublist fields of vertices, w, in SG(S, v) such that a right extension edge goes from x to w.
Consequently, w.sublist must be computed before x.sublist. This is achieved by traversing SG(S, v)
in reverse topological order [8].
Theorem 1 Algorithm A3 takes O(n + k,,) time and space and is therefore optimal [3].
Algorithm A3
1 begin
2 for each vertex, v, in SCD(S) do
3 begin
4 v.subword = false;
5 for all vertices, u, such that a right or suffix extension edge, < u, v >, is incident on v do
6 if u f source then v.subword = true;
7 end
8 for each vertex, v, in SCD(S) such that v f sink and v.subword is true do
9 GetSubwords(v);
10 end
Procedure GetSubwords(v)
1 begin
2 Occurrences(S,v,O);
3 output(v.list);
4 v.sublist = {0};
5 SetUp(v);
6 SetSuffixes(v);
7 for each vertex, x (A source), in reverse topologicalorder of SG(S, v) do
8 begin
9 if de(x) is a suffix of de(v) then x.sublist = {0} else x.sublist {};
10 for each vertex, w, in SG(S, v) on which an re edge, e from x is incident do
11 begin
12 for each element, 1, in w.sublist do
13 x.sublist = x.sublist U {l label(e)l};
14 end;
15 output(x.sublist);
16 end;
17 end
Figure 8: Optimal algorithm to compute all subword conflicts
Algorithm A2A3
Step 1: Obtain a list of all occurrences of all displayable entities in the string. This list is obtained by first
computing the lists of occurrences corresponding to each vertex of the scdawg (except the source and the
sink) and then concatenating these lists.
Step 2: Sort the list of occurrences using the start positions of the occurrences as the primary key (increasing
order) and the end position as the secondary key (decreasing order). This is done using radix sort.
Step3:
for i:= 1 to (number of occurrences) do
begin
j:= i + 1;
while(lastpos(occi) > firstpos(occj) do
begin
if (lastpos(occi) > lastpos(occj))
then occi is a superword of occj
else (occi, occj) have a prefixsuffix conflict;
j:= j + 1;
end;
end;
Figure 9: A simple algorithm for computing conflicts
6.3 Computing Prefix Suffix Conflicts Efficiently
As with subword conflicts, the lower bound for the problem of computing prefixsuffix conflicts is
O(n + kp), where kp is the number of prefixsuffix conflicts in S. For S = a", kp is n4/24 n3/12 
2",,.2/24 21n/12 + 1 = O(n4), which is also the upper bound on kp. Unlike subword conflicts, it
is not possible to compact the output representation.
Theorem 2 All prefixsuffix conflicts in S can be computed in O(n + kp) space and time, which is
optimal [3].
6.4 Alternative Algorithms
In this section, an alternative solution for computing subword and prefixsuffix conflicts is presented.
The solution is relatively simple and has competitive running times. However, it lacks the flexibility
required to solve many of the problems listed in Section 4. The algorithm is presented in Figure 9.
N 106 
u a = Size of Alphabet
m =2
C 105
r =5
o 102
0 104 1
C 50
s 102 
100200 500 1000 2000
Size of String
Figure 10: Graph of Number of Prefix Suffix Conflicts vs String Size
6.5 Implementation
The algorithms to compute all prefixsuffix and subword conflicts (i.e., algorithms from Sections 6.2,
6.3, and 6.4) were implemented on a SUN SPARCstation 1 in GNU C++. 50 randomly generated
strings of lengths ranging from 100 to 2000 from alphabets whose size ranged from 2 to 50 were
input to our algorithms. Statistical information such as the number of vertices in the scdawg, the
number of prefixsuffix and subword conflicts etc was obtained. The run times of the algorithms
were also recorded.
(i) Figure 10 shows Number of prefixsuffix conflicts vs String size. There is one curve for each
alphabet size. The plot illustrates that the number of prefixsuffix conflicts increases with string
size and decreases with alphabet size. The graph for subword conflicts is similar.
(ii) Figure 11 shows Time per prefixsuffix conflict (ius) vs String size. There is one curve for each
alphabet size. It illustrates that the time per conflict generally decreases with increasing string size
and increases with alphabet size. The graph for subword conflicts is similar.
(iii) The factor by which the compact representation of subword conflicts is smaller than the
fully expanded representation varies from 2 to 9. It increases with string size and decreases with
alphabet size.
(iv) Figure 12 shows the size of the largest displayable entity for each combination of alphabet size
800
700
600
500
400
300
200
100
0
= 10
7=5
I I
I I I I I I I I I I I I I I I I
100200 500 1000
Size of String
Figure 11: Graph of Time per Prefix Suffix Conflict vs
2000
String Size
Figure 12: Lengths of Largest Displayable Entity for Random Strings
and string size. It shows that only displayable entities of small lengths occur in random strings (in
practice we would like to be able to distinguish between displayable entities that occur randomly
and those that do not in a given string. This can be done by selecting those displayable entities
whose frequency is large compared to other displayable entities of the same length in a random
string).
6.6 Display Algorithms
A list of all occurrences of a displayable entity may be obtained from the scdawg data structure
described earlier. A list of all conflicts between displayable entities may be obtained by operations
a = Size of Alphabet
= 25
Size of Size of String
Alphabet 100 200 500 1000 2000
2 11 12 15 18 19
5 5 6 7 8 9
10 3 4 5 6 6
15 3 3 4 5 5
20 3 3 4 4 5
25 2 3 3 4 4
50 2 2 3 3 4
`~~___z
on the scdawg. This information is then used to assign numeric weights to each occurrence of each
displayable entity.
In this section we shall discuss algorithms to implement some of the refined display models of
Section 3. Specifically, we shall consider models 1, 2(b), and 3 under the Automatic mode (i.e.,
problems P4, P6, and P7 of Section 4). Our algorithms also apply to the Semi Automatic mode.
Problem P4 can be reduced to the single pair, longest path problem in a directed acyclic graph
as follows: let vertex Vi, 1 < i < n, correspond to the position between the i'th and i + 1'th
characters in the string. Vo corresponds to the position preceding the first character in the string.
For each occurrence Sij of each displayable entity, we create an edge from Vi_1 to Vj. The weight
associated with this edge is exactly the weight of the occurrence it represents. Finally, for each
pair (Vi, Vi+), 0 < i < n, of vertices such that an edge from Vi to Vi + does not already exist, we
create an edge from Vi to Vi+1 of weight 0.
Figure 13 shows the directed acyclic graph corresponding to the string, S = ,l,, '", I, / l, ,I,,, 1,,,, I1.
of Figure 2, assuming that the weights corresponding to each occurrence of abc, cde, and c are 4,
3, and 2 respectively. The longest path from Vo to V, in the dag is Vo V3 V4 V7 Vs 
V11 V12 V15 V16 V19 V20 V21. All edges on this path with nonzero weight represent
occurrences of displayable entities that are to be highlighted. Here Vo V3, V4 V7, Vs V11,
V12 V15, V16 V19 correspond to occurrences < 1,3 >, < 13, 15 >, and < 17, 19 > of abc, and
< 5, 7 > and < 9, 11 > of cde. The length of the longest path (here, 18) represents the total weight
of the display.
Algorithm A4 of Figure 14 solves problem P4. A[0..n] is an array of integers, where A[i] rep
resents the longest path from Vo to Vi detected upto that point of time. All elements of A are
initialized to 0. The auxiliary array, T[1..n] stores the occurrences that have been chosen for dis
play. The array delist contains all the occurrences of all the displayable entities in the string. Each
element of delist contains three fields: start, end, and r.' '"d,1l which respectively represent the start
position, the end position, and the numeric weight of the occurrence. It is assumed that delist is
sorted in increasing order of end. The vertices are processed in topological order (i.e., Vo, V1, ...
V,). When a vertex, Vj is being processed, each vertex, V (i < j) preceding it has associated with
it the cost of the longest path from Vo to Vi. The cost of the longest path from Vo to 1V is then
4 4
Figure 13: Dag corresponding to abcicdefcdegabchabcde
determined by examining each of the incoming edges to Vj.
The complexity of Algorithm A4 is O(n + e), where e represents the size of delist. So, Algo
rithm A4 is optimal. Note that sorting delist on end position can also be accomplished in O(n + e)
time using radix sort.
Problem P6 of Section 4 is solved by Algorithm A6 of Figure 15 using the greedy method.
endpoints[1..2e] is a list of both, the start positions and end positions of all occurrences of all
displayable entities in the string. Each element of endpoints contains three fields: position, type,
and id. position contains the position of the particular endpoint in the string; type determines
whether the endpoint is a "start" position or an i,1" position. id uniquely identifies the occurrence
corresponding to the endpoint. It is assumed that endpoints is sorted in increasing order on primary
key, position, and secondary key, type ("start" < il"). LineStack is a stack that contains line
numbers (or partition numbers) which are currently available. On completion of the algorithm,
line[i] contains the line or the particular copy of the string in which the i'th occurrence is to be
highlighted for 1 < i < e.
Proof of Correctness: From the algorithm the following assertions can be made:
(1) The final value of max (fmax) represents the number of partitions of S.
(2) At least one position in the string is covered by fmax occurrences. Statement (2) indicates that
fmax is the smallest possible number of partitions that satisfy the problem. Statement (1) indicates
Algorithm A4
begin
A[O]:= 0; j:= 1;
for i := 1 to n do
begin
A[i]:= 0;
while (delist[j].end = i) do
begin
if A[i] < A[delist[j].sart1] + delist[j].weight
then
begin
A[i] := A[delist[j].start] + delist[j].weight;
T[i := j;
end;
j:= + 1
end;
end;
end.
Figure 14: Algorithm for P4
that a partition of that size is in fact obtained.
Algorithm A6 consumes O(n + e) time, which is optimal. Note that sorting endpoints can also
be accomplished in O(n + e) time, if radix sort is used.
We outline two solutions to problem P7. In the first, we assume that all occurrences of the
same displayable entity are assigned the same numeric weight. In the second, we do not make this
assumption.
Algorithm A7(a) of Figure 16 solves the first version of P7. The second version of P7 may be
solved by executing steps ac for each occurrence of each displayable entity as shown in Algorithm
A7(b) of Figure 17. Both solutions involve a traversal of SCD(S) in topological order. For each
occurrence, an optimal selection of its subwords are chosen. The weight of the occurrence is then
obtained by adding the sum of the weights of the chosen subwords to its weight. This is done
because an occurrence is highlighted along with the chosen subwords. Algorithm A7(a) consumes
O(n3) time, while Algorithm A7(b) consumes O(n4) time.
Algorithm A6
begin
max:= 0; current := 0;
for i := 1 to 2e do
begin
x := endpoints[i];
if(x.type= i i ') then
begin
current := current + 1;
if (current < max) then
begin
CurreniLine := top(LineStack);
pop(LineStack);
end
else
begin
max := max + 1;
CurrentLine := max;
end;
line[x.id]:= CurrentLine;
end
else {x.type = ... i"}
begin
current := current 1;
return line[x.id] to LineStack;
end;
end;
fmax := max;
end.
Figure 15: Algorithm for P6
Algorithm A 7(a)
begin
for each vertex, v, of SCD(S), in topological order do
begin
Step a. Compute the relative positions of all subword displayable entities
in a single instance of de(v) using Algorithm A of Section 6.2.
Step b. Choose a mutually non overlapping subset from the set of subwords
of de(v) (obtained in step (a) above) so that the sum of their weights is
maximized. This is achieved by an algorithm similar to A4.
Step c. Reset the numeric weight of de(v) by adding to it the
total weight of the configuration obtained in step (b).
end;
end.
Figure 16: Algorithm for P7, same weights
Algorithm A 7(b)
begin
for each vertex, v, of SCD(S), in topological order do
begin
for each occurrence < i,j > of de(v) do
begin
Step a. Compute the occurrences of all subword displayable entities in < i, j >.
Step b. Choose a mutually non overlapping subset from the set of occurrences
(obtained in step (a) above) so that the sum of their weights is maximized.
This is achieved by an algorithm similar to A4.
Step c Reset the numeric weight of < i, j > by adding to it the total weight
of the configuration obtained in step (b).
end;
end;
end.
Figure 17: Algorithm for P7, different weights
Figure 18: Circular string
7 Circular String Visualization
In this section, we consider the problem of circular string visualization. Section 7.2 mentions some
of its applications while Section 7.3 outlines the algorithms.
7.1 Problem Definition
As with the linear string, we provide a specification for the circular string visualization problem:
1. Structure of Data to be Visualized: A circular string of length n whose characters are chosen
from a fixed alphabet E of constant size. Figure 18 shows an example circular string of size
8.
2. Structure of Patterns: A linear substring of the circular string of length less than n. For
example, abc and ceab are patterns in Figure 18.
3. Maximality of Patterns: The definition of maximality of a pattern in a circular string is
similar to that of a linear string. I.e., a pattern is said to be maximal iff its occurrences in
the circular string are not all preceded by the same character nor all followed by the same
character.
4. Measure of Similarity (MS): If two patterns are identical, then MS = 1. Otherwise, MS =
0.
5. Display Model: A maximal pattern in a circular string is called a displayable entity if it occurs
at least twice in the circular string. In the example string of Figure 18, abc is a displayable
entity. All instances of the same displayable entity are highlighted in the same color. As
with linear strings, we encounter the problem of prefixsuffix and subword conflicts. Similar
techniques are used to overcome these.
7.2 Applications
Circular strings may be used to represent circular genomes [5] such as G4 and oX174. The detection
and analysis of patterns in genomes helps to provide insights into the evolution, structure, and
function of organisms. [5] analyzes G4 and oX174 by linearizing them and then constructing their
scdawg. Our work improves upon [5] by :
(i) analyzing circular strings without risking the "loss" of patterns.
(ii) extending the analysis and visualization techniques presented for linear strings to circular
strings.
Circular strings in the form of chain codes are also used to represent closed curves in computer
vision [9]. The objects of Figure 19(a) are represented in chain code as follows:
(1) Arbitrarily choose a pixel through which the curve passes. In the diagram, the starting pixels
for the chain code representation of objects 1 and 2 are marked by arrows.
(2) Traverse the curve in the clockwise direction. At each move from one pixel to the next, the
direction of the move is recorded according to the convention shown in Figure 19(b).
Objects 1 and 2 are represented by 1122102243244666666666 and 6666666611 .' .'." .', ,,'6,
respectively. The alphabet is {0, 1, 2, 3, 4, 5, 6, 7} which is fixed and of constant size (8) and
therefore satisfies the definitions of Section 7.1. We may now use our visualization techniques
to compare the two objects. For example, our methods would show that objects 1 and 2 share
the segments S1 and S2 (Figure 19(c)) corresponding to 0224 and 2446666666661122, respectively.
Information on other common segments would also be available. The techniques of this paper make
it possible to detect all patterns irrespective of the starting pixels chosen for the two objects.
Circular strings may also be used to represent polygons in computer graphics and computational
geometry [10]. Figure 20 shows a polygon which is represented by the following alternating sequence
Jr _L rA_
J JLIL A ILI L A I iI L L AI L .A
I F _
?J LJT J LTTJ L J L J?L J
