Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Models and techniques for the visualization of labeled discrete objects
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095104/00001
 Material Information
Title: Models and techniques for the visualization of labeled discrete objects
Series Title: Department of Computer and Information Science and Engineering Technical Reports
Physical Description: Book
Language: English
Creator: Sahni, Sartaj
Mehta, Dinesh
Affiliation: University of Florida
University of Minnesota
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 1991
 Record Information
Bibliographic ID: UF00095104
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

199129 ( PDF )


Full Text











Models and Techniques for the Visualization of Labeled Discrete


Objects *



Dinesh P. Mehta t+ Sartaj Sahnit





Abstract

A general technique for the visualization of discrete objects is presented. This technique
which consists of identifying similar substructures within the structure and color coding them is
demonstrated by applying it to linear strings, circular strings, binary trees, and series-parallel
graphs.
The problem of display conflicts which is encountered while applying this technique is de-
scribed and methods to deal with it are suggested. Some of these methods are interactive in
nature. Queries that would be supported in such an environment are described. Efficient algo-
rithms to implement some of the queries are developed. The performance of these algorithms
is studied, both theoretically and experimentally. We also demonstrate that our algorithms are
well-suited for implementation using the object oriented paradigm.
The application of our techniques to such diverse areas as molecular biology, text analysis,
analysis of numerical sequences, computer vision, computer graphics, computer-aided design,
data compression, algorithm animation, and debuggers is outlined.











*This research was supported in part by the National Science Foundation under grant MIP 86-17374.
tDept. of Computer and Information Sciences, University of Florida, Gainesville, FL 32611
tDept. Computer Science, University of Minnesota, Minneapolis, MN 55455










1 Introduction


The objective of visualization is to extract useful and relevant information from data and represent

it so that it can be easily understood and assimilated by humans. This enables specialists in appli-

cation areas to observe trends and patterns in the data. This could lead to better understanding of

various phenomena and provide insights resulting in theories or hypotheses, which can subsequently

be proved (or disproved) by formal methods or by further experimentation.

A notion that is useful in the understanding of objects is that of similarity. Multiple occurrences

of the same pattern in data which represents the outcome or result of some process indicates the

presence of many instances of the same II i". Analyzing the set of circumstances associated

with these occurrences could yield a plausible "cause". For example, multiple occurrences of the

same flaw in a paper roll might reveal the faulty component in the paper production process.

Similarly, multiple occurrences of the same patterns in an object whose structure is being studied

indicates the presence of many instances of the same "cause". Observing the phenomena which

occur in the presence of these patterns could shed some light on the I I i" of the patterns.

For example, the presence of multiple occurrences of patterns in DNA strands in organisms could

manifest themselves as common characteristics shared by the organisms.

This linkage of cause and effect is a fundamental goal in many scientific disciplines. Most work

in visualization so far, which attempts to facilitate the scientific goals outlined above, consists of

choosing methods to display individual units of data, so that patterns and trends become visually

obvious when the data is seen in its entirety [1, 2]. However, the onus of detecting patterns and

trends still lies on the user or the specialist. This becomes more crucial when the amount of data

is large and the user's perceptual faculties are overburdened. Consequently, errors of omission (not

seeing patterns which are actually there) become more likely.

Our work attempts to shift the responsibility of detecting patterns from the human to the

computer. This is done by devising algorithms to detect patterns in the data; and then making

these patterns available to the user for further scrutiny. Multiple occurrences of the same pattern

can be made visually explicit by color coding them with the same color. Other methods could also

be used, such as "flashing" occurrences of the same pattern on the screen. If visual schemes are










not appropriate, recurring patterns could simply be provided as a list of occurrences which the user

goes through. In the context of a visualization environment, this technique could be used either in

a stand-alone manner or as a supplement to other visualization methods.

For the technique outlined above to be useful, the following principles on displaying patterns

should be adhered to:

Principle 1: Two patterns should be displayed to look similar iff they are similar.

Principle 2: The degree of similarity between the displays of two patterns should be proportional

to the actual similarity of the patterns.

A complete specification of a visualization problem requires one to provide the following five

items. These are illustrated using linear string visualization as an example. Henceforth, the word

-i11igi" shall refer to linear strings.

(1) Structure of the Data to be Visualized: Does the data represent a string, a series-

parallel graph, a binary tree etc.

In the string visualization example, the data is a string, S of length n, whose characters are

chosen from a fixed alphabet, E, of constant size.

(2) Structure of Patterns: This depends largely on the structure of the data. If the data

represents a string, then the structure of the pattern could be a substring (contiguous sequence of

string elements), a subsequence (a non-contiguous sequence of string elements), etc. Patterns can

have other constraints imposed upon them. For example, a pattern may be required to be of a

minimum size.

In the string visualization example, the pattern is a substring of S, defined uniquely by its start

and end positions.

(3) Maximality of Patterns: If a pattern is repeated in the data, then any subpattern of

the pattern is also repeated. For example, if abc repeats in a string, so do ab, be, a, b, and c .

In particular, if all occurrences of ab occur in the context of abc, then attempting to distinguish

between ab and abc does not serve any useful purpose. Defining maximality and restricting the

users attention to maximal patterns helps to simplify the display.

In the string visualization example, a pattern is said to be maximal iff all its occurrences are












a b cz e f yd e f xa b c

Figure 1: Highlighting Displayable Entities

not all preceded by the same letter, nor all followed by the same letter. Consider the string S =
abczdefydefxabc. Here, abc and def are the only maximal patterns. The occurrences of def are
preceded by different letters (z and y) and followed by different letters (y and x). The occurrences
of abc are not preceded by the same letter (the first occurrence does not have a predecessor) nor
followed by the same letter. However, de is not maximal because all its occurrences in S are followed
by f.

(4) Measure of Similarity (MS): A measure of similarity would consist of attaching numerical
values to pairs of maximal patterns which indicate the degree of similarity between the two patterns.

In the string visualization example, the measure of similarity could be defined as follows: If two
patterns are identical, then MS = 1. Otherwise, MS = 0. In other words, two patterns are defined
to be similar iff they are identical. There is no concept of "degree of similarity" in this definition.

(5) Display Models: This addresses the issues of which patterns are displayed and how they
are displayed. While choosing display models, Principles 1 and 2 should be kept in mind.

In the string visualization example, a pattern is said to be a displayable entity (or displayable)
iff it is maximal and occurs more than once in S (in this case, all maximal patterns are displayable
entities with the exception of S, which occurs once in itself) All instances of the same displayable
entity are highlighted in the same color. Instances of different displayable entities are highlighted in
different colors (there is no relationship between colors representing different displayable entities).
In the example string, S, abc and def are the only displayable entities. So, S would be displayed
by highlighting abc in one color and def in another as shown in Figure 1.

Visualization conflicts that arise from the above technique are described in Section 2 and refine-
ments to the display model that attempt to overcome these conflicts are provided in Section 3. In
Section 4 some of the queries supported by a string visualization system are stated. Applications of
string visualization are discussed in Section 5 and some of the issues that arise when one implements
a string visualization system are addressed in Section 6. Some empirical results are also provided










in this section. Sections 7, 8, 9 discuss circular string, tree, and geometric series parallel graph
visualization respectively. Section 10 demonstrates that many of our algorithms can be effectively
coded using the object oriented paradigm. Finally, in Section 11, we provide an outline of how a
visualization system based on our approach would function.



2 Visualization Conflicts


Consider the string S = abcicdefcdegabchabcde and its displayable entities, abc and cde (both are
maximal and occur thrice). So, they must be highlighted in different colors. Notice, however, that
abc and cde both occur in the string abcde, which occurs as a suffix of S. Clearly, both displayable
entities cannot be highlighted in different colors in abcde as required by the model. This is a
consequence of the fact that the letter c occurs in both displayable entities. This situation is known
as a prefix-suffix conflict (because a prefix of one displayable entity is a suffix of the other).

Note, also, that c is a displayable entity in S. Consequently, all occurrences of c must be
highlighted in a color different from those used for abc and cde. But this is impossible as c is a
subword of both abc and cde. This situation is referred to as a subword conflict. Formally,

(i) A subword conflict between two displayable entities, D1 and D2, in S exists iff D1 is a substring
of D2.
(ii) A prefix-suffix conflict between two displayable entities, D1 and D2, in S exists iff there exist
substrings, Sp, Sm, Ss in S such that SpSSS occurs in S and SpS, = D1 and SmS, = D2.



3 Refinements of Display Model


When subword and prefix-suffix conflicts occur, we need some criteria to determine which of the
information previously required to be displayed actually gets displayed. For instance, in the ex-
ample string S = abcicdefcdegabchabcde from the previous section, three possible non-conflicting,
displayable subsets are shown in Figure 2. In this section we present three refinements to the dis-
play model from Section 1 which attempt to overcome the display difficulties created by conflicts.
They are










a b c i c d e f c d e g a b c h a b cd e


fa b c i c d e f c d e g a b ch a b c d el


ia b cdi cd e f c]d e gla b c h a b c]d e


Figure 2: Possible Configurations



a b c i d e f de gla b c h a b cd e
Figure 3: Optimal Configuration under Model 1

(1) One-Copy, Maximum-Content, No-Overlap: In this model, exactly one copy of the string is
displayed. Occurrences of displayable entities are selected so that there are no mutually conflicting
occurrences. Given this restriction, the model requires occurrences to be selected so that the
amount of information conveyed by the display is maximized. This goal may be achieved in three
ways.
Interactive : The user selects occurrences interactively by using his/her judgement. Typically, this
would be done by examining the occurrences which are involved in a conflict and choosing one that
is the most meaningful.
Automatic : A numeric weight is assigned to each occurrence. The higher the weight, the greater
the desirability of displaying the corresponding occurrence. Criteria that could be used in assigning
weights to occurrences include: length, position, number of occurrences of the pattern, semantic
value of the displayable entity, information on conflicts, etc. The information is then fed to a
routine which selects a set of occurrences so that the sum of their weights is maximized. For
example, consider string S = abcicdefcdegabchabcde of Section 2. If the weight assigned to each
occurrence of abc is 4, cde is 2, c is 3, then Figure 3 shows the optimal display configuration. The
total weight of the display is 18.
Semi-Automatic: In a practical environment, the most appropriate method would be a hybrid of
the Interactive and Automatic approaches described above. The user could select some occurrences









a b c i *d e ffd e g a b c h a b cl de


a bfi c d elf d elga b ha b de

Figure 4: Optimal Configuration under Model 2(a)


ia b ci id e fld e ga b ch a b cld e


a bfi c d ef c d ega b ha b de


abci cdefcdegabchab cde

Figure 5: Configuration under Model 2(b)

that he/she wants included in the final display. The selection of the remaining occurrences can
then be performed by a routine which maximizes the display information.

(2) Multiple-Copy, No-Overlap: Multiple copies of the string may be displayed. Mutually dis-
joint sets of occurrences are associated with the copies (one set per copy), so that the occurrences
corresponding to each copy are mutually non-conflicting. There are two approaches:
(a) A constant number (max) of copies of the string may be displayed. The total content of
the display, summed over all max copies, is to be maximized. For example, consider string S =
abcicdefcdegabchabcde of Section 2. If the weights corresponding to abc, cde, and c are 4, 2, and
3, respectively, and max = 2, then Figure 4 shows the optimal display configuration. The total
weight of the display is 31.
(b) No limit is imposed on the number of copies of the string that may be displayed. However,
each occurrence is highlighted in only one copy of the string. The number of copies of the string
used should be minimized. It can be shown, that in the worst case, O(n2) copies may be required.
Figure 5 shows a configuration for string S = abcicdefcdegabchabcdeof Section 2.


(3) S;iiI -Copy, Maximum-Content, Subword-Overlap: Exactly one copy of the string may be
displayed. Occurrences are selected so that no pair of occurrences has a prefix-suffix conflict.











a b i d e f de ga h a b d e

Figure 6: Optimal Configuration under Model 3

Subword conflicts are allowed. As in the Single-Copy, Maximum-Content, No-Overlap model, the
goal is to maximize the information conveyed. Again, there are three approaches for selecting
occurrences: Automatic, Interactive and Semi-Automatic. For example, consider string S = abci-
cdefcdegabchabcde of Section 2. If the weights corresponding to each occurrence of abc, cde, and
c are 4, 3, and 2, respectively, then Figure 6 shows the optimal display configuration. The total
value of the display is 31.

Note that it is crucial to all methods to get information on prefix-suffix and subword conflicts.



4 String Visualization Queries


The following algorithmic problems arise as a result of the discussion in the previous section:

Given a string, S, of length n whose elements are chosen from an alphabet E of fixed size,
P1. Obtain a list of all displayable entities and their occurrences.
P2. Obtain a list of all prefix-suffix conflicts.
P3. Obtain a list of all subword conflicts.

Given a list of occurrences of displayable entities and a weight associated with each occurrence,
P4. Obtain a set of mutually non-conflicting occurrences so that the sum of the weights associated
with them is maximum (Model 1).
P5. Obtain max mutually disjoint sets of mutually non-conflicting occurrences so that the sum of
the weights associated with them is maximum (Model 2(a)).
P6. Obtain a minimum number of mutually disjoint sets of mutually non-conflicting occurrences
required to partition the set of all occurrences (Model 2(b)).
P7. Obtain a set of occurrences such that no two have a prefix-suffix conflict, so that the sum of
the weights associated with them is maximized (Model 3).

In addition to the problems outlined above, restricted versions of these problems exist. These










problems represent typical queries that would be supported by an interactive visualization system

based on our approach. We list some of these below.

(i) Restricted Queries: The overlap of a conflict is defined as the string common to the conflicting

displayable entities. The overlap of a subword conflict is the subword displayable entity. The overlap

of a prefix-suffix conflict is the substring common to the conflicting strings. The size of a conflict is

the length of the overlap. It is useful to be able to list only those conflicts whose sizes are greater

than some specified length. This simplifies the display and eliminates uninteresting displayable

entities.

P8. Obtain all prefix-suffix conflicts of size greater than some integer min.

P9. Obtain all subword conflicts of size greater than some integer min.

(ii) Pattern-Oriented Queries: These queries are useful in applications where the fact that two

patterns have a conflict is more important than the number of conflicts or where in the string the

conflicts occur.

P10. List all pairs of displayable entities which have prefix-suffix or subword conflicts.

P11. List all pairs of displayable entities which have conflicts of size greater than some given

constant.

P12. List all displayable entities that are superwords of a given displayable entity.

(iii) Statistical Queries: These queries are useful when conclusions are to be drawn from the
data based on statistical facts.

P13. For each pair of displayable entities, D1 and D2, involved in a subword conflict (D1 is the

subword of D2), obtain p = (number of occurrences of D, which occur as subwords of D2)/(number

of occurrences of Di).

P14. For each pair of displayable entities, D1 and D2, involved in a prefix-suffix conflict, obtain q

= (number of occurrences of Di (D2) which have prefix-suffix conflicts with D2 (D1))/ (number of

occurrences of D, (D2)).

If p or q is greater than a statistically determined threshold, then the following could be be said

with some confidence: Presence of D1 implies Presence of D2.

A detailed explanation of efficient algorithms for P1-P3 and P8-P14 is provided in [3]. Algorithms

for P1-P3 are briefly outlined in Section 6, while algorithms for P4, P6, and P7 are presented in










Section 6.6.


5 Applications


An abstract strategy has been outlined for the visualization of strings. This section discusses some

general methods for applying this strategy to specific areas. Applications to molecular biology, text

and numerical sequences are also outlined.


5.1 General Methods


In order to apply this strategy successfully to actual data, it is important to first check that the
data conforms to the definition of strings provided in Section 1. If this is not the case, then it may

be possible to transform the data so that it satisfies the definition without losing vital information

in the process.

(1) If a string consists of characters which are chosen from a fixed alphabet of a small size (

< 50, say), then it is already in the correct format. For example, DNA sequences are made up
from an alphabet of 4 elements. English text is made up of an alphabet of 26 characters and some
special symbols.

(2) Otherwise, if a function, f, can be defined for each element in the alphabet such that:

a) The range of f can be determined and is of constant size and

b) f(elementl) = f(element2) iff element1 and element2 are similar,
then the given string may be converted to another string which is obtained by applying f to each

element of the original string. The resulting string can now be input to the visualization routines.

For example, consider a sequence of objects which are chosen from a large (possibly infinite)

alphabet. Assume that a set of properties, P = {P, P2,..., Pm}, is associated with each object.

Suppose that patterns of property, Pi, of the sequence of objects are interesting (where Pi may
take on one of a constant, fixed number of values). Then, for the purposes of visualization, each

object in the sequence is replaced by the corresponding Pi value. This approach can be used with

the other properties as well. Some examples where this approach is useful are:










(1) Protein Sequences [2]: A protein sequence consists of a sequence of amino acids. While the

number of amino acids that could form a sequence is large it is possible to place amino acids in

groups on the basis of physical properties such as hydrophobicity, acidity, polarity etc. Amino acids

in the sequence may then be replaced by a symbol representing the groups to which they belong.

(2) C/,i,,, "li .I sequences of Multidimensional Data [1]: Here, a number of measurements relating

to a particular scientific phenomenon are taken at regular intervals of time. The measurements for

each variable are classified as LOW, MEDIUM, or HIGH. Consequently, the sequence of multidi-

mensional data may be replaced by a sequence of symbols: L,M,H which represent the values of a

particular variable.

Many applications would benefit by comparisons between two or more different strings (as

opposed to comparisons within the same string). This can also be supported by a simple extension

of our techniques.


5.2 Numerical Data


An important category of data is numerical data. These arise whenever properties of objects are

described by measurements. Numerical information, in general, is chosen from large alphabet sizes

which are determined by the accuracy of measurement required. Clearly, numerical sequences

cannot be directly input to our visualization system. This is remedied by determining the range

of values that a variable can take on. This range is then subdivided into a constant number of

subranges (this is essentially the same strategy used in [1]). Each value in the sequence is then

replaced by a symbol representing the subrange to which it belongs. The resulting sequence may

then be input to our visualization system. Consider a sequence of numbers which lie in the range

1-200. Assume that subranges have been defined as 1-20, 21-40, ..., 181-200 which are respectively

represented by the symbols: a,b,......,j. So, the sequence: 7 142 63 94 6 148 69 becomes: ahd e ahd.

Sequences of numbers, such as financial data, are usually studied by using graphs. We expect

that the techniques outlined here if used in conjunction with graphs could reduce the possibility

of overlooking important patterns in the data. This can be done by either appropriately coloring

pieces of the graph or by coloring a string which is aligned with the graph.

Often, comparisons are made not between the values of numbers in a sequence, but between the










increase/decrease in consecutive values. For example, in [5 20 15 75 90 i,' ], 5 20 15 is not obviously

related to 75 90 ". However, the increases/decreases in values are identical. An increase of 15
followed by a decrease of 5. Information of this type may be obtained by transforming the string

by taking the difference between successive values before inputting it to the visualization system.

I.e., 15 -5 60 15 -5. Similar transformations may be used for percentage increases/decreases, second

order differences, etc.


5.3 Molecular Biology


In molecular biology, RNA,DNA, and protein sequences are studied. Sequence comparison helps to

answer questions about evolution, structure and function in organisms, and the structural configu-

ration of individual RNA molecules [4]. Of particular importance are repeating patterns and their
relative positions [5]. [5] uses the scdawg data structure, which is described in the next section, to

analyze sequences. Our work improves upon [5] by suggesting more effective display methods as

well as by introducing more sophisticated analysis techniques involving prefix-suffix and subword
conflicts.

For example, let D1 and D2 be displayable entities and D1 be a subword of D2. If the fraction

(number of occurrences of D1 contained in D2) /(number of occurrences of DI) w 1, then we can
infer that D1 usually occurs as a subword of D2 which could mean that D1 does not perform any

significant functions except as a subword of D2.

Suppose patterns P1 and P2 perform functions Fi and F2 in an organism. Then, if

(number of prefix-suffix conflicts between Pi and P2)/ (min{number of occurrences of Pi,number
of occurrences of P2}) w 1, then we can infer that F1 and F2 are generally performed by the same

region and are therefore related in some way.


5.4 Textual Data


Structural information about text may be obtained by studying prefix-suffix and subword conflicts.
Information about the contexts in which certain phrases are used is provided by subword conflicts.

Information on the combination of phrases is provided by prefix-suffix conflicts. This information










can be used to identify anomalies in sentence structure and possibly identify the author of a text by
the structure. It can also be used to decipher text coded using sophisticated substitution ciphers
where patterns are substituted by other patterns.



6 Implementation Considerations


A string visualization system along the lines described above requires efficient algorithms for prob-
lems P1-P14. These problems can be solved in optimal or near optimal time [3] by using the
Symmetric Compact Directed Acyclic Word Graph (scdawg) data structure [6, 7].


6.1 Directed Acyclic Word Graphs


An scdawg, SCD(S), corresponding to a string S is a directed acyclic graph defined by a set of
vertices, V(S), a set, R(S), of labeled directed edges called right extension (re) edges, and a set,

L(S), of labeled directed edges called left extension (le) edges Each vertex of V(S) represents a
substring of S. Specifically, V(S) consists of a source (which represents the empty word, A), a sink
(which represents S), and a vertex corresponding to each displayable entity of S.

Let de(v) denote the string represented by vertex, v (v c V(S)). Define the implication,

imp(S, a), of a string a in S to be the smallest superword of a in {de(v): v c V(S)}, if such
a superword exists. Otherwise, imp(S, a) does not exist.

Re edges from a vertex, vi, are obtained as follows: for each letter, x, in E, if imp(S, de(vl)x)
exists and is equal to de(v2) = Bde(vl)x7, then there exists an re edge from vl to v2 with label x7y.
If 3 is the empty string, then the edge is known as a prefix extension edge. Le edges from a vertex,
vl, are obtained as follows: for each letter, x, in E, if imp(S, xde(vl)) exists and is equal to de(v2)
= 7xde(vl)3, then there exists an le edge from vl to v2 with label 7x. If 3 is the empty string,
then the edge is known as a suffix extension edge.

Figure 7 shows V(S) and R(S) corresponding to S = cdefabcgabcde. abc, cde, and c are the
displayable entities of S. There are two outgoing re edges from the vertex representing abc. These
edges correspond to x = d and x = g. imp(S, abcd)= imp(S, abcg) = S. Consequently, both edges































Figure 7: Scdawg for S = cdefabcgabcde (L(S) not shown)


are incident on the sink. There are no edges corresponding to the other letters of the alphabet as

imp(S, abcx) does not exist for x c {a, b, c, e, f}.

The space required for SCD(S) is O(n) and the time needed to construct it is O(n) [6, 7]. While
we have defined the scdawg data structure for a single string, S, it can be extended to represent a
set of strings.


6.2 Computing Subword Conflicts Efficiently


6.2.1 Representing Subword Conflicts


Consider, P2, the problem of finding all subword conflicts in string S. Let ks be the number of
subword conflicts in S. Any algorithm to solve this problem requires (i) O(n) time to read in the
input string and (ii) 0(k,) time to output all subword conflicts. So, O(n + k,) is a lower bound on
the time complexity for this problem. For the string S = a", ks = n4/24+ n3/4- 13n2/24-3n/4+1
= O(n4). This is an upper bound on the number of conflicts as the maximum number of substring
occurrences is O(n2) and in the worst case, all occurrences conflict with each other. In this section,
a compact method for representing conflicts is presented. Let ksc be the size of this representation.










kc, is n3/6 + n2/2 5n/3 or O(n3), for a". Compaction never increases the size of the output and
may yield up to a factor of n reduction, as in the example. The compaction method is described
below.

Consider S= abcdbcgabcdbchbc. The displayable entities are D1 = abcdbc and D2 = be. The
ending positions of D1 are 6 and 13 while those of D2 are 3, 6, 10, 13, and 16. A list of the subword
conflicts between D1 and D2 can be written as: {(6,3), (6,6), (13,10), (13,13)}. The first element
of each ordered pair is the last position of the instance of the superstring (here, D1) involved in the
conflict; the second element of each ordered pair is the last position of the instance of the substring
(here, D2) involved in the conflict.

The cardinality of the set is the number of subword conflicts between D1 and D2. This is
given by: frequency(D1) number of occurrences of D2 in D1, where frequency(D1) is the number of
occurrences of D1 in S. Since each conflict is represented by an ordered pair, the size of the output
is 2(frequency(D1)*number of occurrences of D2 in D1).

Observe that the occurrences of D2 in D1 are in the same relative positions in all instances of D1.
It is therefore possible to write the list of subword conflicts between D1 and D2 as: (6,13):(0,-3).
The first list gives all the occurrences in S of the superstring (D1), and the second gives the relative
positions of all the occurrences of the substring (D2) in the superstring (D1) from the right end
of D1. The size of the output is now: frequency(Di)+number of occurrences of D2 in D1. This is
more economical than our earlier representation.

In general, a substring, Di, of S will have conflicts with many instances of a number of displayable
entities (say, Dj, Dk,..., Dz) of which it (Di) is the superword. We would then write the conflicts
of Di as:



Here, the /i's represent all the occurrences of Di in S; the I's, I's,..., I's represent the relative
positions of all the occurrences of Dj, Dk,..., D in Di. One such list will be required for each
displayable entity that contains other displayable entities as subwords. The following qualities are
easily obtained:
Size of Compact Representation = ED, cD (fi+ DecDE (ij)).










Size of O, '.;',. Representation = 2 ED,,D (fi* nDJeD, (rj)).
fi is the frequency of Di (only Di's that have conflicts are considered). rij is the frequency of Dj
in one instance of Di. D represents the set of all displayable entities of S. Df represents the set of
all displayable entities that are subwords of Di.


6.2.2 Computing Subword Conflicts


Algorithm A3 of Figure 8 computes the subword conflicts of S. These are represented using the
scheme described in Section 6.2.1.

SG(S, v), v e V(S), is defined as the subgraph of SCD(S) which consists of the set of ver-
tices, SV(S, v) C V(S) which represent displayable entities that are subwords of de(v) and the set
SE(S, v) of all re and suffix extension edges that connect any pair of vertices in SV(S, v). Define
SGR(S, v) as SG(S, v) with the directions of all the edges in SE(S, v) reversed.

The subword conflicts are computed for precisely those displayable entities which have subword
displayable entities. Lines 4 to 6 of Algorithm A3 determine whether de(v) has subword displayable
entities. Procedure Getsubwords(v), which computes the subword conflicts of de(v) is invoked if
v.subword is true.

Procedure Occurrences(S, v, 0) (line 2 of GetSubwords) computes the occurrences of de(v) in S
and places them in v.list. Procedure SetUp in line 5 traverses SGR(S, v) and initializes fields in
each vertex of SGR(S, v) so that a reverse topological traversal of SG(S, v) may be subsequently
performed. Procedure SetSuffixes in line 6 marks vertices whose displayable entities are suffixes of
de(v). A list of relative occurrences, sublist, is associated with each vertex, x, in SG(S, v). x.sublist
represents the relative positions of de(x) in an occurrence of de(v). If de(x) is a suffix of de(v)
then x.sublist is initialized with the element, 0. The remaining elements of x.sublist are computed
from the sublist fields of vertices, w, in SG(S, v) such that a right extension edge goes from x to w.
Consequently, w.sublist must be computed before x.sublist. This is achieved by traversing SG(S, v)
in reverse topological order [8].


Theorem 1 Algorithm A3 takes O(n + k,,) time and space and is therefore optimal [3].





















Algorithm A3
1 begin
2 for each vertex, v, in SCD(S) do
3 begin
4 v.subword = false;
5 for all vertices, u, such that a right or suffix extension edge, < u, v >, is incident on v do
6 if u f source then v.subword = true;
7 end
8 for each vertex, v, in SCD(S) such that v f sink and v.subword is true do
9 GetSubwords(v);
10 end

Procedure GetSubwords(v)
1 begin
2 Occurrences(S,v,O);
3 output(v.list);
4 v.sublist = {0};
5 SetUp(v);
6 SetSuffixes(v);
7 for each vertex, x (A source), in reverse topologicalorder of SG(S, v) do
8 begin
9 if de(x) is a suffix of de(v) then x.sublist = {0} else x.sublist {};
10 for each vertex, w, in SG(S, v) on which an re edge, e from x is incident do
11 begin
12 for each element, 1, in w.sublist do
13 x.sublist = x.sublist U {l label(e)l};
14 end;
15 output(x.sublist);
16 end;
17 end



Figure 8: Optimal algorithm to compute all subword conflicts










Algorithm A2A3
Step 1: Obtain a list of all occurrences of all displayable entities in the string. This list is obtained by first
computing the lists of occurrences corresponding to each vertex of the scdawg (except the source and the
sink) and then concatenating these lists.
Step 2: Sort the list of occurrences using the start positions of the occurrences as the primary key (increasing
order) and the end position as the secondary key (decreasing order). This is done using radix sort.
Step3:

for i:= 1 to (number of occurrences) do
begin
j:= i + 1;
while(lastpos(occi) > firstpos(occj) do
begin
if (lastpos(occi) > lastpos(occj))
then occi is a superword of occj
else (occi, occj) have a prefix-suffix conflict;
j:= j + 1;
end;
end;



Figure 9: A simple algorithm for computing conflicts


6.3 Computing Prefix Suffix Conflicts Efficiently


As with subword conflicts, the lower bound for the problem of computing prefix-suffix conflicts is

O(n + kp), where kp is the number of prefix-suffix conflicts in S. For S = a", kp is n4/24 n3/12 -

2",,.2/24 21n/12 + 1 = O(n4), which is also the upper bound on kp. Unlike subword conflicts, it

is not possible to compact the output representation.


Theorem 2 All prefix-suffix conflicts in S can be computed in O(n + kp) space and time, which is

optimal [3].



6.4 Alternative Algorithms


In this section, an alternative solution for computing subword and prefix-suffix conflicts is presented.

The solution is relatively simple and has competitive running times. However, it lacks the flexibility

required to solve many of the problems listed in Section 4. The algorithm is presented in Figure 9.










N 106 -
u a = Size of Alphabet
m =2
C 105
r =5





o 102
0 104 -1
C 50





s 102 -
100200 500 1000 2000
Size of String

Figure 10: Graph of Number of Prefix Suffix Conflicts vs String Size


6.5 Implementation


The algorithms to compute all prefix-suffix and subword conflicts (i.e., algorithms from Sections 6.2,

6.3, and 6.4) were implemented on a SUN SPARCstation 1 in GNU C++. 50 randomly generated

strings of lengths ranging from 100 to 2000 from alphabets whose size ranged from 2 to 50 were

input to our algorithms. Statistical information such as the number of vertices in the scdawg, the

number of prefix-suffix and subword conflicts etc was obtained. The run times of the algorithms

were also recorded.

(i) Figure 10 shows Number of prefix-suffix conflicts vs String size. There is one curve for each

alphabet size. The plot illustrates that the number of prefix-suffix conflicts increases with string

size and decreases with alphabet size. The graph for subword conflicts is similar.

(ii) Figure 11 shows Time per prefix-suffix conflict (ius) vs String size. There is one curve for each

alphabet size. It illustrates that the time per conflict generally decreases with increasing string size

and increases with alphabet size. The graph for subword conflicts is similar.

(iii) The factor by which the compact representation of subword conflicts is smaller than the

fully expanded representation varies from 2 to 9. It increases with string size and decreases with

alphabet size.

(iv) Figure 12 shows the size of the largest displayable entity for each combination of alphabet size










800-

700-

600-

500-

400-

300-

200-

100-

0-


= 10
7=5
I I


I I I I I I I I I I I I I I I I
100200 500 1000
Size of String

Figure 11: Graph of Time per Prefix Suffix Conflict vs


2000


String Size


Figure 12: Lengths of Largest Displayable Entity for Random Strings


and string size. It shows that only displayable entities of small lengths occur in random strings (in

practice we would like to be able to distinguish between displayable entities that occur randomly

and those that do not in a given string. This can be done by selecting those displayable entities

whose frequency is large compared to other displayable entities of the same length in a random

string).



6.6 Display Algorithms


A list of all occurrences of a displayable entity may be obtained from the scdawg data structure

described earlier. A list of all conflicts between displayable entities may be obtained by operations


a = Size of Alphabet


= 25


Size of Size of String
Alphabet 100 200 500 1000 2000
2 11 12 15 18 19
5 5 6 7 8 9
10 3 4 5 6 6
15 3 3 4 5 5
20 3 3 4 4 5
25 2 3 3 4 4
50 2 2 3 3 4


`~-----~___z










on the scdawg. This information is then used to assign numeric weights to each occurrence of each

displayable entity.

In this section we shall discuss algorithms to implement some of the refined display models of

Section 3. Specifically, we shall consider models 1, 2(b), and 3 under the Automatic mode (i.e.,

problems P4, P6, and P7 of Section 4). Our algorithms also apply to the Semi Automatic mode.

Problem P4 can be reduced to the single pair, longest path problem in a directed acyclic graph

as follows: let vertex Vi, 1 < i < n, correspond to the position between the i'th and i + 1'th

characters in the string. Vo corresponds to the position preceding the first character in the string.

For each occurrence Sij of each displayable entity, we create an edge from Vi_1 to Vj. The weight

associated with this edge is exactly the weight of the occurrence it represents. Finally, for each

pair (Vi, Vi+), 0 < i < n, of vertices such that an edge from Vi to Vi + does not already exist, we

create an edge from Vi to Vi+1 of weight 0.

Figure 13 shows the directed acyclic graph corresponding to the string, S = ,l,, '", I, / l, ,I,,, 1,,,, I1.

of Figure 2, assuming that the weights corresponding to each occurrence of abc, cde, and c are 4,

3, and 2 respectively. The longest path from Vo to V, in the dag is Vo V3 V4 V7 Vs -

V11 V12 V15 V16 V19 V20 V21. All edges on this path with non-zero weight represent
occurrences of displayable entities that are to be highlighted. Here Vo V3, V4 V7, Vs V11,

V12 V15, V16 V19 correspond to occurrences < 1,3 >, < 13, 15 >, and < 17, 19 > of abc, and
< 5, 7 > and < 9, 11 > of cde. The length of the longest path (here, 18) represents the total weight

of the display.

Algorithm A4 of Figure 14 solves problem P4. A[0..n] is an array of integers, where A[i] rep-

resents the longest path from Vo to Vi detected upto that point of time. All elements of A are

initialized to 0. The auxiliary array, T[1..n] stores the occurrences that have been chosen for dis-

play. The array delist contains all the occurrences of all the displayable entities in the string. Each

element of delist contains three fields: start, end, and r.' '"d,1l which respectively represent the start

position, the end position, and the numeric weight of the occurrence. It is assumed that delist is

sorted in increasing order of end. The vertices are processed in topological order (i.e., Vo, V1, ...

V,). When a vertex, Vj is being processed, each vertex, V (i < j) preceding it has associated with
it the cost of the longest path from Vo to Vi. The cost of the longest path from Vo to 1V is then

























4 4


Figure 13: Dag corresponding to abcicdefcdegabchabcde


determined by examining each of the incoming edges to Vj.

The complexity of Algorithm A4 is O(n + e), where e represents the size of delist. So, Algo-

rithm A4 is optimal. Note that sorting delist on end position can also be accomplished in O(n + e)

time using radix sort.

Problem P6 of Section 4 is solved by Algorithm A6 of Figure 15 using the greedy method.

endpoints[1..2e] is a list of both, the start positions and end positions of all occurrences of all

displayable entities in the string. Each element of endpoints contains three fields: position, type,

and id. position contains the position of the particular endpoint in the string; type determines

whether the endpoint is a "start" position or an i,1" position. id uniquely identifies the occurrence

corresponding to the endpoint. It is assumed that endpoints is sorted in increasing order on primary

key, position, and secondary key, type ("start" < il"). LineStack is a stack that contains line
numbers (or partition numbers) which are currently available. On completion of the algorithm,

line[i] contains the line or the particular copy of the string in which the i'th occurrence is to be

highlighted for 1 < i < e.

Proof of Correctness: From the algorithm the following assertions can be made:

(1) The final value of max (fmax) represents the number of partitions of S.

(2) At least one position in the string is covered by fmax occurrences. Statement (2) indicates that

fmax is the smallest possible number of partitions that satisfy the problem. Statement (1) indicates











Algorithm A4
begin
A[O]:= 0; j:= 1;
for i := 1 to n do
begin
A[i]:= 0;
while (delist[j].end = i) do
begin
if A[i] < A[delist[j].sart-1] + delist[j].weight
then
begin
A[i] := A[delist[j].start-] + delist[j].weight;
T[i := j;
end;
j:= + 1
end;
end;
end.



Figure 14: Algorithm for P4


that a partition of that size is in fact obtained.

Algorithm A6 consumes O(n + e) time, which is optimal. Note that sorting endpoints can also

be accomplished in O(n + e) time, if radix sort is used.

We outline two solutions to problem P7. In the first, we assume that all occurrences of the

same displayable entity are assigned the same numeric weight. In the second, we do not make this

assumption.

Algorithm A7(a) of Figure 16 solves the first version of P7. The second version of P7 may be

solved by executing steps a-c for each occurrence of each displayable entity as shown in Algorithm

A7(b) of Figure 17. Both solutions involve a traversal of SCD(S) in topological order. For each

occurrence, an optimal selection of its subwords are chosen. The weight of the occurrence is then

obtained by adding the sum of the weights of the chosen subwords to its weight. This is done

because an occurrence is highlighted along with the chosen subwords. Algorithm A7(a) consumes

O(n3) time, while Algorithm A7(b) consumes O(n4) time.
























Algorithm A6
begin
max:= 0; current := 0;
for i := 1 to 2e do
begin
x := endpoints[i];
if(x.type= -i i ') then
begin
current := current + 1;
if (current < max) then
begin
CurreniLine := top(LineStack);
pop(LineStack);
end
else
begin
max := max + 1;
CurrentLine := max;
end;
line[x.id]:= CurrentLine;
end
else {x.type = ... i"}
begin
current := current- 1;
return line[x.id] to LineStack;
end;
end;
fmax := max;
end.



Figure 15: Algorithm for P6
















Algorithm A 7(a)
begin
for each vertex, v, of SCD(S), in topological order do
begin
Step a. Compute the relative positions of all subword displayable entities
in a single instance of de(v) using Algorithm A of Section 6.2.
Step b. Choose a mutually non overlapping subset from the set of subwords
of de(v) (obtained in step (a) above) so that the sum of their weights is
maximized. This is achieved by an algorithm similar to A4.
Step c. Reset the numeric weight of de(v) by adding to it the
total weight of the configuration obtained in step (b).
end;
end.



Figure 16: Algorithm for P7, same weights










Algorithm A 7(b)
begin
for each vertex, v, of SCD(S), in topological order do
begin
for each occurrence < i,j > of de(v) do
begin
Step a. Compute the occurrences of all subword displayable entities in < i, j >.
Step b. Choose a mutually non overlapping subset from the set of occurrences
(obtained in step (a) above) so that the sum of their weights is maximized.
This is achieved by an algorithm similar to A4.
Step c Reset the numeric weight of < i, j > by adding to it the total weight
of the configuration obtained in step (b).
end;
end;
end.



Figure 17: Algorithm for P7, different weights
























Figure 18: Circular string


7 Circular String Visualization


In this section, we consider the problem of circular string visualization. Section 7.2 mentions some

of its applications while Section 7.3 outlines the algorithms.


7.1 Problem Definition


As with the linear string, we provide a specification for the circular string visualization problem:


1. Structure of Data to be Visualized: A circular string of length n whose characters are chosen

from a fixed alphabet E of constant size. Figure 18 shows an example circular string of size

8.

2. Structure of Patterns: A linear substring of the circular string of length less than n. For

example, abc and ceab are patterns in Figure 18.

3. Maximality of Patterns: The definition of maximality of a pattern in a circular string is

similar to that of a linear string. I.e., a pattern is said to be maximal iff its occurrences in

the circular string are not all preceded by the same character nor all followed by the same

character.

4. Measure of Similarity (MS): If two patterns are identical, then MS = 1. Otherwise, MS =

0.










5. Display Model: A maximal pattern in a circular string is called a displayable entity if it occurs

at least twice in the circular string. In the example string of Figure 18, abc is a displayable

entity. All instances of the same displayable entity are highlighted in the same color. As

with linear strings, we encounter the problem of prefix-suffix and subword conflicts. Similar

techniques are used to overcome these.


7.2 Applications


Circular strings may be used to represent circular genomes [5] such as G4 and oX174. The detection

and analysis of patterns in genomes helps to provide insights into the evolution, structure, and

function of organisms. [5] analyzes G4 and oX174 by linearizing them and then constructing their

scdawg. Our work improves upon [5] by :

(i) analyzing circular strings without risking the "loss" of patterns.

(ii) extending the analysis and visualization techniques presented for linear strings to circular
strings.

Circular strings in the form of chain codes are also used to represent closed curves in computer

vision [9]. The objects of Figure 19(a) are represented in chain code as follows:

(1) Arbitrarily choose a pixel through which the curve passes. In the diagram, the starting pixels

for the chain code representation of objects 1 and 2 are marked by arrows.

(2) Traverse the curve in the clockwise direction. At each move from one pixel to the next, the

direction of the move is recorded according to the convention shown in Figure 19(b).

Objects 1 and 2 are represented by 1122102243244666666666 and 6666666611 .' -.'." .', ,,'6,

respectively. The alphabet is {0, 1, 2, 3, 4, 5, 6, 7} which is fixed and of constant size (8) and

therefore satisfies the definitions of Section 7.1. We may now use our visualization techniques

to compare the two objects. For example, our methods would show that objects 1 and 2 share

the segments S1 and S2 (Figure 19(c)) corresponding to 0224 and 2446666666661122, respectively.

Information on other common segments would also be available. The techniques of this paper make

it possible to detect all patterns irrespective of the starting pixels chosen for the two objects.

Circular strings may also be used to represent polygons in computer graphics and computational

geometry [10]. Figure 20 shows a polygon which is represented by the following alternating sequence



























Jr _L rA_


J -JLIL A ILI L A I iI L L AI L .-A

I F _
-?--J LJ-T- -J L-TTJ L J L J-?-L J--




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs