
Citation 
 Permanent Link:
 https://ufdc.ufl.edu/UFE0011379/00001
Material Information
 Title:
 Optimization and Information Retrieval Techniques for Complex Networks
 Copyright Date:
 2008
Subjects
 Subjects / Keywords:
 Algorithms ( jstor )
Connectivity ( jstor ) Datasets ( jstor ) Diameters ( jstor ) Electrodes ( jstor ) Electroencephalography ( jstor ) Graph theory ( jstor ) Mining ( jstor ) Stock markets ( jstor ) Vertices ( jstor )
Record Information
 Source Institution:
 University of Florida
 Holding Location:
 University of Florida
 Rights Management:
 All applicable rights reserved by the source institution and holding location.
 Embargo Date:
 7/30/2007

Downloads 
This item has the following downloads:

Full Text 
OPTIMIZATION AND
INFORMATION RETRIEVAL TECHNIQUES FOR
COMPLEX NETWORKS
By
VLADIMIR L. BOGINSKI
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2005
Copyright 2005
by
Vladimir L. Boginski
I dedicate this to my parents.
ACKNOWLEDGMENTS
I would like to thank my advisor Prof. Panos Pardalos for his support and
guidance that made my studies in the University of Florida enjoi, 1l and produc
tive. His energy and enthusiasm inspired me during these four years, and I believe
that this was crucial for my success.
I also want to thank my committee members Prof. Stan Uryasev, Prof. Joseph
Geunes, and Prof. William Hager for their concern and encouragement. I am
grateful to all my collaborators, especially Sergiy Butenko and Oleg Prokopyev,
who were ahvb a great pleasure to work with.
Finally, I would like to express my greatest appreciation to my family and
friends, who ahvb believed in me and supported me in all circumstances.
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS ................... ...... iv
LIST OF TABLES ................... .......... viii
LIST OF FIGURES ................... ......... ix
ABSTRACT ...................... ............. xi
CHAPTER
1 INTRODUCTION .................... ....... 1
1.1 Basic Concepts from Graph Theory and Data Mining Interpretation 3
1.1.1 Connectivity and Degree Distribution ............. 3
1.1.2 Cliques and Independent Sets ........ ........ 5
1.1.3 Clustering via Clique Partitioning ...... . . 6
2 REVIEW OF NETWORKBASED MODELING AND OPTIMIZATION
TECHNIQUES IN MASSIVE DATA SETS .... . .. 9
2.1 Modeling and Optimization in Massive Graphs . . ... 9
2.1.1 Examples of Massive Graphs ................. .. 10
2.1.1.1 Call Graph .................. ..... 10
2.1.1.2 Internet and Web Graphs . . ...... 13
2.1.2 External Memory Algorithms ................ .. 17
2.1.3 Modeling Massive Graphs .................. .. 18
2.1.3.1 Uniform Random Graphs . . ..... 19
2.1.3.2 Potential Drawbacks of the Uniform Random Graph
Model ....... . . .... 21
2.1.3.3 Random Graphs with a Given Degree Sequence .23
2.1.3.4 PowerLaw Random Graphs . . 24
2.1.4 Optimization in Random Massive Graphs . .... 29
2.1.4.1 Clique Number ............ .. .. .. 29
2.1.4.2 C('.!i. ii ,,1 Number ................. .. 31
2.1.5 Remarks .................. .......... .. 32
3 NETWORKBASED APPROACHES TO MINING STOCK MARKET
DATA ................. .................. ..33
3.1 Structure of the Market Graph ............... .. .. 34
3.1.1 Constructing the Market Graph ............... .. 34
3.1.2 Connectivity of the Market Graph . . 36
3.1.3 Degree Distribution of the Market Graph . .... 37
3.1.4 Instruments Corresponding to HighDegree Vertices . 40
3.1.5 Clustering Coefficients in the Market Graph . ... 41
3.2 Analysis of Cliques and Independent Sets in the Market Graph 42
3.2.1 Cliques in the Market Graph ............. .. 43
3.2.2 Independent Sets in the Market Graph ......... .45
3.3 Data Mining Interpretation of the Market Graph Model ...... ..48
3.4 Evolution of the Market Graph . . . ... 50
3.4.1 Dynamics of Global 'C!i 'lteristics of the Market Graph 51
3.4.2 Dynamics of the Size of Cliques and Independent Sets in the
Market Graph .................. ...... 55
3.4.3 Minimum Clique Partition of the Market Graph ...... ..59
3.5 Concluding Remarks ............... ...... .. 60
4 NETWORKBASED TECHNIQUES IN ELECTROENCEPHALOGRAPHIC
(EEG) DATA ANALYSIS AND EPILEPTIC BRAIN MODELING ... 62
4.1 Statistical Preprocessing of EEG Data ... . . 63
4.1.1 Datasets. .................. .. ..... 63
4.1.2 Tstatistics and STLmax ................ 63
4.2 Graph Structure of the Epileptic Brain ............... ..66
4.2.1 Key Idea of the Model . . . .. ..66
4.2.1.1 Interpretation of the Considered Graph Models .67
4.2.2 Properties of the Graphs ................ 67
4.2.2.1 Edge Density ................ .. .. 67
4.2.2.2 Connectivity ................ .. .. 69
4.2.2.3 Minimum Spanning Tree .............. ..69
4.2.2.4 Degrees of the Vertices ............... 71
4.2.2.5 Maximum Cliques . . ...... 72
4.3 Graph as a Macroscopic Model of the Epileptic Brain . ... 74
4.4 Concluding Remarks and Directions of Future Research ...... ..75
5 COLLABORATION NETWORKS IN SPORTS ........... .77
5.1 Examples of Social Networks . . . .... ...... 78
5.1.1 Scientific Collaboration Graph and Erdos Number . 78
5.1.2 Hollywood Graph and Bacon Number . ..... 79
5.1.3 Baseball Graph and Wynn Number ............. ..80
5.1.4 Diameter of Collaboration Networks ............. .81
5.2 NBA Graph ........ .... ... ............ 82
5.2.1 General Properties of the NBA Graph ............ ..83
5.2.2 Diameter of the NBA Graph and Jordan Number . 85
5.2.3 Degrees and "Connectedness" of the Vertices in the NBA
Graph ...... ......... ......... .... 88
5.3 Concluding Remarks .................. ....... .. 89
6 CONCLUSIONS AND DIRECTIONS FOR FUTURE RESEARCH ... 90
REFERENCES .............. .......... ... .... 91
BIOGRAPHICAL SKETCH ......... ....... ......... 100
LIST OF TABLES
Table page
31 Leastsquares estimates of the parameter 7 in the market graph . 38
32 Top 25 instruments with highest degrees in the market graph ...... 42
33 C('!1i. i ig coefficients of the market graph ................ ..43
34 Sizes of the maximum cliques in the market graph ............ ..45
35 Sizes of independent sets in the complementary market graph . 46
36 Dates and mean correlations corresponding to each considered 500d,v
shift .... ................ ....... ...... .. 51
37 Number of vertices and number of edges in the market graph for differ
ent periods .................. ............... .. 55
38 Greedy clique size and the clique number for different time periods 57
39 Structure of maximum cliques in the market graph for different time pe
riods ...... ............. ................. .. 58
310 Size of independent sets in the market graph found using the greedy heuris
tic . . . . . . .... ... ..... 59
311 The largest clique size and the number of cliques in computed clique par
titions . . . . . . . . ... .. 60
51 Jordan numbers of some NBA stars (end of the 20022003 season). . 86
52 Degrees of the Vertices in the NBA graph ................ 88
53 The most "connected" ph ,l rs in the NBA graph ............ ..89
LIST OF FIGURES
Figure page
21 Frequencies of clique sizes in the call graph ..... . . 11
22 Pattern of connections in the call graph .................. 12
23 Number of Internet hosts for the period 01/199101/2002. . ... 13
24 Pattern of connections in the Web graph ............. .. 14
25 Connectivity of the Web (BowTie model) ............ .. 16
31 Distribution of correlation coefficients in the stock market . ... 35
32 Edge density of the market graph for different values of the correlation
threshold ............... ............ .. .. 36
33 Plot of the size of the largest connected component in the market graph
as a function of correlation threshold 0. ................. 37
34 Degree distribution of the market graph ................. 39
35 Degree distribution of the complementary market graph . ... 40
36 Frequency of the sizes of independent sets found in the market graph 48
37 Distribution of correlation coefficients in the US stock market for several
overlapping 500day periods during 20002002 . . ..... 52
38 Degree distribution of the market graph for different 500div periods in
20002002 .. ....... ............... 53
39 Dynamics of edge density and maximum clique size in the market graph 55
41 Electrode placement in the brain .................. ..... 64
42 Number of edges in GRAPHII .................. .. 68
43 The size of the largest connected component in GRAPHII . ... 70
44 Average value of Tindex of the edges in Minimum Spanning Tree of GRAPH
I........................ ......... ...... 71
45 Average degree of the vertices in GRAPHII. . . ..... 72
51 Number of vertices in the Hollywood graph with different values of Ba
con number .................. ............... .. 80
52 Number of vertices in the baseball graph with different vaues of Wynn
number ..... .............. ............... .. 81
53 General structure of the NBA graph and other collaboration networks .84
54 Number of vertices in the NBA graph with different values of Jordan
num ber . . . . . .. . ... 85
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
OPTIMIZATION AND INFORMATION RETRIEVAL TECHNIQUES FOR
COMPLEX NETWORKS
By
Vladimir L. Boginski
August 2005
C('! Ii: Panagote M. Pardalos
Major Department: Industrial and Systems Engineering
This study develops novel approaches to modeling realworld datasets arising
in diverse application areas as networks and information retrieval from these
datasets using network optimization techniques. Networkbased models allow one
to extract information from datasets using various concepts from graph theory.
In many cases, one can investigate specific properties of a dataset by detecting
special formations in the corresponding graph (for instance, connected components,
spanning trees, cliques, and independent sets). This process often involves solving
computationally challenging combinatorial optimization problems on graphs
(maximum independent set, maximum clique, minimum clique partition, etc.).
These problems are especially difficult to solve for large graphs. However, in certain
cases, the exact solution of a hard optimization problem can be found using a
special structure of the considered graph.
A significant part of the dissertation focuses on developing networkbased
models of realworld complex systems, including the stock market and the human
brain, which have ahvl been of special interest to scientists. These systems gen
erate huge amounts of data and are especially hard to ain &v. This dissertation
demonstrates that networkbased models can be successfully applied to information
retrieval from datasets, providing new insight into the structural properties and
patterns underlying the corresponding complex systems.
The developed network representations of the considered datasets are in
many cases nontrivial and include certain statistical preprocessing techniques.
In particular, the U.S. stock market is represented as a network based on cross
correlations of price fluctuations of the financial instruments, which are calculated
over a certain number of trading div This model (market p'jiq'l) allows one
to analyze the structure and dynamics of the stock market from an alternative
perspective and obtain useful information about the global structure of the market,
classes of similar stocks, and diversified portfolios.
Similarly, a macroscopic network model of the human brain is constructed
based on the statistical measures of entrainment between electroencephalographic
(EEG) signals recorded from different functional units of the brain. Studying the
evolution of the properties of these networks revealed some interesting facts about
brain disorders, such as epilepsy.
CHAPTER 1
INTRODUCTION
Now,1dl, the process of studying reallife complex systems often deals with
large datasets arising in diverse applications including government and military
systems, telecommunications, biotechnology, medicine, finance, astrophysics, ecol
ogy, geographical information systems, etc. [3, 25]. Understanding the structural
properties of a certain dataset is in many cases the task of crucial importance. To
get useful information from these data, one often needs to apply special techniques
of summarizing and visualizing the information contained in a dataset.
An appropriate mathematical model can simplify the analysis of a dataset and
even theoretically predict some of its properties. Thus, a fundamental problem that
arises here is modeling the datasets characterizing realworld complex systems.
In this dissertation, we concentrate on one aspect of this problem: network
representation of realworld datasets. According to this approach, a certain dataset
is represented as a p',h'l (network) with certain attributes associated with its
vertices and edges.
Studying the structure of a graph representing a dataset is often important
for understanding the internal properties of the application it represents, as
well as for improving storage organization and information retrieval. One can
visualize a graph as a set of dots and links connecting them, which often makes this
representation convenient and easily understandable.
The main concepts of graph theory were founded several centuries ago, and
many network optimization algorithms have been developed since then. However,
graph models have been applied only recently to representing various reallife
massive datasets. Graph theory is quickly becoming a practical field of science.
Expansion of graphtheoretical approaches in various applications gave birth to the
terms 3,i 1h! practice" and ,i 111 1(1 ii ,, ii1, [63].
Networkbased models allow one to extract information from realworld
datasets using various standard concepts from graph theory. In many cases, one
can investigate specific properties of a dataset by detecting special formations in
the corresponding graph, for instance, connected components, a ur.'.':'ij trees, cliques
and independent sets. In particular, cliques and independent sets can be used for
solving the important clustering problem arising in data mining, which essentially
represents partitioning the set of elements of a certain dataset into a number of
subsets (clusters) of objects according to some similarity (or dissimilarity) criterion.
These concepts are associated with a number of network optimization problems
discussed later.
Another aspect of investigating network models of realworld datasets is
studying the degree distribution of the constructed graphs. The degree distribution
is an important characteristic of a dataset represented by a graph. It represents
the largescale pattern of connections in the graph, which reflects the global
properties of the dataset. One of the important results discovered during the last
several years is the observation that many graphs representing the datasets from
diverse areas (Internet, telecommunications, biology, sociology) obey the powerlaw
model [9]. The fact that graphs representing completely different datasets have a
similar welldefined powerlaw structure has been widely reflected in the literature
[10, 19, 20, 25, 63, 116, 117]. It indicates that global organization and evolution
of datasets arising in various spheres of life 1 i., .I1 ,is follow similar laws and
patterns. This fact served as a motivation to introduce a concept of "selforganized
networks."
Later we discuss in more detail various aspects of modeling realworld datasets
as networks, and retrieving useful information from these networks. The practical
importance of graphtheoretic techniques is shown by several examples of applying
these approaches associated with datasets arising in telecommunications, internet,
sociology, etc. The 1i ii' ', part of the dissertation devoted to novel networkbased
techniques and models that allow one to obtain important nontrivial information
from datasets arising in finance and biomedicine.
1.1 Basic Concepts from Graph Theory and Data Mining Interpretation
To facilitate further discussion, we present several basic definitions and
notations from graph theory and discuss the interpretation of the introduced
concepts from the perspective of data mining and information retrieval.
Let G = (V, E) be an undirected graph with the set of n vertices V and the set
of edges E = {(i,j) : i,j E V}. Directed graphs, where the head and tail of each
edge are specified, are considered in some applications. The concept of a mi"ill.:raph
is also sometimes introduced. A multigraph is a graph where multiple edges
connecting a given pair of vertices may exist. One of the important characteristics
of a graph is its edge /,. ,.:1;I: the ratio of the number of edges in the graph to the
maximum possible number of edges.1
1.1.1 Connectivity and Degree Distribution
The graph G = (V, E) is connected if there is a path from any vertex to any
vertex in the set V. If the graph is disconnected, it can be decomposed into several
connected subgraphs, which are referred to as the connected components of G.
The degree of a vertex is the number of edges emanating from it. For every
integer k one can calculate the number of vertices n(k) with a degree equal to k,
and then get the probability that a vertex has the degree k as P(k) = n(k)/n,
where n is the total number of vertices. The function P(k) is referred to as the
1 The maximum possible number of edges in a graph is equal to n(n 1)/2 (n is
the number of vertices).
degree distribution of the graph. In the case of a directed graph, the concept of
degree distribution is generalized: one can distinguish the distribution of indegrees
and outdegrees, which deal with the number of edges ending at and starting from a
vertex, respectively.
Degree distribution is an important characteristic of a dataset represented
by a graph. It reflects the overall pattern of connections in the graph, which in
many cases reflects the global properties of the dataset this graph represents. As
mentioned above, many realworld graphs representing the datasets coming from
diverse areas (Internet, telecommunications, finance, biology, sociology) have degree
distributions that follow the powerlaw model, which states that the probability
that a vertex of a graph has a degree k (i.e., there are k edges emanating from it) is
P(k) oc k. (11)
Equivalently, one can represent it as
logP oc log k, (12)
which demonstrates that this distribution forms a straight line in the logarithmic
scale, and the slope of this line equals the value of the parameter 7.
An important characteristic of the powerlaw model is its scalefree property.
This property implies that the powerlaw structure of a certain network should not
depend on the size of the network. Clearly, realworld networks dynamically grow
over time, therefore, the growth process of these networks should obey certain rules
in order to satisfy the scalefree property. The necessary properties of the evolution
of the realworld networks are growth and preferential attachment [20]. The first
property implies the obvious fact that the size of these networks grows continuously
(i.e., new vertices are added to a network, which means that new elements are
added to the corresponding dataset). The second property represents the idea that
5
new vertices are more likely to be connected to old vertices with high degrees. It is
intuitively clear that these principles characterize the evolution of many realworld
complex networks i.v, '. Il, ivs.
From another perspective, some properties of graphs that follow the powerlaw
model can be predicted theoretically. Aiello et al. [9] studied the properties of the
powerlaw graphs using the theoretical powerlaw random ','rll, model representing
the the class of random graphs obeying the power law (see C'! plter 2). Among
their results, one can mention the existence of a giant connected component in
a powerlaw graph with 7 < 7o a 3.47875, and the fact that a giant connected
component does not exist otherwise.2 Emergence of a giant connected component
at the point 7o m 3.47875 is often called phase transition.
The size of connected components of the graph may provide useful information
about the structure of the corresponding dataset, as the connected components
would normally represent groups of "!~iI! i1 objects. In some applications,
decomposing the graph into a set of connected components can provide a reason
able solution to the clustering problem (i.e., partitioning the graph into several
subgraphs, each of which corresponds to a certain cluster).
1.1.2 Cliques and Independent Sets
Given a subset S C V, we denote G(S) as the subgraph induced by S. A
subset C C V is a clique if G(C) is a complete graph (i.e., it has all possible edges).
The maximum clique problem is to find the largest clique in a graph.
The following definitions generalize the concept of clique. Instead of cliques
one can consider dense subgraphs, or quasicliques. A 7clique C., also called a
2 These results are valid i ;,,,;///. i//.;/rl almost surely (a.a.s.), which means that
the probability that a given property takes place tends to 1 as the number of ver
tices n goes to infinity.
quasiclique, is a subset of V such that G(C.) has at least L[q(q 1)/2] edges,
where q is the cardinality (i.e., number of vertices) of C..
An independent set is a subset I C V such that the subgraph G(I) has no
edges. The maximum independent set problem can be easily reformulated as the
maximum clique problem in the complementary graph G(V, E), defined as follows.
If an edge (i,j) E E, then (i,j) E; and if (i,j) E, then (i,j) E E. Clearly, a
maximum clique in G is a maximum independent set in G, so the maximum clique
and maximum independent set problems can be easily reduced to each other.
Locating cliques (quasicliques) and independent sets in a graph representing
a dataset provides important information about this dataset. Intuitively, edges
in such a graph would connect vertices corresponding to "!!!!i! 1i elements of
the dataset. Therefore, cliques (or quasicliques) would naturally represent dense
clusters of similar objects. On the contrary, independent sets can be treated as
groups of objects that differ from every other object in the group. This information
is also important in some applications. Clearly, it is useful to find a maximum
clique or independent set in the graph, since it would give the maximum possible
size of the groups of mid! 11 or "dil, I, i objects.
The maximum clique problem (as well as the maximum independent set prob
lem) is known to be NPhard [59]. Moreover, it turns out that these problems are
difficult to approximate [18, 62]. This makes these problems especially challenging
in large graphs.
1.1.3 Clustering via Clique Partitioning
The problem of locating cliques and independent sets in a graph can be
naturally extended to finding an optimal partition of a graph into a minimum
number of distinct cliques or independent sets. These problems are referred to as
minimum clique partition and u',jl' coloring, respectively. Pardalos et al. [102] give
various mathematical programming formulations of these problems. Clearly, as in
the case of maximum clique and maximum independent set problems, minimum
clique partition and graph coloring are reduced to each other by considering the
complimentary graph, and both of these problems are NPhard [59]. Solving these
problems for graphs representing reallife datasets is important from a data mining
perspective; especially for solving the clustering problem.
The essence of clustering is partitioning the elements in a certain dataset into
several distinct subsets (clusters) grouped according to an appropriate .;,ii,..,li/
criterion [34]. Identifying the groups of objects that are "iidI! ,o to each other
but "dill, i il from other objects in a given dataset is important in many practical
applications. The clustering problem is challenging because the number of clusters
and the similarity criterion are usually not known a priori.
If a dataset is represented as a graph, where each data element corresponds to
a vertex, the clustering problem essentially deals with decomposing this graph into
a set of subgraphs subsetss of vertices), so that each of these subgraphs correspond
to a specific cluster.
Since the data elements assigned to the same cluster should be !r!!! 1" to
each other, the goal of clustering can be achieved by finding a clique partition
of the graph, and the number of clusters will equal the number of cliques in the
partition.
Similar arguments hold for the case of the graph coloring problem which
should be solved when a dataset needs to be decomposed into the clusters of
"dil!, i, il objects (i.e., each object in a cluster is different from all other objects in
the same cluster), that can be represented as independent sets in the corresponding
graph. The number of independent sets in the optimal partition is referred to as
the chromatic number of the graph.
Instead of cliques and independent sets one can consider quasicliques, and
quasiindependent sets and partition the graph on this basis. As mentioned,
8
quasicliques are subgraphs that are dense enough (i.e., they have a high edge
density). Therefore, it is often reasonable to relate clusters to quasicliques, since
they represent sufficiently dense clusters of similar objects. Obviously, in the case
of partitioning a dataset into clusters of "dl!1i i, i, objects, one can use quasi
independent sets (i.e., subgraphs that are sparse enough) to define these clusters.
CHAPTER 2
REVIEW OF NETWORKBASED MODELING AND OPTIMIZATION
TECHNIQUES IN MASSIVE DATA SETS
In this chapter, we review current developments in studying massive graphs
used as models of certain realworld datasets.1 Massive data sets arise in a broad
spectrum of scientific, engineering and commercial applications [3]. Some of the
wide range of problems associated with massive data sets are data warehousing,
compression and visualization, information retrieval, clustering and pattern
recognition, and nearest neighbor search. Handling these problems requires
special interdisciplinary efforts to develop novel sophisticated techniques. The
pervasiveness and complexity of the problems brought by massive data sets make it
one of the most challenging and exciting areas of research for years to come.
In many cases, a massive data set can be represented as a very large graph
with certain attributes associated with its vertices and edges. These attributes
may contain specific information characterizing the given application. Studying
the structure of this graph is important for understanding the structural properties
of the application it represents, as well as for improving storage organization and
information retrieval.
2.1 Modeling and Optimization in Massive Graphs
In this section we discuss recent advances in modeling and optimization for
massive graphs. As examples, Call, Internet, and Web graphs will be used.
1 This chapter is based on the joint publication with Butenko and Pardalos [25].
As before, by G = (V, E) we will denote a simple undirected graph with the set
of n vertices V and the set of edges E. A multigraph is an undirected graph with
multiple edges.
The distance between two vertices is the number of edges in the shortest
path between them (it is equal to infinity for vertices representing different
connected components). The diameter of a graph G is usually defined as the
maximal distance between pairs of vertices of G. In a disconnected graph, the usual
definition of the diameter would result in the infinite diameter, so the following
definition is in order. By the diameter of a disconnected graph we will mean the
maximum finite shortest path length in the graph (the same as the largest of the
diameters of the graph's connected components).
2.1.1 Examples of Massive Graphs
2.1.1.1 Call Graph
Here we discuss an example of a massive graph representing telecommunica
tions traffic data presented by Abello, Pardalos and Resende [2]. In this call p',j'l,
the vertices are telephone numbers, and two vertices are connected by an edge if a
call was made from one number to another.
Abello et al. [2] experimented with data from AT&T telephone billing records.
To give an idea of how large a call graph can be we mention that a graph based
on one 20dv period had 290 million vertices and 4 billion edges. The analyzed
oned(v call graph had 53,767,087 vertices and over 170 million edges. This graph
appeared to have 3,667,448 connected components, most of them tiny; only 302,468
(or '.) components had more than 3 vertices. A giant connected component
with 44,989,297 vertices was computed. It was observed that the existence of a
giant component resembles a behavior si l i l by the random graphs theory of
Erdos and Rinyi [47, 48], which will be mentioned below, but by the pattern of
connections the call graph obviously does not fit into this theory(Subsection 2.1.3).
The maximum clique problem and problem of finding large quasicliques with
prespecified density were considered in this giant component. These problems were
attacked using a greedy randomized adaptive search procedure (GRASP) [51, 52].
In short, GRASP is an iterative method that at each iteration constructs, using a
greedy function, a randomized solution and then finds a locally optimal solution
by searching the neighborhood of the constructed solution. This is a heuristic
approach which gives no guarantee about quality of the solutions found, but proved
to be practically efficient for many combinatorial optimization problems. To make
application of optimization algorithms in the considered large component possible,
the authors use some suitable graph decomposition techniques employing external
memory algorithms (see Subsection 2.1.2).
1000
freq 100
10
1__
5 10 15 20 25 30
clique size
Figure 2 1. Frequencies of clique sizes in the call graph found by Abello et al. [2].
Abello et al. [2] ran 100,000 GRASP iterations taking 10 parallel processors
about one and a half dv4 to finish. Of the 100,000 cliques generated, 14,141
appeared to be distinct, although rn r: of them had vertices in common. Abello
et al. i::I. 1 that the graph contains no clique of a size greater than 32.
Figure 21 shows the number of detected cliques of various sizes. Finally, large
12
quasicliques with density parameters 7 = 0.9, 0.8, 0.7, and 0.5 for the giant
connected component were computed. The sizes of the largest quasicliques found
were 44, 57, 65, and 98, respectively.
le+07
le+06
le+05
le+04
Sle+03
Sle+02
le+01
le+01
1le+ 00 .. . ......, ......
le+00 le+01 l+02 le+03
Outdegree
le+06j
le+05
le+
Sle+
le+
&*
le+
(a)
le+04 1 05
le+07
le+06
le+05
le+04
" le+03
le+02
ie+03
le+01
le+00
le+00
(b)
le+01 le+02 le03 le+04 le05
Indegree
(c)
04
03
02
01
: <*
le+00 le+01 le+02 le+03 le+04 le+05 le+06 le+07
Component size
Figure 22. Pattern of connections in the call graph: number of vertices with
various outdegrees (a) and indegrees (b); number of connected com
ponents of various sizes (c) in the call graph [8].
Aiello et al. [8] used the same data as Abello et al. [2] to show that the
considered call graph fits to their powerlaw random graph model (Section 2.1.3).
The plots in Figure 22 demonstrate some connectivity properties of the call graph.
Summarizing the results presented in this subsection, one can w that graph
based techniques proved to be rather useful in the analysis and revealing the global
S M
patterns of the telecommunications traffic dataset. In the next subsection, we
will consider another example of a similar type of dataset associated with the
WorldWide Web.
2.1.1.2 Internet and Web Graphs
The role of the Internet in the modern world is difficult to overestimate; its
invention changed the way people interact, learn, and communicate like nothing
before. Alongside with increasing significance, the Internet itself continues to
grow at an overwhelming rate. Figure 23 shows the dynamics of growth of the
number of Internet hosts for the last 13 years. As of January 2002 this number
was estimated to be close to 150 million.2 The number of web pages indexed by
large search engines exceeds 2 billion, and the number of web sites is growing by
thousands daily.
160,000,000
140,000,000
120,000,000
100,000,000
80,000,000
60,000,000
40,000,000
20,000,000
0
Figure 23. Number of Internet hosts for the period 01/199101/2002. Data by
Internet Software Consortium.
2 According to Internet Software Consortium, http://www.isc.org/ds/hostcount
history.html
le+lO I I I !e 7 i
le+O 
le+ 4
le 1+ +e+
1e405
le40 le
le+00 I I \
le'o0 1e4O I
1 10 100 1000 10 100 100000
outdegee iz of coponta
Figure 24. Pattern of connections in the Web graph: number of vertices with var
ious outdegrees (left) and distribution of sizes of strongly connected
components (right) in Web graph [37].
The highly dynamic and seemingly unpredictable structure of the World Wide
Web attracts more and more attention of scientists representing many diverse
disciplines, including graph theory. In a graph representation of the World Wide
Web, the vertices are documents and the edges are hyperlinks pointing from one
document to another. Similarly to the call graph, the Web is a directed multigraph,
although often it is treated as an undirected graph to simplify the analysis.
Another graph is associated with the physical network of the Internet, where the
vertices are routers navigating packets of data or groups of routers (domains). The
edges in this graph represent wires or cables in the physical network.
Graph theory has been applied for web search [36, 78], web mining [96, 97]
and other problems arising in the Internet and World Wide Web. In several
recent studies, there were attempts to understand some structural properties of
the Web graph by investigating large Web crawls. Adamic and Huberman [6, 65]
used crawls which covered almost 260,000 pages in their studies. Barabdsi and
Albert [20] analyzed a subgraph of the Web graph approximately 325,000 nodes
representing nd.edu pages. In another experiment, Kumar et al. [82] examined a
data set containing about 40 million pages. In a recent study, Broder et al. [37]
used two Altavista crawls, each with about 200 million pages and 1.5 billion links,
thus significantly exceeding the scale of the preceding experiments. This work
yielded several remarkable observations about local and global properties of the
Web graph. All of the properties observed in one of the two crawls were validated
for the other as well. Below, by the Web graph we will mean one of the crawls,
which has 203,549,046 nodes and 2130 million arcs.
The first observation made by Broder et al. confirms a property of the Web
graph i... i. 1 in earlier works [20, 82] claiming that the distribution of degrees
follows a power law. Interestingly, the degree distribution of the Web graph
resembles the powerlaw relationship of the Internet graph topology, which was first
discovered by Faloutsos et al. [50]. Broder et al. [37] computed the in and out
degree distributions for both considered crawls and showed that these distributions
agree with power laws. Moreover, they observed that in the case of indegrees
the constant 7 m 2.1 is the same as the exponent of power laws discovered in
earlier studies [20, 82]. In another set of experiments conducted by Broder et al.,
directed and undirected connected components were investigated. It was noticed
that the distribution of sizes of these connected components also obeys a power
law. Figure 24 illustrates the experiments with distributions of outdegrees and
connected component sizes.
The last series of experiments discussed by Broder et al. [37] aimed to explore
the global connectivity structure of the Web. This led to the discovery of the so
called BowTie model of the Web [38]. Similarly to the call graph, the considered
Web graph appeared to have a giant connected component, containing 186,771,290
nodes, or over 9,n'. of the total number of nodes. Taking into account the directed
nature of the edges, this connected component can be subdivided into four pieces:
,i,..('i/, connected component (SCC), In and Out components, and "T i,..i/"..
Overall, the Web graph in the BowTie model is divided into the following pieces:
/i
Tendrils \
43,797,944 ,
\. tubes>'
/ \ _.. *'
SSCC
/ In 43318 )  Out
.43,343,168 ,, \43,166,185
/56 463 993 
SDisc.)
16,777,756
Figure 25. Connectivity of the Web (BowTie model) [37].
* Strongly connected component: the part of the giant connected component in
which all nodes are reachable from one another by a directed path.
* In component: nodes which can reach any node in the SCC but cannot be
reached from the SCC.
* Out component: contains the nodes that are reachable from the SCC, but
cannot access the SCC through directed links.
* Tendrils component: accumulates the remaining nodes of the giant connected
component, i.e., the nodes which are not connected with the SCC.
* Disconnected component: the part of the Web which is not connected with the
giant connected component.
Figure 25 shows the connectivity structure of the Web, as well as sizes of the
considered components. As one can see from the figure, the sizes of SCC, In, Out
and Tendrils components are roughly equal, and the Disconnected component is
significantly smaller.
%
Broder et al. [37] have also computed the diameters of the SCC and of the
whole graph. It was shown that the diameter of the SCC is at least 28, and the
diameter of the whole graph is at least 503. The average connected distance is
defined as the pairwise distance averaged over those directed pairs (i,j) of nodes
for which there exists a path from i to j. The average connected distance of the
whole graph was estimated as 16.12 for inlinks, 16.18 for outlinks, and 6.83
for undirected links. Interestingly, it was also found that for a randomly chosen
directed pair of nodes, the chance that there is a directed path between them is
only about 2!' .
2.1.2 External Memory Algorithms
In many cases, the data associated with massive graphs is too large to fit
entirely inside the fast computer's internal memory, therefore a slower external
memory (for example disks) needs to be used. The input/output communication
(I/O) between these memories can result in an algorithm's slow performance.
External memory (EM) algorithms and data structures are designed with aim
to reduce the I/O cost by exploiting the locality. Recently, external memory
algorithms have been successfully applied for solving batched problems involving
graphs, including connected components, topological sorting, and shortest paths.
The first EM graph algorithm was developed by Ullman and Yannakakis [112]
in 1991 and dealt with the problem of transitive closure. ,ii;: other researchers
contributed to the progress in this area ever since [1, 15, 16, 39, 42, 83, 115]. Chi
ang et al. [42] proposed several new techniques for design and analysis of efficient
EM graph algorithms and discussed applications of these techniques to specific
problems, including minimum spanning tree verification, connected and biconnected
components, graph drawing, and visibility representation. Abello et al. [1] proposed
a functional approach for EM graph algorithms and used their methodology to
develop deterministic and randomized algorithms for computing connected com
ponents, maximal independent sets, maximal matching, and other structures in
the graph. In this approach each algorithm is defined as a sequence of functions,
and the computation continues in a series of scan operations over the data. If the
produced output data, once written, cannot be changed, then the function is said
to have no side effects. The lack of side effects enables the application of standard
checkpointing techniques, thus increasing the reliability. Abello et al. presented a
semiexternal model for graph problems, which assumes that only the vertices fit
in the computer's internal memory. This is quite common in practice, and in fact
this was the case for the call graph described in Subsection 2.1.1, for which efficient
EM algorithms developed by Abello et al. [1] were used in order to compute its
connected components [2].
For more detail on external memory algorithms see the book [4] and the
extensive review by Vitter [115] of EM algorithms and data structures.
2.1.3 Modeling Massive Graphs
The size of reallife massive graphs, many of which cannot be held even
by a computer with several gigabytes of main memory, vanishes the power of
classical algorithms and makes one look for novel approaches. External memory
algorithms and data structures discussed in the previous subsection represent one
of the research directions aiming to overcome difficulties created by data sizes.
But in some cases not only is the amount of data huge, but the data itself is not
completely available. For instance, one can hardly expect to collect complete
information about the Web graph; in fact, the largest search engines are estimated
to cover only 3'. of the Web [84].
Therefore, to investigate reallife massive graphs, one needs to use the available
information in order to construct proper theoretical models of these graphs. One
of the earliest attempts to model real networks theoretically goes back to the late
1950's, when the foundations of random graph theory had been developed. In this
subsection we will present some of the results produced by this and other (more
realistic) graph models.
2.1.3.1 Uniform Random Graphs
The classical theory of random graphs founded by Erd6s and R6nyi [47, 48]
deals with several standard models of the socalled ;, iu .[rm random graphs. Two
of such models are G(n, m) and g(n,p) [30]. The first model assigns the same
probability to all graphs with n vertices and m edges, while in the second model
each pair of vertices is chosen to be linked by an edge randomly and independently
with probability p.
In most cases for each natural n a probability space consisting of graphs
with exactly n vertices is considered, and the properties of this space as n i 0
are studied. It is said that a typical element of the space or almost every (a.e.)
graph has property Q when the probability that a random graph on n vertices has
this property tends to 1 as n  oo. We will also v that the property Q holds
i;/,i,'..1. 'i ll1i almost surely (a.a.s.). Erd6s and R6nyi discovered that in many
cases either almost every graph has property Q or almost every graph does not
have this property.
Many properties of uniform random graphs have been well studied [29, 30, 73,
80]. Below we will summarize some known results in this field.
Probably the simplest property to be considered in any graph is its connec
S.: .:/;, It was shown that for a uniform random graph G(n,p) E G(n,p) there is a
I !i, ! 1'" value of p that determines whether a graph is almost surely connected
or not. More specifically, a graph G(n,p) is a.a.s. disconnected ifp < lon. Fur
thermore, it turns out that ifp is in the range I < p < ,gn the graph G(n,p)
a.a.s. has a unique .i.:rl connected component [30]. The emergence of a giant
connected component in a random graph is very often referred to as the p!i .i
transition".
The next subject of our discussion is the diameter of a uniform random
graph G(n,p). Recall that the diameter of a disconnected graph is defined as the
maximum diameter of its connected components. When dealing with random
graphs, one usually speaks not about a certain diameter, but rather about the
distribution of the possible values of the diameter. Intuitively, one can i, that this
distribution depends on the interrelationship of the parameters of the model n and
p. However, this dependency turns out to be rather complicated. It was discussed
in many papers, and the corresponding results are summarized below.
It was proved by Klee and Larman [77] that a random graph ..imptotically
almost surely has the diameter d, where d is a certain integer value, if the following
conditions are satisfied
d1 d
p p
 0 and oo, 00.
n n
Bollobds [30] proved that if np log n i o then the diameter of a random
graph is a.a.s. concentrated on no more than four values.
Luczak [87] considered the case np < 1, when a uniform random graph a.a.s.
is disconnected and has no giant connected component. Let diamT(G) denote the
maximum diameter of all connected components of G(n,p) which are trees. Then if
(1 np)n1/3  o the diameter of G(n,p) is a.a.s. equal to diamr(G).
Chuing and Lu [43] investigated another extreme case: np  o. They showed
that in this case the diameter of a random graph G(n,p) is a.a.s. equal to
log n
(1 + o(1)) .
log(up)
Moreover, they considered the case when np > c > 1 for some constant c and got a
generalization of the above result:
10c 1
log n logn (C + 1) logn
(1 + o(1)) < diam(G n, p)) < + 2 ) + 1.
log(np) log(np) c log(2c) np
Also, they explored the distribution of the diameter of a random graph with respect
to different ranges of the ratio np/log n. They obtained the following results:
* For np/logn = c > 8 the diameter of G(n,p) is a.a.s. concentrated on at most
two values at log n/ log(np).
For 8 > np/logn c > 2 the diameter of G(n,p) is a.a.s. concentrated on at
most three values at log n/ log(np).
For 2 > np/logn c > 1 the diameter of G(n,p) is a.a.s. concentrated on at
most four values at log n/log(np).
* For 1 > np/logn c > co the diameter of G(n,p) is a.a.s. concentrated
on a finite number of values, and this number is at most 2 L + 4. More
specifically, in this case the following formula can be proved:
( f33C 122
rlog((,,/11) log 0 nlognT 1
i \< diam(G(n,p)) < [ g p)) L + 2 +2.
l og(np) l og(np) co
As pointed out above, a graph G(n,p) a.a.s. has a giant connected component
for 1 < np < log n. It is natural to assume that in this case the diameter of
G(n,p) is equal to the diameter of this giant connected component. However, it
was strictly proved by Chung and Lu [43] that it is a.a.s. true only if np > 3.5128.
2.1.3.2 Potential Drawbacks of the Uniform Random Graph Model
There were some attempts to model the reallife massive graphs by the
uniform random graphs and to compare their behavior. However, the results of
these experiments demonstrated a significant discrepancy between the properties of
real graphs and corresponding uniform random graphs.
The further discussion analyzes the potential drawbacks of applying the
uniform random graph model to the reallife massive graphs.
Though the uniform random graphs demonstrate some properties similar to
the reallife massive graphs, many problems arise when one tries to describe the
real graphs using the uniform random graph model. As it was mentioned above,
a giant connected component a.a.s. emerges in a uniform random graph at a
certain threshold. It looks very similar to the properties of the real massive graphs
discussed in Subsection 2.1.3. However, after deeper insight, it can be seen that
the giant connected components in the uniform random graphs and the reallife
massive graphs have different structures. The fundamental difference between
them is as follows: it was noticed that in almost all the real massive graphs the
property of socalled clustering takes place [116, 117]. It means that the probability
of the event that two given vertices are connected by an edge is higher if these
vertices have a common neighbor (i.e., a vertex which is connected by an edge
with both of these vertices). The probability that two neighbors of a given vertex
are connected by an edge is called the clustering coefficient. It can be easily seen
that in the case of the uniform random graphs, the clustering coefficient is equal
to the parameter p, since the probability that each pair of vertices is connected
by an edge is independent of all other vertices. In reallife massive graphs, the
value of the clustering coefficient turns out to be much higher than the value of
the parameter p of the uniform random graphs with the same number of vertices
and edges. Adamic [5] found that the value of the clustering coefficient for some
part of the Web graph was approximately 0.1078, while the clustering coefficient for
the corresponding uniform random graph was 0.00023. PastorSatorras et al. [103]
got similar results for the part of the Internet graph. The values of the clustering
coefficients for the real graph and the corresponding uniform random graph were
0.24 and 0.0006 respectively.
Another significant problem arising in modeling massive graphs using the
uniform random graph model is the difference in degree distributions. It can
be shown that as the number of vertices in a uniform random graph increases,
the distribution of the degrees of the vertices tends to the wellknown Poisson
distribution with the parameter np which represents the average degree of a vertex.
However, as it was pointed out in Subsection 2.1.3, the experiments show that
in the real massive graphs degree distributions obey a power law. These facts
demonstrate that some other models are needed to better describe the properties
of real massive graphs. Next, we discuss two of such models; namely, the random
graph model with a given degree sequence and its most important special case the
powerlaw model.
2.1.3.3 Random Graphs with a Given Degree Sequence
Besides the uniform random graphs, there are more general viv of modeling
massive graphs. These models deal with random '.irl,' with a given degree
sequence. The main idea of how to construct these graphs is as follows. For all the
vertices i 1 ... n the set of the degrees {ki} is specified. This set is chosen so that
the fraction of vertices that have degree k tends to the desired degree distribution
pk as n increases.
It turns out that some properties of the uniform random graphs can be
generalized for the model of a random graph with a given degree sequence.
Recall the notation of socalled 1ph  transition" (i.e., the phenomenon when
at a certain point a giant connected component emerges in a random graph) which
happens in the uniform random graphs. It turns out that a similar thing takes
place in the case of a random graph with a given degree sequence. This result was
obtained by Molloy and Reed [98]. The essence of their findings is as follows.
Consider a sequence of nonnegative real numbers po, pi, ..., such that
~ pk 1. Assume that a graph G with n vertices has approximately pkn vertices
k
of degree k. If we define Q k>1 k(k 2)pk then it can be proved that G a.a.s.
has a giant connected component if Q > 0 and there is a.a.s. no giant connected
component if Q < 0.
As a development of the analysis of random graphs with a given degree se
quence, the work of Cooper and Frieze [45] should be mentioned. They considered
a sparse directed random graph with a given degree sequence and analyzed its
strong connectivity. In the study, the size of the giant strongly connected compo
nent, as well as the conditions of its existence, were discussed.
The results obtained for the model of random graphs with a given degree
sequence are especially useful because they can be implemented for some important
special cases of this model. For instance, the classical results on the size of a
connected component in uniform random graphs follow from the aforementioned
fact presented by Molloy and Reed. Next, we present another example of applying
this general result to one of the most practically used random graph models the
powerlaw model.
2.1.3.4 PowerLaw Random Graphs
One of the most important special cases of the model of random graphs with
a given degree sequence is the powerlaw random j,'li model, which represents
the class of random graphs with a powerlaw degree sequence. This models
theoretically describes the properties of powerlaw graphs that were mentioned
above. Some important results for this model were obtained by Aiello, Chung and
Lu [8, 9].
The powerlaw random graph model (also referred to P(a, 3) assigns two
parameters characterizing a powerlaw random graph. If we define y to be the
number of nodes with degree x, then according to this model
y = ec/X/ (21)
Equivalently, we can write
logy = a logrx.
(22)
Similarly to formulas in Chapter 1, the relationship between y and x can be plotted
as a straight line on a loglog scale, so that (3) is the slope, and a is the intercept.
The following properties of a graph described by the powerlaw random graph
model [8] are valid:
* The maximum degree of the graph is e''.
* The number of vertices is
a (( )et, >l,
n P j ,a,& 1, (23)
X 1 e /(1 3),0 < 3< 1,
where ((t) is the Riemann Zeta function.
n=
* The number of edges is
1((0 1)ea, > 2,
1 I 1(/32 ,)e /3 > 2,
xE 1e 2, (2 4)
i_ I e'C /(2/3),0
Since the powerlaw random graph model is a special case of the model of a
random graph with a given degree sequence, the results discussed above can be
applied to the powerlaw graphs. We need to find the threshold value of 3 in which
the 1i! .i transition" (i.e., the emergence of a giant connected component) occurs.
In this case Q = Y> x(x 2)p4 is defined as
Ca C3 C3
Q x(x 2)LYj 23: ' [((0 2) 26((0 1)]1e for
X1 X1 X1
/3>3.
Hence, the threshold value Oo can be found from the equation
(( / 2) 2(( 1) 0,
which yields 3o 3.47875.
The results on the size of the connected component of a powerlaw graph were
presented by Aiello et al [8]. These results are summarized below.
* If 0 < 3 < 1, then a powerlaw graph is a.a.s. connected (i.e., there is only one
connected component of size n).
* If 1 < 3 < 2, then a powerlaw graph a.a.s. has a giant connected component
(the component size is O(n)), and the second largest connected component
a.a.s. has a size 0(1).
* If 2 < 3 < /o = 3.47875, then a giant connected component a.a.s. exists, and
the size of the second largest component a.a.s. is O (log n).
* = 2 is a special case when there is a.a.s. a giant connected component, and
the size of the second largest connected component is 0(log n/log log n).
* If 3 > o = 3.47875, then there is a.a.s. no giant connected component.
The powerlaw random graph model was developed for describing reallife
massive graphs. So the natural question is how well it reflects the properties of
these graphs.
Though this model certainly does not reflect all the properties of real massive
graphs, it turns out that the massive graphs such as the call graph or the Internet
graph can be fairly well described by the powerlaw model. The following example
demonstrates it.
Aiello, Clhuii and Lu [8] investigated the same call graph that was analyzed
by Abello et al. [2]. This massive graph was already discussed in Subsection 2.1.3,
so it is interesting to compare the experimental results presented by Abello et
al. [2] with the theoretical results obtained in [8] using the powerlaw random graph
model.
Figure 22 shows the number of vertices in the call graph with certain in
degrees and outdegrees. Recall that according to the powerlaw model the
dependency between the number of vertices and the corresponding degrees can
be plotted as a straight line on a loglog scale, so one can approximate the real
data shown in Figure 22 by a straight line and evaluate the parameter a and 3
using the values of the intercept and the slope of the line. The value of 3 for the
indegree data was estimated to be approximately 2.1, and the value of e" was
approximately 30 x 106. The total number of nodes can be estimated using formula
(23) as ((2.1) x e" = 1.56 x e" 47 x 106 (compare with Subsection 2.1.3).
According to the results for the size of the largest connected component
presented above, a powerlaw graph with 1 < 3 < 3.47875 a.a.s. has a giant
connected component. Since j3 w 2.1 falls in this range, this result exactly coincides
with the real observations for the call graph (see Subsection 2.1.3).
Another aspect that is worth mentioning is how to generate powerlaw graphs.
The methodology for doing it was discussed in detail in the literature [9, 44]. These
papers use a similar approach, which is referred to as a random .,j'l, evolution
process. The main idea is to construct a powerlaw massive graph "stepbyl p :
at each time step, a node and an edge are added to a graph in accordance with
certain rules in order to obtain a graph with a specified indegree and outdegree
powerlaw distribution. The indegree and outdegree parameters of the resulting
powerlaw graph are functions of the input parameters of the model. A simple
evolution model was presented by Kumar et al. [81]. Aiello, Chuing and Lu [9]
developed four more advanced models for generating both directed and undirected
powerlaw graphs with different distributions of indegrees and outdegrees. As
an example, we will briefly describe one of their models. It was the basic model
developed in the paper, and the other three models actually were improvements
and generalizations of this model.
The main idea of the considered model is as follows. At the first time moment
a vertex is added to the graph, and it is assigned two parameters the inweight
and the outweight, both equal to 1. Then at each time step t + 1 a new vertex
with inweight 1 and outweight 1 is added to the graph with probability 1 a,
and a new directed edge is added to the graph with probability a. The origin and
destination vertices are chosen according to the current values of the inweights
and outweights. More specifically, a vertex u is chosen as the origin of this edge
with the probability proportional to its current outweight which is defined as
wt = 1 + 6t where 6Pt is the outdegree of the vertex u at time t. Similarly,
a vertex v is chosen as the destination with the probability proportional to its
current inweight '. = 1 + 6^ where 6' is the indegree of v at time t. From
the above description it can be seen that at time t the total inweight and the total
outweight are both equal to t. So for each particular pair of vertices u and v, the
probability that an edge going from u to v is added to the graph at time t is equal
to
(1 67)D(1 (t )
t2
In the above notations, the parameter a is the input parameter of the model.
The output of this model is a powerlaw random graph with the parameter of
the degree distribution being a function of the input parameter. In the case of
the considered model, it was shown that it generates a powerlaw graph with the
distribution of indegrees and outdegrees having the parameter 1 + 1.
The notion of the socalled scale invariance [20, 21] must also be mentioned.
This concept arises from the following considerations. The evolution of massive
graphs can be treated as the process of growing the graph at a time unit. Now,
if we replace all the nodes that were added to the graph at the same unit of time
by only one node, then we will get another graph of a smaller size. The bigger the
time unit is, the smaller the new graph size will be. The evolution model is called
scalefree (scaleinvariant) if with high probability the new (scaled) graph has the
same powerlaw distribution of indegrees and outdegrees as the original graph, for
any choice of the time unit length. It turns out that most of the random evolution
models have this property. For instance, the models of Aiello et al. [9] were proved
to be scaleinvariant.
2.1.4 Optimization in Random Massive Graphs
Recent random graph models of reallife massive networks, some of which
were mentioned in Subsection 2.1.3 increased interest in various properties of
random graphs and methods used to discover these properties. Indeed, numerical
characteristics of graphs, such as clique and chromatic numbers, could be used as
one of the steps in validation of the proposed models. In this regard, the expected
clique number of powerlaw random graphs is of special interest due to the results
by Abello et al. [2] and Aiello et al. [9] mentioned in Subsections 2.1.1 and 2.1.3.
If computed, it could be used as one of the points in verifying the validity of the
model for the call graph proposed by Aiello et al. [9].
In this subsection we present some wellknown facts regarding the clique and
chromatic numbers in uniform random graphs.
2.1.4.1 Clique Number
The earliest results describing the properties of cliques in uniform random
graphs are due to Matula [93], who noticed that for a fixed p almost all graphs
G E G(n,p) have about the same clique number, if n is sufficiently large. Bollobas
and Erdbs [32] further developed these remarkable results by proving some more
specific facts about the clique number of a random graph. Let us discuss these
results in more detail by presenting not only the facts but also some reasoning
behind them. For more detail see books by Bollobas [29, 30] and Janson et al. [73].
Assume that 0 < p < 1 is fixed. Then instead of the sequence of spaces
{(n, p),n > 1} one can work with the single probability space g(N,p) containing
graphs on N with the edges chosen independently with probability p. In this
way, (n, p) becomes an image of g(N,p), and the term "almost ( iy is used
in its usual measuretheory sense. For a graph G E g(N,p) we denote by G, the
subgraph of G induced by the first n vertices {1, 2,..., n}. Then the sequence
w(G,) appears to be almost completely determined for a.e. G E (N,p).
For a natural 1, let us denote by ki(G,) the number of cliques spanning I
vertices of G,. Then, obviously,
w(G) = max{l: ki(G) > 0}.
When I is small, the random variable ki(G,) has a large expectation and a rather
small variance. If I is increased, then for most values of n there exists some number
lo for which the expectation of ko (G,) is fairly large (> 1) and k0o+l(G,) is much
smaller than 1. Therefore, if we find this value l0 then w(G,) = lo with a high
probability. The expectation of kl(G,) can be calculated as
E(k(G,)) =( p
Denoting by f(l1) E(k,(G,)) and replacing (n) by its Stirling approximation we
obtain
n n+1/2
f(1) ] 2 1(11)/2
V/(n 1)nl+1/211+1/2
Solving the equation f(1) = 1 we get the following approximation l0 of the root:
lo 2log/ n 2log/plog/pn + 21og/,(e/2) + + o(1)
(25)
S2log/ n + O(log log n).
Using this observation and the second moment method, Bollobds and
Erdos [32] proved that if p = p(n) satisfies n' < p < c for every c and some
c < 1, then there exists a function cl : N N such that a.a.s.
cl(n) < w(G,) < cl(n) + l,
i.e., the clique number is .i mptotically distributed on at most two values. The
sequence cl(n) appears to be close to lo(n) computed in (25). Namely, it can be
shown that for a.e. G E g(N,p) if n is large enough then
[lo(n) 2 log log nlog n < (G,) < Llo(n) + 2 log log n/ log n]
and
w(G,) 21og/ n + 21og/p logpn n 21og/p(e/2) 1 < .
Frieze [56] and Janson et al. [73] extended these results by showing that for c > 0
there exists a constant ce, such that for c < pp(n) < log2 n a.a.s.
[21og /n 2log,/p log,/ n + 21og/ (e/2) + 1 c/p] < w(G,) <
L2 log/p n 2 log/p log/p n +21log,/p(e/2)+1+ c/p].
2.1.4.2 Chromatic Number
Grimmett and McDiarmid [61] were the first to study the problem of coloring
random graphs. Many other researchers contributed to solving this problem [12,
31]. We will mention some facts emerged from these studies.
Luczak [85] improved the results about the concentration of X(G(n,p))
previously proved by Shamir and Spencer [110], proving that for every sequence
p = p(n) such that p < n6/7 there is a function ch(n) such that a.a.s.
ch(n) < x(G(n,p)) < ch(n) + 1.
Alon and Krivelevich [12] proved that for any positive constant 6 the chromatic
number of a uniform random graph G(n,p), where p = n2, is a.a.s. concentrated
in two consecutive values. Moreover, they proved that a proper choice of p(n) may
result in a onepoint distribution. The function ch(n) is difficult to find, but in
some cases it can be characterized. For example, Janson et al. [73] proved that
there exists a constant co such that for any p p(n) satisfying C0 < p < log7 n
a.a.s.
np V np
2 log np 2 log log np + 1 2 log np 40 log log np
In the case when p is constant Bollob&s' method utilizing martingales [30] yields
the following estimate:
xG(n ) 2logbn 2logblogbn+ O(1)'
where b 1/(1 p).
2.1.5 Remarks
We discussed advances in several research directions dealing with massive
graphs, such as external memory algorithms and modeling of massive networks
as random graphs with powerlaw degree distributions. Despite the evidence that
uniform random graphs are hardly suitable for modeling the considered reallife
graphs, the classical random graphs theory still may serve as a great source of
ideas in studying properties of massive graphs and their models. We recalled
some wellknown results produced by the classical random graphs theory. These
include results for concentration of clique number and chromatic number of random
graphs, which would be interesting to extend to more complicated random graph
models (i.e., powerlaw graphs and graphs with arbitrary degree distributions).
External memory algorithms and numerical optimization techniques could be
applied to find an approximate value of the clique number (as it was discussed
in Subsection 2.1.1). On the other hand, probabilistic methods similar to those
discussed in Subsection 2.1.4 could be utilized in order to find the .,vmptotical
distribution of the clique number in the same network's random graph model, and
therefore verify this model.
CHAPTER 3
NETWORKBASED APPROACHES TO MINING STOCK MARKET DATA
One of the most important problems in the modern finance is finding efficient
v,I of summarizing and visualizing the stock market data that would allow
one to obtain useful information about the behavior of the market. Nowad, , a
great number of stocks are traded in the US stock market; moreover, this number
steadily increases. The amount of data generated by the stock market every dv is
enormous. This data is usually visualized by thousands of plots reflecting the price
of each stock over a certain period of time. The analysis of these plots becomes
more and more complicated as the number of stocks grows.
It turns out that the stock market data can be effectively represented as a
network, although this representation is not so obvious as in the case of telephone
traffic or internet data. We have developed the networkbased model of the market
referred to as the market pi,''l, This chapter is based on the results described in
[26, 27, 28].
A natural graph representation of the stock market is based on the cross
correlations of price fluctuations. A market graph can be constructed as follows:
each financial instrument is represented by a vertex, and two vertices are connected
by an edge if the correlation coefficient of the corresponding pair of instruments
(calculated for a certain period of time) exceeds a specified threshold 0, 1 < 0 < 1.
Nowadl, a great number of different instruments are traded in the US stock
market, so the market graph representing them is very large. The market graph
that we construct has 6546 vertices and several million edges.
In this chapter, we present a detailed study of the properties of this graph. It
turns out that the market graph can be rather accurately described by the power
law model. We an iv. .. the distribution of the degrees of the vertices in this graph,
the edge density of this graph with respect to the correlation threshold, as well as
its connectivity and the size of its connected components.
Furthermore, we look for maximum cliques and maximum independent sets in
this graph for different values of the correlation threshold. Analyzing cliques and
independent sets in the market graph gives us a very valuable knowledge about
the internal structure of the stock market. For instance, a clique in this graph
represents a set of financial instruments whose prices change similarly over time
(a change of the price of any instrument in a clique is likely to affect all other
instruments in this clique), and an independent set consists of instruments that are
negatively correlated with respect to each other; therefore, it can be treated as a
diver.:l;. portfolio. Based on the information obtained from this analysis, we will
be able to classify financial instruments into certain groups, which will give us a
deeper insight into the stock market structure.
3.1 Structure of the Market Graph
3.1.1 Constructing the Market Graph
The market graph that we study in this chapter represents the set of financial
instruments traded in the US stock markets. More specifically, we consider
6546 instruments and analyze daily changes of their prices over a period of 500
consecutive trading d,, in 20002002. Based on this information, we calculate the
crosscorrelations between each pair of stocks using the following formula [92]:
j (R Rj) (R)(R )
(R2f \(R)2) (R (Rj2)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 31. Distribution of correlation coefficients in the stock market
where Ri(t) In Pt) defines the return of the stock i for d t. Pi(t) denotes the
price of the stock i on di t.
The correlation coefficients Ci can vary from 1 to 1. Figure 31 shows
the distribution of the correlation coefficients based on the prices data for the
years 20002002. It can be seen that this plot has a shape similar to the normal
distribution with the mean 0.05.
The main idea of constructing a market graph is as follows. Let the set of
financial instruments represent the set of vertices of the graph. Also, we specify a
certain threshold value 0, 1 < 0 < 1 and add an undirected edge connecting the
vertices i and j if the corresponding correlation coefficient Ci is greater than or
equal to 0. Obviously, different values of 0 define the market graphs with the same
set of vertices, but different sets of edges.
It is easy to see that the number of edges in the market graph decreases as the
threshold value 0 increases. In fact, our experiments show that the edge density
0.07
0.06
0.05
0.04
36
60.00%
50.00%
40.00%
S30.00%
C
20.00%
10.00%
0.00% 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
correlation threshold
Figure 32. Edge density of the market graph for different values of the correlation
threshold.
of the market graph decreases exponentially w.r.t. 0. The corresponding graph is
presented on Figure 37.
3.1.2 Connectivity of the Market Graph
In Subsection 2.1.3 we mentioned the connectivity thresholds in random
graphs. The main idea of this concept is finding a threshold value of the parameter
of the model that will define if the graph is connected or not.
A similar question arises for the market graph: what is its connectivity
threshold? Since the number of edges in the market graph depends on the chosen
correlation threshold 0, we should find a value 00 that determines the connectivity
of the graph. As it was mentioned above, the smaller value of 0 we choose, the
more edges the market graph will have. So, if we decrease 0, after a certain point,
the graph will become connected. We have conducted a series of computational
o 7000
S6000 
0
6 5000
T 4000 
o E 3000
0
S 2000
0 0
1 0.90.80.70.6 0.5 0.40.3 0.20.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
correlation threshold
Figure 33. Plot of the size of the largest connected component in the market
graph as a function of correlation threshold 0.
experiments for checking the connectivity of the market graph using the breadth
first search technique, and we obtained a relatively accurate approximation of the
connectivity threshold: 00 0.14382. Moreover, we investigated the dependency
of the size of the largest connected component in the market graph w.r.t. 0. The
corresponding plot is shown in Figure 33.
3.1.3 Degree Distribution of the Market Graph
The next important subject of our interest is the distribution of the degrees
of the vertices in the market graph. We have conducted several computational
experiments with different values of the correlation threshold 0, and these results
are presented below.
It turns out that if a small (in absolute value) correlation threshold 0 is spec
ified, the distribution of the degrees of the vertices does not have any welldefined
structure. Note that for these values of 0 the market graph has a relatively high
edge density (i.e. the ratio of the number of edges to the maximum possible
number of edges). However, as the correlation threshold is increased, the degree
Table 31. Leastsquares estimates of the parameter 7 in the market graph for
different values of correlation threshold (* complementary graph)
0 7
0.25* 1.2922
0.2* 1.4088
0.15* 1.4072
0.2 0.4931
0.25 0.5820
0.3 0.6793
0.35 0.7679
0.4 0.8269
0.45 0.8753
0.5 0.9054
0.55 0.9331
0.6 0.9743
distribution more and more resembles a power law. In fact, for 0 > 0.2 this distri
bution is approximately a straight line in the logarithmic scale, which represents
the powerlaw distribution, as it was mentioned above. Figure 34 demonstrates
the degree distributions of the market graph for some positive values of the correla
tion threshold, along with the corresponding linear approximations. The slopes of
the approximating lines were estimated using the leastsquares method. Table 31
summarizes the estimates of the parameter 7 of the powerlaw distribution (i.e., the
slope of the line) for different values of 0.
From this table, it can be seen that the slope of the lines corresponding to
positive values of 0 is rather small. According to the powerlaw model, in this
case a graph would have many vertices with high degrees, therefore, one can
intuitively expect to find large cliques in a powerlaw graph with a small value of
the parameter 7.
We also analyze the degree distribution of the complement of the market
graph, which is defined as follows: an edge connects instruments i and j if the
correlation coefficient between them Ci < 0. Studying this complementary graph is
important for the next subject of our consideration finding maximum independent
Figure 34. Degree distribution of the market graph for 0 = 0.4 (left); 0 = 0.5
(right) (logarithmic scale)
sets in the market graph with negative values of the correlation threshold 0.
Obviously, a maximum independent set in the initial graph is a maximum clique
in the complement, so the maximum independent set problem can be reduced
to the maximum clique problem in the complementary graph. Therefore, it is
useful to investigate the degree distributions of the complementary graphs for
different values of 0. As it can be seen from Figure 31, the distribution of the
correlation coefficients is nearly symmetric around 0 = 0.05, so for the values of
0 close to 0 the edge density of both the initial and the complementary graph is
high enough. For these values of 0 the degree distribution of a complementary
graph also does not seem to have any welldefined structure, as in the case of the
corresponding initial graph. As 0 decreases (i.e., increases in the absolute value),
the degree distribution of a complementary graph starts to follow the power law.
Figure 35 shows the degree distributions of the complementary graph, along with
the leastsquares linear regression lines. However, as one can see from Table 31,
the slopes of these lines are higher than in the case of the graphs with positive
values of 0, which implies that there are fewer vertices with a high degree in these
graphs, so intuitively, the size of a cliques in a complementary graph (i.e., the size
2 3 4
Degree
Figure 35. Degree distribution of the complementary market graph for = 0.15
(left); 0 = 0.2 (right) (logarithmic scale)
of independent sets in the original graph) should be significantly smaller than in
the case of the market graph with positive values of the correlation threshold (see
Section 3.2).
3.1.4 Instruments Corresponding to HighDegree Vertices
Up to this point, we studied the properties of the market graph as one big
system, and did not consider the characteristics of every vertex in this graph.
However, an important practical issue is to look at the degree of each vertex in
the market graph and to find the vertices with high degrees, i.e. the stocks that
are highly correlated with many other instruments in the market. Clearly, this
information will help us to answer the question: which instruments most accurately
reflect the behavior of the market?
For this purpose, we chose the market graph with a high correlation threshold
(8 = 0.6), calculated the degrees of each vertex in this graph and sorted the vertices
in the decreasing order of their degrees.
0 1 2 3 4
Degree
0 1 2 3 4
Degree
Interestingly, even though the edge density of the considered graph is only
0.0 !'. (only highly correlated instruments are connected by an edge), there are
many vertices with degrees greater than 100.
According to our calculations, the vertex with the highest degree in this
market graph corresponds to the NASDAQ 100 Index Tracking Stock. The degree
of this vertex is 216, which means that there are 216 instruments that are highly
correlated with it. An interesting observation is that the degree of this vertex is
twice higher than the number of companies whose stock prices the NASDAQ index
reflects, which means that these 100 companies greatly influence the market.
In Table 32 we present the "top 25" instruments in the U.S. stock mar
ket, according to their degrees in the considered market graph. The corre
sponding symbols definitions can be found on several websites, for example
http://www.nasdaq.com. Note that most of them are indices that incorporate
a number of different stocks of the companies in different industries. Although
this result is not surprising from the financial point of view, it is important as a
practical justification of the market graph model.
3.1.5 Clustering Coefficients in the Market Graph
Next, we calculate the clustering coefficients in the original and complemen
tary market graphs for different values of 0. The clustering coefficient is defined
as the probability that for a given vertex its two neighbors are connected by an
edge. Interestingly, clustering coefficients in the original market graph are large
even for high correlation thresholds, however, in the complementary graphs with
a negative correlation threshold the values of the clustering coefficient turned out
to be very close to 0. These results are summarized in Table 33. For instance, as
one can see from this table, the market graph with 0 = 0.6 has almost the same
edge density as the complementary market graph with 0 0.15, however, their
clustering coefficients differ dramatically. This fact also intuitively explains the
Table 32. Top 25 instruments with highest degrees in the market graph (0 = 0.6)
symbol vertex degree
QQQ
IWF
IWO
IYW
XLK
IVV
MDY
SPY
IJH
IWV
IVW
IAH
IYY
IWB
IYV
BDH
MKH
IWM
IJR
SMH
STM
IIH
IVE
DIA
IWD
216
193
193
193
181
175
171
162
159
158
156
155
154
153
150
144
143
142
134
130
118
116
113
106
106
results presented in the next section, which deals with cliques and independent sets
in the market graph.
3.2 Analysis of Cliques and Independent Sets in the Market Graph
In this section, we discuss the methods of finding maximum cliques and
maximum independent sets in the market graph and analyze the obtained results.
The maximum clique problem (as well as the maximum independent set
problem) is known to be NPhard [59]. Moreover, it turns out that the maximum
clique is difficult to approximate [18, 62]. This makes these problems especially
challenging in large graphs. However, as we will see in the next subsection, even
Table 33. Clustering coefficients of the market graph (* complementary graph)
0 edge density clustering coef.
0.15* 0.0005 2.64 x 105
0.1* 0.0050 0.0012
0.3 0.0178 0.4885
0.4 0.0047 0.4458
0.5 0.0013 0.4522
0.6 0.0004 0.4872
0.7 0.0001 0.4886
though the maximum clique problem is generally very hard to solve in large graphs,
the special structure of the market graph allows us to find the exact solution
relatively easily.
3.2.1 Cliques in the Market Graph
In this subsection, we consider cliques in the market graph, which have a
clear interpretation in terms of finance. Since a clique is a set of completely
interconnected vertices, any stock that belongs to the clique is highly correlated
with all other stocks in this clique; therefore, a stock is assigned to a certain group
only if it demonstrates a behavior similar to all other stocks in this group. Clearly,
the size of the maximum clique is an important characteristic of the stock market,
since it represents the maximum possible group of similar objects (i.e., mutually
correlated stocks).
A standard integer programming formulation [33] was used to compute the
exact maximum clique in the market graph, however, before solving this problem,
we applied a greedy heuristic for finding a lower bound of the clique number, and
a special preprocessing technique which reduces the problem size. To find a large
clique, we apply the "bestin" greedy algorithm based on degrees of vertices. Let C
denote the clique. Starting with C = 0, we recursively add to the clique a vertex
vmax of largest degree and remove all vertices that are not .,.i i: ent to vmax from
the graph. After running this algorithm, we applied the following preprocessing
procedure [2]. We recursively remove from the graph all of the vertices which are
not in C and whose degree is less than ICI, where C is the clique found by the
greedy algorithm.
Denote by G' = (V', E') the graph induced by remaining vertices. Then
the maximum clique problem can be formulated and solved for G'. The following
integer programming formulation was used [33]:
Iv'I
maximize x
i= 1
s.t.
xi + x < (i,j) E'
xi C {0, 1}
It should be noted that in the case of market graph instances with a high
positive correlation threshold, the aforementioned preprocessing procedure is very
efficient and significantly reduces the number of vertices in a graph [26]. This can
be intuitively explained by the fact that these instances of the market graph are
clustered (i.e. two vertices in a graph are more likely to be connected if they have a
common neighbor), so the clustering coefficient, which is defined as the probability
that for a given vertex its two neighbors are connected by an edge, is much higher
than the edge density in these graphs (see Table 38). This characteristic is also
typical for other powerlaw graphs arising in different applications.
After reducing the size of the original graph, the resulting integer programming
problem for finding a maximum clique can be relatively easily solved using the
CPLEX integer programming solver [71].
Table 34 summarizes the exact sizes of the maximum cliques found in
the market graph for different values of 0. It turns out that these cliques are
rather large, which agrees with the analysis of degree distributions and clustering
coefficients in the market graphs with positive values of 0.
Table 34. Sizes of the maximum cliques in the market graph with positive values
of the correlation threshold (exact solutions)
0 edge density clique size
0.35 0.0090 193
0.4 0.0047 144
0.45 0.0024 109
0.5 0.0013 85
0.55 0.0007 63
0.6 0.0004 45
0.65 0.0002 27
0.7 0.0001 22
These results show that in the modern stock market there are large groups
of instruments whose price fluctuations behave similarly over time, which is not
surprising, since 1i. ,1 is different branches of economy highly affect each other.
3.2.2 Independent Sets in the Market Graph
Here we present the results of solving the maximum independent set problem
in the market graphs with nonpositive values of the correlation threshold 0. As it
was pointed out above, this problem is equivalent to the maximum clique problem
in a complementary graph. However, the preprocessing procedure that was very
helpful for finding maximum cliques in the original graph could not eliminate
any vertices in the case of the complement, and we were not able to find the
exact solution of the maximum independent set problem in this case. Recall that
the clustering coefficients in the complementary graph were very small, which
intuitively explains the failure of the preprocessing procedure. Therefore, solving
the maximum independent set in the market graph is more challenging than finding
the maximum clique. Table 35 presents the sizes of the independent sets found
using the greedy heuristic that was described in the previous section.
Table 35. Sizes of independent sets in the complementary market graph found
using the greedy algorithm (lower bounds)
0 edge density indep. set size
0.05 0.4794 45
0.0 0.2001 12
0.05 0.0431 5
0.1 0.005 3
0.15 0.0005 2
This table demonstrates that the sizes of computed independent sets are rather
small, which is in agreement with the results of the previous section, where we
mentioned that in the complementary graph the values of the parameter of the
powerlaw distribution are rather high, and the clustering coefficients are very
small.
The small size of the computed independent sets means that finding a large
"completely diversified" portfolio (where all instruments are negatively correlated
to each other) is not an easy task in the modern stock market.
Moreover, it turns out that one can make a theoretical estimation of the
maximum size of a diversified portfolio, where all stocks are strictly negatively
correlated with each other. Intuitively, the lower (higher by the absolute value)
threshold 0 we set, the smaller diversified portfolio one would expect to find. These
considerations are confirmed by the following theorem.
Theorem 3.1. Consider a market ',jI'l,' with the correlation threshold 0 < 0.
Assume that each stock's return has a finite variance. Then there is no independent
set (diver'il; portfolio) of a size greater than 1 + 1
Proof. Let a random variable Xi denote the return of stock i at some time moment,
of denote the variance of Xi, and jmax = maxi ui. Suppose that there are m stocks,
which are pairwise negatively correlated, i.e., C, < 0,Vi,j =1,... m, and the
maximum correlation is 0 = rnr:: i Ci < 0. Consider the variance of the sum of
these variables:
Var(y X,) = Var(Xi) + Y Cov(Xi, X) 
i= 1 i 1 i j
S+ max + (m )0 n x(1 + 1)0)
i= 1 ij
Note that if 0 < 0, ma l(1 + (m 1)0) < 0 for m > 1 + 1. Consequently,
Var(Z X,) < 0 for m > 1 + .
Therefore, the number of stocks with pairwise correlations Cij < 0 < 0 cannot
be greater than m = 1 + which completes the proof.
Another natural question now arises: how many completely diversified
portfolios can be found in the market? In order to find an answer, we have
calculated maximal independent sets starting from each vertex, by running 6546
iterations of the greedy algorithm mentioned above. That is, for each of the
considered 6546 financial instruments, we have found a completely diversified
portfolio that would contain this instrument. Interestingly enough, for every vertex
in the market graph, we were able to detect an independent set that contains this
vertex, and the sizes of these independent sets were rather close. Moreover, all
these independent sets were distinct. Figure 36 shows the frequency of the sizes
of the independent sets found in the market graphs corresponding to different
correlation thresholds.
These results demonstrate that it is alvi possible for an investor to find a
group of stocks that would form a completely diversified portfolio with any given
stock, and this can be efficiently done using the technique of finding independent
sets in the market graph.
4500 1400
3500
3000 1000
2000 600
1500 400
100500 8 S 200
1000
200
500
0 0
32 33 343536 3 383Z0
Ind. Set Size 12 Ind. Set Size 44 45
Figure 36. Frequency of the sizes of independent sets found in the market graph
with 0 = 0.00 (left), and 0 = 0.05 (right)
3.3 Data Mining Interpretation of the Market Graph Model
As we have seen, the analysis of the market graph provides a practically
useful methodology of extracting information from the stock market data. In this
subsection, we discuss the conceptual interpretation of this approach from the data
mining perspective. An important aspect of the proposed model is the fact that
it allows one to reveal certain patterns underlying the financial data, therefore, it
represents a structured data mining approach.
Nontrivial information about the global properties of the stock market
is obtained from the analysis of the degree distribution of the market graph.
Highly specific structure of this distribution si . I that the stock market can
be analyzed using the powerlaw model, which can theoretically predict some
characteristics of the graph representing the market.
On the other hand, the analysis of cliques and independent sets in the mar
ket graph is also useful from the data mining point of view. As it was pointed
out above, cliques and independent sets in the market graph represent groups of
",ii i and "dIl. i. i l financial instruments, respectively. Therefore, informa
tion about the size of the maximum cliques and independent sets is also rather
important, since it gives one the idea about the trends that take place in the stock
market. Besides analyzing the maximum cliques and independent sets in the mar
ket graph, one can also divide the market graph into the smallest possible set of
distinct cliques (or independent sets). Partitioning a dataset into sets (clusters) of
elements grouped according to a certain criterion is referred to as clustering, which
is one of the wellknown data mining problems [34].
As discussed above, the main difficulty one encounters in solving the clustering
problem on a certain dataset is the fact that the number of desired clusters of
similar objects is usually not known a priori, moreover, an appropriate ii,.lr,.ii:,
criterion should be chosen before partitioning a dataset into clusters.
Clearly, the methodology of finding cliques in the market graph provides an
efficient tool of performing clustering based on the stock market data. The choice
of the grouping criterion is clear and natural: o~ .I w'" financial instruments are
determined according to the correlation between their price fluctuations. Moreover,
the minimum number of clusters in the partition of the set of financial instruments
is equal to the minimum number of distinct cliques that the market graph can be
divided into (the minimum clique partition problem). Similar partition can be done
using independent sets instead of cliques, which would represent the partition of
the market into a set of distinct diversified portfolios. In this case the minimum
possible number of clusters is equal to a partition of vertices into a minimum
number of distinct independent sets. This problem is called the tlji', coloring
problem, and the number of sets in the optimal partition is referred to as the
chromatic number of the graph.
We should also mention another in i, i type of data mining problems with
many applications in finance. They are referred to as /1.i.:7 ,l.:.>n problems.
Although the setup of this type of problems is similar to clustering, one should
clearly understand the difference between these two types of problems.
In classification, one deals with a predefined number of classes that the data
elements must be assigned to. Also, there is a socalled tr ':.' :' dataset, i.e., the
set of data elements for which it is known a priori which class they belong to. It
means that in this setup one uses some initial information about the classification
of existing data elements. A certain classification model is constructed based on
this information, and the parameters of this model are 1iii, 1 to classify new data
elements. This procedure is known as 11 ,iiiig the classifier". An example of the
application of this approach to classifying financial instruments can be found in
[40].
The main difference between classification and clustering is the fact that unlike
classification, in the case of clustering, one does not use any initial information
about the class attributes of the existing data elements, but tries to determine a
classification using appropriate criteria. Therefore, the methodology of classifying
financial instruments using the market graph model is essentially different from
the approaches commonly considered in the literature in the sense that it does not
require any apriori information about the classes that certain stocks belong to, but
classifies them only based on the behavior of their prices over time.
3.4 Evolution of the Market Graph
In the previous sections, we have discussed the properties of the market graph
constructed for one 500d4v period. We have revealed a number of important
properties of this model; however, another crucial question that needs to be
answered is how these characteristics change over time. This analysis would provide
more information about the patterns underlying the stock market dynamics. We
address these issues in this section.
In order to investigate the dynamics of the market graph structure, we chose
the period of 1000 trading d4,va in 19982002 and considered eleven 500d4iv shifts
within this period. The starting points of every two consecutive shifts are separated
Table 36. Dates and mean correlations corresponding to each considered 500div
shift
Period # Starting date Ending date Mean correlation
1 09/24/1998 09/15/2000 0.0403
2 12/04/1998 11/27/2000 0.0373
3 02/18/1999 02/08/2001 0.0381
4 04/30/1999 04/23/2001 0.0426
5 07/13/1999 07/03/2001 0.0444
6 09/22/1999 09/19/2001 0.0465
7 12/02/1999 11/29/2001 0.0545
8 02/14/2000 02/12/2002 0.0561
9 04/26/2000 04/25/2002 0.0528
10 07/07/2000 07/08/2002 0.0570
11 09/18/2000 09/17/2002 0.0672
by the interval of 50 d,,i Therefore, every pair of consecutive shifts had 450
d,i in common and 50 d , different. Dates corresponding to each shift and the
corresponding mean correlations are summarized in Table 36.
This procedure allows us to accurately reflect the structural changes of the
market graph using relatively small intervals between shifts, but at the same
time one can maintain sufficiently large sample sizes of the stock prices data for
calculating crosscorrelations for each shift. We should note that in our analysis we
considered only stocks which were among those traded as of the last of the 1000
trading d,, i.e. for practical reasons we did not take into account stocks which
had been withdrawn from the market.
3.4.1 Dynamics of Global Characteristics of the Market Graph
In this subsection, we analyze the evolution of the basic characteristics of
the market graph model that were considered above for one trading period: the
distribution of the correlation coefficients in the market, the degree distribution,
and the edge density. As we will see, some properties of the market graph remain
stable; however, there are certain trends that can be observed in the stock market
development.
The first subject of our consideration is the distribution of correlation coeffi
cients between all pairs of stocks in the market. As it was mentioned above, this
distribution on [1, 1] had a shape similar to a part of normal distribution with
mean close to 0.05 for the sample data considered in [26, 27]. One of the interpre
tations of this fact is that the correlation of most pairs of stocks is close to zero,
therefore, the structure of the stock market is substantially random, and one can
make a reasonable assumption that the prices of most stocks change independently.
As we consider the evolution of the correlation distribution over time, it turns out
that the shape of this distribution remains stable, which is illustrated by Figure
37.
0.08
0.08
0.067
0.05
0.04
0.03
0.02
0.01
0
( 5. I.@ 5. 4. 5. Q Z Q Q) Q). Q) Q Q) Q).
 period 1  period 3 period 5
period7 period 9 period 11
Figure 37. Distribution of correlation coefficients in the US stock market for sev
eral overlapping 500d4v periods during 20002002 (period 1 is the
earliest, period 11 is the latest).
The stability of the correlation coefficients distribution of the market graph
intuitively motivates the hypothesis that the degree distribution should also remain
stable for different values of the correlation threshold. To verify this assumption,
53
we have calculated the degree distribution of the graphs constructed for all
considered time periods. The correlation threshold 0 = 0.5 was chosen to describe
the structure of connections corresponding to significantly high correlations. Our
experiments show that the degree distribution is similar for all time intervals,
and in all cases it is well described by a power law. Figure 38 shows the degree
distributions (in the logarithmic scale) for some instances of the market graph
(with 0 = 0.5) corresponding to different intervals.
(a) period 1
1 o00 
100 
1000 10000
1 10 100
degree
(b) period 4
10000
1000
om
m
100
10
10
1000 10000
1 10 100
degree
(c) period 7 (d) period 11
Figure 38. Degree distribution of the market graph for different 500d4v periods
in 20002002 with 0 = 0.5: (a) period 1, (b) period 4, (c) period 7, (d)
period 11.
The crosscorrelation distribution and the degree distribution of the market
graph represent the general characteristics of the market, and the aforementioned
10C00
o10 
Io 
100 
10
1
1 140
I 10 100
degree
10000
100
om
m
100
10
N.
1 10 100
degree
~*4SiO
1000 10000
1000 10000
results lead us to the conclusion that the global structure of the market is stable
over time. However, as we will see now, some global changes in the stock market
structure do take place. In order to demonstrate it, we look at another characteris
tic of the market graph its edge density.
In our analysis of the market graph dynamics, we chose a relatively high
correlation threshold 0 = 0.5 that would ensure that we consider only the edges
corresponding to the pairs of stocks, which are significantly correlated with each
other. In this case, the edge density of the market graph would represent the
proportion of those pairs of stocks in the market, whose price fluctuations are
similar and influence each other. The subject of our interest is to study how this
proportion changes during the considered period of time. Table 37 summarizes
the obtained results. As it can be seen from this table, both the number of vertices
and the number of edges in the market graph increase as time goes. Obviously, the
number of vertices grows since new stocks appear in the market, and we do not
consider those stocks which ceased to exist by the last of 1000 trading di,4 used
in our analysis, so the maximum possible number of edges in the graph increases
as well. However, it turns out that the number of edges grows faster; therefore, the
edge density of the market graph increases from period to period. As one can see
from Figure 39(a), the greatest increase of the edge density corresponds to the
last two periods. In fact, the edge density for the latest interval is approximately
8.5 times higher than for the first interval! This dramatic jump ii:: 1 that there
is a trend to the "globalization" of the modern stock market, which means that
nowad1, more and more stocks significantly affect the behavior of the others.
It should be noted that the increase of the edge density could be predicted
from the analysis of the distribution of the crosscorrelations between all pairs
of stocks. From Figure 37, one can observe that even though the distributions
corresponding to different periods have a similar shape and the same mean,
Table 37. Number of vertices and number of edges in the
ent periods (0 = 0.5)
market graph for differ
Number of Vertices
5430
5507
5593
5666
5768
5866
6013
6104
6262
6399
6556
Number of Edges
2258
2614
3772
5276
6841
7770
10428
12457
12911
19707
27885
Edge density
0.015'
0.017'
0.02!' .
0.0;
0.041
0.045'
(I II '
0.01 ,' .
0.0 I .
(I I II .
O. 1i i'
the I il!" of the distribution corresponding to the latest period (period 11) is
somewhat "heavier" than for the earlier periods, which means that there are more
pairs of stocks with higher values of the correlation coefficient.
90
80
70
60
~50
40
U 0 30
20
10
0
1 2 3 4
5 6 7
time period
8 9 10 11
Figure 39. Dynamics of edge density and maximum clique size in the market
graph: Evolution of the edge density (a) and maximum clique size (b)
in the market graph (0 = 0.5)
3.4.2 Dynamics of the Size of Cliques and Independent Sets in the
Market Graph
In this subsection we ,in iv. .. the evolution of the size of the maximum clique
in the market graph over the considered period of time.
Period
1
2
3
4
5
6
7
8
9
10
11
0.14%
0.12%
0.10% 2
0.08% %
0.06%
0.04%
S0.02%
o 000%
12 3 4 5 6 7 8 9 10 11
ine period
Table 38 presents the sizes of the maximum cliques found in the market graph
for different time periods. As in the previous subsection, we used a relatively high
correlation threshold 0 = 0.5 to consider only significantly correlated stocks. As
one can see, there is a clear trend of the increase of the maximum clique size over
time, which is consistent with the behavior of the edge density of the market graph
discussed above (see Figure 39(b)). This result provides another confirmation of
the globalization hypothesis discussed above.
Another related issue to consider is how much the structure of maximum
cliques is different for the various time periods. Table 39 presents the stocks
included into the maximum cliques for different time periods. It turns out that in
most cases stocks that appear in a clique in an earlier period also appear in the
cliques in later periods.
There are some other interesting observations about the structure of the
maximum cliques found for different time periods. It can be seen that all the
cliques include a significant number of stocks of the companies representing the
"hightech" industry sector. As the examples, one can mention wellknown com
panies such as Sun Mi. i ', i. ini Inc., Cisco Systems, Inc., Intel Corporation,
etc. Moreover, each clique contains stocks of the companies related to the semi
conductor industry (e.g., Cypress Semiconductor Corporation, Cree, Inc., Lattice
Semiconductor Corporation, etc.), and the number of these stocks in the cliques
increases with the time. These facts ii: 1 that the corresponding branches of
industry expanded during the considered period of time to form a in i ri cluster of
the market.
In addition, we observed that in the later periods (especially in the last two
periods) the maximum cliques contain a rather large number of exchange traded
funds, i.e., stocks that reflect the behavior of certain indices representing various
groups of companies. It should be mentioned that all maximum cliques contain
Table 38. Greedy clique size and the clique number for different time periods ( =
0.5)
Period IV Edge Dens. C!i,. iii C1 IV' Edge Dens. Clique
in G Coefficient in G' Number
1 5430 0.00015 0.505 15 76 0.286 18
2 5507 0.00017 0.504 18 43 0.731 19
3 5593 0.00024 0.499 26 49 0.817 27
4 5666 0.00033 0.517 34 70 0.774 34
5 5768 0.00041 0.550 42 82 0.787 42
6 5866 0.00045 0.558 45 86 0.804 45
7 6013 0.00058 0.553 51 110 0.769 51
8 6104 0.00067 0.566 60 114 0.819 60
9 6262 0.00066 0.553 62 107 0.869 62
10 6399 0.00096 0.486 77 134 0.841 77
11 6556 0.00130 0.452 84 146 0.844 85
N 11 1 100 tracking stock (QQQ), which was also found to be the vertex with the
highest degree (i.e., correlated with the most stocks) in the market graph [26].
Another natural question that one can pose is how the size of independent sets
(i.e., diversified portfolios in the market) changes over time. As it was pointed out
in [26, 27], finding a maximum independent set in the market graph turns out to
be a much more complicated task than finding a maximum clique. In particular,
in the case of solving the maximum independent set problem (or, equivalently,
the maximum clique problem in the complementary graph), the preprocessing
procedure described above does not reduce the size of the original graph. This
can be explained by the fact that the clustering coefficient in the complementary
market graph with 0 = 0 is much smaller than in the original graph corresponding
to 0 = 0.5 (see Table 310).
Similarly to Section 3.2, we calculate maximal independent sets (a maximal
independent set is an independent set that is not a subset of another independent
set) in the market graph using the above greedy algorithm. As one can see from
Table 310, the sizes of independent sets found in the market graph for 0 = 0 are
rather small, which is consistent with the results of Section 3.2.
Table 39.
Structure of maximum cliques in the market graph for different time
periods (0 = 0.5)
Period Stocks included into maximum clique
1 BK, EMC, FBF, HAL, HP, INTC, NCC, NOI, NOK, PDS, PMCS, QQQ, RF, SII, SLB,
SPY, TER, WM
2 ADI, ALTR, AMAT, AMCC, ATML, CSCO,KLAC, LLTC, LSCC, MDY, MXIM, NVLS,
PMCS, QQQ, SPY, SUNW, TXN, VTSS, XLNX
3 AMAT, AMCC, CREE, CSCO, EMC, JDSU, KLAC, LLTC, LSCC, MDY, MXIM,
NVLS, PHG, PMCS, QLGC, QQQ, SEBL, SPY, STM, SUNW, TQNT, TXCC, TXN,
VRTS, VTSS, XLK, XLNX
4 AMAT, AMCC, ASML, ATML, BRCM, CHKP, CIEN, CREE, CSCO, EMC, FLEX,
JDSU, KLAC, LSCC, MDY, MXIM, NTAP, NVLS, PMCS, QLGC, QQQ, RFMD,
SEBL, SPY, STM, SUNW, TQNT, TXCC, TXN, VRSN, VRTS, VTSS, XLK, XLNX
5 ALTR, AMAT, AMCC, ASML, ATML, BRCM, CIEN, CREE, CSCO, EMC, FLEX,
IDTI, IRF, JDSU, JNPR, KLAC, LLTC, LRCX, LSCC, LSI, MDY, MXIM, NTAP,
NVLS, PHG, PMCS, QLGC, QQQ, RFMD, SEBL, SPY, STM, SUNW, SWKS, TQNT,
TXCC, TXN, VRSN, VRTS, VTSS, XLK, XLNX
6 ADI, ALTR, AMAT, AMCC, ASML, ATML, BEAS, BRCM, CIEN, CREE, CSCO, CY,
ELX, EMC, FLEX, IDTI, ITWO, JDSU, JNPR, KLAC, LLTC, LRCX, LSCC, LSI,
MDY, MXIM, NTAP, NVLS, PHG, PMCS, QLGC, QQQ, RFMD, SEBL, SPY, STM,
SUNW, TQNT, TXCC, TXN, VRSN, VRTS, VTSS, XLK, XLNX
7 ALTR, AMAT, AMCC, ATML, BEAS, BRCD, BRCM, CHKP, CIEN, CNXT, CREE,
CSCO, CY, DIGL, EMC, FLEX, HHH, ITWO, JDSU, JNPR, KLAC, LLTC, LRCX,
LSCC, MDY, MERQ, MXIM, NEWP, NTAP, NVLS, ORCL, PMCS, QLGC, QQQ,
RBAK, RFMD, SCMR, SEBL, SPY, SSTI, STM, SUNW, SWKS, TQNT, TXCC, TXN,
VRSN, VRTS, VTSS, XLK, XLNX
8 ALTR, AMAT, AMCC, AMKR, ARMHY, ASML, ATML, AVNX, BEAS, BRCD,
BRCM, CHKP, CIEN, CMRC, CNXT, CREE, CSCO, CY, DIGL, ELX, EMC, EXTR,
FLEX, HHH, IDTI, ITWO, JDSU, JNPR, KLAC, LLTC, LRCX, LSCC, MDY, MERQ,
MRVC, MXIM, NEWP, NTAP, NVLS, ORCL, PMCS, QLGC, QQQ, RFMD, SCMR,
SEBL, SNDK, SPY, SSTI, STM, SUNW, SWKS, TQNT, TXCC, TXN, VRSN, VRTS,
VTSS, XLK, XLNX
9 ADI, ALTR, AMAT, AMCC, ARMHY, ASML, ATML, AVNX, BDH, BEAS, BHH,
BRCM, CHKP, CIEN, CLS, CREE, CSCO, CY, DELL, ELX, EMC, EXTR, FLEX,
HHH, IAH, IDTI, IIH, INTC, IRF, JDSU, JNPR, KLAC, LLTC, LRCX, LSCC, LSI,
MDY, MXIM, NEWP, NTAP, NVLS, PHG, PMCS, QLGC, QQQ, RFMD, SCMR,
SEBL, SNDK, SPY, SSTI, STM, SUNW, SWKS, TQNT, TXCC, TXN, VRSN, VRTS,
VTSS, XLK, XLNX
10 ADI, ALTR, AMAT, AMCC, AMD, ASML, ATML, BDH, BHH, BRCM, CIEN, CLS,
CREE, CSCO, CY, CYMI, DELL, EMC, FCS, FLEX, HHH, IAH, IDTI, IFX, IIH, IJH,
IJR, INTC, IRF, IVV, IVW, IWB, IWF, IWM, IWV, IYV, IYW, IYY, JBL, JDSU,
KLAC, KOPN, LLTC, LRCX, LSCC, LSI, LTXX, MCHP, MDY, MXIM, NEWP, NTAP,
NVDA, NVLS, PHG, PMCS, QLGC, QQQ, RFMD, SANM, SEBL, SMH, SMTC,
SNDK, SPY, SSTI, STM, SUNW, TER, TQNT, TXCC, TXN, VRTS, VSH, VTSS,
XLK, XLNX
11 ADI, ALA, ALTR, AMAT, AMCC, AMD, ASML, ATML, BDH, BEAS, BHH, BRCM,
CIEN, CLS, CNXT, CREE, CSCO, CY, CYMI, DELL, EMC, EXTR, FCS, FLEX,
HHH, IAH, IDTI, IIH, IJH, IJR, INTC, IRF, IVV, IVW, IWB, IWF, IWM, IWO, IWV,
IWZ, IYV, IYW, IYY, JBL, JDSU, JNPR, KLAC, KOPN, LLTC, LRCX, LSCC, LSI,
LTXX, MCRL, MDY, MKH, MRVC, MXIM, NEWP, NTAP, NVDA, NVLS, PHG,
PMCS, QLGC, QQQ, RFMD, SANM, SEBL, SMH, SMTC, SNDK, SPY, SSTI, STM,
_SUNW, TER, TQNT, TXN, VRTS, VSH, VTSS, XLK, XLNX
Table 310.
Size of independent sets in the market graph found using the greedy
heuristic (8 = 0.0). Edge density and clustering coefficient are given
for the complementary graph.
Period Number of Edge Clustering Independent
vertices density coefficient set size
1 5430 0.258 0.293 11
2 5507 0.275 0.307 11
3 5593 0.281 0.307 10
4 5666 0.265 0.297 11
5 5768 0.260 0.292 11
6 5866 0.254 0.288 11
7 6013 0.228 0.269 11
8 6104 0.227 0.268 10
9 6262 0.238 0.277 12
10 6399 0.228 0.269 12
11 6556 0.201 0.245 11
3.4.3 Minimum Clique Partition of the Market Graph
Besides analyzing the maximum cliques in the market graph, one can also
divide the market graph into the smallest possible set of distinct cliques. As it
was pointed out above, the partition of a dataset into sets (clusters) of elements
grouped according to a certain criterion is referred to as clustering.
For finding a clique partition, we choose the instance of the market graph
with a low correlation threshold 0 = 0.05 (the mean of the correlation coefficients
distribution shown in Figure 37), which would ensure that the edge density of the
considered graph is high enough and the number of isolated vertices (which would
obviously form distinct cliques) is small.
We use the standard greedy heuristic to compute a clique partition in the
market graph: recursively find a maximal clique and remove it from the graph,
until no vertex remain. Cliques are computed using the previously described greedy
algorithm. The corresponding results for the market graph with threshold 0 = 0.05
are presented in Table 311. Note that the size of the largest clique in the partition
is increasing from one period to another, with the largest clique in the last period
Table 311. The largest clique size and
partitions (0 = 0.05)
Period Number of Edge
vertices density
1 5430 0.400
2 5507 0.377
3 5593 0.379
4 5666 0.405
5 5768 0.413
6 5866 0.425
7 6013 0.469
8 6104 0.475
9 6262 0.456
10 6399 0.474
11 6556 0.521
the number of cliques in computed clique
Largest clique
in the partition
469
552
636
743
789
824
929
983
997
1159
1372
Sof cliques in
the partition
494
517
513
503
501
496
471
470
509
501
479
containing about three times as many vertices as the corresponding clique in
the first partition. At the same time, the number of cliques in the partition is
comparable for different periods, with a slight overall trend towards decrease,
whereas the number of vertices is increasing as time goes.
3.5 Concluding Remarks
Graph representation of the stock market data and interpretation of the
properties of this graph gives a new insight into the internal structure of the stock
market. In this paper, we have studied different characteristics of the market graph
and their evolution over time and came to several interesting conclusions based on
our analysis. It turns out that the powerlaw structure of the market graph is quite
stable over the considered time intervals; therefore one can v that the concept of
selforganized networks, which was mentioned above, is applicable in finance, and in
this sense the stock market can be considered as a "selfor5, i... i system.
Another important result is the fact that the edge density of the market graph,
as well as the maximum clique size, steadily increase during the last several years,
which supports the wellknown idea about the globalization of economy which has
been widely discussed recently.
61
We have also indicated the natural way of dividing the set of financial instru
ments into groups of similar objects (clustering) by computing a clique partition of
the market graph. This methodology can be extended by considering quasicliques
in the partition, which may reduce the number of obtained clusters. Moreover,
finding independent sets in the market graph provides a new approach to choosing
diversified portfolios where all stocks are pairwise uncorrelated, which is potentially
useful in practice.
CHAPTER 4
NETWORKBASED TECHNIQUES IN ELECTROENCEPHALOGRAPHIC
(EEG) DATA ANALYSIS AND EPILEPTIC BRAIN MODELING
Human brain is one of the most complex systems ever studied by scientists.
Enormous number of neurons and the dynamic nature of connections between
them makes the analysis of brain function especially challenging. One of the most
important directions in studying the brain is treating disorders of the central
nervous system. For instance, /'.:/ I/,; is a common form of such disorders, which
affects approximately 1 of the human population. Essentially, epileptic seizures
represent excessive and hypersynchronous activity of the neurons in the cerebral
cortex.
During the last several years, significant progress in the field of epileptic
seizures prediction has been made. The advances are associated with the extensive
use of electr ... '., p1,l.'u11.'I (EEG) which can be treated as a quantitative repre
sentation of the brain function. Rapid development of computational equipment
has made possible to store and process huge amounts of EEG data obtained from
recording devices. The availability of these massive datasets gives a rise to another
problem utilizing mathematical tools and data mining techniques for extracting
useful information from EEG data. Is it possible to construct a ilp!l." mathe
matical model based on EEG data that would reflect the behavior of the epileptic
brain?
In this chapter, we make an attempt to create such a model using a network
based approach.
In the case of the human brain and EEG data, we apply a relatively simple
networkbased approach. We represent the electrodes used for obtaining the EEG
readings, which are located in different parts of the brain, as the vertices of the
constructed graph. The data received from every single electrode is essentially a
time series reflecting the change of the EEG signal over time. Later in the chapter
we will discuss the quantitative measure characterizing statistical relationships
between the recordings of every pair of electrodes so called Tindex. The values
of the Tindex Ti measured for all pairs of electrodes i and j enable us to establish
certain rules of placing edges connecting different pairs of vertices i and j depend
ing on the corresponding values of Tij. Using this technique, we develop several
graphbased mathematical models and study the dynamics of the structural prop
erties of these graphs. As we will see, these models can provide useful information
about the behavior of the brain prior to, during, and after an epileptic seizure.
4.1 Statistical Preprocessing of EEG Data
4.1.1 Datasets.
The datasets consisting of continuous longterm (3 to 12 d iv) multichannel
intracranial EEG recordings that had been acquired from 4 patients with medically
intractable temporal lobe epilepsy. Each record included a total of 28 to 32
intracranial electrodes (8 subdural and 6 hippocampal depth electrodes for each
cerebral hemisphere). A diagram of electrode locations is provided in Figure 41.
4.1.2 Tstatistics and STLmax
In this subsection we give a brief introduction to nonlinear measures and
statistics used to analyze EEG data (for more information see [67, 69, 101]).
Since the brain is a nonstationary system, algorithms used to estimate
measures of the brain dynamics should be capable of automatically identifying and
appropriately weighing existing transients in the data. In a chaotic system, orbits
originating from similar initial conditions (nearby points in the state space) diverge
exponentially (expansion process). The rate of divergence is an important aspect
of the system dynamics and is reflected in the value of Lyapunov exponents. The
R '43 2'1 R\I2 3 AL
BR L
CR CL
BL1
Figure 41. Electrode placement in the brain: (A) Inferior transverse and (B)
lateral views of the brain, illustrating approximate depth and subdu
ral electrode placement for EEG recordings are depicted. Subdural
electrode strips are placed over the left orbitofrontal (AL), right or
bitofrontal (AR), left subtemporal (BL), and right subtemporal (BR)
cortex. Depth electrodes are placed in the left temporal depth (CL)
and right temporal depth (CR) to record hippocampal activity.
method used for estimation of the short time largest Lyapunov exponent STLmax,
an estimate of Lmx for nonstationary data, is explained in detail in [66, 68, 118].
By splitting the EEG time series recorded from each electrode into a sequence
of nonoverlapping segments, each 10.24 sec in duration, and estimating STLma,
for each of these segments, profiles of STLmx, over time are generated.
Having estimated the STLma, temporal profiles at an individual cortical site,
and as the brain proceeds towards the ictal state, the temporal evolution of the
stability of each cortical site is quantified. The spatial dynamics of this transition
are captured by consideration of the relations of the STLmax between different
cortical sites. For example, if a similar transition occurs at different cortical
sites, the STLmax of the involved sites are expected to converge to similar values
prior to the transition. Such participating sites are called "critical i. and
such a convergence dynamicall ( 1i i i i:n. i i More specifically, in order for the
dynamical entrainment to have a statistical content, we allow a period over which
the difference of the means of the STLma, values at two sites is estimated. We use
periods of 10 minutes (i.e. moving windows including approximately 60 STLmx,
values over time at each electrode site) to test the dynamical entrainment at the
0.01 statistical significance level. We employ the Tindex (from the wellknown
paired Tstatistics for comparisons of means) as a measure of distance between the
mean values of pairs of STLma, profiles over time. The Tindex at time t between
electrode sites i and j is defined as:
Ti,,(t) = N x E{STLmx,i STLm }x,j} /ai,j(t) (41)
where E{} is the sample average difference for the STLma,i STLma,,j
estimated over a moving window wt(A) defined as:
S1 if AE [t t]
0 if A [t N t],
where N is the length of the moving window. Then, ai,j(t) is the sample standard
deviation of the STLmax differences between electrode sites i and j within the
moving window wt(A). The Tindex follows a tdistribution with N1 degrees of
freedom. For the estimation of the Tij(t) indices in our data we used N = 60 (i.e.,
average of 60 differences of STLmax exponents between sites i and j per moving
window of approximately 10 minute duration). Therefore, a twosided ttest with
N 1(= 59) degrees of freedom, at a statistical significance level a should be
used to test the null hypothesis, Ho: "brain sites i and j acquire identical STLmax
values at time t". In this experiment, we set the probability of a type I error
a = 0.01 (i.e., the probability of falsely rejecting Ho if Ho is true, is 1 .). For the
Tindex to pass this test, the Tij(t) value should be within the interval [0, 2.662].
We will refer to the upper bound of this interval as Tcritical.
4.2 Graph Structure of the Epileptic Brain
4.2.1 Key Idea of the Model
If we model the brain (with epilepsy) by a graph (where nodes are "functional
units" of the system and edges are connections between them) we need to answer
the following questions: what properties the model has, i.e. what the properties of
this graph are; how the properties of the graph change prior to, during, and after
epileptic seizures. We try to answer this question using the following idea we
study the system of the electrodes as a weighted graph where nodes are electrodes
and weights of the edges between nodes are values of the corresponding Tindex.
More specifically, we consider three types of graphs constructed using this principle:
* GRAPHI is a complete graph, i.e., it has all possible edges,
* GRAPHII is obtained from the complete graph by removing all the edges
(i,j) for which the corresponding value of Ti is greater than Tcritical,
* GRAPHIII is obtained from the complete graph by removing all the edges
(i,j) for which the corresponding value of Ti is less than Tcritical 10 minutes
after the seizure point and greater than Tcritical at the seizure point.
4.2.1.1 Interpretation of the Considered Graph Models
Before proceeding with the further discussion, we need to give a conceptual
interpretation of the ideas lying behind introducing the aforementioned graphs.
* GRAPHI contains all the edges connecting the considered brain sites, and
it is considered in order to reflect the general distribution of the values of
Tindices between each pair of vertices (i.e., the weights of the corresponding
edges).
* GRAPHII contains only the edges connecting the brain sites (electrodes) that
are statistically entrained at a certain time, which means that they exhibit a
similar behavior. Recall that a pair of electrodes is considered to be entrained
if the value of the corresponding Tindex between them is less than Tcrutcal,
that is why we remove all the edges with the weights greater than Tcritcal. The
main point of our interest is studying the evolution of the properties of this
graph over time. As we will see in the next subsections, this analysis can help
in revealing the /;;,:, ,,i.., / patterns underlying the functioning of the brain
during preictal, ictal, postictal, and interictal states. Therefore, this graph can
be used as a basis for the mathematical model describing some characteristics
of the epileptic brain.
* GRAPHIII is constructed to reflect the connections only between those
electrodes that are entrained during the seizure, but are not entrained 10
minutes after the seizure. The motivation for introducing this graph is the
existence of i. . 11 'ig" of the brain after the seizure [70, 108, 111], which
is essentially the divergence of the profiles of the STLma, time series. As it
was indicated above, this divergence is characterized by the values of Tindex
greater than Tcritica.
4.2.2 Properties of the Graphs
In this subsection, we investigate the properties of the considered graph models
and give an intuitive explanation of the observed results. As we will see, there are
specific tendencies in the evolution of the properties of the considered graphs prior
to, during, and after epileptic seizures, which indicates that the proposed models
capture certain trends in the behavior of the epileptic brain.
4.2.2.1 Edge Density
Recall that GRAPHII was introduced to reflect the connections between
brain sites that are statistically entrained at a certain time moment. Figure 42
0)
"o
4 200
E
Z 150
100
50
I I I II I I i I I I I
7900 7950 8000 8050 8100 8150 8200 8250 8300 8350
MINUTES
Figure 42. Number of edges in GRAPHII
illustrates the typical evolution of the number of edges in GRAPHI over time.
As it was indicated above, edge density of the graph is proportional to the number
of edges in a graph. It is easy to notice that the number of edges in GRAPHII
dramatically increases at seizure points (represented by dashed vertical lines), and
it decreases immediately after seizures. It means that the global structure of the
graph significantly changes during the seizure and after the seizure, i.e. the density
of increases during ictal state and decreases in postictal state, which supports the
idea that the epileptic brain (and GRAPHII as the model of the brain) experiences
a "phase transition" during the seizure.
4.2.2.2 Connectivity
Another important property of GRAPHII that we are interested in is its
.., ,.. /.'; :/ We need to check if this graph is connected prior to, during, and
after epileptic seizures, and if not, find the size of its largest connected component.
Clearly, this information will also be helpful in the analysis of the structural
properties of the brain. If GRAPHII is connected (i.e., the size of the largest
connected component is equal to the number of vertices in the graph), then all the
functional units of the brain are "linked" with each other by a path, and in this
case the brain can be treated as an i i, i ii, 1 system, however, if the size of the
largest connected component in GRAPHII is significantly smaller than the total
number of the vertices, it means that the brain becomes 1' 1 iied" into smaller
dii.iil subsystems.
The size of the largest connected component of the GRAPHII is presented in
Figure 43. One can see that GRAPHII is connected during the interictal period
(i.e., the brain is a connected system), however, it becomes disconnected after the
seizure (during the postical state): the size of the largest connected component
significantly decreases. This fact is not surprising and can be intuitively explained,
since after the seizure the brain needs some time to "reset" [70, 108, 111] and
restore the connections between the functional units.
4.2.2.3 Minimum Spanning Tree
The next subject of our discussion is the analysis of minimum ',','.:,':,'; trees
of GRAPHI, which was defined as the graph with all possible edges, where each
edge (i, j) has the weight equal to the value of Tindex Tij corresponding to brain
sites i and j. The definition of Minimum Sr I !',!'.:,t Tree was given in Section 2.
Studying minimum spanning trees in GRAPHI is motivated by the hypothesis
that the seizure signal in the brain propagates to all functional units according
to the minimum y'u~r...':: tree, i.e. along the edges with small values of Ti. This
70
II
32 
30
28
S26
o
0
4 24
E 22
N
/ 20 
18
16
14
I I I
9700 9800 9900 10000 10100 10200 10300
MINUTES
Figure 43. The size of the largest connected component in GRAPHII. Number of
nodes in the graph is 30.
hypothesis is partially supported by the behavior of the average Tindex of the
edges corresponding to the Minimum Spanning Tree of GRAPHI, which is shown
in Figure 44.
However, this hypothesis cannot be verified using the considered data, since
the values of average Tindices are calculated over a 10minute interval, whereas
the the seizure signal propagates in a fraction of a second. Therefore, in order to
check if the seizure signal actually spreads along the minimum spanning tree, one
needs to introduce other nonlinear measures to reflect the behavior of the brain
over short time intervals.
1 .1 '
1 
0.9
0.8
o
0 0.7
0.6
0.5
0.4
0.3
0.2
I I I I I I I I i I
0.995 1 1.005 1.01 1.015 1.02 1.025 1.03 1.035
MINUTES x104
Figure 44. Average value of Tindex of the edges in Minimum Spanning Tree of
GRAPHI.
Also, note that the average value of the T index in the Minimum Spanning
Tree is less than Tcritical, which also supports the above statement about the
connectivity of the system.
4.2.2.4 Degrees of the Vertices
Another important issue that we analyze here is the degrees of the vertices in
GRAPHII. Recall that the degree of a vertex is defined simply as the number of
edges emanating from it.
We look at the behavior of the average degree of the vertices in GRAPHII
over time. Clearly, this plot is very similar to the behavior of the edge density of
GRAPHII (see Figure 45).
72
101
9
8
7
S6
5
4
3
2
1
7800 7900 8000 8100 8200 8300 8400
MINUTES
Figure 45. Average degree of the vertices in GRAPHII.
We are also particularly interested in highdegree vertices, i.e., the functional
units of the brain that are at a certain time moment connected (entrained) with
many other brain sites. Interestingly enough, the vertex with a maximum degree
in GRAPHH usually corresponds to the electrode which is located in RTD (right
temporal depth) or RST (right subtemporal cortex), in other words, the vertex
with the maximum degree is located near the epileptogenic focus.
4.2.2.5 Maximum Cliques
In the previous works in the field of epileptic seizure prediction, a quadratic
01 programming approach based on EEG data was introduced [69]. In fact, this
approach utilizes the same preprocessing technique (i.e., calculating the values
of Tindices for all pairs of electrode sites) as we apply in this chapter. In this
subsection, we will briefly describe this quadratic programming technique and
relate it to the graph models introduced above.
The main idea of the considered quadratic programming approach is to
construct a model that would select a certain number of socalled "critical"
electrode sites, i.e., those that are the most entrained during the seizure. According
to Section 3, such group of electrode sites should produce a minimal sum of T
indices calculated for all pairs of electrodes within this group. If the number of
critical sites is set equal to k, and the total number of electrode sites is n, then the
problem of selecting the optimal group of sites can be formulated as the following
quadratic 01 problem [69]:
min xTAx (42)
s.t. Eix = k. (43)
i e{0,1} Vie {1,..., n} (44)
In this setup, the vector x = (xl, x2, ..., ,) consists of the components equal
to either 1 (if the corresponding site is included into the group of critical sites) or 0
(otherwise), and the elements of the matrix A = [aij],j= 1...,n are the values of Tij's
at the seizure point.
However, as it was shown in the previous studies, one can observe the "re
setting of the brain after seizures' onset [111, 70, 108], that is, the divergence of
STLmax profiles after a seizure. Therefore, to ensure that the optimal group of
critical sites shows this divergence, one can reformulate this optimization problem
by adding one more quadratic constraint:
xTBx > Tcritical k (k 1),
(45)
where the matrix B = r1 .]ij1,...,, is the Tindex matrix of brain sites i and j
within 10 minute windows after the onset of a seizure.
This problem is then solved using standard techniques, and the group of k
critical sites is found. It should be pointed out that the number of critical sites k is
predetermined, i.e., it is defined empirically, based on practical observations. Also,
note that in terms of GRAPHI model this problem represents finding a subgraph
of GRAPHI of a fixed size, satisfying the properties specified above.
Now, recall that we introduced GRAPHIII using the same principles as
in the formulation of the above optimization problem, that is, we considered
the connections only between the pairs of sites i,j satisfying both of the two
conditions: Ti < Tcritical at the seizure point, and Tj > Tcritical 10 minutes after
the seizure point, which are exactly the conditions that the critical sites must
satisfy. A natural way of detecting such a groups of sites is to find cliques in
GRAPHIII. Since a clique is a subgraph where all vertices are interconnected, it
means that all pairs of electrode sites in a clique would satisfy the aforementioned
conditions. Therefore, it is clear that the size of the maximum clique in GRAPH
III would represent the upper bound on the number of selected critical sites, i.e.,
the maximum value of the parameter k in the optimization problem described
above.
Computational results indicate that the maximum clique sizes for different
instances of GRAPHIII are close to the actual values of k empirically selected
in the quadratic programming model, which shows that these approaches are
consistent with each other.
4.3 Graph as a Macroscopic Model of the Epileptic Brain
Based on the results obtained in the sections above, we now can formulate the
graph model which describes the behavior of the epileptic brain at the macroscopic
level. The main idea of this model is to use the properties of GRAPHI, GRAPH
II, and GRAPHIII as a characterization of the behavior of the brain prior to,
during, and after epileptic seizures. According to this graph model, the graphs
reflecting the behavior of the epileptic brain demonstrate the following properties:
* Increase and decrease of the edge density and the average degree of the
vertices during and after the seizures respectively;
* The graph is connected during the interictal state, however, it becomes
disconnected right after the seizures (during the postictal state);
* The vertex with the maximum degree corresponds to the epileptogenic focus.
Moreover, one of the advantages of the considered graph model is the possi
bility to detect special formations in these graphs, such as cliques and minimum
spanning trees, which can be used for further studying of various properties of the
epileptic brain.
4.4 Concluding Remarks and Directions of Future Research
In this chapter, we have made the initial attempt to analyze EEG data and
model the epileptic brain using networkbased approaches. Despite the fact that
the size of the constructed graphs is rather small, we were able to determine
specific patterns in the behavior of the epileptic brain based on the information
obtained from statistical analysis of EEG data. Clearly, this model can be made
more accurate by considering more electrodes corresponding to smaller functional
units.
Among the directions of future research in this field, one can mention the
possibility of developing directed graph models based on the analysis of EEG data.
Such models would take into account the natural ivnii.i i i'1 of the brain, where
certain functional units control the other ones. Also, one could apply a similar
approach to studying the patterns underlying the brain function of the patients
with other types of disorders, such as Parkinson's disease, or sleep disorder.
76
Therefore, the methodology introduced in this chapter can be generalized and
applied in practice.
CHAPTER 5
COLLABORATION NETWORKS IN SPORTS
In this chapter, we will discuss one of the most interesting reallife graph
applications socalled "social ii. i l: where the vertices are real people [63,
116]. The main idea of this approach is to consider the ." .I1 l iii o:eship graph"
connecting the entire human population. In this graph, an edge connects two given
vertices if the corresponding two persons know each other.
Social networks are associated with a famous i i illworld" hypothesis, which
claims that despite the large number of vertices, the distance between any two
vertices (or, the diameter of the graph) is small. More specifically, the idea of
"six degrees of separation" has been introduced. It states that any two persons
in the world are linked with each other through a sequence of at most six people
[63, 116, 117].
Clearly, one cannot verify this hypothesis for the graph incorporating more
than 6 billion people living on the Earth, however, smaller subgraphs of the
acquaintanceship graph connecting certain groups of people can be investigated in
detail. One of the most wellknown graphs of this type is the scientific collaboration
,jir1,' reflecting the information about the joint works between all scientists. Two
vertices are connected by an edge if the corresponding two scientists have a joint
research paper. Another graph of this type is known as the "H .//;/;, ...../ Il,,ll, : it
links all the movie actors, and an edge connects two actors if they ever appeared
in the same movie. Wellknown concepts associated with these graphs are so
called "Erdos number" (in the scientific collaboration graph) and "Bacon number"
(in the Hollywood graph), which are assigned to every vertex and characterize
the distance from this vertex to the vertex denoting the "center" of the graph.
In the collaboration graph, the central vertex corresponds to the famous graph
theoretician Paul Erdis, whereas in the Hollywood graph the same position is
assigned to Kevin Bacon.
In this chapter, we discuss graphs of a similar type arising in sports, that
represent the pll li rs' "collaboration". In these graphs, the pl li rs are the vertices,
and an edge is added to the graph if the corresponding two pl ivrs ever pll li d
together in the same team. One of the examples of this type of graphs is the graph
representing baseball pl i, rs. For any two baseball pl li, rs who ever pll li d in the
Major League Baseball(j\l .), a path connecting them can be found in this graph.
As another instance of social networks in sports, we study the "NBA graph"
where the vertices represent all the basketball pl li, rs who are currently pl viing
in the NBA. We apply standard graphtheoretical algorithms for investigating the
properties of this graph, such as its connectivity and diameter (i.e., the maximum
distance between all pairs of vertices in the graph). As we will see later in the
chapter, this study also confirms the 1!! illworld hypotl! Moreover, we
introduce a distance measure in the NBA graph similar to the Erdis number and
the Bacon number. The central role in this graph is given to Michael Jordan, the
greatest basketball pl liv r of all times, and we refer to this measure as the Jordan
number.
5.1 Examples of Social Networks
In this section, we give a more detailed description of the examples of social
networks mentioned in the introduction the scientific collaboration graph, the
Hollywood graph, and the baseball graph.
5.1.1 Scientific Collaboration Graph and Erdis Number
As it was mentioned above, the vertices of the scientific collaboration graph
are scientists, and the edges in this graph connect the scientists who have ever
collaborated with each other (i.e., had a joint paper). In order to measure the
distances in this graph, the "central v, i I :; is introduced. This vertex corresponds
to Paul Erd6s, the father of the theory of random graphs. This vertex is assigned
Erdos number equal to 0. For all other vertices in the graph, the Erd6s number
is defined as the distance (i.e., the shortest path length) from the central vertex.
For example, those scientists who had a joint paper with Erd6s have Erdis
number 1, those who did not collaborate with Erd6s, but collaborated with Erd6s'
collaborators have Erd6s number 2, etc.
Following this logic, one can construct the connected component of the
collaboration graph with "concentric circles", which would incorporate almost all
scientists in the world, except those who never collaborate with anybody. This
connected component is expected to have a relatively small diameter.
The idea of constructing collaboration graphs encompassing people in different
areas gave a rise to several other applications. Next, we discuss the Hollywood
graph and the baseball graph, where the number of vertices is significantly smaller
than in the scientific collaboration graph, which allows one to study their structure
in more detail.
5.1.2 Hollywood Graph and Bacon Number
The Hollywood graph is constructed using the same principles as the scientific
collaboration graph, however, the number of Hollywood actors is much smaller than
the number of scientists, therefore, one can investigate the characteristics of every
vertex in this graph. This information is maintained at the "Oracle of B ..I 
website.1 The most recent Hollywood graph contains 595,578 vertices (actors).
The central vertex in this graph represents the famous actor Kevin Bacon, and
this vertex obviously has Bacon number 0. Since the number of vertices in this
graph is small enough, one can explicitly calculate the Bacon number for every
1 http://www.cs.virginia.edu/oracle/
Average Bacon number= 2.946
400000 364066
350000
300000
S250000
S200000 
o 133856
W 150000 
88058
E 100000
50000 1686 6960 854 94 3
0 
0 1 2 3 4 5 6 7 8
Bacon number
Figure 51. Number of vertices in the Hollywood graph with different values of
Bacon number. Average Bacon number = 2.946.
actor. It turns out that most of the actors have Bacon numbers equal to 2 or 3,
and the maximum possible Bacon number is equal to 8, which is the case only for 3
vertices.
The distribution of Bacon numbers in the Hollywood graph is shown in Figure
51. The average Bacon number (i.e., the average path length from a given actor
to Bacon) is equal to 2.946. As one can see, both the average and the maximum
Bacon numbers of the Hollywood graph are very small, which provides an argument
in favor of the 11 i 11 world hypot : mentioned above.
5.1.3 Baseball Graph and Wynn Number
Collaboration networks similar to the ones mentioned above can also be
constructed in sports. One example of such a network is the "baseball graph"
representing all baseball p1 li, rs who ever p .li', d in the MLB. In this graph, two
pl i rs are connected if they ever were teammates. The most recent baseball graph
Average Wynn number = 2.901
7000 6663
6000 5286
0 5000
> 4000
o 3000 2472
3 2000
E 899
= 1000 408 88
0 7
0 1 2 3 4 5 6
Wynn number
Figure 52. Number of vertices in the baseball graph with different vaues of Wynn
number. Average Wynn number = 2.901
has 15817 vertices. Links between any pair of baseball p1 li. rs can be found at the
"Oracle of Baseball" website.2
One can assign the central role in this graph to Early Wynn, a member of the
Hall of Fame who spent 23 seasons in the MLB. Figure 52 shows the distribution
of Wynn numbers in the baseball graph. The maximum Wynn number is 6, which
is smaller than the maximum Bacon number since total number of baseball p1 li. rs
is less than the number of Hollywood actors.
5.1.4 Diameter of Collaboration Networks
Another aspect that should be mentioned here is that the maximum from
the central vertex in the collaboration graphs certainly depends on the choice
of this central vertex. The reason for choosing Kevin Bacon as the center of the
2 http://www.baseballreference.com/oracle/
Hollywood graph, and Early Wynn as the center of the baseball graph is the fact
that it is reasonable to expect them to be connected to many vertices: Bacon
appeared in many movies, and Wynn p1l li, d in several baseball teams had a lot
of teammates during his long career. However, one can choose less "connected"
centers of these graphs, and in this case the maximum distance from the new center
of the graph may significantly increase. For example, if one chooses Barry Bonds as
the center of the baseball graph, the maximum Bonds number will be 9 instead of
6. Moreover, in the Hollywood graph, it is possible to choose the center so that the
maximum distance from it is equal to 14, and the average distance is greater than 6
(instead of 2.946). Therefore, in order to have a more complete information about
the structure of these graphs, one should calculate the maximum possible distance
among all pairs of vertices in the graph. Recall that this quantity is referred to as
the diameter of the graph. Clearly, the diameter can be found by considering each
vertex as the center of the graph, calculating corresponding maximal distances, and
then choosing the maximum among them.
In the next section, we study the properties of the NBA graph incorporating
basketball p1l i, rs p1 giving in the world's best basketball league. In a similar
fashion, we introduce the Jordan number, investigate its values corresponding to
different vertices, and calculate the diameter of this graph.
5.2 NBA Graph
The NBA graph considered in this section is constructed using the same
idea as the graphs described above. Here we provide a detailed description of
the structural properties of this graph. As we will see, its properties are rather
similar to the properties of other social networks, which confirms the smallworld
hypothesis.
5.2.1 General Properties of the NBA Graph
The instance of the NBA graph that we consider in this section is relatively
small and contains only those 1pl ,i rs who are curr n i/,ll pl1 ,iing in the NBA (as
of the season of 20022003). However, this information is sufficient to reveal that
the NBA graph follows similar patterns as other social networks. As of May 2003,
the total number of p1 li, rs in the rosters of all the NBA teams is equal to 404
(pl irs picked in the 2003 NBA draft and transfers that occurred after the end
of the 20022003 season are not taken into account). An edge connects two given
pl!i,rs if they ever p1l li, d in the same team. Consequently, the constructed NBA
graph has 404 vertices, and 5492 edges connecting them. Note that the maximum
possible number of edges is equal to 404 x (404 1)/2 = 81406, therefore, the
edge /. ,.:1;, of this graph (i.e., the ratio of the number of edges to the maximum
possible number of edges) is rather small: 5492/81406 = 6.7.'.
As one can easily see, this graph has a highly specific structure: the p1 li. rs
of every team form a clique in the graph (i.e., the set of completely interconnected
vertices), because all the vertices corresponding to the p1 li, rs of the same team
must be interconnected. Since many 1p ii rs change teams during or between the
seasons, there are edges connecting the vertices from different cliques (teams). Note
that this type of structure is common for all "collaboration 1. I . i I: (see Figure
53).
It should be pointed out that the number of p1 li. rs in a basketball team is
relatively small, and the pl liv rs' transfers between different teams occur rather
often, therefore, it would be logical to expect that the NBA graph should be
connected, i.e., there is a path from every vertex to every vertex, moreover, the
length of this path must be small enough. As we will see below, calculations
confirm these assumptions.
Figure 53. General structure of the NBA graph and other collaboration networks
First, we used a standard breadthfirst search technique for checking the
connectivity of the considered graph. Starting from an arbitrary vertex, we were
able to locate all other vertices in the graph, which means that every vertex is
reachable from another, therefore, the graph is connected. In the next subsection,
we will also see that every pair of vertices in this graph are connected by a short
path, which is in agreement with the i,, illworld hypot !.  .
Average Jordan number = 2.270
300
244
u 250
2 200
135
"6 150
0
I
~ 100
E
5 50 24
co 1
0 1 2 3
Jordan number
Figure 54. Number of vertices in the NBA graph with different values of Jordan
number. Average Jordan number = 2.270
5.2.2 Diameter of the NBA Graph and Jordan Number
The next subject of our interest is verifying if the NBA graph follows the
smallworld hypothesis. We need to answer the question, what is the distance
between any two vertices in this graph?
Similarly to the social graphs mentioned above, we define the "central v iI :
in the NBA graph corresponding to Michael Jordan, who p1 i, 1 for Washington
Wizards during his final NBA season. Obviously, all other pl i ,v rs in the Wizards'
roster for 20022003, as well as all the pl .i, rs who have plin .1 with Jordan
during at least one season in the past, have Jordan number 1. It should be noted
that Michael Jordan p1l li, d only for two teams (Chicago Bulls and Washington
Wizards) through his entire career, therefore, one can expect that the number of
pl vrs with Jordan number 1 is rather small. In fact, only 24 pl li, rs currently
pl giving in the NBA have Jordan number 1.
Table 51. Jordan numbers of some NBA stars (end of the 20022003 season).
PlIiyr
Kobe Bryant
Vince Carter
Vlade Divac
Tim Duncan
Michael Finley
Steve Francis
Kevin Garnett
Pau Gasol
Richard Hamilton
Allen Iverson
Jason Kidd
Toni Kukoc
Karl Malone
Stephon Marbury
Shawn Marion
K. ion Martin
Jamal Mashburn
Tracy McGrady
R. ._i Miller
Yao Ming
Dikembe Mutombo
Steve Nash
Dirk Nowitzki
Jermaine O'Neal
Shaquille O'Neal
Gary Payton
Paul Pierce
Scottie Pippen
David Robinson
Arvydas Sabonis
Jerry Stackhouse
Predrag Stojakovic
Antoine Walker
Ben Wallace
C'!hin Webber
Team
Los Angeles Lakers
Toronto Raptors
Sacramento Kings
San Antonio Spurs
Dallas Mavericks
Houston Rockets
Minnesota Timberwolves
Memphis Grizzlies
Detroit Pistons
Philadelphia 76ers
New Jersey Nets
Milwaukee Bucks
Utah Jazz
Phoenix Suns
Phoenix Suns
New Jersey Nets
New Orleans Hornets
Orlando Magic
Indiana Pacers
Houston Rockets
New Jersey Nets
Dallas Mavericks
Dallas Mavericks
Indiana Pacers
Los Angeles Lakers
Milwaukee Bucks
Boston Celtics
Portland Trail Blazers
San Antonio Spurs
Portland Trail Blazers
Washington Wizards
Sacramento Kings
Boston Celtics
Detroit Pistons
Sacramento Kings
Jordan Number
2
2
2
2
2
3
3
3
1
2
2
1
2
2
2
3
2
2
3
3
2
2
2
2
2
2
2
1
2
2
1
2
2
2
2
Following similar logic, the p1l li, rs who have pl li, d with Jordan's "collabora
tors" have Jordan number 2, and so on. However, it turns out that the maximum
Jordan number in this instance of the NBA graph is only 3, i.e., all the p1l l, rs are
linked with Jordan through at most two vertices, which is certainly not surprising:
with 29 teams and only around 15 pl.' rs in each team, NBA is really a in, ,1
v i !. 1[ Figure 54 shows the distribution of Jordan numbers in the NBA graph.
The average Jordan number is equal to 2.27, which is smaller than the average
Bacon number in the Hollywood graph, and the average Wynn number in the
baseball graph, due to smaller number of vertices.
Table 51 presents Jordan numbers corresponding to some wellknown NBA
pli rs. Not surprisingly, most of them have Jordan number 2, except for several
p!i rs with Jordan number 3: those who joined this league recently, and therefore
did not have many teammates through their career, as well as R... : Miller who
spent 16 seasons in the same team (Indiana Pacers), and Kevin Garnett who p1l li d
in Minnesota for 8 years. Scottie Pippen, Toni Kukoc, and Jerry Stackhouse were
Jordan's teammates at different times, therefore, they have Jordan number 1.
Furthermore, we calculated the diameter of the NBA graph, i.e., the maximum
possible distance between any two vertices in the graph. Since the maximum
Jordan number in the NBA graph is equal to 3, one would expect that the value
of the diameter to be of the same order of magnitude. As it was mentioned in
the previous section, the diameter of the NBA graph can be found as follows: for
every given vertex, we calculate the distances between this vertex and all others.
In this approach, we need to repeat this procedure 404 times, and every time a
different vertex is considered to be the "center" of the graph. Our calculations
show that the diameter of the NBA graph (the maximum distance between all pairs
of vertices) is equal to 4. Therefore, one can claim that the NBA graph actually
follows the smallworld hypothesis, since its diameter is small enough.
Table 52. Degrees of the Vertices in the NBA graph
degree interval number of vertices
1120 134
2130 116
3140 103
4150 42
5160 8
61+ 2
5.2.3 Degrees and "Connectedness" of the Vertices in the NBA Graph
As it was pointed out above, the maximum and the average distance from
the center of the graph actually depend on the choice of this center. One can
easily guess that Michael Jordan is not the most "connected" central vertex of
the NBA graph, since he pll li d only for two teams and the number of his former
teammates among currently active pl ,i rs is rather small. In fact, the degree of the
vertex (i.e., the number of edged starting from it, or, the number of teammates)
corresponding to Jordan is only 24. Table 52 presents the number of vertices in
the NBA graph corresponding to different intervals of the degree values.
It would be reasonable to assume that if one picks a vertex with a high degree
as the center of the NBA graph, the average distance in the graph corresponding to
this vertex would be smaller than the average Jordan number. We have found the
most "connected" pl li, rs in the NBA graph with the smallest corresponding aver
age distances. Table 53 presents five pl .i, rs who could be the most "connected"
centers of the NBA graph. As one can notice, all of them are "bench p! .i, i who
have changed many teams during their career, therefore, they have high degrees in
the NBA graph. Also, an interesting observation is that although Corie Blount's
vertex is degree smaller than Jim Jackson's, the average connectivity is higher for
Corie Blount, which could be explained by the fact that his teammates were highly
"connected" themselves.

Full Text 
PAGE 4
IwouldliketothankmyadvisorProf.PanosPardalosforhissupportandguidancethatmademystudiesintheUniversityofFloridaenjoyableandproductive.Hisenergyandenthusiasminspiredmeduringthesefouryears,andIbelievethatthiswascrucialformysuccess.IalsowanttothankmycommitteemembersProf.StanUryasev,Prof.JosephGeunes,andProf.WilliamHagerfortheirconcernandencouragement.Iamgratefultoallmycollaborators,especiallySergiyButenkoandOlegProkopyev,whowerealwaysagreatpleasuretoworkwith.Finally,Iwouldliketoexpressmygreatestappreciationtomyfamilyandfriends,whoalwaysbelievedinmeandsupportedmeinallcircumstances. iv
PAGE 5
page ACKNOWLEDGMENTS ............................. iv LISTOFTABLES ................................. viii LISTOFFIGURES ................................ ix ABSTRACT .................................... xi CHAPTER 1INTRODUCTION .............................. 1 1.1BasicConceptsfromGraphTheoryandDataMiningInterpretation 3 1.1.1ConnectivityandDegreeDistribution ............. 3 1.1.2CliquesandIndependentSets ................. 5 1.1.3ClusteringviaCliquePartitioning ............... 6 2REVIEWOFNETWORKBASEDMODELINGANDOPTIMIZATIONTECHNIQUESINMASSIVEDATASETS ................ 9 2.1ModelingandOptimizationinMassiveGraphs ............ 9 2.1.1ExamplesofMassiveGraphs .................. 10 2.1.1.1CallGraph ...................... 10 2.1.1.2InternetandWebGraphs .............. 13 2.1.2ExternalMemoryAlgorithms ................. 17 2.1.3ModelingMassiveGraphs ................... 18 2.1.3.1UniformRandomGraphs .............. 19 2.1.3.2PotentialDrawbacksoftheUniformRandomGraphModel ......................... 21 2.1.3.3RandomGraphswithaGivenDegreeSequence .. 23 2.1.3.4PowerLawRandomGraphs ............. 24 2.1.4OptimizationinRandomMassiveGraphs ........... 29 2.1.4.1CliqueNumber .................... 29 2.1.4.2ChromaticNumber .................. 31 2.1.5Remarks ............................. 32 3NETWORKBASEDAPPROACHESTOMININGSTOCKMARKETDATA ..................................... 33 3.1StructureoftheMarketGraph .................... 34 3.1.1ConstructingtheMarketGraph ................ 34 v
PAGE 6
............... 36 3.1.3DegreeDistributionoftheMarketGraph ........... 37 3.1.4InstrumentsCorrespondingtoHighDegreeVertices ..... 40 3.1.5ClusteringCoecientsintheMarketGraph ......... 41 3.2AnalysisofCliquesandIndependentSetsintheMarketGraph ... 42 3.2.1CliquesintheMarketGraph .................. 43 3.2.2IndependentSetsintheMarketGraph ............ 45 3.3DataMiningInterpretationoftheMarketGraphModel ....... 48 3.4EvolutionoftheMarketGraph .................... 50 3.4.1DynamicsofGlobalCharacteristicsoftheMarketGraph .. 51 3.4.2DynamicsoftheSizeofCliquesandIndependentSetsintheMarketGraph .......................... 55 3.4.3MinimumCliquePartitionoftheMarketGraph ....... 59 3.5ConcludingRemarks .......................... 60 4NETWORKBASEDTECHNIQUESINELECTROENCEPHALOGRAPHIC(EEG)DATAANALYSISANDEPILEPTICBRAINMODELING ... 62 4.1StatisticalPreprocessingofEEGData ................ 63 4.1.1Datasets. ............................. 63 4.1.2TstatisticsandSTLmax 63 4.2GraphStructureoftheEpilepticBrain ................ 66 4.2.1KeyIdeaoftheModel ..................... 66 4.2.1.1InterpretationoftheConsideredGraphModels .. 67 4.2.2PropertiesoftheGraphs .................... 67 4.2.2.1EdgeDensity ..................... 67 4.2.2.2Connectivity ..................... 69 4.2.2.3MinimumSpanningTree ............... 69 4.2.2.4DegreesoftheVertices ................ 71 4.2.2.5MaximumCliques .................. 72 4.3GraphasaMacroscopicModeloftheEpilepticBrain ........ 74 4.4ConcludingRemarksandDirectionsofFutureResearch ....... 75 5COLLABORATIONNETWORKSINSPORTS .............. 77 5.1ExamplesofSocialNetworks ...................... 78 5.1.1ScienticCollaborationGraphandErdosNumber ...... 78 5.1.2HollywoodGraphandBaconNumber ............. 79 5.1.3BaseballGraphandWynnNumber .............. 80 5.1.4DiameterofCollaborationNetworks .............. 81 5.2NBAGraph ............................... 82 5.2.1GeneralPropertiesoftheNBAGraph ............. 83 5.2.2DiameteroftheNBAGraphandJordanNumber ...... 85 5.2.3Degreesand\Connectedness"oftheVerticesintheNBAGraph .............................. 88 5.3ConcludingRemarks .......................... 89 vi
PAGE 7
... 90 REFERENCES ................................... 91 BIOGRAPHICALSKETCH ............................ 100 vii
PAGE 8
Table page 3{1Leastsquaresestimatesoftheparameterinthemarketgraph ..... 38 3{2Top25instrumentswithhighestdegreesinthemarketgraph ....... 42 3{3Clusteringcoecientsofthemarketgraph ................. 43 3{4Sizesofthemaximumcliquesinthemarketgraph ............. 45 3{5Sizesofindependentsetsinthecomplementarymarketgraph ...... 46 3{6Datesandmeancorrelationscorrespondingtoeachconsidered500dayshift ...................................... 51 3{7Numberofverticesandnumberofedgesinthemarketgraphfordierentperiods .................................. 55 3{8Greedycliquesizeandthecliquenumberfordierenttimeperiods ... 57 3{9Structureofmaximumcliquesinthemarketgraphfordierenttimeperiods ...................................... 58 3{10Sizeofindependentsetsinthemarketgraphfoundusingthegreedyheuristic ....................................... 59 3{11Thelargestcliquesizeandthenumberofcliquesincomputedcliquepartitions ..................................... 60 5{1JordannumbersofsomeNBAstars(endofthe20022003season). .... 86 5{2DegreesoftheVerticesintheNBAgraph ................. 88 5{3Themost\connected"playersintheNBAgraph ............. 89 viii
PAGE 9
Figure page 2{1Frequenciesofcliquesizesinthecallgraph ................. 11 2{2Patternofconnectionsinthecallgraph ................... 12 2{3NumberofInternethostsfortheperiod01/199101/2002. ........ 13 2{4PatternofconnectionsintheWebgraph .................. 14 2{5ConnectivityoftheWeb(BowTiemodel) ................. 16 3{1Distributionofcorrelationcoecientsinthestockmarket ........ 35 3{2Edgedensityofthemarketgraphfordierentvaluesofthecorrelationthreshold. ................................... 36 3{3Plotofthesizeofthelargestconnectedcomponentinthemarketgraphasafunctionofcorrelationthreshold. ................... 37 3{4Degreedistributionofthemarketgraph .................. 39 3{5Degreedistributionofthecomplementarymarketgraph ......... 40 3{6Frequencyofthesizesofindependentsetsfoundinthemarketgraph .. 48 3{7DistributionofcorrelationcoecientsintheUSstockmarketforseveraloverlapping500dayperiodsduring20002002 ............... 52 3{8Degreedistributionofthemarketgraphfordierent500dayperiodsin20002002 ................................... 53 3{9Dynamicsofedgedensityandmaximumcliquesizeinthemarketgraph 55 4{1Electrodeplacementinthebrain ...................... 64 4{2NumberofedgesinGRAPHII 68 4{3ThesizeofthelargestconnectedcomponentinGRAPHII 70 4{4AveragevalueofTindexoftheedgesinMinimumSpanningTreeofGRAPHI. ........................................ 71 4{5AveragedegreeoftheverticesinGRAPHII. ................ 72 ix
PAGE 10
.................................. 80 5{2NumberofverticesinthebaseballgraphwithdierentvauesofWynnnumber .................................... 81 5{3GeneralstructureoftheNBAgraphandothercollaborationnetworks .. 84 5{4NumberofverticesintheNBAgraphwithdierentvaluesofJordannumber .................................... 85 x
PAGE 11
Thisstudydevelopsnovelapproachestomodelingrealworlddatasetsarisingindiverseapplicationareasasnetworksandinformationretrievalfromthesedatasetsusingnetworkoptimizationtechniques.Networkbasedmodelsallowonetoextractinformationfromdatasetsusingvariousconceptsfromgraphtheory.Inmanycases,onecaninvestigatespecicpropertiesofadatasetbydetectingspecialformationsinthecorrespondinggraph(forinstance,connectedcomponents,spanningtrees,cliques,andindependentsets).Thisprocessofteninvolvessolvingcomputationallychallengingcombinatorialoptimizationproblemsongraphs(maximumindependentset,maximumclique,minimumcliquepartition,etc.).Theseproblemsareespeciallydiculttosolveforlargegraphs.However,incertaincases,theexactsolutionofahardoptimizationproblemcanbefoundusingaspecialstructureoftheconsideredgraph. Asignicantpartofthedissertationfocusesondevelopingnetworkbasedmodelsofrealworldcomplexsystems,includingthestockmarketandthehumanbrain,whichhavealwaysbeenofspecialinteresttoscientists.Thesesystemsgeneratehugeamountsofdataandareespeciallyhardtoanalyze.Thisdissertation xi
PAGE 12
Thedevelopednetworkrepresentationsoftheconsidereddatasetsareinmanycasesnontrivialandincludecertainstatisticalpreprocessingtechniques.Inparticular,theU.S.stockmarketisrepresentedasanetworkbasedoncrosscorrelationsofpriceuctuationsofthenancialinstruments,whicharecalculatedoveracertainnumberoftradingdays.Thismodel(marketgraph)allowsonetoanalyzethestructureanddynamicsofthestockmarketfromanalternativeperspectiveandobtainusefulinformationabouttheglobalstructureofthemarket,classesofsimilarstocks,anddiversiedportfolios. Similarly,amacroscopicnetworkmodelofthehumanbrainisconstructedbasedonthestatisticalmeasuresofentrainmentbetweenelectroencephalographic(EEG)signalsrecordedfromdierentfunctionalunitsofthebrain.Studyingtheevolutionofthepropertiesofthesenetworksrevealedsomeinterestingfactsaboutbraindisorders,suchasepilepsy. xii
PAGE 13
Nowadays,theprocessofstudyingreallifecomplexsystemsoftendealswithlargedatasetsarisingindiverseapplicationsincludinggovernmentandmilitarysystems,telecommunications,biotechnology,medicine,nance,astrophysics,ecology,geographicalinformationsystems,etc.[ 3 25 ].Understandingthestructuralpropertiesofacertaindatasetisinmanycasesthetaskofcrucialimportance.Togetusefulinformationfromthesedata,oneoftenneedstoapplyspecialtechniquesofsummarizingandvisualizingtheinformationcontainedinadataset. Anappropriatemathematicalmodelcansimplifytheanalysisofadatasetandeventheoreticallypredictsomeofitsproperties.Thus,afundamentalproblemthatariseshereismodelingthedatasetscharacterizingrealworldcomplexsystems. Inthisdissertation,weconcentrateononeaspectofthisproblem:networkrepresentationofrealworlddatasets.Accordingtothisapproach,acertaindatasetisrepresentedasagraph(network)withcertainattributesassociatedwithitsverticesandedges. Studyingthestructureofagraphrepresentingadatasetisoftenimportantforunderstandingtheinternalpropertiesoftheapplicationitrepresents,aswellasforimprovingstorageorganizationandinformationretrieval.Onecanvisualizeagraphasasetofdotsandlinksconnectingthem,whichoftenmakesthisrepresentationconvenientandeasilyunderstandable. Themainconceptsofgraphtheorywerefoundedseveralcenturiesago,andmanynetworkoptimizationalgorithmshavebeendevelopedsincethen.However,graphmodelshavebeenappliedonlyrecentlytorepresentingvariousreallifemassivedatasets.Graphtheoryisquicklybecomingapracticaleldofscience. 1
PAGE 14
Expansionofgraphtheoreticalapproachesinvariousapplicationsgavebirthtotheterms\graphpractice"and\graphengineering"[ 63 ]. Networkbasedmodelsallowonetoextractinformationfromrealworlddatasetsusingvariousstandardconceptsfromgraphtheory.Inmanycases,onecaninvestigatespecicpropertiesofadatasetbydetectingspecialformationsinthecorrespondinggraph,forinstance,connectedcomponents,spanningtrees,cliquesandindependentsets.Inparticular,cliquesandindependentsetscanbeusedforsolvingtheimportantclusteringproblemarisingindatamining,whichessentiallyrepresentspartitioningthesetofelementsofacertaindatasetintoanumberofsubsets(clusters)ofobjectsaccordingtosomesimilarity(ordissimilarity)criterion.Theseconceptsareassociatedwithanumberofnetworkoptimizationproblemsdiscussedlater. Anotheraspectofinvestigatingnetworkmodelsofrealworlddatasetsisstudyingthedegreedistributionoftheconstructedgraphs.Thedegreedistributionisanimportantcharacteristicofadatasetrepresentedbyagraph.Itrepresentsthelargescalepatternofconnectionsinthegraph,whichreectstheglobalpropertiesofthedataset.Oneoftheimportantresultsdiscoveredduringthelastseveralyearsistheobservationthatmanygraphsrepresentingthedatasetsfromdiverseareas(Internet,telecommunications,biology,sociology)obeythepowerlawmodel[ 9 ].Thefactthatgraphsrepresentingcompletelydierentdatasetshaveasimilarwelldenedpowerlawstructurehasbeenwidelyreectedintheliterature[ 10 19 20 25 63 116 117 ].Itindicatesthatglobalorganizationandevolutionofdatasetsarisinginvariousspheresoflifenowadaysfollowsimilarlawsandpatterns.Thisfactservedasamotivationtointroduceaconceptof\selforganizednetworks." Laterwediscussinmoredetailvariousaspectsofmodelingrealworlddatasetsasnetworks,andretrievingusefulinformationfromthesenetworks.Thepractical
PAGE 15
importanceofgraphtheoretictechniquesisshownbyseveralexamplesofapplyingtheseapproachesassociatedwithdatasetsarisingintelecommunications,internet,sociology,etc.Themajorpartofthedissertationdevotedtonovelnetworkbasedtechniquesandmodelsthatallowonetoobtainimportantnontrivialinformationfromdatasetsarisinginnanceandbiomedicine. LetG=(V;E)beanundirectedgraphwiththesetofnverticesVandthesetofedgesE=f(i;j):i;j2Vg.Directedgraphs,wheretheheadandtailofeachedgearespecied,areconsideredinsomeapplications.Theconceptofamultigraphisalsosometimesintroduced.Amultigraphisagraphwheremultipleedgesconnectingagivenpairofverticesmayexist.Oneoftheimportantcharacteristicsofagraphisitsedgedensity:theratioofthenumberofedgesinthegraphtothemaximumpossiblenumberofedges. Thedegreeofavertexisthenumberofedgesemanatingfromit.Foreveryintegerkonecancalculatethenumberofverticesn(k)withadegreeequaltok,andthengettheprobabilitythatavertexhasthedegreekasP(k)=n(k)=n,wherenisthetotalnumberofvertices.ThefunctionP(k)isreferredtoasthe
PAGE 16
Degreedistributionisanimportantcharacteristicofadatasetrepresentedbyagraph.Itreectstheoverallpatternofconnectionsinthegraph,whichinmanycasesreectstheglobalpropertiesofthedatasetthisgraphrepresents.Asmentionedabove,manyrealworldgraphsrepresentingthedatasetscomingfromdiverseareas(Internet,telecommunications,nance,biology,sociology)havedegreedistributionsthatfollowthepowerlawmodel,whichstatesthattheprobabilitythatavertexofagraphhasadegreek(i.e.,therearekedgesemanatingfromit)is Equivalently,onecanrepresentitas logP/logk;(1{2) whichdemonstratesthatthisdistributionformsastraightlineinthelogarithmicscale,andtheslopeofthislineequalsthevalueoftheparameter. Animportantcharacteristicofthepowerlawmodelisitsscalefreeproperty.Thispropertyimpliesthatthepowerlawstructureofacertainnetworkshouldnotdependonthesizeofthenetwork.Clearly,realworldnetworksdynamicallygrowovertime,therefore,thegrowthprocessofthesenetworksshouldobeycertainrulesinordertosatisfythescalefreeproperty.Thenecessarypropertiesoftheevolutionoftherealworldnetworksaregrowthandpreferentialattachment[ 20 ].Therstpropertyimpliestheobviousfactthatthesizeofthesenetworksgrowscontinuously(i.e.,newverticesareaddedtoanetwork,whichmeansthatnewelementsareaddedtothecorrespondingdataset).Thesecondpropertyrepresentstheideathat
PAGE 17
newverticesaremorelikelytobeconnectedtooldverticeswithhighdegrees.Itisintuitivelyclearthattheseprinciplescharacterizetheevolutionofmanyrealworldcomplexnetworksnowadays. Fromanotherperspective,somepropertiesofgraphsthatfollowthepowerlawmodelcanbepredictedtheoretically.Aielloetal.[ 9 ]studiedthepropertiesofthepowerlawgraphsusingthetheoreticalpowerlawrandomgraphmodelrepresentingthetheclassofrandomgraphsobeyingthepowerlaw(seeChapter 2 ).Amongtheirresults,onecanmentiontheexistenceofagiantconnectedcomponentinapowerlawgraphwith<03:47875,andthefactthatagiantconnectedcomponentdoesnotexistotherwise. Thesizeofconnectedcomponentsofthegraphmayprovideusefulinformationaboutthestructureofthecorrespondingdataset,astheconnectedcomponentswouldnormallyrepresentgroupsof\similar"objects.Insomeapplications,decomposingthegraphintoasetofconnectedcomponentscanprovideareasonablesolutiontotheclusteringproblem(i.e.,partitioningthegraphintoseveralsubgraphs,eachofwhichcorrespondstoacertaincluster). Thefollowingdenitionsgeneralizetheconceptofclique.Insteadofcliquesonecanconsiderdensesubgraphs,orquasicliques.AcliqueC,alsocalleda
PAGE 18
AnindependentsetisasubsetIVsuchthatthesubgraphG(I)hasnoedges.ThemaximumindependentsetproblemcanbeeasilyreformulatedasthemaximumcliqueprobleminthecomplementarygraphG(V;E),denedasfollows.Ifanedge(i;j)2E,then(i;j)=2E;andif(i;j)=2E,then(i;j)2E.Clearly,amaximumcliqueinGisamaximumindependentsetinG,sothemaximumcliqueandmaximumindependentsetproblemscanbeeasilyreducedtoeachother. Locatingcliques(quasicliques)andindependentsetsinagraphrepresentingadatasetprovidesimportantinformationaboutthisdataset.Intuitively,edgesinsuchagraphwouldconnectverticescorrespondingto\similar"elementsofthedataset.Therefore,cliques(orquasicliques)wouldnaturallyrepresentdenseclustersofsimilarobjects.Onthecontrary,independentsetscanbetreatedasgroupsofobjectsthatdierfromeveryotherobjectinthegroup.Thisinformationisalsoimportantinsomeapplications.Clearly,itisusefultondamaximumcliqueorindependentsetinthegraph,sinceitwouldgivethemaximumpossiblesizeofthegroupsof\similar"or\dierent"objects. Themaximumcliqueproblem(aswellasthemaximumindependentsetproblem)isknowntobeNPhard[ 59 ].Moreover,itturnsoutthattheseproblemsarediculttoapproximate[ 18 62 ].Thismakestheseproblemsespeciallychallenginginlargegraphs. 102 ]givevariousmathematicalprogrammingformulationsoftheseproblems.Clearly,asin
PAGE 19
thecaseofmaximumcliqueandmaximumindependentsetproblems,minimumcliquepartitionandgraphcoloringarereducedtoeachotherbyconsideringthecomplimentarygraph,andbothoftheseproblemsareNPhard[ 59 ].Solvingtheseproblemsforgraphsrepresentingreallifedatasetsisimportantfromadataminingperspective;especiallyforsolvingtheclusteringproblem. Theessenceofclusteringispartitioningtheelementsinacertaindatasetintoseveraldistinctsubsets(clusters)groupedaccordingtoanappropriatesimilaritycriterion[ 34 ].Identifyingthegroupsofobjectsthatare\similar"toeachotherbut\dierent"fromotherobjectsinagivendatasetisimportantinmanypracticalapplications.Theclusteringproblemischallengingbecausethenumberofclustersandthesimilaritycriterionareusuallynotknownapriori. Ifadatasetisrepresentedasagraph,whereeachdataelementcorrespondstoavertex,theclusteringproblemessentiallydealswithdecomposingthisgraphintoasetofsubgraphs(subsetsofvertices),sothateachofthesesubgraphscorrespondtoaspeciccluster. Sincethedataelementsassignedtothesameclustershouldbe\similar"toeachother,thegoalofclusteringcanbeachievedbyndingacliquepartitionofthegraph,andthenumberofclusterswillequalthenumberofcliquesinthepartition. Similarargumentsholdforthecaseofthegraphcoloringproblemwhichshouldbesolvedwhenadatasetneedstobedecomposedintotheclustersof\dierent"objects(i.e.,eachobjectinaclusterisdierentfromallotherobjectsinthesamecluster),thatcanberepresentedasindependentsetsinthecorrespondinggraph.Thenumberofindependentsetsintheoptimalpartitionisreferredtoasthechromaticnumberofthegraph. Insteadofcliquesandindependentsetsonecanconsiderquasicliques,andquasiindependentsetsandpartitionthegraphonthisbasis.Asmentioned,
PAGE 20
quasicliquesaresubgraphsthataredenseenough(i.e.,theyhaveahighedgedensity).Therefore,itisoftenreasonabletorelateclusterstoquasicliques,sincetheyrepresentsucientlydenseclustersofsimilarobjects.Obviously,inthecaseofpartitioningadatasetintoclustersof\dierent"objects,onecanusequasiindependentsets(i.e.,subgraphsthataresparseenough)todenetheseclusters.
PAGE 21
Inthischapter,wereviewcurrentdevelopmentsinstudyingmassivegraphsusedasmodelsofcertainrealworlddatasets. 3 ].Someofthewiderangeofproblemsassociatedwithmassivedatasetsaredatawarehousing,compressionandvisualization,informationretrieval,clusteringandpatternrecognition,andnearestneighborsearch.Handlingtheseproblemsrequiresspecialinterdisciplinaryeortstodevelopnovelsophisticatedtechniques.Thepervasivenessandcomplexityoftheproblemsbroughtbymassivedatasetsmakeitoneofthemostchallengingandexcitingareasofresearchforyearstocome. Inmanycases,amassivedatasetcanberepresentedasaverylargegraphwithcertainattributesassociatedwithitsverticesandedges.Theseattributesmaycontainspecicinformationcharacterizingthegivenapplication.Studyingthestructureofthisgraphisimportantforunderstandingthestructuralpropertiesoftheapplicationitrepresents,aswellasforimprovingstorageorganizationandinformationretrieval. 25 ]. 9
PAGE 22
Asbefore,byG=(V;E)wewilldenoteasimpleundirectedgraphwiththesetofnverticesVandthesetofedgesE.Amultigraphisanundirectedgraphwithmultipleedges. Thedistancebetweentwoverticesisthenumberofedgesintheshortestpathbetweenthem(itisequaltoinnityforverticesrepresentingdierentconnectedcomponents).ThediameterofagraphGisusuallydenedasthemaximaldistancebetweenpairsofverticesofG.Inadisconnectedgraph,theusualdenitionofthediameterwouldresultintheinnitediameter,sothefollowingdenitionisinorder.Bythediameterofadisconnectedgraphwewillmeanthemaximumniteshortestpathlengthinthegraph(thesameasthelargestofthediametersofthegraph'sconnectedcomponents). 2.1.1.1CallGraph 2 ].Inthiscallgraphtheverticesaretelephonenumbers,andtwoverticesareconnectedbyanedgeifacallwasmadefromonenumbertoanother. Abelloetal.[ 2 ]experimentedwithdatafromAT&Ttelephonebillingrecords.Togiveanideaofhowlargeacallgraphcanbewementionthatagraphbasedonone20dayperiodhad290millionverticesand4billionedges.Theanalyzedonedaycallgraphhad53,767,087verticesandover170millionedges.Thisgraphappearedtohave3,667,448connectedcomponents,mostofthemtiny;only302,468(or8%)componentshadmorethan3vertices.Agiantconnectedcomponentwith44,989,297verticeswascomputed.ItwasobservedthattheexistenceofagiantcomponentresemblesabehaviorsuggestedbytherandomgraphstheoryofErdosandRenyi[ 47 48 ],whichwillbementionedbelow,butbythepatternofconnectionsthecallgraphobviouslydoesnottintothistheory(Subsection 2.1.3 ).
PAGE 23
Themaximumcliqueproblemandproblemofndinglargequasicliqueswithprespecieddensitywereconsideredinthisgiantcomponent.Theseproblemswereattackedusingagreedyrandomizedadaptivesearchprocedure(GRASP)[ 51 52 ].Inshort,GRASPisaniterativemethodthatateachiterationconstructs,usingagreedyfunction,arandomizedsolutionandthenndsalocallyoptimalsolutionbysearchingtheneighborhoodoftheconstructedsolution.Thisisaheuristicapproachwhichgivesnoguaranteeaboutqualityofthesolutionsfound,butprovedtobepracticallyecientformanycombinatorialoptimizationproblems.Tomakeapplicationofoptimizationalgorithmsintheconsideredlargecomponentpossible,theauthorsusesomesuitablegraphdecompositiontechniquesemployingexternalmemoryalgorithms(seeSubsection 2.1.2 ). Figure2{1. FrequenciesofcliquesizesinthecallgraphfoundbyAbelloetal.[ 2 ]. Abelloetal.[ 2 ]ran100,000GRASPiterationstaking10parallelprocessorsaboutoneandahalfdaystonish.Ofthe100,000cliquesgenerated,14,141appearedtobedistinct,althoughmanyofthemhadverticesincommon.Abelloetal.suggestedthatthegraphcontainsnocliqueofasizegreaterthan32.Figure 2{1 showsthenumberofdetectedcliquesofvarioussizes.Finally,large
PAGE 24
quasicliqueswithdensityparameters=0:9;0:8;0:7;and0:5forthegiantconnectedcomponentwerecomputed.Thesizesofthelargestquasicliquesfoundwere44,57,65,and98,respectively. Figure2{2. Patternofconnectionsinthecallgraph:numberofverticeswithvariousoutdegrees(a)andindegrees(b);numberofconnectedcomponentsofvarioussizes(c)inthecallgraph[ 8 ]. Aielloetal.[ 8 ]usedthesamedataasAbelloetal.[ 2 ]toshowthattheconsideredcallgraphtstotheirpowerlawrandomgraphmodel(Section 2.1.3 ).TheplotsinFigure 2{2 demonstratesomeconnectivitypropertiesofthecallgraph. Summarizingtheresultspresentedinthissubsection,onecansaythatgraphbasedtechniquesprovedtoberatherusefulintheanalysisandrevealingtheglobal
PAGE 25
patternsofthetelecommunicationstracdataset.Inthenextsubsection,wewillconsideranotherexampleofasimilartypeofdatasetassociatedwiththeWorldWideWeb. 2{3 showsthedynamicsofgrowthofthenumberofInternethostsforthelast13years.AsofJanuary2002thisnumberwasestimatedtobecloseto150million. Figure2{3. NumberofInternethostsfortheperiod01/199101/2002.DatabyInternetSoftwareConsortium.
PAGE 26
Figure2{4. PatternofconnectionsintheWebgraph:numberofverticeswithvariousoutdegrees(left)anddistributionofsizesofstronglyconnectedcomponents(right)inWebgraph[ 37 ]. ThehighlydynamicandseeminglyunpredictablestructureoftheWorldWideWebattractsmoreandmoreattentionofscientistsrepresentingmanydiversedisciplines,includinggraphtheory.InagraphrepresentationoftheWorldWideWeb,theverticesaredocumentsandtheedgesarehyperlinkspointingfromonedocumenttoanother.Similarlytothecallgraph,theWebisadirectedmultigraph,althoughoftenitistreatedasanundirectedgraphtosimplifytheanalysis.AnothergraphisassociatedwiththephysicalnetworkoftheInternet,wheretheverticesareroutersnavigatingpacketsofdataorgroupsofrouters(domains).Theedgesinthisgraphrepresentwiresorcablesinthephysicalnetwork. Graphtheoryhasbeenappliedforwebsearch[ 36 78 ],webmining[ 96 97 ]andotherproblemsarisingintheInternetandWorldWideWeb.Inseveralrecentstudies,therewereattemptstounderstandsomestructuralpropertiesoftheWebgraphbyinvestigatinglargeWebcrawls.AdamicandHuberman[ 6 65 ]usedcrawlswhichcoveredalmost260,000pagesintheirstudies.BarabasiandAlbert[ 20 ]analyzedasubgraphoftheWebgraphapproximately325,000nodesrepresentingnd.edupages.Inanotherexperiment,Kumaretal.[ 82 ]examinedadatasetcontainingabout40millionpages.Inarecentstudy,Broderetal.[ 37 ]
PAGE 27
usedtwoAltavistacrawls,eachwithabout200millionpagesand1.5billionlinks,thussignicantlyexceedingthescaleoftheprecedingexperiments.ThisworkyieldedseveralremarkableobservationsaboutlocalandglobalpropertiesoftheWebgraph.Allofthepropertiesobservedinoneofthetwocrawlswerevalidatedfortheotheraswell.Below,bytheWebgraphwewillmeanoneofthecrawls,whichhas203,549,046nodesand2130millionarcs. TherstobservationmadebyBroderetal.conrmsapropertyoftheWebgraphsuggestedinearlierworks[ 20 82 ]claimingthatthedistributionofdegreesfollowsapowerlaw.Interestingly,thedegreedistributionoftheWebgraphresemblesthepowerlawrelationshipoftheInternetgraphtopology,whichwasrstdiscoveredbyFaloutsosetal.[ 50 ].Broderetal.[ 37 ]computedtheinandoutdegreedistributionsforbothconsideredcrawlsandshowedthatthesedistributionsagreewithpowerlaws.Moreover,theyobservedthatinthecaseofindegreestheconstant2:1isthesameastheexponentofpowerlawsdiscoveredinearlierstudies[ 20 82 ].InanothersetofexperimentsconductedbyBroderetal.,directedandundirectedconnectedcomponentswereinvestigated.Itwasnoticedthatthedistributionofsizesoftheseconnectedcomponentsalsoobeysapowerlaw.Figure 2{4 illustratestheexperimentswithdistributionsofoutdegreesandconnectedcomponentsizes. ThelastseriesofexperimentsdiscussedbyBroderetal.[ 37 ]aimedtoexploretheglobalconnectivitystructureoftheWeb.ThisledtothediscoveryofthesocalledBowTiemodeloftheWeb[ 38 ].Similarlytothecallgraph,theconsideredWebgraphappearedtohaveagiantconnectedcomponent,containing186,771,290nodes,orover90%ofthetotalnumberofnodes.Takingintoaccountthedirectednatureoftheedges,thisconnectedcomponentcanbesubdividedintofourpieces:stronglyconnectedcomponent(SCC),InandOutcomponents,and\Tendrils".Overall,theWebgraphintheBowTiemodelisdividedintothefollowingpieces:
PAGE 28
Figure2{5. ConnectivityoftheWeb(BowTiemodel)[ 37 ]. Figure 2{5 showstheconnectivitystructureoftheWeb,aswellassizesoftheconsideredcomponents.Asonecanseefromthegure,thesizesofSCC,In,OutandTendrilscomponentsareroughlyequal,andtheDisconnectedcomponentissignicantlysmaller.
PAGE 29
Broderetal.[ 37 ]havealsocomputedthediametersoftheSCCandofthewholegraph.ItwasshownthatthediameteroftheSCCisatleast28,andthediameterofthewholegraphisatleast503.Theaverageconnecteddistanceisdenedasthepairwisedistanceaveragedoverthosedirectedpairs(i;j)ofnodesforwhichthereexistsapathfromitoj.Theaverageconnecteddistanceofthewholegraphwasestimatedas16.12forinlinks,16.18foroutlinks,and6.83forundirectedlinks.Interestingly,itwasalsofoundthatforarandomlychosendirectedpairofnodes,thechancethatthereisadirectedpathbetweenthemisonlyabout24%. TherstEMgraphalgorithmwasdevelopedbyUllmanandYannakakis[ 112 ]in1991anddealtwiththeproblemoftransitiveclosure.Manyotherresearcherscontributedtotheprogressinthisareaeversince[ 1 15 16 39 42 83 115 ].Chiangetal.[ 42 ]proposedseveralnewtechniquesfordesignandanalysisofecientEMgraphalgorithmsanddiscussedapplicationsofthesetechniquestospecicproblems,includingminimumspanningtreeverication,connectedandbiconnectedcomponents,graphdrawing,andvisibilityrepresentation.Abelloetal.[ 1 ]proposedafunctionalapproachforEMgraphalgorithmsandusedtheirmethodologyto
PAGE 30
developdeterministicandrandomizedalgorithmsforcomputingconnectedcomponents,maximalindependentsets,maximalmatchings,andotherstructuresinthegraph.Inthisapproacheachalgorithmisdenedasasequenceoffunctions,andthecomputationcontinuesinaseriesofscanoperationsoverthedata.Iftheproducedoutputdata,oncewritten,cannotbechanged,thenthefunctionissaidtohavenosideeects.Thelackofsideeectsenablestheapplicationofstandardcheckpointingtechniques,thusincreasingthereliability.Abelloetal.presentedasemiexternalmodelforgraphproblems,whichassumesthatonlytheverticestinthecomputer'sinternalmemory.Thisisquitecommoninpractice,andinfactthiswasthecaseforthecallgraphdescribedinSubsection 2.1.1 ,forwhichecientEMalgorithmsdevelopedbyAbelloetal.[ 1 ]wereusedinordertocomputeitsconnectedcomponents[ 2 ]. Formoredetailonexternalmemoryalgorithmsseethebook[ 4 ]andtheextensivereviewbyVitter[ 115 ]ofEMalgorithmsanddatastructures. 84 ]. Therefore,toinvestigatereallifemassivegraphs,oneneedstousetheavailableinformationinordertoconstructpropertheoreticalmodelsofthesegraphs.Oneoftheearliestattemptstomodelrealnetworkstheoreticallygoesbacktothelate
PAGE 31
1950's,whenthefoundationsofrandomgraphtheoryhadbeendeveloped.Inthissubsectionwewillpresentsomeoftheresultsproducedbythisandother(morerealistic)graphmodels. 47 48 ]dealswithseveralstandardmodelsofthesocalleduniformrandomgraphs.TwoofsuchmodelsareG(n;m)andG(n;p)[ 30 ].Therstmodelassignsthesameprobabilitytoallgraphswithnverticesandmedges,whileinthesecondmodeleachpairofverticesischosentobelinkedbyanedgerandomlyandindependentlywithprobabilityp. Inmostcasesforeachnaturalnaprobabilityspaceconsistingofgraphswithexactlynverticesisconsidered,andthepropertiesofthisspaceasn!1arestudied.Itissaidthatatypicalelementofthespaceoralmostevery(a.e.)graphhaspropertyQwhentheprobabilitythatarandomgraphonnverticeshasthispropertytendsto1asn!1.WewillalsosaythatthepropertyQholdsasymptoticallyalmostsurely(a.a.s.).ErdosandRenyidiscoveredthatinmanycaseseitheralmosteverygraphhaspropertyQoralmosteverygraphdoesnothavethisproperty. Manypropertiesofuniformrandomgraphshavebeenwellstudied[ 29 30 73 80 ].Belowwewillsummarizesomeknownresultsinthiseld. Probablythesimplestpropertytobeconsideredinanygraphisitsconnectivity.ItwasshownthatforauniformrandomgraphG(n;p)2G(n;p)thereisa\threshold"valueofpthatdetermineswhetheragraphisalmostsurelyconnectedornot.Morespecically,agraphG(n;p)isa.a.s.disconnectedifp
PAGE 32
connectedcomponentinarandomgraphisveryoftenreferredtoasthe\phasetransition". ThenextsubjectofourdiscussionisthediameterofauniformrandomgraphG(n;p).Recallthatthediameterofadisconnectedgraphisdenedasthemaximumdiameterofitsconnectedcomponents.Whendealingwithrandomgraphs,oneusuallyspeaksnotaboutacertaindiameter,butratheraboutthedistributionofthepossiblevaluesofthediameter.Intuitively,onecansaythatthisdistributiondependsontheinterrelationshipoftheparametersofthemodelnandp.However,thisdependencyturnsouttoberathercomplicated.Itwasdiscussedinmanypapers,andthecorrespondingresultsaresummarizedbelow. ItwasprovedbyKleeandLarman[ 77 ]thatarandomgraphasymptoticallyalmostsurelyhasthediameterd,wheredisacertainintegervalue,ifthefollowingconditionsaresatised 30 ]provedthatifnplogn!1thenthediameterofarandomgraphisa.a.s.concentratedonnomorethanfourvalues. Luczak[ 87 ]consideredthecasenp<1,whenauniformrandomgrapha.a.s.isdisconnectedandhasnogiantconnectedcomponent.LetdiamT(G)denotethemaximumdiameterofallconnectedcomponentsofG(n;p)whicharetrees.Thenif(1np)n1=3!1thediameterofG(n;p)isa.a.s.equaltodiamT(G). ChungandLu[ 43 ]investigatedanotherextremecase:np!1.TheyshowedthatinthiscasethediameterofarandomgraphG(n;p)isa.a.s.equalto (1+o(1))logn
PAGE 33
(1+o(1))logn np+1: log(np)diam(G(n;p))2666log33c20 43 ]thatitisa.a.s.trueonlyifnp>3:5128. Thefurtherdiscussionanalyzesthepotentialdrawbacksofapplyingtheuniformrandomgraphmodeltothereallifemassivegraphs.
PAGE 34
Thoughtheuniformrandomgraphsdemonstratesomepropertiessimilartothereallifemassivegraphs,manyproblemsarisewhenonetriestodescribetherealgraphsusingtheuniformrandomgraphmodel.Asitwasmentionedabove,agiantconnectedcomponenta.a.s.emergesinauniformrandomgraphatacertainthreshold.ItlooksverysimilartothepropertiesoftherealmassivegraphsdiscussedinSubsection 2.1.3 .However,afterdeeperinsight,itcanbeseenthatthegiantconnectedcomponentsintheuniformrandomgraphsandthereallifemassivegraphshavedierentstructures.Thefundamentaldierencebetweenthemisasfollows:itwasnoticedthatinalmostalltherealmassivegraphsthepropertyofsocalledclusteringtakesplace[ 116 117 ].Itmeansthattheprobabilityoftheeventthattwogivenverticesareconnectedbyanedgeishigheriftheseverticeshaveacommonneighbor(i.e.,avertexwhichisconnectedbyanedgewithbothofthesevertices).Theprobabilitythattwoneighborsofagivenvertexareconnectedbyanedgeiscalledtheclusteringcoecient.Itcanbeeasilyseenthatinthecaseoftheuniformrandomgraphs,theclusteringcoecientisequaltotheparameterp,sincetheprobabilitythateachpairofverticesisconnectedbyanedgeisindependentofallothervertices.Inreallifemassivegraphs,thevalueoftheclusteringcoecientturnsouttobemuchhigherthanthevalueoftheparameterpoftheuniformrandomgraphswiththesamenumberofverticesandedges.Adamic[ 5 ]foundthatthevalueoftheclusteringcoecientforsomepartoftheWebgraphwasapproximately0.1078,whiletheclusteringcoecientforthecorrespondinguniformrandomgraphwas0.00023.PastorSatorrasetal.[ 103 ]gotsimilarresultsforthepartoftheInternetgraph.Thevaluesoftheclusteringcoecientsfortherealgraphandthecorrespondinguniformrandomgraphwere0.24and0.0006respectively. Anothersignicantproblemarisinginmodelingmassivegraphsusingtheuniformrandomgraphmodelisthedierenceindegreedistributions.Itcan
PAGE 35
beshownthatasthenumberofverticesinauniformrandomgraphincreases,thedistributionofthedegreesoftheverticestendstothewellknownPoissondistributionwiththeparameternpwhichrepresentstheaveragedegreeofavertex.However,asitwaspointedoutinSubsection 2.1.3 ,theexperimentsshowthatintherealmassivegraphsdegreedistributionsobeyapowerlaw.Thesefactsdemonstratethatsomeothermodelsareneededtobetterdescribethepropertiesofrealmassivegraphs.Next,wediscusstwoofsuchmodels;namely,therandomgraphmodelwithagivendegreesequenceanditsmostimportantspecialcasethepowerlawmodel. Itturnsoutthatsomepropertiesoftheuniformrandomgraphscanbegeneralizedforthemodelofarandomgraphwithagivendegreesequence. Recallthenotationofsocalled\phasetransition"(i.e.,thephenomenonwhenatacertainpointagiantconnectedcomponentemergesinarandomgraph)whichhappensintheuniformrandomgraphs.Itturnsoutthatasimilarthingtakesplaceinthecaseofarandomgraphwithagivendegreesequence.ThisresultwasobtainedbyMolloyandReed[ 98 ].Theessenceoftheirndingsisasfollows. Considerasequenceofnonnegativerealnumbersp0,p1,...,suchthatPkpk=1.AssumethatagraphGwithnverticeshasapproximatelypknverticesofdegreek.IfwedeneQ=Pk1k(k2)pkthenitcanbeprovedthatGa.a.s.
PAGE 36
hasagiantconnectedcomponentifQ>0andthereisa.a.s.nogiantconnectedcomponentifQ<0. Asadevelopmentoftheanalysisofrandomgraphswithagivendegreesequence,theworkofCooperandFrieze[ 45 ]shouldbementioned.Theyconsideredasparsedirectedrandomgraphwithagivendegreesequenceandanalyzeditsstrongconnectivity.Inthestudy,thesizeofthegiantstronglyconnectedcomponent,aswellastheconditionsofitsexistence,werediscussed. Theresultsobtainedforthemodelofrandomgraphswithagivendegreesequenceareespeciallyusefulbecausetheycanbeimplementedforsomeimportantspecialcasesofthismodel.Forinstance,theclassicalresultsonthesizeofaconnectedcomponentinuniformrandomgraphsfollowfromtheaforementionedfactpresentedbyMolloyandReed.Next,wepresentanotherexampleofapplyingthisgeneralresulttooneofthemostpracticallyusedrandomgraphmodelsthepowerlawmodel. 8 9 ]. Thepowerlawrandomgraphmodel(alsoreferredtoP(,)assignstwoparameterscharacterizingapowerlawrandomgraph.Ifwedeneytobethenumberofnodeswithdegreex,thenaccordingtothismodel Equivalently,wecanwrite logy=logx:(2{2)
PAGE 37
SimilarlytoformulasinChapter1,therelationshipbetweenyandxcanbeplottedasastraightlineonaloglogscale,sothat()istheslope,andistheintercept. Thefollowingpropertiesofagraphdescribedbythepowerlawrandomgraphmodel[ 8 ]arevalid: Xx=1e where(t)=1Pn=11 2e Xx=1xe 2(1)e;>2;1 4e;=2;1 2e2=.(2);0<<2:(2{4) Sincethepowerlawrandomgraphmodelisaspecialcaseofthemodelofarandomgraphwithagivendegreesequence,theresultsdiscussedabovecanbeappliedtothepowerlawgraphs.Weneedtondthethresholdvalueofinwhichthe\phasetransition"(i.e.,theemergenceofagiantconnectedcomponent)occurs.InthiscaseQ=Px1x(x2)pxisdenedas Px=1x(x2)e Px=1e Px=1e Hence,thethresholdvalue0canbefoundfromtheequation
PAGE 38
TheresultsonthesizeoftheconnectedcomponentofapowerlawgraphwerepresentedbyAielloetal[ 8 ].Theseresultsaresummarizedbelow. Thepowerlawrandomgraphmodelwasdevelopedfordescribingreallifemassivegraphs.Sothenaturalquestionishowwellitreectsthepropertiesofthesegraphs. Thoughthismodelcertainlydoesnotreectallthepropertiesofrealmassivegraphs,itturnsoutthatthemassivegraphssuchasthecallgraphortheInternetgraphcanbefairlywelldescribedbythepowerlawmodel.Thefollowingexampledemonstratesit. Aiello,ChungandLu[ 8 ]investigatedthesamecallgraphthatwasanalyzedbyAbelloetal.[ 2 ].ThismassivegraphwasalreadydiscussedinSubsection 2.1.3 ,soitisinterestingtocomparetheexperimentalresultspresentedbyAbelloetal.[ 2 ]withthetheoreticalresultsobtainedin[ 8 ]usingthepowerlawrandomgraphmodel. Figure 2{2 showsthenumberofverticesinthecallgraphwithcertainindegreesandoutdegrees.Recallthataccordingtothepowerlawmodelthedependencybetweenthenumberofverticesandthecorrespondingdegreescanbeplottedasastraightlineonaloglogscale,soonecanapproximatethereal
PAGE 39
datashowninFigure 2{2 byastraightlineandevaluatetheparameterandusingthevaluesoftheinterceptandtheslopeoftheline.Thevalueoffortheindegreedatawasestimatedtobeapproximately2.1,andthevalueofewasapproximately30106.Thetotalnumberofnodescanbeestimatedusingformula( 2{3 )as(2:1)e=1:56e47106(comparewithSubsection 2.1.3 ). Accordingtotheresultsforthesizeofthelargestconnectedcomponentpresentedabove,apowerlawgraphwith1<3:47875a.a.s.hasagiantconnectedcomponent.Since2:1fallsinthisrange,thisresultexactlycoincideswiththerealobservationsforthecallgraph(seeSubsection 2.1.3 ). Anotheraspectthatisworthmentioningishowtogeneratepowerlawgraphs.Themethodologyfordoingitwasdiscussedindetailintheliterature[ 9 44 ].Thesepapersuseasimilarapproach,whichisreferredtoasarandomgraphevolutionprocess.Themainideaistoconstructapowerlawmassivegraph\stepbystep":ateachtimestep,anodeandanedgeareaddedtoagraphinaccordancewithcertainrulesinordertoobtainagraphwithaspeciedindegreeandoutdegreepowerlawdistribution.Theindegreeandoutdegreeparametersoftheresultingpowerlawgrapharefunctionsoftheinputparametersofthemodel.AsimpleevolutionmodelwaspresentedbyKumaretal.[ 81 ].Aiello,ChungandLu[ 9 ]developedfourmoreadvancedmodelsforgeneratingbothdirectedandundirectedpowerlawgraphswithdierentdistributionsofindegreesandoutdegrees.Asanexample,wewillbrieydescribeoneoftheirmodels.Itwasthebasicmodeldevelopedinthepaper,andtheotherthreemodelsactuallywereimprovementsandgeneralizationsofthismodel. Themainideaoftheconsideredmodelisasfollows.Atthersttimemomentavertexisaddedtothegraph,anditisassignedtwoparameterstheinweightandtheoutweight,bothequalto1.Thenateachtimestept+1anewvertexwithinweight1andoutweight1isaddedtothegraphwithprobability1,
PAGE 40
andanewdirectededgeisaddedtothegraphwithprobability.Theoriginanddestinationverticesarechosenaccordingtothecurrentvaluesoftheinweightsandoutweights.Morespecically,avertexuischosenastheoriginofthisedgewiththeprobabilityproportionaltoitscurrentoutweightwhichisdenedaswoutu;t=1+outu;twhereoutu;tistheoutdegreeofthevertexuattimet.Similarly,avertexvischosenasthedestinationwiththeprobabilityproportionaltoitscurrentinweightwinv;t=1+inv;twhereinv;tistheindegreeofvattimet.Fromtheabovedescriptionitcanbeseenthatattimetthetotalinweightandthetotaloutweightarebothequaltot.Soforeachparticularpairofverticesuandv,theprobabilitythatanedgegoingfromutovisaddedtothegraphattimetisequalto Thenotionofthesocalledscaleinvariance[ 20 21 ]mustalsobementioned.Thisconceptarisesfromthefollowingconsiderations.Theevolutionofmassivegraphscanbetreatedastheprocessofgrowingthegraphatatimeunit.Now,ifwereplaceallthenodesthatwereaddedtothegraphatthesameunitoftimebyonlyonenode,thenwewillgetanothergraphofasmallersize.Thebiggerthetimeunitis,thesmallerthenewgraphsizewillbe.Theevolutionmodeliscalledscalefree(scaleinvariant)ifwithhighprobabilitythenew(scaled)graphhasthesamepowerlawdistributionofindegreesandoutdegreesastheoriginalgraph,foranychoiceofthetimeunitlength.Itturnsoutthatmostoftherandomevolution
PAGE 41
modelshavethisproperty.Forinstance,themodelsofAielloetal.[ 9 ]wereprovedtobescaleinvariant. 2.1.3 increasedinterestinvariouspropertiesofrandomgraphsandmethodsusedtodiscovertheseproperties.Indeed,numericalcharacteristicsofgraphs,suchascliqueandchromaticnumbers,couldbeusedasoneofthestepsinvalidationoftheproposedmodels.Inthisregard,theexpectedcliquenumberofpowerlawrandomgraphsisofspecialinterestduetotheresultsbyAbelloetal.[ 2 ]andAielloetal.[ 9 ]mentionedinSubsections 2.1.1 and 2.1.3 .Ifcomputed,itcouldbeusedasoneofthepointsinverifyingthevalidityofthemodelforthecallgraphproposedbyAielloetal.[ 9 ]. Inthissubsectionwepresentsomewellknownfactsregardingthecliqueandchromaticnumbersinuniformrandomgraphs. 93 ],whonoticedthatforaxedpalmostallgraphsG2G(n;p)haveaboutthesamecliquenumber,ifnissucientlylarge.BollobasandErdos[ 32 ]furtherdevelopedtheseremarkableresultsbyprovingsomemorespecicfactsaboutthecliquenumberofarandomgraph.Letusdiscusstheseresultsinmoredetailbypresentingnotonlythefactsbutalsosomereasoningbehindthem.FormoredetailseebooksbyBollobas[ 29 30 ]andJansonetal.[ 73 ]. Assumethat0
PAGE 42
subgraphofGinducedbytherstnverticesf1;2;:::;ng.Thenthesequence!(Gn)appearstobealmostcompletelydeterminedfora.e.G2G(N;p). Foranaturall,letusdenotebykl(Gn)thenumberofcliquesspanninglverticesofGn.Then,obviously,!(Gn)=maxfl:kl(Gn)>0g: Usingthisobservationandthesecondmomentmethod,BollobasandErdos[ 32 ]provedthatifp=p(n)satisesn
PAGE 43
shownthatfora.e.G2G(N;p)ifnislargeenoughthenbl0(n)2loglogn=lognc!(Gn)bl0(n)+2loglogn=lognc 2: 56 ]andJansonetal.[ 73 ]extendedtheseresultsbyshowingthatfor>0thereexistsaconstantc,suchthatforc b2log1=pn2log1=plog1=pn+2log1=p(e=2)+1+=pc: 61 ]werethersttostudytheproblemofcoloringrandomgraphs.Manyotherresearcherscontributedtosolvingthisproblem[ 12 31 ].Wewillmentionsomefactsemergedfromthesestudies. Luczak[ 85 ]improvedtheresultsabouttheconcentrationof(G(n;p))previouslyprovedbyShamirandSpencer[ 110 ],provingthatforeverysequencep=p(n)suchthatpn6=7thereisafunctionch(n)suchthata:a:s:ch(n)(G(n;p))ch(n)+1: 12 ]provedthatforanypositiveconstantthechromaticnumberofauniformrandomgraphG(n;p),wherep=n1 2,isa.a.s.concentratedintwoconsecutivevalues.Moreover,theyprovedthataproperchoiceofp(n)mayresultinaonepointdistribution.Thefunctionch(n)isdiculttond,butinsomecasesitcanbecharacterized.Forexample,Jansonetal.[ 73 ]provedthatthereexistsaconstantc0suchthatforanyp=p(n)satisfyingc0
PAGE 44
30 ]yieldsthefollowingestimate:(G(n;p))=n 2.1.1 ).Ontheotherhand,probabilisticmethodssimilartothosediscussedinSubsection 2.1.4 couldbeutilizedinordertondtheasymptoticaldistributionofthecliquenumberinthesamenetwork'srandomgraphmodel,andthereforeverifythismodel.
PAGE 45
Oneofthemostimportantproblemsinthemodernnanceisndingecientwaysofsummarizingandvisualizingthestockmarketdatathatwouldallowonetoobtainusefulinformationaboutthebehaviorofthemarket.Nowadays,agreatnumberofstocksaretradedintheUSstockmarket;moreover,thisnumbersteadilyincreases.Theamountofdatageneratedbythestockmarketeverydayisenormous.Thisdataisusuallyvisualizedbythousandsofplotsreectingthepriceofeachstockoveracertainperiodoftime.Theanalysisoftheseplotsbecomesmoreandmorecomplicatedasthenumberofstocksgrows. Itturnsoutthatthestockmarketdatacanbeeectivelyrepresentedasanetwork,althoughthisrepresentationisnotsoobviousasinthecaseoftelephonetracorinternetdata.Wehavedevelopedthenetworkbasedmodelofthemarketreferredtoasthemarketgraph.Thischapterisbasedontheresultsdescribedin[ 26 27 28 ]. Anaturalgraphrepresentationofthestockmarketisbasedonthecrosscorrelationsofpriceuctuations.Amarketgraphcanbeconstructedasfollows:eachnancialinstrumentisrepresentedbyavertex,andtwoverticesareconnectedbyanedgeifthecorrelationcoecientofthecorrespondingpairofinstruments(calculatedforacertainperiodoftime)exceedsaspeciedthreshold;11. Nowadays,agreatnumberofdierentinstrumentsaretradedintheUSstockmarket,sothemarketgraphrepresentingthemisverylarge.Themarketgraphthatweconstructhas6546verticesandseveralmillionedges. 33
PAGE 46
Inthischapter,wepresentadetailedstudyofthepropertiesofthisgraph.Itturnsoutthatthemarketgraphcanberatheraccuratelydescribedbythepowerlawmodel.Weanalyzethedistributionofthedegreesoftheverticesinthisgraph,theedgedensityofthisgraphwithrespecttothecorrelationthreshold,aswellasitsconnectivityandthesizeofitsconnectedcomponents. Furthermore,welookformaximumcliquesandmaximumindependentsetsinthisgraphfordierentvaluesofthecorrelationthreshold.Analyzingcliquesandindependentsetsinthemarketgraphgivesusaveryvaluableknowledgeabouttheinternalstructureofthestockmarket.Forinstance,acliqueinthisgraphrepresentsasetofnancialinstrumentswhosepriceschangesimilarlyovertime(achangeofthepriceofanyinstrumentinacliqueislikelytoaectallotherinstrumentsinthisclique),andanindependentsetconsistsofinstrumentsthatarenegativelycorrelatedwithrespecttoeachother;therefore,itcanbetreatedasadiversiedportfolio.Basedontheinformationobtainedfromthisanalysis,wewillbeabletoclassifynancialinstrumentsintocertaingroups,whichwillgiveusadeeperinsightintothestockmarketstructure. 3.1.1ConstructingtheMarketGraph 92 ]:
PAGE 47
Figure3{1. Distributionofcorrelationcoecientsinthestockmarket whereRi(t)=lnPi(t) ThecorrelationcoecientsCijcanvaryfrom1to1.Figure 3{1 showsthedistributionofthecorrelationcoecientsbasedonthepricesdatafortheyears20002002.Itcanbeseenthatthisplothasashapesimilartothenormaldistributionwiththemean0.05. Themainideaofconstructingamarketgraphisasfollows.Letthesetofnancialinstrumentsrepresentthesetofverticesofthegraph.Also,wespecifyacertainthresholdvalue;11andaddanundirectededgeconnectingtheverticesiandjifthecorrespondingcorrelationcoecientCijisgreaterthanorequalto.Obviously,dierentvaluesofdenethemarketgraphswiththesamesetofvertices,butdierentsetsofedges. Itiseasytoseethatthenumberofedgesinthemarketgraphdecreasesasthethresholdvalueincreases.Infact,ourexperimentsshowthattheedgedensity
PAGE 48
Figure3{2. Edgedensityofthemarketgraphfordierentvaluesofthecorrelationthreshold. ofthemarketgraphdecreasesexponentiallyw.r.t..ThecorrespondinggraphispresentedonFigure 3{7 2.1.3 wementionedtheconnectivitythresholdsinrandomgraphs.Themainideaofthisconceptisndingathresholdvalueoftheparameterofthemodelthatwilldeneifthegraphisconnectedornot. Asimilarquestionarisesforthemarketgraph:whatisitsconnectivitythreshold?Sincethenumberofedgesinthemarketgraphdependsonthechosencorrelationthreshold,weshouldndavalue0thatdeterminestheconnectivityofthegraph.Asitwasmentionedabove,thesmallervalueofwechoose,themoreedgesthemarketgraphwillhave.So,ifwedecrease,afteracertainpoint,thegraphwillbecomeconnected.Wehaveconductedaseriesofcomputational
PAGE 49
Figure3{3. Plotofthesizeofthelargestconnectedcomponentinthemarketgraphasafunctionofcorrelationthreshold. experimentsforcheckingtheconnectivityofthemarketgraphusingthebreadthrstsearchtechnique,andweobtainedarelativelyaccurateapproximationoftheconnectivitythreshold:0'0:14382.Moreover,weinvestigatedthedependencyofthesizeofthelargestconnectedcomponentinthemarketgraphw.r.t..ThecorrespondingplotisshowninFigure 3{3 Itturnsoutthatifasmall(inabsolutevalue)correlationthresholdisspecied,thedistributionofthedegreesoftheverticesdoesnothaveanywelldenedstructure.Notethatforthesevaluesofthemarketgraphhasarelativelyhighedgedensity(i:e.theratioofthenumberofedgestothemaximumpossiblenumberofedges).However,asthecorrelationthresholdisincreased,thedegree
PAGE 50
Table3{1. Leastsquaresestimatesoftheparameterinthemarketgraphfordierentvaluesofcorrelationthreshold(complementarygraph) 0.2 0.15 0.2 0.4931 0.25 0.5820 0.3 0.6793 0.35 0.7679 0.4 0.8269 0.45 0.8753 0.5 0.9054 0.55 0.9331 0.6 0.9743 distributionmoreandmoreresemblesapowerlaw.Infact,for0:2thisdistributionisapproximatelyastraightlineinthelogarithmicscale,whichrepresentsthepowerlawdistribution,asitwasmentionedabove.Figure 3{4 demonstratesthedegreedistributionsofthemarketgraphforsomepositivevaluesofthecorrelationthreshold,alongwiththecorrespondinglinearapproximations.Theslopesoftheapproximatinglineswereestimatedusingtheleastsquaresmethod.Table 3{1 summarizestheestimatesoftheparameterofthepowerlawdistribution(i.e.,theslopeoftheline)fordierentvaluesof. Fromthistable,itcanbeseenthattheslopeofthelinescorrespondingtopositivevaluesofisrathersmall.Accordingtothepowerlawmodel,inthiscaseagraphwouldhavemanyverticeswithhighdegrees,therefore,onecanintuitivelyexpecttondlargecliquesinapowerlawgraphwithasmallvalueoftheparameter. Wealsoanalyzethedegreedistributionofthecomplementofthemarketgraph,whichisdenedasfollows:anedgeconnectsinstrumentsiandjifthecorrelationcoecientbetweenthemCij.Studyingthiscomplementarygraphisimportantforthenextsubjectofourconsiderationndingmaximumindependent
PAGE 51
Figure3{4. Degreedistributionofthemarketgraphfor=0:4(left);=0:5(right)(logarithmicscale) setsinthemarketgraphwithnegativevaluesofthecorrelationthreshold.Obviously,amaximumindependentsetintheinitialgraphisamaximumcliqueinthecomplement,sothemaximumindependentsetproblemcanbereducedtothemaximumcliqueprobleminthecomplementarygraph.Therefore,itisusefultoinvestigatethedegreedistributionsofthecomplementarygraphsfordierentvaluesof.AsitcanbeseenfromFigure 3{1 ,thedistributionofthecorrelationcoecientsisnearlysymmetricaround=0:05,soforthevaluesofcloseto0theedgedensityofboththeinitialandthecomplementarygraphishighenough.Forthesevaluesofthedegreedistributionofacomplementarygraphalsodoesnotseemtohaveanywelldenedstructure,asinthecaseofthecorrespondinginitialgraph.Asdecreases(i.e.,increasesintheabsolutevalue),thedegreedistributionofacomplementarygraphstartstofollowthepowerlaw.Figure 3{5 showsthedegreedistributionsofthecomplementarygraph,alongwiththeleastsquareslinearregressionlines.However,asonecanseefromTable 3{1 ,theslopesoftheselinesarehigherthaninthecaseofthegraphswithpositivevaluesof,whichimpliesthattherearefewerverticeswithahighdegreeinthesegraphs,sointuitively,thesizeofacliquesinacomplementarygraph(i.e.,thesize
PAGE 52
Figure3{5. Degreedistributionofthecomplementarymarketgraphfor=0:15(left);=0:2(right)(logarithmicscale) ofindependentsetsintheoriginalgraph)shouldbesignicantlysmallerthaninthecaseofthemarketgraphwithpositivevaluesofthecorrelationthreshold(seeSection 3.2 ). Forthispurpose,wechosethemarketgraphwithahighcorrelationthreshold(=0:6),calculatedthedegreesofeachvertexinthisgraphandsortedtheverticesinthedecreasingorderoftheirdegrees.
PAGE 53
Interestingly,eventhoughtheedgedensityoftheconsideredgraphisonly0.04%(onlyhighlycorrelatedinstrumentsareconnectedbyanedge),therearemanyverticeswithdegreesgreaterthan100. Accordingtoourcalculations,thevertexwiththehighestdegreeinthismarketgraphcorrespondstotheNASDAQ100IndexTrackingStock.Thedegreeofthisvertexis216,whichmeansthatthereare216instrumentsthatarehighlycorrelatedwithit.AninterestingobservationisthatthedegreeofthisvertexistwicehigherthanthenumberofcompanieswhosestockpricestheNASDAQindexreects,whichmeansthatthese100companiesgreatlyinuencethemarket. InTable 3{2 wepresentthe\top25"instrumentsintheU.S.stockmarket,accordingtotheirdegreesintheconsideredmarketgraph.Thecorrespondingsymbolsdenitionscanbefoundonseveralwebsites,forexamplehttp://www.nasdaq.com.Notethatmostofthemareindicesthatincorporateanumberofdierentstocksofthecompaniesindierentindustries.Althoughthisresultisnotsurprisingfromthenancialpointofview,itisimportantasapracticaljusticationofthemarketgraphmodel. 3{3 .Forinstance,asonecanseefromthistable,themarketgraphwith=0:6hasalmostthesameedgedensityasthecomplementarymarketgraphwith=0:15,however,theirclusteringcoecientsdierdramatically.Thisfactalsointuitivelyexplainsthe
PAGE 54
Table3{2. Top25instrumentswithhighestdegreesinthemarketgraph(=0:6) symbol vertexdegree QQQ 216IWF 193IWO 193IYW 193XLK 181IVV 175MDY 171SPY 162IJH 159IWV 158IVW 156IAH 155IYY 154IWB 153IYV 150BDH 144MKH 143IWM 142IJR 134SMH 130STM 118IIH 116IVE 113DIA 106IWD 106 resultspresentedinthenextsection,whichdealswithcliquesandindependentsetsinthemarketgraph. Themaximumcliqueproblem(aswellasthemaximumindependentsetproblem)isknowntobeNPhard[ 59 ].Moreover,itturnsoutthatthemaximumcliqueisdiculttoapproximate[ 18 62 ].Thismakestheseproblemsespeciallychallenginginlargegraphs.However,aswewillseeinthenextsubsection,even
PAGE 55
Table3{3. Clusteringcoecientsofthemarketgraph(complementarygraph) clusteringcoef. 0.15 2:64105 0.0012 0.3 0.0178 0.4885 0.4 0.0047 0.4458 0.5 0.0013 0.4522 0.6 0.0004 0.4872 0.7 0.0001 0.4886 thoughthemaximumcliqueproblemisgenerallyveryhardtosolveinlargegraphs,thespecialstructureofthemarketgraphallowsustondtheexactsolutionrelativelyeasily. Astandardintegerprogrammingformulation[ 33 ]wasusedtocomputetheexactmaximumcliqueinthemarketgraph,however,beforesolvingthisproblem,weappliedagreedyheuristicforndingalowerboundofthecliquenumber,andaspecialpreprocessingtechniquewhichreducestheproblemsize.Tondalargeclique,weapplythe\bestin"greedyalgorithmbasedondegreesofvertices.LetCdenotetheclique.StartingwithC=;,werecursivelyaddtothecliqueavertexvmaxoflargestdegreeandremoveallverticesthatarenotadjacenttovmaxfromthegraph.Afterrunningthisalgorithm,weappliedthefollowingpreprocessing
PAGE 56
procedure[ 2 ].WerecursivelyremovefromthegraphalloftheverticeswhicharenotinCandwhosedegreeislessthanjCj,whereCisthecliquefoundbythegreedyalgorithm. DenotebyG0=(V0;E0)thegraphinducedbyremainingvertices.ThenthemaximumcliqueproblemcanbeformulatedandsolvedforG0.Thefollowingintegerprogrammingformulationwasused[ 33 ]: maximizejV0jXi=1xis:t:xi+xj1;(i;j)=2E0xi2f0;1g 26 ].Thiscanbeintuitivelyexplainedbythefactthattheseinstancesofthemarketgraphareclustered(i.e.twoverticesinagrapharemorelikelytobeconnectediftheyhaveacommonneighbor),sotheclusteringcoecient,whichisdenedastheprobabilitythatforagivenvertexitstwoneighborsareconnectedbyanedge,ismuchhigherthantheedgedensityinthesegraphs(seeTable 3{8 ).Thischaracteristicisalsotypicalforotherpowerlawgraphsarisingindierentapplications. Afterreducingthesizeoftheoriginalgraph,theresultingintegerprogrammingproblemforndingamaximumcliquecanberelativelyeasilysolvedusingtheCPLEXintegerprogrammingsolver[ 71 ]. Table 3{4 summarizestheexactsizesofthemaximumcliquesfoundinthemarketgraphfordierentvaluesof.Itturnsoutthatthesecliquesare
PAGE 57
ratherlarge,whichagreeswiththeanalysisofdegreedistributionsandclusteringcoecientsinthemarketgraphswithpositivevaluesof. Table3{4. Sizesofthemaximumcliquesinthemarketgraphwithpositivevaluesofthecorrelationthreshold(exactsolutions) cliquesize 0.35 0.0090 193 0.4 0.0047 144 0.45 0.0024 109 0.5 0.0013 85 0.55 0.0007 63 0.6 0.0004 45 0.65 0.0002 27 0.7 0.0001 22 Theseresultsshowthatinthemodernstockmarkettherearelargegroupsofinstrumentswhosepriceuctuationsbehavesimilarlyovertime,whichisnotsurprising,sincenowadaysdierentbranchesofeconomyhighlyaecteachother. 3{5 presentsthesizesoftheindependentsetsfoundusingthegreedyheuristicthatwasdescribedintheprevioussection.
PAGE 58
Table3{5. Sizesofindependentsetsinthecomplementarymarketgraphfoundusingthegreedyalgorithm(lowerbounds) indep.setsize 0.05 0.4794 45 0.0 0.2001 12 0.05 0.0431 5 0.1 0.005 3 0.15 0.0005 2 Thistabledemonstratesthatthesizesofcomputedindependentsetsarerathersmall,whichisinagreementwiththeresultsoftheprevioussection,wherewementionedthatinthecomplementarygraphthevaluesoftheparameterofthepowerlawdistributionareratherhigh,andtheclusteringcoecientsareverysmall. Thesmallsizeofthecomputedindependentsetsmeansthatndingalarge\completelydiversied"portfolio(whereallinstrumentsarenegativelycorrelatedtoeachother)isnotaneasytaskinthemodernstockmarket. Moreover,itturnsoutthatonecanmakeatheoreticalestimationofthemaximumsizeofadiversiedportfolio,whereallstocksarestrictlynegativelycorrelatedwitheachother.Intuitively,thelower(higherbytheabsolutevalue)thresholdweset,thesmallerdiversiedportfolioonewouldexpecttond.Theseconsiderationsareconrmedbythefollowingtheorem. Proof.
PAGE 59
maximumcorrelationis=maxi;jCij<0.Considerthevarianceofthesumofthesevariables: Notethatif<0,m2max(1+(m1))<0form>1+1 Therefore,thenumberofstockswithpairwisecorrelationsCij<0cannotbegreaterthanm=1+1 Anothernaturalquestionnowarises:howmanycompletelydiversiedportfolioscanbefoundinthemarket?Inordertondananswer,wehavecalculatedmaximalindependentsetsstartingfromeachvertex,byrunning6546iterationsofthegreedyalgorithmmentionedabove.Thatis,foreachoftheconsidered6546nancialinstruments,wehavefoundacompletelydiversiedportfoliothatwouldcontainthisinstrument.Interestinglyenough,foreveryvertexinthemarketgraph,wewereabletodetectanindependentsetthatcontainsthisvertex,andthesizesoftheseindependentsetswereratherclose.Moreover,alltheseindependentsetsweredistinct.Figure 3{6 showsthefrequencyofthesizesoftheindependentsetsfoundinthemarketgraphscorrespondingtodierentcorrelationthresholds. Theseresultsdemonstratethatitisalwayspossibleforaninvestortondagroupofstocksthatwouldformacompletelydiversiedportfoliowithanygivenstock,andthiscanbeecientlydoneusingthetechniqueofndingindependentsetsinthemarketgraph.
PAGE 60
Figure3{6. Frequencyofthesizesofindependentsetsfoundinthemarketgraphwith=0:00(left),and=0:05(right) Nontrivialinformationabouttheglobalpropertiesofthestockmarketisobtainedfromtheanalysisofthedegreedistributionofthemarketgraph.Highlyspecicstructureofthisdistributionsuggeststhatthestockmarketcanbeanalyzedusingthepowerlawmodel,whichcantheoreticallypredictsomecharacteristicsofthegraphrepresentingthemarket. Ontheotherhand,theanalysisofcliquesandindependentsetsinthemarketgraphisalsousefulfromthedataminingpointofview.Asitwaspointedoutabove,cliquesandindependentsetsinthemarketgraphrepresentgroupsof\similar"and\dierent"nancialinstruments,respectively.Therefore,informationaboutthesizeofthemaximumcliquesandindependentsetsisalsoratherimportant,sinceitgivesonetheideaaboutthetrendsthattakeplaceinthestock
PAGE 61
market.Besidesanalyzingthemaximumcliquesandindependentsetsinthemarketgraph,onecanalsodividethemarketgraphintothesmallestpossiblesetofdistinctcliques(orindependentsets).Partitioningadatasetintosets(clusters)ofelementsgroupedaccordingtoacertaincriterionisreferredtoasclustering,whichisoneofthewellknowndataminingproblems[ 34 ]. Asdiscussedabove,themaindicultyoneencountersinsolvingtheclusteringproblemonacertaindatasetisthefactthatthenumberofdesiredclustersofsimilarobjectsisusuallynotknownapriori,moreover,anappropriatesimilaritycriterionshouldbechosenbeforepartitioningadatasetintoclusters. Clearly,themethodologyofndingcliquesinthemarketgraphprovidesanecienttoolofperformingclusteringbasedonthestockmarketdata.Thechoiceofthegroupingcriterionisclearandnatural:\similar"nancialinstrumentsaredeterminedaccordingtothecorrelationbetweentheirpriceuctuations.Moreover,theminimumnumberofclustersinthepartitionofthesetofnancialinstrumentsisequaltotheminimumnumberofdistinctcliquesthatthemarketgraphcanbedividedinto(theminimumcliquepartitionproblem).Similarpartitioncanbedoneusingindependentsetsinsteadofcliques,whichwouldrepresentthepartitionofthemarketintoasetofdistinctdiversiedportfolios.Inthiscasetheminimumpossiblenumberofclustersisequaltoapartitionofverticesintoaminimumnumberofdistinctindependentsets.Thisproblemiscalledthegraphcoloringproblem,andthenumberofsetsintheoptimalpartitionisreferredtoasthechromaticnumberofthegraph. Weshouldalsomentionanothermajortypeofdataminingproblemswithmanyapplicationsinnance.Theyarereferredtoasclassicationproblems.Althoughthesetupofthistypeofproblemsissimilartoclustering,oneshouldclearlyunderstandthedierencebetweenthesetwotypesofproblems.
PAGE 62
Inclassication,onedealswithapredenednumberofclassesthatthedataelementsmustbeassignedto.Also,thereisasocalledtrainingdataset,i.e.,thesetofdataelementsforwhichitisknownaprioriwhichclasstheybelongto.Itmeansthatinthissetuponeusessomeinitialinformationabouttheclassicationofexistingdataelements.Acertainclassicationmodelisconstructedbasedonthisinformation,andtheparametersofthismodelare\tuned"toclassifynewdataelements.Thisprocedureisknownas\trainingtheclassier".Anexampleoftheapplicationofthisapproachtoclassifyingnancialinstrumentscanbefoundin[ 40 ]. Themaindierencebetweenclassicationandclusteringisthefactthatunlikeclassication,inthecaseofclustering,onedoesnotuseanyinitialinformationabouttheclassattributesoftheexistingdataelements,buttriestodetermineaclassicationusingappropriatecriteria.Therefore,themethodologyofclassifyingnancialinstrumentsusingthemarketgraphmodelisessentiallydierentfromtheapproachescommonlyconsideredintheliteratureinthesensethatitdoesnotrequireanyaprioriinformationabouttheclassesthatcertainstocksbelongto,butclassiesthemonlybasedonthebehavioroftheirpricesovertime. Inordertoinvestigatethedynamicsofthemarketgraphstructure,wechosetheperiodof1000tradingdaysin1998{2002andconsideredeleven500dayshiftswithinthisperiod.Thestartingpointsofeverytwoconsecutiveshiftsareseparated
PAGE 63
Table3{6. Datesandmeancorrelationscorrespondingtoeachconsidered500dayshift Period#StartingdateEndingdateMeancorrelation 109/24/199809/15/20000.0403212/04/199811/27/20000.0373302/18/199902/08/20010.0381404/30/199904/23/20010.0426507/13/199907/03/20010.0444609/22/199909/19/20010.0465712/02/199911/29/20010.0545802/14/200002/12/20020.0561904/26/200004/25/20020.05281007/07/200007/08/20020.05701109/18/200009/17/20020.0672 bytheintervalof50days.Therefore,everypairofconsecutiveshiftshad450daysincommonand50daysdierent.DatescorrespondingtoeachshiftandthecorrespondingmeancorrelationsaresummarizedinTable 3{6 Thisprocedureallowsustoaccuratelyreectthestructuralchangesofthemarketgraphusingrelativelysmallintervalsbetweenshifts,butatthesametimeonecanmaintainsucientlylargesamplesizesofthestockpricesdataforcalculatingcrosscorrelationsforeachshift.Weshouldnotethatinouranalysisweconsideredonlystockswhichwereamongthosetradedasofthelastofthe1000tradingdays,i.e.forpracticalreasonswedidnottakeintoaccountstockswhichhadbeenwithdrawnfromthemarket.
PAGE 64
Therstsubjectofourconsiderationisthedistributionofcorrelationcoecientsbetweenallpairsofstocksinthemarket.Asitwasmentionedabove,thisdistributionon[1;1]hadashapesimilartoapartofnormaldistributionwithmeancloseto0.05forthesampledataconsideredin[ 26 27 ].Oneoftheinterpretationsofthisfactisthatthecorrelationofmostpairsofstocksisclosetozero,therefore,thestructureofthestockmarketissubstantiallyrandom,andonecanmakeareasonableassumptionthatthepricesofmoststockschangeindependently.Asweconsidertheevolutionofthecorrelationdistributionovertime,itturnsoutthattheshapeofthisdistributionremainsstable,whichisillustratedbyFigure 3{7 Figure3{7. DistributionofcorrelationcoecientsintheUSstockmarketforseveraloverlapping500dayperiodsduring20002002(period1istheearliest,period11isthelatest). Thestabilityofthecorrelationcoecientsdistributionofthemarketgraphintuitivelymotivatesthehypothesisthatthedegreedistributionshouldalsoremainstablefordierentvaluesofthecorrelationthreshold.Toverifythisassumption,
PAGE 65
wehavecalculatedthedegreedistributionofthegraphsconstructedforallconsideredtimeperiods.Thecorrelationthreshold=0:5waschosentodescribethestructureofconnectionscorrespondingtosignicantlyhighcorrelations.Ourexperimentsshowthatthedegreedistributionissimilarforalltimeintervals,andinallcasesitiswelldescribedbyapowerlaw.Figure 3{8 showsthedegreedistributions(inthelogarithmicscale)forsomeinstancesofthemarketgraph(with=0:5)correspondingtodierentintervals. (a)period1 (b)period4 (c)period7 (d)period11 Figure3{8. Degreedistributionofthemarketgraphfordierent500dayperiodsin20002002with=0:5:(a)period1,(b)period4,(c)period7,(d)period11. Thecrosscorrelationdistributionandthedegreedistributionofthemarketgraphrepresentthegeneralcharacteristicsofthemarket,andtheaforementioned
PAGE 66
resultsleadustotheconclusionthattheglobalstructureofthemarketisstableovertime.However,aswewillseenow,someglobalchangesinthestockmarketstructuredotakeplace.Inordertodemonstrateit,welookatanothercharacteristicofthemarketgraph{itsedgedensity. Inouranalysisofthemarketgraphdynamics,wechosearelativelyhighcorrelationthreshold=0:5thatwouldensurethatweconsideronlytheedgescorrespondingtothepairsofstocks,whicharesignicantlycorrelatedwitheachother.Inthiscase,theedgedensityofthemarketgraphwouldrepresenttheproportionofthosepairsofstocksinthemarket,whosepriceuctuationsaresimilarandinuenceeachother.Thesubjectofourinterestistostudyhowthisproportionchangesduringtheconsideredperiodoftime.Table 3{7 summarizestheobtainedresults.Asitcanbeseenfromthistable,boththenumberofverticesandthenumberofedgesinthemarketgraphincreaseastimegoes.Obviously,thenumberofverticesgrowssincenewstocksappearinthemarket,andwedonotconsiderthosestockswhichceasedtoexistbythelastof1000tradingdaysusedinouranalysis,sothemaximumpossiblenumberofedgesinthegraphincreasesaswell.However,itturnsoutthatthenumberofedgesgrowsfaster;therefore,theedgedensityofthemarketgraphincreasesfromperiodtoperiod.AsonecanseefromFigure 3{9(a) ,thegreatestincreaseoftheedgedensitycorrespondstothelasttwoperiods.Infact,theedgedensityforthelatestintervalisapproximately8.5timeshigherthanfortherstinterval!Thisdramaticjumpsuggeststhatthereisatrendtothe\globalization"ofthemodernstockmarket,whichmeansthatnowadaysmoreandmorestockssignicantlyaectthebehavioroftheothers. Itshouldbenotedthattheincreaseoftheedgedensitycouldbepredictedfromtheanalysisofthedistributionofthecrosscorrelationsbetweenallpairsofstocks.FromFigure 3{7 ,onecanobservethateventhoughthedistributionscorrespondingtodierentperiodshaveasimilarshapeandthesamemean,
PAGE 67
Table3{7. Numberofverticesandnumberofedgesinthemarketgraphfordierentperiods(=0:5) PeriodNumberofVerticesNumberofEdgesEdgedensity 1543022580.015%2550726140.017%3559337720.024%4566652760.033%5576868410.041%6586677700.045%76013104280.058%86104124570.067%96262129110.066%106399197070.096%116556278850.130% the\tail"ofthedistributioncorrespondingtothelatestperiod(period11)issomewhat\heavier"thanfortheearlierperiods,whichmeansthattherearemorepairsofstockswithhighervaluesofthecorrelationcoecient. (b) Dynamicsofedgedensityandmaximumcliquesizeinthemarketgraph:Evolutionoftheedgedensity(a)andmaximumcliquesize(b)inthemarketgraph(=0:5)
PAGE 68
Table 3{8 presentsthesizesofthemaximumcliquesfoundinthemarketgraphfordierenttimeperiods.Asintheprevioussubsection,weusedarelativelyhighcorrelationthreshold=0:5toconsideronlysignicantlycorrelatedstocks.Asonecansee,thereisacleartrendoftheincreaseofthemaximumcliquesizeovertime,whichisconsistentwiththebehavioroftheedgedensityofthemarketgraphdiscussedabove(seeFigure 3{9(b) ).Thisresultprovidesanotherconrmationoftheglobalizationhypothesisdiscussedabove. Anotherrelatedissuetoconsiderishowmuchthestructureofmaximumcliquesisdierentforthevarioustimeperiods.Table 3{9 presentsthestocksincludedintothemaximumcliquesfordierenttimeperiods.Itturnsoutthatinmostcasesstocksthatappearinacliqueinanearlierperiodalsoappearinthecliquesinlaterperiods. Therearesomeotherinterestingobservationsaboutthestructureofthemaximumcliquesfoundfordierenttimeperiods.Itcanbeseenthatallthecliquesincludeasignicantnumberofstocksofthecompaniesrepresentingthe\hightech"industrysector.Astheexamples,onecanmentionwellknowncompaniessuchasSunMicrosystems,Inc.,CiscoSystems,Inc.,IntelCorporation,etc.Moreover,eachcliquecontainsstocksofthecompaniesrelatedtothesemiconductorindustry(e.g.,CypressSemiconductorCorporation,Cree,Inc.,LatticeSemiconductorCorporation,etc.),andthenumberofthesestocksinthecliquesincreaseswiththetime.Thesefactssuggestthatthecorrespondingbranchesofindustryexpandedduringtheconsideredperiodoftimetoformamajorclusterofthemarket. Inaddition,weobservedthatinthelaterperiods(especiallyinthelasttwoperiods)themaximumcliquescontainaratherlargenumberofexchangetradedfunds,i.e.,stocksthatreectthebehaviorofcertainindicesrepresentingvariousgroupsofcompanies.Itshouldbementionedthatallmaximumcliquescontain
PAGE 69
Table3{8. Greedycliquesizeandthecliquenumberfordierenttimeperiods(=0:5) PeriodjVjEdgeDens.ClusteringjCjjV0jEdgeDens.CliqueinGCoecientinG0Number 154300:000150.50515760.28618255070:000170.50418430.73119355930:000240.49926490.81727456660:000330.51734700.77434557680:000410.55042820.78742658660:000450.55845860.80445760130:000580.553511100.76951861040:000670.566601140.81960962620:000660.553621070.869621063990:000960.486771340.841771165560:001300.452841460.84485 Nasdaq100trackingstock(QQQ),whichwasalsofoundtobethevertexwiththehighestdegree(i.e.,correlatedwiththemoststocks)inthemarketgraph[ 26 ]. Anothernaturalquestionthatonecanposeishowthesizeofindependentsets(i.e.,diversiedportfoliosinthemarket)changesovertime.Asitwaspointedoutin[ 26 27 ],ndingamaximumindependentsetinthemarketgraphturnsouttobeamuchmorecomplicatedtaskthanndingamaximumclique.Inparticular,inthecaseofsolvingthemaximumindependentsetproblem(or,equivalently,themaximumcliqueprobleminthecomplementarygraph),thepreprocessingproceduredescribedabovedoesnotreducethesizeoftheoriginalgraph.Thiscanbeexplainedbythefactthattheclusteringcoecientinthecomplementarymarketgraphwith=0ismuchsmallerthanintheoriginalgraphcorrespondingto=0:5(seeTable 3{10 ). SimilarlytoSection 3.2 ,wecalculatemaximalindependentsets(amaximalindependentsetisanindependentsetthatisnotasubsetofanotherindependentset)inthemarketgraphusingtheabovegreedyalgorithm.AsonecanseefromTable 3{10 ,thesizesofindependentsetsfoundinthemarketgraphfor=0arerathersmall,whichisconsistentwiththeresultsofSection 3.2
PAGE 70
Table3{9. Structureofmaximumcliquesinthemarketgraphfordierenttimeperiods(=0:5) Stocksincludedintomaximumclique BK,EMC,FBF,HAL,HP,INTC,NCC,NOI,NOK,PDS,PMCS,QQQ,RF,SII,SLB,SPY,TER,WM 2 ADI,ALTR,AMAT,AMCC,ATML,CSCO,KLAC,LLTC,LSCC,MDY,MXIM,NVLS,PMCS,QQQ,SPY,SUNW,TXN,VTSS,XLNX 3 AMAT,AMCC,CREE,CSCO,EMC,JDSU,KLAC,LLTC,LSCC,MDY,MXIM,NVLS,PHG,PMCS,QLGC,QQQ,SEBL,SPY,STM,SUNW,TQNT,TXCC,TXN,VRTS,VTSS,XLK,XLNX 4 AMAT,AMCC,ASML,ATML,BRCM,CHKP,CIEN,CREE,CSCO,EMC,FLEX,JDSU,KLAC,LSCC,MDY,MXIM,NTAP,NVLS,PMCS,QLGC,QQQ,RFMD,SEBL,SPY,STM,SUNW,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 5 ALTR,AMAT,AMCC,ASML,ATML,BRCM,CIEN,CREE,CSCO,EMC,FLEX,IDTI,IRF,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,LSI,MDY,MXIM,NTAP,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SEBL,SPY,STM,SUNW,SWKS,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 6 ADI,ALTR,AMAT,AMCC,ASML,ATML,BEAS,BRCM,CIEN,CREE,CSCO,CY,ELX,EMC,FLEX,IDTI,ITWO,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,LSI,MDY,MXIM,NTAP,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SEBL,SPY,STM,SUNW,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 7 ALTR,AMAT,AMCC,ATML,BEAS,BRCD,BRCM,CHKP,CIEN,CNXT,CREE,CSCO,CY,DIGL,EMC,FLEX,HHH,ITWO,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,MDY,MERQ,MXIM,NEWP,NTAP,NVLS,ORCL,PMCS,QLGC,QQQ,RBAK,RFMD,SCMR,SEBL,SPY,SSTI,STM,SUNW,SWKS,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 8 ALTR,AMAT,AMCC,AMKR,ARMHY,ASML,ATML,AVNX,BEAS,BRCD,BRCM,CHKP,CIEN,CMRC,CNXT,CREE,CSCO,CY,DIGL,ELX,EMC,EXTR,FLEX,HHH,IDTI,ITWO,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,MDY,MERQ,MRVC,MXIM,NEWP,NTAP,NVLS,ORCL,PMCS,QLGC,QQQ,RFMD,SCMR,SEBL,SNDK,SPY,SSTI,STM,SUNW,SWKS,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 9 ADI,ALTR,AMAT,AMCC,ARMHY,ASML,ATML,AVNX,BDH,BEAS,BHH,BRCM,CHKP,CIEN,CLS,CREE,CSCO,CY,DELL,ELX,EMC,EXTR,FLEX,HHH,IAH,IDTI,IIH,INTC,IRF,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,LSI,MDY,MXIM,NEWP,NTAP,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SCMR,SEBL,SNDK,SPY,SSTI,STM,SUNW,SWKS,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 10 ADI,ALTR,AMAT,AMCC,AMD,ASML,ATML,BDH,BHH,BRCM,CIEN,CLS,CREE,CSCO,CY,CYMI,DELL,EMC,FCS,FLEX,HHH,IAH,IDTI,IFX,IIH,IJH,IJR,INTC,IRF,IVV,IVW,IWB,IWF,IWM,IWV,IYV,IYW,IYY,JBL,JDSU,KLAC,KOPN,LLTC,LRCX,LSCC,LSI,LTXX,MCHP,MDY,MXIM,NEWP,NTAP,NVDA,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SANM,SEBL,SMH,SMTC,SNDK,SPY,SSTI,STM,SUNW,TER,TQNT,TXCC,TXN,VRTS,VSH,VTSS,XLK,XLNX 11 ADI,ALA,ALTR,AMAT,AMCC,AMD,ASML,ATML,BDH,BEAS,BHH,BRCM,CIEN,CLS,CNXT,CREE,CSCO,CY,CYMI,DELL,EMC,EXTR,FCS,FLEX,HHH,IAH,IDTI,IIH,IJH,IJR,INTC,IRF,IVV,IVW,IWB,IWF,IWM,IWO,IWV,IWZ,IYV,IYW,IYY,JBL,JDSU,JNPR,KLAC,KOPN,LLTC,LRCX,LSCC,LSI,LTXX,MCRL,MDY,MKH,MRVC,MXIM,NEWP,NTAP,NVDA,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SANM,SEBL,SMH,SMTC,SNDK,SPY,SSTI,STM,SUNW,TER,TQNT,TXN,VRTS,VSH,VTSS,XLK,XLNX
PAGE 71
Table3{10. Sizeofindependentsetsinthemarketgraphfoundusingthegreedyheuristic(=0:0).Edgedensityandclusteringcoecientaregivenforthecomplementarygraph. PeriodNumberofEdgeClusteringIndependentverticesdensitycoecientsetsize 154300.2580.29311255070.2750.30711355930.2810.30710456660.2650.29711557680.2600.29211658660.2540.28811760130.2280.26911861040.2270.26810962620.2380.277121063990.2280.269121165560.2010.24511 Forndingacliquepartition,wechoosetheinstanceofthemarketgraphwithalowcorrelationthreshold=0:05(themeanofthecorrelationcoecientsdistributionshowninFigure 3{7 ),whichwouldensurethattheedgedensityoftheconsideredgraphishighenoughandthenumberofisolatedvertices(whichwouldobviouslyformdistinctcliques)issmall. Weusethestandardgreedyheuristictocomputeacliquepartitioninthemarketgraph:recursivelyndamaximalcliqueandremoveitfromthegraph,untilnovertexremain.Cliquesarecomputedusingthepreviouslydescribedgreedyalgorithm.Thecorrespondingresultsforthemarketgraphwiththreshold=0:05arepresentedinTable 3{11 .Notethatthesizeofthelargestcliqueinthepartitionisincreasingfromoneperiodtoanother,withthelargestcliqueinthelastperiod
PAGE 72
Table3{11. Thelargestcliquesizeandthenumberofcliquesincomputedcliquepartitions(=0:05) PeriodNumberofEdgeLargestclique#ofcliquesinverticesdensityinthepartitionthepartition 154300.400469494255070.377552517355930.379636513456660.405743503557680.413789501658660.425824496760130.469929471861040.475983470962620.4569975091063990.47411595011165560.5211372479 containingaboutthreetimesasmanyverticesasthecorrespondingcliqueintherstpartition.Atthesametime,thenumberofcliquesinthepartitioniscomparablefordierentperiods,withaslightoveralltrendtowardsdecrease,whereasthenumberofverticesisincreasingastimegoes. Anotherimportantresultisthefactthattheedgedensityofthemarketgraph,aswellasthemaximumcliquesize,steadilyincreaseduringthelastseveralyears,whichsupportsthewellknownideaabouttheglobalizationofeconomywhichhasbeenwidelydiscussedrecently.
PAGE 73
Wehavealsoindicatedthenaturalwayofdividingthesetofnancialinstrumentsintogroupsofsimilarobjects(clustering)bycomputingacliquepartitionofthemarketgraph.Thismethodologycanbeextendedbyconsideringquasicliquesinthepartition,whichmayreducethenumberofobtainedclusters.Moreover,ndingindependentsetsinthemarketgraphprovidesanewapproachtochoosingdiversiedportfolioswhereallstocksarepairwiseuncorrelated,whichispotentiallyusefulinpractice.
PAGE 74
Humanbrainisoneofthemostcomplexsystemseverstudiedbyscientists.Enormousnumberofneuronsandthedynamicnatureofconnectionsbetweenthemmakestheanalysisofbrainfunctionespeciallychallenging.Oneofthemostimportantdirectionsinstudyingthebrainistreatingdisordersofthecentralnervoussystem.Forinstance,epilepsyisacommonformofsuchdisorders,whichaectsapproximately1%ofthehumanpopulation.Essentially,epilepticseizuresrepresentexcessiveandhypersynchronousactivityoftheneuronsinthecerebralcortex. Duringthelastseveralyears,signicantprogressintheeldofepilepticseizurespredictionhasbeenmade.Theadvancesareassociatedwiththeextensiveuseofelectroencephalograms(EEG)whichcanbetreatedasaquantitativerepresentationofthebrainfunction.RapiddevelopmentofcomputationalequipmenthasmadepossibletostoreandprocesshugeamountsofEEGdataobtainedfromrecordingdevices.TheavailabilityofthesemassivedatasetsgivesarisetoanotherproblemutilizingmathematicaltoolsanddataminingtechniquesforextractingusefulinformationfromEEGdata.Isitpossibletoconstructa\simple"mathematicalmodelbasedonEEGdatathatwouldreectthebehavioroftheepilepticbrain? Inthischapter,wemakeanattempttocreatesuchamodelusinganetworkbasedapproach. InthecaseofthehumanbrainandEEGdata,weapplyarelativelysimplenetworkbasedapproach.WerepresenttheelectrodesusedforobtainingtheEEG 62
PAGE 75
readings,whicharelocatedindierentpartsofthebrain,astheverticesoftheconstructedgraph.ThedatareceivedfromeverysingleelectrodeisessentiallyatimeseriesreectingthechangeoftheEEGsignalovertime.LaterinthechapterwewilldiscussthequantitativemeasurecharacterizingstatisticalrelationshipsbetweentherecordingsofeverypairofelectrodessocalledTindex.ThevaluesoftheTindexTijmeasuredforallpairsofelectrodesiandjenableustoestablishcertainrulesofplacingedgesconnectingdierentpairsofverticesiandjdependingonthecorrespondingvaluesofTij.Usingthistechnique,wedevelopseveralgraphbasedmathematicalmodelsandstudythedynamicsofthestructuralpropertiesofthesegraphs.Aswewillsee,thesemodelscanprovideusefulinformationaboutthebehaviorofthebrainpriorto,during,andafteranepilepticseizure. 4.1.1Datasets. 4{1 67 69 101 ]). Sincethebrainisanonstationarysystem,algorithmsusedtoestimatemeasuresofthebraindynamicsshouldbecapableofautomaticallyidentifyingandappropriatelyweighingexistingtransientsinthedata.Inachaoticsystem,orbitsoriginatingfromsimilarinitialconditions(nearbypointsinthestatespace)divergeexponentially(expansionprocess).TherateofdivergenceisanimportantaspectofthesystemdynamicsandisreectedinthevalueofLyapunovexponents.The
PAGE 76
Electrodeplacementinthebrain:(A)Inferiortransverseand(B)lateralviewsofthebrain,illustratingapproximatedepthandsubduralelectrodeplacementforEEGrecordingsaredepicted.Subduralelectrodestripsareplacedovertheleftorbitofrontal(AL),rightorbitofrontal(AR),leftsubtemporal(BL),andrightsubtemporal(BR)cortex.Depthelectrodesareplacedinthelefttemporaldepth(CL)andrighttemporaldepth(CR)torecordhippocampalactivity.
PAGE 77
methodusedforestimationoftheshorttimelargestLyapunovexponentSTLmax,anestimateofLmaxfornonstationarydata,isexplainedindetailin[ 66 68 118 ]. BysplittingtheEEGtimeseriesrecordedfromeachelectrodeintoasequenceofnonoverlappingsegments,each10.24secinduration,andestimatingSTLmaxforeachofthesesegments,prolesofSTLmaxovertimearegenerated. HavingestimatedtheSTLmaxtemporalprolesatanindividualcorticalsite,andasthebrainproceedstowardstheictalstate,thetemporalevolutionofthestabilityofeachcorticalsiteisquantied.ThespatialdynamicsofthistransitionarecapturedbyconsiderationoftherelationsoftheSTLmaxbetweendierentcorticalsites.Forexample,ifasimilartransitionoccursatdierentcorticalsites,theSTLmaxoftheinvolvedsitesareexpectedtoconvergetosimilarvaluespriortothetransition.Suchparticipatingsitesarecalled\criticalsites",andsuchaconvergence\dynamicalentrainment".Morespecically,inorderforthedynamicalentrainmenttohaveastatisticalcontent,weallowaperiodoverwhichthedierenceofthemeansoftheSTLmaxvaluesattwositesisestimated.Weuseperiodsof10minutes(i.e.movingwindowsincludingapproximately60STLmaxvaluesovertimeateachelectrodesite)totestthedynamicalentrainmentatthe0.01statisticalsignicancelevel.WeemploytheTindex(fromthewellknownpairedTstatisticsforcomparisonsofmeans)asameasureofdistancebetweenthemeanvaluesofpairsofSTLmaxprolesovertime.TheTindexattimetbetweenelectrodesitesiandjisdenedas: whereEfgisthesampleaveragedierencefortheSTLmax;iSTLmax;jestimatedoveramovingwindowwt()denedas:wt()=8><>:1if2[tN1;t]0if62[tN1;t];
PAGE 78
whereNisthelengthofthemovingwindow.Then,i;j(t)isthesamplestandarddeviationoftheSTLmaxdierencesbetweenelectrodesitesiandjwithinthemovingwindowwt().TheTindexfollowsatdistributionwithN1degreesoffreedom.FortheestimationoftheTi;j(t)indicesinourdataweusedN=60(i.e.,averageof60dierencesofSTLmaxexponentsbetweensitesiandjpermovingwindowofapproximately10minuteduration).Therefore,atwosidedttestwithN1(=59)degreesoffreedom,atastatisticalsignicancelevelshouldbeusedtotestthenullhypothesis,Ho:\brainsitesiandjacquireidenticalSTLmaxvaluesattimet".Inthisexperiment,wesettheprobabilityofatypeIerror=0:01(i.e.,theprobabilityoffalselyrejectingHoifHoistrue,is1%).FortheTindextopassthistest,theTi;j(t)valueshouldbewithintheinterval[0,2.662].WewillrefertotheupperboundofthisintervalasTcritical. 4.2.1KeyIdeaoftheModel
PAGE 79
70 108 111 ],whichisessentiallythedivergenceoftheprolesoftheSTLmaxtimeseries.Asitwasindicatedabove,thisdivergenceischaracterizedbythevaluesofTindexgreaterthanTcritical. 4{2
PAGE 80
Figure4{2. NumberofedgesinGRAPHII
PAGE 81
ThesizeofthelargestconnectedcomponentoftheGRAPHIIispresentedinFigure 4{3 .OnecanseethatGRAPHIIisconnectedduringtheinterictalperiod(i.e.,thebrainisaconnectedsystem),however,itbecomesdisconnectedaftertheseizure(duringtheposticalstate):thesizeofthelargestconnectedcomponentsignicantlydecreases.Thisfactisnotsurprisingandcanbeintuitivelyexplained,sinceaftertheseizurethebrainneedssometimeto\reset"[ 70 108 111 ]andrestoretheconnectionsbetweenthefunctionalunits.
PAGE 82
Figure4{3. ThesizeofthelargestconnectedcomponentinGRAPHII.Numberofnodesinthegraphis30. hypothesisispartiallysupportedbythebehavioroftheaverageTindexoftheedgescorrespondingtotheMinimumSpanningTreeofGRAPHI,whichisshowninFigure 4{4 However,thishypothesiscannotbeveriedusingtheconsidereddata,sincethevaluesofaverageTindicesarecalculatedovera10minuteinterval,whereasthetheseizuresignalpropagatesinafractionofasecond.Therefore,inordertocheckiftheseizuresignalactuallyspreadsalongtheminimumspanningtree,oneneedstointroduceothernonlinearmeasurestoreectthebehaviorofthebrainovershorttimeintervals.
PAGE 83
Figure4{4. AveragevalueofTindexoftheedgesinMinimumSpanningTreeofGRAPHI. Also,notethattheaveragevalueoftheTindexintheMinimumSpanningTreeislessthanTcritical,whichalsosupportstheabovestatementabouttheconnectivityofthesystem. WelookatthebehavioroftheaveragedegreeoftheverticesinGRAPHIIovertime.Clearly,thisplotisverysimilartothebehavioroftheedgedensityofGRAPHII(seeFigure 4{5 ).
PAGE 84
Figure4{5. AveragedegreeoftheverticesinGRAPHII. Wearealsoparticularlyinterestedinhighdegreevertices,i.e.,thefunctionalunitsofthebrainthatareatacertaintimemomentconnected(entrained)withmanyotherbrainsites.Interestinglyenough,thevertexwithamaximumdegreeinGRAPHIIusuallycorrespondstotheelectrodewhichislocatedinRTD(righttemporaldepth)orRST(rightsubtemporalcortex),inotherwords,thevertexwiththemaximumdegreeislocatedneartheepileptogenicfocus. 69 ].Infact,thisapproachutilizesthesamepreprocessingtechnique(i.e.,calculatingthevaluesofTindicesforallpairsofelectrodesites)asweapplyinthischapter.Inthis
PAGE 85
subsection,wewillbrieydescribethisquadraticprogrammingtechniqueandrelateittothegraphmodelsintroducedabove. Themainideaoftheconsideredquadraticprogrammingapproachistoconstructamodelthatwouldselectacertainnumberofsocalled\critical"electrodesites,i.e.,thosethatarethemostentrainedduringtheseizure.AccordingtoSection3,suchgroupofelectrodesitesshouldproduceaminimalsumofTindicescalculatedforallpairsofelectrodeswithinthisgroup.Ifthenumberofcriticalsitesissetequaltok,andthetotalnumberofelectrodesitesisn,thentheproblemofselectingtheoptimalgroupofsitescanbeformulatedasthefollowingquadratic01problem[ 69 ]: minxTAx s.t.Pni=1xi=k: Inthissetup,thevectorx=(x1;x2;:::;xn)consistsofthecomponentsequaltoeither1(ifthecorrespondingsiteisincludedintothegroupofcriticalsites)or0(otherwise),andtheelementsofthematrixA=[aij]i;j=1;:::;narethevaluesofTij'sattheseizurepoint. However,asitwasshowninthepreviousstudies,onecanobservethe\resetting"ofthebrainafterseizures'onset[ 111 70 108 ],thatis,thedivergenceofSTLmaxprolesafteraseizure.Therefore,toensurethattheoptimalgroupofcriticalsitesshowsthisdivergence,onecanreformulatethisoptimizationproblembyaddingonemorequadraticconstraint:
PAGE 86
wherethematrixB=[bij]i;j=1;:::;nistheTindexmatrixofbrainsitesiandjwithin10minutewindowsaftertheonsetofaseizure. Thisproblemisthensolvedusingstandardtechniques,andthegroupofkcriticalsitesisfound.Itshouldbepointedoutthatthenumberofcriticalsiteskispredetermined,i.e.,itisdenedempirically,basedonpracticalobservations.Also,notethatintermsofGRAPHImodelthisproblemrepresentsndingasubgraphofGRAPHIofaxedsize,satisfyingthepropertiesspeciedabove. Now,recallthatweintroducedGRAPHIIIusingthesameprinciplesasintheformulationoftheaboveoptimizationproblem,thatis,weconsideredtheconnectionsonlybetweenthepairsofsitesi;jsatisfyingbothofthetwoconditions:TijTcritical10minutesaftertheseizurepoint,whichareexactlytheconditionsthatthecriticalsitesmustsatisfy.AnaturalwayofdetectingsuchagroupsofsitesistondcliquesinGRAPHIII.Sinceacliqueisasubgraphwhereallverticesareinterconnected,itmeansthatallpairsofelectrodesitesinacliquewouldsatisfytheaforementionedconditions.Therefore,itisclearthatthesizeofthemaximumcliqueinGRAPHIIIwouldrepresenttheupperboundonthenumberofselectedcriticalsites,i.e.,themaximumvalueoftheparameterkintheoptimizationproblemdescribedabove. ComputationalresultsindicatethatthemaximumcliquesizesfordierentinstancesofGRAPHIIIareclosetotheactualvaluesofkempiricallyselectedinthequadraticprogrammingmodel,whichshowsthattheseapproachesareconsistentwitheachother.
PAGE 87
level.ThemainideaofthismodelistousethepropertiesofGRAPHI,GRAPHII,andGRAPHIIIasacharacterizationofthebehaviorofthebrainpriorto,during,andafterepilepticseizures.Accordingtothisgraphmodel,thegraphsreectingthebehavioroftheepilepticbraindemonstratethefollowingproperties: Moreover,oneoftheadvantagesoftheconsideredgraphmodelisthepossibilitytodetectspecialformationsinthesegraphs,suchascliquesandminimumspanningtrees,whichcanbeusedforfurtherstudyingofvariouspropertiesoftheepilepticbrain. Amongthedirectionsoffutureresearchinthiseld,onecanmentionthepossibilityofdevelopingdirectedgraphmodelsbasedontheanalysisofEEGdata.Suchmodelswouldtakeintoaccountthenatural\asymmetry"ofthebrain,wherecertainfunctionalunitscontroltheotherones.Also,onecouldapplyasimilarapproachtostudyingthepatternsunderlyingthebrainfunctionofthepatientswithothertypesofdisorders,suchasParkinson'sdisease,orsleepdisorder.
PAGE 88
Therefore,themethodologyintroducedinthischaptercanbegeneralizedandappliedinpractice.
PAGE 89
Inthischapter,wewilldiscussoneofthemostinterestingreallifegraphapplications{socalled\socialnetworks"wheretheverticesarerealpeople[ 63 116 ].Themainideaofthisapproachistoconsiderthe\acquaintanceshipgraph"connectingtheentirehumanpopulation.Inthisgraph,anedgeconnectstwogivenverticesifthecorrespondingtwopersonsknoweachother. Socialnetworksareassociatedwithafamous\smallworld"hypothesis,whichclaimsthatdespitethelargenumberofvertices,thedistancebetweenanytwovertices(or,thediameterofthegraph)issmall.Morespecically,theideaof\sixdegreesofseparation"hasbeenintroduced.Itstatesthatanytwopersonsintheworldarelinkedwitheachotherthroughasequenceofatmostsixpeople[ 63 116 117 ]. Clearly,onecannotverifythishypothesisforthegraphincorporatingmorethan6billionpeoplelivingontheEarth,however,smallersubgraphsoftheacquaintanceshipgraphconnectingcertaingroupsofpeoplecanbeinvestigatedindetail.Oneofthemostwellknowngraphsofthistypeisthescienticcollaborationgraphreectingtheinformationaboutthejointworksbetweenallscientists.Twoverticesareconnectedbyanedgeifthecorrespondingtwoscientistshaveajointresearchpaper.Anothergraphofthistypeisknownasthe\Hollywoodgraph":itlinksallthemovieactors,andanedgeconnectstwoactorsiftheyeverappearedinthesamemovie.Wellknownconceptsassociatedwiththesegraphsaresocalled\Erdosnumber"(inthescienticcollaborationgraph)and\Baconnumber"(intheHollywoodgraph),whichareassignedtoeveryvertexandcharacterizethedistancefromthisvertextothevertexdenotingthe\center"ofthegraph. 77
PAGE 90
Inthecollaborationgraph,thecentralvertexcorrespondstothefamousgraphtheoreticianPaulErdos,whereasintheHollywoodgraphthesamepositionisassignedtoKevinBacon. Inthischapter,wediscussgraphsofasimilartypearisinginsports,thatrepresenttheplayers'\collaboration".Inthesegraphs,theplayersarethevertices,andanedgeisaddedtothegraphifthecorrespondingtwoplayerseverplayedtogetherinthesameteam.Oneoftheexamplesofthistypeofgraphsisthegraphrepresentingbaseballplayers.ForanytwobaseballplayerswhoeverplayedintheMajorLeagueBaseball(MLB),apathconnectingthemcanbefoundinthisgraph. Asanotherinstanceofsocialnetworksinsports,westudythe\NBAgraph"wheretheverticesrepresentallthebasketballplayerswhoarecurrentlyplayingintheNBA.Weapplystandardgraphtheoreticalalgorithmsforinvestigatingthepropertiesofthisgraph,suchasitsconnectivityanddiameter(i.e.,themaximumdistancebetweenallpairsofverticesinthegraph).Aswewillseelaterinthechapter,thisstudyalsoconrmsthe\smallworldhypothesis".Moreover,weintroduceadistancemeasureintheNBAgraphsimilartotheErdosnumberandtheBaconnumber.ThecentralroleinthisgraphisgiventoMichaelJordan,thegreatestbasketballplayerofalltimes,andwerefertothismeasureastheJordannumber.
PAGE 91
distancesinthisgraph,the\centralvertex"isintroduced.ThisvertexcorrespondstoPaulErdos,thefatherofthetheoryofrandomgraphs.ThisvertexisassignedErdosnumberequalto0.Forallotherverticesinthegraph,theErdosnumberisdenedasthedistance(i.e.,theshortestpathlength)fromthecentralvertex.Forexample,thosescientistswhohadajointpaperwithErdoshaveErdosnumber1,thosewhodidnotcollaboratewithErdos,butcollaboratedwithErdos'collaboratorshaveErdosnumber2,etc. Followingthislogic,onecanconstructtheconnectedcomponentofthecollaborationgraphwith\concentriccircles",whichwouldincorporatealmostallscientistsintheworld,exceptthosewhonevercollaboratewithanybody.Thisconnectedcomponentisexpectedtohavearelativelysmalldiameter. Theideaofconstructingcollaborationgraphsencompassingpeopleindierentareasgavearisetoseveralotherapplications.Next,wediscusstheHollywoodgraphandthebaseballgraph,wherethenumberofverticesissignicantlysmallerthaninthescienticcollaborationgraph,whichallowsonetostudytheirstructureinmoredetail.
PAGE 92
Figure5{1. NumberofverticesintheHollywoodgraphwithdierentvaluesofBaconnumber.AverageBaconnumber=2.946. actor.ItturnsoutthatmostoftheactorshaveBaconnumbersequalto2or3,andthemaximumpossibleBaconnumberisequalto8,whichisthecaseonlyfor3vertices. ThedistributionofBaconnumbersintheHollywoodgraphisshowninFigure 5{1 .TheaverageBaconnumber(i.e.,theaveragepathlengthfromagivenactortoBacon)isequalto2.946.Asonecansee,boththeaverageandthemaximumBaconnumbersoftheHollywoodgraphareverysmall,whichprovidesanargumentinfavorofthe\smallworldhypothesis"mentionedabove.
PAGE 93
Figure5{2. NumberofverticesinthebaseballgraphwithdierentvauesofWynnnumber.AverageWynnnumber=2.901 has15817vertices.Linksbetweenanypairofbaseballplayerscanbefoundatthe\OracleofBaseball"website. 5{2 showsthedistributionofWynnnumbersinthebaseballgraph.ThemaximumWynnnumberis6,whichissmallerthanthemaximumBaconnumbersincetotalnumberofbaseballplayersislessthanthenumberofHollywoodactors.
PAGE 94
Hollywoodgraph,andEarlyWynnasthecenterofthebaseballgraphisthefactthatitisreasonabletoexpectthemtobeconnectedtomanyvertices:Baconappearedinmanymovies,andWynnplayedinseveralbaseballteamshadalotofteammatesduringhislongcareer.However,onecanchooseless\connected"centersofthesegraphs,andinthiscasethemaximumdistancefromthenewcenterofthegraphmaysignicantlyincrease.Forexample,ifonechoosesBarryBondsasthecenterofthebaseballgraph,themaximumBondsnumberwillbe9insteadof6.Moreover,intheHollywoodgraph,itispossibletochoosethecentersothatthemaximumdistancefromitisequalto14,andtheaveragedistanceisgreaterthan6(insteadof2.946).Therefore,inordertohaveamorecompleteinformationaboutthestructureofthesegraphs,oneshouldcalculatethemaximumpossibledistanceamongallpairsofverticesinthegraph.Recallthatthisquantityisreferredtoasthediameterofthegraph.Clearly,thediametercanbefoundbyconsideringeachvertexasthecenterofthegraph,calculatingcorrespondingmaximaldistances,andthenchoosingthemaximumamongthem. Inthenextsection,westudythepropertiesoftheNBAgraphincorporatingbasketballplayersplayingintheworld'sbestbasketballleague.Inasimilarfashion,weintroducetheJordannumber,investigateitsvaluescorrespondingtodierentvertices,andcalculatethediameterofthisgraph.
PAGE 95
Asonecaneasilysee,thisgraphhasahighlyspecicstructure:theplayersofeveryteamformacliqueinthegraph(i.e.,thesetofcompletelyinterconnectedvertices),becausealltheverticescorrespondingtotheplayersofthesameteammustbeinterconnected.Sincemanyplayerschangeteamsduringorbetweentheseasons,thereareedgesconnectingtheverticesfromdierentcliques(teams).Notethatthistypeofstructureiscommonforall\collaborationnetworks"(seeFigure 5{3 ). Itshouldbepointedoutthatthenumberofplayersinabasketballteamisrelativelysmall,andtheplayers'transfersbetweendierentteamsoccurratheroften,therefore,itwouldbelogicaltoexpectthattheNBAgraphshouldbeconnected,i.e.,thereisapathfromeveryvertextoeveryvertex,moreover,thelengthofthispathmustbesmallenough.Aswewillseebelow,calculationsconrmtheseassumptions.
PAGE 96
Figure5{3. GeneralstructureoftheNBAgraphandothercollaborationnetworks First,weusedastandardbreadthrstsearchtechniqueforcheckingtheconnectivityoftheconsideredgraph.Startingfromanarbitraryvertex,wewereabletolocateallotherverticesinthegraph,whichmeansthateveryvertexisreachablefromanother,therefore,thegraphisconnected.Inthenextsubsection,wewillalsoseethateverypairofverticesinthisgraphareconnectedbyashortpath,whichisinagreementwiththe\smallworldhypothesis".
PAGE 97
Figure5{4. NumberofverticesintheNBAgraphwithdierentvaluesofJordannumber.AverageJordannumber=2.270 Similarlytothesocialgraphsmentionedabove,wedenethe\centralvertex"intheNBAgraphcorrespondingtoMichaelJordan,whoplayedforWashingtonWizardsduringhisnalNBAseason.Obviously,allotherplayersintheWizards'rosterfor20022003,aswellasalltheplayerswhohaveplayedwithJordanduringatleastoneseasoninthepast,haveJordannumber1.ItshouldbenotedthatMichaelJordanplayedonlyfortwoteams(ChicagoBullsandWashingtonWizards)throughhisentirecareer,therefore,onecanexpectthatthenumberofplayerswithJordannumber1israthersmall.Infact,only24playerscurrentlyplayingintheNBAhaveJordannumber1.
PAGE 98
Table5{1. JordannumbersofsomeNBAstars(endofthe20022003season). PlayerTeamJordanNumber KobeBryantLosAngelesLakers2VinceCarterTorontoRaptors2VladeDivacSacramentoKings2TimDuncanSanAntonioSpurs2MichaelFinleyDallasMavericks2SteveFrancisHoustonRockets3KevinGarnettMinnesotaTimberwolves3PauGasolMemphisGrizzlies3RichardHamiltonDetroitPistons1AllenIversonPhiladelphia76ers2JasonKiddNewJerseyNets2ToniKukocMilwaukeeBucks1KarlMaloneUtahJazz2StephonMarburyPhoenixSuns2ShawnMarionPhoenixSuns2KenyonMartinNewJerseyNets3JamalMashburnNewOrleansHornets2TracyMcGradyOrlandoMagic2ReggieMillerIndianaPacers3YaoMingHoustonRockets3DikembeMutomboNewJerseyNets2SteveNashDallasMavericks2DirkNowitzkiDallasMavericks2JermaineO'NealIndianaPacers2ShaquilleO'NealLosAngelesLakers2GaryPaytonMilwaukeeBucks2PaulPierceBostonCeltics2ScottiePippenPortlandTrailBlazers1DavidRobinsonSanAntonioSpurs2ArvydasSabonisPortlandTrailBlazers2JerryStackhouseWashingtonWizards1PredragStojakovicSacramentoKings2AntoineWalkerBostonCeltics2BenWallaceDetroitPistons2ChrisWebberSacramentoKings2
PAGE 99
Followingsimilarlogic,theplayerswhohaveplayedwithJordan's\collaborators"haveJordannumber2,andsoon.However,itturnsoutthatthemaximumJordannumberinthisinstanceoftheNBAgraphisonly3,i.e.,alltheplayersarelinkedwithJordanthroughatmosttwovertices,whichiscertainlynotsurprising:with29teamsandonlyaround15playersineachteam,NBAisreallya\smallworld".Figure 5{4 showsthedistributionofJordannumbersintheNBAgraph.TheaverageJordannumberisequalto2.27,whichissmallerthantheaverageBaconnumberintheHollywoodgraph,andtheaverageWynnnumberinthebaseballgraph,duetosmallernumberofvertices. Table 5{1 presentsJordannumberscorrespondingtosomewellknownNBAplayers.Notsurprisingly,mostofthemhaveJordannumber2,exceptforseveralplayerswithJordannumber3:thosewhojoinedthisleaguerecently,andthereforedidnothavemanyteammatesthroughtheircareer,aswellasReggieMillerwhospent16seasonsinthesameteam(IndianaPacers),andKevinGarnettwhoplayedinMinnesotafor8years.ScottiePippen,ToniKukoc,andJerryStackhousewereJordan'steammatesatdierenttimes,therefore,theyhaveJordannumber1. Furthermore,wecalculatedthediameteroftheNBAgraph,i.e.,themaximumpossibledistancebetweenanytwoverticesinthegraph.SincethemaximumJordannumberintheNBAgraphisequalto3,onewouldexpectthatthevalueofthediametertobeofthesameorderofmagnitude.Asitwasmentionedintheprevioussection,thediameteroftheNBAgraphcanbefoundasfollows:foreverygivenvertex,wecalculatethedistancesbetweenthisvertexandallothers.Inthisapproach,weneedtorepeatthisprocedure404times,andeverytimeadierentvertexisconsideredtobethe\center"ofthegraph.OurcalculationsshowthatthediameteroftheNBAgraph(themaximumdistancebetweenallpairsofvertices)isequalto4.Therefore,onecanclaimthattheNBAgraphactuallyfollowsthesmallworldhypothesis,sinceitsdiameterissmallenough.
PAGE 100
Table5{2. DegreesoftheVerticesintheNBAgraph degreeintervalnumberofvertices 1120134213011631401034150425160861+2 5{2 presentsthenumberofverticesintheNBAgraphcorrespondingtodierentintervalsofthedegreevalues. ItwouldbereasonabletoassumethatifonepicksavertexwithahighdegreeasthecenteroftheNBAgraph,theaveragedistanceinthegraphcorrespondingtothisvertexwouldbesmallerthantheaverageJordannumber.Wehavefoundthemost\connected"playersintheNBAgraphwiththesmallestcorrespondingaveragedistances.Table 5{3 presentsveplayerswhocouldbethemost\connected"centersoftheNBAgraph.Asonecannotice,allofthemare\benchplayers"whohavechangedmanyteamsduringtheircareer,therefore,theyhavehighdegreesintheNBAgraph.Also,aninterestingobservationisthatalthoughCorieBlount'svertexisdegreesmallerthanJimJackson's,theaverageconnectivityishigherforCorieBlount,whichcouldbeexplainedbythefactthathisteammateswerehighly\connected"themselves.
PAGE 101
Table5{3. Themost\connected"playersintheNBAgraph PlayerTeamDegreeAv.Distance CorieBlountChicagoBulls631.906JimJacksonSacramentoKings681.923RobertPackNewOrleansHornets571.936GrantLongBostonCeltics501.946BimboColesBostonCeltics541.958 AlthoughtheinstanceoftheNBAgraphconsideredinthischaptercontainsonlycurrentlyactivebasketballplayers,itcanbeeasilyextendedtoreectallplayersinthehistoryoftheNBA.Moreover,sincealotofforeignplayersfromdierentcountriesandcontinentshavecometotheNBAinrecentyears,onewouldexpectthatthegraphcoveringallbasketballplayersplayinginmajorforeignchampionshipsisalsoconnectedandhasasmalldiameter.
PAGE 102
Inthisdissertation,wehaveaddressedseveralissuesregardingtheuseofnetworkbasedtechniquesforsolvingvariousproblemsarisinginthebroadareaoftheanalysisofcomplexsystems.Wehavedemonstratedthatapplyingtheseapproachesiseectiveinmanyapplications,includingnance,biomedicine,telecommunications,sociology,etc.Ifarealworldmassivedatasetcanbeappropriatelyrepresentedasanetworkstructure,itsanalysisusinggraphtheoreticaltechniquesoftenyieldsimportantpracticalresults. Clearly,theresearchinthisareaisfarfromcomplete.Astechnologicalprogresscontinues,newtypesofdatasetsemergeindierentpracticalelds,whichleadstofurtherresearchintheeldofmodelingandinformationretrievalfromthesedatasets.Moreover,theapproachesdiscussedinthisdissertationcanbepotentiallyextendedtoobtainamoredetailedpictureofthestructureoftheconsidereddatasets.Inthefuturework,thenetworkmodelsdescribedabovecanbegeneralizedtotakeintoaccountthedirectionoflinksbetweenvertices(directedgraphs),whichcanhelptounderstandthemechanismsofinuencebetweendierentelementsofthesystems(e.g.,stocks,brainunits,etc.)Inaddition,someparameterscanbeassignedtoverticesrepresentingelementsofthesystem(e.g.,stockscanbecharacterizedbytheirexpectedreturnsandliquidities).Thisleadstosolvingoptimizationproblemsonweightedgraphs(e.g.,maximumweightedclique/independentset),whichmaybemorechallengingtosolveinpracticeforlargegraphs;however,thisanalysismayprovidevaluableinformationabouttheconsideredsystems. 90
PAGE 103
[1] J.Abello,A.BuchsbaumandJ.Westbrook,2002.Afunctionalapproachtoexternalgraphalgorithms.Algorithmica,32(3):437{58. [2] J.Abello,P.M.Pardalos,andM.G.C.Resende,1999.Onmaximumcliqueproblemsinverylargegraphs,DIMACSSeries,50,AmericanMathematicalSociety,119130. [3] J.Abello,P.M.Pardalos,andM.G.C.Resende(eds.),2002.HandbookofMassiveDataSets,KluwerAcademicPublishers,Dordrecht,TheNetherlands. [4] J.AbelloandJ.S.Vitter(eds.),1999.ExternalMemoryAlgorithms.Vol.50ofDIMACSSeriesinDiscreteMathematicsandTheoreticalComputerScience.AmericanMathematicalSociety,Providence,RI. [5] L.Adamic,1999.TheSmallWorldWeb.ProceedingsofECDL'99,LectureNotesinComputerScience,1696:443452.Springer,Berlin. [6] L.AdamicandB.Huberman,2000.PowerlawdistributionoftheWorldWideWeb.Science,287:2115a. [7] R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghvan,1998.AutomaticSubspaceClusteringofHighDimensionalDataforDataMiningApplications,inProceedingsofACMSIGMODInternationalConferenceonManagementofData,ACM,NewYork,94105. [8] W.Aiello,F.Chung,andL.Lu,2001.Arandomgraphmodelforpowerlawgraphs,ExperimentalMath.10:5366. [9] W.Aiello,F.ChungandL.Lu,2002.Randomevolutioninmassivegraphs.InJ.Abello,P.PardalosandM.Resende(eds.),HandbookonMassiveDataSets.KluwerAcademicPublishers,Dordrecht,TheNetherlands. [10] R.AlbertandA.L.Barabasi,2002.Statisticalmechanicsofcomplexnetworks,ReviewsofModernPhysics74,4797. [11] R.Albert,H.JeongandA.L.Barabasi,1999.DiameteroftheWorldWideWeb.Nature,401:130131. [12] N.AlonandM.Krivelevich.Theconcentrationofthechromaticnumberofrandomgraphs.Combinatorica,17:303313,1997. 91
PAGE 104
[13] L.Amaral,A.Scala,M.Barthelemy,andH.Stanley,2000.Classesofsmallworldnetworks.Proc.ofNationalAcademyofSciencesUSA,97:1114911152. [14] M.R.Anderberg,1973.ClusterAnalysisforApplications,AcademicPress,NewYork,NY. [15] L.Arge,1995.Thebuertree:AnewtechniqueforoptimalI/Oalgorithms.ProceedingsoftheWorkshoponAlgorithmsandDataStructures,LectureNotesinComputerScience,955:334345,SpringerVerlag,Berlin. [16] L.Arge,G.S.BrodalandL.Toma,2000.OnexternalmemoryMST,SSSPandmultiwayplanargraphseparation.ProceedingsoftheScandinavianWorkshoponAlgorithmicTheory,LectureNotesinComputerScience,1851.SpringerVerlag,Berlin. [17] S.Arora,C.Lund,R.Motwani,andM.Szegedy,1998.Proofvericationandhardnessofapproximationproblems.JournaloftheACM,45:501555. [18] S.AroraandS.Safra,1992.ApproximatingcliqueisNPcomplete,Proceedingsofthe33rdIEEESymposiumonFoundationsonComputerScience,Oct.2427,1992,Pittsburg,PA,2{13. [19] A.L.Barabasi,2002.Linked,PerseusPublishing,NewYork. [20] A.L.BarabasiandR.Albert,1999.Emergenceofscalinginrandomnetworks.Science,286:509{511. [21] A.L.Barabasi,R.AlbertandH.Jeong,2000.Scalefreecharacteristicsofrandomnetworks:thetopologyoftheworldwideweb.PhysicaA,281:6977. [22] A.L.Barabasi,R.Albert,H.Jeong,G.Bianconi,2000.PowerlawdistributionoftheWorldWideWeb.Science,287:2115a. [23] K.P.BennettandO.L.Mangasarian,1992.NeuralNetworkTrainingviaLinearProgramming,inAdvancesinOptimizationandParallelComputing,P.M.Pardalos,(ed.),NorthHolland,Amsterdam,5667. [24] P.Berkhin,2002.SurveyofClusteringDataMiningTechniques.TechnicalReport,AccrueSoftware,SanJose,CA. [25] V.Boginski,S.Butenko,andP.M.Pardalos,2003.ModelingandOptimizationinMassiveGraphs.In:P.M.PardalosandH.Wolkowicz,editors.NovelApproachestoHardDiscreteOptimization,AmericanMathematicalSociety,17{39. [26] V.Boginski,S.Butenko,andP.M.Pardalos,2003.OnStructuralPropertiesoftheMarketGraph.In:A.Nagurney(editor),InnovationsinFinancialandEconomicNetworks,EdwardElgarPublishers,28{45.
PAGE 105
[27] V.Boginski,S.Butenko,andP.M.Pardalos,2005.Statisticalanalysisofnancialnetworks.ComputationalStatisticsandDataAnalysis,48(2):431443. [28] V.Boginski,S.Butenko,andP.M.Pardalos,2005.MiningMarketData:ANetworkApproach.ComputersandOperationsResearch,inpress. [29] B.Bollobas,1978.ExtremalGraphTheory.AcademicPress,NewYork. [30] B.Bollobas,1985.RandomGraphs.AcademicPress,NewYork. [31] B.Bollobas,1988.Thechromaticnumberofrandomgraphs.Combinatorica,8:4956. [32] B.BollobasandP.Erdos,1976.Cliquesinrandomgraphs.Math.Proc.Camb.Phil.Soc.,80:419427. [33] I.M.Bomze,M.Budinich,P.M.Pardalos,andM.Pelillo,1999.Themaximumcliqueproblem.In:D.Z.DuandP.M.Pardalos,editors,HandbookofCombinatorialOptimization,KluwerAcademicPublishers,Dordrecht,TheNetherlands,174. [34] P.S.Bradley,U.M.Fayyad,andO.L.Mangasarian,1999.MathematicalProgrammingforDataMining:FormulationsandChallenges.INFORMSJournalonComputing,11(3),217{238. [35] P.S.Bradley,O.L.Mangasarian,andW.N.Street,1998.FeatureSelectionviaMathematicalProgramming,INFORMSJournalonComputing10,209217. [36] S.BrinandL.Page,1998.Theanatomyofalargescalehypertextualwebsearchengine.Proceedingsofthe7thWorldWideWebConference,107{117. [37] A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tomkins,andJ.Wiener,2000.GraphstructureintheWeb.ComputerNetworks,33:309{320. [38] A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tompkins,andJ.Wiener,2000.TheBowTieWeb.Proceedingsofthe9thInternationalWorldWideWebConference,May1519,2000,Amsterdam. [39] A.L.Buchsbaum,M.Goldwasser,S.Venkatasubramanian,andJ.R.Westbrook,2000.Onexternalmemorygraphtraversal.Proceedingsofthe11thACMSIAMSymposiumonDiscreteAlgorithms,January911,2000,SanFrancisco,CA. [40] V.Bugera,S.Uryasev,andG.Zrazhevsky,2003.ClassicationUsingOptimization:ApplicationtoCreditRatingsofBonds.Univ.ofFlorida,ISEDept.,ResearchReport#200314.
PAGE 106
[41] G.Caldarelli,R.Marchetti,L.Pietronero,2000.TheFractalPropertiesofInternet.EurophysicsLetters,52. [42] Y.J.Chiang,M.T.Goodrich,E.F.Grove,R.Tamassia,D.E.Vengro,andJ.S.Vitter,1995.Externalmemorygraphalgorithms.ProceedingsoftheACMSIAMSymposiumonDiscreteAlgorithms,6:139149,January2224,1995,SanFrancisco,CA. [43] F.ChungandL.Lu,2001.Thediameterofrandomsparsegraphs.AdvancesinAppliedMath.,26,257279. [44] C.CooperandA.Frieze,2003.Ageneralmodelofwebgraphs.RandomStructures&Algorithms,22(3):311{335. [45] C.CooperandA.Frieze,2004.Thesizeofthelargeststronglyconnectedcomponentofarandomgraphwithagivendegreesequence.Combinatorics,ProbabilityandComputing,13(3):319337. [46] P.Dolan,1992.Spanningtreesinrandomgraphs.InA.FriezeandT.Luczak,eds.,RandomGraphs,2:4758.JohnWileyandSons,NewYork. [47] P.ErdosandA.Renyi,1959.Onrandomgraphs.PublicationesMathematicae,6:290297. [48] P.ErdosandA.Renyi,1960.Ontheevolutionofrandomgraphs.Publ.Math.Inst.Hungar.Acad.Sci.,5:1761. [49] P.ErdosandA.Renyi,1961.Onthestrengthofconnectednessofarandomgraph.ActaMath.Acad.Sci.Hungar.,12:261267. [50] M.Faloutsos,P.FaloutsosandC.Faloutsos,1999.OnpowerlawrelationshipsoftheInternettopology.InProc.ACMSIGCOMM,Cambridge,MA,Sept.1999,pp.251{262. [51] T.A.FeoandM.G.C.Resende,1994.Agreedyrandomizedadaptivesearchprocedureformaximumindependentset.OperationsResearch,42:860878. [52] T.A.FeoandM.G.C.Resende,1995.Greedyrandomizedadaptivesearchprocedures.JournalofGlobalOptimization,6:109133. [53] U.Feige,S.Goldwasser,L.Lovasz,SSafra,andM.Szegedy,1996.Interactiveproofsandthehardnessofapproximatingcliques.JournaloftheACM,43:268292. [54] U.FeigeandJ.Kilian,1998.Zeroknowledgeandthechromaticnumber.JournalofComputerandSystemSciences,57:187199. [55] D.J.FellemanandD.C.VanEssen,1991.DistributedHierarchicalProcessinginthePrimateCerebralCortex.Cereb.Cortex,1,1{47.
PAGE 107
[56] A.Frieze.Ontheindependencenumberofrandomgraphs,1990.DisctereMathematics,81:171175. [57] A.FriezeandC.McDiarmid,1997.Algorithmictheoryofrandomgraphs.RandomStructuresandAlgorithms,10:542. [58] M.R.GareyandD.S.Johnson,1976.Thecomplexityofnearoptimalcoloring.JournaloftheACM,23:4349. [59] M.R.GareyandD.S.Johnson,1979.ComputersandIntractability:AGuidetotheTheoryofNPcompleteness,Freeman,NewYork. [60] R.GovindanandA.Reddy,1997.Ananalysisofinternetinterdomaintopologyandroutestability.Proc.IEEEINFOCOM.Kobe,Japan. [61] G.R.GrimmettandC.J.H.McDiarmid,1975.Oncoloringrandomgraphs.MathematicalProceedingsofCambridgePhil.Society,77:313324. [62] J.Hastad,1999.Cliqueishardtoapproximatewithinn1,ActaMathematica182105142. [63] B.Hayes,2000.GraphTheoryinPractice.AmericanScientist,88:913(PartI),104109(PartII). [64] C.C.Hilgetag,R.Kotter,K.E.Stephen,O.Sporns,2002.ComputationalMethodsfortheAnalysisofBrainConnectivity,In:G.A.Ascoli,ed.,ComputationalNeuroanatomy,HumanaPress,Totowa,NJ. [65] B.HubermanandL.Adamic,1999.GrowthdynamicsoftheWorldWideWeb.Nature,401:131. [66] L.D.IasemidisandJ.C.Sackellares,1991.TheevolutionwithtimeofthespatialdistributionofthelargestLyapunovexponentonthehumanepilepticcortex.In:Duke,D.W.,Pritchard,W.S.,eds.,MeasuringChaosintheHumanBrain,4982.WorldScientic,Singapore. [67] L.D.Iasemidis,J.C.Principe,J.M.Czaplewski,R.L.Gilmore,S.N.Roper,J.C.Sackellares,1997.Spatiotemporaltransitiontoepilepticseizures:anonlineardynamicalanalysisofscalpandintracranialEEGrecordings.In:Silva,F.L.,Principe,J.C.,Almeida,L.B.,eds.,SpatiotemporalModelsinBiologicalandArticialSystems,8188.IOSPress,Amsterdam. [68] L.D.Iasemidis,J.C.Principe,J.C.Sackellares,2000.Measurementandquanticationofspatiotemporaldynamicsofhumanepilepticseizures.In:Akay,M.,ed.,Nonlinearbiomedicalsignalprocessing.IEEEPress,vol.II,294318. [69] L.D.Iasemidis,P.M.Pardalos,J.C.Sackellares,DS.Shiau,2001.Quadraticbinaryprogramminganddynamicalsystemapproachtodeterminethe
PAGE 108
predictabilityofepilepticseizures.JournalofCombinatorialOptimization,5:9{26. [70] L.D.Iasemidis,D.S.Shiau,J.C.Sackellares,P.M.Pardalos,A.Prasad,2004.Dynamicalresettingofthehumanbrainatepilepticseizures:applicationofnonlineardynamicsandglobaloptimizationtecniques.IEEETransactionsonBiomedicalEngineering,51(3):493{506. [71] ILOGCPLEX7.0ReferenceManual,2000. [72] A.K.JainandR.C.Dubes,1988.AlgorithmsforClusteringData,PrenticeHall,EnglewoodClis,NJ. [73] S.Janson,T.LuczakandA.Rucinski,2000.RandomGraphs.Wiley&Sons,NewYork. [74] H.Jeong,S.Mason,A.L.Barabasi,andZ.N.Oltvai,2001.Lethalityandcertaintyinproteinnetworks.Nature,411:4142. [75] H.Jeong,B.Tomber,R.Albert,Z.N.Oltvai,andA.L.Barabasi,2000.Thelargescaleorganizationofmetabolicnetworks,Nature,407:651654. [76] D.S.JohnsonandM.A.Trick(eds.),1996.Cliques,Coloring,andSatisability:SecondDIMACSImplementationChallenge,Vol.26ofDIMACSSeries,AmericanMathematicalSociety,Providence,RI. [77] V.KleeandD.Larman,1981.Diametersofrandomgraphs.CanadianJournalofMathematics,33:618640. [78] J.Kleinberg,1999.Authoritativesourcesinahyperlinkedenvironment.JournaloftheACM,46. [79] J.KleinbergandS.Lawrence,2001.TheStructureoftheWeb.Science,294:184950. [80] V.F.Kolchin,1999.RandomGraphs.CambridgeUniversityPress,Cambridge,UK. [81] R.Kumar,P.Raghavan,S.Rajagopalan,D.Sivakumar,A.Tomkins,andE.Upfal,2000.TheWebasagraph.In:Proceedingsofthe19thACMSIGMODSIGACTSIGARTsymposiumonPrinciplesofdatabasesystems,Dallas,TX,pp.1{10. [82] R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins,1999.TrawlingtheWebforcybercommunities.ComputerNetworks,31(1116):1481{1493. [83] V.KumarandE.Schwabe,1996.Improvedalgorithmsanddatastructuresforsolvinggraphproblemsinexternalmemory.In:ProceedingsoftheEighth
PAGE 109
IEEESymposiumonParallelandDistributedProcessing,NewOrleans,LA,pp.169176. [84] S.LawrenceandC.L.Giles,1999.AccessibilityofInformationontheWeb.Nature,400:107{109. [85] T.Luczak,1991.Anoteonthesharpconcentrationofthechromaticnumberofrandomgraphs.Combinatorica,11:295{297. [86] T.Luczak,1990.Componentsbehaviornearthecriticalpointoftherandomgraphprocess.RandomStructuresandAlgorithms,1:287310. [87] T.Luczak,1998.Randomtreesandrandomgraphs.RandomStructuresandAlgorithms,13:485500. [88] T.Luczak,B.PittelandJ.Wierman,1994.Thestructureofarandomgraphnearthepointofthephasetransition.TransactionsoftheAmericanMathematicalSociety,341:721748. [89] C.LundandM.Yannakakis,1994.Onthehardnessofapproximatingminimizationproblems.JournaloftheACM,41:960981. [90] O.L.Mangasarian,1993.MathematicalProgramminginNeuralNetworks,ORSAJournalonComputing,5:349360. [91] O.L.Mangasarian,W.N.Street,andW.H.Wolberg,1995.BreastCancerDiagnosisandPrognosisviaLinearProgramming,OperationsResearch43(4),570577. [92] R.N.MantegnaandH.E.Stanley,2000.AnIntroductiontoEconophysics:CorrelationsandComplexityinFinance,CambridgeUniversityPress,Cambridge,UK. [93] D.Matula,1970.Onthecompletesubgraphofarandomgraph.InR.BoseandT.Dowling,eds.,CombinatoryMathematicsanditsApplications,356369,ChapelHill,NC. [94] A.Medina,I.Matta,andJ.Byers,2000.OntheOriginofPowerlawsinInternetTopologies.ACMComputerCommunicationReview,30:160163. [95] B.MirkinandI.Muchnik,1998.CombinatoralOptimizationinClustering.In:HandbookofCombinatorialOptimization(D.Z.DuandP.M.Pardalos,eds.),Volume2,261{329.KluwerAcademicPublishers,Dordrecht,TheNetherlands. [96] A.Mendelzon,G.Mihaila,andT.Milo,1997.QueryingtheWorldWideWeb.JournalofDigitalLibraries,1:6888. [97] A.MendelzonandP.Wood,1995.Findingregularsimplepathsingraphdatabases.SIAMJ.Comp.,24:12351258.
PAGE 110
[98] M.MolloyandB.Reed,1995.Acriticalpointforrandomgraphswithagivendegreesequence.RandomStructuresandAlgorithms,6:161180. [99] M.MolloyandB.Reed,1998.Thesizeofthelargestcomponentofarandomgraphonaxeddegreesequence.Combinatorics,ProbabilityandComputing,7:295306. [100] J.M.MurreandD.P.Sturdy,1995.TheConnectivityoftheBrain:MultiLevelQuantitativeAnalysis.Biol.Cybern.,73,529{545. [101] P.M.Pardalos,W.Chaovalitwongse,L.D.Iasemidis,J.C.Sackellares,D.S.Shiau,P.R.Carney,O.A.Prokopyev,andV.A.Yatsenko,2004.SeizureWarningAlgorithmBasedonSpatiotemporalDynamicsofIntracranialEEG.MathematicalProgramming,101(2):365385. [102] P.M.Pardalos,T.Mavridou,andJ.Xue,1998.TheGraphColoringProblem:ABibliographicSurvey.In:HandbookofCombinatorialOptimization(D.Z.DuandP.M.Pardalos,eds.),Volume2,331{395.KluwerAcademicPublishers,Dordrecht,TheNetherlands. [103] R.PastorSatorras,A.Vazquez,andA.Vespignani,2001.DynamicalandcorrelationpropertiesoftheInternet.Phys.Rev.Lett.,87:258701. [104] R.PastorSatorrasandA.Vespignani,2001.Epidemicspreadinginscalefreenetworks.PhysicalReviewLetters,86:32003203. [105] G.PiatetskyShapiroandW.Frawley(eds.),1991.KnowledgeDiscoveryinDatabases,MITPress,Cambridge,MA. [106] O.A.Prokopyev,V.Boginski,W.Chaovalitwongse,P.M.Pardalos,J.C.Sackellares,andP.R.Carney,2005.NetworkBasedTechniquesinEEGDataAnalysisandEpilepticBrainModeling.In:DataMininginBiomedicine,P.M.Pardalosetal.(eds.),Springer,NewYork(toappear). [107] D.E.RumelhartandD.Zipser,1985.FeatureDiscoverybyCompetitiveLearning.CognitiveScience,9,75{112. [108] J.C.Sackellares,L.D.Iasemidis,R.L.Gilmore,S.N.Roper,1997.Epilepticseizuresasneuralresettingmechanisms.Epilepsia,38,S3,189. [109] E.ScheinermanandJ.Wierman,1989.Optimalandnearoptimalbroadcastinginrandomgraphs.DiscreteAppliedMathematics,25:289297. [110] E.ShamirandJ.Spencer,1987.SharpconcentrationofthechromaticnumberonrandomgraphsGn;p.Combinatorica,7:124{129. [111] D.S.Shiau,Q.Luo,S.L.Gilmore,S.N.Roper,P.M.Pardalos,J.C.Sackellares,L.D.Iasemidis,2000.Epilepticseizuresresettingrevisited.Epilepsia,41,S7,208209.
PAGE 111
[112] J.D.UllmanandM.Yannakakis,1991.Theinput/outputcomplexityoftransitiveclosure.AnnalsofMathematicsandArticialIntelligence,3:331360. [113] V.Vapnik,S.E.Golowich,andA.Smola,1997.SupportVectorMethodforFunctionApproximation,RegressionEstimation,andSignalProcessing,inAdvancesinNeuralInformationProcessingSystems9,M.C.Mozer,M.I.Jordan,andT.Petsche(eds.),MITPress,Cambridge,MA. [114] V.N.Vapnik,1995.TheNatureofStatisticalLearningTheory,Springer,NewYork. [115] J.S.Vitter,2001.ExternalMemoryAlgorithmsandDataStructures:DealingwithMASSIVEDATA.ACMComputingSurveys,33:209271. [116] D.Watts,1999.SmallWorlds:TheDynamicsofNetworksBetweenOrderandRandomness,PrincetonUniversityPress,Princeton,NJ. [117] D.WattsandS.Strogatz,1998.Collectivedynamicsof`smallworld'networks,Nature,393:440442. [118] A.Wolf,J.B.Swift,H.L.Swinney,J.A.Vastano,1985.DeterminingLyapunovexponentsfromatimeseries.PhysicaD,16:285317.
PAGE 112
VladimirBoginskiwasbornonSeptember23,1980,inBryansk,Russia.Hereceivedhisbachelor'sdegreeinAppliedMathematicsfromMoscowInstituteofPhysicsandTechnology(StateUniversity)in2000.In2001,heenteredthegraduateprograminIndustrialandSystemsEngineeringattheUniversityofFlorida.HereceivedhisM.S.andPh.D.degreesinIndustrialandSystemsEngineeringfromtheUniversityofFloridainMay2003andAugust2005,respectively. 100

