Citation
Optimization and Information Retrieval Techniques for Complex Networks

Material Information

Title:
Optimization and Information Retrieval Techniques for Complex Networks
Copyright Date:
2008

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Connectivity ( jstor )
Datasets ( jstor )
Diameters ( jstor )
Electrodes ( jstor )
Electroencephalography ( jstor )
Graph theory ( jstor )
Mining ( jstor )
Stock markets ( jstor )
Vertices ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Embargo Date:
7/30/2007

Downloads

This item has the following downloads:


Full Text











OPTIMIZATION AND


INFORMATION RETRIEVAL TECHNIQUES FOR
COMPLEX NETWORKS


By

VLADIMIR L. BOGINSKI


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2005

































Copyright 2005

by

Vladimir L. Boginski
















I dedicate this to my parents.















ACKNOWLEDGMENTS

I would like to thank my advisor Prof. Panos Pardalos for his support and

guidance that made my studies in the University of Florida enjoi,- 1-l and produc-

tive. His energy and enthusiasm inspired me during these four years, and I believe

that this was crucial for my success.

I also want to thank my committee members Prof. Stan Uryasev, Prof. Joseph

Geunes, and Prof. William Hager for their concern and encouragement. I am

grateful to all my collaborators, especially Sergiy Butenko and Oleg Prokopyev,

who were ahv--b a great pleasure to work with.

Finally, I would like to express my greatest appreciation to my family and

friends, who ahv--b believed in me and supported me in all circumstances.















TABLE OF CONTENTS
page

ACKNOWLEDGMENTS ................... ...... iv

LIST OF TABLES ................... .......... viii

LIST OF FIGURES ................... ......... ix

ABSTRACT ...................... ............. xi

CHAPTER

1 INTRODUCTION .................... ....... 1

1.1 Basic Concepts from Graph Theory and Data Mining Interpretation 3
1.1.1 Connectivity and Degree Distribution ............. 3
1.1.2 Cliques and Independent Sets ........ ........ 5
1.1.3 Clustering via Clique Partitioning ...... . . 6

2 REVIEW OF NETWORK-BASED MODELING AND OPTIMIZATION
TECHNIQUES IN MASSIVE DATA SETS .... . .. 9

2.1 Modeling and Optimization in Massive Graphs . . ... 9
2.1.1 Examples of Massive Graphs ................. .. 10
2.1.1.1 Call Graph .................. ..... 10
2.1.1.2 Internet and Web Graphs . . ...... 13
2.1.2 External Memory Algorithms ................ .. 17
2.1.3 Modeling Massive Graphs .................. .. 18
2.1.3.1 Uniform Random Graphs . . ..... 19
2.1.3.2 Potential Drawbacks of the Uniform Random Graph
Model ....... . . .... 21
2.1.3.3 Random Graphs with a Given Degree Sequence .23
2.1.3.4 Power-Law Random Graphs . . 24
2.1.4 Optimization in Random Massive Graphs . .... 29
2.1.4.1 Clique Number ............ .. .. .. 29
2.1.4.2 C('.!i. ii ,,1 Number ................. .. 31
2.1.5 Remarks .................. .......... .. 32

3 NETWORK-BASED APPROACHES TO MINING STOCK MARKET
DATA ................. .................. ..33

3.1 Structure of the Market Graph ............... .. .. 34
3.1.1 Constructing the Market Graph ............... .. 34









3.1.2 Connectivity of the Market Graph . . 36
3.1.3 Degree Distribution of the Market Graph . .... 37
3.1.4 Instruments Corresponding to High-Degree Vertices . 40
3.1.5 Clustering Coefficients in the Market Graph . ... 41
3.2 Analysis of Cliques and Independent Sets in the Market Graph 42
3.2.1 Cliques in the Market Graph ............. .. 43
3.2.2 Independent Sets in the Market Graph ......... .45
3.3 Data Mining Interpretation of the Market Graph Model ...... ..48
3.4 Evolution of the Market Graph . . . ... 50
3.4.1 Dynamics of Global 'C!i 'lteristics of the Market Graph 51
3.4.2 Dynamics of the Size of Cliques and Independent Sets in the
Market Graph .................. ...... 55
3.4.3 Minimum Clique Partition of the Market Graph ...... ..59
3.5 Concluding Remarks ............... ...... .. 60

4 NETWORK-BASED TECHNIQUES IN ELECTROENCEPHALOGRAPHIC
(EEG) DATA ANALYSIS AND EPILEPTIC BRAIN MODELING ... 62

4.1 Statistical Preprocessing of EEG Data ... . . 63
4.1.1 Datasets. .................. .. ..... 63
4.1.2 T-statistics and STLmax ................ 63
4.2 Graph Structure of the Epileptic Brain ............... ..66
4.2.1 Key Idea of the Model . . . .. ..66
4.2.1.1 Interpretation of the Considered Graph Models .67
4.2.2 Properties of the Graphs ................ 67
4.2.2.1 Edge Density ................ .. .. 67
4.2.2.2 Connectivity ................ .. .. 69
4.2.2.3 Minimum Spanning Tree .............. ..69
4.2.2.4 Degrees of the Vertices ............... 71
4.2.2.5 Maximum Cliques . . ...... 72
4.3 Graph as a Macroscopic Model of the Epileptic Brain . ... 74
4.4 Concluding Remarks and Directions of Future Research ...... ..75

5 COLLABORATION NETWORKS IN SPORTS ........... .77

5.1 Examples of Social Networks . . . .... ...... 78
5.1.1 Scientific Collaboration Graph and Erdos Number . 78
5.1.2 Hollywood Graph and Bacon Number . ..... 79
5.1.3 Baseball Graph and Wynn Number ............. ..80
5.1.4 Diameter of Collaboration Networks ............. .81
5.2 NBA Graph ........ .... ... ............ 82
5.2.1 General Properties of the NBA Graph ............ ..83
5.2.2 Diameter of the NBA Graph and Jordan Number . 85
5.2.3 Degrees and "Connectedness" of the Vertices in the NBA
Graph ...... ......... ......... .... 88
5.3 Concluding Remarks .................. ....... .. 89









6 CONCLUSIONS AND DIRECTIONS FOR FUTURE RESEARCH ... 90

REFERENCES .............. .......... ... .... 91

BIOGRAPHICAL SKETCH ......... ....... ......... 100















LIST OF TABLES
Table page

3-1 Least-squares estimates of the parameter 7 in the market graph . 38

3-2 Top 25 instruments with highest degrees in the market graph ...... 42

3-3 C('!1-i. i ig coefficients of the market graph ................ ..43

3-4 Sizes of the maximum cliques in the market graph ............ ..45

3-5 Sizes of independent sets in the complementary market graph . 46

3-6 Dates and mean correlations corresponding to each considered 500-d-,v
shift .... ................ ....... ...... .. 51

3-7 Number of vertices and number of edges in the market graph for differ-
ent periods .................. ............... .. 55

3-8 Greedy clique size and the clique number for different time periods 57

3-9 Structure of maximum cliques in the market graph for different time pe-
riods ...... ............. ................. .. 58

3-10 Size of independent sets in the market graph found using the greedy heuris-
tic . . . . . . .... ... ..... 59

3-11 The largest clique size and the number of cliques in computed clique par-
titions . . . . . . . . ... .. 60

5-1 Jordan numbers of some NBA stars (end of the 2002-2003 season). . 86

5-2 Degrees of the Vertices in the NBA graph ................ 88

5-3 The most "connected" ph ,l rs in the NBA graph ............ ..89















LIST OF FIGURES
Figure page

2-1 Frequencies of clique sizes in the call graph ..... . . 11

2-2 Pattern of connections in the call graph .................. 12

2-3 Number of Internet hosts for the period 01/1991-01/2002. . ... 13

2-4 Pattern of connections in the Web graph ............. .. 14

2-5 Connectivity of the Web (Bow-Tie model) ............ .. 16

3-1 Distribution of correlation coefficients in the stock market . ... 35

3-2 Edge density of the market graph for different values of the correlation
threshold ............... ............ .. .. 36

3-3 Plot of the size of the largest connected component in the market graph
as a function of correlation threshold 0. ................. 37

3-4 Degree distribution of the market graph ................. 39

3-5 Degree distribution of the complementary market graph . ... 40

3-6 Frequency of the sizes of independent sets found in the market graph 48

3-7 Distribution of correlation coefficients in the US stock market for several
overlapping 500-day periods during 2000-2002 . . ..... 52

3-8 Degree distribution of the market graph for different 500-div periods in
2000-2002 .. ....... ............... 53

3-9 Dynamics of edge density and maximum clique size in the market graph 55

4-1 Electrode placement in the brain .................. ..... 64

4-2 Number of edges in GRAPH-II .................. .. 68

4-3 The size of the largest connected component in GRAPH-II . ... 70

4-4 Average value of T-index of the edges in Minimum Spanning Tree of GRAPH-
I........................ ......... ...... 71

4-5 Average degree of the vertices in GRAPH-II. . . ..... 72









5-1 Number of vertices in the Hollywood graph with different values of Ba-
con number .................. ............... .. 80

5-2 Number of vertices in the baseball graph with different vaues of Wynn
number ..... .............. ............... .. 81

5-3 General structure of the NBA graph and other collaboration networks .84

5-4 Number of vertices in the NBA graph with different values of Jordan
num ber . . . . . .. . ... 85















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

OPTIMIZATION AND INFORMATION RETRIEVAL TECHNIQUES FOR
COMPLEX NETWORKS

By

Vladimir L. Boginski

August 2005

C('! Ii: Panagote M. Pardalos
Major Department: Industrial and Systems Engineering

This study develops novel approaches to modeling real-world datasets arising

in diverse application areas as networks and information retrieval from these

datasets using network optimization techniques. Network-based models allow one

to extract information from datasets using various concepts from graph theory.

In many cases, one can investigate specific properties of a dataset by detecting

special formations in the corresponding graph (for instance, connected components,

spanning trees, cliques, and independent sets). This process often involves solving

computationally challenging combinatorial optimization problems on graphs

(maximum independent set, maximum clique, minimum clique partition, etc.).

These problems are especially difficult to solve for large graphs. However, in certain

cases, the exact solution of a hard optimization problem can be found using a

special structure of the considered graph.

A significant part of the dissertation focuses on developing network-based

models of real-world complex systems, including the stock market and the human

brain, which have ahv--l- been of special interest to scientists. These systems gen-

erate huge amounts of data and are especially hard to ain &v. This dissertation









demonstrates that network-based models can be successfully applied to information

retrieval from datasets, providing new insight into the structural properties and

patterns underlying the corresponding complex systems.

The developed network representations of the considered datasets are in

many cases non-trivial and include certain statistical preprocessing techniques.

In particular, the U.S. stock market is represented as a network based on cross-

correlations of price fluctuations of the financial instruments, which are calculated

over a certain number of trading div- This model (market p'jiq'l) allows one

to analyze the structure and dynamics of the stock market from an alternative

perspective and obtain useful information about the global structure of the market,

classes of similar stocks, and diversified portfolios.

Similarly, a macroscopic network model of the human brain is constructed

based on the statistical measures of entrainment between electroencephalographic

(EEG) signals recorded from different functional units of the brain. Studying the

evolution of the properties of these networks revealed some interesting facts about

brain disorders, such as epilepsy.















CHAPTER 1
INTRODUCTION

Now,1d-l-,- the process of studying real-life complex systems often deals with

large datasets arising in diverse applications including government and military

systems, telecommunications, biotechnology, medicine, finance, astrophysics, ecol-

ogy, geographical information systems, etc. [3, 25]. Understanding the structural

properties of a certain dataset is in many cases the task of crucial importance. To

get useful information from these data, one often needs to apply special techniques

of summarizing and visualizing the information contained in a dataset.

An appropriate mathematical model can simplify the analysis of a dataset and

even theoretically predict some of its properties. Thus, a fundamental problem that

arises here is modeling the datasets characterizing real-world complex systems.

In this dissertation, we concentrate on one aspect of this problem: network

representation of real-world datasets. According to this approach, a certain dataset

is represented as a p',h'l (network) with certain attributes associated with its

vertices and edges.

Studying the structure of a graph representing a dataset is often important

for understanding the internal properties of the application it represents, as

well as for improving storage organization and information retrieval. One can

visualize a graph as a set of dots and links connecting them, which often makes this

representation convenient and easily understandable.

The main concepts of graph theory were founded several centuries ago, and

many network optimization algorithms have been developed since then. However,

graph models have been applied only recently to representing various real-life

massive datasets. Graph theory is quickly becoming a practical field of science.









Expansion of graph-theoretical approaches in various applications gave birth to the

terms 3,i 1h! practice" and ,i 111 1(1 ii ,, ii1, [63].

Network-based models allow one to extract information from real-world

datasets using various standard concepts from graph theory. In many cases, one

can investigate specific properties of a dataset by detecting special formations in

the corresponding graph, for instance, connected components, a ur.'-.':'ij trees, cliques

and independent sets. In particular, cliques and independent sets can be used for

solving the important clustering problem arising in data mining, which essentially

represents partitioning the set of elements of a certain dataset into a number of

subsets (clusters) of objects according to some similarity (or dissimilarity) criterion.

These concepts are associated with a number of network optimization problems

discussed later.

Another aspect of investigating network models of real-world datasets is

studying the degree distribution of the constructed graphs. The degree distribution

is an important characteristic of a dataset represented by a graph. It represents

the large-scale pattern of connections in the graph, which reflects the global

properties of the dataset. One of the important results discovered during the last

several years is the observation that many graphs representing the datasets from

diverse areas (Internet, telecommunications, biology, sociology) obey the power-law

model [9]. The fact that graphs representing completely different datasets have a

similar well-defined power-law structure has been widely reflected in the literature

[10, 19, 20, 25, 63, 116, 117]. It indicates that global organization and evolution

of datasets arising in various spheres of life 1 i., .I1 ,is follow similar laws and

patterns. This fact served as a motivation to introduce a concept of "self-organized

networks."

Later we discuss in more detail various aspects of modeling real-world datasets

as networks, and retrieving useful information from these networks. The practical









importance of graph-theoretic techniques is shown by several examples of applying

these approaches associated with datasets arising in telecommunications, internet,

sociology, etc. The 1i ii' ', part of the dissertation devoted to novel network-based

techniques and models that allow one to obtain important non-trivial information

from datasets arising in finance and biomedicine.

1.1 Basic Concepts from Graph Theory and Data Mining Interpretation

To facilitate further discussion, we present several basic definitions and

notations from graph theory and discuss the interpretation of the introduced

concepts from the perspective of data mining and information retrieval.

Let G = (V, E) be an undirected graph with the set of n vertices V and the set

of edges E = {(i,j) : i,j E V}. Directed graphs, where the head and tail of each

edge are specified, are considered in some applications. The concept of a mi"ill.:-raph

is also sometimes introduced. A multigraph is a graph where multiple edges

connecting a given pair of vertices may exist. One of the important characteristics

of a graph is its edge /,. ,.:1;I: the ratio of the number of edges in the graph to the

maximum possible number of edges.1

1.1.1 Connectivity and Degree Distribution

The graph G = (V, E) is connected if there is a path from any vertex to any

vertex in the set V. If the graph is disconnected, it can be decomposed into several

connected subgraphs, which are referred to as the connected components of G.

The degree of a vertex is the number of edges emanating from it. For every

integer k one can calculate the number of vertices n(k) with a degree equal to k,

and then get the probability that a vertex has the degree k as P(k) = n(k)/n,

where n is the total number of vertices. The function P(k) is referred to as the



1 The maximum possible number of edges in a graph is equal to n(n 1)/2 (n is
the number of vertices).









degree distribution of the graph. In the case of a directed graph, the concept of

degree distribution is generalized: one can distinguish the distribution of in-degrees

and out-degrees, which deal with the number of edges ending at and starting from a

vertex, respectively.

Degree distribution is an important characteristic of a dataset represented

by a graph. It reflects the overall pattern of connections in the graph, which in

many cases reflects the global properties of the dataset this graph represents. As

mentioned above, many real-world graphs representing the datasets coming from

diverse areas (Internet, telecommunications, finance, biology, sociology) have degree

distributions that follow the power-law model, which states that the probability

that a vertex of a graph has a degree k (i.e., there are k edges emanating from it) is



P(k) oc k-. (1-1)

Equivalently, one can represent it as



logP oc log k, (1-2)

which demonstrates that this distribution forms a straight line in the logarithmic

scale, and the slope of this line equals the value of the parameter 7.

An important characteristic of the power-law model is its scale-free property.

This property implies that the power-law structure of a certain network should not

depend on the size of the network. Clearly, real-world networks dynamically grow

over time, therefore, the growth process of these networks should obey certain rules

in order to satisfy the scale-free property. The necessary properties of the evolution

of the real-world networks are growth and preferential attachment [20]. The first

property implies the obvious fact that the size of these networks grows continuously

(i.e., new vertices are added to a network, which means that new elements are

added to the corresponding dataset). The second property represents the idea that






5


new vertices are more likely to be connected to old vertices with high degrees. It is

intuitively clear that these principles characterize the evolution of many real-world

complex networks i.v, '. Il, ivs.

From another perspective, some properties of graphs that follow the power-law

model can be predicted theoretically. Aiello et al. [9] studied the properties of the

power-law graphs using the theoretical power-law random ','rll, model representing

the the class of random graphs obeying the power law (see C'! plter 2). Among

their results, one can mention the existence of a giant connected component in

a power-law graph with 7 < 7o a 3.47875, and the fact that a giant connected

component does not exist otherwise.2 Emergence of a giant connected component

at the point 7o m 3.47875 is often called phase transition.

The size of connected components of the graph may provide useful information

about the structure of the corresponding dataset, as the connected components

would normally represent groups of "-!~iI! i1 objects. In some applications,

decomposing the graph into a set of connected components can provide a reason-

able solution to the clustering problem (i.e., partitioning the graph into several

subgraphs, each of which corresponds to a certain cluster).

1.1.2 Cliques and Independent Sets

Given a subset S C V, we denote G(S) as the subgraph induced by S. A

subset C C V is a clique if G(C) is a complete graph (i.e., it has all possible edges).

The maximum clique problem is to find the largest clique in a graph.

The following definitions generalize the concept of clique. Instead of cliques

one can consider dense subgraphs, or quasi-cliques. A 7-clique C., also called a



2 These results are valid i ;,,,;///. i//.;/rl almost surely (a.a.s.), which means that
the probability that a given property takes place tends to 1 as the number of ver-
tices n goes to infinity.









quasi-clique, is a subset of V such that G(C.) has at least L[q(q 1)/2] edges,

where q is the cardinality (i.e., number of vertices) of C..

An independent set is a subset I C V such that the subgraph G(I) has no

edges. The maximum independent set problem can be easily reformulated as the

maximum clique problem in the complementary graph G(V, E), defined as follows.

If an edge (i,j) E E, then (i,j) E; and if (i,j) E, then (i,j) E E. Clearly, a

maximum clique in G is a maximum independent set in G, so the maximum clique

and maximum independent set problems can be easily reduced to each other.

Locating cliques (quasi-cliques) and independent sets in a graph representing

a dataset provides important information about this dataset. Intuitively, edges

in such a graph would connect vertices corresponding to "-!!!!i! 1i elements of

the dataset. Therefore, cliques (or quasi-cliques) would naturally represent dense

clusters of similar objects. On the contrary, independent sets can be treated as

groups of objects that differ from every other object in the group. This information

is also important in some applications. Clearly, it is useful to find a maximum

clique or independent set in the graph, since it would give the maximum possible

size of the groups of -mid! 11 or "dil, I, i objects.

The maximum clique problem (as well as the maximum independent set prob-

lem) is known to be NP-hard [59]. Moreover, it turns out that these problems are

difficult to approximate [18, 62]. This makes these problems especially challenging

in large graphs.

1.1.3 Clustering via Clique Partitioning

The problem of locating cliques and independent sets in a graph can be

naturally extended to finding an optimal partition of a graph into a minimum

number of distinct cliques or independent sets. These problems are referred to as

minimum clique partition and u',jl' coloring, respectively. Pardalos et al. [102] give

various mathematical programming formulations of these problems. Clearly, as in









the case of maximum clique and maximum independent set problems, minimum

clique partition and graph coloring are reduced to each other by considering the

complimentary graph, and both of these problems are NP-hard [59]. Solving these

problems for graphs representing real-life datasets is important from a data mining

perspective; especially for solving the clustering problem.

The essence of clustering is partitioning the elements in a certain dataset into

several distinct subsets (clusters) grouped according to an appropriate -.;,ii,..,li/

criterion [34]. Identifying the groups of objects that are "-iidI! ,o to each other

but "dill, i il from other objects in a given dataset is important in many practical

applications. The clustering problem is challenging because the number of clusters

and the similarity criterion are usually not known a priori.

If a dataset is represented as a graph, where each data element corresponds to

a vertex, the clustering problem essentially deals with decomposing this graph into

a set of subgraphs subsetss of vertices), so that each of these subgraphs correspond

to a specific cluster.

Since the data elements assigned to the same cluster should be -!r!!! 1" to

each other, the goal of clustering can be achieved by finding a clique partition

of the graph, and the number of clusters will equal the number of cliques in the

partition.

Similar arguments hold for the case of the graph coloring problem which

should be solved when a dataset needs to be decomposed into the clusters of

"dil!, i, il objects (i.e., each object in a cluster is different from all other objects in

the same cluster), that can be represented as independent sets in the corresponding

graph. The number of independent sets in the optimal partition is referred to as

the chromatic number of the graph.

Instead of cliques and independent sets one can consider quasi-cliques, and

quasi-independent sets and partition the graph on this basis. As mentioned,






8


quasi-cliques are subgraphs that are dense enough (i.e., they have a high edge

density). Therefore, it is often reasonable to relate clusters to quasi-cliques, since

they represent sufficiently dense clusters of similar objects. Obviously, in the case

of partitioning a dataset into clusters of "dl!1i i, i, objects, one can use quasi-

independent sets (i.e., subgraphs that are sparse enough) to define these clusters.















CHAPTER 2
REVIEW OF NETWORK-BASED MODELING AND OPTIMIZATION
TECHNIQUES IN MASSIVE DATA SETS

In this chapter, we review current developments in studying massive graphs

used as models of certain real-world datasets.1 Massive data sets arise in a broad

spectrum of scientific, engineering and commercial applications [3]. Some of the

wide range of problems associated with massive data sets are data warehousing,

compression and visualization, information retrieval, clustering and pattern

recognition, and nearest neighbor search. Handling these problems requires

special interdisciplinary efforts to develop novel sophisticated techniques. The

pervasiveness and complexity of the problems brought by massive data sets make it

one of the most challenging and exciting areas of research for years to come.

In many cases, a massive data set can be represented as a very large graph

with certain attributes associated with its vertices and edges. These attributes

may contain specific information characterizing the given application. Studying

the structure of this graph is important for understanding the structural properties

of the application it represents, as well as for improving storage organization and

information retrieval.

2.1 Modeling and Optimization in Massive Graphs

In this section we discuss recent advances in modeling and optimization for

massive graphs. As examples, Call, Internet, and Web graphs will be used.



1 This chapter is based on the joint publication with Butenko and Pardalos [25].









As before, by G = (V, E) we will denote a simple undirected graph with the set

of n vertices V and the set of edges E. A multi-graph is an undirected graph with

multiple edges.

The distance between two vertices is the number of edges in the shortest

path between them (it is equal to infinity for vertices representing different

connected components). The diameter of a graph G is usually defined as the

maximal distance between pairs of vertices of G. In a disconnected graph, the usual

definition of the diameter would result in the infinite diameter, so the following

definition is in order. By the diameter of a disconnected graph we will mean the

maximum finite shortest path length in the graph (the same as the largest of the

diameters of the graph's connected components).

2.1.1 Examples of Massive Graphs

2.1.1.1 Call Graph

Here we discuss an example of a massive graph representing telecommunica-

tions traffic data presented by Abello, Pardalos and Resende [2]. In this call p',j'l,

the vertices are telephone numbers, and two vertices are connected by an edge if a

call was made from one number to another.

Abello et al. [2] experimented with data from AT&T telephone billing records.

To give an idea of how large a call graph can be we mention that a graph based

on one 20-d-v period had 290 million vertices and 4 billion edges. The analyzed

one-d-(v call graph had 53,767,087 vertices and over 170 million edges. This graph

appeared to have 3,667,448 connected components, most of them tiny; only 302,468

(or -'.) components had more than 3 vertices. A giant connected component

with 44,989,297 vertices was computed. It was observed that the existence of a

giant component resembles a behavior si l--- -i l by the random graphs theory of

Erdos and Rinyi [47, 48], which will be mentioned below, but by the pattern of

connections the call graph obviously does not fit into this theory(Subsection 2.1.3).










The maximum clique problem and problem of finding large quasi-cliques with

prespecified density were considered in this giant component. These problems were

attacked using a greedy randomized adaptive search procedure (GRASP) [51, 52].

In short, GRASP is an iterative method that at each iteration constructs, using a

greedy function, a randomized solution and then finds a locally optimal solution

by searching the neighborhood of the constructed solution. This is a heuristic

approach which gives no guarantee about quality of the solutions found, but proved

to be practically efficient for many combinatorial optimization problems. To make

application of optimization algorithms in the considered large component possible,

the authors use some suitable graph decomposition techniques employing external

memory algorithms (see Subsection 2.1.2).




1000



freq 100


10



1----------------------------------------__
5 10 15 20 25 30
clique size


Figure 2 1. Frequencies of clique sizes in the call graph found by Abello et al. [2].


Abello et al. [2] ran 100,000 GRASP iterations taking 10 parallel processors

about one and a half dv4 to finish. Of the 100,000 cliques generated, 14,141

appeared to be distinct, although rn r: of them had vertices in common. Abello

et al. -i::-I. -1 that the graph contains no clique of a size greater than 32.

Figure 2-1 shows the number of detected cliques of various sizes. Finally, large








12




quasi-cliques with density parameters 7 = 0.9, 0.8, 0.7, and 0.5 for the giant


connected component were computed. The sizes of the largest quasi-cliques found


were 44, 57, 65, and 98, respectively.


le+07-

le+06-

le+05-
le+04




Sle+03

Sle+02

le+01-
le+01

1le+ 00 .. --. ......-,- ......
le+00 le+01 l+02 le+03
Outdegree



le+06j


le+05


le+


Sle+


le+
&*


le+


(a)


le+04 1 05


le+07-

le+06-

le+05-

le+04

" le+03

le+02
ie+03



le+01

le+00
le+00


(b)


le+01 le+02 le03 le+04 le05
Indegree


(c)


04


03


02


01-
: <*


le+00 le+01 le+02 le+03 le+04 le+05 le+06 le+07
Component size


Figure 2-2. Pattern of connections in the call graph: number of vertices with

various out-degrees (a) and in-degrees (b); number of connected com-
ponents of various sizes (c) in the call graph [8].




Aiello et al. [8] used the same data as Abello et al. [2] to show that the


considered call graph fits to their power-law random graph model (Section 2.1.3).


The plots in Figure 2-2 demonstrate some connectivity properties of the call graph.


Summarizing the results presented in this subsection, one can -w that graph-


based techniques proved to be rather useful in the analysis and revealing the global


S M









patterns of the telecommunications traffic dataset. In the next subsection, we

will consider another example of a similar type of dataset associated with the

World-Wide Web.

2.1.1.2 Internet and Web Graphs

The role of the Internet in the modern world is difficult to overestimate; its

invention changed the way people interact, learn, and communicate like nothing

before. Alongside with increasing significance, the Internet itself continues to

grow at an overwhelming rate. Figure 2-3 shows the dynamics of growth of the

number of Internet hosts for the last 13 years. As of January 2002 this number

was estimated to be close to 150 million.2 The number of web pages indexed by

large search engines exceeds 2 billion, and the number of web sites is growing by

thousands daily.


160,000,000
140,000,000
120,000,000
100,000,000
80,000,000
60,000,000
40,000,000
20,000,000
0





Figure 2-3. Number of Internet hosts for the period 01/1991-01/2002. Data by
Internet Software Consortium.



2 According to Internet Software Consortium, http://www.isc.org/ds/host-count-
history.html










le+lO I I I !e 7 i-
le+O -



le+ -4
le- 1-+ +e-+
1e4-05

le4-0 le-
le+00 I I \
le'o0 1e4-O I

1 10 100 1000 10 100 100000
out-degee iz of coponta

Figure 2-4. Pattern of connections in the Web graph: number of vertices with var-
ious out-degrees (left) and distribution of sizes of strongly connected
components (right) in Web graph [37].


The highly dynamic and seemingly unpredictable structure of the World Wide

Web attracts more and more attention of scientists representing many diverse

disciplines, including graph theory. In a graph representation of the World Wide

Web, the vertices are documents and the edges are hyperlinks pointing from one

document to another. Similarly to the call graph, the Web is a directed multigraph,

although often it is treated as an undirected graph to simplify the analysis.

Another graph is associated with the physical network of the Internet, where the

vertices are routers navigating packets of data or groups of routers (domains). The

edges in this graph represent wires or cables in the physical network.

Graph theory has been applied for web search [36, 78], web mining [96, 97]

and other problems arising in the Internet and World Wide Web. In several

recent studies, there were attempts to understand some structural properties of

the Web graph by investigating large Web crawls. Adamic and Huberman [6, 65]

used crawls which covered almost 260,000 pages in their studies. Barabdsi and

Albert [20] analyzed a subgraph of the Web graph approximately 325,000 nodes

representing nd.edu pages. In another experiment, Kumar et al. [82] examined a

data set containing about 40 million pages. In a recent study, Broder et al. [37]









used two Altavista crawls, each with about 200 million pages and 1.5 billion links,

thus significantly exceeding the scale of the preceding experiments. This work

yielded several remarkable observations about local and global properties of the

Web graph. All of the properties observed in one of the two crawls were validated

for the other as well. Below, by the Web graph we will mean one of the crawls,

which has 203,549,046 nodes and 2130 million arcs.

The first observation made by Broder et al. confirms a property of the Web

graph -i.-.-. -i. 1 in earlier works [20, 82] claiming that the distribution of degrees

follows a power law. Interestingly, the degree distribution of the Web graph

resembles the power-law relationship of the Internet graph topology, which was first

discovered by Faloutsos et al. [50]. Broder et al. [37] computed the in- and out-

degree distributions for both considered crawls and showed that these distributions

agree with power laws. Moreover, they observed that in the case of in-degrees

the constant 7 m 2.1 is the same as the exponent of power laws discovered in

earlier studies [20, 82]. In another set of experiments conducted by Broder et al.,

directed and undirected connected components were investigated. It was noticed

that the distribution of sizes of these connected components also obeys a power

law. Figure 2-4 illustrates the experiments with distributions of out-degrees and

connected component sizes.

The last series of experiments discussed by Broder et al. [37] aimed to explore

the global connectivity structure of the Web. This led to the discovery of the so-

called Bow-Tie model of the Web [38]. Similarly to the call graph, the considered

Web graph appeared to have a giant connected component, containing 186,771,290

nodes, or over 9,n'. of the total number of nodes. Taking into account the directed

nature of the edges, this connected component can be subdivided into four pieces:

,-i,..('i/, connected component (SCC), In and Out components, and "T i,..i/"..-

Overall, the Web graph in the Bow-Tie model is divided into the following pieces:

















/i


Tendrils \
43,797,944 ,
\. tubes>'

/ \ _.. *'


SSCC
/ In 43318 ) ------- Out
.43,343,168 ,, \43,166,185
/56 463 993 -


SDisc.)

16,777,756


Figure 2-5. Connectivity of the Web (Bow-Tie model) [37].

* Strongly connected component: the part of the giant connected component in
which all nodes are reachable from one another by a directed path.

* In component: nodes which can reach any node in the SCC but cannot be
reached from the SCC.

* Out component: contains the nodes that are reachable from the SCC, but
cannot access the SCC through directed links.

* Tendrils component: accumulates the remaining nodes of the giant connected
component, i.e., the nodes which are not connected with the SCC.

* Disconnected component: the part of the Web which is not connected with the
giant connected component.

Figure 2-5 shows the connectivity structure of the Web, as well as sizes of the

considered components. As one can see from the figure, the sizes of SCC, In, Out
and Tendrils components are roughly equal, and the Disconnected component is


significantly smaller.


%









Broder et al. [37] have also computed the diameters of the SCC and of the

whole graph. It was shown that the diameter of the SCC is at least 28, and the

diameter of the whole graph is at least 503. The average connected distance is

defined as the pairwise distance averaged over those directed pairs (i,j) of nodes

for which there exists a path from i to j. The average connected distance of the

whole graph was estimated as 16.12 for in-links, 16.18 for out-links, and 6.83

for undirected links. Interestingly, it was also found that for a randomly chosen

directed pair of nodes, the chance that there is a directed path between them is

only about 2!' .

2.1.2 External Memory Algorithms

In many cases, the data associated with massive graphs is too large to fit

entirely inside the fast computer's internal memory, therefore a slower external

memory (for example disks) needs to be used. The input/output communication

(I/O) between these memories can result in an algorithm's slow performance.

External memory (EM) algorithms and data structures are designed with aim

to reduce the I/O cost by exploiting the locality. Recently, external memory

algorithms have been successfully applied for solving batched problems involving

graphs, including connected components, topological sorting, and shortest paths.

The first EM graph algorithm was developed by Ullman and Yannakakis [112]

in 1991 and dealt with the problem of transitive closure. ,ii;: other researchers

contributed to the progress in this area ever since [1, 15, 16, 39, 42, 83, 115]. Chi-

ang et al. [42] proposed several new techniques for design and analysis of efficient

EM graph algorithms and discussed applications of these techniques to specific

problems, including minimum spanning tree verification, connected and biconnected

components, graph drawing, and visibility representation. Abello et al. [1] proposed

a functional approach for EM graph algorithms and used their methodology to









develop deterministic and randomized algorithms for computing connected com-

ponents, maximal independent sets, maximal matching, and other structures in

the graph. In this approach each algorithm is defined as a sequence of functions,

and the computation continues in a series of scan operations over the data. If the

produced output data, once written, cannot be changed, then the function is said

to have no side effects. The lack of side effects enables the application of standard

checkpointing techniques, thus increasing the reliability. Abello et al. presented a

semi-external model for graph problems, which assumes that only the vertices fit

in the computer's internal memory. This is quite common in practice, and in fact

this was the case for the call graph described in Subsection 2.1.1, for which efficient

EM algorithms developed by Abello et al. [1] were used in order to compute its

connected components [2].

For more detail on external memory algorithms see the book [4] and the

extensive review by Vitter [115] of EM algorithms and data structures.

2.1.3 Modeling Massive Graphs

The size of real-life massive graphs, many of which cannot be held even

by a computer with several gigabytes of main memory, vanishes the power of

classical algorithms and makes one look for novel approaches. External memory

algorithms and data structures discussed in the previous subsection represent one

of the research directions aiming to overcome difficulties created by data sizes.

But in some cases not only is the amount of data huge, but the data itself is not

completely available. For instance, one can hardly expect to collect complete

information about the Web graph; in fact, the largest search engines are estimated

to cover only 3-'-. of the Web [84].

Therefore, to investigate real-life massive graphs, one needs to use the available

information in order to construct proper theoretical models of these graphs. One

of the earliest attempts to model real networks theoretically goes back to the late









1950's, when the foundations of random graph theory had been developed. In this

subsection we will present some of the results produced by this and other (more

realistic) graph models.

2.1.3.1 Uniform Random Graphs

The classical theory of random graphs founded by Erd6s and R6nyi [47, 48]

deals with several standard models of the so-called ;, iu .[rm random graphs. Two

of such models are G(n, m) and g(n,p) [30]. The first model assigns the same

probability to all graphs with n vertices and m edges, while in the second model

each pair of vertices is chosen to be linked by an edge randomly and independently

with probability p.

In most cases for each natural n a probability space consisting of graphs

with exactly n vertices is considered, and the properties of this space as n -i 0

are studied. It is said that a typical element of the space or almost every (a.e.)

graph has property Q when the probability that a random graph on n vertices has

this property tends to 1 as n -- oo. We will also -v that the property Q holds

-i;/,i,'..1. 'i ll1i almost surely (a.a.s.). Erd6s and R6nyi discovered that in many

cases either almost every graph has property Q or almost every graph does not

have this property.

Many properties of uniform random graphs have been well studied [29, 30, 73,

80]. Below we will summarize some known results in this field.

Probably the simplest property to be considered in any graph is its connec-

S.: .:/;, It was shown that for a uniform random graph G(n,p) E G(n,p) there is a

I !i, -! 1'" value of p that determines whether a graph is almost surely connected

or not. More specifically, a graph G(n,p) is a.a.s. disconnected ifp < lon. Fur-

thermore, it turns out that ifp is in the range I < p < ,gn the graph G(n,p)

a.a.s. has a unique .i.:rl connected component [30]. The emergence of a giant









connected component in a random graph is very often referred to as the p!i .i-

transition".

The next subject of our discussion is the diameter of a uniform random

graph G(n,p). Recall that the diameter of a disconnected graph is defined as the

maximum diameter of its connected components. When dealing with random

graphs, one usually speaks not about a certain diameter, but rather about the

distribution of the possible values of the diameter. Intuitively, one can i, that this

distribution depends on the interrelationship of the parameters of the model n and

p. However, this dependency turns out to be rather complicated. It was discussed

in many papers, and the corresponding results are summarized below.

It was proved by Klee and Larman [77] that a random graph ..i-mptotically

almost surely has the diameter d, where d is a certain integer value, if the following

conditions are satisfied


d-1 d
p p
-- 0 and oo, 00.
n n

Bollobds [30] proved that if np log n -i o then the diameter of a random

graph is a.a.s. concentrated on no more than four values.

Luczak [87] considered the case np < 1, when a uniform random graph a.a.s.

is disconnected and has no giant connected component. Let diamT(G) denote the

maximum diameter of all connected components of G(n,p) which are trees. Then if

(1 np)n1/3 -- o the diameter of G(n,p) is a.a.s. equal to diamr(G).

Chuing and Lu [43] investigated another extreme case: np -- o. They showed

that in this case the diameter of a random graph G(n,p) is a.a.s. equal to


log n
(1 + o(1)) .
log(up)
Moreover, they considered the case when np > c > 1 for some constant c and got a

generalization of the above result:










10c 1
log n logn (-C + 1) logn
(1 + o(1)) < diam(G n, p)) < + 2 --) + 1.
log(np) log(np) c -log(2c) np

Also, they explored the distribution of the diameter of a random graph with respect
to different ranges of the ratio np/log n. They obtained the following results:

* For np/logn = c > 8 the diameter of G(n,p) is a.a.s. concentrated on at most
two values at log n/ log(np).

For 8 > np/logn c > 2 the diameter of G(n,p) is a.a.s. concentrated on at
most three values at log n/ log(np).

For 2 > np/logn c > 1 the diameter of G(n,p) is a.a.s. concentrated on at
most four values at log n/log(np).

* For 1 > np/logn c > co the diameter of G(n,p) is a.a.s. concentrated
on a finite number of values, and this number is at most 2 L + 4. More
specifically, in this case the following formula can be proved:

( f33C 122
rlog((,,/11) log 0 nlognT 1
i \< diam(G(n,p)) < [ g p)) L + 2 +2.
l og(np) l og(np) co

As pointed out above, a graph G(n,p) a.a.s. has a giant connected component
for 1 < np < log n. It is natural to assume that in this case the diameter of

G(n,p) is equal to the diameter of this giant connected component. However, it
was strictly proved by Chung and Lu [43] that it is a.a.s. true only if np > 3.5128.

2.1.3.2 Potential Drawbacks of the Uniform Random Graph Model

There were some attempts to model the real-life massive graphs by the
uniform random graphs and to compare their behavior. However, the results of

these experiments demonstrated a significant discrepancy between the properties of
real graphs and corresponding uniform random graphs.
The further discussion analyzes the potential drawbacks of applying the
uniform random graph model to the real-life massive graphs.









Though the uniform random graphs demonstrate some properties similar to

the real-life massive graphs, many problems arise when one tries to describe the

real graphs using the uniform random graph model. As it was mentioned above,

a giant connected component a.a.s. emerges in a uniform random graph at a

certain threshold. It looks very similar to the properties of the real massive graphs

discussed in Subsection 2.1.3. However, after deeper insight, it can be seen that

the giant connected components in the uniform random graphs and the real-life

massive graphs have different structures. The fundamental difference between

them is as follows: it was noticed that in almost all the real massive graphs the

property of so-called clustering takes place [116, 117]. It means that the probability

of the event that two given vertices are connected by an edge is higher if these

vertices have a common neighbor (i.e., a vertex which is connected by an edge

with both of these vertices). The probability that two neighbors of a given vertex

are connected by an edge is called the clustering coefficient. It can be easily seen

that in the case of the uniform random graphs, the clustering coefficient is equal

to the parameter p, since the probability that each pair of vertices is connected

by an edge is independent of all other vertices. In real-life massive graphs, the

value of the clustering coefficient turns out to be much higher than the value of

the parameter p of the uniform random graphs with the same number of vertices

and edges. Adamic [5] found that the value of the clustering coefficient for some

part of the Web graph was approximately 0.1078, while the clustering coefficient for

the corresponding uniform random graph was 0.00023. Pastor-Satorras et al. [103]

got similar results for the part of the Internet graph. The values of the clustering

coefficients for the real graph and the corresponding uniform random graph were

0.24 and 0.0006 respectively.

Another significant problem arising in modeling massive graphs using the

uniform random graph model is the difference in degree distributions. It can









be shown that as the number of vertices in a uniform random graph increases,

the distribution of the degrees of the vertices tends to the well-known Poisson

distribution with the parameter np which represents the average degree of a vertex.

However, as it was pointed out in Subsection 2.1.3, the experiments show that

in the real massive graphs degree distributions obey a power law. These facts

demonstrate that some other models are needed to better describe the properties

of real massive graphs. Next, we discuss two of such models; namely, the random

graph model with a given degree sequence and its most important special case the

power-law model.

2.1.3.3 Random Graphs with a Given Degree Sequence

Besides the uniform random graphs, there are more general v--iv of modeling

massive graphs. These models deal with random '.i-rl,' with a given degree

sequence. The main idea of how to construct these graphs is as follows. For all the

vertices i 1 ... n the set of the degrees {ki} is specified. This set is chosen so that

the fraction of vertices that have degree k tends to the desired degree distribution

pk as n increases.

It turns out that some properties of the uniform random graphs can be

generalized for the model of a random graph with a given degree sequence.

Recall the notation of so-called 1ph -- transition" (i.e., the phenomenon when

at a certain point a giant connected component emerges in a random graph) which

happens in the uniform random graphs. It turns out that a similar thing takes

place in the case of a random graph with a given degree sequence. This result was

obtained by Molloy and Reed [98]. The essence of their findings is as follows.

Consider a sequence of non-negative real numbers po, pi, ..., such that

~ pk 1. Assume that a graph G with n vertices has approximately pkn vertices
k
of degree k. If we define Q k>1 k(k 2)pk then it can be proved that G a.a.s.









has a giant connected component if Q > 0 and there is a.a.s. no giant connected

component if Q < 0.

As a development of the analysis of random graphs with a given degree se-

quence, the work of Cooper and Frieze [45] should be mentioned. They considered

a sparse directed random graph with a given degree sequence and analyzed its

strong connectivity. In the study, the size of the giant strongly connected compo-

nent, as well as the conditions of its existence, were discussed.

The results obtained for the model of random graphs with a given degree

sequence are especially useful because they can be implemented for some important

special cases of this model. For instance, the classical results on the size of a

connected component in uniform random graphs follow from the aforementioned

fact presented by Molloy and Reed. Next, we present another example of applying

this general result to one of the most practically used random graph models the

power-law model.

2.1.3.4 Power-Law Random Graphs

One of the most important special cases of the model of random graphs with

a given degree sequence is the power-law random j,'li model, which represents

the class of random graphs with a power-law degree sequence. This models

theoretically describes the properties of power-law graphs that were mentioned

above. Some important results for this model were obtained by Aiello, Chung and

Lu [8, 9].

The power-law random graph model (also referred to P(a, 3) assigns two

parameters characterizing a power-law random graph. If we define y to be the

number of nodes with degree x, then according to this model


y = ec/X/ (2-1)
Equivalently, we can write


logy = a logrx.


(2-2)









Similarly to formulas in Chapter 1, the relationship between y and x can be plotted
as a straight line on a log-log scale, so that (-3) is the slope, and a is the intercept.
The following properties of a graph described by the power-law random graph
model [8] are valid:

* The maximum degree of the graph is e''.

* The number of vertices is

a (( )et, >l,
n P j ,a,& 1, (23)
X 1 e /(1 -3),0 < 3< 1,

where ((t) is the Riemann Zeta function.
n=

* The number of edges is

1((0 1)ea, > 2,
1 I 1-(/32 ,)e /3 > 2,
xE 1e- 2, (2 4)
i_ I e-'C /(2-/3),0
Since the power-law random graph model is a special case of the model of a
random graph with a given degree sequence, the results discussed above can be
applied to the power-law graphs. We need to find the threshold value of 3 in which
the 1i! .i transition" (i.e., the emergence of a giant connected component) occurs.
In this case Q = Y> x(x 2)p4 is defined as
Ca C3 C3
Q x(x- 2)LYj 23: -' [((0 2) 26((0- 1)]1e for
X-1 X-1 X-1
/3>3.
Hence, the threshold value Oo can be found from the equation


(( /- 2) 2(( 1) 0,


which yields 3o 3.47875.









The results on the size of the connected component of a power-law graph were

presented by Aiello et al [8]. These results are summarized below.

* If 0 < 3 < 1, then a power-law graph is a.a.s. connected (i.e., there is only one
connected component of size n).

* If 1 < 3 < 2, then a power-law graph a.a.s. has a giant connected component
(the component size is O(n)), and the second largest connected component
a.a.s. has a size 0(1).

* If 2 < 3 < /o = 3.47875, then a giant connected component a.a.s. exists, and
the size of the second largest component a.a.s. is O (log n).

* = 2 is a special case when there is a.a.s. a giant connected component, and
the size of the second largest connected component is 0(log n/log log n).

* If 3 > o = 3.47875, then there is a.a.s. no giant connected component.

The power-law random graph model was developed for describing real-life

massive graphs. So the natural question is how well it reflects the properties of

these graphs.

Though this model certainly does not reflect all the properties of real massive

graphs, it turns out that the massive graphs such as the call graph or the Internet

graph can be fairly well described by the power-law model. The following example

demonstrates it.

Aiello, Clhuii and Lu [8] investigated the same call graph that was analyzed

by Abello et al. [2]. This massive graph was already discussed in Subsection 2.1.3,

so it is interesting to compare the experimental results presented by Abello et

al. [2] with the theoretical results obtained in [8] using the power-law random graph

model.

Figure 2-2 shows the number of vertices in the call graph with certain in-

degrees and out-degrees. Recall that according to the power-law model the

dependency between the number of vertices and the corresponding degrees can

be plotted as a straight line on a log-log scale, so one can approximate the real









data shown in Figure 2-2 by a straight line and evaluate the parameter a and 3

using the values of the intercept and the slope of the line. The value of 3 for the

in-degree data was estimated to be approximately 2.1, and the value of e" was

approximately 30 x 106. The total number of nodes can be estimated using formula

(2-3) as ((2.1) x e" = 1.56 x e" 47 x 106 (compare with Subsection 2.1.3).

According to the results for the size of the largest connected component

presented above, a power-law graph with 1 < 3 < 3.47875 a.a.s. has a giant

connected component. Since j3 w 2.1 falls in this range, this result exactly coincides

with the real observations for the call graph (see Subsection 2.1.3).

Another aspect that is worth mentioning is how to generate power-law graphs.

The methodology for doing it was discussed in detail in the literature [9, 44]. These

papers use a similar approach, which is referred to as a random .,j'l, evolution

process. The main idea is to construct a power-law massive graph "step-by--l p :

at each time step, a node and an edge are added to a graph in accordance with

certain rules in order to obtain a graph with a specified in-degree and out-degree

power-law distribution. The in-degree and out-degree parameters of the resulting

power-law graph are functions of the input parameters of the model. A simple

evolution model was presented by Kumar et al. [81]. Aiello, Chuing and Lu [9]

developed four more advanced models for generating both directed and undirected

power-law graphs with different distributions of in-degrees and out-degrees. As

an example, we will briefly describe one of their models. It was the basic model

developed in the paper, and the other three models actually were improvements

and generalizations of this model.

The main idea of the considered model is as follows. At the first time moment

a vertex is added to the graph, and it is assigned two parameters the in-weight

and the out-weight, both equal to 1. Then at each time step t + 1 a new vertex

with in-weight 1 and out-weight 1 is added to the graph with probability 1 a,









and a new directed edge is added to the graph with probability a. The origin and

destination vertices are chosen according to the current values of the in-weights

and out-weights. More specifically, a vertex u is chosen as the origin of this edge

with the probability proportional to its current out-weight which is defined as

wt = 1 + 6t where 6Pt is the out-degree of the vertex u at time t. Similarly,

a vertex v is chosen as the destination with the probability proportional to its

current in-weight '. = 1 + 6^ where 6' is the in-degree of v at time t. From

the above description it can be seen that at time t the total in-weight and the total

out-weight are both equal to t. So for each particular pair of vertices u and v, the

probability that an edge going from u to v is added to the graph at time t is equal

to


(1 67)D(1 (t )
t2

In the above notations, the parameter a is the input parameter of the model.

The output of this model is a power-law random graph with the parameter of

the degree distribution being a function of the input parameter. In the case of

the considered model, it was shown that it generates a power-law graph with the

distribution of in-degrees and out-degrees having the parameter 1 + 1.

The notion of the so-called scale invariance [20, 21] must also be mentioned.

This concept arises from the following considerations. The evolution of massive

graphs can be treated as the process of growing the graph at a time unit. Now,

if we replace all the nodes that were added to the graph at the same unit of time

by only one node, then we will get another graph of a smaller size. The bigger the

time unit is, the smaller the new graph size will be. The evolution model is called

scale-free (scale-invariant) if with high probability the new (scaled) graph has the

same power-law distribution of in-degrees and out-degrees as the original graph, for

any choice of the time unit length. It turns out that most of the random evolution









models have this property. For instance, the models of Aiello et al. [9] were proved

to be scale-invariant.

2.1.4 Optimization in Random Massive Graphs

Recent random graph models of real-life massive networks, some of which

were mentioned in Subsection 2.1.3 increased interest in various properties of

random graphs and methods used to discover these properties. Indeed, numerical

characteristics of graphs, such as clique and chromatic numbers, could be used as

one of the steps in validation of the proposed models. In this regard, the expected

clique number of power-law random graphs is of special interest due to the results

by Abello et al. [2] and Aiello et al. [9] mentioned in Subsections 2.1.1 and 2.1.3.

If computed, it could be used as one of the points in verifying the validity of the

model for the call graph proposed by Aiello et al. [9].

In this subsection we present some well-known facts regarding the clique and

chromatic numbers in uniform random graphs.

2.1.4.1 Clique Number

The earliest results describing the properties of cliques in uniform random

graphs are due to Matula [93], who noticed that for a fixed p almost all graphs

G E G(n,p) have about the same clique number, if n is sufficiently large. Bollobas

and Erdbs [32] further developed these remarkable results by proving some more

specific facts about the clique number of a random graph. Let us discuss these

results in more detail by presenting not only the facts but also some reasoning

behind them. For more detail see books by Bollobas [29, 30] and Janson et al. [73].

Assume that 0 < p < 1 is fixed. Then instead of the sequence of spaces

{(n, p),n > 1} one can work with the single probability space g(N,p) containing

graphs on N with the edges chosen independently with probability p. In this

way, (n, p) becomes an image of g(N,p), and the term "almost ( iy is used

in its usual measure-theory sense. For a graph G E g(N,p) we denote by G, the









subgraph of G induced by the first n vertices {1, 2,..., n}. Then the sequence

w(G,) appears to be almost completely determined for a.e. G E (N,p).

For a natural 1, let us denote by ki(G,) the number of cliques spanning I

vertices of G,. Then, obviously,

w(G) = max{l: ki(G) > 0}.

When I is small, the random variable ki(G,) has a large expectation and a rather

small variance. If I is increased, then for most values of n there exists some number

lo for which the expectation of ko (G,) is fairly large (> 1) and k0o+l(G,) is much

smaller than 1. Therefore, if we find this value l0 then w(G,) = lo with a high

probability. The expectation of kl(G,) can be calculated as


E(k(G,)) =( p

Denoting by f(l1) E(k,(G,)) and replacing (n) by its Stirling approximation we

obtain
n n+1/2
f(1) ] 2 1(1-1)/2
V/(n 1)n-l+1/211+1/2
Solving the equation f(1) = 1 we get the following approximation l0 of the root:

lo 2log/ n 2log/plog/pn + 21og/,(e/2) + + o(1)
(2-5)
S2log/ n + O(log log n).

Using this observation and the second moment method, Bollobds and

Erdos [32] proved that if p = p(n) satisfies n-' < p < c for every c and some

c < 1, then there exists a function cl : N N such that a.a.s.

cl(n) < w(G,) < cl(n) + l,

i.e., the clique number is .i- mptotically distributed on at most two values. The

sequence cl(n) appears to be close to lo(n) computed in (2-5). Namely, it can be









shown that for a.e. G E g(N,p) if n is large enough then


[lo(n) 2 log log nlog n < (G,) < Llo(n) + 2 log log n/ log n]

and

w(G,) 21og/ n + 21og/p logpn n 21og/p(e/2) 1 < .

Frieze [56] and Janson et al. [73] extended these results by showing that for c > 0

there exists a constant ce, such that for c < pp(n) < log-2 n a.a.s.


[21og /n 2log,/p log,/ n + 21og/ (e/2) + 1 c/p] < w(G,) <



L2 log/p n 2 log/p log/p n +21log,/p(e/2)+1+ c/p].

2.1.4.2 Chromatic Number

Grimmett and McDiarmid [61] were the first to study the problem of coloring

random graphs. Many other researchers contributed to solving this problem [12,

31]. We will mention some facts emerged from these studies.

Luczak [85] improved the results about the concentration of X(G(n,p))

previously proved by Shamir and Spencer [110], proving that for every sequence

p = p(n) such that p < n-6/7 there is a function ch(n) such that a.a.s.

ch(n) < x(G(n,p)) < ch(n) + 1.

Alon and Krivelevich [12] proved that for any positive constant 6 the chromatic

number of a uniform random graph G(n,p), where p = n2-, is a.a.s. concentrated

in two consecutive values. Moreover, they proved that a proper choice of p(n) may

result in a one-point distribution. The function ch(n) is difficult to find, but in

some cases it can be characterized. For example, Janson et al. [73] proved that

there exists a constant co such that for any p p(n) satisfying C0 < p < log-7 n









a.a.s.

np V np
2 log np 2 log log np + 1 2 log np 40 log log np

In the case when p is constant Bollob&s' method utilizing martingales [30] yields

the following estimate:


xG(n ) 2logbn- 2logblogbn+ O(1)'

where b 1/(1 p).

2.1.5 Remarks

We discussed advances in several research directions dealing with massive

graphs, such as external memory algorithms and modeling of massive networks

as random graphs with power-law degree distributions. Despite the evidence that

uniform random graphs are hardly suitable for modeling the considered real-life

graphs, the classical random graphs theory still may serve as a great source of

ideas in studying properties of massive graphs and their models. We recalled

some well-known results produced by the classical random graphs theory. These

include results for concentration of clique number and chromatic number of random

graphs, which would be interesting to extend to more complicated random graph

models (i.e., power-law graphs and graphs with arbitrary degree distributions).

External memory algorithms and numerical optimization techniques could be

applied to find an approximate value of the clique number (as it was discussed

in Subsection 2.1.1). On the other hand, probabilistic methods similar to those

discussed in Subsection 2.1.4 could be utilized in order to find the .,-vmptotical

distribution of the clique number in the same network's random graph model, and

therefore verify this model.















CHAPTER 3
NETWORK-BASED APPROACHES TO MINING STOCK MARKET DATA

One of the most important problems in the modern finance is finding efficient

v--,I- of summarizing and visualizing the stock market data that would allow

one to obtain useful information about the behavior of the market. Nowad-, ,- a

great number of stocks are traded in the US stock market; moreover, this number

steadily increases. The amount of data generated by the stock market every d-v is

enormous. This data is usually visualized by thousands of plots reflecting the price

of each stock over a certain period of time. The analysis of these plots becomes

more and more complicated as the number of stocks grows.

It turns out that the stock market data can be effectively represented as a

network, although this representation is not so obvious as in the case of telephone

traffic or internet data. We have developed the network-based model of the market

referred to as the market pi,''l, This chapter is based on the results described in

[26, 27, 28].

A natural graph representation of the stock market is based on the cross

correlations of price fluctuations. A market graph can be constructed as follows:

each financial instrument is represented by a vertex, and two vertices are connected

by an edge if the correlation coefficient of the corresponding pair of instruments

(calculated for a certain period of time) exceeds a specified threshold 0, -1 < 0 < 1.

Nowad-l,- a great number of different instruments are traded in the US stock

market, so the market graph representing them is very large. The market graph

that we construct has 6546 vertices and several million edges.









In this chapter, we present a detailed study of the properties of this graph. It

turns out that the market graph can be rather accurately described by the power-

law model. We an iv. .. the distribution of the degrees of the vertices in this graph,

the edge density of this graph with respect to the correlation threshold, as well as

its connectivity and the size of its connected components.

Furthermore, we look for maximum cliques and maximum independent sets in

this graph for different values of the correlation threshold. Analyzing cliques and

independent sets in the market graph gives us a very valuable knowledge about

the internal structure of the stock market. For instance, a clique in this graph

represents a set of financial instruments whose prices change similarly over time

(a change of the price of any instrument in a clique is likely to affect all other

instruments in this clique), and an independent set consists of instruments that are

negatively correlated with respect to each other; therefore, it can be treated as a

diver-.:l;. portfolio. Based on the information obtained from this analysis, we will

be able to classify financial instruments into certain groups, which will give us a

deeper insight into the stock market structure.

3.1 Structure of the Market Graph

3.1.1 Constructing the Market Graph

The market graph that we study in this chapter represents the set of financial

instruments traded in the US stock markets. More specifically, we consider

6546 instruments and analyze daily changes of their prices over a period of 500

consecutive trading d-,,- in 2000-2002. Based on this information, we calculate the

cross-correlations between each pair of stocks using the following formula [92]:


j (R Rj)- (R-)(R- )
(R2f \(R)2) (R (Rj2)



























-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Figure 3-1. Distribution of correlation coefficients in the stock market


where Ri(t) In Pt) defines the return of the stock i for d- t. Pi(t) denotes the

price of the stock i on di- t.

The correlation coefficients Ci can vary from -1 to 1. Figure 3-1 shows

the distribution of the correlation coefficients based on the prices data for the

years 2000-2002. It can be seen that this plot has a shape similar to the normal

distribution with the mean 0.05.

The main idea of constructing a market graph is as follows. Let the set of

financial instruments represent the set of vertices of the graph. Also, we specify a

certain threshold value 0, -1 < 0 < 1 and add an undirected edge connecting the

vertices i and j if the corresponding correlation coefficient Ci is greater than or

equal to 0. Obviously, different values of 0 define the market graphs with the same

set of vertices, but different sets of edges.

It is easy to see that the number of edges in the market graph decreases as the

threshold value 0 increases. In fact, our experiments show that the edge density


0.07

0.06

0.05

0.04






36



60.00%


50.00%


40.00%


S30.00%

-C
20.00%


10.00%


0.00% -
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
correlation threshold


Figure 3-2. Edge density of the market graph for different values of the correlation
threshold.


of the market graph decreases exponentially w.r.t. 0. The corresponding graph is

presented on Figure 3-7.

3.1.2 Connectivity of the Market Graph

In Subsection 2.1.3 we mentioned the connectivity thresholds in random

graphs. The main idea of this concept is finding a threshold value of the parameter

of the model that will define if the graph is connected or not.

A similar question arises for the market graph: what is its connectivity

threshold? Since the number of edges in the market graph depends on the chosen

correlation threshold 0, we should find a value 00 that determines the connectivity

of the graph. As it was mentioned above, the smaller value of 0 we choose, the

more edges the market graph will have. So, if we decrease 0, after a certain point,

the graph will become connected. We have conducted a series of computational











-o 7000

S6000 -
0
6 5000

T 4000 -

o E 3000
0




S 2000


0 0
-1 -0.9-0.8-0.7-0.6 -0.5 -0.4-0.3 -0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

correlation threshold


Figure 3-3. Plot of the size of the largest connected component in the market
graph as a function of correlation threshold 0.


experiments for checking the connectivity of the market graph using the breadth-

first search technique, and we obtained a relatively accurate approximation of the

connectivity threshold: 00 0.14382. Moreover, we investigated the dependency

of the size of the largest connected component in the market graph w.r.t. 0. The

corresponding plot is shown in Figure 3-3.

3.1.3 Degree Distribution of the Market Graph

The next important subject of our interest is the distribution of the degrees

of the vertices in the market graph. We have conducted several computational

experiments with different values of the correlation threshold 0, and these results

are presented below.

It turns out that if a small (in absolute value) correlation threshold 0 is spec-

ified, the distribution of the degrees of the vertices does not have any well-defined

structure. Note that for these values of 0 the market graph has a relatively high

edge density (i.e. the ratio of the number of edges to the maximum possible

number of edges). However, as the correlation threshold is increased, the degree









Table 3-1. Least-squares estimates of the parameter 7 in the market graph for
different values of correlation threshold (* complementary graph)

0 7
-0.25* 1.2922
-0.2* 1.4088
-0.15* 1.4072
0.2 0.4931
0.25 0.5820
0.3 0.6793
0.35 0.7679
0.4 0.8269
0.45 0.8753
0.5 0.9054
0.55 0.9331
0.6 0.9743


distribution more and more resembles a power law. In fact, for 0 > 0.2 this distri-

bution is approximately a straight line in the logarithmic scale, which represents

the power-law distribution, as it was mentioned above. Figure 3-4 demonstrates

the degree distributions of the market graph for some positive values of the correla-

tion threshold, along with the corresponding linear approximations. The slopes of

the approximating lines were estimated using the least-squares method. Table 3-1

summarizes the estimates of the parameter 7 of the power-law distribution (i.e., the

slope of the line) for different values of 0.

From this table, it can be seen that the slope of the lines corresponding to

positive values of 0 is rather small. According to the power-law model, in this

case a graph would have many vertices with high degrees, therefore, one can

intuitively expect to find large cliques in a power-law graph with a small value of

the parameter 7.

We also analyze the degree distribution of the complement of the market

graph, which is defined as follows: an edge connects instruments i and j if the

correlation coefficient between them Ci < 0. Studying this complementary graph is

important for the next subject of our consideration finding maximum independent

























Figure 3-4. Degree distribution of the market graph for 0 = 0.4 (left); 0 = 0.5
(right) (logarithmic scale)


sets in the market graph with negative values of the correlation threshold 0.

Obviously, a maximum independent set in the initial graph is a maximum clique

in the complement, so the maximum independent set problem can be reduced

to the maximum clique problem in the complementary graph. Therefore, it is

useful to investigate the degree distributions of the complementary graphs for

different values of 0. As it can be seen from Figure 3-1, the distribution of the

correlation coefficients is nearly symmetric around 0 = 0.05, so for the values of

0 close to 0 the edge density of both the initial and the complementary graph is

high enough. For these values of 0 the degree distribution of a complementary

graph also does not seem to have any well-defined structure, as in the case of the

corresponding initial graph. As 0 decreases (i.e., increases in the absolute value),

the degree distribution of a complementary graph starts to follow the power law.

Figure 3-5 shows the degree distributions of the complementary graph, along with

the least-squares linear regression lines. However, as one can see from Table 3-1,

the slopes of these lines are higher than in the case of the graphs with positive

values of 0, which implies that there are fewer vertices with a high degree in these

graphs, so intuitively, the size of a cliques in a complementary graph (i.e., the size


2 3 4
Degree































Figure 3-5. Degree distribution of the complementary market graph for = -0.15
(left); 0 = -0.2 (right) (logarithmic scale)


of independent sets in the original graph) should be significantly smaller than in

the case of the market graph with positive values of the correlation threshold (see

Section 3.2).

3.1.4 Instruments Corresponding to High-Degree Vertices

Up to this point, we studied the properties of the market graph as one big

system, and did not consider the characteristics of every vertex in this graph.

However, an important practical issue is to look at the degree of each vertex in

the market graph and to find the vertices with high degrees, i.e. the stocks that

are highly correlated with many other instruments in the market. Clearly, this

information will help us to answer the question: which instruments most accurately

reflect the behavior of the market?

For this purpose, we chose the market graph with a high correlation threshold

(8 = 0.6), calculated the degrees of each vertex in this graph and sorted the vertices

in the decreasing order of their degrees.


0 1 2 3 4
Degree


0 1 2 3 4
Degree









Interestingly, even though the edge density of the considered graph is only

0.0 !'. (only highly correlated instruments are connected by an edge), there are

many vertices with degrees greater than 100.

According to our calculations, the vertex with the highest degree in this

market graph corresponds to the NASDAQ 100 Index Tracking Stock. The degree

of this vertex is 216, which means that there are 216 instruments that are highly

correlated with it. An interesting observation is that the degree of this vertex is

twice higher than the number of companies whose stock prices the NASDAQ index

reflects, which means that these 100 companies greatly influence the market.

In Table 3-2 we present the "top 25" instruments in the U.S. stock mar-

ket, according to their degrees in the considered market graph. The corre-

sponding symbols definitions can be found on several websites, for example

http://www.nasdaq.com. Note that most of them are indices that incorporate

a number of different stocks of the companies in different industries. Although

this result is not surprising from the financial point of view, it is important as a

practical justification of the market graph model.

3.1.5 Clustering Coefficients in the Market Graph

Next, we calculate the clustering coefficients in the original and complemen-

tary market graphs for different values of 0. The clustering coefficient is defined

as the probability that for a given vertex its two neighbors are connected by an

edge. Interestingly, clustering coefficients in the original market graph are large

even for high correlation thresholds, however, in the complementary graphs with

a negative correlation threshold the values of the clustering coefficient turned out

to be very close to 0. These results are summarized in Table 3-3. For instance, as

one can see from this table, the market graph with 0 = 0.6 has almost the same

edge density as the complementary market graph with 0 -0.15, however, their

clustering coefficients differ dramatically. This fact also intuitively explains the









Table 3-2. Top 25 instruments with highest degrees in the market graph (0 = 0.6)

symbol vertex degree


QQQ
IWF
IWO
IYW
XLK
IVV
MDY
SPY
IJH
IWV
IVW
IAH
IYY
IWB
IYV
BDH
MKH
IWM
IJR
SMH
STM
IIH
IVE
DIA
IWD


216
193
193
193
181
175
171
162
159
158
156
155
154
153
150
144
143
142
134
130
118
116
113
106
106


results presented in the next section, which deals with cliques and independent sets

in the market graph.

3.2 Analysis of Cliques and Independent Sets in the Market Graph

In this section, we discuss the methods of finding maximum cliques and

maximum independent sets in the market graph and analyze the obtained results.

The maximum clique problem (as well as the maximum independent set

problem) is known to be NP-hard [59]. Moreover, it turns out that the maximum

clique is difficult to approximate [18, 62]. This makes these problems especially

challenging in large graphs. However, as we will see in the next subsection, even









Table 3-3. Clustering coefficients of the market graph (* complementary graph)

0 edge density clustering coef.
-0.15* 0.0005 2.64 x 10-5
-0.1* 0.0050 0.0012
0.3 0.0178 0.4885
0.4 0.0047 0.4458
0.5 0.0013 0.4522
0.6 0.0004 0.4872
0.7 0.0001 0.4886


though the maximum clique problem is generally very hard to solve in large graphs,

the special structure of the market graph allows us to find the exact solution

relatively easily.

3.2.1 Cliques in the Market Graph

In this subsection, we consider cliques in the market graph, which have a

clear interpretation in terms of finance. Since a clique is a set of completely

interconnected vertices, any stock that belongs to the clique is highly correlated

with all other stocks in this clique; therefore, a stock is assigned to a certain group

only if it demonstrates a behavior similar to all other stocks in this group. Clearly,

the size of the maximum clique is an important characteristic of the stock market,

since it represents the maximum possible group of similar objects (i.e., mutually

correlated stocks).

A standard integer programming formulation [33] was used to compute the

exact maximum clique in the market graph, however, before solving this problem,

we applied a greedy heuristic for finding a lower bound of the clique number, and

a special preprocessing technique which reduces the problem size. To find a large

clique, we apply the "best-in" greedy algorithm based on degrees of vertices. Let C

denote the clique. Starting with C = 0, we recursively add to the clique a vertex

vmax of largest degree and remove all vertices that are not .,.i i: ent to vmax from

the graph. After running this algorithm, we applied the following preprocessing









procedure [2]. We recursively remove from the graph all of the vertices which are

not in C and whose degree is less than ICI, where C is the clique found by the

greedy algorithm.

Denote by G' = (V', E') the graph induced by remaining vertices. Then

the maximum clique problem can be formulated and solved for G'. The following

integer programming formulation was used [33]:

Iv'I
maximize x
i= 1
s.t.

xi + x < (i,j) E'

xi C {0, 1}

It should be noted that in the case of market graph instances with a high

positive correlation threshold, the aforementioned preprocessing procedure is very

efficient and significantly reduces the number of vertices in a graph [26]. This can

be intuitively explained by the fact that these instances of the market graph are

clustered (i.e. two vertices in a graph are more likely to be connected if they have a

common neighbor), so the clustering coefficient, which is defined as the probability

that for a given vertex its two neighbors are connected by an edge, is much higher

than the edge density in these graphs (see Table 3-8). This characteristic is also

typical for other power-law graphs arising in different applications.

After reducing the size of the original graph, the resulting integer programming

problem for finding a maximum clique can be relatively easily solved using the

CPLEX integer programming solver [71].

Table 3-4 summarizes the exact sizes of the maximum cliques found in

the market graph for different values of 0. It turns out that these cliques are









rather large, which agrees with the analysis of degree distributions and clustering

coefficients in the market graphs with positive values of 0.

Table 3-4. Sizes of the maximum cliques in the market graph with positive values
of the correlation threshold (exact solutions)

0 edge density clique size
0.35 0.0090 193
0.4 0.0047 144
0.45 0.0024 109
0.5 0.0013 85
0.55 0.0007 63
0.6 0.0004 45
0.65 0.0002 27
0.7 0.0001 22



These results show that in the modern stock market there are large groups

of instruments whose price fluctuations behave similarly over time, which is not

surprising, since 1i. --,1 is different branches of economy highly affect each other.

3.2.2 Independent Sets in the Market Graph

Here we present the results of solving the maximum independent set problem

in the market graphs with nonpositive values of the correlation threshold 0. As it

was pointed out above, this problem is equivalent to the maximum clique problem

in a complementary graph. However, the preprocessing procedure that was very

helpful for finding maximum cliques in the original graph could not eliminate

any vertices in the case of the complement, and we were not able to find the

exact solution of the maximum independent set problem in this case. Recall that

the clustering coefficients in the complementary graph were very small, which

intuitively explains the failure of the preprocessing procedure. Therefore, solving

the maximum independent set in the market graph is more challenging than finding

the maximum clique. Table 3-5 presents the sizes of the independent sets found

using the greedy heuristic that was described in the previous section.









Table 3-5. Sizes of independent sets in the complementary market graph found
using the greedy algorithm (lower bounds)

0 edge density indep. set size
0.05 0.4794 45
0.0 0.2001 12
-0.05 0.0431 5
-0.1 0.005 3
-0.15 0.0005 2


This table demonstrates that the sizes of computed independent sets are rather

small, which is in agreement with the results of the previous section, where we

mentioned that in the complementary graph the values of the parameter of the

power-law distribution are rather high, and the clustering coefficients are very

small.

The small size of the computed independent sets means that finding a large

"completely diversified" portfolio (where all instruments are negatively correlated

to each other) is not an easy task in the modern stock market.

Moreover, it turns out that one can make a theoretical estimation of the

maximum size of a diversified portfolio, where all stocks are strictly negatively

correlated with each other. Intuitively, the lower (higher by the absolute value)

threshold 0 we set, the smaller diversified portfolio one would expect to find. These

considerations are confirmed by the following theorem.

Theorem 3.1. Consider a market ',jI'l,' with the correlation threshold 0 < 0.

Assume that each stock's return has a finite variance. Then there is no independent

set (diver'-il; portfolio) of a size greater than 1 + 1

Proof. Let a random variable Xi denote the return of stock i at some time moment,

of denote the variance of Xi, and jmax = maxi ui. Suppose that there are m stocks,

which are pairwise negatively correlated, i.e., C, < 0,Vi,j =1,... m, and the









maximum correlation is 0 = rnr:: i Ci < 0. Consider the variance of the sum of

these variables:


Var(y X,) = Var(Xi) + Y Cov(Xi, X) -
i= 1 i 1 i j

S+ max + (m )0 n x(1 + 1)0)
i= 1 ij
Note that if 0 < 0, ma l(1 + (m 1)0) < 0 for m > 1 + 1. Consequently,

Var(Z X,) < 0 for m > 1 + .

Therefore, the number of stocks with pairwise correlations Cij < 0 < 0 cannot

be greater than m = 1 + which completes the proof.



Another natural question now arises: how many completely diversified

portfolios can be found in the market? In order to find an answer, we have

calculated maximal independent sets starting from each vertex, by running 6546

iterations of the greedy algorithm mentioned above. That is, for each of the

considered 6546 financial instruments, we have found a completely diversified

portfolio that would contain this instrument. Interestingly enough, for every vertex

in the market graph, we were able to detect an independent set that contains this

vertex, and the sizes of these independent sets were rather close. Moreover, all

these independent sets were distinct. Figure 3-6 shows the frequency of the sizes

of the independent sets found in the market graphs corresponding to different

correlation thresholds.

These results demonstrate that it is alv-i-- possible for an investor to find a

group of stocks that would form a completely diversified portfolio with any given

stock, and this can be efficiently done using the technique of finding independent

sets in the market graph.










4500 1400

3500
3000 1000

2000 600
1500 400
100500 8 S 200

1000
200
500
0 0
32 33 343536 3 38|3Z0
Ind. Set Size 12 Ind. Set Size 44 45


Figure 3-6. Frequency of the sizes of independent sets found in the market graph
with 0 = 0.00 (left), and 0 = 0.05 (right)


3.3 Data Mining Interpretation of the Market Graph Model

As we have seen, the analysis of the market graph provides a practically

useful methodology of extracting information from the stock market data. In this

subsection, we discuss the conceptual interpretation of this approach from the data

mining perspective. An important aspect of the proposed model is the fact that

it allows one to reveal certain patterns underlying the financial data, therefore, it

represents a structured data mining approach.

Non-trivial information about the global properties of the stock market

is obtained from the analysis of the degree distribution of the market graph.

Highly specific structure of this distribution si -. -I that the stock market can

be analyzed using the power-law model, which can theoretically predict some

characteristics of the graph representing the market.

On the other hand, the analysis of cliques and independent sets in the mar-

ket graph is also useful from the data mining point of view. As it was pointed

out above, cliques and independent sets in the market graph represent groups of

"-,ii i and "dIl. i. i l financial instruments, respectively. Therefore, informa-

tion about the size of the maximum cliques and independent sets is also rather

important, since it gives one the idea about the trends that take place in the stock









market. Besides analyzing the maximum cliques and independent sets in the mar-

ket graph, one can also divide the market graph into the smallest possible set of

distinct cliques (or independent sets). Partitioning a dataset into sets (clusters) of

elements grouped according to a certain criterion is referred to as clustering, which

is one of the well-known data mining problems [34].

As discussed above, the main difficulty one encounters in solving the clustering

problem on a certain dataset is the fact that the number of desired clusters of

similar objects is usually not known a priori, moreover, an appropriate -ii,.lr,.ii:,

criterion should be chosen before partitioning a dataset into clusters.

Clearly, the methodology of finding cliques in the market graph provides an

efficient tool of performing clustering based on the stock market data. The choice

of the grouping criterion is clear and natural: -o~ .I w'" financial instruments are

determined according to the correlation between their price fluctuations. Moreover,

the minimum number of clusters in the partition of the set of financial instruments

is equal to the minimum number of distinct cliques that the market graph can be

divided into (the minimum clique partition problem). Similar partition can be done

using independent sets instead of cliques, which would represent the partition of

the market into a set of distinct diversified portfolios. In this case the minimum

possible number of clusters is equal to a partition of vertices into a minimum

number of distinct independent sets. This problem is called the tlji', coloring

problem, and the number of sets in the optimal partition is referred to as the

chromatic number of the graph.

We should also mention another in i, i type of data mining problems with

many applications in finance. They are referred to as /1.i-.:7 ,l.:.>n problems.

Although the setup of this type of problems is similar to clustering, one should

clearly understand the difference between these two types of problems.









In classification, one deals with a pre-defined number of classes that the data

elements must be assigned to. Also, there is a so-called tr ':.' :' dataset, i.e., the

set of data elements for which it is known a priori which class they belong to. It

means that in this setup one uses some initial information about the classification

of existing data elements. A certain classification model is constructed based on

this information, and the parameters of this model are 1iii, 1 to classify new data

elements. This procedure is known as 11 ,iiiig the classifier". An example of the

application of this approach to classifying financial instruments can be found in

[40].

The main difference between classification and clustering is the fact that unlike

classification, in the case of clustering, one does not use any initial information

about the class attributes of the existing data elements, but tries to determine a

classification using appropriate criteria. Therefore, the methodology of classifying

financial instruments using the market graph model is essentially different from

the approaches commonly considered in the literature in the sense that it does not

require any a-priori information about the classes that certain stocks belong to, but

classifies them only based on the behavior of their prices over time.

3.4 Evolution of the Market Graph

In the previous sections, we have discussed the properties of the market graph

constructed for one 500-d-4v period. We have revealed a number of important

properties of this model; however, another crucial question that needs to be

answered is how these characteristics change over time. This analysis would provide

more information about the patterns underlying the stock market dynamics. We

address these issues in this section.

In order to investigate the dynamics of the market graph structure, we chose

the period of 1000 trading d4,va in 1998-2002 and considered eleven 500-d4iv shifts

within this period. The starting points of every two consecutive shifts are separated









Table 3-6. Dates and mean correlations corresponding to each considered 500-d-iv
shift

Period # Starting date Ending date Mean correlation
1 09/24/1998 09/15/2000 0.0403
2 12/04/1998 11/27/2000 0.0373
3 02/18/1999 02/08/2001 0.0381
4 04/30/1999 04/23/2001 0.0426
5 07/13/1999 07/03/2001 0.0444
6 09/22/1999 09/19/2001 0.0465
7 12/02/1999 11/29/2001 0.0545
8 02/14/2000 02/12/2002 0.0561
9 04/26/2000 04/25/2002 0.0528
10 07/07/2000 07/08/2002 0.0570
11 09/18/2000 09/17/2002 0.0672



by the interval of 50 d-,,i Therefore, every pair of consecutive shifts had 450

d-,i- in common and 50 d -, different. Dates corresponding to each shift and the

corresponding mean correlations are summarized in Table 3-6.

This procedure allows us to accurately reflect the structural changes of the

market graph using relatively small intervals between shifts, but at the same

time one can maintain sufficiently large sample sizes of the stock prices data for

calculating cross-correlations for each shift. We should note that in our analysis we

considered only stocks which were among those traded as of the last of the 1000

trading d-,,- i.e. for practical reasons we did not take into account stocks which

had been withdrawn from the market.

3.4.1 Dynamics of Global Characteristics of the Market Graph

In this subsection, we analyze the evolution of the basic characteristics of

the market graph model that were considered above for one trading period: the

distribution of the correlation coefficients in the market, the degree distribution,

and the edge density. As we will see, some properties of the market graph remain

stable; however, there are certain trends that can be observed in the stock market

development.










The first subject of our consideration is the distribution of correlation coeffi-

cients between all pairs of stocks in the market. As it was mentioned above, this

distribution on [-1, 1] had a shape similar to a part of normal distribution with

mean close to 0.05 for the sample data considered in [26, 27]. One of the interpre-

tations of this fact is that the correlation of most pairs of stocks is close to zero,

therefore, the structure of the stock market is substantially random, and one can

make a reasonable assumption that the prices of most stocks change independently.

As we consider the evolution of the correlation distribution over time, it turns out

that the shape of this distribution remains stable, which is illustrated by Figure

3-7.


0.08
0.08


0.067

0.05

0.04

0.03

0.02-

0.01

0
( 5. I.@ 5. 4. 5. Q Z Q Q) Q). Q) Q Q) Q).

-- period 1 -- period 3 period 5
--period7 period 9 period 11


Figure 3-7. Distribution of correlation coefficients in the US stock market for sev-
eral overlapping 500-d-4v periods during 2000-2002 (period 1 is the
earliest, period 11 is the latest).


The stability of the correlation coefficients distribution of the market graph

intuitively motivates the hypothesis that the degree distribution should also remain

stable for different values of the correlation threshold. To verify this assumption,







53



we have calculated the degree distribution of the graphs constructed for all

considered time periods. The correlation threshold 0 = 0.5 was chosen to describe

the structure of connections corresponding to significantly high correlations. Our

experiments show that the degree distribution is similar for all time intervals,

and in all cases it is well described by a power law. Figure 3-8 shows the degree

distributions (in the logarithmic scale) for some instances of the market graph

(with 0 = 0.5) corresponding to different intervals.


(a) period 1


1 o00 -

100 -


1000 10000


1 10 100
degree


(b) period 4


10000

1000
om-

m-

100

10

10


1000 10000


1 10 100
degree


(c) period 7 (d) period 11

Figure 3-8. Degree distribution of the market graph for different 500-d-4v periods
in 2000-2002 with 0 = 0.5: (a) period 1, (b) period 4, (c) period 7, (d)
period 11.


The cross-correlation distribution and the degree distribution of the market

graph represent the general characteristics of the market, and the aforementioned


10C00
o10 -
Io -

100 -

10

1


1 -140

I 10 100
degree


10000

100
om-

m-

100

10


N.


1 10 100
degree


~*4SiO


1000 10000


1000 10000









results lead us to the conclusion that the global structure of the market is stable

over time. However, as we will see now, some global changes in the stock market

structure do take place. In order to demonstrate it, we look at another characteris-

tic of the market graph -its edge density.

In our analysis of the market graph dynamics, we chose a relatively high

correlation threshold 0 = 0.5 that would ensure that we consider only the edges

corresponding to the pairs of stocks, which are significantly correlated with each

other. In this case, the edge density of the market graph would represent the

proportion of those pairs of stocks in the market, whose price fluctuations are

similar and influence each other. The subject of our interest is to study how this

proportion changes during the considered period of time. Table 3-7 summarizes

the obtained results. As it can be seen from this table, both the number of vertices

and the number of edges in the market graph increase as time goes. Obviously, the

number of vertices grows since new stocks appear in the market, and we do not

consider those stocks which ceased to exist by the last of 1000 trading di,-4 used

in our analysis, so the maximum possible number of edges in the graph increases

as well. However, it turns out that the number of edges grows faster; therefore, the

edge density of the market graph increases from period to period. As one can see

from Figure 3-9(a), the greatest increase of the edge density corresponds to the

last two periods. In fact, the edge density for the latest interval is approximately

8.5 times higher than for the first interval! This dramatic jump -i-i:: -1 that there

is a trend to the "globalization" of the modern stock market, which means that

nowad-1,- more and more stocks significantly affect the behavior of the others.

It should be noted that the increase of the edge density could be predicted

from the analysis of the distribution of the cross-correlations between all pairs

of stocks. From Figure 3-7, one can observe that even though the distributions

corresponding to different periods have a similar shape and the same mean,










Table 3-7. Number of vertices and number of edges in the
ent periods (0 = 0.5)


market graph for differ-


Number of Vertices
5430
5507
5593
5666
5768
5866
6013
6104
6262
6399
6556


Number of Edges
2258
2614
3772
5276
6841
7770
10428
12457
12911
19707
27885


Edge density
0.015'
0.017'
0.02!' .
0.0;
0.041
0.045'
(I II '
0.01 ,' .
0.0 I .
(I I II .
O. 1i i'


the I il!" of the distribution corresponding to the latest period (period 11) is

somewhat "heavier" than for the earlier periods, which means that there are more

pairs of stocks with higher values of the correlation coefficient.


90
80
70
60
~50
40
U 0 30
20
10
0


1 2 3 4


5 6 7
time period


8 9 10 11


Figure 3-9. Dynamics of edge density and maximum clique size in the market
graph: Evolution of the edge density (a) and maximum clique size (b)
in the market graph (0 = 0.5)



3.4.2 Dynamics of the Size of Cliques and Independent Sets in the
Market Graph

In this subsection we ,in iv. .. the evolution of the size of the maximum clique


in the market graph over the considered period of time.


Period
1
2
3
4
5
6
7
8
9
10
11


-0.14%
-0.12%
-0.10% 2
-0.08% %
-0.06%
0.04%
S0.02%
o- 000%
12 3 4 5 6 7 8 9 10 11
in-e period









Table 3-8 presents the sizes of the maximum cliques found in the market graph

for different time periods. As in the previous subsection, we used a relatively high

correlation threshold 0 = 0.5 to consider only significantly correlated stocks. As

one can see, there is a clear trend of the increase of the maximum clique size over

time, which is consistent with the behavior of the edge density of the market graph

discussed above (see Figure 3-9(b)). This result provides another confirmation of

the globalization hypothesis discussed above.

Another related issue to consider is how much the structure of maximum

cliques is different for the various time periods. Table 3-9 presents the stocks

included into the maximum cliques for different time periods. It turns out that in

most cases stocks that appear in a clique in an earlier period also appear in the

cliques in later periods.

There are some other interesting observations about the structure of the

maximum cliques found for different time periods. It can be seen that all the

cliques include a significant number of stocks of the companies representing the

"high-tech" industry sector. As the examples, one can mention well-known com-

panies such as Sun Mi. i ', -i--. ini- Inc., Cisco Systems, Inc., Intel Corporation,

etc. Moreover, each clique contains stocks of the companies related to the semi-

conductor industry (e.g., Cypress Semiconductor Corporation, Cree, Inc., Lattice

Semiconductor Corporation, etc.), and the number of these stocks in the cliques

increases with the time. These facts -i-i: -1 that the corresponding branches of

industry expanded during the considered period of time to form a in i ri cluster of

the market.

In addition, we observed that in the later periods (especially in the last two

periods) the maximum cliques contain a rather large number of exchange traded

funds, i.e., stocks that reflect the behavior of certain indices representing various

groups of companies. It should be mentioned that all maximum cliques contain









Table 3-8. Greedy clique size and the clique number for different time periods ( =
0.5)

Period IV Edge Dens. C!-i,. iii C1 IV' Edge Dens. Clique
in G Coefficient in G' Number
1 5430 0.00015 0.505 15 76 0.286 18
2 5507 0.00017 0.504 18 43 0.731 19
3 5593 0.00024 0.499 26 49 0.817 27
4 5666 0.00033 0.517 34 70 0.774 34
5 5768 0.00041 0.550 42 82 0.787 42
6 5866 0.00045 0.558 45 86 0.804 45
7 6013 0.00058 0.553 51 110 0.769 51
8 6104 0.00067 0.566 60 114 0.819 60
9 6262 0.00066 0.553 62 107 0.869 62
10 6399 0.00096 0.486 77 134 0.841 77
11 6556 0.00130 0.452 84 146 0.844 85


N 11 1 100 tracking stock (QQQ), which was also found to be the vertex with the

highest degree (i.e., correlated with the most stocks) in the market graph [26].

Another natural question that one can pose is how the size of independent sets

(i.e., diversified portfolios in the market) changes over time. As it was pointed out

in [26, 27], finding a maximum independent set in the market graph turns out to

be a much more complicated task than finding a maximum clique. In particular,

in the case of solving the maximum independent set problem (or, equivalently,

the maximum clique problem in the complementary graph), the preprocessing

procedure described above does not reduce the size of the original graph. This

can be explained by the fact that the clustering coefficient in the complementary

market graph with 0 = 0 is much smaller than in the original graph corresponding

to 0 = 0.5 (see Table 3-10).

Similarly to Section 3.2, we calculate maximal independent sets (a maximal

independent set is an independent set that is not a subset of another independent

set) in the market graph using the above greedy algorithm. As one can see from

Table 3-10, the sizes of independent sets found in the market graph for 0 = 0 are

rather small, which is consistent with the results of Section 3.2.



















Table 3-9.


Structure of maximum cliques in the market graph for different time
periods (0 = 0.5)


Period Stocks included into maximum clique
1 BK, EMC, FBF, HAL, HP, INTC, NCC, NOI, NOK, PDS, PMCS, QQQ, RF, SII, SLB,
SPY, TER, WM
2 ADI, ALTR, AMAT, AMCC, ATML, CSCO,KLAC, LLTC, LSCC, MDY, MXIM, NVLS,
PMCS, QQQ, SPY, SUNW, TXN, VTSS, XLNX
3 AMAT, AMCC, CREE, CSCO, EMC, JDSU, KLAC, LLTC, LSCC, MDY, MXIM,
NVLS, PHG, PMCS, QLGC, QQQ, SEBL, SPY, STM, SUNW, TQNT, TXCC, TXN,
VRTS, VTSS, XLK, XLNX
4 AMAT, AMCC, ASML, ATML, BRCM, CHKP, CIEN, CREE, CSCO, EMC, FLEX,
JDSU, KLAC, LSCC, MDY, MXIM, NTAP, NVLS, PMCS, QLGC, QQQ, RFMD,
SEBL, SPY, STM, SUNW, TQNT, TXCC, TXN, VRSN, VRTS, VTSS, XLK, XLNX
5 ALTR, AMAT, AMCC, ASML, ATML, BRCM, CIEN, CREE, CSCO, EMC, FLEX,
IDTI, IRF, JDSU, JNPR, KLAC, LLTC, LRCX, LSCC, LSI, MDY, MXIM, NTAP,
NVLS, PHG, PMCS, QLGC, QQQ, RFMD, SEBL, SPY, STM, SUNW, SWKS, TQNT,
TXCC, TXN, VRSN, VRTS, VTSS, XLK, XLNX
6 ADI, ALTR, AMAT, AMCC, ASML, ATML, BEAS, BRCM, CIEN, CREE, CSCO, CY,
ELX, EMC, FLEX, IDTI, ITWO, JDSU, JNPR, KLAC, LLTC, LRCX, LSCC, LSI,
MDY, MXIM, NTAP, NVLS, PHG, PMCS, QLGC, QQQ, RFMD, SEBL, SPY, STM,
SUNW, TQNT, TXCC, TXN, VRSN, VRTS, VTSS, XLK, XLNX
7 ALTR, AMAT, AMCC, ATML, BEAS, BRCD, BRCM, CHKP, CIEN, CNXT, CREE,
CSCO, CY, DIGL, EMC, FLEX, HHH, ITWO, JDSU, JNPR, KLAC, LLTC, LRCX,
LSCC, MDY, MERQ, MXIM, NEWP, NTAP, NVLS, ORCL, PMCS, QLGC, QQQ,
RBAK, RFMD, SCMR, SEBL, SPY, SSTI, STM, SUNW, SWKS, TQNT, TXCC, TXN,
VRSN, VRTS, VTSS, XLK, XLNX
8 ALTR, AMAT, AMCC, AMKR, ARMHY, ASML, ATML, AVNX, BEAS, BRCD,
BRCM, CHKP, CIEN, CMRC, CNXT, CREE, CSCO, CY, DIGL, ELX, EMC, EXTR,
FLEX, HHH, IDTI, ITWO, JDSU, JNPR, KLAC, LLTC, LRCX, LSCC, MDY, MERQ,
MRVC, MXIM, NEWP, NTAP, NVLS, ORCL, PMCS, QLGC, QQQ, RFMD, SCMR,
SEBL, SNDK, SPY, SSTI, STM, SUNW, SWKS, TQNT, TXCC, TXN, VRSN, VRTS,
VTSS, XLK, XLNX
9 ADI, ALTR, AMAT, AMCC, ARMHY, ASML, ATML, AVNX, BDH, BEAS, BHH,
BRCM, CHKP, CIEN, CLS, CREE, CSCO, CY, DELL, ELX, EMC, EXTR, FLEX,
HHH, IAH, IDTI, IIH, INTC, IRF, JDSU, JNPR, KLAC, LLTC, LRCX, LSCC, LSI,
MDY, MXIM, NEWP, NTAP, NVLS, PHG, PMCS, QLGC, QQQ, RFMD, SCMR,
SEBL, SNDK, SPY, SSTI, STM, SUNW, SWKS, TQNT, TXCC, TXN, VRSN, VRTS,
VTSS, XLK, XLNX
10 ADI, ALTR, AMAT, AMCC, AMD, ASML, ATML, BDH, BHH, BRCM, CIEN, CLS,
CREE, CSCO, CY, CYMI, DELL, EMC, FCS, FLEX, HHH, IAH, IDTI, IFX, IIH, IJH,
IJR, INTC, IRF, IVV, IVW, IWB, IWF, IWM, IWV, IYV, IYW, IYY, JBL, JDSU,
KLAC, KOPN, LLTC, LRCX, LSCC, LSI, LTXX, MCHP, MDY, MXIM, NEWP, NTAP,
NVDA, NVLS, PHG, PMCS, QLGC, QQQ, RFMD, SANM, SEBL, SMH, SMTC,
SNDK, SPY, SSTI, STM, SUNW, TER, TQNT, TXCC, TXN, VRTS, VSH, VTSS,
XLK, XLNX
11 ADI, ALA, ALTR, AMAT, AMCC, AMD, ASML, ATML, BDH, BEAS, BHH, BRCM,
CIEN, CLS, CNXT, CREE, CSCO, CY, CYMI, DELL, EMC, EXTR, FCS, FLEX,
HHH, IAH, IDTI, IIH, IJH, IJR, INTC, IRF, IVV, IVW, IWB, IWF, IWM, IWO, IWV,
IWZ, IYV, IYW, IYY, JBL, JDSU, JNPR, KLAC, KOPN, LLTC, LRCX, LSCC, LSI,
LTXX, MCRL, MDY, MKH, MRVC, MXIM, NEWP, NTAP, NVDA, NVLS, PHG,
PMCS, QLGC, QQQ, RFMD, SANM, SEBL, SMH, SMTC, SNDK, SPY, SSTI, STM,
_SUNW, TER, TQNT, TXN, VRTS, VSH, VTSS, XLK, XLNX









Table 3-10.


Size of independent sets in the market graph found using the greedy
heuristic (8 = 0.0). Edge density and clustering coefficient are given
for the complementary graph.

Period Number of Edge Clustering Independent
vertices density coefficient set size
1 5430 0.258 0.293 11
2 5507 0.275 0.307 11
3 5593 0.281 0.307 10
4 5666 0.265 0.297 11
5 5768 0.260 0.292 11
6 5866 0.254 0.288 11
7 6013 0.228 0.269 11
8 6104 0.227 0.268 10
9 6262 0.238 0.277 12
10 6399 0.228 0.269 12
11 6556 0.201 0.245 11


3.4.3 Minimum Clique Partition of the Market Graph

Besides analyzing the maximum cliques in the market graph, one can also

divide the market graph into the smallest possible set of distinct cliques. As it

was pointed out above, the partition of a dataset into sets (clusters) of elements

grouped according to a certain criterion is referred to as clustering.

For finding a clique partition, we choose the instance of the market graph

with a low correlation threshold 0 = 0.05 (the mean of the correlation coefficients

distribution shown in Figure 3-7), which would ensure that the edge density of the

considered graph is high enough and the number of isolated vertices (which would

obviously form distinct cliques) is small.

We use the standard greedy heuristic to compute a clique partition in the

market graph: recursively find a maximal clique and remove it from the graph,

until no vertex remain. Cliques are computed using the previously described greedy

algorithm. The corresponding results for the market graph with threshold 0 = 0.05

are presented in Table 3-11. Note that the size of the largest clique in the partition

is increasing from one period to another, with the largest clique in the last period









Table 3-11. The largest clique size and
partitions (0 = 0.05)

Period Number of Edge
vertices density
1 5430 0.400
2 5507 0.377
3 5593 0.379
4 5666 0.405
5 5768 0.413
6 5866 0.425
7 6013 0.469
8 6104 0.475
9 6262 0.456
10 6399 0.474
11 6556 0.521


the number of cliques in computed clique


Largest clique
in the partition
469
552
636
743
789
824
929
983
997
1159
1372


Sof cliques in
the partition
494
517
513
503
501
496
471
470
509
501
479


containing about three times as many vertices as the corresponding clique in

the first partition. At the same time, the number of cliques in the partition is

comparable for different periods, with a slight overall trend towards decrease,

whereas the number of vertices is increasing as time goes.

3.5 Concluding Remarks

Graph representation of the stock market data and interpretation of the

properties of this graph gives a new insight into the internal structure of the stock

market. In this paper, we have studied different characteristics of the market graph

and their evolution over time and came to several interesting conclusions based on

our analysis. It turns out that the power-law structure of the market graph is quite

stable over the considered time intervals; therefore one can v that the concept of

self-organized networks, which was mentioned above, is applicable in finance, and in

this sense the stock market can be considered as a "self-or5, i... i system.

Another important result is the fact that the edge density of the market graph,

as well as the maximum clique size, steadily increase during the last several years,

which supports the well-known idea about the globalization of economy which has

been widely discussed recently.






61


We have also indicated the natural way of dividing the set of financial instru-

ments into groups of similar objects (clustering) by computing a clique partition of

the market graph. This methodology can be extended by considering quasi-cliques

in the partition, which may reduce the number of obtained clusters. Moreover,

finding independent sets in the market graph provides a new approach to choosing

diversified portfolios where all stocks are pairwise uncorrelated, which is potentially

useful in practice.















CHAPTER 4
NETWORK-BASED TECHNIQUES IN ELECTROENCEPHALOGRAPHIC
(EEG) DATA ANALYSIS AND EPILEPTIC BRAIN MODELING

Human brain is one of the most complex systems ever studied by scientists.

Enormous number of neurons and the dynamic nature of connections between

them makes the analysis of brain function especially challenging. One of the most

important directions in studying the brain is treating disorders of the central

nervous system. For instance, /'.:/ I/,-; is a common form of such disorders, which

affects approximately 1 of the human population. Essentially, epileptic seizures

represent excessive and hypersynchronous activity of the neurons in the cerebral

cortex.

During the last several years, significant progress in the field of epileptic

seizures prediction has been made. The advances are associated with the extensive

use of electr ... '., p1,l.'u11.'-I (EEG) which can be treated as a quantitative repre-

sentation of the brain function. Rapid development of computational equipment

has made possible to store and process huge amounts of EEG data obtained from

recording devices. The availability of these massive datasets gives a rise to another

problem utilizing mathematical tools and data mining techniques for extracting

useful information from EEG data. Is it possible to construct a -ilp!l." mathe-

matical model based on EEG data that would reflect the behavior of the epileptic

brain?

In this chapter, we make an attempt to create such a model using a network-

based approach.

In the case of the human brain and EEG data, we apply a relatively simple

network-based approach. We represent the electrodes used for obtaining the EEG









readings, which are located in different parts of the brain, as the vertices of the

constructed graph. The data received from every single electrode is essentially a

time series reflecting the change of the EEG signal over time. Later in the chapter

we will discuss the quantitative measure characterizing statistical relationships

between the recordings of every pair of electrodes so called T-index. The values

of the T-index Ti measured for all pairs of electrodes i and j enable us to establish

certain rules of placing edges connecting different pairs of vertices i and j depend-

ing on the corresponding values of Tij. Using this technique, we develop several

graph-based mathematical models and study the dynamics of the structural prop-

erties of these graphs. As we will see, these models can provide useful information

about the behavior of the brain prior to, during, and after an epileptic seizure.

4.1 Statistical Preprocessing of EEG Data

4.1.1 Datasets.

The datasets consisting of continuous long-term (3 to 12 d iv) multichannel

intracranial EEG recordings that had been acquired from 4 patients with medically

intractable temporal lobe epilepsy. Each record included a total of 28 to 32

intracranial electrodes (8 subdural and 6 hippocampal depth electrodes for each

cerebral hemisphere). A diagram of electrode locations is provided in Figure 4-1.

4.1.2 T-statistics and STLmax

In this subsection we give a brief introduction to nonlinear measures and

statistics used to analyze EEG data (for more information see [67, 69, 101]).

Since the brain is a nonstationary system, algorithms used to estimate

measures of the brain dynamics should be capable of automatically identifying and

appropriately weighing existing transients in the data. In a chaotic system, orbits

originating from similar initial conditions (nearby points in the state space) diverge

exponentially (expansion process). The rate of divergence is an important aspect

of the system dynamics and is reflected in the value of Lyapunov exponents. The















R '43 2'1 R\I2 3 AL


BR L







CR CL














BL1







Figure 4-1. Electrode placement in the brain: (A) Inferior transverse and (B)
lateral views of the brain, illustrating approximate depth and subdu-
ral electrode placement for EEG recordings are depicted. Subdural
electrode strips are placed over the left orbitofrontal (AL), right or-
bitofrontal (AR), left subtemporal (BL), and right subtemporal (BR)
cortex. Depth electrodes are placed in the left temporal depth (CL)
and right temporal depth (CR) to record hippocampal activity.









method used for estimation of the short time largest Lyapunov exponent STLmax,

an estimate of Lmx for nonstationary data, is explained in detail in [66, 68, 118].

By splitting the EEG time series recorded from each electrode into a sequence

of non-overlapping segments, each 10.24 sec in duration, and estimating STLma,

for each of these segments, profiles of STLmx, over time are generated.

Having estimated the STLma, temporal profiles at an individual cortical site,

and as the brain proceeds towards the ictal state, the temporal evolution of the

stability of each cortical site is quantified. The spatial dynamics of this transition

are captured by consideration of the relations of the STLmax between different

cortical sites. For example, if a similar transition occurs at different cortical

sites, the STLmax of the involved sites are expected to converge to similar values

prior to the transition. Such participating sites are called "critical i. and

such a convergence dynamicall ( 1i i i i:n. i i More specifically, in order for the

dynamical entrainment to have a statistical content, we allow a period over which

the difference of the means of the STLma, values at two sites is estimated. We use

periods of 10 minutes (i.e. moving windows including approximately 60 STLmx,

values over time at each electrode site) to test the dynamical entrainment at the

0.01 statistical significance level. We employ the T-index (from the well-known

paired T-statistics for comparisons of means) as a measure of distance between the

mean values of pairs of STLma, profiles over time. The T-index at time t between

electrode sites i and j is defined as:


Ti,,(t) = N x |E{STLmx,i STLm }x,j} /ai,j(t) (4-1)

where E{-} is the sample average difference for the STLma,i STLma,,j

estimated over a moving window wt(A) defined as:


S1 if AE [t t]

0 if A [t N t],









where N is the length of the moving window. Then, ai,j(t) is the sample standard

deviation of the STLmax differences between electrode sites i and j within the

moving window wt(A). The T-index follows a t-distribution with N-1 degrees of

freedom. For the estimation of the Tij(t) indices in our data we used N = 60 (i.e.,

average of 60 differences of STLmax exponents between sites i and j per moving

window of approximately 10 minute duration). Therefore, a two-sided t-test with

N 1(= 59) degrees of freedom, at a statistical significance level a should be

used to test the null hypothesis, Ho: "brain sites i and j acquire identical STLmax

values at time t". In this experiment, we set the probability of a type I error

a = 0.01 (i.e., the probability of falsely rejecting Ho if Ho is true, is 1 .). For the

T-index to pass this test, the Tij(t) value should be within the interval [0, 2.662].

We will refer to the upper bound of this interval as Tcritical.

4.2 Graph Structure of the Epileptic Brain

4.2.1 Key Idea of the Model

If we model the brain (with epilepsy) by a graph (where nodes are "functional

units" of the system and edges are connections between them) we need to answer

the following questions: what properties the model has, i.e. what the properties of

this graph are; how the properties of the graph change prior to, during, and after

epileptic seizures. We try to answer this question using the following idea -we

study the system of the electrodes as a weighted graph where nodes are electrodes

and weights of the edges between nodes are values of the corresponding T-index.

More specifically, we consider three types of graphs constructed using this principle:

* GRAPH-I is a complete graph, i.e., it has all possible edges,

* GRAPH-II is obtained from the complete graph by removing all the edges
(i,j) for which the corresponding value of Ti is greater than Tcritical,

* GRAPH-III is obtained from the complete graph by removing all the edges
(i,j) for which the corresponding value of Ti is less than Tcritical 10 minutes
after the seizure point and greater than Tcritical at the seizure point.









4.2.1.1 Interpretation of the Considered Graph Models

Before proceeding with the further discussion, we need to give a conceptual

interpretation of the ideas lying behind introducing the aforementioned graphs.

* GRAPH-I contains all the edges connecting the considered brain sites, and
it is considered in order to reflect the general distribution of the values of
T-indices between each pair of vertices (i.e., the weights of the corresponding
edges).

* GRAPH-II contains only the edges connecting the brain sites (electrodes) that
are statistically entrained at a certain time, which means that they exhibit a
similar behavior. Recall that a pair of electrodes is considered to be entrained
if the value of the corresponding T-index between them is less than Tcrutcal,
that is why we remove all the edges with the weights greater than Tcritcal. The
main point of our interest is studying the evolution of the properties of this
graph over time. As we will see in the next subsections, this analysis can help
in revealing the /;;,:, ,,i.., / patterns underlying the functioning of the brain
during preictal, ictal, postictal, and interictal states. Therefore, this graph can
be used as a basis for the mathematical model describing some characteristics
of the epileptic brain.

* GRAPH-III is constructed to reflect the connections only between those
electrodes that are entrained during the seizure, but are not entrained 10
minutes after the seizure. The motivation for introducing this graph is the
existence of i. -. 11 'ig" of the brain after the seizure [70, 108, 111], which
is essentially the divergence of the profiles of the STLma, time series. As it
was indicated above, this divergence is characterized by the values of T-index
greater than Tcritica.

4.2.2 Properties of the Graphs

In this subsection, we investigate the properties of the considered graph models

and give an intuitive explanation of the observed results. As we will see, there are

specific tendencies in the evolution of the properties of the considered graphs prior

to, during, and after epileptic seizures, which indicates that the proposed models

capture certain trends in the behavior of the epileptic brain.

4.2.2.1 Edge Density

Recall that GRAPH-II was introduced to reflect the connections between

brain sites that are statistically entrained at a certain time moment. Figure 4-2




















0)
"o
4 200

E

Z 150



100



50

I I I II I I i I I I I
7900 7950 8000 8050 8100 8150 8200 8250 8300 8350
MINUTES

Figure 4-2. Number of edges in GRAPH-II


illustrates the typical evolution of the number of edges in GRAPH-I over time.

As it was indicated above, edge density of the graph is proportional to the number

of edges in a graph. It is easy to notice that the number of edges in GRAPH-II

dramatically increases at seizure points (represented by dashed vertical lines), and

it decreases immediately after seizures. It means that the global structure of the

graph significantly changes during the seizure and after the seizure, i.e. the density

of increases during ictal state and decreases in postictal state, which supports the

idea that the epileptic brain (and GRAPH-II as the model of the brain) experiences

a "phase transition" during the seizure.









4.2.2.2 Connectivity

Another important property of GRAPH-II that we are interested in is its

.., ,.. /.'; :/ We need to check if this graph is connected prior to, during, and

after epileptic seizures, and if not, find the size of its largest connected component.

Clearly, this information will also be helpful in the analysis of the structural

properties of the brain. If GRAPH-II is connected (i.e., the size of the largest

connected component is equal to the number of vertices in the graph), then all the

functional units of the brain are "linked" with each other by a path, and in this

case the brain can be treated as an i i, i ii, 1 system, however, if the size of the

largest connected component in GRAPH-II is significantly smaller than the total

number of the vertices, it means that the brain becomes 1' 1 iied" into smaller

di-i.iil subsystems.

The size of the largest connected component of the GRAPH-II is presented in

Figure 4-3. One can see that GRAPH-II is connected during the interictal period

(i.e., the brain is a connected system), however, it becomes disconnected after the

seizure (during the postical state): the size of the largest connected component

significantly decreases. This fact is not surprising and can be intuitively explained,

since after the seizure the brain needs some time to "reset" [70, 108, 111] and

restore the connections between the functional units.

4.2.2.3 Minimum Spanning Tree

The next subject of our discussion is the analysis of minimum ',','.:,':,'; trees

of GRAPH-I, which was defined as the graph with all possible edges, where each

edge (i, j) has the weight equal to the value of T-index Tij corresponding to brain

sites i and j. The definition of Minimum Sr I !',!'.:,t Tree was given in Section 2.

Studying minimum spanning trees in GRAPH-I is motivated by the hypothesis

that the seizure signal in the brain propagates to all functional units according

to the minimum -y'u~r...':: tree, i.e. along the edges with small values of Ti. This







70



II
32 -


30


28


S26
-o
0
4 24


E 22

N
/ 20 -


18


16


14
I I I
9700 9800 9900 10000 10100 10200 10300
MINUTES

Figure 4-3. The size of the largest connected component in GRAPH-II. Number of
nodes in the graph is 30.


hypothesis is partially supported by the behavior of the average T-index of the

edges corresponding to the Minimum Spanning Tree of GRAPH-I, which is shown

in Figure 4-4.

However, this hypothesis cannot be verified using the considered data, since

the values of average T-indices are calculated over a 10-minute interval, whereas

the the seizure signal propagates in a fraction of a second. Therefore, in order to

check if the seizure signal actually spreads along the minimum spanning tree, one

needs to introduce other nonlinear measures to reflect the behavior of the brain

over short time intervals.











1 .1 '


1 -


0.9


0.8
-o
0- 0.7


0.6


0.5-


0.4-


0.3


0.2
I I I I I I I I i I
0.995 1 1.005 1.01 1.015 1.02 1.025 1.03 1.035
MINUTES x104

Figure 4-4. Average value of T-index of the edges in Minimum Spanning Tree of
GRAPH-I.


Also, note that the average value of the T index in the Minimum Spanning

Tree is less than Tcritical, which also supports the above statement about the

connectivity of the system.

4.2.2.4 Degrees of the Vertices

Another important issue that we analyze here is the degrees of the vertices in

GRAPH-II. Recall that the degree of a vertex is defined simply as the number of

edges emanating from it.

We look at the behavior of the average degree of the vertices in GRAPH-II

over time. Clearly, this plot is very similar to the behavior of the edge density of

GRAPH-II (see Figure 4-5).







72





101


9-


8-


7-


S6


5-


4


3-


2-

1-

7800 7900 8000 8100 8200 8300 8400
MINUTES

Figure 4-5. Average degree of the vertices in GRAPH-II.


We are also particularly interested in high-degree vertices, i.e., the functional

units of the brain that are at a certain time moment connected (entrained) with

many other brain sites. Interestingly enough, the vertex with a maximum degree

in GRAPH-H usually corresponds to the electrode which is located in RTD (right

temporal depth) or RST (right subtemporal cortex), in other words, the vertex

with the maximum degree is located near the epileptogenic focus.

4.2.2.5 Maximum Cliques

In the previous works in the field of epileptic seizure prediction, a quadratic

0-1 programming approach based on EEG data was introduced [69]. In fact, this

approach utilizes the same preprocessing technique (i.e., calculating the values

of T-indices for all pairs of electrode sites) as we apply in this chapter. In this









subsection, we will briefly describe this quadratic programming technique and

relate it to the graph models introduced above.

The main idea of the considered quadratic programming approach is to

construct a model that would select a certain number of so-called "critical"

electrode sites, i.e., those that are the most entrained during the seizure. According

to Section 3, such group of electrode sites should produce a minimal sum of T-

indices calculated for all pairs of electrodes within this group. If the number of

critical sites is set equal to k, and the total number of electrode sites is n, then the

problem of selecting the optimal group of sites can be formulated as the following

quadratic 0-1 problem [69]:



min xTAx (4-2)

s.t. Eix = k. (4-3)

i e{0,1} Vie {1,..., n} (4-4)

In this setup, the vector x = (xl, x2, ..., ,) consists of the components equal

to either 1 (if the corresponding site is included into the group of critical sites) or 0

(otherwise), and the elements of the matrix A = [aij],j= 1...,n are the values of Tij's

at the seizure point.

However, as it was shown in the previous studies, one can observe the "re-

setting of the brain after seizures' onset [111, 70, 108], that is, the divergence of

STLmax profiles after a seizure. Therefore, to ensure that the optimal group of

critical sites shows this divergence, one can reformulate this optimization problem

by adding one more quadratic constraint:


xTBx > Tcritical k (k 1),


(4-5)









where the matrix B = r1 .]ij1,...,, is the T-index matrix of brain sites i and j

within 10 minute windows after the onset of a seizure.

This problem is then solved using standard techniques, and the group of k

critical sites is found. It should be pointed out that the number of critical sites k is

predetermined, i.e., it is defined empirically, based on practical observations. Also,

note that in terms of GRAPH-I model this problem represents finding a subgraph

of GRAPH-I of a fixed size, satisfying the properties specified above.

Now, recall that we introduced GRAPH-III using the same principles as

in the formulation of the above optimization problem, that is, we considered

the connections only between the pairs of sites i,j satisfying both of the two

conditions: Ti < Tcritical at the seizure point, and Tj > Tcritical 10 minutes after

the seizure point, which are exactly the conditions that the critical sites must

satisfy. A natural way of detecting such a groups of sites is to find cliques in

GRAPH-III. Since a clique is a subgraph where all vertices are interconnected, it

means that all pairs of electrode sites in a clique would satisfy the aforementioned

conditions. Therefore, it is clear that the size of the maximum clique in GRAPH-

III would represent the upper bound on the number of selected critical sites, i.e.,

the maximum value of the parameter k in the optimization problem described

above.

Computational results indicate that the maximum clique sizes for different

instances of GRAPH-III are close to the actual values of k empirically selected

in the quadratic programming model, which shows that these approaches are

consistent with each other.

4.3 Graph as a Macroscopic Model of the Epileptic Brain

Based on the results obtained in the sections above, we now can formulate the

graph model which describes the behavior of the epileptic brain at the macroscopic









level. The main idea of this model is to use the properties of GRAPH-I, GRAPH-

II, and GRAPH-III as a characterization of the behavior of the brain prior to,

during, and after epileptic seizures. According to this graph model, the graphs

reflecting the behavior of the epileptic brain demonstrate the following properties:

* Increase and decrease of the edge density and the average degree of the
vertices during and after the seizures respectively;

* The graph is connected during the interictal state, however, it becomes
disconnected right after the seizures (during the postictal state);

* The vertex with the maximum degree corresponds to the epileptogenic focus.

Moreover, one of the advantages of the considered graph model is the possi-

bility to detect special formations in these graphs, such as cliques and minimum

spanning trees, which can be used for further studying of various properties of the

epileptic brain.

4.4 Concluding Remarks and Directions of Future Research

In this chapter, we have made the initial attempt to analyze EEG data and

model the epileptic brain using network-based approaches. Despite the fact that

the size of the constructed graphs is rather small, we were able to determine

specific patterns in the behavior of the epileptic brain based on the information

obtained from statistical analysis of EEG data. Clearly, this model can be made

more accurate by considering more electrodes corresponding to smaller functional

units.

Among the directions of future research in this field, one can mention the

possibility of developing directed graph models based on the analysis of EEG data.

Such models would take into account the natural i-vnii.i i i'1 of the brain, where

certain functional units control the other ones. Also, one could apply a similar

approach to studying the patterns underlying the brain function of the patients

with other types of disorders, such as Parkinson's disease, or sleep disorder.






76


Therefore, the methodology introduced in this chapter can be generalized and

applied in practice.















CHAPTER 5
COLLABORATION NETWORKS IN SPORTS

In this chapter, we will discuss one of the most interesting real-life graph

applications -so-called "social ii. -i l:- where the vertices are real people [63,

116]. The main idea of this approach is to consider the ." .I1 l iii o:eship graph"

connecting the entire human population. In this graph, an edge connects two given

vertices if the corresponding two persons know each other.

Social networks are associated with a famous -i i ill-world" hypothesis, which

claims that despite the large number of vertices, the distance between any two

vertices (or, the diameter of the graph) is small. More specifically, the idea of

"six degrees of separation" has been introduced. It states that any two persons

in the world are linked with each other through a sequence of at most six people

[63, 116, 117].

Clearly, one cannot verify this hypothesis for the graph incorporating more

than 6 billion people living on the Earth, however, smaller subgraphs of the

acquaintanceship graph connecting certain groups of people can be investigated in

detail. One of the most well-known graphs of this type is the scientific collaboration

,jir1,' reflecting the information about the joint works between all scientists. Two

vertices are connected by an edge if the corresponding two scientists have a joint

research paper. Another graph of this type is known as the "H .//;/;, ...../ Il,,ll, : it

links all the movie actors, and an edge connects two actors if they ever appeared

in the same movie. Well-known concepts associated with these graphs are so-

called "Erdos number" (in the scientific collaboration graph) and "Bacon number"

(in the Hollywood graph), which are assigned to every vertex and characterize

the distance from this vertex to the vertex denoting the "center" of the graph.









In the collaboration graph, the central vertex corresponds to the famous graph

theoretician Paul Erdis, whereas in the Hollywood graph the same position is

assigned to Kevin Bacon.

In this chapter, we discuss graphs of a similar type arising in sports, that

represent the pll li- rs' "collaboration". In these graphs, the pl li- rs are the vertices,

and an edge is added to the graph if the corresponding two pl iv-rs ever pll li- d

together in the same team. One of the examples of this type of graphs is the graph

representing baseball pl i,- rs. For any two baseball pl li-, rs who ever pll li- d in the

Major League Baseball(j\l .), a path connecting them can be found in this graph.

As another instance of social networks in sports, we study the "NBA graph"

where the vertices represent all the basketball pl li, rs who are currently pl viing

in the NBA. We apply standard graph-theoretical algorithms for investigating the

properties of this graph, such as its connectivity and diameter (i.e., the maximum

distance between all pairs of vertices in the graph). As we will see later in the

chapter, this study also confirms the -1!! ill-world hypotl! Moreover, we

introduce a distance measure in the NBA graph similar to the Erdis number and

the Bacon number. The central role in this graph is given to Michael Jordan, the

greatest basketball pl li-v r of all times, and we refer to this measure as the Jordan

number.

5.1 Examples of Social Networks

In this section, we give a more detailed description of the examples of social

networks mentioned in the introduction -the scientific collaboration graph, the

Hollywood graph, and the baseball graph.

5.1.1 Scientific Collaboration Graph and Erdis Number

As it was mentioned above, the vertices of the scientific collaboration graph

are scientists, and the edges in this graph connect the scientists who have ever

collaborated with each other (i.e., had a joint paper). In order to measure the









distances in this graph, the "central v, i I :; is introduced. This vertex corresponds

to Paul Erd6s, the father of the theory of random graphs. This vertex is assigned

Erdos number equal to 0. For all other vertices in the graph, the Erd6s number

is defined as the distance (i.e., the shortest path length) from the central vertex.

For example, those scientists who had a joint paper with Erd6s have Erdis

number 1, those who did not collaborate with Erd6s, but collaborated with Erd6s'

collaborators have Erd6s number 2, etc.

Following this logic, one can construct the connected component of the

collaboration graph with "concentric circles", which would incorporate almost all

scientists in the world, except those who never collaborate with anybody. This

connected component is expected to have a relatively small diameter.

The idea of constructing collaboration graphs encompassing people in different

areas gave a rise to several other applications. Next, we discuss the Hollywood

graph and the baseball graph, where the number of vertices is significantly smaller

than in the scientific collaboration graph, which allows one to study their structure

in more detail.

5.1.2 Hollywood Graph and Bacon Number

The Hollywood graph is constructed using the same principles as the scientific

collaboration graph, however, the number of Hollywood actors is much smaller than

the number of scientists, therefore, one can investigate the characteristics of every

vertex in this graph. This information is maintained at the "Oracle of B ..I -

website.1 The most recent Hollywood graph contains 595,578 vertices (actors).

The central vertex in this graph represents the famous actor Kevin Bacon, and

this vertex obviously has Bacon number 0. Since the number of vertices in this

graph is small enough, one can explicitly calculate the Bacon number for every


1 http://www.cs.virginia.edu/oracle/











Average Bacon number= 2.946

400000 364066
350000
300000
S250000
S200000 -
o 133856
W 150000 -
88058
E 100000
50000 1686 6960 854 94 3
0 -
0 1 2 3 4 5 6 7 8
Bacon number


Figure 5-1. Number of vertices in the Hollywood graph with different values of
Bacon number. Average Bacon number = 2.946.


actor. It turns out that most of the actors have Bacon numbers equal to 2 or 3,

and the maximum possible Bacon number is equal to 8, which is the case only for 3

vertices.

The distribution of Bacon numbers in the Hollywood graph is shown in Figure

5-1. The average Bacon number (i.e., the average path length from a given actor

to Bacon) is equal to 2.946. As one can see, both the average and the maximum

Bacon numbers of the Hollywood graph are very small, which provides an argument

in favor of the -11 i 11 world hypot -:- mentioned above.

5.1.3 Baseball Graph and Wynn Number

Collaboration networks similar to the ones mentioned above can also be

constructed in sports. One example of such a network is the "baseball graph"

representing all baseball p1 li-, rs who ever p .li', d in the MLB. In this graph, two

pl i rs are connected if they ever were teammates. The most recent baseball graph










Average Wynn number = 2.901

7000 6663

6000 5286
0 5000
> 4000
o 3000- 2472
3 2000
E 899
= 1000 408 88
0 7
0 1 2 3 4 5 6
Wynn number


Figure 5-2. Number of vertices in the baseball graph with different vaues of Wynn
number. Average Wynn number = 2.901


has 15817 vertices. Links between any pair of baseball p1 li. rs can be found at the

"Oracle of Baseball" website.2

One can assign the central role in this graph to Early Wynn, a member of the

Hall of Fame who spent 23 seasons in the MLB. Figure 5-2 shows the distribution

of Wynn numbers in the baseball graph. The maximum Wynn number is 6, which

is smaller than the maximum Bacon number since total number of baseball p1 li. rs

is less than the number of Hollywood actors.

5.1.4 Diameter of Collaboration Networks

Another aspect that should be mentioned here is that the maximum from

the central vertex in the collaboration graphs certainly depends on the choice

of this central vertex. The reason for choosing Kevin Bacon as the center of the


2 http://www.baseball-reference.com/oracle/









Hollywood graph, and Early Wynn as the center of the baseball graph is the fact

that it is reasonable to expect them to be connected to many vertices: Bacon

appeared in many movies, and Wynn p1l li, d in several baseball teams had a lot

of teammates during his long career. However, one can choose less "connected"

centers of these graphs, and in this case the maximum distance from the new center

of the graph may significantly increase. For example, if one chooses Barry Bonds as

the center of the baseball graph, the maximum Bonds number will be 9 instead of

6. Moreover, in the Hollywood graph, it is possible to choose the center so that the

maximum distance from it is equal to 14, and the average distance is greater than 6

(instead of 2.946). Therefore, in order to have a more complete information about

the structure of these graphs, one should calculate the maximum possible distance

among all pairs of vertices in the graph. Recall that this quantity is referred to as

the diameter of the graph. Clearly, the diameter can be found by considering each

vertex as the center of the graph, calculating corresponding maximal distances, and

then choosing the maximum among them.

In the next section, we study the properties of the NBA graph incorporating

basketball p1l i-, rs p1 giving in the world's best basketball league. In a similar

fashion, we introduce the Jordan number, investigate its values corresponding to

different vertices, and calculate the diameter of this graph.

5.2 NBA Graph

The NBA graph considered in this section is constructed using the same

idea as the graphs described above. Here we provide a detailed description of

the structural properties of this graph. As we will see, its properties are rather

similar to the properties of other social networks, which confirms the small-world

hypothesis.









5.2.1 General Properties of the NBA Graph

The instance of the NBA graph that we consider in this section is relatively

small and contains only those 1pl ,i- rs who are curr n i/,ll pl1 ,iing in the NBA (as

of the season of 2002-2003). However, this information is sufficient to reveal that

the NBA graph follows similar patterns as other social networks. As of May 2003,

the total number of p1 li-, rs in the rosters of all the NBA teams is equal to 404

(pl i--rs picked in the 2003 NBA draft and transfers that occurred after the end

of the 2002-2003 season are not taken into account). An edge connects two given

pl!i-,rs if they ever p1l li, d in the same team. Consequently, the constructed NBA

graph has 404 vertices, and 5492 edges connecting them. Note that the maximum

possible number of edges is equal to 404 x (404 1)/2 = 81406, therefore, the

edge /. ,-.:1;, of this graph (i.e., the ratio of the number of edges to the maximum

possible number of edges) is rather small: 5492/81406 = 6.7.'-.

As one can easily see, this graph has a highly specific structure: the p1 li. rs

of every team form a clique in the graph (i.e., the set of completely interconnected

vertices), because all the vertices corresponding to the p1 li-, rs of the same team

must be interconnected. Since many 1p ii- rs change teams during or between the

seasons, there are edges connecting the vertices from different cliques (teams). Note

that this type of structure is common for all "collaboration 1. I .- -i I:- (see Figure

5-3).

It should be pointed out that the number of p1 li. rs in a basketball team is

relatively small, and the pl liv rs' transfers between different teams occur rather

often, therefore, it would be logical to expect that the NBA graph should be

connected, i.e., there is a path from every vertex to every vertex, moreover, the

length of this path must be small enough. As we will see below, calculations

confirm these assumptions.















































Figure 5-3. General structure of the NBA graph and other collaboration networks


First, we used a standard breadth-first search technique for checking the

connectivity of the considered graph. Starting from an arbitrary vertex, we were

able to locate all other vertices in the graph, which means that every vertex is

reachable from another, therefore, the graph is connected. In the next subsection,

we will also see that every pair of vertices in this graph are connected by a short

path, which is in agreement with the -i,, ill-world hypot !. -- .











Average Jordan number = 2.270

300
244
u 250

2 200
135
"6 150
0
I-
~ 100
E
5 50 24
co 1

0 1 2 3
Jordan number



Figure 5-4. Number of vertices in the NBA graph with different values of Jordan
number. Average Jordan number = 2.270


5.2.2 Diameter of the NBA Graph and Jordan Number

The next subject of our interest is verifying if the NBA graph follows the

small-world hypothesis. We need to answer the question, what is the distance

between any two vertices in this graph?

Similarly to the social graphs mentioned above, we define the "central v- iI :

in the NBA graph corresponding to Michael Jordan, who p1 i, -1 for Washington

Wizards during his final NBA season. Obviously, all other pl i ,v rs in the Wizards'

roster for 2002-2003, as well as all the pl .i, rs who have plin .1 with Jordan

during at least one season in the past, have Jordan number 1. It should be noted

that Michael Jordan p1l li, d only for two teams (Chicago Bulls and Washington

Wizards) through his entire career, therefore, one can expect that the number of

pl vrs with Jordan number 1 is rather small. In fact, only 24 pl li, rs currently

pl giving in the NBA have Jordan number 1.













Table 5-1. Jordan numbers of some NBA stars (end of the 2002-2003 season).


PlI-iy-r
Kobe Bryant
Vince Carter
Vlade Divac
Tim Duncan
Michael Finley
Steve Francis
Kevin Garnett
Pau Gasol
Richard Hamilton
Allen Iverson
Jason Kidd
Toni Kukoc
Karl Malone
Stephon Marbury
Shawn Marion
K. -i-on Martin
Jamal Mashburn
Tracy McGrady
R. ._i Miller
Yao Ming
Dikembe Mutombo
Steve Nash
Dirk Nowitzki
Jermaine O'Neal
Shaquille O'Neal
Gary Payton
Paul Pierce
Scottie Pippen
David Robinson
Arvydas Sabonis
Jerry Stackhouse
Predrag Stojakovic
Antoine Walker
Ben Wallace
C'!hin Webber


Team
Los Angeles Lakers
Toronto Raptors
Sacramento Kings
San Antonio Spurs
Dallas Mavericks
Houston Rockets
Minnesota Timberwolves
Memphis Grizzlies
Detroit Pistons
Philadelphia 76ers
New Jersey Nets
Milwaukee Bucks
Utah Jazz
Phoenix Suns
Phoenix Suns
New Jersey Nets
New Orleans Hornets
Orlando Magic
Indiana Pacers
Houston Rockets
New Jersey Nets
Dallas Mavericks
Dallas Mavericks
Indiana Pacers
Los Angeles Lakers
Milwaukee Bucks
Boston Celtics
Portland Trail Blazers
San Antonio Spurs
Portland Trail Blazers
Washington Wizards
Sacramento Kings
Boston Celtics
Detroit Pistons
Sacramento Kings


Jordan Number
2
2
2
2
2
3
3
3
1
2
2
1
2
2
2
3
2
2
3
3
2
2
2
2
2
2
2
1
2
2
1
2
2
2
2









Following similar logic, the p1l li, rs who have pl li-, d with Jordan's "collabora-

tors" have Jordan number 2, and so on. However, it turns out that the maximum

Jordan number in this instance of the NBA graph is only 3, i.e., all the p1l l,- rs are

linked with Jordan through at most two vertices, which is certainly not surprising:

with 29 teams and only around 15 pl.' rs in each team, NBA is really a -in, ,1

v i !. 1[ Figure 5-4 shows the distribution of Jordan numbers in the NBA graph.

The average Jordan number is equal to 2.27, which is smaller than the average

Bacon number in the Hollywood graph, and the average Wynn number in the

baseball graph, due to smaller number of vertices.

Table 5-1 presents Jordan numbers corresponding to some well-known NBA

pli --rs. Not surprisingly, most of them have Jordan number 2, except for several

p!-i rs with Jordan number 3: those who joined this league recently, and therefore

did not have many teammates through their career, as well as R.-.-. :- Miller who

spent 16 seasons in the same team (Indiana Pacers), and Kevin Garnett who p1l li- d

in Minnesota for 8 years. Scottie Pippen, Toni Kukoc, and Jerry Stackhouse were

Jordan's teammates at different times, therefore, they have Jordan number 1.

Furthermore, we calculated the diameter of the NBA graph, i.e., the maximum

possible distance between any two vertices in the graph. Since the maximum

Jordan number in the NBA graph is equal to 3, one would expect that the value

of the diameter to be of the same order of magnitude. As it was mentioned in

the previous section, the diameter of the NBA graph can be found as follows: for

every given vertex, we calculate the distances between this vertex and all others.

In this approach, we need to repeat this procedure 404 times, and every time a

different vertex is considered to be the "center" of the graph. Our calculations

show that the diameter of the NBA graph (the maximum distance between all pairs

of vertices) is equal to 4. Therefore, one can claim that the NBA graph actually

follows the small-world hypothesis, since its diameter is small enough.









Table 5-2. Degrees of the Vertices in the NBA graph

degree interval number of vertices
11-20 134
21-30 116
31-40 103
41-50 42
51-60 8
61+ 2


5.2.3 Degrees and "Connectedness" of the Vertices in the NBA Graph

As it was pointed out above, the maximum and the average distance from

the center of the graph actually depend on the choice of this center. One can

easily guess that Michael Jordan is not the most "connected" central vertex of

the NBA graph, since he pll li- d only for two teams and the number of his former

teammates among currently active pl ,i- rs is rather small. In fact, the degree of the

vertex (i.e., the number of edged starting from it, or, the number of teammates)

corresponding to Jordan is only 24. Table 5-2 presents the number of vertices in

the NBA graph corresponding to different intervals of the degree values.

It would be reasonable to assume that if one picks a vertex with a high degree

as the center of the NBA graph, the average distance in the graph corresponding to

this vertex would be smaller than the average Jordan number. We have found the

most "connected" pl li,- rs in the NBA graph with the smallest corresponding aver-

age distances. Table 5-3 presents five pl .i-, rs who could be the most "connected"

centers of the NBA graph. As one can notice, all of them are "bench p! .i, i who

have changed many teams during their career, therefore, they have high degrees in

the NBA graph. Also, an interesting observation is that although Corie Blount's

vertex is degree smaller than Jim Jackson's, the average connectivity is higher for

Corie Blount, which could be explained by the fact that his teammates were highly

"connected" themselves.




Full Text

PAGE 4

IwouldliketothankmyadvisorProf.PanosPardalosforhissupportandguidancethatmademystudiesintheUniversityofFloridaenjoyableandproduc-tive.Hisenergyandenthusiasminspiredmeduringthesefouryears,andIbelievethatthiswascrucialformysuccess.IalsowanttothankmycommitteemembersProf.StanUryasev,Prof.JosephGeunes,andProf.WilliamHagerfortheirconcernandencouragement.Iamgratefultoallmycollaborators,especiallySergiyButenkoandOlegProkopyev,whowerealwaysagreatpleasuretoworkwith.Finally,Iwouldliketoexpressmygreatestappreciationtomyfamilyandfriends,whoalwaysbelievedinmeandsupportedmeinallcircumstances. iv

PAGE 5

page ACKNOWLEDGMENTS ............................. iv LISTOFTABLES ................................. viii LISTOFFIGURES ................................ ix ABSTRACT .................................... xi CHAPTER 1INTRODUCTION .............................. 1 1.1BasicConceptsfromGraphTheoryandDataMiningInterpretation 3 1.1.1ConnectivityandDegreeDistribution ............. 3 1.1.2CliquesandIndependentSets ................. 5 1.1.3ClusteringviaCliquePartitioning ............... 6 2REVIEWOFNETWORK-BASEDMODELINGANDOPTIMIZATIONTECHNIQUESINMASSIVEDATASETS ................ 9 2.1ModelingandOptimizationinMassiveGraphs ............ 9 2.1.1ExamplesofMassiveGraphs .................. 10 2.1.1.1CallGraph ...................... 10 2.1.1.2InternetandWebGraphs .............. 13 2.1.2ExternalMemoryAlgorithms ................. 17 2.1.3ModelingMassiveGraphs ................... 18 2.1.3.1UniformRandomGraphs .............. 19 2.1.3.2PotentialDrawbacksoftheUniformRandomGraphModel ......................... 21 2.1.3.3RandomGraphswithaGivenDegreeSequence .. 23 2.1.3.4Power-LawRandomGraphs ............. 24 2.1.4OptimizationinRandomMassiveGraphs ........... 29 2.1.4.1CliqueNumber .................... 29 2.1.4.2ChromaticNumber .................. 31 2.1.5Remarks ............................. 32 3NETWORK-BASEDAPPROACHESTOMININGSTOCKMARKETDATA ..................................... 33 3.1StructureoftheMarketGraph .................... 34 3.1.1ConstructingtheMarketGraph ................ 34 v

PAGE 6

............... 36 3.1.3DegreeDistributionoftheMarketGraph ........... 37 3.1.4InstrumentsCorrespondingtoHigh-DegreeVertices ..... 40 3.1.5ClusteringCoecientsintheMarketGraph ......... 41 3.2AnalysisofCliquesandIndependentSetsintheMarketGraph ... 42 3.2.1CliquesintheMarketGraph .................. 43 3.2.2IndependentSetsintheMarketGraph ............ 45 3.3DataMiningInterpretationoftheMarketGraphModel ....... 48 3.4EvolutionoftheMarketGraph .................... 50 3.4.1DynamicsofGlobalCharacteristicsoftheMarketGraph .. 51 3.4.2DynamicsoftheSizeofCliquesandIndependentSetsintheMarketGraph .......................... 55 3.4.3MinimumCliquePartitionoftheMarketGraph ....... 59 3.5ConcludingRemarks .......................... 60 4NETWORK-BASEDTECHNIQUESINELECTROENCEPHALOGRAPHIC(EEG)DATAANALYSISANDEPILEPTICBRAINMODELING ... 62 4.1StatisticalPreprocessingofEEGData ................ 63 4.1.1Datasets. ............................. 63 4.1.2T-statisticsandSTLmax 63 4.2GraphStructureoftheEpilepticBrain ................ 66 4.2.1KeyIdeaoftheModel ..................... 66 4.2.1.1InterpretationoftheConsideredGraphModels .. 67 4.2.2PropertiesoftheGraphs .................... 67 4.2.2.1EdgeDensity ..................... 67 4.2.2.2Connectivity ..................... 69 4.2.2.3MinimumSpanningTree ............... 69 4.2.2.4DegreesoftheVertices ................ 71 4.2.2.5MaximumCliques .................. 72 4.3GraphasaMacroscopicModeloftheEpilepticBrain ........ 74 4.4ConcludingRemarksandDirectionsofFutureResearch ....... 75 5COLLABORATIONNETWORKSINSPORTS .............. 77 5.1ExamplesofSocialNetworks ...................... 78 5.1.1ScienticCollaborationGraphandErdosNumber ...... 78 5.1.2HollywoodGraphandBaconNumber ............. 79 5.1.3BaseballGraphandWynnNumber .............. 80 5.1.4DiameterofCollaborationNetworks .............. 81 5.2NBAGraph ............................... 82 5.2.1GeneralPropertiesoftheNBAGraph ............. 83 5.2.2DiameteroftheNBAGraphandJordanNumber ...... 85 5.2.3Degreesand\Connectedness"oftheVerticesintheNBAGraph .............................. 88 5.3ConcludingRemarks .......................... 89 vi

PAGE 7

... 90 REFERENCES ................................... 91 BIOGRAPHICALSKETCH ............................ 100 vii

PAGE 8

Table page 3{1Least-squaresestimatesoftheparameterinthemarketgraph ..... 38 3{2Top25instrumentswithhighestdegreesinthemarketgraph ....... 42 3{3Clusteringcoecientsofthemarketgraph ................. 43 3{4Sizesofthemaximumcliquesinthemarketgraph ............. 45 3{5Sizesofindependentsetsinthecomplementarymarketgraph ...... 46 3{6Datesandmeancorrelationscorrespondingtoeachconsidered500-dayshift ...................................... 51 3{7Numberofverticesandnumberofedgesinthemarketgraphfordier-entperiods .................................. 55 3{8Greedycliquesizeandthecliquenumberfordierenttimeperiods ... 57 3{9Structureofmaximumcliquesinthemarketgraphfordierenttimepe-riods ...................................... 58 3{10Sizeofindependentsetsinthemarketgraphfoundusingthegreedyheuris-tic ....................................... 59 3{11Thelargestcliquesizeandthenumberofcliquesincomputedcliquepar-titions ..................................... 60 5{1JordannumbersofsomeNBAstars(endofthe2002-2003season). .... 86 5{2DegreesoftheVerticesintheNBAgraph ................. 88 5{3Themost\connected"playersintheNBAgraph ............. 89 viii

PAGE 9

Figure page 2{1Frequenciesofcliquesizesinthecallgraph ................. 11 2{2Patternofconnectionsinthecallgraph ................... 12 2{3NumberofInternethostsfortheperiod01/1991-01/2002. ........ 13 2{4PatternofconnectionsintheWebgraph .................. 14 2{5ConnectivityoftheWeb(Bow-Tiemodel) ................. 16 3{1Distributionofcorrelationcoecientsinthestockmarket ........ 35 3{2Edgedensityofthemarketgraphfordierentvaluesofthecorrelationthreshold. ................................... 36 3{3Plotofthesizeofthelargestconnectedcomponentinthemarketgraphasafunctionofcorrelationthreshold. ................... 37 3{4Degreedistributionofthemarketgraph .................. 39 3{5Degreedistributionofthecomplementarymarketgraph ......... 40 3{6Frequencyofthesizesofindependentsetsfoundinthemarketgraph .. 48 3{7DistributionofcorrelationcoecientsintheUSstockmarketforseveraloverlapping500-dayperiodsduring2000-2002 ............... 52 3{8Degreedistributionofthemarketgraphfordierent500-dayperiodsin2000-2002 ................................... 53 3{9Dynamicsofedgedensityandmaximumcliquesizeinthemarketgraph 55 4{1Electrodeplacementinthebrain ...................... 64 4{2NumberofedgesinGRAPH-II 68 4{3ThesizeofthelargestconnectedcomponentinGRAPH-II 70 4{4AveragevalueofT-indexoftheedgesinMinimumSpanningTreeofGRAPH-I. ........................................ 71 4{5AveragedegreeoftheverticesinGRAPH-II. ................ 72 ix

PAGE 10

.................................. 80 5{2NumberofverticesinthebaseballgraphwithdierentvauesofWynnnumber .................................... 81 5{3GeneralstructureoftheNBAgraphandothercollaborationnetworks .. 84 5{4NumberofverticesintheNBAgraphwithdierentvaluesofJordannumber .................................... 85 x

PAGE 11

Thisstudydevelopsnovelapproachestomodelingreal-worlddatasetsarisingindiverseapplicationareasasnetworksandinformationretrievalfromthesedatasetsusingnetworkoptimizationtechniques.Network-basedmodelsallowonetoextractinformationfromdatasetsusingvariousconceptsfromgraphtheory.Inmanycases,onecaninvestigatespecicpropertiesofadatasetbydetectingspecialformationsinthecorrespondinggraph(forinstance,connectedcomponents,spanningtrees,cliques,andindependentsets).Thisprocessofteninvolvessolvingcomputationallychallengingcombinatorialoptimizationproblemsongraphs(maximumindependentset,maximumclique,minimumcliquepartition,etc.).Theseproblemsareespeciallydiculttosolveforlargegraphs.However,incertaincases,theexactsolutionofahardoptimizationproblemcanbefoundusingaspecialstructureoftheconsideredgraph. Asignicantpartofthedissertationfocusesondevelopingnetwork-basedmodelsofreal-worldcomplexsystems,includingthestockmarketandthehumanbrain,whichhavealwaysbeenofspecialinteresttoscientists.Thesesystemsgen-eratehugeamountsofdataandareespeciallyhardtoanalyze.Thisdissertation xi

PAGE 12

Thedevelopednetworkrepresentationsoftheconsidereddatasetsareinmanycasesnon-trivialandincludecertainstatisticalpreprocessingtechniques.Inparticular,theU.S.stockmarketisrepresentedasanetworkbasedoncross-correlationsofpriceuctuationsofthenancialinstruments,whicharecalculatedoveracertainnumberoftradingdays.Thismodel(marketgraph)allowsonetoanalyzethestructureanddynamicsofthestockmarketfromanalternativeperspectiveandobtainusefulinformationabouttheglobalstructureofthemarket,classesofsimilarstocks,anddiversiedportfolios. Similarly,amacroscopicnetworkmodelofthehumanbrainisconstructedbasedonthestatisticalmeasuresofentrainmentbetweenelectroencephalographic(EEG)signalsrecordedfromdierentfunctionalunitsofthebrain.Studyingtheevolutionofthepropertiesofthesenetworksrevealedsomeinterestingfactsaboutbraindisorders,suchasepilepsy. xii

PAGE 13

Nowadays,theprocessofstudyingreal-lifecomplexsystemsoftendealswithlargedatasetsarisingindiverseapplicationsincludinggovernmentandmilitarysystems,telecommunications,biotechnology,medicine,nance,astrophysics,ecol-ogy,geographicalinformationsystems,etc.[ 3 25 ].Understandingthestructuralpropertiesofacertaindatasetisinmanycasesthetaskofcrucialimportance.Togetusefulinformationfromthesedata,oneoftenneedstoapplyspecialtechniquesofsummarizingandvisualizingtheinformationcontainedinadataset. Anappropriatemathematicalmodelcansimplifytheanalysisofadatasetandeventheoreticallypredictsomeofitsproperties.Thus,afundamentalproblemthatariseshereismodelingthedatasetscharacterizingreal-worldcomplexsystems. Inthisdissertation,weconcentrateononeaspectofthisproblem:networkrepresentationofreal-worlddatasets.Accordingtothisapproach,acertaindatasetisrepresentedasagraph(network)withcertainattributesassociatedwithitsverticesandedges. Studyingthestructureofagraphrepresentingadatasetisoftenimportantforunderstandingtheinternalpropertiesoftheapplicationitrepresents,aswellasforimprovingstorageorganizationandinformationretrieval.Onecanvisualizeagraphasasetofdotsandlinksconnectingthem,whichoftenmakesthisrepresentationconvenientandeasilyunderstandable. Themainconceptsofgraphtheorywerefoundedseveralcenturiesago,andmanynetworkoptimizationalgorithmshavebeendevelopedsincethen.However,graphmodelshavebeenappliedonlyrecentlytorepresentingvariousreal-lifemassivedatasets.Graphtheoryisquicklybecomingapracticaleldofscience. 1

PAGE 14

Expansionofgraph-theoreticalapproachesinvariousapplicationsgavebirthtotheterms\graphpractice"and\graphengineering"[ 63 ]. Network-basedmodelsallowonetoextractinformationfromreal-worlddatasetsusingvariousstandardconceptsfromgraphtheory.Inmanycases,onecaninvestigatespecicpropertiesofadatasetbydetectingspecialformationsinthecorrespondinggraph,forinstance,connectedcomponents,spanningtrees,cliquesandindependentsets.Inparticular,cliquesandindependentsetscanbeusedforsolvingtheimportantclusteringproblemarisingindatamining,whichessentiallyrepresentspartitioningthesetofelementsofacertaindatasetintoanumberofsubsets(clusters)ofobjectsaccordingtosomesimilarity(ordissimilarity)criterion.Theseconceptsareassociatedwithanumberofnetworkoptimizationproblemsdiscussedlater. Anotheraspectofinvestigatingnetworkmodelsofreal-worlddatasetsisstudyingthedegreedistributionoftheconstructedgraphs.Thedegreedistributionisanimportantcharacteristicofadatasetrepresentedbyagraph.Itrepresentsthelarge-scalepatternofconnectionsinthegraph,whichreectstheglobalpropertiesofthedataset.Oneoftheimportantresultsdiscoveredduringthelastseveralyearsistheobservationthatmanygraphsrepresentingthedatasetsfromdiverseareas(Internet,telecommunications,biology,sociology)obeythepower-lawmodel[ 9 ].Thefactthatgraphsrepresentingcompletelydierentdatasetshaveasimilarwell-denedpower-lawstructurehasbeenwidelyreectedintheliterature[ 10 19 20 25 63 116 117 ].Itindicatesthatglobalorganizationandevolutionofdatasetsarisinginvariousspheresoflifenowadaysfollowsimilarlawsandpatterns.Thisfactservedasamotivationtointroduceaconceptof\self-organizednetworks." Laterwediscussinmoredetailvariousaspectsofmodelingreal-worlddatasetsasnetworks,andretrievingusefulinformationfromthesenetworks.Thepractical

PAGE 15

importanceofgraph-theoretictechniquesisshownbyseveralexamplesofapplyingtheseapproachesassociatedwithdatasetsarisingintelecommunications,internet,sociology,etc.Themajorpartofthedissertationdevotedtonovelnetwork-basedtechniquesandmodelsthatallowonetoobtainimportantnon-trivialinformationfromdatasetsarisinginnanceandbiomedicine. LetG=(V;E)beanundirectedgraphwiththesetofnverticesVandthesetofedgesE=f(i;j):i;j2Vg.Directedgraphs,wheretheheadandtailofeachedgearespecied,areconsideredinsomeapplications.Theconceptofamultigraphisalsosometimesintroduced.Amultigraphisagraphwheremultipleedgesconnectingagivenpairofverticesmayexist.Oneoftheimportantcharacteristicsofagraphisitsedgedensity:theratioofthenumberofedgesinthegraphtothemaximumpossiblenumberofedges. Thedegreeofavertexisthenumberofedgesemanatingfromit.Foreveryintegerkonecancalculatethenumberofverticesn(k)withadegreeequaltok,andthengettheprobabilitythatavertexhasthedegreekasP(k)=n(k)=n,wherenisthetotalnumberofvertices.ThefunctionP(k)isreferredtoasthe

PAGE 16

Degreedistributionisanimportantcharacteristicofadatasetrepresentedbyagraph.Itreectstheoverallpatternofconnectionsinthegraph,whichinmanycasesreectstheglobalpropertiesofthedatasetthisgraphrepresents.Asmentionedabove,manyreal-worldgraphsrepresentingthedatasetscomingfromdiverseareas(Internet,telecommunications,nance,biology,sociology)havedegreedistributionsthatfollowthepower-lawmodel,whichstatesthattheprobabilitythatavertexofagraphhasadegreek(i.e.,therearekedgesemanatingfromit)is Equivalently,onecanrepresentitas logP/logk;(1{2) whichdemonstratesthatthisdistributionformsastraightlineinthelogarithmicscale,andtheslopeofthislineequalsthevalueoftheparameter. Animportantcharacteristicofthepower-lawmodelisitsscale-freeproperty.Thispropertyimpliesthatthepower-lawstructureofacertainnetworkshouldnotdependonthesizeofthenetwork.Clearly,real-worldnetworksdynamicallygrowovertime,therefore,thegrowthprocessofthesenetworksshouldobeycertainrulesinordertosatisfythescale-freeproperty.Thenecessarypropertiesoftheevolutionofthereal-worldnetworksaregrowthandpreferentialattachment[ 20 ].Therstpropertyimpliestheobviousfactthatthesizeofthesenetworksgrowscontinuously(i.e.,newverticesareaddedtoanetwork,whichmeansthatnewelementsareaddedtothecorrespondingdataset).Thesecondpropertyrepresentstheideathat

PAGE 17

newverticesaremorelikelytobeconnectedtooldverticeswithhighdegrees.Itisintuitivelyclearthattheseprinciplescharacterizetheevolutionofmanyreal-worldcomplexnetworksnowadays. Fromanotherperspective,somepropertiesofgraphsthatfollowthepower-lawmodelcanbepredictedtheoretically.Aielloetal.[ 9 ]studiedthepropertiesofthepower-lawgraphsusingthetheoreticalpower-lawrandomgraphmodelrepresentingthetheclassofrandomgraphsobeyingthepowerlaw(seeChapter 2 ).Amongtheirresults,onecanmentiontheexistenceofagiantconnectedcomponentinapower-lawgraphwith<03:47875,andthefactthatagiantconnectedcomponentdoesnotexistotherwise. Thesizeofconnectedcomponentsofthegraphmayprovideusefulinformationaboutthestructureofthecorrespondingdataset,astheconnectedcomponentswouldnormallyrepresentgroupsof\similar"objects.Insomeapplications,decomposingthegraphintoasetofconnectedcomponentscanprovideareason-ablesolutiontotheclusteringproblem(i.e.,partitioningthegraphintoseveralsubgraphs,eachofwhichcorrespondstoacertaincluster). Thefollowingdenitionsgeneralizetheconceptofclique.Insteadofcliquesonecanconsiderdensesubgraphs,orquasi-cliques.A-cliqueC,alsocalleda

PAGE 18

AnindependentsetisasubsetIVsuchthatthesubgraphG(I)hasnoedges.ThemaximumindependentsetproblemcanbeeasilyreformulatedasthemaximumcliqueprobleminthecomplementarygraphG(V;E),denedasfollows.Ifanedge(i;j)2E,then(i;j)=2E;andif(i;j)=2E,then(i;j)2E.Clearly,amaximumcliqueinGisamaximumindependentsetinG,sothemaximumcliqueandmaximumindependentsetproblemscanbeeasilyreducedtoeachother. Locatingcliques(quasi-cliques)andindependentsetsinagraphrepresentingadatasetprovidesimportantinformationaboutthisdataset.Intuitively,edgesinsuchagraphwouldconnectverticescorrespondingto\similar"elementsofthedataset.Therefore,cliques(orquasi-cliques)wouldnaturallyrepresentdenseclustersofsimilarobjects.Onthecontrary,independentsetscanbetreatedasgroupsofobjectsthatdierfromeveryotherobjectinthegroup.Thisinformationisalsoimportantinsomeapplications.Clearly,itisusefultondamaximumcliqueorindependentsetinthegraph,sinceitwouldgivethemaximumpossiblesizeofthegroupsof\similar"or\dierent"objects. Themaximumcliqueproblem(aswellasthemaximumindependentsetprob-lem)isknowntobeNP-hard[ 59 ].Moreover,itturnsoutthattheseproblemsarediculttoapproximate[ 18 62 ].Thismakestheseproblemsespeciallychallenginginlargegraphs. 102 ]givevariousmathematicalprogrammingformulationsoftheseproblems.Clearly,asin

PAGE 19

thecaseofmaximumcliqueandmaximumindependentsetproblems,minimumcliquepartitionandgraphcoloringarereducedtoeachotherbyconsideringthecomplimentarygraph,andbothoftheseproblemsareNP-hard[ 59 ].Solvingtheseproblemsforgraphsrepresentingreal-lifedatasetsisimportantfromadataminingperspective;especiallyforsolvingtheclusteringproblem. Theessenceofclusteringispartitioningtheelementsinacertaindatasetintoseveraldistinctsubsets(clusters)groupedaccordingtoanappropriatesimilaritycriterion[ 34 ].Identifyingthegroupsofobjectsthatare\similar"toeachotherbut\dierent"fromotherobjectsinagivendatasetisimportantinmanypracticalapplications.Theclusteringproblemischallengingbecausethenumberofclustersandthesimilaritycriterionareusuallynotknownapriori. Ifadatasetisrepresentedasagraph,whereeachdataelementcorrespondstoavertex,theclusteringproblemessentiallydealswithdecomposingthisgraphintoasetofsubgraphs(subsetsofvertices),sothateachofthesesubgraphscorrespondtoaspeciccluster. Sincethedataelementsassignedtothesameclustershouldbe\similar"toeachother,thegoalofclusteringcanbeachievedbyndingacliquepartitionofthegraph,andthenumberofclusterswillequalthenumberofcliquesinthepartition. Similarargumentsholdforthecaseofthegraphcoloringproblemwhichshouldbesolvedwhenadatasetneedstobedecomposedintotheclustersof\dierent"objects(i.e.,eachobjectinaclusterisdierentfromallotherobjectsinthesamecluster),thatcanberepresentedasindependentsetsinthecorrespondinggraph.Thenumberofindependentsetsintheoptimalpartitionisreferredtoasthechromaticnumberofthegraph. Insteadofcliquesandindependentsetsonecanconsiderquasi-cliques,andquasi-independentsetsandpartitionthegraphonthisbasis.Asmentioned,

PAGE 20

quasi-cliquesaresubgraphsthataredenseenough(i.e.,theyhaveahighedgedensity).Therefore,itisoftenreasonabletorelateclusterstoquasi-cliques,sincetheyrepresentsucientlydenseclustersofsimilarobjects.Obviously,inthecaseofpartitioningadatasetintoclustersof\dierent"objects,onecanusequasi-independentsets(i.e.,subgraphsthataresparseenough)todenetheseclusters.

PAGE 21

Inthischapter,wereviewcurrentdevelopmentsinstudyingmassivegraphsusedasmodelsofcertainreal-worlddatasets. 3 ].Someofthewiderangeofproblemsassociatedwithmassivedatasetsaredatawarehousing,compressionandvisualization,informationretrieval,clusteringandpatternrecognition,andnearestneighborsearch.Handlingtheseproblemsrequiresspecialinterdisciplinaryeortstodevelopnovelsophisticatedtechniques.Thepervasivenessandcomplexityoftheproblemsbroughtbymassivedatasetsmakeitoneofthemostchallengingandexcitingareasofresearchforyearstocome. Inmanycases,amassivedatasetcanberepresentedasaverylargegraphwithcertainattributesassociatedwithitsverticesandedges.Theseattributesmaycontainspecicinformationcharacterizingthegivenapplication.Studyingthestructureofthisgraphisimportantforunderstandingthestructuralpropertiesoftheapplicationitrepresents,aswellasforimprovingstorageorganizationandinformationretrieval. 25 ]. 9

PAGE 22

Asbefore,byG=(V;E)wewilldenoteasimpleundirectedgraphwiththesetofnverticesVandthesetofedgesE.Amulti-graphisanundirectedgraphwithmultipleedges. Thedistancebetweentwoverticesisthenumberofedgesintheshortestpathbetweenthem(itisequaltoinnityforverticesrepresentingdierentconnectedcomponents).ThediameterofagraphGisusuallydenedasthemaximaldistancebetweenpairsofverticesofG.Inadisconnectedgraph,theusualdenitionofthediameterwouldresultintheinnitediameter,sothefollowingdenitionisinorder.Bythediameterofadisconnectedgraphwewillmeanthemaximumniteshortestpathlengthinthegraph(thesameasthelargestofthediametersofthegraph'sconnectedcomponents). 2.1.1.1CallGraph 2 ].Inthiscallgraphtheverticesaretelephonenumbers,andtwoverticesareconnectedbyanedgeifacallwasmadefromonenumbertoanother. Abelloetal.[ 2 ]experimentedwithdatafromAT&Ttelephonebillingrecords.Togiveanideaofhowlargeacallgraphcanbewementionthatagraphbasedonone20-dayperiodhad290millionverticesand4billionedges.Theanalyzedone-daycallgraphhad53,767,087verticesandover170millionedges.Thisgraphappearedtohave3,667,448connectedcomponents,mostofthemtiny;only302,468(or8%)componentshadmorethan3vertices.Agiantconnectedcomponentwith44,989,297verticeswascomputed.ItwasobservedthattheexistenceofagiantcomponentresemblesabehaviorsuggestedbytherandomgraphstheoryofErdosandRenyi[ 47 48 ],whichwillbementionedbelow,butbythepatternofconnectionsthecallgraphobviouslydoesnottintothistheory(Subsection 2.1.3 ).

PAGE 23

Themaximumcliqueproblemandproblemofndinglargequasi-cliqueswithprespecieddensitywereconsideredinthisgiantcomponent.Theseproblemswereattackedusingagreedyrandomizedadaptivesearchprocedure(GRASP)[ 51 52 ].Inshort,GRASPisaniterativemethodthatateachiterationconstructs,usingagreedyfunction,arandomizedsolutionandthenndsalocallyoptimalsolutionbysearchingtheneighborhoodoftheconstructedsolution.Thisisaheuristicapproachwhichgivesnoguaranteeaboutqualityofthesolutionsfound,butprovedtobepracticallyecientformanycombinatorialoptimizationproblems.Tomakeapplicationofoptimizationalgorithmsintheconsideredlargecomponentpossible,theauthorsusesomesuitablegraphdecompositiontechniquesemployingexternalmemoryalgorithms(seeSubsection 2.1.2 ). Figure2{1. FrequenciesofcliquesizesinthecallgraphfoundbyAbelloetal.[ 2 ]. Abelloetal.[ 2 ]ran100,000GRASPiterationstaking10parallelprocessorsaboutoneandahalfdaystonish.Ofthe100,000cliquesgenerated,14,141appearedtobedistinct,althoughmanyofthemhadverticesincommon.Abelloetal.suggestedthatthegraphcontainsnocliqueofasizegreaterthan32.Figure 2{1 showsthenumberofdetectedcliquesofvarioussizes.Finally,large

PAGE 24

quasi-cliqueswithdensityparameters=0:9;0:8;0:7;and0:5forthegiantconnectedcomponentwerecomputed.Thesizesofthelargestquasi-cliquesfoundwere44,57,65,and98,respectively. Figure2{2. Patternofconnectionsinthecallgraph:numberofverticeswithvariousout-degrees(a)andin-degrees(b);numberofconnectedcom-ponentsofvarioussizes(c)inthecallgraph[ 8 ]. Aielloetal.[ 8 ]usedthesamedataasAbelloetal.[ 2 ]toshowthattheconsideredcallgraphtstotheirpower-lawrandomgraphmodel(Section 2.1.3 ).TheplotsinFigure 2{2 demonstratesomeconnectivitypropertiesofthecallgraph. Summarizingtheresultspresentedinthissubsection,onecansaythatgraph-basedtechniquesprovedtoberatherusefulintheanalysisandrevealingtheglobal

PAGE 25

patternsofthetelecommunicationstracdataset.Inthenextsubsection,wewillconsideranotherexampleofasimilartypeofdatasetassociatedwiththeWorld-WideWeb. 2{3 showsthedynamicsofgrowthofthenumberofInternethostsforthelast13years.AsofJanuary2002thisnumberwasestimatedtobecloseto150million. Figure2{3. NumberofInternethostsfortheperiod01/1991-01/2002.DatabyInternetSoftwareConsortium.

PAGE 26

Figure2{4. PatternofconnectionsintheWebgraph:numberofverticeswithvar-iousout-degrees(left)anddistributionofsizesofstronglyconnectedcomponents(right)inWebgraph[ 37 ]. ThehighlydynamicandseeminglyunpredictablestructureoftheWorldWideWebattractsmoreandmoreattentionofscientistsrepresentingmanydiversedisciplines,includinggraphtheory.InagraphrepresentationoftheWorldWideWeb,theverticesaredocumentsandtheedgesarehyperlinkspointingfromonedocumenttoanother.Similarlytothecallgraph,theWebisadirectedmultigraph,althoughoftenitistreatedasanundirectedgraphtosimplifytheanalysis.AnothergraphisassociatedwiththephysicalnetworkoftheInternet,wheretheverticesareroutersnavigatingpacketsofdataorgroupsofrouters(domains).Theedgesinthisgraphrepresentwiresorcablesinthephysicalnetwork. Graphtheoryhasbeenappliedforwebsearch[ 36 78 ],webmining[ 96 97 ]andotherproblemsarisingintheInternetandWorldWideWeb.Inseveralrecentstudies,therewereattemptstounderstandsomestructuralpropertiesoftheWebgraphbyinvestigatinglargeWebcrawls.AdamicandHuberman[ 6 65 ]usedcrawlswhichcoveredalmost260,000pagesintheirstudies.BarabasiandAlbert[ 20 ]analyzedasubgraphoftheWebgraphapproximately325,000nodesrepresentingnd.edupages.Inanotherexperiment,Kumaretal.[ 82 ]examinedadatasetcontainingabout40millionpages.Inarecentstudy,Broderetal.[ 37 ]

PAGE 27

usedtwoAltavistacrawls,eachwithabout200millionpagesand1.5billionlinks,thussignicantlyexceedingthescaleoftheprecedingexperiments.ThisworkyieldedseveralremarkableobservationsaboutlocalandglobalpropertiesoftheWebgraph.Allofthepropertiesobservedinoneofthetwocrawlswerevalidatedfortheotheraswell.Below,bytheWebgraphwewillmeanoneofthecrawls,whichhas203,549,046nodesand2130millionarcs. TherstobservationmadebyBroderetal.conrmsapropertyoftheWebgraphsuggestedinearlierworks[ 20 82 ]claimingthatthedistributionofdegreesfollowsapowerlaw.Interestingly,thedegreedistributionoftheWebgraphresemblesthepower-lawrelationshipoftheInternetgraphtopology,whichwasrstdiscoveredbyFaloutsosetal.[ 50 ].Broderetal.[ 37 ]computedthein-andout-degreedistributionsforbothconsideredcrawlsandshowedthatthesedistributionsagreewithpowerlaws.Moreover,theyobservedthatinthecaseofin-degreestheconstant2:1isthesameastheexponentofpowerlawsdiscoveredinearlierstudies[ 20 82 ].InanothersetofexperimentsconductedbyBroderetal.,directedandundirectedconnectedcomponentswereinvestigated.Itwasnoticedthatthedistributionofsizesoftheseconnectedcomponentsalsoobeysapowerlaw.Figure 2{4 illustratestheexperimentswithdistributionsofout-degreesandconnectedcomponentsizes. ThelastseriesofexperimentsdiscussedbyBroderetal.[ 37 ]aimedtoexploretheglobalconnectivitystructureoftheWeb.Thisledtothediscoveryoftheso-calledBow-TiemodeloftheWeb[ 38 ].Similarlytothecallgraph,theconsideredWebgraphappearedtohaveagiantconnectedcomponent,containing186,771,290nodes,orover90%ofthetotalnumberofnodes.Takingintoaccountthedirectednatureoftheedges,thisconnectedcomponentcanbesubdividedintofourpieces:stronglyconnectedcomponent(SCC),InandOutcomponents,and\Tendrils".Overall,theWebgraphintheBow-Tiemodelisdividedintothefollowingpieces:

PAGE 28

Figure2{5. ConnectivityoftheWeb(Bow-Tiemodel)[ 37 ]. Figure 2{5 showstheconnectivitystructureoftheWeb,aswellassizesoftheconsideredcomponents.Asonecanseefromthegure,thesizesofSCC,In,OutandTendrilscomponentsareroughlyequal,andtheDisconnectedcomponentissignicantlysmaller.

PAGE 29

Broderetal.[ 37 ]havealsocomputedthediametersoftheSCCandofthewholegraph.ItwasshownthatthediameteroftheSCCisatleast28,andthediameterofthewholegraphisatleast503.Theaverageconnecteddistanceisdenedasthepairwisedistanceaveragedoverthosedirectedpairs(i;j)ofnodesforwhichthereexistsapathfromitoj.Theaverageconnecteddistanceofthewholegraphwasestimatedas16.12forin-links,16.18forout-links,and6.83forundirectedlinks.Interestingly,itwasalsofoundthatforarandomlychosendirectedpairofnodes,thechancethatthereisadirectedpathbetweenthemisonlyabout24%. TherstEMgraphalgorithmwasdevelopedbyUllmanandYannakakis[ 112 ]in1991anddealtwiththeproblemoftransitiveclosure.Manyotherresearcherscontributedtotheprogressinthisareaeversince[ 1 15 16 39 42 83 115 ].Chi-angetal.[ 42 ]proposedseveralnewtechniquesfordesignandanalysisofecientEMgraphalgorithmsanddiscussedapplicationsofthesetechniquestospecicproblems,includingminimumspanningtreeverication,connectedandbiconnectedcomponents,graphdrawing,andvisibilityrepresentation.Abelloetal.[ 1 ]proposedafunctionalapproachforEMgraphalgorithmsandusedtheirmethodologyto

PAGE 30

developdeterministicandrandomizedalgorithmsforcomputingconnectedcom-ponents,maximalindependentsets,maximalmatchings,andotherstructuresinthegraph.Inthisapproacheachalgorithmisdenedasasequenceoffunctions,andthecomputationcontinuesinaseriesofscanoperationsoverthedata.Iftheproducedoutputdata,oncewritten,cannotbechanged,thenthefunctionissaidtohavenosideeects.Thelackofsideeectsenablestheapplicationofstandardcheckpointingtechniques,thusincreasingthereliability.Abelloetal.presentedasemi-externalmodelforgraphproblems,whichassumesthatonlytheverticestinthecomputer'sinternalmemory.Thisisquitecommoninpractice,andinfactthiswasthecaseforthecallgraphdescribedinSubsection 2.1.1 ,forwhichecientEMalgorithmsdevelopedbyAbelloetal.[ 1 ]wereusedinordertocomputeitsconnectedcomponents[ 2 ]. Formoredetailonexternalmemoryalgorithmsseethebook[ 4 ]andtheextensivereviewbyVitter[ 115 ]ofEMalgorithmsanddatastructures. 84 ]. Therefore,toinvestigatereal-lifemassivegraphs,oneneedstousetheavailableinformationinordertoconstructpropertheoreticalmodelsofthesegraphs.Oneoftheearliestattemptstomodelrealnetworkstheoreticallygoesbacktothelate

PAGE 31

1950's,whenthefoundationsofrandomgraphtheoryhadbeendeveloped.Inthissubsectionwewillpresentsomeoftheresultsproducedbythisandother(morerealistic)graphmodels. 47 48 ]dealswithseveralstandardmodelsoftheso-calleduniformrandomgraphs.TwoofsuchmodelsareG(n;m)andG(n;p)[ 30 ].Therstmodelassignsthesameprobabilitytoallgraphswithnverticesandmedges,whileinthesecondmodeleachpairofverticesischosentobelinkedbyanedgerandomlyandindependentlywithprobabilityp. Inmostcasesforeachnaturalnaprobabilityspaceconsistingofgraphswithexactlynverticesisconsidered,andthepropertiesofthisspaceasn!1arestudied.Itissaidthatatypicalelementofthespaceoralmostevery(a.e.)graphhaspropertyQwhentheprobabilitythatarandomgraphonnverticeshasthispropertytendsto1asn!1.WewillalsosaythatthepropertyQholdsasymptoticallyalmostsurely(a.a.s.).ErdosandRenyidiscoveredthatinmanycaseseitheralmosteverygraphhaspropertyQoralmosteverygraphdoesnothavethisproperty. Manypropertiesofuniformrandomgraphshavebeenwellstudied[ 29 30 73 80 ].Belowwewillsummarizesomeknownresultsinthiseld. Probablythesimplestpropertytobeconsideredinanygraphisitsconnec-tivity.ItwasshownthatforauniformrandomgraphG(n;p)2G(n;p)thereisa\threshold"valueofpthatdetermineswhetheragraphisalmostsurelyconnectedornot.Morespecically,agraphG(n;p)isa.a.s.disconnectedifp
PAGE 32

connectedcomponentinarandomgraphisveryoftenreferredtoasthe\phasetransition". ThenextsubjectofourdiscussionisthediameterofauniformrandomgraphG(n;p).Recallthatthediameterofadisconnectedgraphisdenedasthemaximumdiameterofitsconnectedcomponents.Whendealingwithrandomgraphs,oneusuallyspeaksnotaboutacertaindiameter,butratheraboutthedistributionofthepossiblevaluesofthediameter.Intuitively,onecansaythatthisdistributiondependsontheinterrelationshipoftheparametersofthemodelnandp.However,thisdependencyturnsouttoberathercomplicated.Itwasdiscussedinmanypapers,andthecorrespondingresultsaresummarizedbelow. ItwasprovedbyKleeandLarman[ 77 ]thatarandomgraphasymptoticallyalmostsurelyhasthediameterd,wheredisacertainintegervalue,ifthefollowingconditionsaresatised 30 ]provedthatifnplogn!1thenthediameterofarandomgraphisa.a.s.concentratedonnomorethanfourvalues. Luczak[ 87 ]consideredthecasenp<1,whenauniformrandomgrapha.a.s.isdisconnectedandhasnogiantconnectedcomponent.LetdiamT(G)denotethemaximumdiameterofallconnectedcomponentsofG(n;p)whicharetrees.Thenif(1np)n1=3!1thediameterofG(n;p)isa.a.s.equaltodiamT(G). ChungandLu[ 43 ]investigatedanotherextremecase:np!1.TheyshowedthatinthiscasethediameterofarandomgraphG(n;p)isa.a.s.equalto (1+o(1))logn

PAGE 33

(1+o(1))logn np+1: log(np)diam(G(n;p))2666log33c20 43 ]thatitisa.a.s.trueonlyifnp>3:5128. Thefurtherdiscussionanalyzesthepotentialdrawbacksofapplyingtheuniformrandomgraphmodeltothereal-lifemassivegraphs.

PAGE 34

Thoughtheuniformrandomgraphsdemonstratesomepropertiessimilartothereal-lifemassivegraphs,manyproblemsarisewhenonetriestodescribetherealgraphsusingtheuniformrandomgraphmodel.Asitwasmentionedabove,agiantconnectedcomponenta.a.s.emergesinauniformrandomgraphatacertainthreshold.ItlooksverysimilartothepropertiesoftherealmassivegraphsdiscussedinSubsection 2.1.3 .However,afterdeeperinsight,itcanbeseenthatthegiantconnectedcomponentsintheuniformrandomgraphsandthereal-lifemassivegraphshavedierentstructures.Thefundamentaldierencebetweenthemisasfollows:itwasnoticedthatinalmostalltherealmassivegraphsthepropertyofso-calledclusteringtakesplace[ 116 117 ].Itmeansthattheprobabilityoftheeventthattwogivenverticesareconnectedbyanedgeishigheriftheseverticeshaveacommonneighbor(i.e.,avertexwhichisconnectedbyanedgewithbothofthesevertices).Theprobabilitythattwoneighborsofagivenvertexareconnectedbyanedgeiscalledtheclusteringcoecient.Itcanbeeasilyseenthatinthecaseoftheuniformrandomgraphs,theclusteringcoecientisequaltotheparameterp,sincetheprobabilitythateachpairofverticesisconnectedbyanedgeisindependentofallothervertices.Inreal-lifemassivegraphs,thevalueoftheclusteringcoecientturnsouttobemuchhigherthanthevalueoftheparameterpoftheuniformrandomgraphswiththesamenumberofverticesandedges.Adamic[ 5 ]foundthatthevalueoftheclusteringcoecientforsomepartoftheWebgraphwasapproximately0.1078,whiletheclusteringcoecientforthecorrespondinguniformrandomgraphwas0.00023.Pastor-Satorrasetal.[ 103 ]gotsimilarresultsforthepartoftheInternetgraph.Thevaluesoftheclusteringcoecientsfortherealgraphandthecorrespondinguniformrandomgraphwere0.24and0.0006respectively. Anothersignicantproblemarisinginmodelingmassivegraphsusingtheuniformrandomgraphmodelisthedierenceindegreedistributions.Itcan

PAGE 35

beshownthatasthenumberofverticesinauniformrandomgraphincreases,thedistributionofthedegreesoftheverticestendstothewell-knownPoissondistributionwiththeparameternpwhichrepresentstheaveragedegreeofavertex.However,asitwaspointedoutinSubsection 2.1.3 ,theexperimentsshowthatintherealmassivegraphsdegreedistributionsobeyapowerlaw.Thesefactsdemonstratethatsomeothermodelsareneededtobetterdescribethepropertiesofrealmassivegraphs.Next,wediscusstwoofsuchmodels;namely,therandomgraphmodelwithagivendegreesequenceanditsmostimportantspecialcase-thepower-lawmodel. Itturnsoutthatsomepropertiesoftheuniformrandomgraphscanbegeneralizedforthemodelofarandomgraphwithagivendegreesequence. Recallthenotationofso-called\phasetransition"(i.e.,thephenomenonwhenatacertainpointagiantconnectedcomponentemergesinarandomgraph)whichhappensintheuniformrandomgraphs.Itturnsoutthatasimilarthingtakesplaceinthecaseofarandomgraphwithagivendegreesequence.ThisresultwasobtainedbyMolloyandReed[ 98 ].Theessenceoftheirndingsisasfollows. Considerasequenceofnon-negativerealnumbersp0,p1,...,suchthatPkpk=1.AssumethatagraphGwithnverticeshasapproximatelypknverticesofdegreek.IfwedeneQ=Pk1k(k2)pkthenitcanbeprovedthatGa.a.s.

PAGE 36

hasagiantconnectedcomponentifQ>0andthereisa.a.s.nogiantconnectedcomponentifQ<0. Asadevelopmentoftheanalysisofrandomgraphswithagivendegreese-quence,theworkofCooperandFrieze[ 45 ]shouldbementioned.Theyconsideredasparsedirectedrandomgraphwithagivendegreesequenceandanalyzeditsstrongconnectivity.Inthestudy,thesizeofthegiantstronglyconnectedcompo-nent,aswellastheconditionsofitsexistence,werediscussed. Theresultsobtainedforthemodelofrandomgraphswithagivendegreesequenceareespeciallyusefulbecausetheycanbeimplementedforsomeimportantspecialcasesofthismodel.Forinstance,theclassicalresultsonthesizeofaconnectedcomponentinuniformrandomgraphsfollowfromtheaforementionedfactpresentedbyMolloyandReed.Next,wepresentanotherexampleofapplyingthisgeneralresulttooneofthemostpracticallyusedrandomgraphmodels-thepower-lawmodel. 8 9 ]. Thepower-lawrandomgraphmodel(alsoreferredtoP(,)assignstwoparameterscharacterizingapower-lawrandomgraph.Ifwedeneytobethenumberofnodeswithdegreex,thenaccordingtothismodel Equivalently,wecanwrite logy=logx:(2{2)

PAGE 37

SimilarlytoformulasinChapter1,therelationshipbetweenyandxcanbeplottedasastraightlineonalog-logscale,sothat(-)istheslope,andistheintercept. Thefollowingpropertiesofagraphdescribedbythepower-lawrandomgraphmodel[ 8 ]arevalid: Xx=1e where(t)=1Pn=11 2e Xx=1xe 2(1)e;>2;1 4e;=2;1 2e2=.(2);0<<2:(2{4) Sincethepower-lawrandomgraphmodelisaspecialcaseofthemodelofarandomgraphwithagivendegreesequence,theresultsdiscussedabovecanbeappliedtothepower-lawgraphs.Weneedtondthethresholdvalueofinwhichthe\phasetransition"(i.e.,theemergenceofagiantconnectedcomponent)occurs.InthiscaseQ=Px1x(x2)pxisdenedas Px=1x(x2)e Px=1e Px=1e Hence,thethresholdvalue0canbefoundfromtheequation

PAGE 38

Theresultsonthesizeoftheconnectedcomponentofapower-lawgraphwerepresentedbyAielloetal[ 8 ].Theseresultsaresummarizedbelow. Thepower-lawrandomgraphmodelwasdevelopedfordescribingreal-lifemassivegraphs.Sothenaturalquestionishowwellitreectsthepropertiesofthesegraphs. Thoughthismodelcertainlydoesnotreectallthepropertiesofrealmassivegraphs,itturnsoutthatthemassivegraphssuchasthecallgraphortheInternetgraphcanbefairlywelldescribedbythepower-lawmodel.Thefollowingexampledemonstratesit. Aiello,ChungandLu[ 8 ]investigatedthesamecallgraphthatwasanalyzedbyAbelloetal.[ 2 ].ThismassivegraphwasalreadydiscussedinSubsection 2.1.3 ,soitisinterestingtocomparetheexperimentalresultspresentedbyAbelloetal.[ 2 ]withthetheoreticalresultsobtainedin[ 8 ]usingthepower-lawrandomgraphmodel. Figure 2{2 showsthenumberofverticesinthecallgraphwithcertainin-degreesandout-degrees.Recallthataccordingtothepower-lawmodelthedependencybetweenthenumberofverticesandthecorrespondingdegreescanbeplottedasastraightlineonalog-logscale,soonecanapproximatethereal

PAGE 39

datashowninFigure 2{2 byastraightlineandevaluatetheparameterandusingthevaluesoftheinterceptandtheslopeoftheline.Thevalueofforthein-degreedatawasestimatedtobeapproximately2.1,andthevalueofewasapproximately30106.Thetotalnumberofnodescanbeestimatedusingformula( 2{3 )as(2:1)e=1:56e47106(comparewithSubsection 2.1.3 ). Accordingtotheresultsforthesizeofthelargestconnectedcomponentpresentedabove,apower-lawgraphwith1<3:47875a.a.s.hasagiantconnectedcomponent.Since2:1fallsinthisrange,thisresultexactlycoincideswiththerealobservationsforthecallgraph(seeSubsection 2.1.3 ). Anotheraspectthatisworthmentioningishowtogeneratepower-lawgraphs.Themethodologyfordoingitwasdiscussedindetailintheliterature[ 9 44 ].Thesepapersuseasimilarapproach,whichisreferredtoasarandomgraphevolutionprocess.Themainideaistoconstructapower-lawmassivegraph\step-by-step":ateachtimestep,anodeandanedgeareaddedtoagraphinaccordancewithcertainrulesinordertoobtainagraphwithaspeciedin-degreeandout-degreepower-lawdistribution.Thein-degreeandout-degreeparametersoftheresultingpower-lawgrapharefunctionsoftheinputparametersofthemodel.AsimpleevolutionmodelwaspresentedbyKumaretal.[ 81 ].Aiello,ChungandLu[ 9 ]developedfourmoreadvancedmodelsforgeneratingbothdirectedandundirectedpower-lawgraphswithdierentdistributionsofin-degreesandout-degrees.Asanexample,wewillbrieydescribeoneoftheirmodels.Itwasthebasicmodeldevelopedinthepaper,andtheotherthreemodelsactuallywereimprovementsandgeneralizationsofthismodel. Themainideaoftheconsideredmodelisasfollows.Atthersttimemomentavertexisaddedtothegraph,anditisassignedtwoparameters-thein-weightandtheout-weight,bothequalto1.Thenateachtimestept+1anewvertexwithin-weight1andout-weight1isaddedtothegraphwithprobability1,

PAGE 40

andanewdirectededgeisaddedtothegraphwithprobability.Theoriginanddestinationverticesarechosenaccordingtothecurrentvaluesofthein-weightsandout-weights.Morespecically,avertexuischosenastheoriginofthisedgewiththeprobabilityproportionaltoitscurrentout-weightwhichisdenedaswoutu;t=1+outu;twhereoutu;tistheout-degreeofthevertexuattimet.Similarly,avertexvischosenasthedestinationwiththeprobabilityproportionaltoitscurrentin-weightwinv;t=1+inv;twhereinv;tisthein-degreeofvattimet.Fromtheabovedescriptionitcanbeseenthatattimetthetotalin-weightandthetotalout-weightarebothequaltot.Soforeachparticularpairofverticesuandv,theprobabilitythatanedgegoingfromutovisaddedtothegraphattimetisequalto Thenotionoftheso-calledscaleinvariance[ 20 21 ]mustalsobementioned.Thisconceptarisesfromthefollowingconsiderations.Theevolutionofmassivegraphscanbetreatedastheprocessofgrowingthegraphatatimeunit.Now,ifwereplaceallthenodesthatwereaddedtothegraphatthesameunitoftimebyonlyonenode,thenwewillgetanothergraphofasmallersize.Thebiggerthetimeunitis,thesmallerthenewgraphsizewillbe.Theevolutionmodeliscalledscale-free(scale-invariant)ifwithhighprobabilitythenew(scaled)graphhasthesamepower-lawdistributionofin-degreesandout-degreesastheoriginalgraph,foranychoiceofthetimeunitlength.Itturnsoutthatmostoftherandomevolution

PAGE 41

modelshavethisproperty.Forinstance,themodelsofAielloetal.[ 9 ]wereprovedtobescale-invariant. 2.1.3 increasedinterestinvariouspropertiesofrandomgraphsandmethodsusedtodiscovertheseproperties.Indeed,numericalcharacteristicsofgraphs,suchascliqueandchromaticnumbers,couldbeusedasoneofthestepsinvalidationoftheproposedmodels.Inthisregard,theexpectedcliquenumberofpower-lawrandomgraphsisofspecialinterestduetotheresultsbyAbelloetal.[ 2 ]andAielloetal.[ 9 ]mentionedinSubsections 2.1.1 and 2.1.3 .Ifcomputed,itcouldbeusedasoneofthepointsinverifyingthevalidityofthemodelforthecallgraphproposedbyAielloetal.[ 9 ]. Inthissubsectionwepresentsomewell-knownfactsregardingthecliqueandchromaticnumbersinuniformrandomgraphs. 93 ],whonoticedthatforaxedpalmostallgraphsG2G(n;p)haveaboutthesamecliquenumber,ifnissucientlylarge.BollobasandErdos[ 32 ]furtherdevelopedtheseremarkableresultsbyprovingsomemorespecicfactsaboutthecliquenumberofarandomgraph.Letusdiscusstheseresultsinmoredetailbypresentingnotonlythefactsbutalsosomereasoningbehindthem.FormoredetailseebooksbyBollobas[ 29 30 ]andJansonetal.[ 73 ]. Assumethat0
PAGE 42

subgraphofGinducedbytherstnverticesf1;2;:::;ng.Thenthesequence!(Gn)appearstobealmostcompletelydeterminedfora.e.G2G(N;p). Foranaturall,letusdenotebykl(Gn)thenumberofcliquesspanninglverticesofGn.Then,obviously,!(Gn)=maxfl:kl(Gn)>0g: Usingthisobservationandthesecondmomentmethod,BollobasandErdos[ 32 ]provedthatifp=p(n)satisesn
PAGE 43

shownthatfora.e.G2G(N;p)ifnislargeenoughthenbl0(n)2loglogn=lognc!(Gn)bl0(n)+2loglogn=lognc 2: 56 ]andJansonetal.[ 73 ]extendedtheseresultsbyshowingthatfor>0thereexistsaconstantc,suchthatforc b2log1=pn2log1=plog1=pn+2log1=p(e=2)+1+=pc: 61 ]werethersttostudytheproblemofcoloringrandomgraphs.Manyotherresearcherscontributedtosolvingthisproblem[ 12 31 ].Wewillmentionsomefactsemergedfromthesestudies. Luczak[ 85 ]improvedtheresultsabouttheconcentrationof(G(n;p))previouslyprovedbyShamirandSpencer[ 110 ],provingthatforeverysequencep=p(n)suchthatpn6=7thereisafunctionch(n)suchthata:a:s:ch(n)(G(n;p))ch(n)+1: 12 ]provedthatforanypositiveconstantthechromaticnumberofauniformrandomgraphG(n;p),wherep=n1 2,isa.a.s.concentratedintwoconsecutivevalues.Moreover,theyprovedthataproperchoiceofp(n)mayresultinaone-pointdistribution.Thefunctionch(n)isdiculttond,butinsomecasesitcanbecharacterized.Forexample,Jansonetal.[ 73 ]provedthatthereexistsaconstantc0suchthatforanyp=p(n)satisfyingc0

PAGE 44

30 ]yieldsthefollowingestimate:(G(n;p))=n 2.1.1 ).Ontheotherhand,probabilisticmethodssimilartothosediscussedinSubsection 2.1.4 couldbeutilizedinordertondtheasymptoticaldistributionofthecliquenumberinthesamenetwork'srandomgraphmodel,andthereforeverifythismodel.

PAGE 45

Oneofthemostimportantproblemsinthemodernnanceisndingecientwaysofsummarizingandvisualizingthestockmarketdatathatwouldallowonetoobtainusefulinformationaboutthebehaviorofthemarket.Nowadays,agreatnumberofstocksaretradedintheUSstockmarket;moreover,thisnumbersteadilyincreases.Theamountofdatageneratedbythestockmarketeverydayisenormous.Thisdataisusuallyvisualizedbythousandsofplotsreectingthepriceofeachstockoveracertainperiodoftime.Theanalysisoftheseplotsbecomesmoreandmorecomplicatedasthenumberofstocksgrows. Itturnsoutthatthestockmarketdatacanbeeectivelyrepresentedasanetwork,althoughthisrepresentationisnotsoobviousasinthecaseoftelephonetracorinternetdata.Wehavedevelopedthenetwork-basedmodelofthemarketreferredtoasthemarketgraph.Thischapterisbasedontheresultsdescribedin[ 26 27 28 ]. Anaturalgraphrepresentationofthestockmarketisbasedonthecrosscorrelationsofpriceuctuations.Amarketgraphcanbeconstructedasfollows:eachnancialinstrumentisrepresentedbyavertex,andtwoverticesareconnectedbyanedgeifthecorrelationcoecientofthecorrespondingpairofinstruments(calculatedforacertainperiodoftime)exceedsaspeciedthreshold;11. Nowadays,agreatnumberofdierentinstrumentsaretradedintheUSstockmarket,sothemarketgraphrepresentingthemisverylarge.Themarketgraphthatweconstructhas6546verticesandseveralmillionedges. 33

PAGE 46

Inthischapter,wepresentadetailedstudyofthepropertiesofthisgraph.Itturnsoutthatthemarketgraphcanberatheraccuratelydescribedbythepower-lawmodel.Weanalyzethedistributionofthedegreesoftheverticesinthisgraph,theedgedensityofthisgraphwithrespecttothecorrelationthreshold,aswellasitsconnectivityandthesizeofitsconnectedcomponents. Furthermore,welookformaximumcliquesandmaximumindependentsetsinthisgraphfordierentvaluesofthecorrelationthreshold.Analyzingcliquesandindependentsetsinthemarketgraphgivesusaveryvaluableknowledgeabouttheinternalstructureofthestockmarket.Forinstance,acliqueinthisgraphrepresentsasetofnancialinstrumentswhosepriceschangesimilarlyovertime(achangeofthepriceofanyinstrumentinacliqueislikelytoaectallotherinstrumentsinthisclique),andanindependentsetconsistsofinstrumentsthatarenegativelycorrelatedwithrespecttoeachother;therefore,itcanbetreatedasadiversiedportfolio.Basedontheinformationobtainedfromthisanalysis,wewillbeabletoclassifynancialinstrumentsintocertaingroups,whichwillgiveusadeeperinsightintothestockmarketstructure. 3.1.1ConstructingtheMarketGraph 92 ]:

PAGE 47

Figure3{1. Distributionofcorrelationcoecientsinthestockmarket whereRi(t)=lnPi(t) ThecorrelationcoecientsCijcanvaryfrom-1to1.Figure 3{1 showsthedistributionofthecorrelationcoecientsbasedonthepricesdatafortheyears2000-2002.Itcanbeseenthatthisplothasashapesimilartothenormaldistributionwiththemean0.05. Themainideaofconstructingamarketgraphisasfollows.Letthesetofnancialinstrumentsrepresentthesetofverticesofthegraph.Also,wespecifyacertainthresholdvalue;11andaddanundirectededgeconnectingtheverticesiandjifthecorrespondingcorrelationcoecientCijisgreaterthanorequalto.Obviously,dierentvaluesofdenethemarketgraphswiththesamesetofvertices,butdierentsetsofedges. Itiseasytoseethatthenumberofedgesinthemarketgraphdecreasesasthethresholdvalueincreases.Infact,ourexperimentsshowthattheedgedensity

PAGE 48

Figure3{2. Edgedensityofthemarketgraphfordierentvaluesofthecorrelationthreshold. ofthemarketgraphdecreasesexponentiallyw.r.t..ThecorrespondinggraphispresentedonFigure 3{7 2.1.3 wementionedtheconnectivitythresholdsinrandomgraphs.Themainideaofthisconceptisndingathresholdvalueoftheparameterofthemodelthatwilldeneifthegraphisconnectedornot. Asimilarquestionarisesforthemarketgraph:whatisitsconnectivitythreshold?Sincethenumberofedgesinthemarketgraphdependsonthechosencorrelationthreshold,weshouldndavalue0thatdeterminestheconnectivityofthegraph.Asitwasmentionedabove,thesmallervalueofwechoose,themoreedgesthemarketgraphwillhave.So,ifwedecrease,afteracertainpoint,thegraphwillbecomeconnected.Wehaveconductedaseriesofcomputational

PAGE 49

Figure3{3. Plotofthesizeofthelargestconnectedcomponentinthemarketgraphasafunctionofcorrelationthreshold. experimentsforcheckingtheconnectivityofthemarketgraphusingthebreadth-rstsearchtechnique,andweobtainedarelativelyaccurateapproximationoftheconnectivitythreshold:0'0:14382.Moreover,weinvestigatedthedependencyofthesizeofthelargestconnectedcomponentinthemarketgraphw.r.t..ThecorrespondingplotisshowninFigure 3{3 Itturnsoutthatifasmall(inabsolutevalue)correlationthresholdisspec-ied,thedistributionofthedegreesoftheverticesdoesnothaveanywell-denedstructure.Notethatforthesevaluesofthemarketgraphhasarelativelyhighedgedensity(i:e.theratioofthenumberofedgestothemaximumpossiblenumberofedges).However,asthecorrelationthresholdisincreased,thedegree

PAGE 50

Table3{1. Least-squaresestimatesoftheparameterinthemarketgraphfordierentvaluesofcorrelationthreshold(-complementarygraph) -0.2 -0.15 0.2 0.4931 0.25 0.5820 0.3 0.6793 0.35 0.7679 0.4 0.8269 0.45 0.8753 0.5 0.9054 0.55 0.9331 0.6 0.9743 distributionmoreandmoreresemblesapowerlaw.Infact,for0:2thisdistri-butionisapproximatelyastraightlineinthelogarithmicscale,whichrepresentsthepower-lawdistribution,asitwasmentionedabove.Figure 3{4 demonstratesthedegreedistributionsofthemarketgraphforsomepositivevaluesofthecorrela-tionthreshold,alongwiththecorrespondinglinearapproximations.Theslopesoftheapproximatinglineswereestimatedusingtheleast-squaresmethod.Table 3{1 summarizestheestimatesoftheparameterofthepower-lawdistribution(i.e.,theslopeoftheline)fordierentvaluesof. Fromthistable,itcanbeseenthattheslopeofthelinescorrespondingtopositivevaluesofisrathersmall.Accordingtothepower-lawmodel,inthiscaseagraphwouldhavemanyverticeswithhighdegrees,therefore,onecanintuitivelyexpecttondlargecliquesinapower-lawgraphwithasmallvalueoftheparameter. Wealsoanalyzethedegreedistributionofthecomplementofthemarketgraph,whichisdenedasfollows:anedgeconnectsinstrumentsiandjifthecorrelationcoecientbetweenthemCij.Studyingthiscomplementarygraphisimportantforthenextsubjectofourconsideration-ndingmaximumindependent

PAGE 51

Figure3{4. Degreedistributionofthemarketgraphfor=0:4(left);=0:5(right)(logarithmicscale) setsinthemarketgraphwithnegativevaluesofthecorrelationthreshold.Obviously,amaximumindependentsetintheinitialgraphisamaximumcliqueinthecomplement,sothemaximumindependentsetproblemcanbereducedtothemaximumcliqueprobleminthecomplementarygraph.Therefore,itisusefultoinvestigatethedegreedistributionsofthecomplementarygraphsfordierentvaluesof.AsitcanbeseenfromFigure 3{1 ,thedistributionofthecorrelationcoecientsisnearlysymmetricaround=0:05,soforthevaluesofcloseto0theedgedensityofboththeinitialandthecomplementarygraphishighenough.Forthesevaluesofthedegreedistributionofacomplementarygraphalsodoesnotseemtohaveanywell-denedstructure,asinthecaseofthecorrespondinginitialgraph.Asdecreases(i.e.,increasesintheabsolutevalue),thedegreedistributionofacomplementarygraphstartstofollowthepowerlaw.Figure 3{5 showsthedegreedistributionsofthecomplementarygraph,alongwiththeleast-squareslinearregressionlines.However,asonecanseefromTable 3{1 ,theslopesoftheselinesarehigherthaninthecaseofthegraphswithpositivevaluesof,whichimpliesthattherearefewerverticeswithahighdegreeinthesegraphs,sointuitively,thesizeofacliquesinacomplementarygraph(i.e.,thesize

PAGE 52

Figure3{5. Degreedistributionofthecomplementarymarketgraphfor=0:15(left);=0:2(right)(logarithmicscale) ofindependentsetsintheoriginalgraph)shouldbesignicantlysmallerthaninthecaseofthemarketgraphwithpositivevaluesofthecorrelationthreshold(seeSection 3.2 ). Forthispurpose,wechosethemarketgraphwithahighcorrelationthreshold(=0:6),calculatedthedegreesofeachvertexinthisgraphandsortedtheverticesinthedecreasingorderoftheirdegrees.

PAGE 53

Interestingly,eventhoughtheedgedensityoftheconsideredgraphisonly0.04%(onlyhighlycorrelatedinstrumentsareconnectedbyanedge),therearemanyverticeswithdegreesgreaterthan100. Accordingtoourcalculations,thevertexwiththehighestdegreeinthismarketgraphcorrespondstotheNASDAQ100IndexTrackingStock.Thedegreeofthisvertexis216,whichmeansthatthereare216instrumentsthatarehighlycorrelatedwithit.AninterestingobservationisthatthedegreeofthisvertexistwicehigherthanthenumberofcompanieswhosestockpricestheNASDAQindexreects,whichmeansthatthese100companiesgreatlyinuencethemarket. InTable 3{2 wepresentthe\top25"instrumentsintheU.S.stockmar-ket,accordingtotheirdegreesintheconsideredmarketgraph.Thecorre-spondingsymbolsdenitionscanbefoundonseveralwebsites,forexamplehttp://www.nasdaq.com.Notethatmostofthemareindicesthatincorporateanumberofdierentstocksofthecompaniesindierentindustries.Althoughthisresultisnotsurprisingfromthenancialpointofview,itisimportantasapracticaljusticationofthemarketgraphmodel. 3{3 .Forinstance,asonecanseefromthistable,themarketgraphwith=0:6hasalmostthesameedgedensityasthecomplementarymarketgraphwith=0:15,however,theirclusteringcoecientsdierdramatically.Thisfactalsointuitivelyexplainsthe

PAGE 54

Table3{2. Top25instrumentswithhighestdegreesinthemarketgraph(=0:6) symbol vertexdegree QQQ 216IWF 193IWO 193IYW 193XLK 181IVV 175MDY 171SPY 162IJH 159IWV 158IVW 156IAH 155IYY 154IWB 153IYV 150BDH 144MKH 143IWM 142IJR 134SMH 130STM 118IIH 116IVE 113DIA 106IWD 106 resultspresentedinthenextsection,whichdealswithcliquesandindependentsetsinthemarketgraph. Themaximumcliqueproblem(aswellasthemaximumindependentsetproblem)isknowntobeNP-hard[ 59 ].Moreover,itturnsoutthatthemaximumcliqueisdiculttoapproximate[ 18 62 ].Thismakestheseproblemsespeciallychallenginginlargegraphs.However,aswewillseeinthenextsubsection,even

PAGE 55

Table3{3. Clusteringcoecientsofthemarketgraph(-complementarygraph) clusteringcoef. -0.15 2:64105 0.0012 0.3 0.0178 0.4885 0.4 0.0047 0.4458 0.5 0.0013 0.4522 0.6 0.0004 0.4872 0.7 0.0001 0.4886 thoughthemaximumcliqueproblemisgenerallyveryhardtosolveinlargegraphs,thespecialstructureofthemarketgraphallowsustondtheexactsolutionrelativelyeasily. Astandardintegerprogrammingformulation[ 33 ]wasusedtocomputetheexactmaximumcliqueinthemarketgraph,however,beforesolvingthisproblem,weappliedagreedyheuristicforndingalowerboundofthecliquenumber,andaspecialpreprocessingtechniquewhichreducestheproblemsize.Tondalargeclique,weapplythe\best-in"greedyalgorithmbasedondegreesofvertices.LetCdenotetheclique.StartingwithC=;,werecursivelyaddtothecliqueavertexvmaxoflargestdegreeandremoveallverticesthatarenotadjacenttovmaxfromthegraph.Afterrunningthisalgorithm,weappliedthefollowingpreprocessing

PAGE 56

procedure[ 2 ].WerecursivelyremovefromthegraphalloftheverticeswhicharenotinCandwhosedegreeislessthanjCj,whereCisthecliquefoundbythegreedyalgorithm. DenotebyG0=(V0;E0)thegraphinducedbyremainingvertices.ThenthemaximumcliqueproblemcanbeformulatedandsolvedforG0.Thefollowingintegerprogrammingformulationwasused[ 33 ]: maximizejV0jXi=1xis:t:xi+xj1;(i;j)=2E0xi2f0;1g 26 ].Thiscanbeintuitivelyexplainedbythefactthattheseinstancesofthemarketgraphareclustered(i.e.twoverticesinagrapharemorelikelytobeconnectediftheyhaveacommonneighbor),sotheclusteringcoecient,whichisdenedastheprobabilitythatforagivenvertexitstwoneighborsareconnectedbyanedge,ismuchhigherthantheedgedensityinthesegraphs(seeTable 3{8 ).Thischaracteristicisalsotypicalforotherpower-lawgraphsarisingindierentapplications. Afterreducingthesizeoftheoriginalgraph,theresultingintegerprogrammingproblemforndingamaximumcliquecanberelativelyeasilysolvedusingtheCPLEXintegerprogrammingsolver[ 71 ]. Table 3{4 summarizestheexactsizesofthemaximumcliquesfoundinthemarketgraphfordierentvaluesof.Itturnsoutthatthesecliquesare

PAGE 57

ratherlarge,whichagreeswiththeanalysisofdegreedistributionsandclusteringcoecientsinthemarketgraphswithpositivevaluesof. Table3{4. Sizesofthemaximumcliquesinthemarketgraphwithpositivevaluesofthecorrelationthreshold(exactsolutions) cliquesize 0.35 0.0090 193 0.4 0.0047 144 0.45 0.0024 109 0.5 0.0013 85 0.55 0.0007 63 0.6 0.0004 45 0.65 0.0002 27 0.7 0.0001 22 Theseresultsshowthatinthemodernstockmarkettherearelargegroupsofinstrumentswhosepriceuctuationsbehavesimilarlyovertime,whichisnotsurprising,sincenowadaysdierentbranchesofeconomyhighlyaecteachother. 3{5 presentsthesizesoftheindependentsetsfoundusingthegreedyheuristicthatwasdescribedintheprevioussection.

PAGE 58

Table3{5. Sizesofindependentsetsinthecomplementarymarketgraphfoundusingthegreedyalgorithm(lowerbounds) indep.setsize 0.05 0.4794 45 0.0 0.2001 12 -0.05 0.0431 5 -0.1 0.005 3 -0.15 0.0005 2 Thistabledemonstratesthatthesizesofcomputedindependentsetsarerathersmall,whichisinagreementwiththeresultsoftheprevioussection,wherewementionedthatinthecomplementarygraphthevaluesoftheparameterofthepower-lawdistributionareratherhigh,andtheclusteringcoecientsareverysmall. Thesmallsizeofthecomputedindependentsetsmeansthatndingalarge\completelydiversied"portfolio(whereallinstrumentsarenegativelycorrelatedtoeachother)isnotaneasytaskinthemodernstockmarket. Moreover,itturnsoutthatonecanmakeatheoreticalestimationofthemaximumsizeofadiversiedportfolio,whereallstocksarestrictlynegativelycorrelatedwitheachother.Intuitively,thelower(higherbytheabsolutevalue)thresholdweset,thesmallerdiversiedportfolioonewouldexpecttond.Theseconsiderationsareconrmedbythefollowingtheorem. Proof.

PAGE 59

maximumcorrelationis=maxi;jCij<0.Considerthevarianceofthesumofthesevariables: Notethatif<0,m2max(1+(m1))<0form>1+1 Therefore,thenumberofstockswithpairwisecorrelationsCij<0cannotbegreaterthanm=1+1 Anothernaturalquestionnowarises:howmanycompletelydiversiedportfolioscanbefoundinthemarket?Inordertondananswer,wehavecalculatedmaximalindependentsetsstartingfromeachvertex,byrunning6546iterationsofthegreedyalgorithmmentionedabove.Thatis,foreachoftheconsidered6546nancialinstruments,wehavefoundacompletelydiversiedportfoliothatwouldcontainthisinstrument.Interestinglyenough,foreveryvertexinthemarketgraph,wewereabletodetectanindependentsetthatcontainsthisvertex,andthesizesoftheseindependentsetswereratherclose.Moreover,alltheseindependentsetsweredistinct.Figure 3{6 showsthefrequencyofthesizesoftheindependentsetsfoundinthemarketgraphscorrespondingtodierentcorrelationthresholds. Theseresultsdemonstratethatitisalwayspossibleforaninvestortondagroupofstocksthatwouldformacompletelydiversiedportfoliowithanygivenstock,andthiscanbeecientlydoneusingthetechniqueofndingindependentsetsinthemarketgraph.

PAGE 60

Figure3{6. Frequencyofthesizesofindependentsetsfoundinthemarketgraphwith=0:00(left),and=0:05(right) Non-trivialinformationabouttheglobalpropertiesofthestockmarketisobtainedfromtheanalysisofthedegreedistributionofthemarketgraph.Highlyspecicstructureofthisdistributionsuggeststhatthestockmarketcanbeanalyzedusingthepower-lawmodel,whichcantheoreticallypredictsomecharacteristicsofthegraphrepresentingthemarket. Ontheotherhand,theanalysisofcliquesandindependentsetsinthemar-ketgraphisalsousefulfromthedataminingpointofview.Asitwaspointedoutabove,cliquesandindependentsetsinthemarketgraphrepresentgroupsof\similar"and\dierent"nancialinstruments,respectively.Therefore,informa-tionaboutthesizeofthemaximumcliquesandindependentsetsisalsoratherimportant,sinceitgivesonetheideaaboutthetrendsthattakeplaceinthestock

PAGE 61

market.Besidesanalyzingthemaximumcliquesandindependentsetsinthemar-ketgraph,onecanalsodividethemarketgraphintothesmallestpossiblesetofdistinctcliques(orindependentsets).Partitioningadatasetintosets(clusters)ofelementsgroupedaccordingtoacertaincriterionisreferredtoasclustering,whichisoneofthewell-knowndataminingproblems[ 34 ]. Asdiscussedabove,themaindicultyoneencountersinsolvingtheclusteringproblemonacertaindatasetisthefactthatthenumberofdesiredclustersofsimilarobjectsisusuallynotknownapriori,moreover,anappropriatesimilaritycriterionshouldbechosenbeforepartitioningadatasetintoclusters. Clearly,themethodologyofndingcliquesinthemarketgraphprovidesanecienttoolofperformingclusteringbasedonthestockmarketdata.Thechoiceofthegroupingcriterionisclearandnatural:\similar"nancialinstrumentsaredeterminedaccordingtothecorrelationbetweentheirpriceuctuations.Moreover,theminimumnumberofclustersinthepartitionofthesetofnancialinstrumentsisequaltotheminimumnumberofdistinctcliquesthatthemarketgraphcanbedividedinto(theminimumcliquepartitionproblem).Similarpartitioncanbedoneusingindependentsetsinsteadofcliques,whichwouldrepresentthepartitionofthemarketintoasetofdistinctdiversiedportfolios.Inthiscasetheminimumpossiblenumberofclustersisequaltoapartitionofverticesintoaminimumnumberofdistinctindependentsets.Thisproblemiscalledthegraphcoloringproblem,andthenumberofsetsintheoptimalpartitionisreferredtoasthechromaticnumberofthegraph. Weshouldalsomentionanothermajortypeofdataminingproblemswithmanyapplicationsinnance.Theyarereferredtoasclassicationproblems.Althoughthesetupofthistypeofproblemsissimilartoclustering,oneshouldclearlyunderstandthedierencebetweenthesetwotypesofproblems.

PAGE 62

Inclassication,onedealswithapre-denednumberofclassesthatthedataelementsmustbeassignedto.Also,thereisaso-calledtrainingdataset,i.e.,thesetofdataelementsforwhichitisknownaprioriwhichclasstheybelongto.Itmeansthatinthissetuponeusessomeinitialinformationabouttheclassicationofexistingdataelements.Acertainclassicationmodelisconstructedbasedonthisinformation,andtheparametersofthismodelare\tuned"toclassifynewdataelements.Thisprocedureisknownas\trainingtheclassier".Anexampleoftheapplicationofthisapproachtoclassifyingnancialinstrumentscanbefoundin[ 40 ]. Themaindierencebetweenclassicationandclusteringisthefactthatunlikeclassication,inthecaseofclustering,onedoesnotuseanyinitialinformationabouttheclassattributesoftheexistingdataelements,buttriestodetermineaclassicationusingappropriatecriteria.Therefore,themethodologyofclassifyingnancialinstrumentsusingthemarketgraphmodelisessentiallydierentfromtheapproachescommonlyconsideredintheliteratureinthesensethatitdoesnotrequireanya-prioriinformationabouttheclassesthatcertainstocksbelongto,butclassiesthemonlybasedonthebehavioroftheirpricesovertime. Inordertoinvestigatethedynamicsofthemarketgraphstructure,wechosetheperiodof1000tradingdaysin1998{2002andconsideredeleven500-dayshiftswithinthisperiod.Thestartingpointsofeverytwoconsecutiveshiftsareseparated

PAGE 63

Table3{6. Datesandmeancorrelationscorrespondingtoeachconsidered500-dayshift Period#StartingdateEndingdateMeancorrelation 109/24/199809/15/20000.0403212/04/199811/27/20000.0373302/18/199902/08/20010.0381404/30/199904/23/20010.0426507/13/199907/03/20010.0444609/22/199909/19/20010.0465712/02/199911/29/20010.0545802/14/200002/12/20020.0561904/26/200004/25/20020.05281007/07/200007/08/20020.05701109/18/200009/17/20020.0672 bytheintervalof50days.Therefore,everypairofconsecutiveshiftshad450daysincommonand50daysdierent.DatescorrespondingtoeachshiftandthecorrespondingmeancorrelationsaresummarizedinTable 3{6 Thisprocedureallowsustoaccuratelyreectthestructuralchangesofthemarketgraphusingrelativelysmallintervalsbetweenshifts,butatthesametimeonecanmaintainsucientlylargesamplesizesofthestockpricesdataforcalculatingcross-correlationsforeachshift.Weshouldnotethatinouranalysisweconsideredonlystockswhichwereamongthosetradedasofthelastofthe1000tradingdays,i.e.forpracticalreasonswedidnottakeintoaccountstockswhichhadbeenwithdrawnfromthemarket.

PAGE 64

Therstsubjectofourconsiderationisthedistributionofcorrelationcoe-cientsbetweenallpairsofstocksinthemarket.Asitwasmentionedabove,thisdistributionon[1;1]hadashapesimilartoapartofnormaldistributionwithmeancloseto0.05forthesampledataconsideredin[ 26 27 ].Oneoftheinterpre-tationsofthisfactisthatthecorrelationofmostpairsofstocksisclosetozero,therefore,thestructureofthestockmarketissubstantiallyrandom,andonecanmakeareasonableassumptionthatthepricesofmoststockschangeindependently.Asweconsidertheevolutionofthecorrelationdistributionovertime,itturnsoutthattheshapeofthisdistributionremainsstable,whichisillustratedbyFigure 3{7 Figure3{7. DistributionofcorrelationcoecientsintheUSstockmarketforsev-eraloverlapping500-dayperiodsduring2000-2002(period1istheearliest,period11isthelatest). Thestabilityofthecorrelationcoecientsdistributionofthemarketgraphintuitivelymotivatesthehypothesisthatthedegreedistributionshouldalsoremainstablefordierentvaluesofthecorrelationthreshold.Toverifythisassumption,

PAGE 65

wehavecalculatedthedegreedistributionofthegraphsconstructedforallconsideredtimeperiods.Thecorrelationthreshold=0:5waschosentodescribethestructureofconnectionscorrespondingtosignicantlyhighcorrelations.Ourexperimentsshowthatthedegreedistributionissimilarforalltimeintervals,andinallcasesitiswelldescribedbyapowerlaw.Figure 3{8 showsthedegreedistributions(inthelogarithmicscale)forsomeinstancesofthemarketgraph(with=0:5)correspondingtodierentintervals. (a)period1 (b)period4 (c)period7 (d)period11 Figure3{8. Degreedistributionofthemarketgraphfordierent500-dayperiodsin2000-2002with=0:5:(a)period1,(b)period4,(c)period7,(d)period11. Thecross-correlationdistributionandthedegreedistributionofthemarketgraphrepresentthegeneralcharacteristicsofthemarket,andtheaforementioned

PAGE 66

resultsleadustotheconclusionthattheglobalstructureofthemarketisstableovertime.However,aswewillseenow,someglobalchangesinthestockmarketstructuredotakeplace.Inordertodemonstrateit,welookatanothercharacteris-ticofthemarketgraph{itsedgedensity. Inouranalysisofthemarketgraphdynamics,wechosearelativelyhighcorrelationthreshold=0:5thatwouldensurethatweconsideronlytheedgescorrespondingtothepairsofstocks,whicharesignicantlycorrelatedwitheachother.Inthiscase,theedgedensityofthemarketgraphwouldrepresenttheproportionofthosepairsofstocksinthemarket,whosepriceuctuationsaresimilarandinuenceeachother.Thesubjectofourinterestistostudyhowthisproportionchangesduringtheconsideredperiodoftime.Table 3{7 summarizestheobtainedresults.Asitcanbeseenfromthistable,boththenumberofverticesandthenumberofedgesinthemarketgraphincreaseastimegoes.Obviously,thenumberofverticesgrowssincenewstocksappearinthemarket,andwedonotconsiderthosestockswhichceasedtoexistbythelastof1000tradingdaysusedinouranalysis,sothemaximumpossiblenumberofedgesinthegraphincreasesaswell.However,itturnsoutthatthenumberofedgesgrowsfaster;therefore,theedgedensityofthemarketgraphincreasesfromperiodtoperiod.AsonecanseefromFigure 3{9(a) ,thegreatestincreaseoftheedgedensitycorrespondstothelasttwoperiods.Infact,theedgedensityforthelatestintervalisapproximately8.5timeshigherthanfortherstinterval!Thisdramaticjumpsuggeststhatthereisatrendtothe\globalization"ofthemodernstockmarket,whichmeansthatnowadaysmoreandmorestockssignicantlyaectthebehavioroftheothers. Itshouldbenotedthattheincreaseoftheedgedensitycouldbepredictedfromtheanalysisofthedistributionofthecross-correlationsbetweenallpairsofstocks.FromFigure 3{7 ,onecanobservethateventhoughthedistributionscorrespondingtodierentperiodshaveasimilarshapeandthesamemean,

PAGE 67

Table3{7. Numberofverticesandnumberofedgesinthemarketgraphfordier-entperiods(=0:5) PeriodNumberofVerticesNumberofEdgesEdgedensity 1543022580.015%2550726140.017%3559337720.024%4566652760.033%5576868410.041%6586677700.045%76013104280.058%86104124570.067%96262129110.066%106399197070.096%116556278850.130% the\tail"ofthedistributioncorrespondingtothelatestperiod(period11)issomewhat\heavier"thanfortheearlierperiods,whichmeansthattherearemorepairsofstockswithhighervaluesofthecorrelationcoecient. (b) Dynamicsofedgedensityandmaximumcliquesizeinthemarketgraph:Evolutionoftheedgedensity(a)andmaximumcliquesize(b)inthemarketgraph(=0:5)

PAGE 68

Table 3{8 presentsthesizesofthemaximumcliquesfoundinthemarketgraphfordierenttimeperiods.Asintheprevioussubsection,weusedarelativelyhighcorrelationthreshold=0:5toconsideronlysignicantlycorrelatedstocks.Asonecansee,thereisacleartrendoftheincreaseofthemaximumcliquesizeovertime,whichisconsistentwiththebehavioroftheedgedensityofthemarketgraphdiscussedabove(seeFigure 3{9(b) ).Thisresultprovidesanotherconrmationoftheglobalizationhypothesisdiscussedabove. Anotherrelatedissuetoconsiderishowmuchthestructureofmaximumcliquesisdierentforthevarioustimeperiods.Table 3{9 presentsthestocksincludedintothemaximumcliquesfordierenttimeperiods.Itturnsoutthatinmostcasesstocksthatappearinacliqueinanearlierperiodalsoappearinthecliquesinlaterperiods. Therearesomeotherinterestingobservationsaboutthestructureofthemaximumcliquesfoundfordierenttimeperiods.Itcanbeseenthatallthecliquesincludeasignicantnumberofstocksofthecompaniesrepresentingthe\high-tech"industrysector.Astheexamples,onecanmentionwell-knowncom-paniessuchasSunMicrosystems,Inc.,CiscoSystems,Inc.,IntelCorporation,etc.Moreover,eachcliquecontainsstocksofthecompaniesrelatedtothesemi-conductorindustry(e.g.,CypressSemiconductorCorporation,Cree,Inc.,LatticeSemiconductorCorporation,etc.),andthenumberofthesestocksinthecliquesincreaseswiththetime.Thesefactssuggestthatthecorrespondingbranchesofindustryexpandedduringtheconsideredperiodoftimetoformamajorclusterofthemarket. Inaddition,weobservedthatinthelaterperiods(especiallyinthelasttwoperiods)themaximumcliquescontainaratherlargenumberofexchangetradedfunds,i.e.,stocksthatreectthebehaviorofcertainindicesrepresentingvariousgroupsofcompanies.Itshouldbementionedthatallmaximumcliquescontain

PAGE 69

Table3{8. Greedycliquesizeandthecliquenumberfordierenttimeperiods(=0:5) PeriodjVjEdgeDens.ClusteringjCjjV0jEdgeDens.CliqueinGCoecientinG0Number 154300:000150.50515760.28618255070:000170.50418430.73119355930:000240.49926490.81727456660:000330.51734700.77434557680:000410.55042820.78742658660:000450.55845860.80445760130:000580.553511100.76951861040:000670.566601140.81960962620:000660.553621070.869621063990:000960.486771340.841771165560:001300.452841460.84485 Nasdaq100trackingstock(QQQ),whichwasalsofoundtobethevertexwiththehighestdegree(i.e.,correlatedwiththemoststocks)inthemarketgraph[ 26 ]. Anothernaturalquestionthatonecanposeishowthesizeofindependentsets(i.e.,diversiedportfoliosinthemarket)changesovertime.Asitwaspointedoutin[ 26 27 ],ndingamaximumindependentsetinthemarketgraphturnsouttobeamuchmorecomplicatedtaskthanndingamaximumclique.Inparticular,inthecaseofsolvingthemaximumindependentsetproblem(or,equivalently,themaximumcliqueprobleminthecomplementarygraph),thepreprocessingproceduredescribedabovedoesnotreducethesizeoftheoriginalgraph.Thiscanbeexplainedbythefactthattheclusteringcoecientinthecomplementarymarketgraphwith=0ismuchsmallerthanintheoriginalgraphcorrespondingto=0:5(seeTable 3{10 ). SimilarlytoSection 3.2 ,wecalculatemaximalindependentsets(amaximalindependentsetisanindependentsetthatisnotasubsetofanotherindependentset)inthemarketgraphusingtheabovegreedyalgorithm.AsonecanseefromTable 3{10 ,thesizesofindependentsetsfoundinthemarketgraphfor=0arerathersmall,whichisconsistentwiththeresultsofSection 3.2

PAGE 70

Table3{9. Structureofmaximumcliquesinthemarketgraphfordierenttimeperiods(=0:5) Stocksincludedintomaximumclique BK,EMC,FBF,HAL,HP,INTC,NCC,NOI,NOK,PDS,PMCS,QQQ,RF,SII,SLB,SPY,TER,WM 2 ADI,ALTR,AMAT,AMCC,ATML,CSCO,KLAC,LLTC,LSCC,MDY,MXIM,NVLS,PMCS,QQQ,SPY,SUNW,TXN,VTSS,XLNX 3 AMAT,AMCC,CREE,CSCO,EMC,JDSU,KLAC,LLTC,LSCC,MDY,MXIM,NVLS,PHG,PMCS,QLGC,QQQ,SEBL,SPY,STM,SUNW,TQNT,TXCC,TXN,VRTS,VTSS,XLK,XLNX 4 AMAT,AMCC,ASML,ATML,BRCM,CHKP,CIEN,CREE,CSCO,EMC,FLEX,JDSU,KLAC,LSCC,MDY,MXIM,NTAP,NVLS,PMCS,QLGC,QQQ,RFMD,SEBL,SPY,STM,SUNW,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 5 ALTR,AMAT,AMCC,ASML,ATML,BRCM,CIEN,CREE,CSCO,EMC,FLEX,IDTI,IRF,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,LSI,MDY,MXIM,NTAP,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SEBL,SPY,STM,SUNW,SWKS,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 6 ADI,ALTR,AMAT,AMCC,ASML,ATML,BEAS,BRCM,CIEN,CREE,CSCO,CY,ELX,EMC,FLEX,IDTI,ITWO,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,LSI,MDY,MXIM,NTAP,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SEBL,SPY,STM,SUNW,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 7 ALTR,AMAT,AMCC,ATML,BEAS,BRCD,BRCM,CHKP,CIEN,CNXT,CREE,CSCO,CY,DIGL,EMC,FLEX,HHH,ITWO,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,MDY,MERQ,MXIM,NEWP,NTAP,NVLS,ORCL,PMCS,QLGC,QQQ,RBAK,RFMD,SCMR,SEBL,SPY,SSTI,STM,SUNW,SWKS,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 8 ALTR,AMAT,AMCC,AMKR,ARMHY,ASML,ATML,AVNX,BEAS,BRCD,BRCM,CHKP,CIEN,CMRC,CNXT,CREE,CSCO,CY,DIGL,ELX,EMC,EXTR,FLEX,HHH,IDTI,ITWO,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,MDY,MERQ,MRVC,MXIM,NEWP,NTAP,NVLS,ORCL,PMCS,QLGC,QQQ,RFMD,SCMR,SEBL,SNDK,SPY,SSTI,STM,SUNW,SWKS,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 9 ADI,ALTR,AMAT,AMCC,ARMHY,ASML,ATML,AVNX,BDH,BEAS,BHH,BRCM,CHKP,CIEN,CLS,CREE,CSCO,CY,DELL,ELX,EMC,EXTR,FLEX,HHH,IAH,IDTI,IIH,INTC,IRF,JDSU,JNPR,KLAC,LLTC,LRCX,LSCC,LSI,MDY,MXIM,NEWP,NTAP,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SCMR,SEBL,SNDK,SPY,SSTI,STM,SUNW,SWKS,TQNT,TXCC,TXN,VRSN,VRTS,VTSS,XLK,XLNX 10 ADI,ALTR,AMAT,AMCC,AMD,ASML,ATML,BDH,BHH,BRCM,CIEN,CLS,CREE,CSCO,CY,CYMI,DELL,EMC,FCS,FLEX,HHH,IAH,IDTI,IFX,IIH,IJH,IJR,INTC,IRF,IVV,IVW,IWB,IWF,IWM,IWV,IYV,IYW,IYY,JBL,JDSU,KLAC,KOPN,LLTC,LRCX,LSCC,LSI,LTXX,MCHP,MDY,MXIM,NEWP,NTAP,NVDA,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SANM,SEBL,SMH,SMTC,SNDK,SPY,SSTI,STM,SUNW,TER,TQNT,TXCC,TXN,VRTS,VSH,VTSS,XLK,XLNX 11 ADI,ALA,ALTR,AMAT,AMCC,AMD,ASML,ATML,BDH,BEAS,BHH,BRCM,CIEN,CLS,CNXT,CREE,CSCO,CY,CYMI,DELL,EMC,EXTR,FCS,FLEX,HHH,IAH,IDTI,IIH,IJH,IJR,INTC,IRF,IVV,IVW,IWB,IWF,IWM,IWO,IWV,IWZ,IYV,IYW,IYY,JBL,JDSU,JNPR,KLAC,KOPN,LLTC,LRCX,LSCC,LSI,LTXX,MCRL,MDY,MKH,MRVC,MXIM,NEWP,NTAP,NVDA,NVLS,PHG,PMCS,QLGC,QQQ,RFMD,SANM,SEBL,SMH,SMTC,SNDK,SPY,SSTI,STM,SUNW,TER,TQNT,TXN,VRTS,VSH,VTSS,XLK,XLNX

PAGE 71

Table3{10. Sizeofindependentsetsinthemarketgraphfoundusingthegreedyheuristic(=0:0).Edgedensityandclusteringcoecientaregivenforthecomplementarygraph. PeriodNumberofEdgeClusteringIndependentverticesdensitycoecientsetsize 154300.2580.29311255070.2750.30711355930.2810.30710456660.2650.29711557680.2600.29211658660.2540.28811760130.2280.26911861040.2270.26810962620.2380.277121063990.2280.269121165560.2010.24511 Forndingacliquepartition,wechoosetheinstanceofthemarketgraphwithalowcorrelationthreshold=0:05(themeanofthecorrelationcoecientsdistributionshowninFigure 3{7 ),whichwouldensurethattheedgedensityoftheconsideredgraphishighenoughandthenumberofisolatedvertices(whichwouldobviouslyformdistinctcliques)issmall. Weusethestandardgreedyheuristictocomputeacliquepartitioninthemarketgraph:recursivelyndamaximalcliqueandremoveitfromthegraph,untilnovertexremain.Cliquesarecomputedusingthepreviouslydescribedgreedyalgorithm.Thecorrespondingresultsforthemarketgraphwiththreshold=0:05arepresentedinTable 3{11 .Notethatthesizeofthelargestcliqueinthepartitionisincreasingfromoneperiodtoanother,withthelargestcliqueinthelastperiod

PAGE 72

Table3{11. Thelargestcliquesizeandthenumberofcliquesincomputedcliquepartitions(=0:05) PeriodNumberofEdgeLargestclique#ofcliquesinverticesdensityinthepartitionthepartition 154300.400469494255070.377552517355930.379636513456660.405743503557680.413789501658660.425824496760130.469929471861040.475983470962620.4569975091063990.47411595011165560.5211372479 containingaboutthreetimesasmanyverticesasthecorrespondingcliqueintherstpartition.Atthesametime,thenumberofcliquesinthepartitioniscomparablefordierentperiods,withaslightoveralltrendtowardsdecrease,whereasthenumberofverticesisincreasingastimegoes. Anotherimportantresultisthefactthattheedgedensityofthemarketgraph,aswellasthemaximumcliquesize,steadilyincreaseduringthelastseveralyears,whichsupportsthewell-knownideaabouttheglobalizationofeconomywhichhasbeenwidelydiscussedrecently.

PAGE 73

Wehavealsoindicatedthenaturalwayofdividingthesetofnancialinstru-mentsintogroupsofsimilarobjects(clustering)bycomputingacliquepartitionofthemarketgraph.Thismethodologycanbeextendedbyconsideringquasi-cliquesinthepartition,whichmayreducethenumberofobtainedclusters.Moreover,ndingindependentsetsinthemarketgraphprovidesanewapproachtochoosingdiversiedportfolioswhereallstocksarepairwiseuncorrelated,whichispotentiallyusefulinpractice.

PAGE 74

Humanbrainisoneofthemostcomplexsystemseverstudiedbyscientists.Enormousnumberofneuronsandthedynamicnatureofconnectionsbetweenthemmakestheanalysisofbrainfunctionespeciallychallenging.Oneofthemostimportantdirectionsinstudyingthebrainistreatingdisordersofthecentralnervoussystem.Forinstance,epilepsyisacommonformofsuchdisorders,whichaectsapproximately1%ofthehumanpopulation.Essentially,epilepticseizuresrepresentexcessiveandhypersynchronousactivityoftheneuronsinthecerebralcortex. Duringthelastseveralyears,signicantprogressintheeldofepilepticseizurespredictionhasbeenmade.Theadvancesareassociatedwiththeextensiveuseofelectroencephalograms(EEG)whichcanbetreatedasaquantitativerepre-sentationofthebrainfunction.RapiddevelopmentofcomputationalequipmenthasmadepossibletostoreandprocesshugeamountsofEEGdataobtainedfromrecordingdevices.Theavailabilityofthesemassivedatasetsgivesarisetoanotherproblem-utilizingmathematicaltoolsanddataminingtechniquesforextractingusefulinformationfromEEGdata.Isitpossibletoconstructa\simple"mathe-maticalmodelbasedonEEGdatathatwouldreectthebehavioroftheepilepticbrain? Inthischapter,wemakeanattempttocreatesuchamodelusinganetwork-basedapproach. InthecaseofthehumanbrainandEEGdata,weapplyarelativelysimplenetwork-basedapproach.WerepresenttheelectrodesusedforobtainingtheEEG 62

PAGE 75

readings,whicharelocatedindierentpartsofthebrain,astheverticesoftheconstructedgraph.ThedatareceivedfromeverysingleelectrodeisessentiallyatimeseriesreectingthechangeoftheEEGsignalovertime.Laterinthechapterwewilldiscussthequantitativemeasurecharacterizingstatisticalrelationshipsbetweentherecordingsofeverypairofelectrodes-socalledT-index.ThevaluesoftheT-indexTijmeasuredforallpairsofelectrodesiandjenableustoestablishcertainrulesofplacingedgesconnectingdierentpairsofverticesiandjdepend-ingonthecorrespondingvaluesofTij.Usingthistechnique,wedevelopseveralgraph-basedmathematicalmodelsandstudythedynamicsofthestructuralprop-ertiesofthesegraphs.Aswewillsee,thesemodelscanprovideusefulinformationaboutthebehaviorofthebrainpriorto,during,andafteranepilepticseizure. 4.1.1Datasets. 4{1 67 69 101 ]). Sincethebrainisanonstationarysystem,algorithmsusedtoestimatemeasuresofthebraindynamicsshouldbecapableofautomaticallyidentifyingandappropriatelyweighingexistingtransientsinthedata.Inachaoticsystem,orbitsoriginatingfromsimilarinitialconditions(nearbypointsinthestatespace)divergeexponentially(expansionprocess).TherateofdivergenceisanimportantaspectofthesystemdynamicsandisreectedinthevalueofLyapunovexponents.The

PAGE 76

Electrodeplacementinthebrain:(A)Inferiortransverseand(B)lateralviewsofthebrain,illustratingapproximatedepthandsubdu-ralelectrodeplacementforEEGrecordingsaredepicted.Subduralelectrodestripsareplacedovertheleftorbitofrontal(AL),rightor-bitofrontal(AR),leftsubtemporal(BL),andrightsubtemporal(BR)cortex.Depthelectrodesareplacedinthelefttemporaldepth(CL)andrighttemporaldepth(CR)torecordhippocampalactivity.

PAGE 77

methodusedforestimationoftheshorttimelargestLyapunovexponentSTLmax,anestimateofLmaxfornonstationarydata,isexplainedindetailin[ 66 68 118 ]. BysplittingtheEEGtimeseriesrecordedfromeachelectrodeintoasequenceofnon-overlappingsegments,each10.24secinduration,andestimatingSTLmaxforeachofthesesegments,prolesofSTLmaxovertimearegenerated. HavingestimatedtheSTLmaxtemporalprolesatanindividualcorticalsite,andasthebrainproceedstowardstheictalstate,thetemporalevolutionofthestabilityofeachcorticalsiteisquantied.ThespatialdynamicsofthistransitionarecapturedbyconsiderationoftherelationsoftheSTLmaxbetweendierentcorticalsites.Forexample,ifasimilartransitionoccursatdierentcorticalsites,theSTLmaxoftheinvolvedsitesareexpectedtoconvergetosimilarvaluespriortothetransition.Suchparticipatingsitesarecalled\criticalsites",andsuchaconvergence\dynamicalentrainment".Morespecically,inorderforthedynamicalentrainmenttohaveastatisticalcontent,weallowaperiodoverwhichthedierenceofthemeansoftheSTLmaxvaluesattwositesisestimated.Weuseperiodsof10minutes(i.e.movingwindowsincludingapproximately60STLmaxvaluesovertimeateachelectrodesite)totestthedynamicalentrainmentatthe0.01statisticalsignicancelevel.WeemploytheT-index(fromthewell-knownpairedT-statisticsforcomparisonsofmeans)asameasureofdistancebetweenthemeanvaluesofpairsofSTLmaxprolesovertime.TheT-indexattimetbetweenelectrodesitesiandjisdenedas: whereEfgisthesampleaveragedierencefortheSTLmax;iSTLmax;jestimatedoveramovingwindowwt()denedas:wt()=8><>:1if2[tN1;t]0if62[tN1;t];

PAGE 78

whereNisthelengthofthemovingwindow.Then,i;j(t)isthesamplestandarddeviationoftheSTLmaxdierencesbetweenelectrodesitesiandjwithinthemovingwindowwt().TheT-indexfollowsat-distributionwithN-1degreesoffreedom.FortheestimationoftheTi;j(t)indicesinourdataweusedN=60(i.e.,averageof60dierencesofSTLmaxexponentsbetweensitesiandjpermovingwindowofapproximately10minuteduration).Therefore,atwo-sidedt-testwithN1(=59)degreesoffreedom,atastatisticalsignicancelevelshouldbeusedtotestthenullhypothesis,Ho:\brainsitesiandjacquireidenticalSTLmaxvaluesattimet".Inthisexperiment,wesettheprobabilityofatypeIerror=0:01(i.e.,theprobabilityoffalselyrejectingHoifHoistrue,is1%).FortheT-indextopassthistest,theTi;j(t)valueshouldbewithintheinterval[0,2.662].WewillrefertotheupperboundofthisintervalasTcritical. 4.2.1KeyIdeaoftheModel

PAGE 79

70 108 111 ],whichisessentiallythedivergenceoftheprolesoftheSTLmaxtimeseries.Asitwasindicatedabove,thisdivergenceischaracterizedbythevaluesofT-indexgreaterthanTcritical. 4{2

PAGE 80

Figure4{2. NumberofedgesinGRAPH-II

PAGE 81

ThesizeofthelargestconnectedcomponentoftheGRAPH-IIispresentedinFigure 4{3 .OnecanseethatGRAPH-IIisconnectedduringtheinterictalperiod(i.e.,thebrainisaconnectedsystem),however,itbecomesdisconnectedaftertheseizure(duringtheposticalstate):thesizeofthelargestconnectedcomponentsignicantlydecreases.Thisfactisnotsurprisingandcanbeintuitivelyexplained,sinceaftertheseizurethebrainneedssometimeto\reset"[ 70 108 111 ]andrestoretheconnectionsbetweenthefunctionalunits.

PAGE 82

Figure4{3. ThesizeofthelargestconnectedcomponentinGRAPH-II.Numberofnodesinthegraphis30. hypothesisispartiallysupportedbythebehavioroftheaverageT-indexoftheedgescorrespondingtotheMinimumSpanningTreeofGRAPH-I,whichisshowninFigure 4{4 However,thishypothesiscannotbeveriedusingtheconsidereddata,sincethevaluesofaverageT-indicesarecalculatedovera10-minuteinterval,whereasthetheseizuresignalpropagatesinafractionofasecond.Therefore,inordertocheckiftheseizuresignalactuallyspreadsalongtheminimumspanningtree,oneneedstointroduceothernonlinearmeasurestoreectthebehaviorofthebrainovershorttimeintervals.

PAGE 83

Figure4{4. AveragevalueofT-indexoftheedgesinMinimumSpanningTreeofGRAPH-I. Also,notethattheaveragevalueoftheTindexintheMinimumSpanningTreeislessthanTcritical,whichalsosupportstheabovestatementabouttheconnectivityofthesystem. WelookatthebehavioroftheaveragedegreeoftheverticesinGRAPH-IIovertime.Clearly,thisplotisverysimilartothebehavioroftheedgedensityofGRAPH-II(seeFigure 4{5 ).

PAGE 84

Figure4{5. AveragedegreeoftheverticesinGRAPH-II. Wearealsoparticularlyinterestedinhigh-degreevertices,i.e.,thefunctionalunitsofthebrainthatareatacertaintimemomentconnected(entrained)withmanyotherbrainsites.Interestinglyenough,thevertexwithamaximumdegreeinGRAPH-IIusuallycorrespondstotheelectrodewhichislocatedinRTD(righttemporaldepth)orRST(rightsubtemporalcortex),inotherwords,thevertexwiththemaximumdegreeislocatedneartheepileptogenicfocus. 69 ].Infact,thisapproachutilizesthesamepreprocessingtechnique(i.e.,calculatingthevaluesofT-indicesforallpairsofelectrodesites)asweapplyinthischapter.Inthis

PAGE 85

subsection,wewillbrieydescribethisquadraticprogrammingtechniqueandrelateittothegraphmodelsintroducedabove. Themainideaoftheconsideredquadraticprogrammingapproachistoconstructamodelthatwouldselectacertainnumberofso-called\critical"electrodesites,i.e.,thosethatarethemostentrainedduringtheseizure.AccordingtoSection3,suchgroupofelectrodesitesshouldproduceaminimalsumofT-indicescalculatedforallpairsofelectrodeswithinthisgroup.Ifthenumberofcriticalsitesissetequaltok,andthetotalnumberofelectrodesitesisn,thentheproblemofselectingtheoptimalgroupofsitescanbeformulatedasthefollowingquadratic0-1problem[ 69 ]: minxTAx s.t.Pni=1xi=k: Inthissetup,thevectorx=(x1;x2;:::;xn)consistsofthecomponentsequaltoeither1(ifthecorrespondingsiteisincludedintothegroupofcriticalsites)or0(otherwise),andtheelementsofthematrixA=[aij]i;j=1;:::;narethevaluesofTij'sattheseizurepoint. However,asitwasshowninthepreviousstudies,onecanobservethe\re-setting"ofthebrainafterseizures'onset[ 111 70 108 ],thatis,thedivergenceofSTLmaxprolesafteraseizure.Therefore,toensurethattheoptimalgroupofcriticalsitesshowsthisdivergence,onecanreformulatethisoptimizationproblembyaddingonemorequadraticconstraint:

PAGE 86

wherethematrixB=[bij]i;j=1;:::;nistheT-indexmatrixofbrainsitesiandjwithin10minutewindowsaftertheonsetofaseizure. Thisproblemisthensolvedusingstandardtechniques,andthegroupofkcriticalsitesisfound.Itshouldbepointedoutthatthenumberofcriticalsiteskispredetermined,i.e.,itisdenedempirically,basedonpracticalobservations.Also,notethatintermsofGRAPH-ImodelthisproblemrepresentsndingasubgraphofGRAPH-Iofaxedsize,satisfyingthepropertiesspeciedabove. Now,recallthatweintroducedGRAPH-IIIusingthesameprinciplesasintheformulationoftheaboveoptimizationproblem,thatis,weconsideredtheconnectionsonlybetweenthepairsofsitesi;jsatisfyingbothofthetwoconditions:TijTcritical10minutesaftertheseizurepoint,whichareexactlytheconditionsthatthecriticalsitesmustsatisfy.AnaturalwayofdetectingsuchagroupsofsitesistondcliquesinGRAPH-III.Sinceacliqueisasubgraphwhereallverticesareinterconnected,itmeansthatallpairsofelectrodesitesinacliquewouldsatisfytheaforementionedconditions.Therefore,itisclearthatthesizeofthemaximumcliqueinGRAPH-IIIwouldrepresenttheupperboundonthenumberofselectedcriticalsites,i.e.,themaximumvalueoftheparameterkintheoptimizationproblemdescribedabove. ComputationalresultsindicatethatthemaximumcliquesizesfordierentinstancesofGRAPH-IIIareclosetotheactualvaluesofkempiricallyselectedinthequadraticprogrammingmodel,whichshowsthattheseapproachesareconsistentwitheachother.

PAGE 87

level.ThemainideaofthismodelistousethepropertiesofGRAPH-I,GRAPH-II,andGRAPH-IIIasacharacterizationofthebehaviorofthebrainpriorto,during,andafterepilepticseizures.Accordingtothisgraphmodel,thegraphsreectingthebehavioroftheepilepticbraindemonstratethefollowingproperties: Moreover,oneoftheadvantagesoftheconsideredgraphmodelisthepossi-bilitytodetectspecialformationsinthesegraphs,suchascliquesandminimumspanningtrees,whichcanbeusedforfurtherstudyingofvariouspropertiesoftheepilepticbrain. Amongthedirectionsoffutureresearchinthiseld,onecanmentionthepossibilityofdevelopingdirectedgraphmodelsbasedontheanalysisofEEGdata.Suchmodelswouldtakeintoaccountthenatural\asymmetry"ofthebrain,wherecertainfunctionalunitscontroltheotherones.Also,onecouldapplyasimilarapproachtostudyingthepatternsunderlyingthebrainfunctionofthepatientswithothertypesofdisorders,suchasParkinson'sdisease,orsleepdisorder.

PAGE 88

Therefore,themethodologyintroducedinthischaptercanbegeneralizedandappliedinpractice.

PAGE 89

Inthischapter,wewilldiscussoneofthemostinterestingreal-lifegraphapplications{so-called\socialnetworks"wheretheverticesarerealpeople[ 63 116 ].Themainideaofthisapproachistoconsiderthe\acquaintanceshipgraph"connectingtheentirehumanpopulation.Inthisgraph,anedgeconnectstwogivenverticesifthecorrespondingtwopersonsknoweachother. Socialnetworksareassociatedwithafamous\small-world"hypothesis,whichclaimsthatdespitethelargenumberofvertices,thedistancebetweenanytwovertices(or,thediameterofthegraph)issmall.Morespecically,theideaof\sixdegreesofseparation"hasbeenintroduced.Itstatesthatanytwopersonsintheworldarelinkedwitheachotherthroughasequenceofatmostsixpeople[ 63 116 117 ]. Clearly,onecannotverifythishypothesisforthegraphincorporatingmorethan6billionpeoplelivingontheEarth,however,smallersubgraphsoftheacquaintanceshipgraphconnectingcertaingroupsofpeoplecanbeinvestigatedindetail.Oneofthemostwell-knowngraphsofthistypeisthescienticcollaborationgraphreectingtheinformationaboutthejointworksbetweenallscientists.Twoverticesareconnectedbyanedgeifthecorrespondingtwoscientistshaveajointresearchpaper.Anothergraphofthistypeisknownasthe\Hollywoodgraph":itlinksallthemovieactors,andanedgeconnectstwoactorsiftheyeverappearedinthesamemovie.Well-knownconceptsassociatedwiththesegraphsareso-called\Erdosnumber"(inthescienticcollaborationgraph)and\Baconnumber"(intheHollywoodgraph),whichareassignedtoeveryvertexandcharacterizethedistancefromthisvertextothevertexdenotingthe\center"ofthegraph. 77

PAGE 90

Inthecollaborationgraph,thecentralvertexcorrespondstothefamousgraphtheoreticianPaulErdos,whereasintheHollywoodgraphthesamepositionisassignedtoKevinBacon. Inthischapter,wediscussgraphsofasimilartypearisinginsports,thatrepresenttheplayers'\collaboration".Inthesegraphs,theplayersarethevertices,andanedgeisaddedtothegraphifthecorrespondingtwoplayerseverplayedtogetherinthesameteam.Oneoftheexamplesofthistypeofgraphsisthegraphrepresentingbaseballplayers.ForanytwobaseballplayerswhoeverplayedintheMajorLeagueBaseball(MLB),apathconnectingthemcanbefoundinthisgraph. Asanotherinstanceofsocialnetworksinsports,westudythe\NBAgraph"wheretheverticesrepresentallthebasketballplayerswhoarecurrentlyplayingintheNBA.Weapplystandardgraph-theoreticalalgorithmsforinvestigatingthepropertiesofthisgraph,suchasitsconnectivityanddiameter(i.e.,themaximumdistancebetweenallpairsofverticesinthegraph).Aswewillseelaterinthechapter,thisstudyalsoconrmsthe\small-worldhypothesis".Moreover,weintroduceadistancemeasureintheNBAgraphsimilartotheErdosnumberandtheBaconnumber.ThecentralroleinthisgraphisgiventoMichaelJordan,thegreatestbasketballplayerofalltimes,andwerefertothismeasureastheJordannumber.

PAGE 91

distancesinthisgraph,the\centralvertex"isintroduced.ThisvertexcorrespondstoPaulErdos,thefatherofthetheoryofrandomgraphs.ThisvertexisassignedErdosnumberequalto0.Forallotherverticesinthegraph,theErdosnumberisdenedasthedistance(i.e.,theshortestpathlength)fromthecentralvertex.Forexample,thosescientistswhohadajointpaperwithErdoshaveErdosnumber1,thosewhodidnotcollaboratewithErdos,butcollaboratedwithErdos'collaboratorshaveErdosnumber2,etc. Followingthislogic,onecanconstructtheconnectedcomponentofthecollaborationgraphwith\concentriccircles",whichwouldincorporatealmostallscientistsintheworld,exceptthosewhonevercollaboratewithanybody.Thisconnectedcomponentisexpectedtohavearelativelysmalldiameter. Theideaofconstructingcollaborationgraphsencompassingpeopleindierentareasgavearisetoseveralotherapplications.Next,wediscusstheHollywoodgraphandthebaseballgraph,wherethenumberofverticesissignicantlysmallerthaninthescienticcollaborationgraph,whichallowsonetostudytheirstructureinmoredetail.

PAGE 92

Figure5{1. NumberofverticesintheHollywoodgraphwithdierentvaluesofBaconnumber.AverageBaconnumber=2.946. actor.ItturnsoutthatmostoftheactorshaveBaconnumbersequalto2or3,andthemaximumpossibleBaconnumberisequalto8,whichisthecaseonlyfor3vertices. ThedistributionofBaconnumbersintheHollywoodgraphisshowninFigure 5{1 .TheaverageBaconnumber(i.e.,theaveragepathlengthfromagivenactortoBacon)isequalto2.946.Asonecansee,boththeaverageandthemaximumBaconnumbersoftheHollywoodgraphareverysmall,whichprovidesanargumentinfavorofthe\smallworldhypothesis"mentionedabove.

PAGE 93

Figure5{2. NumberofverticesinthebaseballgraphwithdierentvauesofWynnnumber.AverageWynnnumber=2.901 has15817vertices.Linksbetweenanypairofbaseballplayerscanbefoundatthe\OracleofBaseball"website. 5{2 showsthedistributionofWynnnumbersinthebaseballgraph.ThemaximumWynnnumberis6,whichissmallerthanthemaximumBaconnumbersincetotalnumberofbaseballplayersislessthanthenumberofHollywoodactors.

PAGE 94

Hollywoodgraph,andEarlyWynnasthecenterofthebaseballgraphisthefactthatitisreasonabletoexpectthemtobeconnectedtomanyvertices:Baconappearedinmanymovies,andWynnplayedinseveralbaseballteamshadalotofteammatesduringhislongcareer.However,onecanchooseless\connected"centersofthesegraphs,andinthiscasethemaximumdistancefromthenewcenterofthegraphmaysignicantlyincrease.Forexample,ifonechoosesBarryBondsasthecenterofthebaseballgraph,themaximumBondsnumberwillbe9insteadof6.Moreover,intheHollywoodgraph,itispossibletochoosethecentersothatthemaximumdistancefromitisequalto14,andtheaveragedistanceisgreaterthan6(insteadof2.946).Therefore,inordertohaveamorecompleteinformationaboutthestructureofthesegraphs,oneshouldcalculatethemaximumpossibledistanceamongallpairsofverticesinthegraph.Recallthatthisquantityisreferredtoasthediameterofthegraph.Clearly,thediametercanbefoundbyconsideringeachvertexasthecenterofthegraph,calculatingcorrespondingmaximaldistances,andthenchoosingthemaximumamongthem. Inthenextsection,westudythepropertiesoftheNBAgraphincorporatingbasketballplayersplayingintheworld'sbestbasketballleague.Inasimilarfashion,weintroducetheJordannumber,investigateitsvaluescorrespondingtodierentvertices,andcalculatethediameterofthisgraph.

PAGE 95

Asonecaneasilysee,thisgraphhasahighlyspecicstructure:theplayersofeveryteamformacliqueinthegraph(i.e.,thesetofcompletelyinterconnectedvertices),becausealltheverticescorrespondingtotheplayersofthesameteammustbeinterconnected.Sincemanyplayerschangeteamsduringorbetweentheseasons,thereareedgesconnectingtheverticesfromdierentcliques(teams).Notethatthistypeofstructureiscommonforall\collaborationnetworks"(seeFigure 5{3 ). Itshouldbepointedoutthatthenumberofplayersinabasketballteamisrelativelysmall,andtheplayers'transfersbetweendierentteamsoccurratheroften,therefore,itwouldbelogicaltoexpectthattheNBAgraphshouldbeconnected,i.e.,thereisapathfromeveryvertextoeveryvertex,moreover,thelengthofthispathmustbesmallenough.Aswewillseebelow,calculationsconrmtheseassumptions.

PAGE 96

Figure5{3. GeneralstructureoftheNBAgraphandothercollaborationnetworks First,weusedastandardbreadth-rstsearchtechniqueforcheckingtheconnectivityoftheconsideredgraph.Startingfromanarbitraryvertex,wewereabletolocateallotherverticesinthegraph,whichmeansthateveryvertexisreachablefromanother,therefore,thegraphisconnected.Inthenextsubsection,wewillalsoseethateverypairofverticesinthisgraphareconnectedbyashortpath,whichisinagreementwiththe\small-worldhypothesis".

PAGE 97

Figure5{4. NumberofverticesintheNBAgraphwithdierentvaluesofJordannumber.AverageJordannumber=2.270 Similarlytothesocialgraphsmentionedabove,wedenethe\centralvertex"intheNBAgraphcorrespondingtoMichaelJordan,whoplayedforWashingtonWizardsduringhisnalNBAseason.Obviously,allotherplayersintheWizards'rosterfor2002-2003,aswellasalltheplayerswhohaveplayedwithJordanduringatleastoneseasoninthepast,haveJordannumber1.ItshouldbenotedthatMichaelJordanplayedonlyfortwoteams(ChicagoBullsandWashingtonWizards)throughhisentirecareer,therefore,onecanexpectthatthenumberofplayerswithJordannumber1israthersmall.Infact,only24playerscurrentlyplayingintheNBAhaveJordannumber1.

PAGE 98

Table5{1. JordannumbersofsomeNBAstars(endofthe2002-2003season). PlayerTeamJordanNumber KobeBryantLosAngelesLakers2VinceCarterTorontoRaptors2VladeDivacSacramentoKings2TimDuncanSanAntonioSpurs2MichaelFinleyDallasMavericks2SteveFrancisHoustonRockets3KevinGarnettMinnesotaTimberwolves3PauGasolMemphisGrizzlies3RichardHamiltonDetroitPistons1AllenIversonPhiladelphia76ers2JasonKiddNewJerseyNets2ToniKukocMilwaukeeBucks1KarlMaloneUtahJazz2StephonMarburyPhoenixSuns2ShawnMarionPhoenixSuns2KenyonMartinNewJerseyNets3JamalMashburnNewOrleansHornets2TracyMcGradyOrlandoMagic2ReggieMillerIndianaPacers3YaoMingHoustonRockets3DikembeMutomboNewJerseyNets2SteveNashDallasMavericks2DirkNowitzkiDallasMavericks2JermaineO'NealIndianaPacers2ShaquilleO'NealLosAngelesLakers2GaryPaytonMilwaukeeBucks2PaulPierceBostonCeltics2ScottiePippenPortlandTrailBlazers1DavidRobinsonSanAntonioSpurs2ArvydasSabonisPortlandTrailBlazers2JerryStackhouseWashingtonWizards1PredragStojakovicSacramentoKings2AntoineWalkerBostonCeltics2BenWallaceDetroitPistons2ChrisWebberSacramentoKings2

PAGE 99

Followingsimilarlogic,theplayerswhohaveplayedwithJordan's\collabora-tors"haveJordannumber2,andsoon.However,itturnsoutthatthemaximumJordannumberinthisinstanceoftheNBAgraphisonly3,i.e.,alltheplayersarelinkedwithJordanthroughatmosttwovertices,whichiscertainlynotsurprising:with29teamsandonlyaround15playersineachteam,NBAisreallya\smallworld".Figure 5{4 showsthedistributionofJordannumbersintheNBAgraph.TheaverageJordannumberisequalto2.27,whichissmallerthantheaverageBaconnumberintheHollywoodgraph,andtheaverageWynnnumberinthebaseballgraph,duetosmallernumberofvertices. Table 5{1 presentsJordannumberscorrespondingtosomewell-knownNBAplayers.Notsurprisingly,mostofthemhaveJordannumber2,exceptforseveralplayerswithJordannumber3:thosewhojoinedthisleaguerecently,andthereforedidnothavemanyteammatesthroughtheircareer,aswellasReggieMillerwhospent16seasonsinthesameteam(IndianaPacers),andKevinGarnettwhoplayedinMinnesotafor8years.ScottiePippen,ToniKukoc,andJerryStackhousewereJordan'steammatesatdierenttimes,therefore,theyhaveJordannumber1. Furthermore,wecalculatedthediameteroftheNBAgraph,i.e.,themaximumpossibledistancebetweenanytwoverticesinthegraph.SincethemaximumJordannumberintheNBAgraphisequalto3,onewouldexpectthatthevalueofthediametertobeofthesameorderofmagnitude.Asitwasmentionedintheprevioussection,thediameteroftheNBAgraphcanbefoundasfollows:foreverygivenvertex,wecalculatethedistancesbetweenthisvertexandallothers.Inthisapproach,weneedtorepeatthisprocedure404times,andeverytimeadierentvertexisconsideredtobethe\center"ofthegraph.OurcalculationsshowthatthediameteroftheNBAgraph(themaximumdistancebetweenallpairsofvertices)isequalto4.Therefore,onecanclaimthattheNBAgraphactuallyfollowsthesmall-worldhypothesis,sinceitsdiameterissmallenough.

PAGE 100

Table5{2. DegreesoftheVerticesintheNBAgraph degreeintervalnumberofvertices 11-2013421-3011631-4010341-504251-60861+2 5{2 presentsthenumberofverticesintheNBAgraphcorrespondingtodierentintervalsofthedegreevalues. ItwouldbereasonabletoassumethatifonepicksavertexwithahighdegreeasthecenteroftheNBAgraph,theaveragedistanceinthegraphcorrespondingtothisvertexwouldbesmallerthantheaverageJordannumber.Wehavefoundthemost\connected"playersintheNBAgraphwiththesmallestcorrespondingaver-agedistances.Table 5{3 presentsveplayerswhocouldbethemost\connected"centersoftheNBAgraph.Asonecannotice,allofthemare\benchplayers"whohavechangedmanyteamsduringtheircareer,therefore,theyhavehighdegreesintheNBAgraph.Also,aninterestingobservationisthatalthoughCorieBlount'svertexisdegreesmallerthanJimJackson's,theaverageconnectivityishigherforCorieBlount,whichcouldbeexplainedbythefactthathisteammateswerehighly\connected"themselves.

PAGE 101

Table5{3. Themost\connected"playersintheNBAgraph PlayerTeamDegreeAv.Distance CorieBlountChicagoBulls631.906JimJacksonSacramentoKings681.923RobertPackNewOrleansHornets571.936GrantLongBostonCeltics501.946BimboColesBostonCeltics541.958 AlthoughtheinstanceoftheNBAgraphconsideredinthischaptercontainsonlycurrentlyactivebasketballplayers,itcanbeeasilyextendedtoreectallplayersinthehistoryoftheNBA.Moreover,sincealotofforeignplayersfromdierentcountriesandcontinentshavecometotheNBAinrecentyears,onewouldexpectthatthegraphcoveringallbasketballplayersplayinginmajorforeignchampionshipsisalsoconnectedandhasasmalldiameter.

PAGE 102

Inthisdissertation,wehaveaddressedseveralissuesregardingtheuseofnetwork-basedtechniquesforsolvingvariousproblemsarisinginthebroadareaoftheanalysisofcomplexsystems.Wehavedemonstratedthatapplyingtheseapproachesiseectiveinmanyapplications,includingnance,biomedicine,telecommunications,sociology,etc.Ifareal-worldmassivedatasetcanbeappro-priatelyrepresentedasanetworkstructure,itsanalysisusinggraph-theoreticaltechniquesoftenyieldsimportantpracticalresults. Clearly,theresearchinthisareaisfarfromcomplete.Astechnologicalprogresscontinues,newtypesofdatasetsemergeindierentpracticalelds,whichleadstofurtherresearchintheeldofmodelingandinformationretrievalfromthesedatasets.Moreover,theapproachesdiscussedinthisdissertationcanbepotentiallyextendedtoobtainamoredetailedpictureofthestructureoftheconsidereddatasets.Inthefuturework,thenetworkmodelsdescribedabovecanbegeneralizedtotakeintoaccountthedirectionoflinksbetweenvertices(directedgraphs),whichcanhelptounderstandthemechanismsofinuencebetweendierentelementsofthesystems(e.g.,stocks,brainunits,etc.)Inaddition,someparameterscanbeassignedtoverticesrepresentingelementsofthesystem(e.g.,stockscanbecharacterizedbytheirexpectedreturnsandliquidities).Thisleadstosolvingoptimizationproblemsonweightedgraphs(e.g.,maximumweightedclique/independentset),whichmaybemorechallengingtosolveinpracticeforlargegraphs;however,thisanalysismayprovidevaluableinformationabouttheconsideredsystems. 90

PAGE 103

[1] J.Abello,A.BuchsbaumandJ.Westbrook,2002.Afunctionalapproachtoexternalgraphalgorithms.Algorithmica,32(3):437{58. [2] J.Abello,P.M.Pardalos,andM.G.C.Resende,1999.Onmaximumcliqueproblemsinverylargegraphs,DIMACSSeries,50,AmericanMathematicalSociety,119-130. [3] J.Abello,P.M.Pardalos,andM.G.C.Resende(eds.),2002.HandbookofMassiveDataSets,KluwerAcademicPublishers,Dordrecht,TheNetherlands. [4] J.AbelloandJ.S.Vitter(eds.),1999.ExternalMemoryAlgorithms.Vol.50ofDIMACSSeriesinDiscreteMathematicsandTheoreticalComputerScience.AmericanMathematicalSociety,Providence,RI. [5] L.Adamic,1999.TheSmallWorldWeb.ProceedingsofECDL'99,LectureNotesinComputerScience,1696:443-452.Springer,Berlin. [6] L.AdamicandB.Huberman,2000.Power-lawdistributionoftheWorldWideWeb.Science,287:2115a. [7] R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghvan,1998.AutomaticSubspaceClusteringofHighDimensionalDataforDataMiningApplications,inProceedingsofACMSIGMODInternationalConferenceonManagementofData,ACM,NewYork,94105. [8] W.Aiello,F.Chung,andL.Lu,2001.Arandomgraphmodelforpowerlawgraphs,ExperimentalMath.10:53-66. [9] W.Aiello,F.ChungandL.Lu,2002.Randomevolutioninmassivegraphs.InJ.Abello,P.PardalosandM.Resende(eds.),HandbookonMassiveDataSets.KluwerAcademicPublishers,Dordrecht,TheNetherlands. [10] R.AlbertandA.-L.Barabasi,2002.Statisticalmechanicsofcomplexnetworks,ReviewsofModernPhysics74,47-97. [11] R.Albert,H.JeongandA.-L.Barabasi,1999.DiameteroftheWorld-WideWeb.Nature,401:130-131. [12] N.AlonandM.Krivelevich.Theconcentrationofthechromaticnumberofrandomgraphs.Combinatorica,17:303-313,1997. 91

PAGE 104

[13] L.Amaral,A.Scala,M.Barthelemy,andH.Stanley,2000.Classesofsmall-worldnetworks.Proc.ofNationalAcademyofSciencesUSA,97:11149-11152. [14] M.R.Anderberg,1973.ClusterAnalysisforApplications,AcademicPress,NewYork,NY. [15] L.Arge,1995.Thebuertree:AnewtechniqueforoptimalI/Oalgorithms.ProceedingsoftheWorkshoponAlgorithmsandDataStructures,LectureNotesinComputerScience,955:334-345,Springer-Verlag,Berlin. [16] L.Arge,G.S.BrodalandL.Toma,2000.OnexternalmemoryMST,SSSPandmulti-wayplanargraphseparation.ProceedingsoftheScandinavianWorkshoponAlgorithmicTheory,LectureNotesinComputerScience,1851.Springer-Verlag,Berlin. [17] S.Arora,C.Lund,R.Motwani,andM.Szegedy,1998.Proofvericationandhardnessofapproximationproblems.JournaloftheACM,45:501-555. [18] S.AroraandS.Safra,1992.ApproximatingcliqueisNP-complete,Proceed-ingsofthe33rdIEEESymposiumonFoundationsonComputerScience,Oct.24-27,1992,Pittsburg,PA,2{13. [19] A.-L.Barabasi,2002.Linked,PerseusPublishing,NewYork. [20] A.-L.BarabasiandR.Albert,1999.Emergenceofscalinginrandomnetworks.Science,286:509{511. [21] A.-L.Barabasi,R.AlbertandH.Jeong,2000.Scale-freecharacteristicsofrandomnetworks:thetopologyoftheworld-wideweb.PhysicaA,281:69-77. [22] A.-L.Barabasi,R.Albert,H.Jeong,G.Bianconi,2000.Power-lawdistributionoftheWorldWideWeb.Science,287:2115a. [23] K.P.BennettandO.L.Mangasarian,1992.NeuralNetworkTrainingviaLinearProgramming,inAdvancesinOptimizationandParallelComputing,P.M.Pardalos,(ed.),NorthHolland,Amsterdam,5667. [24] P.Berkhin,2002.SurveyofClusteringDataMiningTechniques.TechnicalReport,AccrueSoftware,SanJose,CA. [25] V.Boginski,S.Butenko,andP.M.Pardalos,2003.ModelingandOptimizationinMassiveGraphs.In:P.M.PardalosandH.Wolkowicz,editors.NovelApproachestoHardDiscreteOptimization,AmericanMathematicalSociety,17{39. [26] V.Boginski,S.Butenko,andP.M.Pardalos,2003.OnStructuralPropertiesoftheMarketGraph.In:A.Nagurney(editor),InnovationsinFinancialandEconomicNetworks,EdwardElgarPublishers,28{45.

PAGE 105

[27] V.Boginski,S.Butenko,andP.M.Pardalos,2005.Statisticalanalysisofnancialnetworks.ComputationalStatisticsandDataAnalysis,48(2):431443. [28] V.Boginski,S.Butenko,andP.M.Pardalos,2005.MiningMarketData:ANetworkApproach.ComputersandOperationsResearch,inpress. [29] B.Bollobas,1978.ExtremalGraphTheory.AcademicPress,NewYork. [30] B.Bollobas,1985.RandomGraphs.AcademicPress,NewYork. [31] B.Bollobas,1988.Thechromaticnumberofrandomgraphs.Combinatorica,8:49-56. [32] B.BollobasandP.Erdos,1976.Cliquesinrandomgraphs.Math.Proc.Camb.Phil.Soc.,80:419-427. [33] I.M.Bomze,M.Budinich,P.M.Pardalos,andM.Pelillo,1999.Themaximumcliqueproblem.In:D.-Z.DuandP.M.Pardalos,editors,HandbookofCombinatorialOptimization,KluwerAcademicPublishers,Dordrecht,TheNetherlands,1-74. [34] P.S.Bradley,U.M.Fayyad,andO.L.Mangasarian,1999.MathematicalProgrammingforDataMining:FormulationsandChallenges.INFORMSJournalonComputing,11(3),217{238. [35] P.S.Bradley,O.L.Mangasarian,andW.N.Street,1998.FeatureSelectionviaMathematicalProgramming,INFORMSJournalonComputing10,209217. [36] S.BrinandL.Page,1998.Theanatomyofalargescalehypertextualwebsearchengine.Proceedingsofthe7thWorldWideWebConference,107{117. [37] A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tomkins,andJ.Wiener,2000.GraphstructureintheWeb.ComputerNetworks,33:309{320. [38] A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tompkins,andJ.Wiener,2000.TheBow-TieWeb.Proceedingsofthe9thInternationalWorldWideWebConference,May15-19,2000,Amsterdam. [39] A.L.Buchsbaum,M.Goldwasser,S.Venkatasubramanian,andJ.R.West-brook,2000.Onexternalmemorygraphtraversal.Proceedingsofthe11thACM-SIAMSymposiumonDiscreteAlgorithms,January9-11,2000,SanFrancisco,CA. [40] V.Bugera,S.Uryasev,andG.Zrazhevsky,2003.ClassicationUsingOpti-mization:ApplicationtoCreditRatingsofBonds.Univ.ofFlorida,ISEDept.,ResearchReport#2003-14.

PAGE 106

[41] G.Caldarelli,R.Marchetti,L.Pietronero,2000.TheFractalPropertiesofInternet.EurophysicsLetters,52. [42] Y.-J.Chiang,M.T.Goodrich,E.F.Grove,R.Tamassia,D.E.Vengro,andJ.S.Vitter,1995.External-memorygraphalgorithms.ProceedingsoftheACM-SIAMSymposiumonDiscreteAlgorithms,6:139-149,January22-24,1995,SanFrancisco,CA. [43] F.ChungandL.Lu,2001.Thediameterofrandomsparsegraphs.AdvancesinAppliedMath.,26,257-279. [44] C.CooperandA.Frieze,2003.Ageneralmodelofwebgraphs.RandomStructures&Algorithms,22(3):311{335. [45] C.CooperandA.Frieze,2004.Thesizeofthelargeststronglyconnectedcomponentofarandomgraphwithagivendegreesequence.Combinatorics,ProbabilityandComputing,13(3):319-337. [46] P.Dolan,1992.Spanningtreesinrandomgraphs.InA.FriezeandT.Luczak,eds.,RandomGraphs,2:47-58.JohnWileyandSons,NewYork. [47] P.ErdosandA.Renyi,1959.Onrandomgraphs.PublicationesMathematicae,6:290-297. [48] P.ErdosandA.Renyi,1960.Ontheevolutionofrandomgraphs.Publ.Math.Inst.Hungar.Acad.Sci.,5:17-61. [49] P.ErdosandA.Renyi,1961.Onthestrengthofconnectednessofarandomgraph.ActaMath.Acad.Sci.Hungar.,12:261-267. [50] M.Faloutsos,P.FaloutsosandC.Faloutsos,1999.Onpower-lawrelationshipsoftheInternettopology.InProc.ACMSIGCOMM,Cambridge,MA,Sept.1999,pp.251{262. [51] T.A.FeoandM.G.C.Resende,1994.Agreedyrandomizedadaptivesearchprocedureformaximumindependentset.OperationsResearch,42:860-878. [52] T.A.FeoandM.G.C.Resende,1995.Greedyrandomizedadaptivesearchprocedures.JournalofGlobalOptimization,6:109-133. [53] U.Feige,S.Goldwasser,L.Lovasz,SSafra,andM.Szegedy,1996.Interactiveproofsandthehardnessofapproximatingcliques.JournaloftheACM,43:268-292. [54] U.FeigeandJ.Kilian,1998.Zeroknowledgeandthechromaticnumber.JournalofComputerandSystemSciences,57:187-199. [55] D.J.FellemanandD.C.VanEssen,1991.DistributedHierarchicalProcessinginthePrimateCerebralCortex.Cereb.Cortex,1,1{47.

PAGE 107

[56] A.Frieze.Ontheindependencenumberofrandomgraphs,1990.DisctereMathematics,81:171-175. [57] A.FriezeandC.McDiarmid,1997.Algorithmictheoryofrandomgraphs.RandomStructuresandAlgorithms,10:5-42. [58] M.R.GareyandD.S.Johnson,1976.Thecomplexityofnear-optimalcoloring.JournaloftheACM,23:43-49. [59] M.R.GareyandD.S.Johnson,1979.ComputersandIntractability:AGuidetotheTheoryofNP-completeness,Freeman,NewYork. [60] R.GovindanandA.Reddy,1997.Ananalysisofinternetinterdomaintopol-ogyandroutestability.Proc.IEEEINFOCOM.Kobe,Japan. [61] G.R.GrimmettandC.J.H.McDiarmid,1975.Oncoloringrandomgraphs.MathematicalProceedingsofCambridgePhil.Society,77:313-324. [62] J.Hastad,1999.Cliqueishardtoapproximatewithinn1,ActaMathematica182105-142. [63] B.Hayes,2000.GraphTheoryinPractice.AmericanScientist,88:9-13(PartI),104-109(PartII). [64] C.C.Hilgetag,R.Kotter,K.E.Stephen,O.Sporns,2002.ComputationalMethodsfortheAnalysisofBrainConnectivity,In:G.A.Ascoli,ed.,Compu-tationalNeuroanatomy,HumanaPress,Totowa,NJ. [65] B.HubermanandL.Adamic,1999.GrowthdynamicsoftheWorld-WideWeb.Nature,401:131. [66] L.D.IasemidisandJ.C.Sackellares,1991.TheevolutionwithtimeofthespatialdistributionofthelargestLyapunovexponentonthehumanepilepticcortex.In:Duke,D.W.,Pritchard,W.S.,eds.,MeasuringChaosintheHumanBrain,49-82.WorldScientic,Singapore. [67] L.D.Iasemidis,J.C.Principe,J.M.Czaplewski,R.L.Gilmore,S.N.Roper,J.C.Sackellares,1997.Spatiotemporaltransitiontoepilepticseizures:anonlineardynamicalanalysisofscalpandintracranialEEGrecordings.In:Silva,F.L.,Principe,J.C.,Almeida,L.B.,eds.,SpatiotemporalModelsinBiologicalandArticialSystems,81-88.IOSPress,Amsterdam. [68] L.D.Iasemidis,J.C.Principe,J.C.Sackellares,2000.Measurementandquanticationofspatiotemporaldynamicsofhumanepilepticseizures.In:Akay,M.,ed.,Nonlinearbiomedicalsignalprocessing.IEEEPress,vol.II,294-318. [69] L.D.Iasemidis,P.M.Pardalos,J.C.Sackellares,D-S.Shiau,2001.Quadraticbinaryprogramminganddynamicalsystemapproachtodeterminethe

PAGE 108

predictabilityofepilepticseizures.JournalofCombinatorialOptimization,5:9{26. [70] L.D.Iasemidis,D.S.Shiau,J.C.Sackellares,P.M.Pardalos,A.Prasad,2004.Dynamicalresettingofthehumanbrainatepilepticseizures:applicationofnonlineardynamicsandglobaloptimizationtecniques.IEEETransactionsonBiomedicalEngineering,51(3):493{506. [71] ILOGCPLEX7.0ReferenceManual,2000. [72] A.K.JainandR.C.Dubes,1988.AlgorithmsforClusteringData,Prentice-Hall,EnglewoodClis,NJ. [73] S.Janson,T.LuczakandA.Rucinski,2000.RandomGraphs.Wiley&Sons,NewYork. [74] H.Jeong,S.Mason,A.-L.Barabasi,andZ.N.Oltvai,2001.Lethalityandcertaintyinproteinnetworks.Nature,411:41-42. [75] H.Jeong,B.Tomber,R.Albert,Z.N.Oltvai,andA.-L.Barabasi,2000.Thelarge-scaleorganizationofmetabolicnetworks,Nature,407:651-654. [76] D.S.JohnsonandM.A.Trick(eds.),1996.Cliques,Coloring,andSatisabil-ity:SecondDIMACSImplementationChallenge,Vol.26ofDIMACSSeries,AmericanMathematicalSociety,Providence,RI. [77] V.KleeandD.Larman,1981.Diametersofrandomgraphs.CanadianJournalofMathematics,33:618-640. [78] J.Kleinberg,1999.Authoritativesourcesinahyperlinkedenvironment.JournaloftheACM,46. [79] J.KleinbergandS.Lawrence,2001.TheStructureoftheWeb.Science,294:1849-50. [80] V.F.Kolchin,1999.RandomGraphs.CambridgeUniversityPress,Cambridge,UK. [81] R.Kumar,P.Raghavan,S.Rajagopalan,D.Sivakumar,A.Tomkins,andE.Upfal,2000.TheWebasagraph.In:Proceedingsofthe19thACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,Dallas,TX,pp.1{10. [82] R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins,1999.TrawlingtheWebforcybercommunities.ComputerNetworks,31(11-16):1481{1493. [83] V.KumarandE.Schwabe,1996.Improvedalgorithmsanddatastructuresforsolvinggraphproblemsinexternalmemory.In:ProceedingsoftheEighth

PAGE 109

IEEESymposiumonParallelandDistributedProcessing,NewOrleans,LA,pp.169-176. [84] S.LawrenceandC.L.Giles,1999.AccessibilityofInformationontheWeb.Nature,400:107{109. [85] T.Luczak,1991.Anoteonthesharpconcentrationofthechromaticnumberofrandomgraphs.Combinatorica,11:295{297. [86] T.Luczak,1990.Componentsbehaviornearthecriticalpointoftherandomgraphprocess.RandomStructuresandAlgorithms,1:287-310. [87] T.Luczak,1998.Randomtreesandrandomgraphs.RandomStructuresandAlgorithms,13:485-500. [88] T.Luczak,B.PittelandJ.Wierman,1994.Thestructureofarandomgraphnearthepointofthephasetransition.TransactionsoftheAmericanMathematicalSociety,341:721-748. [89] C.LundandM.Yannakakis,1994.Onthehardnessofapproximatingmini-mizationproblems.JournaloftheACM,41:960-981. [90] O.L.Mangasarian,1993.MathematicalProgramminginNeuralNetworks,ORSAJournalonComputing,5:349-360. [91] O.L.Mangasarian,W.N.Street,andW.H.Wolberg,1995.BreastCancerDiagnosisandPrognosisviaLinearProgramming,OperationsResearch43(4),570-577. [92] R.N.MantegnaandH.E.Stanley,2000.AnIntroductiontoEconophysics:CorrelationsandComplexityinFinance,CambridgeUniversityPress,Cam-bridge,UK. [93] D.Matula,1970.Onthecompletesubgraphofarandomgraph.InR.BoseandT.Dowling,eds.,CombinatoryMathematicsanditsApplications,356-369,ChapelHill,NC. [94] A.Medina,I.Matta,andJ.Byers,2000.OntheOriginofPower-lawsinInternetTopologies.ACMComputerCommunicationReview,30:160-163. [95] B.MirkinandI.Muchnik,1998.CombinatoralOptimizationinClustering.In:HandbookofCombinatorialOptimization(D.-Z.DuandP.M.Pardalos,eds.),Volume2,261{329.KluwerAcademicPublishers,Dordrecht,TheNetherlands. [96] A.Mendelzon,G.Mihaila,andT.Milo,1997.QueryingtheWorldWideWeb.JournalofDigitalLibraries,1:68-88. [97] A.MendelzonandP.Wood,1995.Findingregularsimplepathsingraphdatabases.SIAMJ.Comp.,24:1235-1258.

PAGE 110

[98] M.MolloyandB.Reed,1995.Acriticalpointforrandomgraphswithagivendegreesequence.RandomStructuresandAlgorithms,6:161-180. [99] M.MolloyandB.Reed,1998.Thesizeofthelargestcomponentofarandomgraphonaxeddegreesequence.Combinatorics,ProbabilityandComputing,7:295-306. [100] J.M.MurreandD.P.Sturdy,1995.TheConnectivityoftheBrain:Multi-LevelQuantitativeAnalysis.Biol.Cybern.,73,529{545. [101] P.M.Pardalos,W.Chaovalitwongse,L.D.Iasemidis,J.C.Sackellares,D.-S.Shiau,P.R.Carney,O.A.Prokopyev,andV.A.Yatsenko,2004.SeizureWarningAlgorithmBasedonSpatiotemporalDynamicsofIntracranialEEG.MathematicalProgramming,101(2):365-385. [102] P.M.Pardalos,T.Mavridou,andJ.Xue,1998.TheGraphColoringProblem:ABibliographicSurvey.In:HandbookofCombinatorialOptimization(D.-Z.DuandP.M.Pardalos,eds.),Volume2,331{395.KluwerAcademicPublishers,Dordrecht,TheNetherlands. [103] R.Pastor-Satorras,A.Vazquez,andA.Vespignani,2001.DynamicalandcorrelationpropertiesoftheInternet.Phys.Rev.Lett.,87:258701. [104] R.Pastor-SatorrasandA.Vespignani,2001.Epidemicspreadinginscale-freenetworks.PhysicalReviewLetters,86:3200-3203. [105] G.Piatetsky-ShapiroandW.Frawley(eds.),1991.KnowledgeDiscoveryinDatabases,MITPress,Cambridge,MA. [106] O.A.Prokopyev,V.Boginski,W.Chaovalitwongse,P.M.Pardalos,J.C.Sackellares,andP.R.Carney,2005.Network-BasedTechniquesinEEGDataAnalysisandEpilepticBrainModeling.In:DataMininginBiomedicine,P.M.Pardalosetal.(eds.),Springer,NewYork(toappear). [107] D.E.RumelhartandD.Zipser,1985.FeatureDiscoverybyCompetitiveLearning.CognitiveScience,9,75{112. [108] J.C.Sackellares,L.D.Iasemidis,R.L.Gilmore,S.N.Roper,1997.Epilepticseizuresasneuralresettingmechanisms.Epilepsia,38,S3,189. [109] E.ScheinermanandJ.Wierman,1989.Optimalandnear-optimalbroadcast-inginrandomgraphs.DiscreteAppliedMathematics,25:289-297. [110] E.ShamirandJ.Spencer,1987.SharpconcentrationofthechromaticnumberonrandomgraphsGn;p.Combinatorica,7:124{129. [111] D.S.Shiau,Q.Luo,S.L.Gilmore,S.N.Roper,P.M.Pardalos,J.C.Sackel-lares,L.D.Iasemidis,2000.Epilepticseizuresresettingrevisited.Epilepsia,41,S7,208-209.

PAGE 111

[112] J.D.UllmanandM.Yannakakis,1991.Theinput/outputcomplexityoftransitiveclosure.AnnalsofMathematicsandArticialIntelligence,3:331-360. [113] V.Vapnik,S.E.Golowich,andA.Smola,1997.SupportVectorMethodforFunctionApproximation,RegressionEstimation,andSignalProcessing,inAdvancesinNeuralInformationProcessingSystems9,M.C.Mozer,M.I.Jordan,andT.Petsche(eds.),MITPress,Cambridge,MA. [114] V.N.Vapnik,1995.TheNatureofStatisticalLearningTheory,Springer,NewYork. [115] J.S.Vitter,2001.ExternalMemoryAlgorithmsandDataStructures:DealingwithMASSIVEDATA.ACMComputingSurveys,33:209-271. [116] D.Watts,1999.SmallWorlds:TheDynamicsofNetworksBetweenOrderandRandomness,PrincetonUniversityPress,Princeton,NJ. [117] D.WattsandS.Strogatz,1998.Collectivedynamicsof`small-world'net-works,Nature,393:440-442. [118] A.Wolf,J.B.Swift,H.L.Swinney,J.A.Vastano,1985.DeterminingLya-punovexponentsfromatimeseries.PhysicaD,16:285-317.

PAGE 112

VladimirBoginskiwasbornonSeptember23,1980,inBryansk,Russia.Hereceivedhisbachelor'sdegreeinAppliedMathematicsfromMoscowInstituteofPhysicsandTechnology(StateUniversity)in2000.In2001,heenteredthegrad-uateprograminIndustrialandSystemsEngineeringattheUniversityofFlorida.HereceivedhisM.S.andPh.D.degreesinIndustrialandSystemsEngineeringfromtheUniversityofFloridainMay2003andAugust2005,respectively. 100