Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: On the natural growth of random forests
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095133/00001
 Material Information
Title: On the natural growth of random forests
Alternate Title: Department of Computer and Information Science and Engineering Technical Report
Physical Description: Book
Language: English
Creator: Johnson, Theodore
Publisher: Department of Computer and Information Science, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: July 29, 1992
Copyright Date: 1992
 Record Information
Bibliographic ID: UF00095133
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

199257 ( PDF )


Full Text













On the Natural Growth of Random Forests


Theodore Johnson, University of Florida
Harold S. Stone, IBM T. J. Watson Center

July 29, 1992


Abstract
We explore several aspects of the natural growth of random forests. We show that if initially there
are k single-node forests and N nodes to add to the forest, and every forest node is equally likely to
be connected to the next node added, then the tree that receives the last node will have an expected
2N/k nodes. We next explore the growth of a minimum weight forest. We show that the probability
that a node in the forest is connected to the next node added to the forest depends primarily upon
how recently the forest node was added to the forest, and that this distribution approaches a limit
distribution. We obtain the surprising result that when a node is added to the forest, it is adjacent
to the most recently added node 50% of the time.


1 Introduction


In this paper, we examine several distributions associated with randomly grown forests. We assume that

there are k initial trees in the forest, each of which consists of a single node, and N nodes to be added

to the forest. The forest grows by repeatedly adding a node not in the forest to a tree. The trees are

never merged.

Randomly grown and minimum weight forests are used in many algorithms. Our results will provide a

useful insight on their expected size and performance. This work was motivated by the desire to analyze

the expected time complexity of a 2-matching algorithm [13]. Other works that use minimum weight

spanning trees or forests include Held and Karp's Traveling Salesman Problem algorithms [4, 5], and the

Matching algorithm of Desler and Hakami [2].

Our first growth model (the uniform growth model) assumes that when a node is added to the forest,

the edge that connects it to the forest is equally likely to be adjacent to every node already in the forest.

We show that this growth assumption leads to an easily solved system of equations that describe the

distribution of tree sizes.

In our second growth model (the minimum weight model) we assume that all nodes in the graph are

connected by weighted edges. We then seek to characterize the growth of a minimum weight forest. We













initially propose the uniform growth model as a tractable approximation to the minimum weight model.

We show that the uniform growth model is a poor approximation, however, because the uniform growth

model predicts short, bushy trees, while the minimum weight model predicts long, thin trees.

Previous work in characterizing the growth of forests has concentrated on forests in random graphs

(see [1] for a survey), or on the natural growth of search trees (see [8] for a survey). The question of

the weight of the minimum spanning tree of a randomly weighted graph has been addressed by Frieze

[3] and Steele [12]. In this work, however, we are interested in the growth of random forests, and are not

concerned about their weight.


2 Uniform Growth Model


In this section, we analyze the growth of random forests that have a tractable growth model. When we

build the forest, we add to the forest an edge that connects a forest node to a nonforest node. We say

that the forest node adjacent to the new edge receives the edge, and that edge and the nonforest node

are added to the tree. In the uniform growth model, every forest node is equally likely to receive the next

edge added to the forest.

We calculate the distribution of the tree sizes, and the size of the tree that receives the jth node

added to the forest.

Theorem 1 The probability that the node that receives the jth edge added to the forest is part of a tree

that contains i nodes is

(k + 2j + 1)/(k + 1)

Proof: We define Ti(j) to be the expected number of trees with i nodes after j nodes have been added

to the initial forest. The forest starts with k trees each of which contain a single node, so the initial

conditions are:

T(0) = k

Ti(0)= 0 fori>l

Since all trees start with 1 node, after j 1 nodes have been added, no tree has more than j nodes.

The produces the additional constraint:

Ti(j-1)= 0 fori>j












Each time a node is added, it is added to some tree with i nodes, so the number of trees with i nodes

decreases by 1, and the number of trees with i + 1 nodes increases by 1. The probability of selecting a

tree with i nodes at the jth node selection is proportional to the total number of nodes in trees with i

nodes. Before the jth selection there are k + j 1 nodes in all of the trees, so that the probability of

selecting a tree with i nodes is given by

Pr[Selecting a node in a tree with i nodes at step j] = i +j

This distribution yields the following equations for the tree distribution after the jth selection in

terms of trees that exist before the jth selection:

TI(j)= T(j -) 1(-1)
( k+j-1
STI (j 1) (k+-

and

T.(i-) 1)- iT(j-1) + (-1)T._1-ij1)
= T j- 1) k+j--1 k+j-l


The solution to these equations with these initial conditions is:

Tip ) = k (k- 1)j!(k+j-i-1)!
-i (j+l-i)!(k+j-1)!

which can be shown by induction.

The next step is to compute the expected size of a tree that receives the jth node. The probability

of adding the jth node to a tree of size i is proportional to the number of nodes in trees of size i, and

therefore is iTi(j)/(k + j). As a result, the expected size of the tree that receives the jth node is:


-k Y i2T (j)
i=0
We can show that the general form of the expected value is

(k + 2j + 1)/(k + 1)

by using induction to show that:
j+1
i2T(j) = (k+j)(k+2j+l)
i=l
If k is large, and j > k, then the size of this tree is approximately 2j/k twice as large as the average

tree size.











3 Minimum Weight Model


The uniform growth assumption is convenient for analytical tractability and is realistic for some situa-
tions. We also want to analyze minimum weight forests, and we would like to approximate the minimum
weight model by the uniform growth model. Unfortunately, the approximation is inaccurate, and the
actual distribution is quite complex. Our objective in this section is to show what relation exists between
the two growth models.
In the minimum weight model, our starting assumptions again are that we are growing a forest,
starting with k initial (isolated) nodes in the forest and N nodes not in the forest. We denote the set of
forest nodes by F, and the set of nonforest nodes by V F. There are M = () + Nk edges connecting
all pairs of nodes in the graph, except for pairs where both nodes are in F. There is a function defined
on each edge, w(e), the weight of edge e. At each step, we add a new node to the forest by finding
the lowest weight edge that connects a node in the forest to a node not in the forest, then adding the
edge and the attached node (i.e, apply Dijkstra's algorithm [9]). We assume that all edge weights are
independently and identically chosen from a continuous distribution over the real numbers, F(x). That
is, Pr[w(e) < a|w(ei) = x, w(e2) = x2,, w(e) = xl] = Pr[w(e) < x] = F(x). See Figure 1.

w(i)=2

C( ) w(g)=5 D w(h)=8 E



-) w(h)=8
w(b)=6 w(e)=4
w(a)=l /w(f)=3
w(c)=9
w(d)=7
A B

original nodes in the forest
Figure 1: Initial graph (A and B are the roots of the forest)

We define a function of the nodes, l(n), the label of the node. The original k forest nodes are labeled
1 through k, and the ith node added to the forest is labeled k + i, i = 1, . ., N. We also label the edges














e = (nl, n2) with 1(e) = min(l(nl), l(n2)).

We can classify the edges in the graph at any point in the forest growth as follows: Edges that connect

two nodes in F (two forest nodes) are in EF, edges that connect two nodes in V F (not in the forest)

are in Ev-F, and edges that connect a node in F to a node in V F are in Ecross (cross edges). Edges

in EF can be subclassified as being either tree edges or nontree edges depending on whether they are

part of the minimum weight forest.

To add the next node to the forest, the minimum weight edge in Ecross is chosen and made into a

tree edge. At this point, we can label the new tree node and the unlabeled edges incident to the new

tree node using the following algorithm:


min_weightf orest (V, F,E)

Label the vertices in F by 1 through IFI.

Ecross = edges in E incident to a vertex in F.

Label the edges in Ecross by the label of the node in F.

EF = 0.

EV-F = E Ecross

i=0.

While V-F is not empty do

let emin be the minimum weight edge in Ecross.

Let u be the node incident to emin that is in V-F.

Remove from Ecross all edges incident to u, put them in EF.

Label u by k+i.

Remove from EV-F all edges incident to u, label them with k+i,

and put them in Ecross.




The min_weight_forest algorithm is illustrated in Figures 1 through 6. Figure 1 shows the initial

forest. Figure 2 shows the eligible edges (the cross edges) when the first edge is selected. The edges

incident on the root nodes are labeled with the node's label. Figure 3 shows the cross edges before the

second edge has been selected. Node C was added to the forest, and was labeled with 3. Edges leading

from C to nodes not yet in the forest are also labeled with 3. Figure 4 shows the graph before last edge














w(i)=2


C w(g)=5 D w(h)=8 E

Cross

C x 5 U F --

w=6 w(b)=6 w(e)=4
I(b)=1 Ile)=2
/ / (f)=3
w(a)=1 \
1(a)=1 \ w(c)=9 (f)=2
I(c)=1 w(d)=7
A1 l(d)=2 B
I(A)=1 I(B)=2


Figure 2: Selecting the first edge


has been selected, and Figure 5 shows the final forest.

To aid in calculating the distribution of the probability that a given node in the forest receives the

next edge, we list the edges in the graph sorted by the edge weights. If the min_weightforest algorithm

chooses an edge labeled i as the next edge to add to the forest, then the node labeled i receives the edge.

Therefore, we identify the edges in the list by their labels only. Figure 6 shows the list of edges with

their corresponding edge labels, sorted by edge weight, that corresponds to our running example. We

next prove this list of edges has a tractable distribution.

Definition: A (N1, N,..., Nj)-permutation is an ordered list of N1 items labeled 1, N2 items

labeled 2, etc.

Lemma 1 Label the edges using the edge labeling algorithm, then sort the edges by their weights. The

resulting (N, N, N 1, N 2, ..., 2, 1)-permutation, L is uniformly randomly chosen from the set
k times
of all (N, N, - N, N 1, N 2, ..., 2, 1)-permutations.

Proof: First, L is indeed a (N, N, .- N, N 1, N 2, .., 2, 1)-permutation, since the original k

nodes each label N edges, and the node added at the ith labeling step labels N i edges.

We consider the algorithm for labeling edges. The set of edges, E is partitioned into EF, Ey-F, and

Ecross. At each edge selection step, the algorithm searches Ec,oss and picks the edge with the lowest

weight, ein. Based on ein, the algorithm removes some edges from Ecross, puts them in EF, and also

















w(i)=2 I(i)=3


I(C)=3 C w(g)=5 D w(h)=8 E
I(g)=3
Ecross

C EV-F
\ (b)=6 w(e)=4 EF
I(b)=1 I(e)=2 tree
w (a)= 1 w\ w(f)=3
I(a)=1 \ w(c)=9 \/ (f)=2
S l(c)=1 w(d)=7
A I(d)=2 B
I(A)=1 I(B)=2


Figure 3: Selecting the second edge








w(i)=2 I(i)=3

I(C)=3 c w(g)=5 ( w(h)=8 E I(E)=4
I(g)=3 I(h)=4

Cross

EV-F
\ (b)=/ w(e)=4 EF
l(g)= I(e)=2 tree
w (a)=1 w(f)=3
I(a)=1 w(c)=9 (f)=2
I(c)=1 w(d)=7
(A)= I(d)=2 B
I(A)=1 I(B)=2


Figure 4: Selecting the third edge



















w(i)=2 I(i)=3


I(C)=3 w(g)=5 D w(h)=8 I(E)=4
I(g)=3 I(D)=5 l(h)=4

Cross

EV-F
w(b)=6 =4
(b)= (e)2 tree
w(a)=1 w(f)=3
I(a)=1 w(c)=9 1(f)=2
I(c)=1 w(d)=7
A l(d)=2 B
I(A)=1 I(B)=2



Figure 5: Final forest









w(i)=2 1(i)=3

I(C)=3 C w(g)=5 ) w(h)=8 () (E)=4
1(g)=3 I(D)=5 1(h)=4

= E cross
EV-F
(b)=6 =4
w(b)=6 w(e)=4 EF
1(b)1 1(e)=2 tree ___
w(a)-=1 \(e2 w(f)=3 tree
(a)=1 w(c)=9 1(f=2
1(c)=1 w(d)=7
A I1(d)=2 B
I(A)=1 1(B)=2

edge a i f eg b d h c
weight 123456789
label 1 3 2 2 3 1 2 4 1


Figure 6: Final forest and corresponding list of labeled edges













removes some edges from Ev-F and puts them in Ecross. At this point, the new edges that are placed

into Ecross from Ev-F are labeled. This step repeats until Ecross is empty.

An edge that is removed from Ev-F is never returned to Ev-F. The edges are labeled only when

they are removed from Ev-F and placed into Ecross. The weights of these edges are examined by the

algorithm for the first time when the next edge is selected. Since the edge weights are independent

and the weight of an edge is first read by the algorithm after the edge is labeled, the edge weight is

independent of the edge label.

All edge weights have the same distribution, regardless of their labels. The permutation, L, is sorted

by edge weight, so every edge is equally likely to be in every position. The resulting permutation of

the edges is equally likely to be chosen from the set of all possible permutations, so L is a uniformly

randomly chosen (N, N, N, N 1, N 2, 2, 1)-permutation*


The underlying combinatorial object is a matrix of independent and identically distributed edge

weights. So far, we have made a transformation from the matrix of edge weights to a list of edge labels,

and have shown that the list is uniformly distributed. At any step in the forest growing algorithm, only

a subset of the edges are examined in order to find a least weight edge -those in Ecros,. To find the

edge that the min_weightforest algorithm adds, we take the list that corresponds to E and remove

the edges in EF and Ev-F while preserving the order of the remaining edges. We then select the first

edge on the list. We next prove some properties of the lists with removed edges:

Lemma 2 For every k E 1,..., N1 + + Nj = M, the probability that the kth entry in a uniformly

selected (N1, N2. ., Nj)-permutation is labeled r, r E 1, ., j is N,/M.

Proof: If we pick edge e E E as the edge in the kth entry of the permutation, there are (M 1)! ways

to choose the remainder of the permutation. There are N, ways to choose an edge e with a label r, so

there are N,(M 1)! permutations with an edge labeled r in the kth position. There are M! possible

permutations of the edges, so the probability that the edge in the kth position is labeled r is N,/M*

Lemma 3 Let k, 1 E 1, ..., N1 + - + Nj = M, k : 1, M > 1. If the kth entry of a uniformly selected

(N,..., Nj)-permutation is identified as being labeled r, r E 1,...,j, then the probability that the 1th

entry is labeled s, s E 1,..., j is

N,
M-1 rs













N, 1
M-1

Proof: We fix edge e in position r, then apply lemma 2.


Corollary 1 Lei P be a uniformly chosen (N, . ., Nj)-permutation, and identify the rth entry as being

labeled 1, r E 1,..., N + + Nj, 1 E 1,...,j. Create P' by taking the first r 1 and the last N1 +

* + Nj r entries of P and concalenating them together. Then P' is equally likely to be any one of the

(N, . ., N1 1, . ., Nj)-permutations.

Proof: Apply lemma 3*

Suppose that we have run the algorithm once, labeled the edges, and created the list L. Suppose

that we rerun the min_weight orest algorithm on the same graph that resulted in L. By examining

L, we want to determine what choices the algorithm makes. We then use the uniform randomness of L

to make an expected case analysis of the forest growth.

At any point in the min_weightlorest algorithm, the algorithm examines a subset of the edges

(those in Ecross), and chooses the lowest-weight edge to add to the forest. The list L is sorted by edge

weight, so if we rule out the ineligible edges (those in Ev-F or in EF), the first edge in L is the edge

that the algorithm picks. This edge's label is the label of the forest node it is incident on, so we can keep

track of how the forest grows.

At each edge selection step, we choose an edge that connects a node in the forest to a node not in the

forest (i.e., an edge in Ecross). If we select the ith edge, then nodes numbered 1, . ., k + i 1 are in the

forest. Nodes numbered k + i, . ., k + N are not. The fact that we are selecting the ith edge immediately

gives us some information: edges labeled i, . ., N are in Ey-F, and so are not eligible. Thus, we can

simplify L by removing edges that will not be chosen.

Definition: Li is L with edges labeled i + k, ..., N + k removed. The list Li contains exactly the

edges in Ecross and EF when the ith edge is chosen.


Corollary 2 Li is equally likely to be any one of the (N,..., N, N 1,..., N i + 1)-permutations.
k times
Proof: We generate Li from L by removing edges labeled i or larger, then apply Corollary 1 repeatedly.


Even though Li is a refinement of L, Li still contains many edges that the min_weightforest

algorithm does not examine: the edges in EF. We have not kept track of the edge destinations, so













we have no information yet about which edges are in EF and which are in E ...... Fortunately, the

information necessary for an expected case analysis is implicitly stored in Li.

To extract this information, we look at the first entry in Li, el. If this edge is labeled k + i 1 (that

is, it leads from the most recently added node), then it has never been considered before. Therefore, el

must be in Ecross, so the min_weightforest algorithm will select it as the ith edge in the forest. If the

label of the edge, l(ei), is less than k+i- 1, then el was also the first edge in Li_1 since edges with labels

greater than or equal to k + i 1 are removed from L to create Li_1. Thus, at some previous selection

j, l(ei) < j < i, edge el was the lowest weight edge in cross, so the min_weightlorest algorithm chose

it as the jth edge to add to the forest. Therefore, el C EF (in fact, el is a tree edge), so el is ineligible.

If the first edge on Li is in EF, we need to continue scanning Li for the first eligible edge. The

discovery that el is not an eligible edge tells us that some other edges in Li are ineligible also: the edges

that lead to the same destination as el does. We are not keeping any destination information in L,

but we assume for now that we have some way of determining destinations (we address this assumption

later). In the analysis that follows, we will want to scan over tree edges only, so we remove edges from

Li = Li,1 that have the same destination as el to form Li,2.

Suppose that e2 is the first edge in Li,2 (i.e, the second eligible or tree edge in Li). If e2 is labeled

k+i-1, the min_weightlforest algorithm has never examined e2 before, so e2 is chosen by the algorithm.

Suppose that e2 is labeled k + i 2. If el is labeled k + i 2 then el must have been chosen when the

i+k 1-st edge was added to the forest. An edge labeled i+k -2 can be selected as the next forest edge

only when nodes labeled i + k 1 or higher are added to the forest. Since we know that el added the

node labeled i + k 1 to the forest, e2 must add the node labeled i + k, and so the min_weighttforest

algorithm chooses edge e2 on the ith edge selection step. If e2 is labeled k + i 3 or lower, then e2 was

chosen as a forest edge on a previous iteration regardless of the label of el. In general,

Theorem 2 Let ej be the first edge on Lij, and suppose that we have not found in {e, . ., eji} the

edge that adds the node labeled k + i to the forest.

If l(ej) = k + i s and s < i, then ej adds the node labeled k + i to the forest if and only if we have

already found in {e, . ., ej-1 the edges that added the nodes labeled k + i s + 1 through k + i 1 to

the forest.

If s > i, then ej adds the node labeled k + i to the forest if and only if we have already found in













{e, ..., ej_} the edges that added the nodes labeled k + 1 through k + i 1 to the forest.

Proof: Assume that s < i. Suppose that we have found the edges that added the nodes labeled k+i-s+1

through k + i 1 to the forest. Since edge ej can only add nodes labeled k + i s + 1 and higher to the

forest and all of the previous possibilities have been covered already, ej can not be a tree edge. Since we

throw away nontree edges in EF, ej must be a cross edge. Since it is the first cross edge we have found,

it must have the lowest weight among all cross edges. Therefore, the tree growing algorithm picks it to

add the node labeled k + i to the forest.

Suppose that we have not yet found all of the edges that add the nodes labeled k + i s + 1 through

k + i 1 to the forest. Let the lowest labeled node that is unaccounted for be labeled k + r.

Claim: Edge ej added the node labeled k + r to the forest.

Proof: Consider what occurs when the node labeled k + r is added to the the forest. We can retrace the

forest generating algorithm's decision by examining the list L,. Edge ej in Li was considered as some

edge ej, in L,, because we had not accounted for the edge that added node k + r in Li before we came

to ej. When edge ej, in L, is considered, all nodes with labels between + i s + 1 and + r 1

are accounted for. As we have shown, ej, must be the lowest weight cross edge, so the forest growing

algorithm picks it to add the node labeled k + r to the forest*

Since ej adds the node labeled k + r to the forest, it can not add node k + i to the forest.

If s > i, we only need to account for the fact that any edge labeled 1 through k can add the node

labeled k + 1 to the forest*


Suppose that we have an algorithm for determining which node a tree edge adds to the forest. Then

the algorithm for using L to calculate which node received the ith edge is:

procedure retrace(L)

for i=1 to to N do

Li is L with edges labeled k +i or greater removed.

j=1.

Li,j =Li .

while the edge that adds node k+ i to the forest is not found

ej=the first edge in Li, .














if all nodes between k + l(ej) + 1 and k+i- 1 are accounted for

output (i, (eC)) .

else

account for the node that ej adds.

Lij+,=Li,j with edges that lead to the same destination as ej removed.

increment j.

Theorem 2 gives us an algorithm for determining which node a tree edge adds to the forest. When

edge ej is considered, the algorithm looks to see if node k + l(ej) + 1 is accounted for. If the node is

accounted for, the algorithm looks at node k + 1(ej) + 2, and so on until an unaccounted for node is

found. Edge ej adds that node (possibly node k + i), so the algorithm accounts for it.

We use a 'marked boxes' technique to account for the nodes. When sublist Li is examined, the

algorithm initializes i boxes, numbered 1 through i. If the edge ej is labeled k + i s, the algorithm

looks for an unmarked box starting at box s and working towards box 1. The algorithm marks the first

unmarked box that is found. If box 1 is marked, the algorithm concludes that edge ej adds node k + i.

procedure markbox(Li)

initialize box[l..i] as unmarked.

j=1.

while box[1l is unmarked

get ej .

s=k+i-l(ej).

if s>i then s=i.

while box[s] is marked

s=s-1.

mark box[s].

j=j+1.

return(j-1).



Executions of retrace and markbox procedures are illustrated in Table 1. The sorted list of edges L

is the same as that in Figure 6. The list L1, is the list of edge choices when node 3 is selected (all edges














a i f e g b d h c 1 2 3
L= 1 3 2 2 3 1 2 4 1
L1,1= 1 2 2 1 2 1 X
L2,1 = 1 3 2 2 3 1 2 1 X
L22 3 2 2 3 1 1 X X
L3,1 = 1 3 2 2 3 1 2 4 1 X
L3,2 = 3 2 2 3 1 4 1 X X
3,3= 2 3 1 4 X X X

Table 1: The sorted list of edges, and finding the selected edge.


labeled 3 and larger are deleted). The first edge on the list is labeled 1 = 3 2, so the algorithm marks

box 1 and selects that edge (since 2 > 1). The list L2,1 is the initial list of edges when node 4 is selected.

The first edge on L2,1 is labeled 1 = 4 3. Since 3 > 2, the algorithm marks box 2. It generates L2,2

by removing the initial edge on L2,1 and all edges that lead to the same node. The first edge on L2,2 is

labeled 3 = 4 1, so the algorithm marks box 1 and selects that edge. Finally, list L3,1 has all edges on

it, for there are no edges in Ev-F. The first edge on L3,1 is labeled 1 = 5 4, so the algorithm marks

box 3. The first edge on L3,2 is labeled 3 = 5 2, so the algorithm marks box 2. The first edge on L3,3

is labeled 2 = 5 3, so it tries to mark box 3. Box 3 is already marked (since a different edge chose the

third node), so it tries to mark box 2. Box 2 is also marked, so it marks box 1, and select that edge.

Note that this procedure selects the same edges (a, i, e) that the simulation of Dijkstra's algorithm does.

Before we continue, we need to determine the underlying distributions of the sublists. Lemma 1 and

Corollary 2 show that L and Li are uniformly randomly chosen permutations. The algorithm to recreate

the min_weightforest algorithm's choices from L requires the use of the Lij sublists. The Lij sublists

are created by determining edge destinations and removing some edges. We do not explicitly keep edge

destination information in L, but we show that this is not a problem for the analysis.


Corollary 3 The lists Lij are uniformly randomly distributed permutations.


Proof: Li,1 = Li, so Li,1 is a uniformly randomly chosen permutation. Since Li,1 is a uniformly randomly

chosen permutation, we see that Li,2 is a uniformly randomly chosen permutation by repeated appli-

cations of Corollary 1. Proceeding inductively, we see that every Lij is a uniformly randomly chosen

permutation*












3.1 Relation to Uniform Growth Model


The algorithm for determining the min_weightforest algorithm's choices from L shows that edges with

different labels have different probabilities of being chosen to add the i-th node. An edge labeled k + i- 1

can be selected from every Lij, while an edge labeled 1 through k can only be selected from Li. We

will return to the issue of the edge label of the edge that is selected, but here we want to determine the

relation of the minimum weight model to the uniform growth model.

We observe that if the edge that adds the node labeled k + i to the forest is not found in Li, through

Li,i-1, then the edge will be found in Lii since every edge in Lii is a cross edge. Every forest node is

incident to the same number of cross edges (N i+ 1), so by lemma 3, every forest node is equally likely

to receive the next edge. Thus, the minimum weight model is equivalent to the uniform growth model on

the occasions when Li,i must be searched to find the minimum weight cross edge. The following theorem

states the probability of this event.


Theorem 3 Suppose that there are k trees, and the min_weightorest algorithm is adding the node

labeled k + i to the forest. If i < N, then the probability that the first edge in Lij, ej, is the minimum

weight cross edge is approximately 1/(k + i 1) if j < i and k/(k + i 1) if j = i.


Proof: In this proof, we use the marked boxes mechanism described in the previous section. The

input to the marked boxes algorithm is the sequence (l(ei), l(e2), ..., l(e,)). We transform this sequence

into the more convenient sequence {(x-, 2, ..., x)}, where each z, E {1,.., i} and 1 < n < i, by the

following function:

f k +i -l(ej) ifl(ej) > k
xj i if l(ej < k

If xj = r, then xj represents an edge labeled (k + i r) and refers to box r in the markbox procedure,

if r < i. If xj = i, then xj represents to an edge labeled 1 through k and refers to box i.

In Li, there are N edges labeled 1 through k, and N s edges labeled k + s for s = 1, . ., i 1. Since

N > i, there are about the same number of edges labeled k + i 1 in Li as there are edges labeled 1.

Therefore, we will assume that each xj in a sequence is independent, and has value r with probability

1/(i + k 1) if r < i, and value i with probability k/(k + i 1). We want to determine the probability

that the length of the minimum length sequence which marks box 1 is n.













ST(t) ST(s)



V t+2V
1 2 Ir+1


t s

Figure 7: Last entry in st,

We define a safe trigger of length r to be a sequence st, which is of length r, does not mark box 1,

but (st, j) marks 1 if 1 < j < r + 1. For example, (2 3) is a safe trigger of length 2. We define st(r) to

be the set of all safe triggers of length r, and ST(r) = Ist(r)l. For example, st(3) = {(2 3), (3 2), (3 3)},

and ST(3) = 3. We define ST(0) = 1.

Lemma 4 ST(r) = (r + 1)'-1

Proof: We first derive a recurrence for ST(r), then solve the recurrence. For the starting values of

the recurrence, we observe that st(O) = {0} and st(l) = {(2)}, so ST(O) = ST(1) = 1.

Consider the last entry added to a sequence in st(r). The last entry marks a box, so we assume

that it marks the box numbered t + 2 (see Figure 7). To the right of the last marked box is a sequence

in st(t) and to the left is a sequence that corresponds to a sequence in st(s) by translation. There are

(St) = (r) ways to combine an s length sequence with a t length sequence. The last entry in the safe

trigger of length r can be any number between t + 2 and r + 1 inclusive, so there are s + 1 possibilities.

Therefore:
r-1
ST(r) = Z(s + 1)r 1) ST(s)ST(r 1) (1)
s=O
The form of the recurrence is similar to the convolution of an exponential generating function, so we

let Y = Y(z) = E ST(r)z/rr! be the exponential generating function of ST(r). Then Y satisfies the

differential equation

(1- Yz)Y'= Y2 (2)

To solve this equation, we look for a solution of the form:

Y = e"a(Y)













Then


Y' = Y(Y'a'(Y)z + a(Y))

Y'(1 Ya'(Y)z) = Ya(Y)


Setting a(Y) = Y satisfies the differential equation. The generating function of Y = eYZ is (see [7], page

392)


Y(z) += 1 zr!
r>O

so that
r-1
ST(r) = Y(s + 1) r- ST(s)ST(r s 1) = (r + 1) r-
s=0

Let us count the number of ways that edge ej marks box 1 (i.e, adds the node labeled k + i to the

forest). Suppose after processing j-_1, box 1 is unmarked, and the t subsequent boxes are marked. If we

examine which boxes are marked after j 1 steps, box 1 is unmarked, boxes 2 through t + 1 are marked,

and box t + 2 is unmarked. Of the remaining i t 2 boxes, numbered t + 3 through i, exactly j t 1

are marked (see Figure 8). In order to make the analysis simpler, we aggregate all of the xz values that

can not mark box 1 on the jth step (those values between j + 1 and i) and renumber them by j + 1.

That is,

x if x1 < j
xl j + 1 ifx > j


In the transformed sequence, all the boxes are marked except for boxes 1 and t + 2. When x' is chosen

to augment this sequence, there are i + k j 1 ways to choose it to be j + 1, and 1 way to choose it to

be n, n = 1,... ,j. With this configuration, any x' numbered between 1 and t + 1 will mark box 1. Let

R(s) be the number of ways to mark the last s = j t 1 boxes. The number of ways that an edge will

be selected on the jth examination step is therefore
j-1 I
Stop(i) = (t + 1) Ij- 1 ST(t)R(j -1-t) (3)
t
t=o

In order to solve equation (3), we need to determine the value of R(s). We can quickly derive a

recursive formula that is similar to equation (1). In Figure 7, the number of sequences that can fill the

right hand side is R(s) instead of ST(s). The sequence will be completed if z E t + 2, .., r + 1. The xz











t i-t-2



1 V V t+2 V ... V i





W
_t j-t-



1 .V .t+2
j+l

ST(t) R(j-t-1)

Figure 8: Transformation for ith selection

can be chosen to have the value r in X = i + k j 1 ways, All other numbers can only be chosen one
way. Thus, the number of ways to complete the sequence is

R(r)= E(X + s) ('V)ST(r 1- s)R(s)

= (X + S)( )(r- S)-s-2R()

Lemma 5 R(r) = X(X + r)"-1

We prove the lemma by induction. It is easy to see that R(O) = 1 and R(1) = X. For the inductive step,
we need to show that

X(X + r)"- = E,(X + s)( l)(r s)r-s-2X(X + s)S-1

or that

(x + r)"- = ( )(r S)- s-2(X + s)S

Let

T(r) = R(r + 1)/X = ( (r + 1 s)--(X + s)













Then, from Riordan [11], page 18, we see that T(r) is an instance of the generalization of Abel's binomial

formula, T(r) = A,(X, 1, 0, -1) = (X + 1 + r)', which is exactly what we need to show *

Lemma 6 Stop(j) = (X + j)j-1 = (i + k 1)-1

To prove this lemma, we apply lemma 5 to equation (3):


Stop(j + 1) = EC(j + 1 s)(j)ST(j s)R(s)

= X (Q(j+1-s)i -(X+s) -1

XAj(X, 1, -1, 0)

X (X+j+ )y
X

(X + j + 1)i

where the third step is another application of the generalization of Abel's binomial formula*

To prove the theorem, we note that on the jth examination step there are (i + k 1)J possible

sequences. The min_weightlorest algorithm chooses an edge on the jth examination step on exactly

(i + k 1)j-1 of these sequences, so the probability that a selection is made on the jth examination step

is 1/(i + k 1) if j < i. A selection will be made by the ith step, so the probability of selecting an edge

on the ith step is the remaining probability, proving the theorem*

As a corollary to theorem 3, we find the relationship between the uniform growth model and the

minimum weight model.

Corollary 4 Let k be the number of trees in the forests, and N the number of nodes to be added to the

forest. Consider the act of adding the ith edge to the minimum weight forest. If i < N, then

Pr[every node in the forest is equally likely to receive the ith edge] = k/(i + k 1)

Proof: Every forest node is equally represented in Lii


3.1.1 A Limiting Distribution

The results we used to prove Corollary 4 let us calculate for the minimum weight model the probability

that a forest node receives the next edge. In Table 2, we list the possibilities by which an edge at the

head of Lij can be chosen.












examination Size of safe trigger
step 0 1 2
1 ST(0)(0)R'(0)/Z
=1/Z
2 ST(O)(o)R2(1)/Z2 ST(1)()R2(0)/Z2
= (Z 2)/Z2= l/Z2
3 ST(0)(2)R3(2)/3 ST(1)(()R3(1)/Z3 ST(2)())R3(0)/Z3
= (Z 3)(Z 1)/Z3 =2(Z 3)/Z3 3/Z3
4 ST(O)(3) R4(3)/Z4 ST(1) (1)R4(2)/Z4 ST(2) () R4(1)/Z4 ...
= (Z- 4)(Z 1)2/Z4 3(Z- 4)(Z- 2)/Z4 9(Z- 4)/Z4
5 ST(O)() R5(4)/5 ST(1)()R(3)/Z5 ST(2)()R(2)/Z5 ...
= (Z 5)(Z 1)3/Z5 =4(Z 5)( 2)2/Z5 =18(Z 5)(Z 3)/Z5


Table 2: Probability that an edge is selected, given an examination step and a safe trigger size

In Table 2, the number of nodes in the forest is Z = k + i 1. The number of identifiable elements

grows as the list is searched further. On the jth selection step, only edges labeled (k + i) 1 through

(k + i) j can be cross edges. Since the number of unselectable edges changes with the selection step,
we must modify our definition of R(n), which represents the edges that will not help cause a selection on

step j. We replace X in R(n) by Z j, and write the new function as RJ(n) = (Z j)(Z j + n)n-1

In the previous section, we proved Theorem 3 by weighting the entries in a column by the number of

edge labels that can cause a selection and summing across the rows of the table. In order to calculate

the probability distributions that a node with a given label receives the edge, we sum down the columns.

Each entry in the table represents a different possibility by which an edge can be chosen as the next edge

added to the forest. The Sth column is the probability that an s-safe trigger causes the selection. The
rth most recently added node will receive the next edge only through a safe trigger of size r or larger. So,

the Sth column sum is the difference between the probability that the Sth and the s + 1st most recently

added node receives the next edge. Let us define Ai(j) to be the probability that the forest node labeled

k + i j receives the ith edge added to the forest.

Theorem 4 If the ith edge is being added to the forest and i is large compared to the number of forests

but small compared the the number of nodes in the graph, then the probability that a node receives an

edge approaches the following limiting distribution:
00 tt-2
A(j)= -t (4)
t=j












Proof: Let C(t) be the sum of the tth column of Table 2. Then, ignoring the possibility that Li,i
must be examined in order to make a selection,

Ai(j) = E=j C(t)


Lemma 7


C(t + 1) (t -(l
(t + 1)!


It is easily seen from the preceding arguments that column sum has the following value:


C(t + 1)


SzT) Z-t (j+t )R+t+l (j)/Z

Zl Ejz(i + t))( (t + 1))(Z - (t + 1) + j)-1/Z
ST(t) Z-t [(j t)(t) (j+t)(t+l)] /-(t+l) '
Zt+lt! Ej=o Z-(t+l) J I


Here, z(') = x(x 1)(x 1) ..., (x n + 1) is the falling factorial function. In order to solve this
sum, we apply Theorem 1.6 from Micken (page 36) [10], which states

AJP(j) (= l A2I +)p()
Ap) A--1 1- A (1-A) 2

where F is indefinite summation and A is the forward difference.
In our equation, A = (Z-(t+1))/Z, so that 1/(A-1) = -Z/(t+1) and A/(1-A) = ((Z-(t+1))/(t+1).
For C(t + 1), P(j) = Pt(j) = (j + t)( (j + t)(t+l)/(Z (t + 1)). Therefore,
t(r)(j + t)(t-) (t+l)(')(J+t)(t+l-) r
Arpt(j) (t +1) Z-(t+l) < t
A z-(t+l) r t (1

Putting these into our formula for C(t + 1), we get:

.l) i'T(t) Z (t+ i) (I--- Z-(t+l)- [t(r)(j + t)(t-r) (t+l))(j+t)( +l-r)_
SzZ-(t+l) (t+)1 j=Z-t+l
t+ l1 Z-=(t+l) j
j=0
ST(m+) -Z Zt
Zt+lt! t+1 Z )\r=0\ t+1
It+l (Z-(t+l) (t+l )()j+t) t+l-) (_ t(t+\ :=Z-t+l
r=l \ t+1 ) rn-(t+l) Z-(t+l) ) j=0
ST(t) -Z Z-(t+l)) (t=o z-(t+l) ( (t-)
Zt+lt! t+ Z r=\ t+1
t (Z-(t+l)) () )(t-r) U+ \ : =Z-t+l
r=0 t+1 )") Z-(t+l) ) j=












j=Z-t+l
ST(t) z (Z-(t+fl)3 (j+t)('+l)
Z7-lt! t+ I Z-)1 Z-(t+l) j=
=u
ST(t) Z (Z-(+l)\Z-t+l (Z-1)(+1)
Zt+lt! t+1 Z Z-(t+l)

If i is large then Z is large, so that ((1 (t + l)/Z)z-t+l 1 e-(t+1) and


C(t + 1) (t + 1 -(t+l),
(t + 1)!

To finish the proof of the theorem, we note that the the value of C(t) rapidly decreases to zero, so

that

A(j) M 2
t=j

For large j, we can find an approximate value of the sum. If we use Stirling's approximation for the

factorial, we find that

A(j) F 2t=j v/ Ot +i/^-2

-t=j V1t5/2

We can approximate the sum with an integral using Euler's summation formula [6] to get:

A(j) j j-3/2 _3 5/2 ( + O(j-2)) (5)
327 [ 4 (

Table 3 lists the first few values of the distribution A(j). We summed the first 400 values of C(t)

and then added the approximation of Ai(401) to calculate Ai(1). We calculated the remaining entries in

Table 3 by subtracting the appropriate values of C(t). Note that Ai(j) depends on j but not i, so the

probability that a forest node receives the next edge depends primarily on how recently it was added to

the tree. The analysis shows that the most recently added forest node receives the next edge 50% of the

time, and the ten most recently added forest nodes receive the next edge 83% of the time.

The analysis that leads to the distribution A depends on the assumption that k < i < N. In order to

determine how strongly the analysis depends on the assumptions, we wrote a simulator that generated

minimum weight random forests using the min_weight_forest algorithm. We generated a minimum

weight random forest using the parameters k = 5 and N = 94 10,000 times. For each edge selection step

i, we recorded the number of times that an edge labeled k + i s was selected. We used this information

to estimate Ai(j), which we list in table 3 for i = 10, 20, 40, and 60. For small i (i = 10), there is still














j 1 2 3 4 5 6 7 8 9 10
A(j) .5000 .1321 .0645 .0396 .0273 .0203 .0159 .0128 .0106 .0090
Alo(j) .4808 .1227 .0642 .0428 .0373 .0320 r-' .0 273 .0271
A20(j) .4811 .1209 .0627 .0442 .0318 .0254 .0233 .0172 .0157 .0142
A40(j) .4527 .1122 .0597 .0302 .0234 .0198 .0160 .0126 .0140 .0080
A60(j) .3993 .0914 .0425 .0252 .0142 .0142 .0122 .0114 .0089 .0082

Table 3: Probability that the jth most recently added forest node receives the next edge.



a large chance (5/14) that every node in the forest will receive the next edge, so the tail of Alo is not

well approximated by A. For large i (i = 60), there are significantly fewer edges labeled k + i 1 than

labeled 1 through k. As a result, Ai(1) becomes smaller than A(1). For moderate i (i = 20 and i = 40),

A is a good approximation to Ai.


4 Conclusions


We have examined two models of naturally growing forests, one in which every node in the forest is

equally likely to be be adjacent to the next node added to the forest (uniform growth model), and a

model in which the forest is constructed in order to create a minimum weight forest (minimum weight

model).

We found that the uniform growth model is analytically tractable. We calculated the distribution of

tree sizes, and found that the tree that receives the last node will contain about 2N/k nodes, where k is

the number of trees in the forest and N nodes are added to the forest.

We found the minimum weight model to be far more difficult. We examined the relationship between

the uniform growth model and the minimum weight model and found that the uniform growth model

holds in the minimum weight model with probability k/(k+i 1) when adding the ith node to the forest.

The probability that a forest node receives the next edge added to the forest depends primarily on how

recently that node was added to the forest. The most recently added node receives the next edge about

50% of the time, and one of the ten most recently added nodes receives the next edge about 82% of the

time.













5 Acknowledgements


We'd like to thank Kevin Donovan and Yogin Campbell for their help and advice on solving the problems

in this paper.


References


[1] B. Bollobas. Random Graphs. Academic Press, 1 ".

[2] J.F. Desler and S.L. Hakami. A graph-theoretic approach to a class of integer-programming prob-

lems. Operations Research, 17:1017-1033, 1969.

[3] A.M. Frieze. On the value of a random minimum spanning tree problem. Discrete Applied Marthe-

matics, 10:47-56, 1 -1.

[4] M. Held and R.M. Karp. The traveling salesman problem and minimum spanning trees. Operations

Research, 18:1138-1162, 1970.

[5] M. Held and R.M. Karp. The traveling salesman problem and minimum spanning trees. Mathemat-

ical Programming, 1:6-25, 1971.

[6] Micha Hofri. Probabilistic Analysis of Algorithms. Springer-Verlag, 1 I .

[7] D. Knuth. The Art of Computer Programming, volume 1. Addison-Wesley, 1968.

[8] H.M. Mahmoud. Evolution of Random Search Trees. Wiley-Interscience, 1992.

[9] Udi Manber. Introduction to Algorithms. Addison Wesley, 1989.

[10] R. Mickens. Difference Equations. Van Nostrand Reinhold, New York, 1 I .

[11] J. Riordan. Combinatorial Identities. Robert E. Kreiger, Huntington, NY, 1979.

[12] J.M. Steele. On Frieze's <(3) limit of lengths of minimal spanning trees. Discrete Applied Mathe-

matics, 18:99-103, 1 I-.

[13] H.S. Stone. An algorithm for finding a minimum weighted 2-matching. Technical Report RC 15014,

IBM T.J. Watson Research Center, 189.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs