On the Natural Growth of Random Forests
Theodore Johnson, University of Florida
Harold S. Stone, IBM T. J. Watson Center
July 29, 1992
Abstract
We explore several aspects of the natural growth of random forests. We show that if initially there
are k singlenode forests and N nodes to add to the forest, and every forest node is equally likely to
be connected to the next node added, then the tree that receives the last node will have an expected
2N/k nodes. We next explore the growth of a minimum weight forest. We show that the probability
that a node in the forest is connected to the next node added to the forest depends primarily upon
how recently the forest node was added to the forest, and that this distribution approaches a limit
distribution. We obtain the surprising result that when a node is added to the forest, it is adjacent
to the most recently added node 50% of the time.
1 Introduction
In this paper, we examine several distributions associated with randomly grown forests. We assume that
there are k initial trees in the forest, each of which consists of a single node, and N nodes to be added
to the forest. The forest grows by repeatedly adding a node not in the forest to a tree. The trees are
never merged.
Randomly grown and minimum weight forests are used in many algorithms. Our results will provide a
useful insight on their expected size and performance. This work was motivated by the desire to analyze
the expected time complexity of a 2matching algorithm [13]. Other works that use minimum weight
spanning trees or forests include Held and Karp's Traveling Salesman Problem algorithms [4, 5], and the
Matching algorithm of Desler and Hakami [2].
Our first growth model (the uniform growth model) assumes that when a node is added to the forest,
the edge that connects it to the forest is equally likely to be adjacent to every node already in the forest.
We show that this growth assumption leads to an easily solved system of equations that describe the
distribution of tree sizes.
In our second growth model (the minimum weight model) we assume that all nodes in the graph are
connected by weighted edges. We then seek to characterize the growth of a minimum weight forest. We
initially propose the uniform growth model as a tractable approximation to the minimum weight model.
We show that the uniform growth model is a poor approximation, however, because the uniform growth
model predicts short, bushy trees, while the minimum weight model predicts long, thin trees.
Previous work in characterizing the growth of forests has concentrated on forests in random graphs
(see [1] for a survey), or on the natural growth of search trees (see [8] for a survey). The question of
the weight of the minimum spanning tree of a randomly weighted graph has been addressed by Frieze
[3] and Steele [12]. In this work, however, we are interested in the growth of random forests, and are not
concerned about their weight.
2 Uniform Growth Model
In this section, we analyze the growth of random forests that have a tractable growth model. When we
build the forest, we add to the forest an edge that connects a forest node to a nonforest node. We say
that the forest node adjacent to the new edge receives the edge, and that edge and the nonforest node
are added to the tree. In the uniform growth model, every forest node is equally likely to receive the next
edge added to the forest.
We calculate the distribution of the tree sizes, and the size of the tree that receives the jth node
added to the forest.
Theorem 1 The probability that the node that receives the jth edge added to the forest is part of a tree
that contains i nodes is
(k + 2j + 1)/(k + 1)
Proof: We define Ti(j) to be the expected number of trees with i nodes after j nodes have been added
to the initial forest. The forest starts with k trees each of which contain a single node, so the initial
conditions are:
T(0) = k
Ti(0)= 0 fori>l
Since all trees start with 1 node, after j 1 nodes have been added, no tree has more than j nodes.
The produces the additional constraint:
Ti(j1)= 0 fori>j
Each time a node is added, it is added to some tree with i nodes, so the number of trees with i nodes
decreases by 1, and the number of trees with i + 1 nodes increases by 1. The probability of selecting a
tree with i nodes at the jth node selection is proportional to the total number of nodes in trees with i
nodes. Before the jth selection there are k + j 1 nodes in all of the trees, so that the probability of
selecting a tree with i nodes is given by
Pr[Selecting a node in a tree with i nodes at step j] = i +j
This distribution yields the following equations for the tree distribution after the jth selection in
terms of trees that exist before the jth selection:
TI(j)= T(j ) 1(1)
( k+j1
STI (j 1) (k+
and
T.(i) 1) iT(j1) + (1)T._1ij1)
= T j 1) k+j1 k+jl
The solution to these equations with these initial conditions is:
Tip ) = k (k 1)j!(k+ji1)!
i (j+li)!(k+j1)!
which can be shown by induction.
The next step is to compute the expected size of a tree that receives the jth node. The probability
of adding the jth node to a tree of size i is proportional to the number of nodes in trees of size i, and
therefore is iTi(j)/(k + j). As a result, the expected size of the tree that receives the jth node is:
k Y i2T (j)
i=0
We can show that the general form of the expected value is
(k + 2j + 1)/(k + 1)
by using induction to show that:
j+1
i2T(j) = (k+j)(k+2j+l)
i=l
If k is large, and j > k, then the size of this tree is approximately 2j/k twice as large as the average
tree size.
3 Minimum Weight Model
The uniform growth assumption is convenient for analytical tractability and is realistic for some situa
tions. We also want to analyze minimum weight forests, and we would like to approximate the minimum
weight model by the uniform growth model. Unfortunately, the approximation is inaccurate, and the
actual distribution is quite complex. Our objective in this section is to show what relation exists between
the two growth models.
In the minimum weight model, our starting assumptions again are that we are growing a forest,
starting with k initial (isolated) nodes in the forest and N nodes not in the forest. We denote the set of
forest nodes by F, and the set of nonforest nodes by V F. There are M = () + Nk edges connecting
all pairs of nodes in the graph, except for pairs where both nodes are in F. There is a function defined
on each edge, w(e), the weight of edge e. At each step, we add a new node to the forest by finding
the lowest weight edge that connects a node in the forest to a node not in the forest, then adding the
edge and the attached node (i.e, apply Dijkstra's algorithm [9]). We assume that all edge weights are
independently and identically chosen from a continuous distribution over the real numbers, F(x). That
is, Pr[w(e) < aw(ei) = x, w(e2) = x2,, w(e) = xl] = Pr[w(e) < x] = F(x). See Figure 1.
w(i)=2
C( ) w(g)=5 D w(h)=8 E
) w(h)=8
w(b)=6 w(e)=4
w(a)=l /w(f)=3
w(c)=9
w(d)=7
A B
original nodes in the forest
Figure 1: Initial graph (A and B are the roots of the forest)
We define a function of the nodes, l(n), the label of the node. The original k forest nodes are labeled
1 through k, and the ith node added to the forest is labeled k + i, i = 1, . ., N. We also label the edges
e = (nl, n2) with 1(e) = min(l(nl), l(n2)).
We can classify the edges in the graph at any point in the forest growth as follows: Edges that connect
two nodes in F (two forest nodes) are in EF, edges that connect two nodes in V F (not in the forest)
are in EvF, and edges that connect a node in F to a node in V F are in Ecross (cross edges). Edges
in EF can be subclassified as being either tree edges or nontree edges depending on whether they are
part of the minimum weight forest.
To add the next node to the forest, the minimum weight edge in Ecross is chosen and made into a
tree edge. At this point, we can label the new tree node and the unlabeled edges incident to the new
tree node using the following algorithm:
min_weightf orest (V, F,E)
Label the vertices in F by 1 through IFI.
Ecross = edges in E incident to a vertex in F.
Label the edges in Ecross by the label of the node in F.
EF = 0.
EVF = E Ecross
i=0.
While VF is not empty do
let emin be the minimum weight edge in Ecross.
Let u be the node incident to emin that is in VF.
Remove from Ecross all edges incident to u, put them in EF.
Label u by k+i.
Remove from EVF all edges incident to u, label them with k+i,
and put them in Ecross.
The min_weight_forest algorithm is illustrated in Figures 1 through 6. Figure 1 shows the initial
forest. Figure 2 shows the eligible edges (the cross edges) when the first edge is selected. The edges
incident on the root nodes are labeled with the node's label. Figure 3 shows the cross edges before the
second edge has been selected. Node C was added to the forest, and was labeled with 3. Edges leading
from C to nodes not yet in the forest are also labeled with 3. Figure 4 shows the graph before last edge
w(i)=2
C w(g)=5 D w(h)=8 E
Cross
C x 5 U F 
w=6
w(b)=6 w(e)=4
I(b)=1 Ile)=2
/ / (f)=3
w(a)=1 \
1(a)=1 \ w(c)=9 (f)=2
I(c)=1 w(d)=7
A1 l(d)=2 B
I(A)=1 I(B)=2
Figure 2: Selecting the first edge
has been selected, and Figure 5 shows the final forest.
To aid in calculating the distribution of the probability that a given node in the forest receives the
next edge, we list the edges in the graph sorted by the edge weights. If the min_weightforest algorithm
chooses an edge labeled i as the next edge to add to the forest, then the node labeled i receives the edge.
Therefore, we identify the edges in the list by their labels only. Figure 6 shows the list of edges with
their corresponding edge labels, sorted by edge weight, that corresponds to our running example. We
next prove this list of edges has a tractable distribution.
Definition: A (N1, N,..., Nj)permutation is an ordered list of N1 items labeled 1, N2 items
labeled 2, etc.
Lemma 1 Label the edges using the edge labeling algorithm, then sort the edges by their weights. The
resulting (N, N, N 1, N 2, ..., 2, 1)permutation, L is uniformly randomly chosen from the set
k times
of all (N, N,  N, N 1, N 2, ..., 2, 1)permutations.
Proof: First, L is indeed a (N, N, . N, N 1, N 2, .., 2, 1)permutation, since the original k
nodes each label N edges, and the node added at the ith labeling step labels N i edges.
We consider the algorithm for labeling edges. The set of edges, E is partitioned into EF, EyF, and
Ecross. At each edge selection step, the algorithm searches Ec,oss and picks the edge with the lowest
weight, ein. Based on ein, the algorithm removes some edges from Ecross, puts them in EF, and also
w(i)=2 I(i)=3
I(C)=3 C w(g)=5 D w(h)=8 E
I(g)=3
Ecross
C EVF
\ (b)=6 w(e)=4 EF
I(b)=1 I(e)=2 tree
w (a)= 1 w\ w(f)=3
I(a)=1 \ w(c)=9 \/ (f)=2
S l(c)=1 w(d)=7
A I(d)=2 B
I(A)=1 I(B)=2
Figure 3: Selecting the second edge
w(i)=2 I(i)=3
I(C)=3 c w(g)=5 ( w(h)=8 E I(E)=4
I(g)=3 I(h)=4
Cross
EVF
\ (b)=/ w(e)=4 EF
l(g)= I(e)=2 tree
w (a)=1 w(f)=3
I(a)=1 w(c)=9 (f)=2
I(c)=1 w(d)=7
(A)= I(d)=2 B
I(A)=1 I(B)=2
Figure 4: Selecting the third edge
w(i)=2 I(i)=3
I(C)=3 w(g)=5 D w(h)=8 I(E)=4
I(g)=3 I(D)=5 l(h)=4
Cross
EVF
w(b)=6 =4
(b)= (e)2 tree
w(a)=1 w(f)=3
I(a)=1 w(c)=9 1(f)=2
I(c)=1 w(d)=7
A l(d)=2 B
I(A)=1 I(B)=2
Figure 5: Final forest
w(i)=2 1(i)=3
I(C)=3 C w(g)=5 ) w(h)=8 () (E)=4
1(g)=3 I(D)=5 1(h)=4
= E cross
EVF
(b)=6 =4
w(b)=6 w(e)=4 EF
1(b)1 1(e)=2 tree ___
w(a)=1 \(e2 w(f)=3 tree
(a)=1 w(c)=9 1(f=2
1(c)=1 w(d)=7
A I1(d)=2 B
I(A)=1 1(B)=2
edge a i f eg b d h c
weight 123456789
label 1 3 2 2 3 1 2 4 1
Figure 6: Final forest and corresponding list of labeled edges
removes some edges from EvF and puts them in Ecross. At this point, the new edges that are placed
into Ecross from EvF are labeled. This step repeats until Ecross is empty.
An edge that is removed from EvF is never returned to EvF. The edges are labeled only when
they are removed from EvF and placed into Ecross. The weights of these edges are examined by the
algorithm for the first time when the next edge is selected. Since the edge weights are independent
and the weight of an edge is first read by the algorithm after the edge is labeled, the edge weight is
independent of the edge label.
All edge weights have the same distribution, regardless of their labels. The permutation, L, is sorted
by edge weight, so every edge is equally likely to be in every position. The resulting permutation of
the edges is equally likely to be chosen from the set of all possible permutations, so L is a uniformly
randomly chosen (N, N, N, N 1, N 2, 2, 1)permutation*
The underlying combinatorial object is a matrix of independent and identically distributed edge
weights. So far, we have made a transformation from the matrix of edge weights to a list of edge labels,
and have shown that the list is uniformly distributed. At any step in the forest growing algorithm, only
a subset of the edges are examined in order to find a least weight edge those in Ecros,. To find the
edge that the min_weightforest algorithm adds, we take the list that corresponds to E and remove
the edges in EF and EvF while preserving the order of the remaining edges. We then select the first
edge on the list. We next prove some properties of the lists with removed edges:
Lemma 2 For every k E 1,..., N1 + + Nj = M, the probability that the kth entry in a uniformly
selected (N1, N2. ., Nj)permutation is labeled r, r E 1, ., j is N,/M.
Proof: If we pick edge e E E as the edge in the kth entry of the permutation, there are (M 1)! ways
to choose the remainder of the permutation. There are N, ways to choose an edge e with a label r, so
there are N,(M 1)! permutations with an edge labeled r in the kth position. There are M! possible
permutations of the edges, so the probability that the edge in the kth position is labeled r is N,/M*
Lemma 3 Let k, 1 E 1, ..., N1 +  + Nj = M, k : 1, M > 1. If the kth entry of a uniformly selected
(N,..., Nj)permutation is identified as being labeled r, r E 1,...,j, then the probability that the 1th
entry is labeled s, s E 1,..., j is
N,
M1 rs
N, 1
M1
Proof: We fix edge e in position r, then apply lemma 2.
Corollary 1 Lei P be a uniformly chosen (N, . ., Nj)permutation, and identify the rth entry as being
labeled 1, r E 1,..., N + + Nj, 1 E 1,...,j. Create P' by taking the first r 1 and the last N1 +
* + Nj r entries of P and concalenating them together. Then P' is equally likely to be any one of the
(N, . ., N1 1, . ., Nj)permutations.
Proof: Apply lemma 3*
Suppose that we have run the algorithm once, labeled the edges, and created the list L. Suppose
that we rerun the min_weight orest algorithm on the same graph that resulted in L. By examining
L, we want to determine what choices the algorithm makes. We then use the uniform randomness of L
to make an expected case analysis of the forest growth.
At any point in the min_weightlorest algorithm, the algorithm examines a subset of the edges
(those in Ecross), and chooses the lowestweight edge to add to the forest. The list L is sorted by edge
weight, so if we rule out the ineligible edges (those in EvF or in EF), the first edge in L is the edge
that the algorithm picks. This edge's label is the label of the forest node it is incident on, so we can keep
track of how the forest grows.
At each edge selection step, we choose an edge that connects a node in the forest to a node not in the
forest (i.e., an edge in Ecross). If we select the ith edge, then nodes numbered 1, . ., k + i 1 are in the
forest. Nodes numbered k + i, . ., k + N are not. The fact that we are selecting the ith edge immediately
gives us some information: edges labeled i, . ., N are in EyF, and so are not eligible. Thus, we can
simplify L by removing edges that will not be chosen.
Definition: Li is L with edges labeled i + k, ..., N + k removed. The list Li contains exactly the
edges in Ecross and EF when the ith edge is chosen.
Corollary 2 Li is equally likely to be any one of the (N,..., N, N 1,..., N i + 1)permutations.
k times
Proof: We generate Li from L by removing edges labeled i or larger, then apply Corollary 1 repeatedly.
Even though Li is a refinement of L, Li still contains many edges that the min_weightforest
algorithm does not examine: the edges in EF. We have not kept track of the edge destinations, so
we have no information yet about which edges are in EF and which are in E ...... Fortunately, the
information necessary for an expected case analysis is implicitly stored in Li.
To extract this information, we look at the first entry in Li, el. If this edge is labeled k + i 1 (that
is, it leads from the most recently added node), then it has never been considered before. Therefore, el
must be in Ecross, so the min_weightforest algorithm will select it as the ith edge in the forest. If the
label of the edge, l(ei), is less than k+i 1, then el was also the first edge in Li_1 since edges with labels
greater than or equal to k + i 1 are removed from L to create Li_1. Thus, at some previous selection
j, l(ei) < j < i, edge el was the lowest weight edge in cross, so the min_weightlorest algorithm chose
it as the jth edge to add to the forest. Therefore, el C EF (in fact, el is a tree edge), so el is ineligible.
If the first edge on Li is in EF, we need to continue scanning Li for the first eligible edge. The
discovery that el is not an eligible edge tells us that some other edges in Li are ineligible also: the edges
that lead to the same destination as el does. We are not keeping any destination information in L,
but we assume for now that we have some way of determining destinations (we address this assumption
later). In the analysis that follows, we will want to scan over tree edges only, so we remove edges from
Li = Li,1 that have the same destination as el to form Li,2.
Suppose that e2 is the first edge in Li,2 (i.e, the second eligible or tree edge in Li). If e2 is labeled
k+i1, the min_weightlforest algorithm has never examined e2 before, so e2 is chosen by the algorithm.
Suppose that e2 is labeled k + i 2. If el is labeled k + i 2 then el must have been chosen when the
i+k 1st edge was added to the forest. An edge labeled i+k 2 can be selected as the next forest edge
only when nodes labeled i + k 1 or higher are added to the forest. Since we know that el added the
node labeled i + k 1 to the forest, e2 must add the node labeled i + k, and so the min_weighttforest
algorithm chooses edge e2 on the ith edge selection step. If e2 is labeled k + i 3 or lower, then e2 was
chosen as a forest edge on a previous iteration regardless of the label of el. In general,
Theorem 2 Let ej be the first edge on Lij, and suppose that we have not found in {e, . ., eji} the
edge that adds the node labeled k + i to the forest.
If l(ej) = k + i s and s < i, then ej adds the node labeled k + i to the forest if and only if we have
already found in {e, . ., ej1 the edges that added the nodes labeled k + i s + 1 through k + i 1 to
the forest.
If s > i, then ej adds the node labeled k + i to the forest if and only if we have already found in
{e, ..., ej_} the edges that added the nodes labeled k + 1 through k + i 1 to the forest.
Proof: Assume that s < i. Suppose that we have found the edges that added the nodes labeled k+is+1
through k + i 1 to the forest. Since edge ej can only add nodes labeled k + i s + 1 and higher to the
forest and all of the previous possibilities have been covered already, ej can not be a tree edge. Since we
throw away nontree edges in EF, ej must be a cross edge. Since it is the first cross edge we have found,
it must have the lowest weight among all cross edges. Therefore, the tree growing algorithm picks it to
add the node labeled k + i to the forest.
Suppose that we have not yet found all of the edges that add the nodes labeled k + i s + 1 through
k + i 1 to the forest. Let the lowest labeled node that is unaccounted for be labeled k + r.
Claim: Edge ej added the node labeled k + r to the forest.
Proof: Consider what occurs when the node labeled k + r is added to the the forest. We can retrace the
forest generating algorithm's decision by examining the list L,. Edge ej in Li was considered as some
edge ej, in L,, because we had not accounted for the edge that added node k + r in Li before we came
to ej. When edge ej, in L, is considered, all nodes with labels between + i s + 1 and + r 1
are accounted for. As we have shown, ej, must be the lowest weight cross edge, so the forest growing
algorithm picks it to add the node labeled k + r to the forest*
Since ej adds the node labeled k + r to the forest, it can not add node k + i to the forest.
If s > i, we only need to account for the fact that any edge labeled 1 through k can add the node
labeled k + 1 to the forest*
Suppose that we have an algorithm for determining which node a tree edge adds to the forest. Then
the algorithm for using L to calculate which node received the ith edge is:
procedure retrace(L)
for i=1 to to N do
Li is L with edges labeled k +i or greater removed.
j=1.
Li,j =Li .
while the edge that adds node k+ i to the forest is not found
ej=the first edge in Li, .
if all nodes between k + l(ej) + 1 and k+i 1 are accounted for
output (i, (eC)) .
else
account for the node that ej adds.
Lij+,=Li,j with edges that lead to the same destination as ej removed.
increment j.
Theorem 2 gives us an algorithm for determining which node a tree edge adds to the forest. When
edge ej is considered, the algorithm looks to see if node k + l(ej) + 1 is accounted for. If the node is
accounted for, the algorithm looks at node k + 1(ej) + 2, and so on until an unaccounted for node is
found. Edge ej adds that node (possibly node k + i), so the algorithm accounts for it.
We use a 'marked boxes' technique to account for the nodes. When sublist Li is examined, the
algorithm initializes i boxes, numbered 1 through i. If the edge ej is labeled k + i s, the algorithm
looks for an unmarked box starting at box s and working towards box 1. The algorithm marks the first
unmarked box that is found. If box 1 is marked, the algorithm concludes that edge ej adds node k + i.
procedure markbox(Li)
initialize box[l..i] as unmarked.
j=1.
while box[1l is unmarked
get ej .
s=k+il(ej).
if s>i then s=i.
while box[s] is marked
s=s1.
mark box[s].
j=j+1.
return(j1).
Executions of retrace and markbox procedures are illustrated in Table 1. The sorted list of edges L
is the same as that in Figure 6. The list L1, is the list of edge choices when node 3 is selected (all edges
a i f e g b d h c 1 2 3
L= 1 3 2 2 3 1 2 4 1
L1,1= 1 2 2 1 2 1 X
L2,1 = 1 3 2 2 3 1 2 1 X
L22 3 2 2 3 1 1 X X
L3,1 = 1 3 2 2 3 1 2 4 1 X
L3,2 = 3 2 2 3 1 4 1 X X
3,3= 2 3 1 4 X X X
Table 1: The sorted list of edges, and finding the selected edge.
labeled 3 and larger are deleted). The first edge on the list is labeled 1 = 3 2, so the algorithm marks
box 1 and selects that edge (since 2 > 1). The list L2,1 is the initial list of edges when node 4 is selected.
The first edge on L2,1 is labeled 1 = 4 3. Since 3 > 2, the algorithm marks box 2. It generates L2,2
by removing the initial edge on L2,1 and all edges that lead to the same node. The first edge on L2,2 is
labeled 3 = 4 1, so the algorithm marks box 1 and selects that edge. Finally, list L3,1 has all edges on
it, for there are no edges in EvF. The first edge on L3,1 is labeled 1 = 5 4, so the algorithm marks
box 3. The first edge on L3,2 is labeled 3 = 5 2, so the algorithm marks box 2. The first edge on L3,3
is labeled 2 = 5 3, so it tries to mark box 3. Box 3 is already marked (since a different edge chose the
third node), so it tries to mark box 2. Box 2 is also marked, so it marks box 1, and select that edge.
Note that this procedure selects the same edges (a, i, e) that the simulation of Dijkstra's algorithm does.
Before we continue, we need to determine the underlying distributions of the sublists. Lemma 1 and
Corollary 2 show that L and Li are uniformly randomly chosen permutations. The algorithm to recreate
the min_weightforest algorithm's choices from L requires the use of the Lij sublists. The Lij sublists
are created by determining edge destinations and removing some edges. We do not explicitly keep edge
destination information in L, but we show that this is not a problem for the analysis.
Corollary 3 The lists Lij are uniformly randomly distributed permutations.
Proof: Li,1 = Li, so Li,1 is a uniformly randomly chosen permutation. Since Li,1 is a uniformly randomly
chosen permutation, we see that Li,2 is a uniformly randomly chosen permutation by repeated appli
cations of Corollary 1. Proceeding inductively, we see that every Lij is a uniformly randomly chosen
permutation*
3.1 Relation to Uniform Growth Model
The algorithm for determining the min_weightforest algorithm's choices from L shows that edges with
different labels have different probabilities of being chosen to add the ith node. An edge labeled k + i 1
can be selected from every Lij, while an edge labeled 1 through k can only be selected from Li. We
will return to the issue of the edge label of the edge that is selected, but here we want to determine the
relation of the minimum weight model to the uniform growth model.
We observe that if the edge that adds the node labeled k + i to the forest is not found in Li, through
Li,i1, then the edge will be found in Lii since every edge in Lii is a cross edge. Every forest node is
incident to the same number of cross edges (N i+ 1), so by lemma 3, every forest node is equally likely
to receive the next edge. Thus, the minimum weight model is equivalent to the uniform growth model on
the occasions when Li,i must be searched to find the minimum weight cross edge. The following theorem
states the probability of this event.
Theorem 3 Suppose that there are k trees, and the min_weightorest algorithm is adding the node
labeled k + i to the forest. If i < N, then the probability that the first edge in Lij, ej, is the minimum
weight cross edge is approximately 1/(k + i 1) if j < i and k/(k + i 1) if j = i.
Proof: In this proof, we use the marked boxes mechanism described in the previous section. The
input to the marked boxes algorithm is the sequence (l(ei), l(e2), ..., l(e,)). We transform this sequence
into the more convenient sequence {(x, 2, ..., x)}, where each z, E {1,.., i} and 1 < n < i, by the
following function:
f k +i l(ej) ifl(ej) > k
xj i if l(ej < k
If xj = r, then xj represents an edge labeled (k + i r) and refers to box r in the markbox procedure,
if r < i. If xj = i, then xj represents to an edge labeled 1 through k and refers to box i.
In Li, there are N edges labeled 1 through k, and N s edges labeled k + s for s = 1, . ., i 1. Since
N > i, there are about the same number of edges labeled k + i 1 in Li as there are edges labeled 1.
Therefore, we will assume that each xj in a sequence is independent, and has value r with probability
1/(i + k 1) if r < i, and value i with probability k/(k + i 1). We want to determine the probability
that the length of the minimum length sequence which marks box 1 is n.
ST(t) ST(s)
V t+2V
1 2 Ir+1
t s
Figure 7: Last entry in st,
We define a safe trigger of length r to be a sequence st, which is of length r, does not mark box 1,
but (st, j) marks 1 if 1 < j < r + 1. For example, (2 3) is a safe trigger of length 2. We define st(r) to
be the set of all safe triggers of length r, and ST(r) = Ist(r)l. For example, st(3) = {(2 3), (3 2), (3 3)},
and ST(3) = 3. We define ST(0) = 1.
Lemma 4 ST(r) = (r + 1)'1
Proof: We first derive a recurrence for ST(r), then solve the recurrence. For the starting values of
the recurrence, we observe that st(O) = {0} and st(l) = {(2)}, so ST(O) = ST(1) = 1.
Consider the last entry added to a sequence in st(r). The last entry marks a box, so we assume
that it marks the box numbered t + 2 (see Figure 7). To the right of the last marked box is a sequence
in st(t) and to the left is a sequence that corresponds to a sequence in st(s) by translation. There are
(St) = (r) ways to combine an s length sequence with a t length sequence. The last entry in the safe
trigger of length r can be any number between t + 2 and r + 1 inclusive, so there are s + 1 possibilities.
Therefore:
r1
ST(r) = Z(s + 1)r 1) ST(s)ST(r 1) (1)
s=O
The form of the recurrence is similar to the convolution of an exponential generating function, so we
let Y = Y(z) = E ST(r)z/rr! be the exponential generating function of ST(r). Then Y satisfies the
differential equation
(1 Yz)Y'= Y2 (2)
To solve this equation, we look for a solution of the form:
Y = e"a(Y)
Then
Y' = Y(Y'a'(Y)z + a(Y))
Y'(1 Ya'(Y)z) = Ya(Y)
Setting a(Y) = Y satisfies the differential equation. The generating function of Y = eYZ is (see [7], page
392)
Y(z) += 1 zr!
r>O
so that
r1
ST(r) = Y(s + 1) r ST(s)ST(r s 1) = (r + 1) r
s=0
Let us count the number of ways that edge ej marks box 1 (i.e, adds the node labeled k + i to the
forest). Suppose after processing j_1, box 1 is unmarked, and the t subsequent boxes are marked. If we
examine which boxes are marked after j 1 steps, box 1 is unmarked, boxes 2 through t + 1 are marked,
and box t + 2 is unmarked. Of the remaining i t 2 boxes, numbered t + 3 through i, exactly j t 1
are marked (see Figure 8). In order to make the analysis simpler, we aggregate all of the xz values that
can not mark box 1 on the jth step (those values between j + 1 and i) and renumber them by j + 1.
That is,
x if x1 < j
xl j + 1 ifx > j
In the transformed sequence, all the boxes are marked except for boxes 1 and t + 2. When x' is chosen
to augment this sequence, there are i + k j 1 ways to choose it to be j + 1, and 1 way to choose it to
be n, n = 1,... ,j. With this configuration, any x' numbered between 1 and t + 1 will mark box 1. Let
R(s) be the number of ways to mark the last s = j t 1 boxes. The number of ways that an edge will
be selected on the jth examination step is therefore
j1 I
Stop(i) = (t + 1) Ij 1 ST(t)R(j 1t) (3)
t
t=o
In order to solve equation (3), we need to determine the value of R(s). We can quickly derive a
recursive formula that is similar to equation (1). In Figure 7, the number of sequences that can fill the
right hand side is R(s) instead of ST(s). The sequence will be completed if z E t + 2, .., r + 1. The xz
t it2
1 V V t+2 V ... V i
W
_t jt
1 .V .t+2
j+l
ST(t) R(jt1)
Figure 8: Transformation for ith selection
can be chosen to have the value r in X = i + k j 1 ways, All other numbers can only be chosen one
way. Thus, the number of ways to complete the sequence is
R(r)= E(X + s) ('V)ST(r 1 s)R(s)
= (X + S)( )(r S)s2R()
Lemma 5 R(r) = X(X + r)"1
We prove the lemma by induction. It is easy to see that R(O) = 1 and R(1) = X. For the inductive step,
we need to show that
X(X + r)" = E,(X + s)( l)(r s)rs2X(X + s)S1
or that
(x + r)" = ( )(r S) s2(X + s)S
Let
T(r) = R(r + 1)/X = ( (r + 1 s)(X + s)
Then, from Riordan [11], page 18, we see that T(r) is an instance of the generalization of Abel's binomial
formula, T(r) = A,(X, 1, 0, 1) = (X + 1 + r)', which is exactly what we need to show *
Lemma 6 Stop(j) = (X + j)j1 = (i + k 1)1
To prove this lemma, we apply lemma 5 to equation (3):
Stop(j + 1) = EC(j + 1 s)(j)ST(j s)R(s)
= X (Q(j+1s)i (X+s) 1
XAj(X, 1, 1, 0)
X (X+j+ )y
X
(X + j + 1)i
where the third step is another application of the generalization of Abel's binomial formula*
To prove the theorem, we note that on the jth examination step there are (i + k 1)J possible
sequences. The min_weightlorest algorithm chooses an edge on the jth examination step on exactly
(i + k 1)j1 of these sequences, so the probability that a selection is made on the jth examination step
is 1/(i + k 1) if j < i. A selection will be made by the ith step, so the probability of selecting an edge
on the ith step is the remaining probability, proving the theorem*
As a corollary to theorem 3, we find the relationship between the uniform growth model and the
minimum weight model.
Corollary 4 Let k be the number of trees in the forests, and N the number of nodes to be added to the
forest. Consider the act of adding the ith edge to the minimum weight forest. If i < N, then
Pr[every node in the forest is equally likely to receive the ith edge] = k/(i + k 1)
Proof: Every forest node is equally represented in Lii
3.1.1 A Limiting Distribution
The results we used to prove Corollary 4 let us calculate for the minimum weight model the probability
that a forest node receives the next edge. In Table 2, we list the possibilities by which an edge at the
head of Lij can be chosen.
examination Size of safe trigger
step 0 1 2
1 ST(0)(0)R'(0)/Z
=1/Z
2 ST(O)(o)R2(1)/Z2 ST(1)()R2(0)/Z2
= (Z 2)/Z2= l/Z2
3 ST(0)(2)R3(2)/3 ST(1)(()R3(1)/Z3 ST(2)())R3(0)/Z3
= (Z 3)(Z 1)/Z3 =2(Z 3)/Z3 3/Z3
4 ST(O)(3) R4(3)/Z4 ST(1) (1)R4(2)/Z4 ST(2) () R4(1)/Z4 ...
= (Z 4)(Z 1)2/Z4 3(Z 4)(Z 2)/Z4 9(Z 4)/Z4
5 ST(O)() R5(4)/5 ST(1)()R(3)/Z5 ST(2)()R(2)/Z5 ...
= (Z 5)(Z 1)3/Z5 =4(Z 5)( 2)2/Z5 =18(Z 5)(Z 3)/Z5
Table 2: Probability that an edge is selected, given an examination step and a safe trigger size
In Table 2, the number of nodes in the forest is Z = k + i 1. The number of identifiable elements
grows as the list is searched further. On the jth selection step, only edges labeled (k + i) 1 through
(k + i) j can be cross edges. Since the number of unselectable edges changes with the selection step,
we must modify our definition of R(n), which represents the edges that will not help cause a selection on
step j. We replace X in R(n) by Z j, and write the new function as RJ(n) = (Z j)(Z j + n)n1
In the previous section, we proved Theorem 3 by weighting the entries in a column by the number of
edge labels that can cause a selection and summing across the rows of the table. In order to calculate
the probability distributions that a node with a given label receives the edge, we sum down the columns.
Each entry in the table represents a different possibility by which an edge can be chosen as the next edge
added to the forest. The Sth column is the probability that an ssafe trigger causes the selection. The
rth most recently added node will receive the next edge only through a safe trigger of size r or larger. So,
the Sth column sum is the difference between the probability that the Sth and the s + 1st most recently
added node receives the next edge. Let us define Ai(j) to be the probability that the forest node labeled
k + i j receives the ith edge added to the forest.
Theorem 4 If the ith edge is being added to the forest and i is large compared to the number of forests
but small compared the the number of nodes in the graph, then the probability that a node receives an
edge approaches the following limiting distribution:
00 tt2
A(j)= t (4)
t=j
Proof: Let C(t) be the sum of the tth column of Table 2. Then, ignoring the possibility that Li,i
must be examined in order to make a selection,
Ai(j) = E=j C(t)
Lemma 7
C(t + 1) (t (l
(t + 1)!
It is easily seen from the preceding arguments that column sum has the following value:
C(t + 1)
SzT) Zt (j+t )R+t+l (j)/Z
Zl Ejz(i + t))( (t + 1))(Z  (t + 1) + j)1/Z
ST(t) Zt [(j t)(t) (j+t)(t+l)] /(t+l) '
Zt+lt! Ej=o Z(t+l) J I
Here, z(') = x(x 1)(x 1) ..., (x n + 1) is the falling factorial function. In order to solve this
sum, we apply Theorem 1.6 from Micken (page 36) [10], which states
AJP(j) (= l A2I +)p()
Ap) A1 1 A (1A) 2
where F is indefinite summation and A is the forward difference.
In our equation, A = (Z(t+1))/Z, so that 1/(A1) = Z/(t+1) and A/(1A) = ((Z(t+1))/(t+1).
For C(t + 1), P(j) = Pt(j) = (j + t)( (j + t)(t+l)/(Z (t + 1)). Therefore,
t(r)(j + t)(t) (t+l)(')(J+t)(t+l) r
Arpt(j) (t +1) Z(t+l) < t
A z(t+l) r t (1
Putting these into our formula for C(t + 1), we get:
.l) i'T(t) Z (t+ i) (I Z(t+l) [t(r)(j + t)(tr) (t+l))(j+t)( +lr)_
SzZ(t+l) (t+)1 j=Zt+l
t+ l1 Z=(t+l) j
j=0
ST(m+) Z Zt
Zt+lt! t+1 Z )\r=0\ t+1
It+l (Z(t+l) (t+l )()j+t) t+l) (_ t(t+\ :=Zt+l
r=l \ t+1 ) rn(t+l) Z(t+l) ) j=0
ST(t) Z Z(t+l)) (t=o z(t+l) ( (t)
Zt+lt! t+ Z r=\ t+1
t (Z(t+l)) () )(tr) U+ \ : =Zt+l
r=0 t+1 )") Z(t+l) ) j=
j=Zt+l
ST(t) z (Z(t+fl)3 (j+t)('+l)
Z7lt! t+ I Z)1 Z(t+l) j=
=u
ST(t) Z (Z(+l)\Zt+l (Z1)(+1)
Zt+lt! t+1 Z Z(t+l)
If i is large then Z is large, so that ((1 (t + l)/Z)zt+l 1 e(t+1) and
C(t + 1) (t + 1 (t+l),
(t + 1)!
To finish the proof of the theorem, we note that the the value of C(t) rapidly decreases to zero, so
that
A(j) M 2
t=j
For large j, we can find an approximate value of the sum. If we use Stirling's approximation for the
factorial, we find that
A(j) F 2t=j v/ Ot +i/^2
t=j V1t5/2
We can approximate the sum with an integral using Euler's summation formula [6] to get:
A(j) j j3/2 _3 5/2 ( + O(j2)) (5)
327 [ 4 (
Table 3 lists the first few values of the distribution A(j). We summed the first 400 values of C(t)
and then added the approximation of Ai(401) to calculate Ai(1). We calculated the remaining entries in
Table 3 by subtracting the appropriate values of C(t). Note that Ai(j) depends on j but not i, so the
probability that a forest node receives the next edge depends primarily on how recently it was added to
the tree. The analysis shows that the most recently added forest node receives the next edge 50% of the
time, and the ten most recently added forest nodes receive the next edge 83% of the time.
The analysis that leads to the distribution A depends on the assumption that k < i < N. In order to
determine how strongly the analysis depends on the assumptions, we wrote a simulator that generated
minimum weight random forests using the min_weight_forest algorithm. We generated a minimum
weight random forest using the parameters k = 5 and N = 94 10,000 times. For each edge selection step
i, we recorded the number of times that an edge labeled k + i s was selected. We used this information
to estimate Ai(j), which we list in table 3 for i = 10, 20, 40, and 60. For small i (i = 10), there is still
j 1 2 3 4 5 6 7 8 9 10
A(j) .5000 .1321 .0645 .0396 .0273 .0203 .0159 .0128 .0106 .0090
Alo(j) .4808 .1227 .0642 .0428 .0373 .0320 r' .0 273 .0271
A20(j) .4811 .1209 .0627 .0442 .0318 .0254 .0233 .0172 .0157 .0142
A40(j) .4527 .1122 .0597 .0302 .0234 .0198 .0160 .0126 .0140 .0080
A60(j) .3993 .0914 .0425 .0252 .0142 .0142 .0122 .0114 .0089 .0082
Table 3: Probability that the jth most recently added forest node receives the next edge.
a large chance (5/14) that every node in the forest will receive the next edge, so the tail of Alo is not
well approximated by A. For large i (i = 60), there are significantly fewer edges labeled k + i 1 than
labeled 1 through k. As a result, Ai(1) becomes smaller than A(1). For moderate i (i = 20 and i = 40),
A is a good approximation to Ai.
4 Conclusions
We have examined two models of naturally growing forests, one in which every node in the forest is
equally likely to be be adjacent to the next node added to the forest (uniform growth model), and a
model in which the forest is constructed in order to create a minimum weight forest (minimum weight
model).
We found that the uniform growth model is analytically tractable. We calculated the distribution of
tree sizes, and found that the tree that receives the last node will contain about 2N/k nodes, where k is
the number of trees in the forest and N nodes are added to the forest.
We found the minimum weight model to be far more difficult. We examined the relationship between
the uniform growth model and the minimum weight model and found that the uniform growth model
holds in the minimum weight model with probability k/(k+i 1) when adding the ith node to the forest.
The probability that a forest node receives the next edge added to the forest depends primarily on how
recently that node was added to the forest. The most recently added node receives the next edge about
50% of the time, and one of the ten most recently added nodes receives the next edge about 82% of the
time.
5 Acknowledgements
We'd like to thank Kevin Donovan and Yogin Campbell for their help and advice on solving the problems
in this paper.
References
[1] B. Bollobas. Random Graphs. Academic Press, 1 ".
[2] J.F. Desler and S.L. Hakami. A graphtheoretic approach to a class of integerprogramming prob
lems. Operations Research, 17:10171033, 1969.
[3] A.M. Frieze. On the value of a random minimum spanning tree problem. Discrete Applied Marthe
matics, 10:4756, 1 1.
[4] M. Held and R.M. Karp. The traveling salesman problem and minimum spanning trees. Operations
Research, 18:11381162, 1970.
[5] M. Held and R.M. Karp. The traveling salesman problem and minimum spanning trees. Mathemat
ical Programming, 1:625, 1971.
[6] Micha Hofri. Probabilistic Analysis of Algorithms. SpringerVerlag, 1 I .
[7] D. Knuth. The Art of Computer Programming, volume 1. AddisonWesley, 1968.
[8] H.M. Mahmoud. Evolution of Random Search Trees. WileyInterscience, 1992.
[9] Udi Manber. Introduction to Algorithms. Addison Wesley, 1989.
[10] R. Mickens. Difference Equations. Van Nostrand Reinhold, New York, 1 I .
[11] J. Riordan. Combinatorial Identities. Robert E. Kreiger, Huntington, NY, 1979.
[12] J.M. Steele. On Frieze's <(3) limit of lengths of minimal spanning trees. Discrete Applied Mathe
matics, 18:99103, 1 I.
[13] H.S. Stone. An algorithm for finding a minimum weighted 2matching. Technical Report RC 15014,
IBM T.J. Watson Research Center, 189.
