PROBABILISTIC CHARACTERIZATION OF DECISION TREES
Probabilistic Characterization of Decision Trees
Amit Dhurandhar ASD@CISE.UFL.EDU
Computer and Information Science and Engineering
University of Florida
Gainesville, FL 32611, USA
Alin Dobra ADOBRA~CISE.UFL.EDU
Computer and Information Science and Engineering
University of Florida
Gainesville, FL 32611, USA
Editor: Leslie Pack Kaelbling
Abstract
In this paper we use the methodology introduced in Dhurandhar and Dobra (2006) for analyzing
the error of classifiers and the model selection measures, to analyze decision tree algorithms. The
methodology consists of obtaining parametric expressions for the moments of the Generalization
error (GE) for the classification model of interest, followed by plotting these expressions for inter
pritability. The major challenge in applying the methodology to decision trees, the main theme
of this work, is customizing the generic expressions for the moments of GE to this particular
classification algorithm. The specific contributions we make in this paper are: (a) we completely
characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the
analysis extends to other decision tree algorithms, and (c) in order to extend the analysis to cer
tain model selection measures, we generalize the relationships between the moments of GE and
moments of the model selection measures given in Dhurandhar and Dobra (2006) to randomized
classification algorithms. An extensive empirical comparison between the proposed method and
Monte Carlo, depicts the advantages of the method in terms of running time and accuracy. It also
showcases the use of the method as an exploratory tool to study learning algorithms.
1. Introduction
Model selection for classification is one of the major challenges in Machine Learning and Datamining.
Given an independent and identically distributed (i.i.d.) sample from the underlying probability
distribution, the classification model selection problem consists in building a classifier by selecting
among competing models. Ideally the model selected minimizes the Generalization error (GE) the
expected error over the entire input. Since GE cannot be directly computed, part of the sample is
used to estimate GE through measures such as Crossvalidation, Holdoutset, Leaveoneout, etc.
Though certain rules of thumb are followed by practitioners w.r.t. training size and other parameters
specific to the validation measures in evaluating models through empirical studies Kohavi (1995);
Blum et al. (1999) and certain asymptotic results exist Vapnik (1998); Shao (1993), the fact remains
that most of these models and model selection measures are not well understood in real life (non
asymptotic) scenarios (e.g. what fraction should be test and training, what should be the value
k, in kfold cross validation, etc.). This lack of deep understanding limits our ability of utilizing
the models most effectively and, maybe more importantly, trusting the models to perform well in a
particular application this is the single most important complaint from users of Machine Learning
and Datamining techniques.
Recently, a novel methodology was proposed in Dhurandhar and Dobra (2006) to study the
behavior of models and model selection measures. Since the methodology is at the core of the
DHURANDHAR AND DOBRA
current work, we briefly describe it together with the motivation for using this type of analysis for
classification in general and decision trees in particular.
1.1 What is the methodology ?
The methodology for studying classification models consists in studying the behavior of the first two
central moments of the GE of the classification algorithm studied. The moments are taken over the
space of all possible classifiers produced by the classification algorithm, by training it over all possible
datasets sampled i.i.d. from some distribution. The first two moments give enough information
about the statistical behavior of the classification algorithm to allow interesting observations about
the behavior/trends of the classification algorithm w.r.t. any chosen data distribution.
1.2 Why have such a methodology?
The answers to the following questions shed light on why the methodology is necessary if tight
statistical characterization is to be provided for classification algorithms.
1. TI..t study GE ? The biggest danger of learning is overfitting the training data. The main
idea in using GE as a measure of success of learning instead on the empirical error on a given
dataset is to provide a mechanism to avoid this pitfall. Implicitly, by analyzing GE all the
input is considered.
2. Ti..t study the moments instead of the distribution of GE ? Ideally, we would study the
distribution of GE instead of moments in order to get a complete picture of what is its behavior.
Studying the distribution of discrete random variables, except for very simple cases, turns out
to be very hard. The difficulty comes from the fact that even computing the pdf in a single
point is intractable since all combinations of random choices that result in the same value for
GE have to be enumerated. On the other hand, the first two central moments coupled with
distribution independent bounds such as C'i I.  and Chernoff give guarantees about the
worst possible behavior that are not too far from the actual behavior (small constant factor).
Interestingly, it is possible to compute the moments of a random variable like GE without ever
explicitly writing or making use of the formula for the pdf. What makes such an endeavor
possible is extensive use of the linearity of expectation as explained in Dhurandhar and Dobra
(2006).
3. TI ..l characterize a class of classifiers instead of a single classifier ? While the use of GE as the
success measure is standard practice in Machine Learning, characterizing classes of classifiers
instead of the particular classifier produced on a given dataset is not. From the point of
view of the analysis, without large testing datasets it is not possible to evaluate directly GE
for a particular classifier. By considering classes of classifiers to which a classifier belongs,
an indirect characterization is obtained for the particular classifier. This is precisely what
Statistical Learning Theory (SLT) does; there the class of classifiers consists in all classifiers
with the same VC dimension. The main problem with SLT results is that classes based on VC
dimension are too large, thus results tend to be pessimistic. In the methodology in Dhurandhar
and Dobra (2006), the class of classifiers consists only of the classifiers that are produced by
the given classification algorithm from datasets of fixed size from the underlying distribution.
This is the probabilistic smallest class in which the particular classifier produced on a given
dataset can be placed in.
1.3 How do we implement the methodology ?
One way of approximately estimating the moments of GE over all possible classifiers for a particular
classification algorithm is by directly using Monte Carlo. If we use Monte Carlo directly, we first
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
need to produce a classifier on a sampled dataset then test on a number of test sets sampled from the
same distribution acquiring an estimate of the GE of this classifier. Repeating this entire procedure
a couple of times we would acquire estimates of GE for different classifiers. Then by averaging the
error of these multiple classifiers we would get an estimate of the first moment of GE. The variance
of GE can also be similarly estimated.
Another way of estimating the moments of GE, is by obtaining parametric expressions for them. If
this can be accomplished the moments can be computed exactly. Moreover, by dexterously observing
the manner in which expressions are derived for a particular classification algorithm, insights can
be gained into analyzing other algorithms of interest. Though deriving the expressions may be a
tedious task, using them we obtain highly accurate estimates of the moments. In this paper, we
propose this second alternative for analyzing a subclass of decision trees. The key to the analysis
is focusing on the learning phase of the algorithm. In cases where the parametric expressions are
computationally intensive to compute directly, we show that approximating individual terms using
Monte Carlo we obtain accurate estimates of the moments when compared to directly using Monte
Carlo (first alternative) for the same computational cost.
If the moments are to be studied on synthetic data then the distribution is anyway assumed and
the parametric expressions can be directly used. If we have real data an empirical distribution can
be built on the dataset and then the parametric expressions can be used.
1.4 Applications of the methodology
It is important to note that the methodology is not aimed towards providing a way of estimating
bounds for GE of a classifier on a given dataset. The primary goal is creating an avenue in which
learning algorithms can be studied precisely i.e. studying the statistical behavior of a particular
algorithm w.r.t. a chosen/built distribution. Below, we discuss the two most important perspectives
in which the methodology can be applied.
1.4.1 ALGORITHMIC PERSPECTIVE
If a researcher/practitioner designs a new classification algorithm, he/she needs to validate it. Stan
dard practice is to validate the algorithm on a relatively small (520) number of datasets and to
report the performance. By observing the behavior of only a few instances of the algorithm the
designer infers its quality. Moreover, if the algorithm under performs on some datasets, it can be
sometimes difficult to pinpoint the precise reason for its failure. If instead he/she is able to derive
parametric expressions for the moments of GE, the test results would be more relevant to the par
ticular classification algorithm, since the moments are over all possible datasets of a particular size
drawn i.i.d. from some chosen/built distribution. Testing individually on all these datasets is an
impossible task. Thus, by computing the moments using the parametric expressions the algorithm
would be tested on a plethora of datasets with the results being highly accurate. Moreover, since
the testing is done in a controlled environment i.e. all the parameters are known to the designer
while testing, he/she can precisely pinpoint the conditions under which the algorithm performs well
and the conditions under which the algorithm under performs.
1.4.2 DATASET PERSPECTIVE
If an algorithm designer validates his/her algorithm by computing moments as mentioned earlier, it
can instill greater confidence in the practitioner searching for an appropriate algorithm for his/her
dataset. The reason for this being, if the practitioner has a dataset which has a similar structure
or is from a similar source as the test dataset on which an empirical distribution was built and
favourable results reported by the designer, then this would mean that the results apply not only to
that particular test dataset, but to other similar type of datasets and since the practitioner's dataset
DHURANDHAR AND DOBRA
belongs to this similar collection, the results would also apply to his. Note that a distribution is just
a weighting of different datasets and this perspective is used in the above exposition.
1.5 Specific Contributions
In this paper we develop a characterization for a subclass of decision trees. In particular, we charac
terize Random decision trees which are an interesting variant with respect to three popular stopping
criteria namely; fixed height, purity and scarcity (i.e. fewer than some threshold number of points
in a portion of the tree). The analysis directly applies to categorical as well as continuous attributes
with split points predetermined for each attribute. Moreover, the analysis in Section 2.3 is appli
cable to even other deterministic attribute selection methods based on information gain, gini gain
etc. These and other extensions of the analysis to continuous attributes with dynamically chosen
split points is discussed in Section 4. In the experiments that ensue the theory, we compare the
accuracy of the derived expressions with direct Monte Carlo on synthetic distributions as well as on
distributions built on real data. Notice that using the expressions the moments can be computed
without explicitly building the tree. We also extend the relationships between the moments of GE
and moments of cross validation error (CE), leaveoneout error (LE) and holdoutset error (HE)
given in Dhurandhar and Dobra (2006) which were applicable only to deterministic classification
algorithms, to be applicable to randomized classification algorithms.
2. Computing Moments
In this section we first provide the necessary technical groundwork, followed by customization of
the expressions for decision trees. We now introduce some notation that is used primarily in this
section. X is a random vector modeling input whose domain is denoted by X. Y is a random
variable modeling output whose domain is denoted by y (set of class labels). Y(x) is a random
variable modeling output for input x. ( represents a particular classifier with its GE denoted by
GE((). Z(N) denotes a set of classifiers obtained by application of a classification algorithm to
different samples of size N.
2.1 Technical Framework
The basic idea in the generic characterization of the moments of GE as given in Dhurandhar and
Dobra (2006), is to define a class of classifiers induced by a classification algorithm and an i.i.d.
sample of a particular size from an underlying distribution. Each classifier in this class and its GE
act as random variables, since the process of obtaining the sample is randomized. Since GE(C) is a
random variable, it has a distribution. Quite often though, characterizing a finite subset of moments
turns out to be a more viable option than characterizing the entire distribution. Based on these
facts, we revisit the expressions for the first two moments around zero of the GE of a classifier,
EZ(N) [GE(()]
SP[x =x] PZ(n C() =Yl P [Y(x) y] (
xzX yCV
EZ(N)xZ(N) [GE(()GE(y')]
SP [X =] P[X = ']
PPX x')X
S S PZ(N)xZ(N) [(x) A (x')=y].
P [Y(x)y] P [Y(x') y']
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
C2
CC
x A.
X1 ml1 m12
Smi2 m m \m2 m 2 m22
y1 2
Figure 1: Contingency table with 2 Figure 2: The all attribute tree with 3 at
attributes each having 2 tributes A1, A2, A3, each having
values and 2 classes. 2 values.
From the above equations we observe that for the first moment we have to characterize the behavior
of the classifier on each input separately while for the second moment we need to observe its behavior
on pairs of inputs. In particular, to derive expressions for the moments of any classification algorithm
we need to characterize PZ(N) [(() =y] for the first moment and PZ(N)xZ(N) [((x)=y A ('(x') y']
for the second moment. The values for the other terms denote the error of the classifier for the
first moment and errors of two classifiers for the second moment which are obtained directly from
the underlying joint distribution. For example, if we have data with a class prior p for class 1 and
1p for class 2. Then the error of a classifier classifying data into class 1 is 1p and the error of
a classifier classifying data into class 2 is given by p. We now focus our attention on relating the
above two probabilities, to probabilities that can be computed using the joint distribution and the
classification model viz. Decision Trees.
In the subsections that follow we assume the following setup. We consider the dimensionality
of the input space to be d. A1, A2, ..., Ad are the corresponding discrete attributes or continuous
attributes with predetermined split points. al, a2, ..., ad are the number of attribute values/the
number of splits of the attributes A, A2, ..., Ad respectively. is the ith attribute value/split of
the jth attribute, where i < aj and j < d. Let C1, C2,., C Ck be the class labels representing k classes
and N the sample size.
2.2 All Attribute Decision Trees (ATT)
Let us consider a decision tree algorithm whose only stopping criteria is that no attributes remain
when building any part of the tree. In other words, every path in the tree from root to leaf has all
the attributes. An example of such a tree is shown in Figure 2. It can be seen that irrespective of
the split attribute selection method (e.g. information gain, gini gain, randomised selection, etc.) the
above stopping criteria yields trees with the same leaf nodes. Thus although a particular path in one
tree has an ordering of attributes that might be different from a corresponding path in other trees,
the leaf nodes will represent the same region in space or the same set of datapoints. This is seen in
Figure 3. Moreover, since predictions are made using data in the leaf nodes, any deterministic way
of prediction would lead to these trees resulting in the same classifier for a given sample and thus
having the same GE. Usually, prediction in the leaves is performed by choosing the most numerous
class as the class label for the corresponding datapoint. With this we arrive at the expressions for
DHURANDHAR AND DOBRA
m2 A A
ml1 m21 M2
A2 A A3
m21 \ mi A m i31
S3 m11
a b c
Figure 3: Given 3 attributes A A2, A3, the path m11m2m ms3 is formed irrespective of the ordering of
the attributes. Three such permutations are shown in the above figure.
computing the aforementioned probabilities,
Pz(N) [((x)= Ci]
Pz(N) [ct(mpimqq2...mrdCi) > ct(mpimq2....mrdC),
Vj i, i,j e [1, ..., k]]
where x = mplm,2...mrd represents a datapoint which is also a path from root to leaf in the tree.
ct(mpimq2...mrdCi) is the count of the datapoints specified by the cell mpim,2...mrdCi. For example
in Figure 1 xlyiCI represents a cell. Henceforth, when using the word "path" we will strictly imply
path from root to leaf. By computing the above probability V i and V x we can compute the first
moment of the GE for this classification algorithm.
Similarly, for the second moment we compute cumulative joint probabilities of the following form:
Pz(N)xZ(N) [((x)= CiA '(x')= C,]
Pz(N)xZ(N)[ct(mpi...mrdCi) > ct(mpi...mrdCj),
ct(mfi ...mhdC) > ct(mTi...m hdC.),
Vj 1 i, Vw 1 v, i, j, v, w [1,..., k]]
where the terms have similar conotation as before. These probabilities can be computed exactly
or by using fast approximation techniques proposed in Dhurandhar and Dobra (2006).
2.3 Decision Trees with Nontrivial Stopping Criteria
We just considered decision trees which are grown until all attributes are exhausted. In real life
though we seldom build such trees. The main reasons for this could be any of the following: we wish
to build small decision trees to save space; certain path counts (i.e. number of datapoints in the
leaves) are extremely low and hence we want to avoid splitting further, as the predictions can get
arbitrarily bad; we have split on a certain subset of attributes and all the datapoints in that path
belong to the same class (purity based criteria); we want to grow trees to a fixed height (or depth).
These stopping measures would lead to paths in the tree that contain a subset of the entire set of
attributes. Thus from a classification point of view we cannot simply compare the counts in two
cells as we did previously. The reason for this being that the corresponding path may not be present
in the tree. Hence, we need to check that the path exists and then compare cell counts. Given the
classification algorithm, since the PZ(N) [(x)= Ci] is the probability of all possible ways in which
an input x can be classified into class C, for a decision tree it equates to finding the following kind
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
of probability for the first moment,
Pz(N) [(Zx)= C]
SPz(N) [ct(pathpCi) > ct(pathpCj ),pathpexists, (3)
P
Vj i, i, j E [1,..., k]]
where p indexes all allowed paths by the tree algorithm in classifying input x. After the summation,
the right hand side term above is the probability that the cell pathpCi has the greatest count, with
the path "path," being present in the tree. This will become clearer when we discuss different
stopping criteria. Notice that the characterization for the ATT is just a special case of this more
generic characterization.
The probability that we need to find for the second moment is,
PZ(N)xZ(N) [((x)= Ci A ('(x')= C,] =
5 Pz(N)xZ(N)[ct(pathpCi) > ct(pathpCj),pathpexists,
p,q (4)
ct(pathqCv) > ct(pathqCw),pathqexists,
Vj 1 i, Vw 1 v, i, j, v, w [1,..., k]]
where p and q index all allowed paths by the tree algorithm in classifying input x and x' respectively.
The above two equations are generic in analyzing any decision tree algorithm which classifies inputs
into the most numerous class in the corresponding leaf. It is not difficult to generalize it further
when the decision in leaves is some other measure than majority. In that case we would just include
that measure in the probability in place of the inequality.
2.3.1 CHARACTERIZING path exists FOR THREE STOPPING CRITERIA
It follows from above that to compute the moments of the GE for a decision tree algorithm we need to
characterize conditions under which particular paths are present. This characterization depends on
the stopping criteria and split attribute selection method in a decision tree algorithm. We now look
at three popular stopping criteria, namely a) Fixed height based, b) Purity (i.e. entropy 0 or gini
index 0 etc.) based and c) Scarcity (i.e. too few datapoints) based. We consider conditions under
which certain paths are present for each stopping criteria. Similar conditions can be enumerated for
any reasonable stopping criteria. We then choose a split attribute selection method, thereby fully
characterizing the above two probabilities and hence the moments.
1. Fixed Height: This stopping criteria is basically that every path in the tree should be of
length exactly h, where h E [1, ..., d]. If h = 1 we classify based on just one attribute. If h = d
then we have the all attribute tree.
In general, a path milmj2...mlh is present in the tree iff the attributes A1, A2, ..., Ah are chosen
in any order to form the path for a tree construction during the split attribute selection phase.
Thus, for any path of length h to be present we biconditionally imply that the corresponding
attributes are chosen.
2. Purity: This stopping criteria implies that we stop growing the tree from a particular split
of a particular attribute if all datapoints lying in that split belong to the same class. We call
such a path pure else we call it impure. In this scenario, we could have paths of length 1 to d
depending on when we encounter purity (assuming all datapoints don't lie in 1 class). Thus,
we have the following two separate checks for paths of length d and less than d respectively.
a) Path milmj2...mld present iff the path mil ... _.... m(d 1) is impure and attributes A1, A2, ..., Ad 1
are chosen above Ad, or milmj2...ms(d 2)mld is impure and attributes A1, A, ..., Ad 2, Ad
DHURANDHAR AND DOBRA
are chosen above Ad 1, or ... or .......mld is impure and attributes A, ..., Ad are chosen above
A1.
This means that if a certain set of d 1 attributes are present in a path in the tree then we
split on the dth attribute iff the current path is not pure, finally resulting in a path of length
d.
b) Path mil'. _....mam present where h < d iff the path mil..' _.... mM is pure and attributes
A1, A2, ..., Ah 1 are chosen above Ah and mil..' .....ml(h 1) is impure or the path mil... _.... mT
is pure and attributes A1, A2, ..., Ah 2, Ah are chosen above Ah 1 and milmj2...mTl(h 2)nm
is impure or ... or the path ... _....mrm is pure and attributes A2, ..., Ah are chosen above A1
and ... _....mam is impure.
This means that if a certain set of h 1 attributes are present in a path in the tree then we
split on some hth attribute iff the current path is not pure and the resulting path is pure.
The above conditions suffice for 1p 1 I 1". ' ii since the purity property is antimonotone and
the impurity property is monotone.
3. Scarcity: This stopping criteria implies that we stop growing the tree from a particular split
of a certain attribute if its count is less than or equal to some prespecified pruning bound.
Let us denote this number by pb. As before, we have the following two separate checks for
paths of length d and less than d respectively.
a) Path milmj2...mld present iff the attributes A, ..., Ad 1 are chosen above Ad and ct(milmj2...ml(d 1)) >
pb or the attributes A, ..., Ad 2, Ad are chosen above Ad 1 and ct(mnil. ....mT(d2)mnd) > pb
or ... or the attributes A, ..., Ad are chosen above A1 and ct(m, ....mid) > pb.
b) Path ... ,.. _... .m1 present where h < d iff the attributes A, ..., Ah 1 are chosen above Ah
and cdt(' ,. .... .m(h 1)) > pb and ct(mil _..... m ) < pb or the attributes A1, ..., Ah 2, Ah are
chosen above Ah 1 and ct(milmj2...ml((h2)mnh) > pb and ct(milmj2...mnfh) < pb or ... or the
attributes A, ..., Ah are chosen above A1 and ct(,.. ......mla) > pb and ct(mnil,. _.... mu) <
pb.
This means that we stop growing the tree under a node once we find that the next chosen
attribute produces a path with occupancy < pb.
The above conditions suffice for "path 1"  il since the occupancy property is monotone.
We observe from the above checks that we have two types of conditions that need to be evaluated
for a path being present namely, i) those that depend on the sample viz. milmj2...ml(d 1) is
impure or ct(",.' i. _.....mh) > pb and ii) those that depend split attribute selection method viz.
A1, A, ..., Ah are chosen. The former depends on the data distribution which we have specified to
be a multinomial. The latter we discuss in the next subsection. Note that checks for a combination
of the above stopping criteria can be obtained by appropriately combining the individual checks.
2.4 Split Attribute Selection
In decision tree construction algorithms, at each iteration we have to decide the attribute variable on
which the data should be split. Numerous measures have been developed Hall and Holmes (2003).
Some of the most popular ones aim to increase the purity of a set of datapoints that lie in the
region formed by that split. The purer the region, the better the prediction and lower the error
of the classifier. Measures such as, i) Information Gain (IG) Quinlan (1986), ii) Gini Gain (GG)
Breiman et al. (1984), iii) Gain Ratio (GR) Quinlan (1986), iv) C'l... 1I e test (CS) Shao (2003)
etc. aim at realising this intuition. Other measures using Principal Component Analysis Smith
(2002), Correlationbased measures Hall (1998) have also been developed. Another interesting yet
nonintuitive measure in terms of its utility is the Random attribute selection measure. According
to this measure we randomly choose the split attribute from available set. The decision tree that
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
this algorithm produces is called a Random decision tree (RDT). Surprisingly enough, a collection of
RDTs quite often outperform their seemingly more powerful counterparts Liu et al. (2005). In this
paper we study this interesting variant. We do this by first presenting a probabilistic characterization
in selecting a particular attribute/set of attributes, followed by simulation studies. Characterizations
for the other measures can be developed in similar vein by focusing on the working of each measure.
As an example, for the deterministic purity based measures mentioned above the split attribute
selection is just a function of the sample and thus by appropriately conditioning on the sample we
can find the relevant probabilities and hence the moments.
Before presenting the expression for the probability of selecting a split attribute/attributes in
constructing a RDT we extend the results in Dhurandhar and Dobra (2006) where relationships
were drawn between the moments of HE, CE, LE (just a special case of crossvalidation) and GE,
to be applicable to randomized classification algorithms. The random process is assumed to be
independent of the sampling process. This result is required since the results in Dhurandhar and
Dobra (2006) are applicable to deterministic classification algorithms and we would be analyzing
RDT's. With this we have the following lemma.
Lemma 1 Let D and T be independent discrete random variables, with some distribution defined
on each of them. Let D and T denote the domains of the random variables. Let f(d,t) and g(d,t)
be two functions such that Vt E T Ep[f(d,t)] Ep[g(d,t)] and d E D. Then, ETx7[f(d,t)]
ETxv [g(d, t)]
Proof
ETxD [f(d, t)] = f(d,t)P[T t, D d]
teT deD
= f(d,t)P[D= d]P[T t]
tET deD
= ED[g(d,t)]P[T t]
tcT
=ETxD [g(d,tt)]
The result is valid even when D and T are continuous, but considering the scope of this paper
we are mainly interested in the discrete case. This result implies that all the relationships and
expressions in Dhurandhar and Dobra (2006) hold with an extra expectation over the t's, for ran
domized classification algorithms where the random process is independent of the sampling process.
In equations 1 and 2 the expectations w.r.t. Z(N) become expectations w.r.t. Z(N, t).
2.5 Random Decision Trees
In this subsection we explain the randomized process used for split attribute selection and provide
the expression for the probability of choosing an attribute/a set of attributes. The attribute selection
method we use is as follows. We assume a uniform probability distribution in selecting the attribute
variables i.e. attributes which have already not been chosen in a particular branch, have an equal
chance of being chosen for the next level. The random process involved in attribute selection is
independent of the sample and hence the lemma 1 applies. We now give the expression for the
probability of selecting a subset of attributes from the given set for a path. This expression is
required in the computation of the above mentioned probabilities used in computing the moments.
For the first moment we need to find the following probability. Given d attributes A1, A2, ..., Ad
DHURANDHAR AND DOBRA
the probability of choosing a set of h attributes where h E {1, 2,..., d} is,
P[h attributes chosen] =
dh
since choosing without replacement is equivalent to simultaneously choosing a subset of attributes
from the given set.
For the second moment when the trees are different (required in the finding of variance of
CE since, the training sets in the various runs in cross validation are different i.e. for finding
EZ(N)xZ(N) [GE(()GE((')]), the probability of choosing 11 attributes for path in one tree and 12
attributes for path in another tree where 11, 12 < d is given by,
P[11 attribute path in tree 1, 12 attribute path in tree 2]
1
( Xt d)
since the process of choosing one set of attributes for a path in one tree is independent of the process
of choosing another set of attributes for a path in a different tree.
For the second moment when the tree is the same (required in the finding of variance of GE and
HE i.e. for finding EZ(N)xZ(N) [GE(()2]), the probability of choosing two sets of attributes such
that the two distinct paths resulting from them coexist in a single tree is given by the following.
Assume we have d attributes A1, A, ..., Ad. Let the lengths of the two paths (or cardinality of the
two sets) be 11 and 12 respectively, where 11, 12 < d. Without loss of generality assume 11 < 12. Let
p be the number of attributes common to both paths. Notice that p > 1 is one of the necessary
conditions for the two paths to coexist. Let v < p be those attributes among the total p that have
same values for both paths. Thus p v attributes are common to both paths but have different
values. At one of these attributes in a given tree the two paths will bifurcate. The probability that
the two paths coexist given our randomized attribute selection method is computed by finding out
all possible ways in which the two paths can coexist in a tree and then multiplying the number of
each kind of way by the probability of having that way. A detailed proof is given in the appendix.
The expression for the probability based on the attribute selection method is,
P[11 and 12 length paths co exist] =
vPr (l i 1)!(12 i 1)!(p v)probi
i=0
where vPri = !, denotes permutation and probi = d(d1)...(d i) )2(d i1) 1.2 (d l1)(d 121)
is the probability of the ith possible way. For fixed height trees of height h, (11 i 1)!(12 i 1)!
becomes (h i 1)!2 and prob, d1) (di(d 1)2 i (d
d(d 1)...( d i)2(d 1)2...(dh
2.6 Putting things together
We now have all the ingredients that are required for the computation of the moments of GE. In
this subsection we combine the results derived in the previous subsections to obtain expressions for
PZ(N) [((x)= C,] and PZ(N)xz(N) [((x)= C, A ('(x')= C,] which are vital in the computation of the
moments.
Let s.c.c.s. be an abbreviation for stopping criteria conditions that are sample dependent. Con
versely, s.c.c.i. be an abbreviation for stopping criteria conditions that are sample independent or
conditions that are dependent on the attribute selection method. We now provide expressions for
the above probabilities categorized by the 3 stopping criteria.
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
2.6.1 FIXED HEIGHT
The conditions for "path .1 for fixed height trees depend only on the attribute selection method
as seen in subsection 2.3.1. Hence the probability used in finding the first moment is given by,
PZ(N) [((x))=Ci] = Pz(N)[ct(pathpCiQ) > ct(pathpCj),pathpexists, Vj' i, i,j E [1, ..., k]]
p
= Pz(N)[ct(pathpC) > ct(pathpCj), s.c.c.i., Vj i, i,j e [1,..., k]]
p
(5)
= Pz(N)[ct(pathpCi) > ct(pathpCj), Vj 1 i, i,j [1,...,k]]P(N)[s.c.c.i.] (5)
p
Pz(N)[ct(pathpCi) > ct(pathpCj), Vj V i, i,j [1,...,k]]
I: dCh
p
where dCh d and h is the length of the paths or the height of the tree. The probability
h!(d h)!
in the last step of the above derivation can be computed from the underlying joint distribution. The
probability for the second moment when the trees are different is given by,
Pz(N)xz(N) [((x)= Ci A ('(x') =C,]
= Pz(N)xZ(N) [ct(pathpCi) > ct(pathpCj),pathpexists, ct(pathqC,) > ct(pathqC, ),pathqexists,
p,q
Vj i, Vw v, i,j,v, we [1,..., k]]
= Pz(N)xZ(N)[ct(pathpC,) > ct(pathpCj),ct(pathqCv) > ct(pathC,), Vj 1 i, Vw I v, i,j, v, w E [1, ..., k]]
P,q
Pz(N)xZ(N) [s.c.c.i.]
SPz(N)xZ(N)[ct(pathpCi) > ct(pathpCj),ct(pathqC,) > ct(pathC,), Vj 1 i, Vw v i, ij,v,w [1, ...,k]]
p,q
(6)
where h is the length of the paths. The probability for the second moment when the trees are
identical is given by,
PZ(N)xZ(N) [((x)= Ci A ((x')= Cv]
= Pz(N)xZ(N) [ct(pathpCi) > ct(pathpCj),pathpexists, ct(pathqC,) > ct(pathqC, ),pathqexists,
P,q
Vj i, Vw v, i,j,v,w [1,..., k]]
= Pz(N)xZ(N)[ct(pathpC,) > ct(pathpCj),ct(pathqCv) > ct(pathC,), Vj 1 i, Vw 1 v, i,j, v, w [1, ..., k]]
P,q
Pz(N)xZ(N) [s.c.c.i.]
b
=E bPrt(ht 1)!2(r )probtPz(N)xz(N)[c(pathpCi) > ct(pathpC), ct(pathqC) > ct(pathqC,),
p,q t=0
V i, Vw v, ij,v, [1, ...,k]]
(7)
DHURANDHAR AND DOBRA
where r is the number of attributes that are common in the 2 paths, b is the number of
attributes that have the same value in the 2 paths, h is the length of the paths and probt
d(d1).(dt)(d t1)2 Adh 1)2 As before, the probability comparing counts can be computed from
the underlying joint distribution.
2.6.2 PURITY AND SCARCITY
The conditions for "path . 1 in the case of purity and scarcity depend on both the sample and
the attribute selection method as can be seen in 2.3.1. The probability used in finding the first
moment is given by,
PZ(N) [((x)= Ci] = Pz(N)[ct(pathpCi,) > ct(pathpCj), pathpexists, Vj 1 i, i,j E [1,..., k]]
p
= Pz(N) [ct(pathpCi) > ct(pathpCj), s.c.c.i, s.c.c.s., Vj 1 i, i,j E [1, ..., k]]
p
= Pz(N)[ct(pathpCi) > ct(pathpC), s.c.c.s., Vj 1 i, i,j E [1, ..., k]]P(N)[s.c.c.i.]
Pz(N)[ct(pathpCi') > ct(pathpCj), s.c.c.s., Vj 1 i, i,j e [1, ..., k]]
I dChp (d hp + 1)
(8)
where hp is the length of the path indexed by p. The joint probability of comparing counts
and s.c.c.s. can be computed from the underlying joint distribution. The probability for the second
moment when the trees are different is given by,
Pz(N)xZ(N) [ =((x)= Ci A ('(x')= C,]
(N P zN)xZ(N) [ct(pathpCi) > ct(pathpCj),pathpexists, ct(pathqC) > ct(pathC,),pathqexists,
P,q
Vj i, Vw v, i, j, v,W [1,..., k]
Z Pz(N)xZ(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw), s.c.c.s., Vj 1 i, Vw 1 v, i, j, v, w E [1,..., k]]
P,q
PZ(N)x Z(N) [s.c.c.i.]
PZ(N)xZ(N)[ct(pathpCi) > ct(pathpCj), ct(pathqC ) > ct(pathCw,), s.c.c.s., Vj 1 i, Vw I v, i, j, v, w e [1,..., k]]
: dChp ldCh, i(d hp + 1)(d hq + 1)
(9)
where hp and hq are the lengths of the paths indexed by p and q. The probability for the second
moment when the trees are identical is given by,
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
PZ(N)XZ(N) [((x)= A C(x')= C]
= PZ(N)xZ(N) [ct(pathpCi) > ct(pathpCj),pathpexists, ct(pathqCv) > ct(pathqCw),pathqexists,
Pq
Vj i, Vw v, i,j,v,w E [1,...,k]]
= Pz(N)xz(N)[ct(pathpCi) > ct(pathpCj), ct(pathqC,) > ct(pathqC,), s.c.c.s., Vj i, Vw 1 v,
pq
i,j, v, w [1, ..., k]]Pz(N) z(N) [s.c.c.i.]
b bPrt(h t 2)!(h t 2)!(r v)prob ,
pq 0 (d h + 1)(d h + 1) P( )[ct(pthpC) > ct(path pC),
ct(pathqC,) > ct(pathqC,), s.c.c.s., Vj 1 i, Vw 1 v, i, j, v, w [1,..., k]]
(10)
where r is the number of attributes that are common in the 2 paths sparing the attributes cho
sen as leaves, b is the number of attributes that have the same value, hp and hq are the lengths of the 2
paths and without loss of generality assuming hp < hq probt (d )...(d t)(dt1)2 .d h)(d h 1)(d
As before, the probability of comparing counts and s.c.c.s. can be computed from the underlying
joint distribution.
Using the expressions for the above probabilities the moments of GE can be computed. In next
section we perform experiments on synthetic as well as distributions built on real data to portray
the efficacy of the derived expressions.
3. Experiments
To exactly compute the probabilities for each path the time complexity for fixed height trees is
O(N2) and for purity and scarcity based trees is O(N3). Hence, computing exactly the probabilities
and consequently the moments is practical for small values of N. For larger values of N, we propose
computing the individual probabilities using Monte Carlo (MC). In the empirical studies we report,
we show that the accuracy in estimating the error (i.e. the moments of GE) by using our expressions
with MC is always greater than by directly using MC for the same computational cost. In fact, the
accuracy of using the expressions is never worse than MC even when MC is executed for 10 times
the number of iterations as those of the expressions. The true error or the golden standard against
which we compare the accuracy of these estimators is obtained by running MC for a week, which is
around 200 times the number of iterations as those of the expressions.
Notation: In the experiments, AF refers to the estimates obtained by using the expressions in
conjunction with Monte Carlo. MCi refers to simple Monte Carlo being executed for i times the
number of iterations as those of the expressions. The term True Error or TE refers to the golden
standard against which we compare AF and MCi.
General Setup: We perform empirical studies on synthetic as well as real data. The experimen
tal setup for synthetic data is as follows: We fix N to 10000. The number of classes is fixed to two.
We observe the behavior of the error for the three kinds of trees with the number of attributes fixed
to d = 5 and each attribute having 2 attribute values. We then increase the number of attribute
values to 3, to observe the effect that increasing the number of split points has on the performance of
the estimators. We also increase the number of attributes to d = 8 to study the effect that increasing
the number of attributes has on the performance. With this we have a d+ 1 dimensional contingency
table whose d dimensions are the attributes and the (d+ 1)th dimension represents the class labels.
When each attribute has two values the total number of cells in the table is c = 2d+ and with three
values the total number of cells is c = 3d x 2. If we fix the probability of observing a datapoint in
DHURANDHAR AND DOBRA
True Error
0.5
0.4
0.3
0.2
0.1
0 0.2 0.4 0.6 0.8 1
Correlation
STrue Error
0.2 0.4 0.6 0.8
Correlation
0.5e Error
0.4 
0.3
0.2
0.1
00 0.2 0.4 0.6 0.8 1
Correlation
Figure 4: Fixed Height trees with d = 5, h = 3 and attributes with binary splits.
True Error
y ax
^~c
0.2 0.4 0.6 0.8
Correlation
0.5
True Error
0.4
0.3
S0.2
0 0.2 0.4 0.6 0.8
Correlation
0.5 True Error
MC10
0.4
0.3
0.2
0.1
0 0.2 0.4 0.6 0.8
Correlation
Figure 5: Fixed Height trees with d 5, h = 3 and attributes with ternary splits.
True Error
MC1 0.
VJ /z
0.4
0.3
0.2
True Error
MC10
V
0.2 0.4 0.6 0.8
Correlation
00 0.2 0.4 0.6 0.8
Correlation
00 0.2 0.4 0.6 0.8
Correlation
Figure 6: Fixed Height trees with d = 8, h = 3 and attributes with binary splits.
14
True Error
AF
0.4
0.2
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
0.5
True Error = AF
2 0.3
0.2
0.1
0 0.2 0.4 0.6 0.8
Correlation
STrue Error = MC1
True Error
SMC1
0.2 0.4 0.6 0.8
Correlation
True Error= MC10
True Error
MC10
0.2 0.4 0.6 0.8
Correlation
Figure 7: Purity based trees with d = 5 and attributes with binary splits.
0.5
0.4
True Error = AF
S0.3
0.2
0.1
0 0.2 0.4 0.6 0.8
Correlation
True Error= MC1
\ *
0.5
0.4
S0.3
0.2
0.1
S 0
.2 Tre Error
1MC1
0 0.2 0.4 0.6 0.8
Correlation
True Error= MC10
True Error
MC10
0.2 0.4 0.6 0.8 1
Correlation
Figure 8: Purity based trees with d = 5 and attributes with ternary splits.
0.5
0.4
True Error = AF
0.3
0.2
0.1
0 0.2 0.4 0.6 0.8
Correlation
T., Error
MC 1
0 0.2 0.4 0.6 0.8
Correlation
0.4 True Error= MC10
0.3
True Error
0.2 M 10
0.1
1 0 0.2 0.4 0.6 0.8 1
Correlation
Figure 9: Purity based trees with d = 8 and attributes with binary splits.
15
DHURANDHAR AND DOBRA
05
True Error
0.4
0.3
0.2
0.1
0 0.2 0.4 0.6 0.8
Correlation
True Error
0.2 0.4 0.6 0.8
Correlation
True Error
I 4 ^
MC10
0 0.2 0.4 0.6 0.8
Correlation
Figure 10: Scarcity based trees with d = 5, pb = and attributes with binary splits.
1O
True Error
AF:
y E
0.5
True Error
04
0.3
0.2
0.5
0. True Error
0.4
0.3
0.2
0.1
MC10
0 0.2 0.4 0.6 0.8
Correlation
00 0.2 0.4 0.6 0.8
Correlation
0 0.2 0.4 0.6 0.8 1
Correlation
Figure 11: Scarcity based trees with d = 5, pb = and attributes with ternary splits.
True Error 0.5
0.4
0.3
' 1W
0 0.2 0.4 0.6 0.8
Correlation
True ETrror 0.5
SMC1 0.5
0.4
0.3
W~
00 0.2 0.4 0.6 0.8
Correlation
I True Error MC10
 /
00 0.2 0.4 0.6 0.8
Correlation
Figure 12: Scarcity based trees with d =8, pb = and attributes with binary splits.
16
0.2k
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
o AF o AF o AF
MC10 MC10 MC10
0.8 MC1 0.8 MC1 0.8 MC1
0.7 0.7 0.7
0.6 [ 0.6 0.6
o B
0.5 0.5 0.5
0.4 o 0.4 0 9 0.4
0.3 o 0.3 o 0.3
0.2 0.2 0.2
0.1 0.1 0.1
Fixed Ht. Purity Scarcity Fixed Ht. Purity Scarcity Fixed Ht. Purity Scarcity
Pima Indians Balloon Shuttle Landing Control
Figure 13: Comparison between AF and MC on three UCI datasets for trees prunned based on fixed
height (h = 3), purity and scarcity (pb = ).
cell i to be pi such that = pi 1 and the sample size to N the distribution that perfectly models
this scenario is a multinomial distribution with parameters N and the set {p1,p2, ...,pc}. In fact,
irrespective of the value of d and the number of attribute values for each attribute the scenario can
be modelled by a multinomial distribution. In the studies that follow the pi's are varied and the
amount of dependence between the attributes and the class labels is computed for each set of pi's
using the Ch .. .1 ne test ConnorLinton (2003). More precisely, we sum over all i the squares of the
difference of each pi with the product of its corresponding marginals, with each squared difference
being divided by this product .i.e. correlation i (pi pi)2 where pin is the product of the
Pim
marginals for the ith cell. The behavior of the error for trees with the three aforementioned stopping
criteria is seen for different correlation values and for a class prior of 0.5.
In case of real data, we perform experiments on distributions built on three UCI datasets. We
split the continuous attributes at the mean of the given data. We thus can form a contingency
table representing each of the datasets. The counts in the individual cells divided by the dataset
size provide us with empirical estimates for the individual cell probabilities (pi's). Thus, with the
knowledge of N (dataset size) and the individual pi's we have a multinomial distribution. Using
this distribution we observe the behavior of the error for the three kinds of trees with results being
applicable to other datasets that are similar to the original.
Observations: Figures 4, 5 and 6 depict the error of fixed height trees with the number of
attributes being 5 for the first two figures and 8 for the third figure. The number of attribute
values increases from 2 to 3 in figures 4 and 5 respectively. We observe in these figures that AF is
significantly more accurate than both MC1 and MC10. In fact the performance of the 3 estimators
namely, AF, MC1 and MC10 remains more or less unaltered even with changes in the number of
attributes and in the number of splits per attribute. A similar trend is seen for both purity based
trees i.e. figures 7, 8 and 9 as well as scarcity based trees 10, 11 and 12. Though in the case of
purity based trees the performance of both MC1 and MC10 is much superior as compared with
their performance on the other two kinds of trees, especially at low correlations. The reason for this
being that, at low correlations the probability in each cell of the multinomial is nonnegligible and
with N = 10000 the event that every cell contains atleast a single datapoint is highly likely. Hence,
the trees we obtain with high probability using the purity based stopping criteria are all ATT's.
Since in an ATT all the leaves are identical irrespective of the ordering of the attributes in any path,
the randomness in the classifiers produced, is only due to the randomness in the data generation
process and not because of the random attribute selection method. Thus, the space of classifiers
DHURANDHAR AND DOBRA
over which the error is computed reduces and MC performs well even for a relatively fewer number
of iterations. At higher correlations and for the other two kinds of trees the probability of smaller
trees is reasonable and hence MC has to account for a larger space of classifiers induced by not only
the randomness in the data but also by the randomness in the attribute selection method.
In case of real data too figure 13, the performance of the expressions is significantly superior as
compared with MC1 and MC10. The performance of MC1 and MC10 for the purity based trees
is not as impressive here since the dataset sizes are much smaller (in the tens or hundreds) compared
to 10000 and hence the probability of having an empty cell are not particularly low. Moreover, the
correlations are reasonably high (above 0.6).
Reasons for superior performance of expressions: With simple MC, trees have to be built
while performing the experiments. Since, the expectations are over all possible classifiers i.e. over
all possible datasets and all possible randomizations in the attribute selection phase, the exhaustive
space over which direct MC has to run is huge. No tree has to be explicitly built when using
the expressions. Moreover, the probabilities for each path can be computed parrallely. Another
reason as to why calculating the moments using expressions works better is that the portion of the
probabilities for each path that depend on the attribute selection method are computed exactly (i.e.
with no error) by the given expressions and the inaccuracies in the estimates only occur due to the
sample dependent portion in the probabilities.
4. Discussion
In the previous sections we derived the analytical expressions for the moments of the GE of decision
trees and depicted interesting behavior of RDT's built under the 3 stopping criteria. It is clear that
using the expressions we obtain highly accurate estimates of the moments of errors for situations
of interest. In this section we discuss issues related to extension of the analysis to other attribute
selection methods and issues related to computational complexity of algorithm.
4.1 Extension
The conditions presented for the 3 stopping criteria namely, fixed height, purity and scarcity are
applicable irrespective of the attribute selection method. Commonly used deterministic attribute
selection methods include those based on Information Gain (IG), Gini Gain (GG), Gain ratio (GR)
etc. Given a sample the above metrics can be computed for each attribute. Hence, the above metrics
can be implemented as corresponding functions of the sample. For e.g. in the case of IG we compute
the loss in entropy (qlogq where the q's are computed from the sample) by the addition of an attribute
as we build the tree. We then compare the loss in entropy of all attributes not already chosen in the
path and choose the attribute for which the loss in entropy is maximum. Following this procedure
we build the path and hence the tree. To compute the probability of path exists, we add these sample
dependent conditions in the corresponding probabilities. These conditions account for a particular
set of attributes being chosen, in the 3 stopping criteria. In other words, these conditions quantify
the conditions in the 3 stopping criteria that are attribute selection method dependent. Similar
conditions can be derived for the other attribute selection methods (attribute with maximum gini
gain for GG, attribute with maximum gain ratio for GR) from which the relevant probabilities and
hence the moments can be computed. Thus, while computing the probabilities given in equations
3 and 4 the conditions for path exists for these attribute selection methods depend totally on the
sample. This is unlike what we observed for the randomized attribute selection criterion where the
conditions for path exists depending on this randomized criterion, were sample independent while the
other conditions in purity and scarcity were sample dependent. Characterizing these probabilities
enables us in computing the moments of GE for these other attribute selection methods.
In the analysis that we presented, we assumed that the split points for continuous attributes were
determined apriori to tree construction. If the split point selection algorithm is dynamic i.e. the
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
split points are selected while building the tree, then in the path exists conditions of the 3 stopping
criteria we would have to append an extra condition namely, the split occurs at "this" particular
attribute value. In reality, the value of "this" is determined by the values that the samples attain
for the specific attribute in the particular dataset, which is finite.1 Hence, while analyzing we can
choose a set of allowed values for "this" for each continuous attribute. Using these updated set of
conditions for the 3 stopping criteria the moments of GE can be computed.
4.2 Scalability
The time complexity of implementing the analysis is proportional to the product of the size of
the input/output space 2 and the number of paths that are possible in the tree while classifying a
particular input. To this end, it should be noted that if a stopping criterion is not carefully chosen
and applied, then the number of possible trees and hence the number of allowed paths can become
exponential in the dimensionality. In such scenarios, studying small or at best medium size trees is
feasible. For studying larger trees the practitioner should combine stopping criteria (e.g. pruning
bound and fixed height or scarcity and fixed height) i.e. combine the conditions given for each
individual stopping criteria or choose a stopping criterion that limits the number of paths (e.g. fixed
height). Keeping these simple facts in mind and on appropriate usage, the expressions can assist in
delving into the statistical behavior of the errors for decision tree classifiers.
5. Conclusion
In this paper we have developed a general characterization for computing the moments of the GE
for decision trees. In particular we have specifically characterized RDT's for three stopping criteria
namely, fixed height, purity and scarcity. Being able to compute moments of GE, allows us to
compute the moments of the various validation measures and observe their relative behavior. Using
the general characterization, characterizations for specific attribute selection measures (e.g. IG, GG
etc.) other than randomized can be developed as described before. As a technical result, we have
extended the theory in Dhurandhar and Dobra (2006) to be applicable to randomized classification
algorithms; this is necessary if the theory is to be applied to random decisions trees as we did in this
paper. The experiments reported in section 3 had two purposes: (a) portray the manner in which
the expressions can be utilized as an exploratory tool to gain a better understanding of decision
tree classifiers, and (b) show conclusively that the methodology in Dhurandhar and Dobra (2006)
together with the developments in this paper provide a superior analysis tool when compared with
simple Monte Carlo.
More work needs to be done to explore the possibilities and test the limits of the kind of analysis
that we have performed. However, if learning algorithms are analyzed in the manner that we
have shown, it would aid us in studying them more precisely, leading to better understanding and
improved decisionmaking in the practice of model selection.
6. Appendix
The probability that two paths of lengths 11 and 12 (12 > 11) coexist in a tree based on the randomized
attribute selection method is given by,
P[11 and 12 length paths co exist]
Y vPril i 1)!(12 i 1)!(r v)probi
i=0
1. Since dataset is finite.
2. In case of continuous attributes the size of the input/output space is the size after discretization
DHURANDHAR AND DOBRA
A A1 A1 A4
A A A4 AA
2 / 2 /4 A A
/ A A A A
A 42 / \2 / 2 \2
A A A3 /3 A3
4A As 5 A5 A5
AA A AA AA
\5 5 A5
/ A A A, A A, A
A6 A7 A6 A7 6 7 6 7
A8 A8 A A
a b c d
Figure 14: Instances of possible arrangements.
where r is the number of attributes common in the two paths, v is the number attributes with the
same values in the two paths, vPri ( i denotes permutation and
probi = d(d1)... (di)(d 1)2 ...( ll+l)2( l (d 12+1)
We prove the above result with the help of an example. The derivation of the above result will
become clearer through the following example. Consider the total number of attributes to be d
as usual. Let A1, A2 and A3 be three attributes that are common to both paths and also having
the same attribute values. Let A4 and A5 be common to both paths but have different attribute
values for each of them. Let A6 belong to only the first path and A7, As to only the second path.
Thus, in our example 11 = 6, 12 = 7, r = 5 and v = 3. For the two paths to coexist notice that
atleast one of A4 or A5 has to be at a lower depth than the noncommon attributes A6, A7, As.
This has to be true since, if a noncommon attribute say A6 is higher than A4 and A5 in a path
of the tree then the other path cannot exist. Hence, in all the possible ways that the two paths
can coexist, one of the attributes A4 or A5 has to occur at a maximum depth of v + 1 i.e. 4 in
this example. Figure 14a depicts this case. In the successive tree structures i.e. Figure 14b, Figure
14c the common attribute with distinct attribute values (A4) rises higher up in the tree (to lower
depths) until in Figure 14d it becomes the root. To find the probability that the two paths coexist
we sum up the probabilities of such arrangements/tree structures. The probability of the subtree
shown in Figure 14a is d(d 1)(d 2)(d3)(d 4)2(d 5)2(d6) considering that we choose attributes w/o
replacement for a particular path. Thus the probability of choosing the root is d, the next attribute
is d11 and so on till the subtree splits into two paths at depth 5. After the split at depth 5 the
probability of choosing the respective attributes for the two paths is d142, since repetitions are
allowed in two separate paths. Finally, the first path ends at depth 6 and only one attribute has to
be chosen at depth 7 for the second path which is chosen with a probability of 16. We now find
the total number of subtrees with such an arrangement where the highest common attribute with
different values is at depth of 4. We observe that A1, A2 and A3 can be permuted in whichever
way w/o altering the tree structure. The total number of ways of doing this is 3! i.e. 3Pr3. The
attributes below A4 can also be permuted in 2!3! w/o changing the tree structure. Moreover, A4 can
be replaced by A5. Thus, the total number of ways the two paths can coexist with this arrangement
is 3Pr32!3!2. The probability of the arrangement is hence given by, d(d)(d2)(d 3)(d4)2(d 5)2(d6)
Similarly, we find the probability of the arrangement in Figure 14b where the common attribute
with different values is at depth 3 then at depth 2 and finally at the root. The probabilities for
3Pr23!4!2 3Pr14!5!2
the successive arrangements are d(d 1)(d 2)(d 3)2(d 4)2(d5)2(d 6) d(d 1)(d 2)2(d 3)2(d 4)2(d 5)2(d6)
3Pro5!6!2
and d(d 1)2(d 2)2(d3)2(d4)2(d5)2(d6) respectively. The total probability for the paths to coexist
is given by the sum of the probabilities of these individual arrangements.
PROBABILISTIC CHARACTERIZATION OF DECISION TREES
In the general case, where we have v attributes with the same values the number of arrangements
possible is v+l. This is because the depth at which the two paths separate out lowers from v+l to 1.
When the bifurcation occurs at v+l the total number of subtrees is vPr(11 v 1)!(12 v 1)!(rv)
with this arrangement. vPr, is the permutations of the common attributes with same values.
(11 v 1)! and (12 v 1)! are the total permutations of the attributes in path 1 and 2 respectively
after the split. r v is the number of choices for the split attribute. The probability of any one
of the subtrees is d 1) (dv)(d )2 1 )2(d i)d 1 since until a depth of v + 1 the two
d(dl}...(d'}(d'l ...(d11+1)2(d1I1...(d 2 +1)
paths are the same and then from v + 2 the two paths separate out. The probability of the first
Arrangement is thus, P 1 11 1)2!( 1) d 2 For the second arrangement
arrangement is thus, d(d1)...(dv)(dv 1)2 (d 11+1)2(d 1) ...(d 12+1)"sr
with the bifurcation occurring at a depth of v, the number of subtrees is vPrvi(l v)!(12 v)!(r 
v) and the probability of any one of them is d(d 1) ..(d )(d ) (d 1 1)2d 1) 2 T The
Sd(d l)...(d l)(d u)"...(d11 l) (d1l)...(d12+1)
probability of the arrangement is thus d(d 1)... (d )(d v)( ) Similarly, the
1d )(d)2 (d11 1)2(d11) (d12+1) ilarly, the
probabilities of the other arrangements can be derived. Hence the total probability for the two paths
to coexist which is the sum of the probabilities of the individual arrangements is given by,
P[11 and 12 length paths co exist] =
V vPrl'(h i 1)!(12 i 1)!(r v)
I: d(d 1)...(d i)(d i 1)2...(d 11 + 1)2(d 11)...(d 12 + 1)
References
Avrim Blum, Adam Kalai, and John Langford. Beating the holdout: Bounds for k
fold and progressive crossvalidation. In Computational Learing Theory, 1999. URL
citeseer.ist.psu.edu/blum99beating.html.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth
and Brooks, 1984.
Jeff ConnorLinton. Chi square tutorial. http://www.georgetown.edu/faculty/ballc/webtools/
web_chi_tut.html, 2003.
Amit Dhurandhar and Alin Dobra. Semianalytical method for analyzing models and model selec
tion measures based on moment analysis. www.cise.ufl.edu/submit/ext_ops.php?op list&type
report&by_tag REP2007296&display_level full, 2006.
M. Hall. Correlationbased feature selection for machine learning, 1998. URL
citeseer.ist.psu.edu/hall99correlationbased.html.
Mark A. Hall and Georey Holmes. Benchmarking attribute selection techniques for discrete class
data mining. IEEE TRANSACTIONS ON KDE, 2003.
R. Kohavi. A study of crossvalidation and bootstrap for accuracy estimation
and model selection. In In Proceedings of the Fourteenth IJCAL, 1995. URL
overcite.Ics.mit.edu/kohavi95study.html.
Fei Tony Liu, Kai Ming Ting, and Wei Fan. Maximizing tree diversity by building completerandom
decision trees. In PAKDD, pages 605610, 2005.
J.R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81106, 1986.
Jun Shao. Linear model selection by cross validation. JASA, 88, 1993.
Jun Shao. Mathematical statistics. SpringerVerlag, 2003.
DHURANDHAR AND DOBRA
Lindsay I. Smith. A tutorial on principal components analysis.
www.csnet.otago.ac.nz/cosc453/studenttutorials/ principal_components.pdf, 2002.
Vapnik. Statistical Learning Theory. Wiley & Sons, 1998.
