Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Semi-analytical method for analyzing models and model selection measures based on moment analysis
Full Citation
Permanent Link:
 Material Information
Title: Semi-analytical method for analyzing models and model selection measures based on moment analysis
Physical Description: Book
Language: English
Creator: Dhurandhar, Amit
Dobra, Alin
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 2007
 Record Information
Bibliographic ID: UF00095647
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

2007296 ( PDF )

Full Text


Semi-analytical Method for Analyzing Models and Model
Selection Measures based on Moment Analysis

Amit Dhurandhar ASD@CISE.UFL.EDU
Computer and Information Science and Engineering
University of Florida
Gainesville, FL 32611, USA
Computer and Information Science and Engineering
University of Florida
Gainesville, FL 32611, USA


In this paper we propose a moment based semi-analytical method for studying models and
model selection measures. By focusing on the probabilistic space of classifiers induced by
the classification algorithm rather than on of datasets, we obtain efficient characterizations
for computing the moments which is followed by visualization of the resulting formulae
that are too complicated for direct interpretation. By assuming the data to be drawn i.i.d.
from the underlying probability distribution, and by going over the space of all possible
datasets, we establish general relationships between the Generalization error, Hold-out-set
error, Cross-validation error and Leave-one-out error. We later exemplify the method and
the results by studying the behavior of the errors for the Naive Bayes Classifier.
Keywords: Hold out set, Cross validation, Generalization Error, Naive Bayes Classifier

1. Introduction

Most of the work in machine learning is dedicated to designing new learning methods or
better und- I'li.i:_. at a macroscopic level (i.e. performance over various datasets), the
known learning methods. The body of work that tries to understand microscopic behavior
of either models or methods to evaluate models -which we think is crucial for deepening the
understanding of machine learning techniques and results -and establish solid connections
with Statistics is rather small. The two prevalent approaches to establish such results are
based on either theory or empirical studies but usually not both, unless empirical studies
are used to validate the theory. While both methods are powerful in themselves, each suffers
from at least a major deficiency.
The theoretical method depends on nice, closed form formulae that usually restricts the
types of results that can be obtained to asymptotic results or statistical learning theory(SLT)
type of results(Vapnik, 1998). Should formulae become large and tedious to manipulate,
the theoretical results are hard to obtain and use/interpret.
The empirical method is well suited for validating intuitions but is significantly less
useful for finding novel, interesting things since large number of experiments have to be


conducted in order to reduce the error to a reasonable level. This is particularly difficult
when small probabilities are involved, making the empirical evaluation impractical in such
a case.
An ideal scenario, from the point of view of producing interesting results, would be to
use theory to make as much progress as possible but potentially obtaining uninterpretable
formulae, followed by visualization to understand and find consequences of such formulae.
This would avoid the limitation of theory to use only nice formulae and the limitation of em-
pirical studies to perform large experiments. The role of the theory could be to -.:,,.:i; ,i,,ll1
reduce the amount of computation required and the role of visualization to understand
the potentially complicated theoretical formulae. This is precisely what we propose in this
paper, a new hybrid method to characterize and understand models and model selection
measures. What makes such an endeavor possible is the fact that, mostly due to the lin-
earity of expectation, moments of complicated random variables can be computed exactly
with efficient formulae, even though deriving the exact distribution in the form of small
close formulae is a daunting task.
The specific contributions we make in this paper are:

1. We propose a new methodology to .1 i1- ,.. at a microscopic level, statistical behavior
of models and model evaluation measures. The methodology is based on defining
random variables for quantities of interest, exactly computing their moments and
then understanding their behavior by visualization.

2. For the concrete example of model selection measures -we focus on hold-out-set er-
ror, cross validation error and Leave-one-out error. We establish connections between
the moments of these three error measures and the generalization error indicating
efficient strategies for computing these moments. By introducing a probability space
over the classifiers and computing the moments of the generalization error, we have
the following two advantages over the theoretical results given by Statistical Learn-
ing Theory: (a) we obtain qualitative results about classifiers based on the actual
classification algorithms, not the expressiveness of the class of functions to which the
classifier belongs to, and (b) the results are not as pessimistic.

3. We exemplify the general theory by comparing the behavior of the three model se-
lection measures for the Naive Bayes Classifier(NBC). The data generation model we
consider for the discrete case is truly the most generic, in that it can be used to rep-
resent any discrete data distribution over the input output space. It is important to
note that in the simulations we present there is no explicit data being generated but
instead the formulations are used to obtain the plots.

4. The method proposed can be used as an exploratory tool to observe the ,i. 'ii-, an; n l..i.
behavior of the errors for a particular model under desired conditions, -I i, ih,: away
from generalized statements which are more prone to error. The method encapsu-
lates the generalization of theory and specificness of experiments bearing in mind

The rest of the paper is organized as follows. Section 2 is a brief survey of the related
work. In Section 3 we mention the neccessity to put a classifier into a class and characterize


this class. In Section 4 we derive relationships between the moments of the Generalization
error, Hold out set error, Multifold cross validation error and Leave-one-out error which
are independent of any classification algorithm. Section 5 discusses the aspects critical to
the methodology we propose. Section 6 is an illustration of the actual methodology applied
to a Naive Bayes Classifier. It discusses the application of Theorem 1 in the paper and
reports simulation studies for the NBC with one input. Section 7 and Section 8 discuss
the problems involved in directly extending the method to NBC with multiple inputs. We
also discuss strategies to solve these problems and maintain scalability of the proposed
method. In Section 9 we report simulation studies for the multi-input NBC and discuss
their implications. We finally, discuss promising lines for future research and summarize
the major developments in the paper in Section 10.

2. Related Work

There is a large body of both experimental and theoretical work that addresses the problem
of understanding various model selection measures. Shao (Shao, 1993) showed that asymp-
totically Leave-one-out(LOO) chooses the best but not the simplest model. Devroye, Gyor,
and Lugosi (L. Devroye and Lugosi, 1-',l.) derived distribution free bounds for cross vali-
dation. The bounds they found were for the nearest neighbour model. Breiman (Breiman,
1996) showed that cross validation gives an unbiased estimate of the first moment of the
Generalization error. Though cross validation has desired characteristics with estimating
the first moment, Breiman stated that its variance can be significant. Theoritical bounds
on LOO error under certain algorithmic stability assumptions were given by Kearns and
Ron (Kearns and Ron, 1997). They showed that the worst case error of the LOO estimate
is not much worse than the training error estimate. Elisseeff and Pontil (A and M, 2003)
introduced the notion of training stability. They showed that even with this weaker no-
tion of stability good bounds could be obtained on the generalization error. Blum, Kalai
and Langford (Blum et al., 1999) showed that v-fold cross validation is at least as good as
N hold out set estimation on expectation. Kohavi (Kohavi, 1995) conducted experiments
on Naive Bayes and C4.5 using cross-validation. Through his experiments he concluded
that 10 fold stratified cross validation should be used for model selection. Moore and Lee
proposed heuristics to speed up cross-validation (\I ,ore and Lee, 1994). Plutowski's (Plu-
towski, 1'i ) survey included papers with theoritical results, heuristics and experiments on
cross-validation. His survey was especially geared towards the behavior of cross-validation
on neural networks. He inferred from the previously published results that cross-validation
is robust. More recently, Bengio and Grandvalet (Bengio and Grandvalet, 2003) proved
that there is no universally unbiased estimator of the variance of cross-validation. Zhu
and Rohwer (Zhu and Rohwer, 1''ll.) proposed a simple setting in which cross-validation
performs poorly. Goutte (Goutte, 1997) refuted this proposed setting and claimed that a
realistic scenario in which cross-validation fails is still an open question.

The work we present here covers the middle ground between these theoretical and empir-
ical results by allowing classifier specific results based on moment ., 11m, -i- Such an endeavor
is important since the gap between theoretical and empirical results is significant(Langford,


3. Statistical Characterization of Classifiers

C'l --iii. i produced by learning algorithms need to perform well with respect to the under-
lying probability distribution for the problem at hand rather than produce good results on
datasets available during learning. This immediately leads to the notion of generalization
error, and empirical errors such as hold-out-set, cross validation, Leave-one-out. When the
amount of testing data is very large, the generalization error of any given classifier is well
approximated by any empirical error. In such a case the performance of any classifier can
be estimated accurately by these empirical errors. Unfortunately, the amount of testing
data can be small for namely two reasons, the amount of available data is small and a large
part of the data needs to be used for training. With these observations it is can be seen
that a given classifier cannot be evaluated accurately enough with respect to the underlying
distribution since large amount of testing data is not available and the distribution is un-
known. This remains true even if the learning process is disregarded since the total amount
of available data is usually not large enough to warrant use of asymptotic results.
The main difficulty with direct characterization of a given classifier without extra infor-
mation is the fact that the behavior of the empirical errors cannot be determined indepen-
dent of the underlying pi, .. .1ill-, distribution, thus no statistical statements can be made
about the characterization produced. To avoid this difficulty, an indirect characterization
of the classifier could be produced by characterizing a class of ,.- -../7, to which this clas-
sifier belongs to. Identifying such a class of classifiers and characterizing them seems the
only practical method to produce a statistical characterization for a given classifier.
Statistical Learning Theory(SLT) categorizes classification algorithms(actually the more
general learning algorithms) into different classes called Concept C'l --. The concept
class of a classification algorithm is determined by its Vapnik-C'!. ii .. l:.i-(VC) dimension
which related to the shattering capability of the algorithm. Distribution free bounds on the
generalization error of a classifier built using a particular classification algorithm belonging
to a concept class are derived in SLT. The bounds are functions of the VC dimension and
the sample size. The strength of this technique is that by finding the VC dimension of an
algorithm we can derive error bounds for the classifiers built using this algorithm without
ever referring to the -il.1. ii-, in- distribution. A fallout of this very general characterization
is that the bounds are usually loose(Olivier Bousquet and Lugosi, 2005; Williamson, 2001)
which in turn result in making statements about any particular classifier weak.
The idea we pursue in this paper is to define the class of classifiers as the classifiers
induced by a given learning algorithm and i.i.d. data of a given size. Any member of this
class can be viewed as a sample 1,i--.r.. i, and the characterization of the class is strongly
connected to the behavior of the classifier. This class of classifiers are much smaller than the
classes considered in SLT. With this in the next section we establish relationships between
the statistical behavior of the generalization error and empirical errors for the members
of this class of classifiers and, relationships that have the character of general results(they
hold irrespective of the classification algorithm that induces the class). The downside of our
method is the fact that we loose the strength to make generalized statements to the extent
that SLT makes. In order to obtain an actual characterization, involved computations for
the specific classifier are required. However, we exemplify our methodology in this paper


for the Naive B,io. .,:ii Clo....f. i. While this process might be tedious, we believe that it
leads to a better understanding of learning algorithms.

4. Moment Analysis

P '1.1 ili -, distributions completely characterize the behavior of a random variable. Mo-
ments of a random variable give us information about its probability distribution. Thus,
if we have knowledge of the moments of a random variable we can make statements about
its behavior. In some cases characterizing a finite subset of moments may prove to be
a more desired alternative than characterizing the entire distribution which can be wild
and computationally expensive to compute. This is precisely what we do when we study
the behavior of the generalization error of a classifier and the error estimation methods
viz. Hold-out-error, Leave-one-out error and Cross-validation error. Characterizing the
distribution though possible, can turn out to be a tedious task and studying the moments
instead is a more viable option. As a result, we employ moment .,11, 1,-i, and use linearity
of expectation to explore the relationship between various estimates for the error of classi-
fiers: generalization error(GE), hold-out-set error(HE), and cross validation error(CE)
leave-one-out error is just a particular case of CE and we do not .1 ,1-,. '. it independently.
The relationships are drawn by going over the space of all possible datasets. The actual
computation of moments though is conducted by going over the space of classifiers induced
by a particular classification algorithm and i.i.d. data as can be seen in Section 6. This
is done since it leads to computational efficiency. We interchangably go over the space of
datasets and space of classifiers as deemed appropriate, since the classification algorithm is
assumed to be deterministic. That is we have,

ED(N) [F(([D(N)])] = EZ(N) [F()] Ex ...x.m [.F((Xm X2, ..., Xm))]

where F() is some function that operates on a classifier. We also consider the learn-
ing algorithms to be symmetric(the algorithm is oblivious to random permutations of the
samples in the training dataset).
Throughout this section and in the rest of the paper we use the notation in Table 1
unless stated otherwise.

4.1 Generalization Error

The notion of generalization error is defined with respect to an in.1. il-,:i pI1. 1.illi-,
distribution defined over the input output space and a loss function (error metric). We
model this probability space with the random vector X for input and random variable Y
for output. When the input is fixed, Y(x) is the random variable that models the output.
We assume in this paper that the domain X of X is discrete; all the theory can be extended
to continuous essentially by replacing the counting measure with Lebesque measure and
sums with integrals. Whenever the probability and expectation is with respect to this
probabilistic space(i.e. (X, Y)) that models the problem we will not use any index. For

1. By modeling the output for a given input as a random variable, we allow the output to be randomized,
as it might be in most real circumstances.


Symbol Meaning
X Random vector modeling input
X Domain of random vector (input space) X
Y Random variable modeling output
Y(x) Random variable modeling output for input x
Y Set of class labels (output space)
D Dataset
(x, y) Data-point from dataset D
Dt Training dataset
Ds Testing dataset
Di The i'th part/fold of D (for cross validation)
N Size of dataset
Nt Size of training dataset
N, Size of testing dataset
v Number of folds of cross validation
( Classifier
([D] Classifier build from dataset D
GE(() Generalization error of classifier (
HE(() Hold-out-set error of classifier (
CE(() Cross validation error of classifier (
Z(S) The set of classifiers obtained by application of
classification algorithm to an i.i.d. set of size S
D(S) Dataset of size S
Ez(s)[] Expectation w.r.t. the space of classifiers built on a sample of size S

Table 1: Notation used in the paper.


other probabilistic spaces, we will specify by an index what is the pr,! 1* 1il li -, space we refer
to. We denote the error metric by A(a, b); in this paper we will use only the 0-1 metric that
takes value 1 if a b and 0 otherwise. With this, the generalization error of a classifier ( is:

GE(()= E [(((X), Y)]
P [((X) Y] (1)
S P [X z] P [((x) Y(z)]

where we used the fact that, for 0-1 loss function the expectation is the probability that
the prediction is erroneous. Notice that the notation using Y(z) is really a conditional on
X = x. We use this notation since it is intuitive and more compact. The last equation
for the generalization error is the most useful in this paper since it decomposes a global
measure, generalization error, defined over the entire space into micro measures, one for
each input.
By carefully selecting the class of classifiers for which the moment mI '1-, -i- of the gener-
alization error is performed, meaningful and relevant probabilistic statements can be made
about the generalization error of a particular classifier from this class. The probability dis-
tribution over the classifiers will be based on the randomness of the data used to produce
the classifier. To formalize this, let Z(N) be the class of classifiers built over a dataset of
size N with a probability space defined over it. With this, the k-th moment around 0 of
the generalization error is:

EZ(N) [GE(C) Pz [(] GE(()k

The problem with this definition is that it talks about global characterization of classi-
fiers which can be hard to capture. We rewrite the formulae for the first and second moment
in terms of fine granularity structure of the classifiers.
While deriving these moments, we have to consider double expectations of the form:
EZ(N) [E [F(x, ()]] with F(x, () a function that depends both on the input x and the clas-
sifier. With this we arrive at the following result:

EZ(N) [E [F(x, ()]] Pz(N) [K] P [X =x] F(, ()
= P[X = ] E PZ-(N)[(] F(, () (2)
Y CeZ(N)
=E [Ez(N) [(x, ()]]

that uses the fact that P [X x] does not depend on a particular ( and PZ(N) [(] does not
depend on a particular x, even though both quantities depend on the underlying pi.1, 1 il il-,i
Using the definition of the moments above, Equation 1 and Equation 2 we have the
following theorem.


Theorem 1 The first and second moment of GE are given by,

EZ(N) [GE(()] = P [X= ] XZPZ(N) [C(x)

y] P [Y(x) /y]

ZP [x-L x]P[X -X].

y A (rx') y']

Y EyPZ(N) x z(N) [(X)

P [Y(x) y] P [Y(x') y']


EZ(N) [GE(()]

Ez(N) [E [A(((X), Y)]]
E [E(N) [A(((X), Y)]]

exG Cez(N)
EP [X=X] E PZ(N)[(1] P[(X) =y,Y(X) & YC]
p[x x] (N)
ExQ ^eZ(N)

= P [xx: E PZ(N) [(] P [((X)
xeG yey Cez(N)\C(x) y
= P [X = X] Pz(N)[,X) [Y(xy) = ) ]

EZ(N)xz(N) [GE(()GE((')]
= EZ(N)xZ(N) [E [A(((X),Y)]E [A(('(X), Y)]]

y, Y(X) # 15(

PZ(N) xz(N) [(, ('] P [X


E P [X =x]P [X ='] E
eGY -',e (C,C/)ez(N) x (N)
P [((x) Y()] P [('(x') Y Y(x')]
E P [X =]P[X =']
E E PZ(N)xZ(N) [((x) =y A ('(X') -y']
VY y'E)
P [Y(x)y] P [Y(x') y']

(C,C')EZ(N) x z(N)

x] P [(x) #Y(x)])

EZ(N)xZ(N) [GE(()GE((')]

PZ(N)xZ(N) [(,(j


In both series of equations we made the transition from a summation over the class
of classifiers to a summation over the possible outputs since the focus changed from the
classifier to the prediction of the classifier for a specific input(a is fixed inside the first
summation). What this effectively does is it allows the computation of moments using
only local information(behavior on particular inputs) not global information(behavior on
all inputs). This results in speeding the process of computing the moments as can be seen
in Section 6.

4.2 Alternative Methods for computing the moments of GE
The method we introduced above for computing the moments of the generalization error
are based on decomposing the moment into contributions of individual input-output pairs.
With such a decomposition, not only the i i ', -i, becomes simpler, but the complexity of the
algorithm required is reduced. In particular, the complexity of computing the first moment
is proportional to the size of the input-output space and the complexity of estimating
probabilities of the form Pz [((x) =y]. The complexity of the second moment is quadratic
in the size of the input-output space and proportional to the complexity of estimating
Pz [((x) y A ((x') y']. To see the advantage of this method, we compare it with the
other two alternatives for computing the moments: definition based computation and Monte
Carlo simulation.
Definition based computation uses the definition of expectation. It consists in summing
over all possible datasets and multiplying the generalization error of the classifier built from
the dataset with the probability to obtain the dataset as an i.i.d. sample from the underlying
probability distribution. Formally,

ED(N) [GE(()] P [D] GE(([D]) (3)

where D(N) is the set of all possible datasets of size N. The number of possible datasets
is exponential in N with the base of the exponent proportional to the size of the input-
output space (the product of the sizes of the domains of inputs and outputs). Evaluating
the moments in this manner is impractical for all but very small spaces and dataset sizes.
Monte Carlo simulation is a simple way to estimate moments that consists in performing
experiments to produce samples that determine the value of the generalization error. In
this case, to estimate ED(N) [GE(()], datasets of size N have to be generated, one for each
sample desired. For each of these datasets a classifier has to be constructed according to the
classifier construction algorithm. For the classifier produced, samples from the underlying
probability distribution have to be generated in order to estimate the generalization error of
this classifier. Especially for second moments, the amount of samples required will be large
in order to obtain reasonable accuracy for the moments. If a study has to be conducted in
order to determine the influence of various parameters of the data generation model, the
overall amount of experiments that have to performed becomes infeasible.
In summary, the advantages of the method we propose for estimating the moments are:
(a) it is exact, (b) it needs only local behavior of the classifier, (c) so the time complexity


is reduced and (d) does not depend on the fact that some of the probabilities are small.
We will use this method to compute moments of the generalization error for the NBC in
Section 6.

4.3 Error Estimation Methods
The exact computation of the generalization error depends on the actual -n-i 1-, i, hir: prob-
ability distribution, which is unknown, and hence other estimates for the generalization
error have been introduced: hold out set(HOS), leave-one-out(LOO), and v-fold cross val-
idation(CV). In the this subsection we establish relationships between moments of these
error metrics and the moments of the generalization error with respect to some distribution
over the classifiers. The general setup for the .111 i,, 1-i, for all these metrics is the following.
A dataset D of size N is provided, containing i.i.d. samples coming from the underlying
distribution over the input and outputs. The set is further divided and used both to build
a classifier and to estimate the generalization error; the particular way this is achieved is
slightly different for each error metric. The important question we will ask is how the values
of the error metrics relate to the generalization error. In all the developments that follow
we will assume that ([D] is the classifier built deterministically from the dataset D.

The HOS error involves randomly partitioning the dataset D into two parts Dt, the training
dataset of fixed size Nt, and Ds, the test dataset of fixed size N,. A classifier is built over
the training dataset and the generalization error is estimated as the average error over the
test dataset. Formally, denoting the random variable that gives the HOS error by HE we
HE= E A(([Dt](x),y) (4)
S (x,y))s

where y is the actual label for the input x.
Proposition 1 The expected value of HE is given by,
EDt(Nt)xD,(N,) [HE] = ED,(Nt) [GE(([Dt])]
Proof Using the notation in Table 1 and realizing that all the datapoints are i.i.d. we
derive the above result.

SC@,_yE(^ (C[Dt](x),y)]]
EDt(Nt)xDS(N) [HE] ED(Nt) ED,(N,) Z(xY)Ds A([D(), y)

= ED,(Nt) [ED,(N,) [P[([Dt](x) y|Ds]]]

=ED(Nt) P[([Dt](x) yIDs]P[D,]

=ED,(N) P[([Dt](x) y, D,]
= ED,(N,) [P[([Dt](x) f y]]
= ED,(Nt) [GE(([Dt])]


where we used the fact that by going over all values of one r.v. we get the probability fo
the other. 0
We observe from the above result that the expected value of HE is dependent only on the
size of the training set Dr. This result is intuitive since only Nt data-points are used for
building the classifier.
Lemma 1 The second moment of HE is given by,
ED,(N)XD,(N) [HE2] ED,(N,) [GE(([Dt])] + N -ED(N) [GE(([Dt])2].

Proof To compute the second moment of HE, from the definition in Equation 4 from the
paper we have:

EDt(Nt)xD,(N,) [HE2]
-- EDt(Nt)xD,(N,) [E(,y)ED, Y(,y')ED, A(([Dt](), y)A(([Dt](x'), y')

The expression under the double sum depends on whether (x, y) and (z', y') are the same
or not. When they are the same, we are precisely in the case we derived for EDt(Nt)xD,(N,) [HE]
above, except that we have N2 in the denominator. This gives us the following term,
7VTEDt(Nt) [GE(([Dt])]. When they are different, i.e. (x, y) $ (x', y') then we get,

N2ED(Nt)xD(N,) E A(C[Dt](x),y)A(([Dt](x'),y')
S(x,y))D, (x',y')ED,\(x,y)
N8-1 [x-(xY)CDs, A(([Dt](x),y) -(x',y')gDA(,y) A(([Dt](x'), y')j
NN-, N,-I
N8 1
ED(N)xDD(N,) [P[([Dt](x) $ yI(x,y) E D,]P[([Dt](x') $ y'l(x',y') E D, \ (x,y)]]
N8 1
N EDt(Nt) [ED, [P[([Dt](x) y(x, y) E Ds]] EDs [P[([Dt](x') y'(x', y') E Ds \ (x, y)]]]
N8 1
N I EDt(Nt) [GE(([DL])2]
where we used the primary fact that since the samples are i.i.d. any function applied
on two distinct inputs is also independent. This the reason why the ED, [] factories.
Putting everything together, and observing that terms inside summations are constants,
we have:
1 N i-1
EDt(Nt)xDS(N,) [HE2] = ED(Nt) [GE(([Dt])] + N EDt(Nt) [GE(([D])2]
, NN

Theorem 2 The variance of HE is given by,
VarDt(Nt)xD,(N )(HE)= 1ED(N) [GE(([Dt])]+ --E(N) [GE((D] ]-EDN) [GE(([D])]2.
7VD,(NN E.(Nt) [GE(([Ut])2 -DN [GE(([Ut]).


Proof The proof of the theorem immediately follows from the Proposition 1 and Lemma
1 and by using the formula for the variance of a random variable in this case HE,
Var(HE) E[HE2] E[HE]2. U

Unlike the first moment the variance depends on the sizes of both, the training set as
well as the test set.

v-fold cross validation consists in randomly partitioning the available data into v equi-size
parts and then training v classifiers using all data but one chunk and then testing the
performance of the classifier on this chunk. The estimate of the generalization error of the
classifier build from the entire data is the average error over the chunks. Using notation in
this paper and denoting by Di the i-th chunk of the data set D, the Cross Validation Error
(CE) is:
i= 1
Notice that we expressed CE in terms of HE, the HOS Error. By substituting the formula
for HE in Equation 4 into the above equation, a direct definition for CV is obtained, if
In this case we have a classifier for each chunk not a single classifier for the entire
data. We model the selection of N i.i.d. samples that constitute the dataset D and the
partitioning into v chunks. With this we have:

Proposition 2 The expected value of CE is given by,

EDt(_N)xDj() [CE] = ED(u N) [GE(([Dt])]

Proof Using the above equation, Proposition 1 we get the above result.

ED (- N)xDi(N) [CE] = ED (U N)xD() [HE-]
i= 1

SEDt (v N) [GE(([Dt])]
i= 1
=ED, N) [GE(([Dt])]

This results follows the intuition since is states that the expected error is the general-
ization error of a classifier trained on v 1N data-points. Thus, at least on expectation, the
cross validation behaves exactly like HOS estimate that is trained over v 1 chunks.
For CE we could compute the second moment around zero using the strategy previously
shown and then compute the variance. Here, we compute the variance using the relation-
ship between the variance of the sum and the variances and covariances of individual terms.


In this way we can decompose the overall variance of CE into the sum of variances of
individual estimators and the sum of covariances of pairs of such estimators; this decompo-
sition significantly enhances the understanding of the behavior of CE, as we will see in the
example in Section 6.

Var(CE) 1 Var(HE)+ Cov(HEi, HEj) (5)

The quantity Var(HEi), the variance of the HE on training data of size v-N and test
data of size N, is computed using the formulae in the previous section. The only things
that remain to be computed are the covariances. Since we already computed the expectation
of HE, to compute the covariance it is enough to compute the q(-,ititv Q = E[HEiHEj]
(since for any two random variables X, Y we have Cov(X, Y) = E [XY] E [X] E [Y]). Let
Di denote D \ Di and let Ns = With this we have the following lemma,

Lemma 2 The ED,( N)xDi()xDJ( N)xDj() [HEiHEj]
ED iv N)xD( N) [GE(([D \ D])GE([D \ Dj])].


ED (vu1N)xDi(-)xDj(v N)xDj(-) [HE.HEj]

[ (,,)E(.,,2,) s (( C[D](xj), YJ)-
-ED (vN)xDj( 1N)xD() [HEjEDj(-) [(yJ) -, ]

ED vN) xDj N) [GE(([D \ D])EDi(-) [HE ]
= EDt(vN) xDj( N) [GE(([D \ Dj])GE(([D \ D])]

where used the fact that the datasets Di and Dj are di-i iil and drawn i.i.d. It is impor-
tant to observe that due to the fact that D \ Di and D \ Dj intersect ( the intersection is
D \ (Di U Dj)), the two classifiers will neither be independent nor identical. As was the
case for the first and second moment of GE, this moment will depend only on the size of
the intersection and the sizes of the two sets since all points are i.i.d. This means that the
expression has the same value for any pair i, j, i j. U

Theorem 3 The variance of CE is given by,

VarD( N)xDi(-)xDj(U N)xD()CE EDt( N) [GE(([Dt])] +
7 E( ) [GE(([D])2] v-2 ED N [GE([Dt])]2 +
ED N)D N) [GE([D \ D )GE(([D \ Di])]

Proof The expression for the covariance is immediate from the above result and so using
Equation 5 we derive the variance of CE. U


It is worth mentioning that leave-one-out(LOO) is just a special case of v-fold cross
validation (v =N for leave-one-out). The formulae above apply to LOO as well thus no
separate i i 1-, -i> is necessary.
With this we have related the first two moments of HE and CE to that of GE. In the
next section we shall leverage this fact and utilise it to study the model selection measures.

5. Methodology

In the previous section we derived relationships between the moments of hold-out-set error,
cross validation error, leave-one-out error and the moments of the generalization error.
Thus, if we know the moments of GE we can compute the moments of the other error
measures and vice-versa. Notice from the formulations that to compute these moments we
need to know the underlying distribution and the classification algorithm. We now discuss
these two requirements in detail.

5.1 Data Generation Model

The generalization error depends both on the classifier and the joint underlying pi. .1.1 ilili -,
distribution of the predictor attributes and the class label. The joint probability distribution
of the inputs and output is completely specified by the parameters:

pi,k P[X xi A Y yk] (6)

the probability to see each of the combinations of values for the inputs and outputs. Here,
pi,k is the pr', 1 .1.ili -,i of being in class Yk for the input vector xi. This is indeed the most
general data generation model for the discrete case when only the random vector X is
considered as input since any extra information would be irrelevant from the point of view
of classification process. Under this data generation model, the distribution of the random
variables Nik is multinomial with the total number of items N and the probabilities that
determine the multinomial, pik. In other words, we can model our generic data generation
process by a multinomial distribution. The term data generation is somewhat of a misnomer
here, since it is important to note that in all the simulations we report there is no explicit
data generation. The probabilistic formulations that we soon propose are used for the
simulations. Thus, the formulations act as parametric equations whose parameters we vary
to produce the plots. This aspect is critical since it means that just by altering the values
of the parameters(individual cell probabilities of the multinomial, dataset size, etc.) we can
observe the behavior of the errors under various circumstances in a short amount of time.
Moreover, it gives us control over what we wish to observe.

5.2 Classification Algorithm

The results(i.e. expressions and relationships) we derived in the previous section were
generic applicable to any deterministic classification algorithm. We can thus use those
results to study the behavior of the errors for a classification algorithm of our choice.


Table 2: Contingency table of input X

X yl y2
xi Nil N12
X2 N21 N22

X, N,1 N.2
N1 N2 N

The classification algorithm we consider in this paper is Naive Bayes. We first study the
Naive i-, for a single input attribute(i.e. for one dimension) and later the generalized
version maintaining scalability.

5.3 Roadmap

With this, we deploy the proposed methodology in the remaining part of the paper which

1. Deriving expressions for the first two moments of GE for the NBC which in turn also
provide expressions for the first two moments of HE and CE(LOO just a special case
of cross-validation).

2. Using the expressions in <. .iil l i. .!! with the data generation model as an exploratory
tool to observe the behavior of the errors under different circumstances leading to
interesting explanations for certain observed facts. Realize that the goal is not to
make generalized statements but to use the method to observe specifics. However, we
still discuss the qualitative extensibility of certain results.

6. Example: Naive Bayes Classifier

In this section we will compute explicitly these moments for the NBC with a single input.
As we will see, these moments are too complicated, as mathematical formulae, to interpret.
We will plot these moments to gain an understanding of the behavior of the errors under
different conditions thus portli -, iir: the usefulness of the proposed method.

6.1 Naive Bayes Classifier Model

In order to compute the moments of the generalization error, moments that we linked in
Section 4 with the moments of the hold-out set error and cross validation, we first have to
select a classifier and specify the construction method. We selected the single input, naive
Bayes classifier since the .111 i-, -i, is not too complicated but highlights both the method
and the difficulties that have to be overcome. As we will see, even this simplified version
exhibits interesting behavior. We fix the number of class labels to 2 as well. In the next
section, we discuss how the ,i '1-, -i, we present here extends to the general NBC.


Given values for any of the inputs, the NBC computes the probability to see any of the
class labels as output under the assumption that the inputs influence the output indepen-
dently. The prediction is the class label that has the largest such estimated probability. For
the version of the naive Bayes classifier we consider here(i.e. a single input), the prediction
given input x is:

(W) = 1-r ,::, {1,2}P [Y = k] P [X= xzY = k]
The probabilities that appear in the formula are estimated using the counts in the contin-
gency table in Table 2. Using the fact that P [Y Yk] P [X = xY Yk] P [X x A Y =Yk]
and the fact that P [X xi A Y yk] is N-, the prediction of the classifier is:

((i) Y if Nil > Ni2)
Y2 if Nil < Ni2)

6.2 Computation of the Moments of GE
Under the already stated data generation model, the moments of the generalization error for
the NBC can be computed. We now present three approaches for computing the moments
and show that the approach using theorem 1 is by far the most practical.
Going over datasets: If we calculate the moments by going over all possible datasets
the number of terms in the formulation for the moments is exponential in the number of
attribute values with the base of the exponential being the size of the dataset(i.e. O(N")
terms). This is because each of the cells in Table 2 can take O(N) values. The formulation
of the first moment would be as follows.

N N-N11 N-(N11+...+N(_1)2)
ED(N)[GE(((D(N)))] E ... E eP[Ni,...,N ] (7)
Nll=0 N12=0 N,1=0
where e is the corresponding error of the classifier. We see that this formulation can be
tedious to deal with. So can we do better? Yes we definitely can and this spurs from the
following observation. For the NBC built on Table 2 all we care about in the classification
process is the relative counts in each of the rows. Thus, if we had to classify a datapoint
with attribute value xi we would classify it into class yi if Nil > Ni2 and vice-versa. What
this means is that irrespective of the actual counts of Nil and Ni2 as long Nil > Ni2 the
classification algorithm would make the same prediction i.e. we would have the same clas-
sifier. We can hence switch from going over the space of all possible datasets to going over
the space of all possible classifiers with the advantage of reducing the number of terms.
Going over classifiers: If we find the moments by going over the space of possible classi-
fiers we reduce the number of terms from O(N") to 0(2"). This is because there are only
two possible relations between the counts in any row(> or <). The formulation for the first
moment would then be as follows.

EZ(N) [GE(()] = eiP[Nll > N12, ..., N,1 > NX2] + e2P[N1l < N12, .., > N2 +...
+ea2P[N11 < N12, .., N.l < N.2]


where e, e2 to eC2 are the corresponding errors. Though this formulation reduces the
complexity significantly since N >> 2 for any practical scenario, nonetheless the number
of terms are still exponential in n. Can we still do better? The answer is yes again. Here is
where theorem 1 gains prominence. To restate, theorem 1 says that while calculating the
first moment we just need to look at particular rows of the table and to calculate the second
moment just pairs of rows. This reduces the complexity significantly without compromising
on the accuracy as we will see.
Going over classifiers using Theorem 1: If we use theorem 1 the number of terms
reduces from an exponential in n to small polynomial in n. Thus the number of terms
in finding the first moment is just O(n) and that for the second moment is O(n2). The
formulation for the first moment would then be as follows.

EZ(N)[GE(()] = eiP[Ni1 > N12] + 2P [N11 < N12] + ... + e2nP[N. < N,2]

where el, e2 to e2n are the corresponding errors. They are basically the respective cell
probabilities of the multinomial(i.e. el is probability of a datapoint belonging to the cell
xiC2, e2 is probability to belong to x1C1 and so on). For the second moment we would
have joint probabilities with the expression being the following,

EZ(N) [GE(()2 (el + e3)P[N 11> N12, N21 > N22] + (e2 + e3)P[N11 < N12, N21 > N22 +
+(e2n-2 + e2n)P[N(n-1)1 < N(n-1)2, N 1 < N2]

We have thus reduced the number of terms from O(NT) to O(nk) where k is small and
depends on the order of the moment we are interested in. This formulation has another
advantage. The complexity of calculating the individual probabilities is also significantly
reduced. The probabilities for the first moment can be computed in O(N2) time and that
for the second in O(N4) time rather than O(N"-1) and O(N2n-2) time respectively.
Further optimizations can be done by identifying independence between random vari-
ables and expressing them as binomial cdf's and using the incomplete regularized beta
function to calculate these cdf's in essentially constant time. Infact in future sections we
discuss the general NBC model for which the cdf's(probabilities) cannot be computed di-
rectly as it turns out to be too expensive. Over there we propose strategies to efficiently
compute these probabilities. The same strategies can be used here to make the computation
more scalable.
The situation when ( and (' are the classifiers constructed for two different folds in cross
validation requires special treatment. Without loss of generality, assume that the classifiers
are built for the folds 1 and 2. If we let D1,..., D, be the partitioning of the datasets into v
parts, ( is constructed using D2 U D3 U.. UD, and (' is constructed using D U D3 U.. U D,,
thus D3 U ... U D, training data is common for both. If we denote by Njk the number of
data-points with X xj and Y yk in this common part and by N() and N) the number
of such data-points in D1 and D2, respectively, then we have to compute probabilities of
the form,
P [(l + > NJ + N,2) A (N() + Ni > N) + Nj2)]
Pi 1 i


The estimation of this probability using the above method requires fixing the values of 6
random variables, thus giving an O(N6) algorithm. Again, further optimizations can be
carried out using the strategies in Sections 7 and 8.
Using the moments of GE, the moments of HE and CE are found using relationships
already derived.

6.3 Moment Comparison of Test Metrics

In the previous subsections we pointed out how the moments of the generalization error
can be computed. In Section 4 we established connections between the moments of the
generalization error (GE) and the moments of Hold-out-set error (HE) and Cross valida-
tion error (CE). Neither of these relationships, i.e. between the moments of these errors
and the moments of the generalization error give a direct indication of the behavior of the
different types of errors in specific circumstances. In this section we provide a graphical,
simple to interpret, representation of these moments for specific cases. While these visual
representations do not replace general facts, they greatly improve our und(-I ,,:_. al-
lowing rediscovery of empirically discovered properties of the error metrics and portraying
the flexibility of the method to model different scenarios.


In our first study we want to establish the prediction behavior of the naive Bayes classifier
as a function of the p, .1. d .1il l, to see the first class label, which we call the prior and denote
by p, and the amount of correlation between the input and the output. To this end, we
consider the situation when the domain of the input variable consists of Xl, x2, the size of
the dataset N = 1000, and the underlying probability distribution given by the following
prior probabilities:

P [X =x1 A y = yl] -= c)
P[X= x2iAy=yi] (1 c)

P[X= 2 Ay =Y2] 1 (1 + c)
P[X =2Ay =Y2] = (1 -c)

where c is the correlation coefficient which we vary between 0 and 1 (the behavior is sym-
metric when c is varied between -1 and 0). The behavior of the expectation of the gen-
eralization error is depicted in Figure 1. On the most part, the behavior is expected and
reaffirms the fact that the naive Bayes classifier is optimal (it has been shown to be optimal
Domingos and Pazzani (1997) if the independence between inputs holds; this is the case
here since there is only one input). The bumps in the graphs (they appear more clear at a
higher resolution) are due to the way the ties are broken, i.e. when the counts are equal,
the prediction of each of the class labels is with probability 1/2. They appear only in the
region of the graph in which the two counts compete; should the ties have been broken with
probabilities p and 1 p, the graph would have been smooth. The ability to make such


predictions shows the power of the moment .1,11 -,1i, method we introduced here; it was
much easier to observe such behavior visually than from formulae.

Our second study involves the dependency of the hold-out-set error on the splitting of the
data into testing (the hold-out-set) and training. We consider datasets of size N = 100 and
we set the -ii.1. -, iin-, probability distribution such that the output is independent of the
input and the prior probability of the first class label is p = 0.4. To get insight into the
behavior of HE, we plotted the expectation in Figure 2, the variance in Figure 3 and the sum
of the expectation and one standard deviation in Figure 4. As expected, the expectation of
HE grows as the size of the training dataset reduces. On the other hand, the variance is
reduced until the size of test data is 50% than it increases slightly. The general downwards
trend is predictable using intuitive understanding of the naive Bayes classifier but the fact
that the variance has an upwards trend is not. We believe that the behavior on the second
part of the graph is due to the fact that the behavior of the classifier becomes unstable as
the size of the training dataset is reduced and this competes with the reduction due the
increase in the size of testing data. Our methodology established this fact exactly without
the doubts associated with intuitively determining which of the two factors is dominant.
From the plot for the sum of the expectation and the standard deviation of HE in
Figure 4, which indicates the pessimistic expected behavior, a good choice for the size of
the test set is 40-50% for this particular instance. This best split depends on the size of the
dataset and it is hard to select only based on intuition.

In our third study we used the same setup described above, for cross validation and de-
picted the results in Figures 5-9. As the number of folds increases, the following trends
are observed: (a) the expectation of CE reduces (Figure 5) since the size of training data
increases, (b) the variance of the classifier for each of the folds increases (Figure 6) since the
size of the test data decreases, (c) the covariance between the estimates of different folds de-
creases first then increases again (Figure 7) -we explain this behavior below -and the same
trend is observed for the total variance of CE (Figure 8) and the sum of the expectation
and the standard deviation of CE (Figure 9). Observe that the minimum of the sum of the
expectation and the standard deviation (which indicates the pessimistic expected behavior)
is around 10-20 folds, which coincides with the number of folds usually recommended.
A possible explanation for the behavior of the covariance between the estimates of
different folds is based on the following two observations. First, when the number of folds is
small, the errors of the estimates have large correlations despite the fact that the classifiers
are negatively correlated -this happens because almost the entire training dataset of one
classifier is the test set for the other, 2-fold cross validation in the extreme. Due to this,
though the classifiers built may be similar or different their errors are strongly positively
correlated. Second, for large number of folds (leave-one-out situation in the extreme), there
is a huge overlap between the training sets, thus the classifiers built are almost the same
and so the corresponding errors they make are highly correlated again. These two opposing

2. Plateau at 0.4 is Ez [GE(()].

0 03


% 02 04 06 08 1
Correlation Coefficient

(a) Prior=0.1





0 02 04 06 08 1
Correlation Coefficient

(b) Prior=0.3



0 02 04 06 08
Correlation Coefficient

(c) Prior 0.5

Figure 1: Dependency of expectation of generalization error on correlation factor for various


O 20 40 o 60 0 100
Size of HOS (%)

Figure 2: Hold-out-set ex-


Number of Folds

Figure 5: Expectation of





50 20 40 o O80 100
Number of Folds

Figure 8: Total variance of

cross validation.

o 20 40 so 8o 100
Size of HOS (%)

Figure 3: Hold-out-set


Number of Folds
Number of Folds

Figure 6: Individual run

variance of CE.

0 46

0 44
o 43

o 20 40 o 80 100
Number of Folds

Figure 9: E [] + Var()

of CV

o 20 40 60 80 100
Size of HOS (%)

Figure 4: E [] + ar() for



0 20 40 6O 80 100
Number of Folds

Figure 7: Pairwise covari-

ances of CV.

043 E[HE]

/ 0405

o \\

0 o 200 400 600 800 1000
Size of Dataset

Figure 10: Convergence





0 47




0 42



Symbol Symantics
pci prior of class C1
Pc2 prior of class C2
pT1 joint probability of being in xi, C1
['i2 joint probability of being in xi, C2
pI joint probability of being in yi, C1
pH2 joint probability of being in yi, C2
N1 r.v. denoting number of datapoints in class C1
N2 r.v. denoting number of datapoints in class C2
N2I r.v. denoting number of datapoints in xl, C1
Nx2 r.v. denoting number of datapoints in xi, C2
N4I' r.v. denoting number of datapoints in yi, C1
Nl2 r.v. denoting number of datapoints in yi, C2
Nijk r.v. denoting number of datapoints in xi, yj, Ck
N Size of dataset

Table 3: Notation

trends produce the U-shaped curve of the covariance. This has a significant effect on the
overall variance and so the variance also has a similar form with the minimum around 10
folds. Predicting this behavior using only intuition or reasonable number of experiments or
just theory is unlikely since it is not clear what the interaction between the two trends is.
In the studies for both hold-out-set error and cross validation error, we observed the
similar qualitative results for different priors and for various degrees of correlation between
the inputs and outputs.


The purpose of our last study we report was to determine the dependency of the three errors
on the size of the dataset which indicates the convergence behavior and relative merits of
hold-out-set and cross validation. In Figure 10 we plotted the expected value of GE, HE
and CE for the same underlying distribution as in the study CE, the size of the hold-out-set
for HE was set to 11 (which we found to be the best in experiments not reported for lack of
space) and 20 folds for CE. As it can be observed from the figure, the error of hold-out-set
is significantly larger for small datasets. The expectation of cross validation is almost on
par with the expectation of the generalization error. This property of cross validation to
reliably estimate the generalization error is known from empirical studies. But the method
can be used to estimate how quickly (at what dataset size) HE and CE converge to GE.

6.4 Extension to Higher Dimensions

All of that we have seen so far was for the 1 dimensional case. What happens when the
di__, d_-in liv increases? Can we still compute the moments efficiently? The answers to
these questions are addressed in the next couple of sections.




y, Y2

Figure 11: The figure depicts the scenario when we have two attributes each having two
values with 2 class lables.

7. Calculation of basic probabilities
Having come up with the probabilistic formulation for discerning the moments of the gen-
eralization error, we are now faced with the daunting task of efficiently calculating the
probabilities involved for the NBC when the number of dimensions is more than 1. In this
section we will mainly discuss single probabilities and the extension to joint probabilities is
in the next section. Let us now briefly preview the kind of probabilities we need to decipher.
With reference to Figure 11, considering the cell xiyi(w.l.o.g.) and by the Naive Bayes
classifier independence assumption, we need to find the probability of the following condi-
tion being true for the 2-dimensional case,

y, y
Pi l Pp l
i.e. p r'1pN11
i.e. N2Ni1N11

> P 12 P12
> Pc2 1
Pc2 Pc2
> Pclll'J'12
> N1Nf2N1,2

In general for the d-dimensional(d > 2) case we have to find the following probability,

P[N N(d-) xl x2 .Nxd 2(d-l) x2 N1xd]
[ '2 11 11 1 '12 '12 ""*'12

7.1 Direct Calculation
We can find the probability P[N (-1')NxNjx2N.1d > 12 112.. 12d] by summing
over all possible assignments of the multinomial random variables involved. For the 2 di-


mensional case shown in Figure 11 we have,

P[N2Nx1N > NN1N2 2]

EN111 EN121 EN211 EN112 EN122 EN212 ZN222 P[N111, N121, N211, N112, N122, N212, N221, N222]"
I[N2N1NY1 > NN1 N 12]

where N2 N112 N122 N212 N222, N N11 + N121, N = N111 + N211, N1
N N2, N2 N112 + N122, and I[condition] 1 if condition is true else I[condition] 0.
Each of the summations takes O(N) values and so the worst case time complexity is O(N7).
We thus observe that for the simple scenario depicted, the time to compute the probabilities
is unreasonable even for small size datasets(N = 100 say). The number of summations
increases linearly with the dimensionality of the space. Hence, the time complexity is
exponential in the dim', -in ,ili,. We thus need to resort to approximations to speed up
the process.

7.2 Approximation Techniques
If all the moments of a random variable are known then we know the moment generating
function(MGF) of the random variable and as a consequence the probability generating
function and hence the precise cdf for any value in the domain of the random variable. If
only a subset of the moments are known then we can at best approximate the MGF and so
the cdf.
We need to compute probabilities(cdf's) of the following form P[X > 0] where X
N(d-l)NXN12 ..N1d Nd-1) Nx2lNx22...N Most of the alternative approximation tech-
niques we propose in the subsections that follow, to efficiently compute the above proba-
bilities(cdf's), are based on the fact that we have knowledge of some finite subset of the
moments of the random variable X. We now ellucidate a method to obtain these moments.

Derivation of Moments: As previously mentioned the data generative model we use
is a multinomial distribution. We know the moment generating function for it. A moment
generating function generates all the moments of a random variable, uniquely defining its
distribution. The MGF of a multivariate distribution is defined as follows,

MR(t) = E(eR't) (9)

where R is a q dimensional random vector, R' is transpose of R and t E R'. In our case q
is the number of cells in the multinomial.
Taking different order partial derivatives of the moment generating function w.r.t. the
elements of t and setting those elements to zero, gives us moments of the product of the
random variables in the multinomial raised to those orders. Formally,

0tv0tv2 ...tq (tlt2 =.=tq 0) E(R R2... Rq) (10)

where R' = (RI, R2, ..., R), t ( t, ..., tq) and v, v2, ..., v is the order of the partial
derivatives w.r.t. tl, 2, ..., tq respectively.


The expressions for these derivatives can be precomputed or computed at run time
using tools such as mathematica(Wolfram-Research). But how does all of what we have
just discussed relate to our problem? Consider the 2 dimensional case given in Figure
11. We need to find the probability P[Z > 0] where Z = N2N1N^ NIN'2NN2. The
individual terms in the product can be expressed as a sum of certain random variables
in the multinomial. Thus Z can be written as the sum of the product of some of the
multinomial random variables. Consider the first term in Z,

N2NfINf (N112 + N122 + N212 + N222)(N11 + N121)(N111 + N211)
SN112111 + ... + N222N121N211

the second term also can be expressed in this form. Thus Z can be written as the sum of
the products of the multinomial random variables.
E(Z) E(N2Nf NY NIN12NN2)

E( N2Nx1N) E(NIN2N 2)
E(N112N211 + ... + N222N121N211) -
E(N1N212 + ... + N221N122N212)
E(N112N1211) + ... + E(N222N121N211) -
E(NInN1212) ... E(N221N122N212)

These expectations can be computed using the technique in the discussion before. Higher
moments can also be found in the same vein since we would only need to find expectations
of higher degree polynomials in the random variables of the multinomial. Similarly, the ex-
pressions for the moments in higher dimensions will also include higher degree polynomials.

The Edgeworth or the Gram-Charlier A series(Hall, 1992) are used to approximate distribu-
tions of random variables whose moments or more specifically cumulants are known. These
expansions consist in writing the characteristic function of the unknown distribution whose
probability density is to be approximated in terms of the characteristic function of another
known distribution(usually normal). The density to be found is then recovered by taking
the inverse Fourier transform.
Let p1c(t), Pud(x) and n, be the characteristic function, probability density function and
the ith cumulant of the unknown distribution respectively. And let P', (t), p', ,(x) and yi be
the characteristic function, probability density function and the ith cumulant of the known
distribution respectively. Hence,

p11(t) = ezra(1a t-a)r )
S-- D (W )a] 4
Pud(x(a),i= pi ia)

where D is the differential operator. If ', ,(x) is a normal density then we arrive at the
following expansion,


1 (x a1)21 I- 3 12(X -- K_ )
p xd(2) --e'[1 + ) + ...] (11)
\/27. 2 V' 2

where H3 (x3 3x)/3! is the 3rd Hermite polynomial. This method works reasonably
well in practice as can be seen in (Levin, 1981) and (Butler and Sutton, 1998). The major
challenge though, lies in choosing a distribution that will approximate the unknown distri-
bution .-- II ', as the accuracy of the cdf estimate depends on this. The performance of the
method may vary significantly on the choice of this distribution, since choosing the normal
distribution may not 1'.-- -, give satisfactory results. This task of choosing an appropriate
distribution is non-trivial.

We have just seen a method of approximating the cdf using series expansions. Interestingly,
this problem can also be framed as an optimization problem, wherein we find upper and
lower bounds on the possible values of the cdf by optimizing over the set of all possible
distributions having these moments. Since our unknown distribution is an element of this
set, its cdf will lie within the bounds computed. This problem is called the classical moment
problem and has been studied in literature(Isii, 1960, 1963; Karlin and Shapely, 1953).
Infact upto 3 moments known, there are closed form solutions for the bounds(Prekopa,
1989). In the material that follows, we present the optimization problem in its primal and
dual form. We then explore strategies for solving it, given the fact that the most obvious
ones can prove to be computationally expensive.
Assume that we know m moments of the discrete random variable X denoted by
/,1, ...,Imn where pj is the jth moment. The domain of X is given by U = o, i, ..., x.
P[X x= ] pr where r E 0,1,..., n and Pr 1. We only discuss the maximization
version of the problem (i.e. finding the upper bound) since the minimization version (i.e.
finding the lower bound) has an analogous description. Thus, in the primal space we have
the following formulation,

max P[X <= x,] = E=opi, r < n

.,li. l to : E oPi 1
Zio 1' p=

En m
YTi=0-X -i Prm
pi > 0, Vi
Solving the above optimization problem gives us an upper bound on P[X <= xr]. On
inspecting the formulation of the objective and the constraints we notice that it is a Linear
Programming(LP) problem with m+1 equality constraints and 1 inequality constraint. The
number of optimization variables(i.e. the pay's) is equal to the size of the domain of X which
can be large. For example, in the 2 dimensional case when X N2N1NY 1- N1N* 2NY2, X


takes O(N2) values(as N1 constraints N2 given N and N1 also constraints the maximum
value of N,,, Njl, N,2 and N,2). Thus with N as small as 100 we already have around
10000(ten thousand) variables. In an attempt to monitor the explosion in computation,
we derive the dual formulation of our problem. A dual is a complimentary problem and
solving the dual is usually easier than solving the primal since the dual is '1,- -, convex
for primal maximization problems(concave for minimization) irrespective of the form of the
primal. For most convex optimization problems(which includes LP) the optimal solution
of the dual is the optimal solution of the primal, technically speaking only if the Slaters
conditions are satisfied(i.e. duality gap is zero). Maximization(minimization) problems in
the primal space map to minimization(maximization) problems in the dual space. For a
maximization Standardized LP problem the primal and dual have the following form,

max cT x
,0..i . to: Ax b, x > 0

min bTy
., . to : ATy > c
where y is the dual variable.

For our LP problem the dual is the following,

min EmL1'"/"

..,i,.. t to: Eko' 0'' -1_>0; V xEW
or "' ',k> 0; V x E U

where Yk represent the dual variables and W represents a subset of U over which the cdf is
computed. We observe that the number of variables is reduced to just m + 1 in the dual
formulation but the number of constraints has increased to the size of the domain of X. We
now propose some strategies to solve this optimization problem; discuss their shortcomings
and eventually -1:_:_. -I the strategy preffered by us.

Using Standard Linear Programming Solvers: We have a linear programming prob-
lem whose domain is discrete and finite. On careful inspection for our problem we observe
that the number of variables in the primal formulation and the number of constraints in the
dual increases exponentially in the dimensionality of the space (i.e. the domain of the r.v.
X). Though the current state-of-the-art LP solvers(using interior point methods) can solve
linear optimization problems of the order of thousands of variables and constraints rapidly,
our problem can exceed these counts by a significant margin even for moderate dataset
sizes and reasonable dimension thus becoming computationally intractable. Since standard
methods for solving this LP can prove to be inefficient we investigate other possibilities.
In the next three approaches we extend the domain of the random variable X to include
all the integers between the extremeties of the original domain. The current domain is thus


a superset of the original domain and so are the possible distributions. Another way of
looking at it is, in the space of y's we have a superset of the original set of constraints.
Thus the upper bound calculated in this scenario will be greater than or equal to the upper
bound of the original problem. This extension is done to enhance the performance of the
next two approaches since it treadjumps the problem of explicitly enumerating the domain
of X and is a requirement for the third, as we will soon see.

Gradient descent with binary search(GD): We use gradient descent on the dual to
find new values of the vector y [yo, ..., Ym] and a reasonably large step length since the
objective to be minimized is an affine function and it tread jumps the problem of having a
large step size in optimizing convex functions. Fixing y and assuming x to be continuous the
two equations representing the inequalities above can be viewed as polynomials in x. The
basic intuition in the method we propose is that if the polynomials are 11'- i, non-negative
within the domain of X then all the constraints are satisfied else some of the constraints
are violated. To check if the polynomials are '1,- -, non-negative we find their roots and
perform the following checks. The polynomials will change sign only at their roots and
hence we need to carefully examine their behavior at these points. Here are the details of
the algorithm.
We check if the roots of the polynomial lie within the extremeties of the domain of X.

1. If they don't then we check to see if the value of polynomial at any point within this
range satisfies the inequalities. On the inequalities being satisfied we jump to the
next y using gradient descent storing the current value of y inplace of the previously
stored(if it exists) one. If the inequalities are dis-satisfied we reject the value of y
and perform a binary search between this value and the previous legal value of y
along the gradient until we reach the value that minimizes the objective i-F, i,:_ the

2. If we do, then we check the value of the constraints at the two extremeties. If satisfied
and if there exists only one root in the range we store this value of y and go on to the
next. If there are multiple roots then we check to see if consecutive roots have any
integral values between them. If not we again store this value of y and move to the
next. Else we verify for any point between the roots that the constraints are satisfied
based on which we either store or reject the value. On rejecting we perform the same
binary search type procedure mentioned above.

C'l I 1:_,i if consecutive roots of the polynomial have values in the domain of X, is where
the extension in domain to include all integers between the extremeties helps in enhancing
performance. In the absence of this extension we would need to find if a particular set of
integers lie in the domain of X. This operation is expensive for large domains. But with the
extension all the above operations can be performed efficiently. Finding roots of polynomi-
als can be done extremely efficiently even for high degree polynomials by various methods
such as computing eigenvalues of the companion matrix(Edelman and Murakami, 1995) as
is implemented in Matlab. Since the number of roots is just the degree of the polynomial
which is the number of moments, the above mentioned checks can be done quickly. The
binary search takes log(t) steps where t is the step length. Thus the entire optimization


can be done efficiently. Nonetheless, the method suffers from the following pitfall. The final
bound is sensitive to the initial value of y. Depending on the initial y we might stop at dif-
ferent values of the objective on hitting some constraint. We could thus have a suboptimal
value as our solution as we only descend on the negative of the gradient. We can somewhat
overcome this drawback by making multiple random restarts.

Gradient descent with local topology search(GDTS): Perform gradient descent as
mentioned before. C'! i....-- a random set of points around the current best solution. Again
perform gradient descent on the feasible subset of the choose points. C'! ....-.- the best
solution and repeat until some reasonable stopping criteria. This works well sometimes in
practice but not l'v ,-,I

Prekopa's Algorithm(PA): Prekopa(Prekopa, 1989) gave an algorithm for the discrete
moment problem. In his algorithm we maintain an m + 1 x m + 1 matrix called the basis
matrix B which needs to have a particular structure to be dual feasible. We iteratively
update the columns of this matrix until it becomes primal feasible, resulting in the optimal
solution to the optimization problem2. The issue with this algorithm is that there is no
guarantee w.r.t. the time required for the algorithm to find this primal feasible basis
In the remaining approaches we further extend the domain of the random variable X
to be continuous within the given range. Again for the same reason described before the
bound is unintruisive. It is also worthwhile noting that the feasibility region of the op-
timization problem is convex since the objective and the constraints are convex(actually
affine). Standard convex optimization strategies can't be used since the equation of the
boundary is unknown and the length of the description of the problem is large.

Sequential Quadratic Programming(SQP): Sequential Quadratic Programming is a
method for non-linear optimization. It is known to have local convergence for non-linear
non-convex problems and will thus globally converge in the case of convex optimization.
The idea behind SQP is the following. We start with an initial feasible point say, iI
The original objective function is then approximated by a quadratic function around iI
which then is the objective for that particular iteration. The constraints are approximated
by linear constraints around the same point. The solution of the quadratic program is a
direction vector along which the next feasible point should be chosen. The step length can
be found using standard line search procedures or more sophisticated merit functions. On
deriving the new feasible point the procedure is repeated until a suitable stopping criteria.
Thus at every iteration a quadratic programming problem is solved.
Let f(y), ceqj(y) and cieqj(y) be the objective function, the jth equality constraint
and the jth inequality constraint respectively. For the current iterate Yk we have following
quadratic optimization problem,

mmin Ok(dk) = f(yk) + Vf(ri, )Tdk + d V2 C(1, Ak:/,

2. for explanation of the algorithm read (Prekopa, 1989)


Figure 12: The current iterate yk just satisfies the constraint cz and easily satisfies the
other constraints. Suppose, cz is jfo Uyjx where xi is a value of X, then in the
diagram on the left we observe that for the kth iteration y yk the polynomial
j-o yjx 0 has a minimum at X xi with the value of the polynomial being
a. This is also the value of cl evaluated at y = I.

i,1'j.. I to : ceqi(yk) + Vceqi(li )Tdk 0; i E
cieqi(l, ) + Vcieqi(yk)Tdk > 0; i E Z

where Ok(dk) is the quadratic approximation of the objective function around yk. The term
f(yk) is generally dropped from the above objective since it is a constant at any particular
iteration and has no bearing on the solution. V2L() is the Hessian of the Lagrangian w.r.t.
y, and I are the set of indices for the equality and inequality constraints respectively and
dk is the direction vector which is the solution of the above optimization problem. The next
iterate Yk+1 is given by I, +1 = yk + ckdk where cOk is the step length.
For our specific problem the objective function is affine, thus a quadratic approximation
of it yields the original objective function. We have no equality constraints. For the in-
equality constraints, we use the following idea. The two equations representing the infinite
number of linear constraints given in the dual formulation can be perceived as being poly-
nomials in x with coefficients y. For a particular iteration with iterate(y) known, we find



the lowest value that the polynomials take. This value is the value of the most violated(if
some constraints are violated)/just satisfied(if no constraint is violated) linear constraint.
This is shown in Figure 12. The constraint c =z E o y9jx is just satisfied. With this in
view we arrive at the following formulation of our optimization problem at the kth iteration,

min pTdk

u,,.I to: Zfo + Y+Z i oxdk > 0; Yk [YO

This technique gives a sense of the non-linear boundary traced out by the constraints.
The above mentioned values can be deduced by finding roots of the derivative of the 2
polynomials w.r.t. x and then finding the minimum of these values evaluated at the real
roots of its derivative. The number of roots is bounded by the number of moments, infact it
is equal to m 1. Since this approach does not require the enumeration of each of the linear
constraints and operations described are fast with results being accurate, this turns out to
be a good option for solving this optimization problem. We carried out the optimization
using the Matlab function fmincon and the procedure just illustrated.

Semi-definite Programming(SDP): A semi-definite programming problem has a linear
objective, linear equality constraints and linear matrix inequality(LMI) constraints. Here
is an example formulation,

min c q

*.I. t. tO : qlFl + ... + qF, + H -< 0
Aq b

where H, F1,..., F, are positive semidefinite matrices, q E R", b E RP and A E Rpxn.
SDP's can be efficiently solved by interior point methods. As it turns out, we can express
our semi-infinite LP as a SDP.
Consider the constraint cl(a) = To ',.i The constraint ci() satisfies ci() > 0
Vx E [a, b] iff 3 a m + 1 x m + 1 positive semidefinite matrix S such that,

i+j=2i-1 S(i, j) = 0; 1 1, ..., m

lo pk+-1 yrrCk(m )Cl-kar-kbk i+j21 S(i, j); 1 0,..., m

S > 0 means S is positive semidefinite.

The proof of this result is given in (Bertsimas and Popescu, 1998).
We derive the equivalent semidefinite formulation for the second constraint c2(x)
iTo '. 1 to be greater than or equal to zero. To accomplish this, we replace yo by

3. .' is the value of yi at the I iteration


yo 1 in the above set of qualities since c2(z) = cl(x) 1. Thus Vx E [a, b] we have the
following semidefinite formulation for the second constraint,

Ei+j=21-1 S(i,j) 0;

I 1, ..., m

Ek=1 r-k 1 rCk(m r)Ckar-kbk +
T -yr(m r)Clar + yo 1 Ei+j2 S(iJ);

I= 1,..., m

E=1 Yryar + yo 1 S(O, 0)


Combining the above 2 results we have the following semidefinite program with O(m2)

min E -~ o'/

1*i.. t to: Zi+j= 2-1 G(i, j) 0;

S1 1, ..., m

Ek=1 -k yrCk(m r)C-kar-kbk +
z- yr(m r)Clar + yo -1 = i+j=2 G(i j);

S 1, ..., m

Erm r + yo 1 G(0, 0)

Zi+j=21-1 Z(i,j) 0;

S 1, ..., m

0 k-1 yr rCk(m r r)Cl-kbrkck +j 2 Z(i, j);

I = 0,..., m

G 0, Z 0

G and Z are m+1 x m+1 positive semidefinite matrices. The domain of the random variable
is [a, c]. Solving this semidefinite program yields an upper bound on the cdf P[X <= b],
where a < b < c. We used a free online SDP solver(Wu and Boyd, 1996) to solve the above
semidefinite program. Through empirical studies that follow we found this approach to be
the best in solving the optimization problem in terms of a balance between speed, r, Ii 1 .illil v
and accuracy.


In sampling we select a subset of observations from the universe consisting of all possible
observations. Using this subset we calculate a function whose value we consider to be
equal(or at least close enough) to the value of the same function applied to the entire
observation set. Sampling is an important process, since in many cases we do not have
access to this entire observation set(many times it is infinite). Numerous studies(Hall,
1992; Bartlett J. E. and C., 2001; L and J, 1977) have been conducted to .111 1,.. different


kinds of sampling procedures. The sampling procedure that is relevant to our problem is
Random Sampling and hence we restrict our discussion only to it.
Random sampling is a sampling technique in which we select a sample from a larger
population wherein each individual is chosen entirely by chance and each member of the
population has possibly an unequal chance of being included in the sample. Random sam-
pling reduces the likelihood of bias. It is known that asymptotically the estimates found
using random sampling converge to their true values.
For our problem the cdf's can be computed using this sampling procedure. We sample
data from the multinomial distribution(our data generative model) and add the number of
times the condition whose cdf is to be computed is true. This number when divided by the
total number of samples gives an estimate of the cdf. By finding the mean and standard
deviation of these estimates we can derive confidence bounds on the cdf's using C'!. I, -, -I. 's
inequality. The width of these confidence bounds depends on the standard deviation of the
estimates, which in turn depends on the number samples used to compute the estimates.
As the number of samples increases the bounds become tighter. We will observe this in the
experiments that follow. Infact, all the estimates for the cdf's necessary for computing the
moments of the generalization error using the mathematical formulations given by us, can
be computed parrallely. The moments are thus functions of the samples for which confidence
bounds can be derived. Realize that, only sampling in (..i il li i,_ with our formulations
for the moments makes for an efficient method. If we directly use sampling without using
the formulations, we would first need to sample for building a set of classifiers, and then
for each classifier built we would need to sample test sets from the multinomial. Reason
being that the expectation in the moments is w.r.t. the D x X space. This process can
prove to be computationally intensive for acquiring accurate estimates of the moments, as
the required sample sizes can potentially be large. Thus we are still consistent with our
methodology of going as far as possible in theory to reduce the computation for conducting

7.3 Empirical Comparison of the cdf computing methods

Consider the 2 dimensional case in Figure 11. We instantiated all the cell probabilities to
be equal. We found the probability P[N2NjI1NI > NiNf2N 2] by the methods -,,:_: -., ,1.
varying the dataset size from 10 to 1000 in multiples of 10 and having knowledge of the first
six moments of the random variable X N2N{INf1 -N1N12N12. The actual probability in
all the three cases is around 0.5(actually just lesser than 0.5). The execution speeds for the
various methods are given in Table 44. From the table we see that the SDP and gradient
descent methods are lightning fast. The SQP and gradient descent with topology search
methods take a couple of seconds to execute. The thing to notice here is that SDP, SQP,
the two gradient descent methods and the series approximation method are oblivious to the
size of the dataset with regards to execution time. In terms of accuracy the gradient descent
method is sensitive to initialization and the series approximation method to the choice of
distribution, as previously stated. A normal distribution gives an estimate of 0.5 which
is good in this case since the original distribution is symmetric about the origin. But for
finding cdf's near the extremeties of the domain of X the error can be considerable. Since

4. arnd means around


Method Dataset Size 10 Dataset Size 100 Dataset Size 1000
Direct 25 hrs arnd 200 centuries arnd 200 billion yrs
SA Instantaneous Instantaneous Instantaneous
LP arnd 3.5 sec arnd 2 min arnd 2:30 hrs
GD arnd 0.13s arnd 0.13 sec arnd 0.13 sec
PA arnd Is arnd 25 sec arnd 5 min
GDTS arnd 3.5 sec arnd 3.5 sec arnd 3.5 sec
SQP arnd 3.5 sec arnd 3.5 sec arnd 3.5 sec
SDP arnd 0.1 sec arnd 0.1 sec arnd 0.1 sec
RSloo arnd 0.08 sec arnd 0.08 sec arnd 0.1 sec
RSlooo arnd 0.65 sec arnd 0.66 sec arnd 0.98 sec
RSloooo arnd 6.3 sec arnd 6.5 sec arnd 9.6 sec

Table 4: Empirical Comparison of the cdf computing methods in terms of execution time.
RS, denotes the Random Sampling procedure using n samples to estimate the

Samples Dataset Size 10 Dataset Size 100 Dataset Size 1000
100 0.7-0.23 0.72-0.26 0.69-0.31
1000 0.54-0.4 0.56-0.42 0.57-0.42
10000 0.5-0.44 0.51-0.47 0.52-0.48

Table 5: ',' confidence bounds for Random Sampling.


Method Accuracy Speed
Direct Exact solution Low
Series Approximation Variable High
Standard LP solvers High Low
Gradient descent Low High
Prekopa's Algorithm High Moderate
Gradient descent (topology search) Moderate Moderate
Sequential Quadratic Programming High Moderate
Semi-definite Programming High High
Random Sampling Very High High Moderate

Table 6: Comparison of methods for computing the cdf

the domain of X is finite, variants of the beta distribution with a change of variable(i.e.
shifting and scaling the distribution) can provide better approximation capabilities. The
SQP and SDP methods are robust and insensitive to initialization(as long as the initial
point is feasible). The bound found by SQP is 0.64 to 0.34 and that found by SDP is 0.62
to 0.33. The LP solver also finds a similar bound of 0.62 to 0.34 but the execution time
scales quadratically with the size of the input. For RS we observe from Table 4 and Table
5 that the method does not scale much in time with the size of the dataset but produces
extremely good confidence bounds as the number of samples increases. With 1000 samples
we already have pretty tight bounds with time required being just over half a second. Also
as previously stated the cdf's can be calculated together rather than independently and
hence using 10000 samples for estimation can prove to be more than acceptable.

8. Calculation of Cumulative Joint probabilities

Cumulative Joint probabilities need to be calculated for computation of higher moments.
In the Random Sampling method, these probabilities can be computed in similar fashion as
the single probabilities shown above. But for the other methods knowledge of the moments
is required. Cumulative Joint probabilities are defined over multiple random variables
wherein each random variable satisfies some inequality or equality. In our case, for the
second moment we need to find the following kind of cumulative joint probabilities P[X >
0, Y > 0], where X and Y are random variables(overiding their definition in Table 1). Since
the probability is of an event over two distinct random variables the previous method of
computing moments cannot be directly applied. An important question is can we somehow
through certain transformations reuse the previous method? Fortunately the answer is
affirmative. The intuition behind the technique we propose is as follows. We find another
random variable Z = f(X, Y)(polynomial in X and Y) such that Z > 0 iff X > 0 and Y > 0.
Since the two events are equivalent their probabilities are also equal. By taking derivatives
of the MGF of the multinomial we get expressions for the moments of polynomials of the
multinomial random variables. Thus, f(X, Y) is required to be a polynomial in X and Y.
We now discuss the challenges in finding such a function and eventually -,:_:_- -I a solution.






N 1-




0 1
-5 -5
-10 -10

Figure 13: The plot is of the polynomial (x + 10)42y + (y + 10)4y2 z = 0. We see that
it is positive in the first quadrant and non-positive in the remaining three.


Geometrically, we can consider the random variables X, Y and Z to denote the three co-
ordinate axes. Then the function f(X, Y) should have a positive value in the first quadrant
and negative in the remaining three. If the domains of X and Y were infinite and continuous
then this problem is potentially intractable since the polynomial needs to have a discrete
jump along the X and Y axis. Such behavior can be emulated at best approximately by
polynomials. In our case though, the domains of the random variables are finite, discrete and
symmetric about the origin. Therefore, what we care about is that the function behaves as
desired only at these finite number of discrete points. One simple solution is to have a circle
covering the relevant points in the first quadrant and with appropriate sign the function
would be positive for all the points encompassed by it. This works for small domains of X
and Y. As the domain size increases the circle intrudes into the other quadrants and no
longer ,I i-F-, i :._ the conditions. Other simple functions such as XY or X + Y or a product
of the two also do not work. We now give a function that does work and discuss the basic
intuition in constructing it. Consider the domain of X and Y to be integers in the interval
[-a,a].5 Then the polynomial is given by,

Z =(X +a)rX2 + (Y + a)Y2X (12)

where r = maxb [r_,b + 1, 1 < b < a and b E N.6 The value of r can be found
numerically by finding the corresponding value of b which maximizes that function. For
5 < a < 10 the value of b which does this is 5. For larger values 10 < a < 106 the value of b
is 4. Figure 13 depicts the polynomial for a 10 where r 4. The polynomial resembles a
bird with its neck in the first quadrant, wings in the 2nd and 4th quadrants and its posterior
in the third. The general shape remains same for higher values of a.
The first requirement for the polynomial was that it must be symmetric. Secondly,
we wanted to penalise negative terms and so we have X + a(and Y + a) raised to some
power which will ,v --,-, be positive but will have lower values for lesser X(and Y). The
X2Y(and Y2X) makes the first(second) term zero if any of X and Y are zero. Moreover,
it imparts sign to the corresponding term. If absolute value function(l|) could be used we
would replace the X2(Y2) by |X|(|YI) and set r 1 But since we cannot; in the resultant
function r is a reciprocal of a logarithmic function of a. For a fixed r with increasing value of
a the polynomial starts violating the biconditional by becoming positive in the 2nd and 4th
quadrants(i.e. the wings rise). The polynomial is ilv ---, valid in the 1st and 3rd quadrants.
With increase in degree(r) of the polynomial its wings begin flattening out, thus Il i--, i_:_
the biconditional for a certain a.
With this we have all the ingredients necessary to perform experiments reasonably
quickly. This is exactly what we report in the next section.

9. Experiments

Having come up with strategies to compute the probabilities and hence the moments of
the errors efficiently, we now report some of the empirical studies we conducted. These

5. in our problem X and Y have the same domain.
6. proof in appendix


20 40 60 80 100
Size of HOS(%)

Figure 14: HE expectation

0 49 /

0 44
0 43
0 20 40 60 80 100
Size of HOS(%)

Figure 17: HE expectation

20 40 60 80 100
Size of HOS(%)

Figure 15: HE variance


0 20 40 60 80 100
Size of HOS(%)

Figure 18: HE variance





0 5

0 20 40 60 80 100
Size of HOS(%)

Figure 16: HE E[] + Std()




W 057

0 55

0 5/

0 5

0 20 40 60 80 100
Size of HOS(%)

Figure 19: HE E[] + Std(

x4' 10-a

0 42'0
0 20 40 60 80 100
Number of Folds

Figure 20: CE expectation

0 055
0 20 40 60 80 100
Number of Folds

Figure 23: Total variance

of cross valida-


0 20 40 60 80 100
Number of Folds

20 40 60 80 100
Number of Folds

Figure 21: Individual run Figure 22: Pairwise covari-

variance of CE ances of CV.

S4 20 40 60 80 100
Number of Folds

042 _

04 *7 D e .. Sz

0380 1000 2000 3000 4000 5000
Dataset Size

Figure 24: E [] + Var (()) Figure 25: Convergence

of CV behavior.

0 025

0 46

0 435


0 425







empirical studies were carried out on the 2 x 2 x 2 cube in Figure 11(i.e. 2 attributes each
having two values and 2 classes).

9.1 HOS
In our first study we set all the joint cell probabilities to be equal. For N 100 we wanted to
observe the behavior of the moments of HE under varying HOS size. We see from Figure 14
that the E[HE] is 0.5 throughout. This is the case since each class is equally likely and thus
the training set carries little information that can assist the NBC in classification. This
also affects the variance which reduces as the test size increases(Figure 15). The reason for
this being that the stability of the training set is a non-issue here and so larger the test set
a better estimate one gets of the error. Combining these two results we see from Figure 16
that the estimate is more accurate for larger HOS sizes.
In our next study we set the joint probabilities in class 1 to 0.1 and those in class 2
to 0.15. In this case we observe qualitatively similar results as those obtained for the 1
dimensional scenario. The E[HE] increases as the training set size reduces(Figure 17). The
best performance is observed for HOS size of 50%(Figure 19).

9.2 Cross Validation
In our third study we observed the behavior of CV with varying number of folds. We set the
joint probabilities the same as our previous study. Here too we observe the same qualitative
results we saw for the 1 dimensional case. The variance of the individual runs increases
with the number of folds(Figure 21). The covariances between errors of two runs is high
initially, later drops and then increases again. The reasoning behind this is the same as
described before. The best performance is observed when the number of folds is 10.

9.3 Comparison of GE, HE, and CE
In our final study we observe the behavior of the three aforementioned errors with increase
in dataset size. We again set the joint probabilities the same as our last two studies. We
set the HOS size to 50%, the number of folds in CV to 10 as these proved to be the best
in our previous studies. From Figure 25 we observe that consistently 10 fold CV is better
than HOS, though as the dataset size increases this difference in performance reduces.
This type of study can be used to observe the non-asymptotic convergence behavior of

10. Conclusion

In this section we summarize the major developments in this work and mention a future
line of research.
The major contributions in this paper are: (a) we developed general relationships be-
tween the moments of GE, HE, CE, (b) we reduced the granularity by shifting the focus
from looking at the entire input at one time to looking at only particular inputs, (c) we
developed efficient formulations for the moments of the errors for the NBC using the fact
in (b), (d) we proposed a plethora of strategies to efficiently compute cdf's required for the
moment computation in higher dimensions for the NBC, thus making our method scalable,


and (e) we plotted our formulations to be able interpret what they mean in different circum-
stances. The major challenge in future work is to extend the ., 11- i, to more sophisticated
classification models such as decision trees, neural nets, SVM's etc. We have however devel-
oped a characterization for the NBC which is important in its own right. It is a model which
is extensively used in industry, due to its robustness outperforming its more sophisticated
counterparts in some real world applications(eg. spam filtering in Mozilla Thunderbird and
Microsoft Outlook, bio-informatics etc.).
In conclusion, we proposed a methodology to observe and understand non-asymptotic
behavior of errors for the classification model at hand by making as much progress as
possible in theory in view of reducing the computation required for experiments, and then
performed experiments to study specific situations that we are interested in. The extent to
which this methodology can be used for studying learning methods is a part of our current
investigation, however we feel the method has promise.

11. Appendix

Proposition 3 The 1p.'lu.I ,'i:ai1 (x + a)r2y + (y+ a)y2x > 0 iffx > 0 and y > 0, where
x,y E [-a,-a+ l,...,a], r imrli[, 1 J + 1, aE N, 1< b
Proof One direction is trivial. If x > 0 and y > 0 then definitely the polynomial is greater
than zero for any value of r. Now lets prove the other direction i.e. if (x + a)rx2y + (y +
a)ry2x > 0 then x > 0 and y > 0 where x,y e [-a,..., a], r maxb [ ] + 1, 1 < b < a
and b E N. In other words, if x < 0 or y < 0 then (x + a)rx2y + y + a)y2x < 0. We prove
this result by forming cases.

Case 1: Both x and y are zero.
The value of the polynomial is zero.
Case 2: One of x or y is zero.
The value of the polynomial again is zero since each of the two terms separated by a sum
have xy as a factor.
Case 3: Both x and y are less than zero.
Consider the first term (x + a)rx2y. This term is non-positive since x + a is 1.',---, non-
negative(since x e [-a,..., a]) and x2 is ,1 .-'.-, positive but y is non-positive. Analogous
argument for the second term and so it too is non-positive. Thus their sum is non-positive.
Case 4: One of x or y is negative and the other is positive. Assume w.l.o.g. that x is
positive and y is negative.

(x + a)r2y + (y+ a)ry2 < 0
if (x+ a)rx + (y + a)ry > 0
,r,, i > ln[

On fixing the value of y the value of x at which the right hand side of the above achieves
maximum is 1(since lower the value of x higher the right handside but x is positive by our


assumption and x E [-a,..., a]). Thus we have the above inequality true only if,

> n[-y]
n y+a

Let b = -y then 1 < b < a since y is negative. Hence, if r satisfies the inequality for
all possible allowed values of b then only will it imply that the polynomial is less than or
equal to zero in the specified range. For r to satisfy the inequality for all allowed values of
b, it must satisfy the inequality for the value of b that the function is maximum. Also, for
b = 1 and b = a the right handside is zero. So the range of b over which we want to find the
maximum is 1 < b < a. With this the minimum value of r that satifies the above inequality
in order for the polynomial to be less than or equal to zero is,

r =maxbLl 1

The 4 cases cover all the possibilities and thus we have shown that if x < 0 or y < 0
then (x +a) y + (y + a) ry2x < 0. Having also shown the other direction we have proved
the proposition. 0


Elisseeff A and Pontil M. Learning TI. .',i; and Practice, chapter Leave-one-out error and
stability of learning algorithms with applications. IOS Press, 2003.

Kotrlik J. W. Bartlett J. E. and Hii:_:_ i i C. Organizational research: Determining appropri-
ate sample size for survey research. Inf r., il,11, i T. I.. ,./..,.;. Learning, and Performance
Journal, 19(1):43-50, 2001. URL

Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross valida-
tion. Journal of Machine Learning Research, 2003.

D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: a convex opti-
mization approach. Technical report, Dept. Math. O.R., Cambridge, Mass 02139, 1998.

Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold
and progressive cross-validation. In Computational Learing Theory, pages 203-208, 1999.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge, 2004.

L. Breiman. Heuristics of instability and stabilization in model selection. The Annals of
Statistics, 1996.

Ronald W. Butler and Richard K. Sutton. Saddlepoint approximation for multivariate
cumulative distribution functions and probability computations in sampling theory and
outlier testing. Journal of the American Statistical Association, 93(442):596-604, 1998.


Pedro Domingos and Michael J. Pazzani. On the optimality of the simple ', -, -i ,i1 classifier
under zero-one loss. Machine Learning, 29(2-3):103-130, 1997.

Alan Edelman and H. Murakami. Polynomial roots from companion ma-
trix eigenvalues. Mathematics of Computation, 64(210):763-776, 1995. URL

Cyril Goutte. Note on free lunches and cross-validation. Neural Computation, 9(6):1245
1249, 1997. URL

Peter Hall. The Bootstrap and Edgeworth Expansion. Springer-V ,1 i:_. 1992.

K. Isii. The extrema of probability determined by generalized moments(i) bounded random
variables. Ann. Inst. Stat. Math, 12:119-133, 1960.

K. Isii. On the sharpeness of chebyshev-type inequalities. Ann. Inst. Stat. Math, 14:185-197,

S. Karlin and L.S. Shapely. Geometry of moment spaces. Memoirs Amer. Math. Soc., 12,

Michael J. Kearns and Dana Ron. Algorithmic stability and iil -,-check bounds for leave-
one-out cross-validation. In Computational Learing Theory, pages 152-162, 1997. URL

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model
selection. In In Proceedings of the Fourteenth International Joint C.,,o ,,. -, on A, I.'l. .:,i
Intelligence, pages 1137-1143. San Mateo, CA: Morgan Klfriiini. 1995., 1995. URL

Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generaliza-
tion error. In Proceedings of the 18th Annual C.,,, f, i ,. on Uncertainty in Aiil:. :I..,
Intelligence (UAI-i.'), pages 275-282, San Francisco, CA, 2002. Morgan Kaufmann Pub-

C'!i il i, s R L and Skinner C J. Aiii'1,,. of S,,, ;I Data. Wiley, 1977.

L. Gyor L. Devroye and G. Lugosi. A Probabilistic TI..r ,i of Pattern Recognition. Springer-
V 1 .:_. 1996.

John Langford. Filed under: Prediction theory, problems. 29, 2005.

Pat Langley and Stephanie Sage. Tractable average-case .m 1, -i of naive
B -,. -ii classifiers. In Proc. 16th International Conf. on Machine Learn-
ing, pages 220-228. Morgan Kaufmann, San Francisco, CA, 1999. URL

Bruce Levin. A representation for multinomial cumulative distribution functions. The
Annals of Statistics, 9(5):1123-1126, 1981.


Andrew W. Moore and Mary S. Lee. Efficient algorithms for minimizing cross validation
error. In International C. ,,I;- i, .. on Machine Learning, pages 190-198, 1994. URL

Stephane Boucheron Olivier Bousquet and Gabor Lugosi. Introduction to statistical learning
theory., 2005.

M.E. Plutowski. Survey: Cross-validation in theory and in practice., 1996.

Andras Prekopa. The discrete moment problem and linear programming. RUTCOR Re-
search Report, 1989.

Jun Shao. Linear model selection by cross validation. Journal of the American Statistical
Association, 88, 1993.

Vapnik. Statistical Learning T! ,..'i Wiley & Sons, 1998.

Robert C. Williamson. Srm and vc theory(statistical learning theory). williams/papers/P151.pdf, 2001.

Wolfram-Research. Mathematica.

Shao-Po Wu and Stephen Boyd. Sdpsol: a parser/solver for sdp and maxdet problems with
matrix structure. boyd/SDPSOL.html, 1996.

Huaiyu Zhu and Richard Rohwer. No free lunch for cross validation. Neural Computation,
8(7):1421-1426, 1996. URL

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs