Probabilistic Characterization of Nearest Neighbor
Classifier
AMIT DHURANDHAR
University of Florida
and
ALIN DOBRA
University of Florida
The kNearest Neighbor classification algorithm (kNN) is one of the most simple yet effective
classification algorithms in use. It finds major applications in text categorization, outlier detection,
handwritten character recognition, fraud detection and in other related areas. Though sound
theoretical results exist [Stone 1977], regarding convergence of the Generalization Error (GE)
of this algorithm to Bayes error, these results are asymptotic in nature. The understanding of
the behavior of the kNN algorithm in real world scenarios is limited. In this paper, assuming
categorical attributes, we provide a structured way of statistically analyzing the kNN algorithm,
by developing analytical expressions for the moments of the GE of this algorithm. The expressions
are functions of the sample, and hence can be computed given any joint probability distribution
defined over the inputoutput space. These expressions can be used as a tool, that aids in unveiling
the statistical behavior of the algorithm in settings of interest viz. an acceptable value of k for a
given sample size and distribution. This work employs the semianalytical methodology that was
proposed in [Dhurandhar and Dobra 2006] to better understand the nonasymptotic behavior of
learning algorithms.
Categories and Subject Descriptors: H.2.8 [Data Management]: Data Mining
General Terms: Theory
Additional Key Words and Phrases: kNN, GE, model selection
1. INTRODUCTION
A major portion of the work in Data Mining caters towards building new and
improved classification algorithms. Empirical studies [Kohavi 1995; Moore and
Lee 1994] and theoretical results, mainly asymptotic [Vapnik 1998; Shao 2003] in
nature, assist in realizing this endeavor. Though both methods are powerful in their
own right, results from empirical studies heavily depend on the available datasets
and results from the theoretical studies depend on a large and mostly unknown
dataset size after which the asymptotic results reasonably apply. Keeping these
factors in mind, a semianalytical methodology was proposed in [Dhurandhar. and
A. Dhurandhar, University of Florida, Gainesville, Fl32611, USA.
A. Dobra, University of Florida, Gainesville, Fl32611, USA.
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for profit or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
@ 20YY ACM 15293785/20YY/07000001 $5.00
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY, Pages 10??.
2 A. Dhurandhar and A. Dobra
Dobra 2006] to study the nonasymptotic behavior of classification algorithms. In
the next few subsections, we briefly describe this methodology. We then make
a short segway discussing properties of the kNN algorithm, which is followed by
enumerating the specific contributions that we make in this paper.
1.1 What is the methodology ?
The methodology for studying classification models consists in studying the be
havior of the first two central moments of the GE of the classification algorithm
studied. The moments are taken over the space of all possible classifiers produced
by the classification algorithm, by training it over all possible datasets sampled i.i.d.
from some distribution. The first two moments give enough information about the
statistical behavior of the classification algorithm to allow interesting observations
about the behavior/trends of the classification algorithm w.r.t. any chosen data
distribution.
1.2 Why have such a methodology?
The answers to the following questions shed light on why the methodology is nec
essary if tight statistical characterization is to be provided for classification algo
rithms.
(1) 1/..i study GE ? The biggest danger of learning is overfitting the training
data. The main idea in using GE the expected error over the entire input, as
a measure of success of learning, instead of empirical error on a given dataset,
is to provide a mechanism to avoid this pitfall.
(2) IT./t study the moments instead of the distribution of GE ? Ideally, we would
study the distribution of GE instead of moments in order to get a complete
picture of what is its behavior. Studying the distribution of discrete random
variables, except for very simple cases, turns out to be very hard. The difficulty
comes from the fact that even computing the pdf in a single point is intractable
since all combinations of random choices that result in the same value for
GE have to be enumerated. On the other hand, the first two central moments
coupled with distribution independent bounds such as C('. I . and Chernoff
give guarantees about the worst possible behavior that are not too far from the
actual behavior (small constant factor). Interestingly, it is possible to compute
the moments of a random variable like GE without ever explicitly writing or
making use of the formula for the pdf. What makes such an endeavor possible
is extensive use of the linearity of expectation as explained in [Dhurandhar.
and Dobra 2006].
(3) 1T ../ characterize a class of classifiers instead of a single classifier ? While the
use of GE as the success measure is standard practice in Machine Learning,
characterizing classes of classifiers instead of the particular classifier produced
on a given dataset is not. From the point of view of the analysis, without
large testing datasets it is not possible to evaluate directly GE for a particular
classifier. By considering classes of classifiers to which a classifier belongs, an
indirect characterization is obtained for the particular classifier. This is pre
cisely what Statistical Learning Theory (SLT) does; there the class of classifiers
consists in all classifiers with the same VC dimension. The main problem with
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
SLT results is that classes based on VC dimension are too large, thus results
tend to be pessimistic. In the methodology in [Dhurandhar. and Dobra 2006],
the class of classifiers consists only of the classifiers that are produced by the
given classification algorithm from datasets of fixed size from the underlying
distribution. This is the probabilistic smallest class in which the particular
classifier produced on a given dataset can be placed in.
1.3 How do we implement the methodology ?
We estimate the moments of GE, by obtaining parametric expressions for them.
If this can be accomplished the moments can be computed exactly. Moreover, by
dexterously observing the manner in which expressions are derived for a particular
classification algorithm, insights can be gained into analyzing other algorithms.
Though deriving the expressions may be a tedious task, using them we obtain
highly accurate estimates of the moments. In this paper, we analyze the kNN
algorithm applied to categorical data. The key to the analysis is focusing on how
the algorithm builds its final inference. In cases where the parametric expressions
are computationally intensive to compute exactly, the approximations proposed in
Section 5 can be used to obtain estimates with small error.
If the moments are to be studied on synthetic data then the distribution is anyway
assumed and the parametric expressions can be directly used. If we have real data
an empirical distribution can be built on the dataset and then the parametric
expressions can be used.
1.4 Applications of the methodology
It is important to note that the methodology is not aimed towards providing a
way of estimating bounds for GE of a classifier on a given dataset. The primary
goal is creating an avenue in which learning algorithms can be studied precisely
i.e. studying the statistical behavior of a particular algorithm w.r.t. a chosen/built
distribution. Below, we discuss the two most important perspectives of applying
the methodology.
1.4.1 Algorithmic Perspective. If a researcher/practitioner designs a new clas
sification algorithm, he/she needs to validate it. Standard practice is to validate
the algorithm on a relatively small (520) number of datasets and to report the
performance. By observing the behavior of only a few instances of the algorithm
the designer infers its quality. Moreover, if the algorithm under performs on some
datasets, it can be sometimes difficult to pinpoint the precise reason for its failure.
If instead he/she is able to derive parametric expressions for the moments of GE,
the test results would be more relevant to the particular classification algorithm,
since the moments are over all possible datasets of a particular size drawn i.i.d.
from some chosen/built distribution. Testing individually on all these datasets is
an impossible task. Thus, by computing the moments using the parametric expres
sions the algorithm would be tested on a plethora of datasets with the results being
highly accurate. Moreover, since the testing is done in a controlled environment
i.e. all the parameters are known to the designer while testing, he/she can precisely
pinpoint the conditions under which the algorithm performs well and the conditions
under which the algorithm under performs.
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
4 A. Dhurandhar and A. Dobra
1.4.2 Dataset Perspective. If an algorithm designer validates his/her algorithm
by computing moments as mentioned earlier, it can instill greater confidence in the
practitioner searching for an appropriate algorithm for his/her dataset. The reason
for this being, if the practitioner has a dataset which has a similar structure or is
from a similar source as the test dataset on which an empirical distribution was
built and favorable results reported by the designer, then this would mean that
the results apply not only to that particular test dataset, but to other similar type
of datasets and since the practitioner's dataset belongs to this similar collection,
the results would also apply to his. Note that a distribution is just a weighting of
different datasets and this perspective is used in the above exposition.
1.5 Properties of the kNN algorithm
The kNN algorithm is a simple yet effective and hence commonly used classification
algorithm in industry and research. It is known to be a consistent estimator [Stone
1977], i.e. it asymptotically achieves Bayes error within a constant factor. None
of the even more sophisticated classification algorithms eg. SVM, Neural Networks
etc. are known to outperform it consistently [Stanfill and Waltz 1'".] However,
the algorithm is susceptible to noise and choosing an appropriate value of k is more
of an art than science.
1.6 Specific Contributions
In this paper, we develop expressions for the first 2 moments of GE for the kNearest
Neighbor classification algorithm built on categorical data. We accomplish this by
expressing the moments as functions of the sample produced by the underlying joint
distribution. In particular, we develop efficient characterizations for the moments
when the distance metric used in the kNN algorithm, is independent of the sample.
We also discuss issues related to the scalability of the algorithm. We use the
derived expressions, to study the classification algorithm in settings of interest
(example different values of k), by visualization. The joint distribution we use in
the empirical studies that ensue the theory, is a multinomial the most generic
data generation model for the discrete case.
The paper is organized as follows: In Section 2 we provide the basic technical
background. In Section 3 we briefly discuss the kNN algorithm and the different
distance metrics that are used when dealing with categorical attributes. In Section
4 we characterize the kNN algorithm. In Section 5 we address issues related to
scalability. In Section 6 we report empirical studies, which illustrate the usage of
the expressions, as a tool to delve into the statistical behavior of the kNN algorithm.
In Section 7, we discuss implications of the experiments. In Section 8, we look at
possible extensions to the current work. Finally, we conclude in Section 9.
2. TECHNICAL FRAMEWORK
In this section we present the generic expressions for the moments of GE that were
given in [Dhurandhar. and Dobra 2006]. The moments of the GE of a classifier
built over an independent and identically distributed (i.i.d.) random sample drawn
from a joint distribution, are taken over the space of all possible classifiers that can
be built, given the classification algorithm and the joint distribution. Though the
classification algorithm may be deterministic, the classifiers act as random variables
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
Symbol Meaning
X Random vector modeling input
X Domain of random vector (input space) X
Y Random variable modeling output
Y(x) Random variable modeling output for input x
y Set of class labels (output space)
( Classifier
Z(N) The class of classifiers obtained by application of
classification algorithm to an ii.d. set of size N
Ez(N)[] Expectation w.r.t. the space of classifiers built
on a sample of size N
Table I. Notation used in the paper.
since the sample that they are built on is random. The GE of a classifier, being a
function of the classifier, also acts as a random variable. Due to this fact, GE of
classifier denoted by GE(() has a distribution and consequently we can talk about
its moments. The generic expressions for the first two moments of GE taken over
the space of possible classifiers resulting from samples of size N from some joint
distribution are as follows:
Ez(N) [GE(()]
ZP[X x] Pz(N)[(x) y]P[Y(x) y] (1)
Ez(N)xz(N) [GE(()GE((')]
SP [X =x]P [X =x']
S S PZ(N)XZ(N) [(ar) =y A '(x') =y'] (
YEY Vey
P [Y(x) y] P [Y(x') y']
Equation 1 is the expression for the first moment of the GE((). Notice that
inside the first sum 1,x, the input x is fixed and inside the second sum the
output y is fixed, thus the PZ(N) [((x) y] is the probability of all possible ways
in which an input x is classified into class y. This probability depends on the
joint distribution and the classification algorithm. The other two probabilities
are directly derived from the distribution. Thus, customizing the expression for
EZ(N) [GE(()], effectively means deciphering a way of computing PZ(N) [(x) =y].
Similarly, customizing the expression for EZ(N)xZ(N) [GE(()GE((')] means finding
a way of computing PZ(N)xZ(N) [((x) y A ('(x') y'] given any joint distribution.
In Section 4 we derive expressions for these two probabilities, which depend only
on the underlying joint probability distribution, thus providing a way of computing
them analytically.
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
6 A. Dhurandhar and A. Dobra
I X CI C  I ...I C
1l N11 N12 ... N1,
2 N21 N22 ... N2,
XM NMI NM2 ... NM,
Table II. Contingency table with v classes, M input vectors and total sample size N =
tl,'jI Nij.
3. KNEAREST NEIGHBOR ALGORITHM
The knearest neighbor (kNN) classification algorithm classifies an input based on
the class labels of the closest k points in the training dataset. The class label
assigned to an input is usually the most numerous class of these k closest points.
The underlying intuition that is the basis of this classification model is that nearby
points will tend to have higher "inliI viz. same class, than points that are
far apart.
The notion of closeness between points is determined by the distance metric used.
When the attributes are continuous, the most popular metric is the 12 norm or the
Euclidean distance. Figure 1 shows points in R2 space. The points b, c and d are
the 3nearest neighbors (k 3) of the point a. When the attributes are categorical
the most popular metric used is the Hamming distance [Liu and White 1997]. The
Hamming distance between two points/inputs is the number of attributes that
have distinct values for the two inputs. This metric is sample independent i.e.
the Hamming distance between two inputs remains unchanged, irrespective of the
sample counts produced in the corresponding contingency table. For example, Table
II represents a contingency table. The Hamming distance between xi and x2 is the
same irrespective of the values of Nij where i E {1, 2, ..., M} and j E {1, 2, ..., v}.
Other metrics such as Value Difference Metric (VDM) [Stanfill and Waltz 1'l.],
C'l 1..1 1 e [ConnorLinton 2003] etc. exist, that depend on the sample. We now
provide a global characterization for calculating the aforementioned probabilities
for both kinds of metrics. This is followed by an efficient characterization for the
sample independent metrics, which includes the traditionally used and most popular
Hamming distance metric.
4. COMPUTATION OF MOMENTS
In this section we characterize the probabilities PZ(N) [((x) y] and
PZ(N)xZ(N) [((x) yA ('(x')=y'] required for the computation of the first two mo
ments. In the case, that the number of nearest neighbors at a particular distance d
is more than k for an input and at any lesser value of distance the number of NN's
is less than k, we classify the input based on all the NN's upto the distance d.
4.1 General Characterization
We provide a global characterization for the above mentioned probabilities without
any assumptions on the distance metric in this subsection.
The scenario wherein x, is classified into class Cj given i E {1, 2, ..., M} and
j E {1, 2, ..., v} depends on two factors; 1) the kNN's of xi and 2) the class label of
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
a@d
aO d
cO
Fig. 1. b, c and d are the 3 near
est neighbours of a.
Fig. 2. The Figure shows the ex
tent to which a point x, is near to
x1. The radius of the smallest en
compassing circle for a point x, is
proportional to its distance from
x1. xi is the closest point and XM
is the farthest.
the majority of these kNN's. The first factor is determined by the distance metric
used, which may be dependent or independent of the sample as previously discussed.
The second factor is always determined by the sample. The PZ(N) [((i)= Cj] is
the probability of all possible ways that input x, can be classified into class Cj,
given the joint distribution over the inputoutput space. This probability for xi is
calculated by summing the joint probabilities of having a particular set of kNN's
and the majority of this set of kNN's has a class label Cj, over all possible kNN's
that the input can have. Formally,
I PZ(N) [q, c(q,j) > c(q, t),Vt E {1, 2, ..., v}, t j]
qCQ
where q is a set of kNN's of the given input and Q is the set containing all possible
q. c(q, b) is a function which calculates the number of kNN's in q that lie in class Cb.
For example, from Table II, if x1 and x2 are the kNN's of some input, then q { 1, 2}
and c(q, b) Nb + N2b. Notice that, since xi and x2 are the kNN's of some input,
l , NUij > k. Moreover, if the kNN's comprise of the entire input sample,
then the resultant classification is equivalent to classification performed using class
priors determined by the sample. The PZ(N)xZ(N) [((x)=y A('(x') y'] used in
the computation of the second moment is calculated by going over kNN's of two
inputs rather than one. The expression for this probability is given by,
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
8 A. Dhurandhar and A. Dobra
Pz(N)xZ(N) [((= ) CA '(xi) C) =Cw]
S Pz(N) x (N)[q, c(q,j) > c(q,t), r, c(r, w) > c(r, s) (4)
q Qr R
Vs, t {1,2,...,v},t j,s 1 w]
where q and r are sets of kNN's of xi and x1 respectively. Q and R are sets
containing all possible q and r respectively. c(., .) has the same connotation as
before.
As mentioned before the probability of a particular q (or the probability of the
joint q, r), depends on the distance metric used. The inputs (e.g. x1, x2, ...) that
are the k nearest neighbors to some given input depend on the sample irrespective
of the distance metric i.e. the kNN's of an input depend on the sample even if the
distance metric is sample independent. We illustrate this fact by an example.
EXAMPLE 1. Say, x1 and x2 are the two closest inputs to xi where x1 is closer
than x2, based on some sample independent distance metric. x1 and x2 are both
the kNN's of x, if and only if E i E1 1 c(a, b) > k, E =lc({1},b) < k and
E b1 c({2}, b) < k. The first inequality states that the number of copies of x and
X2 given by El Nlj and E = N2j respectively, in the contingency table II is
greater than or equal to k. If this inequality is true, then definitely the class label
of input xi is determined by the copies of xt or x2 or both x1, x2. No input besides
these two is involved in the classification of xi. The second and third inequality state
that the number of copies of x1 and x2 is less than k respectively. This forces both
x1 and x2 to be used in the classification of xi. If the first inequality was untrue,
then farther away inputs will also play a part in the classification of xi. Thus the
kNN's of an input depend on the sample irrespective of the distance metric used.
The above example also illustrates the manner in which the set q (or r) can be
characterized as a function of the sample, enabling us to compute the two probabil
ities required for the computation of the moments from any given joint distribution
over the data, for sample independent metrics. Without loss of generality (w.l.o.g.)
assume xi, x2, ..., xM are inputs in nondecreasing order (from left to right) of their
distance from a given input x, based on some sample independent distance met
ric. Then this input having the kNN given by the set q {= Xa, Xa2, ..., z} where
al, a2,..., az E {1, 2,..., M} and ad < af if d < f is equivalent to the following
conditions on the sample being true: Ez =l c1({xa}l,j) > k, VI E {1, 2,..., z}
EC, c({xal},j) > 0, VI E 2q where 29 is the power set of q and cardinality of
I z 1 denoted by Il = 1z 1, EJ 1c(l,j) < k and Vxh e {fxl,2, ...,xM} q
where h < az E l c({xh}, j) 0. The conditions imply that for the elements of
q to be the kNN's of some given input, the sum of their counts should be greater
than or equal to k, the sum of any subset of the counts (check only subsets of q of
cardinality Iq 1) should be less than k, the count of each element of q is nonzero
and the other inputs that are not in q, but are no farther to the given input than
the farthest input in q should have counts zero. Notice that all of these conditions
are functions of the sample.
The other condition in the probabilities is that a particular class label is the
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
most numerous among the kNNs, which is also a function of the sample. In case of
sample dependent metrics, the conditions that are equivalent to having a particular
set q as kNN, are totally dependent on the specific distance metric used. Since these
distance metrics are sample dependent, we can certainly write these conditions as
the corresponding functions of the sample. Since all the involved conditions in the
above probabilities can be expressed as functions of the sample, we can be compute
them over any joint distribution defined over the data.
4.2 Efficient Characterization for Sample Independent Distance Metrics
In the previous subsection we observed the global characterization for the kNN
algorithm. Though this characterization provides insight into the relationship be
tween the moments of GE, the underlying distribution and the kNN classification
algorithm, it is inefficient to compute in practical scenarios. This is due to fact
that any given input can have itself and/or any of the other inputs as kNN. Hence,
the total number of terms in finding the probabilities in equations 3 and 4 turns
out to be exponential in the input size M. Considering these limitations, we pro
vide alternative expressions for computing these probabilities efficiently for sample
independent distance metrics, viz. Manhattan distance [Krause 1987], C'l. I. II.
distance [Abello et al. 2002], Hamming distance. The number of terms in the new
characterization we propose, is linear in M for PZ(N) [((x) y] and quadratic in M
for Pz(N)xZ(N)[(() y A ('(x')y'].
The characterization we just presented, computes the probability of classifying an
input into a particular class for each possible set of kNN's separately. What if we in
some manner, combine disjoint sets of these probabilities into groups, and compute
a single probability for each group ? This would reduce the number of terms to
be computed, thus speeding up the process of computation of the moments. To
accomplish this, we use the fact that the distance between inputs is independent
of the sample. A consequence of this independence is that all pairwise distances
between the inputs are known prior to the computation of the probabilities. This
assists in obtaining a sorted ordering of inputs from the closest to the farthest for
any given input. For example, if we have inputs albl, alb2, a2bl and a2b2, then
given input albl, we know that a2b2 is the farthest from the given input, followed
by alb2, a2bl which are equidistant and albl is the closest in terms of Hamming
distance.
Before presenting a fullfledged characterization for computing the two proba
bilities, we explain the basic grouping scheme that we employ with the help of an
example.
EXAMPLE 2. W.I.o.g. let x, be the given input, for which we want to find
PZ(N) [((x)=C ]. Let xt, x2,..,XM be inputs arranged in increasing order of
distance from left to right. This is shown in F .;.... 2. In this case, the number
of terms we need, to compute PZ(N) [((Xl)=C1] is M. The first term calculates
the probability of classifying xt into C1 when the kNN are multiple instances of xl
(i.e. E 1 N1j > k). Thus, the first group contains only the set {xi}. The second
term calculates the probability of classifying xl into C1 when the kNN are multiple
instances of x2 or x, x2. The second group thus contains the sets {x2} and {xl, x2}
as the possible kNN's to x1. If we proceed in this manner, eventually we have M
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
10 A. Dhurandhar and A. Dobra
terms and consequently M groups. The Mth group will contain sets in which XM
is an element of every set and the other elements in different sets are all possible
combinations of the remaining M 1 inputs. Notice that this grouping scheme cov
ers all possible kNN's as in the general case, stated previously i.e. U^Mlgi = 2S 0
where gi denotes the ith group, S {= x, x2, ..., XM}, 9 is the empty set and any two
groups are disjoint i.e. V i,j E {1, 2, ..., M}, i / j g, n gj = preventing multiple
computations of the same probability. The rth (r > 1) term in the expression for
PZ(N) [((x) =C1] given the contingency table II is,
PZ(N) [((X1) =Cl, sets in g, E kNN]
PZ(N) [ N~ > N, V j e {2, 3,..., v},
i=1 i=1 (5)
r v r 1 v
EE Ni > k, Ne < k]
i=l 1l= il 1l
where the last two conditions force only sets in gr to be among the kNN's. The
first condition ensures that C1 is the most numerous class among the given kNN's.
For r = 1 the last condition becomes invalid and unnecessary, hence it is re
moved. The probability for the second moment is the sum of probabilities which
are calculated for two inputs rather than one and over two groups, one for each
input. W.l.o.g. assume that x, and x2 are 2 inputs with x, being the clos
est input of x2 and sets in gr E kNN(1) i.e. kNN's for input xt and sets in
gs E kNN(2) i.e. kNN's for input x2 where r, s E {2, 3,..., M} then the rsth term
in PZ(N)xZ(N) [((Xl)= C,((X'2) C2] is,
PZ(N)xZ(N)[C(xI) =C, sets in g, E kNN(1),
((x2) C2, sets in g, E kNN(2)]
i= l 1 i= l 1
s s
N, > 5N jV {1, 3, ...,v},
i= 1 i 1
s v s1 v
EENi > k, NE < k]
i= 11 1 i= 11= 1
In this case, when r = 1 remove Ei I E=l Ni < k condition from the above
probability. If s 1 remove 1, N < k condition from the above probabil
ity.
In the general case, there may be multiple inputs that lie at a particular distance
from any given input; i.e. the concentric circles in Figure 2 may contain more than
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
one input. To accommodate this case, we extend the grouping scheme previously
outlined. Previously, the group g, contained all possible sets formed by the r 1
distinct closest inputs to a given input, with the rth closest input being present
in every set. Realize that the rth closest input doesn't necessarily mean it is the
rth NN, since there may be multiple copies of any of the r 1 closest inputs.
In our modified definition, the group g, contains all possible sets formed by the
r 1 closest inputs, with at least one of the rth closest inputs being present in
every set. We illustrate this with an example. Say, we have inputs albl, alb2,
a2bl and a2b2, then given input albl, we know that a2b2 is the farthest from the
given input, followed by alb2, a2bl which are equidistant and albl is the closest
in terms of Hamming distance. The group gi contains only albl as before. The
group 92 in this case contains the sets {alb2}, {a2bi}, {aib2, a2bi}, {alb2, albi} and
{a2bl, alb}. Observe that each set has at least one of the 2 inputs alb2, a2bl. We
now characterize the probabilities in equations 5 and 6 for this general case. Let q,
denote the set containing inputs from the closest to the rth closest, to some input
xi. The function c(.,.) has the same connotation as before. With this the rth term
in the PZ(N) [((xi) Cj] where r E {2, 3,..., G} and G < M is number of groups is,
Pz(N) [(xi) = C, sets in g, E kNN]
Pz(N) [c(q, j) > c(q, 1), V1 {1, 2,...,v} I j,
(7)
c(q,t) >_ kk, c(q 1,it) < k]
t=l t=l
where the last condition is removed for r = 1. Similarly, the rsth term in
PZ(N)xZ(N) [((xi)=C '(xp)= C] where r, s E {2,3,..., G is,
Pz(N)x (N) [(x) = C, sets in g, kNN(),
((xp) = C, sets in gs E kNN )] =
Pz(N)xZ(N)[c(q, j) > c(q,,),V 1E {1,2,...,v} I j,
v v
1c(qr,t) > k, 1 c(q 1, t) < k, (8)
t=l t=l
c(q,, w) > c(q,, 1), V Ie {1, 2, ..., v} I w,
v v
5 c(qs,t) > k, c(qs 1,t) < k]
t=l t=l
where E c(q 1,t) < k and E c(qs, t) < k are removed when r = 1 and
s =1 respectively. From equation 7 the PZ(N) [((xi)= C] is given by,
G
Pz(N) [((Xi)= C = T, (9)
r=l
where T. is the rth term in PZ(N) [((xi)= C]. From equation 8 the
PZ(N)xZ(N) [C((i)=Cj, ('(xp) C,] is given by,
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
12 A. Dhurandhar and A. Dobra
G
PZ(N)xZ(N) [((xi) C, '(P) C] T (10)
r,s=1
where Ts is the rsth term in Pz(N)xz(N) [(i)= Cj, '(p)= C,].
With this grouping scheme we have been able to reduce the number of terms in
the calculation of PZ(N) [((x)= Cj] and Pz(N)xZ(N)[(xi)= Cj, '(xp) C,] from
exponential in M (the number of distinct inputs), to manageable proportions of
2(M) terms for the first probability and 2(M2) terms for the second probability.
Moreover, we have accomplished this without compromising on the accuracy.
5. SCALABILITY ISSUES
In the previous section we provided the generic characterization and the time ef
ficient characterization for sample independent distance metrics, relating the two
probabilities required for the computation of the first and second moments, to
probabilities that can be computed using the joint distribution. In this section we
discuss approximation schemes that may be carried out, to further speed up the
computation. There are two factors on which the time complexity of calculating
Pz(N) [((xi) Cj] and Pz(N)xZ(N) [((xi)= C, ('(xp)= C] depends,
(1) the number of terms (or smaller probabilities) that sum up to the above prob
abilities,
(2) the time complexity of each term.
Reduction in number of terms: In the previous section we reduced the number
of terms to a small polynomial in M for a class of distance metrics. The current
enhancement we propose, further reduces the number of terms and works even for
the general case at the expense of accuracy, which we can control. The rth term
in the characterizations has the condition that the number of the closest r 1
distinct inputs is less than k. The probability of this condition being true mono
tonically reduces with increasing r. After a point, this probability may become
"small enough", so that the total contribution of the remaining terms in the sum
is not worthwhile finding, given the additional computational cost. We can set
a threshold below which, if the probability of this condition diminishes, we avoid
computing the terms that follow.
Reduction in term computation: Each of the terms can be computed directly
from the underlying joint distribution. Different tricks can be employed to speed
up the computation such as collapsing cells of the table etc., but even then the
complexity is still a small polynomial in N. For example, using a multinomial joint
distribution, the time complexity of calculating a term for the probability of the
first moment is quartic in N and for the probability of the second moment it is
octic in N. This problem can be addressed by using the approximation techniques
proposed in [Dhurandhar. and Dobra 2006]. Using techniques such as optimization,
we can find tight lower and upper bounds for the terms in essentially constant time.
Parallel computation: Note that each of the terms is self contained and not
dependent on the others. This fact can be used to compute these terms in parallel,
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
03 052
02 0 54
01 i002/
012 01 04
0o5 0o
100 2 300 400 500 600 700 9o0 100 200 300 4 sm 60 0 700 800 900 04 l 200 300 400 500 6 700 o 900
k k k
(a) (b) (c)
Fig. 3. Behavior of the GE for different values of k with sample size N 1000 and
the correlation between the attributes and class labels being 1 in (a), 0.5 in (b) and
0 in (c). Std() denotes standard deviation.
eventually merging them to produce the result. This will further reduce the time
of computation.
With this we have not only proposed analytical expressions for the moments of
GE for the kNN classification model applied to categorical attributes, but have also
suggested efficient methods of computing them.
6. EXPERIMENTS
In this section we portray the manner in which the characterizations can be used
to study the kNN algorithm in conjunction with the model selection measures
(viz. crossvalidation). Generic relationships between the moments of GE and
moments of CE (crossvalidation error) that are not algorithm specific are given
in [Dhurandhar. and Dobra 2006]. We use the expressions provided in this paper
and these relationships to conduct the experiments described below. The main
objective of the experiments we report, is to provide a flavor of the utility of the
expressions as a tool to study this learning method.
6.1 General Setup
We mainly conduct 4 studies. The first three studies are on synthetic data and the
fourth on 2 real UCI datasets. In our first study, we observe the performance of
the kNN algorithm for different values of k. In the second study, we observe the
convergence behavior of the algorithm with increasing sample size. In our third
study, we observe the relative performance of crossvalidation in estimating the GE
for different values of k. In the three studies we vary the correlation (measured
using C'I 1..1 e [ConnorLinton _'il1 ;]) between the attributes and the class labels
to see the effect it has on the performance of the algorithm. In our fourth study,
we choose 2 UCI datasets and observe the estimates of crossvalidation with the
true error estimates. We also explain how a multinomial distribution can be built
over these datasets. The same idea can be used to build a multinomial over any
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
14 A. Dhurandhar and A. Dobra
O5 I 
03 026 oss
01 0022 0 52
~~2o          
02 024 054
02 016 04S
01 
0 014 044
04 012 F 11 1 2N I N42
.10o N x 10 N 1o
2 4 N a x:' 2 4 N a x1010 4 2 4 N a x1
(a) (b) (c)
Fig. 4. Convergence of the GE for different values of k when the sample size (N)
increases from 1000 to 100000 and the correlation between the attributes and class
labels is 1 in (a), 0.5 in (b) and 0 in (c). Std() denotes standard deviation. In (b)
and (c), after about N = 1500 large, midrange and small values of k give the same
error depicted by the dashed line.
discrete dataset to represent it precisely.
Setup for studies 13: We set the dimensionality of the space to be 8. The
number of classes is fixed to two, with each attribute taking two values. This gives
rise to a multnomial with 29 = 512 cells. If we fix the probability of observing a
512
datapoint in cell i to be pi such that E1 pi 1 1 and the sample size to N, we then
have a completely specified multinomial distribution with parameters N and the
set of cell probabilities {p1,p2, ...,p512}. The distance metric we use is Hamming
distance and the class prior is 0.5.
Setup for study 4: In case of real data we choose 2 UCI datasets whose attributes
are not limited to having binary splits. The datasets can be represented in the form
of a contingency table where each cell in the table contains the count of the number
of copies of the corresponding input belonging to a particular class. These counts in
the individual cells divided by the dataset size provide us with empirical estimates
for the individual cell probabilities (pi's). Thus, with the knowledge of N (dataset
size) and the individual pi's we have a multinomial distribution whose representative
sample is the particular dataset. Using this distribution we observe the estimates
of the true error (i.e. moments of GE) and estimates given by crossvalidation for
different values of k. Notice that these estimates are also applicable (with high
probability) to other datasets that are similar to the original.
A detailed explanation of these 4 studies is given below. The expressions in
equations 9, 10 are used to produce the plots.
6.2 Study 1: Performance of the kNN algorithm for different values of k.
In the first study we observe the behavior of the GE of the kNN algorithm for
different values of k and for a sample size of 1000.
In Figure 3a the attributes and the class labels are totally correlated (i.e. corre
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
04"'''* N 1000
02 0 054
0
l (EGE aStd(GE)=EC St) E(b) (
between the attributes and class labels is 1 in (a) in (b) and
01 044
100 20 300 400 500 600 700 Boo 9W 100 200 3oo00 4 S 600 700 o00 900 1 W 200 300oo 400 s OO 600 o0 8W 900
(a) (b) (c)
Fig. 5. Comparison between the GE and 10 fold Cross validation error (CE) esti
mate for different values of k when the sample size (N) is 1000 and the correlation
between the attributes and class labels is 1 in (a), 0.5 in (b) and 0 in (c). Std()
denotes standard deviation.
lation = 1). We observe that for a large range of values of k (from small to large)
the error is zero. This is expected since any input lies only in a single class with
the probability of lying in the other class being zero.
In Figure 3b we reduce the correlation between the attributes and class labels
from being totally correlated to a correlation of 0.5. We observe that for low values
of k the error is high, it then plummets to about 0.14 and increases again for large
values of k. The high error for low values of k is because the variance of GE is large
for these low values. The reason for the variance being large is that the number of
points used to classify a given input is relatively small. As the value of k increases
this effect reduces upto a stage and then remains constant. This produces the
middle portion of the graph where the GE is the smallest. In the right portion of
the graph i.e. at very high values of k, almost the entire sample is used to classify
any given input. This procedure is effectively equivalent to classifying inputs based
on class priors. In the general setup we mentioned that we set the priors to 0.5,
which results in the high errors.
In Figure 3c we reduce the correlation still further down to 0 i.e. the attributes
and the class labels are uncorrelated. In here we observe that the error is initially
high, then reduces and remains unchanged. As before the initial upsurge is due to
the fact that the variance for low values of k is high, which later settles down.
From the three figures, Figure 3a, Figure 3b and Figure 3c we observe a gradual
increase in GE as the correlation reduces. The values of k that give low error for
the three values correlation and a sample size of 1000 can be deciphered from the
corresponding figures. In Figure 3a, we notice that small, midrange and large
values of k are all acceptable. In Figure 3b we find that midrange values (200 to
500) of k are desirable. In the third figure, i.e. Figure 3c we discover that midrange
and large values of k produce low error.
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
16 A. Dhurandhar and A. Dobra
03 056
04
02 054
o 01 E[GE1+Stl(GE)=E[CE]StdCE)5 CE CE) E[ GE]Std(GE) = E[CE1Std(CE)
0  03. ______ 1 0<
'_01 LU E[GE]+Std(GE) 04.
02 046
EGE]Std(GE \
02E[CE]Std 04
03 01 044
1000 2000 3000 4000 5000 6000 7000 80000 0 4 00 0 .000 000 ooo 1000 2000 30000 4 5000 6000 7000 8000 9000
(a) (b) (c)
Fig. 6. Comparison between the GE and 10 fold Cross validation error (CE) esti
mate for different values of k when the sample size (N) is 10000 and the correlation
between the attributes and class labels is 1 in (a), 0.5 in (b) and 0 in (c). Std()
denotes standard deviation.
6.3 Study 2: Convergence of the kNN algorithm with increasing sample size.
In the second study we observe the convergence characteristics of the GE of the
kNN algorithm for different values of k, and with increasing sample size going from
1000 to 100000.
In Figure 4a the attributes and class labels are completely correlated. The error
remains zero for small, medium and large values of k irrespective of the sample size.
In this case any value of k is suitable.
In Figure 4b the correlation between the attributes and the class labels is 0.5.
For small sample sizes (less and close to 1000), large and small values of k result in
high error while moderate values of k have low error throughout. The initial high
error for low values of k is because the variance of the estimates is high. The reason
for high error at large values of k is because it is equivalent to classifying inputs
based on priors and the prior is 0.5. At moderate values of k both these effects are
diminished and hence the error produced by them is low. From the figure we see
that after around 1500 the errors of the low and high k converge to the error of
moderate k's. Thus here a k within the range 200 to 0.5N would be appropriate.
In Figure 4c the attributes and the class labels are uncorrelated. The initial high
error for low k's is again because of the high variance. Since the attributes and
class labels are uncorrelated with a given prior, the error is 0.5 for moderate as
well as high values of k. Here large values of k don't have higher error than the
midrange values since the prior is 0.5. The low value of k converges to the errors
of the comparatively larger values at around a sample size of 1500.
Here too from the three figures, Figure 4a, Figure 4b and Figure 4c we observe a
gradual increase in GE as the correlation reduces. At sample sizes of greater than
about 1500, large, medium and small values of k all perform equally well.
6.4 Study 3: Relative performance of 10fold cross validation on synthetic data.
In the third and final study on synthetic data we observe the performance of 10
fold Cross validation in estimating the GE for different values of k and sample
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
sizes of 1000 and 10000. The plots for the moments of cross validation error (CE)
are produced using the expressions we derived and the relationships between the
moments of GE and the moments of CE for deterministic classification algorithms
given in [Dhurandhar. and Dobra 2006].
In Figure 5a the correlation is 1 and the sample size is 1000. Cross validation
exactly estimates the GE which is zero irrespective of the value of k. When we
increase the sample size to 10000, as shown in Figure 6a Cross validation still does
a pretty good job in estimating the actual error (i.e. GE) of kNN.
In Figure 5b the correlation is set to 0.5 and the sample size is 1000. We observe
that cross validation initially, i.e. for low values of k underestimates the actual error,
performs well for moderate values of k and grossly overestimates the actual error
for large values of k. At low values of k the actual error is high because of the high
variance, which we have previously discussed. Hence, eventhough the expected
values of GE and CE are closeby, the variances are far apart, since the variance
of CE is low. This leads to the optimistic estimate made by cross validation. At
moderate values of k the variance of GE is reduced and hence cross validation
produces an accurate estimate. When k takes large values most of the sample
is used to classify an input, which is equivalent to classification based on priors.
The effect of this is more pronounced in the case of CE than GE, since a higher
percentage of the training sample (Tj N) is used for classification of an input for a
fixed k, than it is when computing GE. Due to this, CE rises more steeply than GE.
When we increase the sample size to 10000, as is depicted in Figure 6b, the poor
estimate at low values of k that we saw for a smaller sample size of 1000 vanishes.
The reason for this is that the variance of GE reduces with the increase in sample
size. Even for moderate values of k the performance of cross validation improves
though the difference in accuracy of estimation is not as vivid as in the previous
case. For large values of k though the error in estimation is somewhat reduced
it is still noticeable. It is advisable that in the scenario presented we should use
moderate values of k ranging from about 200 to 0.5N to achieve reasonable amount
of accuracy in the prediction made by crossvalidation.
In Figure 5c the attributes are uncorrelated to the class labels and the sample
size is 1000. For low values of k the variance of GE is high while the variance
of CE is low and hence, the estimate of cross validation is off. For medium and
large values of k, cross validation estimates the GE accurately, which has the same
reason mentioned above. On increasing the sample size to 10000, shown in Figure
6c the variance of GE for low values of k reduces and cross validation estimates the
GE with high precision. In general, the GE for any value of k will be estimated
accurately by cross validation in this case, but for lower sample sizes (below and
around 1000) the estimates are accurate for moderate and large values of k.
6.5 Study 4: Relative performance of 10fold cross validation on real datasets.
In our fourth and final study we observe the behavior of the true error (E[GE] +
Std(GE)) and the error estimated by crossvalidation on 2 UCI datasets. On the
Balloon dataset in figure 7 we observe that crossvalidation estimates the true error
accurately for a k value of 2. Increasing the k to 5 the crossvalidation estimate
becomes pessimistic. This is because of the increase in variance of CE. We also
observe that the true error is lower for k equal to 2. The reason for this is the
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
18 A. Dhurandhar and A. Dobra
k2 k
Sk5 20
008 08 
006 06
004 04
002 02
TE CE TE CE
Balloon Shuttle Landing Control
Fig. 7. Comparison between true error (TE) and CE on 2 UCI datasets.
fact that the expected error is much lower for this case than that for k equal to 5,
eventhough the variance for k equal to 2 is comparatively higher. For the dataset
on the right in figure 7, crossvalidation does a good job for both, the small value
of k and the larger value of k. The true error in this case is lower for the higher k
since the expectations for both the k's is roughly the same but the variance for the
smaller k is larger. This is mainly due to the high covariance between the successive
runs of crossvalidation.
7. DISCUSSION
From the previous section we see that the expressions for the moments assist in
providing highly detailed explanations of the observed behavior. Midrange values
of k were the best in studies 1 and 2 for small sample sizes. The reason for this is
the fact that at small values of k the prediction was based on individual cells and
having a small sample size the estimates were unstable, producing a large variance.
For high values of k the classification was essentially based on class priors and
hence the expected error was high, eventhough the variance in this case was low.
In the case of midrange values of k, the pitfalls of the extreme values of k were
circumvented (since k was large enough to reduce variance but small enough so as to
prevent classification based on priors) and hence the performance was superior. 10
fold crossvalidation which is considered to be the "holygrail" in error estimation,
is not always ideal as we have seen in the experiments. The most common reason
why crossvalidation under performed in certain specific cases, was that its variance
was high, which in turn was due to the covariance between successive runs of cross
validation was high. The ability to make such subtle observations and provide
meticulous explanations for them, is the key strength of the deployed methodology
developing and using the expressions.
Another important aspect is that, in the experiments, we built a single distribu
tion on each test dataset to observe the best value of k. Considering the fact that
data can be noisy we can build multiple distributions with small perturbations in
parameters (depending on the level of noise) and observe the performance of the
algorithm for different values of k using the expressions. Then we can choose a
robust value of k for which the estimates of the error are acceptable on most (or
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
Probabilistic Characterization of Nearest Neighbor Classifier
all) built distributions. Notice that this value of k may not be the best choice, on
the distribution built without perturbations. We can thus use the expressions to
make these type of informed decisions.
As we can see, by building expressions for the moments of GE in the manner
portrayed, classification models in conjunction with popular model selection mea
sures can be studied in detail. The expressions can act as a guiding tool in making
the appropriate choice of model and model selection measure in desired scenarios.
For example, in the experiments we observed that 10fold crossvalidation did not
perform well in certain cases. In these cases we can use the expressions to study
crossvalidation with different number of folds and attempt to find the ideal num
ber of folds for our specific situation. Moreover, such characterizations can aid
in finding answers or challenging the appropriateness of questions such as, What
number of v in vfold crossvalidation gives the best bias/variance tradeoff?. The
appropriateness of some queries has to be sometimes challenged since it may very
well be the case that no single value of v is truly optimal. In fact depending on
the situation, different values of v or may be even other model selection measures
(viz. hold out set etc.) may be optimal. Analyzing such situations and finding the
appropriate values of the parameters (i.e. v for crossvalidation, ffraction of hold
out, for hold out set validation) can be accomplished using the methodology we
have deployed in the paper. Sometimes, it is intuitive to anticipate the behavior of
a learning algorithm in extreme cases, but the behavior at the nonextreme cases is
not as intuitive. Moreover, the precise point at which the behavior of an algorithm
starts to emulate the particular extreme case is a nontrivial task. The methodol
ogy can be used to study such cases and potentially a wide range of other relevant
questions. Essentially, the studies 1 and 2 in the experimental section are examples
of such studies. In those experiments, at extreme correlations the behavior is more
or less predictable but at intermediate correlations it is not.
What the studies in the experimental section and the discussion above suggest
is that the method in [Dhurandhar. and Dobra 2006] and developments such as
the ones introduced in this paper open new avenues in studying learning methods,
allowing them to be assessed for their robustness, appropriateness for a specific
task, with lucid elucidations being given for their behavior. These studies do not
replace but complement purely theoretical and empirical studies usually carried out
when evaluating learning methods.
8. POSSIBLE EXTENSIONS
We discussed the importance of the methodology in the previous section. Below, we
touch upon ways of extending the analysis provided in this paper. An interesting
line of future research would be to efficiently characterize the sample dependent
distance metrics. Another interesting line would be to extend the analysis to the
continuous kNN classification algorithm. A possible way of doing this would be
to consider a set of k points that would be kNN to a given input (recollect that
to characterize the moments, we only need to characterize the behavior of the
algorithm on individual inputs e.g. PZ(N) [((xi) Cj]) and consider the remaining
Nk points to lie outside the smallest ball encompassing the kNN. Under these
conditions we would integrate the density defined on the input/output space over all
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.
20 A. Dhurandhar and A. Dobra
possible such N (i.e. k which are kNN and the remaining Nk) with the appropriate
condition for class majority (i.e. to classify an input in C,, we would have the
condition that, at least [LJ + 1 points that are kNN lie in class C,). A rigorous
analysis using ideas from this paper would have to be performed and the complexity
discussed for the continuous kNN. We plan to address these issues in the future.
9. CONCLUSION
In this paper, we provided a general characterization for the moments of GE of
the kNN algorithm applied to categorical data. In particular, we developed an
efficient characterization for the moments when the distance metric was sample
independent. We discussed issues related to scalability in using the expressions
and suggested optimizations to speedup the computation. We later portrayed the
usage of the expressions and hence the methodology with the help of empirical
studies. It remains to be seen how extensible such an analysis is, to other learning
algorithms. However, if such an analysis is in fact possible, it can be deployed as
a tool to better understanding the statistical behavior of learning models in the
nonasymptotic regime.
REFERENCES
ABELLO, J., PARDALOS, P. M., AND RESENDE, M. G. C., Eds. 2002. Handbook of massive data
sets. Kluwer Academic Publishers, Norwell, MA, USA.
BLUM, A., KALAI, A., AND LANGFORD, J. 1999. Beating the holdout: Bounds for kfold and
progressive crossvalidation. In Computational Learing Theory.
CONNORLINTON, J. 2003. Chi square tutorial. http://www.georgetown.edu/ I .. ..II /ballc/webtools/
web_chi_tut.html.
DHURANDHAR., A. AND DOBRA, A. 2006. Semianalytical method for an
alyzing models and model selection measures based on moment analysis.
www.cise.ufl.edu/submit/ext_ops.php?op list&type report&by_tag REP2007
296&displaylevel full.
HALL, M. A. AND HOLMES, G. 2003. Benchmarking attribute selection techniques for discrete
class data mining. IEEE TRANSACTIONS ON KDE.
KOHAVI, R. 1995. A study of crossvalidation and bootstrap for accuracy estimation and model
selection. In In I of the Fourteenth IJCAI.
KRAUSE, E. F. 1987. Taxicab Geometry: An Adventure in NonEuclidean Geometry. Dover.
LIU, W. AND WHITE, A. 1997. Metrics for nearest neighbour discrimination with categorical
attributes. In Research and Development in Expert Systems XIV: I of the 17th
Annual Technicial Conference of the BCES Specialist Group. 5159.
MOORE, A. W. AND LEE, M. S. 1994. Efficient algorithms for minimizing cross validation error.
In International Conference on Machine Learning. 190198.
SHAO, J. 2003. Mathematical statistics. SpringerVerlag.
STANFILL, C. AND WALTZ, D. 1986. Toward memorybased reasoning. Commun. ACM 29, 12,
12131228.
STONE, C. 1977. Consistent nonparametric regression. The Annals of Statistics 5, 4, 595645.
VAPNIK, V. 1998. Statistical Learning Theory. V. .1. & Sons.
ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY
