FUSING PROBABILITY DISTRIBUTIONS WITH INFORMATION
THEORETIC CENTERS AND ITS APPLICATION TO DATA RETRIEVAL
By
ERIC SPELLMAN
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2005
Copyright 2005
by
Eric Spellman
I dedicate this work to my dearest Kayla with whom I have already learned
how to be a Doctor of Philosophy.
ACKNOWLEDGMENTS
For his supportive guidance during my graduate career, I thank Dr. Baba C.
Vemuri, my doctoral advisor. He taught me the field, offered me an interesting
problem to explore, pushed me to publish-in spite of my terminal procrastination
and tried his best to instill in me the intangible secrets to a productive academic
career.
Also the other members of my committee have helped me greatly in my career
at the University of Florida, and I thank them all: Dr. Brett Presnell delivered the
first lecture I attended as an undergraduate; and I am glad he could attend the last
lecture I gave as a doctoral candidate. I have also benefitted from and appreciated
Dr. Anand Rangarajan's lectures, professional advice, and philosophical discussions.
While I did not have the pleasure of attending Dr. Arunava Banerjee's or Dr. Jeff
Ho's classes, I have appreciated their insights and their examples as successful early
researchers. I would also like to thank Dr. Murali Rao for stimulating debates,
Dr. Sun-Ichi Amari for proposing the idea of the e-center and proofs of the related
theorems, and numerous .. i :vimous reviewers.
My professional debts extend beyond the faculty however to my fellow
comrades-in-research. With them, I have muttered all manner of things about
the aforementioned group in the surest confidence that my mutterings would not be
betll, i, 1 Dr. Jundong Liu, Dr. Zhizhou Wang, Tim McGraw, Fei W\,'v. Santosh
Kodipaka, Nick Lord, Bing Jian, Vinh Nghiem, and Evren Ozarslan all deserve
thanks.
Also deserving are the Department staff members whose hard work keeps
this place afloat and Ron Smith for designing a word-processing template without
which the process of writing a dissertation might itself require a Ph.D. For the
permission to reproduce copyrighted material within this dissertation, I thank the
IEEE (C'!i plter 3) and Springer-Verlag (C'! plter 4). For data I thank the people
behind the Yale Face Database images from which I used in Fig. 2-5 and Fig. 2-8
and Michael Black for tracking data. And for the financial support which made
this work possible, I acknowledge the University of Florida's Stephen C. O'Connell
Presidential Fellowship, NIH grant RO1 NS42075, and travel grants from the
Computer and Information Science and Engineering department, the Graduate
Student Council, and the IEEE.
And finally, most importantly, I thank my family. I thank my mother-in-law
Donna Lea for all of her help these past few weeks and Neil, Abra, and Peter for
letting us take her away for that time. I thank my mother and father for everything
and my brother, too. And I of course thank my dearest Kayla and my loudest,
most obstinate, and sweetest Sophia.
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS ................... ...... iv
LIST OF TABLES ................... .......... viii
LIST OF FIGURES ..................... ......... ix
ABSTRACT ...................... ............. xi
CHAPTER
1 INTRODUCTION ........................... 1
1.1 Information Theory .......................... 2
1.2 Alternative Representatives ......... ........ ... 3
1.3 Minimax Approaches ......................... 4
1.4 Outline of Remainder ........... ............. 5
2 THEORETICAL FOUNDATION ........... ........... 7
2.1 Preliminary-Euclidean Information Center ............ 7
2.2 Mixture Information Center of Probability Distributions ..... 10
2.2.1 Global Optimality .................. .. 13
2.3 Exponential Center of Probability Distributions . ... 13
2.4 Illustrations and Intuition .................. ..... 16
2.4.1 Gaussians .................. ........ .. 19
2.4.2 Normalized Cr -v-1.-v -l Histograms . . ..... 19
2.4.3 The e-Center of Gaussians ................ 23
2.4.4 e-ITC of Histograms .................. ..... 24
2.5 Previous W ork .................. .......... .. 27
2.5.1 Jensen-Shannon Divergence ... . . 27
2.5.2 Mutual Information .................. .... 28
2.5.3 Ci .11,1!, I Capacity .................. .. 29
2.5.4 Boosting .. ... .. .. .. .. ... .. .. .. ... .. .. 30
3 RETRIEVAL WITH m-CENTER .................. ..... 31
3.1 Introduction .................. ........... .. 31
3.2 Experimental Design .................. ..... .. 34
3.2.1 Review of the Texture Retrieval System . .... 34
3.2.2 Improving Efficiency .................. ..... 37
3.3 Results and Discussion .................. ... .. 37
3.3.1 Comparison to Pre-existing Efficiency Scheme
3.4 Conclusion . . . . . .
4 RETRIEVAL WITH e-CENTER .. ............
4.1 Introduction . . . . . .
4.2 Lower Bound .. ..................
4.3 Shape Retrieval Experiment .. ...........
4.4 Retrieval with JS-divergence .. ...........
4.5 Comparing the m- and e-ITCs .. ...........
5 TRA CKING . .. .. ... .. .. .. .. ... .. ..
5.1 Introduction .. .......
5.2 Background-Particle Filters
5.3 Problem Statement .....
5.4 Motivation .. ........
5.4.1 Binary State Space
5.4.2 Self-information Loss
5.5 Experiment-One Tracker to
5.5.1 Preliminaries .....
5.5.2 m-ITC .. ......
5.6 Conclusion .. ........
Rule Them All?
6 CONCLUSION . . . . . . . .
Limitations .. ......
Summary .. .......
Future Work ........
REFERENCES .............
BIOGRAPHICAL SKETCH ......
LIST OF TABLES
Table page
5-1 Average time-till-failure of the tracker based on the arithmetic mean
(AM) and on the m-ITC (800 particles) ............... ..68
5-2 Average time-till-failure of the tracker based on the arithmetic mean
(AM) and on the m-ITC (400 particles) with second set of models .68
LIST OF FIGURES
Figure page
2-1 Center is denoted by o and supports are denoted by x. . . 9
2-2 The ensemble of 50 Gaussians with means evenly spaced on the in-
terval [-30,30] and a = 5 ............... ..... 16
2-3 The components from Fig. 2-2 scaled by their weights in the m-ITC. 17
2-4 The m-ITC (solid) and arithmetic mean (AM, dashed) of the ensem-
ble of Gaussians shown in Fig. 2-2 ................. 18
2-5 Seven faces of one person under different expressions with an eighth
face from someone else. Above each image is the weight pi which
the ITC assigns to the distribution arising from that image. . 20
2-6 The normalized grv i-kv1 histograms of the faces from Fig. 2-5. Above
each distribution is the KL-divergence from that distribution to
the m-ITC. Parentheses indicate that the value is equal to the KL-
radius of the set. Note that as predicted by theory, the the dis-
tributions which have maximum KL-divergence are the very ones
which received non-zero weights in the m-ITC . ...... 21
2-7 In the left column, we can see that the arithmetic mean (solid, lower
left) resembles the distribution arising from the first face more closely
than the m-ITC (solid, upper left) does. In the right column, we
see the opposite: The m-ITC (upper right) more closely resembles
the eighth distribution than does the arithmetic mean (lower right). 22
2-8 Eight images of faces which yield normalized gray level histograms.
We choose an extraordinary distribution for number eight to con-
trast how the representative captures variation within a class. The
number above each face weighs the corresponding distribution in
the e-ITC. ............... ............ .. 25
2-9 KL(C, Pi) for each distribution, for C equal to the e-ITC and geo-
metric mean, respectively. The horizontal bar represents the value
of the e-radius. ............... .......... .. 26
3-1 Using the triangle inequality to prune ............. .. 33
3-2 Examples images from the CUReT database . ...... 35
3-3 On average for probes from each texture class, the speed-up relative
to an exhaustive search achieved by the metric tree with the m-
ITC as the representative .................. .. 38
3-4 The excess comparisons performed by the arithmetic mean relative
to the m-ITC within each texture class as a proportion of the total
database. .................. ........... 39
3-5 The excess comparisons performed by the best medoid relative to the
m-ITC within each texture class as a proportion of the total database 40
4-1 Intuitive proof of the lower bound in equation 4.7 (see text). The KL-
divergence acts like squared Euclidean distance, and the Pythagorean
Theorem holds under special circumstances. Q is the query, P is a
distribution in the database, and C is the e-ITC of the set contain-
ing P. P* is the I-projection of Q onto the set containing P. On
the right, D(C| P) < Re, where Re is the e-radius, by the minimax
definition of C. .................. .. ...... 45
4-2 The speed-up factor versus an exhaustive search when using the e-
ITC as a function of each class in the shape database. ...... ..49
4-3 The relative percent of additional prunings which the e-ITC achieves
beyond the geometric center, again for each class number. . 50
4-4 Speedup factor for each class resulting from using e-ITC over an ex-
haustive search .................. ........ .. .. 52
4-5 Excess searches performed using geometric mean relative to e-ITC as
proportion of total database. .................. .... 53
4-6 Excess searches performed using e-ITC relative to m-ITC as propor-
tion of total database. .................. ..... 54
4-7 The ratio of the maximal X2 distance from each center to all of the
elements in a class .................. ........ .. 56
5-1 Frame from test sequence .................. ... .. 62
5-2 Average time till failure as a function of angle disparity between the
true motion and the tracker's motion model . . ..... 64
5-3 Average time till failure as a function of the weight on the correct
motion model; for 10 and 20 particles ................ .. 66
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Phil. .. 1hi
FUSING PROBABILITY DISTRIBUTIONS WITH INFORMATION
THEORETIC CENTERS AND ITS APPLICATION TO DATA RETRIEVAL
By
Eric Spellman
August 2005
C('! i: Baba C. Vemuri
Major Department: Computer and Information Science and Engineering
This work presents two representations for a collection of probability distribu-
tions or densities dubbed information theoretic centers (ITCs). Like the common
arithmetic mean, the first new center is a convex combination of its constituent
densities in the mixture family. Analogously, the second ITC is a weighted geomet-
ric mean of densities in the exponential family. In both cases, the weights in the
combinations vary as one changes the distributions. These centers minimize the
maximum Kullback-Leibler divergence from each distribution in their collections
to themselves and exhibit an equi-divergence property, lying equally far from most
elements of their collections. The properties of these centers have been established
in information theory through the study of channel-capacity and universal codes;
drawing on this rich theoretical basis, this work applies them to the problems of
indexing for content-based retrieval and to tracking.
Many existing techniques in image retrieval cast the problem in terms of
probability distributions. That is, these techniques represent each image in the
database-as well as incoming query images-as probability distributions, thus
reducing the retrieval problem to one of finding a nearest probability distribution
under some dissimilarity measure. This work presents an indexing scheme for such
techniques wherein an ITC stands in for a subset of distributions in the database.
If the search finds that a query lies sufficiently far from such an ITC, the search
can safely disregard (i.e., without fear of reducing accuracy) the associated subset
of probability distributions without further consideration, thus speeding search.
Often in tracking, one represents knowledge about the expected motion of
the object of interest by a probability distribution on its next position. This work
considers the case in which one must specify a tracker capable of tracking any
one of several objects, each with different probability distributions governing its
motion. Related is the case in which one object undergoes different ph i of
motion, each of which can be modeled independently (e.g., a car driving straight
vs. turning). In this case, an ITC can fuse these different distributions into one,
creating one motion model to handle any of the several objects.
CHAPTER 1
INTRODUCTION
Given a set of objects, one commonly wishes to represent the entire set with
one object. In this work, we address this concern for the case in which the objects
are probability distributions. Particularly, we present a novel representative
for a set of distributions whose behavior we can describe in the language of
information theory. For contrast, let us first consider the familiar arithmetic
mean: The arithmetic mean is a uniformly weighted convex combination of a set
of objects (e.g., distributions) which minimizes the sum of squared Euclidean
distances from objects in turn to itself. However, later sections contain examples
of applications in which such a representative is not ideal. As an alternative
we describe a representative which minimizes the maximal Kullback-Leibler
divergence from itself to the set of objects. The central idea of this work is that
such a minimax representation is better than more commonly used representatives
(e.g., the arithmetic mean) in some computer vision problems. In exploring this
thesis, we present the theoretical properties of this representative and results of
experiments using it.
We examine two applications in particular, comparing the minimax represen-
tation to the more common arithmetic mean (and other representatives). These
applications are indexing collections of images (or textures or shapes) and choosing
a motion prior under uncertainty for tracking. In the first of these applications,
we follow a promising avenue of work in using a probability distribution as the
signature of a given object to be indexed. Then using an established data struc-
ture, the representative can fuse several signatures into one, thus making searches
more efficient. In the tracking application, we consider the case in which one has
some uncertainty as to the motion model that governs the object to be tracked.
Specifically, the governing model will be drawn from a known family of models. We
s-l--,- -1 using a minimax representative to construct a single prior distribution to
describe the expected motion of an object in a way that fuses all of the models in
the family and yields one model which with best worst case performance.
1.1 Information Theory
As discussed in C'! lpter 2, the minimax representative can be characterized
in terms of the Kullback-Leibler divergence and the Jensen-Shannon divergence.
Hence, a brief review of these concepts is in order.
The Kullback-Leibler (KL) divergence [1] (also known as the relative entropy)
between two distributions p and q is defined as
KL(p,q) pilogp. (1.1)
i qi
It is convex in p, non-negative (though not necessarily finite), and equals zero if
and only if p = q. In information theory it has an interpretation in terms of the
length of encoded messages from a source which emits symbols according to a
probability distribution. While the familiar Shannon entropy gives a lower bound
on the average length per symbol a code can achieve, the KL-divergence between
p and q gives the penalty (in length per symbol) incurred by encoding a source
with distribution p under the assumption it really has distribution q; this penalty is
commonly called redundancy.
To illustrate this, consider the Morse code, designed to send messages in
English. The Morse code encodes the letter "E" with a single dot and the letter
"Q" with a sequence of four dots and dashes. Because "E" is used frequently in
English and "Q" seldom, this makes for efficient transmission. However if one
wanted to use the Morse code to send messages in Chinese pinyin, which might
use "Q" more frequently, he would find the code less efficient. If we assume
contrafactually that the Morse code is optimal for English, this difference in
efficiency is the redundancy.
Also 1p i ing a role in this work is the Jensen-Shannon (JS) divergence. It is
defined between two distributions p and q as
JS(p, q) = aKL(p, ap + (1- c)q) + (1- a)KL(q, ap + (1- a)q), (1.2)
where a E (0, 1) is a fixed parameter [2]; we will also consider its straightforward
generalization to n distributions.
1.2 Alternative Representatives
Also of interest is the body of work which computes averages of sets of objects
using non-Euclidean distances since the representative presented in this work
p1ii- a similar role. One example of this appears in computing averages on a
manifold of shapes [3, 4] by generalizing the minimization characterization of the
arithmetic mean away from the squared Euclidean distance and to the geodesic
distance. Linking manifolds on one hand and distributions on the other is the field
of information geometry [5]. Using notions from information geometry one can
find the mean on a manifold of parameterized distributions by using the geodesic
distances derived from the Fisher information metric.
In this work, the representative does not minimize a function of this geodesic
distance, but rather the :;:-ii KL-divergence. Furthermore, the representa-
tives here are restricted to simple manifolds of distributions-namely the family of
weighted arithmetic means (i.e., convex combinations) and normalized, weighted
geometric means (sometimes referred to as the ::" i. i i i family") of the con-
stituent distributions. These are simple yet interesting families of distributions and
can accommodate non-parametric representations. These two manifolds are dual in
the sense of information geometry [6], and so as one might expect, the represen-
tatives have a similar dual relationship. Pelletier forms a barrycenter based in the
KL-divergence on each of these manifolds [7]. That barrycenter, in the spirit of the
arithmetic mean and in contrast to the representative in this work, minimizes a
sum of KL-divergences. Another "mean-like" representative on the family of Gaus-
sian distributions seeks to minimize the sum of squared J-divergences (also known
as symmetrized KL) [8]. The key difference between most of these approaches and
this work is that this work seeks to present a representative which minimizes the
maximum of KL-divergences to the objects in a set.
1.3 Minimax Approaches
Ingrained in the computer vision culture is a heightened awareness of noise
in data. This is a defensive trait which has evolved over the millennia to allow
researchers to survive in a harsh environment full of imperfect sensors. With this
justified concern in mind, one naturally asks if minimizing a maximum distance
will be doomed by over-sensitivity to noise. As with many such concerns, this
depends on the application, and we argue that an approach which minimizes a
max-based function is well-suited to the applications in this work. But to show
that such a set of appropriate applications is non-empty, consider the successful
work of Huttenlocher et al. as a proof of concept [9]: They seek to minimize the
Hausdorff distance (H) between two point sets (A, B). As one can see, minimizing
the Hausdorff distance,
H(A, B) = max(h(A, B), h(B, A))
h(A,B) = maxmindist(a,b),
aEA bEB
minimizes a max-based function. Another example appears in the work of Ho et
al. for tracking. In Ho et al. [10] at each frame they find a linear subspace which
minimizes the maximal distance to a set of previous observations. Additionally,
Har'!,- and Schaffalitzky minimize a max-based cost function to do geometric
reconstruction [11]. Cognizant that the method is sensitive to outliers, they
recommend using it on data from which they have been removed. These examples
demonstrate that one cannot dismiss a priori a method as overly-sensitive to noise
just because it minimizes a max-based measure.
Closer in spirit to the approach in this work is work using the minimax en-
tropy principle. Zhu et al. [12] and Liu et al. [13] use this principle to learn models
of textures and faces. The first part of this principle (the well-known maximum
entropy or minimum prejudice principle) seeks to learn a distribution which, given
the constraint that some of its statistics fit sample statistics, maximizes entropy.
Csiszar has shown that this process is related to an entropy projection [14]. The
rationale for selecting a distribution with maximal entropy is that it is random
or simple except for the parts explained by the data. Coupled with this is the
minimum entropy principle which further seeks to impose fidelity on the model.
To satisfy this principle, Zhu et al. choose a set of statistics (constraints) to which
a maximal entropy model must adhere such that its entropy is minimized. By
minimizing the entropy, they show that he minimizes the KL-divergence between
the true distribution and the learned model. For a set S of constraints on statistics,
they summarize the approach as
S* = argmin max entr. ',u(p), (1.3)
S pERs
where Qs is the set of all probability distributions which satisfy the constraints
in S. Then with the optimal S*, one need only find the p E Qs. with maximal
entropy.
1.4 Outline of Remainder
In the next chapters, we rigorously define the minimax representatives and
present theoretical results on their properties. We also connect one of these to
its alternative and better-known identity in the information theory results for
channel capacity. Thereafter follow two chapters on using the representatives to
index databases for the sake of making retrieval more efficient. In ('!i ipter 3 we
present a texture retrieval experiment; in this experiment the accuracy of the
retrieval is determined by the Jensen-Shannon divergence, the square root of which
is a true metric. Later in C'! ipter 4 we present an experiment in shape retrieval
where the dissimilarity measure is the KL-divergence. We propose a search method
which happens to work in the particular case of this data set, but in general has
no guarantee that it will not degrade in accuracy. The second area in which we
demonstrate the utility of a minimax representative is tracking. In ('!i Ilter 5 we
present experiments in which the representative stands in for an unknown motion
model. Lastly we end with some concluding points and thoughts for future work.
CHAPTER 2
THEORETICAL FOUNDATION
In this chapter we define the minimax representatives and enumerate their
properties. First, we present a casual, motivating treatment in Euclidean space to
give intuition to the idea. Then come the section's central results-the ITC for the
mixture family and the ITC on the dual manifold, the exponential family; after
defining them, we present several illustrations to lend intuition to their behavior;
and then finally we show their well-established interpretation in terms of channel
capacity.
2.1 Preliminary-Euclidean Information Center
Let S = {f1, f,} be a set of n points in the Euclidean space Rf.
Throughout our development we will consider the maximum dispersion of the
members of a set S about a point f,
D (f, S) = max f f (2.1)
We will look for the point f, that minimizes D (f, S) and call it the center.
Definition 1 The center of S is 1/ 7,.,I by
fc(S)= arg min D(f, S). (2.2)
f
The following properties are easily proved.
Theorem 1 The center of S is unique and is given by a convex combination of the
elements of S.
fc Z= if, (2.3)
i= 1
where 0 < pi < 1 and :pi 1. We call a point f for which pi > 0 a support.
Theorem 2 Let f, be the center of S. Then, for F = {i,pi > 0}, the set of indices
called the supports, we have an equi-distance p"*''. I/;
If fe 2 2, ifi E F, (2.4)
fi f 2 < r2, otherwise, (2.5)
where r2 is the square of the radius of the sphere coinciding with the supports and
centered at the center.
Now that we have characterized the center first by its minimax property
(Definition 1) and then its equi-distance property (Theorem 2), we can give yet
another characterization, this one useful computationally. Define the simplex A of
probability distributions on m symbols,
A {P= (pi, -- ,P),0 < pi, pi 1 }.
We define a function of p on A by
DE (P, S) -i If 2 ill 2, (2.6)
and now use it to find the center.
Theorem 3 DE(p, S) is strictly concave in p, and the center of S is given by
fc = YPfyi, (2.7)
where
p = arg max DE(p, S) (2.8)
P
is the unique maximal point of DE(p, S).
An example of the center is given in Fig. 2-1. In general, the support set
consists of a relatively sparse subset of points. The points that are most :;i ,i ih-
ii ,y are given high weights.
Figure 2-1: Center is denoted by o and supports are denoted by x.
2.2 Mixture Information Center of Probability Distributions
We now move to the heart of the results. While the following derivations
use densities, they could just as easily work in the discrete case. Let f(x) be a
probability density. Given a set S of n linearly independent densities fl, i fn, we
consider the space M consisting of their mixtures,
M = pi f 0
The dimension m of M satisfies m < n 1.
M is a dually flat manifold equipped with a Riemannian metric, and a pair of
dual affine connections [5].
The KL divergence from fi(x) to f2(x) is defined by
KL(ff2) fX) log dx, (2.10)
which pl i, the role of the squared of the distance in Euclidean space. The KL-
divergence from S to f is defined by
KL(S, f)= max KL (f, f). (2.11)
Let us define the mixture information center (m-ITC) of S.
Definition 2 The m-center of S is /. Im by
fm(S)= arg min KL(S, f). (2.12)
f
In order to analyze properties of the m-center, we define the m-sphere of
radius r centered at f by the set
Sm(f,r) {f' KL(f',f)
(2.13)
In order to obtain an analytical solution of the m-center, we remark that the
negative entropy
p(f) j f(x) log f(x)dx (2.14)
is a strictly convex function of f e M. This is the dual potential function in M
whose second derivatives give the Fisher information matrix [5].
Given a point f in M,
f Z= i f, p p > 0, (2.15)
and using o above, we write the Jensen-Shannon (JS) divergence [2] of f with
respect to S as
Dm(f, S) -= -(f)+ Ypi (fi). (2.16)
When we regard Dm(f, S) as a function of p = (pi, ,,)T E A, we denote it by
Dm(p, S). It is easy to see
Dm(f, S) is strictly concave in p A (2.17)
Dm(f, S) is invariant under permutation of {f,}, (2.18)
Dm(f, S) = 0 if p is an extreme point of simplex A. (2.19)
Hence from equations 2.17 and 2.19, Dm(f, S) has a unique maximum in A. In
terms of the KL-divergence, we have
D,(p,S) = piKL (f, f). (2.20)
Theorem 4 The m-center of S is given by the unique maximizer of Dm(f, S), that
ts,
fm 5fi, (2.21)
p arg max Dm(p,S). (2.22)
P
Moreover, for supports for which pi > 0, KL (fi, fm) = r2, and for non-supports for
which pi = 0, KL (fi, fm) < r2 r2 = maxKL (fi, fm).
By using the Lagrange multiplier A for the constraint pi = 1, we calculate
S[D(p, S) A( p I)} -0. (2.23)
From the definition of Dm and because of
a()= a) f'l9g pkf kg
Sf, logf + 1, (2.24)
equation 2.23 is rewritten as
fl ogf -1+(f)- A=0 (2.25)
when pi > 0, and then as
KL (fi, f)= A+ 1. (2.26)
However, because of the constraints pj > 0, this is not necessarily satisfied for some
fj for which pj = 0. Hence, at the extreme point fm(S), we have
KL (fi, f) =r2 (2.27)
for supports fi (pi > 0), but for non-supports (pi = 0),
KL (f, f) < r2. (2.28)
This distinction occurs because p maximizes EpiKL (fi, f), and if we assume for a
non-support density fi that KL (fi, f) > r2, we arrive at a contradiction.
We remark that numerical optimization of D, can be accomplished efficiently
(e.g., with a Newton scheme [15]), because it is concave in p.
2.2.1 Global Optimality
Lastly, it should be noted that while it appears that the results so far -ii--:: -1
that the m-ITC is the minimax representative over merely the simplex of mix-
tures, a result by Merhav and Feder [16] (based on work by Csiszar [17]) shows
that the m-ITC is indeed the global minimax representative over all probability
distributions.
To argue so, consider a distribution g which a skeptic puts forward as the
minimizer of equation 2.12. Next, we consider the I-projection of g onto the set of
mixtures M defined as
SargminKL(f,g). (2.29)
fEM
And from the properties of I-projections [17], we know for all f E M that
KL(f, g) > KL(f, g) + KL(g, g) > KL(f, g) (2.30)
since the last term on the right-hand side is positive. This tells us that the mixture
g performs at least as well as g; so we may restrict our minimization to the set of
mixtures, knowing we will find the global minimizer.
2.3 Exponential Center of Probability Distributions
Given n densities fi(x), f,(x) > 0, we define the exponential family of densities
E,
E = f f(x) -exp{ pilog f(x) (p)}, 0 < p, pi 1- }, (2.31)
instead of the mixture family M. The dimension m of E satisfies m < n 1. E
is also a dually flat Riemannian manifold. E and M are "dual" [5], and we can
establish similar structures in E.
The potential function Q(p) is convex, and is given by
(p) log Jexp { pi log f(Wx)} dx. (2.32)
The potential b(p) is the cumulant generating function, and is connected with the
negative entropy p by the Legendre transformation. It is called the free energy in
statistical physics. Its second derivatives give the Fisher information matrix in this
coordinate system.
An e-sphere (exponential sphere) in E centered at f is the set of points
Se(f,r)= {f' KL(f', f) < r2, f E} (2.33)
where r is the e-radius.
Definition 3 The e-center of S is 1 f, .I by
fe(S)= arg min KL(f, S), (2.34)
f
where
KL(f, S) max KL (f, f,). (2.35)
i
We further define the JS-like divergence in E,
De (p, S)= { (p) + p' (f,). (2.36)
Because of
y (f) log exp (log f) dx 0, (2.37)
we have
D, (p, S) = -(p). (2.38)
The function De = is strictly concave, and has a unique maximum. It is an
interesting exercise to show for f = exp { pi log fi b(p)},
-i(p) = piKL (f, f) (2.39)
Analogous to the case of M, we can prove the following.
Theorem 5 The e-center of S is unique and given by the maximizer of De (p, {fi}),
exp { pilogf/i
b (p) },
p = arg max De (p, S) .
(2.40)
(2.41)
Moreover,
KL (fe, fi)
and
KL (f, fi)
r2, for supporting fi (pi / 0),
for non-supporting fi (pi C
where
r2 = maxKL (fe, fi).
We calculate the derivative of
'(p) A ( pi
1), and put
I ) 0.
f = exp i logfi(x)
Slog
apii
Sexp plog f (x) dx
Sf(x) log f,(x)dx.
(2.47)
Hence, (2.45) becomes
I f(x) log fa(x) dx
A = const
(2.48)
and hence
KL (f, fi) = const
fe(S)
(2.42)
(2.43)
(2.44)
we have
(2.45)
a (p),
(2.46)
(2.49)
- (p) A >pi
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
-50 -40 -30 -20 -10 0 10 20 30 40 50
Figure 2-2: The ensemble of 50 Gaussians with means evenly spaced on the inter-
val [-30,30] and a = 5
for fi with pi > 0. For pi = 0, KL (f, fi) is not larger than r2, because of (2.39).
2.4 Illustrations and Intuition
In this section we convey some intuition regarding the behavior of the ITCs.
Particularly, we show examples in which the ITCs choose a relatively sparse subset
of densities to be support densities (assigning zero as weights to the other densities
in the set); also we find that the ITC tends to assign disproportionately high
weights to members of the set that are most :;:i .,i hi, iry" (i.e., those that are
most distinct from the rest of the set).
0.018
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
-50 -40 -30 -20 -10 0 10 20 30 40 50
Figure 2-3: The components from Fig. 2-2 scaled by their weights in the m-ITC.
ITM
o-/ T AM
0.016 --/ ,
0.014
0.012-
0.01 -
0.008 -
0.006 -
0.004 -
0.002-/
0 -
-50 -40 -30 -20 -10 0 10 20 30 40 50
Figure 2-4: The m-ITC (solid) and arithmetic mean (AM, dashed) of the ensemble
of Gaussians shown in Fig. 2-2
2.4.1 Gaussians
First, we examine a synthetic example in which we analytically specify a set
of densities and then numerically compute the m-ITC of that set. We construct a
set of 50 one-dimensional Gaussian distributions with means evenly spaced on the
interval [-30, 30] and with a = 5. Fig. 2-2 shows all of the densities in this set.
When we compute the m-ITC, we see both of the properties mentioned above: Out
of the 50 densities specified, only eight become support densities (sparsity). And
additionally, the densities which receive the largest weight in the m-ITC are outer-
most densities with means at -30 and 30 (highlighting ::i i. ,iii! try" elements).
Fig. 2-3 shows the eight support densities scaled according to the weight which
the m-ITC assigns them. For the sake of comparison, Fig. 2-4 shows the m-ITC
compared with the arithmetic mean.
2.4.2 Normalized C-r liv- -v- Histograms
Next we consider an example with distributions arising from image data.
Fig. 2-5 shows eight images from the Yale Face Database. The first seven images
are of the same person under different facial expressions while the last (eighth)
image is of a different person. We consider each image's normalized histogram of
gray-level intensities as a distribution and show them in Fig. 2-6.
When we take the m-ITC of these distributions, we again notice the sparsity
and the favortism of the boundary elements. The numbers above each face in
Fig. 2-5 are the weights that the distribution arising from that face received in the
m-ITC. Again, we find that only three out of the eight distributions are support
distributions. In the previous example, we could concretely see how the m-ITC
favored densities on the "on the bouiwl iy of the set because the densities had a
clear geometric relationship among themselves with their means ai i i' on a line.
In this example the notion of a boundary is not quite so obvious; yet, we think that
0.32519 0 0 0
0 0 0.2267 0.4481
Figure 2-5: Seven faces of one person under different expressions with an eighth
face from someone else. Above each image is the weight pi which the ITC assigns
to the distribution arising from that image.
0.1718
0.02
0.015
0.01
0.005
000 000 0
50 100150200250
0.1787
50 100150200250
D.015[
0.01
005
50.
50 100150200250
0.1668
50 100150200250
0.
50 100150200250
(0.1827)
50 100150200250
0.02
0.015
0.01
.005
50 100150200250
(0.1827)
50 100150200250
Figure 2-6: The normalized gray-level histograms of the faces from Fig. 2-5. Above
each distribution is the KL-divergence from that distribution to the m-ITC. Paren-
theses indicate that the value is equal to the KL-radius of the set. Note that as
predicted by theory, the the distributions which have maximum KL-divergence are
the very ones which received non-zero weights in the m-ITC.
0.02
0.015
0.01
0.005
0
0.02
0.015
0.01
0.005
0
(0.1827)
0.1602
0.1524
50 100 150 200 250
50 100 150 200 250
0.02
0.015
0.01
0.005
0
0.02
0.015
0.01
0.005
0
50 100 150 200 250
0.02
0.015
0.01
0.005
0
0.02
0.015
0.01
0.005
0
Figure 2-7: In the left column, we can see that the arithmetic mean (solid, lower
left) resembles the distribution arising from the first face more closely than the
m-ITC (solid, upper left) does. In the right column, we see the opposite: The m-
ITC (upper right) more closely resembles the eighth distribution than does the
arithmetic mean (lower right).
50 100 150 200 250
if one examines the eighth image and the eighth distribution, one can qualitatively
agree that it is the most ::li i, inary."
Returning briefly to Fig. 2-6 we examine the KL-divergences between each
distribution and the m-ITC. We report these values above each distribution and
indicate with parentheses the maximal values. Since the three distributions with
maximal KL-divergence to the ITC are exactly the three support distributions, this
example unsurprisingly complies with Theorem 4.
Finally, we again compare the m-ITC and the arithmetic mean in Fig. 2-7,
but this time we overlay in turn the first and eighth distributions. Examining the
left column of the figure, we see that the arithmetic mean more closely resembles
the first distribution than does the m-ITC. The KL-divergences bear this out with
the KL-divergence from the first distribution to the arithmetic mean being 0.0461
which compares to 0.1827 to the m-ITC. Conversely, when we do a similar com-
parison in the right column of Fig. 2-7 for the eighth (extraordinary) distribution,
we find that the IT\i most resembles it. Again, the KL-divergences quantify this
observation: Whereas the KL-divergence from the eighth distribution to the m-ITC
is (again) 0.1827, we find the KL-divergence to the arithmetic mean to be 0.5524.
This result falls in line with what we would expect from Theorem 4 and -'i-.-- -1- a
more refined bit of intuition: The m-ITC better represents extraordinary distribu-
tions, but sometimes at the expense of the more common-looking distributions. Yet
overall, that trade-off yields a minimized maximum KL-divergence.
2.4.3 The e-Center of Gaussians
We consider a very simple example consisting of multivariate Gaussian
distributions with unit covariance matrix,
fi(x) = cexp x- pAi2 (2.50)
where x, ti E Rm. We have
SPi logfi (x p 2 2
ll + logc (2.51)
Hence, E too consists of Gaussian distributions.
This case is special because E is not only dually flat but also Euclidean where
the Fisher information matrix is the identity matrix. The KL-divergence is
1 2
KL [fy, fi] 1 2y -i (2.53)
given by a half of the squared Euclidean distance. Hence, the e-center in E of {fi}
is the same as the center of the points {pI} in the Euclidean space.
When m = 1, that is x is univariate, pi is a set of points on the real line.
When pi < P-2 < ... < I,,, it is easy to see that the center is is the average of the
two extremes, pi and p,. This is different from the m-center of {fi} which is not a
single Gaussian but rather a mixture.
2.4.4 e-ITC of Histograms
We return again to the histograms of Section 2.4.2 As for the m-ITC, this
example illustrates how the e-ITC gives up some representative power for the
elements with small variability for the sake of better representing the elements
with large variability. Fig. 2-8 is an extreme case of this, chosen to make this
effect starkly clear: Here again we have the same seven images of the same person
under slight variations along with an eighth image of a different person. After
representing each image by its global gr-iv-1.v.- histogram, we compute the
uniformly weighted, normalized geometric mean and the e-ITC.
It is also worth noting that the e-ITC only selects three support distributions
out of a possible eight, exemplifying the sparsity tendency mentioned in the
0.21224
0
0
0
Figure 2-8: Eight images of faces which yield normalized gray level histograms. We
choose an extraordinary distribution for number eight to contrast how the repre-
sentative captures variation within a class. The number above each face weighs the
corresponding distribution in the e-ITC.
0
0.18607
0.60169
ad
IA~
0.4
0.35
0.3
0.25
0.2
0.15
0.1 -
0.05-
0I
1
Figure 2-9: KL(C, P,)
ric mean, respectively.
2 3 4 5 6 7 8
distribution number
for each distribution, for C equal to the e-ITC and geomet-
The horizontal bar represents the value of the e-radius.
previous section. By now examining Fig. 2-9, we can see that KL(C, Pi) is equal
to the e-radius (indicated by the horizontal bar) for the three support distributions
(i = 1, 7, 8) and is less for the others. This illustrates the equi-divergence property
stated previously.
In Figs. 2-8 and 2-9, the worst-case KL-divergence from the geometric mean
is 2.5 times larger than the worst-case from the e-ITC. Of course, this better worst-
case performance comes at the price of the e-ITC's larger distance to the other
seven distributions; but it is our thesis that in some applications we are eager to
make this trade.
2.5 Previous Work
2.5.1 Jensen-Shannon Divergence
First introduced by Lin [2], the Jensen-Shannon divergence has appeared in
several contexts. Sibson referred to it as the information radius [18]. While his
moniker is tantalizingly similar to the previously mentioned notions of m-/e-radii,
he strictly uses it to refer to the divergence, not the optimal value. Jardin and he
later used it as a discriminatory measure for classification [19].
Others have also used the JS-divergence and its variations as a dissimilarity
measure-for image registration and retrieval applications [20, 21], and in the
retrieval experiment of C'! lpter 3, this work will follow suit. That experiment also
makes use of the important fact that the square root of the JS-divergence (in the
case when its parameter is fixed to 1) is a metric [22].
Topsoe provides another take on the JS-divergence, calling a scaled, special
case of it the capac.:/..,';, discrimination [23]. This name hints at the next, perhaps
most important interpretation of the JS-divergence, namely that as a measurement
of mutual information. This alternative identity is widely known. (Cf. Burbea and
Rao [24] as just one example.) And this understanding can help illuminate the
context of the m-center within information theory.
2.5.2 Mutual Information
But the obvious question arises-the mutual information between what? First,
the mutual information between two random variables X and Y with distributions
Px and Py is defined as
MI(X; Y) = H(Y) H(Y|X), (2.54)
where H(Y) = Py log Py is Shannon's entropy and the conditional entropy is
H(YIX) = Pxy(x, y) log Pyx(iu ). (2.55)
x y
To see the connection to the m-center and JS-divergence, Let us first identify
the random variable X above as a random index into a set of random variables
{Y1, ,Y,} with probability distributions {P1, P,}. Note that X merely
takes on a number from one to n. If we now consider Y as a random variable
whose value y results from first sampling X = i and then sampling = y, we
can certainly begin to appreciate that these concocted random variables X and Y
are dependent. That is, learning the value that X takes will let you guess more
accurately what value Y will take. Conversely, learning the value of Y hints at
which distribution Y that value was drawn from and hence the value that X took.
Returning to equation 2.54 and p ,iing particular attention to the definition
of the conditional entropy term, we can use our definitions of X and Y to get an
expression for their joint distribution,
Px(i,y)= Px(i)P(ii,.). (2.56)
Then by observing that P(',|.i) = Pi(y), piu.-inii; this into equation 2.54, and
pulling Px(i) outside of the summations with respect to y, we have
MI(X; Y) H(Y) Px(i)H(P,). (2.57)
i
And this is precisely the Jensen-Shannon divergence JS(PI, .- P ) with co-
efficients (Px(1), Px(n)). So when we evaluate the JS-divergence with a
particular choice of those coefficients, we directly specify a distribution for both
the random variable X and indirectly specify a distribution (a mixture) for the
random variable Y and evaluate the mutual information between them. And when
we maximize the JS-divergence with respect to those same coefficients, we specify
the distributions X and Y which have maximum mutual information.
There are several equivalent definitions for mutual information, and by
considering a different one we can gain insight into the selection of support
distributions in the m-center. The mirror image of equation 2.54 gives us
MI(X; Y)= H(X) H(XIY). (2.58)
Starting from here and for convenience letting f(i) = P(i y) we can derive that
MI(X; Y) = H(X) EyH(fy(i)). (2.59)
fy(i) describes the contributions at a value y of each of the distributions which
make up the mixture. By maximizing MI(X; Y) we minimize the expected value
(over y) of the entropy of this distribution. This means that on the average
we encourage as few as possible contributors of probability to a location; this
is particularly the case at locations y with high marginal probability. Acting
as a regularization term of sorts, to prevent the process from assigning all the
probability to one component (thus driving the right-hand term to zero), is the first
term which encourages uniform weights. Maximizing the whole expression means
that we balance the impulses for few contributors and for uniform weights.
2.5.3 ('! im n. I Capacity
Of interest to information theory from its inception has been the question of
how much information can one reliably transmit over a channel.
In this case, one interprets the ensemble of distributions {P1, P,} making
up the conditional probability P(yli) Pi(y) as the channel. That is, given an in-
put symbol i, the channel will transmit an output symbol y with probability Pi(y).
Given such a channel (discrete and memoryless since each symbol is independent
of the symbols before and after), one will achieve different rates of transmission de-
pending on the distribution over the source symbols Px. These rates are precisely
the mutual information between X and Y, and to find the maximum capacity of
the channel, one picks the Px yielding a maximum value. In this context, many
of the results from this section have been developed and presented quite clearly in
texts by Gallager [25, Section 4.2] and Csiszar [26, Section 2.3].
Related also is the field of universal coding. As reviewed earlier, choosing
an appropriate code depends upon the characteristics of the source which will be
encoded (e.g., Morse code for English). Universal coding concerns itself with the
problem of selecting a code suitable for any source out of a family-i.e., a code
which will have certain universal performance characteristics across the family.
Although this work initially developed in ignorance of that field, the key result in
this field which this work touches upon is the Redundancy-Capacity Theorem. This
theorem-independently proven in several places [27, 28] and later strengthened
[16]-states that the best code for which one can hope will have redundancy
equal to the capacity (or maximum transmission rate [23]) of a certain channel
which takes input from the parameters of the family and output as symbols to be
encoded.
2.5.4 Boosting
Surprisingly, we also find a connection in the online learning literature wherein
the AdaBoost [29] learning algorithm is recast as entropy projection [30]. This
angle, along with its extentions to general Bregman divergences is an interesting
avenue for futue work.
CHAPTER 3
RETRIEVAL WITH m-CENTER
3.1 Introduction
There are two key components in most retrieval systems-the signature
stored for an object and the (dis)similarity measure used to find a closest match.
These two choices completely determine accuracy in a nearest neighbor-type system
if one is willing to endure a potentially exhaustive search of the database. However,
because databases can be quite large, comparing each query image against every
element in the database is obviously undesirable.
In this chapter and the next we focus on speeding up retrieval without com-
promising accuracy in the case in which the object and query signatures are
probability distributions. In a variety of domains, researchers have achieved impres-
sive results utilizing such signatures in conjunction with a v ,i i I i of appropriate
dissimilarity measures. This includes work in shape-based retrieval [31], image
retrieval [32, 33], and texture retrieval [34, 35, 36, 37]. (The last of which we review
in great detail in Section 3.2.) For any such retrieval system which uses a distri-
bution and a metric, we present a way to speed up queries while guaranteeing no
drop in accuracy by representing a cluster of distributions with an optimally close
representative which we call the m-ITC.
Tomes of research have concentrated on speeding up nearest neighbor searches
in non-Euclidean metric spaces. We build on this work by refining it to better suit
the case in which the metric is on probability distributions. In low-dimensional
Euclidean spaces, the familiar k-d tree and R*-tree can index point sets handily.
But in spaces with a non-Euclidean metric, one must resort to other techniques.
These include ball trees [38], vantage point trees [39], and metric trees [40, 41].
Our system utilizes a metric tree, but our main contribution is picking a single
object to represent a set. Picking a single object to describe a set of objects is one
of the most common v-,v to condense a large amount of data. The most obvious
way to accomplish this (when possible) is to compute the arithmetic mean or the
centroid. To contrast with the properties of our choice (the m-ITC) we point out
that the arithmetic mean minimizes the sum of squared distances from it to the
elements of its set. In the case in which the data are known (or forced) to lie on a
manifold, it is useful to pick an intrinsic mean which has the mean's minimization
property but which also lies on a manifold containing the data. This has been
explored for manifolds of shape [4, 3] and of parameterized probability densities [5].
Metric trees. To see how our representative fits into the metric tree framework,
a brief review of metric trees helps. Given a set of points and a metric (which by
definition satisfies positive-definiteness, symmetry, and the triangle inequality), a
leaf of a metric tree indexes some subset of points {pi} and contain two fields-a
center c and a radius r-which satisfy d(c, p) < r, for all pi where d is the metric.
Proceeding hierarchically, an interior node indexes all of the points indexed by its
children and also ensures that its fields satisfy the constraint above. Hence using
the triangle inequality, for any subtree, one can find a lower bound on the distance
from a query point q to the entire set of points contained in that subtree
d(q, pi) > d(q, c) r, (3.1)
as illustrated in Fig. 3-1. And during a nearest neighbor search, one can recursively
use this lower bound (costing only one comparison) to prune out subtrees which
contain points too distant from the query.
Importance of picking a center. Returning to the choice of the center, we see
that if the radius r in equation 3.1 is large, then the lower bound is not very tight
and subsequently, pruning will not be very efficient. On the other hand, a center
P.,
d(q,p
p1
c p2
q d(q,c)
P, P3
P4
Figure 3-1: Using the triangle inequality to prune
which yields a small radius will likely lead to more pruning and efficient searches. If
we reexamine Fig. 3-1, we see that we can decrease r by moving the center toward
the point pi in the figure. This is precisely how the m-ITC behaves. We claim that
the m-ITC yields tighter clusters than the commonly used centroid and the best
medoid, allowing more pruning and better efficiency because it respects the natural
metrics used for probability distributions. We show theoretically that under the
KL-divergence, the m-ITC uniquely yields the smallest radius for a set of points.
And we demonstrate that it also performs well under other metrics.
In the following sections we first define our m-ITC and enumerate its prop-
erties. Then in Section 3.2 we review a pre-existing texture classification system
[37] which utilizes probability distributions as signatures and discuss how we build
our efficiency experiment atop this system. Then in Section 3.3 we present results
showing that the m-ITC, when incorporated in a metric tree, can improve query
efficiency; also we compare how much improvement the m-ITC achieves relative to
other representatives when each is placed in a metric tree. Lastly, we summarize
our contributions.
3.2 Experimental Design
The central claim of this chapter is that the m-ITC tightly represents a set of
distributions under typical metrics; and thus, that this better representation allows
for more efficient retrieval. With that goal in mind, we compare the m-ITC against
the centroid and the best medoid-two commonly used representatives-in order to
find which scheme retrieves the nearest neighbor with the fewest comparisons. The
centroid is the common arithmetic mean, and the best medoid of a set {pi} in this
case is the the distribution p satisfying
p arg min maxKL(pj,p').
p'E{pi} 3
Notice that we restrict ourselves to the pre-existing set of distributions instead of
finding the best convex combination which is the m-ITC.
In the experiment, each representative will serve as the center of the nodes
of a metric tree on the database. We control for the topology of the metric tree,
keeping it unchanged for each representative. Since the experiment will examine
efficiency of retrieval, not accuracy, we will choose the same signature as Varma
and Zisserman [37], and this along with the metric will completely determine
accuracy, allowing us to exclusively examine efficiency.
3.2.1 Review of the Texture Retrieval System
The texture retrieval system of Varma and Zisserman [37] [42] builds on an
earlier system [43]. It uses a probability distribution as a signature to represent
each image. A query image is matched to a database image by nearest neighbor
search based on the X2 dissimilarity between distributions. Although the system
contains a method for increasing efficiency, we do not implement it and d 1 i
discussion of it to the next section.
^^^(^~~~ ** *'.'
;E
*^, ,~-'m ..
P (; .m *.m.' .? J : I
Figure Examples images from the CAiURT database
Figure 3-2: Examples images from the CUReT database
Texton dictionary. Before computing any image's signature, this system
requires that we first construct a texton ,I.: / :..,, i, ,;/ To construct this dictionary, we
first extract from each pixel in each training image a feature describing the texture
in its neighborhood. This feature can be a vector of filter responses at that pixel
[42] or simply a vector of intensities in the neighborhood about that pixel [37]. (We
choose the later approach.) After clustering this ensemble of features, we take a
small set of cluster centers for the dictionary, 610 centers in total.
Computing a signature. To compute the signature of an image, one first finds
for each pixel the closest texton in the dictionary and then retains the label of
that texton. At this stage, one can imagine an image transformed from having
intensities at each pixel to having indices into the texton dictionary at each pixel,
as would result from vector quantization. In the next step one simply histograms
these labels in the same way that one would form a global gray-level histogram.
This normalized histogram on texton labels is the signature of an image.
Data. Again as in [37], we use the Columbia-Utrecht Reflectance and Texture
(CUReT) Database [44] for texture data. This database contains 61 varieties of
texture with each texture imaged under various (sometimes extreme) illumination
and viewing angles. Fig. 3-2 illustrates the extreme intra-class variability and the
inter-class similarity which make this database challenging for texture retrieval
systems. In this figure each row contains realizations from the same texture class
with each column corresponding to different viewing and illumination angles.
Selecting 92 images from each of the 61 texture classes, we randomly partition each
of the classes into 46 training images, which make up the database, and 46 test
images, which make up the queries. Preprocessing consists of conversion to gray
scale and mean and variance normalization.
Dissimilarity measures. Varma and Zisserman [37] measured dissimilarity
between a query q and a database element p (both distributions) using the X2
signfigance test,
X2(p,q) ( q) (3.2)
pi + qi
and returned the nearest neighbor under this dissimilarity.
In our work, we require a metric, so we take the square root of equation 3.2
[23]. Note that since the square root is a monotonic function, this does not alter
the choice of nearest neighbor and therefore maintains the accuracy of the system.
Additionally we use another metric [22], the square root of the Jensen-Shannon
divergence between two distributions:
p+q 1 1
JS1 (p, q) = H( ) H(p) H(q)
2 2 2 2
1 p+ q 1 p +q
-KL(p, P -q) + KL(q, + q.
2 2 2 2
Note that in contrast to equation 2.16, we now take only two distributions and fix
the mixture parameter as Unlike the first metric, this change effects accuracy,
but in our experiments, the change in retrieval accuracy from the original system
did not exceed one percentage point. This does not surprise since the JS has served
well in registration and retrieval applications in the past [20].
3.2.2 Improving Efficiency
After selecting the signature and dissimilarity measure (in our a case metric),
we have fixed the accuracy of the nearest neighbor search; so we can now turn our
attention to improving the efficiency of the search.
We construct a metric tree on the elements of the database to improve
efficiency. And in our experiment we will vary the representative used in the nodes
of the metric tree, finding which representative prunes most. We hold constant the
topology of the metric tree, determining each element's membership in a tree node
based on the texture variety from which it came. Specifically, each of the 61 texture
varieties has a corresponding node and each of these nodes contain the 46 elements
arising from the realizations of that texture. Given this fixed node membership, we
can construct the appropriate representative for each node.
Then, given each representative's version of the metric tree, we perform
searches and count the number of comparisons required to find the nearest neigh-
bor.
3.3 Results and Discussion
The results with each dissimilarity measure were very similar, and theoretical
results [22] bear this observed similarity out by showing that the JS-divergence and
X2 are .,- mptotically related. Below we report the results for the JS-divergence.
Fig. 3-3 shows the speed-ups which the metric tree with m-ITC achieve for
each texture class. These data relate the total number of comparisons required
for an average query using the metric tree to the number of comparisons in an
exhaustive search. On average, the m-ITC discards 68.9'. of the database, yielding
a factor of 3.2 improvement in efficiency.
It should not surprise that the indexing out-performs an exhaustive search,
so we now consider what happens when we vary the representative in the metric
tree. On average, the arithmetic mean discards 47.1' of the database, resulting
I I I I I I I I I I I I
121-
I
I .
5 10 15 20 25 30 35 40
Texture class
45 50 55 60
Figure 3-3: On average for probes from each texture class, the speed-up relative to
an exhaustive search achieved by the metric tree with the m-ITC as the representa-
tive
a-D I
0.35
I I IT I I T T
0.3 l I I I I
iT8I iT
iF' '
0 r I TT^ ^ n n 1 i i i iT
o 0.25 i 1 .
Z I 1 1 1 T I III TI
,, I 1 1 ii i
0.2 i
aU)I I- II
i I
SII I
0.15 I I I I I I
0.1 I II
z1 1 II II 1 I I II
0 I I 5
5 10 15 20 25 30 35 40 45 50 55 60
Texture class number
Figure 3-4: The excess comparisons performed by the arithmetic mean relative to
the m-ITC within each texture class as a proportion of the total database
_ 0.35 T
TT T T
o 0.3 I T T I I T T T I I
S I I I I I I I
l1 11 | 1
S0.25 I I II I I I I I I
S1 1 -
a 0. 5 IIII I I
E II
0. 1 i
I0.15 I I I I I
0.05 I I 1
S 0.11 1 1 1 1 11
0.05 -
0-
5 10 15 20 25 30 35 40 45 50 55 60
Texture class number
Figure 3-5: The excess comparisons performed by the best medoid relative to the
m-ITC within each texture class as a proportion of the total database
in a speed-up factor of 1.9. Fig. 3-4 plots the excess proportion of the database
which the arithmetic mean searches relative to the m-ITC for each probe. The
box and whiskers respectively plot the median, quartiles, and range of the data
for all the probes within a class. On average, the metric tree with the arithmetic
mean explores an additional 21.;:'. of the total database relative to the metric tree
with the m-ITC. In only 2.0'. of the queries did the m-ITC fail to out-perform the
arithmetic mean. Since the proportion of excess comparisons is a positive value for
98.0'. of the probes, we can conclude that the m-ITC almost ahlv-ix offers some
improvement and occasionally avoids searching more than a third of the database.
Fig. 3-5 shows a similar plot for the best medoid. Again, the box and whiskers
respectively plot the median, quartiles, and range of the data for all the probes
within a class. On average, the metric tree with the best medoid explores an
additional 22.1 of the total database relative to the metric tree with the r-ITC.
Here the m-ITC improves even more than it did over the arithmetic mean; in only
0.1 of the queries did the m-ITC fail to out-perform the best medoid-never once
doing more poorly.
3.3.1 Comparison to Pre-existing Efficiency Scheme
Varma and Zisserman propose a different approach to increase the efficiency of
queries [37]. They decrease the size of the database in a greedy fashion: Initially,
each texture class in the database has 46 models (one arising from each training
image). Then for each class, they discard the model whose absence impairs retrieval
performance the least on a subset of the training images. And this is repeated till
the number of models is suitably small and the estimated accuracy is suitably high.
While this method performed well in practice-achieving comparable accuracy
to an exhaustive search and reducing the average number of models per class from
46 to eight or nine, it has several potential shortcomings. The first is this method's
computational expense: Although the model selection process occurs off-line and
time is not critical, the number of times one must validate models against the
entire training subset scales quadratically in the number of models. Additionally
and more importantly, the model selection procedure depends upon the subset of
the training data used to validate it. It offers no l;,.ii,,l" of its accur'. ;, relative
to an exhaustive search which utilizes all the known data.
In contrast, our method can compute the m-ITC efficiently, and more impor-
tantly we guarantee that the accuracy of the more efficient search is identical to the
accuracy that an exhaustive search would achieve. Although it must be noted that
the method of Varma and Zisserman performed fewer comparisons, we believe that
building a multi-i i t -, metric tree will bridge this gap.
Lastly, the two methods can co-exist simultaneously: Since the pre-existing
approach focuses on reducing the size of the database while ours indexes the
database (solving an orthogonal problem), nothing stops us from taking the smaller
database resulting from their method and performing our indexing atop it for
further improvement.
3.4 Conclusion
Our goal was to select the best single representative for a class of probability
distributions. We chose the m-ITC which minimizes the maximim KL-divergence
from each distribution in the class to it; and when we placed it in the nodes
of a metric tree, it allowed us to prune more efficiently. Experimentally, we
demonstrated significant speed-ups over exhaustive search in a state-of-the-art
texture retrieval system on the CUReT database. The metric tree approach to
nearest neighbor searches guarantees accuracy identical to an exhaustive search of
the database. Additionally, we showed that the m-ITC outperforms the arithmetic
mean and the best medoid when these other representatives are used analogously.
Probability distributions are a popular choice for retrieval in many domains,
and as the retrieval databases grow large, there will be a need to condence many
distributions into one representative. We have shown that the m-ITC is a useful
choice for such a representative with well-behaved theoretical properties and
empirically superior results.
CHAPTER 4
RETRIEVAL WITH e-CENTER
4.1 Introduction
In the course of designing a retrieval system, one must usually consider at least
three broad elements
1. a signature that will represent each element, allowing for compact storage and
fast comparisons,
2. a (dis)similarity measure that will discriminate between a pair of signatures
that are close and a pair that are far from each other, and
3. an indexing structure or search strategy that will allow for efficient, non-
exhaustive queries.
The first of these two elements mostly determine the accuracy of a system's
retrieval results. The focus of this chapter, like the last, is on the third point.
A great deal of work has been done on retrieval systems that utilize a prob-
ability distribution as a signature. This work has covered a variety of domains
including shape [31], texture [34], [45], [35], [36], [37], and general images [32], [33].
Of these, some have used the Kullback-Leibler (KL) divergence [1] as a dissimilarity
measure [35], [36], [32].
The KL-divergence has many nice theoretical properties. Particularly, its
relationship to maximum-likelihood estimation [46]. However, in spite of this,
it is not a metric. This makes it challenging to construct an indexing structure
which respects the divergence. it ,i, basic methods exist to speed up search
in Euclidean space including k-d trees and R*-trees. And there are even some
methods for general metric spaces such as ball trees [38], vantage point trees
[39], and metric trees [40]. Yet little work has been done on efficiently finding
exact nearest neighbors under KL-divergence. In this chapter, we present a novel
means of speeding nearest neighbor search (and hence retrieval) in a database of
probability distributions when the nearest neighbor is defined as the element that
minimizes the KL-divergence to the query. This approach does have a significant
drawback which does not impair it on this particular dataset, but in general it
cannot guarantee retrieval accuracy equal to that of an exhaustive search.
The basic idea is a common one in computer science and reminiscent of the
last chapter: We represent a set of elements by one representative. During a search,
we compare the query object against the representative, and if the representative
is sufficiently far from the query, we discard the entire set that corresponds to it
without further comparisons. Our contribution lies in selecting this representative
in an optimal fashion; ideally we would like to determine the circumstances under
which we may discard the set without fear of accidentally discarding the nearest
neighbor, but this cannot alv--iv be guaranteed For this application, we will utilize
the the exponential information theoretic center (e-ITC).
In the remaining sections we first derive the expression upon which the system
makes pruning decisions. Thereafter we present the experiment showing increased
efficiency in retrieval over an exhaustive search and over the uniformly weighted
geometric mean-a reasonable alternate representative. Finally, we return to
the texture retrieval example of the previous chapter to compare the m- and e-
centers, evaluating which forms the tightest clusters as evaluated by the X2 metric
(equation 3.2).
4.2 Lower Bound
In this section we attempt to derive a lower bound on the KL-divergence from
a database element to a query which only depends upon the the element through
its e-ITC. This lower bound guides the pruning and results in the subsequent
increased search efficiency which we describe in Section 4.3.
Q Q
D(CIIQ) Re
D(PIIQ) D(P* IQ) D(CIIQ) D(CIIQ)
/DPIIQ)
/ \ /DQ)
P P* C P P* C
D(PIIP*) D(CIIP*)
Figure 4-1: Intuitive proof of the lower bound in equation 4.7 (see text). The
KL-divergence acts like squared Euclidean distance, and the Pythagorean Theo-
rem holds under special circumstances. Q is the query, P is a distribution in the
database, and C is the e-ITC of the set containing P. P* is the I-projection of Q
onto the set containing P. On the right, D(C IP) < Re, where Re is the e-radius,
by the minimax definition of C.
In order to search for the nearest element to a query efficiently, we need to
bound the KL-divergence to a set of elements from beneath by a quantity which
only depends upon the e-ITC of that set. That way, we can use the knowledge
gleaned from a single comparison to avoid individual comparisons to each member
of the set.
We approach such a lower bound by examining the left side of Fig. 4-1. Here
we consider a query distribution Q and an arbitrary distribution P in a set which
has C as its e-ITC. As a stepping stone to the lower bound, we briefly define the
I-projection of a distribution Q onto a space E as
P* argmin D(PIIQ). (4.1)
PEE
It is well known that one can use intuition about the squared Euclidean distance
to appreciate the properties of the KL-divergence; and in fact, in the case of the
I-projection P* of Q onto E, we have in some cases a version of the familiar
Pythagorean Theorem [26]. In this case we have for all P E E that
D(PI Q) > D(P* Q) + D(P IP*). (4.2)
And in the case that
P* cP1 + (1 a)P2 (4.3)
for distributions P1, P2 c E with 0 < a < 1, then we call P* an ,i /,iraic inner
point and we have equality in equation 4.2
Unfortunately it is this condition which we cannot verify easily as it depends
upon both the query (which we will label Q) and the structure of E which is
determined by the database. Interestingly we do get the equality version when
we take E as a linear family, like the families with which the m-ITC is concerned.
Regardless, we will continue with the derivation and demonstrate its use in this
application.
Assuming equality in equation 4.2 and applying it twice yields,
D(PIIQ) = D(P*IIQ) + D(PIIP*) (4.4)
D(CIIQ) = D(P*IIQ) + D(C IP*), (4.5)
where we are free to select P E E as an arbitrary database element and C as the
e-ITC. Equation 4.4 corresponds to AQPP* while equation 4.5 corresponds to
AQCP* in Fig. 4-1.
If we subtract the two equations above and re-arrange, we find,
D(PIIQ) = D(CIIQ) + D(PIP*) D(CIIP*).
(4.6)
But since the KL-divergence is non-negative, and since the e-radius Re is a uniform
upper bound on the KL-divergence from the e-ITC to any P E E, we have
D(PIIQ) > D(CIQ) R,. (4.7)
Here we see that the m-ITC would do little better in guaranteeing this particular
bound. While it would insure that we had equality in equation 4.2, it could not
bound the last term in equation 4.6 because the order of the arguments is reversed.
We can get an intuitive, nonrigorous view of the same lower bound by again
borrowing notions from squared Euclidean distance. This pictoral reprise of
equation 4.7 can lend valuable insight to the tightness of the bound and its
dependence on each of the two terms. For this discussion we refer to the right side
of Fig. 4-1.
The minimax definition tells us that D(CI P) < R,. We consider the case in
which this is equality and sweep out an arc centered at C with radius Re from the
base of the triangle counter-clockwise. We take the point where a line segment from
Q is tangent to this arc as a vertex of a right triangle with hypotenuse of length
D(CIIQ). The leg which is normal to the arc has length Re by construction, and by
the Pythagorean Theorem the other leg of this triangle, which originates from Q,
has length D(C IQ) Re. We can use the length of this leg to visualize the lower
bound, and by inspection we see that it will alh--i be exceeded by the length of
the line segment originating from Q and terminating further along the arc at P.
This segment has length D(P| Q) and is indeed the quantity we seek to bound
from below.
4.3 Shape Retrieval Experiment
In this section we apply the e-ITC and the lower bound in equation 4.7 to
represent distributions arising from shapes. Since the lower bound guarantees that
we only discard elements that cannot be nearest neighbors, the accuracy of retrieval
is as good as an exhaustive search.
While we know from the theory that the e-ITC yields a smaller worst-case KL-
divergence, we now present an experiment to test if this translates into a tighter
bound and more efficient queries. We tackle a shape retrieval problem, using shape
distributions [31] as our signature. To form a shape distribution from a 3D shape,
we uniformly sample pairs of points from the surface of the shape and compute
the distance between these random points, building a histogram of these random
distances. To account for changes in scale, we independently scale each histogram
so that the maximum distance is ahv--, the same. For our dissimilarity measure,
we use KL-divergence, so the nearest neighbor P to a query distribution Q is
P argmin D(P'I Q). (4.8)
For data, we use the Princeton Shape Database [47] which consists of over 1800
triangulated 3D models from over 160 classes including people, animals, buildings,
and vehicles.
To test the efficiency, we again compare the e-ITC to the uniformly weighted,
normalized geometric mean. Using the convexity of E we can generalize the lower
bound in equation 4.7 to work for the geometric mean by replacing the e-radius
with maxim D(CIIPi) for our different C.
We take the base classification a< "n-1i' ,r_:ing the database to define our
clusters, and then compute the e-ITC and geometric means of each cluster.
When we consider a novel query model (on a leave-one-out basis), we search
for the nearest neighbor utilizing the lower bound and disregarding unnecessary
comparisons. For each query, we measure the number of comparisons required to
find the nearest neighbor.
14
12
10
8
6
4
2
0
0 20 40 60 80 100
class number
120 140 160 180
Figure 4-2: The speed-up factor versus an exhaustive search when using the e-ITC
as a function of each class in the shape database.
0 20 40 60 80 100
class number
120 140 160 180
Figure 4-3: The relative percent of additional prunings which the e-ITC achieves
beyond the geometric center, again for each class number.
Fig. 4-2 and Fig. 4-3 show the results of our experiment. In Fig. 4-2, we
see the speed-up factor that the e-ITC achieves over an exhaustive search. Aver-
aged over all probes in all classes, this speed-up factor is approximately 2.6; the
geometric mean achieved an average speed-up of about 1.9.
And in Fig. 4-3, we compare the e-ITC to the geometric mean and see that
for some classes, the e-ITC allows us to discard nearly twice as many unworthy
candidates as the geometric mean. For no class of probes did the geometric mean
prune more than the e-ITC, and when averaged over all probes in all classes, the
e-ITC discarded over 3i '. more elements than did the geometric mean.
4.4 Retrieval with JS-divergence
In this section we mimick the experiments of the previous chapter. Instead of
using the KL-divergence to determine nearest neighbors, we return to the square-
root of the JS-divergence, and we again use the triangle inequality to guarantee no
decrease in accuracy.
Using metric trees with the e-center, geometric mean, and m-center, we can
compare the efficiencies of each representative and the overall speedup. In Fig. 4-4,
we see the speedup relative to an exhaustive search when using the e-ITC; on
average, the speedup factor is 1.53. In Fig. 4-5 we compare the e-ITC to the
geometric mean, much as we compared the m-ITC to the arithmetic mean in the
last chapter. We claim that this is the natural comparison because the exponential
family consists of weighted geometric means. Here the geometric mean searches on
average 7.2!' more of the database than the e-ITC. Lastly in Fig. 4-6 we compare
the two centers against each other. Here the e-ITC comes up short, searching on
average an additional 14 I. of the database. In the next section we try to explain
this result.
15
10
C
5-
0 20 40 60 80 100 120 140 160 180
class number
Figure 4 4: Speedup factor for each class resulting from using e-ITC over an ex-
haustive search
20 40 60 80 100
class number
120 140 160 180
Figure 4-5: Excess searches performed using geometric mean relative to e-ITC as
proportion of total database.
V)
a
m 0.16
0
a
C 0.14
0 .
C.
0
C
E 0.1
0
0.08
E
= 0.06
0.04
0.02
0
54
0.5
ro
S0.4 -
0
0 0.3
0.2
-0.2
E 0.1
8
0
E
= -0.1
-0.2
0 20 40 60 80 100 120 140 160 180
class number
Figure 4-6: Excess searches performed using e-ITC relative to m-ITC as proportion
of total database.
4.5 Comparing the m- and e-ITCs
Since both centers have found successful application to retrieval, it is reason-
able to explore their relationship. Since the arguments in their respective minimax
criteria are reversed, it is not immediately clear that a meaningful comparison
could be made with KL-divergence alone. Hence, we resort to the X2 distance from
the previous chapter as an arbiter (though we could just as well use JS-divergence).
The comparison we make next is simple. Returning to the texture retrieval
dataset from the previous chapter, we use the same set memberships and calculate
an m-ITC and an e-ITC for each set. Then for each representative and each set we
determine what the maximum X2 distance between an element of that set and the
representative is.
The results appear in Fig. 4.5 as the ratio between this "X2 radius" of the
e-ITC and the m-ITC. Since the numbers are greater than one with the exception
of only two out of 61 classes, it is safe to conclude in this setting that the r-ITC
forms tighter clusters. This result helps explain the superior performance of the
m-center in the previous section.
In retrospect, one could attribute this to the m-ITC's global optimality
property (cf. Section 2.2.1) which the e-ITC may not share.
0 10 20 30 40
class number
50 60 70
Figure 4-7: The ratio of the maximal X2 distance from each center to all of the
elements in a class
2.6
2.4
2.2
2
1.8 -
1.6
1.4
1.2
10.8
0.8 -
U.;
CHAPTER 5
TRACKING
5.1 Introduction
In the previous chapters, we considered the problem of efficient retrieval. In
the case of retrieval, where a uniform upper bound is important, one measures how
well a representative does by focusing on how it handles the most distant members.
This property is why the minimax representatives are well-suited to retrieval. In
this chapter, we explore the question of whether the same can be said for tracking.
We first present several encouraging signs that in fact it may be, and then we go
on to consider an experiment to test the performance of a tracker built around the
m-ITC. But first we set the context of the tracking problem in probabilistic terms.
5.2 Background-Particle Filters
The tracking problem consists of estimating and maintaining a hidden variable
(usually position, sometimes pose or a more complicated state) from a sequence
of observations. Commonly, instead of simply keeping one guess as to the present
state and updating that guess at each new observation, one stores and updates
an entire probability distribution on the state space; then, when pressed to give a
single, concrete estimate of the state, one uses some statistic (e.g., mean or mode)
of that distribution.
This approach is embodied famously in the Kalman filter, where the probabil-
ity distribution on the state space is restricted to be a Gaussian, and the trajectory
of the state (or simply the motion of the object) is assumed to follow linear dynam-
ics so that the Gaussian at one time step may propagate to another Gaussian at
the next.
When this assumption is too limiting-possibly because background clutter
creates multiple modes in the distribution on the state space or because of complex
dynamics or both-researchers often turn to a particle filter to track an object [48].
Unlike the Kalman filter, particle filters do not require that the probability dis-
tribution on the state space be a Gaussian. To gain this additional representative
power, a particle filter stores a set of samples from the state space with a weight for
each sample describing its probability. The goal then is to update these sets when
new observations arrive. One can update from a time step t 1 to a time step t in
three steps [48]:
1. Given a sample set {s j1), s ) } and its associated weights
{7rt) -.. 7, t-)}, randomly sample (with replacement) a new set
2. To arrive at the new st randomly propagate each s t) by assigning sl) to
the value of x according to a probability distribution (i.e., the motion model)
Pmx (X I)).
3. Adjust the weights according to the likelihood of observing the data z(t) given
that the true state is si )
1 pd(z(t)) P ()5.1)
where Z is a normalization factor.
In this work we restrict our attention to a two dimensional state space
consisting only of an object's position. Also we exclusively consider motion models
pm in which
Pm(xly) = pm(x + xoly + yo) = G(x y), (5.2)
where G is a Gaussian. That is p, is a "shift-inv ,i i ,l model in which the
probability of the displacement from a position in one time step to a position in the
next is independent of that starting position. Furthermore, we take that probability
of a displacement as determined by a Gaussian.
5.3 Problem Statement
In this chapter we consider the following scenario: We have several objects,
and for each object we know its distinct probabilistic motion model. We want to
build a single particle filter with one motion model which can track any of the
objects. This is reminiscent of the problem of designing a universal code that can
efficiently encode any of a number of distinct sources, and that similarity -ii--. -1 -
that this is a promising application for an information theoretic center.
Related to this problem is the case in which one single object undergoes
distinct lph,! -" of motion, each of which has a distinct motion model. An
example of this is a car that moves in one fashion when it drives straight and in a
completely different fashion when turning. This work does not explore such multi-
phase tracking. For this related problem of multi-phase motion, there are certainly
more complicated motion models suited to the problem [49]. And for the problem
we focus on in this chapter, one could also imagine refining the motion model as
observations arrive, until, once in possession of a preponderance of evidence, one
finally settles on the most likely component. But all of these require some sort of
on-line learning while in contrast, the approach we present offers a single, simple,
fixed prior which can be directly incorporated into the basic particle filter
5.4 Motivation
5.4.1 Binary State Space
To motivate the use of an ITC, we begin with a toy example in which we
Ii 1:" the value of a binary variable. Our caricature of a tracker is based on
a particle filter with N particles. In this example, we make a further, highly
simplifying assumption: The observation model is perfect and clutter-free. That
is the likelihood Pd in equation 5.1 has perfect information. This means that any
particles which propagate to the incorrect state receive a weight of zero, and the
particles (if any) which propagate to the correct state share the entire probability
mass among themselves.
Under these assumptions, the event that the tracker fails (and henceforth never
recovers) at each time step is an independent, identically distributed Bernoulli
random variable with probability
Pfai = p(1 q)N + (1 p)qN, (5.3)
where p is the probability the tracker takes on state 0 and q is the probability
that a particle evolves under its motion model to state 0. Similarly, 1 p is the
probability the tracker takes on state 1 and 1 q is the probability that a particle
evolves to state 1. What the equation above -,v is that our tracker will fail if and
only if all N particles choose wrongly.
Now the interesting thing about this example is that a motion model in
which q / p can outperform one in which q = p. Specifically by differentiating
equation 5.3 with respect to q, we find that Pfail takes on a minimum when
q ( 1 (5.4)
As a concrete example, we take the case with p = .1 and N = 10; here the optimal
value for q is .4393. When we find the expected number of trials till the tracker
fails in each case (simply pfA), we find that if we take a motion model with q p,
the tracker goes for an average of 29 steps, but if we take the optimal value of q,
the tracker continues for 1825 steps.
Here we see reminders of how the ITCs give more weight to the extraordinary
situations than other representatives. In this case it is justified to under-weight the
most likely case because even having a single particle arrive at the correct location
is as good as having all N particles arrive.
5.4.2 Self-information Loss
For more evidence -i, '-' -ii!-; that the ITC might be well-suited, we consider
the following analysis [50]. Suppose we try to predict a value x with distribution
p(x). But instead of picking a single value, we specify another distribution q(x)
which defines our confidence for each value of x.
Now depending on the value x that occurs, we p li a penalty based on how
much confidence we had in that value; if we had a great deal of confidence (q(x)
close to one), we p liv a small penalty, and if we did not give much credence to the
value, we lp .i a larger penalty. The self-information loss function is a common
choice for this penalty. According to this function if the value x occurs, we would
incur a penalty of log q(x). If we examine the expected value of our loss, we find
E [-log q(x)] KL(p, q) + H(p). (5.5)
Returning to our problem statement, if we are faced with a host of distribu-
tions and want to find a single q to minimize the expected loss in equation 5.5 over
the set of distributions, we begin to approach something like the m-ITC.
5.5 Experiment-One Tracker to Rule Them All?
5.5.1 Preliminaries
To test how well the m-ITC incorporates a set of motion models into one
tracker, we designed the following experiment. Given a sequence of images, we first
estimated the true motion by hand, measuring the position of the object of interest
at key frames. We then fit a Gaussian to the set of displacements, taking the mean
of the Gaussian as the average velocity.
Data. A single frame from the 74-frame sequence appears in Fig. 5-1. In this
sequence we track the head of the walker which has nearly constant velocity in
the x-direction and slight periodic motion in the y-direction (as she steps). The
mean velocity in the x-direction was -3.7280 pixels/frame with a marginal standard
Figure 5-1: Frame from test sequence
deviation of 1.15; in the y-direction direction the average velocity was -0.1862 with
standard deviation 2.41.
For our observation model, we simply use a template of the head from the
initial frame and compare it to a given region, finding the mean squared error
(V\.Sl) of the gi i ,-1. 1.I. Then likelihood is just exp ( MSE). We initialize all
trackers to the true state at the initial frame.
Motion models. From this single image sequence, we can hallucinate numerous
image sequences which consist of rotated versions of the original sequence. We
know the true motion models for all of these novel sequences since they are just the
original motion model rotated by the same amount.
To examine the performance of one motion model applied to a different image
sequence, one need only consider the angle disparity between the true underlying
motion model of the image sequence and the motion model utilized by the tracker.
In Fig. 5-2 we report the performance in average time-till-failure as a function of
angle disparity. (Here the number of particles is fixed to 10.)
Since in this experiment (and subsequent ones in this chapter), we define
a tracker as having irrevocably failed when all of its particles are at a distance
greater than 40 pixels from the location, we see that even the most hopeless of
trackers will succeed in I Ii .il:, the subject for six frames-just long enough
for all of its particles to flee from the correct state at an average relative velocity
of 7.5 pixels/frame. This observation lets us calculate a lower bound on the
performance of a tracker with a given angle disparity from the true motion:
Pessimistically assuming a completely uninformative observation model, one can
calculate the time required for the centroid of the particles to exceed a distance of
D = 40 pixels from the true state as
t s- (5.6)
2r sin 2
2
X -0 2618 X -0 1258
Y 74 Y 74
U-.
-3 -2.5 -2 -1.5 -1
angle of deviation (radians) from true motion
-0.5
Figure 5-2: Average time till failure as a function of angle disparity between the
true motion and the tracker's motion model
performance
lower bound
50
E
| 40
0)
S30
20
10
where r = 3.7326 is the speed at which centroid and true state each move and 0 is
the angle disparity. The dashed line in Fig. 5-2 represents this curve, capped at the
maximum number of frames 74.
One should note that since all of the motion models are rotated versions
of each other, the H(P) term in equation 5.5 is a constant. Hence the q which
minimizes the maximum expected self-information loss over a set of p's is in fact
the m-ITC.
Performance of mixtures. Because we will take the m-ITC of several of these
motion models, it is also of interest how mixtures perform. We consider a mixture
of the correct motion model and a motion model 7 radians rotated in the opposite
direction, which essentially contributes nothing to the tracking. In Fig. 5-3 we
again plot the time-till-failure for trackers with 10 and 20 particles as a function of
the weight in the mixture of the correct model.
To derive a lower bound, this time we represent the proportion of the proba-
bility near the true state at time t as rt and the remainder as wt. Further we -i
that at each time step, art of the probability moves to (or remains in) the correct
state (as a result of those particles being driven by the correct motion model) and
the rest moves away from the correct state. Next we model the step in the particle
filter algorithm where we adjust the weights. We assume that a particle .iv. from
the true state receives a weight that is c < 1 times the weight received by a particle
near the true state. If there were no clutter in the scene, this number would be zero
and we would return to the assumption in Section 5.4.1. Now, by pessimistically
assuming that particles which move away from the correct state have exceedingly
small chance (i.e., zero) of rejoining the true state randomly we can derive the
80
performance, N=
performance, N=21
lower bound
60
50
E
|40-
,30-
20-
10-
0.2 0.3 0.4 0.5 0.6 0.7
weight of correct component
0.8 0.9
Figure 5-3: Average time till failure as a function of the weight on the correct
motion model; for 10 and 20 particles
following from an initial ro 1,
rt = (5.7)
(1 a) Y o ct-iai
wt (- a (5.8)
where Z is a normalizing constant. And by taking a lower bound of wt > (1-a)c -
z
we can derive that rt < a+( a Finally, by taking this upper bound on rt as a
Bernoulli random variable as we did in Section 5.4.1, we can get a lower bound
on the expected time-till-failure as 1- plus an adjustment for the six free frames
required to drift out of the 40 pixel range. This is the lower bound shown in Fig. 5
3. To calculate c, we randomly sampled the background at a distance greater than
40 pixels from the true state, and averaged the response of the observation model,
yielding c 0.1886.
5.5.2 m-ITC
Again we compare the performance of the m-ITC to the arithmetic mean. This
time we form mixtures of several motion models of varying angles, taking weights
either as uniform (for the arithmetic mean) or as determined by the m-ITC. In the
first case we take a total of 12 motion models with angle disparities of
{5r/20, 4r/20, 3r/20, 2r/20, lr/20, 0, r}.
On this mixture, the m-ITC assigns weights of .31 to each of the motion models at
r/4 and the remaining .37 to the motion model at r.
We tested these trackers when the true motion model has orientation zero,
7/4, and 7, respectively and reported their average times-till-failure, in Table 5-1.
Indeed as we might expect from a minimax representative, the m-ITC registers the
best worst-case performance; but nevertheless it is not an impressive performance.
The second set of motion models was slightly less extreme in variation,
{0, r/64, r/32, r/4}. To these components, the m-ITC split its probability
evenly between the motion models at disparities r/4. In this case, the m-ITC had
a better showing overall, and still had the best worst case performance. The results
are shown in Table 5-2.
Table 5-1: Average time-till-failure of the tracker based on the arithmetic mean
(AM) and on the m-ITC (800 particles)
True angle m-ITC AM
0 43 74
7r/4 23 46
7 16 7
Table 5-2: Average time-till-failure of the tracker based on the arithmetic mean
(AM) and on the m-ITC (400 particles) with second set of models
True angle m-ITC AM
0 72 74
7r/4 44 37
5.6 Conclusion
While there is reason to believe that a minimax representative would serve
well in combining several motion models in tracking, the precise circumstances
when this might be beneficial are difficult to determine. Examining Fig. 5-2 and
Fig. 5-3 it seems that if there is too little weight or too great a disparity, a tracker
is doomed from the beginning. So while the m-ITC will not perform ideally under
all situations, it still retains its expected best worst-case performance.
CHAPTER 6
CONCLUSION
6.1 Limitations
The central thrust of this work has been the claim that despite many computer
vision researchers' instinctive suspicion of minimax methods, given the right
application, they can be useful. However those skeptical researchers' instincts
are often well-founded: The main issue one must be aware of regarding the
representatives presented in this work is their sensitivity to outliers. One must
carefully consider his data, particularly the extreme elements, because those are
precisely the elements with the most influence on these representatives.
In addition to data, one must also consider how one's application defines
successful results. If one's application can tolerate small deviations in the i .. i- 1
cases," and successful behavior is defined by good results in extreme cases, then
a minimax representative might be appropriate. This is precisely the case in the
retrieval problem where a uniform upper bound on the dispersion of a subset is
the criterion on which successful indexing is judged. Such a criterion disregards
whether the innermost members are especially close to the representative or not.
Despite some initial si-.-l. -1 I ii- this did not turn out to be the case in the tracking
domain (in the presence of clutter). There it seems the deviations in the "normal
( did have a significant effect on performance.
6.2 Summary
After characterizing two minimax representatives, with firm groundings
in information theory, we have shown how they can be utilized to speed the
retrieval of textures, images, shapes, or any object so represented by a probability
distribution. Their power in this application comes from the fact that they form
tight clusters, allowing for more precise localization and efficient pruning than other
common representatives. While the tracking results did not bear as much fruit, we
still believe that is a promising avenue for such a representative if the problem is
properly formulated.
6.3 Future Work
This topic touches upon a myriad of areas including information theory,
information geometry, and learning. Csiszar and others have characterized the
expectation-maximization algorithm in terms of divergence minimization; and
we believe that incorporating the ITCs into some EM-style algorithm would be
very interesting. Also of interest are its connections to AdaBoost and other online
learning algorithms. But for all of these avenues, the main challenge remains of
verifying that the data and measurement of success are appropriate fits to these
representatives.
REFERENCES
[1] T. M. Cover and J. A. Thomas, Elements of Information The. ",; Wiley &
Sons, New York, NY, 1991.
[2] J. Lin, "Divergence measures based on the shannon entropy," IEEE Trans.
I,f.' ,, Th., -;, vol. 37, no. 1, pp. 145-151, Mar. 1991.
[3] P. T. Fletcher, C. Lu, and S. Joshi, "Statistics of shape via principal geodesic
analysis on lie groups," IEEE Trans. Med. Imag., vol. 23, no. 8, pp. 995-1005,
Aug. 2004.
[4] E. Klassen, A. Srivastava, W. Mio, and S. H. Joshi, "Analysis of planar shapes
using geodesic paths on shape spaces," IEEE Trans. Pattern Anal. Machine
Intell., vol. 26, no. 3, pp. 372-383, Mar. 2004.
[5] S.-I. Amari, Methods of Information G, .. ,,/ Ir; American Mathematical
Society, Providence, RI, 2000.
[6] S.-I. Amari, lii.l il i..1 geometry on hierarchy of probability distributions,"
IEEE Trans. Int[..,,, T i,,., vol. 47, no. 5, pp. 1707-1711, Jul. 2001.
[7] B. Pelletier, I~i -,ii, I ,ive barycentres in statistics," Annals of Institute of
Statistical Mathematics, to appear.
[8] Z. Wang and B. C. Vemuri, "An affine invariant tensor dissimilarity measure
and its applications to tensor-valued image segmentation," in Proc. IEEE
Conf. Computer Vision and Pattern Recognition, Washington, DC, Jun./Jul.
2004, vol. 1, pp. 228-233.
[9] D. P. Huttenlocher, G. A. Klanderman, and W. A. Rucklidge, "Comparing
images using the hausdorff distance," IEEE Trans. Pattern Anal. Machine
Intell., vol. 15, no. 9, pp. 850-863, Sep. 1993.
[10] J. Ho, K.-C. Lee, M.-H. Yang, and D. Kriegman, "Visual tracking using
learned linear subspaces," in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, Washington, DC, Jun./Jul. 2004, pp. 228-233.
[11] R. I. Hartley and F. Schaffalitzky, "L-oo minimization in geometric recon-
struction problems," in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, Washington, DC, Jun./Jul. 2004, pp. 504-509.
[12] S. C. Zhu, Y. N. Wu, and D. Mumford, \I !!!ii! ,:: entropy principle and
its application to texture modeling," Neural Computation, vol. 9, no. 8, pp.
1627-1660, Nov. 1997.
[13] C. Liu, S. C. Zhu, and H.-Y. Shum, "Learning inhomogeneous gibbs model of
faces by minimax entropy," in Proc. Int'l Conf. Computer Vision, Vancover,
Canada, Jul. 2001, pp. 281-287.
[14] I. Csiszar, "Why least squares and maximum entropy? An axiomatic approach
to inference for linear inverse problems," Annals of Statistics, vol. 19, no. 4,
pp. 2032-2066, Dec. 1991.
[15] D. P. Bertsekas, Nonlinear P,. ,j,.iiin,:,j Athena Scientific, Belmont, MA,
1999.
[16] N. Merhav and M. Feder, "A strong version of the redundancy-capacity
theorem of universal coding," IEEE Trans. In r,1,, The ..,; vol. 41, no. 3, pp.
714-722, May 1995.
[17] I. Csiszar, "I-divergence geometry of probability distributions and mini-
mization problems," Annals of P, .l,,l.:.:/';; vol. 3, no. 1, pp. 146-158, Jan.
1975.
[18] R. Sibson, lid li in 1ii i.. radius," Z. Wahrscheinlichkeitstheorie verw. Geb.,
vol. 14, no. 1, pp. 149-160, Jan. 1969.
[19] N. Jardine and R. Sibson, Mathematical TI'..'.,n,,;;, John Wiley & Sons,
London, UK, 1971.
[20] A. O. Hero, B. Ma, O. Michel, and J. Gorman, "Applications of entropic
spanning graphs," IEEE S':,j.rl Processing Mag., vol. 19, no. 5, pp. 85-95, Sep.
2002.
[21] Y. He, A. B. Hamza, and H. Krim, "A generalized divergence measure for
robust image registration," IEEE Trans. S.:'.'li Processing, vol. 51, no. 5, pp.
1211-1220, May 2003.
[22] D. M. Endres and J. E. Schindelin, "A new metric for probability distri-
butions," IEEE Trans. I,.f..I,, Th(t.'-, vol. 49, no. 7, pp. 1858-1860, Jul.
2003.
[23] F. Tops0e, "Some inequalities for information divergence and related measures
of discrimination," IEEE Trans. Inform. Th(. ,-; vol. 46, no. 4, pp. 1602-1609,
Jan. 2000.
[24] J. Burbea and C. R. Rao, "On the convexity of some divergence measures
based on entropy functions," IEEE Trans. Ift. I, The(.-', vol. 28, no. 3, pp.
489-495, May 1982.
[25] R. G. Gallager, Information Theory and Reliable Communication, John Wiley
& Sons, New York, NY, 1968.
[26] I. Csiszar and J. G. K6rner, Information The(..,' Coding Theorems for
Discrete Memorl, -- SS i-/,i' Academic Press, Inc., New York, NY, 1981.
[27] L. D. Davisson and A. Leon-Garcia, "A source matching approach to finding
minimax codes," IEEE Trans. Inform. The.'., vol. 26, no. 2, pp. 166-174,
Mar. 1980.
[28] B. Y. Ryabko, "Comments on 'A source matching approach to finding
minimax codes'," IEEE Trans. Ifrt,..,, Th(..,;, vol. 27, no. 6, pp. 780-781,
Nov. 1981.
[29] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of on-line
learning and an application to boosting," J. Computer and System Sciences,
vol. 55, no. 1, pp. 119-139, Aug. 1997.
[30] J. Kivinen and M. K. Warmuth, "Boosting as entropy projection," in
Proceedings of the To. I /l, Annual Conference on Computational Learning
Ti, .. ,.i Santa Cruz, CA, Jul. 1999, pp. 134-144.
[31] R. Osada, T. Funkhouser, B. C'!.. !!.-, and D. Dobkin, "Shape distributions,"
ACMI Trans. Graphics, vol. 21, no. 4, pp. 807-832, Oct. 2002.
[32] S. Gordon, J. Goldberger, and H. Greenspan, "Applying the information
bottleneck principle to unsupervised clustering of discrete and continuous
image representations," in Proc. Int'l Conf. Computer Vision, Nice, France,
Oct. 2003, pp. 370-396.
[33] C. Carson, S. Belongie, H. Greenspan, and J. Malik, "Blobworld: image
segmentation using expectation-maximization and its application to image
querying," IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 8, pp.
1026-1038, Aug. 2002.
[34] Y. Rubner, C. Tomasi, and L. Guibas, "A metric for distributions with
applications to image databases," in Proc. Int'l Conf. Computer Vision,
Bomb-,i,, India, Jan. 1998, pp. 59-66.
[35] J. Puzicha, J. M. Buhmann, Y. Rubner, and C. Tomasi, "Empirical evaluation
of dissimilarity measures for color and texture," in Proc. Int'l Conf. Computer
Vision, Kerkyra, Greece, Sep. 1999, pp. 1165-1172.
[36] M. N. Do and M. Vetterli, \\V i,. t-based texture retrieval using generalized
Gaussian density and Kullback-Leibler distance," IEEE Trans. Image
Processing, vol. 11, no. 2, pp. 146-158, Feb. 2002.
[37] M. Varma and A. Zisserman, "Texture classification: Are filter banks
necessary?," in Proc. IEEE Conf. Computer Vision and Pattern Recognition,
Madison, WI, Jun. 2003, pp. 691-698.
[38] S. M. Omohundro, "Bumptrees for efficient function, constraint, and classifica-
tion learning," in Advances in Neural Information Processing S1~i.l i,- Denver,
CO, Nov. 1990, vol. 3, pp. 693-699.
[39] P. N. Yianilos, "Data structures and algorithms for nearest neighbor search in
general metric spaces," in Proc. AC'11-.IAM Symp. on Discrete Algorithms,
Austin, TX, Jan. 1993, pp. 311-321.
[40] J. K. Uhlmann, "Satisfying general proximity/similarity queries with metric
trees," Information Processing Letters, vol. 40, no. 4, pp. 175-179, Nov. 1991.
[41] A. Moore, "The anchors hierarchy: Using the triangle inequality to survive
high-dimensional data," in Proc. on Uncera''.'ii, in Ar'.:l' .:.' Intelligence,
Stanford, CA, Jun./Jul. 2000, pp. 397-405.
[42] M. Varma and A. Zisserman, "Classifying images of materials: Achieving
viewpoint and illumination independence," in Proc. European Conf. Computer
Vision, Copenhagen, Denmark, ,i-/Jun. 2002, pp. 255-271.
[43] T. Leung and J. Malik, "Recognizing surfaces using three-dimensional
textons," in Proc. Int'l Conf. Computer Vision, Kerkyra, Greece, Sep. 1999,
pp. 1010-1017.
[44] K. J. Dana, B. van Ginneken, S. K. N i, I', and J. J. Koenderink, "Reflectance
and texture of real-world surfaces," AC'i Trans. Graphics, vol. 18, no. 1, pp.
1-34, Jan. 1999.
[45] E. Levina and P. Bickel, "The earth mover's distance is the Mallows distance:
some insights from statistics," in Proc. Int'l Conf. Computer Vision, Vancover,
Canada, Jul. 2001, pp. 251-256.
[46] N. Vasconcelos, "On the complexity of probabilistic image retrieval.," in Proc.
Int'l Conf. Computer Vision, Vancover, Canada, Jul. 2001, pp. 400-407.
[47] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, "The princeton shape
benchmark," in S1p ,l Modeling International, Genova, Italy, Jun. 2004, pp.
167-178.
[48] M. Isard and A. Blake, "Condensation-conditional density propagation for
visual tracking," Int'l J. of Computer Vision, vol. 29, no. 1, pp. 5-28, Jan.
1998.
[49] M. Isard and A. Blake, "A mixed-state condensation tracker with automatic
model-switching," in Proc. Int'l Conf. Computer Vision, B-ombli-, India, Jan.
1998, pp. 107-112.
75
[50] N. Merhav and M. Feder, "Universal prediction," IEEE Trans. h ,f.,,
Tih., ,;< vol. 44, no. 6, pp. 2124-2147, Oct. 1998.
BIOGRAPHICAL SKETCH
A native Floridian, Eric Spellman grew up on Florida's Space Coast, gradu-
ating from Satellite High School in 1998. Thereafter he attended the University of
Florida, receiving his Bachelor of Science in mathematics in 2000, his Master of En-
gineering in computer information science and engineering in 2001, and, under the
supervision of Baba C. Vemuri, his Doctor of Philosophy in the same in 2005. After
graduating he will return to the Space Coast with his wife Kayla and daughter
Sophia to work for Harris Corporation.