Citation
Fusing Probability Distributions with Information Theoretic Centers and Its Application to Data Retrieval

Material Information

Title:
Fusing Probability Distributions with Information Theoretic Centers and Its Application to Data Retrieval
Copyright Date:
2008

Subjects

Subjects / Keywords:
Arithmetic mean ( jstor )
Databases ( jstor )
Distance functions ( jstor )
Entropy ( jstor )
Geometric mean ( jstor )
Information retrieval ( jstor )
Information search and retrieval ( jstor )
Minimax ( jstor )
Particle tracks ( jstor )
Probability distributions ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Embargo Date:
7/30/2007

Downloads

This item has the following downloads:


Full Text











FUSING PROBABILITY DISTRIBUTIONS WITH INFORMATION
THEORETIC CENTERS AND ITS APPLICATION TO DATA RETRIEVAL
















By

ERIC SPELLMAN


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2005

































Copyright 2005

by

Eric Spellman















I dedicate this work to my dearest Kayla with whom I have already learned

how to be a Doctor of Philosophy.















ACKNOWLEDGMENTS

For his supportive guidance during my graduate career, I thank Dr. Baba C.

Vemuri, my doctoral advisor. He taught me the field, offered me an interesting

problem to explore, pushed me to publish-in spite of my terminal procrastination

and tried his best to instill in me the intangible secrets to a productive academic

career.

Also the other members of my committee have helped me greatly in my career

at the University of Florida, and I thank them all: Dr. Brett Presnell delivered the

first lecture I attended as an undergraduate; and I am glad he could attend the last

lecture I gave as a doctoral candidate. I have also benefitted from and appreciated

Dr. Anand Rangarajan's lectures, professional advice, and philosophical discussions.

While I did not have the pleasure of attending Dr. Arunava Banerjee's or Dr. Jeff

Ho's classes, I have appreciated their insights and their examples as successful early

researchers. I would also like to thank Dr. Murali Rao for stimulating debates,

Dr. Sun-Ichi Amari for proposing the idea of the e-center and proofs of the related

theorems, and numerous .. i :vimous reviewers.

My professional debts extend beyond the faculty however to my fellow

comrades-in-research. With them, I have muttered all manner of things about

the aforementioned group in the surest confidence that my mutterings would not be

betll, i, 1 Dr. Jundong Liu, Dr. Zhizhou Wang, Tim McGraw, Fei W\,'v. Santosh

Kodipaka, Nick Lord, Bing Jian, Vinh Nghiem, and Evren Ozarslan all deserve

thanks.

Also deserving are the Department staff members whose hard work keeps

this place afloat and Ron Smith for designing a word-processing template without









which the process of writing a dissertation might itself require a Ph.D. For the

permission to reproduce copyrighted material within this dissertation, I thank the

IEEE (C'!i plter 3) and Springer-Verlag (C'! plter 4). For data I thank the people

behind the Yale Face Database images from which I used in Fig. 2-5 and Fig. 2-8

and Michael Black for tracking data. And for the financial support which made

this work possible, I acknowledge the University of Florida's Stephen C. O'Connell

Presidential Fellowship, NIH grant RO1 NS42075, and travel grants from the

Computer and Information Science and Engineering department, the Graduate

Student Council, and the IEEE.

And finally, most importantly, I thank my family. I thank my mother-in-law

Donna Lea for all of her help these past few weeks and Neil, Abra, and Peter for

letting us take her away for that time. I thank my mother and father for everything

and my brother, too. And I of course thank my dearest Kayla and my loudest,

most obstinate, and sweetest Sophia.















TABLE OF CONTENTS
page

ACKNOWLEDGMENTS ................... ...... iv

LIST OF TABLES ................... .......... viii

LIST OF FIGURES ..................... ......... ix

ABSTRACT ...................... ............. xi

CHAPTER

1 INTRODUCTION ........................... 1

1.1 Information Theory .......................... 2
1.2 Alternative Representatives ......... ........ ... 3
1.3 Minimax Approaches ......................... 4
1.4 Outline of Remainder ........... ............. 5

2 THEORETICAL FOUNDATION ........... ........... 7

2.1 Preliminary-Euclidean Information Center ............ 7
2.2 Mixture Information Center of Probability Distributions ..... 10
2.2.1 Global Optimality .................. .. 13
2.3 Exponential Center of Probability Distributions . ... 13
2.4 Illustrations and Intuition .................. ..... 16
2.4.1 Gaussians .................. ........ .. 19
2.4.2 Normalized Cr -v-1.-v -l Histograms . . ..... 19
2.4.3 The e-Center of Gaussians ................ 23
2.4.4 e-ITC of Histograms .................. ..... 24
2.5 Previous W ork .................. .......... .. 27
2.5.1 Jensen-Shannon Divergence ... . . 27
2.5.2 Mutual Information .................. .... 28
2.5.3 Ci .11,1!, I Capacity .................. .. 29
2.5.4 Boosting .. ... .. .. .. .. ... .. .. .. ... .. .. 30

3 RETRIEVAL WITH m-CENTER .................. ..... 31

3.1 Introduction .................. ........... .. 31
3.2 Experimental Design .................. ..... .. 34
3.2.1 Review of the Texture Retrieval System . .... 34
3.2.2 Improving Efficiency .................. ..... 37
3.3 Results and Discussion .................. ... .. 37









3.3.1 Comparison to Pre-existing Efficiency Scheme
3.4 Conclusion . . . . . .

4 RETRIEVAL WITH e-CENTER .. ............

4.1 Introduction . . . . . .
4.2 Lower Bound .. ..................
4.3 Shape Retrieval Experiment .. ...........
4.4 Retrieval with JS-divergence .. ...........
4.5 Comparing the m- and e-ITCs .. ...........

5 TRA CKING . .. .. ... .. .. .. .. ... .. ..


5.1 Introduction .. .......
5.2 Background-Particle Filters
5.3 Problem Statement .....
5.4 Motivation .. ........
5.4.1 Binary State Space
5.4.2 Self-information Loss
5.5 Experiment-One Tracker to
5.5.1 Preliminaries .....
5.5.2 m-ITC .. ......
5.6 Conclusion .. ........


Rule Them All?


6 CONCLUSION . . . . . . . .


Limitations .. ......
Summary .. .......
Future Work ........


REFERENCES .............

BIOGRAPHICAL SKETCH ......















LIST OF TABLES
Table page

5-1 Average time-till-failure of the tracker based on the arithmetic mean
(AM) and on the m-ITC (800 particles) ............... ..68

5-2 Average time-till-failure of the tracker based on the arithmetic mean
(AM) and on the m-ITC (400 particles) with second set of models .68















LIST OF FIGURES
Figure page

2-1 Center is denoted by o and supports are denoted by x. . . 9

2-2 The ensemble of 50 Gaussians with means evenly spaced on the in-
terval [-30,30] and a = 5 ............... ..... 16

2-3 The components from Fig. 2-2 scaled by their weights in the m-ITC. 17

2-4 The m-ITC (solid) and arithmetic mean (AM, dashed) of the ensem-
ble of Gaussians shown in Fig. 2-2 ................. 18

2-5 Seven faces of one person under different expressions with an eighth
face from someone else. Above each image is the weight pi which
the ITC assigns to the distribution arising from that image. . 20

2-6 The normalized grv i-kv1 histograms of the faces from Fig. 2-5. Above
each distribution is the KL-divergence from that distribution to
the m-ITC. Parentheses indicate that the value is equal to the KL-
radius of the set. Note that as predicted by theory, the the dis-
tributions which have maximum KL-divergence are the very ones
which received non-zero weights in the m-ITC . ...... 21

2-7 In the left column, we can see that the arithmetic mean (solid, lower
left) resembles the distribution arising from the first face more closely
than the m-ITC (solid, upper left) does. In the right column, we
see the opposite: The m-ITC (upper right) more closely resembles
the eighth distribution than does the arithmetic mean (lower right). 22

2-8 Eight images of faces which yield normalized gray level histograms.
We choose an extraordinary distribution for number eight to con-
trast how the representative captures variation within a class. The
number above each face weighs the corresponding distribution in
the e-ITC. ............... ............ .. 25

2-9 KL(C, Pi) for each distribution, for C equal to the e-ITC and geo-
metric mean, respectively. The horizontal bar represents the value
of the e-radius. ............... .......... .. 26

3-1 Using the triangle inequality to prune ............. .. 33

3-2 Examples images from the CUReT database . ...... 35









3-3 On average for probes from each texture class, the speed-up relative
to an exhaustive search achieved by the metric tree with the m-
ITC as the representative .................. .. 38

3-4 The excess comparisons performed by the arithmetic mean relative
to the m-ITC within each texture class as a proportion of the total
database. .................. ........... 39

3-5 The excess comparisons performed by the best medoid relative to the
m-ITC within each texture class as a proportion of the total database 40

4-1 Intuitive proof of the lower bound in equation 4.7 (see text). The KL-
divergence acts like squared Euclidean distance, and the Pythagorean
Theorem holds under special circumstances. Q is the query, P is a
distribution in the database, and C is the e-ITC of the set contain-
ing P. P* is the I-projection of Q onto the set containing P. On
the right, D(C| P) < Re, where Re is the e-radius, by the minimax
definition of C. .................. .. ...... 45

4-2 The speed-up factor versus an exhaustive search when using the e-
ITC as a function of each class in the shape database. ...... ..49

4-3 The relative percent of additional prunings which the e-ITC achieves
beyond the geometric center, again for each class number. . 50

4-4 Speedup factor for each class resulting from using e-ITC over an ex-
haustive search .................. ........ .. .. 52

4-5 Excess searches performed using geometric mean relative to e-ITC as
proportion of total database. .................. .... 53

4-6 Excess searches performed using e-ITC relative to m-ITC as propor-
tion of total database. .................. ..... 54

4-7 The ratio of the maximal X2 distance from each center to all of the
elements in a class .................. ........ .. 56

5-1 Frame from test sequence .................. ... .. 62

5-2 Average time till failure as a function of angle disparity between the
true motion and the tracker's motion model . . ..... 64

5-3 Average time till failure as a function of the weight on the correct
motion model; for 10 and 20 particles ................ .. 66















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Phil. .. 1hi

FUSING PROBABILITY DISTRIBUTIONS WITH INFORMATION
THEORETIC CENTERS AND ITS APPLICATION TO DATA RETRIEVAL

By

Eric Spellman

August 2005

C('! i: Baba C. Vemuri
Major Department: Computer and Information Science and Engineering

This work presents two representations for a collection of probability distribu-

tions or densities dubbed information theoretic centers (ITCs). Like the common

arithmetic mean, the first new center is a convex combination of its constituent

densities in the mixture family. Analogously, the second ITC is a weighted geomet-

ric mean of densities in the exponential family. In both cases, the weights in the

combinations vary as one changes the distributions. These centers minimize the

maximum Kullback-Leibler divergence from each distribution in their collections

to themselves and exhibit an equi-divergence property, lying equally far from most

elements of their collections. The properties of these centers have been established

in information theory through the study of channel-capacity and universal codes;

drawing on this rich theoretical basis, this work applies them to the problems of

indexing for content-based retrieval and to tracking.

Many existing techniques in image retrieval cast the problem in terms of

probability distributions. That is, these techniques represent each image in the

database-as well as incoming query images-as probability distributions, thus

reducing the retrieval problem to one of finding a nearest probability distribution









under some dissimilarity measure. This work presents an indexing scheme for such

techniques wherein an ITC stands in for a subset of distributions in the database.

If the search finds that a query lies sufficiently far from such an ITC, the search

can safely disregard (i.e., without fear of reducing accuracy) the associated subset

of probability distributions without further consideration, thus speeding search.

Often in tracking, one represents knowledge about the expected motion of

the object of interest by a probability distribution on its next position. This work

considers the case in which one must specify a tracker capable of tracking any

one of several objects, each with different probability distributions governing its

motion. Related is the case in which one object undergoes different ph i of

motion, each of which can be modeled independently (e.g., a car driving straight

vs. turning). In this case, an ITC can fuse these different distributions into one,

creating one motion model to handle any of the several objects.















CHAPTER 1
INTRODUCTION

Given a set of objects, one commonly wishes to represent the entire set with

one object. In this work, we address this concern for the case in which the objects

are probability distributions. Particularly, we present a novel representative

for a set of distributions whose behavior we can describe in the language of

information theory. For contrast, let us first consider the familiar arithmetic

mean: The arithmetic mean is a uniformly weighted convex combination of a set

of objects (e.g., distributions) which minimizes the sum of squared Euclidean

distances from objects in turn to itself. However, later sections contain examples

of applications in which such a representative is not ideal. As an alternative

we describe a representative which minimizes the maximal Kullback-Leibler

divergence from itself to the set of objects. The central idea of this work is that

such a minimax representation is better than more commonly used representatives

(e.g., the arithmetic mean) in some computer vision problems. In exploring this

thesis, we present the theoretical properties of this representative and results of

experiments using it.

We examine two applications in particular, comparing the minimax represen-

tation to the more common arithmetic mean (and other representatives). These

applications are indexing collections of images (or textures or shapes) and choosing

a motion prior under uncertainty for tracking. In the first of these applications,

we follow a promising avenue of work in using a probability distribution as the

signature of a given object to be indexed. Then using an established data struc-

ture, the representative can fuse several signatures into one, thus making searches

more efficient. In the tracking application, we consider the case in which one has









some uncertainty as to the motion model that governs the object to be tracked.

Specifically, the governing model will be drawn from a known family of models. We

s-l--,- -1 using a minimax representative to construct a single prior distribution to

describe the expected motion of an object in a way that fuses all of the models in

the family and yields one model which with best worst case performance.

1.1 Information Theory

As discussed in C'! lpter 2, the minimax representative can be characterized

in terms of the Kullback-Leibler divergence and the Jensen-Shannon divergence.

Hence, a brief review of these concepts is in order.

The Kullback-Leibler (KL) divergence [1] (also known as the relative entropy)

between two distributions p and q is defined as


KL(p,q) pilogp. (1.1)
i qi

It is convex in p, non-negative (though not necessarily finite), and equals zero if

and only if p = q. In information theory it has an interpretation in terms of the

length of encoded messages from a source which emits symbols according to a

probability distribution. While the familiar Shannon entropy gives a lower bound

on the average length per symbol a code can achieve, the KL-divergence between

p and q gives the penalty (in length per symbol) incurred by encoding a source

with distribution p under the assumption it really has distribution q; this penalty is

commonly called redundancy.

To illustrate this, consider the Morse code, designed to send messages in

English. The Morse code encodes the letter "E" with a single dot and the letter

"Q" with a sequence of four dots and dashes. Because "E" is used frequently in

English and "Q" seldom, this makes for efficient transmission. However if one

wanted to use the Morse code to send messages in Chinese pinyin, which might

use "Q" more frequently, he would find the code less efficient. If we assume









contrafactually that the Morse code is optimal for English, this difference in

efficiency is the redundancy.

Also 1p i ing a role in this work is the Jensen-Shannon (JS) divergence. It is

defined between two distributions p and q as


JS(p, q) = aKL(p, ap + (1- c)q) + (1- a)KL(q, ap + (1- a)q), (1.2)


where a E (0, 1) is a fixed parameter [2]; we will also consider its straightforward

generalization to n distributions.

1.2 Alternative Representatives

Also of interest is the body of work which computes averages of sets of objects

using non-Euclidean distances since the representative presented in this work

p1ii- a similar role. One example of this appears in computing averages on a

manifold of shapes [3, 4] by generalizing the minimization characterization of the

arithmetic mean away from the squared Euclidean distance and to the geodesic

distance. Linking manifolds on one hand and distributions on the other is the field

of information geometry [5]. Using notions from information geometry one can

find the mean on a manifold of parameterized distributions by using the geodesic

distances derived from the Fisher information metric.

In this work, the representative does not minimize a function of this geodesic

distance, but rather the :;:-ii KL-divergence. Furthermore, the representa-

tives here are restricted to simple manifolds of distributions-namely the family of

weighted arithmetic means (i.e., convex combinations) and normalized, weighted

geometric means (sometimes referred to as the ::" i. i i i family") of the con-

stituent distributions. These are simple yet interesting families of distributions and

can accommodate non-parametric representations. These two manifolds are dual in

the sense of information geometry [6], and so as one might expect, the represen-

tatives have a similar dual relationship. Pelletier forms a barrycenter based in the









KL-divergence on each of these manifolds [7]. That barrycenter, in the spirit of the

arithmetic mean and in contrast to the representative in this work, minimizes a

sum of KL-divergences. Another "mean-like" representative on the family of Gaus-

sian distributions seeks to minimize the sum of squared J-divergences (also known

as symmetrized KL) [8]. The key difference between most of these approaches and

this work is that this work seeks to present a representative which minimizes the

maximum of KL-divergences to the objects in a set.

1.3 Minimax Approaches

Ingrained in the computer vision culture is a heightened awareness of noise

in data. This is a defensive trait which has evolved over the millennia to allow

researchers to survive in a harsh environment full of imperfect sensors. With this

justified concern in mind, one naturally asks if minimizing a maximum distance

will be doomed by over-sensitivity to noise. As with many such concerns, this

depends on the application, and we argue that an approach which minimizes a

max-based function is well-suited to the applications in this work. But to show

that such a set of appropriate applications is non-empty, consider the successful

work of Huttenlocher et al. as a proof of concept [9]: They seek to minimize the

Hausdorff distance (H) between two point sets (A, B). As one can see, minimizing

the Hausdorff distance,


H(A, B) = max(h(A, B), h(B, A))

h(A,B) = maxmindist(a,b),
aEA bEB

minimizes a max-based function. Another example appears in the work of Ho et

al. for tracking. In Ho et al. [10] at each frame they find a linear subspace which

minimizes the maximal distance to a set of previous observations. Additionally,

Har'!,- and Schaffalitzky minimize a max-based cost function to do geometric

reconstruction [11]. Cognizant that the method is sensitive to outliers, they









recommend using it on data from which they have been removed. These examples

demonstrate that one cannot dismiss a priori a method as overly-sensitive to noise

just because it minimizes a max-based measure.

Closer in spirit to the approach in this work is work using the minimax en-

tropy principle. Zhu et al. [12] and Liu et al. [13] use this principle to learn models

of textures and faces. The first part of this principle (the well-known maximum

entropy or minimum prejudice principle) seeks to learn a distribution which, given

the constraint that some of its statistics fit sample statistics, maximizes entropy.

Csiszar has shown that this process is related to an entropy projection [14]. The

rationale for selecting a distribution with maximal entropy is that it is random

or simple except for the parts explained by the data. Coupled with this is the

minimum entropy principle which further seeks to impose fidelity on the model.

To satisfy this principle, Zhu et al. choose a set of statistics (constraints) to which

a maximal entropy model must adhere such that its entropy is minimized. By

minimizing the entropy, they show that he minimizes the KL-divergence between

the true distribution and the learned model. For a set S of constraints on statistics,

they summarize the approach as


S* = argmin max entr. ',u(p), (1.3)
S pERs

where Qs is the set of all probability distributions which satisfy the constraints

in S. Then with the optimal S*, one need only find the p E Qs. with maximal

entropy.

1.4 Outline of Remainder

In the next chapters, we rigorously define the minimax representatives and

present theoretical results on their properties. We also connect one of these to

its alternative and better-known identity in the information theory results for

channel capacity. Thereafter follow two chapters on using the representatives to









index databases for the sake of making retrieval more efficient. In ('!i ipter 3 we

present a texture retrieval experiment; in this experiment the accuracy of the

retrieval is determined by the Jensen-Shannon divergence, the square root of which

is a true metric. Later in C'! ipter 4 we present an experiment in shape retrieval

where the dissimilarity measure is the KL-divergence. We propose a search method

which happens to work in the particular case of this data set, but in general has

no guarantee that it will not degrade in accuracy. The second area in which we

demonstrate the utility of a minimax representative is tracking. In ('!i Ilter 5 we

present experiments in which the representative stands in for an unknown motion

model. Lastly we end with some concluding points and thoughts for future work.














CHAPTER 2
THEORETICAL FOUNDATION

In this chapter we define the minimax representatives and enumerate their

properties. First, we present a casual, motivating treatment in Euclidean space to

give intuition to the idea. Then come the section's central results-the ITC for the

mixture family and the ITC on the dual manifold, the exponential family; after

defining them, we present several illustrations to lend intuition to their behavior;

and then finally we show their well-established interpretation in terms of channel

capacity.

2.1 Preliminary-Euclidean Information Center

Let S = {f1, f,} be a set of n points in the Euclidean space Rf.

Throughout our development we will consider the maximum dispersion of the

members of a set S about a point f,


D (f, S) = max f f (2.1)

We will look for the point f, that minimizes D (f, S) and call it the center.

Definition 1 The center of S is 1/ 7,.,I by


fc(S)= arg min D(f, S). (2.2)
f
The following properties are easily proved.

Theorem 1 The center of S is unique and is given by a convex combination of the

elements of S.

fc Z= if, (2.3)
i= 1
where 0 < pi < 1 and :pi 1. We call a point f for which pi > 0 a support.









Theorem 2 Let f, be the center of S. Then, for F = {i,pi > 0}, the set of indices
called the supports, we have an equi-distance p"*''. I/;


If fe 2 2, ifi E F, (2.4)

fi f 2 < r2, otherwise, (2.5)

where r2 is the square of the radius of the sphere coinciding with the supports and
centered at the center.
Now that we have characterized the center first by its minimax property
(Definition 1) and then its equi-distance property (Theorem 2), we can give yet

another characterization, this one useful computationally. Define the simplex A of
probability distributions on m symbols,

A {P= (pi, -- ,P),0 < pi, pi 1 }.

We define a function of p on A by

DE (P, S) -i If 2 ill 2, (2.6)

and now use it to find the center.
Theorem 3 DE(p, S) is strictly concave in p, and the center of S is given by


fc = YPfyi, (2.7)

where

p = arg max DE(p, S) (2.8)
P
is the unique maximal point of DE(p, S).
An example of the center is given in Fig. 2-1. In general, the support set

consists of a relatively sparse subset of points. The points that are most :;i ,i ih-
ii ,y are given high weights.


















































Figure 2-1: Center is denoted by o and supports are denoted by x.









2.2 Mixture Information Center of Probability Distributions

We now move to the heart of the results. While the following derivations

use densities, they could just as easily work in the discrete case. Let f(x) be a

probability density. Given a set S of n linearly independent densities fl, i fn, we

consider the space M consisting of their mixtures,


M = pi f 0
The dimension m of M satisfies m < n 1.

M is a dually flat manifold equipped with a Riemannian metric, and a pair of

dual affine connections [5].

The KL divergence from fi(x) to f2(x) is defined by


KL(ff2) fX) log dx, (2.10)

which pl i, the role of the squared of the distance in Euclidean space. The KL-

divergence from S to f is defined by

KL(S, f)= max KL (f, f). (2.11)

Let us define the mixture information center (m-ITC) of S.

Definition 2 The m-center of S is /. Im by

fm(S)= arg min KL(S, f). (2.12)
f

In order to analyze properties of the m-center, we define the m-sphere of

radius r centered at f by the set


Sm(f,r) {f' KL(f',f)

(2.13)









In order to obtain an analytical solution of the m-center, we remark that the

negative entropy

p(f) j f(x) log f(x)dx (2.14)

is a strictly convex function of f e M. This is the dual potential function in M
whose second derivatives give the Fisher information matrix [5].
Given a point f in M,


f Z= i f, p p > 0, (2.15)

and using o above, we write the Jensen-Shannon (JS) divergence [2] of f with
respect to S as

Dm(f, S) -= -(f)+ Ypi (fi). (2.16)

When we regard Dm(f, S) as a function of p = (pi, ,,)T E A, we denote it by

Dm(p, S). It is easy to see

Dm(f, S) is strictly concave in p A (2.17)

Dm(f, S) is invariant under permutation of {f,}, (2.18)

Dm(f, S) = 0 if p is an extreme point of simplex A. (2.19)

Hence from equations 2.17 and 2.19, Dm(f, S) has a unique maximum in A. In
terms of the KL-divergence, we have

D,(p,S) = piKL (f, f). (2.20)

Theorem 4 The m-center of S is given by the unique maximizer of Dm(f, S), that
ts,


fm 5fi, (2.21)

p arg max Dm(p,S). (2.22)
P








Moreover, for supports for which pi > 0, KL (fi, fm) = r2, and for non-supports for
which pi = 0, KL (fi, fm) < r2 r2 = maxKL (fi, fm).
By using the Lagrange multiplier A for the constraint pi = 1, we calculate

S[D(p, S) A( p I)} -0. (2.23)

From the definition of Dm and because of

a()= a) f'l9g pkf kg

Sf, logf + 1, (2.24)

equation 2.23 is rewritten as

fl ogf -1+(f)- A=0 (2.25)

when pi > 0, and then as
KL (fi, f)= A+ 1. (2.26)

However, because of the constraints pj > 0, this is not necessarily satisfied for some
fj for which pj = 0. Hence, at the extreme point fm(S), we have

KL (fi, f) =r2 (2.27)

for supports fi (pi > 0), but for non-supports (pi = 0),

KL (f, f) < r2. (2.28)

This distinction occurs because p maximizes EpiKL (fi, f), and if we assume for a
non-support density fi that KL (fi, f) > r2, we arrive at a contradiction.
We remark that numerical optimization of D, can be accomplished efficiently

(e.g., with a Newton scheme [15]), because it is concave in p.









2.2.1 Global Optimality

Lastly, it should be noted that while it appears that the results so far -ii--:: -1

that the m-ITC is the minimax representative over merely the simplex of mix-

tures, a result by Merhav and Feder [16] (based on work by Csiszar [17]) shows

that the m-ITC is indeed the global minimax representative over all probability

distributions.

To argue so, consider a distribution g which a skeptic puts forward as the

minimizer of equation 2.12. Next, we consider the I-projection of g onto the set of

mixtures M defined as

SargminKL(f,g). (2.29)
fEM
And from the properties of I-projections [17], we know for all f E M that


KL(f, g) > KL(f, g) + KL(g, g) > KL(f, g) (2.30)

since the last term on the right-hand side is positive. This tells us that the mixture

g performs at least as well as g; so we may restrict our minimization to the set of

mixtures, knowing we will find the global minimizer.

2.3 Exponential Center of Probability Distributions

Given n densities fi(x), f,(x) > 0, we define the exponential family of densities

E,


E = f f(x) -exp{ pilog f(x) (p)}, 0 < p, pi 1- }, (2.31)

instead of the mixture family M. The dimension m of E satisfies m < n 1. E

is also a dually flat Riemannian manifold. E and M are "dual" [5], and we can

establish similar structures in E.

The potential function Q(p) is convex, and is given by


(p) log Jexp { pi log f(Wx)} dx. (2.32)









The potential b(p) is the cumulant generating function, and is connected with the

negative entropy p by the Legendre transformation. It is called the free energy in

statistical physics. Its second derivatives give the Fisher information matrix in this

coordinate system.

An e-sphere (exponential sphere) in E centered at f is the set of points


Se(f,r)= {f' KL(f', f) < r2, f E} (2.33)

where r is the e-radius.

Definition 3 The e-center of S is 1 f, .I by


fe(S)= arg min KL(f, S), (2.34)
f

where

KL(f, S) max KL (f, f,). (2.35)
i

We further define the JS-like divergence in E,


De (p, S)= { (p) + p' (f,). (2.36)

Because of

y (f) log exp (log f) dx 0, (2.37)

we have

D, (p, S) = -(p). (2.38)

The function De = is strictly concave, and has a unique maximum. It is an

interesting exercise to show for f = exp { pi log fi b(p)},


-i(p) = piKL (f, f) (2.39)


Analogous to the case of M, we can prove the following.








Theorem 5 The e-center of S is unique and given by the maximizer of De (p, {fi}),


exp { pilogf/i


b (p) },


p = arg max De (p, S) .


(2.40)

(2.41)


Moreover,


KL (fe, fi)


and


KL (f, fi)

r2, for supporting fi (pi / 0),



for non-supporting fi (pi C


where


r2 = maxKL (fe, fi).


We calculate the derivative of


'(p) A ( pi


1), and put


I ) 0.


f = exp i logfi(x)


Slog
apii


Sexp plog f (x) dx


Sf(x) log f,(x)dx.


(2.47)


Hence, (2.45) becomes


I f(x) log fa(x) dx


A = const


(2.48)


and hence


KL (f, fi) = const


fe(S)


(2.42)


(2.43)


(2.44)


we have


(2.45)


a (p),


(2.46)


(2.49)


- (p) A >pi










0.08


0.07


0.06


0.05


0.04


0.03


0.02


0.01


0
-50 -40 -30 -20 -10 0 10 20 30 40 50

Figure 2-2: The ensemble of 50 Gaussians with means evenly spaced on the inter-
val [-30,30] and a = 5


for fi with pi > 0. For pi = 0, KL (f, fi) is not larger than r2, because of (2.39).

2.4 Illustrations and Intuition

In this section we convey some intuition regarding the behavior of the ITCs.

Particularly, we show examples in which the ITCs choose a relatively sparse subset

of densities to be support densities (assigning zero as weights to the other densities

in the set); also we find that the ITC tends to assign disproportionately high

weights to members of the set that are most :;:i .,i hi, iry" (i.e., those that are

most distinct from the rest of the set).


























0.018


0.016


0.014


0.012


0.01


0.008


0.006


0.004


0.002


0
-50 -40 -30 -20 -10 0 10 20 30 40 50


Figure 2-3: The components from Fig. 2-2 scaled by their weights in the m-ITC.

























ITM
o-/ T AM
0.016 --/ ,


0.014


0.012-


0.01 -


0.008 -


0.006 -


0.004 -


0.002-/


0 -
-50 -40 -30 -20 -10 0 10 20 30 40 50

Figure 2-4: The m-ITC (solid) and arithmetic mean (AM, dashed) of the ensemble
of Gaussians shown in Fig. 2-2









2.4.1 Gaussians

First, we examine a synthetic example in which we analytically specify a set

of densities and then numerically compute the m-ITC of that set. We construct a

set of 50 one-dimensional Gaussian distributions with means evenly spaced on the

interval [-30, 30] and with a = 5. Fig. 2-2 shows all of the densities in this set.

When we compute the m-ITC, we see both of the properties mentioned above: Out

of the 50 densities specified, only eight become support densities (sparsity). And

additionally, the densities which receive the largest weight in the m-ITC are outer-

most densities with means at -30 and 30 (highlighting ::i i. ,iii! try" elements).

Fig. 2-3 shows the eight support densities scaled according to the weight which

the m-ITC assigns them. For the sake of comparison, Fig. 2-4 shows the m-ITC

compared with the arithmetic mean.

2.4.2 Normalized C-r liv- -v- Histograms

Next we consider an example with distributions arising from image data.

Fig. 2-5 shows eight images from the Yale Face Database. The first seven images

are of the same person under different facial expressions while the last (eighth)

image is of a different person. We consider each image's normalized histogram of

gray-level intensities as a distribution and show them in Fig. 2-6.

When we take the m-ITC of these distributions, we again notice the sparsity

and the favortism of the boundary elements. The numbers above each face in

Fig. 2-5 are the weights that the distribution arising from that face received in the

m-ITC. Again, we find that only three out of the eight distributions are support

distributions. In the previous example, we could concretely see how the m-ITC

favored densities on the "on the bouiwl iy of the set because the densities had a

clear geometric relationship among themselves with their means ai i i' on a line.

In this example the notion of a boundary is not quite so obvious; yet, we think that





















0.32519 0 0 0















0 0 0.2267 0.4481












Figure 2-5: Seven faces of one person under different expressions with an eighth
face from someone else. Above each image is the weight pi which the ITC assigns
to the distribution arising from that image.























0.1718


0.02


0.015


0.01


0.005


000 000 0
50 100150200250


0.1787


50 100150200250


D.015[


0.01


005


50.



50 100150200250


0.1668


50 100150200250


0.



50 100150200250


(0.1827)


50 100150200250


0.02


0.015


0.01


.005


50 100150200250


(0.1827)


50 100150200250


Figure 2-6: The normalized gray-level histograms of the faces from Fig. 2-5. Above
each distribution is the KL-divergence from that distribution to the m-ITC. Paren-
theses indicate that the value is equal to the KL-radius of the set. Note that as
predicted by theory, the the distributions which have maximum KL-divergence are
the very ones which received non-zero weights in the m-ITC.


0.02


0.015


0.01


0.005


0


0.02


0.015


0.01


0.005


0


(0.1827)


0.1602


0.1524





































50 100 150 200 250


50 100 150 200 250


0.02


0.015


0.01


0.005


0




0.02


0.015


0.01


0.005


0


50 100 150 200 250


0.02


0.015


0.01


0.005


0




0.02


0.015


0.01


0.005


0


Figure 2-7: In the left column, we can see that the arithmetic mean (solid, lower
left) resembles the distribution arising from the first face more closely than the
m-ITC (solid, upper left) does. In the right column, we see the opposite: The m-
ITC (upper right) more closely resembles the eighth distribution than does the
arithmetic mean (lower right).


50 100 150 200 250









if one examines the eighth image and the eighth distribution, one can qualitatively

agree that it is the most ::li i, inary."

Returning briefly to Fig. 2-6 we examine the KL-divergences between each

distribution and the m-ITC. We report these values above each distribution and

indicate with parentheses the maximal values. Since the three distributions with

maximal KL-divergence to the ITC are exactly the three support distributions, this

example unsurprisingly complies with Theorem 4.

Finally, we again compare the m-ITC and the arithmetic mean in Fig. 2-7,

but this time we overlay in turn the first and eighth distributions. Examining the

left column of the figure, we see that the arithmetic mean more closely resembles

the first distribution than does the m-ITC. The KL-divergences bear this out with

the KL-divergence from the first distribution to the arithmetic mean being 0.0461

which compares to 0.1827 to the m-ITC. Conversely, when we do a similar com-

parison in the right column of Fig. 2-7 for the eighth (extraordinary) distribution,

we find that the IT\i most resembles it. Again, the KL-divergences quantify this

observation: Whereas the KL-divergence from the eighth distribution to the m-ITC

is (again) 0.1827, we find the KL-divergence to the arithmetic mean to be 0.5524.

This result falls in line with what we would expect from Theorem 4 and -'i-.-- -1- a

more refined bit of intuition: The m-ITC better represents extraordinary distribu-

tions, but sometimes at the expense of the more common-looking distributions. Yet

overall, that trade-off yields a minimized maximum KL-divergence.

2.4.3 The e-Center of Gaussians

We consider a very simple example consisting of multivariate Gaussian

distributions with unit covariance matrix,


fi(x) = cexp x- pAi2 (2.50)









where x, ti E Rm. We have


SPi logfi (x p 2 2

ll + logc (2.51)



Hence, E too consists of Gaussian distributions.

This case is special because E is not only dually flat but also Euclidean where

the Fisher information matrix is the identity matrix. The KL-divergence is

1 2
KL [fy, fi] 1 2y -i (2.53)

given by a half of the squared Euclidean distance. Hence, the e-center in E of {fi}

is the same as the center of the points {pI} in the Euclidean space.

When m = 1, that is x is univariate, pi is a set of points on the real line.

When pi < P-2 < ... < I,,, it is easy to see that the center is is the average of the

two extremes, pi and p,. This is different from the m-center of {fi} which is not a
single Gaussian but rather a mixture.

2.4.4 e-ITC of Histograms

We return again to the histograms of Section 2.4.2 As for the m-ITC, this
example illustrates how the e-ITC gives up some representative power for the

elements with small variability for the sake of better representing the elements

with large variability. Fig. 2-8 is an extreme case of this, chosen to make this

effect starkly clear: Here again we have the same seven images of the same person

under slight variations along with an eighth image of a different person. After

representing each image by its global gr-iv-1.v.- histogram, we compute the

uniformly weighted, normalized geometric mean and the e-ITC.

It is also worth noting that the e-ITC only selects three support distributions

out of a possible eight, exemplifying the sparsity tendency mentioned in the


















0.21224










0


0










0


Figure 2-8: Eight images of faces which yield normalized gray level histograms. We
choose an extraordinary distribution for number eight to contrast how the repre-
sentative captures variation within a class. The number above each face weighs the
corresponding distribution in the e-ITC.


0










0.18607


0.60169


ad


IA~
























0.4


0.35


0.3


0.25


0.2


0.15


0.1 -


0.05-


0I
1



Figure 2-9: KL(C, P,)
ric mean, respectively.


2 3 4 5 6 7 8
distribution number


for each distribution, for C equal to the e-ITC and geomet-
The horizontal bar represents the value of the e-radius.









previous section. By now examining Fig. 2-9, we can see that KL(C, Pi) is equal

to the e-radius (indicated by the horizontal bar) for the three support distributions

(i = 1, 7, 8) and is less for the others. This illustrates the equi-divergence property

stated previously.

In Figs. 2-8 and 2-9, the worst-case KL-divergence from the geometric mean

is 2.5 times larger than the worst-case from the e-ITC. Of course, this better worst-

case performance comes at the price of the e-ITC's larger distance to the other

seven distributions; but it is our thesis that in some applications we are eager to

make this trade.

2.5 Previous Work

2.5.1 Jensen-Shannon Divergence

First introduced by Lin [2], the Jensen-Shannon divergence has appeared in

several contexts. Sibson referred to it as the information radius [18]. While his

moniker is tantalizingly similar to the previously mentioned notions of m-/e-radii,

he strictly uses it to refer to the divergence, not the optimal value. Jardin and he

later used it as a discriminatory measure for classification [19].

Others have also used the JS-divergence and its variations as a dissimilarity

measure-for image registration and retrieval applications [20, 21], and in the

retrieval experiment of C'! lpter 3, this work will follow suit. That experiment also

makes use of the important fact that the square root of the JS-divergence (in the

case when its parameter is fixed to 1) is a metric [22].

Topsoe provides another take on the JS-divergence, calling a scaled, special

case of it the capac.:/..,';, discrimination [23]. This name hints at the next, perhaps

most important interpretation of the JS-divergence, namely that as a measurement

of mutual information. This alternative identity is widely known. (Cf. Burbea and

Rao [24] as just one example.) And this understanding can help illuminate the

context of the m-center within information theory.









2.5.2 Mutual Information

But the obvious question arises-the mutual information between what? First,

the mutual information between two random variables X and Y with distributions

Px and Py is defined as

MI(X; Y) = H(Y) H(Y|X), (2.54)


where H(Y) = Py log Py is Shannon's entropy and the conditional entropy is


H(YIX) = Pxy(x, y) log Pyx(iu ). (2.55)
x y

To see the connection to the m-center and JS-divergence, Let us first identify

the random variable X above as a random index into a set of random variables

{Y1, ,Y,} with probability distributions {P1, P,}. Note that X merely

takes on a number from one to n. If we now consider Y as a random variable

whose value y results from first sampling X = i and then sampling = y, we

can certainly begin to appreciate that these concocted random variables X and Y

are dependent. That is, learning the value that X takes will let you guess more

accurately what value Y will take. Conversely, learning the value of Y hints at

which distribution Y that value was drawn from and hence the value that X took.

Returning to equation 2.54 and p ,iing particular attention to the definition

of the conditional entropy term, we can use our definitions of X and Y to get an

expression for their joint distribution,


Px(i,y)= Px(i)P(ii,.). (2.56)

Then by observing that P(',|.i) = Pi(y), piu.-inii; this into equation 2.54, and

pulling Px(i) outside of the summations with respect to y, we have


MI(X; Y) H(Y) Px(i)H(P,). (2.57)
i









And this is precisely the Jensen-Shannon divergence JS(PI, .- P ) with co-

efficients (Px(1), Px(n)). So when we evaluate the JS-divergence with a

particular choice of those coefficients, we directly specify a distribution for both

the random variable X and indirectly specify a distribution (a mixture) for the

random variable Y and evaluate the mutual information between them. And when

we maximize the JS-divergence with respect to those same coefficients, we specify

the distributions X and Y which have maximum mutual information.

There are several equivalent definitions for mutual information, and by

considering a different one we can gain insight into the selection of support

distributions in the m-center. The mirror image of equation 2.54 gives us


MI(X; Y)= H(X) H(XIY). (2.58)


Starting from here and for convenience letting f(i) = P(i y) we can derive that


MI(X; Y) = H(X) EyH(fy(i)). (2.59)


fy(i) describes the contributions at a value y of each of the distributions which

make up the mixture. By maximizing MI(X; Y) we minimize the expected value

(over y) of the entropy of this distribution. This means that on the average

we encourage as few as possible contributors of probability to a location; this

is particularly the case at locations y with high marginal probability. Acting

as a regularization term of sorts, to prevent the process from assigning all the

probability to one component (thus driving the right-hand term to zero), is the first

term which encourages uniform weights. Maximizing the whole expression means

that we balance the impulses for few contributors and for uniform weights.

2.5.3 ('! im n. I Capacity

Of interest to information theory from its inception has been the question of

how much information can one reliably transmit over a channel.









In this case, one interprets the ensemble of distributions {P1, P,} making

up the conditional probability P(yli) Pi(y) as the channel. That is, given an in-

put symbol i, the channel will transmit an output symbol y with probability Pi(y).

Given such a channel (discrete and memoryless since each symbol is independent

of the symbols before and after), one will achieve different rates of transmission de-

pending on the distribution over the source symbols Px. These rates are precisely

the mutual information between X and Y, and to find the maximum capacity of

the channel, one picks the Px yielding a maximum value. In this context, many

of the results from this section have been developed and presented quite clearly in

texts by Gallager [25, Section 4.2] and Csiszar [26, Section 2.3].

Related also is the field of universal coding. As reviewed earlier, choosing

an appropriate code depends upon the characteristics of the source which will be

encoded (e.g., Morse code for English). Universal coding concerns itself with the

problem of selecting a code suitable for any source out of a family-i.e., a code

which will have certain universal performance characteristics across the family.

Although this work initially developed in ignorance of that field, the key result in

this field which this work touches upon is the Redundancy-Capacity Theorem. This

theorem-independently proven in several places [27, 28] and later strengthened

[16]-states that the best code for which one can hope will have redundancy

equal to the capacity (or maximum transmission rate [23]) of a certain channel

which takes input from the parameters of the family and output as symbols to be

encoded.

2.5.4 Boosting

Surprisingly, we also find a connection in the online learning literature wherein

the AdaBoost [29] learning algorithm is recast as entropy projection [30]. This

angle, along with its extentions to general Bregman divergences is an interesting

avenue for futue work.















CHAPTER 3
RETRIEVAL WITH m-CENTER

3.1 Introduction

There are two key components in most retrieval systems-the signature

stored for an object and the (dis)similarity measure used to find a closest match.

These two choices completely determine accuracy in a nearest neighbor-type system

if one is willing to endure a potentially exhaustive search of the database. However,

because databases can be quite large, comparing each query image against every

element in the database is obviously undesirable.

In this chapter and the next we focus on speeding up retrieval without com-

promising accuracy in the case in which the object and query signatures are

probability distributions. In a variety of domains, researchers have achieved impres-

sive results utilizing such signatures in conjunction with a v ,i i I i of appropriate

dissimilarity measures. This includes work in shape-based retrieval [31], image

retrieval [32, 33], and texture retrieval [34, 35, 36, 37]. (The last of which we review

in great detail in Section 3.2.) For any such retrieval system which uses a distri-

bution and a metric, we present a way to speed up queries while guaranteeing no

drop in accuracy by representing a cluster of distributions with an optimally close

representative which we call the m-ITC.

Tomes of research have concentrated on speeding up nearest neighbor searches

in non-Euclidean metric spaces. We build on this work by refining it to better suit

the case in which the metric is on probability distributions. In low-dimensional

Euclidean spaces, the familiar k-d tree and R*-tree can index point sets handily.

But in spaces with a non-Euclidean metric, one must resort to other techniques.

These include ball trees [38], vantage point trees [39], and metric trees [40, 41].









Our system utilizes a metric tree, but our main contribution is picking a single

object to represent a set. Picking a single object to describe a set of objects is one

of the most common v-,v to condense a large amount of data. The most obvious

way to accomplish this (when possible) is to compute the arithmetic mean or the

centroid. To contrast with the properties of our choice (the m-ITC) we point out

that the arithmetic mean minimizes the sum of squared distances from it to the

elements of its set. In the case in which the data are known (or forced) to lie on a

manifold, it is useful to pick an intrinsic mean which has the mean's minimization

property but which also lies on a manifold containing the data. This has been

explored for manifolds of shape [4, 3] and of parameterized probability densities [5].

Metric trees. To see how our representative fits into the metric tree framework,

a brief review of metric trees helps. Given a set of points and a metric (which by

definition satisfies positive-definiteness, symmetry, and the triangle inequality), a

leaf of a metric tree indexes some subset of points {pi} and contain two fields-a

center c and a radius r-which satisfy d(c, p) < r, for all pi where d is the metric.

Proceeding hierarchically, an interior node indexes all of the points indexed by its

children and also ensures that its fields satisfy the constraint above. Hence using

the triangle inequality, for any subtree, one can find a lower bound on the distance

from a query point q to the entire set of points contained in that subtree


d(q, pi) > d(q, c) r, (3.1)

as illustrated in Fig. 3-1. And during a nearest neighbor search, one can recursively

use this lower bound (costing only one comparison) to prune out subtrees which

contain points too distant from the query.

Importance of picking a center. Returning to the choice of the center, we see

that if the radius r in equation 3.1 is large, then the lower bound is not very tight

and subsequently, pruning will not be very efficient. On the other hand, a center










P.,


d(q,p

p1

c p2
q d(q,c)
P, P3
P4







Figure 3-1: Using the triangle inequality to prune

which yields a small radius will likely lead to more pruning and efficient searches. If

we reexamine Fig. 3-1, we see that we can decrease r by moving the center toward

the point pi in the figure. This is precisely how the m-ITC behaves. We claim that

the m-ITC yields tighter clusters than the commonly used centroid and the best

medoid, allowing more pruning and better efficiency because it respects the natural
metrics used for probability distributions. We show theoretically that under the

KL-divergence, the m-ITC uniquely yields the smallest radius for a set of points.

And we demonstrate that it also performs well under other metrics.

In the following sections we first define our m-ITC and enumerate its prop-

erties. Then in Section 3.2 we review a pre-existing texture classification system

[37] which utilizes probability distributions as signatures and discuss how we build
our efficiency experiment atop this system. Then in Section 3.3 we present results

showing that the m-ITC, when incorporated in a metric tree, can improve query
efficiency; also we compare how much improvement the m-ITC achieves relative to









other representatives when each is placed in a metric tree. Lastly, we summarize

our contributions.

3.2 Experimental Design

The central claim of this chapter is that the m-ITC tightly represents a set of

distributions under typical metrics; and thus, that this better representation allows

for more efficient retrieval. With that goal in mind, we compare the m-ITC against

the centroid and the best medoid-two commonly used representatives-in order to

find which scheme retrieves the nearest neighbor with the fewest comparisons. The

centroid is the common arithmetic mean, and the best medoid of a set {pi} in this

case is the the distribution p satisfying


p arg min maxKL(pj,p').
p'E{pi} 3

Notice that we restrict ourselves to the pre-existing set of distributions instead of

finding the best convex combination which is the m-ITC.

In the experiment, each representative will serve as the center of the nodes

of a metric tree on the database. We control for the topology of the metric tree,

keeping it unchanged for each representative. Since the experiment will examine

efficiency of retrieval, not accuracy, we will choose the same signature as Varma

and Zisserman [37], and this along with the metric will completely determine

accuracy, allowing us to exclusively examine efficiency.

3.2.1 Review of the Texture Retrieval System

The texture retrieval system of Varma and Zisserman [37] [42] builds on an

earlier system [43]. It uses a probability distribution as a signature to represent

each image. A query image is matched to a database image by nearest neighbor

search based on the X2 dissimilarity between distributions. Although the system

contains a method for increasing efficiency, we do not implement it and d 1 i

discussion of it to the next section.













^^^(^~~~ ** *'.'
;E
*^, ,~-'m ..
P (; .m *.m.' .? J : I


Figure Examples images from the CAiURT database
Figure 3-2: Examples images from the CUReT database


Texton dictionary. Before computing any image's signature, this system

requires that we first construct a texton ,I.: / :..,, i, ,;/ To construct this dictionary, we

first extract from each pixel in each training image a feature describing the texture

in its neighborhood. This feature can be a vector of filter responses at that pixel

[42] or simply a vector of intensities in the neighborhood about that pixel [37]. (We

choose the later approach.) After clustering this ensemble of features, we take a

small set of cluster centers for the dictionary, 610 centers in total.

Computing a signature. To compute the signature of an image, one first finds

for each pixel the closest texton in the dictionary and then retains the label of

that texton. At this stage, one can imagine an image transformed from having

intensities at each pixel to having indices into the texton dictionary at each pixel,

as would result from vector quantization. In the next step one simply histograms

these labels in the same way that one would form a global gray-level histogram.

This normalized histogram on texton labels is the signature of an image.

Data. Again as in [37], we use the Columbia-Utrecht Reflectance and Texture

(CUReT) Database [44] for texture data. This database contains 61 varieties of









texture with each texture imaged under various (sometimes extreme) illumination

and viewing angles. Fig. 3-2 illustrates the extreme intra-class variability and the

inter-class similarity which make this database challenging for texture retrieval

systems. In this figure each row contains realizations from the same texture class

with each column corresponding to different viewing and illumination angles.

Selecting 92 images from each of the 61 texture classes, we randomly partition each

of the classes into 46 training images, which make up the database, and 46 test

images, which make up the queries. Preprocessing consists of conversion to gray

scale and mean and variance normalization.

Dissimilarity measures. Varma and Zisserman [37] measured dissimilarity

between a query q and a database element p (both distributions) using the X2

signfigance test,

X2(p,q) ( q) (3.2)
pi + qi
and returned the nearest neighbor under this dissimilarity.

In our work, we require a metric, so we take the square root of equation 3.2

[23]. Note that since the square root is a monotonic function, this does not alter

the choice of nearest neighbor and therefore maintains the accuracy of the system.

Additionally we use another metric [22], the square root of the Jensen-Shannon

divergence between two distributions:

p+q 1 1
JS1 (p, q) = H( ) H(p) H(q)
2 2 2 2
1 p+ q 1 p +q
-KL(p, P -q) + KL(q, + q.
2 2 2 2

Note that in contrast to equation 2.16, we now take only two distributions and fix

the mixture parameter as Unlike the first metric, this change effects accuracy,

but in our experiments, the change in retrieval accuracy from the original system

did not exceed one percentage point. This does not surprise since the JS has served

well in registration and retrieval applications in the past [20].









3.2.2 Improving Efficiency

After selecting the signature and dissimilarity measure (in our a case metric),

we have fixed the accuracy of the nearest neighbor search; so we can now turn our

attention to improving the efficiency of the search.

We construct a metric tree on the elements of the database to improve

efficiency. And in our experiment we will vary the representative used in the nodes

of the metric tree, finding which representative prunes most. We hold constant the

topology of the metric tree, determining each element's membership in a tree node

based on the texture variety from which it came. Specifically, each of the 61 texture

varieties has a corresponding node and each of these nodes contain the 46 elements

arising from the realizations of that texture. Given this fixed node membership, we

can construct the appropriate representative for each node.

Then, given each representative's version of the metric tree, we perform

searches and count the number of comparisons required to find the nearest neigh-

bor.

3.3 Results and Discussion

The results with each dissimilarity measure were very similar, and theoretical

results [22] bear this observed similarity out by showing that the JS-divergence and

X2 are .,- mptotically related. Below we report the results for the JS-divergence.

Fig. 3-3 shows the speed-ups which the metric tree with m-ITC achieve for

each texture class. These data relate the total number of comparisons required

for an average query using the metric tree to the number of comparisons in an

exhaustive search. On average, the m-ITC discards 68.9'. of the database, yielding

a factor of 3.2 improvement in efficiency.

It should not surprise that the indexing out-performs an exhaustive search,

so we now consider what happens when we vary the representative in the metric

tree. On average, the arithmetic mean discards 47.1' of the database, resulting






















I I I I I I I I I I I I


121-


I


I .


5 10 15 20 25 30 35 40
Texture class


45 50 55 60


Figure 3-3: On average for probes from each texture class, the speed-up relative to
an exhaustive search achieved by the metric tree with the m-ITC as the representa-
tive

























a-D I
0.35
I I IT I I T T


0.3 l I I I I
iT8I iT
iF' '
0 r I TT^ ^ n n 1 i i i iT


o 0.25 i 1 .
Z I 1 1 1 T I III TI

,, I 1 1 ii i
0.2 i
aU)I I- II

i I
SII I
0.15 I I I I I I
0.1 I II


z1 1 II II 1 I I II

0 I I 5

5 10 15 20 25 30 35 40 45 50 55 60
Texture class number

Figure 3-4: The excess comparisons performed by the arithmetic mean relative to
the m-ITC within each texture class as a proportion of the total database















_ 0.35 T
TT T T

o 0.3 I T T I I T T T I I
S I I I I I I I

l1 11 | 1

S0.25 I I II I I I I I I

S1 1 -
a 0. 5 IIII I I


E II
0. 1 i
I0.15 I I I I I





0.05 I I 1
S 0.11 1 1 1 1 11


0.05 -
0-

5 10 15 20 25 30 35 40 45 50 55 60
Texture class number

Figure 3-5: The excess comparisons performed by the best medoid relative to the
m-ITC within each texture class as a proportion of the total database


in a speed-up factor of 1.9. Fig. 3-4 plots the excess proportion of the database

which the arithmetic mean searches relative to the m-ITC for each probe. The

box and whiskers respectively plot the median, quartiles, and range of the data

for all the probes within a class. On average, the metric tree with the arithmetic

mean explores an additional 21.;:'. of the total database relative to the metric tree

with the m-ITC. In only 2.0'. of the queries did the m-ITC fail to out-perform the

arithmetic mean. Since the proportion of excess comparisons is a positive value for

98.0'. of the probes, we can conclude that the m-ITC almost ahlv-ix offers some

improvement and occasionally avoids searching more than a third of the database.









Fig. 3-5 shows a similar plot for the best medoid. Again, the box and whiskers

respectively plot the median, quartiles, and range of the data for all the probes

within a class. On average, the metric tree with the best medoid explores an

additional 22.1 of the total database relative to the metric tree with the r-ITC.

Here the m-ITC improves even more than it did over the arithmetic mean; in only

0.1 of the queries did the m-ITC fail to out-perform the best medoid-never once

doing more poorly.

3.3.1 Comparison to Pre-existing Efficiency Scheme

Varma and Zisserman propose a different approach to increase the efficiency of

queries [37]. They decrease the size of the database in a greedy fashion: Initially,

each texture class in the database has 46 models (one arising from each training

image). Then for each class, they discard the model whose absence impairs retrieval

performance the least on a subset of the training images. And this is repeated till

the number of models is suitably small and the estimated accuracy is suitably high.

While this method performed well in practice-achieving comparable accuracy

to an exhaustive search and reducing the average number of models per class from

46 to eight or nine, it has several potential shortcomings. The first is this method's

computational expense: Although the model selection process occurs off-line and

time is not critical, the number of times one must validate models against the

entire training subset scales quadratically in the number of models. Additionally

and more importantly, the model selection procedure depends upon the subset of

the training data used to validate it. It offers no l;,.ii,,l" of its accur'. ;, relative

to an exhaustive search which utilizes all the known data.

In contrast, our method can compute the m-ITC efficiently, and more impor-

tantly we guarantee that the accuracy of the more efficient search is identical to the

accuracy that an exhaustive search would achieve. Although it must be noted that









the method of Varma and Zisserman performed fewer comparisons, we believe that

building a multi-i i t -, metric tree will bridge this gap.

Lastly, the two methods can co-exist simultaneously: Since the pre-existing

approach focuses on reducing the size of the database while ours indexes the

database (solving an orthogonal problem), nothing stops us from taking the smaller

database resulting from their method and performing our indexing atop it for

further improvement.

3.4 Conclusion

Our goal was to select the best single representative for a class of probability

distributions. We chose the m-ITC which minimizes the maximim KL-divergence

from each distribution in the class to it; and when we placed it in the nodes

of a metric tree, it allowed us to prune more efficiently. Experimentally, we

demonstrated significant speed-ups over exhaustive search in a state-of-the-art

texture retrieval system on the CUReT database. The metric tree approach to

nearest neighbor searches guarantees accuracy identical to an exhaustive search of

the database. Additionally, we showed that the m-ITC outperforms the arithmetic

mean and the best medoid when these other representatives are used analogously.

Probability distributions are a popular choice for retrieval in many domains,

and as the retrieval databases grow large, there will be a need to condence many

distributions into one representative. We have shown that the m-ITC is a useful

choice for such a representative with well-behaved theoretical properties and

empirically superior results.














CHAPTER 4
RETRIEVAL WITH e-CENTER

4.1 Introduction

In the course of designing a retrieval system, one must usually consider at least

three broad elements

1. a signature that will represent each element, allowing for compact storage and

fast comparisons,

2. a (dis)similarity measure that will discriminate between a pair of signatures

that are close and a pair that are far from each other, and

3. an indexing structure or search strategy that will allow for efficient, non-

exhaustive queries.

The first of these two elements mostly determine the accuracy of a system's

retrieval results. The focus of this chapter, like the last, is on the third point.

A great deal of work has been done on retrieval systems that utilize a prob-

ability distribution as a signature. This work has covered a variety of domains

including shape [31], texture [34], [45], [35], [36], [37], and general images [32], [33].

Of these, some have used the Kullback-Leibler (KL) divergence [1] as a dissimilarity

measure [35], [36], [32].

The KL-divergence has many nice theoretical properties. Particularly, its

relationship to maximum-likelihood estimation [46]. However, in spite of this,

it is not a metric. This makes it challenging to construct an indexing structure

which respects the divergence. it ,i, basic methods exist to speed up search

in Euclidean space including k-d trees and R*-trees. And there are even some

methods for general metric spaces such as ball trees [38], vantage point trees

[39], and metric trees [40]. Yet little work has been done on efficiently finding









exact nearest neighbors under KL-divergence. In this chapter, we present a novel

means of speeding nearest neighbor search (and hence retrieval) in a database of

probability distributions when the nearest neighbor is defined as the element that

minimizes the KL-divergence to the query. This approach does have a significant

drawback which does not impair it on this particular dataset, but in general it

cannot guarantee retrieval accuracy equal to that of an exhaustive search.

The basic idea is a common one in computer science and reminiscent of the

last chapter: We represent a set of elements by one representative. During a search,

we compare the query object against the representative, and if the representative

is sufficiently far from the query, we discard the entire set that corresponds to it

without further comparisons. Our contribution lies in selecting this representative

in an optimal fashion; ideally we would like to determine the circumstances under

which we may discard the set without fear of accidentally discarding the nearest

neighbor, but this cannot alv--iv be guaranteed For this application, we will utilize

the the exponential information theoretic center (e-ITC).

In the remaining sections we first derive the expression upon which the system

makes pruning decisions. Thereafter we present the experiment showing increased

efficiency in retrieval over an exhaustive search and over the uniformly weighted

geometric mean-a reasonable alternate representative. Finally, we return to

the texture retrieval example of the previous chapter to compare the m- and e-

centers, evaluating which forms the tightest clusters as evaluated by the X2 metric

(equation 3.2).

4.2 Lower Bound

In this section we attempt to derive a lower bound on the KL-divergence from

a database element to a query which only depends upon the the element through

its e-ITC. This lower bound guides the pruning and results in the subsequent

increased search efficiency which we describe in Section 4.3.









Q Q
D(CIIQ) Re



D(PIIQ) D(P* IQ) D(CIIQ) D(CIIQ)

/DPIIQ)
/ \ /DQ)



P P* C P P* C
D(PIIP*) D(CIIP*)

Figure 4-1: Intuitive proof of the lower bound in equation 4.7 (see text). The
KL-divergence acts like squared Euclidean distance, and the Pythagorean Theo-
rem holds under special circumstances. Q is the query, P is a distribution in the
database, and C is the e-ITC of the set containing P. P* is the I-projection of Q
onto the set containing P. On the right, D(C IP) < Re, where Re is the e-radius,
by the minimax definition of C.

In order to search for the nearest element to a query efficiently, we need to

bound the KL-divergence to a set of elements from beneath by a quantity which

only depends upon the e-ITC of that set. That way, we can use the knowledge

gleaned from a single comparison to avoid individual comparisons to each member

of the set.

We approach such a lower bound by examining the left side of Fig. 4-1. Here

we consider a query distribution Q and an arbitrary distribution P in a set which

has C as its e-ITC. As a stepping stone to the lower bound, we briefly define the
I-projection of a distribution Q onto a space E as

P* argmin D(PIIQ). (4.1)
PEE

It is well known that one can use intuition about the squared Euclidean distance

to appreciate the properties of the KL-divergence; and in fact, in the case of the









I-projection P* of Q onto E, we have in some cases a version of the familiar

Pythagorean Theorem [26]. In this case we have for all P E E that


D(PI Q) > D(P* Q) + D(P IP*). (4.2)

And in the case that

P* cP1 + (1 a)P2 (4.3)

for distributions P1, P2 c E with 0 < a < 1, then we call P* an ,i /,iraic inner

point and we have equality in equation 4.2

Unfortunately it is this condition which we cannot verify easily as it depends

upon both the query (which we will label Q) and the structure of E which is

determined by the database. Interestingly we do get the equality version when

we take E as a linear family, like the families with which the m-ITC is concerned.

Regardless, we will continue with the derivation and demonstrate its use in this

application.

Assuming equality in equation 4.2 and applying it twice yields,


D(PIIQ) = D(P*IIQ) + D(PIIP*) (4.4)

D(CIIQ) = D(P*IIQ) + D(C IP*), (4.5)

where we are free to select P E E as an arbitrary database element and C as the

e-ITC. Equation 4.4 corresponds to AQPP* while equation 4.5 corresponds to

AQCP* in Fig. 4-1.

If we subtract the two equations above and re-arrange, we find,


D(PIIQ) = D(CIIQ) + D(PIP*) D(CIIP*).


(4.6)









But since the KL-divergence is non-negative, and since the e-radius Re is a uniform

upper bound on the KL-divergence from the e-ITC to any P E E, we have


D(PIIQ) > D(CIQ) R,. (4.7)

Here we see that the m-ITC would do little better in guaranteeing this particular

bound. While it would insure that we had equality in equation 4.2, it could not

bound the last term in equation 4.6 because the order of the arguments is reversed.

We can get an intuitive, nonrigorous view of the same lower bound by again

borrowing notions from squared Euclidean distance. This pictoral reprise of

equation 4.7 can lend valuable insight to the tightness of the bound and its

dependence on each of the two terms. For this discussion we refer to the right side

of Fig. 4-1.

The minimax definition tells us that D(CI P) < R,. We consider the case in

which this is equality and sweep out an arc centered at C with radius Re from the

base of the triangle counter-clockwise. We take the point where a line segment from

Q is tangent to this arc as a vertex of a right triangle with hypotenuse of length

D(CIIQ). The leg which is normal to the arc has length Re by construction, and by

the Pythagorean Theorem the other leg of this triangle, which originates from Q,

has length D(C IQ) Re. We can use the length of this leg to visualize the lower

bound, and by inspection we see that it will alh--i be exceeded by the length of

the line segment originating from Q and terminating further along the arc at P.

This segment has length D(P| Q) and is indeed the quantity we seek to bound

from below.

4.3 Shape Retrieval Experiment

In this section we apply the e-ITC and the lower bound in equation 4.7 to

represent distributions arising from shapes. Since the lower bound guarantees that









we only discard elements that cannot be nearest neighbors, the accuracy of retrieval

is as good as an exhaustive search.

While we know from the theory that the e-ITC yields a smaller worst-case KL-

divergence, we now present an experiment to test if this translates into a tighter

bound and more efficient queries. We tackle a shape retrieval problem, using shape

distributions [31] as our signature. To form a shape distribution from a 3D shape,

we uniformly sample pairs of points from the surface of the shape and compute

the distance between these random points, building a histogram of these random

distances. To account for changes in scale, we independently scale each histogram

so that the maximum distance is ahv--, the same. For our dissimilarity measure,

we use KL-divergence, so the nearest neighbor P to a query distribution Q is


P argmin D(P'I Q). (4.8)


For data, we use the Princeton Shape Database [47] which consists of over 1800

triangulated 3D models from over 160 classes including people, animals, buildings,

and vehicles.

To test the efficiency, we again compare the e-ITC to the uniformly weighted,

normalized geometric mean. Using the convexity of E we can generalize the lower

bound in equation 4.7 to work for the geometric mean by replacing the e-radius

with maxim D(CIIPi) for our different C.

We take the base classification a< "n-1i' ,r_:ing the database to define our

clusters, and then compute the e-ITC and geometric means of each cluster.

When we consider a novel query model (on a leave-one-out basis), we search

for the nearest neighbor utilizing the lower bound and disregarding unnecessary

comparisons. For each query, we measure the number of comparisons required to

find the nearest neighbor.


























14


12



10


8



6


4



2


0
0 20 40 60 80 100
class number


120 140 160 180


Figure 4-2: The speed-up factor versus an exhaustive search when using the e-ITC
as a function of each class in the shape database.


















































0 20 40 60 80 100
class number


120 140 160 180


Figure 4-3: The relative percent of additional prunings which the e-ITC achieves
beyond the geometric center, again for each class number.









Fig. 4-2 and Fig. 4-3 show the results of our experiment. In Fig. 4-2, we

see the speed-up factor that the e-ITC achieves over an exhaustive search. Aver-

aged over all probes in all classes, this speed-up factor is approximately 2.6; the

geometric mean achieved an average speed-up of about 1.9.

And in Fig. 4-3, we compare the e-ITC to the geometric mean and see that

for some classes, the e-ITC allows us to discard nearly twice as many unworthy

candidates as the geometric mean. For no class of probes did the geometric mean

prune more than the e-ITC, and when averaged over all probes in all classes, the

e-ITC discarded over 3i '. more elements than did the geometric mean.

4.4 Retrieval with JS-divergence

In this section we mimick the experiments of the previous chapter. Instead of

using the KL-divergence to determine nearest neighbors, we return to the square-

root of the JS-divergence, and we again use the triangle inequality to guarantee no

decrease in accuracy.

Using metric trees with the e-center, geometric mean, and m-center, we can

compare the efficiencies of each representative and the overall speedup. In Fig. 4-4,

we see the speedup relative to an exhaustive search when using the e-ITC; on

average, the speedup factor is 1.53. In Fig. 4-5 we compare the e-ITC to the

geometric mean, much as we compared the m-ITC to the arithmetic mean in the

last chapter. We claim that this is the natural comparison because the exponential

family consists of weighted geometric means. Here the geometric mean searches on

average 7.2!' more of the database than the e-ITC. Lastly in Fig. 4-6 we compare

the two centers against each other. Here the e-ITC comes up short, searching on

average an additional 14 I. of the database. In the next section we try to explain

this result.
































15









10


C






5-









0 20 40 60 80 100 120 140 160 180
class number



















Figure 4 4: Speedup factor for each class resulting from using e-ITC over an ex-
haustive search













































































20 40 60 80 100
class number


120 140 160 180


Figure 4-5: Excess searches performed using geometric mean relative to e-ITC as

proportion of total database.


V)


a
m 0.16




0


a
C 0.14






0 .
C.


0
C
E 0.1


0

0.08


E
= 0.06



0.04


0.02
0










54


























0.5



ro
S0.4 -


0
0 0.3



0.2
-0.2



E 0.1
8


0


E
= -0.1



-0.2
0 20 40 60 80 100 120 140 160 180
class number






















Figure 4-6: Excess searches performed using e-ITC relative to m-ITC as proportion

of total database.









4.5 Comparing the m- and e-ITCs

Since both centers have found successful application to retrieval, it is reason-

able to explore their relationship. Since the arguments in their respective minimax

criteria are reversed, it is not immediately clear that a meaningful comparison

could be made with KL-divergence alone. Hence, we resort to the X2 distance from

the previous chapter as an arbiter (though we could just as well use JS-divergence).

The comparison we make next is simple. Returning to the texture retrieval

dataset from the previous chapter, we use the same set memberships and calculate

an m-ITC and an e-ITC for each set. Then for each representative and each set we

determine what the maximum X2 distance between an element of that set and the

representative is.

The results appear in Fig. 4.5 as the ratio between this "X2 radius" of the

e-ITC and the m-ITC. Since the numbers are greater than one with the exception

of only two out of 61 classes, it is safe to conclude in this setting that the r-ITC

forms tighter clusters. This result helps explain the superior performance of the

m-center in the previous section.

In retrospect, one could attribute this to the m-ITC's global optimality

property (cf. Section 2.2.1) which the e-ITC may not share.


















































0 10 20 30 40
class number


50 60 70


Figure 4-7: The ratio of the maximal X2 distance from each center to all of the
elements in a class


2.6

2.4

2.2

2

1.8 -

1.6

1.4

1.2

10.8

0.8 -


U.;















CHAPTER 5
TRACKING

5.1 Introduction

In the previous chapters, we considered the problem of efficient retrieval. In

the case of retrieval, where a uniform upper bound is important, one measures how

well a representative does by focusing on how it handles the most distant members.

This property is why the minimax representatives are well-suited to retrieval. In

this chapter, we explore the question of whether the same can be said for tracking.

We first present several encouraging signs that in fact it may be, and then we go

on to consider an experiment to test the performance of a tracker built around the

m-ITC. But first we set the context of the tracking problem in probabilistic terms.

5.2 Background-Particle Filters

The tracking problem consists of estimating and maintaining a hidden variable

(usually position, sometimes pose or a more complicated state) from a sequence

of observations. Commonly, instead of simply keeping one guess as to the present

state and updating that guess at each new observation, one stores and updates

an entire probability distribution on the state space; then, when pressed to give a

single, concrete estimate of the state, one uses some statistic (e.g., mean or mode)

of that distribution.

This approach is embodied famously in the Kalman filter, where the probabil-

ity distribution on the state space is restricted to be a Gaussian, and the trajectory

of the state (or simply the motion of the object) is assumed to follow linear dynam-

ics so that the Gaussian at one time step may propagate to another Gaussian at

the next.









When this assumption is too limiting-possibly because background clutter

creates multiple modes in the distribution on the state space or because of complex

dynamics or both-researchers often turn to a particle filter to track an object [48].

Unlike the Kalman filter, particle filters do not require that the probability dis-

tribution on the state space be a Gaussian. To gain this additional representative

power, a particle filter stores a set of samples from the state space with a weight for

each sample describing its probability. The goal then is to update these sets when

new observations arrive. One can update from a time step t 1 to a time step t in

three steps [48]:

1. Given a sample set {s j1), s ) } and its associated weights

{7rt) -.. 7, t-)}, randomly sample (with replacement) a new set


2. To arrive at the new st randomly propagate each s t) by assigning sl) to

the value of x according to a probability distribution (i.e., the motion model)

Pmx (X I)).
3. Adjust the weights according to the likelihood of observing the data z(t) given

that the true state is si )

1 pd(z(t)) P ()5.1)


where Z is a normalization factor.

In this work we restrict our attention to a two dimensional state space

consisting only of an object's position. Also we exclusively consider motion models

pm in which

Pm(xly) = pm(x + xoly + yo) = G(x y), (5.2)

where G is a Gaussian. That is p, is a "shift-inv ,i i ,l model in which the

probability of the displacement from a position in one time step to a position in the









next is independent of that starting position. Furthermore, we take that probability

of a displacement as determined by a Gaussian.

5.3 Problem Statement

In this chapter we consider the following scenario: We have several objects,

and for each object we know its distinct probabilistic motion model. We want to

build a single particle filter with one motion model which can track any of the

objects. This is reminiscent of the problem of designing a universal code that can

efficiently encode any of a number of distinct sources, and that similarity -ii--. -1 -

that this is a promising application for an information theoretic center.

Related to this problem is the case in which one single object undergoes

distinct lph,! -" of motion, each of which has a distinct motion model. An

example of this is a car that moves in one fashion when it drives straight and in a

completely different fashion when turning. This work does not explore such multi-

phase tracking. For this related problem of multi-phase motion, there are certainly

more complicated motion models suited to the problem [49]. And for the problem

we focus on in this chapter, one could also imagine refining the motion model as

observations arrive, until, once in possession of a preponderance of evidence, one

finally settles on the most likely component. But all of these require some sort of

on-line learning while in contrast, the approach we present offers a single, simple,

fixed prior which can be directly incorporated into the basic particle filter

5.4 Motivation

5.4.1 Binary State Space

To motivate the use of an ITC, we begin with a toy example in which we

Ii 1:" the value of a binary variable. Our caricature of a tracker is based on

a particle filter with N particles. In this example, we make a further, highly

simplifying assumption: The observation model is perfect and clutter-free. That

is the likelihood Pd in equation 5.1 has perfect information. This means that any









particles which propagate to the incorrect state receive a weight of zero, and the

particles (if any) which propagate to the correct state share the entire probability

mass among themselves.

Under these assumptions, the event that the tracker fails (and henceforth never

recovers) at each time step is an independent, identically distributed Bernoulli

random variable with probability


Pfai = p(1 q)N + (1 p)qN, (5.3)

where p is the probability the tracker takes on state 0 and q is the probability

that a particle evolves under its motion model to state 0. Similarly, 1 p is the

probability the tracker takes on state 1 and 1 q is the probability that a particle

evolves to state 1. What the equation above -,v is that our tracker will fail if and

only if all N particles choose wrongly.

Now the interesting thing about this example is that a motion model in

which q / p can outperform one in which q = p. Specifically by differentiating

equation 5.3 with respect to q, we find that Pfail takes on a minimum when


q ( 1 (5.4)

As a concrete example, we take the case with p = .1 and N = 10; here the optimal

value for q is .4393. When we find the expected number of trials till the tracker

fails in each case (simply pfA), we find that if we take a motion model with q p,

the tracker goes for an average of 29 steps, but if we take the optimal value of q,

the tracker continues for 1825 steps.

Here we see reminders of how the ITCs give more weight to the extraordinary

situations than other representatives. In this case it is justified to under-weight the

most likely case because even having a single particle arrive at the correct location

is as good as having all N particles arrive.









5.4.2 Self-information Loss

For more evidence -i, '-' -ii!-; that the ITC might be well-suited, we consider

the following analysis [50]. Suppose we try to predict a value x with distribution

p(x). But instead of picking a single value, we specify another distribution q(x)

which defines our confidence for each value of x.

Now depending on the value x that occurs, we p li a penalty based on how

much confidence we had in that value; if we had a great deal of confidence (q(x)

close to one), we p liv a small penalty, and if we did not give much credence to the

value, we lp .i a larger penalty. The self-information loss function is a common

choice for this penalty. According to this function if the value x occurs, we would

incur a penalty of log q(x). If we examine the expected value of our loss, we find


E [-log q(x)] KL(p, q) + H(p). (5.5)


Returning to our problem statement, if we are faced with a host of distribu-

tions and want to find a single q to minimize the expected loss in equation 5.5 over

the set of distributions, we begin to approach something like the m-ITC.

5.5 Experiment-One Tracker to Rule Them All?

5.5.1 Preliminaries

To test how well the m-ITC incorporates a set of motion models into one

tracker, we designed the following experiment. Given a sequence of images, we first

estimated the true motion by hand, measuring the position of the object of interest

at key frames. We then fit a Gaussian to the set of displacements, taking the mean

of the Gaussian as the average velocity.

Data. A single frame from the 74-frame sequence appears in Fig. 5-1. In this

sequence we track the head of the walker which has nearly constant velocity in

the x-direction and slight periodic motion in the y-direction (as she steps). The

mean velocity in the x-direction was -3.7280 pixels/frame with a marginal standard


















































Figure 5-1: Frame from test sequence









deviation of 1.15; in the y-direction direction the average velocity was -0.1862 with

standard deviation 2.41.

For our observation model, we simply use a template of the head from the

initial frame and compare it to a given region, finding the mean squared error

(V\.Sl) of the gi i ,-1. 1.I. Then likelihood is just exp ( MSE). We initialize all

trackers to the true state at the initial frame.

Motion models. From this single image sequence, we can hallucinate numerous

image sequences which consist of rotated versions of the original sequence. We

know the true motion models for all of these novel sequences since they are just the

original motion model rotated by the same amount.

To examine the performance of one motion model applied to a different image

sequence, one need only consider the angle disparity between the true underlying

motion model of the image sequence and the motion model utilized by the tracker.

In Fig. 5-2 we report the performance in average time-till-failure as a function of

angle disparity. (Here the number of particles is fixed to 10.)

Since in this experiment (and subsequent ones in this chapter), we define

a tracker as having irrevocably failed when all of its particles are at a distance

greater than 40 pixels from the location, we see that even the most hopeless of

trackers will succeed in I Ii .il:, the subject for six frames-just long enough

for all of its particles to flee from the correct state at an average relative velocity

of 7.5 pixels/frame. This observation lets us calculate a lower bound on the

performance of a tracker with a given angle disparity from the true motion:

Pessimistically assuming a completely uninformative observation model, one can

calculate the time required for the centroid of the particles to exceed a distance of

D = 40 pixels from the true state as


t s- (5.6)
2r sin 2
2







































X -0 2618 X -0 1258
Y 74 Y 74
U-.


-3 -2.5 -2 -1.5 -1
angle of deviation (radians) from true motion


-0.5


Figure 5-2: Average time till failure as a function of angle disparity between the

true motion and the tracker's motion model


performance
lower bound


50

E
| 40
0)
S30


20


10









where r = 3.7326 is the speed at which centroid and true state each move and 0 is

the angle disparity. The dashed line in Fig. 5-2 represents this curve, capped at the

maximum number of frames 74.

One should note that since all of the motion models are rotated versions

of each other, the H(P) term in equation 5.5 is a constant. Hence the q which

minimizes the maximum expected self-information loss over a set of p's is in fact

the m-ITC.

Performance of mixtures. Because we will take the m-ITC of several of these

motion models, it is also of interest how mixtures perform. We consider a mixture

of the correct motion model and a motion model 7 radians rotated in the opposite

direction, which essentially contributes nothing to the tracking. In Fig. 5-3 we

again plot the time-till-failure for trackers with 10 and 20 particles as a function of

the weight in the mixture of the correct model.

To derive a lower bound, this time we represent the proportion of the proba-

bility near the true state at time t as rt and the remainder as wt. Further we -i

that at each time step, art of the probability moves to (or remains in) the correct

state (as a result of those particles being driven by the correct motion model) and

the rest moves away from the correct state. Next we model the step in the particle

filter algorithm where we adjust the weights. We assume that a particle .iv. from

the true state receives a weight that is c < 1 times the weight received by a particle

near the true state. If there were no clutter in the scene, this number would be zero

and we would return to the assumption in Section 5.4.1. Now, by pessimistically

assuming that particles which move away from the correct state have exceedingly

small chance (i.e., zero) of rejoining the true state randomly we can derive the





































80
performance, N=
performance, N=21
lower bound

60


50

E
|40-


,30-


20-


10-


0.2 0.3 0.4 0.5 0.6 0.7
weight of correct component


0.8 0.9


Figure 5-3: Average time till failure as a function of the weight on the correct

motion model; for 10 and 20 particles









following from an initial ro 1,


rt = (5.7)

(1 a) Y o ct-iai
wt (- a (5.8)


where Z is a normalizing constant. And by taking a lower bound of wt > (1-a)c -
z
we can derive that rt < a+( a Finally, by taking this upper bound on rt as a

Bernoulli random variable as we did in Section 5.4.1, we can get a lower bound

on the expected time-till-failure as 1- plus an adjustment for the six free frames

required to drift out of the 40 pixel range. This is the lower bound shown in Fig. 5

3. To calculate c, we randomly sampled the background at a distance greater than

40 pixels from the true state, and averaged the response of the observation model,

yielding c 0.1886.

5.5.2 m-ITC

Again we compare the performance of the m-ITC to the arithmetic mean. This

time we form mixtures of several motion models of varying angles, taking weights

either as uniform (for the arithmetic mean) or as determined by the m-ITC. In the

first case we take a total of 12 motion models with angle disparities of


{5r/20, 4r/20, 3r/20, 2r/20, lr/20, 0, r}.


On this mixture, the m-ITC assigns weights of .31 to each of the motion models at

r/4 and the remaining .37 to the motion model at r.

We tested these trackers when the true motion model has orientation zero,

7/4, and 7, respectively and reported their average times-till-failure, in Table 5-1.

Indeed as we might expect from a minimax representative, the m-ITC registers the

best worst-case performance; but nevertheless it is not an impressive performance.

The second set of motion models was slightly less extreme in variation,

{0, r/64, r/32, r/4}. To these components, the m-ITC split its probability









evenly between the motion models at disparities r/4. In this case, the m-ITC had

a better showing overall, and still had the best worst case performance. The results

are shown in Table 5-2.

Table 5-1: Average time-till-failure of the tracker based on the arithmetic mean
(AM) and on the m-ITC (800 particles)

True angle m-ITC AM
0 43 74
7r/4 23 46
7 16 7



Table 5-2: Average time-till-failure of the tracker based on the arithmetic mean
(AM) and on the m-ITC (400 particles) with second set of models

True angle m-ITC AM
0 72 74
7r/4 44 37



5.6 Conclusion

While there is reason to believe that a minimax representative would serve

well in combining several motion models in tracking, the precise circumstances

when this might be beneficial are difficult to determine. Examining Fig. 5-2 and

Fig. 5-3 it seems that if there is too little weight or too great a disparity, a tracker

is doomed from the beginning. So while the m-ITC will not perform ideally under

all situations, it still retains its expected best worst-case performance.















CHAPTER 6
CONCLUSION

6.1 Limitations

The central thrust of this work has been the claim that despite many computer

vision researchers' instinctive suspicion of minimax methods, given the right

application, they can be useful. However those skeptical researchers' instincts

are often well-founded: The main issue one must be aware of regarding the

representatives presented in this work is their sensitivity to outliers. One must

carefully consider his data, particularly the extreme elements, because those are

precisely the elements with the most influence on these representatives.

In addition to data, one must also consider how one's application defines

successful results. If one's application can tolerate small deviations in the i .. i- 1

cases," and successful behavior is defined by good results in extreme cases, then

a minimax representative might be appropriate. This is precisely the case in the

retrieval problem where a uniform upper bound on the dispersion of a subset is

the criterion on which successful indexing is judged. Such a criterion disregards

whether the innermost members are especially close to the representative or not.

Despite some initial si-.-l. -1 I ii- this did not turn out to be the case in the tracking

domain (in the presence of clutter). There it seems the deviations in the "normal

( did have a significant effect on performance.

6.2 Summary

After characterizing two minimax representatives, with firm groundings

in information theory, we have shown how they can be utilized to speed the

retrieval of textures, images, shapes, or any object so represented by a probability

distribution. Their power in this application comes from the fact that they form









tight clusters, allowing for more precise localization and efficient pruning than other

common representatives. While the tracking results did not bear as much fruit, we

still believe that is a promising avenue for such a representative if the problem is

properly formulated.

6.3 Future Work

This topic touches upon a myriad of areas including information theory,

information geometry, and learning. Csiszar and others have characterized the

expectation-maximization algorithm in terms of divergence minimization; and

we believe that incorporating the ITCs into some EM-style algorithm would be

very interesting. Also of interest are its connections to AdaBoost and other online

learning algorithms. But for all of these avenues, the main challenge remains of

verifying that the data and measurement of success are appropriate fits to these

representatives.















REFERENCES


[1] T. M. Cover and J. A. Thomas, Elements of Information The. ",; Wiley &
Sons, New York, NY, 1991.

[2] J. Lin, "Divergence measures based on the shannon entropy," IEEE Trans.
I,f.' ,, Th., -;, vol. 37, no. 1, pp. 145-151, Mar. 1991.

[3] P. T. Fletcher, C. Lu, and S. Joshi, "Statistics of shape via principal geodesic
analysis on lie groups," IEEE Trans. Med. Imag., vol. 23, no. 8, pp. 995-1005,
Aug. 2004.

[4] E. Klassen, A. Srivastava, W. Mio, and S. H. Joshi, "Analysis of planar shapes
using geodesic paths on shape spaces," IEEE Trans. Pattern Anal. Machine
Intell., vol. 26, no. 3, pp. 372-383, Mar. 2004.

[5] S.-I. Amari, Methods of Information G, .. ,,/ Ir; American Mathematical
Society, Providence, RI, 2000.

[6] S.-I. Amari, lii.l il i..1 geometry on hierarchy of probability distributions,"
IEEE Trans. Int[..,,, T i,,., vol. 47, no. 5, pp. 1707-1711, Jul. 2001.

[7] B. Pelletier, I~i -,ii, I ,ive barycentres in statistics," Annals of Institute of
Statistical Mathematics, to appear.

[8] Z. Wang and B. C. Vemuri, "An affine invariant tensor dissimilarity measure
and its applications to tensor-valued image segmentation," in Proc. IEEE
Conf. Computer Vision and Pattern Recognition, Washington, DC, Jun./Jul.
2004, vol. 1, pp. 228-233.

[9] D. P. Huttenlocher, G. A. Klanderman, and W. A. Rucklidge, "Comparing
images using the hausdorff distance," IEEE Trans. Pattern Anal. Machine
Intell., vol. 15, no. 9, pp. 850-863, Sep. 1993.

[10] J. Ho, K.-C. Lee, M.-H. Yang, and D. Kriegman, "Visual tracking using
learned linear subspaces," in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, Washington, DC, Jun./Jul. 2004, pp. 228-233.

[11] R. I. Hartley and F. Schaffalitzky, "L-oo minimization in geometric recon-
struction problems," in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, Washington, DC, Jun./Jul. 2004, pp. 504-509.









[12] S. C. Zhu, Y. N. Wu, and D. Mumford, \I !!!ii! ,:: entropy principle and
its application to texture modeling," Neural Computation, vol. 9, no. 8, pp.
1627-1660, Nov. 1997.

[13] C. Liu, S. C. Zhu, and H.-Y. Shum, "Learning inhomogeneous gibbs model of
faces by minimax entropy," in Proc. Int'l Conf. Computer Vision, Vancover,
Canada, Jul. 2001, pp. 281-287.

[14] I. Csiszar, "Why least squares and maximum entropy? An axiomatic approach
to inference for linear inverse problems," Annals of Statistics, vol. 19, no. 4,
pp. 2032-2066, Dec. 1991.

[15] D. P. Bertsekas, Nonlinear P,. ,j,.iiin,:,j Athena Scientific, Belmont, MA,
1999.

[16] N. Merhav and M. Feder, "A strong version of the redundancy-capacity
theorem of universal coding," IEEE Trans. In r,1,, The ..,; vol. 41, no. 3, pp.
714-722, May 1995.

[17] I. Csiszar, "I-divergence geometry of probability distributions and mini-
mization problems," Annals of P, .l,,l.:.:/';; vol. 3, no. 1, pp. 146-158, Jan.
1975.

[18] R. Sibson, lid li in 1ii i.. radius," Z. Wahrscheinlichkeitstheorie verw. Geb.,
vol. 14, no. 1, pp. 149-160, Jan. 1969.

[19] N. Jardine and R. Sibson, Mathematical TI'..'.,n,,;;, John Wiley & Sons,
London, UK, 1971.

[20] A. O. Hero, B. Ma, O. Michel, and J. Gorman, "Applications of entropic
spanning graphs," IEEE S':,j.rl Processing Mag., vol. 19, no. 5, pp. 85-95, Sep.
2002.

[21] Y. He, A. B. Hamza, and H. Krim, "A generalized divergence measure for
robust image registration," IEEE Trans. S.:'.'li Processing, vol. 51, no. 5, pp.
1211-1220, May 2003.

[22] D. M. Endres and J. E. Schindelin, "A new metric for probability distri-
butions," IEEE Trans. I,.f..I,, Th(t.'-, vol. 49, no. 7, pp. 1858-1860, Jul.
2003.

[23] F. Tops0e, "Some inequalities for information divergence and related measures
of discrimination," IEEE Trans. Inform. Th(. ,-; vol. 46, no. 4, pp. 1602-1609,
Jan. 2000.

[24] J. Burbea and C. R. Rao, "On the convexity of some divergence measures
based on entropy functions," IEEE Trans. Ift. I, The(.-', vol. 28, no. 3, pp.
489-495, May 1982.









[25] R. G. Gallager, Information Theory and Reliable Communication, John Wiley
& Sons, New York, NY, 1968.

[26] I. Csiszar and J. G. K6rner, Information The(..,' Coding Theorems for
Discrete Memorl, -- SS i-/,i' Academic Press, Inc., New York, NY, 1981.

[27] L. D. Davisson and A. Leon-Garcia, "A source matching approach to finding
minimax codes," IEEE Trans. Inform. The.'., vol. 26, no. 2, pp. 166-174,
Mar. 1980.

[28] B. Y. Ryabko, "Comments on 'A source matching approach to finding
minimax codes'," IEEE Trans. Ifrt,..,, Th(..,;, vol. 27, no. 6, pp. 780-781,
Nov. 1981.

[29] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of on-line
learning and an application to boosting," J. Computer and System Sciences,
vol. 55, no. 1, pp. 119-139, Aug. 1997.

[30] J. Kivinen and M. K. Warmuth, "Boosting as entropy projection," in
Proceedings of the To. I /l, Annual Conference on Computational Learning
Ti, .. ,.i Santa Cruz, CA, Jul. 1999, pp. 134-144.

[31] R. Osada, T. Funkhouser, B. C'!.. !!.-, and D. Dobkin, "Shape distributions,"
ACMI Trans. Graphics, vol. 21, no. 4, pp. 807-832, Oct. 2002.

[32] S. Gordon, J. Goldberger, and H. Greenspan, "Applying the information
bottleneck principle to unsupervised clustering of discrete and continuous
image representations," in Proc. Int'l Conf. Computer Vision, Nice, France,
Oct. 2003, pp. 370-396.

[33] C. Carson, S. Belongie, H. Greenspan, and J. Malik, "Blobworld: image
segmentation using expectation-maximization and its application to image
querying," IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 8, pp.
1026-1038, Aug. 2002.

[34] Y. Rubner, C. Tomasi, and L. Guibas, "A metric for distributions with
applications to image databases," in Proc. Int'l Conf. Computer Vision,
Bomb-,i,, India, Jan. 1998, pp. 59-66.

[35] J. Puzicha, J. M. Buhmann, Y. Rubner, and C. Tomasi, "Empirical evaluation
of dissimilarity measures for color and texture," in Proc. Int'l Conf. Computer
Vision, Kerkyra, Greece, Sep. 1999, pp. 1165-1172.

[36] M. N. Do and M. Vetterli, \\V i,. t-based texture retrieval using generalized
Gaussian density and Kullback-Leibler distance," IEEE Trans. Image
Processing, vol. 11, no. 2, pp. 146-158, Feb. 2002.









[37] M. Varma and A. Zisserman, "Texture classification: Are filter banks
necessary?," in Proc. IEEE Conf. Computer Vision and Pattern Recognition,
Madison, WI, Jun. 2003, pp. 691-698.

[38] S. M. Omohundro, "Bumptrees for efficient function, constraint, and classifica-
tion learning," in Advances in Neural Information Processing S1~i.l i,- Denver,
CO, Nov. 1990, vol. 3, pp. 693-699.

[39] P. N. Yianilos, "Data structures and algorithms for nearest neighbor search in
general metric spaces," in Proc. AC'11-.IAM Symp. on Discrete Algorithms,
Austin, TX, Jan. 1993, pp. 311-321.

[40] J. K. Uhlmann, "Satisfying general proximity/similarity queries with metric
trees," Information Processing Letters, vol. 40, no. 4, pp. 175-179, Nov. 1991.

[41] A. Moore, "The anchors hierarchy: Using the triangle inequality to survive
high-dimensional data," in Proc. on Uncera''.'ii, in Ar'.:l' .:.' Intelligence,
Stanford, CA, Jun./Jul. 2000, pp. 397-405.

[42] M. Varma and A. Zisserman, "Classifying images of materials: Achieving
viewpoint and illumination independence," in Proc. European Conf. Computer
Vision, Copenhagen, Denmark, ,i-/Jun. 2002, pp. 255-271.

[43] T. Leung and J. Malik, "Recognizing surfaces using three-dimensional
textons," in Proc. Int'l Conf. Computer Vision, Kerkyra, Greece, Sep. 1999,
pp. 1010-1017.

[44] K. J. Dana, B. van Ginneken, S. K. N i, I', and J. J. Koenderink, "Reflectance
and texture of real-world surfaces," AC'i Trans. Graphics, vol. 18, no. 1, pp.
1-34, Jan. 1999.

[45] E. Levina and P. Bickel, "The earth mover's distance is the Mallows distance:
some insights from statistics," in Proc. Int'l Conf. Computer Vision, Vancover,
Canada, Jul. 2001, pp. 251-256.

[46] N. Vasconcelos, "On the complexity of probabilistic image retrieval.," in Proc.
Int'l Conf. Computer Vision, Vancover, Canada, Jul. 2001, pp. 400-407.

[47] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, "The princeton shape
benchmark," in S1p ,l Modeling International, Genova, Italy, Jun. 2004, pp.
167-178.

[48] M. Isard and A. Blake, "Condensation-conditional density propagation for
visual tracking," Int'l J. of Computer Vision, vol. 29, no. 1, pp. 5-28, Jan.
1998.

[49] M. Isard and A. Blake, "A mixed-state condensation tracker with automatic
model-switching," in Proc. Int'l Conf. Computer Vision, B-ombli-, India, Jan.
1998, pp. 107-112.







75

[50] N. Merhav and M. Feder, "Universal prediction," IEEE Trans. h ,f.,,
Tih., ,;< vol. 44, no. 6, pp. 2124-2147, Oct. 1998.















BIOGRAPHICAL SKETCH

A native Floridian, Eric Spellman grew up on Florida's Space Coast, gradu-

ating from Satellite High School in 1998. Thereafter he attended the University of

Florida, receiving his Bachelor of Science in mathematics in 2000, his Master of En-

gineering in computer information science and engineering in 2001, and, under the

supervision of Baba C. Vemuri, his Doctor of Philosophy in the same in 2005. After

graduating he will return to the Space Coast with his wife Kayla and daughter

Sophia to work for Harris Corporation.




Full Text

PAGE 1

FUSINGPROBABILITYDISTRIBUTIONSWITHINFORMATIONTHEORETICCENTERSANDITSAPPLICATIONTODATARETRIEVALByERICSPELLMANADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2005

PAGE 2

Copyright2005byEricSpellman

PAGE 3

IdedicatethisworktomydearestKaylawithwhomIhavealreadylearnedhowtobeaDoctorofPhilosophy.

PAGE 4

ACKNOWLEDGMENTSForhissupportiveguidanceduringmygraduatecareer,IthankDr.BabaC.Vemuri,mydoctoraladvisor.Hetaughtmetheeld,oeredmeaninterestingproblemtoexplore,pushedmetopublish|inspiteofmyterminalprocrastination{andtriedhisbesttoinstillinmetheintangiblesecretstoaproductiveacademiccareer.AlsotheothermembersofmycommitteehavehelpedmegreatlyinmycareerattheUniversityofFlorida,andIthankthemall:Dr.BrettPresnelldeliveredtherstlectureIattendedasanundergraduate;andIamgladhecouldattendthelastlectureIgaveasadoctoralcandidate.IhavealsobenettedfromandappreciatedDr.AnandRangarajan'slectures,professionaladvice,andphilisophicaldiscussions.WhileIdidnothavethepleasureofattendingDr.ArunavaBanerjee'sorDr.JeHo'sclasses,Ihaveappreciatedtheirinsightsandtheirexamplesassuccessfulearlyresearchers.IwouldalsoliketothankDr.MuraliRaoforstimulatingdebates,Dr.Sun-IchiAmariforproposingtheideaofthee-centerandproofsoftherelatedtheorems,andnumerousanonymousreviewers.Myprofessionaldebtsextendbeyondthefacultyhowevertomyfellowcomrades-in-research.Withthem,Ihavemutteredallmannerofthingsabouttheaforementionedgroupinthesurestcondencethatmymutteringswouldnotbebetrayed.Dr.JundongLiu,Dr.ZhizhouWang,TimMcGraw,FeiWang,SantoshKodipaka,NickLord,BingJian,VinhNghiem,andEvrenOzarslanalldeservethanks.AlsodeservingaretheDepartmentstamemberswhosehardworkkeepsthisplaceaoatandRonSmithfordesigningaword-processingtemplatewithout iv

PAGE 5

whichtheprocessofwritingadissertationmightitselfrequireaPh.D.Forthepermissiontoreproducecopyrightedmaterialwithinthisdissertation,IthanktheIEEEChapter 3 andSpringer-VerlagChapter 4 .FordataIthankthepeoplebehindtheYaleFaceDatabaseimagesfromwhichIusedinFig. 2{5 andFig. 2{8 andMichaelBlackfortrackingdata.Andforthenancialsupportwhichmadethisworkpossible,IacknowledgetheUniversityofFlorida'sStephenC.O'ConnellPresidentialFellowship,NIHgrantRO1NS42075,andtravelgrantsfromtheComputerandInformationScienceandEngineeringdepartment,theGraduateStudentCouncil,andtheIEEE.Andnally,mostimportantly,Ithankmyfamily.Ithankmymother-in-lawDonnaLeaforallofherhelpthesepastfewweeksandNeil,Abra,andPeterforlettingustakeherawayforthattime.Ithankmymotherandfatherforeverythingandmybrother,too.AndIofcoursethankmydearestKaylaandmyloudest,mostobstinate,andsweetestSophia. v

PAGE 6

TABLEOFCONTENTS page ACKNOWLEDGMENTS ............................. iv LISTOFTABLES ................................. viii LISTOFFIGURES ................................ ix ABSTRACT .................................... xi CHAPTER 1INTRODUCTION .............................. 1 1.1InformationTheory .......................... 2 1.2AlternativeRepresentatives ..................... 3 1.3MinimaxApproaches ......................... 4 1.4OutlineofRemainder ......................... 5 2THEORETICALFOUNDATION ...................... 7 2.1Preliminary|EuclideanInformationCenter ............ 7 2.2MixtureInformationCenterofProbabilityDistributions ..... 10 2.2.1GlobalOptimality ....................... 13 2.3ExponentialCenterofProbabilityDistributions .......... 13 2.4IllustrationsandIntuition ...................... 16 2.4.1Gaussians ........................... 19 2.4.2NormalizedGray-levelHistograms .............. 19 2.4.3Thee-CenterofGaussians .................. 23 2.4.4e-ITCofHistograms ...................... 24 2.5PreviousWork ............................. 27 2.5.1Jensen-ShannonDivergence .................. 27 2.5.2MutualInformation ...................... 28 2.5.3ChannelCapacity ....................... 29 2.5.4Boosting ............................ 30 3RETRIEVALWITHm-CENTER ...................... 31 3.1Introduction .............................. 31 3.2ExperimentalDesign ......................... 34 3.2.1ReviewoftheTextureRetrievalSystem ........... 34 3.2.2ImprovingEciency ...................... 37 3.3ResultsandDiscussion ........................ 37 vi

PAGE 7

3.3.1ComparisontoPre-existingEciencyScheme ....... 41 3.4Conclusion ............................... 42 4RETRIEVALWITHe-CENTER ...................... 43 4.1Introduction .............................. 43 4.2LowerBound ............................. 44 4.3ShapeRetrievalExperiment ..................... 47 4.4RetrievalwithJS-divergence ..................... 51 4.5Comparingthem-ande-ITCs .................... 55 5TRACKING .................................. 57 5.1Introduction .............................. 57 5.2Background|ParticleFilters ..................... 57 5.3ProblemStatement .......................... 59 5.4Motivation ............................... 59 5.4.1BinaryStateSpace ...................... 59 5.4.2Self-informationLoss ..................... 61 5.5Experiment|OneTrackertoRuleThemAll? ........... 61 5.5.1Preliminaries .......................... 61 5.5.2m-ITC ............................. 67 5.6Conclusion ............................... 68 6CONCLUSION ................................ 69 6.1Limitations .............................. 69 6.2Summary ............................... 69 6.3FutureWork .............................. 70 REFERENCES ................................... 71 BIOGRAPHICALSKETCH ............................ 76 vii

PAGE 8

LISTOFTABLES Table page 5{1Averagetime-till-failureofthetrackerbasedonthearithmeticmeanAMandonthem-ITC0particles ................ 68 5{2Averagetime-till-failureofthetrackerbasedonthearithmeticmeanAMandonthem-ITC0particleswithsecondsetofmodels 68 viii

PAGE 9

LISTOFFIGURES Figure page 2{1Centerisdenotedbyandsupportsaredenotedby. ........ 9 2{2Theensembleof50Gaussianswithmeansevenlyspacedonthein-terval[-30,30]and=5 ........................ 16 2{3ThecomponentsfromFig. 2{2 scaledbytheirweightsinthem-ITC. 17 2{4Them-ITCsolidandarithmeticmeanAM,dashedoftheensem-bleofGaussiansshowninFig. 2{2 .................. 18 2{5Sevenfacesofonepersonunderdierentexpressionswithaneighthfacefromsomeoneelse.AboveeachimageistheweightpiwhichtheITCassignstothedistributionarisingfromthatimage. .... 20 2{6Thenormalizedgray-levelhistogramsofthefacesfromFig. 2{5 .AboveeachdistributionistheKL-divergencefromthatdistributiontothem-ITC.ParenthesesindicatethatthevalueisequaltotheKL-radiusoftheset.Notethataspredictedbytheory,thethedis-tributionswhichhavemaximumKL-divergencearetheveryoneswhichreceivednon-zeroweightsinthem-ITC. ............ 21 2{7Intheleftcolumn,wecanseethatthearithmeticmeansolid,lowerleftresemblesthedistributionarisingfromtherstfacemorecloselythanthem-ITCsolid,upperleftdoes.Intherightcolumn,weseetheopposite:Them-ITCupperrightmorecloselyresemblestheeighthdistributionthandoesthearithmeticmeanlowerright. 22 2{8Eightimagesoffaceswhichyieldnormalizedgraylevelhistograms.Wechooseanextraordinarydistributionfornumbereighttocon-trasthowtherepresentativecapturesvariationwithinaclass.Thenumberaboveeachfaceweighsthecorrespondingdistributioninthee-ITC. ................................ 25 2{9KLC;Piforeachdistribution,forCequaltothee-ITCandgeo-metricmean,respectively.Thehorizontalbarrepresentsthevalueofthee-radius. ............................. 26 3{1Usingthetriangleinequalitytoprune .................. 33 3{2ExamplesimagesfromtheCUReTdatabase .............. 35 ix

PAGE 10

3{3Onaverageforprobesfromeachtextureclass,thespeed-uprelativetoanexhaustivesearchachievedbythemetrictreewiththem-ITCastherepresentative ....................... 38 3{4Theexcesscomparisonsperformedbythearithmeticmeanrelativetothem-ITCwithineachtextureclassasaproportionofthetotaldatabase ................................. 39 3{5Theexcesscomparisonsperformedbythebestmedoidrelativetothem-ITCwithineachtextureclassasaproportionofthetotaldatabase 40 4{1Intuitiveproofofthelowerboundinequation 4.7 seetext.TheKL-divergenceactslikesquaredEuclideandistance,andthePythagoreanTheoremholdsunderspecialcircumstances.Qisthequery,Pisadistributioninthedatabase,andCisthee-ITCofthesetcontain-ingP.PistheI-projectionofQontothesetcontainingP.Ontheright,DCjjPRe,whereReisthee-radius,bytheminimaxdenitionofC. ............................. 45 4{2Thespeed-upfactorversusanexhaustivesearchwhenusingthee-ITCasafunctionofeachclassintheshapedatabase. ....... 49 4{3Therelativepercentofadditionalpruningswhichthee-ITCachievesbeyondthegeometriccenter,againforeachclassnumber. ..... 50 4{4Speedupfactorforeachclassresultingfromusinge-ITCoveranex-haustivesearch ............................. 52 4{5Excesssearchesperformedusinggeometricmeanrelativetoe-ITCasproportionoftotaldatabase. ..................... 53 4{6Excesssearchesperformedusinge-ITCrelativetom-ITCaspropor-tionoftotaldatabase. ......................... 54 4{7Theratioofthemaximal2distancefromeachcentertoalloftheelementsinaclass ........................... 56 5{1Framefromtestsequence ......................... 62 5{2Averagetimetillfailureasafunctionofangledisparitybetweenthetruemotionandthetracker'smotionmodel ............. 64 5{3Averagetimetillfailureasafunctionoftheweightonthecorrectmotionmodel;for10and20particles ................. 66 x

PAGE 11

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyFUSINGPROBABILITYDISTRIBUTIONSWITHINFORMATIONTHEORETICCENTERSANDITSAPPLICATIONTODATARETRIEVALByEricSpellmanAugust2005Chair:BabaC.VemuriMajorDepartment:ComputerandInformationScienceandEngineeringThisworkpresentstworepresentationsforacollectionofprobabilitydistribu-tionsordensitiesdubbedinformationtheoreticcentersITCs.Likethecommonarithmeticmean,therstnewcenterisaconvexcombinationofitsconstituentdensitiesinthemixturefamily.Analogously,thesecondITCisaweightedgeomet-ricmeanofdensitiesintheexponentialfamily.Inbothcases,theweightsinthecombinationsvaryasonechangesthedistributions.ThesecentersminimizethemaximumKullback-Leiblerdivergencefromeachdistributionintheircollectionstothemselvesandexhibitanequi-divergenceproperty,lyingequallyfarfrommostelementsoftheircollections.Thepropertiesofthesecentershavebeenestablishedininformationtheorythroughthestudyofchannel-capacityanduniversalcodes;drawingonthisrichtheoreticalbasis,thisworkappliesthemtotheproblemsofindexingforcontent-basedretrievalandtotracking.Manyexistingtechniquesinimageretrievalcasttheproblemintermsofprobabilitydistributions.Thatis,thesetechniquesrepresenteachimageinthedatabase|aswellasincomingqueryimages|asprobabilitydistributions,thusreducingtheretrievalproblemtooneofndinganearestprobabilitydistribution xi

PAGE 12

undersomedissimilaritymeasure.ThisworkpresentsanindexingschemeforsuchtechniqueswhereinanITCstandsinforasubsetofdistributionsinthedatabase.IfthesearchndsthataqueryliessucientlyfarfromsuchanITC,thesearchcansafelydisregardi.e.,withoutfearofreducingaccuracytheassociatedsubsetofprobabilitydistributionswithoutfurtherconsideration,thusspeedingsearch.Oftenintracking,onerepresentsknowledgeabouttheexpectedmotionoftheobjectofinterestbyaprobabilitydistributiononitsnextposition.Thisworkconsidersthecaseinwhichonemustspecifyatrackercapableoftrackinganyoneofseveralobjects,eachwithdierentprobabilitydistributionsgoverningitsmotion.Relatedisthecaseinwhichoneobjectundergoesdierentphases"ofmotion,eachofwhichcanbemodeledindependentlye.g.,acardrivingstraightvs.turning.Inthiscase,anITCcanfusethesedierentdistributionsintoone,creatingonemotionmodeltohandleanyoftheseveralobjects. xii

PAGE 13

CHAPTER1INTRODUCTIONGivenasetofobjects,onecommonlywishestorepresenttheentiresetwithoneobject.Inthiswork,weaddressthisconcernforthecaseinwhichtheobjectsareprobabilitydistributions.Particularly,wepresentanovelrepresentativeforasetofdistributionswhosebehaviorwecandescribeinthelanguageofinformationtheory.Forcontrast,letusrstconsiderthefamiliararithmeticmean:Thearithmeticmeanisauniformlyweightedconvexcombinationofasetofobjectse.g.,distributionswhichminimizesthesumofsquaredEuclideandistancesfromobjectsinturntoitself.However,latersectionscontainexamplesofapplicationsinwhichsucharepresentativeisnotideal.AsanalternativewedescribearepresentativewhichminimizesthemaximalKullback-Leiblerdivergencefromitselftothesetofobjects.Thecentralideaofthisworkisthatsuchaminimaxrepresentationisbetterthanmorecommonlyusedrepresentativese.g.,thearithmeticmeaninsomecomputervisionproblems.Inexploringthisthesis,wepresentthetheoreticalpropertiesofthisrepresentativeandresultsofexperimentsusingit.Weexaminetwoapplicationsinparticular,comparingtheminimaxrepresen-tationtothemorecommonarithmeticmeanandotherrepresentatives.Theseapplicationsareindexingcollectionsofimagesortexturesorshapesandchoosingamotionpriorunderuncertaintyfortracking.Intherstoftheseapplications,wefollowapromisingavenueofworkinusingaprobabilitydistributionasthesignatureofagivenobjecttobeindexed.Thenusinganestablisheddatastruc-ture,therepresentativecanfuseseveralsignaturesintoone,thusmakingsearchesmoreecient.Inthetrackingapplication,weconsiderthecaseinwhichonehas 1

PAGE 14

2 someuncertaintyastothemotionmodelthatgovernstheobjecttobetracked.Specically,thegoverningmodelwillbedrawnfromaknownfamilyofmodels.Wesuggestusingaminimaxrepresentativetoconstructasinglepriordistributiontodescribetheexpectedmotionofanobjectinawaythatfusesallofthemodelsinthefamilyandyieldsonemodelwhichwithbestworstcaseperformance. 1.1 InformationTheoryAsdiscussedinChapter 2 ,theminimaxrepresentativecanbecharacterizedintermsoftheKullback-LeiblerdivergenceandtheJensen-Shannondivergence.Hence,abriefreviewoftheseconceptsisinorder.TheKullback-LeiblerKLdivergence[ 1 ]alsoknownastherelativeentropybetweentwodistributionspandqisdenedasKLp;q=Xipilogpi qi:.1Itisconvexinp,non-negativethoughnotnecessarilynite,andequalszeroifandonlyifp=q.Ininformationtheoryithasaninterpretationintermsofthelengthofencodedmessagesfromasourcewhichemitssymbolsaccordingtoaprobabilitydistribution.WhilethefamiliarShannonentropygivesalowerboundontheaveragelengthpersymbolacodecanachieve,theKL-divergencebetweenpandqgivesthepenaltyinlengthpersymbolincurredbyencodingasourcewithdistributionpundertheassumptionitreallyhasdistributionq;thispenaltyiscommonlycalledredundancy.Toillustratethis,considertheMorsecode,designedtosendmessagesinEnglish.TheMorsecodeencodestheletterE"withasingledotandtheletterQ"withasequenceoffourdotsanddashes.BecauseE"isusedfrequenlyinEnglishandQ"seldom,thismakesforecienttransmission.HoweverifonewantedtousetheMorsecodetosendmessagesinChinesepinyin,whichmightuseQ"morefrequently,hewouldndthecodelessecient.Ifweassume

PAGE 15

3 contrafactuallythattheMorsecodeisoptimalforEnglish,thisdierenceineciencyistheredundancy.AlsoplayingaroleinthisworkistheJensen-ShannonJSdivergence.ItisdenedbetweentwodistributionspandqasJSp;q=KLp;p+)]TJ/F22 11.955 Tf 11.955 0 Td[(q+)]TJ/F22 11.955 Tf 11.956 0 Td[(KLq;p+)]TJ/F22 11.955 Tf 11.955 0 Td[(q;.2where2;1isaxedparameter[ 2 ];wewillalsoconsideritsstraightforwardgeneralizationtondistributions. 1.2 AlternativeRepresentativesAlsoofinterestisthebodyofworkwhichcomputesaveragesofsetsofobjectsusingnon-Euclideandistancessincetherepresentativepresentedinthisworkplaysasimilarrole.Oneexampleofthisappearsincomputingaveragesonamanifoldofshapes[ 3 4 ]bygeneralizingtheminimizationcharacterizationofthearithmeticmeanawayfromthesquaredEuclideandistanceandtothegeodesicdistance.Linkingmanifoldsononehandanddistributionsontheotheristheeldofinformationgeometry[ 5 ].UsingnotionsfrominformationgeometryonecanndthemeanonamanifoldofparameterizeddistributionsbyusingthegeodesicdistancesderivedfromtheFisherinformationmetric.Inthiswork,therepresentativedoesnotminimizeafunctionofthisgeodesicdistance,butrathertheextrinsic"KL-divergence.Furthermore,therepresenta-tivesherearerestrictedtosimplemanifoldsofdistributions|namelythefamilyofweightedarithmeticmeansi.e.,convexcombinationsandnormalized,weightedgeometricmeanssometimesreferredtoastheexponentialfamily"ofthecon-stituentdistributions.Thesearesimpleyetinterestingfamiliesofdistributionsandcanaccomodatenon-parametricrepresentations.Thesetwomanifoldsaredualinthesenseofimformationgeometry[ 6 ],andsoasonemightexpect,therepresen-tativeshaveasimilardualrelationship.Pelletierformsabarrycenterbasedinthe

PAGE 16

4 KL-divergenceoneachofthesemanifolds[ 7 ].Thatbarrycenter,inthespiritofthearithmeticmeanandincontrasttotherepresentativeinthiswork,minimizesasumofKL-divergences.Anothermean-like"representativeonthefamilyofGaus-siandistributionsseekstominimizethesumofsquaredJ-divergencesalsoknownassymmetrizedKL[ 8 ].ThekeydierencebetweenmostoftheseapproachesandthisworkisthatthisworkseekstopresentarepresentativewhichminimizesthemaximumofKL-divergencestotheobjectsinaset. 1.3 MinimaxApproachesIngrainedinthecomputervisioncultureisaheightenedawarenessofnoiseindata.Thisisadefensivetraitwhichhasevolvedoverthemillenniatoallowresearcherstosurviveinaharshenvironmentfullofimperfectsensors.Withthisjustiedconcerninmind,onenaturallyasksifminimizingamaximumdistancewillbedoomedbyover-sensitivitytonoise.Aswithmanysuchconcerns,thisdependsontheapplication,andwearguethatanapproachwhichminimizesamax-basedfunctioniswell-suitedtotheapplicationsinthiswork.Buttoshowthatsuchasetofappropriateapplicationsisnon-empty,considerthesuccessfulworkofHuttenlocheretal.asaproofofconcept[ 9 ]:TheyseektominimizetheHausdordistanceHbetweentwopointsetsA;B.Asonecansee,minimizingtheHausdordistance,HA;B=maxhA;B;hB;AhA;B=maxa2Aminb2Bdista;b;minimizesamax-basedfunction.AnotherexampleappearsintheworkofHoetal.fortracking.InHoetal.[ 10 ]ateachframetheyndalinearsubspacewhichminimizesthemaximaldistancetoasetofpreviousobservations.Additionally,HartleyandSchaalitzkyminimizeamax-basedcostfunctiontodogeometricreconstruction[ 11 ].Cognizantthatthemethodissensitivetooutliers,they

PAGE 17

5 recommendusingitondatafromwhichtheyhavebeenremoved.Theseexamplesdemonstratethatonecannotdismissaprioriamethodasoverly-sensitivetonoisejustbecauseitminimizesamax-basedmeasure.Closerinspirittotheapproachinthisworkisworkusingtheminimaxen-tropyprinciple.Zhuetal.[ 12 ]andLiuetal.[ 13 ]usethisprincipletolearnmodelsoftexturesandfaces.Therstpartofthisprinciplethewell-knownmaximumentropyorminimumprejudiceprincipleseekstolearnadistributionwhich,giventheconstraintthatsomeofitsstatisticstsamplestatistics,maximizesentropy.Csiszarhasshownthatthisprocessisrelatedtoanentropyprojection[ 14 ].Therationaleforselectingadistributionwithmaximalentropyisthatitisrandomorsimpleexceptforthepartsexplainedbythedata.Coupledwiththisistheminimumentropyprinciplewhichfurtherseekstoimposedelityonthemodel.Tosatisfythisprinciple,Zhuetal.chooseasetofstatisticsconstraintstowhichamaximalentropymodelmustadheresuchthatitsentropyisminimized.Byminimizingtheentropy,theyshowthatheminimizestheKL-divergencebetweenthetruedistributionandthelearnedmodel.ForasetSofconstraintsonstatistics,theysummarizetheapproachasS=argminSmaxp2Sentropyp;.3whereSisthesetofallprobabilitydistributionswhichsatisfytheconstraintsinS.ThenwiththeoptimalS,oneneedonlyndthep2Swithmaximalentropy. 1.4 OutlineofRemainderInthenextchapters,werigorouslydenetheminimaxrepresentativesandpresenttheoreticalresultsontheirproperties.Wealsoconnectoneofthesetoitsalternativeandbetter-knownidentityintheinformationtheoryresultsforchannelcapacity.Thereafterfollowtwochaptersonusingtherepresentativesto

PAGE 18

6 indexdatabasesforthesakeofmakingretrievalmoreecient.InChapter 3 wepresentatextureretrievalexperiment;inthisexperimenttheaccuracyoftheretrievalisdeterminedbytheJensen-Shannondivergence,thesquarerootofwhichisatruemetric.LaterinChapter 4 wepresentanexperimentinshaperetrievalwherethedissimilaritymeasureistheKL-divergence.Weproposeasearchmethodwhichhappenstoworkintheparticularcaseofthisdataset,butingeneralhasnoguaranteethatitwillnotdegradeinaccuracy.Thesecondareainwhichwedemonstratetheutilityofaminimaxrepresentativeistracking.InChapter 5 wepresentexperimentsinwhichtherepresentativestandsinforanunknownmotionmodel.Lastlyweendwithsomeconcludingpointsandthoughtsforfuturework.

PAGE 19

CHAPTER2THEORETICALFOUNDATIONInthischapterwedenetheminimaxrepresentativesandenumeratetheirproperties.First,wepresentacasual,motivatingtreatmentinEuclideanspacetogiveintuitiontotheidea.Thencomethesection'scentralresults|theITCforthemixturefamilyandtheITConthedualmanifold,theexponentialfamily;afterdeningthem,wepresentseveralillustrationstolendintuitiontotheirbehavior;andthennallyweshowtheirwell-establishedinterpretationintermsofchannelcapacity. 2.1 Preliminary|EuclideanInformationCenterLetS=ff1;;fngbeasetofnpointsintheEuclideanspaceRm.ThroughoutourdevelopmentwewillconsiderthemaximumdispersionofthemembersofasetSaboutapointf,Df;S=maxijf)]TJ/F36 11.955 Tf 11.955 0 Td[(fij2:.1WewilllookforthepointfcthatminimizesDf;Sandcallitthecenter. Denition1 ThecenterofSisdenedbyfcS=argminfDf;S:.2Thefollowingpropertiesareeasilyproved. Theorem1 ThecenterofSisuniqueandisgivenbyaconvexcombinationoftheelementsofS.fc=nXi=1pifi;.3where0pi1andPpi=1.Wecallapointfiforwhichpi>0asupport. 7

PAGE 20

8 Theorem2 LetfcbethecenterofS.Then,forF=fi;pi>0g,thesetofindicescalledthesupports,wehaveanequi-distanceproperty,jfi)]TJ/F36 11.955 Tf 11.955 0 Td[(fcj2=r2;ifi2F; .4 jfi)]TJ/F36 11.955 Tf 11.955 0 Td[(fcj2r2;otherwise, .5 wherer2isthesquareoftheradiusofthespherecoincidingwiththesupportsandcenteredatthecenter.NowthatwehavecharacterizedthecenterrstbyitsminimaxpropertyDenition 1 andthenitsequi-distancepropertyTheorem 2 ,wecangiveyetanothercharacterization,thisoneusefulcomputationally.Denethesimplexofprobabilitydistributionsonmsymbols,=np=p1;;pn;0pi;Xpi=1o:WedeneafunctionofponbyDEp;S=)]TJ/F15 11.955 Tf 10.494 8.088 Td[(1 2Xpifi2+1 2Xpikfik2;.6andnowuseittondthecenter. Theorem3 DEp;Sisstrictlyconcaveinp,andthecenterofSisgivenbyfc=X~pifi;.7where~p=argmaxpDEp;S.8istheuniquemaximalpointofDEp;S.AnexampleofthecenterisgiveninFig. 2{1 .Ingeneral,thesupportsetconsistsofarelativelysparsesubsetofpoints.Thepointsthataremostextraordi-nary"aregivenhighweights.

PAGE 21

9 Figure2{1:Centerisdenotedbyandsupportsaredenotedby.

PAGE 22

10 2.2 MixtureInformationCenterofProbabilityDistributionsWenowmovetotheheartoftheresults.Whilethefollowingderivationsusedensities,theycouldjustaseasilyworkinthediscretecase.Letfxbeaprobabilitydensity.GivenasetSofnlinearlyindependentdensitiesf1;;fn,weconsiderthespaceMconsistingoftheirmixtures,M=nXpifi0pi;Xpi=1o:.9ThedimensionmofMsatisesmn)]TJ/F15 11.955 Tf 11.956 0 Td[(1.MisaduallyatmanifoldequippedwithaRiemannianmetric,andapairofdualaneconnections[ 5 ].TheKLdivergencefromf1xtof2xisdenedbyKLf1;f2=Zf1xlogf1x f2xdx;.10whichplaystheroleofthesquaredofthedistanceinEuclideanspace.TheKL-divergencefromStofisdenedbyKLS;f=maxiKLfi;f:.11Letusdenethemixtureinformationcenterm-ITCofS. Denition2 Them-centerofSisdenedbyfmS=argminfKLS;f:.12Inordertoanalyzepropertiesofthem-center,wedenethem-sphereofradiusrcenteredatfbythesetSmf;r=f0KLf0;fr2;f02M:.13

PAGE 23

11 Inordertoobtainananalyticalsolutionofthem-center,weremarkthatthenegativeentropy'f=Zfxlogfxdx.14isastrictlyconvexfunctionoff2M.ThisisthedualpotentialfunctioninMwhosesecondderivativesgivetheFisherinformationmatrix[ 5 ].GivenapointfinM,f=Xpifi;Xpi=1;pi0;.15andusing'above,wewritetheJensen-ShannonJSdivergence[ 2 ]offwithrespecttoSasDmf;S=)]TJ/F22 11.955 Tf 9.298 0 Td[('f+Xpi'fi:.16WhenweregardDmf;Sasafunctionofp=p1;;pnT2,wedenoteitbyDmp;S.ItiseasytoseeDmf;Sisstrictlyconcaveinp2 .17 Dmf;Sisinvariantunderpermutationofffig; .18 Dmf;S=0ifpisanextremepointofsimplex: .19 Hencefromequations 2.17 and 2.19 ,Dmf;Shasauniquemaximumin.IntermsoftheKL-divergence,wehaveDmp;S=XpiKLfi;f:.20 Theorem4 Them-centerofSisgivenbytheuniquemaximizerofDmf;S,thatis,fm=X~pifi; .21 ~p=argmaxpDmp;S: .22

PAGE 24

12 Moreover,forsupportsforwhich~pi>0,KLfi;fm=r2,andfornon-supportsforwhich~pi=0,KLfi;fmr2,r2=maxiKLfi;fm.ByusingtheLagrangemultiplierfortheconstraintPpi=1,wecalculate@ @pinDmp;S)]TJ/F22 11.955 Tf 11.955 0 Td[(Xpi)]TJ/F15 11.955 Tf 11.955 0 Td[(1o=0:.23FromthedenitionofDmandbecauseof@ @pi'f=@ @piZXpjfjlogXpkfkdx=Zfilogf+1; .24 equation 2.23 isrewrittenas)]TJ/F28 11.955 Tf 11.291 16.273 Td[(Zfilogf)]TJ/F15 11.955 Tf 11.955 0 Td[(1+'fi)]TJ/F22 11.955 Tf 11.955 0 Td[(=0.25whenpi>0,andthenasKLfi;f=+1:.26However,becauseoftheconstraintspj0,thisisnotnecessarilysatisedforsomefjforwhichpj=0.Hence,attheextremepointfmS,wehaveKLfi;f=r2.27forsupportsfipi>0,butfornon-supportspi=0,KLfi;fr2:.28Thisdistinctionoccursbecause~pmaximizesPpiKLfi;f,andifweassumeforanon-supportdensityfithatKLfi;f>r2,wearriveatacontradiction.WeremarkthatnumericaloptimizationofDmcanbeaccomplishedecientlye.g.,withaNewtonscheme[ 15 ],becauseitisconcaveinp.

PAGE 25

13 2.2.1 GlobalOptimalityLastly,itshouldbenotedthatwhileitappearsthattheresultssofarsuggestthatthem-ITCistheminimaxrepresentativeovermerelythesimplexofmix-tures,aresultbyMerhavandFeder[ 16 ]basedonworkbyCsiszar[ 17 ]showsthatthem-ITCisindeedtheglobalminimaxrepresentativeoverallprobabilitydistributions.Toargueso,consideradistributiongwhichaskepticputsforwardastheminimizerofequation 2.12 .Next,weconsidertheI-projectionofgontothesetofmixturesMdenedas~g=argminf2MKLf;g:.29AndfromthepropertiesofI-projections[ 17 ],weknowforallf2MthatKLf;gKLf;~g+KL~g;gKLf;~g.30sincethelasttermontheright-handsideispositive.Thistellsusthatthemixture~gperformsatleastaswellasg;sowemayrestrictourminimizationtothesetofmixtures,knowingwewillndtheglobalminimizer. 2.3 ExponentialCenterofProbabilityDistributionsGivenndensitiesfix,fix>0,wedenetheexponentialfamilyofdensitiesE,E=nfjfx=expnXpilogfix)]TJ/F22 11.955 Tf 11.955 0 Td[(po;0pi;Xpi=1o;.31insteadofthemixturefamilyM.ThedimensionmofEsatisesmn)]TJ/F15 11.955 Tf 12.478 0 Td[(1.EisalsoaduallyatRiemannianmanifold.EandMaredual"[ 5 ],andwecanestablishsimilarstructuresinE.Thepotentialfunctionpisconvex,andisgivenbyp=logZexpnXpilogfixodx:.32

PAGE 26

14 Thepotentialpisthecumulantgeneratingfunction,andisconnectedwiththenegativeentropy'bytheLegendretransformation.Itiscalledthefreeenergyinstatisticalphysics.ItssecondderivativesgivetheFisherinformationmatrixinthiscoordinatesystem.Ane-sphereexponentialsphereinEcenteredatfisthesetofpointsSef;r=f0KLf0;fr2;f02E;.33whereristhee-radius. Denition3 Thee-centerofSisdenedbyfeS=argminfKLf;S;.34whereKLf;S=maxiKLf;fi:.35WefurtherdenetheJS-likedivergenceinE,Dep;S=)]TJ/F22 11.955 Tf 9.298 0 Td[(p+Xpifi:.36Becauseoffi=logZexplogfidx=0;.37wehaveDep;S=)]TJ/F22 11.955 Tf 9.298 0 Td[(p:.38ThefunctionDe=)]TJ/F22 11.955 Tf 9.298 0 Td[(isstrictlyconcave,andhasauniquemaximum.Itisaninterestingexercisetoshowforf=expfPpilogfi)]TJ/F22 11.955 Tf 11.955 0 Td[(pg,)]TJ/F22 11.955 Tf 9.298 0 Td[(p=XpiKLf;fi:.39AnalogoustothecaseofM,wecanprovethefollowing.

PAGE 27

15 Theorem5 Thee-centerofSisuniqueandgivenbythemaximizerofDep;ffig,asfeS=expnX~pilogfi)]TJ/F22 11.955 Tf 11.956 0 Td[(~po; .40 ~p=argmaxDep;S: .41 Moreover,KLfe;fi=r2;forsupportingfipi6=0;.42andKLfe;fir2;fornon-supportingfipi=0;.43wherer2=maxiKLfe;fi:.44Wecalculatethederivativeof)]TJ/F22 11.955 Tf 9.299 0 Td[(p)]TJ/F22 11.955 Tf 11.955 0 Td[(Ppi)]TJ/F15 11.955 Tf 11.955 0 Td[(1,andput@ @pin)]TJ/F22 11.955 Tf 9.299 0 Td[(p)]TJ/F22 11.955 Tf 11.956 0 Td[(Xpi)]TJ/F15 11.955 Tf 11.955 0 Td[(1o=0:.45Forf=expnXpilogfix)]TJ/F22 11.955 Tf 11.956 0 Td[(po;.46wehave@ @pip=@ @pilogZexpnXpilogfixodx=Zfxlogfixdx: .47 Hence, 2.45 becomesZfxlogfixdx==const2.48andhenceKLf;fi=const2.49

PAGE 28

16 Figure2{2:Theensembleof50Gaussianswithmeansevenlyspacedontheinter-val[-30,30]and=5 forfiwithpi>0.Forpi=0,KLf;fiisnotlargerthanr2,becauseof 2.39 2.4 IllustrationsandIntuitionInthissectionweconveysomeintuitionregardingthebehavioroftheITCs.Particularly,weshowexamplesinwhichtheITCschoosearelativelysparsesubsetofdensitiestobesupportdensitiesassigningzeroasweightstotheotherdensitiesintheset;alsowendthattheITCtendstoassigndisproportionatelyhighweightstomembersofthesetthataremostextraordinary"i.e.,thosethataremostdistinctfromtherestoftheset.

PAGE 29

17 Figure2{3:ThecomponentsfromFig. 2{2 scaledbytheirweightsinthem-ITC.

PAGE 30

18 Figure2{4:Them-ITCsolidandarithmeticmeanAM,dashedoftheensembleofGaussiansshowninFig. 2{2

PAGE 31

19 2.4.1 GaussiansFirst,weexamineasyntheticexampleinwhichweanalyticallyspecifyasetofdensitiesandthennumericallycomputethem-ITCofthatset.Weconstructasetof50one-dimensionalGaussiandistributionswithmeansevenlyspacedontheinterval[)]TJ/F15 11.955 Tf 9.298 0 Td[(30;30]andwith=5.Fig. 2{2 showsallofthedensitiesinthisset.Whenwecomputethem-ITC,weseebothofthepropertiesmentionedabove:Outofthe50densitiesspecied,onlyeightbecomesupportdensitiessparsity.Andadditionally,thedensitieswhichreceivethelargestweightinthem-ITCareouter-mostdensitieswithmeansat-30and30highlightingextraordinary"elements.Fig. 2{3 showstheeightsupportdensitiesscaledaccordingtotheweightwhichthem-ITCassignsthem.Forthesakeofcomparison,Fig. 2{4 showsthem-ITCcomparedwiththearithmeticmean. 2.4.2 NormalizedGray-levelHistogramsNextweconsideranexamplewithdistributionsarisingfromimagedata.Fig. 2{5 showseightimagesfromtheYaleFaceDatabase.Therstsevenimagesareofthesamepersonunderdierentfacialexpressionswhilethelasteighthimageisofadierentperson.Weconsidereachimage'snormalizedhistogramofgray-levelintensitiesasadistributionandshowtheminFig. 2{6 .Whenwetakethem-ITCofthesedistributions,weagainnoticethesparsityandthefavortismoftheboundaryelements.ThenumbersaboveeachfaceinFig. 2{5 aretheweightsthatthedistributionarisingfromthatfacereceivedinthem-ITC.Again,wendthatonlythreeoutoftheeightdistributionsaresupportdistributions.Inthepreviousexample,wecouldconcretelyseehowthem-ITCfavoreddensitiesontheontheboundary"ofthesetbecausethedensitieshadacleargeometricrelationshipamongthemselveswiththeirmeansarrayedonaline.Inthisexamplethenotionofaboundaryisnotquitesoobvious;yet,wethinkthat

PAGE 32

20 Figure2{5:Sevenfacesofonepersonunderdierentexpressionswithaneighthfacefromsomeoneelse.AboveeachimageistheweightpiwhichtheITCassignstothedistributionarisingfromthatimage.

PAGE 33

21 Figure2{6:Thenormalizedgray-levelhistogramsofthefacesfromFig. 2{5 .AboveeachdistributionistheKL-divergencefromthatdistributiontothem-ITC.Paren-thesesindicatethatthevalueisequaltotheKL-radiusoftheset.Notethataspredictedbytheory,thethedistributionswhichhavemaximumKL-divergencearetheveryoneswhichreceivednon-zeroweightsinthem-ITC.

PAGE 34

22 Figure2{7:Intheleftcolumn,wecanseethatthearithmeticmeansolid,lowerleftresemblesthedistributionarisingfromtherstfacemorecloselythanthem-ITCsolid,upperleftdoes.Intherightcolumn,weseetheopposite:Them-ITCupperrightmorecloselyresemblestheeighthdistributionthandoesthearithmeticmeanlowerright.

PAGE 35

23 ifoneexaminestheeighthimageandtheeighthdistribution,onecanqualitativelyagreethatitisthemostextraordinary."ReturningbrieytoFig. 2{6 weexaminetheKL-divergencesbetweeneachdistributionandthem-ITC.Wereportthesevaluesaboveeachdistributionandindicatewithparenthesesthemaximalvalues.SincethethreedistributionswithmaximalKL-divergencetotheITCareexactlythethreesupportdistributions,thisexampleunsurprisinglycomplieswithTheorem 4 .Finally,weagaincomparethem-ITCandthearithmeticmeaninFig. 2{7 ,butthistimeweoverlayinturntherstandeighthdistributions.Examiningtheleftcolumnofthegure,weseethatthearithmeticmeanmorecloselyresemblestherstdistributionthandoesthem-ITC.TheKL-divergencesbearthisoutwiththeKL-divergencefromtherstdistributiontothearithmeticmeanbeing0.0461whichcomparesto0.1827tothem-ITC.Conversely,whenwedoasimilarcom-parisonintherightcolumnofFig. 2{7 fortheeighthextraordinarydistribution,wendthattheITMmostresemblesit.Again,theKL-divergencesquantifythisobservation:WhereastheKL-divergencefromtheeighthdistributiontothem-ITCisagain0.1827,wendtheKL-divergencetothearithmeticmeantobe0.5524.ThisresultfallsinlinewithwhatwewouldexpectfromTheorem 4 andsuggestsamorerenedbitofintuition:Them-ITCbetterrepresentsextraordinarydistribu-tions,butsometimesattheexpenseofthemorecommon-lookingdistributions.Yetoverall,thattrade-oyieldsaminimizedmaximumKL-divergence. 2.4.3 Thee-CenterofGaussiansWeconsideraverysimpleexampleconsistingofmultivariateGaussiandistributionswithunitcovariancematrix,fix=cexp)]TJ/F15 11.955 Tf 10.494 8.088 Td[(1 2jx)]TJ/F36 11.955 Tf 11.955 0 Td[(ij2;.50

PAGE 36

24 wherex,i2Rm.WehaveXpilogfi=)]TJ/F15 11.955 Tf 10.494 8.088 Td[(1 2x)]TJ/F28 11.955 Tf 11.955 11.357 Td[(Xpii2+1 2Xpii2)]TJ/F15 11.955 Tf 10.494 8.088 Td[(1 2Xpikik2+logc .51 p=)]TJ/F15 11.955 Tf 10.494 8.088 Td[(1 2Xpikik2)]TJ/F28 11.955 Tf 11.955 13.749 Td[(Xpii2: .52 Hence,EtooconsistsofGaussiandistributions.ThiscaseisspecialbecauseEisnotonlyduallyatbutalsoEuclideanwheretheFisherinformationmatrixistheidentitymatrix.TheKL-divergenceisKL[fj;fi]=1 2j)]TJ/F36 11.955 Tf 11.955 0 Td[(i2;.53givenbyahalfofthesquaredEuclideandistance.Hence,thee-centerinEofffigisthesameasthecenterofthepointsfigintheEuclideanspace.Whenm=1,thatisxisunivariate,iisasetofpointsontherealline.When1<2<
PAGE 37

25 Figure2{8:Eightimagesoffaceswhichyieldnormalizedgraylevelhistograms.Wechooseanextraordinarydistributionfornumbereighttocontrasthowtherepre-sentativecapturesvariationwithinaclass.Thenumberaboveeachfaceweighsthecorrespondingdistributioninthee-ITC.

PAGE 38

26 Figure2{9:KLC;Piforeachdistribution,forCequaltothee-ITCandgeomet-ricmean,respectively.Thehorizontalbarrepresentsthevalueofthee-radius.

PAGE 39

27 previoussection.BynowexaminingFig. 2{9 ,wecanseethatKLC;Piisequaltothee-radiusindicatedbythehorizontalbarforthethreesupportdistributionsi=1;7;8andislessfortheothers.Thisillustratestheequi-divergencepropertystatedpreviously.InFigs. 2{8 and 2{9 ,theworst-caseKL-divergencefromthegeometricmeanis2.5timeslargerthantheworst-casefromthee-ITC.Ofcourse,thisbetterworst-caseperformancecomesatthepriceofthee-ITC'slargerdistancetotheothersevendistributions;butitisourthesisthatinsomeapplicationsweareeagertomakethistrade. 2.5 PreviousWork 2.5.1 Jensen-ShannonDivergenceFirstintroducedbyLin[ 2 ],theJensen-Shannondivergencehasappearedinseveralcontexts.Sibsonreferredtoitastheinformationradius[ 18 ].Whilehismonikeristantalizinglysimilartothepreviouslymentionednotionsofm-/e-radii,hestrictlyusesittorefertothedivergence,nottheoptimalvalue.Jardinandhelateruseditasadiscriminatorymeasureforclassication[ 19 ].OthershavealsousedtheJS-divergenceanditsvariationsasadissimilaritymeasure|forimageregistrationandretrievalapplications[ 20 21 ],andintheretrievalexperimentofChapter 3 ,thisworkwillfollowsuit.ThatexperimentalsomakesuseoftheimportantfactthatthesquarerootoftheJS-divergenceinthecasewhenitsparameterisxedto1 2isametric[ 22 ].TopseprovidesanothertakeontheJS-divergence,callingascaled,specialcaseofitthecapacitorydiscrimination[ 23 ].Thisnamehintsatthenext,perhapsmostimportantinterpretationoftheJS-divergence,namelythatasameasurementofmutualinformation.Thisalternativeidentityiswidelyknown.Cf.BurbeaandRao[ 24 ]asjustoneexemple.Andthisunderstandingcanhelpilluminatethecontextofthem-centerwithininformationtheory.

PAGE 40

28 2.5.2 MutualInformationButtheobviousquestionarises|themutualinformationbetweenwhat?First,themutualinformationbetweentworandomvariablesXandYwithdistributionsPXandPYisdenedasMIX;Y=HY)]TJ/F22 11.955 Tf 11.956 0 Td[(HYjX;.54whereHY=)]TJ/F28 11.955 Tf 11.291 8.966 Td[(PPYlogPYisShannon'sentropyandtheconditionalentropyisHYjX=)]TJ/F28 11.955 Tf 11.291 11.357 Td[(XxXyPXYx;ylogPYjXyjx:.55Toseetheconnectiontothem-centerandJS-divergence,LetusrstidentifytherandomvariableXaboveasarandomindexintoasetofrandomvariablesfY1;;YngwithprobabilitydistributionsfP1;;Png.NotethatXmerelytakesonanumberfromoneton.IfwenowconsiderYasarandomvariablewhosevalueyresultsfromrstsamplingX=iandthensamplingYi=y,wecancertainlybegintoappreciatethattheseconcoctedrandomvariablesXandYaredependent.Thatis,learningthevaluethatXtakeswillletyouguessmoreaccuratelywhatvalueYwilltake.Conversely,learningthevalueofYhintsatwhichdistributionYithatvaluewasdrawnfromandhencethevaluethatXtook.Returningtoequation 2.54 andpayingparticularattentiontothedenitionoftheconditionalentropyterm,wecanuseourdenitionsofXandYtogetanexpressionfortheirjointdistribution,PXYi;y=PXiPyji:.56ThenbyobservingthatPyji=Piy,pluggingthisintoequation 2.54 ,andpullingPXioutsideofthesummationswithrespecttoy,wehaveMIX;Y=HY)]TJ/F28 11.955 Tf 11.955 11.358 Td[(XiPXiHPi:.57

PAGE 41

29 AndthisispreciselytheJensen-ShannondivergenceJSP1;;Pnwithco-ecientsPX;;PXn.SowhenweevaluatetheJS-divergencewithaparticularchoiceofthosecoecients,wedirectlyspecifyadistributionforboththerandomvariableXandindirectlyspecifyadistributionamixturefortherandomvariableYandevaluatethemutualinformationbetweenthem.AndwhenwemaximizetheJS-divergencewithrespecttothosesamecoecients,wespecifythedistributionsXandYwhichhavemaximummutualinformation.Thereareseveralequivalentdenitionsformutualinformation,andbyconsideringadierentonewecangaininsightintotheselectionofsupportdistributionsinthem-center.Themirrorimageofequation 2.54 givesusMIX;Y=HX)]TJ/F22 11.955 Tf 11.956 0 Td[(HXjY:.58Startingfromhereandforconveniencelettingfyi=PijywecanderivethatMIX;Y=HX)]TJ/F22 11.955 Tf 11.955 0 Td[(EYHfyi:.59fyidescribesthecontributionsatavalueyofeachofthedistributionswhichmakeupthemixture.BymaximizingMIX;Yweminimizetheexpectedvalueoveryoftheentropyofthisdistribution.Thismeansthatontheaverageweencourageasfewaspossiblecontributorsofprobabilitytoalocation;thisisparticularlythecaseatlocationsywithhighmarginalprobability.Actingasaregularizationtermofsorts,topreventtheprocessfromassigningalltheprobabilitytoonecomponentthusdrivingtheright-handtermtozero,isthersttermwhichencouragesuniformweights.Maximizingthewholeexpressionmeansthatwebalancetheimpulsesforfewcontributorsandforuniformweights. 2.5.3 ChannelCapacityOfinteresttoinformationtheoryfromitsinceptionhasbeenthequestionofhowmuchinformationcanonereliablytransmitoverachannel.

PAGE 42

30 Inthiscase,oneinterpretstheensembleofdistributionsfP1;;PngmakinguptheconditionalprobabilityPyji=Piyasthechannel.Thatis,givenanin-putsymboli,thechannelwilltransmitanoutputsymbolywithprobabilityPiy.Givensuchachanneldiscreteandmemorylesssinceeachsymbolisindependentofthesymbolsbeforeandafter,onewillachievedierentratesoftransmissionde-pendingonthedistributionoverthesourcesymbolsPX.TheseratesarepreciselythemutualinformationbetweenXandY,andtondthemaximumcapacityofthechannel,onepicksthePXyieldingamaximumvalue.Inthiscontext,manyoftheresultsfromthissectionhavebeendevelopedandpresentedquiteclearlyintextsbyGallager[ 25 ,Section4.2]andCsiszar[ 26 ,Section2.3].Relatedalsoistheeldofuniversalcoding.Asreviewedearlier,choosinganappropriatecodedependsuponthecharacteristicsofthesourcewhichwillbeencodede.g.,MorsecodeforEnglish.Universalcodingconcernsitselfwiththeproblemofselectingacodesuitableforanysourceoutofafamily|i.e.,acodewhichwillhavecertainuniversalperformancecharacteristicsacrossthefamily.Althoughthisworkinitiallydevelopedinignoranceofthateld,thekeyresultinthiseldwhichthisworktouchesuponistheRedundancy-CapacityTheorem.Thistheorem|independentlyproveninseveralplaces[ 27 28 ]andlaterstrengthened[ 16 ]|statesthatthebestcodeforwhichonecanhopewillhaveredundancyequaltothecapacityormaximumtransmissionrate[ 23 ]ofacertainchannelwhichtakesinputfromtheparametersofthefamilyandoutputassymbolstobeencoded. 2.5.4 BoostingSurprisingly,wealsondaconnectionintheonlinelearningliteraturewhereintheAdaBoost[ 29 ]learningalgorithmisrecastasentropyprojection[ 30 ].Thisangle,alongwithitsextentionstogeneralBregmandivergencesisaninterestingavenueforfutuework.

PAGE 43

CHAPTER3RETRIEVALWITHm-CENTER 3.1 IntroductionTherearetwokeycomponenetsinmostretrievalsystems|thesignaturestoredforanobjectandthedissimilaritymeasureusedtondaclosestmatch.Thesetwochoicescompletelydetermineaccuracyinanearestneighbor-typesystemifoneiswillingtoendureapotentiallyexhaustivesearchofthedatabase.However,becausedatabasescanbequitelarge,comparingeachqueryimageagainsteveryelementinthedatabaseisobviouslyundesirable.Inthischapterandthenextwefocusonspeedingupretrievalwithoutcom-promisingaccuracyinthecaseinwhichtheobjectandquerysignaturesareprobabilitydistributions.Inavarietyofdomains,researchershaveachievedimpres-siveresultsutilizingsuchsignaturesinconjunctionwithavarietyofappropriatedissimilaritymeasures.Thisincludesworkinshape-basedretrieval[ 31 ],imageretrieval[ 32 33 ],andtextureretrieval[ 34 35 36 37 ].ThelastofwhichwereviewingreatdetailinSection 3.2 .Foranysuchretrievalsystemwhichusesadistri-butionandametric,wepresentawaytospeedupquerieswhileguaranteeingnodropinaccuracybyrepresentingaclusterofdistributionswithanoptimallycloserepresentativewhichwecallthem-ITC.Tomesofresearchhaveconcentratedonspeedingupnearestneighborsearchesinnon-Euclideanmetricspaces.Webuildonthisworkbyreningittobettersuitthecaseinwhichthemetricisonprobabilitydistributions.Inlow-dimensionalEuclideanspaces,thefamiliark-dtreeandR*-treecanindexpointsetshandily.Butinspaceswithanon-Euclideanmetric,onemustresorttoothertechniques.Theseincludeballtrees[ 38 ],vantagepointtrees[ 39 ],andmetrictrees[ 40 41 ]. 31

PAGE 44

32 Oursystemutilizesametrictree,butourmaincontributionispickingasingleobjecttorepresentaset.Pickingasingleobjecttodescribeasetofobjectsisoneofthemostcommonwaystocondensealargeamountofdata.Themostobviouswaytoaccomplishthiswhenpossibleistocomputethearithmeticmeanorthecentroid.Tocontrastwiththepropertiesofourchoicethem-ITCwepointoutthatthearithmeticmeanminimizesthesumofsquareddistancesfromittotheelementsofitsset.Inthecaseinwhichthedataareknownorforcedtolieonamanifold,itisusefultopickanintrinsicmeanwhichhasthemean'sminimizationpropertybutwhichalsoliesonamanifoldcontainingthedata.Thishasbeenexploredformanifoldsofshape[ 4 3 ]andofparameterizedprobabilitydensities[ 5 ]. Metrictrees.Toseehowourrepresentativetsintothemetrictreeframework,abriefreviewofmetrictreeshelps.Givenasetofpointsandametricwhichbydenitionsatisespositive-deniteness,symmetry,andthetriangleinequality,aleafofametrictreeindexessomesubsetofpointsfpigandcontaintwoelds|acentercandaradiusr|whichsatisfydc;pir,forallpiwheredisthemetric.Proceedinghierarchically,aninteriornodeindexesallofthepointsindexedbyitschildrenandalsoensuresthatitseldssatisfytheconstraintabove.Henceusingthetriangleinequality,foranysubtree,onecanndalowerboundonthedistancefromaquerypointqtotheentiresetofpointscontainedinthatsubtree|dq;pidq;c)]TJ/F22 11.955 Tf 11.956 0 Td[(r;.1asillustratedinFig. 3{1 .Andduringanearestneighborsearch,onecanrecursivelyusethislowerboundcostingonlyonecomparisontopruneoutsubtreeswhichcontainpointstoodistantfromthequery. Importanceofpickingacenter.Returningtothechoiceofthecenter,weseethatiftheradiusrinequation 3.1 islarge,thenthelowerboundisnotverytightandsubsequently,pruningwillnotbeveryecient.Ontheotherhand,acenter

PAGE 45

33 Figure3{1:Usingthetriangleinequalitytoprune whichyieldsasmallradiuswilllikelyleadtomorepruningandecientsearches.IfwereexamineFig. 3{1 ,weseethatwecandecreaserbymovingthecentertowardthepointpiinthegure.Thisispreciselyhowthem-ITCbehaves.Weclaimthatthem-ITCyieldstighterclustersthanthecommonlyusedcentroidandthebestmedoid,allowingmorepruningandbettereciencybecauseitrespectsthenaturalmetricsusedforprobabilitydistributions.WeshowtheoreticallythatundertheKL-divergence,them-ITCuniquelyyieldsthesmallestradiusforasetofpoints.Andwedemonstratethatitalsoperformswellunderothermetrics.Inthefollowingsectionswerstdeneourm-ITCandenumerateitsprop-erties.TheninSection 3.2 wereviewapre-existingtextureclassicationsystem[ 37 ]whichutilizesprobabilitydistributionsassignaturesanddiscusshowwebuildoureciencyexperimentatopthissystem.TheninSection 3.3 wepresentresultsshowingthatthem-ITC,whenincorporatedinametrictree,canimprovequeryeciency;alsowecomparehowmuchimprovementthem-ITCachievesrelativeto

PAGE 46

34 otherrepresentativeswheneachisplacedinametrictree.Lastly,wesummarizeourcontributions. 3.2 ExperimentalDesignThecentralclaimofthischapteristhatthem-ITCtightlyrepresentsasetofdistributionsundertypicalmetrics;andthus,thatthisbetterrepresentationallowsformoreecientretrieval.Withthatgoalinmind,wecomparethem-ITCagainstthecentroidandthebestmedoid|twocommonlyusedrepresentatives|inordertondwhichschemeretrievesthenearestneighborwiththefewestcomparisons.Thecentroidisthecommonarithmeticmean,andthebestmedoidofasetfpiginthiscaseisthethedistributionpsatisyingp=argminp02fpigmaxjKLpj;p0:Noticethatwerestrictourselvestothepre-existingsetofdistributionsinsteadofndingthebestconvexcombinationwhichisthem-ITC.Intheexperiment,eachrepresentativewillserveasthecenterofthenodesofametrictreeonthedatabase.Wecontrolforthetopologyofthemetrictree,keepingitunchangedforeachrepresentative.Sincetheexperimentwillexamineeciencyofretrieval,notaccuracy,wewillchoosethesamesignatureasVarmaandZisserman[ 37 ],andthisalongwiththemetricwillcompletelydetermineaccuracy,allowingustoexclusivelyexamineeciency. 3.2.1 ReviewoftheTextureRetrievalSystemThetextureretrievalsystemofVarmaandZisserman[ 37 ][ 42 ]buildsonanearliersystem[ 43 ].Itusesaprobabilitydistributionasasignaturetorepresenteachimage.Aqueryimageismatchedtoadatabaseimagebynearestneighborsearchbasedonthe2dissimilaritybetweendistributions.Althoughthesystemcontainsamethodforincreasingeciency,wedonotimplementitanddelaydiscussionofittothenextsection.

PAGE 47

35 Figure3{2:ExamplesimagesfromtheCUReTdatabase Textondictionary.Beforecomputinganyimage'ssignature,thissystemrequiresthatwerstconstructatextondictionary.Toconstructthisdictionary,werstextractfromeachpixelineachtrainingimageafeaturedescribingthetextureinitsneighborhood.Thisfeaturecanbeavectoroflterresponsesatthatpixel[ 42 ]orsimplyavectorofintensitiesintheneighborhoodaboutthatpixel[ 37 ].Wechoosethelaterapproach.Afterclusteringthisensembleoffeatures,wetakeasmallsetofclustercentersforthedictionary,610centersintotal. Computingasignature.Tocomputethesignatureofanimage,onerstndsforeachpixeltheclosesttextoninthedictionaryandthenretainsthelabelofthattexton.Atthisstage,onecanimagineanimagetransformedfromhavingintensitiesateachpixeltohavingindicesintothetextondictionaryateachpixel,aswouldresultfromvectorquantization.Inthenextsteponesimplyhistogramstheselabelsinthesamewaythatonewouldformaglobalgray-levelhistogram.Thisnormalizedhistogramontextonlabelsisthesignatureofanimage. Data.Againasin[ 37 ],weusetheColumbia-UtrechtReectanceandTextureCUReTDatabase[ 44 ]fortexturedata.Thisdatabasecontains61varietiesof

PAGE 48

36 texturewitheachtextureimagedundervarioussometimesextremeilluminationandviewingangles.Fig. 3{2 illustratestheextremeintra-classvariabilityandtheinter-classsimilaritywhichmakethisdatabasechallengingfortextureretrievalsystems.Inthisgureeachrowcontainsrealizationsfromthesametextureclasswitheachcolumncorrespondingtodierentviewingandilluminationangles.Selecting92imagesfromeachofthe61textureclasses,werandomlypartitioneachoftheclassesinto46trainingimages,whichmakeupthedatabase,and46testimages,whichmakeupthequeries.Preprocessingconsistsofconversiontograyscaleandmeanandvariancenormalization. Dissimilaritymeasures.VarmaandZisserman[ 37 ]measureddissimilaritybetweenaqueryqandadatabaseelementpbothdistributionsusingthe2signgancetest,2p;q=Xipi)]TJ/F22 11.955 Tf 11.955 0 Td[(qi2 pi+qi;.2andreturnedthenearestneighborunderthisdissimilarity.Inourwork,werequireametric,sowetakethesquarerootofequation 3.2 [ 23 ].Notethatsincethesquarerootisamonotonicfunction,thisdoesnotalterthechoiceofnearestneighborandthereforemaintainstheaccuracyofthesystem.Additionallyweuseanothermetric[ 22 ],thesquarerootoftheJensen-Shannondivergencebetweentwodistributions:JS1 2p;q=Hp+q 2)]TJ/F15 11.955 Tf 13.151 8.088 Td[(1 2Hp)]TJ/F15 11.955 Tf 13.15 8.088 Td[(1 2Hq=1 2KLp;p+q 2+1 2KLq;p+q 2:Notethatincontrasttoequation 2.16 ,wenowtakeonlytwodistributionsandxthemixtureparameteras1 2.Unliketherstmetric,thischangeeectsaccuracy,butinourexperiments,thechangeinretrievalaccuracyfromtheoriginalsystemdidnotexceedonepercentagepoint.ThisdoesnotsurprisesincetheJShasservedwellinregistrationandretrievalapplicationsinthepast[ 20 ].

PAGE 49

37 3.2.2 ImprovingEciencyAfterselectingthesignatureanddissimilaritymeasureinouracasemetric,wehavexedtheaccuracyofthenearestneighborsearch;sowecannowturnourattentiontoimprovingtheeciencyofthesearch.Weconstructametrictreeontheelementsofthedatabasetoimproveeciency.Andinourexperimentwewillvarytherepresentativeusedinthenodesofthemetrictree,ndingwhichrepresentativeprunesmost.Weholdconstantthetopologyofthemetrictree,determiningeachelement'smembershipinatreenodebasedonthetexturevarietyfromwhichitcame.Specically,eachofthe61texturevarietieshasacorrespondingnodeandeachofthesenodescontainthe46elementsarisingfromtherealizationsofthattexture.Giventhisxednodemembership,wecanconstructtheappropriaterepresentativeforeachnode.Then,giveneachrepresentative'sversionofthemetrictree,weperformsearchesandcountthenumberofcomparisonsrequiredtondthenearestneigh-bor. 3.3 ResultsandDiscussionTheresultswitheachdissimilaritymeasurewereverysimilar,andtheoreticalresults[ 22 ]bearthisobservedsimilarityoutbyshowingthattheJS-divergenceand2areasymptoticallyrelated.BelowwereporttheresultsfortheJS-divergence.Fig. 3{3 showsthespeed-upswhichthemetrictreewithm-ITCachieveforeachtextureclass.Thesedatarelatethetotalnumberofcomparisonsrequiredforanaveragequeryusingthemetrictreetothenumberofcomparisonsinanexhaustivesearch.Onaverage,them-ITCdiscards68.9%ofthedatabase,yieldingafactorof3.2improvementineciency.Itshouldnotsurprisethattheindexingout-performsanexhaustivesearch,sowenowconsiderwhathappenswhenwevarytherepresentativeinthemetrictree.Onaverage,thearithmeticmeandiscards47.6%ofthedatabase,resulting

PAGE 50

38 Figure3{3:Onaverageforprobesfromeachtextureclass,thespeed-uprelativetoanexhaustivesearchachievedbythemetrictreewiththem-ITCastherepresenta-tive

PAGE 51

39 Figure3{4:Theexcesscomparisonsperformedbythearithmeticmeanrelativetothem-ITCwithineachtextureclassasaproportionofthetotaldatabase

PAGE 52

40 Figure3{5:Theexcesscomparisonsperformedbythebestmedoidrelativetothem-ITCwithineachtextureclassasaproportionofthetotaldatabase inaspeed-upfactorof1.9.Fig. 3{4 plotstheexcessproportionofthedatabasewhichthearithmeticmeansearchesrelativetothem-ITCforeachprobe.Theboxandwhiskersrespectivelyplotthemedian,quartiles,andrangeofthedataforalltheprobeswithinaclass.Onaverage,themetrictreewiththearithmeticmeanexploresanadditional21.3%ofthetotaldatabaserelativetothemetrictreewiththem-ITC.Inonly2.0%ofthequeriesdidthem-ITCfailtoout-performthearithmeticmean.Sincetheproportionofexcesscomparisonsisapositivevaluefor98.0%oftheprobes,wecanconcludethatthem-ITCalmostalwaysoerssomeimprovementandoccasionallyavoidssearchingmorethanathirdofthedatabase.

PAGE 53

41 Fig. 3{5 showsasimilarplotforthebestmedoid.Again,theboxandwhiskersrespectivelyplotthemedian,quartiles,andrangeofthedataforalltheprobeswithinaclass.Onaverage,themetrictreewiththebestmedoidexploresanadditional22.1%ofthetotaldatabaserelativetothemetrictreewiththem-ITC.Herethem-ITCimprovesevenmorethanitdidoverthearithmeticmean;inonly0.1%ofthequeriesdidthem-ITCfailtoout-performthebestmedoid|neveroncedoingmorepoorly. 3.3.1 ComparisontoPre-existingEciencySchemeVarmaandZissermanproposeadierentapproachtoincreasetheeciencyofqueries[ 37 ].Theydecreasethesizeofthedatabaseinagreedyfashion:Initially,eachtextureclassinthedatabasehas46modelsonearisingfromeachtrainingimage.Thenforeachclass,theydiscardthemodelwhoseabsenceimpairsretrievalperformancetheleastonasubsetofthetrainingimages.Andthisisrepeatedtillthenumberofmodelsissuitablysmallandtheestimatedaccuracyissuitablyhigh.Whilethismethodperformedwellinpractice|achievingcomparableaccuracytoanexhaustivesearchandreducingtheaveragenumberofmodelsperclassfrom46toeightornine,ithasseveralpotentialshortcomings.Therstisthismethod'scomputationalexpense:Althoughthemodelselectionprocessoccurso-lineandtimeisnotcritical,thenumberoftimesonemustvalidatemodelsagainsttheentiretrainingsubsetscalesquadraticallyinthenumberofmodels.Additionallyandmoreimportantly,themodelselectionproceduredependsuponthesubsetofthetrainingdatausedtovalidateit.Itoersnoguaranteesofitsaccuracyrelativetoanexhaustivesearchwhichutilizesalltheknowndata.Incontrast,ourmethodcancomputethem-ITCeciently,andmoreimpor-tantlyweguaranteethattheaccuracyofthemoreecientsearchisidenticaltotheaccuracythatanexhaustivesearchwouldachieve.Althoughitmustbenotedthat

PAGE 54

42 themethodofVarmaandZissermanperformedfewercomparisons,webelievethatbuildingamulti-layeredmetrictreewillbridgethisgap.Lastly,thetwomethodscanco-existsimultaneously:Sincethepre-existingapproachfocusesonreducingthesizeofthedatabasewhileoursindexesthedatabasesolvinganorthogonalproblem,nothingstopsusfromtakingthesmallerdatabaseresultingfromtheirmethodandperformingourindexingatopitforfurtherimprovement. 3.4 ConclusionOurgoalwastoselectthebestsinglerepresentativeforaclassofprobabilitydistributions.Wechosethem-ITCwhichminimizesthemaximimKL-divergencefromeachdistributionintheclasstoit;andwhenweplaceditinthenodesofametrictree,itallowedustoprunemoreeciently.Experimentally,wedemonstratedsignicantspeed-upsoverexhaustivesearchinastate-of-the-arttextureretrievalsystemontheCUReTdatabse.Themetrictreeapproachtonearestneighborsearchesguaranteesaccuracyidenticaltoanexhaustivesearchofthedatabase.Additionally,weshowedthatthem-ITCoutperformsthearithmeticmeanandthebestmedoidwhentheseotherrepresentativesareusedanalogously.Probabilitydistributionsareapopularchoiceforretrievalinmanydomains,andastheretrievaldatabasesgrowlarge,therewillbeaneedtocondencemanydistributionsintoonerepresentative.Wehaveshownthatthem-ITCisausefulchoiceforsucharepresentativewithwell-behavedtheoreticalpropertiesandempiricallysuperiorresults.

PAGE 55

CHAPTER4RETRIEVALWITHe-CENTER 4.1 IntroductionInthecourseofdesigningaretrievalsystem,onemustusuallyconsideratleastthreebroadelements| 1. asignaturethatwillrepresenteachelement,allowingforcompactstorageandfastcomparisons, 2. adissimilaritymeasurethatwilldiscriminatebetweenapairofsignaturesthatarecloseandapairthatarefarfromeachother,and 3. anindexingstructureorsearchstrategythatwillallowforecient,non-exhaustivequeries.Therstofthesetwoelementsmostlydeterminetheaccuracyofasystem'sretrievalresults.Thefocusofthischapter,likethelast,isonthethirdpoint.Agreatdealofworkhasbeendoneonretrievalsystemsthatutilizeaprob-abilitydistributionasasignature.Thisworkhascoveredavarietyofdomainsincludingshape[ 31 ],texture[ 34 ],[ 45 ],[ 35 ],[ 36 ],[ 37 ],andgeneralimages[ 32 ],[ 33 ].Ofthese,somehaveusedtheKullback-LeiblerKLdivergence[ 1 ]asadissimilaritymeasure[ 35 ],[ 36 ],[ 32 ].TheKL-divergencehasmanynicetheoreticalproperties.Particularly,itsrelationshiptomaximum-likelihoodestimation[ 46 ].However,inspiteofthis,itisnotametric.Thismakesitchallengingtoconstructanindexingstructurewhichrespectsthedivergence.ManybasicmethodsexisttospeedupsearchinEuclideanspaceincludingk-dtreesandR*-trees.Andthereareevensomemethodsforgeneralmetricspacessuchasballtrees[ 38 ],vantagepointtrees[ 39 ],andmetrictrees[ 40 ].Yetlittleworkhasbeendoneonecientlynding 43

PAGE 56

44 exactnearestneighborsunderKL-divergence.Inthischapter,wepresentanovelmeansofspeedingnearestneighborsearchandhenceretrievalinadatabaseofprobabilitydistributionswhenthenearestneighborisdenedastheelementthatminimizestheKL-divergencetothequery.Thisapproachdoeshaveasignicantdrawbackwhichdoesnotimpairitonthisparticulardataset,butingeneralitcannotguaranteeretrievalaccuracyequaltothatofanexhaustivesearch.Thebasicideaisacommononeincomputerscienceandreminiscentofthelastchapter:Werepresentasetofelementsbyonerepresenative.Duringasearch,wecomparethequeryobjectagainsttherepresentative,andiftherepresentativeissucientlyfarfromthequery,wediscardtheentiresetthatcorrespondstoitwithoutfurthercomparisons.Ourcontributionliesinselectingthisrepresentativeinanoptimalfashion;ideallywewouldliketodeterminethecircumstancesunderwhichwemaydiscardthesetwithoutfearofaccidentallydiscardingthenearestneighbor,butthiscannotalwaysbeguaranteedForthisapplication,wewillutilizethetheexponentialinformationtheoreticcentere-ITC.Intheremainingsectionswerstderivetheexpressionuponwhichthesystemmakespruningdecisions.Thereafterwepresenttheexperimentshowingincreasedeciencyinretrievaloveranexhaustivesearchandovertheuniformlyweightedgeometricmean|areasonablealternaterepresentative.Finally,wereturntothetextureretrievalexampleofthepreviouschaptertocomparethem-ande-centers,evaluatingwhichformsthetightestclustersasevaluatedbythe2metricequation 3.2 4.2 LowerBoundInthissectionweattempttoderivealowerboundontheKL-divergencefromadatabaseelementtoaquerywhichonlydependsuponthetheelementthroughitse-ITC.ThislowerboundguidesthepruningandresultsinthesubsequentincreasedsearcheciencywhichwedescribeinSection 4.3

PAGE 57

45 Figure4{1:Intuitiveproofofthelowerboundinequation 4.7 seetext.TheKL-divergenceactslikesquaredEuclideandistance,andthePythagoreanTheo-remholdsunderspecialcircumstances.Qisthequery,Pisadistributioninthedatabase,andCisthee-ITCofthesetcontainingP.PistheI-projectionofQontothesetcontainingP.Ontheright,DCjjPRe,whereReisthee-radius,bytheminimaxdenitionofC. Inordertosearchforthenearestelementtoaqueryeciently,weneedtoboundtheKL-divergencetoasetofelementsfrombeneathbyaquantitywhichonlydependsuponthee-ITCofthatset.Thatway,wecanusetheknowledgegleanedfromasinglecomparisontoavoidindividualcomparisonstoeachmemberoftheset.WeapproachsuchalowerboundbyexaminingtheleftsideofFig. 4{1 .HereweconsideraquerydistributionQandanarbitrarydistributionPinasetwhichhasCasitse-ITC.Asasteppingstonetothelowerbound,webrieydenetheI-projectionofadistributionQontoaspaceEasP=argminP2EDPjjQ:.1ItiswellknownthatonecanuseintuitionaboutthesquaredEuclideandistancetoappreciatethepropertiesoftheKL-divergence;andinfact,inthecaseofthe

PAGE 58

46 I-projectionPofQontoE,wehaveinsomecasesaversionofthefamiliarPythagoreanTheorem[ 26 ].InthiscasewehaveforallP2EthatDPjjQDPjjQ+DPjjP:.2AndinthecasethatP=P1+)]TJ/F22 11.955 Tf 11.955 0 Td[(P2.3fordistributionsP1;P22Ewith0<<1,thenwecallPanalgebraicinnerpointandwehaveequalityinequation 4.2 UnfortunatelyitisthisconditionwhichwecannotverifyeasilyasitdependsuponboththequerywhichwewilllabelQandthestructureofEwhichisdeterminedbythedatabase.InterestinglywedogettheequalityversionwhenwetakeEasalinearfamily,likethefamilieswithwhichthem-ITCisconcerned.Regardless,wewillcontinuewiththederivationanddemonstrateitsuseinthisapplication.Assumingequalityinequation 4.2 andapplyingittwiceyields,DPjjQ=DPjjQ+DPjjP .4 DCjjQ=DPjjQ+DCjjP; .5 wherewearefreetoselectP2EasanarbitrarydatabaseelementandCasthee-ITC.Equation 4.4 correspondsto4QPPwhileequation 4.5 correspondsto4QCPinFig. 4{1 .Ifwesubtractthetwoequationsaboveandre-arrange,wend,DPjjQ=DCjjQ+DPjjP)]TJ/F22 11.955 Tf 11.955 0 Td[(DCjjP:.6

PAGE 59

47 ButsincetheKL-divergenceisnon-negative,andsincethee-radiusReisauniformupperboundontheKL-divergencefromthee-ITCtoanyP2E,wehaveDPjjQDCjjQ)]TJ/F22 11.955 Tf 11.955 0 Td[(Re:.7Hereweseethatthem-ITCwoulddolittlebetteringuaranteeingthisparticularbound.Whileitwouldinsurethatwehadequalityinequation 4.2 ,itcouldnotboundthelastterminequation 4.6 becausetheorderoftheargumentsisreversed.Wecangetanintuitive,nonrigorousviewofthesamelowerboundbyagainborrowingnotionsfromsquaredEuclideandistance.Thispictoralrepriseofequation 4.7 canlendvaluableinsighttothetightnessoftheboundanditsdependenceoneachofthetwoterms.ForthisdiscussionwerefertotherightsideofFig. 4{1 .TheminimaxdenitiontellsusthatDCjjPRe.WeconsiderthecaseinwhichthisisequalityandsweepoutanarccenteredatCwithradiusRefromthebaseofthetrianglecounter-clockwise.WetakethepointwherealinesegmentfromQistangenttothisarcasavertexofarighttrianglewithhypotenuseoflengthDCjjQ.ThelegwhichisnormaltothearchaslengthRebyconstruction,andbythePythagoreanTheoremtheotherlegofthistriangle,whichoriginatesfromQ,haslengthDCjjQ)]TJ/F22 11.955 Tf 12.319 0 Td[(Re.Wecanusethelengthofthislegtovisualizethelowerbound,andbyinspectionweseethatitwillalwaysbeexceededbythelengthofthelinesegmentoriginatingfromQandterminatingfurtheralongthearcatP.ThissegmenthaslengthDPjjQandisindeedthequantityweseektoboundfrombelow. 4.3 ShapeRetrievalExperimentInthissectionweapplythee-ITCandthelowerboundinequation 4.7 torepresentdistributionsarisingfromshapes.Sincethelowerboundguaranteesthat

PAGE 60

48 weonlydiscardelementsthatcannotbenearestneighbors,theaccuracyofretrievalisasgoodasanexhaustivesearch.Whileweknowfromthetheorythatthee-ITCyieldsasmallerworst-caseKL-divergence,wenowpresentanexperimenttotestifthistranslatesintoatighterboundandmoreecientqueries.Wetackleashaperetrievalproblem,usingshapedistributions[ 31 ]asoursignature.Toformashapedistributionfroma3Dshape,weuniformlysamplepairsofpointsfromthesurfaceoftheshapeandcomputethedistancebetweentheserandompoints,buildingahistogramoftheserandomdistances.Toaccountforchangesinscale,weindependentlyscaleeachhistogramsothatthemaximumdistanceisalwaysthesame.Forourdissimilaritymeasure,weuseKL-divergence,sothenearestneighborPtoaquerydistributionQisP=argminP0DP0jjQ:.8Fordata,weusethePrincetonShapeDatabase[ 47 ]whichconsistsofover1800triangulated3Dmodelsfromover160classesincludingpeople,animals,buildings,andvehicles.Totesttheeciency,weagaincomparethee-ITCtotheuniformlyweighted,normalizedgeometricmean.UsingtheconvexityofEwecangeneralizethelowerboundinequation 4.7 toworkforthegeometricmeanbyreplacingthee-radiuswithmaxiDCjjPiforourdierentC.Wetakethebaseclassicationaccompanyingthedatabasetodeneourclusters,andthencomputethee-ITCandgeometricmeansofeachcluster.Whenweconsideranovelquerymodelonaleave-one-outbasis,wesearchforthenearestneighborutilizingthelowerboundanddisregardingunnecessarycomparisons.Foreachquery,wemeasurethenumberofcomparisonsrequiredtondthenearestneighbor.

PAGE 61

49 Figure4{2:Thespeed-upfactorversusanexhaustivesearchwhenusingthee-ITCasafunctionofeachclassintheshapedatabase.

PAGE 62

50 Figure4{3:Therelativepercentofadditionalpruningswhichthee-ITCachievesbeyondthegeometriccenter,againforeachclassnumber.

PAGE 63

51 Fig. 4{2 andFig. 4{3 showtheresultsofourexperiment.InFig. 4{2 ,weseethespeed-upfactorthatthee-ITCachievesoveranexhaustivesearch.Aver-agedoverallprobesinallclasses,thisspeed-upfactorisapproximately2.6;thegeometricmeanachievedanaveragespeed-upofabout1.9.AndinFig. 4{3 ,wecomparethee-ITCtothegeometricmeanandseethatforsomeclasses,thee-ITCallowsustodiscardnearlytwiceasmanyunworthycandidatesasthegeometricmean.Fornoclassofprobesdidthegeometricmeanprunemorethanthee-ITC,andwhenaveragedoverallprobesinallclasses,thee-ITCdiscardedover30%moreelementsthandidthegeometricmean. 4.4 RetrievalwithJS-divergenceInthissectionwemimicktheexperimentsofthepreviouschapter.InsteadofusingtheKL-divergencetodeterminenearestneighbors,wereturntothesquare-rootoftheJS-divergence,andweagainusethetriangleinequalitytoguaranteenodecreaseinaccuracy.Usingmetrictreeswiththee-center,geometricmean,andm-center,wecancomparetheecienciesofeachrepresentativeandtheoverallspeedup.InFig. 4{4 ,weseethespeeduprelativetoanexhaustivesearchwhenusingthee-ITC;onaverage,thespeedupfactoris1.53.InFig. 4{5 wecomparethee-ITCtothegeometricmean,muchaswecomparedthem-ITCtothearithmeticmeaninthelastchapter.Weclaimthatthisisthenaturalcomparisonbecausetheexponentialfamilyconsistsofweightedgeometricmeans.Herethegeometricmeansearchesonaverage7.24%moreofthedatabasethanthee-ITC.LastlyinFig. 4{6 wecomparethetwocentersagainsteachother.Herethee-ITCcomesupshort,searchingonaverageanadditional14.89%ofthedatabase.Inthenextsectionwetrytoexplainthisresult.

PAGE 64

52 Figure4{4:Speedupfactorforeachclassresultingfromusinge-ITCoveranex-haustivesearch

PAGE 65

53 Figure4{5:Excesssearchesperformedusinggeometricmeanrelativetoe-ITCasproportionoftotaldatabase.

PAGE 66

54 Figure4{6:Excesssearchesperformedusinge-ITCrelativetom-ITCasproportionoftotaldatabase.

PAGE 67

55 4.5 Comparingthem-ande-ITCsSincebothcentershavefoundsuccessfulapplicationtoretrieval,itisreason-abletoexploretheirrelationship.Sincetheargumentsintheirrespectiveminimaxcriteriaarereversed,itisnotimmediatelyclearthatameaningfulcomparisoncouldbemadewithKL-divergencealone.Hence,weresorttothe2distancefromthepreviouschapterasanarbiterthoughwecouldjustaswelluseJS-divergence.Thecomparisonwemakenextissimple.Returningtothetextureretrievaldatasetfromthepreviouschapter,weusethesamesetmembershipsandcalculateanm-ITCandane-ITCforeachset.Thenforeachrepresentativeandeachsetwedeterminewhatthemaximum2distancebetweenanelementofthatsetandtherepresentativeis.TheresultsappearinFig. 4.5 astheratiobetweenthis2radius"ofthee-ITCandthem-ITC.Sincethenumbersaregreaterthanonewiththeexceptionofonlytwooutof61classes,itissafetoconcludeinthissettingthatthem-ITCformstighterclusters.Thisresulthelpsexplainthesuperiorperformanceofthem-centerintheprevioussection.Inretrospect,onecouldattributethistothem-ITC'sglobaloptimalitypropertycf.Section 2.2.1 whichthee-ITCmaynotshare.

PAGE 68

56 Figure4{7:Theratioofthemaximal2distancefromeachcentertoalloftheelementsinaclass

PAGE 69

CHAPTER5TRACKING 5.1 IntroductionInthepreviouschapters,weconsideredtheproblemofecientretrieval.Inthecaseofretrieval,whereauniformupperboundisimportant,onemeasureshowwellarepresentativedoesbyfocusingonhowithandlesthemostdistantmembers.Thispropertyiswhytheminimaxrepresentativesarewell-suitedtoretrieval.Inthischapter,weexplorethequestionofwhetherthesamecanbesaidfortracking.Werstpresentseveralencouragingsignsthatinfactitmaybe,andthenwegoontoconsideranexperimenttotesttheperformanceofatrackerbuiltaroundthem-ITC.Butrstwesetthecontextofthetrackingprobleminprobabilisticterms. 5.2 Background|ParticleFiltersThetrackingproblemconsistsofestimatingandmaintainingahiddenvariableusuallyposition,sometimesposeoramorecomplicatedstatefromasequenceofobservations.Commonly,insteadofsimplykeepingoneguessastothepresentstateandupdatingthatguessateachnewobservation,onestoresandupdatesanentireprobabilitydistributiononthestatespace;then,whenpressedtogiveasingle,concreteestimateofthestate,oneusessomestatistice.g.,meanormodeofthatdistribution.ThisapproachisembodiedfamouslyintheKalmanlter,wheretheprobabil-itydistributiononthestatespaceisrestrictedtobeaGaussian,andthetrajectoryofthestateorsimplythemotionoftheobjectisassumedtofollowlineardynam-icssothattheGaussianatonetimestepmaypropagatetoanotherGaussianatthenext. 57

PAGE 70

58 Whenthisassumptionistoolimiting|possiblybecausebackgroundcluttercreatesmultiplemodesinthedistributiononthestatespaceorbecauseofcomplexdynamicsorboth|researchersoftenturntoaparticleltertotrackanobject[ 48 ].UnliketheKalmanlter,particleltersdonotrequirethattheprobabilitydis-tributiononthestatespacebeaGaussian.Togainthisadditionalrepresentativepower,aparticlelterstoresasetofsamplesfromthestatespacewithaweightforeachsampledescribingitsprobability.Thegoalthenistoupdatethesesetswhennewobservationsarrive.Onecanupdatefromatimestept)]TJ/F15 11.955 Tf 12.129 0 Td[(1toatimesteptinthreesteps[ 48 ]: 1. Givenasamplesetfst)]TJ/F21 7.97 Tf 6.587 0 Td[(11;;st)]TJ/F21 7.97 Tf 6.587 0 Td[(1nganditsassociatedweightsft)]TJ/F21 7.97 Tf 6.586 0 Td[(11;;t)]TJ/F21 7.97 Tf 6.587 0 Td[(1ng,randomlysamplewithreplacementanewsetf~st1;;~stng. 2. Toarriveatthenewsti,randomlypropagateeach~stibyassigningstitothevalueofxaccordingtoaprobabilitydistributioni.e.,themotionmodelpmxj~sti. 3. Adjusttheweightsaccordingtothelikelihoodofobservingthedataztgiventhatthetruestateissti,ti=1 Zpdztjsti;.1whereZisanormalizationfactor.Inthisworkwerestrictourattentiontoatwodimensionalstatespaceconsistingonlyofanobject'sposition.Alsoweexclusivelyconsidermotionmodelspminwhichpmxjy=pmx+x0jy+y0=Gx)]TJ/F22 11.955 Tf 11.955 0 Td[(y;.2whereGisaGaussian.Thatispmisashift-invariant"modelinwhichtheprobabilityofthedisplacementfromapositioninonetimesteptoapositioninthe

PAGE 71

59 nextisindependentofthatstartingposition.Furthermore,wetakethatprobabilityofadisplacementasdeterminedbyaGaussian. 5.3 ProblemStatementInthischapterweconsiderthefollowingscenario:Wehaveseveralobjects,andforeachobjectweknowitsdistinctprobabilisticmotionmodel.Wewanttobuildasingleparticlelterwithonemotionmodelwhichcantrackanyoftheobjects.Thisisreminiscientoftheproblemofdesigningauniversalcodethatcanecientlyencodeanyofanumberofdistinctsources,andthatsimilaritysuggeststhatthisisapromisingapplicationforaninformationtheoreticcenter.Relatedtothisproblemisthecaseinwhichonesingleobjectundergoesdistinctphases"ofmotion,eachofwhichhasadistinctmotionmodel.Anexampleofthisisacarthatmovesinonefashionwhenitdrivesstraightandinacompletelydierentfashionwhenturning.Thisworkdoesnotexploresuchmulti-phasetracking.Forthisrelatedproblemofmulti-phasemotion,therearecertainlymorecomplicatedmotionmodelssuitedtotheproblem[ 49 ].Andfortheproblemwefocusoninthischapter,onecouldalsoimaginereningthemotionmodelasobservationsarrive,until,onceinpossessionofapreponderanceofevidence,onenallysettlesonthemostlikelycomponent.Butalloftheserequiresomesortofon-linelearningwhileincontrast,theapproachwepresentoersasingle,simple,xedpriorwhichcanbedirectlyincorporatedintothebasicparticlelter 5.4 Motivation 5.4.1 BinaryStateSpaceTomotivatetheuseofanITC,webeginwithatoyexampleinwhichwetrack"thevalueofabinaryvariable.OurcaricatureofatrackerisbasedonaparticlelterwithNparticles.Inthisexample,wemakeafurther,highlysimplifyingassumption:Theobservationmodelisperfectandclutter-free.Thatisthelikelihoodpdinequation 5.1 hasperfectinformation.Thismeansthatany

PAGE 72

60 particleswhichpropagatetotheincorrectstatereceiveaweightofzero,andtheparticlesifanywhichpropagatetothecorrectstatesharetheentireprobabilitymassamongthemselves.Undertheseassumptions,theeventthatthetrackerfailsandhenceforthneverrecoversateachtimestepisanindependent,identicallydistributedBernoullirandomvariablewithprobabilitypfail=p)]TJ/F22 11.955 Tf 11.955 0 Td[(qN+)]TJ/F22 11.955 Tf 11.955 0 Td[(pqN;.3wherepistheprobabilitythetrackertakesonstate0andqistheprobabilitythataparticleevolvesunderitsmotionmodeltostate0.Similarly,1)]TJ/F22 11.955 Tf 12.565 0 Td[(pistheprobabilitythetrackertakesonstate1and1)]TJ/F22 11.955 Tf 12.207 0 Td[(qistheprobabilitythataparticleevolvestostate1.WhattheequationabovesaysisthatourtrackerwillfailifandonlyifallNparticleschoosewrongly.Nowtheinterestingthingaboutthisexampleisthatamotionmodelinwhichq6=pcanoutperformoneinwhichq=p.Specicallybydierentiatingequation 5.3 withrespecttoq,wendthatpfailtakesonaminimumwhenq="1)]TJ/F22 11.955 Tf 11.955 0 Td[(p p1 N)]TJ/F18 5.978 Tf 5.756 0 Td[(1+1#)]TJ/F21 7.97 Tf 6.587 0 Td[(1:.4Asaconcreteexample,wetakethecasewithp=:1andN=10;heretheoptimalvalueforqis.4393.Whenwendtheexpectednumberoftrialstillthetrackerfailsineachcasesimplyp)]TJ/F21 7.97 Tf 6.586 0 Td[(1fail,wendthatifwetakeamotionmodelwithq=p,thetrackergoesforanaverageof29steps,butifwetaketheoptimalvalueofq,thetrackercontinuesfor1825steps.HereweseeremindersofhowtheITCsgivemoreweighttotheextraordinarysituationsthanotherrepresentatives.Inthiscaseitisjustiedtounder-weightthemostlikelycasebecauseevenhavingasingleparticlearriveatthecorrectlocationisasgoodashavingallNparticlesarrive.

PAGE 73

61 5.4.2 Self-informationLossFormoreevidencesuggestingthattheITCmightbewell-suited,weconsiderthefollowinganalysis[ 50 ].Supposewetrytopredictavaluexwithdistributionpx.Butinsteadofpickingasinglevalue,wespecifyanotherdistributionqxwhichdenesourcondenceforeachvalueofx.Nowdependingonthevaluexthatoccurs,wepayapenaltybasedonhowmuchcondencewehadinthatvalue;ifwehadagreatdealofcondenceqxclosetoone,wepayasmallpenalty,andifwedidnotgivemuchcredencetothevalue,wepayalargerpenalty.Theself-informationlossfunctionisacommonchoiceforthispenalty.Accordingtothisfunctionifthevaluexoccurs,wewouldincurapenaltyof)]TJ/F15 11.955 Tf 11.291 0 Td[(logqx.Ifweexaminetheexpectedvalueofourloss,wendEx[)]TJ/F15 11.955 Tf 11.291 0 Td[(logqx]=KLp;q+Hp:.5Returningtoourproblemstatement,ifwearefacedwithahostofdistribu-tionsandwanttondasingleqtominimizetheexpectedlossinequation 5.5 overthesetofdistributions,webegintoapproachsomethinglikethem-ITC. 5.5 Experiment|OneTrackertoRuleThemAll? 5.5.1 PreliminariesTotesthowwellthem-ITCincorporatesasetofmotionmodelsintoonetracker,wedesignedthefollowingexperiment.Givenasequenceofimages,werstestimatedthetruemotionbyhand,measuringthepositionoftheobjectofinterestatkeyframes.WethentaGaussiantothesetofdisplacements,takingthemeanoftheGaussianastheaveragevelocity. Data.Asingleframefromthe74-framesequenceappearsinFig. 5{1 .Inthissequencewetracktheheadofthewalkerwhichhasnearlyconstantvelocityinthex-directionandslightperiodicmotioninthey-directionasshesteps.Themeanvelocityinthex-directionwas-3.7280pixels/framewithamarginalstandard

PAGE 74

62 Figure5{1:Framefromtestsequence

PAGE 75

63 deviationof1.15;inthey-directiondirectiontheaveragevelocitywas-0.1862withstandarddeviation2.41.Forourobservationmodel,wesimplyuseatemplateoftheheadfromtheinitialframeandcompareittoagivenregion,ndingthemeansquarederrorMSEofthegray-levels.Thenlikelihoodisjustexp)]TJ/F26 7.97 Tf 8.02 -4.976 Td[()]TJ/F21 7.97 Tf 6.587 0 Td[(1 22MSE.Weinitializealltrackerstothetruestateattheinitialframe. Motionmodels.Fromthissingleimagesequence,wecanhallucinatenumerousimagesequenceswhichconsistofrotatedversionsoftheoriginalsequence.Weknowthetruemotionmodelsforallofthesenovelsequencessincetheyarejusttheoriginalmotionmodelrotatedbythesameamount.Toexaminetheperformanceofonemotionmodelappliedtoadierentimagesequence,oneneedonlyconsidertheangledisparitybetweenthetrueunderlyingmotionmodeloftheimagesequenceandthemotionmodelutilizedbythetracker.InFig. 5{2 wereporttheperformanceinaveragetime-till-failureasafunctionofangledisparity.Herethenumberofparticlesisxedto10.Sinceinthisexperimentandsubsequentonesinthischapter,wedeneatrackerashavingirrevocablyfailedwhenallofitsparticlesareatadistancegreaterthan40pixelsfromthelocation,weseethateventhemosthopelessoftrackerswillsucceedintracking"thesubjectforsixframes|justlongenoughforallofitsparticlestoeefromthecorrectstateatanaveragerelativevelocityof7.5pixels/frame.Thisobservationletsuscalculatealowerboundontheperformanceofatrackerwithagivenangledisparityfromthetruemotion:Pessimisticallyassumingacompletelyuninformativeobservationmodel,onecancalculatethetimerequiredforthecentroidoftheparticlestoexceedadistanceofD=40pixelsfromthetruestateast=D 2rsin 2;.6

PAGE 76

64 Figure5{2:Averagetimetillfailureasafunctionofangledisparitybetweenthetruemotionandthetracker'smotionmodel

PAGE 77

65 wherer=3:7326isthespeedatwhichcentroidandtruestateeachmoveandistheangledisparity.ThedashedlineinFig. 5{2 representsthiscurve,cappedatthemaximumnumberofframes74.Oneshouldnotethatsinceallofthemotionmodelsarerotatedversionsofeachother,theHPterminequation 5.5 isaconstant.Hencetheqwhichminimizesthemaximumexpectedself-informationlossoverasetofp'sisinfactthem-ITC. Performanceofmixtures.Becausewewilltakethem-ITCofseveralofthesemotionmodels,itisalsoofinteresthowmixturesperform.Weconsideramixtureofthecorrectmotionmodelandamotionmodelradiansrotatedintheoppositedirection,whichessentiallycontributesnothingtothetracking.InFig. 5{3 weagainplotthetime-till-failurefortrackerswith10and20particlesasafunctionoftheweightinthemixtureofthecorrectmodel.Toderivealowerbound,thistimewerepresenttheproportionoftheproba-bilitynearthetruestateattimetasrtandtheremainderaswt.Furtherwesaythatateachtimestep,rtoftheprobabilitymovestoorremainsinthecorrectstateasaresultofthoseparticlesbeingdrivenbythecorrectmotionmodelandtherestmovesawayfromthecorrectstate.Nextwemodelthestepintheparticlelteralgorithmwhereweadjusttheweights.Weassumethataparticleawayfromthetruestatereceivesaweightthatisc<1timestheweightreceivedbyaparticlenearthetruestate.Iftherewerenoclutterinthescene,thisnumberwouldbezeroandwewouldreturntotheassumptioninSection 5.4.1 .Now,bypessimisticallyassumingthatparticleswhichmoveawayfromthecorrectstatehaveexceedinglysmallchancei.e.,zeroofrejoiningthetruestaterandomlywecanderivethe

PAGE 78

66 Figure5{3:Averagetimetillfailureasafunctionoftheweightonthecorrectmotionmodel;for10and20particles

PAGE 79

67 followingfromaninitialr0=1,rt=t Z .7 wt=)]TJ/F22 11.955 Tf 11.955 0 Td[(Pn)]TJ/F21 7.97 Tf 6.587 0 Td[(1i=0ct)]TJ/F23 7.97 Tf 6.586 0 Td[(ii Z; .8 whereZisanormalizingconstant.Andbytakingalowerboundofwt>)]TJ/F23 7.97 Tf 6.586 0 Td[(cn)]TJ/F18 5.978 Tf 5.756 0 Td[(1 Z,wecanderivethatrt< +1)]TJ/F23 7.97 Tf 6.587 0 Td[(c.Finally,bytakingthisupperboundonrtasaBernoullirandomvariableaswedidinSection 5.4.1 ,wecangetalowerboundontheexpectedtime-till-failureas1 1)]TJ/F23 7.97 Tf 6.587 0 Td[(rtplusanadjustmentforthesixfreeframesrequiredtodriftoutofthe40pixelrange.ThisisthelowerboundshowninFig. 5{3 .Tocalculatec,werandomlysampledthebackgroundatadistancegreaterthan40pixelsfromthetruestate,andaveragedtheresponseoftheobservationmodel,yieldingc=0:1886. 5.5.2 m-ITCAgainwecomparetheperformanceofthem-ITCtothearithmeticmean.Thistimeweformmixturesofseveralmotionmodelsofvaryingangles,takingweightseitherasuniformforthearithmeticmeanorasdeterminedbythem-ITC.Intherstcasewetakeatotalof12motionmodelswithangledisparitiesoff5=20;4=20;3=20;2=20;1=20;0;g:Onthismixture,them-ITCassignsweightsof.31toeachofthemotionmodelsat=4andtheremaining.37tothemotionmodelat.Wetestedthesetrackerswhenthetruemotionmodelhasorientationzero,=4,and,respectivelyandreportedtheiraveragetimes-till-failure,inTable 5{1 .Indeedaswemightexpectfromaminimaxrepresentative,them-ITCregistersthebestworst-caseperformance;butneverthelessitisnotanimpressiveperformance.Thesecondsetofmotionmodelswasslightlylessextremeinvariation,f0;=64;=32;=4g.Tothesecomponents,them-ITCsplititsprobability

PAGE 80

68 evenlybetweenthemotionmodelsatdisparities=4.Inthiscase,them-ITChadabettershowingoverall,andstillhadthebestworstcaseperformance.TheresultsareshowninTable 5{2 Table5{1:Averagetime-till-failureofthetrackerbasedonthearithmeticmeanAMandonthem-ITC0particles Trueanglem-ITCAM 04374=42346167 Table5{2:Averagetime-till-failureofthetrackerbasedonthearithmeticmeanAMandonthem-ITC0particleswithsecondsetofmodels Trueanglem-ITCAM 07274=44437 5.6 ConclusionWhilethereisreasontobelievethataminimaxrepresentativewouldservewellincombiningseveralmotionmodelsintracking,theprecisecircumstanceswhenthismightbebenecialarediculttodetermine.ExaminingFig. 5{2 andFig. 5{3 itseemsthatifthereistoolittleweightortoogreatadisparity,atrackerisdoomedfromthebeginning.Sowhilethem-ITCwillnotperformideallyunderallsituations,itstillretainsitsexpectedbestworst-caseperformance.

PAGE 81

CHAPTER6CONCLUSION 6.1 LimitationsThecentralthrustofthisworkhasbeentheclaimthatdespitemanycomputervisionresearchers'instinctivesuspicionofminimaxmethods,giventherightapplication,theycanbeuseful.Howeverthoseskepticalresearchers'instinctsareoftenwell-founded:Themainissueonemustbeawareofregardingtherepresentativespresentedinthisworkistheirsensitivitytooutliers.Onemustcarefullyconsiderhisdata,particularlytheextremeelements,becausethosearepreciselytheelementswiththemostinuenceontheserepresentatives.Inadditiontodata,onemustalsoconsiderhowone'sapplicationdenessuccessfulresults.Ifone'sapplicationcantoleratesmalldeviationsinthenormalcases,"andsuccessfulbehaviorisdenedbygoodresultsinextremecases,thenaminimaxrepresentativemightbeappropriate.Thisispreciselythecaseintheretrievalproblemwhereauniformupperboundonthedispersionofasubsetisthecriteriononwhichsuccessfulindexingisjudged.Suchacriteriondisregardswhethertheinnermostmembersareespeciallyclosetotherepresentativeornot.Despitesomeinitialsuggestions,thisdidnotturnouttobethecaseinthetrackingdomaininthepresenceofclutter.Thereitseemsthedeviationsinthenormalcases"didhaveasignicanteectonperformance. 6.2 SummaryAftercharacterizingtwominimaxrepresentatives,withrmgroundingsininformationtheory,wehaveshownhowtheycanbeutilizedtospeedtheretrievaloftextures,images,shapes,oranyobjectsorepresentedbyaprobabilitydistribution.Theirpowerinthisapplicationcomesfromthefactthattheyform 69

PAGE 82

70 tightclusters,allowingformorepreciselocalizationandecientpruningthanothercommonrepresentatives.Whilethetrackingresultsdidnotbearasmuchfruit,westillbelievethatisapromisingavenueforsucharepresentativeiftheproblemisproperlyformulated. 6.3 FutureWorkThistopictouchesuponamyriadofareasincludinginformationtheory,informationgeometry,andlearning.Csiszarandothershavecharacterizedtheexpectation-maximizationalgorithmintermsofdivergenceminimization;andwebelievethatincorporatingtheITCsintosomeEM-stylealgorithmwouldbeveryinteresting.AlsoofinterestareitsconnectionstoAdaBoostandotheronlinelearningalgorithms.Butforalloftheseavenues,themainchallengeremainsofverifyingthatthedataandmeasurementofsuccessareappropriatetstotheserepresentatives.

PAGE 83

REFERENCES [1] T.M.CoverandJ.A.Thomas,ElementsofInformationTheory,Wiley&Sons,NewYork,NY,1991. [2] J.Lin,Divergencemeasuresbasedontheshannonentropy,"IEEETrans.Inform.Theory,vol.37,no.1,pp.145{151,Mar.1991. [3] P.T.Fletcher,C.Lu,andS.Joshi,Statisticsofshapeviaprincipalgeodesicanalysisonliegroups,"IEEETrans.Med.Imag.,vol.23,no.8,pp.995{1005,Aug.2004. [4] E.Klassen,A.Srivastava,W.Mio,andS.H.Joshi,Analysisofplanarshapesusinggeodesicpathsonshapespaces,"IEEETrans.PatternAnal.MachineIntell.,vol.26,no.3,pp.372{383,Mar.2004. [5] S.-I.Amari,MethodsofInformationGeometry,AmericanMathematicalSociety,Providence,RI,2000. [6] S.-I.Amari,Informationgeometryonhierarchyofprobabilitydistributions,"IEEETrans.Inform.Theory,vol.47,no.5,pp.1707{1711,Jul.2001. [7] B.Pelletier,Informativebarycentresinstatistics,"AnnalsofInstituteofStatisticalMathematics,toappear. [8] Z.WangandB.C.Vemuri,Ananeinvarianttensordissimilaritymeasureanditsapplicationstotensor-valuedimagesegmentation,"inProc.IEEEConf.ComputerVisionandPatternRecognition,Washington,DC,Jun./Jul.2004,vol.1,pp.228{233. [9] D.P.Huttenlocher,G.A.Klanderman,andW.A.Rucklidge,Comparingimagesusingthehausdordistance,"IEEETrans.PatternAnal.MachineIntell.,vol.15,no.9,pp.850{863,Sep.1993. [10] J.Ho,K.-C.Lee,M.-H.Yang,andD.Kriegman,Visualtrackingusinglearnedlinearsubspaces,"inProc.IEEEConf.ComputerVisionandPatternRecognition,Washington,DC,Jun./Jul.2004,pp.228{233. [11] R.I.HartleyandF.Schaalitzky,L-1minimizationingeometricrecon-structionproblems,"inProc.IEEEConf.ComputerVisionandPatternRecognition,Washington,DC,Jun./Jul.2004,pp.504{509. 71

PAGE 84

72 [12] S.C.Zhu,Y.N.Wu,andD.Mumford,Minimaxentropyprincipleanditsapplicationtotexturemodeling,"NeuralComputation,vol.9,no.8,pp.1627{1660,Nov.1997. [13] C.Liu,S.C.Zhu,andH.-Y.Shum,Learninginhomogeneousgibbsmodeloffacesbyminimaxentropy,"inProc.Int'lConf.ComputerVision,Vancover,Canada,Jul.2001,pp.281{287. [14] I.Csiszar,Whyleastsquaresandmaximumentropy?Anaxiomaticapproachtoinferenceforlinearinverseproblems,"AnnalsofStatistics,vol.19,no.4,pp.2032{2066,Dec.1991. [15] D.P.Bertsekas,NonlinearProgramming,AthenaScientic,Belmont,MA,1999. [16] N.MerhavandM.Feder,Astrongversionoftheredundancy-capacitytheoremofuniversalcoding,"IEEETrans.Inform.Theory,vol.41,no.3,pp.714{722,May1995. [17] I.Csiszar,I-divergencegeometryofprobabilitydistributionsandmini-mizationproblems,"AnnalsofProbability,vol.3,no.1,pp.146{158,Jan.1975. [18] R.Sibson,Informationradius,"Z.Wahrscheinlichkeitstheorieverw.Geb.,vol.14,no.1,pp.149{160,Jan.1969. [19] N.JardineandR.Sibson,MathematicalTaxonomy,JohnWiley&Sons,London,UK,1971. [20] A.O.Hero,B.Ma,O.Michel,andJ.Gorman,Applicationsofentropicspanninggraphs,"IEEESignalProcessingMag.,vol.19,no.5,pp.85{95,Sep.2002. [21] Y.He,A.B.Hamza,andH.Krim,Ageneralizeddivergencemeasureforrobustimageregistration,"IEEETrans.SignalProcessing,vol.51,no.5,pp.1211{1220,May2003. [22] D.M.EndresandJ.E.Schindelin,Anewmetricforprobabilitydistri-butions,"IEEETrans.Inform.Theory,vol.49,no.7,pp.1858{1860,Jul.2003. [23] F.Topse,Someinequalitiesforinformationdivergenceandrelatedmeasuresofdiscrimination,"IEEETrans.Inform.Theory,vol.46,no.4,pp.1602{1609,Jan.2000. [24] J.BurbeaandC.R.Rao,Ontheconvexityofsomedivergencemeasuresbasedonentropyfunctions,"IEEETrans.Inform.Theory,vol.28,no.3,pp.489{495,May1982.

PAGE 85

73 [25] R.G.Gallager,InformationTheoryandReliableCommunication,JohnWiley&Sons,NewYork,NY,1968. [26] I.CsiszarandJ.G.K}orner,InformationTheory:CodingTheoremsforDiscreteMemorylessSystems,AcademicPress,Inc.,NewYork,NY,1981. [27] L.D.DavissonandA.Leon-Garcia,Asourcematchingapproachtondingminimaxcodes,"IEEETrans.Inform.Theory,vol.26,no.2,pp.166{174,Mar.1980. [28] B.Y.Ryabko,Commentson`Asourcematchingapproachtondingminimaxcodes',"IEEETrans.Inform.Theory,vol.27,no.6,pp.780{781,Nov.1981. [29] Y.FreundandR.E.Schapire,Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting,"J.ComputerandSystemSciences,vol.55,no.1,pp.119{139,Aug.1997. [30] J.KivinenandM.K.Warmuth,Boostingasentropyprojection,"inProceedingsoftheTwelfthAnnualConferenceonComputationalLearningTheory,SantaCruz,CA,Jul.1999,pp.134{144. [31] R.Osada,T.Funkhouser,B.Chazelle,andD.Dobkin,Shapedistributions,"ACMTrans.Graphics,vol.21,no.4,pp.807{832,Oct.2002. [32] S.Gordon,J.Goldberger,andH.Greenspan,Applyingtheinformationbottleneckprincipletounsupervisedclusteringofdiscreteandcontinuousimagerepresentations,"inProc.Int'lConf.ComputerVision,Nice,France,Oct.2003,pp.370{396. [33] C.Carson,S.Belongie,H.Greenspan,andJ.Malik,Blobworld:imagesegmentationusingexpectation-maximizationanditsapplicationtoimagequerying,"IEEETrans.PatternAnal.MachineIntell.,vol.24,no.8,pp.1026{1038,Aug.2002. [34] Y.Rubner,C.Tomasi,andL.Guibas,Ametricfordistributionswithapplicationstoimagedatabases,"inProc.Int'lConf.ComputerVision,Bombay,India,Jan.1998,pp.59{66. [35] J.Puzicha,J.M.Buhmann,Y.Rubner,andC.Tomasi,Empiricalevaluationofdissimilaritymeasuresforcolorandtexture,"inProc.Int'lConf.ComputerVision,Kerkyra,Greece,Sep.1999,pp.1165{1172. [36] M.N.DoandM.Vetterli,Wavelet-basedtextureretrievalusinggeneralizedGaussiandensityandKullback-Leiblerdistance,"IEEETrans.ImageProcessing,vol.11,no.2,pp.146{158,Feb.2002.

PAGE 86

74 [37] M.VarmaandA.Zisserman,Textureclassication:Arelterbanksnecessary?,"inProc.IEEEConf.ComputerVisionandPatternRecognition,Madison,WI,Jun.2003,pp.691{698. [38] S.M.Omohundro,Bumptreesforecientfunction,constraint,andclassica-tionlearning,"inAdvancesinNeuralInformationProcessingSystems,Denver,CO,Nov.1990,vol.3,pp.693{699. [39] P.N.Yianilos,Datastructuresandalgorithmsfornearestneighborsearchingeneralmetricspaces,"inProc.ACM-SIAMSymp.onDiscreteAlgorithms,Austin,TX,Jan.1993,pp.311{321. [40] J.K.Uhlmann,Satisfyinggeneralproximity/similarityquerieswithmetrictrees,"InformationProcessingLetters,vol.40,no.4,pp.175{179,Nov.1991. [41] A.Moore,Theanchorshierarchy:Usingthetriangleinequalitytosurvivehigh-dimensionaldata,"inProc.onUncertaintyinArticialIntelligence,Stanford,CA,Jun./Jul.2000,pp.397{405. [42] M.VarmaandA.Zisserman,Classifyingimagesofmaterials:Achievingviewpointandilluminationindependence,"inProc.EuropeanConf.ComputerVision,Copenhagen,Denmark,May/Jun.2002,pp.255{271. [43] T.LeungandJ.Malik,Recognizingsurfacesusingthree-dimensionaltextons,"inProc.Int'lConf.ComputerVision,Kerkyra,Greece,Sep.1999,pp.1010{1017. [44] K.J.Dana,B.vanGinneken,S.K.Nayar,andJ.J.Koenderink,Reectanceandtextureofreal-worldsurfaces,"ACMTrans.Graphics,vol.18,no.1,pp.1{34,Jan.1999. [45] E.LevinaandP.Bickel,Theearthmover'sdistanceistheMallowsdistance:someinsightsfromstatistics,"inProc.Int'lConf.ComputerVision,Vancover,Canada,Jul.2001,pp.251{256. [46] N.Vasconcelos,Onthecomplexityofprobabilisticimageretrieval.,"inProc.Int'lConf.ComputerVision,Vancover,Canada,Jul.2001,pp.400{407. [47] P.Shilane,P.Min,M.Kazhdan,andT.Funkhouser,Theprincetonshapebenchmark,"inShapeModelingInternational,Genova,Italy,Jun.2004,pp.167{178. [48] M.IsardandA.Blake,Condensation|conditionaldensitypropagationforvisualtracking,"Int'lJ.ofComputerVision,vol.29,no.1,pp.5{28,Jan.1998. [49] M.IsardandA.Blake,Amixed-statecondensationtrackerwithautomaticmodel-switching,"inProc.Int'lConf.ComputerVision,Bombay,India,Jan.1998,pp.107{112.

PAGE 87

75 [50] N.MerhavandM.Feder,Universalprediction,"IEEETrans.Inform.Theory,vol.44,no.6,pp.2124{2147,Oct.1998.

PAGE 88

BIOGRAPHICALSKETCHAnativeFloridian,EricSpellmangrewuponFlorida'sSpaceCoast,gradu-atingfromSatelliteHighSchoolin1998.ThereafterheattendedtheUniversityofFlorida,receivinghisBachelorofScienceinmathematicsin2000,hisMasterofEn-gineeringincomputerinformationscienceandengineeringin2001,and,underthesupervisionofBabaC.Vemuri,hisDoctorofPhilosophyinthesamein2005.AftergraduatinghewillreturntotheSpaceCoastwithhiswifeKaylaanddaughterSophiatoworkforHarrisCorporation. 76