Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Learning with Iwasawa coordinates
Full Citation
Permanent Link:
 Material Information
Title: Learning with Iwasawa coordinates
Alternate Title: Department of Computer and Information Science and Engineering Technical Report
Physical Description: Book
Language: English
Creator: Jian, Bing
Vemuri, Baba C.
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: August 29, 2006
Copyright Date: 2006
 Record Information
Bibliographic ID: UF00095640
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

2006066 ( PDF )

Full Text

Learning with Iwasawa Coordinates

Bing Jian Baba C. Vemuri t
Department of Computer and Information Science and Engineering
University of Florida
Gainesville, FL 32611

August 29, 2006

Finding a good metric over the input space plays a fundamental role in machine
learning. Most existing techniques assume the Mahalanobis metric without incor-
porating the geometry of n., the space ofn x n symmetric positive-definite (SPD)
matrices, which leads to difficulties in the optimization procedure used to learn the
metric. In this paper, we introduce a novel algorithm to learn the Mahalanobis
metric using a natural parametrization of Pn. The data are then transformed by the
learned metric into another linear space. This linear space however does not have
the required structure needed to significantly improve the classification of data that
are not linearly separable. Therefore, we develop an efficient algorithm to map
this transformed input data space onto the known curved space of positive definite
matrices, n., and empirically show that this mapping yields superior clustering
results (in comparison to state-of-the-art) on several well published data. A key
advantage of mapping the data to Pn as opposed to an infinite dimensional space
using a Kernel (as in SVM) is that, 'n is finite dimensional, its geometry is fully
known and therefore one can incorporate its Riemannian structure into nonlinear
learning tasks.

1 Introduction

In many machine learning and data mining problems, the underlying distance mea-
sures (or metrics) over the input space play a fundamental role. For example, the per-
formance of the nearest neighbor algorithm, multi-dimensional scaling and clustering
algorithm such as K-means all depend critically on whether the metric used truly re-
flects the underlying relationships between the input instances. The problem of finding
a good metric over the input space has attracted extensive attention recently. Several
recent papers have focused on the problem of automatically learning a distance func-
tion from examples ([1, 13, 16, 20]). Most existing works assume the metrics to be
*This paper has been submitted to NIPS 2006 for review.
tThis research was in part supported by the grants: NIH RO1 NS046812 and NIH RO1 EB007082.

quadratic forms parameterized by symmetric positive definite (SPD) matrices, which
leads to a constrained optimization problem. Various techniques are used in previ-
ous work towards learning a SPD matrix. For example, Xing et al. [20] solved this
problem by forcing the negative eigenvalues in the learned symmetric matrix to zero.
Bar-Hillel et al. [1] proposed a Relevant Component Analysis (RCA) algorithm where
the covariance matrix of the centered data points in small subsets of points with known
relevant information is used as the inverse matrix in the Mahalanobis metric. Tsuda
et al. [16] introduced the matrix exponential gradient update which preserves symme-
try and positive definiteness due to the fact that the matrix exponential of a symmetric
matrix is always an SPD matrix. More recently, Shalev-Shwartz et al. [13] proposed
an on-line algorithm for learning a kernel matrix when only some of class labels of
the examples were provided. The metric is trained by assigning an upper bound for
intra-class distances and a lower bound for inter-class distances and then inducing a
margin by a hinge loss function. Globerson and Roweis [7] proposed an interesting
algorithm called Maximally Collapsing Metric Learning (MCML) which tries to map
all points in a same class to a single location in the feature space via a stochastic se-
lection rule. Weinberger et al. [19] described an novel model which incorporates the
nearest neighbors constraints to learn a Mahanalobis distance for use in classification
tasks. Though different numerical optimization techniques including standard methods
such as semi-definite programming, have been utilized, most existing works resort to
an update-and-correction approach which projects back onto the positive (semi)definite
cone at each update.
Moreover, in none of above works, the geometry of P,, the space of n x n SPD
matrices, is exploited and this in turn leads to difficulties in the optimization procedure.
Moreover, the use of the Mahalanobis metric is equivalent to performing a linear trans-
form on the input space and hence fails to achieve quality data separability in many
situations. In this paper, we introduce the Iwasawa coordinates as the natural parame-
terization of P,P. We show the original complicated constrained optimization problem
can be transformed to a simplified constrained problem. Furthermore, we propose an
incremental learning algorithm, which maps the input Euclidean metric space to Pn,
which is a curved Riemannian manifold with its own intrinsic metric. The justification
for mapping the data onto P,, lies in the fact that the geometry of P,, is fully known
and its curved structure adds value to clustering and classification algorithms in giving
them the power to separate out data that are not linearly separable in the input space.

2 Learning a Mahalanobis metric using Iwasawa coor-

Let Q2 c R" be the input feature space, the problem of learning a metric (or semi-
metric) over Q2 is to find a nonnegative real valued function d : Q2 x 2 R+, which
satisfies the specified requirements. Mathematically speaking, d is called a distance on
(2 if d is symmetric, i.e. d(i,j) = d(j, i) for all i,j E (, and if d(i, i) = 0 holds for
all i E 2. Then, (Q2, d) is called a distance space. If d satisfies, in addition, the triangle
inequality, i.e. d(i, j) < d(i, k) + d(j, k) for all i, j, k E (2, then d is called a semimet-

ric on 2. Moreover, if d(i, j) = 0 holds only for i = j, then d is called a metric on 2.
In general, finding a universally "good" metric suitable for different tasks and datasets
can be difficult if not impossible. Usually, in order to learn a data-dependent or context
dependent metric, some auxiliary data, or side-information, which is in addition to the
input data set, must be made available. This side-information may be represented either
as (i) a quantitative distance given by numerical values or (ii) qualitative evaluations
where distances between some pairs are known to be relatively smaller than others. In
this work, we focus on the latter case. Two general approaches have been studied by
several authors in the problem of semi-supervised clustering using equivalence con-
straints as side-information. The first class is called constraint-based approach where
the user-provided labels or pairwise constraints are used to guide the algorithm towards
a more appropriate data partitioning. For examples, the COP-KMeans algorithm [18]
enforces the must-link and cannot-link constraints during the clustering process; in [2]
the clustering is initialized and constrained based on labelled examples. The equiva-
lence constraints have also been considered in estimation of a Gaussian mixture model
using the Expectation Maximization (EM) algorithm [14]. A second theme of research,
to which our current work belongs, focuses on learning a "good" metric using equiva-
lence constraints. Examples of previous work in this area include [1, 3, 9, 12, 13, 20].
More specifically, the prior knowledge on (dis)similarity from small groups of data is
assumed to take the form of (i,j, v) E 2 x 2 x {+1, -1}. Each example is com-
posed of an instance pair (i,j) and an equivalence flag v equals +1 if i and j are
considered similar and -1 otherwise. Note the pair with flag +1 only implies that the
two objects associated with this pair are known to originate from the same class (or
with large p1 .si iI ii\ i, although their own labels are still unknown as in a clustering
or classification problem. Now the goal is to learn a distance (semi)metric d(i, j) over
2 which respects the given side-information. Most existing works assume the met-
ric to be in the form of Mahalanobis distance, i.e., as the square root of a quadratic
form dA = l\x Y\A / y)TA(x y) where A 0 is a symmetric positive
(semi)definite matrix.
Let S denote the set of similar pairs and D the set of dissimilar pairs. A natural
way of defining a criterion for the desired metric is to demand that pairs in S, have,
say a small distance between them, while pairs of D have distance as large as possible.
For example, [20] defines the criterion to be the sum of squared distances between the
similar pairs:
f(A) -...I (1)
where A is a symmetric positive definite matrix. To prevent A from shrinking to 0, an
inequality constraint in the form of Z(x,,x)cv I' -. 11l _> c. can be added. Here
c is an arbitrarily chosen positive number since most applications of metric learning
algorithms do not depend on the scale of the metric tensor.
As discussed in the introduction section, various techniques have been developed
to enforce the positivity constraint during the learning of the Mahalanobis metric but
without integrating the geometry of the underlying space, which in this case, is 7P,
the space of SPD matrices. In [20], the domain of optimization for the above function
is Euclidean and the update is directly performed on the matrix entries while the

positiveness needs to be checked at each step by forcing the negative eigenvalues to 0.
This requires a diagonalization operation at each step, which leads to a computationally
expensive approach when the dimension of the target problem is high.
Here we propose a novel method which enables the optimization of the same cost
function to be carried directly on the curved space P, as opposed to the Euclidean
space. A iginilik.iili feature of our method is the use of Iwasawa coordinates, a natural
parameterization of P,. By choosing this Iwasawa coordinates, the constructed matri-
ces will always stay on P, and hence there is no need to further enforce the positive
definite constraint during the optimization via projections as in [20].
As an analogue of the rectangular coordinates of Euclidean space, the so-called
Iwasawa coordinates [15] is defined as follows for Y e Pr,:

(V W IX (2)

where V E Pp, W E Pq,X E Rpxq and Y[g] denotes gTYg. Note the above de-
composition can always be solved uniquely for V, W, X once p, q and Y e P, are
given. Hence, for any matrix Y = V, in nP, with n > 1, by representing V, as a tuple
(V, 1, x, w, 1) and repeating the following partial Iwasawa decomposition:

( V O I xn V, Vn x(
S0 w" 0 1 x V, xnV",x,+ w } '

where V, E Sn, Wn > 0 and x, E R" we finally get the following vectorized

Vn iwasawa(V) = (((wo, x1, wi), x~, W2),..., x, 1, wn1) (4)

which we term full Iwasawa coordinates and adopt in the rest of this paper from now
on. Note the component wi at position (1, 3, 6,..., i(i + 1)/2,...) are called diagonal
elements and have to be positive, while off-diagonal elements xi can be any real num-
bers. Let vec(A) be the column vector created from a matrix A by stacking its column
vectors and vech(A) be the compact form with the upper portion excluded when A
is symmetric. Then the jacobian of the one-to-one transformation in (4) can be easily
derived from (3) in a recursive fashion:

dvech(V ,) ( J 0 0
Jn+1= .l) (T I)SnJn V, 0 (5)
S diwasawa(V 1) ch(x, xI )J 2xV, 1
vech(,Ll ),Jn 2j n n 1

where S, is the n2 x n(n + 1)/2 matrix of Os and Is such that vec() = Svech() and
o denotes the Kronecker product. With this Iwasawa coordinate system, the gradient-
based techniques can be used to optimize cost functions like (1) or other forms. For
reader's convenience, the proposed learning algorithm is outlined in Algorithm 1.
To prevent A from shrinking to 0, we simply put constraints on those diagonal
elements in the Iwasawa coordinates: e.g. w, > c where c can be an arbitrarily chosen
positive number. Note this constraint is equivalent to putting a lower bound on the L2
norm on the matrix, we point out that the Iwasawa coordinate system, as a natural

Algorithm 1: Gradient-based algorithm for minimizing the cost function f(A)
input : S and D (optional): sets of pairwise (dis)similarity constraints or
labeled examples
output: A e P(n) to be used in Mahalanobis metric
1 begin
2 Heuristically initialize A to some SPD matrix
3 iwasawa(A) +- iwasawa(A) e(Viwasawa(A)f)
4 until convergence
s A iwasawa(A)
6 end

parametrization of the ,,, can be used in many other applications where a SPD matrix
needs to be learned.
Several recent studies [1, 7, 13, 19] on Mahalanobis metric learning have also used
the trick of changing the parameters representing the SPD matrices. The most widely
used representation is the factorization A LTL which interprets the Mahalanobis
distance metric by performing a linear transformation on the input. Note there are
not only one L satisfying A LTL 1 while the Iwasawa decomposition is unique.
Though, as pointed out in [7], the factorization A LTL turns a convex problem (in
A) into a non-convex problem (in L), the Iwasawa parameterization has the same prob-
lem which does not preserve the convexity though the induced domain is still a convex
body 2. However, the one-to-one property of Iwasawa coordinates and the existence of
analytical Jacobian enable the use of fast gradient-based optimization techniques. And
at each update, there is no need of eigendecomposition which can be quite expensive.
More ,'"t'.1 i,,al' one key advantage of Iwasawa coordinates over other parameter-
izations is that it is closely related to the intrinsic geometry of the manifold of SPD
matrices, which we will exploit in the next section.

3 Learning in the Curved Space P,

Clearly, the fact that (x y)TA(x y) = ||Lxr ., II where A = LTL implies that
there exists a linear transformation providing a isometric embedding from the metric
space ,FI dA) to a standard Euclidean space iF.' d1j). From this point of view, the
use of a Mahalanobis metric still results in a flat feature space and may not be able to
achieve quality data separability in many situations. To illustrate this problem, consider
two pairs of points (xi, yi) and (X2, y2) such that (xi yi) = (X2 Y2). Clearly no
matter what matrix A is, dA(Xi, Y1) will always stay equal with dA(X2, Y2). But it is
possible to have the constraints which require these two distance values be ',.,,,ii. ,,,d
1Consider the square root and Cholesky factorization of a SPD matrix
2Actually, the convexity of this problem can be preserved using Iwasawa decomposition according to
:R. Vanderbei and H. Benson. On Formulating Semidefinite Programming Problems as Smooth Convex
Nonlinear Optimization Problems. TR-ORFE-99-01, Princeton University

A common preprocessing strategy is to use a non-linear mapping function : 2 --
3 that maps the data into some high-dimensional feature space and then perform the
learning in [17]. The feature space usually is an infinite dimensional inner product
space. In the following we propose a novel learning technique which incrementally
maps R", the input Euclidean metric space to the space of SPD matrices, Pn, a curved
Riemannian manifold with its own intrinsic metric and a well-studied geometry. Note
that 'P, although a subset of a vector space, is not a vector space anymore.

3.1 An Incremental Learning Algorithm
It is well known that P,, is a complete Riemannian manifold of nonpositive sectional
curvature and there is precisely one geodesic connecting two arbitrary points in ,, (see
[8, pg. 203]). Hence it is natural to measure intrinsic distance in P,, using geodesic
length. Moreover, there exists an invariant metric on P,, which is given in terms of arc
length element: ds2 trace((Y-ldY)2) for any Y E p,. It has been shown in [15]
that the arc length element can be expressed in partial Iwasawa coordinates given in
ds4 ds .+ ds + 2trace(V W[dX]) (6)
where ds2 is the element of arc length in 7Pp, dsW is the analogue of arc length in
P,, and dX is the matrix of differentials dxij if X = (xij) E Rpxq. Furthermore, the
geodesic distance between two arbitrary points A and B in IP, can be derived via group
action and invariance property. Due to lack of space, we omit the detailed derivation
but simply give that distance as d2(A, B) = 1 log2 (A) where Ai's are solutions to
the general eigenvalue problem det(AA B) =0.
The algorithm we present here is based on the key assumption that the input in-
stances are sampled from the high dimensional curved space, in this work, Pt. First,
it is intuitively clear that any space consisting of matrices with fixed off-diagonal x-
coordinates in (4) is a flat subspace of ,, by looking at (6). Hence there exists a map-
ping 0 (the isometric embedding) from R" to P,, such that dl, (x, y) = d(o(x), 0(y))
for all x, y E R'. Based on this fact, we further assume that, for each instance, only its
projection onto a flat subspace is observed. Now the problem is to recover the under-
lying curved structure with the help of the side-information. We shall rewrite the Eqn.
f(9) S dp,(y(xi),((xj)) (7)
where p is the map from R" to P,, and d-p, is the intrinsic geodesic distance on 'P,.
To simplify the problem, we further assume there exists a linear map p from the
observed {wi} coordinates to the hidden {xi} coordinates with total m n(n 1)/2
elements, i.e. q can be represented by an m x n matrix. Obviously, if q turns out to be
a zero map in a very special case, then we end up with a Euclidean flat space. Instead
of learning an m x n matrix in one shot, we choose an incremental approach which
incrementally increases the "curvedness" of the input space. More specifically, the
desired map p from R" to P,, is determined from a set of linear maps { 1i =1 where
each yO maps R" to R' and I < n is the level of this "curvedness" increasing procedure.

At each step i, the off-diagonal coordinates {x } of each instance are recovered by
performing a linear transform yi on {wi}. Algorithm 2 summarizes the proposed
incremental method by minimizing the cost function f which is generally in the form
of (7) with optional constrains from dissimilarity information.

Algorithm 2: Incremental procedure for recovering the off-diagonal coordinates
input : S: side-information, 1: Level of incremental algorithm
f: the cost function which respects side-information using the metric on

output: = 1
1 begin
2 for i 1 to I do
3 L 0i = arg min f(0) (Note Vf can be numerically estimated from (6))
4 end

3.2 Clustering on PP,
Once the underlying embedding from R" to P, is estimated, we get a data set lying on
a curved space P,. Many classical learning algorithms for the Euclidean space can be
extended to this new setting. In this work, we will focus on the clustering problem as
an application to show the potential of our proposed method.
As one of the most popular clustering techniques, K-Means partitions a dataset into
K clusters by iteratively minimizing the sum, over all clusters, of the within-cluster
sums of point-to-cluster-centroid distances. Usually the squared Euclidean distances
are used in K-means in R". In order to perform K-Means type clustering, one has
to develop algorithms fully respecting the geometry of the underlying space. More
specifically, the cluster means and the distances should be computed in terms of the
intrinsic Riemannian geodesics.
Given a set of N observations x,..., ZN from an unknown distribution on P,, one
can calculate the sample mean by the usual linear average: x z7 N1 xi. Since 2,,
is a convex subset of R" x still lies within P,. However, linear averages do not fit
the natural nonlinear geometry of P, well. For example, the convex combination of
any two matrices in P,, of the same determinant always gives a matrix with a larger
determinant. It is well known that, in a normed vector space V, the average of set
{ ., ~ } can be uniquely characterized as x : -- E M i xi arg minitv ||a -
;1|2. Note, the least-squares property still makes sense if the linear structure of V is
replaced by any metric space. Based on this, the so-called Frechet mean of a finite
subset Q of a metric space (Q2, d) can be defined as the element in (2 at which the
function p k-* EqC d2 (p, q) is minimized. Similar definition has be generalized in [8]
to the differentiable manifolds:

Definition 1 Let M be a complete, simply connected, non-positively curved Rieman-
nian manifold endowed with a metric d(,) and let p be a probability measure on M,

i.e. p(M) = f dp = 1. A point q E M is called a center of mass for p if

J d2(q, y)d/p(y) injf7m j d(p, y)dt(y) < oc (8)

[8] further shows that the function f(p) = d (p, y)dp(y) is a strictly convex function
on P,, which is non-positively curved, and thus the center of mass is uniquely defined
for a probability measure on P,. In practical problems, datasets are usually given in
a discrete form, hence the probability measure can be assumed as a mixture of Dirac
measures. Under this assumption, the intrinsic mean of a set of points xi, ..., XN in
P, can be defined as the unique minimum of the sum-of-squared geodesic distances:
= arg mintM % i1 d2(x, xi) which is clearly a limiting case of Eqn.(8). Since
the function f(p) is a differentiable function with gradient Vf(p) f logP(q)dtp(q)
where logP : M TpM is the inverse of the exponential map expp. Thus, the center
of mass of p, where Vf(p) reaches zero, can be computed iteratively as follows:

p p and Pk+ 1 exp, (-_ ,Vf(pk))
A proof of convergence of the sequence {pk : k > 0} can be found in [10].
Moreover, equipped with the geodesic distance and intrinsic mean, the classical K-
means and its variants can be naturally extended to P,. In the experiments, we simply
modified the K-means provided in MATLAB by supplying the distance function and
mean computation as discussed above. Many existing techniques which accelerate K-
means by taking advantage of triangle inequality, for instance [6], can still be applied
since the distance function here is indeed a metric.

4 Experiments: Application to Clustering

As mentioned in the introduction, the main goal of our research presented here is to uti-
lize the side information in the form of pairwise equivalence constraints to improve the
performance of unsupervised learning techniques. In order to test our proposed method
and to compare it with the previous work, we conduct experiments on six data sets from
UC Irvine repository [5] which were used in [1, 4, 20]. First, for each dataset and each
run of algorithm, we randomly selected a set of pairwise equivalence constraints as
input to the algorithms. After the metric was learned, the K-means clustering algo-
rithm was run on the linearly transformed dataset. To evaluate the performance gain
in using the learned metrics, we compare the techniques that use K-means with Eu-
clidean distance on the original dataset, i.e. no side-information. For the purpose of
comparison, we have also run the RCA method [1], to learn the metric from the same
side-information and then transform the dataset accordingly before feeding it as input
to the K-means clustering algorithm. Finally we applied our method, which maps R"
to P,, and performs the modified K-means clustering scheme on the curved space Pn.
The input to our algorithm can be either the given feature space or a linearly trans-
formed feature space, where the linear transformation for example is learned from the
input data using either our method or the RCA technique [1].
As in [1, 20], all of our experiments used K-means with multiple restarts. To eval-
uate the accuracy of the clustering results, we have used the variation of information

(VI) to compare the results with the ground truth. 3 Variation of information is a
semi-metric for measuring the amount of information that is lost or gained in changing
from one random variable to another random variable. Recent work [11] suggested that
variation of information can be used as a good criterion for comparing clusterings and
showed nice properties of VI within this context. By associating a clustering result with
K clusters to a discrete random variable taking K values, one can define the variation
of information between two clusterings C and C' as H(C) + H(C') 2I(C, C') where
H(C) denotes the entropy associated with a clustering C and I(C, C') is the mutual
Figure 1 shows the results of all the clustering schemes described above using 15%
of side-information. In this figure, the heights of the bars display the values of the
variation of information between clustering results and ground truth classification. The
smaller these heights, the better is the performance. The results were averaged over
20 random selections of side-information. The fraction of subset used to extract the
side-information in this experiment from is chosen to be 15% of the whole dataset.
We also show how the quality of clustering algorithms improves with the amount of
side-information through one typical example.
Figure 2 depicts plots of accuracy of clustering vs. amount of side information
for various methods described in the figure caption. As expected, for our method and
others, the accuracy increases with increasing side-information. However, as evident
from the plots, the accuracy increase is more significant in our technique compared to

5 Conclusions

In this paper, we presented a novel algorithm to learn the Mahanalobis metric that is
tailored to the input data. The learning algorithm makes use of the geometry of the
space of SPD matrices, P,., that contains this metric as an element. We also present
an efficient algorithm to map the input data onto the curved space ,n so as to facili-
tate clustering of data that are nonlinearly separable. Unlike the popular Kernel-based
methods which map the data into an infinite dimensional space and do not make use
of the geometry of the target space in the classifier, our algorithm maps the data onto a
finite dimensional target manifold ,P and uses its known Riemannian structure in the
clustering technique used to achieve superior results compared to some of the existing
The mapping of the data onto ,P in our algorithm requires side information and
involves incrementally updating the "curvedness" of the embedding space. We used
natural parameterization of Pn, called Iwasawa coordinates to achieve this updating.
Following this embedding process, we used the geodesic distance on this embedding
space ,n to perform K-means clustering where the mean is defined intrinsically. Fi-
nally, we tested this algorithm on publicly available data sets from the UCI repository
and presented comparisons to several existing methods depicting superior performance
of our algorithm.
3Note in all experiments we only use the (dis)similarity pairwise constraints but no labeled training
examples, hence no classification error rates are reported here.

soybean N=47 D=35 K=4



Flat Curved Flat Curved Flat Curved

protein N=116 D=20 K=6 diabetes N=768 D=8 K=2 balance N=625 D=4 K=3
2.5 2.5 3.5

2 =2
z 2.5

>1.5 >0.5 2


0 o
Flat Curved Flat Curved Flat Curved

Figure 1: Clustering accuracy on six data sets. In each plot, the three bars on the left correspond
to a learning experiment with flat space, and the three bars on the right correspond to curved
space Pn. From left to right, the six bars are respectively: (a) K-means over the original space
(without using any side-information); (b) K-means over the feature space created by RCA (c)
K-means over the feature space created by method in section 3.2; (d) Modified K-means over the
Pn space created by method in section 3.3 based on the results from (a); (e) Modified K-means
over the Pn space created by method in section 3.3 based on the results from (b); (f) Modified
K-means over the RPn space created by method in section 3.3 based on the results from (c). Also
shown are N the number of points, C the number of classes, d the dimension of the feature

iris N=150 D=4 K=3

wine N=178 D=12 K=3

--+-- (b)
\.82--. (c)
E. (f)

> 0.4


0 0.1 0.2 0.3 0.4
amount of side-information

Figure 2: Plot of accuracy v.s. amount of side-information. The x-axis gives the fraction of
the points used to generate the constraints. The y-axis gives the accuracy measure in terms of
variation of information. The labels (a)-(f) denote the different schemes in the same order as in
Figure 1.


[1] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from
equivalence constraints. Journal of Machine Learning Research, 6:937-965, Jun 2005.

[2] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In ICML,
pages 27-34, 2002.

[3] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity
measures. In KDD, pages 39-48, 2003.

[4] M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in
semi-supervised clustering. In ICML, 2004.

[5] C. B. D.J. Newman, S. Hettich and C. Merz. UCI repository of machine learning databases,
1998. URL$\sim$mlearn/MLRepository.

[6] C. Elkan. Using the triangle inequality to accelerate k-means. In ICML, pages 147-153,

[7] A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS, 2005.

[8] J. Jost. Riemannian geometry and geometric analysis. Springer-Verlag, 2001.

[9] D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-
level constraints: Making the most of prior knowledge in data clustering. In ICML, pages
307-314, 2002.

[10] H. Le. Estimation of Riemannian barycentres. LMS J. Comput. Math., 7:193-200, 2004.

[11] M. Meila. Comparing clusterings by the variation of information. In COLT, pages 173-
187, 2003.

[12] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In
NIPS, 2003.

[13] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics.
In ICML, 2004.

[14] N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing gaussian mixture models
with em using equivalence constraints. In S. Thrun, L. Saul, and B. SchOlkopf, editors,
Advances in Neural Information Processing Systems 16, Cambridge, MA, 2003. MIT Press.

[15] A. Terras. Harmonic analysis on symmetric spaces and applications, vol2. Springer, 1985.

[16] K. Tsuda, G. Rsch, and M. K. Warmuth. Matrix exponentiated gradient updates for on-line
learning and bregman projection. i..l,..i.,i .I L.. ,;,,.m. Learning Research, 6:995-1018, Jun

[17] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.

[18] K. Wagstaff, C. Cardie, S. Rogers, and S. Schr6dl. Constrained k-means clustering with
background knowledge. In ICML, pages 577-584, 2001.

[19] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest
neighbor classification. In NIPS, 2005.

[20] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with appli-
cation to clustering with side-information. In NIPS, pages 505-512, 2002.

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs