Learning with Iwasawa Coordinates
Bing Jian Baba C. Vemuri t
Department of Computer and Information Science and Engineering
University of Florida
Gainesville, FL 32611
{bjian,vemuri}@cise.ufl.edu
August 29, 2006
Abstract
Finding a good metric over the input space plays a fundamental role in machine
learning. Most existing techniques assume the Mahalanobis metric without incor
porating the geometry of n., the space ofn x n symmetric positivedefinite (SPD)
matrices, which leads to difficulties in the optimization procedure used to learn the
metric. In this paper, we introduce a novel algorithm to learn the Mahalanobis
metric using a natural parametrization of Pn. The data are then transformed by the
learned metric into another linear space. This linear space however does not have
the required structure needed to significantly improve the classification of data that
are not linearly separable. Therefore, we develop an efficient algorithm to map
this transformed input data space onto the known curved space of positive definite
matrices, n., and empirically show that this mapping yields superior clustering
results (in comparison to stateoftheart) on several well published data. A key
advantage of mapping the data to Pn as opposed to an infinite dimensional space
using a Kernel (as in SVM) is that, 'n is finite dimensional, its geometry is fully
known and therefore one can incorporate its Riemannian structure into nonlinear
learning tasks.
1 Introduction
In many machine learning and data mining problems, the underlying distance mea
sures (or metrics) over the input space play a fundamental role. For example, the per
formance of the nearest neighbor algorithm, multidimensional scaling and clustering
algorithm such as Kmeans all depend critically on whether the metric used truly re
flects the underlying relationships between the input instances. The problem of finding
a good metric over the input space has attracted extensive attention recently. Several
recent papers have focused on the problem of automatically learning a distance func
tion from examples ([1, 13, 16, 20]). Most existing works assume the metrics to be
*This paper has been submitted to NIPS 2006 for review.
tThis research was in part supported by the grants: NIH RO1 NS046812 and NIH RO1 EB007082.
quadratic forms parameterized by symmetric positive definite (SPD) matrices, which
leads to a constrained optimization problem. Various techniques are used in previ
ous work towards learning a SPD matrix. For example, Xing et al. [20] solved this
problem by forcing the negative eigenvalues in the learned symmetric matrix to zero.
BarHillel et al. [1] proposed a Relevant Component Analysis (RCA) algorithm where
the covariance matrix of the centered data points in small subsets of points with known
relevant information is used as the inverse matrix in the Mahalanobis metric. Tsuda
et al. [16] introduced the matrix exponential gradient update which preserves symme
try and positive definiteness due to the fact that the matrix exponential of a symmetric
matrix is always an SPD matrix. More recently, ShalevShwartz et al. [13] proposed
an online algorithm for learning a kernel matrix when only some of class labels of
the examples were provided. The metric is trained by assigning an upper bound for
intraclass distances and a lower bound for interclass distances and then inducing a
margin by a hinge loss function. Globerson and Roweis [7] proposed an interesting
algorithm called Maximally Collapsing Metric Learning (MCML) which tries to map
all points in a same class to a single location in the feature space via a stochastic se
lection rule. Weinberger et al. [19] described an novel model which incorporates the
nearest neighbors constraints to learn a Mahanalobis distance for use in classification
tasks. Though different numerical optimization techniques including standard methods
such as semidefinite programming, have been utilized, most existing works resort to
an updateandcorrection approach which projects back onto the positive (semi)definite
cone at each update.
Moreover, in none of above works, the geometry of P,, the space of n x n SPD
matrices, is exploited and this in turn leads to difficulties in the optimization procedure.
Moreover, the use of the Mahalanobis metric is equivalent to performing a linear trans
form on the input space and hence fails to achieve quality data separability in many
situations. In this paper, we introduce the Iwasawa coordinates as the natural parame
terization of P,P. We show the original complicated constrained optimization problem
can be transformed to a simplified constrained problem. Furthermore, we propose an
incremental learning algorithm, which maps the input Euclidean metric space to Pn,
which is a curved Riemannian manifold with its own intrinsic metric. The justification
for mapping the data onto P,, lies in the fact that the geometry of P,, is fully known
and its curved structure adds value to clustering and classification algorithms in giving
them the power to separate out data that are not linearly separable in the input space.
2 Learning a Mahalanobis metric using Iwasawa coor
dinates
Let Q2 c R" be the input feature space, the problem of learning a metric (or semi
metric) over Q2 is to find a nonnegative real valued function d : Q2 x 2 R+, which
satisfies the specified requirements. Mathematically speaking, d is called a distance on
(2 if d is symmetric, i.e. d(i,j) = d(j, i) for all i,j E (, and if d(i, i) = 0 holds for
all i E 2. Then, (Q2, d) is called a distance space. If d satisfies, in addition, the triangle
inequality, i.e. d(i, j) < d(i, k) + d(j, k) for all i, j, k E (2, then d is called a semimet
ric on 2. Moreover, if d(i, j) = 0 holds only for i = j, then d is called a metric on 2.
In general, finding a universally "good" metric suitable for different tasks and datasets
can be difficult if not impossible. Usually, in order to learn a datadependent or context
dependent metric, some auxiliary data, or sideinformation, which is in addition to the
input data set, must be made available. This sideinformation may be represented either
as (i) a quantitative distance given by numerical values or (ii) qualitative evaluations
where distances between some pairs are known to be relatively smaller than others. In
this work, we focus on the latter case. Two general approaches have been studied by
several authors in the problem of semisupervised clustering using equivalence con
straints as sideinformation. The first class is called constraintbased approach where
the userprovided labels or pairwise constraints are used to guide the algorithm towards
a more appropriate data partitioning. For examples, the COPKMeans algorithm [18]
enforces the mustlink and cannotlink constraints during the clustering process; in [2]
the clustering is initialized and constrained based on labelled examples. The equiva
lence constraints have also been considered in estimation of a Gaussian mixture model
using the Expectation Maximization (EM) algorithm [14]. A second theme of research,
to which our current work belongs, focuses on learning a "good" metric using equiva
lence constraints. Examples of previous work in this area include [1, 3, 9, 12, 13, 20].
More specifically, the prior knowledge on (dis)similarity from small groups of data is
assumed to take the form of (i,j, v) E 2 x 2 x {+1, 1}. Each example is com
posed of an instance pair (i,j) and an equivalence flag v equals +1 if i and j are
considered similar and 1 otherwise. Note the pair with flag +1 only implies that the
two objects associated with this pair are known to originate from the same class (or
with large p1 .si iI ii\ i, although their own labels are still unknown as in a clustering
or classification problem. Now the goal is to learn a distance (semi)metric d(i, j) over
2 which respects the given sideinformation. Most existing works assume the met
ric to be in the form of Mahalanobis distance, i.e., as the square root of a quadratic
form dA = l\x Y\A / y)TA(x y) where A 0 is a symmetric positive
(semi)definite matrix.
Let S denote the set of similar pairs and D the set of dissimilar pairs. A natural
way of defining a criterion for the desired metric is to demand that pairs in S, have,
say a small distance between them, while pairs of D have distance as large as possible.
For example, [20] defines the criterion to be the sum of squared distances between the
similar pairs:
f(A) ...I (1)
(xi,xj)CS
where A is a symmetric positive definite matrix. To prevent A from shrinking to 0, an
inequality constraint in the form of Z(x,,x)cv I' . 11l _> c. can be added. Here
c is an arbitrarily chosen positive number since most applications of metric learning
algorithms do not depend on the scale of the metric tensor.
As discussed in the introduction section, various techniques have been developed
to enforce the positivity constraint during the learning of the Mahalanobis metric but
without integrating the geometry of the underlying space, which in this case, is 7P,
the space of SPD matrices. In [20], the domain of optimization for the above function
is Euclidean and the update is directly performed on the matrix entries while the
positiveness needs to be checked at each step by forcing the negative eigenvalues to 0.
This requires a diagonalization operation at each step, which leads to a computationally
expensive approach when the dimension of the target problem is high.
Here we propose a novel method which enables the optimization of the same cost
function to be carried directly on the curved space P, as opposed to the Euclidean
space. A iginilik.iili feature of our method is the use of Iwasawa coordinates, a natural
parameterization of P,. By choosing this Iwasawa coordinates, the constructed matri
ces will always stay on P, and hence there is no need to further enforce the positive
definite constraint during the optimization via projections as in [20].
As an analogue of the rectangular coordinates of Euclidean space, the socalled
Iwasawa coordinates [15] is defined as follows for Y e Pr,:
(V W IX (2)
where V E Pp, W E Pq,X E Rpxq and Y[g] denotes gTYg. Note the above de
composition can always be solved uniquely for V, W, X once p, q and Y e P, are
given. Hence, for any matrix Y = V, in nP, with n > 1, by representing V, as a tuple
(V, 1, x, w, 1) and repeating the following partial Iwasawa decomposition:
( V O I xn V, Vn x(
S0 w" 0 1 x V, xnV",x,+ w } '
where V, E Sn, Wn > 0 and x, E R" we finally get the following vectorized
expression:
Vn iwasawa(V) = (((wo, x1, wi), x~, W2),..., x, 1, wn1) (4)
which we term full Iwasawa coordinates and adopt in the rest of this paper from now
on. Note the component wi at position (1, 3, 6,..., i(i + 1)/2,...) are called diagonal
elements and have to be positive, while offdiagonal elements xi can be any real num
bers. Let vec(A) be the column vector created from a matrix A by stacking its column
vectors and vech(A) be the compact form with the upper portion excluded when A
is symmetric. Then the jacobian of the onetoone transformation in (4) can be easily
derived from (3) in a recursive fashion:
dvech(V ,) ( J 0 0
Jn+1= .l) (T I)SnJn V, 0 (5)
S diwasawa(V 1) ch(x, xI )J 2xV, 1
vech(,Ll ),Jn 2j n n 1
where S, is the n2 x n(n + 1)/2 matrix of Os and Is such that vec() = Svech() and
o denotes the Kronecker product. With this Iwasawa coordinate system, the gradient
based techniques can be used to optimize cost functions like (1) or other forms. For
reader's convenience, the proposed learning algorithm is outlined in Algorithm 1.
To prevent A from shrinking to 0, we simply put constraints on those diagonal
elements in the Iwasawa coordinates: e.g. w, > c where c can be an arbitrarily chosen
positive number. Note this constraint is equivalent to putting a lower bound on the L2
norm on the matrix, we point out that the Iwasawa coordinate system, as a natural
Algorithm 1: Gradientbased algorithm for minimizing the cost function f(A)
input : S and D (optional): sets of pairwise (dis)similarity constraints or
labeled examples
output: A e P(n) to be used in Mahalanobis metric
1 begin
2 Heuristically initialize A to some SPD matrix
repeat
3 iwasawa(A) + iwasawa(A) e(Viwasawa(A)f)
4 until convergence
s A iwasawa(A)
6 end
parametrization of the ,,, can be used in many other applications where a SPD matrix
needs to be learned.
Several recent studies [1, 7, 13, 19] on Mahalanobis metric learning have also used
the trick of changing the parameters representing the SPD matrices. The most widely
used representation is the factorization A LTL which interprets the Mahalanobis
distance metric by performing a linear transformation on the input. Note there are
not only one L satisfying A LTL 1 while the Iwasawa decomposition is unique.
Though, as pointed out in [7], the factorization A LTL turns a convex problem (in
A) into a nonconvex problem (in L), the Iwasawa parameterization has the same prob
lem which does not preserve the convexity though the induced domain is still a convex
body 2. However, the onetoone property of Iwasawa coordinates and the existence of
analytical Jacobian enable the use of fast gradientbased optimization techniques. And
at each update, there is no need of eigendecomposition which can be quite expensive.
More ,'"t'.1 i,,al' one key advantage of Iwasawa coordinates over other parameter
izations is that it is closely related to the intrinsic geometry of the manifold of SPD
matrices, which we will exploit in the next section.
3 Learning in the Curved Space P,
Clearly, the fact that (x y)TA(x y) = Lxr ., II where A = LTL implies that
there exists a linear transformation providing a isometric embedding from the metric
space ,FI dA) to a standard Euclidean space iF.' d1j). From this point of view, the
use of a Mahalanobis metric still results in a flat feature space and may not be able to
achieve quality data separability in many situations. To illustrate this problem, consider
two pairs of points (xi, yi) and (X2, y2) such that (xi yi) = (X2 Y2). Clearly no
matter what matrix A is, dA(Xi, Y1) will always stay equal with dA(X2, Y2). But it is
possible to have the constraints which require these two distance values be ',.,,,ii. ,,,d
1Consider the square root and Cholesky factorization of a SPD matrix
2Actually, the convexity of this problem can be preserved using Iwasawa decomposition according to
:R. Vanderbei and H. Benson. On Formulating Semidefinite Programming Problems as Smooth Convex
Nonlinear Optimization Problems. TRORFE9901, Princeton University
different.
A common preprocessing strategy is to use a nonlinear mapping function : 2 
3 that maps the data into some highdimensional feature space and then perform the
learning in [17]. The feature space usually is an infinite dimensional inner product
space. In the following we propose a novel learning technique which incrementally
maps R", the input Euclidean metric space to the space of SPD matrices, Pn, a curved
Riemannian manifold with its own intrinsic metric and a wellstudied geometry. Note
that 'P, although a subset of a vector space, is not a vector space anymore.
3.1 An Incremental Learning Algorithm
It is well known that P,, is a complete Riemannian manifold of nonpositive sectional
curvature and there is precisely one geodesic connecting two arbitrary points in ,, (see
[8, pg. 203]). Hence it is natural to measure intrinsic distance in P,, using geodesic
length. Moreover, there exists an invariant metric on P,, which is given in terms of arc
length element: ds2 trace((YldY)2) for any Y E p,. It has been shown in [15]
that the arc length element can be expressed in partial Iwasawa coordinates given in
(2):
ds4 ds .+ ds + 2trace(V W[dX]) (6)
where ds2 is the element of arc length in 7Pp, dsW is the analogue of arc length in
P,, and dX is the matrix of differentials dxij if X = (xij) E Rpxq. Furthermore, the
geodesic distance between two arbitrary points A and B in IP, can be derived via group
action and invariance property. Due to lack of space, we omit the detailed derivation
but simply give that distance as d2(A, B) = 1 log2 (A) where Ai's are solutions to
the general eigenvalue problem det(AA B) =0.
The algorithm we present here is based on the key assumption that the input in
stances are sampled from the high dimensional curved space, in this work, Pt. First,
it is intuitively clear that any space consisting of matrices with fixed offdiagonal x
coordinates in (4) is a flat subspace of ,, by looking at (6). Hence there exists a map
ping 0 (the isometric embedding) from R" to P,, such that dl, (x, y) = d(o(x), 0(y))
for all x, y E R'. Based on this fact, we further assume that, for each instance, only its
projection onto a flat subspace is observed. Now the problem is to recover the under
lying curved structure with the help of the sideinformation. We shall rewrite the Eqn.
(1)as
f(9) S dp,(y(xi),((xj)) (7)
(xi,xj)CS
where p is the map from R" to P,, and dp, is the intrinsic geodesic distance on 'P,.
To simplify the problem, we further assume there exists a linear map p from the
observed {wi} coordinates to the hidden {xi} coordinates with total m n(n 1)/2
elements, i.e. q can be represented by an m x n matrix. Obviously, if q turns out to be
a zero map in a very special case, then we end up with a Euclidean flat space. Instead
of learning an m x n matrix in one shot, we choose an incremental approach which
incrementally increases the "curvedness" of the input space. More specifically, the
desired map p from R" to P,, is determined from a set of linear maps { 1i =1 where
each yO maps R" to R' and I < n is the level of this "curvedness" increasing procedure.
At each step i, the offdiagonal coordinates {x } of each instance are recovered by
performing a linear transform yi on {wi}. Algorithm 2 summarizes the proposed
incremental method by minimizing the cost function f which is generally in the form
of (7) with optional constrains from dissimilarity information.
Algorithm 2: Incremental procedure for recovering the offdiagonal coordinates
input : S: sideinformation, 1: Level of incremental algorithm
f: the cost function which respects sideinformation using the metric on
output: = 1
1 begin
2 for i 1 to I do
3 L 0i = arg min f(0) (Note Vf can be numerically estimated from (6))
4 end
3.2 Clustering on PP,
Once the underlying embedding from R" to P, is estimated, we get a data set lying on
a curved space P,. Many classical learning algorithms for the Euclidean space can be
extended to this new setting. In this work, we will focus on the clustering problem as
an application to show the potential of our proposed method.
As one of the most popular clustering techniques, KMeans partitions a dataset into
K clusters by iteratively minimizing the sum, over all clusters, of the withincluster
sums of pointtoclustercentroid distances. Usually the squared Euclidean distances
are used in Kmeans in R". In order to perform KMeans type clustering, one has
to develop algorithms fully respecting the geometry of the underlying space. More
specifically, the cluster means and the distances should be computed in terms of the
intrinsic Riemannian geodesics.
Given a set of N observations x,..., ZN from an unknown distribution on P,, one
can calculate the sample mean by the usual linear average: x z7 N1 xi. Since 2,,
is a convex subset of R" x still lies within P,. However, linear averages do not fit
the natural nonlinear geometry of P, well. For example, the convex combination of
any two matrices in P,, of the same determinant always gives a matrix with a larger
determinant. It is well known that, in a normed vector space V, the average of set
{ ., ~ } can be uniquely characterized as x :  E M i xi arg minitv a 
;12. Note, the leastsquares property still makes sense if the linear structure of V is
replaced by any metric space. Based on this, the socalled Frechet mean of a finite
subset Q of a metric space (Q2, d) can be defined as the element in (2 at which the
function p k* EqC d2 (p, q) is minimized. Similar definition has be generalized in [8]
to the differentiable manifolds:
Definition 1 Let M be a complete, simply connected, nonpositively curved Rieman
nian manifold endowed with a metric d(,) and let p be a probability measure on M,
i.e. p(M) = f dp = 1. A point q E M is called a center of mass for p if
J d2(q, y)d/p(y) injf7m j d(p, y)dt(y) < oc (8)
[8] further shows that the function f(p) = d (p, y)dp(y) is a strictly convex function
on P,, which is nonpositively curved, and thus the center of mass is uniquely defined
for a probability measure on P,. In practical problems, datasets are usually given in
a discrete form, hence the probability measure can be assumed as a mixture of Dirac
measures. Under this assumption, the intrinsic mean of a set of points xi, ..., XN in
P, can be defined as the unique minimum of the sumofsquared geodesic distances:
= arg mintM % i1 d2(x, xi) which is clearly a limiting case of Eqn.(8). Since
the function f(p) is a differentiable function with gradient Vf(p) f logP(q)dtp(q)
where logP : M TpM is the inverse of the exponential map expp. Thus, the center
of mass of p, where Vf(p) reaches zero, can be computed iteratively as follows:
p p and Pk+ 1 exp, (_ ,Vf(pk))
A proof of convergence of the sequence {pk : k > 0} can be found in [10].
Moreover, equipped with the geodesic distance and intrinsic mean, the classical K
means and its variants can be naturally extended to P,. In the experiments, we simply
modified the Kmeans provided in MATLAB by supplying the distance function and
mean computation as discussed above. Many existing techniques which accelerate K
means by taking advantage of triangle inequality, for instance [6], can still be applied
since the distance function here is indeed a metric.
4 Experiments: Application to Clustering
As mentioned in the introduction, the main goal of our research presented here is to uti
lize the side information in the form of pairwise equivalence constraints to improve the
performance of unsupervised learning techniques. In order to test our proposed method
and to compare it with the previous work, we conduct experiments on six data sets from
UC Irvine repository [5] which were used in [1, 4, 20]. First, for each dataset and each
run of algorithm, we randomly selected a set of pairwise equivalence constraints as
input to the algorithms. After the metric was learned, the Kmeans clustering algo
rithm was run on the linearly transformed dataset. To evaluate the performance gain
in using the learned metrics, we compare the techniques that use Kmeans with Eu
clidean distance on the original dataset, i.e. no sideinformation. For the purpose of
comparison, we have also run the RCA method [1], to learn the metric from the same
sideinformation and then transform the dataset accordingly before feeding it as input
to the Kmeans clustering algorithm. Finally we applied our method, which maps R"
to P,, and performs the modified Kmeans clustering scheme on the curved space Pn.
The input to our algorithm can be either the given feature space or a linearly trans
formed feature space, where the linear transformation for example is learned from the
input data using either our method or the RCA technique [1].
As in [1, 20], all of our experiments used Kmeans with multiple restarts. To eval
uate the accuracy of the clustering results, we have used the variation of information
(VI) to compare the results with the ground truth. 3 Variation of information is a
semimetric for measuring the amount of information that is lost or gained in changing
from one random variable to another random variable. Recent work [11] suggested that
variation of information can be used as a good criterion for comparing clusterings and
showed nice properties of VI within this context. By associating a clustering result with
K clusters to a discrete random variable taking K values, one can define the variation
of information between two clusterings C and C' as H(C) + H(C') 2I(C, C') where
H(C) denotes the entropy associated with a clustering C and I(C, C') is the mutual
information.
Figure 1 shows the results of all the clustering schemes described above using 15%
of sideinformation. In this figure, the heights of the bars display the values of the
variation of information between clustering results and ground truth classification. The
smaller these heights, the better is the performance. The results were averaged over
20 random selections of sideinformation. The fraction of subset used to extract the
sideinformation in this experiment from is chosen to be 15% of the whole dataset.
We also show how the quality of clustering algorithms improves with the amount of
sideinformation through one typical example.
Figure 2 depicts plots of accuracy of clustering vs. amount of side information
for various methods described in the figure caption. As expected, for our method and
others, the accuracy increases with increasing sideinformation. However, as evident
from the plots, the accuracy increase is more significant in our technique compared to
others.
5 Conclusions
In this paper, we presented a novel algorithm to learn the Mahanalobis metric that is
tailored to the input data. The learning algorithm makes use of the geometry of the
space of SPD matrices, P,., that contains this metric as an element. We also present
an efficient algorithm to map the input data onto the curved space ,n so as to facili
tate clustering of data that are nonlinearly separable. Unlike the popular Kernelbased
methods which map the data into an infinite dimensional space and do not make use
of the geometry of the target space in the classifier, our algorithm maps the data onto a
finite dimensional target manifold ,P and uses its known Riemannian structure in the
clustering technique used to achieve superior results compared to some of the existing
methods.
The mapping of the data onto ,P in our algorithm requires side information and
involves incrementally updating the "curvedness" of the embedding space. We used
natural parameterization of Pn, called Iwasawa coordinates to achieve this updating.
Following this embedding process, we used the geodesic distance on this embedding
space ,n to perform Kmeans clustering where the mean is defined intrinsically. Fi
nally, we tested this algorithm on publicly available data sets from the UCI repository
and presented comparisons to several existing methods depicting superior performance
of our algorithm.
3Note in all experiments we only use the (dis)similarity pairwise constraints but no labeled training
examples, hence no classification error rates are reported here.
soybean N=47 D=35 K=4
25
50.4
Flat Curved Flat Curved Flat Curved
protein N=116 D=20 K=6 diabetes N=768 D=8 K=2 balance N=625 D=4 K=3
2.5 2.5 3.5
2 =2
z 2.5
>1.5 >0.5 2
0.5
0 o
Flat Curved Flat Curved Flat Curved
Figure 1: Clustering accuracy on six data sets. In each plot, the three bars on the left correspond
to a learning experiment with flat space, and the three bars on the right correspond to curved
space Pn. From left to right, the six bars are respectively: (a) Kmeans over the original space
(without using any sideinformation); (b) Kmeans over the feature space created by RCA (c)
Kmeans over the feature space created by method in section 3.2; (d) Modified Kmeans over the
Pn space created by method in section 3.3 based on the results from (a); (e) Modified Kmeans
over the Pn space created by method in section 3.3 based on the results from (b); (f) Modified
Kmeans over the RPn space created by method in section 3.3 based on the results from (c). Also
shown are N the number of points, C the number of classes, d the dimension of the feature
space.
iris N=150 D=4 K=3
wine N=178 D=12 K=3
(U(a)
+ (b)
\.82. (c)
E. (f)
80.6
> 0.4
0.2
0 0.1 0.2 0.3 0.4
amount of sideinformation
Figure 2: Plot of accuracy v.s. amount of sideinformation. The xaxis gives the fraction of
the points used to generate the constraints. The yaxis gives the accuracy measure in terms of
variation of information. The labels (a)(f) denote the different schemes in the same order as in
Figure 1.
References
[1] A. BarHillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from
equivalence constraints. Journal of Machine Learning Research, 6:937965, Jun 2005.
[2] S. Basu, A. Banerjee, and R. J. Mooney. Semisupervised clustering by seeding. In ICML,
pages 2734, 2002.
[3] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity
measures. In KDD, pages 3948, 2003.
[4] M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in
semisupervised clustering. In ICML, 2004.
[5] C. B. D.J. Newman, S. Hettich and C. Merz. UCI repository of machine learning databases,
1998. URL http://www.ics.uci.edu/$\sim$mlearn/MLRepository.
html.
[6] C. Elkan. Using the triangle inequality to accelerate kmeans. In ICML, pages 147153,
2003.
[7] A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS, 2005.
[8] J. Jost. Riemannian geometry and geometric analysis. SpringerVerlag, 2001.
[9] D. Klein, S. D. Kamvar, and C. D. Manning. From instancelevel constraints to space
level constraints: Making the most of prior knowledge in data clustering. In ICML, pages
307314, 2002.
[10] H. Le. Estimation of Riemannian barycentres. LMS J. Comput. Math., 7:193200, 2004.
[11] M. Meila. Comparing clusterings by the variation of information. In COLT, pages 173
187, 2003.
[12] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In
NIPS, 2003.
[13] S. ShalevShwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudometrics.
In ICML, 2004.
[14] N. Shental, A. BarHillel, T. Hertz, and D. Weinshall. Computing gaussian mixture models
with em using equivalence constraints. In S. Thrun, L. Saul, and B. SchOlkopf, editors,
Advances in Neural Information Processing Systems 16, Cambridge, MA, 2003. MIT Press.
[15] A. Terras. Harmonic analysis on symmetric spaces and applications, vol2. Springer, 1985.
[16] K. Tsuda, G. Rsch, and M. K. Warmuth. Matrix exponentiated gradient updates for online
learning and bregman projection. i..l,..i.,i .I L.. ,;,,.m. Learning Research, 6:9951018, Jun
2005.
[17] V. N. Vapnik. Statistical Learning Theory. WileyInterscience, 1998.
[18] K. Wagstaff, C. Cardie, S. Rogers, and S. Schr6dl. Constrained kmeans clustering with
background knowledge. In ICML, pages 577584, 2001.
[19] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest
neighbor classification. In NIPS, 2005.
[20] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with appli
cation to clustering with sideinformation. In NIPS, pages 505512, 2002.
