TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
A Method for Compact Image Representation
using Sparse Matrix and Tensor Projections
onto Exemplar Orthonormal Bases
Karthik S. Gurumoorthy, Ajit Rajwade, Arunava Banerjee and Anand Rangarajan
Abstract
We present a new method for compact representation of large image datasets. Our method is
based on treating small patches from a 2D image as matrices as opposed to the conventional vectorial
representation, and encoding these patches as sparse projections onto a set of exemplar orthonormal bases,
which are learned a priori from a training set. The end result is a lowerror, highly compact image/patch
representation that has significant theoretical merits and compares favorably with existing techniques
(including JPEG) on experiments involving the compression of ORL and Yale face databases, as well as
two databases of miscellaneous natural images. In the context of learning multiple orthonormal bases,
we show the easy tunability of our method to efficiently represent patches of different complexities.
Furthermore, we show that our method is extensible in a theoretically sound manner to higherorder
matrices ('tensors'). We demonstrate applications of this theory to the compression of wellknown color
image datasets such as the GaTech and CMUPIE face databases and show performance competitive
with JPEG. Lastly, we also analyze the effect of image noise on the performance of our compression
schemes.
Index Terms
compression, compact representation, sparse projections, singular value decomposition (SVD), higher
order singular value decomposition (HOSVD), greedy algorithm, tensor decompositions
The authors are with the Department of Computer and Information Science and Engineering, University of Florida, Gainesville,
FL 32611, USA. email: {ksg,avr,arunava,anand}@cise.ufl.edu.
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
I. INTRODUCTION
Most conventional techniques of image analysis treat images as elements of a vector space. Lately,
there has been a steady growth of literature which regards images as matrices, e.g. [1], [2], [3], [4].
As compared to a vectorial method, the matrixbased representation helps to better exploit the spatial
relationships between image pixels, as an image is essentially a twodimensional function. In the image
processing community, this notion of an image has been considered by Andrews and Patterson in [5],
in the context of image coding by using singular value decomposition (SVD). Given any matrix, say
J E RMixM2, the SVD of the matrix expresses it in the form J = USVT where S E RMlxM2 is a
diagonal matrix of singular values, and U E RM xM1 and V E RM2 xM are two orthonormal matrices.
This decomposition is unique upto a sign factor on the vectors of U and V if the singular values are
all distinct [6]. For the specific case of natural images, it has been observed that the values along the
diagonal of S are rapidly decreasingI, which allows for the lowerror reconstruction of the image by
lowrank approximations. This property of SVD, coupled with the fact that it provides the best possible
lowerrank approximation to a matrix, has motivated its application to image compression in the work
of Andrews and Patterson [5], and Yang and Lu [8]. In [8], the SVD technique is further combined with
vector quantization for compression applications.
To the best of our knowledge, the earliest technique for obtaining a common matrixbased representation
for a set of images was developed by Rangarajan in [1]. In [1], a single pair of orthonormal bases (U, V)
is learned from a set of some N images, and each image Ij, 1 < j < N is represented by means of a
projection Sj onto these bases, i.e. in the form Ij = USjVT. Similar ideas are developed by Ye in [2] in
terms of optimal lowerrank image approximations. In [3], Yang et al. develop a technique named 2D
PCA, which computes principal components of a columncolumn covariance matrix of a set of images.
In [9], Ding et al. present a generalization of 2DPCA, termed as 2DSVD, and investigate some of its
optimality properties. The work in [9] also unifies the approaches in [3] and [2], and provides a non
iterative approximate solution with bounded error to the problem of finding the single pair of common
bases. In [4], He et al. develop a clustering application, in which a single set of orthonormal bases is
learned in such a way that projections of a set of images onto these bases are neighborhoodpreserving.
It is quite intuitive to observe that as compared to an entire image, a small image patch is a simpler,
more local entity, and hence can be more accurately represented by means of a smaller number of bases.
'A related empirical observation is the rapid decrease in the power spectra of natural images with increase in spatial frequency
[7].
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
Following this intuition, we choose to regard an image as a set of matrices (one per image patch) instead
of using a single matrix for the entire image as in [1], [2]. Furthermore, there usually exists a great deal
of similarity between a large number of patches in one image or across several images of a similar kind.
We exploit this fact to learn a small number of fullsized orthonormal bases (as opposed to a single set
of lowrank bases learned in [1], [2] and [9], or a single set of lowrank projection vectors learned in
[3]) to reconstruct a set of patches from a training set by means of sparse projections with least possible
error. As we shall demonstrate later, this change of focus from learning lowerrank bases to learning full
rank bases but with sparse projection matrices, brings with it several significant theoretical and practical
advantages. This is because it is much easier to adjust the sparsity of a projection matrix in order to meet
a certain reconstruction error threshold than to adjust its rank [see Section IIA and Section III].
There exists a large body of work in the field of sparse image representation. Coding methods which
express images as sparse combinations of bases using techniques such as the discrete cosine transform
(DCT) or Gaborfilter bases have been quite popular for a long period. The DCT is also an essential
ingredient of the widely used JPEG image compression standard [10]. These developments have also
spurred interest in methods that learn a set of bases from a set of natural images as well as a sparse
set of coefficients to represent images in terms of those bases. Pioneering work in this area has been
done by Olshausen and Field [11], with a special emphasis on learning an overcomplete set of bases
and their sparse combinations, and also by Lewicki et al [12]. Some recent noteworthy contributions in
the field of overcomplete representations include the highly interesting work in [13] by Aharon et al.,
which encodes image patches as a sparse linear combination of a set of dictionary vectors (learned from a
training set). There also exist other techniques such as sparsified nonnegative matrix factorization (NMF)
[14] by Hoyer, which represent image datasets as a product of two large sparse nonnegative matrices.
An important feature of all such learningbased approaches (as opposed to those that use a fixed set of
bases such as DCT) is their tunability to datasets containing a specific type of images. In this paper, we
develop such a learning technique, but with the key difference that our technique is matrixbased, unlike
the aforementioned vectorbased learning algorithms. The main advantage of our technique is as follows:
we learn a small number of pairs of orthonormal bases to represent ensembles of image patches. Given
any such pair, the computation of a sparse projection for any image patch (matrix) with least possible
reconstruction error, can be accomplished by means of a very simple and optimal greedy algorithm. On
the other hand, the computation of the optimal sparse linear combination of an overcomplete set of
basis vectors to represent another vector is a wellknown NPhard problem [15]. We demonstrate the
applicability of our algorithm to the compression of databases of face images, with favorable results in
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
comparison to existing approaches. Our main algorithm on ensembles of 2D images or image patches,
was presented earlier by us in [16]. In this paper, we present more detailed comparisons, including with
JPEG, and also study the effect of various parameters on our method, besides showing more detailed
derivations of the theory.
In the current work, we bring out another significant extension of our previous algorithm namely its
elegant applicability to higherorder matrices (commonly and usually mistakenly termed tensors), with
sound theoretical foundations. In this paper, we represent patches from color images as tensors (third
order matrices). Next, we learn a small number of orthonormal matrices and represent these patches as
sparse tensor projections onto these orthonormal matrices. The tensorrepresentation for image datasets,
as such, is not new. Shashua and Levin [17] regard an ensemble of grayscale (face) images, or a
video sequence, as a single thirdorder tensor, and achieve compact representation by means of lower
rank approximations of the tensor. However, quite unlike the computation of the rank of a matrix, the
computation of the rank of a tensor2 is known to be an NPcomplete problem [19]. Moreover, while
the SVD is known to yield the best lowerrank approximation of a matrix in the 2D case, its higher
dimensional analog (known as higherorder SVD or HOSVD) does not necessarily produce the best
lowerrank approximation of a tensor. The work of Lathauwer in [20] and Lathauwer et al. in [18]
presents an extensive development of the theory of HOSVD and several of its properties. Nevertheless,
for the sake of simplicity, HOSVD has been used to produce a lowerrank approximation of datasets.
Though this approximation is theoretically suboptimal[18]), it is seen to work well in applications such
as face recognition under change in pose, illumination and facial expression as in [21] by Vasilescu et al
(though in [21], each image is still represented as a vector) and dynamic texture synthesis from grayscale
and color videos in [22] by Costantini et al. In these applications, the authors demonstrate the usefulness
of the multilinear representation over the vectorbased representation for expressing the variability over
different modes (such as pose, illumination and expression in [21], or space, chromaticity and time in
[22]).
An iterative scheme for a lowerrank tensor approximation is developed in [20], but the corresponding
energy function is susceptible to the problem of local minima. In [17], two new lowerrank approximation
schemes are designed: a closedform solution under the assumption that the actual tensor rank equals
the number of images (which usually may not be true in practice), or an iterative approximation in other
2The rank of a tensor is defined as the smallest number of rank1 tensors whose linear combination gives back the tensor,
with a tensor being of rank1 if it is equal to the outer product of several vectors [18].
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
cases. Although the latter iterative algorithm is proved to be convergent, it is not guaranteed to yield a
global minimum. Wang and Ahuja [23] also develop a new iterative rankR approximation scheme using
an alternating leastsquares formulation, and also present another iterative method that is specially tuned
to the case of thirdorder tensors. Very recently, Ding et al. [24] have derived upper and lower bounds
for the error due to lowrank truncation of the core tensor obtained in HOSVD (which is a closedform
decomposition), and use this theory to find a single common triple of orthonormal bases to represent a
database of color images (represented as 3D arrays) with minimal error in the L2 sense. Nevertheless,
there is still no method of directly obtaining the optimal lowerrank tensor approximation which is non
iterative in nature. Furthermore, the error bounds in [24] are applicable only when the entire set of images
is coded using a single common triple of orthonormal bases. Likewise, the algorithms presented in [23]
and [17] also seek to find a single common basis.
The algorithm we present here differs from the aforementioned ones in the following ways: (1) All
the aforementioned methods learn a common basis, which may not be sufficient to account for the
variability in the images. We do not learn a single common basistuple, but a set of K orthonormal bases
to represent N patches in the database of images, with each image and each patch being represented
as a higherorder matrix. Note that K is much less than N. (2) We do not seek to obtain lowerrank
approximations to a tensor. Rather, we represent the tensor as a sparse projection onto a chosen tuple of
orthonormal bases. This sparse projection, as we show later, turns out to be optimal and can be obtained
by a very simple greedy algorithm. We use our extension for the purpose of compression of a database
of color images, with promising results. Note that this sparsitybased approach has advantages in terms
of coding efficiency as compared to methods that look for lowerrank approximations, just as in the 2D
case mentioned before.
Our paper is organized as follows. We describe the theory and the main algorithm for 2D datasets in
Section II. Section III presents experimental results and comparisons with existing techniques. Section
IV presents our extension to higherorder matrices, with experimental results and comparisons in Section
V. We conclude in Section VI.
II. THEORY: 2D IMAGES
Consider a set of digital images, each of size M1 x M2. We divide each image into nonoverlapping
patches of size mi x m2, m1 < M1, m2 < M2, and treat each patch as a separate matrix. Exploiting the
similarity inherent in these patches, we effectively represent them by means of sparse projections onto
(appropriately created) orthonormal bases, which we term 'exemplar bases'. We learn these exemplars a
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
priori from a set of training image patches. Before describing the learning procedure, we first explain
the mathematical structure of the exemplars.
A. Exemplar Bases and Sparse Projections
Let P E Rm, xm2 be an image patch. Using singular value decomposition (SVD), we can represent
P as a combination of orthonormal bases U e Rrnlx and V E R2 x"2 in the form P = USVT,
where S Rml xm2 is a diagonal matrix of singular values. However P can also be represented using
any set of orthonormal bases U and V, different from those obtained from the SVD of P. In this
case, we have P = USVT where S turns out to be a nondiagonal matrix.3 Contemporary SVDbased
compression methods leverage the fact that the SVD provides the best lowrank approximation to a matrix
[8], [25]. We choose to depart from this notion, and instead answer the following question: What sparse
matrix W E Rm Xm2 will reconstruct P from a pair of orthonormal bases U and V with the least error
IIP UWVT2? Sparsity is quantified by an upper bound T on the Lo norm of W, i.e. on the number of
nonzero elements in W (denoted as Ili 1,,)4. We prove that the optimal W with this sparsity constraint
is obtained by nullifying the least (in absolute value) mim2 T elements of the estimated projection
matrix S = UTPV. Due to the orthonormality of U and V, this simple greedy ulg", lidi turns out to
be optimal as we prove below:.
Theorem 1: Given a pair of orthonormal bases (U, V), the optimal sparse projection matrix W with
Ii I = T is obtained by setting to zero mim2 T elements of the matrix S = UTPV having least
absolute value.
Proof: We have P = USVT. The error in reconstructing a patch P using some other matrix W is
e = U(S W)VT112 S 11i as U and V are orthonormal. Let = {(i,j)lWij 0} and
12 {(i,j)lWij 0}. Then e = (ij)eI, Sj + (ij)c12(Sij W)2. This error will be minimized
when Sij = Wij in all locations where Wij 0 and Wij 0 at those indices where the corresponding
values in S are as small as possible. Thus if we want li  = T, then W is the matrix obtained by
nullifying mim2 T entries from S that have the least absolute value and leaving the remaining elements
intact. D
Hence, the problem of finding an optimal sparse projection of a matrix (image patch) onto a pair of
3The decomposition P = USVT exists for any P even if U and V are not orthonormal. We still follow orthonormality
constraints to facilitate optimization and coding. See section IIC and IIIB.
4See section III for the merits of our sparsitybased approach over the lowrank approach.
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
orthonormal bases, is solvable in O(mim2(mi + i2)) time as it requires just two matrix multiplications
(S UTPV). On the other hand, the problem of finding the optimal sparse linear combination of an
overcomplete set of basis vectors in order to represent another vector (i.e. the vectorized form of an image
patch) is a wellknown NPhard problem [15]. In actual applications [13], approximate solutions to these
problems are sought, by means of pursuit algorithms such as OMP [26]. Unfortunately, the quality of the
approximation in OMP is dependent on T, with an upper bound on the error that is directly proportional
to 1 + 6T [see [27], Theorem (C)] under certain conditions. Similar problems exist with other pursuit
approximation algorithms such as Basis Pursuit (BP) as well [27]. For large values of T, there can be
difficulties related to convergence, when pursuit algorithms are put inside an iterative optimization loop.
Our technique avoids any such dependencies as we are specifically dealing with complete basis pairs.
In any event, it is not clear whether overcomplete representations are indeed beneficial for the specific
application (image compression) being considered in this paper.
B. Learning the Bases
The essence of this paper lies in a learning method to produce K exemplar orthonormal bases
{(Ua, Va)}, 1 < a < K, to encode a training set of N image patches Pi E R1xm"2 (1 < i < N) with
least possible error (in the sense of the L2 norm of the difference between the original and reconstructed
patches). Note that K < N. In addition, we impose a sparsity constraint that every Sia (the matrix used
to reconstruct Pi from (Ua, Va)) has at most T nonzero elements. The main objective function to be
minimized is:
N K
E( {Ua,Va, Sia,Mia) = Mi Pi UaSiaVf2 (1)
i= a 1
subject to the following constraints:
(1) UUa VT a = I,Va. (2) SiaJl0 < T,V(i, a).(3) ,Mia =,Vi and Mia e {0,1},Vi,a. (2)
a
Here Mia is a binary matrix of size N x K which indicates whether the ith patch belongs to the space
defined by (Ua, Va). Note that we are not trying to use a mixture of orthonormal bases for the projection.
Rather we simply project each patch onto a single (suitably chosen) basis pair. The optimization of the
above energy function is difficult, as Mia is binary. Since an algorithm using Kmeans will lead to local
minima, we choose to relax the binary membership constraint so that now Mia E (0, 1), V(i, a), subject to
ZK=1 Mia =1, Vi. The naive mean field theory line of development starting with [28] and culminating
in [29], (with much of the theory developed previously in [30]) leads to the following deterministic
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
annealing energy function:
i=l a=l
N K
SE({UVS, M ia Mia + lPi Mia 1)2 +
ia i a
trace[AIa(UjUa I)] + trace[A2a(V Va I)]. (3)
a a
Note that in the above equation, {pi} are the Lagrange parameters, {Aia, A2a} are symmetric Lagrange
matrices, and /3 a temperature parameter.
We first initialize {Ua} and {Va} to random orthonormal matrices Va, and Mia = V(i, a). As {Ua}
and {Va} are orthonormal, the projection matrix Sia is computed by the following rule:
Sia U PiVa. (4)
Note that this is the solution to the least squares energy function I PiUaSiaVj 12. Subsequently m M2 
T elements in Sia with least absolute value are nullified to get the best sparse projection matrix. The
updates to Ua and Va are obtained as follows. Denoting the sum of the terms of the cost function in Eqn.
(1) that are independent of Ua as C, we rewrite the cost function as follows:
N K
E({Ua, Va, Sia, Mia}) = MiaPi UaSiaVfl2 + Ztrace[Aila(UjU I)] + C
i=l a l a
N K
= E Ma [trace(Pi UaSiaVT)T (Pi UaSiaV))] + trace[Aia(UjUa I)] + C
i=l a l a
N K
E= Mia [trace(P, P) 2trace(PUaSioaV) + trace(SSia)] + > trace[Al(U Ua I)] + C. (5)
i=l a l a
Now taking derivatives w.r.t. Ua, we get:
OF N
O = 2 Mia(PiVaS/j) + 2UAla = 0. (6)
i= 1
Rearranging the terms and eliminating the Lagrange matrices, we obtain the following update rule for
Ua:
Zla = MiaPiVaS; Ua = Zla(Z Za)1. (7)
An SVD f Z ill give Z where and T are orthonormal matrices and is a
An SVD of ZIa will give us ZIa = I TITa where FIa and TI are orthonormal matrices and is a
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
diagonal matrix. Using this, we can rewrite Ua as follows:
Ua (F la)Fa la (l la))
1 f {r T qrfT ' T 2 T 2 f
(Fla lT) (TlaPFla P la2 (FlaTla)(Tla la) 2
(iaT )(Ti a 1T ~) fTra. (8)
Note that the steps above followed due to the fact that Ua (and hence Fla and Tia) are fullrank matrices.
This gives us an update rule for Ua. A very similar update rule for Va follows along the same lines:
K
Z2a = MiaPTUaSia; Va = Z2a(Z2Z2a) = 2aTTa. (9)
i= 1
where F2a and T2a are obtained from the SVD of Z2a. The membership values are obtained from the
following update:
Mia = P u 2 (10)
mp 1 e1P_UbSib TbT2
b=\
The matrices {Sia, Ua, Va} and M are then updated sequentially for a fixed 3 value, until convergence.
The value of 3 is then increased and the sequential updates are repeated. The entire process is repeated
until an integrality condition is met.
Our algorithm could be modified to learn a set of orthonormal bases (single matrices and not a pair)
for sparse representation of vectorized images, as well. Note that even in this scenario, our technique
should not be confused with the 'mixtures of PCA' approach [31]. The emphasis of the latter is again
on lowrank approximations and not on sparsity. Furthermore, in our algorithm we do not compute an
actual probabilistic mixture (also see Sections IIC and IIIE). However, we have not considered this
vectorbased variant in our experiments, because it does not exploit the fact that image patches are
twodimensional entities.
C. Application to Compact Image Representation
Our framework is geared towards compact but lowerror patch reconstruction. We are not concerned
with the discrimtinti/ng assignment of a specific kind of patches to a specific exemplar, quite unlike in
a clustering or classification application. In our method, after the optimization, each training patch Pi
(1 < i < N) gets represented as a projection onto the particular pair of exemplar orthonormal bases
(out of the K pairs), which produces the least reconstruction error. In other words, the kth exemplar
is chosen if Pi UkSikV~ 2 Pi UaSiaVaT 2,Va E {1,2, ...,K}, 1 < k < K. For patch Pi, we
denote the corresponding 'optimal' projection matrix as ST = Sik, and the corresponding exemplar as
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
(Ui, V*) ( Uk, Vk). Thus the entire training set is approximated by (1) the common set of basispairs
{(Ua, Va)}, 1 < a < K (K < N), and (2) the optimal sparse projection matrices {S} for each patch,
with at most T nonzero elements each. The overall storage per image is thus greatly reduced (see also
section IIIB). Furthermore, these bases {(Ua, Va)} can now be used to encode patches from a new set of
images that are somewhat similar to the ones existing in the training set. However, a practical application
demands that the reconstruction meet a specific error threshold on unseen patches, and hence the Lo norm
of the projection matrix of the patch is adjusted dynamically in order to meet the error. Experimental
results using such a scheme are described in the next section.
III. EXPERIMENTS: 2D IMAGES
In this section, we first describe the overall methodology for the training and testing phases of our
algorithm based on our earlier work in [16]. As we are working on a compression application, we then
describe the details of our image coding and quantization scheme. This is followed by a discussion
of the comparisons between our technique and other competitive techniques (including JPEG), and an
enumeration of the experimental results.
A. Training and Testing Phases
We tested our algorithm on the compression of the entire ORL database [32] and the entire Yale
database [33]. We divided the images in each database into patches of fixed size (12 x 12), and these
sets of patches were segregated into training and test sets. The test set for both databases consisted of
many more images than the training set. For the purpose of training, we learned a total of K = 50
orthonormal bases using a fixed value of T to control the sparsity. For testing, we projected each patch
Pi onto that exemplar (U V* ) which produced the sparsest projection matrix Si that yielded an average
 p UTJS *K*,*T 2
perpixel reconstruction error 2 ~ of no more than some chosen 6. Note that different test
patches required different T values, depending upon their inherent 'complexity'. Hence, we varied the
sparsity of the projection matrix (but keeping its size fixed to 12 x 12), by greedily nullifying the smallest
elements in the matrix, without letting the reconstruction error go above 6. This gave us the flexibility
to adjust to patches of different complexities, without altering the rank of the exemplar bases (U V7*).
As any patch Pi is projected onto exemplar orthonormal bases which are different from those produced
by its own SVD, the projection matrices turn out to be nondiagonal. Hence, there is no such thing as a
hierarchy of 'singular values' as in ordinary SVD. As a result, we cannot resort to restricting the rank
of the projection matrix (and thereby the rank of (U Vi*)) to adjust for patches of different complexity
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
(unless we learn a separate set of K bases, each set for a different rank r where 1 < r < min(ml, m2),
which would make the training and even the image coding [see Section IIIB] very cumbersome). This
highlights an advantage of our approach over that of ulg"n i,,lii that adjust the rank of the projection
matrices.
B. Details of Image Coding and Quantization
We obtain S by sparsifying Ur PiV*. As U7 and Vi are orthonormal, we can show that the values
in ST will always lie in the range [m, m] for m x m patches, if the values of Pi lie in [0, 1]. This can
be proved as follows: We know that S = U*TPiVi*. Hence we write the element of S, in the ath row
and bth column as follows:
Sjab Utr*T *1 1
S*ab i akPiklVlb Pikl
kl
ZPikl m. (11)
The first step follows because the maximum value of Siab is obtained when all values of Uk are equal
to f/m. The second step follows because Siab will meet its upper bound when all the m2 values in the
patch are equal to 1. We can similarly prove that the lower bound on Sab is m. In our case since
m = 12, the upper and lower bounds are +12 and 12 respectively.
We Huffmanencoded [34] the integer parts of the values in the {S7} matrices over the whole image
(giving us an average of some Qi bits per entry) and quantized the fractional parts with Q2 bits per
entry. Thus, we needed to store the following information per testpatch to create the compressed image:
(1) the index of the best exemplar, using al bits, (2) the location and value of each nonzero element in
its ST matrix, using a2 bits per location and Qi + Q2 bits for the value, and (3) the number of nonzero
elements per patch encoded using a3 bits. Hence the total number of bits per pixel for the whole image
is given by:
RPP N(al + a3) + whle(a2 + Q1 + Q2) (12)
M1 M2
where Twhole = 1 SII o. The values of ai, a2 and a3 were obtained by Huffman encoding [34].
After performing quantization on the values of the coefficients in S7, we obtained new quantized projec
tion matrices, denoted as S Following this, the PSNR for each image was measured as 10 loglo0 NI m 2
_IiVl \\,_Ui* V 2 i i I
and then averaged over the entire test set. The average number of bits per pixel (RPP) was calculated as
in Eqn. (12) for each image, and then averaged over the whole test set. We repeated this procedure for
different 6 values from 8 x 105 to 8 x 103 (range of image intensity values was [0, 1]) and plotted an
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
ROC curve of average PSNR vs. average RPP.
C. Comparison with other techniques:
We pitted our method against four existing approaches, each of which are both competitive and recent:
1) The KSVD algorithm from [13], for which we used 441 unit norm dictionary vectors of size
144. In this method, each patch is vectorized and represented as a sparse linear combination of the
dictionary vectors. During training, the value ofT is kept fixed, and during testing it is dynamically
adjusted depending upon the patch so as to meet the appropriate set of error thresholds represented
by 6 E [8 x 105, 8 x 103]. For the purpose of fair comparison between our method and KSVD,
we used the exact same values of T and 6 in both KSVD as well as in our method. The method
used to find the sparse projection onto the dictionary was the orthogonal matching pursuit (OMP)
algorithm [27].
2) The overcomplete DCT dictionary with 441 unit norm vectors of size 144 created by sampling
cosine waves of various frequencies, again with the same T value, and using the OMP technique
for sparse projection.
3) The SSVD method from [25], which is not a learning based technique. It is based on the creation
of subsampled versions of the image by traversing the image in a manner different from usual
rasterscanning.
4) The JPEG standard (for which we used the implementation provided in MATLAB), for which
we measured the RPP from the number of bytes for storage of the file.
For KSVD and overcomplete DCT, there exist bounds on the values of the coefficients that are very
similar to those for our technique, as presented in Eqn. (11). For both these methods, the RPP value
was computed using the formula in [13], Eqn. (27), with the modification that the integer parts of the
coefficients of the linear combination were Huffmanencoded and the fractional parts separately quantized
(as it gave a better ROC curves for these methods). For the SSVD method, the RPP was calculated as
in [25], section (5).
D. Results
For the ORL database, we created a training set of patches of size 12 x 12 from images of 10 different
people, with 10 images per person. Patches from images of the remaining 30 people (10 images per
person) were treated as the test set. From the training set, a total of 50 pairs of orthonormal bases were
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
ORL: PSNR vs RPP; T = 10 Yale: PSNR vs RPP; T = 3
45 45
40 40 1 0 3 Variance in RPP vs error ORL Database
1 35 35
Z Z
S30 (1) OurMethod 30 (1) OurMethod
i (2) KSVD (2) KSVD
25 (3) OverComplete DCl 25 (3) OverComplete DC1
i1/ (4) SSVD (4) SSVD
(5) JPEG (5) JPEG 5
02 4 6 0 05 1 15
RPP RPP Err o5
(a) (b) (c)
(d) (e) (f) (g) (h)
Fig. 1. ROC curves on (a) ORL and (b) Yale Databases. Legend Red (1): Our Method, Blue (2): KSVD, Green (3): Over
complete DCT, Black (4): SSVD and (5): JPEG (Magenta). (c) Variance in RPP versus prequantization error for ORL. (d)
Original image from ORL database. Sample reconstructions with 6 = 3 x 10 4 of (d) by using (e) Our Method [RPP: 1.785,
PSNR: 35.44], (f) KSVD [RPP: 2.112, PSNR: 35.37], (g) Overcomplete DCT [RPP: 2.929, PSNR: 35.256], (h) SSVD [RPP:
2.69, PSNR: 34.578]. These plots are best viewed in color.
learned using the algorithm described in Section IIB. The T value for sparsity of the projection matrices
was set to 10 during training. As shown in Figure IIID(c), we also computed the variance in the RPP
values for every different prequantization error value, for each image in the ORL database. Note that the
variance in RPP decreases with increase in specified error. This is because at very low errors, different
images require different number of coefficients in order to meet the error.
The same experiment was run on the Yale database with a value of T = 3 on patches of size 12 x 12.
The training set consisted of a total of 64 images of one and the same person under different illumination
conditions, whereas the testing set consisted of 65 images each of the remaining 38 people from the
database (i.e. 2470 images in all), under different illumination conditions. The ROC curves for our
method were superior to those of other methods over a significant range of 6, for the ORL as well as
the Yale database, as seen in Figures l(a) and l(b). Sample reconstructions for an image from the ORL
database are shown in Figures l(d), l(f), l(g) and l(h) for 6 = 3 x 104. For this image, our method
produced a better PSNR to RPP ratio than others. Also Figure 2 shows reconstructed versions of another
image from the ORL database using our method with different values of 6. For experiments on the
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
Original
Error 0.003
Error 0.001
Error 0.0005
Error 0.0003
Error 0.0001
Fig. 2. Reconstructed versions of an image from the ORL database using 5 different error values. The original image is on
the extreme left and top corner.
ORL database, the number of bits used to code the fractional parts of the coefficients of the projection
matrices [i.e. Q2 in Eqn. (12)] was set to 5. For the Yale database, we often obtained prequantization
errors significantly less than the chosen 6, and hence using a value of Q2 less than 5 bits often did not
raise the postquantization error above 6. Keeping this in mind, we varied the value of Q2 dynamically, for
each prequantization error value. The same variation was applied to the KSVD and overcomplete DCT
techniques as well. Furthermore, the dictionary size used by all the relevant methods being compared in
the experiments here is shown in Table I.
E. Effect of Different Parameters on the Performance of our Method:
The different parameters in our method include: (1) the size of the patches for training and testing,
(2) the number of pairs of orthonormal bases, i.e. K, and (3) the sparsity of each patch during training,
i.e. T. In the following, we describe the effect of varying these parameters:
1) Patch Size: If the size of the patch is too large, it becomes more and more unlikely (due to the
curse of dimensionality) that a fixed number of orthonormal matrices will serve as adequate bases
for these patches in terms of a lowerror sparse representation. Furthermore, if the patch size is
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
Method Dictionary Size (number of scalars)
Our Method 50 x 2 x 12 x 12 = 14400
KSVD 441 x 12 x 12 = 63504
Overcomplete DCT 441 x 12 x 12 = 63504
SSVD
JPEG
TABLE I
COMPARISON OF DICTIONARY SIZE FOR VARIOUS METHODS FOR EXPERIMENTS ON THE ORL AND YALE DATABASES.
ORL PSNR vs RPP, Diff Patch Size
8X8 with T = 5
*12X12 with T= 1
,. 15X15 with T = 1
20X20 with T = 29
2 3
RPP
ORL: PSNR vs RPP; Diff. Values of T
40
35 /
/
z30 ,
25 T_ *T 0
T=20
T T=50
4 200
4 0
2 3
RPP
Fig. 3. ROC curves for the ORL database using our technique with (a) different patch size and (b) different value of T for a
fixed patch size of 12 x 12. These plots are best viewed in color.
too large, any algorithm will lose out on the advantages of locality and simplicity. However, if the
patch size is too small (say 2 x 2), there is very little a compression algorithm can do in terms
of lowering the number of bits required to store these patches. We stress that the patch size as a
free parameter is something common to all algorithms to which we are comparing our technique
(including JPEG). Also, the choice of this parameter is mainly empirical. We tested the performance
of our algorithm with K = 50 pairs of orthonormal bases on the ORL database, using patches of
size 8 x 8, 12 x 12, 15 x 15 and 20 x 20, with appropriately chosen (different) values of T. As
shown in Figure 3(a), the patch size of 12 x 12 using T = 10 yielded the best performance, though
other patch sizes also performed quite well.
2) The number of orthonormal bases K: The choice of this number is not critical, and can be set to
as high a value as desired without affecting the accuracy of the results, though it will increase the
number of bits to store the basis index (see Eqn. 12). This parameter should not be confused with
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
the number of mixture components in standard mixturemodel based density estimation, because
in our method each patch gets projected only onto a single set of orthonormal bases, i.e. we do
not compute a combination of projections onto all the K different orthonormal basis pairs. The
only downside of a higher value of K is the added computational cost during training. Again note
that this parameter will be part of any learningbased algorithm for compression that uses a set of
bases to express image patches.
3) The value of T during training: The value of T is fixed only during training, and is varied for each
patch during the testing phase so as to meet the required error threshold. A very high value of
T during training can cause the orthonormal basis pairs to overfit to the training data (variance),
whereas a very low value could cause a large bias error. This is an instance of the wellknown
biasvariance tradeoff common in machine learning algorithms [35]. Our choice of T was empirical,
though the value of T = 10 performed very well on nearly all the chosen datasets. The effect of
different values of T on the compression performance for the ORL database is shown in Figure
3(b). We again emphasize that the issue with the choice of an 'optimal' T is not an artifact of
our algorithm per se. For instance, this issue will also be encountered in pursuit algorithms to
find the approximately optimal linear combination of unit vectors from a dictionary. In the latter
case, however, there will also be the problem of poorer approximation errors as T increases, under
certain conditions on the dictionary of overcomplete basis vectors [27].
F Performance on Random Collections of Images
We performed additional experiments to study the behavior of our algorithm when the training and test
sets were very different from one another. To this end, we used the database of 155 natural images (in
uncompressed format) from the Computer Vision Group at the University of Granada [36]. The database
consists of images of faces, houses, natural scenery, inanimate objects and wildlife. We converted all
the given images to grayscale. Our training set consisted of patches (of size 12 x 12) from 11 images.
Patches from the remaining images were part of the test set. Though the images in the training and test
sets were very different, our algorithm produced excellent results superior to JPEG upto an RPP value of
3.1 bits, as shown in Figure 4(a). Three training images and reconstructed versions of three different test
images for error values of 0.0002, 0.0006 and 0.002 respectively, are shown in Figure IIIF. To test the
effect of minute noise on the performance of our algorithm, a similar experiment was run on the UCID
database [37] with a training set of 10 images and the remaining 234 test images from the database. The
test images were significantly different from the ones used for training. The images were scaled down
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
UCID RPPvsPSNR
40
CVG RPPvsPSNR,To 10
40
., OurMethod 35
JPEG .
35
S3030
S30 o.. _Our Method T =5
S25 ] JPEG
25
0 05 1 15 2 25 3 35 0 1 2 3 4 5
RPP RPP
(a) (b)
Fig. 4. PSNR vs. RPP curve for our method and JPEG on (a) the CVG Granada database and (b) the UCID database.
by a factor of two (i.e. to 320 x 240 instead of 640 x 480) for faster training and subjected to zero mean
Gaussian noise of variance 0.001 (on a scale from 0 to 1). Our algorithm was yet again competitive with
JPEG as shown in Figure 4(b).
IV. THEORY: 3D (OR HIGHERD) IMAGES
We now consider a set of images represented as thirdorder matrices (say RGB face images), each of
size MI x M2 x M3. We divide each image into nonoverlapping patches of size mi x m2 x s3, 1mi
M1, m2 < M2, mn3 < M3, and treat each patch as a separate tensor. Just as before, we start by exploiting
the similarity inherent in these patches, and represent them by means of sparse projections onto a triple
of exemplar orthonormal bases. Again, we learn these exemplars a priori from a set of training image
patches. We would like to point out that our extension to higher dimensions is nontrivial, especially
when considering that a key property of the SVD in 2D (namely that SVD provides the optimal lower
rank reconstruction by simple nullification of the lowest singular values) does not extend into higher
dimensional analogs such as HOSVD [18]. In the following, we now describe the mathematical structure
of the exemplars. We would like to emphasize that though the derivations presented in this paper are for
3D matrices, a very similar treatment is applicable to learning ntuples of exemplar orthonormal bases
to represent patches from nD matrices (where n > 4).
A. Exemplar Bases and Sparse Projections
Let P E RmM2 xm 3 be an image patch. Using HOSVD, we can represent P as a combination of
orthonormal bases U E RmlTxm, V E Rm2xm2 and W E Rma3xm in the form P = S x1 U x2 V x3 W,
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
(a) (b)
(d) (e)
U
U
(g) (h) (i)
(j) (k) (1)
Fig. 5. (a) to (c): Training images from the CVG Granada database. (d) to (1): Reconstructed versions of three test images for
error values of 0.0002, 0.0006 and 0.002 respectively.
where S E Rml X xm 3 is termed the coretensor. The operators xi refer to tensormatrix multiplication
over different axes. The coretensor has special properties such as allorthogonality and ordering. For
further details of the core tensor and other aspects of multilinear algebra, the reader is referred to [18].
Now, P can also be represented as a combination of any set of orthonormal bases U, V and W, different
from those obtained from the HOSVD of P. In this case, we have P = S x U x2 V x3 W where S is
not guaranteed to be an allorthogonal tensor, nor is it guaranteed to obey the ordering property.
As the HOSVD does not necessarily provide the best lowrank approximation to a tensor [18], we
choose to depart from this notion (instead of settling for a suboptimal approximation), and instead
answer the following question: What sparse tensor Q Rml xmI2 3 will reconstruct P from a triple of
orthonormal bases (U, V, W) with the least error IP Q x Ux 2 V x 3 11 I'" Sparsity is again quantified
by an upper bound T on the Lo norm of Q (denoted as IQo). We now prove that the optimal Q with
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
this sparsity constraint is obtained by nullifying the least (in absolute value) mim2m3 T elements of
the estimated projection tensor S = P x UT x2 VT x3 WT. Due to the orthonormality of U, V and
W, this simple greedy ulg ,,lii turns out to be optimal (see Theorem 2).
Theorem 2: Given a triple of orthonormal bases (U, V, W), the optimal sparse projection tensor Q with
IQo = T is obtained by setting to zero mim2m3 T elements of the tensor S = P x1 UT X2VT X3 WT
having least absolute value.
Proof: We have P S x U 2 V X 3 W. The error in reconstructing a patch P using some other matrix Q
is e = II(S Q) xiUx 2Vx 3 1i 1. For any tensor X, we have IIX 12 = IIV,,,,12 (i.e. the Frobenius norm
of the tensor and its nth unfolding are the same [18]). Also, by the matrix representation of HOSVD, we
have X(T) = U S(T) (V W)5. Hence, it follows that e = 1U (S Q)(1) (V W)T 12. This gives us
e = IIS Q112. The last step follows because U, V and W, and hence V W are orthonormal matrices.
Letl I {(i, j, k)Qijk = 0} and 12 {(i, j, k)Qijk 0}. Then e = (i,j,k)cI1 Sjk (ij,k)cI2 (Sijk
Qijk)2. This error will be minimized when Sijk Qijk in all locations where Qijk 7 0 and Qijk = 0 at
those indices where the corresponding values in S are as small as possible. Thus if we want IQo = T,
then Q is the tensor obtained by nullifying mim2m3 T entries from S that have the least absolute
value and leaving the remaining elements intact. D
We wish to reemphasize that a key feature of our approach is the fact that the same technique used for
2D images scales to higher dimensions. Tensor decompositions such as HOSVD do not share this feature,
because the optimal lowrank reconstruction property for SVD does not extend to HOSVD. Furthermore,
though the upper and lower error bounds for coretensor truncation in HOSVD derived in [24] are very
interesting, they are applicable only when the entire set of images has a common basis (i.e. a common U,
V and W matrix), which may not be sufficient to compactly account for the large variability in realworld
datasets.
B. Learning the Bases
We now describe a method to learn K exemplar orthonormal bases {(U,, Va, Wa)}, 1 < a < K, to
encode a training set of N image patches Pi E Rmrxm2 xm (1 < i < N) with least possible error (in
the sense of the L2 norm of the difference between the original and reconstructed patches). Note that
K < N. In addition, we impose a sparsity constraint that every Sia (the tensor used to reconstruct Pi
5Here A B refers to the Kronecker product of matrices A E RE1xE2 and B E RF1xF2, which is given as A B =
(Aeie2 B)1
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
from (Ua, Va, Wa)) has at most T nonzero elements. The main objective function to be minimized is:
N K
E({Ua,Va,Wa,Sia,Mia}) = Mi l Pi Sia xI Ua X2 Va X3 Wa. 2 (13)
i=l a 1
subject to the following constraints:
(1) U a U = VaTV =W W a = I, Va. (2) IISalo < T,V(i, a).
(3) Mi= 1, Vi, and Ma G {0, 1},Vi, a. (14)
a
Here Mia is a binary matrix of size N x K which indicates whether the ith patch belongs to the space
defined by (Ua, Va, Wa). Just as for the 2D case, we relax the binary membership constraint so that
now Mia c (0, 1),V(i, a), subject to ya 1 Mia 1 ,Vi. Using Lagrange parameters {pti}, symmetric
Lagrange matrices {AIa}, {A2a} and {A3a}, and a temperature parameter /3, we obtain the following
deterministic annealing objective function:
N K
E({Ua,Va,Wa,Sia,Mia}) = MialPi Sia x1 Ua X2 Va X3 Wa\2 +
i=1 a=1
i E Mia log Mia + E PiCE Mia 1) +
ia i a
>trace[Ali(U/Ua I)] + > trace[A2a(V Va )] + trace[A3a(WfWa I)]. (15)
a a a
We first initialize {Ua}, {Va} and {Wa} to random orthonormal tensors Va, and Mia = V(i, a).
Secondly, using the fact that {Ua}, {Va} and {Wa} are orthonormal, the projection matrix Sia is computed
by the rule:
Sia Pi X1 U X2 Va X3 Wf, V(i,a). (16)
This is the minimum of the energy function I Pi Sia x Ua X 2 Va x 3 Wa 2. Then ma 12m T elements
in Sio with least absolute value are nullified. Thereafter, Ua, Va and Wa are updated as follows. Let us
denote the sum of the terms in the previous energy function that are independent of Ua, as C. Then we
can write the energy function as:
E({Ua,Va,Wa,Sia,Mia}) MiP Pi (Si, l X Ua X2 Va X3 Wa)\2 +
ia
Strace[Ali(UtUa I)] + C
a
1MiallPi(l) (Sia X1 Ua X2 Va X3 Wa)(1)l2 +
ia
Etrace[Ala(U!Ua I)] + C. (17)
01,
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
Note that here Pi(,) refers to the nth unfolding of the tensor Pi (refer to [18] for exact details). Also note
that the above step uses the fact that for any tensor X, we have X112 = , 112. Now, the objective
function can be further expressed as follows:
E({U., Va, Wa, Sia, Mi}) = Mia P M U+ Sia() (Va Wa\)T2 +
ia
Strace[Al(U!Ua I)] + C.
a
(18)
Further simplification gives:
E({Ua, Vaa, W, i, Mia})=
> Miatrace[(Pi(1) Ua Sia(l) (Va 0 a)T)T(Pi(l) Ua Sia() (Va Wa)T)] +
ia
Strace[Al(U!Ua I)] + C
a
= Mia,[trace[Pl )Pi()] 2trace[Pl)Ua Sia() (Va Wa)T] +
ia
trace[S(1) Sia(1))] + Etrace[Ala(UU I)] + C. (19)
a
Taking the partial derivative of E w.r.t. Ua and setting it to zero, we have:
OE = 2MiaPi() (Va 0 Wa)S(l) + 2Uala 0. (20)
After a series of manipulations to eliminate the Lagrange matrices, we arrive at the following update rule
for Ua:
ZUa MiaPi(1)(Va Wa)ST(l);Ua ZUa(ZaZUa) Flata. (21)
Here Fla and Til are orthonormal matrices obtained from the SVD of ZUa. The updates for Va and Wa
are obtained similarly and are mentioned below:
Zva = MiaPi(2) (Wa Ua)S (2); Va = Z Va(aZVa) 2= 2a a (22)
i
Zwa = MiaPi(3) (Ua 0 Va)S(3); a Zwa(Zw aZwa) = 3aTa. (23)
Here again, F2a and T2a refer to the orthonormal matrices obtained from the SVD of ZVa, and F3a
and T3a refer to the orthonormal matrices obtained from the SVD of Zwa. The membership values are
obtained by the following update:
/IIP SaXiUX2Va xW aII2
Mia = P S X UX2V. X W l2 (24)
bK I C3\\PiSiX1UbX 2VbX3Wb 211
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
The core tensors {Sia} and the matrices {U,, Va, Wa}, M are then updated sequentially for a fixed 3
value, until convergence. The value of 3 is then increased and the sequential updates are repeated. The
entire process is repeated until an integrality condition is met.
C. Application to Compact Image Representation
Quite similar to the 2D case, after the optimization during the training phase, each training patch Pi
(1 < i < N) gets represented as a projection onto one out of the K exemplar orthonormal bases, which
produces the least reconstruction error, i.e. the kth exemplar is chosen if P Sik Xi Uk x2 Vk x3
W, I1 < P Sia X1 Ua X2 Va X3 Wa2,Va {1,2,...,K},1 < k < K. For training patch Pi, we
denote the corresponding 'optimal' projection tensor as S, = Sik, and the corresponding exemplar as
(U*, V*, W7) = (Uk, Vk,Wk). Thus the entire set of patches can be closely approximated by (1) the
common set of basispairs {(Ua, Va, Wa)}, 1 < a < K (K < N), and (2) the optimal sparse projection
tensors {ST} for each patch, with at most T nonzero elements each. The overall storage per image is
thus greatly reduced. Furthermore, these bases {(Ua, Va, Wa)} can now be used to encode patches from
a new set of images that are similar to the ones existing in the training set, though just as in the 2D
case, the sparsity of the patch will be adjusted dynamically in order to meet a given error threshold.
Experimental results for this are provided in the next section.
V. EXPERIMENTS: COLOR IMAGES
In this section, we describe the experiments we performed on color images represented in the RGB
color scheme. Each image of size M1 x M2 x 3 was treated as a thirdorder matrix. Our compression
algorithm was tested on the GaTech Face Database [38], which consists of RGB color images with 15
images each of 50 different people. The images in the database are already cropped to include just the
face, but some of them contain small portions of a distinctly cluttered background. The average image
size is 150 x 150 pixels, and all the images are in the JPEG format. For our experiments, we divided this
database into a training set of one image each of 40 different people, and a test set of the remaining 14
images each of these 40 people, and all 15 images each of the remaining 10 people. Thus, the size of the
test set was 710 images. The patchsize we chose for the training and test sets was 12 x 12 x 3 and we
experimented with K = 100 different orthonormal bases learned during training. The value of T during
training was set to 10. At the time of testing, for our method, each patch was projected onto that triple
of orthonormal bases (Ui, V*, W*) which gave the sparsest projection tensor S7 such that the perpixel
reconstruction error IISixlx2V* x3W* 12 was no greater than a chosen 6. Note that, in calculating the
rIme2
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
perpixel reconstruction error, we did not divide by the number of channels, i.e. 3, because at each pixel,
there are three values defined. We experimented with different reconstruction error values 6 ranging from
8 x 105 to 8 x 103. Following the reconstruction, the PSNR for the entire image was measured, and
averaged over the entire test test. The total number of bits per pixel, i.e. RPP, was also calculated for
each image and averaged over the test set. The details of the training and testing methodology, and also
the actual quantization and coding step are the same as presented previously in Section IIIA and IIIB.
A. Comparisons with Other Methods
The results obtained by our method were compared to those obtained by the following techniques:
1) KSVD, for which we used patches of size 12 x 12 x 3 reshaped to give vectors of size 432, and
used these to train a dictionary of 1340 vectors using a value of T = 30.
2) Our algorithm for 2D images from Section II with an independent (separate) encoding of each of
the three channels. As an independent coding of the R, G and B slices would fail to account for
the inherent correlation between the channels (and hence give inferior compression performance),
we used principal components analysis (PCA) to find the three principal components of the R, G,
B values of each pixel from the training set. The R, G, B pixel values from the test images were
then projected onto these principal components to give a transformed image in which the values
in each of the different channels are decorrelated. A similar approach has been taken earlier in
[39] for compression of color images of faces using vector quantization, where the PCA method is
empirically shown to produce channels that are even more decorrelated than those from the YCb
Cr color model. The orthonormal bases were learned on each of the three (decorrelated) channels
of the PCA image. This was followed by the quantization and coding step similar to that described
in Section IIIB. However in the color image case, the Huffman encoding step [34] for finding the
optimal values of al, a2 and a3 in Eqn. (12) was performed using projection matrices from all
three channels together. This was done to improve the coding efficiency.
3) The JPEG standard (its MATLAB implementation), for which we calculated RPP from the number
of bytes of file storage on the disk. See Section VB for more details.
We would like to mention here that we did not compare our technique with [39]. The latter technique
uses patches from color face images and encodes them using vector quantization. The patches from
more complex regions of the face (such as the eyes, nose and mouth) are encoded using a separate vector
quantization step for better reconstruction. The method in [39] requires prior demarcation of such regions,
which in itself is a highly difficult task to automate, especially under varying illumination and pose. Our
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
method does not require any such prior segmentation, as we manually tune the number of coefficients to
meet the prespecified reconstruction error.
B. Results
As can be seen from Figure 6(a), all three methods perform well, though the higherorder method
produced the best results after a bitrate of around 2 per pixel. For the GaTech database, we did not
compare our algorithm directly with JPEG because the images in the database are already in the JPEG
format. However, to facilitate comparison with JPEG, we used a subset of 54 images from the CMUPIE
database [40]. The CMUPIE database contains images of several people against cluttered backgrounds
with a large variation in pose, illumination, facial expression and occlusions created by spectacles. All
the images are available in an uncompressed (.ppm) format, and their size is 631 x 467 pixels. We
chose 54 images belonging to one and the same person, and used exactly one image for training, and all
the remaining for testing. The results produced by our method using higherorder matrices, our method
involving separate channels and also KSVD were seen to be competitive with those produced by JPEG.
For a bit rate of greater than 1.5 per pixel, our methods produced performance that was superior to that of
JPEG in terms of the PSNR for a given RPP, as seen in Figure VB. The parameters for this experiment
were K = 100 and T = 10 for training. Sample reconstructions of an image from the CMUPIE database
using our higherorder method for different error values are shown in Figure VB. This is quite interesting,
since there is considerable variation between the training image and the test images, as is clear from
Figure VB. We would like to emphasize that the experiments were carried out on uncropped images of
the full size with the complete cluttered background. Also, the dictionary sizes for the experiments on
each of these databases are summarized in Table II. For colorimage patches of size mi x m2 x 3 using
K sets of bases, our method in 3D requires a dictionary of size 3Kmmin2, whereas our method in 2D
requires a dictionary of size 2Kmlm2 per channel, which is 6Kmlm2 in total.
C. Comparisons with JPEG on Noisy Datasets
As mentioned before, we did not directly compare our results to the JPEG technique for the GaTech
Face Database, because the images in the database are already in the JPEG format. Instead, we added
zeromean Gaussian noise of variance 8 x 104 (on a color scale of [0, 1]) to the images of the GaTech
database and converted them to a raw format. Following this, we converted these raw images back to JPEG
(using MATLAB) and measured the RPP and PSNR. These figures were pitted against those obtained
by our higherorder method, our method on separate channels, as well as KSVD, as shown in Figures
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
PSNR vs RPP Training and testing on clean images
45,
PSNR vs RPP Clean Training, Noisy Testing
Our Method 3
Our Method 2E
KSVD
2 4 6
RPP
2 4 6 8 10
RPP
PSNR vs RPP Noisy Training Noisy Testing
Our Method (3D
Our Method (2D
 KSVD
JPEG
5 10 1
RPP
Fig. 6. ROC curves for the GaTech database: (a) when training and testing were done on the clean original images, (b) when
training was done on clean images, but testing was done with additive zero mean Gaussian noise of variance 8 x 10 4 added to
the test images, and (c) when training and testing were both done with zero mean Gaussian noise of variance 8 x 10 4 added
to the respective images. The methods tested were our higherorder method, our method in 2D, KSVD and JPEG. Parameters
for our methods: K = 100 and T = 10 during training. These plots are best viewed in color.
Training Image
Error 0 001
VWME; M
Original Test Image
Error 0 0005
Error 0 0003
Error 0 0001
Fig. 7. Sample reconstructions of an image from the CMUPIE database with different error values using our higher order
method. The original image is on the topleft. These images are best viewed in color.
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
PSNR vs RPP: CMUPIE
40
Z 30
Our Method (3D
Our Method (2D
25 "KSVD
JPEG
200 2 4 6 8 10
2 4 6 8 10
RPP
Fig. 8. ROC curves for the CMUPIE database using our method in 3D, our method in 2D, KSVD and JPEG. Parameters for
our methods: with K = 100 and T = 10 during training.These plots are best viewed in color.
Method Dictionary Size (number of scalars)
Our Method (3D) 100 x 3 x 12 x 12 = 43200
Our Method (2D) 100 x 2 x 3 x 12 x 12 = 86400
KSVD 1340 x 12 x 12 x 3 = 578880
JPEG
TABLE II
COMPARISON OF DICTIONARY SIZE FOR VARIOUS METHODS FOR EXPERIMENTS ON THE GATECH AND CMUPIE
DATABASES.
6(b) and 6(c). The performance of JPEG was distinctly inferior despite the fact that the noise added
did not have a very high variance. The reason for this is that the algorithms used by JPEG use the fact
that while representing most natural images, the lower frequencies strongly dominate. This assumption
is invalid in case of sensor noise. Hence the DCT coefficients on noisy images will have prominently
higher values, as a result of which JPEG produces higher bit rates for the same PSNR. For the purposes
of comparison, we ran two experiments using our higherorder method, our separate channel method and
KSVD as well. In the first experiment, noise was added only to the test set, though the orthonormal
bases or the dictionary were learned on a clean training set (i.e. without noise being added to the training
images). In the second experiment, noise was added to every image from the training set, and all methods
were trained on these noisy images. The testing was performed on (noisy) images from the test set, and
the ROC curves were plotted as usual. As can be seen from Figures 6(b) and 6(c), the PSNR values for
JPEG begin to plateau off rather quickly. Note that in all these experiments, the PSNR is calculated from
the squared difference between the reconstructed image and the noisy training image. This is because the
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
'original clean' image would be usually unavailable in a practical application that required compression
of noisy data.
VI. CONCLUSION
We have presented a new technique for sparse representation of image patches, which is based on the
notion of treating an image as a 2D entity. We go beyond the usual singular value decomposition of
individual images or imagepatches to learn a set of common, fullsized orthonormal bases for sparse
representation, as opposed to lowrank representation. For the projection of a patch onto these bases, we
have provided a very simple and provably optimal greedy algorithm, unlike the approximation algorithms
that are part and parcel of projection pursuit techniques, and which may require restrictive assumptions
on the nature or properties of the dictionary [27]. Based on the developed theory, we have demonstrated a
successful application of our technique for the purpose of compression of two wellknown face databases.
Our compression method is able to handle the varying complexity of different patches by dynamically
altering the number of coefficients (and hence the bitrate for storage) in the projection matrix of the
patch. Furthermore, this paper also presents a clean and elegant extension to higher order matrices, for
which we have presented applications to colorimage compression. Unlike decompositions like HOSVD
which do not retain the optimal lowrank reconstruction property of SVD, our method scales cleanly
into higher dimensions. The experimental results show that our technique compares very favorably to
other existing approaches from recent literature, including the JPEG standard. We have also empirically
examined the performance of our algorithm on noisy color image datasets.
Directions for future work include: (1) investigation of alternative matrix representations tuned for
specific applications (as opposed to using the default rows and columns of the image), (2) application of
our technique for compression of grayscale and color video (which can be represented as 3D and 4D
matrices respectively), (3) application of our technique for denoising or classification, (4) a theoretical
study of the method in the context of natural image statistics, and (5) a study on utilizing a perceptually
driven quality metric [41] as opposed to a simple L2 norm.
REFERENCES
[1] A. Rangarajan, "Learning matrix space image representations", in Energy Minimization Methods in Computer Vision and
Pattern Recognition (EAMMCVPR). 2001, vol. 2134 of LNCS, pp. 153168, Springer Verlag.
[2] J. Ye, "Generalized low rank approximations of matrices", Machine Learning, vol. 61, pp. 167191, 2005.
[3] J. Yang, D. Zhang, A. F. Frangi, and J. y. Yang, "Twodimensional PCA: A new approach to appearancebased face
representation and recognition", IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 1, 2004.
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
[4] X. He, D. Cai, and P Niyogi, "Tensor subspace analysis", in Neural I,,r ..n... n.. Processing Systems, 2005, pp. 499506.
[5] H. C. Andrews and C. L. Patterson, "Singular value decomposition (SVD) image coding", IEEE Transactions on
Communications, pp. 425432, 1976.
[6] G. Golub and C. van Loan, Matrix Computations, The Johns Hopkins University Press, October 1996.
[7] A. van der Schaaf and J. H. van Hateren, "Modelling the power spectra of natural images: statistics and information",
Vision Research, vol. 36, pp. 27592770, 1996.
[8] J.F. Yang and C.L. Lu, "Combined techniques of singular value decomposition and vector quantization for image coding",
IEEE Transactions on Image Processing, vol. 4, no. 8, pp. 11411146, 1995.
[9] C. Ding and J. Ye, "Twodimensional singular value decomposition (2DSVD) for 2D maps and images", in SIAM Int'l
Conf Data Mining, 2005, pp. 3243.
[10] G. Wallace, "The JPEG still picture compression standard", IEEE Transactions on Consumer Electronics, vol. 38, no. 1,
pp. 1834, 1992.
[11] B. Olshausen and D. Field, "Natural image statistics and efficient coding", Network, vol. 7, pp. 333339, 1996.
[12] M. Lewicki, T. Sejnowski, and H. Hughes, "Learning overcomplete representations", Neural Computation, vol. 12, pp.
337365, 2000.
[13] M. Aharon, M. Elad, and A. Bruckstein, "The KSVD: An algorithm for designing of overcomplete dictionaries for sparse
representation", IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 43114322, November 2006.
[14] P Hoyer, "Nonnegative matrix factorization with sparseness constraints", Journal of Machine Learning Research, pp.
14571469, 2004.
[15] D. Mallat and M. Avellaneda, "Adaptive greedy approximations", Journal of Constructive Approximations, vol. 13, pp.
5798, 1997.
[16] K. Gurumoorthy, A. Rajwade, A. Banerjee, and A. Rangarajan, "Beyond SVD sparse projections onto exemplar
orthonormal bases for compact image representation", in International Conference on Pattern Recognition (ICPR), 2008.
[17] A. Shashua and A. Levin, "Linear image coding for regression and classification using the tensorrank principle", in
CVPR(1), 2001, pp. 4249.
[18] L. de Lathauwer, B. de Moor, and J. Vandewalle, "A multilinear singular value decomposition", SIAM J Matrix Anal.
Appl., vol. 21, no. 4, pp. 12531278, 2000.
[19] J. Hastad, "Tensor rank is NPComplete", in ICALP '89: Proceedings of the 16th International Colloquium on Automata,
Languages and Programming, 1989, pp. 451460.
[20] L. de Lathauwer, Signal Processing Based on Multilinear Algebra, PhD thesis, Katholieke Universiteit Leuven, Belgium,
1997.
[21] M.A.O. Vasilescu and D. Terzopoulos, "Multilinear analysis of image ensembles: Tensorfaces", in International Conference
on Pattern Recognition (ICPR), 2002, pp. 511514.
[22] R. Costantini, L. Sbaiz, and S. Susstrunk, "Higher order SVD analysis for dynamic texture synthesis", IEEE Transactions
on Image Processing, vol. 17, no. 1, pp. 4252, Jan. 2008.
[23] H. Wang and N. Ahuja, "RankR approximation of tensors: Using imageasmatrix representation", in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2005, vol. 2, pp. 346353.
[24] C. Ding, H. Huang, and D. Luo, "Tensor reduction error analysis applications to video compression and classification",
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
July 21, 2009
DRAFT
TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA
[25] A. Ranade, S. Mahabalarao, and S. Kale, "A variation on SVD based image compression", Image and Vision C. ....
vol. 25, no. 6, pp. 771777, 2007.
[26] Y Pati, R. Rezaiifar, and P Krishnaprasad, "Orthogonal matching pursuit: Recursive function approximation with
applications to wavelet decomposition", in 27th Asilomar Conference on Signals, Systems and Computation, 1993, vol. 1,
pp. 4044.
[27] J. Tropp, "Greed is good: algorithmic results for sparse approximation", IEEE Transactions on Int.,,:....i.. Theory, vol.
50, no. 10, pp. 22312242, 2004.
[28] C. Petersen and B. S6derberg, "A new method for mapping optimization problems onto neural networks", International
Journal of Neural Systems, vol. 1, pp. 322, 1989.
[29] A. Yuille and J. Kosowsky, "Statistical physics algorithms that converge", Neural Computation, vol. 6, no. 3, pp. 341356,
1994.
[30] R. Hathaway, "Another interpretation of the EM algorithm for mixture distributions", Statistics and Probability Letters,
vol. 4, pp. 5356, 1986.
[31] M. Tipping and C. Bishop, "Mixtures of probabilistic principal component analysers", Neural Computation, vol. 11, no.
2, pp. 443482, 1999.
[32] "The ORL Database of Faces", http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
[33] "The Yale Face Database", http://cvc.yale.edu/projects/yalefaces/yalefaces.html.
[34] D. Huffman, "A method for the construction of minimumredundancy codes", in Proceedings of the I.R.E., 1952, pp.
10981102.
[35] S. Geman, E. Bienenstock, and R. Doursat, "Neural networks and the bias/variance dilemma", Neural Computation, vol.
4, no. 1, pp. 158, 1992.
[36] "The CVG Granada Database", http://decsai.ugr.es/cvg/dbimagenes/index.php.
[37] "The Uncompressed Colour Image Database", http://vision.cs.aston.ac.uk/datasets/UCID/ucid.html.
[38] "The Georgia Tech Face Database", http://www.anefian.com/face_reco.htm.
[39] J. Huang and Y Wang, "Compression of color facial images using feature correction twostage vector quantization", IEEE
Transactions on Image Processing, vol. 8, pp. 102109, 1999.
[40] "The CMU Pose, Illumination and Expression (PIE) Database", http://www.ri.cmu.edu/projects/project_418.html.
[41] M. Eckert and A. Bradley, "Perceptual quality metrics applied to still image compression", Signal Processing, vol. 70,
no. 3, pp. 177200, 1998.
July 21, 2009
DRAFT
