Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: A Method for compact image representation using sparse matrix and tensor projections onto exemplar orthonormal bases
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095739/00001
 Material Information
Title: A Method for compact image representation using sparse matrix and tensor projections onto exemplar orthonormal bases
Alternate Title: Department of Computer and Information Science and Engineering Technical Report
Physical Description: Book
Language: English
Creator: Gurumoorthy, Karthik S.
Rajwade, Ajit
Banerjee, Arunava
Rangarajan, Arnand
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: July 21, 2009
Copyright Date: 2009
Edition: DRAFT
 Record Information
Bibliographic ID: UF00095739
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

2009476 ( PDF )


Full Text












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


July 21, 2009


DRAFT











TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


A Method for Compact Image Representation


using Sparse Matrix and Tensor Projections


onto Exemplar Orthonormal Bases

Karthik S. Gurumoorthy, Ajit Rajwade, Arunava Banerjee and Anand Rangarajan





Abstract

We present a new method for compact representation of large image datasets. Our method is
based on treating small patches from a 2D image as matrices as opposed to the conventional vectorial
representation, and encoding these patches as sparse projections onto a set of exemplar orthonormal bases,
which are learned a priori from a training set. The end result is a low-error, highly compact image/patch
representation that has significant theoretical merits and compares favorably with existing techniques
(including JPEG) on experiments involving the compression of ORL and Yale face databases, as well as
two databases of miscellaneous natural images. In the context of learning multiple orthonormal bases,
we show the easy tunability of our method to efficiently represent patches of different complexities.
Furthermore, we show that our method is extensible in a theoretically sound manner to higher-order
matrices ('tensors'). We demonstrate applications of this theory to the compression of well-known color
image datasets such as the GaTech and CMU-PIE face databases and show performance competitive
with JPEG. Lastly, we also analyze the effect of image noise on the performance of our compression
schemes.


Index Terms

compression, compact representation, sparse projections, singular value decomposition (SVD), higher-
order singular value decomposition (HOSVD), greedy algorithm, tensor decompositions


The authors are with the Department of Computer and Information Science and Engineering, University of Florida, Gainesville,
FL 32611, USA. e-mail: {ksg,avr,arunava,anand}@cise.ufl.edu.


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


I. INTRODUCTION

Most conventional techniques of image analysis treat images as elements of a vector space. Lately,

there has been a steady growth of literature which regards images as matrices, e.g. [1], [2], [3], [4].

As compared to a vectorial method, the matrix-based representation helps to better exploit the spatial

relationships between image pixels, as an image is essentially a two-dimensional function. In the image

processing community, this notion of an image has been considered by Andrews and Patterson in [5],

in the context of image coding by using singular value decomposition (SVD). Given any matrix, say

J E RMixM2, the SVD of the matrix expresses it in the form J = USVT where S E RMlxM2 is a

diagonal matrix of singular values, and U E RM xM1 and V E RM2 xM are two orthonormal matrices.

This decomposition is unique upto a sign factor on the vectors of U and V if the singular values are

all distinct [6]. For the specific case of natural images, it has been observed that the values along the

diagonal of S are rapidly decreasingI, which allows for the low-error reconstruction of the image by

low-rank approximations. This property of SVD, coupled with the fact that it provides the best possible

lower-rank approximation to a matrix, has motivated its application to image compression in the work

of Andrews and Patterson [5], and Yang and Lu [8]. In [8], the SVD technique is further combined with

vector quantization for compression applications.

To the best of our knowledge, the earliest technique for obtaining a common matrix-based representation

for a set of images was developed by Rangarajan in [1]. In [1], a single pair of orthonormal bases (U, V)

is learned from a set of some N images, and each image Ij, 1 < j < N is represented by means of a

projection Sj onto these bases, i.e. in the form Ij = USjVT. Similar ideas are developed by Ye in [2] in

terms of optimal lower-rank image approximations. In [3], Yang et al. develop a technique named 2D-

PCA, which computes principal components of a column-column covariance matrix of a set of images.

In [9], Ding et al. present a generalization of 2D-PCA, termed as 2D-SVD, and investigate some of its

optimality properties. The work in [9] also unifies the approaches in [3] and [2], and provides a non-

iterative approximate solution with bounded error to the problem of finding the single pair of common

bases. In [4], He et al. develop a clustering application, in which a single set of orthonormal bases is

learned in such a way that projections of a set of images onto these bases are neighborhood-preserving.

It is quite intuitive to observe that as compared to an entire image, a small image patch is a simpler,

more local entity, and hence can be more accurately represented by means of a smaller number of bases.


'A related empirical observation is the rapid decrease in the power spectra of natural images with increase in spatial frequency
[7].


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


Following this intuition, we choose to regard an image as a set of matrices (one per image patch) instead

of using a single matrix for the entire image as in [1], [2]. Furthermore, there usually exists a great deal
of similarity between a large number of patches in one image or across several images of a similar kind.

We exploit this fact to learn a small number of full-sized orthonormal bases (as opposed to a single set

of low-rank bases learned in [1], [2] and [9], or a single set of low-rank projection vectors learned in

[3]) to reconstruct a set of patches from a training set by means of sparse projections with least possible
error. As we shall demonstrate later, this change of focus from learning lower-rank bases to learning full-

rank bases but with sparse projection matrices, brings with it several significant theoretical and practical

advantages. This is because it is much easier to adjust the sparsity of a projection matrix in order to meet

a certain reconstruction error threshold than to adjust its rank [see Section II-A and Section III].

There exists a large body of work in the field of sparse image representation. Coding methods which

express images as sparse combinations of bases using techniques such as the discrete cosine transform
(DCT) or Gabor-filter bases have been quite popular for a long period. The DCT is also an essential

ingredient of the widely used JPEG image compression standard [10]. These developments have also

spurred interest in methods that learn a set of bases from a set of natural images as well as a sparse
set of coefficients to represent images in terms of those bases. Pioneering work in this area has been

done by Olshausen and Field [11], with a special emphasis on learning an over-complete set of bases

and their sparse combinations, and also by Lewicki et al [12]. Some recent noteworthy contributions in
the field of over-complete representations include the highly interesting work in [13] by Aharon et al.,

which encodes image patches as a sparse linear combination of a set of dictionary vectors (learned from a

training set). There also exist other techniques such as sparsified non-negative matrix factorization (NMF)

[14] by Hoyer, which represent image datasets as a product of two large sparse non-negative matrices.

An important feature of all such learning-based approaches (as opposed to those that use a fixed set of

bases such as DCT) is their tunability to datasets containing a specific type of images. In this paper, we

develop such a learning technique, but with the key difference that our technique is matrix-based, unlike

the aforementioned vector-based learning algorithms. The main advantage of our technique is as follows:

we learn a small number of pairs of orthonormal bases to represent ensembles of image patches. Given

any such pair, the computation of a sparse projection for any image patch (matrix) with least possible

reconstruction error, can be accomplished by means of a very simple and optimal greedy algorithm. On

the other hand, the computation of the optimal sparse linear combination of an over-complete set of

basis vectors to represent another vector is a well-known NP-hard problem [15]. We demonstrate the
applicability of our algorithm to the compression of databases of face images, with favorable results in


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


comparison to existing approaches. Our main algorithm on ensembles of 2D images or image patches,

was presented earlier by us in [16]. In this paper, we present more detailed comparisons, including with

JPEG, and also study the effect of various parameters on our method, besides showing more detailed

derivations of the theory.

In the current work, we bring out another significant extension of our previous algorithm namely its

elegant applicability to higher-order matrices (commonly and usually mistakenly termed tensors), with

sound theoretical foundations. In this paper, we represent patches from color images as tensors (third-

order matrices). Next, we learn a small number of orthonormal matrices and represent these patches as

sparse tensor projections onto these orthonormal matrices. The tensor-representation for image datasets,

as such, is not new. Shashua and Levin [17] regard an ensemble of gray-scale (face) images, or a

video sequence, as a single third-order tensor, and achieve compact representation by means of lower-

rank approximations of the tensor. However, quite unlike the computation of the rank of a matrix, the

computation of the rank of a tensor2 is known to be an NP-complete problem [19]. Moreover, while

the SVD is known to yield the best lower-rank approximation of a matrix in the 2D case, its higher-

dimensional analog (known as higher-order SVD or HOSVD) does not necessarily produce the best

lower-rank approximation of a tensor. The work of Lathauwer in [20] and Lathauwer et al. in [18]

presents an extensive development of the theory of HOSVD and several of its properties. Nevertheless,

for the sake of simplicity, HOSVD has been used to produce a lower-rank approximation of datasets.

Though this approximation is theoretically sub-optimal[18]), it is seen to work well in applications such

as face recognition under change in pose, illumination and facial expression as in [21] by Vasilescu et al

(though in [21], each image is still represented as a vector) and dynamic texture synthesis from gray-scale

and color videos in [22] by Costantini et al. In these applications, the authors demonstrate the usefulness

of the multilinear representation over the vector-based representation for expressing the variability over

different modes (such as pose, illumination and expression in [21], or space, chromaticity and time in

[22]).

An iterative scheme for a lower-rank tensor approximation is developed in [20], but the corresponding

energy function is susceptible to the problem of local minima. In [17], two new lower-rank approximation

schemes are designed: a closed-form solution under the assumption that the actual tensor rank equals

the number of images (which usually may not be true in practice), or an iterative approximation in other


2The rank of a tensor is defined as the smallest number of rank-1 tensors whose linear combination gives back the tensor,
with a tensor being of rank-1 if it is equal to the outer product of several vectors [18].


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


cases. Although the latter iterative algorithm is proved to be convergent, it is not guaranteed to yield a

global minimum. Wang and Ahuja [23] also develop a new iterative rank-R approximation scheme using

an alternating least-squares formulation, and also present another iterative method that is specially tuned

to the case of third-order tensors. Very recently, Ding et al. [24] have derived upper and lower bounds

for the error due to low-rank truncation of the core tensor obtained in HOSVD (which is a closed-form

decomposition), and use this theory to find a single common triple of orthonormal bases to represent a

database of color images (represented as 3D arrays) with minimal error in the L2 sense. Nevertheless,

there is still no method of directly obtaining the optimal lower-rank tensor approximation which is non-

iterative in nature. Furthermore, the error bounds in [24] are applicable only when the entire set of images

is coded using a single common triple of orthonormal bases. Likewise, the algorithms presented in [23]

and [17] also seek to find a single common basis.

The algorithm we present here differs from the aforementioned ones in the following ways: (1) All

the aforementioned methods learn a common basis, which may not be sufficient to account for the

variability in the images. We do not learn a single common basis-tuple, but a set of K orthonormal bases

to represent N patches in the database of images, with each image and each patch being represented

as a higher-order matrix. Note that K is much less than N. (2) We do not seek to obtain lower-rank

approximations to a tensor. Rather, we represent the tensor as a sparse projection onto a chosen tuple of

orthonormal bases. This sparse projection, as we show later, turns out to be optimal and can be obtained

by a very simple greedy algorithm. We use our extension for the purpose of compression of a database

of color images, with promising results. Note that this sparsity-based approach has advantages in terms

of coding efficiency as compared to methods that look for lower-rank approximations, just as in the 2D

case mentioned before.

Our paper is organized as follows. We describe the theory and the main algorithm for 2D datasets in

Section II. Section III presents experimental results and comparisons with existing techniques. Section

IV presents our extension to higher-order matrices, with experimental results and comparisons in Section

V. We conclude in Section VI.


II. THEORY: 2D IMAGES

Consider a set of digital images, each of size M1 x M2. We divide each image into non-overlapping

patches of size mi x m2, m1 < M1, m2 < M2, and treat each patch as a separate matrix. Exploiting the

similarity inherent in these patches, we effectively represent them by means of sparse projections onto

(appropriately created) orthonormal bases, which we term 'exemplar bases'. We learn these exemplars a


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


priori from a set of training image patches. Before describing the learning procedure, we first explain

the mathematical structure of the exemplars.


A. Exemplar Bases and Sparse Projections

Let P E Rm, xm2 be an image patch. Using singular value decomposition (SVD), we can represent

P as a combination of orthonormal bases U e Rrnlx and V E R2 x"2 in the form P = USVT,

where S Rml xm2 is a diagonal matrix of singular values. However P can also be represented using

any set of orthonormal bases U and V, different from those obtained from the SVD of P. In this

case, we have P = USVT where S turns out to be a non-diagonal matrix.3 Contemporary SVD-based

compression methods leverage the fact that the SVD provides the best low-rank approximation to a matrix

[8], [25]. We choose to depart from this notion, and instead answer the following question: What sparse
matrix W E Rm Xm2 will reconstruct P from a pair of orthonormal bases U and V with the least error

IIP- UWVT||2? Sparsity is quantified by an upper bound T on the Lo norm of W, i.e. on the number of

non-zero elements in W (denoted as Ili 1|,,)4. We prove that the optimal W with this sparsity constraint

is obtained by nullifying the least (in absolute value) mim2 T elements of the estimated projection
matrix S = UTPV. Due to the ortho-normality of U and V, this simple greedy ulg", lidi turns out to

be optimal as we prove below:.

Theorem 1: Given a pair of orthonormal bases (U, V), the optimal sparse projection matrix W with

I|i I| = T is obtained by setting to zero mim2 T elements of the matrix S = UTPV having least
absolute value.

Proof: We have P = USVT. The error in reconstructing a patch P using some other matrix W is

e = |U(S W)VT112 ||S 11i|- as U and V are orthonormal. Let = {(i,j)lWij 0} and

12 {(i,j)lWij 0}. Then e = (ij)eI, Sj + -(ij)c12(Sij W)2. This error will be minimized
when Sij = Wij in all locations where Wij 0 and Wij 0 at those indices where the corresponding

values in S are as small as possible. Thus if we want |li || = T, then W is the matrix obtained by

nullifying mim2 -T entries from S that have the least absolute value and leaving the remaining elements

intact. D

Hence, the problem of finding an optimal sparse projection of a matrix (image patch) onto a pair of


3The decomposition P = USVT exists for any P even if U and V are not orthonormal. We still follow ortho-normality
constraints to facilitate optimization and coding. See section II-C and III-B.
4See section III for the merits of our sparsity-based approach over the low-rank approach.


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


orthonormal bases, is solvable in O(mim2(mi + i2)) time as it requires just two matrix multiplications

(S UTPV). On the other hand, the problem of finding the optimal sparse linear combination of an
over-complete set of basis vectors in order to represent another vector (i.e. the vectorized form of an image

patch) is a well-known NP-hard problem [15]. In actual applications [13], approximate solutions to these

problems are sought, by means of pursuit algorithms such as OMP [26]. Unfortunately, the quality of the

approximation in OMP is dependent on T, with an upper bound on the error that is directly proportional

to 1 + 6T [see [27], Theorem (C)] under certain conditions. Similar problems exist with other pursuit

approximation algorithms such as Basis Pursuit (BP) as well [27]. For large values of T, there can be

difficulties related to convergence, when pursuit algorithms are put inside an iterative optimization loop.

Our technique avoids any such dependencies as we are specifically dealing with complete basis pairs.

In any event, it is not clear whether overcomplete representations are indeed beneficial for the specific

application (image compression) being considered in this paper.


B. Learning the Bases

The essence of this paper lies in a learning method to produce K exemplar orthonormal bases

{(Ua, Va)}, 1 < a < K, to encode a training set of N image patches Pi E R1xm"2 (1 < i < N) with
least possible error (in the sense of the L2 norm of the difference between the original and reconstructed

patches). Note that K < N. In addition, we impose a sparsity constraint that every Sia (the matrix used

to reconstruct Pi from (Ua, Va)) has at most T non-zero elements. The main objective function to be

minimized is:
N K
E( {Ua,Va, Sia,Mia) = Mi Pi UaSiaVf2 (1)
i= a 1
subject to the following constraints:

(1) UUa VT a = I,Va. (2) |SiaJl0 < T,V(i, a).(3) ,Mia =,Vi and Mia e {0,1},Vi,a. (2)
a
Here Mia is a binary matrix of size N x K which indicates whether the ith patch belongs to the space

defined by (Ua, Va). Note that we are not trying to use a mixture of orthonormal bases for the projection.

Rather we simply project each patch onto a single (suitably chosen) basis pair. The optimization of the

above energy function is difficult, as Mia is binary. Since an algorithm using K-means will lead to local

minima, we choose to relax the binary membership constraint so that now Mia E (0, 1), V(i, a), subject to

ZK=1 Mia =1, Vi. The naive mean field theory line of development starting with [28] and culminating
in [29], (with much of the theory developed previously in [30]) leads to the following deterministic


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


annealing energy function:


i=l a=l
N K

SE({UVS, M ia Mia + lPi Mia 1)2 +

ia i a
trace[AIa(UjUa I)] + trace[A2a(V Va- I)]. (3)
a a
Note that in the above equation, {pi} are the Lagrange parameters, {Aia, A2a} are symmetric Lagrange
matrices, and /3 a temperature parameter.
We first initialize {Ua} and {Va} to random orthonormal matrices Va, and Mia = V(i, a). As {Ua}
and {Va} are orthonormal, the projection matrix Sia is computed by the following rule:

Sia U PiVa. (4)

Note that this is the solution to the least squares energy function I Pi-UaSiaVj 12. Subsequently m M2 -
T elements in Sia with least absolute value are nullified to get the best sparse projection matrix. The
updates to Ua and Va are obtained as follows. Denoting the sum of the terms of the cost function in Eqn.
(1) that are independent of Ua as C, we rewrite the cost function as follows:
N K
E({Ua, Va, Sia, Mia}) = MiaPi UaSiaVfl2 + Ztrace[Aila(UjU I)] + C
i=l a l a
N K
= E Ma [trace(Pi UaSiaVT)T (Pi UaSiaV))] + trace[Aia(UjUa I)] + C
i=l a l a
N K
E= Mia [trace(P, P) 2trace(PUaSioaV) + trace(SSia)] + > trace[Al(U Ua I)] + C. (5)
i=l a l a
Now taking derivatives w.r.t. Ua, we get:

OF N
O = -2 Mia(PiVaS/j) + 2UAla = 0. (6)
i= 1
Re-arranging the terms and eliminating the Lagrange matrices, we obtain the following update rule for

Ua:

Zla = MiaPiVaS; Ua = Zla(Z Za)1. (7)

An SVD f Z ill give Z where and T are orthonormal matrices and is a
An SVD of ZIa will give us ZIa = I TITa where FIa and TI are orthonormal matrices and is a


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


diagonal matrix. Using this, we can re-write Ua as follows:

Ua (F la)Fa la (l la))
1 -f -{r T qrfT -' T 2 T 2 -f
(Fla lT) (TlaPFla P la2- (FlaTla)(Tla la) 2

(iaT )(Ti a -1T ~) -fTra. (8)

Note that the steps above followed due to the fact that Ua (and hence Fla and Tia) are full-rank matrices.
This gives us an update rule for Ua. A very similar update rule for Va follows along the same lines:
K
Z2a = MiaPTUaSia; Va = Z2a(Z2Z2a) = 2aTTa. (9)
i= 1
where F2a and T2a are obtained from the SVD of Z2a. The membership values are obtained from the
following update:

Mia = -P u- 2 (10)
mp 1 e-1|P_-UbSib TbT2
-b=\
The matrices {Sia, Ua, Va} and M are then updated sequentially for a fixed 3 value, until convergence.
The value of 3 is then increased and the sequential updates are repeated. The entire process is repeated
until an integrality condition is met.

Our algorithm could be modified to learn a set of orthonormal bases (single matrices and not a pair)
for sparse representation of vectorized images, as well. Note that even in this scenario, our technique

should not be confused with the 'mixtures of PCA' approach [31]. The emphasis of the latter is again
on low-rank approximations and not on sparsity. Furthermore, in our algorithm we do not compute an
actual probabilistic mixture (also see Sections II-C and III-E). However, we have not considered this

vector-based variant in our experiments, because it does not exploit the fact that image patches are
two-dimensional entities.


C. Application to Compact Image Representation

Our framework is geared towards compact but low-error patch reconstruction. We are not concerned
with the discrimtinti/ng assignment of a specific kind of patches to a specific exemplar, quite unlike in

a clustering or classification application. In our method, after the optimization, each training patch Pi

(1 < i < N) gets represented as a projection onto the particular pair of exemplar orthonormal bases
(out of the K pairs), which produces the least reconstruction error. In other words, the kth exemplar

is chosen if |Pi UkSikV~ 2 Pi UaSiaVaT 2,Va E {1,2, ...,K}, 1 < k < K. For patch Pi, we
denote the corresponding 'optimal' projection matrix as ST = Sik, and the corresponding exemplar as


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


(Ui, V*) ( Uk, Vk). Thus the entire training set is approximated by (1) the common set of basis-pairs

{(Ua, Va)}, 1 < a < K (K < N), and (2) the optimal sparse projection matrices {S} for each patch,
with at most T non-zero elements each. The overall storage per image is thus greatly reduced (see also

section III-B). Furthermore, these bases {(Ua, Va)} can now be used to encode patches from a new set of

images that are somewhat similar to the ones existing in the training set. However, a practical application

demands that the reconstruction meet a specific error threshold on unseen patches, and hence the Lo norm

of the projection matrix of the patch is adjusted dynamically in order to meet the error. Experimental

results using such a scheme are described in the next section.


III. EXPERIMENTS: 2D IMAGES

In this section, we first describe the overall methodology for the training and testing phases of our

algorithm based on our earlier work in [16]. As we are working on a compression application, we then

describe the details of our image coding and quantization scheme. This is followed by a discussion

of the comparisons between our technique and other competitive techniques (including JPEG), and an

enumeration of the experimental results.


A. Training and Testing Phases

We tested our algorithm on the compression of the entire ORL database [32] and the entire Yale

database [33]. We divided the images in each database into patches of fixed size (12 x 12), and these

sets of patches were segregated into training and test sets. The test set for both databases consisted of

many more images than the training set. For the purpose of training, we learned a total of K = 50

orthonormal bases using a fixed value of T to control the sparsity. For testing, we projected each patch
Pi onto that exemplar (U V* ) which produced the sparsest projection matrix Si that yielded an average
|| p UTJS *K*,*T| 2
per-pixel reconstruction error 2 ~ of no more than some chosen 6. Note that different test

patches required different T values, depending upon their inherent 'complexity'. Hence, we varied the

sparsity of the projection matrix (but keeping its size fixed to 12 x 12), by greedily nullifying the smallest

elements in the matrix, without letting the reconstruction error go above 6. This gave us the flexibility
to adjust to patches of different complexities, without altering the rank of the exemplar bases (U V7*).

As any patch Pi is projected onto exemplar orthonormal bases which are different from those produced

by its own SVD, the projection matrices turn out to be non-diagonal. Hence, there is no such thing as a
hierarchy of 'singular values' as in ordinary SVD. As a result, we cannot resort to restricting the rank

of the projection matrix (and thereby the rank of (U Vi*)) to adjust for patches of different complexity


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


(unless we learn a separate set of K bases, each set for a different rank r where 1 < r < min(ml, m2),
which would make the training and even the image coding [see Section III-B] very cumbersome). This
highlights an advantage of our approach over that of ulg"n i,,lii that adjust the rank of the projection
matrices.


B. Details of Image Coding and Quantization

We obtain S by sparsifying Ur PiV*. As U7 and Vi are orthonormal, we can show that the values
in ST will always lie in the range [-m, m] for m x m patches, if the values of Pi lie in [0, 1]. This can

be proved as follows: We know that S = U*TPiVi*. Hence we write the element of S, in the ath row
and bth column as follows:

Sjab Utr*T *1 1
S*ab i akPiklVlb Pikl


kl
ZPikl m. (11)

The first step follows because the maximum value of Siab is obtained when all values of Uk are equal

to -f/m. The second step follows because Siab will meet its upper bound when all the m2 values in the
patch are equal to 1. We can similarly prove that the lower bound on Sab is -m. In our case since
m = 12, the upper and lower bounds are +12 and -12 respectively.

We Huffman-encoded [34] the integer parts of the values in the {S7} matrices over the whole image

(giving us an average of some Qi bits per entry) and quantized the fractional parts with Q2 bits per
entry. Thus, we needed to store the following information per test-patch to create the compressed image:

(1) the index of the best exemplar, using al bits, (2) the location and value of each non-zero element in
its ST matrix, using a2 bits per location and Qi + Q2 bits for the value, and (3) the number of non-zero
elements per patch encoded using a3 bits. Hence the total number of bits per pixel for the whole image

is given by:
RPP N(al + a3) + whle(a2 + Q1 + Q2) (12)
M1 M2
where Twhole = 1 |SII ||o. The values of ai, a2 and a3 were obtained by Huffman encoding [34].

After performing quantization on the values of the coefficients in S7, we obtained new quantized projec-
tion matrices, denoted as S Following this, the PSNR for each image was measured as 10 loglo0 NI m 2
_IiVl \\,_Ui* V 2 i i I
and then averaged over the entire test set. The average number of bits per pixel (RPP) was calculated as

in Eqn. (12) for each image, and then averaged over the whole test set. We repeated this procedure for
different 6 values from 8 x 10-5 to 8 x 10-3 (range of image intensity values was [0, 1]) and plotted an


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


ROC curve of average PSNR vs. average RPP.




C. Comparison with other techniques:

We pitted our method against four existing approaches, each of which are both competitive and recent:

1) The KSVD algorithm from [13], for which we used 441 unit norm dictionary vectors of size

144. In this method, each patch is vectorized and represented as a sparse linear combination of the

dictionary vectors. During training, the value ofT is kept fixed, and during testing it is dynamically

adjusted depending upon the patch so as to meet the appropriate set of error thresholds represented

by 6 E [8 x 10-5, 8 x 10-3]. For the purpose of fair comparison between our method and KSVD,

we used the exact same values of T and 6 in both KSVD as well as in our method. The method

used to find the sparse projection onto the dictionary was the orthogonal matching pursuit (OMP)

algorithm [27].

2) The over-complete DCT dictionary with 441 unit norm vectors of size 144 created by sampling

cosine waves of various frequencies, again with the same T value, and using the OMP technique

for sparse projection.

3) The SSVD method from [25], which is not a learning based technique. It is based on the creation

of sub-sampled versions of the image by traversing the image in a manner different from usual

raster-scanning.

4) The JPEG standard (for which we used the implementation provided in MATLAB), for which

we measured the RPP from the number of bytes for storage of the file.

For KSVD and over-complete DCT, there exist bounds on the values of the coefficients that are very

similar to those for our technique, as presented in Eqn. (11). For both these methods, the RPP value

was computed using the formula in [13], Eqn. (27), with the modification that the integer parts of the

coefficients of the linear combination were Huffman-encoded and the fractional parts separately quantized

(as it gave a better ROC curves for these methods). For the SSVD method, the RPP was calculated as

in [25], section (5).


D. Results

For the ORL database, we created a training set of patches of size 12 x 12 from images of 10 different

people, with 10 images per person. Patches from images of the remaining 30 people (10 images per

person) were treated as the test set. From the training set, a total of 50 pairs of orthonormal bases were


July 21, 2009


DRAFT













TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


ORL: PSNR vs RPP; T = 10 Yale: PSNR vs RPP; T = 3
45 45

40 40 1 0 3 Variance in RPP vs error ORL Database

1 35 35
Z Z
S30 (1) OurMethod 30 (1) OurMethod
i -(2) KSVD --(2) KSVD
25 (3) OverComplete DCl 25 (3) OverComplete DC1
i1/ -(4) SSVD -(4) SSVD
-(5) JPEG -(5) JPEG 5
02 4 6 0 05 1 15
RPP RPP Err o5

(a) (b) (c)









(d) (e) (f) (g) (h)

Fig. 1. ROC curves on (a) ORL and (b) Yale Databases. Legend- Red (1): Our Method, Blue (2): KSVD, Green (3): Over-
complete DCT, Black (4): SSVD and (5): JPEG (Magenta). (c) Variance in RPP versus pre-quantization error for ORL. (d)
Original image from ORL database. Sample reconstructions with 6 = 3 x 10 4 of (d) by using (e) Our Method [RPP: 1.785,
PSNR: 35.44], (f) KSVD [RPP: 2.112, PSNR: 35.37], (g) Over-complete DCT [RPP: 2.929, PSNR: 35.256], (h) SSVD [RPP:
2.69, PSNR: 34.578]. These plots are best viewed in color.




learned using the algorithm described in Section II-B. The T value for sparsity of the projection matrices

was set to 10 during training. As shown in Figure III-D(c), we also computed the variance in the RPP

values for every different pre-quantization error value, for each image in the ORL database. Note that the

variance in RPP decreases with increase in specified error. This is because at very low errors, different

images require different number of coefficients in order to meet the error.

The same experiment was run on the Yale database with a value of T = 3 on patches of size 12 x 12.

The training set consisted of a total of 64 images of one and the same person under different illumination

conditions, whereas the testing set consisted of 65 images each of the remaining 38 people from the

database (i.e. 2470 images in all), under different illumination conditions. The ROC curves for our

method were superior to those of other methods over a significant range of 6, for the ORL as well as

the Yale database, as seen in Figures l(a) and l(b). Sample reconstructions for an image from the ORL

database are shown in Figures l(d), l(f), l(g) and l(h) for 6 = 3 x 10-4. For this image, our method

produced a better PSNR to RPP ratio than others. Also Figure 2 shows reconstructed versions of another

image from the ORL database using our method with different values of 6. For experiments on the


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


Original


Error 0.003


Error 0.001


Error 0.0005


Error 0.0003


Error 0.0001


Fig. 2. Reconstructed versions of an image from the ORL database using 5 different error values. The original image is on
the extreme left and top corner.



ORL database, the number of bits used to code the fractional parts of the coefficients of the projection

matrices [i.e. Q2 in Eqn. (12)] was set to 5. For the Yale database, we often obtained pre-quantization

errors significantly less than the chosen 6, and hence using a value of Q2 less than 5 bits often did not

raise the post-quantization error above 6. Keeping this in mind, we varied the value of Q2 dynamically, for

each pre-quantization error value. The same variation was applied to the KSVD and over-complete DCT

techniques as well. Furthermore, the dictionary size used by all the relevant methods being compared in

the experiments here is shown in Table I.


E. Effect of Different Parameters on the Performance of our Method:

The different parameters in our method include: (1) the size of the patches for training and testing,

(2) the number of pairs of orthonormal bases, i.e. K, and (3) the sparsity of each patch during training,

i.e. T. In the following, we describe the effect of varying these parameters:

1) Patch Size: If the size of the patch is too large, it becomes more and more unlikely (due to the

curse of dimensionality) that a fixed number of orthonormal matrices will serve as adequate bases

for these patches in terms of a low-error sparse representation. Furthermore, if the patch size is


July 21, 2009


DRAFT















TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


Method Dictionary Size (number of scalars)

Our Method 50 x 2 x 12 x 12 = 14400

KSVD 441 x 12 x 12 = 63504

Overcomplete DCT 441 x 12 x 12 = 63504

SSVD

JPEG

TABLE I

COMPARISON OF DICTIONARY SIZE FOR VARIOUS METHODS FOR EXPERIMENTS ON THE ORL AND YALE DATABASES.


ORL PSNR vs RPP, Diff Patch Size


-8X8 with T = 5
*-12X12 with T= 1
-,. 15X15 with T = 1
20X20 with T = 29

2 3
RPP


ORL: PSNR vs RPP; Diff. Values of T
40

35 /
/
z30 ,

25 T_ -*T 0
--T=20
T T=50


4 200
4 0


2 3
RPP


Fig. 3. ROC curves for the ORL database using our technique with (a) different patch size and (b) different value of T for a

fixed patch size of 12 x 12. These plots are best viewed in color.




too large, any algorithm will lose out on the advantages of locality and simplicity. However, if the

patch size is too small (say 2 x 2), there is very little a compression algorithm can do in terms

of lowering the number of bits required to store these patches. We stress that the patch size as a

free parameter is something common to all algorithms to which we are comparing our technique

(including JPEG). Also, the choice of this parameter is mainly empirical. We tested the performance

of our algorithm with K = 50 pairs of orthonormal bases on the ORL database, using patches of

size 8 x 8, 12 x 12, 15 x 15 and 20 x 20, with appropriately chosen (different) values of T. As

shown in Figure 3(a), the patch size of 12 x 12 using T = 10 yielded the best performance, though

other patch sizes also performed quite well.

2) The number of orthonormal bases K: The choice of this number is not critical, and can be set to

as high a value as desired without affecting the accuracy of the results, though it will increase the

number of bits to store the basis index (see Eqn. 12). This parameter should not be confused with


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


the number of mixture components in standard mixture-model based density estimation, because

in our method each patch gets projected only onto a single set of orthonormal bases, i.e. we do

not compute a combination of projections onto all the K different orthonormal basis pairs. The

only down-side of a higher value of K is the added computational cost during training. Again note

that this parameter will be part of any learning-based algorithm for compression that uses a set of

bases to express image patches.

3) The value of T during training: The value of T is fixed only during training, and is varied for each

patch during the testing phase so as to meet the required error threshold. A very high value of

T during training can cause the orthonormal basis pairs to overfit to the training data (variance),

whereas a very low value could cause a large bias error. This is an instance of the well-known

bias-variance tradeoff common in machine learning algorithms [35]. Our choice of T was empirical,

though the value of T = 10 performed very well on nearly all the chosen datasets. The effect of

different values of T on the compression performance for the ORL database is shown in Figure

3(b). We again emphasize that the issue with the choice of an 'optimal' T is not an artifact of

our algorithm per se. For instance, this issue will also be encountered in pursuit algorithms to

find the approximately optimal linear combination of unit vectors from a dictionary. In the latter

case, however, there will also be the problem of poorer approximation errors as T increases, under

certain conditions on the dictionary of overcomplete basis vectors [27].


F Performance on Random Collections of Images

We performed additional experiments to study the behavior of our algorithm when the training and test

sets were very different from one another. To this end, we used the database of 155 natural images (in

uncompressed format) from the Computer Vision Group at the University of Granada [36]. The database

consists of images of faces, houses, natural scenery, inanimate objects and wildlife. We converted all

the given images to grayscale. Our training set consisted of patches (of size 12 x 12) from 11 images.

Patches from the remaining images were part of the test set. Though the images in the training and test

sets were very different, our algorithm produced excellent results superior to JPEG upto an RPP value of

3.1 bits, as shown in Figure 4(a). Three training images and reconstructed versions of three different test

images for error values of 0.0002, 0.0006 and 0.002 respectively, are shown in Figure III-F. To test the

effect of minute noise on the performance of our algorithm, a similar experiment was run on the UCID

database [37] with a training set of 10 images and the remaining 234 test images from the database. The

test images were significantly different from the ones used for training. The images were scaled down


July 21, 2009


DRAFT













TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


UCID RPPvsPSNR
40
CVG RPPvsPSNR,To 10
40
., OurMethod 35
-JPEG .
35
S3030
S30 o.. _Our Method T =5
S25 ] JPEG
25


0 05 1 15 2 25 3 35 0 1 2 3 4 5
RPP RPP

(a) (b)

Fig. 4. PSNR vs. RPP curve for our method and JPEG on (a) the CVG Granada database and (b) the UCID database.




by a factor of two (i.e. to 320 x 240 instead of 640 x 480) for faster training and subjected to zero mean

Gaussian noise of variance 0.001 (on a scale from 0 to 1). Our algorithm was yet again competitive with

JPEG as shown in Figure 4(b).


IV. THEORY: 3D (OR HIGHER-D) IMAGES

We now consider a set of images represented as third-order matrices (say RGB face images), each of

size MI x M2 x M3. We divide each image into non-overlapping patches of size mi x m2 x s3, 1mi

M1, m2 < M2, mn3 < M3, and treat each patch as a separate tensor. Just as before, we start by exploiting

the similarity inherent in these patches, and represent them by means of sparse projections onto a triple

of exemplar orthonormal bases. Again, we learn these exemplars a priori from a set of training image

patches. We would like to point out that our extension to higher dimensions is non-trivial, especially

when considering that a key property of the SVD in 2D (namely that SVD provides the optimal lower-

rank reconstruction by simple nullification of the lowest singular values) does not extend into higher-

dimensional analogs such as HOSVD [18]. In the following, we now describe the mathematical structure

of the exemplars. We would like to emphasize that though the derivations presented in this paper are for

3D matrices, a very similar treatment is applicable to learning n-tuples of exemplar orthonormal bases

to represent patches from n-D matrices (where n > 4).



A. Exemplar Bases and Sparse Projections

Let P E RmM2 xm 3 be an image patch. Using HOSVD, we can represent P as a combination of

orthonormal bases U E RmlTxm, V E Rm2xm2 and W E Rma3xm in the form P = S x1 U x2 V x3 W,


July 21, 2009


DRAFT








TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


(a) (b)


(d) (e)


U


U


(g) (h) (i)




(j) (k) (1)
Fig. 5. (a) to (c): Training images from the CVG Granada database. (d) to (1): Reconstructed versions of three test images for
error values of 0.0002, 0.0006 and 0.002 respectively.


where S E Rml X xm 3 is termed the core-tensor. The operators xi refer to tensor-matrix multiplication
over different axes. The core-tensor has special properties such as all-orthogonality and ordering. For
further details of the core tensor and other aspects of multilinear algebra, the reader is referred to [18].
Now, P can also be represented as a combination of any set of orthonormal bases U, V and W, different
from those obtained from the HOSVD of P. In this case, we have P = S x U x2 V x3 W where S is
not guaranteed to be an all-orthogonal tensor, nor is it guaranteed to obey the ordering property.
As the HOSVD does not necessarily provide the best low-rank approximation to a tensor [18], we
choose to depart from this notion (instead of settling for a sub-optimal approximation), and instead
answer the following question: What sparse tensor Q Rml xmI2 3 will reconstruct P from a triple of
orthonormal bases (U, V, W) with the least error IP Q x Ux 2 V x 3 11 I'" Sparsity is again quantified
by an upper bound T on the Lo norm of Q (denoted as I|Q||o). We now prove that the optimal Q with


July 21, 2009


DRAFT











TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


this sparsity constraint is obtained by nullifying the least (in absolute value) mim2m3 T elements of
the estimated projection tensor S = P x UT x2 VT x3 WT. Due to the ortho-normality of U, V and
W, this simple greedy ulg ,,lii turns out to be optimal (see Theorem 2).
Theorem 2: Given a triple of orthonormal bases (U, V, W), the optimal sparse projection tensor Q with

I||Q|o = T is obtained by setting to zero mim2m3 -T elements of the tensor S = P x1 UT X2VT X3 WT
having least absolute value.
Proof: We have P S x U 2 V X 3 W. The error in reconstructing a patch P using some other matrix Q
is e = |II(S -Q) xiUx 2Vx 3 1i 1. For any tensor X, we have IIX 12 = IIV,,,,12 (i.e. the Frobenius norm
of the tensor and its nth unfolding are the same [18]). Also, by the matrix representation of HOSVD, we
have X(T) = U S(T) (V W)5. Hence, it follows that e -= 1U (S Q)(1) (V W)T 12. This gives us

e = IIS Q112. The last step follows because U, V and W, and hence V W are orthonormal matrices.
Letl I {(i, j, k)Qijk = 0} and 12 {(i, j, k)|Qijk 0}. Then e = (i,j,k)cI1 Sjk (ij,k)cI2 (Sijk

Qijk)2. This error will be minimized when Sijk Qijk in all locations where Qijk 7 0 and Qijk = 0 at
those indices where the corresponding values in S are as small as possible. Thus if we want I||Q|o = T,
then Q is the tensor obtained by nullifying mim2m3 T entries from S that have the least absolute
value and leaving the remaining elements intact. D
We wish to re-emphasize that a key feature of our approach is the fact that the same technique used for
2D images scales to higher dimensions. Tensor decompositions such as HOSVD do not share this feature,
because the optimal low-rank reconstruction property for SVD does not extend to HOSVD. Furthermore,
though the upper and lower error bounds for core-tensor truncation in HOSVD derived in [24] are very
interesting, they are applicable only when the entire set of images has a common basis (i.e. a common U,
V and W matrix), which may not be sufficient to compactly account for the large variability in real-world
datasets.


B. Learning the Bases

We now describe a method to learn K exemplar orthonormal bases {(U,, Va, Wa)}, 1 < a < K, to
encode a training set of N image patches Pi E Rmrxm2 xm (1 < i < N) with least possible error (in
the sense of the L2 norm of the difference between the original and reconstructed patches). Note that
K < N. In addition, we impose a sparsity constraint that every Sia (the tensor used to reconstruct Pi

5Here A B refers to the Kronecker product of matrices A E RE1xE2 and B E RF1xF2, which is given as A B =
(Aeie2 B)1

July 21, 2009


DRAFT











TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


from (Ua, Va, Wa)) has at most T non-zero elements. The main objective function to be minimized is:
N K
E({Ua,Va,Wa,Sia,Mia}) = Mi l Pi -Sia xI Ua X2 Va X3 Wa. 2 (13)
i=l a 1
subject to the following constraints:

(1) U a U = VaTV =W W a = I, Va. (2) IISalo < T,V(i, a).

(3) Mi= 1, Vi, and Ma G {0, 1},Vi, a. (14)
a
Here Mia is a binary matrix of size N x K which indicates whether the ith patch belongs to the space
defined by (Ua, Va, Wa). Just as for the 2D case, we relax the binary membership constraint so that
now Mia c (0, 1),V(i, a), subject to ya 1 Mia 1 ,Vi. Using Lagrange parameters {pti}, symmetric
Lagrange matrices {AIa}, {A2a} and {A3a}, and a temperature parameter /3, we obtain the following
deterministic annealing objective function:
N K
E({Ua,Va,Wa,Sia,Mia}) = MialPi Sia x1 Ua X2 Va X3 Wa\2 +
i=1 a=1

i E Mia log Mia + E PiCE Mia 1) +
ia i a
>trace[Ali(U/Ua I)] + > trace[A2a(V Va )] + trace[A3a(WfWa I)]. (15)
a a a
We first initialize {Ua}, {Va} and {Wa} to random orthonormal tensors Va, and Mia = V(i, a).
Secondly, using the fact that {Ua}, {Va} and {Wa} are orthonormal, the projection matrix Sia is computed
by the rule:

Sia Pi X1 U X2 Va X3 Wf, V(i,a). (16)

This is the minimum of the energy function I Pi Sia x Ua X 2 Va x 3 Wa 2. Then ma 12m T elements
in Sio with least absolute value are nullified. Thereafter, Ua, Va and Wa are updated as follows. Let us
denote the sum of the terms in the previous energy function that are independent of Ua, as C. Then we
can write the energy function as:

E({Ua,Va,Wa,Sia,Mia}) MiP -Pi (Si, l X Ua X2 Va X3 Wa)\2 +
ia
Strace[Ali(UtUa I)] + C
a
1MiallPi(l) (Sia X1 Ua X2 Va X3 Wa)(1)l|2 +
ia
Etrace[Ala(U!Ua I)] + C. (17)
01,


July 21, 2009


DRAFT











TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


Note that here Pi(,) refers to the nth unfolding of the tensor Pi (refer to [18] for exact details). Also note
that the above step uses the fact that for any tensor X, we have ||X112 -= -, 112. Now, the objective
function can be further expressed as follows:

E({U., Va, Wa, Sia, Mi}) = Mia P M U+ Sia() (Va Wa\)T2 +
ia
Strace[Al(U!Ua I)] + C.
a
(18)

Further simplification gives:

E({Ua, Vaa, W, i, Mia})=

> Miatrace[(Pi(1) Ua Sia(l) (Va 0 a)T)T(Pi(l) Ua Sia() (Va Wa)T)] +
ia
Strace[Al(U!Ua I)] + C
a
= Mia,[trace[Pl )Pi()] 2trace[Pl)Ua Sia() (Va Wa)T] +
ia
trace[S(1) Sia(1))] + Etrace[Ala(UU I)] + C. (19)
a
Taking the partial derivative of E w.r.t. Ua and setting it to zero, we have:
OE = -2MiaPi() (Va 0 Wa)S(l) + 2Uala 0. (20)


After a series of manipulations to eliminate the Lagrange matrices, we arrive at the following update rule
for Ua:

ZUa MiaPi(1)(Va Wa)ST(l);Ua ZUa(ZaZUa) Flata. (21)

Here Fla and Til are orthonormal matrices obtained from the SVD of ZUa. The updates for Va and Wa
are obtained similarly and are mentioned below:

Zva = MiaPi(2) (Wa Ua)S (2); Va = Z Va(aZVa) 2= 2a a- (22)
i
Zwa = MiaPi(3) (Ua 0 Va)S(3); a Zwa(Zw aZwa)- = 3aTa. (23)

Here again, F2a and T2a refer to the orthonormal matrices obtained from the SVD of ZVa, and F3a
and T3a refer to the orthonormal matrices obtained from the SVD of Zwa. The membership values are
obtained by the following update:
--/||IIP -S-aXiUX2Va xW aII2
Mia = P- S X UX2V. X W l2 (24)
bK I C-3\\Pi-SiX1UbX 2VbX3Wb 211


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


The core tensors {Sia} and the matrices {U,, Va, Wa}, M are then updated sequentially for a fixed 3

value, until convergence. The value of 3 is then increased and the sequential updates are repeated. The

entire process is repeated until an integrality condition is met.


C. Application to Compact Image Representation

Quite similar to the 2D case, after the optimization during the training phase, each training patch Pi

(1 < i < N) gets represented as a projection onto one out of the K exemplar orthonormal bases, which

produces the least reconstruction error, i.e. the kth exemplar is chosen if ||P Sik Xi Uk x2 Vk x3

W, I-1 < ||P Sia X1 Ua X2 Va X3 Wa2,Va {1,2,...,K},1 < k < K. For training patch Pi, we
denote the corresponding 'optimal' projection tensor as S, = Sik, and the corresponding exemplar as

(U*, V*, W7) = (Uk, Vk,Wk). Thus the entire set of patches can be closely approximated by (1) the
common set of basis-pairs {(Ua, Va, Wa)}, 1 < a < K (K < N), and (2) the optimal sparse projection

tensors {ST} for each patch, with at most T non-zero elements each. The overall storage per image is

thus greatly reduced. Furthermore, these bases {(Ua, Va, Wa)} can now be used to encode patches from

a new set of images that are similar to the ones existing in the training set, though just as in the 2D

case, the sparsity of the patch will be adjusted dynamically in order to meet a given error threshold.

Experimental results for this are provided in the next section.


V. EXPERIMENTS: COLOR IMAGES

In this section, we describe the experiments we performed on color images represented in the RGB

color scheme. Each image of size M1 x M2 x 3 was treated as a third-order matrix. Our compression

algorithm was tested on the GaTech Face Database [38], which consists of RGB color images with 15

images each of 50 different people. The images in the database are already cropped to include just the

face, but some of them contain small portions of a distinctly cluttered background. The average image
size is 150 x 150 pixels, and all the images are in the JPEG format. For our experiments, we divided this

database into a training set of one image each of 40 different people, and a test set of the remaining 14

images each of these 40 people, and all 15 images each of the remaining 10 people. Thus, the size of the

test set was 710 images. The patch-size we chose for the training and test sets was 12 x 12 x 3 and we

experimented with K = 100 different orthonormal bases learned during training. The value of T during

training was set to 10. At the time of testing, for our method, each patch was projected onto that triple
of orthonormal bases (Ui, V*, W*) which gave the sparsest projection tensor S7 such that the per-pixel

reconstruction error II-Sixlx2V* x3W* 12 was no greater than a chosen 6. Note that, in calculating the
rIme2


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


per-pixel reconstruction error, we did not divide by the number of channels, i.e. 3, because at each pixel,

there are three values defined. We experimented with different reconstruction error values 6 ranging from

8 x 10-5 to 8 x 10-3. Following the reconstruction, the PSNR for the entire image was measured, and

averaged over the entire test test. The total number of bits per pixel, i.e. RPP, was also calculated for

each image and averaged over the test set. The details of the training and testing methodology, and also

the actual quantization and coding step are the same as presented previously in Section III-A and III-B.


A. Comparisons with Other Methods

The results obtained by our method were compared to those obtained by the following techniques:

1) KSVD, for which we used patches of size 12 x 12 x 3 reshaped to give vectors of size 432, and

used these to train a dictionary of 1340 vectors using a value of T = 30.

2) Our algorithm for 2D images from Section II with an independent (separate) encoding of each of

the three channels. As an independent coding of the R, G and B slices would fail to account for
the inherent correlation between the channels (and hence give inferior compression performance),

we used principal components analysis (PCA) to find the three principal components of the R, G,

B values of each pixel from the training set. The R, G, B pixel values from the test images were

then projected onto these principal components to give a transformed image in which the values

in each of the different channels are decorrelated. A similar approach has been taken earlier in

[39] for compression of color images of faces using vector quantization, where the PCA method is
empirically shown to produce channels that are even more decorrelated than those from the Y-Cb-

Cr color model. The orthonormal bases were learned on each of the three (decorrelated) channels

of the PCA image. This was followed by the quantization and coding step similar to that described

in Section III-B. However in the color image case, the Huffman encoding step [34] for finding the

optimal values of al, a2 and a3 in Eqn. (12) was performed using projection matrices from all

three channels together. This was done to improve the coding efficiency.
3) The JPEG standard (its MATLAB implementation), for which we calculated RPP from the number

of bytes of file storage on the disk. See Section V-B for more details.

We would like to mention here that we did not compare our technique with [39]. The latter technique

uses patches from color face images and encodes them using vector quantization. The patches from

more complex regions of the face (such as the eyes, nose and mouth) are encoded using a separate vector

quantization step for better reconstruction. The method in [39] requires prior demarcation of such regions,
which in itself is a highly difficult task to automate, especially under varying illumination and pose. Our


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


method does not require any such prior segmentation, as we manually tune the number of coefficients to

meet the pre-specified reconstruction error.


B. Results

As can be seen from Figure 6(a), all three methods perform well, though the higher-order method

produced the best results after a bit-rate of around 2 per pixel. For the GaTech database, we did not

compare our algorithm directly with JPEG because the images in the database are already in the JPEG

format. However, to facilitate comparison with JPEG, we used a subset of 54 images from the CMU-PIE

database [40]. The CMU-PIE database contains images of several people against cluttered backgrounds

with a large variation in pose, illumination, facial expression and occlusions created by spectacles. All

the images are available in an uncompressed (.ppm) format, and their size is 631 x 467 pixels. We

chose 54 images belonging to one and the same person, and used exactly one image for training, and all

the remaining for testing. The results produced by our method using higher-order matrices, our method

involving separate channels and also KSVD were seen to be competitive with those produced by JPEG.

For a bit rate of greater than 1.5 per pixel, our methods produced performance that was superior to that of

JPEG in terms of the PSNR for a given RPP, as seen in Figure V-B. The parameters for this experiment

were K = 100 and T = 10 for training. Sample reconstructions of an image from the CMU-PIE database

using our higher-order method for different error values are shown in Figure V-B. This is quite interesting,

since there is considerable variation between the training image and the test images, as is clear from

Figure V-B. We would like to emphasize that the experiments were carried out on uncropped images of

the full size with the complete cluttered background. Also, the dictionary sizes for the experiments on

each of these databases are summarized in Table II. For color-image patches of size mi x m2 x 3 using

K sets of bases, our method in 3D requires a dictionary of size 3Kmmin2, whereas our method in 2D

requires a dictionary of size 2Kmlm2 per channel, which is 6Kmlm2 in total.


C. Comparisons with JPEG on Noisy Datasets

As mentioned before, we did not directly compare our results to the JPEG technique for the GaTech

Face Database, because the images in the database are already in the JPEG format. Instead, we added

zero-mean Gaussian noise of variance 8 x 10-4 (on a color scale of [0, 1]) to the images of the GaTech

database and converted them to a raw format. Following this, we converted these raw images back to JPEG

(using MATLAB) and measured the RPP and PSNR. These figures were pitted against those obtained

by our higher-order method, our method on separate channels, as well as KSVD, as shown in Figures


July 21, 2009


DRAFT

















TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


PSNR vs RPP Training and testing on clean images
45,


PSNR vs RPP Clean Training, Noisy Testing


-Our Method 3
Our Method 2E
--KSVD


2 4 6
RPP


2 4 6 8 10
RPP


PSNR vs RPP Noisy Training Noisy Testing


-Our Method (3D
Our Method (2D
-- KSVD
-JPEG

5 10 1
RPP


Fig. 6. ROC curves for the GaTech database: (a) when training and testing were done on the clean original images, (b) when

training was done on clean images, but testing was done with additive zero mean Gaussian noise of variance 8 x 10 4 added to

the test images, and (c) when training and testing were both done with zero mean Gaussian noise of variance 8 x 10 4 added

to the respective images. The methods tested were our higher-order method, our method in 2D, KSVD and JPEG. Parameters

for our methods: K = 100 and T = 10 during training. These plots are best viewed in color.


Training Image


Error 0 001
VWME; M


Original Test Image


Error 0 0005


Error 0 0003


Error 0 0001


Fig. 7. Sample reconstructions of an image from the CMU-PIE database with different error values using our higher order

method. The original image is on the top-left. These images are best viewed in color.


July 21, 2009


DRAFT














TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


PSNR vs RPP: CMU-PIE
40




Z 30
-Our Method (3D
Our Method (2D
25 "--KSVD
-JPEG
200 2 4 6 8 10
2 4 6 8 10
RPP


Fig. 8. ROC curves for the CMU-PIE database using our method in 3D, our method in 2D, KSVD and JPEG. Parameters for
our methods: with K = 100 and T = 10 during training.These plots are best viewed in color.


Method Dictionary Size (number of scalars)

Our Method (3D) 100 x 3 x 12 x 12 = 43200

Our Method (2D) 100 x 2 x 3 x 12 x 12 = 86400
KSVD 1340 x 12 x 12 x 3 = 578880
JPEG


TABLE II
COMPARISON OF DICTIONARY SIZE FOR VARIOUS METHODS FOR EXPERIMENTS ON THE GATECH AND CMU-PIE
DATABASES.







6(b) and 6(c). The performance of JPEG was distinctly inferior despite the fact that the noise added

did not have a very high variance. The reason for this is that the algorithms used by JPEG use the fact

that while representing most natural images, the lower frequencies strongly dominate. This assumption

is invalid in case of sensor noise. Hence the DCT coefficients on noisy images will have prominently

higher values, as a result of which JPEG produces higher bit rates for the same PSNR. For the purposes

of comparison, we ran two experiments using our higher-order method, our separate channel method and

KSVD as well. In the first experiment, noise was added only to the test set, though the orthonormal

bases or the dictionary were learned on a clean training set (i.e. without noise being added to the training

images). In the second experiment, noise was added to every image from the training set, and all methods

were trained on these noisy images. The testing was performed on (noisy) images from the test set, and

the ROC curves were plotted as usual. As can be seen from Figures 6(b) and 6(c), the PSNR values for

JPEG begin to plateau off rather quickly. Note that in all these experiments, the PSNR is calculated from

the squared difference between the reconstructed image and the noisy training image. This is because the


July 21, 2009


DRAFT












TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


'original clean' image would be usually unavailable in a practical application that required compression

of noisy data.


VI. CONCLUSION

We have presented a new technique for sparse representation of image patches, which is based on the

notion of treating an image as a 2D entity. We go beyond the usual singular value decomposition of

individual images or image-patches to learn a set of common, full-sized orthonormal bases for sparse

representation, as opposed to low-rank representation. For the projection of a patch onto these bases, we

have provided a very simple and provably optimal greedy algorithm, unlike the approximation algorithms

that are part and parcel of projection pursuit techniques, and which may require restrictive assumptions

on the nature or properties of the dictionary [27]. Based on the developed theory, we have demonstrated a

successful application of our technique for the purpose of compression of two well-known face databases.

Our compression method is able to handle the varying complexity of different patches by dynamically

altering the number of coefficients (and hence the bit-rate for storage) in the projection matrix of the

patch. Furthermore, this paper also presents a clean and elegant extension to higher order matrices, for

which we have presented applications to color-image compression. Unlike decompositions like HOSVD

which do not retain the optimal low-rank reconstruction property of SVD, our method scales cleanly

into higher dimensions. The experimental results show that our technique compares very favorably to

other existing approaches from recent literature, including the JPEG standard. We have also empirically

examined the performance of our algorithm on noisy color image datasets.

Directions for future work include: (1) investigation of alternative matrix representations tuned for

specific applications (as opposed to using the default rows and columns of the image), (2) application of

our technique for compression of gray-scale and color video (which can be represented as 3D and 4D

matrices respectively), (3) application of our technique for denoising or classification, (4) a theoretical

study of the method in the context of natural image statistics, and (5) a study on utilizing a perceptually

driven quality metric [41] as opposed to a simple L2 norm.


REFERENCES

[1] A. Rangarajan, "Learning matrix space image representations", in Energy Minimization Methods in Computer Vision and
Pattern Recognition (EAMMCVPR). 2001, vol. 2134 of LNCS, pp. 153-168, Springer Verlag.
[2] J. Ye, "Generalized low rank approximations of matrices", Machine Learning, vol. 61, pp. 167-191, 2005.
[3] J. Yang, D. Zhang, A. F. Frangi, and J. y. Yang, "Two-dimensional PCA: A new approach to appearance-based face
representation and recognition", IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 1, 2004.


July 21, 2009


DRAFT














TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


[4] X. He, D. Cai, and P Niyogi, "Tensor subspace analysis", in Neural I,,r- ..n... n.. Processing Systems, 2005, pp. 499-506.
[5] H. C. Andrews and C. L. Patterson, "Singular value decomposition (SVD) image coding", IEEE Transactions on

Communications, pp. 425-432, 1976.
[6] G. Golub and C. van Loan, Matrix Computations, The Johns Hopkins University Press, October 1996.
[7] A. van der Schaaf and J. H. van Hateren, "Modelling the power spectra of natural images: statistics and information",

Vision Research, vol. 36, pp. 2759-2770, 1996.

[8] J.-F. Yang and C.-L. Lu, "Combined techniques of singular value decomposition and vector quantization for image coding",
IEEE Transactions on Image Processing, vol. 4, no. 8, pp. 1141-1146, 1995.
[9] C. Ding and J. Ye, "Two-dimensional singular value decomposition (2DSVD) for 2D maps and images", in SIAM Int'l

Conf Data Mining, 2005, pp. 32-43.
[10] G. Wallace, "The JPEG still picture compression standard", IEEE Transactions on Consumer Electronics, vol. 38, no. 1,
pp. 18-34, 1992.

[11] B. Olshausen and D. Field, "Natural image statistics and efficient coding", Network, vol. 7, pp. 333-339, 1996.
[12] M. Lewicki, T. Sejnowski, and H. Hughes, "Learning overcomplete representations", Neural Computation, vol. 12, pp.

337-365, 2000.
[13] M. Aharon, M. Elad, and A. Bruckstein, "The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse
representation", IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311-4322, November 2006.

[14] P Hoyer, "Non-negative matrix factorization with sparseness constraints", Journal of Machine Learning Research, pp.
1457-1469, 2004.

[15] D. Mallat and M. Avellaneda, "Adaptive greedy approximations", Journal of Constructive Approximations, vol. 13, pp.
57-98, 1997.
[16] K. Gurumoorthy, A. Rajwade, A. Banerjee, and A. Rangarajan, "Beyond SVD sparse projections onto exemplar

orthonormal bases for compact image representation", in International Conference on Pattern Recognition (ICPR), 2008.
[17] A. Shashua and A. Levin, "Linear image coding for regression and classification using the tensor-rank principle", in
CVPR(1), 2001, pp. 42-49.

[18] L. de Lathauwer, B. de Moor, and J. Vandewalle, "A multilinear singular value decomposition", SIAM J Matrix Anal.

Appl., vol. 21, no. 4, pp. 1253-1278, 2000.
[19] J. Hastad, "Tensor rank is NP-Complete", in ICALP '89: Proceedings of the 16th International Colloquium on Automata,
Languages and Programming, 1989, pp. 451-460.
[20] L. de Lathauwer, Signal Processing Based on Multilinear Algebra, PhD thesis, Katholieke Universiteit Leuven, Belgium,
1997.

[21] M.A.O. Vasilescu and D. Terzopoulos, "Multilinear analysis of image ensembles: Tensorfaces", in International Conference
on Pattern Recognition (ICPR), 2002, pp. 511-514.

[22] R. Costantini, L. Sbaiz, and S. Susstrunk, "Higher order SVD analysis for dynamic texture synthesis", IEEE Transactions
on Image Processing, vol. 17, no. 1, pp. 42-52, Jan. 2008.

[23] H. Wang and N. Ahuja, "Rank-R approximation of tensors: Using image-as-matrix representation", in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2005, vol. 2, pp. 346-353.

[24] C. Ding, H. Huang, and D. Luo, "Tensor reduction error analysis applications to video compression and classification",

in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.


July 21, 2009


DRAFT














TECHNICAL REPORT, CVGMI LAB, CISE DEPARTMENT, UNIVERSITY OF FLORIDA


[25] A. Ranade, S. Mahabalarao, and S. Kale, "A variation on SVD based image compression", Image and Vision C. ....
vol. 25, no. 6, pp. 771-777, 2007.

[26] Y Pati, R. Rezaiifar, and P Krishnaprasad, "Orthogonal matching pursuit: Recursive function approximation with
applications to wavelet decomposition", in 27th Asilomar Conference on Signals, Systems and Computation, 1993, vol. 1,
pp. 40-44.

[27] J. Tropp, "Greed is good: algorithmic results for sparse approximation", IEEE Transactions on Int.,,:....i.. Theory, vol.

50, no. 10, pp. 2231-2242, 2004.
[28] C. Petersen and B. S6derberg, "A new method for mapping optimization problems onto neural networks", International
Journal of Neural Systems, vol. 1, pp. 3-22, 1989.

[29] A. Yuille and J. Kosowsky, "Statistical physics algorithms that converge", Neural Computation, vol. 6, no. 3, pp. 341-356,
1994.

[30] R. Hathaway, "Another interpretation of the EM algorithm for mixture distributions", Statistics and Probability Letters,

vol. 4, pp. 53-56, 1986.
[31] M. Tipping and C. Bishop, "Mixtures of probabilistic principal component analysers", Neural Computation, vol. 11, no.

2, pp. 443-482, 1999.
[32] "The ORL Database of Faces", http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
[33] "The Yale Face Database", http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

[34] D. Huffman, "A method for the construction of minimum-redundancy codes", in Proceedings of the I.R.E., 1952, pp.
1098-1102.

[35] S. Geman, E. Bienenstock, and R. Doursat, "Neural networks and the bias/variance dilemma", Neural Computation, vol.
4, no. 1, pp. 1-58, 1992.
[36] "The CVG Granada Database", http://decsai.ugr.es/cvg/dbimagenes/index.php.

[37] "The Uncompressed Colour Image Database", http://vision.cs.aston.ac.uk/datasets/UCID/ucid.html.
[38] "The Georgia Tech Face Database", http://www.anefian.com/face_reco.htm.
[39] J. Huang and Y Wang, "Compression of color facial images using feature correction two-stage vector quantization", IEEE

Transactions on Image Processing, vol. 8, pp. 102-109, 1999.
[40] "The CMU Pose, Illumination and Expression (PIE) Database", http://www.ri.cmu.edu/projects/project_418.html.

[41] M. Eckert and A. Bradley, "Perceptual quality metrics applied to still image compression", Signal Processing, vol. 70,
no. 3, pp. 177-200, 1998.


July 21, 2009


DRAFT




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs