A NEW CLUSTERING ALGORITHM FOR SEGMENTATION OF MAGNETIC
RESONANCE IMAGES
By
ERHAN GOKCAY
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2000
Copyright 2000
by
Erhan Gokcay
ACKNOWLEDGEMENTS
First and foremost I wish to thank my advisor, Dr. Jose Principe. He allowed me the
freedom to explore, while at the same time provided invaluable insight without which this
dissertation would not have been possible.
I also wish to thank the members of my committee, Dr. John Harris, Dr. Christiana
Leonard, Dr. Joseph Wilson, and Dr. William Edmonson, for their insightful comments
which improved the quality of this dissertation.
I also wish to thank my wife Didem and my son Tugra for their patience and support
during the long nights I have been working.
TABLE OF CONTENTS
Page
ACKNOW LEDGEM ENTS ............................................... iii
ABSTRACT .................................................... vii
CHAPTERS
1 INTRODUCTION .......... ........................................ 1
1.1 Magnetic Resonance Image Segmentation ..................... . ..... 1
1.2 Im age Form ation in M RI ................................. . ..... 2
1.3 Characteristics of M medical Imagery .......................... ..... 3
1.4 Segmentation of M R images . . . ........................ . ..... 4
1.4.1 Feature E xtraction . . ...... .............................. 4
1.4.2 Gray Scale Single Image Segmentation ................... .... 4
1.4.3 M ultispectral Segm entation ........................... ..... 5
1.5 Validation ........... .. ................................ 7
1.5.1 M RI Contrast M ethods .............................. ..... 8
1.5.2 Validation U sing Phantom s ........................... ..... 8
1.5.3 Validation Using M RI Simulations ...................... ..... 9
1.5.4 M annual Labeling of MR Images ....................... 10
1.5.5 Brain Development During Childhood ................... . 11
1.6 M otiv ation .................................................... 12
1.7 Outline .......... ............................... ........ 13
2 UNSUPERVISED LEARNING AND CLUSTERING ..................... 15
2.1 C classical M ethods .............................................. 15
2.1.1 Discussion ......... ................................. 15
2.1.2 Clustering Criterion . . ....... ............................ 18
2.1.3 Sim ilarity M measures . . ....... ............................ 19
2.2 C riterion Functions . . ...... ................................. 20
2.2.1 The Sum-of-Squared-Error Criterion .................... . 20
2.2.2 The Scatter M atrices ............................... . . 21
2.3 Clustering A lgorithm s . . ...... ............................... 23
Page
2.3.1 Iterative O ptim ization .............................. . . 23
2.3.2 M erging and Splitting .............................. . 24
2.3.3 Neighborhood Dependent methods ..................... . 24
2.3.4 H ierarchical Clustering ............................. 25
2.3.5 Nonparam etric Clustering ........................... . 26
2.4 Mixture Models ....... ... ...................... ........ 27
2.4.1 Maximum Likelihood Estimation ...................... . 28
2.4.2 EM A lgorithm . . ...... ............................... .. 29
2.5 Com petitive N etw orks .................................. . 31
2.6 ART Networks .............................................. 33
2.7 Conclusion . ........ ......................................... 34
3 ENTROPY AND INFORMATION THEORY ........................... 36
3 .1 Introdu action ................................................... 36
3.2 M aximum Entropy Principle . . ....... .......................... 37
3.2.1 M utual Inform ation ................................ 38
3.3 D ivergence M measures . . ....... .............................. .. 39
3.3.1 The Relationship to Maximum Entropy Measure ............ . 41
3.3.2 Other Entropy M measures ............................ . 41
3.3.3 Other Divergence M measures .......................... . 42
3.4 Conclusion . ........ ......................................... 43
4 CLUSTERING EVALUATION FUNCTIONS ........................... 45
4 .1 Introdu action ................................................... 4 5
4.2 D ensity E stim ation . . ....... ............................... 45
4.2.1 Clustering Evaluation Function ........................ 47
4.2.2 CEF as a W eighted Average Distance ................... 48
4.2.3 CEF as a Bhattacharya Related Distance ................. 49
4.2.4 Properties as a D instance ............................. . . 51
4.2.5 Summary ......... .................................. 52
4.3 M multiple C lusters . . ...... ................................... 53
4.4 Results ..................... ... ............................ 55
4.4.1 A Parameter to Control Clustering ..................... 57
4.4.2 Effect of the Variance to the Pdf Function ................. . 59
4.4.3 Perform ance Surface ............................... . 61
4.5 Comparison of Distance Measures in Clustering ................. . 62
4.5.1 CEF as a Distance M measure .......................... . . 64
4.5.2 Sensitivity A analysis . . ....... ............................ 65
4.5.3 N orm alization . . ....... ................................ 66
5 OPTIMIZATION ALGORITHM .............................. . 68
5.1 Introduction . ........ ........................................ 68
v
Page
5.2 Combinatorial Optimization Problems ....................... 68
5.2.1 L ocal M inim a . . ...... ................................. 7 1
5.2.2 Sim ulated A nnealing ............................... . . 72
5.3 The Algorithm ....... ................................. 73
5.3.1 A New Neighborhood Structure ....................... . 74
5.3.2 Grouping A lgorithm ............................... . . 76
5.3.3 Optim ization Algorithm ............................. 79
5.3.4 C onvergence . . ...... ................................. 83
5.4 Prelim inary R esult . . ...... ................................. .. 83
5.5 Comparison ............................................. 89
6 APPLICATIONS ...... ....... 95
6.1 Implementation of the IMAGETOOL program .................. . 95
6.1.1 PVW AVE Implementation ........................... . 95
6.1.2 Tools Provided W ith the System ....................... 96
6.2 Testing on MR Images ........................................ 99
6.2.1 Feature Extraction . . ....... ............................ .. 99
6.2.2 Test Image ......... ................................ 100
6.3 Validation ............. ......................... ........ 104
6.3.1 Brain Surface Extraction ........................... . . 105
6.3.2 Segm entation . . ...... ............................... .. 106
6.3.3 Segm entation Results .............................. 108
6.4 Results .......... .............................. ........ 114
7 CONCLUSIONS AND FUTURE WORK .............................. 117
REFEREN CES . ........ ............................................. 119
BIOGRAPHICAL SKETCH . . ...... ................................. 133
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
A NEW CLUSTERING ALGORITHM FOR SEGMENTATION OF MAGNETIC
RESONANCE IMAGES
By
Erhan Gokcay
August 2000
Chairman: Dr. Jose C. Principe
Major Department: Computer and Information Science and Engineering
The major goal of this dissertation is to present a new clustering algorithm using infor-
mation theoretic measures and apply the algorithm to segment Magnetic Resonance (MR)
Images. Since MR images are highly variable from subject to subject, data driven segmen-
tation methods seem appropriate. We developed a new clustering evaluation function
based on information theory that outperforms previous clustering algorithms, and the new
cost function works as a valley seeking algorithm. Since optimization of the clustering
evaluation function is difficult because of its stepwise nature and existence of local min-
ima, we developed an improvement on the K-change algorithm used commonly in cluster-
ing problems. When applied to nonlinearly separable data, the algorithm performed with
very good results, and was able to find the nonlinear boundaries between clusters without
supervision.
The clustering algorithm is applied to segment brain MR images with successful
results. A feature set is created from MR images using entropy measures of small blocks
from the input image. Clustering the whole brain image is computationaly intensive.
Therefore, a small section of the brain is first used to train the clustering algorithm. After-
wards, the rest of the brain is clustered using the results obtained from the training image
by using the distance measure proposed.
The algorithm is easy to apply and the calculations are simplified by choosing a proper
distance measure which does not require numerical integration.
CHAPTER 1
INTRODUCTION
1.1 Magnetic Resonance Image Segmentation
Segmentation of medical imagery is a challenging task due to the complexity of the
images, as well as to the absence of models of the anatomy that fully capture the possible
deformations in each structure. Brain tissue is a particularly complex structure, and its seg-
mentation is an important step for derivation of computerized anatomical atlases, as well
as pre- and intra-operative guidance for therapeutic intervention.
MRI segmentation has been proposed for a number of clinical investigations of vary-
ing complexity. Measurements of tumor volume and its response to therapy have used
image gray scale methods as applied to X-ray Computerized Tomography (CT) or simple
MRI datasets [Cli87]. However, the differentiation of tissues within tumors that have sim-
ilar MRI characteristics, such as edema, necrotic, or scar tissue, has proven to be impor-
tant in the evaluation of response to therapy, and hence, multispectral methods have been
proposed [Van91a] [Cla93]. Recently, multimodality approaches, such as positron emis-
sion tomography (PET) and functional magnetic resonance imaging (fMRI) studies using
radiotracers [Tju94], or contrast materials [Tju94] [Buc91] have been suggested to provide
better tumor tissue specification and to identify active tumor tissue. Hence, segmentation
methods need to include these additional image data sets. In the same context, a similar
progression of segmentation methods is evolving for the planning of surgical procedures
primarily in neurological investigations [Hil93] [Zha90] [Cli91], surgery simulations
[Hu90] [Kam93], or the actual implementation of surgery in the operating suite where
both normal tissues and the localization of the lesion or mass needs to be accurately iden-
tified. The methods proposed include gray scale image segmentation and multispectral
segmentation for anatomical images with additional recent efforts directed toward the
mapping of functional metrics (fMRI, EEG, etc) to provide locations of important func-
tional regions of the brain as required for optimal surgical planning.
Other applications of MRI segmentation include the diagnosis of brain trauma where
white matter lesions, a signature of traumatic brain injury, may potentially be identified in
moderate and possibly mild cases. These methods, in turn, may require correlation of ana-
tomical images with functional metrics to provide sensitive measurements of brain trauma.
MRI segmentation methods have also been useful in the diagnostic imaging of multiple
sclerosis [Wal92], including the detection of lesions [Raf90], and the quantitation of lesion
volume using multispectral methods[Jac93].
In order to understand the issues in medical image segmentation, in contrast with seg-
mentation of, say, images of indoor environments, which are the kind of images with
which general purpose visual segmentation systems deal, we need an understanding of the
salient characteristics of medical imagery.
One application of our clustering algorithm is to map and identify important brain
structures, which may be important in brain surgery.
1.2 Image Formation in MRI
MRI exploits the inherent magnetic moment of certain atomic nuclei. The nucleus of
the hydrogen atom (proton) is used in biologic tissue imaging due to its abundance in the
human body and its large magnetic moment. When the subject is positioned in the core of
the imaging magnet, protons in the tissues experience a strong static magnetic field and
process at a characteristic frequency that is a function solely of the magnetic field strength,
and does not depend, for instance, on the tissue to which the proton belongs. An excitation
magnetic field is applied at this characteristic frequency to alter the orientation of preces-
sion of the protons. The protons relax to their steady state after the excitation field is
stopped. The reason MRI is useful is because protons in different tissues relax to their
steady state at different rates. MRI essentially measures the components of the magnitude
vector of the precession orientation at different times and thus differentiates tissues. These
measures are encoded in 3D using methods for slice selection, frequency encoding and
phase encoding. Slice selection is performed by exciting thin cross-sections of tissue one
at a time. Frequency encoding is achieved by varying the returned frequency of the mea-
sured signal, and phase encoding is done by spatially varying the returned phase of the
measured signal.
1.3 Characteristics of Medical Imagery
While the nature of medical imagery allows a segmentation system to ignore issues
such as illumination and pose determination that would be important to a more general
purpose segmentation system, there are other issues, which will be briefly discussed
below. The objects to be segmented from medical imagery are actual anatomical struc-
tures, which are often non rigid and complex in shape, and exhibit considerable variability
from person to person. This combined with the absence of explicit shape models that cap-
ture the deformations in anatomy, makes the segmentation task challenging. Magnetic res-
onance images are further complicated due to the limitations in the imaging equipment,
like in homogeneities in the receiver or transmitter coils leads to a non-linear gain artifact
in the images, and large differences in magnetic susceptibilities of adjacent tissue leads to
distortion of the gradient magnetic field, and hence a spatial susceptibility artifact in the
images. In addition, the signal is degraded by the motion artifacts that may appear in the
images due to movement of the subject during the scan.
1.4 Segmentation of MR images
MR segmentation can be roughly divided into two categories: a single image segmen-
tation, where a single 2D or 3D gray scale image is used, and multi-spectral image seg-
mentation where multiple MR images with different gray scale contrasts are available.
1.4.1 Feature Extraction
Segmentation of MR images is based on sets of features that can be extracted from the
images, such as pixel intensities, which in turn can be used to calculate other features such
as edges and texture. Rather than using all the information in the images at once, feature
extraction and selection breaks down the problem of segmentation to the grouping of fea-
ture vectors [Jai89] [Zad92] [Zad96]. Selection of good features is the key to successful
segmentation [Pal93]. The focus of the thesis is not feature extraction. Therefore we will
use a simple but effective feature extraction method using entropy measures, and we will
not investigate the feature extraction further.
1.4.2 Gray Scale Single Image Segmentation
The most intuitive approach to segmentation is global thresholding. One common dif-
ficulty with this approach is determining the value of the thresholds. Knowledge guided
thresholding methods, where global thresholds are determined based on a "goodness func-
tion" describing the separation of background, skull and brain have been reported [Van88]
[Cla93] [Jac93] [Hal92] [Lia93] [Ger92b]. The method is limited, and successful applica-
tion for clinical use hindered by the variability of anatomy and MR data.
Edge detection [Pel82] [Mar80] [Bru93] schemes suffer from incorrect detection of
edges due to noise, over- and under-segmentation, and variability in threshold selection in
the edge image [Del91la] [Sah88]. Combination with morphological [Rit96] [Rit87a]
[Rit87b] filtering is also reported [Bom90]. Another method is boundary tracing [Ash90],
where the operator clicks a pixel in a region to be outlined and the method then finds the
boundary starting from that point. It is usually to be restricted to segmentation of large,
well defined structures, but not to distinguish tissue types.
Seed growing methods are also reported [Rob94], where the segmentation requires an
operator to empirically select seeds and thresholds. Pixels around the seed are examined,
and included in the region if they are within the thresholds. Each added pixel then
becomes a new seed.
Random field methods have been successfully applied, where an energy function is
required, which is often very difficult to define, that describes the problem [Bou94]
[Kar90].
1.4.3 Multispectral Segmentation
Supervised methods require a user supplied training set, usually found by drawing
regions of interest on the images. Using maximum likelihood (ML) methods where multi-
variate Gaussian distributions are assumed [Cla93] [Hal92] [Lia94] [Ger92b] [Kar90], sta-
tistics are calculated like mean and covariance matrices. The remaining pixels are then
classified by calculating the likelihood of each tissue class, and picking the tissue type
with the highest probability. Parametric methods are useful when the feature distributions
for different classes are well known, which is not necessarily the case for MR images
[Cla93]. k nearest neigbourhood (kNN) has given superior results both in terms of accu-
racy and reproducibility compared to parametric methods [Cla93]. Artificial neural net-
works also are commonly used [Cla93] [Hal92] [Daw91] [Hay94] [Has95] [Hec89]
[Ozk93] [Wan98].
All supervised methods are operator dependent. Inter- and intra-operator variability
has been measured and shown to be relatively large [Cla93] [Ger92b] [Dav96]. Because of
this reason unsupervised methods may be preferred from a viewpoint of reproducibility.
Unsupervised techniques, which is usually called clustering, automatically find the
structure in the data. A cluster is an area in feature space with a high density. Clustering
methods include k-means [Bez93] [Ger92b] [Van88], and its fuzzy equivalent, fuzzy c-
means [Bez93] [Hal92] [Phi95]. These methods and their variants are basically limited to
find linearly separable clusters. Another promising development is using semi-supervised
methods [Ben94b].
The expectation-maximization (EM) algorithm is also used in clustering of MR
images [Wel96] [Dem77] [Gui97], where the knowledge of tissue class is considered as
the missing information. As usual the method assumes a normal distribution, and it incor-
porates an explicit bias field, which frequently arise in MR images.
Model based approaches including deformable models [Dav96] [Coh91], also known
as active contour models, provide a method for minimizing an objective function to obtain
a contour of interest, especially if an approximate location of the contour is available. A
deformable contour is a planar curve which has an initial position and an objective func-
tion associated with it. A special class of deformable contours called snakes was intro-
duced by Witkin [Wit88], in which the initial position is specified interactively by the user
and the objective function is referred to as the energy of the snake. The snake tries to min-
imize its energy over time, similar to physical systems. This energy of the snake is
expressed as a sum of two components: the internal energy of the snake and the external
energy of the snake which is given as
Snake internal external (1.1)
The internal energy term imposes a piecewise smoothness constraint on the snake, and the
external energy term is responsible for attracting the snake to interesting features in the
image. The balloon model for deformable contours is an extension of the snake model. It
modifies the snake energy to include a "balloon" force, which can be either an inflation
force, or a deflation force. All these methods require an operator input to place the snake
close to a boundary.
1.5 Validation
MRI segmentation is being proposed either as a method for determining the volume of
tissues in their 3D spatial distibutions in applications involving diagnostic, therapeutic, or
surgical simulation protocols. Some form of quantitative measure of the accuracy and/or
reproducibility for the proposed segmentation method is clearly required. Since a direct
measure of ground truth is not logistically feasible, or even possible with pathologic corre-
lation, several alternative procedures have been used.
1.5.1 MRI Contrast Methods
The use of MR contrast agents in neuroinvestigations of the brain provide information
about whether or not a breakdown of blood-brain barrier (BBB) has occurred and on the
integrity of the tissue vascularity both of which are often tumor type- and stage-dependent
[Run89] [Bra93] [Hen93] [Bro90]. However, MR contrast may not be optimum for the
quantitative differentiation of active tumor tissue, scar tissue, or recurrent tumors. Many
segmentation methods, in particular gray scale methods and multispectral methods, use
MR contrast information with Ti-weighted images for tumor volume or size estimations
despite the limitations of these methods in the absence of ground truth determina-
tions[Mcc94] [Gal93]. Recently the use of multi-modality imaging methods, such as the
correlation with the PET studies, have been proposed to identify active tissues [Tju94].
Alternatively, the use of fMRI measurement of contrast dynamics has been suggested to
provide better differentiation of active tumor tissue in neurological investigations and
these functional images could be potentially included in segmentation methods[4,5].
1.5.2 Validation Using Phantoms
The use of phantoms constructed with compartments containing known volumes is
widely reported [Cli91] [Jac93] [Koh91] [Ger92b] [Mit94] [Bra94] [Pec92] [Jac90]. The
typical phantom represents a very idealized case consisting of two or three highly con-
trasting classes in a homogenous background [Koh91] [Ash90] [Jac90]. Phantoms con-
taining paramagnetic agents have been introduced to mimic MRI parameters of the tissues
being modelled [Cli91] [Jac93] [Ger92b]. However, phantoms have not evolved to encom-
pass all the desired features which allow a realistic segmentation validation, namely: a
high level of geometric complexity in three dimensions, multiple classes (e.g. representa-
tive of white matter, gray matter, cerebrospinal fluid, tumor, background, etc.), and more
importantly, RF coil loading similar to humans and MRI parameter distributions similar to
those of human tissue.
The reported accuracy obtained using phantoms is very high for large volumes
[Koh91] [Bra94] [Ash90], but decreases as the volume is smaller [Koh91] [Ger92b]. For a
true indication of the maximum obtainable accuracy of the segmentation methods, the
phantom volumes should be comparable to the anatomical or pathological structures of
interest. In summary, phantoms do not fully exhibit the characteristics that make segmen-
tation of human tissues so difficult. The distributions of MRI parameters for a given tissue
class is not necessarily Gaussian or unimodal, and will often overlap for different tissues.
The complex spatial distribution of the tissue regions, in turn, may cause the MR image
intensity in a given pixel to represent signal from a mix of tissues, commonly referred to
as the partial volume artefact. Although phantom images provide an excellent means for
daily quality control of the MRI scanner, they can only provide a limited degree of confi-
dence in the reliability of the segmentation methods.
We believe using phantoms will limit the accuracy of the algorithm, because of limited
modelling capabilities of phantoms. Using MR image of the brain itself is more suitable
for our purpose.
1.5.3 Validation Using MRI Simulations
Because of the increase in computer speeds, studying MR imaging via computer simu-
lations is very attractive. Several MR signal simulations, and the resulting image construc-
tion methods can be found in literature [Bit84] [Odo85] [Sum96] [Bea92] [Hei93] [Pet93]
[Sch94]. These simulation methods have so far not been used for the evaluation of seg-
mentation methods, but were used to investigate a wide variety of MR processes includ-
ing: optimization of RF pulse techniques [Bit94] [Sum96] [Pet93], the merit of spin warp
imaging in the presence of field inhomogeneities and gradient field phase-encoding with
wavelet encoding [Bea92], noise filtering[Hei93].
In summary, simulation methods can be extended to include MRI segmentation analy-
sis. The robustness of the segmentation process may be probed by corrupting the simu-
lated signal with noise, nonlinear field gradients, or more importantly, nonuniform RF
excitation. In this fashion, one source of signal uncertainty can be introduced at a time,
and the resulting segmentation uncertainty can be related to the signal source uncertainty
in a quantifiable manner.
1.5.4 Manual Labeling of MR Images
Some validation methods have let experts manually trace the boundaries of the differ-
ent tissue regions [Van91a] [Del91b] [Zij93] [Vel94]. The major advantage of the manual
labeling technique is that it truly mimics the radiologist's interpretation, which realisti-
cally is the only "valid truth" available for in vivo imaging. However, there is considerable
variation with operators[Ger92b] [Eil90], limiting "ground truth" determinations. Further-
more, manual labeling is labor intensive and currently cannot be feasibly performed for
large numbers of image data sets. Improvements in the area of manual labeling may be
found by interfacing locally operating segmentation techniques with manual improve-
ments. As manual labeling allows an evaluation most closely related to the radiologists'
opinion, these improvements deserve further investigation.
1.5.5 Brain Development During Childhood
Normal brain development during childhood is a complex and dynamic process for
which detailed scientific information is lacking. Several studies are done to investigate the
volumetric analysis of the brain during childhood [Van91b] [Puj93] [Gie99] [Rei96]
[Ben94a] [Bro87]. Prominent, age-related changes in gray matter, white matter and CSF
volumes are evident during childhood and appear to reflect ongoing maturation and
remodelling of the central nervous system. There is little change in cerebral volume after
the age of 5 years in either male or female subjects [Rei96] [Gie99]. After removing the
effect of total cerebral volume, age was found to predict a significant proportion of the
variance in cortical gray matter, and cerebral white matter volume, such that increasing
age was associated with decreasing gray matter volume and increasing white matter vol-
ume in children. The change in CSF was found to be very small with increasing age which
can be shown in Figure 1-11 [Rei96] so that the brain volume is not a determining factor
on the volumes of gray and white matter. The change in gray matter is about -%1 per year
for boys and -%0.92 per year for girls. The change in white matter is about +%0.093 for
boys and +%0.072 for girls [Rei96]. We will use this method to quantify our clustering
method, since detecting %1 change is a good indicator to evaluate a segmentation method.
We expect to find a percentage of % using the proposed clustering algorithm. Although
1. The Figure 1-1 is reprinted with the permission of Oxford University Press.
this method will not verify the segmentation of individual structures, it still is a good
method, because the change is very small and difficult to show.
1600
x x
1 1500-
a 1400 -
S1200, x o
S 100 x -
Sx C
E o 1000
9000
4 6 8 10 12 14 16 18
Age (years)
Figure 1-1. Total cerebral volume in children in age from 5 to 17 years
1.6 Motivation
Segmentation of MR images is considered to be difficult because of the non-rigid and
complex shapes in anatomical structures. Adding the high variability among patients and
even among the same scan makes it difficult for model based approaches to segment MR
images. We believe data based approaches are more appropriate for MR images because of
the complexity of anatomical structures.
Many clustering algorithms are proposed to solve segmentation problems in MR
images, which are reviewed in Chapter 2. A common problem with the segmentation algo-
rithms is the fact that they depend on Euclidean distance measure to separate the clusters.
Deformable models are excluded from this reasoning, but they require to be placed near
the boundary, and we will consider only data-driven methods because of the complexity of
the brain images. The Euclidean distance has limited capacity to separate nonlinearly sep-
arated clusters. To be able to distinguish nonlinearly separable clusters, more information
about the structure of the data should be obtained. On the other hand, because of the miss-
ing label assignments, clustering is a more computationaly intensive operation than classi-
fication. More information about the structure should be collected without introducing
complicated calculations.
The motivation of this dissertation is to develop a new cost function for clustering
which can be used in nonlinearly separable clusters. Such a method should be computa-
tionaly feasible and simple to calculate. The proposed method does not require any numer-
ical integration methods, and uses information theory to collect more information about
the data. The stepwise nature of the cost function required us to develop an improved ver-
sion of the K-change algorithm [Dud73].
1.7 Outline
In Chapter 2, the basic clustering algorithms are reviewed. There are many variations
to these algorithms but the basic principles stay the same. Many of them can not be used in
nonlinearly separable clusters, and the ones that can be used, like the valley seeking algo-
rithm [Fug90], suffer from generating more clusters than there would be if the distribu-
tions were not unimodal.
Chapter 3 covers the basics of information theory and entropy based distance mea-
sures. Many of these calculations require numerical methods which increase the already
high computational cost of clustering algorithms. Therefore, we propose a different dis-
tance measure, which does not require a numerical integration and is simple to calculate.
Chapter 4 covers the new clustering evaluation function proposed, and will give some
initial results to give the power of the cost function, which is capable of clustering nonlin-
early combined clusters.
Chapter 5 focuses on the optimization algorithm that minimizes the cost function
developed in Chapter 4. We propose an improvement to the K-change algorithm by intro-
ducing a changing group size scheme.
In Chapter 6, applications of MR image segmentation are tested and discussed, and
Chapter 7 includes the conclusion and the discussion for future research.
CHAPTER 2
UNSUPERVISED LEARNING AND CLUSTERING
2.1 Classical Methods
2.1.1 Discussion
There are many important applications of pattern recognition, which include a wide
range of information processing problems of great practical significance, from speech rec-
ognition and the classification of handwritten characters, to fault detection and medical
diagnosis. The discussion here provides the basic elements of clustering where there are
many variations to these ideas in the literature. So we try to investigate the basic algo-
rithms.
Clustering [Har85] is an unsupervised way of data grouping using a given measure of
similarity. Clustering algorithms attempt to organize unlabeled feature vectors into clus-
ters or "natural groups" such that samples within a cluster are more similar to each other
than to samples belonging to different clusters. Since there is no information given about
the underlying data structure or the number of clusters, there is no single solution to clus-
tering, neither is there a single similarity measure to differentiate all clusters. Because of
this reason there is no theory which describes clustering uniquely.
Pattern classification can be divided into two areas depending on the external knowl-
edge about the input data. If we know the labels of our input data, the pattern recognition
problem is considered supervised. Otherwise the problem is called unsupervised. Here we
will only cover statistical pattern recognition. There are several ways of handling the prob-
lem of pattern recognition if the labels are given a priori. Since we know the labels, the
problem reduces to finding features of the data set with the known labels, and to build a
classifier using these features. The Bayes' rule shows how to calculate the posteriori prob-
ability from a priori probability. Assume that we know that a priori probabilities P(ci)
and the conditional densities p(x I ci). When we measure x, we can calculate the posteri-
ori probability P(ci x) as shown in (2.1).
p(x|ci)P(ci)
P(c x) = (2.1)
where
N
p(x) = p (xci)P(ci) (2.2)
i= 1
In the case of unsupervised classification or clustering, we don't have the labels,
which increases the problem. The clustering problem is not well defined unless the result-
ing clusters are required to have certain properties. The fundamental problem in clustering
is how to choose these properties. Once we have a suitable definition of a cluster, it is pos-
sible to evaluate the validity of the resulting clustering using standard statistical validation
procedures.
There are two basic approaches to clustering, which we call parametric and nonpara-
metric approaches. If the purpose of unsupervised learning is data description then we can
assume a predefined distribution function for the data set, and calculate the sufficient sta-
tistics which will describe the data set in a compact way. For example, if we assume that
the data set comes from a normal distribution N(M, 1), which is defined as
NIMX(M, ) = / 1/2 exp(_ (X_ M)T -I(X_ M)) (2.3)
(27rt)"/2 1/2 2
the sufficient statistics are the sample mean M = E{X} and the sample covariance
matrix I = E{XX }, which will describe the distribution perfectly. Unfortunately, if
the data set is not distributed according to our choice, then the statistics can be very mis-
leading. Another approach uses a mixture of distributions to describe the data [Mcl88]
[Mcl96] [Dem77]. We can approximate virtually any density function in this way, but esti-
mating the parameters of a mixture is not a trivial operation. And the question of how to
separate the data set into different clusters is still unanswered, since estimating the distri-
bution does not tell us how to divide the data set into clusters. If we are using the first
approach, namely fitting one distribution function to each cluster, then the clustering can
be done by trying to estimate the parameters of the distributions. If we are using the mix-
ture of distributions approach, then clustering is very loosely defined. Assume that we
have more mixtures than the number of clusters. The model does not tell us how to com-
bine the mixtures to obtain the desired clustering.
Another approach to clustering is to group the data set into groups of points which
posses strong internal similarities [Dud73] [Fug70]. To measure the similarities we use a
criterion function and seek the grouping that finds the extreme point of the criterion func-
tion. For this kind of algorithm we need a cost function to evaluate how well the clustering
fits to the data, and an algorithm to minimize the cost function. For a given clustering
problem, the input data X is fixed. The clustering algorithm varies only by the sample
assignment C, which means that the minimization algorithm will change only C. Because
of the discrete and unordered nature of C, classical steepest descent search algorithms can
not be applied easily.
2.1.2 Clustering Criterion
We will define the clustering problem as follows: We will assume that we have N sam-
ples, i.e. x1 ...xN At this moment we assume that the samples are not random vari-
ables, since once the samples are fixed by the clustering algorithm, they are not random
variables anymore. The problem can be defined as to place each sample into one of L clus-
ters, w I ... WL, where L is assumed to be given. The cluster k to which the ith sample is
assigned is denoted by Wk(i), where k(i) is an integer between 1...L, and i = 1...N.
A clustering C is a vector made of Wk(i) and Xis a vector made up xi 's, that is,
T
C = [Wk(1)... k(N) (2.4)
and
X = [xi...XN] (2.5)
The clustering criterion J is a function of C and X. and can be written as,
J(C,X) = J(wk(1)... Wk(N);X...XN) (2.6)
The best clustering CO should satisfy
J(Co, X) = minIC or max C(J(C,X)) (2.7)
depending on the criterion. Only minimization will be considered, since maximization can
always be converted to minimization.
2.1.3 Similarity Measures
In order to apply the clustering algorithm we have to define how we will measure sim-
ilarity between samples. The most obvious measure of the similarity between two samples
is the distance between them. The L norm is the generalized distance measure where
p = 2 corresponds to the Euclidean distance [Ben66] [Gre74[. The L norm between
two vectors of size N, is given as
1
N
Lp(X, X2) = (xl(i) 2(i))p (2.8)
i= 1
If this distance is a good measure of similarity, then we would expect the distance between
samples in the same cluster to be significantly less than the distance between samples in
different clusters.
Another way to measure the similarity between two vectors is the normalized inner
product which is given as
T
x x
S(X l, X2) b 2 (2.9)
This measure is basically the angle between two vectors.
2.2 Criterion Functions
2.2.1 The Sum-of-Squared-Error Criterion
One of the simplest and most widely used error criterian is the sum-of-squared-error
criterion. Let Ni be the number of samples in XI and let mi be the mean of those sam-
ples,
mi =- x (2.10)
ix e X,
Then the criterion can be defined as
L
J = x mi 2 (2.11)
i= lx~ X,
which means that the mean vector mi is the best representation of the samples in X, and
the clustering achieved is minimized by the squared error vectors x m The error func-
tion J measures the total squared error when N samples are represented by L cluster cen-
ters m 1 ... mL The value of J depends how the samples are distributed among the cluster
centers. This kind of clustering is often called minimum-variance partitions [Dud73]. This
kind of clustering works well when the clusters are compact regions that are well sepa-
rated from each other but it gives unexpected results when the distance between clusters is
comparable to size of clusters. An equivalent expression can be obtained by eliminating
the mean vectors from the expression as in
L
J = ni i (2.12)
i= 1
where
Si 2 x -x 2 (2.13)
ni x 1e X x2 e X,
The above expression shows that the sum-of-squared-error criterion uses the Euclidean
distance to measure similarity [Dud73]. We can derive different criterion functions by
changing si using other similarity functions S(x1, X2).
2.2.2 The Scatter Matrices
In discriminant analysis, within-class, between-class and mixture scatter matrices are
used to measure and formulate class separability. These matrices can be combined in dif-
ferent ways to be used as a criterion function. Let's make the following definitions:
Mean vector for the ith cluster
mi x (2.14)
x e X,
Total mean vector
L
m= x Nm (2.15)
xeX i= i
Scatter matrix for ith cluster
Si = (x-mi)(x-m ) (2.16)
xE X,
Within-cluster scatter matrix
L
Sw = Si (2.17)
i= 1
Between-cluster scatter matrix
L
SB = ni(mi m)(mi m)T (2.18)
i= 1
Total scatter matrix
ST = SW+SB (2.19)
The following criterion functions can be defined [Fug90][Dud73]
J1 = tr(S2-1S ) (2.20)
J2 = In S2-1S (2.21)
J3 = tr(S) (tr(S2) c) (2.22)
tr(S1)
J = tr( ) (2.23)
4 tr(S2)
where S1 and S2 are one of SW, SB, or ST. Some combinations are invariant under
any nonsingular linear transformation and some are not. These functions are not univer-
sally applicable, and this is a major flaw in these criterion functions. Once the function is
determined, the clustering that the function will provide is fixed in terms of parameters.
We are assuming we can reach the global extreme point, which may not always be the
case. If the function does not provide good results with a certain data set, no parameter is
readily available to change the behavior of the clustering output. The only parameter we
can change is the function itself, which may be difficult to set. Another limitation of the
criterion functions mentioned above is the fact that they are basically second order statis-
tics. In the coming chapters we will provide a new clustering function using information
theoretic measures and a practical way to calculate and optimize the resulted function,
which gives us a way to control the clustering behavior of the criterion. There are other
measures using entropy and information theory to measure cluster separability, which will
be covered in the next chapter. These methods will be the basis for our clustering function.
2.3 Clustering Algorithms
2.3.1 Iterative Optimization
The input data are finite, therefore there are only a finite number of possible partitions.
In theory, the clustering criterion can always be solved by exhaustive enumeration. How-
ever, in practice such an approach not feasible, since the number of iterations will grow
exponentially with the number of clusters and sample size where the number of different
solutions are given approximately by L /L! .
The basic idea in iterative-optimization is to find an initial partition and to move sam-
ples from one group to another if such a move will improve the value of the criterion func-
tion. In general this procedure will guarantee local optimization. Different initial points
will give different results. The simplicity of the method usually overcomes the limitations
in most problems. In the following chapters we will improve this optimization method to
obtain better results and use in our algorithm.
2.3.2 Merging and Splitting
After a number of clusters are obtained, it is possible to merge certain clusters or split
certain clusters. Merging may be required if two clusters are very similar. Of course we
should define a similarity measure for this operation as we did for clustering. Several mea-
sures given in (2.20), (2.21), (2.22) and (2.23), can be used for this purpose [Dud73].
Merging is sometimes desired when the cluster size is very small. The criterion for appro-
priate splitting is far more difficult to define [Dud73]. Multimodal and nonsymmetric dis-
tributions as well as distributions with large variances along one direction can be split.
We can start partitioning the input data of size N to N clusters containing one sample
each. The next step is to partition the current clustering into N-1 clusters. We can continue
doing this until we reach the desired clustering. At any level some samples from different
clusters may be combined together to form a single cluster. The merging and splitting are
heuristic operations with no guarantee of reaching the desired clustering, but still useful
[Dud73].
2.3.3 Neighborhood Dependent methods
Once we choose a measure to describe similarity between clusters, we will use the fol-
lowing algorithm [Dud73]. After the initial clustering, we will recluster a sample accord-
ing to the nearest cluster center, and calculate the mean of the clusters again, until there is
no change in clustering. This algorithm is called the nearest mean reclassification
[Dud73].
2.3.4 Hierarchical Clustering
Let's consider a sequence of partitions of the N samples into C clusters. First partition
into N clusters, where each cluster contains exactly one sample. The next iteration is a par-
tition into N-1 clusters, until all samples form one cluster. If the sequence has the property
that whenever two samples are in the same cluster at some level, they remain together at
all higher levels, then the sequence is called a hierarchical clustering. It should be noted
that the clustering can be done in reverse order, that is, first all samples form a single clus-
ter, and at each iteration more clusters are generated.
In order to combine or divide the clusters, we need a way to measure the similarity in
the clusters and dissimilarity between clusters. Commonly used distance measures
[Dud73] are as follows:
Dmin(C1, C2) = min x1 -x2 (2.24)
Dmax(C1, C2) = max x1 x2 (2.25)
D a (C, C2) = x X2 (2.26)
avg 21 Cx2 C2
Dmean(Cl, C2)= mn -m2 (2.27)
All of these measures have a minimum-variance flavor, and they usually give the same
results if the clusters are compact and well separated. However, if the clusters are close to
each other, and/or the shapes are not basically hyperspherical, very different results may
be obtained. Although never tested, it is possible to use our clustering evaluation function
to measure the distance between clusters in hierarchical clustering to improve the results.
2.3.5 Nonparametric Clustering
When a mixture density function has peaks and valleys, it is most natural to divide the
samples into clusters according to the valley. The valley may not have a parametric struc-
ture, which creates difficulties with parametric assumptions. One way to discover the val-
ley is to use estimation of local density gradients at each sample point and move the
samples in the direction of the gradient. By repeating this we will move the samples away
from the valley, and the samples form compact clusters. We call this procedure the valley-
seeking procedure [Fuk90].
Figure 2-1. Valley seeking algorithm
The local gradient can be estimated by the local mean vector around the sample. The
direction of the local gradient is given as shown in Figure 2-1.
Vp() M(X) (2.28)
p() -
valley
The local gradient near the decision surface will be proportional to the difference of the
means, that is, MI1 (x) -M2(x), where M1 (x) is the local mean of one cluster, and
M2(x) is the local mean of the other cluster inside the same local region.
The method seems very promising but it may result in too many clusters, if there is a
slight nonuniformity in one of the clusters. The performance and the number of clusters
depends on the local region used to calculate the gradient vector. If the local region is too
small, there will be many clusters, and on the other hand if the region is too huge, then all
the points form one cluster. So size of local region is directly related to the number of clus-
ters but loosely related to the quality of the clustering. The advantage of this method is that
the number of clusters does not need to be specified in advance, but in some cases this may
be a disadvantage. Assume we want to improve the clustering without increasing the num-
ber of clusters. A change in local region will change the number of clusters and will not
help to improve the clustering. Of course small changes will change the clustering without
increasing the cluster number, but determining the range of the parameter may be a prob-
lem.
2.4 Mixture Models
The mixture model is a semi-parametric way of estimating the underlying density
function [Dud73] [Par62] [Chr81]. In the non-parametric kernel-based approach to density
estimation, the density function was represented as a linear superposition of kernel func-
tions, with one kernel centered on each data point. In the mixture model the density func-
tion is again formed by a linear combination of basis functions, but the number of basis
functions is treated as a parameter of the model and it is much less than the number N of
data points. We write the density estimator as a linear combination of component densities
p(x i) in the form
L
p(x) = p(xi)P(i) (2.29)
i= 1
Such a representation is called a mixture distribution and the coefficients P(j) are called
the mixing parameters.
2.4.1 Maximum Likelihood Estimation
N
The maximum likelihood estimation may be obtained by maximizing J p(xj)
with respect to Pi, Mi and Zi under the constraint 1
L
P i = 1 (2.30)
i= 1
The negative log-likelihood is given by
N
E = -ln(F) = ln(p(x)) (2.31)
i= 1
which can be regarded as an error function [Fug90]. Maximizing the likelihood F is then
equivalent to minimizing E. One way to solve the maximum likelihood is the EM algo-
rithm which is explained next.
2.4.2 EM Algorithm
Usually no theoretical solution exists for the likelihood equations, and it is necessary
to use numerical methods. Direct maximization of the likelihood function using Newton-
Raphson or gradient methods is possible but it may need analytical work to obtain the gra-
dient and possibly the Hessian. The EM algorithm [Dem77] is a general method for com-
puting maximum-likelihood (ML) estimates for "incomplete data" problems. In each
iteration of the EM algorithm there are two steps, called the expectation step or the E-step
and the maximization step or M-step, thus the name EM algorithm, given by [Dem77] in
their fundamental paper. The EM algorithm can be applied in situations described as
incomplete-data problems, where ML estimation is made difficult by the absence of some
part of the data in a more familiar and simpler data structure. The parameters are estimated
after filling in initial values for the missing data. The latter are then updated by their pre-
dicted values using these parameter estimates. The parameters are then reestimated itera-
tively until convergence.
The term "incomplete data" implies in general to the existence of two sample spaces X
and Y and a many-to-one mapping H from Xto Y, where x e X and y e Y are elements
of the sample spaces and y = H(x) The corresponding x in Xis not observed directly,
but only indirectly through y. Let f (x 6) be the parametric distribution of x, where 0 is
a vector of parameters taking values in 0. The distribution of y, denoted by g(x 16), is
also parametrized by 0, since the complete-data specification f(... ...) is related to the
incomplete-data specification g(... ...) by
g(y|0) = f f(xlO)dx (2.32)
H(x) = y
The EM algorithm tries to find a value of 0 which maximizes g(y| 0) given an
observed y, and it uses the associated family f(x ) It should be noted that there are
many possible complete-data specifications f(x |0) that will generate g(y 0).
The maximum-likelihood estimator 6 maximizes the log-likelihood function
L(0) = ln(g(x|0)) (2.33)
over 0,
6 = argmax O(L(0)) (2.34)
The main idea behind the EM algorithm is that, there are cases in which the estimation
of 0 would be easy if the complete data x were available, and is only difficult for the
incomplete data y. In other words the maximization of ln(g(y 0)) is complicated,
where the maximization of ln(f(x 0)) is easy. Since only the incomplete data y is
available in practice, it is not possible to directly perform the optimization of the complete
data likelihood ln(f (x 0)). Instead it will be easier to estimate ln( f(x 0)) from y
and use this estimator to find 6. Since estimating the complete data likelihood requires 0,
we need an iterative approach. First using an estimate of 0 the complete likelihood func-
tion will be estimated, then this likelihood function should be maximized over 0, and so
on, until a satisfactory convergence is obtained. Given the current value 0' of the parame-
ters and y, we can estimate in (f(x 1 0)) using
P(0, 0') = E[ln(p(x|0)) y, 0'] (2.35)
The EM algorithm can be expressed as
E-step
P(O, 0P) = E[ln(p(xlO)) y, 0p] (2.36)
M-step
0p+1 e argmax OE (P(0o1 o)) (2.37)
where p is the value of 0 at pth iteration. For the problem of density estimation using a
mixture model we do not have corresponding cluster labels. The missing labels can be
considered as incomplete data and can be solved with the other parameters of the mixture
using the EM algorithm described above. Unfortunately estimating the mixture model will
not answer the question of how to cluster the data set using the mixture model.
2.5 Competitive Networks
Unsupervised learning can be accomplished by appropriately designed neural net-
works. The original unsupervised learning rule was proposed by Hebb [Heb49], which is
inspired by the biological synaptic signal changes. In this rule, changes in weight depend
on to the correlation of pre- and post-synaptic signals x and y, respectively, which may be
formulated as
T
Wk+l = Wk +YkXk (2.38)
T
where p > 0, xk is the unit's input signal, and k = Xk Wk is the unit's output.The
analysis will show that the rule is unstable in this form and it drives the weights to infinite
in magnitude. One way to prevent divergence is to normalize the weight vector after each
iteration.
Another update rule which prevents the divergence is proposed by Oja [Oja82]
2
[Oja85] adds a weight decay proportional to y and results in the following update rule:
T
Wk + 1 = Wk + (xk k wk)yk (2.39)
There are many variations to this update rule, and the update rule is used very frequently
in principal component analysis (PCA) [Ama77] [Lin88] [San89] [Oja83] [Rub89]
[Pri90]. There are linear and nonlinear versions [Oja91] [Tay93] [Kar94] [Xu94] [Pri90].
Competitive networks [Gro76a] [Gro76b] [Lip89] [Rum85] [Mal73] [Gro69] [Pri90]
can be used in clustering procedures [Dud73] [Har75]. Since in clustering there is no sig-
nal which to show the cluster labels, competitive networks use a competition procedure to
find the output node to be updated according to a particular weight update rule. The unit
with the largest activation is usually chosen as the winner whose weight vector is updated
according to the rule
Awn = p(xkw i) (2.40)
where wi is the weight vector of the winning node. The weight vectors of other nodes are
not updated. The net effect of the rule is to move the weight vectors of each node towards
the center-of-mass of the nearest dense cluster of data points. This means that the number
of output nodes determine the number of clusters.
One application of competitive learning is adaptive vector quantization [Ger82]
[Ger92a]. Vector quantization is a technique where the input space is divided into a num-
ber of distinct regions, and for each region a "template" or reconstruction vector is
defined. When presented with a new input vector x, a vector quantizer first determines the
region in which the vector lies. Then the quantizer outputs an encoded version of the
reconstruction vector wi representing that particular region containing x. The set of all
possible reconstruction vectors is usually called the codebook of the quantizer [Lin80]
[Gra84]. When the Euclidean distance is used to measure the similarity of x to the regions,
the quantizer is called a Voronoi quantizer [Gra84].
2.6 ART Networks
Adaptive resonance architectures are artificial neural networks that are capable of sta-
ble categorization of an arbitrary sequence of unlabeled input patterns in real time. These
architectures are capable of continuous training with nonstationary inputs. They also solve
the stability-plasticity dilemma. In other words, they let the network adapt yet prevent cur-
rent inputs from destroying past training. The basic principles of the underlying theory of
these networks, known as adaptive resonance theory (ART), were introduced by Gross-
berg 1976 [Gro76a] [Gro76b].
A class of ART architectures, called ART1 [Car87a] [Car88], is characterized by a
system of ordinary differential equations, with associated theorems. A number of interpre-
tations and simplifications of the ART1 net have been reported in the literature [Lip87]
[Pao89] [Moo89].
The basic architecture of the ART1 net consists of a layer of linear units representing
prototype vectors whose outputs are acted on by a winner-take-all network. This architec-
ture is identical to the simple competitive network with one major difference. The linear
prototype units are allocated dynamically, as needed, in response to novel input vectors.
Once a prototype unit is allocated, appropriate lateral-inhibitory and self-excitatory con-
nections are introduced so that the allocated unit may compete with preexisting prototype
units. Alternatively, one may assume a prewired network with a large number of inactive
(zero weights) units. A unit becomes active if the training algorithm decides to assign it as
a cluster prototype unit, and its weights are adapted accordingly.
The general idea behind ART1 training is as follows. Every training iteration consists
of taking a training example xk and examining existing prototypes (weight vectors w )
that are sufficiently similar to Xk. If a prototype wi is found to match xk (according to a
similarity test based an a preset matching threshold), sample xk is added to the cluster
represented by wi, and wi is modified to make it better match xk. If no prototype
matches Xk, then xk becomes the prototype for a new cluster.
The family of ART networks also includes more complex models such as ART2
[Car87b] and ART3 [Car90]. These ART models are capable of clustering binary and ana-
log input patterns. A simplified model of ART2, ART2-A [Car9la], has been proposed
that is two to three orders of magnitude faster than ART2. Also, a supervised real-time
learning ART model called ARTMAP has been proposed [Car9lb].
2.7 Conclusion
In this chapter we summarized basic algorithms used in clustering. There are many
variations to these algorithms, but the basic principle stays the same. A fundamental prob-
lem can be seen immediately, which can be summarized as the usage of the Euclidean dis-
tance as a measure for cluster separability. Others use mean and variance to differentiate
the clusters. When there are nonlinear structures in the data, then it is obvious that the
Euclidean distance measures and differences in the mean and variance is an inadequate
measure of cluster separability. The valley seeking algorithm tries to solve the problem by
moving the samples along the gradients, and the algorithm will behave as a classifier if the
clusters are well separated and unimodal. But when the clusters are multi-modal and over-
lapping, then the valley seeking algorithm may create more cluster centers than there are
clusters in the data. The question of how to combine these cluster centers is not answered,
even when we know the exact number of clusters. Defining the number of clusters before-
hand can be an advantage depending on the problem. For example in MRI segmentation it
is to our advantage to fix the number of clusters, since we know this number a priori. Con-
sider an MRI brain image where the basic structures are CSF, white matter and gray mat-
ter. Failure to fix the number of clusters in this problem beforehand will raise the question
of how to combine the excess cluster centers later.
When we consider the fact that the tissue boundaries in an MRI brain image are not
sharply defined, it is obvious that the Euclidean distance measures and mean and variance
differences are not enough to differentiate the clusters in a brain MRI. The variability of
brain structures among persons and within the same scan makes it difficult to use model
based approaches, since the model that fits to a particular part of the brain, may not fit to
the rest. This encourages us to use data-driven based algorithms, where there is no pre-
defined structure imposed on the data. But the limitations of the Euclidean distance mea-
sures forces us to seek other measures for cluster separability.
CHAPTER 3
ENTROPY AND INFORMATION THEORY
3.1 Introduction
Entropy [Sha48] [Sha62] [Kaz90] was introduced into information theory by Shannon
(1948). The entropy of a random variable is a measure of the average amount of informa-
tion contained. In another words entropy measures the uncertainty of the random variable.
Consider a random variable X which can take values X N ...*xN with probabilities
p(xk), k = 1 ...N. If we know that the event xi occurs with probability pi = 1,
which requires that pi = 0, i # k, there is no surprise and therefore there is no informa-
tion contained in X, since we know the outcome exactly. If we want to send the value of X
to a receiver, then the amount of information is given as I(xk) = -In(p(xk)), if the
variable takes the value xk Thus, the expected information needed to transmit the value
of Xis given by
E(I(xk)) = Hs(X) = p(xk)1n(p(xk)) (3.1)
k
which is called the entropy of the random variable X.
Shannon's measure of entropy was developed essentially for the discrete values case.
Moving to the continuous random variable case where summations are usually replaced
with integrals, is not so trivial, because a continuous random variable takes values from
-c to co, which makes the information content infinite. In order to avoid this problem,
the continuous entropy calculation is considered as differential entropy, instead of absolute
entropy as in case of discrete random variables [Pap65] [Hog65]. If we let the interval
between discrete random variables be
VXk X= k xk- 1 (3.2)
If the continuous density function is given as f(x), then p, can be approximated by
f(xk)Vxk, so that
H= p(xk)ln(p(xk)) =- f(xk)(Vxk)ln(f(xk)Vxk) (3.3)
k k
After some manipulations and replacing the summation by integrals we will obtain
H= -Jf(x)ln(f(x)))dx-ln(Vxk) (3.4)
X
In this equation -In(Vxk) -> oo as VXk -> 0, which suggest that the entropy of a con-
tinuous variable is infinite. If the equation is used in making comparisons between differ-
ent density functions, the last term cancels out. We can drop the last term and use the
equation as a measure of entropy by assuming that the measure is the differential entropy
with the reference term -n (Vxk) If all measurements are done relative to the same ref-
erence point, dropping the last term from (3.4) is justified and we have
h(X)= -Jf(x)ln(f(x)))dx (3.5)
X
3.2 Maximum Entropy Principle
The Maximum Entropy Principle (MaxEnt) or the principle of maximum uncertanity
was independently proposed by Jaynes, Ingarden and Kullback independently [Jay57]
[Kul59] [Kap92]. Given just some mean values, there are usually an infinity of compatible
distributions. MaxEnt encourages us to select the distribution that maximizes the Shannon
entropy measure while being consistent with the given constraints. In other words, out of
all distributions consistent with the constraints, we should choose the distribution that has
maximum uncertainty, or choose the distribution that is most random. Mathematically, this
principle states that we should maximize
N
Pilnpi (3.6)
i= 1
subject to
N
p1 = 1 (3.7)
i= 1
N
p pigr(x) = ar r = 1,.M (3.8)
i= 1
and
P > 0 i = 1, ...N (3.9)
The maximization can be done using Lagrangian multipliers.
3.2.1 Mutual Information
Let's assume that H(X) represents the uncertainty about a system before observing
the system output, and the conditional entropy H(X| Y) represents the uncertainty about
the system after observing the system output. The difference H(X) H(X| Y) must
represent the uncertainty about the system input after observing system output. This quan-
tity is called the mutual information [Cov91] [Gra90] between the random variables X and
Which is given by
I(X;Y) = H(X)- H(X Y) (3.10)
Entropy is a special case of mutual information, where
H(X) = I(X;X) (3.11)
There are some important properties of the mutual information measure. These proper-
ties can be summarized as follows [Kap92].
1. The mutual information is symmetric, that is I(X;Y) = I(Y;X).
2. The mutual information is always nonnegative, that is I(X;Y) > 0 .
The mutual information can also be regarded as the Kullback-Leibler divergence
[Kul59] between the joint pdf fXx (xl, x2) and the factorized marginal pdf
fx,(xl )fX2(2) The Kullback-Leibler divergence is defined in (3.12) and (3.13).
The importance of mutual information is that it provides more information about the
structure of two pdf functions than second order measures. Basically it gives us informa-
tion about how different the two pdf's are, which is very important in clustering. The same
information can not be obtained by using second order measures.
3.3 Divergence Measures
In Chapter 2, the clustering problem was formulated as a distance between two distri-
butions, but all the proposed measures are limited to second order statistics (i.e. variance).
Another useful entropy measure is the minimum cross-entropy measure which gives the
separation between two distributions [Kul59]. This is also called directed divergence,
since most of the measures are not symmetrical, although they can be made symmetrical.
Assume D(p, q) is a measure for the distance between p and q distributions. If
D(p, q) is not symmetric, then it can be made symmetric by introducing
D'(p, q) = D(p, q) + D(q, p). Under certain conditions the minimization of
directed divergence measure is equivalent to the maximization of the entropy [Kap92].
The first divergence we will introduce is the Kullback-Leibler's cross-entropy measure
which is defined as [Kul59]
n
DKL(P' )= iln j (3.12)
i= 1
where p = (l' P2' ..."" Pn) and q = (ql' q2' .." qn) are two probability distri-
butions. The following are some important properties of the measure
* DKL(p, q) is a continuous function ofp and q.
* DKL(p, q) is permutationally symmetric.
* DKL(p, q) > 0, and it vanishes iffp q.
* DKL (p, q) DKL(q, p)
The measure can be also formulated for the continuous variate density functions
DKL(f, g) = ff(x)n(f( )dx (3.13)
a \g(x)/
where it vanishes iff f(x)
g(x).
3.3.1 The Relationship to Maximum Entropy Measure
The Kullback-Leibler divergence measure is used to measure the distance between two
distributions.Where the second distribution is not given, it is natural to choose the distribu-
tion that has maximum entropy. When there are no constraints we compare to the uniform
distribution u. We will use the following measure to minimize
n
DKL(p, u) = Pin (3.14)
i= 1
In other words, we maximize
n
Pilnpi (3.15)
i= 1
Thus, minimizing cross-entropy is equivalent to maximization of entropy when the distri-
bution we are comparing to is a uniform distribution [Kap92]. Even though maximization
of entropy can be thought as a special case of minimum cross-entropy principle, there is a
conceptual difference between the two measures [Kap92]. The maximum entropy princi-
ple maximizes uncertainty, while the minimum cross-entropy principle minimizes a prob-
abilistic distance between two distributions.
3.3.2 Other Entropy Measures
We are not restricted to Shannon's entropy definition. There are other entropy mea-
sures which are quite useful. One of the measures is the Renyi's entropy measure [Ren60]
which is given as
n
R (X) YIn Pk
k= 1
a >Oyu #
The Havdra-Charvat's entropy is given as
n
1 In p 1
k= 1
ax>0, a# I
Hs(X) = lim HR(X)
(X----_ I
(3.18)
We will use the Renyi's entropy measure for our derivation in the next chapter due to its
better implementation properties. We can compare the three types of entropy in Table 3-1.
Table 3-1. The comparison of properties of three entropies
Properties Shannon's Renyi's H-C's
Continuous
function yes yes yes
Permutationally
symmetric
Monotonically
increasing yes yes yes
increasing
Recursivity yes no yes
Additivity yes yes no
3.3.3 Other Divergence Measures
Another important measure of divergence is given by Bhattacharya [Bhat]. The dis-
tance DB(f, g) is defined by
(3.16)
HHc(X)
(3.17)
b(f, g) = ff(x)g(x)dx (3.19)
and
DB(f, g) = -lnb(f, g) (3.20)
DB(f, g) vanishes iff f = g almost everywhere. There is a non-symmetric measure,
the so-called generalized Bhattacharya distance, or Chernoff distance [Che52], which is
defined by
Dc(f,g) = -Inj[f(x)]1-S[g(x)]sdx 0 ~~
~~
Another important divergence measure is Renyi's measure [Ren60] which is given as
D (f,g) T In 1 [f (x)] [g(x)] l1Cdx 1 (3.22)
1
Note that the Bhattacharya distance corresponds to s and the generalized Bhatta-
2
charya distance corresponds to a = 1 s. The Bhattarcharya distance will form the
starting point for our clustering function.
3.4 Conclusion
If the clustering problem is formulated as the distance between two pdf's, we need a
way to measure and maximize this distance. Second order methods are limited by the fact
that they can not exploit all the underlying structure of a pdf, hence their usage in cluster-
ing will be limited to linearly separable data. In another words, we assume that the distri-
butions are Gaussian, where second order statistics are enough to differentiate. If we want
to cluster nonlinearly separable data, or if the Gaussian assumption is not valid, we need
more information about the structures in the data. This information can be obtained by
44
mutual information, which forms the basis of our clustering algorithm. The second prob-
lem we have to solve is how to obtain mutual information from the data. Most of the mea-
sures in Chapter 3 require a numerical integration which is very time consuming. Our
choice as our clustering evaluation function (CEF) depends on the better implementation
properties of Renyi's entropy, and a different form of Bhattacharya distance where no
numerical integration is required.
CHAPTER 4
CLUSTERING EVALUATION FUNCTIONS
4.1 Introduction
The measure for evaluating a clustering operation should be both easy to calculate and
effective. We introduced many measures in the previous chapters. Some of them are easy
to calculate but not effective, and there are others which are effective but not easy to calcu-
late. We will derive our clustering function in several different ways and provide a discus-
sion about the relation between them.
4.2 Density Estimation
As we know from the previous chapter, there are several ways to measure entropy. One
of the measures was the Renyi's entropy measure. In order to use this measure in the cal-
culations, we need a way to estimate the probability density function. Unsupervised esti-
mation of the probability density function is a difficult problem, especially if the
dimension is high. One of the methods is the Parzen Window Method, which is also called
a kernel estimation method [Par62]. The kernel used here will be a Gaussian kernel for
simplicity
SGauss(x)= ( 1 T -1 (4.1)
Gauss(x) = (2r)k 12 exp ( (x- (4.1)
where I is the covariance matrix. For simplicity we can assume that I = G I. For a
data set X(i) = {xI(1), ..., xi(n)} i = 1...N the density function can be esti-
mated as
N
f(x) G(x- x, 2) (4.2)
i=1
Now consider Renyi's quadratic entropy calculation for continuous case which is
HR(X) = -In ff(x)2dx a = 2 (4.3)
Easy integration of the Gaussian is one advantage of this method. The reason for using
(O = 2 is the exact calculation of the integral. When we replace the probability density
function with its Parzen window estimation we obtain the following formulation
N N
= -In G(x x,G2) G(x x 2 dx
N N j 1 (4.4)
N N
-In 1 G(xi- Xj'G2)
N i j= 1
Because of the Gaussian kernels and the quadratic form of Renyi's formulation, the result
does not need any numerical evaluation of integrals [Xu98] [PriOO]. This is not true for the
Shannon's entropy measure [Sha48] [Sha62]. The combination of the Parzen window
method with the quadratic entropy calculation gives us an entropy estimator that computes
2
the interactions among pairs of samples. We call the quantity G(xi X ,G ) the
potential energy between samples xi and x because the quantity is a decreasing func-
tion of the distance between them, similarly to the potential energy between physical par-
tiles. Since this potential energy is related to information, we will call it the Information
Potential. There are many applications of information potential as described in [Xu98].
We want to use this expression from another perspective. We know that the double sum-
mation of the potential energy gives us the entropy of the distribution. We can easily start
thinking about what will happen if the potential energy is calculated between samples of
different distributions. This will form the basis for our clustering algorithm.
4.2.1 Clustering Evaluation Function
Let's assume that we have two distributions p(x) and q(x). We would like to evaluate the
interaction between these particles and how it changes when the distributions change.
Let's consider the following clustering evaluation function (CEF)
N1 N2
CEF(p, q)= N G(xi 20 2) (4.5)
= 1j= 1
and
xi e p(x) x e q(x) (4.6)
where the interaction between samples from different distributions are calculated. In order
to be more specific about the distribution functions, we include a membership function
which shows the distribution for each sample. The membership function M(xi) will be
defined for the first distribution as MI (xi) = 1 iff xi e p(x), and MI (xi) = 0
otherwise. Likewise for the second distribution, a similar function can be defined as
M2(xi) = 1 iff xi e q(x) Now we can redefine the CEF function as
N N
CEF(p, q)= N1N2 M(x, x)G(xi x, 22) (4.7)
=I 1j= 1
where
M(Xi, x ) = M1 (xi) x M2(x) (4.8)
The value of the function M() is equal to 1, when both samples are from different distri-
butions and the value is 0, when both samples come from the same distribution. For two
distributions, the functionMcan be rewritten as M(ai, a ) = (MI (a) MI(aj)) .
The summation is calculated only when the samples belong to different distributions and
the calculation is not done, when the samples come from the same distribution. By chang-
ing MO we are changing the distribution functions of the clusters.
4.2.2 CEF as a Weighted Average Distance
One way to measure the distance between two clusters is the average distance between
pairs of samples as stated in (2.26) and copied here for convenience
D g(X, X2 1ai- aI (4.9)
2a, E Xaj 1 X2
This measure works well when the clusters are well separated and compact, but it will fail
if the clusters are close to each other and have nonlinear boundaries. A better way of mea-
suring the distance between clusters is to weight the distance between samples nonlin-
early. The clustering problem is to find regions such that each region contains samples that
are more "similar" among themselves than samples in different regions. We assume here
that samples that are close to each other will probably have more similarities than samples
that are further away. However, when two samples are far away from each other, it does
not mean that they are not related. So, samples that are far away should not be discarded
completely. A simple and easy way to nonlinearly weight the distance between samples is
to use a kernel function evaluated for each difference between samples. For simplicity let's
use the Gaussian kernel as our weighting function. The kernel will give more emphasis to
the samples close to each other than the Euclidean distance, and will give much less
emphasis to samples far away from each other when compared with the Euclidean dis-
tance. When we use the kernel function the average distance function becomes
DNEWavg(X-X2) G(a1 -a ,2) (4.10)
1 2 a, e Xa2j e X2
which is exactly the CEFO function that we derived before. The only difference is that
(4.10) should be minimized where (4.9) should be maximized. We have an additional
parameter G, which controls the variance of the kernel. By changing G, we can control
the output of the clustering algorithm to some extent. On the other hand, DO does not have
any parameters to control, which means if it is not working properly for a certain data set,
there is no recourse, other than changing the function itself. However, creating a princi-
pled approach for selecting the kernel variance is a difficult problem for which no solution
yet exists.
4.2.3 CEF as a Bhattacharya Related Distance
The derivation of CEF progressed by applying the Information Potential concept to
pairs of samples from different clusters. Since the CEF function measures the relative
entropy between two clusters using Renyi's entropy, we investigated how the CEF is
related to other divergence measures between pdf's. CEF in (4.6) can be generally rewrit-
ten as a divergence measure between two pdf'sp(x) and q(x) as
D(p, q) = Jp(x) q(x)dx (4.11)
Since we are interested in a nonparametric approach (sample by sample estimate) the
pdf's are estimated by the Parzen-Window [Par62] method using a Gaussian kernel in
(4.1) where p(x) and q(x) are given as
(4.12)
NP
p(x) =- Gauss(x -x,(2)
Ni= 1
Nq
q(x) -= Gauss(x x,G 2) (4.13)
j= 1
By substituting the definition of p and q into equation (4.11), we obtain
N; Nq
CEF(p, q) = I N N Gauss(x- x 2G2) (4.14)
q Pi= 1j= 1
This is exactly the definition of CEF in (4.5). The difference in the summation is just a
different way of counting the samples. Notice also the similarity of the CEF in (4.11) and
the Bhattacharyya [Bha43] distance in (3.19), which differ simply by a square root. We
interpret this difference as a metric choice [PriOO]. A nice feature of equation (4.11) is that
the integral can be calculated exactly, while it would be very difficult to compute (3.19)
nonparametrically.
4.2.4 Properties as a Distance
To be able to use CEF(p, q) as a distance, we need to show certain properties. Let's
define the following distance measure to be consistent with the other divergence measures.
We usually drop the log term, since it is not required during maximization or minimization
of the distance function. But without using the log function, the comparison will not be
complete.
DCEF(p, q) = -logCEF(p, q) (4.15)
The first property we have to show is positiveness of DCEF(p, q). We have to show that
the distance measure CEF(p, q) is always less than or equal to 1. Since
fp(x)dx = 1, and Jq(x)dx = 1, the product can never be greater than one, which
means that DCEF(p, q) > 0 The second property which should be satisfied is when
the distances are equal, the measure should be zero, i.e. DCEF(p, q) = 0 whenever
p(x) = q(x). But since Jp(x)dx = 1 does not mean that fp2(x)dx = 1 holds,
we can not say that the distance measure is always zero whenever two pdf's are equal, but
the distance measure reaches a minimum. The third property is the symmetry condition. It
is obvious that DCEF(p, q) = DCEF(q, p). To satisfy the second property, we have
to normalize the DCEF(p, q), which can be done as follows:
\p(x) q(x)dx
DCEFnormn(P, q) = -1 2(x)x q2(x)dxI (4.16)
\P (X dx q2 (x)dx,
It is easy to observe that we obtain the Cauchy-Schwartz distance measure, if we take the
square of this equation. Although it is not a standard measure, we would like to show that
DCEF(p, q) can be used as a distance measure. Since we are trying to maximize the
distance measure betweenp(x) and q(x), we may omit the fact that the distance is not zero,
whenever the pdf's are equal. We compared several distance measures experimentally,
changing a single parameter of one of the distributions. The comparison is done between
two Gaussian distributions, where the mean of one of the Gaussian distributions is
changed and the behavior of the distance measures are observed in Figure 4-1. CEF,
CEFnorm and Bhattcharya is calculated with a kernel variance of 0.3. Chernof is calcu-
lated with the parameter of value 0.2, and Renyi's divergence is calculated with the param-
eter 1.1. Although the minimum is not zero, when two pdf functions are equal, the
behaviour of CEF is consistent with the other measures, hence it can be used in a maximi-
zation or minimization procedure.
4.2.5 Summary
The most important property of the distance function CEF, is the fact that it does not
require any numerical integration which would increases the calculation time and numeri-
cal errors. Although it does not satisfy an important property as a distance function, CEF
can be used as a distance as long as it is only used in maximization or minimization proce-
dures. It can not be compared with the other distance measures if maximization or minimi-
zation is not used. The CEF has also other properties that makes the distance measure a
good choice for clustering algorithms, as we will see later.
Figure 4-1. Distance w.r.t. mean
4.3 Multiple Clusters
The previous definition was given for two clusters. In this section we will generalize
the CEF function to more than two clusters. Since we want to measure the divergence
between different clusters, the measure should include the divergence from one cluster to
all the others. One way of achieving this is to estimate pairs of divergence measures
between each possible pair of clusters. Let's assume that we have C clusters, and if the
Nx(N-1)
divergence measure is symmetric we need N x (N 1) pairs, and if the measure is not
2
symmetric, we need N2 pairs. The following extension of CEF will be used for more than
two clusters:
C C
CEF(pl' ,2' "...C) = CEF(pip) (4.17)
i= lj=i+ 1
where CEF(pi, p .) is given by (4.14). Replacing the densities with Parzen window
we will obtain
N1 N2
CEF(pl' ,2' ... PC)= Gauss(xil-xi2, 2G ) + ... (4.18)
il = li2= 1
where
N1 +N2+ ... +NC
(4.19)
The equation (4.18) can be written in a more compact way by introducing a scoring
function so for C clusters where s(.) is a C-bit long function defined as
s k(xi)
1 xi Ck
(4.20)
and sk() is the k'th bit of s() .
Using the scoring function, the extended CEF of (4.20) can be written in a compact
form as follows
CEF(x, s)
NN...Nc YC M(xij, s)Gauss(xi -xj, 2G 2) (4.21)
1N2 .ij
where
M(xj, s)
s(xi) u s(xj)
(4.22)
If both samples are in the same cluster, then M(.) is 0, while if both samples are in dif-
ferent clusters, M(.) is 1. Since the label information is in different bits, this equation is
valid for any number of clusters. The only requirement is there should be enough bits for
all clusters.
This function evaluates the distance between clusters. The only variable we can
change in the function is M(., since the input data X is fixed. We can use this function to
cluster the input data by minimizing the function with respect to M(.). The stepwise nature
of the function CEF w.r.t. M(.), makes it difficult to use gradient based methods.
4.4 Results
Let us see how the function behaves and performs in the clustering of several different
synthetic data sets [GokOO]. The data sets are chosen in such a way that they represent sev-
eral different cases with a small number of samples, so that evaluation and verification of
the results will be easy. Figure 4-2 shows five of the data sets used. We designed data sets
for two different clusters that would require MLPs or RBFs classifiers for separation. For
the data sets shown in Figure 1 the minimum of the CEF provided perfect clustering in all
cases (according to the labels we used to create the data). The different clusters are shown
using the symbols "square" and "star", although our clustering algorithm, of course, did
not have access to the data labels. We can say that in all these cases there is a natural "val-
ley" between the data clusters, but the valley can be of arbitrary shape. We have to note
that the data sets are small and the clusters are easily separable by eye-balling the data.
However, we should remember that none of the conventional algorithms for clustering
would cluster the data sets in the way the CEF did. The main difference is that conven-
tional clustering uses a minimum distance metric which provides from a discrimination
point of view a Voronoi tesselation (i.e. a division of the space in convex regions). Cluster-
ing using the CEF is not limited to distance to means, and seems therefore more appropri-
ate to realistic data structures. We implemented by enumeration (i.e. all possible
combinations) the CEF and the cluster labeling was attributed to its minimum. We will
investigate below other methods to optimize the search for the CEF minimum. We used a
variance ranging from 0.01 to 0.05, and the data is normalized between -1 and +1.
33O x
x
x x x x
3 a
M] :
X- ^
x
x x
x x
x x
x a x
X E] X
x E] x
x i i
x X
x x x X
OO]
O O ]
OO
O] [
Figure 4-2. Input/output data
4.4.1 A Parameter to Control Clustering
The variance of the Gaussian function in Eq. (4.7) determines the level of interaction
between samples. When the variance gets smaller, the importance of samples that are far
away also gets smaller. Since the relative distance between samples is important, a large
variance will make samples close to each other less important.
L=I L=5 L=10 L=15
1 n1. -1 1D
-1i --5 fto I .Q -1, -..5 aA5 M 1A -in 0M 5 i J5 -IaN -1 5 ti^ A 5 ib
L=20 L=25 L=27 L=29
Figure 4-3. Label assignments
But we want to give more emphasis to the samples that are close. At the same time we
don't want to lose the interaction between the samples that are far away. Therefore the
variance of the Gaussian determines the results of the clustering with the CEF. This should
be no surprise if we recall the role of the variance in nonparametric density estimation
[Dud73].
We experimentally evaluated the importance of the Gaussian kernel variance for the
two triangles data set. The clustering assignment is changed starting with the data in the
top line, continuing until the bottom, so the clustering that provides the two lines is
obtained when the label assignment is L=15. The change of the labels can be seen in Fig-
ure 4-3. Every cluster is shown with a different symbol. Only 8 different assignments out
Figure 4-4. Change of CEF w.r.t. the variance of the Gaussian kernel
of 28 are shown in Figure 4-3. 28 comes from the fact that we omit the assignments where
the number of the elements in any cluster is zero, such that we always have clusters with
nonzero elements. Of course this is only one way to change the membership function but it
will give us an idea of how variance affects the results. Figure 4-4 shows that as variance
decreases, the global optimum becomes our minimum point and many local minima disap-
pear. Notice also that in this case the global minimum for the largest variance (G=0.1)
occurs at L=3 and does not coincide with the label assignments. Hence, not only is the
CEF difficult to search, but also it provides a global minimum that does not correspond in
this case to the label assignments, that is, this variance setting will miss the fact that there
is a "valley" between the two lines. In general, we do not have a label assignment, but we
can expect that different cluster assignments will be obtained depending upon the variance
used in the Gaussian kernel.
U 0.10
0.00
0 5 10 15
Label assignment
20 25 30
All in all, we were very pleased to observe this behavior in the CEF, because it tells us
that the evaluation function is incorporating and using global information about the data
set structure, unlike traditional clustering methods.
4.4.2 Effect of the Variance to the Pdf Function
The variance is the most important parameter in the pdf calculation, since it deter-
mines the size of the kernel, hence the shape of the final pdf function. If the variance is too
big, then the pdf is smoothed such that, all the details of the data are lost. If the variance is
too small, then the pdf estimation will be discontinuous and full of spikes. In both extreme
cases, it is not possible to get a meaningful result from the clustering algorithm. The pdf
estimation using Parzen window of the above data is shown with different values of the
variance in Figure 4-5, Figure 4-6 and Figure 4-7.
Figure 4-5. Pdf of the data given in Figure 4-3, var=0.2
* .* ~
Figure 4-6. Pdf of the data given in Figure 4-3, var=0.1
Figure 4-7. Pdf of the data given in Figure 4-3, var=0.08
As is evident, the pdf estimation is too smooth when the variance is 0.2. It gets better
around the value of 0.1, which can be seen in Figure 4-6. When the variance is 0.08, we
get the pdf estimation given in Figure 4-7, which will give us the correct clustering, as
shown in Figure 4-3 where the minimum point of the CEF function corresponds to the cor-
rect clustering
4.4.3 Performance Surface
Because of the nature of clustering, the CEF function is a step-wise function, which
makes optimization difficult. To see how the CEF changes w.r.t. the membership function,
let's define the following parametric membership function without losing the functionality
of our definition. Instead of assigning the membership values directly, define G() as a
threshold mapping function where the weights define the mapping. G() is defined as
G(x, w) = TRESH(F(x, w)) where F(x, w) is any linear/nonlinear mapping
function from x > F(x, w), and TRESH() is defined as follows:
TRESH(x) = { 1.0, -1.0 }, {x > 0, x < 0 } respectively.
We can write the CEF in terms of w. Let's define as a linear function, where
F(x, w) = x0Wo + x1 w1 Figure 4-8 depicts the plot of the CEF function w.r.t. one of
the weights w0 for the two lines data set. As one can immediately observe the evaluation
function is a staircase function, because until there is a change in clustering, the value of
CEF remains constant. It is clear that the usual gradient descent method for adaptation
[Hay94] will not work here, since at most of the points in the weight space, there is no gra-
dient information to determine the next change in weights. Since the traditional gradient
0 50 100 150 200
weight(O)
Figure 4-8. Change of CEF w.r.t. weight wO
method fails to solve the problem, we have to find a more suitable method to minimize the
clustering function.
4.5 Comparison of Distance Measures in Clustering
Several distance measures are proposed by several authors, and all of them have simi-
lar and distinct characteristics. Some of them are stronger distance measures than the oth-
ers. We adopted an experimental comparison of several distance measures (i.e. CEF,
CEFnorm, Renyi, Kullback-Leibler and Bhattacharyya) in our clustering algorithm, and
we showed that some of them can not be used in the clustering procedure. We select a sim-
ple data set from Figure 4-2, and we test the distance measures by using four different con-
figurations of the data set. The four configurations can be seen in the Figure 4-9.
For comparison with the other measures, CEF and CEFnorm functions are used as
given in (4.15) and (4.16). The log function is usually not necessary when we optimize the
0,8
0.8
0A4
CD
63
2 2
1 0 O i1 D
-2 E D _2
-3 -2 -1 a 1 2 3 -3 -2 -1 b0 1 2 3
a) b)
2 I 2 [I
0 0
-1 3 3 -1 3 D 3
0 0
-2 EE El EE
-3 -2 -1 0 1 2 3 -3 -2 -1 1 2 3
c) d)
Figure 4-9. Different test configurations
function, since it is a monotonically increasing function. Since the Renyi's divergence
measure and the Kullback-Leibler divergence measure are not symmetric, we calculated
the sum of the distance from p(x) to q(x) and the distance from q(x) to p(x), which makes
both distance measures symmetric. Renyi's entropy is calculated using X = 1.1 The
variance of Parzen window estimator is selected as 0.15 for all 5 measures. The configura-
tion given in Figure 4-9 (c) is the desired clustering, because there is a natural valley
between the two triangle shaped data samples. Since we are trying to maximize the dis-
tance between two pdf distributions, we expect the distance measures to be maximum for
the configuration (c). As it can be seen from the Table 4-1, the measures that incorporate
in their calculation the ratio of the two pdf distributions fail to find the correct clustering.
The maximum of each function is shown as a shaded area.
Table 4-1. Comparison of different distance measures
CEF CEFnorm Renyi's KL Bhat
a) 3.12594 0.10834 0.54023 0.47845 0.05639
b) 4.42961 1.91489 17.5679 13.2470 1.14886
c) 11.6223 10.2988
d) 3.89062 1.80538 e 1.12171
The Renyi's divergence measure and Kullback-Leibler divergence measure requires
both pdf distributions to be strictly positive over the whole range. Because of this restric-
tion, these two distributions can not be used in our clustering algorithm. Although Parzen
window estimation of pdf's is continuous in theory, the pdf becomes practically zero away
from the center. Increasing the variance to overcome this problem results in an over-
smoothed pdf function, as seen in Figure 4-5. The CEF, CEFnorm and Bhattacharyya dis-
tances perform similarly finding the correct clustering. The results of the comparison
justify our choice of CEF as a distance measure for clustering.
4.5.1 CEF as a Distance Measure
As seen before, the CEF function does not disappear when two pdf distributions are
equal. Although this result does not affect our clustering algorithm, it would be nice to
have this property, so that CEF will be comparable to the other divergence measures. As
seen in (4.16), CEF can be normalized to have a 0 value, when two distributions are equal.
Although this is a desired property for CEF to be a comparable distance measure, it may
not be so good for the purpose of clustering. The terms V1= Jp2(x)dx and
rq2 1
V2= q (x)dx can be considered as Renyi's entropy without the term, where
a = 2 When we use the normalized CEF in clustering, we have to be careful. Consider
the case in Figure 4-9 (d), where one cluster consists of only one point. Entropy of a small
cluster will be large, and entropy of a big cluster with many points will be small, whereas
for the case in Figure 4-9 (c), entropies of both clusters will be close to each other and will
be neither large nor small. There can be occasions where VI *V2 for the case in Figure 4-9
(d) will be larger than for the case in Figure 4-9 (c). This may result in the CEFnorm
reaching a maximum in the case in Figure 4-9 (d), instead of in the case in Figure 4-9 (c).
Because of this reason we will not use CEFnorm in our clustering algorithm.
4.5.2 Sensitivity Analysis
The sensitivity of the algorithm is much reduced by choosing a proper distance func-
tion, but it is not removed completely. Since this is basically a valley-seeking algorithm, it
tries to find a valley where the area under the separating line, which is the multiplication of
the areas from each side of the line, is minimum. We can see how it works in Figure 4-10.
Gaussian kernels are placed on each sample to estimate pdf's of each cluster. As the clus-
ter assignments are changed, the algorithm tries to separate the samples such that the area
under the pdf's of the clusters is minimum. When the sample PI is away from the distribu-
tion, it is very likely that the separating line L2 will go over an area which is less than the
area that LI covers. The value of the CEF for LI is the area, which is found by multiplying
the Gaussian kernels in the upper part of LI, with the Gaussian kernels in the lower part of
LI. So, if P1 is away from the other sample points, L2 will be a better choice to minimize
CEF, although the correct clusters do not result.
Gaussian kernel
Figure 4-10. Sensitivity of the algorithm
4.5.3 Normalization
This sensitivity can be removed using different normalization factors with different
results. Every normalization will have different effects on the clustering. We would like to
give an outline for the procedures for normalizing CEF. The normalization will be done to
remove the sensitivity of the CEF function to small clusters, not to make it zero when two
pdf's are equal. It should be noted that this kind of normalization is effective only when
the clusters are almost of equal size. When the number of points in a cluster increase, the
maximum value of the pdf decreases, assuming the points don't repeat themselves. Let's
use the following normalization term max(p(x)) x max(q(x)). This term will nor-
malize clusters with different size, but will have the effect of forcing the cluster to be of
similar size. The term can be approximated by finding the maximum value of the function
at the sample points as given in (4.23).
Pl
L2 -
max(p(x)) = max i(p(xi)) (4.23)
Whenever a normalization similar to (4.23) is incorporated into the cost function,
where smaller clusters are discouraged, the effect will be losing the valley seeking prop-
erty of the cost function with clusters of different sizes. The term is effective if the clusters
are of similar sizes. Once the term is included in the cost function, it is difficult to control,
as it is difficult to choose to accept certain cluster sizes, and not to accept the small cluster
sizes. It would be easier to create a rejection class, where during the iterations certain con-
figurations are not accepted, depending on an external criterion, such as rejecting clusters
of a certain size, or rejecting clusters having a maximum/minimum value greater than a
prespecified value. Actually, a rejection class is already embedded in the algorithm, such
that clusters of size zero are not accepted. This brings the question how to set the thresh-
old. Although more work should be done on this topic, we suggest selecting a value that is
less than at least half of the maximum expected cluster size.
CHAPTER 5
OPTIMIZATION ALGORITHM
5.1 Introduction
As we have seen previously the performance surface contains local minima and is dis-
crete in nature, which makes the optimization process difficult. Gradient optimization
methods will not work, and we have to find other methods to optimize the CEF perfor-
mance function. One of the methods which can be used to overcome the local minima
problem is simulated annealing [Aar90]. It is used with success in many combinatorial
optimization problems, such as the travelling salesman problem, which is an NP complete
problem.
5.2 Combinatorial Optimization Problems
A combinatorial optimization problem is either a minimization or a maximization
problem and is specified by a set of problem instances. An instance of a combinatorial
optimization problem can be defined as a pair (S, f), where the solution space S denotes
the finite set of all possible solutions and the cost function / is a mapping defined as
f;S (5.1)
In the case of minimization, the problem is to find a solution opt e S which satisfies
f(iopt) <_ f(i) for all i e S (5.2)
Such a solution iopt is called a globally-optimal solution.
Local search algorithms are based on stepwise improvements on the value of the cost
function by exploring neighborhoods. Let (S, f) be an instance of a combinatorial opti-
mization problem. Then a neighborhood structure is a mapping
N;S 2S (5.3)
which defines for each solution i e S a set S. e S of solutions that are 'close' to i in
some sense. The set Si is called the neighborhood of solution i, and each j e Si is called
a neighbor of i.
In the present class assignment problem, a neighborhood structure Nk, called the k-
change, defines for each solution i a neighborhood Si consisting of the set of solutions
that can be obtained from the given solution i, by changing k-labels from the solution i.
So the 2-change N2(p, q) changes a solution i into a solution by changing the labels of
the samples p and q [Dud73]. We will use a similar but more complex neighborhood struc-
ture. Define a generation mechanism as a means of selecting a solution from the neigh-
borhood S of a solution i.
Given an instance of a combinatorial optimization problem and a neighbourhood
structure, a local search algorithm iterates on a number of solutions by starting off with a
given initial solution, which is often randomly chosen. Next, by applying a generation
mechanism, it continuously tries to find a better solution by searching the neighborhood of
the current solution for a solution with lower cost. If such a solution is found, the current
solution is replaced by the lower cost solution. Otherwise the algorithm continues with the
current solution. The algorithm terminates when no further improvement can be obtained.
This algorithm will converge to a local minimum. There is usually no guarantee how far
this local minimum is from the global minimum, unless the neighborhood structure is
exact, which results in complete enumeration of the solutions and is impractical in many
cases. Let's summarize the local search algorithm with a block diagram.
INITIALIZE, start
i I sstart
repeat
GENERATE (J from Si )
if f (j) < f(i) then = j
StopCriterion= f( (j) f(i) for all j
until StopCriterion
Figure 5-1. Local search
Let (S, f) be an instance of a combinatorial optimization problem and let N be a
neighborhood structure, then i e S is called a locally optimal solution or simply a local
minimum with respect to N if i is better than, or equal to, all its neighborhood solutions
with regard to their cost. More specifically, i is called the local minimum if
f(i) < f (j), for all j S (5.4)
The problem we have is not as difficult as the travelling salesman problem, because we
have some a-priori information about the problem. We know that the global solution will
assign the same label to samples that are close to each other. Of course just this statement
alone will be an oversimplification of the problem. In nonlinear structured data, the clus-
ters won't be compact, which means that there are samples that are far away from each
other with the same label, but the statement remains true even in this case. This will sim-
plify our problem a little bit. First we will introduce an algorithm using a local search
algorithm, then we will adapt a different simulated annealing algorithm on top of it.
Before doing this it is necessary to introduce how simulated annealing works.
5.2.1 Local Minima
The neighborhood structure is not helpful when there are many local minima. We need
another mechanism to escape from the local minima. Exact neighborhood structure is one
solution but it results in complete enumeration in a local search algorithm. The quality of
the local optimum obtained by a local search algorithm usually strongly depends on the
initial solution and for most there are no guidelines available. There are some ways to
overcome this difficulty while maintaining the basic principle of local search algorithm.
The first one is to execute the local search algorithm for a large number of initial solu-
tions. If the number is large enough such that all solutions have been used as initial solu-
tion, such an algorithm can find the global solution, but the running time will be
impractical.
The second solution is to introduce a more complex neighborhood structure, in order
to be able to search a larger part of the solution space. This may need a-priori information
about the problem and is often difficult. We will use try to use a-priori information to con-
struct a better neighborhood structure.
The third solution is to accept, in a limited way, transitions corresponding to an
increase in the value of the cost function. The standard local search algorithm only accepts
a transition only if there is a decrease in the cost. This will be the basis of the simulated
annealing algorithm.
5.2.2 Simulated Annealing
In condensed matter physics, annealing is known as a thermal process for obtaining
low energy states of a solid in a heat bath. The first step is to increase the temperature of
the bath to a maximum value at which the solid melts. The second step is to decrease care-
fully the temperature of the bath until the particles arrange themselves in the ground state
of the solid. In the liquid phase all particles of the solid arrange themselves randomly. In
the ground state the particles are arranged in a highly structured lattice and the energy of
the system is minimal. The ground state of the solid is obtained only if the maximum tem-
perature is sufficiently high and the cooling is sufficiently slow.
The physical annealing process can be modelled using a computer simulation [Kir83].
The algorithm generates a sequence of states as follows. Given the current state i of the
solid with energy E then a subsequent state j is generated by applying a perturbation
mechanism which transforms the current state into next state by a small distortion. The
energy of the next state is E ., if the energy difference, E. Ei, is less than or equal to 0,
the state j is accepted as the current state. If the energy difference is greater than 0, the
state j is accepted with a certain probability which is given by
exp B (5.5)
where T denotes the temperature of the heat bath and kB is a physical constant known as
the Boltzmann constant. If lowering of the temperature proceeds sufficiently slowly, the
solid can reach thermal equilibrium at each temperature. This can be achieved by generat-
ing a large number of transitions at a given temperature value. Thermal equilibrium is
characterized by the Boltzmann distribution. This distribution gives the probability of the
solid being in s state i with energy Ei at temperature T, and is given by
PT{X= i} ( exp (5.6)
Z(T) ( kBT )
where X is a stochastic variable denoting the current state of the solid, and
Z(T) = exp k-- (5.7)
where the summation extends over all possible states.
5.3 The Algorithm
We can apply the previous algorithm to solve our optimization problem. We will
replace the states of the particles with the solutions of the problem, where the cost of the
solution is equivalent to the energy of the state. Instead of the temperature, we introduce a
parameter, called the control parameter. Let assume that (S, f) is an instance of the prob-
lem, and i and j two solutions with cost f(i) and f(j), respectively. Then the acceptance cri-
terion determines whether j is accepted from i by applying the following acceptance
probability
Pc{ accept = exp (f( c- f(j)) if f(j) < f(i) (5.8)
and
P c{accept = 1 if f(j) > f(i) (5.9)
where c denotes the control parameter. The algorithm can be seen in Figure 5-2.
INITIALIZE, start, CO, LO
i = start
k= 0
repeat
for/ = 1 to Lk do
GENERATE (] from Si )
if f (j) < f(i) then = j
else
if exp(f([0 (J] > random [0, 1] then =
Sck 7
endfor
k = k+1
CALCULATE ( Lk )
CALCULATE ( Ck )
until StopCriterian
Figure 5-2. Simulated annealing algorithm
A typical feature of the algorithm is that, besides accepting improvements in cost, it
also to a limited extent accepts deteriorations in cost. Initially, at large values of c, large
deteriorations will be accepted and finally, as the value of c approaches 0, no deteriora-
tions will be accepted at all. Lk should be sufficiently long so that the system reaches
thermal equilibrium.
5.3.1 A New Neighborhood Structure
We have some a-priori information about the optimization problem. The problem is
basically a clustering algorithm that tries to group pixels into regions. The pixels that
belong to a class should be close to each other using some distance metric. This informa-
tion will help us create another neighborhood structure and will eliminate many local min-
ima. On top of this algorithm we will put a modified version of a simulated annealing
algorithm, which gives us a chance to escape from local minima, if any. We start by look-
ing at the problem closely. Assume that at an intermediate step the optimization reached
Figure 5-3. An instance
the following label assignment which is shown in Figure 5-3. The ideal case is to label the
group of points marked in g2 as "Class2", so that the upper triangle will be labeled as
"Class2", and the lower triangle will be labeled as "Classl". The variance of the Gaussian
kernel is selected as G = 0.08, and the value of the CEF function is given as
0.059448040 for the given configuration. When some of the labels in the group g2 are
changed, the value of CEF function will increase to 0.060655852, and when all the labels
Class2
g3
in the group g2 are changed to "Class2", then the value of CEF will drop sharply to
0.049192752. Which means that the 2-change N2(p, q) neighborhood structure
explained before will fail to label the samples in the group g2 correctly. This behavior and
the local minima can also be seen in the Figure 4-3. Of course this behavior disappears at
a certain value of the variance, but there is no guarantee that lowering the value of the vari-
ance will always work for every data set. A better approach is to find a better neighbor-
hood structure to escape from the local minima.
In the previous example, one can make the following observation very easily. If we
change the labels of the group of pixels g2 at the same time, then we can skip the local
minima. This brings two questions: How to choose the groups, and how big they should
be?
5.3.2 Grouping Algorithm
We know the clustering algorithm should give the same label to samples that are close to
each other in some distance metric. Instead of taking a fixed region around a sample, we
used the following grouping algorithm. Let's assume that the group size is defined as KM.
We will group samples from the same cluster starting from a large KM, which will be
decreased during the iterations according to a predefined formula. This can be exponential
or linear. The grouping algorithm is explained in Figure (6-4). Assume that the starting
sample is X I, and the subgroup size is 4. The first sample that is closest to x1 is x2, so
x2 is included in the subgroup. The next sample that is closest to the initial sample x 1 is
x3, but we are looking for a sample that is closest to any sample in the subgroup, which in
this case is x4. The next sample closest to the subgroup is still not x3, but x5, which is
close to x4. The resulting subgroup {Xl, x2, x4, x5} is a more connected group than
just by selecting the 4 pixels {X x2, x3, x4 } that are closest to the initial sample x1 .
The grouping algorithm is more calculation intensive than taking the samples around the
/ x ~ x3
4 /x2 /
Figure 5-4. Grouping
pixel, but it will help the algorithm to escape from the local minima. The initial cluster
labels come from the random initialization at the beginning of the algorithm. This group-
ing is done starting with every pixel, and the groups that are equal are eliminated. The
grouping is done among the samples that have the same cluster label, and done for each
cluster label independently. When the grouping is finished for a certain group size, for all
cluster labels, they are joined in a single big group, and they will be used in the optimiza-
tion process instead of the original data set. We will obtain the following set of groups as
seen in Figure 5-9. The difference of this grouping algorithm can be seen clearly using the
N
data set in Figure 5-5. The group size is initialized as KMA where N is the total
number of data samples, and C is the number of clusters.
In Figure 5-6, the group is selected starting from the sample xl, and the closest N
samples is selected. The group shown in Figure 5-6 is selected using N=10. This group is
actually the kNN of point x1 The proposed grouping algorithm, on the other hand, cre-
0.4
-02.4 -
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
Figure 5-5. Data set for grouping algorithm
Figure 5-6. Grouping using 10 nearest samples (kNN)
ates the group shown in Figure 5-7. As can be seen from both examples, the proposed
grouping is more sensitive to the structure of the data, whereas the kNN method, a very
common method, is less sensitive to the structure of the data. Several grouping examples
can be seen in Figure 5-8. Using the proposed method will create a better grouping struc-
ture for our clustering algorithm.
0.4
0.2 0 0
-0.2
-0.6 -0.4 -0.2 0,0 0.2 0.4 0.6
Figure 5-7. Grouping using the proposed algorithm (group size is 10)
Figure 5-8. Examples using proposed grouping algorithm
5.3.3 Optimization Algorithm
First we consider the case where the groups overlap, and propose an optimization
algorithm. The optimization starts by choosing an initial groups size KM, which is smaller
0.4
0.0 2 M 0 .xl
-02.
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
0.4
0 3
-0C2 4 .
-0.6 -0.4 -02 0.0 0.2 0.4 0.6
80
Groups ( Group size <= KM)
Sxl,x3,x4 .... x gl
x5,x6,... I g2
Class
Sg3
Sg4
/ g5
Sg6
Class2 /
\ g7
g8
Figure 5-9. Final group structure
than the total number of points in the data set. The second step is to randomly initialize the
cluster labels of the given data set. After selecting the group size and the initial cluster
labels, the groups are formed as explained above, and a new data set is created which con-
sists of groups instead of individual data points. We adapt a 2-change algorithm similar to
the one described before in this chapter. We select one group and check whether changing
the label of this group reduces the cost function. If the change reduces the cost function we
will record this change as our new minimum cluster assignment. If the change does not
reduce the value of the cost function, we choose a second group from the list, and change
the cluster labels of that group in addition to the change of labels of the first group. We
record this assignment if it reduces the minimum value of the cost function. In this algo-
rithm we allow a cluster assignment to be used even it increases the cost function. This is
like a deterministic annealing, since it is done only for one pair, and it is done at every step
INITIALIZE ( KM, var, ClusterLabels)
ClusterLabels = random()
WHILE (KM >= 1) {
GRP = CREATE_GROUPS (InputData, KM, ClusterLabels)
REPEAT UNTIL NOCHANGE {
FOR i=0 to SIZE(GRP) {
Change the clustering labels of GRP(i) and record the
improvement if any
FORj=i+l to SIZE(GRP) {
Change the clustering labels of GRP(i) and GRP(j)
and record the improvement if any
}
}
}
; Decrease the group size. This can be linear or exponential
KM = GENERATE_NEWGROUPSIZE(KM)
Figure 5-10. Optimization algorithm
without any random decision. Of course we can do this for every pair, but this will end up
doing a complete enumeration among the possible solutions, which is not computationally
feasible. We repeat calculating a new cluster label, until there is no improvement possible
with the given group size KM. The next step is to reduce the group size KM, and repeat
the whole process, until the group size is 1, which will be equivalent to the 2-change algo-
rithm with the N2(p, q) neighborhood structure as explained before. When the algo-
rithm stops it is a good idea to run one more pass by setting KM to the initial value, and
run the whole process again using the same data where the initial condition is the previ-
ously obtained clustering, or this can be repeated until there is no change recorded. It is
possible to increase the group size starting from one, to a final value. Repeating the algo-
rithm will help us to escape from local minima, unlike previous attempts.
It is possible that several optimization algorithms can be derived from the given algo-
rithm. Let's assume that we are increasing the value of group size KM, and at a certain
group size, we made an improvement. It is possible that instead of increasing the group
size at the next step, we can decrease the group size, until no further improvements occur
in the cost function. When the improvements stop, then we can continue to increase the
group size. We can explain this algorithm using the following Figure 5-11. It should be
noted that none of these algorithms is guaranteed to reach the global optimal point of the
function.
Figure 5-11. A variation of the algorithm
INITIALIZE ( KM, var, ClusterLabels)
ClusterLabels = random()
WHILE ( KM >= 1 ) {
CHECK FOR IMPROVEMENT
IF NO IMPROVEMENT KM = KM 1
IF IMPROVEMENT KM = KM + 1
ENDWHILE
It is possible to record the change that makes the biggest improvement at the end of the
outside loop, after calculating all possible improvements, instead of changing it immedi-
ately when an improvement is recorded. This may result in a slightly different solution.
5.3.4 Convergence
Since there is a finite number of combinations tested, and since the algorithm contin-
ues to work only when there is a improvement in the cost function, the algorithm is guar-
anteed to stop in a finite number of iterations. There is only one condition where the
algorithm may not stop. Because of some underflow errors, if the value of the function
becomes zero, the algorithm will run forever, since no improvement can be done beyond
zero. The integral is always greater than or equal to zero. This is an computational prob-
lem, not an algorithmic problem. So whenever the cost function becomes zero, the algo-
rithm should be forced to stop.
5.4 Preliminary Result
We tested our algorithm using multiple clusters and the results are given in Figure 5-
12. Three partial sinewaves were created with Gaussian noise added with different vari-
ance with 150 samples per cluster. The algorithm was run with KM as 150, linearly
annealed one by one until we reach the group size of one. The kernel size used in this
experiment is 0.05 and the minimum value of the CEF function is 3.25E-05. Each symbol
represents a different cluster. As we can see the algorithm was able to find the natural
boundaries for the clusters, although they are highly nonlinear.
We tested the algorithm with more overlapping clusters. The input data is given in Fig-
ure 5-13. We obtained meaningful results when we set the variance of the kernel such that
84
1 .0 -0 5 00 0i5 I 1
E0.5 A' 0
52 Ak A "A Es
AA A
00 AA A -s, ,,.
-1A, AA A 0 0, 1 -0
-AA
AA 4, A A
1.0 0.5 0.0 0.5 1.0
Figure 5-12. Output with three clusters
the pdf estimation shows the structure of the data properly. When we say meaningful, it
means meaningful to a person that created the data, which may not be always what is
given in the data. Another important property of all these results is that all of them repre-
sent a local optimum point of the cost functirs. This is a basic problways possible that another solution
with a lower value of the cost function exists. This time we tested the algorithm with dif-
ferent variance values. Results are given in Figure 5-14. When the variance is high, single
point clusters are formed. As we dropped the variance, the results are improved, and we
got the result that mimics the data set generation. Points that are isolated may not repre-
sent the structure of the data as well as the other points. So it is possible that those points
will form single point clusters. This is a basic problem with the cost function proposed. It
is possible to eliminate these points and run the algorithm again.
Figure 5-13. Overlapping boundaries
Another test set, where clusters do not have same number of points, is used to see the
behavior of the algorithm. The input data can be seen in Figure 5-15. The output of the
algorithm can be seen in Figure 5-16. To get more information about the conversion of the
algorithm, we collected some statistics each time an improvement occurred in the algo-
rithm given in Figure 5-10. The statistics collected are the entropy of each class, and the
value of the CEF function. The value of the CEF function can be seen in Figure 5-17. The
plot is the output of the algorithm during the minimization of the data given in Figure 5-
15. The horizontal scale is compressed time and does not reflect the real time passed. In
the beginning of the algorithm the interval between improvements were short, whereas as
the calculations continue, the time interval between improvements gets bigger. The mini-
mum value is not zero, but a very small value. The entropy of each class is a more interest-
ing statistic and is given in Figure 5-18. Renyi's entropy is used to calculate the entropy of
each cluster generated after each improvement.
1 0. i i i i | i i i i | i i i i | i i i i
0.5 0
0 ^ -
0.5^ '
1.0 0 0.-.5 .
% '. ,,_ ,- -.. : .--
,. .
%
-1,0 -0.5 0.0 0,51
-1.0 -0.5 0.0 0.5
Figure 5-14. Output with a variance of 0.026
1.0
-1 0
0.5
1 .D I I I
-10 -0, 0 05 1,0
Figure 5-15. Non-symmetric clusters
An interesting plot occurs when we plot the inverse of the entropies, which can be seen
in Figure 5-19. Although the clusters are not symmetrical, the inverse entropies are almost
87
1.0
00 5
-1.0 -0.5 0,0 0.5 1.0
Figure 5-16. Output with variance 0.021
Figure 5-17. CEF function
400 600 800
88
1 0 I I I I I I I I I
o 200 400 600 800
Figure 5-18. Entropies of each cluster
2-00
+00 600 800
Figure 5-19. Inverse entropies of each cluster
a mirror reflection of each other. When we add them up we get the plot in Figure 5-20. The
interesting thing about this figure is that the function decreases in parallel with the CEF
function.
0 200 4-00 S00O ,800
Figure 5-20. Sum of inverse entropies
5.5 Comparison
We would like to collect the performance measures of the algorithm in several tables,
where the algorithm is compared to a k-means clustering algorithm and to a neural net-
work based classification algorithm. It should be mentioned that it is meaningful to com-
pare our algorithm with a neural network if the data set contain valleys) between clusters.
Otherwise a neural network with enough neurons is capable of separating the samples
even when there is no valley between them because of the supervised learning scheme. To
be able to use a supervised classification algorithm, we assume that the data is generated
with known labels. We used the data in Figure 5-12, Figure 5-13 and Figure 5-15. We used
one hidden layer neural network topology with 10 hidden nodes and two output nodes.
The result shows that the proposed algorithm is superior to the k-means algorithm and
although it is an unsupervised method, performed equally with a supervised classification
algorithm.
Table 5-1. Results for k-means algorithm
DATA Class Class Class
SET1 1 2 3
Class 110 40 0
1
Class 25 90 35
2
Class 2 36 112
3
Table 5-2. Results for k-means algorithm
DATA Class Class
SET2 1 2
Class 350 0
1
Class 69 221
2
Table 5-3. Results for k-means algorithm
DATA Class Class
SET3 1 2
Class 130 20
1
Class 31 319
2
Table 5-4. Results for supervised classification algorithm
DATA Class Class Class
SET1 1 2 3
Class 150 0 0
1
Class 0 150 0
2
Class 0 0 150
3
Table 5-5. Results for supervised classification algorithm
DATA Class Class
SET2 1 2
Class 342 8
1
Class 9 341
2
Table 5-6. Results for supervised classification algorithm
DATA Class Class
SET3 1 2
Class 144 6
1
Class 9 341
2
Table 5-7. Results for the proposed algorithm
DATA Class Class Class
SET1 1 2 3
Class 150 0 0
1
Class 0 150 0
2
Table 5-7. Results for the proposed algorithm
DATA Class Class Class
SET1 1 2 3
Class 0 0 150
3
Table 5-8. Results for the proposed algorithm
DATA Class Class
SET2 1 2
Class 340 10
1
Class 12 338
2
Table 5-9. Results for the proposed algorithm
DATA Class Class
SET3 1 2
Class 145 5
1
Class 9 341
2
~~
~~ |