Group Title: new clustering algorithm for segmentation of magnetic resonance images
Title: A new clustering algorithm for segmentation of magnetic resonance images
Full Citation
Permanent Link:
 Material Information
Title: A new clustering algorithm for segmentation of magnetic resonance images
Physical Description: Book
Language: English
Creator: Gokcay, Erhan, 1963-
Publisher: State University System of Florida
Place of Publication: Florida
Publication Date: 2000
Copyright Date: 2000
Subject: Magnetic resonance imaging   ( lcsh )
Imaging systems in medicine   ( lcsh )
Cluster analysis -- Data processing   ( lcsh )
Computer and Information Science and Engineering thesis, Ph. D   ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF   ( lcsh )
Genre: government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )
Summary: ABSTRACT: The major goal of this dissertation is to present a new clustering algorithm using information theoretic measures and apply the algorithm to segment Magnetic Resonance (MR) Images. Since MR images are highly variable from subject to subject, data driven segmentation methods seem appropriate. We developed a new clustering evaluation function based on information theory that outperforms previous clustering algorithms, and the new cost function works as a valley seeking algorithm. Since optimization of the clustering evaluation function is difficult because of its stepwise nature and existence of local minima, we developed an improvement on the K-change algorithm used commonly in clustering problems. When applied to nonlinearly separable data, the algorithm performed with very good results, and was able to find the nonlinear boundaries between clusters without supervision. The clustering algorithm is applied to segment brain MR images with successful results. A feature set is created from MR images using entropy measures of small blocks from the input image. Clustering the whole brain image is computationaly intensive. Therefore, a small section of the brain is first used to train the clustering algorithm. Afterwards, the rest of the brain is clustered using the results obtained from the training image by using the distance measure proposed. The algorithm is easy to apply and the calculations are simplified by choosing a proper distance measure which does not require numerical integration.
Summary: KEYWORDS: clustering, segmentation, MRI, unsupervised, entropy, information theory, distance measure
Thesis: Thesis (Ph. D.)--University of Florida, 2000.
Bibliography: Includes bibliographical references (p. 119-132).
System Details: System requirements: World Wide Web browser and PDF reader.
System Details: Mode of access: World Wide Web.
Statement of Responsibility: by Erhan Gokcay.
General Note: Title from first page of PDF file.
General Note: Document formatted into pages; contains viii, 133 p.; also contains graphics.
General Note: Vita.
 Record Information
Bibliographic ID: UF00100688
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: oclc - 45825736
alephbibnum - 002639399
notis - ANA6225


This item has the following downloads:

thesis ( PDF )

Full Text







Copyright 2000


Erhan Gokcay


First and foremost I wish to thank my advisor, Dr. Jose Principe. He allowed me the

freedom to explore, while at the same time provided invaluable insight without which this

dissertation would not have been possible.

I also wish to thank the members of my committee, Dr. John Harris, Dr. Christiana

Leonard, Dr. Joseph Wilson, and Dr. William Edmonson, for their insightful comments

which improved the quality of this dissertation.

I also wish to thank my wife Didem and my son Tugra for their patience and support

during the long nights I have been working.



ACKNOW LEDGEM ENTS ............................................... iii

ABSTRACT .................................................... vii


1 INTRODUCTION .......... ........................................ 1

1.1 Magnetic Resonance Image Segmentation ..................... . ..... 1
1.2 Im age Form ation in M RI ................................. . ..... 2
1.3 Characteristics of M medical Imagery .......................... ..... 3
1.4 Segmentation of M R images . . . ........................ . ..... 4
1.4.1 Feature E xtraction . . ...... .............................. 4
1.4.2 Gray Scale Single Image Segmentation ................... .... 4
1.4.3 M ultispectral Segm entation ........................... ..... 5
1.5 Validation ........... .. ................................ 7
1.5.1 M RI Contrast M ethods .............................. ..... 8
1.5.2 Validation U sing Phantom s ........................... ..... 8
1.5.3 Validation Using M RI Simulations ...................... ..... 9
1.5.4 M annual Labeling of MR Images ....................... 10
1.5.5 Brain Development During Childhood ................... . 11
1.6 M otiv ation .................................................... 12
1.7 Outline .......... ............................... ........ 13


2.1 C classical M ethods .............................................. 15
2.1.1 Discussion ......... ................................. 15
2.1.2 Clustering Criterion . . ....... ............................ 18
2.1.3 Sim ilarity M measures . . ....... ............................ 19
2.2 C riterion Functions . . ...... ................................. 20
2.2.1 The Sum-of-Squared-Error Criterion .................... . 20
2.2.2 The Scatter M atrices ............................... . . 21
2.3 Clustering A lgorithm s . . ...... ............................... 23

2.3.1 Iterative O ptim ization .............................. . . 23
2.3.2 M erging and Splitting .............................. . 24
2.3.3 Neighborhood Dependent methods ..................... . 24
2.3.4 H ierarchical Clustering ............................. 25
2.3.5 Nonparam etric Clustering ........................... . 26
2.4 Mixture Models ....... ... ...................... ........ 27
2.4.1 Maximum Likelihood Estimation ...................... . 28
2.4.2 EM A lgorithm . . ...... ............................... .. 29
2.5 Com petitive N etw orks .................................. . 31
2.6 ART Networks .............................................. 33
2.7 Conclusion . ........ ......................................... 34

3 ENTROPY AND INFORMATION THEORY ........................... 36

3 .1 Introdu action ................................................... 36
3.2 M aximum Entropy Principle . . ....... .......................... 37
3.2.1 M utual Inform ation ................................ 38
3.3 D ivergence M measures . . ....... .............................. .. 39
3.3.1 The Relationship to Maximum Entropy Measure ............ . 41
3.3.2 Other Entropy M measures ............................ . 41
3.3.3 Other Divergence M measures .......................... . 42
3.4 Conclusion . ........ ......................................... 43

4 CLUSTERING EVALUATION FUNCTIONS ........................... 45

4 .1 Introdu action ................................................... 4 5
4.2 D ensity E stim ation . . ....... ............................... 45
4.2.1 Clustering Evaluation Function ........................ 47
4.2.2 CEF as a W eighted Average Distance ................... 48
4.2.3 CEF as a Bhattacharya Related Distance ................. 49
4.2.4 Properties as a D instance ............................. . . 51
4.2.5 Summary ......... .................................. 52
4.3 M multiple C lusters . . ...... ................................... 53
4.4 Results ..................... ... ............................ 55
4.4.1 A Parameter to Control Clustering ..................... 57
4.4.2 Effect of the Variance to the Pdf Function ................. . 59
4.4.3 Perform ance Surface ............................... . 61
4.5 Comparison of Distance Measures in Clustering ................. . 62
4.5.1 CEF as a Distance M measure .......................... . . 64
4.5.2 Sensitivity A analysis . . ....... ............................ 65
4.5.3 N orm alization . . ....... ................................ 66

5 OPTIMIZATION ALGORITHM .............................. . 68

5.1 Introduction . ........ ........................................ 68


5.2 Combinatorial Optimization Problems ....................... 68
5.2.1 L ocal M inim a . . ...... ................................. 7 1
5.2.2 Sim ulated A nnealing ............................... . . 72
5.3 The Algorithm ....... ................................. 73
5.3.1 A New Neighborhood Structure ....................... . 74
5.3.2 Grouping A lgorithm ............................... . . 76
5.3.3 Optim ization Algorithm ............................. 79
5.3.4 C onvergence . . ...... ................................. 83
5.4 Prelim inary R esult . . ...... ................................. .. 83
5.5 Comparison ............................................. 89

6 APPLICATIONS ...... ....... 95

6.1 Implementation of the IMAGETOOL program .................. . 95
6.1.1 PVW AVE Implementation ........................... . 95
6.1.2 Tools Provided W ith the System ....................... 96
6.2 Testing on MR Images ........................................ 99
6.2.1 Feature Extraction . . ....... ............................ .. 99
6.2.2 Test Image ......... ................................ 100
6.3 Validation ............. ......................... ........ 104
6.3.1 Brain Surface Extraction ........................... . . 105
6.3.2 Segm entation . . ...... ............................... .. 106
6.3.3 Segm entation Results .............................. 108
6.4 Results .......... .............................. ........ 114

7 CONCLUSIONS AND FUTURE WORK .............................. 117

REFEREN CES . ........ ............................................. 119

BIOGRAPHICAL SKETCH . . ...... ................................. 133

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Erhan Gokcay

August 2000

Chairman: Dr. Jose C. Principe
Major Department: Computer and Information Science and Engineering

The major goal of this dissertation is to present a new clustering algorithm using infor-

mation theoretic measures and apply the algorithm to segment Magnetic Resonance (MR)

Images. Since MR images are highly variable from subject to subject, data driven segmen-

tation methods seem appropriate. We developed a new clustering evaluation function

based on information theory that outperforms previous clustering algorithms, and the new

cost function works as a valley seeking algorithm. Since optimization of the clustering

evaluation function is difficult because of its stepwise nature and existence of local min-

ima, we developed an improvement on the K-change algorithm used commonly in cluster-

ing problems. When applied to nonlinearly separable data, the algorithm performed with

very good results, and was able to find the nonlinear boundaries between clusters without


The clustering algorithm is applied to segment brain MR images with successful

results. A feature set is created from MR images using entropy measures of small blocks

from the input image. Clustering the whole brain image is computationaly intensive.

Therefore, a small section of the brain is first used to train the clustering algorithm. After-

wards, the rest of the brain is clustered using the results obtained from the training image

by using the distance measure proposed.

The algorithm is easy to apply and the calculations are simplified by choosing a proper

distance measure which does not require numerical integration.


1.1 Magnetic Resonance Image Segmentation

Segmentation of medical imagery is a challenging task due to the complexity of the

images, as well as to the absence of models of the anatomy that fully capture the possible

deformations in each structure. Brain tissue is a particularly complex structure, and its seg-

mentation is an important step for derivation of computerized anatomical atlases, as well

as pre- and intra-operative guidance for therapeutic intervention.

MRI segmentation has been proposed for a number of clinical investigations of vary-

ing complexity. Measurements of tumor volume and its response to therapy have used

image gray scale methods as applied to X-ray Computerized Tomography (CT) or simple

MRI datasets [Cli87]. However, the differentiation of tissues within tumors that have sim-

ilar MRI characteristics, such as edema, necrotic, or scar tissue, has proven to be impor-

tant in the evaluation of response to therapy, and hence, multispectral methods have been

proposed [Van91a] [Cla93]. Recently, multimodality approaches, such as positron emis-

sion tomography (PET) and functional magnetic resonance imaging (fMRI) studies using

radiotracers [Tju94], or contrast materials [Tju94] [Buc91] have been suggested to provide

better tumor tissue specification and to identify active tumor tissue. Hence, segmentation

methods need to include these additional image data sets. In the same context, a similar

progression of segmentation methods is evolving for the planning of surgical procedures

primarily in neurological investigations [Hil93] [Zha90] [Cli91], surgery simulations

[Hu90] [Kam93], or the actual implementation of surgery in the operating suite where

both normal tissues and the localization of the lesion or mass needs to be accurately iden-

tified. The methods proposed include gray scale image segmentation and multispectral

segmentation for anatomical images with additional recent efforts directed toward the

mapping of functional metrics (fMRI, EEG, etc) to provide locations of important func-

tional regions of the brain as required for optimal surgical planning.

Other applications of MRI segmentation include the diagnosis of brain trauma where

white matter lesions, a signature of traumatic brain injury, may potentially be identified in

moderate and possibly mild cases. These methods, in turn, may require correlation of ana-

tomical images with functional metrics to provide sensitive measurements of brain trauma.

MRI segmentation methods have also been useful in the diagnostic imaging of multiple

sclerosis [Wal92], including the detection of lesions [Raf90], and the quantitation of lesion

volume using multispectral methods[Jac93].

In order to understand the issues in medical image segmentation, in contrast with seg-

mentation of, say, images of indoor environments, which are the kind of images with

which general purpose visual segmentation systems deal, we need an understanding of the

salient characteristics of medical imagery.

One application of our clustering algorithm is to map and identify important brain

structures, which may be important in brain surgery.

1.2 Image Formation in MRI

MRI exploits the inherent magnetic moment of certain atomic nuclei. The nucleus of

the hydrogen atom (proton) is used in biologic tissue imaging due to its abundance in the

human body and its large magnetic moment. When the subject is positioned in the core of

the imaging magnet, protons in the tissues experience a strong static magnetic field and

process at a characteristic frequency that is a function solely of the magnetic field strength,

and does not depend, for instance, on the tissue to which the proton belongs. An excitation

magnetic field is applied at this characteristic frequency to alter the orientation of preces-

sion of the protons. The protons relax to their steady state after the excitation field is

stopped. The reason MRI is useful is because protons in different tissues relax to their

steady state at different rates. MRI essentially measures the components of the magnitude

vector of the precession orientation at different times and thus differentiates tissues. These

measures are encoded in 3D using methods for slice selection, frequency encoding and

phase encoding. Slice selection is performed by exciting thin cross-sections of tissue one

at a time. Frequency encoding is achieved by varying the returned frequency of the mea-

sured signal, and phase encoding is done by spatially varying the returned phase of the

measured signal.

1.3 Characteristics of Medical Imagery

While the nature of medical imagery allows a segmentation system to ignore issues

such as illumination and pose determination that would be important to a more general

purpose segmentation system, there are other issues, which will be briefly discussed

below. The objects to be segmented from medical imagery are actual anatomical struc-

tures, which are often non rigid and complex in shape, and exhibit considerable variability

from person to person. This combined with the absence of explicit shape models that cap-

ture the deformations in anatomy, makes the segmentation task challenging. Magnetic res-

onance images are further complicated due to the limitations in the imaging equipment,

like in homogeneities in the receiver or transmitter coils leads to a non-linear gain artifact

in the images, and large differences in magnetic susceptibilities of adjacent tissue leads to

distortion of the gradient magnetic field, and hence a spatial susceptibility artifact in the

images. In addition, the signal is degraded by the motion artifacts that may appear in the

images due to movement of the subject during the scan.

1.4 Segmentation of MR images

MR segmentation can be roughly divided into two categories: a single image segmen-

tation, where a single 2D or 3D gray scale image is used, and multi-spectral image seg-

mentation where multiple MR images with different gray scale contrasts are available.

1.4.1 Feature Extraction

Segmentation of MR images is based on sets of features that can be extracted from the

images, such as pixel intensities, which in turn can be used to calculate other features such

as edges and texture. Rather than using all the information in the images at once, feature

extraction and selection breaks down the problem of segmentation to the grouping of fea-

ture vectors [Jai89] [Zad92] [Zad96]. Selection of good features is the key to successful

segmentation [Pal93]. The focus of the thesis is not feature extraction. Therefore we will

use a simple but effective feature extraction method using entropy measures, and we will

not investigate the feature extraction further.

1.4.2 Gray Scale Single Image Segmentation

The most intuitive approach to segmentation is global thresholding. One common dif-

ficulty with this approach is determining the value of the thresholds. Knowledge guided

thresholding methods, where global thresholds are determined based on a "goodness func-

tion" describing the separation of background, skull and brain have been reported [Van88]

[Cla93] [Jac93] [Hal92] [Lia93] [Ger92b]. The method is limited, and successful applica-

tion for clinical use hindered by the variability of anatomy and MR data.

Edge detection [Pel82] [Mar80] [Bru93] schemes suffer from incorrect detection of

edges due to noise, over- and under-segmentation, and variability in threshold selection in

the edge image [Del91la] [Sah88]. Combination with morphological [Rit96] [Rit87a]

[Rit87b] filtering is also reported [Bom90]. Another method is boundary tracing [Ash90],

where the operator clicks a pixel in a region to be outlined and the method then finds the

boundary starting from that point. It is usually to be restricted to segmentation of large,

well defined structures, but not to distinguish tissue types.

Seed growing methods are also reported [Rob94], where the segmentation requires an

operator to empirically select seeds and thresholds. Pixels around the seed are examined,

and included in the region if they are within the thresholds. Each added pixel then

becomes a new seed.

Random field methods have been successfully applied, where an energy function is

required, which is often very difficult to define, that describes the problem [Bou94]


1.4.3 Multispectral Segmentation

Supervised methods require a user supplied training set, usually found by drawing

regions of interest on the images. Using maximum likelihood (ML) methods where multi-

variate Gaussian distributions are assumed [Cla93] [Hal92] [Lia94] [Ger92b] [Kar90], sta-

tistics are calculated like mean and covariance matrices. The remaining pixels are then

classified by calculating the likelihood of each tissue class, and picking the tissue type

with the highest probability. Parametric methods are useful when the feature distributions

for different classes are well known, which is not necessarily the case for MR images

[Cla93]. k nearest neigbourhood (kNN) has given superior results both in terms of accu-

racy and reproducibility compared to parametric methods [Cla93]. Artificial neural net-

works also are commonly used [Cla93] [Hal92] [Daw91] [Hay94] [Has95] [Hec89]

[Ozk93] [Wan98].

All supervised methods are operator dependent. Inter- and intra-operator variability

has been measured and shown to be relatively large [Cla93] [Ger92b] [Dav96]. Because of

this reason unsupervised methods may be preferred from a viewpoint of reproducibility.

Unsupervised techniques, which is usually called clustering, automatically find the

structure in the data. A cluster is an area in feature space with a high density. Clustering

methods include k-means [Bez93] [Ger92b] [Van88], and its fuzzy equivalent, fuzzy c-

means [Bez93] [Hal92] [Phi95]. These methods and their variants are basically limited to

find linearly separable clusters. Another promising development is using semi-supervised

methods [Ben94b].

The expectation-maximization (EM) algorithm is also used in clustering of MR

images [Wel96] [Dem77] [Gui97], where the knowledge of tissue class is considered as

the missing information. As usual the method assumes a normal distribution, and it incor-

porates an explicit bias field, which frequently arise in MR images.

Model based approaches including deformable models [Dav96] [Coh91], also known

as active contour models, provide a method for minimizing an objective function to obtain

a contour of interest, especially if an approximate location of the contour is available. A

deformable contour is a planar curve which has an initial position and an objective func-

tion associated with it. A special class of deformable contours called snakes was intro-

duced by Witkin [Wit88], in which the initial position is specified interactively by the user

and the objective function is referred to as the energy of the snake. The snake tries to min-

imize its energy over time, similar to physical systems. This energy of the snake is

expressed as a sum of two components: the internal energy of the snake and the external

energy of the snake which is given as

Snake internal external (1.1)

The internal energy term imposes a piecewise smoothness constraint on the snake, and the

external energy term is responsible for attracting the snake to interesting features in the

image. The balloon model for deformable contours is an extension of the snake model. It

modifies the snake energy to include a "balloon" force, which can be either an inflation

force, or a deflation force. All these methods require an operator input to place the snake

close to a boundary.

1.5 Validation

MRI segmentation is being proposed either as a method for determining the volume of

tissues in their 3D spatial distibutions in applications involving diagnostic, therapeutic, or

surgical simulation protocols. Some form of quantitative measure of the accuracy and/or

reproducibility for the proposed segmentation method is clearly required. Since a direct

measure of ground truth is not logistically feasible, or even possible with pathologic corre-

lation, several alternative procedures have been used.

1.5.1 MRI Contrast Methods

The use of MR contrast agents in neuroinvestigations of the brain provide information

about whether or not a breakdown of blood-brain barrier (BBB) has occurred and on the

integrity of the tissue vascularity both of which are often tumor type- and stage-dependent

[Run89] [Bra93] [Hen93] [Bro90]. However, MR contrast may not be optimum for the

quantitative differentiation of active tumor tissue, scar tissue, or recurrent tumors. Many

segmentation methods, in particular gray scale methods and multispectral methods, use

MR contrast information with Ti-weighted images for tumor volume or size estimations

despite the limitations of these methods in the absence of ground truth determina-

tions[Mcc94] [Gal93]. Recently the use of multi-modality imaging methods, such as the

correlation with the PET studies, have been proposed to identify active tissues [Tju94].

Alternatively, the use of fMRI measurement of contrast dynamics has been suggested to

provide better differentiation of active tumor tissue in neurological investigations and

these functional images could be potentially included in segmentation methods[4,5].

1.5.2 Validation Using Phantoms

The use of phantoms constructed with compartments containing known volumes is

widely reported [Cli91] [Jac93] [Koh91] [Ger92b] [Mit94] [Bra94] [Pec92] [Jac90]. The

typical phantom represents a very idealized case consisting of two or three highly con-

trasting classes in a homogenous background [Koh91] [Ash90] [Jac90]. Phantoms con-

taining paramagnetic agents have been introduced to mimic MRI parameters of the tissues

being modelled [Cli91] [Jac93] [Ger92b]. However, phantoms have not evolved to encom-

pass all the desired features which allow a realistic segmentation validation, namely: a

high level of geometric complexity in three dimensions, multiple classes (e.g. representa-

tive of white matter, gray matter, cerebrospinal fluid, tumor, background, etc.), and more

importantly, RF coil loading similar to humans and MRI parameter distributions similar to

those of human tissue.

The reported accuracy obtained using phantoms is very high for large volumes

[Koh91] [Bra94] [Ash90], but decreases as the volume is smaller [Koh91] [Ger92b]. For a

true indication of the maximum obtainable accuracy of the segmentation methods, the

phantom volumes should be comparable to the anatomical or pathological structures of

interest. In summary, phantoms do not fully exhibit the characteristics that make segmen-

tation of human tissues so difficult. The distributions of MRI parameters for a given tissue

class is not necessarily Gaussian or unimodal, and will often overlap for different tissues.

The complex spatial distribution of the tissue regions, in turn, may cause the MR image

intensity in a given pixel to represent signal from a mix of tissues, commonly referred to

as the partial volume artefact. Although phantom images provide an excellent means for

daily quality control of the MRI scanner, they can only provide a limited degree of confi-

dence in the reliability of the segmentation methods.

We believe using phantoms will limit the accuracy of the algorithm, because of limited

modelling capabilities of phantoms. Using MR image of the brain itself is more suitable

for our purpose.

1.5.3 Validation Using MRI Simulations

Because of the increase in computer speeds, studying MR imaging via computer simu-

lations is very attractive. Several MR signal simulations, and the resulting image construc-

tion methods can be found in literature [Bit84] [Odo85] [Sum96] [Bea92] [Hei93] [Pet93]

[Sch94]. These simulation methods have so far not been used for the evaluation of seg-

mentation methods, but were used to investigate a wide variety of MR processes includ-

ing: optimization of RF pulse techniques [Bit94] [Sum96] [Pet93], the merit of spin warp

imaging in the presence of field inhomogeneities and gradient field phase-encoding with

wavelet encoding [Bea92], noise filtering[Hei93].

In summary, simulation methods can be extended to include MRI segmentation analy-

sis. The robustness of the segmentation process may be probed by corrupting the simu-

lated signal with noise, nonlinear field gradients, or more importantly, nonuniform RF

excitation. In this fashion, one source of signal uncertainty can be introduced at a time,

and the resulting segmentation uncertainty can be related to the signal source uncertainty

in a quantifiable manner.

1.5.4 Manual Labeling of MR Images

Some validation methods have let experts manually trace the boundaries of the differ-

ent tissue regions [Van91a] [Del91b] [Zij93] [Vel94]. The major advantage of the manual

labeling technique is that it truly mimics the radiologist's interpretation, which realisti-

cally is the only "valid truth" available for in vivo imaging. However, there is considerable

variation with operators[Ger92b] [Eil90], limiting "ground truth" determinations. Further-

more, manual labeling is labor intensive and currently cannot be feasibly performed for

large numbers of image data sets. Improvements in the area of manual labeling may be

found by interfacing locally operating segmentation techniques with manual improve-

ments. As manual labeling allows an evaluation most closely related to the radiologists'

opinion, these improvements deserve further investigation.

1.5.5 Brain Development During Childhood

Normal brain development during childhood is a complex and dynamic process for

which detailed scientific information is lacking. Several studies are done to investigate the

volumetric analysis of the brain during childhood [Van91b] [Puj93] [Gie99] [Rei96]

[Ben94a] [Bro87]. Prominent, age-related changes in gray matter, white matter and CSF

volumes are evident during childhood and appear to reflect ongoing maturation and

remodelling of the central nervous system. There is little change in cerebral volume after

the age of 5 years in either male or female subjects [Rei96] [Gie99]. After removing the

effect of total cerebral volume, age was found to predict a significant proportion of the

variance in cortical gray matter, and cerebral white matter volume, such that increasing

age was associated with decreasing gray matter volume and increasing white matter vol-

ume in children. The change in CSF was found to be very small with increasing age which

can be shown in Figure 1-11 [Rei96] so that the brain volume is not a determining factor

on the volumes of gray and white matter. The change in gray matter is about -%1 per year

for boys and -%0.92 per year for girls. The change in white matter is about +%0.093 for

boys and +%0.072 for girls [Rei96]. We will use this method to quantify our clustering

method, since detecting %1 change is a good indicator to evaluate a segmentation method.

We expect to find a percentage of % using the proposed clustering algorithm. Although

1. The Figure 1-1 is reprinted with the permission of Oxford University Press.

this method will not verify the segmentation of individual structures, it still is a good

method, because the change is very small and difficult to show.

x x
1 1500-

a 1400 -

S1200, x o

S 100 x -
Sx C
E o 1000

4 6 8 10 12 14 16 18
Age (years)

Figure 1-1. Total cerebral volume in children in age from 5 to 17 years

1.6 Motivation

Segmentation of MR images is considered to be difficult because of the non-rigid and

complex shapes in anatomical structures. Adding the high variability among patients and

even among the same scan makes it difficult for model based approaches to segment MR

images. We believe data based approaches are more appropriate for MR images because of

the complexity of anatomical structures.

Many clustering algorithms are proposed to solve segmentation problems in MR

images, which are reviewed in Chapter 2. A common problem with the segmentation algo-

rithms is the fact that they depend on Euclidean distance measure to separate the clusters.

Deformable models are excluded from this reasoning, but they require to be placed near

the boundary, and we will consider only data-driven methods because of the complexity of

the brain images. The Euclidean distance has limited capacity to separate nonlinearly sep-

arated clusters. To be able to distinguish nonlinearly separable clusters, more information

about the structure of the data should be obtained. On the other hand, because of the miss-

ing label assignments, clustering is a more computationaly intensive operation than classi-

fication. More information about the structure should be collected without introducing

complicated calculations.

The motivation of this dissertation is to develop a new cost function for clustering

which can be used in nonlinearly separable clusters. Such a method should be computa-

tionaly feasible and simple to calculate. The proposed method does not require any numer-

ical integration methods, and uses information theory to collect more information about

the data. The stepwise nature of the cost function required us to develop an improved ver-

sion of the K-change algorithm [Dud73].

1.7 Outline

In Chapter 2, the basic clustering algorithms are reviewed. There are many variations

to these algorithms but the basic principles stay the same. Many of them can not be used in

nonlinearly separable clusters, and the ones that can be used, like the valley seeking algo-

rithm [Fug90], suffer from generating more clusters than there would be if the distribu-

tions were not unimodal.

Chapter 3 covers the basics of information theory and entropy based distance mea-

sures. Many of these calculations require numerical methods which increase the already

high computational cost of clustering algorithms. Therefore, we propose a different dis-

tance measure, which does not require a numerical integration and is simple to calculate.

Chapter 4 covers the new clustering evaluation function proposed, and will give some

initial results to give the power of the cost function, which is capable of clustering nonlin-

early combined clusters.

Chapter 5 focuses on the optimization algorithm that minimizes the cost function

developed in Chapter 4. We propose an improvement to the K-change algorithm by intro-

ducing a changing group size scheme.

In Chapter 6, applications of MR image segmentation are tested and discussed, and

Chapter 7 includes the conclusion and the discussion for future research.


2.1 Classical Methods

2.1.1 Discussion

There are many important applications of pattern recognition, which include a wide

range of information processing problems of great practical significance, from speech rec-

ognition and the classification of handwritten characters, to fault detection and medical

diagnosis. The discussion here provides the basic elements of clustering where there are

many variations to these ideas in the literature. So we try to investigate the basic algo-


Clustering [Har85] is an unsupervised way of data grouping using a given measure of

similarity. Clustering algorithms attempt to organize unlabeled feature vectors into clus-

ters or "natural groups" such that samples within a cluster are more similar to each other

than to samples belonging to different clusters. Since there is no information given about

the underlying data structure or the number of clusters, there is no single solution to clus-

tering, neither is there a single similarity measure to differentiate all clusters. Because of

this reason there is no theory which describes clustering uniquely.

Pattern classification can be divided into two areas depending on the external knowl-

edge about the input data. If we know the labels of our input data, the pattern recognition

problem is considered supervised. Otherwise the problem is called unsupervised. Here we

will only cover statistical pattern recognition. There are several ways of handling the prob-

lem of pattern recognition if the labels are given a priori. Since we know the labels, the

problem reduces to finding features of the data set with the known labels, and to build a

classifier using these features. The Bayes' rule shows how to calculate the posteriori prob-

ability from a priori probability. Assume that we know that a priori probabilities P(ci)

and the conditional densities p(x I ci). When we measure x, we can calculate the posteri-

ori probability P(ci x) as shown in (2.1).

P(c x) = (2.1)


p(x) = p (xci)P(ci) (2.2)
i= 1

In the case of unsupervised classification or clustering, we don't have the labels,

which increases the problem. The clustering problem is not well defined unless the result-

ing clusters are required to have certain properties. The fundamental problem in clustering

is how to choose these properties. Once we have a suitable definition of a cluster, it is pos-

sible to evaluate the validity of the resulting clustering using standard statistical validation


There are two basic approaches to clustering, which we call parametric and nonpara-

metric approaches. If the purpose of unsupervised learning is data description then we can

assume a predefined distribution function for the data set, and calculate the sufficient sta-

tistics which will describe the data set in a compact way. For example, if we assume that

the data set comes from a normal distribution N(M, 1), which is defined as

NIMX(M, ) = / 1/2 exp(_ (X_ M)T -I(X_ M)) (2.3)
(27rt)"/2 1/2 2

the sufficient statistics are the sample mean M = E{X} and the sample covariance

matrix I = E{XX }, which will describe the distribution perfectly. Unfortunately, if

the data set is not distributed according to our choice, then the statistics can be very mis-

leading. Another approach uses a mixture of distributions to describe the data [Mcl88]

[Mcl96] [Dem77]. We can approximate virtually any density function in this way, but esti-

mating the parameters of a mixture is not a trivial operation. And the question of how to

separate the data set into different clusters is still unanswered, since estimating the distri-

bution does not tell us how to divide the data set into clusters. If we are using the first

approach, namely fitting one distribution function to each cluster, then the clustering can

be done by trying to estimate the parameters of the distributions. If we are using the mix-

ture of distributions approach, then clustering is very loosely defined. Assume that we

have more mixtures than the number of clusters. The model does not tell us how to com-

bine the mixtures to obtain the desired clustering.

Another approach to clustering is to group the data set into groups of points which

posses strong internal similarities [Dud73] [Fug70]. To measure the similarities we use a

criterion function and seek the grouping that finds the extreme point of the criterion func-

tion. For this kind of algorithm we need a cost function to evaluate how well the clustering

fits to the data, and an algorithm to minimize the cost function. For a given clustering

problem, the input data X is fixed. The clustering algorithm varies only by the sample

assignment C, which means that the minimization algorithm will change only C. Because

of the discrete and unordered nature of C, classical steepest descent search algorithms can

not be applied easily.

2.1.2 Clustering Criterion

We will define the clustering problem as follows: We will assume that we have N sam-

ples, i.e. x1 ...xN At this moment we assume that the samples are not random vari-

ables, since once the samples are fixed by the clustering algorithm, they are not random

variables anymore. The problem can be defined as to place each sample into one of L clus-

ters, w I ... WL, where L is assumed to be given. The cluster k to which the ith sample is

assigned is denoted by Wk(i), where k(i) is an integer between 1...L, and i = 1...N.

A clustering C is a vector made of Wk(i) and Xis a vector made up xi 's, that is,

C = [Wk(1)... k(N) (2.4)


X = [xi...XN] (2.5)

The clustering criterion J is a function of C and X. and can be written as,

J(C,X) = J(wk(1)... Wk(N);X...XN) (2.6)

The best clustering CO should satisfy

J(Co, X) = minIC or max C(J(C,X)) (2.7)

depending on the criterion. Only minimization will be considered, since maximization can

always be converted to minimization.

2.1.3 Similarity Measures

In order to apply the clustering algorithm we have to define how we will measure sim-

ilarity between samples. The most obvious measure of the similarity between two samples

is the distance between them. The L norm is the generalized distance measure where

p = 2 corresponds to the Euclidean distance [Ben66] [Gre74[. The L norm between

two vectors of size N, is given as

Lp(X, X2) = (xl(i) 2(i))p (2.8)
i= 1

If this distance is a good measure of similarity, then we would expect the distance between

samples in the same cluster to be significantly less than the distance between samples in

different clusters.

Another way to measure the similarity between two vectors is the normalized inner

product which is given as

x x
S(X l, X2) b 2 (2.9)

This measure is basically the angle between two vectors.

2.2 Criterion Functions

2.2.1 The Sum-of-Squared-Error Criterion

One of the simplest and most widely used error criterian is the sum-of-squared-error

criterion. Let Ni be the number of samples in XI and let mi be the mean of those sam-


mi =- x (2.10)
ix e X,

Then the criterion can be defined as

J = x mi 2 (2.11)
i= lx~ X,

which means that the mean vector mi is the best representation of the samples in X, and

the clustering achieved is minimized by the squared error vectors x m The error func-

tion J measures the total squared error when N samples are represented by L cluster cen-

ters m 1 ... mL The value of J depends how the samples are distributed among the cluster

centers. This kind of clustering is often called minimum-variance partitions [Dud73]. This

kind of clustering works well when the clusters are compact regions that are well sepa-

rated from each other but it gives unexpected results when the distance between clusters is

comparable to size of clusters. An equivalent expression can be obtained by eliminating

the mean vectors from the expression as in

J = ni i (2.12)
i= 1


Si 2 x -x 2 (2.13)
ni x 1e X x2 e X,

The above expression shows that the sum-of-squared-error criterion uses the Euclidean

distance to measure similarity [Dud73]. We can derive different criterion functions by

changing si using other similarity functions S(x1, X2).

2.2.2 The Scatter Matrices

In discriminant analysis, within-class, between-class and mixture scatter matrices are

used to measure and formulate class separability. These matrices can be combined in dif-

ferent ways to be used as a criterion function. Let's make the following definitions:

Mean vector for the ith cluster

mi x (2.14)
x e X,

Total mean vector

m= x Nm (2.15)
xeX i= i

Scatter matrix for ith cluster

Si = (x-mi)(x-m ) (2.16)
xE X,

Within-cluster scatter matrix

Sw = Si (2.17)
i= 1

Between-cluster scatter matrix

SB = ni(mi m)(mi m)T (2.18)
i= 1

Total scatter matrix

ST = SW+SB (2.19)

The following criterion functions can be defined [Fug90][Dud73]

J1 = tr(S2-1S ) (2.20)

J2 = In S2-1S (2.21)

J3 = tr(S) (tr(S2) c) (2.22)

J = tr( ) (2.23)
4 tr(S2)

where S1 and S2 are one of SW, SB, or ST. Some combinations are invariant under

any nonsingular linear transformation and some are not. These functions are not univer-

sally applicable, and this is a major flaw in these criterion functions. Once the function is

determined, the clustering that the function will provide is fixed in terms of parameters.

We are assuming we can reach the global extreme point, which may not always be the

case. If the function does not provide good results with a certain data set, no parameter is

readily available to change the behavior of the clustering output. The only parameter we

can change is the function itself, which may be difficult to set. Another limitation of the

criterion functions mentioned above is the fact that they are basically second order statis-

tics. In the coming chapters we will provide a new clustering function using information

theoretic measures and a practical way to calculate and optimize the resulted function,

which gives us a way to control the clustering behavior of the criterion. There are other

measures using entropy and information theory to measure cluster separability, which will

be covered in the next chapter. These methods will be the basis for our clustering function.

2.3 Clustering Algorithms

2.3.1 Iterative Optimization

The input data are finite, therefore there are only a finite number of possible partitions.

In theory, the clustering criterion can always be solved by exhaustive enumeration. How-

ever, in practice such an approach not feasible, since the number of iterations will grow

exponentially with the number of clusters and sample size where the number of different

solutions are given approximately by L /L! .

The basic idea in iterative-optimization is to find an initial partition and to move sam-

ples from one group to another if such a move will improve the value of the criterion func-

tion. In general this procedure will guarantee local optimization. Different initial points

will give different results. The simplicity of the method usually overcomes the limitations

in most problems. In the following chapters we will improve this optimization method to

obtain better results and use in our algorithm.

2.3.2 Merging and Splitting

After a number of clusters are obtained, it is possible to merge certain clusters or split

certain clusters. Merging may be required if two clusters are very similar. Of course we

should define a similarity measure for this operation as we did for clustering. Several mea-

sures given in (2.20), (2.21), (2.22) and (2.23), can be used for this purpose [Dud73].

Merging is sometimes desired when the cluster size is very small. The criterion for appro-

priate splitting is far more difficult to define [Dud73]. Multimodal and nonsymmetric dis-

tributions as well as distributions with large variances along one direction can be split.

We can start partitioning the input data of size N to N clusters containing one sample

each. The next step is to partition the current clustering into N-1 clusters. We can continue

doing this until we reach the desired clustering. At any level some samples from different

clusters may be combined together to form a single cluster. The merging and splitting are

heuristic operations with no guarantee of reaching the desired clustering, but still useful


2.3.3 Neighborhood Dependent methods

Once we choose a measure to describe similarity between clusters, we will use the fol-

lowing algorithm [Dud73]. After the initial clustering, we will recluster a sample accord-

ing to the nearest cluster center, and calculate the mean of the clusters again, until there is

no change in clustering. This algorithm is called the nearest mean reclassification


2.3.4 Hierarchical Clustering

Let's consider a sequence of partitions of the N samples into C clusters. First partition

into N clusters, where each cluster contains exactly one sample. The next iteration is a par-

tition into N-1 clusters, until all samples form one cluster. If the sequence has the property

that whenever two samples are in the same cluster at some level, they remain together at

all higher levels, then the sequence is called a hierarchical clustering. It should be noted

that the clustering can be done in reverse order, that is, first all samples form a single clus-

ter, and at each iteration more clusters are generated.

In order to combine or divide the clusters, we need a way to measure the similarity in

the clusters and dissimilarity between clusters. Commonly used distance measures

[Dud73] are as follows:

Dmin(C1, C2) = min x1 -x2 (2.24)

Dmax(C1, C2) = max x1 x2 (2.25)

D a (C, C2) = x X2 (2.26)
avg 21 Cx2 C2

Dmean(Cl, C2)= mn -m2 (2.27)

All of these measures have a minimum-variance flavor, and they usually give the same

results if the clusters are compact and well separated. However, if the clusters are close to

each other, and/or the shapes are not basically hyperspherical, very different results may

be obtained. Although never tested, it is possible to use our clustering evaluation function

to measure the distance between clusters in hierarchical clustering to improve the results.

2.3.5 Nonparametric Clustering

When a mixture density function has peaks and valleys, it is most natural to divide the

samples into clusters according to the valley. The valley may not have a parametric struc-

ture, which creates difficulties with parametric assumptions. One way to discover the val-

ley is to use estimation of local density gradients at each sample point and move the

samples in the direction of the gradient. By repeating this we will move the samples away

from the valley, and the samples form compact clusters. We call this procedure the valley-

seeking procedure [Fuk90].

Figure 2-1. Valley seeking algorithm

The local gradient can be estimated by the local mean vector around the sample. The

direction of the local gradient is given as shown in Figure 2-1.

Vp() M(X) (2.28)
p() -


The local gradient near the decision surface will be proportional to the difference of the

means, that is, MI1 (x) -M2(x), where M1 (x) is the local mean of one cluster, and

M2(x) is the local mean of the other cluster inside the same local region.

The method seems very promising but it may result in too many clusters, if there is a

slight nonuniformity in one of the clusters. The performance and the number of clusters

depends on the local region used to calculate the gradient vector. If the local region is too

small, there will be many clusters, and on the other hand if the region is too huge, then all

the points form one cluster. So size of local region is directly related to the number of clus-

ters but loosely related to the quality of the clustering. The advantage of this method is that

the number of clusters does not need to be specified in advance, but in some cases this may

be a disadvantage. Assume we want to improve the clustering without increasing the num-

ber of clusters. A change in local region will change the number of clusters and will not

help to improve the clustering. Of course small changes will change the clustering without

increasing the cluster number, but determining the range of the parameter may be a prob-


2.4 Mixture Models

The mixture model is a semi-parametric way of estimating the underlying density

function [Dud73] [Par62] [Chr81]. In the non-parametric kernel-based approach to density

estimation, the density function was represented as a linear superposition of kernel func-

tions, with one kernel centered on each data point. In the mixture model the density func-

tion is again formed by a linear combination of basis functions, but the number of basis

functions is treated as a parameter of the model and it is much less than the number N of

data points. We write the density estimator as a linear combination of component densities

p(x i) in the form

p(x) = p(xi)P(i) (2.29)
i= 1

Such a representation is called a mixture distribution and the coefficients P(j) are called

the mixing parameters.

2.4.1 Maximum Likelihood Estimation
The maximum likelihood estimation may be obtained by maximizing J p(xj)

with respect to Pi, Mi and Zi under the constraint 1


P i = 1 (2.30)
i= 1

The negative log-likelihood is given by

E = -ln(F) = ln(p(x)) (2.31)
i= 1

which can be regarded as an error function [Fug90]. Maximizing the likelihood F is then

equivalent to minimizing E. One way to solve the maximum likelihood is the EM algo-

rithm which is explained next.

2.4.2 EM Algorithm

Usually no theoretical solution exists for the likelihood equations, and it is necessary

to use numerical methods. Direct maximization of the likelihood function using Newton-

Raphson or gradient methods is possible but it may need analytical work to obtain the gra-

dient and possibly the Hessian. The EM algorithm [Dem77] is a general method for com-

puting maximum-likelihood (ML) estimates for "incomplete data" problems. In each

iteration of the EM algorithm there are two steps, called the expectation step or the E-step

and the maximization step or M-step, thus the name EM algorithm, given by [Dem77] in

their fundamental paper. The EM algorithm can be applied in situations described as

incomplete-data problems, where ML estimation is made difficult by the absence of some

part of the data in a more familiar and simpler data structure. The parameters are estimated

after filling in initial values for the missing data. The latter are then updated by their pre-

dicted values using these parameter estimates. The parameters are then reestimated itera-

tively until convergence.

The term "incomplete data" implies in general to the existence of two sample spaces X

and Y and a many-to-one mapping H from Xto Y, where x e X and y e Y are elements

of the sample spaces and y = H(x) The corresponding x in Xis not observed directly,

but only indirectly through y. Let f (x 6) be the parametric distribution of x, where 0 is

a vector of parameters taking values in 0. The distribution of y, denoted by g(x 16), is

also parametrized by 0, since the complete-data specification f(... ...) is related to the

incomplete-data specification g(... ...) by

g(y|0) = f f(xlO)dx (2.32)
H(x) = y

The EM algorithm tries to find a value of 0 which maximizes g(y| 0) given an

observed y, and it uses the associated family f(x ) It should be noted that there are

many possible complete-data specifications f(x |0) that will generate g(y 0).

The maximum-likelihood estimator 6 maximizes the log-likelihood function

L(0) = ln(g(x|0)) (2.33)

over 0,

6 = argmax O(L(0)) (2.34)

The main idea behind the EM algorithm is that, there are cases in which the estimation

of 0 would be easy if the complete data x were available, and is only difficult for the

incomplete data y. In other words the maximization of ln(g(y 0)) is complicated,

where the maximization of ln(f(x 0)) is easy. Since only the incomplete data y is

available in practice, it is not possible to directly perform the optimization of the complete

data likelihood ln(f (x 0)). Instead it will be easier to estimate ln( f(x 0)) from y

and use this estimator to find 6. Since estimating the complete data likelihood requires 0,

we need an iterative approach. First using an estimate of 0 the complete likelihood func-

tion will be estimated, then this likelihood function should be maximized over 0, and so

on, until a satisfactory convergence is obtained. Given the current value 0' of the parame-

ters and y, we can estimate in (f(x 1 0)) using

P(0, 0') = E[ln(p(x|0)) y, 0'] (2.35)

The EM algorithm can be expressed as


P(O, 0P) = E[ln(p(xlO)) y, 0p] (2.36)


0p+1 e argmax OE (P(0o1 o)) (2.37)

where p is the value of 0 at pth iteration. For the problem of density estimation using a

mixture model we do not have corresponding cluster labels. The missing labels can be

considered as incomplete data and can be solved with the other parameters of the mixture

using the EM algorithm described above. Unfortunately estimating the mixture model will

not answer the question of how to cluster the data set using the mixture model.

2.5 Competitive Networks

Unsupervised learning can be accomplished by appropriately designed neural net-

works. The original unsupervised learning rule was proposed by Hebb [Heb49], which is

inspired by the biological synaptic signal changes. In this rule, changes in weight depend

on to the correlation of pre- and post-synaptic signals x and y, respectively, which may be

formulated as

Wk+l = Wk +YkXk (2.38)

where p > 0, xk is the unit's input signal, and k = Xk Wk is the unit's output.The

analysis will show that the rule is unstable in this form and it drives the weights to infinite

in magnitude. One way to prevent divergence is to normalize the weight vector after each


Another update rule which prevents the divergence is proposed by Oja [Oja82]
[Oja85] adds a weight decay proportional to y and results in the following update rule:

Wk + 1 = Wk + (xk k wk)yk (2.39)

There are many variations to this update rule, and the update rule is used very frequently

in principal component analysis (PCA) [Ama77] [Lin88] [San89] [Oja83] [Rub89]

[Pri90]. There are linear and nonlinear versions [Oja91] [Tay93] [Kar94] [Xu94] [Pri90].

Competitive networks [Gro76a] [Gro76b] [Lip89] [Rum85] [Mal73] [Gro69] [Pri90]

can be used in clustering procedures [Dud73] [Har75]. Since in clustering there is no sig-

nal which to show the cluster labels, competitive networks use a competition procedure to

find the output node to be updated according to a particular weight update rule. The unit

with the largest activation is usually chosen as the winner whose weight vector is updated

according to the rule

Awn = p(xkw i) (2.40)

where wi is the weight vector of the winning node. The weight vectors of other nodes are

not updated. The net effect of the rule is to move the weight vectors of each node towards

the center-of-mass of the nearest dense cluster of data points. This means that the number

of output nodes determine the number of clusters.

One application of competitive learning is adaptive vector quantization [Ger82]

[Ger92a]. Vector quantization is a technique where the input space is divided into a num-

ber of distinct regions, and for each region a "template" or reconstruction vector is

defined. When presented with a new input vector x, a vector quantizer first determines the

region in which the vector lies. Then the quantizer outputs an encoded version of the

reconstruction vector wi representing that particular region containing x. The set of all

possible reconstruction vectors is usually called the codebook of the quantizer [Lin80]

[Gra84]. When the Euclidean distance is used to measure the similarity of x to the regions,

the quantizer is called a Voronoi quantizer [Gra84].

2.6 ART Networks

Adaptive resonance architectures are artificial neural networks that are capable of sta-

ble categorization of an arbitrary sequence of unlabeled input patterns in real time. These

architectures are capable of continuous training with nonstationary inputs. They also solve

the stability-plasticity dilemma. In other words, they let the network adapt yet prevent cur-

rent inputs from destroying past training. The basic principles of the underlying theory of

these networks, known as adaptive resonance theory (ART), were introduced by Gross-

berg 1976 [Gro76a] [Gro76b].

A class of ART architectures, called ART1 [Car87a] [Car88], is characterized by a

system of ordinary differential equations, with associated theorems. A number of interpre-

tations and simplifications of the ART1 net have been reported in the literature [Lip87]

[Pao89] [Moo89].

The basic architecture of the ART1 net consists of a layer of linear units representing

prototype vectors whose outputs are acted on by a winner-take-all network. This architec-

ture is identical to the simple competitive network with one major difference. The linear

prototype units are allocated dynamically, as needed, in response to novel input vectors.

Once a prototype unit is allocated, appropriate lateral-inhibitory and self-excitatory con-

nections are introduced so that the allocated unit may compete with preexisting prototype

units. Alternatively, one may assume a prewired network with a large number of inactive

(zero weights) units. A unit becomes active if the training algorithm decides to assign it as

a cluster prototype unit, and its weights are adapted accordingly.

The general idea behind ART1 training is as follows. Every training iteration consists

of taking a training example xk and examining existing prototypes (weight vectors w )

that are sufficiently similar to Xk. If a prototype wi is found to match xk (according to a

similarity test based an a preset matching threshold), sample xk is added to the cluster

represented by wi, and wi is modified to make it better match xk. If no prototype

matches Xk, then xk becomes the prototype for a new cluster.

The family of ART networks also includes more complex models such as ART2

[Car87b] and ART3 [Car90]. These ART models are capable of clustering binary and ana-

log input patterns. A simplified model of ART2, ART2-A [Car9la], has been proposed

that is two to three orders of magnitude faster than ART2. Also, a supervised real-time

learning ART model called ARTMAP has been proposed [Car9lb].

2.7 Conclusion

In this chapter we summarized basic algorithms used in clustering. There are many

variations to these algorithms, but the basic principle stays the same. A fundamental prob-

lem can be seen immediately, which can be summarized as the usage of the Euclidean dis-

tance as a measure for cluster separability. Others use mean and variance to differentiate

the clusters. When there are nonlinear structures in the data, then it is obvious that the

Euclidean distance measures and differences in the mean and variance is an inadequate

measure of cluster separability. The valley seeking algorithm tries to solve the problem by

moving the samples along the gradients, and the algorithm will behave as a classifier if the

clusters are well separated and unimodal. But when the clusters are multi-modal and over-

lapping, then the valley seeking algorithm may create more cluster centers than there are

clusters in the data. The question of how to combine these cluster centers is not answered,

even when we know the exact number of clusters. Defining the number of clusters before-

hand can be an advantage depending on the problem. For example in MRI segmentation it

is to our advantage to fix the number of clusters, since we know this number a priori. Con-

sider an MRI brain image where the basic structures are CSF, white matter and gray mat-

ter. Failure to fix the number of clusters in this problem beforehand will raise the question

of how to combine the excess cluster centers later.

When we consider the fact that the tissue boundaries in an MRI brain image are not

sharply defined, it is obvious that the Euclidean distance measures and mean and variance

differences are not enough to differentiate the clusters in a brain MRI. The variability of

brain structures among persons and within the same scan makes it difficult to use model

based approaches, since the model that fits to a particular part of the brain, may not fit to

the rest. This encourages us to use data-driven based algorithms, where there is no pre-

defined structure imposed on the data. But the limitations of the Euclidean distance mea-

sures forces us to seek other measures for cluster separability.


3.1 Introduction

Entropy [Sha48] [Sha62] [Kaz90] was introduced into information theory by Shannon

(1948). The entropy of a random variable is a measure of the average amount of informa-

tion contained. In another words entropy measures the uncertainty of the random variable.

Consider a random variable X which can take values X N ...*xN with probabilities

p(xk), k = 1 ...N. If we know that the event xi occurs with probability pi = 1,

which requires that pi = 0, i # k, there is no surprise and therefore there is no informa-

tion contained in X, since we know the outcome exactly. If we want to send the value of X

to a receiver, then the amount of information is given as I(xk) = -In(p(xk)), if the

variable takes the value xk Thus, the expected information needed to transmit the value

of Xis given by

E(I(xk)) = Hs(X) = p(xk)1n(p(xk)) (3.1)

which is called the entropy of the random variable X.

Shannon's measure of entropy was developed essentially for the discrete values case.

Moving to the continuous random variable case where summations are usually replaced

with integrals, is not so trivial, because a continuous random variable takes values from

-c to co, which makes the information content infinite. In order to avoid this problem,

the continuous entropy calculation is considered as differential entropy, instead of absolute

entropy as in case of discrete random variables [Pap65] [Hog65]. If we let the interval

between discrete random variables be

VXk X= k xk- 1 (3.2)

If the continuous density function is given as f(x), then p, can be approximated by

f(xk)Vxk, so that

H= p(xk)ln(p(xk)) =- f(xk)(Vxk)ln(f(xk)Vxk) (3.3)
k k

After some manipulations and replacing the summation by integrals we will obtain

H= -Jf(x)ln(f(x)))dx-ln(Vxk) (3.4)

In this equation -In(Vxk) -> oo as VXk -> 0, which suggest that the entropy of a con-

tinuous variable is infinite. If the equation is used in making comparisons between differ-

ent density functions, the last term cancels out. We can drop the last term and use the

equation as a measure of entropy by assuming that the measure is the differential entropy

with the reference term -n (Vxk) If all measurements are done relative to the same ref-

erence point, dropping the last term from (3.4) is justified and we have

h(X)= -Jf(x)ln(f(x)))dx (3.5)

3.2 Maximum Entropy Principle

The Maximum Entropy Principle (MaxEnt) or the principle of maximum uncertanity

was independently proposed by Jaynes, Ingarden and Kullback independently [Jay57]

[Kul59] [Kap92]. Given just some mean values, there are usually an infinity of compatible

distributions. MaxEnt encourages us to select the distribution that maximizes the Shannon

entropy measure while being consistent with the given constraints. In other words, out of

all distributions consistent with the constraints, we should choose the distribution that has

maximum uncertainty, or choose the distribution that is most random. Mathematically, this

principle states that we should maximize

Pilnpi (3.6)
i= 1

subject to


p1 = 1 (3.7)
i= 1


p pigr(x) = ar r = 1,.M (3.8)
i= 1


P > 0 i = 1, ...N (3.9)

The maximization can be done using Lagrangian multipliers.

3.2.1 Mutual Information

Let's assume that H(X) represents the uncertainty about a system before observing

the system output, and the conditional entropy H(X| Y) represents the uncertainty about

the system after observing the system output. The difference H(X) H(X| Y) must

represent the uncertainty about the system input after observing system output. This quan-

tity is called the mutual information [Cov91] [Gra90] between the random variables X and

Which is given by

I(X;Y) = H(X)- H(X Y) (3.10)

Entropy is a special case of mutual information, where

H(X) = I(X;X) (3.11)

There are some important properties of the mutual information measure. These proper-

ties can be summarized as follows [Kap92].

1. The mutual information is symmetric, that is I(X;Y) = I(Y;X).

2. The mutual information is always nonnegative, that is I(X;Y) > 0 .

The mutual information can also be regarded as the Kullback-Leibler divergence

[Kul59] between the joint pdf fXx (xl, x2) and the factorized marginal pdf

fx,(xl )fX2(2) The Kullback-Leibler divergence is defined in (3.12) and (3.13).
The importance of mutual information is that it provides more information about the

structure of two pdf functions than second order measures. Basically it gives us informa-

tion about how different the two pdf's are, which is very important in clustering. The same

information can not be obtained by using second order measures.

3.3 Divergence Measures

In Chapter 2, the clustering problem was formulated as a distance between two distri-

butions, but all the proposed measures are limited to second order statistics (i.e. variance).

Another useful entropy measure is the minimum cross-entropy measure which gives the

separation between two distributions [Kul59]. This is also called directed divergence,

since most of the measures are not symmetrical, although they can be made symmetrical.

Assume D(p, q) is a measure for the distance between p and q distributions. If

D(p, q) is not symmetric, then it can be made symmetric by introducing

D'(p, q) = D(p, q) + D(q, p). Under certain conditions the minimization of

directed divergence measure is equivalent to the maximization of the entropy [Kap92].

The first divergence we will introduce is the Kullback-Leibler's cross-entropy measure

which is defined as [Kul59]

DKL(P' )= iln j (3.12)
i= 1

where p = (l' P2' ..."" Pn) and q = (ql' q2' .." qn) are two probability distri-

butions. The following are some important properties of the measure

* DKL(p, q) is a continuous function ofp and q.
* DKL(p, q) is permutationally symmetric.
* DKL(p, q) > 0, and it vanishes iffp q.
* DKL (p, q) DKL(q, p)

The measure can be also formulated for the continuous variate density functions

DKL(f, g) = ff(x)n(f( )dx (3.13)
a \g(x)/

where it vanishes iff f(x)


3.3.1 The Relationship to Maximum Entropy Measure

The Kullback-Leibler divergence measure is used to measure the distance between two

distributions.Where the second distribution is not given, it is natural to choose the distribu-

tion that has maximum entropy. When there are no constraints we compare to the uniform

distribution u. We will use the following measure to minimize

DKL(p, u) = Pin (3.14)
i= 1

In other words, we maximize

Pilnpi (3.15)
i= 1

Thus, minimizing cross-entropy is equivalent to maximization of entropy when the distri-

bution we are comparing to is a uniform distribution [Kap92]. Even though maximization

of entropy can be thought as a special case of minimum cross-entropy principle, there is a

conceptual difference between the two measures [Kap92]. The maximum entropy princi-

ple maximizes uncertainty, while the minimum cross-entropy principle minimizes a prob-

abilistic distance between two distributions.

3.3.2 Other Entropy Measures

We are not restricted to Shannon's entropy definition. There are other entropy mea-

sures which are quite useful. One of the measures is the Renyi's entropy measure [Ren60]

which is given as


R (X) YIn Pk
k= 1

a >Oyu #

The Havdra-Charvat's entropy is given as

1 In p 1
k= 1

ax>0, a# I

Hs(X) = lim HR(X)
(X----_ I


We will use the Renyi's entropy measure for our derivation in the next chapter due to its

better implementation properties. We can compare the three types of entropy in Table 3-1.

Table 3-1. The comparison of properties of three entropies

Properties Shannon's Renyi's H-C's

function yes yes yes


increasing yes yes yes

Recursivity yes no yes

Additivity yes yes no

3.3.3 Other Divergence Measures

Another important measure of divergence is given by Bhattacharya [Bhat]. The dis-

tance DB(f, g) is defined by




b(f, g) = ff(x)g(x)dx (3.19)


DB(f, g) = -lnb(f, g) (3.20)

DB(f, g) vanishes iff f = g almost everywhere. There is a non-symmetric measure,

the so-called generalized Bhattacharya distance, or Chernoff distance [Che52], which is

defined by

Dc(f,g) = -Inj[f(x)]1-S[g(x)]sdx 0
Another important divergence measure is Renyi's measure [Ren60] which is given as

D (f,g) T In 1 [f (x)] [g(x)] l1Cdx 1 (3.22)

Note that the Bhattacharya distance corresponds to s and the generalized Bhatta-
charya distance corresponds to a = 1 s. The Bhattarcharya distance will form the

starting point for our clustering function.

3.4 Conclusion

If the clustering problem is formulated as the distance between two pdf's, we need a

way to measure and maximize this distance. Second order methods are limited by the fact

that they can not exploit all the underlying structure of a pdf, hence their usage in cluster-

ing will be limited to linearly separable data. In another words, we assume that the distri-

butions are Gaussian, where second order statistics are enough to differentiate. If we want

to cluster nonlinearly separable data, or if the Gaussian assumption is not valid, we need

more information about the structures in the data. This information can be obtained by


mutual information, which forms the basis of our clustering algorithm. The second prob-

lem we have to solve is how to obtain mutual information from the data. Most of the mea-

sures in Chapter 3 require a numerical integration which is very time consuming. Our

choice as our clustering evaluation function (CEF) depends on the better implementation

properties of Renyi's entropy, and a different form of Bhattacharya distance where no

numerical integration is required.


4.1 Introduction

The measure for evaluating a clustering operation should be both easy to calculate and

effective. We introduced many measures in the previous chapters. Some of them are easy

to calculate but not effective, and there are others which are effective but not easy to calcu-

late. We will derive our clustering function in several different ways and provide a discus-

sion about the relation between them.

4.2 Density Estimation

As we know from the previous chapter, there are several ways to measure entropy. One

of the measures was the Renyi's entropy measure. In order to use this measure in the cal-

culations, we need a way to estimate the probability density function. Unsupervised esti-

mation of the probability density function is a difficult problem, especially if the

dimension is high. One of the methods is the Parzen Window Method, which is also called

a kernel estimation method [Par62]. The kernel used here will be a Gaussian kernel for


SGauss(x)= ( 1 T -1 (4.1)
Gauss(x) = (2r)k 12 exp ( (x- (4.1)

where I is the covariance matrix. For simplicity we can assume that I = G I. For a

data set X(i) = {xI(1), ..., xi(n)} i = 1...N the density function can be esti-

mated as

f(x) G(x- x, 2) (4.2)

Now consider Renyi's quadratic entropy calculation for continuous case which is

HR(X) = -In ff(x)2dx a = 2 (4.3)

Easy integration of the Gaussian is one advantage of this method. The reason for using

(O = 2 is the exact calculation of the integral. When we replace the probability density

function with its Parzen window estimation we obtain the following formulation

= -In G(x x,G2) G(x x 2 dx

N N j 1 (4.4)
-In 1 G(xi- Xj'G2)
N i j= 1

Because of the Gaussian kernels and the quadratic form of Renyi's formulation, the result

does not need any numerical evaluation of integrals [Xu98] [PriOO]. This is not true for the

Shannon's entropy measure [Sha48] [Sha62]. The combination of the Parzen window

method with the quadratic entropy calculation gives us an entropy estimator that computes
the interactions among pairs of samples. We call the quantity G(xi X ,G ) the

potential energy between samples xi and x because the quantity is a decreasing func-

tion of the distance between them, similarly to the potential energy between physical par-

tiles. Since this potential energy is related to information, we will call it the Information

Potential. There are many applications of information potential as described in [Xu98].

We want to use this expression from another perspective. We know that the double sum-

mation of the potential energy gives us the entropy of the distribution. We can easily start

thinking about what will happen if the potential energy is calculated between samples of

different distributions. This will form the basis for our clustering algorithm.

4.2.1 Clustering Evaluation Function

Let's assume that we have two distributions p(x) and q(x). We would like to evaluate the

interaction between these particles and how it changes when the distributions change.

Let's consider the following clustering evaluation function (CEF)

N1 N2
CEF(p, q)= N G(xi 20 2) (4.5)
= 1j= 1


xi e p(x) x e q(x) (4.6)

where the interaction between samples from different distributions are calculated. In order

to be more specific about the distribution functions, we include a membership function

which shows the distribution for each sample. The membership function M(xi) will be

defined for the first distribution as MI (xi) = 1 iff xi e p(x), and MI (xi) = 0

otherwise. Likewise for the second distribution, a similar function can be defined as

M2(xi) = 1 iff xi e q(x) Now we can redefine the CEF function as

CEF(p, q)= N1N2 M(x, x)G(xi x, 22) (4.7)
=I 1j= 1


M(Xi, x ) = M1 (xi) x M2(x) (4.8)

The value of the function M() is equal to 1, when both samples are from different distri-

butions and the value is 0, when both samples come from the same distribution. For two

distributions, the functionMcan be rewritten as M(ai, a ) = (MI (a) MI(aj)) .

The summation is calculated only when the samples belong to different distributions and

the calculation is not done, when the samples come from the same distribution. By chang-

ing MO we are changing the distribution functions of the clusters.

4.2.2 CEF as a Weighted Average Distance

One way to measure the distance between two clusters is the average distance between

pairs of samples as stated in (2.26) and copied here for convenience

D g(X, X2 1ai- aI (4.9)
2a, E Xaj 1 X2

This measure works well when the clusters are well separated and compact, but it will fail

if the clusters are close to each other and have nonlinear boundaries. A better way of mea-

suring the distance between clusters is to weight the distance between samples nonlin-

early. The clustering problem is to find regions such that each region contains samples that

are more "similar" among themselves than samples in different regions. We assume here

that samples that are close to each other will probably have more similarities than samples

that are further away. However, when two samples are far away from each other, it does

not mean that they are not related. So, samples that are far away should not be discarded

completely. A simple and easy way to nonlinearly weight the distance between samples is

to use a kernel function evaluated for each difference between samples. For simplicity let's

use the Gaussian kernel as our weighting function. The kernel will give more emphasis to

the samples close to each other than the Euclidean distance, and will give much less

emphasis to samples far away from each other when compared with the Euclidean dis-

tance. When we use the kernel function the average distance function becomes

DNEWavg(X-X2) G(a1 -a ,2) (4.10)
1 2 a, e Xa2j e X2

which is exactly the CEFO function that we derived before. The only difference is that

(4.10) should be minimized where (4.9) should be maximized. We have an additional

parameter G, which controls the variance of the kernel. By changing G, we can control

the output of the clustering algorithm to some extent. On the other hand, DO does not have

any parameters to control, which means if it is not working properly for a certain data set,

there is no recourse, other than changing the function itself. However, creating a princi-

pled approach for selecting the kernel variance is a difficult problem for which no solution

yet exists.

4.2.3 CEF as a Bhattacharya Related Distance

The derivation of CEF progressed by applying the Information Potential concept to

pairs of samples from different clusters. Since the CEF function measures the relative

entropy between two clusters using Renyi's entropy, we investigated how the CEF is

related to other divergence measures between pdf's. CEF in (4.6) can be generally rewrit-

ten as a divergence measure between two pdf'sp(x) and q(x) as

D(p, q) = Jp(x) q(x)dx (4.11)

Since we are interested in a nonparametric approach (sample by sample estimate) the

pdf's are estimated by the Parzen-Window [Par62] method using a Gaussian kernel in

(4.1) where p(x) and q(x) are given as


p(x) =- Gauss(x -x,(2)
Ni= 1

q(x) -= Gauss(x x,G 2) (4.13)
j= 1

By substituting the definition of p and q into equation (4.11), we obtain

N; Nq
CEF(p, q) = I N N Gauss(x- x 2G2) (4.14)
q Pi= 1j= 1

This is exactly the definition of CEF in (4.5). The difference in the summation is just a

different way of counting the samples. Notice also the similarity of the CEF in (4.11) and

the Bhattacharyya [Bha43] distance in (3.19), which differ simply by a square root. We

interpret this difference as a metric choice [PriOO]. A nice feature of equation (4.11) is that

the integral can be calculated exactly, while it would be very difficult to compute (3.19)


4.2.4 Properties as a Distance

To be able to use CEF(p, q) as a distance, we need to show certain properties. Let's

define the following distance measure to be consistent with the other divergence measures.

We usually drop the log term, since it is not required during maximization or minimization

of the distance function. But without using the log function, the comparison will not be


DCEF(p, q) = -logCEF(p, q) (4.15)

The first property we have to show is positiveness of DCEF(p, q). We have to show that

the distance measure CEF(p, q) is always less than or equal to 1. Since

fp(x)dx = 1, and Jq(x)dx = 1, the product can never be greater than one, which

means that DCEF(p, q) > 0 The second property which should be satisfied is when

the distances are equal, the measure should be zero, i.e. DCEF(p, q) = 0 whenever

p(x) = q(x). But since Jp(x)dx = 1 does not mean that fp2(x)dx = 1 holds,

we can not say that the distance measure is always zero whenever two pdf's are equal, but

the distance measure reaches a minimum. The third property is the symmetry condition. It

is obvious that DCEF(p, q) = DCEF(q, p). To satisfy the second property, we have

to normalize the DCEF(p, q), which can be done as follows:

\p(x) q(x)dx
DCEFnormn(P, q) = -1 2(x)x q2(x)dxI (4.16)
\P (X dx q2 (x)dx,

It is easy to observe that we obtain the Cauchy-Schwartz distance measure, if we take the

square of this equation. Although it is not a standard measure, we would like to show that

DCEF(p, q) can be used as a distance measure. Since we are trying to maximize the

distance measure betweenp(x) and q(x), we may omit the fact that the distance is not zero,

whenever the pdf's are equal. We compared several distance measures experimentally,

changing a single parameter of one of the distributions. The comparison is done between

two Gaussian distributions, where the mean of one of the Gaussian distributions is

changed and the behavior of the distance measures are observed in Figure 4-1. CEF,

CEFnorm and Bhattcharya is calculated with a kernel variance of 0.3. Chernof is calcu-

lated with the parameter of value 0.2, and Renyi's divergence is calculated with the param-

eter 1.1. Although the minimum is not zero, when two pdf functions are equal, the

behaviour of CEF is consistent with the other measures, hence it can be used in a maximi-

zation or minimization procedure.

4.2.5 Summary

The most important property of the distance function CEF, is the fact that it does not

require any numerical integration which would increases the calculation time and numeri-

cal errors. Although it does not satisfy an important property as a distance function, CEF

can be used as a distance as long as it is only used in maximization or minimization proce-

dures. It can not be compared with the other distance measures if maximization or minimi-

zation is not used. The CEF has also other properties that makes the distance measure a

good choice for clustering algorithms, as we will see later.

Figure 4-1. Distance w.r.t. mean

4.3 Multiple Clusters

The previous definition was given for two clusters. In this section we will generalize

the CEF function to more than two clusters. Since we want to measure the divergence

between different clusters, the measure should include the divergence from one cluster to

all the others. One way of achieving this is to estimate pairs of divergence measures

between each possible pair of clusters. Let's assume that we have C clusters, and if the
divergence measure is symmetric we need N x (N 1) pairs, and if the measure is not
symmetric, we need N2 pairs. The following extension of CEF will be used for more than

two clusters:

CEF(pl' ,2' "...C) = CEF(pip) (4.17)
i= lj=i+ 1

where CEF(pi, p .) is given by (4.14). Replacing the densities with Parzen window

we will obtain

N1 N2
CEF(pl' ,2' ... PC)= Gauss(xil-xi2, 2G ) + ... (4.18)
il = li2= 1


N1 +N2+ ... +NC


The equation (4.18) can be written in a more compact way by introducing a scoring

function so for C clusters where s(.) is a C-bit long function defined as

s k(xi)

1 xi Ck


and sk() is the k'th bit of s() .

Using the scoring function, the extended CEF of (4.20) can be written in a compact

form as follows

CEF(x, s)

NN...Nc YC M(xij, s)Gauss(xi -xj, 2G 2) (4.21)
1N2 .ij


M(xj, s)

s(xi) u s(xj)


If both samples are in the same cluster, then M(.) is 0, while if both samples are in dif-

ferent clusters, M(.) is 1. Since the label information is in different bits, this equation is

valid for any number of clusters. The only requirement is there should be enough bits for

all clusters.

This function evaluates the distance between clusters. The only variable we can

change in the function is M(., since the input data X is fixed. We can use this function to

cluster the input data by minimizing the function with respect to M(.). The stepwise nature

of the function CEF w.r.t. M(.), makes it difficult to use gradient based methods.

4.4 Results

Let us see how the function behaves and performs in the clustering of several different

synthetic data sets [GokOO]. The data sets are chosen in such a way that they represent sev-

eral different cases with a small number of samples, so that evaluation and verification of

the results will be easy. Figure 4-2 shows five of the data sets used. We designed data sets

for two different clusters that would require MLPs or RBFs classifiers for separation. For

the data sets shown in Figure 1 the minimum of the CEF provided perfect clustering in all

cases (according to the labels we used to create the data). The different clusters are shown

using the symbols "square" and "star", although our clustering algorithm, of course, did

not have access to the data labels. We can say that in all these cases there is a natural "val-

ley" between the data clusters, but the valley can be of arbitrary shape. We have to note

that the data sets are small and the clusters are easily separable by eye-balling the data.

However, we should remember that none of the conventional algorithms for clustering

would cluster the data sets in the way the CEF did. The main difference is that conven-

tional clustering uses a minimum distance metric which provides from a discrimination

point of view a Voronoi tesselation (i.e. a division of the space in convex regions). Cluster-

ing using the CEF is not limited to distance to means, and seems therefore more appropri-

ate to realistic data structures. We implemented by enumeration (i.e. all possible

combinations) the CEF and the cluster labeling was attributed to its minimum. We will

investigate below other methods to optimize the search for the CEF minimum. We used a

variance ranging from 0.01 to 0.05, and the data is normalized between -1 and +1.

33O x

x x x x

3 a

M] :
X- ^

x x
x x
x x
x a x

X E] X

x E] x

x i i
x X
x x x X
O O ]
O] [

Figure 4-2. Input/output data

4.4.1 A Parameter to Control Clustering

The variance of the Gaussian function in Eq. (4.7) determines the level of interaction

between samples. When the variance gets smaller, the importance of samples that are far

away also gets smaller. Since the relative distance between samples is important, a large

variance will make samples close to each other less important.

L=I L=5 L=10 L=15
1 n1. -1 1D

-1i --5 fto I .Q -1, -..5 aA5 M 1A -in 0M 5 i J5 -IaN -1 5 ti^ A 5 ib
L=20 L=25 L=27 L=29

Figure 4-3. Label assignments

But we want to give more emphasis to the samples that are close. At the same time we

don't want to lose the interaction between the samples that are far away. Therefore the

variance of the Gaussian determines the results of the clustering with the CEF. This should

be no surprise if we recall the role of the variance in nonparametric density estimation


We experimentally evaluated the importance of the Gaussian kernel variance for the

two triangles data set. The clustering assignment is changed starting with the data in the

top line, continuing until the bottom, so the clustering that provides the two lines is

obtained when the label assignment is L=15. The change of the labels can be seen in Fig-

ure 4-3. Every cluster is shown with a different symbol. Only 8 different assignments out

Figure 4-4. Change of CEF w.r.t. the variance of the Gaussian kernel

of 28 are shown in Figure 4-3. 28 comes from the fact that we omit the assignments where

the number of the elements in any cluster is zero, such that we always have clusters with

nonzero elements. Of course this is only one way to change the membership function but it

will give us an idea of how variance affects the results. Figure 4-4 shows that as variance

decreases, the global optimum becomes our minimum point and many local minima disap-

pear. Notice also that in this case the global minimum for the largest variance (G=0.1)

occurs at L=3 and does not coincide with the label assignments. Hence, not only is the

CEF difficult to search, but also it provides a global minimum that does not correspond in

this case to the label assignments, that is, this variance setting will miss the fact that there

is a "valley" between the two lines. In general, we do not have a label assignment, but we

can expect that different cluster assignments will be obtained depending upon the variance

used in the Gaussian kernel.

U 0.10


0 5 10 15
Label assignment

20 25 30

All in all, we were very pleased to observe this behavior in the CEF, because it tells us

that the evaluation function is incorporating and using global information about the data

set structure, unlike traditional clustering methods.

4.4.2 Effect of the Variance to the Pdf Function

The variance is the most important parameter in the pdf calculation, since it deter-

mines the size of the kernel, hence the shape of the final pdf function. If the variance is too

big, then the pdf is smoothed such that, all the details of the data are lost. If the variance is

too small, then the pdf estimation will be discontinuous and full of spikes. In both extreme

cases, it is not possible to get a meaningful result from the clustering algorithm. The pdf

estimation using Parzen window of the above data is shown with different values of the

variance in Figure 4-5, Figure 4-6 and Figure 4-7.

Figure 4-5. Pdf of the data given in Figure 4-3, var=0.2

* .* ~

Figure 4-6. Pdf of the data given in Figure 4-3, var=0.1

Figure 4-7. Pdf of the data given in Figure 4-3, var=0.08

As is evident, the pdf estimation is too smooth when the variance is 0.2. It gets better

around the value of 0.1, which can be seen in Figure 4-6. When the variance is 0.08, we

get the pdf estimation given in Figure 4-7, which will give us the correct clustering, as

shown in Figure 4-3 where the minimum point of the CEF function corresponds to the cor-

rect clustering

4.4.3 Performance Surface

Because of the nature of clustering, the CEF function is a step-wise function, which

makes optimization difficult. To see how the CEF changes w.r.t. the membership function,

let's define the following parametric membership function without losing the functionality

of our definition. Instead of assigning the membership values directly, define G() as a

threshold mapping function where the weights define the mapping. G() is defined as

G(x, w) = TRESH(F(x, w)) where F(x, w) is any linear/nonlinear mapping

function from x > F(x, w), and TRESH() is defined as follows:

TRESH(x) = { 1.0, -1.0 }, {x > 0, x < 0 } respectively.

We can write the CEF in terms of w. Let's define as a linear function, where

F(x, w) = x0Wo + x1 w1 Figure 4-8 depicts the plot of the CEF function w.r.t. one of

the weights w0 for the two lines data set. As one can immediately observe the evaluation

function is a staircase function, because until there is a change in clustering, the value of

CEF remains constant. It is clear that the usual gradient descent method for adaptation

[Hay94] will not work here, since at most of the points in the weight space, there is no gra-

dient information to determine the next change in weights. Since the traditional gradient

0 50 100 150 200

Figure 4-8. Change of CEF w.r.t. weight wO

method fails to solve the problem, we have to find a more suitable method to minimize the

clustering function.

4.5 Comparison of Distance Measures in Clustering

Several distance measures are proposed by several authors, and all of them have simi-

lar and distinct characteristics. Some of them are stronger distance measures than the oth-

ers. We adopted an experimental comparison of several distance measures (i.e. CEF,

CEFnorm, Renyi, Kullback-Leibler and Bhattacharyya) in our clustering algorithm, and

we showed that some of them can not be used in the clustering procedure. We select a sim-

ple data set from Figure 4-2, and we test the distance measures by using four different con-

figurations of the data set. The four configurations can be seen in the Figure 4-9.

For comparison with the other measures, CEF and CEFnorm functions are used as

given in (4.15) and (4.16). The log function is usually not necessary when we optimize the





2 2
1 0 O i1 D

-2 E D _2

-3 -2 -1 a 1 2 3 -3 -2 -1 b0 1 2 3
a) b)

2 I 2 [I
0 0

-1 3 3 -1 3 D 3
0 0
-2 EE El EE

-3 -2 -1 0 1 2 3 -3 -2 -1 1 2 3
c) d)

Figure 4-9. Different test configurations

function, since it is a monotonically increasing function. Since the Renyi's divergence

measure and the Kullback-Leibler divergence measure are not symmetric, we calculated

the sum of the distance from p(x) to q(x) and the distance from q(x) to p(x), which makes

both distance measures symmetric. Renyi's entropy is calculated using X = 1.1 The

variance of Parzen window estimator is selected as 0.15 for all 5 measures. The configura-

tion given in Figure 4-9 (c) is the desired clustering, because there is a natural valley

between the two triangle shaped data samples. Since we are trying to maximize the dis-

tance between two pdf distributions, we expect the distance measures to be maximum for

the configuration (c). As it can be seen from the Table 4-1, the measures that incorporate

in their calculation the ratio of the two pdf distributions fail to find the correct clustering.

The maximum of each function is shown as a shaded area.

Table 4-1. Comparison of different distance measures

CEF CEFnorm Renyi's KL Bhat

a) 3.12594 0.10834 0.54023 0.47845 0.05639

b) 4.42961 1.91489 17.5679 13.2470 1.14886

c) 11.6223 10.2988
d) 3.89062 1.80538 e 1.12171

The Renyi's divergence measure and Kullback-Leibler divergence measure requires

both pdf distributions to be strictly positive over the whole range. Because of this restric-

tion, these two distributions can not be used in our clustering algorithm. Although Parzen

window estimation of pdf's is continuous in theory, the pdf becomes practically zero away

from the center. Increasing the variance to overcome this problem results in an over-

smoothed pdf function, as seen in Figure 4-5. The CEF, CEFnorm and Bhattacharyya dis-

tances perform similarly finding the correct clustering. The results of the comparison

justify our choice of CEF as a distance measure for clustering.

4.5.1 CEF as a Distance Measure

As seen before, the CEF function does not disappear when two pdf distributions are

equal. Although this result does not affect our clustering algorithm, it would be nice to

have this property, so that CEF will be comparable to the other divergence measures. As

seen in (4.16), CEF can be normalized to have a 0 value, when two distributions are equal.

Although this is a desired property for CEF to be a comparable distance measure, it may

not be so good for the purpose of clustering. The terms V1= Jp2(x)dx and

rq2 1
V2= q (x)dx can be considered as Renyi's entropy without the term, where

a = 2 When we use the normalized CEF in clustering, we have to be careful. Consider

the case in Figure 4-9 (d), where one cluster consists of only one point. Entropy of a small

cluster will be large, and entropy of a big cluster with many points will be small, whereas

for the case in Figure 4-9 (c), entropies of both clusters will be close to each other and will

be neither large nor small. There can be occasions where VI *V2 for the case in Figure 4-9

(d) will be larger than for the case in Figure 4-9 (c). This may result in the CEFnorm

reaching a maximum in the case in Figure 4-9 (d), instead of in the case in Figure 4-9 (c).

Because of this reason we will not use CEFnorm in our clustering algorithm.

4.5.2 Sensitivity Analysis

The sensitivity of the algorithm is much reduced by choosing a proper distance func-

tion, but it is not removed completely. Since this is basically a valley-seeking algorithm, it

tries to find a valley where the area under the separating line, which is the multiplication of

the areas from each side of the line, is minimum. We can see how it works in Figure 4-10.

Gaussian kernels are placed on each sample to estimate pdf's of each cluster. As the clus-

ter assignments are changed, the algorithm tries to separate the samples such that the area

under the pdf's of the clusters is minimum. When the sample PI is away from the distribu-

tion, it is very likely that the separating line L2 will go over an area which is less than the

area that LI covers. The value of the CEF for LI is the area, which is found by multiplying

the Gaussian kernels in the upper part of LI, with the Gaussian kernels in the lower part of

LI. So, if P1 is away from the other sample points, L2 will be a better choice to minimize

CEF, although the correct clusters do not result.

Gaussian kernel

Figure 4-10. Sensitivity of the algorithm

4.5.3 Normalization

This sensitivity can be removed using different normalization factors with different

results. Every normalization will have different effects on the clustering. We would like to

give an outline for the procedures for normalizing CEF. The normalization will be done to

remove the sensitivity of the CEF function to small clusters, not to make it zero when two

pdf's are equal. It should be noted that this kind of normalization is effective only when

the clusters are almost of equal size. When the number of points in a cluster increase, the

maximum value of the pdf decreases, assuming the points don't repeat themselves. Let's

use the following normalization term max(p(x)) x max(q(x)). This term will nor-

malize clusters with different size, but will have the effect of forcing the cluster to be of

similar size. The term can be approximated by finding the maximum value of the function

at the sample points as given in (4.23).


L2 -

max(p(x)) = max i(p(xi)) (4.23)

Whenever a normalization similar to (4.23) is incorporated into the cost function,

where smaller clusters are discouraged, the effect will be losing the valley seeking prop-

erty of the cost function with clusters of different sizes. The term is effective if the clusters

are of similar sizes. Once the term is included in the cost function, it is difficult to control,

as it is difficult to choose to accept certain cluster sizes, and not to accept the small cluster

sizes. It would be easier to create a rejection class, where during the iterations certain con-

figurations are not accepted, depending on an external criterion, such as rejecting clusters

of a certain size, or rejecting clusters having a maximum/minimum value greater than a

prespecified value. Actually, a rejection class is already embedded in the algorithm, such

that clusters of size zero are not accepted. This brings the question how to set the thresh-

old. Although more work should be done on this topic, we suggest selecting a value that is

less than at least half of the maximum expected cluster size.


5.1 Introduction

As we have seen previously the performance surface contains local minima and is dis-

crete in nature, which makes the optimization process difficult. Gradient optimization

methods will not work, and we have to find other methods to optimize the CEF perfor-

mance function. One of the methods which can be used to overcome the local minima

problem is simulated annealing [Aar90]. It is used with success in many combinatorial

optimization problems, such as the travelling salesman problem, which is an NP complete


5.2 Combinatorial Optimization Problems

A combinatorial optimization problem is either a minimization or a maximization

problem and is specified by a set of problem instances. An instance of a combinatorial

optimization problem can be defined as a pair (S, f), where the solution space S denotes

the finite set of all possible solutions and the cost function / is a mapping defined as

f;S (5.1)

In the case of minimization, the problem is to find a solution opt e S which satisfies

f(iopt) <_ f(i) for all i e S (5.2)

Such a solution iopt is called a globally-optimal solution.

Local search algorithms are based on stepwise improvements on the value of the cost

function by exploring neighborhoods. Let (S, f) be an instance of a combinatorial opti-

mization problem. Then a neighborhood structure is a mapping

N;S 2S (5.3)

which defines for each solution i e S a set S. e S of solutions that are 'close' to i in

some sense. The set Si is called the neighborhood of solution i, and each j e Si is called

a neighbor of i.

In the present class assignment problem, a neighborhood structure Nk, called the k-

change, defines for each solution i a neighborhood Si consisting of the set of solutions

that can be obtained from the given solution i, by changing k-labels from the solution i.

So the 2-change N2(p, q) changes a solution i into a solution by changing the labels of

the samples p and q [Dud73]. We will use a similar but more complex neighborhood struc-

ture. Define a generation mechanism as a means of selecting a solution from the neigh-

borhood S of a solution i.

Given an instance of a combinatorial optimization problem and a neighbourhood

structure, a local search algorithm iterates on a number of solutions by starting off with a

given initial solution, which is often randomly chosen. Next, by applying a generation

mechanism, it continuously tries to find a better solution by searching the neighborhood of

the current solution for a solution with lower cost. If such a solution is found, the current

solution is replaced by the lower cost solution. Otherwise the algorithm continues with the

current solution. The algorithm terminates when no further improvement can be obtained.

This algorithm will converge to a local minimum. There is usually no guarantee how far

this local minimum is from the global minimum, unless the neighborhood structure is

exact, which results in complete enumeration of the solutions and is impractical in many

cases. Let's summarize the local search algorithm with a block diagram.


i I sstart


GENERATE (J from Si )

if f (j) < f(i) then = j

StopCriterion= f( (j) f(i) for all j

until StopCriterion

Figure 5-1. Local search

Let (S, f) be an instance of a combinatorial optimization problem and let N be a

neighborhood structure, then i e S is called a locally optimal solution or simply a local

minimum with respect to N if i is better than, or equal to, all its neighborhood solutions

with regard to their cost. More specifically, i is called the local minimum if

f(i) < f (j), for all j S (5.4)

The problem we have is not as difficult as the travelling salesman problem, because we

have some a-priori information about the problem. We know that the global solution will

assign the same label to samples that are close to each other. Of course just this statement

alone will be an oversimplification of the problem. In nonlinear structured data, the clus-

ters won't be compact, which means that there are samples that are far away from each

other with the same label, but the statement remains true even in this case. This will sim-

plify our problem a little bit. First we will introduce an algorithm using a local search

algorithm, then we will adapt a different simulated annealing algorithm on top of it.

Before doing this it is necessary to introduce how simulated annealing works.

5.2.1 Local Minima

The neighborhood structure is not helpful when there are many local minima. We need

another mechanism to escape from the local minima. Exact neighborhood structure is one

solution but it results in complete enumeration in a local search algorithm. The quality of

the local optimum obtained by a local search algorithm usually strongly depends on the

initial solution and for most there are no guidelines available. There are some ways to

overcome this difficulty while maintaining the basic principle of local search algorithm.

The first one is to execute the local search algorithm for a large number of initial solu-

tions. If the number is large enough such that all solutions have been used as initial solu-

tion, such an algorithm can find the global solution, but the running time will be


The second solution is to introduce a more complex neighborhood structure, in order

to be able to search a larger part of the solution space. This may need a-priori information

about the problem and is often difficult. We will use try to use a-priori information to con-

struct a better neighborhood structure.

The third solution is to accept, in a limited way, transitions corresponding to an

increase in the value of the cost function. The standard local search algorithm only accepts

a transition only if there is a decrease in the cost. This will be the basis of the simulated

annealing algorithm.

5.2.2 Simulated Annealing

In condensed matter physics, annealing is known as a thermal process for obtaining

low energy states of a solid in a heat bath. The first step is to increase the temperature of

the bath to a maximum value at which the solid melts. The second step is to decrease care-

fully the temperature of the bath until the particles arrange themselves in the ground state

of the solid. In the liquid phase all particles of the solid arrange themselves randomly. In

the ground state the particles are arranged in a highly structured lattice and the energy of

the system is minimal. The ground state of the solid is obtained only if the maximum tem-

perature is sufficiently high and the cooling is sufficiently slow.

The physical annealing process can be modelled using a computer simulation [Kir83].

The algorithm generates a sequence of states as follows. Given the current state i of the

solid with energy E then a subsequent state j is generated by applying a perturbation

mechanism which transforms the current state into next state by a small distortion. The

energy of the next state is E ., if the energy difference, E. Ei, is less than or equal to 0,

the state j is accepted as the current state. If the energy difference is greater than 0, the

state j is accepted with a certain probability which is given by

exp B (5.5)

where T denotes the temperature of the heat bath and kB is a physical constant known as

the Boltzmann constant. If lowering of the temperature proceeds sufficiently slowly, the

solid can reach thermal equilibrium at each temperature. This can be achieved by generat-

ing a large number of transitions at a given temperature value. Thermal equilibrium is

characterized by the Boltzmann distribution. This distribution gives the probability of the

solid being in s state i with energy Ei at temperature T, and is given by

PT{X= i} ( exp (5.6)
Z(T) ( kBT )

where X is a stochastic variable denoting the current state of the solid, and

Z(T) = exp k-- (5.7)

where the summation extends over all possible states.

5.3 The Algorithm

We can apply the previous algorithm to solve our optimization problem. We will

replace the states of the particles with the solutions of the problem, where the cost of the

solution is equivalent to the energy of the state. Instead of the temperature, we introduce a

parameter, called the control parameter. Let assume that (S, f) is an instance of the prob-

lem, and i and j two solutions with cost f(i) and f(j), respectively. Then the acceptance cri-

terion determines whether j is accepted from i by applying the following acceptance


Pc{ accept = exp (f( c- f(j)) if f(j) < f(i) (5.8)


P c{accept = 1 if f(j) > f(i) (5.9)

where c denotes the control parameter. The algorithm can be seen in Figure 5-2.


i = start

k= 0


for/ = 1 to Lk do

GENERATE (] from Si )

if f (j) < f(i) then = j


if exp(f([0 (J] > random [0, 1] then =
Sck 7

k = k+1



until StopCriterian

Figure 5-2. Simulated annealing algorithm

A typical feature of the algorithm is that, besides accepting improvements in cost, it

also to a limited extent accepts deteriorations in cost. Initially, at large values of c, large

deteriorations will be accepted and finally, as the value of c approaches 0, no deteriora-

tions will be accepted at all. Lk should be sufficiently long so that the system reaches

thermal equilibrium.

5.3.1 A New Neighborhood Structure

We have some a-priori information about the optimization problem. The problem is

basically a clustering algorithm that tries to group pixels into regions. The pixels that

belong to a class should be close to each other using some distance metric. This informa-

tion will help us create another neighborhood structure and will eliminate many local min-

ima. On top of this algorithm we will put a modified version of a simulated annealing

algorithm, which gives us a chance to escape from local minima, if any. We start by look-

ing at the problem closely. Assume that at an intermediate step the optimization reached

Figure 5-3. An instance

the following label assignment which is shown in Figure 5-3. The ideal case is to label the

group of points marked in g2 as "Class2", so that the upper triangle will be labeled as

"Class2", and the lower triangle will be labeled as "Classl". The variance of the Gaussian

kernel is selected as G = 0.08, and the value of the CEF function is given as

0.059448040 for the given configuration. When some of the labels in the group g2 are

changed, the value of CEF function will increase to 0.060655852, and when all the labels


in the group g2 are changed to "Class2", then the value of CEF will drop sharply to

0.049192752. Which means that the 2-change N2(p, q) neighborhood structure

explained before will fail to label the samples in the group g2 correctly. This behavior and

the local minima can also be seen in the Figure 4-3. Of course this behavior disappears at

a certain value of the variance, but there is no guarantee that lowering the value of the vari-

ance will always work for every data set. A better approach is to find a better neighbor-

hood structure to escape from the local minima.

In the previous example, one can make the following observation very easily. If we

change the labels of the group of pixels g2 at the same time, then we can skip the local

minima. This brings two questions: How to choose the groups, and how big they should


5.3.2 Grouping Algorithm

We know the clustering algorithm should give the same label to samples that are close to

each other in some distance metric. Instead of taking a fixed region around a sample, we

used the following grouping algorithm. Let's assume that the group size is defined as KM.

We will group samples from the same cluster starting from a large KM, which will be

decreased during the iterations according to a predefined formula. This can be exponential

or linear. The grouping algorithm is explained in Figure (6-4). Assume that the starting

sample is X I, and the subgroup size is 4. The first sample that is closest to x1 is x2, so

x2 is included in the subgroup. The next sample that is closest to the initial sample x 1 is

x3, but we are looking for a sample that is closest to any sample in the subgroup, which in

this case is x4. The next sample closest to the subgroup is still not x3, but x5, which is

close to x4. The resulting subgroup {Xl, x2, x4, x5} is a more connected group than

just by selecting the 4 pixels {X x2, x3, x4 } that are closest to the initial sample x1 .

The grouping algorithm is more calculation intensive than taking the samples around the

/ x ~ x3
4 /x2 /

Figure 5-4. Grouping

pixel, but it will help the algorithm to escape from the local minima. The initial cluster

labels come from the random initialization at the beginning of the algorithm. This group-

ing is done starting with every pixel, and the groups that are equal are eliminated. The

grouping is done among the samples that have the same cluster label, and done for each

cluster label independently. When the grouping is finished for a certain group size, for all

cluster labels, they are joined in a single big group, and they will be used in the optimiza-

tion process instead of the original data set. We will obtain the following set of groups as

seen in Figure 5-9. The difference of this grouping algorithm can be seen clearly using the
data set in Figure 5-5. The group size is initialized as KMA where N is the total

number of data samples, and C is the number of clusters.

In Figure 5-6, the group is selected starting from the sample xl, and the closest N

samples is selected. The group shown in Figure 5-6 is selected using N=10. This group is

actually the kNN of point x1 The proposed grouping algorithm, on the other hand, cre-


-02.4 -

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

Figure 5-5. Data set for grouping algorithm

Figure 5-6. Grouping using 10 nearest samples (kNN)

ates the group shown in Figure 5-7. As can be seen from both examples, the proposed

grouping is more sensitive to the structure of the data, whereas the kNN method, a very

common method, is less sensitive to the structure of the data. Several grouping examples

can be seen in Figure 5-8. Using the proposed method will create a better grouping struc-

ture for our clustering algorithm.


0.2 0 0


-0.6 -0.4 -0.2 0,0 0.2 0.4 0.6

Figure 5-7. Grouping using the proposed algorithm (group size is 10)

Figure 5-8. Examples using proposed grouping algorithm

5.3.3 Optimization Algorithm

First we consider the case where the groups overlap, and propose an optimization

algorithm. The optimization starts by choosing an initial groups size KM, which is smaller


0.0 2 M 0 .xl


-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6


0 3

-0C2 4 .

-0.6 -0.4 -02 0.0 0.2 0.4 0.6


Groups ( Group size <= KM)
Sxl,x3,x4 .... x gl

x5,x6,... I g2

/ g5

Class2 /
\ g7


Figure 5-9. Final group structure

than the total number of points in the data set. The second step is to randomly initialize the

cluster labels of the given data set. After selecting the group size and the initial cluster

labels, the groups are formed as explained above, and a new data set is created which con-

sists of groups instead of individual data points. We adapt a 2-change algorithm similar to

the one described before in this chapter. We select one group and check whether changing

the label of this group reduces the cost function. If the change reduces the cost function we

will record this change as our new minimum cluster assignment. If the change does not

reduce the value of the cost function, we choose a second group from the list, and change

the cluster labels of that group in addition to the change of labels of the first group. We

record this assignment if it reduces the minimum value of the cost function. In this algo-

rithm we allow a cluster assignment to be used even it increases the cost function. This is

like a deterministic annealing, since it is done only for one pair, and it is done at every step

INITIALIZE ( KM, var, ClusterLabels)

ClusterLabels = random()

WHILE (KM >= 1) {

GRP = CREATE_GROUPS (InputData, KM, ClusterLabels)


FOR i=0 to SIZE(GRP) {

Change the clustering labels of GRP(i) and record the

improvement if any

FORj=i+l to SIZE(GRP) {

Change the clustering labels of GRP(i) and GRP(j)

and record the improvement if any




; Decrease the group size. This can be linear or exponential


Figure 5-10. Optimization algorithm

without any random decision. Of course we can do this for every pair, but this will end up

doing a complete enumeration among the possible solutions, which is not computationally

feasible. We repeat calculating a new cluster label, until there is no improvement possible

with the given group size KM. The next step is to reduce the group size KM, and repeat

the whole process, until the group size is 1, which will be equivalent to the 2-change algo-

rithm with the N2(p, q) neighborhood structure as explained before. When the algo-

rithm stops it is a good idea to run one more pass by setting KM to the initial value, and

run the whole process again using the same data where the initial condition is the previ-

ously obtained clustering, or this can be repeated until there is no change recorded. It is

possible to increase the group size starting from one, to a final value. Repeating the algo-

rithm will help us to escape from local minima, unlike previous attempts.

It is possible that several optimization algorithms can be derived from the given algo-

rithm. Let's assume that we are increasing the value of group size KM, and at a certain

group size, we made an improvement. It is possible that instead of increasing the group

size at the next step, we can decrease the group size, until no further improvements occur

in the cost function. When the improvements stop, then we can continue to increase the

group size. We can explain this algorithm using the following Figure 5-11. It should be

noted that none of these algorithms is guaranteed to reach the global optimal point of the


Figure 5-11. A variation of the algorithm

INITIALIZE ( KM, var, ClusterLabels)

ClusterLabels = random()

WHILE ( KM >= 1 ) {





It is possible to record the change that makes the biggest improvement at the end of the

outside loop, after calculating all possible improvements, instead of changing it immedi-

ately when an improvement is recorded. This may result in a slightly different solution.

5.3.4 Convergence

Since there is a finite number of combinations tested, and since the algorithm contin-

ues to work only when there is a improvement in the cost function, the algorithm is guar-

anteed to stop in a finite number of iterations. There is only one condition where the

algorithm may not stop. Because of some underflow errors, if the value of the function

becomes zero, the algorithm will run forever, since no improvement can be done beyond

zero. The integral is always greater than or equal to zero. This is an computational prob-

lem, not an algorithmic problem. So whenever the cost function becomes zero, the algo-

rithm should be forced to stop.

5.4 Preliminary Result

We tested our algorithm using multiple clusters and the results are given in Figure 5-

12. Three partial sinewaves were created with Gaussian noise added with different vari-

ance with 150 samples per cluster. The algorithm was run with KM as 150, linearly

annealed one by one until we reach the group size of one. The kernel size used in this

experiment is 0.05 and the minimum value of the CEF function is 3.25E-05. Each symbol

represents a different cluster. As we can see the algorithm was able to find the natural

boundaries for the clusters, although they are highly nonlinear.

We tested the algorithm with more overlapping clusters. The input data is given in Fig-

ure 5-13. We obtained meaningful results when we set the variance of the kernel such that


1 .0 -0 5 00 0i5 I 1

E0.5 A' 0

52 Ak A "A Es

00 AA A -s, ,,.
-1A, AA A 0 0, 1 -0

AA 4, A A

1.0 0.5 0.0 0.5 1.0

Figure 5-12. Output with three clusters

the pdf estimation shows the structure of the data properly. When we say meaningful, it

means meaningful to a person that created the data, which may not be always what is

given in the data. Another important property of all these results is that all of them repre-

sent a local optimum point of the cost functirs. This is a basic problways possible that another solution

with a lower value of the cost function exists. This time we tested the algorithm with dif-

ferent variance values. Results are given in Figure 5-14. When the variance is high, single

point clusters are formed. As we dropped the variance, the results are improved, and we

got the result that mimics the data set generation. Points that are isolated may not repre-

sent the structure of the data as well as the other points. So it is possible that those points

will form single point clusters. This is a basic problem with the cost function proposed. It

is possible to eliminate these points and run the algorithm again.

Figure 5-13. Overlapping boundaries

Another test set, where clusters do not have same number of points, is used to see the

behavior of the algorithm. The input data can be seen in Figure 5-15. The output of the

algorithm can be seen in Figure 5-16. To get more information about the conversion of the

algorithm, we collected some statistics each time an improvement occurred in the algo-

rithm given in Figure 5-10. The statistics collected are the entropy of each class, and the

value of the CEF function. The value of the CEF function can be seen in Figure 5-17. The

plot is the output of the algorithm during the minimization of the data given in Figure 5-

15. The horizontal scale is compressed time and does not reflect the real time passed. In

the beginning of the algorithm the interval between improvements were short, whereas as

the calculations continue, the time interval between improvements gets bigger. The mini-

mum value is not zero, but a very small value. The entropy of each class is a more interest-

ing statistic and is given in Figure 5-18. Renyi's entropy is used to calculate the entropy of

each cluster generated after each improvement.

1 0. i i i i | i i i i | i i i i | i i i i

0.5 0

0 ^ -
0.5^ '
1.0 0 0.-.5 .
% '. ,,_ ,- -.. : .--

,. .


-1,0 -0.5 0.0 0,51

-1.0 -0.5 0.0 0.5

Figure 5-14. Output with a variance of 0.026


-1 0


1 .D I I I
-10 -0, 0 05 1,0

Figure 5-15. Non-symmetric clusters

An interesting plot occurs when we plot the inverse of the entropies, which can be seen

in Figure 5-19. Although the clusters are not symmetrical, the inverse entropies are almost



00 5

-1.0 -0.5 0,0 0.5 1.0

Figure 5-16. Output with variance 0.021

Figure 5-17. CEF function

400 600 800


1 0 I I I I I I I I I

o 200 400 600 800

Figure 5-18. Entropies of each cluster


+00 600 800

Figure 5-19. Inverse entropies of each cluster

a mirror reflection of each other. When we add them up we get the plot in Figure 5-20. The

interesting thing about this figure is that the function decreases in parallel with the CEF


0 200 4-00 S00O ,800

Figure 5-20. Sum of inverse entropies

5.5 Comparison

We would like to collect the performance measures of the algorithm in several tables,

where the algorithm is compared to a k-means clustering algorithm and to a neural net-

work based classification algorithm. It should be mentioned that it is meaningful to com-

pare our algorithm with a neural network if the data set contain valleys) between clusters.

Otherwise a neural network with enough neurons is capable of separating the samples

even when there is no valley between them because of the supervised learning scheme. To

be able to use a supervised classification algorithm, we assume that the data is generated

with known labels. We used the data in Figure 5-12, Figure 5-13 and Figure 5-15. We used

one hidden layer neural network topology with 10 hidden nodes and two output nodes.

The result shows that the proposed algorithm is superior to the k-means algorithm and

although it is an unsupervised method, performed equally with a supervised classification


Table 5-1. Results for k-means algorithm

DATA Class Class Class
SET1 1 2 3

Class 110 40 0

Class 25 90 35

Class 2 36 112

Table 5-2. Results for k-means algorithm

DATA Class Class
SET2 1 2

Class 350 0

Class 69 221

Table 5-3. Results for k-means algorithm

DATA Class Class
SET3 1 2

Class 130 20

Class 31 319

Table 5-4. Results for supervised classification algorithm

DATA Class Class Class
SET1 1 2 3

Class 150 0 0

Class 0 150 0

Class 0 0 150

Table 5-5. Results for supervised classification algorithm

DATA Class Class
SET2 1 2

Class 342 8

Class 9 341

Table 5-6. Results for supervised classification algorithm

DATA Class Class
SET3 1 2

Class 144 6

Class 9 341

Table 5-7. Results for the proposed algorithm

DATA Class Class Class
SET1 1 2 3

Class 150 0 0

Class 0 150 0

Table 5-7. Results for the proposed algorithm

DATA Class Class Class
SET1 1 2 3

Class 0 0 150

Table 5-8. Results for the proposed algorithm

DATA Class Class
SET2 1 2

Class 340 10

Class 12 338

Table 5-9. Results for the proposed algorithm

DATA Class Class
SET3 1 2

Class 145 5

Class 9 341

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs