Entropy-Based Techniques with Applications in Data Mining

Material Information

Entropy-Based Techniques with Applications in Data Mining
Copyright Date:


Subjects / Keywords:
Cost functions ( jstor )
Datasets ( jstor )
Dimensionality reduction ( jstor )
Entropy ( jstor )
Mining ( jstor )
Probability distributions ( jstor )
Random variables ( jstor )
Simulations ( jstor )
Standard deviation ( jstor )
War theaters ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Embargo Date:


This item has the following downloads:

Full Text







Copyright 2005


Anthony Okafor

This work is dedicated to my family.


I want to thank Professor Panos M. Pardalos for his help and patience in guiding

me through the preparation and completion of my Ph.D. I also want to thank Drs.

Joseph P. Geunes, Stanislav Uraysev and William Hager for their insightful comments,

valuable -Il --I i--ii, constant encouragement and for serving on my supervisory


I also would to thank my colleagues in the graduate school of the Industrial and

Systems Engineering Department especially, Don Grundel.

Finally, I am especially grateful to my wife, my parents, and sister for their

support and encouragement as I complete the Ph.D. program.


ACKNOWLEDGMENTS ................... ...... iv

LIST OF TABLES ................... .......... viii

LIST OF FIGURES ................................ x

ABSTRACT ....................... ........... xi


1 INTRODUCTION ........................... 1

1.1 D ata M ining . . . . . . . 1
1.1.1 Classification ................... ...... 2
1.1.2 Clustering . . . . . . . 2
1.1.3 Estim ation . . . . . . . 2
1.1.4 Prediction . . . . . ... .. 3
1.1.5 Description. .................. ........ .. 3

2 ENTROPY OPTIMIZATION .................. ...... 4

2.1 Introduction ............... . . .... 4
2.2 A Background on Entropy Optimization .............. 4
2.2.1 Definition of Entropy .................. 5
2.2.2 C'!....- ,ig A Probability Distribution ..... . . 6
2.2.3 Prior Information .................. ..... 8
2.2.4 Minimum Cross Entropy Principle ...... . . 9
2.3 Applications of Entropy Optimization ................ .. 10

3 DATA MINING USING ENTROPY ............. ... ... 12

3.1 Introduction .................. ............ .. 12
3.2 K-Means Clustering ............... ....... 13
3.3 An Overview of Entropy Optimization ............ .. 15
3.3.1 Minimum Entropy and Its Properties . . 16
3.3.2 The Entropy Decomposition Theorem . . 18
3.4 The K-Means via Entropy Model ............... .. .. 19
3.4.1 Entropy as a Prior Via B li,. -i ,i Inference . ... 19
3.4.2 Defining the Prior Probability ............... .. 20
3.4.3 Determining Number of Clusters . . ..... 20

3.5 Graph Matching ............... ........ .. 22
3.6 Results ................. ............ .. 25
3.6.1 Image Clustering ............... . .. 26
3.6.2 Iris Data ............... ......... .. 27
3.7 Conclusion. ................ .......... .. 29

4 DIMENSION REDUCTION .............. ... ... .. 30

4.1 Introduction ............ . . .. 30
4.1.1 Entropy Dimension Reduction ............... .. 30
4.1.2 Entropy Criteria For Dimension Reduction . ... 31
4.1.3 Entropy Calculations ...... .......... ...... 31
4.1.4 Entropy and the Clustering Criteria . . 32
4.1.5 Algorithm ............... ........ .. 32
4.2 Results ...... ....................... .. 32
4.3 Conclusion. ................ .......... .. 33


5.1 Introduction ............... ......... .. 34
5.1.1 Problem Parameters ...... ......... .... 35
5.1.2 Entropy Solution ............... . .. 36
5.2 Mode 1 ..... ............ ........... 38
5.3 Mode 2 ...... ....................... .. 41
5.4 Mode 3 ............ .... ......... 41
5.5 Maximizing the Probability of Detecting a Target . ... 42
5.5.1 Cost Function. Alternative 1 ............. .. 43
5.5.2 Generalization ................ ......... 44
5.5.3 Cost function. Alternative 2 and Markov C('! i, Model .. 46
5.5.4 The Second Order Estimated Cost Function with Markov C('! ,i 47
5.5.5 Connection of Multistage Graphs and the Problem . 48
5.6 More General Model ............... ...... .. 51
5.6.1 The Agent is Faster Than Target ..... . . 51
5.6.2 Obstacles in Path ............... .. .. 51
5.6.3 Target Direction ................ ... ... .. 52
5.7 Conclusion and Future Direction ............... .. .. 52

6 BEST TARGET SELECTION ........... ... .... 53

6.1 Introduction ....... ...... .......... 53
6.2 Maximize Probability of Attacking the Most Valuable Target . 54
6.2.1 Best Target Strategy . . . . 55 Probability of Attacking th Most Valuable Target 56 Mean Rank of the Attacked Target . . 58 Mean Number of Examined Targets . . 58
6.2.2 Results of Best Target Strategy ... . ... 59
6.2.3 Best Target Strategy with Threshold ..... . . 59

6.3 Maximize Mean Rank of the Attacked Target . . 61
6.3.1 Mean Value Strategy ................ ....... 61
6.3.2 Results of Mean Value Strategy ... . ... 63
6.4 Number of Targets is a Random Variable ... . . 63
6.5 Target Strategy with Sampling-The Learning Agent . ... 66
6.6 Multiple Agents ............... ......... .. 70
6.6.1 Agents as a Pack ............... .. .. .. 70
6.6.2 Separate Agents . . ........ .... 73 Separate Agents without Communication . 74 Separate Agents with Communication ...... ..75 Separate Agents Comparison . . 76
6.6.3 Multiple Agent Strategy-Discussion . . ..... 77
6.7 Dynamic Threshold ............... ...... .. 77
6.8 Conclusion ................... .......... .. 78


7.1 Summary ............... ........... .. 79
7.2 Future Research ............... ......... .. 79

REFERENCES ................... ... ... ........ .. 81

BIOGRAPHICAL SKETCH .................. ......... .. 86



3-1 The number of clusters for different values of . .

3-2 The number of clusters as a function of 3 for the iris data .

3-3 Percentage of correct classification of iris data . ...

3-4 The average number of clusters for various k using a fixed 3
iris data ..... .....................

3-5 The average number of clusters for various k using a fixed 3
iris d ata . . . . . . . .

3-6 The average number of clusters for various k using
the Iris Data ..........

6-1 Significant results of the basic best target strategy .

6-2 Empirical results of the best target strategy with
1-percent tail . . . . . .

6-3 Empirical results of the best target strategy with
2.5-percent tail . . . . .

6-4 Empirical results of the best target strategy with
30-percent tail . . . . . .

6-5 k values to minimize mean rank of attacked targets

6-6 Simulation results of the best target strategy .

6-7 Simulation results of the mean target strategy .

a fixed :






. 28


2.5 for the
. 28

5.0 for the

10.5 for




6-8 Simulation results with number of targets poisson distributed, mean n

6-9 Simulation results with number of targets normally distributed, mean n
and standard deviation 0.2n .. ....................

6-10 Simulation results with number of targets uniformly distributed in
[0 .5n 1.5n ] . . . . . . . . .

6-11 Performance summary of best target strategy and mean value strategy
when n varies; values are in percentage drop compared to when n is fixed

6-12 Simulation results with expected number of targets, n, updated 90-percent
into the mission ............... .......... .. .. 67

6-13 Simulation results with expected number of targets, n, updated near the
end of the mission. n may be updated downward, but not upward. 67

6-14 Simulation results of the target strategy with sampling . .... 69

6-15 Simulation results of the target strategy with m agents in a pack. Target
values are uniform on [0,1000]. ................. .. 73

6-16 Simulation results of the target strategy with m agents on separate
missions with no communication between them. Target values are uniform
on [0,1000]. .................. ............... .. 75

6-17 Simulation results of the target strategy with m agents on separate
missions with communication between them. Target values are uniform
on [0,1000]. .................. ............... .. 75

6-18 Simulation results of the target strategy with m agents on separate
missions with communication between them. Uncommitted agents are
allowed to evaluate targets in other unsearched partitions. . ... 76

6-19 Simulation results using the dynamic threshold strategy . ... 78

Figure page

3-1 K-Means algorithm ............... .......... .. 14

3-2 Entropy K-means algorithm .............. .. .... 22

3-3 Generic MST algorithm. ............... ........ 24

3-4 Kruskal MST algorithm. ............... ........ 25

3-5 Graph clustering algorithm. ................ .. .... 25

4-1 Algorithm for dimension reduction. ............. .. 33

5-1 Target and agent boundaries ................ ...... 40

5-2 Region . ............... ............. .. 43

5-3 Multistage graph representation 1 ................ ...... 49

5-4 Multistage graph representation 2 ................ ...... 51

5-5 Region with obstacles ............... ......... .. 51

5-6 Using the 8 zimuths ............... ......... .. 52

6-1 Racetrack search of battlespace with n targets .............. ..55

6-2 Plots of probabilities of attacking the jth best target for two proposed
strategies, n=100 .................. ............ .. 62

6-3 m agents performing racetrack search of battlespace with n targets 70

6-4 m agents on separate missions of a equally partitioned battlespace
performing racetrack search for n targets ................. ..74

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Anthony Okafor

December 2005

C'!I i': Panos M. Pardalos
Major Department: Industrial and Systems Engineering

Many real word problems in engineering, mathematics and other areas are often

solved on the basis of measured data, given certain conditions and assumptions.

In solving these problems, we are concerned with solution properties like existence,

uniqueness, and stability. A problem for which any one of the above three conditions

is not met is called an ill-posed problem. This problem is caused by incomplete and/or

noisy data, where noise can be referred to as any discrepancy between the measured

and true data. Equally important in the data analysis is the effective interpretation

of the results. Data data sets to be analyzed usually have several attributes and the

domain of each attribute can be very large. Therefore results obtained in these high

dimensions are very difficult to interpret.

Several solution methods exist to handle this problem. One of these methods is

the maximum entropy method. We present in this dissertation entropy optimization

methods and give applications in modelling real life problems, specifically in mining

numerical data. Best target selection and the application of entropy in modelling

path planning problems are also presented in this research.


Many real word problems are often solved on the basis of measured data, given

certain conditions and assumptions. Several diverse areas where data analysis is

involved include the government and military systems, medicine, sports, finance,

geographical information systems, etc. [1, 35, 31]. Solutions to these problems involve

in most cases the understanding of the structural properties (pattern discovery) of the

data set. In pattern discovery, we look for a model that reflects the structure of the

data which we hope will reflect the structure of the generating process. Thus given a

data set, we want to extract as much essential structure as possible without modelling

any of its accidental structures (e.g., noise and sampling artifacts). We want to

maximize the information content of all parameters. A method for achieving the

objectives above is entropy optimization [8] in which entropy minimization maximizes

the amount of evidence supporting each parameter, while minimizing the uncertainty

in the sufficient statistic and the cross entropy between the model and the data.

1.1 Data Mining

Data mining or knowledge discovery in databases (KDD) is a non-trivial process

that seeks to identify valid, useful and ultimately understandable patterns in data

[19]. KDD consists of several steps. These steps include preparation of data, pattern

search, knowledge evaluation and refinement. Data mining is a very important process

in the KDD process since this is where specific algorithms are employ, ,1 for extracting

patterns from the data. Data mining can therefore be considered as a set of extraction

processes of knowledge starting from data contained in a base of data [9].

Data mining techniques include a variety of methods. These methods generally

fall into one of two groups: predictive methods and descriptive methods. The

predictive methods involve the use of some variables to predict unknown or future

values of other variables. They are usually referred to as classification. The

descriptive methods seek human-interpretable patterns that describe the data and are

referred to as clustering. However, authors including [6], have phrased these methods

in terms of six tasks: classification, estimation, prediction, clustering, market basket

analysis and description. These different tasks of data mining are described below.

1.1.1 Classification

Classification is the process of classifying the categories for an unknown data set.

The data set to be classified is divided into two parts, a training set and a testing set.

A characteristic of classification is that there is a a well defined set of categories or

classes. Classification is sometimes referred to as supervised learning. The machine

algorithm to applied is trained using the training set (pre-classified examples) until

the error bound is decreased to some threshold. This process is done iteratively, and

repeated many times with different parameter values in some randomized order of

the input. Once an optimal process design is obtained, the testing data unknown to

the algorithm are used on the algorithm.

1.1.2 Clustering

Clustering is concerned with the grouping of unlabelled feature vectors into

clusters such that samples within a cluster are more similar to each other than samples

belonging to different clusters. The clustering problem can be stated as follows: given

a set of n data points (xi, ..., Xk) in d dimensional space Rd and an integer k, partition

the set of data into k di-i, iii clusters so as to minimize some loss function.

1.1.3 Estimation

Estimation is a task of data mining that is generally applied to continuous

data. Similar to classification, estimation has the feature that the data record is

rank ordered, thereby making it easy to work with a part of the data that has the

desired attributes. Neural network methods are well suited for estimation.

1.1.4 Prediction

This data mining task sometimes grouped with classification allows objects to

be classified based on predicted future behavior or values. In prediction, historical

data are often used to build the model the predicts future behaviors.

1.1.5 Description

The ability to understand complicated data bases is the descriptive task of data

mining. Data description gives insight on the different attributes of the data.

The remainder of the thesis proceeds as follows. C'! ipter 2 discusses entropy

optimization. Definition of entropy is given and a discussion of the rationale for its

use in the area of data mining is provided. Chapter 3 develops entropy tools for data

mining and apply them to data clustering. Problems of high dimensionality in data

mining is the focus of chapter 4. An entropy dimension reduction method is given

in this chapter. In chapter 5 and 6, we apply the predictive ability of our entropy

optimization methods to model a path planning problem and best target problem

selection respectively. C'!i pter 7 summarizes the work and proposes extensions.


2.1 Introduction

Many real word problems are often solved on the basis of measured data, given

certain conditions and assumptions. Several diverse areas where data analysis is

involved include the government and military systems, medicine, sports, finance,

geographical information systems, etc. [1, 35, 31]. Solutions to these problems involve

in most cases the understanding of the structural properties (pattern discovery) of the

data set. In pattern discovery, we look for a model that reflects the structure of the

data which we hope will reflect the structure of the generating process. Thus given a

data set, we want to extract as much essential structure as possible without modelling

any of its accidental structures (e.g., noise and sampling artifacts). We want to

maximize the information content of all parameters. A method for achieving the

objectives above is entropy optimization [8] in which entropy minimization maximizes

the amount of evidence supporting each parameter, while minimizing the uncertainty

in the sufficient statistic and the cross entropy between the model and the data.

This chapter is organized as follows. In the next section, we provide some

background on entropy and provide some rationale for its use in the area of data


2.2 A Background on Entropy Optimization

The concept of entropy was originally developed by the physicist Rudolf Clausius

around 1865 as a measure of the amount of energy in a thermodynamic system as

cited in Fang at al.[15]. This concept was later extended through the development

of statistical mechanics. It was first introduced into information theory in 1948 by

Claude Shannon as cited in shore et al.[45].


2.2.1 Definition of Entropy

Entropy can be defined as a measure of the expected information content or

uncertainty of a probability distribution. It is also defined as the degree of disorder

in a system or the uncertainty about a partition [45, 29].

Let Ei stand for an event and pi the probability that event Ei occurs. Let there

be n such events E, ...,E,, with probabilities pi,...,p, adding up to 1. Since the

occurrence of events with smaller probability yields more information since they are

least expected, a measure of information h should be a decreasing function of pi.

Claude Shannon proposed a log function h(pi) to express information. This function

is given as

h(pi) -lI-.i (2 -1)

which decreases from infinity to 0, for pi ranging from 0 to 1. This function reflects

the idea that the lower the probability of an event to occur, the higher the amount

of information in the message stating that the event occurred.

From these n information values h(pi), the expected information content H

called entropy is derived by weighting the information values by their respective


H = plog2p (2-2)
i= 1

Since -pi log2pi > 0 for 0 < pi < 1 it follows from (2-2) that H > 0, where H = 0 iff

one of the pi equals 1; all others are then equal to zero. Hence the notation 0 In 0 = 0.

Definition 2.1. Given a discrete random variable X i.,1.:,,,i on values in the finite

set {xi, ..., x,} with probabilities p = (p, ...,pn), we /. I;,.: the Shannon entropy to be

H(X)= H(p) -kypilnp, (2-3)
i= 1

where k depends on the unit used and is ;,-,rl;l/i set to ,i.i, The convention OlnO

applies also.

The Shannon entropy has the following desirable properties [15]:

1. Shannon measure is nonnegative and concave in pi, ..., p.
2. The measure does not change with inclusion of a zero-probability outcome.
3. The entropy of a probability distribution representing a completely certain
outcome is 0 and the entropy of any probability distribution representing
uncertain outcome is positive.
4. Given a fixed number of outcomes, the maximum possible entropy is that of
the uniform distribution.
5. The entropy of the joint distribution of two independent distributions is the
sum of the individual entropies.
6. The entropy of the joint distribution of two dependent distributions is no greater
than the sum of the two individual entropies.
7. Since entropy only depend on the unordered probabilities and not on X, it is
invariant to both shift and scale i.e. H(aX + b) = H(X) for a / 0 and for all

Definition 2.2. The differential entropy of a continuous random variable X with

- .1'/,, ': i 1. I, ;.'/i function p(x), is

H(X) f px) np(x)dx (2-4)

Ai,,:,,: OlnO is taken to be 0.

The differential entropy does not retain all of the useful properties of the discrete

entropy. The differential entropy is not invariant to transform. Its value could also

be negative.

2.2.2 Choosing A Probability Distribution

E.T. Jaynes in 1957 [18] introduced the principle of maximum entropy. The

Maximum Entropy Principle (\ I::;l'.i:) is stated as follows:

Out of all possible distributions that are consistent with available information
(constraints), choose the one that has maximum entropy.

Using this principle, we give an entropy formulation for an associated problem. Let

X denote a random variable with n possible outcomes xl, ...,x,. Let p= (pi, ...,p)

denote their respective probabilities, respectively.

Let ri(X),...., r(X) be m functions of X with known expected values

E(ri(X)) = al,..., E(rm(X)) = am. The MaxEnt formulation is as follows:

max H(X) pi lnpi

S.t ^(Pi),rj (xi)=a-j =l,...,m
i i

Pi > 0, i = 1,..., n

This is a concave optimization problem with linear constraints. The solution to

this optimization problem is obtained by applying the method of Lagrange multipliers.

The form of the solution is exponential. The Lagrangian of the optimization problem


ni m
L(A,p) x= (p Inpx)) + \o px) 1) + \ px)r(x) = aI) (2-5)
xEX i=1 i=1 xEX

Taking the gradient with respect to p(x) we get

AL(A, = Inp(x) 1 + Ao + Aifri(x) (2-6)
O(Px i 1

p(x) = e-1+Ao+EZiairi(x), for all x EX

where Ao, Ai, ..., Am are chosen so that the constraints are satisfied. In absence of the

moment constraints, that is

max H(X) pi lnpi
i= 1

Pi- 1
i= 1
Pi > 0,i = 1,...,n

The distribution with the maximum entropy is the uniform distribution with pi = 1/n.


Suppose you are given data on 3 routes from A to B that you usually take to work.

The cost of each route in dollars is 1, 2, and 3. The average cost is $1.75. What is the

maximum entropy distribution describing your choice of route for a particular d4v-?

The solution to the above example can be formulated and solved as follows.

max -(pilnpi +p2lnp2 +3lnp3)

s.t. llnp + 21np2+ 31np = 1.75

P1 +P2 +P3 1

pl > 0,p2 > 0,p3 > 0

The range of values for pi is

0 < p2 < 0.75

0 < p3 < 0.375

0.25 < pi < 0.625

The maximum entropy solution is pi = 0.466, p2 = 0.318, p3 = 0.216.

2.2.3 Prior Information

Suppose that in addition to the moment constraints, we have a priori probability

distribution po that we think our probability distribution p should be close to.

How close should p be to po? A measure of this closeness or deviation is the

Kullback-Liebler distance or the measure of relative (cross) entropy. This distance

measure was introduced in 1951 by S. Kullback and R.A. Leibler [29].

With relative entropy as a measure of deviation, the Kullback- Leibler minimum

entropy principle, or MinEnt is states as follows:

Out of all possible distributions that are consistent with available information
(constraints), choose the one that minimizes the cross-entropy with respect to the a
prior distribution.

The MinEnt Formulation is as follows:

min D(p po) =H(p)= Zpiln
i= 1 Pi

s.t. C(pi))rj(xi') = ajj =l,...,m
i= 1

Pi 1
i= 1
pi > 0, i = 1,..., n

If no a priori distribution is given, then we can use the maximum entropy distribution.

This leads to the uniform distribution u as the a priori distribution. We then obtain:
n n
D(p||u) Ypi n -- lnn+ -piln
i= 1 i= 1

MaxEnt is a special case of MinEnt. Thus minimizing the cross-entropy with

respect to the uniform distribution is equivalent to maximizing entropy.

2.2.4 Minimum Cross Entropy Principle

Several authors [49, 51] have explored the use of cross-entropy and have shown

rigorously that the Jaynes principle of maximum entropy and Kullback's principle of

minimum cross-entropy provides a correct method of inductive inference when new

information is given in the form of expected value. Given a distribution po and some

new information in the form of constraints:

I p(x)ck(x)dx > 0,k 1,2,...m (2-7)

then the new distribution p(x), which incorporates this information in the least

biased way one and which is arrived at in a way that does not not lead to any

inconsistencies or contradictions, is the one obtained from minimizing

/p(x) n dx (2-8)

SThis is the minimum cross-entropy principle [49, 50, 51]. These authors also showed

that maximum entropy principle is a special case of minimum cross-entropy based

as outlined below. Suppose that we are trying to estimate the probability of finding

a system in state x. If we know that only n discrete states are possible, then we

already know the some information about the system. This information is expressed

by po = 1/nVi. If we obtain more information in the form of the inequality given in

2-7, then the correct estimate of the probability of the system being in state i is given

by minimizing:

D(P lpo) =Pi In piInpi -Inn

which is equivalent to maximizing the entropy

H = pilog2 i

This principle is used in developing models in chapters five and six.

2.3 Applications of Entropy Optimization

Entropy optimization has successfully been applied in many scientific and

engineering problems. Example applications of entropy optimization include

transportation planning (Fang and Tsao, 1993)[16], regional planning (Wilson

, 1970) [55], investment portfolio optimization (Kapur et al., 1989)[29], image

reconstruction (Burch et al., 1984)[21], and pattern recognition (Tou and Gonzalez,

1974) [53]

Rationale for Using Entropy Optimization. Data to be clustered are usually

incomplete. Solution using the data should incorporate and be consistent with all

relevant data and maximally noncommittal with regard to unavailable data. The

solution may be viewed as a procedure for extracting information from data. The

information comes from two sources: the measured data and the assumption about

the unavailable ones because of data incompleteness. Making an assumption means

artificially adding information which may be true or false. Maximum entropy implies

that the added information is minimal. A maximum entropy solution has the least

assumption and is maximally noncommittal.

In the next chapter we develop an entropy minimization method and apply it to

data clustering.


3.1 Introduction

Data clustering and classification analysis is an important tool in statistical

analysis. C'lu- i. lig techniques find applications in many areas including pattern

recognition and pattern classification, data mining and knowledge discovery, data

compression and vector quantization. Data clustering is a difficult problem that often

requires the unsupervised partitioning of the data set into clusters. In the absence of

prior knowledge about the shape of the clusters, similarity measures for a clustering

technique are hard to specify. The quality of a good cluster is application dependent

since there are many methods for finding clusters subject to various criteria which

are both ad hoc and systematic [28].

Another difficulty in using unsupervised methods is the need for input

parameters. Many algorithms, especially the K-means and other hierarchical methods

[26] require that the initial number of clusters be specified. Several authors have

proposed methods that automatically determine the number of clusters in the data

[22, 29, 25]. These methods use some form of cluster validity measures like variance,

a prior probabilities and the difference of cluster centers. The obtained results

are not alv-i- as expected and are data dependent [54]. Some criteria from

information theory have also been proposed. The Minimum Descriptive Length

(\I )L) criteria evaluates the compromise between the likelihood of the classification

and the complexity of the model [48].

In this chapter, we develop a framework for clustering by learning from the

structure of the data. Learning is accomplished by randomly applying the K-means

algorithm via entropy minimization (KMEM) multiple times on the data. The

(KMEM) enables us to overcome the problem of knowing the number of clusters a

prior. Multiple applications of the KMEM allow us to maintain a similarity measure

matrix between pairs of input patterns. An entry aj in the similarity matrix gives

the proportion of times input patterns i and j are co-located in a cluster among N

clusterings using KMEM. Using this similarity matrix, the final data clustering is

obtained by clustering a sparse graph of this matrix.

The contribution of this work is the incorporation of entropy minimization to

estimate an approximate number of clusters in a data set based on some threshold

and the use of graph clustering to recover the expected number of clusters.

This chapter is organized as follows: In the next section, we provide some

background on the K-means algorithm. A brief discussion of entropy that will be

necessary in developing our model is presented in section 3.3. The proposed K-Means

via entropy minimization is outlined in section 4. The graph clustering approach is

presented in section 5. The results of our algorithms are discussed in section 3.6. We

conclude briefly in section 3.7.

3.2 K-Means Clustering

The K-means clustering [33] is a method commonly used to partition a data

set into k groups. In the K-means clustering, we are given a set of n data points

(patterns) (xl, ..., xk) in d dimensional space Rd and an integer k and the problem

is to determine a set of points (centers) in Rd so as to minimize the square of the

distance from each data point to its nearest center. That is find k centers (ci, ..., Ck)

which minimize:

I I d d(x, ) (3
k xECk

where the C's are di-i, iil and their union covers the data set. The K-means consists

of primarily two steps:

1) The assignment step where based on initial k cluster centers of classes, instances

are assigned to the closest class.

2) The re-estimation step where the class centers are recalculated from the instances

assigned to that class.

These steps are repeated until convergence occurs; that is when the re-estimation

step leads to minimal change in the class centers. The algorithm is outlined in figure


Figure 3-1: K-Means algorithm

Several distance metrics like the Manhattan or the Euclidean are commonly used.

In this research, we consider the Euclidean distance metric. Issues that arise in using

the K-means include: shape of the clusters, ( 'i... i. the number of clusters, the

selection of initial cluster centers which could affect the final results and degeneracy.

There are several v--,v to select the initial cluster centers. Given the number of

clusters k, you randomly select k values from the data set. (This approach was used

in our analysis). You could also generate k seeds as the initial cluster centers, or

manually specify the initial cluster centers. Degeneracy arises when the algorithm is

The K-means Algorithm
P = {p,,..., p,} (points to be clustered)
k (number of clusters)
C = {c,,} (cluster centers)
m :P {1,...k} (cluster membership)
Procedure K-means
1. Initialize C (random selection of P).
2. For each, e P, m(p,) = argminJ k distance(p,,c,).
3. If m has not changed, stop, else proceed.
4. For each i e {1,...k}, recompute c, as a center of {p I m(p)= i}.
5. Go to step 2.

trapped in a local minimum thereby resulting in some empty clusters. In this paper

we intend to handle the last threes problem via entropy optimization.

3.3 An Overview of Entropy Optimization

The concept of entropy was originally developed by the physicist Rudolf Clausius

around 1865 as a measure of the amount of energy in a thermodynamic system [15].

This concept was later extended through the development of statistical mechanics.

It was first introduced into information theory in 1948 by Claude Shannon [45].

Entropy can be understood as the degree of disorder of a system. It is also a measure

of uncertainty about a partition [45, 29].

The philosophy of entropy minimization in the pattern recognition field can be

applied to classification, data analysis, and data mining where one of the tasks

is to discover patterns or regularities in a large data set. The regularities of

the data structure are characterized by small entropy values, while randomness is

characterized by large entropy values [29]. In the data mining field, the most well

known application of entropy is information gain of decision trees. Entropy based

discretization recursively partitions the values of a numeric attribute to a hierarchy

discretization. Using entropy as an information measure, one can then evaluate an

attribute's importance by examining the information theoretic measures [29].

Using entropy as an information measure of the distribution data in the clusters,

we can determine the number of clusters. This is because we can represent data

belonging to a cluster as one bin. Thus a histogram of these bins represents cluster

distribution of data. From entropy theory, a histogram of cluster labels with low

entropy shows a classification with high confidence, while a histogram with high

entropy shows a classification with low confidence.

3.3.1 Minimum Entropy and Its Properties

Recall that the Shannon Entropy is defined as

H(X)= (plnpi) (3-2)
i 1
where X is a random variable with outcomes 1, 2, ..., n and associated probabilities

Pl, P2, .--,Pn-
Since -pl Inpi > 0 for 0 < pi < 1 it follows from (5-6) that H(X) > 0, where

H(X) 0 iff one of the pi equals 1; all others are then equal to zero. Hence the

notation 0 In 0 = 0. For continuous random variable with probability density function

p(x), entropy is defined as

H(X) =- p(x) lnp(x)dx (3-3)

This entropy measure tells us whether one probability distribution is more informative

than the other. The minimum entropy provides us with minimum uncertainty, which

is the limit of the knowledge we have about a system and its structure [45]. In data

classification, for example the quest is to find minimum entropy [45]. The problem

of evaluating a minimal entropy probability distribution is the global minimization

of the Shannon entropy measure subject to the given constraints. This problem is

known to be NP-hard [45].

Two properties of minimal entropy which will be fundamental in the development

of KMEM model are concentration and '''.:,',l [45]. Grouping implies moving all

the probability mass from one state to another, that is, reduce the number of states.

This reduction can decrease entropy.

Proposition 3.1. Given a partition B = [Ba, Bb, A2, A3, ...AN], we form the partition

A [A, A2, A3, ...AN] obtained by merging Ba and Bb into Ai, where pa = P(Ba),

pb P(Bb) and pi = P(Ai), we maintain that

H(A) < H(13)


Proof. The function p(p) = -plnp is convex. Therefore for A > 0 and

pi A < pi < pa < p 2 + A
we have

c(Pl + P2) < ~c(p A) + ((p2 + A) < ~c(p) + ~c(p2) (3-5)


H(B) p(p) c(pb) = H(A) p(pa +Pb)

because each side equals the contribution to H(3) and H(A) respectively due the to

common elements of B and A.

Hence, (3-4) follows from (3-5).

Concentration implies moving probability mass from a state with low probability to

a state with high probability. Whenever this move occurs, the system becomes less

uniform and thus entropy decreases.

Proposition 3.2. Given two partitions B = [bl,b2, A3, A4, ...AN] and

A = [A, A2, A3, ...AN] that have the same elements except the first two.

We maintain that if

pi = P(A1), P2 = P(A2) with pi < p and (p A) = P(b1) < (2 + A) P(b2),

H(B) < H(A) (3-6)

Proof. Clearly,

H(A) c(pl) c(p2) = H(B) p(p A) c(p2 + A)

because each side equals the contribution to H(3) and H(A) respectively due to the

common elements of A and B

Hence, (3-6) follows from (3-5).

3.3.2 The Entropy Decomposition Theorem

Another attractive property of entropy is the way in which .,.::--regation and

diz-, ir-egation are handled [18]. This is because of the property of additivity of

entropy. Suppose we have n outcomes denoted by X = {xl,...,x,}, with probability

pl, ..., p,. Assume that these outcomes can be .I ,- related into a smaller number of sets

C1, ..., CK in such a way that each outcome is in only one set Ck, where k = 1, ...K.

The probability that outcomes are in set Ck is

Pk Y Pz (37)

The entropy decomposition theorem gives the relationship between the entropy

H(X) at level of the outcomes as given in (5-6) and the entropy Ho(X) at the level

of sets. Ho(X) is the between group entropy and is given by:

Ho(X) -- (pk lnpk) (3 8)

Shannon entropy (5-6) can then be written as

H(X) = plnp
i= 1
S pi In pi
k= iHCk

k 1 k 1 iE kPk
=P pf npi +In


Ho(X) + pkHk(X) (3-9)
k= 1


Hk(X) -=- Y PlIn (3-10)
Pk Pk

A property of this relationship is that H(X) > Ho(X) because pk and Hk(X) are

nonnegative. This means that after data grouping, there cannot be more uncertainty

(entropy) than there was before grouping.
3.4 The K-Means via Entropy Model

In this section we outline the K-means via entropy minimization. The method

of this section enables us to perform learning on the data set, in order to obtain the

similarity matrix and to estimate a value for the expected number of clusters based

on the clustering requirements or some threshold.

3.4.1 Entropy as a Prior Via Bayesian Inference

Given a data set represented as X = {x,....,x,}, a clustering is the partitioning of

the data set to get the clusters {C, j = 1, ...K}, where K is usually less than n. Since

entropy measures the amount of disorder of the system, each cluster should have a

low entropy because instances in a particular cluster should be similar. Therefore our

clustering objective function must include some form of entropy. A good minimum

entropy clustering criterion has to reflect some relationship between data points and

clusters. Such relationship information will help us to identify the meaning of data,

i.e. the category of data. Also, it will help to reveal the components, i.e. clusters

and components of mixed clusters. Since the concept of entropy measure is identical

to that of probabilistic dependence, an entropy criterion measured on a posteriori

probability would suffice. The B i, -i i inference is therefore very suitable in the

development of the entropy criterion.

Suppose that after clustering the data set X, we obtain the clusters {Cj, j

1,...K} by Bw,-.- rule, the posterior probability P(CjIX) is given as;

P(X|Cy)P(Cy) ()
P(C X) (= XCJ) ocJ) P(X |C)P(CJ) (3- 11)

where P(XlCj) given in (3-12) is the likelihood and measures the accuracy in

clustering the data and the prior P(Cj) measures consistency with our background


P(X Cj) = P(x, Cj) = eC+cc 1np(xIjCj) (3-12)

By the B-.i approach, a classified data set is obtained by maximizing the posterior

probability (3-11). In addition to three of the problems presented by the K-means

which we would like to address: determining number of clusters, selecting initial

cluster centers and degeneracy, a fourth problem is, the choice of the prior distribution

to use in (3-11). We address these issues below.

3.4.2 Defining the Prior Probability

Generally speaking, the choice of the prior probability is quite arbitrary [56]. This

is a problem facing everyone and no universal solution has been found. For our our

application, we will define the prior as an exponential distribution, of the form;

P(CI) oc e IP'npi (3 13)

where pj = C/n is the prior probability of cluster j, and3 > 0 refers to a weighting

of the a priori knowledge. Hence forth, we call P the entropy constant.

3.4.3 Determining Number of Clusters

Let k* be the final unknown number of clusters in our K-means algorithm

(KMEM). After ( 11-. 1i the entropy

H(X) = pnp

will be minimum based on the clustering requirement. From previous discussions, we

know that entropy decreases as clusters are merged. Therefore if we start with some

large number of clusters K > k*, our clustering algorithm will reduce K to k* because

clusters with probability zero will vanish. Note that convergence to k* is guaranteed

because the entropy of the partitions is bounded below by 0. A rule of thumb on the

value of initial number of clusters is K = / [17].

The KMEM Model The K-Means algorithm works well on a data set that has

spherical clusters. Since our model (KMEM) is based on the K-means, we make

the assumption that the each cluster has Gaussian distribution with mean values

j, i = (1, ..., k) and constant cluster variance. Thus for any given cluster Cj,

1 ( 2)
P(x,|C,) 2 e (3-14)

Taking natural log and omitting constants, we have

InP(xI C) (x C)2 (3-15)

Using equations (3-12) and (3-13), the posterior probability (3-11) now becomes:
P(Cj|X) o exp (lnp(x|C)) exp 3ypilnpi oc exp(-E) (3-16)
xiECj i=

where E is written as follows:
E Inp(x|, C' ) / pi lnp, (3 17)
XiECj i=1

If we now use equation (3-14), equation (3-17) becomes

k* 2 k*
E (x C)2 (3 18)
2a2 pilnI 3p,
i=1 xiEC i=1

k* X,_e)2
S= + H(X) (3-19)

Maximizing the posterior probability is equivalent to minimizing (5-20). Also, notice

that since the entropy term in (5-20) is nonnegative, equation (5-20) is minimized if

entropy is minimized. Therefore (5-20) is the required clustering criterion.

We note that when = 0, E is identical to the cost function of the K-Means

clustering algorithm.

The Entropy K-means algorithm (KMEM) is given in figure 3-2. Multiple runs of

KMEM are used to generate the similarity matrix. Once this matrix is generated,

the learning phase is complete.

Figure 3-2: Entropy K-means algorithm

This algorithm iteratively reduces the numbers of clusters as some empty clusters

will vanish.

3.5 Graph Matching

The rationale behind our approach for structure learning is that any pair of

patterns that should be co-located in a cluster after clustering must appear together

in the same cluster a ii i, i i ly of the time after N applications of KMEM.

Let G(V, E) be the graph of the similarity matrix where each input pattern

is a vertex of G and V is the set of vertices of G. An edge between a pair of

patterns (i,j) exists if the entry (i,j) in the similarity matrix is non-zero. E is a

collection of all the edges of G. Graph matching is next applied on the maximum

spanning tree of the sparse graph G'(V, E) C G(V, E). The sparse graph is obtained

Entropy K-means Algorithm

1. Select the initial number of clusters k and a value for the stopping
criteria e.

2. Randomly initialize the cluster centers 0, (t), and the a priori
probabilities p,, i = 1, 2,..., k, /f, and the counter t = 0.

3. Classify each input vector xJ, j = 1, 2,..., n to get the partition C
such that for each x e Cr, r = 1, 2,..., k
[x, (t)]2 _- ln(pr) < [J 0,(t)]2 _- ln(p,)
n n
4. Update the cluster centers
e,(t+ 1) = x
I C, J ,t -
and the a priori probabilities of clusters
p,(t+1) =I
5. Check for convergence; that is see if
max ,(t + )-0,(t) |<
if it is not, update t = t +1 and go to step 3.

by eliminating inconsistent edges. An inconsistent edge is an edge whose weight is

less than some threshold r. Thus a pattern pair whose edge is considered inconsistent

is unlikely to be co-located in a cluster. To understand the idea behind the maximum

spanning tree, we can consider the minimum spanning tree which can be found in

many texts, for example [3] pages 278 and 520. The minimum spanning tree (\!ST) is

a graph theoretic method, which determines the dominant skeletal pattern of points

by mapping the shortest path of nearest neighbor connections [40]. Thus given a set

of input patterns X = x, ..., x, each with edge weight dij, the minimum spanning

tree is an .,. i-, 1i connected graph that passes through all input patterns of X with

a minimum total edge weight See section 3.5. The maximum spanning tree on the

other hand is a spanning with a maximum total weight. Since all of the edge weight

in the similarity matrix are nonnegative, we can negate these values and then apply

the minimum spanning tree algorithm given in figure 3-4.

Minimum Spanning Tree Minimum spanning trees( are used in solving many

real world problems. For example, consider a case of a network with V nodes with

E undirected connections between nodes. This can be represented as a connected,

undirected graph G = (V; E) containing V vertices and E edges. Now suppose that

all the edges are weighted, i.e., for each edge (u; v) E E we have an associated weight

w(u; v). A weight can be used to represent real world quantities such as cost of a

wire, distance etc between two nodes in a network. A spanning tree is defined as a

.I i-, 1ii graph that connects all the vertices. A minimum spanning tree is a spanning

tree with the minimum weight. Suppose we represent the spanning tree as T C E,

which connects all the vertices, and whose total length is w(T), then the minimum

spanning tree is defined as,

min(w(T)) = w(u;v) (3-20)

Algorithm 1
1. S--0
2. while S does not form a spanning tree
3. do find a safe-edge (u, v) which can be added to S
4. S <- SU(u, v)
5. return S
Figure 3-3: Generic MST algorithm.

Generic MST algorithm. The book by Cormen et al. [11] gives a supported

analysis of minimum spanning tree algorithms. The MST algorithm falls in the

category of greedy algorithms. Greedy Algorithms are algorithms that make the best

choice at each decision making step. In other words, at every step, greedy algorithms

make the locally optimum choice and hope that it leads to a globally optimum

solution. The greedy MST algorithm builds the tree step-by-step, incorporating the

edge that causes minimum increase in the total weight at each step, without adding

any cycles to the tree. Suppose there is a connected, undirected graph G = (V; E)

with the weight function w. While finding the minimum spanning tree for graph G,

the algorithm manages at each step an edge-set S which is some subset of the MST.

At each step, edge (u; v) is added to subset S such that it does not violate the MST

property of S This makes S U (u; v) a subset of the Minimum Spanning Tree. The

edge which is added at each step is termed a "safe e The generic algorithm is

given in figure 3-3

There are two popular algorithms for computing the Minimum Spanning Tree,

Prim's algorithm and Kruskal's algorithm (refer [11]). We used the Kruskal's

Algorithm in our analysis. Its description follows.

Kruskal's algorithm for MST. Kruskal's algorithm is an extension of the generic MST

algorithm described in the preceding sub-section above. In the Kruskal's algorithm

the set S, which is a subset of the minimum spanning tree, is a forest. At each step,

the Kruskal's Algorithm finds the safe edge to be added as the edge with the minimum

weight that connects two forests. Initially, the edges are sorted in the decreasing order

Algorithm 2
1. S --
2. for each vertex v E V[G]
3. do MAKE-SET(v)
4. sort the edges E by non-decreasing weight w
5. for each edge (u, v) E E, in order of nondecreasing weight
6. do if FIND-SET(u) / FIND-SET(v)
7. then S <-- SU(u, v)
8. UNION(u, v)
9. return S
Figure 3-4: Kruskal MST algorithm.

Algorithm 3
1. input: n d-dimensional patterns, initial number of clusters k, the number of
clustering N, the threshold r and the 3
output: clustered patterns
2. initialize the similarity matrix M to null n x n matrix and the number of
iterations iter 0
3. apply the KMEM algorithm to produce the partition C
4. update the M; for each input pattern (i,j) in C set a(i,j) = a(i,j) + 1/N
5. if iter < N go to step 2
6. obtain final clustering by applying the MST and removing inconsistent edges
(a(i,j) < 7)
Figure 3-5: Graph clustering algorithm.

of their weights. At each step, one finds the minimum edge in the graph not already

present in the minimum spanning tree, connects two forests together. This process

is repeated until all the vertices are included in the graph. The algorithm is given in

figure 3-4. In steps 1-3, the subset S is initialized to null and V number of forests

each with a single vertex are created. Step 4 sorts the edge set E in a non decreasing

order of weight. In steps 5-8, an edge (u; v) is found such that the endpoint u belongs

to one forest and endpoint v belongs to other forest. This edge is incorporated in the

subset S. The algorithm stops when all vertices are included in the tree.

The algorithm given in figure 3-5 is used to generate the final clustered data.

3.6 Results

The KMEM and the graph matching algorithms were tested on some synthetic

image and data from the UCI data repository [7]. The data include the Iris data,

Table 3-1: The number of clusters for different values of 3

3 test 1 test2 test3
1.0 10 10 13
1.5 6 8 9
3.5 5 5 6
5.5 4 4 5

wine data and heart disease data. The results for the synthetic images and iris data

are given in 6.1 and 6.2. The KMEM algorithm was run 200 times in order to obtain

the similarity matrix and the average number of clusters kve.

3.6.1 Image Clustering

For the synthetic images, the objective is to reduce the complexity of the grey

levels. Our algorithm was implemented with synthetic images for which the ideal

clustering is known. Matlab and Paint Shop Pro were used for the image processing

in order to obtain an image data matrix. A total of three test images were used with

varying numbers of clusters. The first two images, test and test2, have four clusters.

Three of the clusters had uniformly distributed values with a range of 255, and the

other had a constant value. Testl had clusters of varying size while test2 had equal

sized clusters. The third synthetic image, test3, has nine clusters each of the same

size and each having values uniformly distributed with a range of 255. We initialized

the algorithm with the number of clusters equal to the number of grey levels, and the

value of cluster centers equal to the grey values. The initial probabilities (pi) were

computed from the image histogram. The algorithm was able to correctly detect

the number of clusters. Different clustering results were obtained as the value of

the entropy constant was changed, as is shown in Table 3-1. For the image test3,

the correct number of clusters was obtained using a 3 of 1.5. For the images test

and test2, a 3 value of 5.5 yielded the correct number of clusters. In Table 3-1, the

optimum number of clusters for each synthetic image are bolded.

3.6.2 Iris Data

Next we tested the algorithm on the different data obtained from the UCI

repository and got satisfactory results. The results presented in this section are

on the Iris data. The Iris data are well known [12, 27] and serves as a benchmark for

supervised learning techniques. It consists of three types of Iris plants: Iris Versi-

color, Iris Virginica, and Iris Setosa with 50 instances per class. Each datum is four

dimensional and consists of a plants' morphology namely sepal width, sepal length,

petal width, and petal length. One class Iris Setosa is well separated from the other

two. Our algorithm was able to obtain the three-cluster solution when using the

entropy constant 3's of 10.5 and 11.0. Two cluster solutions were also obtained using

entropy constants of 14.5, 15.0, 15.5 and 16.0 Table 3-2 shows the results of the


To evaluate the performance of our algorithm, we determined the percentage

of data that were correctly classified for three cluster solution. We compared it to

the results of direct K-means. Our algorithm had a 91 correct classification while

the direct K-means achieved only I,'- percent correct classification, see Table 3-3.

Another measure of correct classification is entropy. The entropy of each cluster is

calculated as follows
k i ni
H(C) In (3-21)

where nj is the size of cluster j and n' is the number of patterns from cluster i

that were assigned to cluster j. The overall entropy of the clustering is the sum of

the weighted entropy of each cluster and is given by

H(C) = LjH(C) (3-22)

where n is the number of input patterns. The entropy is given in table 3-3. The

lower the entropy the higher the cluster quality.

Table 3-2: The number of clusters as a function of 3 for the iris data

/3 10.5 11.0 14.5 15.0 15.5 16
k 3 3 2 2 2 2

Table 3-3: Percentage of correct classification of iris data

k 3.0 3.0 2.0 2.0 2.0 2.0
90 91 69 68 68 68
Er,/i. pa 0.31 0.27 1.33 1.30 1.28 1.31

We also determined the effect of 3 and the different cluster sizes on the average

value of k obtained. The results are given in tables 3-4, 3-5 and 3-6. The tables

show that for a given 3 and different k value the average number of clusters converge.

Table 3-4: The average number of clusters for various k using a fixed = 2.5 for the
iris data
k 10 15 20 30 50
ka,, 9.7 14.24 18.73 27.14 42.28

Table 3-5: The average number of clusters for various k using a fixed 3 =5.0 for the
iris data

k 10 15 20 30 50
ka,, 7.08 7.10 7.92 9.16 10.81

Table 3-6: The average number of clusters for various k using a fixed = 10.5 for
the Iris Data

k 10 15 20 30 50
ka,, 3.25 3.34 3.36 3.34 3.29

3.7 Conclusion

The KMEM provided good estimates for the unknown number of clusters. We

should point out that whenever the clusters are well separated, the KMEM algorithm

is sufficient. Whenever that was not the case, further processing by the graph

clustering produced the required results. Varying the entropy constant 3 allows us to

vary the final number of clusters in KMEM. However, we had to empirically obtain

values for 3. Further research work was necessary in order to find a way of estimating

the value of 3 based on the some properties of the data set. Our approach worked

on the data that we tested, producing the required number of clusters. While our

results are satisfactory, we observed that our graph clustering approach sometimes

matched weakly linked nodes, thus combining clusters. Therefore, further work will

be required to reduce this problem. Such a result would be very useful in image

processing and other applications.


4.1 Introduction

Data mining often requires the unsupervised partitioning of the data set into

clusters. It also places some special requirements on the clustering algorithms

including: data scalability, non-presumable assumptions of any canonical data

distribution, and insensitivity to the order of the input record [4]. Equally important

in data mining is the effective interpretability of the results. Data sets to be clustered

usually have several attributes and the domain of each attribute can be very large.

Therefore results obtained in these high dimension are very difficult to interpret.

High dimensionality poses two challenges for unsupervised learning algorithms. First

the presence of irrelevant and noisy features can mislead the clustering algorithm.

Second, in high dimensions data may be sparse (the curse of dimensionality), making

it difficult for an algorithm to find any structure in the data. To ameliorate these

problems, two basic approaches to reducing the dimensionality have been investigated:

feature subset selection (Agrawal et al., 1998; Dy and Brodley, 2000) and feature

transformations, which project high dimensional data onto "interesting; subspaces

(Fukunaga, 1990; C'! l1:. hiarti et al., 2002). For example, principle component

analysis (PCA), chooses the projection that best preserves the variance of the data.

It therefore is important to have clusters represented in lower dimensions in order

to allow effective use of visual techniques and better result interpretation. In this

chapter, we address dimensionality reduction using entropy minimization.

4.1.1 Entropy Dimension Reduction

In the previous two chapters and in [41], we have shown that entropy is a good

measure of the quality of clustering. We therefore propose an entropy method to

handle the problem of dimension reduction. As with any clustering algorithm, certain

requirements such as sensitivity to outliers, shape of the cluster, efficiency, etc ]'pl

vital roles in how well the algorithm performs. In the next, we outline the different

criteria necessary to handle these problem via entropy.

4.1.2 Entropy Criteria For Dimension Reduction

Given two clustering of different data sets, how do we determine which cluster

is better? Since these clusters may be in different dimensions, we need some criteria

that will be robust. We propose the following measures: good data span or coverage;

a dimensional space that has well defined clusters will tend to have good data span

than one that is closed to random, high density; whereas two two distribution can

have have the same data span, one may be more dense and therefore qualify as a

cluster. Given theses criteria, we some metric that is capable of measuring theses

criteria simultaneously. A reduced dimension with good clustering should score high

on this metric at some level of a threshold. This metric is entropy and we outline the

approach in the following sections.

4.1.3 Entropy Calculations

Each dimension is divided into intervals of equal length thus partitioning the

high dimension to form a grid. The density of each cell can be found by counting

the number of points in the cell. If we denote the set of all cells by X and d(x) the

density of cell x, we can define the entropy of the data set as:

H(X) d(x) )ln d(x) (4-1)

When the data points are uniformly distributed, we are most uncertain where a

particular point would lie. In this case entropy is highest. When the data points are

closely packed in a small cluster, we know that a particular point to fall within a

small area of the cluster, and so the entropy will be low. The size of the partition

when each is divided should be carefully selected. If the interval is too small, there

will be many cells making the average number of points in each so small, similarly if

the interval size is too large, it may be difficult to capture the differences in density

in different regions of the space. Selecting at least 30 points in each is recommended.

We follow the approach outlined by C'!, i1 et al. [10].

4.1.4 Entropy and the Clustering Criteria

Entropy is used to relate the different criteria outline for clustering. As the

density of dense units increases, the entropy decreases. Hence entropy can be

used to relate the measurement of density in clustering. Problem of correlated

variable can also be handled by entropy. Entropy can easily detect independence in

variable through the following relationships. H(X1,..., X,) = H(XI) + ... + H(X,) if

Xi,..., X, are independent. This property will be necessary in our algorithm

4.1.5 Algorithm

The algorithm for dimension reduction consist of two main steps:

1. Find out reduced dimension with good clustering by entropy method.

2. Identify clusters in the dimensions found

To identify good cluster, we set a threshold 7. A reduced dimension has good

clustering if its entropy is below the threshold. This proposed approach uses a

bottom-up approach. It starts by finding a large one-dimensional space with good

clustering, this is the used to generate candidate 2-dimensional spaces which are

checked against the data set to determine if they have good clustering. The process

is repeated with increasing dimensionality until no more spaces with good clustering

are found. The algorithm is given in 4-1.

4.2 Results

We evaluated the algorithm using both synthetic and real data. For synthetic

data, we generated data of fixed dimensions and also the dimensions that contained

clusters. The algorithm was able to identify the lower dimensional spaces that had

cluster. We next used the algorithm on the breast cancer data which can found in

Algorithm 1
1. k 1
2. Let Ck be one dimensional space
3. For each space c E CK do
4. f,(.) ,I(c)
5. H(c)= entr. '/i(fc(.))
6. if H(c) < T then
7. Sk Sk Uc
8. else
9. NSk NSk Uc
10. End For
11. Ck+1 = cand(NSk)
12. If Ck+1 = 0, goto 15
13. k= k+
14. goto step 3
15. Result = UvkSk
Figure 4-1: Algorithm for dimension reduction.

[7] in order to reduce the 38 feature space. The algorithm performed well when using

only the numerical features.

4.3 Conclusion

In this chapter, we provided an entropy method that can be used for dimension

reduction of high dimensional data. This method uses data coverage, density and

correlation to determine the reduced dimension that have good clustering. While

this method does not cluster the data, it provided a subspace the has good clustering

and whose results are easy to interpret.


5.1 Introduction

Path planning is concerned with creating an optimal path from point A to

point B while satisfying constraints imposed by the path like obstacles, cost, etc.

In our path planning problem, we are concerned with planning a path for an agent

such that the likelihood of the agent being co-located with the target at some time

in its trajectory is maximized. An assumption here is that the agent operates in

receding-horizon optimization framework where the optimization considers the likely

position of target up to sometime time in the future and is repeated at every time

step. The information about the target is contained in a stochastic process, and we

assume that the distribution is known at every time in the future.

We consider the problem for path planning of single agent. The basic formulation

of the problem is to have the agent move from an initial dynamic state to moving

target. In the case of stationary target, several methods have been proposed by other

researchers [46, 52]. Here, we assume that the velocity of the vehicle is fixed and

it is higher than the maximum velocity of the target. In each fixed length of time

interval (even it's not constant, proposed methods can work) the agents can have

the information about the positions of the targets at that time. There will be several

modes depending on the prediction of the targets' move.

1. the prediction of the targets' move is unknown.

2. the prediction of the targets' move is given by a probability distribution.

We also consider the case of planning a path for an agent such that its likelihood of

being co-located with a target at some time in its trajectory is maximized. We assume

that the agent operates in a receding-horizon optimization framework, with some

fixed planning horizon and a reasonable of re-planning. When the future location of

the target is expressed stochastically, we derive condition under which the planning

horizon of the agent can be bounded from above, without sacrificing performance.

We assume that the target employs a receding horizon approach, where the

optimization considers the likely position of the target up to some fixed time in the

future and where the optimization is repeated at every time step.

5.1.1 Problem Parameters

We define a discrete two-dimensional state-space X C Z x Z, where Z denotes

the set of integers. We also denote by the discrete index t E Z. An agent's position

at time t = to is denoted by x(to) E X, and a trajectory of an agent can be defined

as follows:

Definition 5.1. A state xi E Xis said to i.1i.. -,it to the state xj E X if

IXi Xj\ <_ 1

Definition 5.2. The path p for an agent is a sequence of states x(t) E X:

p ={x(to), x(to + 1),...., x(to + T)},

such that x(t) is adjacent to x(t + 1) for all t E [to + 1,to + T]. We -.';, that such a

path has length T.

The agent is assumed to have information regarding the future position of the

target, and this information is contained in a stochastic process, M(t) E RNxN. Thus

M(.) is a sequence of N x N matrices whose element at a particular time constitute

a probability mass function

i j
3f(t,j) < 1,

where (i, j) denotes the (i, j)t element of the matrix M(t). We assume that

there exits a stationary mapping of the elements of M(t) to the state space X for all

t, and we will we the notation that the probability that the target is at state y C X

at time t = to is .,,,(y).

The receding horizon problem optimization problem is to find a path p of fixed length

T, such that the likelihood of being co-located with the target in at least one state

on the path is maximized. One way of estimating this is the following cost function:

J(p)= (x (t)) (5-1)

In the following sections, we examined the different solution methods we proposed

to solve this problem.

5.1.2 Entropy Solution

In constructing the optimal path, we will make use of the concept of information

theory, specifically information gain. Entropy in information theory measures how

predictable a distribution is. Specifically, information gain is the change in entropy

of a system when new information relating to the system is gained. Suppose the

{X,Ye} is a random process taken on values in a discrete set. Shannon introduced

introduced the notion mutual information between the two process:

I(X,Y) = H(X) + H(Y) H(X,Y),

the sum of two entropies minus the entropy of their pair. Average mutual

information can also be defined in terms of conditional entropy

H(XIY) H(X, Y)- H(Y)

and hence

I(X, Y) = H(X) H(X Y) = H(Y) H(YIX). (5-2)

In this form the mutual information can be interpreted as the information

contained in one process minus the information contained in the process when the

other process is known [20]. Recall that the entropy of a discrete distribution over

some set X is defined as

H(X)= p(x) In p(x) (5-3)

Given the value v of a certain variable V, the entropy of a system S defined on

X is given by

H(S V v) = p(xlv)Inp(xlv)) (5-4)

Now suppose that we gain new information V about the system in the form a

distribution over all possible values of V, we can define the conditional entropy of the

system S as

H(S V) = p(v)yp(xlv) Inp(xlv)) (5-5)

Thus for some system and some new knowledge V, the information gain is

according to equation (5-2) is

I(S, V)= H(S)- H(SV) (5-6)

Incorporating information gain in our path pl nilli.r a strategy that leads to

maximum decrease in conditional entropy or maximum information gain is desirable.

This strategy will move us to a state of low entropy as quickly as possible, given our

current knowledge and representation of the system. We can therefore use entropy

measure in which a state with low entropy measure correspond to the solution of the

path. Our proposed entropy method is used to build an initial path. Local search

method is then used to improve the path. The results generated by our simulator is

then used on the different modes outlined below and on the cost function given in

5-14. Several other methods and cost functions are also discussed.

5.2 Mode 1

We propose the following method. This method can be applied for the both

modes. We describe the method for single vehicle and single target. The vehicle

move along the shortest path between the vehicle's current position and the target's

current position until getting the updated information. We can give the following

Proposition for this mode.

Proposition 5.1. Let St be length of the shortest path between the vehicle and the

,ij,, I at time t. Then there exists an positive integer number no such that

St < vto for all time t > noto,

where to is the length of the time interval the vehicle can get the information of the

/,,', ,t's position and v, is the maximum :. ... /,:/ of the ~i,, /

Proof. Consider Skto, the length of the shortest path between the target and the

vehicle at time kto. We can write down the following inequality for Skto and S(k+l)to-

Skto S(k+l)to > ( V,)to (5-7)

(explanation and a figure)

Indeed, the length of the shortest path between the current position of the vehicle

and the next position (in time to) of the target is less than the sum of the Skto and Lk,

which is the length of the trajectory the target move in time interval [kto, (k + 1)to].

Moreover, the following inequality holds:

Lk < Vmto.

Along the shortest path, the vehicle moves for distance of uto in time interval [kto, (k+


Since in each time interval, the length of shortest path between the vehicle and

the target is reduced by at least fixed amount of (u Vm)to, after time noto, where

no < FU V)too (5-8)

Which completes the proof.

Definition 5.3. DT = DT(tI, x) C R2, the set the ',i,, I could reach in time tl from

the current point x, is called the reachable set of the 'i1, at time ti.

Definition 5.4. DA = DA(tl,XO) C R2, the set the agent could reach in time tl from

the current point x, is called the reachable set of the agent at time tl.

Since the agent can not predict the move of the target, we will assume that the

direction of the target move is the uniform distribution over [-r, 7r] at every time

t e [0, ti]. Since for each move of the target, there exist a corresponding move which

is exactly vice direction of the move. According to this assumption, the reachable

set is a circle with its inside. Moreover, the density function values at points on the

circle which has the same center as the reachable set. At the next time, the distance

between the target and the agent is a random variable. Our goal is to find a target

which minimizes the expectation of the distance between the agent and the target.

The following proposition gives us the answer for single target and single agent.

Proposition 5.2. The optimal tril. /.I'.,; is the line gI,, ,.t which connects the cur-

rent position of the agent and the current position of the Ii,, I

Proof. Let Ak be the current position of the agent and Bk be the current

position of the target at time kto. Consider the circle with center of Bk and a radius

of vto; and the circle w2 with center of Ak and a radius of uto. Then those circles

represent the boundaries of the reachable sets for targets and the agents. We prove

that the optimal trajectory for the agent is AkAk+1 (optimal position for the agent at

time (k + 1)to is Ak+1). Without loss of generality, the optimal position for the agent

is Ak+1 which is a point inside of w2 at time (k + 1)to. Consider a polar coordinate

system (p, a). Let h(p, a) be the distance between point (p, a) and Ak+1. Then the

expectation of the distance between Ak+1 and the next position of the target at time

(k + l)to is

Jvmto 2-F
jO j h(p, a)f(p, a)dadp, (5-9)

where f(p, a) is the density function of the next position of the target at time (k+ l)to.



S Bk Ak+1 Ak

Figure 5-1: Target and agent boundaries

Consider a new polar coordinate system (p', a') for the angle 0,

= a 0

p' p.

Consider the expectation of the distance between Ak+1 and the next position of the

target at time (k + 1)to. If we denote h'(p', a') by the distance between point (p', a')

and A'k, the expectation is

vfo 2 h'(p', a') f (p', a')da'dp', (5-10)

since the function f(p, a) has rotation symmetry. If we count the fact that

h'(p', a') > h(p, a) when p' = p and a' ca,


rvmto 27T Vmto 2-F
It0 /j h2(p',a')f(p',a')da'dp' > h(p,a)f(p,a)dadp.

Since the rotation keeps distances, our proposition is proved.

5.3 Mode 2

Suppose that the agent is given the target move as a probability distribution.

Moreover, we assume that we know the density function, z- p(x, y), of the next

position of target in the reachable set. Our goal is to find the trajectory, which

minimizes the expected distance between the agent and the target at time tl. Let

DA be the reachable set, a set of finite points, of the agent at time t1.

Let x0 = (xo, y) be the current position of the agent and DT {xi = (x, Y), i =

1, 2, ...,1} be the reachable set for the target at t.

min YPllx-xill (5-11)
i= 1
s.t. x E DA-

Without obstacles the reachable set DA for the agent is a circle with its inside.

5.4 Mode 3

Another interesting question is how to find the optimal path if we do not know

the exact time when the next information will be received by the agent. In other

words, suppose that the time when the agent receive the next information about

the position of the target at is given according to some probability distribution g(t),

t {0 = to < tl < t2 < ... < tT t1}. Without loss of generality, t = 0,1, ..., T and

g(t) = gt, the probability which the next information is received by the agent at time

t. Let DT(t) = {x (x, yy),i = 2,..., lI, t 0, 1,2,..., T be the reachable sets of

the target. One we can propose the following model.
T ( 1,
min Y xt x, l/, h*(t) gt (5-12)
t=0 i=1
s.t. Ixt+i xt < u, t = 0, 2,...,T- 1

where xt (xt, yt), t = 0,1, 2, ..., T, constitute the agent's path, and u is the velocity

of the agent and h*(t) is a solution to Problem (5-11). Consider the following


This give us the error amount when the next information is received at t. Then

objective function of Problem (5-12) is the expected error of the agent regarding the

time when information is received.

Proposition 5.3. The constraint set of Problem (5-12) is convex.

Proof. Consider the function f(x1,x2) = Ix yl. This is a convex function.


f(Axl + (1 -A)ylX2 + (- )y2) < Af(x,x2)+(1- A)f(y, 2)

AuA + (1 A)uA uuA

Since the problem is convex, it can be solved using one of the existing gradient


5.5 Maximizing the Probability of Detecting a Target

The modes we discussed in the previous sections can work for the case which the

distance between the agent and the target is far enough and the time interval between

consecutive two information is short. For long time and short distance, those modes

are not very efficient. From now, we discuss the problem to find an optimal path p, of

fixed length T, for the agent such that likelihood of being co-located with the target

in at least one point on the path is maximized. For continuous time the problem is

very expensive to solve. Thus, we will have the following assumption. The vehicle

and the target move among a finite set of cells in discrete time. At the beginning of

each time period the agent and the target can move only to the .,-i i,:ent cells or can

stay the same cells as they were -i ivi-; in the previous time period. Moreover, when

the agent and the target are in the same cell, then the agent can detect the target

with probability 1.We are looking for a path such that the probability of detecting

the target in a fixed number of time periods, ,- T, is maximized.

(i,NW) (i,N) (i,NE)

(i,W)- i (i,E)

(i,SW) (i,S) (i,SE)

Figure 5-2: Region

5.5.1 Cost Function. Alternative 1

We find the probability of co-locating at least one point in a fixed number of

time for a path x. Let us define the indicator random variables 1,0,i 01,..., T for

path x by
S1 if x(to + i) is a co-located point
I<= (5-13)
0 otherwise.

The probability that the agent and the target are co-located at least one point on the

agent's T-length path is

J(x)= 1 0 (5-14)

where P( o I 0 can be written as follows

t _) P I O = 0.
t=0 t=0 j=0

Our optimization problem becomes

max J(x) (5-15)

T t-1
minfl P ( o i= 0
t=o j=o
Taking natural logarithm from this, the objective function takes the form

min In P ( = 0
t=0 ( j=0

Since I1, t = 0,..., T random variables are dependent on each other, we need a huge

and complete information regarding the target's motion which contains ..... (I mean

it's much more difficult because we are not talking about It = 1's). One way to

handle this problem is an approximation method. If we assume that the dependence

of 1 = 0 and I = O's, = 0, 1, 2,..., k 1 and k 1, 2,..., T, is weak, then one

can take
min ln(P(I -0)),
min In(1 P(J 1)). (5-16)
However, the consequent error of using the assumption depends on a model of target's

motion. For some case, it can give us optimal path. The last optimization problem

is much more easier than (5-15). We will discuss this problem in a later section.

5.5.2 Generalization

For more reality, we should accept that an agent's ability to detect a target in the

same cell could be not perfect. If the agent and the target are in cell j, j 1,...,N, at

the beginning of a time period, then the agent can detect the target with probability

qj. If they are in different cells, the agent cannot detect the target during the current

time period.

Let us introduce the following indicator random variables D, i = 0, 1,. T for

path x by

S 1 if the agent detects the target at x(to + i) (5
D?, (5-17)
0 otherwise.

The probability of detecting the target in a fixed time T for given agent's T length

path path x is

J(x) =l- P D = 0o,

where P to D =- 0 can be written as follows

n x 1x
P(oD =0 P D ( 0 D D =0 .
t=0 ) t=0 j=0

After taking natural logarithm as we did before, the problem becomes

T /t-1
min l In P D 0 D = 0 .
t=0 j=0
min-0 (Dj ( j-0

We propose an approximate method for the problem changing the cost function as

min ln(P(D 0)),

where P(D = 1) can be expressed as follows:

P(D 1) P(Dt~ 1It = 1)P(I= ) + P(Dt= I = 0)P(It 0)

P(Dt= 1 I = 1)P(I 1) = qP(= 1).

Here j x(t).

5.5.3 Cost function. Alternative 2 and Markov Chain Model

In this subsection we discuss a problem which is a case of the problem, so called

the path constrained search problem, James N.Eagle introduced in [13, 14]. They

assumed that the target moves according to a Markov chain model.

Definition 5.5. short / 1 I;,:/:.n of Markov chain.

In more precisely, we assume that the target in cell j moves to cell k with

probability pjk in one time period. The transition matrix, P = (pjk) is known to the

agent. Under this assumption, here we find the exact probability of detecting the

target for a given agent's T length path x.

The cost function (5-14) can be written in another form as follows:

The right hand side


J(x) P U1' 1

of the last equation is extracted using the ....'s identity.
1 P( I ) lP('- 1i, 1)
i=0 i V^YY V pP(lx t 1-i t -, i t)
i -.- + (_-1)+P(j = 1, 1 =1,..., 1)

We show that we are able to simplify each component of the summation.

Definition 5.6. P) = P P P = P" is called n-step transition probabilities


Note that
k j-1
P(pl =t, 1X12= t, ..., ik = 1) = P i 1 t niP 1. \
j=1 l 1


Using Markovian property the above can be simplified as follows:
P(I1 I 2 1,...,I ) P ( 1 1- 1),

where P(I2 1| 1 ) Pj s =x(iji), r = x(i) and P(IJ = 1) Pi,

p = x(il) and so is the initial cell the target was at the beginning of time period 0.
5.5.4 The Second Order Estimated Cost Function with Markov Chain

Using the cost function (5-16) could give undesirable big error. For this we can

propose another cost function.
J(x)= 1 P(I 0 I = 0) (5-20)

Of course, we should assume that P(I = 0) = 1 or initial states of the agent and the

target are not the same otherwise there is nothing to solve. Using the fact that

P(AIB) 1- P(AIB)
SP(A, B)
1 P(BA)P(A)
1 P(B)
S(1- P(BIA))P(A)
1 P(B)
(1 P )ABP(A)
1 P(B)
SP(A)- P(AB)
1 P(B)
SP(A)- P(AIB)P(B) (5
1 P(B)

(assuming P(B) / 1)
the problem becomes

(I\ 1) P(I 1)-(I-- 1))
min J(x) = 1 ) P I- 1) )
i= 1


min J(x) In (5-22)
X^1 P(^If-1 = 1)
i= 1

5.5.5 Connection of Multistage Graphs and the Problem

Definition 5.7. A ,,,ui':-l,:. ,i: i''.1, G = (V,E) is a directed i,'r'l, in which the

vertices are partitioned into k > 1 disjoint sets Vi, 0 < i < k. In addition, if (u, v) is

an edge in E, then u E Vi and v E V1+i for some i, 0 < i < k. The sets Vo and Vk

are such that IVol = Vk = 1. The vertices in Vo and Vk are called the source and the

sink nodes.

Let c(i,j) be the cost of edge (i,j) E E. The cost of a path from s to t is the

sum of the costs of the edges on the path. The in,,,l. l/.: i,'.,'li problem is to find a

minimum-cost path from the source node to the sink node.

A dynamic programming formulation for a k-stage graph problem is obtained by

first noticing that every path from the source node to the sink node is the result of

a sequence of k 1 decisions. The i th decision involves determining which vertex in

Vi+1, 0 < i < k 2, is to be on the path. Let p(i,j) be a minimum-cost path from

the source node to a vertex j in V4. Let cost(i,j) be the cost of p(i,j). Then the

following is true.

cost(i,j) = m {cost(i 1,1) + c(l,j)}.

Next, we explain how our problem with cost functions (5-14), (5-22) and (??) can

be expressed as multistage graph problems.

Let {1, 2,..., N} be the cells that represent the map of the region. Vi, i

1, 2,..., T consists of N nodes that represents N cells of the agent (or target) at time

i. Vo or source node represents the agent's initial position which is a cell. We have

additional one node, the sink node in VT+1. We connects nodes in Vi to nodes V+1

which the agent (or the target) can move in one time step (accessible or .,ili ient

cells). This cost function works for not only Markov chain but also general case. In

other words, each edge which connects to a node in V, i = 1, 2,... T has the same

length. For instance, let j E Vi. Then all of edges from 1V-1 to j E Vi has length

ln(1 P,). Here Pj is the probability that the target would be in cell j at time i.

The sink node must be connected to the all nodes in VT. We assign cost of 0 to those

edges. For instance, let us consider a 3 x 3 map. Let us assume that the agent can

move from one node to only its .,1i ,i:ent nodes (diagonally .,1i ,i:ent nodes cannot

be included) in one time period. Then the multistage graph representation can be

shown as in Figure 2.

1 2 T-1 T

Figure 5-3: Multistage graph representation 1

Here the construction of multistage graph is the same as (5-16). The only

difference is the length of the edges. We should assign cost of

I pn po po-l\
In 1 soq PI Sop
1 Pi-1 I

where so is the initial position or cell of the target, to edge (p, q), where p E 1-_1 and

q e Vi, i-1,2,...,T.

We give an explanation how to use the multistage graph problem to this problem.

Vo consists of one node which represents the initial state of the target.

Vi consists of all possible cells of the map but we add only admissible edges and assign

costs same as we did in the second order estimated cost function.

Vk, k = 2, 3, ..., T, consists of a number of nodes. Each of them represents a couple

of states, i- (p, q). Moreover edge (p, q) must be an admissible edge in sense of real

path. We connect node (p, q) in V2 to node p in VI by an edge and assign to the edge

cost of
P2 P P
1 soqg pq sop
1 Psop

We connect node (p, q) in Vk to any nodes in Vk+1 which the first of the two states of

the node is q. For example node (p, q) in Vk and node (q, r) in Vk+1 must be connected

by an edge. We give cost of

P+1 P r Pko p2 Pk -1 p p Ips-1
1 + PqTsPr {qq +P/ Pp 1Pqr~pqPsp1
1 Po pk-p1 +p p PPk1

to the edge. Consider node (4, 5) in 3 x 3 region. The agent can move from one

node to only its .,l-i i,:ent nodes (diagonally .,l-i i,:ent cannot be included) in one time

period. Then (4,5) in Vk can be connected to only nodes (5,2), (5,4), (5,5), (5,6)

and (5, 8) (see Figure 3.).

K K+1



(4,5) (5,5))



Figure 5-4: Multistage graph representation 2

5.6 More General Model

5.6.1 The Agent is Faster Than Target

It can be arranged changing the transition probabilities matrix, i.e., the

probability of -I ,i-;.; the same state is increased.

5.6.2 Obstacles in Path

Obstacles can be arranged assigning cost of -M (or M depending on the

problem), where M is a big positive number.

Figure 5-5: Region with obstacles

5.6.3 Target Direction

We can consider the current direction of the target. It can be arranged extending

the Markov chain or the transition matrix. If we use the 8 azimuths, we have an

8N x 8N transition probabilities matrix instead of an N x N matrix. For each cell,

there are 8 different states which is shown in Figure 4.

(i,NW) (i,N) (i,NE)

(i,W) *- (i,E)

(i,SW) (i,S) (i,SE)

Figure 5-6: Using the 8 zimuths

If the target is in cell i and heads to North East then the probability that it will

be in cell j and will head to South East is P(i,NE),(j,SE).

5.7 Conclusion and Future Direction

Given a path we have developed and coded an exact method of calculating the

probability of intercept. We have also built a simulator to test our path planning

algorithm. The simulator builds agent and target paths and determines the number

of times the target and agent paths intersect. The different models proposed in

this chapter were tested against our coded simulation. The future direction in this

research is to compare our model against competing path planning algorithms and

also consider impacts of path planning horizon.


6.1 Introduction

Weaponized unmanned systems can be expected to search for and attack targets

in a battlespace where there is a variety of targets with different values. An obvious

goal of war fighters is to attack the most valued targets. This goal can be challenging

in an uncertain battlespace where the number of detectable targets and their values

are most likely not known with certainty at the time of launch. However, a best

target strategy is si r-.- -1 I, to maximize the probability of a weaponized unmanned

system attacking the most valued target within the battlespace. The strategy fits

well with autonomous systems making decisions of action. The results of calculations

and simulations show the strategy is robust to variability's and uncertainties. The

strategy is extended to multiple unmanned systems searching for targets in the same

battle space. Information sharing between unmanned systems is considered as way of

improving results. This can be achieved through mutual information or cross entropy.

The future battlespace will consist of automatous agents either working alone

or cooperatively in an effort to locate and attack enemy targets. An obvious goal

is that agents should attack the most valuable targets. Although value is subjective

and can change as the battlespace changes, one may assume the value of targets

can be assessed given a time and situation. Some basic strategies are presented

that maximize the probability of attacking the most valuable targets. Performance

measures of these strategies are:

1. Probability of attacking the most valuable target.

2. Probability of attacking the jh most valuable target. In particular, probability

of attacking the worst target.

3. Mean rank of the attached target. Rank of 1 is best of all n targets.

4. Average number of targets examined before a decision to attack is made. A

smaller number means the agent is exposed to a hostile environment for less


This chapter is organized into eight sections. Section 6.2 discusses a strategy to

maximize the probability of attacking the most valuable target. Section 6.3 discusses

a strategy that increases the mean value of attacked targets after many missions. The

impact of a variable number of targets is discussed in Section 6.4. A novel approach

based on learning agents is presented in Section 6.5. This learning could be This

strategy provides some exciting results that encourage further research where entropy

or mutual information could be applied. Multiple agents is discussed in Section 6.6.

Multiple agents can be used as a pack or separately in against a battlespace. Lastly,

a strategy based on a dynamic threshold is presented in Section 6.7.

6.2 Maximize Probability of Attacking the Most Valuable Target

Consider a situation as represented in Figure 6-1 where an agent is to find and

attack 1 of n distinct targets that are uniformly distributed across a battlespace.

The agent carries a single weapon. In random order the agent detects and classifies

each target one at a time and the agent must decide to attack or move on. If the

agent moves on, then it cannot return to attack (assume the target conceals itself

or moves upon detection). In this situation assume the locations and values of the

targets within the battlespace are unknown ahead of time. As mentioned above, a

goal is to attack the most valuable target. If the agent makes a decision to attack a

particular target soon into the mission, then it may not see a more valuable target

later. If the agent attacks late in the mission, the agent may have passed over more

valuable targets.

Figure 6-1: Racetrack search of battlespace with n targets

6.2.1 Best Target Strategy

To maximize the probability of attacking the most valuable target a decision

strategy to consider is that described by [58] where the agent will examine the first

k, 1 < k < n, targets then attack the first target after k that is more valuable than

any of the first k. The probability of attacking the best target is shown as

k 1
Pk(best) -k

k is selected such that the agent maximizes the probability of attacking the

best target. The value of k that maximizes this probability is k = rounded to

nearest integer. When n is large this probability is 1 0.368. Many may find this

a surprisingly high probability of attacking the most valued target among a large

number of targets given that there is only one most valued target and that value and

the value of all other targets are unknown beforehand.

As an example of this strategy, consider 5 distinct targets with distinct values.

Allow k = 2. Assume the targets would be presented in the following order of their

ranked value (1 being best): 4, 3, 1, 5, and 2. The agent passes first two targets and

records the values (best of first k is 3). After the first two targets pass, the agent is

ready to attack the first target with value better than 3 which is 1 in this case. In the

situation of n = 5 and k = 2, the probability of attacking the most valuable target

using this strategy is 0.433.

If the most valuable target is among the first k targets, then the strategy would

result in attacking the nth target, regardless of value. Even if the most valuable

target is not among the first k targets, the strategy may result in attacking other

than most valuable target. For example, if the third most valuable target is the most

valuable among the first k targets and, if after k, the second most valuable target is

encountered before the most valuable target, then the second most valuable target

will be attacked. Therefore, this brings to mind the questions in the introduction. Probability of Attacking jth Most Valuable Target

To develop the relation for the probability of attacking the jth most valuable

target we begin by examining the probability of attacking the second most valuable

target, then third and finally the worst. This development will lead to a generalized

closed form expression for Pk(jthbest).

By using conditional probability where X is the position of the second most

valuable target and ensuring 0 < k < n 2 an expression is developed as follows

Pk(2nd best) = Pk(2nd best|X = )P(X i),
i= 1

> Pk(2ndbest|X i). (6 1)

To complete (6-1), expressions are needed for Pk (2ndbest|X = i) which follow

Pk(2nd bestlX i) 0 if 1 < i < k, (6-2a)
k n-i
if k < i < n, (6-2b)
i-1 n-1

S if = n. (6-2c)
n- 1

If the second best target is among the first k, then there is no chance of attacking

it; therefore, (6-2a) holds. Equation (6-2b) holds because it is the probability of the

best of the first i 1 is among the first k and the overall best target is after i. Finally,

(6-2c) follows because it is the probability of the overall best is encountered among

the first k.

Note that if k = n 1 the only way the second best target is attacked is if the

overall best target is among the first k and the second best target is located at the

nth position. The probability of this occurring is

1 k
Pk(2nd best) if k n (6-3)
n n-1

Inserting (6-2) back into (6-1) and incorporating (6-3) results in

1 k +1 -i k_ ,- if l< k D o ^ n n-1 + n ^i=k+l i-1 n-1 -
Pk(2nd best) = (6-4)
1 k if k = 1.
n n-1

Similarly, if we ensure 0 < k < n-3 an expression for the probability of attacking

the third best target can be developed and is

1 k 1 k n-i n-i-1
Pk(3nd best) + (6-5)
n n- 1 n i-1 n-1 n-2

The event of attacking the worst target occurs if the worst target is located

at the nth position and the overall best position is located within the first k. The


probability of this occurring is

1 k
Pk (worst) (6-6)
n n-1

A generalized relationship for the probability of attacking the jth best target,

j > 1, may be developed from the above relationships and is as follows

1 k +1 n-j+l k n-i n-i-i--i-j+2 if 1 < t < n j,
nPk(jth n-1 n z~ i=k+ i-1 n-1 n-2j n-j+1 -
Pk(jth best) =
1 k ifn -j n n-1 - Mean Rank of the Attacked Target

Let the jth best target be considered the rank of that target, 1 < j < n, where

j = 1 is the highest value and j =n is the worst value. The mean rank of the

attacked target is
E[j] jPk(hbest). (6-7)

As a general rule a higher mean rank is desirable. Mean Number of Examined Targets

The mean number of targets examined before an attack may be developed by

considering the probability of attacking the ith examined target, 1 < i < n. The

probability of attacking the ith target seen where 0 < i < k is, of course, 0. The

probability of attacking the ith target where k < i < n is the probability the best of

the i 1 is within k and the best of first i values is in position i. The result is

k 1
P(attacking ithexamined target) =- if k < i < n.
i-1 i

The probability of attacking the very last target, i = n, is the probability the

overall best is found within k or the best of the n 1 is within k and the best of the

n is in position n. This results in

k k 1
P(attacking last target) + if i = n.
n -1 i

Using the rules of expectation, the mean number of examined targets simplifies


E(number of examined targets) k + ).
i= k+I

6.2.2 Results of Best Target Strategy

Table 6-1 shows the results of the Best Target Strategy against 10, 100, and

1000 targets where the values of the targets and even the distribution of the values

of the targets are unknown ahead of time. The k-value for each is optimum for

the Best Target Strategy in that it maximizes Pk(best). The results in Table 6-1

were verified empirically by simulations with target values from either uniform or

normal distributions. Note for even large n of targets with values that are completely

unknown beforehand, the probability of attacking the most valuable target remains

very favorable. Mean rank appears acceptable. For example, for n = 100 mean

rank indicates that on average missions result in attacking targets within the top

21 in value. The number of targets examined appears relatively large with about

75-percent of all targets examined before a decision to attack is made.

Table 6-1: Significant results of the basic best target strategy

n k Pk(best) Pk(worst) Mean Mean
Rank Examinations
10 4 0.398 0.044 3.3 7.98
100 37 0.371 0.004 20.01 74.10
1000 368 0.368 0.0003 185.54 736.20

6.2.3 Best Target Strategy with Threshold

Next consider a situation similar to that just studied; however, instead of skipping

the first k targets, the agent will attack a target that meets or exceeds some threshold.

For example, a target of significant value may appear within the first k. If so, then

attack the target immediately. If no target meets or exceeds the threshold value within

the first k, then follow the basic Best Target Strategy as outlined above which is to

attack the first target after k whose value exceeds the maximum value found within

the first k. Obviously, the higher the threshold value the smaller the probability that

any targets) in the battle space exceeds the threshold. A threshold set too high

simply defaults the strategy to the Best Target Strategy. Too small of a threshold

may result in attacking relatively lower valued targets compared to those available in

the battlespace thereby lowering the probability of attacking the most valuable target.

Setting a threshold relies on some knowledge of the target values which violates the

assumption of no knowledge of target values or their distributions. However, use

of a relatively high threshold can have improved results by increasing probability of

attacking the most valuable target and decreasing the number of targets examined.

Tables 6-2, 6-3 and 6-4 show, respectively, the results of simulations with

threshold values set at the upper 1-percent, 2.5-percent and 30-percent tail of uniform

distributions of target values. Other simulations using normal distributions had

almost identical results. These results indicate a high threshold can increase the

probability of attacking the best target and decrease the number of examined targets.

However, as shown in Table 6-4 the probability of attacking the best target can

decrease much with a threshold set too low.

Table 6-2: Empirical results of the best target strategy with threshold at upper
1-percent tail

n k Pk(best) P (worst) Mean Mean
Rank Examinations
10 4 0.435 0.0403 3.11 7.69
100 37 0.515 0.0014 8.41 51.11

Table 6-3: Empirical results of the best target strategy with threshold at upper
2.5-percent tail

n k P(best) Pk(worst) Mean Mean
Rank Examinations
10 4 0.474 0.0355 2.89 7.33
100 37 0.433 0.0004 3.83 34.53

Table 6-4: Empirical results of the best target strategy with threshold at upper
30-percent tail

n k P(best) Pk(worst) Mean Mean
Rank Examinations
10 4 0.387 0.0014 2.15 3.11
100 37 0.036 0 15.26 3.38

6.3 Maximize Mean Rank of the Attacked Target

In the preceding section a strategy is presented that maximizes the probability

of attacking the most valuable target. In the long run, however, one may wish to

minimize the mean rank (or maximize mean value) of the attacked targets over many

missions. This section introduces a Mean Value Strategy which is superior to the

Best Target Strategy in several v---.

6.3.1 Mean Value Strategy

Consider the same battlespace situation as described above. Again, the agent

will examine the first k, 1 < k < n, targets then attack the first target after k that

is more valuable than any of the first k. Equation (6-7) still applies for mean rank

and rank 1 is the most valued target. Instead of fixing k at a value that maximizes

Pk(best) consider fixing k that minimizes mean rank (or maximizes mean value). The

values of k that minimize the mean rank for different values of n were determined

empirically and are listed in Table 6-5. To interpret the results in Table 6-5 consider

n = 100. The mean rank of 9.6 means that on average, the Mean Value Strategy

will attack 10th best target out of 100 targets. For large n mean rank is minimum at

approximately n. These results might be surprising in the sense that for large n,

Sn = 3000, this strategy will result, on average, in attacking at least the 55th best


Table 6-5: k values to minimize mean rank of attacked targets

n k Mean Rank
10 2 2.9
20 4 4.2
50 6 6.7
100 9 9.6
200 13 13.7
500 21 21.9
1000 31 31.2
3000 54 54.3

Of course change in k from that used in the Best Target Strategy reduces the

probability of attacking the best target. As Figure 6-2 shows however, the Best

Target Strategy results in Pk fthbest) dropping off much faster than for the Mean

Value Strategy as j goes to n. The result is that the mean value is typically higher

using the Mean Value Strategy.

Best Target Strategy
- -Mean Value Strategy

jth best target (log scale)

Figure 6-2: Plots of probabilities of attacking the jth best target for two proposed
strategies, n=100

6.3.2 Results of Mean Value Strategy

Tables 6-6 and 6-7 are the results of simulations (106 runs per experiment) using

the Best Target Strategy and Mean Value Strategy against targets with values that

are uniform in [1,1000] and normal (p = 1000, a = 250). Note the improvement in

mean target values and reduction of mean number of target examinations when using

the Mean Value Strategy.

Table 6-6: Simulation results of the best target strategy

n k Pk(best) P (worst) Mean Mean
Value Examinations
Uniform Target Values, [1,1000]
10 4 0.397 0.045 697.4 7.99
100 37 0.357 0.004 795.1 74.6
Normal Target Values, p = 1000, a = 250
10 4 0.399 0.045 1178.2 7.98
100 37 0.370 0.004 1358.4 74.1

Table 6-7: Simulation results of the mean target strategy

n k Pk(best) Pk(worst) Mean Mean
Value Examinations
Uniform Target Values, [1,1000]
10 2 0.364 0.022 730.4 5.66
100 9 0.214 0.001 902.5 31.4
Normal Target Values, p = 1000, a = 250
10 2 0.365 0.022 1202.2 5.66
100 9 0.221 0.001 1420.8 31.2

6.4 Number of Targets is a Random Variable

Both the Best Target Strategy and the Mean Value Strategy described in the

earlier sections rely on knowing n. n may, of course, be some random variable.

Consider the case where an agent has a fixed flight time, t. Assume for example, the

agent runs out of fuel after t time into the mission. If we make a further assumption

that the time between targets is exponentially distributed with some mean, A, then the

number of targets becomes a Poisson random variable with mean -. Of course other

distributions for the number of targets may be appropriate. This section discusses

the impacts of n being some random variable.

Consider three possible outcomes for n. Let expected be the expected number

of targets and let "realized be the actual number of targets in the battlespace. If

realized = expected then no impact on strategy. If realized > "expected then the

agent would default to attacking expected if no acceptable target is found before

then. If realized < expected then it is possible no target is attacked.

Tables 6-8, 6-9 and 6-10 are the results of simulations (106 runs per experiment)

using the Best Target Strategy and Mean Value Strategy against targets with values

that are uniform in [1,1000] and normal (p = 1000, a = 250). The number of

targets in Table 6-8 are Poisson distributed with mean n. The number of targets

in Table 6-9 are approximately normally distributed (non-negative and rounded to

nearest integer) with mean n and standard deviation 0.2n. And finally, the number

of targets in Table 6-10 are uniformly distributed in [0.5n, 1.5n]. In these results,

worst case is not attacking any target and the value of not attacking a target is zero.

A review of this data reveals the Mean Value Strategy is more robust to a variable n.

A summary of performance of the two strategies for uniform target values (results are

similar for normally distributed target values) is given in Table 6-11. The summary

shows how much average performance changes when n varies compared to a fixed

n. Table 6-11 shows the average percentage drop in probability of attacking the

most valuable target, the mean value of attacked target and the mean number of

examinations before an attack is made. The results also indicate that both strategies

tend to be less impacted by a varying n as n gets large.

The preceding results can be improved if during the mission the agent updates its

expected value of n. As an example of this, we assume we know n; however, due to

atmospheric conditions, sensor performance may be either degraded or enhanced.

This results in the agent detecting fewer or more targets than expected. The

Table 6-8: Simulation results with number of targets poisson distributed, mean n

n k Pk(best) Pk(no target) Mean Mean
Value Examinations
Uniform Target Values, [1,1000]
10 2 0.348 0.135 651.1 5.33
10 4 0.328 0.265 542.0 7.42
100 9 0.212 .050 875.6 31.1
100 37 0.343 0.202 687.8 72.9
Normal Target Values, = 1000, a = 250
10 2 0.348 0.135 1055.4 5.33
10 4 0.330 0.264 892.9 7.42
100 9 0.218 0.048 1370.2 30.8
100 37 0.356 0.197 1150.8 72.6

Table 6-9: Simulation results with number of targets normally distributed, mean n
and standard deviation 0.2n
n k Pk(best) P(no target) Mean Mean
Value Examinations
Uniform Target Values, [1,1000]
10 2 0.363 0.132 657.2 5.40
10 4 0.361 0.261 552.2 7.47
100 9 0.213 0.057 868.7 30.6
100 37 0.327 0.232 660.5 71.0
Normal Target Values, = 1000, a = 250
10 2 0.365 0.131 1064.8 5.39
10 4 0.362 0.262 902.4 7.47
100 9 0.220 0.055 1359.2 30.3
100 37 0.338 0.227 1105.3 70.6

knowledge of number of targets encountered by some time into the mission can be used

to update the expected value of n. This should result in fewer non-attacks because the

number of targets was overestimated. It may increase the mean value of the attacked

target because the agent may examine more targets if they exist. Indeed, Table

6-12 shows the simulation results if, at about 90-percent into the mission time, the

agent updates the expected number of targets. This simulation assumes exponentially

distributed interarrival times with some mean, A, for encountering targets. The results

show that the mean value of the attacked targets increases by a notable amount.

Table 6-10:
[0.5n, 1.5n]

Simulation results with number of targets uniformly distributed in

n k Pk (best) Pk,(no target) Mean Mean
Value Examinations
Uniform Target Values, [1,1000]
10 2 0.349 0.136 650.1 5.29
10 4 0.323 0.271 535.9 7.25
100 9 0.214 0.064 862.0 30.0
100 37 0.306 0.261 631.2 68.7
Normal Target Values, p = 1000, = 250
10 2 0.348 0.136 1054.1 5.30
10 4 0.24 0.271 882.4 7.25
100 9 0.221 0.062 1347.8 29.8
100 37 0.317 0.256 1059.0 68.3

Table 6-11: Performance summary of best target strategy and mean value strategy
when n varies; values are in percentage drop compared to when n is fixed

n Pk(best) Mean Mean
Value Examinations
Best Target Strategy
10 15.0 22.1 7.6
100 8.9 17.0 5.0
Mean Value Strategy
10 2.9 10.6 5.7
100 0.5 3.7 2.7

If the probability of attacking no target can be reduced, this should increase both

probability of attacking the best target and the mean value of the attacked target.

One way to do this is update n only when n is overestimated. Simply put, allow

Expected to be reduced and not increased. Table 6-13 shows the desired results are

obtained when expected number of targets is allowed to be lowered but not increased.

The optimum results are obtained when the update is made at about the 95-percent

point into the mission time (in other words near the end of the mission).

6.5 Target Strategy with Sampling-The Learning Agent

Obviously, the Best Target Strategy and Mean Value Strategy use a very simple

piece of information available to make a decision. That is they use only the maximum

Table 6-12: Simulation results with expected number of targets, n, updated
90-percent into the mission

n k Pk(best) Pk(no target) Mean Mean
Value Examinations
Uniform Target Values, [1,1000]
10 2 0.333 0.094 678.0 5.19
10 4 0.334 0.185 595.3 7.43
100 9 0.210 .031 884.9 29.1
100 37 0.344 0.125 726.8 70.0
Normal Target Values, p = 1000, = 250
10 2 0.334 0.095 1116.8 5.19
10 4 0.334 0.183 1012.9 7.33
100 9 0.218 0.030 1388.6 28.9
100 37 0.356 0.122 1229.0 69.6

Table 6-13: Simulation results with expected number of targets, n, updated near the
end of the mission. n may be updated downward, but not upward.

n k Pk(best) P(no target) Mean Mean
Value Examinations
Uniform Target Values, [1,1000]
10 2 0.332 0.087 680.7 5.18
10 4 0.335 0.166 603.8 7.24
100 9 0.210 .018 891.1 30.0
100 37 0.340 0.072 751.9 71.4
Normal Target Values, p = 1000, = 250
10 2 0.343 0.085 1132.7 5.10
10 4 0.342 0.157 1048.8 7.05
100 9 0.217 0.016 1403.0 29.9
100 37 0.352 0.063 1287.5 71.2

value of the targets within k. A strategy modification to consider is to use more

information about the values of the targets encountered within k. For example,

although the values (or their distributions) of the targets are unknown beforehand,

the agent learns more about the target-value distribution as examinations are made.

Information on mean and standard deviation of the sample may be useful in the

decision making process. Suppose that instead of using the maximum value within

k, the agent uses mean of the sample plus some factor of the standard deviation

as a decision threshold. In addition, suppose that after k, as potential targets are

encountered but not attacked, their values are used to update the sample mean and

sample standard deviation. In turn, the sample mean and standard deviation can

be used to further refine the threshold used in the decision making process. This

threshold will be developed shortly based on order statistics.

Using the idea of order statistics and given a uniform distribution in [0,1] of n

variables, it can be shown the expected value of the maximum value is

E [maximum value] =

For example, if given 10 variables drawn from a uniform distribution in [0,1], the

expected value of the highest value is '.

A potential decision strategy for attacking a target is to set the decision threshold

at the value of the expected highest value of the remaining targets. In the example

of n targets whose values are uniformly distributed in [0,1], an agent will decide to

attack the ith target if

n+-i if < < n ,
ith Target Value > n+2- (6-8)
0, if i = n.

The noticeable flaw with the above strategy is the distribution of the targets is

unknown. We can use entropy to overcome this flaw. We develop using the result

from (6-8) to some threshold. We propose a strategy where the agent will examine

the first k, 1 < k < n, targets then attack the first target after k that is equal or

more valuable than threshold, T, such that

H(p)("$+ )(max min) + min, if k < i n 1,
Tif = n
0O, if i = n,

where H(p) is a scaling factor and max and min are calculated as follows

max = T + 7,

min = TI 7;,

where TI is sample mean, c is sample standard deviation, and 7 is some scaling factor

for the sample standard deviation. Through simulations, applicable values for p and 7

were found to be 0.95 and 2.0, respectively. The sample mean and standard deviation

are determined from the values of targets up to but not including ith target.

The results of simulations (106 runs per experiment) of this strategy are

summarized in Table 6-14. We found the best value for k is the mean of the optimum

values of k for the Best Target Strategy and Mean Value Strategy. Referring back

to Tables 6-6 and 6-7, one may note much improvement for the mean value of the

attacked targets (over 5-percent improvement in the case of n = 100 and normally

distributed target values). A drawback of this strategy compared to the Mean Value

Strategy is the number of target examinations is higher.

Table 6-14: Simulation results of the target strategy with sampling

n k Pk(best) Pk(worst) Mean Mean
Value Examinations
Uniform Target Values, [1,1000]
10 3 0.350 0.023 752.7 7.08
100 23 0.196 0.001 921.1 63.9
Normal Target Values, = 1000, a = 250
10 3 0.381 0.023 1226.1 7.08
100 23 0.210 0 1497.5 42.7

This strategy may be interpreted as a "learning- i, ini strategy since as the

agent examines targets it learns more about the target-value distributions. Because

complete knowledge about the target distributions should result in the best decisions,

we believe more research is warranted for this strategy.

6.6 Multiple Agents

Consider the situation in Figure 6-3 where multiple agents, each carrying

one weapon, are to search a battlespace and attack separate targets (although

not examined in this paper, more than one agent attacking the same target may

be appropriate where given agent y attacks target z, the probability of agent y

disabling target z, pyz, is less than 1). Again, the goal is to attack the most

valuable targets. Two search and attack strategies will be considered in the

following subsections. The first strategy is that of agents searching together as a

pack. A pack may be advantageous because as a pack, multiple looks at a single

target (coordinated sensing) can result in a higher probability of a correct target

classification [36]. The second strategy is that of agents searching separately with

and without communication. Agents searching separately obviously results in more

ground being covered in a shorter amount of time.

Figure 6-3: m agents performing racetrack search of battlespace with n targets

6.6.1 Agents as a Pack

As depicted in the upper left of Figure 6-3 a strategy proposed here is m agents,

m < n, examine the first k, 1 < k < n, targets then the first agent attacks the first

target after k that is more valuable than any of the first k and the next agent attacks

the next target that is more valuable than any of the first k, and so on. k should

be set equal to a value that maximizes Pk(best) or maximizes the mean value of the

attacked targets. If the most valuable target is within k then the pack of m agents

will attack the last m targets regardless of value. We expect the following outcomes

when using multiple agents as a pack

Pk(best)m should be higher

The mean value of all attacked targets will go down because additional agents

will obviously attack targets of lesser value than the best

The mean value of the best target attacked will be higher because there is a

higher chance of attacking the best target when using more than one agent

The number of target examinations should go up since it will take longer for all

agents to find an appropriate target

where Pk(best)m is the probability of attacking the best target with m agents. To

develop the relation for the probability of attacking the most valuable target when

using multiple agents as a pack we begin by examining the probability of attacking

the most valuable target when using just two agents, then three and finally m. This

development will lead to a generalized closed form expression for Pk(best),.

By using conditional probability where X is the position of the most valuable

target and ensuring 0 < k < n 2 an expression for two agents searching as a pack

is developed as follows

Pk(best)2 Pk(bestX i)2P(X i),

= P(bestX i)2-. (6-9)

To complete (6-9), expressions are needed for Pk(best X = i)2 which follow

Pk(bestlX i)2 0 if 1 < i < k, (610a)
k k i-l-k
+ k i if k < i i-1 i-1 i-2

If the best target is among the first k, then there is no chance of attacking it;

therefore, (6-10a) holds. Equation (6O10b) holds because; 1) it is the probability of

the best of the first i 1 is among the first k in which case the first agent strikes, or

2) the third best is among the first k and the second best is among the targets from

k + 1 to i 1 in which case the second agent strikes.

Inserting (6-10) back into (6-9)

1 ( k k i-l-k
Pk(best)2 I + ,
n z-1 z-1 i-2
n-1 (6-11)
k (+ ii+ 1) if 0

Similarly, if we ensure 0 < k < n-3 an expression for Pk(best)3 can be developed


k n- 1 ik (i-k)( k -1)
Pk(best)3 + i -)) if 0 < k < n 3. (612)
n 2 z 1) z~s-1)(s-2)

A generalized relationship for the probability of attacking the Pk(best), best

target may be developed from the above relationships and is as follows

Pk (best)m =- k 1 i-k (i-k)(i-k-) 1)
Diest-rn iin i i(i-)(i-2) +

(i-k)(i-k-l)..---. (i-k-Tm+2) 0
i(i-1)(i-2)(...(i- +l) )

For large n, the following formula may be derived that provides the value of k

that maximizes Pk(best)m given pack of m agents searching n targets

k 1+1 I
C -+2

The results of simulations (106 runs per experiment) of this strategy against

targets with uniformly distributed target values are summarized in Table 6-15.

Simulations against normally distributed target values are not included for

compactness; however, the results are similar. Table 6-15 includes statistics

concerning the value of attacked targets. The mean value is the overall mean of

the targets attacked by all agents. The mean best statistic is the mean of the value

of the best target attacked by the pack of agents. Mean examinations is the average

number of target examinations made by the pack of agents. The values for k were

selected to either optimize Pk(best), or mean best. The asterisk placed next to a

value in the table indicates it was the parameter optimized by the selection of k.

As expected the optimal value of k for Pk(best), reduced as the number of agents

increased; however, the k that maximized mean best remained at or near the value

that is optimal for the Mean Value Strategy for a single agent.

Table 6-15: Simulation results of the target strategy with m agents in a pack. Target
values are uniform on [0,1000].

m n k Pk (best) Pk (worst) Mean Mean Mean
Value Best Examinations
2 10 2 0.552 0.067 -.,-. 808.6* 5.66
2 10 3 0.562* 0.067 663.6 798.0 7.54
3 10 2 0.661* 0.134 645.4 843.8 8.64
4 10 2 0.727* 0.223 608.3 862.4 9.30
2 100 9 0.345 0.002 881.8 934.3* 45.2
2 100 30 0.495* 0.009 775.9 883.6 82.0
3 100 9 0.440 0.005 862.1 950.2* 55.3
3 100 26 0.573* 0.014 758.0 920.3 86.3
4 100 9 0.513 0.009 843.4 959.5* 62.9
4 100 23 0.626* 0.020 744.7 940.2 88.8
5 100 8 0.550 0.012 832.8 965.9* 65.7
5 100 21 0.664* 0.025 739.2 953.5 89.8

6.6.2 Separate Agents

The next strategy for multiple agents is that of separate missions as depicted

in Figure 6-4 where each agent is assigned to equal partitions of the battlespace.

Figure 6-4: m agents on separate missions of a equally partitioned battlespace
performing racetrack search for n targets

Two substrategies are considered: 1) without communications, and 2) with

communications. Separate Agents without Communication

Consider the situation where agents independently search their respectively

partition without any interaction. For example, if n = 100 and m = 2, the battlespace

is partitioned into two equal parts each with n = 50 targets. Each agent then performs

its mission within its partition. That is each agent agent will examine the first k,

1 < k < n/m, targets then attack the first target after k that is more valuable

than any of the first k. We expect similar impacts to the pertinent performance

measures as with agents in a pack. That is Pk(best)m and average value of the best

target attacked should go up as more agents are added. In this model Pk(best)m is

calculated similarly to a single agent attacking a battlespace using the Best Target

Strategy. This is

n/a 1
Pk(best)m kn
n/m z ik

k is selected such that the agent maximizes the probability of attacking the best

target. The value of k that maximizes this probability is k = rounded to nearest

integer. Table 6-16 shows the results of simulations (106 runs per experiment) of

this strategy against targets with uniformly distributed target values. Note the mean

examinations is based on the total number of targets examined by m agents.

Table 6-16: Simulation results of the target strategy with m agents on separate
missions with no communication between them. Target values are uniform on [0,1000].

m n k Pk(best) Pk (worst) Mean Mean Mean
Value Best Examinations
2 10 1 0.416 0.051 647.5 792.9* 3.74
2 10 2 0.431* 0.100 630.6 786.6 4.64
2 100 9 0.304 0.004 857.2 952.3* 33.0
2 100 18 0.358* 0.008 781.6 925.7 44.3
4 100 7 0.355 0.013 795.0 968.6* 23.2
4 100 9 0.364* 0.016 767.3 966.9 24.1
5 100 6 0.362 0.017 775.7 970.6* 19.6
5 100 7 0.367* 0.019 759.9 970.3 19.8 Separate Agents with Communication

We now consider the same situation as just described, but this time the agents

communicate by informing others of their maximum within k. Table 6-17 shows the

results of simulations (106 runs per experiment) of this strategy against targets with

uniformly distributed target values.

Table 6-17: Simulation results of the target strategy with m agents on separate
missions with communication between them. Target values are uniform on [0,1000].

Sn k Pk(best) Pk(worst) Mean Mean Mean
Value Best Examinations
2 10 1 0.511 0.081 663.8 803.8 4.05
2 100 4 0.301 0.003 867.7 933.7 24.9
2 100 15 0.450 0.010 748.4 883.0 43.6
4 100 2 0.429 0.011 812.7 958.3 18.9
4 100 6 0.544 0.024 696.4 937.4 24.1
5 100 2 0.508 0.019 773.4 963.8 17.5
5 100 4 0.573 0.030 694.9 952.2 19.4

In the situation just described, agents investigate their independent partitions

of the battlespace. Consider a case where ;?w agent A in one partition attacks early.

This leaves much of A's partition unsearched. If another agent, B, is still active,

it may be advantageous to allow B investigate the remaining part of A's partition.

B will search until it finds an appropriate target or until it reaches the end of A's

partition. This can only occur if A and B communicate by telling each other when

and where they made their attack. Table 6-18 shows the results of simulations against

targets with uniform target values. The first part is where agents do not share the

maximum target value found within k and the second part is where agents have full

communication -that is both maximum target value found within k and when and

where they attack.

Table 6-18: Simulation results of the target strategy with m agents on separate
missions with communication between them. Uncommitted agents are allowed to
evaluate targets in other unsearched partitions.

m n k Pk(best) P (worst) Mean Mean Mean
Value Best Examinations
Not sharing information on maximum target value within k
2 10 1 0.481 0.036 682.2 810.4 6.73
2 10 2 0.494 0.086 655.6 799.8 8.77
2 100 9 0.372 0.003 884.7 957.7 57.0
2 100 17 0.434 0.006 830.9 937.7 79.5
Sharing information on maximum target value within k
2 10 1 0.551 0.067 686.3 808.5 7.45
2 10 2 0.528 0.123 630.9 777.4 9.17
2 100 4 0.326 0.003 883.3 934.1 42.2
2 100 14 0.494 0.008 786.9 889.7 79.8 Separate Agents Comparison

In comparing the performance of the two strategies (separate agents without

communications and separate agents with communications) we observe the best form

of communications is when attacking agents inform other agents when and where their

attack is made and then allowing uncommitted agents to evaluate other unsearched

partitions. There appears to be no advantage for agents communicating the maximum

target value found within their first respective k evaluations.

When the ratio of agents to targets is higher, communication is beneficial for

both Pk(best), and mean best; however, when the ratio is lower, it appears there is

no advantage to have communications.

6.6.3 Multiple Agent Strategy-Discussion

While operating under the given assumptions and comparing the two strategies

of either a pack of agents or separate agents, we observe that when there are few

agents and many targets, then a separate-agent strategy with communication is the

best strategy. However, when the ratio of agents to targets increases, then the best

strategy is agents hunting as a pack.

6.7 Dynamic Threshold

In all preceding sections we assume the target values are unknown a prior.

However, in actuality, the war fighter may have knowledge of the maximum and

minimum target values that are possible in the battlespace. This knowledge can

come from experience, intelligence or previous missions and can be very useful in

increasing both Pk(best) and mean value of attacked target. A strategy to consider is

that of a dynamic decision threshold [42]. Consider as before, the number of targets,

n, is known. Unlike earlier assumptions, we now assume we know the maximum and

minimum values of the target-value distribution and we assume the target values are

uniformly distributed. A strategy derived by induction that maximizes Pk(best) is

to set a threshold T such that the agent attacks a target if its value exceeds T. T is

determined as follows

0.5-- -(max min) + min if 1 < i T = (613)
0 if i = n,

where max and min are maximum and minimum target values, respectively, that

are possible in the battlespace. To summarize the dynamic threshold strategy, an

agent will find and examine targets in succession. The agent will attack the first

target whose value exceeds T as defined in (6-13). Although (6-13) is derived based

on an assumption of uniform target values, we found that it is very effective with

other distributions such as the normal distribution where one assumes max and

min are, respectively, two standard deviations above and below the mean. The

results of simulations (106 runs per experiment) of the dynamic threshold strategy

are summarized in Table 6-19. These results indicate, as expected, the more an

agents knows about the target-value distribution, the better the results.

Table 6-19: Simulation results using the dynamic threshold strategy

n Pk(best) Mean
Uniform Target Values, [1,1000]
10 0.574 836.5
100 0.528 962.5
Normal Target Values, p = 1000, a = 250
10 0.516 1278.5
100 0.463 1556.1

6.8 Conclusion

We have shown that an agent will find and examine targets in succession. The

agent will attack the first target whose value exceeds T as defined in (6-13). Although

(6-13) is derived based on an assumption of uniform target values, we found that it

is very effective with other distributions such as the normal distribution where one

assumes max and min are, respectively, two standard deviations above and below the

mean. The results of simulations (106 runs per experiment) of the dynamic threshold

strategy are summarized in Table 6-19. These results indicate, as expected, the more

an agents knows about the target-value distribution, the better the results.


7.1 Summary

The goal of this research is to employ entropy optimization in data mining and

feature extraction in data. In this dissertation, we have has developed minimum and

maximum entropy based approaches that allow us to extract essential information

from the data in order to effectively cluster and classify data. Most data when the

data set

Several methods for mining data exist, however most of these methods suffer

from several limitations especially that of specifying the data distribution which is

usually unavailable. Equally important in data mining is the effective interpretability

of the results. Data to be clustered often exist in very high dimensions which most of

the existing methods do not handle but instead rely on some data preprocessing. The

entropy methods developed in this thesis are well suited for handling these problems.

Our methods eliminates making distributional assumptions which may or may not

exist. Dimension reduction for better data result comprehension are also achieved by

the entropy methods developed in this dissertation. We have also successfully applied

these entropy methods in best target selection and path planning which are areas of

research interest.

7.2 Future Research

There are several important directions to extend the work presented in this

dissertation. One such area is finance. In financial analysis, an investor is ah--i-b

interested in a model's performance. He therefore must evaluate models based on the

performance of the strategies that the model -,r-.-I -I- This performance measure

of the models can be evaluated using the principles of relative entropy. In macro


econometric modelling and policy analysis, the empirical models that forecast well are

typically nonstructural, yet making the kinds of theoretically coherent forecasts policy

makers wish to see requires imposing structure that may be difficult to implement and

that in turn often makes the model empirically irrelevant. Cross entropy procedure

can be used to produce forecasts that are consistent with a set of moment restrictions

without imposing them directly on the model.


[1] J. Abello, P.M. Pardalos, and M.G.C. Resende (eds.), Handbook of Massive Data
Sets, Kluwer Academic Publishers, Norwell, 2002.

[2] S.M. Andrijich and L. Caccetta, Solving the multisensor data association problem,
Nonlinear A,.al;-.: 47: 5525-5536, 2001.

[3] R.K. Al!mi T.L. Magnanti, and J.B. Orlin, Network Flows: Th'..,;. Algorithm,
and Applications, Prentice Hall, Englewood Cliffs, 1993.

[4] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghaven, Automatic subspace
clustering of high dimensional data for data mining. ACI[ SIGMOD Record, 27(2):
94-105, 1998.

[5] J. Bellingham, A. Richards, and J.P. How, Receding Horizon Control of
Autonomous Aerial Vehicles. In Proceedings of the American Control Conference,
Anchorage, AK, 8-10, 2002.

[6] M.J.A. Berry and G. Linoff, Data Mining Techniques for Marketing, Sales and
Customer Support, Wiley, New York, 1997.

[7] C.L. Blake and C.J. Merz, (1998). UCI Repository of Machine Learning Databases Oct. 24, 2004.

[8] M. Brand, Pattern Discovery Via Entropy Minimization, Uncer'c.'il., 99: Inter-
national Workshop on Artificail Intelligence and Statistics, (AISTAT) TR98-21,

[9] W. C'!I iliwongse, Optimization and D;,..i,.ii,. il Approaches in Nonlinear Time
Series Au.l;-,.: with Applications in Bioengineering, Ph.D Thesis, University of
Florida, 2003.

[10] C. C'! i- A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining
numerical data. In Proceedings of International Conference on Knowledge Discov-
ery and Data Mining, 84-93, 1999

[11] T. Cormen, C. Leiserson, and L. Rivest, Introduction to Algorithms, MIT Press,
Cambridge, 2001.

[12] R.O. Duba and P.E. Hart, Pattern Clr--...:rl,/ n and Scene A,..l;,;.:'
Wiley-Interscience, New York, 1974.

[13] J.N. Eagle, The Optimal Search for a moving Target When the Search Path is
Constrianed, Operations Research, 32(5), 1984.

[14] J.N. Eagle and J.R. Yee, An Optimal Branch and Bound Procedure for the
Constrained Path, Moving target Search Problem, Operations Research, 38(1),

[15] S.-C. Fang, J.R. R i.i .- 1:, i I and H.-S. J. Tsao, Entropy Optimization and Math-
ematical Pi,.ii,,i',,.:,,i Kluwer Academic Publishers, Norwell, 1997.

[16] S.-C. Fang, H.S.J. Tsao, Linear Constrained Entropy Maximization Problem
with Qurdtratic and its Application to Transportation Planning Problems, Trans-
portation Science, 29:353-365, 1993.

[17] M. Figueiredo and A.K. Jain, Unsupervised Learning of Finite Mixture Models,
IEEE Trans. Pattern A,.all;,. and Machine Intelligence, 24(3):381-396, 2002.

[18] K. Frenken, Entropy Statistics and Information Theory, The El,' Companion to
Neo-Schumpeterian Economics, C'i,. i., i~/r, UK and Northampton MA: Edward
Elgar Publishing (In Press).

[19] W. Frawley, G. Piatetsky-Shapiro, and C. Matheus. Knowledge Discovery in
Databases: An overview. G. Piatetsky-Shapiro and W. Frawley, (eds), Knowledge
Discovery in Databases, 1-27. AAAI/\ I I' Press, 1991.

[20] R.M. Gray, Entropy Theory and Information The(. -; Springer-V. i1 New York,

[21] S.F. Gull, J. Killing, and J.A. Roberts(ed), The Entropy of an Image: Indirect
Imaging, Cambridge University Press, UK, 267-279, 1984.

[22] D. Hall and G Ball, ISODATA A Novel Method of Data A,.aI,-.- and Pattern
Cla.--../'. ln.:m, Tech Report, Stanford Research institute, Menlo Park, 1965.

[23] E. Horowitz, S. Sahni, and S. R i -,l:i l, Computer Algorithms, W.H.
Freeman, New York, 1998.

[24] V. Isler, S. Kannan, and S. Khanna, I.. ,rl.:., and Capturing an Evader in
a P .1;,,]i .Jl Environment, Technical Report MS-CIS-03-33, Dept. of Computer
Science, Univ. of Pennsylvania, 2003.

[25] G. Iyengar and A. Lippman, Clustering Images Using Relative Entropy for
Efficient Retrieval, IEEE Computer M ,j.r .:,. 28(9):23-32, 1995.

[26] A.K. Jain and R.C. Dubes, Alj.., .:il,, for Ci,-l. ring Data, Prentice Hall, New
Jersey, 1988.

[27] M. James, C1r.--:../ a/.:mn Algorithms, John Wiley, New York, 1985.

[28] T. Kanungo, D.M. Mount, N.S. N. i ivahu, C.D. Piako, R. Silverman, and A.Y.
Wu, An Efficient K-Means Clustering Algorithm: Analysis and Implementation,
IEEE Trans. Pattern A -,,,l,;-. and Machine Intelligence, 24(7):881-892, 2002.

[29] J.N. Kapur and H.K. Kesaven, Entropy Optimization Principles with Applica-
tions, Academic Press, San Diego, 1992.

[30] T. Kirubar ili Y. Bar-Shalon, and K.R. Pattipati, Multiassignment for
Tracking a Large Number of Overlapping Objects, IEEE Trans. Aerospace and
Electronic S.i-/. m- 37(1):2-21, 2001.

[31] W. Klosgen and J.M. Zytkow (eds.), Handbook of Data Mining and Knowledge
D.: ...;-. ;, Oxford University Press, New York, 2002.

[32] Y.W. Lim and S.U. Lee, On the Color Image Segmentation Algorithm based on
Thresholding and Fuzzy C-means Techniques, Pattern Recognition, 23:935-952,

[33] J.B. McQueen Some Methods for Classification and Analysis of Multivariate
Observations,In Proceedings of the Fitfth Symposium on Math, Statistics, and
P, l,.-,,:7l.;,7 281-297, University of California Press, Berkely, 1967.

[34] D. Miller, A. Rao, K. Rose, and A. Gersho, An Information Theoretic Framework
for Optimization with Application to Supervised Learning, IEEE International
Symposium on Information Ti,.. Whistler B.C., Canada, 1995.

[35] B. Mirkin, Nonconvex Optimization and its Applications: Mathematical Classi-
fication and C'i;il ring, Kluwer Academic Publishers, Dordrecht, 1996.

[36] R. Murphey and P. Pardalos (eds.), Cooperative Control and Optimization,
Kluwer Academic Publishers, Dordrecht, 2002.

[37] R. Murrieta-Cid, A. Sarmiento and S.Hutchinson, On the Existence of a Strategy
to Maintain a Moving Target within the Sensing Range of an Observer Reacting
with Delay, In Proc IEEE Int. Conf. on Intelligent Robots and Optimal S,1-/1 4

[38] R. Murrieta-Cid, H. Gonzilez-Banos and B. Tovar, A Reactive Motion Planner
to Maintain Visibility of Unpredictable Targets, In Proceedings of the IEEE In-
ternational Conference on Robotics and Automation, 4242-4247, 2002.

[39] R. Murrieta-Cid, A. Sarmiento, S. Bhattacharya and S.Hutchinson, Maintaining
Visibility of a Moving Target at a Fixed Distance: The Case of Observer Bounded
July 15, 2004.

[40] H. Neemuchawala, A. Hero and P. Carson, Image Registration Using
Entropic Graph-Matching Criteria,
asilomar2002.pdf, Sept. 12, 2004

[41] A. Okafor, M. Ragle and P.M. Pardalos, Data Mining Via Entropy and Graph
('!,l-1 i ii- Data Mining in Biomedicine, Springer, New York,(In Press).

[42] H. Pfister, Unpublished work, Air Force Research Laboratory, Munitions
Directorate, 2003.

[43] A.B. Poor, N. Rii .,., M. Lki.:.i- and V. Vannicola., Data Association Problem
Posed as Multidimensional Assignment Problems: Problem Formulation., S.:,.i',l
and Data Processing of Small Tii, /- 552-561. SPIE, Bellingham, WA, 1993.

[44] J. Pusztaszeri, P.E. Resing, and T.M. Liebling, Tracking Elementary Particles
Near their Vertex: A Combinatorial Approach, Journal of Global Optimizatiom,
9(1):41-64, 1996.

[45] D. Ren, An Adaptive Nearest Neighbor Classification Algorithm, http://www., Feb. 21, 2004.

[46] A. Richards and J.P. How, Aircraft Trajectory Planning With Collision
Avoidence Using Mixed Integer Linear Programming, In Proceedings of the Amer-
ican Control Conference, ACC02-AIAA1057, 2002.

[47] A. Richards, J.P. How, T. Schouwenaars, and E. Feron, Plume Avoidence
Maneuver Planning Using Mixed Integer Linear Programming, In Proceedings
of the AIAA Guidance, Navigation, and Control Conference, AIAA:2001-4091,

[48] J. Rissanen, A Universal Prior for Integers and Estimation by Minimum
Description Length, Annals of Statistics, 11(2):416-431, 1983.

[49] J. Shore and R. Johnson, Axiomatics Deviation of the Principle of Maximum
Entropy and the Principles of Minimum Cross-Entropy, IEEE Transactions on
Information Th(..,,; IT-26(1):26-37, 1980.

[50] J. Shore and R. Johnson, Properties of Cross-Entropy Minimization, IEEE
Transactions on I.f.', i,,al.: n T7i,.. ; IT-27(4):472-482, 1981.

[51] J. Shore, Cross-Entropy Minimization Given Fully Decomposable Subset
of Aggregate Constraints, IEEE Transactions on Information The .,,
IT-28(6):956-961, 1982.

[52] T. Schouwenaars, B. De Moor, E. Feron and J. How, Mixed Integer Prgramming
for Multi-Vehicle Path Planning, In Proceedings of the European Control Confer-
ence, 2603:2613, 2001.

[53] J.T. Tou and R.C. Gonzalez, Pattern Recognition Principles Addison-Wesley,
R., d1,i-: 1974.

[54] M.M. Trivedi, and J.C. Bezdeck, Low-Level Segmentation of Aerial with Fuzzy
C('!-1. i i.- IEEE Trans. Syst. Man, C;, ,,. SMC-16:89-598, 1986.


[55] A.G. Wilson, Entropy in Urban and R, i.:t.,rl Plauu '.:, Pion, London, 1970.

[56] N. Wu, The Maximum Entropy Method, Springer, New York, 1997.

[57] S.M. Ross, Stochastic Processes, 2nd ed., Wiley, New Jersy, 1996.

[58] S.M. Ross, Introduction to P,.'1a.l.:l'l, i Models, 7th ed., Academic Press, San
Diego, 2000.


Anthony Okafor was born in Awgu, Nigeria. He received a bachelor's degree in

mechanical engineering from the University of Nigeria, N- :1:. He later immigrated

to the United States where he earned a master's degree in mathematical sciences from

the University of West Florida, Pensacola, and a master's degree in industrial and

systems engineering (ISE) from the University of Florida. He is a Ph.D. student in the

ISE department, University of Florida. His research interests include mathematical

programming, entropy optimization and operations research.

As for hobbies, Anthony loves table tennis, badminton and tennis. He is also

actively involved with home automation and enjoys mentoring.

Full Text


IwanttothankProfessorPanosM.PardalosforhishelpandpatienceinguidingmethroughthepreparationandcompletionofmyPh.D.IalsowanttothankDrs.JosephP.Geunes,StanislavUraysevandWilliamHagerfortheirinsightfulcomments,valuablesuggestions,constantencouragementandforservingonmysupervisorycommittee.IalsowouldtothankmycolleaguesinthegraduateschooloftheIndustrialandSystemsEngineeringDepartmentespecially,DonGrundel.Finally,Iamespeciallygratefultomywife,myparents,andsisterfortheirsupportandencouragementasIcompletethePh.D.program. iv


page ACKNOWLEDGMENTS ............................. iv LISTOFTABLES ................................. viii LISTOFFIGURES ................................ x ABSTRACT .................................... xi CHAPTERS 1INTRODUCTION .............................. 1 1.1DataMining ............................... 1 1.1.1Classication ........................... 2 1.1.2Clustering ............................ 2 1.1.3Estimation ............................ 2 1.1.4Prediction ............................ 3 1.1.5Description ............................ 3 2ENTROPYOPTIMIZATION ........................ 4 2.1Introduction ............................... 4 2.2ABackgroundonEntropyOptimization ............... 4 2.2.1DenitionofEntropy ...................... 5 2.2.2ChoosingAProbabilityDistribution .............. 6 2.2.3PriorInformation ........................ 8 2.2.4MinimumCrossEntropyPrinciple ............... 9 2.3ApplicationsofEntropyOptimization ................. 10 3DATAMININGUSINGENTROPY .................... 12 3.1Introduction ............................... 12 3.2K-MeansClustering ........................... 13 3.3AnOverviewofEntropyOptimization ................ 15 3.3.1MinimumEntropyandItsProperties ............. 16 3.3.2TheEntropyDecompositionTheorem ............. 18 3.4TheK-MeansviaEntropyModel ................... 19 3.4.1EntropyasaPriorViaBayesianInference .......... 19 3.4.2DeningthePriorProbability ................. 20 3.4.3DeterminingNumberofClusters ................ 20 v


............................ 22 3.6Results .................................. 25 3.6.1ImageClustering ........................ 26 3.6.2IrisData ............................. 27 3.7Conclusion ................................ 29 4DIMENSIONREDUCTION ......................... 30 4.1Introduction ............................... 30 4.1.1EntropyDimensionReduction ................. 30 4.1.2EntropyCriteriaForDimensionReduction .......... 31 4.1.3EntropyCalculations ...................... 31 4.1.4EntropyandtheClusteringCriteria .............. 32 4.1.5Algorithm ............................ 32 4.2Results .................................. 32 4.3Conclusion ................................ 33 5PATHPLANNINGPROBLEMFORMOVINGTARGET ........ 34 5.1Introduction ............................... 34 5.1.1ProblemParameters ....................... 35 5.1.2EntropySolution ........................ 36 5.2Mode1 ................................. 38 5.3Mode2 .................................. 41 5.4Mode3 ................................. 41 5.5MaximizingtheProbabilityofDetectingaTarget .......... 42 5.5.1CostFunction.Alternative1 .................. 43 5.5.2Generalization .......................... 44 5.5.3Costfunction.Alternative2andMarkovChainModel ... 46 5.5.4TheSecondOrderEstimatedCostFunctionwithMarkovChain 47 5.5.5ConnectionofMultistageGraphsandtheProblem ...... 48 5.6MoreGeneralModel .......................... 51 5.6.1TheAgentisFasterThanTarget ............... 51 5.6.2ObstaclesinPath ........................ 51 5.6.3TargetDirection ......................... 52 5.7ConclusionandFutureDirection ................... 52 6BESTTARGETSELECTION ....................... 53 6.1Introduction ............................... 53 6.2MaximizeProbabilityofAttackingtheMostValuableTarget .... 54 6.2.1BestTargetStrategy ...................... 55 56 ......... 58 ......... 58 6.2.2ResultsofBestTargetStrategy ................ 59 6.2.3BestTargetStrategywithThreshold ............. 59 vi


............ 61 6.3.1MeanValueStrategy ...................... 61 6.3.2ResultsofMeanValueStrategy ................ 63 6.4NumberofTargetsisaRandomVariable ............... 63 6.5TargetStrategywithSampling-TheLearningAgent ......... 66 6.6MultipleAgents ............................. 70 6.6.1AgentsasaPack ........................ 70 6.6.2SeparateAgents ......................... 73 ...... 74 ....... 75 ............ 76 6.6.3MultipleAgentStrategy-Discussion .............. 77 6.7DynamicThreshold ........................... 77 6.8Conclusion ................................ 78 7CONCLUDINGREMARKSANDFUTURERESEARCH ........ 79 7.1Summary ................................ 79 7.2FutureResearch ............................. 79 REFERENCES ................................... 81 BIOGRAPHICALSKETCH ............................ 86 vii


Table page 3{1Thenumberofclustersfordierentvaluesof 26 3{2Thenumberofclustersasafunctionoffortheirisdata ........ 28 3{3Percentageofcorrectclassicationofirisdata ............... 28 3{4Theaveragenumberofclustersforvariouskusingaxed=2:5fortheirisdata .................................... 28 3{5Theaveragenumberofclustersforvariouskusingaxed=5:0fortheirisdata .................................... 29 3{6Theaveragenumberofclustersforvariouskusingaxed=10:5fortheIrisData ................................. 29 6{1Signicantresultsofthebasicbesttargetstrategy ............. 59 6{2Empiricalresultsofthebesttargetstrategywiththresholdatupper1-percenttail ................................. 60 6{3Empiricalresultsofthebesttargetstrategywiththresholdatupper2.5-percenttail ................................ 61 6{4Empiricalresultsofthebesttargetstrategywiththresholdatupper30-percenttail ................................. 61 6{5kvaluestominimizemeanrankofattackedtargets ............ 62 6{6Simulationresultsofthebesttargetstrategy ................ 63 6{7Simulationresultsofthemeantargetstrategy ............... 63 6{8Simulationresultswithnumberoftargetspoissondistributed,meann 65 6{9Simulationresultswithnumberoftargetsnormallydistributed,meannandstandarddeviation0.2n 65 6{10Simulationresultswithnumberoftargetsuniformlydistributedin[0:5n;1:5n] ................................... 66 6{11Performancesummaryofbesttargetstrategyandmeanvaluestrategywhennvaries;valuesareinpercentagedropcomparedtowhennisxed 66 viii


................................ 67 6{13Simulationresultswithexpectednumberoftargets,n,updatedneartheendofthemission.nmaybeupdateddownward,butnotupward. ... 67 6{14Simulationresultsofthetargetstrategywithsampling .......... 69 6{15Simulationresultsofthetargetstrategywithmagentsinapack.Targetvaluesareuniformon[0,1000]. ........................ 73 6{16Simulationresultsofthetargetstrategywithmagentsonseparatemissionswithnocommunicationbetweenthem.Targetvaluesareuniformon[0,1000]. .................................. 75 6{17Simulationresultsofthetargetstrategywithmagentsonseparatemissionswithcommunicationbetweenthem.Targetvaluesareuniformon[0,1000]. .................................. 75 6{18Simulationresultsofthetargetstrategywithmagentsonseparatemissionswithcommunicationbetweenthem.Uncommittedagentsareallowedtoevaluatetargetsinotherunsearchedpartitions. ........ 76 6{19Simulationresultsusingthedynamicthresholdstrategy .......... 78 ix


Figure page 3{1K-Meansalgorithm .............................. 14 3{2EntropyK-meansalgorithm ......................... 22 3{3GenericMSTalgorithm. ........................... 24 3{4KruskalMSTalgorithm. ........................... 25 3{5Graphclusteringalgorithm. ......................... 25 4{1Algorithmfordimensionreduction. ..................... 33 5{1Targetandagentboundaries ......................... 40 5{2Region ..................................... 43 5{3Multistagegraphrepresentation1 ...................... 49 5{4Multistagegraphrepresentation2 ...................... 51 5{5Regionwithobstacles ............................. 51 5{6Usingthe8zimuths ............................. 52 6{1Racetracksearchofbattlespacewithntargets ............... 55 6{2Plotsofprobabilitiesofattackingthejthbesttargetfortwoproposedstrategies,n=100 ............................... 62 6{3magentsperformingracetracksearchofbattlespacewithntargets ... 70 6{4magentsonseparatemissionsofaequallypartitionedbattlespaceperformingracetracksearchforntargets .................. 74 x


Manyrealwordproblemsinengineering,mathematicsandotherareasareoftensolvedonthebasisofmeasureddata,givencertainconditionsandassumptions.Insolvingtheseproblems,weareconcernedwithsolutionpropertieslikeexistence,uniqueness,andstability.Aproblemforwhichanyoneoftheabovethreeconditionsisnotmetiscalledanill-posedproblem.Thisproblemiscausedbyincompleteand/ornoisydata,wherenoisecanbereferredtoasanydiscrepancybetweenthemeasuredandtruedata.Equallyimportantinthedataanalysisistheeectiveinterpretationoftheresults.Datadatasetstobeanalyzedusuallyhaveseveralattributesandthedomainofeachattributecanbeverylarge.Thereforeresultsobtainedinthesehighdimensionsareverydiculttointerpret.Severalsolutionmethodsexisttohandlethisproblem.Oneofthesemethodsisthemaximumentropymethod.Wepresentinthisdissertationentropyoptimizationmethodsandgiveapplicationsinmodellingreallifeproblems,specicallyinminingnumericaldata.Besttargetselectionandtheapplicationofentropyinmodellingpathplanningproblemsarealsopresentedinthisresearch. xi


Manyrealwordproblemsareoftensolvedonthebasisofmeasureddata,givencertainconditionsandassumptions.Severaldiverseareaswheredataanalysisisinvolvedincludethegovernmentandmilitarysystems,medicine,sports,nance,geographicalinformationsystems,etc.[ 1 35 31 ].Solutionstotheseproblemsinvolveinmostcasestheunderstandingofthestructuralproperties(patterndiscovery)ofthedataset.Inpatterndiscovery,welookforamodelthatreectsthestructureofthedatawhichwehopewillreectthestructureofthegeneratingprocess.Thusgivenadataset,wewanttoextractasmuchessentialstructureaspossiblewithoutmodellinganyofitsaccidentalstructures(e.g.,noiseandsamplingartifacts).Wewanttomaximizetheinformationcontentofallparameters.Amethodforachievingtheobjectivesaboveisentropyoptimization[ 8 ]inwhichentropyminimizationmaximizestheamountofevidencesupportingeachparameter,whileminimizingtheuncertaintyinthesucientstatisticandthecrossentropybetweenthemodelandthedata. 19 ].KDDconsistsofseveralsteps.Thesestepsincludepreparationofdata,patternsearch,knowledgeevaluationandrenement.DataminingisaveryimportantprocessintheKDDprocesssincethisiswherespecicalgorithmsareemployedforextractingpatternsfromthedata.Dataminingcanthereforebeconsideredasasetofextractionprocessesofknowledgestartingfromdatacontainedinabaseofdata[ 9 ]. Dataminingtechniquesincludeavarietyofmethods.Thesemethodsgenerallyfallintooneoftwogroups:predictivemethodsanddescriptivemethods.The 1


predictivemethodsinvolvetheuseofsomevariablestopredictunknownorfuturevaluesofothervariables.Theyareusuallyreferredtoasclassication.Thedescriptivemethodsseekhuman-interpretablepatternsthatdescribethedataandarereferredtoasclustering.However,authorsincluding[ 6 ],havephrasedthesemethodsintermsofsixtasks:classication,estimation,prediction,clustering,marketbasketanalysisanddescription.Thesedierenttasksofdataminingaredescribedbelow.




1 35 31 ].Solutionstotheseproblemsinvolveinmostcasestheunderstandingofthestructuralproperties(patterndiscovery)ofthedataset.Inpatterndiscovery,welookforamodelthatreectsthestructureofthedatawhichwehopewillreectthestructureofthegeneratingprocess.Thusgivenadataset,wewanttoextractasmuchessentialstructureaspossiblewithoutmodellinganyofitsaccidentalstructures(e.g.,noiseandsamplingartifacts).Wewanttomaximizetheinformationcontentofallparameters.Amethodforachievingtheobjectivesaboveisentropyoptimization[ 8 ]inwhichentropyminimizationmaximizestheamountofevidencesupportingeachparameter,whileminimizingtheuncertaintyinthesucientstatisticandthecrossentropybetweenthemodelandthedata. Thischapterisorganizedasfollows.Inthenextsection,weprovidesomebackgroundonentropyandprovidesomerationaleforitsuseintheareaofdatamining. 15 ].Thisconceptwaslaterextendedthroughthedevelopmentofstatisticalmechanics.Itwasrstintroducedintoinformationtheoryin1948byClaudeShannonascitedinshoreetal.[ 45 ]. 4


45 29 ]. LetEistandforaneventandpitheprobabilitythateventEioccurs.LettherebensucheventsE1;:::;Enwithprobabilitiesp1;:::;pnaddingupto1.Sincetheoccurrenceofeventswithsmallerprobabilityyieldsmoreinformationsincetheyareleastexpected,ameasureofinformationhshouldbeadecreasingfunctionofpi.ClaudeShannonproposedalogfunctionh(pi)toexpressinformation.Thisfunctionisgivenas whichdecreasesfrominnityto0,forpirangingfrom0to1.Thisfunctionreectstheideathatthelowertheprobabilityofaneventtooccur,thehighertheamountofinformationinthemessagestatingthattheeventoccurred.Fromtheseninformationvaluesh(pi),theexpectedinformationcontentHcalledentropyisderivedbyweightingtheinformationvaluesbytheirrespectiveprobabilities. Sincepilog2pi0for0pi1itfollowsfrom( 2{2 )thatH0,whereH=0ioneofthepiequals1;allothersarethenequaltozero.Hencethenotation0ln0=0.


15 ]: 1. Shannonmeasureisnonnegativeandconcaveinp1;:::;pn. 2. Themeasuredoesnotchangewithinclusionofazero-probabilityoutcome. 3. Theentropyofaprobabilitydistributionrepresentingacompletelycertainoutcomeis0andtheentropyofanyprobabilitydistributionrepresentinguncertainoutcomeispositive. 4. Givenaxednumberofoutcomes,themaximumpossibleentropyisthatoftheuniformdistribution. 5. Theentropyofthejointdistributionoftwoindependentdistributionsisthesumoftheindividualentropies. 6. Theentropyofthejointdistributionoftwodependentdistributionsisnogreaterthanthesumofthetwoindividualentropies. 7. SinceentropyonlydependontheunorderedprobabilitiesandnotonX,itisinvarianttobothshiftandscalei.e.H(aX+b)=H(X)fora6=0andforallb. 18 ]introducedtheprincipleofmaximumentropy.TheMaximumEntropyPrinciple(MaxEnt)isstatedasfollows: Usingthisprinciple,wegiveanentropyformulationforanassociatedproblem.LetXdenotearandomvariablewithnpossibleoutcomesx1;:::;xn.Letp=(p1;:::;pn)


denotetheirrespectiveprobabilities,respectively.Letr1(X);::::;rm(X)bemfunctionsofXwithknownexpectedvaluesE(r1(X))=a1,...,E(rm(X))=am.TheMaxEntformulationisasfollows:maxH(X)=nXi=1pilnpis.t.nXi=1(pi)rj(xi)=aj;j=1;:::;mnXi=1pi=1pi0;i=1;:::;n (2{5) Takingthegradientwithrespecttop(x)weget (2{6)


Thedistributionwiththemaximumentropyistheuniformdistributionwithpi=1=n. Example:Supposeyouaregivendataon3routesfromAtoBthatyouusuallytaketowork.Thecostofeachrouteindollarsis1;2;and3.Theaveragecostis$1:75.Whatisthemaximumentropydistributiondescribingyourchoiceofrouteforaparticularday?Thesolutiontotheaboveexamplecanbeformulatedandsolvedasfollows.max(p1lnp1+p2lnp2+p3lnp3)s:t:1lnp1+2lnp2+3lnp3=1:75p1+p2+p3=1p10;p20;p30 Therangeofvaluesforpiis0p20:750p30:3750:25p10:625 Themaximumentropysolutionisp1=0:466;p2=0:318;p3=0:216. 29 ]. Withrelativeentropyasameasureofdeviation,theKullback-Leiblerminimumentropyprinciple,orMinEntisstatesdasfollows:


49 51 ]haveexploredtheuseofcross-entropyandhaveshownrigorouslythattheJaynesprincipleofmaximumentropyandKullback'sprincipleofminimumcross-entropyprovidesacorrectmethodofinductiveinferencewhennewinformationisgivenintheformofexpectedvalue.Givenadistributionp0andsomenewinformationintheformofconstraints: .thenthenewdistributionp(x),whichincorporatesthisinformationintheleastbiasedwayoneandwhichisarrivedatinawaythatdoesnotnotleadtoany


inconsistenciesorcontradictions,istheoneobtainedfromminimizing .Thisistheminimumcross-entropyprinciple[ 49 50 51 ].Theseauthorsalsoshowedthatmaximumentropyprincipleisaspecialcaseofminimumcross-entropybasedasoutlinedbelow.Supposethatwearetryingtoestimatetheprobabilityofndingasysteminstatex.Ifweknowthatonlyndiscretestatesarepossible,thenwealreadyknowthesomeinformationaboutthesystem.Thisinformationisexpressedbyp0i=1=n8i.Ifweobtainmoreinformationintheformoftheinequalitygivenin 2{7 ,thenthecorrectestimateoftheprobabilityofthesystembeinginstateiisgivenbyminimizing:D(pjjp0)=pilnpi 16 ],regionalplanning(Wilson,1970)[ 55 ],investmentportfoliooptimization(Kapuretal.,1989)[ 29 ],imagereconstruction(Burchetal.,1984)[ 21 ],andpatternrecognition(TouandGonzalez,1974)[ 53 ] RationaleforUsingEntropyOptimization .Datatobeclusteredareusuallyincomplete.Solutionusingthedatashouldincorporateandbeconsistentwithall


relevantdataandmaximallynoncommittalwithregardtounavailabledata.Thesolutionmaybeviewedasaprocedureforextractinginformationfromdata.Theinformationcomesfromtwosources:themeasureddataandtheassumptionabouttheunavailableonesbecauseofdataincompleteness.Makinganassumptionmeansarticiallyaddinginformationwhichmaybetrueorfalse.Maximumentropyimpliesthattheaddedinformationisminimal.Amaximumentropysolutionhastheleastassumptionandismaximallynoncommittal. Inthenextchapterwedevelopanentropyminimizationmethodandapplyittodataclustering.


28 ]. Anotherdicultyinusingunsupervisedmethodsistheneedforinputparameters.Manyalgorithms,especiallytheK-meansandotherhierarchicalmethods[ 26 ]requirethattheinitialnumberofclustersbespecied.Severalauthorshaveproposedmethodsthatautomaticallydeterminethenumberofclustersinthedata[ 22 29 25 ].Thesemethodsusesomeformofclustervaliditymeasureslikevariance,aprioriprobabilitiesandthedierenceofclustercenters.Theobtainedresultsarenotalwaysasexpectedandaredatadependent[ 54 ].Somecriteriafrominformationtheoryhavealsobeenproposed.TheMinimumDescriptiveLength(MDL)criteriaevaluatesthecompromisebetweenthelikelihoodoftheclassicationandthecomplexityofthemodel[ 48 ]. Inthischapter,wedevelopaframeworkforclusteringbylearningfromthestructureofthedata.LearningisaccomplishedbyrandomlyapplyingtheK-meansalgorithmviaentropyminimization(KMEM)multipletimesonthedata.The 12


(KMEM)enablesustoovercometheproblemofknowingthenumberofclustersapriori.MultipleapplicationsoftheKMEMallowustomaintainasimilaritymeasurematrixbetweenpairsofinputpatterns.Anentryaijinthesimilaritymatrixgivestheproportionoftimesinputpatternsiandjareco-locatedinaclusteramongNclusteringsusingKMEM.Usingthissimilaritymatrix,thenaldataclusteringisobtainedbyclusteringasparsegraphofthismatrix. Thecontributionofthisworkistheincorporationofentropyminimizationtoestimateanapproximatenumberofclustersinadatasetbasedonsomethresholdandtheuseofgraphclusteringtorecovertheexpectednumberofclusters. Thischapterisorganizedasfollows:Inthenextsection,weprovidesomebackgroundontheK-meansalgorithm.Abriefdiscussionofentropythatwillbenecessaryindevelopingourmodelispresentedinsection 3.3 .TheproposedK-Meansviaentropyminimizationisoutlinedinsection4.Thegraphclusteringapproachispresentedinsection5.Theresultsofouralgorithmsarediscussedinsection 3.6 .Weconcludebrieyinsection 3.7 33 ]isamethodcommonlyusedtopartitionadatasetintokgroups.IntheK-meansclustering,wearegivenasetofndatapoints(patterns)(x1;:::;xk)inddimensionalspaceRdandanintegerkandtheproblemistodetermineasetofpoints(centers)inRdsoastominimizethesquareofthedistancefromeachdatapointtoitsnearestcenter.Thatisndkcenters(c1;:::;ck)whichminimize: wheretheC0saredisjointandtheirunioncoversthedataset.TheK-meansconsistsofprimarilytwosteps:1)Theassignmentstepwherebasedoninitialkclustercentersofclasses,instances


areassignedtotheclosestclass.2)There-estimationstepwheretheclasscentersarerecalculatedfromtheinstancesassignedtothatclass.Thesestepsarerepeateduntilconvergenceoccurs;thatiswhenthere-estimationstepleadstominimalchangeintheclasscenters.Thealgorithmisoutlinedingure 3{1 K-Meansalgorithm SeveraldistancemetricsliketheManhattanortheEuclideanarecommonlyused.Inthisresearch,weconsidertheEuclideandistancemetric.IssuesthatariseinusingtheK-meansinclude:shapeoftheclusters,choosing,thenumberofclusters,theselectionofinitialclustercenterswhichcouldaectthenalresultsanddegeneracy.Thereareseveralwaystoselecttheinitialclustercenters.Giventhenumberofclustersk,yourandomlyselectkvaluesfromthedataset.(Thisapproachwasusedinouranalysis).Youcouldalsogeneratekseedsastheinitialclustercenters,ormanuallyspecifytheinitialclustercenters.Degeneracyariseswhenthealgorithmis


trappedinalocalminimumtherebyresultinginsomeemptyclusters.Inthispaperweintendtohandlethelastthreesproblemviaentropyoptimization. 15 ].Thisconceptwaslaterextendedthroughthedevelopmentofstatisticalmechanics.Itwasrstintroducedintoinformationtheoryin1948byClaudeShannon[ 45 ].Entropycanbeunderstoodasthedegreeofdisorderofasystem.Itisalsoameasureofuncertaintyaboutapartition[ 45 29 ]. Thephilosophyofentropyminimizationinthepatternrecognitioneldcanbeappliedtoclassication,dataanalysis,anddataminingwhereoneofthetasksistodiscoverpatternsorregularitiesinalargedataset.Theregularitiesofthedatastructurearecharacterizedbysmallentropyvalues,whilerandomnessischaracterizedbylargeentropyvalues[ 29 ].Inthedataminingeld,themostwellknownapplicationofentropyisinformationgainofdecisiontrees.Entropybaseddiscretizationrecursivelypartitionsthevaluesofanumericattributetoahierarchydiscretization.Usingentropyasaninformationmeasure,onecanthenevaluateanattribute'simportancebyexaminingtheinformationtheoreticmeasures[ 29 ]. Usingentropyasaninformationmeasureofthedistributiondataintheclusters,wecandeterminethenumberofclusters.Thisisbecausewecanrepresentdatabelongingtoaclusterasonebin.Thusahistogramofthesebinsrepresentsclusterdistributionofdata.Fromentropytheory,ahistogramofclusterlabelswithlowentropyshowsaclassicationwithhighcondence,whileahistogramwithhighentropyshowsaclassicationwithlowcondence.


(3{2) whereXisarandomvariablewithoutcomes1;2;:::;nandassociatedprobabilitiesp1;p2;:::;pn. Sincepilnpi0for0pi1itfollowsfrom( 5{6 )thatH(X)0,whereH(X)=0ioneofthepiequals1;allothersarethenequaltozero.Hencethenotation0ln0=0.Forcontinuousrandomvariablewithprobabilitydensityfunctionp(x),entropyisdenedas Thisentropymeasuretellsuswhetheroneprobabilitydistributionismoreinformativethantheother.Theminimumentropyprovidesuswithminimumuncertainty,whichisthelimitoftheknowledgewehaveaboutasystemanditsstructure[ 45 ].Indataclassication,forexamplethequestistondminimumentropy[ 45 ].TheproblemofevaluatingaminimalentropyprobabilitydistributionistheglobalminimizationoftheShannonentropymeasuresubjecttothegivenconstraints.ThisproblemisknowntobeNP-hard[ 45 ]. TwopropertiesofminimalentropywhichwillbefundamentalinthedevelopmentofKMEMmodelareconcentrationandgrouping[ 45 ].Groupingimpliesmovingalltheprobabilitymassfromonestatetoanother,thatis,reducethenumberofstates.Thisreductioncandecreaseentropy. (3{4)


(3{5) Clearly, becauseeachsideequalsthecontributiontoH()andH(A)respectivelyduethetocommonelementsofandA.Hence,( 3{4 )followsfrom( 3{5 ).Concentrationimpliesmovingprobabilitymassfromastatewithlowprobabilitytoastatewithhighprobability.Wheneverthismoveoccurs,thesystembecomeslessuniformandthusentropydecreases. (3{6) becauseeachsideequalsthecontributiontoH()andH(A)respectivelyduetothecommonelementsofAandHence,( 3{6 )followsfrom( 3{5 ).


18 ].Thisisbecauseofthepropertyofadditivityofentropy.SupposewehavenoutcomesdenotedbyX=fx1,...,xng,withprobabilityp1;:::;pn.AssumethattheseoutcomescanbeaggregatedintoasmallernumberofsetsC1;:::;CKinsuchawaythateachoutcomeisinonlyonesetCk,wherek=1;:::K.TheprobabilitythatoutcomesareinsetCkis TheentropydecompositiontheoremgivestherelationshipbetweentheentropyH(X)atleveloftheoutcomesasgivenin( 5{6 )andtheentropyH0(X)atthelevelofsets.H0(X)isthebetweengroupentropyandisgivenby: (3{8) Shannonentropy( 5{6 )canthenbewrittenas (3{9) where


ApropertyofthisrelationshipisthatH(X)H0(X)becausepkandHk(X)arenonnegative.Thismeansthatafterdatagrouping,therecannotbemoreuncertainty(entropy)thantherewasbeforegrouping. SupposethatafterclusteringthedatasetX,weobtaintheclustersfCj,j=1;:::KgbyBayesrule,theposteriorprobabilityP(CjjX)isgivenas; (3{11) whereP(XjCj)givenin( 3{12 )isthelikelihoodandmeasurestheaccuracyinclusteringthedataandthepriorP(Cj)measuresconsistencywithourbackground


knowledge. BytheBayesapproach,aclassieddatasetisobtainedbymaximizingtheposteriorprobability( 3{11 ).InadditiontothreeoftheproblemspresentedbytheK-meanswhichwewouldliketoaddress:determiningnumberofclusters,selectinginitialclustercentersanddegeneracy,afourthproblemis,thechoiceofthepriordistributiontousein( 3{11 ).Weaddresstheseissuesbelow. 56 ].Thisisaproblemfacingeveryoneandnouniversalsolutionhasbeenfound.Forourourapplication,wewilldenethepriorasanexponentialdistribution,oftheform; wherepj=jCjj=nisthepriorprobabilityofclusterj,and0referstoaweightingoftheaprioriknowledge.Henceforth,wecalltheentropyconstant. 17 ].


TheKMEMModel TheK-Meansalgorithmworkswellonadatasetthathassphericalclusters.Sinceourmodel(KMEM)isbasedontheK-means,wemaketheassumptionthattheeachclusterhasGaussiandistributionwithmeanvaluescj;i=(1;:::;k)andconstantclustervariance.ThusforanygivenclusterCj, 22 Takingnaturallogandomittingconstants,wehave lnP(xijCj)=(xicj)2 Usingequations( 3{12 )and( 3{13 ),theposteriorprobability( 3{11 )nowbecomes: (3{16) whereEiswrittenasfollows: Ifwenowuseequation( 3{14 ),equation( 3{17 )becomes or (3{19) Maximizingtheposteriorprobabilityisequivalenttominimizing( 5{20 ).Also,noticethatsincetheentropytermin( 5{20 )isnonnegative,equation( 5{20 )isminimizedifentropyisminimized.Therefore( 5{20 )istherequiredclusteringcriterion. Wenotethatwhen=0,EisidenticaltothecostfunctionoftheK-Meansclusteringalgorithm.


TheEntropyK-meansalgorithm(KMEM)isgiveningure 3{2 .MultiplerunsofKMEMareusedtogeneratethesimilaritymatrix.Oncethismatrixisgenerated,thelearningphaseiscomplete. EntropyK-meansalgorithm Thisalgorithmiterativelyreducesthenumbersofclustersassomeemptyclusterswillvanish.


byeliminatinginconsistentedges.Aninconsistentedgeisanedgewhoseweightislessthansomethreshold.Thusapatternpairwhoseedgeisconsideredinconsistentisunlikelytobeco-locatedinacluster.Tounderstandtheideabehindthemaximumspanningtree,wecanconsidertheminimumspanningtreewhichcanbefoundinmanytexts,forexample[ 3 ]pages278and520.Theminimumspanningtree(MST)isagraphtheoreticmethod,whichdeterminesthedominantskeletalpatternofpointsbymappingtheshortestpathofnearestneighborconnections[ 40 ].ThusgivenasetofinputpatternsX=x1;:::;xneachwithedgeweightdi;j,theminimumspanningtreeisanacyclicconnectedgraphthatpassesthroughallinputpatternsofXwithaminimumtotaledgeweightSeesection 3.5 .Themaximumspanningtreeontheotherhandisaspanningwithamaximumtotalweight.Sincealloftheedgeweightinthesimilaritymatrixarenonnegative,wecannegatethesevaluesandthenapplytheminimumspanningtreealgorithmgiveningure 3{4 MinimumSpanningTree Minimumspanningtrees(areusedinsolvingmanyrealworldproblems.Forexample,consideracaseofanetworkwithVnodeswithEundirectedconnectionsbetweennodes.Thiscanberepresentedasaconnected,undirectedgraphG=(V;E)containingVverticesandEedges.Nowsupposethatalltheedgesareweighted,i.e.,foreachedge(u;v)2Ewehaveanassociatedweightw(u;v).Aweightcanbeusedtorepresentrealworldquantitiessuchascostofawire,distanceetcbetweentwonodesinanetwork.Aspanningtreeisdenedasaacyclicgraphthatconnectsallthevertices.Aminimumspanningtreeisaspanningtreewiththeminimumweight.SupposewerepresentthespanningtreeasTE,whichconnectsallthevertices,andwhosetotallengthisw(T),thentheminimumspanningtreeisdenedas, (3{20)


2. 3. 5. GenericMSTalgorithm. GenericMSTalgorithm .ThebookbyCormenetal.[ 11 ]givesasupportedanalysisofminimumspanningtreealgorithms.TheMSTalgorithmfallsinthecategoryofgreedyalgorithms.GreedyAlgorithmsarealgorithmsthatmakethebestchoiceateachdecisionmakingstep.Inotherwords,ateverystep,greedyalgorithmsmakethelocallyoptimumchoiceandhopethatitleadstoagloballyoptimumsolution.ThegreedyMSTalgorithmbuildsthetreestep-by-step,incorporatingtheedgethatcausesminimumincreaseinthetotalweightateachstep,withoutaddinganycyclestothetree.Supposethereisaconnected,undirectedgraphG=(V;E)withtheweightfunctionw.WhilendingtheminimumspanningtreeforgraphG,thealgorithmmanagesateachstepanedge-setSwhichissomesubsetoftheMST.Ateachstep,edge(u;v)isaddedtosubsetSsuchthatitdoesnotviolatetheMSTpropertyofSThismakesS[(u;v)asubsetoftheMinimumSpanningTree.Theedgewhichisaddedateachstepistermeda"safeedge".Thegenericalgorithmisgiveningure 3{3 TherearetwopopularalgorithmsforcomputingtheMinimumSpanningTree,Prim'salgorithmandKruskal'salgorithm(refer[ 11 ]).WeusedtheKruskal'sAlgorithminouranalysis.Itsdescriptionfollows.Kruskal'salgorithmforMST .Kruskal'salgorithmisanextensionofthegenericMSTalgorithmdescribedintheprecedingsub-sectionabove.IntheKruskal'salgorithmthesetS,whichisasubsetoftheminimumspanningtree,isaforest.Ateachstep,theKruskal'sAlgorithmndsthesafeedgetobeaddedastheedgewiththeminimumweightthatconnectstwoforests.Initially,theedgesaresortedinthedecreasingorder


2. 3. 4. sorttheedgesEbynon-decreasingweightw 6. 7. 8. UNION(u;v) 9. KruskalMSTalgorithm. input:nd-dimensionalpatterns,initialnumberofclustersk,thenumberofclusteringN,thethresholdandthe 2. initializethesimilaritymatrixMtonullnnmatrixandthenumberofiterationsiter=0 3. applytheKMEMalgorithmtoproducethepartitionC updatetheM;foreachinputpattern(i;j)inCseta(i;j)=a(i;j)+1=N ifiter

Table3{1: Thenumberofclustersfordierentvaluesof test2 test3 1.0 10 10 13 1.5 6 8 5 5 6 5.5 4 winedataandheartdiseasedata.Theresultsforthesyntheticimagesandirisdataaregivenin6.1and6.2.TheKMEMalgorithmwasrun200timesinordertoobtainthesimilaritymatrixandtheaveragenumberofclusterskave. 3{1 .Fortheimagetest3,thecorrectnumberofclusterswasobtainedusingaof1.5.Fortheimagestest1andtest2,avalueof5.5yieldedthecorrectnumberofclusters.InTable 3{1 ,theoptimumnumberofclustersforeachsyntheticimagearebolded.


12 27 ]andservesasabenchmarkforsupervisedlearningtechniques.ItconsistsofthreetypesofIrisplants:IrisVersi-color,IrisVirginica,andIrisSetosawith50instancesperclass.Eachdatumisfourdimensionalandconsistsofaplants'morphologynamelysepalwidth,sepallength,petalwidth,andpetallength.OneclassIrisSetosaiswellseparatedfromtheothertwo.Ouralgorithmwasabletoobtainthethree-clustersolutionwhenusingtheentropyconstant'sof10:5and11:0.Twoclustersolutionswerealsoobtainedusingentropyconstantsof14.5,15.0,15.5and16.0Table 3{2 showstheresultsoftheclustering. Toevaluatetheperformanceofouralgorithm,wedeterminedthepercentageofdatathatwerecorrectlyclassiedforthreeclustersolution.WecomparedittotheresultsofdirectK-means.Ouralgorithmhada91%correctclassicationwhilethedirectK-meansachievedonly68%percentcorrectclassication,seeTable 3{3 .Anothermeasureofcorrectclassicationisentropy.Theentropyofeachclusteriscalculatedasfollows wherenjisthesizeofclusterjandnijisthenumberofpatternsfromclusterithatwereassignedtoclusterj.Theoverallentropyoftheclusteringisthesumoftheweightedentropyofeachclusterandisgivenby (3{22) wherenisthenumberofinputpatterns.Theentropyisgivenintable 3{3 .Thelowertheentropythehighertheclusterquality.


Table3{2: Thenumberofclustersasafunctionoffortheirisdata 11.0 14.5 15.0 15.5 16 3 2 2 2 2 Table3{3: Percentageofcorrectclassicationofirisdata 3.0 2.0 2.0 2.0 2.0 % 90 91 69 68 68 68 0.27 1.33 1.30 1.28 1.31 Wealsodeterminedtheeectofandthedierentclustersizesontheaveragevalueofkobtained.Theresultsaregivenintables 3{4 3{5 and 3{6 .Thetablesshowthatforagivenanddierentkvaluetheaveragenumberofclustersconverge. Table3{4: Theaveragenumberofclustersforvariouskusingaxed=2:5fortheirisdata 15 20 30 50 14.24 18.73 27.14 42.28


Table3{5: Theaveragenumberofclustersforvariouskusingaxed=5:0fortheirisdata 15 20 30 50 7.10 7.92 9.16 10.81 Table3{6: Theaveragenumberofclustersforvariouskusingaxed=10:5fortheIrisData 15 20 30 50 3.34 3.36 3.34 3.29


4 ].Equallyimportantindataminingistheeectiveinterpretabilityoftheresults.Datasetstobeclusteredusuallyhaveseveralattributesandthedomainofeachattributecanbeverylarge.Thereforeresultsobtainedinthesehighdimensionareverydiculttointerpret.Highdimensionalityposestwochallengesforunsupervisedlearningalgorithms.Firstthepresenceofirrelevantandnoisyfeaturescanmisleadtheclusteringalgorithm.Second,inhighdimensionsdatamaybesparse(thecurseofdimensionality),makingitdicultforanalgorithmtondanystructureinthedata.Toamelioratetheseproblems,twobasicapproachestoreducingthedimensionalityhavebeeninvestigated:featuresubsetselection(Agrawaletal.,1998;DyandBrodley,2000)andfeaturetransformations,whichprojecthighdimensionaldataonto"interesting"subspaces(Fukunaga,1990;Chakrabartietal.,2002).Forexample,principlecomponentanalysis(PCA),choosestheprojectionthatbestpreservesthevarianceofthedata.Itthereforeisimportanttohaveclustersrepresentedinlowerdimensionsinordertoalloweectiveuseofvisualtechniquesandbetterresultinterpretation.Inthischapter,weaddressdimensionalityreductionusingentropyminimization. 41 ],wehaveshownthatentropyisagoodmeasureofthequalityofclustering.Wethereforeproposeanentropymethodto 30


handletheproblemofdimensionreduction.Aswithanyclusteringalgorithm,certainrequirementssuchassensitivitytooutliers,shapeofthecluster,eciency,etcplayvitalrolesinhowwellthealgorithmperforms.Inthenext,weoutlinethedierentcriterianecessarytohandletheseproblemviaentropy. (4{1) Whenthedatapointsareuniformlydistributed,wearemostuncertainwhereaparticularpointwouldlie.Inthiscaseentropyishighest.Whenthedatapointsarecloselypackedinasmallcluster,weknowthataparticularpointtofallwithinasmallareaofthecluster,andsotheentropywillbelow.Thesizeofthepartitionwheneachisdividedshouldbecarefullyselected.Iftheintervalistoosmall,there


willbemanycellsmakingtheaveragenumberofpointsineachsosmall,similarlyiftheintervalsizeistoolarge,itmaybediculttocapturethedierencesindensityindierentregionsofthespace.Selectingatleast30pointsineachisrecommended.WefollowtheapproachoutlinedbyChenetal.[ 10 ]. 1. Findoutreduceddimensionwithgoodclusteringbyentropymethod. 2. Identifyclustersinthedimensionsfound Toidentifygoodcluster,wesetathreshold.Areduceddimensionhasgoodclusteringifitsentropyisbelowthethreshold.Thisproposedapproachusesabottom-upapproach.Itstartsbyndingalargeone-dimensionalspacewithgoodclustering,thisistheusedtogeneratecandidate2-dimensionalspaceswhicharecheckedagainstthedatasettodetermineiftheyhavegoodclustering.Theprocessisrepeatedwithincreasingdimensionalityuntilnomorespaceswithgoodclusteringarefound.Thealgorithmisgivenin 4{1


2. LetCkbeonedimensionalspace 3. Foreachspacec2CKdo 4. 5. 6. ifH(c)

Weconsidertheproblemforpathplanningofsingleagent.Thebasicformulationoftheproblemistohavetheagentmovefromaninitialdynamicstatetomovingtarget.Inthecaseofstationarytarget,severalmethodshavebeenproposedbyotherresearchers[ 46 52 ].Here,weassumethatthevelocityofthevehicleisxedanditishigherthanthemaximumvelocityofthetarget.Ineachxedlengthoftimeinterval(evenit'snotconstant,proposedmethodscanwork),theagentscanhavetheinformationaboutthepositionsofthetargetsatthattime.Therewillbeseveralmodesdependingonthepredictionofthetargets'move. 1. thepredictionofthetargets'moveisunknown. 2. thepredictionofthetargets'moveisgivenbyaprobabilitydistribution. Wealsoconsiderthecaseofplanningapathforanagentsuchthatitslikelihoodofbeingco-locatedwithatargetatsometimeinitstrajectoryismaximized.Weassume 34


thattheagentoperatesinareceding-horizonoptimizationframework,withsomexedplanninghorizonandareasonableofre-planning.Whenthefuturelocationofthetargetisexpressedstochastically,wederiveconditionunderwhichtheplanninghorizonoftheagentcanbeboundedfromabove,withoutsacricingperformance. Weassumethatthetargetemploysarecedinghorizonapproach,wheretheoptimizationconsidersthelikelypositionofthetargetuptosomexedtimeinthefutureandwheretheoptimizationisrepeatedateverytimestep.


whereMt(i;j)denotesthe(i;j)thelementofthematrixM(t).WeassumethatthereexitsastationarymappingoftheelementsofM(t)tothestatespaceXforallt,andwewillwethenotationthattheprobabilitythatthetargetisatstatey2Xattimet=t0isMt0(y).TherecedinghorizonproblemoptimizationproblemistondapathpofxedlengthT,suchthatthelikelihoodofbeingco-locatedwiththetargetinatleastonestateonthepathismaximized.Onewayofestimatingthisisthefollowingcostfunction: (5{1) Inthefollowingsections,weexaminedthedierentsolutionmethodsweproposedtosolvethisproblem. andhence


Inthisformthemutualinformationcanbeinterpretedastheinformationcontainedinoneprocessminustheinformationcontainedintheprocesswhentheotherprocessisknown[ 20 ].RecallthattheentropyofadiscretedistributionoversomesetXisdenedas (5{3) GiventhevaluevofacertainvariableV,theentropyofasystemSdenedonXisgivenby (5{4) NowsupposethatwegainnewinformationVaboutthesystemintheformadistributionoverallpossiblevaluesofV,wecandenetheconditionalentropyofthesystemSas (5{5) ThusforsomesystemandsomenewknowledgeV,theinformationgainisaccordingtoequation( 5{2 )is (5{6) Incorporatinginformationgaininourpathplanning,astrategythatleadstomaximumdecreaseinconditionalentropyormaximuminformationgainisdesirable.Thisstrategywillmoveustoastateoflowentropyasquicklyaspossible,givenourcurrentknowledgeandrepresentationofthesystem.Wecanthereforeuseentropymeasureinwhichastatewithlowentropymeasurecorrespondtothesolutionofthe


path.Ourproposedentropymethodisusedtobuildaninitialpath.Localsearchmethodisthenusedtoimprovethepath.Theresultsgeneratedbyoursimulatoristhenusedonthedierentmodesoutlinedbelowandonthecostfunctiongivenin 5{14 .Severalothermethodsandcostfunctionsarealsodiscussed. (explanationandagure)Indeed,thelengthoftheshortestpathbetweenthecurrentpositionofthevehicleandthenextposition(intimet0)ofthetargetislessthanthesumoftheSkt0andLk,whichisthelengthofthetrajectorythetargetmoveintimeinterval[kt0;(k+1)t0].Moreover,thefollowinginequalityholds:Lkvmt0:


Alongtheshortestpath,thevehiclemovesfordistanceofut0intimeinterval[kt0;(k+1)t0]. Sinceineachtimeinterval,thelengthofshortestpathbetweenthevehicleandthetargetisreducedbyatleastxedamountof(uvm)t0,aftertimen0t0,where Whichcompletestheproof.


thattheoptimaltrajectoryfortheagentisAkAk+1(optimalpositionfortheagentattime(k+1)t0isAk+1).Withoutlossofgenerality,theoptimalpositionfortheagentisA0k+1whichisapointinsideofw2attime(k+1)t0.Considerapolarcoordinatesystem(;).Leth(;)bethedistancebetweenpoint(;)andAk+1.ThentheexpectationofthedistancebetweenAk+1andthenextpositionofthetargetattime(k+1)t0is wheref(;)isthedensityfunctionofthenextpositionofthetargetattime(k+1)t0. Targetandagentboundaries Consideranewpolarcoordinatesystem(0;0)fortheangle,0=0=:


andA0k+1,theexpectationis sincethefunctionf(;)hasrotationsymmetry.Ifwecountthefactthath0(0;0)h(;)when0=and0=; minlXi=1pikxxik


minTXt=0ltXi=1kxtxitkpith(t)!gt wherext=(xt;yt),t=0;1;2;:::;T,constitutetheagent'spath,anduisthevelocityoftheagentandh(t)isasolutiontoProblem( 5{11 ).Considerthefollowingexpression.ltXi=1kxtxitkpith(t) Thisgiveustheerroramountwhenthenextinformationisreceivedatt.ThenobjectivefunctionofProblem( 5{12 )istheexpectederroroftheagentregardingthetimewheninformationisreceived. 5{12 )isconvex. Proof.Considerthefunctionf(x1;x2)=kxyk.Thisisaconvexfunction.Thusf(x1+(1)y1;x2+(1)y2)f(x1;x2)+(1)f(y1;y2)u4+(1)u4=uu4Sincetheproblemisconvex,itcanbesolvedusingoneoftheexistinggradientmethods.


consecutivetwoinformationisshort.Forlongtimeandshortdistance,thosemodesarenotveryecient.Fromnow,wediscusstheproblemtondanoptimalpathp,ofxedlengthT,fortheagentsuchthatlikelihoodofbeingco-locatedwiththetargetinatleastonepointonthepathismaximized.Forcontinuoustimetheproblemisveryexpensivetosolve.Thus,wewillhavethefollowingassumption.Thevehicleandthetargetmoveamonganitesetofcellsindiscretetime.Atthebeginningofeachtimeperiodtheagentandthetargetcanmoveonlytotheadjacentcellsorcanstaythesamecellsastheywerestayingintheprevioustimeperiod.Moreover,whentheagentandthetargetareinthesamecell,thentheagentcandetectthetargetwithprobability1.Wearelookingforapathsuchthattheprobabilityofdetectingthetargetinaxednumberoftimeperiods,sayT,ismaximized. Figure5{2: Region Theprobabilitythattheagentandthetargetareco-locatedatleastonepointontheagent'sT-lengthpathis


wherePTTt=0Itx=0canbewrittenasfollowsPT\t=0Itx=0!=TYt=0PItx=0t1\j=0Ijx=0!: maxxJ(x)(5{15) orminxTYt=0PItx=0t1\j=0Ijx=0!: minxTXt=0ln(1P(Itx=1)):(5{16) However,theconsequenterrorofusingtheassumptiondependsonamodeloftarget'smotion.Forsomecase,itcangiveusoptimalpath.Thelastoptimizationproblemismuchmoreeasierthan( 5{15 ).Wewilldiscussthisprobleminalatersection.


thebeginningofatimeperiod,thentheagentcandetectthetargetwithprobabilityqj.Iftheyareindierentcells,theagentcannotdetectthetargetduringthecurrenttimeperiod. LetusintroducethefollowingindicatorrandomvariablesDix,i=0;1;:::;Tforpathxby TheprobabilityofdetectingthetargetinaxedtimeTforgivenagent'sTlengthpathpathxisJ(x)=1PT\t=0Dtx=0!; minxTXt=0ln(1P(Dtx=1));(5{18) whereP(Dtx=1)canbeexpressedasfollows:P(Dtx=1)=P(Dtx=1jItx=1)P(Itx=1)+P(Dtx=1jItx=0)P(Itx=0)=P(Dtx=1jItx=1)P(Itx=1)=qjP(Itx=1):


Herej=x(t). 13 14 ].TheyassumedthatthetargetmovesaccordingtoaMarkovchainmodel. 5{14 )canbewritteninanotherformasfollows: Therighthandsideofthelastequationisextractedusingthe....'sidentity.PT[i=0Iix=1!=TXi=0P(Iix=1)XXi

UsingMarkovianpropertytheabovecanbesimpliedasfollows:P(Ii1x=1;Ii2x=1;:::;Iikx=1)=kYj=1PIijx=1Iij1x=1; 5{16 )couldgiveundesirablebigerror.Forthiswecanproposeanothercostfunction. Ofcourse,weshouldassumethatP(I0x=0)=1orinitialstatesoftheagentandthetargetarenotthesameotherwisethereisnothingtosolve.Usingthefactthat 1P(B)=1(1P(BjA))P(A) 1P(B)=1(1P(AB) 1P(B)=1P(A)P(AB) 1P(B)=1P(A)P(AjB)P(B) 1P(B) (5{21) (assumingP(B)6=1)theproblembecomesminxJ(x)=TYi=11P(Iix=1)P(Iix=1jIi1x=1)P(Ii1x=1) 1P(Ii1x=1)


or minxJ(x)=TXi=1ln1P(Iix=1)P(Iix=1jIi1x=1)P(Ii1x=1) 1P(Ii1x=1):(5{22) Denition5.7. Adynamicprogrammingformulationforak-stagegraphproblemisobtainedbyrstnoticingthateverypathfromthesourcenodetothesinknodeistheresultofasequenceofk1decisions.TheithdecisioninvolvesdeterminingwhichvertexinVi+1,0ik2,istobeonthepath.Letp(i;j)beaminimum-costpathfromthesourcenodetoavertexjinVi.Letcost(i;j)bethecostofp(i;j).Thenthefollowingistrue. 5{14 ),( 5{22 )and(??)canbeexpressedasmultistagegraphproblems. Letf1;2;:::;Ngbethecellsthatrepresentthemapoftheregion.Vi,i=1;2;:::;TconsistsofNnodesthatrepresentsNcellsoftheagent(ortarget)attimei.V0orsourcenoderepresentstheagent'sinitialpositionwhichisacell.Wehaveadditionalonenode,thesinknodeinVT+1.WeconnectsnodesinVitonodesVi+1


whichtheagent(orthetarget)canmoveinonetimestep(accessibleoradjacentcells).ThiscostfunctionworksfornotonlyMarkovchainbutalsogeneralcase.Inotherwords,eachedgewhichconnectstoanodeinVi,i=1;2;:::;Thasthesamelength.Forinstance,letj2Vi.ThenallofedgesfromVi1toj2Vihaslengthln(1Pij).HerePijistheprobabilitythatthetargetwouldbeincelljattimei.ThesinknodemustbeconnectedtotheallnodesinVT.Weassigncostof0tothoseedges.Forinstance,letusconsidera33map.Letusassumethattheagentcanmovefromonenodetoonlyitsadjacentnodes(diagonallyadjacentnodescannotbeincluded)inonetimeperiod.ThenthemultistagegraphrepresentationcanbeshownasinFigure2. Figure5{3: Multistagegraphrepresentation1


Heretheconstructionofmultistagegraphisthesameas( 5{16 ).Theonlydierenceisthelengthoftheedges.Weshouldassigncostofln1Pis0qPpqPi1s0p Wegiveanexplanationhowtousethemultistagegraphproblemtothisproblem.V0consistsofonenodewhichrepresentstheinitialstateofthetarget.V1consistsofallpossiblecellsofthemapbutweaddonlyadmissibleedgesandassigncostssameaswedidinthesecondorderestimatedcostfunction.Vk,k=2;3;:::;T,consistsofanumberofnodes.Eachofthemrepresentsacoupleofstates,say(p;q).Moreoveredge(p;q)mustbeanadmissibleedgeinsenseofrealpath.Weconnectnode(p;q)inV2tonodepinV1byanedgeandassigntotheedgecostof1P2s0qPpqPs0p


Figure5{4: Multistagegraphrepresentation2 5.6.1TheAgentisFasterThanTarget Figure5{5: Regionwithobstacles


Figure5{6: Usingthe8zimuths IfthetargetisincelliandheadstoNorthEastthentheprobabilitythatitwillbeincelljandwillheadtoSouthEastisP(i;NE);(j;SE).


Thefuturebattlespacewillconsistofautomatousagentseitherworkingaloneorcooperativelyinaneorttolocateandattackenemytargets.Anobviousgoalisthatagentsshouldattackthemostvaluabletargets.Althoughvalueissubjectiveandcanchangeasthebattlespacechanges,onemayassumethevalueoftargetscanbeassessedgivenatimeandsituation.Somebasicstrategiesarepresentedthatmaximizetheprobabilityofattackingthemostvaluabletargets.Performancemeasuresofthesestrategiesare: 1. Probabilityofattackingthemostvaluabletarget. 2. Probabilityofattackingthejthmostvaluabletarget.Inparticular,probabilityofattackingtheworsttarget. 53


3. Meanrankoftheattachedtarget.Rankof1isbestofallntargets. 4. Averagenumberoftargetsexaminedbeforeadecisiontoattackismade.Asmallernumbermeanstheagentisexposedtoahostileenvironmentforlesstime. Thischapterisorganizedintoeightsections.Section 6.2 discussesastrategytomaximizetheprobabilityofattackingthemostvaluabletarget.Section 6.3 discussesastrategythatincreasesthemeanvalueofattackedtargetsaftermanymissions.TheimpactofavariablenumberoftargetsisdiscussedinSection 6.4 .AnovelapproachbasedonlearningagentsispresentedinSection 6.5 .ThislearningcouldbeThisstrategyprovidessomeexcitingresultsthatencouragefurtherresearchwhereentropyormutualinformationcouldbeapplied.MultipleagentsisdiscussedinSection 6.6 .Multipleagentscanbeusedasapackorseparatelyinagainstabattlespace.Lastly,astrategybasedonadynamicthresholdispresentedinSection 6.7 6{1 whereanagentistondandattack1ofndistincttargetsthatareuniformlydistributedacrossabattlespace.Theagentcarriesasingleweapon.Inrandomordertheagentdetectsandclassieseachtargetoneatatimeandtheagentmustdecidetoattackormoveon.Iftheagentmoveson,thenitcannotreturntoattack(assumethetargetconcealsitselformovesupondetection).Inthissituationassumethelocationsandvaluesofthetargetswithinthebattlespaceareunknownaheadoftime.Asmentionedabove,agoalistoattackthemostvaluabletarget.Iftheagentmakesadecisiontoattackaparticulartargetsoonintothemission,thenitmaynotseeamorevaluabletargetlater.Iftheagentattackslateinthemission,theagentmayhavepassedovermorevaluabletargets.


Figure6{1: Racetracksearchofbattlespacewithntargets 58 ]wheretheagentwillexaminetherstk,1k

recordsthevalues(bestofrstkis3).Afterthersttwotargetspass,theagentisreadytoattackthersttargetwithvaluebetterthan3whichis1inthiscase.Inthesituationofn=5andk=2,theprobabilityofattackingthemostvaluabletargetusingthisstrategyis0.433. Ifthemostvaluabletargetisamongtherstktargets,thenthestrategywouldresultinattackingthenthtarget,regardlessofvalue.Evenifthemostvaluabletargetisnotamongtherstktargets,thestrategymayresultinattackingotherthanmostvaluabletarget.Forexample,ifthethirdmostvaluabletargetisthemostvaluableamongtherstktargetsand,ifafterk,thesecondmostvaluabletargetisencounteredbeforethemostvaluabletarget,thenthesecondmostvaluabletargetwillbeattacked.Therefore,thisbringstomindthequestionsintheintroduction. ByusingconditionalprobabilitywhereXisthepositionofthesecondmostvaluabletargetandensuring0

i1ni n1ifk

probabilityofthisoccurringis n1: Ageneralizedrelationshipfortheprobabilityofattackingthejthbesttarget,j>1,maybedevelopedfromtheaboverelationshipsandisasfollowsPk(jthbest)=8>><>>:1 n1+1 i1ni n1ni1 n1ifnj

n+k i11 n1+n1Xi=k+11 6{1 showstheresultsoftheBestTargetStrategyagainst10,100,and1000targetswherethevaluesofthetargetsandeventhedistributionofthevaluesofthetargetsareunknownaheadoftime.Thek-valueforeachisoptimumfortheBestTargetStrategyinthatitmaximizesPk(best).TheresultsinTable 6{1 wereveriedempiricallybysimulationswithtargetvaluesfromeitheruniformornormaldistributions.Noteforevenlargenoftargetswithvaluesthatarecompletelyunknownbeforehand,theprobabilityofattackingthemostvaluabletargetremainsveryfavorable.Meanrankappearsacceptable.Forexample,forn=100meanrankindicatesthatonaveragemissionsresultinattackingtargetswithinthetop21invalue.Thenumberoftargetsexaminedappearsrelativelylargewithabout75-percentofalltargetsexaminedbeforeadecisiontoattackismade. Table6{1: Signicantresultsofthebasicbesttargetstrategy RankExaminations 1040.3980.0443.37.98 100370.3710.00420.0174.10 10003680.3680.0003185.54736.20


Forexample,atargetofsignicantvaluemayappearwithintherstk.Ifso,thenattackthetargetimmediately.Ifnotargetmeetsorexceedsthethresholdvaluewithintherstk,thenfollowthebasicBestTargetStrategyasoutlinedabovewhichistoattackthersttargetafterkwhosevalueexceedsthemaximumvaluefoundwithintherstk.Obviously,thehigherthethresholdvaluethesmallertheprobabilitythatanytarget(s)inthebattlespaceexceedsthethreshold.AthresholdsettoohighsimplydefaultsthestrategytotheBestTargetStrategy.Toosmallofathresholdmayresultinattackingrelativelylowervaluedtargetscomparedtothoseavailableinthebattlespacetherebyloweringtheprobabilityofattackingthemostvaluabletarget.Settingathresholdreliesonsomeknowledgeofthetargetvalueswhichviolatestheassumptionofnoknowledgeoftargetvaluesortheirdistributions.However,useofarelativelyhighthresholdcanhaveimprovedresultsbyincreasingprobabilityofattackingthemostvaluabletargetanddecreasingthenumberoftargetsexamined. Tables 6{2 6{3 and 6{4 show,respectively,theresultsofsimulationswiththresholdvaluessetattheupper1-percent,2.5-percentand30-percenttailofuniformdistributionsoftargetvalues.Othersimulationsusingnormaldistributionshadalmostidenticalresults.Theseresultsindicateahighthresholdcanincreasetheprobabilityofattackingthebesttargetanddecreasethenumberofexaminedtargets.However,asshowninTable 6{4 theprobabilityofattackingthebesttargetcandecreasemuchwithathresholdsettoolow. Table6{2: Empiricalresultsofthebesttargetstrategywiththresholdatupper1-percenttail RankExaminations 1040.4350.04033.117.69 100370.5150.00148.4151.11


Table6{3: Empiricalresultsofthebesttargetstrategywiththresholdatupper2.5-percenttail RankExaminations 1040.4740.03552.897.33 100370.4330.00043.8334.53 Table6{4: Empiricalresultsofthebesttargetstrategywiththresholdatupper30-percenttail RankExaminations 1040.3870.00142.153.11 100370.036015.263.38 6{7 )stillappliesformeanrankandrank=1isthemostvaluedtarget.InsteadofxingkatavaluethatmaximizesPk(best)considerxingkthatminimizesmeanrank(ormaximizesmeanvalue).ThevaluesofkthatminimizethemeanrankfordierentvaluesofnweredeterminedempiricallyandarelistedinTable 6{5 .TointerprettheresultsinTable 6{5 considern=100.Themeanrankof9.6meansthatonaverage,theMeanValueStrategywillattack10thbesttargetoutof100targets.Forlargenmeanrankisminimumatapproximatelyp


sayn=3000,thisstrategywillresult,onaverage,inattackingatleastthe55thbesttarget. Table6{5: 1022.9 2044.2 5066.7 10099.6 2001313.7 5002121.9 10003131.2 30005454.3 OfcoursechangeinkfromthatusedintheBestTargetStrategyreducestheprobabilityofattackingthebesttarget.AsFigure 6{2 showshowever,theBestTargetStrategyresultsinPk(jthbest)droppingomuchfasterthanfortheMeanValueStrategyasjgoeston.TheresultisthatthemeanvalueistypicallyhigherusingtheMeanValueStrategy. Figure6{2: Plotsofprobabilitiesofattackingthejthbesttargetfortwoproposedstrategies,n=100


6{6 and 6{7 aretheresultsofsimulations(106runsperexperiment)usingtheBestTargetStrategyandMeanValueStrategyagainsttargetswithvaluesthatareuniformin[1,1000]andnormal(=1000;=250).NotetheimprovementinmeantargetvaluesandreductionofmeannumberoftargetexaminationswhenusingtheMeanValueStrategy. Table6{6: Simulationresultsofthebesttargetstrategy nkPk(best)Pk(worst)MeanMean ValueExaminations UniformTargetValues,[1,1000] 1040.3970.045697.47.99 100370.3570.004795.174.6 NormalTargetValues,=1000;=250 1040.3990.0451178.27.98 100370.3700.0041358.474.1 Table6{7: Simulationresultsofthemeantargetstrategy ValueExaminations UniformTargetValues,[1,1000] 1020.3640.022730.45.66 10090.2140.001902.531.4 NormalTargetValues,=1000;=250 1020.3650.0221202.25.66 10090.2210.0011420.831.2 .Ofcourseother


distributionsforthenumberoftargetsmaybeappropriate.Thissectiondiscussestheimpactsofnbeingsomerandomvariable. Considerthreepossibleoutcomesforn.Letnexpectedbetheexpectednumberoftargetsandletnrealizedbetheactualnumberoftargetsinthebattlespace.Ifnrealized=nexpectedthennoimpactonstrategy.Ifnrealized>nexpectedthentheagentwoulddefaulttoattackingnexpectedifnoacceptabletargetisfoundbeforethen.Ifnrealized

Table6{8: Simulationresultswithnumberoftargetspoissondistributed,meann nkPk(best)Pk(notarget)MeanMean ValueExaminations UniformTargetValues,[1,1000] 1020.3480.135651.15.33 1040.3280.265542.07.42 10090.212.050875.631.1 100370.3430.202687.872.9 NormalTargetValues,=1000;=250 1020.3480.1351055.45.33 1040.3300.264892.97.42 10090.2180.0481370.230.8 100370.3560.1971150.872.6 Table6{9: Simulationresultswithnumberoftargetsnormallydistributed,meannandstandarddeviation0.2n nkPk(best)Pk(notarget)MeanMean ValueExaminations UniformTargetValues,[1,1000] 1020.3630.132657.25.40 1040.3610.261552.27.47 10090.2130.057868.730.6 100370.3270.232660.571.0 NormalTargetValues,=1000;=250 1020.3650.1311064.85.39 1040.3620.262902.47.47 10090.2200.0551359.230.3 100370.3380.2271105.370.6 knowledgeofnumberoftargetsencounteredbysometimeintothemissioncanbeusedtoupdatetheexpectedvalueofn.Thisshouldresultinfewernon-attacksbecausethenumberoftargetswasoverestimated.Itmayincreasethemeanvalueoftheattackedtargetbecausetheagentmayexaminemoretargetsiftheyexist.Indeed,Table 6{12 showsthesimulationresultsif,atabout90-percentintothemissiontime,theagentupdatestheexpectednumberoftargets.Thissimulationassumesexponentiallydistributedinterarrivaltimeswithsomemean,,forencounteringtargets.Theresultsshowthatthemeanvalueoftheattackedtargetsincreasesbyanotableamount.


Table6{10: Simulationresultswithnumberoftargetsuniformlydistributedin[0:5n;1:5n] ValueExaminations UniformTargetValues,[1,1000] 1020.3490.136650.15.29 1040.3230.271535.97.25 10090.2140.064862.030.0 100370.3060.261631.268.7 NormalTargetValues,=1000;=250 1020.3480.1361054.15.30 1040.240.271882.47.25 10090.2210.0621347.829.8 100370.3170.2561059.068.3 Table6{11: Performancesummaryofbesttargetstrategyandmeanvaluestrategywhennvaries;valuesareinpercentagedropcomparedtowhennisxed ValueExaminations BestTargetStrategy 1015.022.17.6 1008.917.05.0 MeanValueStrategy 102.910.65.7 1000.53.72.7 Iftheprobabilityofattackingnotargetcanbereduced,thisshouldincreasebothprobabilityofattackingthebesttargetandthemeanvalueoftheattackedtarget.Onewaytodothisisupdatenonlywhennisoverestimated.Simplyput,allownexpectedtobereducedandnotincreased.Table 6{13 showsthedesiredresultsareobtainedwhenexpectednumberoftargetsisallowedtobeloweredbutnotincreased.Theoptimumresultsareobtainedwhentheupdateismadeataboutthe95-percentpointintothemissiontime(inotherwordsneartheendofthemission).


Table6{12: Simulationresultswithexpectednumberoftargets,n,updated90-percentintothemission ValueExaminations UniformTargetValues,[1,1000] 1020.3330.094678.05.19 1040.3340.185595.37.43 10090.210.031884.929.1 100370.3440.125726.870.0 NormalTargetValues,=1000;=250 1020.3340.0951116.85.19 1040.3340.1831012.97.33 10090.2180.0301388.628.9 100370.3560.1221229.069.6 Table6{13: Simulationresultswithexpectednumberoftargets,n,updatedneartheendofthemission.nmaybeupdateddownward,butnotupward. ValueExaminations UniformTargetValues,[1,1000] 1020.3320.087680.75.18 1040.3350.166603.87.24 10090.210.018891.130.0 100370.3400.072751.971.4 NormalTargetValues,=1000;=250 1020.3430.0851132.75.10 1040.3420.1571048.87.05 10090.2170.0161403.029.9 100370.3520.0631287.571.2 valueofthetargetswithink.Astrategymodicationtoconsideristousemoreinformationaboutthevaluesofthetargetsencounteredwithink.Forexample,althoughthevalues(ortheirdistributions)ofthetargetsareunknownbeforehand,theagentlearnsmoreaboutthetarget-valuedistributionasexaminationsaremade.Informationonmeanandstandarddeviationofthesamplemaybeusefulinthedecisionmakingprocess.Supposethatinsteadofusingthemaximumvaluewithink,theagentusesmeanofthesampleplussomefactorofthestandarddeviationasadecisionthreshold.Inaddition,supposethatafterk,aspotentialtargetsare


encounteredbutnotattacked,theirvaluesareusedtoupdatethesamplemeanandsamplestandarddeviation.Inturn,thesamplemeanandstandarddeviationcanbeusedtofurtherrenethethresholdusedinthedecisionmakingprocess.Thisthresholdwillbedevelopedshortlybasedonorderstatistics. Usingtheideaoforderstatisticsandgivenauniformdistributionin[0,1]ofnvariables,itcanbeshowntheexpectedvalueofthemaximumvalueisE[maximumvalue]=n n+1: 11. Apotentialdecisionstrategyforattackingatargetistosetthedecisionthresholdatthevalueoftheexpectedhighestvalueoftheremainingtargets.Intheexampleofntargetswhosevaluesareuniformlydistributedin[0,1],anagentwilldecidetoattacktheithtargetif n+2i;if1in1;0;ifi=n:(6{8) Thenoticeableawwiththeabovestrategyisthedistributionofthetargetsisunknown.Wecanuseentropytoovercomethisaw.Wedevelopusingtheresultfrom( 6{8 )tosomethreshold.Weproposeastrategywheretheagentwillexaminetherstk,1k><>>:H(p)(n+1i n+2i)(maxmin)+min;ifk

whereH(p)isascalingfactorandmaxandminarecalculatedasfollowsmax=+&;min=&; Theresultsofsimulations(106runsperexperiment)ofthisstrategyaresummarizedinTable 6{14 .WefoundthebestvalueforkisthemeanoftheoptimumvaluesofkfortheBestTargetStrategyandMeanValueStrategy.ReferringbacktoTables 6{6 and 6{7 ,onemaynotemuchimprovementforthemeanvalueoftheattackedtargets(over5-percentimprovementinthecaseofn=100andnormallydistributedtargetvalues).AdrawbackofthisstrategycomparedtotheMeanValueStrategyisthenumberoftargetexaminationsishigher. Table6{14: Simulationresultsofthetargetstrategywithsampling ValueExaminations UniformTargetValues,[1,1000] 1030.3500.023752.77.08 100230.1960.001921.163.9 NormalTargetValues,=1000;=250 1030.3810.0231226.17.08 100230.21001497.542.7 Thisstrategymaybeinterpretedasa\learning-agent"strategysinceastheagentexaminestargetsitlearnsmoreaboutthetarget-valuedistributions.Becausecompleteknowledgeaboutthetargetdistributionsshouldresultinthebestdecisions,webelievemoreresearchiswarrantedforthisstrategy.


6{3 wheremultipleagents,eachcarryingoneweapon,aretosearchabattlespaceandattackseparatetargets(althoughnotexaminedinthispaper,morethanoneagentattackingthesametargetmaybeappropriatewheregivenagentyattackstargetz,theprobabilityofagentydisablingtargetz,pyz,islessthan1).Again,thegoalistoattackthemostvaluabletargets.Twosearchandattackstrategieswillbeconsideredinthefollowingsubsections.Therststrategyisthatofagentssearchingtogetherasapack.Apackmaybeadvantageousbecauseasapack,multiplelooksatasingletarget(coordinatedsensing)canresultinahigherprobabilityofacorrecttargetclassication[ 36 ].Thesecondstrategyisthatofagentssearchingseparatelywithandwithoutcommunication.Agentssearchingseparatelyobviouslyresultsinmoregroundbeingcoveredinashorteramountoftime. Figure6{3: 6{3 astrategyproposedhereismagents,mn,examinetherstk,1k

targetafterkthatismorevaluablethananyoftherstkandthenextagentattacksthenexttargetthatismorevaluablethananyoftherstk,andsoon.kshouldbesetequaltoavaluethatmaximizesPk(best)ormaximizesthemeanvalueoftheattackedtargets.Ifthemostvaluabletargetiswithinkthenthepackofmagentswillattackthelastmtargetsregardlessofvalue.Weexpectthefollowingoutcomeswhenusingmultipleagentsasapack wherePk(best)mistheprobabilityofattackingthebesttargetwithmagents.Todeveloptherelationfortheprobabilityofattackingthemostvaluabletargetwhenusingmultipleagentsasapackwebeginbyexaminingtheprobabilityofattackingthemostvaluabletargetwhenusingjusttwoagents,thenthreeandnallym.ThisdevelopmentwillleadtoageneralizedclosedformexpressionforPk(best)m. ByusingconditionalprobabilitywhereXisthepositionofthemostvaluabletargetandensuring0

Tocomplete( 6{9 ),expressionsareneededforPk(bestjX=i)2whichfollow i1+k i1i1k i2ifk

Theresultsofsimulations(106runsperexperiment)ofthisstrategyagainsttargetswithuniformlydistributedtargetvaluesaresummarizedinTable 6{15 .Simulationsagainstnormallydistributedtargetvaluesarenotincludedforcompactness;however,theresultsaresimilar.Table 6{15 includesstatisticsconcerningthevalueofattackedtargets.Themeanvalueistheoverallmeanofthetargetsattackedbyallagents.Themeanbeststatisticisthemeanofthevalueofthebesttargetattackedbythepackofagents.Meanexaminationsistheaveragenumberoftargetexaminationsmadebythepackofagents.ThevaluesforkwereselectedtoeitheroptimizePk(best)mormeanbest.Theasteriskplacednexttoavalueinthetableindicatesitwastheparameteroptimizedbytheselectionofk.AsexpectedtheoptimalvalueofkforPk(best)mreducedasthenumberofagentsincreased;however,thekthatmaximizedmeanbestremainedatornearthevaluethatisoptimalfortheMeanValueStrategyforasingleagent. Table6{15: Simulationresultsofthetargetstrategywithmagentsinapack.Targetvaluesareuniformon[0,1000]. ValueBestExaminations 21020.5520.067686.5808.6*5.66 21030.562*0.067663.6798.07.54 31020.661*0.134645.4843.88.64 41020.727*0.223608.3862.49.30 210090.3450.002881.8934.3*45.2 2100300.495*0.009775.9883.682.0 310090.4400.005862.1950.2*55.3 3100260.573*0.014758.0920.386.3 410090.5130.009843.4959.5*62.9 4100230.626*0.020744.7940.288.8 510080.5500.012832.8965.9*65.7 5100210.664*0.025739.2953.589.8 6{4 whereeachagentisassignedtoequalpartitionsofthebattlespace.


Figure6{4: Twosubstrategiesareconsidered:1)withoutcommunications,and2)withcommunications. n=mn=m1Xi=k1


eroundedtonearestinteger.Table 6{16 showstheresultsofsimulations(106runsperexperiment)ofthisstrategyagainsttargetswithuniformlydistributedtargetvalues.Notethemeanexaminationsisbasedonthetotalnumberoftargetsexaminedbymagents. Table6{16: Simulationresultsofthetargetstrategywithmagentsonseparatemissionswithnocommunicationbetweenthem.Targetvaluesareuniformon[0,1000]. ValueBestExaminations 21010.4160.051647.5792.9*3.74 21020.431*0.100630.6786.64.64 210090.3040.004857.2952.3*33.0 2100180.358*0.008781.6925.744.3 410070.3550.013795.0968.6*23.2 410090.364*0.016767.3966.924.1 510060.3620.017775.7970.6*19.6 510070.367*0.019759.9970.319.8 6{17 showstheresultsofsimulations(106runsperexperiment)ofthisstrategyagainsttargetswithuniformlydistributedtargetvalues. Table6{17: Simulationresultsofthetargetstrategywithmagentsonseparatemissionswithcommunicationbetweenthem.Targetvaluesareuniformon[0,1000]. ValueBestExaminations 21010.5110.081663.8803.84.05 210040.3010.003867.7933.724.9 2100150.4500.010748.4883.043.6 410020.4290.011812.7958.318.9 410060.5440.024696.4937.424.1 510020.5080.019773.4963.817.5 510040.5730.030694.9952.219.4


Inthesituationjustdescribed,agentsinvestigatetheirindependentpartitionsofthebattlespace.ConsideracasewheresayagentAinonepartitionattacksearly.ThisleavesmuchofA'spartitionunsearched.Ifanotheragent,B,isstillactive,itmaybeadvantageoustoallowBinvestigatetheremainingpartofA'spartition.BwillsearchuntilitndsanappropriatetargetoruntilitreachestheendofA'spartition.ThiscanonlyoccurifAandBcommunicatebytellingeachotherwhenandwheretheymadetheirattack.Table 6{18 showstheresultsofsimulationsagainsttargetswithuniformtargetvalues.Therstpartiswhereagentsdonotsharethemaximumtargetvaluefoundwithinkandthesecondpartiswhereagentshavefullcommunication{thatisbothmaximumtargetvaluefoundwithinkandwhenandwheretheyattack. Table6{18: Simulationresultsofthetargetstrategywithmagentsonseparatemissionswithcommunicationbetweenthem.Uncommittedagentsareallowedtoevaluatetargetsinotherunsearchedpartitions. ValueBestExaminations Notsharinginformationonmaximumtargetvaluewithink 21020.4940.086655.6799.88.77 210090.3720.003884.7957.757.0 2100170.4340.006830.9937.779.5 Sharinginformationonmaximumtargetvaluewithink 21020.5280.123630.9777.49.17 210040.3260.003883.3934.142.2 2100140.4940.008786.9889.779.8


partitions.Thereappearstobenoadvantageforagentscommunicatingthemaximumtargetvaluefoundwithintheirrstrespectivekevaluations. Whentheratioofagentstotargetsishigher,communicationisbenecialforbothPk(best)mandmeanbest;however,whentheratioislower,itappearsthereisnoadvantagetohavecommunications. 42 ].Considerasbefore,thenumberoftargets,n,isknown.Unlikeearlierassumptions,wenowassumeweknowthemaximumandminimumvaluesofthetarget-valuedistributionandweassumethetargetvaluesareuniformlydistributed.AstrategyderivedbyinductionthatmaximizesPk(best)istosetathresholdTsuchthattheagentattacksatargetifitsvalueexceedsT.Tisdeterminedasfollows


wheremaxandminaremaximumandminimumtargetvalues,respectively,thatarepossibleinthebattlespace.Tosummarizethedynamicthresholdstrategy,anagentwillndandexaminetargetsinsuccession.TheagentwillattackthersttargetwhosevalueexceedsTasdenedin( 6{13 ).Although( 6{13 )isderivedbasedonanassumptionofuniformtargetvalues,wefoundthatitisveryeectivewithotherdistributionssuchasthenormaldistributionwhereoneassumesmaxandminare,respectively,twostandarddeviationsaboveandbelowthemean.Theresultsofsimulations(106runsperexperiment)ofthedynamicthresholdstrategyaresummarizedinTable 6{19 .Theseresultsindicate,asexpected,themoreanagentsknowsaboutthetarget-valuedistribution,thebettertheresults. Table6{19: Simulationresultsusingthedynamicthresholdstrategy Value UniformTargetValues,[1,1000] 100.574836.5 1000.528962.5 NormalTargetValues,=1000;=250 100.5161278.5 1000.4631556.1 6{13 ).Although( 6{13 )isderivedbasedonanassumptionofuniformtargetvalues,wefoundthatitisveryeectivewithotherdistributionssuchasthenormaldistributionwhereoneassumesmaxandminare,respectively,twostandarddeviationsaboveandbelowthemean.Theresultsofsimulations(106runsperexperiment)ofthedynamicthresholdstrategyaresummarizedinTable 6{19 .Theseresultsindicate,asexpected,themoreanagentsknowsaboutthetarget-valuedistribution,thebettertheresults.


Severalmethodsforminingdataexist,howevermostofthesemethodssuerfromseverallimitationsespeciallythatofspecifyingthedatadistributionwhichisusuallyunavailable.Equallyimportantindataminingistheeectiveinterpretabilityoftheresults.Datatobeclusteredoftenexistinveryhighdimensionswhichmostoftheexistingmethodsdonothandlebutinsteadrelyonsomedatapreprocessing.Theentropymethodsdevelopedinthisthesisarewellsuitedforhandlingtheseproblems.Ourmethodseliminatesmakingdistributionalassumptionswhichmayormaynotexist.Dimensionreductionforbetterdataresultcomprehensionarealsoachievedbytheentropymethodsdevelopedinthisdissertation.Wehavealsosuccessfullyappliedtheseentropymethodsinbesttargetselectionandpathplanningwhichareareasofresearchinterest. 79




[1] J.Abello,P.M.Pardalos,andM.G.C.Resende(eds.),HandbookofMassiveDataSets,KluwerAcademicPublishers,Norwell,2002. [2] S.M.AndrijichandL.Caccetta,Solvingthemultisensordataassociationproblem,NonlinearAnalysis,47:5525-5536,2001. [3] R.K.Ahuja,T.L.Magnanti,andJ.B.Orlin,NetworkFlows:Theory,Algorithm,andApplications,PrenticeHall,EnglewoodClis,1993. [4] R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghaven,Automaticsubspaceclusteringofhighdimensionaldatafordatamining.ACMSIGMODRecord,27(2):94-105,1998. [5] J.Bellingham,A.Richards,andJ.P.How,RecedingHorizonControlofAutonomousAerialVehicles.InProceedingsoftheAmericanControlConference,Anchorage,AK,8-10,2002. [6] M.J.A.BerryandG.Lino,DataMiningTechniquesforMarketing,SalesandCustomerSupport,Wiley,NewYork,1997. [7] C.L.BlakeandC.J.Merz,(1998).UCIRepositoryofMachineLearningDatabases [8] M.Brand,PatternDiscoveryViaEntropyMinimization,Uncertainty99:Inter-nationalWorkshoponArticailIntelligenceandStatistics,(AISTAT)TR98-21,1998. [9] W.Chaovaliwongse,OptimizationandDynamicalApproachesinNonlinearTimeSeriesAnalysiswithApplicationsinBioengineering,Ph.DThesis,UniversityofFlorida,2003. [10] C.Cheng,A.W.Fu,andY.Zhang.Entropy-basedsubspaceclusteringforminingnumericaldata.InProceedingsofInternationalConferenceonKnowledgeDiscov-eryandDataMining,84-93,1999 [11] T.Cormen,C.Leiserson,andL.Rivest,IntroductiontoAlgorithms,MITPress,Cambridge,2001. [12] R.O.DubaandP.E.Hart,PatternClassicationandSceneAnalysis,Wiley-Interscience,NewYork,1974. 81


[13] J.N.Eagle,TheOptimalSearchforamovingTargetWhentheSearchPathisConstrianed,OperationsResearch,32(5),1984. [14] J.N.EagleandJ.R.Yee,AnOptimalBranchandBoundProcedurefortheConstrainedPath,MovingtargetSearchProblem,OperationsResearch,38(1),1990. [15] S.-C.Fang,J.R.Rajasekera,andH.-S.J.Tsao,EntropyOptimizationandMath-ematicalProgramming,KluwerAcademicPublishers,Norwell,1997. [16] S.-C.Fang,H.S.J.Tsao,LinearConstrainedEntropyMaximizationProblemwithQurdtraticanditsApplicationtoTransportationPlanningProblems,Trans-portationScience,29:353-365,1993. [17] M.FigueiredoandA.K.Jain,UnsupervisedLearningofFiniteMixtureModels,IEEETrans.PatternAnalysisandMachineIntelligence,24(3):381-396,2002. [18] K.Frenken,EntropyStatisticsandInformationTheory,TheElgarCompaniontoNeo-SchumpeterianEconomics,Cheltenham,UKandNorthamptonMA:EdwardElgarPublishing(InPress). [19] W.Frawley,G.Piatetsky-Shapiro,andC.Matheus.KnowledgeDiscoveryinDatabases:Anoverview.G.Piatetsky-ShapiroandW.Frawley,(eds),KnowledgeDiscoveryinDatabases,1-27.AAAI/MITPress,1991. [20] R.M.Gray,EntropyTheoryandInformationTheory,Springer-Verlag,NewYork,1990. [21] S.F.Gull,J.Skilling,andJ.A.Roberts(ed),TheEntropyofanImage:IndirectImaging,CambridgeUniversityPress,UK,267-279,1984. [22] D.HallandGBall,ISODATAANovelMethodofDataAnalysisandPatternClassication,TechReport,StanfordResearchinstitute,MenloPark,1965. [23] E.Horowitz,S.Sahni,andS.Rajasekaran,ComputerAlgorithms,W.H.Freeman,NewYork,1998. [24] V.Isler,S.Kannan,andS.Khanna,LocatingandCapturinganEvaderinaPolygonalEnvironment,TechnicalReportMS-CIS-03-33,Dept.ofComputerScience,Univ.ofPennsylvania,2003. [25] G.IyengarandA.Lippman,ClusteringImagesUsingRelativeEntropyforEcientRetrieval,IEEEComputerMagazine,28(9):23-32,1995. [26] A.K.JainandR.C.Dubes,AlgorithmsforClusteringData,PrenticeHall,NewJersey,1988. [27] M.James,ClassicationAlgorithms,JohnWiley,NewYork,1985.


[28] T.Kanungo,D.M.Mount,N.S.Netayahu,C.D.Piako,R.Silverman,andA.Y.Wu,AnEcientK-MeansClusteringAlgorithm:AnalysisandImplementation,IEEETrans.PatternAnalysisandMachineIntelligence,24(7):881-892,2002. [29] J.N.KapurandH.K.Kesaven,EntropyOptimizationPrincipleswithApplica-tions,AcademicPress,SanDiego,1992. [30] T.Kirubarajan,Y.Bar-Shalon,andK.R.Pattipati,MultiassignmentforTrackingaLargeNumberofOverlappingObjects,IEEETrans.AerospaceandElectronicSystems,37(1):2-21,2001. [31] W.KlosgenandJ.M.Zytkow(eds.),HandbookofDataMiningandKnowledgeDiscovery,OxfordUniversityPress,NewYork,2002. [32] Y.W.LimandS.U.Lee,OntheColorImageSegmentationAlgorithmbasedonThresholdingandFuzzyC-meansTechniques,PatternRecognition,23:935-952,1990. [33] J.B.McQueen,SomeMethodsforClassicationandAnalysisofMultivariateObservations,InProceedingsoftheFitfthSymposiumonMath,Statistics,andProbability,281-297,UniversityofCaliforniaPress,Berkely,1967. [34] D.Miller,A.Rao,K.Rose,andA.Gersho,AnInformationTheoreticFrameworkforOptimizationwithApplicationtoSupervisedLearning,IEEEInternationalSymposiumonInformationTheory,WhistlerB.C.,Canada,1995. [35] B.Mirkin,NonconvexOptimizationanditsApplications:MathematicalClassi-cationandClustering,KluwerAcademicPublishers,Dordrecht,1996. [36] R.MurpheyandP.Pardalos(eds.),CooperativeControlandOptimization,KluwerAcademicPublishers,Dordrecht,2002. [37] R.Murrieta-Cid,A.SarmientoandS.Hutchinson,OntheExistenceofaStrategytoMaintainaMovingTargetwithintheSensingRangeofanObserverReactingwithDelay,InProcIEEEInt.Conf.onIntelligentRobotsandOptimalSystems,2003 [38] R.Murrieta-Cid,H.Gonzalez-Ba~nosandB.Tovar,AReactiveMotionPlannertoMaintainVisibilityofUnpredictableTargets,InProceedingsoftheIEEEIn-ternationalConferenceonRoboticsandAutomation,4242-4247,2002. [39] R.Murrieta-Cid,A.Sarmiento,S.BhattacharyaandS.Hutchinson,MaintainingVisibilityofaMovingTargetataFixedDistance:TheCaseofObserverBoundedSpeed, [40] H.Neemuchawala,A.HeroandP.Carson,ImageRegistrationUsingEntropicGraph-MatchingCriteria,


[41] A.Okafor,M.RagleandP.M.Pardalos,DataMiningViaEntropyandGraphClustering,DataMininginBiomedicine,Springer,NewYork,(InPress). [42] H.Pster,Unpublishedwork,AirForceResearchLaboratory,MunitionsDirectorate,2003. [43] A.B.Poor,N.Rijavec,M.Liggins,andV.Vannicola.,DataAssociationProblemPosedasMultidimensionalAssignmentProblems:ProblemFormulation.,SignalandDataProcessingofSmallTargets,552-561.SPIE,Bellingham,WA,1993. [44] J.Pusztaszeri,P.E.Resing,andT.M.Liebling,TrackingElementaryParticlesNeartheirVertex:ACombinatorialApproach,JournalofGlobalOptimizatiom,9(1):41-64,1996. [45] D.Ren,AnAdaptiveNearestNeighborClassicationAlgorithm, [46] A.RichardsandJ.P.How,AircraftTrajectoryPlanningWithCollisionAvoidenceUsingMixedIntegerLinearProgramming,InProceedingsoftheAmer-icanControlConference,ACC02-AIAA1057,2002. [47] A.Richards,J.P.How,T.Schouwenaars,andE.Feron,PlumeAvoidenceManeuverPlanningUsingMixedIntegerLinearProgramming,InProceedingsoftheAIAAGuidance,Navigation,andControlConference,AIAA:2001-4091,2001. [48] J.Rissanen,AUniversalPriorforIntegersandEstimationbyMinimumDescriptionLength,AnnalsofStatistics,11(2):416-431,1983. [49] J.ShoreandR.Johnson,AxiomaticsDeviationofthePrincipleofMaximumEntropyandthePrinciplesofMinimumCross-Entropy,IEEETransactionsonInformationTheory,IT-26(1):26-37,1980. [50] J.ShoreandR.Johnson,PropertiesofCross-EntropyMinimization,IEEETransactionsonInformationTheory,IT-27(4):472-482,1981. [51] J.Shore,Cross-EntropyMinimizationGivenFullyDecomposableSubsetofAggregateConstraints,IEEETransactionsonInformationTheory,IT-28(6):956-961,1982. [52] T.Schouwenaars,B.DeMoor,E.FeronandJ.How,MixedIntegerPrgrammingforMulti-VehiclePathPlanning,InProceedingsoftheEuropeanControlConfer-ence,2603:2613,2001. [53] J.T.TouandR.C.Gonzalez,PatternRecognitionPrinciples,Addison-Wesley,Reading,1974. [54] M.M.Trivedi,andJ.C.Bezdeck,Low-LevelSegmentationofAerialwithFuzzyClustering,IEEETrans.Syst.Man,Cybern.,SMC-16:89-598,1986.


[55] A.G.Wilson,EntropyinUrbanandRegionalPlanning,Pion,London,1970. [56] N.Wu,TheMaximumEntropyMethod,Springer,NewYork,1997. [57] S.M.Ross,StochasticProcesses,2nded.,Wiley,NewJersy,1996. [58] S.M.Ross,IntroductiontoProbabilityModels,7thed.,AcademicPress,SanDiego,2000.


AnthonyOkaforwasborninAwgu,Nigeria.Hereceivedabachelor'sdegreeinmechanicalengineeringfromtheUniversityofNigeria,Nsukka.HelaterimmigratedtotheUnitedStateswhereheearnedamaster'sdegreeinmathematicalsciencesfromtheUniversityofWestFlorida,Pensacola,andamaster'sdegreeinindustrialandsystemsengineering(ISE)fromtheUniversityofFlorida.HeisaPh.D.studentintheISEdepartment,UniversityofFlorida.Hisresearchinterestsincludemathematicalprogramming,entropyoptimizationandoperationsresearch.Asforhobbies,Anthonylovestabletennis,badmintonandtennis.Heisalsoactivelyinvolvedwithhomeautomationandenjoysmentoring. 86