Citation
Smoothing functional data for cluster analysis

Material Information

Title:
Smoothing functional data for cluster analysis
Creator:
Hitchcock, David B
Publication Date:
Language:
English
Physical Description:
x, 106 leaves : ill. ; 29 cm.

Subjects

Subjects / Keywords:
Cluster analysis ( jstor )
Data smoothing ( jstor )
Datasets ( jstor )
Estimators ( jstor )
Information retrieval noise ( jstor )
Libraries ( jstor )
Objective functions ( jstor )
Signals ( jstor )
Simulations ( jstor )
Statistics ( jstor )
Dissertations, Academic -- Statistics -- UF
Statistics thesis, Ph. D
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 2004.
Bibliography:
Includes bibliographical references.
General Note:
Printout.
General Note:
Vita.
Statement of Responsibility:
by David B. Hitchcock.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright David B. Hitchcock. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
022823428 ( ALEPH )
886931224 ( OCLC )

Downloads

This item has the following downloads:


Full Text








SMOOTHING FUNCTIONAL DATA
FOR CLUSTER ANALYSIS














By
DAVID B. HITCHCOCK














A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2004






























Copyright 2004

by

David B. Hitchcock

































To Cassandra













ACKNOWLEDGMENTS

It has been a long but rewarding journey during my time as a student, and I

have many people I need to thank.

First of all. I thank God for my having more blessings in my life than I

possibly deserve. Next, I thank my family: the support and love of my parents and

sister, and the pride they have shown in me, have truly inspired me and kept me

determined to succeed in my studies. And I especially thank my wife Sandi, who

has shown so much confidence in me and has given me great encouragement with

her love and patience.

I owe a great deal of thanks to my Ph.D. advisors, George Casella and Jim

Booth. In addition to being great people, they have, through their fine examples,

taught me much about the research process: the importance of knowing the

literature well, of looking at simple cases to understand difficult concepts, and of

writing results carefully. I also thank my other committee members, Jim Hobert,

Brett Presnell, and John Henretta, accomplished professors who have selflessly

given their time for me. All the teachers I have had at Florida deserve my thanks,

especially Alan Agresti. Andy Rosalsky, Denny Wackerly, and Malay Ghosh.

I could not have persevered without the support of my fellow students here

at Florida. I especially thank Bernhard and Karabi, who entered the graduate

program with me and who have shared my progress. I also thank Carsten, Terry,

Jeff, Siuli, Sounak, Jamie, Dobrin, Samiran, Damaris, Christian, Ludwig, Brian,

Keith, and many others for all their help and friendship.






iv













TABLE OF CONTENTS
page

ACKNOWLEDGMENTS ............... ............... iv

LIST OF TABLES ................................. vii

LIST OF FIGURES ....... .... ... .. .............. viii

ABSTRACT .......... ........................ ix

CHAPTERS

1 INTRODUCTION TO CLUSTER ANALYSIS .................. 1

1.1 The Objective Function .......................... 4
1.2 Measures of Dissimilarity ........... ..... ........ 4
1.3 Hierarchical Methods .......... ............... 5
1.4 Partitioning Methods ........................... 6
1.4.1 K-means Clustering .................. ....... 7
1.4.2 K-medoids and Robust Clustering ........... 8
1.5 Stochastic Methods .......................... 9
1.6 Role of the Dissimilarity Matrix ................ 10

2 INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING 13

2.1 Functional Data ........ .. ............. 13
2.2 Introduction to Smoothing ..................... 13
2.3 Dissimilarities Between Curves .................. 17
2.4 Previous Work ............................. 18
2.5 Summary .................... ........... 18

3 CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL .... 20

3.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the
Linear Subspace Defined by S .......... .......... 20
3.2 A James-Stein Shrinkage Adjustment to the Smoother . ... 23
3.3 Extension to the Case of Unknown 2 . . . . 29
3.4 A Bayes Result: dmooth) and a Limit of Bayes Estimators .36

4 CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL 40

4.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the
Linear Subspace Defined by S ................. .. .. 40


v







4.2 James-Stein Shrinkage Estimation in the Functional Noise Model. 42
4.3 Extension to the Case when a2 is Unknown . . ... 48
4.4 A Pure Functional Analytic Approach to Smoothing . ... 54

5 A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION
FOR CLUSTERING FUNCTIONAL DATA . . . ... 63

6 SIM ULATIONS ................... ............. 68

6.1 Setup of Simulation Study ................. ...... 68
6.2 Smoothing the Data ........................... 71
6.3 Simulation Results ... ......................... 72
6.4 Additional Simulation Results . . . .... 76

7 ANALYSIS OF REAL FUNCTIONAL DATA . . . ... 81

7.1 Analysis of Expression Ratios of Yeast Genes . . ... 81
7.2 Analysis of Research Libraries . . . ..... 84

8 CONCLUSIONS AND FUTURE RESEARCH . . . ... 91

APPENDIX

A DERIVATIONS AND PROOFS ....................... 95

A.1 Proof of Conditions for Smoothed-data Estimator Superiority. 95
A.2 Extension of Stochastic Domination Result . . ... 97
A.3 Definition and Derivation of and . . ..... 97

B ADDITIONAL FORMULAS AND CONDITIONS . . .... 99

B.1 Formulas for Roots of AU = 0 for General n and k . ... 99
B.2 Regularity Condition for Positive Semidefinite G in Laplace
Approximation ................... ........ 100

REFERENCES ................................... 101

BIOGRAPHICAL SKETCH .......................... .. .. 106














vi













LIST OF TABLES
Table page

1-1 Agricultural data for European countries. . . . . 2

3-1 Table of choices of a for various n and k. . . . ... 29

6-1 Clustering the observed data and clustering the smoothed data (inde-
pendent error structure. n = 200). . . . 73

6-2 Clustering the observed data and clustering the smoothed data (O-U
error structure. n = 200). ......................... 73

6-3 Clustering the observed data and clustering the smoothed data (inde-
pendent error structure. n = 30). . . . ..... 74

6-4 Clustering the observed data and clustering the smoothed data (O-U
error structure. n = 30) ................... ....... 75

7-1 The classification of the 78 yeast genes into clusters, for both observed
data and smoothed data. ......................... 82

7-2 A 4-cluster K-medoids clustering of the smooths for the library data.. 89

7-3 A 4-cluster K-medoids clustering of the observed library data. ..... 90






















vii













LIST OF FIGURES
Figure page

1-1 A scatter plot of the agricultural data. . . . . 3

1-2 Proportion of pairs of objects correctly grouped vs. MSE of dissimi-
larities ........ .... ... ...... .. ... .. 12

3-1 Plot of Au against a for varying n and for k = 5. . . .... 30

3-2 Plot of simulated A and AU against a for n = 20, k = 5. ...... ..31

4-1 Plot of asymptotic upper bound. and simulated A's, for Ornstein-
Uhlenbeck-type data ................ ........... 49

6-1 Plot of signal curves chosen for simulations. . . ... 70

6-2 Proportion of pairs of objects correctly matched, plotted against a
(n = 200). ............ ........ ......... 74

6-3 Proportion of pairs of objects correctly matched, plotted against a
(n = 30)...... ................... .. ... .. 75

6-4 Proportion of pairs of objects correctly matched, plotted against a,
when the number of clusters is misspecified. . . ... 77

6-5 Proportion of pairs of objects correctly matched, plotted against a
(n = 30) ................... ................. 79

7-1 Plots of clusters of genes. . . . ... . .. 83

7-2 Edited plots of clusters of genes which were classified differently as
observed curves and smoothed curves. . . . ... 85

7-3 Plots of clusters of libraries. . . .... . . 87

7-4 Mean curves for the four library clusters given on the same plot. 88

7-5 Measurements and B-spline smooth, University of Arizona library. 89









viii












Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

SMOOTHING FUNCTIONAL DATA
FOR CLUSTER ANALYSIS

By

David B. Hitchcock

August 2004

Chair: George Casella
Cochair: James G. Booth
Major Department: Statistics

Cluster analysis, which attempts to place objects into reasonable groups on

the basis of statistical data measured on them, is an important exploratory tool

for many scientific studies. In particular, we explore the problem of clustering

functional data. which arise as curves, characteristically observed as part of

a continuous process. In recent years, methods for smoothing and clustering

functional data have appeared in the statistical literature, but little work has

appeared specifically addressing the effect of smoothing on the cluster analysis.

We discuss the purpose of cluster analysis and review some common clustering

methods, with attention given to both deterministic and stochastic methods.

We address functional data and the related field of smoothing, and a measure of

dissimilarity for functional data is suggested.

We examine the effect of smoothing functional data on estimating the dis-

similarities among objects and on clustering those objects. We prove that a

shrinkage method of smoothing results in a better estimator of the dissimilarities

among a set of noisy curves. For a model having independent noise structure,


ix







the smoothed-data dissimilarity estimator dominates the observed-data estima-

tor. For a dependent-error model, an asymptotic domination result is given for

the smoothed-data estimator. We propose an objective function to measure the

goodness of a clustering for smoothed functional data.

Simulations give strong empirical evidence that smoothing functional data

before clustering results in a more accurate grouping than clustering the observed

data without smoothing. Two examples, involving functional data on yeast gene

expression levels and research library "growth curves." illustrate the technique.






































x












CHAPTER 1
INTRODUCTION TO CLUSTER ANALYSIS
The goal of cluster analysis is to find groups, or clusters, in data. The objects

in a data set (often univariate or multivariate observations) should be grouped so

that objects in the same cluster are similar and objects in different clusters are

dissimilar (Kaufman and Rousseeuw. 1990. p. 1). How to measure the similarity

of objects is something that depends on the application, yet is a fundamental issue

in cluster analysis. Sometimes in a multivariate data set it is not the observations

that are clustered, but rather the variables (according to some similarity measure

on the variables) and this case is dealt with slightly differently (Johnson and

Wichern, 1998, p. 735). More often, though, it is the objects that are clustered

according to their observed values of one or more variables, and this introduction

will chiefly focus on this situation.

The general clustering setup for multivariate data is as follows: In a data set

there are N objects on which are measured p variables. Hence we represent this

by N vectors yi,..., YN in W. We wish to group the N objects into K clusters,

1 < K < N. Denote the possible clusterings of N objects into nonempty groups

as C = {C1,..., CB(N)}. The number of possible clusterings B(N) depends on the

number of objects N and is known as the Bell number (Sloane and Plouff, 1995,

entry M4981).

As a simple example, consider the data in Table 1-1. We wish to group the

12 objects (the European countries) based on the values of two variables, gross

national product (gnp) and percent of gnp due to agriculture (agric). When the

data are univariate or two-dimensional and N is not too large, it is often easy to



1





2

Table 1-1: Agricultural data for European countries.

country agric gnp
Belgium 2.7 16.8
Denmark 5.7 21.3
Germany 3.5 18.7
Greece 22.2 5.9
Spain 10.9 11.4
France 6.0 17.8
Ireland 14.0 10.9
Italy 8.5 16.6
Luxembourg 3.5 21.0
Netherlands 4.3 16.4
Portugal 17.4 7.8
United Kingdom 2.3 14.0


construct a scatter plot and determine the clusters by eye (see Figure 1-1). For

higher dimensions. however, automated clustering methods become necessary.

A statistical field closely related to cluster analysis is discriminant analysis,

which also attempts to classify objects into groups. The main difference is that

in discriminant analysis there exists a training sample of objects whose group

memberships are known, and the goal is to use characteristics of the training

sample to devise a rule which classifies future objects into the prespecified groups.

In cluster analysis. however, the clusters are unknown, in form and often in

number. Thus cluster analysis is more exploratory in nature, whereas discriminant

analysis allows more precise statements about the probability of making inferential

errors (Gnanadesikan et al., 1989).

In contrast with discriminant analysis, where the number of groups and the

groups' definitions are known, cluster analysis presents two separate questions:

How many groups are there? And which objects should be allocated to which

groups, i.e., how should the objects be partitioned into groups? It is partially

because of the added difficulty of answering both of these questions that the field







3


























0


0
0

















S15 20












% of gnp due to agrcutue
Figure 1-1: A scatter plot of the agricultural data.
0













5 10 15 20

% of gnp due to agnculture


Figure 1-1: A scatter plot of the agricultural data.





4

of cluster analysis has not built such anextensive and thorough theory as has

discriminant analysis (Gnanadesikan et al., 1989).

1.1 The Objective Function

Naturally, it is desirable that a clustering algorithm have some optimal

property. We would like a mathematical criterion to measure how well-grouped the

data are at any point in the algorithm. A convenient way to define such a criterion

is via an objective function. a real-valued function of the possible partitions of the

objects. Mathematically, if C = {cl,..., CB(N)} represents the space of all possible

partitions of N objects, a typical objective function is a mapping g : C -+ *+.

Ideally, a good objective function g will increase (or decrease. depending on

the formulation of g) monotonically as the partitions group more similar objects

in the same cluster and more dissimilar objects in different clusters. Given a good

objective function g. the ideal algorithm would optimize g, resulting in the best

possible partition.

When N is very small. we might enumerate the possible partitions cl,..., CB(N),

calculate the objective function for each, and choose the ci with the optimal g(ci).

However, B(N) grows rapidly with N. For example, for the European agriculture

data, B(12) = 4.213.597. while B(19) = 5,832.742.205.057 (Sloane and Plouffe,

1995, entry M1484). For moderate to large N, this enumeration is infeasible

(Johnson and Wichern. 1998, p. 727).

Since full enumeration is usually impossible, clustering methods tend to be

algorithms systematically designed to search for good partitions. But such deter-

ministic algorithms cannot guarantee the discovery of the best overall partition.

1.2 Measures of Dissimilarity

A fundamental question for most deterministic algorithms is which measure of

dissimilarity (distance) to use. A popular choice is the Euclidean distance between





5

object i and object j


dE(i,j) A= )/( i ) l) (Yi j2)2 -" + (yip yjp)2.

The Manhattan (city-block) distance is

dM(i,j) = yil yjll + Yi2 Yj2 + + Yip Yjpl-

Certain types of data require specialized dissimilarity measures. The Canberra

metric and Czekanowski coefficient (see Johnson and Wichern, 1998, p. 729) are

two dissimilarity measures for nonnegative variables, while Johnson and Wichern

(1998, p. 733) give several dissimilarity measures for binary variables.

Having chosen a dissimilarity measure, one can construct a N x N (symmetric)
dissimilarity matrix (also called the distance matrix) D whose rows and columns

represent the objects in the data set, such that Dij = Dji = d(i, j).

In the following sections, some common methods of cluster analysis are

presented and categorized by type.

1.3 Hierarchical Methods

Hierarchical methods can be either agglomerative or divisive. Kaufman and

Rousseeuw (1990) compiled a set of methods which were adopted into the cluster

library of the S-plus computing package, and often the methods are referred to by

the names Kaufman and Rousseeuw gave them.

Agglomerative methods begin with N clusters; that is, each observation forms
its own cluster. The algorithm successively joins clusters, yielding N 1 clusters,

then N 2 clusters, and so on until there remains only one cluster containing all
N objects. The S-plus functions agnes and hclust perform agglomerative clustering.

Common agglomerative methods include linkage methods and Ward's method.

All agglomerative methods, at each step, join the two clusters which are
considered "closest." The difference among the methods is how each defines





6

"closeness." Each method, however, defines the distance between two clusters using

some function of the dissimilarities among individual objects in those clusters.

Divisive methods begin with all objects in one cluster and successively split
clusters, resulting in partitions of 1. 2, 3, ... and finally N clusters. The S-plus

function diana performs divisive analysis.

1.4 Partitioning Methods

While hierarchical methods seek good partitions for all K = 1,..., N,

partitioning methods fix the number of clusters and seek a good partition for

that specific K. Although the hierarchical methods may seem to be more flexible,

they have an important disadvantage. Once two clusters have been joined in an

agglomerative method (or split in a divisive method), this move can never be

undone, although later in the algorithm undoing the move might improve the

clustering criterion (Kaufman and Rousseeuw, 1990, p. 44). Hence hierarchical

methods severely limit how much of the partition space C can be explored. While

this phenomenon results in higher computational speed for hierarchical algorithms,

its clear disadvantage often necessitates the use of the less rigid partitioning

methods (Kaufman and Rousseeuw, 1990. p. 44).

In practice, Johnson and Wichern (1998, p. 760) recommend running a
partitioning method for several reasonable choices of K and subjectively examining

the resulting clusterings. Finding an objective, data-dependent way to specify K

is an open question that has spurred recent research. Rousseeuw (1987) proposes

to select K to maximize the average silhouette width S(K). For each object i, the

silhouette value
s(i) = b(i) a(i)
max{a(i), b(i)}
where a(i) = average dissimilarity of i to all other objects in its cluster (say, cluster

A); b(i) = minRA d(i, R); and d(i, R) = average dissimilarity of i to all objects in
cluster R. Then s(K) is the average s(i) over all objects i in the data set.







Tibshirani et al. (2001a) propose a. "Gap" statistic to choose K. In a separate

paper. Tibshirani et al. (2001b) suggest treating the problem as in model selection,

and choosing K via a "prediction strength" measure. Sugar and James (2003)

suggest a nonparametric approach to determining K based on the "distortion,"

a measure of within-cluster variability. Fraley and Raftery (1998) use the Bayes

Information Criterion to select the number of clusters. Milligan and Cooper (1985)

give a survey of earlier methods of choosing K.

Model-based clustering takes a different perspective on the problem. It as-

sumes the data follow a mixture of K underlying probability distributions. The

mixture likelihood is then maximized, and the maximum likelihood estimate of

the mixture parameter vector determines which objects belong to which subpop-

ulations. Fraley and Raftery (2002) provide an extensive survey of model-based

clustering methods.

1.4.1 K-means Clustering

Among the oldest and most well-known partitioning methods is K-means

clustering, due to MacQueen (1967). Note that the centroid of a cluster is the

p-dimensional mean of the objects in that cluster. After the choice of K, the

K-means algorithm initially arbitrarily partitions the objects into K clusters.

(Alternatively, one can choose K centroids as an initial step.) One at a time, each

object is moved to the cluster whose centroid is closest (usually Euclidean distance

is used to determine this). When an object is moved, centroids are immediately

recalculated for the cluster gaining the object and the cluster losing it. The method

repeatedly cycles through the list of objects until no reassignments of objects take

place (Johnson and Wichern, 1998, p. 755).

A characteristic of the K-means method is that the final clustering depends

in part on the initial configuration of the objects (or initial specification of the

centroids). Hence in practice, one typically reruns the algorithm from various





8

starting points to monitor the stability of the clustering (Johnson and Wichern,

1998, p. 755).

Selim and Ismail (1984) show that K-means clustering does not globally

minimize the criterion
N K
Z 4dI[iEj1 (1.1)
i=1 j=1
where dij denotes the Euclidean distance between object i and the centroid of

cluster j, i.e., d = (yi y())'(yi yi)). This criterion is in essence an objective

function g(c). The K-means solution may not even locally minimize this objective

function: conditions for which it is locally optimal are given by Selim and Ismail

(1984).

1.4.2 K-medoids and Robust Clustering

Because K-means uses means (centroids) and a least squares technique in

calculating distances, it is not robust with respect to outlying observations. K-

medoids, which is used in the S-plus function pam (partitioning around medoids).

due to Kaufman and Rousseeuw (1987), has gained support as a robust alternative

to K-means. Instead of minimizing a sum of squared Euclidean distances, K-

medoids minimizes a sum of dissimilarities. Philosophically, K-medoids is to

K-means as least-absolute-residuals regression is to least-squares regression.

Consider the objective function
N K
E d(i, mj). (1.2)
i=1 j=1

The algorithm begins (in the so-called build-step) by selecting k representa-

tive objects, called medoids, based on an objective function involving a sum of

dissimilarities (see Kaufman and Rousseeuw, 1990, p. 102). It proceeds by as-

signing each object i to the cluster j with the closest medoid mj, i.e., such that

d(i, mj) < d(i, m.) for all w = 1,..., K. Next, in the swap-step, if swapping

any unselected object with a medoid results in the decrease of the value of (1.2),





9

the swap is made. The algorithm stops when no swap can decrease (1.2). Like

K-means. K-medoids does not in general globally optimize its objective function

(Kaufman and Rousseeuw, 1990. p. 110).

Cuesta-Albertos et al. (1997) propose another robust alternative to K-means

known as "trimmed K-means," which chooses centroids by minimizing an objective

function which is based only on an optimally chosen subset of the data. Robustness

properties of the trimmed K-means method are given by Garcia-Escudero and

Gordaliza (1999).

1.5 Stochastic Methods

In general, the objective function in a cluster analysis can yield optima

which are not global (Selim and Alsultan. 1991). This leads to a weakness of the

traditional deterministic methods. The deterministic algorithms are designed

to severely limit the number of partitions searched (in the hope that "good"

clusterings will quickly be found). If g is especially ill-behaved, however, finding the

best clusterings may require a more wide-ranging search of C. Stochastic methods

are ideal tools for such a search (Robert and Casella, 1999, Section 1.4).

The major advantage of a stochastic method is that it can explore more of the

space of partitions. A stochastic method can be designed so that it need not always

improve the objective function value at each step. Thus at some steps the method

may move to a partition with a poorer value of g, with the benefits of exploring

parts of C that a deterministic method would ignore. Unlike "greedy" deterministic

algorithms, stochastic methods sacrifice immediate gain for greater flexibility and

the promise of a potentially better objective function value in another area of C.

The major disadvantage of stochastic cluster analysis is that it takes more

time than the deterministic methods, which can usually be run in a matter of

seconds. At the time most of the traditional methods were proposed, this was an

insurmountable obstacle, but the growth in computing power has narrowed the gap





10

in recent years. Stochastic methods that can run in a reasonable amount of time

can be valuable additions to the repertoire of the practitioner of cluster analysis.

Since cluster analysis often involves optimizing an objective function g, the

Monte Carlo optimization method of simulated annealing is a natural tool to use

in clustering. In the context of optimization over a large, finite set, simulated

annealing dates to Metropolis et al. (1953), while its modern incarnation was

introduced by Kirkpatrick et al. (1983).

An example of a stochastic cluster analysis algorithm is the simulated anneal-

ing algorithm of Selim and Alsultan (1991). which seeks to minimize the K-means

objective function (1.1). Celeux and Govaert (1992) propose two stochastic cluster-

ing methods based on the EM algorithm.

1.6 Role of the Dissimilarity Matrix

For many standard cluster analysis methods, the resulting clustering structure

is determined by the dissimilarity matrix (or distance matrix) D (containing

elements which we henceforth denote 6ij, i = 1, ..., N; j = 1,..., N) for the objects.

With hierarchical methods, this is explicit: If we input a certain dissimilarity

matrix into a clustering algorithm, we will get one and only one resulting grouping.

With partitioning methods, the fact is less explicit since the result depends partly

on the initial partition, the starting point of the algorithm, but this is an artifact

of the imperfect search algorithm (which can only assure a "locally optimal"

partition), not of the clustering structure itself. An ideal search algorithm which

could examine every possible partition would always map an inputted dissimilarity

matrix to a unique final clustering.

Consider the two most common criterion-based partitioning methods, K-

medoids (the Splus function pam) and K-means. For both, the objective function

is a function of the pairwise dissimilarities among the objects. The K-medoids

objective function simply involves a sum of elements of D. With K-means, the





11

connection involves a complicated recursive formula. as indicated by Gordon

(1981. p. 42). (Because the connection is so complicated, in practice, K-means

algorithms accept the data matrix as input, but theoretically they could accept the

dissimilarity matrix it would just slow down the computational time severely.)

The result. then. is that for these methods. a specified dissimilarity matrix yields a

unique final clustering, meaning that for the purpose of cluster analysis, knowing D

is as good as knowing the complete data.

If the observed data have random variation, and hence the measurements on

the objects contain error, then the distances between pairs of objects will have

error. If we want our algorithm to produce a clustering result that is close to the

"true" clustering structure, it seems desirable that the dissimilarity matrix we use

reflect as closely as possible the (unknown) pairwise dissimilarities between the

underlying systematic components of the data.

It is intuitive that if the dissimilarities in the observed distance matrix are

near the "truth." then the resulting clustering structure should be near the true

structure, and a small computer example helps to show this. We generate a sample

of 60 3-dimensional normal random variables (with covariance matrix I) such that

15 observations have mean vector (1, 3, 1)', 15 have mean (10, 6,4)', 15 have mean

(1, 10.2)', and 15 have mean (5, 1, 10)'. These means are well-separated enough

that the data naturally form four clusters, and the true clustering is obvious.

Then for 100 iterations we perturb the data with random N(0, a2) noise having

varying values of a. For each iteration, we compute the dissimilarities and input

the dissimilarity matrix of the perturbed data into the K-medoids algorithm and

obtain a resulting clustering.

Figure 1-2 plots, for each perturbed data set, the mean (across elements)

squared discrepancy from the true dissimilarity matrix against the proportion of

all possible pairs of objects which are correctly matched in the clustering resulting






12








S oo


0 0
0 0
o




0 0







Figure 1-2: Proportion of pairs of objects correctly grouped vs. MSE of dissimilari-
ties.


from that perturbed matrix. (A correct match for two objects means correctly

putting the two objects in the same cluster or correctly putting the two objects

in different clusters, depending on the -truth.") This proportion serves as a

measure of concordance between the clustering of the perturbed data set and the

underlying clustering structure. We expect that as mean squared discrepancy

among dissimilarities increases, the proportion of pairs correctly clustered will

decrease, and the plot indicates this negative association. This indicates that a

better estimate of the pairwise dissimilarities among the data tends to yield a

better estimate of the true clustering structure.

While this thesis focuses on cluster analysis, several other statistical methods

are typically based on pairwise dissimilarities among data. Examples include

multidimensional scaling (Young and Hamer, 1987) and statistical matching

(Rodgers, 1988). An improved estimate of pairwise dissimilarities would likely

benefit the results of these methods as well.
o o1 o'
I / '


o o









































benefit the results of these methods as well.













CHAPTER 2
INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING

2.1 Functional Data

Frequently, the measurements on each observation are connected by being part

of a single underlying continuous process (often. but not always, a time process).

One example of such data are the growth records of Swiss boys (Falkner. 1960),

discussed by Ramsay and Silverman (1997, p. 2) in which the measurements are the

heights of the boys at 29 different ages. Ramsay and Silverman (1997) generally

label such data as functional data. since the underlying data are thought to be

intrinsically smooth, continuous curves having domain T, which without loss of

generality we take to be [0, T]. The observed data vector y is merely a discretized

representation of the functional observation y(t).

Functional data are related to longitudinal data, data measured across time

which appear often in biostatistical applications. Typically, however, in functional

data analysis, the primary goal is to discover something about the smooth curves

which underlie the functional observations, and to analyze the entire set of func-

tional data (consisting of many curves). The term "functional data analysis" is

attributed to Ramsay and Dalzell (1991), although methods of analysis existed

before the term was coined.

2.2 Introduction to Smoothing

When scientists observe data containing random noise, they typically desire

to remove the random variation to better understand the underlying process of

interest behind the data. A common method used to capture the underlying signal

process is smoothing.



13





14

Scatterplot smoothing, or nonparametric regression, may be used generally for

paired data (ti, yi) for which some underlying regression function E[yi] = f(ti) is

assumed. But smoothing is particularly appropriate for functional data. for which

that functional relationship y(t) between the response and the process on T is

inherent in the data.

One option, upon observing a functional measurement y, is to imagine the

unknown underlying curve as an interpolant y(t). This results in a curve that is

visually no smoother than the observed data. however. Typically, when functional

data are analyzed, the vector of measurements is converted to a curve via a

smoothing procedure which reduces the random variation in the function. If we

wish to cluster functional data, it may be advantageous to smooth the observed

vector for each object and perform the cluster analysis on the smooth curves rather

than on the observed data. (Clearly, this option is inappropriate for cross-sectional

data such as the European agricultural data of Chapter 1.)

Smoothing data results in an unavoidable tradeoff between bias and variance

(Simonoff. 1996, p. 15). The greater the amount of smoothing of a functional

measurement, the more its variance will decrease, but the more biased it will

become (Simonoff, 1996, p. 42). In cluster analysis. we hope that clustering the

smoothed data (which contains reduced noise) will lead to smaller within-cluster

variability, since functional data which truly belong to the same cluster should

appear more similar when represented as smooth curves. This would help make

the clustering structure of the data more apparent. Using smoothed data may

introduce a bias, however, and the bias-variance tradeoff could be quantified with a

mean squared error-type criterion.

We denote the "observed" noisy curves to be yl(t),..., YNv(t). The underlying

signal curves for this data set are AJl(t),..., (t). In reality we observe these





15

curves at a fine grid of n points, tl,..., t,, so that we observe N independent

vectors, each n x 1: yi, YN.

A possible model for our noisy data is the discrete noise model:

ij = i(tj) + ii = 1,..., Nj = 1 ...., n. (2.1)

Here. for each i = 1,..., N, ,ij may be considered independent for different

measurement points, having mean zero and constant variance ai.

Another possible model for our noisy curves is the functional noise model:


yi(tj) = p (tj) + ,i(t),i = 1,.... N,j = 1...., n, (2.2)

where ei(t) is. for example, a stationary Ornstein-Uhlenbeck process with "pull"

parameter 0 > 0 and variability parameter a2. This choice of model implies that

the errors for the ith discretized curve have variance-covariance matrix = i2fn

where 0lm = (20)-1 exp(-/0lt tmi) (Taylor et al., 1994). Note that in this case,

the noise process is functional specifically Ornstein-Uhlenbeck but we still

assume the response data collected is discretized. and is thus a vector at the level

of analysis. Conceptually, however, the noise process is smooth and continuous in

(2.2), as is the signal process in either model (2.1) or (2.2).

Depending on the data and sampling scheme, either (2.1) or (2.2) may be an

appropriate model. If the randomness in the data arises from measurement error

which is independent from one measurement to the next, (2.1) is more appropriate.

Ramsay and Silverman (1997, p. 42) suggest a discrete noise model for the Swiss

growth data, in which heights of boys are measured at 29 separate ages, and in

which some small measuring error (independent across measurements) is likely to

be present in the recorded data.

In the case that the variation of the observed data from the underlying curve is

due to an essentially continuous random process, model (2.2) may be appropriate.





16

Data which are measured frequently and almost continuously for example, via

sophisticated monitoring equipment may be more likely to follow model (2.2),

since data measured closely across time (or another domain) may more likely be

correlated. We will examine both situations.

We may apply a linear smoother to obtain the smoothed curves Fl(t),... ,N(t).

In practice, we apply a smoothing matrix S to the observed noisy data to obtain a

smooth, called linear because the smooth fi can be written as


/i = Syi,i= 1,...,N

where S does not depend on yi (Buja et al., 1989) and we define fii = (/i(ti),... ,i(tn))V

Note that as n -+ oo, the vector fi begins to closely resemble the curve /i(t) on

[0, T].

(Here and subsequently, when writing "limit as n oo," we assume

ti,..., t, E [0, T]; that is. the collection of points is becoming denser within [0, T],

with the maximum gap between any pair of adjacent points ti-1, ti, i = 2,..., n,

tending to 0. Stein (1995) calls this method of taking the limit "fixed-domain

asymptotics," while Cressie (1993) calls it "infill asymptotics.")

Many popular smoothing methods (kernel smoothers, local polynomial

regression, smoothing splines) are linear. Note that if a bandwidth or smoothing

parameter for these methods is chosen via a data-driven method, then technically,

these smoothers become nonlinear (Buja et al., 1989).

We will focus primarily on basis function smoothing methods, in which the

smoothing matrix S is an orthogonal projection (i.e., symmetric and idempotent).

These methods seek to express the signal curve as a linear combination of k (< n)

specified basis functions, in which case the rank of S is k. Examples of such

methods are regression splines, Fourier series, polynomial regression, and some

types of wavelet smoothers (Ramsay and Silverman, 1997, pp. 44-50).





17

2.3 Dissimilarities Between Curves
If we choose squared L2 distance as our dissimilarity metric, then denote
the dissimilarities between the true. observed, and smoothed curves i and j,
respectively, as follows:


fj = f[p(t) g3(t)]2dt (2.3)

dij = f i[t y(t)] dt (2.4)

moo =t [(t) (t)]2 dt. (2.5)

Define


Oij = i Pj

where gi = (pi(tl), .,/ji(t,))',

j = Yi yj,

(smooth) __
i(smooth) = Pi p= Sij.

If the data follow the discrete noise model, Oij ~ N(Oij, afI) where f =
(ao + j2). If the data follow the functional noise model, 6i ( ~ N(Oij, Eiy) where
Eij = aijf and a?. = (af + a).
If we observe the response at points tl,..., t, in [0, T], then we may approxi-
mate (2.3)-(2.5) by


dij = jOij



d(amooth) = TOs 'S,.
V n





18

Note that di -* 6ij where the limit is taken as n -+ oot, t1,.. t, E [0, TI.

Hence the question of whether smoothing aids clustering is, in a sense, closely
related to the question of: When, for large n. is d~j0th) a better estimator of dij

than is dij?

(Note: In the following chapters, since the pair of curves i and j is arbitrary,

we shall suppress the ij subscript on 0ij, 0,. or?, and Eij, writing instead 0,

0. a2, and E, understanding that we are concerned with any particular pair

i.j E {1,..., N}, i j.)

2.4 Previous Work

Some methods for clustering functional data recently have been presented in

the statistical literature. James and Sugar (2003) introduce a model-based cluster-

ing method that is especially useful when the points measured along the sample

curves are sparse and irregular. Tarpey and Kinateder (2003) discuss represent-

ing each curve with basis functions and clustering via K-means, to estimate the
"principal points" (underlying cluster means) of the data's distribution. Abraham

et al. (2003) propose representing the curves via a B-spline basis and clustering

the estimated coefficients with a K-means algorithm, and they derive consistency

results about the convergence of the algorithm. Tarpey et al. (2003) apply methods

for clustering functional data to the analysis of a pharmaceutical study. Hastie et

al. (1995) and James and Hastie (2001) discuss discriminant analysis for functional

data.

2.5 Summary
It seems intuitive that in the analysis of functional data, some form of smooth-

ing the observed data is appropriate, and some previous methods (James and

Sugar, 2003; Tarpey and Kinateder, 2003; Abraham et al., 2003) do involve

smoothing. Tarpey and Kinateder (2003, p. 113) propose the question of the effect

of smoothing on the clustering of functional data, citing the need "to study the





19

sensitivity of clustering methods on the degree of smoothing used to estimate the

functions."

In this thesis. we will provide some rigorous theoretical justification that

smoothing the data before clustering will improve the cluster analysis. Much of

the theory will focus on the estimation of the underlying dissimilarities among

the curves, and we will show that a shrinkage-type smoothing method leads to an

improved risk in estimating the dissimilarities. A simulation study will demonstrate

that this risk improvement is accompanied by a clearly improved performance in

correctly grouping objects into their proper clusters.












CHAPTER 3
CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL
First we will consider functional data following model (2.1). Recall that we
assume the response is measured at n discrete points in [0, T].
3.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the
Linear Subspace Defined by S

We assume our linear smoothing matrix S is symmetric and idempotent. For a
linear basis function smoother which is fitted via least squares. S will be symmetric
and idempotent as long as the n points at which fi is evaluated are identical to the
points at which y, is observed (Ramsay and Silverman, 1997, p. 44). Examples of

such smoothers are regression splines (in particular, B-splines), wavelet bases, and
Fourier series bases (Ramsay and Silverman. 1997). Regression splines and B-spline
bases are discussed in detail by de Boor (1978) and Eubank (1988, Chapter 7).

We also assume that S projects the observed data onto a lower-dimensional

space (of dimension k < n), and thus r(S) = tr(S) = k. Note that S is a shrinking

smoother, since all its singular values are < 1 (Buja et al., 1989). Recall that

according to the discrete noise model for the data, 0 ~ N(O, o2I). Without loss
of generality, let a2I = I. (Otherwise, we can let, for example, ?) = a-1 and
7r = a-l and work with 17 and 17 instead.)

Note that TO'6 represents the approximate L2 distance between observed
curves y((t) and yj(t) and -r'S'S0 = TO'SO represents the approximate L2
distance between smoothed curves pi(t) and pj(t).

We wish to see when the "smoothed-data dissimilarity" better estimates
the true dissimilarity 6ij between curves p((t) and j (t) than the observed-data




20





21

dissimilarity. The risk of an estimator f, for 7 is given by R(r, i) = E[L(r, f)]

where L(.) is a loss function (see Lehmann and Casella, 1998, pp. 4-5).

For the familiar case of squared error loss L(r, f) = (7 i)2, the risk is simply

the mean squared error (MSE) of the estimator. Hence we may compare the MSEs

of two competing estimators and choose the one with the smaller MSE. To this

end, let us examine MSE(TO'S6) and MSE(TO'O) in estimating TO'0 (which

approaches dij as n -+ oo).

In this section, we consider the case when 0 lies in the linear subspace that S

projects onto, i.e., SO = 0. Note that if two arbitrary (discretized) signal curves pi

and pj are in this linear subspace. then the corresponding 0 is also in the subspace,

since in this case


0 = li tj = S11i SJtj = S(t, Aij) = SO.

In this idealized situation, a straightforward comparison of MSEs shows that

the smoothed-data estimator improves on the observed-data estimator.

Theorem 3.1 Suppose the observed 60 N(O, a2I). Let S be a symmetric and

idempotent linear smoothing matrix of rank k. If 0 lies in the linear subspace

defined by S, then the dissimilarity estimator TO'S6 has smaller mean squared

error than does T'6 in estimating TO'O. That is, the dissimilarity estimator based

on the smooth is better than the one based on the observed data in estimating the

dissimilarities between the underlying signal curves.

Proof of Theorem 3.1:

Let o2 = 1 without loss of generality.

Now, E[ZT'6] = TE[6'6] = T[n + '0] = T + TO'.
tn nl nn





22

Similarly, E[T0'S6] = E[0'S6] = .,k + 0'SO] = T + TO'SO.




= T-- 2n+4 +T + T0
MSE((Ti'9) E{ \I8'9 e' i]2}

var 'v+(T') E-{E['] --



= 2 2k +40'0 +]+{T-'e-e'\

+ S + SS I ra + J
= T2(2 4 T TS'




(2 4 n2
= 2f2 + 4 0 2+1 1 (3.1)
\n2 -n2 )L







M- 2SE+ 1-S1S2 + {- + El~ T 0S0] 10112]
= T 2k+ O'S'SO + Tk +'S'SO 01 2





+r (SE+ S2 1 -) 11011
T2 W + -) + j +2



nW n2 n2 n2
+n2 n2 ) [ U[ J

+ [(uSi2 110112)2] (3.2)

Comparing (3.1) with (3.2), we see that if 0 lies in the subspace that S projects
onto, implying that SO = 0, then MSE(1O'SO) < MSE(To'0) since k < n,
and thus smoothing leads to a better estimator of each pairwise dissimilarity. In
this case, smoothing reduces the variance of the dissimilarity estimate, without
adversely increasing the bias, since SO = 0. O]





23

On the other hand, suppose the smooth does not reproduce 0 perfectly and
that SO 0 8. Then it can be shown (see Appendix A.1) that the smoothed-data
estimator is better when:

IIS0112 0112 > -2 k 4 + 2n + n2 + 2k. (3.3)

Now, since S is a shrinking smoother, this means IlSyll < I Yll for all y, and hence

llSy|l2 < 1ly|12 for all y. Therefore, ISOI12 < 116112 and the left hand side of (3.3) is
negative and so is the right hand side. If 0 is such that SO 0 and IS8O112 11|112
is near 0, then (3.3) will be satisfied. If, however, |ISO112 110|12 < 0, then (3.3)
will not be satisfied and smoothing will not help.
In other words, some shrinkage smoothing of the observed curves makes the
dissimilarity estimator better, but too much shrinkage leads to a forfeiture of that
advantage. The disadvantage of the linear smoother is that it cannot "learn"
from the data how much to shrink 0. To improve the smoother, we can employ a
James-Stein-type adjustment to S, so that the data can determine the amount of
shrinkage.
3.2 A James-Stein Shrinkage Adjustment to the Smoother

What is now known as "shrinkage estimation" or "Stein estimation" originated
with the work of Stein in the context of estimating a multivariate normal mean.
Stein (1956) showed that the usual estimator (e.g., the sample mean vector) was
inadmissible when the data were of dimension 3 or greater. James and Stein (1961)
showed that a particular "shrinkage estimator" (so named because it shrunk the
usual estimate toward the origin or some appropriate point) dominated the usual
estimator. In subsequent years, many results have been derived about shrinkage
estimation in a variety of contexts. A detailed discussion can be found in Lehmann
and Casella (1998, Chapter 5).





24

Lehmann and Casella (1998, p. 367) discuss shrinking an estimator toward a
linear subspace of the parameter space. In our case, we believe that 0 is near SO.
Let Cs = {0 : SO = 0} where S is symmetric and idempotent of rank k. Hence we
may shrink 0 toward SO, the MLE of 0 E Ls.
In this case, a James-Stein estimator of 0 (see Lehmann and Casella, 1998,
p. 367) is

O(Ps) = S + I a ( S))
i i O-seO]2 )
where a is a constant and [ I is the usual Euclidean norm.
In practice, to avoid the problem of the shrinkage factor possibly being
negative for small !0 S01|2, we will use the positive-part James-Stein estimator

(S)= SO + (1 i a- (0 S) (3.4)
+ |0-SO +

where x+ = xI(x >_ 0).

Casella and Hwang (1987) propose similar shrinkage estimators in the context
of confidence sets for a multivariate normal mean. Green and Strawderman (1991),
also in the context of estimating a multivariate mean, discuss how shrinking an
unbiased estimator toward a possibly biased estimator using a James-Stein form
can result in a risk improvement.
The shrinkage estimator involves the data by giving more weight to 0 when

|0 S0112 is large and more weight to SO when 10 S|112 is small. In fact, if the
smoother is at all well-chosen, SO is often close enough to 6 that the shrinkage fac-
tor in (3.4) is very often zero. The shrinkage factor is actually merely a safeguard
against oversmoothing, in case S smooths the curves beyond what, in reality, it
should.





25

So an appropriate shrinkage estimator of dij = TO'0 is

di(J) = T(Js)'(Js) (3.5)

= [ (s)' -( ) (0- So)']
n 110- SO112

S + 1 ( S) (3.6)

Now, the risk difference (difference in MSEs) between the James-Stein
smoothed dissimilarity diS) and the observed dissimilarity dij is:


T E T T [ T 1 \
A = E T(J)'(Js) iT'0 E T '0 T ,0 (3.7)
n n n n

Since we are interested in when the risk difference is negative, we can ignore the
positive constant multiplier T2/n2.
From (3.7),

A = E[((Js)'O(Js))2 20'0&(JS)'O(JS) (6'0)2 + 20'00'0]

= E[((JS)'O(Js))2 (6'0)2 20'0(o(JS)'(JS) ')].

Write P(Js) = 0 p(0)0, where

(0)A a (I- S).
110 S61 '
Then

P(Js)'O(Js) 0'0 = -260'(0)'6 + 0'0(0)'(0)0
S-' a(I S)6
110- SOl12
a a
+ 0'(I S)'I- (I S)O
116 -se6l 110-S ll2





26


0'(I-S)'(I-=S) 0 .(0-SO0I4
'(I-S)6 2'(I -S)
'(I s) | |20'(I S)0

= -2a +
I11- SO112
Hence the (scaled) risk difference is

( 2
A = E [(O(Js)'OJs))2 (0') 20'0 -2a + I- s012 )1
AE -20 0 11 S 1 S 2
2 22
= E 0'0-2a +I- ('0)2 20'0 -2a + a-SI12 .

Note that
'6 = '(I S)O + 's

and
0'0 = '(I S)0 + O'S0.

Define the following quadratic forms:

41 = 6'(I- S)6

42 = 6''se
q, = O'(I- S)o

q2 = O'SO.

Note that 41 ~ X-_k(ql), q2 X;2(q2) and they are independent (S idempotent
implies (I S)S = 0 and 0 is normally distributed). Now write A as:

A = E ([4 + 2] 2a + ) + 2 2 2(ql + q2) -2a +
4 91 0 gi1
( a a2 2 2a2
S2 -2a + -- [q41 + 42]+ -2a + + 4a(qi + 92) (q1 92)
I (- q,) 4qi 1 1




27

= E -4a[41 + 42] + 2[41 + 42] (-' + 4a2 + ( ) 4a3 + 4a(qi + q2)
2a2 ]
2--(qi + q2)

2 a 2) 2 4a3
= E -4a 2]4a2+ 42] + 4a- + 4a(qi + q2)
1 q, 41
2a2/ J
+ --(1 + 42 q q2) ]

Since E[q1 + 42] = n + q1 + q2,

2a2 2 2 4a3 2a2
A= -4an+ 4a2 + E a -- + -(41 + 42 92) ]
\9/ -91 9i
Note that since 41 and q2 are independent, we can take the expectation of q2 in the
last term. Since E[42] = k + q2,

[=-a4[ 2a2 2k ( 2a + q)]
A = 4a 4an + E -+ --- + 2a2 1 .
1 91 91
By Jensen's Inequality, E(1/41) > 1/E(41), and since E(41) = n k + ql, we can
bound the last term above:

E2a2(1 2a + q)1 <2a2(1 2a + ql 2(n -k+q -2a q
L \ i \ n-k+q n k+q
=2a2(n-k-2a n -- k 2a 22(1 2a
n-k+ql \ n-k n-k).
Hence
A < 4a2 4an + E a4]+ E 2a2k+ 2a2'(1- (3.8)
1 L 91 J 2 n--k
Note that the numerators of the terms in the expected values in (3.8) are
positive. The random variable 1 is a noncentral x,2-k with noncentrality parameter
ql, and this distribution is stochastically increasing in qi. This implies that for
m > 0, E[(41)-m] is decreasing in q1. So

E[(q4)-m] 5 E[(X k)-m]





28

where X,_k is a central X with n k degrees of freedom. So by replacing 41 with

X-k in (3.8), we obtain an upper bound Au for A:

(X2 I ^ [ ( ) (3.9)
Au= 4a2 4an + E (X2-k)2 + E [2a k] + 2a2 1 (3.9)--
(2)n-k/ An-k n -

Note that E[(X- -1] = 1/(n k 2) and E[((X_-2)-] = 1/(n k 2)(n k 4).
So taking expectations. we have:

4a 2a4 2a2k 2a
S(n-k-2)(n-k-4) +(n-k-2 n-k
= a4 a-6) + 6a(n 4na.
(n k 2)(n k 4) n k n-k-2

We are looking for values of a which make Au negative. The fourth-degree
equation Au(a) = 0 can be solved analytically. (Two of the roots are imaginary, as
noted in Appendix B.1.) One real root is clearly 0; call the second real root r.
Theorem 3.2 If n k > 4, then the nontrivial real root r of Au(a) = 0 is positive.
Furthermore, for any choice a E (0, r), the upper bound Au < 0, implying A < 0.
That is, for 0 < a < r, the risk difference is negative and the smoothed-data
dissimilarity estimator ds) is better than dj.
Proof of Theorem 3.2: Write Ay(a) = 0 as c4a4 + c3a3 + c2a2 + cla = 0, where

c4,c3,C2,cl are the respective coefficients of Au(a). It is clear that if n k > 4,
then c4 > 0, c3 < 0, C2 > 0. ci < 0.
Note that r also solves the cubic equation f,(a) = c4a3 + c3a2 + c2a + cl = 0.
Since its leading coefficient c4 > 0,

lim f,(a) = -oo, lim f,(a) = oo.
a-*-oo a--oo

Since fc(a) has only one real root, it crosses the a-axis only once. Since its vertical
intercept cl < 0, its horizontal intercept r > 0.





29

Table 3-1: Table of choices of a for various n and k.
n l k minimizer a* root r
20 5 9.3 19.0
50 5 31.9 72.2
100 5 71.3 169.1
200 5 153.2 367.5


And since Au(a) has leading coefficient c4 > 0,

lim Au(a) = oo.
a foo

Since it tends to oo at its endpoints, Ay(a) must be negative between its two real

roots 0 and r. Therefore A < 0 for 0 < a < r. O

Using a symbolic algebra software program (such as Maple or Mathematica).

one can easily obtain the formula for the second real root for general n and k

(the formula is given in Appendix B.1) and verify that the other two roots are

imaginary.

Figure 3-1 shows Au plotted as a function of a for varying n and k = 5.

For various choices of n (and k = 5) Table 3-1 provides values of r, as well as

the value of a which minimizes the upper bound for A. For 0 < a < r, the risk

difference is assured of being negative. For a*, Au is minimized.

Since Au provides an upper bound for the (scaled) risk difference, it may

be valuable to ascertain the size of the discrepancy between Au and A. We can

estimate this discrepancy via Monte Carlo simulation. We generate a large number

of random variables having the distribution of q1 (namely X'2-k(ql)) and get an

estimate of A using a Monte Carlo mean. For various values of ql, Figure 3-2

shows A plotted alongside Ay.

3.3 Extension to the Case of Unknown a2

In the previous section, a2 was assumed to be known. We now examine the

situation in which 60 N(8, a2I) with a2 unknown.






30












DeltaU for various n and k=5




0 -





C-14
























0 50 100
a


Figure 3-1: Plot of AU against a for varying n and for k = 5.
Solid line: n = 20. Dashed line: n = 50. Dotted line: n = 100.






31








comparing upper bound to "true Delta for n=20, k=5





ii /*
!I
I I :


























Figure 3-2: Plot of simulated and against a for n = 20, k = 5.
Solid line: Upper bound Au. Dashed line: simulated A, q, = 0. Dotted line:
simulated A, q, = 135. Long-dashed line: simulated A, q, = 540. Dot-dashed line:
simulated10 20 30 40000.
Figure 3-2: Plot of simulated A and Au against a for n = 20, k = 5.
Solid line: Upper bound Au. Dashed line: simulated A, q1 = 0. Dotted line:
simulated A, q1 = 135. Long-dashed line: simulated A, q1 = 540. Dot-dashed line:
simulated A, q1 = 3750000.





32

Suppose there exists a random variable S2, independent of 0, such that

S2 2
a2 X.

Then let e2 S2/v.

Consider the following definition of the James-Stein estimator which accounts

for a2:

0(s) = S + 1 ai oI2 (I- S)0


= S+ 1 a, )(I-S)O. (3.10)
O (I-S)O'

Note that the James-Stein estimator from Section 3.2 (when we assumed a

known covariance matrix I) is simply (3.10) with o2 = 1.

Now, replacing a2 with the estimate a2, define


JS) = o + 1 a (I -S)O,

letting 41 = 0'(I S)0 as in Section 3.2. Write

(Js) = -


where
a&2


Define the James-Stein smoothed-data dissimilarity estimator based on 0JS)

to be:
(Js) -_ T g(s)' js)
dij, &- a

Then analogously to (3.7).


Aa = E ( s) s) O'0 2 E O'0 2





33

Now, e(JS)'}(Js) = 6' 20'6 + 0' So

A = E 0'0-0'0-2'0 + 0')2-('6 -0'0)2

= E [(( '- ) (20 ) 2) -{e(0'e-e'e)2
= E -2 '0 )( 0'0- 0'0) + (2 -')2 .



Now,
2 = a2'(I S) a2
91
and
= 2 6( S)'(I S)0 =-.

So
A = 2 2a&2 a '( &6 0) + 2a &2 -2 a21.

Since & and 0 are independent,

E[-4a&2(0'0 0')] = E[-4a62]E[6'6 0'0] = E[-4a'&]a2n

and

E[2a2&4(' 0'0)]
E (2a24 -,0)
a 24
= E -a(1 (I( -S) 0+0s 0(I-S)0-0'SO)

.[2a2c4 1 J
= E -- ( + 2 q q2)
91
= E[2a2 4]E [1 +E 2a(41 ql-q2) (3.11)

by the independence of & and 0 (and thus & and 42).





34

Since 41 and q2 are independent, we may write (3.11) as:

E[2a264]E[q2]E[1/11] + E 224(1 q q2)]
2a&
= E[2a24](q2+ l2k)E[1/4] + E 2241 91 2)]. (3.12)

Now, since a264/41 is decreasing in (1 and (1 ql q2 is increasing in 41, these
quantities have covariance < 0 by the Covariance Inequality (Casella and Berger,
1990, p. 184). Hence

(3.12) < E[2a24](q2 + 2k)E[1/t] + E 2a241[1 9- 2]

= E[2a2&4](q2 + a2k)E[1/i1] + E[2a264&]E[1/(i] (a2(n k) q2)

= E[2a24 ]E[1/4]a2n.

So

A& < E[-4a&2]a2n + E[2a204]E o1 2 + E 2a22 a24&2]
-9. 2&4 11 2 J
= E[-4a2]2n + E[2a2]E nn+ E 4a24 [4a3 a42
9ilJ1 91 9+ 2 i"
Note that

E[a&2] = aa2

E[a2&4]= a 2 4 (V+2))

E[a3&6] = a36v(v + 2)(v+4)v

E[a 4&d = a48 ((L + 2)(V + 4)(v + 6))
V4
Since !2 and 41 are independent, taking expectations:

A < -4a4n+2a2 n( 2) E + 4a4 (2 v(v + 2)
2 ) [] V 2
4a3 v(v + 2)(v + 4) )E[ + a4as (v(v + 2)(v + 4)(v + 6) E 1
3 4 /





35

Define 1 = (1/2 = (0/ar)'(I S)(O/a). Since 60 N(0,a'I), (6/a) ~ N(C,I)
where C = O/a. Then the risk difference is a function of (, and the distribution of
q is a function of C and does not involve a alone. Then

A K < -4au4n + 2a2a4n ( 2) + 4a2 2))

-4a34((v+ 2)(v+4))[ 1 + 44(v(v+2)(v+4)(v+6))E[ 2
VI3 V4 "*2

We may now divide the inequality through by a4 > 0.

A 2 v(v 2) 1 ((V + 2)
S< -4an + 2a'n E + 4a'

4a_(v(v+2)(v+4) E1 EI +4 V(V +2)(v + 4)(v+6) )E[l
V_3 + 4 22

Now, since (I S)I is idempotent, q is noncentral X2,k(C'(I S)C). Recall
that the noncentral X2 distribution is stochastically increasing in its noncentrality
parameter, so we may replace this parameter by zero and obtain the upper bound

E[(4;)-m] < E[(X-k)-m, m = 1, 2.

Hence

(v(v + 2) v(v + 2)
<4 Au = -4an + 2a2n 2 )E[(X )- + 4a2
r4 V 2 n V2
4a3( (v + 2)(v + 4)
43 E (( n-k 1
S(v(V+2)(v+4)(v+6) [(X2
+a4 4 E[( n-k) .

If n k > 4, then E[( _k)-1] = 1/(n k 2) and E[(X k)-2] = 1/(n k 2)(n -
k 4). So taking expectations, we have:





36



< A = -4an+ 2 2 + 4a2
a4 v2 n k 2 v2
-4(v(v + 2)(v + 4) a3
v3 n-k-2
+(v(v + 2)(v + 4)(v + 6) a4
v4 (n k 2)(n k 4)
( v(v+2)(v+4)(v+6) 4 4(v+2)(v+4) 3
v 4(n-k-c2)(n-k-4) v (n-k-2)
( 2( + 2) 4v(v + 2) a2 4na.
v2(n k 2) v2

Note that if Au (which does not involve a) is less than 0, then A& < 0, which is
what we wish to prove.
Theorem 3.3 Suppose that 8 ~ N(8, a2I) with a2 unknown and that there exists
a random variable S2, independent of 6, such that S2/a2 ~ X2, and let &2 = S2/v.
If n k > 4, then the nontrivial real root r of Av(a) = 0 is positive. Furthermore,
for any choice a E (0, r), the upper bound Au < 0, implying A < 0. That is,
for 0 < a < r, the risk difference is negative and the smoothed-data dissimilarity
estimator dij,( is better than di.
Proof of Theorem 3.3: Note that A (a) = 0 may be written as c4a4 + c3a3 +
c2a2 + cla = 0, with c4 > 0, c3 < 0, c2 > 0. c1 < 0. The proof then follows exactly as
the proof of Theorem 3.2 in Section 3.2. O
3.4 A Bayes Result: d'"mooth) and a Limit of Bayes Estimators
In this section, we again assume S, the smoothing matrix, to be symmetric and
idempotent. We again assume 6~ N(8, o2I) and again, without loss of generality,
let a2I = I. (Otherwise, we can let, for example, if = a-16 and q = a-10 and
work with ff and I7 instead.) Also. assume that 0 lies in the linear subspace (of
dimension k < n) that S projects onto, and hence SO = 0. We will split f(I10)





37

into a part involving (I S)0 and a part involving SO.

f(010) cx exp[-(1/2)(6 0)'(0 )]

S exp[-(1/2)(6 0)'(I S + S)(6 0)]

S exp{-(1/2)[0'(I S)0 + (6 0)'S(6 0)]}

which has a part involving (I S)6 and a part involving S6.
Since S is symmetric, then for some orthogonal matrix P, PSP' is diagonal.
Since S is idempotent, its eigenvalues are 0 and 1. so PSP' = B where


B= Ik 0
0 0

Let p'z,...pn be the rows of P. Then 0* = PO = (p'0,..., p',0,0...,0)'.
And let 0 be the vector containing the k nonzero elements of 0*. Similarly,
* = Pe.
Now we consider the likelihood for 0:



L(010) ca exp{-(1/2)['(I S) + ( 0)'S( 0)]}

ca exp{-(1/2)[( 0)'S(6 0)]}

= exp{-(1/2)[(O 0)'P'PSP'P(O 0)]}

= exp{-(1/2)[(6* 0*)'PSP'(6* 0*)]}

= exp{-(1/2)[(O* 0*)'B(6* 0*)]}.

Now we put a N(0, V) prior on 0*:

r(0*) oc exp[-(1/2)0*'V-0*],





38



V- = V 0
0 0

where V1 is the full-rank prior variance of 6. Note that this prior is only a proper
density when considered as a density over the first k elements of 9*. Recall that
the last n k elements of 9* are zero.
Then the posterior for 0* is

7r(90**) oc exp{-(1/2)[(0* 0*)'B(9* 0*) + 9*'VV-*]}

= exp{-(1/2)[b*'B6* 20*'BO* + 0*'(B + V-)O*]}

oc exp{-(1/2)[9*'B'(B + V-)-B9* 20*'B* + O*'(B + V-)*]}

= exp{-(1/2)[(O* (B + V-)-B*)'(B + V-)(0*- (B+ V-)-B*)]}.

Hence the posterior 7r(08*) is N[(B + V-)-B0*, (B + V-)-], which is
N[(B + V-)-BP6, (B + V-)-].
Thus the posterior expectation of T9'9 is

E -0'0) = E T('P'PO) = E (9*'9*
nn ( n
ST*'B(B + V-)-(B + V-)-BP* + tr (TI(B + V-)-

n \n
T- 6'P'B(B + V-)-(B + V-)-BP6 + tr (-I(B + V-)-

= T'SP'(B + V-)-(B + V-)-PS + tr (T(B + V-)-)
n \n

Let us choose the prior variance so that V- = IB. Then

E( T') 0 T6'SP' ( B ( B) PSO + tr(T(n+
n n n n n n
T n n T n
n n+1 /n+1 \n n+ ))
= T'SP' B PSn tr( n 1
n \(n+1)2 n





39

and since P'BP = S,



(n n (n+l)2 \n\+r +l
This posterior mean is the Bayes estimator of d1j.

Since tr(B) = k, note that tr( (- B)) = tr(- B) = -+ 0 as n -- oo.
2
Also, 2 --+ 1 as n --+ o.

Then if we let the number of measurement points n -+ oo in a fixed-domain

sense, these Bayes estimators of dij approach dimoot) = T'SS in the sense that

the difference between the Bayes estimators and dm oth) tends to zero. Recall that

dij = O'O, the approximate distance between curve i and curve j, which in the
limit approaches the exact distance 6ij between curve i and curve j.












CHAPTER 4
CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL

Now we will consider functional data following model (2.2). Again, we assume

the response is measured at n discrete points in [0, T], but here we assume a

dependence among errors measured at different points.

4.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the
Linear Subspace Defined by S

As with Case I, we assume our linear smoothing matrix S is symmetric and

idempotent, with r(S) = tr(S) = k. Recall that according to the functional noise

model for the data, 0 ~ N(O, E) where the covariance matrix E corresponds, e.g.,

to a stationary Ornstein-Uhlenbeck process.

Note that T1'6 represents the approximate L2 distance between observed

curves yi(t) and yj(t), and that, with an Ornstein-Uhlenbeck-type error structure, it

approaches the exact distance as n -+ oc in a fixed-domain sense.

Also, TO'S'SO = TO'SO represents the approximate L2 distance between

smoothed curves fi(t) and Lj(t), and this approaches the exact distance as n -+ oo.

We wish to see when the smoothed-data dissimilarity better estimates the true

dissimilarity 6ij between curves yi1(t) and pj(t) than the observed-data dissimilarity.

To this end let us examine MSE(TO'SO) and MSE(T 'o) in estimating TO'0

(which approaches 6ij as n -+ oo).

In this section, we consider the case in which 0 lies in the linear subspace
defined by S, i.e., SO = 0. We now present a theorem which generalizes the result
of Theorem 3.1 to the case of a general covariance matrix.

Theorem 4.1 Suppose the observed 6 ~ N(O, E). Let S be a symmetric and

idempotent linear smoothing matrix of rank k. If 0 lies in the linear subspace


40





41

defined by S, then the dissimilarity estimator TG'SO has smaller mean squared
error than does T6'6 in estimating O'0. That is, the dissimilarity estimator based
on the smooth is better than the one based on the observed data in estimating the
dissimilarities between the underlying signal curves.
Proof of Theorem 4.1:

MSE( T = E{ [Tbi ,6 T 2,









-MSE (-'S) = E{ -['0s -e' ] }
= var (T'6) + E\[T'] -E0 O
n n n

T-i2tr TlE2)+40' O+[t( ) T T 4.1)


= -2 2tr(SE2) + 40'S'EO + { tr(ES) + T- 'S'S -0'
n2 n n n









T2(E2) (E)12 }
-= 2tr[(S)2 + 40'EO + [tr(~)]2 (4.1)






Hence, if we compare (4.2) with (4.1), we must show that tr(SlI=) < tr(N) and
MSE T 'S6) = E -'SO -T' 0









tr[(SC)2] < tr[(N)2] to complete the proof.
n n n J





=tr() tr(S} + (I- S)) '
n n n I

= tr(SE)] + 40'SES + tr(SE) SS 0'0
n n n n I

= 2tr[(SE)2] + 40'EO + [tr(SE)]2 (4.2)

Hence, if we compare (4.2) with (4.1), we must show that tr(SE) < tr(E) and
tr[(SE)2] < tr[(E)2] to complete the proof.

tr(E) = tr(SE + (I S)E)
= tr(SE) + tr((I S)E)
= tr(S) + tr((El/2(I S)(I S)E1/2)
> tr(SE)





42

since the last term is the sum of the squared elements of (I S)E1/2.

tr(E2) = tr[(SE + (I-S)E)2]

= tr[(SE)2] + tr[SE(I S)E] + tr[(I S)ESE] + tr[((I S)E)2]

= tr[(SE)2] + tr[SSE(I S)(I S)E] + tr[E(I S)(I S)ESS]

+ tr[(I S)(I S)E]

= tr[(SE)2] + 2tr[SE(I- S)(I- S)ES]

+ tr[E'2(I S)E1/2E1/2(I S)E/2]

Str[(SE)2]

since tr[SE(I S)(I S)ES] is the sum of the squared elements of (I S)ES and
the last term is the sum of the squared elements of 1/2(I S)E1/2.
Hence MSE(TO'SO) < MSE(TO'd). Since (I- S)E1/2 (which has rank n k)
is not the zero matrix, the inequality is strict. l
4.2 James-Stein Shrinkage Estimation in the Functional Noise Model
Now we will consider functional data following model (2.2), and in this
section we do not assume 0 lies in the subspace defined by S. Again, we assume
the response is measured at n discrete points in [0, T], but here we assume a
dependence among errors measured at different points.
In Section 3.2, we obtained an exact upper bound for the difference in risks
between the smoothed-data estimator and observed-data estimator. In this section
we will develop an asymptotic (large n) upper bound for this difference of risks.
The upper bound is asymptotic here in the sense that the bound is valid for
sufficiently large n, not necessarily that the expression for the bound converges to a
meaningful limiting expression for infinite n.





43

Note that as the number of measurement points n grows (within a fixed

domain), assuming a certain dependence across measurement points (e.g., a corre-

lation structure like that of a stationary Ornstein-Uhlenbeck process), the observed

data y1,..., YN closely resemble the pure observed functions yl(t),..., yN(t) on

[0, T]. Therefore this result will be most appropriate for situations with a large
number of measurements taken on a functional process, so that the observed data

vector is "nearly" a pure function.

As with Case I, we assume our linear smoothing matrix S is symmetric and

idempotent, with r(S) = tr(S) = k. Recall that according to the functional noise

model for the data, 0 ~ N(0, E) where E is, for example, a known covariance

matrix corresponding to a stationary Ornstein-Uhlenbeck process. In this section,

we will assume a Gaussian error process whose covariance structure allows for the

possibility of dependence among the errors at different measurement points.

We consider the same James-Stein estimator of 0 as in Section 3.2, namely
6(JS) (in practice, again, we use the positive-part estimator (Js)), and the same

James-Stein dissimilarity estimator.

Recall the definitions of the following quadratic forms:


q1 = 0'(I-S)0

2 = O'sO

q1 = O'(I-S)O

q2 = O'SO.

Unlike in Section 3.2 when we assumed an independent error structure, under

the functional noise model, 41 and 42 do not have noncentral X2 distributions, nor

are they independent in general.





44

In this same way as in Section 3.2,,we may write A as:

1 (a24 2 413
A = E -4a[1 + 2] + 4a2 + +4a(qg + q2)
2a2
+ (41 + 42 q9 q2)

Note that

E[41 + 42] = q + q2 + tr[(I S)E] + tr[SE]

= + q2 + tr(S).

Hence

-4atr(E) + 4a2+ E a2 2 + 2a (1 +42 qa q2)
L9-q 9q2 9)
Consider

E[1 = e' K se .
Lieberman (1994) gives a Laplace approximation for E[(x'Fx/x'Gx)k], k > 1,
where F is symmetric and G positive definite:
[(' x'Fx E[(x'Fx)k

\xGx ) [E(x'Gx)]k

In our case, (I S) is merely positive semidefinite, but with a simple regularity
condition, the result will hold (see Appendix B.2). The Laplace approximation will
approach the true value of the expected ratio, i.e., the difference between the true
expected ratio and the approximated expected ratio tends to 0 as n -- oo;
Lieberman (1994) gives sufficient conditions involving the cumulants of the
quadratic forms which establish rates of convergence of the approximation.
Hence

2 = [ 'e E(6'SO) q2 +tr(SE)
E =E ) E((I S)) tr[I-S)]
41 6'(I S)OJ E('(I S)6) q + tr(I S)E]





45

Hence we have a large n approximation, or asymptotic expression, for A:

A[a4 4a3 2a2(q2 + tr(SE)) 2a2q 2a2q2
A a -4atr(E)+4a2+E [ + 2a2 +qltr[I-S)E]
19 1+ 9q + tr[(I S)] 1 41
[ a. 1 ( q2 + tr(SE) 2a + q + q2]
= -4atr(E) + 4a2 + E + 2a 1 + 2+= .
q2 q + tr[(I S)ME] 41
Since by Jensen's Inequality, E(1/41) > 1/E(41), we can bound the last term above:
E22(1 q2+tr(SE) 2a+q+ql +q2
Iq + tr[(I S)E] 41

S22 qi + tr[(I- S)E] q + tr[(I- S)]
= 2a2(q, + tr[(I S)E] + tr(SE) ql 2a

= 2a2( tr(E) 2a
qg + tr[(I S)]
( tr(E) 2a
ktr[(I-S)E]
Hence we have the asymptotic upper bound for A:

4a2 4atr(E) + E 4] + 2a2 ( tr() .2a
Ir ) \tr[(I- S)E]-
Denote the eigenvalues of E1/2(I-S)E1/2 by c,..., cn. Since (I-S) is positive
semidefinite, it is clear that cl,..., c, are all nonnegative. Note that cl,..., cn are
also the eigenvalues of (I S)E, since E1/2(I S)E1/2 = 1/2( S)ZE-1/2 and
(I-S) are similar matrices. Note that since r[El/2(I-S)E1/2] = r(I-S) = n-k,
1/2(I S)E1/2 has n k nonzero eigenvalues, and so does (I S)E.
It is well known (see, e.g., Baldessari, 1967; Tan, 1977) that if y ~ N(p, V)
for positive definite, nonsingular V, then for symmetric A, y'Ay is distributed as a
linear combination of independent noncentral X2 random variables, the coefficients
of which are the eigenvalues of AV.





46

Specifically, since q1 = 0'(I S)0, and (I S)E has eigenvalues c,..., ,c, then
n

i=1
for some noncentrality parameter 62 > 0 which is zero when 0 = 0.
Under 0 = 0, then, 41 ~ = cE' x, a linear combination of independent central

X2 variates.
Since a noncentral X, random variable stochastically dominates a central 1:,
for any 62 > 0,

P[X2(62) 2 x] > P[x2 > x] Vx > 0.

As shown in Appendix A.2, this implies

P[cLX'(62) + + c1X (62) > X] > P[cix + + CnX >_ x] V X > 0.

Hence for all x > 0, Peoo[ql 2 x] > Po[q1 > x], i.e., the distribution of 41 with
8 # 0 stochastically dominates the distribution of 41 with 0 = 0. Then


EWgol(4j)-" ] < Eo[(4i)-"]

for m = 1, 2,....
Letting M2 = Eo[(q1)-2], we have the following asymptotic upper bound for A:


2a2tr( E) 4a3
A* = 4a2 4atr(E) + M2a4 +
tr[(I- S)E] tr[(I- S)E]
4 3 2tr(E)
= M2a' tr[(I- S) 3 + (t a2 4tr(E)a.
tr[(I- S)E] tr [(I S)E] +(4)aa

Then M2 = E[(E 1 ciX)-2]. We see that if n k > 4, then M2 exists and is
positive since
1 E[ [(X1 1_1 )
Wmax Xn-k Wmin Xn-k





47

where wme is the smallest and Wma the largest of the n k nonzero eigenvalues of

(I S)E.

As was the case in the discrete noise situation, A (a) = 0 is a fourth-degree

equation with two real roots, one of which is trivially zero. Call the nontrivial real

root r.

Theorem 4.2 If n k > 4, then the nontrivial real root r of A*,(a) = 0 is positive.

Furthermore, for any choice a E (0, r), the asymptotic upper bound A* < 0,

implying A < 0. That is, for 0 < a < r, and for sufficiently large n, the risk

difference is negative and the smoothed-data dissimilarity estimator dls) is better

than dij.

Proof of Theorem 4.2: Since E is positive definite, tr(E) is positive. Since

M2 > 0, we may write Ay(a) = 0 as c4a4 + c3a3 + c2a2 + cia = 0, where

c4 > 0, C3 < 0, c2 > 0, C1 < 0. The proof then follows exactly from the proof of

Theorem 3.2 in Section 3.2. E

As with the discrete noise situation, one can easily verify, using a symbolic

algebra program, that two of the roots are imaginary, and can determine the

nontrivial real root in terms of M2, tr[(I S)E], and tr(E).

As an example, let us consider a situation in which we observe functional data

yl,..., YN measured at 50 equally spaced points tl,..., to0, one unit apart. Here,
let us assume the observations are discretized versions of functions yi(t),..., yN(t)

which contain (possibly different) signal functions, plus a noise function arising

from an Ornstein-Uhlenbeck (O-U) process. We can then calculate the covariance

matrix of each y, and the covariance matrix E of 0. Under the O-U model,

tr(S) = -n always, since each diagonal element of E is 2

For example, suppose the O-U process has a2 = 2 and / = 1. Then in this

example, tr(E) = 50. Suppose we choose S be the smoothing matrix corresponding

to a B-spline basis smoother with 6 knots, dispersed evenly within the data. Then





48

we can easily calculate the eigenvalues of (I S)E, which are cl,..., c,, and via

numerical or Monte Carlo integration, we find that M2 = 0.0012 in this case.

Substituting these values into Ay(a), we see, in the top plot of Figure 4-1,

the asymptotic upper bound plotted as a function of a. Also plotted is a simulated

true A for a variety of values (n-vectors) of 0: 0 x 1, (0, 1, 0, 1, 0,..., 0,1)', and

(-1,0, 1, -1,0, 1,..., -1,0)', where 1 is a n-vector of ones. It should be noted that
when 0 is the zero vector, it lies in the subspace defined by S, since SO = 8 in that

case. The other two values of 0 shown in this plot do not lie in the subspace. In

any case, however, choosing the a that optimizes the upper bound guarantees that
d.(Js) has smaller risk than dj.

Also shown, in the bottom plot of Figure 4-1, is the asymptotic upper bound

for data following the same Ornstein-Uhlenbeck model, except with n = 30.

Although A* is a large-n upper bound, it appears to work well for the grid size of

30.

4.3 Extension to the Case when a2 is Unknown

In Section 3.2, it was proved that the smoothed-data dissimilarity estimator,

with the shrinkage adjustment, dominated the observed-data dissimilarity estimator

in estimating the true dissimilarities, when the covariance matrix of 0 was a2I,

a2 known. In Section 3.3, we established the domination of the smoothed-data

estimator for covariance matrix a2I, a2 unknown.

In Section 4.2, we developed an analogous asymptotic result for general known

covariance matrix Z. In this section, we extend the asymptotic result to the case

of 0 having covariance matrix of the form V = a2E, where a2 is unknown and E

is a known symmetric, positive definite matrix. This encompasses the functional

noise model (2.2) in which the errors follow an Ornstein-Uhlenbeck process with

unknown a2 and known f. (Of course, this also includes the discrete noise model






49










DetaU for O-U Example (n = 50)






o








0 50 100 150
a


DeltaU for O-U Example (n = 30)







0 ----- -------------- -i------------*--- --------------



o t '



0 20 40 60
a


Figure 4-1: Plot of asymptotic upper bound, and simulated A's, for Ornstein-
Uhlenbeck-type data.
Solid line: plot of A, against a for above O-U process. Dashed line: simulated A,
0 = 0 x 1. Dotted line: simulated A, 0 = (0, 1, 0. 1, 0,..., 0, 1)'. Dot-dashed line:
simulated A, 0 = (-1, 0, 1, -1, 0, 1,..., -1, 0)'. Top: n = 50. Bottom: n = 30.





50

(2.1), in which V = a2I, but this case was dealt with in Section 3.3, in which an
exact domination result was shown.)
Assume 0 N(0, V), with V = a2E. Suppose. as in Section 3.3, that there
exists a random variable S2, independent of 0, such that
S2 ,
"2 X .

Then let a2 = S2/v.
As shown in Section 3.3, we may write

A = E[-2 2a&2 a4(0' 0'0) + (2a2 a20-4)2].

Since & and 0 are independent.

E[-4aa2(9'_ 8'8)1 = E[-4a&2]E[#' 8'8] = E[-4a&2]a2tr(E)

and

E 2a2&4 (0'6 O'0)]
[2a 1
= E 2a4 '(I S) + 'S '(I S) O'SO)

r[2a22(&4 (
= E --q- + (2 -i q- 2)

= E[2a2& ]E + E [2a (41 q q2) (4.3)

by the independence of & and 6 (and thus & and 42). We use the (large n) approx-
imation to E[42/i4] obtained in Section 4.2 to obtain an asymptotic upper bound
for (4.3):
( q2 + 2tr(S )) 2a (4.4)
E[2a2&4] q2 + E2tr(I (1 q 2) (4.4)
(q, + o,2tr((I S) Ej ) 4





51
Now, since a2&4/41 is decreasing in 41 and qi qi q2 is increasing in 41, these
quantities have covariance < 0. Hence
( q2+ +2tr(S]E) 2 [2a
(4.4) < E[2a~ (4] q2 + 0 tr(SE) + E 2a (41 q q2]
Sqi2tr(I -S)Z] J
E[2a2&4] ( q2 + 2tr(S) + E 2a2 (2tr[(I S)E] q)
q + a2tr[(I S) 1
E[2a2&4]q2 E[2a26&4]a2tr(SE) [2a2&4] tr[(I )




Since by Jensen's InequalityE /E E[/] 0,
(4I +. a2tr[(I S) + 2tr[(I S)E] 1



SE[ 41 tr[(I S) + a2tr[(I S)
_E[2a22&]
+ E 2a2tr[(I 2 trjI S)E2].



(4.4) < E[224] rtr(S) E 2a242tr[( S)E]
q- +qa2tr[(I- S)E] 41
< E[2a2&41 tr(s ) +E 2a2 4]2tr[(I- S)E]
tr[(IL S)+)] + l-
Our asymptotic upper bound for a& is:

A*u = E[-4aa2]a2tr(E) + E[2a24] tr(SE) + E 2a-alrI S)E
tr[(I S)E2 q,
+ E 2a 2 a- 24l J

=E -4a22tr(E) + (a2+4 2r- tr(SE)
=tl rtr[(I- S)E]
+ 4a4 4a3&6 a4&8]
+-i+





52

Recall that

E(a~i] = aa2
E[a &4] = a2a4((v -2)

E(a3&6] = a3(v(v 2)(v + 4)

E[a4&8] = a48Y((v +2)(v+4)( +6))
V4
Since 62 and i1 are independent, taking expectations:

A = -4au4tr(E) + 2a -E 1- o2tr[(I S

+ 2a 2 u4(v(v + 2) tr(SE) 4a24( + 2)
v2 tr [(I S) ] v2
-4a v(v+2)(v+4) E[]- 48 V(V+2)(v + 4)(v + 6)4 [1 E
V34a3( (V+2(v ) q.[ V4 E-a (
As in Section 3.3, define 4q = 41/T2 = (0/o)'(I S)(60/). Here, since
9 N(0, ,2E), (6/a) N((, E) where C = 0/a. Then the risk difference is a
function of C, and the distribution of 4q is a function of (. Then

A = -4a4tr(E) + 2a2 4 (( + 2)) [ t(I-S)E

+ 2a 2u4 (v(v + 2) tr(SE) 4a4 (+ 2)\
v2 tr[(I S)E] ( 2 )
3 v(v+2) (v4\) 1( 4 + 2)(v+4)(v+6)\ [1
-4a u")E[ 1- +auor )E[
V3 qV4 T
We may now divide the inequality through by a4 > 0.

S-= -4atr(E) + 2a ( ) E (I- S)E
2a2v( + 2) tr(SE) (4a + 2)
a v2 tr[ (I- S)E a YV2
-4a3(V(V + 2)(v + 4) ) ] v(v + +2)(v + 4)( + 6)- E I
V3 E + a4 .\2





53

As shown in Section 4.2. for all x > 0, Peoo[q1 > x] > Po[ql > x]. Note that
this is equivalent to: For all x > 0, Pcto[q* > x] >_ Po[q4 > x]. Then

E(o[(4)-m}] < Eo[(4;)-m]

for m = 1, 2.
Let M{ = Eo[(q{)-'] and M2 = Eo[(*)-2]. Then

j'v(t'+ 2) 2 v(v + 2) tr(S1E) a
A*&a < -4tr(E)a+ 2 + + 2) M*tr[(I S)Va2 + 2 tr(S)a
aO4 v2 v2 tr[(I-S)E

+4(v(v + 2) a2 (v+ 2)(v + 4) M a3

( v( + 2)(v + 4)(v + 6) Ma4
(v(v +2)(v + 4)(v + 6) 4 v(v+2)(v+4) a 3
V V4
(v(v + 2) 2tr(SE)
S(v 2)(2Mitr[(I- S)E] + +4)) a 4tr()a.
v+ tr[(I- S)E]
Note that if this last expression (which does not involve a and which we denote
simply as A,(a)) is less than 0, then Agu < 0. which is what we wish to prove.
We may repeat the argument from Section 4.2 in which we showed that when
0 = 0, 1i ~ Y-= cX, replacing 4q with q(, 0 with O/a. and 0 with C. Thus we
conclude that when 0 = 0, q( ~ i- vix, where vi,. .. v, are the eigenvalues of
(I S)E.
Again, if n k > 4, then Mj* and M2 exist and are positive, as was shown in
Section 4.2.
Again, Au(a) = 0 is a fourth-degree equation with two real roots, one of which
is trivially zero. Call the nontrivial real root r.
Theorem 4.3 Suppose the same conditions as Theorem 3.3, except generalize
the covariance matrix of 0 to be a2E, a2 unknown and E known, symmetric, and
positive definite. If n k > 4, then the nontrivial real root r of Ayu(a) = 0
is positive. Furthermore, for any choice a E (0, r), the asymptotic upper bound





54

A y < 0, implying A < 0. That is, for, 0 < a < r. and for sufficiently large n,
the risk difference is negative and the smoothed-data dissimilarity estimator ,s) is
better than d2i.
Proof of Theorem 4.3: Since E is positive definite. tr(E) is positive. Since
Mi > 0 and M.; > 0, we may write Ay (a) = 0 as c4a4 + c3a3 + c2a2 + a = 0.
where c4 > 0. ca < 0. c2 > 0, C1 < 0. The proof is exactly the same as the proof of
Theorem 4.2 in Section 4.2. -

4.4 A Pure Functional Analytic Approach to Smoothing

Suppose we have two arbitrary observed random curves yi(t) and yj(t), with
underlying signal curves pi(t) and pj (t). This is a more abstract situation in which
the responses themselves are not assumed to be discretized. but rather come to
the analyst as "pure" functions. This can be viewed as a limiting case of the
functional noise model (2.2) when the number of measurement points n -4 oo in a
fixed-domain sense. Throughout this section, we assume the noise components of
the observed curves are normally distributed for each fixed t, as, for example, in an
Ornstein-Uhlenbeck process.
Let C[0, T] denote the space of continuous functions of t on [0, T]. Assume we
have a linear smoothing operator S which acts on the observed curves to produce
a smooth: /i(t) = Syi(t), for example. The linear operator S maps a function

f E C[0, T] to C[0. T] as follows:


Sf](t) = K(t, s)f(s)ds

(where K is some known function) with the property that S[af + 3g] = aS[f] +

3S[g] for constants a, 3 and functions f,g E C[0, T] (DeVito, 1990, p. 66).
Recall that with squared L2 distance as our dissimilarity metric, we denote
the dissimilarities between the true, observed, and smoothed curves i and j,
respectively, as follows:





55


6ij = / (t) j(t)] dt = 1 pi(t) p (t)|2 (4.5)
0o

S= [y(t) yj(t)]2 dt = ||y1(t) y(t)\2 (4.6)

~!mooth) = [i(t) (t)]2 dt = SyI(t) Syj(t)~ (4.7)

Recall that for a set of points ti,...,t, in [0, T], we may approximate (4.5)-
(4.7) by


dij = T0'
n
T
cj = -O'8

dj8mooth) -s'sg
t3 n

where d i -+ 6i when the limit is taken as n -+ oo with ti,..., t, [0, T].
In this section, the smoothing matrix S corresponds to a discretized analog of
S. That is, it maps a discrete set of points y(tl),..., y(t,) on the "noisy" observed
curve to the corresponding points A(ti),..., (tn) on the smoothed curve which
results from the application of S. In particular, in this section we examine how
the characteristics of our self-adjoint S correspond to those of our symmetric S as
n oc in a fixed-domain sense.
Suppose S is self-adjoint. Then for any x, y,

(x, Sy) = I x(t) K(t s)y(s)dsdt = (Sx,y) = gjK(t,s)x(s)dsy(t)dt.

Therefore for points tl, .... t, E [0, T], with n oo in a fixed-domain sense,

lim T [T x(ti)K(t,, s)y(s) ds + ** -+ f x(t)K(t,, s)y(s) ds

n-=oo n JO (t ,.





56

Now, for a symmetric matrix S, for,any x. y, (x, Sy) = (Sx, y) < x'Sy =
x'S'y. At the sequence of points tl,...t,n, x = (x(t1),....x(t,))', and for our
smoothing matrix S,

Sy (S[y](ti).... S[y](t,))' = ( K(t,,s)y(s)ds.... K(t, s)y(s) d .

So

x'Sy = x(t)K(tl,s)y(s)ds +.+ +TX(tn)K(tn,s)y(s)ds

= xS'y = K(ti, s)x(s)y(ti) ds + ... + K(tn, s)x(s)y(t) ds



[T T
and

(T/n) x(tt)K(ti, s)y(s) ds +-- + x(tn)K(tn, s)y(s) ds

= (T/n) K(tt, s)x(s)y(ti) ds +. + f K(tn, s)z(s)y(tn) ds

So taking the (fixed-domain) limits as n -* oo, we obtain (4.8), which shows that
the defining characteristic of symmetric S corresponds to that of self-adjoint S.
Definition 4.1 S is called a shrinking smoother if IlSyll 5< llyl for all elements y
(Buja et al., 1989).
A sufficient condition for S to be shrinking is that all of the singular values of
S be < 1 (Buja et al., 1989).
Lemma 4.1 Let A be a symmetric matrix and E be a symmetric, positive definite
matrix. Then
x'A'EAx
sup [emax(A)]2
x X EX
where emax(A) denotes the maximum eigenvalue of A.
Proof of Lemma 4.1:





57

Note that

x'A' Ax
sup x''x emax(E-'iA'E1 21/2AE-/2).
x X IX

Let B = E /2AE-1/2. Then this maximum eigenvaiue is emax(B'B) < emax(A'A) =

[emax(A)]2 by a property of the spectral norm. c
Lemma 4.2 Let the linear smoother S be a self-adjoint operator with singular
values E [0, 1]. Then var(T JISO|2) var(T|li0l2). where S is symmetric with
singular values E [0, 1]. (Recall 0 = y, yj.)
Proof of Lemma 4.2:

We will show that, for any n and for any positive definite covariance matrix
E of 0. var(T!ISO)l2) < var(T lO(2) where S is symmetric. (A symmetric matrix
is the discrete form of a self-adjoint operator (see Ramsay and Silverman, 1997.

pp. 287-290).)





4 var (ISO112) var(T|6 12)

4 var(6'S'S6) < var(6'O)

2 2tr[(S'SE)2] + 40'S'SES'SO < 2tr[(E)2] + 40'E0.

Now,
0'S'SECS'S0
40'S'SES'S0 < 40'E0 eS'SE S < 1
O'EO -
and so we apply Lemma 4.1 with S'S playing the role of A. Since the singular
values of S are E [0, 1], all the eigenvalues of S'S are E [0, 1]. Hence we have

O'S'SES'SO
sup < 1.
0 0' E -





58

So we must show tr[(S'SE)2] < tr[(E)2] for any E.

tr[(E)2] = tr[(S'SE + (I S'S)E)2]
= tr[(S'SE)2] + tr[((I S'S)E)2] + tr[S'SE(I S'S)EI

+ tr[(I S'S)ES'SE
= tr[(S'SE)2] + tr12(I S)( S'S)E(I- S'S)EI/21

+ tr[SE(I S'S)ES] + tr[SE(I S'S)ES'J
= tr[(S'SE)2] + tr[EI/2(I S'S)E(I S'S)E1/]

+ 2tr[SE(I- S'S)ES]
= tr[(S'SE)2] + tr[/2(I S'S)E/2'/2(I S'S)E12]

+ 2tr[SE(I S'S)'2(I S'S)1/2ES]

> tr[(S'SE)2]

since the last two terms are > 0. This holds since tr[E'/2(I S'S)EX/2E/(I -
S'S)E1/2] is the sum of the squared elements of the symmetric matrix 1/2(I -
S'S)E1/2. And tr[SE(I S'S)'/2(I S'S)'/2S] is the sum of the squared elements
of (I S'S)1/2ES. C]
We now extend the result in Lemma 4.2 to functional space.
Corollary 4.1 Under the conditions of Lemma 4.2, var[8ij] varf6m'th)].
Proof of Corollary 4.1:
Note that since almost sure convergence implies convergence in distribution,
we have

a,, = Tee 4 Jj
T ^ ij

sdmooth) 4 ooth)

Consider E[Idij'3] = E[(dij)3j = -E[( '0)3]. We wish to show that this is
bounded for n > 1.





59

Now ~ X (A). where A = 0'0 here. The third moment of a noncentral

chi-square is given by

n3 + 6n2 + 6nA + 8n + 36nA + 12nA2 + 48A + 48A2 + 8A3.

Hence 3E[(0'0)3] =

T3[1 + 6/n + 6A/n + 8/n2 + 36A/n2 + 12A2/n2 + 48A/n3 + 48A2/n3 + 8A3/n3]

Now, A/n is bounded for n > 1 since T8O' is the discrete approximation of the

distance between signal curves ~s(t) and pj(t), which tends to 6ij. Hence the third

moment is bounded.

This boundedness. along with the convergence in distribution, implies (see

Chow and Teicher. 1997, p. 277) that

lim E[(dij)k] = E[(j)] < o
n-*oo

for k = 1,2.

So we have

lim var[dij]
n-oo
= limE[(dj)2] lim[E(d,)]2

= limE [(dj)2j [lim E(dj)]2

= E[(Sij)2] -[E(-j)]2

= var[6Jij < oo.

An almost identical argument yields

lim var [dm00t)] = var [.h)
n-oo -- 1.





60
As a consequence of Lemma 4.2. since var[dmooth)] < var[dij] for all n,

lim var[dmth)< lim varidj] < 0o.
n-oo n--m
Hence
var ooth] < var[ ij]. O
The following theorem establishes the domination of the smoothed-data
dissimilarity estimator ij"oot) over the observed-data estimator yij, when the
signal curves pi(t), i = 1,...., N. lie in the linear subspace defined by S.
Theorem 4.4 If S is a self-adjoint operator and a shrinking smoother, and if
pii(t) is in the linear subspace which S projects onto for all i = 1..., N, then
MSE(6;(smooth)) MSE().
MSE0(oj ) < MSE(6j).
Proof of Theorem 4.4:
Define the difference in MSEs A to be as follows (we want to show that A is
negative):

A = E{( (mooth) 6j)2} -E{(( j)}
= E{ ( f (t)t EI(t)]2dt /[i(t) is(t)I2dt }

E (T [y,(t) y(t)]2 dt fT [,(t) (t)]2 dt) 2}

= E( ( [Sy,(t) Sy,(t)]2 dt) + (IT [p(t) (t)12 dt)

-2 (1 [Sys(t) Syj(t)]2 dt) (J [1T,(t) y(t)]2 dt)

E [ry(t) y(t)]2 dt) + (T[i(t) g,(t) dt2

-2 U( [y(t) y3(t)]2 dt) ( [L(t) p (t)]2 dt)




61

= E{ ( [Sy(t) Syj(t)]2dt) ( [(t) y(t)J dt)}
+ 2E{ (r[y(t) y(t)]2dt)( (t) p(t)]2 dt
(f Z ) () -Y
(- [Syi(t) Syj(t)]2 dt) ( T [/i(t)- (t)] dt }.
Let f = fij(t) = yi(t) yj(t) and let f = fij(t) = pi (t) pj(t). Note that since S is
a linear smoother. Syi(t) Syj(t) = St.
A = E [S]( dt [f2dt)2
+ 2E [f]2 dt) ( [f]2 dt) ( [S2di (t f]2 dt) }
= E{(Sf.Sf)2 (,ff)2} + 2E{(, ) (f,f)( (Sf,Sf)(ff)}
= E{Ilsfi4 f4} + 2E{ llIfl|2lfI2 iS112lflI2}
= E{IIS\II4} E{ll114} 2 fI 12E{|lSfI2 -_ lf112}
var(l SfkI2) var(|I I|2) + {E(IISfl 2)}2 {E( 11f12)}2
21 f 12E{(|lISll2 fl1122)}
= var(llSftl2) var( lllf2)
+ {E(l t5f12) + E(I Il2)} {E( ISI 12) E(J1f112)}
2 f11ll2 E{(lISf l2- illl2)}.
Since pi(t) is in the subspace which S projects onto for all i, then
Sf = f IlSfl= |lfl.

A = var( llS 12) var(ll ll2) + E(IlSfll2 + ll112)E(llIf2 -1ll2)
(llSf112 + Ilfll2)E(llISf 2 Illl2).





62
Now, since var(llS fj2) <: var(Ilflj2) from Corollary 4.1. we have:

A < E(IISf!I2 + Jf112)E(I ISfl2 If 2) (I Sf2 + fl112)E( ISfII2 I12)
SE(I ISfI2 Il112) [E(I SfII2 + I112) (lISf!2 + if12)] (4.9)

Since S is a shrinking smoother, J|Sfjl < Ijffl for all f. Hence IIJSif2 < IIfl 2 for
all f and hence E(IISf!\2) E(liffl2). Therefore the first factor of (4.9) is < 0.
Also note that

E(IISfil2) = E f [S2dt = E[Sf]2 dt

S var[Sf] dt + f(E[Sf])2 dt
/T T
= var[Sf] dt + j [Sf]2 dt
/T
= Tvar[S]dt + jSf > Sf2.

Similarly,

E(I ii2) = E 0['2 dt }= fo E(112 dt
T l T T
= var[f] dt + o(E[f])2dt

= var[f] dt + [f]2 dt
T
f= var[f]dt + IIfI > |f 12.

Hence the second factor of (4.9) is positive, implying that (4.9) < 0. Hence the
risk difference A < 0. O
Remark. We know that |ISifl < I Ill since S is shrinking. If, for our particular
yi(t) and yj(t), IISf\I < iHfl, then our result is MSE(6mcoth)) < MSE(&ij).












CHAPTER 5
A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION
FOR CLUSTERING FUNCTIONAL DATA

Most cluster analysis methods, whether deterministic or stochastic, use an
objective function to evaluate the "goodness" of any clustering of the data. In this

chapter we propose an objective function specifically intended for functional data

which have been smoothed with a linear smoother. In setting up the situation,

we assume we have N functional data objects y1(t),.... yN(t). We assume the

discretized data we observe (yl,.... YN) follow the discrete noise model (2.1).

For object i. define the indicator vector Zi = (Zi4,..., ZN) such that Zyi = 1 if

object i belongs to cluster j and 0 otherwise. Each partition in C corresponds to a

different Z = (Zi,..., ZN), and Z then determines the clustering structure. (Note

that if the number of clusters K < N, then Zi,K+1 = = ZiN = 0 for all i.)

In cluster analysis, we wish to predict the values of the Zi's based on the data

y (or data curves (yl(t),.... yv(t)). Assume the Zi's are i.i.d. unit multinomial
vectors with unknown probability vector p = (pl,....pN). Also, we assume a

model f(ylz) for the distribution of the data given the partition. The joint density
N N
f(y, z) = 11[py11jf1(yz1j)zIj
i=1 j=1

can serve as an objective function. However, the parameters in f(y|z), and p,

are unknown. In model-based clustering, these parameters are estimated by
maximizing the likelihood, given the observed y, across the values of z. If the
data model f(yfz) allows it, a predictive likelihood for Z can be constructed by

conditioning on the sufficient statistics r(y, z) of the parameters in f(y, z) (Butler,

1986; Bjornstad, 1990). This predictive likelihood, f(y, z)//f(r(y, z)), can serve as


63





64

the objective function. (Bjornstad (1990) notes that, when y or z is continuous,

an alternative predictive likelihood with a Jacobian factor for the transformation

r(y, z) is possible. Including the Jacobian factor makes the predictive likelihood
independent of the choice of minimal sufficient statistic: excluding it makes the

predictive likelihood invariant to scale changes of z.)

For instance, consider a functional datum yi(t) and its corresponding observed

data vector yi. For the purpose of cluster analysis, let us assume that functional
observations in the same cluster come from the same linear smoothing model with

a normal error structure. That is, the distribution of Yi, given that it belongs to

cluster j, is

Yi I z = 1 N(T3j, l)

for i = 1..... N. Here T is the design matrix, whose n rows contain the regression

covariates defined by the functional form we assume for yi(t). This model for

Yi allows for a variety of parametric and nonparametric linear basis smoothers,

including standard linear regression, regression splines, Fourier series models,

and smoothing splines. as long as the basis coefficients are estimated using least

squares.

For example, consider the yeast gene data described in Section 7.1. Since

the genes, being cell-cycle regulated, are periodic data, Booth et al. (2001) use a

first-order Fourier series y(t) = o0 + /1 cos(27rt/T) + 32 sin(27rt/T) as a smooth

representation of a given gene's measurement, fitting a harmonic regression model

(see Brockwell and Davis, 1996, p. 12) by least squares. Then the 18 rows of T are

the vectors (1,cos(ftj),sin(ftj)), for the n = 18 time points tj,j = 0,...,17. In

this example 0j = (o30j, 1j, a,)'.

Let J = {jlzij = 1 for some i} be the set of the K nonempty clusters.

The unknown parameters in f(y, z), j E J, are (pj, Pi, ao3). The corresponding

sufficient statistics are (mj, /^i, a) where mj = '1 zij is the number of objects





65

in the (proposed) jth cluster. j is the p*-dimensional least squares estimator
of 0j (based on a regression of all objects in the proposed jth cluster), and 5
is the unbiased estimator of a,. The estimators /j and &2 can be expressed in
terms of the estimators ,ij and &P (for all i such that zj = 1), obtained from the
regressions of the individual objects. (See Appendix A.3 for the derivations of flj
and &?.)
Marginally, the vector m = (mi,...,mN) is multinomial(N, p); conditional
on m, 3j is multivariate normal and independent of (mn )&P ~, X j-.. Then
the following theorem gives the predictive likelihood objective function, up to a
proportionality constant:
Theorem 5.1 If Z, .... ZN are i.i.d. unit multinomial vectors unth unknown
probability vector p = (pi,... PN), and

Y, I zij = 1 N(T/j, a(7I)

for i = 1,... N, then the predictive likelihood for Z can be written:


g() = f(y z)
n,/(, f (m, j)

02j6Jn + I(mjn p') (mjn p* 2
ocJ fm!(mU)-1/2 m(&2 ) (in2

Proof of Theorem 5.1:
According to the model assumed for the data and for the clusters,

) = K
f "=' Z)VIpj )n ep (Yi TBj)'(yi- TBj)]





66

Therefore log f(y, z)

= E' logpj +log y T()'(yi T13j)
i=1 j=1
n1 N
= log p + mj log zij(Y T j)'( T3j)
j=1 I i=1

= ^logp + mj log(
j=1

2? zij(y, T )'(y Ti3j)+ mj(T/ 3- T l)'(T13 TI,).

Hence f(y,z)

(1)nm. x
jEJ
Si (y T,)'(- T3,) (Tij Tj)'(T3j Tf)
exp yi 2a2 mj 23 m
i=1
m -p(27r) ,,n/2 !7j -mn exp m(Tpj TOS)'(T~j TO j) (mn p*)6j2
SjP (2 exp-m3 2oa? 2 a

Also, ( JJ f(mj, ,,2)

pi 1 1
= N! )exp{-(^ -)'[cov( 0)]-(/- 0)}
"nj! I (2,r)p*/2'!cov ( 1j)l/2 2

(1/2)n mjn-p* ^,2,'i-p'_ 1 mn 2mn-p*
x r(_ ) ( "aj) 2 exp{-( a] 2)} J )
2 3
= N -1 1/2 exp -ij ( Oj)'(T'T)(4 3)
jEJ *! (2r)P'/2 T' T) 1


2 j a
i)^ exp{- (j -)}.p

Hence, since exp{- (/4j j)'(T'T)(4j Of)} = exp{- -j(T/j -
TBj,)'(T^ TB)}, we have that





67


f(y,z)
I-TEJ f(mj, 43, &J2)

IEJ rn-j!{ (2"r) -N/2 J7(2w)P/2 1/2 }(T /
N6jEJ (my

oc n--p

mj /2n p (mjn p* 2 ) n
C m() 2 -2 \ J2)2-*2/2


jEJ

When considered as an objective function, this predictive likelihood has
certain properties which reflect intuition about what type of clustering structure is
desirable.
The data model assumes a2 represents the variability of the functional data
in cluster j. Since bj is an unbiased estimator of oj,j E J, then (2j)-_ +l1
can be interpreted as a measure of variability within cluster j. The (typically)
negative exponent indicates that "good" clustering partitions, which have small
within-cluster variances, yield large values of g(z). Clusters containing more objects
(having large mj values) contribute more heavily to this phenomenon.
The other factors of g(z) may be interpreted as penalty terms which penalize
a partition with a large number of clusters having small mj values. A possible
alternative is to adopt a Bayesian outlook and put priors on the pj's, which, when
combined with the predictive likelihood, will yield a suitable predictive posterior.













CHAPTER 6
SIMULATIONS

6.1 Setup of Simulation Study

In this chapter, we examine the results of a simulation study implemented

using the statistical software R. In each simulation. 100 samples of N = 18 noisy

functional data were generated such that the data followed models (2.1) or (2.2).

For the discrete noise data. the errors were generated as independent N(0, a2)

random variables, while for the functional noise data. the errors were generated as

the result of a stationary Ornstein-Uhlenbeck process with variability parameter a2

and "pull" parameter 6 = 1.

For each sample of curves, several (four) "clusters" were built into the data

by creating each functional observation from one of four distinct signal curves, to

which the random noise was added. Of course. within the program the curves were

represented in a discretized form, with values generated at n equally spaced points

along T = [0, 20]. In one simulation example. n = 200 measurements were used and

in a second example, n = 30.

The four distinct signal curves for the simulated data were defined as follows:


ti(t) = 0.51n(t+1)+.01cos(t),i = 1,...,5.

Li(t) = logo(t + ) .01cos(2t),i = 6,...,10.

14(t) = 0.751og(t + 1) + .01sin(3t),i = 11,...,15.

i(t) = 0.3Vt + .01 sin(4t), i = 16,..., 18.







68





69

These four curves, shown in Figure 6-1, ,were intentionally chosen to be similar

enough to provide a good test for the clustering methods that attempted to

group the curves into the correct clustering structure. yet different enough that

they represented four clearly distinct processes. They all contain some form of

periodicity. which is more prominent in some curves than others. In short, the

curves were chosen so that, when random noise was added to them, they would be

difficult but not impossible to distinguish.

After the -observed" data were generated. the pairwise dissimilarities among

the curves (measured by squared L2 distance, with the integral approximated

on account of the discretized data) were calculated. Denote these by di, i =

1,..., N.j = 1..... N. The data were then smoothed using a linear smoother S

(see below for details) and the James-Stein shrinkage adjustment. The pairwise

dissimilarities among the smoothed curves were then calculated. Denote these by

di oth). i = 1 ...., N, j = 1,..., N. The sample mean squared error criterion was

used to judge whether dij or dimooth better estimated the dissimilarities among the

signal curves, denoted dij, i = 1,..., N, j = 1, .... N. That is,


MSE(obs) = 1 [(dij di)2]
( ) i=l.....N
j>t

was compared to


MSE(Smooth) 1 (smooth) d).
( i=) ....,N
2(N)
j>i

Once the dissimilarities were calculated, we used the resulting dissimilarity

matrix to cluster the observed data, and then to cluster the smoothed data. The

clustering algorithm used was the K-medoids method, implemented by the pam

function in R. We examine the resulting clustering structure of both the observed





70







































0 5 10 15 20
time


Figure 6-1: Plot of signal curves chosen for simulations.
Solid line: I1I(t). Dashed line: pe(t). Dotted line: I11(t). Dot-dashed line: 16 (t)
tn ^--^







.- '

















Solid line: p.\(t). Dashed line: 6(t). Dotted line: ai1(t). Dot-dashed line: ie(t).





71

data and the smoothed data to determine which clustering better captures the

structure of the underlying clusters, as defined by the four signal curves.

The outputs of the cluster analyses were judged by the proportion of pairs of

objects (i.e., curves) correctly placed in the same cluster (or correctly placed in

different clusters, as the case may be). The correct clustering structure, as defined

by the signal curves, places the first five curves in one cluster, the next five in

another cluster, etc. Note that this is a type of measure of concordance between

the clustering produced by the analysis and the true clustering structure.

6.2 Smoothing the Data

The smoother used for the n = 200 example corresponded to a cubic B-spline

basis with 16 interior knots interspersed evenly within the interval [0, 20]. The rank

of S was thus k = 20 (Ramsay and Silverman, 1997, p. 49). The value of a in the

James-Stein estimator was chosen to be a = 160, a choice based on the values of

n = 200 and k = 20. For the n = 30 example, a cubic B-spline smoother with six

knots was used (hence k = 10) and the value of a was chosen to be a = 15.

We should note here an interesting operational issue that arises when using

the James-Stein adjustment. The James-Stein estimator of 0 given by (3.4) is

obtained by smoothing the differences 0 = y yj, i z j, first, and then adjusting

the SO's with the James-Stein method. This estimator for 0 is used in the James-

Stein dissimilarity estimator d(s). For a data set with N curves, this amounts to

smoothing (N2 N)/2 pairwise differences. It is computationally less intensive to

simply smooth the N curves with the linear smoother S and then adjust each of

the N smooths with the James-Stein method. This leads to the following estimator

of 0:

Sy + 1- )(yi Sy) Sy3 1- a (y Sy.
11yi Sy112 ) 11j -- S~Yj 12) Y- y)





72

Of course, when simply using a linear smoother S. it does not matter whether

we proceed by smoothing the curves or the differences, since SO = Syi Syj. But

when using the James-Stein adjustment, the overall smooth is nonlinear and there

is an difference in the two operations. Since in certain situations, it may be more

sensible to smooth the curves first and then adjust the N smooths. we can check

empirically to see whether this changes the risk very much.

To accomplish this, a simulation study was done in which the coefficients of

the above signal curves were randomly selected (within a certain range) for each

simulation. Denoting the coefficients of curve ,i(t) by (bi,, bi,2), the coefficients

were randomly chosen in the following intervals:

b,1 E [-3.3]. b1.2 E [-0.5. 0.5] b6,1 E [-6,6], 66,2 1[-0.5. 0.5],


bnl. E [-44.54.5], bn.2 E [-0.5, 0.5], b6.i e [-1.8,1.8], b6.2 E [-0.5,0.5].

For each of 100 simulations. the sample MSE for the smoothed data was the same.

within 10 significant decimal places, whether the curves were smoothed first or the

differences were smoothed first. This indicates that the choice between these two

procedures has a negligible effect on the risk of the smoothed-data dissimilarity

estimator.

Also, the simulation analyses in this chapter were carried out both when

smoothing the differences first and when smoothing the curves first. The results

were extremely similar, so we present the results obtained when smoothing the

curves first and adjusting the smooths with the James-Stein method.

6.3 Simulation Results

Results for the n = 200 example are shown in Table 6-1 for the data with

independent errors and Table 6-2 for the data with Ornstein-Uhlenbeck-type errors.

The ratio of average MSEs is the average MSE(obs) across the 100 simulations

divided by the average MSE(smooth) across the 100 simulations. As shown by





73

Table 6-1: Clustering the observed data and clustering the smoothed data (inde-
pendent error structure, n = 200).

Independent error structure
a 0.4 0.5 0.75 1.0 1.25 1.5 2.0
Ratio of avg. MSEs 12.86 30.10 62.66 i 70.89 72.40 72.24 71.03
Avg. prop. (observed) .668 .511 .353 .325 .330 .327 .318
Avg. prop. (smoothed) .937 .715 .428 .380 .372 .352 .332
Ratios of average MSE(b') to average MSE(m00oth. Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.
Table 6-2: Clustering the observed data and clustering the smoothed data (O-U
error structure, n = 200).

O-U error structure
a 0.5 0.75 1.0 1.25 1.5 2.0 2.5
Ratio of avg. MSEs 3.53 39.09 126.53 171.39 176.52 168.13 160.52
Avg. prop. (observed) .997 .607 .445 .369 .330 .312 .310
Avg. prop. (smoothed) 1 .944 .620 .489 .396 .326 .334
Ratios of average MSE(O') to average MSE(smo"'t. Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.

the ratios of average MSEs being greater than 1, smoothing the data results in

an improvement in estimating the dissimilarities, and. as shown graphically in

Figure 6-2, it also results in a greater proportion of pairs of objects correctly

clustered in every case. Similar results are seen for the n = 30 example in Table

6-3 and Table 6-4 and in Figure 6-3.

The improvement, it should be noted, appears to be most pronounced for

the medium values of a2, which makes sense. If there is small variability in the

data, smoothing yields little improvement since the observed data is not very

noisy to begin with. If the variability is quite large, the signal is relatively weak

and smoothing may not capture the underlying structure precisely, although the

smooths still perform far better than the observed data in estimating the true

dissimilarities. When the magnitude of the noise is moderate, the advantage of

smoothing in capturing the underlying clustering structure is sizable.






74



Independent error structure


o \










0
.. "

I I \
0.5 1.0 1.5 2.0
sigma


Ostein-Uhlenbeck error structure







o
d ~ ', ^~----- __
co ------ --- --------------------











0.5 1.0 1.5 2.0 2.5
sigma


Figure 6-2: Proportion of pairs of objects correctly matched, plotted against a
(n = 200).
Solid line: Smoothed data. Dashed line: Observed data.


Table 6-3: Clustering the observed data and clustering the smoothed data (inde-
pendent error structure, n = 30).

Independent error structure
a 0.2 0.4 0.5 0.75 1.0 1.25 1.5
Ratio of avg. MSEs 1.23 4.02 5.12 5.95 6.30 6.24 6.27
Avg. prop. (observed) .999 .466 .376 .298 .299 .312 .316
Avg. prop. (smoothed) 1.00 .552 .458 .359 .330 .321 .334
Ratios of average MSE(b) to average MSE(""OOth). Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.






75


Table 6-4: Clustering the observed dataand clustering the smoothed data (O-U
error structure, n = 30).

O-U error structure
a 0.2 0.4 0.5 0.75 1.0 1.25 1.5
Ratio of avg. MSEs 1.26 4.07 5.03 5.69 5.78 5.85 6.04
Avg. prop. (observed) .851 .383 .280 .251 .225 .213 .233
Avg. prop. (smoothed) .881 .625 .528 .502 .478 .489 .464
Ratios of average MSE(obS) to average MSE(smoOth). Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.



Independent error structure




0



CL
---l



II I I I I
02 0.4 0.6 0.8 1.0 1.2 1.4
sigma


Omrstein-Uhlenbeck error structure












-- --- ....-------------
N0 I I
0.2 0.4 0.6 0.8 1.0 1.2 1.4
sigma


Figure 6-3: Proportion of pairs of objects correctly matched, plotted against a
(n = 30).
Solid line: Smoothed data. Dashed line: Observed data.





76

The above analysis assumes the number of clusters is correctly specified as

4, so we repeat the n = 200 simulation, for data generated from the O-U model.

when the number of clusters is misspecified as 3 or 5. Figure 6-4 shows that, for

the most part. the same superiority, measured by proportion of pairs correctly

clustered, is exhibited by the method of smoothing before clustering.

6.4 Additional Simulation Results

In the previous section, the clustering of the observed curves was compared

with the clustering of the smooths adjusted via the James-Stein shrinking method.

To examine the effect of the James-Stein adjustment in the clustering problem, in

this section we present results of a simulation study done to compare the clustering

of the observed curves, the James-Stein smoothed curves, and smoothed curves

having no James-Stein shrinkage adjustment.

The simulation study was similar to the previous one. We had 20 sample

curves (with n = 30) in each simulation, again generated from four distinct signal

curves: in this case the signal curves could be written as Fourier series. Again, for

various simulations, both independent errors and Ornstein-Uhlenbeck errors (for

varying levels of a2) were added to the signals to create noisy curves.

For some of the simulations, the signal curves were first-order Fourier series:


pi(t) = cos(7rt/10) +2sin(7rt/10), i = 1,...,5.

/i(t) = 2cos(7rt/10) +sin(rt/10),i= 6,...,10.

Ai(t) = 1.5cos(irt/10) + sin(7rt/10), i = 11,..., 15.

pi(t) = -cos(7rt/10) +sin(7rt/10),i = 16,...,20.

In these simulations, the smoother chosen was a first-order Fourier series, so that

the smooths lay in the linear subspace that the Fourier smoother projected onto. In

this case, we would expect the unadjusted smoothing method to perform well, since






77












Number of Clusters Misspecified as 3







- - - - -








0.5 1.0 1.5 2.0 2.5





















0.5 1.0 1.5 2.0 2.5
sigma

























Figure 6-4: Proportion of pairs of objects correctly matched, plotted against a,
when the number of clustersClusters Misspcifed asisspecified.





Solid line: Smoothed data Dashed line: Observed data
II I \
0.5 1.0 1.5 2.0 2.5




Figure 6-4: Proportion of pairs of objects correctly matched, plotted against when the number of clusters is misspecified.
Solid line: Smoothed data. Dashed line: Observed data.





78

the first-order smooths should capture the underlying curves extremely well. Here

the James-Stein adjustment may be superfluous, if not detrimental.

In other simulations, the signal curves were second-order Fourier series:


pi(t) = cos(rt/10) + 2 sin(rt/10) + 5 cos(27rt/10) + sin(2rt/10), i = 1,..., 5.

pi(t) = 2 cos(rt/10) + sin(7rt/10) + cos(27rt/10) + 8 sin(27rt/10), i = 6,..., 10.

ii(t) = 1.5cos(7t/10) + sin(7rt/10)+ 5 cos(27rt/10) + 8sin(27rt/10),i = 11,.... 15.

i(t) = cos(7rt/10) + sin(7rt/10) + 8 cos(27rt/10) + 5 sin(27rt/10), i = 16,...,20.

In these simulations, again a first-order Fourier smoother was used, so that the

smoother oversmoothed the observed curves and the smooths were not in the

subspace S projected onto. In this case, the simple linear smoother would not be

expected to capture the underlying curves well, and the James-Stein adjustment

would be needed to reproduce the true nature of the curves.

For each of the three methods (no smoothing, simple linear smoothing,

and James-Stein adjusted smoothing), and for varying magnitudes of noise, the

proportions of pairs of curves correctly grouped are plotted in Figure 6-5. Again.

for each setting, 100 simulations were run, with results averaged over all 100 runs.

The James-Stein adjusted smoothing method performs essentially no worse

than its competitors in three of the four cases. With Ornstein-Uhlenbeck errors.

and when the signal curves are first-order Fourier series, the simple linear smoother

does better as a becomes larger, which is understandable since that smoother

captures the underlying truth perfectly. For higher values of a (noisier data), the

James-Stein adjustment is here shrinking the smoother slightly toward the noise

in the observed data. Nevertheless, the James-Stein smoothing method has an

advantage in the independent-errors case, even when the signal curves are first-

order. In both of these cases, both types of smooths are clustered correctly more

often than the observed curves.






79











O-U r~lars. 1 st-ardHu Fouer ~ nam curvs






O-U error nd-orer Fourier mgnal curv
8 -___--------------------------------












8.





-ndU. rrTo 2nt-owrdr Fourdr ignsi curv







0.5 10 5 20 2 30















Figure 6-5: Proportion of pairs of objects correctly matched, plotted against a
(n = 30).
Solid line: James-Stein adjusted smoothed data. Dotted line: (Unadjusted)
smoothed data. Dashed line: Observed data.





80

When the signal curves are second-order and the simple linear smoother

oversmooths. the unadjusted smooths are clustered relatively poorly. Here, the

James-Stein method is needed as a safeguard. to shrink the smooth toward the

observed data. The James-Stein adjusted smooths and the observed curves are

both clustered correctly more often than the unadjusted smooths.













CHAPTER 7
ANALYSIS OF REAL FUNCTIONAL DATA

7.1 Analysis of Expression Ratios of Yeast Genes

As an example, we consider yeast gene data analyzed in Alter et al. (2000).

The objects in this situation are 78 genes. and the measured responses are (log-

transformed) expression ratios measured 18 times (at 7-minute intervals) tj = 7j

for j = 0,...,17. As described in Spellman et al. (1998). the yeast was analyzed

in an experiment involving "alpha-factor-based synchronization," after which the

RNA for each gene was measured over time. (In fact, there were three separate

synchronization methods used, but here we focus on the measurements produced by

the alpha-factor synchronization.)

Biologists believe that these genes fall into five clusters according to the cell

cycle phase corresponding to each gene.

To analyze these data, we treated them as 78 separate functional observations.

Initially, each multivariate observation on a gene was centered by subtracting the

mean response (across the 18 measurements) from the measurements. Then we

clustered the genes into 5 clusters using the K-medoids method, implemented by

the R function pam.

To investigate the effect of smoothing on the cluster analysis, first we simply

clustered the unsmoothed observed data. Then we smoothed each observation and

clustered the smooth curves, essentially treating the fitted values of the smooths

as the data. A (cubic) B-spline smoother, with two interior knots interspersed

evenly within the given timepoints, was chosen. The rank of the smoothing matrix

S was in this case k = 6 (Ramsay and Silverman, 1997, p. 49). The smooths were



81





82

Table 7-1: The classification of the 78 yeast genes into clusters, for both observed
data and smoothed data.
Cluster Observed Data Smoothed Data
1 1,5,6,7.8,11,75 1,2,4,5.6,7,8,9,10,11,13,62,70,74,75
2,3,4,9.10,13,16,19,22,23,27, 16,19,22,27,30,32,36,37,41,42,
2 30,32.36,37,41.44,47,49,51,53, 44,47,49,51,61,63,
61,62.63.64.65.67,69,70,74,78 65,67,69,78
12,14.15,18,21.24,25,26,28, 3,12,14,15,18,21,24,25,26,
3 29,31.33.34,35.38,39,40, 28,29,31,33,34,35,38,39,
42,43.45,46,48.50,52 40,43,45,46,48,50,52
4 17,20.54.55,56.57,58,59,60 17,20.23,53,54,55,56,57,58,59,60.64
5 66,68.71.72.73.76.77 66,68.71,72,73,76,77


adjusted via the James-Stein shrinkage procedure described in Section 3.2, with

a = 5 here, a choice based on the values of n = 18 and k = 6.

Figure 7-1 shows the resulting clusters of curves for the cluster analyses of

the observed data and of the smoothed curves. We can see that the clusters of

smoothed curves tend to be somewhat less variable about their sample mean curves

than are the clusters of observed curves, which could aid the interpretability of the

clusters.

In particular, features of the second, third, and fourth clusters seem more

apparent when viewing the clusters of smooths than the clusters of observed

data. The major characteristic of the second cluster is that it seems to contain

those processes which are relatively flat, with little oscillation. This is much more

evident in the cluster of smooths. The third and fourth clusters look similar for

the observed data, but a key distinction between them is apparent if we examine

the smooth curves. The third cluster consists primarily of curves which rise to

a substantial peak, decrease, and then rise to a more gradual peak. The fourth

cluster consists mostly of curves which rise to a small peak, decrease, and then rise

to a higher peak. This distinction between clusters is seen much more clearly in the

clustering of the smooths.






83




Clusters for observed curves Cluster for imoothed curvem




s r-I 1 I-
.o _ _ .. . . .


0 20 40 60 80 100 120 0 20 40 60 80 100 120
time time





I

ii 7- i
0 20 40 60 80 100 120 0 20 40 60 80 100 120
time time




0 20












Fgr -7 Plo of
_a curvs f b sl i
0 20 40 60 80 100 120 0 20 40 60 80 100 120












time time
time time

















Figure 7-1: Plots of clusters of genes.



Observed curves (left) and smoothed curves (right) are shown by dotted lines.

Mean curves for each cluster shown by solid lines.
0 N ------------------gNi-----------------



















Mean curves for each cluster shown by solid lines.





84

The groupings of the curves into the five clusters are given in Table 7-1 for

both the observed data and the smoothed data. While the clusterings were fairly

similar for the two methods, there were some differences in the way the curves were

classified. Figure 7-2 shows an edited picture of the respective clusterings of the

curves, with the curves deleted which were classified the same way both as observed

curves and smoothed curves. Note that the fifth cluster was the same for the set of

observed curves and the set of smoothed curves.

7.2 Analysis of Research Libraries

The Association of Research Libraries has collected extensive annual data

on a large number of university libraries throughout many years (Association of

Research Libraries, 2003). Perhaps the most important (or at least oft-quoted)

variable measured in this ongoing study is the number of volumes held in a library

at a given time. In this section a cluster analysis is performed on 67 libraries whose

data on volumes were available for each year from 1967 to 2002 (a string of 36

years). This fits a strict definition of functional data, since the number of volumes

in a library is changing "nearly continuously" over time, and each measurement is

merely a snapshot of the response at a particular time, namely when the inventory

was done that year.

The goal of the analysis is to determine which libraries are most similar in

their growth patterns over the last 3-plus decades and to see how many groups into

which the 67 libraries naturally clustered. In the analysis, the logarithm of volumes

held was used as the response variable, since reducing the immense magnitude of

the values in the data set appeared to produce curves with stable variances. Since

the goal was to cluster the libraries based on their "growth curves" rather than

simply their sizes, the data was centered by subtracting each library's sample mean

(across the 36 years) from each year's measurement, resulting in 67 mean-centered

data vectors.







85





Clusters for observed curves Clusters or smoothed curves

( ------------ II
2 N

























0 20 40 60 80 100 120 0 20 40 60 80 100 120

time tlme
Io 0




1- 0 -


















0 20 40 60 so 100 120 0 20 40 60 so too 120
0 20 40 60 80 100 120 0 20 40 60 80 100 120
time time






0







o o


O0 -
$ 7
II I I I


















0 20 40 60 80 100 120 0 20 40 60 80 100 120

time time
r c s -








O u
0 0









7I -











0 20 40 60 80 100 120 0 20 40 60 80 100 120

time time



Figure 7-2: Edited plots of clusters of genes which were classified differently as

observed curves and smoothed curves.

Observed curves (left) and smoothed curves (right).





86

The observed curves were each smoothed using a basis of cubic B-splines with

three knots interspersed evenly within the timepoints. The positive-part James-

Stein shrinkage was applied, with a value a = 35 in the shrinkage factor (based on

n = 36, k = 7). The K-medoids algorithm was applied to the smoothed curves,

with K = 4 clusters, a choice guided by the average silhouette width criterion that

Rousseeuw (1987) suggests for selecting K.

The resulting clusters are shown in Table 7-2. In Figure 7-3. we see the

distinctions among the curves in the different clusters. Cluster 1 contains a few

libraries whose volume size was relatively small at the begining of the time period.

but then grew dramatically in the first 12 years of the study, growing slowly after

that. Cluster 2 curves show a similar pattern, except that the initial growth is less

dramatic. The behavior of the cluster 3 curves is the most eccentric of the four

clusters, with some notable nonmonotone behavior apparent in the middle years.

The growth curves for the libraries in cluster 4 is consistently slow. steady, and

nearly linear.

For purposes of comparing the clusters, the mean curves for the four clusters

are shown in Figure 7-4.

Note that several of the smooths may not appear visually smooth in Figure

7-3. A close inspection of the data shows some anomalous (probably misrecorded)

measurements which contribute to this chaotic behavior in the smooth. (For

example, see the data and associated smooth for the University of Arizona library,

in Figure 7-5, for which 1971 is a suspicious response.) While an extensive data

analysis would detect and fix these outliers, the data set is suitable to illustrate

the cluster analysis. The sharp-looking peak in the smooth is an artifact of the

discretized plotting mechanism, as the cubic spline is mathematically assured to

have two continuous derivatives at each point.






87




Cluster 1 for smoothed curves Cluster 2 for smoothed curves




in U


















0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

time (years since 1967) time (years since 1967)

































Clusters based on smoothed curves. Mean curves for each cluster shown by solid
lines.
o ~"0 *? "













0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

time (years since 1967) time (yeats since 1967)


Cluster 3 for smoothed curves Cluster 4 for smoothed cubrys




l ines .. i

















0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

time (years since 1967) time (years since 1967)




Clusters based on smoothed curves. Mean curves for each cluster shown by solid







88














Meen curves of four clusters


0



















7-
0







P o























0 5 10 15 20 25 30 35

time (years since 1967)



Figure 7-4: Mean curves for the four library clusters given on the same plot.
Solid: cluster 1. Dashed: cluster 2. Dotted: cluster 3. Dot-dash: cluster 4.





89



Table 7-2: A 4-cluster K-medoids clustering of the smooths for the library data.


Clusters of smooths for library data
Cluster 1: Arizona. British Columbia, Connecticut, Georgetown. Georgia
Cluster 2: Boston. California-Los Angeles, Florida, Florida State, Iowa,
Iowa State, Kansas. Michigan State, Nebraska. Northwestern. Notre Dame,
Oklahoma, Pennsylvania State, Pittsburgh, Princeton. Rochester,
Southern California. SUNY-Buffalo, Temple, Virginia. Washington U-St. Louis,
Wisconsin
Cluster 3: Brown. California-Berkeley, Chicago, Cincinnati, Colorado, Duke,
Indiana, Kentucky, Louisiana State, Maryland. McGill. MIT. Ohio State,
Oregon, Pennsylvania. Purdue, Rutgers, Stanford, Tennessee. Toronto, Tulane,
Utah. Washington State. Wayne State
Cluster 4: Columbia. Cornell. Harvard, Illinois-Urbana. Johns Hopkins, Michigan,
Minnesota. Missouri. New York, North Carolina. Southern Illinois. Syracuse. Yale
















I .







0 5 10 15 20 25 30 36
0.35

Figure 7-5: Measurements and B-spline smooth, University of Arizona library.
Points: log of volumes held, 1967-2002. Curve: B-spline smooth.





90

Table 7-3: A 4-cluster K-medoids clustering of the observed library data.

Clusters of observed library data
Cluster 1: Arizona. British Columbia, Connecticut, Georgetown, Georgia
Cluster 2: Boston. California-Los Angeles. Florida, Florida State, Iowa,
Iowa State, Kansas. McGill. Michigan State. Nebraska, Northwestern,
Notre Dame. Oklahoma, Pennsylvania State, Pittsburgh, Princeton. Rochester,
Southern California. SUNY-Buffalo, Temple, Virginia, Washington U-St. Louis,
Wayne State, Wisconsin
Cluster 3: Brown, California-Berkeley, Chicago, Cincinnati, Colorado, Duke,
Illinois-Urbana, Indiana, Kentucky, Louisiana State, Maryland, MIT,
Ohio State, Oregon. Pennsylvania, Purdue. Rutgers, Stanford, Syracuse,
Tennessee. Toronto. Tulane. Utah, Washington State
Cluster 4: Columbia. Cornell, Harvard, Johns Hopkins, Michigan, Minnesota,
Missouri. New York. North Carolina. Southern Illinois, Yale


Interpreting the meaning of the clusters is fairly subjective, but the dramatic

rises in library volumes for the cluster 1 and cluster 2 curves could correspond

to an increase in state-sponsored academic spending in the 1960s and 1970s,

especially since many of the libraries in the first two clusters correspond to public

universities. On the other hand, cluster 4 contains several old, traditional Ivy

League universities (Harvard, Yale, Columbia, Cornell), whose libraries' steady

growth could reflect the extended buildup of volumes over many previous years.

Any clear conclusions, however, would require more extensive study of the subject,

as the cluster analysis is only an exploratory first step.

In this data set, only minor differences in the clustering partition were seen

when comparing the analysis of the smoothed data with that of the observed

data (shown in Table 7-3). Interestingly, in some spots where the two partitions

differed, the clustering of the smooths did place certain libraries from the same

state together when the other method did not. For instance, Illinois-Urbana was

placed with Southern Illinois, and Syracuse was placed with Columbia, Cornell, and

NYU.




Full Text
23
On the other hand, suppose the srpooth does not reproduce 0 perfectly and
that S0 / 6. Then it can be shown (see Appendix A.l) that the smoothed-data
estimator is better when:
||S0||2 ||0||2 > -2 k Vi + 2n + n2 + 2k. (3.3)
Now, since S is a shrinking smoother, this means ||Sy|| < ||y|| for all y, and hence
||Sy|j2 < ||y||2 for all y. Therefore, ||S0||2 < ||0||2 and the left hand side of (3.3) is
negative and so is the right hand side. If 0 is such that S0 0 and ||S0||2 ||0||2
is near 0, then (3.3) will be satisfied. If, however, ||S0||2 ||0||2 0, then (3.3)
will not be satisfied and smoothing will not help.
In other words, some shrinkage smoothing of the observed curves makes the
dissimilarity estimator better, but too much shrinkage leads to a forfeiture of that
advantage. The disadvantage of the linear smoother is that it cannot learn
from the data how much to shrink 0. To improve the smoother, we can employ a
James-Stein-type adjustment to S, so that the data can determine the amount of
shrinkage.
3.2 A James-Stein Shrinkage Adjustment to the Smoother
What is now known as shrinkage estimation or Stein estimation originated
with the work of Stein in the context of estimating a multivariate normal mean.
Stein (1956) showed that the usual estimator (e.g., the sample mean vector) was
inadmissible when the data were of dimension 3 or greater. James and Stein (1961)
showed that a particular shrinkage estimator (so named because it shrunk the
usual estimate toward the origin or some appropriate point) dominated the usual
estimator. In subsequent years, many results have been derived about shrinkage
estimation in a variety of contexts. A detailed discussion can be found in Lehmann
and Casella (1998, Chapter 5).


53
As shown in Section 4.2. for all x > 0, Pe^o[qi > x] > Po[qi > x], Note that
this is equivalent to: For all x > 0, P^o[q{ > x] > Po[? > x]- Then
^o[(?!*)-m] < £o[(<7iTm]
for m = 1,2.
Let = £"o[(g) *] and AJ = 5o[() 2]- Then
+ 4
+
1/(l' + 2>V-4
u
i/(t/ + 2)(*/ + 4)
M{a3
/uiu + 2){u + 4)W + e)\M;a,
= + 2)(l/+ ^ + 6)) M2V 4 + W" +
)2-
^(i/ + 2)
(2M[tr[(l S)S] +
2ir(SS)
tr[(I-S)E]
+ 4)
4tr(E)a.
Note that if this last expression (which does not involve a and which we denote
simply as A[/(a)) is less than 0, then /S*dU < 0. which is what we wish to prove.
We may repeat the argument from Section 4.2 in which we showed that when
9 = 0. conclude that when £ = 0. ~ V*Xi> where V\,..., vn are the eigenvalues of
(I-S)E.
Again, if n k > 4, then M{ and M2* exist and are positive, as was shown in
Section 4.2.
Again. A(/(a) = 0 is a fourth-degree equation with two real roots, one of which
is trivially zero. Call the nontrivial real root r.
Theorem 4.3 Suppose the same conditions as Theorem 3.3, except generalize
the covariance matrix of 9 to be positive definite. If n k > 4, then the nontrivial real root r of A^[/(a) = 0
is positive. Furthermore, for any choice a 6 (0,r), the asymptotic upper bound


Copyright 2004
by
David B. Hitchcock


BIOGRAPHICAL SKETCH
David B. Hitchcock was born in Hartford, Connecticut, in 1974 to Richard
and Gloria Hitchcock, and he grew up in Athens. Georgia, and Stone Mountain,
Georgia, with his parents and sister, Rebecca. He graduated from St. Pius X
Catholic High School in Atlanta, Georgia, in 1992.
He earned a bachelor's degree from the University of Georgia in 1996 and a
masters degree from Clemson University in 1999. While at Clemson, he met his
future wife Cassandra Kirby, who was also a masters student in mathematical
sciences.
He came to the University of Florida in 1999 to pursue a Ph.D. degree in
statistics. While at Florida, he completed the required Ph.D. coursework and
taught several undergraduate courses. His activities in the department at Florida
also included student consulting at the statistics unit of the Institute of Food and
Agricultural Sciences and a research assistantship with Alan Agresti. He began
working with George Casella and Jim Booth on his dissertation research in the fall
of 2001 after passing the written Ph.D. qualifying exam.
In January 2004 David and Cassandra were married. After graduating, David
will be an assistant professor of statistics at the University of South Carolina.
106


34
Since <7i and q2 are independent, we may write (3.11) as:
E[2a24}E[q2]E[l/qi] + E
2a2 a4 x
(9i 9i ~ Q2)
L 9i
= E[2a24](q2 + a2k)E[l/qx] + E
2^.4
2cr<7
L 9i
(qi ~q\~ 92)
(3.12)
Now, since a2a4/qx is decreasing in qx and qx qx q2 is increasing in qx, these
quantities have covariance < 0 by the Covariance Inequality (Casella and Berger,
1990, p. 184). Hence
(3.12) < E[2a2a4]{q2 + a2k)E[l/qx} +E
2 a2a4
L 9i
E[qi 9i 92]
= E[2a2a4](q2 + a2k)E[\/qx] + £'[2a2<74].E,[l/9i](i72(n k) q2)
= £[2a2cr4].E,[l/<7i]c72n.
So
A* < E[4a2]cr2n + E[2a2a4]E
= E[4a2]a2n + E[2a2a4]E
Note that
E[ad2] = aa2
2 4(^ + 2)'
1'
a2n + E
.9i.
-
' 1 '
a2n + E
.9i.
2a? a2
4a2 2^.4 \ 2-1
aa
9i J
4a36 a48
+
9i
? j
E[aV] = aV(*;2))
£[a^] = av|-(^y+4)j
E[aV] = aV^+2)(^4)(y + 6)Y
Since (j2 and ^ are independent, taking expectations:
A* < 4ao-4n + 2aro'Jn
4 o2_6 + 2)
E
1
9i J
+ 4V(^?))
-4ov('^^i)V
1
L9iJ
, 4 8/I/0' + 2)(l' + 4)(l/ + 6)
+ a a
ir
)E
' 1'
)
L9iJ


78
the first-order smooths should capture the underlying curves extremely well. Here
the James-Stein adjustment may be superfluous, if not detrimental.
In other simulations, the signal curves were second-order Fourier series:
Hi(t) = cos(7ri/10) -I- 2sin(7r/10) -I- 5cos(27r/10) + sin(27r/10),i = 1,...,5.
Hi(t) = 2 cos(7t£/10) 4- sin(7ri/10) 4- cos(27r/10) 4- 8 sin(27r£/10), i = 6,..., 10.
Hi(t) = 1.5 cos(7r/10) -I- sin(7r/10) 4- 5 cos(27r/10) -I- 8 sin(27r/10), i = 11,.... 15.
//() = cos(7r/10) 4- sin(7r/10) 4- 8cos(27r/10) -I- 5 sin(2;r£/10), i = 16,..., 20.
In these simulations, again a first-order Fourier smoother was used, so that the
smoother oversmoothed the observed curves and the smooths were not in the
subspace S projected onto. In this case, the simple linear smoother would not be
expected to capture the underlying curves well, and the James-Stein adjustment
would be needed to reproduce the true nature of the curves.
For each of the three methods (no smoothing, simple linear smoothing,
and James-Stein adjusted smoothing), and for varying magnitudes of noise, the
proportions of pairs of curves correctly grouped are plotted in Figure 6-5. Again,
for each setting, 100 simulations were run, with results averaged over all 100 runs.
The James-Stein adjusted smoothing method performs essentially no worse
than its competitors in three of the four cases. With Ornstein-Uhlenbeck errors,
and when the signal curves are first-order Fourier series, the simple linear smoother
does better as cr becomes larger, which is understandable since that smoother
captures the underlying truth perfectly. For higher values of a (noisier data), the
James-Stein adjustment is here shrinking the smoother slightly toward the noise
in the observed data. Nevertheless, the James-Stein smoothing method has an
advantage in the independent-errors case, even when the signal curves are first-
order. In both of these cases, both types of smooths are clustered correctly more
often than the observed curves.


CHAPTER 5
A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION
FOR CLUSTERING FUNCTIONAL DATA
Most cluster analysis methods, whether deterministic or stochastic, use an
objective function to evaluate the goodness" of any clustering of the data. In this
chapter we propose an objective function specifically intended for functional data
which have been smoothed with a linear smoother. In setting up the situation,
we assume we have N functional data objects y\(t),..., yff(t). We assume the
discretized data we observe (yi,.... yn) follow the discrete noise model (2.1).
For object i. define the indicator vector Z¡ = (Zn,..., Zm) such that Z,j = 1 if
object i belongs to cluster j and 0 otherwise. Each partition in C corresponds to a
different Z = (Zx,..., Zjv), and Z then determines the clustering structure. (Note
that if the number of clusters K < N, then Zk+1 = = Zw 0 for all i.)
In cluster analysis, we wish to predict the values of the Zs based on the data
y (or data curves (yx(i),.... /tv())- Assume the Z^s are i.i.d. unit multinomial
vectors with unknown probability vector p = (px, Pn)- Also, we assume a
model /(y|z) for the distribution of the data given the partition. The joint density
N N
/(y,z) = YlYl\pjf{yi\zn)]Zii
t=i j=i
can serve as an objective function. However, the parameters in /(y|z), and p,
are unknown. In model-based clustering, these parameters are estimated by
maximizing the likelihood, given the observed y, across the values of z. If the
data model /(y|z) allows it, a predictive likelihood for Z can be constructed by
conditioning on the sufficient statistics r(y,z) of the parameters in /(y,z) (Butler,
1986; Bjornstad, 1990). This predictive likelihood, /(y,z)/f(r(y,z)), can serve as
63


I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy. /?
*
George Casella. Chair
Professor of Statistics
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
* V c z 2
James G. Booth. Cochair
Professor of Statistics
I certify that I have read this study and that jn my opinion it conforms to
acceptable standards of scholarly presentation ana is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor
of Philosophy.
L
(vr
James
VHqfcert
AssociateJ^rofessor of Statistics
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree
Brett D. Presnell
Associate Professor of Statistics
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
John , Henretta
Professor of Sociology


67
/(y,z)
Yljej
-111/2
/TTijn-p
jeJ
m ;n p
(*J)
jeJ
When considered as an objective function, this predictive likelihood has
certain properties which reflect intuition about what type of clustering structure is
desirable.
The data model assumes aj represents the variability of the functional data
in cluster j. Since is an unbiased estimator of Oj,j G J, then (a?) 2 +1
can be interpreted as a measure of variability within cluster j. The (typically)
negative exponent indicates that good clustering partitions, which have small
within-cluster variances, yield large values of g(z). Clusters containing more objects
(having large m; values) contribute more heavily to this phenomenon.
The other factors of g(z) may be interpreted as penalty terms which penalize
a partition with a large number of clusters having small m,j values. A possible
alternative is to adopt a Bayesian outlook and put priors on the pj's, which, when
combined with the predictive likelihood, will yield a suitable predictive posterior.
rrijU p \ ( rrijn P


55
(4.7)
(4.5)
(4.6)
Recall that for a set of points £1}..., tn in [0, T], we may approximate (4.5)-
(4.7) by
T
dii = -O'0
n
where dij > Sij when the limit is taken as n oo with G [0,T).
In this section, the smoothing matrix S corresponds to a discretized analog of
S. That is, it maps a discrete set of points y(t{),..., y(tn) on the noisy observed
curve to the corresponding points /2(£i),..., fi(tn) on the smoothed curve which
results from the application of S. In particular, in this section we examine how
the characteristics of our self-adjoint S correspond to those of our symmetric S as
n > oo in a fixed-domain sense.
Suppose S is self-adjoint. Then for any x, y,
Therefore for points tx,...,tn G [0,T], with n ^ oo in a fixed-domain sense,


Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
SMOOTHING FUNCTIONAL DATA
FOR CLUSTER ANALYSIS
By
David B. Hitchcock
August 2004
Chair: George Caseila
Cochair: James G. Booth
Major Department: Statistics
Cluster analysis, which attempts to place objects into reasonable groups on
the basis of statistical data measured on them, is an important exploratory tool
for many scientific studies. In particular, we explore the problem of clustering
functional data, which arise as curves, characteristically observed as part of
a continuous process. In recent years, methods for smoothing and clustering
functional data have appeared in the statistical literature, but little work has
appeared specifically addressing the effect of smoothing on the cluster analysis.
We discuss the purpose of cluster analysis and review some common clustering
methods, with attention given to both deterministic and stochastic methods.
We address functional data and the related field of smoothing, and a measure of
dissimilarity for functional data is suggested.
We examine the effect of smoothing functional data on estimating the dis
similarities among objects and on clustering those objects. We prove that a
shrinkage method of smoothing results in a better estimator of the dissimilarities
among a set of noisy curves. For a model having independent noise structure,
IX


22
Similarly, £[J0'S0] = $E['SO] = £[k + 0'S0] = ? + J0'S0.
MS£| -0'0 ) = £
n
= var
-O'O -O'O
n n
i 2
T2
n*
-0'0- -fl'el
n n J
2n + 40'0
+ = T2( + ^G'G + 1
n n
= r2| + 4¡WI2 + i I-
n n
(3.1)
M SE
K*8*)
= E
-G'SG- -G'G
n n
= varGSG
E
-G'SG
TI
- W
TI
T2
TI*
2k + 40 S0
f Tk T T '
+ < + -0 S0 -0 0
n n n
= T2(^ + l0'S'S0)+{^ fc + 0'S'S0-0'0]}
= T2(| + llse¡l2) +{^[<= + IIS0||2-|lll2]}!
^{h2 + 2*(||S0||2-||0f)
= T2(H + 4llS0l!2) +
n2 n2
2\2
+ (||S0||2-||0||2)
}
= T2(^ + Hse!l2 + ^) + ^ 2fc(||se||2-¡|(p)
T2
H T
n
2\2
is0ii2-¡i0in
(3-2)
Comparing (3.1) with (3.2), we see that if 0 lies in the subspace that S projects
onto, implying that S0 = 0, then MSE(^G'SG) < MSE(^G'G) since k < n,
and thus smoothing leads to a better estimator of each pairwise dissimilarity. In
this case, smoothing reduces the variance of the dissimilarity estimate, without
adversely increasing the bias, since S0 = 0.


98
SSEj, the sum of squared errors of the regression of all the objects in cluster j, is:
SSE¡ = (yi',...,y/)
y, '
\ yj /
( t)
/ T ^
-i
( \
yx
(T\.
> T')
j

lTJ
lTJ
^ yj y
= yi yi + ymj ymj
- (yi'T + + ym/T)KT'T]-1(T,yi + + T'ymj)
= yx'yi + ^ym/yiiij
- (yi'T + + ym;T)[(rT)-1T'yi + + (T'T)-1T'ymj]
TTlj
= KyiVi yiT(T'T)-1T'yi) + + (ym/ym, y_,T(T'T)-,T'jr1)]
m j j j j
+
/ \
yi
TTlj 1 .
(yi ,-.-,ymj )S
Tfli
\ y / _
(where S is the rrijU x rrijU block matrix with In in each diagonal block and
-^j3yT(T/T)-1T' in each off-diagonal block)
1 ^
m.
- T.SSE> + -7T- l(yi' y-,')S(y,' y/)'].
i / TTl-i
3 l-l / 3
Hence, the mean squared error of the regression of all the objects in cluster j,
is:
1 m
+
1
3 =1
rrijn p*
TTlj 1
mi
[(yi\---,yn/)S(yi',


ACKNOWLEDGMENTS
It has been a long but rewarding journey during my time as a student, and I
have many people I need to thank.
First of all. I thank God for my having more blessings in my life than I
possibly deserve. Next, I thank my family: the support and love of my parents and
sister, and the pride they have shown in me, have truly inspired me and kept me
determined to succeed in my studies. And I especially thank my wife Sandi, who
has shown so much confidence in me and has given me great encouragement with
her love and patience.
I owe a great deal of thanks to my Ph.D. advisors, George Casella and Jim
Booth. In addition to being great people, they have, through their fine examples,
taught me much about the research process: the importance of knowing the
literature well, of looking at simple cases to understand difficult concepts, and of
writing results carefully. I also thank my other committee members, Jim Hobert,
Brett Presnell. and John Henretta, accomplished professors who have selflessly
given their time for me. All the teachers I have had at Florida deserve my thanks,
especially Alan Agresti. Andy Rosalsky, Denny Wackerly, and Malay Ghosh.
I could not have persevered without the support of my fellow students here
at Florida. I especially thank Bernhard and Karabi, who entered the graduate
program with me and who have shared my progress. I also thank Carsten, Terry,
Jeff, Siuli, Sounak, Jamie, Dobrin. Samiran, Damaris, Christian, Ludwig, Brian,
Keith, and many others for all their help and friendship.
IV


82
Table 7-1: The classification of the 78 yeast genes into clusters, for both observed
data and smoothed data.
Cluster
Observed Data
Smoothed Data
1
1,5.6.7.8.11.75
1,2.4,5.6.7,8,9,10,11,13,62,70,74,75
2
2,3,4.9.10.13.16.19,22,23,27,
30.32.36.37,41.44,47,49,51,53,
61,62.63.64.65.67,69.70,74,78
16,19.22,27,30,32,36,37,41,42,
44,47.49,51,61.63,
65,67,69,78
3
12,14.15.18,21.24,25,26.28,
29.31.33.34.35.38,39,40,
42.43.45,46,48.50,52
3,12,14,15,18,21,24,25,26,
28,29,31,33,34.35.38,39,
40,43,45,46,48,50,52
4
17,20.54.55,56.57,58,59.60
17,20.23,53,54,55,56.57,58,59,60.64
5
66.68.71.72.73.76.77
66,68.71.72.73,76,77
adjusted via the James-Stein shrinkage procedure described in Section 3.2, with
a 5 here, a choice based on the values of n = 18 and k = 6.
Figure 7-1 shows the resulting clusters of curves for the cluster analyses of
the observed data and of the smoothed curves. We can see that the clusters of
smoothed curves tend to be somewhat less variable about their sample mean curves
than are the clusters of observed curves, which could aid the interpretability of the
clusters.
In particular, features of the second, third, and fourth clusters seem more
apparent when viewing the clusters of smooths than the clusters of observed
data. The major characteristic of the second cluster is that it seems to contain
those processes which are relatively flat, with little oscillation. This is much more
evident in the cluster of smooths. The third and fourth clusters look similar for
the observed data, but a key distinction between them is apparent if we examine
the smooth curves. The third cluster consists primarily of curves which rise to
a substantial peak, decrease, and then rise to a more gradual peak. The fourth
cluster consists mostly of curves which rise to a small peak, decrease, and then rise
to a higher peak. This distinction between clusters is seen much more clearly in the
clustering of the smooths.


15
curves at a fine grid of n points, t\,... in, so that we observe N independent
vectors, each n x 1: yi,... ,ys-
A possible model for our noisy data is the discrete noise model:
y ij = G-iji l = 1) i -N, j 1. ... ,71. (2.1)
Here, for each i = 1,..., N, ij may be considered independent for different
measurement points, having mean zero and constant variance erf.
Another possible model for our noisy curves is the functional noise model:
Vi(tj) = Mt(ij) + ^i(tj),i = = l,...,n, (2.2)
where e(£) is. for example, a stationary Ornstein-Uhlenbeck process with pull
parameter ¡3 > 0 and variability parameter of. This choice of model implies that
the errors for the ith discretized curve have variance-covariance matrix £* = of SI
where f¡m = (2/3)-1 exp(/3|t¡ £m|) (Taylor et al., 1994). Note that in this case,
the noise process is functional specifically Ornstein-Uhlenbeck but we still
assume the response data collected is discretized, and is thus a vector at the level
of analysis. Conceptually, however, the noise process is smooth and continuous in
(2.2), as is the signal process in either model (2.1) or (2.2).
Depending on the data and sampling scheme, either (2.1) or (2.2) may be an
appropriate model. If the randomness in the data arises from measurement error
which is independent from one measurement to the next, (2.1) is more appropriate.
Ramsay and Silverman (1997, p. 42) suggest a discrete noise model for the Swiss
growth data, in which heights of boys are measured at 29 separate ages, and in
which some small measuring error (independent across measurements) is likely to
be present in the recorded data.
In the case that the variation of the observed data from the underlying curve is
due to an essentially continuous random process, model (2.2) may be appropriate.


51
Now, since a2<74/ft is decreasing in .ft and ft q\ q2 is increasing in ft, these
quantities have covariance < 0. Hence
/ q2 + o2tr(S'E) >
\+E
2 a2bA
\ft + cr2ir[(I S)E]^
. ft .
/ ft + a2ir(S£) ^
+ E
2a2<74
Vft + ^2M(I-S)E]J
. ft .
£[ft ft ft]
( E[2a2a4]q2
£[2a2<74]cr2ir(S£)
ft + <72tr[(I S)£] + qi+ o2tr[(I S)E]
2 a2o4
ft
+ £
2a2o4
L ft
cr2tr[(I S)E]
- E
L ft
E[2a2o4]q2
2 a2o4
1
- £
1
LftJ
+ £
L ft
£[ft
+ £'[2a2(j4]
cr2ir(S£)
ft 4- cr2ir[(I S)£]
Since by Jensen's Inequality, \/E[q\] E[l/qi] < 0,
(4.4) < £[2a2<741
cr2ir(S£)
ft + o2tr[(I S)£]
+ E
2a2 . ft
cr2ir[(I S)S]
ir(SE)
£ £'2oV'm^I + £
Our asymptotic upper bound for is:
2^.41
2a2o
ft
ff2ir[(I-S)E].
A*au = £[4a + E
(
'ir[(I-S)£]
+ E
2a2o4
L ft
2aa2
9 4 \ 2n
aro \
ft
= E
2a24
-4a ft
2 4 4a3 + 4a2a4 h
tr[(I S)S]
ft
ft2


74
Independent error structure
Ornstein-Uhlenbeck error structure
Figure 6-2: Proportion of pairs of objects correctly matched, plotted against a
(n = 200).
Solid line: Smoothed data. Dashed line: Observed data.
Table 6-3: Clustering the observed data and clustering the smoothed data (inde
pendent error structure, n = 30).
Independent error structure
a
0.2
0.4
0.5
0.75
1.0
1.25
1.5
Ratio of avg. MSEs
1.23
4.02
5.12
5.95
6.30
6.24
6.27
Avg. prop, (observed)
.999
.466
.376
.298
.299
.312
.316
Avg. prop, (smoothed)
1.00
.552
.458
.359
.330
.321
.334
Ratios of average MSE^^ to average MS£'(smoot/l). Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.


85
time
time
Figure 7-2: Edited plots of clusters of genes which were classified differently as
observed curves and smoothed curves.
Observed curves (left) and smoothed curves (right).


CHAPTER 4
CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL
Now we will consider functional data following model (2.2). Again, we assume
the response is measured at n discrete points in [0, T], but here we assume a
dependence among errors measured at different points.
4.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the
Linear Subspace Defined by S
As with Case I, we assume our linear smoothing matrix S is symmetric and
idempotent, with r(S) = tr(S) = k. Recall that according to the functional noise
model for the data, 0 ~ N(0, E) where the covariance matrix E corresponds, e.g.,
to a stationary Ornstein-Uhlenbeck process.
Note that ^0 0 represents the approximate L2 distance between observed
curves y(£) and ¡jj{t), and that, with an Ornstein-Uhlenbeck-type error structure, it
approaches the exact distance as n > oc in a fixed-domain sense.
Also, ^0'S S0 = ^0S0 represents the approximate L2 distance between
smoothed curves /z() and fij(t), and this approaches the exact distance as n > oo.
We wish to see when the smoothed-data dissimilarity better estimates the true
dissimilarity Sij between curves /(i) and Pj{t) than the observed-data dissimilarity.
To this end let us examine MSE(^0'S0) and MSE(^0'0) in estimating ^0'0
(which approaches as n > oo).
In this section, we consider the case in which 0 lies in the linear subspace
defined by S, i.e., SO = 0. We now present a theorem which generalizes the result
of Theorem 3.1 to the case of a general covariance matrix.
Theorem 4.1 Suppose the observed 0 ~ N(0, E). Let S be a symmetric and
idempotent linear smoothing matrix of rank k. If 0 lies in the linear subspace
40


100
B.2 Regularity Condition for Positive Semidefinite G in Laplace
Approximation
Recall the Laplace approximation given by Lieberman (1994) for the expecta
tion of a ratio of quadratic forms:
/ xFx \ F(x'Fx)
^\x'Gx) ~ E(x'Gx)'
Lieberman (1994) denotes the joint moment generating function of xFx and x Gx
M(ui, U2) = Ffexp^x Fx -+- u;2x Gx)].
and assumes a positive definite G. In that case. FfxGx) > 0. Lieberman uses the
positive definiteness of G to show that the derivative of the cumulant generating
function of x Gx is greater than zero. That is.
d
d0J2
\ogM(0:u2) =
>
_d_
dui*
M(0,u2)
-1
M (0, UJ2)
J (x'Gx) exp{a;2X,Gx}/(x) dx \^j exp{t2X Gx}/(x) dx J
0.
The positive derivative ensures the maximum of log M(0,a;2) is attained at the
boundary point (where uj2 = 0).
For positive semidefinite G. we need the additional regularity condition that
P(x Gx > 0) > 0. i.e., the support of x'Gx is not degenerate at zero. This will
ensure that F(x'Gx) > 0, i.e., that M(0,u2) > 0. Therefore both of the
integrals in the above expression are positive and the Laplace approximation will
hold.


102
Celeux, G. and Govaert. G. (1992). A classification EM algorithm for clustering
and two stochastic versions, Computational Statistics and Data Analysis
14: 315-332.
Chow, Y. S. and Teicher. H. (1997). Probability Theory: Independence, Interchange-
ability, Martingales. New York: Springer-Verlag Inc.
Cressie, N. A. C. (1993). Statistics for Spatial Data, New York: John Wiley and
Sons.
Cuesta-Albertos, J. A.. Gordaliza. A. C. and Matrn, C. (1997). Trimmed fc-means:
An attempt to robustifv quantizers. The Annals of Statistics 25: 553-576.
de Boor, C. (1978). A Practical Guide to Splines, New York: Springer-Verlag Inc.
DeVito, C. L. (1990). Functional Analysis and Linear Operator Theory, Redwood
City, California: Addison-Wesley.
Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression, New York:
Marcel Dekker Inc.
Falkner, F. (ed.) (1960). Child Development: An International Method of Study,
Basel: Karger.
Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering
method? Answers via model-based cluster analysis, The Computer Journal
41: 578-588.
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis,
and density estimation, Journal of the American Statistical Association
97(458): 611-631.
Garca-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of k
means and trimmed k means, Journal of the American Statistical Association
94: 956-969.
Gnanadesikan. R., Blashfield. R. K., Breiman, L., Dunn, O. J.. Friedman, J. H., Fu,
K., Hartigan, J. A., Kettenring, J. R., Lachenbruch, P. A., Olshen, R. A. and
Rohlf, F. J. (1989). Discriminant analysis and clustering, Statistical Science
4: 34-69.
Gordon, A. D. (1981). Classification. Methods for the Exploratory Analysis of
Multivariate Data, London: Chapman and Hall Ltd.
Green, E. J. and Strawderman, W. E. (1991). A James-Stein type estimator for
combining unbiased and possibly biased estimators, Journal of the American
Statistical Association 86: 1001-1006.


86
The observed curves were each smoothed using a basis of cubic B-splines with
three knots interspersed evenly within the timepoints. The positive-part James-
Stein shrinkage was applied, with a value a = 35 in the shrinkage factor (based on
n = 36. k = 7). The K-medoids algorithm was applied to the smoothed curves,
with K = 4 clusters, a choice guided by the average silhouette width criterion that
Rousseeuw (1987) suggests for selecting K.
The resulting clusters are shown in Table 7-2. In Figure 7-3. we see the
distinctions among the curves in the different clusters. Cluster 1 contains a few
libraries whose volume size was relatively small at the begining of the time period,
but then grew dramatically in the first 12 years of the study, growing slowly after
that. Cluster 2 curves show a similar pattern, except that the initial growth is less
dramatic. The behavior of the cluster 3 curves is the most eccentric of the four
clusters, with some notable nonmonotone behavior apparent in the middle years.
The growth curves for the libraries in cluster 4 is consistently slow, steady, and
nearly linear.
For purposes of comparing the clusters, the mean curves for the four clusters
are shown in Figure 7-4.
Note that several of the smooths may not appear visually smooth in Figure
7-3. A close inspection of the data shows some anomalous (probably misrecorded)
measurements which contribute to this chaotic behavior in the smooth. (For
example, see the data and associated smooth for the University of Arizona library,
in Figure 7-5, for which 1971 is a suspicious response.) While an extensive data
analysis would detect and fix these outliers, the data set is suitable to illustrate
the cluster analysis. The sharp-looking peak in the smooth is an artifact of the
discretized plotting mechanism, as the cubic spline is mathematically assured to
have two continuous derivatives at each point.


CHAPTER 6
SIMULATIONS
6.1 Setup of Simulation Study
In this chapter, we examine the results of a simulation study implemented
using the statistical software R. In each simulation. 100 samples of N = 18 noisy
functional data were generated such that the data followed models (2.1) or (2.2).
For the discrete noise data, the errors were generated as independent Af(0, cr)
random variables, while for the functional noise data, the errors were generated as
the result of a stationary Ornstein-Uhlenbeck process with variability parameter a2
and pull' parameter /3 = 1.
For each sample of curves, several (four; clusters' were built into the data
by creating each functional observation from one of four distinct signal curves, to
which the random noise was added. Of course, within the program the curves were
represented in a discretized form, with values generated at n equally spaced points
along T = [0. 20]. In one simulation example, n = 200 measurements were used and
in a second example, n = 30.
The four distinct signal curves for the simulated data were defined as follows:
Hi (f) = 0.51n(i + 1) + .01 cos(), i = 1,..., 5.
Hi(t) = log10( T 1) .01 cos(2£), z = 6,..., 10.
Hi(t) = 0.751og5(i + l) + .01sin(3i),i = 11,..., 15.
Hi(t) = 0.3>/t + 1-.01 sin(4f), z = 16,, 18.
68


33
Now, diJS)'eiJS) = O'0 2(¡> + ' A a = E
= E
- E
(io'o o'o 2(p'e + ')2 (o'o o'o)
({O'O O'6) {2e 4>')^j ('O O'O)
-2(2'4>){0'6 OO) + {2(f)'0 )
Now,
,* -0'(I-S)0 2
(p 0 = ao2 = acr2
9i
and
a2 9i
O'{l- S)'(I- S)0 =
a2&4
9 i
So
A = E
2(2ai2 -a2pj('-e'e) + (ia* *
Since a and 0 are independent,
E[-4aa2{0'0 O'O)] = E[-4aa2]E{0'0 O'0) = £[-4a<72]cr2n
and
£
= E
= E
2(-e'e)
. 9l
/ (1. (J A A Al A I .
(0 (I S)0 + 0 S0 0 (I S)0 0 S0)
9i
2a2(j4
9i
= E[2a2a4]E
[Qi + 92 ~ 9i ~ 92)
2a2(74
>15
1
+ E
L9iJ
(9i 9i 92)
by the independence of 0 and 0 (and thus <7 and q2).
(3.11)


93
is used to estimate each signal curve. Fortunately, methods such as regression
splines, especially when the number of knots is fairly large, are flexible enough
to approximate well a variety of shapes of curves. Also, given our emphasis on
working with the pairwise differences of data vectors (i.e., discretized curves), for
the differences to make sense, we need each curve in the data set to be measured at
the same set of points il5..., in, another limitation.
The abstract case of data which come to the analyst as pure continuous
functions was resolved in this thesis in the case for which the signal curves were in
the subspace defined by the linear operator 5, but the more general case still needs
further work. A more abstract functional analytic approach to functional data
analysis, such as Bosq (2000) presents, could be useful in this problem.
The practical distinction that arises when using the James-Stein adjustment
between smoothing the pairwise differences in the noisy curves as opposed to
smoothing the noisy curves themselves was addressed in Section 6.2. Empirical ev
idence was given indicating that it mattered very little in practice which procedure
was done, so one could feel safe in employing the computationally faster method of
smoothing the noisy curves directly. More analytical exploration of this is needed,
however.
With the continuing growth in computing power available to statisticians,
computationally intensive methods like stochastic cluster analysis will become more
prominent in the future. Much of the work that has been done in this area has
addressed the clustering of traditional multivariate data, so stochastic methods for
clustering functional data are needed. Objective functions designed for functional
data, such as the one proposed in Chapter 5, are the first step in such methods.
The next step is to create better and faster search algorithms to optimize such
objective functions and discover the best clustering.


APPENDIX A
DERIVATIONS AND PROOFS
A.l Proof of Conditions for Smoothed-data Estimator Superiority
Now. since S is a shrinking smoother, this means ||Sy|| < | |y11 for all y, and
hence ||Sy||2 < ||y||2 for all y. Therefore. ||S0||2 < ||0||2. Hence, and because
k < n. we see that the first term of (3.2) is less than the first term of (3.1), and
that the second term of (3.2) is < 0. The third term of (3.2) will be positive,
though.
Therefore, the only way the observed-data estimator would have a smaller
MSE than the smoothed-data estimator is if the negative quantity ||S0||2 ||0||2 is
of magnitude large enough that its square (which appears in the third term of (3.2)
overrides the advantage (3.2) has in the first two terms. This corresponds to the
situation when the shrinking smoother shrinks too much past the underlying mean
curve. This phenomenon is the same one seen in simulation studies.
We can make this more mathematically precise. Note that
(j[(l|S0||2 lien2)2] th\ + 4n2||0||2 + 1) < 0
^[(liseil2 lien2)2] < o
95


SMOOTHING FUNCTIONAL DATA FOR CLUSTER ANALYSIS
David B. Hitchcock
(352) 392-1941
Department of Statistics
Chair: George Casella
Degree: Doctor of Philosophy
Graduation Date: August 2004
Cluster analysis, which places objects into reasonable groups based on statis
tical data measured on them, is an important exploratory tool for the social and
biological sciences. We explore the particular problem of clustering functional data,
characteristically observed as part of a continuous process. Examples of functional
data include growth curves and biomechanics data measuring movement. In recent
years, methods for smoothing and clustering functional data have appeared in the
statistical literature, but little has specifically addressed the effect of smoothing on
the cluster analysis.
We examine the effect of smoothing functional data on estimating the dis
similarities among objects and on clustering those objects. Through theory and
simulations, a shrinkage smoothing method is shown to result in a better estimator
of the dissimilarities and a more accurate grouping than using unsmoothed data.
Two examples, involving yeast gene expression levels and research library growth
curves, illustrate the technique.


50
(2.1), in which V = a21. but this case was dealt with in Section 3.3, in which an
exact domination result was shown.)
Assume 9 ~ 1V(0. V), with V = exists a random variable S2, independent of 9, such that
S2
xl-
Then let 2 = S2/v.
As shown in Section 3.3. we may write
A = E
-2 2a<72 -
a2(74
91
^(0'0-9'0)+ (2 aa2-?j^J
Since <7 and 6 are independent.
E[-Aac2{0'0 9'9)] = E[-4aa2]E{9'9 9'9} = E[-4a2]a2tr{S)
and
E
= E
-- E
2a2*V e)
L 9i
2 a2a4
9i
2 a24
9i
= E[2a2dA}E
(0'(I S)0 + 9S9 9'{I S)9 9'S9)
(91 + 92 9i 92)
92
+ E
.91.
2a2 L 9i
(9i 9i 92)
(4.3)
by the independence of a and 9 (and thus <7 and q2). We use the (large n) approx
imation to £[92/91] obtained in Section 4.2 to obtain an asymptotic upper bound
for (4.3):
r0 2~4i/ 92 + <72ir(SS) \ J2a2&4^ A ,
£[2a 17 1 ( + a2ir[(I S)E]) + B [(?1 91 H' (4'4)


REFERENCES
Abraham. C.. Cornillon, P. A., Matzner-Lober, E. and Molinari, N. (2003).
Unsupervised curve clustering using B-splines, The Scandinavian Journal of
Statistics 30: 581-595.
Alter, O., Brown. P. O. and Botstein, D. (2000). Singular value decomposition
for genome-wide expression data processing and modeling, Proceedings of the
National Academy of Sciences 97: 10101-10106.
Association of Research Libraries (2003). ARL Statistics Publication Home
Page. Available at www.arl.org/stats/arlstat. accessed November 2003.
published by Association of Research Libraries.
Baldessari. B. (1967). The distribution of a quadratic form of normal random
variables. The Annals of Mathematical Statistics 38: 1700-1704.
Bjornstad, J. F. (1990). Predictive likelihood: A review (C/R: p255-265), Statistical
Science 5: 242-254.
Booth, J. G.. Casella. G., Cooke, J. E. K. and Davis. J. M. (2001). Sorting
periodically-expressed genes using microarray data. Technical Report 2001-026.
Department of Statistics, University of Florida.
Bosq, D. (2000). Linear Processes in Function Spaces: Theory and Applications,
New York: Springer-Verlag Inc.
Brockwell. P. J. and Davis. R. A. (1996). Introduction to Time Series and Forecast
ing, New York: Springer-Verlag Inc.
Buja, A., Hastie. T. and Tibshirani, R. (1989). Linear smoothers and additive
models (C/R: p510-555), The Annals of Statistics 17: 453-510.
Butler, R. W. (1986). Predictive likelihood inference with applications (C/R:
p23-38). Journal of the Royal Statistical Society, Series B, Methodological
48: 1-23.
Casella, G. and Berger, R. L. (1990). Statistical inference, Belmont, California:
Duxbury Press.
Casella, G. and Hwang, J. T. (1987). Employing vague prior information in the
construction of confidence sets, Journal of Multivariate Analysis 21: 79-104.
101


5
object i and object j
dE{i,j) = yJ{Vi\ ~ Vj\)2 + (Vi2 ~ 2/j 2)2 + * + (Vip Vjp)2-
The Manhattan (city-block) distance is
= life yji\ + \ya ~ Va\ + ' + !ViP~ VjP\-
Certain types of data require specialized dissimilarity measures. The Canberra
metric and Czekanowski coefficient (see Johnson and Wichern. 1998, p. 729) are
two dissimilarity measures for nonnegative variables, while Johnson and Wichern
(1998, p. 733) give several dissimilarity measures for binary variables.
Having chosen a dissimilarity measure, one can construct a N x N (symmetric)
dissimilarity matrix (also called the distance matrix) D whose rows and columns
represent the objects in the data set, such that D^ = Dj = d(i,j).
In the following sections, some common methods of cluster analysis are
presented and categorized by type.
1.3 Hierarchical Methods
Hierarchical methods can be either agglomerative or divisive. Kaufman and
Rousseeuw (1990) compiled a set of methods which were adopted into the cluster
library of the S-plus computing package, and often the methods are referred to by
the names Kaufman and Rousseeuw gave them.
Agglomerative methods begin with N clusters; that is, each observation forms
its own cluster. The algorithm successively joins clusters, yielding N 1 clusters,
then N 2 clusters, and so on until there remains only one cluster containing all
N objects. The S-plus functions agues and hclust perform agglomerative clustering.
Common agglomerative methods include linkage methods and Wards method.
All agglomerative methods, at each step, join the two clusters which are
considered closest. The difference among the methods is how each defines


59
Now 00~ x'n (^)- where A = 0 0 here. The third moment of a noncentral
chi-square is given by
n3 + 6n2 -+- 6n2A 8 n + 36nA + 12?iA 4- 48A + 48A2 + 8A3.
Hence ^E[(0'0)3] =
T3[l + 6/n + 6A/n + 8/n2 + 36A/n2 + 12A2/n2 + 48A/n3 + 48A2/n3 + 8A3/n3]
Now. A/n is bounded for n > 1 since ~0'0 is the discrete approximation of the
distance between signal curves /i() and which tends to y. Hence the third
moment is bounded.
This boundedness, along with the convergence in distribution, implies (see
Chow and Teicher. 1997. p. 277) that
lim £[(4)*] = £[(4)*] < oo
for k = 1,2.
So we have
lim varldij]
n*00
= limE[(dij)2] lim[£(dy)]2
= lim£[(4)2] [Hm£'(4)]2
= E[(4)2] [£(4)]2
= var[iy] < oo.
An almost identical argument yields
lim tar[d-*moot/l)] = t/ar[4moot/l)].
noo J J


77
Number of Clusters Misspecified as 3
Number of Clusters Misspecified as 5
sigma
Figure 6-4: Proportion of pairs of objects correctly matched, plotted against cr,
when the number of clusters is misspecified.
Solid line: Smoothed data. Dashed line: Observed data.


39
and since P BP = S,
Elle'e)=ll^W
G'SSG + tri-
n
n
oo.
\n\n+ 1 ) )
This posterior mean is the Bayes estimator of dtJ.
Since fr(B) = k, note that tr{ J(^B)) = Jir(^j-B) = ^ 0 as
2
Also, 1 as n oo.
Then if we let the number of measurement points n t oo in a fixed-domain
sense, these Bayes estimators of approach d(.!mooth) ^G'SSG in the sense that
the difference between the Bayes estimators and tends to zero. Recall that
d^ ^G'G, the approximate distance between curve i and curve j, which in the
limit approaches the exact distance between curve i and curve j.


62
Now. since uar(||5fl|2) < uar(||f||2) from Corollary 4.1. we have:
A < E(\\Sf\\2 + ||f||2)£(||Sf||2 ||f||2) (||5f||2 + ||f||2)£(||Sf||2 ||f||2)
= E(\\Sf\\2 ||f||2) [^(||5f||2 + ||f||2) (||5fl|2 + ¡|f||2)l. (4.9)
Since 5 is a shrinking smoother, ||Sf|| < ||f|| for all f. Hence ||5f||2 < ||f||2 for
all f and hence £'(||5f||2) < i?(||f||2). Therefore the first factor of (4.9) is < 0.
Also note that
i?(||Sf||2) = E^[Sf]2dt\ = f* E[Sf]2dt
= [ var[Sf}dt+ [ [E[Sf])2 dt
Jo Jo
= f var[Sf]dt+ [Sf]2df
Jo Jo
= [ uar[Sf]dt + ||Sf||2 > ||Sf||2.
Jo
£(||f||2) = [tf it\ = [ E[tf dt
= [ var[f]dt+ f (E[f])2dt
Jo Jo
= f uar[f] dt + [ [f]5
Jo Jo
= / var[f]
Jo
Similarly,
]2 dt
dt + llfll2 > "f"2
Hence the second factor of (4.9) is positive, implying that (4.9) < 0. Hence the
risk difference A < 0.
Remark. We know that ||Sf|| < ||f|| since S is shrinking. If, for our particular
Vi(t) and Vj(t), ||Sf|| < ||f||, then our result is MSE(6(th)) < MSE(fy).


This dissertation was submitted to the Graduate Faculty of the Department of
Statistics in the College of Liberal Arts and Sciences and to the Graduate School
and was accepted as partial fulfillment of the requirements for the degree of Doctor
of Philosophy.
August 2004
Dean. Graduate School


60
As a consequence of Lemma 4.2. since var[d-*mooi/l)] < var[dij] for all n,
lim var\d^mooth)] < lim var\dtj] < oo.
noc J n>oc
Hence
var[S\jmooth)] < var[Sij].
The following theorem establishes the domination of the smoothed-data
dissimilarity estimator ^moothl over the observed-data estimator Sjj, when the
signal curves Pi{t), i = 1,..., N. lie in the linear subspace defined by S.
Theorem 4.4 If S is a self-adjoint operator and a shrinking smoother, and if
Pi(t) is in the linear subspace which S projects onto for all i = 1... .,N, then
MSE((ooth)) < MSEiSij).
Proof of Theorem 4-4'
Define the difference in MSEs A to be as follows (we want to show that A is
negative):
C / T T \ 2 \
= E[(Kfo [£(*) hj(t)]2dt [Pi{t) Pj{t)]2 dtj j
~ E{(fo ^Vi^ ~ Vj^2 dt ~ h ~ dt
+ (/ [/^() Mi(0]2*)
_ 2 (/ \-Syi^ ~ SyiW2 dt) (fQ ~ }
~E{(fo {yi^~yjW*dt) + (/ [m(0 -
~ 2 (/ ~ Vj^2 dt) (/ M*)]2 di) }
'¡Of
[Syi(t) Syj{t)}2dt
2


56
Now, for a symmetric matrix S, for any x. y, (x. Sy) = (Sx. y) <=> x Sy =
x Sy. At the sequence of points ti,tn: x = (x(£i),....£(£)), and for our
smoothing matrix S.
Sy = {S[y](tiS[y](t))' = (j K(tl,s)y(s)ds...., j K(tn,s)y(s) ds^j
So
x Sy =
= x'Sy =
[ x(ti)K(ti, s)y(s) ds 4 f [ x(tn)K(tn, s)y(s) ds
Jo Jo
I K(tl,s)x{s)y{t1)ds H h [ K{tn, s)x{s)y{tn) ds
Jo Jo
and
T
x{ti)K(t\, s)y(s) ds H 1-
T rT
K{t\, s)x(s)y(ti) ds H 1- / K(tn, s)x(s)y(tn) ds
Jo
So taking the (fixed-domain) limits as n oo, we obtain (4.8), which shows that
the defining characteristic of symmetric S corresponds to that of self-adjoint S.
Definition 4.1 5 is called a shrinking smoother if ||5y|| < ||t/|| for all elements y
(Buja et al., 1989).
A sufficient condition for S to be shrinking is that all of the singular values of
S be < 1 (Buja et al., 1989).
Lemma 4.1 Let A be a symmetric matrix and £ be a symmetric, positive definite
matrix. Then
j
x{tn)K(tn,s)y{s)ds
sup
X
x A SAx
x' Ex
< (emax(A)]2
where emax(A) denotes the maximum eigenvalue of A.
Proof of Lemma 4-1:


46
Specifically, since qx = &{I S)0, and (I S)£ has eigenvalues cx,..., c,,, then
Qi ~ cxi2(,2)
i=l
for some noncentrality parameter <5? > 0 which is zero when 0 = 0.
Under 6 = 0. then, qi ~ CiX2, a linear combination of independent central
xi variates.
Since a noncentral xl random variable stochastically dominates a central x2>
for any S? > 0,
P[Xi(Si) >x]> P[xl > A Vx > 0.
As shown in Appendix A.2, this implies
iW(tf) + + cx'i2(2) > x] > P[cix\ + + Cnxl >x] V X > 0.
Hence for all x > 0, Pe^o[q\ > x\ > P0[<7i > x], i.e., the distribution of q\ with
6 0 stochastically dominates the distribution of q\ with 6 = 0. Then
^oKii)"] < EolWi)-]
for m = 1,2,....
Letting M2 = i?o[(<7i)_2], we have the following asymptotic upper bound for A:
A^ = 4a2 4air(£) + M2a4 +
4a3
2a2ir(£)
£r[(I-S)E] ~~ £r[(I-S)E]
= M2a4
ra3 +
2£r(E)
+ 4 ) a2 4ir(E)a.
£r[(I S)E] \tr[(I S)E]
Then M2 = P[(X^=i ^Xi)-2]- We see that if n A: > 4, then M2 exists and is
E
Wr,
(i)l
^ A^2 ^ E
^min
positive since


29
Table 3-1: Table of choices of a for various n and k.
n
k
minimizer a*
root r
20
5
9.3
19.0
50
5
31.9
72.2
100
5
71.3
169.1
200
5
153.2
367.5
And since Ay(a) has leading coefficient c4 > 0,
lim Ay(a) = oo.
a>oo
Since it tends to oo at its endpoints, Ay (a) must be negative between its two real
roots 0 and r. Therefore A Using a symbolic algebra software program (such as Maple or Mathematical.
one can easily obtain the formula for the second real root for general n and k
(the formula is given in Appendix B.l) and verify that the other two roots are
imaginary.
Figure 3-1 shows Ay plotted as a function of a for varying n and k = 5.
For various choices of n (and k = 5) Table 3-1 provides values of r, as well as
the value of a which minimizes the upper bound for A. For 0 < a < r, the risk
difference is assured of being negative. For a*. Ay is minimized.
Since Ay provides an upper bound for the (scaled) risk difference, it may
be valuable to ascertain the size of the discrepancy between Ay and A. We can
estimate this discrepancy via Monte Carlo simulation. We generate a large number
of random variables having the distribution of q\ (namely Xn-kiQi)) and get an
estimate of A using a Monte Carlo mean. For various values of qi, Figure 3-2
shows A plotted alongside Ay.
3.3 Extension to the Case of Unknown er2
In the previous section, a2 was assumed to be known. We now examine the
situation in which 0 N{0, a21) with a2 unknown.


80
When the signal curves are second-order and the simple linear smoother
oversmooths, the unadjusted smooths are clustered relatively poorly. Here, the
James-Stein method is needed as a safeguard, to shrink the smooth toward the
observed data. The James-Stein adjusted smooths and the observed curves are
both clustered correctly more often than the unadjusted smooths.


CHAPTER 3
CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL
First we will consider functional data following model (2.1). Recall that we
assume the response is measured at n discrete points in [0,T].
3.1 Comparing MSEs of Dissimilarity Estimators when 6 is in the
Linear Subspace Defined by S
We assume our linear smoothing matrix S is symmetric and idempotent. For a
linear basis function smoother which is fitted via least squares. S will be symmetric
and idempotent as long as the n points at which /i* is evaluated are identical to the
points at which y, is observed (Ramsay and Silverman. 1997. p. 44). Examples of
such smoothers are regression splines (in particular, B-splines), wavelet bases, and
Fourier series bases (Ramsay and Silverman. 1997). Regression splines and B-spline
bases are discussed in detail by de Boor (1978) and Eubank (1988, Chapter 7).
We also assume that S projects the observed data onto a lower-dimensional
space (of dimension k < n), and thus r(S) = tr(S) = k. Note that S is a shrinking
smoother, since all its singular values are < 1 (Buja et al., 1989). Recall that
according to the discrete noise model for the data, 0 ~ N(0,a2l). Without loss
of generality, let cr2I = I. (Otherwise, we can let, for example, fj = o~x9 and
T] = a~l9 and work with r) and 77 instead.)
Note that ^O'O represents the approximate L2 distance between observed
curves y(£) and yj(t) and ^9'S S6 = £0S0 represents the approximate L2
distance between smoothed curves /(i) and p,j(t).
We wish to see when the smoothed-data dissimilarity better estimates
the true dissimilarity 6{j between curves /q(£) and Hj(t) than the observed-data
20


38
y- =
o
o
where Vn is the full-rank prior variance of 0. Note that this prior is only a proper
density when considered as a density over the first k elements of 0*. Recall that
the last n k elements of 9* are zero.
Then the posterior for 9* is
tt(0*|0*) oc exp{-(l/2)[(r-0*)'B(0*-0*) + 0*'V-0*]}
= exp{-(l/2)[0*'B0* 20* B0* + 0*'(B + V")0*]}
a exp{-(l/2)[0*'B'(B + y-)-B0* 20*'B0* + 0*'(B + V")0*]}
= exp{ (l/2)[(0* (B + V-)"B0*)'(B + V")(0* (B + V-)"B0*)]}.
Hence the posterior n(0\9*) is N[(B + V ) B0*, (B + V ) ], which is
N[(B + V-)-BP0. (B + V-)-].
Thus the posterior expectation of ^0 0 is
E
T
E\ -0 P P0 = E -9' 9*
n
n
0*B(B + V)-(B + V-)~B0* + tr(-I(B + V-)
n \n
0 P B(B + y-)-(B + y-)-BP0 + tr( I(B + V)
n
n
0 SP (B + y-)-(B + V-)-pS0 + tr(-(B + V-)
n
n
Let us choose the prior variance so that V = -B. Then
T
E[ -0 0 = 0 SP
n ) n
= -0'SP'
n
= -0'SP'
n
n + 1
n
n
n + 1
n
B PS0 4- tr (
(T(n+1
B
\ n
n
n + 1
n2
(n + 1)2/
n+ 1
B PS0 + tr\
n
B |PSe + tr| -(
)
n
n\n + 1
n \n + 1
bV
n
B


To Cassandra


CHAPTER 8
CONCLUSIONS AND FUTURE RESEARCH
This dissertation has addressed the problems of clustering and estimating
dissimilarities for functional observations.
We began by reviewing important clustering methods and connecting the
clustering output to the dissimilarities among the objects. We then described
functional data and provided some background about functional data analysis,
which has become increasingly visible in the past decade. An examination of the
recent statistical literature reveals a number of methods proposed for clustering
functional data (James and Sugar. 2003: Tarpey and Kinateder. 2003; Abraham et
al., 2003). These methods involve some type of smoothing of the data, and we have
provided justification for this practice of smoothing before clustering.
We proposed a model for the functional observations and hence for the
dissimilarities among them. When the data are smoothed using a basis function
method (such as regression splines, for example), the resulting smoothed-data
dissimilarity estimator dominates the estimator based on the observed data when
the pairwise differences between the signal curves lie in the linear subspace defined
by the smoothing matrix.
When the differences do not lie in the subspace, we have shown that a James-
Stein shrinkage dissimilarity estimator dominates the observed-data estimator
under an independent error model. With dependent errors, an asymptotic (for
n large within a fixed domain) domination result was given. (While the result
appears to hold for moderately large n, the asymptotic situation of the theorem
corresponds to data which are nearly pure functions, measured nearly continuously
91


30
DeltaU for various n and k=5
0 50 100
a
Figure 3-1: Plot of against a for varying n and for k = 5.
Solid line: n = 20. Dashed line: n = 50. Dotted line: n = 100.


18
Note that d^ ¥ 5ij where the limit is taken as n oo, 1,..., tn £ [0, T].
Hence the question of whether smoothing aids clustering is, in a sense, closely
related to the question of: When, for large n. is d[jmooth a better estimator of
than is d^?
(Note: In the following chapters, since the pair of curves i and j is arbitrary,
we shall suppress the ij subscript on 0^, dij. <7? and writing instead 0,
6. a2, and S, understanding that we are concerned with any particular pair
i.j e {1 j.)
2.4 Previous Work
Some methods for clustering functional data recently have been presented in
the statistical literature. James and Sugar (2003) introduce a model-based cluster
ing method that is especially useful when the points measured along the sample
curves are sparse and irregular. Tarpey and Kinateder (2003) discuss represent
ing each curve with basis functions and clustering via K-means, to estimate the
principal points (underlying cluster means) of the datas distribution. Abraham
et al. (2003) propose representing the curves via a B-spline basis and clustering
the estimated coefficients with a K-means algorithm, and they derive consistency
results about the convergence of the algorithm. Tarpey et al. (2003) apply methods
for clustering functional data to the analysis of a pharmaceutical study. Hastie et
al. (1995) and James and Hastie (2001) discuss discriminant analysis for functional
data.
2.5 Summary
It seems intuitive that in the analysis of functional data, some form of smooth
ing the observed data is appropriate, and some previous methods (James and
Sugar, 2003; Tarpey and Kinateder, 2003; Abraham et al., 2003) do involve
smoothing. Tarpey and Kinateder (2003, p. 113) propose the question of the effect
of smoothing on the clustering of functional data, citing the need to study the


48
we can easily calculate the eigenvalues of (I S)£, which are ci,..., c*, and via
numerical or Monte Carlo integration, we find that M2 = 0.0012 in this case.
Substituting these values into Ay(a), we see, in the top plot of Figure 4-1,
the asymptotic upper bound plotted as a function of a. Also plotted is a simulated
true A for a variety of values (n-vectors) of 9: Ox 1, (0,1,0,1,0,..., 0,1)', and
(1,0,1, 1,0,1,..., 1,0)', where 1 is a n-vector of ones. It should be noted that
when 9 is the zero vector, it lies in the subspace defined by S, since S9 = 9 in that
case. The other two values of 9 shown in this plot do not lie in the subspace. In
any case, however, choosing the a that optimizes the upper bound guarantees that
djj5) has smaller risk than dtj.
Also shown, in the bottom plot of Figure 4-1, is the asymptotic upper bound
for data following the same Ornstein-Uhlenbeck model, except with n = 30.
Although A^ is a large-n upper bound, it appears to work well for the grid size of
30.
4.3 Extension to the Case when cr2 is Unknown
In Section 3.2, it was proved that the smoothed-data dissimilarity estimator,
with the shrinkage adjustment, dominated the observed-data dissimilarity estimator
in estimating the true dissimilarities, when the covariance matrix of 9 was a2 known. In Section 3.3, we established the domination of the smoothed-data
estimator for covariance matrix In Section 4.2, we developed an analogous asymptotic result for general known
covariance matrix £. In this section, we extend the asymptotic result to the case
of 9 having covariance matrix of the form V = (t2E, where a2 is unknown and £
is a known symmetric, positive definite matrix. This encompasses the functional
noise model (2.2) in which the errors follow an Ornstein-Uhlenbeck process with
unknown a2 and known ¡3. (Of course, this also includes the discrete noise model


26
= 2a
9'{i-s)
0{ I-S)'(I-S)0 \0-SG\
n '{i-s)d 2 '(i-s)
= -2 a^r + a2
-'(i-s)'(i-s)
'(i-s) ||0- S0||20'(i- s)0
a2
2a H .
\\o-so\\2
Hence the (scaled) risk difference is
A = E
= E
(§1JS)'0(JS))2 (0'0)2 20'0 _2a +
0 0 2a +
I0-S0II2
\\e -S0||2
- (0'0)2 20'0 ( -2a +
\\o-so\\\
Note that
and
0'0 = 0'(I-S)0 + 0'S0
0'0 = 0'{I-S)0 + 0'S0.
Define the following quadratic forms:
ft = '(I-S )0
q2 = 0'S0
ft = 0'(I-S)0
?2 = 0'S0.
Note that implies (I S)S = 0 and 0 is normally distributed). Now write A as:
A = E ^[ = E
a
ft
,2\ 2
21 2a -f 1 [9i + 92] + ( 2a + ) + 4a(9i + q2) (91 + 92)
ft
2a2
ft


LIST OF FIGURES
Figure page
1-1 A scatter plot of the agricultural data 3
1-2 Proportion of pairs of objects correctly grouped vs. MSE of dissimi
larities 12
o
3-1 Plot of A y against a for varying n and for k = 5 30
3-2 Plot of simulated A and Ay against a for n = 20, k = 5 31
4-1 Plot of asymptotic upper bound, and simulated As. for Ornstein-
Uhlenbeck-type data 49
6-1 Plot of signal curves chosen for simulations 70
6-2 Proportion of pairs of objects correctly matched, plotted against a
(n = 200) 74
6-3 Proportion of pairs of objects correctly matched, plotted against a
(n = 30) 75
6-4 Proportion of pairs of objects correctly matched, plotted against a,
when the number of clusters is misspecified 77
6-5 Proportion of pairs of objects correctly matched, plotted against a
(n = 30) 79
7-1 Plots of clusters of genes 83
7-2 Edited plots of clusters of genes which were classified differently as
observed curves and smoothed curves 85
7-3 Plots of clusters of libraries 87
7-4 Mean curves for the four library clusters given on the same plot. ... 88
7-5 Measurements and B-spline smooth. University of Arizona library. . 89
viii


54
< 0, implying A < 0. That is, for 0 < a < r. and for sufficiently large n,
T C\
the risk difference is negative and the smoothed-data dissimilarity estimator d\ - is
better than dij.
Proof of Theorem 4-3: Since £ is positive definite. ir(£) is positive. Since
> 0 and M4 > 0. we may write Au(a) = 0 as c4a4 + c3a3 + c2a2 + C\a = 0.
where c4 > 0. c3 < 0. c2 > 0, C\ < 0. The proof is exactly the same as the proof of
Theorem 4.2 in Section 4.2.
4.4 A Pure Functional Analytic Approach to Smoothing
Suppose we have two arbitrary observed random curves y() and yfft), with
underlying signal curves pi(t) and Pj(t). This is a more abstract situation in which
the responses themselves are not assumed to be discretized, but rather come to
the analyst as "pure' functions. This can be viewed as a limiting case of the
functional noise model (2.2) when the number of measurement points n oo in a
fixed-domain sense. Throughout this section, we assume the noise components of
the observed curves are normally distributed for each fixed t. as, for example, in an
Ornstein-Uhlenbeck process.
Let C[0.T] denote the space of continuous functions of t on [0, T]. Assume we
have a linear smoothing operator S which acts on the observed curves to produce
a smooth: /(i) = Syi(t), for example. The linear operator S maps a function
/ C[0, T] to C[0. T\ as follows:
S[/](t) = K(t,s)f(s)ds
Jo
(where K is some known function) with the property that S[a/ + (3g\ = <*S[f] +
(3S[g] for constants a,/? and functions /,g e C[0,T] (DeVito, 1990, p. 66).
Recall that with squared L2 distance as our dissimilarity metric, we denote
the dissimilarities between the true, observed, and smoothed curves i and j,
respectively, as follows:


19
sensitivity of clustering methods on the degree of smoothing used to estimate the
functions."
In this thesis, we will provide some rigorous theoretical justification that
smoothing the data before clustering will improve the cluster analysis. Much of
the theory will focus on the estimation of the underlying dissimilarities among
the curves, and we will show that a shrinkage-type smoothing method leads to an
improved risk in estimating the dissimilarities. A simulation study will demonstrate
that this risk improvement is accompanied by a clearly improved performance in
correctly grouping objects into their proper clusters.


25
So an appropriate shrinkage estimator of djj = ^00 is
d(JS) = L§Wg(JS)
lJ n
T
n
(SO) -
S0 + | 1 -
1 -
\0- S0||2
(0-S0)'
(19 SO)
\o-so\\\
Now, the risk difference (difference in MSEs) between the James-Stein
smoothed dissimilarity d\j and the observed dissimilarity is:
(3.5)
(3.6)
T2
A = E
T
TV,
-0 n n
- E
-0'0--0'0
n n
(3.7)
Since we are interested in when the risk difference is negative, we can ignore the
positive constant multiplier T2/n2.
From (3.7),
A = £[(0(J5>'0(JS>)2 2O'OOW'OW (O'O)2 + 20'00'0}
= E[(0(Jsyo(JS'>)2 (O'O)2 20'0(0(JS)'0(JS) OO)].
Write = 0 0(0)0, where
m =
a
I0-S0II2
(I-S).
Then
g(JS)'Q(JS) =
-2 0V(0)'0 + 0>(0)'0(0)0
, a(I S)0
= 20
II0-S0IP
+ 0'(I-S)' a
0 S0||2 ||0 S0||2
(I-S)0


41
defined by S, then the dissimilarity estimator ^0 SO has smaller mean squared
error than does ^00 in estimating ^00. That is, the dissimilarity estimator based
on the smooth is better than the one based on the observed data in estimating the
dissimilarities between the underlying signal curves.
Proof of Theorem 4-1:
MSE
H
n2
T2
2ir(£2) + 40'£0
+ ( -ir(E) + -0'0 -0'0
[ n n n
2ir(£2) + 40E0 + [ir(£)]2 >.
2
(4.1)
MSE
e\\-0'S0- -9'9] )
{[n n J J
var(9'S9 ) + ^ E
0S0
n
T2
n2
2ir[(S£)2] + 40'S£S0
- le'eY
n J
{-tr(SE) + -0'S'S0 -0'0l
[n n n J
^|2ir[(SE)2] + Ifl'Efl + [tr(SE)],|.
(4.2)
Hence, if we compare (4.2) with (4.1), we must show that £r(SE) < ir(E) and
ir[(S£)2] < £r[(£)2] to complete the proof.
ir(E) = ir(SE + (I-S)E)
= ir(SE) + fr((I-S)E)
= £r(S£)+ir(£1/2(I-S)(I-S)£1/2)
> tr(SE)


105
Sugar, C. A. and James. G. M. (2003). Finding the number of clusters in a dataset:
an information-theoretic approach. Journal of the American Statistical
Association 98: 750-763.
Tan, VV. Y. (1977). On the distribution of quadratic forms in normal random
variables. The Canadian Journal of Statistics 5: 241-250.
Tarpey, T. and Kinateder, K. (2003). Clustering functional data, The Journal of
Classification 20: 93-114.
Tarpey, T., Petkova, E. and Ogden, R. T. (2003). Profiling placebo responders
by self-consistent partitioning of functional data, Journal of the American
Statistical Association 98: 850-858.
Taylor, J. M. G., Cumberland, W. G. and Sy, J. P. (1994). A stochastic model
for analysis of longitudinal AIDS data. Journal of the American Statistical
Association 89: 727-736.
Tibshirani, R.. Walther. G. and Hastie. T. (2001a). Estimating the number of
clusters in a data set via the gap statistic, Journal of the Royal Statistical
Society, Series B. Methodological 63(2): 411-423.
Tibshirani, R., Walther. G., Botstein. D. and Brown. P. (2001b). Cluster validation
by prediction strength, Technical Report 2001-21, Department of Statistics,
Stanford University.
Young, F. W. and Hamer. R. M. (1987). Multidimensional Scaling: History, The
ory, and Applications. Hillsdale, New Jersey: Lawrence Erlbaum Associates.


64
the objective function. (Bjornstad (1990.) notes that, when y or z is continuous,
an alternative predictive likelihood with a Jacobian factor for the transformation
r(y, z) is possible. Including the Jacobian factor makes the predictive likelihood
independent of the choice of minimal sufficient statistic: excluding it makes the
predictive likelihood invariant to scale changes of z.)
For instance, consider a functional datum yi[t) and its corresponding observed
data vector y. For the purpose of cluster analysis, let us assume that functional
observations in the same cluster come from the same linear smoothing model with
a normal error structure. That is, the distribution of Y, given that it belongs to
cluster j, is
Y,- | Zij = 1 ~ N{TPj,(J*I)
for i = 1 N. Here T is the design matrix, whose n rows contain the regression
covariates defined by the functional form we assume for yi(t). This model for
Yi allows for a variety of parametric and nonparametric linear basis smoothers,
including standard linear regression, regression splines, Fourier series models,
and smoothing splines, as long as the basis coefficients are estimated using least
squares.
For example, consider the yeast gene data described in Section 7.1. Since
the genes, being cell-cycle regulated, are periodic data. Booth et al. (2001) use a
first-order Fourier series y(t) = (30 + (3X cos(27rf/T) + /32sin(2nt/T) as a smooth
representation of a given genes measurement, fitting a harmonic regression model
(see Brockwell and Davis, 1996, p. 12) by least squares. Then the 18 rows of T are
the vectors (l,cos(^fJ),sin(y:i:7)), for the n = 18 time points tj,j = 0,...,17. In
this example 0j = (P0j,Pij,p2j)'-
Let J = {j\zij 1 for some } be the set of the K nonempty clusters.
The unknown parameters in /(y,z), j 6 J, are (pj,0jtaf). The corresponding
sufficient statistics are (rrij, Pj,&j) where rrij = YliLi z*j number of objects


57
Note that
sup^r^ =em(E-A'E 2E1/2A£-l/2).
x x Ex
Let B = E1/2AE_1 2. Then this maximum eigenvaiue is emaxiBB) < emax(A A)
[emax(A)]2 by a property of the spectral norm. C
Lemma 4.2 Let the linear smoother S be a self-adjoint operator with singular
values 6 [0.1]. Then uar(^||S0||2) < t/ar(^||0||2). where S is symmetric with
singular values e [0,1]. (Recall 6 = y yj.)
Proof of Lemma \.2:
We will show that, for any n and for any positive definite covariance matrix
E of G. var(£;|S0|j2) < uar(^||0||2) where S is symmetric. (A symmetric matrix
is the discrete form of a self-adjoint operator (see Ramsay and Silverman, 1997.
pp. 287-290).)
r(£l|S0||2) < wxr(^||0||2j
<=> uar(||S0||2) < var(||0||2)
<=> var(G S S0) < var(GG)
<=> 2tr[(S'SE)2] + 40'S'SESS0 < 2tr[(E)2] + 40'E0.
Now,
40'S'SES S0 < 40'E0 S ^SS S < 1
0E0
and so we apply Lemma 4.1 with S S playing the role of A. Since the singular
values of S are [0,1], all the eigenvalues of S S are [0,1]. Hence we have
0SSESS0
0'E0
sup
G
< 1.


44
In this same way as in Section 3.2, ,we may write A as:
( oP \ ^ 4a3
A = E 4a[<7i + Q2] T 4a2 ( ) h 4a(<7i + 92)
L V 9i / 9i
2 a2
H(9i + Q2 Qi 92)
9i
Note that
-^[91+92] 9i + 92 + r[(I ~ S)S] 4- tr[SS]
= 9i + 92 + tr(H).
Hence
A =
4atr(S) + 4a2 4* E
4a3 2a2
H ~r-{Qi + 92 9i ~ 92)
9i 9i
Consider
E
92
= E
0S0 '
UiJ
L0'(I-S)0j
Lieberman (1994) gives a Laplace approximation for £[(xFx/x'Gx)*], k > 1,
where F is symmetric and G positive definite:
E
xFx
x'Gx
k~\
£[(xFx)*]
[£(x'Gx)]t-
In our case, (I S) is merely positive semidefinite, but with a simple regularity
condition, the result will hold (see Appendix B.2). The Laplace approximation will
approach the true value of the expected ratio, i.e., the difference between the true
expected ratio and the approximated expected ratio tends to 0 as n -4 00;
Lieberman (1994) gives sufficient conditions involving the cumulants of the
quadratic forms which establish rates of convergence of the approximation.
Hence
92
= E
' 0S0
L9iJ
<35
1
w
E(0'SG)
<72 + tr{ S£)
£(0'(I-S)0) gi + ir[(I-S)S]'


72
Of course, when simply using a linear smoother S. it does not matter whether
we proceed by smoothing the curves or the differences, since SO = Sy Sy,-. But
when using the James-Stein adjustment, the overall smooth is nonlinear and there
is an difference in the two operations. Since in certain situations, it may be more
sensible to smooth the curves first and then adjust the N smooths, we can check
empirically to see whether this changes the risk very much.
To accomplish this, a simulation study was done in which the coefficients of
the above signal curves were randomly selected (within a certain range) for each
simulation. Denoting the coefficients of curve by (&,i,6.2)> the coefficients
were randomly chosen in the following intervals:
&i,i £ [3,3],b\_2 6 [0.5.0.5],66,i £ [6,6], 6,2 £ [0.5.0.5],
611,1 £ [4.5,4.5],611.2 6 [0.5,0.5], 16.1 £ [1.8,1.8],616,2 £ [0.5,0.5].
For each of 100 simulations, the sample MSE for the smoothed data was the same,
within 10 significant decimal places, whether the curves were smoothed first or the
differences were smoothed first. This indicates that the choice between these two
procedures has a negligible effect on the risk of the smoothed-data dissimilarity
estimator.
Also, the simulation analyses in this chapter were carried out both when
smoothing the differences first and when smoothing the curves first. The results
were extremely similar, so we present the results obtained when smoothing the
curves first and adjusting the smooths with the James-Stein method.
6.3 Simulation Results
Results for the n = 200 example are shown in Table 6-1 for the data with
independent errors and Table 6-2 for the data with Ornstein-Uhlenbeck-type errors.
The ratio of average MSEs is the average MSE^obs^ across the 100 simulations
divided by the average MS£'(smoot/l) across the 100 simulations. As shown by


32
Suppose there exists a random variable S2, independent of 0, such that
S'2
<7
Xu
Then let a2 = S2/u.
Consider the following definition of the James-Stein estimator which accounts
for a2:
0 = S0 + 1
aa
|0-S0|'2
,2
= S0+ 1--5T
acr
(I-S)0
(I-S)0.
0 (I-S)0
Note that the James-Stein estimator from Section 3.2 (when we assumed a
known covariance matrix I) is simply (3.10) with a2 = 1.
Now, replacing a2 with the estimate a2, define
(3.10)
- 2
aa
0iJS) = S0 + ( 1 y- ) (I S)0,
letting tji = 07(I S)0 as in Section 3.2. Write
a(JS) a
0
where
- 2
aa
= (I-S)0.
Qi
Define the James-Stein smoothed-data dissimilarity estimator based on 0JS^
to be:
;(JS) T jsymjs)
- -o e>
Then analogously to (3.7),
A & = E
2-i
q(JS)'q(JS) Q'Q
-E
2-i
o'e-o'o


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID ECAUVDIVP_QEN9NY INGEST_TIME 2014-12-03T19:52:27Z PACKAGE AA00026435_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES


SMOOTHING FUNCTIONAL DATA
FOR CLUSTER ANALYSIS
By
DAVID B. HITCHCOCK
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2004

Copyright 2004
by
David B. Hitchcock

To Cassandra

ACKNOWLEDGMENTS
It has been a long but rewarding journey during my time as a student, and I
have many people I need to thank.
First of all. I thank God for my having more blessings in my life than I
possibly deserve. Next, I thank my family: the support and love of my parents and
sister, and the pride they have shown in me, have truly inspired me and kept me
determined to succeed in my studies. And I especially thank my wife Sandi, who
has shown so much confidence in me and has given me great encouragement with
her love and patience.
I owe a great deal of thanks to my Ph.D. advisors, George Casella and Jim
Booth. In addition to being great people, they have, through their fine examples,
taught me much about the research process: the importance of knowing the
literature well, of looking at simple cases to understand difficult concepts, and of
writing results carefully. I also thank my other committee members, Jim Hobert,
Brett Presnell. and John Henretta, accomplished professors who have selflessly
given their time for me. All the teachers I have had at Florida deserve my thanks,
especially Alan Agresti. Andy Rosalsky, Denny Wackerly, and Malay Ghosh.
I could not have persevered without the support of my fellow students here
at Florida. I especially thank Bernhard and Karabi, who entered the graduate
program with me and who have shared my progress. I also thank Carsten, Terry,
Jeff, Siuli, Sounak, Jamie, Dobrin. Samiran, Damaris, Christian, Ludwig, Brian,
Keith, and many others for all their help and friendship.
IV

TABLE OF CONTENTS
page
ACKNOWLEDGMENTS iv
LIST OF TABLES vii
LIST OF FIGURES viii
ABSTRACT ix
CHAPTERS
1 INTRODUCTION TO CLUSTER ANALYSIS 1
1.1 The Objective Function 4
1.2 Measures of Dissimilarity 4
1.3 Hierarchical Methods 5
1.4 Partitioning Methods 6
1.4.1 K-means Clustering 7
1.4.2 K-medoids and Robust Clustering 8
1.5 Stochastic Methods 9
1.6 Role of the Dissimilarity Matrix 10
2 INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING ... 13
2.1 Functional Data 13
2.2 Introduction to Smoothing 13
2.3 Dissimilarities Between Curves 17
2.4 Previous Work 18
2.5 Summary 18
3 CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL .... 20
3.1 Comparing MSEs of Dissimilarity Estimators when 9 is in the
Linear Subspace Defined by S 20
3.2 A James-Stein Shrinkage Adjustment to the Smoother 23
3.3 Extension to the Case of Unknown a2 29
3.4 A Bayes Result: and a Limit of Bayes Estimators .... 36
4 CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL 40
4.1 Comparing MSEs of Dissimilarity Estimators when 9 is in the
Linear Subspace Defined by S 40
v

4.2 James-Stein Shrinkage Estimation in the Functional Noise Model. 42
4.3 Extension to the Case when a2 is Unknown 48
4.4 A Pure Functional Analytic Approach to Smoothing 54
5 A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION
FOR CLUSTERING FUNCTIONAL DATA 63
6 SIMULATIONS 68
6.1 Setup of Simulation Study 68
6.2 Smoothing the Data 71
6.3 Simulation Results 72
6.4 Additional Simulation Results 76
7 ANALYSIS OF REAL FUNCTIONAL DATA 81
7.1 Analysis of Expression Ratios of Yeast Genes 81
7.2 Analysis of Research Libraries 84
8 CONCLUSIONS AND FUTURE RESEARCH 91
APPENDIX
A DERIVATIONS AND PROOFS 95
A.l Proof of Conditions for Smoothed-data Estimator Superiority ... 95
A.2 Extension of Stochastic Domination Result 97
A.3 Definition and Derivation of /3j and a2 97
B ADDITIONAL FORMULAS AND CONDITIONS 99
B.l Formulas for Roots of = 0 for General n and k 99
B.2 Regularity Condition for Positive Semidefinite G in Laplace
Approximation 100
REFERENCES 101
BIOGRAPHICAL SKETCH 106
vi

LIST OF TABLES
Table page
1-1 Agricultural data for European countries 2
3-1 Table of choices of a for various n and k 29
6-1 Clustering the observed data and clustering the smoothed data (inde
pendent error structure, n = 200) 73
6-2 Clustering the observed data and clustering the smoothed data (O-U
error structure, n = 200) 73
6-3 Clustering the observed data and clustering the smoothed data (inde
pendent error structure, n = 30) 74
6-4 Clustering the observed data and clustering the smoothed data (O-U
error structure, n = 30) 75
7-1 The classification of the 78 yeast genes into clusters, for both observed
data and smoothed data 82
7-2 A 4-cluster K-medoids clustering of the smooths for the library data. 89
7-3 A 4-cluster K-medoids clustering of the observed library data 90
Vll

LIST OF FIGURES
Figure page
1-1 A scatter plot of the agricultural data 3
1-2 Proportion of pairs of objects correctly grouped vs. MSE of dissimi
larities 12
o
3-1 Plot of A y against a for varying n and for k = 5 30
3-2 Plot of simulated A and Ay against a for n = 20, k = 5 31
4-1 Plot of asymptotic upper bound, and simulated As. for Ornstein-
Uhlenbeck-type data 49
6-1 Plot of signal curves chosen for simulations 70
6-2 Proportion of pairs of objects correctly matched, plotted against a
(n = 200) 74
6-3 Proportion of pairs of objects correctly matched, plotted against a
(n = 30) 75
6-4 Proportion of pairs of objects correctly matched, plotted against a,
when the number of clusters is misspecified 77
6-5 Proportion of pairs of objects correctly matched, plotted against a
(n = 30) 79
7-1 Plots of clusters of genes 83
7-2 Edited plots of clusters of genes which were classified differently as
observed curves and smoothed curves 85
7-3 Plots of clusters of libraries 87
7-4 Mean curves for the four library clusters given on the same plot. ... 88
7-5 Measurements and B-spline smooth. University of Arizona library. . 89
viii

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
SMOOTHING FUNCTIONAL DATA
FOR CLUSTER ANALYSIS
By
David B. Hitchcock
August 2004
Chair: George Caseila
Cochair: James G. Booth
Major Department: Statistics
Cluster analysis, which attempts to place objects into reasonable groups on
the basis of statistical data measured on them, is an important exploratory tool
for many scientific studies. In particular, we explore the problem of clustering
functional data, which arise as curves, characteristically observed as part of
a continuous process. In recent years, methods for smoothing and clustering
functional data have appeared in the statistical literature, but little work has
appeared specifically addressing the effect of smoothing on the cluster analysis.
We discuss the purpose of cluster analysis and review some common clustering
methods, with attention given to both deterministic and stochastic methods.
We address functional data and the related field of smoothing, and a measure of
dissimilarity for functional data is suggested.
We examine the effect of smoothing functional data on estimating the dis
similarities among objects and on clustering those objects. We prove that a
shrinkage method of smoothing results in a better estimator of the dissimilarities
among a set of noisy curves. For a model having independent noise structure,
IX

the smoothed-data dissimilarity estimator dominates the observed-data estima
tor. For a dependent-error model, an asymptotic domination result is given for
the smoothed-data estimator. We propose an objective function to measure the
goodness of a clustering for smoothed functional data.
Simulations give strong empirical evidence that smoothing functional data
before clustering results in a more accurate grouping than clustering the observed
data without smoothing. Two examples, involving functional data on yeast gene
expression levels and research library 'growth curves. illustrate the technique.
x

CHAPTER 1
INTRODUCTION TO CLUSTER ANALYSIS
The goal of cluster analysis is to find groups, or clusters, in data. The objects
in a data set (often univariate or multivariate observations) should be grouped so
that objects in the same cluster are similar and objects in different clusters are
dissimilar (Kaufman and Rousseeuw. 1990. p. 1). How to measure the similarity
of objects is something that depends on the application, yet is a fundamental issue
in cluster analysis. Sometimes in a multivariate data set it is not the observations
that are clustered, but rather the variables (according to some similarity measure
on the variables) and this case is dealt with slightly differently (Johnson and
Wichern, 1998, p. 735). More often, though, it is the objects that are clustered
according to their observed values of one or more variables, and this introduction
will chiefly focus on this situation.
The general clustering setup for multivariate data is as follows: In a data set
there are N objects on which are measured p variables. Hence we represent this
by N vectors yi,.... yN in 9?p. We wish to group the N objects into K clusters,
1 < K < N. Denote the possible clusterings of N objects into nonempty groups
as C = {ci,..., cB(#)}. The number of possible clusterings B(N) depends on the
number of objects N and is known as the Bell number (Sloane and Plouff, 1995,
entry M4981).
As a simple example, consider the data in Table 1-1. We wish to group the
12 objects (the European countries) based on the values of two variables, gross
national product (gnp) and percent of gnp due to agriculture (agrie). When the
data are univariate or two-dimensional and N is not too large, it is often easy to
1

2
Table 1-1: Agricultural data for European countries.
country
agnc
gnp
Belgium
2.7
16.8
Denmark
5.7
21.3
Germany
3.5
18.7
Greece
22.2
5.9
Spain
10.9
11.4
France
6.0
17.8
Ireland
14.0
10.9
Italy
8.5
16.6
Luxembourg
3.5
21.0
Netherlands
4.3
16.4
Portugal
17.4
7.8
United Kingdom
2.3
14.0
construct a scatter plot and determine the clusters by eye (see Figure 1-1). For
higher dimensions, however, automated clustering methods become necessary.
A statistical field closely related to cluster analysis is discriminant analysis,
which also attempts to classify objects into groups. The main difference is that
in discriminant analysis there exists a training sample of objects whose group
memberships are known, and the goal is to use characteristics of the training
sample to devise a rule which classifies future objects into the prespecified groups.
In cluster analysis, however, the clusters are unknown, in form and often in
number. Thus cluster analysis is more exploratory in nature, whereas discriminant
analysis allows more precise statements about the probability of making inferential
errors (Gnanadesikan et al., 1989).
In contrast with discriminant analysis, where the number of groups and the
groups' definitions are known, cluster analysis presents two separate questions:
How many groups are there? And which objects should be allocated to which
groups, i.e., how should the objects be partitioned into groups? It is partially
because of the added difficulty of answering both of these questions that the field

gross national product (gnp)
3
% of gnp due to agncuiture
Figure 1-1: A scatter plot of the agricultural data.

4
of cluster analysis has not built such an,extensive and thorough theory as has
discriminant analysis (Gnanadesikan et al., 1989).
1.1 The Objective Function
Naturally, it is desirable that a clustering algorithm have some optimal
property. We would like a mathematical criterion to measure how well-grouped the
data are at any point in the algorithm. A convenient way to define such a criterion
is via an objective function, a real-valued function of the possible partitions of the
objects. Mathematically, if C = {ex,..., cb(ao) represents the space of all possible
partitions of N objects, a typical objective function is a mapping g : C > 3?+.
Ideally, a good objective function g will increase (or decrease, depending on
the formulation of g) monotonically as the partitions group more similar objects
in the same cluster and more dissimilar objects in different clusters. Given a good
objective function g. the ideal algorithm would optimize g, resulting in the best
possible partition.
When N is very small, we might enumerate the possible partitions clt... ,Cb(n),
calculate the objective function for each, and choose the ct with the optimal g{ci).
However, B(N) grows rapidly with N. For example, for the European agriculture
data, 5(12) = 4.213.597. while 5(19) = 5,832.742.205.057 (Sloane and Plouffe,
1995, entry M1484). For moderate to large N, this enumeration is infeasible
(Johnson and Wichern. 1998. p. 727).
Since full enumeration is usually impossible, clustering methods tend to be
algorithms systematically designed to search for good partitions. But such deter
ministic algorithms cannot guarantee the discovery of the best overall partition.
1.2 Measures of Dissimilarity
A fundamental question for most deterministic algorithms is which measure of
dissimilarity (distance) to use. A popular choice is the Euclidean distance between

5
object i and object j
dE{i,j) = yJ{Vi\ ~ Vj\)2 + (Vi2 ~ 2/j 2)2 + * + (Vip Vjp)2-
The Manhattan (city-block) distance is
= life yji\ + \ya ~ Va\ + ' + !ViP~ VjP\-
Certain types of data require specialized dissimilarity measures. The Canberra
metric and Czekanowski coefficient (see Johnson and Wichern. 1998, p. 729) are
two dissimilarity measures for nonnegative variables, while Johnson and Wichern
(1998, p. 733) give several dissimilarity measures for binary variables.
Having chosen a dissimilarity measure, one can construct a N x N (symmetric)
dissimilarity matrix (also called the distance matrix) D whose rows and columns
represent the objects in the data set, such that D^ = Dj = d(i,j).
In the following sections, some common methods of cluster analysis are
presented and categorized by type.
1.3 Hierarchical Methods
Hierarchical methods can be either agglomerative or divisive. Kaufman and
Rousseeuw (1990) compiled a set of methods which were adopted into the cluster
library of the S-plus computing package, and often the methods are referred to by
the names Kaufman and Rousseeuw gave them.
Agglomerative methods begin with N clusters; that is, each observation forms
its own cluster. The algorithm successively joins clusters, yielding N 1 clusters,
then N 2 clusters, and so on until there remains only one cluster containing all
N objects. The S-plus functions agues and hclust perform agglomerative clustering.
Common agglomerative methods include linkage methods and Wards method.
All agglomerative methods, at each step, join the two clusters which are
considered closest. The difference among the methods is how each defines

6
closeness. Each method, however, defines the distance between two clusters using
some function of the dissimilarities among individual objects in those clusters.
Divisive methods begin with all objects in one cluster and successively split
clusters, resulting in partitions of 1, 2, 3, ... and finally N clusters. The S-plus
function diana performs divisive analysis.
1.4 Partitioning Methods
While hierarchical methods seek good partitions for all K = 1,..., N,
partitioning methods fix the number of clusters and seek a good partition for
that specific K. Although the hierarchical methods may seem to be more flexible,
they have an important disadvantage. Once two clusters have been joined in an
agglomerative method (or split in a divisive method), this move can never be
undone, although later in the algorithm undoing the move might improve the
clustering criterion (Kaufman and Rousseeuw, 1990, p. 44). Hence hierarchical
methods severely limit how much of the partition space C can be explored. While
this phenomenon results in higher computational speed for hierarchical algorithms,
its clear disadvantage often necessitates the use of the less rigid partitioning
methods (Kaufman and Rousseeuw, 1990. p. 44).
In practice, Johnson and Wichern (1998, p. 760) recommend running a
partitioning method for several reasonable choices of K and subjectively examining
the resulting clusterings. Finding an objective, data-dependent way to specify K
is an open question that has spurred recent research. Rousseeuw (1987) proposes
to select K to maximize the average silhouette width s(K). For each object i, the
silhouette value
= b(i) q(Q
max{a(i), 6(z)}
where a(i) = average dissimilarity of i to all other objects in its cluster (say, cluster
A)\ b(i) = min/i^ d(i, R); and d(i, R) = average dissimilarity of i to all objects in
cluster R. Then s(K) is the average s(i) over all objects i in the data set.

7
Tibshirani et al. (2001a) propose a- Gap" statistic to choose K. In a separate
paper. Tibshirani et al. (2001b) suggest treating the problem as in model selection,
and choosing K via a prediction strength measure. Sugar and James (2003)
suggest a nonparametric approach to determining K based on the distortion,
a measure of within-cluster variability. Fraley and Raftery (1998) use the Bayes
Information Criterion to select the number of clusters. Milligan and Cooper (1985)
give a survey of earlier methods of choosing K.
Model-based clustering takes a different perspective on the problem. It as
sumes the data follow a mixture of K underlying probability distributions. The
mixture likelihood is then maximized, and the maximum likelihood estimate of
the mixture parameter vector determines which objects belong to which subpop
ulations. Fraley and Raftery (2002) provide an extensive survey of model-based
clustering methods.
1.4.1 K-means Clustering
Among the oldest and most well-known partitioning methods is K-means
clustering, due to MacQueen (1967). Note that the centroid of a cluster is the
p-dimensional mean of the objects in that cluster. After the choice of K, the
K-means algorithm initially arbitrarily partitions the objects into K clusters.
(Alternatively, one can choose K centroids as an initial step.) One at a time, each
object is moved to the cluster whose centroid is closest (usually Euclidean distance
is used to determine this). When an object is moved, centroids are immediately
recalculated for the cluster gaining the object and the cluster losing it. The method
repeatedly cycles through the list of objects until no reassignments of objects take
place (Johnson and Wichern, 1998, p. 755).
A characteristic of the K-means method is that the final clustering depends
in part on the initial configuration of the objects (or initial specification of the
centroids). Hence in practice, one typically reruns the algorithm from various

8
starting points to monitor the stability of the clustering (Johnson and Wichern.
1998, p. 755).
Selim and Ismail (1984) show that K-means clustering does not globally
minimize the criterion
EE4'W) (i.i)
i=l j=1
where dij denotes the Euclidean distance between object i and the centroid of
cluster j, i.e.. d? = (y¡ y(^)'(y¡ y^). This criterion is in essence an objective
function g(c). The K-means solution may not even locally minimize this objective
function: conditions for which it is locally optimal are given by Selim and Ismail
(1984).
1.4.2 K-medoids and Robust Clustering
Because K-means uses means (centroids) and a least squares technique in
calculating distances, it is not robust with respect to outlying observations. K-
medoids, which is used in the S-plus function pam (partitioning around medoids),
due to Kaufman and Rousseeuw (1987), has gained support as a robust alternative
to K-means. Instead of minimizing a sum of squared Euclidean distances, K-
medoids minimizes a sum of dissimilarities. Philosophically, K-medoids is to
K-means as least-absolute-residuals regression is to least-squares regression.
Consider the objective function
N K
(i-2)
=1 j=i
The algorithm begins (in the so-called build-step) by selecting k representa
tive objects, called medoids. based on an objective function involving a sum of
dissimilarities (see Kaufman and Rousseeuw, 1990, p. 102). It proceeds by as
signing each object i to the cluster j with the closest medoid rrij, i.e., such that
d(i, mj) < d(i, mw) for all w = 1K. Next, in the swap-step, if swapping
any unselected object with a medoid results in the decrease of the value of (1.2),

9
the swap is made. The algorithm stops when no swap can decrease (1.2). Like
K-means. K-medoids does not in general globally optimize its objective function
(Kaufman and Rousseeuw, 1990. p. 110).
Cuesta-Albertos et al. (1997) propose another robust alternative to K-means
known as trimmed K-means.' which chooses centroids by minimizing an objective
function which is based only on an optimally chosen subset of the data. Robustness
properties of the trimmed K-means method are given by Garca-Escudero and
Gordaliza (1999).
1.5 Stochastic Methods
In general, the objective function in a cluster analysis can yield optima
which are not global (Selim and Alsultan. 1991). This leads to a weakness of the
traditional deterministic methods. The deterministic algorithms are designed
to severely limit the number of partitions searched (in the hope that good
clusterings will quickly be found). If g is especially ill-behaved, however, finding the
best clusterings may require a more wide-ranging search of C. Stochastic methods
are ideal tools for such a search (Robert and Casella, 1999, Section 1.4).
The major advantage of a stochastic method is that it can explore more of the
space of partitions. A stochastic method can be designed so that it need not always
improve the objective function value at each step. Thus at some steps the method
may move to a partition with a poorer value of g, with the benefits of exploring
parts of C that a deterministic method would ignore. Unlike greedy deterministic
algorithms, stochastic methods sacrifice immediate gain for greater flexibility and
the promise of a potentially better objective function value in another area of C.
The major disadvantage of stochastic cluster analysis is that it takes more
time than the deterministic methods, which can usually be run in a matter of
seconds. At the time most of the traditional methods were proposed, this was an
insurmountable obstacle, but the growth in computing power has narrowed the gap

10
in recent years. Stochastic methods that can run in a reasonable amount of time
can be valuable additions to the repertoire of the practitioner of cluster analysis.
Since cluster analysis often involves optimizing an objective function g, the
Monte Carlo optimization method of simulated annealing is a natural tool to use
in clustering. In the context of optimization over a large, finite set, simulated
annealing dates to Metropolis et al. (1953), while its modern incarnation was
introduced by Kirkpatrick et al. (1983).
An example of a stochastic cluster analysis algorithm is the simulated anneal
ing algorithm of Selim and Alsultan (1991), which seeks to minimize the K-means
objective function (1.1). Celeux and Govaert (1992) propose two stochastic cluster
ing methods based on the EM algorithm.
1.6 Role of the Dissimilarity Matrix
For many standard cluster analysis methods, the resulting clustering structure
is determined by the dissimilarity matrix (or distance matrix) D (containing
elements which we henceforth denote y, i = 1 = 1,..., JV) for the objects.
With hierarchical methods, this is explicit: If we input a certain dissimilarity
matrix into a clustering algorithm, we will get one and only one resulting grouping.
With partitioning methods, the fact is less explicit since the result depends partly
on the initial partition, the starting point of the algorithm, but this is an artifact
of the imperfect search algorithm (which can only assure a locally optimal
partition), not of the clustering structure itself. An ideal search algorithm which
could examine every possible partition would always map an inputted dissimilarity
matrix to a unique final clustering.
Consider the two most common criterion-based partitioning methods. K-
medoids (the Splus function pam) and K-means. For both, the objective function
is a function of the pairwise dissimilarities among the objects. The K-medoids
objective function simply involves a sum of elements of D. With K-means, the

11
connection involves a complicated recursive formula, as indicated by Gordon
(1981. p. 42). (Because the connection is so complicated, in practice, K-means
algorithms accept the data matrix as input, but theoretically they could accept the
dissimilarity matrix it would just slow down the computational time severely.)
The result, then, is that for these methods, a specified dissimilarity matrix yields a
unique final clustering, meaning that for the purpose of cluster analysis, knowing D
is as good as knowing the complete data.
If the observed data have random variation, and hence the measurements on
the objects contain error, then the distances between pairs of objects will have
error. If we want our algorithm to produce a clustering result that is close to the
true" clustering structure, it seems desirable that the dissimilarity matrix we use
reflect as closely as possible the (unknown) pairwise dissimilarities between the
underlying systematic components of the data.
It is intuitive that if the dissimilarities in the observed distance matrix are
near the truth. then the resulting clustering structure should be near the true
structure, and a small computer example helps to show this. We generate a sample
of 60 3-dimensional normal random variables (with covariance matrix I) such that
15 observations have mean vector (1,3,1)\ 15 have mean (10,6,4)', 15 have mean
(1,10.2)(, and 15 have mean (5,1,10)\ These means are well-separated enough
that the data naturally form four clusters, and the true clustering is obvious.
Then for 100 iterations we perturb the data with random N(0,a2) noise having
varying values of a. For each iteration, we compute the dissimilarities and input
the dissimilarity matrix of the perturbed data into the K-medoids algorithm and
obtain a resulting clustering.
Figure 1-2 plots, for each perturbed data set, the mean (across elements)
squared discrepancy from the true dissimilarity matrix against the proportion of
all possible pairs of objects which are correctly matched in the clustering resulting

12
o
H
0 5000 COOO '5000 20000 25000
ma*
Figure 1-2: Proportion of pairs of objects correctly grouped vs. MSE of dissimilari
ties.
from that perturbed matrix. (A correct match for two objects means correctly
putting the two objects in the same cluster or correctly putting the two objects
in different clusters, depending on the "truth.) This proportion serves as a
measure of concordance between the clustering of the perturbed data set and the
underlying clustering structure. We expect that as mean squared discrepancy
among dissimilarities increases, the proportion of pairs correctly clustered will
decrease, and the plot indicates this negative association. This indicates that a
better estimate of the pairwise dissimilarities among the data tends to yield a
better estimate of the true clustering structure.
While this thesis focuses on cluster analysis, several other statistical methods
are typically based on pairwise dissimilarities among data. Examples include
multidimensional scaling (Young and Hamer, 1987) and statistical matching
(Rodgers, 1988). An improved estimate of pairwise dissimilarities would likely
benefit the results of these methods as well.

CHAPTER 2
INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING
2.1 Functional Data
Frequently, the measurements on each observation are connected by being part
of a single underlying continuous process (often, but not always, a time process).
One example of such data are the growth records of Swiss boys (Falkner, 1960),
discussed by Ramsay and Silverman (1997. p. 2) in which the measurements are the
heights of the boys at 29 different ages. Ramsay and Silverman (1997) generally
label such data as functional data, since the underlying data are thought to be
intrinsically smooth, continuous curves having domain T, which without loss of
generality we take to be [0.T]. The observed data vector y is merely a discretized
representation of the functional observation y(t).
Functional data are related to longitudinal data, data measured across time
which appear often in biostatistical applications. Typically, however, in functional
data analysis, the primary goal is to discover something about the smooth curves
which underlie the functional observations, and to analyze the entire set of func
tional data (consisting of many curves). The term "functional data analysis is
attributed to Ramsay and Dalzell (1991), although methods of analysis existed
before the term was coined.
2.2 Introduction to Smoothing
When scientists observe data containing random noise, they typically desire
to remove the random variation to better understand the underlying process of
interest behind the data. A common method used to capture the underlying signal
process is smoothing.
13

14
Scatterplot smoothing, or nonparametric regression, may be used generally for
paired data for which some underlying regression function E[y^ = /(£*) is
assumed. But smoothing is particularly appropriate for functional data, for which
that functional relationship y[t) between the response and the process on T is
inherent in the data.
One option, upon observing a functional measurement y, is to imagine the
unknown underlying curve as an interpolant y(t). This results in a curve that is
visually no smoother than the observed data, however. Typically, when functional
data are analyzed, the vector of measurements is converted to a curve via a
smoothing procedure which reduces the random variation in the function. If we
wish to cluster functional data, it may be advantageous to smooth the observed
vector for each object and perform the cluster analysis on the smooth curves rather
than on the observed data. (Clearly, this option is inappropriate for cross-sectional
data such as the European agricultural data of Chapter 1.)
Smoothing data results in an unavoidable tradeoff between bias and variance
(Simonoff. 1996, p. 15). The greater the amount of smoothing of a functional
measurement, the more its variance will decrease, but the more biased it will
become (Simonoff. 1996. p. 42). In cluster analysis, we hope that clustering the
smoothed data (which contains reduced noise) will lead to smaller within-cluster
variability, since functional data which truly belong to the same cluster should
appear more similar when represented as smooth curves. This would help make
the clustering structure of the data more apparent. Using smoothed data may
introduce a bias, however, and the bias-variance tradeoff could be quantified with a
mean squared error-type criterion.
We denote the observed noisy curves to be yi(t),..., j/at(<). The underlying
signal curves for this data set are ..., /xat(£)* In reality we observe these

15
curves at a fine grid of n points, t\,... in, so that we observe N independent
vectors, each n x 1: yi,... ,ys-
A possible model for our noisy data is the discrete noise model:
y ij = G-iji l = 1) i -N, j 1. ... ,71. (2.1)
Here, for each i = 1,..., N, ij may be considered independent for different
measurement points, having mean zero and constant variance erf.
Another possible model for our noisy curves is the functional noise model:
Vi(tj) = Mt(ij) + ^i(tj),i = = l,...,n, (2.2)
where e(£) is. for example, a stationary Ornstein-Uhlenbeck process with pull
parameter ¡3 > 0 and variability parameter of. This choice of model implies that
the errors for the ith discretized curve have variance-covariance matrix £* = of SI
where f¡m = (2/3)-1 exp(/3|t¡ £m|) (Taylor et al., 1994). Note that in this case,
the noise process is functional specifically Ornstein-Uhlenbeck but we still
assume the response data collected is discretized, and is thus a vector at the level
of analysis. Conceptually, however, the noise process is smooth and continuous in
(2.2), as is the signal process in either model (2.1) or (2.2).
Depending on the data and sampling scheme, either (2.1) or (2.2) may be an
appropriate model. If the randomness in the data arises from measurement error
which is independent from one measurement to the next, (2.1) is more appropriate.
Ramsay and Silverman (1997, p. 42) suggest a discrete noise model for the Swiss
growth data, in which heights of boys are measured at 29 separate ages, and in
which some small measuring error (independent across measurements) is likely to
be present in the recorded data.
In the case that the variation of the observed data from the underlying curve is
due to an essentially continuous random process, model (2.2) may be appropriate.

16
Data which are measured frequently and almost continuously for example, via
sophisticated monitoring equipment may be more likely to follow model (2.2),
since data measured closely across time (or another domain) may more likely be
correlated. We will examine both situations.
We may apply a linear smoother to obtain the smoothed curves pi(t),..., /i^(i).
In practice, we apply a smoothing matrix S to the observed noisy data to obtain a
smooth, called linear because the smooth can be written as
Pi = Sy ui = 1
where S does not depend on y (Buja et al.. 1989) and we define pi = (/ii(ii),... pi(tn))'.
Note that as n > oo, the vector p begins to closely resemble the curve p() on
[o,n
(Here and subsequently, when writing limit as n > oo, we assume
ti,... ,tn E [0, T]; that is. the collection of points is becoming denser within [0, T],
with the maximum gap between any pair of adjacent points _i, ,, i = 2,..., n,
tending to 0. Stein (1995) calls this method of taking the limit fixed-domain
asymptotics, while Cressie (1993) calls it infill asymptotics.)
Many popular smoothing methods (kernel smoothers, local polynomial
regression, smoothing splines) are linear. Note that if a bandwidth or smoothing
parameter for these methods is chosen via a data-driven method, then technically,
these smoothers become nonlinear (Buja et al., 1989).
We will focus primarily on basis function smoothing methods, in which the
smoothing matrix S is an orthogonal projection (i.e., symmetric and idempotent).
These methods seek to express the signal curve as a linear combination of k (< n)
specified basis functions, in which case the rank of S is k. Examples of such
methods are regression splines, Fourier series, polynomial regression, and some
types of wavelet smoothers (Ramsay and Silverman, 1997, pp. 44-50).

17
2.3 Dissimilarities Between Curves
If we choose squared Li distance as our dissimilarity metric, then denote
the dissimilarities between the true, observed, and smoothed curves i and j,
respectively, as follows:
0 = f ~ Hi*)]2 dt
Jo
(2.3)
rT
&ij= / [y Jo
(2.4)
rT
¡ir = / few-diWi2*.
Jo
. (2.5)
Define
Oij = m m
where m = (#ii(i1)>...,/ii(tB))',
Oij = y yj,
- (smooth)
Oij
At A j SQij.
If the data follow the discrete noise model, ct,2 I) where cq2
(a2 + er2). If the data follow the functional noise model, Qij ~ N(0ij, Ey) where
Sij = 0$n and ffj = (of + a2).
If we observe the response at points i1?..., tn in [0, T], then we may approxi
mate (2.3)-(2.5) by
n
T -
dij nOijOij

18
Note that d^ ¥ 5ij where the limit is taken as n oo, 1,..., tn £ [0, T].
Hence the question of whether smoothing aids clustering is, in a sense, closely
related to the question of: When, for large n. is d[jmooth a better estimator of
than is d^?
(Note: In the following chapters, since the pair of curves i and j is arbitrary,
we shall suppress the ij subscript on 0^, dij. <7? and writing instead 0,
6. a2, and S, understanding that we are concerned with any particular pair
i.j e {1 j.)
2.4 Previous Work
Some methods for clustering functional data recently have been presented in
the statistical literature. James and Sugar (2003) introduce a model-based cluster
ing method that is especially useful when the points measured along the sample
curves are sparse and irregular. Tarpey and Kinateder (2003) discuss represent
ing each curve with basis functions and clustering via K-means, to estimate the
principal points (underlying cluster means) of the datas distribution. Abraham
et al. (2003) propose representing the curves via a B-spline basis and clustering
the estimated coefficients with a K-means algorithm, and they derive consistency
results about the convergence of the algorithm. Tarpey et al. (2003) apply methods
for clustering functional data to the analysis of a pharmaceutical study. Hastie et
al. (1995) and James and Hastie (2001) discuss discriminant analysis for functional
data.
2.5 Summary
It seems intuitive that in the analysis of functional data, some form of smooth
ing the observed data is appropriate, and some previous methods (James and
Sugar, 2003; Tarpey and Kinateder, 2003; Abraham et al., 2003) do involve
smoothing. Tarpey and Kinateder (2003, p. 113) propose the question of the effect
of smoothing on the clustering of functional data, citing the need to study the

19
sensitivity of clustering methods on the degree of smoothing used to estimate the
functions."
In this thesis, we will provide some rigorous theoretical justification that
smoothing the data before clustering will improve the cluster analysis. Much of
the theory will focus on the estimation of the underlying dissimilarities among
the curves, and we will show that a shrinkage-type smoothing method leads to an
improved risk in estimating the dissimilarities. A simulation study will demonstrate
that this risk improvement is accompanied by a clearly improved performance in
correctly grouping objects into their proper clusters.

CHAPTER 3
CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL
First we will consider functional data following model (2.1). Recall that we
assume the response is measured at n discrete points in [0,T].
3.1 Comparing MSEs of Dissimilarity Estimators when 6 is in the
Linear Subspace Defined by S
We assume our linear smoothing matrix S is symmetric and idempotent. For a
linear basis function smoother which is fitted via least squares. S will be symmetric
and idempotent as long as the n points at which /i* is evaluated are identical to the
points at which y, is observed (Ramsay and Silverman. 1997. p. 44). Examples of
such smoothers are regression splines (in particular, B-splines), wavelet bases, and
Fourier series bases (Ramsay and Silverman. 1997). Regression splines and B-spline
bases are discussed in detail by de Boor (1978) and Eubank (1988, Chapter 7).
We also assume that S projects the observed data onto a lower-dimensional
space (of dimension k < n), and thus r(S) = tr(S) = k. Note that S is a shrinking
smoother, since all its singular values are < 1 (Buja et al., 1989). Recall that
according to the discrete noise model for the data, 0 ~ N(0,a2l). Without loss
of generality, let cr2I = I. (Otherwise, we can let, for example, fj = o~x9 and
T] = a~l9 and work with r) and 77 instead.)
Note that ^O'O represents the approximate L2 distance between observed
curves y(£) and yj(t) and ^9'S S6 = £0S0 represents the approximate L2
distance between smoothed curves /(i) and p,j(t).
We wish to see when the smoothed-data dissimilarity better estimates
the true dissimilarity 6{j between curves /q(£) and Hj(t) than the observed-data
20

21
dissimilarity. The risk of an estimator f, for r is given by R(t, t) = E[L(t, f)]
where L(-) is a loss function (see Lehmann and Casella, 1998, pp. 4-5).
For the familiar case of squared error loss L(t, f) = (r r)2, the risk is simply
the mean squared error (MSE) of the estimator. Hence we may compare the MSEs
of two competing estimators and choose the one with the smaller MSE. To this
end, let us examine MSE(^G'SG) and MSE(^G'G) in estimating ^GG (which
approaches Sij as n > oo).
In this section, we consider the case when G lies in the linear subspace that S
projects onto, i.e.. S0 = 6. Note that if two arbitrary (discretized) signal curves ^
and Hj are in this linear subspace, then the corresponding 6 is also in the subspace,
since in this case
9 = Hi Hj = SHi SHj = S(h Hj) = so.
In this idealized situation, a straightforward comparison of MSEs shows that
the smoothed-data estimator improves on the observed-data estimator.
Theorem 3.1 Suppose the observed G ~ N(0,cr21). Let S be a symmetric and
idempotent linear smoothing matrix of rank k. If G lies in the linear subspace
defined by S, then the dissimilarity estimator ^G SG has smaller mean squared
error than does ~6 G in estimating ~G' 9. That is, the dissimilarity estimator based
on the smooth is better than the one based on the observed data in estimating the
dissimilarities between the underlying signal curves.
Proof of Theorem 3.1:
Let cr2 = 1 without loss of generality.
Now, E^O'G] = lE[6'G] = £[n + 6'6} = T+ J9'0.

22
Similarly, £[J0'S0] = $E['SO] = £[k + 0'S0] = ? + J0'S0.
MS£| -0'0 ) = £
n
= var
-O'O -O'O
n n
i 2
T2
n*
-0'0- -fl'el
n n J
2n + 40'0
+ = T2( + ^G'G + 1
n n
= r2| + 4¡WI2 + i I-
n n
(3.1)
M SE
K*8*)
= E
-G'SG- -G'G
n n
= varGSG
E
-G'SG
TI
- W
TI
T2
TI*
2k + 40 S0
f Tk T T '
+ < + -0 S0 -0 0
n n n
= T2(^ + l0'S'S0)+{^ fc + 0'S'S0-0'0]}
= T2(| + llse¡l2) +{^[<= + IIS0||2-|lll2]}!
^{h2 + 2*(||S0||2-||0f)
= T2(H + 4llS0l!2) +
n2 n2
2\2
+ (||S0||2-||0||2)
}
= T2(^ + Hse!l2 + ^) + ^ 2fc(||se||2-¡|(p)
T2
H T
n
2\2
is0ii2-¡i0in
(3-2)
Comparing (3.1) with (3.2), we see that if 0 lies in the subspace that S projects
onto, implying that S0 = 0, then MSE(^G'SG) < MSE(^G'G) since k < n,
and thus smoothing leads to a better estimator of each pairwise dissimilarity. In
this case, smoothing reduces the variance of the dissimilarity estimate, without
adversely increasing the bias, since S0 = 0.

23
On the other hand, suppose the srpooth does not reproduce 0 perfectly and
that S0 / 6. Then it can be shown (see Appendix A.l) that the smoothed-data
estimator is better when:
||S0||2 ||0||2 > -2 k Vi + 2n + n2 + 2k. (3.3)
Now, since S is a shrinking smoother, this means ||Sy|| < ||y|| for all y, and hence
||Sy|j2 < ||y||2 for all y. Therefore, ||S0||2 < ||0||2 and the left hand side of (3.3) is
negative and so is the right hand side. If 0 is such that S0 0 and ||S0||2 ||0||2
is near 0, then (3.3) will be satisfied. If, however, ||S0||2 ||0||2 0, then (3.3)
will not be satisfied and smoothing will not help.
In other words, some shrinkage smoothing of the observed curves makes the
dissimilarity estimator better, but too much shrinkage leads to a forfeiture of that
advantage. The disadvantage of the linear smoother is that it cannot learn
from the data how much to shrink 0. To improve the smoother, we can employ a
James-Stein-type adjustment to S, so that the data can determine the amount of
shrinkage.
3.2 A James-Stein Shrinkage Adjustment to the Smoother
What is now known as shrinkage estimation or Stein estimation originated
with the work of Stein in the context of estimating a multivariate normal mean.
Stein (1956) showed that the usual estimator (e.g., the sample mean vector) was
inadmissible when the data were of dimension 3 or greater. James and Stein (1961)
showed that a particular shrinkage estimator (so named because it shrunk the
usual estimate toward the origin or some appropriate point) dominated the usual
estimator. In subsequent years, many results have been derived about shrinkage
estimation in a variety of contexts. A detailed discussion can be found in Lehmann
and Casella (1998, Chapter 5).

24
Lehmann and Casella (1998, p. 367) discuss shrinking an estimator toward a
linear subspace of the parameter space. In our case, we believe that 0 is near S0.
Let Cs = {0 : SO = 9} where S is symmetric and idempotent of rank k. Hence we
may shrink 9 toward S0. the MLE of 0 G Cs-
In this case, a James-Stein estimator of 9 (see Lehmann and Casella, 1998,
p. 367) is
0 = S9 + (1 ) (0 S0)
V M0-S0IIV
where a is a constant and || || is the usual Euclidean norm.
In practice, to avoid the problem of the shrinkage factor possibly being
negative for small ||0 S0||2, we will use the positive-part James-Stein estimator
(JS)
= S 9 +
a
9 S0||2
{0 S0)
(3.4)
where x+ = xl(x > 0).
Casella and Hwang (1987) propose similar shrinkage estimators in the context
of confidence sets for a multivariate normal mean. Green and Strawderman (1991),
also in the context of estimating a multivariate mean, discuss how shrinking an
unbiased estimator toward a possibly biased estimator using a James-Stein form
can result in a risk improvement.
The shrinkage estimator involves the data by giving more weight to 9 when
|| smoother is at all well-chosen, S9 is often close enough to 9 that the shrinkage fac
tor in (3.4) is very often zero. The shrinkage factor is actually merely a safeguard
against oversmoothing, in case S smooths the curves beyond what, in reality, it
should.

25
So an appropriate shrinkage estimator of djj = ^00 is
d(JS) = L§Wg(JS)
lJ n
T
n
(SO) -
S0 + | 1 -
1 -
\0- S0||2
(0-S0)'
(19 SO)
\o-so\\\
Now, the risk difference (difference in MSEs) between the James-Stein
smoothed dissimilarity d\j and the observed dissimilarity is:
(3.5)
(3.6)
T2
A = E
T
TV,
-0 n n
- E
-0'0--0'0
n n
(3.7)
Since we are interested in when the risk difference is negative, we can ignore the
positive constant multiplier T2/n2.
From (3.7),
A = £[(0(J5>'0(JS>)2 2O'OOW'OW (O'O)2 + 20'00'0}
= E[(0(Jsyo(JS'>)2 (O'O)2 20'0(0(JS)'0(JS) OO)].
Write = 0 0(0)0, where
m =
a
I0-S0II2
(I-S).
Then
g(JS)'Q(JS) =
-2 0V(0)'0 + 0>(0)'0(0)0
, a(I S)0
= 20
II0-S0IP
+ 0'(I-S)' a
0 S0||2 ||0 S0||2
(I-S)0

26
= 2a
9'{i-s)
0{ I-S)'(I-S)0 \0-SG\
n '{i-s)d 2 '(i-s)
= -2 a^r + a2
-'(i-s)'(i-s)
'(i-s) ||0- S0||20'(i- s)0
a2
2a H .
\\o-so\\2
Hence the (scaled) risk difference is
A = E
= E
(§1JS)'0(JS))2 (0'0)2 20'0 _2a +
0 0 2a +
I0-S0II2
\\e -S0||2
- (0'0)2 20'0 ( -2a +
\\o-so\\\
Note that
and
0'0 = 0'(I-S)0 + 0'S0
0'0 = 0'{I-S)0 + 0'S0.
Define the following quadratic forms:
ft = '(I-S )0
q2 = 0'S0
ft = 0'(I-S)0
?2 = 0'S0.
Note that implies (I S)S = 0 and 0 is normally distributed). Now write A as:
A = E ^[ = E
a
ft
,2\ 2
21 2a -f 1 [9i + 92] + ( 2a + ) + 4a(9i + q2) (91 + 92)
ft
2a2
ft

27
/ a2 \ a? \2 4a3
= E 4a[?i + q?\ 4- 2[ \ 9i / \9i / 9i
92)
2a2
-(91+92)
9i
= E
a
2 \ 2
4a[gi + 72] + 4a + I 1 h 4a( 9i
4a3
9i
2a2 .
+ "T(9i + 92 9i 92)
9i
Since '[(ft +92] = + 9i + 92,
A = 4 an + 4 a2 + £
( a2 ^
)2 4a3
2a2. J
+ -T(91 + 92 9i 92)
L\9i >
9i
9i J
Note that since <7x and g2 are independent, we can take the expectation of q2 in the
last term. Since E[q2] k + q2,
A = 4a2 4an + E
a4
2 a2k ,
+ + 2a2
^ 2a + 9i\
L9i
9i
< 9i )\
By Jensen's Inequality, E(\/qi) > 1/E(qi), and since E(qi) = n k + 91, we can
bound the last term above:
2a2 1 -
2a + 91
< 2a2 1 -
= 2a2
9i
*~2a) <2a2fn~*;~2a) = 2a2(l-V
\ n k + 91 / \ n k J \ n k)
2a + 91
n A: + qi
n k 2a
\ n n k + i
)=2a(in
9i 2a qi
k + 9i
Hence
A < 4a2 4an + E
a
L9i J
+ E
n k
2 a2k
9i J
+ 2a2 1 -
2a ,
n k
(3.8)
Note that the numerators of the terms in the expected values in (3.8) are
positive. The random variable 91 is a noncentral Xn-k with noncentrality parameter
9i> an<4 this distribution is stochastically increasing in q. This implies that for
m > 0, £,[(?1)_m] is decreasing in qx. So
£[(9.)-ra] <

28
where Xn-k *s a central \2 with n k degrees of freedom. So by replacing qx with
Xn-k (3-8), we obtain an upper bound Au for A:
A;/ = 4a2 4an + E
a
l(xl-k)2\
+ E
2 a2k
Z2
.Xnk.
-f- 2 a
(3.9)
Note that EUxl-t) '} = 1/in k 2) and EUxl-t) !] = '/(" k 2)(n -k-4).
So taking expectations, we have:
n4 2a2k
A n = 4a2 4an +
1
+
(n k 2)(n k 4) n k 2
+ 2a2 1
2 a \
n k )
-a4
-a3 +
2fc
f 6 a2 4na.
(n A; 2)(n k 4) n k~ \n k 2
We are looking for values of a which make Au negative. The fourth-degree
equation Au(a) = 0 can be solved analytically. (Two of the roots are imaginary, as
noted in Appendix B.l.) One real root is clearly 0; call the second real root r.
Theorem 3.2 If n k > 4, then the nontrivial real root r of Au(a) = 0 is positive.
Furthermore, for any choice a (0, r), the upper bound Au < 0, implying A < 0.
That is, for 0 < a < r, the risk difference is negative and the smoothed-data
dissimilarity estimator is better than dij.
Proof of Theorem 3.2: Write Af/(o) = 0 as c4a4 -I- c^a3 + c-^a2 + C\a = 0, where
C4,C3,C2,Ci are the respective coefficients of A[/(a). It is clear that if n k > 4,
then c4 > 0, c3 < 0, c2 > 0. ci < 0.
Note that r also solves the cubic equation fc(a) = c4a3 + c3a2 -I- c2a + Ci = 0.
Since its leading coefficient c4 > 0,
lim fc(a) = -oc, lim fc(a) = oo.
aoo a-*oo
Since /c(a) has only one real root, it crosses the a-axis only once. Since its vertical
intercept C\ < 0, its horizontal intercept r > 0.

29
Table 3-1: Table of choices of a for various n and k.
n
k
minimizer a*
root r
20
5
9.3
19.0
50
5
31.9
72.2
100
5
71.3
169.1
200
5
153.2
367.5
And since Ay(a) has leading coefficient c4 > 0,
lim Ay(a) = oo.
a>oo
Since it tends to oo at its endpoints, Ay (a) must be negative between its two real
roots 0 and r. Therefore A Using a symbolic algebra software program (such as Maple or Mathematical.
one can easily obtain the formula for the second real root for general n and k
(the formula is given in Appendix B.l) and verify that the other two roots are
imaginary.
Figure 3-1 shows Ay plotted as a function of a for varying n and k = 5.
For various choices of n (and k = 5) Table 3-1 provides values of r, as well as
the value of a which minimizes the upper bound for A. For 0 < a < r, the risk
difference is assured of being negative. For a*. Ay is minimized.
Since Ay provides an upper bound for the (scaled) risk difference, it may
be valuable to ascertain the size of the discrepancy between Ay and A. We can
estimate this discrepancy via Monte Carlo simulation. We generate a large number
of random variables having the distribution of q\ (namely Xn-kiQi)) and get an
estimate of A using a Monte Carlo mean. For various values of qi, Figure 3-2
shows A plotted alongside Ay.
3.3 Extension to the Case of Unknown er2
In the previous section, a2 was assumed to be known. We now examine the
situation in which 0 N{0, a21) with a2 unknown.

30
DeltaU for various n and k=5
0 50 100
a
Figure 3-1: Plot of against a for varying n and for k = 5.
Solid line: n = 20. Dashed line: n = 50. Dotted line: n = 100.

31
comparing upper bound to true Delta* for n=20, k=5
0 10 20 30 40
a
Figure 3-2: Plot of simulated A and Ay against a for n = 20, k = 5.
Solid line: Upper bound Ay. Dashed line: simulated A, qx = 0. Dotted line:
simulated A, qi = 135. Long-dashed line: simulated A, qi = 540. Dot-dashed line:
simulated A, qx = 3750000.

32
Suppose there exists a random variable S2, independent of 0, such that
S'2
<7
Xu
Then let a2 = S2/u.
Consider the following definition of the James-Stein estimator which accounts
for a2:
0 = S0 + 1
aa
|0-S0|'2
,2
= S0+ 1--5T
acr
(I-S)0
(I-S)0.
0 (I-S)0
Note that the James-Stein estimator from Section 3.2 (when we assumed a
known covariance matrix I) is simply (3.10) with a2 = 1.
Now, replacing a2 with the estimate a2, define
(3.10)
- 2
aa
0iJS) = S0 + ( 1 y- ) (I S)0,
letting tji = 07(I S)0 as in Section 3.2. Write
a(JS) a
0
where
- 2
aa
= (I-S)0.
Qi
Define the James-Stein smoothed-data dissimilarity estimator based on 0JS^
to be:
;(JS) T jsymjs)
- -o e>
Then analogously to (3.7),
A & = E
2-i
q(JS)'q(JS) Q'Q
-E
2-i
o'e-o'o

33
Now, diJS)'eiJS) = O'0 2(¡> + ' A a = E
= E
- E
(io'o o'o 2(p'e + ')2 (o'o o'o)
({O'O O'6) {2e 4>')^j ('O O'O)
-2(2'4>){0'6 OO) + {2(f)'0 )
Now,
,* -0'(I-S)0 2
(p 0 = ao2 = acr2
9i
and
a2 9i
O'{l- S)'(I- S)0 =
a2&4
9 i
So
A = E
2(2ai2 -a2pj('-e'e) + (ia* *
Since a and 0 are independent,
E[-4aa2{0'0 O'O)] = E[-4aa2]E{0'0 O'0) = £[-4a<72]cr2n
and
£
= E
= E
2(-e'e)
. 9l
/ (1. (J A A Al A I .
(0 (I S)0 + 0 S0 0 (I S)0 0 S0)
9i
2a2(j4
9i
= E[2a2a4]E
[Qi + 92 ~ 9i ~ 92)
2a2(74
>15
1
+ E
L9iJ
(9i 9i 92)
by the independence of 0 and 0 (and thus <7 and q2).
(3.11)

34
Since <7i and q2 are independent, we may write (3.11) as:
E[2a24}E[q2]E[l/qi] + E
2a2 a4 x
(9i 9i ~ Q2)
L 9i
= E[2a24](q2 + a2k)E[l/qx] + E
2^.4
2cr<7
L 9i
(qi ~q\~ 92)
(3.12)
Now, since a2a4/qx is decreasing in qx and qx qx q2 is increasing in qx, these
quantities have covariance < 0 by the Covariance Inequality (Casella and Berger,
1990, p. 184). Hence
(3.12) < E[2a2a4]{q2 + a2k)E[l/qx} +E
2 a2a4
L 9i
E[qi 9i 92]
= E[2a2a4](q2 + a2k)E[\/qx] + £'[2a2<74].E,[l/9i](i72(n k) q2)
= £[2a2cr4].E,[l/<7i]c72n.
So
A* < E[4a2]cr2n + E[2a2a4]E
= E[4a2]a2n + E[2a2a4]E
Note that
E[ad2] = aa2
2 4(^ + 2)'
1'
a2n + E
.9i.
-
' 1 '
a2n + E
.9i.
2a? a2
4a2 2^.4 \ 2-1
aa
9i J
4a36 a48
+
9i
? j
E[aV] = aV(*;2))
£[a^] = av|-(^y+4)j
E[aV] = aV^+2)(^4)(y + 6)Y
Since (j2 and ^ are independent, taking expectations:
A* < 4ao-4n + 2aro'Jn
4 o2_6 + 2)
E
1
9i J
+ 4V(^?))
-4ov('^^i)V
1
L9iJ
, 4 8/I/0' + 2)(l' + 4)(l/ + 6)
+ a a
ir
)E
' 1'
)
L9iJ

35
Define q{ = qja2 = {9/a)'{I S)(9/a). Since 6 ~ N{0,(J2I), (0/ where £ = 9/o. Then the risk difference is a function of £, and the distribution of
q{ is a function of £ and does not involve a alone. Then
+ ia^(,{v + 2)
< 4a ,4 o_2 _4_ / *'(*' +2)
I/
£
1
J
i/(i/ + 2)(i/ + 4)
4aVI -V- -/V- V ]£
1
L9J
+ aV4
1/(1/ + 2)(i/ + 4)(i/ + 6)'
Va
)E
1
-*2
IQi J
We may now divide the inequality through by 0.
- < 4 an + 2a2n
<74
u(u + 2)
- 4a3
i/(i/ + 2)(t/ + 4)
)E
' 1'
Ai.
)E
' 1
Al.
+ 4a2
' v(v + 2)
v*
j4/i/(t/ + 2)(i/ + 4)(y + 6) .
L Now, since (I S)I is idempotent, q{ is noncentral Xn-jt(C(I S)£). Recall
that the noncentral x2 distribution is stochastically increasing in its noncentrality
parameter, so we may replace this parameter by zero and obtain the upper bound
= 1,2.
Hence
A,
-4an + 2a2n ^ 2) ^ E[(^n_k) *] + 4a2 ^
3 ( u(u + 2)(i/ + 4)x
u(u + 2)
4a
+ a If n k > 4, then £[(*2_*) 1} = l/(n-k-2) and £[(x2_fc)-2] = l/(n -k-2){n-
k 4). So taking expectations, we have:

36
4 an 4- 2
is(is + 2)\ a2n 2(v(v + 2)
-4
4-
u(v + 2)(is + 4)
n k 2
a3
n k 2
4- 4a
i/(i^ 4- 2)(i/ 4- 4)(i^ + 6)
z/4
a
1/(1/ 4- 2)(is 4- 4)(z>' 4- 6)
is4(n k 2)(n A; 4)
(n k 2 )(n k A)
4 /,4i/(j/ + 2)(i/ + 4)
a4 (
i/3(n k 2)
a
+
2^ + 2) 4v(l/ + 2) \ 2 _
i/2(n k 2)
is
Note that if Aj/ (which does not involve a) is less than 0, then A* < 0. which is
what we wish to prove.
Theorem 3.3 Suppose that 6 ~ N(6,a2I) with a2 unknown and that there exists
a random variable S2, independent of 9. such that S2/a2 ~ xl> and ^ S2/is.
If n k > 4, then the nontrivial real root r of Ay(a) = 0 is positive. Furthermore,
for any choice a G (0, r), the upper bound Ay < 0, implying A < 0. That is,
for 0 < a < r, the risk difference is negative and the smoothed-data dissimilarity
estimator d\jf^ is better than dij.
Proof of Theorem 3.3: Note that Au{a) = 0 may be written as c4a4 4- c^a3 4-
c2a2 4- Cia = 0, with c4 > 0, C3 < 0, c2 > 0. c\ < 0. The proof then follows exactly as
the proof of Theorem 3.2 in Section 3.2.
3.4 A Bayes Result: ^moot>l) and a Limit of Bayes Estimators
In this section, we again assume S, the smoothing matrix, to be symmetric and
idempotent. We again assume 9 ~ N(9, cj2I) and again, without loss of generality,
let cr2I = I. (Otherwise, we can let, for example, 77 = a~xQ and 77 = a~l9 and
work with r) and 77 instead.) Also, assume that 6 lies in the linear subspace (of
dimension k < n) that S projects onto, and hence S0 = 9. We will split f(0\9)

37
into a part involving (I S)0 and a part involving SO.
/(0|0)oc exp[-(l/2)(0-0)'(0-0)]
= exp[(l/2)(0 0)' (l S + S)(0 0)]
= exp{-(l/2)[0'(I S)0 + (0 0)'S(0 0)]}
which has a part involving (I S)0 and a part involving SO.
Since S is symmetric, then for some orthogonal matrix P, PSP* is diagonal.
Since S is idempotent, its eigenvalues are 0 and 1. so PSP = B where
B = [ Ilc .
0 0
Let pi,.... pn be the rows of P. Then 0* = P0 = (p^,..., 0,... ,0)\
And let 0 be the vector containing the k nonzero elements of 0*. Similarly,
0* = P0.
Now we consider the likelihood for 0\
L(0\0) (x exp{-(l/2)[0'(I-S)0 + (0-0)'S(0-0)]}
oc exp{-(l/2)[(0 0),S(0 0)]}
= exp{-(l/2)[(0 0) P PSP P(0 0)]}
= exp{(l/2)[(0* 0*)'PSP'(0* 0*)]}
= exp{ (l/2)[(0* 0*)B(0* 0*)]}.
Now we put a V(0, V) prior on 0*:
7r(0*) a exp[-(l/2)0*'V-0*],

38
y- =
o
o
where Vn is the full-rank prior variance of 0. Note that this prior is only a proper
density when considered as a density over the first k elements of 0*. Recall that
the last n k elements of 9* are zero.
Then the posterior for 9* is
tt(0*|0*) oc exp{-(l/2)[(r-0*)'B(0*-0*) + 0*'V-0*]}
= exp{-(l/2)[0*'B0* 20* B0* + 0*'(B + V")0*]}
a exp{-(l/2)[0*'B'(B + y-)-B0* 20*'B0* + 0*'(B + V")0*]}
= exp{ (l/2)[(0* (B + V-)"B0*)'(B + V")(0* (B + V-)"B0*)]}.
Hence the posterior n(0\9*) is N[(B + V ) B0*, (B + V ) ], which is
N[(B + V-)-BP0. (B + V-)-].
Thus the posterior expectation of ^0 0 is
E
T
E\ -0 P P0 = E -9' 9*
n
n
0*B(B + V)-(B + V-)~B0* + tr(-I(B + V-)
n \n
0 P B(B + y-)-(B + y-)-BP0 + tr( I(B + V)
n
n
0 SP (B + y-)-(B + V-)-pS0 + tr(-(B + V-)
n
n
Let us choose the prior variance so that V = -B. Then
T
E[ -0 0 = 0 SP
n ) n
= -0'SP'
n
= -0'SP'
n
n + 1
n
n
n + 1
n
B PS0 4- tr (
(T(n+1
B
\ n
n
n + 1
n2
(n + 1)2/
n+ 1
B PS0 + tr\
n
B |PSe + tr| -(
)
n
n\n + 1
n \n + 1
bV
n
B

39
and since P BP = S,
Elle'e)=ll^W
G'SSG + tri-
n
n
oo.
\n\n+ 1 ) )
This posterior mean is the Bayes estimator of dtJ.
Since fr(B) = k, note that tr{ J(^B)) = Jir(^j-B) = ^ 0 as
2
Also, 1 as n oo.
Then if we let the number of measurement points n t oo in a fixed-domain
sense, these Bayes estimators of approach d(.!mooth) ^G'SSG in the sense that
the difference between the Bayes estimators and tends to zero. Recall that
d^ ^G'G, the approximate distance between curve i and curve j, which in the
limit approaches the exact distance between curve i and curve j.

CHAPTER 4
CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL
Now we will consider functional data following model (2.2). Again, we assume
the response is measured at n discrete points in [0, T], but here we assume a
dependence among errors measured at different points.
4.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the
Linear Subspace Defined by S
As with Case I, we assume our linear smoothing matrix S is symmetric and
idempotent, with r(S) = tr(S) = k. Recall that according to the functional noise
model for the data, 0 ~ N(0, E) where the covariance matrix E corresponds, e.g.,
to a stationary Ornstein-Uhlenbeck process.
Note that ^0 0 represents the approximate L2 distance between observed
curves y(£) and ¡jj{t), and that, with an Ornstein-Uhlenbeck-type error structure, it
approaches the exact distance as n > oc in a fixed-domain sense.
Also, ^0'S S0 = ^0S0 represents the approximate L2 distance between
smoothed curves /z() and fij(t), and this approaches the exact distance as n > oo.
We wish to see when the smoothed-data dissimilarity better estimates the true
dissimilarity Sij between curves /(i) and Pj{t) than the observed-data dissimilarity.
To this end let us examine MSE(^0'S0) and MSE(^0'0) in estimating ^0'0
(which approaches as n > oo).
In this section, we consider the case in which 0 lies in the linear subspace
defined by S, i.e., SO = 0. We now present a theorem which generalizes the result
of Theorem 3.1 to the case of a general covariance matrix.
Theorem 4.1 Suppose the observed 0 ~ N(0, E). Let S be a symmetric and
idempotent linear smoothing matrix of rank k. If 0 lies in the linear subspace
40

41
defined by S, then the dissimilarity estimator ^0 SO has smaller mean squared
error than does ^00 in estimating ^00. That is, the dissimilarity estimator based
on the smooth is better than the one based on the observed data in estimating the
dissimilarities between the underlying signal curves.
Proof of Theorem 4-1:
MSE
H
n2
T2
2ir(£2) + 40'£0
+ ( -ir(E) + -0'0 -0'0
[ n n n
2ir(£2) + 40E0 + [ir(£)]2 >.
2
(4.1)
MSE
e\\-0'S0- -9'9] )
{[n n J J
var(9'S9 ) + ^ E
0S0
n
T2
n2
2ir[(S£)2] + 40'S£S0
- le'eY
n J
{-tr(SE) + -0'S'S0 -0'0l
[n n n J
^|2ir[(SE)2] + Ifl'Efl + [tr(SE)],|.
(4.2)
Hence, if we compare (4.2) with (4.1), we must show that £r(SE) < ir(E) and
ir[(S£)2] < £r[(£)2] to complete the proof.
ir(E) = ir(SE + (I-S)E)
= ir(SE) + fr((I-S)E)
= £r(S£)+ir(£1/2(I-S)(I-S)£1/2)
> tr(SE)

42
since the last term is the sum of the squared elements of (I S)E^2.
r(S2) = fr[(S£ + (I S)£)2]
= ir[(S£)2] + fr[SE(I S)S] + ir[(I S)ESE] + ir[((I S)E)2]
= fr[(S£)2] + ir[SSS(I S)(I S)E] + ir[£(I S)(I S)£SS]
+ ir[(I S)E(I S)£]
= ir[(S£)2] + 2ir[S£(I- S)(I- S)£S]
+ ir[£1/2(I S)£1/2£1/2(I S)£1/2]
> r[(S£)2]
since ir[S£(I S)(I S)£S] is the sum of the squared elements of (I S)£S and
the last term is the sum of the squared elements of £1^2(I S)£1,/2.
Hence MSE^O'SO) < MSE^d'O). Since (I-S)£1/2 (which has rank n-k)
is not the zero matrix, the inequality is strict.
4.2 James-Stein Shrinkage Estimation in the Functional Noise Model
Now we will consider functional data following model (2.2), and in this
section we do not assume 6 lies in the subspace defined by S. Again, we assume
the response is measured at n discrete points in [0, T], but here we assume a
dependence among errors measured at different points.
In Section 3.2, we obtained an exact upper bound for the difference in risks
between the smoothed-data estimator and observed-data estimator. In this section
we will develop an asymptotic (large n) upper bound for this difference of risks.
The upper bound is asymptotic here in the sense that the bound is valid for
sufficiently large n, not necessarily that the expression for the bound converges to a
meaningful limiting expression for infinite n.

43
Note that as the number of measurement points n grows (within a fixed
domain), assuming a certain dependence across measurement points (e.g., a corre
lation structure like that of a stationary Ornstein-Uhlenbeck process), the observed
data yi,... ,Yn closely resemble the pure observed functions yi(t),..., yjv(i) on
[0, T]. Therefore this result will be most appropriate for situations with a large
number of measurements taken on a functional process, so that the observed data
vector is nearly a pure function.
As with Case I, we assume our linear smoothing matrix S is symmetric and
idempotent, with r(S) = tr(S) = k. Recall that according to the functional noise
model for the data, 9 ~ N(0, £) where £ is, for example, a known covariance
matrix corresponding to a stationary Ornstein-Uhlenbeck process. In this section,
we will assume a Gaussian error process whose covariance structure allows for the
possibility of dependence among the errors at different measurement points.
We consider the same James-Stein estimator of 0 as in Section 3.2, namely
0(JS) (in practice, again, we use the positive-part estimator 0+S^), and the same
James-Stein dissimilarity estimator.
Recall the definitions of the following quadratic forms:
qi = \l-S)
q2 = O' SO
?i = 0'(I-S)0
q2 = 0'S 9.
Unlike in Section 3.2 when we assumed an independent error structure, under
the functional noise model,
are they independent in general.

44
In this same way as in Section 3.2, ,we may write A as:
( oP \ ^ 4a3
A = E 4a[<7i + Q2] T 4a2 ( ) h 4a(<7i + 92)
L V 9i / 9i
2 a2
H(9i + Q2 Qi 92)
9i
Note that
-^[91+92] 9i + 92 + r[(I ~ S)S] 4- tr[SS]
= 9i + 92 + tr(H).
Hence
A =
4atr(S) + 4a2 4* E
4a3 2a2
H ~r-{Qi + 92 9i ~ 92)
9i 9i
Consider
E
92
= E
0S0 '
UiJ
L0'(I-S)0j
Lieberman (1994) gives a Laplace approximation for £[(xFx/x'Gx)*], k > 1,
where F is symmetric and G positive definite:
E
xFx
x'Gx
k~\
£[(xFx)*]
[£(x'Gx)]t-
In our case, (I S) is merely positive semidefinite, but with a simple regularity
condition, the result will hold (see Appendix B.2). The Laplace approximation will
approach the true value of the expected ratio, i.e., the difference between the true
expected ratio and the approximated expected ratio tends to 0 as n -4 00;
Lieberman (1994) gives sufficient conditions involving the cumulants of the
quadratic forms which establish rates of convergence of the approximation.
Hence
92
= E
' 0S0
L9iJ
<35
1
w
E(0'SG)
<72 + tr{ S£)
£(0'(I-S)0) gi + ir[(I-S)S]'

45
Hence we have a large n approximation, or asymptotic expression, for A:
,4
4afr(E) + 4a2 + E
= 4a£r(E) + 4a2 + E
4a3 n 2 2a2{q2 + fr(SE)) 2a?q\ 2a2q2
T + 0 + gi+tr[(I S)S] = ~
a
L1
^ + 2a'( 1 +
L9i
g-2 + £r(SE)
9i 9i J
2a -f 9i + 92
9i +ir[(I- S)E] 9i
Since by Jensens Inequality, (1 /^i) > l/E(qi), we can bound the last term above:
2a2(l+ Q2 + tr{S'E) 2a+ 91+92
E''m ' 9i + tr[{I S)E] 9i
< 2a2 ( 1 + Q2 + 9! + 92 + 2a ^
= 2 a
= 2a2
< 2a
9i+ir[(I-S)S] 9i+ir[(I-S)S]y
2 /9i + fr[(I S)E] + £r(SE) 91 2a'
(
9i + ir[(I S)S]
£r(E) 2a \
?i+fr[(I-S)£]J
2 / ir(E) 2a \
(tr[(I-S)S]J-
Hence we have the asymptotic upper bound for A:
,4'
4a2 4a£r(E) + E
IQ
+ 2a
£r(E) 2a \
tr[(I-S)EP
Denote the eigenvalues of E1^2(I-S)E1/2 by ci,...,c. Since (I S) is positive
semidefinite, it is clear that cl5..., cn are all nonnegative. Note that Ci,..., are
also the eigenvalues of (I S)E, since E1/2(I S)E1/2 = E1/,2(I S)EE-1/2 and
(I-S)E are similar matrices. Note that since r[E1/2(I S)E1/2] = r(I-S) = n-k,
E^2(I S)!]1 2 has n k nonzero eigenvalues, and so does (I S)E.
It is well known (see, e.g., Baldessari, 1967; Tan, 1977) that if y JV(m. v)
for positive definite, nonsingular V, then for symmetric A, y Ay is distributed as a
linear combination of independent noncentral y2 random variables, the coefficients
of which are the eigenvalues of AV.

46
Specifically, since qx = &{I S)0, and (I S)£ has eigenvalues cx,..., c,,, then
Qi ~ cxi2(,2)
i=l
for some noncentrality parameter <5? > 0 which is zero when 0 = 0.
Under 6 = 0. then, qi ~ CiX2, a linear combination of independent central
xi variates.
Since a noncentral xl random variable stochastically dominates a central x2>
for any S? > 0,
P[Xi(Si) >x]> P[xl > A Vx > 0.
As shown in Appendix A.2, this implies
iW(tf) + + cx'i2(2) > x] > P[cix\ + + Cnxl >x] V X > 0.
Hence for all x > 0, Pe^o[q\ > x\ > P0[<7i > x], i.e., the distribution of q\ with
6 0 stochastically dominates the distribution of q\ with 6 = 0. Then
^oKii)"] < EolWi)-]
for m = 1,2,....
Letting M2 = i?o[(<7i)_2], we have the following asymptotic upper bound for A:
A^ = 4a2 4air(£) + M2a4 +
4a3
2a2ir(£)
£r[(I-S)E] ~~ £r[(I-S)E]
= M2a4
ra3 +
2£r(E)
+ 4 ) a2 4ir(E)a.
£r[(I S)E] \tr[(I S)E]
Then M2 = P[(X^=i ^Xi)-2]- We see that if n A: > 4, then M2 exists and is
E
Wr,
(i)l
^ A^2 ^ E
^min
positive since

47
where is the smallest and wmax the largest of the n k nonzero eigenvalues of
(1-8)33.
As was the case in the discrete noise situation, A^(a) = 0 is a fourth-degree
equation with two real roots, one of which is trivially zero. Call the nontrivial real
root r.
Theorem 4.2 If n k > 4, then the nontrivial real root r of A^(a) = 0 is positive.
Furthermore, for any choice a 6 (0,r), the asymptotic upper bound Ay < 0,
implying A < 0. That is, forO < a < r, and for sufficiently large n, the risk
difference is negative and the smoothed-data dissimilarity estimator d\j is better
than dij.
Proof of Theorem f.2: Since £ is positive definite, £r(£) is positive. Since
M2 > 0, we may write Ay(a) = 0 as c4a4 + c3a3 + c2a2 + Cia = 0, where
c4 > 0, c3 < 0, c2 > 0, cj < 0. The proof then follows exactly from the proof of
Theorem 3.2 in Section 3.2.
As with the discrete noise situation, one can easily verify, using a symbolic
algebra program, that two of the roots are imaginary, and can determine the
nontrivial real root in terms of M2, £r[(I S)£], and £r(£).
As an example, let us consider a situation in which we observe functional data
yi,..., yn measured at 50 equally spaced points t\,..., £50, one unit apart. Here,
let us assume the observations are discretized versions of functions yi(£),..., ?/#(£)
which contain (possibly different) signal functions, plus a noise function arising
from an Ornstein-Uhlenbeck (O-U) process. We can then calculate the covariance
matrix of each y and the covariance matrix £ of 9. Under the O-U model,
t r(S) = always, since each diagonal element of £ is
For example, suppose the O-U process has a2 = 2 and 1. Then in this
example, £r(£) = 50. Suppose we choose S be the smoothing matrix corresponding
to a B-spline basis smoother with 6 knots, dispersed evenly within the data. Then

48
we can easily calculate the eigenvalues of (I S)£, which are ci,..., c*, and via
numerical or Monte Carlo integration, we find that M2 = 0.0012 in this case.
Substituting these values into Ay(a), we see, in the top plot of Figure 4-1,
the asymptotic upper bound plotted as a function of a. Also plotted is a simulated
true A for a variety of values (n-vectors) of 9: Ox 1, (0,1,0,1,0,..., 0,1)', and
(1,0,1, 1,0,1,..., 1,0)', where 1 is a n-vector of ones. It should be noted that
when 9 is the zero vector, it lies in the subspace defined by S, since S9 = 9 in that
case. The other two values of 9 shown in this plot do not lie in the subspace. In
any case, however, choosing the a that optimizes the upper bound guarantees that
djj5) has smaller risk than dtj.
Also shown, in the bottom plot of Figure 4-1, is the asymptotic upper bound
for data following the same Ornstein-Uhlenbeck model, except with n = 30.
Although A^ is a large-n upper bound, it appears to work well for the grid size of
30.
4.3 Extension to the Case when cr2 is Unknown
In Section 3.2, it was proved that the smoothed-data dissimilarity estimator,
with the shrinkage adjustment, dominated the observed-data dissimilarity estimator
in estimating the true dissimilarities, when the covariance matrix of 9 was a2 known. In Section 3.3, we established the domination of the smoothed-data
estimator for covariance matrix In Section 4.2, we developed an analogous asymptotic result for general known
covariance matrix £. In this section, we extend the asymptotic result to the case
of 9 having covariance matrix of the form V = (t2E, where a2 is unknown and £
is a known symmetric, positive definite matrix. This encompasses the functional
noise model (2.2) in which the errors follow an Ornstein-Uhlenbeck process with
unknown a2 and known ¡3. (Of course, this also includes the discrete noise model

49
DeltaU for O-U Example (n = 50)
0 50 100 150
DettaU for O-U Example (n = 30)
0 20 40 60
Figure 4-1: Plot of asymptotic upper bound, and simulated As, for Ornstein-
Uhlenbeck-type data.
Solid line: plot of A^ against a for above O-U process. Dashed line: simulated A,
0 = 0x1. Dotted line: simulated A, 0 = (0,1,0,1,0,..., 0,1)'. Dot-dashed line:
simulated A, 0 = (-1,0,1,-1,0,1,...,-1,0)'. Top: n = 50. Bottom: n = 30.

50
(2.1), in which V = a21. but this case was dealt with in Section 3.3, in which an
exact domination result was shown.)
Assume 9 ~ 1V(0. V), with V = exists a random variable S2, independent of 9, such that
S2
xl-
Then let 2 = S2/v.
As shown in Section 3.3. we may write
A = E
-2 2a<72 -
a2(74
91
^(0'0-9'0)+ (2 aa2-?j^J
Since <7 and 6 are independent.
E[-Aac2{0'0 9'9)] = E[-4aa2]E{9'9 9'9} = E[-4a2]a2tr{S)
and
E
= E
-- E
2a2*V e)
L 9i
2 a2a4
9i
2 a24
9i
= E[2a2dA}E
(0'(I S)0 + 9S9 9'{I S)9 9'S9)
(91 + 92 9i 92)
92
+ E
.91.
2a2 L 9i
(9i 9i 92)
(4.3)
by the independence of a and 9 (and thus <7 and q2). We use the (large n) approx
imation to £[92/91] obtained in Section 4.2 to obtain an asymptotic upper bound
for (4.3):
r0 2~4i/ 92 + <72ir(SS) \ J2a2&4^ A ,
£[2a 17 1 ( + a2ir[(I S)E]) + B [(?1 91 H' (4'4)

51
Now, since a2<74/ft is decreasing in .ft and ft q\ q2 is increasing in ft, these
quantities have covariance < 0. Hence
/ q2 + o2tr(S'E) >
\+E
2 a2bA
\ft + cr2ir[(I S)E]^
. ft .
/ ft + a2ir(S£) ^
+ E
2a2<74
Vft + ^2M(I-S)E]J
. ft .
£[ft ft ft]
( E[2a2a4]q2
£[2a2<74]cr2ir(S£)
ft + <72tr[(I S)£] + qi+ o2tr[(I S)E]
2 a2o4
ft
+ £
2a2o4
L ft
cr2tr[(I S)E]
- E
L ft
E[2a2o4]q2
2 a2o4
1
- £
1
LftJ
+ £
L ft
£[ft
+ £'[2a2(j4]
cr2ir(S£)
ft 4- cr2ir[(I S)£]
Since by Jensen's Inequality, \/E[q\] E[l/qi] < 0,
(4.4) < £[2a2<741
cr2ir(S£)
ft + o2tr[(I S)£]
+ E
2a2 . ft
cr2ir[(I S)S]
ir(SE)
£ £'2oV'm^I + £
Our asymptotic upper bound for is:
2^.41
2a2o
ft
ff2ir[(I-S)E].
A*au = £[4a + E
(
'ir[(I-S)£]
+ E
2a2o4
L ft
2aa2
9 4 \ 2n
aro \
ft
= E
2a24
-4a ft
2 4 4a3 + 4a2a4 h
tr[(I S)S]
ft
ft2

52
Recall that
£[a aa2
E'fa2^4]
= a2(j4
£[a3d6]
= aV
£'[a4 = a4a8
V
u{ v + 2)(i/ + 4)\
V* )
u{u + 2)(i/ + 4)(// + 6)
Since , o2 4/ ^ + 2)
A^ = 4a v*
.,J T4/''(<' +2) \ (r(SE)
+ 2on^^Jir[(i-s)S]
1 '
9iJ
4a rr
3 6 / ^ + 2)(i/ + 4) ^
J
1
-Lv(^)
4 8/i/(i/ + 2)(i/ + 4)(i/ + 6)'
+ aM ^
W
' 1 '
/
L<7iJ
As in Section 3.3, define q[ = qi/a2 = S)(0/cr). Here, since
0 ~ N{0.a2Jl), {0/a) ~ iV(£, S) where C = 0/a. Then the risk difference is a
function of and the distribution of q[ is a function of Then
= -4a<74ir(£) + 2a2a4
i
v ;
Id
ir[(I S)E
I Vt4^+2)) r(SE) 1-T2T4^(l/ + 2)^
+ 2 v ~2 ) -4aVR(l/ + 2).(l, + 4)|£
1
L9J
+ aV/^ + 2)(^+4)(^ + 6))E
L2
We may now divide the inequality through by a4 > 0.
^lu
= 4a£r(E) + 2a2
(^ + 2))e
' 1
\ V2 J
iQi J
\ tr( SE)
/
-4 a>(l^]E
)
9J
£r[(I S)E
+ a i ^ Kt/ + 2)(t/ + 4)(t/ + 6)\^
L9i2J

53
As shown in Section 4.2. for all x > 0, Pe^o[qi > x] > Po[qi > x], Note that
this is equivalent to: For all x > 0, P^o[q{ > x] > Po[? > x]- Then
^o[(?!*)-m] < £o[(<7iTm]
for m = 1,2.
Let = £"o[(g) *] and AJ = 5o[() 2]- Then
+ 4
+
1/(l' + 2>V-4
u
i/(t/ + 2)(*/ + 4)
M{a3
/uiu + 2){u + 4)W + e)\M;a,
= + 2)(l/+ ^ + 6)) M2V 4 + W" +
)2-
^(i/ + 2)
(2M[tr[(l S)S] +
2ir(SS)
tr[(I-S)E]
+ 4)
4tr(E)a.
Note that if this last expression (which does not involve a and which we denote
simply as A[/(a)) is less than 0, then /S*dU < 0. which is what we wish to prove.
We may repeat the argument from Section 4.2 in which we showed that when
9 = 0. conclude that when £ = 0. ~ V*Xi> where V\,..., vn are the eigenvalues of
(I-S)E.
Again, if n k > 4, then M{ and M2* exist and are positive, as was shown in
Section 4.2.
Again. A(/(a) = 0 is a fourth-degree equation with two real roots, one of which
is trivially zero. Call the nontrivial real root r.
Theorem 4.3 Suppose the same conditions as Theorem 3.3, except generalize
the covariance matrix of 9 to be positive definite. If n k > 4, then the nontrivial real root r of A^[/(a) = 0
is positive. Furthermore, for any choice a 6 (0,r), the asymptotic upper bound

54
< 0, implying A < 0. That is, for 0 < a < r. and for sufficiently large n,
T C\
the risk difference is negative and the smoothed-data dissimilarity estimator d\ - is
better than dij.
Proof of Theorem 4-3: Since £ is positive definite. ir(£) is positive. Since
> 0 and M4 > 0. we may write Au(a) = 0 as c4a4 + c3a3 + c2a2 + C\a = 0.
where c4 > 0. c3 < 0. c2 > 0, C\ < 0. The proof is exactly the same as the proof of
Theorem 4.2 in Section 4.2.
4.4 A Pure Functional Analytic Approach to Smoothing
Suppose we have two arbitrary observed random curves y() and yfft), with
underlying signal curves pi(t) and Pj(t). This is a more abstract situation in which
the responses themselves are not assumed to be discretized, but rather come to
the analyst as "pure' functions. This can be viewed as a limiting case of the
functional noise model (2.2) when the number of measurement points n oo in a
fixed-domain sense. Throughout this section, we assume the noise components of
the observed curves are normally distributed for each fixed t. as, for example, in an
Ornstein-Uhlenbeck process.
Let C[0.T] denote the space of continuous functions of t on [0, T]. Assume we
have a linear smoothing operator S which acts on the observed curves to produce
a smooth: /(i) = Syi(t), for example. The linear operator S maps a function
/ C[0, T] to C[0. T\ as follows:
S[/](t) = K(t,s)f(s)ds
Jo
(where K is some known function) with the property that S[a/ + (3g\ = <*S[f] +
(3S[g] for constants a,/? and functions /,g e C[0,T] (DeVito, 1990, p. 66).
Recall that with squared L2 distance as our dissimilarity metric, we denote
the dissimilarities between the true, observed, and smoothed curves i and j,
respectively, as follows:

55
(4.7)
(4.5)
(4.6)
Recall that for a set of points £1}..., tn in [0, T], we may approximate (4.5)-
(4.7) by
T
dii = -O'0
n
where dij > Sij when the limit is taken as n oo with G [0,T).
In this section, the smoothing matrix S corresponds to a discretized analog of
S. That is, it maps a discrete set of points y(t{),..., y(tn) on the noisy observed
curve to the corresponding points /2(£i),..., fi(tn) on the smoothed curve which
results from the application of S. In particular, in this section we examine how
the characteristics of our self-adjoint S correspond to those of our symmetric S as
n > oo in a fixed-domain sense.
Suppose S is self-adjoint. Then for any x, y,
Therefore for points tx,...,tn G [0,T], with n ^ oo in a fixed-domain sense,

56
Now, for a symmetric matrix S, for any x. y, (x. Sy) = (Sx. y) <=> x Sy =
x Sy. At the sequence of points ti,tn: x = (x(£i),....£(£)), and for our
smoothing matrix S.
Sy = {S[y](tiS[y](t))' = (j K(tl,s)y(s)ds...., j K(tn,s)y(s) ds^j
So
x Sy =
= x'Sy =
[ x(ti)K(ti, s)y(s) ds 4 f [ x(tn)K(tn, s)y(s) ds
Jo Jo
I K(tl,s)x{s)y{t1)ds H h [ K{tn, s)x{s)y{tn) ds
Jo Jo
and
T
x{ti)K(t\, s)y(s) ds H 1-
T rT
K{t\, s)x(s)y(ti) ds H 1- / K(tn, s)x(s)y(tn) ds
Jo
So taking the (fixed-domain) limits as n oo, we obtain (4.8), which shows that
the defining characteristic of symmetric S corresponds to that of self-adjoint S.
Definition 4.1 5 is called a shrinking smoother if ||5y|| < ||t/|| for all elements y
(Buja et al., 1989).
A sufficient condition for S to be shrinking is that all of the singular values of
S be < 1 (Buja et al., 1989).
Lemma 4.1 Let A be a symmetric matrix and £ be a symmetric, positive definite
matrix. Then
j
x{tn)K(tn,s)y{s)ds
sup
X
x A SAx
x' Ex
< (emax(A)]2
where emax(A) denotes the maximum eigenvalue of A.
Proof of Lemma 4-1:

57
Note that
sup^r^ =em(E-A'E 2E1/2A£-l/2).
x x Ex
Let B = E1/2AE_1 2. Then this maximum eigenvaiue is emaxiBB) < emax(A A)
[emax(A)]2 by a property of the spectral norm. C
Lemma 4.2 Let the linear smoother S be a self-adjoint operator with singular
values 6 [0.1]. Then uar(^||S0||2) < t/ar(^||0||2). where S is symmetric with
singular values e [0,1]. (Recall 6 = y yj.)
Proof of Lemma \.2:
We will show that, for any n and for any positive definite covariance matrix
E of G. var(£;|S0|j2) < uar(^||0||2) where S is symmetric. (A symmetric matrix
is the discrete form of a self-adjoint operator (see Ramsay and Silverman, 1997.
pp. 287-290).)
r(£l|S0||2) < wxr(^||0||2j
<=> uar(||S0||2) < var(||0||2)
<=> var(G S S0) < var(GG)
<=> 2tr[(S'SE)2] + 40'S'SESS0 < 2tr[(E)2] + 40'E0.
Now,
40'S'SES S0 < 40'E0 S ^SS S < 1
0E0
and so we apply Lemma 4.1 with S S playing the role of A. Since the singular
values of S are [0,1], all the eigenvalues of S S are [0,1]. Hence we have
0SSESS0
0'E0
sup
G
< 1.

58
So we must show ir[(S SE)2] < £r[(£)2] for any £.
ir[(E)2] = ir[(S'S£ + (I-S'S)£)2]
= r[(S'SE)2] + ir[((I S'S)E)2] + ir[S'S£(I S'S)£]
+ ir[(I-S'S)£S'S£]
= £r[(S'S£)2] + ir[£1/2(I S'S)E(I S'S)E1/2]
4- ir[S£(I S'S)£S] + ir[S£(I S'S)£S']
= £r[(S'S£)2] + ir[£1/2(I S'S)£(I S'S)£1/2]
+ 2ir[S£(I- S'S)£S]
= ir[(S'S£)2] + £r[E1/2(I S'S)E1/2E1/2(I S'S)E1/2]
+ 2ir[SE(I S'S)1/2(I S'S)1/2SS]
> £r[(S'S£)2]
since the last two terms are > 0. This holds since ir[E1/,2(I S S)E1/2E1^2(I
S,S)E1/2] is the sum of the squared elements of the symmetric matrix E1/2(I
S S)£1/2. And £r[SE(I S S)1/,2(I S S)1/2ES] is the sum of the squared elements
of (I-S'S)1/2ES.
We now extend the result in Lemma 4.2 to functional space.
Corollary 4.1 Under the conditions of Lemma 4-2, var[Sij] < var[S¡^mooth^].
Proof of Corollary 4-1-'
Note that since almost sure convergence implies convergence in distribution,
we have
-9'0 A L
n
smooth.)
*ij
-'s's 4 s{smooth).
n 13
Consider £[|d,j|3] = £,[(dj)3] = ^£'[(0,0)3]. We wish to show that this is
bounded for n > 1.

59
Now 00~ x'n (^)- where A = 0 0 here. The third moment of a noncentral
chi-square is given by
n3 + 6n2 -+- 6n2A 8 n + 36nA + 12?iA 4- 48A + 48A2 + 8A3.
Hence ^E[(0'0)3] =
T3[l + 6/n + 6A/n + 8/n2 + 36A/n2 + 12A2/n2 + 48A/n3 + 48A2/n3 + 8A3/n3]
Now. A/n is bounded for n > 1 since ~0'0 is the discrete approximation of the
distance between signal curves /i() and which tends to y. Hence the third
moment is bounded.
This boundedness, along with the convergence in distribution, implies (see
Chow and Teicher. 1997. p. 277) that
lim £[(4)*] = £[(4)*] < oo
for k = 1,2.
So we have
lim varldij]
n*00
= limE[(dij)2] lim[£(dy)]2
= lim£[(4)2] [Hm£'(4)]2
= E[(4)2] [£(4)]2
= var[iy] < oo.
An almost identical argument yields
lim tar[d-*moot/l)] = t/ar[4moot/l)].
noo J J

60
As a consequence of Lemma 4.2. since var[d-*mooi/l)] < var[dij] for all n,
lim var\d^mooth)] < lim var\dtj] < oo.
noc J n>oc
Hence
var[S\jmooth)] < var[Sij].
The following theorem establishes the domination of the smoothed-data
dissimilarity estimator ^moothl over the observed-data estimator Sjj, when the
signal curves Pi{t), i = 1,..., N. lie in the linear subspace defined by S.
Theorem 4.4 If S is a self-adjoint operator and a shrinking smoother, and if
Pi(t) is in the linear subspace which S projects onto for all i = 1... .,N, then
MSE((ooth)) < MSEiSij).
Proof of Theorem 4-4'
Define the difference in MSEs A to be as follows (we want to show that A is
negative):
C / T T \ 2 \
= E[(Kfo [£(*) hj(t)]2dt [Pi{t) Pj{t)]2 dtj j
~ E{(fo ^Vi^ ~ Vj^2 dt ~ h ~ dt
+ (/ [/^() Mi(0]2*)
_ 2 (/ \-Syi^ ~ SyiW2 dt) (fQ ~ }
~E{(fo {yi^~yjW*dt) + (/ [m(0 -
~ 2 (/ ~ Vj^2 dt) (/ M*)]2 di) }
'¡Of
[Syi(t) Syj{t)}2dt
2

61
= [%(0 Syj(t)}adt) ~ (J [%(*) ~Vj(*)]2dt^j |
+2E{(L ~ yj^2(it) (/~
- Qf mt) Syj(t)]2 dt^j Qf [/(f) Hj(t)}2dt^j |.
Let f = /(f) = Vi(t) yj(t) and let f = /(f) = /(f) /(f)- Note that since 5 is
a linear smoother. 5y(f) 5y(f) = Sf.
a = £{(/w<)XfM2}
+2*{ (if]2 *) (hf>2 ) (|sf|2 ) (m *)}
= £{(Sf.Sf)2 (Sf,5f)(f,f)}
= £{||Sf||4 ||f||4} + 2£{||f||2||f||2 Sf||2||f||2}
= E{\\Si\\4} £{||f||4} 2||f||2£'{||5f!|2 ||f||2}
= mr(||5f||2) var(||f||2) + {£(||Sf||2)}2 {£(||f||2)}2
-2||f||2E{(||5f||2-||f||2)}
= var(||Sf||2) yar(||f||2)
+ {£(||Sf||2) + £(||f||2)}{£(||Sf||2) £(||f||2)}
-2||f||2E{(||5f||2-||f||2)}.
Since /(f) is in the subspace which S projects onto for all i, then
Sf = f => ||5f|| = ||f||.
A = uar(||5f||2) uar(||f||2) + £(||Sf||2 + ||f||2)^(|| -(||5f||2 + ||f||2)^(||5f||2-||f||2).

62
Now. since uar(||5fl|2) < uar(||f||2) from Corollary 4.1. we have:
A < E(\\Sf\\2 + ||f||2)£(||Sf||2 ||f||2) (||5f||2 + ||f||2)£(||Sf||2 ||f||2)
= E(\\Sf\\2 ||f||2) [^(||5f||2 + ||f||2) (||5fl|2 + ¡|f||2)l. (4.9)
Since 5 is a shrinking smoother, ||Sf|| < ||f|| for all f. Hence ||5f||2 < ||f||2 for
all f and hence £'(||5f||2) < i?(||f||2). Therefore the first factor of (4.9) is < 0.
Also note that
i?(||Sf||2) = E^[Sf]2dt\ = f* E[Sf]2dt
= [ var[Sf}dt+ [ [E[Sf])2 dt
Jo Jo
= f var[Sf]dt+ [Sf]2df
Jo Jo
= [ uar[Sf]dt + ||Sf||2 > ||Sf||2.
Jo
£(||f||2) = [tf it\ = [ E[tf dt
= [ var[f]dt+ f (E[f])2dt
Jo Jo
= f uar[f] dt + [ [f]5
Jo Jo
= / var[f]
Jo
Similarly,
]2 dt
dt + llfll2 > "f"2
Hence the second factor of (4.9) is positive, implying that (4.9) < 0. Hence the
risk difference A < 0.
Remark. We know that ||Sf|| < ||f|| since S is shrinking. If, for our particular
Vi(t) and Vj(t), ||Sf|| < ||f||, then our result is MSE(6(th)) < MSE(fy).

CHAPTER 5
A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION
FOR CLUSTERING FUNCTIONAL DATA
Most cluster analysis methods, whether deterministic or stochastic, use an
objective function to evaluate the goodness" of any clustering of the data. In this
chapter we propose an objective function specifically intended for functional data
which have been smoothed with a linear smoother. In setting up the situation,
we assume we have N functional data objects y\(t),..., yff(t). We assume the
discretized data we observe (yi,.... yn) follow the discrete noise model (2.1).
For object i. define the indicator vector Z¡ = (Zn,..., Zm) such that Z,j = 1 if
object i belongs to cluster j and 0 otherwise. Each partition in C corresponds to a
different Z = (Zx,..., Zjv), and Z then determines the clustering structure. (Note
that if the number of clusters K < N, then Zk+1 = = Zw 0 for all i.)
In cluster analysis, we wish to predict the values of the Zs based on the data
y (or data curves (yx(i),.... /tv())- Assume the Z^s are i.i.d. unit multinomial
vectors with unknown probability vector p = (px, Pn)- Also, we assume a
model /(y|z) for the distribution of the data given the partition. The joint density
N N
/(y,z) = YlYl\pjf{yi\zn)]Zii
t=i j=i
can serve as an objective function. However, the parameters in /(y|z), and p,
are unknown. In model-based clustering, these parameters are estimated by
maximizing the likelihood, given the observed y, across the values of z. If the
data model /(y|z) allows it, a predictive likelihood for Z can be constructed by
conditioning on the sufficient statistics r(y,z) of the parameters in /(y,z) (Butler,
1986; Bjornstad, 1990). This predictive likelihood, /(y,z)/f(r(y,z)), can serve as
63

64
the objective function. (Bjornstad (1990.) notes that, when y or z is continuous,
an alternative predictive likelihood with a Jacobian factor for the transformation
r(y, z) is possible. Including the Jacobian factor makes the predictive likelihood
independent of the choice of minimal sufficient statistic: excluding it makes the
predictive likelihood invariant to scale changes of z.)
For instance, consider a functional datum yi[t) and its corresponding observed
data vector y. For the purpose of cluster analysis, let us assume that functional
observations in the same cluster come from the same linear smoothing model with
a normal error structure. That is, the distribution of Y, given that it belongs to
cluster j, is
Y,- | Zij = 1 ~ N{TPj,(J*I)
for i = 1 N. Here T is the design matrix, whose n rows contain the regression
covariates defined by the functional form we assume for yi(t). This model for
Yi allows for a variety of parametric and nonparametric linear basis smoothers,
including standard linear regression, regression splines, Fourier series models,
and smoothing splines, as long as the basis coefficients are estimated using least
squares.
For example, consider the yeast gene data described in Section 7.1. Since
the genes, being cell-cycle regulated, are periodic data. Booth et al. (2001) use a
first-order Fourier series y(t) = (30 + (3X cos(27rf/T) + /32sin(2nt/T) as a smooth
representation of a given genes measurement, fitting a harmonic regression model
(see Brockwell and Davis, 1996, p. 12) by least squares. Then the 18 rows of T are
the vectors (l,cos(^fJ),sin(y:i:7)), for the n = 18 time points tj,j = 0,...,17. In
this example 0j = (P0j,Pij,p2j)'-
Let J = {j\zij 1 for some } be the set of the K nonempty clusters.
The unknown parameters in /(y,z), j 6 J, are (pj,0jtaf). The corresponding
sufficient statistics are (rrij, Pj,&j) where rrij = YliLi z*j number of objects

65
in the (proposed) jth cluster. is the ^-dimensional least squares estimator
of (3j (based on a regression of all objects in the proposed j th cluster), and d2
is the unbiased estimator of terms of the estimators and d2 (for all i such that Zij = 1), obtained from the
regressions of the individual objects. (See Appendix A.3 for the derivations of
and d2.)
Marginally, the vector m = (mi,.... m^) is multinomial(./V, p); conditional
on m, is multivariate normal and independent of ()d2 ~ Xm np*" Then
the following theorem gives the predictive likelihood objective function, up to a
proportionality constant:
Theorem 5.1 If Zl5..., are i.i.d. unit multinomial vectors with unknown
probability vector p = (pi,.... pN), and
Yi\zij = l~N(T/3j,a]l)
for i = 1,..., N, then the predictive likelihood for Z can be written:
g{ z) =
/(y-z)
jeJ x
Proof of Theorem 5.1:
According to the model assumed for the data and for the clusters,
= e*p[-|(yi-W(yi-T,3)]}

66
Therefore log / (y, z j
N K
= E E 40gp>+ log {vkT- F(y W(yi T3i)}
= {log P? + mi loS ( 7^) iE2^yi"W(y'"T^
= S{logP?
i=i 1
+ 771 j log
\¡2/nCj
2c]
Hence /(y.z
1 r 'v .11
- ^ Ztj(y¡ T^J^yi T0j) + v(T/3j T0jY(T/3j T/?,)
aj >-,=1 *
= rrr
jeJ
exp
1
\/27TCj
( N
-E
v t=i
(* Tft)'(y¡ Tft) (Tft Tft)'(Tft Tft)
2 al
m
1
= YlP?J {2n)~rn,n/2(7j m,n exp< -rrij
jeJ '
Also, TljejffaiPj'tf)
=
(T ft Tft)(Tft Tft) (m¡n-pn¡
*!
1/A
exp{-^(/3--/3 )'[ccw(/3j)] /?,)}
mJ!(27r)P'/2iC0U(/g.)|i/2
(5a^)=qF-. =
jeJ
-1 f \
77 775 p {- j£(^ ft)
mi-(2T)p-A(^|(T'T)->|) 1 2 x(ft)
/myn-p* _
2-.m,"~P* 1 ^ 2aj r P*-2
m j n p
p (rnjnt p*) exP{ 2 ^ cr?
*?)}
Hence, since expf-^O^ ^'(TT)^ 0,)} = exp{-^(T/3,. -
T/3j)'(T$j T0j)}, we have that

67
/(y,z)
Yljej
-111/2
/TTijn-p
jeJ
m ;n p
(*J)
jeJ
When considered as an objective function, this predictive likelihood has
certain properties which reflect intuition about what type of clustering structure is
desirable.
The data model assumes aj represents the variability of the functional data
in cluster j. Since is an unbiased estimator of Oj,j G J, then (a?) 2 +1
can be interpreted as a measure of variability within cluster j. The (typically)
negative exponent indicates that good clustering partitions, which have small
within-cluster variances, yield large values of g(z). Clusters containing more objects
(having large m; values) contribute more heavily to this phenomenon.
The other factors of g(z) may be interpreted as penalty terms which penalize
a partition with a large number of clusters having small m,j values. A possible
alternative is to adopt a Bayesian outlook and put priors on the pj's, which, when
combined with the predictive likelihood, will yield a suitable predictive posterior.
rrijU p \ ( rrijn P

CHAPTER 6
SIMULATIONS
6.1 Setup of Simulation Study
In this chapter, we examine the results of a simulation study implemented
using the statistical software R. In each simulation. 100 samples of N = 18 noisy
functional data were generated such that the data followed models (2.1) or (2.2).
For the discrete noise data, the errors were generated as independent Af(0, cr)
random variables, while for the functional noise data, the errors were generated as
the result of a stationary Ornstein-Uhlenbeck process with variability parameter a2
and pull' parameter /3 = 1.
For each sample of curves, several (four; clusters' were built into the data
by creating each functional observation from one of four distinct signal curves, to
which the random noise was added. Of course, within the program the curves were
represented in a discretized form, with values generated at n equally spaced points
along T = [0. 20]. In one simulation example, n = 200 measurements were used and
in a second example, n = 30.
The four distinct signal curves for the simulated data were defined as follows:
Hi (f) = 0.51n(i + 1) + .01 cos(), i = 1,..., 5.
Hi(t) = log10( T 1) .01 cos(2£), z = 6,..., 10.
Hi(t) = 0.751og5(i + l) + .01sin(3i),i = 11,..., 15.
Hi(t) = 0.3>/t + 1-.01 sin(4f), z = 16,, 18.
68

69
These four curves, shown in Figure 6-1, .were intentionally chosen to be similar
enough to provide a good test for the clustering methods that attempted to
group the curves into the correct clustering structure, yet different enough that
they represented four clearly distinct processes. They all contain some form of
periodicity, which is more prominent in some curves than others. In short, the
curves were chosen so that, when random noise was added to them, they would be
difficult but not impossible to distinguish.
After the "observed data were generated, the pairwise dissimilarities among
the curves (measured by squared L2 distance, with the integral approximated
on account of the discretized data) were calculated. Denote these by i =
1,..., N.j = l..... N. The data were then smoothed using a linear smoother S
(see below for details) and the James-Stein shrinkage adjustment. The pairwise
dissimilarities among the smoothed curves were then calculated. Denote these by
(smooth) i?TV, j = 1,... ,N. The sample mean squared error criterion was
used to judge whether dij or d\"mooth) better estimated the dissimilarities among the
signal curves, denoted i = 1,..., N, j = 1,.... N. That is,
mse^> = E [(4 )2]
' 2 / N
2/ t=l N
j>i
was compared to
Once the dissimilarities were calculated, we used the resulting dissimilarity
matrix to cluster the observed data, and then to cluster the smoothed data. The
clustering algorithm used was the K-medoids method, implemented by the pam
function in R. We examine the resulting clustering structure of both the observed

70
time
Figure 6-1: Plot of signal curves chosen for simulations.
Solid line: /ij(t). Dashed line: /6(0- Dotted line: Dot-dashed line: /i16(t).

71
data and the smoothed data to determine which clustering better captures the
structure of the underlying clusters, as defined by the four signal curves.
The outputs of the cluster analyses were judged by the proportion of pairs of
objects (i.e., curves) correctly placed in the same cluster (or correctly placed in
different clusters, as the case may be). The correct clustering structure, as defined
by the signal curves, places the first five curves in one cluster, the next five in
another cluster, etc. Note that this is a type of measure of concordance between
the clustering produced by the analysis and the true clustering structure.
6.2 Smoothing the Data
The smoother used for the n = 200 example corresponded to a cubic B-spline
basis with 16 interior knots interspersed evenly within the interval [0,20]. The rank
of S was thus k = 20 (Ramsay and Silverman, 1997, p. 49). The value of a in the
James-Stein estimator was chosen to be a = 160, a choice based on the values of
n = 200 and k 20. For the n = 30 example, a cubic B-spline smoother with six
knots was used (hence k = 10) and the value of a was chosen to be a = 15.
We should note here an interesting operational issue that arises when using
the James-Stein adjustment. The James-Stein estimator of 0 given by (3.4) is
obtained by smoothing the differences 0 = y y^, i ^ j, first, and then adjusting
the S0s with the James-Stein method. This estimator for 0 is used in the James-
Stein dissimilarity estimator d\jS\ For a data set with N curves, this amounts to
smoothing (N2 N)/2 pairwise differences. It is computationally less intensive to
simply smooth the N curves with the linear smoother S and then adjust each of
the N smooths with the James-Stein method. This leads to the following estimator
of 0 :
(' (y' -Syi) -Sy (! (yi -Syi)-
Sy i +

72
Of course, when simply using a linear smoother S. it does not matter whether
we proceed by smoothing the curves or the differences, since SO = Sy Sy,-. But
when using the James-Stein adjustment, the overall smooth is nonlinear and there
is an difference in the two operations. Since in certain situations, it may be more
sensible to smooth the curves first and then adjust the N smooths, we can check
empirically to see whether this changes the risk very much.
To accomplish this, a simulation study was done in which the coefficients of
the above signal curves were randomly selected (within a certain range) for each
simulation. Denoting the coefficients of curve by (&,i,6.2)> the coefficients
were randomly chosen in the following intervals:
&i,i £ [3,3],b\_2 6 [0.5.0.5],66,i £ [6,6], 6,2 £ [0.5.0.5],
611,1 £ [4.5,4.5],611.2 6 [0.5,0.5], 16.1 £ [1.8,1.8],616,2 £ [0.5,0.5].
For each of 100 simulations, the sample MSE for the smoothed data was the same,
within 10 significant decimal places, whether the curves were smoothed first or the
differences were smoothed first. This indicates that the choice between these two
procedures has a negligible effect on the risk of the smoothed-data dissimilarity
estimator.
Also, the simulation analyses in this chapter were carried out both when
smoothing the differences first and when smoothing the curves first. The results
were extremely similar, so we present the results obtained when smoothing the
curves first and adjusting the smooths with the James-Stein method.
6.3 Simulation Results
Results for the n = 200 example are shown in Table 6-1 for the data with
independent errors and Table 6-2 for the data with Ornstein-Uhlenbeck-type errors.
The ratio of average MSEs is the average MSE^obs^ across the 100 simulations
divided by the average MS£'(smoot/l) across the 100 simulations. As shown by

73
Table 6-1: Clustering the observed data and clustering the smoothed data (inde
pendent error structure, n = 200).
Independent error structure
a
0.4
0.5
0.75
1.0
1.25
1.5
2.0
Ratio of avg. MSEs
12.86
30.10
62.66
70.89
72.40
72.24
71.03
Avg. prop, (observed)
.668
.511
.353
.325
.330
.327
.318
Avg. prop, (smoothed)
.937
.715
.428
.380
.372
.352
.332
Ratios of average MSE(oba) to average MSE[STnoottl). Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.
Table 6-2: Clustering the observed data and clustering the smoothed data (O-U
error structure, n = 200).
O-U error structure
a
0.5
0.75
1.0
1.25
1.5
2.0
2.5
Ratio of avg. MSEs
3.53
39.09
126.53
171.39
176.52
168.13
160.52
Avg. prop, (observed)
.997
.607
.445
.369
.330
.312
.310
Avg. prop, (smoothed)
1
.944
.620
.489
.396
.326
.334
Ratios of average MSE(obs* to average MSE[smooth>. Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.
the ratios of average MSEs being greater than 1. smoothing the data results in
an improvement in estimating the dissimilarities, and. as shown graphically in
Figure 6-2, it also results in a greater proportion of pairs of objects correctly
clustered in every case. Similar results are seen for the n = 30 example in Table
6-3 and Table 6-4 and in Figure 6-3.
The improvement, it should be noted, appears to be most pronounced for
the medium values of cr2, which makes sense. If there is small variability in the
data, smoothing yields little improvement since the observed data is not very
noisy to begin with. If the variability is quite large, the signal is relatively weak
and smoothing may not capture the underlying structure precisely, although the
smooths still perform far better than the observed data in estimating the true
dissimilarities. When the magnitude of the noise is moderate, the advantage of
smoothing in capturing the underlying clustering structure is sizable.

74
Independent error structure
Ornstein-Uhlenbeck error structure
Figure 6-2: Proportion of pairs of objects correctly matched, plotted against a
(n = 200).
Solid line: Smoothed data. Dashed line: Observed data.
Table 6-3: Clustering the observed data and clustering the smoothed data (inde
pendent error structure, n = 30).
Independent error structure
a
0.2
0.4
0.5
0.75
1.0
1.25
1.5
Ratio of avg. MSEs
1.23
4.02
5.12
5.95
6.30
6.24
6.27
Avg. prop, (observed)
.999
.466
.376
.298
.299
.312
.316
Avg. prop, (smoothed)
1.00
.552
.458
.359
.330
.321
.334
Ratios of average MSE^^ to average MS£'(smoot/l). Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.

75
Table 6-4: Clustering the observed dat£ and clustering the smoothed data (O-U
error structure, n = 30).
O-U error structure
a
0.2
0.4
0.5
0.75
1.0
1.25
1.5
Ratio of avg. MSEs
1.26
4.07
5.03
5.69
5.78
5.85
6.04
Avg. prop, (observed)
.851
.383
.280
.251
.225
.213
.233
Avg. prop, (smoothed)
.881
.625
.528
.502
.478
.489
.464
Ratios of average MSE^ob^ to average MSE(smooth). Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.
Independent error structure
Omstein-Uhlenbeck error structure
Figure 6-3: Proportion of pairs of objects correctly matched, plotted against a
(n = 30).
Solid line: Smoothed data. Dashed line: Observed data.

76
The above analysis assumes the number of clusters is correctly specified as
4. so we repeat the n = 200 simulation, for data generated from the O-U model,
when the number of clusters is misspecified as 3 or 5. Figure 6-4 shows that, for
the most part, the same superiority, measured by proportion of pairs correctly
clustered, is exhibited by the method of smoothing before clustering.
6.4 Additional Simulation Results
In the previous section, the clustering of the observed curves was compared
with the clustering of the smooths adjusted via the James-Stein shrinking method.
To examine the effect of the James-Stein adjustment in the clustering problem, in
this section we present results of a simulation study done to compare the clustering
of the observed curves, the James-Stein smoothed curves, and smoothed curves
having no James-Stein shrinkage adjustment.
The simulation study was similar to the previous one. We had 20 sample
curves (with n = 30) in each simulation, again generated from four distinct signal
curves: in this case the signal curves could be written as Fourier series. Again, for
various simulations, both independent errors and Ornstein-Uhlenbeck errors (for
varying levels of a2) were added to the signals to create noisy curves.
For some of the simulations, the signal curves were first-order Fourier series:
Hi(t) = cos(7ri/10) + 2sin(7rf/10), i = 1,..., 5.
Hi{t) 2cos(7r/10) + sin(7r/10), i = 6,..., 10.
Hi(t) = 1.5cos(7rf/10) + sin(7rf/10), = 11,..., 15.
Hi(t) = cos(7r/10) + sin(7r/10), i = 16,..., 20.
In these simulations, the smoother chosen wras a first-order Fourier series, so that
the smooths lay in the linear subspace that the Fourier smoother projected onto. In
this case, we would expect the unadjusted smoothing method to perform well, since

77
Number of Clusters Misspecified as 3
Number of Clusters Misspecified as 5
sigma
Figure 6-4: Proportion of pairs of objects correctly matched, plotted against cr,
when the number of clusters is misspecified.
Solid line: Smoothed data. Dashed line: Observed data.

78
the first-order smooths should capture the underlying curves extremely well. Here
the James-Stein adjustment may be superfluous, if not detrimental.
In other simulations, the signal curves were second-order Fourier series:
Hi(t) = cos(7ri/10) -I- 2sin(7r/10) -I- 5cos(27r/10) + sin(27r/10),i = 1,...,5.
Hi(t) = 2 cos(7t£/10) 4- sin(7ri/10) 4- cos(27r/10) 4- 8 sin(27r£/10), i = 6,..., 10.
Hi(t) = 1.5 cos(7r/10) -I- sin(7r/10) 4- 5 cos(27r/10) -I- 8 sin(27r/10), i = 11,.... 15.
//() = cos(7r/10) 4- sin(7r/10) 4- 8cos(27r/10) -I- 5 sin(2;r£/10), i = 16,..., 20.
In these simulations, again a first-order Fourier smoother was used, so that the
smoother oversmoothed the observed curves and the smooths were not in the
subspace S projected onto. In this case, the simple linear smoother would not be
expected to capture the underlying curves well, and the James-Stein adjustment
would be needed to reproduce the true nature of the curves.
For each of the three methods (no smoothing, simple linear smoothing,
and James-Stein adjusted smoothing), and for varying magnitudes of noise, the
proportions of pairs of curves correctly grouped are plotted in Figure 6-5. Again,
for each setting, 100 simulations were run, with results averaged over all 100 runs.
The James-Stein adjusted smoothing method performs essentially no worse
than its competitors in three of the four cases. With Ornstein-Uhlenbeck errors,
and when the signal curves are first-order Fourier series, the simple linear smoother
does better as cr becomes larger, which is understandable since that smoother
captures the underlying truth perfectly. For higher values of a (noisier data), the
James-Stein adjustment is here shrinking the smoother slightly toward the noise
in the observed data. Nevertheless, the James-Stein smoothing method has an
advantage in the independent-errors case, even when the signal curves are first-
order. In both of these cases, both types of smooths are clustered correctly more
often than the observed curves.

79
O-U errors. 1st-order Fourier signal curves
O-U errorv 2nd-order Fourier signal curves
Indep. errors. 1st-order Fourier signal curves
s^ms
Figure 6-5: Proportion of pairs of objects correctly matched, plotted against a
(n = 30).
Solid line: James-Stein adjusted smoothed data. Dotted line: (Unadjusted)
smoothed data. Dashed line: Observed data.

80
When the signal curves are second-order and the simple linear smoother
oversmooths, the unadjusted smooths are clustered relatively poorly. Here, the
James-Stein method is needed as a safeguard, to shrink the smooth toward the
observed data. The James-Stein adjusted smooths and the observed curves are
both clustered correctly more often than the unadjusted smooths.

CHAPTER 7
ANALYSIS OF REAL FUNCTIONAL DATA
7.1 Analysis of Expression Ratios of Yeast Genes
As an example, we consider yeast gene data analyzed in Alter et al. (2000).
The objects in this situation are 78 genes, and the measured responses are (log-
transformed) expression ratios measured 18 times (at 7-minute intervals) tj = 7j
for j = 0,..., 17. As described in Spellman et al. (1998). the yeast was analyzed
in an experiment involving ilalpha-factor-based synchronization, after which the
RNA for each gene was measured over time. (In fact, there were three separate
synchronization methods used, but here we focus on the measurements produced by
the alpha-factor synchronization.)
Biologists believe that these genes fall into five clusters according to the cell
cycle phase corresponding to each gene.
To analyze these data, we treated them as 78 separate functional observations.
Initially, each multivariate observation on a gene was centered by subtracting the
mean response (across the 18 measurements) from the measurements. Then we
clustered the genes into 5 clusters using the K-medoids method, implemented by
the R function pam.
To investigate the effect of smoothing on the cluster analysis, first we simply
clustered the unsmoothed observed data. Then we smoothed each observation and
clustered the smooth curves, essentially treating the fitted values of the smooths
as the data. A (cubic) B-spline smoother, with two interior knots interspersed
evenly within the given timepoints, was chosen. The rank of the smoothing matrix
S was in this case k = 6 (Ramsay and Silverman, 1997, p. 49). The smooths were
81

82
Table 7-1: The classification of the 78 yeast genes into clusters, for both observed
data and smoothed data.
Cluster
Observed Data
Smoothed Data
1
1,5.6.7.8.11.75
1,2.4,5.6.7,8,9,10,11,13,62,70,74,75
2
2,3,4.9.10.13.16.19,22,23,27,
30.32.36.37,41.44,47,49,51,53,
61,62.63.64.65.67,69.70,74,78
16,19.22,27,30,32,36,37,41,42,
44,47.49,51,61.63,
65,67,69,78
3
12,14.15.18,21.24,25,26.28,
29.31.33.34.35.38,39,40,
42.43.45,46,48.50,52
3,12,14,15,18,21,24,25,26,
28,29,31,33,34.35.38,39,
40,43,45,46,48,50,52
4
17,20.54.55,56.57,58,59.60
17,20.23,53,54,55,56.57,58,59,60.64
5
66.68.71.72.73.76.77
66,68.71.72.73,76,77
adjusted via the James-Stein shrinkage procedure described in Section 3.2, with
a 5 here, a choice based on the values of n = 18 and k = 6.
Figure 7-1 shows the resulting clusters of curves for the cluster analyses of
the observed data and of the smoothed curves. We can see that the clusters of
smoothed curves tend to be somewhat less variable about their sample mean curves
than are the clusters of observed curves, which could aid the interpretability of the
clusters.
In particular, features of the second, third, and fourth clusters seem more
apparent when viewing the clusters of smooths than the clusters of observed
data. The major characteristic of the second cluster is that it seems to contain
those processes which are relatively flat, with little oscillation. This is much more
evident in the cluster of smooths. The third and fourth clusters look similar for
the observed data, but a key distinction between them is apparent if we examine
the smooth curves. The third cluster consists primarily of curves which rise to
a substantial peak, decrease, and then rise to a more gradual peak. The fourth
cluster consists mostly of curves which rise to a small peak, decrease, and then rise
to a higher peak. This distinction between clusters is seen much more clearly in the
clustering of the smooths.

log expression ratio log expression ratio log expression ratio log expression ratio log expression ratio
83
Clusters for observed curves
Clusters for smoothed curves
cm -
5 w -
2
*- -
S'"
52
CM _
1 ^
2
a.
X
7 -1
0 20 40 60 80 100 120 0 20 40 60 80 100 120
time
time
Figure 7-1: Plots of clusters of genes.
Observed curves (left) and smoothed curves (right) are shown by dotted lines.
Mean curves for each cluster shown by solid lines.

84
The groupings of the curves into the five clusters are given in Table 7-1 for
both the observed data and the smoothed data. While the clusterings were fairly
similar for the two methods, there were some differences in the way the curves were
classified. Figure 7-2 shows an edited picture of the respective clusterings of the
curves, with the curves deleted which were classified the same way both as observed
curves and smoothed curves. Note that the fifth cluster was the same for the set of
observed curves and the set of smoothed curves.
7.2 Analysis of Research Libraries
The Association of Research Libraries has collected extensive annual data
on a large number of university libraries throughout many years (Association of
Research Libraries. 2003). Perhaps the most important (or at least oft-quoted)
variable measured in this ongoing study is the number of volumes held in a library
at a given time. In this section a cluster analysis is performed on 67 libraries whose
data on volumes were available for each year from 1967 to 2002 (a string of 36
years). This fits a strict definition of functional data, since the number of volumes
in a library is changing nearly continuously over time, and each measurement is
merely a snapshot of the response at a particular time, namely when the inventory
was done that year.
The goal of the analysis is to determine which libraries are most similar in
their growth patterns over the last 3-plus decades and to see how many groups into
which the 67 libraries naturally clustered. In the analysis, the logarithm of volumes
held was used as the response variable, since reducing the immense magnitude of
the values in the data set appeared to produce curves with stable variances. Since
the goal was to cluster the libraries based on their growth curves rather than
simply their sizes, the data was centered by subtracting each library's sample mean
(across the 36 years) from each years measurement, resulting in 67 mean-centered
data vectors.

85
time
time
Figure 7-2: Edited plots of clusters of genes which were classified differently as
observed curves and smoothed curves.
Observed curves (left) and smoothed curves (right).

86
The observed curves were each smoothed using a basis of cubic B-splines with
three knots interspersed evenly within the timepoints. The positive-part James-
Stein shrinkage was applied, with a value a = 35 in the shrinkage factor (based on
n = 36. k = 7). The K-medoids algorithm was applied to the smoothed curves,
with K = 4 clusters, a choice guided by the average silhouette width criterion that
Rousseeuw (1987) suggests for selecting K.
The resulting clusters are shown in Table 7-2. In Figure 7-3. we see the
distinctions among the curves in the different clusters. Cluster 1 contains a few
libraries whose volume size was relatively small at the begining of the time period,
but then grew dramatically in the first 12 years of the study, growing slowly after
that. Cluster 2 curves show a similar pattern, except that the initial growth is less
dramatic. The behavior of the cluster 3 curves is the most eccentric of the four
clusters, with some notable nonmonotone behavior apparent in the middle years.
The growth curves for the libraries in cluster 4 is consistently slow, steady, and
nearly linear.
For purposes of comparing the clusters, the mean curves for the four clusters
are shown in Figure 7-4.
Note that several of the smooths may not appear visually smooth in Figure
7-3. A close inspection of the data shows some anomalous (probably misrecorded)
measurements which contribute to this chaotic behavior in the smooth. (For
example, see the data and associated smooth for the University of Arizona library,
in Figure 7-5, for which 1971 is a suspicious response.) While an extensive data
analysis would detect and fix these outliers, the data set is suitable to illustrate
the cluster analysis. The sharp-looking peak in the smooth is an artifact of the
discretized plotting mechanism, as the cubic spline is mathematically assured to
have two continuous derivatives at each point.

volumes held volumes held
87
Cluster 1 for smoothed curves
Cluster 2 for smoothed curves
in
d
o
d
m
?
in
d
time (years since 1967)
time (years since 1967)
Cluster 3 for smoothed curves
Cluster 4 for smoothed curves
in
d
in
d
o
d
in
d
i
time (years since 1967)
time (years since 1967)
Figure 7-3: Plots of clusters of libraries.
Clusters based on smoothed curves. Mean curves for each cluster shown by solid
lines.

88
Mean curves of four clusters
Figure 7-4: Mean curves for the four library clusters given on the same plot.
Solid: cluster 1. Dashed: cluster 2. Dotted: cluster 3. Dot-dash: cluster 4.

89
Table 7-2: A 4-cluster K-medoids clustering of the smooths for the library data.
Clusters of smooths for library data
Cluster 1: Arizona. British Columbia. Connecticut, Georgetown. Georgia
Cluster 2: Boston. California-Los Angeles, Florida, Florida State, Iowa,
Iowa State, Kansas. Michigan State, Nebraska. Northwestern. Notre Dame,
Oklahoma. Pennsylvania State, Pittsburgh, Princeton. Rochester,
Southern California. SUNY-Buffalo, Temple, Virginia. Washington U-St. Louis,
Wisconsin
Cluster 3: Brown. California-Berkeley, Chicago. Cincinnati. Colorado, Duke.
Indiana, Kentucky, Louisiana State, Maryland. McGill. MIT. Ohio State,
Oregon. Pennsylvania. Purdue, Rutgers, Stanford, Tennessee. Toronto, Tulane,
Utah, Washington State. Wayne State
Cluster 4-' Columbia. Cornell. Harvard. Illinois-Urbana. Johns Hopkins, Michigan,
Minnesota. Missouri. New York. North Carolina. Southern Illinois. Syracuse. Yale
0:35
Figure 7-5: Measurements and B-spline smooth, University of Arizona library.
Points: log of volumes held, 1967-2002. Curve: B-spline smooth.

90
Table 7-3: A 4-cluster K-medoids clustering of the observed library data.
Clusters of observed library data
Cluster 1: Arizona. British Columbia. Connecticut, Georgetown. Georgia
Cluster 2: Boston. California-Los Angeles. Florida, Florida State, Iowa.
Iowa State, Kansas. McGill. Michigan State. Nebraska, Northwestern.
Notre Dame. Oklahoma, Pennsylvania State. Pittsburgh, Princeton. Rochester,
Southern California. SUNY-Buffalo, Temple, Virginia, Washington U-St. Louis,
Wayne State. Wisconsin
Cluster 3: Brown, California-Berkeley, Chicago, Cincinnati, Colorado. Duke,
Illinois-Urbana, Indiana, Kentucky, Louisiana State, Maryland, MIT,
Ohio State. Oregon. Pennsylvania, Purdue. Rutgers, Stanford, Syracuse.
Tennessee, Toronto. Tulane. Utah, Washington State
Cluster 4: Columbia. Cornell. Harvard, Johns Hopkins, Michigan. Minnesota,
Missouri. New York. North Carolina, Southern Illinois, Yale
Interpreting the meaning of the clusters is fairly subjective, but the dramatic
rises in library volumes for the cluster 1 and cluster 2 curves could correspond
to an increase in state-sponsored academic spending in the 1960s and 1970s,
especially since many of the libraries in the first two clusters correspond to public
universities. On the other hand, cluster 4 contains several old, traditional Ivy
League universities (Harvard, Yale, Columbia. Cornell), whose libraries' steady
growth could reflect the extended buildup of volumes over many previous years.
Any clear conclusions, however, would require more extensive study of the subject,
as the cluster analysis is only an exploratory first step.
In this data set, only minor differences in the clustering partition were seen
when comparing the analysis of the smoothed data with that of the observed
data (shown in Table 7-3). Interestingly, in some spots where the two partitions
differed, the clustering of the smooths did place certain libraries from the same
state together when the other method did not. For instance, Illinois-Urbana was
placed with Southern Illinois, and Syracuse was placed with Columbia, Cornell, and
NYU.

CHAPTER 8
CONCLUSIONS AND FUTURE RESEARCH
This dissertation has addressed the problems of clustering and estimating
dissimilarities for functional observations.
We began by reviewing important clustering methods and connecting the
clustering output to the dissimilarities among the objects. We then described
functional data and provided some background about functional data analysis,
which has become increasingly visible in the past decade. An examination of the
recent statistical literature reveals a number of methods proposed for clustering
functional data (James and Sugar. 2003: Tarpey and Kinateder. 2003; Abraham et
al., 2003). These methods involve some type of smoothing of the data, and we have
provided justification for this practice of smoothing before clustering.
We proposed a model for the functional observations and hence for the
dissimilarities among them. When the data are smoothed using a basis function
method (such as regression splines, for example), the resulting smoothed-data
dissimilarity estimator dominates the estimator based on the observed data when
the pairwise differences between the signal curves lie in the linear subspace defined
by the smoothing matrix.
When the differences do not lie in the subspace, we have shown that a James-
Stein shrinkage dissimilarity estimator dominates the observed-data estimator
under an independent error model. With dependent errors, an asymptotic (for
n large within a fixed domain) domination result was given. (While the result
appears to hold for moderately large n, the asymptotic situation of the theorem
corresponds to data which are nearly pure functions, measured nearly continuously
91

92
across some domain.) The shrinkage estimator is a novel way to unite linear
smoothers and Stein estimation to derive a useful smoothing method.
A good estimation of dissimilarities is only a preliminary step in the eventual
goal, the correct clustering of the data. Thus we have presented a simulation study
which indicates that, for a set of noisy functional data generated from a four-cluster
structure, the data which are smoothed before a K-medoids cluster analysis are
classified more correctly than the observed, unsmoothed data. The correctness of
each clustering was determined by a measure of concordance between the clustering
output and the underlying structure.
Some real functional data were analyzed, with the James-Stein shrinkage
smoother applied to two data sets, which were then clustered via K-medoids. One
data set consisted of 78 yeast genes whose expression level was measured across
time, while the other data set consisted of the "growth curves" (number of volumes
held measured over time) of 67 research libraries.
The applications for this research are wide-ranging, since functional data
are gathered in many fields from the biological to the social sciences. Besides the
applications mentioned in this dissertation, functional data analyses in archaeol
ogy, economics, and biomechanics, among others, are presented by Ramsay and
Silverman (2002).
One obvious extension of this work that could be addressed in the future is
to consider linear smoothers for which S is not idempotent. While a symmetric,
idempotent S corresponds to an important class of smoothers, many other common
smoothers such as kernel smoothers, local polynomial methods, and smoothing
splines have smoothing matrices which are symmetric but not in general idempo
tent.
The use of a basis function smoother requires, in some sense, that the curves in
the data set all have the same basic structure, since the same set of basis functions

93
is used to estimate each signal curve. Fortunately, methods such as regression
splines, especially when the number of knots is fairly large, are flexible enough
to approximate well a variety of shapes of curves. Also, given our emphasis on
working with the pairwise differences of data vectors (i.e., discretized curves), for
the differences to make sense, we need each curve in the data set to be measured at
the same set of points il5..., in, another limitation.
The abstract case of data which come to the analyst as pure continuous
functions was resolved in this thesis in the case for which the signal curves were in
the subspace defined by the linear operator 5, but the more general case still needs
further work. A more abstract functional analytic approach to functional data
analysis, such as Bosq (2000) presents, could be useful in this problem.
The practical distinction that arises when using the James-Stein adjustment
between smoothing the pairwise differences in the noisy curves as opposed to
smoothing the noisy curves themselves was addressed in Section 6.2. Empirical ev
idence was given indicating that it mattered very little in practice which procedure
was done, so one could feel safe in employing the computationally faster method of
smoothing the noisy curves directly. More analytical exploration of this is needed,
however.
With the continuing growth in computing power available to statisticians,
computationally intensive methods like stochastic cluster analysis will become more
prominent in the future. Much of the work that has been done in this area has
addressed the clustering of traditional multivariate data, so stochastic methods for
clustering functional data are needed. Objective functions designed for functional
data, such as the one proposed in Chapter 5, are the first step in such methods.
The next step is to create better and faster search algorithms to optimize such
objective functions and discover the best clustering.

94
In a broad sense, there is an increasing need for methods of analyzing large
functional data sets. With todays advanced monitoring equipment, scientists can
measure responses nearly continuously, over time or some other domain, for a large
number of individuals. Statisticians need to have methods, both exploratory and
confirmatory, to discover genuine trends and variability across great masses of
curves. The field of data mining (Hand et al., 2000) has begun to provide some
answers for these problems, and functional data provide another rich class of
problems to address.

APPENDIX A
DERIVATIONS AND PROOFS
A.l Proof of Conditions for Smoothed-data Estimator Superiority
Now. since S is a shrinking smoother, this means ||Sy|| < | |y11 for all y, and
hence ||Sy||2 < ||y||2 for all y. Therefore. ||S0||2 < ||0||2. Hence, and because
k < n. we see that the first term of (3.2) is less than the first term of (3.1), and
that the second term of (3.2) is < 0. The third term of (3.2) will be positive,
though.
Therefore, the only way the observed-data estimator would have a smaller
MSE than the smoothed-data estimator is if the negative quantity ||S0||2 ||0||2 is
of magnitude large enough that its square (which appears in the third term of (3.2)
overrides the advantage (3.2) has in the first two terms. This corresponds to the
situation when the shrinking smoother shrinks too much past the underlying mean
curve. This phenomenon is the same one seen in simulation studies.
We can make this more mathematically precise. Note that
(j[(l|S0||2 lien2)2] th\ + 4n2||0||2 + 1) < 0
^[(liseil2 lien2)2] < o
95

96
(Let w = ||S0||2 ||0||2.)
Ok P O rp2 rp2
& T2[ + 1] + (4 + 2k)w 4- w2 < 0
n n n n rr
=> aw2 + bw + c < 0.
Here, k < n, so a > 0. b > 0. and c < 0. Thus the left hand side < 0 when
w e ([-b yfb2 4ac]/2a, [b 4- y/b2 4ac]/2aj (A.l)
Note that \/b2 4ac = ^y/(4 + 2/c)2 + 4(2n + n2 2k fc2). So the left
endpoint of the interval in (A.l)
-g(4 4- 2k) g v"(4 + 2A:)2 + 4(2n + n2 2k k2)
2^
-A-2k- v716 4- 16k + \k2 + 8n-8k + An2 Ak2
2
A 2k -\/16 4- 8n 4- 4n2 4- 8k
2
-A-2k- 2\]A 4- 2n + n2 + 2k
2
= 2 k \/A + 2n 4- n2 4- 2k.
Similarly, the right endpoint of (A.l) is -2 k 4- \/4 4- 2n 4- n2 + 2k. Hence,
the smoothed-data estimator is better in terms of MSE when
2 k \/A 4- 2n 4- n2 4- 2k < ||S0||2 ||0||2 < -2 k 4- y/A 4- 2n 4- n2 4- 2k. (A.2)
Note that for any n > 1 and integer k < n, the right hand side of (A.2) is
greater than 0. Since ||S0||2 ||0||2 < 0, the second inequality of (A.2) is always
satisfied in our setting. Hence, the smoothed-data estimator is better when:
||S0||2 ||0||2 > -2 k y/A + 2n + ri2 + 2k.

97
A.2 Extension of Stochastic Domination Result
Recall that we have a linear combination of n independent noncentral x?
random variables and a linear combination of n independent central \\ variates.
We know that for each 6f > 0, i = 1,..., n,
P[x?($i) > A > P[x\ > A V x > 0.
Since Ci,... ,cn >0. letting Z{ = c,x, i = 1,..., n,
P[cm > zi] > P[ciX 1 > Zi] V 2 > 0.
(If C{ = 0. the inequality trivially holds.) Now, since f(x\.... ,xn) = Xi + ... + xn is
an increasing function, and since the n noncentral chi-squares are independent and
the n central chi-squares are independent, we apply a result given in Ross (1996. p.
410, Example 9.2(A)) to conclude:
f*[ciX,l2(^l) + + CnXli^l) >x}> P[CiX 1 + + CnX2i > x] V X > 0.
A.3 Definition and Derivation of 0] and Recall that y¡ is the response vector for object i, and we denote the design
matrix by T. Then /3;, the estimator of the coefficients of the regression of all the
objects in cluster j, is:
"
(^ ^
-i
/ \
(T',...,T')
T
( f)
yi
vTJ
V ymi )
= [myT'TJ-'fT'yi + + T'ym,)
= i-(TT)-'T'y,
mt1
1=1

98
SSEj, the sum of squared errors of the regression of all the objects in cluster j, is:
SSE¡ = (yi',...,y/)
y, '
\ yj /
( t)
/ T ^
-i
( \
yx
(T\.
> T')
j

lTJ
lTJ
^ yj y
= yi yi + ymj ymj
- (yi'T + + ym/T)KT'T]-1(T,yi + + T'ymj)
= yx'yi + ^ym/yiiij
- (yi'T + + ym;T)[(rT)-1T'yi + + (T'T)-1T'ymj]
TTlj
= KyiVi yiT(T'T)-1T'yi) + + (ym/ym, y_,T(T'T)-,T'jr1)]
m j j j j
+
/ \
yi
TTlj 1 .
(yi ,-.-,ymj )S
Tfli
\ y / _
(where S is the rrijU x rrijU block matrix with In in each diagonal block and
-^j3yT(T/T)-1T' in each off-diagonal block)
1 ^
m.
- T.SSE> + -7T- l(yi' y-,')S(y,' y/)'].
i / TTl-i
3 l-l / 3
Hence, the mean squared error of the regression of all the objects in cluster j,
is:
1 m
+
1
3 =1
rrijn p*
TTlj 1
mi
[(yi\---,yn/)S(yi',

APPENDIX B
ADDITIONAL FORMULAS AND CONDITIONS
B.l Formulas for Roots of Au = 0 for General n and k
The four roots (two real, two imaginary) of Au = 0 are 0, pj 3 + (2/9)P3
(4/3)pi/(n + k), and the double imaginary root (l/2)p^3 (1/9)P3
(4/3)pi/(n + k) + (1 /2)iV/3(P2/3 ~ (2/9)P3).
Here,
Pi = n2 2nk 6n + k2 + 6k 4- 8.
p2 = (2/27)pj(-3072fc + 3072n 1232n2 + 2464n/t 1232:2 + 60n3-
132A:3 5n4 252n2k + 324nfc2 + lln3it 3n2k2 Ink3 + 4:4 2048)-=-
(-n + k)3 + (2/9)(-442368/c + 344064n 1032192n2 147456n: + 1179648fc2+
1250304n3 + 1274880/c3 781056n4 1238016n2fc 1287168n/t2 + 1658880n3A;-
238080n2fc2 1376256nA:3 + 736512A:4 + 248064A:5 + 265536n5 870336fcn4+
769728n3k2 + 257472n2*:3 670464nA:4 + 211680n5fc 327216n4*;2 + 133440n3A;3+
152496n2/c4 172320nA;5 21636neA: + 48288n5Jfc2 44712n4fc3 + 588n3fc4+
31308fc5n2 22776k6n 4- 576n7fc 1776n6fc2 + 2520n5k3 1080n4it4 47088n6
+49008A;6 + 3660n7 + 5280fc7 72n8 + 240A;8 1488n3*;5 + 2304n2fc6-
1224nk7 15n8/c + 27n7k2 15k4n5 + 27fc5n4 15n6A:3 15n3k6+
3n2kT + 3n9) -=- (-n + A:))1/2 -j- (-n -I- k),
p3 = (768k 768n + 344n2 688nk + 344/fc2 42n3 + 54Jfc3 n4 + I38n2k
-150nk2 + n3k + 3n2k2 5nk3 + 2k4 + 512) ((-n + fc)2p2/3).
99

100
B.2 Regularity Condition for Positive Semidefinite G in Laplace
Approximation
Recall the Laplace approximation given by Lieberman (1994) for the expecta
tion of a ratio of quadratic forms:
/ xFx \ F(x'Fx)
^\x'Gx) ~ E(x'Gx)'
Lieberman (1994) denotes the joint moment generating function of xFx and x Gx
M(ui, U2) = Ffexp^x Fx -+- u;2x Gx)].
and assumes a positive definite G. In that case. FfxGx) > 0. Lieberman uses the
positive definiteness of G to show that the derivative of the cumulant generating
function of x Gx is greater than zero. That is.
d
d0J2
\ogM(0:u2) =
>
_d_
dui*
M(0,u2)
-1
M (0, UJ2)
J (x'Gx) exp{a;2X,Gx}/(x) dx \^j exp{t2X Gx}/(x) dx J
0.
The positive derivative ensures the maximum of log M(0,a;2) is attained at the
boundary point (where uj2 = 0).
For positive semidefinite G. we need the additional regularity condition that
P(x Gx > 0) > 0. i.e., the support of x'Gx is not degenerate at zero. This will
ensure that F(x'Gx) > 0, i.e., that M(0,u2) > 0. Therefore both of the
integrals in the above expression are positive and the Laplace approximation will
hold.

REFERENCES
Abraham. C.. Cornillon, P. A., Matzner-Lober, E. and Molinari, N. (2003).
Unsupervised curve clustering using B-splines, The Scandinavian Journal of
Statistics 30: 581-595.
Alter, O., Brown. P. O. and Botstein, D. (2000). Singular value decomposition
for genome-wide expression data processing and modeling, Proceedings of the
National Academy of Sciences 97: 10101-10106.
Association of Research Libraries (2003). ARL Statistics Publication Home
Page. Available at www.arl.org/stats/arlstat. accessed November 2003.
published by Association of Research Libraries.
Baldessari. B. (1967). The distribution of a quadratic form of normal random
variables. The Annals of Mathematical Statistics 38: 1700-1704.
Bjornstad, J. F. (1990). Predictive likelihood: A review (C/R: p255-265), Statistical
Science 5: 242-254.
Booth, J. G.. Casella. G., Cooke, J. E. K. and Davis. J. M. (2001). Sorting
periodically-expressed genes using microarray data. Technical Report 2001-026.
Department of Statistics, University of Florida.
Bosq, D. (2000). Linear Processes in Function Spaces: Theory and Applications,
New York: Springer-Verlag Inc.
Brockwell. P. J. and Davis. R. A. (1996). Introduction to Time Series and Forecast
ing, New York: Springer-Verlag Inc.
Buja, A., Hastie. T. and Tibshirani, R. (1989). Linear smoothers and additive
models (C/R: p510-555), The Annals of Statistics 17: 453-510.
Butler, R. W. (1986). Predictive likelihood inference with applications (C/R:
p23-38). Journal of the Royal Statistical Society, Series B, Methodological
48: 1-23.
Casella, G. and Berger, R. L. (1990). Statistical inference, Belmont, California:
Duxbury Press.
Casella, G. and Hwang, J. T. (1987). Employing vague prior information in the
construction of confidence sets, Journal of Multivariate Analysis 21: 79-104.
101

102
Celeux, G. and Govaert. G. (1992). A classification EM algorithm for clustering
and two stochastic versions, Computational Statistics and Data Analysis
14: 315-332.
Chow, Y. S. and Teicher. H. (1997). Probability Theory: Independence, Interchange-
ability, Martingales. New York: Springer-Verlag Inc.
Cressie, N. A. C. (1993). Statistics for Spatial Data, New York: John Wiley and
Sons.
Cuesta-Albertos, J. A.. Gordaliza. A. C. and Matrn, C. (1997). Trimmed fc-means:
An attempt to robustifv quantizers. The Annals of Statistics 25: 553-576.
de Boor, C. (1978). A Practical Guide to Splines, New York: Springer-Verlag Inc.
DeVito, C. L. (1990). Functional Analysis and Linear Operator Theory, Redwood
City, California: Addison-Wesley.
Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression, New York:
Marcel Dekker Inc.
Falkner, F. (ed.) (1960). Child Development: An International Method of Study,
Basel: Karger.
Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering
method? Answers via model-based cluster analysis, The Computer Journal
41: 578-588.
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis,
and density estimation, Journal of the American Statistical Association
97(458): 611-631.
Garca-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of k
means and trimmed k means, Journal of the American Statistical Association
94: 956-969.
Gnanadesikan. R., Blashfield. R. K., Breiman, L., Dunn, O. J.. Friedman, J. H., Fu,
K., Hartigan, J. A., Kettenring, J. R., Lachenbruch, P. A., Olshen, R. A. and
Rohlf, F. J. (1989). Discriminant analysis and clustering, Statistical Science
4: 34-69.
Gordon, A. D. (1981). Classification. Methods for the Exploratory Analysis of
Multivariate Data, London: Chapman and Hall Ltd.
Green, E. J. and Strawderman, W. E. (1991). A James-Stein type estimator for
combining unbiased and possibly biased estimators, Journal of the American
Statistical Association 86: 1001-1006.

103
Hand, D. J., Blunt. G., Kelly, M. G. and Adams, N. M. (2000). Data mining for
fun and profit, Statistical Science 15(2): 111-126.
Hastie, T., Buja, A. and Tibshirani, R. (1995). Penalized discriminant analysis,
The Annals of Statistics 23: 73-102.
James, G. M. and Hastie. T. J. (2001). Functional linear discriminant analysis for
irregularly sampled curves. Journal of the Royal Statistical Society, Series B,
Methodological 63(3): 533-550.
James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional
data, Journal of the American Statistical Association 98: 397-408.
James, W. and Stein. C. (1961). Estimation with quadratic loss. Proceedings of
the Fourth Berkeley Symposium on Mathematical Statistics and Probability,
Volume 1, pp. 361-379.
Johnson, R. A. and Wichern. D. W. (1998). Applied Multivariate Statistical
Analysis. Upper Saddle River, New Jersey: Prentice-Hall Inc.
Kaufman, L. and Rousseeuw, P. J. (1987). Clustering by means of medoids,
Statistical Data Analysis Based on the Li-norm and Related Methods, pp. 405-
416.
Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduc
tion to Cluster Analysis, New York: John Wiley and Sons.
Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983). Optimization by simulated
annealing, Science 220: 671-680.
Lehmann. E. L. and Casella, G. (1998). Theory of Point Estimation, New York:
Springer-Verlag Inc.
Lieberman. O. (1994). A Laplace approximation to the moments of a ratio of
quadratic forms, Biometrika 81: 681-690.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate
observations. Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, Volume 1, pp. 281-297.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E.
(1953). Equations of state calculations by fast computing machines, Journal of
Chemical Physics 21: 1087-1092.
Milligan, G. W. and Cooper, M. C. (1985). An examination of procedures for
determining the number of clusters in a data set. Psychometrika 50: 159-179.

104
Ramsay. J. O. and Dalzeil, C. J. (1991). Some tools for functional data analysis
(Disc: p561-572), Journal of the Royal Statistical Society, Series B, Method
ological 53: 539-561.
Ramsay. J. O. and Silverman. B. W. (1997). Functional Data Analysis, New York:
Springer-Verlag Inc.
Ramsay. J. O. and Silverman. B. W. (2002). Applied Functional Data Analysis:
Methods and Case Studies, New York: Springer-Verlag Inc.
Robert. C. P. and Casella. G. (1999). Monte Carlo Statistical Methods, New York:
Springer-Verlag Inc.
Rodgers. W. L. (1988). Statistical matching, Encyclopedia of Statistical Sciences (9
vols. plus Supplement), Volume 8, pp. 663-664.
Ross, S. M. (1996). Stochastic Processes, New York: John Wiley and Sons.
Rousseeuw. P. J. (1987). Silhouettes: A graphical aid to the interpretation
and validation of cluster analysis, Journal of Computational and Applied
Mathematics 20: 53-65.
Selim. S. Z. and Alsultan, K. (1991). A simulated annealing algorithm for the
clustering problem, The Journal of the Pattern Recognition Society 24: 1003-
1008.
Selim, S. Z. and Ismail. M. A. (1984). A'-means-type algorithms: A generalized
convergence theorem and characterization of local optimality, IEEE Transac
tions on Pattern Analysis and Machine Intelligence 6: 81-87.
Simonoff. J. S. (1996). Smoothing Methods in Statistics, New York: Springer-Verlag
Inc.
Sloane. N. J. A. and Plouffe, S. (1995). The Encyclopedia of Integer Sequences, San
Diego: Academic Press.
Spellman. P. T.. Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen,
M. B., Brown. P. O.. Botstein, D. and Futcher. B. (1998). Comprehensive
identification of cell cycle-regulated genes of the yeast saccharomyces ceremsiae
by microarray hybridization, Molecular Biology of the Cell 9: 3273-3297.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multi
variate normal distribution, Proceedings of the Third Berkeley Symposium on
Mathematical Statistics and Probability, Volume 1, pp. 197-206.
Stein, M. L. (1995). Locally lattice sampling designs for isotropic random fields
(Corr: 1999V27 pl440), The Annals of Statistics 23: 1991-2012.

105
Sugar, C. A. and James. G. M. (2003). Finding the number of clusters in a dataset:
an information-theoretic approach. Journal of the American Statistical
Association 98: 750-763.
Tan, VV. Y. (1977). On the distribution of quadratic forms in normal random
variables. The Canadian Journal of Statistics 5: 241-250.
Tarpey, T. and Kinateder, K. (2003). Clustering functional data, The Journal of
Classification 20: 93-114.
Tarpey, T., Petkova, E. and Ogden, R. T. (2003). Profiling placebo responders
by self-consistent partitioning of functional data, Journal of the American
Statistical Association 98: 850-858.
Taylor, J. M. G., Cumberland, W. G. and Sy, J. P. (1994). A stochastic model
for analysis of longitudinal AIDS data. Journal of the American Statistical
Association 89: 727-736.
Tibshirani, R.. Walther. G. and Hastie. T. (2001a). Estimating the number of
clusters in a data set via the gap statistic, Journal of the Royal Statistical
Society, Series B. Methodological 63(2): 411-423.
Tibshirani, R., Walther. G., Botstein. D. and Brown. P. (2001b). Cluster validation
by prediction strength, Technical Report 2001-21, Department of Statistics,
Stanford University.
Young, F. W. and Hamer. R. M. (1987). Multidimensional Scaling: History, The
ory, and Applications. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

BIOGRAPHICAL SKETCH
David B. Hitchcock was born in Hartford, Connecticut, in 1974 to Richard
and Gloria Hitchcock, and he grew up in Athens. Georgia, and Stone Mountain,
Georgia, with his parents and sister, Rebecca. He graduated from St. Pius X
Catholic High School in Atlanta, Georgia, in 1992.
He earned a bachelor's degree from the University of Georgia in 1996 and a
masters degree from Clemson University in 1999. While at Clemson, he met his
future wife Cassandra Kirby, who was also a masters student in mathematical
sciences.
He came to the University of Florida in 1999 to pursue a Ph.D. degree in
statistics. While at Florida, he completed the required Ph.D. coursework and
taught several undergraduate courses. His activities in the department at Florida
also included student consulting at the statistics unit of the Institute of Food and
Agricultural Sciences and a research assistantship with Alan Agresti. He began
working with George Casella and Jim Booth on his dissertation research in the fall
of 2001 after passing the written Ph.D. qualifying exam.
In January 2004 David and Cassandra were married. After graduating, David
will be an assistant professor of statistics at the University of South Carolina.
106

I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy. /?
*
George Casella. Chair
Professor of Statistics
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
* V c z 2
James G. Booth. Cochair
Professor of Statistics
I certify that I have read this study and that jn my opinion it conforms to
acceptable standards of scholarly presentation ana is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor
of Philosophy.
L
(vr
James
VHqfcert
AssociateJ^rofessor of Statistics
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree
Brett D. Presnell
Associate Professor of Statistics
I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.
John , Henretta
Professor of Sociology

This dissertation was submitted to the Graduate Faculty of the Department of
Statistics in the College of Liberal Arts and Sciences and to the Graduate School
and was accepted as partial fulfillment of the requirements for the degree of Doctor
of Philosophy.
August 2004
Dean. Graduate School

SMOOTHING FUNCTIONAL DATA FOR CLUSTER ANALYSIS
David B. Hitchcock
(352) 392-1941
Department of Statistics
Chair: George Casella
Degree: Doctor of Philosophy
Graduation Date: August 2004
Cluster analysis, which places objects into reasonable groups based on statis
tical data measured on them, is an important exploratory tool for the social and
biological sciences. We explore the particular problem of clustering functional data,
characteristically observed as part of a continuous process. Examples of functional
data include growth curves and biomechanics data measuring movement. In recent
years, methods for smoothing and clustering functional data have appeared in the
statistical literature, but little has specifically addressed the effect of smoothing on
the cluster analysis.
We examine the effect of smoothing functional data on estimating the dis
similarities among objects and on clustering those objects. Through theory and
simulations, a shrinkage smoothing method is shown to result in a better estimator
of the dissimilarities and a more accurate grouping than using unsmoothed data.
Two examples, involving yeast gene expression levels and research library growth
curves, illustrate the technique.



37
into a part involving (I S)0 and a part involving SO.
/(0|0)oc exp[-(l/2)(0-0)'(0-0)]
= exp[(l/2)(0 0)' (l S + S)(0 0)]
= exp{-(l/2)[0'(I S)0 + (0 0)'S(0 0)]}
which has a part involving (I S)0 and a part involving SO.
Since S is symmetric, then for some orthogonal matrix P, PSP* is diagonal.
Since S is idempotent, its eigenvalues are 0 and 1. so PSP = B where
B = [ Ilc .
0 0
Let pi,.... pn be the rows of P. Then 0* = P0 = (p^,..., 0,... ,0)\
And let 0 be the vector containing the k nonzero elements of 0*. Similarly,
0* = P0.
Now we consider the likelihood for 0\
L(0\0) (x exp{-(l/2)[0'(I-S)0 + (0-0)'S(0-0)]}
oc exp{-(l/2)[(0 0),S(0 0)]}
= exp{-(l/2)[(0 0) P PSP P(0 0)]}
= exp{(l/2)[(0* 0*)'PSP'(0* 0*)]}
= exp{ (l/2)[(0* 0*)B(0* 0*)]}.
Now we put a V(0, V) prior on 0*:
7r(0*) a exp[-(l/2)0*'V-0*],


14
Scatterplot smoothing, or nonparametric regression, may be used generally for
paired data for which some underlying regression function E[y^ = /(£*) is
assumed. But smoothing is particularly appropriate for functional data, for which
that functional relationship y[t) between the response and the process on T is
inherent in the data.
One option, upon observing a functional measurement y, is to imagine the
unknown underlying curve as an interpolant y(t). This results in a curve that is
visually no smoother than the observed data, however. Typically, when functional
data are analyzed, the vector of measurements is converted to a curve via a
smoothing procedure which reduces the random variation in the function. If we
wish to cluster functional data, it may be advantageous to smooth the observed
vector for each object and perform the cluster analysis on the smooth curves rather
than on the observed data. (Clearly, this option is inappropriate for cross-sectional
data such as the European agricultural data of Chapter 1.)
Smoothing data results in an unavoidable tradeoff between bias and variance
(Simonoff. 1996, p. 15). The greater the amount of smoothing of a functional
measurement, the more its variance will decrease, but the more biased it will
become (Simonoff. 1996. p. 42). In cluster analysis, we hope that clustering the
smoothed data (which contains reduced noise) will lead to smaller within-cluster
variability, since functional data which truly belong to the same cluster should
appear more similar when represented as smooth curves. This would help make
the clustering structure of the data more apparent. Using smoothed data may
introduce a bias, however, and the bias-variance tradeoff could be quantified with a
mean squared error-type criterion.
We denote the observed noisy curves to be yi(t),..., j/at(<). The underlying
signal curves for this data set are ..., /xat(£)* In reality we observe these


47
where is the smallest and wmax the largest of the n k nonzero eigenvalues of
(1-8)33.
As was the case in the discrete noise situation, A^(a) = 0 is a fourth-degree
equation with two real roots, one of which is trivially zero. Call the nontrivial real
root r.
Theorem 4.2 If n k > 4, then the nontrivial real root r of A^(a) = 0 is positive.
Furthermore, for any choice a 6 (0,r), the asymptotic upper bound Ay < 0,
implying A < 0. That is, forO < a < r, and for sufficiently large n, the risk
difference is negative and the smoothed-data dissimilarity estimator d\j is better
than dij.
Proof of Theorem f.2: Since £ is positive definite, £r(£) is positive. Since
M2 > 0, we may write Ay(a) = 0 as c4a4 + c3a3 + c2a2 + Cia = 0, where
c4 > 0, c3 < 0, c2 > 0, cj < 0. The proof then follows exactly from the proof of
Theorem 3.2 in Section 3.2.
As with the discrete noise situation, one can easily verify, using a symbolic
algebra program, that two of the roots are imaginary, and can determine the
nontrivial real root in terms of M2, £r[(I S)£], and £r(£).
As an example, let us consider a situation in which we observe functional data
yi,..., yn measured at 50 equally spaced points t\,..., £50, one unit apart. Here,
let us assume the observations are discretized versions of functions yi(£),..., ?/#(£)
which contain (possibly different) signal functions, plus a noise function arising
from an Ornstein-Uhlenbeck (O-U) process. We can then calculate the covariance
matrix of each y and the covariance matrix £ of 9. Under the O-U model,
t r(S) = always, since each diagonal element of £ is
For example, suppose the O-U process has a2 = 2 and 1. Then in this
example, £r(£) = 50. Suppose we choose S be the smoothing matrix corresponding
to a B-spline basis smoother with 6 knots, dispersed evenly within the data. Then


16
Data which are measured frequently and almost continuously for example, via
sophisticated monitoring equipment may be more likely to follow model (2.2),
since data measured closely across time (or another domain) may more likely be
correlated. We will examine both situations.
We may apply a linear smoother to obtain the smoothed curves pi(t),..., /i^(i).
In practice, we apply a smoothing matrix S to the observed noisy data to obtain a
smooth, called linear because the smooth can be written as
Pi = Sy ui = 1
where S does not depend on y (Buja et al.. 1989) and we define pi = (/ii(ii),... pi(tn))'.
Note that as n > oo, the vector p begins to closely resemble the curve p() on
[o,n
(Here and subsequently, when writing limit as n > oo, we assume
ti,... ,tn E [0, T]; that is. the collection of points is becoming denser within [0, T],
with the maximum gap between any pair of adjacent points _i, ,, i = 2,..., n,
tending to 0. Stein (1995) calls this method of taking the limit fixed-domain
asymptotics, while Cressie (1993) calls it infill asymptotics.)
Many popular smoothing methods (kernel smoothers, local polynomial
regression, smoothing splines) are linear. Note that if a bandwidth or smoothing
parameter for these methods is chosen via a data-driven method, then technically,
these smoothers become nonlinear (Buja et al., 1989).
We will focus primarily on basis function smoothing methods, in which the
smoothing matrix S is an orthogonal projection (i.e., symmetric and idempotent).
These methods seek to express the signal curve as a linear combination of k (< n)
specified basis functions, in which case the rank of S is k. Examples of such
methods are regression splines, Fourier series, polynomial regression, and some
types of wavelet smoothers (Ramsay and Silverman, 1997, pp. 44-50).


24
Lehmann and Casella (1998, p. 367) discuss shrinking an estimator toward a
linear subspace of the parameter space. In our case, we believe that 0 is near S0.
Let Cs = {0 : SO = 9} where S is symmetric and idempotent of rank k. Hence we
may shrink 9 toward S0. the MLE of 0 G Cs-
In this case, a James-Stein estimator of 9 (see Lehmann and Casella, 1998,
p. 367) is
0 = S9 + (1 ) (0 S0)
V M0-S0IIV
where a is a constant and || || is the usual Euclidean norm.
In practice, to avoid the problem of the shrinkage factor possibly being
negative for small ||0 S0||2, we will use the positive-part James-Stein estimator
(JS)
= S 9 +
a
9 S0||2
{0 S0)
(3.4)
where x+ = xl(x > 0).
Casella and Hwang (1987) propose similar shrinkage estimators in the context
of confidence sets for a multivariate normal mean. Green and Strawderman (1991),
also in the context of estimating a multivariate mean, discuss how shrinking an
unbiased estimator toward a possibly biased estimator using a James-Stein form
can result in a risk improvement.
The shrinkage estimator involves the data by giving more weight to 9 when
|| smoother is at all well-chosen, S9 is often close enough to 9 that the shrinkage fac
tor in (3.4) is very often zero. The shrinkage factor is actually merely a safeguard
against oversmoothing, in case S smooths the curves beyond what, in reality, it
should.


log expression ratio log expression ratio log expression ratio log expression ratio log expression ratio
83
Clusters for observed curves
Clusters for smoothed curves
cm -
5 w -
2
*- -
S'"
52
CM _
1 ^
2
a.
X
7 -1
0 20 40 60 80 100 120 0 20 40 60 80 100 120
time
time
Figure 7-1: Plots of clusters of genes.
Observed curves (left) and smoothed curves (right) are shown by dotted lines.
Mean curves for each cluster shown by solid lines.


45
Hence we have a large n approximation, or asymptotic expression, for A:
,4
4afr(E) + 4a2 + E
= 4a£r(E) + 4a2 + E
4a3 n 2 2a2{q2 + fr(SE)) 2a?q\ 2a2q2
T + 0 + gi+tr[(I S)S] = ~
a
L1
^ + 2a'( 1 +
L9i
g-2 + £r(SE)
9i 9i J
2a -f 9i + 92
9i +ir[(I- S)E] 9i
Since by Jensens Inequality, (1 /^i) > l/E(qi), we can bound the last term above:
2a2(l+ Q2 + tr{S'E) 2a+ 91+92
E''m ' 9i + tr[{I S)E] 9i
< 2a2 ( 1 + Q2 + 9! + 92 + 2a ^
= 2 a
= 2a2
< 2a
9i+ir[(I-S)S] 9i+ir[(I-S)S]y
2 /9i + fr[(I S)E] + £r(SE) 91 2a'
(
9i + ir[(I S)S]
£r(E) 2a \
?i+fr[(I-S)£]J
2 / ir(E) 2a \
(tr[(I-S)S]J-
Hence we have the asymptotic upper bound for A:
,4'
4a2 4a£r(E) + E
IQ
+ 2a
£r(E) 2a \
tr[(I-S)EP
Denote the eigenvalues of E1^2(I-S)E1/2 by ci,...,c. Since (I S) is positive
semidefinite, it is clear that cl5..., cn are all nonnegative. Note that Ci,..., are
also the eigenvalues of (I S)E, since E1/2(I S)E1/2 = E1/,2(I S)EE-1/2 and
(I-S)E are similar matrices. Note that since r[E1/2(I S)E1/2] = r(I-S) = n-k,
E^2(I S)!]1 2 has n k nonzero eigenvalues, and so does (I S)E.
It is well known (see, e.g., Baldessari, 1967; Tan, 1977) that if y JV(m. v)
for positive definite, nonsingular V, then for symmetric A, y Ay is distributed as a
linear combination of independent noncentral y2 random variables, the coefficients
of which are the eigenvalues of AV.


CHAPTER 7
ANALYSIS OF REAL FUNCTIONAL DATA
7.1 Analysis of Expression Ratios of Yeast Genes
As an example, we consider yeast gene data analyzed in Alter et al. (2000).
The objects in this situation are 78 genes, and the measured responses are (log-
transformed) expression ratios measured 18 times (at 7-minute intervals) tj = 7j
for j = 0,..., 17. As described in Spellman et al. (1998). the yeast was analyzed
in an experiment involving ilalpha-factor-based synchronization, after which the
RNA for each gene was measured over time. (In fact, there were three separate
synchronization methods used, but here we focus on the measurements produced by
the alpha-factor synchronization.)
Biologists believe that these genes fall into five clusters according to the cell
cycle phase corresponding to each gene.
To analyze these data, we treated them as 78 separate functional observations.
Initially, each multivariate observation on a gene was centered by subtracting the
mean response (across the 18 measurements) from the measurements. Then we
clustered the genes into 5 clusters using the K-medoids method, implemented by
the R function pam.
To investigate the effect of smoothing on the cluster analysis, first we simply
clustered the unsmoothed observed data. Then we smoothed each observation and
clustered the smooth curves, essentially treating the fitted values of the smooths
as the data. A (cubic) B-spline smoother, with two interior knots interspersed
evenly within the given timepoints, was chosen. The rank of the smoothing matrix
S was in this case k = 6 (Ramsay and Silverman, 1997, p. 49). The smooths were
81


7
Tibshirani et al. (2001a) propose a- Gap" statistic to choose K. In a separate
paper. Tibshirani et al. (2001b) suggest treating the problem as in model selection,
and choosing K via a prediction strength measure. Sugar and James (2003)
suggest a nonparametric approach to determining K based on the distortion,
a measure of within-cluster variability. Fraley and Raftery (1998) use the Bayes
Information Criterion to select the number of clusters. Milligan and Cooper (1985)
give a survey of earlier methods of choosing K.
Model-based clustering takes a different perspective on the problem. It as
sumes the data follow a mixture of K underlying probability distributions. The
mixture likelihood is then maximized, and the maximum likelihood estimate of
the mixture parameter vector determines which objects belong to which subpop
ulations. Fraley and Raftery (2002) provide an extensive survey of model-based
clustering methods.
1.4.1 K-means Clustering
Among the oldest and most well-known partitioning methods is K-means
clustering, due to MacQueen (1967). Note that the centroid of a cluster is the
p-dimensional mean of the objects in that cluster. After the choice of K, the
K-means algorithm initially arbitrarily partitions the objects into K clusters.
(Alternatively, one can choose K centroids as an initial step.) One at a time, each
object is moved to the cluster whose centroid is closest (usually Euclidean distance
is used to determine this). When an object is moved, centroids are immediately
recalculated for the cluster gaining the object and the cluster losing it. The method
repeatedly cycles through the list of objects until no reassignments of objects take
place (Johnson and Wichern, 1998, p. 755).
A characteristic of the K-means method is that the final clustering depends
in part on the initial configuration of the objects (or initial specification of the
centroids). Hence in practice, one typically reruns the algorithm from various


CHAPTER 2
INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING
2.1 Functional Data
Frequently, the measurements on each observation are connected by being part
of a single underlying continuous process (often, but not always, a time process).
One example of such data are the growth records of Swiss boys (Falkner, 1960),
discussed by Ramsay and Silverman (1997. p. 2) in which the measurements are the
heights of the boys at 29 different ages. Ramsay and Silverman (1997) generally
label such data as functional data, since the underlying data are thought to be
intrinsically smooth, continuous curves having domain T, which without loss of
generality we take to be [0.T]. The observed data vector y is merely a discretized
representation of the functional observation y(t).
Functional data are related to longitudinal data, data measured across time
which appear often in biostatistical applications. Typically, however, in functional
data analysis, the primary goal is to discover something about the smooth curves
which underlie the functional observations, and to analyze the entire set of func
tional data (consisting of many curves). The term "functional data analysis is
attributed to Ramsay and Dalzell (1991), although methods of analysis existed
before the term was coined.
2.2 Introduction to Smoothing
When scientists observe data containing random noise, they typically desire
to remove the random variation to better understand the underlying process of
interest behind the data. A common method used to capture the underlying signal
process is smoothing.
13


103
Hand, D. J., Blunt. G., Kelly, M. G. and Adams, N. M. (2000). Data mining for
fun and profit, Statistical Science 15(2): 111-126.
Hastie, T., Buja, A. and Tibshirani, R. (1995). Penalized discriminant analysis,
The Annals of Statistics 23: 73-102.
James, G. M. and Hastie. T. J. (2001). Functional linear discriminant analysis for
irregularly sampled curves. Journal of the Royal Statistical Society, Series B,
Methodological 63(3): 533-550.
James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional
data, Journal of the American Statistical Association 98: 397-408.
James, W. and Stein. C. (1961). Estimation with quadratic loss. Proceedings of
the Fourth Berkeley Symposium on Mathematical Statistics and Probability,
Volume 1, pp. 361-379.
Johnson, R. A. and Wichern. D. W. (1998). Applied Multivariate Statistical
Analysis. Upper Saddle River, New Jersey: Prentice-Hall Inc.
Kaufman, L. and Rousseeuw, P. J. (1987). Clustering by means of medoids,
Statistical Data Analysis Based on the Li-norm and Related Methods, pp. 405-
416.
Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduc
tion to Cluster Analysis, New York: John Wiley and Sons.
Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983). Optimization by simulated
annealing, Science 220: 671-680.
Lehmann. E. L. and Casella, G. (1998). Theory of Point Estimation, New York:
Springer-Verlag Inc.
Lieberman. O. (1994). A Laplace approximation to the moments of a ratio of
quadratic forms, Biometrika 81: 681-690.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate
observations. Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, Volume 1, pp. 281-297.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E.
(1953). Equations of state calculations by fast computing machines, Journal of
Chemical Physics 21: 1087-1092.
Milligan, G. W. and Cooper, M. C. (1985). An examination of procedures for
determining the number of clusters in a data set. Psychometrika 50: 159-179.


APPENDIX B
ADDITIONAL FORMULAS AND CONDITIONS
B.l Formulas for Roots of Au = 0 for General n and k
The four roots (two real, two imaginary) of Au = 0 are 0, pj 3 + (2/9)P3
(4/3)pi/(n + k), and the double imaginary root (l/2)p^3 (1/9)P3
(4/3)pi/(n + k) + (1 /2)iV/3(P2/3 ~ (2/9)P3).
Here,
Pi = n2 2nk 6n + k2 + 6k 4- 8.
p2 = (2/27)pj(-3072fc + 3072n 1232n2 + 2464n/t 1232:2 + 60n3-
132A:3 5n4 252n2k + 324nfc2 + lln3it 3n2k2 Ink3 + 4:4 2048)-=-
(-n + k)3 + (2/9)(-442368/c + 344064n 1032192n2 147456n: + 1179648fc2+
1250304n3 + 1274880/c3 781056n4 1238016n2fc 1287168n/t2 + 1658880n3A;-
238080n2fc2 1376256nA:3 + 736512A:4 + 248064A:5 + 265536n5 870336fcn4+
769728n3k2 + 257472n2*:3 670464nA:4 + 211680n5fc 327216n4*;2 + 133440n3A;3+
152496n2/c4 172320nA;5 21636neA: + 48288n5Jfc2 44712n4fc3 + 588n3fc4+
31308fc5n2 22776k6n 4- 576n7fc 1776n6fc2 + 2520n5k3 1080n4it4 47088n6
+49008A;6 + 3660n7 + 5280fc7 72n8 + 240A;8 1488n3*;5 + 2304n2fc6-
1224nk7 15n8/c + 27n7k2 15k4n5 + 27fc5n4 15n6A:3 15n3k6+
3n2kT + 3n9) -=- (-n + A:))1/2 -j- (-n -I- k),
p3 = (768k 768n + 344n2 688nk + 344/fc2 42n3 + 54Jfc3 n4 + I38n2k
-150nk2 + n3k + 3n2k2 5nk3 + 2k4 + 512) ((-n + fc)2p2/3).
99


TABLE OF CONTENTS
page
ACKNOWLEDGMENTS iv
LIST OF TABLES vii
LIST OF FIGURES viii
ABSTRACT ix
CHAPTERS
1 INTRODUCTION TO CLUSTER ANALYSIS 1
1.1 The Objective Function 4
1.2 Measures of Dissimilarity 4
1.3 Hierarchical Methods 5
1.4 Partitioning Methods 6
1.4.1 K-means Clustering 7
1.4.2 K-medoids and Robust Clustering 8
1.5 Stochastic Methods 9
1.6 Role of the Dissimilarity Matrix 10
2 INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING ... 13
2.1 Functional Data 13
2.2 Introduction to Smoothing 13
2.3 Dissimilarities Between Curves 17
2.4 Previous Work 18
2.5 Summary 18
3 CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL .... 20
3.1 Comparing MSEs of Dissimilarity Estimators when 9 is in the
Linear Subspace Defined by S 20
3.2 A James-Stein Shrinkage Adjustment to the Smoother 23
3.3 Extension to the Case of Unknown a2 29
3.4 A Bayes Result: and a Limit of Bayes Estimators .... 36
4 CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL 40
4.1 Comparing MSEs of Dissimilarity Estimators when 9 is in the
Linear Subspace Defined by S 40
v



PAGE 1

SMOOTHING FUNCTIONAL DATA FOR CLUSTER ANALYSIS r. By DAVID B. HITCHCOCK A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004

PAGE 2

Copyright 2004 by David B. Hitchcock

PAGE 3

To Cassandra

PAGE 4

ACKNOWLEDGMENTS It has been a long but rewarding journey during my time as a student, and I have many people I need to thank. First of all. I thank God for my having more blessings in my life th£in I possibly deserve. Next, I thank my family: the support and love of my parents and sister, and the pride they have shown in me, have truly inspired me and kept me determined to succeed in my studies. And I especially thank my wife Sandi, who has shown so much confidence in me and has given me great encouragement with her love and patience. I owe a great deal of thanks to my Ph.D. advisors, George Casella and Jim Booth. In addition to being great people, they have, through their fine examples, taught me much about the research process: the importance of knowing the literature well, of looking at simple cases to understand difficult concepts, and of writing results carefully. I also thank my other committee members, Jim Robert, Brett Presnell, and John Henretta, accomplished professors who have selflessly given their time for me. All the teachers I have had at Florida deserve my thanks, especially Alan Agresti. Andy Rosalsky, Denny Wackerly, and Malay Ghosh. I could not have persevered without the support of my fellow students here at Florida. I especially thank Bernhard and Karabi, who entered the graduate program with me and who have shared my progress. I also thank Carsten, Terry, Jeff, Siuli, Sounak, Jamie, Dobrin. Samiran, Damaris, Christian, Ludwig, Brian, Keith, and many others for all their help and friendship. iv

PAGE 5

TABLE OF CONTENTS 2age ACKNOWLEDGMENTS iv LIST OF TABLES vii LIST OF FIGURES viii ABSTRACT ix CHAPTERS 1 INTRODUCTION TO CLUSTER ANALYSIS 1 LI The Objective Function 4 1.2 Measures of Dissimilarity 4 1.3 Hierarchical Methods 5 1.4 Partitioning Methods 6 1.4.1 K-means Clustering 7 1.4.2 K-medoids and Robust Clustering 8 1.5 Stochastic Methods 9 1.6 Role of the Dissimilarity Matrix 10 2 INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING ... 13 2.1 Functional Data I3 2.2 Introduction to Smoothing I3 2.3 Dissimilarities Between Curves I7 2.4 Previous Work jg 2.5 Summary jg 3 CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL .... 20 3.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the Linear Subspace Defined by S 20 3.2 A James-Stein Shrinkage Adjustment to the Smoother 23 3.3 Extension to the Case of Unknown 29 3.4 A Bayes Result: d\j""^'^^ and a Limit of Bayes Estimators 36 4 CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL 40 4.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the Linear Subspace Defined by S 40 V

PAGE 6

4.2 James-Stein Shrinkage Estimation in the Functional Noise Model 42 4.3 Extension to the Case when cris Unknown 48 4.4 A Pure Functional Analytic Approach to Smoothing 54 5 A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION FOR CLUSTERING FUNCTIONAL DATA 63 6 SIMULATIONS 68 6.1 Setup of Simulation Study 68 6.2 Smoothing the Data 71 6.3 Simulation Results 72 6.4 Additional Simulation Results 76 7 ANALYSIS OF REAL FUNCTIONAL DATA 81 7.1 Analysis of Expression Ratios of Yeast Genes 81 7.2 Analysis of Research Libraries 84 8 CONCLUSIONS AND FUTURE RESEARCH 91 APPENDIX A DERIVATIONS AND PROOFS 95 A.l Proof of Conditions for Smoothed-data Estimator Superiority 95 A. 2 Extension of Stochastic Domination Result 97 A. 3 Definition and Derivation of 4j and aj 97 B ADDITIONAL FORMULAS AND CONDITIONS 99 B. l Formulas for Roots oi Au = 0 for General n and k 99 B.2 Regularity Condition for Positive Semidefinite G in Laplace Approximation 100 REFERENCES 101 BIOGRAPHICAL SKETCH 106 vi

PAGE 7

LIST OF TABLES Table page 1-1 Agricultural data for European countries 2 3-1 Table of choices of a for various n and k 29 6-1 Clustering the observed data and clustering the smoothed data (independent error structure, n = 200) 73 6-2 Clustering the observed data and clustering the smoothed data (0-U error structure, n — 200) 73 6-3 Clustering the observed data and clustering the smoothed data (independent error structure, n = 30) 74 64 Clustering the observed data and clustering the smoothed data (0-U error structure, n = 30) 75 71 The classification of the 78 yeast genes into clusters, for both observed data and smoothed data 82 7-2 A 4-cluster K-medoids clustering of the smooths for the library data. 89 7-3 A 4-cluster K-medoids clustering of the observed library data 90 vii

PAGE 8

I LIST OF FIGURES Figure page 1-1 A scatter plot of the agricultural data 3 1-2 Proportion of pairs of objects correctly grouped vs. MSE of dissimilarities 12 3-1 Plot of A[/ against a for varying n and iov k — 5 30 32 Plot of simulated A and Au against a for n = 20. /: = 5 31 41 Plot of asymptotic upper bound, and simulated A's, for OrnsteinUhlenbeck-type data 49 6-1 Plot of signal curves chosen for simulations 70 6-2 Proportion of pairs of objects correctly matched, plotted against a [n = 200) 74 6-3 Proportion of pairs of objects correctly matched, plotted against a {n = 30) 75 6-4 Proportion of pairs of objects correctly matched, plotted against a, when the number of clusters is misspecified 77 65 Proportion of pairs of objects correctly matched, plotted against a (n = 30) 79 71 Plots of clusters of genes 83 7-2 Edited plots of clusters of genes which were classified differently as observed curves and smoothed curves 85 7-3 Plots of clusters of libraries 87 7-4 Mean curves for the four library clusters given on the same plot. ... 88 7-5 Measurements and B-spline smooth, University of Arizona library. 89 viii

PAGE 9

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SMOOTHING FUNCTIONAL DATA FOR CLUSTER ANALYSIS By David B. Hitchcock August 2004 Chair: George Casella Cochair: James G. Booth Major Department: Statistics Cluster analysis, which attempts to place objects into reasonable groups on the basis of statistical data measured on them, is an important exploratory tool for many scientific studies. In particular, we explore the problem of clustering functional data, which arise as curves, characteristically observed as part of a continuous process. In recent years, methods for smoothing and clustering functional data have appeared in the statistical literature, but little work has appeared specifically addressing the effect of smoothing on the cluster analysis. We discuss the purpose of cluster analysis and review some common clustering methods, with attention given to both deterministic and stochastic methods. We address functional data and the related field of smoothing, and a measure of dissimilarity for functional data is suggested. We examine the effect of smoothing functional data on estimating the dissimilarities among objects and on clustering those objects. We prove that a shrinkage method of smoothing results in a better estimator of the dissimilarities among a set of noisy curves. For a model having independent noise structure, ix

PAGE 10

the smoothed-data dissimilarity estimator dominates the observed-data estimator. For a dependent-error model, an asymptotic domination result is given for the smoothed-data estimator. We propose an objective function to measure the goodness of a clustering for smoothed functional data. Simulations give strong empirical evidence that smoothing functional data before clustering results in a more accurate grouping than clustering the observed data without smoothing. Two examples, involving functional data on yeast gene expression levels and research library 'growth curves," illustrate the technique.

PAGE 11

CHAPTER 1 INTRODUCTION TO CLUSTER ANALYSIS The goal of cluster analysis is to find groups, or clusters, in data. The objects in a data set (often univariate or multivariate observations) should be grouped so that objects in the same cluster are similar and objects in different clusters are dissimilar (Kaufman and Rousseeuw. 1990. p. 1). How to measure the similarity of objects is something that depends on the application, yet is a fundamental issue in cluster analysis. Sometimes in a multivariate data set it is not the obsen-ations that are clustered, but rather the variables (according to some similarity measure on the variables) and this case is dealt with slightly differently (Johnson and Wichern, 1998, p. 735). More often, though, it is the objects that are clustered according to their observed values of one or more variables, and this introduction will chiefly focus on this situation. The general clustering setup for multivariate data is as follows: In a data set there are objects on which are measured p variables. Hence we represent this by N vectors yi, y^v in We wish to group the N objects into K clusters, I < K < N. Denote the possible clusterings of A'^ objects into nonempty groups as C = {ci, cb(n)}The number of possible clusterings B{N) depends on the number of objects A^ and is known as the Bell number (Sloane and Plouff, 1995, entry M4981). As a simple example, consider the data in Table 1-1. We wish to group the 12 objects (the European countries) based on the values of two variables, gross national product (gnp) and percent of gnp due to agriculture (agric). When the data are univariate or two-dimensional and N is not too large, it is often easy to 1

PAGE 12

Table 1-1: Agricultural data for European countries. country agnc gnp Belgium 2.7 16.8 Denmark 5.7 21.3 Germany 3.5 18.7 Greece 22.2 5.9 Spain 10.9 11.4 France 6.0 17.8 Ireland 14.0 10.9 Italy 8.5 16.6 Luxembourg 3.5 21.0 Netherlands 4.3 16.4 Portugal 17.4 7.8 United Kingdom 2.3 14.0 construct a scatter plot and determine the clusters by eye (see Figure 1-1). For higher dimensions, however, automated clustering methods become necessary. A statistical field closely related to cluster analysis is discriminant analysis, which also attempts to classify objects into groups. The main difference is that in discriminant analysis there exists a training sample of objects whose group memberships are known, and the goal is to use characteristics of the training sample to devise a rule which classifies future objects into the prespecified groups. In cluster analysis, however, the clusters are unknown, in form and often in number. Thus cluster analysis is more exploratory in nature, whereas discriminant analysis allows more precise statements about the probability of making inferential errors (Gnanadesikan et al., 1989). In contrast with discriminant analysis, where the number of groups and the groups' definitions are known, cluster analysis presents two separate questions: How many groups are there? And which objects should be allocated to which groups, i.e., how should the objects be partitioned into groups? It is partially because of the added difficulty of answering both of these questions that the field

PAGE 13

-| 1 r 10 15 20 % of gnp due to agncutture Figure 1-1: A scatter plot of the agricultural data.

PAGE 14

of cluster analysis has not built such an, extensive and thorough theory as has discriminant analysis (Gnanadesikan et al., 1989). 1.1 The Objective Function Naturally, it is desirable that a clustering algorithm have some optimal property. We would like a mathematical criterion to measure how well-grouped the data are at any point in the algorithm. A convenient way to define such a criterion is via an objective function, a real-valued function of the possible partitions of the objects. Mathematically, if C — {ci, CB(jv)} represents the space of all possible partitions of A'^ objects, a typical objective function is a mapping g : C S?"*". Ideally, a good objective function g will increase (or decrease, depending on the formulation of g) monotonically as the partitions group more similar objects in the same cluster and more dissimilar objects in different clusters. Given a good objective function g. the ideal algorithm would optimize g, resulting in the best possible partition. When iV is very small, we might enumerate the possible partitions Ci, ,cb{n), calculate the objective function for each, and choose the Ci with the optimal g{ci). However, B{N) grows rapidly with A''. For example, for the European agriculture data, B{12) 4.213.597. while 5(19) = 5.832.742.205.057 (Sloane and Plouffe, 1995, entry M1484). For moderate to large N, this enumeration is infeasible (Johnson and Wichern. 1998, p. 727). Since full enumeration is usually impossible, clustering methods tend to be algorithms systematically designed to search for good partitions. But such deterministic algorithms cannot guarantee the discovery of the best overall partition. 1.2 Measures of Dissimilarity A fundamental question for most deterministic algorithms is which measure of dissimilarity (distance) to use. A popular choice is the Euclidean distance between

PAGE 15

5 object i and object j dsiij) = \l ivii Vii? + {yi2 VjiY ^•• + {yip~ VipYThe Manhattan (city-block) distance is dM{i,3) = Ivii yji\ + \yi2 yj2\ + — ^Ivipyjp\Certain types of data require specialized dissimilarity measures. The Canberra metric and Czekanowski coefficient (see Johnson and Wichern. 1998, p. 729) are two dissimilarity measures for nonnegative variables, while Johnson and Wichern (1998, p. 733) give several dissimilarity measures for binary variables. Having chosen a dissimilarity measure, one can construct a. N x N (symmetric) dissimilarity matrix (also called the distance matrix) D whose rows and columns represent the objects in the data set, such that Dj^ = D^j = d{i,j). In the following sections, some common methods of cluster analysis are presented and categorized by type. 1.3 Hierarchical Methods Hierarchical methods can be either agglomerative or divisive. Kaufman and Rousseeuw (1990) compiled a set of methods which were adopted into the cluster library of the S-plus computing package, and often the methods are referred to by the names Kaufman and Rousseeuw gave them. Agglomerative methods begin with N clusters; that is, each observation forms its own cluster. The algorithm successively joins clusters, yielding 1 clusters, then N 2 clusters, and so on until there remains only one cluster containing all N objects. The S-plus functions agnes and hclust perform agglomerative clustering. Common agglomerative methods include linkage methods and Ward's method. All agglomerative methods, at each step, join the two clusters which are considered "closest." The difference among the methods is how each defines

PAGE 16

6 "closeness." Each method, however, defines the distance between two clusters using some function of the dissimilarities among individual objects in those clusters. Divisive methods begin with all objects in one cluster and successively split clusters, resulting in partitions of 1. 2, 3, and finally N clusters. The S-plus function diana performs divisive analysis. 1.4 Partitioning Methods While hierarchical methods seek good partitions ior d\\ K = 1, ,iV, partitioning methods fix the number of clusters and seek a good partition for that specific K. Although the hierarchical methods may seem to be more flexible, they have an important disadvantage. Once two clusters have been joined in an agglomerative method (or split in a divisive method), this move can never be undone, although later in the algorithm undoing the move might improve the clustering criterion (Kaufman and Rousseeuw, 1990, p. 44). Hence hierarchical methods severely limit how much of the partition space C can be explored. While this phenomenon results in higher computational speed for hierarchical algorithms, its clear disadvantage often necessitates the use of the less rigid partitioning methods (Kaufman and Rousseeuw, 1990. p. 44). In practice. Johnson and Wichern (1998, p. 760) recommend running a partitioning method for several reasonable choices of K and subjectively examining the resulting clusterings. Finding an objective, data-dependent way to specify K is an open question that has spurred recent research. Rousseeuw (1987) proposes to select K to maximize the average silhouette width s{K). For each object i, the silhouette value .(.•) = a(0 m£ix{a(z), b{i)) where a{i) = average dissimilarity of i to all other objects in its cluster (say, cluster A)\ b{i) = miufi^^ d{i, R); and d{i, R) = average dissimilarity of i to all objects in cluster R. Then s{K) is the average s{i) over all objects i in the data set.

PAGE 17

7 Tibshirani et al. (2001a) propose a.' "Gap" statistic to choose K. In a separate paper. Tibshirani et al. (2001b) suggest treating the problem as in model selection, and choosing K via a "prediction strength" measure. Sugar and James (2003) suggest a nonparametric approach to determining K based on the "distortion," a measure of within-cluster variability. Fraley and Raftery (1998) use the Bayes Information Criterion to select the number of clusters. Milligan and Cooper (1985) give a survey of earlier methods of choosing K. Model-based clustering takes a different perspective on the problem. It assumes the data follow a mixture of K underlying probability distributions. The mixture likelihood is then maximized, and the maximum likelihood estimate of the mixture parameter vector determines which objects belong to which subpopulations. Fraley and Raftery (2002) provide an extensive survey of model-based clustering methods. 1.4.1 K-means Clustering Among the oldest and most well-known partitioning methods is K-means clustering, due to MacQueen (1967). Note that the centroid of a cluster is the p-dimensional mean of the objects in that cluster. After the choice of K, the K-means algorithm initially arbitrarily partitions the objects into K clusters. (Alternatively, one can choose K centroids as an initial step.) One at a time, each object is moved to the cluster whose centroid is closest (usually Euclidean distance is used to determine this). When an object is moved, centroids are immediately recalculated for the cluster gaining the object and the cluster losing it. The method repeatedly cycles through the list of objects until no reassignments of objects take place (Johnson and Wichern, 1998, p. 755). A characteristic of the K-means method is that the final clustering depends in part on the initial configuration of the objects (or initial specification of the centroids). Hence in practice, one typically reruns the algorithm from various

PAGE 18

8 starting points to monitor the stability of the clustering (Johnson and Wichern, 1998, p. 755). Selim and Ismail (1984) show that K-means clustering does not globally minimize the criterion N K EE4^(i6ii (1.1) i=l j = l where dij denotes the Euclidean distance between object i and the centroid of cluster j, i.e.. cP^^ = (y; y^j')'(yi y'j'). This criterion is in essence an objective function g{c). The K-means solution may not even locally minimize this objective function: conditions for which it is locally optimal are given by Selim and Ismail (1984). 1.4.2 K-medoids and Robust Clustering Because K-means uses means (centroids) and a least squares technique in calculating distances, it is not robust with respect to outlying observations. Kmedoids, which is used in the S-plus function pam (partitioning around medoids). due to Kaufman and Rousseeuw (1987), has gained support as a robust alternative to K-means. Instead of minimizing a sum of squared Euclidean distances, Kmedoids minimizes a sum of dissimilarities. Philosophically, K-medoids is to K-means as least-absolute-residuals regression is to least-squares regression. Consider the objective function N K J2^d{i,mj). (1.2) i=l j=l The algorithm begins (in the so-called build-step) by selecting k representative objects, called medoids. based on an objective function involving a sum of dissimilarities (see Kaufman and Rousseeuw, 1990, p. 102). It proceeds by assigning each object i to the cluster j with the closest medoid mj, i.e., such that d{i, rrij) < d{i, m„) for all w = 1,...,K. Next, in the swap-step, if swapping any unselected object with a medoid results in the decrease of the value of (1.2).

PAGE 19

9 the swap is made. The algorithm stops .when no swap can decrease (1.2). Like K-means. K-medoids does not in general globally optimize its objective function (Kaufman and Rousseeuw. 1990. p. 110). Cuesta-Albertos et al. (1997) propose another robust alternative to K-means known as "trimmed K-means." which chooses centroids by minimizing an objective function which is based only on an optimally chosen subset of the data. Robustness properties of the trimmed K-means method are given by Garcia-Escudero and Gordaliza (1999). 1.5 Stochastic Methods In general, the objective function in a cluster analysis can yield optima which are not global (Selim and Alsultan. 1991). This leads to a weakness of the traditional deterministic methods. The deterministic algorithms are designed to severely limit the number of partitions searched (in the hope that "good" clusterings will quickly be found). If g is especially ill-behaved, however, finding the best clusterings may require a more wide-ranging search of C. Stochastic methods are ideal tools for such a search (Robert and Casella, 1999, Section 1.4). The major advantage of a stochastic method is that it can explore more of the space of partitions. A stochastic method can be designed so that it need not always improve the objective function value at each step. Thus at some steps the method may move to a partition with a poorer value of g, with the benefits of exploring parts of C that a deterministic method would ignore. Unlike "greedy" deterministic algorithms, stochastic methods sacrifice immediate gain for greater flexibility and the promise of a potentially better objective function value in another area of C. The major disadvantage of stochastic cluster analysis is that it takes more time than the deterministic methods, which can usually be run in a matter of seconds. At the time most of the traditional methods were proposed, this was an insurmountable obstacle, but the growth in computing power has narrowed the gap

PAGE 20

! 10 in recent years. Stochastic methods that can run in a reasonable amount of time can be valuable additions to the repertoire of the practitioner of cluster analysis. Since cluster analysis often involves optimizing an objective function g, the Monte Carlo optimization method of simulated annealing is a natural tool to use in clustering. In the context of optimization over a large, finite set, simulated annealing dates to Metropolis et al. (1953), while its modern incarnation was introduced by Kirkpatrick et al. (1983). An example of a stochastic cluster analysis algorithm is the simulated annealing algorithm of Selim and Alsultan (1991). which seeks to minimize the K-means objective function (1.1). Celeux and Govaert (1992) propose two stochastic clustering methods based on the EM algorithm. 1.6 Role of the Dissimilarity Matrix For many standard cluster analysis methods, the resulting clustering structure is determined by the dissimilarity matrix (or distance matrix) D (containing elements which we henceforth denote f = 1, iV; j = 1, iV) for the objects. With hierarchical methods, this is explicit: If we input a certain dissimilarity matrix into a clustering algorithm, we will get one and only one resulting grouping. With partitioning methods, the fact is less explicit since the result depends partly on the initial partition, the starting point of the algorithm, but this is an artifact of the imperfect search algorithm (which can only assure a "locally optimal" partition), not of the clustering structure itself. An ideal search algorithm which could examine every possible partition would always map an inputted dissimilarity matrix to a unique final clustering. Consider the two most common criterion-based partitioning methods. Kmedoids (the Splus function paw) and K-means. For both, the objective function is a function of the pairwise dissimilarities among the objects. The K-medoids objective function simply involves a sum of elements of D. With K-means, the

PAGE 21

11 connection involves a complicated recursive formula, as indicated by Gordon (1981. p. 42). (Because the connection is so complicated, in practice, K-means algorithms accept the data matrix as input, but theoretically they could accept the dissimilarity matrix it would just slow down the computational time severely.) The result, then, is that for these methods, a specified dissimilarity matrix yields a unique final clustering, meaning that for the purpose of cluster analysis, knowing D is £is good as knowing the complete data. If the observed data have random variation, and hence the mezisurements on the objects contain error, then the distances between pairs of objects will have error. If we want our algorithm to produce a clustering result that is close to the •'true clustering structure, it seems desirable that the dissimilarity matrix we use reflect as closely as possible the (unknown) pairwise dissimilarities between the underlying systematic components of the data. It is intuitive that if the dissimilarities in the observed distance matrix are near the "truth." then the resulting clustering structure should be near the true structure, and a small computer example helps to show this. We generate a sample of 60 3-dimensional normal random variables (with covariance matrix I) such that 15 observations have mean vector (1, 3, 1) 15 have mean (10,6,4) 15 have mean (1,10.2) and 15 have mean (5, 1, 10) These means are well-separated enough that the data naturally form four clusters, and the true clustering is obvious. Then for 100 iterations we perturb the data with random N{0,a'^) noise having varying values of a. For each iteration, we compute the dissimilarities and input the dissimilarity matrix of the perturbed data into the K-medoids algorithm and obtain a resulting clustering. Figure 1-2 plots, for each perturbed data set, the mean (across elements) squared discrepancy from the true dissimilarity matrix against the proportion of all possible pairs of objects which are correctly matched in the clustering resulting

PAGE 22

12 e L_, 1 i 0 SOOO COOO '5000 20000 25000 mm Figure 1-2: Proportion of pairs of objects correctly grouped vs. MSE of dissimilarities. from that perturbed matrix. (A correct match for two objects means correctly putting the two objects in the same cluster or correctly putting the two objects in different clusters, depending on the "truth.") This proportion serves as a measure of concordance between the clustering of the perturbed data set and the underlying clustering structure. We expect that as mean squared discrepancy among dissimilarities increases, the proportion of pairs correctly clustered will decrease, and the plot indicates this negative association. This indicates that a better estimate of the pairwise dissimilarities among the data tends to yield a better estimate of the true clustering structure. While this thesis focuses on cluster analysis, several other statistical methods are typically based on pairwise dissimilarities among data. Examples include multidimensional scaling (Young and Hamer, 1987) and statistical matching (Rodgers, 1988). An improved estimate of pairwise dissimilarities would likely benefit the results of these methods as well.

PAGE 23

CHAPTER 2 INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING 2.1 Functional Data Frequently, the measurements on each observation are connected by being part of a single underlying continuous process (often, but not always, a time process). One example of such data are the growth records of Swiss boys (Falkner. 1960), discussed by Ramsay and Silverman (1997. p. 2) in which the measurements are the heights of the boys at 29 different ages. Ramsay and Silverman (1997) generally label such data as functional data, since the underlying data are thought to be intrinsically smooth, continuous curves having domain T, which without loss of generality we take to be [0. T]. The observed data vector y is merely a discretized representation of the functional observation y{t). Functional data are related to longitudinal data, data measured across time which appear often in biostatistical applications. Typically, however, in functional data analysis, the primary goal is to discover something about the smooth curves which underlie the functional observations, and to analyze the entire set of functional data (consisting of many curves). The term "functional data analysis" is attributed to Ramsay and Dalzell (1991), although methods of analysis existed before the term was coined. 2.2 Introduction to Smoothing When scientists observe data containing random noise, they typically desire to remove the random variation to better understand the underlying process of interest behind the data. A common method used to capture the underlying signal process is smoothing. 13

PAGE 24

14 Scatterplot smoothing, or nonparametric regression, may be used generally for paired data {ti,yi) for which some underlying regression function E[yi] — f{ti) is assumed. But smoothing is particularly appropriate for functional data, for which that functional relationship y{t) between the response and the process on T is inherent in the data. One option, upon observing a functional measurement y, is to imagine the unknown underlying curve as an interpolant y{t). This results in a curve that is visually no smoother than the observed data, however. Typically, when functional data are analyzed, the vector of measurements is converted to a curve via a smoothing procedure which reduces the random variation in the function. If we wish to cluster functional data, it may be advantageous to smooth the observed vector for each object and perform the cluster analysis on the smooth curves rather than on the observed data. (Clearly, this option is inappropriate for cross-sectional data such as the European agricultural data of Chapter 1.) Smoothing data results in an unavoidable tradeoff between bias and variance (Simonoff. 1996, p. 15). The greater the amount of smoothing of a functional measurement, the more its variance will decrease, but the more biased it will become (Simonoff, 1996, p. 42). In cluster analysis, we hope that clustering the smoothed data (which contains reduced noise) will lead to smaller within-cluster variability, since functional data which truly belong to the same cluster should appear more similar when represented as smooth curves. This would help make the clustering structure of the data more apparent. Using smoothed data may introduce a bias, however, and the bias-variance tradeoff could be quantified with a mean squared errortype criterion. We denote the "observed" noisy curves to he yi{t),..., yN{t). The underlying signal curves for this data set are /Ui(^), .... ^iN{t)In reality we observe these

PAGE 25

15 curves at a fine grid of n points, ti,. .,tn, so that we observe N independent vectors, each n x 1: yi, y^. A possible model for our noisy data is the discrete noise model: Yij = t^iitj) + ^ij, i = 1,...,NJ = l....,n. (2.1) Here, for each i = 1, N, Cij may be considered independent for different measurement points, having mean zero and constant variance af. Another possible model for our noisy curves is the functional noise model: Viitj) ^ fiiitj)+ei{tj),i=l,....NJ = l....,n, (2.2) where ei{t) is. for example, a stationary Ornstein-Uhlenbeck process with "pull" parameter 0 > 0 and variability parameter af. This choice of model implies that the errors for the ith. discretized curve have variance-covariance matrix Sj = afCl where fi,^ = (20)-^ exp(-/5|f, tm\) (Taylor et al., 1994). Note that in this case, the noise process is functional — specifically Ornstein-Uhlenbeck — but we still assume the response data collected is discretized. and is thus a vector at the level of analysis. Conceptually, however, the noise process is smooth and continuous in (2.2), as is the signal process in either model (2.1) or (2.2). Depending on the data and sampling scheme, either (2.1) or (2.2) may be an appropriate model. If the randomness in the data arises from measurement error which is independent from one measurement to the next, (2.1) is more appropriate. Ramsay and Silverman (1997, p. 42) suggest a discrete noise model for the Swiss growth data, in which heights of boys are measured at 29 separate ages, and in which some small measuring error (independent across measurements) is likely to be present in the recorded data. In the case that the variation of the observed data from the underlying curve is due to an essentially continuous random process, model (2.2) may be appropriate.

PAGE 26

16 Data which are measured frequently and. almost continuously — for example, via sophisticated monitoring equipment — may be more likely to follow model (2.2), since data measured closely across time (or another domain) may more likely be correlated. We will examine both situations. We may apply a linear smoother to obtain the smoothed curves ili{t), p.ff{t). In practice, we apply a smoothing matrix S to the observed noisy data to obtain a smooth, called linear because the smooth fii can be written as p.i = Syi,i = l,...,N where S does not depend on yj (Buja et al.. 1989) and we define p,i — {fii{ti), ,fii{tn))' Note that as n — ^ oo, the vector p,i begins to closely resemble the curve fLi{t) on [o,r]. (Here and subsequently, when writing "limit as n ^ oo," we assume ti,. ..,tn G [0, T]; that is. the collection of points is becoming denser within [0, T], with the maximum gap between any pair of adjacent points i,, i = 2, n, tending to 0. Stein (1995) calls this method of taking the limit ''fixed-domain asymptotics," while Cressie (1993) calls it "infill asymptotics.") Many popular smoothing methods (kernel smoothers, local polynomial regression, smoothing splines) are linear. Note that if a bandwidth or smoothing parameter for these methods is chosen via a data-driven method, then technically, these smoothers become nonlinear (Buja et al., 1989). We will focus primarily on basis function smoothing methods, in which the smoothing matrix S is an orthogonal projection (i.e., symmetric and idempotent). These methods seek to express the signal curve as a linear combination of k (< n) specified basis functions, in which case the rank of S is k. Examples of such methods are regression splines, Fourier series, polynomial regression, and some types of wavelet smoothers (Ramsay and Silverman, 1997, pp. 44-50).

PAGE 27

17 2.3 Dissimilarities Between Curves If we choose squared L2 distance as our dissimilarity metric, then denote the dissimilarities between the true, observed, and smoothed curves i and j, respectively, as follows: 6y[ IMt) Mt)? dt (2.3) Jo k= r [Viit] yM' (2-4) = r (iM' dt. .(2.5) ^0 (smooth) Define 0ij = Ati tJ-j where fXi = {ni{ti) ^li{tr^))\ Oij = yiYj, (smooth) ^ ^5 dij = pLi — p,j = SOij. If the data follow the discrete noise model, Oij ~ N{6ij,afjl) where a^j = {af + aj). If the data follow the functional noise model, Oij ~ N{6ij, where E,^=4nand4 = (af + af). If we observe the response at points ii, i„ in [0, T], then we may approximate (2.3)-(2.5) by T ^mooth) ^ Z0'..s'S0ij.

PAGE 28

18 Note that Sij where the limit is taken as n — > oo, ii, t„ G [0, T]. Hence the question of whether smoothing aids clustering is, in a sense, closely related to the question of: When, for large n. is Jj*"^*''' a better estimator of than is dij? (Note: In the following chapters, since the pair of curves i and j is arbitrary, we shall suppress the ij subscript on Oij, dij, afj, and Ejj, writing instead 0, 9. a^, and E, understanding that we are concerned with any particulax pair i.je{i,...,7V},i7^i.) 2.4 Previous Work Some methods for clustering functional data recently have been presented in the statistical literature. James and Sugar ( 2003) introduce a model-based clustering method that is especially useful when the points measured along the sample curves are sparse and irregular. Tarpey and Kinateder (2003) discuss representing each curve with basis functions and clustering via K-means, to estimate the "principal points" (underlying cluster means) of the data's distribution. Abraham et al. (2003) propose representing the curves via a B-spIine basis and clustering the estimated coefficients with a K-means algorithm, and they derive consistency results about the convergence of the algorithm. Tarpey et al. (2003) apply methods for clustering functional data to the analysis of a pharmaceutical study. Hastie et al. (1995) and James and Hastie (2001) discuss discriminant analysis for functional data. 2.5 Summary It seems intuitive that in the analysis of functional data, some form of smoothing the observed data is appropriate, and some previous methods (James and Sugar, 2003; Tarpey and Kinateder, 2003; Abraham et al., 2003) do involve smoothing. Tarpey and Kinateder (2003, p. 113) propose the question of the effect of smoothing on the clustering of functional data, citing the need "to study the

PAGE 29

19 sensitivity of clustering methods on the degree of smoothing used to estimate the functions." In this thesis, we will provide some rigorous theoretical justification that smoothing the data before clustering will improve the cluster analysis. Much of the theory will focus on the estimation of the underlying dissimilarities among the curves, and we will show that a shrinkage-type smoothing method leads to an improved risk in estimating the dissimilarities. A simulation study will demonstrate that this risk improvement is accompanied by a clearly improved performance in correctly grouping objects into their proper clusters.

PAGE 30

CHAPTER 3 CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL First we will consider functional data following model (2.1). Recall that we assume the response is measured at n discrete points in [0,T]. 3.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the Linear Subspace Defined by S We assume our linear smoothing matrix S is symmetric and idempotent. For a linear basis function smoother which is fitted via least squares. S will be symmetric and idempotent as long as the n points at which p,i is evaluated are identical to the points at which yj is observed (Ramsay and Silverman. 1997. p. 44). Examples of such smoothers are regression splines (in particular, B-splines), wavelet bases, and Fourier series bases (Ramsay and Silverman. 1997). Regression splines and B-spline bases are discussed in detail by de Boor (1978) and Eubank (1988, Chapter 7). We also assume that S projects the observed data onto a lower-dimensional space (of dimension k < n), and thus r(S) = tr{S) = k. Note that S is a shrinking smoother, since all its singular values are < 1 (Buja et al., 1989). Recall that according to the discrete noise model for the data, 0 ~ N{0,a'^l). Without loss of generality let cr^I = I. (Otherwise, we can let, for example, r) = and V = cr~^0 and work with t) and rj instead.) Note that ^0'0 represents the approximate L2 distance between observed curves yi{t) and yj{t) and J^'S'S^ = ^^'S^ represents the approximate L2 distance between smoothed curves p.i{t) and fij{t). We wish to see when the "smoothed-data dissimilarity" better estimates the true dissimilarity 6ij between curves fii{t) and Hj{t) than the observed-data 20

PAGE 31

21 dissimilarity. The risk of an estimator f, for r is given by R{r, f) = E[L{t, f)] where L(-) is a loss function (see Lehmann and Casella, 1998, pp. 4-5). For the familiar case of squared error loss L(r, f) = (r — f)^, the risk is simply the mean squared error (MSE) of the estimator. Hence we may compare the MSEs of two competing estimators and choose the one with the smaller MSE. To this end, let us examine MSE{^0'se) and MSE{^9'e) in estimating (which approaches 6ij as n oo). In this section, we consider the case when 6 lies in the linear subspace that S projects onto, i.e., SB = 6. Note that if two arbitrary (discretized) signal curves fXi and are in this linear subspace. then the corresponding 0 is also in the subspace, since in this case d =^ ^lilij = Sni S/x, = S(^i /Xj) ----SO. In this idealized situation, a straightforward comparison of MSEs shows that the smoothed-data estimator improves on the observed-data estimator. Theorem 3.1 Suppose the observed 0 ~ N{d,a'^l). Let S be a symmetric and idempotent linear smoothing matrix of rank k. If 9 lies in the linear subspace defined by S, then the dissimilarity estimator ^9'S9 has smaller mean squared error than does ^9' 9 in estimating ^9' 9. That is, the dissimilarity estimator based on the smooth is better than the one based on the observed data in estimating the dissimilarities between the underlying signal curves. Proof of Theorem 3.1: Let = 1 without loss of generality. Now, E[l9'9\ = lE[9'9\ = J[n + 9' 9] =T+ J^'O.

PAGE 32

22 Similarly, E[^0'se] = ^E[0'se] = ^[k + e'SO] = ^ + lO'SO. MSE(-9'e] = Ei\-0'e --e'e] ] n J n J lT + -0'e--0'e [ n n ^ = var(-0'0 J2 2n + 40'0 = T'{ + A^0'0 + 1 n = T^-^-^\\0\f + l n (3.1) MSE van -9'S0] |£;|-^'S0| -0'0 n / \ \ n \ n 72 2k + 40'S0 (Tk r^,„^ T.^ + I — + -0S0--00 Inn n TV I \n = r(^ + 4iis0i|2 + US n2 I k + 0'S'S0-0'0^^ k + \\se\\' mfj'^' e + 2k{\\S0\\'-\\0\\') k^'n? -n? J2 IS0I 2^2 + 2k{\\S0\\'-\\0f) (3.2) Comparing (3.1) with (3.2), we see that if 0 lies in the subspace that S projects onto, implying that S0 = 0, then MSE{^0'S0) < MSE{^0'0) since k < n, and thus smoothing leads to a better estimator of each pairwise dissimilarity. In this case, smoothing reduces the variance of the dissimilarity estimate, without adversely increasing the bias, since S0 = 0.

PAGE 33

23 On the other hand, suppose the srpooth does not reproduce 0 perfectly and that Sd 7^ 0. Then it can be shown (see Appendix A.l) that the sraoothed-data estimator is better when: l|S0|r ll^ll' > -2 V4 + 2n + n2 + 2k. (3.3) Now, since S is a shrinking smoother, this means ||Sy|| < ||y|| for all y, and hence llSyiP < llyiP for all y. Therefore, ||S0||2 < ||^||2 and the left hand side of (3.3) is negative and so is the right hand side. If 0 is such that SO ^ 6 and ||Stf|p \\e\\^ is near 0, then (3.3) will be satisfied. If, however, ||S0|p \\e\\^ < 0, then (3.3) will not be satisfied and smoothing will not help. In other words, some shrinkage smoothing of the observed curves makes the dissimilarity estimator better, but too much shrinkage leads to a forfeiture of that advantage. The disadvantage of the linear smoother is that it cannot "learn" from the data how much to shrink 9. To improve the smoother, we can employ a James-Steintype adjustment to S, so that the data can determine the amount of shrinkage. 3.2 A James-Stein Shrinkage Adjustment to the Smoother What is now known as "shrinkage estimation" or "Stein estimation" originated with the work of Stein in the context of estimating a multivariate normal mean. Stein (1956) showed that the usual estimator (e.g., the sample mean vector) was inadmissible when the data were of dimension 3 or greater. James and Stein (1961) showed that a particular "shrinkage estimator" (so named because it shrunk the usual estimate toward the origin or some appropriate point) dominated the usual estimator. In subsequent years, many results have been derived about shrinkage estimation in a variety of contexts. A detailed discussion can be found in Lehmann and Casella (1998, Chapter 5).

PAGE 34

24 Lehmann and Casella (1998, p. 367) discuss shrinking an estimator toward a linear subspace of the parameter space. In our case, we believe that 0 is near S9. Let Cs = {0 : Sd = 9} where S is symmetric and idempotent of rank k. Hence we may shrink 0 toward S0, the MLE of 0 e £5. In this case, a James-Stein estimator of 0 (see Lehmann and Casella, 1998, p. 367) is 0(JS) = S0 + [ 1 I (0 S0) \ \\0-S0\\y^ where a is a constant and 1 1 • 1 1 is the usual Euclidean norm. In practice, to avoid the problem of the shrinkage factor possibly being negative for small ||^ S0|p, we will use the positive-part James-Stein estimator g^^^Sg+fl. "J (^-S^) (3.4) \0-S0\\^ + where z+ = xl{x > 0). Casella and Hwang (1987) propose similar shrinkage estimators in the context of confidence sets for a multivariate normal mean. Green and Strawderman (1991), also in the context of estimating a multivariate mean, discuss how shrinking an unbiased estimator toward a possibly biased estimator using a James-Stein form can result in a risk improvement. The shrinkage estimator involves the data by giving more weight to 0 when 11^ S0\\^ is large and more weight to S0 when \\0 S0\\^ is small. In fact, if the smoother is at all well-chosen, is often close enough to 0 that the shrinkage factor in (3.4) is very often zero. The shrinkage factor is actually merely a safeguard against oversmoothing, in case S smooths the curves beyond what, in reality, it should.

PAGE 35

25 So an appropriate shrinkage estimator of dij = ^9' 6 is n T n q(JSYq(JS) {SO)' se+li \0-se\\^ {9 se)' \e-s9\ {9 S9) Now, the risk difference (difference in MSEs) between the James-Stein smoothed dissimilarity d-^'^' and the observed dissimilarity dij is: (3.5) (3.6) rp2 A = E L0(JS)'Q{JS) n 9 9 n E -9'9 -9' 9 n n (3.7) Since we are interested in when the risk difference is negative, we can ignore the positive constant multiplier T^/n^. From (3.7), A = E[{9(-^^y9^-^s))^-29'99(-^^y9(-^^)-{9'9)'' + 29'99'9] = ^[(^(•'S)'^(JS))2 (^g'g^2 20' 0 {0iJ S)' QiJ S) ^'^)J Write = 0(f){0)0, where \9-S9\\^ (I-S). Then Q(JS)'Q(JS) Q'0 = -20 29' (f>{9)' 9 + 9' (j){9)' (l){9)9 a{l S)9 \9-S9\\^ + 0'{l-S)'a \0 S0\\^ \\0 S0\\ -(1-8)0

PAGE 36

26 = -2a ^'(I-S)^ e' (I sy {1 s)e ,\e-se\\^ d'{i-s)'{i-s)e 9'{l-S)0 ^ e {i-s)e 2 = -2a V — + a^— — ^. e {i-s)0 \\e S9\\w {I s)e = -2a + —. \\e-se\\^ Hence the (scaled) risk difference is A = E = E (0(JS)'^(JS))2 ^g'g^2 20' 0 -2a + \0-se\ 9 9 -2a + \9-m\^ {9'9f 29' 9 -2a + \\9-S9\\\ Note that and 9'9^9'{l-S)9^9'S9 9'9 = 9'{l-S)9 + 9'S9. Define the following quadratic forms: 92 92 9'{I-S)9 9'S9 9'{I-S)9 9'S9. Note that qi ~ Xn-fe(9i)' 92 ~ Xfc^(92) and they are independent (S idempotent implies (I S)S = 0 and 9 is normally distributed). Now write A as: A = E = E (^[9i + 92] 2a + I" j -[qi + 92]^ 2{qi + 92) ( -2a + — 9i a2\ / q2\^ 2( -2a + — [91 + 92] + -2a + — + 4a(9i + 52) —(91 + 92) 91/ V 91/ 9i

PAGE 37

27 E /a2\ /a^\^ 4a^ -4a[qi + 92] + 2[gi + 92] — + 40^ + — — + 4a(9i + 92) 2a?. —{Qi +92) 9i + —(91 +92 91 92) Qi 2\ 2 9i 4a^ 9i + 4a(gi + 92) Since E[qi + 92] = n + gi + 92, A = -4an + + E a LV9i + —(91 + 92 9i 92) 9i 91 Note that since gi and 92 are independent, we can take the expectation of 92 in the last term. Since £'[92] — k + q2, A = 4a^ 4an + E 2a?k ^ ^{ 2a + qi ^ + + 2aM 1 — ^ .9i 9i \ 9i By Jensen's Inequality, £^(1/91) > l/£^(9i), and since E{qi) = n k + qi, we can bound the last term above: E 2aM 1 2a + qi 9i <2a'{l2a^ I : 1 < 2a^ Hence n — k + qi A < 4a^ 4an + E 2a + qi \ n k + qij n — k — 2a n — k = 2a 2 / n — k + qi — 2a — qi 2a^[ 1 9i -E 2a^k 9i + 2aM 1 n — k + qi 2a n — k 2a n — k (3.8) Note that the numerators of the terms in the expected values in (3.8) are positive. The random variable is a noncentral x'n-k with noncentrality parameter 9i, and this distribution is stochastically increasing in qi. This implies that for m > 0, Eliqi)-""] is decreasing in qi. So ^[(91)-'"] < E[{xl_,)-"^]

PAGE 38

28 where Xn-k ^ central x' with n — k degrees of freedom. So by replacing qi with Xn-k (3-8), we obtain an upper bound Au for A: Au = 4a^ -4an + E a' ixl-k)' + E 2a? k xii-k + 2a^ 1 2a n — k (3.9) Note that E[{xl_,r'] = l/(n k 2) and E[{xl_,)-'] = l/(n -k2){n A; 4). So taking expectations, we have: Au = 4a^ 4an -h 1 + 2a^k {nk -2){nk -4) n-k + 2aM 1 2a n — k 2k r 6 a — 4na. {nk -2){n k 4) n-k" {n-k-'^ We are looking for values of a which make Au negative. The fourth-degree equation Au{a) = 0 can be solved analytically. (Two of the roots are imaginary, as noted in Appendix B.l.) One real root is clearly 0; call the second real root r. Theorem 3.2 If n-k > 4, then the nontrivial real root r of Au{a) = 0 is positive. Furthermore, for any choice a e (0, r), the upper bound Au < 0, implying A < 0. That is, forO < a < r, the risk difference is negative and the smoothed-data dissimilarity estimator d^^^^"* is better than dij. Proof of Theorem 3.2: Write Au{a) = 0 as c^a^^ + c^a^ + c-^a^ + ciO = 0, where C4, C3, C2, ci are the respective coefficients of Ac/ (a). It is clear that if n A; > 4, then C4 > 0, C3 < 0. c^ > 0. ci < 0. Note that r also solves the cubic equation /^(a) = c^a^ + c-^a^ + c^a + ci = 0. Since its leading coefficient C4 > 0, lira fc{a) = -00, lim fda) = 00. a--oo a-yoo Since f^a) has only one real root, it crosses the a-axis only once. Since its vertical intercept Ci < 0, its horizontal intercept r > 0.

PAGE 39

29 Table 3-1: Table of choices of a for various n and k. Tl k ft • • • ^ lIllIillXllZt^F CX rooi r 20 5 9.3 19.0 50 5 31.9 72.2 100 5 71.3 169.1 200 5 153.2 367.5 And since Au{a) has leading coefficient C4 > 0, lim Au(a) = 00. a-^oo Since it tends to 00 at its endpoints, Ay (a) must be negative between its two real roots 0 and r. Therefore A<0for0
PAGE 40

Figure 3-1: Plot of Au against a for varying n and for A; = 5. Solid line: n = 20. Dashed line: n = 50. Dotted line: n = 100.

PAGE 41

31 a Figure 3-2: Plot of simulated A and Ay against a for n = 20, k = 5. Solid line: Upper bound At/. Dashed line: simulated A, = 0. Dotted line: simulated A, qi = 135. Long-dashed line: simulated A, = 540. Dot-dashed line: simulated A, qi = 3750000.

PAGE 42

32 Suppose there exists a random variable 5^, independent of 0, such that S' 2 Then let = S^/v. Consider the following definition of the James-Stein estimator which accounts for a^: 0{JS) = + 1 aa |0-S0|I2 ,2 (I S)^ (I S)^. Note that the James-Stein estimator from Section 3.2 (when we assumed a known covariance matrix I) is simply (3.10) with a"^ = \. Now, replacing (t^ with the estimate a"^, define (3.10) ^P)==S^+fl-^|(I-S)d, letting qy = 9 {l-S)e as in Section 3.2. Write q{JS) a 9 0-0, where 2 0=— (I-S)0. Define the James-Stein smoothed-data dissimilarity estimator based on 9^^^^ to be: n Then analogously to (3.7), -E 2n 9' 9 9'e

PAGE 43

33 Now, e^^^^'o^^^^ =&0-2(i^e + 'e + {e'e e'e) {^e'e e'e) (2'(t>){e'e e'e) + {2'e (j^(j)) ,iX ,f = E = E = E Now, ^J'(i-s)e .0 and e'{i-s)'{i-s)e So A.=^E 2\2aa^ -tJ^\{e'e-e'e) + \2ad'' 2 4 \ 2 Qi Since a and e are independent, E[-4aa\e e e 6)] = E[-Aaa^]E[e e e 0] = E[-4aa^yn and E E E 2-4 {e'e -e'e) {&{! s)e + e'se -e'{is)e e'se) = E[2a^a^]E {qi + 92 91 92) 202^" + E .91. (91 9i 92) (3.11) by the independence of a and 0 (and thus 6and 92)-

PAGE 44

34 Since qi and q2 are independent, we may write (3.11) as: E[2a^a']E[q2]E[l/qi] + E L 9i (gi 91 q2) = E[2a'^a'']{q2 + a^k)E[l/qi] + E 2^4 2a^a {qi 91 92) (3.12) Now, since a^a^/qi is decreasing in qi and 91 9i — 92 is increasing in 91, these quantities have covariance < 0 by the Covariance Inequality (Casella and Berger, 1990, p. 184). Hence "2a2(7^" (3.12) < E[2a''d\q2 + a''k)E[\lqy] + E E[qi 91 92] L 9i J = E[2a'a%q2 + a''k)E[llq{\ + E[2a^ a^]E[\ / q^]{a'' {n k) q^) = £[20^ (T^]E[\/qiyn. So < E\-Aaa'^]a'^n + E[2a?a^]E = E[-Aad'^]a'^n + E[2a^a^]E Note that 1 a'^n + E (2a'a' .91. 1 a^n + E .91. 2 4 \ 2n 9i 9i + a a 1 J E[aa^] = aa^ ^3^6/^ K^ + 2)(t^ + 4) ^4^8 / W^ + 2)(t/ + 4)(:/ + 6) ) Since o-^ and gi are independent, taking expectations: A A < -Aaa^n + 2d 2) 1 .91. + 2) _4^3^6/'K'^ + 2)(i' + 4)~ > 1 .91. i/(i/ + 2)(i/ + 4)(i/ + 6) 1

PAGE 45

35 Define q', = q^/a^ = {9 /a)' {I S){e/a}. Since 6 ~ N{e,an), {0/a) ~ N{CI) where ^ = 0/a. Then the risk difference is a function of C, and the distribution of ql is a function of C and does not involve a alone. Then < -iaa'^n + 2a^a'^n u{u + 2) 1/^ E 1 1 4 4/i^(i/ + 2)(i/ + 4)(i/ + 6) + a (T E We may now divide the inequality through by > 0. % < -ian + 2a^n('^^]E i/(i/ + 2)(i/ + 4) E 1 + 4a' + 0" i/(i/ + 2)' t/(f^ + 2)(t/ + 4)(t/ + 6) ) Now, since (I S)I is idempotent, ql is noncentral x'n-kiC' ~ S)C)Recall that the noncentral distribution is stochastically increasing in its noncentrality parameter, so we may replace this parameter by zero and obtain the upper bound E[iqir"'] 4, then ^[(xL J"'] = ^ 2) and ^[(xt J"'] = A; 2)(n k — \). So taking expectations, we have:

PAGE 46

36 -^ A, then the nontrivial real root r of Au{a) = 0 is positive. Furthermore, for any choice a £ (0, r), the upper bound Au < 0, implying A < 0. That is, for 0 < a < r, the risk difference is negative and the smoothed-data dissimilarity estimator dtl^^ is better than du. Proof of Theorem 3.3: Note that At; (a) = 0 may be written as c^a^ + c^a^ + C2a^ + Cia = 0, with C4 > 0, C3 < 0, C2 > 0. Ci < 0. The proof then follows exactly as the proof of Theorem 3.2 in Section 3.2. 3.4 A Bayes Result: J**'"'"'*''' and a Limit of Bayes Estimators In this section, we again assume S, the smoothing matrix, to be symmetric and idempotent. We again assume 0 ~ N{9, ct^I) and again, without loss of generality, let (7^1 = I. (Otherwise, we can let, for example, r) = a~^0 and rj = a~^0 and work with fj and r) instead.) Also, assume that 0 lies in the linear subspace (of dimension k < n) that S projects onto, and hence SB = 0. We will split f{0\0)

PAGE 47

37 into a part involving (I S)^ and a part involving S^. f{e\e)oc exp[-{i/2){e-e)'{e-9)] exp[-{l/2){0-d)'{I-S + S){e-d)] exp{-(l/2)[0'(I S)^ +{9O)'S{0 0)]} which has a part involving (I S)0 and a part involving S0. Since S is symmetric, then for some orthogonal matrix P, PSP' is diagonal. Since S is idempotent, its eigenvalues are 0 and 1. so PSP' = B where Ik 0 B = 0 0 Let p'l, p„ be the rows of P. Then 9* = PO = {p'i9, p\^9, 0, 0)'. And let 9 be the vector containing the k nonzero elements of 9*. Similarly, 9* = P^. Now we consider the likelihood for 9: L{9\9) oc exp{-{l/2)[9'{l-S)9+i9-9)'S{9-9)]} (X exp{-{l/2)[{9-9)'S{9-9)]} = exp{-(l/2)[(^ 0)'P'PSP'P(0 9)]} = exp{-(l/2)[(0' r)'PSP'(r 9*)]} = exp{-{l/2)[{9* -9*)'B{9* -9*)]}. Now we put a N{0, V) prior on 9*: Tr{9*) oc exp[-(l/2)0*'V-0*],

PAGE 48

38 0 0 where Vn is the full-rank prior variance of 6. Note that this prior is only a proper density when considered as a density over the first k elements of 0*. Recall that the last n — k elements of 0* are zero. Then the posterior for 0* is Tr{0*\0*) oc exp{-{l/2)[{0* -0*)'B{0* 0*) + 0*'y'0*]} = exp{-(l/2)[^*'B^* 20*'B0* + 0*'{B + y')0*]} oc exp{-(l/2)[r'B'(B + V-)-B^* 20*'B0* + 0*' {B + y~)0*]} = exp{-(l/2)[(r (B + y-)-B0*)'{B + y-){0* (B + V-)-B^*)]}. Hence the posterior tt{0\0*) is N[{B + y-)-B0*, (B + V")-], which is N[{B + V-)-BP^. (B + V-)-]. Thus the posterior expectation of -0 0 is e(^~0'0^ = E^^0''P'-P0^ = e(^^0*'0* = -^*'B(B + V-)-(B + y-)-B0* + tr ( -I(B + V-)" n \n = -0'p'b{b + y-)-(B + y-)-BP0 + tr f -i(B + V-)n \n = -0'SP'{B + V-)-(B + V-)-PS^ + tr( -(B + y-)-] n \n J Let us choose the prior variance so that V~ = ^B. Then E^e'e) = W(!iiB)7!liBypse + ,r(^f^B \n J n \ n J \ n J \n \ n b]pS0 + trC^ I ^B n \n + 1 = -0'SP n \{n^\Y )

PAGE 49

39 and since P BP = S, \n J n(n + l)2 V"\" + l // This posterior mean is the Bayes estimator of dij. Since tr{B) = k, note that ir( J(;^B)) = Jir(;^B) = ^^Oasn^oo. 2 Also, ^^"^p — > 1 as n — > oo. Then if we let the number of measurement points n — >^ oo in a fixed-domain sense, these Bayes estimators of dij approach J^*""'"*''' = in the sense that the difference between the Bayes estimators and ^1^"*"*'') tends to zero. Recall that dij — ^0 0, the approximate distance between curve i and curve j, which in the limit approaches the exact distance Sij between curve i and curve j.

PAGE 50

CHAPTER 4 CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL Now we will consider functional data following model (2.2). Again, we assume the response is measured at n discrete points in [0, T], but here we assume a dependence among errors measured at different points. 4.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the Linecir Subspace Defined by S As with Case I, we assume our linear smoothing matrix S is symmetric and idempotent, with r(S) = tr{S) = k. Recall that according to the functional noise model for the data, 0 ~ N{6, E) where the covariance matrix S corresponds, e.g., to a stationary Ornstein-Uhlenbeck process. Note that ^0 0 represents the approximate L2 distance between observed curves yi{t) and yj{t), and that, with an Ornstein-Uhlenbeck-type error structure, it approaches the exact distance as n — > oc in a fixed-domain sense. Also, ^^'S S0 = ^0'S0 represents the approximate L2 distance between smoothed curves fli{t) and fij{t), and this approaches the exact distance as n ^ 00. We wish to see when the smoothed-data dissimilarity better estimates the true dissimilarity Sij between curves Hi{t) and Hj(t) than the observed-data dissimilarity. To this end let us examine MSE{^0'S0) and MSE{^0'0) in estimating ^O'O (which approaches Sij as n 00). In this section, we consider the case in which 0 lies in the linear subspace defined by S, i.e., S0 = 0. We now present a theorem which generalizes the result of Theorem 3.1 to the case of a general covariance matrix. Theorem 4.1 Suppose the observed 0 ~ N{0,J1). Let S be a symmetric and idempotent linear smoothing matrix of rank k. If 0 lies in the linear subspace 40

PAGE 51

41 defined by S, then the dissimilarity estimator ^0'S6 has smaller mean squared error than does ^O'O in estimating ^O'O. That is, the dissimilarity estimator based on the smooth is better than the one based on the observed data in estimating the dissimilarities between the underlying signal curves. Proof of Theorem 4-i-' MSEi-e'e T -0 0 n n var\ -0'0 ]+^E n T -00 n -0'0 n 72 2tr{J:^) + A0'J:0 + [tr{'E)f^ + I -tr{T,) + -0'0 -0'0 n n n (4.1) mse{^0's0^ = £;| T ~ ^ T -0'S0 -0'0 n n -0'S0 n 2fr[(SE)2] + 40'SES0 + n } -tr{S'E) + -0's'S0--0'0\ n n "J = ^|2fr[(SS)2]+40'S0 + [ir(Si;)pj. Hence, if we compare (4.2) with (4.1), we must show that ir(SE) < trCE) and ir[(Si:)2] < fr[(E)2] to complete the proof. (4.2) fr(E) = fr(SE + (I-S)E) = fr(SE)4-fr((I-S)E) = ir(SE) + ir(E'/2(I-S)(I-S)E'/2) > r(SE) ; -V

PAGE 52

42 since the last term is the sum of the squared elements of (I — S)S^''^. fr(E2) =
PAGE 53

43 Note that as the number of measurement points n grows (within a fixed domain), assuming a certain dependence across measurement points (e.g., a correlation structure like that of a stationary Ornstein-Uhlenbeck process), the observed data yi, yjv closely resemble the pure observed functions yi{t),. j/jv(<) on [0, T]. Therefore this result will be most appropriate for situations with a large number of measurements taken on a functional process, so that the observed data vector is "nearly" a pure function. As with Case I, we assume our linear smoothing matrix S is symmetric and idempotent, with r(S) = tr{S) = k. Recall that according to the functional noise model for the data, 6 ~ N{0, E) where S is, for example, a known covariance matrix corresponding to a stationary Ornstein-Uhlenbeck process. In this section, we will assume a Gaussian error process whose covariance structure allows for the possibility of dependence among the errors at diflferent measurement points. We consider the same James-Stein estimator of 0 as in Section 3.2, namely §iJS) ^jj^ practice, again, we use the positive-part estimator 0^^^), and the same James-Stein dissimilarity estimator. Recall the definitions of the following quadratic forms: gi = 0'{l-S)e q2 = e'se 91 = e'{i-s)e 92 = o'se. Unlike in Section 3.2 when we assumed an independent error structure, under the functional noise model, qi and 92 do not have noncentral distributions, nor are they independent in general.

PAGE 54

44 In this same way as in Section 3.2, ,we may write A as: ,2\2 E -Aa[qi + 92] + + — — + 4a(9i + 92) V 9i / 9i 2a2 91 (91 + 92 9i 92) Note that £^[91+92] = gi+92 + ^r[(I-S)E] + fr[SE] = 9i+92 + ^r(E). Hence A = -4air(E) + Aa^ + E Consider — — + —{91 + 92 9i 92) LV9i/ 9i 9i 92 9i E 9se .e'{i-s)e. Lieberman (1994) gives a Laplace approximation for £'[(x'Fx/x'Gx)*], k > I, where F is symmetric and G positive definite: ''^ £;[(x'Fx)*] E xFx xGx [£;(x'Gx)]'=' In our case, (I — S) is merely positive semidefinite, but with a simple regularity condition, the result will hold (see Appendix B.2). The Laplace approximation will approach the true value of the expected ratio, i.e., the difference between the true expected ratio and the approximated expected ratio tends to 0 as n — ) 00; Lieberman (1994) gives sufficient conditions involving the cumulants of the quadratic forms which establish rates of convergence of the approximation. Hence 02 9i = E 0S0 0(1-8)0 E{0'S0) 92 + fr(SE) £;(^'(I-S)^) 91 + ^r[(I S)S]

PAGE 55

45 Hence we have a large n approximation, or asymptotic expression, for A A -4atr{S) + ia^ + E = -AatrCE) + ia^ + E 4a^ 2a^(g2 + ^r(SS)) 2a\ 20^92 gi Qi + tr[{l S)X;] qi qi 2 A l/E{qi), we can bound the last term above: q2 + tr{S'£) 20 + 91 + 92" E 2a^ 1 + 91 + tr[{I S)E] <2a?(l+ 92 + fr-(SE) 91 + 92 + 2a 91 + fr[{I S)E] 9i+fr[(I-S)E] S)X;] +r(SS) -91 -2a' / qi+tr[{I\ 91 + ir[(I-S)E] fr(E) 2a 9i + r[(I-S)E] ir(X;) 2a = 2a2 ^"\m(i-s)s] Hence we have the asymptotic upper bound for A: 4air(E) + E ( tr(^)-2a \ Denote the eigenvalues of S)E^/^ by ci, ,Cn. Since (I-S) is positive semidefinite, it is clear that Ci, c„ are all nonnegative. Note that Ci, c„ are also the eigenvalues of (I S)S, since 2^/^(1 5)2^/^ = 2^/^(1 S)EE-^/2 and (I-S)i; are similar matrices. Note that since r[i:^/^(IS)i;^/^] = r(I-S) = n-k, 2^/2(1 — S)E^''^ has n — k nonzero eigenvalues, and so does (I — S)S. It is well known (see, e.g., Baldessari, 1967; Tan, 1977) that if y ~ Ar(/i, V) for positive definite, nonsingular V, then for symmetric A, y Ay is distributed as a linear combination of independent noncentral random variables, the coefficients of which are the eigenvalues of AV.

PAGE 56

46 Specifically, since qi e'{l~ S)0, and (I S)E has eigenvalues Ci, c„, then i=l for some noncentrality parameter > 0 which is zero when 0 = 0. Under 6 = 0, then, qi ~ Yl^=i ^Xv ^ linear combination of independent central Xi variates. Since a noncentral xl random variable stochastically dominates a central xh for any > 0, P[Xi{Sf)>x]>P[xl>x] Vx>0. As shown in Appendix A. 2, this implies P[ciXi{St) + • + CnXiiK) >x]> P[c,xi + --+ Cr,xl>x] ^ x>0. Hence for all x > 0, Pe^o[9i x] > Po[qi > x], i.e., the distribution of gi with 0^0 stochastically dominates the distribution of qi with 0 = 0. Then EeMiQirn < Eoiiqi)-"'] for m = 1, 2, Letting M2 = £'o[(gi)~^], we have the following asymptotic upper bound for A: Then M2 positive since fr[(I-S)E] Vr[(I-S)S] E[{Yl^=i We see that if n A; > 4, then M2 exists and is E Xn-k 2-1 < M2 < -^E Li-k)

PAGE 57

47 where Wjoin is the smallest and Wmax the largest of the n nonzero eigenvalues of (I-S)S. As was the case in the discrete noise situation, Ay (a) = 0 is a fourth-degree equation with two real roots, one of which is trivially zero. Call the nontrivial real root r. Theorem 4.2 Ifn-k>4, then the nontrivial real root r o/ Ay(a) = 0 is positive. Furthermore, for any choice a 6 (0, r), the asymptotic upper bound Ay < 0, implying A < 0. That is, forO < a < r, and for sufficiently large n, the risk difference is negative and the smootheddata dissimilarity estimator d\j is better than dij. Proof of Theorem 4.2: Since S is positive definite, tr{^) is positive. Since M2 > 0, we may write Ay (a) = 0 as 040^ + c^a^ + C2a^ + CiO = 0, where C4 > 0, C3 < 0, C2 > 0, ci < 0. The proof then follows exactly from the proof of Theorem 3.2 in Section 3.2. As with the discrete noise situation, one can easily verify, using a symbolic algebra program, that two of the roots are imaginary, and can determine the nontrivial real root in terms of M2, tr[{I — S)E], and tr{'S). As an example, let us consider a situation in which we observe functional data yi, • iYn me£isured at 50 equally spaced points ^l, t^o, one unit apart. Here, let us assume the observations are discretized versions of functions yi{t),..., yN{t) which contain (possibly different) signal functions, plus a noise function arising from an Ornstein-Uhlenbeck (0-U) process. We can then calculate the covariance matrix of each and the covariance matrix E of 6. Under the 0-U model,
PAGE 58

48 we can easily calculate the eigenvalues of (I S)E, which are ci, c„, and via numerical or Monte Carlo integration, we find that M2 = 0.0012 in this case. Substituting these values into Ay(a), we see, in the top plot of Figure 4-1, the asymptotic upper bound plotted as a function of a. Also plotted is a simulated true A for a variety of values (n-vectors) oi 0: Ox 1, (0, 1, 0, 1, 0, ... 0, 1)', and (-1, 0, 1, -1, 0, 1, -1, 0)', where 1 is a n-vector of ones. It should be noted that when 0 is the zero vector, it lies in the subspace defined by S, since S0 = 0 in that case. The other two values of 0 shown in this plot do not lie in the subspace. In any case, however, choosing the a that optimizes the upper bound guarantees that d]j has smaller risk than Also shown, in the bottom plot of Figure 4-1, is the asymptotic upper bound for data following the same Ornstein-Uhlenbeck model, except with n = 30. Although is a large-n upper bound, it appears to work well for the grid size of 30. 4.3 Extension to the Case when cr^ is Unknown In Section 3.2, it was proved that the smoothed-data dissimilarity estimator, with the shrinkage adjustment, dominated the observed-data dissimilarity estimator in estimating the true dissimilarities, when the covariance matrix of 0 was a% known. In Section 3.3, we established the domination of the smoothed-data estimator for covariance matrix aH, unknown. In Section 4.2, we developed an analogous asymptotic result for general known covariance matrix S. In this section, we extend the asymptotic result to the case of 0 having covariance matrix of the form V = a^E, where ct^ is unknown and E is a known symmetric, positive definite matrix. This encompasses the functional noise model (2.2) in which the errors follow an Ornstein-Uhlenbeck process with unknown cr^ and known (Of course, this also includes the discrete noise model

PAGE 59

49 Figure 4-1: Plot of asymptotic upper bound, and simulated A's, for OmsteinUhlenbeck-type data. Solid line: plot of against a for above 0-U process. Dashed line: simulated A 0-0x1. Dotted line: simulated A, 6 = (0, 1, 0, 1, 0, 0, 1)'. Dot-dashed line: simulatedA,0 = (-1,0, 1,-1, 0,1,..., -1,0)'. Top: n = 50. Bottom: n = 30.

PAGE 60

50 (2.1), in which V = a^l, but this case was dealt with in Section 3.3, in which an exact domination result was shown.) Assume 9 ~ N{e. V), with V = (t^E. Suppose, as in Section 3.3, that there exists a random variable 5^, independent of 0. such that 2 ~ V ^2 ^'Then let = S'^/v. As shown in Section 3.3, we may write ,2^-4 a a {e'0 e'o) + lao" 2 4 \ 2 Since a and 0 are independent, E[-4aa^{e'e d' 9)] = E[-Aad'^]E[9' 9 9' 9] = E[-Aaa'^]aHT{ll) and "202^" E = E = E = E[2a^a'^]E {9'9 9' 9) {9'{l S)9 + 9'S9 9'{1 S)9 9'S9) (91+92 -Qi92) + E .91. 2aV 9i (qi -qi92) (4.3) by the independence of a and 9 (and thus a and 92)We use the (large n) approximation to £'[92/91] obtained in Section 4.2 to obtain an asymptotic upper bound for (4.3): E[2a^a^](-^1^^P^)^E 2a^a^ 9i (91 9i 92) (4.4)

PAGE 61

51 Now, since a?a*/qi is decreasing in 4i and qi qi 92 is increasing in qi, these quantities have covariance < 0. Hence E[qi -qi92] E[2a^a*]q2 9i ^ E[2a'ayHr{Si:2 I E (aMl-S)S]-g2) 2a^a*' qi + aHr[{l S)S] gi + c72fr[(I S)i:] 92 9i a2ir[(I S)i:] -E = E[2a?a% '2a^a^ 1 E 1 + E E[qi] aHr[{l S)E] + E[2a''a^] 9i +a2ir[(IS)S] Since by Jensen's Inequality, l/E[qi] £"[1/91] < 0, (4.4) < E[2a^a^' qi + aHr[{I S)^] + E 2a^a' c7Hr[{l S)E] < E[2a^a*' tr{Si:) + E Ur[(I-S)S] Our asymptotic upper bound for is I 2a^a' ahr[{l S)E]. A;^ = E[-4aa'ytr{j:) + E[2a'a\ ^""^^^^ 2^4 \ 2 + E 2a^a aHr[{l S)E] + E = E + 4aV^ 2aa^ a a iaa'ahriJ:) + ^aHr[iI S)E] + 2a'a' /"^^^^ 9i ir[(I-S)E] 9i + 9?

PAGE 62

Recall that 52 Since and qi are independent, taking expectations: llu = -4a(T''^r(i;) + 2aV 2 4///(i^ + 2) 1 L9iJ a2ir[(I S)i: L9i 9f As in Section 3.3, define 9* = ^j/a^ = {e/a)'{l S){e/a). Here, since e ~ iV(0,cT2E), (^/a) ~ iV(C,S) where C = B/a. Then the risk difference is a function of C, and the distribution of q{ is a function of CThen 1 ir[(IS)E 1 We may now divide the inequality through by ct^ > 0. 1 r[(I-S)i; 4. 2a^ f fllf^) trjS^) ^ 4,2 M:^ + 2) 1 4 Wt^ + 2)(t/ + 4)(f/ + 6) 1

PAGE 63

+ + 53 As shown in Section 4.2. for all x > 0, Pfl^o[9i > 2:] > Po[q\ > x]. Note that this is equivalent to: For all x > 0, Pc^ol^i > ^] > Po[Qi > x}. Then for m = 1, 2. Let A/i* = foflgi*)"^] and = Eo[{ql)-^]. Then Note that if this last expression (which does not involve a and which we denote simply as Ay {a)) is less than 0, then A^^ < 0. which is what we wish to prove. We may repeat the argument from Section 4.2 in which we showed that when 0 = 0, Qi ^ ^"^j axl, replacing qi with ql, 9 with 0/a. and 9 with CThus we conclude that when C = 0Qi ~ Z]"^! where Vi,....Vn are the eigenvalues of (I S)E. Again, if n > 4, then and exist and are positive, as was shown in Section 4.2. Again. Ay (a) = 0 is a fourth-degree equation with two real roots, one of which is trivially zero. Call the nontrivial real root r. Theorem 4.3 Suppose the same conditions as Theorem 3.3, except generalize the covariance matrix of 9 to be a^H, unknown and E known, symmetric, and positive definite. Ifn-k > 4, then the nontrivial real root r of A'^y{a) = 0 is positive. Furthermore, for any choice a G (0,r), the asymptotic upper bound

PAGE 64

54 Al^ < 0, implying A < 0. That is, /or, 0 < a < r. and for sufficiently large n, the risk difference is negative and the smootheddata dissimilarity estimator d-"^-^ is better than dij. Proof of Theorem 4-3: Since E is positive definite. ir(S) is positive. Since Ml > 0 and Mo > 0. we may write Ac; (a) = 0 as C4a* + c^a^ + C2a^ + cio = 0. where C4 > 0. C3 < 0. C2 > 0. ci < 0. The proof is exactly the same as the proof of Theorem 4.2 in Section 4.2. 4.4 A Pure Functional Analytic Approach to Smoothing Suppose we have two arbitrary observed random curves and yj{t), with underlying signal curves /Xj(i) and fij{t). This is a more abstract situation in which the responses themselves are not assumed to be discretized. but rather come to the analyst as "pure" functions. This can be viewed as a limiting case of the functional noise model (2.2) when the number of measurement points n 00 in a fixed-domain sense. Throughout this section, we assume the noise components of the observed curves are normally distributed for each fixed t. as, for example, in an Ornstein-Uhlenbeck process. Let C[O.T] denote the space of continuous functions of t on [0,T]. Assume we have a linear smoothing operator S which acts on the observed curves to produce a smooth: fii{t) = Syi{t), for example. The linear operator S maps a function / € C[0,r] to C[Q.T\ as follows: (where K is some known function) with the property that S[af + /3g] = aS[f] + /3S[g] for constants a, (3 and functions f,ge C[0,T] (DeVito. 1990. p. 66). Recall that with squared L2 distance as our dissimilarity metric, we denote the dissimilarities between the true, observed, and smoothed curves i and j, respectively, as follows:

PAGE 65

55 ^ij = [ M) Mt)? dt = \\fi,it) tij{t)\f (4.5) 4= / iViit) yj{t)?dt = \\y,{t) yj{t)\\^ (4.6) •^Jr"'" = fl^^iit) Nit)? dt = \\Sy,{t) SyM?. (4.7) ^0 Recall that for a set of points ii, i„ in [0, T], we may approximate (4.5)(4.7) by dij = -e'e n n iij = -ee n where dij Sij when the limit is taken as n -> oo with fi, f„ G [0, T"]. In this section, the smoothing matrix S corresponds to a discretized analog of S. That is, it maps a discrete set of points y{ti), y{t„) on the '^noisy" observed curve to the corresponding points fi{ti), ,/i(i„) on the smoothed curve which results from the application of S. In particular, in this section we examine how the characteristics of our self-adjoint S correspond to those of our symmetric S as n — >• oo in a fixed-domain sense. Suppose 5 is self-adjoint. Then for any x, y, {x,Sy) = J^ x{t) K[t.s)y{s)dsdt={Sx,y)= 1^ j\{t,s)x{s)dsy{t)dt. Therefore for points fj, i„ e [0, T], with n -> oo in a fixed-domain sense, J™ ^ [/ ^{ti)K{tus)y{s) ds + --. + j\{tr,)K{tr,, s)y{s) ds n'^ooVll ^(l'^M^Ml)'^5 + --+ ^ K{tr,,s)x{s)y{tr;)ds (4.8)

PAGE 66

56 Now, for a symmetric matrix S, for any x.y, (x. Sy) = (Sx, y) <^ x'Sy = x'S'y. At the sequence of points ^i,...,i„, x = (x(ii), x(<„))', and for our smoothing matrix S. Sy So = {S[y]{h) %](^n))'= (^J^ K{t,.s)y{s)ds....j\{tr„s)y{s)ds^ X Sy = x'S = / x{ti)K{ti,s}y{s)ds + ---+ f x{tn)K{tr„s)y{s)ds Jo Jo y = / K{tus)x{s)y{h)ds + ---+ f K{trr, s)x{s)y(t^) ds Jo Jo and (T/n)|^^ xiti)K{ti,s)y{s)ds + --+ x(^„)A:(^„, 5)^(5) ds = (^/")[/ Kitus)x{s)y{h)ds + --+ J\{tr,,s)x{s)yitr,)ds So taking the (fixed-domain) limits as n 00, we obtain (4.8), which shows that the defining characteristic of symmetric S corresponds to that of self-adjoint S. Definition 4.1 5 is called a shrinking smoother ||5y|| < ||y|| for all elements y (Buja et al, 1989). A suflRcient condition for S to be shrinking is that all of the singular values of 5 be < 1 (Buja et al, 1989). Lemma 4.1 Let A be a symmetric matrix and E 6e a symmetric, positive definite matrix. Then x'A'SAx T x'Sx where eniax(A) denotes the maximum eigenvalue of A. Proof of Lemma 4-

PAGE 67

57 Note that Let B = S^/^AS"^^^. Then this maximum eigenvalue is en,ax(B'B) < e^axCA'A) [emax(A)]^ by a property of the spectral norm. Lemma 4.2 Let the linear smoother S be a self-adjoint operator with singular values e [0.1]. Then var{^\\S9\\^) < war( where S is symmetric with singular values G [0, 1]. (Recall 0 y-.) Proof of Lemma 4-2: We will show that, for any n and for any positive definite covariance matrix E of ^. var{^\\Se\\^) < var{^\\e\\^) where S is symmetric. (A symmetric matrix is the discrete form of a self-adjoint operator (see Ramsay and Silverman, 1997. pp. 287-290).) Now, var (^^WSeW'^j 2fr[(S'SS)2] + 40'S'SES'S0 < 2tr\{T,f] + A0'j:0. 40'S'SES'S0 < A0'Y0 ^ ^'S'SSS'Sg ^ 9'i:0 and so we apply Lemma 4.1 with S'S playing the role of A. Since the singul values of S are G [0, 1], all the eigenvalues of S'S are G [0, 1]. Hence we have e'S'SES'S^ sup ; < 1. 0 0E0

PAGE 68

58 So we must show ir[(S'SE)2] < tr[{'E)^] for any S. ir[(E)2] = fr[(S'SE + (I-S'S)S)2] = fr[(S'SS)2] since the last two terms are > 0. This holds since fr[S^/^(I S'S)E^''^S*/^(I sum of the squared elements of the symmetric matrix — S'S)S^/^ And fr[SE(I S'S)i/2(I S'Sj^/^ES] is the sum of the squared elements of (I-S'S)i/2ES. We now extend the result in Lemma 4.2 to functional space. Corollary 4.1 Under the conditions of Lemma 4.2, var[Sij] < var[S\'j"^^^^]. Proof of Corollary 4-1-' Note that since almost sure convergence implies convergence in distribution, we have dij =-9'eAL n ^{smooth) T-Q'^'^Q _^ ^(smooth) *^ n Consider ^[|4f ] = E[{dijf] = g£;[(0'^)3]. We wish to show that this is bounded for n > 1.

PAGE 69

59 Now O'O ~ XnWwhere A = 0 0 here. The third moment of a noncentral chi-square is given by + Gn^ + Gn^A + 8n + 36nA + l2nX^ + 48A + A^X^ + 8A^ Hence '^E[[0'Of] = T^[l + 6/n + 6A/n + B/n^ + SGA/n^ + 12AVn2 + 48A/n2 + 48AVn^ + SAVn^] Now, A/n is bounded for n > 1 since ^0 0 is the discrete approximation of the distance between signal curves ^i{t) and which tends to 6ij. Hence the third moment is bounded. This boundedness. along with the convergence in distribution, implies (see Chow and Teicher. 1997. p. 277) that lim ^[(4)*^] = E[{5i^f] < oo n—yoo for A; = 1,2. So we have lim varldij] n->oo = \imE[{dijf]-l[m[E{dij)f = lim E[{dijf][lim E{dij)]' = E[{S,j)']-[E{6,j)r = var[5ij] < oo. An almost identical argument yields lim varlS'""^'"^] = var[5lr""^\

PAGE 70

60 As a consequence of Lemma 4.2, since rar[d,-*"""'*'*'] < var[dij] for all n, lim warfd,< lim var\dij] < oo. n-t-oo n->oo Hence varlSlp""'""^] < var[6ij]. The following theorem establishes the domination of the smoothed-data dissimilarity estimator over the observed-data estimator Sjj, when the signal curves Hi{t),i = I, N. lie in the linear subspace defined by S. Theorem 4.4 If S is a self-adjoint operator and a shrinking smoother, and if fii{t) is in the linear subspace which S projects onto for all i — 1, ^V, then MSEiSl;"""'"'^) < MSE{Sij). Proo f of Theorem 4-4 Define the difference in MSEs A to be as follows (we want to show that A negative): = E r [fti{t) fij{t)]ut [fi,{t) tij{t)]ut^ I = E!^(^£[Syi{t)-Syj{t)]'dt^ + (^^ViW M.WN^j

PAGE 71

= E< 61 2 ^ [Sy,[t) SyM^ dt^ -(^^ [y,{t) yM^ dt^ | Let f = fij{t) = yi{t) yj{t) and let f = fij{t) = ^J.j{t). Note that since S is a linear smoother. 5y,(i) — Syj{t) = Sf. = £{(5f. Sff (f, f)2} + 2E{{1 f)(f, f ) (5f, 5f)(f, f)} = ^{ii5fir-iifir}+2^{iifinifii2-iisfinifin = ^{ii5fir} ^{iifin 2m'E{\\sf\\' wm = var{\\S{\\') variWiW') + {E{\\Sf\\')y {Emi')}' -2iifipE{(ii5fip-iifin} = var{\\Si\\') variim + {E{\\Sf\\') + Em')}{E{\\Sf\\') E{\m} -m\'E{{\\Sf\\'-\\f\\')}. Since fii{t) is in the subspace which S projects onto for all i, then Sf = {^\\Sf\\ = A = w(||5f|n-^;ar(||f|n+^(||5f|p + ||f||2)£;(||5f||2-||f||2) -(||5f|p + ||f|n^(||5f|r-||f||2).

PAGE 72

62 Now, since var{\\S{\\^) < t;ar(||f|p) from Corollary 4.1. we have: A < E{\\Sf\\' + ||f|n^(||5f|p ||f|n (||5f||2 + ||f||^)^(||5f|p ||f||2) = E{\\Sf\\' ||f||2) [E{\\S{\f + i|f|n (||5fl|2 + ||f||2)]. (4.9) Since 5 is a shrinking smoother, ||5f|| < ||f|| for all f. Hence ||5f|p < ||f||2 for all f and hence E{\\Sf\\^) < E{\\f\\^). Therefore the first factor of (4.9) is < 0. Also note that ^(||5f|n = E!^j\sf]'dt^ = E[Sffdt = [ var[Sf]dt+ f (E[Sf]fdt Jo Jo = / var[S{]dt+ r[Si\Ut Jo Jo = [ var[Sf]dt+\\Sf\\^ >\\Sf\\\ Jo Similarly, E{\m=E^l\f]Ut^= 1^'' E[f]'dt = f var[f]dt+ r{E[i]fdt Jo Jo = / var[f]dt+ [ [f]^dt Jo Jo = f var[f]dt+\\f\\^>\\{\\^. Jo Hence the second factor of (4.9) is positive, implying that (4.9) < 0. Hence the risk difference A < 0. Remark. We know that ||5f|| < ||f|| since S is shrinking. If, for our particular Viit) and yj{t), \\Sf\\ < ||f||, then our result is MSE{5l;""^*''^) < MSE{5ij).

PAGE 73

CHAPTER 5 A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION FOR CLUSTERING FUNCTIONAL DATA Most cluster analysis methods, whether deterministic or stochastic, use an objective function to evaluate the "goodness" of any clustering of the data. In this chapter we propose an objective function specifically intended for functional data which have been smoothed with a linear smoother. In setting up the situation, we assume we have N functional data objects yiit), yN(t). We assume the discretized data we observe (yi, .Yn) follow the discrete noise model (2.1). For object i. define the indicator vector Zi = {Zn. Zi^) such that = 1 if object i belongs to cluster j and 0 otherwise. Each partition in C corresponds to a different Z — (Zi, Z^), and Z then determines the clustering structure. (Note that if the number of clusters K < N, then Zi^^+i = • • • = Zi^ = 0 for all t.) In cluster analysis, we wish to predict the values of the Z,'s based on the data y (or data curves {yi{t), — ysit))Assume the Zj's are i.i.d. unit multinomial vectors with unknown probability vector p = (pi, p^). Also, we assume a model /(y|z) for the distribution of the data given the partition. The joint density N N =i j=i can serve as an objective function. However, the parameters in /(y|z), and p, are unknown. In model-based clustering, these parameters are estimated by maximizing the likelihood, given the observed y, across the values of z. If the data model /(y|z) allows it, a predictive likelihood for Z can be constructed by conditioning on the sufficient statistics r(y,z) of the parameters in /(y,z) (Butler, 1986; Bjornstad, 1990). This predictive likelihood, /(y, z)//(r(y, z)), can serve as 63

PAGE 74

64 the objective function. (Bjornstad (1990) notes that, when y or z is continuous, an alternative predictive likelihood with a Jacobian factor for the transformation r(y, z) is possible. Including the Jacobian factor makes the predictive likelihood independent of the choice of minimal sufficient statistic: excluding it makes the predictive likelihood invariant to scale changes of z.) For instance, consider a functional datum yi{t) and its corresponding observ'ed data vector y,. For the purpose of cluster analysis, let us assume that functional observations in the same cluster come from the same linear smoothing model with a normal error structure. That is, the distribution of Yj, given that it belongs to cluster is Yi\zij = l^N{T0j,a]l) for i = 1 ;V. Here T is the design matrix, whose n rows contain the regression covariates defined by the functional form we assume for yi{t). This model for Y, allows for a variety of parametric and nonparametric linear basis smoothers, including standard linear regression, regression splines, Fourier series models, and smoothing splines, as long as the basis coefficients are estimated using least squares. For example, consider the yeast gene data described in Section 7.1. Since the genes, being cell-cycle regulated, are periodic data, Booth et al. (2001) use a first-order Fourier series y{t) = /3o + cos(27ri/T) + /32sm{2Trt/T) as a smooth representation of a given gene's measurement, fitting a harmonic regression model (see Brockwell and Davis, 1996, p. 12) by least squares. Then the 18 rows of T are the vectors (1, cos{^tj),sm{^tj)), for the n = 18 time points tj,j = 0, 17. In this example 0^ = (/Sqj, Ai,/32j)'. Let J = {j\zij ^1 for some i} be the set of the K nonempty clusters. The unknown parameters in /(y,z), j € J, are {pj,/3j,a]). The corresponding sufficient statistics are {mj,0j,a]) where nij = ^^^^ Zij is the number of objects

PAGE 75

65 in the (proposed) jth cluster. 0^ is the p*-dimensional least squares estimator of 0j (based on a regression of all objects in the proposed jth cluster), and is the unbiased estimator of aj. The estimators 0j and a| can be expressed in terms of the estimators 0jj and ct? (for ail i such that Zij — 1), obtained from the regressions of the individual objects. (See Appendix A. 3 for the derivations of $j and dj.) Marginally, the vector m = (mi, .m^r) is muitinomial(Ar. p); conditional on m, 0j is multivariate normal and independent of ( "^^"f^ )(7j ~ Xm^n-p'Then the following theorem gives the predictive likelihood objective function, up to a proportionality constant: Theorem 5.1 If Zi. ... .Zf^ are i.i.d. unit multinomial vectors with unknown probability vector p = (pi, p^^), and Yi\z,j^l^ N{Tl3j,a]l) for i = 1, — A^, then the predictive likelihood for Z can be written: Proof of Theorem 5.1: According to the model assumed for the data and for the clusters,

PAGE 76

66 Therefore log/(y,z) N K t=l 2af Hence /(y.z) X (y, T/3^.)-(y. T0^) {T$^ T^^yjT^j T0^) ] Also, n;ej/K^)3j,^j^) d'"' 1 1 = Y\^'-—, -~ exp{-^(/9,. -/3,)'[cot;(^,)]-^(^j-/3,.)} f^^j mj!(2;r)P-/2icot;(/3.)|i/2 2^"^^ ^^^^ ^ '"^^'^ '"^^ "^'^^ (1/2)"^ m.-n-p* .piLzpl. 1 m,n-p= N\ n h rT75-P -^(^i ^i)'(T'T)(/3, Hence, since exp{-g^0,. /3,.)'(T'T)(4,/3j)} = exp{-g^(T^,. T0^y{T$j T0j)}, we have that

PAGE 77

67 /(y,z) = {(2.)-'"'n(2-r-'^(^l(T'T)-r'^} When considered as an objective function, this predictive likelihood has certain properties which reflect intuition about what type of clustering structure is desirable. The data model assumes trj represents the variability of the functional data in cluster j. Since is an unbiased estimator of cr?, j 6 J, then (a?)" '" '" +^ can be interpreted as a measure of variability within cluster j. The (typically) negative exponent indicates that "good" clustering partitions, which have small within-cluster variances, yield large values of ^^(z). Clusters containing more objects (having large values) contribute more heavily to this phenomenon. The other factors of g{z) may be interpreted as penalty terms which penalize a partition with a large number of clusters having small ruj values. A possible alternative is to adopt a Bayesian outlook and put priors on the p/s, which, when combined with the predictive likelihood, will yield a suitable predictive posterior. n'".!K)-"V;^)-^-r(=i^)(

PAGE 78

CHAPTER 6 SIMULATIONS 6.1 Setup of Simulation Study In this chapter, we examine the results of a simulation study implemented using the statistical software R. In each simulation. 100 samples of iV = 18 noisy functional data were generated such that the data followed models (2.1) or (2.2). For the discrete noise data, the errors were generated as independent N{0, a^) random variables, while for the functional noise data, the errors were generated as the result of a stationary Ornstein-Uhlenbeck process with variability parameter aand "pull"' parameter 3 = \. For each sample of curves, several (four) "'clusters" were built into the data by creating each functional observation from one of four distinct signal curves, to which the random noise was added. Of course, within the program the curves were represented in a discretized form, with values generated at n equally spaced points along T = [0. 20]. In one simulation example, n = 200 measurements were used and in a second example, n = 30. The four distinct signal curves for the simulated data were defined as follows: 0.5 \n{t + 1)4.01 cos(<), i = 1, 5. logio(f + 1) -01 cos(2f), 2 = 6,... 10. 0.75 log5(^ + 1) + .01 sin(30, i = U, 15. 0.3x/m .01 sin(4f), z = 16, 18. 68

PAGE 79

69 These four curves, shown in Figure 6-1, .were intentionally chosen to be similar enough to provide a good test for the clustering methods that attempted to group the curves into the correct clustering structure, yet different enough that they represented four clearly distinct processes. They all contain some form of periodicity, which is more prominent in some curves than others. In short, the curves were chosen so that, when random noise was added to them, they would be difficult but not impossible to distinguish. After the observed" data were generated, the pairwise dissimilarities among the curves (measured by squared L2 distance, with the integral approximated on account of the discretized data) were calculated. Denote these by dij.i = 1,. .,N.j = 1 N. The data were then smoothed using a linear smoother S (see below for details) and the James-Stein shrinkage adjustment. The pairwise dissimilarities among the smoothed curves were then calculated. Denote these by ^i^mooth) ^ — j I jj^g sample mean squared error criterion was used to judge whether dij or better estimated the dissimilarities among the signal curves, denoted dij, i — I, N,j — 1, N. That is, vzi i=i A' was compared to ^2) i=i jv j>i Once the dissimilarities were calculated, we used the resulting dissimilarity matrix to cluster the observed data, and then to cluster the smoothed data. The clustering algorithm used was the K-medoids method, implemented by the pam function in R. We examine the resulting clustering structure of both the observed

PAGE 80

70 1 10 15 20 time Figure 6-1: Plot of signal curves chosen for simulations. Solid line: fj.i{t). Dashed line: ne{t). Dotted line: fin{t). Dot-dashed line: fi^it).

PAGE 81

71 data and the smoothed data to determine which clustering better captures the structure of the underlying clusters, as defined by the four signal curves. The outputs of the cluster analyses were judged by the proportion of pairs of objects (i.e., curves) correctly placed in the same cluster (or correctly placed in different clusters, as the case may be). The correct clustering structure, as defined by the signal curves, places the first five curves in one cluster, the next five in another cluster, etc. Note that this is a type of measure of concordance between the clustering produced by the analysis and the true clustering structure. 6.2 Smoothing the Data The smoother used for the n = 200 example corresponded to a cubic B-sphne basis with 16 interior knots interspersed evenly within the interval [0,20]. The rank of S was thus A: = 20 (Ramsay and Silverman, 1997, p. 49). The value of o in the James-Stein estimator was chosen to be a = 160, a choice based on the values of n = 200 and k = 20. For the n — 30 example, a cubic B-spIine smoother with six knots was used (hence k = 10) and the value of a was chosen to be a = 15. We should note here an interesting operational issue that arises when using the James-Stein adjustment. The James-Stein estimator of 0 given by (3.4) is obtained by smoothing the differences ^ = Yi y^, i # first, and then adjusting the S^'s with the James-Stein method. This estimator for 0 is used in the JamesStein dissimilarity estimator d\'j^\ For a data set with N curves, this amounts to smoothing (A'^^ N)/2 pairwise diflferences. It is computationally less intensive to simply smooth the A'' curves with the linear smoother S and then adjust each of the N smooths with the James-Stein method. This leads to the following estimator oi0:

PAGE 82

72 Of course, when simply using a linear smoother S. it does not matter whether we proceed by smoothing the curves or the differences, since = Syj — Sy^. But when using the James-Stein adjustment, the overall smooth is nonlinear and there is an difference in the two operations. Since in certain situations, it may be more sensible to smooth the curves first and then adjust the smooths, we can check empirically to see whether this changes the risk very much. To accomplish this, a simulation study was done in which the coefficients of the above signal curves were randomly selected (within a certain range) for each simulation. Denoting the coefficients of curve by (6^4,6^,2), the coefficients were randomly chosen in the following intervals: 61,1 e [-3. 3]. 61.2 € [-0.5,0.51,66,1 € [-6, 6], 66,2 e [-0.5.0.5], 611,1 e [-4.5, 4.5], 611.2 € [-0.5,0.5],6i6,i G [-1.8, 1.8],6i6.2 6 [-0.5,0.5]. For each of 100 simulations, the sample MSE for the smoothed data was the same, within 10 significant decimal places, whether the curves were smoothed first or the differences were smoothed first. This indicates that the choice between these two procedures has a negligible effect on the risk of the smoothed-data dissimilarity estimator. Also, the simulation analyses in this chapter were carried out both when smoothing the differences first and when smoothing the curves first. The results were extremely similar, so we present the results obtained when smoothing the curves first and adjusting the smooths with the James-Stein method. 6.3 Simulation Results Results for the n 200 example are shown in Table 6-1 for the data with independent errors and Table 6-2 for the data with Ornstein-Uhlenbeck-type errors. The ratio of average MSEs is the average M S E^^'^ across the 100 simulations divided by the average MSE^'™^^^ across the 100 simulations. As shown by

PAGE 83

73 Table 6-1: Clustering the observed data and clustering the smoothed data (independent error structure, n — 200). Independent error structure a 0.4 0.5 0.75 1.0 1.25 1.5 2.0 Ratio of avg. MSEs 12.86 30.10 62.66 70.89 72.40 72.24 71.03 Avg. prop, (observed) .668 .511 .353 .325 .330 .327 .318 Avg. prop, (smoothed) .937 .715 .428 .380 .372 .352 .332 Ratios of average MSE^'"^ to average M5£(""""'"". Average proportions of pairs of objects correctly matched. Averages taken across 100 simulations. Table 6-2: Clustering the observed data and clustering the smoothed data (0-U error structure, n — 200). 0-U error structure a 0.5 0.75 1.0 1.25 1.5 2.0 2.5 Ratio of avg. MSEs Avg. prop, (observed) Avg. prop, (smoothed) 3.53 .997 1 39.09 .607 .944 126.53 .445 .620 171.39 .369 .489 176.52 .330 .396 168.13 .312 .326 160.52 .310 .334 Ratios of average A/SE'"^"' to average M5£'>'"""'"''. Average proportions of pairs of objects correctly matched. Averages taken across 100 simulations. the ratios of average MSEs being greater than 1. smoothing the data results in an improvement in estimating the dissimilarities, and, as shown graphically in Figure 6-2, it also results in a greater proportion of pairs of objects correctly clustered in every case. Similar results are seen for the n = 30 example in Table 6-3 and Table 6-4 and in Figure 6-3. The improvement, it should be noted, appears to be most pronounced for the medium values of a^, which makes sense. If there is small variability in the data, smoothing yields little improvement since the observed data is not very noisy to begin with. If the variability is quite large, the signal is relatively weak and smoothing may not capture the underlying structure precisely, although the smooths still perform far better than the observed data in estimating the true dissimilarities. When the magnitude of the noise is moderate, the advantage of smoothing in capturing the underlying clustering structure is sizable.

PAGE 84

74 Independent error structure Sigma Ornstein-Uhlenbeck error structure m 6 Figure 6-2: Proportion of pairs of objects correctly matched, plotted against cr (n = 200). Solid line: Smoothed data. Dashed line: Observed data. Table 6-3: Clustering the observed data and clustering the smoothed data (independent error structure, n = 30). Independent error structure a 0.2 0.4 0.5 0.75 1.0 1.25 1.5 Ratio of avg. MSEs 1.23 4.02 5.12 5.95 6.30 6.24 6.27 Avg. prop. (obser\'ed) .999 .466 .376 .298 .299 .312 .316 Avg. prop, (smoothed) 1.00 .552 .458 .359 .330 .321 .334 of objects correctly matched. Averages taken across 100 simulations.

PAGE 85

75 Table 6-4: Clustering the observed dat9,'and clustering the smoothed data (0-U error structure, ra = 30). 0-U error structure a 0.2 0.4 0.5 0.75 1.0 1.25 1.5 Ratio of avg. MSEs 1.26 4.07 5.03 5.69 5.78 5.85 6.04 Avg. prop, (observed) .851 .383 .280 .251 .225 .213 .233 Avg. prop, (smoothed) .881 .625 .528 .502 .478 .489 .464 Ratios of average MSE^^'^ to average MSE^'"'^^\ Average proportions of pairs of objects correctly matched. Averages taken across 100 simulations. Independent error structure ~i I 1 1 1 1 r OJ 0.4 OS OJ IX) 15 1.4 Sigma Omstein-Uhlenbeck error structure Sigma Figure 6-3: Proportion of pairs of objects correctly matched, plotted against <7 (n = 30). Solid line: Smoothed data. Dashed line: Observed data.

PAGE 86

76 The above analysis assumes the number of clusters is correctly specified as 4, so we repeat the n — 200 simulation, for data generated from the 0-U model, when the number of clusters is misspecified as 3 or 5. Figure 6-4 shows that, for the most part, the same superiority, measured by proportion of pairs correctly clustered, is exhibited by the method of smoothing before clustering. 6.4 Additional Simulation Results In the previous section, the clustering of the observed curves was compared with the clustering of the smooths adjusted via the James-Stein shrinking method. To examine the effect of the James-Stein adjustment in the clustering problem, in this section we present results of a simulation study done to compare the clustering of the observed curves, the James-Stein smoothed curves, and smoothed curves having no James-Stein shrinkage adjustment. The simulation study was similar to the previous one. We had 20 sample curves (with n = 30) in each simulation, again generated from four distinct signal curves: in this case the signal curves could be written as Fourier series. Again, for various simulations, both independent errors and Ornstein-Uhlenbeck errors (for varying levels of a^) were added to the signals to create noisy curves. For some of the simulations, the signal curves were first-order Fourier series: Hi{t) = cos(7ri/10) + 2sin(7ri/10),i = 1,...,5. fii{t) = 2cos(7ri/10) + sin(7ri/10),i = 6, ...,10. Hi{t) = 1.5cos(7ri/10) +sin(7ri/10),f = 11,...,15. mit) = -cos(7r^/10) + sin(7rt/10),i = 16, ...,20. In these simulations, the smoother chosen was a first-order Fourier series, so that the smooths lay in the linear subspace that the Fourier smoother projected onto. In this case, we would expect the unadjusted smoothing method to perform well, since

PAGE 87

77 Figure 6-4: Proportion of pairs of objects correctly matched, plotted against when the number of clusters is misspecified. Solid line: Smoothed data. Dashed line: Observed data.

PAGE 88

78 the first-order smooths should capture the underlying curves extremely well. Here the James-Stein adjustment may be superfluous, if not detrimental. In other simulations, the signal curves were second-order Fourier series: lii{t) = cos(7ri/10) + 2 sin(7r/10) + 5 cos(27ri/10) + sin(27ri/10), i = 1, 5. fj,i{t) = 2 cos(7ri/10) -Isin(;rf/10) + cos(27ri/10) -I8 sin(27ri/10), i = 6, 10. fii{t) = 1.5 cos(7ri/10) + sin(7rf/10) + 5 cos(27ri/10) + 8 sin(27ri/10), i = 11, 15. tii{t) = cos{7r
PAGE 89

79 0-U*nra. IM-ardar Fourlaralgnal eurvM 8 H 0-U •rrort. 2nd-onMr Fourwr ugnal curvM Indap. •crora, 1 tt-onlar Fourlar Mgnal eurvM sma Indap. rrara, 2nd-ara*r Fourtar signal eurvaa Figure 6-5: Proportion of pairs of objects correctly matched, plotted against a (n = 30). Solid line: James-Stein adjusted smoothed data. Dotted line: (Unadjusted) smoothed data. Dashed line: Observed data.

PAGE 90

80 When the signal curves are second-order and the simple linear smoother oversmooths. the unadjusted smooths are clustered relatively poorly. Here, the James-Stein method is needed as a safeguard, to shrink the smooth toward the observed data. The James-Stein adjusted smooths and the observed curves are both clustered correctly more often than the unadjusted smooths.

PAGE 91

CHAPTER 7 ANALYSIS OF REAL FUNCTIONAL DATA 7.1 Analysis of Expression Ratios of Yeast Genes As an example, we consider yeast gene data analyzed in Alter et aL (2000). The objects in tiiis situation are 78 genes, and the measured responses are (logtransformed) expression ratios measured 18 times (at 7-minute intervals) tj = 7j for J = 0, .... 17. As described in Spellman et al. (1998). the yeast was analyzed in an experiment involving "alpha-factor-based synchronization," after which the RNA for each gene was measured over time. (In fact, there were three separate synchronization methods used, but here we focus on the measurements produced by the alpha-factor synchronization.) Biologists believe that these genes fall into five clusters according to the cell cycle phase corresponding to each gene. To analyze these data, we treated them as 78 separate functional observations. Initially, each multivariate observation on a gene was centered by subtracting the mean response (across the 18 measurements) from the measurements. Then we clustered the genes into 5 clusters using the K-medoids method, implemented by the R function pam. To investigate the effect of smoothing on the cluster analysis, first we simply clustered the unsmoothed observed data. Then we smoothed each observation and clustered the smooth curves, essentially treating the fitted values of the smooths as the data. A (cubic) B-spline smoother, with two interior knots interspersed evenly within the given timepoints, was chosen. Ths rank of the smoothing matrix S was in this case k = 6 (Ramsay and Silverman, 1997, p. 49). The smooths were 81

PAGE 92

82 Table 7-1: The classification of the 78 yeast genes into clusters, for both observed data and smoothed data. Cluster Observed Data Smoothed Data 1 1,5.6.7.8.11.75 1,2.4,5.6.7,8,9,10.11,13,62,70,74,75 2 4. 9. 10. 13, 16, 19, 22, 23, 27, 30.32.36.37,41.44,47,49.51,53. 61,62.63.64.65.67,69,70,74,78 16,19.22,27,30,32,36,37,41,42, 44,47,49,51,61,63, fi^ fi7 fiQ 7S 3 12,14.15.18,21.24,25,26,28, 29.31.33.34.35.38,39,40, 42,43.45,46,48.50,52 3,12,14,15,18,21,24,25,26, 28,29,31,33,34,35,38,39, 40,43,45,46,48,50,52 4 17.20.54.55,56.57,58,59.60 17,20.23,53,54,55,56.57,58,59,60.64 5 66.68.71.72.73.76.77 66,68.71.72.73,76,77 adjusted via the James-Stein shrinkage procedure described in Section 3.2, with a = 5 here, a choice based on the values of n = 18 and A; = 6. Figure 7-1 shows the resulting clusters of curves for the cluster analyses of the observed data and of the smoothed curves. We can see that the clusters of smoothed curves tend to be somewhat less variable about their sample mean curves than are the clusters of observed curves, which could aid the interpretability of the clusters. In particular, features of the second, third, and fourth clusters seem more apparent when viewing the clusters of smooths than the clusters of observed data. The major characteristic of the second cluster is that it seems to contain those processes which are relatively flat, with little oscillation. This is much more evident in the cluster of smooths. The third and fourth clusters look similar for the observed data, but a key distinction between them is apparent if we examine the smooth curves. The third cluster consists primarily of curves which rise to a substantial peak, decrease, and then rise to a more gradual peak. The fourth cluster consists mostly of curves which rise to a small peak, decrease, and then rise to a higher peak. This distinction between clusters is seen much more clearly in the clustering of the smooths.

PAGE 93

83 Clusters for observed curves Clusters for smoothed curves 0 20 40 60 80 100 120 0 20 40 60 80 100 tza Figure 7-1: Plots of clusters of genes. Observed curves (left) and smoothed curves (right) are shown by dotted lines. Mean curves for each cluster shown by solid lines.

PAGE 94

The groupings of the curves into the five clusters are given in Table 7-1 for both the observed data and the smoothed data. While the clusterings were fairly similar for the two methods, there were some differences in the way the curves were classified. Figure 7-2 shows an edited picture of the respective clusterings of the curves, with the curves deleted which were classified the same way both as observed curves and smoothed curves. Note that the fifth cluster was the same for the set of observed curves and the set of smoothed curves. 7.2 Analysis of Research Libraries The Association of Research Libraries has collected extensive annual data on a large number of university libraries throughout many years (Association of Research Libraries, 2003). Perhaps the most important (or at least oft-quoted) variable measured in this ongoing study is the number of volumes held in a library at a given time. In this section a cluster analysis is performed on 67 libraries whose data on volumes were available for each year from 1967 to 2002 (a string of 36 years). This fits a strict definition of functional data, since the number of volumes in a library is changing "nearly continuously" over time, and each measurement is merely a snapshot of the response at a particular time, namely when the inventory was done that year. The goal of the analysis is to determine which libraries are most similar in their growth patterns over the last 3-plus decades and to see how many groups into which the 67 libraries naturally clustered. In the analysis, the logarithm of volumes held was used as the response variable, since reducing the immense magnitude of the values in the data set appeared to produce curves with stable variances. Since the goal was to cluster the libraries based on their "growth curves" rather than simply their sizes, the data was centered by subtracting each library's sample mean (across the 36 years) from each year's measurement, resulting in 67 mean-centered data vectors.

PAGE 95

85 Clutters for observed curves Clusters for smoothed curves 0 20 40 60 80 IOC 120 0 20 40 60 30 10O 120 lime tnw 0 20 40 60 80 100 120 0 20 40 60 80 100 120 time tinw 1 Figure 7-2: Edited plots of clusters of genes which were classified differently as observed curves and smoothed curves. Observed curves (left) and smoothed curves (right).

PAGE 96

86 The observed curves were each smoothed using a basis of cubic B-splines with three knots interspersed evenly within the timepoints. The positive-part JamesStein shrinkage was applied, with a value a = 35 in the shrinkage factor (based on n = 36, A; = 7). The K-niedoids algorithm was applied to the smoothed curves, with K = i clusters, a choice guided by the average silhouette width criterion that Rousseeuw (1987) suggests for selecting K. The resulting clusters are shown in Table 7-2. In Figure 7-3. we see the distinctions among the curves in the different clusters. Cluster 1 contains a few libraries whose volume size was relatively small at the begining of the time period, but then grew dramatically in the first 12 years of the study, growing slowly after that. Cluster 2 curves show a similar pattern, except that the initial growth is less dramatic. The behavior of the cluster 3 curves is the most eccentric of the four clusters, with some notable nonmonotone behavior apparent in the middle years. The growth curves for the libraries in cluster 4 is consistently slow, steady, and nearly linear. For purposes of comparing the clusters, the mean curves for the four clusters are shown in Figure 7-4. Note that several of the smooths may not appear visually smooth in Figure 7-3. A close inspection of the data shows some anomalous (probably misrecorded) measurements which contribute to this chaotic behavior in the smooth. (For example, see the data and associated smooth for the University of Arizona library, in Figure 7-5, for which 1971 is a suspicious response.) While an extensive data analysis would detect and fix these outliers, the data set is suitable to illustrate the cluster analysis. The sharp-looking peak in the smooth is an artifact of the discretized plotting mechanism, as the cubic spUne is mathematically assured to have two continuous derivatives at each point.

PAGE 97

87 Cluster 1 for smoothed curves Cluster 2 for smoothed curves p T 5 10 time (years since 1967) 6 1 £ O E q T "1 1 1 1 r 15 20 25 30 35 time (years since 1 967) Cluster 3 for smoothed curves Cluster 4 for smoothed curves time (years since 1 967) time (years since 1 967) Figure 7-3: Plots of clusters of libraries. Clusters based on smoothed curves. Mean curves for each cluster shown by soHd lines.

PAGE 98

time (years since 1 967) Figure 7-4: Mean curves for the four library clusters given on the same Solid: cluster 1. Dashed: cluster 2. Dotted: cluster 3. Dot-dash: cluster 4.

PAGE 99

89 Table 7-2; A 4-cluster K-medoids clustering of the smooths for the library data. Clusters of smooths for library data Cluster 1: Arizona. British Columbia. Connecticut. Georgetown. Georgia Cluster 2: Boston. California-Los Angeles, Florida, Florida State, Iowa, Iowa State, Kansas. Michigan State, Nebraska. Northwestern. Notre Dame, Oklahoma. Pennsylvania State. Pittsburgh, Princeton. Rochester. Southern California. SUNY-Buffalo, Temple, Virginia. Washington U-St. Louis, Wisconsin Cluster 3: Brown. California-Berkeley, Chicago. Cincinnati. Colorado. Duke. Indiana. Kentucky. Louisiana State, Maryland. McGill. MIT. Ohio State. Oregon. Pennsylvania. Purdue. Rutgers, Stanford, Tennessee. Toronto, Tulane, Utah. Washington State. Wayne State Cluster J,.: Columbia. Cornell. Harvard. Illinois-Urbana. Johns Hopkins, Michigan, Minnesota. Missouri. New York. North Carolina. Southern Illinois. Syracuse. Yale "1 1 I 1 1 1 } r0 5 10 15 20 25 30 35 Figure 7-5: Measurements and B-spline smooth. University of Arizona Ubrary. Points: log of volumes held, 1967-2002. Curve: B-spline smooth.

PAGE 100

90 Table 7-3: A 4-cluster K-medoids clustering of the observed library data. Clusters of observed library data Cluster 1: Arizona. British Columbia. Con necticut. Georgetown. Georgia Cluster 2: Boston. California-Los Angeles. Florida, Florida State, Iowa, Iowa State, Kansas. McGill. Michigan State. Nebraska, Northwestern. Notre Dame. Oklahoma. Pennsylvania State. Pittsburgh, Princeton. Rochester, Southern California. SUNY-Buffalo, Temple, Virginia, Washington U-St. Louis, Wayne State. Wisconsin Cluster 3: Brown. California-Berkeley, Chicago, Cincinnati. Colorado, Duke, Illinois-Urbana. Indiana. Kentucky, Louisiana State, Maryland, MIT, Ohio State. Oregon. Pennsylvania, Purdue. Rutgers, Stanford, Syracuse, Tennessee. Toronto. Tulane. U tah, Washington State Cluster 4: Columbia. Cornell. Harvard, Johns Hopkins, Michigan. Minnesota, Missouri. New York. North Carolina. Southern Illinois. Yale Interpreting the meaning of the clusters is fairly subjective, but the dramatic rises in library volumes for the cluster 1 and cluster 2 curves could correspond to an increase in state-sponsored academic spending in the 1960s and 1970s, especially since many of the libraries in the first two clusters correspond to public universities. On the other hand, cluster 4 contains several old, traditional Ivy League universities (Harvard, Yale, Columbia. Cornell), whose libraries" steady growth could reflect the extended buildup of volumes over many previous years. Any clear conclusions, however, would require more extensive study of the subject, as the cluster analysis is only an exploratory first step. In this data set, only minor differences in the clustering partition were seen when comparing the analysis of the smoothed data with that of the observed data (shown in Table 7-3). Interestingly, in some spots where the two partitions differed, the clustering of the smooths did place certain libraries from the same state together when the other method did not. For instance, Illinois-Urbana was placed with Southern Illinois, and Syracuse was placed with Columbia, Cornell, and NYU.

PAGE 101

CHAPTER 8 CONCLUSIONS AND FUTURE RESEARCH This dissertation iias addressed the problems of clustering and estimating dissimilarities for functional observations. We began by reviewing important clustering methods and connecting the clustering output to the dissimilarities among the objects. We then described functional data and provided some background about functional data analysis, which has become increasingly visible in the past decade. An examination of the recent statistical literature reveals a number of methods proposed for clustering functional data (James and Sugar. 2003: Tarpey and Kinateder, 2003; Abraham et al., 2003). These methods involve some type of smoothing of the data, and we have provided justification for this practice of smoothing before clustering. We proposed a model for the functional observations and hence for the dissimilarities among them. When the data are smoothed using a basis function method (such as regression splines, for example), the resulting smoothed-data dissimilarity estimator dominates the estimator based on the observed data when the pairwise differences between the signal curves lie in the linear subspace defined by the smoothing matrix. When the differences do not lie in the subspace, we have shown that a JamesStein shrinkage dissimilarity estimator dominates the observed-data estimator under an independent error model. With dependent errors, an asymptotic (for n large within a fixed domain) domination result was given. (While the result appears to hold for moderately large n, the asymptotic situation of the theorem corresponds to data which are nearly pure functions, measured nearly continuously 91

PAGE 102

92 across some domain.) The shrinkage estimator is a novel way to unite linear smoothers and Stein estimation to derive a useful smoothing method. A good estimation of dissimilarities is only a preliminary step in the eventual goal, the correct clustering of the data. Thus we have presented a simulation study which indicates that, for a set of noisy functional data generated from a four-cluster structure, the data which are smoothed before a K-medoids cluster analysis are classified more correctly than the observed, unsmoothed data. The correctness of each clustering was determined by a measure of concordance between the clustering output and the underlying structure. Some real functional data were analyzed, with the James-Stein shrinkage smoother applied to two data sets, which were then clustered via K-medoids. One data set consisted of 78 yeast genes whose expression level was measured across time, while the other data set consisted of the "growth curves"" (number of volumes held measured over time) of 67 research libraries. The applications for this research are wide-ranging, since functional data are gathered in many fields from the biological to the social sciences. Besides the applications mentioned in this dissertation, functional data analyses in archaeology, economics, and biomechanics, among others, are presented by Ramsay and Silverman (2002). One obvious extension of this work that could be addressed in the future is to consider linear smoothers for which S is not idempotent. While a symmetric, idempotent S corresponds to an important class of smoothers, many other common smoothers such as kernel smoothers, local polynomial methods, and smoothing splines have smoothing matrices which are symmetric but not in general idempotent. The use of a basis function smoother requires, in some sense, that the curves in the data set all have the same basic structure, since the same set of basis functions

PAGE 103

93 is used to estimate each signal curve. Fortunately, methods such as regression splines, especially when the number of knots is fairly large, are flexible enough to approximate well a variety of shapes of curves. Also, given our emphasis on working with the pairwise differences of data vectors (i.e., discretized curves), for the differences to make sense, we need each curve in the data set to be measured at the same set of points ii. another limitation. The abstract case of data which come to the analyst as pure continuous functions was resolved in this thesis in the case for which the signal curves were in the subspace defined by the linear operator 5, but the more general case still needs further work. .\ more abstract functional analytic approach to functional data analysis, such as Bosq (2000) presents, could be useful in this problem. The practical distinction that arises when using the James-Stein adjustment between smoothing the pairwise differences in the noisy curves as opposed to smoothing the noisy curves themselves was addressed in Section 6.2. Empirical evidence was given indicating that it mattered very little in practice which procedure was done, so one could feel safe in employing the computationally faster method of smoothing the noisy curves directly. More analytical exploration of this is needed, however. With the continuing growth in computing power available to statisticians, computationally intensive methods like stochastic cluster analysis will become more prominent in the future. Much of the work that has been done in this area has addressed the clustering of traditional multivariate data, so stochastic methods for clustering functional data are needed. Objective functions designed for functional data, such as the one proposed in Chapter 5, are the first step in such methods. The next step is to create better and faster search algorithms to optimize such objective functions and discover the best clustering.

PAGE 104

94 In a broad sense, there is an increasing need for methods of analyzing large functional data sets. With today's advanced monitoring equipment, scientists can measure responses nearly continuously, over time or some other domain, for a large number of individuals. Statisticians need to have methods, both exploratory and confirmatory, to discover genuine trends and variability across great masses of curves. The field of data mining (Hand et al., 2000) has begun to provide some answers for these problems, and functional data provide another rich class of problems to address.

PAGE 105

APPENDIX A DERIVATIONS AND PROOFS A.l Proof of Conditions for Smoothed-data Estimator Superiority Now, since S is a shrinking smoother, this means ||Sy|| < ||y|| for all y, and hence i|Sy||2 < ||y||2 for all y. Therefore. \\Se\\^ < \\e\\^. Hence, and because k < n.we see that the first term of (3.2) is less than the first term of (3.1), and that the second term of (3.2) is < 0. The third term of (3.2) will be positive, though. Therefore, the only way the observed-data estimator would have a smaller MSE than the smoothed-data estimator is if the negative quantity ||S0|p \\0\\^ is of magnitude large enough that its square (which appears in the third term of (3.2) overrides the advantage (3.2) has in the first two terms. This corresponds to the situation when the shrinking smoother shrinks too much past the underlying mean curve. This phenomenon is the same one seen in simulation studies. We can make this more mathematically precise. Note that MSE^^e'sej < MSE(^^e'e^ 2k 4 k^ Kmew mm t^+ An'wew + 1) < o n n 2k k"^ 2 47^ ofcT'S '^\{\m\'-\\em
PAGE 106

96 {Letw = \\S0\\^ mi'Here. ^ < n, so a > 0. 6 > 0, and c < 0. Thus the left hand side < 0 when Note that Vb^ 4ac = g^(4 + 2k)^ + 4(2n + 2k P). So the left endpoint of the interval in (A.l) -g(4 + 2k) 5^(4 + 2ky ^ 4(2n + v? 2k k^) -4 2k V16 + IQk + 4fe2 + 8n 8A: + An^ Ak^ 2 -4 2A: Vl6 + 8n + An^TJk 2 -4 2A: 2^4 + 2n + + 2A: 2 = -2 k ^4 + 2n + n^ + 2k. Similarly, the right endpoint of (A.l) is -2 A; + y^4 + 2n + n^ + 2k. Hence, the smoothed-data estimator is better in terms of MSE when -2-A;-V4 + 2n + n2 + 2A;< ||S0||2-||0||2< _2_jt + V4 + 2n + n2 + 2A:. (A.2) Note that for any n > 1 and integer k < n, the right hand side of (A.2) is greater than 0. Since ||Se||2 ||0||2 < q, the second inequaUty of (A.2) is always satisfied in our setting. Hence, the smoothed-data estimator is better when: (A.l) \\Sef ~ \\e\\' >-2-kV4 + 2n + n^ + 2k.

PAGE 107

97 A. 2 Extension of Stochastic Domination Result Recall that we have a linear combination of n independent noncentral xl random variables and a linear combination of n independent central xl variates. We know that for each > 0, i = 1, ,n, P[Xx{Sh>x]>P[xl>x] Vx>0. Since Ci, c„ > 0. letting Zi = c^x, i = I, ,n, P[ciXi-{St) > z,\ > P[ciX\ > z,\ V Zi > 0. (If Ci — 0. the inequality trivially holds.) Now, since /(xi x„) = xi + + x„ is an increasing function, and since the n noncentral chi-squares are independent and the n central chi-squares are independent, we apply a result given in Ross (1996. p. 410, Example 9.2(A)) to conclude: ^[ciXi (5f) + • • • + c„xi'(5') >x\> P[cix? + • • • + c„x? > V X > 0. A.3 Definition and Derivation of 4j and a? Recall that y; is the response vector for object i, and we denote the design matrix by T. Then 4^ the estimator of the coefficients of the regression of all the objects in cluster j, is: A (T', T') (T',...,T') yi = KT'T]-i(T'yi + --+ T'y„.j) = ;^E(T'T)-^T'yi m. ^ t=l

PAGE 108

98 SSEj, the sum of squared errors of the regression of all the objects in cluster j, is: SSEj = (yi',...,y^.') yi -(yi',----ymj') ( t) -1 f \ yi (T' T') (T',...,r) yi yi + • • ymj ym^ (yi'T + • • • + y„.'T)[m,T'T]-i(T'yi + • • • + T'ymj) yi'yi + ymjVmj ;^(yi'T + • • + yn.;T)[(T'T)-^T'yi + • • • + {T'Tr'Ty^A = — [(yi'yi yiT(T'T)-^T'yi) + • • • + (y„.;y™, y„,T(T'T)-^ry„,.)] TTlj 1 + (yi',-..,ym;)S (where S is the mju x m^n block matrix with I„ in each diagonal block and -:^T{T'T)-^T in each off-diagonal block) = +^[(yi'.-.y-;)s(y.' y;)'I. Hence, the mean squared error of the regression of all the objects in cluster j, is: 1 f *i = E*5 + ~r^. (^[(y.'. • ,yc,')S(y.', ,y„/)i)

PAGE 109

APPENDIX B ADDITIONAL FORMULAS AND CONDITIONS B.l Formulas for Roots of Aj; = 0 for General n and k The four roots (two real, two imaginary) of A[/ = 0 are 0,pl^^ + (2/9)^' (4/3)pi/(-n + k), and the double imaginary root -(l/2)pj/^ (1/9)"^ (4/3)pi/(-n + k) + {l/2)iV3{P2^' (2/9)P3). Here, Pi = 2nk -6n + k^ + 6k + 8. V2 = (2/27)pi(-3072A: + 3072n 1232n2 + 2AMnk I222k'^ + GOn^\Z2k^ bn' 252n^k + 324nk^ + Un^k 3n^e Ink" + 4A:^ 2048)-(-n + kf + (2/9)(-442368A: + 344064n 103219271^ 147456nA: + 1179648A;=^+ 1250304n^ + 1274880A;3 781056n'' 1238016n2jt 1287168nA;2 u. l658880n3A;238080n2A;2 1376256nA;' + 736512A;^ + 248064P + 265536n^ 870336A;nV 769728n2A;2 + 257472n2A;2 670464nA;' 42limn^k 327216nU-2 + 133440n3A;3+ 152496n2A;^ 172320nA;5 21636n^A; + 48288n^A;2 44712^4^3 ^ 588„3;;.4^ 31308A:5n2 22776A;^n + 576n^fc 1776nU2 ^ 252071^^:3 1080n^A;^ 47088n +49008A;^ + 3660n^ + 5280fc' 72n + 240A;8 HSSn^fc^ + 230471^^:^12247xA;^ Ibn^k + 27n'P lhk^TV> + 27k^n' 15n^k^ I5n^k^+ 3n^k' + 3n') (-n + A;))^/^ ^ _^ P3 (768A; 768n + 34471^ 6887iA; + 344A;2 4271^ + 54k^ + 1387i2A; -mnk^ + n^k + 37i2A;2 5nk^ + 2k* + 512) ^ ((-n + kfpl^^). 99

PAGE 110

100 B.2 Regularity Condition for Positive Semidefinite G in Laplace Approximation Recall the Laplace approximation given by Lieberman (1994) for the expectation of a ratio of quadratic forms: x'FxX £;(x'Fx) ^x'Gxy ^(x'Gx)" Lieberman (1994) denotes the joint moment generating function of x'Fx and x'Gx by M{ui,uj2) = £'[exp(a;ix'Fx -r u;2x'Gx)]. and assumes a positive definite G. In that case. E(x'Gx) > 0. Lieberman uses the positive definiteness of G to show that the derivative of the cumuiant generating function of x Gx is greater than zero. That is. -^logM(0,u;2) = 3^M(0,u;2) M(0,a;2) -1 j (x'Gx) exp{u;2x'Gx}/(x)dxj^y* exp{a;2x'Gx}/(x) dx > 0. The positive derivative ensures the maximum of logM(0,u;2) is attained at the boundary point (where u;2 = 0). For positive semidefinite G, we need the additional regularity condition that P(x'Gx > 0) > 0. i.e., the support of x'Gx is not degenerate at zero. This will ensure that £(x'Gx) > 0, i.e., that -£^M{Q,U2) > 0. Therefore both of the integrals in the above expression are positive and the Laplace approximation will hold.

PAGE 111

REFERENCES Abraham, C. Cornillon, P. A., Matzner-Lober, E. and Molinari, N. (2003). Unsupervised curve clustering using B-splines, The Scandinavian Journal of Statistics 30: 581-595. Alter, O., Brown. P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences 97: 10101-10106. Association of Research Libraries (2003). ARL Statistics Publication Home Page, .\vailable at www.arl.org/stats/arlstat. accessed November 2003. published by Association of Research Libraries. Baldessari. B. (1967). The distribution of a quadratic form of normal random variables. The Annals of Mathematical Statistics 38: 1700-1704. Bjornstad. J. F. (1990). Predictive likelihood: A review (C/R: p255-265), Statistical Science 5: 242-254. Booth, J. G.. Casella. G., Cooke, J. E. K. and Davis. J. M. (2001). Sorting periodically-expressed genes using microarray data, Technical Report 2001-026, Department of Statistics, University of Florida. Bosq, D. (2000). Linear Processes in Function Spaces: Theory and Applications, New York: SpringerVerlag Inc. Brockwell. P. J. and Davis. R. A. (1996). Introduction to Time Series and Forecasting, New York: SpringerVerlag Inc. Buja, A., Hastie. T. and Tibshirani, R. (1989). Linear smoothers and additive models (C/R: p510-555), The Annals of Statistics 17: 453-510. Butler, R. W. (1986). Predictive likelihood inference with applications (C/R: p23-38). Journal of the Royal Statistical Society, Series B, Methodological 48: 1-23. Casella, G. and Berger, R. L. (1990). Statistical inference, Belmont, California: Duxbury Press. Casella, G. and Hwang, J. T. (1987). Employing vague prior information in the construction of confidence sets. Journal of Multivariate Analysis 21: 79-104. 101

PAGE 112

102 Celeux, G. and Govaert. G. (1992). A classification EM algorithm for clustering and two stochastic versions, Computational Statistics and Data Analysis 14: 315-332. Chow, Y. S. and Teicher. H. (1997). Probability Theory: Independ ence, Interchangeability, Martingales. New York: SpringerVerlag Inc. Cressie, N. A. C. (1993). Statistics for Spatial Data, New York: John Wiley and Sons. Cuesta-Albertos, J. A.. Gordaliza. A. C. and Matran, C. (1997). Trimmed fc-means: An attempt to robustify quantizers. The Annals of Statistics 25: 553-576. de Boor, C. (1978). A Practical Guide to Splines, New York: SpringerVerlag Inc. DeVito, C. L. (1990). Functional Analysis and Linear Operator Theory, Redwood City, California: Addison-Wesley. Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression, New York: Marcel Dekker Inc. Falkner, F. (ed.) (1960). Child Development: An International Method of Study, Basel: Karger. Fraley, C. and Raftery A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41: 578-588. Fraley, C. and Raftery A. E. (2002). Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association 97(458): 611-631. Garci'a-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of A; means and trimmed k means, Journal of the American Statistical Association 94: 956-969. Gnanadesikan, R., Blashfield. R. K., Breiman, L., Dunn, 0. J.. Friedman, J. H., Fu, K., Hartigan, J. A., Kettenring, J. R., Lachenbruch, P. A.. Olshen, R. A. and Rohlf, F. J. (1989). Discriminant analysis and clustering, Statistical Science 4: 34-69. Gordon, A. D. (1981). Classification. Methods for the Exploratory Analysis of Multivariate Data, London: Chapman and Hall Ltd. Green, E. J. and Strawderman. W. E. (1991). A James-Stein type estimator for combining unbiased and possibly biased estimators. Journal of the American Statistical Association 86: 1001-1006.

PAGE 113

103 Hand, D. J., Blunt. G., Kelly, M. G. and Adams, N. M. (2000). Data mining for fun and profit. Statistical Science '15{2): 111-126. Hastie, T., Buja, A. and Tibshirani, R. (1995). Penalized discriminant analysis, The Annals of Statistics 23: 73-102. James, G. M. and Hastie. T. J. (2001). Functional linear discriminant analysis for irregularly sampled curves. Journal of the Royal Statistical Society, Series B, Methodological 63{3): 533-550. James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association 98: 397-408. James, W. and Stein. C. (1961). Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pp. 361-379. Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis. Upper Saddle River, New Jersey: Prentice-Hall Inc. Kaufman, L. and Rousseeuw, P. J. (1987). Clustering by means of medoids, Statistical Data Analysis Based on the Li-norm and Related Methods, pp. 405416. Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley and Sons. Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983). Optimization by simulated annealing, Science 220: 671-680. Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation, New York: SpringerVerlag Inc. Lieberman, O. (1994). A Laplace approximation to the moments of a ratio of quadratic forms, Biometnka 81: 681-690. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume i, pp. 281-297. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equations of state calculations by fast computing machines, Journal of Chemical Physics 21: 1087-1092. Milligan, G. W. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika 50: 159-179.

PAGE 114

104 Ramsay. J. 0. and Dalzell. C. J. (1991). Some tools for functional data analysis (Disc: p561-572), Journal of the Royal Statistical Society, Series B, Methodological 53: 539-561. Ramsay. J. O. and Silverman. B. W. (1997). Functional Data Analysis, New York: SpringerVerlag Inc. Ramsay, J. O. and Silverman. B. W. (2002). Applied Functional Data Analysis: Methods and Case Studies, New York: SpringerVerlag Inc. Robert. C. R and Casella, G. (1999). Monte Carlo Statistical Methods, New York: SpringerVerlag Inc. Rodgers. W. L. (1988). Statistical matching, Encyclopedia of Statistical Sciences (9 vols, plus Supplement), Volume 8, pp. 663-664. Ross, S. M. (1996). Stochastic Processes, New York: John Wiley and Sons. Rousseeuw. R J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20: 53-65. Selim. S. Z. and Alsultan. K. (1991). A simulated annealing algorithm for the clustering problem, The Journal of the Pattern Recognition Society 24 10031008. Selim, S. Z. and Ismail, M. A. (1984). /(T-means-type algorithms: A generalized convergence theorem and characterization of local optimality, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 81-87. Simonoff. J. S. (1996). Smoothing Methods m Statistics, New York: SpringerVerlag Inc. Sloane. N. J. A. and Plouffe, S. (1995). The Encyclopedia of Integer Sequences, San Diego: Academic Press. Spellman, R T.. Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown. R O., Botstein, D. and Futcher. B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization, Molecular Biology of the Cell 9: 3273-3297. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pp. 197-206. Stein, M. L. (1995). Locally lattice sampling designs for isotropic random fields (Corn 1999V27 pl440), The Annals of Statistics 2Z: 1991-2012.

PAGE 115

105 Sugar, C. A. and James, G. M. (2003). Finding the number of clusters in a dataset: an information-theoretic approach. Journal of the American Statistical Association 98: 750-763. Tan, W. Y. (1977). On the distribution of quadratic forms in normal random variables. The Canadian Journal of Statistics 5: 241-250. Tarpey, T. and Kinateder, K. (2003). Clustering functional data, The Journal of Classification 20: 93-114. Tarpey, T., Petkova, E. and Ogden, R. T. (2003). Profiling placebo responders by self-consistent partitioning of functional data, Journal of the American Statistical Association 98: 850-858. Taylor, J. M. G., Cumberland, W. G. and Sy, J. P. (1994). A stochastic model for analysis of longitudinal AIDS data, Journal of the American Statistical Association 89: 727-736. Tibshirani, R.. Walther. G. and Hastie, T. (2001a). Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society, Series B, Methodological 63{2): 411-423. Tibshirani, R.. Walther. G.. Botstein, D. and Brown, P. (2001b). Cluster validation by prediction strength. Technical Report 2001-21, Department of Statistics, Stanford University. Young, F. W. and Hamer. R. M. (1987). Multidimensional Scaling: History, Theory, and Applications. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

PAGE 116

BIOGRAPHICAL SKETCH David B. Hitchcock was born in Hartford, Connecticut, in 1974 to Richard and Gloria Hitchcock, and he grew up in Athens. Georgia, and Stone Mountain, Georgia, with his parents and sister, Rebecca. He graduated from St. Pius X Catholic High School in Atlanta, Georgia, in 1992. He earned a bachelor's degree from the University of Georgia in 1996 and a master's degree from Clemson University in 1999. While at Clemson, he met his future wife Cassandra Kirby, who was also a master s student in mathematical sciences. He came to the University of Florida in 1999 to pursue a Ph.D. degree in statistics. While at Florida, he completed the required Ph.D. coursework and taught several undergraduate courses. His activities in the department at Florida also included student consulting at the statistics unit of the Institute of Food and Agricultural Sciences and a research assistantship with Alan Agresti. He began working with George Casella and Jim Booth on his dissertation research in the fall of 2001 after passing the written Ph.D. qualifying exam. In January 2004 David and Cassandra were married. After graduating, David will be an assistant professor of statistics at the University of South Carolina. 106

PAGE 117

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. (?eorge Casella. Chair Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. James G. Booth. Cochair Professor of Statistics I certify that I have read this study and that jn my opinion it conforms to acceptable standards of scholarly presentation an^ is fuNy adequate, in scope and quality, as a dissertation for the degree of Doctor/of Philosophy. Jamfcs P^Hdbert Associate professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy-^Brett D. Presnell Associate Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. John (Sj Henretta Professor of Sociology

PAGE 118

This dissertation was submitted to the Graduate Faculty of the Department of Statistics in the College of Liberal Arts and Sciences and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. August 2004 Dean. Graduate School

PAGE 119

SMOOTHING FUNCTIONAL DATA FOR CLUSTER ANALYSIS David B. Hitchcock (352) 392-1941 Department of Statistics Chair: George Casella Degree: Doctor of Philosophy Graduation Date: August 2004 Cluster analysis, which places objects into reasonable groups based on statistical data measured on them, is an important exploratory tool for the social cind biological sciences. We explore the particular problem of clustering functional data, characteristically observed as part of a continuous process. Examples of functional data include growth curves and biomechanics data measuring movement. In recent years, methods for smoothing and clustering functional data have appeared in the statistical literature, but little has specifically addressed the effect of smoothing on the cluster analysis. We examine the effect of smoothing functional data on estimating the dissimilarities among objects and on clustering those objects. Through theory and simulations, a shrinkage smoothing method is shown to result in a better estimator of the dissimilarities and a more accurate grouping than using unsmoothed data. Two examples, involving yeast gene expression levels and research library "growth curves," illustrate the technique.


10
in recent years. Stochastic methods that can run in a reasonable amount of time
can be valuable additions to the repertoire of the practitioner of cluster analysis.
Since cluster analysis often involves optimizing an objective function g, the
Monte Carlo optimization method of simulated annealing is a natural tool to use
in clustering. In the context of optimization over a large, finite set, simulated
annealing dates to Metropolis et al. (1953), while its modern incarnation was
introduced by Kirkpatrick et al. (1983).
An example of a stochastic cluster analysis algorithm is the simulated anneal
ing algorithm of Selim and Alsultan (1991), which seeks to minimize the K-means
objective function (1.1). Celeux and Govaert (1992) propose two stochastic cluster
ing methods based on the EM algorithm.
1.6 Role of the Dissimilarity Matrix
For many standard cluster analysis methods, the resulting clustering structure
is determined by the dissimilarity matrix (or distance matrix) D (containing
elements which we henceforth denote y, i = 1 = 1,..., JV) for the objects.
With hierarchical methods, this is explicit: If we input a certain dissimilarity
matrix into a clustering algorithm, we will get one and only one resulting grouping.
With partitioning methods, the fact is less explicit since the result depends partly
on the initial partition, the starting point of the algorithm, but this is an artifact
of the imperfect search algorithm (which can only assure a locally optimal
partition), not of the clustering structure itself. An ideal search algorithm which
could examine every possible partition would always map an inputted dissimilarity
matrix to a unique final clustering.
Consider the two most common criterion-based partitioning methods. K-
medoids (the Splus function pam) and K-means. For both, the objective function
is a function of the pairwise dissimilarities among the objects. The K-medoids
objective function simply involves a sum of elements of D. With K-means, the


36
4 an 4- 2
is(is + 2)\ a2n 2(v(v + 2)
-4
4-
u(v + 2)(is + 4)
n k 2
a3
n k 2
4- 4a
i/(i^ 4- 2)(i/ 4- 4)(i^ + 6)
z/4
a
1/(1/ 4- 2)(is 4- 4)(z>' 4- 6)
is4(n k 2)(n A; 4)
(n k 2 )(n k A)
4 /,4i/(j/ + 2)(i/ + 4)
a4 (
i/3(n k 2)
a
+
2^ + 2) 4v(l/ + 2) \ 2 _
i/2(n k 2)
is
Note that if Aj/ (which does not involve a) is less than 0, then A* < 0. which is
what we wish to prove.
Theorem 3.3 Suppose that 6 ~ N(6,a2I) with a2 unknown and that there exists
a random variable S2, independent of 9. such that S2/a2 ~ xl> and ^ S2/is.
If n k > 4, then the nontrivial real root r of Ay(a) = 0 is positive. Furthermore,
for any choice a G (0, r), the upper bound Ay < 0, implying A < 0. That is,
for 0 < a < r, the risk difference is negative and the smoothed-data dissimilarity
estimator d\jf^ is better than dij.
Proof of Theorem 3.3: Note that Au{a) = 0 may be written as c4a4 4- c^a3 4-
c2a2 4- Cia = 0, with c4 > 0, C3 < 0, c2 > 0. c\ < 0. The proof then follows exactly as
the proof of Theorem 3.2 in Section 3.2.
3.4 A Bayes Result: ^moot>l) and a Limit of Bayes Estimators
In this section, we again assume S, the smoothing matrix, to be symmetric and
idempotent. We again assume 9 ~ N(9, cj2I) and again, without loss of generality,
let cr2I = I. (Otherwise, we can let, for example, 77 = a~xQ and 77 = a~l9 and
work with r) and 77 instead.) Also, assume that 6 lies in the linear subspace (of
dimension k < n) that S projects onto, and hence S0 = 9. We will split f(0\9)


49
DeltaU for O-U Example (n = 50)
0 50 100 150
DettaU for O-U Example (n = 30)
0 20 40 60
Figure 4-1: Plot of asymptotic upper bound, and simulated As, for Ornstein-
Uhlenbeck-type data.
Solid line: plot of A^ against a for above O-U process. Dashed line: simulated A,
0 = 0x1. Dotted line: simulated A, 0 = (0,1,0,1,0,..., 0,1)'. Dot-dashed line:
simulated A, 0 = (-1,0,1,-1,0,1,...,-1,0)'. Top: n = 50. Bottom: n = 30.


17
2.3 Dissimilarities Between Curves
If we choose squared Li distance as our dissimilarity metric, then denote
the dissimilarities between the true, observed, and smoothed curves i and j,
respectively, as follows:
0 = f ~ Hi*)]2 dt
Jo
(2.3)
rT
&ij= / [y Jo
(2.4)
rT
¡ir = / few-diWi2*.
Jo
. (2.5)
Define
Oij = m m
where m = (#ii(i1)>...,/ii(tB))',
Oij = y yj,
- (smooth)
Oij
At A j SQij.
If the data follow the discrete noise model, ct,2 I) where cq2
(a2 + er2). If the data follow the functional noise model, Qij ~ N(0ij, Ey) where
Sij = 0$n and ffj = (of + a2).
If we observe the response at points i1?..., tn in [0, T], then we may approxi
mate (2.3)-(2.5) by
n
T -
dij nOijOij


4
of cluster analysis has not built such an,extensive and thorough theory as has
discriminant analysis (Gnanadesikan et al., 1989).
1.1 The Objective Function
Naturally, it is desirable that a clustering algorithm have some optimal
property. We would like a mathematical criterion to measure how well-grouped the
data are at any point in the algorithm. A convenient way to define such a criterion
is via an objective function, a real-valued function of the possible partitions of the
objects. Mathematically, if C = {ex,..., cb(ao) represents the space of all possible
partitions of N objects, a typical objective function is a mapping g : C > 3?+.
Ideally, a good objective function g will increase (or decrease, depending on
the formulation of g) monotonically as the partitions group more similar objects
in the same cluster and more dissimilar objects in different clusters. Given a good
objective function g. the ideal algorithm would optimize g, resulting in the best
possible partition.
When N is very small, we might enumerate the possible partitions clt... ,Cb(n),
calculate the objective function for each, and choose the ct with the optimal g{ci).
However, B(N) grows rapidly with N. For example, for the European agriculture
data, 5(12) = 4.213.597. while 5(19) = 5,832.742.205.057 (Sloane and Plouffe,
1995, entry M1484). For moderate to large N, this enumeration is infeasible
(Johnson and Wichern. 1998. p. 727).
Since full enumeration is usually impossible, clustering methods tend to be
algorithms systematically designed to search for good partitions. But such deter
ministic algorithms cannot guarantee the discovery of the best overall partition.
1.2 Measures of Dissimilarity
A fundamental question for most deterministic algorithms is which measure of
dissimilarity (distance) to use. A popular choice is the Euclidean distance between


79
O-U errors. 1st-order Fourier signal curves
O-U errorv 2nd-order Fourier signal curves
Indep. errors. 1st-order Fourier signal curves
s^ms
Figure 6-5: Proportion of pairs of objects correctly matched, plotted against a
(n = 30).
Solid line: James-Stein adjusted smoothed data. Dotted line: (Unadjusted)
smoothed data. Dashed line: Observed data.


SMOOTHING FUNCTIONAL DATA
FOR CLUSTER ANALYSIS
By
DAVID B. HITCHCOCK
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2004


35
Define q{ = qja2 = {9/a)'{I S)(9/a). Since 6 ~ N{0,(J2I), (0/ where £ = 9/o. Then the risk difference is a function of £, and the distribution of
q{ is a function of £ and does not involve a alone. Then
+ ia^(,{v + 2)
< 4a ,4 o_2 _4_ / *'(*' +2)
I/
£
1
J
i/(i/ + 2)(i/ + 4)
4aVI -V- -/V- V ]£
1
L9J
+ aV4
1/(1/ + 2)(i/ + 4)(i/ + 6)'
Va
)E
1
-*2
IQi J
We may now divide the inequality through by 0.
- < 4 an + 2a2n
<74
u(u + 2)
- 4a3
i/(i/ + 2)(t/ + 4)
)E
' 1'
Ai.
)E
' 1
Al.
+ 4a2
' v(v + 2)
v*
j4/i/(t/ + 2)(i/ + 4)(y + 6) .
L Now, since (I S)I is idempotent, q{ is noncentral Xn-jt(C(I S)£). Recall
that the noncentral x2 distribution is stochastically increasing in its noncentrality
parameter, so we may replace this parameter by zero and obtain the upper bound
= 1,2.
Hence
A,
-4an + 2a2n ^ 2) ^ E[(^n_k) *] + 4a2 ^
3 ( u(u + 2)(i/ + 4)x
u(u + 2)
4a
+ a If n k > 4, then £[(*2_*) 1} = l/(n-k-2) and £[(x2_fc)-2] = l/(n -k-2){n-
k 4). So taking expectations, we have:


76
The above analysis assumes the number of clusters is correctly specified as
4. so we repeat the n = 200 simulation, for data generated from the O-U model,
when the number of clusters is misspecified as 3 or 5. Figure 6-4 shows that, for
the most part, the same superiority, measured by proportion of pairs correctly
clustered, is exhibited by the method of smoothing before clustering.
6.4 Additional Simulation Results
In the previous section, the clustering of the observed curves was compared
with the clustering of the smooths adjusted via the James-Stein shrinking method.
To examine the effect of the James-Stein adjustment in the clustering problem, in
this section we present results of a simulation study done to compare the clustering
of the observed curves, the James-Stein smoothed curves, and smoothed curves
having no James-Stein shrinkage adjustment.
The simulation study was similar to the previous one. We had 20 sample
curves (with n = 30) in each simulation, again generated from four distinct signal
curves: in this case the signal curves could be written as Fourier series. Again, for
various simulations, both independent errors and Ornstein-Uhlenbeck errors (for
varying levels of a2) were added to the signals to create noisy curves.
For some of the simulations, the signal curves were first-order Fourier series:
Hi(t) = cos(7ri/10) + 2sin(7rf/10), i = 1,..., 5.
Hi{t) 2cos(7r/10) + sin(7r/10), i = 6,..., 10.
Hi(t) = 1.5cos(7rf/10) + sin(7rf/10), = 11,..., 15.
Hi(t) = cos(7r/10) + sin(7r/10), i = 16,..., 20.
In these simulations, the smoother chosen wras a first-order Fourier series, so that
the smooths lay in the linear subspace that the Fourier smoother projected onto. In
this case, we would expect the unadjusted smoothing method to perform well, since


6
closeness. Each method, however, defines the distance between two clusters using
some function of the dissimilarities among individual objects in those clusters.
Divisive methods begin with all objects in one cluster and successively split
clusters, resulting in partitions of 1, 2, 3, ... and finally N clusters. The S-plus
function diana performs divisive analysis.
1.4 Partitioning Methods
While hierarchical methods seek good partitions for all K = 1,..., N,
partitioning methods fix the number of clusters and seek a good partition for
that specific K. Although the hierarchical methods may seem to be more flexible,
they have an important disadvantage. Once two clusters have been joined in an
agglomerative method (or split in a divisive method), this move can never be
undone, although later in the algorithm undoing the move might improve the
clustering criterion (Kaufman and Rousseeuw, 1990, p. 44). Hence hierarchical
methods severely limit how much of the partition space C can be explored. While
this phenomenon results in higher computational speed for hierarchical algorithms,
its clear disadvantage often necessitates the use of the less rigid partitioning
methods (Kaufman and Rousseeuw, 1990. p. 44).
In practice, Johnson and Wichern (1998, p. 760) recommend running a
partitioning method for several reasonable choices of K and subjectively examining
the resulting clusterings. Finding an objective, data-dependent way to specify K
is an open question that has spurred recent research. Rousseeuw (1987) proposes
to select K to maximize the average silhouette width s(K). For each object i, the
silhouette value
= b(i) q(Q
max{a(i), 6(z)}
where a(i) = average dissimilarity of i to all other objects in its cluster (say, cluster
A)\ b(i) = min/i^ d(i, R); and d(i, R) = average dissimilarity of i to all objects in
cluster R. Then s(K) is the average s(i) over all objects i in the data set.


4.2 James-Stein Shrinkage Estimation in the Functional Noise Model. 42
4.3 Extension to the Case when a2 is Unknown 48
4.4 A Pure Functional Analytic Approach to Smoothing 54
5 A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION
FOR CLUSTERING FUNCTIONAL DATA 63
6 SIMULATIONS 68
6.1 Setup of Simulation Study 68
6.2 Smoothing the Data 71
6.3 Simulation Results 72
6.4 Additional Simulation Results 76
7 ANALYSIS OF REAL FUNCTIONAL DATA 81
7.1 Analysis of Expression Ratios of Yeast Genes 81
7.2 Analysis of Research Libraries 84
8 CONCLUSIONS AND FUTURE RESEARCH 91
APPENDIX
A DERIVATIONS AND PROOFS 95
A.l Proof of Conditions for Smoothed-data Estimator Superiority ... 95
A.2 Extension of Stochastic Domination Result 97
A.3 Definition and Derivation of /3j and a2 97
B ADDITIONAL FORMULAS AND CONDITIONS 99
B.l Formulas for Roots of = 0 for General n and k 99
B.2 Regularity Condition for Positive Semidefinite G in Laplace
Approximation 100
REFERENCES 101
BIOGRAPHICAL SKETCH 106
vi


27
/ a2 \ a? \2 4a3
= E 4a[?i + q?\ 4- 2[ \ 9i / \9i / 9i
92)
2a2
-(91+92)
9i
= E
a
2 \ 2
4a[gi + 72] + 4a + I 1 h 4a( 9i
4a3
9i
2a2 .
+ "T(9i + 92 9i 92)
9i
Since '[(ft +92] = + 9i + 92,
A = 4 an + 4 a2 + £
( a2 ^
)2 4a3
2a2. J
+ -T(91 + 92 9i 92)
L\9i >
9i
9i J
Note that since <7x and g2 are independent, we can take the expectation of q2 in the
last term. Since E[q2] k + q2,
A = 4a2 4an + E
a4
2 a2k ,
+ + 2a2
^ 2a + 9i\
L9i
9i
< 9i )\
By Jensen's Inequality, E(\/qi) > 1/E(qi), and since E(qi) = n k + 91, we can
bound the last term above:
2a2 1 -
2a + 91
< 2a2 1 -
= 2a2
9i
*~2a) <2a2fn~*;~2a) = 2a2(l-V
\ n k + 91 / \ n k J \ n k)
2a + 91
n A: + qi
n k 2a
\ n n k + i
)=2a(in
9i 2a qi
k + 9i
Hence
A < 4a2 4an + E
a
L9i J
+ E
n k
2 a2k
9i J
+ 2a2 1 -
2a ,
n k
(3.8)
Note that the numerators of the terms in the expected values in (3.8) are
positive. The random variable 91 is a noncentral Xn-k with noncentrality parameter
9i> an<4 this distribution is stochastically increasing in q. This implies that for
m > 0, £,[(?1)_m] is decreasing in qx. So
£[(9.)-ra] <


42
since the last term is the sum of the squared elements of (I S)E^2.
r(S2) = fr[(S£ + (I S)£)2]
= ir[(S£)2] + fr[SE(I S)S] + ir[(I S)ESE] + ir[((I S)E)2]
= fr[(S£)2] + ir[SSS(I S)(I S)E] + ir[£(I S)(I S)£SS]
+ ir[(I S)E(I S)£]
= ir[(S£)2] + 2ir[S£(I- S)(I- S)£S]
+ ir[£1/2(I S)£1/2£1/2(I S)£1/2]
> r[(S£)2]
since ir[S£(I S)(I S)£S] is the sum of the squared elements of (I S)£S and
the last term is the sum of the squared elements of £1^2(I S)£1,/2.
Hence MSE^O'SO) < MSE^d'O). Since (I-S)£1/2 (which has rank n-k)
is not the zero matrix, the inequality is strict.
4.2 James-Stein Shrinkage Estimation in the Functional Noise Model
Now we will consider functional data following model (2.2), and in this
section we do not assume 6 lies in the subspace defined by S. Again, we assume
the response is measured at n discrete points in [0, T], but here we assume a
dependence among errors measured at different points.
In Section 3.2, we obtained an exact upper bound for the difference in risks
between the smoothed-data estimator and observed-data estimator. In this section
we will develop an asymptotic (large n) upper bound for this difference of risks.
The upper bound is asymptotic here in the sense that the bound is valid for
sufficiently large n, not necessarily that the expression for the bound converges to a
meaningful limiting expression for infinite n.


97
A.2 Extension of Stochastic Domination Result
Recall that we have a linear combination of n independent noncentral x?
random variables and a linear combination of n independent central \\ variates.
We know that for each 6f > 0, i = 1,..., n,
P[x?($i) > A > P[x\ > A V x > 0.
Since Ci,... ,cn >0. letting Z{ = c,x, i = 1,..., n,
P[cm > zi] > P[ciX 1 > Zi] V 2 > 0.
(If C{ = 0. the inequality trivially holds.) Now, since f(x\.... ,xn) = Xi + ... + xn is
an increasing function, and since the n noncentral chi-squares are independent and
the n central chi-squares are independent, we apply a result given in Ross (1996. p.
410, Example 9.2(A)) to conclude:
f*[ciX,l2(^l) + + CnXli^l) >x}> P[CiX 1 + + CnX2i > x] V X > 0.
A.3 Definition and Derivation of 0] and Recall that y¡ is the response vector for object i, and we denote the design
matrix by T. Then /3;, the estimator of the coefficients of the regression of all the
objects in cluster j, is:
"
(^ ^
-i
/ \
(T',...,T')
T
( f)
yi
vTJ
V ymi )
= [myT'TJ-'fT'yi + + T'ym,)
= i-(TT)-'T'y,
mt1
1=1


84
The groupings of the curves into the five clusters are given in Table 7-1 for
both the observed data and the smoothed data. While the clusterings were fairly
similar for the two methods, there were some differences in the way the curves were
classified. Figure 7-2 shows an edited picture of the respective clusterings of the
curves, with the curves deleted which were classified the same way both as observed
curves and smoothed curves. Note that the fifth cluster was the same for the set of
observed curves and the set of smoothed curves.
7.2 Analysis of Research Libraries
The Association of Research Libraries has collected extensive annual data
on a large number of university libraries throughout many years (Association of
Research Libraries. 2003). Perhaps the most important (or at least oft-quoted)
variable measured in this ongoing study is the number of volumes held in a library
at a given time. In this section a cluster analysis is performed on 67 libraries whose
data on volumes were available for each year from 1967 to 2002 (a string of 36
years). This fits a strict definition of functional data, since the number of volumes
in a library is changing nearly continuously over time, and each measurement is
merely a snapshot of the response at a particular time, namely when the inventory
was done that year.
The goal of the analysis is to determine which libraries are most similar in
their growth patterns over the last 3-plus decades and to see how many groups into
which the 67 libraries naturally clustered. In the analysis, the logarithm of volumes
held was used as the response variable, since reducing the immense magnitude of
the values in the data set appeared to produce curves with stable variances. Since
the goal was to cluster the libraries based on their growth curves rather than
simply their sizes, the data was centered by subtracting each library's sample mean
(across the 36 years) from each years measurement, resulting in 67 mean-centered
data vectors.


CHAPTER 1
INTRODUCTION TO CLUSTER ANALYSIS
The goal of cluster analysis is to find groups, or clusters, in data. The objects
in a data set (often univariate or multivariate observations) should be grouped so
that objects in the same cluster are similar and objects in different clusters are
dissimilar (Kaufman and Rousseeuw. 1990. p. 1). How to measure the similarity
of objects is something that depends on the application, yet is a fundamental issue
in cluster analysis. Sometimes in a multivariate data set it is not the observations
that are clustered, but rather the variables (according to some similarity measure
on the variables) and this case is dealt with slightly differently (Johnson and
Wichern, 1998, p. 735). More often, though, it is the objects that are clustered
according to their observed values of one or more variables, and this introduction
will chiefly focus on this situation.
The general clustering setup for multivariate data is as follows: In a data set
there are N objects on which are measured p variables. Hence we represent this
by N vectors yi,.... yN in 9?p. We wish to group the N objects into K clusters,
1 < K < N. Denote the possible clusterings of N objects into nonempty groups
as C = {ci,..., cB(#)}. The number of possible clusterings B(N) depends on the
number of objects N and is known as the Bell number (Sloane and Plouff, 1995,
entry M4981).
As a simple example, consider the data in Table 1-1. We wish to group the
12 objects (the European countries) based on the values of two variables, gross
national product (gnp) and percent of gnp due to agriculture (agrie). When the
data are univariate or two-dimensional and N is not too large, it is often easy to
1


2
Table 1-1: Agricultural data for European countries.
country
agnc
gnp
Belgium
2.7
16.8
Denmark
5.7
21.3
Germany
3.5
18.7
Greece
22.2
5.9
Spain
10.9
11.4
France
6.0
17.8
Ireland
14.0
10.9
Italy
8.5
16.6
Luxembourg
3.5
21.0
Netherlands
4.3
16.4
Portugal
17.4
7.8
United Kingdom
2.3
14.0
construct a scatter plot and determine the clusters by eye (see Figure 1-1). For
higher dimensions, however, automated clustering methods become necessary.
A statistical field closely related to cluster analysis is discriminant analysis,
which also attempts to classify objects into groups. The main difference is that
in discriminant analysis there exists a training sample of objects whose group
memberships are known, and the goal is to use characteristics of the training
sample to devise a rule which classifies future objects into the prespecified groups.
In cluster analysis, however, the clusters are unknown, in form and often in
number. Thus cluster analysis is more exploratory in nature, whereas discriminant
analysis allows more precise statements about the probability of making inferential
errors (Gnanadesikan et al., 1989).
In contrast with discriminant analysis, where the number of groups and the
groups' definitions are known, cluster analysis presents two separate questions:
How many groups are there? And which objects should be allocated to which
groups, i.e., how should the objects be partitioned into groups? It is partially
because of the added difficulty of answering both of these questions that the field


31
comparing upper bound to true Delta* for n=20, k=5
0 10 20 30 40
a
Figure 3-2: Plot of simulated A and Ay against a for n = 20, k = 5.
Solid line: Upper bound Ay. Dashed line: simulated A, qx = 0. Dotted line:
simulated A, qi = 135. Long-dashed line: simulated A, qi = 540. Dot-dashed line:
simulated A, qx = 3750000.


94
In a broad sense, there is an increasing need for methods of analyzing large
functional data sets. With todays advanced monitoring equipment, scientists can
measure responses nearly continuously, over time or some other domain, for a large
number of individuals. Statisticians need to have methods, both exploratory and
confirmatory, to discover genuine trends and variability across great masses of
curves. The field of data mining (Hand et al., 2000) has begun to provide some
answers for these problems, and functional data provide another rich class of
problems to address.


96
(Let w = ||S0||2 ||0||2.)
Ok P O rp2 rp2
& T2[ + 1] + (4 + 2k)w 4- w2 < 0
n n n n rr
=> aw2 + bw + c < 0.
Here, k < n, so a > 0. b > 0. and c < 0. Thus the left hand side < 0 when
w e ([-b yfb2 4ac]/2a, [b 4- y/b2 4ac]/2aj (A.l)
Note that \/b2 4ac = ^y/(4 + 2/c)2 + 4(2n + n2 2k fc2). So the left
endpoint of the interval in (A.l)
-g(4 4- 2k) g v"(4 + 2A:)2 + 4(2n + n2 2k k2)
2^
-A-2k- v716 4- 16k + \k2 + 8n-8k + An2 Ak2
2
A 2k -\/16 4- 8n 4- 4n2 4- 8k
2
-A-2k- 2\]A 4- 2n + n2 + 2k
2
= 2 k \/A + 2n 4- n2 4- 2k.
Similarly, the right endpoint of (A.l) is -2 k 4- \/4 4- 2n 4- n2 + 2k. Hence,
the smoothed-data estimator is better in terms of MSE when
2 k \/A 4- 2n 4- n2 4- 2k < ||S0||2 ||0||2 < -2 k 4- y/A 4- 2n 4- n2 4- 2k. (A.2)
Note that for any n > 1 and integer k < n, the right hand side of (A.2) is
greater than 0. Since ||S0||2 ||0||2 < 0, the second inequality of (A.2) is always
satisfied in our setting. Hence, the smoothed-data estimator is better when:
||S0||2 ||0||2 > -2 k y/A + 2n + ri2 + 2k.


58
So we must show ir[(S SE)2] < £r[(£)2] for any £.
ir[(E)2] = ir[(S'S£ + (I-S'S)£)2]
= r[(S'SE)2] + ir[((I S'S)E)2] + ir[S'S£(I S'S)£]
+ ir[(I-S'S)£S'S£]
= £r[(S'S£)2] + ir[£1/2(I S'S)E(I S'S)E1/2]
4- ir[S£(I S'S)£S] + ir[S£(I S'S)£S']
= £r[(S'S£)2] + ir[£1/2(I S'S)£(I S'S)£1/2]
+ 2ir[S£(I- S'S)£S]
= ir[(S'S£)2] + £r[E1/2(I S'S)E1/2E1/2(I S'S)E1/2]
+ 2ir[SE(I S'S)1/2(I S'S)1/2SS]
> £r[(S'S£)2]
since the last two terms are > 0. This holds since ir[E1/,2(I S S)E1/2E1^2(I
S,S)E1/2] is the sum of the squared elements of the symmetric matrix E1/2(I
S S)£1/2. And £r[SE(I S S)1/,2(I S S)1/2ES] is the sum of the squared elements
of (I-S'S)1/2ES.
We now extend the result in Lemma 4.2 to functional space.
Corollary 4.1 Under the conditions of Lemma 4-2, var[Sij] < var[S¡^mooth^].
Proof of Corollary 4-1-'
Note that since almost sure convergence implies convergence in distribution,
we have
-9'0 A L
n
smooth.)
*ij
-'s's 4 s{smooth).
n 13
Consider £[|d,j|3] = £,[(dj)3] = ^£'[(0,0)3]. We wish to show that this is
bounded for n > 1.


92
across some domain.) The shrinkage estimator is a novel way to unite linear
smoothers and Stein estimation to derive a useful smoothing method.
A good estimation of dissimilarities is only a preliminary step in the eventual
goal, the correct clustering of the data. Thus we have presented a simulation study
which indicates that, for a set of noisy functional data generated from a four-cluster
structure, the data which are smoothed before a K-medoids cluster analysis are
classified more correctly than the observed, unsmoothed data. The correctness of
each clustering was determined by a measure of concordance between the clustering
output and the underlying structure.
Some real functional data were analyzed, with the James-Stein shrinkage
smoother applied to two data sets, which were then clustered via K-medoids. One
data set consisted of 78 yeast genes whose expression level was measured across
time, while the other data set consisted of the "growth curves" (number of volumes
held measured over time) of 67 research libraries.
The applications for this research are wide-ranging, since functional data
are gathered in many fields from the biological to the social sciences. Besides the
applications mentioned in this dissertation, functional data analyses in archaeol
ogy, economics, and biomechanics, among others, are presented by Ramsay and
Silverman (2002).
One obvious extension of this work that could be addressed in the future is
to consider linear smoothers for which S is not idempotent. While a symmetric,
idempotent S corresponds to an important class of smoothers, many other common
smoothers such as kernel smoothers, local polynomial methods, and smoothing
splines have smoothing matrices which are symmetric but not in general idempo
tent.
The use of a basis function smoother requires, in some sense, that the curves in
the data set all have the same basic structure, since the same set of basis functions


9
the swap is made. The algorithm stops when no swap can decrease (1.2). Like
K-means. K-medoids does not in general globally optimize its objective function
(Kaufman and Rousseeuw, 1990. p. 110).
Cuesta-Albertos et al. (1997) propose another robust alternative to K-means
known as trimmed K-means.' which chooses centroids by minimizing an objective
function which is based only on an optimally chosen subset of the data. Robustness
properties of the trimmed K-means method are given by Garca-Escudero and
Gordaliza (1999).
1.5 Stochastic Methods
In general, the objective function in a cluster analysis can yield optima
which are not global (Selim and Alsultan. 1991). This leads to a weakness of the
traditional deterministic methods. The deterministic algorithms are designed
to severely limit the number of partitions searched (in the hope that good
clusterings will quickly be found). If g is especially ill-behaved, however, finding the
best clusterings may require a more wide-ranging search of C. Stochastic methods
are ideal tools for such a search (Robert and Casella, 1999, Section 1.4).
The major advantage of a stochastic method is that it can explore more of the
space of partitions. A stochastic method can be designed so that it need not always
improve the objective function value at each step. Thus at some steps the method
may move to a partition with a poorer value of g, with the benefits of exploring
parts of C that a deterministic method would ignore. Unlike greedy deterministic
algorithms, stochastic methods sacrifice immediate gain for greater flexibility and
the promise of a potentially better objective function value in another area of C.
The major disadvantage of stochastic cluster analysis is that it takes more
time than the deterministic methods, which can usually be run in a matter of
seconds. At the time most of the traditional methods were proposed, this was an
insurmountable obstacle, but the growth in computing power has narrowed the gap


104
Ramsay. J. O. and Dalzeil, C. J. (1991). Some tools for functional data analysis
(Disc: p561-572), Journal of the Royal Statistical Society, Series B, Method
ological 53: 539-561.
Ramsay. J. O. and Silverman. B. W. (1997). Functional Data Analysis, New York:
Springer-Verlag Inc.
Ramsay. J. O. and Silverman. B. W. (2002). Applied Functional Data Analysis:
Methods and Case Studies, New York: Springer-Verlag Inc.
Robert. C. P. and Casella. G. (1999). Monte Carlo Statistical Methods, New York:
Springer-Verlag Inc.
Rodgers. W. L. (1988). Statistical matching, Encyclopedia of Statistical Sciences (9
vols. plus Supplement), Volume 8, pp. 663-664.
Ross, S. M. (1996). Stochastic Processes, New York: John Wiley and Sons.
Rousseeuw. P. J. (1987). Silhouettes: A graphical aid to the interpretation
and validation of cluster analysis, Journal of Computational and Applied
Mathematics 20: 53-65.
Selim. S. Z. and Alsultan, K. (1991). A simulated annealing algorithm for the
clustering problem, The Journal of the Pattern Recognition Society 24: 1003-
1008.
Selim, S. Z. and Ismail. M. A. (1984). A'-means-type algorithms: A generalized
convergence theorem and characterization of local optimality, IEEE Transac
tions on Pattern Analysis and Machine Intelligence 6: 81-87.
Simonoff. J. S. (1996). Smoothing Methods in Statistics, New York: Springer-Verlag
Inc.
Sloane. N. J. A. and Plouffe, S. (1995). The Encyclopedia of Integer Sequences, San
Diego: Academic Press.
Spellman. P. T.. Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen,
M. B., Brown. P. O.. Botstein, D. and Futcher. B. (1998). Comprehensive
identification of cell cycle-regulated genes of the yeast saccharomyces ceremsiae
by microarray hybridization, Molecular Biology of the Cell 9: 3273-3297.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multi
variate normal distribution, Proceedings of the Third Berkeley Symposium on
Mathematical Statistics and Probability, Volume 1, pp. 197-206.
Stein, M. L. (1995). Locally lattice sampling designs for isotropic random fields
(Corr: 1999V27 pl440), The Annals of Statistics 23: 1991-2012.


61
= [%(0 Syj(t)}adt) ~ (J [%(*) ~Vj(*)]2dt^j |
+2E{(L ~ yj^2(it) (/~
- Qf mt) Syj(t)]2 dt^j Qf [/(f) Hj(t)}2dt^j |.
Let f = /(f) = Vi(t) yj(t) and let f = /(f) = /(f) /(f)- Note that since 5 is
a linear smoother. 5y(f) 5y(f) = Sf.
a = £{(/w<)XfM2}
+2*{ (if]2 *) (hf>2 ) (|sf|2 ) (m *)}
= £{(Sf.Sf)2 (Sf,5f)(f,f)}
= £{||Sf||4 ||f||4} + 2£{||f||2||f||2 Sf||2||f||2}
= E{\\Si\\4} £{||f||4} 2||f||2£'{||5f!|2 ||f||2}
= mr(||5f||2) var(||f||2) + {£(||Sf||2)}2 {£(||f||2)}2
-2||f||2E{(||5f||2-||f||2)}
= var(||Sf||2) yar(||f||2)
+ {£(||Sf||2) + £(||f||2)}{£(||Sf||2) £(||f||2)}
-2||f||2E{(||5f||2-||f||2)}.
Since /(f) is in the subspace which S projects onto for all i, then
Sf = f => ||5f|| = ||f||.
A = uar(||5f||2) uar(||f||2) + £(||Sf||2 + ||f||2)^(|| -(||5f||2 + ||f||2)^(||5f||2-||f||2).


11
connection involves a complicated recursive formula, as indicated by Gordon
(1981. p. 42). (Because the connection is so complicated, in practice, K-means
algorithms accept the data matrix as input, but theoretically they could accept the
dissimilarity matrix it would just slow down the computational time severely.)
The result, then, is that for these methods, a specified dissimilarity matrix yields a
unique final clustering, meaning that for the purpose of cluster analysis, knowing D
is as good as knowing the complete data.
If the observed data have random variation, and hence the measurements on
the objects contain error, then the distances between pairs of objects will have
error. If we want our algorithm to produce a clustering result that is close to the
true" clustering structure, it seems desirable that the dissimilarity matrix we use
reflect as closely as possible the (unknown) pairwise dissimilarities between the
underlying systematic components of the data.
It is intuitive that if the dissimilarities in the observed distance matrix are
near the truth. then the resulting clustering structure should be near the true
structure, and a small computer example helps to show this. We generate a sample
of 60 3-dimensional normal random variables (with covariance matrix I) such that
15 observations have mean vector (1,3,1)\ 15 have mean (10,6,4)', 15 have mean
(1,10.2)(, and 15 have mean (5,1,10)\ These means are well-separated enough
that the data naturally form four clusters, and the true clustering is obvious.
Then for 100 iterations we perturb the data with random N(0,a2) noise having
varying values of a. For each iteration, we compute the dissimilarities and input
the dissimilarity matrix of the perturbed data into the K-medoids algorithm and
obtain a resulting clustering.
Figure 1-2 plots, for each perturbed data set, the mean (across elements)
squared discrepancy from the true dissimilarity matrix against the proportion of
all possible pairs of objects which are correctly matched in the clustering resulting


8
starting points to monitor the stability of the clustering (Johnson and Wichern.
1998, p. 755).
Selim and Ismail (1984) show that K-means clustering does not globally
minimize the criterion
EE4'W) (i.i)
i=l j=1
where dij denotes the Euclidean distance between object i and the centroid of
cluster j, i.e.. d? = (y¡ y(^)'(y¡ y^). This criterion is in essence an objective
function g(c). The K-means solution may not even locally minimize this objective
function: conditions for which it is locally optimal are given by Selim and Ismail
(1984).
1.4.2 K-medoids and Robust Clustering
Because K-means uses means (centroids) and a least squares technique in
calculating distances, it is not robust with respect to outlying observations. K-
medoids, which is used in the S-plus function pam (partitioning around medoids),
due to Kaufman and Rousseeuw (1987), has gained support as a robust alternative
to K-means. Instead of minimizing a sum of squared Euclidean distances, K-
medoids minimizes a sum of dissimilarities. Philosophically, K-medoids is to
K-means as least-absolute-residuals regression is to least-squares regression.
Consider the objective function
N K
(i-2)
=1 j=i
The algorithm begins (in the so-called build-step) by selecting k representa
tive objects, called medoids. based on an objective function involving a sum of
dissimilarities (see Kaufman and Rousseeuw, 1990, p. 102). It proceeds by as
signing each object i to the cluster j with the closest medoid rrij, i.e., such that
d(i, mj) < d(i, mw) for all w = 1K. Next, in the swap-step, if swapping
any unselected object with a medoid results in the decrease of the value of (1.2),


gross national product (gnp)
3
% of gnp due to agncuiture
Figure 1-1: A scatter plot of the agricultural data.


volumes held volumes held
87
Cluster 1 for smoothed curves
Cluster 2 for smoothed curves
in
d
o
d
m
?
in
d
time (years since 1967)
time (years since 1967)
Cluster 3 for smoothed curves
Cluster 4 for smoothed curves
in
d
in
d
o
d
in
d
i
time (years since 1967)
time (years since 1967)
Figure 7-3: Plots of clusters of libraries.
Clusters based on smoothed curves. Mean curves for each cluster shown by solid
lines.


90
Table 7-3: A 4-cluster K-medoids clustering of the observed library data.
Clusters of observed library data
Cluster 1: Arizona. British Columbia. Connecticut, Georgetown. Georgia
Cluster 2: Boston. California-Los Angeles. Florida, Florida State, Iowa.
Iowa State, Kansas. McGill. Michigan State. Nebraska, Northwestern.
Notre Dame. Oklahoma, Pennsylvania State. Pittsburgh, Princeton. Rochester,
Southern California. SUNY-Buffalo, Temple, Virginia, Washington U-St. Louis,
Wayne State. Wisconsin
Cluster 3: Brown, California-Berkeley, Chicago, Cincinnati, Colorado. Duke,
Illinois-Urbana, Indiana, Kentucky, Louisiana State, Maryland, MIT,
Ohio State. Oregon. Pennsylvania, Purdue. Rutgers, Stanford, Syracuse.
Tennessee, Toronto. Tulane. Utah, Washington State
Cluster 4: Columbia. Cornell. Harvard, Johns Hopkins, Michigan. Minnesota,
Missouri. New York. North Carolina, Southern Illinois, Yale
Interpreting the meaning of the clusters is fairly subjective, but the dramatic
rises in library volumes for the cluster 1 and cluster 2 curves could correspond
to an increase in state-sponsored academic spending in the 1960s and 1970s,
especially since many of the libraries in the first two clusters correspond to public
universities. On the other hand, cluster 4 contains several old, traditional Ivy
League universities (Harvard, Yale, Columbia. Cornell), whose libraries' steady
growth could reflect the extended buildup of volumes over many previous years.
Any clear conclusions, however, would require more extensive study of the subject,
as the cluster analysis is only an exploratory first step.
In this data set, only minor differences in the clustering partition were seen
when comparing the analysis of the smoothed data with that of the observed
data (shown in Table 7-3). Interestingly, in some spots where the two partitions
differed, the clustering of the smooths did place certain libraries from the same
state together when the other method did not. For instance, Illinois-Urbana was
placed with Southern Illinois, and Syracuse was placed with Columbia, Cornell, and
NYU.


66
Therefore log / (y, z j
N K
= E E 40gp>+ log {vkT- F(y W(yi T3i)}
= {log P? + mi loS ( 7^) iE2^yi"W(y'"T^
= S{logP?
i=i 1
+ 771 j log
\¡2/nCj
2c]
Hence /(y.z
1 r 'v .11
- ^ Ztj(y¡ T^J^yi T0j) + v(T/3j T0jY(T/3j T/?,)
aj >-,=1 *
= rrr
jeJ
exp
1
\/27TCj
( N
-E
v t=i
(* Tft)'(y¡ Tft) (Tft Tft)'(Tft Tft)
2 al
m
1
= YlP?J {2n)~rn,n/2(7j m,n exp< -rrij
jeJ '
Also, TljejffaiPj'tf)
=
(T ft Tft)(Tft Tft) (m¡n-pn¡
*!
1/A
exp{-^(/3--/3 )'[ccw(/3j)] /?,)}
mJ!(27r)P'/2iC0U(/g.)|i/2
(5a^)=qF-. =
jeJ
-1 f \
77 775 p {- j£(^ ft)
mi-(2T)p-A(^|(T'T)->|) 1 2 x(ft)
/myn-p* _
2-.m,"~P* 1 ^ 2aj r P*-2
m j n p
p (rnjnt p*) exP{ 2 ^ cr?
*?)}
Hence, since expf-^O^ ^'(TT)^ 0,)} = exp{-^(T/3,. -
T/3j)'(T$j T0j)}, we have that


70
time
Figure 6-1: Plot of signal curves chosen for simulations.
Solid line: /ij(t). Dashed line: /6(0- Dotted line: Dot-dashed line: /i16(t).


LIST OF TABLES
Table page
1-1 Agricultural data for European countries 2
3-1 Table of choices of a for various n and k 29
6-1 Clustering the observed data and clustering the smoothed data (inde
pendent error structure, n = 200) 73
6-2 Clustering the observed data and clustering the smoothed data (O-U
error structure, n = 200) 73
6-3 Clustering the observed data and clustering the smoothed data (inde
pendent error structure, n = 30) 74
6-4 Clustering the observed data and clustering the smoothed data (O-U
error structure, n = 30) 75
7-1 The classification of the 78 yeast genes into clusters, for both observed
data and smoothed data 82
7-2 A 4-cluster K-medoids clustering of the smooths for the library data. 89
7-3 A 4-cluster K-medoids clustering of the observed library data 90
Vll


75
Table 6-4: Clustering the observed dat£ and clustering the smoothed data (O-U
error structure, n = 30).
O-U error structure
a
0.2
0.4
0.5
0.75
1.0
1.25
1.5
Ratio of avg. MSEs
1.26
4.07
5.03
5.69
5.78
5.85
6.04
Avg. prop, (observed)
.851
.383
.280
.251
.225
.213
.233
Avg. prop, (smoothed)
.881
.625
.528
.502
.478
.489
.464
Ratios of average MSE^ob^ to average MSE(smooth). Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.
Independent error structure
Omstein-Uhlenbeck error structure
Figure 6-3: Proportion of pairs of objects correctly matched, plotted against a
(n = 30).
Solid line: Smoothed data. Dashed line: Observed data.


71
data and the smoothed data to determine which clustering better captures the
structure of the underlying clusters, as defined by the four signal curves.
The outputs of the cluster analyses were judged by the proportion of pairs of
objects (i.e., curves) correctly placed in the same cluster (or correctly placed in
different clusters, as the case may be). The correct clustering structure, as defined
by the signal curves, places the first five curves in one cluster, the next five in
another cluster, etc. Note that this is a type of measure of concordance between
the clustering produced by the analysis and the true clustering structure.
6.2 Smoothing the Data
The smoother used for the n = 200 example corresponded to a cubic B-spline
basis with 16 interior knots interspersed evenly within the interval [0,20]. The rank
of S was thus k = 20 (Ramsay and Silverman, 1997, p. 49). The value of a in the
James-Stein estimator was chosen to be a = 160, a choice based on the values of
n = 200 and k 20. For the n = 30 example, a cubic B-spline smoother with six
knots was used (hence k = 10) and the value of a was chosen to be a = 15.
We should note here an interesting operational issue that arises when using
the James-Stein adjustment. The James-Stein estimator of 0 given by (3.4) is
obtained by smoothing the differences 0 = y y^, i ^ j, first, and then adjusting
the S0s with the James-Stein method. This estimator for 0 is used in the James-
Stein dissimilarity estimator d\jS\ For a data set with N curves, this amounts to
smoothing (N2 N)/2 pairwise differences. It is computationally less intensive to
simply smooth the N curves with the linear smoother S and then adjust each of
the N smooths with the James-Stein method. This leads to the following estimator
of 0 :
(' (y' -Syi) -Sy (! (yi -Syi)-
Sy i +


12
o
H
0 5000 COOO '5000 20000 25000
ma*
Figure 1-2: Proportion of pairs of objects correctly grouped vs. MSE of dissimilari
ties.
from that perturbed matrix. (A correct match for two objects means correctly
putting the two objects in the same cluster or correctly putting the two objects
in different clusters, depending on the "truth.) This proportion serves as a
measure of concordance between the clustering of the perturbed data set and the
underlying clustering structure. We expect that as mean squared discrepancy
among dissimilarities increases, the proportion of pairs correctly clustered will
decrease, and the plot indicates this negative association. This indicates that a
better estimate of the pairwise dissimilarities among the data tends to yield a
better estimate of the true clustering structure.
While this thesis focuses on cluster analysis, several other statistical methods
are typically based on pairwise dissimilarities among data. Examples include
multidimensional scaling (Young and Hamer, 1987) and statistical matching
(Rodgers, 1988). An improved estimate of pairwise dissimilarities would likely
benefit the results of these methods as well.


43
Note that as the number of measurement points n grows (within a fixed
domain), assuming a certain dependence across measurement points (e.g., a corre
lation structure like that of a stationary Ornstein-Uhlenbeck process), the observed
data yi,... ,Yn closely resemble the pure observed functions yi(t),..., yjv(i) on
[0, T]. Therefore this result will be most appropriate for situations with a large
number of measurements taken on a functional process, so that the observed data
vector is nearly a pure function.
As with Case I, we assume our linear smoothing matrix S is symmetric and
idempotent, with r(S) = tr(S) = k. Recall that according to the functional noise
model for the data, 9 ~ N(0, £) where £ is, for example, a known covariance
matrix corresponding to a stationary Ornstein-Uhlenbeck process. In this section,
we will assume a Gaussian error process whose covariance structure allows for the
possibility of dependence among the errors at different measurement points.
We consider the same James-Stein estimator of 0 as in Section 3.2, namely
0(JS) (in practice, again, we use the positive-part estimator 0+S^), and the same
James-Stein dissimilarity estimator.
Recall the definitions of the following quadratic forms:
qi = \l-S)
q2 = O' SO
?i = 0'(I-S)0
q2 = 0'S 9.
Unlike in Section 3.2 when we assumed an independent error structure, under
the functional noise model,
are they independent in general.


21
dissimilarity. The risk of an estimator f, for r is given by R(t, t) = E[L(t, f)]
where L(-) is a loss function (see Lehmann and Casella, 1998, pp. 4-5).
For the familiar case of squared error loss L(t, f) = (r r)2, the risk is simply
the mean squared error (MSE) of the estimator. Hence we may compare the MSEs
of two competing estimators and choose the one with the smaller MSE. To this
end, let us examine MSE(^G'SG) and MSE(^G'G) in estimating ^GG (which
approaches Sij as n > oo).
In this section, we consider the case when G lies in the linear subspace that S
projects onto, i.e.. S0 = 6. Note that if two arbitrary (discretized) signal curves ^
and Hj are in this linear subspace, then the corresponding 6 is also in the subspace,
since in this case
9 = Hi Hj = SHi SHj = S(h Hj) = so.
In this idealized situation, a straightforward comparison of MSEs shows that
the smoothed-data estimator improves on the observed-data estimator.
Theorem 3.1 Suppose the observed G ~ N(0,cr21). Let S be a symmetric and
idempotent linear smoothing matrix of rank k. If G lies in the linear subspace
defined by S, then the dissimilarity estimator ^G SG has smaller mean squared
error than does ~6 G in estimating ~G' 9. That is, the dissimilarity estimator based
on the smooth is better than the one based on the observed data in estimating the
dissimilarities between the underlying signal curves.
Proof of Theorem 3.1:
Let cr2 = 1 without loss of generality.
Now, E^O'G] = lE[6'G] = £[n + 6'6} = T+ J9'0.


89
Table 7-2: A 4-cluster K-medoids clustering of the smooths for the library data.
Clusters of smooths for library data
Cluster 1: Arizona. British Columbia. Connecticut, Georgetown. Georgia
Cluster 2: Boston. California-Los Angeles, Florida, Florida State, Iowa,
Iowa State, Kansas. Michigan State, Nebraska. Northwestern. Notre Dame,
Oklahoma. Pennsylvania State, Pittsburgh, Princeton. Rochester,
Southern California. SUNY-Buffalo, Temple, Virginia. Washington U-St. Louis,
Wisconsin
Cluster 3: Brown. California-Berkeley, Chicago. Cincinnati. Colorado, Duke.
Indiana, Kentucky, Louisiana State, Maryland. McGill. MIT. Ohio State,
Oregon. Pennsylvania. Purdue, Rutgers, Stanford, Tennessee. Toronto, Tulane,
Utah, Washington State. Wayne State
Cluster 4-' Columbia. Cornell. Harvard. Illinois-Urbana. Johns Hopkins, Michigan,
Minnesota. Missouri. New York. North Carolina. Southern Illinois. Syracuse. Yale
0:35
Figure 7-5: Measurements and B-spline smooth, University of Arizona library.
Points: log of volumes held, 1967-2002. Curve: B-spline smooth.


88
Mean curves of four clusters
Figure 7-4: Mean curves for the four library clusters given on the same plot.
Solid: cluster 1. Dashed: cluster 2. Dotted: cluster 3. Dot-dash: cluster 4.


the smoothed-data dissimilarity estimator dominates the observed-data estima
tor. For a dependent-error model, an asymptotic domination result is given for
the smoothed-data estimator. We propose an objective function to measure the
goodness of a clustering for smoothed functional data.
Simulations give strong empirical evidence that smoothing functional data
before clustering results in a more accurate grouping than clustering the observed
data without smoothing. Two examples, involving functional data on yeast gene
expression levels and research library 'growth curves. illustrate the technique.
x


65
in the (proposed) jth cluster. is the ^-dimensional least squares estimator
of (3j (based on a regression of all objects in the proposed j th cluster), and d2
is the unbiased estimator of terms of the estimators and d2 (for all i such that Zij = 1), obtained from the
regressions of the individual objects. (See Appendix A.3 for the derivations of
and d2.)
Marginally, the vector m = (mi,.... m^) is multinomial(./V, p); conditional
on m, is multivariate normal and independent of ()d2 ~ Xm np*" Then
the following theorem gives the predictive likelihood objective function, up to a
proportionality constant:
Theorem 5.1 If Zl5..., are i.i.d. unit multinomial vectors with unknown
probability vector p = (pi,.... pN), and
Yi\zij = l~N(T/3j,a]l)
for i = 1,..., N, then the predictive likelihood for Z can be written:
g{ z) =
/(y-z)
jeJ x
Proof of Theorem 5.1:
According to the model assumed for the data and for the clusters,
= e*p[-|(yi-W(yi-T,3)]}


69
These four curves, shown in Figure 6-1, .were intentionally chosen to be similar
enough to provide a good test for the clustering methods that attempted to
group the curves into the correct clustering structure, yet different enough that
they represented four clearly distinct processes. They all contain some form of
periodicity, which is more prominent in some curves than others. In short, the
curves were chosen so that, when random noise was added to them, they would be
difficult but not impossible to distinguish.
After the "observed data were generated, the pairwise dissimilarities among
the curves (measured by squared L2 distance, with the integral approximated
on account of the discretized data) were calculated. Denote these by i =
1,..., N.j = l..... N. The data were then smoothed using a linear smoother S
(see below for details) and the James-Stein shrinkage adjustment. The pairwise
dissimilarities among the smoothed curves were then calculated. Denote these by
(smooth) i?TV, j = 1,... ,N. The sample mean squared error criterion was
used to judge whether dij or d\"mooth) better estimated the dissimilarities among the
signal curves, denoted i = 1,..., N, j = 1,.... N. That is,
mse^> = E [(4 )2]
' 2 / N
2/ t=l N
j>i
was compared to
Once the dissimilarities were calculated, we used the resulting dissimilarity
matrix to cluster the observed data, and then to cluster the smoothed data. The
clustering algorithm used was the K-medoids method, implemented by the pam
function in R. We examine the resulting clustering structure of both the observed


52
Recall that
£[a aa2
E'fa2^4]
= a2(j4
£[a3d6]
= aV
£'[a4 = a4a8
V
u{ v + 2)(i/ + 4)\
V* )
u{u + 2)(i/ + 4)(// + 6)
Since , o2 4/ ^ + 2)
A^ = 4a v*
.,J T4/''(<' +2) \ (r(SE)
+ 2on^^Jir[(i-s)S]
1 '
9iJ
4a rr
3 6 / ^ + 2)(i/ + 4) ^
J
1
-Lv(^)
4 8/i/(i/ + 2)(i/ + 4)(i/ + 6)'
+ aM ^
W
' 1 '
/
L<7iJ
As in Section 3.3, define q[ = qi/a2 = S)(0/cr). Here, since
0 ~ N{0.a2Jl), {0/a) ~ iV(£, S) where C = 0/a. Then the risk difference is a
function of and the distribution of q[ is a function of Then
= -4a<74ir(£) + 2a2a4
i
v ;
Id
ir[(I S)E
I Vt4^+2)) r(SE) 1-T2T4^(l/ + 2)^
+ 2 v ~2 ) -4aVR(l/ + 2).(l, + 4)|£
1
L9J
+ aV/^ + 2)(^+4)(^ + 6))E
L2
We may now divide the inequality through by a4 > 0.
^lu
= 4a£r(E) + 2a2
(^ + 2))e
' 1
\ V2 J
iQi J
\ tr( SE)
/
-4 a>(l^]E
)
9J
£r[(I S)E
+ a i ^ Kt/ + 2)(t/ + 4)(t/ + 6)\^
L9i2J


73
Table 6-1: Clustering the observed data and clustering the smoothed data (inde
pendent error structure, n = 200).
Independent error structure
a
0.4
0.5
0.75
1.0
1.25
1.5
2.0
Ratio of avg. MSEs
12.86
30.10
62.66
70.89
72.40
72.24
71.03
Avg. prop, (observed)
.668
.511
.353
.325
.330
.327
.318
Avg. prop, (smoothed)
.937
.715
.428
.380
.372
.352
.332
Ratios of average MSE(oba) to average MSE[STnoottl). Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.
Table 6-2: Clustering the observed data and clustering the smoothed data (O-U
error structure, n = 200).
O-U error structure
a
0.5
0.75
1.0
1.25
1.5
2.0
2.5
Ratio of avg. MSEs
3.53
39.09
126.53
171.39
176.52
168.13
160.52
Avg. prop, (observed)
.997
.607
.445
.369
.330
.312
.310
Avg. prop, (smoothed)
1
.944
.620
.489
.396
.326
.334
Ratios of average MSE(obs* to average MSE[smooth>. Average proportions of pairs
of objects correctly matched. Averages taken across 100 simulations.
the ratios of average MSEs being greater than 1. smoothing the data results in
an improvement in estimating the dissimilarities, and. as shown graphically in
Figure 6-2, it also results in a greater proportion of pairs of objects correctly
clustered in every case. Similar results are seen for the n = 30 example in Table
6-3 and Table 6-4 and in Figure 6-3.
The improvement, it should be noted, appears to be most pronounced for
the medium values of cr2, which makes sense. If there is small variability in the
data, smoothing yields little improvement since the observed data is not very
noisy to begin with. If the variability is quite large, the signal is relatively weak
and smoothing may not capture the underlying structure precisely, although the
smooths still perform far better than the observed data in estimating the true
dissimilarities. When the magnitude of the noise is moderate, the advantage of
smoothing in capturing the underlying clustering structure is sizable.


28
where Xn-k *s a central \2 with n k degrees of freedom. So by replacing qx with
Xn-k (3-8), we obtain an upper bound Au for A:
A;/ = 4a2 4an + E
a
l(xl-k)2\
+ E
2 a2k
Z2
.Xnk.
-f- 2 a
(3.9)
Note that EUxl-t) '} = 1/in k 2) and EUxl-t) !] = '/(" k 2)(n -k-4).
So taking expectations, we have:
n4 2a2k
A n = 4a2 4an +
1
+
(n k 2)(n k 4) n k 2
+ 2a2 1
2 a \
n k )
-a4
-a3 +
2fc
f 6 a2 4na.
(n A; 2)(n k 4) n k~ \n k 2
We are looking for values of a which make Au negative. The fourth-degree
equation Au(a) = 0 can be solved analytically. (Two of the roots are imaginary, as
noted in Appendix B.l.) One real root is clearly 0; call the second real root r.
Theorem 3.2 If n k > 4, then the nontrivial real root r of Au(a) = 0 is positive.
Furthermore, for any choice a (0, r), the upper bound Au < 0, implying A < 0.
That is, for 0 < a < r, the risk difference is negative and the smoothed-data
dissimilarity estimator is better than dij.
Proof of Theorem 3.2: Write Af/(o) = 0 as c4a4 -I- c^a3 + c-^a2 + C\a = 0, where
C4,C3,C2,Ci are the respective coefficients of A[/(a). It is clear that if n k > 4,
then c4 > 0, c3 < 0, c2 > 0. ci < 0.
Note that r also solves the cubic equation fc(a) = c4a3 + c3a2 -I- c2a + Ci = 0.
Since its leading coefficient c4 > 0,
lim fc(a) = -oc, lim fc(a) = oo.
aoo a-*oo
Since /c(a) has only one real root, it crosses the a-axis only once. Since its vertical
intercept C\ < 0, its horizontal intercept r > 0.