SMOOTHING FUNCTIONAL DATA

FOR CLUSTER ANALYSIS

By

DAVID B. HITCHCOCK

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2004

Copyright 2004

by

David B. Hitchcock

To Cassandra

ACKNOWLEDGMENTS

It has been a long but rewarding journey during my time as a student, and I

have many people I need to thank.

First of all. I thank God for my having more blessings in my life than I

possibly deserve. Next, I thank my family: the support and love of my parents and

sister, and the pride they have shown in me, have truly inspired me and kept me

determined to succeed in my studies. And I especially thank my wife Sandi, who

has shown so much confidence in me and has given me great encouragement with

her love and patience.

I owe a great deal of thanks to my Ph.D. advisors, George Casella and Jim

Booth. In addition to being great people, they have, through their fine examples,

taught me much about the research process: the importance of knowing the

literature well, of looking at simple cases to understand difficult concepts, and of

writing results carefully. I also thank my other committee members, Jim Hobert,

Brett Presnell, and John Henretta, accomplished professors who have selflessly

given their time for me. All the teachers I have had at Florida deserve my thanks,

especially Alan Agresti. Andy Rosalsky, Denny Wackerly, and Malay Ghosh.

I could not have persevered without the support of my fellow students here

at Florida. I especially thank Bernhard and Karabi, who entered the graduate

program with me and who have shared my progress. I also thank Carsten, Terry,

Jeff, Siuli, Sounak, Jamie, Dobrin, Samiran, Damaris, Christian, Ludwig, Brian,

Keith, and many others for all their help and friendship.

iv

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ............... ............... iv

LIST OF TABLES ................................. vii

LIST OF FIGURES ....... .... ... .. .............. viii

ABSTRACT .......... ........................ ix

CHAPTERS

1 INTRODUCTION TO CLUSTER ANALYSIS .................. 1

1.1 The Objective Function .......................... 4

1.2 Measures of Dissimilarity ........... ..... ........ 4

1.3 Hierarchical Methods .......... ............... 5

1.4 Partitioning Methods ........................... 6

1.4.1 K-means Clustering .................. ....... 7

1.4.2 K-medoids and Robust Clustering ........... 8

1.5 Stochastic Methods .......................... 9

1.6 Role of the Dissimilarity Matrix ................ 10

2 INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING 13

2.1 Functional Data ........ .. ............. 13

2.2 Introduction to Smoothing ..................... 13

2.3 Dissimilarities Between Curves .................. 17

2.4 Previous Work ............................. 18

2.5 Summary .................... ........... 18

3 CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL .... 20

3.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the

Linear Subspace Defined by S .......... .......... 20

3.2 A James-Stein Shrinkage Adjustment to the Smoother . ... 23

3.3 Extension to the Case of Unknown 2 . . . . 29

3.4 A Bayes Result: dmooth) and a Limit of Bayes Estimators .36

4 CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL 40

4.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the

Linear Subspace Defined by S ................. .. .. 40

v

4.2 James-Stein Shrinkage Estimation in the Functional Noise Model. 42

4.3 Extension to the Case when a2 is Unknown . . ... 48

4.4 A Pure Functional Analytic Approach to Smoothing . ... 54

5 A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION

FOR CLUSTERING FUNCTIONAL DATA . . . ... 63

6 SIM ULATIONS ................... ............. 68

6.1 Setup of Simulation Study ................. ...... 68

6.2 Smoothing the Data ........................... 71

6.3 Simulation Results ... ......................... 72

6.4 Additional Simulation Results . . . .... 76

7 ANALYSIS OF REAL FUNCTIONAL DATA . . . ... 81

7.1 Analysis of Expression Ratios of Yeast Genes . . ... 81

7.2 Analysis of Research Libraries . . . ..... 84

8 CONCLUSIONS AND FUTURE RESEARCH . . . ... 91

APPENDIX

A DERIVATIONS AND PROOFS ....................... 95

A.1 Proof of Conditions for Smoothed-data Estimator Superiority. 95

A.2 Extension of Stochastic Domination Result . . ... 97

A.3 Definition and Derivation of and . . ..... 97

B ADDITIONAL FORMULAS AND CONDITIONS . . .... 99

B.1 Formulas for Roots of AU = 0 for General n and k . ... 99

B.2 Regularity Condition for Positive Semidefinite G in Laplace

Approximation ................... ........ 100

REFERENCES ................................... 101

BIOGRAPHICAL SKETCH .......................... .. .. 106

vi

LIST OF TABLES

Table page

1-1 Agricultural data for European countries. . . . . 2

3-1 Table of choices of a for various n and k. . . . ... 29

6-1 Clustering the observed data and clustering the smoothed data (inde-

pendent error structure. n = 200). . . . 73

6-2 Clustering the observed data and clustering the smoothed data (O-U

error structure. n = 200). ......................... 73

6-3 Clustering the observed data and clustering the smoothed data (inde-

pendent error structure. n = 30). . . . ..... 74

6-4 Clustering the observed data and clustering the smoothed data (O-U

error structure. n = 30) ................... ....... 75

7-1 The classification of the 78 yeast genes into clusters, for both observed

data and smoothed data. ......................... 82

7-2 A 4-cluster K-medoids clustering of the smooths for the library data.. 89

7-3 A 4-cluster K-medoids clustering of the observed library data. ..... 90

vii

LIST OF FIGURES

Figure page

1-1 A scatter plot of the agricultural data. . . . . 3

1-2 Proportion of pairs of objects correctly grouped vs. MSE of dissimi-

larities ........ .... ... ...... .. ... .. 12

3-1 Plot of Au against a for varying n and for k = 5. . . .... 30

3-2 Plot of simulated A and AU against a for n = 20, k = 5. ...... ..31

4-1 Plot of asymptotic upper bound. and simulated A's, for Ornstein-

Uhlenbeck-type data ................ ........... 49

6-1 Plot of signal curves chosen for simulations. . . ... 70

6-2 Proportion of pairs of objects correctly matched, plotted against a

(n = 200). ............ ........ ......... 74

6-3 Proportion of pairs of objects correctly matched, plotted against a

(n = 30)...... ................... .. ... .. 75

6-4 Proportion of pairs of objects correctly matched, plotted against a,

when the number of clusters is misspecified. . . ... 77

6-5 Proportion of pairs of objects correctly matched, plotted against a

(n = 30) ................... ................. 79

7-1 Plots of clusters of genes. . . . ... . .. 83

7-2 Edited plots of clusters of genes which were classified differently as

observed curves and smoothed curves. . . . ... 85

7-3 Plots of clusters of libraries. . . .... . . 87

7-4 Mean curves for the four library clusters given on the same plot. 88

7-5 Measurements and B-spline smooth, University of Arizona library. 89

viii

Abstract of Dissertation Presented to the Graduate School

of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Doctor of Philosophy

SMOOTHING FUNCTIONAL DATA

FOR CLUSTER ANALYSIS

By

David B. Hitchcock

August 2004

Chair: George Casella

Cochair: James G. Booth

Major Department: Statistics

Cluster analysis, which attempts to place objects into reasonable groups on

the basis of statistical data measured on them, is an important exploratory tool

for many scientific studies. In particular, we explore the problem of clustering

functional data. which arise as curves, characteristically observed as part of

a continuous process. In recent years, methods for smoothing and clustering

functional data have appeared in the statistical literature, but little work has

appeared specifically addressing the effect of smoothing on the cluster analysis.

We discuss the purpose of cluster analysis and review some common clustering

methods, with attention given to both deterministic and stochastic methods.

We address functional data and the related field of smoothing, and a measure of

dissimilarity for functional data is suggested.

We examine the effect of smoothing functional data on estimating the dis-

similarities among objects and on clustering those objects. We prove that a

shrinkage method of smoothing results in a better estimator of the dissimilarities

among a set of noisy curves. For a model having independent noise structure,

ix

the smoothed-data dissimilarity estimator dominates the observed-data estima-

tor. For a dependent-error model, an asymptotic domination result is given for

the smoothed-data estimator. We propose an objective function to measure the

goodness of a clustering for smoothed functional data.

Simulations give strong empirical evidence that smoothing functional data

before clustering results in a more accurate grouping than clustering the observed

data without smoothing. Two examples, involving functional data on yeast gene

expression levels and research library "growth curves." illustrate the technique.

x

CHAPTER 1

INTRODUCTION TO CLUSTER ANALYSIS

The goal of cluster analysis is to find groups, or clusters, in data. The objects

in a data set (often univariate or multivariate observations) should be grouped so

that objects in the same cluster are similar and objects in different clusters are

dissimilar (Kaufman and Rousseeuw. 1990. p. 1). How to measure the similarity

of objects is something that depends on the application, yet is a fundamental issue

in cluster analysis. Sometimes in a multivariate data set it is not the observations

that are clustered, but rather the variables (according to some similarity measure

on the variables) and this case is dealt with slightly differently (Johnson and

Wichern, 1998, p. 735). More often, though, it is the objects that are clustered

according to their observed values of one or more variables, and this introduction

will chiefly focus on this situation.

The general clustering setup for multivariate data is as follows: In a data set

there are N objects on which are measured p variables. Hence we represent this

by N vectors yi,..., YN in W. We wish to group the N objects into K clusters,

1 < K < N. Denote the possible clusterings of N objects into nonempty groups

as C = {C1,..., CB(N)}. The number of possible clusterings B(N) depends on the

number of objects N and is known as the Bell number (Sloane and Plouff, 1995,

entry M4981).

As a simple example, consider the data in Table 1-1. We wish to group the

12 objects (the European countries) based on the values of two variables, gross

national product (gnp) and percent of gnp due to agriculture (agric). When the

data are univariate or two-dimensional and N is not too large, it is often easy to

1

2

Table 1-1: Agricultural data for European countries.

country agric gnp

Belgium 2.7 16.8

Denmark 5.7 21.3

Germany 3.5 18.7

Greece 22.2 5.9

Spain 10.9 11.4

France 6.0 17.8

Ireland 14.0 10.9

Italy 8.5 16.6

Luxembourg 3.5 21.0

Netherlands 4.3 16.4

Portugal 17.4 7.8

United Kingdom 2.3 14.0

construct a scatter plot and determine the clusters by eye (see Figure 1-1). For

higher dimensions. however, automated clustering methods become necessary.

A statistical field closely related to cluster analysis is discriminant analysis,

which also attempts to classify objects into groups. The main difference is that

in discriminant analysis there exists a training sample of objects whose group

memberships are known, and the goal is to use characteristics of the training

sample to devise a rule which classifies future objects into the prespecified groups.

In cluster analysis. however, the clusters are unknown, in form and often in

number. Thus cluster analysis is more exploratory in nature, whereas discriminant

analysis allows more precise statements about the probability of making inferential

errors (Gnanadesikan et al., 1989).

In contrast with discriminant analysis, where the number of groups and the

groups' definitions are known, cluster analysis presents two separate questions:

How many groups are there? And which objects should be allocated to which

groups, i.e., how should the objects be partitioned into groups? It is partially

because of the added difficulty of answering both of these questions that the field

3

0

0

0

S15 20

% of gnp due to agrcutue

Figure 1-1: A scatter plot of the agricultural data.

0

5 10 15 20

% of gnp due to agnculture

Figure 1-1: A scatter plot of the agricultural data.

4

of cluster analysis has not built such anextensive and thorough theory as has

discriminant analysis (Gnanadesikan et al., 1989).

1.1 The Objective Function

Naturally, it is desirable that a clustering algorithm have some optimal

property. We would like a mathematical criterion to measure how well-grouped the

data are at any point in the algorithm. A convenient way to define such a criterion

is via an objective function. a real-valued function of the possible partitions of the

objects. Mathematically, if C = {cl,..., CB(N)} represents the space of all possible

partitions of N objects, a typical objective function is a mapping g : C -+ *+.

Ideally, a good objective function g will increase (or decrease. depending on

the formulation of g) monotonically as the partitions group more similar objects

in the same cluster and more dissimilar objects in different clusters. Given a good

objective function g. the ideal algorithm would optimize g, resulting in the best

possible partition.

When N is very small. we might enumerate the possible partitions cl,..., CB(N),

calculate the objective function for each, and choose the ci with the optimal g(ci).

However, B(N) grows rapidly with N. For example, for the European agriculture

data, B(12) = 4.213.597. while B(19) = 5,832.742.205.057 (Sloane and Plouffe,

1995, entry M1484). For moderate to large N, this enumeration is infeasible

(Johnson and Wichern. 1998, p. 727).

Since full enumeration is usually impossible, clustering methods tend to be

algorithms systematically designed to search for good partitions. But such deter-

ministic algorithms cannot guarantee the discovery of the best overall partition.

1.2 Measures of Dissimilarity

A fundamental question for most deterministic algorithms is which measure of

dissimilarity (distance) to use. A popular choice is the Euclidean distance between

5

object i and object j

dE(i,j) A= )/( i ) l) (Yi j2)2 -" + (yip yjp)2.

The Manhattan (city-block) distance is

dM(i,j) = yil yjll + Yi2 Yj2 + + Yip Yjpl-

Certain types of data require specialized dissimilarity measures. The Canberra

metric and Czekanowski coefficient (see Johnson and Wichern, 1998, p. 729) are

two dissimilarity measures for nonnegative variables, while Johnson and Wichern

(1998, p. 733) give several dissimilarity measures for binary variables.

Having chosen a dissimilarity measure, one can construct a N x N (symmetric)

dissimilarity matrix (also called the distance matrix) D whose rows and columns

represent the objects in the data set, such that Dij = Dji = d(i, j).

In the following sections, some common methods of cluster analysis are

presented and categorized by type.

1.3 Hierarchical Methods

Hierarchical methods can be either agglomerative or divisive. Kaufman and

Rousseeuw (1990) compiled a set of methods which were adopted into the cluster

library of the S-plus computing package, and often the methods are referred to by

the names Kaufman and Rousseeuw gave them.

Agglomerative methods begin with N clusters; that is, each observation forms

its own cluster. The algorithm successively joins clusters, yielding N 1 clusters,

then N 2 clusters, and so on until there remains only one cluster containing all

N objects. The S-plus functions agnes and hclust perform agglomerative clustering.

Common agglomerative methods include linkage methods and Ward's method.

All agglomerative methods, at each step, join the two clusters which are

considered "closest." The difference among the methods is how each defines

6

"closeness." Each method, however, defines the distance between two clusters using

some function of the dissimilarities among individual objects in those clusters.

Divisive methods begin with all objects in one cluster and successively split

clusters, resulting in partitions of 1. 2, 3, ... and finally N clusters. The S-plus

function diana performs divisive analysis.

1.4 Partitioning Methods

While hierarchical methods seek good partitions for all K = 1,..., N,

partitioning methods fix the number of clusters and seek a good partition for

that specific K. Although the hierarchical methods may seem to be more flexible,

they have an important disadvantage. Once two clusters have been joined in an

agglomerative method (or split in a divisive method), this move can never be

undone, although later in the algorithm undoing the move might improve the

clustering criterion (Kaufman and Rousseeuw, 1990, p. 44). Hence hierarchical

methods severely limit how much of the partition space C can be explored. While

this phenomenon results in higher computational speed for hierarchical algorithms,

its clear disadvantage often necessitates the use of the less rigid partitioning

methods (Kaufman and Rousseeuw, 1990. p. 44).

In practice, Johnson and Wichern (1998, p. 760) recommend running a

partitioning method for several reasonable choices of K and subjectively examining

the resulting clusterings. Finding an objective, data-dependent way to specify K

is an open question that has spurred recent research. Rousseeuw (1987) proposes

to select K to maximize the average silhouette width S(K). For each object i, the

silhouette value

s(i) = b(i) a(i)

max{a(i), b(i)}

where a(i) = average dissimilarity of i to all other objects in its cluster (say, cluster

A); b(i) = minRA d(i, R); and d(i, R) = average dissimilarity of i to all objects in

cluster R. Then s(K) is the average s(i) over all objects i in the data set.

Tibshirani et al. (2001a) propose a. "Gap" statistic to choose K. In a separate

paper. Tibshirani et al. (2001b) suggest treating the problem as in model selection,

and choosing K via a "prediction strength" measure. Sugar and James (2003)

suggest a nonparametric approach to determining K based on the "distortion,"

a measure of within-cluster variability. Fraley and Raftery (1998) use the Bayes

Information Criterion to select the number of clusters. Milligan and Cooper (1985)

give a survey of earlier methods of choosing K.

Model-based clustering takes a different perspective on the problem. It as-

sumes the data follow a mixture of K underlying probability distributions. The

mixture likelihood is then maximized, and the maximum likelihood estimate of

the mixture parameter vector determines which objects belong to which subpop-

ulations. Fraley and Raftery (2002) provide an extensive survey of model-based

clustering methods.

1.4.1 K-means Clustering

Among the oldest and most well-known partitioning methods is K-means

clustering, due to MacQueen (1967). Note that the centroid of a cluster is the

p-dimensional mean of the objects in that cluster. After the choice of K, the

K-means algorithm initially arbitrarily partitions the objects into K clusters.

(Alternatively, one can choose K centroids as an initial step.) One at a time, each

object is moved to the cluster whose centroid is closest (usually Euclidean distance

is used to determine this). When an object is moved, centroids are immediately

recalculated for the cluster gaining the object and the cluster losing it. The method

repeatedly cycles through the list of objects until no reassignments of objects take

place (Johnson and Wichern, 1998, p. 755).

A characteristic of the K-means method is that the final clustering depends

in part on the initial configuration of the objects (or initial specification of the

centroids). Hence in practice, one typically reruns the algorithm from various

8

starting points to monitor the stability of the clustering (Johnson and Wichern,

1998, p. 755).

Selim and Ismail (1984) show that K-means clustering does not globally

minimize the criterion

N K

Z 4dI[iEj1 (1.1)

i=1 j=1

where dij denotes the Euclidean distance between object i and the centroid of

cluster j, i.e., d = (yi y())'(yi yi)). This criterion is in essence an objective

function g(c). The K-means solution may not even locally minimize this objective

function: conditions for which it is locally optimal are given by Selim and Ismail

(1984).

1.4.2 K-medoids and Robust Clustering

Because K-means uses means (centroids) and a least squares technique in

calculating distances, it is not robust with respect to outlying observations. K-

medoids, which is used in the S-plus function pam (partitioning around medoids).

due to Kaufman and Rousseeuw (1987), has gained support as a robust alternative

to K-means. Instead of minimizing a sum of squared Euclidean distances, K-

medoids minimizes a sum of dissimilarities. Philosophically, K-medoids is to

K-means as least-absolute-residuals regression is to least-squares regression.

Consider the objective function

N K

E d(i, mj). (1.2)

i=1 j=1

The algorithm begins (in the so-called build-step) by selecting k representa-

tive objects, called medoids, based on an objective function involving a sum of

dissimilarities (see Kaufman and Rousseeuw, 1990, p. 102). It proceeds by as-

signing each object i to the cluster j with the closest medoid mj, i.e., such that

d(i, mj) < d(i, m.) for all w = 1,..., K. Next, in the swap-step, if swapping

any unselected object with a medoid results in the decrease of the value of (1.2),

9

the swap is made. The algorithm stops when no swap can decrease (1.2). Like

K-means. K-medoids does not in general globally optimize its objective function

(Kaufman and Rousseeuw, 1990. p. 110).

Cuesta-Albertos et al. (1997) propose another robust alternative to K-means

known as "trimmed K-means," which chooses centroids by minimizing an objective

function which is based only on an optimally chosen subset of the data. Robustness

properties of the trimmed K-means method are given by Garcia-Escudero and

Gordaliza (1999).

1.5 Stochastic Methods

In general, the objective function in a cluster analysis can yield optima

which are not global (Selim and Alsultan. 1991). This leads to a weakness of the

traditional deterministic methods. The deterministic algorithms are designed

to severely limit the number of partitions searched (in the hope that "good"

clusterings will quickly be found). If g is especially ill-behaved, however, finding the

best clusterings may require a more wide-ranging search of C. Stochastic methods

are ideal tools for such a search (Robert and Casella, 1999, Section 1.4).

The major advantage of a stochastic method is that it can explore more of the

space of partitions. A stochastic method can be designed so that it need not always

improve the objective function value at each step. Thus at some steps the method

may move to a partition with a poorer value of g, with the benefits of exploring

parts of C that a deterministic method would ignore. Unlike "greedy" deterministic

algorithms, stochastic methods sacrifice immediate gain for greater flexibility and

the promise of a potentially better objective function value in another area of C.

The major disadvantage of stochastic cluster analysis is that it takes more

time than the deterministic methods, which can usually be run in a matter of

seconds. At the time most of the traditional methods were proposed, this was an

insurmountable obstacle, but the growth in computing power has narrowed the gap

10

in recent years. Stochastic methods that can run in a reasonable amount of time

can be valuable additions to the repertoire of the practitioner of cluster analysis.

Since cluster analysis often involves optimizing an objective function g, the

Monte Carlo optimization method of simulated annealing is a natural tool to use

in clustering. In the context of optimization over a large, finite set, simulated

annealing dates to Metropolis et al. (1953), while its modern incarnation was

introduced by Kirkpatrick et al. (1983).

An example of a stochastic cluster analysis algorithm is the simulated anneal-

ing algorithm of Selim and Alsultan (1991). which seeks to minimize the K-means

objective function (1.1). Celeux and Govaert (1992) propose two stochastic cluster-

ing methods based on the EM algorithm.

1.6 Role of the Dissimilarity Matrix

For many standard cluster analysis methods, the resulting clustering structure

is determined by the dissimilarity matrix (or distance matrix) D (containing

elements which we henceforth denote 6ij, i = 1, ..., N; j = 1,..., N) for the objects.

With hierarchical methods, this is explicit: If we input a certain dissimilarity

matrix into a clustering algorithm, we will get one and only one resulting grouping.

With partitioning methods, the fact is less explicit since the result depends partly

on the initial partition, the starting point of the algorithm, but this is an artifact

of the imperfect search algorithm (which can only assure a "locally optimal"

partition), not of the clustering structure itself. An ideal search algorithm which

could examine every possible partition would always map an inputted dissimilarity

matrix to a unique final clustering.

Consider the two most common criterion-based partitioning methods, K-

medoids (the Splus function pam) and K-means. For both, the objective function

is a function of the pairwise dissimilarities among the objects. The K-medoids

objective function simply involves a sum of elements of D. With K-means, the

11

connection involves a complicated recursive formula. as indicated by Gordon

(1981. p. 42). (Because the connection is so complicated, in practice, K-means

algorithms accept the data matrix as input, but theoretically they could accept the

dissimilarity matrix it would just slow down the computational time severely.)

The result. then. is that for these methods. a specified dissimilarity matrix yields a

unique final clustering, meaning that for the purpose of cluster analysis, knowing D

is as good as knowing the complete data.

If the observed data have random variation, and hence the measurements on

the objects contain error, then the distances between pairs of objects will have

error. If we want our algorithm to produce a clustering result that is close to the

"true" clustering structure, it seems desirable that the dissimilarity matrix we use

reflect as closely as possible the (unknown) pairwise dissimilarities between the

underlying systematic components of the data.

It is intuitive that if the dissimilarities in the observed distance matrix are

near the "truth." then the resulting clustering structure should be near the true

structure, and a small computer example helps to show this. We generate a sample

of 60 3-dimensional normal random variables (with covariance matrix I) such that

15 observations have mean vector (1, 3, 1)', 15 have mean (10, 6,4)', 15 have mean

(1, 10.2)', and 15 have mean (5, 1, 10)'. These means are well-separated enough

that the data naturally form four clusters, and the true clustering is obvious.

Then for 100 iterations we perturb the data with random N(0, a2) noise having

varying values of a. For each iteration, we compute the dissimilarities and input

the dissimilarity matrix of the perturbed data into the K-medoids algorithm and

obtain a resulting clustering.

Figure 1-2 plots, for each perturbed data set, the mean (across elements)

squared discrepancy from the true dissimilarity matrix against the proportion of

all possible pairs of objects which are correctly matched in the clustering resulting

12

S oo

0 0

0 0

o

0 0

Figure 1-2: Proportion of pairs of objects correctly grouped vs. MSE of dissimilari-

ties.

from that perturbed matrix. (A correct match for two objects means correctly

putting the two objects in the same cluster or correctly putting the two objects

in different clusters, depending on the -truth.") This proportion serves as a

measure of concordance between the clustering of the perturbed data set and the

underlying clustering structure. We expect that as mean squared discrepancy

among dissimilarities increases, the proportion of pairs correctly clustered will

decrease, and the plot indicates this negative association. This indicates that a

better estimate of the pairwise dissimilarities among the data tends to yield a

better estimate of the true clustering structure.

While this thesis focuses on cluster analysis, several other statistical methods

are typically based on pairwise dissimilarities among data. Examples include

multidimensional scaling (Young and Hamer, 1987) and statistical matching

(Rodgers, 1988). An improved estimate of pairwise dissimilarities would likely

benefit the results of these methods as well.

o o1 o'

I / '

o o

benefit the results of these methods as well.

CHAPTER 2

INTRODUCTION TO FUNCTIONAL DATA AND SMOOTHING

2.1 Functional Data

Frequently, the measurements on each observation are connected by being part

of a single underlying continuous process (often. but not always, a time process).

One example of such data are the growth records of Swiss boys (Falkner. 1960),

discussed by Ramsay and Silverman (1997, p. 2) in which the measurements are the

heights of the boys at 29 different ages. Ramsay and Silverman (1997) generally

label such data as functional data. since the underlying data are thought to be

intrinsically smooth, continuous curves having domain T, which without loss of

generality we take to be [0, T]. The observed data vector y is merely a discretized

representation of the functional observation y(t).

Functional data are related to longitudinal data, data measured across time

which appear often in biostatistical applications. Typically, however, in functional

data analysis, the primary goal is to discover something about the smooth curves

which underlie the functional observations, and to analyze the entire set of func-

tional data (consisting of many curves). The term "functional data analysis" is

attributed to Ramsay and Dalzell (1991), although methods of analysis existed

before the term was coined.

2.2 Introduction to Smoothing

When scientists observe data containing random noise, they typically desire

to remove the random variation to better understand the underlying process of

interest behind the data. A common method used to capture the underlying signal

process is smoothing.

13

14

Scatterplot smoothing, or nonparametric regression, may be used generally for

paired data (ti, yi) for which some underlying regression function E[yi] = f(ti) is

assumed. But smoothing is particularly appropriate for functional data. for which

that functional relationship y(t) between the response and the process on T is

inherent in the data.

One option, upon observing a functional measurement y, is to imagine the

unknown underlying curve as an interpolant y(t). This results in a curve that is

visually no smoother than the observed data. however. Typically, when functional

data are analyzed, the vector of measurements is converted to a curve via a

smoothing procedure which reduces the random variation in the function. If we

wish to cluster functional data, it may be advantageous to smooth the observed

vector for each object and perform the cluster analysis on the smooth curves rather

than on the observed data. (Clearly, this option is inappropriate for cross-sectional

data such as the European agricultural data of Chapter 1.)

Smoothing data results in an unavoidable tradeoff between bias and variance

(Simonoff. 1996, p. 15). The greater the amount of smoothing of a functional

measurement, the more its variance will decrease, but the more biased it will

become (Simonoff, 1996, p. 42). In cluster analysis. we hope that clustering the

smoothed data (which contains reduced noise) will lead to smaller within-cluster

variability, since functional data which truly belong to the same cluster should

appear more similar when represented as smooth curves. This would help make

the clustering structure of the data more apparent. Using smoothed data may

introduce a bias, however, and the bias-variance tradeoff could be quantified with a

mean squared error-type criterion.

We denote the "observed" noisy curves to be yl(t),..., YNv(t). The underlying

signal curves for this data set are AJl(t),..., (t). In reality we observe these

15

curves at a fine grid of n points, tl,..., t,, so that we observe N independent

vectors, each n x 1: yi, YN.

A possible model for our noisy data is the discrete noise model:

ij = i(tj) + ii = 1,..., Nj = 1 ...., n. (2.1)

Here. for each i = 1,..., N, ,ij may be considered independent for different

measurement points, having mean zero and constant variance ai.

Another possible model for our noisy curves is the functional noise model:

yi(tj) = p (tj) + ,i(t),i = 1,.... N,j = 1...., n, (2.2)

where ei(t) is. for example, a stationary Ornstein-Uhlenbeck process with "pull"

parameter 0 > 0 and variability parameter a2. This choice of model implies that

the errors for the ith discretized curve have variance-covariance matrix = i2fn

where 0lm = (20)-1 exp(-/0lt tmi) (Taylor et al., 1994). Note that in this case,

the noise process is functional specifically Ornstein-Uhlenbeck but we still

assume the response data collected is discretized. and is thus a vector at the level

of analysis. Conceptually, however, the noise process is smooth and continuous in

(2.2), as is the signal process in either model (2.1) or (2.2).

Depending on the data and sampling scheme, either (2.1) or (2.2) may be an

appropriate model. If the randomness in the data arises from measurement error

which is independent from one measurement to the next, (2.1) is more appropriate.

Ramsay and Silverman (1997, p. 42) suggest a discrete noise model for the Swiss

growth data, in which heights of boys are measured at 29 separate ages, and in

which some small measuring error (independent across measurements) is likely to

be present in the recorded data.

In the case that the variation of the observed data from the underlying curve is

due to an essentially continuous random process, model (2.2) may be appropriate.

16

Data which are measured frequently and almost continuously for example, via

sophisticated monitoring equipment may be more likely to follow model (2.2),

since data measured closely across time (or another domain) may more likely be

correlated. We will examine both situations.

We may apply a linear smoother to obtain the smoothed curves Fl(t),... ,N(t).

In practice, we apply a smoothing matrix S to the observed noisy data to obtain a

smooth, called linear because the smooth fi can be written as

/i = Syi,i= 1,...,N

where S does not depend on yi (Buja et al., 1989) and we define fii = (/i(ti),... ,i(tn))V

Note that as n -+ oo, the vector fi begins to closely resemble the curve /i(t) on

[0, T].

(Here and subsequently, when writing "limit as n oo," we assume

ti,..., t, E [0, T]; that is. the collection of points is becoming denser within [0, T],

with the maximum gap between any pair of adjacent points ti-1, ti, i = 2,..., n,

tending to 0. Stein (1995) calls this method of taking the limit "fixed-domain

asymptotics," while Cressie (1993) calls it "infill asymptotics.")

Many popular smoothing methods (kernel smoothers, local polynomial

regression, smoothing splines) are linear. Note that if a bandwidth or smoothing

parameter for these methods is chosen via a data-driven method, then technically,

these smoothers become nonlinear (Buja et al., 1989).

We will focus primarily on basis function smoothing methods, in which the

smoothing matrix S is an orthogonal projection (i.e., symmetric and idempotent).

These methods seek to express the signal curve as a linear combination of k (< n)

specified basis functions, in which case the rank of S is k. Examples of such

methods are regression splines, Fourier series, polynomial regression, and some

types of wavelet smoothers (Ramsay and Silverman, 1997, pp. 44-50).

17

2.3 Dissimilarities Between Curves

If we choose squared L2 distance as our dissimilarity metric, then denote

the dissimilarities between the true. observed, and smoothed curves i and j,

respectively, as follows:

fj = f[p(t) g3(t)]2dt (2.3)

dij = f i[t y(t)] dt (2.4)

moo =t [(t) (t)]2 dt. (2.5)

Define

Oij = i Pj

where gi = (pi(tl), .,/ji(t,))',

j = Yi yj,

(smooth) __

i(smooth) = Pi p= Sij.

If the data follow the discrete noise model, Oij ~ N(Oij, afI) where f =

(ao + j2). If the data follow the functional noise model, 6i ( ~ N(Oij, Eiy) where

Eij = aijf and a?. = (af + a).

If we observe the response at points tl,..., t, in [0, T], then we may approxi-

mate (2.3)-(2.5) by

dij = jOij

d(amooth) = TOs 'S,.

V n

18

Note that di -* 6ij where the limit is taken as n -+ oot, t1,.. t, E [0, TI.

Hence the question of whether smoothing aids clustering is, in a sense, closely

related to the question of: When, for large n. is d~j0th) a better estimator of dij

than is dij?

(Note: In the following chapters, since the pair of curves i and j is arbitrary,

we shall suppress the ij subscript on 0ij, 0,. or?, and Eij, writing instead 0,

0. a2, and E, understanding that we are concerned with any particular pair

i.j E {1,..., N}, i j.)

2.4 Previous Work

Some methods for clustering functional data recently have been presented in

the statistical literature. James and Sugar (2003) introduce a model-based cluster-

ing method that is especially useful when the points measured along the sample

curves are sparse and irregular. Tarpey and Kinateder (2003) discuss represent-

ing each curve with basis functions and clustering via K-means, to estimate the

"principal points" (underlying cluster means) of the data's distribution. Abraham

et al. (2003) propose representing the curves via a B-spline basis and clustering

the estimated coefficients with a K-means algorithm, and they derive consistency

results about the convergence of the algorithm. Tarpey et al. (2003) apply methods

for clustering functional data to the analysis of a pharmaceutical study. Hastie et

al. (1995) and James and Hastie (2001) discuss discriminant analysis for functional

data.

2.5 Summary

It seems intuitive that in the analysis of functional data, some form of smooth-

ing the observed data is appropriate, and some previous methods (James and

Sugar, 2003; Tarpey and Kinateder, 2003; Abraham et al., 2003) do involve

smoothing. Tarpey and Kinateder (2003, p. 113) propose the question of the effect

of smoothing on the clustering of functional data, citing the need "to study the

19

sensitivity of clustering methods on the degree of smoothing used to estimate the

functions."

In this thesis. we will provide some rigorous theoretical justification that

smoothing the data before clustering will improve the cluster analysis. Much of

the theory will focus on the estimation of the underlying dissimilarities among

the curves, and we will show that a shrinkage-type smoothing method leads to an

improved risk in estimating the dissimilarities. A simulation study will demonstrate

that this risk improvement is accompanied by a clearly improved performance in

correctly grouping objects into their proper clusters.

CHAPTER 3

CASE I: DATA FOLLOWING THE DISCRETE NOISE MODEL

First we will consider functional data following model (2.1). Recall that we

assume the response is measured at n discrete points in [0, T].

3.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the

Linear Subspace Defined by S

We assume our linear smoothing matrix S is symmetric and idempotent. For a

linear basis function smoother which is fitted via least squares. S will be symmetric

and idempotent as long as the n points at which fi is evaluated are identical to the

points at which y, is observed (Ramsay and Silverman, 1997, p. 44). Examples of

such smoothers are regression splines (in particular, B-splines), wavelet bases, and

Fourier series bases (Ramsay and Silverman. 1997). Regression splines and B-spline

bases are discussed in detail by de Boor (1978) and Eubank (1988, Chapter 7).

We also assume that S projects the observed data onto a lower-dimensional

space (of dimension k < n), and thus r(S) = tr(S) = k. Note that S is a shrinking

smoother, since all its singular values are < 1 (Buja et al., 1989). Recall that

according to the discrete noise model for the data, 0 ~ N(O, o2I). Without loss

of generality, let a2I = I. (Otherwise, we can let, for example, ?) = a-1 and

7r = a-l and work with 17 and 17 instead.)

Note that TO'6 represents the approximate L2 distance between observed

curves y((t) and yj(t) and -r'S'S0 = TO'SO represents the approximate L2

distance between smoothed curves pi(t) and pj(t).

We wish to see when the "smoothed-data dissimilarity" better estimates

the true dissimilarity 6ij between curves p((t) and j (t) than the observed-data

20

21

dissimilarity. The risk of an estimator f, for 7 is given by R(r, i) = E[L(r, f)]

where L(.) is a loss function (see Lehmann and Casella, 1998, pp. 4-5).

For the familiar case of squared error loss L(r, f) = (7 i)2, the risk is simply

the mean squared error (MSE) of the estimator. Hence we may compare the MSEs

of two competing estimators and choose the one with the smaller MSE. To this

end, let us examine MSE(TO'S6) and MSE(TO'O) in estimating TO'0 (which

approaches dij as n -+ oo).

In this section, we consider the case when 0 lies in the linear subspace that S

projects onto, i.e., SO = 0. Note that if two arbitrary (discretized) signal curves pi

and pj are in this linear subspace. then the corresponding 0 is also in the subspace,

since in this case

0 = li tj = S11i SJtj = S(t, Aij) = SO.

In this idealized situation, a straightforward comparison of MSEs shows that

the smoothed-data estimator improves on the observed-data estimator.

Theorem 3.1 Suppose the observed 60 N(O, a2I). Let S be a symmetric and

idempotent linear smoothing matrix of rank k. If 0 lies in the linear subspace

defined by S, then the dissimilarity estimator TO'S6 has smaller mean squared

error than does T'6 in estimating TO'O. That is, the dissimilarity estimator based

on the smooth is better than the one based on the observed data in estimating the

dissimilarities between the underlying signal curves.

Proof of Theorem 3.1:

Let o2 = 1 without loss of generality.

Now, E[ZT'6] = TE[6'6] = T[n + '0] = T + TO'.

tn nl nn

22

Similarly, E[T0'S6] = E[0'S6] = .,k + 0'SO] = T + TO'SO.

= T-- 2n+4 +T + T0

MSE((Ti'9) E{ \I8'9 e' i]2}

var 'v+(T') E-{E['] --

= 2 2k +40'0 +]+{T-'e-e'\

+ S + SS I ra + J

= T2(2 4 T TS'

(2 4 n2

= 2f2 + 4 0 2+1 1 (3.1)

\n2 -n2 )L

M- 2SE+ 1-S1S2 + {- + El~ T 0S0] 10112]

= T 2k+ O'S'SO + Tk +'S'SO 01 2

+r (SE+ S2 1 -) 11011

T2 W + -) + j +2

nW n2 n2 n2

+n2 n2 ) [ U[ J

+ [(uSi2 110112)2] (3.2)

Comparing (3.1) with (3.2), we see that if 0 lies in the subspace that S projects

onto, implying that SO = 0, then MSE(1O'SO) < MSE(To'0) since k < n,

and thus smoothing leads to a better estimator of each pairwise dissimilarity. In

this case, smoothing reduces the variance of the dissimilarity estimate, without

adversely increasing the bias, since SO = 0. O]

23

On the other hand, suppose the smooth does not reproduce 0 perfectly and

that SO 0 8. Then it can be shown (see Appendix A.1) that the smoothed-data

estimator is better when:

IIS0112 0112 > -2 k 4 + 2n + n2 + 2k. (3.3)

Now, since S is a shrinking smoother, this means IlSyll < I Yll for all y, and hence

llSy|l2 < 1ly|12 for all y. Therefore, ISOI12 < 116112 and the left hand side of (3.3) is

negative and so is the right hand side. If 0 is such that SO 0 and IS8O112 11|112

is near 0, then (3.3) will be satisfied. If, however, |ISO112 110|12 < 0, then (3.3)

will not be satisfied and smoothing will not help.

In other words, some shrinkage smoothing of the observed curves makes the

dissimilarity estimator better, but too much shrinkage leads to a forfeiture of that

advantage. The disadvantage of the linear smoother is that it cannot "learn"

from the data how much to shrink 0. To improve the smoother, we can employ a

James-Stein-type adjustment to S, so that the data can determine the amount of

shrinkage.

3.2 A James-Stein Shrinkage Adjustment to the Smoother

What is now known as "shrinkage estimation" or "Stein estimation" originated

with the work of Stein in the context of estimating a multivariate normal mean.

Stein (1956) showed that the usual estimator (e.g., the sample mean vector) was

inadmissible when the data were of dimension 3 or greater. James and Stein (1961)

showed that a particular "shrinkage estimator" (so named because it shrunk the

usual estimate toward the origin or some appropriate point) dominated the usual

estimator. In subsequent years, many results have been derived about shrinkage

estimation in a variety of contexts. A detailed discussion can be found in Lehmann

and Casella (1998, Chapter 5).

24

Lehmann and Casella (1998, p. 367) discuss shrinking an estimator toward a

linear subspace of the parameter space. In our case, we believe that 0 is near SO.

Let Cs = {0 : SO = 0} where S is symmetric and idempotent of rank k. Hence we

may shrink 0 toward SO, the MLE of 0 E Ls.

In this case, a James-Stein estimator of 0 (see Lehmann and Casella, 1998,

p. 367) is

O(Ps) = S + I a ( S))

i i O-seO]2 )

where a is a constant and [ I is the usual Euclidean norm.

In practice, to avoid the problem of the shrinkage factor possibly being

negative for small !0 S01|2, we will use the positive-part James-Stein estimator

(S)= SO + (1 i a- (0 S) (3.4)

+ |0-SO +

where x+ = xI(x >_ 0).

Casella and Hwang (1987) propose similar shrinkage estimators in the context

of confidence sets for a multivariate normal mean. Green and Strawderman (1991),

also in the context of estimating a multivariate mean, discuss how shrinking an

unbiased estimator toward a possibly biased estimator using a James-Stein form

can result in a risk improvement.

The shrinkage estimator involves the data by giving more weight to 0 when

|0 S0112 is large and more weight to SO when 10 S|112 is small. In fact, if the

smoother is at all well-chosen, SO is often close enough to 6 that the shrinkage fac-

tor in (3.4) is very often zero. The shrinkage factor is actually merely a safeguard

against oversmoothing, in case S smooths the curves beyond what, in reality, it

should.

25

So an appropriate shrinkage estimator of dij = TO'0 is

di(J) = T(Js)'(Js) (3.5)

= [ (s)' -( ) (0- So)']

n 110- SO112

S + 1 ( S) (3.6)

Now, the risk difference (difference in MSEs) between the James-Stein

smoothed dissimilarity diS) and the observed dissimilarity dij is:

T E T T [ T 1 \

A = E T(J)'(Js) iT'0 E T '0 T ,0 (3.7)

n n n n

Since we are interested in when the risk difference is negative, we can ignore the

positive constant multiplier T2/n2.

From (3.7),

A = E[((Js)'O(Js))2 20'0&(JS)'O(JS) (6'0)2 + 20'00'0]

= E[((JS)'O(Js))2 (6'0)2 20'0(o(JS)'(JS) ')].

Write P(Js) = 0 p(0)0, where

(0)A a (I- S).

110 S61 '

Then

P(Js)'O(Js) 0'0 = -260'(0)'6 + 0'0(0)'(0)0

S-' a(I S)6

110- SOl12

a a

+ 0'(I S)'I- (I S)O

116 -se6l 110-S ll2

26

0'(I-S)'(I-=S) 0 .(0-SO0I4

'(I-S)6 2'(I -S)

'(I s) | |20'(I S)0

= -2a +

I11- SO112

Hence the (scaled) risk difference is

( 2

A = E [(O(Js)'OJs))2 (0') 20'0 -2a + I- s012 )1

AE -20 0 11 S 1 S 2

2 22

= E 0'0-2a +I- ('0)2 20'0 -2a + a-SI12 .

Note that

'6 = '(I S)O + 's

and

0'0 = '(I S)0 + O'S0.

Define the following quadratic forms:

41 = 6'(I- S)6

42 = 6''se

q, = O'(I- S)o

q2 = O'SO.

Note that 41 ~ X-_k(ql), q2 X;2(q2) and they are independent (S idempotent

implies (I S)S = 0 and 0 is normally distributed). Now write A as:

A = E ([4 + 2] 2a + ) + 2 2 2(ql + q2) -2a +

4 91 0 gi1

( a a2 2 2a2

S2 -2a + -- [q41 + 42]+ -2a + + 4a(qi + 92) (q1 92)

I (- q,) 4qi 1 1

27

= E -4a[41 + 42] + 2[41 + 42] (-' + 4a2 + ( ) 4a3 + 4a(qi + q2)

2a2 ]

2--(qi + q2)

2 a 2) 2 4a3

= E -4a 2]4a2+ 42] + 4a- + 4a(qi + q2)

1 q, 41

2a2/ J

+ --(1 + 42 q q2) ]

Since E[q1 + 42] = n + q1 + q2,

2a2 2 2 4a3 2a2

A= -4an+ 4a2 + E a -- + -(41 + 42 92) ]

\9/ -91 9i

Note that since 41 and q2 are independent, we can take the expectation of q2 in the

last term. Since E[42] = k + q2,

[=-a4[ 2a2 2k ( 2a + q)]

A = 4a 4an + E -+ --- + 2a2 1 .

1 91 91

By Jensen's Inequality, E(1/41) > 1/E(41), and since E(41) = n k + ql, we can

bound the last term above:

E2a2(1 2a + q)1 <2a2(1 2a + ql 2(n -k+q -2a q

L \ i \ n-k+q n k+q

=2a2(n-k-2a n -- k 2a 22(1 2a

n-k+ql \ n-k n-k).

Hence

A < 4a2 4an + E a4]+ E 2a2k+ 2a2'(1- (3.8)

1 L 91 J 2 n--k

Note that the numerators of the terms in the expected values in (3.8) are

positive. The random variable 1 is a noncentral x,2-k with noncentrality parameter

ql, and this distribution is stochastically increasing in qi. This implies that for

m > 0, E[(41)-m] is decreasing in q1. So

E[(q4)-m] 5 E[(X k)-m]

28

where X,_k is a central X with n k degrees of freedom. So by replacing 41 with

X-k in (3.8), we obtain an upper bound Au for A:

(X2 I ^ [ ( ) (3.9)

Au= 4a2 4an + E (X2-k)2 + E [2a k] + 2a2 1 (3.9)--

(2)n-k/ An-k n -

Note that E[(X- -1] = 1/(n k 2) and E[((X_-2)-] = 1/(n k 2)(n k 4).

So taking expectations. we have:

4a 2a4 2a2k 2a

S(n-k-2)(n-k-4) +(n-k-2 n-k

= a4 a-6) + 6a(n 4na.

(n k 2)(n k 4) n k n-k-2

We are looking for values of a which make Au negative. The fourth-degree

equation Au(a) = 0 can be solved analytically. (Two of the roots are imaginary, as

noted in Appendix B.1.) One real root is clearly 0; call the second real root r.

Theorem 3.2 If n k > 4, then the nontrivial real root r of Au(a) = 0 is positive.

Furthermore, for any choice a E (0, r), the upper bound Au < 0, implying A < 0.

That is, for 0 < a < r, the risk difference is negative and the smoothed-data

dissimilarity estimator ds) is better than dj.

Proof of Theorem 3.2: Write Ay(a) = 0 as c4a4 + c3a3 + c2a2 + cla = 0, where

c4,c3,C2,cl are the respective coefficients of Au(a). It is clear that if n k > 4,

then c4 > 0, c3 < 0, C2 > 0. ci < 0.

Note that r also solves the cubic equation f,(a) = c4a3 + c3a2 + c2a + cl = 0.

Since its leading coefficient c4 > 0,

lim f,(a) = -oo, lim f,(a) = oo.

a-*-oo a--oo

Since fc(a) has only one real root, it crosses the a-axis only once. Since its vertical

intercept cl < 0, its horizontal intercept r > 0.

29

Table 3-1: Table of choices of a for various n and k.

n l k minimizer a* root r

20 5 9.3 19.0

50 5 31.9 72.2

100 5 71.3 169.1

200 5 153.2 367.5

And since Au(a) has leading coefficient c4 > 0,

lim Au(a) = oo.

a foo

Since it tends to oo at its endpoints, Ay(a) must be negative between its two real

roots 0 and r. Therefore A < 0 for 0 < a < r. O

Using a symbolic algebra software program (such as Maple or Mathematica).

one can easily obtain the formula for the second real root for general n and k

(the formula is given in Appendix B.1) and verify that the other two roots are

imaginary.

Figure 3-1 shows Au plotted as a function of a for varying n and k = 5.

For various choices of n (and k = 5) Table 3-1 provides values of r, as well as

the value of a which minimizes the upper bound for A. For 0 < a < r, the risk

difference is assured of being negative. For a*, Au is minimized.

Since Au provides an upper bound for the (scaled) risk difference, it may

be valuable to ascertain the size of the discrepancy between Au and A. We can

estimate this discrepancy via Monte Carlo simulation. We generate a large number

of random variables having the distribution of q1 (namely X'2-k(ql)) and get an

estimate of A using a Monte Carlo mean. For various values of ql, Figure 3-2

shows A plotted alongside Ay.

3.3 Extension to the Case of Unknown a2

In the previous section, a2 was assumed to be known. We now examine the

situation in which 60 N(8, a2I) with a2 unknown.

30

DeltaU for various n and k=5

0 -

C-14

0 50 100

a

Figure 3-1: Plot of AU against a for varying n and for k = 5.

Solid line: n = 20. Dashed line: n = 50. Dotted line: n = 100.

31

comparing upper bound to "true Delta for n=20, k=5

ii /*

!I

I I :

Figure 3-2: Plot of simulated and against a for n = 20, k = 5.

Solid line: Upper bound Au. Dashed line: simulated A, q, = 0. Dotted line:

simulated A, q, = 135. Long-dashed line: simulated A, q, = 540. Dot-dashed line:

simulated10 20 30 40000.

Figure 3-2: Plot of simulated A and Au against a for n = 20, k = 5.

Solid line: Upper bound Au. Dashed line: simulated A, q1 = 0. Dotted line:

simulated A, q1 = 135. Long-dashed line: simulated A, q1 = 540. Dot-dashed line:

simulated A, q1 = 3750000.

32

Suppose there exists a random variable S2, independent of 0, such that

S2 2

a2 X.

Then let e2 S2/v.

Consider the following definition of the James-Stein estimator which accounts

for a2:

0(s) = S + 1 ai oI2 (I- S)0

= S+ 1 a, )(I-S)O. (3.10)

O (I-S)O'

Note that the James-Stein estimator from Section 3.2 (when we assumed a

known covariance matrix I) is simply (3.10) with o2 = 1.

Now, replacing a2 with the estimate a2, define

JS) = o + 1 a (I -S)O,

letting 41 = 0'(I S)0 as in Section 3.2. Write

(Js) = -

where

a&2

Define the James-Stein smoothed-data dissimilarity estimator based on 0JS)

to be:

(Js) -_ T g(s)' js)

dij, &- a

Then analogously to (3.7).

Aa = E ( s) s) O'0 2 E O'0 2

33

Now, e(JS)'}(Js) = 6' 20'6 + 0' So

A = E 0'0-0'0-2'0 + 0')2-('6 -0'0)2

= E [(( '- ) (20 ) 2) -{e(0'e-e'e)2

= E -2 '0 )( 0'0- 0'0) + (2 -')2 .

Now,

2 = a2'(I S) a2

91

and

= 2 6( S)'(I S)0 =-.

So

A = 2 2a&2 a '( &6 0) + 2a &2 -2 a21.

Since & and 0 are independent,

E[-4a&2(0'0 0')] = E[-4a62]E[6'6 0'0] = E[-4a'&]a2n

and

E[2a2&4(' 0'0)]

E (2a24 -,0)

a 24

= E -a(1 (I( -S) 0+0s 0(I-S)0-0'SO)

.[2a2c4 1 J

= E -- ( + 2 q q2)

91

= E[2a2 4]E [1 +E 2a(41 ql-q2) (3.11)

by the independence of & and 0 (and thus & and 42).

34

Since 41 and q2 are independent, we may write (3.11) as:

E[2a264]E[q2]E[1/11] + E 224(1 q q2)]

2a&

= E[2a24](q2+ l2k)E[1/4] + E 2241 91 2)]. (3.12)

Now, since a264/41 is decreasing in (1 and (1 ql q2 is increasing in 41, these

quantities have covariance < 0 by the Covariance Inequality (Casella and Berger,

1990, p. 184). Hence

(3.12) < E[2a24](q2 + 2k)E[1/t] + E 2a241[1 9- 2]

= E[2a2&4](q2 + a2k)E[1/i1] + E[2a264&]E[1/(i] (a2(n k) q2)

= E[2a24 ]E[1/4]a2n.

So

A& < E[-4a&2]a2n + E[2a204]E o1 2 + E 2a22 a24&2]

-9. 2&4 11 2 J

= E[-4a2]2n + E[2a2]E nn+ E 4a24 [4a3 a42

9ilJ1 91 9+ 2 i"

Note that

E[a&2] = aa2

E[a2&4]= a 2 4 (V+2))

E[a3&6] = a36v(v + 2)(v+4)v

E[a 4&d = a48 ((L + 2)(V + 4)(v + 6))

V4

Since !2 and 41 are independent, taking expectations:

A < -4a4n+2a2 n( 2) E + 4a4 (2 v(v + 2)

2 ) [] V 2

4a3 v(v + 2)(v + 4) )E[ + a4as (v(v + 2)(v + 4)(v + 6) E 1

3 4 /

35

Define 1 = (1/2 = (0/ar)'(I S)(O/a). Since 60 N(0,a'I), (6/a) ~ N(C,I)

where C = O/a. Then the risk difference is a function of (, and the distribution of

q is a function of C and does not involve a alone. Then

A K < -4au4n + 2a2a4n ( 2) + 4a2 2))

-4a34((v+ 2)(v+4))[ 1 + 44(v(v+2)(v+4)(v+6))E[ 2

VI3 V4 "*2

We may now divide the inequality through by a4 > 0.

A 2 v(v 2) 1 ((V + 2)

S< -4an + 2a'n E + 4a'

4a_(v(v+2)(v+4) E1 EI +4 V(V +2)(v + 4)(v+6) )E[l

V_3 + 4 22

Now, since (I S)I is idempotent, q is noncentral X2,k(C'(I S)C). Recall

that the noncentral X2 distribution is stochastically increasing in its noncentrality

parameter, so we may replace this parameter by zero and obtain the upper bound

E[(4;)-m] < E[(X-k)-m, m = 1, 2.

Hence

(v(v + 2) v(v + 2)

<4 Au = -4an + 2a2n 2 )E[(X )- + 4a2

r4 V 2 n V2

4a3( (v + 2)(v + 4)

43 E (( n-k 1

S(v(V+2)(v+4)(v+6) [(X2

+a4 4 E[( n-k) .

If n k > 4, then E[( _k)-1] = 1/(n k 2) and E[(X k)-2] = 1/(n k 2)(n -

k 4). So taking expectations, we have:

36

< A = -4an+ 2 2 + 4a2

a4 v2 n k 2 v2

-4(v(v + 2)(v + 4) a3

v3 n-k-2

+(v(v + 2)(v + 4)(v + 6) a4

v4 (n k 2)(n k 4)

( v(v+2)(v+4)(v+6) 4 4(v+2)(v+4) 3

v 4(n-k-c2)(n-k-4) v (n-k-2)

( 2( + 2) 4v(v + 2) a2 4na.

v2(n k 2) v2

Note that if Au (which does not involve a) is less than 0, then A& < 0, which is

what we wish to prove.

Theorem 3.3 Suppose that 8 ~ N(8, a2I) with a2 unknown and that there exists

a random variable S2, independent of 6, such that S2/a2 ~ X2, and let &2 = S2/v.

If n k > 4, then the nontrivial real root r of Av(a) = 0 is positive. Furthermore,

for any choice a E (0, r), the upper bound Au < 0, implying A < 0. That is,

for 0 < a < r, the risk difference is negative and the smoothed-data dissimilarity

estimator dij,( is better than di.

Proof of Theorem 3.3: Note that A (a) = 0 may be written as c4a4 + c3a3 +

c2a2 + cla = 0, with c4 > 0, c3 < 0, c2 > 0. c1 < 0. The proof then follows exactly as

the proof of Theorem 3.2 in Section 3.2. O

3.4 A Bayes Result: d'"mooth) and a Limit of Bayes Estimators

In this section, we again assume S, the smoothing matrix, to be symmetric and

idempotent. We again assume 6~ N(8, o2I) and again, without loss of generality,

let a2I = I. (Otherwise, we can let, for example, if = a-16 and q = a-10 and

work with ff and I7 instead.) Also. assume that 0 lies in the linear subspace (of

dimension k < n) that S projects onto, and hence SO = 0. We will split f(I10)

37

into a part involving (I S)0 and a part involving SO.

f(010) cx exp[-(1/2)(6 0)'(0 )]

S exp[-(1/2)(6 0)'(I S + S)(6 0)]

S exp{-(1/2)[0'(I S)0 + (6 0)'S(6 0)]}

which has a part involving (I S)6 and a part involving S6.

Since S is symmetric, then for some orthogonal matrix P, PSP' is diagonal.

Since S is idempotent, its eigenvalues are 0 and 1. so PSP' = B where

B= Ik 0

0 0

Let p'z,...pn be the rows of P. Then 0* = PO = (p'0,..., p',0,0...,0)'.

And let 0 be the vector containing the k nonzero elements of 0*. Similarly,

* = Pe.

Now we consider the likelihood for 0:

L(010) ca exp{-(1/2)['(I S) + ( 0)'S( 0)]}

ca exp{-(1/2)[( 0)'S(6 0)]}

= exp{-(1/2)[(O 0)'P'PSP'P(O 0)]}

= exp{-(1/2)[(6* 0*)'PSP'(6* 0*)]}

= exp{-(1/2)[(O* 0*)'B(6* 0*)]}.

Now we put a N(0, V) prior on 0*:

r(0*) oc exp[-(1/2)0*'V-0*],

38

V- = V 0

0 0

where V1 is the full-rank prior variance of 6. Note that this prior is only a proper

density when considered as a density over the first k elements of 9*. Recall that

the last n k elements of 9* are zero.

Then the posterior for 0* is

7r(90**) oc exp{-(1/2)[(0* 0*)'B(9* 0*) + 9*'VV-*]}

= exp{-(1/2)[b*'B6* 20*'BO* + 0*'(B + V-)O*]}

oc exp{-(1/2)[9*'B'(B + V-)-B9* 20*'B* + O*'(B + V-)*]}

= exp{-(1/2)[(O* (B + V-)-B*)'(B + V-)(0*- (B+ V-)-B*)]}.

Hence the posterior 7r(08*) is N[(B + V-)-B0*, (B + V-)-], which is

N[(B + V-)-BP6, (B + V-)-].

Thus the posterior expectation of T9'9 is

E -0'0) = E T('P'PO) = E (9*'9*

nn ( n

ST*'B(B + V-)-(B + V-)-BP* + tr (TI(B + V-)-

n \n

T- 6'P'B(B + V-)-(B + V-)-BP6 + tr (-I(B + V-)-

= T'SP'(B + V-)-(B + V-)-PS + tr (T(B + V-)-)

n \n

Let us choose the prior variance so that V- = IB. Then

E( T') 0 T6'SP' ( B ( B) PSO + tr(T(n+

n n n n n n

T n n T n

n n+1 /n+1 \n n+ ))

= T'SP' B PSn tr( n 1

n \(n+1)2 n

39

and since P'BP = S,

(n n (n+l)2 \n\+r +l

This posterior mean is the Bayes estimator of d1j.

Since tr(B) = k, note that tr( (- B)) = tr(- B) = -+ 0 as n -- oo.

2

Also, 2 --+ 1 as n --+ o.

Then if we let the number of measurement points n -+ oo in a fixed-domain

sense, these Bayes estimators of dij approach dimoot) = T'SS in the sense that

the difference between the Bayes estimators and dm oth) tends to zero. Recall that

dij = O'O, the approximate distance between curve i and curve j, which in the

limit approaches the exact distance 6ij between curve i and curve j.

CHAPTER 4

CASE II: DATA FOLLOWING THE FUNCTIONAL NOISE MODEL

Now we will consider functional data following model (2.2). Again, we assume

the response is measured at n discrete points in [0, T], but here we assume a

dependence among errors measured at different points.

4.1 Comparing MSEs of Dissimilarity Estimators when 0 is in the

Linear Subspace Defined by S

As with Case I, we assume our linear smoothing matrix S is symmetric and

idempotent, with r(S) = tr(S) = k. Recall that according to the functional noise

model for the data, 0 ~ N(O, E) where the covariance matrix E corresponds, e.g.,

to a stationary Ornstein-Uhlenbeck process.

Note that T1'6 represents the approximate L2 distance between observed

curves yi(t) and yj(t), and that, with an Ornstein-Uhlenbeck-type error structure, it

approaches the exact distance as n -+ oc in a fixed-domain sense.

Also, TO'S'SO = TO'SO represents the approximate L2 distance between

smoothed curves fi(t) and Lj(t), and this approaches the exact distance as n -+ oo.

We wish to see when the smoothed-data dissimilarity better estimates the true

dissimilarity 6ij between curves yi1(t) and pj(t) than the observed-data dissimilarity.

To this end let us examine MSE(TO'SO) and MSE(T 'o) in estimating TO'0

(which approaches 6ij as n -+ oo).

In this section, we consider the case in which 0 lies in the linear subspace

defined by S, i.e., SO = 0. We now present a theorem which generalizes the result

of Theorem 3.1 to the case of a general covariance matrix.

Theorem 4.1 Suppose the observed 6 ~ N(O, E). Let S be a symmetric and

idempotent linear smoothing matrix of rank k. If 0 lies in the linear subspace

40

41

defined by S, then the dissimilarity estimator TG'SO has smaller mean squared

error than does T6'6 in estimating O'0. That is, the dissimilarity estimator based

on the smooth is better than the one based on the observed data in estimating the

dissimilarities between the underlying signal curves.

Proof of Theorem 4.1:

MSE( T = E{ [Tbi ,6 T 2,

-MSE (-'S) = E{ -['0s -e' ] }

= var (T'6) + E\[T'] -E0 O

n n n

T-i2tr TlE2)+40' O+[t( ) T T 4.1)

= -2 2tr(SE2) + 40'S'EO + { tr(ES) + T- 'S'S -0'

n2 n n n

T2(E2) (E)12 }

-= 2tr[(S)2 + 40'EO + [tr(~)]2 (4.1)

Hence, if we compare (4.2) with (4.1), we must show that tr(SlI=) < tr(N) and

MSE T 'S6) = E -'SO -T' 0

tr[(SC)2] < tr[(N)2] to complete the proof.

n n n J

=tr() tr(S} + (I- S)) '

n n n I

= tr(SE)] + 40'SES + tr(SE) SS 0'0

n n n n I

= 2tr[(SE)2] + 40'EO + [tr(SE)]2 (4.2)

Hence, if we compare (4.2) with (4.1), we must show that tr(SE) < tr(E) and

tr[(SE)2] < tr[(E)2] to complete the proof.

tr(E) = tr(SE + (I S)E)

= tr(SE) + tr((I S)E)

= tr(S) + tr((El/2(I S)(I S)E1/2)

> tr(SE)

42

since the last term is the sum of the squared elements of (I S)E1/2.

tr(E2) = tr[(SE + (I-S)E)2]

= tr[(SE)2] + tr[SE(I S)E] + tr[(I S)ESE] + tr[((I S)E)2]

= tr[(SE)2] + tr[SSE(I S)(I S)E] + tr[E(I S)(I S)ESS]

+ tr[(I S)(I S)E]

= tr[(SE)2] + 2tr[SE(I- S)(I- S)ES]

+ tr[E'2(I S)E1/2E1/2(I S)E/2]

Str[(SE)2]

since tr[SE(I S)(I S)ES] is the sum of the squared elements of (I S)ES and

the last term is the sum of the squared elements of 1/2(I S)E1/2.

Hence MSE(TO'SO) < MSE(TO'd). Since (I- S)E1/2 (which has rank n k)

is not the zero matrix, the inequality is strict. l

4.2 James-Stein Shrinkage Estimation in the Functional Noise Model

Now we will consider functional data following model (2.2), and in this

section we do not assume 0 lies in the subspace defined by S. Again, we assume

the response is measured at n discrete points in [0, T], but here we assume a

dependence among errors measured at different points.

In Section 3.2, we obtained an exact upper bound for the difference in risks

between the smoothed-data estimator and observed-data estimator. In this section

we will develop an asymptotic (large n) upper bound for this difference of risks.

The upper bound is asymptotic here in the sense that the bound is valid for

sufficiently large n, not necessarily that the expression for the bound converges to a

meaningful limiting expression for infinite n.

43

Note that as the number of measurement points n grows (within a fixed

domain), assuming a certain dependence across measurement points (e.g., a corre-

lation structure like that of a stationary Ornstein-Uhlenbeck process), the observed

data y1,..., YN closely resemble the pure observed functions yl(t),..., yN(t) on

[0, T]. Therefore this result will be most appropriate for situations with a large

number of measurements taken on a functional process, so that the observed data

vector is "nearly" a pure function.

As with Case I, we assume our linear smoothing matrix S is symmetric and

idempotent, with r(S) = tr(S) = k. Recall that according to the functional noise

model for the data, 0 ~ N(0, E) where E is, for example, a known covariance

matrix corresponding to a stationary Ornstein-Uhlenbeck process. In this section,

we will assume a Gaussian error process whose covariance structure allows for the

possibility of dependence among the errors at different measurement points.

We consider the same James-Stein estimator of 0 as in Section 3.2, namely

6(JS) (in practice, again, we use the positive-part estimator (Js)), and the same

James-Stein dissimilarity estimator.

Recall the definitions of the following quadratic forms:

q1 = 0'(I-S)0

2 = O'sO

q1 = O'(I-S)O

q2 = O'SO.

Unlike in Section 3.2 when we assumed an independent error structure, under

the functional noise model, 41 and 42 do not have noncentral X2 distributions, nor

are they independent in general.

44

In this same way as in Section 3.2,,we may write A as:

1 (a24 2 413

A = E -4a[1 + 2] + 4a2 + +4a(qg + q2)

2a2

+ (41 + 42 q9 q2)

Note that

E[41 + 42] = q + q2 + tr[(I S)E] + tr[SE]

= + q2 + tr(S).

Hence

-4atr(E) + 4a2+ E a2 2 + 2a (1 +42 qa q2)

L9-q 9q2 9)

Consider

E[1 = e' K se .

Lieberman (1994) gives a Laplace approximation for E[(x'Fx/x'Gx)k], k > 1,

where F is symmetric and G positive definite:

[(' x'Fx E[(x'Fx)k

\xGx ) [E(x'Gx)]k

In our case, (I S) is merely positive semidefinite, but with a simple regularity

condition, the result will hold (see Appendix B.2). The Laplace approximation will

approach the true value of the expected ratio, i.e., the difference between the true

expected ratio and the approximated expected ratio tends to 0 as n -- oo;

Lieberman (1994) gives sufficient conditions involving the cumulants of the

quadratic forms which establish rates of convergence of the approximation.

Hence

2 = [ 'e E(6'SO) q2 +tr(SE)

E =E ) E((I S)) tr[I-S)]

41 6'(I S)OJ E('(I S)6) q + tr(I S)E]

45

Hence we have a large n approximation, or asymptotic expression, for A:

A[a4 4a3 2a2(q2 + tr(SE)) 2a2q 2a2q2

A a -4atr(E)+4a2+E [ + 2a2 +qltr[I-S)E]

19 1+ 9q + tr[(I S)] 1 41

[ a. 1 ( q2 + tr(SE) 2a + q + q2]

= -4atr(E) + 4a2 + E + 2a 1 + 2+= .

q2 q + tr[(I S)ME] 41

Since by Jensen's Inequality, E(1/41) > 1/E(41), we can bound the last term above:

E22(1 q2+tr(SE) 2a+q+ql +q2

Iq + tr[(I S)E] 41

S22 qi + tr[(I- S)E] q + tr[(I- S)]

= 2a2(q, + tr[(I S)E] + tr(SE) ql 2a

= 2a2( tr(E) 2a

qg + tr[(I S)]

( tr(E) 2a

ktr[(I-S)E]

Hence we have the asymptotic upper bound for A:

4a2 4atr(E) + E 4] + 2a2 ( tr() .2a

Ir ) \tr[(I- S)E]-

Denote the eigenvalues of E1/2(I-S)E1/2 by c,..., cn. Since (I-S) is positive

semidefinite, it is clear that cl,..., c, are all nonnegative. Note that cl,..., cn are

also the eigenvalues of (I S)E, since E1/2(I S)E1/2 = 1/2( S)ZE-1/2 and

(I-S) are similar matrices. Note that since r[El/2(I-S)E1/2] = r(I-S) = n-k,

1/2(I S)E1/2 has n k nonzero eigenvalues, and so does (I S)E.

It is well known (see, e.g., Baldessari, 1967; Tan, 1977) that if y ~ N(p, V)

for positive definite, nonsingular V, then for symmetric A, y'Ay is distributed as a

linear combination of independent noncentral X2 random variables, the coefficients

of which are the eigenvalues of AV.

46

Specifically, since q1 = 0'(I S)0, and (I S)E has eigenvalues c,..., ,c, then

n

i=1

for some noncentrality parameter 62 > 0 which is zero when 0 = 0.

Under 0 = 0, then, 41 ~ = cE' x, a linear combination of independent central

X2 variates.

Since a noncentral X, random variable stochastically dominates a central 1:,

for any 62 > 0,

P[X2(62) 2 x] > P[x2 > x] Vx > 0.

As shown in Appendix A.2, this implies

P[cLX'(62) + + c1X (62) > X] > P[cix + + CnX >_ x] V X > 0.

Hence for all x > 0, Peoo[ql 2 x] > Po[q1 > x], i.e., the distribution of 41 with

8 # 0 stochastically dominates the distribution of 41 with 0 = 0. Then

EWgol(4j)-" ] < Eo[(4i)-"]

for m = 1, 2,....

Letting M2 = Eo[(q1)-2], we have the following asymptotic upper bound for A:

2a2tr( E) 4a3

A* = 4a2 4atr(E) + M2a4 +

tr[(I- S)E] tr[(I- S)E]

4 3 2tr(E)

= M2a' tr[(I- S) 3 + (t a2 4tr(E)a.

tr[(I- S)E] tr [(I S)E] +(4)aa

Then M2 = E[(E 1 ciX)-2]. We see that if n k > 4, then M2 exists and is

positive since

1 E[ [(X1 1_1 )

Wmax Xn-k Wmin Xn-k

47

where wme is the smallest and Wma the largest of the n k nonzero eigenvalues of

(I S)E.

As was the case in the discrete noise situation, A (a) = 0 is a fourth-degree

equation with two real roots, one of which is trivially zero. Call the nontrivial real

root r.

Theorem 4.2 If n k > 4, then the nontrivial real root r of A*,(a) = 0 is positive.

Furthermore, for any choice a E (0, r), the asymptotic upper bound A* < 0,

implying A < 0. That is, for 0 < a < r, and for sufficiently large n, the risk

difference is negative and the smoothed-data dissimilarity estimator dls) is better

than dij.

Proof of Theorem 4.2: Since E is positive definite, tr(E) is positive. Since

M2 > 0, we may write Ay(a) = 0 as c4a4 + c3a3 + c2a2 + cia = 0, where

c4 > 0, C3 < 0, c2 > 0, C1 < 0. The proof then follows exactly from the proof of

Theorem 3.2 in Section 3.2. E

As with the discrete noise situation, one can easily verify, using a symbolic

algebra program, that two of the roots are imaginary, and can determine the

nontrivial real root in terms of M2, tr[(I S)E], and tr(E).

As an example, let us consider a situation in which we observe functional data

yl,..., YN measured at 50 equally spaced points tl,..., to0, one unit apart. Here,

let us assume the observations are discretized versions of functions yi(t),..., yN(t)

which contain (possibly different) signal functions, plus a noise function arising

from an Ornstein-Uhlenbeck (O-U) process. We can then calculate the covariance

matrix of each y, and the covariance matrix E of 0. Under the O-U model,

tr(S) = -n always, since each diagonal element of E is 2

For example, suppose the O-U process has a2 = 2 and / = 1. Then in this

example, tr(E) = 50. Suppose we choose S be the smoothing matrix corresponding

to a B-spline basis smoother with 6 knots, dispersed evenly within the data. Then

48

we can easily calculate the eigenvalues of (I S)E, which are cl,..., c,, and via

numerical or Monte Carlo integration, we find that M2 = 0.0012 in this case.

Substituting these values into Ay(a), we see, in the top plot of Figure 4-1,

the asymptotic upper bound plotted as a function of a. Also plotted is a simulated

true A for a variety of values (n-vectors) of 0: 0 x 1, (0, 1, 0, 1, 0,..., 0,1)', and

(-1,0, 1, -1,0, 1,..., -1,0)', where 1 is a n-vector of ones. It should be noted that

when 0 is the zero vector, it lies in the subspace defined by S, since SO = 8 in that

case. The other two values of 0 shown in this plot do not lie in the subspace. In

any case, however, choosing the a that optimizes the upper bound guarantees that

d.(Js) has smaller risk than dj.

Also shown, in the bottom plot of Figure 4-1, is the asymptotic upper bound

for data following the same Ornstein-Uhlenbeck model, except with n = 30.

Although A* is a large-n upper bound, it appears to work well for the grid size of

30.

4.3 Extension to the Case when a2 is Unknown

In Section 3.2, it was proved that the smoothed-data dissimilarity estimator,

with the shrinkage adjustment, dominated the observed-data dissimilarity estimator

in estimating the true dissimilarities, when the covariance matrix of 0 was a2I,

a2 known. In Section 3.3, we established the domination of the smoothed-data

estimator for covariance matrix a2I, a2 unknown.

In Section 4.2, we developed an analogous asymptotic result for general known

covariance matrix Z. In this section, we extend the asymptotic result to the case

of 0 having covariance matrix of the form V = a2E, where a2 is unknown and E

is a known symmetric, positive definite matrix. This encompasses the functional

noise model (2.2) in which the errors follow an Ornstein-Uhlenbeck process with

unknown a2 and known f. (Of course, this also includes the discrete noise model

49

DetaU for O-U Example (n = 50)

o

0 50 100 150

a

DeltaU for O-U Example (n = 30)

0 ----- -------------- -i------------*--- --------------

o t '

0 20 40 60

a

Figure 4-1: Plot of asymptotic upper bound, and simulated A's, for Ornstein-

Uhlenbeck-type data.

Solid line: plot of A, against a for above O-U process. Dashed line: simulated A,

0 = 0 x 1. Dotted line: simulated A, 0 = (0, 1, 0. 1, 0,..., 0, 1)'. Dot-dashed line:

simulated A, 0 = (-1, 0, 1, -1, 0, 1,..., -1, 0)'. Top: n = 50. Bottom: n = 30.

50

(2.1), in which V = a2I, but this case was dealt with in Section 3.3, in which an

exact domination result was shown.)

Assume 0 N(0, V), with V = a2E. Suppose. as in Section 3.3, that there

exists a random variable S2, independent of 0, such that

S2 ,

"2 X .

Then let a2 = S2/v.

As shown in Section 3.3, we may write

A = E[-2 2a&2 a4(0' 0'0) + (2a2 a20-4)2].

Since & and 0 are independent.

E[-4aa2(9'_ 8'8)1 = E[-4a&2]E[#' 8'8] = E[-4a&2]a2tr(E)

and

E 2a2&4 (0'6 O'0)]

[2a 1

= E 2a4 '(I S) + 'S '(I S) O'SO)

r[2a22(&4 (

= E --q- + (2 -i q- 2)

= E[2a2& ]E + E [2a (41 q q2) (4.3)

by the independence of & and 6 (and thus & and 42). We use the (large n) approx-

imation to E[42/i4] obtained in Section 4.2 to obtain an asymptotic upper bound

for (4.3):

( q2 + 2tr(S )) 2a (4.4)

E[2a2&4] q2 + E2tr(I (1 q 2) (4.4)

(q, + o,2tr((I S) Ej ) 4

51

Now, since a2&4/41 is decreasing in 41 and qi qi q2 is increasing in 41, these

quantities have covariance < 0. Hence

( q2+ +2tr(S]E) 2 [2a

(4.4) < E[2a~ (4] q2 + 0 tr(SE) + E 2a (41 q q2]

Sqi2tr(I -S)Z] J

E[2a2&4] ( q2 + 2tr(S) + E 2a2 (2tr[(I S)E] q)

q + a2tr[(I S) 1

E[2a2&4]q2 E[2a26&4]a2tr(SE) [2a2&4] tr[(I )

Since by Jensen's InequalityE /E E[/] 0,

(4I +. a2tr[(I S) + 2tr[(I S)E] 1

SE[ 41 tr[(I S) + a2tr[(I S)

_E[2a22&]

+ E 2a2tr[(I 2 trjI S)E2].

(4.4) < E[224] rtr(S) E 2a242tr[( S)E]

q- +qa2tr[(I- S)E] 41

< E[2a2&41 tr(s ) +E 2a2 4]2tr[(I- S)E]

tr[(IL S)+)] + l-

Our asymptotic upper bound for a& is:

A*u = E[-4aa2]a2tr(E) + E[2a24] tr(SE) + E 2a-alrI S)E

tr[(I S)E2 q,

+ E 2a 2 a- 24l J

=E -4a22tr(E) + (a2+4 2r- tr(SE)

=tl rtr[(I- S)E]

+ 4a4 4a3&6 a4&8]

+-i+

52

Recall that

E(a~i] = aa2

E[a &4] = a2a4((v -2)

E(a3&6] = a3(v(v 2)(v + 4)

E[a4&8] = a48Y((v +2)(v+4)( +6))

V4

Since 62 and i1 are independent, taking expectations:

A = -4au4tr(E) + 2a -E 1- o2tr[(I S

+ 2a 2 u4(v(v + 2) tr(SE) 4a24( + 2)

v2 tr [(I S) ] v2

-4a v(v+2)(v+4) E[]- 48 V(V+2)(v + 4)(v + 6)4 [1 E

V34a3( (V+2(v ) q.[ V4 E-a (

As in Section 3.3, define 4q = 41/T2 = (0/o)'(I S)(60/). Here, since

9 N(0, ,2E), (6/a) N((, E) where C = 0/a. Then the risk difference is a

function of C, and the distribution of 4q is a function of (. Then

A = -4a4tr(E) + 2a2 4 (( + 2)) [ t(I-S)E

+ 2a 2u4 (v(v + 2) tr(SE) 4a4 (+ 2)\

v2 tr[(I S)E] ( 2 )

3 v(v+2) (v4\) 1( 4 + 2)(v+4)(v+6)\ [1

-4a u")E[ 1- +auor )E[

V3 qV4 T

We may now divide the inequality through by a4 > 0.

S-= -4atr(E) + 2a ( ) E (I- S)E

2a2v( + 2) tr(SE) (4a + 2)

a v2 tr[ (I- S)E a YV2

-4a3(V(V + 2)(v + 4) ) ] v(v + +2)(v + 4)( + 6)- E I

V3 E + a4 .\2

53

As shown in Section 4.2. for all x > 0, Peoo[q1 > x] > Po[ql > x]. Note that

this is equivalent to: For all x > 0, Pcto[q* > x] >_ Po[q4 > x]. Then

E(o[(4)-m}] < Eo[(4;)-m]

for m = 1, 2.

Let M{ = Eo[(q{)-'] and M2 = Eo[(*)-2]. Then

j'v(t'+ 2) 2 v(v + 2) tr(S1E) a

A*&a < -4tr(E)a+ 2 + + 2) M*tr[(I S)Va2 + 2 tr(S)a

aO4 v2 v2 tr[(I-S)E

+4(v(v + 2) a2 (v+ 2)(v + 4) M a3

( v( + 2)(v + 4)(v + 6) Ma4

(v(v +2)(v + 4)(v + 6) 4 v(v+2)(v+4) a 3

V V4

(v(v + 2) 2tr(SE)

S(v 2)(2Mitr[(I- S)E] + +4)) a 4tr()a.

v+ tr[(I- S)E]

Note that if this last expression (which does not involve a and which we denote

simply as A,(a)) is less than 0, then Agu < 0. which is what we wish to prove.

We may repeat the argument from Section 4.2 in which we showed that when

0 = 0, 1i ~ Y-= cX, replacing 4q with q(, 0 with O/a. and 0 with C. Thus we

conclude that when 0 = 0, q( ~ i- vix, where vi,. .. v, are the eigenvalues of

(I S)E.

Again, if n k > 4, then Mj* and M2 exist and are positive, as was shown in

Section 4.2.

Again, Au(a) = 0 is a fourth-degree equation with two real roots, one of which

is trivially zero. Call the nontrivial real root r.

Theorem 4.3 Suppose the same conditions as Theorem 3.3, except generalize

the covariance matrix of 0 to be a2E, a2 unknown and E known, symmetric, and

positive definite. If n k > 4, then the nontrivial real root r of Ayu(a) = 0

is positive. Furthermore, for any choice a E (0, r), the asymptotic upper bound

54

A y < 0, implying A < 0. That is, for, 0 < a < r. and for sufficiently large n,

the risk difference is negative and the smoothed-data dissimilarity estimator ,s) is

better than d2i.

Proof of Theorem 4.3: Since E is positive definite. tr(E) is positive. Since

Mi > 0 and M.; > 0, we may write Ay (a) = 0 as c4a4 + c3a3 + c2a2 + a = 0.

where c4 > 0. ca < 0. c2 > 0, C1 < 0. The proof is exactly the same as the proof of

Theorem 4.2 in Section 4.2. -

4.4 A Pure Functional Analytic Approach to Smoothing

Suppose we have two arbitrary observed random curves yi(t) and yj(t), with

underlying signal curves pi(t) and pj (t). This is a more abstract situation in which

the responses themselves are not assumed to be discretized. but rather come to

the analyst as "pure" functions. This can be viewed as a limiting case of the

functional noise model (2.2) when the number of measurement points n -4 oo in a

fixed-domain sense. Throughout this section, we assume the noise components of

the observed curves are normally distributed for each fixed t, as, for example, in an

Ornstein-Uhlenbeck process.

Let C[0, T] denote the space of continuous functions of t on [0, T]. Assume we

have a linear smoothing operator S which acts on the observed curves to produce

a smooth: /i(t) = Syi(t), for example. The linear operator S maps a function

f E C[0, T] to C[0. T] as follows:

Sf](t) = K(t, s)f(s)ds

(where K is some known function) with the property that S[af + 3g] = aS[f] +

3S[g] for constants a, 3 and functions f,g E C[0, T] (DeVito, 1990, p. 66).

Recall that with squared L2 distance as our dissimilarity metric, we denote

the dissimilarities between the true, observed, and smoothed curves i and j,

respectively, as follows:

55

6ij = / (t) j(t)] dt = 1 pi(t) p (t)|2 (4.5)

0o

S= [y(t) yj(t)]2 dt = ||y1(t) y(t)\2 (4.6)

~!mooth) = [i(t) (t)]2 dt = SyI(t) Syj(t)~ (4.7)

Recall that for a set of points ti,...,t, in [0, T], we may approximate (4.5)-

(4.7) by

dij = T0'

n

T

cj = -O'8

dj8mooth) -s'sg

t3 n

where d i -+ 6i when the limit is taken as n -+ oo with ti,..., t, [0, T].

In this section, the smoothing matrix S corresponds to a discretized analog of

S. That is, it maps a discrete set of points y(tl),..., y(t,) on the "noisy" observed

curve to the corresponding points A(ti),..., (tn) on the smoothed curve which

results from the application of S. In particular, in this section we examine how

the characteristics of our self-adjoint S correspond to those of our symmetric S as

n oc in a fixed-domain sense.

Suppose S is self-adjoint. Then for any x, y,

(x, Sy) = I x(t) K(t s)y(s)dsdt = (Sx,y) = gjK(t,s)x(s)dsy(t)dt.

Therefore for points tl, .... t, E [0, T], with n oo in a fixed-domain sense,

lim T [T x(ti)K(t,, s)y(s) ds + ** -+ f x(t)K(t,, s)y(s) ds

n-=oo n JO (t ,.

56

Now, for a symmetric matrix S, for,any x. y, (x, Sy) = (Sx, y) < x'Sy =

x'S'y. At the sequence of points tl,...t,n, x = (x(t1),....x(t,))', and for our

smoothing matrix S,

Sy (S[y](ti).... S[y](t,))' = ( K(t,,s)y(s)ds.... K(t, s)y(s) d .

So

x'Sy = x(t)K(tl,s)y(s)ds +.+ +TX(tn)K(tn,s)y(s)ds

= xS'y = K(ti, s)x(s)y(ti) ds + ... + K(tn, s)x(s)y(t) ds

[T T

and

(T/n) x(tt)K(ti, s)y(s) ds +-- + x(tn)K(tn, s)y(s) ds

= (T/n) K(tt, s)x(s)y(ti) ds +. + f K(tn, s)z(s)y(tn) ds

So taking the (fixed-domain) limits as n -* oo, we obtain (4.8), which shows that

the defining characteristic of symmetric S corresponds to that of self-adjoint S.

Definition 4.1 S is called a shrinking smoother if IlSyll 5< llyl for all elements y

(Buja et al., 1989).

A sufficient condition for S to be shrinking is that all of the singular values of

S be < 1 (Buja et al., 1989).

Lemma 4.1 Let A be a symmetric matrix and E be a symmetric, positive definite

matrix. Then

x'A'EAx

sup [emax(A)]2

x X EX

where emax(A) denotes the maximum eigenvalue of A.

Proof of Lemma 4.1:

57

Note that

x'A' Ax

sup x''x emax(E-'iA'E1 21/2AE-/2).

x X IX

Let B = E /2AE-1/2. Then this maximum eigenvaiue is emax(B'B) < emax(A'A) =

[emax(A)]2 by a property of the spectral norm. c

Lemma 4.2 Let the linear smoother S be a self-adjoint operator with singular

values E [0, 1]. Then var(T JISO|2) var(T|li0l2). where S is symmetric with

singular values E [0, 1]. (Recall 0 = y, yj.)

Proof of Lemma 4.2:

We will show that, for any n and for any positive definite covariance matrix

E of 0. var(T!ISO)l2) < var(T lO(2) where S is symmetric. (A symmetric matrix

is the discrete form of a self-adjoint operator (see Ramsay and Silverman, 1997.

pp. 287-290).)

4 var (ISO112) var(T|6 12)

4 var(6'S'S6) < var(6'O)

2 2tr[(S'SE)2] + 40'S'SES'SO < 2tr[(E)2] + 40'E0.

Now,

0'S'SECS'S0

40'S'SES'S0 < 40'E0 eS'SE S < 1

O'EO -

and so we apply Lemma 4.1 with S'S playing the role of A. Since the singular

values of S are E [0, 1], all the eigenvalues of S'S are E [0, 1]. Hence we have

O'S'SES'SO

sup < 1.

0 0' E -

58

So we must show tr[(S'SE)2] < tr[(E)2] for any E.

tr[(E)2] = tr[(S'SE + (I S'S)E)2]

= tr[(S'SE)2] + tr[((I S'S)E)2] + tr[S'SE(I S'S)EI

+ tr[(I S'S)ES'SE

= tr[(S'SE)2] + tr12(I S)( S'S)E(I- S'S)EI/21

+ tr[SE(I S'S)ES] + tr[SE(I S'S)ES'J

= tr[(S'SE)2] + tr[EI/2(I S'S)E(I S'S)E1/]

+ 2tr[SE(I- S'S)ES]

= tr[(S'SE)2] + tr[/2(I S'S)E/2'/2(I S'S)E12]

+ 2tr[SE(I S'S)'2(I S'S)1/2ES]

> tr[(S'SE)2]

since the last two terms are > 0. This holds since tr[E'/2(I S'S)EX/2E/(I -

S'S)E1/2] is the sum of the squared elements of the symmetric matrix 1/2(I -

S'S)E1/2. And tr[SE(I S'S)'/2(I S'S)'/2S] is the sum of the squared elements

of (I S'S)1/2ES. C]

We now extend the result in Lemma 4.2 to functional space.

Corollary 4.1 Under the conditions of Lemma 4.2, var[8ij] varf6m'th)].

Proof of Corollary 4.1:

Note that since almost sure convergence implies convergence in distribution,

we have

a,, = Tee 4 Jj

T ^ ij

sdmooth) 4 ooth)

Consider E[Idij'3] = E[(dij)3j = -E[( '0)3]. We wish to show that this is

bounded for n > 1.

59

Now ~ X (A). where A = 0'0 here. The third moment of a noncentral

chi-square is given by

n3 + 6n2 + 6nA + 8n + 36nA + 12nA2 + 48A + 48A2 + 8A3.

Hence 3E[(0'0)3] =

T3[1 + 6/n + 6A/n + 8/n2 + 36A/n2 + 12A2/n2 + 48A/n3 + 48A2/n3 + 8A3/n3]

Now, A/n is bounded for n > 1 since T8O' is the discrete approximation of the

distance between signal curves ~s(t) and pj(t), which tends to 6ij. Hence the third

moment is bounded.

This boundedness. along with the convergence in distribution, implies (see

Chow and Teicher. 1997, p. 277) that

lim E[(dij)k] = E[(j)] < o

n-*oo

for k = 1,2.

So we have

lim var[dij]

n-oo

= limE[(dj)2] lim[E(d,)]2

= limE [(dj)2j [lim E(dj)]2

= E[(Sij)2] -[E(-j)]2

= var[6Jij < oo.

An almost identical argument yields

lim var [dm00t)] = var [.h)

n-oo -- 1.

60

As a consequence of Lemma 4.2. since var[dmooth)] < var[dij] for all n,

lim var[dmth)< lim varidj] < 0o.

n-oo n--m

Hence

var ooth] < var[ ij]. O

The following theorem establishes the domination of the smoothed-data

dissimilarity estimator ij"oot) over the observed-data estimator yij, when the

signal curves pi(t), i = 1,...., N. lie in the linear subspace defined by S.

Theorem 4.4 If S is a self-adjoint operator and a shrinking smoother, and if

pii(t) is in the linear subspace which S projects onto for all i = 1..., N, then

MSE(6;(smooth)) MSE().

MSE0(oj ) < MSE(6j).

Proof of Theorem 4.4:

Define the difference in MSEs A to be as follows (we want to show that A is

negative):

A = E{( (mooth) 6j)2} -E{(( j)}

= E{ ( f (t)t EI(t)]2dt /[i(t) is(t)I2dt }

E (T [y,(t) y(t)]2 dt fT [,(t) (t)]2 dt) 2}

= E( ( [Sy,(t) Sy,(t)]2 dt) + (IT [p(t) (t)12 dt)

-2 (1 [Sys(t) Syj(t)]2 dt) (J [1T,(t) y(t)]2 dt)

E [ry(t) y(t)]2 dt) + (T[i(t) g,(t) dt2

-2 U( [y(t) y3(t)]2 dt) ( [L(t) p (t)]2 dt)

61

= E{ ( [Sy(t) Syj(t)]2dt) ( [(t) y(t)J dt)}

+ 2E{ (r[y(t) y(t)]2dt)( (t) p(t)]2 dt

(f Z ) () -Y

(- [Syi(t) Syj(t)]2 dt) ( T [/i(t)- (t)] dt }.

Let f = fij(t) = yi(t) yj(t) and let f = fij(t) = pi (t) pj(t). Note that since S is

a linear smoother. Syi(t) Syj(t) = St.

A = E [S]( dt [f2dt)2

+ 2E [f]2 dt) ( [f]2 dt) ( [S2di (t f]2 dt) }

= E{(Sf.Sf)2 (,ff)2} + 2E{(, ) (f,f)( (Sf,Sf)(ff)}

= E{Ilsfi4 f4} + 2E{ llIfl|2lfI2 iS112lflI2}

= E{IIS\II4} E{ll114} 2 fI 12E{|lSfI2 -_ lf112}

var(l SfkI2) var(|I I|2) + {E(IISfl 2)}2 {E( 11f12)}2

21 f 12E{(|lISll2 fl1122)}

= var(llSftl2) var( lllf2)

+ {E(l t5f12) + E(I Il2)} {E( ISI 12) E(J1f112)}

2 f11ll2 E{(lISf l2- illl2)}.

Since pi(t) is in the subspace which S projects onto for all i, then

Sf = f IlSfl= |lfl.

A = var( llS 12) var(ll ll2) + E(IlSfll2 + ll112)E(llIf2 -1ll2)

(llSf112 + Ilfll2)E(llISf 2 Illl2).

62

Now, since var(llS fj2) <: var(Ilflj2) from Corollary 4.1. we have:

A < E(IISf!I2 + Jf112)E(I ISfl2 If 2) (I Sf2 + fl112)E( ISfII2 I12)

SE(I ISfI2 Il112) [E(I SfII2 + I112) (lISf!2 + if12)] (4.9)

Since S is a shrinking smoother, J|Sfjl < Ijffl for all f. Hence IIJSif2 < IIfl 2 for

all f and hence E(IISf!\2) E(liffl2). Therefore the first factor of (4.9) is < 0.

Also note that

E(IISfil2) = E f [S2dt = E[Sf]2 dt

S var[Sf] dt + f(E[Sf])2 dt

/T T

= var[Sf] dt + j [Sf]2 dt

/T

= Tvar[S]dt + jSf > Sf2.

Similarly,

E(I ii2) = E 0['2 dt }= fo E(112 dt

T l T T

= var[f] dt + o(E[f])2dt

= var[f] dt + [f]2 dt

T

f= var[f]dt + IIfI > |f 12.

Hence the second factor of (4.9) is positive, implying that (4.9) < 0. Hence the

risk difference A < 0. O

Remark. We know that |ISifl < I Ill since S is shrinking. If, for our particular

yi(t) and yj(t), IISf\I < iHfl, then our result is MSE(6mcoth)) < MSE(&ij).

CHAPTER 5

A PREDICTIVE LIKELIHOOD OBJECTIVE FUNCTION

FOR CLUSTERING FUNCTIONAL DATA

Most cluster analysis methods, whether deterministic or stochastic, use an

objective function to evaluate the "goodness" of any clustering of the data. In this

chapter we propose an objective function specifically intended for functional data

which have been smoothed with a linear smoother. In setting up the situation,

we assume we have N functional data objects y1(t),.... yN(t). We assume the

discretized data we observe (yl,.... YN) follow the discrete noise model (2.1).

For object i. define the indicator vector Zi = (Zi4,..., ZN) such that Zyi = 1 if

object i belongs to cluster j and 0 otherwise. Each partition in C corresponds to a

different Z = (Zi,..., ZN), and Z then determines the clustering structure. (Note

that if the number of clusters K < N, then Zi,K+1 = = ZiN = 0 for all i.)

In cluster analysis, we wish to predict the values of the Zi's based on the data

y (or data curves (yl(t),.... yv(t)). Assume the Zi's are i.i.d. unit multinomial

vectors with unknown probability vector p = (pl,....pN). Also, we assume a

model f(ylz) for the distribution of the data given the partition. The joint density

N N

f(y, z) = 11[py11jf1(yz1j)zIj

i=1 j=1

can serve as an objective function. However, the parameters in f(y|z), and p,

are unknown. In model-based clustering, these parameters are estimated by

maximizing the likelihood, given the observed y, across the values of z. If the

data model f(yfz) allows it, a predictive likelihood for Z can be constructed by

conditioning on the sufficient statistics r(y, z) of the parameters in f(y, z) (Butler,

1986; Bjornstad, 1990). This predictive likelihood, f(y, z)//f(r(y, z)), can serve as

63

64

the objective function. (Bjornstad (1990) notes that, when y or z is continuous,

an alternative predictive likelihood with a Jacobian factor for the transformation

r(y, z) is possible. Including the Jacobian factor makes the predictive likelihood

independent of the choice of minimal sufficient statistic: excluding it makes the

predictive likelihood invariant to scale changes of z.)

For instance, consider a functional datum yi(t) and its corresponding observed

data vector yi. For the purpose of cluster analysis, let us assume that functional

observations in the same cluster come from the same linear smoothing model with

a normal error structure. That is, the distribution of Yi, given that it belongs to

cluster j, is

Yi I z = 1 N(T3j, l)

for i = 1..... N. Here T is the design matrix, whose n rows contain the regression

covariates defined by the functional form we assume for yi(t). This model for

Yi allows for a variety of parametric and nonparametric linear basis smoothers,

including standard linear regression, regression splines, Fourier series models,

and smoothing splines. as long as the basis coefficients are estimated using least

squares.

For example, consider the yeast gene data described in Section 7.1. Since

the genes, being cell-cycle regulated, are periodic data, Booth et al. (2001) use a

first-order Fourier series y(t) = o0 + /1 cos(27rt/T) + 32 sin(27rt/T) as a smooth

representation of a given gene's measurement, fitting a harmonic regression model

(see Brockwell and Davis, 1996, p. 12) by least squares. Then the 18 rows of T are

the vectors (1,cos(ftj),sin(ftj)), for the n = 18 time points tj,j = 0,...,17. In

this example 0j = (o30j, 1j, a,)'.

Let J = {jlzij = 1 for some i} be the set of the K nonempty clusters.

The unknown parameters in f(y, z), j E J, are (pj, Pi, ao3). The corresponding

sufficient statistics are (mj, /^i, a) where mj = '1 zij is the number of objects

65

in the (proposed) jth cluster. j is the p*-dimensional least squares estimator

of 0j (based on a regression of all objects in the proposed jth cluster), and 5

is the unbiased estimator of a,. The estimators /j and &2 can be expressed in

terms of the estimators ,ij and &P (for all i such that zj = 1), obtained from the

regressions of the individual objects. (See Appendix A.3 for the derivations of flj

and &?.)

Marginally, the vector m = (mi,...,mN) is multinomial(N, p); conditional

on m, 3j is multivariate normal and independent of (mn )&P ~, X j-.. Then

the following theorem gives the predictive likelihood objective function, up to a

proportionality constant:

Theorem 5.1 If Z, .... ZN are i.i.d. unit multinomial vectors unth unknown

probability vector p = (pi,... PN), and

Y, I zij = 1 N(T/j, a(7I)

for i = 1,... N, then the predictive likelihood for Z can be written:

g() = f(y z)

n,/(, f (m, j)

02j6Jn + I(mjn p') (mjn p* 2

ocJ fm!(mU)-1/2 m(&2 ) (in2

Proof of Theorem 5.1:

According to the model assumed for the data and for the clusters,

) = K

f "=' Z)VIpj )n ep (Yi TBj)'(yi- TBj)]

66

Therefore log f(y, z)

= E' logpj +log y T()'(yi T13j)

i=1 j=1

n1 N

= log p + mj log zij(Y T j)'( T3j)

j=1 I i=1

= ^logp + mj log(

j=1

2? zij(y, T )'(y Ti3j)+ mj(T/ 3- T l)'(T13 TI,).

Hence f(y,z)

(1)nm. x

jEJ

Si (y T,)'(- T3,) (Tij Tj)'(T3j Tf)

exp yi 2a2 mj 23 m

i=1

m -p(27r) ,,n/2 !7j -mn exp m(Tpj TOS)'(T~j TO j) (mn p*)6j2

SjP (2 exp-m3 2oa? 2 a

Also, ( JJ f(mj, ,,2)

pi 1 1

= N! )exp{-(^ -)'[cov( 0)]-(/- 0)}

"nj! I (2,r)p*/2'!cov ( 1j)l/2 2

(1/2)n mjn-p* ^,2,'i-p'_ 1 mn 2mn-p*

x r(_ ) ( "aj) 2 exp{-( a] 2)} J )

2 3

= N -1 1/2 exp -ij ( Oj)'(T'T)(4 3)

jEJ *! (2r)P'/2 T' T) 1

2 j a

i)^ exp{- (j -)}.p

Hence, since exp{- (/4j j)'(T'T)(4j Of)} = exp{- -j(T/j -

TBj,)'(T^ TB)}, we have that

67

f(y,z)

I-TEJ f(mj, 43, &J2)

IEJ rn-j!{ (2"r) -N/2 J7(2w)P/2 1/2 }(T /

N6jEJ (my

oc n--p

mj /2n p (mjn p* 2 ) n

C m() 2 -2 \ J2)2-*2/2

jEJ

When considered as an objective function, this predictive likelihood has

certain properties which reflect intuition about what type of clustering structure is

desirable.

The data model assumes a2 represents the variability of the functional data

in cluster j. Since bj is an unbiased estimator of oj,j E J, then (2j)-_ +l1

can be interpreted as a measure of variability within cluster j. The (typically)

negative exponent indicates that "good" clustering partitions, which have small

within-cluster variances, yield large values of g(z). Clusters containing more objects

(having large mj values) contribute more heavily to this phenomenon.

The other factors of g(z) may be interpreted as penalty terms which penalize

a partition with a large number of clusters having small mj values. A possible

alternative is to adopt a Bayesian outlook and put priors on the pj's, which, when

combined with the predictive likelihood, will yield a suitable predictive posterior.

CHAPTER 6

SIMULATIONS

6.1 Setup of Simulation Study

In this chapter, we examine the results of a simulation study implemented

using the statistical software R. In each simulation. 100 samples of N = 18 noisy

functional data were generated such that the data followed models (2.1) or (2.2).

For the discrete noise data. the errors were generated as independent N(0, a2)

random variables, while for the functional noise data. the errors were generated as

the result of a stationary Ornstein-Uhlenbeck process with variability parameter a2

and "pull" parameter 6 = 1.

For each sample of curves, several (four) "clusters" were built into the data

by creating each functional observation from one of four distinct signal curves, to

which the random noise was added. Of course. within the program the curves were

represented in a discretized form, with values generated at n equally spaced points

along T = [0, 20]. In one simulation example. n = 200 measurements were used and

in a second example, n = 30.

The four distinct signal curves for the simulated data were defined as follows:

ti(t) = 0.51n(t+1)+.01cos(t),i = 1,...,5.

Li(t) = logo(t + ) .01cos(2t),i = 6,...,10.

14(t) = 0.751og(t + 1) + .01sin(3t),i = 11,...,15.

i(t) = 0.3Vt + .01 sin(4t), i = 16,..., 18.

68

69

These four curves, shown in Figure 6-1, ,were intentionally chosen to be similar

enough to provide a good test for the clustering methods that attempted to

group the curves into the correct clustering structure. yet different enough that

they represented four clearly distinct processes. They all contain some form of

periodicity. which is more prominent in some curves than others. In short, the

curves were chosen so that, when random noise was added to them, they would be

difficult but not impossible to distinguish.

After the -observed" data were generated. the pairwise dissimilarities among

the curves (measured by squared L2 distance, with the integral approximated

on account of the discretized data) were calculated. Denote these by di, i =

1,..., N.j = 1..... N. The data were then smoothed using a linear smoother S

(see below for details) and the James-Stein shrinkage adjustment. The pairwise

dissimilarities among the smoothed curves were then calculated. Denote these by

di oth). i = 1 ...., N, j = 1,..., N. The sample mean squared error criterion was

used to judge whether dij or dimooth better estimated the dissimilarities among the

signal curves, denoted dij, i = 1,..., N, j = 1, .... N. That is,

MSE(obs) = 1 [(dij di)2]

( ) i=l.....N

j>t

was compared to

MSE(Smooth) 1 (smooth) d).

( i=) ....,N

2(N)

j>i

Once the dissimilarities were calculated, we used the resulting dissimilarity

matrix to cluster the observed data, and then to cluster the smoothed data. The

clustering algorithm used was the K-medoids method, implemented by the pam

function in R. We examine the resulting clustering structure of both the observed

70

0 5 10 15 20

time

Figure 6-1: Plot of signal curves chosen for simulations.

Solid line: I1I(t). Dashed line: pe(t). Dotted line: I11(t). Dot-dashed line: 16 (t)

tn ^--^

.- '

Solid line: p.\(t). Dashed line: 6(t). Dotted line: ai1(t). Dot-dashed line: ie(t).

71

data and the smoothed data to determine which clustering better captures the

structure of the underlying clusters, as defined by the four signal curves.

The outputs of the cluster analyses were judged by the proportion of pairs of

objects (i.e., curves) correctly placed in the same cluster (or correctly placed in

different clusters, as the case may be). The correct clustering structure, as defined

by the signal curves, places the first five curves in one cluster, the next five in

another cluster, etc. Note that this is a type of measure of concordance between

the clustering produced by the analysis and the true clustering structure.

6.2 Smoothing the Data

The smoother used for the n = 200 example corresponded to a cubic B-spline

basis with 16 interior knots interspersed evenly within the interval [0, 20]. The rank

of S was thus k = 20 (Ramsay and Silverman, 1997, p. 49). The value of a in the

James-Stein estimator was chosen to be a = 160, a choice based on the values of

n = 200 and k = 20. For the n = 30 example, a cubic B-spline smoother with six

knots was used (hence k = 10) and the value of a was chosen to be a = 15.

We should note here an interesting operational issue that arises when using

the James-Stein adjustment. The James-Stein estimator of 0 given by (3.4) is

obtained by smoothing the differences 0 = y yj, i z j, first, and then adjusting

the SO's with the James-Stein method. This estimator for 0 is used in the James-

Stein dissimilarity estimator d(s). For a data set with N curves, this amounts to

smoothing (N2 N)/2 pairwise differences. It is computationally less intensive to

simply smooth the N curves with the linear smoother S and then adjust each of

the N smooths with the James-Stein method. This leads to the following estimator

of 0:

Sy + 1- )(yi Sy) Sy3 1- a (y Sy.

11yi Sy112 ) 11j -- S~Yj 12) Y- y)

72

Of course, when simply using a linear smoother S. it does not matter whether

we proceed by smoothing the curves or the differences, since SO = Syi Syj. But

when using the James-Stein adjustment, the overall smooth is nonlinear and there

is an difference in the two operations. Since in certain situations, it may be more

sensible to smooth the curves first and then adjust the N smooths. we can check

empirically to see whether this changes the risk very much.

To accomplish this, a simulation study was done in which the coefficients of

the above signal curves were randomly selected (within a certain range) for each

simulation. Denoting the coefficients of curve ,i(t) by (bi,, bi,2), the coefficients

were randomly chosen in the following intervals:

b,1 E [-3.3]. b1.2 E [-0.5. 0.5] b6,1 E [-6,6], 66,2 1[-0.5. 0.5],

bnl. E [-44.54.5], bn.2 E [-0.5, 0.5], b6.i e [-1.8,1.8], b6.2 E [-0.5,0.5].

For each of 100 simulations. the sample MSE for the smoothed data was the same.

within 10 significant decimal places, whether the curves were smoothed first or the

differences were smoothed first. This indicates that the choice between these two

procedures has a negligible effect on the risk of the smoothed-data dissimilarity

estimator.

Also, the simulation analyses in this chapter were carried out both when

smoothing the differences first and when smoothing the curves first. The results

were extremely similar, so we present the results obtained when smoothing the

curves first and adjusting the smooths with the James-Stein method.

6.3 Simulation Results

Results for the n = 200 example are shown in Table 6-1 for the data with

independent errors and Table 6-2 for the data with Ornstein-Uhlenbeck-type errors.

The ratio of average MSEs is the average MSE(obs) across the 100 simulations

divided by the average MSE(smooth) across the 100 simulations. As shown by

73

Table 6-1: Clustering the observed data and clustering the smoothed data (inde-

pendent error structure, n = 200).

Independent error structure

a 0.4 0.5 0.75 1.0 1.25 1.5 2.0

Ratio of avg. MSEs 12.86 30.10 62.66 i 70.89 72.40 72.24 71.03

Avg. prop. (observed) .668 .511 .353 .325 .330 .327 .318

Avg. prop. (smoothed) .937 .715 .428 .380 .372 .352 .332

Ratios of average MSE(b') to average MSE(m00oth. Average proportions of pairs

of objects correctly matched. Averages taken across 100 simulations.

Table 6-2: Clustering the observed data and clustering the smoothed data (O-U

error structure, n = 200).

O-U error structure

a 0.5 0.75 1.0 1.25 1.5 2.0 2.5

Ratio of avg. MSEs 3.53 39.09 126.53 171.39 176.52 168.13 160.52

Avg. prop. (observed) .997 .607 .445 .369 .330 .312 .310

Avg. prop. (smoothed) 1 .944 .620 .489 .396 .326 .334

Ratios of average MSE(O') to average MSE(smo"'t. Average proportions of pairs

of objects correctly matched. Averages taken across 100 simulations.

the ratios of average MSEs being greater than 1, smoothing the data results in

an improvement in estimating the dissimilarities, and. as shown graphically in

Figure 6-2, it also results in a greater proportion of pairs of objects correctly

clustered in every case. Similar results are seen for the n = 30 example in Table

6-3 and Table 6-4 and in Figure 6-3.

The improvement, it should be noted, appears to be most pronounced for

the medium values of a2, which makes sense. If there is small variability in the

data, smoothing yields little improvement since the observed data is not very

noisy to begin with. If the variability is quite large, the signal is relatively weak

and smoothing may not capture the underlying structure precisely, although the

smooths still perform far better than the observed data in estimating the true

dissimilarities. When the magnitude of the noise is moderate, the advantage of

smoothing in capturing the underlying clustering structure is sizable.

74

Independent error structure

o \

0

.. "

I I \

0.5 1.0 1.5 2.0

sigma

Ostein-Uhlenbeck error structure

o

d ~ ', ^~----- __

co ------ --- --------------------

0.5 1.0 1.5 2.0 2.5

sigma

Figure 6-2: Proportion of pairs of objects correctly matched, plotted against a

(n = 200).

Solid line: Smoothed data. Dashed line: Observed data.

Table 6-3: Clustering the observed data and clustering the smoothed data (inde-

pendent error structure, n = 30).

Independent error structure

a 0.2 0.4 0.5 0.75 1.0 1.25 1.5

Ratio of avg. MSEs 1.23 4.02 5.12 5.95 6.30 6.24 6.27

Avg. prop. (observed) .999 .466 .376 .298 .299 .312 .316

Avg. prop. (smoothed) 1.00 .552 .458 .359 .330 .321 .334

Ratios of average MSE(b) to average MSE(""OOth). Average proportions of pairs

of objects correctly matched. Averages taken across 100 simulations.

75

Table 6-4: Clustering the observed dataand clustering the smoothed data (O-U

error structure, n = 30).

O-U error structure

a 0.2 0.4 0.5 0.75 1.0 1.25 1.5

Ratio of avg. MSEs 1.26 4.07 5.03 5.69 5.78 5.85 6.04

Avg. prop. (observed) .851 .383 .280 .251 .225 .213 .233

Avg. prop. (smoothed) .881 .625 .528 .502 .478 .489 .464

Ratios of average MSE(obS) to average MSE(smoOth). Average proportions of pairs

of objects correctly matched. Averages taken across 100 simulations.

Independent error structure

0

CL

---l

II I I I I

02 0.4 0.6 0.8 1.0 1.2 1.4

sigma

Omrstein-Uhlenbeck error structure

-- --- ....-------------

N0 I I

0.2 0.4 0.6 0.8 1.0 1.2 1.4

sigma

Figure 6-3: Proportion of pairs of objects correctly matched, plotted against a

(n = 30).

Solid line: Smoothed data. Dashed line: Observed data.

76

The above analysis assumes the number of clusters is correctly specified as

4, so we repeat the n = 200 simulation, for data generated from the O-U model.

when the number of clusters is misspecified as 3 or 5. Figure 6-4 shows that, for

the most part. the same superiority, measured by proportion of pairs correctly

clustered, is exhibited by the method of smoothing before clustering.

6.4 Additional Simulation Results

In the previous section, the clustering of the observed curves was compared

with the clustering of the smooths adjusted via the James-Stein shrinking method.

To examine the effect of the James-Stein adjustment in the clustering problem, in

this section we present results of a simulation study done to compare the clustering

of the observed curves, the James-Stein smoothed curves, and smoothed curves

having no James-Stein shrinkage adjustment.

The simulation study was similar to the previous one. We had 20 sample

curves (with n = 30) in each simulation, again generated from four distinct signal

curves: in this case the signal curves could be written as Fourier series. Again, for

various simulations, both independent errors and Ornstein-Uhlenbeck errors (for

varying levels of a2) were added to the signals to create noisy curves.

For some of the simulations, the signal curves were first-order Fourier series:

pi(t) = cos(7rt/10) +2sin(7rt/10), i = 1,...,5.

/i(t) = 2cos(7rt/10) +sin(rt/10),i= 6,...,10.

Ai(t) = 1.5cos(irt/10) + sin(7rt/10), i = 11,..., 15.

pi(t) = -cos(7rt/10) +sin(7rt/10),i = 16,...,20.

In these simulations, the smoother chosen was a first-order Fourier series, so that

the smooths lay in the linear subspace that the Fourier smoother projected onto. In

this case, we would expect the unadjusted smoothing method to perform well, since

77

Number of Clusters Misspecified as 3

- - - - -

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0 2.5

sigma

Figure 6-4: Proportion of pairs of objects correctly matched, plotted against a,

when the number of clustersClusters Misspcifed asisspecified.

Solid line: Smoothed data Dashed line: Observed data

II I \

0.5 1.0 1.5 2.0 2.5

Figure 6-4: Proportion of pairs of objects correctly matched, plotted against
when the number of clusters is misspecified.

Solid line: Smoothed data. Dashed line: Observed data.

78

the first-order smooths should capture the underlying curves extremely well. Here

the James-Stein adjustment may be superfluous, if not detrimental.

In other simulations, the signal curves were second-order Fourier series:

pi(t) = cos(rt/10) + 2 sin(rt/10) + 5 cos(27rt/10) + sin(2rt/10), i = 1,..., 5.

pi(t) = 2 cos(rt/10) + sin(7rt/10) + cos(27rt/10) + 8 sin(27rt/10), i = 6,..., 10.

ii(t) = 1.5cos(7t/10) + sin(7rt/10)+ 5 cos(27rt/10) + 8sin(27rt/10),i = 11,.... 15.

i(t) = cos(7rt/10) + sin(7rt/10) + 8 cos(27rt/10) + 5 sin(27rt/10), i = 16,...,20.

In these simulations, again a first-order Fourier smoother was used, so that the

smoother oversmoothed the observed curves and the smooths were not in the

subspace S projected onto. In this case, the simple linear smoother would not be

expected to capture the underlying curves well, and the James-Stein adjustment

would be needed to reproduce the true nature of the curves.

For each of the three methods (no smoothing, simple linear smoothing,

and James-Stein adjusted smoothing), and for varying magnitudes of noise, the

proportions of pairs of curves correctly grouped are plotted in Figure 6-5. Again.

for each setting, 100 simulations were run, with results averaged over all 100 runs.

The James-Stein adjusted smoothing method performs essentially no worse

than its competitors in three of the four cases. With Ornstein-Uhlenbeck errors.

and when the signal curves are first-order Fourier series, the simple linear smoother

does better as a becomes larger, which is understandable since that smoother

captures the underlying truth perfectly. For higher values of a (noisier data), the

James-Stein adjustment is here shrinking the smoother slightly toward the noise

in the observed data. Nevertheless, the James-Stein smoothing method has an

advantage in the independent-errors case, even when the signal curves are first-

order. In both of these cases, both types of smooths are clustered correctly more

often than the observed curves.

79

O-U r~lars. 1 st-ardHu Fouer ~ nam curvs

O-U error nd-orer Fourier mgnal curv

8 -___--------------------------------

8.

-ndU. rrTo 2nt-owrdr Fourdr ignsi curv

0.5 10 5 20 2 30

Figure 6-5: Proportion of pairs of objects correctly matched, plotted against a

(n = 30).

Solid line: James-Stein adjusted smoothed data. Dotted line: (Unadjusted)

smoothed data. Dashed line: Observed data.

80

When the signal curves are second-order and the simple linear smoother

oversmooths. the unadjusted smooths are clustered relatively poorly. Here, the

James-Stein method is needed as a safeguard. to shrink the smooth toward the

observed data. The James-Stein adjusted smooths and the observed curves are

both clustered correctly more often than the unadjusted smooths.

CHAPTER 7

ANALYSIS OF REAL FUNCTIONAL DATA

7.1 Analysis of Expression Ratios of Yeast Genes

As an example, we consider yeast gene data analyzed in Alter et al. (2000).

The objects in this situation are 78 genes. and the measured responses are (log-

transformed) expression ratios measured 18 times (at 7-minute intervals) tj = 7j

for j = 0,...,17. As described in Spellman et al. (1998). the yeast was analyzed

in an experiment involving "alpha-factor-based synchronization," after which the

RNA for each gene was measured over time. (In fact, there were three separate

synchronization methods used, but here we focus on the measurements produced by

the alpha-factor synchronization.)

Biologists believe that these genes fall into five clusters according to the cell

cycle phase corresponding to each gene.

To analyze these data, we treated them as 78 separate functional observations.

Initially, each multivariate observation on a gene was centered by subtracting the

mean response (across the 18 measurements) from the measurements. Then we

clustered the genes into 5 clusters using the K-medoids method, implemented by

the R function pam.

To investigate the effect of smoothing on the cluster analysis, first we simply

clustered the unsmoothed observed data. Then we smoothed each observation and

clustered the smooth curves, essentially treating the fitted values of the smooths

as the data. A (cubic) B-spline smoother, with two interior knots interspersed

evenly within the given timepoints, was chosen. The rank of the smoothing matrix

S was in this case k = 6 (Ramsay and Silverman, 1997, p. 49). The smooths were

81

82

Table 7-1: The classification of the 78 yeast genes into clusters, for both observed

data and smoothed data.

Cluster Observed Data Smoothed Data

1 1,5,6,7.8,11,75 1,2,4,5.6,7,8,9,10,11,13,62,70,74,75

2,3,4,9.10,13,16,19,22,23,27, 16,19,22,27,30,32,36,37,41,42,

2 30,32.36,37,41.44,47,49,51,53, 44,47,49,51,61,63,

61,62.63.64.65.67,69,70,74,78 65,67,69,78

12,14.15,18,21.24,25,26,28, 3,12,14,15,18,21,24,25,26,

3 29,31.33.34,35.38,39,40, 28,29,31,33,34,35,38,39,

42,43.45,46,48.50,52 40,43,45,46,48,50,52

4 17,20.54.55,56.57,58,59,60 17,20.23,53,54,55,56,57,58,59,60.64

5 66,68.71.72.73.76.77 66,68.71,72,73,76,77

adjusted via the James-Stein shrinkage procedure described in Section 3.2, with

a = 5 here, a choice based on the values of n = 18 and k = 6.

Figure 7-1 shows the resulting clusters of curves for the cluster analyses of

the observed data and of the smoothed curves. We can see that the clusters of

smoothed curves tend to be somewhat less variable about their sample mean curves

than are the clusters of observed curves, which could aid the interpretability of the

clusters.

In particular, features of the second, third, and fourth clusters seem more

apparent when viewing the clusters of smooths than the clusters of observed

data. The major characteristic of the second cluster is that it seems to contain

those processes which are relatively flat, with little oscillation. This is much more

evident in the cluster of smooths. The third and fourth clusters look similar for

the observed data, but a key distinction between them is apparent if we examine

the smooth curves. The third cluster consists primarily of curves which rise to

a substantial peak, decrease, and then rise to a more gradual peak. The fourth

cluster consists mostly of curves which rise to a small peak, decrease, and then rise

to a higher peak. This distinction between clusters is seen much more clearly in the

clustering of the smooths.

83

Clusters for observed curves Cluster for imoothed curvem

s r-I 1 I-

.o _ _ .. . . .

0 20 40 60 80 100 120 0 20 40 60 80 100 120

time time

I

ii 7- i

0 20 40 60 80 100 120 0 20 40 60 80 100 120

time time

0 20

Fgr -7 Plo of

_a curvs f b sl i

0 20 40 60 80 100 120 0 20 40 60 80 100 120

time time

time time

Figure 7-1: Plots of clusters of genes.

Observed curves (left) and smoothed curves (right) are shown by dotted lines.

Mean curves for each cluster shown by solid lines.

0 N ------------------gNi-----------------

Mean curves for each cluster shown by solid lines.

84

The groupings of the curves into the five clusters are given in Table 7-1 for

both the observed data and the smoothed data. While the clusterings were fairly

similar for the two methods, there were some differences in the way the curves were

classified. Figure 7-2 shows an edited picture of the respective clusterings of the

curves, with the curves deleted which were classified the same way both as observed

curves and smoothed curves. Note that the fifth cluster was the same for the set of

observed curves and the set of smoothed curves.

7.2 Analysis of Research Libraries

The Association of Research Libraries has collected extensive annual data

on a large number of university libraries throughout many years (Association of

Research Libraries, 2003). Perhaps the most important (or at least oft-quoted)

variable measured in this ongoing study is the number of volumes held in a library

at a given time. In this section a cluster analysis is performed on 67 libraries whose

data on volumes were available for each year from 1967 to 2002 (a string of 36

years). This fits a strict definition of functional data, since the number of volumes

in a library is changing "nearly continuously" over time, and each measurement is

merely a snapshot of the response at a particular time, namely when the inventory

was done that year.

The goal of the analysis is to determine which libraries are most similar in

their growth patterns over the last 3-plus decades and to see how many groups into

which the 67 libraries naturally clustered. In the analysis, the logarithm of volumes

held was used as the response variable, since reducing the immense magnitude of

the values in the data set appeared to produce curves with stable variances. Since

the goal was to cluster the libraries based on their "growth curves" rather than

simply their sizes, the data was centered by subtracting each library's sample mean

(across the 36 years) from each year's measurement, resulting in 67 mean-centered

data vectors.

85

Clusters for observed curves Clusters or smoothed curves

( ------------ II

2 N

0 20 40 60 80 100 120 0 20 40 60 80 100 120

time tlme

Io 0

1- 0 -

0 20 40 60 so 100 120 0 20 40 60 so too 120

0 20 40 60 80 100 120 0 20 40 60 80 100 120

time time

0

o o

O0 -

$ 7

II I I I

0 20 40 60 80 100 120 0 20 40 60 80 100 120

time time

r c s -

O u

0 0

7I -

0 20 40 60 80 100 120 0 20 40 60 80 100 120

time time

Figure 7-2: Edited plots of clusters of genes which were classified differently as

observed curves and smoothed curves.

Observed curves (left) and smoothed curves (right).

86

The observed curves were each smoothed using a basis of cubic B-splines with

three knots interspersed evenly within the timepoints. The positive-part James-

Stein shrinkage was applied, with a value a = 35 in the shrinkage factor (based on

n = 36, k = 7). The K-medoids algorithm was applied to the smoothed curves,

with K = 4 clusters, a choice guided by the average silhouette width criterion that

Rousseeuw (1987) suggests for selecting K.

The resulting clusters are shown in Table 7-2. In Figure 7-3. we see the

distinctions among the curves in the different clusters. Cluster 1 contains a few

libraries whose volume size was relatively small at the begining of the time period.

but then grew dramatically in the first 12 years of the study, growing slowly after

that. Cluster 2 curves show a similar pattern, except that the initial growth is less

dramatic. The behavior of the cluster 3 curves is the most eccentric of the four

clusters, with some notable nonmonotone behavior apparent in the middle years.

The growth curves for the libraries in cluster 4 is consistently slow. steady, and

nearly linear.

For purposes of comparing the clusters, the mean curves for the four clusters

are shown in Figure 7-4.

Note that several of the smooths may not appear visually smooth in Figure

7-3. A close inspection of the data shows some anomalous (probably misrecorded)

measurements which contribute to this chaotic behavior in the smooth. (For

example, see the data and associated smooth for the University of Arizona library,

in Figure 7-5, for which 1971 is a suspicious response.) While an extensive data

analysis would detect and fix these outliers, the data set is suitable to illustrate

the cluster analysis. The sharp-looking peak in the smooth is an artifact of the

discretized plotting mechanism, as the cubic spline is mathematically assured to

have two continuous derivatives at each point.

87

Cluster 1 for smoothed curves Cluster 2 for smoothed curves

in U

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

time (years since 1967) time (years since 1967)

Clusters based on smoothed curves. Mean curves for each cluster shown by solid

lines.

o ~"0 *? "

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

time (years since 1967) time (yeats since 1967)

Cluster 3 for smoothed curves Cluster 4 for smoothed cubrys

l ines .. i

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

time (years since 1967) time (years since 1967)

Clusters based on smoothed curves. Mean curves for each cluster shown by solid

88

Meen curves of four clusters

0

7-

0

P o

0 5 10 15 20 25 30 35

time (years since 1967)

Figure 7-4: Mean curves for the four library clusters given on the same plot.

Solid: cluster 1. Dashed: cluster 2. Dotted: cluster 3. Dot-dash: cluster 4.

89

Table 7-2: A 4-cluster K-medoids clustering of the smooths for the library data.

Clusters of smooths for library data

Cluster 1: Arizona. British Columbia, Connecticut, Georgetown. Georgia

Cluster 2: Boston. California-Los Angeles, Florida, Florida State, Iowa,

Iowa State, Kansas. Michigan State, Nebraska. Northwestern. Notre Dame,

Oklahoma, Pennsylvania State, Pittsburgh, Princeton. Rochester,

Southern California. SUNY-Buffalo, Temple, Virginia. Washington U-St. Louis,

Wisconsin

Cluster 3: Brown. California-Berkeley, Chicago, Cincinnati, Colorado, Duke,

Indiana, Kentucky, Louisiana State, Maryland. McGill. MIT. Ohio State,

Oregon, Pennsylvania. Purdue, Rutgers, Stanford, Tennessee. Toronto, Tulane,

Utah. Washington State. Wayne State

Cluster 4: Columbia. Cornell. Harvard, Illinois-Urbana. Johns Hopkins, Michigan,

Minnesota. Missouri. New York, North Carolina. Southern Illinois. Syracuse. Yale

I .

0 5 10 15 20 25 30 36

0.35

Figure 7-5: Measurements and B-spline smooth, University of Arizona library.

Points: log of volumes held, 1967-2002. Curve: B-spline smooth.

90

Table 7-3: A 4-cluster K-medoids clustering of the observed library data.

Clusters of observed library data

Cluster 1: Arizona. British Columbia, Connecticut, Georgetown, Georgia

Cluster 2: Boston. California-Los Angeles. Florida, Florida State, Iowa,

Iowa State, Kansas. McGill. Michigan State. Nebraska, Northwestern,

Notre Dame. Oklahoma, Pennsylvania State, Pittsburgh, Princeton. Rochester,

Southern California. SUNY-Buffalo, Temple, Virginia, Washington U-St. Louis,

Wayne State, Wisconsin

Cluster 3: Brown, California-Berkeley, Chicago, Cincinnati, Colorado, Duke,

Illinois-Urbana, Indiana, Kentucky, Louisiana State, Maryland, MIT,

Ohio State, Oregon. Pennsylvania, Purdue. Rutgers, Stanford, Syracuse,

Tennessee. Toronto. Tulane. Utah, Washington State

Cluster 4: Columbia. Cornell, Harvard, Johns Hopkins, Michigan, Minnesota,

Missouri. New York. North Carolina. Southern Illinois, Yale

Interpreting the meaning of the clusters is fairly subjective, but the dramatic

rises in library volumes for the cluster 1 and cluster 2 curves could correspond

to an increase in state-sponsored academic spending in the 1960s and 1970s,

especially since many of the libraries in the first two clusters correspond to public

universities. On the other hand, cluster 4 contains several old, traditional Ivy

League universities (Harvard, Yale, Columbia, Cornell), whose libraries' steady

growth could reflect the extended buildup of volumes over many previous years.

Any clear conclusions, however, would require more extensive study of the subject,

as the cluster analysis is only an exploratory first step.

In this data set, only minor differences in the clustering partition were seen

when comparing the analysis of the smoothed data with that of the observed

data (shown in Table 7-3). Interestingly, in some spots where the two partitions

differed, the clustering of the smooths did place certain libraries from the same

state together when the other method did not. For instance, Illinois-Urbana was

placed with Southern Illinois, and Syracuse was placed with Columbia, Cornell, and

NYU.