Citation
Convexity and geometry of estimating functions

Material Information

Title:
Convexity and geometry of estimating functions
Creator:
Chan, Schultz
Publication Date:
Language:
English
Physical Description:
vi, 97 leaves : ; 29 cm.

Subjects

Subjects / Keywords:
Estimation theory ( jstor )
Experiment design ( jstor )
Geometry ( jstor )
Inner products ( jstor )
Mathematical independent variables ( jstor )
Mathematical vectors ( jstor )
Projective geometry ( jstor )
Statistical estimation ( jstor )
Statistics ( jstor )
Unbiased estimators ( jstor )
Dissertations, Academic -- Statistics -- UF
Statistics thesis, Ph. D
Genre:
government publication (state, provincial, terriorial, dependent) ( marcgt )
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1996.
Bibliography:
Includes bibliographical references (leaves 91-96).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Schultz Chan.

Record Information

Source Institution:
University of Florida
Rights Management:
The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. §107) for non-profit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
Resource Identifier:
023780656 ( ALEPH )
35777909 ( OCLC )

Downloads

This item has the following downloads:


Full Text










CONVEXITY AND GEOMETRY OF ESTIMATING FUNCTIONS


By

SCHULTZ CHAN













A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1996


UNIVERSITY OF FLORIDA LIBRARIES














TABLE OF CONTENTS




A B ST R A C T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 INTRODUCTION ............................... 1

1.1 Pream ble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Literature Review .. . ... .. .. .. ... . ... . .. . .. .. . . 1
1.3 The Subject of the Dissertation ..................... 4

2 THE GEOMETRY OF ESTIMATING FUNCTIONS I .............. 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Generalized Inner Product Spaces and Orthogonal Projection .... 7 2.3 Optimal Estimating Functions: A Geometric Approach ........... 16
2.4 Optimal Bayesian Estimating Functions ..... ................ 24
2.5 Orthogonal Decomposition and Information Inequality ........... 32
2.6 Orthogonal Decomposition for Estimating Functions ............. 35

3 THE GEOMETRY OF ESTIMATING FUNCTIONS II ............. 40

3.1 Introduction ........ ................................ 40
3.2 Properties of Orthogonal Projections ..... .................. 41
3.3 Global Optimality of Estimating Functions ................... 42
3.3.1 The General Result .............................. 43
3.3.2 Geometry of Conditional Inferences ................... 45
3.3.3 Geometry of Marginal Inference ..................... 48
3.4 Locally Optimal Estimating Functions ..................... 51
3.4.1 A General Result ...... ......................... 52
3.4.2 Local Optimality of Conditional Score Functions ........... 53
3.4.3 Locally Optimal Estimating Functions for Stochastic Processes 55 3.4.4 Local Optimality of Projected Partial Likelihood .......... 58
3.5 Optimal Conditional Estimating Functions ................... 61

4 CONVEXITY AND ITS APPLICATIONS TO STATISTICS ........... 66

4.1 Introduction ........ ................................ 66







4.2 Some Simple Results About Convexity ..................... 67
4.3 Theory of Optimum Experimental Designs ................... 72
4.4 Fundamental Theorem of Mixture Distributions ................ 77
4.5 Asymptotic Minimaxity of Estimating Finictions ................ 79
4.5.1 One Dimensional Case ............................ 82
4.5.2 Multi-Dimensional Case .......................... 85

5 SUMMARY AND FUTURE RESEARCH ....................... 89

5.1 Summary ........ ................................. 89
5.2 Future Research ..................................... 89

BIBLIOGRAPHY ........ ................................. 91

BIOGRAPHICAL SKETCH ....... ........................... 97














Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy


CONVEXITY AND GEOMETRY OF ESTIMATING FUNCTIONS


By

Schultz Chan

August 1996

Chairman: Malay Ghosh
Major Department: Statistics

In this dissertation, a general way of constructing optimal generalized estimating equations (GEE) is given. Applications of this general method to various statistical problems, such as the quasi-likelihood method in generalized linear models, Cox's partial likelihood method in survival analysis, Bayesian inference, conditional and marginal inferences, are also studied. Also, some simple results about matrix valued convex functions are proved and are applied to the study of optimal designs, mixture distributions and asymptotic minimaxity.

First, a notion of generalized inner product spaces is introduced to study optimal estimating functions. A characterization of orthogonal projections in generalized inner product spaces is given. It is shown that the orthogonal projection of the score function into a linear subspace of estimating functions is optimal in that subspace, and a characterization of optimal estimating functions is given. Also optimal estimating functions in the Bayesian framework are also studied.








In the case of no nuisance parameters, the results are applied to study longitudinal data, stochastic processes, time series models, generalized linear models and Bayesian inference. As special cases of the main results of this chapter, we derive the results of Godambe on the foundation of estimation in stochastic processes, the result of Godambe and Thompson on the extension of quasi-likelihood, and the linear (and quadratic) generalized estimating equations for miultivariate data due to Liang and Zeger, Liang, Zeger and Qaqish. Also we have derived optimal Bayesian estimating equations in the Bayesian framework.

In the case where there are nuisance parameters, the results are applied to study survival analysis models, the generalized estimating equations proposed by Liang, Zeger and their associates, and the optimality of the marginal and conditional inferences. The three main topics are (A) globally optimal generalized estimating equations; (B) locally optimal generalized estimating equations; (C) conditionally optimal generalized estimating equations. A general result is derived in each case. As special cases, we rederive some of the results already available in the literature and find also some new results. In particular, as special cases of our result on globally optimal generalized estimating equations, we find the results of Godambe and Thompson and Godambe with nuisance parameters. The results of Bliapkar on conditional arid marginal inference are also obtained as special cases. As applications of our result on locally optimal generalized estimating equations, we find Lindsay's result on the optimality of conditional score functions, extend Godambe's result on optimal estimating functions for stochastic processes to nuisance parameters, and extend a recent result of Murphy and Li about projected partial likelihood. Finally, our general result on conditionally optimal generalized estimating equation helps generalize the findings of Godambe and Thompson to situations which admit the presence of nuisance parameters.







Finally, some simple results for matrix valued convex functions are proved, and are used to find optimum experimental designs, the fundamental theorem of mixture distributions, and a generalization of the asymptotic result of Huber.














CHAPTER 1

INTRODUCTION



1.1 Preamble


The objective of this thesis is to provide a geometric insight behind maniy useful concepts in statistics and utilize the geometry for unifying many existing results, as well as in deriving several new ones. One major focus is to find optimal estimating functions as orthogonal projections of score functions into appropriate linear subspaces. The second goal is to use some important theorems from convex analysis for finding optimal experimental designs, for deriving the fundamental theorem of mixture distributions, and for proving the asymptotic minimaxity of estimating functions in a very general framework.

1.2 Literature Review


We begin by reviewing the literature on estimating functions. The topic has grown into an active research area over the past decade. Its beginning is marked with the celebrated articles of Godambe (1960) and Durbin (1960). While Durbin (1960) used estimating functions to study Gauss-Markov type results in a time series setting, Godambe's (1960) main objective was to prove the optimality of the score function in a parametric framework when there were no nuisance parameters. As is well-known, the Gauss-Markov theory and maximum likelihood estimation form two cornerstones of statistical estimation. In their review article, Godambe and Kale




2


(1991) have pointed out that the theory of estimating functions combines the strengths of these two methods, eliminating at the same time many of their weaknesses. To cite an example, Gauss-Markov theorem fails for nonlinear least squares, but estimators obtained as solutions of optimal estimating equations are identical to the least squares estimators under homoscedasticity.

The theory of estimating functions has made rapid strides since the 1970s. Godambe and Thompson (1974), Godambe (1976) studied optimal estimating functions in the presence of nuisance parameters, and proved a variety of optimality results. Bhapkar (1972, 1989, 1991), Bhapkar and Srinivasan (1994) in a series of articles studied the notions of sufficiency, ancillarity and information in the context of estimating functions, and found conditional as well as marginal optimal estimating functions. Amari and Kumon (1988), and Kumon and Amari (1984) used estimating functions to estimate structural parameters in the presence of a large number of nuisance parameters, their approach being based on vector bundle theory from differential geometry.

Nelder and Wedderburn (1971). in their pioneering paper on generalized linear models, showed that using one algorithm (the Newton-Raphson method), a large family of models could be iteratively fitted. Later, Wedderburn (1974) realized that only the first two moments were utilized in fitting the models, and this led to the development of the so-called quasi-likelihood functions for the development of generalized linear models. Firth (1987), and Godambe and Thompson (1989) pointed out the connection between quasi-likelihood and optimal estimating functions. An interesting review article is due to Desmond (1991).

Cox (1972), in his seminal paper, introduced the proportional hazards model. Later, Cox (1975) introduced the notion of partial likelihood. The latter is intended to eliminate nuisance parameters (baseline hazards for the proportional hazards model)







by using a conditioning argument. Because of the nested structure of the conditioning variables, Cox's approach also fits into the estimating function framework.

Liang and Zeger (1986) used estimating functions (they used the terminology generalized estimating equations) to study longitudinal data. Liang and Zeger had motivation similar to Wedderburn's quasi-likelihood function, but in the multivariate setting in order to take into account the correlation between responses within each subject.

Bayesian estimating function is of more recent origin and is still in its infancy. Ferreira (1981, 1982) and Ghosh (1990) intiated the study of optimal estimating functions in a Bayesian framework. While Ferreira's formulation involves the joint distribution of the observations and the parameters, Ghosh used a pure Bayesian approach based only on the posterior probability densitv function.

The theory of optimum experimental designs, it was initiated by Elfving (1952, 1959), and Kiefer (1959). For the references up to the early eighties, we refer to the two monographs of Silvey (1980) and Pazman (1986). During the last decade, there are major advances in optimum experimental design theory, here we only list a few of the main publications. Chaloner and Larntz (1989) studied optimal Bayesian design for logistic regression model, El-Krunz and Studden (1991) studied Bayesian optimal design for linear regression models, while DasGupta and Studden (1991) studied robust Bayesian experimental designs for normal linear models. Dette and Studden (1993) studied the geometry of E-optimal design, while Dette (1993) studied the geometry of D-optimal design, and Haines (1995) studied the geometry of Bayesian designs.

There is a vast literature on the theory of mixture distributions. Laird (1978) studied nonparametric maximum likelihood estimation of a mixing distribution. Lindsay (1981, 1983a, 1983b) studied the properties and geometry of maximum likelihood estimator of mixing distribution. In a recent monograph, Lindsay (1995) presented








a comprehensive treatment of "mixture models: theory, geometry and applications." In this book, a variety of topics about mixture distributions were discussed, which include the well known result proved by Shaked (1980) on mixtures from the exponential family, and the fundamental theorem on mixture distributions proved by Lindsay (1983a).

Huber (1964) in his pioneering paper, proved the well known asymptotic minimaxitv result for estimating functions about location parameter. In his classical book on robust statistics, Huber (1980) presented a more systematic treatment about asymptotic minimaxity.

1.3 The Subject of the Dissertation


This dissertation begins with unfolding the geometry of estimating functions, and pointing out many applications. Although the geometry is primarily used to study estimating functions, this can also be used to study other statistical topics, such as the Rao-Blackwell theorem, Lehmann-Scheffe's approach to uniform minimum variance unbiased estimators and predication theory.

In Chapter 2, a notion of generalized inner product spaces is introduced to study optimal estimating functions. A characterization of orthogonal projections in generalized inner product spaces is given. It is shown that the orthogonal projection of the score function into a linear subspace of estimating functions is optimal in that subspace and a characterization of optimal estimating functions are given. As special cases of the main results of this paper, we derive the results of Godambe (1985) on the foundation of estimation in stochastic processes, the result of Godambe and Thompson (1989) on the extension of quasi-likelihood, and the generalized estimating equations for multivariate data due to Liang and Zeger (1986). Also we have derived optimal estimating functions in the Bayesian framework. This generalizes the results obtained by Ferreira (1981, 1982) and Ghosh (1990).







In Chapter 3, the geometry of estimating functions in the presence of nuisance parameters is studied. The three main topics are: (A) globally optimal estimating functions; (B) locally optimal estimating functions; (C) conditionally optimal estimating functions. A general result is derived in each case. As special cases, we rederive some of the results already available in the literature, and find also some new results. In particular, as special cases of our result on globally optimal estimating functions, we find the results of Godambe and Thompson (1974) and Godambe (1976) with nuisance parameters. The results of Bhapkar (1989, 1991a) on conditional and marginal inference are also obtained as special cases. As applications of our result on locally optimal estimating functions, we find Lindsay's (1982) result on the optimality of conditional score functions, extend Godambe's (1985) result on optimal estimating functions for stochastic processes, and extend a recent result of Murphy and Li (1995) about projected partial likelihood. Finally, our general result on conditionally optimal estimating function helps generalize the findings of Godambe and Thompson (1989) to situations which admit the presence of nuisance paranieters.

In Chapter 4, we first prove some general results about convexity, and then apply the results to various statistical problems, which include the theory of optimum experimental designs (Silvey, 1980), the fundamental theorem of mixture distributions due to Lindsay (1983a), and the asymptotic minimaxity of robust estimation due to Huber (1964). In his classical paper on 1l-estimation, Huber (1964) proved an asymptotic minimaxity result for estimating functions about a location parameter. In this chapter, this fundamental result is generalized to general estimating functions. The geometric optimality of estimating functions proved in Chapter 2 will be used to prove a necessary and sufficient condition for the asymptotic minimaxity of estimating functions when the parameter space is multi-dimensional.

In Chapter 5, we summarize the results of this dissertation, and propose some topics of future research.














CHAPTER 2

THE GEOMETRY OF ESTIMATING FUNCTIONS I



2.1 Introduction


The theory of estimating functions has advanced quite rapidly over the past two decades. Godambe (1960) introduced the subject to prove finite sample optimality of the score function in a parametric framework when no nuisance parameters were presented. Later, his idea was extended in many different directions, and optimal estimating functions were derived under many different formulations.

The underlying thread in all these results is a geometric phenomenon which seems to have gone unnoticed, or at least has never been brought out explicitly. In the present chapter, we make this geometry explicit, and use the same in deriving optimal estimating functions in certain contexts. In particular, it is shown that optimal estimating functions for certain semiparametric models are indeed the orthogonal projections of score functions into certain linear subspaces. Also, this geometry, by its very nature, is neutral, and can be adapted both within the frequentist, and the Bayesian paradigm. Second, the nmultiparameter situation can be handled automiatically through this geometry without involving any additional work.

The outline of the remaining sections is as follows. In Section 2.2, we develop the mathematical prerequisite for the results of the subsequent sections. In particular, we define generalized inner product spaces, and show the existence of orthogonal







projections of elements in these spaces into some linear subspaces. A characterization theorem for these orthogonal projections is given, which is used repeatedly in subsequent sections.

Section 2.3 generalizes the results of Godambe (1985) and Godmabe and Thompson (1989) in multiparameter situations and also finds optimal generalized estimating equations (GEEs) for multivariate data. The GEEs used in Liang and Zeger (1986) and Liang, Zeger and Qaqish (1992) turn out to be special cases of those proposed in this section. The common thread in the derivation of all the optimal estimating functions is the idea of orthogonal projection developed in section 2.2.

Section 2.4 uses the orthogonal projection idea in deriving optimal Bayes estimating functions. The results of Ferreira (1981, 1982) and Ghosh (1990) are included as special cases. Section 2.5 uses an orthogonal decomposition to study information inequality. Section 2.6 studies the Hoeffding type decomposition for estimating functions.

2.2 Generalized Inner Product Spaces and Orthogonal Projection


In this section, we first introduce a matrix version of inner product spaces which generalizes the notion of the usual scalar valued inner product space. Next we provide the definition of the orthogonal projection of an element of a generalized inner product space, say, L into a linear subspace L0 of L. A characterization of the orthogonal projection in the generalized inner product space is also given. As will be seen, such a characterization generalizes a corresponding result for scalar inner product spaces. We also show that for a finite dimensional subspace of a generalized inner product space, an orthogonal projection always exists.

We begin with the definition of a matrix valued inner product space.







Definition 2.2.1. Let L be a real linear space, and let Alkxk be the set of all k x k real matrices. The map

< .,. >: L x L -+ Mk~k,

is called a generalized inner product if
(1) V x, y E L, < x, y >=< y, x >'-,

(2)foranykxkmatrixM,x,yEL, =M ;

(3) V x,y,z E L, < x,y + z >=< x,y> + ;

(4) Vx E L,< x,x > is non negative definite (n. n. d.), and < x,x >= 0 iff x =0.

Two elements x, y E L are said to be orthogonal if < x, y >= 0. Two sets S1, S2 are orthogonal if every element of S1 is orthogonal to every element of S2.

An example of a generalized inner product space, of great interest to statisticians is the one where the generalized inner product is defined by the covariance matrix of random vectors. Specifically, let X be a sample space, and let E C Rk be the parameter space, which is open. Consider the space L of all functions h:X xOE)--Rk,

such that every element of the matrix E[h(X, O)h(X, O)tI1] is finite. For any h, g E L, 0 E E, the family of generalized inner products is defined by < h, g > o= E [h (X, 0) g (X, 0)'10].


Then it is easy to verify that for fixed 0 E E, < .,. >0 is a generalized inner product on L.

Definition 2.2.2. Let L be a generalized inner product space with inner product < .,. >. Suppose L0 is a linear subspace of L. Let s E L. An element yo E L0 is called the orthogonal projection of s into L0 if


< s- o,s-yo >=min< s-y,s- y >, yELo


(2.2.1)








where min is taken with respect to the usual ordering of matrices. More specifically, for two square matrices A and B of the same order, we say that A > B if A - B is n. n. d.

The following theorem characterizes the orthogonal projection in generalized inner product spaces.

Theorem 2.2.1. Let L be a generalized inner product space with inner product < .,. >, and Lo be a linear subspace of L. Let s E L. Then yo E Lo is the orthogonal projection of s into Lo if and only if < S - YOY >= 0, (2.2.2) for all y G L0, i. e., s - Yo and Lo are orthogonal. Furthermore, if the orthogonal projection exists, then it is unique.

Proof. Only if. For all y C Lo, a E R, since Yo - ay E Lo,

- is n. n. d., i. e.,

a[< s- Yo, Y > + t] + a2 < y,y > (2.2.3) is n. n. d., for all y E Lo, ( e R.

Now suppose that there exists y* E Lo such that, < s - Yo, yo >� 0. Let

$t
= sy~ >.


Then A is real symmetric, and A : 0. Suppose A,,... , Ak are the eigenvalues of A with 1A11 ... > JAk. Denote by z, the unit eigenvector corresponding to A1. Then from (3), using z'Az = A1,

aA1 + a Z < Yo, Yo > z1 > 0, for all a E R. This implies that A1 = 0. So A = 0, a contradiction. Hence,


= 0,







for all y E Lo.

If. Suppose

< s - Yo, >= 0,for all y E Lo. Then < s- y,s- y > - < s-yo, s- yo > =< s- yo + Yo - y,S- yo + o - y > - < s - yo, S- yo >

=< Yo - Y, yo - Y >, (2.2.4) which is n. n. d. The last equality follows since < s - yo, yo - y >= 0. This implies < s- Yo, S- Yo >= min < s- y,S- > .
yELo

Finally we show that if an orthogonal projection exists, then it is unique. Suppose that Yi, Y2 E L0 are both orthogonal projections of s into L0. Then < S-yi, y >= 0,

for all y E Lo, i = 1, 2. In particular,


< Y1-Y2, Y1 -Y2 >=< S-Y2, Y1-Y2 > < S-yY,Y1-Y2 > =0-0 =0.

So Yi = Y2. This completes the proof of the theorem.

Next we apply Theorem 2.2.1 to generalize a result of Lehmann-Scheffe to the multidimensional case. Let X be a sample space, E) C Rk an open set, and Y : ( Rd an estimable function, i. e., there exists g : X - Rd such that E[g(X)1O] = 7(0), V 0 E 0. Let


U, = {g :' X fRdlE[g(X)O] = -y(0), V 0 E 01,







Uo = {h: X -- RdIE[h(X)9] = 0, V 0 E e},

where g E U.0,h E Uo satisfy that E[g(X) g(X)'1O] and E[h(X) h(X)t'0] are all well defined. Note that g. E U, is a locally minimum variance unbiased estimator of -y(O) at 0 = Oo if
E[g.(X) g,(X)'l0o] = min E[g(X) g(X)t'0o].
g E

Also it is easy to see that U = g+ U0, Vg (E U.

Thus as an easy consequence of Theorem 2.2.1, we have the following generalization of the Lehmann-Scheffe theorem.

Corollary 2.2.1. With the same notation as above, g, E U is a locally minimum variance unbiased estimator of -y(0) at 0 = O0 iff < g.,,h >0.o= E [g.,(X) h (X)l Oo] =0,


V h E Uy.

Next we show that for any finite dimensional subspace in a generalized inner product space, the orthogonal projection always exists. In order to do this, the famous Gram-Schmidt orthogonalization procedure is used in generalized inner product spaces. We need another definition.

Definition 2.2.3. Let (L. < .,. >) be a generalized inner product space. A set of functions {hi} ' is said to be linearly independent, if for any set of k x k matrices {Ai}I , defining

el = h, ei = i - EJ-lAj hi, i = 2,..., n, {< ei, ei >: i E {1, . . ., n}} are all invertible.

The following is the Gram-Schmidt orthogonalization procedure in generalized inner product spaces.







Proposition 2.2.1. If {hi}i=1 is linearly independent, let


el - hi, e2 = h2- < h2, el >< el, el >-1 el,


ek=hk- i= -tei, kE{2,..., n}. Then {ei}i1 are orthogonal.

Proof. First note that


< e2, e1 > < h2, el > - < h2, e1 >- 0. Now suppose that


=0, V I

< em+,ej >=< hm+l,ej > - < hm+i,ej >= 0, so that {ei}'1 are orthogonal.

The above result is used to prove the existence of the orthogonal projection of every element of a generalized inner product space into a finite dimensional subspace.

Theorem 2.2.2. Let (L, < .,. >) be a generalized inner product space, and let L0 be a finite dimensional subspace of L with linearly independent basis. Then for any s E L, the orthogonal projection of s into L0 always exists.

Proof. From Proposition 2.2.1, without loss of generality, we can assume that IIt... h,, } is an orthogonal basis for L0. Let Ai =< s, hi >< hi, hi >-1 i {1,..., We claim that the orthogonal projection of s into L0 is


It. = E' Ai hi.







To see this. for any h -j E LO.


< s - h., h >=< s, h > - < h.,h > =1e < S, hj > b'. - ET I [ETI =Ai < hi, hj >]bt Ej=l < , hj > bf - Ej= Aj < hj, h3 > by Z= [< s, hj > -Aj < hj, hj >]b,


= 0.

Now apply Theorem 2.2.1.

Theorem 2.2.2 will be used repeatedly in the subsequent sections for the derivation of optimal estimating functions.

Next we establish an abstract information inequality. which is fundamental to our later study. The motivation for the following definition will be clear from the subsequent section where we define information related to an estimating function.

Definition 2.2.4. Let (L, < .,. >) be a generalized inner product space, let s C L be a fixed element. For any g E L, the information of g with respect to s is defined as

Ig=< g,s >t< g,g >-< g,s >, (2.2.5) where "-" denotes a generalized inverse.

We shall also need the following theorem later.

Theoren 2.2.3. Let (L, < .,. >) be a generalized inner product space, and let L0 be a linear subspace of L. For any s e L, suppose g* is the orthogonal projection of s into L0. Consider the function I=< g's >'< g'g >-< g's >


Then


I1* -- I4







is n. n. d., for all g E L0.

Proof. Let s = g. + h. Then using < g, h >= 0,

Ig =< g's >'< g,g >-< g,s > =< g,g, > t< g,g >-< g'g, >, Also, using < g., h >= 0, Iq. =< g. S >t< fj,,.q, >-< f,.S > =< g,,g, > .


Now consider the matrix


< g,,g, > < g,g >1 < g., > < g1g > I6


For any k-dimensional vectors a and b, we have that


[at bt] [< g*,g >
I[< g,9g*>


< gg > a 1 < g,g > J[bJ


at < t, >a+2a' < g,g >t b +b < g,> b

= < atg, + btg, atg, + btg > > 0.


Thus


[ g >t
< gg, > < g,g > J


is n. n. d., which implies that

Iq - Ig =< g*,,g, > - < g' g, > t< g,g >-< g,g, >


is n. n. d. The proof of the theorem is complete.

The following result will be used to establish the essential uniqueness of optimal estimating function.







Theorem 2.2.4. With the same notation as above, if g*, g G L0, and g* is the orthogonal projection of s into L0, and < g*, g* >, < g, g > are invertible, then Ig. =Ig if and only if there exists an invertible matrix M such that g* =Mg.


Proof. If. If g* = M g, by straightforward calculation, we get I19. =< g*, S >t< g*,,g, >-'< g*,,s > =< g' s >1 Mt[M < g' g > Mt]-IMl < g' s > =< g,s >t< g,g >-1< g, S >. Only if. If Ig* = I., note that


I9* =< g*,s >t< g*,g* >-l< g*,s >=< g*,g >,

and
I=< g"s >'< gfg >-1< g' s >=< g' g* >'< g" Y >-'< g,gy* >,


since g* is the orthogonal projection of s. Then

0 = I* - Iq =< g*,g* > - < g,g* >t< g g >-1< g'g. > Let M =< g,g* >< g,g > 1, then it is easy to verify that =-M- Mt+MMt


=< g,,g* > _ < g'g, >t< g1g >-l< 9,g* >= 0, so that g* = M g.

As an easy consequence of Theorems 2.2.2-2.2.4, we have the following corollary.

Corollary 2.2.2. Let (L, < .,. >) be a generalized inner product space, and let L0 be a finite dimensional subspace of L with linearly independent basis. For all s G L, and g E L0. let


I,=< g,s >'< g,g >-< g,s > .







Then there exists g* E L0 such that Ig, - 19 (2.2.6) is n. n. d., for all g E Lo. Futhermore if < g*, g* > and < g, g > are invertible, then S = if and only if there exists an invertible matrix M such that g =M g.

Proof. Since L0 is a finite dimensional subspace of L with linearly independent basis, then by Theorem 2.2.2, for any s E L, the orthogonal projection g* of s into L0 exists. The first part of the corollary now follows from Theorem 2.2.3. The second part of the corollary follows from Theorem 2.2.4.

2.3 Optimal Estimating Functions: A Geometric Approach


In this section, we will apply the results obtained in the previous section to the theory of estimating functions. We begin with the definition of unbiased estimating functions.

Let X be a sample space and E be a k dimensional parameter space. A function g :X xO ------+ R k

is an unbiased estimating function if E[g(X, 0)I0] = 0, VO E R An unbiased estimating function g is called regular if the following conditions hold:

(i) dij(O)= E[1I0], (1 < i,j < k) exists;

(ii) E[g(X, 9) g(X, 0)t1] is positive definite.

Let L denote the space of all regular unbiased estimating functions. For g1, g2 E we define the family of generalized inner products of g1, 92 as


VO E . (2.3.7)


< 91,,92 > o= E [91(X, 0)92 (X,0)'J 0]







This family of generalized inner products will be used throughout this section without specific reference to it. Also we shall denote by s the score function of a parametric family of distributions. We assume also that the score vector is regular in the sense described in (i) and (ii).

Definition 2.3.1. With the same notation as above, let (L, < .,. >0) be the family of generalized inner product spaces, and let L0 be a subspace of L. For any g E L0, the information function of g is defined as follows ig() t < -1 E[01 g 101 (2.3.8) IgO)= [ o 10aog > 9 An element g* E Lo is said to be an optimal estimating function in Lo if I9. (0) -Io9(0)


is n. n. d., for all g C L0 and 0 E .

Next we prove a key result which shows that definition (2.3.8) is indeed equivalent to definition (2.2.5) of the previous section.

In the rest of this section, unless otherwise stated, we shall assume the following regularity condition for unbiased estimating functions.

(R). For any g E L.,
E[ 0g10] = - E[g st 1]. (2.3.9)


Lemma 2.3.1. Under the regularity condition (R), for any g C L, the information matrix of g can be written as Ig(O) =< g,s >< gg >o< g,s >,


where s is the score function.

Proof. The result follows easily since for any g E L, use (2.3.9) to get


- < g, s >0= E[ a0].







Theorem 2.3.1. Let L0 be a subspace of L. Assume that the orthogonal projection g* of s into L0 exists. Then

I9(O) <_ Ig-(O), V 0 e , g e Lo, (2.3.10) that is g* C Lo is an optimal estimating function in Lo. The optimal element in Lo is unique in the following sense: if g E Lo, then Ig (0) = Ig. (0), VO E E, if and only if there exists invertible matrix valued function M : 0 --+ Mk1k such that for any O E 0,
g *(X, 0) = M (0) g (X, 0), (2.3.11)

with probability 1 with respect to P0.

Proof. The first part follows easily from Lemma 2.3.1 and Theorem 2.2.3. The second part follows from Theorem 2.2.4.

Note that if L0 is a finite dimensional subspace of L, from Theorem 2.2.2, an orthogonal projection g9 of s( C L) into L0 always exists, so that the conclusions given in (2.3.10) and (2.3.11) always hold. Also, in this case, Proposition 2.2.1 and Theorem 2.2.2 show how to construct optimal estimating functions.

In the remainder of this section, we shall see several applications of Theorem 2.3.1 for deriving optimal estimating functions in different contexts. We begin by generalizing a result of Godambe (1985) when the parameter space is multidimensional. Also we bring out more explicitly the characterization of optimal estimating functions in a more general framework than what is given in Theorem 1 of Godambe (1985). Let {X1, X2,..., X} be a discrete stochastic process, 0 C Rk be an open set. Let hi be a Rk valued function of X1,..., Xi and 0 to Rk, such that

Ei- 1[hi (X1,...,Xi;0)10] = , (i= 1,...,n, 0 C 0), (2.3.12) where Ei-I denotes the conditional expectation conditioning on the first i - 1 variables, namely, X1,.. ., Xi-1. Let


Lo = {g: g = EiAi-1 hi},







where Ai -1 is a Mkxk valued function of X1, .... Xi_1 and 0, for all i E {1,... n}.

The following theorem generalizes the result of Godambe (1985).

Theorem 2.3.2. With the same notations as above, suppose hi satisfies the regularity condition (R). Let


A* E- A['10]' Ei-1[hi hI10]-' Vi c 1 2, n


and
* y l* hi.

Then the following conclusions hold:

(a). g* is the orthogonal projection of s into L0.

(b). g* is an optimal estimating function in L0, i. e., 19 (0) < I.*(0),


for all g E Lo and 0 c 0.

(c). If g E L0 and E[g g90] is invertible, then J_(0) = Ig*(0), for all 0 E e if and only if there exists an invertible matrix function M : E + - k,�k such that for any


g,(X,...,X; 6) = M(O) g(X,...,X.; 0), with probability 1 with respect to Po.

Proof. (a). For any g = En Ai hi E Lo, 0 E 0,

=0- <9*,9>o
- - En , -h , i=,_E[s h' A'10] - i= j=,E[A* hi h' A'10] = i=,E{Ei-,[s h' A'10]101 Ei=,E[Ai* hi h' A'10]


-EijE[Ai hi ht A(1O]


(2.3.13)







But for i < j,
E[A* hi ht AtIO] = E{Ej-I[Ai hi ht A'10]IO}

- E{Ai hiEj-l[h AtIO]Io} = 0. Similarly, for i > j,
E[A* hi ht At 1] = 0.

Thus from equation (2.3.13), we get < s - g*, g >o= ZilE{Ei-i [- oO]At-JO Ell=IA*EiI[hi htIO] AtJO} = 0. Hence g* is the orthogonal projection of s into L0.

Parts (b) and (c) of the theorem follows easily from part (a) and Theorem 2.3.1.

A second application of Theorem 2.3.1 is to give a geometric formulation of a result of Godambe and Thompson (1989), who proved the existence of optimal estimating functions using mutually orthogonal estimating functions. What we show is that the optimal estimating function of Godambe and Thompson is indeed the orthogonal projection of the score function into an appropriate linear subspace.

To this end, let X denote the sample space, 0 = (01,. . . , 0m) be a vector of parameters, hj,J = 1, ... , k be real functions on X x E such that

E[hj (X,0) 10, Xj] = 0, V0 E 0, j = 1.,k, where Xj is a specified partition of X, j = 1,..., k. We will denote E[.I, Xj] = E(j [.10].

Consider the class of estimating functions

Lo = {. : = (, .....J,)







where
g, = Ek= I (jrhp r = 1.. m, qjr : X x 0 ---+ R being measurable with respect to the partition X, for j = 1,...k, = 1,... , M.
Let
E~j [,h0]
J[Or O ] (2.3.14)



for all j = 1,...,k,r = 1,...,m, and g r j lq j r, h i, I"= 1 . . . .


The estimating functions hij - 1,..., k are said to be mutually orthogonal if

E(j)[qjrhjqyrhjO] 0, V j r, r'= 1,. ... M.(2.3.15)


Theorem 2.3.3. With the same notations as above, if {hj}= are mutually orthogonal, then the following hold:

(a) g* is the orthogonal projection of the score function s into L0.

(b) g* is an optimal estimating function in L0.

(c). If g E L0, and E[g gt10] is invertible, then 1,(0) = I_.(0), V 0 G 0 if and only if there exists an invertible matrix function M : 0 ---> Mkxk such that for any


g*(X; 0) = M(0) g(X; 0), with probability 1 with respect to Po.

Proof. (1). We only need to show that, V r E {1,.... m}, g.r - E qj,1j.


o=O VOEO

i.e.,


< s,gr >0=< gr,gr >o, V 0 E O.







But
< *, g>=k I * gr r> E =,E[qjrhjqj'rhj' 10]

- E Z, 1E~q7I1 hjqy'E(j) [qjrhjq *rhj; 0] j0} Ekl k = 21-10 i ,IE~qj, hqj'rE(j)[qjhjq'h'iO =Yk ah2
-j=IE{qjrE(j)[hJ0]J0}


=Z 1E;=liqszj~t000]l}.

Also

8, gr >0= E[qj s hj1O] k �
Zd=lEfqjrE(j)[shjJO]JO}

k Ohj
=j 1~EfqjrE(j)[hjJO]J0}.


Thus g* is the orthogonal projection of the score function into L0.

Once again (b) and (c) follows from part (a) and Theorem 2.3.1. This completes the proof.

Note that part (b) of Theorem 2.3.3 is due to Godambe and Thompson (1989), while the other two parts are new. We repeat that this theorem provides a geometric formulation of optimal estimating functions in a finite dimensional subspace of estimating functions. Also through this approach, the characterization of optimal estimating function is very easy to establish.

Finally, we apply the above result to obtain optimal generalized estimating equations for multivariate data. Let Xj denote the sample space for the jth subject, E C Rd be a subset with nonempty interior, ui : xj x 0 ---+ Rni, i k, such that E[ui(Xi,O)10] = O,V 0 c 0. Suppose that conditional on 0, {u,(X,,0)} k are independent. Consider the estimating space


Lo { kli(O) ui(Xi,O)}







where Wi(O) is a d x ni matrix, i = 1,..., k. Let


w47* (0) = E -0- '1]' [Var(u110)]-1, i= 1. k, g, = k I �i )U (Xi,' 0).


Then we have the following result.

Theorem 2.3.4. With the same notations as above,
(a) g* is the orthogonal projection of the score function into L0.
(b) g* is an optimal estimating function in L0.

(c) If g E Lo, and E[g gt'1] is invertible, then I(0) = Ig.(O), VO E e if and only if there exists an invertible matrix function M : O ---4 Mk~k such that for any 0 e E, g*(X;0) = M(0) g(X;0),


with probability 1 with respect to PO.
Proof. (a). We only need to show that Vg = Ek 1147(0) ui(Xi, 0).

< s,g >o=< g*,g >0

But
< *,g>0 k J~k=T/If*
i > = V ,1. < Ui (Xio), U, (xi,,0) >o 0vwi
Ek *
g= I Wi < U (Xi, 0), ui(Xi , )) > > W


Eu
i= E[ t w01tW;

also
8, 2>= i=1 < S, Ui >0 Wil = k 0
- Z E[7ji o101vl.


Thus g* is the orthogonal projection of the score function into L0.







Parts (b) and (c) follows from part (a) and Theorem 2.3.1.

Note that by choosing the appropriate the functions of ui, we can very easily get the generalized estimating equations introduced by Liang and Zeger (1986). For further information about generalized estimating equations, we refer to Liang, Zeger and Qaqish (1992).

2.4 Optimal Bayesian Estimating Functions


In this section, we study the geometry of estimating functions within a Bayesian framework. There are two basic approaches here. One formulation is based on the joint distribution of the data and prior, as introduced by Ferreira (1981, 1982). The second formulation, due to Ghosh (1990), is based on the posterior density. We shall study both and see how the notion of orthogonal projection can be brought within Bayesian formulation as well.

We begin with Ferreira's (1981, 1982) formulation. Let X be the sample space, E C Rk be an open set, p(xO) be the conditional density of X given 0, and 7r(0) be a prior density. Let g : X x 0 --+ Rk be a function such that

(1) - exists, VO G :
00

(2) E[g(X, 0)g(X, 0)t] is invertible, where E denotes expectation over the joint distribution of X and 0.

Let L denote the set of all functions g : X x 0 --+ Rk which satisfy (1) and (2) above. The generalized inner product on L is defined by < f,.g >= E[f(X,O)g(X,0)t], V f, g EL, (2.4.16) It is straightforward to verify that (2.4.16) is a generalized inner product on L.

The following calculation will be used to serve as a key connection between the formulation of Ferreira about optimal Bayesian estimating functions and our geometric







formulation. It also provides a geometric insight to the result of Ferreira. Throughout this section, we shall always assume that p(XIO) and 7r(O) are differentiable with respect to 0.

Lemma 2.4.1. Let rr(OIX) be the posterior density, and


9log 7r(O1X)
00j


Vj E {1..


and gi : X x 0 --+ R be a function. Then


E[gO gj = -E[gilO] olog7r(O) aoy O3j 19y


Proof.


E[gi sj] = E{E[gi 0 lgr 10

(logp(X) +Ologr(0)
00~p(IO + o]}
= EE~g ( Oj aOj

o log p(x IO) o9 log 7r-(0)
E{E[gi ( O Oj })0]} + E{E[gi loj )10]1


= j E OE[giIO] I -E g I1
00j } -E{E[-j}


09i =+


E{ ,0E[g0] Oalog7(O)
3 + O j


This completes the proof.

Note that if E[gjjO] = 0, then


E[gi sj] =-E[ _j; also if gi is only a function of 0, then E[gisj] = E{E[gi sjlO]} = E[g 0 log7r(0) a03.]


(2.4.17)


0 log 7r(0) + E{E[g IO] 00j I


(2.4.18)


(2.4.19)







Suppose now

B(g) = El{OE[gi1O] + 0[gjjQ]Olog 7r(0) (2.4.20) 00j 00j for i = 1,...,k,j - 1 .... k, where g = ( Let s - (s.,...,sk), using


Lemma 2.4.1, then < 9,S >= -((E[ Ogi] - Bij (g))).


If E[g10] = 0, from (2.4.17) and (2.4.18),


< g,s >=-E[ --g]; (2.4.21) also if g is only a function of 0, then


< g, s >= E[g (0 log 7r(0))t]. (2.4.22) 00


Now by combining the previous theorem and the above lemma, we have the following result, which is a generalization of the main result due to Ferreira (1981, 1982) to the multidimensional case.

Theorem 2.4.1. For g E L, let

Al = E[g(X, 0) g(X, 0)t], (2.4.23) then
09Og- a gi ((E[0 I - Bij(g)))t Alg1 (([-] - Bij(g))) < M8, for all g C L.

Proof. From the previous Lemma,

< g,s >= -((E[ag] B(g))).
00j -







Also M, -< s, s >'< s, s > -1< s, s >. Thus the result follows easily from Theorem

2.2.3.

Note that if k = 1, the above theorem reduces to the result proved by Ferreira (1981, 1982).

For any g E L, let

19 = ((E [ 9 - Bij(g)))t M q ((E[-] - Bij(g))). (2.4.24)

Ia03 ((E0jz 7())


In the definition of Ig, ((E[o ]- Bij(g))) is a measure of sensitivity of g, and M9 is a measure of variability of g. Thus, analogous to the frequentist case, the following definition seems to be appropriate about optimal estimating function in the Bayesian framework.

Definition. If L0 is a subspace of L, and g* E L0, g* is called an optimal Bayesian estimating function in L0, if for any g E L,


1g I g

Next we prove an optimality result about Bayesian estimating functions in this formulation.

Theorem 2.4.2. With the same notation as above, the generalized inner product on L is defined by (2.4.16), and let L0 be a subspace of L, if g* is the orthogonal projection of s into L0, then we have that

(1) g* is an optimal Bayesian estimating function in L0;

(2) the optimal Bayesian estimating function in L0 is unique in the following sense: for any g G L0, Ig = Ig. if and only if there exists an invertible k x k matrix M such that g* = M g.

Proof. (1). From Theorem 2.2.3,


t - ,t gs> g S> -1 < g*,s >







for all g c L0. But from Lemma 2.4.1, 1, =<_,.s >' < g,g>-

for any g E L0. Thus the result follows easily.

(2) follows easily from Theorem 2.2.4.

Next we apply Theorem 2.4.2 to a case where L0 is a finite dimensional subspace of L, with linearly independent basis.

Let {ui(Xi, O)}i'f1 be a family of ni x 1 vectors of parameteric functions and v(9) be a m x 1 vector such that

(1). For fixed 0 E Q, ui(., 0) : X ---+ Rn? is measurable;

(2). v : ---+ R' is measureable;
(3). E[uilO] = 0 , and E[v] = 0;

(4). Conditional on 0, {ui(Xi, 0)}f< are independent.

Consider the space of estimating functions of the form

L0 = {E_[Wi(O) ui(Xi,O8)] + Q v(8)}, where for any 0 cO , Wi() is a p x ni matrix, for all i E {1,..., K},and Q is a p x m matrix.

Theorem 2.4.3. With the same notation as above, let


W *(8) = E[ -0]t(E[Var(uiO)])-1, Q*= E[v() ( og7)t](E[v() v(0)t])-1,


and
g*=EKJ~(IVi*(t9) Ui(Xi,o)) +Q* V(O). Then

(a) g* is the orthogonal projection of s into L0;
(b) g* is an optimal Bayesian estimating function in Lo;







(c) optimal Bayesian estimating function in Lo is unique in the following sense: if g E Lo, then Ig = Ie. if and only if there exists an invertible matrix M such that

((* (X I,...- , K; 0) = M .q (Xl,...- , K;O0),


with probability 1 with respect to the joint distribution of the Xi and 0.
Proof. (a). For any g = E I(Wi(0) ui(Xi, 0)) + Q v(0), < s - g.,g >=< s,9g > -< g,9g > .


But
< s,g >= i IE{E[[s u(Xi,0)'0] i(0)'} + E[s v(0)] Q

-EKIE{E[ OuW () 0]wi(0)t} + E[v(0) (OlogTr)tt Qt, i=00 00

and

< g*,g >= Z==1E{*(0)E[ui(X , 0) u (X)tJ 0] Wi(O)} + Q*E[v() v(0)t]Qt

= EiK O9ui(Xi,O0) (O log 7)tlt Qt.
E{E[ -o 11 t /i (0) + E[v 00 Thus by Theorem 2.2.1, g* is the orthogonal projection of s into L0.
Parts (b) and (c) follows from (a) and Theorem 2.4.2.
Next we turn to the formulation of Bayesian estimating functions introduced by Ghosh (1990). In this formulation, the parameter space is assumed to have the form E = (a,, bl) x ... x (ak. bk). We start with a result which is very similar to Lemma
2.4.1.
Lemma 2.4.2. Let 7r(OIX) be the posterior density, and si = ao0(O=X)

1,..., k, and gi : X x ) --4 R be a function with suitable regularity condition, then Ogi
E[gi sjlX] = -E[ --INX] + E[Bj(gi)IX],
0 J







where

Bj(gi) = lim gi(X,0) 7r(01X) - lim gi(X,O) 7r(01X).
Oj -*b3 o 1 a+

Proof. Note that

E[gi sjlX] Jgi 07r(OX)dO : E[Bj(qi)[X]- E[Og x].


Next the definition about posterior estimating functions is introduced. A function g: e x X - R Rk is called a posterior unbiased estimating function (PUEF) if E[g(O,X) X] = 0, (2.4.25) Bj(gi) =0, Vx C X, ij C {1,...,k}. (2.4.26) Actually, all we require is that E[Bj(gi)IX] = 0, Vx G X, i, j z {1..k}. Let L be the space consists of all functions g : 0 x X --- Rk, which is PUEF and E[g gtIX] is invertible. A family of generalized inner products oin L is defined as follows: for any f, g E L, and xr E X, < f,g >.= E[f(O,X) g(O,X)tlX = x]. (2.4.27) If the score function s C L, then from Lemma 2.4.2, < g, s >x= -((E[09i I - x]). Next for every g E L, x E X, define


Ig(x) = ((E['IX = x]))t (E[g(O,X) g(O,X)tlX = x]) -1 ((E[ ' IX = x])).
3oj aj


(2.4.28)







Let L0 be a subspace of L; g* G L0 is said to be an optimal element in Lo if


19(x) K I(x)

for all g E L0, and x E X.

The following result now follow very easily.

Theorem 2.4.4. With the same notation as above, suppose that the orthogonal projection g* of s into Lo exists with respect to the generalized inner products. Then g* is optimal in L0, that is, for all g E L0, we have that

4 (X) < Ig (x),


Vx E X. Furthermore, the optimal element in L0 is unique in the following sense: if g G L0, then Ig(x) I. (x), Vx G X if and only if there exists an invertible matrix valued function M �X --* Alkxk suIC11 that g(O; X) = M(x) g*(O; x).

Proof. The first part of the theorem is a consequence of Theorem 2.2.3, and the second part is a consequence of Theorem 2.2.4.

Note that if s G L0, then s is an optimal estimating function.

As a corollary of Theorem 2.4.4, we have the following generalization of a result due to Godambe (1994) about optimal estimating functions to multi-dimensional parameter space.

Corollary 2.4.1. If g* E L0 is the orthogonal projection of s into L0, then

(a) I_(x) < Ig.(x), for all g E L0 and x E X;

(b) E[(g* - s) (g* - s)lx] < E[(g- s) (g- s)tIx], for all g G L0, and x E V.

Note that it is easy to see that if the parameter space is one dimensional, then (a) is equivalent to corr{g*, sIXl2 > corr{g, SIx}2, for all g E Lo and x E X. This is the result proved by Godambe (1994).







2.5 Orthogonal Decomposition and Information Inequality


In this section, we give a geometric intuition of some information inequalities. Let us start with one of the main results of this section.

Theorem 2.5.1. Suppose L0 is a subspace of L, and for all g E L, let go be the orthogonal projection of g into L0. Also, let s be the score function.

(i). Ifo=0, V0E , then


V 0 E 0.


(ii). If< go, s>o= 0, VO E 0, then


V 0 E 0.


Proof. Note that


l(0) =< gs >' < g, g >01


V 0E 0.


(i). Ifo=0, VOE0, then o=o, VOE0. Also,

o=< go, go >o + o


V 0 E .


Thus


V 0 C .


(ii). Ifo=0, VOcO, then o=< g, s >o, VGE , and

o=+


>< g - go, g - go >O,


V 0 G .


Thus


I (0) < 411 (0),1


I4(0) < Is-so,(O),


>_ < go, go >O,


I9 (0) < I'90(0),


ig(0) < L-so(0), V 0 E 6.







As an application of the previous result, we have the following:

Corollary 2.5.1. If T is a statistic, V g E L, let go = E[g(X, 9)IT]. Then

(1) if T is sufficient, then

I (0) < igo(0), V 0 E 0;

(2) if T is ancillary, then

Ig(O) I-ogo(O), V 6 E 0. Proof. (1). If T is sufficient, using the factorization theorem, we have that < g - go, s >0= 0, V 0 E e,

since the score function is only a function of T. The result follows now from part (i) of the previous theorem.

(2). If T is ancillary, then V 0 E 6,


0 log fx)t0]
< go, s >o= E[Yo ( -9lgf )10



0 log fxT to
=E[g0 fT ( 10 XT)t]


E{go fT E[( alofXIT)'10, T]1#} = 0, 00

where fx and fXIT are the marginal and conditional pdf's of X respectively, since the conditional score function has zero expectation with respect to the conditional density.

Thus both results follow from the previous theorem easily.

Next we study the information decomposition for information unbiased estimating functions. Let us start with the definition introduced by Lindsay (1982). Let g be an estimating function. Then g is called information unbiased if


< g,s >o=< g, g >o, V EO,







i. e. , g and s - g are orthogonal, where s is the score function.

The main result on information unbiased estimating function is given in the following theorem.

Theorem 2.5.2. Let T be a statistic such that the marginal and conditional densities of T satisfy the usual regularity conditions. For all g C L, let go = E[gjT, 0], h = g - E[gjT, 0]. Suppose that g is information unbiased. Then

(1) if h is information unbiased with respect to fxIT, then go is information unbiased with respect to fT;

(2) if go is information unbiased with respect to fT, then h is information unbiased with respect to fxIT;

(3) if at least one of (1) or (2) holds, thei

Ig(O) = Io(0) + Ih(0), V 0 E 0. Proof. Let sx denote the score function of X. Then we have that SX - ST + SXT,

and

< g, sx - g >o= E[g (Sx- g)t0] E[(go + h) (ST - 90 + sXIT h)lO] =< go, ST - 90 >o + < h, SX1T - h >o +E[go (S xI - h)t1O] + E[h (ST -0)oI]. But

E[go (sXT - h)t0] = E{go E[(sXIT - )IT, 0]} O since E[hIT, 0] = 0, VO e O, and

E[h (ST - go)10]= E{E[h[T, O](ST - go)tJo} = 0, since E[h[T, 0] = 0, V 0 c . So


< g, SX - g >0--< 90, ST - 9o >0 + < h, sX1T - h >o,


V 0C�o. (2.5.29)







(1) and (2) follows from the above equality.

(3). First note that if g is information unbiased, then Ig(O) =< gg >0, V O E 0. Also because g = go + h, and go, h are orthogonal, so

< g,g >o=< go, go >o + < h,h >0, V 0 c , this implies that

Ig(0) = Igo(0) + Ih(0), V 0 E . As an easy consequence of the above theorem, we have the following result proved by Bhapkar (1989, 1991a).

Corollary 2.5.2. With the same notation as the above theorem, I"(O) = I,,(0) + !SxIT(0), V 0 G 0. Proof. The result follows by noting that under the usual regularity conditions, ST and SXIT are information unbiased.

Note that if T is sufficient, then

Ix(0) =T(0), V 0 E 0. If T is ancilary, then

I-X(o) = IT (), V 0 E E.

2.6 Orthogonal Decomposition for Estimating Functions


In this section, we prove a Hoeffding type decomposition for estimating functions, revealing the geometric nature of the Hoeffding decomposition for U statistics.

Let X1,. . . , X,, be an independent random variables, X be the sample space for X, i = 1, ... , n, C be the parameter space. A function


h: 1 x ... x X,, x -- Rk,







is called an unbiased estimating function if E[h(XI ....,X,,;O)0] = 0, V 0 C E. Let E consist of all unbiased functions h" A'1 x ... x X,, x E -* Rk, such that E[h(X,,...,X,,;O) h(Xt,...,X,;O)tJO]


is well defined for all 0 C R

For hl, h2 c C, define a family of generalized inner product of hl, h2 as 0:= E[hl ht1O].


Then (6, < .,. >o) is a generalized inner product space for all 0 G 0.

For m < n, let S,, be a linear span of the functions of the form




which satisfies
F-[h(Xi ,...., Xjn; 0) 10] = 0,

and < h, h >0 is well defined for all 0 C 0, where {i,..., ir} C {1,..., f}.

If M < m2 < n, then S,, can be regarded as a subspace of $m2 in the obvious fashion. Now a natural question is: V h C E, m < n, does the orthogonal projection of h into Cm exist? If it does, how do we find it? The answer to the above question is affirmative, and it turns out that the answer is very closely related to the Hoeffding type decomposition for U statistics. Before proving the main result of this section, let us introduce some further notation to simply our presentation. For all I={i=,...,im} C {1,...,n},where i1 < ...
fk(X) = f,(X,..,i Ii,_), E[I] = E[gXil,. � ., Xi Ek = {f(il,...,- Iik) : 1 _< il < ... < ik < n), V I < k < i.







Now we return to the orthogonal decomposition of estimating functions. Let h E 8; V I= {I,.,i} C {1,...,n},1 < k < n, let gj(Xi) = E[hlX,]- EijC1.,IE[hJXj], (2.6.30)


V 1 < k < n, let
hk = Elczkg1 (XI). (2.6.31)


Then we have the following result.

Theorem 2.6.1. Let E and F,5" be the generalized inner product space defined as above, where rn < n. Then V h E C, the orthogonal projection of h. into Fn exists. and is given by
im=l hi,

where hi(1 < i < m) are defined as above.

Proof. We are only going to show that hl, h2 are the orthogonal projections of h into S1 and 82 respectively. The rest can be proved similarly.

(1). ZE' E[hXi] is the orthgonal projection of h into 86. In fact, V En _lgj(Xj) C S1, we have

< h - E',E[hIXi],E'= 1.13(X,) >o

-- < "h,gj(Xj) >o ,- iZ - 1 < E[hXi],qj(Xj) >o

Sj= < E[hJ-Ij],.qj(XVj) >0 j1 < E[hXj],,3j(XJ) >o = 0,

since {Xi}P'1 are independent. So by Theorem 2.2.1, we know that EZ'1E[hlXi] is the orthgonal projection of h into C1.

(2). Next we show that


i1





is the orthogonal projection of h into e2. If, V E,1<2g,,y2(Xjl, N32) G E2, < h - h2 - hi,Ei0

-j,0 -j,oEjl0

-jl0Ejl0
Ejl0 (2.6.32) Note that if il � JI and i2 : J2, then

< E[hIXiXi, J - E[hliN]- E[hlYi2], gj1,2(X1, ,2) >0= 0,

by the independence of {Xi } 1. If il j, and i2 # j or ii : jl and i2 =j2, then

< E[hIXi,Xi2 - E[hlXi,]- E[hlXiNl, gIlJ2(Xjllxj2) >0= 0. Also if i $ J, and i -$ J2, then < E[hIXi],g,,J2(X1,Xj2) >0= 0.

Thus from the above equation, we get that < h - h2 - h1,ZEi0 z i101=-- 0.

Hence h2 + h, is the orthogonal projection of h into 62.
As a consequence of the above theorem, we have the following result.







Corollary 2.6.1. Use the same notation as above, then {hl,..., h, } are orthogonal to each other.

Proof. For all 1 < m1l < m2 < n, since ET1 lhi and h are the orthogonal projections of h into �m- and 9,,, respectively, thus


hrn2 = =21hi - Zi=1l h


is orthgonal to Ea2-1. Hence h,, is orthogonal to {hl,. hM2-1}, since {hl,...,hm2-1} C n,2-1.


Theorem 15 generalizes the ANOVA decomposition for statistics proved by Efron and Stein (1981), to the estimating function case.

Also as another consequence of the orthogonal decomposition theorem, we have the following variance decomposition result.

Corollary 2.6.2. Use the same notation as above, we have that


Varo(h) = E=',Varo(hi).














CHAPTER 3

THE GEOMETRY OF ESTIMATING FUNCTIONS II



3.1 Introduction


In Chapter 2, the notion of generalized inner product spaces is introduced to study optimal estimating functions without nuisance parameters. It was shown that the orthogonal projection of the score function into a linear subspace of estimating functions was optimal in that subspace, and a general method for the contruction of such orthogonal projections was also given. As applications, both frequentist and Bayesian optimal estimating functions were found including as special cases some of the frequentist and Bayesian results derived earlier.

In this chapter, we extend the results of the previous chapter to find optimal estimating functions in the presence of nuisance parameters. First in Section 3.2, we derive some simple extension of the basic geometric results of Chapter 2. Next, in Section 3.3, we derive a general result on global optimal estimating functions which extends the results of Godambe and Thompson (1974) and Godambe (1976) to the multiparameter case. The general result is also used to study the geometry of conditional and marginal inference including as special cases some of the results of Bhapkar (1989, 1991).







In Section 3.4, a general result on locally optimal estimating functions is found, and is used to generalize (1) Lindsay's (1982) result on the local optimality of conditional score functions, (2) Godambe's (1985) result on estimation for stochastic processes, and (3) Murphy and Li's (1995) result on projected partial likelihood.

Finally, in Section 3.5, we derive optimal conditional estimating functions. As an application, we generalize the results of Godambe and Thompson (1989) in the presence of nuisance parameters.

3.2 Properties of Orthogonal Projections


The following result demonstrates that the operation of orthogonal projection is compatible with linear operations in a generalized inner product space.

Proposition 3.2.1. Let (L, < .,. >) be a generalized inner product space, and let L0 be a subspace of L. If s1, S2 C L, and the orthogonal projection gq of si into L0 exists for i = 1, 2, then

(i) g1 + g2 is the orthogonal projection of s, + S2 into L0;

(ii) for any matrix N, N g, is the orthgonal projection of N s, into N L0.

Proof. From Theorem 2.2.1, g* is the orthogonal projection of s into L0 if and only if
< - g*, g >= 0,

for all g C L0.

(i) For any g E L0,


< s1 +S2 -gi -g2, g >=< s1 -g9,g > + < s2 -g2,g> 0; so g1 + g2 is the orthogonal projection of s, + 82 into L0.

(ii) For any g E N L0,


< N S1 - N gl,g >=< N (si gl),g >







=N =O;

so N g1 is the orthgonal projection of N s, into L0.

The following result is a slight generalization of Theorem 2.2.3

Theorem 3.2.1. Let (L, < .,. >) be a generalized inner product space, and let L0 be a subspace of L. For any fixed s G L, and any invertible matrix N, suppose that the orthogonal projection g* of N s into L0 exists. Consider the function Ig =< gs >t< g,g >-< g,s >

Then



is non negative definite, for all g E L0.

Proof. Since g* is the orthogonal projection of N s into L0 and N is invertible, N-1 g* is the orthogonal projection of s into N-' L0, from part (ii) of Proposition 1. Hence, from part (b) of Theorem 2.2.3, IN-1 9* - IN-Ig (3.2.1) is non negative definite for all g E L0. But for any g,

IN-1 g =< N-1 g, s >'< N-1 g, N-1 g >-< N-1 g, s > =< g' 8>' (Nt)-I [N-' < g, g > (Nt)-']-N-' < g,. > =< >tg,s>t< fg >-'< g>s >= g.


Hence, the result follows from (3.2.1).

3.3 Global Optimality of Estimating Functions


In this section, a general result about global optimality of estimating functions in the presence of nuisance parameters will be proved. As easy consequences of this result, some results of Godambe and Thompson (1974), and Godambe (1976) are found. Further, the geometry of conditional and marginal inferences will be explored.







3.3.1 The General Result


Suppose X is a sample space, E) = x 2 is the parameter space, with ei c Rdi (i = 1, 2). Let 0 = (01, 02), 01(E -01) be the parameter of interest, and 02(E -02) be the nuisance parameter. Consider the function g : X x 0l -- Rl

where g satisfies the following conditions:

(I) E[gj0] = 0, for all 0 E '9;

(II) for almost all x, '99 exists, for all 0 E 0; o0i
(III) f g pdp is differentiable with respect to 01, and differentiation can be taken under the integral sign;

(IV) E[2 0] is invertible.

The functions which satisfy conditions (I) - (IV) are called regular estimating functions with respect to 01.

Let L denote the space of all regular unbiased estimating functions. For gl, g2 e L. define the family of generalized inner products of g1, 92 as

< gl,.q2 >o= E[gl(X,0)g2(X,0)tI], V 0 E 0. (3.3.2) Also we shall denote by s the score function of a parametric family of distributions with respect to 01. We assume also that the score vector is regular in the sense described in (I) to (IV).

Definition 3.3.1. Let (L, < .,. >o) be the family of generalized inner product spaces, and let L0 be a subspace of L. For any g E Lo, let


Ig(0) =- E[ I0]tE[ -0 1] (3.3.3) An element g* E Lo is said to be an optimal estimating function in L0 if


19.0) - 19(0)








is n. n. d., for all g E L0 and 0 E 0.

In the rest of this section, unless otherwise stated, we shall assume the following regularity condition for estimating functions, which basically involves the interchange of differentiation and expectation.

(R). For any g E L,
Og
E[- O1] = -E[g st10]. (3.3.4)


Thus combining (3.3.3) and (3.3.4), Ig(0) is the information matrix of g with respect to s.

We now state the main result of this section, which is an immediate consequence of Theorem 2.2.3.

Theorem 3.3.1. Let s = .1og 1(0) be an invertible matrix valued finction.
� 001 ' and Lo a subspace of L. If the orthogonal projection g* of A(O) s into Lo exists, then g* is optimal in L0.

As an easy consequence of Theorem 3.3.1, the following result generalizes the results due to Godambe (1976), Godambe and Thompson (1974) to the multiparameter case.

Corollary 3.3.1. Suppose there exists g: X x E) - Rdl such that

g* = M(O) s + g E Lo, (3.3.5)


and y is orthogonal to every element in L0, then g* is optimal in L0.

Proof. Since g is orthogonal to every element in L0, and g* E L0, g* is the orthogonal projection of A1(0) s into L0. The optimality of g* in L0 is immediate from Theorem 3.3.1.

Note that the above corollary generalizes the main result in Godambe and Thompson (1974) to the multiparameter case. It also provides a geometric explanation of equation (5) in Godambe and Thompson (1974). To see this, suppose there exist







matrices {M(O), Mi(O)}i=1 of appropriate dimensions such that M(O) is invertible and

g*= M(O) s + 1 09I(O) c Lo.


Then g* is the orthogonal projection of M(O) s into L0.

For d, = d2 = 1 and k = 2, the above result reduces to the main result in Godambe and Thompson (1974).

Next we use Theorem 3.3.1 to study the geometric ideas behind conditional and marginal inferences.

3.3.2 Geometry of Conditional Inferences




In this subsection, we study the geometry of conditional inference. As an easy consequence of this geometric approach, some of the results due to Bhapkar (1989, 1991) follow easily.

First use the identity

E[Ogl 0 i o o
001


where s log s may involves both 01 and 02. The information of g, is then


(01; 0) -< g1,Sol >lO< g1,g1 >01 < g1,So, >0

Let L denote the space of all estimating functions which satisfy conditions (I)

(IV) of Section 3.3.1. Let L0 be a subspace of L.

Following Bhapkar (1989, 1991a), suppose statistic (S, U) is jointly sufficient for the family {po : 0 E 0}, and furthermore, suppose U satisfies the following condition C:







(C): The conditional distribution of S, given u = U(X), dependes on 0 only through 01, for almost all u, that is S is sufficient for the nuisance )arameter 02.

Denote by h(s; 01 1u) the conditional pdtf of S = S(X). given i.

Definition 3.3.2. A statistic U = U(X) is said to be partially ancillary for 01 in the complete sense if

(i) U satisfies requirement C;

(ii) the family {pU : 02 E 02} of distributions of U for fixed 01 is complete for every 01 e 1.

A statistic U = U(X) is said to be partially ancillary for 01 in the weak sense if

(i) U satisfies requirement C;

(ii) the marginal distribution of U depends on 0 only through a parameteric function 6 = 6(0) (6 is assumed to be differentiable) such that (01, 6) is a one-to-one function of 0.

Letting p"j be the pdf of U and


1,(X;0) =0 log h(s, 011u) 001


the following theorem connects Theorem 3.3.1 and Bhapkar's (1989, 1991a) results. Take L = L0.

Theorem 3.3.2. If the statistic U = U(X) is partially ancillary for 01 in the complete sense or in the weak sense, then 1,(x; 01) is the orthogonal projection of so, into L.

Proof. We are going to prove this result in two cases.

Case 1. U is partially ancillary for 01 in the complete sense.

Note that
Sl0ogpU'
Sol = I(x; 01)+ -- -0 0







We only need to show that V gl c L, 0 E 0 01ogp '
< gl, 09 >0= 0.
(01

But
0 logp> E[ (O log pOU )t10] ,00 001


=E{E[gi (Ol gPo)tO, U]1} =E{E[.gl1O, U] (01log p0l)1101} 00,


Since E{E[g1j0, U]jO} = 0, by the completeness of {p' :02 E 62} for fixed 01 c 01, E[g1 0, U] = 0 almost everywhere for fixed 01. This implies that

0 log p
0=0, V0EO.


Thus l(x; 01) is the orthogonal projection of so, into L.

Case 2. U is partially ancillary for 01 in the weak sense.
Again note that

a log p)
Sol = l,(X; 01) +


Now since the map (01.02) --* (01,6(0)) is a one-to-one map, the matrix


Idi a6
0


is invertible. It implies that - is invertible. Note that


0logpO' 0 logpo (06 )_1 06 0 - 002 002 001







- 01O�p D(0)
002

where D(0) = - ' a-. Thus we only need to show that V g1 E L, 0 E 8,

0 log p'
< l, 00 P 0O 0.
002

But this follows easily by differentiating E[g,10] = 0

with respect to 02, and using the regularity conditions.

By combining Theorems 3.3.1 and 3.3.2, we get the following result due to Bhapkar (1989, 1991a).

Corollary 3.3.2. With the notation as above, l(x; 01) is optimal in L if either U is partially ancillary for 01 in the complete sense or in the weak sense, that is ig(O) < II(O), VO E 0,g E L. Note that from the proof of Theorem 3.3.2, the condition of partial ancillaritv for 01 in either the complete sense or in weak sense guarantees that the conditional score is the orthogonal projection of the score function with respect to 01 into L.

3.3.3 Geometry of Marginal Inference




In this subsection, we study the geometry of marginal inference. As an easy consequence of our approach, the optimality result of Bhapkar (1989) and Lloyd (1987) on marginal inference will follow easily. Assume that

(M): the distribution of statistic S = S(X) depends on 0 only through 01.

Definition 3.3.3. A statistic S = S(X) is said to be partially sufficient for 01 in the complete sense if







(i) S satisfies condition (1l),
(ii)~~~~ ~ ~~ gieg (Xtefm lyfus
(ii) given s - S(X), the family/P0 " 02 E 02} of the conditional distributions of U for fixed 01 is complete for almost all s, and for every 01 C 01.

A statistic S = S(X) is said to be partially sufficient for 01 in the weak sense if

(i) S satisfies condition (M);

(ii) the conditional distribution of U, given s = S(X), depends on 0 only through a parameteric function 6 = 6(0) (6 is assumed to be differentiable) such that (01, 6) is a one-to-one function of 0.

If S = S(X) is partially sufficient for 01 in the complete ( or weak) sense. Let poJS denote the conditional pdf of U given S = s, and


(X; 01) =0 log f(s; 01) 001


The main result of this subsection is given in the following theorem.

Theorem 3.3.3. If the statistic S = S(X) is partially sufficient for 01 in the complete sense or in the weak sense, then 1lm(x; 01) is the orthogonal projection of so, into L.

Proof. We are going to prove this result in two cases.

Case 1. S is partially sufficient for 01 in the complete sense.

Note that

0 log pui)
S01

We only need to show that for any g, E L, 0 E 0

0 log pI)
< 1, 0. >0= 0.


But

alogpU1 log__'
< 9, 0>=00 0 1







- E{E~1 (0ljgP(UIS) )t]O' S]jO} =E{E[gi ( a 0P 110 10


= E{E[g,10, S] (Ol9pl))t1O}.
001


Since E{E[giO, S]1O} = 0, by the completeness of {pUIs) 02 C e2} for fixed 01 E 01, E[gi[o, S] = 0 for fixed 01. This implies that


0 log pV 0
001

Thus I'(x; 01) is the orthogonal projection of so, into L.

Case 2. S is partially sufficient for 01 in the weak sense.

Again note that

0 log Po
so1 1m(x; 01) - 0Now since the map (01,02) --- (01,6(0)) is a one-to-one map, the matrix Idi .



is invertible. This implies that a is invertible. Hence


0logpUs) Olog P~uO) 06 1 06 00 - 002 02 001

1o (UIs)
- logp 0 N(O), 002

where N(-) ) 19 . Thus we only need to show that V g1 E L. 0 E O,


0 logp U!s)
< 91, 002 >0= 0.







But this follows easily by differentiating E[g,10] = 0


with respect to 02, and using the regularity conditions. Thus I,,(x, 0[) is the orthogonal projection of so, into L. This completes the proof.

By combining Theorem 3.3.1 and 3.3.3, we get I(0) < Im (0) for all 0 G O, g E L, which includes the results of Bhapkar(1989, 1991a) and Lloyd (1987) as a corollary.

Corollary 3.3.3.

(1) if S is partially ancillary for 01 in the complete sense, then lm(x; 01) is optimal in L;

(2) if S is partially ancillary for 01 in the weak sense, then lm(x; 01) is optimal in L.

Proof. In both cases, lm(x; 01) is the orthogonal projection of so, into L. So the inequality

(0) <_ i,(0), V 0 cE, g c L, follows from Theorem 3.3.1.

Note that (1) of the above corollary is the main result of Lloyd (1987) and (2) is due to Bhapkar (1989, 1991a).

3.4 Locally Optimal Estimating Functions


In this section, a general result about locally optimality of estimating functions in the presence of nuisance parameters will be proved. As easy consequences of this result, some results of Lindsay (1982), Godambe (1985), Murphy and Li (1995) will be studied.




52


3.4.1 A General Result


In this subsection, we study the local optimal estimating functions in the presnece of nuisance parameters. We first introduce the space of estimating functions of interest.

Let X be a sample space, E = E) x 62 the d, + d2 dimensional parameter space, with Oi C Rdi (Z = 1, 2). A function


g: X x 0 --4 Rd,


is said to be an unbiased estimating function if E[g(X,0)10] = 0, VO = (01,02) E &. An estimating function g is said to be regular if it satisfies conditions (I) - (IV) as in Section 3.3.1.

Let L denote the space of all regular unbiased estimating functions from X x E) to Rdl. Also the score vector so, is assumed to be regular in the sense described in

(i) and (ii).

Definition 3.4.1. Let (L, < .,. >o) be the family of generalized inner product spaces, and let L0 be a subspace of L. For any g E L0, the information function of g is defined by
I9(O). = [Og Ot 01E[ ay 10 (3.4.6)


An element g* E Lo is said to be a locally optimal estimating function at 02 = 020 if


Ig. (01, 02o) - 1g (01, 020)

is n. n. d., for all g E Lo and 01 E 01.

The following is the main result of this section.







Theorem 3.4.1. Let L be the space of all regular unbiased estimating functions


g : X x O ---+ .

Let Lo be subspace of L. If g* is the orthogonal projection of s into Lo with respect to the generalized inner products < .,. >0 with 02 = 020, then g* is locally optimal in Lo, that is for any fixed 020 E 02,

Ig* (01, 020) > Ig (01, 020),


for all 01 E 01. Also the local optimal estimating function in Lo is unique in the following sense: if g E Lo, then I (01, 020) = Ig(01, 020) for all 01 E E), if and only if there exists an invertible matrix valued function N : 01 x {20} --+ Mdl�xd such that for all 01 E 01,

g*(X; 01, 020) = N(01, 020) .g(X; 01,02o),


with probability 1 with respect to Pol,o0.

Proof. This follows easily from Theorem 2.2.4 and 3.3.1.

Next we apply Theorem 3.4.1 to generalize the results in three different cases:

(1) Lindsay's (1982) result on the local optimality of conditional score functions; (2) Godambe's (1985) result on the estimation in stochastic processes; (3) Murphy and Li's (1995) result on the projected partial likelihood.

3.4.2 Local Optimality of Conditional Score Functions




Suppose that (X1,..., X.) = X(,) is a sequence of possibly dependent observations with pdf


f(X(n);0) = fl(X(l);0)f2(X(2)IX(1);0) ... f.(X(.) IX(n-1);0). (


(3.4.7)







Let Ui = a log fj, and Sj (01) = Sj (X(j); 01) be minimal si ifficnt for 02 with 01 fixed in pdf fj, let
l47l = Uj - E[UjISj,X(j-1)], (3.4.8) for j c {1,...,n}. The sequence {Sj(O1)}j= is called sequentially complete if for each k from 1 to n, the system of equalities E[H(Sk, X(k- ); 01)] 0 (3.4.9)

for all 0 with 01 fixed implies that H(Sk, X(k- 1); 01) is a constant in Sk with probability one, that is, H(Sk, X(k-1); 01) does not depend on Sk. The following is a slight generalization of the main result in Lindsay (1982).
Theorem 3.4.2. Assume sequential completeness and E[UiIX(i- )] = 0 for all i - 1...., n. For fixed 020 E 02, consider unbiased estimating function

h: X x 01 x {020} ---+ R (3.4.10) Let Lo be the subspace, which consists of all unbiased estimating functions from X x 01 x {020} into Rd,. Then
(a) W(01,020) = E - 1W is the orthogonal projection of so, into Lo with respect to the generalized inner product < .,. 02o;
(b) W(01, 020) is optimal in Lo, and the optimal element in Lo is unique in the following sense: if g E Lo and I.(01,02o) = Iw(O1, 02o) for all 01 E E) , then there exists an invertible matrix valued function N : 01 X {020} - Md xd, such that for all 01 E E)
I' = N(01, 020) g,

with probability one with respect to Po1,o2o.
Proof. First note that, for any H C L0. consider the decomposition


H = H. + H,._I +... + HI + Ho,







where

H(X(n); 01) = H- E[HISn, X(n_-)], H_ l(S ,, X(.,); 01) = E[HIS,, X(,-I)] - E[HISn-I, X(n-2)],

Hk(Sk+l, X(k); 01) = E[HISk+l, X(k)] - E[H[Sk, X(k_1)],

for all k E {1,2,...,n}, and H0 = E[HIS1]. By the sequential completeness of {Sj(O1)}j, Hk(Sk+1,X(k); 01) does not depend on Sk+1, that is Hk(Sk+1, X(k); 01) Hk(X(k); 01).

(a) Since
j=1 1U=W� + 1 I E[UjIS3, i-, it suffices to show that

< E[Uj ISj, Xj- ], Hk(Sk+I, X(k); 01) >0= 01


for all j, k E {1, ... , n }. But this follows from an easy conditioning argument.
(b) This follows from part (a) and Theorem 3.4.1.

Note that part (a) of Theorem 3.4.2 is a restatement of the main result in Lindsay (1982).

3.4.3 Locally Optimal Estimating Functions for Stochastic Processes




In this subsection, we generalize the results of Godambe (1985) to the case where there are nuisance parameters. As a special case, we get generalized estimating equations in the presence of nuisance parameters.

Let {X1, X2,..., Xn} be a discrete stochastic process, Oj C Raj (j = 1,2) be open sets. Let hi be a Rd valued function of X1,... , Xi and 0, which satisfies for fixed 020 E 02,


(3.4.11)


Ei-[hi(X ,...,A ;O)j01,02o] = 0, ( 1 . , , 0 O 1~ ).







In the above, Ei-1 denotes the conditional expectation conditioning on the first i - 1 variables, namely, X1,..., Xi-1. Let L0 - {g = Ai-1 hi


where Ai- 1 is a Akxk valued function of X1,. . . , Xi- and 01, for all i E {1,... n}.

The following theorem, which generalizes the result of Godambe (1985) on optimal estimating functions for stochastic processes.

Theorem 3.4.3. Let 02 = 020, and suppose hi satisfies the regularity condition

(R). Let
,Ohi
A* = Ei-,[--h 101, 02o]' Ei-1[hi h'101,02- i {,..,}



and
g* Enl A* hi,

then the following conclusions hold:

(a). g* is the orthogonal projection of so, into L0 with respect to the generalized inner product < .,. >0,02o.

(b). g* is a locally optimal estimating function in Lo, i. e., Ig(01, 020) g. (01, 020),


for all g E Lo and 01 E (1.

(c). If g e Lo and E[g gt10] is invertible, then Ig(01,020) = Ig,(01,02o), V01 E E)1 if and only if there exists an invertible matrix function N : 81 X {020} ----> Mkxk such that for any 01 C O)1,

*.(X1,...,Xn;01,02o) = N(01,020) g(X1 ,...,X,;01,02o),


with probability 1 with respect to Po,,020.







Proof. (a). For any g = En=Ai hi c Lo, 01 E 01,

< 801 - g*, g >(01,02o) < o1, >(01,020) - < g* 1g >(0,020)

hiE AsolhtAtI0i,02o)] - ZI= 1ZjE[AihihtAj(01,02o)]
Z1=1E{E,_I[sohtAt1(01,02o)]1(01,020)} - E[A~ hihtA'I(0,


1>E[AihihyAi(01, 020)]. (3.4.12) But for i < j,

E[AhihjAl(0,020)] = E{j _[A7 1hhAj(01,02o)]l(01,02o)} E E{A*hi Ej I[htAt(0,,02o)](01,02o)} 0. Similarly, for i > j,
E[A~hithAy1(01,020)] =-.

Thus from equation (3.4.12), we get


< 80 - gg >01,020 i=1 01, 020] Atj, 02o En E{A*Ei-,[hi h4OI, 020] A'101, 020} = 0. Hence g* is the orthogonal projection of s into Lo.
Parts (b) and (c) of the theorem follows easily from part (a) and Theorem 3.4.1.
As a corollary to Theorem 3.4.3, the generalized estimating equations in the presence of nuisance parameters for multivariate data can be easily obtained.

Corollary 3.4.1. Suppose that X,... ,X, are independent, for fixed 020 E 02, for each i {1, 2,...,n},

hi : A x 6 -4 Rdl,

with E[hi(Xi,0)I01, 02o] = 0. Consider the subspace L0 as that in Theorem 3.4.3. Then the generalized estimating equations determined by {hi} is given by


n Ei=lAi hi = 0,








where


A* = Ei_1[a-i 101, 0201 Ei-,[hi h"101, 020]-' V i 1 {1,2,...,n.


The above corollary provides a very convenient way to construct generalized estimating equations. For instance, if hi is chosen as linear (or quadratic) function of Xi, then the corresponding generalized estimating equations reduce to the GEE1 and GEE2 studied by Liang, Zeger and their associates. One may refer to see Liang and Zeger (1986), Liang, Zeger and Qaqish (1992), Diggle, Liang and Zeger (1994).


3.4.4 Local Optimality of Projected Partial Likelihood




In this subsection, we generalize the result of Murphy and Li (1995) on projected partial likelihood to the nuisance parameter case. Also the application of this result to longitudinal data will be pointed out.

Suppose that the data consist of a vector of observations X with density f(x; 01, 02), 01 is the vector of parameters of interest, which is finite dimensional, and 02 is the vector of nuisance parameters, which may be infinite dimensional. Suppose there is a one-to-one transformation of the data X into a set of variables Y1, C1,... , Y"', C"'. Let
Y ) = ( 1 j , C U ) = ( 1 - - C j = 1, . . . ,M . (3 .4 .1 3 )


For instance, in survival analysis, Y1, ... , Y, denote the lifetime variables, and C1, . . ., C the censoring variables.

Note that the joint density of y(m), C(m) can be written as

m m
IH f(cjc1-),yU-);01,2) IH f(Yic(j),y(J1);o,,02), (3.4.14)
j=l j=l







where r(�) and y(O) are arbitrary constants, and are used only for notational purposes. Then P(01) = lj7 f(yjlc(j), yU -1); 01, 02) is called the Cox partial likelihood.
Let
O log f(cj5c(j'), y(J-); 01, 02)
SO! = j1 001 , *


where
0 alog f(yj ICU), y( -1); 01, 02) (3.4.15) S = l (.4.1001

Next we introduce the subspace of unbiased estimating functions which is of interest in this subsection. This is similar to the space considered by Godambe (1985) in studying the foundation of finite sample estimation in stochastic processes.
For any j E {1, 2,... , rn}, consider estimating functions

hj : (j) x 0)x 01 -4 R, (3.4.16) E[hjIy 0-l), c(j), O] = 0, (3.4.17) for all 0 e E1 x 02.
For chosen {hi}i1, consider the space

Lo = {g g = Ej'IAj(O) hi}, (3.4.18)

where for all j E {1,..., m}, Aj(O) is a di x di matrix, and hj satisfies (3.4.16) and (3.4.17).
Let
=a Olog P = EMalogf(yj Ic(J)'Y(J-1) ;Ol1,02)) (3.4.19)


The main result of this subsection is the following.
Theorem 3.4.4. For fixed 02 = 020, for any i E {1,... , m}, let

-Ohi 02 1) ) E 0 A*; =- E[ a 1 101, 020, y(i-1)C(i)]tE[hiht 1101,0 , y -),c~i]1







and
g* = m

Then
(a) g* is the orthogonal projection of s* into L0, that is, for any g E Lo,

< S* - g*, g >01,020= 0,


for all 0 E O1;
(b) g* is locally optimal in L0 at 02 = 020:
(c) If g E L0 and E[g gt10] is invertible, then I_(01,020) ( 4101,020), V 01 E E1 if and only if there exists an invertible matrix function N : 61 X {020} -4 Mkxk such that for any 01 E 01, g,(X;0,02o) = N(01,020) g(X;01,02o), with probability 1 with respect to Po0,,O2.
Proof. For any j c {1,. . . , m}, let

Oj = lo10 f (Cj I C(j -l), Y (j - 1); 0 1, 02)


then s - s* Sj.

(a) For any g E Lo, let gj = Ay(O) hjj 1,..., m be the components in the definition of L0. In order to show that E[(s - s*) g1l0] = 0, it suffices to prove that E[sj g',0] = 0,

for all j, j' E {1, .. . , m}. Consider the following three cases:
Case 1. j > j'. Then

E[sj gj, 0] = E{E[sjc(J-1), y(J-1)] gl,0} -=- 0,


since E[sj Ic(j-1), y(- 1)] = 0.







Case 2. j = j'. Then

E[sj g'10] = E{sj E[gqjy(J-1),c(J)]t o}=O, since E[gjIy - 1), c(J)]t = 0.

Case 3. j' > j. Then

E[sj gjt,10] = E{s E[gyjy I C 1) 101 -0, since E[gyj y('-l), CI'-1)] = 0.

Part (b) and (c) follows from Theorem 3.4.1 and part (a).

Note that Murphy and Li (1995) studied projected partial likelihood in the case d, = 1 , the absence of nuisance parameters and when all the C('s are empty. Similar to Murphy and Li's comments, because of the nested structure of {Y(J), C(j)}, removing drop-out factors in this way will not cause bias in the resulting partial score function, as long as the subjects' drop-out depends only the past. This is in contrast to a generalized estimating equation, which is biased under random drop-out.

3.5 Optimal Conditional Estimating Functions


In this section, we study optimal conditional estimating functions. Let X be a sample space, e = 1 02, Oi c Rdi, (i = 1, 2), and 01 is the parameter of interest. For fixed 01, assume that S(01) is sufficient for 02. A function g : X x 01 -+ R,

is called a regular conditional unbiased estimating function if
(1) E[gIS(01)] = 0, for all 01 E - 1c

(2) E[g gtS(O1)] is positive definite.

Consider the space L of all regular conditional unbiased estimating functions, a family of generalized inner products on L is defined as follows: for any gl, 92 E L,


(3.5.20)


< 91,,9 >g o E[g ,1( )]







For any g E L, the conditional information of g is defined as follows:


I9(01gS(01)) E[ gIS(01)]t < gg >S(S01) E[ (O)] (3.5.21)


Definition 3.5.1. Let L0 be a subspace of L, a function g* E L0 is called an optimal conditional estimating function in Lo if Ig(O, Is(o,)) >_ Ig(O, IS(01)),


for all g C Lo.

The main result in this section is given in the following theorem.
Theorem 3.5.1. Define

s= log f(XIS(01)). (3.5.22) 00


Suppose L0 is a subspace of L, and assume that the orthogonal projection g* of s into Lo exists. Then

(a) g* is an optimal conditional estimating function in Lo;

(b) the optimal element in L0 is unique in the sense that if g E L0, then 1,* (01 1S(01)) Ig(011 S(01)), for all 01 E E1 if and only if there exists an invertible matrix valued function N: X x 01 -4 Mkxk of the form N(X, 01) = N(S(01)) such that g* (X;O1) - N(S(01)) g(X; 01). Proof. (a) Since E[glS(01)] = 0, under the regularity condition given in (2.5), E[-(gIS(01)] = -E[g stIS(01)]. Thus, from the definition of I,(01 S(01)),


zg(o0ls(01)) =< g,s >S(0,)< g,g >S(0,)< gS >S(oi)







Since g* is the orthogonal projection of s into L0 with respect to < .,. >s(o1), thus the optimality of g* in L0 follows from Theorem 2.2.3.

(b) This follows from Theorem 2.2.4.

Next as applications of Theorem 3.5.1, we generalize the results of Godambe and Thompson (1989) on optimal estimating functions into the conditional estimating functions framework.

To this end, let X denote the sample space, 01 = (011, Old,) be a vector of parameters, hj, j = 1,... , k be real functions on X x 0 1 such that

E[hj (X, 0) S(Ol), X] = 0, V 0 z E), j = 1.,k,


where Xj be a specified partition of X, j - 1,..., k. We will denote E[.IS(01), Xj] - E(j)S.IS(o)]. Consider the class of estimating functions

L0 = {g : g = (gl,..,.d,)}


where
gr = j=lqjr]tj " 1 , r, qjr : X x 01 - R being measurable with respect to the partition Xj for j = I,-., k, r= I,-.,adi.

Let
*Ewj["' IS(61)]
O (3.5.23)
qjr -E(j)[h2JS(01)],


for all j1,...,k,r =1,...,l, and yg. = 1lqjr h, r = 1,..., dl.


The estimating functions hij = 1,..., k are said to be mutually orthogonal if


Vj 0 ( 3,r, r/ = I-. di.


E(j) [qj; hj q*, , I S (01 ) ] = 0,


(3.5.24)







Theorem 3.5.2. Suppose {hj}k= are mutually orthogonal. Then the following hold:

(a) g* is the orthogonal projection of the score function s into L0.

(b) g* is an optimal estimating function in L0.

(c). If g E L0, and E[g gt'1] is invertible, then Ig(O) = IgJ(0), V 0 E 0 if and only if there exists an invertible matrix function N : X x 01 + AMkxk of the form N(X, 01) = N(S(01)) such that for any 01 G 61, .g*(X; 01) = N(S(O,)) g(X; 01),


with probability 1 with respect to PO.

Proof. (1). We only need to show that, V r E {1,... ,dil},gr j iqjrhj, < S - grgr >S(01)= O, V 01 C I1 that is

< S,gr >S(01)=< .r4,gr >S(01), V 01 E 01. But

S9r >S(O)= Ek=lj,=lE[qjhjqj'rhj'S(1)] <- gr=1 gr *-11Eqj ii i (0
= >j,=iE{qjrlhjqj'rE(j)[qjrhjqjlrhjIS(01)]IS(O1)} Z= iE{qjrqirE(j)[h'S(01)] S(01)}


= E " [Ohj
= E {qjrE(j)[-3r S(0i)]S(O1)}"

Also
< s, gr >S(O,)= Yk~lEfqjr .S hjS(01)] = 1E{qjE(j)[shjJS(01)] S(01)}


= =,Eq OhE(j)[ajS(0l)]IS(01)}.




65


Thus g* is the orthogonal projection of the score function into L0.

Once again (b) and (c) follow from part (a) and Theorem 2.2.4.














CHAPTER. 4

CONVEXITY AND ITS APPLICATIONS TO STATISTICS



4.1 Introduction


In this chapter, we first prove some general results about convexity, and then apply the results to various statistical problems, which include the theory of optimum experimental designs, the fundamental theorem of mixture distributions due to Lindsay (1983a), and the asymptotic ininimaxity of robust estimation due to Huber. Huber (1964) proved an asymptotic minimaxity result for estimating functions about the location parameter. In this chapter, this fundamental result will be generalized to general estimating functions. The geometric optimality of estimating functions proved in Chapter 2 will be used to prove a necessary and sufficient condition for the asymptotic minimaxity of estimating functions in multi-dimensional parameter spaces.

The contents of this chapter are organized as follows: in Section 4.2, a few simple results about matrix valued convex functions will be proved. Also we include some of the well known results in convex analysis, such as the Krein-Milman theorem about extreme points of convex sets, and the Caratheodory theorem about the representation of elements of a convex set in a finite dimensional vector space. In Section 4.3, the results of Section 4.2 are applied to the theory of optimum experimental designs. The fundamental result on optimal design theory is generalized to the matrix valued case. In Section 4.4, the results of Section 4.2 are applied to the mixture distribution







situation; the fundamental result about mixture distribution due to Lindsay (1983a) is an easy consequence. In Section 4.5, the results of Section 4.2 and Chapter 2 will be used to generalize the classical asymptotic minimaxity result of Huber (1964) in the estimating function framework.

4.2 Some Simple Results About Convexity


Let L be a linear space. A subset C of L is said to be convex if for every x, y E C, A E [0,11,

Ax + (1 - A)y E C.

A function f : C -+ R is said to be convex if for any x, y E C, A c [0,11, f(Ax + (1 - A)y) < Af(x) + (1 - A)f(y). A symmetric matrix-valued function N : C -+ Mkk (i. e., for any x E C, N(x) is a symmetric k x k matrix) is said to be convex, if for any x, y E C, A [0, 1], N(Ax + (1 - A)y) < AN(x) + (1 - A)N(y), where for two k x k matrices A, B, A < B means that B - A is nonnegative (n. n. d.). In the following, we only study properties of matrix valued convex functions, since for k = 1, they are reduced to the real valued case.

For every x, y E C, consider the function on [0, 1] as follows N(A; x, y)= N((1 - A)x + Ay), then N(A; x, y) is a convex function on A. The directional derivative of N at x in the direction of y is defined as


FN(x; y) = lim N(A; x, y) - N(O; x, y) (4.2.1) A--+0+ A







The existence of the limit is justified as follows: Since A, = (I - A)o + gA A, A2 A2


for 0 < A1

N(Aj; x, y) < (1 - 7)N(0; x, y) +A'1(2 ,Y A2 A2


This implies


N(Ai; x, y) - N(0; x, y) < N(A2; X, y) N(0; x, y)


(4.2.2)


that is N(A;x,y)-N(O;x,y) is a nonincreasing function of A in (0, 1]. Hence the limit in
A
(4.2.1) is well defined.

From (4.2.1) and (4.2.2), for a convex function N,


FN(x; y) < N(1; x, y) - N(O; x, y) = N(y) - N(x), for all x,y c C.


(4.2.3)


The following result will be used repeatedly in the sequel.

Theorem 4.2.1. Suppose that N is convex, then for x0 C C satisfies that N(xo) < N(y) for all y E C if and only if


FN(XO; y) > 0,


for all y E C.

Proof. Suppose that N(xo) < N(y) for all non-negative definite. Hence


FN (xo;y) = lim N(A;xo, y) A-4+


(4.2.4)


y (z C. Then N(A;xo,y)-AN(O;xo,y) is
A


- N(0;xo, y)


is n. n. d., for all y E C.

Conversely, if FA,(xo; y) > 0 for all y E C, then from (4.2.3),


N(y)- N(,:o) > FN(.ro; y) > 0.







Thus N(xo) < N(y) for all y E C.

Next let L be a locally convex vector space, and let N be a symmetric matrix valued function; then N is said to be Gateaux differentiable at x, if there exists a continuous linear operator A : L --+ Alkxk such that FN(x; y) = A(y - x), for all y E C. (4.2.5) Before stating the next result, let us recall one of the well known results from functional analysis.

Theorem 4.2.2 (Krein-Milman). Let L be a locally convex vector space, and let C be a convex compact subset of L. Then

C =-F(OflL4VCxt(C)],

where ext(C) denotes the set of extreme points of C, and con-v(A) denotes the closed convex hull of A, it is the smallest closed convex set containing A.

Now equipped with Gateaux differentiability and Krein-Milman theorem, we are in the position to prove the following result.

Theorem 4.2.3. Let L be a locally convex vector space, and let C be a convex compact subset of L. If N is convex Gateaux differentiable at x0, then x0 E C satisfies N(xo) < N(y) for all y C C if and only if FN(XO; y) > 0, (4.2.6) for all y E ext(C).

Proof. Since FN(xo; y) = A(y - ro) for some continuous linear opeator A for all y C C, from the definition of Gateaux differentiability, FN(xo; y) > 0, for all y E C is equivalent to FN(xo; y) > 0, for all y E ext(C). Thus Theorem 4.2.3 follows from Theorem 4.2.1.

Next the famous theorem of Caratheodory about the representation of elements of convex set in a finite dimensional vector space is presented. The present proof, taken directly from Silvey (1980), is included for the sake of completeness.







Theorem 4.2.4 (Caratheodory). Let S be a subset of R'. Then every element c in cony(S) can be expressed as a convex combination of at most n + 1 elements of S. If c is in the boundary of conv(S), n + 1 can be replaced by n.

Proof. Let

S' {(1,X): X C S}

be a subset of R'L+I, let K be the convex cone generated by S'. Let y E K, then y can be written as

y = Alyi +... + Ay.,

where each Ai > 0 and each yi E S'. Suppose that the y, are not linearly independent. Then there exists pl, ... , i,, not all zeroes such that [ly, +...+ImYm =O.


Since the first component of each yi is 1, so IL1 + ... + 1m 0. Hence, at least one pi is positive. Let A be the largest number such that Api < A, i = 1,..., m: A is finite since at least one pi is positive. Now let A = Ai - Api, then


y = Aly + ... + Ar,,y,,


and at least one A' = 0. Thus y can be expressed as a positive linear combination of fewer than m elements of S'. This argument can continue, until y has been expressed as a positive linear combination of at most n + 1 elements of S', since more than n + 1 elements are linearly dependent. Now the first part of the theorem follows by applying the above result to (1, c) E S'.

Next suppose that y C K and


y = Aly +... + An+ly+,


where each Ai > 0 and the yi are linearly independent. Then y is an interior point of K. Thus any boundary point of K can be expressed as a positive linear combination







of at most n linearly independent elements of S'. So the second part of the theorem follows.

Proposition 4.2.1. If C is a compact subset of a locally convex vector space, then conv(C) is compact.

Theorem 4.2.5. (a) With the same notation as Theorem 4.2.3, and the Gateaux differentiablity of N on C. The following are equivalent:

(i) xo minimizes N(x);

(ii) xo maximizes infyc atFN(x; y)a, for any k x 1 real vector a;

(iii) infYc atFN(xo; y)a = 0, for any k x 1 real vector a.

(b) If xo minimizes N(x), then (xo, xo) is a saddle point of FN, that is, FN(Xo; yl) > 0= FN(xo; Xo) > FN(y2; Xo), for all Yi, Y2 E C.

(c) If xo minimizes N(x), then the support of xo is contained in {y: FN(�o; y) = 0}. More precisely,

{y, EC, xo= EAjyj, Ai >0, EiAj=1} C {y :FN(xo;y)=0}.

Proof. (a) First note that from Gateaux differentiablity of N, for any real k x 1 vector a, and x E C,

inf atFN(x;y)a = inf atFN(x;y)a, yEext(C) yeC

and

irif atFN(x; y)a < atFN(x; x)a = 0. yEC

((i) (iii)). -Note that x0 minimizes N(x), if and only if for any real k x 1 vector a, and y E C, atFN(Xo;y)a > 0. The last inequality holds if and only if infysc atFN(xo; y)a > 0, for any k x 1 real vector a. This, in turn, is equivalent to infyEext(c) atFN(xo; y)a = 0, for every k x 1 real vector a.







((ii) .' (iii)). Note that x0 maximizes infyEc a'FN(x; y)a, for every k x 1 real vector a, if and only if infyEc atFN(xo; y)a > 0, for any k x 1 real vector a. This is equivalent to infy~ext(c) atFN(xo; y)a = 0, for every k x 1 real vector a.

(b) This follows from Theorem 4.2.1 and the definition of FN.

(C) If x0 = EiAY, Ai > 0, EjA = 1, since N is Gateaux differentiable,

0 = FN(xo; x0) = FN(xo; EZ Aiyi)

= E iAFN(.0ro; yi).

Since FN(xo; y) > 0 for all y E C, FI(xo; yi) = 0 for all yi.

4.3 Theory of Optimum Experimental Designs


In this section, the results of the previous section are applied to fixed optimal experimental designs. First, we formulate the problem.

Let f = (fl,. . . , f.n) denote m linearly independent continuous functions on a compact set X, and let 0 = (01, ... , 0,m) denote a vector of parameters. For each x E X, an experiment is performed. The outcome is a random variable y(x) with mean value f(x)tO = E Ifi(x) O, and a variance a2, independent of x. The functions fl, ... , fin, called the regression functions, are assumed to be known, while 0 = (01,..., 0,) and a are unknown. An experimental design is a probability measure p defined on a fixed a-algebra of subsets of X, which include the one point subsets. In practice, the experimenter is allowed N uncorrelated observations and the number of observations that he (or she) takes at each x E X is proportional to the measure p. For a given /, let

M(y) = ((7nij(p)))'=l, mij (p) = ffi(x)fj(x)dp(x). (4.3.7) The matrix M(p) is called the information matrix of the design p.

Let 7 denote the set of all probability measures on A' with the fixed a-algebra, and A {M(p) : p E 71}, : A -4 MkXk be a symmetric matrix-valued function. The







problem of interest is to determine p,,, which maximizes O(M(p)) over all probability measures. Any such p will be called O-optimal.
Proposition 4.3.1.

M =conv({f(x)f(x)t' x X}). Proof. Since M is a convex set, and {f(x)f(x)t" x E X} C A, so conv({f(x)f(x)t' x X}) C M. Next since X is compact, and f is continuous, thus {f(x)f(x)t x E X} C M is compact. Hence

conv({f(x)f(x)t" x c X}) = conv( {f(x)f(x)t" x E X}). Also since M- C co-n({f(x)f(x)t'x G X}), hence MA = conv({f(x)f(x)t : x E X}). From the above proposition and Caratheordory's theorem, the following is true.
Corollary 4.3.1. For any M(p) E A, there exists xi E X,i 1,... ,I,I < rn+1) + 1, such that
2
M(it) = E'= 1Ai f(xi).f(xi)', where Ai > 0, Ef 1Ai = 1. If M(pi) is a boundary point of M, the inequality involving I can be reduced to I< m(m+)
- 2
From the practical point of view, this corollary is extremely important. For it means that if 0 is maximal at A,, then AJ, can always be expressed as AI(t,). where p. is a discrete design measure supported by at most n(1 + 1 points.
~2







Now we are in the position to prove the fundamental theorem in optimum design theory, which is a generalization of the result in Silvey (1980) to the matrix valued case.

Theorem 4.3.1. (A) If 0 is a concave function on A, then AI(IL,) is O-optimal if and only if
FO(M(p,), M(p)) < 0,

for all p E H;

(B) If 0 is a concave function on M, which is Gateaux differentiable on M4, then M(p) is O-optimal if and only if FO(M (p.), f(x) f(x)t) < 0,


for all x G X;

(C) If � is Gateaux differentiable at M(p,), and M(pi) is � optimal, then

{xi E X: M(pi) = EiAXf(xi)f(xi)t, Ai > 0, E, A1 = 1} C {x E X: FO(M(p.),f(x).f(x)t) =0}. Proof. They are easy consequences of Theorem 4.2.1 - 4.2.3.

Next we apply Theorem 4.3.1 to study the relationship between D and G optimal designs.

The D-optimality criterion is defined by the criterion function

O[M(p)] = logdet[MI(p)], if det[M(u)] 54 0

= -00, if det[M(y)] = 0. (4.3.8) /t C 7- is said to be D-optimal if y. maximizes 0. Let M denote the set of all positive definite matrices, then � has the following properties:

(a) � is continuous on M;

(b) 4 is concave on M;







(c) 0 is Gateaux differentiable at M1 if it is nonsingular, and FO(M1, M2) = tr(M2MT1) - k.

Proof. (a) The continuity of 0 follows from the continuity of det.

(b) We want to show that, for every A E (0, 1), and Al,, A12 C ",

0[(1 - A)M1 + AM2] > (1 - A)�(M1) + A�(M2).

This inequality is obvious if either M1 or M12 is singular. Thus we only need to prove the inequality if both I1 and M2 are nonsingular. From a standard result from matrix algebra, there is a nonsingular matrix U such that


UM1Ut = I,


UA2U" = A = diag(A1, ... , Ak).


Using the concavity of log,

0[(1 - A)MI + AM2] = log det{U- [(1 - A)I + AA]U-"i > log detU-2 + Ek 1AlogA,

- (1 - A) log detU -2 + A log det(U-1AU- 't) = (1 - A)�(MI) + A�(M2).

(c) For nonsingular matrix M1, we have

�(MI + EM2) - O(M1) = log det(I + IM2M l)

- log{1 + tr(M2M1 I)} + Q((2)

- tr(M2M 1) + 0(,2).


Thus


FO(AI1, M2) = tr[(112 - AII)M11] = tr-(A21I-l) - (.


(4.3.9)







The G-optimality criterion is defined by the criterion function


�[M (/ ,)] max ft'(x)M - (p ) f(x), xEX

o op,

A design p., is said to be G-optimal if


if detM(u) $ 0, if detM(tu) = 0.


O[M(p,)] < [M(1i)], for all t E W. Example. The equivalence of D and G optimal designs.

In this example, we derive the famous equivalence theorem due to Kiefer and Wolfowitz (1960) about the D and G optimal designs.

Theorem 4.3.2 (Kiefer and Wolfowitz). If IL, E RN satisfies the condition that M(p,) is nonsingular, then p, is D-optimal if and only if p. is G-optimal.

Proof. (=r=). From Theorem 4.3.1 and (4.3.9), p, is D-optimal if and only if


for all x E X,


that is


max ft(x)M -'(p)f(x) < k. XEXA


On the other hand, for any p E - such that M(p) is nonsingular,

mxf t(x) M- 1(p) f(x) > f ft(x)M-l (p)f (x)dp(x)



=tr[I- 1(1)/f f(x)ft(x)dpt(x)] tr[M-l(1)M(p,)] = k.


Hence,


k = max t(x)M-l(1,)f(x) < max ft(x)M-l(I)f(x)
xEA EX 'G


for any p E R- such that M(p) is nonsingular, therefore, p, is G-optimal.


(4.3.10)


tr[f(x)ft(x)M(jt,)-1] <_ k,







(-=). Now suppose that p, is G-optimal, then from the definition, M(uI) is nonsingular. Let ,. be any D-optimal design. Then

k
1 < det[M(P,)M'(PI)] Ai,
i=1


where A,,..., Ak are the eigenvalues of the matrix M- /2(pi)AI(i,)AI 1/2 () Hence k 1 k 1
(Ai) /k '= jE=Ai = k tr [M 0*) M -,(I' )]I



1 f t(x)M- (pi)f(x)dlL*(x)

1

< -max ft(x)M- (ii)f(x) = 1.
k xX

Therefore,

detM(pl) = detM(p,),

hence p1 is D-optimal.

4.4 Fundamental Theorem of Mixture Distributions


In this section, we apply the results of Section 2 to the mixture distribution problem. The fundamental result is due to Lindsay (1982, 1995). We begin with the formulation of the problem.

Let {fo : 0 E -} be a parametric family of densities with respect to some afinite measure, let the parameter space O have a a-algebra of measurable sets which contains all atomic sets {0}. Let W- be the class of all probability measures on . Define the function


fQ(X) Jefo(x)dQ(O), Q E W, (4.4.11)







to be the mixture density corresponding to mixing distribution Q. Since the densities {fo} correspond to the atomic mixing distribution {5(0)}, which assign probability one to any set containing 0, they are called the atomic densities. A finite discrete mixing distribution with support size J will be expressed as Q = Ejrj6(Oj), and the Oj's are distinct, 7rj > 0, Ejrj = 1.
Given a random sample X1,..., X, from the mixture density fQ, the objective will be to estimate the mixing distribution Q by Q,, a maximizer of the likelihood L(Q) = 1rI. IfQ(xi).


Now suppose that the observation vector (x1, Xn) has K distinct data points Y1,..., YK, and let nk be the number of x's which equals to Yk. Define the atomic and mixture likelihood to be fo = (fo(Yl),...,fO(YK)), and fQ (fQ(YO),...,fQ(YK)), respectively. The likelihood curve is the function from 0 to R defined by 0 --+ fo. The orbit of this curve, given by F = {fo : 0 C 1}, represents all possible fitted values of the atomic likelihood vector. Then cony(F) {fQ : Q E 7, Isupport(Q)I < oc}, where JAI denotes the cardinality of A. Furthermore, if E is compact and fo is a continuous function of 0, then conv() = {fQ : Q E R(}. In this case, maximizing L(Q) over Q E 7 may be accomplished by maximizing the concave functional 0(f) Eknklogfk over f in the K-dimensional set conv(F). Note that 0(f) is a strict concave function of f.
Now we are in the position to state the fundamental result about mixture distributions.
Theorem 4.4.1 (Lindsay). Suppose that 0 is compact, and fo is continuous.

(A) There exists a unique vector f on the boundary of conv(F) which maximizes the log likelihood 0(f) on conv(F). f can be expressed as .fQ, where Q has K or fewer points of support.







(B) The measure Q which maximizes log L(Q) can be equivalently characterized by three conditions:

(i) Q maximizes L(Q);

(ii) Q minimizes supo D(6; Q);

(iii) supo D(O; Q) 0

(C) The point (f,f) is a saddle point of 4, in the sense that D(fQoI f)- 0 = - D(f, J) < ' (f,.fQl)


for all Q0, Q, E H.

(D). The support of Q is contained in the set of 0 for which D(O, Q) 0.

Proof. The results are easy consequences of Caratheodory's theorem and Theorem 4.2.5.

4.5 Asymptotic Minimaxity of Estimating Functions


In this section, the famous asymptotic minimaxity result due to Huber (1964) will be generalized. First we formulate the problem of interest.

Let E be an open subset of Rk, X be the sample space, a function g: X x 0 -4 Rk, is called an unbiased estimating function if E[g(X; 0) 10] = 0,


for all 0 E 0. An unbiased estimating function is called regular if EOg10
O],


is nonsingular for all 0 E 0.







In the rest of this section, the regularity conditions (I) - (IV) of Section 3.3.1 for estimating functions will always be assumed.

Let C be a convex set of distribution functions such that every F E C has an absolutely continuous density f satisfying I(F) = E[( O----0--)(--9lo f)'IF, 0], (4.5.12)


is positive definite. Let L be the space of unbiased estimating functions with respect to C, that is every element of L is unbiased with respect to every distribution in C. Let 40 be the subset of L which consists of all regular unbiased estimating functions in L.

Consider the function K : -o x C ---4 Mkxk , defined by


K(O, F) = E[O 9" F, 0]'(E [E 'F, 0])- EO IF, 0]. (4.5.13)


for all 0 E (DO, F c C. Note that when k = 1, then K(, F)- =(fx Of'dx)2
fX 02fdx


For every F E C, for any g1, g E L, the inner product of g, and g2 is defined by < 1,92 >F= E[gig'IF].


For every F E C, the orthogonal projection of the score function of F into the subspace L0, with respect to the inner product < .,. >F (if it exists), is denoted by OF

Lemma 4.5.1. (a) For any (u, v) E R x R+, the function defined by


1 (uv) -,







is convex, that is for any (ui, vi) E R x R+, i = 1, 2, A E (0, 1) (Aul + (1 - A)u2)2 1
- Av, + (1- A)V2 vI v2

(b) For any (Ml, M2) E Mkxk X MIVxk, where Mk+x denotes the set of all k x k positive definite matrices, the matrix valued function defined by


J(M1, M2)


is convex in the sense that, for any (M1, M2), (M, M4) C M'Ikxk X Mkj�k, A C (0,1),


J(A) = [AMi + (1 - A)M3]t [AM2 + (1


is convex in A.

Proof. (a)


Oh 2u 19u V


02h 02U


Q2 h c9uOv


Oh Ov


2u V2


The matrix


2/v -2u/v2] 2u /V2 2u 2/v3 J


is non negative definite, so h is convex.


(b) By straightforward calculation, and using repeatedly the relation


dM
dA


dM I


one gets,


dJ(A)
dA


M3)t[AM2 + (1 - A)M4]-'[AM + (1


-[A.I +(1-A)M3]t[AM2+(1-A)M4] -' ([M2


A)M4]- [AAI1 + (I - A)M.],


02h 2u2 027) V3


A)M3]


MA M-1 M1,


-M4) [A 12 +(1


A) M4]- l [AM1+ (1-A)M /3]







+[AM, + (1 - A)M3]t[AM2 + (1 - A)M4]-'(M, - Nl3), (4.5.14) and
d2J(A) = 2{(M1- M3)t[AM2 + (1 - A)M4]-'(Mi - M3)
d2A

[AM, + (1 - A)Ml3]'[AM2 + (1 - A)M-4]-1(A,12 M,)[AM2 + (1 - A)M4]-'

(112 - 14)[AM12 + (1 - A)314]-'[AM, + (1 - A)M3]

-(Mi - Al3)t[AM2 + (1 - A)M4]-'(M2 - M4)[AM2 + (1 - A)M4]-'[AM, + (1 - A)M3] [AM1 + (1 - A)M3]t[AM2 + (1 - A)M4]- '(A/12 - M4)[AM2 + (1 - A)M4]- '(Mi - M3)} = 2(AAt + BtB - AB - BtAt)

- 2(A - Bt)(A - Bt)t > 0, (4.5.15) where
A = (Mi - M3)t[AM2 + (1 - A)M4]-1/2

B = [AMA/12 + (1 - A)M4]-1/2(M2 - M4)[AM2 + (1 - A)M4]-'[AM, + (1 - A)M3]. This completes the proof of the Lemma.
Note that part (a) of Lemma 4.5.1 was proved by Huber (1964) by using a different argument. Also from part (b), dJ(O) - (M1 3)tM4jM3- M M41(M2 - M4)M4'M3 + MM4 '(M - M3). dA
(4.5.16)

We will use this identity is Section 6.2.

4.5.1 One Dimensional Case


In this subsection, a necessary and sufficient condition of the asymptotic minimaxity of estimating functions will be given when the parameter space is one-dimensional. This result generalizes Theorem 2 of Huber (1964).







Theorem 4.5.1. Suppose the parameter space is one dimensional. Then (OF, Fo) is a saddle point of K, that is K(0, Fo) < K(OF,, Fo) -- K(OFo, F), for all 0 E D, and F E C, if and only if (20,(f' - f ) - (OFo)2(f - fo))dx > 0, (4.5.17) where f' denotes the derivative of f with respect to the parameter.
Proof. Note that since OF,, is the orthogonal projection of sfo into L0, K(O, F0) < K(�Fo, Fo),


for all (p C 4. This fact has been established in Chapter 2.
Also for any F1 E C, consider the function

hF, : [0, 1] --+ R,

given by

(fX OFo [(1 - t)fol + tf]dx)2 (4.5.18)
h ,- (t) = f,_ -2 1(4.018


Then by (a) of Lemma 4.5.1, hyl is a convex function, and by direct calculation,
h' (0+) (f= I4 fodx [2 J /o0g'dx J 4fodx "r ('F�ofodx)2 I


- 0, foqfdx f 1 g]dx, (4.5.19) where g = f, - fo. Since PFo is the orthogonal projection of SFo into Lo with respect to the inner product < .,. ,


jX O5F fod ' dx = fxki A I 4 fod







Hence,

h'1 (0+) /(2�Fg' - �3og)dx. (4.5.20) Only if. Suppose that (OF, F0) is a saddle point of K. Then for any F1 c C, and every t E (0, 1),

hF, (0) = K(OF, Fo) :_ K(OF, (1 - t)Fo + tF) = hF (t). Now, since h' (0+) > 0, J(20Fog' - 0og)dx > 0, where g = f, - fo.

If. Suppose that

J(20Fog' o2g)dx > 0, where g = f, - fo. Then from Theorem 4.2.1, hF1 is a monotone function in [0, 1]. Hence,

hF1(0) = K(OF, Fo) < hF,(1)= K(OFFI).

Thus (OF0, Fo) is a saddle point of K. This completes the proof of Theorem 4.5.1.

Corollary 4.5.1 (Huber) . Assume that F0 C C such that I(Fo) < I(F) for all F E C, and 00 = E D. Then (00, F0) is a saddle point of K.
fo
Proof. For any F1 E C, consider the function
hF(t) = 1((l - t)Fo + tF) f (f + t(fo - fl) T dx"
((1 Ao + t(f, -fo)


Then by (a) of Leima 4.5.1, hF, is convex, and attains its minimum at t = 0. Thus


0 < h' (0+) = [2fg' - (f�)2g]dx, (4.5.21)
-- Ao







where g = fl - fo. The above equality follows from the Lebesgue dominated convergence theorem, and the facts that


t ft f0 -2 fog' -()2g,


and
1 (f;)2] _< (f)2 (f )2 t ft fA fi fA

uniformly in t E (0, 1).

4.5.2 Multi-Dimensional Case




In this subsection, by using the geometry of optimal estimating functions proved in Chapter 2, a necessary and sufficient condition of the asymptotic miniiaxity result for estimating functions in a multi dimensional parameter space will be given. This result generalizes one main result of Huber (1964) to the multi dimensional parameter space.

Theorem 4.5.2. Suppose the parameter space is multi-dimensional. Then (OFo, F0) is a saddle point of K, that is K(0, Fo) _- K(OFo, FO) < K(OF., F), for all 4 C F, and F G C, if and only if Ofo,t Of Of
[0(f- MY+( af-i~o)(OF.)' - OF 04' (f -fo)]dxr (4.5.22) is non negative definite.

Proof. Note that since OF0 is the orthogonal projection of sFo into L0,


K(4, Fo) < K(Oro, Fo),







for all 0 E (D. This has been proved in Chapter 2.
Also for any F1 G C, consider the function

JF, : [0, 1] --4 Mk. k,


given by
JF,(A) =( �1A)fo + AO]tdx)t(J OF��t[(1 - A)f� + Afl]dx)-i



. F0o[(1 -A) afA + A a-f, ]tdx). (4.5.23) From (b) of Lemma 4.5.1, JF, is convex, and by direct calculation, JF, (0 +) = (1M1- M ) t M -IM3 M32VI4 '(M2 - M4)M4 1M3 M M4 (M- 13),

where


MXI OF �(oflO )dx, M12 = fo�Fofldx,

Since OF,, is the orthogonal projection of A13 = .4. Hence,

JF, (0+) = (M1 - M3) +

[0,("f a~fo), + )(f f
ao ao (ao o

Only if. Suppose that (OFo, Fo) is a s, and every t E (0, 1),

JF (0) = K(OFo, FO) K(5F0


f Ofot
.M3 0 � (-b-)dx, M4 - JF � oofodx. sFo into Lo with respect to < .,. >F0,



M - A3) - (02 - AM4) OF.)' - Fo �o0(f - fo)]dx. (4.5.24) addle point of K. Then for any F1 c C,


t) Fo + tF1) =Jp, (M).







Thus from the definition of JF, (0+), JF, (0+) is non negative definite. Hence, Jx[�Fo(t +g . g]dx,
(')+ - o


is non negative definite, where g = fi - fo.
If. Now suppose that

] 0(,)t + Og (OF,) - 0&0g]dx



is non negative definite, where g = f, - fo. Then from Theorem 4.2.1, JF, is a monotone function in [0, 1]. Hence, JF,(0) = K(Fo, Fo) < JFI(1) = K(OFO,FI). Hence, (OF, F0) is a saddle point of K. This completes the proof.
Corollary 4.5.2. Assume that F0 E C is such that I(Fo) < I(F) for all F G C, and SFo = I E (. Then (SFo, F0) is a saddle point of K.
fo

Proof. For any F1 E C, consider the function

JF, (A) = I((1 - A)F0 + AF,) jO(fo + A(fo- f)) (O(fo + A(fo - fl)))t 1
S0 ao fo - A (f - fo)dx. Then by (b) of Lemma 4.5.1, JF is convex, and attains its minimum at t = 0. Thus ,(0+) =Ix[�Fo(--!))+ O--(�F)t - O op/g]dx, (4.5.25)


is non negative definite, where g = f, - fo. The above equality follows from the Lebesgue dominated convergence theorem and the facts that

lao%)~~ _~f0%ot fo I1(gOg\ JfOI dOd fo + aofo-) + d-0(0 o )2 A f ao (goaofo (.O)2




88


and
1 fA ~ ( __ofo, _ (_af )t (af(o )t A fo - fi fo














CHAPTER 5

SUMMARY AND FUTURE RESEARCH



5.1 Summary


In this dissertation, we have studied optimal estimating functions through the introduction of the generalized inner product space. It turns out that, the orthogonal projection of the score function into the subspace of estimating functions (if it exists), is optimal in that subspace. Also, the estimating function theory in the Bayesian framework is studied. We have shown that the orthogonal projection of the posterior score function into a subspace of estimating functions (if it exists) is optimal in that subspace. The geometry of estimating functions in the presence of nuisance parameters is also studied. The geometric idea of conditional, marginal and partial likelihood inference become transparent when viewed as orthogonal projections of score functions into appropriate subspaces. Finally, a general result about matrix valued convex functions was also proved, and then this result was applied to study optimum experimental designs, mixture distributions and asymptotic minimaxity of estimating functions.

5.2 Future Research


We have studied the geometry of estimating functions in the discrete setting; it will be of great interest to extend these results to the martingale framework. I believe that there is a lot of potential in pursuing a vigorious research in this direction.




90


In the last decade., there are major advances in the study of geometry of optimum experimental designs. I believe most of these geometric results are direct consequences of the duality theory in convex analysis. As far as I know, the applications of duality theory to statistics are very limited. It will be of great interest to establish a general duality theory in the statistical framework.















BIBLIOGRAPHY


Amari, S. I. and Kumon, M. (1988), Estimation in the presence of infinitely many nuisance
parameters - geometry of estimating functions. Ann. Statist., 16 (3), 1044-1068.

Bhapkar, V. P. (1972), On a measure of efficiency of an estimating equation. Sankhya A,
34, 467-472.

Bhapkar, V. P. (1989), Conditioning on ancillary statistics and loss of information in the
presence of nuisance parameters. J. Stat. Plan. Inf., 21, 139-160.

Bhapkar, V. P. (1991a), Loss of information in the presence of nuisance parameters and
partial sufficiency. J. Stat. Plan. Inf., 28, 195-203.

Bhapkar, V. P. (1991b), Sufficiency, ancillarity, and information in estimating functions.
Estimating Functions, edited by V. P. Godambe, Oxford University Press, New York,
241-254.

Bhapkar, V. P. and Srinivasan, C. (1994), On Fisher information inequalities in the presence
of nuisance parameters. Ann. Inst. Stat. Math., 46, 593-604.

Breslow, N. E. and Clayton, D. G. (1993), Approximate inference in generalized linear
mixed models. Journal of American Statistical Association, 88 (421), 9-25.

Chaloner, K. and Larntz, K. (1989), Optimal Bayesian design applied to logistic regression
experiments. J. Stat. Plan. Inf., 21, 1991-208.

Cox, D. R. (1972), Regression models and life tables (with discussion). J. R. Stat. Soc. B,
34, 187-220.

Cox, D. R. (1975), Partial likelihood. Biometrika, 62, 269-276. Crowder, M. (1995), On the use of a working correlation matrix in using generalized linear
models for repeated measures. Biometrika, 82, 407-410.

DasGupta, A. and Studden, W. (1991), Robust Bayesian designs. Ann. Stat., Desmond, A. F. (1991), Quasi-likelihood, stochastic processes, and optimal estimating functions. Estimating Functions, edited by V. P. Godambe. Oxford University Press, New








York, 133-146.

Dette, H. (1993), Eflving's theorem for D-optimality. Ann. Stat. , 21 (2), 753-766. Dette, H. and Studden, W. J. (1993), Geometry of E-optinality. Ann. Stat., 21 (1)1 416-433. Diggle, P., Liang, K. Y. and Zeger, S. L. (1994), Analysis of Longitudinal Data. Oxford
University Press, New York.

Durbin, J. (1960). Estimation of parameters in time series regression models. J. R. Stat.
Soc. B, 22, 139-153.

Efron, B. and Stein, C (1981) , The jackknife estimate of variance. Ann. Statist., 9 (2),
586-596.

Elfving, G. (1952), Optimum allocation in linear regression. Ann. Math. Stat. , 23, 255-262. Elfving, G. (1959), Design of linear experiments. Cramer Restschrift Volume. Wiley, New
York. 58-58.

E1-Krunz, S. M. and Studden, W. J. (1991), Bayesian optimal designs for linear regression
models, Ann. Statist., 19 (4), 2183-2208.

Ferreira, P. E. (1981), Extending Fisher's measure of information. Biometrika, 68, 695-698. Ferreira, P. E. (1982), Estimating equations in the presence of prior knowledge. Biometrika,
69, 667-669.

Firth, D. (1987), On the efficiency of quasi-likelihood estimation. Biometrika, 74, 233-245. Ghosh, M. (1990), On a Bayesian analog of the theory of estimating function. D. G. Khatri
Memorial Volume, Gujaral Stat. Rev., 47-52.

Ghosh, M. and Rao, J. N. K. (1994), Small area estimation: an appraisal (with discussion).
Statistical Sciences, 9 (1), 55-93.

Godambe, V. P. (1960). An optimum property of a regular maximum likelihood estimation.
Ann. Math. Stat., 31, 1208-1212.

Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277-284.

Godambe, V. P. (1985). The foundations of finite sample estimation in stochastic processes.
Biometrika, 72, 419-428.

Godambe, V. P. (19941). Linear Bayes and optimal estimation. preprint.








Godambe, V. P. and Heyde, C. C. (1987), Quasi-likelihood and optimal estimation. Int.
Stat. Rev., 55, 231-244.

Godambe, V. P. and Kale, (1991), Estimating functions: an overview, Estimating Functions. Ed. V. P. Godambe, Oxford Unviersity Press, New York, 2-30.

Godambe, V. P. and Thompson, M (1974) Estimating equations in the presence of nuisance
parameters. Ann. Stat., 2 (3), 568-571.

Godambe, V. P. and Thompson, M. E. (1989). An extension of quasi-likelihood estimation
(with discussions). J. Stat. Plan. Inf., 22, 137-172.

Haines, L. M. (1995), A geometric approach to optimal design for one-parameter non-linear
models. J. R. Stat. Soc. B, 57 (3), 575-598.

Hoeffding, W. (1992), A class of statistics with asymptotically normal distributuion. Breakthroughs in Statistics, Vol.1, 308-334.

Huber, P. J. (1964), Robust estimation of a location parameter. Ann. Math. Stat., 35, 73101.

Huber, P. (1980), Robust Statistics. John Wiley and Sons, New York. Heyde, C. C. (1989), Quasi-likelihood and optimality for estimating functions: some current
unifying themes. Bull. Inter. Stat. Inst., 1, 19-29.

Karlin, S. and Studden, W. J. (1966), Optimal experimental designs. Ann. Math. Stat., 37,
783-815.

Kale, B. K. (1962). An extension of Cramer-Rao inequality for statistical estimation functions. Skand. Aktur., 45 , 60-89.

Kiefer, J. (1959), Optimum experimental designs. J. Roy. Stat. Soc. B, 21, 272-319. Kiefer, J. (1974), General equivalence theory for optimum designs (approximate theory).
Ann. Stat., 2, 849-879.

Kiefer, J. and Wolfowitz, J. (1959), Optimum designs in regression problems. Ann. Math.
Stat., 30, 271-294.

Kiefer, J. and Wolfowitz, J. (1960), The equivalence of two extremum problems. Can. J.
Math., 14, 363-366.

Kumon, M. and Amari., S. I. (1984), Estimation of structural parameters in the presence of
a large number of nuisance parameters. Biometrika, 71 (3), 445-459.








Laird, N. M. (1978), Nonparametric maximum likelihood estimation of a mixing distribution. J. Amer. Stat. Assoc., 73, 805-811.

Liang, K. Y. and Waclawiw, M. A. (1990), Extension of the Stein estimating procedure
through the use of estimating functions. Journal of American Statistical Association,
85 (410), 435-440.

Liang, K. Y. and Zeger, S. L. (1986), Longitudinal data analysis using generalized linear
models. Biometrika, 73, 13-22.

Liang, K. Y. and Zeger, S. L. (1995), Inference based on estimating functions in the presence
of nuisance parameters (with discussions). Statistical Science, 10, 158-199.

Liang, K. Y. Zeger, S. L. and Qaqish, B. (1992), Multivariate regression analysis for categorical data (with discussion). J. R. Stat. Soc. B 54, 3-40.

Lindsay, B. G. (1981), Properties of the maximum likelihood estimator of a mixing distribution. Statistical Distributions in Scientific Work, edited by G. P. Patil, Vol.5. Reidel,
Boston, 95-109.

Lindsay, B. G. (1982), Conditional score functions: some optimality results. Biometrika, 69
503-512.

Lindsay, B. G. (1983a), The geometry of mixture likelihoods: a general theory. Ann. Stat.
,11 (1), 86-94.

Lindsay, B. G. (1983b), The geometry of mixture likelihoods, Part II: the exponential family. Ann. Stat. , 11 (3), 783-792.

Lindsay, B. G. (1995), Mixture Models: Theory, Geometry and Applications. Institute of
Mathematical Statistics, Vol. 5.

Lloyd, C. J. (1987), Optimality of marginal likelihood estimating equations. Comm. Stat.,
Theory and Meth. , 16, 1733-1741.

McCullagh, P. (1983), Quasi-likelihood functions. Ann. Statist., 11 (1), 59-67.

McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Models. 2nd Ed. Chapman and
Hall, London.

McGilchrist, C. A. (1994), Estimation in generalized mixed models. J. R. Statist. Soc. B 56
(1), 61-69.

McLeish, D. L. and Small, C. G. (1992), A projected likelihood function for semiparametric
models. Biometrika, 79 (1), 93-102.




Full Text

PAGE 1

CONVEXITY AND GEOMETRY OF ESTIMATING FUNCTIONS By SCHULTZ CHAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1996 UNIVERSITY OF FLORIDA LIBRARIES

PAGE 2

TABLE OF CONTENTS ABSTRACT iv 1 INTRODUCTION 1 1.1 Preamble 1 1.2 Literature Review 1 1.3 The Subject of the Dissertation 4 2 THE GEOMETRY OF ESTIMATING FUNCTIONS I 6 2.1 Introduction 6 2.2 Generalized Inner Product Spaces and Orthogonal Projection .... 7 2.3 Optimal Estimating Functions: A Geometric Approach 16 2.4 Optimal Bayesian Estimating Functions 24 2.5 Orthogonal Decomposition and Information Inequality 32 2.6 Orthogonal Decomposition for Estimating Functions 35 3 THE GEOMETRY OF ESTIMATING FUNCTIONS II 40 3.1 Introduction 40 3.2 Properties of Orthogonal Projections 41 3.3 Global Optimality of Estimating Functions 42 3.3.1 The General Result 43 3.3.2 Geometry of Conditional Inferences 45 3.3.3 Geometry of Marginal Inference 48 3.4 Locally Optimal Estimating Functions 51 3.4.1 A General Result 52 3.4.2 Local Optimality of Conditional Score Functions 53 3.4.3 Locally Optimal Estimating Functions for Stochastic Processes 55 3.4.4 Local Optimality of Projected Partial Likelihood 58 3.5 Optimal Conditional Estimating Functions 61 4 CONVEXITY AND ITS APPLICATIONS TO STATISTICS 66 4.1 Introduction 66 ii

PAGE 3

4.2 Some Simple Results About Convexity 67 4.3 Theory of Optimum Experimental Designs 72 4.4 Fundamental Theorem of Mixture Distributions 77 4.5 Asymptotic Minimaxity of Estimating Functions 79 4.5.1 One Dimensional Case 82 4.5.2 Multi-Dimensional Case 85 5 SUMMARY AND FUTURE RESEARCH 89 5.1 Summary 89 5.2 Future Research 89 BIBLIOGRAPHY 91 BIOGRAPHICAL SKETCH 97 iii

PAGE 4

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CONVEXITY AND GEOMETRY OF ESTIMATING FUNCTIONS By Schultz Chan August 1996 Chairman: Malay Ghosh Major Department: Statistics In this dissertation, a general way of constructing optimal generalized estimating equations (GEE) is given. Applications of this general method to various statistical problems, such as the quasi-likelihood method in generalized linear models, Cox's partial likelihood method in survival analysis, Bayesian inference, conditional and marginal inferences, are also studied. Also, some simple results about matrix valued convex functions are proved and are applied to the study of optimal designs, mixture distributions and asymptotic minimaxity. First, a notion of generalized inner product spaces is introduced to study optimal estimating functions. A characterization of orthogonal projections in generalized inner product spaces is given. It is shown that the orthogonal projection of the score function into a linear subspace of estimating functions is optimal in that subspace, and a characterization of optimal estimating functions is given. Also optimal estimating functions in the Bayesian framework are also studied. iv

PAGE 5

In the case of no nuisance parameters, the results are applied to study longitudinal data, stochastic processes, time series models, generalized linear models and Bayesian inference. As special cases of the main results of this chapter, we derive the results of Godambe on the foundation of estimation in stochastic processes, the result of Godambe and Thompson on the extension of quasi-likelihood, and the linear (and quadratic) generalized estimating equations for multivariate data due to Liang and Zeger, Liang, Zeger and Qaqish. Also we have derived optimal Bayesian estimating equations in the Bayesian framework. In the case where there are nuisance parameters, the results are applied to study survival analysis models, the generalized estimating equations proposed by Liang, Zeger and their associates, and the optimality of the marginal and conditional inferences. The three main topics are (A) globally optimal generalized estimating equations; (B) locally optimal generalized estimating equations; (C) conditionally optimal generalized estimating equations. A general result is derived in each case. As special cases, we rederive some of the results already available in the literature and find also some new results. In particular, as special cases of our result on globally optimal generalized estimating equations, we find the results of Godambe and Thompson and Godambe with nuisance parameters. The results of Bhapkar on conditional and marginal inference are also obtained as special cases. As applications of our result on locally optimal generalized estimating equations, we find Lindsay's result on the optimality of conditional score functions, extend Godambe's result on optimal estimating functions for stochastic processes to nuisance parameters, and extend a recent result of Murphy and Li about projected partial likelihood. Finally, our general result on conditionally optimal generalized estimating equation helps generalize the findings of Godambe and Thompson to situations which admit the presence of nuisance parameters. v

PAGE 6

Finally, some simple results for matrix valued convex functions are proved, and are used to find optimum experimental designs, the fundamental theorem of mixture distributions, and a generalization of the asymptotic result of Huber. vi

PAGE 7

CHAPTER 1 INTRODUCTION 1 . 1 Preamble The objective of this thesis is to provide a geometric insight behind many useful concepts in statistics and utilize the geometry for unifying many existing results, as well as in deriving several new ones. One major focus is to find optimal estimating functions as orthogonal projections of score functions into appropriate linear subspaces. The second goal is to use some important theorems from convex analysis for finding optimal experimental designs, for deriving the fundamental theorem of mixture distributions, and for proving the asymptotic minimaxity of estimating functions in a very general framework. 1.2 Literature Review We begin by reviewing the literature on estimating functions. The topic has grown into an active research area over the past decade. Its beginning is marked with the celebrated articles of Godambe (1960) and Durbin (1960). While Durbin (1960) used estimating functions to study Gauss-Markov type results in a time series setting, Godambe's (1960) main objective was to prove the optimality of the score function in a parametric framework when there were no nuisance parameters. As is well-known, the Gauss-Markov theory and maximum likelihood estimation form two cornerstones of statistical estimation. In their review article, Godambe and Kale 1

PAGE 8

(1991) have pointed out that the theory of estimating functions combines the strengths of these two methods, eliminating at the same time many of their weaknesses. To cite an example, Gauss-Markov theorem fails for nonlinear least squares, but estimators obtained as solutions of optimal estimating equations are identical to the least squares estimators under homoscedasticity. The theory of estimating functions has made rapid strides since the 1970s. Godambe and Thompson (1974), Godambe (1976) studied optimal estimating functions in the presence of nuisance parameters, and proved a variety of optimality results. Bhapkar (1972, 1989, 1991), Bhapkar and Srinivasan (1994) in a series of articles studied the notions of sufficiency, ancillarity and information in the context of estimating functions, and found conditional as well as marginal optimal estimating functions. Amari and Kumon (1988), and Kumon and Amari (1984) used estimating functions to estimate structural parameters in the presence of a large number of nuisance parameters, their approach being based on vector bundle theory from differential geometry. Nelder and Wedderburn (1971), in their pioneering paper on generalized linear models, showed that using one algorithm (the Newton-Raphson method), a large family of models could be iteratively fitted. Later, Wedderburn (1974) realized that only the first two moments were utilized in fitting the models, and this led to the development of the so-called quasi-likelihood functions for the development of generalized linear models. Firth (1987), and Godambe and Thompson (1989) pointed out the connection between quasi-likelihood and optimal estimating functions. An interesting review article is due to Desmond (1991). Cox (1972), in his seminal paper, introduced the proportional hazards model. Later, Cox (1975) introduced the notion of partial likelihood. The latter is intended to eliminate nuisance parameters (baseline hazards for the proportional hazards model)

PAGE 9

3 by using a conditioning argument. Because of the nested structure of the conditioning variables, Cox's approach also fits into the estimating function framework. Liang and Zeger (1986) used estimating functions (they used the terminology generalized estimating equations) to study longitudinal data. Liang and Zeger had motivation similar to Wedderburn's quasi-likelihood function, but in the multivariate setting in order to take into account the correlation between responses within each subject. Bayesian estimating function is of more recent origin and is still in its infancy. Ferreira (1981, 1982) and Ghosh (1990) intiated the study of optimal estimating functions in a Bayesian framework. While Ferreira's formulation involves the joint distribution of the observations and the parameters, Ghosh used a pure Bayesian approach based only on the posterior probability density function. The theory of optimum experimental designs, it was initiated by Elfving (1952, 1959), and Kiefer (1959). For the references up to the early eighties, we refer to the two monographs of Silvey (1980) and Pazman (1986). During the last decade, there are major advances in optimum experimental design theory, here we only list a few of the main publications. Chaloner and Larntz (1989) studied optimal Bayesian design for logistic regression model, El-Krunz and Studden (1991) studied Bayesian optimal design for linear regression models, while DasGupta and Studden (1991) studied robust Bayesian experimental designs for normal linear models. Dette and Studden (1993) studied the geometry of ^-optimal design, while Dette (1993) studied the geometry of D-optimal design, and Haines (1995) studied the geometry of Bayesian designs. There is a vast literature on the theory of mixture distributions. Laird (1978) studied nonparametric maximum likelihood estimation of a mixing distribution. Lindsay (1981, 1983a, 1983b) studied the properties and geometry of maximum likelihood estimator of mixing distribution. In a recent monograph, Lindsay (1995) presented

PAGE 10

4 a comprehensive treatment of "mixture models: theory, geometry and applications." In this book, a variety of topics about mixture distributions were discussed, which include the well known result proved by Shaked (1980) on mixtures from the exponential family, and the fundamental theorem on mixture distributions proved by Lindsay (1983a). Huber (1964) in his pioneering paper, proved the well known asymptotic minimaxity result for estimating functions about location parameter. In his classical book on robust statistics, Huber (1980) presented a more systematic treatment about asymptotic minimaxity. 1.3 The Subject of the Dissertation This dissertation begins with unfolding the geometry of estimating functions, and pointing out many applications. Although the geometry is primarily used to study estimating functions, this can also be used to study other statistical topics, such as the Rao-Blackwell theorem, Lehmann-Scheffe's approach to uniform minimum variance unbiased estimators and predication theory. In Chapter 2, a notion of generalized inner product spaces is introduced to study optimal estimating functions. A characterization of orthogonal projections in generalized inner product spaces is given. It is shown that the orthogonal projection of the score function into a linear subspace of estimating functions is optimal in that subspace and a characterization of optimal estimating functions are given. As special cases of the main results of this paper, we derive the results of Godambe (1985) on the foundation of estimation in stochastic processes, the result of Godambe and Thompson (1989) on the extension of quasi-likelihood, and the generalized estimating equations for multivariate data due to Liang and Zeger (1986). Also we have derived optimal estimating functions in the Bayesian framework. This generalizes the results obtained by Ferreira (1981, 1982) and Ghosh (1990).

PAGE 11

5 In Chapter 3, the geometry of estimating functions in the presence of nuisance parameters is studied. The three main topics are: (A) globally optimal estimating functions; (B) locally optimal estimating functions; (C) conditionally optimal estimating functions. A general result is derived in each case. As special cases, we rederive some of the results already available in the literature, and find also some new results. In particular, as special cases of our result on globally optimal estimating functions, we find the results of Godambe and Thompson (1974) and Godambe (1976) with nuisance parameters. The results of Bhapkar (1989, 1991a) on conditional and marginal inference are also obtained as special cases. As applications of our result on locally optimal estimating functions, we find Lindsay's (1982) result on the optimality of conditional score functions, extend Godambe's (1985) result on optimal estimating functions for stochastic processes, and extend a recent result of Murphy and Li (1995) about projected partial likelihood. Finally, our general result on conditionally optimal estimating function helps generalize the findings of Godambe and Thompson (1989) to situations which admit the presence of nuisance parameters. In Chapter 4, we first prove some general results about convexity, and then apply the results to various statistical problems, which include the theory of optimum experimental designs (Silvey, 1980), the fundamental theorem of mixture distributions due to Lindsay (1983a), and the asymptotic minimaxity of robust estimation due to Huber (1964). In his classical paper on M-estimation, Huber (1964) proved an asymptotic minimaxity result for estimating functions about a location parameter. In this chapter, this fundamental result is generalized to general estimating functions. The geometric optimality of estimating functions proved in Chapter 2 will be used to prove a necessary and sufficient condition for the asymptotic minimaxity of estimating functions when the parameter space is multi-dimensional. In Chapter 5, we summarize the results of this dissertation, and propose some topics of future research.

PAGE 12

CHAPTER 2 THE GEOMETRY OF ESTIMATING FUNCTIONS I 2.1 Introduction The theory of estimating functions has advanced quite rapidly over the past two decades. Godambe (1960) introduced the subject to prove finite sample optimality of the score function in a parametric framework when no nuisance parameters were presented. Later, his idea was extended in many different directions, and optimal estimating functions were derived under many different formulations. The underlying thread in all these results is a geometric phenomenon which seems to have gone unnoticed, or at least has never been brought out explicitly. In the present chapter, we make this geometry explicit, and use the same in deriving optimal estimating functions in certain contexts. In particular, it is shown that optimal estimating functions for certain semiparametric models are indeed the orthogonal projections of score functions into certain linear subspaces. Also, this geometry, by its very nature, is neutral, and can be adapted both within the frequentist, and the Bayesian paradigm. Second, the multiparameter situation can be handled automatically through this geometry without involving any additional work. The outline of the remaining sections is as follows. In Section 2.2, we develop the mathematical prerequisite for the results of the subsequent sections. In particular, we define generalized inner product spaces, and show the existence of orthogonal 6

PAGE 13

7 projections of elements in these spaces into some linear subspaces. A characterization theorem for these orthogonal projections is given, which is used repeatedly in subsequent sections. Section 2.3 generalizes the results of Godambe (1985) and Godmabe and Thompson (1989) in multiparameter situations and also finds optimal generalized estimating equations (GEEs) for multivariate data. The GEEs used in Liang and Zeger (1986) and Liang, Zeger and Qaqish (1992) turn out to be special cases of those proposed in this section. The common thread in the derivation of all the optimal estimating functions is the idea of orthogonal projection developed in section 2.2. Section 2.4 uses the orthogonal projection idea in deriving optimal Bayes estimating functions. The results of Ferreira (1981, 1982) and Ghosh (1990) are included as special cases. Section 2.5 uses an orthogonal decomposition to study information inequality. Section 2.6 studies the Hoeffding type decomposition for estimating functions. 2.2 Generalized Inner Product Spaces and Orthogonal Projection In this section, we first introduce a matrix version of inner product spaces which generalizes the notion of the usual scalar valued inner product space. Next we provide the definition of the orthogonal projection of an element of a generalized inner product space, say, L into a linear subspace L 0 of L. A characterization of the orthogonal projection in the generalized inner product space is also given. As will be seen, such a characterization generalizes a corresponding result for scalar inner product spaces. We also show that for a finite dimensional subspace of a generalized inner product space, an orthogonal projection always exists. We begin with the definition of a matrix valued inner product space.

PAGE 14

s Definition 2.2.1. Let L be a real linear space, and let Mkxk ^ e the set of all k x k real matrices. The map <.,.>: L x L — M kxk , is called a generalized inner product if (1) Vx,y£L, < x,y >=< y,x (2) for any k x k matrix M,x,y e L, < M x,y >= M < x, y >; (3) V x, y, z G L, < x, y + z >=< x,y > + < x, z >; (4) V x G L, < .t,.t > is non negative definite (n. n. d.), and < x,x >= 0 iff i = 0. Two elements x,y G L are said to be orthogonal if < x, y >= 0. Two sets Si, S? are orthogonal if every element of Si is orthogonal to every element of 5 2 An example of a generalized inner product space, of great interest to statisticians is the one where the generalized inner product is defined by the covariance matrix of random vectors. Specifically, let X be a sample space, and let 0 C R h be the parameter space, which is open. Consider the space L of all functions h : X x 6 — > R k , such that every element of the matrix E[h(X,9)h{X,6) t \9\ is finite. For any h,g£ L, 9 G 0, the family of generalized inner products is defined by e = E[h(X,9) g(X,9Y\9]. Then it is easy to verify that for fixed 9 G 0, < ., . >g is a generalized inner product on L. Definition 2.2.2. Let L be a generalized inner product space with inner product <.,.>. Suppose Lq is a linear subspace of L. Let s G L. An element yo G L 0 is called the orthogonal projection of s into L 0 if < s y 0 , s y 0 >= min < s y,s y >, (2.2.1) yeL 0

PAGE 15

9 where min is taken with respect to the usual ordering of matrices. More specifically, for two square matrices A and B of the same order, we say that A > B if A — B is n. n. d. The following theorem characterizes the orthogonal projection in generalized inner product spaces. Theorem 2.2.1. Let L be a generalized inner product space with inner product < ., . >, and L 0 be a linear subspace of L. Let s G L. Then y 0 G L 0 is the orthogonal projection of s into L 0 if and only if =0, (2.2.2) for all y G Lo, i. e., s — y 0 and L 0 are orthogonal. Furthermore, if the orthogonal projection exists, then it is unique. Proof. Only if . For all y G L 0 , a £ R, since y 0 — ay G L 0 , < s y 0 + ay, s y 0 + ay > < s y 0 , s y 0 > is n. n. d., i. e., a[< s-y 0 ,y > + < s y Q , y >'] + a 2 < y, y > (2.2.3) is n. n. d., for all y G L 0 ,a G R. Now suppose that there exists yl G L 0 such that < s — yo, y^ >^ 0. Let A =< s y 0 , yo > + < s y 0 , y 0 * >' . Then A is real symmetric, and A ^ 0. Suppose Ai, . . . , A* are the eigenvalues of A with |Ai| > . . . > |Ajt|. Denote by Z\ the unit eigenvector corresponding to Ai. Then from (3), using z\Az\ = Ai, aAi + a 2 z[ < y 0 , y 0 > z x > 0, for all a G R. This implies that Ai = 0. So A = 0, a contradiction. Hence, < s y 0 , y >= 0,

PAGE 16

10 for all y G L 0 . If. Suppose < s 2/0,2/ >= 0, for all y € L 0 . Then =< s y 0 + s/o y, $ yo + yo y > < * yo, s yo > =, (2.2.4) which is n. n. d. The last equality follows since = 0This implies < s y 0 , s yo >= min < s y, s y > . Finally we show that if an orthogonal projection exists, then it is unique. Suppose that 2/1 , 2/2 £ L 0 are both orthogonal projections of s into L 0 . Then < s-yi,y >= 0, for all y e L 0 , i — 1,2. In particular, < 2/i 2/2, 2/i 2/2 >=< s 2/2, 2/i 2/2 > < s y u y x 2/2 > = 0-0 =0. So 2/1 = 2/2This completes the proof of the theorem. Next we apply Theorem 2.2.1 to generalize a result of Lehmann-Scheffe to the multidimensional case. Let A* be a sample space, 0 C R k an open set, and 7 : G — R d an estimable function, i. e., there exists g : X — > R d such that E\g{X)\e) = 7 (0), Let [/, = {0 : A' — > ^^^(X)^] = 7 (0), V 0 € 0}.

PAGE 17

11 U 0 = {h : X — * R d \E[h{X)\9] = 0, V 9 E 0}, where geU y ,heU 0 satisfy that 0(A")<|0] and are all well defined. Note that g* E U 1 is a locally minimum variance unbiased estimator of 7(0) at 0 = 9 0 if £[<7,(X) ^(A')^o] = min E[g(X) g(X)%]. get/-, Also it is easy to see that U-y = g + U 0 , V96 U 1 . Thus as an easy consequence of Theorem 2.2.1, we have the following generalization of the Lehmann-Scheffe theorem. Corollary 2.2.1. With the same notation as above, g t G J7 7 is a locally minimum variance unbiased estimator of 7(0) at 9 = 9q iff < g*,h > e = E[g*(X) h(Xy\9 0 } = 0, V heUy. Next we show that for any finite dimensional subspace in a generalized inner product space, the orthogonal projection always exists. In order to do this, the famous Gram-Schmidt orthogonalization procedure is used in generalized inner product spaces. We need another definition. Definition 2.2.3. Let (L, < ., . >) be a generalized inner product space. A set of functions {hi}^ is said to be linearly independent, if for any set of k x k matrices {i4<}£i\ defining G\ —— h\i &i —— hi 5jj-_|^4.j hj^ % 2, • • • j 7i j {< ei,ei >: i E {1, . . . , n}} are all invertible. The following is the Gram-Schmidt orthogonalization procedure in generalized inner product spaces.

PAGE 18

12 Proposition 2.2.1. If {/ij" =1 is linearly independent, let e x = hi, e 2 = h 2 < h 2 ,e x >< e x ,e x > _1 e x , e k = h k Sf'i < h k , e { >< e t , e l > _1 e { , k € {2, . . . ,n}. Then {ej}" =1 are orthogonal. Proof. First note that < e 2 , e x >=< h 2 , e x > — < h 2 , e x >= 0. Now suppose that < e m ,ej >= 0, V 1 < j < m G {2, . . . , k — 1}. Then for all j € {1,. . .,m} < e m+1 ,ej >=< h m+ i,ej > < h m+u ej >= 0, so that {ej}-^! are orthogonal. The above result is used to prove the existence of the orthogonal projection of every element of a generalized inner product space into a finite dimensional subspace. Theorem 2.2.2. Let (L, < .,. >) be a generalized inner product space, and let I/ 0 be a finite dimensional subspace of L with linearly independent basis. Then for any s € L, the orthogonal projection of s into L 0 always exists. Proof. From Proposition 2.2.1, without loss of generality, we can assume that {h\, . . . , h m } is an orthogonal basis for L 0 . Let At =< s, hi >< hi, hi > _1 , i £ {1, . . . , m}. We claim that the orthogonal projection of s into L 0 is

PAGE 19

13 To see this, for any h = hj 6 L 0 , < s K, h >=< s,h> < h*,h > = E™ =1 < s, > E^E^ < hi, hj = £™ =1 < s, > 6j T,f =1 Aj < hj, hj > b) = T,f =l [< s, hj > -Aj < hj, hj >]b*j = 0. Now apply Theorem 2.2.1. Theorem 2.2.2 will be used repeatedly in the subsequent sections for the derivation of optimal estimating functions. Next we establish an abstract information inequality, which is fundamental to our later study. The motivation for the following definition will be clear from the subsequent section where we define information related to an estimating function. Definition 2.2.4. Let (L, <.,.>) be a generalized inner product space, let s G L be a fixed element. For any g G L, the information of g with respect to s is defined as I g =< g,s > f < g,g>~< g, s >, (2.2.5) where "-" denotes a generalized inverse. We shall also need the following theorem later. Theorem 2.2.3. Let (L, <.,.>) be a generalized inner product space, and let L 0 be a linear subspace of L. For any s G L, suppose g* is the orthogonal projection of s into Lq. Consider the function I g =< g,S > l < g,g >"< g,s > . Then

PAGE 20

14 is n. n. d., for all g G L 0 . Proof. Let s = g t + h. Then using < g, h >= 0, I g =< g,s > l < g,g>~< g,s > =< g,g*> < g,g> < g,g* >, Also, using < h >= 0, h, =< 9*, s >'< g*,g* > < g*, s > < g*,g* > Now consider the matrix For any k-dimensional vectors a and b, we have that 1 a b Thus a' < g„, g, > a + 2a l < g, g t >' b + b l < g, g > b =< ctg, + tfg, a'g* + b*g >> 0. < g*,g, > < g,g* > l is n. n. d., which implies that Ig. i g =< g*,g* > < g^g* >*< 9,9 >"< 9,9* > is n. n. d. The proof of the theorem is complete. The following result will be used to establish the essential uniqueness of optimal estimating function.

PAGE 21

L5 Theorem 2.2.4. With the same notation as above, if g*,g £ L 0 , and g* is the orthogonal projection of s into L 0 , and < g*,g* >,< g,g > are invertible, then I g . = I g if and only if there exists an invertible matrix M such that 9* = M g. Proof. //. If g* = M g, by straightforward calculation, we get I g . =< g*,s> l < g*,g* >~ l < g\s> =< g,s > l M l [M < g,g> M'] _1 M < g,s> =< g,s >'< g,g > _1 < g, s > . Only if. If Ig* = I g , note that I g . =< g*,s> l < g*,g* >~ l < g*,s>=< g*,g* >, and I g =< g,s >'< g,g >~ x < g,s >=< g,g* >'< g,g>~ l < g,g* >. since g* is the orthogonal projection of s. Then 0 = / 9 . -I g = l < g,g > _1 < g,g* > . Let M =< 5,5* > l < g,g > _1 , then it is easy to verify that = -M < g,g* > M* + M M l =< g*,g* > -*< g,g > _1 < g, g* >= 0, so that g* = M g. As an easy consequence of Theorems 2.2.2-2.2.4, we have the following corollary. Corollary 2.2.2. Let (L, <.,.>) be a generalized inner product space, and let L 0 be a finite dimensional subspace of L with linearly independent basis. For all s € L, and g € L 0 . let J 9 =< 0,s >*< g,g >"< 0,s > .

PAGE 22

Hi Then there exists g* G L 0 such that l t ~ h (2-2.6) is n. n. d., for all g G L 0 . Futhermore if < g*,g* > and < g,g > are invertible, then I g * = I g if and only if there exists an invertible matrix M such that 9* = M g. Proof. Since L 0 is a finite dimensional subspace of L with linearly independent basis, then by Theorem 2.2.2, for any s G L, the orthogonal projection g* of s into L 0 exists. The first part of the corollary now follows from Theorem 2.2.3. The second part of the corollary follows from Theorem 2.2.4. 2.3 Optimal Estimating Functions: A Geometric Approach In this section, we will apply the results obtained in the previous section to the theory of estimating functions. We begin with the definition of unbiased estimating functions. Let X be a sample space and 6 be a A; dimensional parameter space. A function g-.XxS — > R k is an unbiased estimating function if E[g(X,9)\9] = 0, V6»G0. An unbiased estimating function g is called regular if the following conditions hold: (i) dkj{9) = E[^\6], (1 < i,j < k) exists; (ii) E[g(X,B) g{X,9f\9] is positive definite. Let L denote the space of all regular unbiased estimating functions. For gi, g 2 G L, we define the family of generalized inner products of <7i , g 2 as < 9i,92 >e= E\g 1 (X,9)g 2 (X,9) t \9} W G 0. (2.3.7)

PAGE 23

17 This family of generalized inner products will be used throughout this section without specific reference to it. Also we shall denote by s the score function of a parametric family of distributions. We assume also that the score vector is regular in the sense described in (i) and (ii). Definition 2.3.1. With the same notation as above, let (L, < .,. >g) be the family of generalized inner product spaces, and let L 0 be a subspace of L. For any g G Lq, the information function of g is defined as follows W) = E{f 9 \e} t e l E[f e \9} (2.3.8) An element g* G L 0 is said to be an optimal estimating function in L 0 if is n. n. d., for all g G L 0 and 6 G 0. Next we prove a key result which shows that definition (2.3.8) is indeed equivalent to definition (2.2.5) of the previous section. In the rest of this section, unless otherwise stated, we shall assume the following regularity condition for unbiased estimating functions. (11). For any g G L. E[^\9] = -E^m (2.3.9) Lemma 2.3.1. Under the regularity condition (11), for any g G L, the information matrix of g can be written as I g (9) =< g,s > 0 < g,g >g< g,s> 0 , where s is the score function. Proof. The result follows easily since for any g G L, use (2.3.9) to get - e =E[f e \6).

PAGE 24

is Theorem 2.3.1. Let L 0 be a subspace of L. Assume that the orthogonal projection g* of s into L 0 exists. Then I 9 {e)
PAGE 25

19 where A^i is a M kX k valued function of A r i, . . . , Xi-\ and 0, for alU G {1, . . . , n). The following theorem generalizes the result of Godambe (1985). Theorem 2.3.2. With the same notations as above, suppose hi satisfies the regularity condition {TV). Let A\ = E l -A^\ i Ei-iihi hWe}1 Vi G {1,2,..., n}, and g* = TZ =1 A* Then the following conclusions hold: (a) , g* is the orthogonal projection of s into L Q . (b) . g* is an optimal estimating function in L 0 , i. e., Ig(0) M kxk such that for any e e e, g*{X u . . • , X n ; 9) = M{9) g(X u . . . , X n ; 9), with probability 1 with respect to Pg. Proof, (a). For any g = £" =1 Aj frj G L 0 , 0 G 6, < sg*,g > 8 =< s,g > e < g*,g > e = ZUEls h\ A\\9] ZtiZUm hi h) A)\9] = E? =l E{Ei-i[s h\ A\\9)\9) ^ =1 E[A; fu h\ A\\9] -Z lK] E[A* fu h) A)\9\ ^ t>] E[A* hi h) A)\9\ (2.3.13)

PAGE 26

20 But for i < j, E[A* hi h) A)\9] = E{E 3 _ X [A\ ^ h) A)\9)\9] = e{a* hiEjx [h) A)\e\\e} = Q. Similarly, for i > j, E[A* hi h) A$|0] = O. Thus from equation (2.3.13), we get < s g*,g > e = ^MEU^mhl\0} A\\B}=0. Hence g* is the orthogonal projection of s into L 0 . Parts (b) and (c) of the theorem follows easily from part (a) and Theorem 2.3.1. A second application of Theorem 2.3.1 is to give a geometric formulation of a result of Godambe and Thompson (1989), who proved the existence of optimal estimating functions using mutually orthogonal estimating functions. What we show is that the optimal estimating function of Godambe and Thompson is indeed the orthogonal projection of the score function into an appropriate linear subspace. To this end, let X denote the sample space, 9 = (#!,..., 0 m ) be a vector of parameters, hj,j = l,...,kbe real functions on X x 0 such that E[hj(x,e)\9,Xj] = o, v#ee, 3 = i,...,k, where Xj is a specified partition of X, j = 1,. . . , k. We will denote E[.\e,X J ] = E U) [.\9}. Consider the class of estimating functions Lo = {g g= (#i,---,tfm)}

PAGE 27

21 where g T = ^ =l q jr hj, r = l,...,m, q jr : X x 0 — i? being measurable with respect to the partition Xj for j = l,...,k,r= l,...,m. Let for all j = 1, . . . , k, r = 1, . . . , m, and 9r = £*=l?ir *ii r = l,..., m. The estimating functions /i J? j = 1, . . . , A; are said to be mutually orthogonal if JW&Vj'A' 1^1 =0> V j # j',r, r' = 1, . . . ,m. (2.3.15) Theorem 2.3.3. With the same notations as above, if {hj} k j=l are mutually orthogonal, then the following hold: (a) g* is the orthogonal projection of the score function s into L 0 . (b) g* is an optimal estimating function in L 0 . (c) . If g e L 0 , and E[g g l \e] is invertible, then I g (G) = I g .(6), V 9 G 6 if and only if there exists an invertible matrix function M : 0 — > Mfcxfc such that for any #€10, g\X-e) = M{0) g(X;6), with probability 1 with respect to Pg. Proof. (1). We only need to show that, V r G {1, . . . , m}, g r = T,j =l qj r hj. g =0, V0G0 i.e., < s,g r > e =< g* r ,g r > e , V Q G 0.

PAGE 28

22 But < g* r ,g r > e = X k j= {E k j , =l E[q* jr h j q j . r h j .\0] = Z k J=1 E{q* r q jr E U) {h 2 J \9]\6} = X k j=1 E{q jr E u) [^\6]\e}. Also < s,g r > 0 = Z k j= iE[q jr s hj\9] = H k j=l E{q jr E {j) [sh j \e)\B} = E k j=1 E{q jr E U) Q\0]\e}. Thus g* is the orthogonal projection of the score function into L 0 . Once again (b) and (c) follows from part (a) and Theorem 2.3.1. This completes the proof. Note that part (b) of Theorem 2.3.3 is due to Godambe and Thompson (1989), while the other two parts are new. We repeat that this theorem provides a geometric formulation of optimal estimating functions in a finite dimensional subspace of estimating functions. Also through this approach, the characterization of optimal estimating function is very easy to establish. Finally, we apply the above result to obtain optimal generalized estimating equations for multivariate data. Let Xj denote the sample space for the jth subject, 0 C R d be a subset with nonempty interior, ttj : Xj x Q — i R n \ i = 1, . . . , k, such that E[ui(Xi,8)\6] = 0,V 6 £ 6. Suppose that conditional on 9, {ui(Xi, #)}f =1 are independent. Consider the estimating space L Q = {^ =1 W i (0)u i (X i ,9)}

PAGE 29

23 where Wi(8) is a d x m matrix, i = 1, . . . , k. Let W;(e) = E[ d ^\eY[Var(u l \9)]-\ z = h...,k, g* = ^ =l W*(9) Ui(X u $). Then we have the following result. Theorem 2.3.4. With the same notations as above, (a) g* is the orthogonal projection of the score function into L 0 . (b) g* is an optimal estimating function in L 0 . (c) If g G L 0 , and E[g g l \9] is invertible, then I g (9) = I g -{9), V6> G 0 if and only if there exists an invertible matrix function M : 0 — M kxk such that for any #6 0, 9 *(x-e) = M(e) 9 {x-e), with probability 1 with respect to Pg. Proof, (a). We only need to show that V/y = Ef =1 H r 2 (#) u l {X t: 9). < s,g > e =< g*,g > e . But < g\g > e = Eti£*=iWi* < u^fl),^^,*) > 9 = Sf =1 ^* < Ui(Xi,8),Ui(Xi,9) > e W\ also Thus is the orthogonal projection of the score function into L 0 .

PAGE 30

24 Parts (b) and (c) follows from part (a) and Theorem 2.3.1. Note that by choosing the appropriate the functions of u*, we can very easily get the generalized estimating equations introduced by Liang and Zeger (1986). For further information about generalized estimating equations, we refer to Liang, Zeger and Qaqish (1992). 2.4 Optimal Bavesian Estimating Functions In this section, we study the geometry of estimating functions within a Bayesian framework. There are two basic approaches here. One formulation is based on the joint distribution of the data and prior, as introduced by Ferreira (1981, 1982). The second formulation, due to Ghosh (1990), is based on the posterior density. We shall study both and see how the notion of orthogonal projection can be brought within Bayesian formulation as well. We begin with Ferreira's (1981, 1982) formulation. Let X be the sample space, 0 C R k be an open set, p(x\9) be the conditional density of X given 9, and ir(9) be a prior density. Let g : X x O — > R k be a function such that (1) §f exists, V0 e O; (2) E[g(X,9)g(X,9) t ] is invertible, where E denotes expectation over the joint distribution of X and 9. Let L denote the set of all functions g : X x Q — R k which satisfy (1) and (2) above. The generalized inner product on L is defined by < f,g >= E[f(X,9)g(X,9)% V /, g € L, (2.4.16) It is straightforward to verify that (2.4.16) is a generalized inner product on L. The following calculation will be used to serve as a key connection between the formulation of Ferreira about optimal Bayesian estimating functions and our geometric

PAGE 31

25 formulation. It also provides a geometric insight to the result of Ferreira. Throughout this section, we shall always assume that p(X\9) and 7r(0) are differentiate with respect to 9. Lemma 2.4.1. Let n(9\X) be the posterior density, and d\ogn{9\X) 09, and gi : X x 0 — > R be a function. Then v?e {!,..., *}, BW= . { ,| I+E{ « +W « ( , ( ,, 17) Proof. T7\ i p/pr dlog7r(0|X) E[9i Sj] = E{E[ gi — \8\] = E{E l9 U^^ m } + E { E[ 9i 9 -^) m This completes the proof. Note that if E[gi\9] = 0, then E[ 9i Sj ] = (2.4.18) also if gi is only a function of 9, then E[g iSj ) = E{E[g t sffl] = E[ 9i d ^ 6 \ (2.4.19)

PAGE 32

26 Suppose now for z = l,...,k,j = where (7 = (51, . . . ,g k ). Let s = (s lf . . . ,s k ), using Lemma 2.4.1, then =-((E[^}-B l3 (g)))If E[g\9] = 0, from (2.4.17) and (2.4.18), =-E[^]; (2.4.21) also if g is only a function of 0, then < 9 , S >=*b (^§£% (2.4.22) Now by combining the previous theorem and the above lemma, we have the following result, which is a generalization of the main result due to Ferreira (1981, 1982) to the multidimensional case. Theorem 2.4.1. For g e L, let M g = E[g(X,9) g(X,d)% (2.4.23) then m w} BM))t M ? Bij{9))) M " for all g G L. Proof. From the previous Lemma, =-((E[^-]-B lJ (g))).

PAGE 33

27 Also M s =< s,s > l < s,s > l < s,s >. Thus the result follows easily from Theorem 2.2.3. Note that if A; = 1, the above theorem reduces to the result proved by Ferreira (1981, 1982). For any g € L, let /, = Bij(g))Y Mg 1 ((E[^) B^g))). (2.4.24) In the definition of I g , ((E[^] Bij(g))) is a measure of sensitivity of g, and M g is a measure of variability of g. Thus, analogous to the frequentist case, the following definition seems to be appropriate about optimal estimating function in the Bayesian framework. Definition. If L 0 is a subspace of L, and g* 6 L 0 , g* is called an optimal Bayesian estimating function in L 0 , if for any g 6 L, I g * < g,g>~ 1 < g,s><< g*,s>* < g*,g* > _1 < g*,s >,

PAGE 34

28 for all g G L 0 . But from Lemma 2.4.1, I g =< g,s> 1 _1 < g, s >. for any g G L 0 Thus the result follows easily. (2) follows easily from Theorem 2.2.4. Next we apply Theorem 2.4.2 to a case where Lq is a finite dimensional subspace of L, with linearly independent basis. Let {ui(Xi, 6)}iLi be a family of x 1 vectors of parameteric functions and v(9) be a m x 1 vector such that (1) . For fixed ^Gfi, Uj(., 9) : X — > R ni is measurable; (2) . v : 0 — > R m is measureable; (3) . E[ui\0] = 0 , and E[v] =0; (4) . Conditional on 9, {ui{X u 9)}f =l are independent. Consider the space of estimating functions of the form L 0 = {£g 1 [W i (9)u i {X i ,B i )} + Qv(6)}, where for any 9 G 0, Wi(9) is a p x rii matrix, for alU G {1, . . . , K}, and Q is a p x m matrix. Theorem 2.4.3. With the same notation as above, let W?(9) = E[^\9nE[Var( Ul \9)}r\ Q* = E[v{8) (<^y](E[v(0) v{fffl)'\ and 9 ^ = El l (w;(9) miXM + qr v(9). Then (a) g* is the orthogonal projection of s into L Q ; (b) g* is an optimal Bayesian estimating function in Lq;

PAGE 35

29 (c) optimal Bayesian estimating function in L Q is unique in the following sense: if g e L 0 , then I g = I g if and only if there exists an invertible matrix M such that g*(X u . . . , X K ; 9) = M g(X u . . . , X K ; 9), with probability 1 with respect to the joint distribution of the Xi and 9. Proof, (a). For any g = Ejjj^Wip) u i (A r l ,^)) + Q u(0), < sg\g >=< s,g > < g*,g > . But < 8,g >= Zl x E{E[s + £?[a w(0)«] Q' and < g\g >= J&tfWMflMX'i,*) UiiXudyWWiVY} + Q*E[v(9) v(9) t ]Q t = -s^^i^^^i^]^^)'} + *[„(*) (^p)T Q ( Thus by Theorem 2.2.1, g* is the orthogonal projection of s into L 0 . Parts (b) and (c) follows from (a) and Theorem 2.4.2. Next we turn to the formulation of Bayesian estimating functions introduced by Ghosh (1990). In this formulation, the parameter space is assumed to have the form 0 = (ai,fti) x ... x (ak,bk). We start with a result which is very similar to Lemma 2.4.1. Lemma 2.4.2. Let n(9\X) be the posterior density, and Sj = dloS g^ X \ j = 1, . . . , k, and ^ : X x 0 — > R be a function with suitable regularity condition, then E[ 9i Sj \X] = -E[^\X] + EpjbtilX],

PAGE 36

30 where Bj( gi )= lim 9i (X,6)n(9\X)\im 9i (X,d) n{9\X). Proof. Note that drr(0\X) E{g t Sj \X] = / ft J 0 e " dOj ,10 = E[Bj( gi )\X} E[^\X}. Next the definition about posterior estimating functions is introduced. A function g : Q x X — > R k is called a posterior unbiased estimating function (PUEF) if E[g(6,X)\X} = 0, (2.4.25) Bj( gi )=0, Vx€X, i,je{l,...,k}. (2.4.26) Actually, all we require is that E[Bj{ 9i )\X] = 0, VxeX, i,j€{l,...,fc}. Let L be the space consists of all functions g : 0 x X — R k , which is PUEF and E[g g l \X) is invertible. A family of generalized inner products on L is defined as follows: for any f,g G L, and x G X, x = E[f($,X) g(6,Xf\X = x]. (2.4.27) If the score function s G L, then from Lemma 2.4.2, x =-((E[^-\X = x})). Next for every g G L, x G X, define h{x) = ((E[^\X = x})Y (E\g(d,X) g(8,X) t \X = x})1 ((E[^\X = x})). (2.4.28)

PAGE 37

31 Let L 0 be a subspace of L; g* G L 0 is said to be an optimal element in L 0 if I 9 (X) (x), Vx G X if and only if there exists an invertible matrix valued function M : X — M kxk such that g(B;x) = M(x) g*{6;x). Proof. The first part of the theorem is a consequence of Theorem 2.2.3, and the second part is a consequence of Theorem 2.2.4. Note that if s e L 0 , then s is an optimal estimating function. As a corollary of Theorem 2.4.4, we have the following generalization of a result due to Godambe (1994) about optimal estimating functions to multi-dimensional parameter space. Corollary 2.4.1. If g* G L 0 is the orthogonal projection of s into Lq, then (a) I g (x) < I g *(x), for all g G Lq and x G X; (b) E[{g* s) {g* a)*|x] < E[(g s) {g s)'\x], for all g G L 0) and x G X. Note that it is easy to see that if the parameter space is one dimensional, then (a) is equivalent to corr{g* , s\x} 2 > corr {g , s\x} 2 , for all g G L 0 and x G X. This is the result proved by Godambe (1994).

PAGE 38

32 2.5 Orthogonal Decomposition and Information Inequality In this section, we give a geometric intuition of some information inequalities. Let us start with one of the main results of this section. Theorem 2.5.1. Suppose L 0 is a subspace of L, and for all g G L, let g 0 be the orthogonal projection of g into L 0 . Also, let s be the score function. (i) . If < g g 0 , s > e = 0, V 9 G 6, then i g (9) < i^ei v e e e. (ii) . If < g 0 , s > g = 0, V 6 € 6, then Proof. Note that I g {0) =< g,s> t g g l < g, s > e , V 0 g 0. (i) . If < p s >o= 0, V 0 € 0, then < g, s >g=< g 0 , s > 0 , V 6 G 0. Also, < 9, 9 >e=< go, 9o >e + < 9 ~ 9o,9 9o >e >e, V^G0. Thus I g (6) < 1^(9), V0G0. (ii) . If < g 0 , s > e — 0, V 9 G 0, then < g go,s >e=< g, s >e, V 0 G 0, and < g, g >e=< go, go >e + < g go, g go >e >o, V 9 e 0. Thus Ig(9) < Iggo (9), V0G0.

PAGE 39

33 As an application of the previous result, we have the following: Corollary 2.5.1. If T is a statistic, V g € L, let g 0 = E[g(X,9)\T}. Then (1) if T is sufficient, then /,(*)< /,.(*), V0G6; (2) if T is ancillary, then i 9 (e) < W0), v^e. Proof. (1). If T is sufficient, using the factorization theorem, we have that < g g 0 , s > e = 0, V0G0, since the score function is only a function of T. The result follows now from part (i) of the previous theorem. (2). If T is ancillary, then V 0 G 0, 6 = E[g 0 (^±)<\e) = E{gofTE[( dl ° g d f e xlT y\9, T}\d} = 0, where fx and f x \r are the marginal and conditional pdf 's of X respectively, since the conditional score function has zero expectation with respect to the conditional density. Thus both results follow from the previous theorem easily. Next we study the information decomposition for information unbiased estimating functions. Let us start with the definition introduced by Lindsay (1982). Let g be an estimating function. Then g is called information unbiased if e = e , V 6 € 6,

PAGE 40

34 i. e. , g and s — g are orthogonal, where s is the score function. The main result on information unbiased estimating function is given in the following theorem. Theorem 2.5.2. Let T be a statistic such that the marginal and conditional densities of T satisfy the usual regularity conditions. For all g E L, let g 0 = E[g\T, 9], h = g — E[g\T, 9). Suppose that g is information unbiased. Then (1) if h is information unbiased with respect to f x \r, then g 0 is information unbiased with respect to fr; (2) if 0o is information unbiased with respect to fx, then h is information unbiased with respect to fx\T] (3) if at least one of (1) or (2) holds, then I g (6)=I go (6) + I h (6), V(?e0. Proof. Let sx denote the score function of X. Then we have that Sx = S T + S X \T, and < 9,s x -9 >e= E[g {sx-gYie] = E[{g 0 + h) {s T g 0 + s x \t ~ h) l \9} =< 9o, s T -go>e + < h, s x{T h > e +E[g 0 {s X \ T ~ h) l \e] + E[h {s T goY\9}. But E[go (s X \t hf\e] = E{g 0 E[(s X \t ~ h) l \T, 0}} = 0, since E[h\T, 9] = O,V0 e 6, and E[h (sr goY\9} = E{E[h\T, 9}(s T g o y\9} = 0, since E[h\T, 9} = 0, V 9 e 6. So e=<9o,s T -go>e + 6 , V 9 e 6. (2.5.29)

PAGE 41

35 (1) and (2) follows from the above equality. (3). First note that if g is information unbiased, then I g (B)= 9 , V0e0. Also because g — g 0 + h, and go, h are orthogonal, so < 9,9 >e=< go, 9o>e + < h, h > e , V 9 e 0. this implies that I g (9) = I go (9) + I h (6), V^0. As an easy consequence of the above theorem, we have the following result proved by Bhapkar (1989, 1991a). Corollary 2.5.2. With the same notation as the above theorem, i sx (6) = i ST (9) + i sxtT (e), vflee. Proof. The result follows by noting that under the usual regularity conditions, St and s x \t are information unbiased. Note that if T is sufficient, then 7, X (0) = J, T (0), If T is ancilary, then i sx (9) = i SxlT (9), Vflee. 2.6 Orthogonal Decomposition for Estimating Functions In this section, we prove a Hoeffding type decomposition for estimating functions, revealing the geometric nature of the Hoeffding decomposition for U statistics. Let Xi, . . . ,X n be an independent random variables, Xi be the sample space for Xi, i = l,...,n,0 be the parameter space. A function h: ...x X n xO — > R k ,

PAGE 42

36 is called an unbiased estimating function if E[h(x u ...,x n -e)\e] = o, v#ee. Let £ consist of all unbiased functions h : X x x . . . x X n x Q — > R k , such that E[h(X u ...,X n ;0) h{X 1 ,...,X n ;6) t \8] is well defined for all 9 £ 0. For hi,h.2 G £, define a family of generalized inner product of h\,h 2 as < h x ,h 2 > e = E[h x h 2 \9). Then (£, < ., . >g) is a generalized inner product space for all 8 G 0. For m < n, let £ m be a linear span of the functions of the form h : X h x ... x X im x6^ R k , which satisfies E[h{X il ,...,X im ;6)\6] = 0, and < h,h > e is well defined for all 6 G 0, where {ti, . . . , i m ) C {1, . . . , n}. If raj < m 2 < n, then can be regarded as a subspace of £ m2 in the obvious fashion. Now a natural question is: V h 6 £, m < n, does the orthogonal projection of h into £ m exist? If it does, how do we find it? The answer to the above question is affirmative, and it turns out that the answer is very closely related to the Hoeffding type decomposition for U statistics. Before proving the main result of this section, let us introduce some further notation to simply our presentation. For all / = {*!,..., i m } C {1, .... n}, where i x < . . . < i m , let f[Xj) = f(X n , . . . , X im ), E\g\Xj) = E[g\X n X im ). Tk = {{ii, k) 1 < i\ < ...
PAGE 43

37 Now we return to the orthogonal decomposition of estimating functions. Let h € £; V J = {ii, . . . , i m } C {1, . . . , n}, 1 < k < n, let = Xj c i,J*iE[h\Xj], (2.6.30) V 1 < k < n, let = S /e i^/(X/). (2.6.31) Then we have the following result. Theorem 2.6.1. Let S and S m be the generalized inner product space defined as above, where m < n. Then V h G S, the orthogonal projection of /i into S m exists, and is given by y m h where /ij(l < i < m) are defined as above. Proof. We are only going to show that hi, h 2 are the orthogonal projections of h into E\ and €2 respectively. The rest can be proved similarly. (1) . I Xj] is the orthgonal projection of h into S\. In fact, V T^ =x gj{Xj) G Si, we have e = £? =1 < KgjiXj) > e -EtiEJ^a < tf^UPO) >, = < E^U^) >, -E? =1 < E^Xjlg^Xj) > 9 = 0, since {A r J" =1 are independent. So by Theorem 2.2.1, we know that Y% =l E[h\Xi] is the orthgonal projection of h into S\. (2) . Next we show that Z tl
PAGE 44

38 is the orthogonal projection of h into £ 2 . If, V ^j 1 g = S ji g -S J1 0 E ile E il e (2.6.32) Note that if i\ ^ ji and i 2 ^ j 25 then < E\h\Xi„Xd E[h\X h ] E[h\X l2 ], g juj2 {Xj l ,Xj 2 ) >,= 0, by the independence of {A r j}" =1 . If i\ — j\ and i 2 ^ j 2 or i\ ^ j'j and i 2 = j 2 , then < E[h\X^X l2 ] E[h\X ix ] E[h\X i2 ], g Juj2 (X n ,X J2 ) > e = 0. Also if i ^ ji and i ^ j 2 , then ,= 0. Thus from the above equation, we get that < h h 2 hx,Y,j l< j 2 gj u j 2 (Xj 1 ,Xj 2 ) >g = Zji e }^ji 0 = 0. Hence h 2 + hi is the orthogonal projection of h into E 2 . As a consequence of the above theorem, we have the following result.

PAGE 45

39 Corollary 2.6.1. Use the same notation as above, then {h x , . . . , h n } are orthogonal to each other. Proof. For all I < m x < m 2 < n, since Y^J^hi and E-^/ij are the orthogonal projections of h into £ m2 -i and £ m2 respectively, thus u _ yrn 2 u _ ym 2 -U. a m2 — ?->i=V l i ^1=1 "'i is orthgonal to £ m2 -\. Hence h m2 is orthogonal to {hi,..., /i m2 _i}, since {hi,..., /i m2 _i} c £ m -2-iTheorem 15 generalizes the ANOVA decomposition for statistics proved by Efron and Stein (1981), to the estimating function case. Also as another consequence of the orthogonal decomposition theorem, we have the following variance decomposition result. Corollary 2.6.2. Use the same notation as above, we have that Vare(h) = ^ = iVar e (hi).

PAGE 46

CHAPTER 3 THE GEOMETRY OF ESTIMATING FUNCTIONS II 3.1 Introduction In Chapter 2, the notion of generalized inner product spaces is introduced to study optimal estimating functions without nuisance parameters. It was shown that the orthogonal projection of the score function into a linear subspace of estimating functions was optimal in that subspace, and a general method for the contruction of such orthogonal projections was also given. As applications, both frequentist and Bayesian optimal estimating functions were found including as special cases some of the frequentist and Bayesian results derived earlier. In this chapter, we extend the results of the previous chapter to find optimal estimating functions in the presence of nuisance parameters. First in Section 3.2, we derive some simple extension of the basic geometric results of Chapter 2. Next, in Section 3.3, we derive a general result on global optimal estimating functions which extends the results of Godambe and Thompson (1974) and Godambe (1976) to the multiparameter case. The general result is also used to study the geometry of conditional and marginal inference including as special cases some of the results of Bhapkar (1989, 1991). 40

PAGE 47

41 In Section 3.4, a general result on locally optimal estimating functions is found, and is used to generalize (1) Lindsay's (1982) result on the local optimality of conditional score functions, (2) Godambe's (1985) result on estimation for stochastic processes, and (3) Murphy and Li's (1995) result on projected partial likelihood. Finally, in Section 3.5, we derive optimal conditional estimating functions. As an application, we generalize the results of Godambe and Thompson (1989) in the presence of nuisance parameters. 3.2 Properties of Orthogonal Projections The following result demonstrates that the operation of orthogonal projection is compatible with linear operations in a generalized inner product space. Proposition 3.2.1. Let (L, < ., . >) be a generalized inner product space, and let L 0 be a subspace of L. If S\, s 2 G L, and the orthogonal projection = 0, for all g G L 0 . (i) For any g e L 0 , < si + s 2 g\ g 2 ,g >=< s x g u g > + < s 2 g 2 ,g > =0; so gi + g 2 is the orthogonal projection of S] + s 2 into L Q . (ii) For any g e N L 0 , < N ai-N g u g >=< N fa-g^g >

PAGE 48

42 = N < si-g u g >= 0; so N g\ is the orthgonal projection of N S\ into L 0 . The following result is a slight generalization of Theorem 2.2.3 Theorem 3.2.1. Let (L,< .,. >) be a generalized inner product space, and let Lq be a subspace of L. For any fixed s G L, and any invertible matrix N, suppose that the orthogonal projection g* of N s into Lq exists. Consider the function I g =< g,s > l < g,g >"< g,s > . Then h~ h is non negative definite, for all g G L 0 . Proof. Since g* is the orthogonal projection of N s into Lq and iV is invertible, N~ x g* is the orthogonal projection of s into iV -1 L 0 , from part (ii) of Proposition 1. Hence, from part (b) of Theorem 2.2.3, I N~ 1 g* — lNl g (3.2.1) is non negative definite for all g G Lq. But for any g, /jv-i g =< AT 1 g,s> l < N~ l g,N~ l g > < N~' g,s> =< g,s > l (N 1 )1 ^1 (N 1 )^^ 1 < g,a > =< g,s>'< g,g >"< g,s >= I g . Hence, the result follows from (3.2.1). 3.3 Global Qptimality of Estimating Functions In this section, a general result about global optimality of estimating functions in the presence of nuisance parameters will be proved. As easy consequences of this result, some results of Godambe and Thompson (1974), and Godambe (1976) are found. Further, the geometry of conditional and marginal inferences will be explored.

PAGE 49

43 3.3.1 The General Result Suppose X is a sample space, 6 = 61 x 0 2 is the parameter space, with 0; C R d '(i = 1,2). Let 9 = {9 l ,9 2 ),9 l (e 6 t ) be the parameter of interest, and 9 2 (e 9 2 ) be the nuisance parameter. Consider the function g : X x 6i — R d \ where g satisfies the following conditions: (I) E[g\6] = 0, for all 9 G 6; (II) for almost all x, J^exists, for all 9 G 6; (III) / g pdji is differentiate with respect to 9 X , and differentiation can be taken under the integral sign; (IV) E[§fc\9] is invertible. The functions which satisfy conditions (I) (IV) are called regular estimating functions with respect to 9\. Let L denote the space of all regular unbiased estimating functions. For gi, g 2 € L. define the family of generalized inner products of gi,g 2 as < g u g 2 > e = E[ gi (X,9)g 2 (X,9y\9], V 9 € O. (3.3.2) Also we shall denote by s the score function of a parametric family of distributions with respect to 9 X . We assume also that the score vector is regular in the sense described in (I) to (IV). Definition 3.3.1. Let (L, < ., . >g) be the family of generalized inner product spaces, and let Lq be a subspace of L. For any g G L 0 , let = E[^-\9Y < g,g>^ E[^-\9] (3.3.3) An element g* G L 0 is said to be an optimal estimating function in L 0 if I 9 .(0)-I 9 (9)

PAGE 50

44 is n. n. d., for all g G L 0 and 9 e Q. In the rest of this section, unless otherwise stated, we shall assume the following regularity condition for estimating functions, which basically involves the interchange of differentiation and expectation. (11). For any g E L, E[f Q \9] = -E[g s'W. (3.3.4) Thus combining (3.3.3) and (3.3.4), I g (9) is the information matrix of g with respect to s. We now state the main result of this section, which is an immediate consequence of Theorem 2.2.3. Theorem 3.3.1. Let s = dl j$ P , M{6) be an invertible matrix valued function, and L 0 a subspace of L. If the orthogonal projection g* of M(9) s into L Q exists, then g* is optimal in L 0 . As an easy consequence of Theorem 3.3.1, the following result generalizes the results due to Godambe (1976), Godambe and Thompson (1974) to the multiparameter case. Corollary 3.3.1. Suppose there exists g : X x 0 — R dl such that g* = M{6) s + g€L Q , (3.3.5) and g is orthogonal to every element in Lq, then g* is optimal in Lq. Proof. Since g is orthogonal to every element in L 0 , and g* G L 0 , g* is the orthogonal projection of M(9) s into L 0 . The optimality of g* in L 0 is immediate from Theorem 3.3.1. Note that the above corollary generalizes the main result in Godambe and Thompson (1974) to the multiparameter case. It also provides a geometric explanation of equation (5) in Godambe and Thompson (1974). To see this, suppose there exist

PAGE 51

45 matrices {M(0),Mj (#)}*_ j of appropriate dimensions such that M{9) is invertible and Then g* is the orthogonal projection of M(9) s into L 0 . For d\ = d,2 = 1 and k = 2, the above result reduces to the main result in Godambe and Thompson (1974). Next we use Theorem 3.3.1 to study the geometric ideas behind conditional and marginal inferences. 3.3.2 Geometry of Conditional Inferences In this subsection, we study the geometry of conditional inference. As an easy consequence of this geometric approach, some of the results due to Bhapkar (1989, 1991) follow easily. First use the identity E[^-\6) = < g u s 6l >e, where se l = a ^ p , se l may involves both 8 X and # 2 The information of gi is then hA e ^ e ) =< 0i> s «i >9< 9u9i >e Y < 9us &1 >e Let L denote the space of all estimating functions which satisfy conditions (I) (IV) of Section 3.3.1. Let L 0 be a subspace of L. Following Bhapkar (1989, 1991a), suppose statistic (S,U) is jointly sufficient for the family {p g : 6 G 0}, and furthermore, suppose U satisfies the following condition C:

PAGE 52

16 (C): The conditional distribution of S, given u = U(X), dependes on 9 only through #1, for almost all u, that is S is sufficient for the nuisance parameter 9 2 Denote by h(s;6i\u) the conditional pdf of S = S(X), given u. Definition 3.3.2. A statistic U = U(X) is said to be partially ancillary for 6\ in the complete sense if (i) U satisfies requirement C; (ii) the family {p% : 6 2 G 62} of distributions of U for fixed 61 is complete for every 6\ G Q\. A statistic U = U(X) is said to be partially ancillary for 9\ in the weak sense if (i) U satisfies requirement C; (ii) the marginal distribution of U depends on 9 only through a parameteric function 5 = 5(6) (5 is assumed to be differentiable) such that (6 U 5) is a one-to-one function of 6. Letting p L d l be the pdf of U and the following theorem connects Theorem 3.3.1 and Bhapkar's (1989, 1991a) results. Take L = Lq. Theorem 3.3.2. If the statistic U = U(X) is partially ancillary for 6\ in the complete sense or in the weak sense, then l c (x;6\) is the orthogonal projection of sg l into L. Proof. We are going to prove this result in two cases. Case 1. U is partially ancillary for 9 X in the complete sense. Note that l c (x\9 x ) d\ogh(s,9i\u) d9 x s«i = h(x\9 x ) + d9 x

PAGE 53

47 We only need to show that V g x G L, 9 G 9 But de x 501 = E{E[ 9l ( d ~^)% U}\9} = E{E[g x \9, U] i^f-YlO}. Since E{E[g x \9, U}\9} = 0, by the completeness of {p% : 9 2 G 0 2 } for fixed 9\ G Q%, E[gi\9, U] = 0 almost everywhere for fixed 9\. This implies that < 0 1 ' aZ — >0=°> V0G6. o9\ Thus / c (x;0i) is the orthogonal projection of sg 1 into L. Case 2. C/ is partially ancillary for 9\ in the weak sense. Again note that dlogPe s 0l = l e (x;0i) + Now since the map (9i,9 2 ) — > (6i,5(8)) is a one-to-one map, the matrix r 85_ ld l 60 y 0 ^~ U 802 J is invertible. It implies that ^ is invertible. Note that dlog/# _ dlog/# — d9 x d8 2 y d9 2 ' ddi

PAGE 54

48 se 2 D(9), where D(9) = f^. Thus we only need to show that V g x e L,6 e 0, < 9i, dd 2 >o=0. But this follows easily by differentiating E\gi\0]=0 with respect to 02, and using the regularity conditions. By combining Theorems 3.3.1 and 3.3.2, we get the following result due to Bhapkar (1989, 1991a). Corollary 3.3.2. With the notation as above, l c (x;9i) is optimal in L if either U is partially ancillary for 9\ in the complete sense or in the weak sense, that is Note that from the proof of Theorem 3.3.2, the condition of partial ancillarity for 6\ in either the complete sense or in weak sense guarantees that the conditional score is the orthogonal projection of the score function with respect to 9\ into L. 3.3.3 Geometry of Marginal Inference In this subsection, we study the geometry of marginal inference. As an easy consequence of our approach, the optimality result of Bhapkar (1989) and Lloyd (1987) on marginal inference will follow easily. Assume that (M): the distribution of statistic S = S(X) depends on 9 only through 9\. Definition 3.3.3. A statistic S = S(X) is said to be partially sufficient for 9\ in the complete sense if I 9 (9)
PAGE 55

49 (i) S satisfies condition (M); (ii) given s = S(X), the family {p^ s : 9 2 € 62} of the conditional distributions of U for fixed 9\ is complete for almost all s, and for every 9\ E Q\. A statistic S = S(X) is said to be partially sufficient for 9\ in the weak sense if (i) S satisfies condition (M); (ii) the conditional distribution of U, given s = S(X), depends on 9 only through a parameteric function 5 = 5(9) (5 is assumed to be differentiable) such that (9\,6) is a one-to-one function of 9. If S = S(X) is partially sufficient for 9 X in the complete ( or weak) sense. Let u\s p# 1 denote the conditional pdf of U given S = s, and The main result of this subsection is given in the following theorem. Theorem 3.3.3. If the statistic S = S(X) is partially sufficient for 9\ in the complete sense or in the weak sense, then l m (x; 9\) is the orthogonal projection of sg l into L. Proof. We are going to prove this result in two cases. Case 1. 5 is partially sufficient for 9\ in the complete sense. Note that 90i dlogp)) We only need to show that for any g\ 6 L, 9 6 6 39, W\s) <9i But d\ogp\ U\s 89, ,(U\s) < 9u >9=E[ gi ( Y\o]

PAGE 56

50 = E{E[ 9l S\\9) =E{E[ 9x \e, s\ ( d -^f^ym. Since E{E[g x \6, S}\9} = 0, by the completeness of {p { e uls) : 6 2 e 0 2 } for fixed 61 E 61, E[gi\6, S] = 0 for fixed B x . This implies that d\ogp u e ^ n V # E (-). Thus /^(:r;0i) is the orthogonal projection of into L. Case 2. 5 is partially sufficient for 9\ in the weak sense. Again note that a , {U\s) oe 1 Now since the map (61,62) — > (#1, 6(6)) is a one-to-one map, the matrix T dS 1 0 — is invertible. This implies that Jj^ is invertible. Hence aiogp^ |s) where N(6) = Thus we only need to show that V g x e L, 6 e 0, a , (f/|s) <5l '^T>e=0 -

PAGE 57

51 But this follows easily by differentiating E\g 1 \9] = 0 with respect to 9 2 , and using the regularity conditions. Thus l m (x;9i) is the orthogonal projection of sg 1 into L. This completes the proof. By combining Theorem 3.3.1 and 3.3.3, we get I g (9) < I tm (6) for all 9 e e.g E L, which includes the results of Bhapkar(1989, 1991a) and Lloyd (1987) as a corollary. Corollary 3.3.3. (1) if S is partially ancillary for 9\ in the complete sense, then l m (x; 9\) is optimal in L; (2) if S is partially ancillary for 9\ in the weak sense, then l m {x\6{) is optimal in L. Proof. In both cases, l m (x;9i) is the orthogonal projection of sg x into L. So the inequality i g (9)
PAGE 58

52 3.4.1 A General Result In this subsection, we study the local optimal estimating functions in the presnece of nuisance parameters. We first introduce the space of estimating functions of interest. Let X be a sample space, 0 = Qi x 9 2 the d x + d 2 dimensional parameter space, with 6 t C R di (i = 1,2). A function g : X x 0 — is said to be an unbiased estimating function if E[g(X,9)\9) = 0, W={e 1 ,6 2 )ee. An estimating function g is said to be regular if it satisfies conditions (I) (IV) as in Section 3.3.1. Let L denote the space of all regular unbiased estimating functions from ^xO to . Also the score vector sq 1 is assumed to be regular in the sense described in (i) and (ii). Definition 3.4.1. Let (L, < .,. >#) be the family of generalized inner product spaces, and let L 0 be a subspace of L. For any g e L 0 , the information function of g is defined by m = E[^\e)*?E\^L\e\ (3.4.6) An element g* € L 0 is said to be a locally optimal estimating function at 9 2 = #20 if Ig.(6l,02o) Ig{9\,0 2 o) is n. n. d., for all g G L 0 and 9i E Q\. The following is the main result of this section.

PAGE 59

53 Theorem 3.4.1. Let L be the space of all regular unbiased estimating functions g:XxO — R dl . Let Lq be subspace of L. If g* is the orthogonal projection of s into L 0 with respect to the generalized inner products < ., . >g with 9 2 = 6 20 , then g* is locally optimal in L 0 , that is for any fixed #20 G @2 5 Ig' 620) > Ig(6\,6 2 o), for all #1 G 0i. Also the local optimal estimating function in L 0 is unique in the following sense: if g G L 0 , then I g . (61,620) — Ig(6i,6 2 o) for all 6 X G 0i if and only if there exists an invertible matrix valued function iV : 0j x {#20} — -^dixdi such that for all 6 X G 0i, g*(X;6 x ,6 2 o) = N(6 X ,6 20 ) g(X;d x ,d 2 o), with probability 1 with respect to Pe u e 20 Proof. This follows easily from Theorem 2.2.4 and 3.3.1. Next we apply Theorem 3.4.1 to generalize the results in three different cases: (1) Lindsay's (1982) result on the local optimality of conditional score functions; (2) Godambe's (1985) result on the estimation in stochastic processes; (3) Murphy and Li's (1995) result on the projected partial likelihood. 3.4.2 Local Optimality of Conditional Score Functions Suppose that (X x , . . . , X n ) = X( n ) is a sequence of possibly dependent observations with pdf f(X (n y,e) = f 1 (X (1) ;d)f 2 (X {2) \X w ;e) ... f n (X {n) \X (n _ x) ;6). (3.4.7)

PAGE 60

54 Let Ui = ^log/j, and Sy(0i) — Sj(X^y9i) be minimal sufficient for 0 2 with #i fixed in pdf fj, let W^ = tO-^|5 il Xy_ 1) ], (3.4.8) for j G {l,...,n}. The sequence {5j(^i)}j=i is called sequentially complete if for each A; from 1 to n, the system of equalities E[H(S k ,X {k 1) ;9 1 )]=0 (3.4.9) for all 9 with #i fixed implies that H(Sk, X^-\)]9\) is a constant in Sk with probability one, that is, H(Sk, Xik-i)', 9\) does not depend on SkThe following is a slight generalization of the main result in Lindsay (1982). Theorem 3.4.2. Assume sequential completeness and E[Ui\X^^i)] = 0 for all i = 1, . . . , n. For fixed 9 20 € 02, consider unbiased estimating function h: X xQ^ {9 20 } ^ R dl , (3.4.10) Let L Q be the subspace, which consists of all unbiased estimating functions from X x 0i x {020 } into R d K Then (a) W(#i,02o) = ^i^iWi is the orthogonal projection of s$ 1 into L 0 with respect to the generalized inner product < ., . >e u e 20 ', (b) W(0i,#2o) is optimal in Lq, and the optimal element in L 0 is unique in the following sense: if g G L 0 and I g (9i,9 20 ) = Av(#i,02o) for all 9\ £ 0i, then there exists an invertible matrix valued function iV : 0! x {9 20 } — > Md lXdl such that for all By e 0i W = N(9 U 9 2Q ) g, with probability one with respect to Pe u e 20 Proof. First note that, for any H G L 0 , consider the decomposition H = H n + ff n _i + ... + #! + #„,

PAGE 61

55 where H n {X( n )\6\) =H E[H\S n , X(„_i)], H n -i(S n , A r ( n _i); #i) = £ , [if|5 n , A r ( n _]j] i^i/IS^-i, A r („_ 2 )], ^(SWi, = F[ff|5fc + i, A'( fc )] E[H\Sk, X(k-i)], for all A; G {1,2, ...,n}, and 7/ 0 = By the sequential completeness of {Sj(Qi)Yj=u H k (Sk+i,X( k y,6i) does not depend on S k +i, that is H k (S k+ i, A (fc) ; 6 X ) = Hk(X(ic)', (a) Since s 9l = ^ =1 Uj = W + ^m\Si, Aj-i], it suffices to show that < E[Uj\Sj, A'j_i], H k (Sk+i, X( k y, #1) >g= 0, for all j, k G {1, . . . , n). But this follows from an easy conditioning argument. (b) This follows from part (a) and Theorem 3.4.1. Note that part (a) of Theorem 3.4.2 is a restatement of the main result in Lindsay (1982). 3.4.3 Locally Optimal Estimating Functions for Stochastic Processes In this subsection, we generalize the results of Godambe (1985) to the case where there are nuisance parameters. As a special case, we get generalized estimating equations in the presence of nuisance parameters. Let {A 1? A 2 , . . . , X n } be a discrete stochastic process, 0., C R dj (j = 1, 2) be open sets. Let hi be a R dl valued function of X\, . . . , Aj and 9, which satisfies for fixed #20 G 02, E i 1 [h i (X l ,...,X i ;e)\6 1 ,e 2O ] = 0, (t= l,...,n, 9 } G 0J. (3.4.11)

PAGE 62

56 In the above, E{-\ denotes the conditional expectation conditioning on the first i — 1 variables, namely, X\, . . . ,Xi-\. Let L 0 = {g 9= S"=i^»-i hi}, where A^i is a Mkxk valued function of X\, . . . , AVi and 6 X , for all t € {1, ... , n). The following theorem, which generalizes the result of Godambe (1985) on optimal estimating functions for stochastic processes. Theorem 3.4.3. Let 6 2 = 0 2 o, and suppose hi satisfies the regularity condition (K). Let A* = E^i^-lOu fco]' Et-i[hi h\\e u e 2Q ]1 Vie {1,2,..., n}, and g* = E? =1 A* h u then the following conclusions hold: (a) , g* is the orthogonal projection of s fll into L 0 with respect to the generalized inner product < ., . >9 u e 20 (b) . g* is a locally optimal estimating function in L 0 , i. e., ^(#i,#2o) < Jff„(01)02o), for all g (z Lo and #1 G @i(c) . If .9 e L 0 and g l \9] is invertible, then I g {6 1 ,9 2 o) = / 9 ,(#i, 6> 2 o), V0i € ©i if and only if there exists an invertible matrix function N : ©i x {#20} — -^fcxfc such that for any 6\ € ©i, <7»(Xi, . . . , X n ; 61, 620) = N(9 U 9 20 ) g(X\, ... , X n ; 0i, 9 20 ), with probability 1 with respect to Poi,o20 -

PAGE 63

57 Proof, (a). For any g = S" =1 Aj hi £ L 0 , 0 X e 0i, < s 9l g*,g >(e u 02o) =< s e^9 >(.e l ,e 20 ) ~ < 9*, 9 >(e u e 2 o) -Xi^EWhitfjA^ew)] Eoi^MSAjK^.flao)]. (3.4.12) But for i < j, = ^Vi-iI^-Kfli^ISi^)} = o. Similarly, for i > j, E[A*h l ti 3 A%e u 6 2Q )} = i). Thus from equation (3.4.12), we get e^ 0 =^UE{E l l [^u 0»MWi> M" T^ =1 E{A^Ei-i[hi h\\0 u #20] A\\0 U 9 20 } = 0. Hence g* is the orthogonal projection of s into L 0 . Parts (b) and (c) of the theorem follows easily from part (a) and Theorem 3.4.1. As a corollary to Theorem 3.4.3, the generalized estimating equations in the presence of nuisance parameters for multivariate data can be easily obtained. Corollary 3.4.1. Suppose that Aj, . . . ,X n are independent, for fixed #20 £ ®2> for each i £ {1,2, ... ,n}, hi : Xi x O — R d \ with E[hi(Xi,9)\0\, #20] = 0. Consider the subspace L 0 as that in Theorem 3.4.3. Then the generalized estimating equations determined by {hi} is given by S?=i4 ? ^ = 0,

PAGE 64

58 where A* = E^&9 U 02o]' Ek-Ahi h^\e u 9 20 }~ x V i € {l,2,...,n}. The above corollary provides a very convenient way to construct generalized estimating equations. For instance, if hj is chosen as linear (or quadratic) function of Xi, then the corresponding generalized estimating equations reduce to the GEE1 and GEE2 studied by Liang , Zeger and their associates. One may refer to see Liang and Zeger (1986), Liang, Zeger and Qaqish (1992), Diggle, Liang and Zeger (1994). 3.4.4 Local Optimality of Projected Partial Likelihood In this subsection, we generalize the result of Murphy and Li (1995) on projected partial likelihood to the nuisance parameter case. Also the application of this result to longitudinal data will be pointed out. Suppose that the data consist of a vector of observations X with density /(x; 0i, 82), 6\ is the vector of parameters of interest, which is finite dimensional, and 82 is the vector of nuisance parameters, which may be infinite dimensional. Suppose there is a one-to-one transformation of the data X into a set of variables Y\,C\, . . . , Y m , C m . Let yfr) = (Yi, . . . , Yj), CW = (C u ...,Cj), j = l,...,m. (3.4.13) For instance, in survival analysis, Y\ t . . . ,Y m denote the lifetime variables, and C\, . . . ,C, the censoring variables. Note that the joint density of Y^ m \ C (m) can be written as m m n/fok u 1 \y w 1) ;0i f fc) n/tel^?^ 1 ';^), (3.4.14) i=i j=i

PAGE 65

59 where and are arbitrary constants, and are used only for notational purposes. Then P(9 l ) = Uj=i f{yj\c U \ y U ~ l) ; #i, 9 2 ) is called the Cox partial likelihood. Let vm aiog/^lc^1 ),^1 );^,^) where 5 _E j=1 — • {6AAO) Next we introduce the subspace of unbiased estimating functions which is of interest in this subsection. This is similar to the space considered by Godambe (1985) in studying the foundation of finite sample estimation in stochastic processes. For any j € {1, 2, ... , ra}, consider estimating functions hj : Y u) x C {j) x 6j — R dl , (3.4.16) E{h j \ y U1 \cU\e] = 0, (3.4.17) for all 9 e 0i x 0 2 . For chosen {hi}^, consider the space L 9 = {g:g = T% sl A j (0)lhh ( 3 4 18 ) where for all j € {1, . . . , m}, Aj(9) is a d\ x d\ matrix, and hj satisfies (3.4.16) and (3.4.17). Let _ aiogP _ aiog/folc^ y^;!^)) The main result of this subsection is the following. Theorem 3.4.4. For fixed 0 2 = 9 2 o, for any i €• {1, . . . , m}, let

PAGE 66

60 and g* = m^a,. Then (a) g* is the orthogonal projection of s* into L 0 , that is, for any g 6 L 0 , < s* >tf 1 ,e 20 = 0, for all 0 € 0i ; (b) g* is locally optimal in L 0 at 9 2 = 9 2Q ; (c) If g E U and g*\0] is invertible, then I g (9 u 9 20 ) = / 9 .(0i,02o), V 0j € 0j if and only if there exists an invertible matrix function N : Q x x {9 2 o} — Mkxk such that for any 9 X € €>i, g.{X;9 l ,0 2O ) = N(0 U 9 2O ) g{X-9 u 9 w ), with probability 1 with respect to Pe u e 20 Proof. For any j 6 {1, . . . , m}, let d\ogf{ Cj \^l \ y ^-9 u 9 2 ) then s — s* = TSL^j. (a) For any g £ L 0 , let <7j = A,-(0) hj,j = l,...,m be the components in the definition of L 0 . In order to show that E[(s — s*) g l \9] — 0, it suffices to prove that E[s j g t jl \e] = 0, for all j, j' £ {1, . . . , m}. Consider the following three cases: Case 1. j > j'. Then E[ 8j g),\0] = E{E[ Sj \c«l \yU-V\ g ),\9) = 0, since E[sj\6ix \y^-^\ = 0.

PAGE 67

61 Case 2. j = f. Then E[s j g t j \9] = E{s j E\ 9j \yVl \cW) t \e} = 0, since E[g 3 \y^l \ c^] ( = 0. Case 3. j' > j. Then E[ Sj g\,\B\ = E{s 3 EbrfiP-W-^ffl = 0, since E[g r \yV'l \cW-V} = 0. Part (b) and (c) follows from Theorem 3.4.1 and part (a). Note that Murphy and Li (1995) studied projected partial likelihood in the case d x = 1 , the absence of nuisance parameters and when all the C^'s are empty. Similar to Murphy and Li's comments, because of the nested structure of {Y^\ C^}, removing drop-out factors in this way will not cause bias in the resulting partial score function, as long as the subjects' drop-out depends only the past. This is in contrast to a generalized estimating equation, which is biased under random drop-out. 3.5 Optimal Conditional Estimating Functions In this section, we study optimal conditional estimating functions. Let X be a sample space, 6 = Q\ x 0 2 , &i C R di , (i = 1,2), and 9\ is the parameter of interest. For fixed 9i, assume that S(#i) is sufficient for 8 2 . A function g : X x 0! — R d \ is called a regular conditional unbiased estimating function if (1) E[g\S(9 l )]=0, for all 9 X e 9i; (2) E[g g t \S(9i)] is positive definite. Consider the space L of all regular conditional unbiased estimating functions, a family of generalized inner products on L is defined as follows: for any gi,g 2 6 L, < 9u92 >s(9 1 )= E[g x gtlSfr)]. (3.5.20)

PAGE 68

62 For any g G L, the conditional information of g is denned as follows: iMSm = ^|5(0,)F < 9,9 >;/,(0i|S(0i)), for all g £ L 0 . The main result in this section is given in the following theorem. Theorem 3.5.1. Define s = -^-\o%f(X\S(9 1 )). (3.5.22) Suppose Lq is a subspace of L, and assume that the orthogonal projection g* of s into L 0 exists. Then (a) g* is an optimal conditional estimating function in L 0 ; (b) the optimal element in L 0 is unique in the sense that if g £ L 0 , then I g > (9i\S(9i)) I g (9i\S(9i)), for all 9 X e 0i if and only if there exists an invertible matrix valued function N : X x 0i — Mjfcx* of the form iV(A r , #i) = W(5(0i)) such that g*(X;0 l ) = N(S(9 1 ))g(X;9 l ). Proof, (a) Since E[g\S(9i)\ — 0, under the regularity condition given in (2.5), E[ wJ s{9l)] = ~ E[9 s ' lsm Thus, from the definition of I g (9i\S(9i)), I g {di\S(6i)) =< g,s> l s{6l) < g,g >s( 8l )< 9,s> S (e 1 )

PAGE 69

63 Since g* is the orthogonal projection of s into L 0 with respect to < .,. >s(fli), thus the optimality of g* in L 0 follows from Theorem 2.2.3. (b) This follows from Theorem 2.2.4. Next as applications of Theorem 3.5.1, we generalize the results of Godambe and Thompson (1989) on optimal estimating functions into the conditional estimating functions framework. To this end, let X denote the sample space, 9 X = (9 n , . . . , 0^,) be a vector of parameters, hj, j = 1, . . . , k be real functions on X x Q x such that E[hj(X, 9)13(6,), X j ] = Q, \/9eB, j = 1, .... *, where Xj be a specified partition of X, j = 1, . . . , k. We will denote E[.\S(9 l ),X J ] = E U) [.\S(9 l )}. Consider the class of estimating functions Lo = {g9 = where g r = Yjj = ^qj r hj , r = 1, . . . , m, qj r : X x ©! — > R being measurable with respect to the partition Xj for j — 1, . . . , k, r — 1, . . . , d\. Let q *~ E U) [h)\S{9 x )Y for all j = 1, . . . , k, r = 1, . . . , d\, and g; = Y. k j=l q* jT hj, r = l,...,d 1 . The estimating functions hj, j = 1, . . . , k are said to be mutually orthogonal if E {j) [q* r h 3 q], rl h r \S(9 x )} = 0. Vy / ./. r. r' = 1,...,^. (3.5.24)

PAGE 70

64 Theorem 3.5.2. Suppose {hj} 1 ^^ are mutually orthogonal. Then the following hold: (a) g* is the orthogonal projection of the score function s into L 0 . (b) g* is an optimal estimating function in L 0 . (c) . If g € L 0 , and E[g g l \9\ is invertible, then I g (9) = I g *{9), V 9 e 6 if and only if there exists an invertible matrix function N : X x 0! — M kxk of the form N(X,6i) = N(S{9i)) such that for any 9 X e G u S T(X;$ l ) = N{S{9 l ))g(X;$ l ) i with probability 1 with respect to Pq. Proof. (1). We only need to show that, V r G {1, . . .,di},g r = Ej =l qj r hj, < s g*, g r >s(o l )= 0, V 9 X e 9i that is < s,g r >s(0i)=< 9*,9r >s{0i), V 9i s(0,)= ^ k j=l ^ k f=l E[q* r h J q jlr h r \S{9 l )] = £* =1 4 =1 £{ 9 ;,; V/ri^ = X k j=1 E{ q ; r q jr E U) [hp(9 l ))\S(e 1 )} = ^U E ^rE U ){^\S(0 1 )}\S(9 l )}. Also < s,g r >s(e 1 )= Z h j=l E[q jr s hj\S(9i)] = E k J=l E{q jr E U) [sh J \S(9 l )]\S(9 1 )} = ^ =l E{q jr E u) &S(9 l )]\S(9 l )}.

PAGE 71

Thus g* is the orthogonal projection of the score function into L 0 . Once again (b) and (c) follow from part (a) and Theorem 2.2.4.

PAGE 72

CHAPTER 4 CONVEXITY AND ITS APPLICATIONS TO STATISTICS 4.1 Introduction In this chapter, we first prove some general results about convexity, and then apply the results to various statistical problems, which include the theory of optimum experimental designs, the fundamental theorem of mixture distributions due to Lindsay (1983a), and the asymptotic minimaxity of robust estimation due to Huber. Huber (1964) proved an asymptotic minimaxity result for estimating functions about the location parameter. In this chapter, this fundamental result will be generalized to general estimating functions. The geometric optimality of estimating functions proved in Chapter 2 will be used to prove a necessary and sufficient condition for the asymptotic minimaxity of estimating functions in multi-dimensional parameter spaces. The contents of this chapter are organized as follows: in Section 4.2, a few simple results about matrix valued convex functions will be proved. Also we include some of the well known results in convex analysis, such as the Krein-Milman theorem about extreme points of convex sets, and the Caratheodory theorem about the representation of elements of a convex set in a finite dimensional vector space. In Section 4.3, the results of Section 4.2 are applied to the theory of optimum experimental designs. The fundamental result on optimal design theory is generalized to the matrix valued case. In Section 4.4, the results of Section 4.2 are applied to the mixture distribution 66

PAGE 73

67 situation; the fundamental result about mixture distribution due to Lindsay (1983a) is an easy consequence. In Section 4.5, the results of Section 4.2 and Chapter 2 will be used to generalize the classical asymptotic minimaxity result of Huber (1964) in the estimating function framework. 4.2 Some Simple Results About Convexity Let L be a linear space. A subset C of L is said to be convex if for every x,y E C, A €[0,1], Xx + (1 X)y € C. A function / : C — > R is said to be convex if for any x,y E C, X E [0, 1], f(Xx + (l-X)y) M kxk (i. e., for any x E C, N(x) is a symmetric k x k matrix) is said to be convex, if for any x, y G C, A G [0, 1], N{Xx + (1 X)y) < \N(x) + (1 X)N{y), where for two k x k matrices A,B, A < B means that B — A is nonnegative (n. n. d.). In the following, we only study properties of matrix valued convex functions, since for k = 1, they are reduced to the real valued case. For every x,y E C, consider the function on [0, 1] as follows N{X;x,y) = N((l-X)x + Xy), then N(X; x, y) is a convex function on A. The directional derivative of N at x in the direction of y is defined as Af(A; x, y) — N(0; x, y) F N (x; y) = hm -. 4.2.1) A->0+ A

PAGE 74

68 The existence of the limit is justified as follows: Since Ai«(1-£)0 + £a» for 0 < A] < A 2 < 1, N(Xu x, y) < (1 ^)/V(0; x, y) + ^iV(A 2 ; x, y). This implies NjX^x.y) -N{0;x,y) < N(\ 2 ;x,y) N(0;x,y) ^ ^ Ai A 2 that is N (* x 'y)N (°'> x 0, (4.2.4) for all y eC. Proof. Suppose that N(x 0 ) < N(y) for all y G C. Then *(**o,y)-jv(0;» 0 ,y) is non-negative definite. Hence rp 1 , y N (X;x o ,y)-N(0 ;x o ,y) FN(Xo;y)= lim , A->0+ A is n. n. d., for all y G C. Conversely, if F N (x 0 ;y) > 0 for all y G C, then from (4.2.3), N(y)-N(x 0 )>F ff (x 0 ',y)>0.

PAGE 75

69 Thus N(x 0 ) < N(y) for all y € C. Next let L be a locally convex vector space, and let N be a symmetric matrix valued function; then N is said to be Gateaux differentiable at x, if there exists a continuous linear operator A : L — Mkxk such that F N (x; y) = A{y x), for all yeC. (4.2.5) Before stating the next result, let us recall one of the well known results from functional analysis. Theorem 4.2.2 (Krein-Milman). Let L be a locally convex vector space, and let C be a convex compact subset of L. Then C = conv{ext(C)), where ext(C) denotes the set of extreme points of C, and conv(A) denotes the closed convex hull of A, it is the smallest closed convex set containing A. Now equipped with Gateaux differentiability and Krein-Milman theorem, we are in the position to prove the following result. Theorem 4.2.3. Let L be a locally convex vector space, and let C be a convex compact subset of L. If N is convex Gateaux differentiable at x 0 , then x 0 G C satisfies N(x 0 ) < N(y) for all y G C if and only if F N (x 0 ;y)>0, (4.2.6) for all y £ ext(C). Proof. Since F^{xa;y) = A(y — x 0 ) for some continuous linear opeator A for all y G C, from the definition of Gateaux differentiability, F^(xo',y) > 0, for all y G C is equivalent to F N (x 0 ;y) > 0, for all y G ext(C). Thus Theorem 4.2.3 follows from Theorem 4.2.1. Next the famous theorem of Caratheodory about the representation of elements of convex set in a finite dimensional vector space is presented. The present proof, taken directly from Silvey (1980), is included for the sake of completeness.

PAGE 76

70 Theorem 4.2.4 (Caratheodory). Let S be a subset of R n . Then every element c in conv(S) can be expressed as a convex combination of at most n + 1 elements of S. If c is in the boundary of conv(S), n + 1 can be replaced by n. Proof. Let S' = {(l,x) :xeS} be a subset of R n+l , let K be the convex cone generated by S' . Let y £ K, then ?/ can be written as y = Aij/i + . . . + A m j/ m , where each A, > 0 and each y; € 5". Suppose that the y t are not linearly independent. Then there exists Hi, . . . , /x m , not all zeroes such that + . . . + /Lt m y m = 0. Since the first component of each y$ is 1, so + . . . + ^ m = 0. Hence, at least one fa is positive. Let A be the largest number such that A/ij < Aj, i = 1, . . . ,m; A is finite since at least one fa is positive. Now let A' f = Aj — A/Zj, then y = XiVi + + KuVm, and at least one X\ — 0. Thus y can be expressed as a positive linear combination of fewer than m elements of S'. This argument can continue, until y has been expressed as a positive linear combination of at most n + 1 elements of S", since more than n + 1 elements are linearly dependent. Now the first part of the theorem follows by applying the above result to (l,c) 6 S'. Next suppose that y e K and J/ = Ai'f/1 + . . . + Xn+lVn+U where each Aj > 0 and the yi are linearly independent. Then y is an interior point of K. Thus any boundary point of K can be expressed as a positive linear combination

PAGE 77

71 of at most n linearly independent elements of S'. So the second part of the theorem follows. Proposition 4.2.1. If C is a compact subset of a locally convex vector space, then conv(C) is compact. Theorem 4.2.5. (a) With the same notation as Theorem 4.2.3, and the Gateaux differentiablity of N on C. The following are equivalent: (i) xq minimizes N(x); (ii) x 0 maximizes inf^c a t F N (x; y)a, for any fc x 1 real vector a; (hi) inf ye c a t F N (x 0 ; y)a = 0, for any k x 1 real vector a. (b) If x 0 minimizes N(x), then (x Q ,x 0 ) is a saddle point of F N , that is, F N (xo;yi) > 0 = F N (x 0 ;x Q ) > F N (y 2 ;x 0 ), for all j/1,2/2 € C. (c) If x 0 minimizes N(x), then the support of x 0 is contained in {y : F N (xo; y) — 0}. More precisely, {y t eC,x 0 = EfAfj/j, Aj > 0, EjAj = 1} C {?y : F N (x 0 ;y) = 0}. Proof, (a) First note that from Gateaux differentiablity of N, for any real k x 1 vector a, and x G C, inf a t F N (x;y)a = inf a t F N (x;y)a, y£ext(C) y€C and inf a l F^(x;y)a < a t F^(x;x)a = 0. s/ec ((i) (Hi)). Note that minimizes N(x), if and only if for any real k x 1 vector a, and y £ C. a l F N (xo;y)a > 0. The last inequality holds if and only if inf^gc a l F N (x 0 ; y)a > 0, for any k x I real vector a. This, in turn, is equivalent to inf yeext (c) a t FN{xQ\y)a = 0, for every k x 1 real vector a.

PAGE 78

72 ((ii) (Hi)). Note that xo maximizes ini ye c o. 1 Fn(x; y)a, for every k x 1 real vector a, if and only if mi y€ c a 1 Fn(x q ; y)a > 0, for any k x 1 real vector a. This is equivalent to inf y€eif (c) a l F N (x 0 ; y)a — 0, for every k x 1 real vector a. (b) This follows from Theorem 4.2.1 and the definition of F^. (c) If .To = ^iKVii K > 0, S,A( = 1, since N is Gateaux differentiable. 0 = F N (x 0 ; x 0 ) = F N (x 0 ; EfA^) = T,iXiF N (x 0 ;yi). Since F N (x 0 ; y) > 0 for all y G C, F M (x 0 ; y t ) = 0 for all y t . 4.3 Theory of Optimum Experimental Designs In this section, the results of the previous section are applied to fixed optimal experimental designs. First, we formulate the problem. Let / = (/i, . . . , f m ) denote m linearly independent continuous functions on a compact set X, and let 8 = (9\, . . . , 6 m ) denote a vector of parameters. For each x € X, an experiment is performed. The outcome is a random variable y(x) with mean value f(x) l 6 = Y^L x fi(x)9i, and a variance a 2 , independent of x. The functions fx, . . . , / m , called the regression functions, are assumed to be known, while 9 = (6i, . . . , 9 m ) and a are unknown. An experimental design is a probability measure //. defined on a fixed cr-algebra of subsets of X, which include the one point subsets. In practice, the experimenter is allowed ./V uncorrected observations and the number of observations that he (or she) takes at each x 6 X is proportional to the measure /j,. For a given /i, let M(/i) = ((m I ,(/i)))=1 , m,^) / f t (x)f 3 {x)dii(*)( 4 -3-7) The matrix M(//) is called the information matrix of the design /j,. Let % denote the set of all probability measures on X with the fixed cr-algebra, and M = {M(fj) : // E (f> : M — Mt-xk be a symmetric matrix-valued function. The

PAGE 79

73 problem of interest is to determine which maximizes <£(M(/i)) over all probability measures. Any such will be called ^-optimal. Proposition 4.3.1. M = conv({f{x)f(xf : x G X}). Proof. Since M is a convex set, and {f(x)f(x) t : x G A} C M, so conv({f(x)f(xY :xeX})cM. Next since A is compact, and / is continuous, thus {/(*)/(*)' : x G A} C A4 is compact. Hence conv({f(x)f{xY : x G X}) = 7Sm({f(x)f(x)* : a; G #}). Also since A4 C conv({f(x)f(x) 1 : :r G A'}), hence M = cont;({/(x)/(x)* : x G A}). From the above proposition and Caratheordory's theorem, the following is true. Corollary 4.3.1. For any M(/x) G M, there exists X{ G X,i — I,..., I, I < + 1, such that where A; > 0, S/ =1 A 8 = 1. If M(/i) is a boundary point of M, the inequality involving / can be reduced to / < m (™ +1 ) . From the practical point of view, this corollary is extremely important. For it means that if 4> is maximal at M*, then M* can always be expressed as M (//*), where //„ is a discrete design measure supported by at most m (™ +l ) + i points.

PAGE 80

74 Now we are in the position to prove the fundamental theorem in optimum design theory, which is a generalization of the result in Silvey (1980) to the matrix valued case. Theorem 4.3.1. (A) If 0 is a concave function on M, then M(fi*) is ^-optimal if and only if F^(M(^),M(/i)) <0, for all fi 6 H; (B) If 0 is a concave function on M, which is Gateaux differentiable on M, then M((i t ) is 0-optimal if and only if ^(M(^),/(*)/(*)')<0, for all x e X; (C) If 4> is Gateaux differentiable at M (//*), and M(/i*) is (f) optimal, then {xi G X : M{^) = T ii \ i f(xi)f{x i )\ Xi > 0,S,A, = 1} C{xeX:F < / > (M(n.),f(x)f(x) t )=0}. Proof. They are easy consequences of Theorem 4.2.1 4.2.3. Next we apply Theorem 4.3.1 to study the relationship between D and G optimal designs. The D-optimality criterion is defined by the criterion function 0[M(/*)] = log det[M(fi)l if det[M(n)} ^ 0 = -oo, if det[M(n)\ = 0. (4.3.8) li* € % is said to be D-optimal if //, maximizes 0. Let M denote the set of all positive definite matrices, then 0 has the following properties: (a) (j) is continuous on M.\ (b) 0 is concave on M. \

PAGE 81

75 (c) is Gateaux differentiable at M\ if it is nonsingular, and F 4> (M ] ,M 2 ) = tr(M 2 M[ l )-k. Proof, (a) The continuity of 4> follows from the continuity of det. (b) We want to show that, for every Ae (0,1), and Mi, M 2 G M, (f)[(l X)M l + AM 2 ] > (1 A)0(Mi) + A0(M 2 ). This inequality is obvious if either Mi or M 2 is singular. Thus we only need to prove the inequality if both Mi and M 2 are nonsingular. From a standard result from matrix algebra, there is a nonsingular matrix U such that UMxU* = I, UM 2 U l = A = diag(X u . . . , A fc ). Using the concavity of log, 0[(1 A) Afi + AA/ 2 ] = logde^CT^l A)/ + AA]£T U } > logdet^ + Ef^AlogAi = (1 A) logde*f/2 + Alogdet(C/1 A£/-") = (1 A)0(Mi) + A0(M 2 ). (c) For nonsingular matrix Mi, we have (/.(Mi + eM 2 ) (/.(Mi) = log(iet(/ + eM 2 Mf 1 ) = log{l + (tr{M 2 M^)} + 0(e 2 ) = etr{M 2 M[ 1 ) + 0(e 2 ). Thus F 0 (Mj, M 2 ) = *r[(M 2 Mi) M^ 1 ] = tr{M 2 M; 1 ) k. (4.3.9)

PAGE 82

76 The G-optimality criterion is defined by the criterion function 0[M(//)] = m a xf t (x)Ml ( f i)f(x) if detM(fi) / 0, = oo, if detM(n) = 0. (4.3.10) A design is said to be G-optimal if 4>{M{^)\ < ). From Theorem 4.3.1 and (4.3.9), /z* is D-optimal if and only if tr{f(x)f\x)M(ii*)~ l ] < k, for all x€ X, that is maxf t (x)Ml {fi m )f(x) < k. On the other hand, for any ji G H such that M(fi) is nonsingular, tr[Mx {n)M(n)] = k. Hence, k = va^f\x)M-\^)f{x) < m a xf t (x)M1 ( t x)f(x), for any // G % such that M(fi) is nonsingular, therefore, //* is G-optimal.

PAGE 83

77 {<=) Now suppose that Hi is G-optimal, then from the definition, M(/ii) is nonsingular. Let /x* be any D-optimal design. Then k l
PAGE 84

78 to be the mixture density corresponding to mixing distribution Q. Since the densities {fg} correspond to the atomic mixing distribution {5(9)}, which assign probability one to any set containing 9, they are called the atomic densities. A finite discrete mixing distribution with support size J will be expressed as Q — HjiTj5(9j), and the 0j's are distinct, 7Tj > 0, = 1. Given a random sample X\ , . . . , X n from the mixture density /q, the objective will be to estimate the mixing distribution Q by Q n , a maximizer of the likelihood HQ) = n? =1 f Q ( Xi ). Now suppose that the observation vector (x\,... ,x n ) has K distinct data points 2/i> • • • > UK, and let be the number of x's which equals to y k . Define the atomic and mixture likelihood to be fg = (fe(yi), , feivx)), and f Q = (f Q (yi), . . . , /o(j//f)), respectively. The likelihood curve is the function from 0 to R defined by 9 — > fg. The orbit of this curve, given by T = {fg : 9 G 6}, represents all possible fitted values of the atomic likelihood vector. Then conv(F) = {/q : Q € H, \support(Q)\ < oo}, where denotes the cardinality of A. Furthermore, if 6 is compact and fg is a continuous function of 9, then conv{T) = {/q : Q G %}. In this case, maximizing L(Q) over Q £ H may be accomplished by maximizing the concave functional 0(/) = T, k nklogfk over / in the if-dimensional set conviT). Note that (/) is a strict concave function of /. Now we are in the position to state the fundamental result about mixture distributions. Theorem 4.4.1 (Lindsay). Suppose that 0 is compact, and fg is continuous. (A) There exists a unique vector / on the boundary of conv(Y) which maximizes the log likelihood (/) on conv(T). f can be expressed as /q, where Q has K or fewer points of support.

PAGE 85

79 (B) The measure Q which maximizes log L(Q) can be equivalently characterized by three conditions: (i) Q maximizes L(Q); (ii) Q minimizes sup fl D(9;Q); (iii) sup 9 D(6; Q) = 0 . (C) The point (/, /) is a saddle point of <3>, in the sense that *(/*,/) <(> = *(/,/) <*(/,/<*), for all Q 0 ,Qi G H. (D) . The support of Q is contained in the set of 9 for which D(6, Q) = 0. Proof. The results are easy consequences of Caratheodory's theorem and Theorem 4.2.5. 4.5 Asymptotic Minimaxitv of Estimating Functions In this section, the famous asymptotic minimaxity result due to Huber (1964) will be generalized. First we formulate the problem of interest. Let 0 be an open subset of R k , X be the sample space, a function is called an unbiased estimating function if E[g(X;6)\6} = 0, for all 0 G 0. An unbiased estimating function is called regular if is nonsingular for all 9 G 0.

PAGE 86

80 In the rest of this section, the regularity conditions (I) (IV) of Section 3.3.1 for estimating functions will always be assumed. Let C be a convex set of distribution functions such that every F £ C has an absolutely continuous density / satisfying W Bl (2te/, ( 2^Z,.|F.H, (4.5.12) is positive definite. Let L be the space of unbiased estimating functions with respect to C, that is every element of L is unbiased with respect to every distribution in C. Let <3> 0 be the subset of L which consists of all regular unbiased estimating functions in L. Consider the function K : 0 x C — M kxk , defined by K{4>,F) = E^F^E^IF^-'E^F,^ (4.5.13) for all (j) e $o, F f= Elg^F}. For every F G C, the orthogonal projection of the score function of F into the subspace L 0 , with respect to the inner product <.,.>/? (if it exists), is denoted by F . Lemma 4.5.1. (a) For any (it, v) € R x R + , the function defined by u 2 h(u,v) = — , v

PAGE 87

8 1 is convex, that is for any (ui, Vi) G R x R + , i = 1, 2, A G (0, 1) (A, 1 + (l-A)u 2 )^ A u| + (i _ A) u| ; Aui + (1 X)v 2 vi v 2 (b) For any (Mi,M 2 ) G M kxk x M^ xfc , where M^" xfc denotes the set of all k x k positive definite matrices, the matrix valued function defined by J(M U M 2 ) = M\M^M U is convex in the sense that, for any (M x , M 2 ), (M 3 , M 4 ) G M kxk x M fc + xfc , A G (0, 1), J(A) = [AM, + (1 A)A/ 3 ]'[Ail/ 2 + (1 A)A/ 4 ]" 1 [AA/ 1 + (1 A)M 3 ], is convex in A. Proof, (a) dh 2u dh u 2 du i V dv V 2 ' d 2 h 2 d 2 h 2u d 2 h _ 2u 2 d 2 u — i V dudv v 2 ' d 2 v v 3 The matrix 2/v -2u/v 2 -2u/v 2 2u 2 /v 3 is non negative definite, so h is convex. (b) By straightforward calculation, and using repeatedly the relation ~dX~ ~ ~ ~d\ one gets, = (Mi M 3 f[XM 2 + (1 A)M 4 ]" 1 [AM 1 + (1 A)M 3 ] -[AAf 1 + (l-A)M 3 ] t [AM 2 +(l-A)M4]-HM2-M 4 )[AM 2 +(l-A)M4]" 1 [AM 1 + (l-A)M3]

PAGE 88

S2 +[AM X + (1 X)M 3 ] t [XM 2 + (1 A)M 4 ] _1 (M 1 M 3 ), (4.5.14) and d 2 J(X) cPX 2{(Mi M 3 )'[AM 2 + (1 A)M 4 ]" 1 (M 1 M 3 ) [AMi + (1 X)M 3 ] t [XM 2 + (1 A)Af 4 ]" 1 (A/ 2 M 4 )[AM 2 + (1 A)M 4 ] _1 (M 2 M A )[XM 2 + (1 A)]^] -1 ^ + (1 A ) M a] -(Mj MaJ'fAMa + (1 A)M 4 ]" 1 (Af 2 M 4 )[AM 2 + (1 X)M 4 }~ l [XM 1 + (1 A)M 3 ] [AMi + (1 A)M 3 ]'[AM 2 + (1 A)M 4 ]" 1 (M 2 Af 4 )[AM 2 + (1 X)M 4 ]' 1 (M 1 M 3 )} = 2{AA* + B l B -ABB l A l ) A = (M 1 M 3 f[XM 2 + (1 A)M 4 ]" 1/2 , B = [XM 2 + (1 A)M 4 ]1/2 (Af 2 M 4 )[AM 2 + (1 A)M 4 ]" 1 [AM 1 + (1 A)M 3 ]. This completes the proof of the Lemma. Note that part (a) of Lemma 4.5.1 was proved by Huber (1964) by using a different argument. Also from part (b), = (Mj M 3 ) t M^M 3 M\M^ l {M 2 M A )M^M 3 + MlM^{M x M 3 ). 2(A B l )(A B 1 ) 1 > 0, (4.5.15) where (4.5.16) We will use this identity is Section 6.2. 4.5.1 One Dimensional Case In this subsection, a necessary and sufficient condition of the asymptotic minimaxity of estimating functions will be given when the parameter space is one-dimensional. This result generalizes Theorem 2 of Huber (1964).

PAGE 89

S3 Theorem 4.5.1. Suppose the parameter space is one dimensional. Then ((fi Fo , F 0 ) is a saddle point of A', that is K(4>,F 0 ) < K( Fo ,F 0 ) < K( Fo ,F), for all e 0, (4.5.17) Jx where /' denotes the derivative of / with respect to the parameter. Proof. Note that since 4> Fo is the orthogonal projection of s Fo into L 0 , K(,F 0 )< K( Fo ,F 0 ), for all 0 € This fact has been established in Chapter 2. Also for any F± G C, consider the function h Fl : [0, 1] > A, given by W " / J r*} k [(l-t)/. + tA] Then by (a) of Lemma 4.5.1, h Fl is a convex function, and by direct calculation, Ux^Fofodx) 2 Jx Jx [ foj'^dx I (t* 2 Fo g}dx, (4.5.19) where g — f x — f 0 . Since (p Fo is the orthogonal projection of s Fo into L 0 with respect to the inner product < ., . > Fo , /o f 0f o /o^= / F 0 -rfodx= I 2 Fo f 0 dx Jx Jx to Jx "

PAGE 90

84 Hence, h' Fl (0+) = / (24> Fo g' Fo , F 0 ) < A'(0f o , (1 t)F 0 + tFi) = M*)Now, since /i' F] (0+) > 0. [ {2 Fo9 '-(l) 2 FQ g)dx>0, where g = fa f 0 . If . Suppose that / (20 Fo %g)dx > 0, where g = f l — f 0 . Then from Theorem 4.2.1, h Fl is a monotone function in [0, 1]. Hence, h Fl (0) = A'(0 Fo ,F o ) < Ml) = K{ FQ ,F x ). Thus (0f o , F 0 ) is a saddle point of K. This completes the proof of Theorem 4.5.1. Corollary 4.5.1 (Huber) . Assume that F 0 G C such that I(F 0 ) < 1(F) for all F G C, and 0 O = ^ G Then ( 0 , F 0 ) is a saddle point of K. Proof. For any F x G C, consider the function h Fl (t)=w-t)F Q +tF 1 )= / (/ ° tlfr^? 2 ^ Then by (a) of Lemma 4.5.1, h Fl is convex, and attains its minimum at t = 0. Thus 0 < h' Fl (0+) = [ [2 S fsf S) 2 g]dx, (4.5.21) Jx In J a

PAGE 91

85 where g = f\ — foThe above equality follows from the Lebesgue dominated convergence theorem, and the facts that 1 |Cffl! _ iff] _> 2 &sf S?9, t jt Jo Jo Jo and l r (/ t ') 2 (ftf^ifl) 2 (fo? t ft fo fi fo uniformly in t G (0, 1). 4.5.2 Multi-Dimensional Case In this subsection, by using the geometry of optimal estimating functions proved in Chapter 2, a necessary and sufficient condition of the asymptotic minimaxity result for estimating functions in a multi dimensional parameter space will be given. This result generalizes one main result of Huber (1964) to the multi dimensional parameter space. Theorem 4.5.2. Suppose the parameter space is multi-dimensional. Then (F 0 > Fo) is a saddle point of K, that is K(,F 0 )< A'(0 Fo ,F o ) <^(0 Fo ,F), for all €$, and F € C, if and only if />-"' + f 0 is the orthogonal projection of sp 0 into Lq, K(,F 0 ) Fo ,Fo),

PAGE 92

86 for all 4> € $. This has been proved in Chapter 2. Also for any Fi £ C, consider the function J Fl : [0, 1] — > M kxk , given by J Fl (A) = ( fj Fo [(l A)^ + ^UxY( J^fM^ ~ A)/o + A/i]rfx) _1 (/^^[(l-A^ + A^frfa:). (4.5.23) From (b) of Lemma 4.5.1, J Fl is convex, and by direct calculation, J Fl (0+) = (Af 1 -M 3 ) t M 4 1 M 3 M|M 4 _1 (M 2 M A )M^M 3 + MlM^{M x M 3 ), where Since Fo is the orthogonal projection of s Fo into L 0 with respect to < .,. > Fo , M 3 = Af 4 . Hence, J Fl (0+) = (M a M 3 Y + (M, M 3 ) (M 2 M 4 ) = L [M % d -k ]t +{ %w )(0Fo)t 0f ° ^° (/ /o)]dx (4 5 24) On/?/ z/ . Suppose that (cf) Fo ,F 0 ) is a saddle point of K. Then for any F\ £ C, and every t £ (0, 1), J Fl (0) = tf(0 Fo , ^o) < #(f 0 , (1 <)F 0 + = J Fl (t).

PAGE 93

87 Thus from the definition of J' Fi (0+), J' Fl (0+) is non negative definite. Hence, is non negative definite, where g = fi — foIf. Now suppose that 99 st , d 9 is non negative definite, where g = fi — foThen from Theorem 4.2.1, J Fl is a monotone function in [0, 1]. Hence, J Fl (0) = A'(0 Fo ,F o ) < J Fl (l) = K{^F X ). Hence, ( Fo ,Fo) is a saddle point of K. This completes the proof. Corollary 4.5.2. Assume that F 0 G C is such that I(F 0 ) < 1(F) for all F e C, el and s Fo = ^ G Then (s Fo , F 0 ) is a saddle point of K. Proof. For any F\ 6 C, consider the function J Fl (A) = /((l-A)F 0 + AF 1 ) a(/o + A(/o-/i)),9(/o + A(/ 0 -/i)), t 1 -da;. /o + Hh fo) Then by (b) of Lemma 4.5.1, J Fl is convex, and attains its minimum at t = 0. Thus 4(0+) = J x [M%y + j|(
PAGE 94

1 §I±(9A\t d[o(d[o\t 9R(dL\t d[o(dJo\t 1 ^ 86 V 86 > _ 8&\ 86 > j <86 v 86 > _ 86 ^ 86 > A }\ fo f\ fo

PAGE 95

CHAPTER 5 SUMMARY AND FUTURE RESEARCH 5.1 Summary In this dissertation, we have studied optimal estimating functions through the introduction of the generalized inner product space. It turns out that, the orthogonal projection of the score function into the subspace of estimating functions (if it exists), is optimal in that subspace. Also, the estimating function theory in the Bayesian framework is studied. We have shown that the orthogonal projection of the posterior score function into a subspace of estimating functions (if it exists) is optimal in that subspace. The geometry of estimating functions in the presence of nuisance parameters is also studied. The geometric idea of conditional, marginal and partial likelihood inference become transparent when viewed as orthogonal projections of score functions into appropriate subspaces. Finally, a general result about matrix valued convex functions was also proved, and then this result was applied to study optimum experimental designs, mixture distributions and asymptotic minimaxity of estimating functions. 5.2 Future Research We have studied the geometry of estimating functions in the discrete setting; it will be of great interest to extend these results to the martingale framework. I believe that there is a lot of potential in pursuing a vigorious research in this direction. 89

PAGE 96

90 In the last decade, there are major advances in the study of geometry of optimum experimental designs. I believe most of these geometric results are direct consequences of the duality theory in convex analysis. As far as I know, the applications of duality theory to statistics are very limited. It will be of great interest to establish a general duality theory in the statistical framework.

PAGE 97

BIBLIOGRAPHY Amari, S. I. and Kumon, M. (1988), Estimation in the presence of infinitely many nuisance parameters geometry of estimating functions. Ann. Statist., 16 (3), 1044-1068. Bhapkar, V. P. (1972), On a measure of efficiency of an estimating equation. Sankhya A, 34, 467-472. Bhapkar, V. P. (1989), Conditioning on ancillary statistics and loss of information in the presence of nuisance parameters. J. Stat. Plan. Inf., 21, 139-160. Bhapkar, V. P. (1991a), Loss of information in the presence of nuisance parameters and partial sufficiency. J. Stat. Plan. Inf., 28, 195-203. Bhapkar, V. P. (1991b), Sufficiency, ancillarity, and information in estimating functions. Estimating Functions, edited by V. P. Godambe, Oxford University Press, New York, 241-254. Bhapkar, V. P. and Srinivasan, C. (1994), On Fisher information inequalities in the presence of nuisance parameters. Ann. Inst. Stat. Math., 46, 593-604. Breslow, N. E. and Clayton, D. G. (1993), Approximate inference in generalized linear mixed models. Journal of American Statistical Association, 88 (421), 9-25. Chaloner, K. and Larntz, K. (1989), Optimal Bayesian design applied to logistic regression experiments. J. Stat. Plan. Inf., 21, 1991-208. Cox, D. R. (1972), Regression models and life tables (with discussion). J. R. Stat. Soc. B, 34, 187-220. Cox, D. R. (1975), Partial likelihood. Biometrika, 62, 269-276. Crowder, M. (1995), On the use of a working correlation matrix in using generalized linear models for repeated measures. Biometrika, 82, 407-410. DasGupta, A. and Studden, W. (1991), Robust Bayesian designs. Ann. Stat., Desmond, A. F. (1991), Quasi-likelihood, stochastic processes, and optimal estimating functions. Estimating Functions, edited by V. P. Godambe. Oxford University Press, New 91

PAGE 98

92 York, 133-146. Dette, H. (1993), Eflving's theorem for D-optimality. Ann, Stat. , 21 (2), 753-766. Dette, H. and Stridden, W. J. (1993), Geometry of £-optimality. Ann. Stat, 21 (1), 416-433. Diggle, P., Liang, K. Y. and Zeger, S. L. (1994), Analysis of Longitudinal Data. Oxford University Press, New York. Durbin, J. (1960). Estimation of parameters in time series regression models. J. R. Stat. Soc. B, 22, 139-153. Efron, B. and Stein, C (1981) , The jackknife estimate of variance. Ann. Statist., 9 (2), 586-596. Elfving, G. (1952), Optimum allocation in linear regression. Ann. Math. Stat. , 23, 255-262. Elfving, G. (1959), Design of linear experiments. Cramer Restschrift Volume. Wiley, New York. 58-58. El-Krunz, S. M. and Studden, W. J. (1991), Bayesian optimal designs for linear regression models, Ann. Statist, 19 (4), 2183-2208. Ferreira, P. E. (1981), Extending Fisher's measure of information. Biometrika, 68, 695-698. Ferreira, P. E. (1982), Estimating equations in the presence of prior knowledge. Biometrika. 69, 667-669. Firth, D. (1987), On the efficiency of quasi-likelihood estimation. Biometrika, 74, 233-245. Ghosh, M. (1990), On a Bayesian analog of the theory of estimating function. D. G. Khatri Memorial Volume, Gujarat Stat. Rev., 47-52. Ghosh, M. and Rao, J. N. K. (1994), Small area estimation: an appraisal (with discussion). Statistical Sciences, 9 (1), 55-93. Godambe, V. P. (1960). An optimum property of a regular maximum likelihood estimation. Ann. Math. Stat. 31, 1208-1212. Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277-284. Godambe, V. P. (1985). The foundations of finite sample estimation in stochastic processes. Biometrika, 72, 419-428. Godambe, V. P. (1994). Linear Bayes and optimal estimation, preprint.

PAGE 99

93 Godambe, V. P. and Heyde, C. C. (1987), Quasi-likelihood and optimal estimation. Int. Stat. Rev., 55, 231-244. Godambe, V. P. and Kale, (1991), Estimating functions: an overview, Estimating Functions. Ed. V. P. Godambe, Oxford Unviersity Press, New York, 2-30. Godambe, V. P. and Thompson, M (1974) Estimating equations in the presence of nuisance parameters. Ann. Stat., 2 (3), 568-571. Godambe, V. P. and Thompson, M. E. (1989). An extension of quasi-likelihood estimation (with discussions). J. Stat. Plan. Inf., 22, 137-172. Haines, L. M. (1995), A geometric approach to optimal design for one-parameter non-linear models. J. R. Stat. Soc. B, 57 (3), 575-598. Hoeffding, W. (1992), A class of statistics with asymptotically normal distributuion. Breakthroughs in Statistics, Vol.1, 308-334. Huber, P. J. (1964), Robust estimation of a location parameter. Ann. Math. Stat, 35, 73101. Huber, P. (1980), Robust Statistics. John Wiley and Sons, New York. Heyde, C. C. (1989), Quasi-likelihood and optimality for estimating functions: some current unifying themes. Bull. Inter. Stat. Inst., 1, 19-29. Karlin, S. and Studden, W. J. (1966), Optimal experimental designs. Ann. Math. Stat, 37, 783-815. Kale, B. K. (1962). An extension of Cramer-Rao inequality for statistical estimation functions. Skand. Aktur., 45 , 60-89. Kiefer, J. (1959), Optimum experimental designs. J. Roy. Stat. Soc. B, 21, 272-319. Kiefer, J. (1974), General equivalence theory for optimum designs (approximate theory). Ann. Stat, 2, 849-879. Kiefer, J. and Wolfowitz, J. (1959), Optimum designs in regression problems. Ann. Math. Stat., 30, 271-294. Kiefer, J. and Wolfowitz, J. (1960), The equivalence of two extremum problems. Can. J. Math., 14, 363-366. Kumon, M. and Amari, S. I. (1984), Estimation of structural parameters in the presence of a large number of nuisance parameters. Biometrika, 71 (3), 445-459.

PAGE 100

94 Laird, N. M. (1978), Nonparametric maximum likelihood estimation of a mixing distribution. J. Amer. Stat. Assoc., 73, 805-811. Liang, K. Y. and Waclawiw, M. A. (1990), Extension of the Stein estimating procedure through the use of estimating functions. Journal of American Statistical Association, 85 (410), 435-440. Liang, K. Y. and Zeger, S. L. (1986), Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22. Liang, K. Y. and Zeger, S. L. (1995), Inference based on estimating functions in the presence of nuisance parameters (with discussions). Statistical Science, 10, 158-199. Liang, K. Y. Zeger, S. L. and Qaqish, B. (1992), Multivariate regression analysis for categorical data (with discussion). J. R. Stat. Soc. B 54, 3-40. Lindsay, B. G. (1981), Properties of the maximum likelihood estimator of a mixing distribution. Statistical Distributions in Scientific Work, edited by G. P. Patil, Vol.5. Reidel, Boston, 95-109. Lindsay, B. G. (1982), Conditional score functions: some optimality results. Biometrika, 69 503-512. Lindsay, B. G. (1983a), The geometry of mixture likelihoods: a general theory. Ann. Stat. , 11 (1), 86-94. Lindsay, B. G. (1983b), The geometry of mixture likelihoods, Part II: the exponential family. Ann. Stat. , 11 (3), 783-792. Lindsay, B. G. (1995), Mixture Models: Theory, Geometry and Applications. Institute of Mathematical Statistics, Vol. 5. Lloyd, C. J. (1987), Optimality of marginal likelihood estimating equations. Comm. Stat., Theory and Meth. , 16, 1733-1741. McCullagh, P. (1983), Quasi-likelihood functions. Ann. Statist, 11 (1), 59-67. McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Models. 2nd Ed. Chapman and Hall, London. McGilchrist, C. A. (1994), Estimation in generalized mixed models. J. R. Statist. Soc. B 56 (1), 61-69. McLeish, D. L. and Small, C. G. (1992), A projected likelihood function for semiparametric models. Biometrika, 79 (1), 93-102.

PAGE 101

95 Morris, C. L. (1983), Parametric empirical Bayes inference: theory and applications. Journal of American Statistical Association, 78 (381), 47-55. Murphy, S. and Li, B. (1995), Projected partial likelihood and its application to longitudinal data. Biometrika, 82, 399-406. Nelder, J. A. and Wedderburn, R. W. M. (1971), Generalized linear models. Breakthroughs in Statistics, Vol. 2, SpringerVerlag, New York, 547-563. Pazman, A. (1986), Foundations of Optimum Experimental Design. D. Reidel Publishing Company, Boston. Raghunathan, T. E. (1993), A quasi-empirical Bayes method for small area estimation. Journal of American Statistical Association, 88 (424), 1444-1448. Schall, R. (1991), Estimation in generalized linear models with random effects. Biometrika. 78 (4), 719-727. Serfling, R. J. ( 1980), Approximation Theorems of Mathematical Statistics. John Wiley and Sons Inc., New York. Shaked, M. (1980), On mixtures from exponential families. J. R. Stat. Soc. B, 42 (2), 192198. Silvey, S. D. (1980), Optimal Design, Chapman and Hall, London. Small, C. G. and McLeish, D. L. (1994), Hilbert Space Methods in Probability and Statistical Inference, John Wiley and Sons, Inc., New York. Small, C. G. and McLeish, D. L. (1988), Generalization of ancillarity, completeness and sufficiency in an inference function space. Ann. Stat., 16, 534-551. Small, C. G. and McLeish, D. L. (1989), Projection as a method for increasing sensitivity and eliminating nuisance parameters. Biometrika, 76, 693-703. Studden, W. J. (1971), Elfving's theorem and optimal designs for quadratic loss. Ann. Math. Stat., 42 (5), 1613-1621. Waclawiw, M. A. and Liang, K. Y. (1994), Empirical Bayes estimation and inference for the random effects models with binary response. Statistics in Medicine, 13, 541-551. Waclawiw, M. A. and Liang, K. Y. (1993), Prediction of random effects in the generalized linear model. Journal of American Statistical Association, 88 (421), 171-178. Wedderburn, R. W. M. (1974), Quasi-likelihood functions, generalized linear models and the Gauss-Newton method. Biometrika, 63, 27 32.

PAGE 102

96 Whittle, P. (9173), Some general points in the theory of optimal experimental design. , J. Roy. Stat. Soc, B, 35, 123-130. Zeger, S. L. and Liang, K. Y. (1986), Longitudinal data analysis for discrete and continuous outcomes. Biometrics 42, 121-130. Zeger, S. L. and Liang, K. Y. (1992), An overview of methods for the analysis of longitudinal data. Statistics in Medicine, 11, 1825-1839. Zeger, S. L., Liang. K. Y. and Albert, P. S. (1988), Models for longitudinal data: a generalized estimating equation approach. Biometrics 44, 1049-1060.

PAGE 103

BIOGRAPHICAL SKETCH Schultz Chan got his first Ph.D. in mathematical physics from the University of Iowa in August 1992, and came to the University of Florida as a postdoctoral research associate. After completing his postdoctoral appointment in August 1994, and realizing that he wanted to change careers, Schultz Chan came to the Statistics Department to study applied statistics. He is expecting to get his second Ph.D. this August and is ready to face the real world. 97

PAGE 104

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Malay Ghosh,Vfchairman Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fatly-adequate, in scope and quality, as a dissertation for the degree of Doctor of Richard Scheaffer Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Ramon Littell Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. James Booth Associate Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Murali Rao Professor of Mathematics

PAGE 105

This dissertation was submitted to the Graduate Faculty of the Department of Statistics in the College of Liberal Arts and Sciences and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. August 1996 Dean, Graduate School