UFDC Home  myUFDC Home  Help 
PAGE 69
FromHoldersinequalityforpositiveexponentswithp=1+;q=1+ ;g()=I1 2() 2() 2()1 2()d 2(): 2()d: 1+;q=1 andg()=I1 2()weobtainZ1+()I1 2()dZI1 2()d 2():
PAGE 70
When<1usingHoldersinequalityfornegativeexponentswithp=1 1+<0;0
PAGE 71
1+2R1()=Z2 nI()1 21 2I2()0() 2I()0() 4I()00() 16I2()d+o(n1=2): Referencepriorisobtainedbymaximizingtheexpectedchisquaredistancebetweenpriordistributionandcorrespondingposterior.By( 4{29 ),thisamountstomaximizingthefollowingintegralwithrespecttoprior(): 2I1=2()00() Tosimplifythisfurther,wewillusethesubstitutiony()=0() sothat( 4{30 )reducesto 2I1=2()y2()3 2I1=2()y0()d:(4{31) Maximizingthelastexpressionwithrespecttoy()andnotingthat000()=I0(),onegetsEulerLagrangeequation:@L @yd d@L @y0 4{31 ). ThelastexpressionisequivalenttoI0() 4I3=2()y() 4I();
PAGE 72
therebyproducingthereferenceprior
PAGE 73
5 ]dominatesthesamplemeaninthreeorhigherdimensionsunderageneraldivergencelosswhichincludestheKullbackLeibler(KL)andBhattacharyyaHellinger(BH)losses([ 13 ];[ 38 ])asspecialcases.Ananalogousresultisfoundforestimatingthepredictivedensityofanormalvariablewiththesamemeanandaknownbutpossiblydierentscalarmultipleoftheidentitymatrixasitsvariance.Theresultsareextendedtoaccommodateshrinkagetowardsaregressionsurface. Theseresultsareextendedtotheestimationofthemultivariatenormalmeanwithanunknownvariancecovariancematrix.First,itisshownthatforanunknownscalarmultipleoftheidentitymatrixasthevariancecovariancematrix,ageneralclassofestimatorsalongthelinesofBaranchik[ 5 ]andEfronandMorris[ 30 ]continuestodominatethesamplemeaninthreeorhigherdimensions.Seconditisshownthatevenforanunknownpositivedenitevariancecovariancematrix,thedominancecontinuestoholdforageneralclassofasuitablydenedshrinkageestimators. Alsotheproblemofpriorselectionforanestimationproblemisconsidered.ItisshownthattherstorderreferencepriorunderdivergencelosscoincideswithJereys'prior. 66
PAGE 74
34 ]betweenestimationandpredictionproblemsifsuchanidentityexists. 2 ].
PAGE 75
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 68
PAGE 76
[14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]
PAGE 77
[29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42]
PAGE 78
[43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58]
PAGE 79
[59] [60] [61]
PAGE 80
TheauthorwasborninKorukivka,Ukrainein1973.HereceivedtheSpecialistandCandidateofSciencedegreesinProbabilityTheoryandStatisticsfromKievNationalUniversityofTarasShevchenkoin1997and2001respectively.In2001hecametoUFtopursuePh.D.degreeinDepartmentofStatistics. 73



Full Text  
DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION AND PRIOR SELECTION By VICTOR MERGEL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2006 Copyright 2006 by Victor Mergel ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor Dr. M ,1 ivz Ghosh for his support and professional guidance. Working with him was not only enjo,l.e' but also very valuable personal experience. I would also like to thank Michael Daniels, Panos M. Pardalos, Brett Presnell, and Ronald Randles, for their careful reading and extensive comments of this dissertation. TABLE OF CONTENTS page ACKNOWLEDGMENTS ................... ...... iii ABSTRACT ...................... ............. vi CHAPTER 1 INTRODUCTION AND LITERATURE REVIEW ............ 1 1.1 Statistical Decision Theory ........... ........... 1 1.2 Literature Review ............................ 2 1.2.1 Point Estimation of the Multivariate Normal Mean ..... 2 1.2.2 Shrinkage towards Regression Surfaces ........... .10 1.2.3 Baranchik Class of Estimators Dominating the Sample Mean 12 1.3 Shrinkage Predictive Distribution for the Multivariate Normal Density 13 1.3.1 Shrinkage of Predictive Distribution . . ..... 13 1.3.2 Minimax Shrinkage towards Points or Subspaces . ... 17 1.4 Prior Selection Methods and Shrinkage Argument . ... 18 1.4.1 Prior Selection .... ...... . . .. 18 1.4.2 Shrinkage Argument .................. ..... 20 2 ESTIMATION, PREDICTION AND THE STEIN PHENOMENON UNDER DIVERGENCE LOSS .................. ......... .. 22 2.1 Some Preliminary Results .................. .. 22 2.2 Minimaxity Results .................. ...... .. .. 27 2.3 Admissibility for p= 1 .................. ..... .30 2.4 Inadmissibility Results for p > 3 ............. .. .. 31 2.5 Lindley's Estimator and Shrinkage to Regeression Surface ..... ..40 3 POINT ESTIMATION UNDER DIVERGENCE LOSS WHEN VARIANCE COVARIANCE MATRIX IS UNKNOWN ....... . . 43 3.1 Preliminary Results ...... ........ .. ..... .. 43 3.2 Inadmissibility Results when VarianceCovariance Matrix is Proportional to Identity Matrix. ................ ....... .. 44 3.3 Unknown Positive Definite VarianceCovariance Matrix ...... ..49 4 REFERENCE PRIORS UNDER DIVERGENCE LOSS . ... 53 4.1 First Order Reference Prior under Divergence Loss . ... 53 4.2 Reference Prior Selection under Divergence Loss for One Parameter Exponential Family .................. ........ .. 55 5 SUMMARY AND FUTURE RESEARCH ................ .. 66 5.1 Summary .................. ............. .. 66 5.2 Future Research .................. .......... .. 66 REFERENCES ................... ... ... ........ .. 68 BIOGRAPHICAL SKETCH .................. ......... .. 73 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION AND PRIOR SELECTION By Victor Mergel August 2006 C('!, : Malay Ghosh Major Department: Statistics In this dissertation, we consider the following problems: (1) estimate a normal mean under a general divergence loss and (2) find a predictive density of a new observation drawn independently of the sampled observations from a normal distribution with the same mean but possibly with a different variance under the same loss. The general divergence loss includes as special cases both the KullbackLeibler and Bhattacharyya Hellinger losses. The sample mean, which is a B i;. estimator of the population mean under this loss and the improper uniform prior, is shown to be minimax in any arbitrary dimension. A counterpart of this result for the predictive density is also proved in any arbitrary dimension. The admissibility of these rules holds in one dimension, and we conjecture that the result is true in two dimensions as well. However, the general Baranchik class of estimators, which includes the JamesStein estimator and the Strawderman class of estimators, dominates the sample mean in three or higher dimensions for the estimation problem. An analogous class of predictive densities is defined and any member of this class is shown to dominate the predictive density corresponding to a uniform prior in three or higher dimensions. For the prediction problem, in the special case of KullbackLeibler loss, our results complement to a certain extent some of the recent important work of Komaki and George et al. While our proposed approach produces a general class of empirical B,.i predictive densities dominating the predictive density under a uniform prior, George et al. produce a general class of B ;i predictors achieving a similar dominance. We show also that various modifications of the JamesStein estimator continue to dominate the sample mean, and by the duality of the estimation and predictive density results which we will show, similar results continue to hold for the prediction problem as well. In the last chapter we consider the problem of objective prior selection by maximizing the distance between the prior and the posterior. We show that the reference prior under divergence loss coincides with Jeffreys' prior except in one special case. CHAPTER 1 INTRODUCTION AND LITERATURE REVIEW 1.1 Statistical Decision Theory Statistical Decision Theory primarily consists of three basic elements: the sample space X, the parameter space 0, and the action space, A. We assume that an unknown element 0 E 0 labels the otherwise known distribution. We are concerned with inferential procedures for 0 using the sampled observations x (real or vector valued). A decision rule 6 is a function with domain space X, and a range space A. Thus, for each x E X we have an action a = 6(x) E A. For every 0 E 0 and 6(x) E A, we incur a loss L(0, 6(x)). The longterm average loss associated with 6 is the expectation E0[L(0, 6(X))] and this expectation is called the risk function of 6 and will be denoted as R(0, 6). Since the risk function depends on the unknown parameter 0, it is very often impossible to find a decision rule that is optimal for every 0. Thus the statistician needs to restrict decision rules such as Bivi, Minimax and admissible rules. The method required to solve the statistical problem at hand strongly depends on the parametric model considered (the class = {Po, 0 e ?} in which the distribution of X belongs), structure of decision space and choice of loss function. The choice of the decision space often depends on the statistical problem at hand. For example, twodecision problems are used in testing of hypotheses; for point estimation problems decision space often coincides with parameter space. The choice of the loss function is up to the decision maker, and it is supposed to evaluate the penalty (or error) associated with the decision 6 when the parameter takes the value 0. When the setting of an experiment is such that loss function cannot be determined, the most common option is to resort to classical losses such as quadratic loss or absolute error loss. Sometimes, the experiment settings are very uninformative and decision maker may need to use an intrinsic loss such as general divergence loss as considered in this dissertation. This is discussed for example in [56]. In this dissertation, we mostly look at the point estimation problem of the multivariate normal mean under general divergence loss and we consider also the prediction problem, where we are interested in estimating the density function f(x 0) itself. In multidimensional settings, for dimensions high enough, the best invariant estimator is not ahiv admissible. There often exists a class of estimators that dominates the intuitive choice. For quadratic loss this effect was first discovered by Stein [58]. In this dissertation, we consider the estimation and prediction problems simultaneously under a broader class of losses to examine whether the Stein effect continues to hold. Since many results for estimation and prediction problems for the multivariate normal distribution, as considered in this dissertation, seem to have some inherent similarities with the parallel theory of estimating a multivariate normal mean under the quadratic loss, we will begin with a literature review of known results. 1.2 Literature Review 1.2.1 Point Estimation of the Multivariate Normal Mean Suppose X ~ N(O, Ip). For estimating the unknown normal mean 0 under quadratic loss, the best equivariant estimator is X, the MLE which is also the posterior mean under improper uniform prior see [56] pp.429431. Blyth, see [14], showed that this estimator is minimax and admissible when p = 1. Unfortunately, this estimator may fail to be admissible in multidimensional problems. For simultaneous estimation of p (> 2) normal means, this natural estimator is admissible for p = 2, but it is inadmissible for p > 3 for a wide class of losses. This fact was first discovered by [58] for the sum of squared error losses, i.e. when L(O, a) =1 0 a 12. The inadmissibility result was later extended by Brown [17] to a wider class of losses. For the sum of squared error losses, an explicit estimator dominating the sample mean was proposed by James and Stein [42]. For estimating the multivariate normal mean, Stein [58] recommended using "spherically symmetric estimators" of 0 since, under the loss L(0, a) = I 0 a 2, X is an admissible estimator of 0 if and only if it is admissible in the class of all spherically symmetric estimators. The definition of spherically symmetric estimators is as follows. Definition 1.2.1.1. An estimator 6(X) of 0 is said to be sph <,.. /,/ ,a;,,,. .li if and only if 6(X) has the form 6(X) = h(lX 112)X Stein used this result and the CramerRao inequality to prove admissibility of X for p = 2. Later Brown and Hwang [19] provided a Blyth type argument for proving the same result. As mentioned earlier, X is a generalized B,. estimator of 0 E WR under the loss L(0, a) = 1 0 a 12 and the uniform prior. Stein [58] showed the existence of a, b such that (1 b ) X dominates X for p > 3. Later James and Stein a+XI ) [42] have produced the explicit estimator 6(X) = 1 p2)X which dominates X for p > 3. Efron and Morris [28] show how JamesStein estimators arise in an empirical Bi,; context. A good review of empirical B, ;i (EB) and hierarchical B, (HB) approaches could be found in [35]. As described in [8], an EB scenario is one in which known relationships among the coordinates of a parameter vector allow use of the data to estimate some features of the prior distribution. Both EB and HB procedures recognize the uncertainty in the prior information. However, while the EB method estimates the unknown prior parameters in some classical way like the MLE or the method of moments from the marginal distributions (after integrating 0 out) of observations, the HB procedure models the prior distribution in stages. To illustrate this, we begin with the following setup. A Conditional on 01,...,0p, let X1,...,Xp be independent with Xi ~ N(0,2), i = 1,... ,p, 2(> 0) being known. Without loss of generality, assume a2 = . B The Oi's have independent N(pi, A), i = 1,... ,p priors. The posterior distribution of 0 given X = x is then N((1 B)x + Bt, (1 B)Ip), where B = (1 + A)1. The posterior mean (the usual B ~,;. estimate) of 0 is given by E(01 X = x) = (1 B)x + Bta. (11) Now consider the following three scenarios. Case I. Let p, = ... = p = p, where p (real) is unknown, but A (> 0) is known. Based on the marginal distribution of X, X is the UMVUE, MLE and the best equivariant estimator of tt. Thus using EB approach, an EB estimator of 0 is: EB (1 B)X + BX1. (12) This estimator was proposed in Lindley and Smith (1972) but they have used HB approach. Their model was: 5 (i) conditional on 0 and p, X N(0, Ip); (ii) conditional on p, 0 ~ N(plp, AIp); (iii) p is uniform on (oo, oo). Then the joint pdf of X, 0 and p is given by f(x, 0, p) cexp  x 02 A exp   lpl, (13) Thus joint pdf of X and 0 is f(x, 0) oc exp (OTD 2T + xTx) (14) and posterior distribution of 0 given X= x is N(Dlx, D1), where D = A[(A + 1)I p Jp. Thus one gets E(01 X = x) = (1 B)x + Bxlp, (15) and V(0 X = x) = (1 B)Ip + BpJ,, (16) which gives the same estimator as under the EB approach. But the EB approach ignores the uncertainty involved in estimating the prior parameters, and thus underestimates the posterior variance. (1) Lindley and Smith [51] have shown that the risk of 0EB is not uniformly smaller than that of X under squared error loss. However, there is a B ,.i risk (1) superiority of OEB over X as it is shown in the following theorem of Ghosh [35]: Theorem 1.2.1.2. Consider the model XI N(0, Ip) and the prior 0 ~ N(plp, AIp). Let E denote expectation over the joint distribution of X and 0. Then, ,; i,,,:,,,y the loss LI(0, a) = (a 0)(a 0)T, and writing OB as the B;,.  estimator of 0 under L1, (17) ELI(0, X) = Ip; ELI(0, OB) = (1 B)Ip; (1) EL(, EB) = ( B)Ip + BpJ. (18) Now i ri,,.:,i.j the quadratic loss L2(0, a) (a o)TQ(a 0), where Q is a known nonnegative /. f,.'.:7 weight matrix, EL2(0, X) tr(Q); EL2(0, OB)= (1 B)tr(Q); (19) EL(0, B) (1 B)tr(Q) + Btr(Qp1Jp). (1 10) Case II. Assume that pt is known but its components need not to be equal. Also assume A to be unknown. IIX t,2 B12 is complete sufficient statistics. Accordingly, for p > 3, the UMVUE of B is given by (p 2)/X p12 Substituting this estimator of B, an EB estimator of 0 is given by (2) p 2 EB X 2(X tp). (111) B IIX p2 This estimator is known as a JamesStein estimator (see [42]). The EB interpretation of this estimator was given in a series of articles by Efron and Morris ([27], [28], [29]). (2) James and Stein have shown that for p > 3, the risk of OEB is smaller than that of X under the squared error loss. However, if the loss is changed to the arbitrary quadratic loss L2 of the previous theorem, then the risk dominance of b(2) (2) 0EB over X does not necessarily hold. The OEB dominates X under the loss L2 ([8]; [15]) if (i) tr(Q) > 2chi(Q) and (ii) 0 < p 2 < 2[tr(Q)/chi(Q) 2] where chi(Q) denotes the largest eigenvalue of Q. The Bw., risk dominance however still holds which follows from the following (see [35]). Theorem 1.2.1.3. Let X 1 0 N(, Ip) and 0 N(tlp, AIp). It follows that for p > 3, (2) [L(0,B) Ip B(p 2)pl I, (1 12) and S (2) E[L2 (,E)] = tr(Q) B(p 2)pltr(Q). (113) Consider the HB approach in this case with X N(0, Ip), 0 N(t, Alp) and A has Type II beta density oc At1(1+A)(m+n), with m > 0, n > 0. Using the iterated formula for conditional expectations (2) H  E(0l x) = E(E(0 B, x)) =(1 B)x + Bp, (114) where B = (A + 1)1, and B j B +(1 B)mlexp t B\Ix B1] dB 0 S B"+"l(1 B)mlexp Bx p] dB. (1 15) 0 Strawderman [60] considered the case m = 1, and found sufficient conditions S(2) on n under which the risk of OHB is smaller than that of X. His results were generalized by Faith [31]. When m = 1, the posterior mode of B is BMo = min((p + 2n 2)/llx 112,1) (16) This leads to O^B = (1 BMO)X + BMOI, (Ii7) of 0. When n = 0 this estimator will become the positive part JamesStein estimator, which dominates the usual JamesStein estimator. Case III. We consider the same model as in Case I, except that now tt and A > 0 SP are both unknown. In this case (X, (X, X)2) is complete sufficient, so that the i= UMVUE's of tt and B are given by X and (p 3)/ E(X X)2. The EB estimator i= 1 of 0 in this case is 0E = X p (X X 1). (118) (X, X)2 i=i This modification of the JamesStein estimator was proposed by Lindley [50]. Whereas, the JamesStein estimator shrinks X toward a specified point, the above estimator shrinks X towards a hyperplane spanned by lp. (3) The estimator OEB is known to dominate X for p > 3. Ghosh [35] has found the B ,. risk of this estimator under the L1 and L2 losses. Theorem 1.2.1.4. Assume the model and the prior given in Theorem 1.2.1.2 Then for p > 4, L (3) E[L1(0, B) Ip B(p 3)(p 1)( p1J), (119) and (3) E[L2(0, ) tr(Q) B(p 3)(p )tr[Q(I p1J)]. (120) To find the HB estimator of 0 in this case consider the model where (i) conditional on 0, p and A, X ~ N(0,I,); (ii) conditional on p and A, 0 N(plp,AIp); (iii) marginally p and A are independently distributed with p uniform on (oo, oo), and A has uniform improper pdf on (0, oo). Under this model, as shown in [52], E(0l x) x E(BI x)(x xl,), (121) and V(01 x) = V(BI x)(x xlp)(x xl)T +Ip E(BI x)(Ip pJp), where 1 r E(B x) =B(3) exp [ 0 B Y(x )2 dB 2 i= 1 B (pexp 2 B(xi 0 i 1 )2] dB, (123) and E(B2 x)= Bp) exp  B (x, 0 )2] dB Bp5) exp 2 (B x x)2 dB. (24) 0 ii 1 Also, one can obtain a positivepart version of Lindley's estimator by substituting the posterior mode of B namely min (p 5) / (Xi X)2, ) in 11. Morris (1981) si:' 1 approximations to E(B x) and E(B2 x) involving 1 oo replacement of f by f both in the numerator as well as in denominator of 123 0 0 and 124 leading to the following approximations: p E(B x) (p 3) / (xi i= 1 and E(B21 ) (p l)(p 3) { (xi )2}1 / iIil so that V(B x) 2(p3) { (xi /P ui= (122) 2 X)2 Morris [52] points out that the above approximations amount to putting a uniform prior to A on (1, oo) which gives the approximation E(01 x) X p 3 (X X ,) (125) E(X X)2 i= 1 which is Lindley's modification of the JamesStein estimator with V(0 2(p3) 2 (X X1,)(X X1)T +IP p 3 (p 1J). (126) E(x X)2 i= 1 1.2.2 Shrinkage towards Regression Surfaces In the previous section, the sample mean was shrunk towards a point or a subspace spanned by the vector lp. Ghosh [35] synthesized EB and HB methods to shrink the sample mean towards an arbitrary regression surface. The HB approach was discussed in details in [51] with known variance components, while the EB procedure was discussed in [53]. The set up considered in [35] is as follow I Conditional on 0,b and a let X1,...,Xp be independent with Xi ~ N(0i, V), i = 1,... ,p, where the i's are known positive constants; II Conditional on b and a, Oi's are independently distributed with Oi ~ N(zTb, a), i 1,... ,p, where z1,...,zp are known regression vectors of dimension r and b is r x 1. III B and A are marginally independent with B ~ uniform(RP) and A ~ uniform(0, oc). We assume that p > r + 3. Also, we write Z = (zi,..., Zp); G = Diag(VI,..., Vp) and assume rank(Z) = r. As shown in [35], under this model E(Oil x) = E[(1 Ui)xi + UiZfb x]; (127) V(O x) = V[U(xi zf) I x] + E[V(1 Ui) + (1 U)zT(ZTDZ)lzi x]; (128) Cov(0i, Oj I) ) Cov[Uj(xj z Tb), Uj(xj zjT) 1 x] + E[AU UzUjT(ZTDZ) zj, x], (129) where Ui = V/(A + V), = (ZTDZ)1(ZTDx), D = 1,1(1 Ul..., ,1 Up) with (i = 1,. ,p). Morris [53] has approximated E[Oi x] by xi ui(xi zfb), and V[Oix] by (xi zb)2 + V(1 u,)[1 + uzfTDZ)1Z], i = 1,...,p. In the above [2/(p 2)](V + a) ( + a), 1,... ,p, V = ELC, (V+ a)1 +LI(V + )1,D = Diag(1 ul,...,1 u),and b is obtained from b by substituting the estimator of A. The 's are purported to estimate V(Uilx)'s. When V = ... = V = V, with u = ... = = V/(a + V) = U, D = (1 u)I, ZTDZ = (1 u)ZTZ, b= (ZTZ)lZx = b, then the following result holds E(O, x) =xi E(U x)(x zb), (1 30) and z b)2 + V VE(UI x)(1 zT(ZTZ)lzi). (131) v(OjjX) V(Ulx)(Xi If one adopts Morris's approximations, then one estimates E(UI x) by UV(p r 2)/SSE and V(U x) by [2/(p r 2)]W/2 (SSE EZ x2  (Ei P iZ )T (ZTZ) ( I)) 1.2.3 Baranchik Class of Estimators Dominating the Sample Mean In some situations when prior information is available and one believes that unknown 0 is close to some known vector g it makes more sense to shrink estimator X to p instead of 0. In such cases Baranchick [5] proposed using a more general class of shrinkage minimax estimators dominating X. Let S = (X, p)2 i=1 and ii(X) = ) (Xi pi) then estimator X + Q(X) will dominate X under quadratic loss function if the following conditions hold: (i) 0 < r(S) < 2(p 2), (ii) T(S) is nondecreasing in S and differentiable in S. Efron and Morris [30] have slightly widened Baranchick's class of minimax estimators. They have proved that the following conditions will guarantee that the estimator X + O(X) dominates X under the quadratic loss function: (i) 0 < r(S) < 2(p 2),p > 2, (ii) r(S) is differentiable in S, and (1ii) u(S) S 2'(S' is increasing in S. 2(p2) (S) i l in S. Thus the Baranchick class of estimators dominates the best equivariant estimator. The natural question was if that class had a subclass of admissible estimators. Strawderman [60] shows that there exists a subclass in the Baranchick class of estimators which is proper Bw ,i with respect to the following class of two stage priors. The prior distribution for 0 is constructed as follows. Conditional on A = a, 0 ~ N(0,al,), while A itself has pdf g(a) = 6(1 + a), a > 0, 6 > 0. Under this two stage prior, the Bi  estimator of 0 has the Baranchick form ( '7(S)X with 2exp{ } r(S) +26 + 2 xp  f A'+ exp{ S}dA 0 and conditions (i)(iii) holds for p > 4. When p = 5 and 0 < 6 < 2 we will get class of proper Bv. i and thus admissible estimators dominating X. When p > 5 choosing 0 < 6 < 1 will lead to proper BE,, class of estimators dominating X. For p = 3 and 4 Strawderman [61] showed that there do not exist any proper B,v estimators dominating X. 1.3 Shrinkage Predictive Distribution for the Multivariate Normal Density 1.3.1 Shrinkage of Predictive Distribution When fitting a parametric model or estimating parametric density function, two most common methods are: estimate the unknown parameter first and then use this estimator to estimate the unknown density, or try to estimate the predictive density without directly estimating the unknown parameter. In situations like this, the first method often fails to provide a good overall fit to the unknown density even when the estimator of unknown quantity may have optimal properties itself. The second method seems to produce estimators with smaller risk then those constructed with plugin method. To evaluate the goodness of fit of predictive distribution p(y x) (where x is the observed random vector) to the unknown p(y 0), the most often used measure of divergence is KullbackLeibler [46] directed measure of divergence L(0,p(y x))= JP( ) log P (y L) dy (132) which is nonnegative, and is zero if and only if p(y\ x) coincides with p(y 0). Then the average loss or risk function of predictive distribution of p(y x) could be defined as follows: RKL(O,p) p(x 0)L(0,p(y x))dx, (133) and under (possibly improper) prior distribution 7r on 0, the B i, risk is r (,p)= RKL 0,P)(0) dO. (1 34) As shown in Atchinson [1] the B i, predictive density under the prior 7r is given by ) p(xl O)p(y O)7(O) dO f p(xl 8)(0) dO(1 3 and this density is superior to any p(ylx) as a fit to the class of models. Let X0 ~ N(O, vxI) and YO ~ N(O, vyIp) be independent pdimensional multivariate normal vectors with common unknown mean 0 and known variances Vx, Vy. As shown by Murray [54] and Ng [55], the best invariant predictive density in this situation is the constant risk B ivi rule under the uniform prior 7ru(0) = 1, which can be written as Pu(y 1 = X)1 exp Ilv .1 (136) {27(vx(V + )2}2 2(v ,+V)" Although invariance is a frequently used restriction, for point estimation, the best invariant estimator 0 = X of 0 is not admissible if dimension of the problem is p > 3. It is known that the JamesStein estimator dominates the best invariant estimator 0. Komaki [45] showed that the same effect holds for the prediction problem. The best predictive density pu(ylx) which is invariant under translation group is not admissible when p > 3, and is dominated by predictive density under the Stein harmonic prior (see [59]) 7H(O) C 1101(p2). (1 37) Under this prior, the B l i ,i predictive density is given by Komaki [45] /x Op 11( (Y Y + V x 'X))(VY'+ V x pn~~~y~1x)=( +i    1 1 PH(y x) (y+ O) {(v y+ vlx)(vI +vl)}I x(, e xp ) {2(v + vy)} exp 2(v + v)y) ( ) where I12 (u) U+2 / P exp(v) dv. (139) 0 The harmonic prior is a special case of the Strawderman class of priors given by 7"s(O) oc a exp 2}2a 11daoc 1181p25 0 with J = 1. More recently, Liang [49] showed that pu(y x) is dominated by the proper B i; rule pa(y x) under the Strawderman prior 7a (0) Ols N(0svo, sv s (1 + s)a2, (140) when vx < I, p = 5 and a E [.5, 1) or p > 6 and a E [0, 1). When a = 2, it is well known that 7rH(0) is a special case of 7a (0). As shown in George, Liang and Xu [34], this result closely parallels some key developments concerning minimax estimation of a multivariate normal mean under quadratic loss. As shown in Brown [17], any B,i rule 0p = E(01 X) under quadratic loss,has the form 0= X + Vlog m (X). (141) As shown in George, Liang and Xu [34], this result closely parallels some key developments concerning minimax estimation of a multivariate normal mean under quadratic loss. As shown in Brown [17], any Bw,. rule Op = E(01 X) under quadratic loss,has the form 0= X + Vlog m (X). (142) Similar representation was proved by George, Liang and Xu [34], for the predictive density under the KL loss: pA(y x) = pu(y x) m(W (143) mr{(x; v) where vX + vxY v + vY (44) W= 1 (144) V. + Vy J7 +Vy W v", (145) VX + Vy and m,(w; vw) is a marginal distribution of W. Now we present the domination result of George et al. under the KL loss. Theorem 1.3.1.1. For Z 0 ~ N(0, vlp) and a given prior r on 0, let m,(z; v) be the i,,.i.,':rl distribution of Z. If mrn(z; v) is finite for all z, then p,(yl x) will dominate pu(yl x), if ,.;i one of the following conditions holds for all v < vx (i) E, log mn (Z; v) < 0, (ii) mrn7( vz + 0; v) < 0 for all 0, with strict .:,. ;,.:1,.i;/ on some interval A, (iii) rm(z; v) is superharmonic with strict .,'.; .,l.;.i1 on some interval A, (iv) mr(z; v) is superharmonic with strict ,'., ,L;I,oa.:li/ on some interval A, or (v) 7r(0) is superharmonic. From the previous theorem and minimaxity of pu(yl x) under the KL loss, the minimaxity of p (yl x) follows. George et al. [34] also proved the following theorem which is similar to Theorem 1 of Fourdrinier et al. [33]. Theorem 1.3.1.2. If h is a positive function such that (i) (s + 1)h'(s)/h(s) can be decomposed as li(s) + 12(s) where 11 < A is nondecreasing while 0 < 12 < B with !A + B < (p 2)/4, (ii) lim,,, h(s)/(s + 1)P/2 = 0. Then (i) m/(mz; v) is superharmonic for all v < vo. (ii) the B.n,. rule ph(yl x) under prior 7h(O) dominates pu(yl x) and is minimax when v < vo. 1.3.2 Minimax Shrinkage towards Points or Subspaces When a prior distribution is centered around 0, minimax B,. rules p,(y x) yield most risk reduction when 0 is close to 0 (see [34]). Recentering the prior 7r(0) around any b E IR results in 7b(0) =7(0 b). The marginal mb corresponding to 7b can be directly obtained by recentering the marginal m, n b(z; v) mr(z b; v). (1 46) Such recentered marginals yield predictive distributions b (w; v".) pb (y P x)m ( (I ) (147) More generally, in order to recenter a prior 7r(0) around (a possibly affine) subspace B C RP, George et al. [34] considered only spherically symmetric in 0 priors recentered as 7"(0) = (6 PB0), (148) where PBO = argminbs, 0O b is the projection of 0 onto B. Note that the dimension of 0 PBO must be taken into account when considering 7r. Thus, for example, recentering the harmonic prior 7rH(O) = (11"2) around the subspace spanned by lp yields 4(e0) = le 01P,(p 3 (149) Recentered priors yields predictive distributions George et al. [34] also considered multiple shrinkage prediction. Using the mixture prior N 7,(o) = WT "' (0), (151) i= 1 leads to the predictive distribution ZN 1" ,,_, (w; V) p*(y x) = pu(yl ) N B 7 (1 52) 2Ki 1"' '"T1(x; VX) 1.4 Prior Selection Methods and Shrinkage Argument 1.4.1 Prior Selection Since B iv; [6] and later since Fisher [32], the idea of B li, i i, inference was debated. The cornerstone of B ,il, i i, analysis, namely prior selection was criticized for arbitrariness and overwhelming difficulty in the choice of prior. From the very beginning, when Laplace proposed a uniform prior as a noninformative prior, its inconsistencies have been found, generating further criticism. This has given a way to new ideas such as the one of Jeffreys [44] who proposed a prior which remains invariant under any onetoone reparametrization. Jeffrey' prior was derived as the positive square root of Fisher information matrix. However, this prior was not an ideal one in the presence of nuisance parameters. Bernardo [12] has noticed that Jeffreys prior can lead to marginalization paradox (see Dawid et al. [26]) for inferences about p/a when the model is normal with mean p and variance a2. These inconsistencies have led Bernardo [12] and later Berger and Bernardo ( [9], [10], [11]) to propose uninformative priors known as "reference" priors. Two basic ideas were used by Bernardo to construct his prior: the idea of missing information and the stepwise procedure to deal with nuisance parameters. Without any nuisance parameters, Bernardo's prior is identical to Jeffreys prior. The missing information idea makes one to choose the prior which is furthest in terms of KullbackLeibler distance from the posterior under this prior and thus allows the observed sample to change this prior the most. Another class of reference priors is obtained using the invariance principle which is attributed to Laplace's idea of insufficient reasons. Indeed, the simplest example of invariance involves permutations on a finite set. The only invariant distribution in this case is the uniform distribution over this set. Laplace's idea was generalized as follows. Consider a random variable X from a family of distributions parameterized by 0. Then if there exists a group of transformations, ;v ha(0) on the parameter space, such that the distribution of Y = ha(X) belongs to the same family with corresponding parameter ha(0), then we want the prior distribution for parameter 0 to be invariant under this group of transformations. Good description of this approach is given in Jaynes [43], Hartigan [37] and Dawid [25]. A somewhat different criteria is based on matching posterior coverage probability of a B ,i, i in credible set with the corresponding frequentist coverage probability. Most often matching is accomplished by matching a) posteriors quantiles b) highest posterior densities (HPD) regions or c) inversion of test statistics. In this dissertation we will find uninformative priors by maximizing divergence between prior and corresponding posterior distribution .,iii!,ii. ically. To develop .ivmptotic expansions we will use so called "shrinkage arguing i, i Ii.. 1. by J.K. Ghosh [36]. This method is particularly suitable for carrying out the .,vmptotics, and avoids calculation of multivariate cumulants inherent in any multidimentional Edgeworth expansion. 1.4.2 Shrinkage Argument We follow the description of Datta and Mukerjee [24] to explain the shrinkage argument. Consider a possibly vectorvalued random variable X with a probability density function p(x; 0) with 0 E IR or some open subset thereof. We need to find an .,vmptotic expansion for Eo [h(X; 0)] where h is a joint measurable function. The following steps describe the B i, i in approach towards the evaluation of Eo[h(X; 8)]. Step 1: Consider a proper prior density 7r(O) for 0, such that support of 7r(O) is a compact rectangle in the parameter space, and r(0) vanishes on the boundary of support while being positive in the interior. Under this prior one obtains posterior expectation of EW[h(X; ) IX]. Step 2: In this step one finds the following expectation for 0 being in the interior of support of 7r A(0) = Ee [E[[h(X; 8)IX]]. Step 3: Integrate A(0) with respect to r(O) and then allow r(O) converge to the degenerate prior at the true value of 0 supposing that the true value of 0 is an interior point of the support r(O). This yields Eo[h(X; 0)]. The rationale behind this process, if integrability of h(X; 0) with respect to joint probability measure assumed, is as follows. Note that posterior density of 0 under prior r(O) is given by p(X; 8) (0) m(X) where m(X) p(X; )F(0) dO. Hence, in step 1, we will get Ex [h(X; 8)X]= K(X)/m(X), where K(X) = h(X; O)p(X; o0)(0) dO. Step 2 yields A(O) {K(x)/m(x)}p(x; ) dx. In step 3 one would get f{K(x)/mn(x)}p(x; o)7(O) dO dx f {K(x)/m(x)} f p(x; o)j(O) dO dx I K(x) dx f h(x; O)p(x; o) (0) dO dx [Eo[h(X; 8)]7(O) dO. The last integral gives us desired expectation when r(0) converge to the degenerate prior at the true value of 0. SA(o) (0o) dO CHAPTER 2 ESTIMATION, PREDICTION AND THE STEIN PHENOMENON UNDER DIVERGENCE LOSS 2.1 Some Preliminary Results We will start this section with definition of divergence loss. Among others, we refer to Amari [2] and Cressie and Read [23]. This loss is given by L (Oa) Jplp(x O)p3(x a)dx L((1, a)= (21) This above loss is to be interpreted as its limit when i 0 or  1. The KL loss obtains in these two limiting situations. For 3 = 1/2, the divergence loss is 4 times the BH loss. Throughout this dissertation, we will perform the calculations with 3 (0, 1), and pass on to the endpoints only in the limit when needed. Let X and Y be conditionally independent given 0 with corresponding pdf's p(x\ 0) and p(y 0). We begin with a general expression for the predictive density of Y based on X under the divergence loss and a prior pdf 7(0), possibly improper. Under the KL loss and the prior pdf 7(0), the predictive density of Y is given by 7KL(y Ix) Jp(y\ 0)7(01x)d0, where 7(0 x) is the posterior of 0 based on X = x see [1]. The predictive density is proper if and only if the posterior pdf is proper. We now provide a similar result based on the general divergence loss which includes the previous result of Aitchison as a special case when 3  0. Lemma 2.1.0.1. Under the divergence loss and the prior r, the B.rl predictive 1 ,:.,!i of Yis given by 7rD(Y x)= k (y,x k (y,) k( x) dy, (22) where k(y, x) = Jpl(yl O8)(0 x)) dO. Proof of Lemma 2.1.0.1. Under the divergence loss, the posterior risk of predicting p(y 0), by a pdf p(y x), is /(1 3)1 times 1 J lp (y )p(y )dy] 7(0 x)dO S1 (y x) { p L (y1 8o)(0 (x) d dy 1 J k(y,x)rp(yx)dy. (23) An application of Holder's inequality now shows that the integral in (23) is maximized at p(yl x) oc ki (y, x). Again by the same inequality, the denominator of (22) is finite provided the posterior pdf is proper. This leads to the result noting that 7rD(yl x) has to be a pdf. U The next lemma, to be used repeatedly in the sequel, provides an expression for the integral of the product of two normal densities each raised to a certain power. Lemma 2.1.0.2. Let Np(x lp, E) denote the plf of a pvariate normal random variable with mean vector tt and positive /. I;, .: variancecovariance matrix E. Then for a1 > O, a2 > 0, I [N,(X lt, E0)]1 [N,(X A2, E2)12 dx S(27)n(1aa2) IEll(1 a) 2(1a2) 1 + a2  x exp ([ (1 T 2 (12 + c1 2  x exp LC2 a12 (~]tt 2)T (IlA) 2 @+ ci2Ei)1 (Atl 12)] (2 4) Proof of Lemma 2.1.0.2. Writing H = alE1+a2E21 and g = Hl(aEl1l + ca2 12), it follows after some simplification that I[N(x A, E0i)1" [N(x 2 2)]"1 dx S(2) (a+a2) Ej I 2 I exp ( g)H(x g) 1L 2 {aO i 1T_ 1) + a2 (22 12) gTHg} dx = (2x^) 0(la2) 2 H 'a2H (27)2(1 a1 a2) X12 X21 2 Hj x exp 1 {Q(i ll )+Q a2( 2 21A2) THg} (25) It can be checked that al OATEipl) + (ATEl a) gTHs ai(p{X1 ) + a2 2 X2 2) gTHg =1i2(/tl A21)T (a22 + 2c21i)2(2 1 A2), (26) and IH I1/2 I 1 11/21 12 a1/ + 2 I2 11/2 (2 7) Then by (26) and (27), right hand side of (25) (27r)p(1a(a2)/2 1 (1al)/2 22 (1a2)/2 a1X2 + a2I 1/2 2 x exp[2(li .2)T(aiX2 + a2i)1)(. .2)] This proves the lemma. U The above results are now used to obtain the Bayes estimator of 0 and the B i, i i predictive density of a future Y ~ N(0, a I p) under the general divergence loss and the N(t, AIp) prior for 0. We continue to assume that conditional on 0, X ~ N(O, aIp), where oa > 0 is known. The Bv;, estimator of 0 is obtained by minimizing 1 exp[ (1 0 all 2]N(0(1 B)X + Bu,o 7(1 B)I)dO with respect to a, where B = ax(a + A). By Lemma 2.1.0.2, fj p3(1 j3) exp[ I22 a ll2N(l (1 B)X + B, 7(1 B)Ip)d0 7T2 \p/2 27(1 ) p N(Ola, c 1(1 3)Ip)N(0l(1 B)X + BX 72 (1 B)Ip)d Sea (1 B)X B 2 8) ox exp L2 (31(1 _13) (2 8) 2a2(01(1 0)1 + 1 B) which is maximized with respect to a at (1 B)X + Btt. Hence, the Bayes estimator of 0 under the N(p, AIp) prior and the general divergence loss is (1 B)X + Btt, the posterior mean. Also, by Lemma 2.1.0.2, the B i; predictive density under the divergence loss is given by TD(y X) Nx [Nl3'(0 y, Ip)N(0 (1 B)X + Bp, (1 B)Ip)d0 cx N(yl(l B)X + Bp, (a (1 B)(1 /3) + C) Ip). In the limiting (B 0) case, i.e. under the uniform prior Tr(0) = 1, the Bi,. estimator of 0 is X, and the Bv.; predictive density of Y is N(yX, ((1 3) + )I). It may be noted that by Lemma 2.1.0.2 with a1 = 1 3 and a2 = the divergence loss for the plugin predictive density N(0, orI,), which we denote by 60, is L(0, o) (1T 1 N1 (y l, o I)N'(y X,21 I,) dy 2 ( p( l 3) (os)2 (o) 2 (1 23o^a + p)/2 j3(1 3)X 02 0(1tl ) 13)X _0 )1l (29) 2((1 0), + P ) Noting that X  1 R(O,60) = 1 0(t 0) 0112 ~ oX2, the corresponding risk is given by a ) 2 ( ) 2 {(  X(t 0)7 0),72+ 1 1 [31 3 (a )2 ( ) )/2 2 ( }p/2 { (1 02),2 j 'p/2 (210) On the other hand, by Lemma 2.1.0.2 again, the divergence loss for the Biv. predictive density (under uniform prior) of N(y 0, l 2IP) which we denote by 6, is rl~U~UI~ UIIUUY \IIUI UIIIlllrllV1 VI*\IV)~lY L(0, 6 ) [ ( 0(1 1 [ (s) Nl'(1Y 1, a I)N ( (y X, ((1 p3(( ) + ) {(l 3) 2((1 ),72+7) 2 { 3) + y)I) dy] )2a2 + 2p/2 Sexp 1 .)2] (2 1) The corresponding risk (a2) (( 0),7 + 72) P(I 2 2 1 1 2 l ( 0 3(1 3)2 (1 ) 2 + ( ) 2 ((1 3)o a + ) 2 ] (212) 1 3(1 3) x exp x 1+ 2 2 2 Jx + a Y 2P R(0, 6') 1 To show that R(O, 60) > R(O, 6,) for all 0, ua > 0 and a > 0, it suffices to show that (2) ( 2 p( 3)2 2t2 2(23) ( (a ) 0{(1 + 23p/2 < (o) 2 ((1 )o3 + ) 2, (2 13) or equivalently that 1 + (3(,2 3) > (1 + 7/o )), (214) for all 0 < p < 1, oa > 0 and a2 > 0. But the last inequality is a consequence of the elementary inequality (1 + z)' < 1 + uz for all real z and 0 < u < 1. In the next section, we prove the minimaxity of X as an estimator of 0 and the minimaxity of N(yl X, ((1 ),o +o72)Ip) as the predictive density of Y in any arbitrary dimension. 2.2 Minimaxity Results Suppose X ~ N(O, caIp), where 0 E IRP. By Lemma 2.1.0.2 under the general divergence loss given in 21, the risk of X is given by R(0, X) = [1 {1 + 3(1 P)1p/2] (215) f(1 )[ for all 0. We now prove the minimaxity of X as an estimator of 0. Theorem 2.2.0.3. X is a minimax estimator of the 0 in r n arbitrary dimension under the divergence loss given in 21. Proof of Theorem 2.2.0.3. Consider the sequence of proper priors N(0, 2rIp) for 0, where o 0 o as n o0. We denote this sequence of priors by 7r,. The Bi,, estimator of 0, namely the posterior mean, under the prior rF, is 6r (X) = (1 B,)X, (216) with B,= a(ao + a) The B .;i risk of 6" under the prior 7~ is given by: (1 [1) P (1 ) r)((X) 211 r(7,w"') 1 E exp 67,(X) 01}2 (217) where expectation is taken over the joint distribution of X and 0, with 0 having the prior r,. Since under the prior r,, X N (6 x),2(1 B,)IP) it follows that 110 67,(X) 112 X = x a(l1 B,)X, which does not depend on x. Accordingly, from (3.3), r(,F, 6 ") = [1 {1 + 3(1 _)(1 B,)}p/2]. (218) Since B,  0 as n + oo, it follows from 218 that r(",,6") [1 {1+ 3(1 3)}p/2] as n  oo. Noting (215), an appeal to a result of Hodges and Lehmann [40] now shows that X is a minimax estimator of 0 for all p. Next we prove the minimaxity of the predictive density 6,(X) = N(yl X, ((1 f)o + )I) of Y having pdf N(y 0, O21p). Theorem 2.2.0.4. 6,(X) is a minimax predictive I <,.:/;l of N(yl0, 72Ip) in ,:1, arbitrary dimension under the general divergence loss given in (21). Proof of Theorem 2.2.0.4. We have shown already that the predictive density 6,(X) of N(y 0, c2Ip) has constant risk 1 1 ,7 2((1 "32 72) P 3( [) 2((t /32 ) ] under the divergence loss given in (21). Under the same sequence Tr of priors considered earlier in this section, by Lemma 2.1.0.2, the B ; predictive density of N(y lO, JIp) is given by N(yi (1 B)X, {(1 3)(1 B,)a + 2}Ip). By Lemma 2.1.0.2 once again, one gets the identity Noting Nl (yl 0, 2 Ip)N'3(y (1 B,)X, {(1 /3)(1 B),7 + ,7}Ip)dy ( p(2 3) =()t2 ((1 /3)(1 B.)7 + 72) 2 {(1 /)2(3 B), + 72}p/2 x/3( /3)1 (1 B)X 2 x exp[ ]. ( 2((1 /)2(1 + ) once again that 0 (1 B)XI21 X =x ~ ( B,)X2, the post 219) rior risk of 6,(X) simplifies to 1 (,2) P' 2 +3) /3(2 /3) ((1/3)( B2) + o) 2 (( ^.2 ~2 2. f /3(1/3)o(l Bn) 03( ) [ ( (1 /3)(1 B .)o + o) /3(X Y3 '] (220) Since the expression does not depend on x, this is also the same as the Bi,, risk of 6,(X). The Bv, risk converges to o 1 [l ) 2 t2 th t. An appeal to Hodges and Lehmann [40] once again proves the theorem. An appeal to Hodges and Lehmann [40] once again proves the theorem. X P 2.3 Admissibility for p= 1 We use Blyth's [14] original technique for proving admissibility. First consider the estimation problem. Suppose that X is not an admissible estimator of 0. Then there exists an estimator 6o(X) of 0 such that R(0, 6o) < R(0, X) for all 0 with strict inequality for some 0 = Oo. Let r = R(00,X) R(00, 6o(X)) > 0. Due to continuity of the risk function, there exists an interval [00 E, 0o + E] with E > 0 such that R(0, X) R(0, 6o(X)) > 1 for all 0 e [0o E, 0o + E]. Now with the same 2 prior rF,(0) N(010, a ), r(7i, X) r(o, 60(X)) Oo+E = [R(0, X) R(0, o(X))] T,(do) > [R(0, X) R(0, 6o(X))] T,(do) 1 f(2w72) exp 02} dO> t(2 ) (2E) (221) _> (2 ') exp 2 do _> / T (2 7 ,72. 00E Again, [{1 + (1 P)(1 Bz)}1/2 t + 0(1 0)}11/2] r(w,, X) r(wn, Jw (X)) ( l (1 P ) O= (B,) (222) for large n, where O denotes the exact order. Since BT = (, (o + oT)1 and i  oc as n oo, denoting C(> 0) as a generic constant, it follows from (221) and (222) that for large n, i n > no, > r1(2) 1/2 1 (223) r(, X) r(Fn, Jwr(X)) 4 as n 0oo. Hence, for large n, r(Tr, 67"(X)) > r(7r, 6o(X)) which contradicts the B winess of 6" (X) with respect to r,,. This proves the admissibility of X for p 1=. For the prediction problem, suppose there exists a density p(y v(X)) which dominates N(yl X, ((1 3)ac + a'). Since [((1 /)(1 B.)a + a~)) ((1 0) + ())B] 2) 3(1 f) for large n under the same prior Tr,, using a similar argument, r(T,, N(y X, ((1 3)a + a))) r(Tn, p(y v(X))) r(T., N(yX, ((1 )a + ))) r(T., N(y X, ((1 3)(1 B.)a + a7))) = O(alB1) oo, (224) as n  oo. An argument similar to the previous result now completes the proof. U Remark 1. The above technique of proving admissibility does not work for p = 2. This is because for p = 2, the ratios in the left hand side of (223) and (224) are greater than or equal to some constant times ca2B1 for large n which tends to a constant as n  oo. We conjecture the admissibility of X or N (y X, ((1 03)o + o~ )IP) for p = 2 under the general divergence loss for the respective problems of estimation and prediction. 2.4 Inadmissibility Results for p > 3 Let S = IX12/a The Baranchik class of estimators for 0 is given by 6(X) 1 T j X, r(S) where one needs some restrictions on r. The special choice r(S) = p 2 (with p > 3) leads to the JamesStein estimator. It is important to note that the class of estimators 6'(X) can be motivated from an empirical B ,, (EB) point of view. To see this, we first note that with the N(O, Alp) (A > 0) prior for 0, the By,;., estimator of 0 under the divergence loss is (1 B)X, where B = a(A + a )1 An EB estimator of 0 estimates B from the marginal distribution of X. Marginally, X ~ N(O, acB1I,) so that S is minimal sufficient for B. Thus, a general EB estimator of 0 can be written in the form 6'(X). In particular, the UMVUE of B is (p 2)/S which leads to the JamesStein estimator [28]. Note that for the estimation problem, L(0, 6(X)) 1 exp [ )6(X) 0 2] f3(1 /3) (225) while for the prediction problem, L (N(y I,), N (yl 6(X), ((1 3) + r)I,)) P13 p(l 3) 1 () ((1 ) ) 2 {(1 )22 + 2}p/2 S(1 3) x exp (1 )ll(X) .ll0 (2 26) 2((1 0)2j2 + j2) The first result of this section finds an expression for EO exp 6,(X) 02}] b > 0. We will need b = (1 j)/2 and b (((1 ") to evaluate (225) and (226). Theorem 2.4.0.5. E [exp 1 67(X) 02} L a 2 OO (2b + 1)p/2 {exp( )r/r!}Ib(r), (227) r0 where (b= + ) and I (r) J 1 0 b 2t 2' r+  7i t 2b + (r + ) b(b) ) 2 t 2t x exp t b(b + 2 ( 2 ) + 2br 2 )1 2 dt. (228) t 2b+1 2b+1t Proof. Recall that S =  X 2/72. For 011 0, proof is straightforward. So we consider 10 > 0. Let Z = X/ux and r1 = 0/u. First we reexpress S1 (t _S)X 02 as ,722 2 as (1 2  )112 Z = S+ 7S 2(S) 2+ 227TZ ( (S). (229) We begin with the orthogonal transformation Y = CZ where C is an orthogonal matrix with its first row given by (01/11011,... ,0p/01181). Writing Y = (Y1,... ,p)T, the right hand side of (229) can be written as S+ (S) 27(S)+ 1712 2 11 7 1 S\ (230) where S = Y11 2. Also we note that Y1,... ,Yp are mutually independent with Y1 ~ N(1711, 1), and Y2,... ,Yp are iid N(0, 1). Now writing Z = Y 2 1 we i= 2 have EIexp 1 6(X) 02l +00 +00 exp b + { + 1 ) 0 0 2T(y2 + z) + I2 211y, 1 T +z }]x x (27)1/2 exp (YI )2 exp  ,, dy, dz 2 2 F (P 2 +00 +00 (27) 1/2 exp (b ) (y )2 ) + 2bT( by2+ z) 2/ (y2 +) 0 o (y 2 ) + t ) 1 z 3 2bllr yi I exp b+ z dyi dz (231) y~+z J 2 r (S) S ), We first simplify +oo S(2) 12 exp 00 br2 (2 + _) + 2bT(yj + z) 2bl (y + 1z) 1exp b+( + b2) x exp {2 (b + 2) 11y + exp { 2 (b + y1 I dyi y+ z b 2 (y2 ) 1+ 2 + + 2bT(y + z) (yI z)IjTy ) _r(y + z) } 2bllyi y z qyll1 + 2bllllly y z) 1/2 exp (b+ (Y + ll72) bT(y2 + Z) (y2Z)+ 2br(y + z) (y + z) S (211lly1)2r ( bT(y2 + ) 2r r0 (2r)! 2 y + z (b +)(W+ 1l2) br2(w + z) (w +) + 2br(w + z)] S (2r)! w' 1b (2 )! 2 br(w + z) w + z S2r] dw, where w = y. With the substitution v = w + z and u = w/(w + z), it follows from (231) and (232) that E exp A +oo 1 0= 0 0 1167(X) ( 1/2 exp r0 112} (2bllrll)2r (2r)! ((2b) + 1) x exp (b + 1t 2) V b (v) + 2br(v)] 1 Ur+ I 1(1 2 du dv (233) v 22d ( (3) (b +) (y h)2 +00O I (27) 0/ 0 +00 +oo 2 (2) 0 dyi (232) +oo 2 (27) 0 1/2 exp [ v V ( t 1) 2) By the Legendre duplication formula, namely, (2r)! F(2r + 1) F (r + A) F(r + 1)22r~ 1/2 (233) simplifies into E exp +00 1 o o 0 0 b 6T(X) ( x rt0 112} b + I ) r2 (2b)c 2r) 2 { +2 rT (r + 1) 22r ((2b)1 + 1) x v+ lexp (b + v +0021 +~ 1 J0 exp 0 0 r=0 (b + )2)((b + 2) (b 1) )2 r! ((2b)1 ) v ((2b)1+ l)v Xvr+ lexp (b+ ) b2 () r+1 ( + 2br(v) v 2 F(r + 1) 2) u)2lF (r + ) dud 2du dv S(PL) F (r+ L) (234) Integrating with respect to u, (234) leads to b 6(X) ( x 00 7 exp{ /} r70 12} 2 ) ( T(v) )2 ((2b)1 + 1)v r+~1 2Fr (r + 2) b ( + 2br(v) dv (235) vI where = (b + ) 1.2. Now putting t E exp +ooexp x j exp 0 (x ) 0 2} b(b+ 1) 72 2_T) 2 +1 t (b + ) v, we get from (235) (2b + 1) exp{0} r0 + 2b 2 1) (1 (2b + 1t 2b + )) tr+P1 dt F(r +) The theorem follows. U br2(v) + 2br(v)] v I 7(v) 2r v I I r+l1 1 Ur+2 U( ) 2  2 2i ( du dv 2 0 ) E exp x exp (b + v (12) As a consequence of this theorem, putting b = 3(1 3)/2, it follows from (225) and (227) that 00 1 (1 + 3(1 3))p/2 E {exp(O)O/r!} (1 )/2() R(0, 6(X)) r=o 3/( 2(/ while putting b = (( 1. it follows from (226) and (227) that R (N(y O, a2~IP), N (y 67(X), ((1 ),a + 2)IJ,)) t (, )p3/2((l )2 2+)p//2 ep(^/ ,} ()) S0r=2(( 3)2x + ) S3(1 3) Hence, proving Ib(r) > 1 for all b > 0 under certain conditions on 7 leads to R(O, 6T(X)) < R(, X) and R (N(ylO, a72,), N (y 67(X), ((1 3)a7 + 7 )I,)) < R (N (yl aIP),N (yX, ((1 /3) + 72)IJ)) for all 0. In the limiting case when 3 + 0, i.e. for the KL loss, one gets RKL (0, 67(X)) < p/2 RKL(0, X) for all 0, since as shown in Section 1, for estimation, the KL loss is half of the squared error loss. Similarly, for the prediction problem, as3  0, RKL(N(y\ 0, cI p), N(yl 6(X), ((1 3)t + cr)I>) p 2 2___ < 2 log +ar<} = RKL(N(y O, rI, N(y X, ((1 43)r + )IJ,) for all 0. The following theorem provides sufficient conditions on the function 7r() which guarantee Ib(r) > 1 for all r = 0, 1, . Theorem 2.4.0.6. Let p > 3. Suppose (i) 0 < T(t) < 2(p 2) for all t > 0; (ii) 7(t) is a differentiable nondecreasing function of t. Then Ib(r) > 1 for all b > 0. Proof of Theorem 2.4.0.6. Define To(t) = T7( ). Notice that To(t) will also satisfy conditions of Theorem 2.4.0.6. Now 2b (2b + and +ooexp t (1 I(7 b2) 2 (p2) x exp{ t o(t)} dt. (2: 2t t Define to = sup{t > 0 : To(t)/t > b }. Since To(t)/t is continuous in t with limTo(t)/t +oo and lim To(t)/t = 0, there exists such a to which also satisfies tO too ro(to)/to = b1. We now need the following lemma. Lemma 2.4.0.7. For t > to, b > 0 and To(t) M ,/';,if,,/ conditions of Ti,.., ,, 2.4.0.6 the following .:,,';. 1.,/; i holds: exp 7)) (p2) q(t) > 0, (2 where q(t) = t ( b(t)2 Proof of 2.4.0.7. Notice first that for t > to, by the inequality, (1 exp(cz) for c > 0 and 0 < z < 1, one gets exp 1 b7r02(t)} b ()>exp br,2(t) b(p 2)ro( 2t } ( t 2t t2 {+ r2 t0) ()) 2 r(r + ) 36) 37) z) > (238) bro(t) 2+ b ) t ) 2t b (+ b + 12t t + 2 t t \2b +1} t 1 "); for 0 < To(t) < 2(p 2). Notice that q(t) 2b(t) + 2b2ro (t)r (t) b202(t) (239) q'(t) = 1 2br'(t) + 22 (239) Thus q'(t) < 1 for t > to if and only if 2bg (t) b + 22t> 0. (240) The last inequality is true since T'(t) > 0 for all t > 0 and To(t)/t < b1 for all t > to. The lemma follows. U In view of previous lemma, it follows from (237) that +oo ) > (r ) J exp{q(t)}(q(t))r+1q'(t) dt 1. to This proves Theorem 2.4.0.6. U Remark 2. Baranchik [5], under squared error loss, proved the dominance of 6'(X) over X under (i) and (ii). We may note that the special choice r(t) = p 2 for all t leading to the JamesStein estimator, satisfies both conditions (i) and (ii) of the theorem. Remark 3. We may note that the Baranchik class of estimators shrinks the sample mean X towards 0. Instead one can shrink X towards any arbitrary constant tt. In particular, if we consider the N(p, AIp) prior for 0, where p E RP is known, then the B,.; estimator of 0 is (1 B)X +B l, where B= o2(A +2)1. A general EB estimator of 0 is then given by 6**(X) 1 X + ' where S'= IIX pj2/at, and Theorem 2.4.0.6 with obvious modifications will then provide the dominance of the EB estimator 6**(X) over X under the divergence loss. The corresponding prediction result is also true. Remark 4. The special case with r(t) = c satisfies conditions of the theorem if 0 < c < 2(p 2). This is the original JamesStein result. Remark 5. Strawderman [60] considered the hierarchical prior oA N(O, AI), where A has pdf T(A) = 6( + A)1I[A>o] with 6 > 0. Under the above prior, assuming squared error loss, and recalling that S IX112/at, the B,. estimator of 0 is given by t r(S) S) where 2exp( .) r(t) = p + 26 ep( ) (241) fo1 Ai+' exp(t) dA Under the general divergence loss, it is not clear whether this estimator is the hierarchical B,i estimator of 0, although its EB interpretation continues to hold. Besides, as it is well known, this particular 7r satisfies conditions of Theorem 2.4.0.6 if p > 4 + 26. Thus the Strawderman class of estimators dominates X under the general divergence loss. The corresponding predictive density also dominates N (y X, ((1 3)cr + 72)Ip) For the special KL loss, the present results complement those of Komaki [45] and George et al. [34]. The predictive density obtained by these authors under the Strawderman prior, (and Stein's superharmonic prior as a special case) are quite different from the general class of EB predictive densities of this dissertation. One of the virtues of the latter is that the expressions are in closed form, and thus these densities are easy to implement. 2.5 Lindley's Estimator and Shrinkage to Regeression Surface Lindley [50] considered a modification of the JamesStein estimator. Rather then shrinking X towards an arbitrary point, ,v tt, he proposed shrinking X towards Xp1, where X = p1 1 Xi and lp is a pcomponent column vector with each element equal to 1. Writing R = E(X X)2/,c, Lindley's estimator is given i= by p3 6(X) = X (X X1), p > 4. (242) R The above estimator has a simple EB interpretation. Suppose XI 0 N(0, JIp) and 0 has the Np(plp, AIp) prior. Then the B,v, estimator of 0 is given by (1 B)X + Bp/1 where B = jx(A + oX)1. Now if both p and A are unknown, since marginally X ~ N(plp, a B1I), (X, R) is complete sufficient for p and B, and the UMVUE of p and B1 are given by X and (p 3)/R, p > 4. Following Baranchik [5] a more general class of EB estimators is given by 6 (R) 6:(X) X ( (X X1), p > 4. (243) R Theorem 2.5.0.8. Assume (i) 0 < r(t) < 2(p 3) for all t > 0 p > 4; (ii) 7(t) is a nondecreasing differentiable function of t. Then the estimator 6 (X) dominates X under the divergence loss given in 2 1. Similarly, N(yl 6:(X), ((1 /3)o + o)Ip) dominates N(yl X, ((1 P3)a + oa)Ip) as the predictor of N(yl 0, Ir2). Proof of Theorem 2.5.0.8. Let p 0P ol rI 0/c ii and 2 1 p 2 Z(o 0)2. Sii 1ii As in the proof of Theorem 2.4.0.5 we first rewrite 2 *6l(X) 0ll2 ZZ  ^' T (R)( R Z1,) 12 T(R)(z Zl,) (7 1,P) + (Z )lP R(Z R (Z [1 (R) 2 zip) )2 + (2 2( P) (1 By the orthogonal transformation G 2 zip). (244) CZ, where C S1 is an orthogonal matrix with first two rows given by (p2',... ,p ) and ((ll 1)/(,..., (T 1 2 116(X ) 0112 S (G2 Q) 2 i U G I+ qr)/). We can rewrite (G2 + Q) + (G1 VrP)2 + 2 2(G2 (1 T(G+ Q) G(2Q 5) (2 45) P where Q = C G2 and G1, G2,..., Gp are mutually independent with i=3 G1 ~ N]V(Vlp], 1), G2 ~ NV(, 1) and G3, Gp are iid N(0, 1). Hence due to the independence of G1 with (G2,... Gp), and the fact that (G1 v/p)2 ~ X2, from (245), 16 (X) 0112 = x (G + Q) + (2 x(G2 (2b + 1)E  2(G2 ( exp b 1 (( r T(G+Q) 1] T(G2+Q) 2 G+Q ) G 2 + 00 (2b+1 ) exp{ r=0 +00 x exp 0 o/ t b(b + ) + 2bTo(t)4 2 t _ ) 2r r+P1 b M 2 t dt, (2 46) t F(r +) E exp ()Z (Gi,..., G)T where = (b + 1)(2 and as before To(t) T( '). The second equality in 246 follows after long simplifications proceeding as in the proof of Theorem 2.4.0.5. Hence, by (246), the dominance of 6'(X) over X follows if the right hand side of (246) > (2b+ )p/2. This however is an immediate consequence of Theorem 2.4.0.6. U The above result can immediately be extended to shrinkage towards an arbitrary regression surface. Suppose now that X 10 ~ N(0, a I) and 0 ~ Np(K3, AI,) where K is a known p x r matrix of rank r(< p) and / is r x 1 regression coefficient. Writing P = K(KTK)1KT, the projection of X on the regression surface is given by P X = K/, where / = (K K)1K X is the least squares estimator of 0. Now we consider the general class of estimators given by S ( (X PX), R* where R* = IIX P X12/a The above estimator also has an EB interpretation noting that marginally (/, R*) is complete sufficient for (3, A). The following theorem now extends Theorem 2.5.0.8. Theorem 2.5.0.9. Let p > r + 3 and (i) 0 < r(t) < 2(p r 2) for allt > 0; (ii) 7(t) is a nondecreasing differentiable function of t. Then the estimator X >)(X P X) dominates X under the divergence loss. A similar dominance result holds for prediction of N(yl 0, O I2,). CHAPTER 3 POINT ESTIMATION UNDER DIVERGENCE LOSS WHEN VARIANCE COVARIANCE MATRIX IS UNKNOWN 3.1 Preliminary Results In this chapter we will consider the following situation. Let vectors Xi ~ Np (0, E) i = ,..., n be n i.i.d. random vectors, where E is the unknown variance covariance matrix. In Section 3.2 we consider E = a21p with a2 unknown, while in section 3.3 we consider the most general situation of unknown E. Our goal is to estimate the unknown vector 0 under divergence loss. First note that X is distributed as Np (0, E). And thus divergence loss for an estimator a of 0 is as follows: 1 j.f (x 0 )f ( a)dx 1 exp [ (a 0) '(a )] L0a (1 W) 3(1 3) (31) The best unbiased equivariant estimator is X. We will begin with the expression for the risk of this estimator. Lemma 3.1.0.10. Let X ~ Np (0, E) i = 1,..., n be i.i.d. Then the risk of the best unbiased estimator X of 0 is as follows: R3(0) (1 ( (3 2) ) ) ( [t + (1 )]p/32 Proof of the lemma 3.1.0.10. Note first that n{X 6fTE\(X 6) X2 Then Eo exp nf(12 (X )T (X 0)}] (1 + (1 p))p/2 ( The lemma follows from 33. Thus for any rival estimator 6(X) to dominate X under the divergence loss we will need the following inequality to be true for all possible values of 0: E0 exp n( ) (6(X) )TE (6(X) ) 1> ( ))2.(34) 3.2 Inadmissibility Results when VarianceCovariance Matrix is Proportional to Identity Matrix Let X N,(0, 21 ), where o(> 0) is unknown, while S 2 x , independently of X. This situation arises quite naturally in a balanced fixed effects oneway ANOVA model. For example, let Xij = i + Eij (i ,...,;j = ,..., n) where the Eij are i.i.d. Np(0, o2). Then the minimal sufficient statistics is given by (X1,...,X,, S), where n Xi n X,, (i 1,... ,p) j=1 and p n S [(n l)p+ 2]1 (X Xi)2. i= j=1 This leads to the proposed setup with X = (X1,...,X)T, 0 = (01,... ,O)T, a2 = /n and m (n 1)p. Efron and Morris [30], in the above scenario, proposed a general class of shrinkage estimators dominating the sample mean in three or higher dimensions under squared error loss. This class of estimators was developed along the ones of Baranchik [5]. Using equation (31), the divergence loss for an estimator a of 0 is given by S exp [ 1 10 a 2 Lp(0,a) =3(1 3) (35) The above loss to be interpreted as its limit when  0 or  1. The KL loss occurs as a special case when 3 0. Also, noting that IX 0112 2X2, the risk of the classical estimator X of 0 is readily calculated as 1 [1 +(1 s)1p/2 R3(0, X) = 3)(36) P(1 P) Throughout we will perform calculations in the case 0 < 3 < 1, and will pass to the limit as and when needed. Following Baranchik [5] and Efron and Morris [30], we consider the rival class of estimators 6(X) = 1 r(lXIXI12/S)II I (37) where we will impose some conditions later on r. First we observe that under the divergence loss given in (35), S exp [ (1 ) (X) 0 2 L(0, 6(X)) 3= (38) We now prove the following dominance result. Theorem 3.2.0.11. Letp > 3. Assume (i) 0 < r(t) < 2(p 2) for all t > 0; (ii) 7(t) is a differentiable nondecreasing function of t for t > 0. Then R(0, 6(X)) < R(0, X) for all 0 e I. Proof of Theorem 3.2.0.11 First with the transformation Y = 1X, = a 10 and U = S/2, one can rewrite R(o, 6(X)) 1 E exp (1 ) 1 i(Yi2 j )  f(1 f) where Y ~ Np(7r, Ip) and U ~ (m + 2) 1X is distributed independently of Y. Hence a comparison of (39) with (36) reveals that Theorem (3.2.0.11) holds if and only if Sexp (1 ll ) Y > [1+(1 )]p/2. (310) Next writing z U(m + 2) 2 and 2 (m + 2)t" T(t/z)  2l 2= m2 2z we reexpress left hand side of (310) as E exp ( f (70( 2/Z) Y 2}. (311) Note that in order to find the above expectation, we first condition on Z and then average over the distribution of Z. By the independence of Z and Y and Theorem 2.4.0.5, the expression given in (429) simplifies to [1 + (1 )]p/2 exp() I) (r), r0 (312) where I[1 + 3(1 /)]117112, and writing b = (1 b 2 r ( r+t 1 I()[i ?To(t/z)] r 0 0 r b(b + 1/2)z2 + 1 (z 1 x exp t To ) + 2bzt/z) dtdz. (313) \27 (39) 47 From (310) (431), it remains only to show that I(r) > 1 Vr 0,1,...; p>3 under conditions (i) and (ii) of the theorem. To show this we first use the transformation t = zu. Then from (431), 00 00 0 0 x exp [ Ur+ 1 bTo(u)/u]2~ rF(r+)r(L) b(b + 1/2) 2 z({u + 1 + o (U)  71 2bTo (u)} bro (u)/u] 2 r B (r + L, 2 (r+P} ) b(b + 1/2) 2(U) + 1 + rTo+u) 2b0 (u)] (314) Since To(u)/u is a continuous function of u with lim To(U)/U u0 +oo and lim ro ()/u D c it follows that there exists uo such that uo = sup{u > 0oTo(u)/u > 1/b} and To(uo)/uo J[1 0 zr+ (P+) dzdu zr+ 2 I dzdu 48 Thus for u > Uo, To(u)/u < 1/b, from (314), I(r) > [1 U r+ b(b + 1/2) bro(u)/u]2r U 02(u) B (r + 2 ++) bro(u)/u)2}r+ 1 (1 bt (**) _(r+p+P d) x u(1 bTo(u)/u)2 + 1 + bT2u 2 2u / [{u(1 bTo(u)/u)2} /{1 + br(u)/2}]r+ l + u(bro(u)/u)2 ] 2 uo [ + II, "/(2u) bTo(u)/u)(p2) 1 x (1 bTo()/u)(p2) (1 + br(u)/(2u)) (+1) du. (315) By the inequalities (1 bTo(u)/u)(p2) > exp[(p 2)bTo(u)/u] and (1 + bT02(u)/(2u))( +1) > exp[(1 + m/2)bT02(u)/(2u)], it follows that (p2) (1 Sbr2() (+1 2u To())l > 1 (316) >exp [(p2 bTo (u) (m + 2) br2 (u) >exp (p 2)2b(^) fbTo(u)(m + 2) 4(p 2) exp 4u m +2 since 0 < To(u) < 4(p  Moreover, putting 2)/(m + 2). S(1 bo(u)2 1 + (u) 2u 2[u bro(u)2 [2u + br02(u) [ {u(1 J B (r +j, ) brTo(u)) u ) 2o)](r+ 2bTo (u)I it follows that dw 2(u bro(u)) d 2( bT (u) [2(1 b(u))(2u + b0(u)) (u bTo(u))(2 + 2bro(u) (u))] du [2u + bT02(U)]2 2(u bro(u)) 2(u bT(u)) [2u + 2bo(u) + 2b2(u) 4bu(u) 2buro(u)r(u)]. [2u + b702(U)]2 Hence < 1 if and only if 2[u bro(u)][2u + 2bo(u) + 2bT0o2(u) 4bu T(u) 2buTo(u)T'(u)] < [2u + bT7o(u)]2 The last inequality is equivalent to b22(u) [2 + To(u)]2 + 4bu (u) [2 + To(u)][u bro(u)] > 0. (317) Since for u > no, u > bro(u), (317) holds if To(u) > 0, and the latter is true due to assumption (ii). Now from (315) (317) noting that w = 0 when u = Uo, one gets OO /wr+21 for all r = 0, 1, 2,.... This completes the proof of Theorem 3.2.0.11. U Remark 1. It is interesting to note that Il(r) > 1 for all r = 0, 1, 2,... and any arbitrary b > 0. The particular choice b = 3(1 3)/2 does not have any special significance. We now consider an extension of the above result when V(X) = E is an unknown variancecovariance matrix. We solve the problem by reducing the risk expression of the corresponding shrinkage estimator to the one in this section after a suitable transformation. 3.3 Unknown Positive Definite VarianceCovariance Matrix Consider the situation when Zi,..., Z, (n > 2) are i.i.d. Np(O, E), where E is an unknown positive definite matrix. The goal is once again to estimate 0. The usual estimator of 0 is Z = n1 E Z (iv). It is the MLE, UMVUE and i=1 the best equivariant estimator of 0, and is distributed as Np,(, n1E). In addition the usual estimator of E is S 1 Z Z)(Z Z)T i=1 and S is distributed independently of Z. Based on distribution of Z, the minimal sufficient statistic for 0 for any given E, the divergence loss is given by (see equation 31) 1 exp 3(3(a ) ')a 0)] L (0,a) )1. (318) The corresponding risk of Z is the same as the one given in (32), i.e. R(, Z) [1 _{1 + (1 3)}p/2]. (3 19) Consider now the general class of estimators 6'(Z, S) 1( TS Z] (320) of 0. Under the divergence loss given in (21), 1 exp n()(6T(Z, S) 0)Ty (6Z, S) 0) L(0, 6(Z, S)) (321) By the Helmert orthogonal transformation, 1 HI t(Z2 ZI), v/2 H2 = (2Z3 Z1 Z2), V6 1 H,_1 [(n 1)Z, ZI Z2 ... Z,_] n/,(n 1) 1 n 7n z, VnZ, one can rewrite 6'(Z, S) as ((n 1)H( HH )1H) 6'(Z, S)= I  n1H,, (n 1)HT( HiHi)IH,) i= 1 (322) where HI,..., H, are mutually independent with HI,..., H,_ i.i.d. N(0, E) and H, N( VNO, E). Let Yi E= Hi and 1 r7 E (vnO). Then from (321) and (322) one can rewrite ( T ( ( If nT 2 ] 1 exp ((1n Y) 1 Y , L2 ( 6(O (Z'S))( V ) ) I (323) where Y1,..., Y, are mutually independent with Y1,... Y,_ i.i.d. N(O, Ip) and Y, N(q, Ip). Now from Arnold ([3], p. 333) or Anderson([4], p.172), (~nY1 1 d y12 i 1n where U ~ X,p, and is distributed independently of Y,. Now from (323) 1 exp [ (12 ) (1 T ((n1) Y II '/U) (n1)Y /U f3(1 /3) y 721 and 1, ... ,,n) L(0,6'(Z, S)) (324) Next writing U z  2 and To(t/z) = 2 ( t/tz n 2 2 1 exp [ (12 ) 1 To(Y I'/z) Y 27] L3(0, 6 (Z, S)) 3( j3) (325) O(1 0) By Theorem 3.2.0.11, 6'(Z, S) dominates Z as an estimator of 0 provided 0 < To(u) < 42) for all u and 3 < p < n. Accordingly, 6T(Z, S) dominates Z provided 0 < r(u) < 2(p2)(n1). We state this result in the form of the following theorem. np+2 Theorem 3.3.0.12. Let p > 3. Assume (i) 0 < r(t) < 2(p2)(n1) for all t > 0; (ii) 7(t) is a differentiable nondecreasing function of t for t > 0. Then R(O, 6(Z, S)) < R(O, Z) for all 0 E RP. CHAPTER 4 REFERENCE PRIORS UNDER DIVERGENCE LOSS 4.1 First Order Reference Prior under Divergence Loss In this section we will find a reference prior for estimation problems under divergence losses. Such a prior is obtained by maximizing the expected distance between prior and corresponding posterior distribution and thus can be interpreted as noninformative or prior which changes the most on average when the sample is observed. If we use divergence as a distance between a proper prior distribution r(0) (putting all its mass on a compact set if needed) and the corresponding posterior distribution 7(01 x) we can reexpress the expected divergence as R 1 ff (0j)w13(0 x)m(x) dx dO 3(1 4 3) 1 ffjr3(x)pl( (x 0)(0) dxdO0 ( /3(1 3) Using this expression one can easily see that in order to find a prior that maximizes R(Tr) we need to find an .ii!,il, ,l ic expression for rm3(x)p13(x 0)dx first. In this section we assume that we have a parametric family {p0 : 0 E 0}, o C RP, of probability density functions {po p(xl 0) : 0 E 0} with respect to a finite dominating measure A(dx) on a measurable space X, and we have a prior distribution for 0 that has a pdf r(0) with respect to Lebesgue measure. Next we will give a definition of Divergence rate when parameter f < 1 and sample size is n. We define the relative Divergence rate between the true distribution Pb(x) and the marginal distribution of the sample of size n mn(x) to be 1 f (Tn.(x)) T )13/n A(dx) R(0, ) = DR ( P mx)) = (42) v u3 P/n(1 P/n) It is easy to check for f 0 that this definition is equivalent to the definition of relative entropy rate considered for example in Clarke and Barron [21]. Using this definition, we can define for a given prior 7 the corresponding B,, risk as follows: R(3) E [DR P ( p mx))] (43) To find an i ,l .Iic expansion for this risk function, we will reexpress the risk function as follows: 1 Eg [exp{ In ln )1 R3(0, ) /=l n ) ( (44) Sp/n~l PON where p(xl 0) = Jp(xl 0), i=1 and m(x) J p(x )7(0) dO. Clarke and Barron [20] derived the following formula: p(Xl 0) P in t S ( n In In 1I()1 + In 1 s + o(), (45) m(x) 2 27 2 O(o0) 2 where o(1) 0 in L'(P) as well as in probability as n i oo. Here, S, = (1/ v)Vlnp(xl 0) is the standardized score function for which E(SS,{) = 1(0) and E [S (I(0))1 n] =p. Using this formula we can write the following .iiv,!.li l ic expansion for risk function (41): R3 (0, ) 1(0) /n( N Since (()) [mptotically distributed as we can rewrite the)) Since S, (1(0)) S,, is .i. mptotically distributed as X2 we can rewrite the P2w a ert h 46) above expression as follow: Hence, the D maximizes the integral: maximizes the integral: subject to the constraint 1 0 () 1 + o(np/4) t 'n 1_  )p/2 R (0, ) /, n /rior wh 1 t ll rior which minimizes the B,.i  risk will be the one that dOe 1 (0) 2 7 (0) dO=. A simple calculus of variations argument gives this maximizer as 17(0) cX  (0) (49) which is Jeffreys' prior. 4.2 Reference Prior Selection under Divergence Loss for One Parameter Exponential Family Let X1,...,X, be iid with common pdf (with respect to some ufinite measure) belonging to the regular oneparameter exponential family, and is given by p(xO) = exp[8x (O) + c(x)]. (47) (48) (410) Consider a prior 7r(0) for 0 which puts all its mass on a compact set. We will pass on to the limit later as needed. Then the posterior is given by 7r(01xi,... ,Xn) oc exp[n{0x 6(0)}]7r(0). (411) We denote the same by 7(0Ox). Also, let p(xlO) denote the conditional pdf of X given 0 and m(x) the marginal pdf of X. The general expected divergence between the prior and the posterior is given by R( 1 ff3(0)S1 (01 x)dO] mx(2)dx ( R3(Q) ( (4 12) From the relation p(xl0)}(0) = ~(0Ox)m(x), one can reexpress RN(r) given in (412) as R3() 1 JffI 73+1(0)7r (O )p(xlO) dx dO 1 fi7rI ()E)E [r3(0 2x) O] dO [(1 P3 (1 3) (413) Let 1(0) = "(0) denote the per observation Fisher information number. Then we have the following theorem. Theorem 4.2.0.13. Eo [ (01 X)] 27 \ 1 t "'(t)W '(0) ul (= (0) n ( ) 7() +( "'(0))20 (32 + 73 +10) 3 7'(0) 2 24(1 /3)I3() 21(0) r(O70) + 3(2 ) r"(0) '(0)0(2 + +) (n/2 (44) 2(1 3)I(0) 7(0) 8(1 /)12() Proof of Theorem 4.2.0.13. Let 0 denote the MLE of 0. Throughout this section we will use the following notations: i(e) o=x (eo), a2 "(0) "(0), 57 c "(0), a3 = l"'(0) a4 l(4) I and h = v( e). We will use "h!i iid!: I, argument as it presented in Datta and Mukerjee [24] . From Datta and Mukerjee ([24] p.13) 7,(h x) = ^ exp 1 + n Sch2 (0) 4 + I a4 4 24 1 1+ ")(0)\ _ c 2 4 h(0) a3h3 7 (0) 6 1 +7 6 a3 7/( h4 7 () ( alh6 a 2h 3' 15aj c3 With the general expansion 1 ) al+ 2 + + o(n1)) a,1 we get (h x) ~ exp cph2 7 (h x) exp 2 (27 2) 1 + ( '(o) 1 j) 1 (a4h 4 24 "(0) h2 7(0) i(0)c) (a r(0 a37 7(0) I + (n1). 15a) c3 I'(0), 3a3 r'(O) I + ((n1). (4 15) a2 a n n 2 a, + o(n ), a, V { 7 ) h + A~z[W,+ v^1P O 1 3 a3h 3 6 3a3 7r'(0) c2 (0)) 3 3(o ) 72 a2 h 6 72( (416) 58 Using (415) and (416) we will get the following (c2) exp (0), 1 ,3\2 (,h + a3h3 ( () 6 1 ( 'r(0) 4 3a 6 7( ) C 124 24 (a4 4 0 2 3 3a4 1+ t () 7( +l() 1 {1((0) 7(0) 6 1 , a3h 4 6 h 7(0) '(0) Aw + '(0) 1 6 ( 0) 3i 36 i(0c) /3 h4 3 T() 1/3 72 ( alh6 a 3Oh 15a c3 1 2 AFh02 (0) Integrating this last expression with respect to h will get the following 73(hx)7,(hx)dh = ( j ( 2c7/ 1 '( ) '( ) 1 Sn 7(0) (0) C(1 /3) 3(1 +/3) 2c(1 /3) \ i'(0) i(0) J T 1 S ( 0 1 a3 7'(0) 2c2( / )2 7(0) + ___)(s a3 '(0) 15a+ c(1 3) 7r() 36c2(  1+5a 36c3( _/)3 0) /3 "(0) 2c( /3) r(0) a3 0(2 3) '(0) 2c2( /3)2 7(0) /32 t "() 2c(1 /3) 7r(0) a43(2 3) 8c2(1 3) 3/3(2 3) T'(0) 22(1 / )2 (0) 15a 2(1 /3)2 )1 1 + 231 /3)2 o(n ). (418) From the relation 0 = h/ + 0 we get ) (h x)7,(hl x) dh. c(1 3)h2} 2 1 /3(1 +/3) + 2 2 " (0)c 3a3 '())) SI +o(n1). (417) 7 3(hl x) {,(h x) 7 (0) 7(0) _  3 n 2 (419) E7[ 13(01 ) I 2] Thus by (418) and (419) we have 2( ul(0) S (1t3)(0) S/(1 + 0) 2(1 /)1(0) 1 V1 /L S (0) (0) 7T'(0) 7'(0) 7 (0) F(0) ( 7 (0) NO)) +"(0) 2(1 0)I(0) 7(0) ''(0)(2  8(1 /)12() "'(0) ( '(0) 2(1 0)1(0) 7(0)o 2 "'(0) 7'(0) (t M)I(o) 7(0) "(0) ) 7(0) ',"(0)0(2 /) 2(1 )212(0) 5(, '( ))2 (2 3+3) 24(1 0)213(o) ;j 5(, "'(0))2 + 2(1 )212() 7'(0) \(e) '(0)) + o(n1 2). (420) In the next step, we find (ni(0)2 1 1/3 1 + n 7(0) 2 "'(0) ~'(0) (1 )I(0) 7(0) 5(, "'(0))2 1 2(1 )212(0) S/(1 +/3) 2(1 )1(0) /2 ~/(0) 2(1 0)1(0) 7(0) S"'(0)0/(2 p) 7'(0) 2(1 /)212(0) 7(0) + 2l( (nl() +. ''( (0)(2 0) 5(, "'(0))2(0/2 3 + 3) 8(1 )12(0) 24(1 0)213(0) 1 12 n1 n (0) 2(1 (0) 0/3 '(0) ".(0)0 (1 0)1(0) 7(0) + 2(1 )212(0) 3(2 0) )212(o)) 7 (0)] dO 0 )32F"(O) dO + o(n1 3/2). (421) 2nl(o)(1 /)3/2 The last step will give an expression for Eo [r 3(0 x)] We consider Fr(0) to converge weakly to the degenerate prior at true 0 and have chosen r(0) in such a way that we could integrate the last two integrals in (421) by parts and have the first term equal to zero every time we use integration by parts. 5(, "'(0))2 1 2(1 )212 (0)) A)(0)(0) dO x 1 A(0) Eo (E (01 x) x]) ', "'(0)0 7'(0) 2(1 0)22(o) 7(0)) 5/(, '(0))2 12(1 3)313(0) Thus we will have Eo [~x (0I )] 27r 2 I(0) 1 1/3 1+ 5( "'(0))2p(2 +) 24(1 ))3(0) /2 /(0) 2(t 0)1(0) 7(0) 3(1 + 3) ( 7'(0) 2 2(1 )1(0) 7(0) S''(0)0(2 3) ) 8(1 /3)2() ) 7/3 '(0) (1 3)I() 7(0) E( 0 2) S(2( ) + o(n1 /2). 'Fo 2(t 0)1(0) Finally, we have ( /3 1'(0) ( (1 3)(0) (0) S2(1 0)12(0) ul( 0) 1 1j3 /(1 + /2, "'(0) 7'( (1 /3)J2() 7(, /3 "(0) (1 /3)(0) 7(0) 0) l 'f "(0) 9) 2(1 /3) 2(0) 3(  (1/3)1(0) d2 27 2 d02 nl(0) l()2 \nl(0) 1 0/ 0 2(1 /3)(0) 1 ((, "'(0))20( =0 2(1 d 27r \3 dO unl(0) 1 + 3/2)(2 + 3/2)  3)3(0) 1 3(1 + 3/2 '"(0) 1 /3 2(1 3)2(0) ,' "(0)0( + 0/2) 2( /3)J2(0) (424) From (422) (424), we get Eo [73(0 2x)] l( 2r 1o(0) 1 1 / (,f '"(0))2 (3/2 + 73 + 10) 24(1 0)13(0) /3(2/3) r"(0) 2(t / )(0) (0)) "'((0)3 7'(0) ( /3)I2 ()) ( 7(0) 2 21(0) 7(0)) , '"(0)0(2 + )2 0 /2 ) 8(1 0 2)I2 + ( 1 ld 1 dO n dO (ni0) 1 1/3o 1 d2 n d02 , "'( 0) 2(t 0)12(o) ( nI (0) (422) 1 1z7/ and '(0) 2 7(0)) (423) 1+ ( ,. "'(0)02 TT() n 2(t )P(O) 7(0) /(2 + //2)(," ())2 2(t1/3)3(o) (425) This completes the proof 0. In view of Theorem 4.2.0.13, for 3 < 1 and 3 / 0 or 1, one has 1 f ( 7r() dO R()r = ) 12( +o(1). (426) Thus the first order approximation to Rk(r) is given by 1 )82(1 ) tr(O) dO 12(0) 3( /3) We want to maximize this expression with respect to 7r(0) subject to f r() dO =1. We will show that Jeffreys' prior assymptotically maximizes R (7r) when 1/3 < 1. To do this we will use Holders inequality as follow Holders inequality for positive exponents ([39], p.190) Let p, q > 1 be real numbers satisfying 1/p + 1/q = 1. Let f E ,g E 9. Then fg E 1 and lfg dp < (f If Pdp (Jf gq dt1 (427) with equality iff IfI oc xg q. Holders inequality for negative exponents ([39], p.191) Let 0 < q < 1 and p E R be such that 1/p + 1/q = 1 (hence p < 0). If f, g are measurable functions then Ilfg d1> (Jf I dll)P (Jf g dtlL) (428) with equality iff IflP oc Igl. First we will consider 0 < 3 < 1. In this case it is enough to minimize J 1+(O) (I(O)) dO. 62 From H6lders inequality for positive exponents with P 1+/3, S1+ q =  g() = (,2(o)1 and we can write (If 1/p) TT1+(0) (P 1 (f () dO) with "=" iff 7r(0) oc 2 (0). Next, consider case when 1 of R 3(r) is equivalent to maximization of 7F1(0) (I1(0)) dO. From H61ders inequality for positive exponents with 1 = 1+ 3 q  f(0) = 711(0) g(0) (I(0)) ) we obtain SdO< 12(o0)) do with "=" iff r(0) c I (0). and f (0) =(0) (1 (o)) ' 1+3() (I()) When 3 < 1 using Holders inequality for negative exponents with 1 1 p <0, 0
