<%BANNER%>

Divergence Loss for Shrinkage Estimation, Prediction and Prior Selection


PAGE 3

IwouldliketoexpressmysinceregratitudetomyadvisorDr.MalayGhoshforhissupportandprofessionalguidance.Workingwithhimwasnotonlyenjoyablebutalsoveryvaluablepersonalexperience.IwouldalsoliketothankMichaelDaniels,PanosM.Pardalos,BrettPresnell,andRonaldRandles,fortheircarefulreadingandextensivecommentsofthisdissertation. iii

PAGE 4

page ACKNOWLEDGMENTS ............................. iii ABSTRACT .................................... vi CHAPTER 1INTRODUCTIONANDLITERATUREREVIEW ............ 1 1.1StatisticalDecisionTheory ....................... 1 1.2LiteratureReview ............................ 2 1.2.1PointEstimationoftheMultivariateNormalMean ..... 2 1.2.2ShrinkagetowardsRegressionSurfaces ............ 10 1.2.3BaranchikClassofEstimatorsDominatingtheSampleMean 12 1.3ShrinkagePredictiveDistributionfortheMultivariateNormalDensity 13 1.3.1ShrinkageofPredictiveDistribution ............. 13 1.3.2MinimaxShrinkagetowardsPointsorSubspaces ....... 17 1.4PriorSelectionMethodsandShrinkageArgument .......... 18 1.4.1PriorSelection .......................... 18 1.4.2ShrinkageArgument ...................... 20 2ESTIMATION,PREDICTIONANDTHESTEINPHENOMENONUNDERDIVERGENCELOSS ............................ 22 2.1SomePreliminaryResults ....................... 22 2.2MinimaxityResults ........................... 27 2.3Admissibilityforp=1 ......................... 30 2.4InadmissibilityResultsforp3 .................... 31 2.5Lindley'sEstimatorandShrinkagetoRegeressionSurface ...... 40 3POINTESTIMATIONUNDERDIVERGENCELOSSWHENVARIANCECOVARIANCEMATRIXISUNKNOWN ................. 43 3.1PreliminaryResults ........................... 43 3.2InadmissibilityResultswhenVariance-CovarianceMatrixisProportionaltoIdentityMatrix ............................ 44 3.3UnknownPositiveDeniteVariance-CovarianceMatrix ....... 49 4REFERENCEPRIORSUNDERDIVERGENCELOSS ......... 53 4.1FirstOrderReferencePriorunderDivergenceLoss ......... 53 iv

PAGE 5

........................... 55 5SUMMARYANDFUTURERESEARCH ................. 66 5.1Summary ................................ 66 5.2FutureResearch ............................. 66 REFERENCES ................................... 68 BIOGRAPHICALSKETCH ............................ 73 v

PAGE 6

Inthisdissertation,weconsiderthefollowingproblems:(1)estimateanormalmeanunderageneraldivergencelossand(2)ndapredictivedensityofanewobservationdrawnindependentlyofthesampledobservationsfromanormaldistributionwiththesamemeanbutpossiblywithadierentvarianceunderthesameloss.ThegeneraldivergencelossincludesasspecialcasesboththeKullback-LeiblerandBhattacharyya-Hellingerlosses.Thesamplemean,whichisaBayesestimatorofthepopulationmeanunderthislossandtheimproperuniformprior,isshowntobeminimaxinanyarbitrarydimension.Acounterpartofthisresultforthepredictivedensityisalsoprovedinanyarbitrarydimension.Theadmissibilityoftheserulesholdsinonedimension,andweconjecturethattheresultistrueintwodimensionsaswell.However,thegeneralBaranchikclassofestimators,whichincludestheJames-SteinestimatorandtheStrawdermanclassofestimators,dominatesthesamplemeaninthreeorhigherdimensionsfortheestimationproblem.Ananalogousclassofpredictivedensitiesisdenedandanymemberofthisclassisshowntodominatethepredictivedensitycorrespondingtoauniformpriorinthreeorhigherdimensions.Forthepredictionproblem, vi

PAGE 7

vii

PAGE 8

and Weassumethatanunknownelement2labelstheotherwiseknowndistribution.Weareconcernedwithinferentialproceduresforusingthesampledobservationsx(realorvectorvalued). AdecisionruleisafunctionwithdomainspaceX,andarangespaceA:Thus,foreachx2Xwehaveanactiona=(x)2A:Forevery2and(x)2A;weincuralossL(;(x)):Thelong-termaveragelossassociatedwithistheexpectationE[L(;(X))]andthisexpectationiscalledtheriskfunctionofandwillbedenotedasR(;). Sincetheriskfunctiondependsontheunknownparameter;itisveryoftenimpossibletondadecisionrulethatisoptimalforevery:ThusthestatisticianneedstorestrictdecisionrulessuchasBayes,Minimaxandadmissiblerules. Themethodrequiredtosolvethestatisticalproblemathandstronglydependsontheparametricmodelconsidered(theclassP=fP;2ginwhichthedistributionofXbelongs),structureofdecisionspaceandchoiceoflossfunction. Thechoiceofthedecisionspaceoftendependsonthestatisticalproblemathand.Forexample,two-decisionproblemsareusedintestingofhypotheses;forpointestimationproblemsdecisionspaceoftencoincideswithparameterspace. 1

PAGE 9

Thechoiceofthelossfunctionisuptothedecisionmaker,anditissupposedtoevaluatethepenalty(orerror)associatedwiththedecisionwhentheparametertakesthevalue:Whenthesettingofanexperimentissuchthatlossfunctioncannotbedetermined,themostcommonoptionistoresorttoclassicallossessuchasquadraticlossorabsoluteerrorloss.Sometimes,theexperimentsettingsareveryuninformativeanddecisionmakermayneedtouseanintrinsiclosssuchasgeneraldivergencelossasconsideredinthisdissertation.Thisisdiscussedforexamplein[ 56 ]. Inthisdissertation,wemostlylookatthepointestimationproblemofthemultivariatenormalmeanundergeneraldivergencelossandweconsideralsothepredictionproblem,whereweareinterestedinestimatingthedensityfunctionf(xj)itself. Inmultidimensionalsettings,fordimensionshighenough,thebestinvariantestimatorisnotalwaysadmissible.Thereoftenexistsaclassofestimatorsthatdominatestheintuitivechoice.ForquadraticlossthiseectwasrstdiscoveredbyStein[ 58 ].Inthisdissertation,weconsidertheestimationandpredictionproblemssimultaneouslyunderabroaderclassoflossestoexaminewhethertheSteineectcontinuestohold. Sincemanyresultsforestimationandpredictionproblemsforthemultivariatenormaldistribution,asconsideredinthisdissertation,seemtohavesomeinherentsimilaritieswiththeparalleltheoryofestimatingamultivariatenormalmeanunderthequadraticloss,wewillbeginwithaliteraturereviewofknownresults. 1.2.1PointEstimationoftheMultivariateNormalMean 56 ]pp.429-431.Blyth,see[ 14 ],

PAGE 10

showedthatthisestimatorisminimaxandadmissiblewhenp=1.Unfortunately,thisestimatormayfailtobeadmissibleinmultidimensionalproblems.Forsimultaneousestimationofp(2)normalmeans,thisnaturalestimatorisadmissibleforp=2;butitisinadmissibleforp3forawideclassoflosses.Thisfactwasrstdiscoveredby[ 58 ]forthesumofsquarederrorlosses,i.e.whenL(;a)=kak2.TheinadmissibilityresultwaslaterextendedbyBrown[ 17 ]toawiderclassoflosses.Forthesumofsquarederrorlosses,anexplicitestimatordominatingthesamplemeanwasproposedbyJamesandStein[ 42 ]. Forestimatingthemultivariatenormalmean,Stein[ 58 ]recommendedusing"sphericallysymmetricestimators"ofsince,underthelossL(;a)=kak2,Xisanadmissibleestimatorofifandonlyifitisadmissibleintheclassofallsphericallysymmetricestimators.Thedenitionofsphericallysymmetricestimatorsisasfollows. 19 ]providedaBlythtypeargumentforprovingthesameresult. Asmentionedearlier,XisageneralizedBayesestimatorof2RpunderthelossL(;a)=kak2andtheuniformprior.Stein[ 58 ]showedtheexistenceofa;bsuchthat1b a+kXk2XdominatesXforp3:LaterJamesandStein[ 42 ]haveproducedtheexplicitestimator(X)=1p2 28 ]showhowJames-SteinestimatorsariseinanempiricalBayescontext.AgoodreviewofempiricalBayes(EB)andhierarchicalBayes(HB)

PAGE 11

approachescouldbefoundin[ 35 ].Asdescribedin[ 8 ],anEBscenarioisoneinwhichknownrelationshipsamongthecoordinatesofaparametervectorallowuseofthedatatoestimatesomefeaturesofthepriordistribution.BothEBandHBproceduresrecognizetheuncertaintyinthepriorinformation.However,whiletheEBmethodestimatestheunknownpriorparametersinsomeclassicalwayliketheMLEorthemethodofmomentsfromthemarginaldistributions(afterintegratingout)ofobservations,theHBproceduremodelsthepriordistributioninstages. Toilustratethis,webeginwiththefollowingsetup. A Conditionalon1;:::;p;letX1;:::;XpbeindependentwithXiN(i;2);i=1;:::;p;2(>0)beingknown.Withoutlossofgenerality,assume2=1: Thei'shaveindependentN(i;A);i=1;:::;ppriors. TheposteriordistributionofgivenX=xisthenN((1B)x+B;(1B)Ip); Nowconsiderthefollowingthreescenarios. ^(1)EB=(1B)X+BX1p:(1{2) ThisestimatorwasproposedinLindleyandSmith(1972)buttheyhaveusedHBapproach.Theirmodelwas:

PAGE 12

(i)conditionalonand;XN(;Ip); (ii)conditionalon;N(1p;AIp); (iii)isuniformon(;1): 2kxk2A1 2pexp1 2Ak1pk2:(1{3) ThusjointpdfofXandis 2(TD2Tx+xTx);(1{4) andposteriordistributionofgivenX=xisN(D1x;D1);whereD=A1[(A+1)Ipp1Jp]: and whichgivesthesameestimatorasundertheEBapproach.ButtheEBapproachignorestheuncertaintyinvolvedinestimatingthepriorparameters,andthusunderestimatestheposteriorvariance. LindleyandSmith[ 51 ]haveshownthattheriskof^(1)EBisnotuniformlysmallerthanthatofXundersquarederrorloss.However,thereisaBayesrisksuperiorityof^(1)EBoverXasitisshowninthefollowingtheoremofGhosh[ 35 ]: EL1(;X)=Ip;EL1(;^B)=(1B)Ip;(1{7)

PAGE 13

^(2)EB=Xp2 ThisestimatorisknownasaJames-Steinestimator(see[ 42 ]).TheEBinterpretationofthisestimatorwasgiveninaseriesofarticlesbyEfronandMorris([ 27 ],[ 28 ],[ 29 ]). JamesandSteinhaveshownthatforp3;theriskof^(2)EBissmallerthanthatofXunderthesquarederrorloss.However,ifthelossischangedtothearbitraryquadraticlossL2oftheprevioustheorem,thentheriskdominanceof^(2)EBoverXdoesnotnecessarilyhold.The^(2)EBdominatesXunderthelossL2([ 8 ];[ 15 ])if (i)tr(Q)>2ch1(Q)and (ii)0
PAGE 14

E[L1(;^(2)EB)]=IpB(p2)p1Ip;(1{12) ConsidertheHBapproachinthiscasewithXN(;Ip);N(;AIp)andAhasTypeIIbetadensity/Am1(1+A)(m+n);withm>0;n>0:Usingtheiteratedformulaforconditionalexpectations ^(2)HBE(jx)=E(E(jB;x))=(1^^B)x+^^B;(1{14) whereB=(A+1)1;and^^B=1Z0B1 2p+n(1B)m1exp1 2Bkxk2dB1Z0B1 2p+n1(1B)m1exp1 2Bkxk2dB: Strawderman[ 60 ]consideredthecasem=1;andfoundsucientconditionsonnunderwhichtheriskof^(2)HBissmallerthanthatofX:HisresultsweregeneralizedbyFaith[ 31 ]. Whenm=1;theposteriormodeofBis ^BMO=min((p+2n2)=kxk2;1)(1{16) Thisleadsto ^(3)HB=(1^BMO)X+^BMO;(1{17) of:Whenn=0thisestimatorwillbecomethepositivepartJames-Steinestimator,whichdominatestheusualJames-Steinestimator.

PAGE 15

^(3)EB=Xp3 ThismodicationoftheJames-SteinestimatorwasproposedbyLindley[ 50 ].Whereas,theJames-SteinestimatorshrinksXtowardaspeciedpoint,theaboveestimatorshrinksXtowardsahyperplanespannedby1p: 35 ]hasfoundtheBayesriskofthisestimatorundertheL1andL2losses. 1.2.1.2 Thenforp4; E[L1(;^(3)EB)]=IpB(p3)(p1)1(Ipp1Jp);(1{19) TondtheHBestimatorofinthiscaseconsiderthemodelwhere (i)conditionalon;andA;XN(;Ip); (ii)conditionalonandA;N(1p;AIp); (iii)marginallyandAareindependentlydistributedwithuniformon(;1);andAhasuniformimproperpdfon(0;1): 52 ],

PAGE 16

and whereE(Bjx)=1Z0B1 2(p3)exp"1 2BpXi=1(xix)2#dB1Z0B1 2(p5)exp"1 2BpXi=1(xix)2#dB; andE(B2jx)=1Z0B1 2(p1)exp"1 2BpXi=1(xix)2#dB1Z0B1 2(p5)exp"1 2BpXi=1(xix)2#dB: Also,onecanobtainapositive-partversionofLindley'sestimatorbysubstitutingtheposteriormodeofBnamelymin(p5)pPi=1(XiX)2;1in 1{1 .Morris(1981)suggestsapproximationstoE(Bjx)andE(B2jx)involvingreplacementof1R0by1R0bothinthenumeratoraswellasindenominatorof 1{23 and 1{24 leadingtothefollowingapproximations:E(Bjx):=(p3),pXi=1(xix)2

PAGE 17

Morris[ 52 ]pointsoutthattheaboveapproximationsamounttoputtingauniformpriortoAon(1;1)whichgivestheapproximation whichisLindley'smodicationoftheJames-SteinestimatorwithV(jx):=2(p3) 35 ]synthesizedEBandHBmethodstoshrinkthesamplemeantowardsanarbitraryregressionsurface.TheHBapproachwasdiscussedindetailsin[ 51 ]withknownvariancecomponents,whiletheEBprocedurewasdisscussedin[ 53 ]. Thesetupconsideredin[ 35 ]isasfollow I Conditionalon;bandaletX1;:::;XpbeindependentwithXiN(i;Vi);i=1;:::;p;wheretheVi'sareknownpositiveconstants; II Conditionalonbanda;i'sareindependentlydistributedwithiN(zTib;a);i=1;:::;p;wherez1;:::;zpareknownregressionvectorsofdimensionrandbisr1:

PAGE 18

Asshownin[ 35 ],underthismodel (1{28)Cov(i;jjx)=Cov[Ui(xizTi^b);Uj(xjzTj^b)jx]+E[AUiUjzTi(ZTDZ)1zjjx]; whereUi=Vi=(A+Vi);^b=(ZTDZ)1(ZTDx);D=diag(1u1;:::;1up)with(i=1;:::;p): 53 ]hasapproximatedE[ijx]byxi^ui(xizTi^^b);andV[ijx]byvi(xizTi^^b)2+Vi(1^ui)[1+^uizTi(ZT^DZ)1zi];i=1;:::;p:Intheabovevi=[2=(pr2)]^U2i(~V+^a)(Vi+^a);i=1;:::;p,~V=Ppi=1Vi(Vi+^a)1Ppi=1(Vi+^a)1;^D=Diag(1^u1;:::;1^up);and^^bisobtainedfrom^bbysubstitutingtheestimatorofA:Thevi'sarepurportedtoestimateV(Uijx)'s. WhenV1=:::=Vp=V,withu1=:::=up=V=(a+V)=u;D=(1u)Ip;ZTDZ=(1u)ZTZ;^^b=(ZTZ)1ZTx=^b;thenthefollowingresultholds andV(ijx)=V(Ujx)(xizTi^b)2+VVE(Ujx)(1zTi(ZTZ)1zi):

PAGE 19

IfoneadoptsMorris'sapproximations,thenoneestimatesE(Ujx)by^^UV(pr2)=SSEandV(Ujx)by[2=(pr2)]^^U2(SSE=Ppi=1x2i(Ppi=1xizi)T(ZTZ)1(Ppi=1xizi)). 5 ]proposedusingamoregeneralclassofshrinkageminimaxestimatorsdominatingX.LetS=pPi=1(Xii)2andi(X)=(S) 30 ]haveslightlywidenedBaranchick'sclassofminimaxestimators.TheyhaveprovedthatthefollowingconditionswillguaranteethattheestimatorX+(X)dominatesXunderthequadraticlossfunction:(i)0<(S)<2(p2);p>2;(ii)(S)isdierentiableinS;and(iii)u(S)=Sp2 2(S) Strawderman[ 60 ]showsthatthereexistsasubclassintheBaranchickclassofestimatorswhichisproperBayeswithrespecttothefollowingclassoftwostagepriors.Thepriordistributionforisconstructedasfollows.

PAGE 20

ConditionalonA=a;N(0;aIp);whileAitselfhaspdfg(a)=(1+a)1;a0;>0: 2wewillgetclassofproperBayesandthusadmissibleestimatorsdominatingX:Whenp>5choosing0<<1willleadtoproperBayesclassofestimatorsdominatingX: 61 ]showedthattheredonotexistanyproperBayesestimatorsdominatingX: 1.3.1ShrinkageofPredictiveDistribution Toevaluatethegoodnessoftofpredictivedistribution^p(yjx)(wherexistheobservedrandomvector)totheunknownp(yj);themostoftenusedmeasure

PAGE 21

ofdivergenceisKullback-Leibler[ 46 ]directedmeasureofdivergence ^p(yjx)dy(1{32) whichisnonnegative,andiszeroifandonlyif^p(yjx)coincideswithp(yj):Thentheaveragelossorriskfunctionofpredictivedistributionof^p(yjx)couldbedenedasfollows: andunder(possiblyimproper)priordistributionon;theBayesriskis AsshowninAtchinson[ 1 ]theBayespredictivedensityunderthepriorisgivenby andthisdensityissuperiortoany^p(yjx)asattotheclassofmodels. LetXjN(;vxIp)andYjN(;vyIp)beindependentp-dimensionalmultivariatenormalvectorswithcommonunknownmeanandknownvariancesvx;vy:AsshownbyMurray[ 54 ]andNg[ 55 ],thebestinvariantpredictivedensityinthissituationistheconstantriskBayesruleundertheuniformpriorU()=1;whichcanbewrittenas Althoughinvarianceisafrequentlyusedrestriction,forpointestimation,thebestinvariantestimator^=Xofisnotadmissibleifdimensionoftheproblemisp3.ItisknownthattheJames-Steinestimatordominatesthebestinvariantestimator^:Komaki[ 45 ]showedthatthesameeectholdsforthepredictionproblem.ThebestpredictivedensitypU(yjx)whichisinvariantundertranslationgroupisnotadmissiblewhenp3;andisdominatedbypredictivedensityunder

PAGE 22

theSteinharmonicprior(see[ 59 ]) Underthisprior,theBayesianpredictivedensityisgivenbyKomaki[ 45 ]pH(yjx)=vx 2pfk(v1yy+v1xx)(v1y+v1x)1 2kg 2xxk)1 2(vx+vy)kyxk2; where 2u2Z0v1 2p2exp(v)dv:(1{39) TheharmonicpriorisaspecialcaseoftheStrawdermanclassofpriorsgivenbyS()/1Z0ap 49 ]showedthatpU(yjx)isdominatedbytheproperBayesrulepa(yjx)undertheStrawdermanpriora() whenvxv0;p=5anda2[:5;1)orp6anda2[0;1):Whena=2;itiswellknownthatH()isaspecialcaseofa():AsshowninGeorge,LiangandXu[ 34 ],thisresultcloselyparallelssomekeydevelopmentsconcerningminimaxestimationofamultivariatenormalmeanunderquadraticloss.AsshowninBrown[ 17 ],anyBayesrule^p=E(jX)underquadraticloss,hastheform ^=X+rlogm(X):(1{41) AsshowninGeorge,LiangandXu[ 34 ],thisresultcloselyparallelssomekeydevelopmentsconcerningminimaxestimationofamultivariatenormalmean

PAGE 23

underquadraticloss.AsshowninBrown[ 17 ],anyBayesrule^p=E(jX)underquadraticloss,hastheform ^=X+rlogm(X):(1{42) SimilarrepresentationwasprovedbyGeorge,LiangandXu[ 34 ],forthepredictivedensityundertheKLloss: where andm(w;vw)isamarginaldistributionofW: @vE;vlogm(Z;v)<0; @vm(p (v) Georgeetal.[ 34 ]alsoprovedthefollowingtheoremwhichissimilartoTheorem1ofFourdrinieretal.[ 33 ].

PAGE 24

(i) 2A+B(p2)=4; (i) theBayesruleph(yjx)underpriorh()dominatespU(yjx)andisminimaxwhenvv0: 34 ]).Recenteringtheprior()aroundanyb2Rpresultsinb()=(b):Themarginalmbcorrespondingtobcanbedirectlyobtainedbyrecenteringthemarginalm: Suchrecenteredmarginalsyieldpredictivedistributions Moregenerally,inordertorecenteraprior()around(apossiblyane)subspaceBRp,Georgeetal.[ 34 ]consideredonlysphericallysymmetricinpriorsrecenteredas wherePB=argminb2BkbkistheprojectionofontoB:NotethatthedimensionofPBmustbetakenintoaccountwhenconcidering:Thus,forexample,recenteringtheharmonicpriorH()=kk(p2)aroundthesubspace

PAGE 25

spannedby1pyields Recenteredpriorsyieldspredictivedistributions Georgeetal.[ 34 ]alsoconsideredmultipleshrinkageprediction.Usingthemixtureprior leadstothepredictivedistribution 1.4.1PriorSelection 6 ]andlatersinceFisher[ 32 ],theideaofBayesianinferencewasdebated.ThecornerstoneofBayesiananalysis,namelypriorselectionwascriticizedforarbitrarinessandoverwhelmingdicultyinthechoiceofprior.Fromtheverybeginning,whenLaplaceproposedauniformpriorasanoninformativeprior,itsinconsistencieshavebeenfound,generatingfurthercriticism.ThishasgivenawaytonewideassuchastheoneofJereys[ 44 ]whoproposedapriorwhichremainsinvariantunderanyone-to-onereparametrization.Jerey'priorwasderivedasthepositivesquarerootofFisherinformationmatrix.However,thispriorwasnotanidealoneinthepresenceofnuisanceparameters.Bernardo[ 12 ]hasnoticedthatJereyspriorcanleadtomarginalizationparadox(seeDawidetal.[ 26 ])forinferencesabout=whenthemodelisnormalwithmeanandvariance2:TheseinconsistencieshaveledBernardo[ 12 ]andlaterBergerandBernardo([ 9 ],[ 10 ],[ 11 ])toproposeuninformativepriorsknownas"reference"

PAGE 26

priors.TwobasicideaswereusedbyBernardotoconstructhisprior:theideaofmissinginformationandthestepwiseproceduretodealwithnuisanceparameters.Withoutanynuisanceparameters,Bernardo'spriorisidenticaltoJereysprior.ThemissinginformationideamakesonetochoosethepriorwhichisfurthestintermsofKullback-Leiblerdistancefromtheposteriorunderthispriorandthusallowstheobservedsampletochangethispriorthemost. AnotherclassofreferencepriorsisobtainedusingtheinvarianceprinciplewhichisattributedtoLaplace'sideaofinsucientreasons.Indeed,thesimplestexampleofinvarianceinvolvespermutationsonaniteset.Theonlyinvariantdistributioninthiscaseistheuniformdistributionoverthisset.Laplace'sideawasgeneralizedasfollows.ConsiderarandomvariableXfromafamilyofdistributionsparameterizedby:Thenifthereexistsagroupoftransformations,sayha()ontheparameterspace,suchthatthedistributionofY=ha(X)belongstothesamefamilywithcorrespondingparameterha(),thenwewantthepriordistributionforparametertobeinvariantunderthisgroupoftransformations.GooddescriptionofthisapproachisgiveninJaynes[ 43 ],Hartigan[ 37 ]andDawid[ 25 ]. AsomewhatdierentcriteriaisbasedonmatchingposteriorcoverageprobabilityofaBayesiancrediblesetwiththecorrespondingfrequentistcoverageprobability.Mostoftenmatchingisaccomplishedbymatchinga)posteriorsquantilesb)highestposteriordensities(HPD)regionsorc)inversionofteststatistics. Inthisdissertationwewillnduninformativepriorsbymaximizingdivergencebetweenpriorandcorrespondingposteriordistributionasymptotically.Todevelopasymptoticexpansionswewillusesocalled"shrinkageargument"suggestedbyJ.K.Ghosh[ 36 ].Thismethodisparticularlysuitableforcarryingouttheasymptotics,andavoidscalculationofmultivariatecumulantsinherentinanymultidimentionalEdgeworthexpansion.

PAGE 27

24 ]toexplaintheshrinkageargument. Considerapossiblyvector-valuedrandomvariableXwithaprobabilitydensityfunctionp(x;)with2Rporsomeopensubsetthereof.WeneedtondanasymptoticexpansionforE[h(X;)]wherehisajointmeasurablefunction.ThefollowingstepsdescribetheBayesianapproachtowardstheevaluationofE[h(X;)]: Notethatposteriordensityofunderprior()isgivenbyp(X;)()

PAGE 28

Hence,instep1,wewillgetE[h(X;)jX]=K(X)=m(X);

PAGE 29

2 ]andCressieandRead[ 23 ].Thislossisgivenby Thisabovelossistobeinterpretedasitslimitwhen!0or!1.TheKLlossobtainsinthesetwolimitingsituations.For=1=2,thedivergencelossis4timestheBHloss.Throughoutthisdissertation,wewillperformthecalculationswith2(0;1),andpassontotheendpointsonlyinthelimitwhenneeded. LetXandYbeconditionallyindependentgivenwithcorrespondingpdf'sp(xj)andp(yj).WebeginwithageneralexpressionforthepredictivedensityofYbasedonXunderthedivergencelossandapriorpdf(),possiblyimproper.UndertheKLlossandthepriorpdf(),thepredictivedensityofYisgivenbyKL(yjx)=Zp(yj)(jx)d; 1 ].Thepredictivedensityisproperifandonlyiftheposteriorpdfisproper.WenowprovideasimilarresultbasedonthegeneraldivergencelosswhichincludesthepreviousresultofAitchisonasaspecialcasewhen!0. 22

PAGE 30

1(y;x)Zk1 1(y;x)dy;(2{2) 2.1.0.1 .Underthedivergenceloss,theposteriorriskofpredictingp(yj);byapdfp(yjx);is1(1)1times 1ZZp1(yj)p(yjx)dy(jx)d=1Zp(yjx)Zp1(yj)(jx)ddy=1Zk(y;x)p(yjx)dy: AnapplicationofHolder'sinequalitynowshowsthattheintegralin( 2{3 )ismaximizedatp(yjx)/k1 1(y;x):Againbythesameinequality,thedenominatorof( 2{2 )isniteprovidedtheposteriorpdfisproper.ThisleadstotheresultnotingthatD(yjx)hastobeapdf. 2(11)j2j1 2(12)j12+21j1 2exph12

PAGE 31

2.1.0.2 .WritingH=111+212andg=H1(1111+2122);itfollowsaftersomesimplicationthatZ[Np(xj1;1)]1[Np(xj2;2)]2dx=(2)p 21j2j1 22Zexp1 2(xg)TH(xg)1 21(T1111)+2(T2122)gTHgdx=(2)p 21j2j1 22jHj1 2exp1 21(T1111)+2(T2122)gTHg: Itcanbecheckedthat1(T1111)+2(T2122)gTHg=12(12)T(12+21)1(12); and Thenby( 2{6 )and( 2{7 ),righthandsideof( 2{5 )=(2)p(112)=2j1j(11)=2j2j(12)=2j12+21j1=2exp[12

PAGE 32

22xjjajj2]N(j(1B)X+B;2x(1B)Ip)d 2.1.0.2 ,Zexp[(1) 22xjjajj2]N(j(1B)X+B;2x(1B)Ip)d=22x whichismaximizedwithrespecttoaat(1B)X+B.Hence,theBayesestimatorofundertheN(;AIp)priorandthegeneraldivergencelossis(1B)X+B,theposteriormean.Also,byLemma 2.1.0.2 ,theBayespredictivedensityunderthedivergencelossisgivenbyD(yjX)/Z[N1(jy;2yIp)N(j(1B)X+B;2x(1B)Ip)d1 1/N(yj(1B)X+B;2x(1B)(1)+2yIp): 2.1.0.2 with1=1and2=,thedivergencelossfortheplug-inpredictivedensityN(;2xIp);whichwedenoteby

PAGE 33

2f(1)2x+2ygp=2exp(1)kXk2 NotingthatkXk22x2p;thecorrespondingriskisgivenbyR(;0)=1 2f(1)2x+2ygp=21+(1)2x 2f(12)2x+2ygp=2i: Ontheotherhand,byLemma 2.1.0.2 again,thedivergencelossfortheBayespredictivedensity(underuniformprior)ofN(yj;2yIp)whichwedenotebyisL(;)=1 2f(1)22x+2ygp=2exp(1)kXk2 ThecorrespondingriskR(;)=1 2((1)22x+2y)p

PAGE 34

ToshowthatR(;0)>R(;)forall;2x>0and2y>0;itsucestoshowthat (2y)p 2f(12)2x+2ygp=2<(2y)p orequivalentlythat 1+(2y=2x)>(1+2y=2x);(2{14) forall0<<1;2x>0and2y>0:Butthelastinequalityisaconsequenceoftheelementaryinequality(1+z)u<1+uzforallrealzand0
PAGE 35

TheBayesriskofnunderthepriornisgivenby: 22xkn(X)k2;(2{17) whereexpectationistakenoverthejointdistributionofXand;withhavingthepriorn: SinceBn!0asn!1,itfollowsfrom 2{18 thatr(n;n)!1 asn!1: 2{15 ),anappealtoaresultofHodgesandLehmann[ 40 ]nowshowsthatXisaminimaxestimatorofforallp: ofYhavingpdfN(yj;2yIp). 2{1 ).

PAGE 36

2.2.0.4 .Wehaveshownalreadythatthepredictivedensity(X)ofN(yj;2yIp)hasconstantrisk1 2{1 ).Underthesamesequencenofpriorsconsideredearlierinthissection,byLemma 2.1.0.2 ,theBayespredictivedensityofN(yj;2yIp)isgivenbyN(yj(1Bn)X;f(1)(1Bn)2x+2ygIp).ByLemma 2.1.0.2 onceagain,onegetstheidentity 2f(1)2(1Bn)2x+2ygp=2exp[(1)jj(1Bn)Xjj2 Notingonceagainthatjj(1Bn)Xjj2jX=x2x(1Bn)2p,theposteriorriskof(X)simpliesto1 2((1)2(1Bn)2x+2y)p (1)2(1Bn)2x+2yp Sincetheexpressiondoesnotdependonx;thisisalsothesameastheBayesriskof(X):TheBayesriskconvergesto1 40 ]onceagainprovesthetheorem.

PAGE 37

WeuseBlyth's[ 14 ]originaltechniqueforprovingadmissibility.Firstconsidertheestimationproblem.SupposethatXisnotanadmissibleestimatorof:Thenthereexistsanestimator0(X)ofsuchthatR(;0)R(;X)forallwithstrictinequalityforsome=0:Let=R(0;X)R(0;0(X))>0:Duetocontinuityoftheriskfunction,thereexistsaninterval[0";0+"]with">0suchthatR(;X)R(;0(X))1 2forall2[0";0+"]:Nowwiththesamepriorn()=N(j0;2n);r(n;X)r(n;0(X))=ZR[R(;X)R(;0(X))]n(d)0+"Z0"[R(;X)R(;0(X))]n(d)1 20+"Z0"(22n)1 2exp1 22n2d1 2(22n)1 2(2"): Again, (2{22) forlargen,whereOedenotestheexactorder.SinceBn=2x(2x+2n)1and2n!1asn!1,denotingC(>0)asagenericconstant,itfollowsfrom( 2{21 )and( 2{22 )thatforlargen;saynn0, 4(2)1=21nCB1n!1(2{23) asn!1:

PAGE 38

Forthepredictionproblem,supposethereexistsadensityp(yj(X))whichdominatesN(yjX;((1)2x+2y).Since[((1)(1Bn)2x+2y)) forlargenunderthesamepriorn,usingasimilarargument,r(n;N(yjX;((1)2x+2y)))r(n;p(yj(X))) asn!1:Anargumentsimilartothepreviousresultnowcompletestheproof. 2{23 )and( 2{24 )aregreaterthanorequaltosomeconstanttimes2nB1nforlargenwhichtendstoaconstantasn!1:WeconjecturetheadmissibilityofXorNyX;((1)2x+2y)Ipforp=2underthegeneraldivergencelossfortherespectiveproblemsofestimationandprediction. LetS=kXk2=2x:TheBaranchikclassofestimatorsforisgivenby(X)=1(S) Itisimportanttonotethattheclassofestimators(X)canbemotivatedfromanempiricalBayes(EB)pointofview.Toseethis,werstnotethatwiththeN(0;AIp)(A>0)priorfor;theBayesestimatorofunderthedivergencelossis(1B)X;whereB=2x(A+2x)1:

PAGE 39

AnEBestimatorofestimatesBfromthemarginaldistributionofX:Marginally,XN(0;2xB1Ip)sothatSisminimalsucientforB:Thus,ageneralEBestimatorofcanbewrittenintheform(X):Inparticular,theUMVUEofBis(p2)=SwhichleadstotheJames-Steinestimator[ 28 ]. Notethatfortheestimationproblem, 22xk(X)k2i whileforthepredictionproblem,LN(yj;Ip);Nyj(X);((1)2x+2y)Ip=1(2y)p 2f(1)22x+2ygp=2 TherstresultofthissectionndsanexpressionforEexpb 2xk(X)k2;b>0: 2{25 )and( 2{26 ). 2xk(X)k2=(2b+1)p=21Xr=0fexp()r=r!gIb(r);(2{27) 2)kk2 t2t 2)

PAGE 40

WebeginwiththeorthogonaltransformationY=CZwhereCisanorthogonalmatrixwithitsrstrowgivenby(1=kk;:::;p=kk):WritingY=(Y1;:::;Yp)T;therighthandsideof( 2{29 )canbewrittenas whereS=kYk2:AlsowenotethatY1;:::;YparemutuallyindependentwithY1N(kk;1);andY2;:::;YpareiidN(0;1):NowwritingZ=pPi=2Y2i2p1wehaveEexpb 2xk(X)k2=+1Z0+1Zexpby21+z+2(y21+z) (y21+z)2(y21+z)+kk22kky11(y21+z) 2(y1kk)2exp1 2zz1 2(p1)1 2p1 2dy1dz=+1Z0+1Z(2)1=2expb+1 2(y1kk)2b2(y21+z) (y21+z)+2b(y21+z)2bkky1(y21+z) 2zzp3 2 2p1 2dy1dz

PAGE 41

Werstsimplify+1Z(2)1=2expb+1 2(y1kk)2b2(y21+z) (y21+z)+2b(y21+z)2bkky1(y21+z) 2(y21+kk2)b2(y21+z) (y21+z)+2b(y21+z)exp2b+1 2kky12bkky1(y21+z) 2kky1+2bkky1(y21+z) 2(y21+kk2)b2(y21+z) (y21+z)+2b(y21+z)"1Xr=0(2kky1)2r 2b(y21+z) 2(w+kk2)b2(w+z) (w+z)+2b(w+z)"1Xr=0k2k2rwr1 2 2b(w+z) 2{31 )and( 2{32 )thatEexpb 2xk(X)k2=+1Z01Z0(2)1=21Xr=0expb+1 2kk2(2bkk)2r 2vb2(v) 21(1u)p1 21 2p1 2dudv

PAGE 42

BytheLegendreduplicationformula,namely,(2r)!=(2r+1)=r+1 2(r+1)22r1=2; 2{33 )simpliesintoEexpb 2xk(X)k2=+1Z01Z0(2)1=21Xr=0expb+1 2kk2(2bkk)2rp r!r+1 222r(2b)1+1(v) 2vb2(v) 21(1u)p1 21 2p1 2dudv=+1Z01Z01Xr=0expb+1 2kk2b+1 2kk2rb+1 2r ((2b)1+1)v2rvr+p 2vb2(v) 21(1u)p1 21r+p 2p1 2r+p Integratingwithrespecttou;( 2{34 )leadstoEexpb 2xk(X)k2=1Xr=0expfgr 2r1(v) ((2b)1+1)v2rvr+p 2vb2(v) where=b+1 2kk2:Nowputtingt=b+1 2v;wegetfrom( 2{35 )Eexpb 2xk(X)k2=(2b+1)p 22(2t t2t

PAGE 43

Asaconsequenceofthistheorem,puttingb=(1)=2,itfollowsfrom( 2{25 )and( 2{27 )thatR(;(X))=1(1+(1))p=21Pr=0fexp()r=r!gI(1)=2(r) 2{26 )and( 2{27 )thatRN(yj;2yIp);Ny(X);((1)2x+2y)Ip=1(2y)p=2((1)2x+2y)p=21Pr=0fexp()r=r!gI(1)2x andRN(yj;2yIp);Ny(X);((1)2x+2y)Ip1forallr=0;1;.

PAGE 44

(i) (ii) 2.4.0.6 2.4.0.6 .Nowt+bb+1 2 andIb(r)=+1Z0expt1b0(t) t0(t)2r+p 2t1b t0(t)(p2)dt: Denet0=supft>0:0(t)=tb1g.Since0(t)=tiscontinuousintwithlimt!00(t)=t=+1andlimt!10(t)=t=0;thereexistssuchat0whichalsosatises0(t0)=t0=b1:Wenowneedthefollowinglemma. 2.4.0.6 thefollowinginequalityholds: 2t1b t0(t)(p2)q0(t)0;(2{37) 2.4.0.7 .Noticerstthatfortt0,bytheinequality,(1z)cexp(cz)forc>0and0
PAGE 45

for0<0(t)<2(p2): Thusq0(t)1fortt0ifandonlyif 2b00(t)1b0(t) Thelastinequalityistruesince00(t)0forallt>0and0(t)=tb1foralltt0.Thelemmafollows. 2{37 )that r+p 2.4.0.6 5 ],undersquarederrorloss,provedthedominanceof(X)overXunder(i)and(ii).Wemaynotethatthespecialchoice(t)=p2foralltleadingtotheJames-Steinestimator,satisesbothconditions(i)and(ii)ofthetheorem.

PAGE 46

whereS0=kXk2=2x;andTheorem 2.4.0.6 withobviousmodicationswillthenprovidethedominanceoftheEBestimator(X)overXunderthedivergenceloss.Thecorrespondingpredictionresultisalsotrue. 60 ]consideredthehierarchicalpriorjAN(0;AIp); Undertheaboveprior,assumingsquarederrorloss,andrecallingthatS=jjXjj2=2x,theBayesestimatorofisgivenby1(S) Underthegeneraldivergenceloss,itisnotclearwhetherthisestimatoristhehierarchicalBayesestimatorof,althoughitsEBinterpretationcontinuestohold.Besides,asitiswellknown,thisparticularsatisesconditionsofTheorem 2.4.0.6 ifp>4+2:ThustheStrawdermanclassofestimatorsdominatesXunderthegeneraldivergenceloss.ThecorrespondingpredictivedensityalsodominatesNyX;((1)2x+2y)Ip:ForthespecialKLloss,thepresentresultscomplementthoseofKomaki[ 45 ]andGeorgeetal.[ 34 ].ThepredictivedensityobtainedbytheseauthorsundertheStrawdermanprior,(andStein'ssuperharmonicpriorasaspecialcase)arequitedierentfromthegeneralclassof

PAGE 47

EBpredictivedensitiesofthisdissertation.Oneofthevirtuesofthelatteristhattheexpressionsareinclosedform,andthusthesedensitiesareeasytoimplement. 50 ]consideredamodicationoftheJames-Steinestimator.RatherthenshrinkingXtowardsanarbitrarypoint,say,heproposedshrinkingXtowardsX1p;whereX=p1Ppi=1Xiand1pisap-componentcolumnvectorwitheachelementequalto1:WritingR=pPi=1(XiX)2=2x;Lindley'sestimatorisgivenby TheaboveestimatorhasasimpleEBinterpretation.SupposeXjN(;2xIp)andhastheNp(1p;AIp)prior.ThentheBayesestimatorofisgivenby(1B)X+B1pwhereB=2x(A+2x)1:NowifbothandAareunknown,sincemarginallyXN(1p;2xB1Ip),(X;R)iscompletesucientforandB;andtheUMVUEofandB1aregivenbyXand(p3)=R,p4: 5 ]amoregeneralclassofEBestimatorsisgivenby (i) (ii) 2{1 .Similarly,N(yj(X);((1)2x+2y)Ip)dominatesN(yjX;((1)2x+2y)Ip)asthepredictorofN(yj;2yIp). 2.5.0.8 .Let=p1pXi=1i;==x

PAGE 48

and2=1 2.4.0.5 werstrewrite1 BytheorthogonaltransformationG=(G1;:::;Gp)T=CZ;whereCisanorthogonalmatrixwithrsttworowsgivenby(p1 2;:::;p1 2)and((1)=;:::;(p)=).Wecanrewrite1 whereQ=pPi=3G2iandG1;G2;:::;GparemutuallyindependentwithG1N(p 2{45 ),Eexpb 2xk(X)k2=(2b+1)p 2)20(t)

PAGE 49

where'=(b+1 2)2andasbefore0(t)=(t b+1 2):Thesecondequalityin 2{46 followsafterlongsimplicationsproceedingasintheproofofTheorem 2.4.0.5 Hence,by( 2{46 ),thedominanceof(X)overXfollowsiftherighthandsideof( 2{46 )(2b+1)p=2:ThishoweverisanimmediateconsequenceofTheorem 2.4.0.6 2.5.0.8 (i) (ii)

PAGE 50

FirstnotethatXisdistributedasNp(;1 2(a)T1(a)i ThebestunbiasedequivariantestimatorisX:Wewillbeginwiththeexpressionfortheriskofthisestimator. [1+(1)]p=2:(3{2) 3.1.0.10 .Noterstthatn(X)T1(X)2p: 2(X)T1(X)=1 (1+(1))p=2:(3{3) 43

PAGE 51

Thelemmafollowsfrom 3{3 2((X))T1((X))1 (1+(1))p=2:(3{4) wherethe"ijarei.i.d.Np(0;20):Thentheminimalsucientstatisticsisgivenby(X1;:::;Xp;S);whereXi=n1nXj=1Xij(i=1;:::;p) andS=[(n1)p+2]1pXi=1nXj=1(XijXi)2: 30 ],intheabovescenario,proposedageneralclassofshrinkageestimatorsdominatingthesamplemeaninthreeorhigherdimensionsundersquarederrorloss.ThisclassofestimatorswasdevelopedalongtheonesofBaranchik[ 5 ].

PAGE 52

Usingequation( 3{1 ),thedivergencelossforanestimatoraofisgivenby 22kak2i Theabovelosstobeinterpretedasitslimitwhen!0or!1:TheKLlossoccursasaspecialcasewhen!0:Also,notingthatkXk222p;theriskoftheclassicalestimatorXofisreadilycalculatedas Throughoutwewillperformcalculationsinthecase0<<1;andwillpasstothelimitasandwhenneeded. FollowingBaranchik[ 5 ]andEfronandMorris[ 30 ],weconsidertherivalclassofestimators wherewewillimposesomeconditionslateron: 3{5 ), 22k(X)k2i Wenowprovethefollowingdominanceresult. (i) (ii) 3.2.0.11

PAGE 53

onecanrewrite 21(kYk2=U)U whereYNp(;Ip)andU(m+2)12misdistributedindependentlyofY:Henceacomparisonof( 3{9 )with( 3{6 )revealsthatTheorem( 3.2.0.11 )holdsifandonlyif 21U(kYk2=U) NextwritingZ=U(m+2) 2and0(t=z)=2 3{10 )as 21Z Notethatinordertondtheaboveexpectation,werstconditiononZandthenaverageoverthedistributionofZ:BytheindependenceofZandYandTheorem 2.4.0.5 ,theexpressiongivenin( 4{29 )simpliesto [1+(1)]p=21Xr=0exp()r where=1 2[1+(1)]kk2,andwritingb=(1) 2;Ib(r)=1Z01Z01bz t0(t=z)2rtr+p

PAGE 54

From( 3{10 )-( 4{31 ),itremainsonlytoshowthatIb(r)>18r=0;1;:::;p3 underconditions(i)and(ii)ofthetheorem.Toshowthiswerstusethetransformationt=zu: 4{31 ),Ib(r)=1Z01Z0[1b0(u)=u]2rur+p 21dzdu=1Z0[1b0(u)=u]2rur+p Since0(u)=uisacontinuousfunctionofuwithlimu!00(u)=u=+1andlimu!10(u)=u=0;

PAGE 55

Thusforu>u0;0(u)=u1=b;from( 3{14 ),Ib(r)1Zu0[1b0(u)=u]2rur+p 2u(r+p+m Bytheinequalities(1b0(u)=u)(p2)exp[(p2)b0(u)=u] and1+b20(u)=(2u)(m 2u(m 4u=expb0(u)(m+2) 4u4(p2) (3{16) since0<0(u)<4(p2)=(m+2): 2u=2[ub0(u)]2

PAGE 56

itfollowsthatdw du=2(ub0(u)) [2u+b20(u)]22(1b00(u))(2u+b20(u))(ub0(u))(2+2b0(u)00(u))=2(ub0(u)) [2u+b20(u)]22u+2b0(u)+2b20(u)4bu00(u)2bu0(u)00(u): du1ifandonlyif2[ub0(u)][2u+2b0(u)+2b20(u)4bu00(u)2bu0(u)00(u)][2u+b20(u)]2: Sinceforuu0;ub0(u);( 3{17 )holdsif00(u)0;andthelatteristrueduetoassumption(ii).Nowfrom( 3{15 )-( 3{17 )notingthatw=0whenu=u0;onegetsIb(r)>1Z0wr+p 3.2.0.11 WenowconsideranextensionoftheaboveresultwhenV(X)=isanunknownvariance-covariancematrix.Wesolvetheproblembyreducingtheriskexpressionofthecorrespondingshrinkageestimatortotheoneinthissectionafterasuitabletransformation.

PAGE 57

TheusualestimatorofisZ=n1nPi=1Zi(say).ItistheMLE,UMVUEandthebestequivariantestimatorof;andisdistributedasNp(;n1):InadditiontheusualestimatorofisS=1 3{1 ) 2(a)T1(a)i ThecorrespondingriskofZisthesameastheonegivenin( 3{2 ),i.e. Considernowthegeneralclassofestimators of:Underthedivergencelossgivenin( 2{1 ), 2((Z;S))T1((Z;S))i BytheHelmertorthogonaltransformation,H1=1

PAGE 58

andHn=1 2Hn;(3{22) whereH1;:::;HnaremutuallyindependentwithH1;:::;Hn1i.i.d.N(0;)andHnN(p 2Hi(i=1;:::;n)and=1 2(p 3{21 )and( 3{22 )onecanrewrite 20@1(n1)YTn(n1Pi=1YiYTi)1Yn whereY1;:::;YnaremutuallyindependentwithY1;:::;Yn1i.i.d.N(0;Ip)andYnN(;Ip): 3 ],p.333)orAnderson([ 4 ],p.172),YTnn1Xi=1YiYTi!1Ynd=kYnk2 3{23 ) 21((n1)kYnk2=U)

PAGE 59

NextwritingZ=U 2t=z; L(;(Z;S))=1exp"(1) 210(kYnk2=Z) ByTheorem 3.2.0.11 ,(Z;S)dominatesZasanestimatorofprovided0<0(u)<4(p2) (i) (ii)

PAGE 60

Ifweusedivergenceasadistancebetweenaproperpriordistribution()(puttingallitsmassonacompactsetifneeded)andthecorrespondingposteriordistribution(jx)wecanreexpresstheexpecteddivergenceasR()=1RR()1(jx)m(x)dxd UsingthisexpressiononecaneasilyseethatinordertondapriorthatmaximizesR()weneedtondanasymptoticexpressionforZm(x)p1(xj)dx Inthissectionweassumethatwehaveaparametricfamilyfp:2g;Rp;ofprobabilitydensityfunctionsfpp(xj):2gwithrespecttoanitedominatingmeasure(dx)onameasurablespaceX;andwehaveapriordistributionforthathasapdf()withrespecttoLebesguemeasure. 53

PAGE 61

NextwewillgiveadenitionofDivergenceratewhenparameter1andsamplesizeisn:WedenetherelativeDivergenceratebetweenthetruedistributionPn(x)andthemarginaldistributionofthesampleofsizenmn(x)tobe Itiseasytocheckfor!0thatthisdenitionisequivalenttothedenitionofrelativeentropyrateconsideredforexampleinClarkeandBarron[ 21 ]. Usingthisdenition,wecandeneforagivenpriorthecorrespondingBayesriskasfollows: Tondanasymptoticexpansionforthisriskfunction,wewillreexpresstheriskfunctionasfollows: nlnp(xj) wherep(xj)=nYi=1p(xij); 20 ]derivedthefollowingformula: lnp(xj) 2lnjI()j+ln1 2STn(I())1Sn+o(1);(4{5) whereo(1)!0inL1(P)aswellasinprobabilityasn!1:Here,Sn=(1=p

PAGE 62

Usingthisformulawecanwritethefollowingasymptoticexpansionforriskfunction( 4{1 ): np n() SinceSTn(I())1Snisasymptoticallydistributedas2pwecanrewritetheaboveexpressionasfollow: np n() n)p=2+o(np=4) Hence,thepriorwhichminimizestheBayesriskwillbetheonethatmaximizestheintegral: n+1() subjecttotheconstraintZ()d=1:Asimplecalculusofvariationsargumentgivesthismaximizeras 2(4{9) whichisJereys'prior.

PAGE 63

Consideraprior()forwhichputsallitsmassonacompactset.Wewillpassontothelimitlaterasneeded.Thentheposteriorisgivenby Wedenotethesameby(jx):Also,letp(xj)denotetheconditionalpdfofXgivenandm(x)themarginalpdfofX: (1):(4{12) Fromtherelationp(xj)()=(jx)m(x);onecanreexpressR()givenin( 4{12 )as (1)=1R+1()E(jx)jd (1):(4{13) LetI()=00()denotetheperobservationFisherinformationnumber.Thenwehavethefollowingtheorem. nI() 24(1)I3() 2(1)I()00() 8(1)I2()+o(n1=2): 4.2.0.13 .Let^denotetheMLEof.Throughoutthissectionwewillusethefollowingnotations:l()=x();a2=l00(^)=00(^);

PAGE 64

24 ]. FromDattaandMukerjee([ 24 ]p.13)(hjx)=r 6a3h3)+1 200(^) 6a30(^) 24a4h43a4 72a23h615a23 Withthegeneralexpansion1 n+1 2a22 6a3h3)+ n8<:1+ 6a3h3!21 200(^) 6a30(^) 24a4h43a4 72a23h615a23

PAGE 65

Using( 4{15 )and( 4{16 )wewillgetthefollowing(hjx)(hjx)=c (^)h0(^) (^)h2+1 6a3h40(^) (^)!+1 36a23h6!+(1+) 20(^) 6a3h3!2 200(^) (^)h200(^) (^)c!+1 6a30(^) (^)h43a3 (^)a30(^) IntegratingthislastexpressionwithrespecttohwillgetthefollowingZ(hjx)(hjx)dh=c (^)1 (^)!+15a23 2c(1)0@0(^) (^)2 2c2(1)20(^) (^)a32(2) 2c2(1)20(^) 8c2(1)+15a23(1(1)3) 72c3(1)2)#+o(n1): Fromtherelation=h=p

PAGE 66

Thusby( 4{18 )and( 4{19 )wehave()=EE(jx)jx=2 nI() ()000() 2(1)I()0() ()+5(000())2 2(1)I()0() (1)I()0() ()00() 2(1)2I2()0() ()0() 8(1)I2()+5(000())2(23+3) 24(1)2I3()+o(n1=2): Inthenextstep,wendZ()()d=Z2 nI() 2(1)I()0() (1)I()0() 2(1)2I2()0() 8(1)I2()+5(000())2(23+3) 24(1)2I3()()d+Z2 nI() 2(1)2I2()0()d+Z2 nI() ThelaststepwillgiveanexpressionforE(jx):Weconsider()toconvergeweaklytothedegeneratepriorattrueandhavechosen()insuchawaythatwecouldintegratethelasttwointegralsin( 4{21 )bypartsandhavethersttermequaltozeroeverytimeweuseintegrationbyparts.

PAGE 67

ThuswewillhaveE(jx)=2 nI() 24(1)I3()+(1+) 2(1)I()0() 8(1)I2()+1 d"2 nI() nI() Finally,wehaved d"2 nI() nI() (1)I2()0() 2(1)I2()(2+=2)(000())2 andd2 nI() d"2 nI() 2(1)I2()#=2 nI() 2(1)I3()(4)()(1+=2) 2(1)I2(); From( 4{22 )-( 4{24 ),wegetE(jx)=2 nI() 24(1)I3() 2(1)I()00() 8(1)I2()+o(n1=2):

PAGE 68

Thiscompletestheproof: 4.2.0.13 ,for<1and6=0or1;onehas n=2(1)1 2R() 2()()d (1)+o(1):(4{26) ThustherstorderapproximationtoR()isgivenby12 n=2(1)1 2R() 2()()d (1): 39 ],p.190)Letp;q>1berealnumberssatisfying1=p+1=q=1:Letf2Lp;g2Lq:Thenfg2L1and withequlityijfjp/jgjq: 39 ],p.191)Let0
PAGE 69

FromHoldersinequalityforpositiveexponentswithp=1+;q=1+ ;g()=I1 2() 2() 2()1 2()d 2(): 2()d: 1+;q=1 andg()=I1 2()weobtainZ1+()I1 2()dZI1 2()d 2():

PAGE 70

When<1usingHoldersinequalityfornegativeexponentswithp=1 1+<0;0
PAGE 71

1+2R1()=Z2 nI()1 21 2I2()0() 2I()0() 4I()00() 16I2()d+o(n1=2): Referencepriorisobtainedbymaximizingtheexpectedchi-squaredistancebetweenpriordistributionandcorrespondingposterior.By( 4{29 ),thisamountstomaximizingthefollowingintegralwithrespecttoprior(): 2I1=2()00() Tosimplifythisfurther,wewillusethesubstitutiony()=0() sothat( 4{30 )reducesto 2I1=2()y2()3 2I1=2()y0()d:(4{31) Maximizingthelastexpressionwithrespecttoy()andnotingthat000()=I0(),onegetsEuler-Lagrangeequation:@L @yd d@L @y0 4{31 ). ThelastexpressionisequivalenttoI0() 4I3=2()y() 4I();

PAGE 72

therebyproducingthereferenceprior

PAGE 73

5 ]dominatesthesamplemeaninthreeorhigherdimensionsunderageneraldivergencelosswhichincludestheKullback-Leibler(KL)andBhattacharyya-Hellinger(BH)losses([ 13 ];[ 38 ])asspecialcases.Ananalogousresultisfoundforestimatingthepredictivedensityofanormalvariablewiththesamemeanandaknownbutpossiblydierentscalarmultipleoftheidentitymatrixasitsvariance.Theresultsareextendedtoaccommodateshrinkagetowardsaregressionsurface. Theseresultsareextendedtotheestimationofthemultivariatenormalmeanwithanunknownvariance-covariancematrix.First,itisshownthatforanunknownscalarmultipleoftheidentitymatrixasthevariance-covariancematrix,ageneralclassofestimatorsalongthelinesofBaranchik[ 5 ]andEfronandMorris[ 30 ]continuestodominatethesamplemeaninthreeorhigherdimensions.Seconditisshownthatevenforanunknownpositivedenitevariance-covariancematrix,thedominancecontinuestoholdforageneralclassofasuitablydenedshrinkageestimators. Alsotheproblemofpriorselectionforanestimationproblemisconsidered.ItisshownthattherstorderreferencepriorunderdivergencelosscoincideswithJereys'prior. 66

PAGE 74

34 ]betweenestimationandpredictionproblemsifsuchanidentityexists. 2 ].

PAGE 75

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 68

PAGE 76

[14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]

PAGE 77

[29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42]

PAGE 78

[43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58]

PAGE 79

[59] [60] [61]

PAGE 80

TheauthorwasborninKorukivka,Ukrainein1973.HereceivedtheSpecialistandCandidateofSciencedegreesinProbabilityTheoryandStatisticsfromKievNationalUniversityofTarasShevchenkoin1997and2001respectively.In2001hecametoUFtopursuePh.D.degreeinDepartmentofStatistics. 73


Permanent Link: http://ufdc.ufl.edu/UFE0015678/00001

Material Information

Title: Divergence Loss for Shrinkage Estimation, Prediction and Prior Selection
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0015678:00001

Permanent Link: http://ufdc.ufl.edu/UFE0015678/00001

Material Information

Title: Divergence Loss for Shrinkage Estimation, Prediction and Prior Selection
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0015678:00001


This item has the following downloads:


Full Text











DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION AND
PRIOR SELECTION
















By

VICTOR MERGEL


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2006

































Copyright 2006

by

Victor Mergel















ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisor Dr. M ,1 ivz

Ghosh for his support and professional guidance. Working with him was not

only enjo--,l.-e' but also very valuable personal experience.

I would also like to thank Michael Daniels, Panos M. Pardalos, Brett Presnell,

and Ronald Randles, for their careful reading and extensive comments of this

dissertation.















TABLE OF CONTENTS
page

ACKNOWLEDGMENTS ................... ...... iii

ABSTRACT ...................... ............. vi

CHAPTER

1 INTRODUCTION AND LITERATURE REVIEW ............ 1

1.1 Statistical Decision Theory ........... ........... 1
1.2 Literature Review ............................ 2
1.2.1 Point Estimation of the Multivariate Normal Mean ..... 2
1.2.2 Shrinkage towards Regression Surfaces ........... .10
1.2.3 Baranchik Class of Estimators Dominating the Sample Mean 12
1.3 Shrinkage Predictive Distribution for the Multivariate Normal Density 13
1.3.1 Shrinkage of Predictive Distribution . . ..... 13
1.3.2 Minimax Shrinkage towards Points or Subspaces . ... 17
1.4 Prior Selection Methods and Shrinkage Argument . ... 18
1.4.1 Prior Selection .... ...... . . .. 18
1.4.2 Shrinkage Argument .................. ..... 20

2 ESTIMATION, PREDICTION AND THE STEIN PHENOMENON UNDER
DIVERGENCE LOSS .................. ......... .. 22

2.1 Some Preliminary Results .................. .. 22
2.2 Minimaxity Results .................. ...... .. .. 27
2.3 Admissibility for p= 1 .................. ..... .30
2.4 Inadmissibility Results for p > 3 ............. .. .. 31
2.5 Lindley's Estimator and Shrinkage to Regeression Surface ..... ..40

3 POINT ESTIMATION UNDER DIVERGENCE LOSS WHEN VARIANCE
COVARIANCE MATRIX IS UNKNOWN ....... . . 43

3.1 Preliminary Results ...... ........ .. ..... .. 43
3.2 Inadmissibility Results when Variance-Covariance Matrix is Proportional
to Identity Matrix. ................ ....... .. 44
3.3 Unknown Positive Definite Variance-Covariance Matrix ...... ..49

4 REFERENCE PRIORS UNDER DIVERGENCE LOSS . ... 53

4.1 First Order Reference Prior under Divergence Loss . ... 53









4.2 Reference Prior Selection under Divergence Loss for One Parameter
Exponential Family .................. ........ .. 55

5 SUMMARY AND FUTURE RESEARCH ................ .. 66

5.1 Summary .................. ............. .. 66
5.2 Future Research .................. .......... .. 66

REFERENCES ................... ... ... ........ .. 68

BIOGRAPHICAL SKETCH .................. ......... .. 73















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

DIVERGENCE LOSS FOR SHRINKAGE ESTIMATION, PREDICTION AND
PRIOR SELECTION

By

Victor Mergel

August 2006

C('!, : Malay Ghosh
Major Department: Statistics

In this dissertation, we consider the following problems: (1) estimate a normal

mean under a general divergence loss and (2) find a predictive density of a new

observation drawn independently of the sampled observations from a normal

distribution with the same mean but possibly with a different variance under

the same loss. The general divergence loss includes as special cases both the

Kullback-Leibler and Bhattacharyya- Hellinger losses. The sample mean, which is a

B i-;-. estimator of the population mean under this loss and the improper uniform

prior, is shown to be minimax in any arbitrary dimension. A counterpart of this

result for the predictive density is also proved in any arbitrary dimension. The

admissibility of these rules holds in one dimension, and we conjecture that the

result is true in two dimensions as well. However, the general Baranchik class of

estimators, which includes the James-Stein estimator and the Strawderman class

of estimators, dominates the sample mean in three or higher dimensions for the

estimation problem. An analogous class of predictive densities is defined and any

member of this class is shown to dominate the predictive density corresponding

to a uniform prior in three or higher dimensions. For the prediction problem,









in the special case of Kullback-Leibler loss, our results complement to a certain

extent some of the recent important work of Komaki and George et al. While our

proposed approach produces a general class of empirical B,-.--i predictive densities

dominating the predictive density under a uniform prior, George et al. produce a

general class of B --;-i predictors achieving a similar dominance. We show also that

various modifications of the James-Stein estimator continue to dominate the sample

mean, and by the duality of the estimation and predictive density results which we

will show, similar results continue to hold for the prediction problem as well. In

the last chapter we consider the problem of objective prior selection by maximizing

the distance between the prior and the posterior. We show that the reference prior

under divergence loss coincides with Jeffreys' prior except in one special case.















CHAPTER 1
INTRODUCTION AND LITERATURE REVIEW

1.1 Statistical Decision Theory

Statistical Decision Theory primarily consists of three basic elements:

the sample space X,

the parameter space 0,

and

the action space, A.

We assume that an unknown element 0 E 0 labels the otherwise known

distribution. We are concerned with inferential procedures for 0 using the sampled

observations x (real or vector valued).

A decision rule 6 is a function with domain space X, and a range space A.

Thus, for each x E X we have an action a = 6(x) E A. For every 0 E 0 and

6(x) E A, we incur a loss L(0, 6(x)). The long-term average loss associated with 6 is

the expectation E0[L(0, 6(X))] and this expectation is called the risk function of 6

and will be denoted as R(0, 6).

Since the risk function depends on the unknown parameter 0, it is very often

impossible to find a decision rule that is optimal for every 0. Thus the statistician

needs to restrict decision rules such as B-iv--i, Minimax and admissible rules.

The method required to solve the statistical problem at hand strongly depends

on the parametric model considered (the class = {Po, 0 e ?} in which the

distribution of X belongs), structure of decision space and choice of loss function.

The choice of the decision space often depends on the statistical problem at

hand. For example, two-decision problems are used in testing of hypotheses; for

point estimation problems decision space often coincides with parameter space.









The choice of the loss function is up to the decision maker, and it is supposed

to evaluate the penalty (or error) associated with the decision 6 when the

parameter takes the value 0. When the setting of an experiment is such that loss

function cannot be determined, the most common option is to resort to classical

losses such as quadratic loss or absolute error loss. Sometimes, the experiment

settings are very uninformative and decision maker may need to use an intrinsic

loss such as general divergence loss as considered in this dissertation. This is

discussed for example in [56].

In this dissertation, we mostly look at the point estimation problem of the

multivariate normal mean under general divergence loss and we consider also the

prediction problem, where we are interested in estimating the density function

f(x| 0) itself.

In multidimensional settings, for dimensions high enough, the best invariant

estimator is not ah--iv- admissible. There often exists a class of estimators that

dominates the intuitive choice. For quadratic loss this effect was first discovered by

Stein [58]. In this dissertation, we consider the estimation and prediction problems

simultaneously under a broader class of losses to examine whether the Stein effect

continues to hold.

Since many results for estimation and prediction problems for the multivariate

normal distribution, as considered in this dissertation, seem to have some inherent

similarities with the parallel theory of estimating a multivariate normal mean under

the quadratic loss, we will begin with a literature review of known results.

1.2 Literature Review

1.2.1 Point Estimation of the Multivariate Normal Mean

Suppose X ~ N(O, Ip). For estimating the unknown normal mean 0 under

quadratic loss, the best equivariant estimator is X, the MLE which is also the

posterior mean under improper uniform prior see [56] pp.429-431. Blyth, see [14],









showed that this estimator is minimax and admissible when p = 1. Unfortunately,

this estimator may fail to be admissible in multidimensional problems. For

simultaneous estimation of p (> 2) normal means, this natural estimator is

admissible for p = 2, but it is inadmissible for p > 3 for a wide class of losses.

This fact was first discovered by [58] for the sum of squared error losses, i.e. when

L(O, a) =1 0 a 1|2. The inadmissibility result was later extended by Brown [17]

to a wider class of losses. For the sum of squared error losses, an explicit estimator

dominating the sample mean was proposed by James and Stein [42].

For estimating the multivariate normal mean, Stein [58] recommended using

"spherically symmetric estimators" of 0 since, under the loss L(0, a) = I 0 a |2,

X is an admissible estimator of 0 if and only if it is admissible in the class of

all spherically symmetric estimators. The definition of spherically symmetric

estimators is as follows.

Definition 1.2.1.1. An estimator 6(X) of 0 is said to be sph <,.. /,/ -,a;,,,. .li if

and only if 6(X) has the form 6(X) = h(lX 112)X

Stein used this result and the Cramer-Rao inequality to prove admissibility

of X for p = 2. Later Brown and Hwang [19] provided a Blyth type argument for

proving the same result.

As mentioned earlier, X is a generalized B-,-.-- estimator of 0 E WR under the

loss L(0, a) = 1 0 a 1|2 and the uniform prior. Stein [58] showed the existence of

a, b such that (1 b ) X dominates X for p > 3. Later James and Stein
a+||X|I| )
[42] have produced the explicit estimator

6(X) = 1 p-2)X

which dominates X for p > 3.

Efron and Morris [28] show how James-Stein estimators arise in an empirical

Bi,-; context. A good review of empirical B,- -;i (EB) and hierarchical B-,- (HB)









approaches could be found in [35]. As described in [8], an EB scenario is one in

which known relationships among the coordinates of a parameter vector allow use

of the data to estimate some features of the prior distribution. Both EB and HB

procedures recognize the uncertainty in the prior information. However, while the

EB method estimates the unknown prior parameters in some classical way like the

MLE or the method of moments from the marginal distributions (after integrating

0 out) of observations, the HB procedure models the prior distribution in stages.

To illustrate this, we begin with the following setup.

A Conditional on 01,...,0p, let X1,...,Xp be independent with

Xi ~ N(0,2), i = 1,... ,p, 2(> 0) being known. Without loss of generality,

assume a2 = .

B The Oi's have independent N(pi, A), i = 1,... ,p priors.

The posterior distribution of 0 given X = x is then


N((1 B)x + Bt, (1 B)Ip),


where B = (1 + A)-1. The posterior mean (the usual B ~,-;. estimate) of 0 is given

by

E(01 X = x) = (1 B)x + Bta. (1-1)

Now consider the following three scenarios.

Case I. Let p, = ... = p = p, where p (real) is unknown, but A (> 0) is known.

Based on the marginal distribution of X, X is the UMVUE, MLE and the best

equivariant estimator of tt. Thus using EB approach, an EB estimator of 0 is:



EB (1 B)X + BX1. (12)

This estimator was proposed in Lindley and Smith (1972) but they have used

HB approach. Their model was:






5


(i) conditional on 0 and p, X N(0, Ip);

(ii) conditional on p, 0 ~ N(plp, AIp);

(iii) p is uniform on (-oo, oo).

Then the joint pdf of X, 0 and p is given by

f(x, 0, p) cexp | x 0|2 A exp -- || lpl, (13)

Thus joint pdf of X and 0 is


f(x, 0) oc exp (OTD 2T + xTx) (1-4)

and posterior distribution of 0 given X= x is N(D-lx, D-1), where

D = A-[(A + 1)I p- Jp.

Thus one gets

E(01 X = x) = (1 B)x + Bxlp, (1-5)

and

V(0 X = x) = (1 B)Ip + Bp-J,, (1-6)

which gives the same estimator as under the EB approach. But the EB approach

ignores the uncertainty involved in estimating the prior parameters, and thus

underestimates the posterior variance.
(1)
Lindley and Smith [51] have shown that the risk of 0EB is not uniformly

smaller than that of X under squared error loss. However, there is a B ,-.i risk
(1)
superiority of OEB over X as it is shown in the following theorem of Ghosh [35]:

Theorem 1.2.1.2. Consider the model XI N(0, Ip) and the prior

0 ~ N(plp, AIp). Let E denote expectation over the joint distribution of X and 0.

Then, ,--; i,,,:,,,y the loss LI(0, a) = (a 0)(a 0)T, and writing OB as the B;,. -

estimator of 0 under L1,


(1-7)


ELI(0, X) = Ip; ELI(0, OB) = (1 B)Ip;








(1)
EL(, EB) = ( B)Ip + Bp-J. (18)

Now i-- ri,,.:,i.j the quadratic loss L2(0, a) (a o)TQ(a 0), where Q is a
known non-negative /. f,.'.:7 weight matrix,

EL2(0, X) tr(Q); EL2(0, OB)= (1- B)tr(Q); (1-9)

EL(0, B) (1 B)tr(Q) + Btr(Qp-1Jp). (1 10)

Case II. Assume that pt is known but its components need not to be equal. Also
assume A to be unknown. IIX t,||2 B-12 is complete sufficient statistics.
Accordingly, for p > 3, the UMVUE of B is given by (p 2)/||X p1|2
Substituting this estimator of B, an EB estimator of 0 is given by

(2) p 2
EB X 2(X tp). (1-11)
B IIX p2

This estimator is known as a James-Stein estimator (see [42]). The EB
interpretation of this estimator was given in a series of articles by Efron and Morris

([27], [28], [29]).
(2)
James and Stein have shown that for p > 3, the risk of OEB is smaller than
that of X under the squared error loss. However, if the loss is changed to the
arbitrary quadratic loss L2 of the previous theorem, then the risk dominance of
b(2) (2)
0EB over X does not necessarily hold. The OEB dominates X under the loss L2
([8]; [15]) if
(i) tr(Q) > 2chi(Q) and
(ii) 0 < p 2 < 2[tr(Q)/chi(Q) 2] where chi(Q) denotes the largest eigenvalue of

Q.
The Bw.,-- risk dominance however still holds which follows from the following

(see [35]).









Theorem 1.2.1.3. Let X 1 0 N(, Ip) and 0 N(tlp, AIp). It follows that for

p > 3,
(2)
[L(0,B) Ip B(p- 2)p-l I, (1 12)

and
S (2)
E[L2 (,E)] = tr(Q) B(p 2)p-ltr(Q). (1-13)

Consider the HB approach in this case with X N(0, Ip), 0

N(t, Alp) and A has Type II beta density oc At-1(1+A)-(m+n), with m > 0, n > 0.
Using the iterated formula for conditional expectations

(2)
H -- E(0l x) = E(E(0 B, x)) =(1 B)x + Bp, (1-14)

where B = (A + 1)-1, and


B j B +(1- B)m-lexp t -B\Ix- B1|] dB
0

S B"+"-l(1 B)m-lexp -B|x p|] dB. (1 15)
0

Strawderman [60] considered the case m = 1, and found sufficient conditions
S(2)
on n under which the risk of OHB is smaller than that of X. His results were

generalized by Faith [31].

When m = 1, the posterior mode of B is

BMo = min((p + 2n 2)/llx 112,1) (16)

This leads to

O^B = (1 BMO)X + BMOI, (I-i7)

of 0. When n = 0 this estimator will become the positive part James-Stein

estimator, which dominates the usual James-Stein estimator.









Case III. We consider the same model as in Case I, except that now tt and A > 0
SP
are both unknown. In this case (X, (X, X)2) is complete sufficient, so that the
i=
UMVUE's of tt and B are given by X and (p 3)/ E(X X)2. The EB estimator
i= 1
of 0 in this case is

0E = X p- (X X 1). (1-18)
(X,- X)2
i=i
This modification of the James-Stein estimator was proposed by Lindley [50].

Whereas, the James-Stein estimator shrinks X toward a specified point, the above

estimator shrinks X towards a hyperplane spanned by lp.
(3)
The estimator OEB is known to dominate X for p > 3. Ghosh [35] has found

the B ,-.-- risk of this estimator under the L1 and L2 losses.

Theorem 1.2.1.4. Assume the model and the prior given in Theorem 1.2.1.2 Then

for p > 4,
L (3)
E[L1(0, B) Ip B(p- 3)(p- 1)-( p-1J), (119)

and

(3)
E[L2(0, ) tr(Q) B(p- 3)(p- )-tr[Q(I p-1J)]. (1-20)


To find the HB estimator of 0 in this case consider the model where

(i) conditional on 0, p and A, X ~ N(0,I,);

(ii) conditional on p and A, 0 N(plp,AIp);

(iii) marginally p and A are independently distributed with p uniform on (-oo, oo),

and A has uniform improper pdf on (0, oo).

Under this model, as shown in [52],


E(0l x) x E(BI x)(x xl,),


(1-21)









and

V(01 x) = V(BI x)(x xlp)(x xl)T +Ip E(BI x)(Ip p-Jp),

where


1- r
E(B x) =B(-3) exp [
0


-B Y(x -)2 dB
2
i= 1

B (pexp -2 B(xi
0 i 1


)2] dB, (1-23)


and


E(B2 x)= Bp-) exp -- B (x,
0


)2] dB


Bp-5) exp --2 (B x x)2 dB. (24)
0 ii 1
Also, one can obtain a positive-part version of Lindley's estimator by

substituting the posterior mode of B namely min (p 5) / (Xi X)2, )

in 1-1. Morris (1981) si-:-' -1- approximations to E(B x) and E(B2 x) involving
1 oo
replacement of f by f both in the numerator as well as in denominator of 1-23
0 0
and 1-24 leading to the following approximations:


p
E(B x) (p -3) / (xi
i= 1


and


E(B21 ) (p l)(p


3) { (xi )2}1
/ iIil


so that


V(B x) 2(p-3) { (xi
/P ui=


(1-22)


2
X)2









Morris [52] points out that the above approximations amount to putting a

uniform prior to A on (-1, oo) which gives the approximation


E(01 x) X- p- 3 (X X ,) (1-25)
E(X- X)2
i= 1

which is Lindley's modification of the James-Stein estimator with


V(0 2(p-3) 2 (X X1,)(X X1)T



+IP- p -3 (p -1J). (1-26)
E(x X)2
i= 1

1.2.2 Shrinkage towards Regression Surfaces

In the previous section, the sample mean was shrunk towards a point or a

subspace spanned by the vector lp. Ghosh [35] synthesized EB and HB methods to

shrink the sample mean towards an arbitrary regression surface. The HB approach

was discussed in details in [51] with known variance components, while the EB

procedure was discussed in [53].

The set up considered in [35] is as follow

I Conditional on 0,b and a let X1,...,Xp be independent with

Xi ~ N(0i, V), i = 1,... ,p, where the i's are known positive constants;

II Conditional on b and a, Oi's are independently distributed with

Oi ~ N(zTb, a), i 1,... ,p, where z1,...,zp are known regression vectors of

dimension r and b is r x 1.

III B and A are marginally independent with B ~ uniform(RP) and

A ~ uniform(0, oc). We assume that p > r + 3. Also, we write

Z = (zi,..., Zp); G = Diag(VI,..., Vp) and assume rank(Z) = r.








As shown in [35], under this model

E(Oil x) = E[(1 Ui)xi + UiZfb x]; (1-27)


V(O| x) =

V[U(xi- zf) I x] + E[V(1- Ui) + (1- U)zT(ZTDZ)-lzi x]; (1-28)


Cov(0i, Oj I) )

Cov[Uj(xj z Tb), Uj(xj zjT) 1 x] + E[AU UzUjT(ZTDZ)- zj, x], (1-29)

where Ui = V/(A + V), = (ZTDZ)-1(ZTDx), D = 1,1(1 Ul..., ,1 Up) with
(i = 1,. ,p).
Morris [53] has approximated E[Oi x] by xi ui(xi zfb), and V[Oi|x] by
(xi zb)2 + V(1 u,)[1 + uzfTDZ)-1Z], i = 1,...,p. In the above
[2/(p- 2)](V + a) ( + a), 1,... ,p,

V = ELC, (V+ a)1 +LI(V + )-1,D = Diag(1- ul,...,1- u),and b
is obtained from b by substituting the estimator of A. The 's are purported to
estimate V(Uilx)'s.
When V = ... = V = V, with u = ... = = V/(a + V) = U,
D = (1 u)I, ZTDZ = (1 u)ZTZ, b= (ZTZ)-lZx = b, then the following
result holds
E(O, x) =xi E(U x)(x zb), (1 30)

and


z b)2 + V VE(UI x)(1


zT(ZTZ)-lzi). (1-31)


v(OjjX) V(Ulx)(Xi









If one adopts Morris's approximations, then one estimates E(UI x) by

UV(p -r 2)/SSE and V(U| x) by [2/(p -r -2)]W/2 (SSE EZ x2 -

(Ei P iZ )T (ZTZ)- ( I))
1.2.3 Baranchik Class of Estimators Dominating the Sample Mean

In some situations when prior information is available and one believes

that unknown 0 is close to some known vector g it makes more sense to shrink

estimator X to p instead of 0. In such cases Baranchick [5] proposed using a more

general class of shrinkage minimax estimators dominating X. Let S = (X, p)2
i=1
and ii(X) = ) (Xi pi) then estimator X + Q(X) will dominate X under

quadratic loss function if the following conditions hold:

(i) 0 < r(S) < 2(p- 2),

(ii) T(S) is nondecreasing in S and differentiable in S.

Efron and Morris [30] have slightly widened Baranchick's class of minimax

estimators. They have proved that the following conditions will guarantee that the

estimator X + O(X) dominates X under the quadratic loss function:

(i) 0 < r(S) < 2(p- 2),p > 2,

(ii) r(S) is differentiable in S,
and

(1ii) u(S) S 2'(S' is increasing in S.
2(p-2)- (S) i l in S.

Thus the Baranchick class of estimators dominates the best equivariant

estimator. The natural question was if that class had a subclass of admissible

estimators.

Strawderman [60] shows that there exists a subclass in the Baranchick class of

estimators which is proper Bw ,i with respect to the following class of two stage

priors. The prior distribution for 0 is constructed as follows.









Conditional on A = a, 0 ~ N(0,al,), while A itself has pdf


g(a) = 6(1 + a)--, a > 0, 6 > 0.

Under this two stage prior, the B-i -- estimator of 0 has the Baranchick form


( '7(S)X

with
2exp{- }
r(S) +26 + 2-- xp --
f A'+- exp{- S}dA
0
and conditions (i)-(iii) holds for p > 4. When p = 5 and 0 < 6 < 2 we will get

class of proper B-v.-- i and thus admissible estimators dominating X. When p > 5

choosing 0 < 6 < 1 will lead to proper BE,,- class of estimators dominating X.

For p = 3 and 4 Strawderman [61] showed that there do not exist any proper

B-,-v estimators dominating X.

1.3 Shrinkage Predictive Distribution for the Multivariate Normal
Density

1.3.1 Shrinkage of Predictive Distribution

When fitting a parametric model or estimating parametric density function,

two most common methods are: estimate the unknown parameter first and

then use this estimator to estimate the unknown density, or try to estimate

the predictive density without directly estimating the unknown parameter. In

situations like this, the first method often fails to provide a good overall fit to the

unknown density even when the estimator of unknown quantity may have optimal

properties itself. The second method seems to produce estimators with smaller risk

then those constructed with plug-in method.

To evaluate the goodness of fit of predictive distribution p(y| x) (where x is

the observed random vector) to the unknown p(y| 0), the most often used measure








of divergence is Kullback-Leibler [46] directed measure of divergence

L(0,p(y x))= JP( ) log P (y L) dy (1-32)

which is nonnegative, and is zero if and only if p(y\ x) coincides with p(y| 0).
Then the average loss or risk function of predictive distribution of p(y| x) could be
defined as follows:

RKL(O,p) p(x 0)L(0,p(y x))dx, (1-33)

and under (possibly improper) prior distribution 7r on 0, the B- i-,- risk is

r (,p)= RKL 0,P)(0) dO. (1 34)

As shown in Atchinson [1] the B- i-,- predictive density under the prior 7r is
given by
) p(xl O)p(y O)7(O) dO
f p(xl 8)(0) dO(1 3

and this density is superior to any p(ylx) as a fit to the class of models.
Let X0 ~ N(O, vxI) and YO ~ N(O, vyIp) be independent p-dimensional
multivariate normal vectors with common unknown mean 0 and known variances

Vx, Vy. As shown by Murray [54] and Ng [55], the best invariant predictive density
in this situation is the constant risk B i-v--i rule under the uniform prior 7ru(0) = 1,
which can be written as

Pu(y 1 = X)1 exp Ilv- .1 (1-36)
{27(vx(V + )2}2 2(v ,+V)"

Although invariance is a frequently used restriction, for point estimation, the
best invariant estimator 0 = X of 0 is not admissible if dimension of the problem
is p > 3. It is known that the James-Stein estimator dominates the best invariant
estimator 0. Komaki [45] showed that the same effect holds for the prediction
problem. The best predictive density pu(ylx) which is invariant under translation
group is not admissible when p > 3, and is dominated by predictive density under









the Stein harmonic prior (see [59])

7H(O) C 1101-(p-2). (1 37)

Under this prior, the B l -i ,i predictive density is given by Komaki [45]

/x Op 11( (Y Y + V x 'X))(VY'+ V x
pn~~~y~1x)=( -+i -- -- ---

1 1
PH(y x) (y+ O) {(v -y+ vlx)(vI +vl)}I
x(, e xp )

{2(v + vy)} exp 2(v + v)y) ( )

where
I12
(u) U-+2 / P- exp(-v) dv. (1-39)
0
The harmonic prior is a special case of the Strawderman class of priors given

by

7"s(O) oc a- exp 2}2a -11daoc 1181-p-25
0
with J = -1. More recently, Liang [49] showed that pu(y| x) is dominated by the

proper B i-; rule pa(y x) under the Strawderman prior 7a (0)

Ols N(0svo, sv s (1 + s)a-2, (140)

when vx < I, p = 5 and a E [.5, 1) or p > 6 and a E [0, 1). When a = 2, it is

well known that 7rH(0) is a special case of 7a (0). As shown in George, Liang and

Xu [34], this result closely parallels some key developments concerning minimax

estimation of a multivariate normal mean under quadratic loss. As shown in Brown

[17], any B-,--i rule 0p = E(01 X) under quadratic loss,has the form

0= X + Vlog m (X). (1-41)

As shown in George, Liang and Xu [34], this result closely parallels some

key developments concerning minimax estimation of a multivariate normal mean









under quadratic loss. As shown in Brown [17], any Bw,-.- rule Op = E(01 X) under

quadratic loss,has the form

0= X + Vlog m (X). (1-42)

Similar representation was proved by George, Liang and Xu [34], for the

predictive density under the KL loss:


pA(y x) = pu(y x) m(W (1-43)
mr{(x; v)

where
vX + vxY v + vY (44)
W= 1 (144)
V. + Vy J7 +Vy
W v", (1-45)
VX + Vy
and m,(w; vw) is a marginal distribution of W.

Now we present the domination result of George et al. under the KL loss.

Theorem 1.3.1.1. For Z 0 ~ N(0, vlp) and a given prior r on 0, let m,(z; v)

be the i,,.i.,':rl distribution of Z. If mrn(z; v) is finite for all z, then p,(yl x) will

dominate pu(yl x), if ,.;i one of the following conditions holds for all v < vx

(i) E, log mn (Z; v) < 0,
(ii) mrn7( vz + 0; v) < 0 for all 0, with strict .:,. ;,.:1,.i;/ on some interval A,

(iii) rm(z; v) is superharmonic with strict .,'.; .,l.;.i1 on some interval A,

(iv) mr(z; v) is superharmonic with strict ,'., ,L;I,oa.:li/ on some interval A,
or

(v) 7r(0) is superharmonic.

From the previous theorem and minimaxity of pu(yl x) under the KL loss, the

minimaxity of p (yl x) follows.

George et al. [34] also proved the following theorem which is similar to

Theorem 1 of Fourdrinier et al. [33].










Theorem 1.3.1.2. If h is a positive function such that

(i) -(s + 1)h'(s)/h(s) can be decomposed as li(s) + 12(s) where 11 < A is
nondecreasing while 0 < 12 < B with !A + B < (p 2)/4,
(ii) lim,,, h(s)/(s + 1)P/2 = 0.
Then

(i) m/(mz; v) is superharmonic for all v < vo.
(ii) the B.n,. rule ph(yl x) under prior 7h(O) dominates pu(yl x) and is
minimax when v < vo.
1.3.2 Minimax Shrinkage towards Points or Subspaces

When a prior distribution is centered around 0, minimax B-,-.- rules p,(y| x)
yield most risk reduction when 0 is close to 0 (see [34]). Recentering the prior 7r(0)
around any b E IR results in 7b(0) =7(0 b). The marginal mb corresponding to
7b can be directly obtained by recentering the marginal m, n

b(z; v) mr(z b; v). (1 46)

Such recentered marginals yield predictive distributions

b (w; v".)
pb (y P x)m ( (I ) (1-47)

More generally, in order to recenter a prior 7r(0) around (a possibly affine)
subspace B C RP, George et al. [34] considered only spherically symmetric in 0
priors recentered as
7"(0) -= (6- PB0), (1-48)

where PBO = argminbs, 0O b|| is the projection of 0 onto B. Note that the
dimension of 0 PBO must be taken into account when considering 7r. Thus, for
example, recentering the harmonic prior 7rH(O) = (11-"-2) around the subspace









spanned by lp yields

4(e0) = le 01P,-(p- 3 (149)

Recentered priors yields predictive distributions


mB(x;; ,x)

George et al. [34] also considered multiple shrinkage prediction. Using the

mixture prior
N
7,(o) = WT "' (0), (1-51)
i= 1
leads to the predictive distribution
ZN 1" ,,_, (w; V)
p*(y x) = pu(yl ) N B 7- (1 52)
2Ki 1"' '"T1(x; VX)

1.4 Prior Selection Methods and Shrinkage Argument

1.4.1 Prior Selection

Since B iv-; [6] and later since Fisher [32], the idea of B li, -i i, inference

was debated. The cornerstone of B ,il, -i i, analysis, namely prior selection was

criticized for arbitrariness and overwhelming difficulty in the choice of prior. From

the very beginning, when Laplace proposed a uniform prior as a noninformative

prior, its inconsistencies have been found, generating further criticism. This has

given a way to new ideas such as the one of Jeffreys [44] who proposed a prior

which remains invariant under any one-to-one reparametrization. Jeffrey' prior

was derived as the positive square root of Fisher information matrix. However,

this prior was not an ideal one in the presence of nuisance parameters. Bernardo

[12] has noticed that Jeffreys prior can lead to marginalization paradox (see Dawid

et al. [26]) for inferences about p/a when the model is normal with mean p and

variance a2. These inconsistencies have led Bernardo [12] and later Berger and

Bernardo ( [9], [10], [11]) to propose uninformative priors known as "reference"









priors. Two basic ideas were used by Bernardo to construct his prior: the idea of

missing information and the stepwise procedure to deal with nuisance parameters.

Without any nuisance parameters, Bernardo's prior is identical to Jeffreys prior.

The missing information idea makes one to choose the prior which is furthest in

terms of Kullback-Leibler distance from the posterior under this prior and thus

allows the observed sample to change this prior the most.

Another class of reference priors is obtained using the invariance principle

which is attributed to Laplace's idea of insufficient reasons. Indeed, the simplest

example of invariance involves permutations on a finite set. The only invariant

distribution in this case is the uniform distribution over this set. Laplace's idea was

generalized as follows. Consider a random variable X from a family of distributions

parameterized by 0. Then if there exists a group of transformations, -;v ha(0) on

the parameter space, such that the distribution of Y = ha(X) belongs to the same

family with corresponding parameter ha(0), then we want the prior distribution for

parameter 0 to be invariant under this group of transformations. Good description

of this approach is given in Jaynes [43], Hartigan [37] and Dawid [25].

A somewhat different criteria is based on matching posterior coverage

probability of a B ,i-, -i in credible set with the corresponding frequentist coverage

probability. Most often matching is accomplished by matching a) posteriors

quantiles b) highest posterior densities (HPD) regions or c) inversion of test

statistics.

In this dissertation we will find uninformative priors by maximizing divergence

between prior and corresponding posterior distribution .,-i-ii!,ii. ically. To develop

.i-vmptotic expansions we will use so called "shrinkage arguing i, i -Ii.-.-- 1.

by J.K. Ghosh [36]. This method is particularly suitable for carrying out the

.,-vmptotics, and avoids calculation of multivariate cumulants inherent in any

multidimentional Edgeworth expansion.









1.4.2 Shrinkage Argument

We follow the description of Datta and Mukerjee [24] to explain the shrinkage

argument.

Consider a possibly vector-valued random variable X with a probability

density function p(x; 0) with 0 E IR or some open subset thereof. We need to find

an .,-vmptotic expansion for Eo [h(X; 0)] where h is a joint measurable function.

The following steps describe the B i, -i in approach towards the evaluation of

Eo[h(X; 8)].

Step 1: Consider a proper prior density 7r(O) for 0, such that support of 7r(O) is a

compact rectangle in the parameter space, and -r(0) vanishes on the boundary of

support while being positive in the interior. Under this prior one obtains posterior

expectation of EW[h(X; ) IX].

Step 2: In this step one finds the following expectation for 0 being in the interior

of support of 7r

A(0) = Ee [E[[h(X; 8)IX]].

Step 3: Integrate A(0) with respect to -r(O) and then allow -r(O) converge to the

degenerate prior at the true value of 0 supposing that the true value of 0 is an

interior point of the support -r(O). This yields Eo[h(X; 0)].

The rationale behind this process, if integrability of h(X; 0) with respect to

joint probability measure assumed, is as follows.

Note that posterior density of 0 under prior -r(O) is given by

p(X; 8) (0)
m(X)

where

m(X) p(X; )F(0) dO.







Hence, in step 1, we will get

Ex [h(X; 8)X]= K(X)/m(X),


where


K(X) = h(X; O)p(X; o0)(0) dO.


Step 2 yields
A(O) {K(x)/m(x)}p(x; ) dx.
In step 3 one would get


f{K(x)/mn(x)}p(x; o)7-(O) dO dx
f {K(x)/m(x)} f p(x; o)j(O) dO dx


I K(x) dx


f h(x; O)p(x; o) (0) dO dx


[Eo[h(X; 8)]7(O) dO.


The last integral gives us desired expectation when -r(0) converge to the degenerate
prior at the true value of 0.


SA(o) (0o) dO














CHAPTER 2
ESTIMATION, PREDICTION AND THE STEIN PHENOMENON UNDER
DIVERGENCE LOSS

2.1 Some Preliminary Results

We will start this section with definition of divergence loss. Among others, we

refer to Amari [2] and Cressie and Read [23]. This loss is given by

L (Oa) Jpl-p(x O)p3(x a)dx
L((1, a)=- (2-1)

This above loss is to be interpreted as its limit when -i 0 or -- 1. The KL loss

obtains in these two limiting situations. For 3 = 1/2, the divergence loss is 4 times

the BH loss. Throughout this dissertation, we will perform the calculations with

3 (0, 1), and pass on to the endpoints only in the limit when needed.

Let X and Y be conditionally independent given 0 with corresponding

pdf's p(x\ 0) and p(y| 0). We begin with a general expression for the predictive

density of Y based on X under the divergence loss and a prior pdf 7(0), possibly

improper. Under the KL loss and the prior pdf 7(0), the predictive density of Y is

given by

7KL(y Ix) Jp(y\ 0)7(01x)d0,

where 7(0| x) is the posterior of 0 based on X = x see [1]. The predictive density

is proper if and only if the posterior pdf is proper. We now provide a similar result

based on the general divergence loss which includes the previous result of Aitchison

as a special case when 3 -- 0.








Lemma 2.1.0.1. Under the divergence loss and the prior r, the B.rl predictive
1 ,:.,!i of Yis given by

7rD(Y x)= k (y,x k (y,) k( x) dy, (2-2)

where k(y, x) = Jpl-(yl O8)(0 x)) dO.
Proof of Lemma 2.1.0.1. Under the divergence loss, the posterior risk of
predicting p(y| 0), by a pdf p(y| x), is /-(1 3)-1 times

1- J lp (y| )p(y| )dy] 7(0 x)dO

S1 (y x) { p L (y1 8o)(0 (x) d dy

1- J k(y,x)rp(y|x)dy. (2-3)

An application of Holder's inequality now shows that the integral in (2-3) is
maximized at p(yl x) oc ki (y, x). Again by the same inequality, the denominator
of (2-2) is finite provided the posterior pdf is proper. This leads to the result
noting that 7rD(yl x) has to be a pdf. U
The next lemma, to be used repeatedly in the sequel, provides an expression
for the integral of the product of two normal densities each raised to a certain
power.
Lemma 2.1.0.2. Let Np(x lp, E) denote the plf of a p-variate normal random
variable with mean vector tt and positive /. I;, .: variance-covariance matrix E.
Then for a1 > O, a2 > 0,

I [N,(X| lt, E0)]1 [N,(X A2, E2)12 dx
S(27)n(1-a-a2) IEll(1- a) 2(1-a2) 1 + a2 -
x exp ([- (1 T 2 (12 + c-1 2 -
x exp L-C2 a12 (~]tt 2)T (IlA) 2 @+ ci2Ei)-1 (Atl 12)] (2 4)









Proof of Lemma 2.1.0.2. Writing H = alE1+a2E21 and g = H-l(aEl1l +

ca2 -12), it follows after some simplification that


I[N(x A, E0i)1" [N(x 2 2)]"1 dx

S(2)- (a+a2) Ej I 2 I exp ( g)H(x g)
1L

2 {aO i 1T_ 1) + a2 (22 12) -gTHg} dx

= (2x^) 0(l-a2) 2 -H 'a2H
(27)2(1 a1 a2) X12 X21 2 Hj

x exp -1 {Q(i ll )+Q a2( 2 21A2)- THg} (2-5)

It can be checked that


al OATE-ipl) + (ATEl a) gTHs
ai(p{X1- ) + a2 2 X2 2) gTHg

=1i2(/tl A21)T (a22 + 2c21i)2(2 1 A2), (2-6)

and
IH I-1/2 I 1 11/21 12 a1/ + 2 I2 1-1/2 (2 7)

Then by (2-6) and (2-7),

right hand side of (2-5)

(27r)p(1-a(-a2)/2 1 (1-al)/2 22 (1-a2)/2 a1X2 + a2I -1/2

2
x exp[--2(li .2)T(aiX2 + a2i)1)(. .2)]

This proves the lemma. U
The above results are now used to obtain the Bayes estimator of 0 and
the B i, -i i predictive density of a future Y ~ N(0, a I p) under the general
divergence loss and the N(t, AIp) prior for 0. We continue to assume that
conditional on 0, X ~ N(O, aIp), where oa > 0 is known. The B-v-;,- estimator of








0 is obtained by minimizing

1 exp[ (1 0- all 2]N(0(1 B)X + Bu,o 7(1 B)I)dO

with respect to a, where B = ax(a + A)-. By Lemma 2.1.0.2,
fj p3(1 -j3)
exp[- I22 a ll2N(l (1 B)X + B, 7(1 B)Ip)d0
7T2 \p/2
27(1 ) p N(Ola, c -1(1- 3)-Ip)N(0l(1 B)X + BX 72 (1 B)Ip)d

Sea -(1- B)X B 2 8)
ox exp L2- (3-1(1 _13) (2 8)
2a2(0-1(1 0)-1 + 1 B)

which is maximized with respect to a at (1 B)X + Btt. Hence, the Bayes
estimator of 0 under the N(p, AIp) prior and the general divergence loss is
(1 B)X + Btt, the posterior mean. Also, by Lemma 2.1.0.2, the B i-; predictive
density under the divergence loss is given by


TD(y| X) Nx [Nl-3'(0 y, Ip)N(0| (1 B)X + Bp, (1 B)Ip)d0

cx N(yl(l B)X + Bp, (a (1 B)(1 -/3) + C) Ip).

In the limiting (B 0) case, i.e. under the uniform prior Tr(0) = 1, the B-i,.
estimator of 0 is X, and the B-v.-; predictive density of Y is

N(y|X, ((1 -3) + )I).

It may be noted that by Lemma 2.1.0.2 with a1 = 1 3 and a2 = the
divergence loss for the plug-in predictive density N(0, orI,), which we denote by








60, is

L(0, o) (1T 1- N1 -(y l, o I)N'(y X,21 I,) dy


2 ( p( l 3)
(os)2 (o) 2 (1


23o^a + p)/2


j3(1 -3)||X- 02
0(1tl- ) 13)X _0 )1l (2-9)
2((1 0), + P )


Noting that |X -

1
R(O,60) = 1
0(t 0)


0112 ~ oX2, the corresponding risk is given by


a ) 2 ( ) 2 {( -

X(t 0)7
-0),72+


1 1
[31 3


(a )2 ( )


)/2 2 (- }-p/2




{ (1 02),2 j '-p/2 (2-10)


On the other hand, by Lemma 2.1.0.2 again, the divergence loss for the B-iv.
predictive density (under uniform prior) of N(y 0, l 2IP) which we denote by 6, is
rl~U~UI~ UIIUUY \IIUI UIIIlllrllV1 VI*\IV)~lY


L(0, 6- ) [ -(

0(1 1 [ (s)


Nl-'(1Y 1, a I)N ( (y X, ((1
p3(( ) + ) {(l 3)
2((1 ),72+7) 2 {


-3) + y)I) dy]

)2a2 + 2-p/2


Sexp -1 .)2] (2- 1)


The corresponding risk


(a2) (( 0),7 + 72) P(I

2 2
1 1
2 -l ( 0


3(1 3)2
(1 ) 2 +


( ) 2 ((1 3)o a + ) 2 ]


(2-12)


1
3(1 3)


x exp


x 1+


2 2 2-
Jx + a Y 2P


R(0, 6') 1









To show that R(O, 60) > R(O, 6,) for all 0, ua > 0 and a > 0, it suffices to

show that

(2) ( 2 p( 3)2 2t2 2(23)
( (a ) 0{(1 + 23-p/2 < (o) 2 ((1 )o3 + -) 2, (2 -13)

or equivalently that

1 + (3(,2 3) > (1 + 7/o )), (2-14)

for all 0 < p < 1, oa > 0 and a2 > 0. But the last inequality is a consequence of the

elementary inequality (1 + z)' < 1 + uz for all real z and 0 < u < 1.

In the next section, we prove the minimaxity of X as an estimator of 0 and

the minimaxity of N(yl X, ((1 ),o +o72)Ip) as the predictive density of Y in any

arbitrary dimension.

2.2 Minimaxity Results

Suppose X ~ N(O, caIp), where 0 E IRP. By Lemma 2.1.0.2 under the general

divergence loss given in 2-1, the risk of X is given by

R(0, X) = [1 {1 + 3(1 P)1-p/2] (2-15)
f(1 )[

for all 0. We now prove the minimaxity of X as an estimator of 0.

Theorem 2.2.0.3. X is a minimax estimator of the 0 in r n arbitrary dimension

under the divergence loss given in 2-1.

Proof of Theorem 2.2.0.3. Consider the sequence of proper priors N(0, 2rIp)

for 0, where o 0 o as n o0. We denote this sequence of priors by 7r,. The

B-i,-, estimator of 0, namely the posterior mean, under the prior rF, is

6r (X) = (1 B,)X, (2-16)


with B,= a(ao + a)-








The B .-;-i risk of 6" under the prior 7~ is given by:
(1 [1) P (1 ) r)((X) 211
r(7,w"') 1 E exp |67,(X) 01}2 (2-17)

where expectation is taken over the joint distribution of X and 0, with 0 having
the prior -r,.
Since under the prior r,,

|X N (6 x),2(1 B,)IP)

it follows that
110 67,(X) 112 X = x a(l1 B,)X,

which does not depend on x. Accordingly, from (3.3),

r(,F, 6 ") = [1 {1 + 3(1 _)(1 B,)}-p/2]. (2-18)

Since B, -- 0 as n -+ oo, it follows from 2-18 that

r(",,6") [1 {1+ 3(1 3)}-p/2]

as n --- oo.
Noting (2-15), an appeal to a result of Hodges and Lehmann [40] now shows
that X is a minimax estimator of 0 for all p.
Next we prove the minimaxity of the predictive density

6,(X) = N(yl X, ((1- f)o- + )I)

of Y having pdf N(y| 0, O21p).
Theorem 2.2.0.4. 6,(X) is a minimax predictive I <,.:/;l of N(yl0, 72Ip) in ,:1,
arbitrary dimension under the general divergence loss given in (2-1).








Proof of Theorem 2.2.0.4. We have shown already that the predictive density
6,(X) of N(y| 0, c2Ip) has constant risk

1 1 ,7 2((1 "32 72) P
3( [) 2((t -/32 ) ]

under the divergence loss given in (2-1). Under the same sequence Tr of priors
considered earlier in this section, by Lemma 2.1.0.2, the B -; predictive density of
N(y lO, JIp) is given by N(yi (1 B)X, {(1 3)(1 B,)a + 2}Ip). By Lemma
2.1.0.2 once again, one gets the identity


Noting


Nl- (yl 0, 2 Ip)N'3(y| (1 B,)X, {(1 /3)(1 B),7 + ,7}Ip)dy
( p(2 3)
=()t2 ((1 /3)(1 B.)7 + 72) 2 {(1 -/)2(3 B-), + 72}-p/2
x/3( /3)1| (1 B)X| 2
x exp[- ]. (
2((1 /)2(1 + )

once again that |0 (1 B)X|I21 X =x ~ ( B,)X2, the post


2-19)


rior


risk of 6,(X) simplifies to


1 (,2) P' 2 +3)
/3(2 /3) ((1-/3)(- B2) + o) 2

(( ^.2 ~2 2.- f /3(1-/3)o(l- Bn)

03(- ) [ ( (1 /3)(1 B .)o + o)
/3(X Y3


'-] (2-20)


Since the expression does not depend on x, this is also the same as the Bi,-,-
risk of 6,(X). The B-v,- risk converges to

o 1 [l ) -2 t2 t-h t.
An appeal to Hodges and Lehmann [40] once again proves the theorem.
An appeal to Hodges and Lehmann [40] once again proves the theorem.


X P









2.3 Admissibility for p= 1

We use Blyth's [14] original technique for proving admissibility. First consider

the estimation problem. Suppose that X is not an admissible estimator of 0. Then

there exists an estimator 6o(X) of 0 such that R(0, 6o) < R(0, X) for all 0 with

strict inequality for some 0 = Oo. Let r = R(00,X) R(00, 6o(X)) > 0. Due to

continuity of the risk function, there exists an interval [00 E, 0o + E] with E > 0

such that R(0, X) R(0, 6o(X)) > 1 for all 0 e [0o E, 0o + E]. Now with the same
--2

prior rF,(0) N(010, a ),


r(7i, X) r(-o, 60(X))
Oo+E
= [R(0, X) R(0, o(X))] T,(do) > [R(0, X) R(0, 6o(X))] T,(do)



1 f(2w72)- exp 02} dO> t(2 )- (2E) (2-21)
_> (2 -') exp 2 do _> -/ T (2 -7 ,72.
00-E

Again,

[{1 + (1 P)(1 Bz)}-1/2 t + 0(1 0)}1-1/2]
r(w,, X) -r(wn, Jw (X)) ( l-
(1 P )
O= (B,) (2-22)


for large n, where O denotes the exact order. Since BT = (, (o + oT-)1 and

i -- oc as n oo, denoting C(> 0) as a generic constant, it follows from (2-21)

and (2-22) that for large n, i- n > no,


> -r1(2)- 1/2 1 (2-23)
r(, X) r(Fn, Jwr(X)) 4

as n 0oo.

Hence, for large n, r(Tr, 67"(X)) > r(7r, 6o(X)) which contradicts the

B wi-ness of 6"- (X) with respect to r,,. This proves the admissibility of X for

p 1=.









For the prediction problem, suppose there exists a density p(y| v(X)) which
dominates N(yl X, ((1 3)ac + a'). Since

[((1 /)(1 B.)a + a~))- ((1 0) + ())B] 2)
3(1 f)

for large n under the same prior Tr,, using a similar argument,

r(T,, N(y X, ((1 3)a + a))) r(Tn, p(y v(X)))
r(T., N(y|X, ((1 )a + ))) r(T., N(y X, ((1 3)(1 B.)a + a7)))

= O(alB1) oo, (2-24)

as n -- oo. An argument similar to the previous result now completes the proof. U
Remark 1. The above technique of proving admissibility does not work for
p = 2. This is because for p = 2, the ratios in the left hand side of (2-23)
and (2-24) are greater than or equal to some constant times ca2B-1 for large n
which tends to a constant as n -- oo. We conjecture the admissibility of X or
N (y X, ((1 03)o + o~ )IP) for p = 2 under the general divergence loss for the

respective problems of estimation and prediction.
2.4 Inadmissibility Results for p > 3

Let S = IX12/a The Baranchik class of estimators for 0 is given by

6-(X) 1 -T j X,
r(S)

where one needs some restrictions on r. The special choice r(S) = p 2 (with

p > 3) leads to the James-Stein estimator.
It is important to note that the class of estimators 6'(X) can be motivated
from an empirical B -,-, (EB) point of view. To see this, we first note that with the
N(O, Alp) (A > 0) prior for 0, the By,-;., estimator of 0 under the divergence loss is
(1 B)X, where B = a(A + a )-1









An EB estimator of 0 estimates B from the marginal distribution of X.

Marginally, X ~ N(O, acB-1I,) so that S is minimal sufficient for B. Thus, a

general EB estimator of 0 can be written in the form 6'(X). In particular, the

UMVUE of B is (p 2)/S which leads to the James-Stein estimator [28].

Note that for the estimation problem,


L(0, 6(X))


1 -exp [- ||)6(X)- 0 2]
f3(1 /3)


(2-25)


while for the prediction problem,


L (N(y I,), N (yl 6(X), ((1 3) + r)I,))
P13 p(l 3)
1 -() ((1- ) ) 2 {(1 )22 + 2}-p/2
S(1 3)
x exp (1 -)ll(X) .ll0 (2 26)
2((1 0)2j2 + j2)

The first result of this section finds an expression for

EO exp 6,(X)- 02}] b > 0.

We will need b = (1 j)/2 and b (((1- ") to evaluate (2-25) and (2-26).


Theorem 2.4.0.5.


E [exp -1 67-(X) 0|2}
L a 2


OO
(2b + 1)-p/2 {exp(- )r/r!}Ib(r), (2-27)
r-0


where (b= + ) and


I (r) J 1
0


b 2t 2' r+-
- 7i
t 2b + (r + )

b(b)- ) 2 t 2t
x exp t b(b- + 2 ( 2 ) + 2br 2 )1 2 dt. (2-28)
t 2b+1 2b+1t









Proof.
Recall that S = | X 2/72. For 0|11| 0, proof is straightforward. So

we consider 1||0| > 0. Let Z = X/ux and r1 = 0/u. First we reexpress
S1 (t _S)X -02 as

,722 2 as
(1 2- | )112 Z- = S+ 7S 2-(S) 2+ 2-27TZ ( -(S). (2-29)

We begin with the orthogonal transformation Y = CZ where C is an orthogonal
matrix with its first row given by (01/11011,... ,0p/01181). Writing Y = (Y1,... ,p)T,

the right hand side of (2-29) can be written as


S+ -(S) 27(S)+ |1712- 2 |11 7 1
S\


(2-30)


where S = Y11 2. Also we note that Y1,... ,Yp are mutually independent with

Y1 ~ N(|1711, 1), and Y2,... ,Yp are iid N(0, 1). Now writing Z = Y 2 -1 we
i= 2
have


EIexp 1- 6(X) -0|2l
+00 +00
exp -b + { + 1 )
0 -0
-2T(y2 + z) + |I||2 211y, 1 T +z }]x

x (27)-1/2 exp (YI ||||)2 exp -- ,, dy, dz
2 2 F (P 2
+00 +00
(27) -1/2 exp (b ) (y |)2 ) + 2bT( by2+ z)
2/ (y2 +)
0 -o
(y 2 ) + t ) 1 z 3
-2bllr| yi I exp b+ z dyi dz (2-31)
y~+z J 2


r (S)
S ),









We first simplify


+oo
S(2) -12 exp
-00


br2 (2 + _)
+ 2bT(yj + z) 2bl
(y + 1z)

1exp b+-( + b2)


x exp {2 (b + 2) 11y


+ exp {


-2 (b +


||y1 I dyi
y+ z
b 2 (y2 ) 1+ 2
+ + 2bT(y + z)
(yI z)IjTy )


_r(y + z) }
2bllyi y z

qyll1 + 2bllllly y z)


-1/2 exp (b+


(Y + |ll|72)


bT(y2 + Z)
(y2Z)+ 2br(y + z)
(y + z)


S (211lly1)2r ( bT(y2 + ) 2r
r0 (2r)! 2 y + z


(b +)(W+ 1l2)


br2(w + z)
(w +)


+ 2br(w + z)]


S- (2r)!- w'-- 1b
(2 )! 2


br(w + z)
w + z


S2r]
dw,


where w = y.


With the substitution v = w + z and u = w/(w + z), it follows from (2-31) and
(2-32) that


E exp -A
+oo 1

0=
0 0


1167(X) (

-1/2 exp
r-0


112}


(2bllrll)2r
(2r)!


((2b)- + 1)


x exp (b +


1t
2) V


b (v) + 2br(v)] 1 Ur+ -I 1(1 2 du dv (2-33)
v 22d ( (3)


(b +) (y h)2


+00O
I (27)
0/
0


+00
+oo
2 (2)-
0


dyi (2-32)


+oo
2 (27)
0


1/2 exp [


v
V


( t |1) |2)









By the Legendre duplication formula, namely,


(2r)! -F(2r + 1) F (r


+ A) F(r + 1)22r~ -1/2


(2-33) simplifies into


E exp
+00 1

o o
0 0


b
-|6T(X) (-
x

rt0


112}


b + I ) r||2 (2b)c 2r) 2 {
+2 rT (r + 1) 22r


((2b)-1 + 1)


x v+ -lexp (b + v
+0021


+~ 1

J0 exp
0 0 r=0


(b + )2)((b +


2) (b 1)
)2 r!


((2b)1 ) v
((2b)-1+ l)v


Xvr+ -lexp (b+- )


b2 () r+-1 (
+ 2br(v)
v 2 F(r +


1)
2)


u)2-lF (r + ) dud
2du dv
S(PL) F (r+ L)
(2-34)


Integrating with respect to u, (2-34) leads to


b
||6(X) -(
x
00 7
exp{- /}
r70


12}


2 ) (


T(v) )2
((2b)-1 + 1)v


r+~-1
2Fr (r + 2)


b- ( + 2br(v) dv (2-35)
vI


where = (b + -) 1||.|2. Now putting t


E exp

+ooexp
x j exp
0


(x )


0 2}


b(b+ 1) 72 2_T)
2 +1
t


(b + -) v, we get from (2-35)


(2b + 1)- exp{-0}
r-0


+ 2b 2 1) (1
(2b + 1t


2b + ))


tr+P-1
dt
F(r +)


The theorem follows. U


br2(v) + 2br(v)]
v I


7(v) 2r
v I


I r+-l1 -1
Ur+2 U( ) 2 -
2-- 2i -( du dv
2 0 )


E exp


x exp


(b + v
(12)








As a consequence of this theorem, putting b = 3(1 3)/2, it follows from
(2-25) and (2-27) that
00
1 (1 + 3(1 3))-p/2 E {exp(-O)O/r!} (1- )/2()
R(0, 6(X)) r=o
3/( 2(-/

while putting b = (( 1-. it follows from (2-26) and (2-27) that


R (N(y O, a2~IP), N (y 67(X), ((1 ),a + 2)IJ,))

t (, )p3/2((l )2 2+)-p//2 ep(-^/ ,} ())
S0r=2(( 3)2x + )
S3(1 3)

Hence, proving Ib(r) > 1 for all b > 0 under certain conditions on 7 leads to

R(O, 6T(X)) < R(, X)

and


R (N(ylO, a72,), N (y 67(X), ((1 3)a7 + 7 )I,))

< R (N (yl aIP),N (y|X, ((1 /3) + 72)IJ))

for all 0. In the limiting case when 3 -+ 0, i.e. for the KL loss, one gets

RKL (0, 67(X)) < p/2 RKL(0, X) for all 0, since as shown in Section 1,
for estimation, the KL loss is half of the squared error loss. Similarly, for the
prediction problem, as3 -- 0,


RKL(N(y\ 0, cI p), N(yl 6(X), ((1 3)t + cr)I>)

p 2 2___
< 2 log +ar<} = RKL(N(y O, rI, N(y| X, ((1 43)r + )IJ,)

for all 0.
The following theorem provides sufficient conditions on the function 7r() which
guarantee Ib(r) > 1 for all r = 0, 1, .








Theorem 2.4.0.6. Let p > 3. Suppose
(i) 0 < T(t) < 2(p 2) for all t > 0;
(ii) 7(t) is a differentiable nondecreasing function of t.
Then Ib(r) > 1 for all b > 0.
Proof of Theorem 2.4.0.6.

Define To(t) = T7( ). Notice that To(t) will also satisfy conditions of Theorem
2.4.0.6. Now


2b- (2b +


and


+ooexp -t (1
I(7


b2) 2 -(p-2)
x exp{ t o(t)} dt. (2:
2t t

Define to = sup{t > 0 : To(t)/t > b- }. Since To(t)/t is continuous in t with
lim-To(t)/t +oo and lim To(t)/t = 0, there exists such a to which also satisfies
t-O t-oo
-ro(to)/to = b-1. We now need the following lemma.
Lemma 2.4.0.7. For t > to, b > 0 and To(t) M ,/';,if,,/ conditions of Ti,.., ,,
2.4.0.6 the following .:,,';. 1.,/; i holds:

exp 7)) (p-2) q(t) > 0, (2-


where q(t) = t ( b(t)2
Proof of 2.4.0.7. Notice first that for t > to, by the inequality, (1
exp(cz) for c > 0 and 0 < z < 1, one gets

exp 1 b7r02(t)} b -(-)>exp br,2(t) b(p 2)ro(
2t } ( t 2t t2 {+


r2 t0) ()) 2

r(r + )


36)


37)


z)- >


(2-38)


bro(t) 2+ b )
t ) 2t


b (+- b + 12t
t + 2 t
t \2b +1}


t 1


");









for 0 < To(t) < 2(p- 2).

Notice that

q(t) 2b(t) + 2b2ro (t)r (t) b202(t) (239)
q'(t) = 1 2br'(t) + 22 (2-39)

Thus q'(t) < 1 for t > to if and only if


2bg (t) b + 2-2t> 0. (2-40)

The last inequality is true since T'(t) > 0 for all t > 0 and To(t)/t < b-1 for all

t > to. The lemma follows. U

In view of previous lemma, it follows from (2-37) that

+oo
) > (r ) J exp{-q(t)}(q(t))r+-1q'(t) dt 1.
to
This proves Theorem 2.4.0.6. U

Remark 2. Baranchik [5], under squared error loss, proved the dominance of

6'(X) over X under (i) and (ii). We may note that the special choice r(t) = p 2

for all t leading to the James-Stein estimator, satisfies both conditions (i) and (ii)

of the theorem.

Remark 3. We may note that the Baranchik class of estimators shrinks the sample

mean X towards 0. Instead one can shrink X towards any arbitrary constant tt. In

particular, if we consider the N(p, AIp) prior for 0, where p E RP is known, then

the B-,-.-; estimator of 0 is (1- B)X +B l, where B= o-2(A +2)-1. A general EB

estimator of 0 is then given by


6**(X) 1 X + -'









where S'= IIX pj|2/at, and Theorem 2.4.0.6 with obvious modifications will then
provide the dominance of the EB estimator 6**(X) over X under the divergence

loss. The corresponding prediction result is also true.

Remark 4. The special case with r(t) = c satisfies conditions of the theorem if
0 < c < 2(p 2). This is the original James-Stein result.

Remark 5. Strawderman [60] considered the hierarchical prior

o|A N(O, AI),

where A has pdf
T(A) = 6( + A)-1-I[A>o]

with 6 > 0.

Under the above prior, assuming squared error loss, and recalling that S

IX112/at, the B-,-.- estimator of 0 is given by

t r(S)
S)

where
2exp(- .)
r(t) = p + 26 ep( ) (2-41)
fo1 Ai+-' exp(-t) dA
Under the general divergence loss, it is not clear whether this estimator is the
hierarchical B-,-i estimator of 0, although its EB interpretation continues to
hold. Besides, as it is well known, this particular 7r satisfies conditions of Theorem

2.4.0.6 if p > 4 + 26. Thus the Strawderman class of estimators dominates X

under the general divergence loss. The corresponding predictive density also
dominates N (y X, ((1 3)cr + 72)Ip) For the special KL loss, the present

results complement those of Komaki [45] and George et al. [34]. The predictive

density obtained by these authors under the Strawderman prior, (and Stein's
superharmonic prior as a special case) are quite different from the general class of









EB predictive densities of this dissertation. One of the virtues of the latter is that

the expressions are in closed form, and thus these densities are easy to implement.

2.5 Lindley's Estimator and Shrinkage to Regeression Surface

Lindley [50] considered a modification of the James-Stein estimator. Rather

then shrinking X towards an arbitrary point, -,-v tt, he proposed shrinking X

towards Xp1, where X = p-1 1 Xi and lp is a p-component column vector with

each element equal to 1. Writing R = E(X X)2/,c, Lindley's estimator is given
i=
by
p-3
6(X) = X (X X1), p > 4. (2-42)
R
The above estimator has a simple EB interpretation. Suppose

XI 0 N(0, JIp) and 0 has the Np(plp, AIp) prior. Then the B-,v, estimator of

0 is given by (1 B)X + Bp/1 where B = jx(A + oX)-1. Now if both p and A are

unknown, since marginally X ~ N(plp, a B-1I), (X, R) is complete sufficient for

p and B, and the UMVUE of p and B-1 are given by X and (p 3)/R, p > 4.

Following Baranchik [5] a more general class of EB estimators is given by

6 (R)
6:(X) X -( (X X1), p > 4. (2-43)
R

Theorem 2.5.0.8. Assume

(i) 0 < -r(t) < 2(p 3) for all t > 0 p > 4;

(ii) 7(t) is a nondecreasing differentiable function of t.

Then the estimator 6 (X) dominates X under the divergence loss given in 2 1.

Similarly, N(yl 6:(X), ((1 /3)o + o-)Ip) dominates N(yl X, ((1 P3)a + oa)Ip)

as the predictor of N(yl 0, Ir2).

Proof of Theorem 2.5.0.8. Let
p
0-P ol rI 0/c
i-i









and
2 1 -p
2 Z(o 0)2.
Sii 1ii
As in the proof of Theorem 2.4.0.5 we first rewrite


2 *6l(X) 0ll2 ZZ -
^' T


-(R)(
R


Z1,) 12


T(R)(z
Zl,) (7 1,P) + (Z )lP R(Z
R


(Z


[1 (R) 2


zip)


)2 + (2 2( P) (1


By the orthogonal transformation G


2


zip). (2-44)


CZ, where C


S1
is an orthogonal matrix with first two rows given by (p-2',... ,p -) and ((ll


1)/(,..., (T


1
2 116(X ) 0112

S (G2 Q) 2
i U G I+


qr)/). We can rewrite


(G2 + Q) + (G1


VrP)2 + 2


2(G2 (1


T(G+ Q)
G(2Q 5)
(2 45)


P
where Q = C G2 and G1, G2,..., Gp are mutually independent with
i=3
G1 ~ N]V(Vlp], 1), G2 ~ NV(, 1) and G3, Gp are iid N(0, 1). Hence due to the


independence of G1 with (G2,... Gp), and the fact that (G1


v/p)2 ~ X2, from


(2-45),


-16 (X) 0112 =

x (G + Q) + (2
x(G2


(2b + 1)-E

- 2(G2 (


exp b 1 -((



r
T(G|+Q) 1]


T(G2+Q) 2
G+Q )
G 2 +


00
(2b+1 )- exp{
r=0


+00
x exp
0
o/


t b(b + ) + 2bTo(t)4
2 t


_ ) 2r r+P-1
b M 2 t dt, (2 46)
t F(r +)


E exp


()Z


(Gi,..., G)T









where = (b + 1)(2 and as before To(t) T( '). The second equality in 2-46

follows after long simplifications proceeding as in the proof of Theorem 2.4.0.5.

Hence, by (2-46), the dominance of 6'(X) over X follows if the right hand

side of (2-46) > (2b+ )-p/2. This however is an immediate consequence of Theorem

2.4.0.6. U

The above result can immediately be extended to shrinkage towards an

arbitrary regression surface. Suppose now that X 10 ~ N(0, a I) and 0 ~

Np(K3, AI,) where K is a known p x r matrix of rank r(< p) and / is r x 1

regression coefficient. Writing P = K(KTK)-1KT, the projection of X on the

regression surface is given by P X = K/, where / = (K K)-1K X is the least

squares estimator of 0. Now we consider the general class of estimators given by


S (- (X PX),
R*

where R* = IIX P X12/a The above estimator also has an EB interpretation

noting that marginally (/, R*) is complete sufficient for (3, A).

The following theorem now extends Theorem 2.5.0.8.

Theorem 2.5.0.9. Let p > r + 3 and

(i) 0 < r(t) < 2(p- r 2) for allt > 0;

(ii) 7(t) is a nondecreasing differentiable function of t.
Then the estimator X >)(X P X) dominates X under the divergence loss.

A similar dominance result holds for prediction of N(yl 0, O I2,).














CHAPTER 3
POINT ESTIMATION UNDER DIVERGENCE LOSS WHEN VARIANCE
COVARIANCE MATRIX IS UNKNOWN
3.1 Preliminary Results

In this chapter we will consider the following situation. Let vectors Xi ~

Np (0, E) i = ,..., n be n i.i.d. random vectors, where E is the unknown
variance covariance matrix. In Section 3.2 we consider E = a21p with a2 unknown,

while in section 3.3 we consider the most general situation of unknown E. Our goal

is to estimate the unknown vector 0 under divergence loss.
First note that X is distributed as Np (0, E). And thus divergence loss for an

estimator a of 0 is as follows:

1 j.f- (x 0 )f ( a)dx 1 exp [- (a 0) -'(a )]
L0a- (1 --W) 3(1- 3)
(3-1)
The best unbiased equivariant estimator is X. We will begin with the

expression for the risk of this estimator.
Lemma 3.1.0.10. Let X ~ Np (0, E) i = 1,..., n be i.i.d. Then the risk of the

best unbiased estimator X of 0 is as follows:


R3(0) (1 ( (3 2)
) ) ( [t + (1 )]p/3-2

Proof of the lemma 3.1.0.10. Note first that

n{X 6fTE-\(X 6) X2-

Then

Eo exp nf-(12 (X )T (X 0)}] (1 + (1 p))p/2 (









The lemma follows from 3-3.



Thus for any rival estimator 6(X) to dominate X under the divergence loss

we will need the following inequality to be true for all possible values of 0:


E0 exp n(- ) (6(X)- )TE- (6(X)- ) 1> ( ))2.(3-4)


3.2 Inadmissibility Results when Variance-Covariance Matrix is
Proportional to Identity Matrix

Let X N,(0, 21 ), where o(> 0) is unknown, while S 2 x ,

independently of X. This situation arises quite naturally in a balanced fixed effects

one-way ANOVA model. For example, let


Xij = i + Eij (i ,...,;j = ,..., n)

where the Eij are i.i.d. Np(0, o2). Then the minimal sufficient statistics is given by

(X1,...,X,, S), where
n
Xi n- X,, (i 1,... ,p)
j=1

and
p n
S [(n l)p+ 2]-1 (X Xi)2.
i= j=1
This leads to the proposed setup with X = (X1,...,X)T, 0 = (01,... ,O)T,
a2 = /n and m (n 1)p.

Efron and Morris [30], in the above scenario, proposed a general class of

shrinkage estimators dominating the sample mean in three or higher dimensions

under squared error loss. This class of estimators was developed along the ones of

Baranchik [5].








Using equation (3-1), the divergence loss for an estimator a of 0 is given by

S- exp [- 1 10 a |2
Lp(0,a) =3(1- 3) (3-5)

The above loss to be interpreted as its limit when -- 0 or -- 1. The KL loss
occurs as a special case when 3 0. Also, noting that I|X 0112 2X2, the risk
of the classical estimator X of 0 is readily calculated as

1 [1 +(1 s)1-p/2
R3(0, X) = 3)(3-6)
P(1 P)

Throughout we will perform calculations in the case 0 < 3 < 1, and will pass
to the limit as and when needed.
Following Baranchik [5] and Efron and Morris [30], we consider the rival class
of estimators

6(X) = 1 r(lXIXI12/S)II I (3-7)

where we will impose some conditions later on r.
First we observe that under the divergence loss given in (3-5),

S- exp [ (1- ||) (X)- 0 |2
L(0, 6(X)) 3= (3-8)

We now prove the following dominance result.
Theorem 3.2.0.11. Letp > 3. Assume

(i) 0 < r(t) < 2(p 2) for all t > 0;
(ii) 7(t) is a differentiable nondecreasing function of t for t > 0.
Then R(0, 6(X)) < R(0, X) for all 0 e I.
Proof of Theorem 3.2.0.11 First with the transformation


Y = -1X, = a- 10 and U = S/2,









one can rewrite


R(o, 6(X))


1 E exp (1 -) 1 i(Yi2 j ) -

f(1 f)


where Y ~ Np(7r, Ip) and U ~ (m + 2) 1-X is distributed independently of Y.
Hence a comparison of (3-9) with (3-6) reveals that Theorem (3.2.0.11) holds if
and only if

Sexp (1 ll ) Y > [1+(1- )]-p/2. (3-10)

Next writing
z U(m + 2)
2

and
2 (m + 2)t"
-T(t/z) -- 2l 2=
m2 2z
we reexpress left hand side of (3-10) as

E exp ( f- (70( 2/Z) Y- 2}. (3-11)

Note that in order to find the above expectation, we first condition on Z and
then average over the distribution of Z. By the independence of Z and Y and
Theorem 2.4.0.5, the expression given in (4-29) simplifies to


[1 + (1 )]-p/2 exp(-) I) (r),
r-0


(3-12)


where I[1 + 3(1 /)]117112, and writing b = (1-

b 2 r ( r+t -1
I()[i -?To(t/z)] r
0 0
r b(b + 1/2)z2 + 1 (-z 1
x exp -t To ) + 2bzt/z) dtdz. (3-13)
\27


(3-9)






47


From (3-10) (4-31), it remains only to show that


I(r) > 1 Vr 0,1,...; p>3


under conditions (i) and (ii) of the theorem. To show this we first use the

transformation


t = zu.


Then from (4-31),

00 00


0 0

x exp [


Ur+ -1
b-To(u)/u]2~
rF(r+)r(L)

b(b + 1/2) -2
z({u + 1 + o (U) -
71


2bTo (u)}


bro (u)/u] 2 r
B (r + L, 2


(r+P} )


b(b + 1/2) 2(U)
+ 1 + rTo+u)


2b-0 (u)]


(3-14)


Since To(u)/u is a continuous function of u with


lim To(U)/U
u-0


+oo and lim ro ()/u
---D c


it follows that there exists uo such that


uo = sup{u > 0oTo(u)/u > 1/b} and To(uo)/uo


J[1
0


zr+ (P+) dzdu
zr+ 2 I dzdu





48


Thus for u > Uo, To(u)/u < 1/b, from (3-14),


I(r) > [1


U r+- b(b + 1/2)
bro(u)/u]2r U 02(u)
B (r + 2 ++)


bro(u)/u)2}r+ -1 (1


bt (**) _(r+p+P d)
x u(1 bTo(u)/u)2 + 1 + bT2u 2
2u
/ [{u(1 bTo(u)/u)2} /{1 + br(u)/2}]r+-
l + u(-bro(u)/u)2 ] 2
uo [ + II-,- "/(2u)


bTo(u)/u)-(p-2)


-1


x (1 bTo()/u)-(p-2) (1 + br(u)/(2u)) (+1) du. (3-15)


By the inequalities


(1 bTo(u)/u)-(p-2) > exp[(p


2)bTo(u)/u]


and


(1 + bT02(u)/(2u))-( +1) > exp[-(1 + m/2)bT02(u)/(2u)],


it follows that


-(p-2)
(1


Sbr2() -(+1
2u


To())l > 1 (3-16)


>exp [(p-2 bTo (u) (m + 2) br2 (u)
>exp (p- 2)2-b(^)

fbTo(u)(m + 2) 4(p- 2)
exp 4u m +2


since 0 < To(u) < 4(p -
Moreover, putting


2)/(m + 2).


S(1- bo(u)2
1 + (u)
2u


2[u bro(u)2
[2u + br02(u)


[ {u(1


J


B (r +j, )


brTo(u))
u )


2o)]-(r+
2bTo (u)I









it follows that

dw 2(u bro(u))
d 2( bT (u) [2(1 b(u))(2u + b0(u)) (u bTo(u))(2 + 2bro(u) (u))]
du [2u + bT02(U)]2
2(u bro(u))
2(u bT(u)) [2u + 2bo(u) + 2b2(u) 4bu(u) 2buro(u)r(u)].
[2u + b702(U)]2

Hence < 1 if and only if

2[u bro(u)][2u + 2b-o(u) + 2bT0o2(u) 4bu T(u) 2buTo(u)T'(u)] < [2u + bT7o(u)]2

The last inequality is equivalent to

b22(u) [2 + To(u)]2 + 4bu (u) [2 + To(u)][u bro(u)] > 0. (3-17)

Since for u > no, u > bro(u), (3-17) holds if To(u) > 0, and the latter is true

due to assumption (ii). Now from (3-15) (3-17) noting that w = 0 when u = Uo,

one gets
OO
/wr+2-1


for all r = 0, 1, 2,.... This completes the proof of Theorem 3.2.0.11. U

Remark 1. It is interesting to note that Il(r) > 1 for all r = 0, 1, 2,... and any

arbitrary b > 0. The particular choice b = 3(1 3)/2 does not have any special

significance.

We now consider an extension of the above result when V(X) = E is an

unknown variance-covariance matrix. We solve the problem by reducing the risk

expression of the corresponding shrinkage estimator to the one in this section after

a suitable transformation.

3.3 Unknown Positive Definite Variance-Covariance Matrix

Consider the situation when Zi,..., Z, (n > 2) are i.i.d. Np(O, E), where E is

an unknown positive definite matrix. The goal is once again to estimate 0.









The usual estimator of 0 is Z = n-1 E Z (-iv). It is the MLE, UMVUE and
i=1
the best equivariant estimator of 0, and is distributed as Np,(, n-1E). In addition

the usual estimator of E is

S 1 Z Z)(Z Z)T
i=1

and S is distributed independently of Z.

Based on distribution of Z, the minimal sufficient statistic for 0 for any given

E, the divergence loss is given by (see equation 3-1)

1 -exp -3(-3(a )- ')-a 0)]
L (0,a) )1. (3-18)

The corresponding risk of Z is the same as the one given in (3-2), i.e.

R(, Z) [1- _{1 + (1 3)}-p/2]. (3 19)

Consider now the general class of estimators


6'(Z, S) 1-( TS-- Z] (3-20)

of 0. Under the divergence loss given in (2-1),

1 exp n(-)(6T(Z, S) 0)Ty- (6Z, S) -0)
L(0, 6(Z, S)) (3-21)

By the Helmert orthogonal transformation,

1
HI -t(Z2 ZI),
v/2

H2 = (2Z3 Z1 Z2),
V6


1
H,_1 -[(n 1)Z, ZI Z2 ...- Z,_]
n/,(n 1)









1 n
7n z-,


VnZ,


one can rewrite 6'(Z, S) as


((n 1)H( HH )-1H)
6'(Z, S)= I --- n-1H,,
(n 1)HT( HiHi)-IH,)
i= 1


(3-22)


where HI,..., H, are mutually independent with HI,..., H,_- i.i.d. N(0, E) and

H, N( VNO, E).
Let


Yi E-= Hi


and


1
r7 E- (vnO).

Then from (3-21) and (3-22) one can rewrite

( T ( ( If nT 2 ]
1 -exp ((1-n Y) 1 Y ,-
L2 ( 6(O (Z'S))( V ) ) I

(3-23)
where Y1,..., Y, are mutually independent with Y1,... Y,_ i.i.d. N(O, Ip) and
Y, N(q, Ip).
Now from Arnold ([3], p. 333) or Anderson([4], p.172),

(~nY1 1 d y12
i 1n

where U ~ X,-p, and is distributed independently of Y,. Now from (3-23)


1 exp [ (1-2 ) (1


T ((n-1) |Y II '/U)
(n-1)||Y ||-/U
f3(1 -/3)


y 721


and


1, ... ,,n)


L(0,6'(Z, S))


(3-24)









Next writing

U
z -
2
and

To(t/z) = 2 (- t/tz
n 2 2

1 exp [ (1-2 ) 1 T-o(Y I'/z) Y 27]
L3(0, 6 (Z, S)) 3( j3)-- (3-25)
O(1 0)
By Theorem 3.2.0.11, 6'(Z, S) dominates Z as an estimator of 0 provided 0 <

To(u) < 4-2) for all u and 3 < p < n. Accordingly, 6T(Z, S) dominates Z provided
0 < r(u) < 2(p-2)(n-1). We state this result in the form of the following theorem.
n-p+2
Theorem 3.3.0.12. Let p > 3. Assume

(i) 0 < r(t) < 2(p-2)(n-1) for all t > 0;
(ii) 7(t) is a differentiable nondecreasing function of t for t > 0.

Then R(O, 6(Z, S)) < R(O, Z) for all 0 E RP.














CHAPTER 4
REFERENCE PRIORS UNDER DIVERGENCE LOSS

4.1 First Order Reference Prior under Divergence Loss

In this section we will find a reference prior for estimation problems under

divergence losses. Such a prior is obtained by maximizing the expected distance

between prior and corresponding posterior distribution and thus can be interpreted

as noninformative or prior which changes the most on average when the sample is

observed.

If we use divergence as a distance between a proper prior distribution r(0)

(putting all its mass on a compact set if needed) and the corresponding posterior

distribution 7(01 x) we can reexpress the expected divergence as

R 1 ff (0j)w1-3(0| x)m(x) dx dO
3(1 4 3)
1 ffjr3(x)pl-( (x 0)(0) dxdO0 (
/3(1 -3)

Using this expression one can easily see that in order to find a prior that

maximizes R(Tr) we need to find an .ii-!,il, ,l ic expression for

rm3(x)p1-3(x 0)dx

first.

In this section we assume that we have a parametric family {p0 : 0 E 0},

o C RP, of probability density functions {po p(xl 0) : 0 E 0} with respect to

a finite dominating measure A(dx) on a measurable space X, and we have a prior

distribution for 0 that has a pdf r(0) with respect to Lebesgue measure.








Next we will give a definition of Divergence rate when parameter f < 1
and sample size is n. We define the relative Divergence rate between the true
distribution Pb(x) and the marginal distribution of the sample of size n mn(x) to
be

1 f- (Tn.(x)) T )-13/n A(dx)
R(0, ) = DR ( |P mx)) = (4-2)
v u3 P/n(1 P/n)

It is easy to check for f 0 that this definition is equivalent to the definition
of relative entropy rate considered for example in Clarke and Barron [21].
Using this definition, we can define for a given prior 7 the corresponding B-,-,-
risk as follows:
R(3) E [DR P ( |p mx))] (4-3)

To find an i ,l .Iic expansion for this risk function, we will reexpress the
risk function as follows:

1 E-g [exp{- In ln )1
R3(0, ) /=l n ) ( (4-4)
Sp/n~l PON

where

p(xl 0) = Jp(xl 0),
i=1
and
m(x) J p(x )7(0) dO.

Clarke and Barron [20] derived the following formula:
p(Xl 0) P in t S (
n In In 1I()1 + In 1 s + o(), (4-5)
m(x) 2 27 2 O(o0) 2

where o(1) 0 in L'(P) as well as in probability as n -i oo. Here, S, =

(1/ v)Vlnp(xl 0) is the standardized score function for which E(SS,{) = 1(0)
and E [S (I(0))-1 n] =p.









Using this formula we can write the following .i-iv,!.li l ic expansion for risk

function (4-1):


R3 (0, ) 1(0) /n( N
Since (()) [mptotically distributed as we can rewrite the))


Since S, (1(0))- S,, is .i-. mptotically distributed as X2 we can rewrite the
P2w a ert h


4-6)


above expression as follow:


Hence, the D


maximizes the integral:
maximizes the integral:


subject to the constraint


1 0 () 1 + o(n-p/4)
t 'n 1_ -- )p/2
R (0, ) /, n
/rior wh 1 t ll

rior which minimizes the B,.i -- risk will be the one that


dOe
1 (0) 2


7 (0) dO=.


A simple calculus of variations argument gives this maximizer as


17(0) cX | (0)


(4-9)


which is Jeffreys' prior.

4.2 Reference Prior Selection under Divergence Loss for One
Parameter Exponential Family

Let X1,...,X, be iid with common pdf (with respect to some u-finite

measure) belonging to the regular one-parameter exponential family, and is

given by


p(x|O) = exp[8x (O) + c(x)].


(4-7)





(4-8)


(4-10)








Consider a prior 7r(0) for 0 which puts all its mass on a compact set. We will
pass on to the limit later as needed. Then the posterior is given by

7r(01xi,... ,Xn) oc exp[n{0x 6(0)}]7r(0). (4-11)

We denote the same by 7(0Ox). Also, let p(xlO) denote the conditional pdf of X
given 0 and m(x) the marginal pdf of X.
The general expected divergence between the prior and the posterior is given
by
R( 1 ff3(0)S1- (01 x)dO] mx(2)dx (
R3(Q) ( (4 12)

From the relation p(xl0)}(0) = ~(0Ox)m(x), one can reexpress RN(r) given in
(4-12) as

R3() 1- JffI 73+1(0)7r- (O )p(xlO) dx dO 1 fi7rI ()E)E [r-3(0 2x) O] dO
[(1 P3 -(1 3)
(4-13)
Let 1(0) = "(0) denote the per observation Fisher information number. Then
we have the following theorem.
Theorem 4.2.0.13.

Eo [ -(01 X)] 27 \ 1 t "'(t)W '(0)
ul (= -(0) n ( ) 7()
+( "'(0))20 (32 + 73 +10) 3 7'(0) 2
24(1 -/3)I3() 21(0) r(O70)
+ 3(2 ) r"(0) '(0)0(2 + +) (n--/2 (44)
2(1 3)I(0) 7(0) 8(1 /)12()

Proof of Theorem 4.2.0.13. Let 0 denote the MLE of 0. Throughout this
section we will use the following notations:

i(e) o=x (eo),


a2 "(0) -"(0),





57


c "(0),


a3 = l"'(0)

a4 -l(4) I


and


h = v( e).


We will use "-h!i iid!: I, argument as it presented in Datta and Mukerjee [24] .
From Datta and Mukerjee ([24] p.13)


7,(h x) -= ^ exp


1
+
n


Sch2



(0) 4
+ I a4 4
24


1
1+

")(0)\

_
c 2 4


h(0) -a3h3
7 (0) 6


1

+7
6


a3 7/( h4
7 ()


( alh6
a 2h
3'


15aj
c3


With the general expansion

1 )
al+ 2 + + o(n-1))


a,1


we get


(h x) ~- exp cph2
7- (h| x) exp 2
(27 2)


1 + ( '(o) 1 -j)

1 (a4h 4
24


"(0) h2
7(0)


i(0)c)


(a r-(0
a3-7
7(0)


I + (n-1).


15a)
c3


I'(0),


3a3 r'(O)


I + ((n-1).


(4 15)


a2
a n


n


2 a,


+ o(n ),
a,


V { 7 ) h +
A~z[W,+
v^1P O


1 3
-a3h 3
6


3a3 7r'(0)
c2 (0))
3 3(o


)


72 a2 h 6
72(


(4-16)





58


Using (4-15) and (4-16) we will get the following


(c2)


exp


(0), 1 ,3\2
(-,h + -a3h3
( () 6
1 ( 'r(0) 4 3a
6 7( ) C


124
24


(a4 4


-0
2
3

3a4


1+ t () -7( +l() 1
{1((0) 7(0) 6


1 ,
-a3h 4
6 h


7(0)


'(0)
Aw


+ '(0) 1 6-
( 0) 3i 36


i(0c)


/3 h4
3 T()


1-/3
72


( alh6
a
3Oh


15a
c3


1
2


AFh02
(0)


Integrating this last expression with respect to h will get the following


7-3(h|x)7,(hx)dh = (-
j ( 2c7/
-1 '( ) -'( ) 1
Sn 7(0) -(0) C(1 -/3)


3(1 +/3)
2c(1 /3)


\ i'(0)
i(0) J


-T 1
S (-- 0 1
a3 7'(0)
2c2( -/ )2 7(0)


+ ___)(s


a3 '(0) 15a+
c(1 3) 7r() 36c2( -


1+5a
36c3( _/)3

0)


/3 "(0)
2c( /3) r(0)


a3 0(2 3) '(0)
2c2( -/3)2 7(0)


/32 t "()
2c(1 /3) 7r(0)


a43(2 3)
8c2(1 3)


3/3(2 3) T'(0)
22(1 -/ )2 (0)


15a 2(1 /3)2 )1 -1
+ 231 /3)2 o(n ).


(4-18)


From the relation 0 = h/ + 0 we get


)- (h x)7,(hl x) dh.


c(1 3)h2}
2 1


/3(1 +/3)
+ 2
2


" (0)c


3a3 '()))


SI +o(n-1). (4-17)


7 -3(hl x) {,(h| x)


7 (0) 7(0)


_


- 3
n 2


(4-19)


E7[ 13(01| ) I 2]








Thus by (4-18) and (4-19) we have


2(
ul(0)


S-
(1t-3)(0)


S/(1 + 0)
2(1 /)1(0)


1
V1 /L


S (0)
(0)


7T'(0) 7'(0)
7 (0) -F(0)


( 7 (0)
NO))


+"(0)
2(1 0)I(0) 7(0)
''(0)(2 -
8(1 /)12()


"'(0) ( '(0)
2(1 0)1(0) 7(0)o
2 "'(0) 7-'(0)
(t M)I(o) 7(0)


"(0) )
7(0)


',"(0)0(2 -/)
2(1 )212(0)


5(, '( ))2 (2 -3+3)
24(1 0)213(o) ;j


5(, "'(0))2
+ 2(1 -)212()


7'(0)
\-(e)


'(0))


+ o(n-1- 2).


(4-20)


In the next step, we find


(ni(0)2


1
1/3


1
+
n


7(0) 2


"'(0) ~'(0)
(1 )I(0) 7(0)


5(, "'(0))2
1 2(1 )212(0)


S/(1 +/3)
2(1 )1(0)
/2 ~/(0)
2(1 0)1(0) 7(0)


S"'(0)0/(2 p) 7'(0)
2(1 /)212(0) 7(0)


+


2l(
(nl()


+.


''( (0)(2 0) 5(, "'(0))2(0/2 -3 + 3)
8(1 )12(0) 24(1 0)213(0)


1




12
n1


n
(0)
2(1-

(0)


0/3 '(0) ".(0)0
(1 0)1(0) 7(0) + 2(1 )212(0)
3(2- 0)
)212(o)) 7 (0)] dO

0 )32-F"(O) dO + o(n-1- 3/2). (4-21)
2nl(o)(1 /)3/2


The last step will give an expression for Eo [r- 3(0 x)] We consider Fr(0) to
converge weakly to the degenerate prior at true 0 and have chosen -r(0) in such a
way that we could integrate the last two integrals in (4-21) by parts and have the


first term equal to zero every time we use integration by parts.


5(, "'(0))2
1 2(1 -)212 (0))


A)(0)-(0) dO


x 1


A(0) Eo (E- --(01 x)| x])


', "'(0)0 7'(0)
2(1 0)22(o) 7(0))


5/(, '(0))2
12(1 3)313(0)








Thus we will have


Eo [~-x (0I )]


27r 2
I(0)


1
1/3


1+


5( "'(0))2p(2 +)
24(1 ))3(0)
/2 /(0)
2(t 0)1(0) 7(0)


3(1 + 3) ( 7'(0) 2
2(1 )1(0) 7(0)
S''(0)0(2 -3) )
8(1 -/3)2() )


7/3 '(0)
(1 3)I() 7-(0)


E( 0 2)


S(2( ) + o(n-1- /2).
'F---o 2(t 0)1(0)


Finally, we have


( /3 1'(0)
( (1- 3)(0) (0)


S2(1- 0)12(0)


ul( 0)


1
1j3


/(1 + /2, "'(0) 7'(
(1 -/3)J2() 7(,


/3 "(0)
(1 /3)(0) 7(0)
0) l 'f "(0)
9) 2(1 -/3) 2(0)


3( -
(1-/3)1(0)


d2 27 2
d02 nl(0)

l()2
\nl(0)


1 0/
--0- 2(1 /3)(0)
1 ((, "'(0))20(
=0 2(1


d 27r \3
dO unl(0)
1 + 3/2)(2 + 3/2)
- 3)3(0)


1 3(1 + 3/2 '"(0)
1 -/3 2(1 3)2(0)
,' "(0)0( + 0/2)
2( -/3)J2(0)
(4-24)


From (4-22) (4-24), we get


Eo [7-3(0 2x)]


l( 2r
1o(0)


1
1 /


(,f '"(0))2 (3/2 + 73 + 10)
24(1 0)13(0)


/3(2-/3) -r"(0)
2(t -/ )(0) (0))


"'((0)3 7'(0)
( -/3)I2 ())
( 7-(0) 2
21(0) 7(0))


, '"(0)0(2 + )2 0 /2 )
8(1 0 2)I2 + ( --1


ld
1 dO
n dO


(ni0)


1
1/3-o


1 d2
n d02


, "'(- 0)
2(t- 0)12(o)


(
nI (0)


(4-22)


1
1z7/


and


-'(0) 2
7(0))


(4-23)


1+


( ,. "'(0)02 TT()
n 2(t )P(O) 7(0)


/(2 + //2)(," ())2
2(t1-/3)3(o)


(4-25)








This completes the proof 0.
In view of Theorem 4.2.0.13, for 3 < 1 and 3 / 0 or -1, one has

1- f ( 7r() dO
R()r = ) 12( +o(1). (4-26)

Thus the first order approximation to Rk(r) is given by

1 )82(1 ) tr(O) dO
12(0)
3( /3)

We want to maximize this expression with respect to 7r(0) subject to
f r() dO =1.
We will show that Jeffreys' prior assymptotically maximizes R (7r) when
1/3 < 1.
To do this we will use Holders inequality as follow
Holders inequality for positive exponents ([39], p.190) Let p, q > 1 be
real numbers satisfying 1/p + 1/q = 1. Let f E ,g E 9. Then fg E 1 and

lfg dp < (f If Pdp (Jf gq dt1 (427)

with equality iff IfI oc xg q.
Holders inequality for negative exponents ([39], p.191) Let 0 < q < 1 and
p E R be such that 1/p + 1/q = 1 (hence p < 0). If f, g are measurable functions
then
Ilfg d1> (Jf I dll)P (Jf g dtlL) (4-28)

with equality iff IflP oc Igl.
First we will consider 0 < 3 < 1. In this case it is enough to minimize

J 1+(O) (I-(O)) dO.





62


From H6lders inequality for positive exponents with


P 1+/3,


S1+
q = -


g() = (,2(o)1


and


we can write


(If 1/p)


TT1+(0) (P


1
(f () dO)


with "=" iff
7r(0) oc 2 (0).

Next, consider case when -1 of R 3(r) is equivalent to maximization of


7F1(0) (I1(0)) dO.


From H61ders inequality for positive exponents with


1
= 1+ 3


q -


f(0) = 711(0)


g(0) -(I(0)) )


we obtain


SdO< 12(o0)) do


with "=" iff


r(0) c I (0).


and


f (0) =(0) (1- (o)) '


1+3() (I())








When 3 < -1 using Holders inequality for negative exponents with

1 1
p <0, 0
f(O) 7=1+ (0)

and

g(O) I1(0)

we obtain
T1+(0) (2())0) dO > ( J 1(0) dO)

with "=" iff
7r(0) Nc I (0).

Unfortunately, in this case, Jeffreys' prior is a minimizer and since it is the only
solution for corresponding Euler-Lagrange equation, there is no proper prior that
will i-,iiiill ically maximize RN (T).
There are only two cases left 3 = 0, which corresponds to KL loss and = -1,
which corresponds to chi-square loss.
As /3 0+, one obtains the expression due to Clarke and Barron (1990, 1994)

[20], [21], namely

1 n 7(0)
R(7) log (O)log dO + o(1)
2 2e 11/2(0)

which is maximized when f 7(0) log dO 0, i.e. 7(0) 1/2(0), once again
leading to Jeffreys' prior.
The only exception is = -1, the chi-square distance as considered in Clarke
and Sun (1997) [22]. In this case 7r3+1(0) = 1 so that the first order term as
obtained from Theorem 4.2.0.13 is a constant, and one needs to consider the second
order term. In this case,









S( 2 2 1 rt +'-(0) '(7 ) (, ,(0))2 () 2
1+2R- 1() 1 2 +
l (0) n 2 2 7( "(o) ) 21(0) ((0)
41 (0) + 161(0) dO + o(n-1/2). (4 29)
41(0) 7 (0) 1612(0) )
Reference prior is obtained by maximizing the expected chi-square distance
between prior distribution and corresponding posterior. By (4-29), this amounts to
maximizing the following integral with respect to prior 7r(0):
j ( ,f ''1 10,(o) ,,2 3 7o\60)
/] \3/"() 7T(0) +1/2() /( ) 2(0) (0) ( dO (4-30)
+712 Nl/2( O)() 271/2
To simplify this further, we will use the substitution

Y(o) '(0)

so that (4-30) reduces to

(/ (0) Y(o) Y2(o) d (-31)
3/2(0) 2 1/2(0) 2 1/2( ) (4

Maximizing the last expression with respect to y(0) and noting that "'(0) =
I'(0), one gets Euler-Lagrange equation:
AL d aL
Oy dO y'

with L the functional under integral sign in (4-31).
The last expression is equivalent to
I'(0) y(0) .
4J3/2(0) J1/2(0)
Solving, we get
I'(0)
41(0)'






65


thereby producing the reference prior

7r(0) oc J1/4(). (4 32)















CHAPTER 5
SUMMARY AND FUTURE RESEARCH

5.1 Summary

This dissertation revisits the problem of simultaneous estimation of normal

means. It is shown that a general class of shrinkage estimators as introduced

by Baranchik [5] dominates the sample mean in three or higher dimensions

under a general divergence loss which includes the Kullback-Leibler (KL) and

Bhattacharyya-Hellinger (BH) losses ([13]; [38]) as special cases. An analogous

result is found for estimating the predictive density of a normal variable with

the same mean and a known but possibly different scalar multiple of the identity

matrix as its variance. The results are extended to accommodate shrinkage towards

a regression surface.

These results are extended to the estimation of the multivariate normal

mean with an unknown variance-covariance matrix. First, it is shown that for an

unknown scalar multiple of the identity matrix as the variance-covariance matrix,

a general class of estimators along the lines of Baranchik [5] and Efron and Morris

[30] continues to dominate the sample mean in three or higher dimensions. Second

it is shown that even for an unknown positive definite variance-covariance matrix,

the dominance continues to hold for a general class of a suitably defined shrinkage

estimators.

Also the problem of prior selection for an estimation problem is considered. It

is shown that the first order reference prior under divergence loss coincides with

Jeffreys' prior.

5.2 Future Research

The following is a list of future research problems:









* The admissibility of MLE under Divergence loss is an open question when

p = 2. It is conjectured that the MLE is admissible but the proofs under

squared error loss are difficult to use for the Divergence loss.


* Another important problem is to find an admissible class of estimators of the

multivariate normal mean under general divergence loss.


* Extend the results for simultaneous estimation problem with an unknown

variance-covariance matrix to prediction problems.


* Find the link identity similar to those of George et al.[34] between estimation

and prediction problems if such an identity exists.


* Explain the Stein phenomenon using differential-geometric methods on

statistical manifolds as in Amari [2].














REFERENCES


[1] AITCHISON, J. (1975). Goodness of prediction fit. Biometrika 62 547-554.

[2] AMARI, S. (1982). Differential geometry of curved exponential families -
curvatures and information loss. Ann. Statist. 10 357-387.

[3] ARNOLD, S.F. (1981). The theory of linear models and multivariate analysis.
John Wiley & Sons, New York.

[4] ANDERSON, T.W. (1984). An introduction to multivariate statistical
oi,1\ -, i 2nd ed. John Wiley & Sons, New York.

[5] BARANCHICK, A.J. (1970). A family of minimax estimators of the mean of a
multivariate normal distribution. Ann. Math. Statist. 41 642-645.

[6] BAYES, T.R. (1763). An e ;;- towards solving a problem in the doctrine
of chances. Phylosophical Transactions of the Royal Society 53 370-418.
Reprinted in Biometrika 45 243-315, 1958.

[7] BERGER, J.O. (1975). Minimax estimation of location vectors for a wide
class of densities. Ann. Statist. 3 1318-1328.

[8] BERGER, J.O. (1985). Statistical Decision Theory and B.,\, -l;., Analysis
(2nd edition). Springer-Verlag, New York.

[9] BERGER, J.O. AND BENARDO, J.M. (1989). Estimating a product of
means: B i, -i i, analysis with reference priors. J. Amer. Statist. Assoc. 84
200-207.

[10] BERGER, J.O. AND BENARDO, J.M. (1992a). Reference priors in a variance
components problem. In B li-, i i Analysis in Statistics and Econometrics
(P.K. Goel and N.S. Lyengar eds.) 177-194. Springer-V. i1 New York.

[11] BERGER, J.O. AND BENARDO, J.M. (1992b). On the development of
reference priors (with discussion). In B i-; Statistics 4 (J.M. Benardo et al
eds.) 35-60. Oxford Univ. Press, London.

[12] BENARDO, J.M. (1979). Reference posterior distributions for B l-, -i ,i
inference. J. Roy. Statist. Soc. B 41 113-147.

[13] BHATTACHARYYA, A.K. (1943). On a measure of divergence between
two statistical populations defined by their probability distributions. Bull.
Calcutta Math. Soc. 35 99-109.









[14] BLYTH, C.R. (1951). On minimax statistical decision procedures and their
admissibility. Ann. Math. Statist. 22 22-42.

[15] BOCK, M.E. (1975). Minimax estimators of the mean of a multivariate
normal distribution. Ann. Statist. 3 209-218.

[16] BOLZA, 0. (1904). Lectures on the Calculus of Variations. Univ. Chicago
Press, Chicago.

[17] BROWN, L.D. (1966). On the admissibility of invariant estimator of one or
more location parameters. Ann. Math. Stat. 38 1087-1136.

[18] BROWN, L.D. (1971). Admissible estimators, recurrent diffusions and
insoluble boundary value problems. Ann. Math. Statist. 42 855-903.

[19] BROWN, L.D. AND HWANG, J.T. (1982). A unified admissibility proof.
Statistical Decision Theory and related topics Academic Press, New York, 3
205-267.

[20] CLARKE, B. AND BARRON, A. (1990). Information-theoretic .,-ivii ,. .1 ics of
B-,-J methods. IEEE Trans. Inform. Theory 36 453-471.

[21] CLARKE, B. AND BARRON, A. (1994). Jeffreys' prior is .,-vmptotically least
favorable under entropy risk. J. Statist. Plann. Infer. 41 37-60.

[22] CLARKE, B. AND SUN, D. (1997). Reference priors under the chi-square
distance. Sankhya, Ser.A 59 215-231.

[23] CRESSIE, N. AND READ, T. R. C. (1984). Multinomial Goodness-of-Fit
Tests. J. Roy. Statist. Soc. B 46 440-464.

[24] DATTA, G.S. AND MUKERJEE, R. (2004). Probability matching priors:
higher order .,-vmptotics. Springer, New York.

[25] DAWID, A.P. (1983). Invariant Prior Distributions. in E,. \1. I !" dia of
Statistical Sciences eds. Kotz, S. and Johnson, N.L. New York: John Wiley,
228-236.

[26] DAWID, A.P., STONE, N. AND ZIDEK, J.V. (1973). Marginalization
paradoxes in B ,i, i i and structural inference (with discussion). J. Roy.
Statist. Soc. B 35 189-233.

[27] EFRON, B. AND MORRIS, C. (1972). Limiting the risk of B-.-i and
empirical Bi -; estimators Part II: The empirical B -,i case. J. Amer.
Statist. Assoc. 67 130-139.

[28] EFRON, B. AND MORRIS, C. (1973). Stein's estimation rule and its
competitors an empirical Bi -; approach. J. Amer. Statist. Assoc. 68
117-130.









[29] EFRON, B. AND MORRIS, C. (1975). Data e, : 1i -i- using Stein's estimator
and its generalizations. J. Amer. Statist. Assoc. 70 311-319.

[30] EFRON, B. AND MORRIS, C. (1976). Families of minimax estimators of the
mean of a multivariate normal distribution. Ann. Statist. 4 11-21.

[31] FAITH, R.E. (1978). Minimax B--.- and point estimations of a multivariate
normal mean. J. Mult. Anal. 8 372-379.

[32] FISHER, R.A. (1922). On the mathematical foundations of theoretical
statistics. Phylosophical Transactions of the Royal Society of London, ser.A,
222 309-368.

[33] FOURDRINIER, D., STRAWDERMAN, W.E. AND WELLS, M.T. (1998). On
the construction of Bayes minimax estimators. Ann. Statist. 26 660-671.

[34] GEORGE, E.I., LIANG, F. AND XU, X. (2006). Improved minimax
predictive dencities under Kullbak-Leibler loss. Ann. Statist. 34 78-92.

[35] GHOSH, M. (1992). Hierarchical and Empirical B-.,-i Multivariate
estimation. Current Issues in Statistical Inference: Essays in Honor of D.
Basu, Ghosh, M. and Pathak, P.K. eds., Institute of Mathematical Statistics
Lecture Notes and Monograph Series, 17 151-177.

[36] GHOSH, J.K. AND MUKERJEE, R. (1991). C'!I ,1 ,:terization of priors under
which B i,- i ,i and Barlett corrections are equivalent in the multiparameter
case. J. Mult. Anal. 38 385-393.

[37] HARTIGAN, J.A. (1964). Invariant Prior Distributions. Ann. Math. Statist.
35 836-845.

[38] HELLINGER, E. (1909). Neue Begriindung der Theorie quadratischen Formen
von unendlichen vielen Veranderlichen. Journal fir Reine und Angewandte
Mathematik 136 210271.

[39] HEWITT, E. AND STROMBERG, K. (1969). Real and Abstract Analysis. A
Modern Treatment of the Theory of Functions of a Real Variable. second
printing corrected, Springer-V. i1 .- Berlin.

[40] HODGES, J.L. AND LEHMANN, E.L. (1950). Some problems in minimax
point estimation. Ann. Math. Statist. 21 182-197.

[41] HWANG, J.T. AND CASELLA, G. (1982). Minimax confidence sets for the
mean of a multivariate normal distribution. Ann. Statist. 10 868-881.

[42] JAMES, R. AND STEIN, C. (1961). Estimation with quadratic loss. Pro-
ceedings of the Fourth Berkeley Symposium on Mathematical Statistics and
Probability University of California Press, 1 361-380.









[43] JAYNES, E.T. (1968). Prior probabilities. IEEE Transactions on Systems
Science and Cybernetics SSC-4 227-241.

[44] JEFFREYS, H. (1961). Theory of Probability. (3rd edition.) London: Oxford
University Press.

[45] KOMAKI, F. (2001). A shrinkage predictive distribution for multivariate
normal observations. Biometrika 88 859-864.

[46] KULLBACK, S. AND LIEBLER, R.A. (1951). On information and sufficiency.
Ann. Math. Statist. 22 525-540.

[47] LEHMANN, E.L. (1986). Testing Statistical Hypotheses. (2nd edition). J.
Wiley, New York.

[48] LEHMANN, E.L. AND CASELLA, G. (1998). Theory of Point Estimation.
(2nd edition). Springer-Verlag, New York.

[49] LIANG, F. (2002). Exact Minimax Strategies for predictive density estimation
and data. Ph.D. dissertation, Dept. Statistics, Yale Univ.

[50] LINDLEY, D.V. (1962). Discussions of Professor Stein's paper 'Confidence
sets for the mean of a multivariate distribution'. J. Roy. Statist. Soc. B 24
265-296.

[51] LINDLEY, D.V. AND SMITH, A.F.M. (1972). Bayes estimates for the linear
model. J. Roy. Statist. Soc. B 34 1-41.

[52] MORRIS, C. (1981). Parametric empirical B-.,--i confidence intervals.
Scientific Inference, Data Analysis, and Robustness. eds. Box, G.E.P.,
Leonard, T. and Jeff Wu, C.F. Academic Press, 25-50.

[53] MORRIS, C. (1983). Parametric empirical B-j,-i- inference and applications.
J. Amer. Statist. Assoc. 78 47-65.

[54] MURRAY, G.D. (1977). A note on the estimation of probability density
functions. Biometrika 64 150-152.

[55] NG, V.M. (1980). On the estimation of parametric density functions.
Biometrika 67 505-506.

[56] ROBERT, C.P. (2001). The B.,\' -;,, Choice. (2nd edition). Springer-V, 1-
New York.

[57] RUKHIN, A.L. (1995). Admissibility: Survey of a concept in progress. Inter.
Statist. Review, 63 95-115.

[58] STEIN, C. (1955). Inadmissibility of the usual estimator for the mean
of a multivariate normal distribution. Proceedings of the Third Berkeley






72


Symposium on Mathematical Statistics and Probability Berkeley and Los
Angeles, University of California Press, 197-206.

[59] STEIN, C. (1974). Estimation of the mean of a multivariate normal
distribution. Proceedings of the Prague Symposium on Asymptotic Statistics
ed. Hajek, J. Prague, Universita Karlova, 345-381.

[60] STRAWDERMAN, W.E. (1971). Proper Bayes minimax estimators of the
multivariate normal mean. Ann. Math. Statist. 42 385-388.

[61] STRAWDERMAN, W. E. (1972). On the existence of proper Bayes minimax
estimators of the mean of a multivariate distribution. Proceedings of the
Sixth Berkeley Symposium on Mathematical Statistics and Probability
Berkeley and Los Angeles, University of California Press, 6 51-55.















BIOGRAPHICAL SKETCH

The author was born in Korukivka, Ukraine in 1973. He received the Specialist

and Candidate of Science degrees in Probability Theory and Statistics from Kiev

National University of Taras Shevchenko in 1997 and 2001 respectively. In 2001 he

came to UF to pursue Ph.D. degree in Department of Statistics.