Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UFE0041145/00001
## Material Information- Title:
- Computational Approaches for Empirical Bayes Methods and Bayesian Sensitivity Analysis
- Creator:
- Buta, Eugenia
- Place of Publication:
- [Gainesville, Fla.]
Florida - Publisher:
- University of Florida
- Publication Date:
- 2010
- Language:
- english
- Physical Description:
- 1 online resource (104 p.)
## Thesis/Dissertation Information- Degree:
- Doctorate ( Ph.D.)
- Degree Grantor:
- University of Florida
- Degree Disciplines:
- Statistics
- Committee Chair:
- Doss, Hani
- Committee Members:
- Hobert, James P.
Casella, George AitSahlia, Farid - Graduation Date:
- 8/7/2010
## Subjects- Subjects / Keywords:
- Bayes estimators ( jstor )
Ergodic theory ( jstor ) Estimation methods ( jstor ) Estimators ( jstor ) Markov chains ( jstor ) Maximum likelihood estimations ( jstor ) Modeling ( jstor ) Point estimators ( jstor ) Skeleton ( jstor ) Statistics ( jstor ) Statistics -- Dissertations, Academic -- UF analysis, bayes, bayesian, empirical, factor, hyperparameter, posterior, prior, selection, sensitivity, variable - Genre:
- bibliography ( marcgt )
theses ( marcgt ) government publication (state, provincial, terriorial, dependent) ( marcgt ) born-digital ( sobekcm ) Electronic Thesis or Dissertation Statistics thesis, Ph.D.
## Notes- Abstract:
- Computational Approaches for Empirical Bayes Methods and Bayesian Sensitivity Analysis We consider situations in Bayesian analysis where we have a family of priors on the parameter theta, and we deal with two related problems. The first involves sensitivity analysis and is stated as follows. Suppose we fix a function f of theta. How do we efficiently estimate the posterior expectation of f(theta) simultaneously for all priors in the family of priors? The second problem is how do we identify reasonable choices of priors? We assume that we are able to generate Markov chain samples from the posterior for a finite number of the priors, and we develop a methodology, based on a combination of importance sampling and the use of control variates, for dealing with these two problems. The methodology applies very generally, and we show how it applies in particular to a commonly used model for variable selection in Bayesian linear regression, in which the unknown parameter includes the model and the regression coefficients for the selected model. The prior is a hierarchical prior in which first the model is selected, then the coefficients for this model are chosen, and this prior is indexed by two hyperparameters. These hyperparameters effectively determine whether the selected model will be a large model with many variables, or a parsimonious model with only a few variables, so choosing them is very important. We give illustrations of our methodology on real data sets. ( en )
- General Note:
- In the series University of Florida Digital Collections.
- General Note:
- Includes vita.
- Bibliography:
- Includes bibliographical references.
- Source of Description:
- Description based on online resource; title from PDF title page.
- Source of Description:
- This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
- Thesis:
- Thesis (Ph.D.)--University of Florida, 2010.
- Local:
- Adviser: Doss, Hani.
- Electronic Access:
- RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31
- Statement of Responsibility:
- by Eugenia Buta.
## Record Information- Source Institution:
- UFRGP
- Rights Management:
- Copyright Buta, Eugenia. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Embargo Date:
- 8/31/2011
- Classification:
- LD1780 2010 ( lcc )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

Now that we have shown that (1) 1 y ,N ( (0(1)0 1 +..+ (1) 1 Nk (l) (o(k)O) a1 vN 1 =1 + + a1 k N 1 Yk ( (k) 1 2 1i (k)(0(1)0) + (k) 1 1N 2 k;= k)i (k)O) U= 1 v + p(l), (k+l) 1 NAi ,(k+1)/o(1)0) a i 1 1 I (22k) 1 W N k(2k) (k)O) we can prove that U is asymptotically normal by using the Cramer-Wold device. Let us denote the asymptotic variance of U by S. Then S(1) S(2)\ S= I S(2) S(3) where S(i) = B+ CB+, (A.44) with k Crs = A Cov(p,(O/')0, TIo), ps(i~)o0, TIo)) /=1 k oo + ZiA, ov(pr ([C ~ o), Ps(Og, o)) + Cov(Pr((0 go), Ps(j1)0, 1o))] /=1 g=1 for r = 1,... k, and s = 1,..., k, S = [Var(f( (or)) + 2O Cov(f((r)O), f( r))] g=1 (A.45) S3) = 0 when r s, r = 1...k, s = 1...k, and S(2) = B+D, (A.46) Theorem, we know that there exists a d* between d and d such that G(d) = G(d) + VG(d*)'(d d) = R,,, + VG(d*)'(c d) + o,(1). Note that the last equality above comes from applying the SLLN. Next we show that VG(d*) = O,(1). We have three cases for t = 2,..., k. Case 1: t {j,j'}. We have I[VG(d*)]t-l -. k 1 h, [ (OV )/d < 2a,- /=1 1=1 = ni ' ^v^ a, [^^)I'/. _ +Vh, (/))] [,h, (1) d - d*2 k( 1s] [ ( )/d i _,, (0 (d h, ( )] + V h,( ) (0) ,)3 -e) + h1(0(')) atVh,(0(1)) (de )2 (YZ=1 asvh(O, )/(ds + ))3 The term inside the inner sum is bounded, so we can conclude that [VG(d*)]t-1 is bounded in probability, as it is bounded by a O,(1) term on Z. Case 2: j / j', t c {j,j'}, say t = j. We have [V G(d*)]j_ k al n' 2((h ))/d /=1 n=1 Vh, ( ) (' (- 0) / 0)) ) d2 (yik 01))Id)3 d* s=1 as7-h, (1 d k ni =1n 1 I=1 iil and this is bounded in probability. Case 3: t = j = j'. We have [VG(d*)],_ 1k i n 7h( l sh (()/ I=1 ni" =1 Cs=1as 7hs /d; d*2 C = aslhs(O1)/d* (vhs ('()I )/d Vh1 (0')) a Vh, (I') d*2 (kS1 aVh (0)/ds) 2 and again this is bounded in probability. , i (')(,, ( )/d, Sh, ((/)) d2 k(s=l )( )2 Table 5-1 gives the posterior inclusion probabilities for each of the fifteen predictors, i.e. P(7Q = 1 y) for i = 1,..., 15, under several models. Line 2 gives the inclusion probabilities when we use model (1-1) with the values w = .65 and g = 20, which are the values at which the graph in Figure 5-1 attains its maximum. Line 4 gives the inclusion probabilities when the hyper-g prior "HG3" in Liang et al. (2008) is used. As can be seen, the inclusion probabilities we obtained under the EB model are comparable to, but somewhat larger than, the probabilities when the HG3 prior is used. This is not surprising since our model allows w to be chosen, and the data-driven choice gives a value (.65) greater than the value w = .5 used in Liang et al. (2008). (Table 2 of Liang et al. (2008) gives a comparison of posterior inclusion probabilities for a total of ten models taken from the literature.) Line 3 of Table 5-1 gives the inclusion probabilities under model (1-1) when we use w = .5 and the value of g that maximizes the likelihood with w constrained to be .5. It is interesting to note that the inclusion probabilities are then strikingly close to those under the HG3 model. Table 5-1. Posterior inclusion probabilities for the fifteen predictor variables in the U.S. crime data set, under three models. Names of the variables are as in Table 2 of Liang et al. (2008) (but all variables except for the binary variable S have been log transformed). Age S Ed ExO Exl LF M N NW U1 U2 W X Prison Time EB(20,.65) .93 .39 .99 .70 .51 .34 .35 .52 .83 .40 .76 .55 1.00 .96 .55 EB (20,.5) .85 .29 .97 .67 .45 .22 .22 .38 .70 .27 .62 .38 1.00 .90 .39 HG3 .84 .29 .97 .66 .47 .23 .23 .39 .69 .27 .61 .38 .99 .89 .38 Figure 5-2 gives plots of the posterior inclusion probabilities for Variables 1 and 6, as w and g vary. The literature recommends various choices for g [in particular g = m in Kass and Wasserman (1995), g = q2 in Foster and George (1994), g = max(m, q2) in Fernandez et al. (2001)], and posterior inclusion probabilities for all these choices combined with any choice of w can be read directly from the figure. The extent to which these probabilities change with the choice of g is quite striking. on all 1 through q diagonal entries, then we obtain the matrix S ( -(X'X)-1 (X'X)-X'Y Y'X(X'X)-1 Y'Y Y'X(X'X)-1X'Y If we sweep the augmented matrix T defined in (B.3) on the diagonal entries corresponding to the covariates in 7, then from the resulting matrix S we can obtain all the important quantities needed by Steps 1-4: (XX,)-1 is the negative of the submatrix of S corresponding to rows and columns in y, (XX,)-1XX Y is the submatrix of S corresponding to rows in 7 and column q + 1, and Y'X,(X'X,)-'X' Y can be obtained by subtracting the (q + 1, q + 1) element of S from Y'Y (with other methods, we may need to compute separately the last three quantities). To illustrate the use of this operator, suppose that we have already swept T over the covariates in 7 = (0, 72,..., 7q). Then we only need to perform one sweep on the first diagonal entry to get the swept matrix corresponding to the first predictor being added to the previous model (7 = (1, 72,..., 7q)). Conversely, since the sweep operator has an inverse, the latter matrix could be "unswept" over the first diagonal entry to get the swept matrix corresponding to dropping the first predictor. Let ec (0, min(d2,..., dk)). Then P(lld* dl < ) 1. For t j we have nl1/2[VH(d*)]t-l Z nt lhj(Ol))/dj __ h(0/)) at h ( )) < *2t k ( )) )2 i=1 S=1 avha(,s)/d nh-1 t ni ) t h ( 1) -_ 1 n h ( ) h ) < n-h 2(O))atVh(O)+ + nhl (Ol))atl/h (Ol)) =1 d d* (y kl1 s Vh(O)/d)2 / dt*2f ( 1 asvMh (O))/dS) < n t Vhj ( ))th (0)) j= ( -t c)2(d (c)( = aSvh5(0,'))/(ds + ) n (d )2(C/, aV e)( h, (l/(d + e))2 + ni+ ))2 = o0(1)+ o0(1) = o(1). Similarly, 11 0 10) k O n-1/2[VH(d*)]j_1 I < 1 h( )) n (dj C)2 s 1 as h7(O')/(ds + e) 1 nhj (0(, )) ajh (h l )) n+ (d, e)3(E:=1 aSvh,()/(d + ))2 1 nh, ( / ) aj hj(0 ) n'1 Y (d C(Eyk= lavh(')/(ds + e))2' and the right side of this inequality is Op(1), as it is the sum of three Op(1) terms. So (A.16) now implies that nln = la V Op( O,(1) + Op,(1) = Op,(1). in1 We now consider vln(Im 7_ ), the middle term in (A.7). Define Sk n vh(O)) k ,hj(O))/J hl(O')) K(u) = k jjOim)/Us E ash(O)/U / n /=1 1=1 ( s=1 as h, (o) s j=2 s=1 sh, ,()/Us By (2-11), P(6i = 17 ('), ('+1)) = s(y('))d(y('+))/v(7(') ,(i'+)), and the normalizing constant c(Q*) cancels in the numerator, so in practice there is no need to compute it, and the success probability of the regeneration indicator is simply (min v(.,.,,)]Ev(.7*, )/(('+ c)E D)] P(b, = 1 '), )) D V-V ( ), 7('+)) D)] (C.2) The choice of 7* and D affects the regeneration rate. Ideally we would like the regeneration probability to be as big as possible. Notice that regeneration can occur only if 7 is in D. This suggests making D large. However, increasing the size of D makes the first term in brackets in (C.2) smaller. We have found that a reasonable tradeoff consists of taking D to be the smallest set of models that encompasses 25% of the posterior probability. Also, the obvious choice for 7* is the HPM model. The distinguished model and the set D are selected from the output of an initial chain. For the all-inclusive chain that runs not only on the model space but also on the space of error variance and model coefficients, we can obtain a minorization condition if we multiply the condition in (C.1) on both sides by p(,'21y', Y)p(/3o 7,2 yr)p(/ Y,, 712, 0 Y). This yields k(O, O') > sl(O)dl(O') for all 0 = (7 ,, o, ), 0' = (y', o', i, j'), (C.3) where si(0) = s(7) and dl(') = d(7')p('y, Y)p(32I '/2, Y)p(Q3 I <17', )'2, Y). Hence for this bigger chain the regeneration indicator has, according to (2-11), success probability P(6, = 11 ), 1) = l i)) i+ )) ( (i))d( (i )) Sk(O(i), 0(i+ )) v(7('), 7(i+ )) ' Now define the function g: Rk+l Rk by el' -2 A2/A1 e' r-3A3/A1 S e'l- kAk/Al b where Tr is a k-dimensional vector and b is a real number. Applying the delta method to the previously established result that T -d A/(0, Z), we get Sd 0O, Vg I ZVg Iori , SB(h, hl, d) B(h, hl) B(h, hl) B(h, h)) where Vg I = (A.50) B(h, h) 0' 1 with Se"1-'72A2/A1 ell-1'3A3/A1 ... e'-kAk/A -e'l-'2A2/A1 0 ... 0 E = : : (A.51) 0 -el-'73A3/A1 ... 0 0 0 ... -e1-7kAk/A1 and 0 in (A.50) representing the column vector of k zeros. Hence, we know that (B(h, hi, d) B(h, hi)) has an asymptotically normal distribution with mean 0 and variance c(h)'Zc(h) + -2(h) + 2c(h)'E'z, where Z denotes, as in the statement of Theorem 1, the asymptotic variance of v/n(d - d), E is given in (A.51), and z in (A.49). D LIST OF FIGURES Figure page 5-1 Estimates of Bayes factors for the U.S. crime data. The plots give two different views of the graph of the Bayes factor as a function of w and g when the baseline value of the hyperparameter is given by w = 0.5 and g = 15. The estimate is (2-13), which uses control variates. ... 48 5-2 Estimates of posterior inclusion probabilities for Variables 1 and 6 for the U.S. crime data. The estimate used is (2-16). ... 50 5-3 Variance functions for two versions of / (). The left panel is for the estimate based on the skeleton (5-4). The points in this skeleton were shifted to better cover the problematic region near the back of the plot (g small and w large), creating the skeleton (5-5). The maximum variance is then reduced by a factor of 9 (right panel).... ................ ........ ... .. 51 5-4 Estimates of Bayes factors for the ozone data. The plots give two different views of the graph of the Bayes factor as a function of w and g when the baseline value of the hyperparameter is given by w = .2 and g = 50. ... 52 5-5 95% confidence intervals of the posterior inclusion probabilities for the 44 predictors in the ozone data when the hyperparameter value is given by w = .13 and g = 75. A table giving the correspondence between the integers 1-44 and the predictors is given in Appendix D ............. .. .. .......... .. 54 ACKNOWLEDGMENTS I would like to thank my advisor, Professor Hani Doss, for the invaluable help and guidance with writing this dissertation. I am also thankful to Professors Farid AitSahlia, George Casella, and James Hobert for serving on my supervisory committee and offering me helpful comments. In addition, I owe a debt of gratitude to the Department of Statistics at the University of Florida. I greatly appreciate the chance I have been given to come here and learn Statistics from many exceptional teachers, all while benefiting from the kindness and support of other students and staff. If we define c'(j) = Cov(f(1), f( +,)), j = 1, 2,... then Y71 cu(j) is absolutely convergent. This is because under geometric ergodicity, the so-called strong mixing coefficients a(j) decrease to 0 exponentially fast (a definition of strong mixing is given on p. 349 of Ibragimov (1962)), and Cov(fu(01), f,(O1+)) < [a(j)]"[E( fu(, 1)12+)]2/(2+ ), (3-3) for some / > 0. See Theorem 18.5.3 of Ibragimov and Linnik (1971) or Lemma 7.7 of Chapter 7 in Durrett (1991). Since c,(,,)(j) Cd(j) for each j, (3-2) and (3-3) enable us to again apply Dominated Convergence to conclude that ,-1 Cu(,,)(j) Z_ Cd(j), and this proves that r2(h, u) is continuous in u. Let g(u) be the spectral density at 0 of the series fu(Oi). Note that g(u) is equal to r/2(h, u), except for a normalizing constant. Under strong mixing (implied by geometric ergodicity), standard spectral density estimates g(u) are consistent, and bounds on the discrepancy lg(u) g(u)l depend on the mixing rate and bounds on the moments of the function fu(O) (Rosenblatt 1984). By (3-2), the rate is uniform as long as ||u dll is small, and the condition that id dl| is small is guaranteed if the Stage 1 sample size N is large. Geyer (1994) gives an expression for Z involving infinite series of the form (3-1), and this enables estimation of Z by spectral methods. Now, c(h) is a vector each of whose components is an integral with respect to the posterior Vh,y (see (A.3)). The estimate derived in Section 2.3 (see (2-16)) is designed precisely to estimate such posterior expectations. Combining, we arrive at an overall estimate of K2(h), and the asymptotic variances of our other estimates are handled similarly. Methods Based on Regeneration The cleanest approach to estimating asymptotic variances is based on regeneration. Let Xo, X1, X2, ... be a Markov chain on the measurable space (X, B), let K(x, A) be the Markov transition distribution, and assume Consider now the difference 1n l /22-_k ( [ f ] m f ] ' 1/2 l k Pj,lim f]) i z,[,] J)n /22 2(Jim j]) ( l 1l Z ,) I n 7If, )71/2, k/ ,n1 i-,, -0^ _ Sj=2J-- J/,lIim n i= i (j=l3jim 3 f ] ) 1 1 / 2 1y:l)n E(zY j ())] k L;=2(/j,1im j) =1 a 1 1/2 1-Z0) -E(Z- I] )Z k -- .ac=-k 1, nl/2 nl / u-: 1, J/ where the last equality follows from (A.31). By the assumption that the chains are geometrically ergodic (condition Al), the boundedness of Z,)'s, and the moment condition imposed on f in A4, we know that n1/2 ([ZiZ) E(Zi0J)]/ni) and n1/2 n1 [Z(]i) E(ZlifO))]/n,) are asymptotically normal, hence Op(1). This fact, combined with (A.27) and the corresponding result for (3o, /3), yields n"2 (j^, 1 Jim,') = Op(1). Hence we can conclude that /2( [ (lf](h)B(h, h) d) > (O, [f]). SB(h,h,) , Now applying the delta method with the function g(u, v) = u/v we have \/ =1 1 1 l (y [f]= _-[f] [f](j) \ i.e. nl/2 [,]- I[f](h)) d ) (O, r(h)), where r(h,) = Vg(l[f](h)B(h, h/ ), B(h, h))' t[f] Vg(l[f](h)B(h, h), B(h, hl)), (A.33) with Vg(u, v) = (l/v, -u/I2)' and F[f] as in (A.29). O Proof of Theorem 3 First, we note that n(F(7f]](h, d) I[f](h)) = nn(7rfl(h, d) 1f](h, d)) + n[(7[f](h, d) If](h)). (A.19) We begin by analyzing the second term on the right side of (A.19), which only involves randomness from the second stage of sampling, and show that it is asymptotically normal. As for the first term, a closer examination reveals that it is also asymptotically normal, with all its randomness coming from Stage 1. The asymptotic normality of the sum of these two terms then follows immediately from the independence of the two stages of sampling. Note that E=1 alE(Y ) = I[f](h) B(h, hi), and in particular, when f 1, this gives =1 alE(Y1,i) = B(h, hi). Also, we have n Y) i I:'l(h) B(h, hi) V iV aiE(Yl]) n1/2 =1 i=1 1/2 /= 1 =1 1 1 k n k nl k n n = 1 ,=1 \ = i= =1 i== 1 k n1 [f] E (YfI) = al/2 11/ 2 >, 2 1,, (A.20) /=1 =1 ,l E(Y1,i) By condition (2-17), assumption A2 of Theorem 1, and the assumed geometric ergodicity and independence of the k Markov chains used, the vector in (A.20) converges in distribution to a normal random vector with mean 0 and covariance matrix F(h)= E=1 a/,F(h), where /(h) = 711 712 , 721 722 APPENDIX D MAP FOR THE OZONE PREDICTORS IN FIGURE 5-5 Table D-1. The 44 predictors used in the ozone illustration. The symbol "." represents an interaction. Number Predictor 1 vh (Vandenburg 500 millibar pressure height (m)) 2 wind (Wind speed (mph) at Los Angeles International Airport (LAX)) 3 humid (Humidity (percent) at LAX) 4 temp (Sandburg Air Force Base temperature (F)) 5 ibh (Inversion base height at LAX) 6 dpg (Daggett pressure gradient (mm Hg) from LAX to Daggett, CA) 7 ibt (Inversion base temperature at LAX) 8 vis (Visibility (miles) at LAX) Number Predictor Number Predictor 9 vh2 27 vh.dpg 10 wind2 28 wind.dp 11 humid2 29 humid.d 12 temp2 30 temp.dp 13 ibh2 31 ibh.dpg 14 dpg2 32 vh.ibt 15 ibt2 33 wind.ib 16 vis2 34 humid.i 17 vh.wind 35 temp.ib 18 vh.humid 36 ibh.ibt 19 wind.humid 37 dpg.ibt 20 vh.temp 38 vh.vis 21 wind.temp 39 wind.vi 22 humid.temp 40 humid.v 23 vh.ibh 41 temp.vi 24 wind.ibh 42 ibh.vis 25 humid.ibh 43 dpg.vis 26 temp.ibh 44 ibt.vis -0 00. o.o 0.2 .8 .8. < 0.6 m 0.6 200 200 150 0. i 150 0.44 100 100 9 50 0.2 9 50 0.2 Figure 5-2. Estimates of posterior inclusion probabilities for Variables 1 and 6 for the U.S. crime data. The estimate used is (2-16). Selection of the skeleton points was discussed at the end of Chapter 3, and we now return to this issue. Consider the Bayes factor estimate based on the skeleton (5-4), which was chosen in an ad-hoc manner. The left panel in Figure 5-3 gives a plot of the variance of this estimate, as a function of h. As can be seen from the plot, the variance is greatest in the region where g is small and w is large. We changed the skeleton from (5-4) to (w, g) e {.5, .7, .8,.9} x {10, 15, 50, 100} (5-5) and reran the algorithm. The variance for the estimate based on (5-5) is given by the right panel of Figure 5-3, from which we see that the maximum variance has been reduced by a factor of about 9. 5.3.2 Ozone Data This data set was originally analyzed in Breiman and Friedman (1985), was used in many papers since, and was recently analyzed in a Bayesian framework by Casella and Moreno (2006) and Liang et al. (2008). The data consist of daily measurements of ozone concentration and eight meteorological quantities in the Los Angeles basin for 330 days of 1976. The response variable is the daily ozone concentration, and we follow Liang et al. (2008) in considering 44 possible predictors: the eight meteorological [L(d)]_i, j = 2,..., k, converges almost surely. We have [VL(d)]jl_ 11 d=(1 k 1 ash7(- ,) )2 k1 ni = ,=l Zks1 S --- h (O1) asVh(O()ds k ni /=1 1=1 af(e) h, (e) ,1 asvh',()/ds f(0(1) ),h(1(() 1=1 as sh (O,))/ds ( k =i s := ;= k- B(h, hi) a, f(e) h( e) d2 J (h, hl() B(h, h1) Swh,y(O) de k ni /=1 i=1 ajVh(Oi)Vh,) (Oh') d2 ( k=1 ash(1))/ds) (1 (12sl) ) 1 ash 0()) ds Vh,y(O) dO Sh,y(O) de I[l(h) dj := [v(h)]j_, j = 2,..., k. (A.22) As in the proof of Theorem 1, it can be shown that each element of the second-derivative matrix V2L(d*) is Op(1). Now, we can rewrite (A.19) as I[f](h)) = 7VL(d)' /( d) ( h) [f]( + / [v//N(d - + v"(!1f](d, h)-/1f]( qv(h)' /N(d d)] 2 L(d*) [/ (l d) + vn('f](h, d) Since the two sampling stages are assumed to be independent, we conclude that I/f](h)) d A(O, qv(h)'Tv(h) + p(h)). 1 d2I ) f a ,vh,(e) h 1 asvh ds B(h, hi)2 ajVh, (0) =1 asVhc(0)/ ds Sh,y(O) dO, vn ('](h, a) I[](h)) + op(l). Bhh B(h, h f](h) B(h, hi) 2 dJ v (7!I](h, d) we were able to calculate maxhE V(h, hi,..., hk), the design problem would involve the minimization of a function of k x dim(-t) variables, and in general, solving the design problem is hopeless. In our experience, we have found that the following method works reasonably well. Having specified the range -, we select trial values h, ..., hk and plot the estimated variance as a function of h, using one of the methods described above. If we find a region in R- where this variance is unacceptably large, we "cover" this region by moving some hi's closer to the region, or by simply adding new hi's in that region, which increases k. This is illustrated in the example in Chapter 5. [See equation (31) in Geyer (1994).] Note that, by applying the Mean Value Theorem to V/1N(q), BN defined in (A.38) can also be expressed as for some T* between 7N and qo. Hence, with pr, r elements of BN are given by 1,..., k defined as in (A.35), the 1 k N, [BN]r,r = N pr(Oi)O, 1 l*) [ pr(0(')0, 1*)], r ...k, /=1 i=1 1 k Ni [BN]r,s = / r(O(')0, *)P(O(')o, *), r s, /=1 i=1 which makes it easy to verify that BNU = 0. Combining this with equation (A.39), it can be shown that S1 o), \N(1N -17o) BN= V/N(T7o), N V/- (A.40) where S(BN+ U' UU N kUU k is the Moore-Penrose inverse of BN. Furthermore, letting B+ denote the Moore-Penrose inverse of B, we can alternatively write the equality in (A.40) as N(1N o) = (B B + B+) V--VN(1o) 1 1+ = (B B+) VI/N(o) + B V1/N(/o). Nv/ v N (A.41) Now, using the result BN Ea B established by Geyer (1994), we can easily deduce that B+ 2a. B+, (A.42) 1NV2 N(* ), estimate through an iterative scheme which is fast and stable [Meng and Wong (1996, p. 849)] and this is the computational method we use in the present paper. Owen and Zhou (2000) consider the problem of estimating an integral of the form / = f h(x)f(x) dx, where f is a probability density that is completely known (as opposed to known up to a normalizing constant) and h is a known function. They wish to estimate I through importance sampling. They assume they can generate sequences iid X1, ..., Xn, vd pi, I = 1,..., k, where the pi's are completely known densities. The doubly indexed sequence X/,, i = 1,... n, I = 1,..., k forms a (stratified) sample from the mixture density pa = 1=1 ap1, where al = n/ = 1 ni, so one can carry out importance sampling with respect to this mixture. They point out that since the p/'s are completely known, they can form the functions Hi(x) = [pj(x)/pa(x)] 1, j = 1,..., k, and these satisfy Ep,(Hi(X)) = 0, where the subscript indicates that the expectation is taken with respect to the mixture density p,. Therefore, these k functions can be used as control variates. What we do in Chapter 2 is similar, except that we are working with densities whose functional form is known, but whose normalizing constants are not. Kong et al. (2003) also consider the k-sample model for biased sampling, but have a different perspective, and we describe their work in the notation of the present paper. They assume that there are probability measures Q1,... Q, with densities qi/mi, ..., qk/mk, respectively, relative to some dominating measure j, and for each I = 1,..., k, we have an iid sample {X/,}i from Qi. Here, the qi's are known, but the mi's are not. Their objective is to estimate all possible ratios ml/mj, I,j e {1,..., k} or, equivalently, the vector d = (1, m2/m, ..., mk/mi). In their highly unorthodox approach, Kong et al. (2003) obtain the maximum likelihood estimate p of the dominating measure itself (A is given up to an overall multiplicative constant). They can then estimate the ratios ml/mj, since the normalizing constants are known functions of p (i.e. m,r f q,(x) dp(x), and q, is known). They show that the resulting estimate of d is obtained By condition A5, we also have [ 1 =k n, Z[f](0) (Z[f])'Y[f] n Yn k,=1 y /=1 Z,[f](k) ( k a.s. k 1 a E Z[f]() YVf ) 1 a/E(Z[f](k) Y[f] a 1,1 1,1 Let vf]1 = (If] ..., vl)' denote the vector on the right side of (A.26). Combining (A.25) and (A.26) we get (A.27) k /1 n /----1 r( [y,] /y- R[f] Z0) CA k= 1|,j i miZ,, ,/) Y 1,/ k j=2/ J,limZ ,/ Ul") Yi[f] U[f](2) /ida) (Yi, pk = [f] 7[f]Z j)l k=1 Pj,lim i),/ Z =k (jim ) 2 j mZ;,, ) Also, let f] = E(U if]). Now since A2, A3, and A4 hold, for each I= 1,..., k we have nt n1/2 (1 [ li i=1 where 2 S 1,11 0/,12 21 J/,21 J/,22 / with -2 = Var (U[ )) +" 2 l Cov(U ](), U[f](l)' 7,1 1,g= ]\ 1,1 "-l+g,l/' 0/,12 0/,21 C v(Ui[(1), U[f](2)) + CO LCov ( ](1), U[f(2) + Cov(Uf ](2), U f1)] 222 =, V r 1 g=U I 1,d l-g,l d 1,~+ lCg,1j, 2U[f](2)1 COy/u[f](2)i U[f](2)\ -/,22 = Var+[f](2) 2 1Co1 /,22 1,1 id 1, 1 -l+g,l/ " (A.26) where ( f]', a[f]) s ( i[f] / ) \00',1im' (Rlfl)-lvlfl. 1 k ni Ijm,,, = Y V /=1 i=1 d r(0, El]), and this quantity converges to B(h, h,) J hy() V.y(O) dO = B(h, hi). Assuming that for each I = 1,..., k we have samples (/), i = 1,..., n from the posterior density Vh,,y, then for as = ns/n, the estimate in (2-4) can be written as B(h, h, d)= h i()) (2-5) /=1 i=1 ns=l h5(O)/ds (Note that the combined samples 0(), i = 1 ...n, I = 1,..., k form a stratified sample from the mixture distribution v.y.) Doss (2010) shows that under certain regularity conditions the estimate (2-5) is consistent and asymptotically normal. In virtually all applications, the value of the vector d is unknown and has to be estimated. Doss (2010) does not deal with the case where d is unknown. In this Chapter, we assume that d is estimated via preliminary MCMC runs generated independently of the runs subsequently used to estimate B(h, hi). Hence the sampling will consist of the following two stages. Stage 1 Generate samples 0,(0I), i = 1,..., N from Vhiy, the posterior density of 0 given Y = y, assuming that the prior is Vh,, for each I = 1,..., k, and use these N = y,1 NI observations to form an estimate of d. Stage 2 Independently of Stage 1, again generate samples 0('), i = 1,..., n, from Vh,y, for each I = 1,..., k, and construct the estimate of the Bayes factor B(h, hi) based on this second set of n = =1 ni observations and the estimate of d from Stage 1. From now on, for / = 1,..., k, we use the notations Al and al to identify the ratios NV/N and ni/n, respectively. It is natural to ask why is it necessary to have two steps of sampling, instead of estimating the vector d and B(h, hi) from a single sample. The reason is that we are interested in estimating Bayes factors and posterior expectations for a very large number of values of h, and for each h, the computational time needed is linear in the Step 3 Generate /3') 1 ('), Y according to the density p(/3o 7,2, Y) Jp(Y 7, 2,/ 3o,/)p(3 7',2) d/ p(/3o) (B.2a) x exp[- 2( lo)'(- lm30)] (B.2b) oc n(Y, 2/m). Note that (B.2b) follows from (B.2a) because 1'Xy = 0, since the columns of X, are centered. Step 4 Generate 3I') 17'), a2(') ~'), Y according to the density p( 7, 2, a o, Y) oc p(Y 1 7, 72, 0o, 0) p( Y 17, 2) o exp 22 (Y lmo X- )'(Y l- lm3o X,3) + - which can be shown to be a q.-dimensional normal with mean and covariance matrix given respectively by g'= and Z= g 2(XX,)-1 g+l g+ where / = (X'X,)-'X' Y is the usual least squares estimate for model 7. We now discuss the computational effort needed to implement our sampler. Consider generating the first component of 7. As seen in Step 1, the conditional distribution for this component is Bernoulli with success probability P = 1 7j#, Y) p((1, 72, ... 7q) Y) p((0, 72, 7q) IY) +p((1, 72 7q) IY)' with the expression for p(7y Y) given by (B.1). The other components of 7 can be in turn similarly generated, and then the other components of 0 can be generated according to the conditional distributions from Steps 2-4. The main computational burden is in (i) forming R2, (ii) forming /3, and (iii) generating from N.(ci/,, c2(X~XX,)-), where cl and c2 are constants. All of these ostensibly require calculation of (X'X,)-1, for which O(q3) where y(h) = Vg(I[f](h)B(h, hi), B(h, h,))' q (q\ Wl(h)' V(w[l(h), wo(h)) + /El w wo (h)' SVg(Ilf](h)B(h, hi), B(h, hi)), (A.57) with Vg(u, v) = (l/v, -u/v2)'. E and hence the null model has the strictly largest marginal likelihood among all models. Lemma 4.1 of Scott and Berger (2010) implies that, with a Zellner-Siow prior on g, vv = 0, while in our setup, the same data give Cv > 0. 5.3 Examples We illustrate our methods on two examples. The first is the U.S. crime data of Vandaele (1978), which can be found in the R library MASS under the name UScrime. We use this data set because it has been studied in several papers already so we can compare our results with previous analyses, and also because the number of variables is small enough to enable a closed-form calculation of the marginal likelihood mh, so we can compare our estimates with the gold standard. The second data set is the ozone data originally analyzed by Breiman and Friedman (1985). We use this data set because it involves 44 variables, even though only a few of those are important, and we wanted to show how our methodology handles a data set with this character. 5.3.1 U.S. Crime Data The data set gives, for each of m = 47 U.S. states, the crime rate, defined as number of offenses per 100,000 individuals (the response variable), and q = 15 predictors measuring different characteristics of the population, such as average number of years of schooling, average income, unemployment rate, etc. To be consistent with what is done in the literature, we applied a log transformation to all variables, except the indicator variable. We took the baseline hyperparameter to be hi = (wl, gi) = (.5, 15), and our goal was to estimate B(h, hi) for the 924 values of h obtained when w ranges from 0.1 to 0.91 by increments of 0.03, and g ranges from 4 to 100 by increments of 3. We used (2-13) and this estimate was based on 16 chains each of length 10,000, corresponding to the skeleton grid of hyperparameter values (w, g) e {.3, .5, .6,.8} x {15, 50, 100, 225} (5-4) @ 2010 Eugenia Buta COMPUTATIONAL APPROACHES FOR EMPIRICAL BAYES METHODS AND BAYESIAN SENSITIVITY ANALYSIS By EUGENIA BUTA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 1 Establish a functional central limit theorem that says that n12 (B(h) B(h)) converges in distribution to a Gaussian process W(h); h c R. 2 Find the distribution of suph, I W(h)l. If s, is the (1 a)-quantile of the distribution of this supremum, then the band B(h) s,/n1/2 has asymptotic coverage probability equal to 1 a. The value s, is typically too difficult to compute analytically, but can be obtained by simulation [see, e.g. Burr and Doss (1993) among many others]. The maximal inequalities needed to establish functional central limit theorems typically require an iid structure, and for this reason we believe that the regeneration method offers the best hope for establishing such theorems. 3.2 Selection of the Skeleton Points The asymptotic variances of any of our estimates depend on the choice of the points hi,..., hk. For concreteness, consider B(h, hi, d), and to emphasize this dependence, let V(h, hi, ..., hk) denote the asymptotic variance of B(h, hi, d). For fixed h, ..., hk, identifying the set of h's for which V(h, hi, ..., hk) is finite is typically a feasible problem. For instance, Doss (1994) considered the pump data example discussed in Tierney (1994), for which the hyperparameter h has dimension 3, and determined this set for the case k = 1. He showed that one can go as far away from h1 as one wants in certain directions, but in other directions the range is limited. (The calculation can be extended to any k.) Suppose now that we fix a range '- over which h is to vary. Typically, we will want more than just a positioning of hi,..., hk that guarantee that V(h, hi,..., hk) is finite for all he 'c and we will face the problem below. Design Problem Find the values of hi,..., hk that minimize maxh6E V(h, hi,..., hk). Unfortunately, except for extremely simple cases, it is not possible to calculate V(h, hi,..., hk) analytically (even if k = 1, V(h, hi) is an infinite sum each of whose terms depends on the Markov transition distribution in a complicated way), and maximizing it over h E c- would present additional difficulties. Furthermore, even if where d is formed from Stage 1 runs, it is necessary to consider the quantity r2(h, u), defined as the asymptotic variance of n1 h()/us n1 k i h(O(,)) ni kS=1 IsVh I() s where u (ul, u2,..., Uk)'. After defining ()Vh (0) f( s= 1 as Vh, (0) /us we get rT2(h, u) = Var(f,(0('))) + 2 Cov(ful(00)), fu(Os )) g=1 We now proceed to establish continuity of r2(h, u) in u, and to do this we will show that for each I = 1,..., k, ,r(h, u) is continuous in u. For the rest of this discussion expectations and variances are taken with respect to Vh,,y, and we drop I from the notation. Let u(n) be any sequence of vectors such that u(n) d. Then trivially fu,) (0) fd(O)for all 0, and letting c = min{di, ..., dk}, there exists a positive integer n(c) such that I|u(n) dll < c for all n > n(c). Consequently, fu(n)(0) < f2d(0) = 2fd(0) for all 0 and all n > n(e) (3-2) and we can apply the Lebesgue Dominated Convergence Theorem twice to conclude that Var(fu(n)(01)) = E(fu2n)(01)) [E(fu(n)(01))]2 converges to E(fd2(81)) [E(fd(01))12 = Var(fd( 1)). Note that condition A2 guarantees that the dominating function in (3-2) has finite expectation. Similarly, for each of the covariance terms, Cov(futn)(01), fu(,,(Ol+,)) Cov(fd(l), fd(1+j)). Theorem 4 by /3if. For fixed j, j' e {1,... k}, consider the function 1 k n, )f(o)h ()/j ;1 [f(o ) )Jh()j G(u, v) n= v -/- v/ , n= 1 as iV0=1 ;,=1 1usshs Y() Usk1 asVh, ) us where u = (u2,..., uk)' with ui > 0, for / = 2,..., k, and v = (v, ..., k)'. Note that setting u = d and v = e gives G(,e) k ni G(d, e) =nZ Z=f) ZYI0' /=1 i=1 By the Mean Value Theorem, we know that there exists a (d*, e*) between (d, e) and (d, e) such that G(d, e)G(d, e) +VG(d*, e)' d "JR+l,+l r (- ,e -- + Op(l). As in previous proofs, with some calculations we can show that VG(d*, e*) = Op(1). Therefore G(d, e) Rlf] and since R[] is assumed invertible, we have n[(f[q)'f]]-1 (R[f)-1, where 2] is obtained from the matrix Zf] in (A.23) by replacing d and e with d and 8. The same reasoning extends to the case where = 0 orj' = 0. In a similar way, if we let *[f] denote the vector obtained from Y[f] in (A.24) by replacing d with d and we recall that v1] was defined to be the vector on the right side of (A.26), it can be proved that n Recalling that v.y := C= asvhs,y, we have E,,(Y(O)) = B(h, hi), where the subscript v.y to the expectation indicates that 0 vy. Also, forj = 2,..., k, let ) ()dJ -h() (2-9) zs)) = asVhS(0)/ds VhjIY ) a 'hS,,y(0) S ,y( (2-10) Es=1 ash,,y( ) Expression (2-10) shows that E,y(ZW)(0)) = 0. This is true even if the priors Vh, and Vh, are improper, as long as the posteriors vh,y and Vh,,y are proper, exactly our situation in the Bayesian variable selection example of Chapter 1. On the other hand, the representation (2-9) shows that ZO)(0) is computable if we know the d,'s-it involves the priors and not the posteriors. (A similar remark applies to (2-8).) Therefore, if as in Doss (2010) we define for 1, ..., k, i = 1,..., n1 Vh(l) ) ( 1) ) h (1))dj- Vh (0)) Yi, k=Z s ak j ()/d j 2. k Yk=1 as h, ( 1()) ds Y: i asVh,(s (1) Ids (2-11) then for any fixed 3 = (/32,... ,/3) Sk nk Y (Y,, = A=2 /Z, ), (2-12) /=1 ;=1 is an unbiased estimate of B(h, hi). The value of 3 that minimizes the variance of I is unknown. As is commonly done when one uses control variates, we use instead the estimate obtained by doing ordinary linear regression of the response Y,,i on the predictors ZT), j = 2,..., k, and to emphasize that this estimate depends on d, we denote it by /(d). Theorem 1 of Doss (2010) states that the estimator Breg(h, h) = /(d, obtained under the assumption that we know the constants d2,..., dk, has an asymptotically normal distribution. As mentioned earlier, d2 ..., dk are typically unknown, and must be estimated. Let d2,..., dk be estimates obtained from previous MCMC runs REFERENCES ANTONIAK, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2 1152-1174. ATHREYA, K. B., Doss, H. and SETHURAMAN, J. (1996). On the convergence of the Markov chain simulation method. The Annals of Statistics, 24 69-100. BARBIERI, M. M. and BERGER, J. O. (2004). Optimal predictive model selection. The Annals of Statistics, 32 870-897. BREIMAN, L. and FRIEDMAN, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80 580-598. BURR, D. and Doss, H. (1993). Confidence bands for the median survival time as a function of the covariates in the Cox model. Journal of the American Statistical Association, 88 1330-1340. BURR, D. and Doss, H. (2005). A Bayesian semiparametric model for random-effects meta-analysis. Journal of the American Statistical Association, 100 242-251. CASELLA, G. and MORENO, E. (2006). Objective Bayesian variable selection. Journal of the American Statistical Association, 101 157-167. CLYDE, M., DESIMONE, H. and PARMIGIANI, G. (1996). Prediction via orthogonalized model mixing. Journal of the American Statistical Association, 91 1197-1208. CLYDE, M., GHOSH, J. and LITTMAN, M. (2009). Bayesian adaptive sampling for variable selection and model averaging. Discussion Paper 2009-16, Duke University Department of Statistical Science. COGBURN, R. (1972). The central limit theorem for Markov processes. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2. University of California Press, Berkeley, 485-512. CuI, W. and GEORGE, E. (2008). Empirical Bayes vs. fully Bayes variable selection. Journal of Statistical Planning and Inference, 138 888-900. Doss, H. (1994). Comment on "Markov chains for exploring posterior distributions" by Luke Tierney. The Annals of Statistics, 22 1728-1734. Doss, H. (2007). Bayesian model selection: Some thoughts on future directions. Statistica Sinica, 17 413-421. Doss, H. (2010). Estimation of large families of Bayes factors from Markov chain output. Statistica Sinica, 20 537-560. DURRETT, R. (1991). Probability: Theory and Examples. Brooks/Cole Publishing Co. 100 Proof of Theorem 6 Let 1 k ( L n1 r YJk= 1 Zid /Y I1 ,, )= pJ Zi i where the superscripts d, e indicate the values of d and e used when computing Y's and Z's, while the subscripts indicate the coefficients of Z's. With lf] as in the proof of Theorem 4 we now write v^ ,, aI>,/")[ = ^_d,,^, ), SS a )3(d),,4[f](d)) /=1 (A.52) S 1(d),,4 -(d)). /=1 Note that the second quantity on the right side of (A.52), which involves only known d and e, was shown to be asymptotically normal with mean 0 and variance F[ ] in the proof of Theorem 4 [see (A.32)]. Now let us expand the first term on the right side of (A.52) by writing v (d, ) (,+(d) (ad e /+ /V3/ -ad, ) [ []3 1[f],3 f' /3,Im , +rdne ( n \. r -1d, (A.53) We next proceed as follows: 1. We note that the third term on the right side of (A.53) was shown to converge to 0 in probability in the proof of Theorem 4. 2. We show that the first term on the right side of (A.53) also converges to 0 in probability. 3. We show that the second term on the right side of (A.53) is asymptotically normal. To deal with the second step, as in the proof of Theorem 1, first we show that J3 (d) and [3f (d) converge in probability to the same limit, which we denoted in the proof of improved estimator for the posterior expectation I[](h), which is given by '[ (y -z, I ,3 -= (2-18) 2I= n (Y2i= 1 kj =2 zJi, j Theorem 4 Suppose conditions A 1 and A2 stated in Theorem 1 are satisfied and the matrix R defined in Theorem 2 is nonsingular. Also, suppose that A3 for each I = 1,..., k, there exists e > 0 such that E( Yf] 2 6) < oo; A4 for each I = 1,..., k, there exists e > 0 such that E,, (If l2+ (0)) < o; A5 for each I = 1,..., k, Eh f2(0) =Vh ( - E"sY1 as, vh, (0)/ds A6 the (k + 1) x (k + 1) matrix Rf defined by R[f] _Z ]k 1 a [f (j ') k J+llj+l = a iZ 1()Z ,/ )' J) 0,..., k, is nonsingular. Then S(il(h)) (0, r(h)), where r(h) is given in equation (A.33) of the Appendix. Remarks 1. If h = h for some e {1,..., k}, our estimator of posterior expectation I, [f] given above in (2-18) has zero variance. To see why, note that in this case the response Y[f] can be written as y[f] = dj/[f(hj) + djZ[fl+), so there is no noise in the regression of Y[f] on predictors Z[lf)'s, and as a consequence, the numerator of this estimator is constant (specifically, ndj/l[l(hj)). Through similar arguments, the denominator was shown to be constant (nd,) in Doss (2010). Hence, for h = hj, 71,,I[ is a perfect estimator of I[f](hj). 2. Theorem 4 pertains to the case where d and the posterior expectations I[f](hj), j = 1,..., k are known. There do exist some situations where this is the case. For example, in the hierarchical model for Bayesian linear regression discussed in TABLE OF CONTENTS ACKNOW LEDGMENTS ................................ LIST O F TA BLES . . . LIST O F FIG U R ES . . . ABSTRACT. ................... ................... CHAPTER 1 INTRO DUCTIO N . . . 2 ESTIMATION OF BAYES FACTORS AND POSTERIOR EXPECTATIONS . 2.1 Estimation of Bayes Factors ....................... 2.2 Estimation of Bayes Factors Using Control Variates ........... 2.3 Estimation of Posterior Expectations .................. 2.4 Estimation of Posterior Expectations Using Control Variates . 2.5 Estimation of Posterior Expectations Using Control Variates With Estimated Skeleton Bayes Factors and Expectations . 3 VARIANCE ESTIMATION AND SELECTION OF THE SKELETON POINTS 3.1 Estimation of the Variance .......... 3.2 Selection of the Skeleton Points ....... 4 REVIEW OF PREVIOUS WORK ......... 5 ILLUSTRATION ON VARIABLE SELECTION . 5.1 A Markov Chain for Estimating the Posterior Param eters .. ............. 5.2 Choice of the Hyperparameter ....... 5.3 Exam ples . . 5.3.1 U.S. Crime Data .. ......... 5.3.2 Ozone Data .............. 6 DISCUSSION ................... APPENDIX A PROOF OF RESULTS FROM CHAPTER 1 . Distribution page S 4 S 7 S 8 S 9 S10 14 S17 S19 S21 S23 S25 27 Model . . 5 6 B DETAILS REGARDING GENERATION OF THE MARKOV CHAIN FROM C H A PT E R 5 . . . Our proof is organized as follows: * We note that the third term on the right side of (A.7) was shown to converge to 0 in probability by Doss (2010). We will show the first term on the right side of (A.7) also converges to 0 in probability. The second term on the right side of (A.7) involves randomness from both Stage 1 and Stage 2. However, we will show that the randomness from Stage 2 is asymptotically negligible, and that this term is asymptotically equivalent to an expression of the form w(h)'(d d), where w(h) is a deterministic vector. This will show that the second term is asymptotically normal. Now we prove that the first term on the right side of (A.7) is o,(l), and to do this we begin by showing that J(d) and 3(d) converge in probability to the same limit. Let Z be the n x k matrix whose transpose is 1 1 Z(2) Z(2) ni,l 1,2 Z(k) Z(k) ni,1 1,2 ... 1 ... 1 . Z(k) 1,k ... 1 Z(2) nk,k Z(k) S nk,k/ (A.8) and let Y be the vector Y = ( i1, ...Y Yn 1 1, 2 Y,2 Yn2,2. ... Y,k, Ynkk) . (A.9) Let Z be the n x k matrix corresponding to Z when we replace d by d. Similarly, Y is like Y, but using d for d. For fixed j,j' e {2 ..., k}, consider the function 1 k n, h (())/Uj Vh (01/)) G(u) = i 1 V l (,) ni=1 ;=1 Ys=(asVhsI) )/Us (A. 10) ()) j h( O)/uS where u = (u2,... Uk)' and ui > 0, for / = 2,..., k. (On the right side of (A.10), ui is taken to be 1.) Note that setting u = d gives G(d) Z(2) 2n,2 Zn(k) 2,2 Z(k) 1,1 1 = i 1 ZO)ZJ'). By the Mean Value 1 Y/,i / I/, CHAPTER 2 ESTIMATION OF BAYES FACTORS AND POSTERIOR EXPECTATIONS Let Vh,y denote the posterior density of 0 given Y = y when the prior is vh. Suppose we have a sample 01, ..., On (iid or ergodic Markov chain output) from the posterior density Vh,,y for a fixed hi and we are interested in the posterior expectation Eh(f(0) Y )= Jf()h,y(0) dO for different values of the hyperparameter h. We may write f f(0)vh,y(0) dO as J (0) P(Y) 7- h ()Vh),y(0) dO = f (0) 7()Vhl,y(0) dO (2-1a) j e(y)vh(O)/mh d nh l h P0 (y))hl()/mh, (0)mh M/ ( h, ( 0) M ) (2-1 b) m .f .v(o hly(0) dO f f(O) v h,,y(0) dO (2-1 c) f v/' h),y(0) dO where in (2-1 b) we have used the fact that the integral in the denominator is just 1, in order to cancel the unknown constant mh,/mh in (2-1c). The idea to express f f(O)vh,y(O) dO in this way was proposed in a different context by Hastings (1970). Expression (2-1c) is the ratio of two integrals with respect to Vh,,y, each of which may be estimated from the sequence 01,..., On. We may estimate the numerator and the denominator by n n Y f(0i)[vh(O)/vhl(Oi)] and [v]h(O)/hl(Oi)], (2-2) i=1 i=1 respectively. Thus, if we let (h) [Vh(0,) Vh1(0,)] w = [h(O,)IVh(O,)]' then these are weights, and we see that the desired integral may be estimated by the weighted average Ein f(,)w,(h) by solving the system dr = k qr (X ) r = 1,... k, (4-3) =1 s=l nsqs(Xli)/ds which is easily seen to be identical to the system (4-2) of Gill et al. (1988). Tan (2004) shows how control variates can be incorporated in the likelihood framework of Kong et al. (2003). When there are r functions Hi, j = 1,..., r for which we know that f Hj dp = 0, the parameter space is restricted to the set of all sigma-finite measures satisfying these r constraints. For the case where X1i, i = 1,..., n1 are iid for each I = 1,..., k, he obtains the maximum likelihood estimate of p in this reduced parameter space, and therefore of corresponding estimates of d and mh/mh,, and shows that this approach gives estimates that are asymptotically equivalent to estimates that use control variates via regression. He also obtains results on asymptotic normality of his estimators that are valid when we have the iid structure. The estimates of d in Gill et al. (1988), Geyer (1994), Meng and Wong (1996), and Kong et al. (2003) are all equivalent. Theorem 1 of Tan (2004) establishes asymptotic optimality of this estimate under the iid assumption. When the samples are Markov chain draws, the asymptotically optimal estimate is essentially impossible to obtain (Romero 2003). But the estimate derived under the iid assumption can still be used in the Markov chain setting if one can develop asymptotic results that are valid in the Markov chain case, and this is done by Geyer (1994), whose results we use in all our theorems. I H H H H H H i H SH H HH H H H H I SH I HH H Confidence Interval Figure 5-5. 95% confidence intervals of the posterior inclusion probabilities for the 44 predictors in the ozone data when the hyperparameter value is given by w = .13 and g = 75. A table giving the correspondence between the integers 1-44 and the predictors is given in Appendix D. VARDI, Y. (1985). Empirical distributions in selection bias models. The Annals of Statistics, 13 178-203. ZELLNER, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (P. K. Goel and A. Zellner, eds.). Elsevier, New York, 233-243. ZELLNER, A. and Slow, A. (1980). Posterior odds ratios for selected regression hypotheses. In Bayesian Statistics: Proceedings of the First International Meeting held in Valencia (Spain) (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.). Valencia: University Press, 585-603. 103 Here is how the regenerative method is used to get valid asymptotic standard errors. Suppose we wish to approximate the posterior expectation of some function f(X). Suppose further that the Markov chain is to be run for R regenerations (or tours); that is, we begin by drawing the starting value from d and we stop the simulation the Rth time that a 6, = 1. Let 0 = To < T- < 72 < < TR be the (random) regeneration times, i.e. 7- = min{n > T-t- : 6,n- = 1} for t E {1, 2,..., R}. The total length of the simulation, TR, is random. Let N1, N2,..., NR be the lengths of the tours, i.e. Nt = t 'Tt-, and define St = t-1L f(Xn), t = 1, R. Note that the (Nt, St) pairs are iid, and a strongly consistent estimator of E,(f(X)) is f, = S/N = (1/TR) TR1 f(Xn), where S = (1/R) Et= St and N = (1/R) Et= Nt, and the asymptotic variance of fTR may be estimated very simply by E:=(St NTRNt)2/(RN2). Moment and ergodicity conditions that guarantee strong consistency of this variance estimator are given in Hobert et al. (2002). This method has recently been applied successfully in a number of problems involving continuous state spaces; see, e.g., Tan and Hobert (2009) and the references therein, and we use the method in the illustration in Chapter 5. In our framework of multiple chains, one might think that we need to identify a sequence of times 0 = 7o < 7- < -2 < ... < TR at which all the chains regenerate. This is not the case, and we need only identify, for each chain, a sequence of regeneration times for that chain. Since the overall estimate is essentially a function of averages involving the k chains, its asymptotic variance is a function of the asymptotic variances of averages formed from the individual chains. Consider the function B(h, hi); h e c-R, and an estimator, such as B(h, hi, d) (for the rest of this discussion, we will denote these by B(h) and B(h), for brevity). It is of interest to provide a confidence band (region, if h is multidimensional) for B(h) that is valid simultaneously for all he 'c A closely related problem is to produce a confidence interval for argmaxhcH B(h). The traditional way of forming confidence bands that are valid globally is to proceed as follows: D Proof of Theorem 4 Here Z and Y represent the matrix and vector, respectively, previously defined in (A.8) and (A.9). In addition, let ZIf denote the n x (k + 1) matrix with transpose 1 ... 1 1 ... 1 ... 1 ... 1 [f]() Z[fl(') Z[f](1) Zf(1) Z[f]() 7Zf(1) 1,1 ni,1 1,2 n2,2 1,k nk,k (Z[f])' = Z[f(2) Z[f](2) Z7[f(2) Z[f](2) Z[f](2) 7.(2) (A.23) 1,1 ni,,1 1,2 n2,2 .. 1-,k kk (Ak) Z[f](k) 7[f](k) Z[f](k) Z[f](k) Z[f](k) Z[f](k) 1,1 ni,1l 1,2 n ,2 1,k .. nkk and let Y[f] be the vector Y[f]= (y[f] f] Y [f] f] Y [f] Yn f k)' (A.24) S- 1,1 ni,l1 1,2, n2,2 '" l,k' n,, We know from Doss (2010) that the least squares estimate when Y is regressed on Z, denoted by (3o, 2, ... /k) =: (0, 2 ), converges almost surely to (/3o,im, Olim) = R- v. In a similar way, we will show here that the least squares estimate when Y[f] is regressed on Z[] (7f], [f]) = (^f] #if] [f]), converges almost surely to a vector (~,m /[fim). Note that, under the assumption that [(Z[l)'Z[]l] 1 exists, (/)f, ) = n [(Z 1)'Z f] ( Z[)'Y] Since A4 is satisfied, we have 1 nk k n ni n' ~ n ni Z,/ ," + ,+l j,j = 0, ... k, 1/=1 i=1 /=1 i=i and hence (Z[]l)'Z[l/n a R[]. Therefore by A6, with probability one (Z[f])'Z[]1 is nonsingular for large n, and furthermore n [(Z[f])'Z[f]]-1 (R[f])-1. (A.25) V(70), -) be the Markov transition function corresponding to the Gibbs sampler in Smith and Kohn (1996), i.e. V(y(0), .) is the distribution of (1) given that the current state is 70), and let v(y(0), ) = V(y7(), {7}) be the corresponding probability mass function. Suppose the current state is (7('), ', /3'), 0 s0 '). We proceed as follows. 1. We update (') to 7('+1) using V(7('), .). The generation of 7('+1) does not involve 2. We generate -('+1) from the conditional distribution of a given 7= ('+) and the data. 3. We generate O'+1) from the conditional distribution of 3o given = 7(+1l), a = 7('+1), and the data. 4. We generate 'n,+) from the conditional distribution of /3, given = 7('+1) a = ('+1), 3o = i+1), and the data. The details describing the distributions involved and the computations needed are given in Appendix B. The algorithm above gives a sequence 0(1), (2), ..., and it is easy to see that this sequence is a Markov chain. As Markov chains on the 7 sequence, the relative performance of the Gibbs sampler vs. SS(2) depends, in part, on m, q, h, and the data set itself, and neither algorithm is uniformly superior to the other. In principle, in Step 1 of our algorithm we can use any Markov transition function that generates a chain on 7, including SS(2). We chose to work with the Gibbs sampler because it is easier to develop a regeneration scheme for this chain than for the other chains. The output of the chain can be used in several ways. An obvious way is to use the highest posterior probability model (HPM). Unfortunately, when q is bigger than around 20, the number of models, 2q, is very large, and it may happen that no single model has appreciable probability, and in any case, it is very difficult or impossible to identify the HPM from the Markov chain output. Barbieri and Berger (2004) argue in favor of the median probability model (MPM), which is defined to be the model that includes all variables j for which the marginal inclusion probability P(QT = 1 Y) > 1/2. We mention models in Bayesian analyses. For selecting models that are better than others from the family of models indexed by h H -, our strategy will be to compute and subsequently compare all the Bayes factors B(h, h,), for all h E 'H, and a fixed hyperparameter value h,. We could then consider as good candidate models those with values of h that result in the largest Bayes factors. Suppose now that we fix a particular function f of the parameter 0; for instance, in the example, this might be the indicator that variable 1 is included in the regression model. It is of general interest to determine the posterior expectation Eh(f(O) I Y) as a function of h and to determine whether or not Eh(f(O) Y) is very sensitive to the value of h. If it is not, then two individuals using two different hyperparameters will reach approximately the same conclusions and the analysis will not be controversial. On the other hand, if for a function of interest the posterior expectation varies considerably as we change the hyperparameter, then we will want to know which aspects of the hyperparameter (e.g. which components of h) produce big changes and we may want to see a plot of the posterior expectations as we vary those aspects of the hyperparameter. Except for extremely simple cases, posterior expectations cannot be obtained in closed form, and are typically estimated via Markov chain Monte Carlo (MCMC). It is slow and inefficient to run Markov chains for every hyperparameter value h. Chapter 2 reviews an existing method for estimating Eh(f(0) I Y) that bypasses the need to run a separate Markov chain for every h. The method has an analogue for the problem of estimating Bayes factors. Unfortunately, the method has severe limitations, which we also discuss. The purpose of this work is to introduce a methodology for dealing with the sensitivity analysis and model selection issues discussed above. The basic idea is-not surprisingly-to use Markov chains corresponding to a few values of the hyperparameter in order to estimate Eh(f(O) I Y) for all h c '-H and also the Bayes factors B(h, h,) for all h e H-, and this is done through importance sampling. The difficulty we face is that there Fully Bayes (FB) Methods The most common prior on g is the Zellner and Siow (1980) prior, an inverse-gamma which results in a multivariate Cauchy prior for 3. The family of "hyper-g" priors is introduced by Cui and George (2008) and developed further by Liang et al. (2008), who show that these have several desirable properties. In particular, they do not suffer from the information paradox, and they exhibit important consistency properties. Both the EB methods and FB methods have their own advantages and disadvantages. Cui and George (2008) give evidence that EB methods outperform FB methods. This is based on extensive simulation studies in cases where numerical methods are feasible. Also, FB methods require one to specify hyperparameters of the prior on the hyperparameter h, and different choices lead to different inferences. Additionally, in EB methods, one uses a model with a single value of h, and the resulting inference is more parsimonious and interpretable. On the other hand, as with many likelihood-based methods, special care needs to be taken when the maximizing value is at the boundary. When we use the EB method, if the maximizing value of w is 0 or 1, the posterior assigns probability one to the null model or full model (model that includes all variables), respectively. This is similar to the very simple situation in which we have X ~ binomial(n, p): if we observe X = 0, then not only is the maximum likelihood estimate of p equal to 0, but the associated standard error estimate is also 0, and the naive Wald-type confidence interval for p is the singleton {0}. Of course in this simple case there exist modifications to the maximum likelihood estimate P = X/n which yield procedures that do not give rise to this degeneracy. How to develop corresponding modifications to the maximum likelihood estimate of the Bernoulli parameter w in the present context is a problem that is much more difficult, but certainly worthy of investigation. Scott and Berger (2010) consider the same model for variable selection that we consider here, i.e. model (1-1), but with a Zellner-Siow prior on g, and the remaining CHAPTER 3 VARIANCE ESTIMATION AND SELECTION OF THE SKELETON POINTS Estimation of the variance of our estimates is important for several reasons. In addition to the usual need for providing error margins for our point estimates, variance estimates are of great help in selecting the skeleton points. 3.1 Estimation of the Variance There are two approaches one can use to estimate the variance of any of our estimates. For the sake of concreteness, consider B(h, hi, d), whose asymptotic variance is the expression ,2(h) = qc(h)'Ec(h) + 72(h) (see Theorem 1). Spectral Methods If Xo, X1, X2,... is a Markov chain and f is a function, the asymptotic variance of (1/n) Y2 o1 f(Xi) (when it exists) is the infinite series Var(f(Xo)) + 2 E, Cov(f(Xo), f(Xj)) (3-1) where the variances and covariances are calculated under the assumption that Xo has the stationary distribution. Spectral methods involve estimating an initial segment of the series, using techniques from time series; see Geyer (1992) for a review. Our problem is more complicated because we are dealing with multiple chains. In our situation, the term r2(h) may be estimated through spectral methods, and this is done in a straightforward manner. We now give technical details regarding the consistency of this method. The quantity T2(h) is given by 2(h) =- k arT,2(h), where r-f(h) is the asymptotic variance of n1 ah(00) / =n' =1 C s=l h,)) ds (See equation (A.9) of Doss (2010).) Because for each I we will be estimating r/f(h) by the asymptotic variance of 1 n h(,)) n"' k~=1 asVh,(, )/ld requires neither matrix inversion nor calculation of a determinant, so can be done very quickly. Note that in view of (5-3), it is not enough to have Markov chains running on the 7's and we need Markov chains running on the O's (or at least (7, a, /3)). 5.2 Choice of the Hyperparameter As mentioned earlier, regarding w, the proposals in the literature are quite simple: either w is fixed at 1/2, or a beta prior is put on w. The discussion below focuses primarily on g, for which there is an extensive literature, and we now summarize the portion of this literature that is directly relevant to the present work. Broadly speaking, recommendations regarding g can be divided into three categories: Data-Independent Choices In the simple case where the setup is given by (1-1) but without (1-1c), i.e. the true model 7 is assumed known, the posterior distribution of 3 given a is A/((g/(g + 1)))3, (g/(g + 1)),72(XCX_)-1), where 3 is the usual least squares estimate of 3. If q is fixed and m oc, under standard conditions XX,/m ZE, where Z is a positive definite matrix; therefore if g is fixed, this distribution is approximately a point mass at (g/(g + 1))3y, so the posterior is not even consistent, and we see that a necessary condition for consistency is that g oc. Data-independent choices of g include Kass and Wasserman's (1995) recommendation of g = m, and Fernandez et al.'s (2001) recommendation of g = max(m, q2), following up on Foster and George's (1994) earlier recommendation of g = q2 Liang et al. (2008) argue that, in general, data-independent choices of g have the following undesirable property, referred to as the "Information Paradox." When the data give overwhelming evidence in favor of model 7 (e.g. II,. || oo), then using o7 to denote the null model (i.e. the model that includes only the intercept), the ratio of posterior probabilities p(7 I Y)/p(7o I Y) does not tend to infinity. FERNANDEZ, C., LEY, E. and STEEL, M. F. J. (2001). Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100 381-427. FOSTER, D. P. and GEORGE, E. I. (1994). The risk inflation criterion for multiple regression. The Annals of Statistics, 22 1947-1975. GEORGE, E. I. and FOSTER, D. P. (2000). Calibration and empirical Bayes variable selection. Biometrika, 87 731-747. GEORGE, E. I. and MCCULLOCH, R. E. (1997). Approaches for Bayesian variable selection. Statistica Sinica, 7 339-374. GEYER, C. J. (1992). Practical Markov chain Monte Carlo (with discussion). Statistical Science, 7 473-511. GEYER, C. J. (1994). Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo. Tech. Rep. 568r, Department of Statistics, University of Minnesota. GILL, R. D., VARDI, Y. and WELLNER, J. A. (1988). Large sample theory of empirical distributions in biased sampling models. The Annals of Statistics, 16 1069-1112. HANSEN, M. H. and Yu, B. (2001). Model selection and the principle of minimum description length. Journal of the American Statistical Association, 96 746-774. HASTINGS, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57 97-109. HOBERT, J. P., JONES, G. L., PRESNELL, B. and ROSENTHAL, J. S. (2002). On the applicability of regenerative simulation in Markov chain Monte Carlo. Biometrika, 89 731-743. IBRAGIMOV, I. (1962). Some limit theorems for stationary processes. Theory of Probability and its Applications, 7 349-382. IBRAGIMOV, I. A. and LINNIK, Y. V. (1971). Independent and Stationary Sequences of Random Variables. Wolters-Noordhoff, Groningen. KASS, R. E. and WASSERMAN, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90 928-934. KOHN, R., SMITH, M. and CHAN, D. (2001). Nonparametric regression using linear combinations of basis functions. Statistics and Computing, 11 313-322. KONG, A., MCCULLAGH, P., MENG, X.-L., NICOLAE, D. and TAN, Z. (2003). A theory of statistical models for Monte Carlo integration (with discussion). Journal of the Royal Statistical Society, Series B, 65 585-618. here the Bayesian Adaptive Sampling method of Clyde et al. (2009), which gives an algorithm for providing samples without replacement from the set of models. Under certain conditions, the algorithm has the feature that these are perfect samples without replacement; it then enables an efficient search for the HPM. Uniform Ergodicity Let 0 = {0, 1}q x (0, o0) x Rq+1 let v be the (prior) distribution of 0 specified by (1-1b) and (1-1c), and let vy be the posterior distribution of 0 given Y = y. (For the remainder of this section the subscript h is suppressed since we are dealing with a single specification of this hyperparameter.) Let K denote the Markov transition function for the Markov chain on 0 described in the beginning of this chapter, i.e. K(0, .) is the distribution of 01 given that the current state is 00, and let Kn(0o, -) denote the corresponding n-step Markov transition function. Harris ergodicity of the chain is the condition that ||Kn(0, -) )y(-)I|| 0 for all 0 e 0, where I| I| denotes supremum over all Borel subsets of (. This condition is guaranteed by the so-called "usual regularity conditions," namely that the chain has an invariant probability measure, is irreducible, periodic, and Harris recurrent; see, e.g., Theorem 13.0.1 of Meyn and Tweedie (1993). These usual regularity conditions are typically easy to check; in the present context, they are implied for example if the Markov transition function has a density (with respect to the product of counting measure on {0, 1}q and Lebesgue measure on (0, oo) x Rq+l) which is everywhere positive, which is the case in our situation. Uniform ergodicity is the far stronger condition that there exist constants c c [0, 1) and M > 0 such that for any n c N, ||K"(0, ) v() < Mc for all 0. Proposition 1 The chain driven by K is uniformly ergodic. The proof of Proposition 1 is given in Appendix C. Let 00, 01,... be a Markov chain driven by K, let I be a real-valued function of 0 (for example 1(0) = /(7I = 1), the indicator that variable 1 is in the model), and suppose we wish to form confidence and (K e = ],m := w[f] +(h), for t = 1,..., k. (A.56b) vt, e k-1 Proceeding as we did in the proof of Theorem 2 when we showed that V2K(d*) = Op(1), we can show here that V2K[f](d*, e*) is bounded in probability. Hence n(K[Kl(d, e) K[(d, e)) = qwE(h)' -N ) d) + Op(1), and together with (A.55) and (A.54) this implies that rn / 1)- ^ ^lW(h)' |)+ op) V qw(h)' N(d d) + o,p(l) qW[f](h)') N (a d) + O(l wo(h)' e e where wo(h) is the column-vector obtained from w(h) by concatenating k zeros at its end. Now returning to (A.52) and (A.53) we get k w[i] (h)' d ai])=1 wo(h) N e k ), .,(d),I )) + Op() Swo(h)' We can now apply the delta method with the function g(u, v) = u/v to get our result V^(c/,^^ v quasi-likelihood k N, (AIVh,(o1)o)l/d, INv(d) = log k (2-6) /=1 ,=1 s= s The estimate is the same as the estimates obtained by Gill et al. (1988), Meng and Wong (1996), and Kong et al. (2003). We assume that for all the Markov chains we use a Strong Law of Large Numbers (SLLN) holds for all integrable functions [for sufficient conditions see, e.g., Theorem 2 of Athreya et al. (1996)]. In the next theorem, we show that if d is the estimate produced by Geyer's (1994) method, or any of the equivalent estimates discussed above, then the estimate of the Bayes factor given by (h,k nh )h(Ol)) (2-7) /=1 i=1 ks=lnshs(O(1)/dS is asymptotically normal if certain regularity conditions are met. In (2-7), d = 1. Theorem 1 Suppose the chains in Stage 2 satisfy conditions Al and A2 in Doss (2010): Al For each I = 1,..., k, the chain {O(I }J1 is geometrically ergodic. A2 For each I = 1,..., k, there exists c > 0 such that E2 < 00. ( *'"!"' 2+e\ S1 sVhs (Vh, ) /)ds Assume also that the chains in Stage 1 satisfy the conditions in Theorem 2 of Geyer (1994) that imply vN(d d) d Afi(O, Z). In addition, suppose the total sample sizes for the two stages, N and n, are chosen such that n/N -- q e [0, oo). Then vn(B(h, hi, d) B(h, hi)) A,(O, qc(h)'-c(h) + 72(h)), where c(h) and r2(h) are given in equation (A.3) in the Appendix and equation (A.9) in Doss (2010), respectively. Remarks with w[f](h) defined by (A.56) below. By Taylor series expansion, we get n(K[lf(d, ) K[](d, e)) = vnVK[f (d, e)' d - e-e + V2K[f](d*, e*) ( 2 e e where (d*, e*) is between (d, e) and (d, e). Below we compute the gradient VK[f](d, e) and show that it converges almost surely to a vector w[f](h). We have 9K n f(OI ())h(O1) )atht( 1) (d, e) 2 O ut n d (E,2 asVh, (e,))/ds)2 k f(0 ))Vh ( 10))at h, 0t()) S dd dt (k1 asVh,(O )/d)2 j#t iimc d Ek1 h( '))/ds tlim d (Ek,1 as7,h(0o'))/ds)2 tI S 'lah, ( '))/ds O ta'ah7(O-)/S)j a.s. B(h, h) f f(0)at h (0) d2 k a h,y (0) dO d J fs= asVh /(O)ds k -f([f] ) atVh] [f] I (ht) Is hImk "W ,y() dO +- tlim =J di S =1as Vh, ()/ds dt j#t 0[] Jq f( (0) ath(0) t,lim 2 k Vhh,y(O) dO i dtm S= 1 as~h (0) / ds S Es=1 asVh (O)/ds k -f( )a] h,() ,,() dO + f] I[f] ](ht) J=1 dtEs= ashV(O)/h, (0 tds ( dt for t= 2,..., k, (A.56a) LIANG, F., PAULO, R., MOLINA, G., CLYDE, M. A. and BERGER, J. O. (2008). Mixtures of g-priors for Bayesian variable selection. Journal of the American Statistical Association, 103 410-423. MADIGAN, D. and YORK, J. (1995). Bayesian graphical models for discrete data. International Statistical Review, 63 215-232. MENG, X.-L. and WONG, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6 831-860. MEYN, S. P. and TWEEDIE, R. L. (1993). Markov Chains and Stochastic Stability. Springer-Verlag, New York, London. MYKLAND, P., TIERNEY, L. and Yu, B. (1995). Regeneration in Markov chain samplers. Journal of the American Statistical Association, 90 233-241. OWEN, A. and ZHOU, Y. (2000). Safe and effective importance sampling. Journal of the American Statistical Association, 95 135-143. RAFTERY, A. E., MADIGAN, D. and HOETING, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92 179-191. ROMERO, M. (2003). On Two Topics with no Bridge: Bridge Sampling with Dependent Draws and Bias of the Multiple Imputation Variance Estimator. Ph.D. thesis, University of Chicago. ROSENBLATT, M. (1984). Asymptotic normality, strong mixing and spectral density estimates. The Annals of Probability, 12 1167-1180. SCOTT, J. G. and BERGER, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics (to appear). SMITH, M. and KOHN, R. (1996). Nonparametric regression using Bayesian variable selection. Journal of Econometrics, 75 317-343. TAN, A. and HOBERT, J. P. (2009). Block Gibbs sampling for Bayesian random effects models with improper priors: convergence and regeneration. Journal of Computa- tional and Graphical Statistics, 18 861-878. TAN, Z. (2004). On a likelihood approach for Monte Carlo integration. Journal of the American Statistical Association, 99 1027-1036. TIERNEY, L. (1994). Markov chains for exploring posterior distributions (Disc: p1728-1762). The Annals of Statistics, 22 1701-1728. VANDAELE, W. (1978). Participation in illegitimate activities: Ehrlich revisited. In Deterrence and Incapacitation. US National Academy of Sciences, Washington DC, 270-335. 102 where the notation in (A.18) indicates that w(h) denotes the finite vector limit to which VK(d) converges. We now deal with the Hessian matrix V2K(d*). For t $ u we have [V2 K(d*)]t-l,u-, 1 n2 ) Lt7c ( () h (_)-3 n i=1 ;=1 dt "2 (2 ks=1 as h, 1)/d) 2 [Vh (01))/dc* 1 (O1'))] atlh, ( '))auv,h (O1')) I 22 d,2 (y kah(0,)IV/S)3 j=2 t d2d" 2 kC=1 as Vh, i()d j#t j#u + ui im d2 du (:sk= asV, (0I)) d) 2 (h,,,(O '))/du h, (('))) at Vh,(0') u)au (0(')) 2u*2U (s=k a (7sho(l)) /dS)3 m Vh(OB())au, h (O'()) + !t,Iim d*2d*2 k=1/d s h5(1)/dS) d 2 ( ,u2 (yC:k as7Vh, (O')) 2l (2 2(h( '))/)d h 1 ( '))) aSth,( 1))au7,h,( 1)) dm dv2d*2 k= (0s hs 1 )3 Y 1tii 1 I I and as before, it can be shown that this is bounded in probability. Similarly, we can show that the diagonal terms of V2K(d*) are also bounded in probability. Therefore, using the fact that V2K(d*) is bounded in probability, we can now rewrite (A.17) as Iim) = w(h)YN( d) + vd d)'O(1) v( d) 11Pm N N V 2/2N = qw(h)'N(d- d) + op(l). Together with (A.6), this gives v(l(3) a- B(h, h1)) = qw(h)' N( d) + v((d) B(h, h1)) + op(1) -d A/(O, qw(h)'_w(h) + a2(h)), by the independence of the two sampling stages, the assumption that v/N(d d) is asymptotically normal with mean 0 and variance Z, and the result from Doss (2010) that v7(d) B(h, h1)) is asymptotically normal with mean 0 and variance -2(h). D As was done for U in the proof of Theorem 5, we can write T as (1) ynl (1) ( (1)) + + a(1) 1 nk 1)(Ok)) T='7= T (ik) 1 ,1 (k )) + + a() ,k1 k)(0k) + (1), Ek 1/2 1 yn/ 7=1 a 2 i ,= l (,- E(YI,,)) where the first k components are the same as the first k components of the vector U in (A.36), and Y,,/ is given in (2-11). By applying the Cramer-Wold device, we can conclude that the vector T converges in distribution to a normal random variable with mean 0 and variance Z= Sz' 72(h)) in which S(1) is the k x k matrix given in (A.44), z is the k x 1 vector given by z = B+y (A.49) where k k oo Yr = A Cov (p,(0/), To), Yl,,) + AI [Cov (pr(0), 1o), Y1+g,1) /=1 /=1 g=l + Cov(pr(O(1 To), Y,/)], for r = 1,..., k, with B+ as in Theorem 5, and as in (A.9) of Doss (2010) k oo 72(h) = a, [Var(Y,i) + 2 Cov(Yl,i, Y+,g,)]. /=1 g=1 Expressions for v(h) and p(h) are given in equations (A.22) and (A.21), respectively, in the Appendix. 2.4 Estimation of Posterior Expectations Using Control Variates We assume in this section that the values of the vector d and the posterior expectations I[f](hj), forj = 1,..., k, are available to us. In reality, these quantities are seldom known, and the next section deals with the case when they are estimated based on previous MCMC runs. Recall that the integral we want to estimate is I[f](h) = J f(O)h,y(O) dO. In (2-14) we established that (1/n) EZ=1 i nl i[f] is a strongly consistent estimator of I[f](h) B(h, hi). Define i]() Zf(O) ) /))/d I[f](h), j = 1...k, s=1 asVh, (O))/ds and let Zlf])(0) O= h(O) / I[f](hj), = 1 k. s=-1 asVhs(O)/ds With vy denoting the mixture distribution Ek= sy, it can be easily checked that Ey(Z[f])(O)) =0, for j= 1 ... k, so we can use the ZfO)'s as control variates to reduce the variance of the original estimator (1/n) Eki1 ,i y,[]. Doing so gives the estimator k n - /=1 i=1 j=1 where /f]'s denote the least squares estimates resulting from the regression of Y[f on predictors Z, fJ). The Bayes factor B(h, hi) will be estimated as before in Section 2.2, using the estimator (1/n) yk 1 yi1 Y,/ and the k 1 control variates Z(), forj = 2,..., k. The ratio of these two control variate adjusted estimators provides us with an C PROOF OF THE UNIFORM ERGODICITY AND DEVELOPMENT OF THE MINORIZATION CONDITION FROM CHAPTER 5 ................ 95 D MAP FOR THE OZONE PREDICTORS IN FIGURE 5-5 ............. 99 REFERENC ES . . .... 100 BIOGRAPHICAL SKETCH ................... ............. 104 the "independence Bernoulli prior"-each variable goes into the model with a certain probability w, independently of all the other variables-and then choose the vector of regression coefficients corresponding to the selected variables. In more detail, the model is described as follows: Y ~ Am(lm/3o + X7P,r 21) (1-1a) (2, /3) "(2 o) oC 1/2, and given a, ~3-, (0, g,2(X7X7)-1) (1-1b) 7 Wq (1 w)q- q. (1-1c) The prior on (o, 3o, /,) is Zellner's g-prior introduced in Zellner (1986), and is indexed by a hyperparameter g. Although this prior is improper, the resulting posterior distribution is proper. Note that we have used the word "model" in two different ways: (i) a model is a specification of the hyperparameter h, and (ii) a model in regression is a list of variables to include. The meaning of the word will always be clear from context. To summarize, the prior on the parameter 0 = (7, a, 03o, /) is given by the two-level hierarchy (1-1c) and (1-1b), and is indexed by h = (w, g). Loosely speaking, when w is large and g is small, the prior encourages models with many variables and small coefficients, whereas when w is small and g is large, the prior concentrates its mass on parsimonious models with large coefficients. Therefore, the hyperparameter h = (w, g) plays a very important role, and in effect determines the model that will be used to carry out variable selection. A standard method for approaching model selection involves the use of Bayes factors. For each he T- let mh(y) denote the marginal likelihood of the data under the prior Vh, that is, mh(y) = J p(y)vh(O) dO. We will write mh instead of mh(y). The Bayes factor of the model indexed by h2 vs. the model indexed by hi is defined as the ratio of the marginal likelihood of the data under the two models, mh2/mh,, and is denoted throughout by B(h2, hi). Bayes factors are widely used as a criterion for comparing is also asymptotically normal. To carry out the first step, we will express each U), j = 1,..., k, as the sum of a linear combination of standardized averages of functions of the 0')0's and a op(1) quantity. We will also need the central limit theorem to hold for these averages. Hence, for each j = 1,..., k, we plan to find constants ) ..., ) and functions (),... 0), which satisfy the conditions E,,,, ( )(0)) = 0 and E,,,, (0 )(2) ) < =1... k, (A.37a) SN Nk U 0) = a) i 'J)(1)0) + ... + 0) ) )(ok)O0)+ Op(1) (A.37b) N=1 ==1 for some c > 0. Note that conditions (A.37a) and B1 yield central limit theorems for the averages in the linear combination above. For U(k+1),..., U(2k), condition (A.37) is clearly satisfied since +k) = 1 1 NJ UO) ( e, e,) 1 1 (f(o)) e,), forj = 1... k, V" j VN"j 1=1 and the moment conditions in (A.37a) hold (see B2 in the statement of this theorem). Next, we show that condition (A.37) also holds for the first k components of U. In the proof of his Theorem 2, Geyer (1994) defines the matrix BN via 1 (VN(IN) V/AN(o0)) = BN(l rIo), (A.38) where IN was defined in (A.34), and establishes that BN -a B, where B is given by equation (19) in Geyer (1994). He also shows that, with u being the k-dimensional column vector of l's, /NON T0) -J (A.39) U/) 0 To see why the last relationship statement is true, we first consider the integral with respect to so. We have Ip(Y 7, 72/3o, /y)p( 3o) d3o o J(2) -(m2) exp (Y lm3o XO, )'(Y ,1m X,3)] do3 x (2) -(m2) expL (Y X3,)'(Y X )] exp [-m (o2 20o0Y)] do3 -2 x (-2)-(m-1)/2 exp [- (Y Xp)'(Y XO3)] exp (m2 So we may now write P(72 7I Y) 52 S(a2) -(m+1)/2 2exp ) xp (Y X,)/(Y X,)] p(1, |7, 72) d, (2)-(m+1+q,)/2 exp(- exp d-07 X'X + 'X p d N 72)-(m+1+q,)/2 exp -xp S2 )(a2 q /2 exp 2 92 Y Y X'XY )-Y x (2) -(m / 1) exp{- 52)] x(72)-^1)/2 exp{J-ij [1 + g(1 R7}, where the next-to-last proportionality relation results from using the formula exp( W- 1 + a'/) d3 = (27)q2| W1/2 exp(a'Wa/2) which can be shown to hold for any vector a of length q. and any positive definite matrix W by using a "completing the squares" argument. In practice, we use the distributional relationship S2 [1 + (- R-)] 2 2(g + 1) Xm-1 to draw -2. for u = (u2,... Uk)' with ui > 0, I = 2, ... k, u1 = 1. Note that H(d) = n,-12 1n, Zi). To see why (A.14) is true we begin by writing n1/2 = 1/2 ) + n 1/2 I n=1 i= n 1=1 = H(d) H(d) + Op(1). (A.15) Note that the fact that n/12 i 1 ([Z,) e(j, 1)]/n,) = Op(1), which was used to establish the second equality in (A.15), is proved in Doss (2010). Now, applying the Mean Value Theorem to the function H, we know that there exists a point d* between d and d such that (A.15) becomes n1/2 n2i) -e(j VH(d*)'( d) + Op(1) 1=1 = na n-/2VH(d*)' N( d) + Op(1), (A.16) so that the right side of (A.16) is Op(1). To see this last assertion, note that the (t l)th element of the gradient of H, [VH(d)]t-~, is given by ,=1 dt (Cs=l as (h, 1)ds) I -1/2 nnv -(Ohj),)00) n'J(7 ), ())dj l))-)ja(() if t=j. n-1/2 h(+ 1()) if t ( )) by writing S(B + 1 uu' 1 uu' N k k where the last equality comes from Geyer (1994). Next, we establish asymptotic normality of V/N(rOo)/VN. Since the gradient V/vN(ro) is the vector whose rth element is given by OIN( \ k NI OO) = Nr pr(OP)0,i, I), ari I=1 ,=1 we can see that 1 aN(O) 1 A k NN arlr N rPr (0()0, -IO) A Nr k( r NI = Nr (1- p (O(r)O, o)) Pr(O0 C}i), O), S i=1 /=1 v =1 Ir k N1 /=1 =l[ Ilr k N, = A 1 [p r(o l0'i)) E(pr(O(')o, rlo))], /=1 =1 which is a linear combination of the form given in (A.37b) and, because 0 < pr(O, I) < 1 for all 0 and r, condition (A.37a) is also satisfied. Note that we are allowed to insert the Kohn et al. (2001) consider Metropolized Gibbs algorithms which are the same as the Hybrid Algorithm of Clyde et al. (1996), except that at coordinate j, instead of deterministically proposing to go from 7j to 7, = 1 Ty, the proposed value 7* is equal to 1 7 with probability depending on the current state y. Kohn et al. (2001) describe two such algorithms, and show that these are more computationally efficient than the Gibbs sampler in situations where on average q. is small, i.e. the models are sparse. They also conduct a detailed simulation study of one of their sampling schemes (their "SS(2)") which suggests that, while the scheme produces estimates whose standard errors are a bit larger than those produced by the Gibbs sampler, this disadvantage is more than outweighed by its computational efficiency. All the algorithms mentioned above require, in one way or another, the calculation of p(7* I Y)/p(7 I Y). Because of the conjugate nature of model (1-1), the marginal likelihood of model 7 is available in closed form, and therefore p(7 Y) is available up to a normalizing constant. We have p(7 y) o (1 + g)-2S-(m-1)[1 + g(l R)] -(m-1)/2 W) (5-1) where S2 Z= ( Y)2 and R is the coefficient of determination of model 7. As is standard for model (1-1), we assume that the columns of the design matrix are centered, and in this case, R = Y'XY(X X )-IXY/S2. The main computational burden in obtaining (5-1) is the calculation of R-, which is time-consuming if q. is large. Smith and Kohn (1996) note that, when 7* and 7 differ in only one component, R2. can be obtained rapidly from R.. We return to this point in Appendix B. In our situation, we need to generate a Markov chain on 0, because the Bayes factor estimates given in Chapter 2 require samples from the posterior distribution of 0. The algorithm we use in the present paper is based on the Gibbs sampler on 7 introduced in Smith and Kohn (1996) (although the computational implementation we use is different from theirs), followed by three steps to generate a, 3o, and /3.. In a bit more detail, let APPENDIX A PROOF OF RESULTS FROM CHAPTER 1 Proof of Theorem 1 We begin by writing n(B((h, hl, d)- B(h, hi))= =/n(B(h, hl, d) B(h, hl, d)) + vn(B(h, hi, d) B(h, hi)). (A.1) The second term on the right side of (A.1) involves randomness coming only from the second stage of sampling. This term was analyzed by Doss (2010), who showed that it is asymptotically normal, with mean 0 and variance -2(h). The first term ostensibly involves randomness from both Stage 1 and Stage 2 sampling. However, as will emerge from our proof, the randomness from Stage 2 is of lower order, and effectively all the randomness is from Stage 1. This randomness is non-negligible. We mention here the often-cited work of Geyer (1994) (whose nice results we use in the present paper). In the context of a setup very similar to ours, his Theorem 4 states that using an estimated d and using the true d results in the same asymptotic variance. From our proof (refer also to Remark 2 of Section 2.1), we see that this statement is not correct. To analyze the first term on the right side of (A.1), we define the function F(u) = B(h, hi, u), where u = (u2, .., Uk)' is a real vector with ul > 0, / = 2,..., k. Then, by the Taylor series expansion of F about d, we get vn(B(h, hl, d) 3B(h, hi, d)) = vn(F(d) F(d)) = vrVF(d)'( d) + ( d)'V2F(d*)(d d), (A.2) 2 where d* is between d and d. First, we show that the gradient VF(d) = (OF(d)/dd2,..., OF(d)/9dk)' converges almost surely to a finite constant. For j = 2,..., k, the (j 1)th component of this vector converges almost surely since, with the SLLN assumed to hold for the Markov chains CHAPTER 1 INTRODUCTION In the Bayesian paradigm we have a data vector Y with density pe for some unknown 0 e 0, and we wish to put a prior density on 0. The available family of prior densities is {vh, he c- }, where h is called a hyperparameter. Typically, the hyperparameter is multivariate and choosing it can be difficult. But this choice is very important and can have a large impact on subsequent inference. There are two issues we wish to consider: (A) Suppose we fix a quantity of interest, say f(0), where f is a function. How do we assess how the posterior expectation of f(0) changes as we vary h? More generally, how do we assess changes in the posterior distribution of f(0) as we vary h? (B) How do we determine if a given subset of 'H constitutes a class of reasonable choices? The first issue is one of sensitivity analysis and the second is one of model selection. As an example of the kind of problem we wish to deal with, consider the problem of variable selection in Bayesian linear regression. Here, we have a response variable Y and a set of predictors X, ..., Xq, each a vector of length m. For every subset 7 of {1,... q} we have a potential model MA4 given by Y = 1m/3 + X,, + , where 1m is the vector of m l's, X. is the design matrix whose columns consist of the predictor vectors corresponding to the subset 7, 3, is the vector of coefficients for that subset, and e ~ Am(0, o2/). Let q. denote the number of variables in the subset 7. The unknown parameter is 0 = (7-, a, 0o, y), which includes the indicator of the subset of variables that go into the linear model. A very commonly used prior distribution on 0 is given by a hierarchy in which we first choose the indicator 7 from mass function d on F such that v(7, 7') satisfies the minorization condition v( >, 7') > s(()d(Q') for all 7, 7' e F. (C.1) We proceed via the "distinguished point" technique introduced in Mykland et al. (1995). Let 7* denote a fixed model, which we will refer to as a distinguished model, and let D c F be a set of models. The model 7* and the set D are arbitrary, but below we give guidelines for making a practical choice of 7* and D. For all 7, 7' e F we have v(Qy, y') v(, ')) = ( v-y, y' ) > min v(Y*, yl)1(y' ED). y//ED v(Y*, Y") If we let c(7*) denote the normalizing constant for v(y*, Y)/(Y' c D), that is, c(Q*) = y,/ED v(Y*, '), then we get ) > c( ) min v(Y, 7") v(Q*, 7') /(' c D) v(-, -,)> c(v*) m)( 7//ED V(*, 7Y") c(7*) = s(7) d(7'), where v (7, a") v (7*, 7') / (7' E D) s(') = c(y*) min and d(Q') 7//ED V(Q*, 7/) c(7*) Evaluating both s and d requires computing transition probabilities of the form v(7, y') which, due to the fact that the Markov chain on 7 is a Gibbs sampler, can be expressed as v(7, ') = P(7'l 172, 73 ..., 7q, Y) x P '(72 171, 73, ..., 7q, Y) X ... X P((7q 17 7 -. 7,q- 1,Y) where the formula for the right side terms is given by (B.1). Since the 7's in the terms on the right side differ in at most one component, the fast updating techniques discussed in Appendix A can be applied here too to speed up the computations. and let k n1 k id 4A -13) /=1 1 j=2 where Y,/ and 2j) are like in (2-11), except using d for d, and /9() is the least squares regression estimator from regressing 9,,/ on predictors 220, j = 2,...,k. The next theorem gives the asymptotic distribution of this new estimator. Theorem 2 Suppose all the conditions from Theorem 1 are satisfied. Moreover, assume that R, the k x k matrix defined by R =E( 1 aZi jZ ), j,j' 1...,k, is nonsingular. Then n(l() B(h, hi)) -d Ar(0, qw(h)' w(h) + r-2(h)). Expressions for w(h) and2 2(h) are given in equation (A.18) to follow and equation (A.7) in Doss (2010), respectively. 2.3 Estimation of Posterior Expectations In this section we give a method for estimating the posterior expectation of a function f when the prior is Vh. Let us denote this quantity by Ill](h) J f(0)Vh,y(0) dO. Define [f ) f((I) h ( ) ( ')) ( ') ) h h f (/)) fh,y (e0/) Y', k m (O ( B(h, h). s= 1 aS ihs (O)/ds Ek= asih,( h()/mhs mh 1 k=1 sash,,y(O) ) Assuming a SLLN holds for the Markov chains (li), = 1,..., k, i = 1,..., n/, we have 1 [f] f(0)Vhy() S I as Vh,y() Vhi,y() dO B(h, hj). /i=1 s S= l hs,y APPENDIX C PROOF OF THE UNIFORM ERGODICITY AND DEVELOPMENT OF THE MINORIZATION CONDITION FROM CHAPTER 5 Proof of Proposition 1 Let vy and p( I Y) denote the posterior distribution of 0 and 7, respectively, under the prior v on 0 (we are suppressing the subscript h, since the hyperparameter is fixed throughout). We use Kn and Vn to denote the n-step Markov transition functions for the 0 and 7 chains, respectively. Also, letting A denote the product of counting measure on {0, 1}9 and Lebesgue measure on (0, oo) x Rq+1, we use k" to denote the density of Kn with respect to A and vn to denote the probability mass function of V". We now show that the 0-chain and the 7-chain converge to their corresponding posterior distributions at exactly the same rate. For any starting state 80 and n c N, we have lKn(o, ) y(.)|| = sup IKn(Oo, A) Vy(A) A = kn(o' 0) vy(0)) dA = I vn(7o,7 7)p(,2, /3o0 1y, Y) p(7 Y)p(o-2, /30o 1 I, Y) dA = vn(?o,)-p(Y) ppy2 o.y, Y) Ip(2 Y)d-2d/3od/3] 1 Vn"(7o,?) -P(7 Y) 7rE =sup Vn(70, B) p(7 B I Y). B Hence, the 0 chain inherits the convergence rate of its 7-subchain, a uniformly ergodic Gibbs sampler on a finite state space. D Description of the Regeneration Scheme For regeneration purposes, it is enough to restrict our attention to the Markov chain that runs on 7. This is because, as we will see later, whenever this subchain regenerates, the augmented chain that produces draws from the posterior distribution of 0 = (y, a, /3o, /3) also regenerates. We will find a function s: F [0, 1) and a probability 80 0. 80 0.2 \ 0.2 100 100 Figure 5-3. Variance functions for two versions of ). The left panel is for the estimate based on the skeleton (5-4). The points in this skeleton were shifted to better cover the problematic region near the back of the plot (g small and w large), creating the skeleton (5-5). The maximum variance is then reduced by a factor of 9 (right panel). measurements, their squares, and their two-way interactions. Liang et al. (2008) give a review of the literature on priors for the hyperparameter g and advocate the hyper-g priors. They compare 10 variable selection techniques (including three hyper-g priors) on this data set by using a cross-validation procedure: the data set is randomly split in two halves, one of which (the training sample) is used for selecting the model (for the Bayesian methods this is the highest probability model), while the other (the validation sample) is used for measuring the predictive accuracy. The predictive accuracy of method j is measured through the square-root of the mean squared prediction error (RMSE) of the selected model ?7, defined by RMSE(7j) = (nl E,C(Y Y,)2)1/2 Here, V is the validation set, nv is its size, and Y, is the fitted value of observation i under model 7j. Liang et al. (2008) point out the curious fact that the RMSE's of the 10 methods are all very close (they range from 4.4 to 4.6), but the selected models differ greatly in the number of variables selected, which range from 3 to 18. Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy COMPUTATIONAL APPROACHES FOR EMPIRICAL BAYES METHODS AND BAYESIAN SENSITIVITY ANALYSIS By Eugenia Buta August 2010 Chair: Hani Doss Major: Statistics We consider situations in Bayesian analysis where we have a family of priors Vh on the parameter 0, where h varies continuously over a space H, and we deal with two related problems. The first involves sensitivity analysis and is stated as follows. Suppose we fix a function f of 0. How do we efficiently estimate the posterior expectation of f(0) simultaneously for all h in H-? The second problem is how do we identify subsets of H- which give rise to reasonable choices of vh? We assume that we are able to generate Markov chain samples from the posterior for a finite number of the priors, and we develop a methodology, based on a combination of importance sampling and the use of control variates, for dealing with these two problems. The methodology applies very generally, and we show how it applies in particular to a commonly used model for variable selection in Bayesian linear regression, in which the unknown parameter includes the model and the regression coefficients for the selected model. The prior is a hierarchical prior in which first the model is selected, then the coefficients for this model are chosen, and this prior is indexed by two hyperparameters. These hyperparameters effectively determine whether the selected model will be a large model with many variables, or a parsimonious model with only a few variables, so choosing them is very important. We give two illustrations of our methodology, one on the U.S. crime data of Vandaele and the other on ground level ozone data originally analyzed by Breiman and Friedman. that r is a stationary probability distribution for the chain. Suppose that for each x, K(x, .) has density k(x, -) with respect to a dominating measure p. Regeneration methods require the existence of a function s: X [0, 1), whose expectation with respect to r is strictly positive, and a probability density d with respect to p, such that k(., -) satisfies k(x, x') > s(x)d(x') for all x, x' e X. (3-4) This is called a minorization condition and, as we describe below, it can be used to introduce regenerations into the Markov chain driven by k. These regenerations are the key to constructing a simple, consistent estimator of the variance in the central limit theorem. Define k(x,x') s(x)d(x') r(x, x') = 1 S(X) Note that, for fixed x c X, r(x, x') is a density function in x'. We may therefore write k(x, x') = s(x)d(x') + (1 s(x))r(x, x'), which gives a representation of k(x, -) as a mixture of two densities, d(-) and r(x, .). This provides an alternative method of simulating from k. Suppose that the current state of the chain is Xn. We generate 6n ~ Bernoulli(s(Xn)). If 6, = 1, we draw Xn+1 ~ d; otherwise, we draw Xn+, ~ r(Xn, .). Note that, if 6, = 1, the next state of the chain is drawn from d, which does not depend on the current state. Hence, the chain "forgets" the current state and we have a regeneration. To be more specific, suppose we start the Markov chain with Xo ~ d and then use the method described above to simulate the chain. Each time 6n = 1, we have Xn+1 d and the process stochastically restarts itself; that is, the process regenerates. Even though the description above involves generating observations from r, there are clever tricks that enable the user to bypass generating from r, and to obtain the sequence (Xn, 6n) in a way that requires only generating directly from k; see, e.g., Tan and Hobert (2009). Chapter 1, for each j= 1,..., k, the marginal likelihood mhJ is a sum of 2q terms, and if q is relatively small, these marginal likelihood are computable, so the vector d is available. Likewise, for some functions f, the posterior expectations I[](hj) can be numerically obtained; see Section 3 of George and Foster (2000). So it is possible to calculate d and I[/](hj), j = 1,..., k for skeleton points hi, ..., hk, and the method described in this section enables us to efficiently estimate the family I[f](h), h c T. 2.5 Estimation of Posterior Expectations Using Control Variates With Estimated Skeleton Bayes Factors and Expectations This section undertakes estimation of the posterior expectation I[f](h) = f f(O)vh,y() dO in the case where the quantities d and If](hJ)'s are unknown and estimated based on previous MCMC runs. Let Ni Nk \ e = (I[f(hl),. I[f(hk))' and e = f(0(1o) 1. r f(O ) i= 1 = 1 i.e. e is the vector of true expectations, which had been assumed known in Theorem 4, and e is its natural estimate based on the samples in Stage 1. To account for the fact that the responses Y[f] and covariates Z},]0) used in the previous section are now unknown, as they involve the unknown d and e, we need to consider new responses and covariates based on estimates d and e obtained from Stage 1 samples. Define f )( kand 2 )= -k -h- ) J k ^[f] f')v ) and z]) ()h, (o .)/ s=' 1 as1 vhs(Os'))aS s= a1s)hs /i as j 1. k Hence a new control variate adjusted estimator for I[f](h) corresponding to the estimator (2-18) of the previous section, which assumed knowledge of d and I[l(h,)'s, is ndk e 1 L (1yn [/[ (]zk )f] ) ,,1 ,/ 1 Li 2[/(d)ji,lZ where [f (d) is the least squares estimate resulting from the regression of Y[, on predictors 2z[fU). Theorem 6 establishes the asymptotic normality of this estimator. But before stating this theorem, we give an auxiliary theorem which shows the joint CHAPTER 6 DISCUSSION The following fact is obvious, but it may be worthwhile to state it explicitly. If hi is fixed, maximizing B(h, hi) and maximizing the marginal likelihood mh are equivalent. Choosing the value of h that maximizes mh is by definition the empirical Bayes method. Thus, the development in Chapter 2 can be used to implement empirical Bayes methods. Our methodology for dealing with the sensitivity analysis and model selection problems discussed in Chapter 1 can be applied to many classes of Bayesian models. In addition to the usual parametric models, we mention also Bayesian nonparametric models involving mixtures of Dirichlet processes (Antoniak (1974)), in which one of the hyperparameters is the so-called total mass parameter-very briefly, this hyperparameter controls the extent to which the nonparametric model differs from a purely parametric model. (Among the many papers that use such models, we mention in particular Burr and Doss (2005), who give a more detailed discussion of the role of the total mass parameter.) The approach developed in Sections 2.1 and 2.2 can be used to select this parameter. When the dimension of h is low, it will be possible to plot B(h, hi), or at least plot it as h varies along some of its dimensions. Empirical Bayes methods are notoriously difficult to implement when the dimension of the hyperparameter h is high. In this case, it is possible to use the methods developed in Sections 2.1 and 2.2 to enable approaches based on stochastic search algorithms. These require the calculation of the gradient OB(h, hl)/9h. We note that the same methodology used to estimate B(h, hi) can also be used to estimate its gradient. For example, in (2-7), vh((0)) is simply replaced by a eh(O(1))1/h. Therefore G() = Rij, + VG(d*)/(d d) + o,(1) Similar arguments extend to the case j R,, + O,(1)o,(1) + o,(1) Ri,,. 1 orj' = 1. By the fact that R is assumed invertible, we have n(7' f)-1 R-1 (A.11) In a similar way, it can be shown that (A.12) where v is the same limit vector to which Z'Y/n has been proved to converge in Doss (2010). Combining (A.11) and (A.12) we have Ln(Z'Z)-l] [z'*/n] A) (0o,1im, Arim) Let e(j, I) = E(Z0). We now have k v/( )-/m) = e( im j=2 J op()2 ( an1 j=2 (/=1 To show that (A. 13) converges to 0 in probability it suffices to show that for each I and j n /1 ) /2i=l i -ne(, I) Op(1). (A.14) For fixed j {2, ..., k} and / {1, ..., k}, define H(u) = n1/2 i=1 Vh, ( ) ) Vh (01) =1 asVhs,(Oi)/Us R-iv. (3)J) ( \/=1 a n1/2 n ) -ne(jl nl S / /2 1 i, ne(j, I) (A.13) 2/ Pn P v, Z Y/n tv, f )(0)W) used, we have [VkF(d)]_ nJ jh(0 )) hj(O) /=1 i=1 d2( s= nVh (0(,)/ds) k 1 n1 ajallh(O(',)l/h, (O,)) = n d2 (Ek=1 ash(0)/ds)2 a.s. 1 Jk j h (0) h (0) d2 'J /k Vh /y(O) dO I/i ( =1 avh (O) /ds) B(h, h1) f aJyh,() S 2 k aVh() Vh,y(0) dO := [c(h)]j_i. (A.3) d Es=l asVh,(0)/ds The last integral is clearly finite, and the last equality in (A.3) indicates that c(h) denotes the constant vector to which VF(d) converges. Next, we show that the random Hessian matrix V2F(d*) of second-order derivatives of F evaluated at d* is bounded in probability. To this end, it suffices to show that each element of this matrix, say [V2F(d*)]t-,1,1, where t,j c {2,..., k}, is Op(1). Since I d* dll < lid dll > 0, it follows that d* A d. Let c e (0, min(d2,..., dk)). Then we have P(l d* dll < e) 1. We now show that, on the set {l d* dll < c}, V2F(d*) is bounded in probability. Let I= I(d* dl < ). for the Stage 1 samples, and 16 new chains, each of length 1000, corresponding to the same hyperparameter values, for the Stage 2 samples. The plots in Figure 5-1 give graphs of the estimate (2-13) as w and g vary, from two different angles. These indicate that values for w around 0.65 and for g around 20 seem appropriate, while values of w less than .3 and values of g greater than 60 should be avoided. A side calculation showed that, interestingly, for g = max{m, q2} (= 225), the estimate of B((w, g), (.65, 20)) is less than .008 regardless of the value of w, so this choice should not be used for this data set. With the long chains used and the estimate that uses control variates, the Bayes factor estimates in Figure 5-1 are extremely accurate-root mean squared errors are less than 0.04 uniformly over the entire domain of the plot and considerably less in the convex hull of the skeleton grid (our calculation of the root mean squared errors used the closed-form expression for the Bayes factors based on complete enumeration). The figure took about a half hour to generate on an Intel 2.8 GHz Q9550 running Linux. (The accuracy we obtained is overkill and the figure can be created in a few minutes if we use more typical Markov chain lengths.) n -n S1.1 o. 0.i0 0.4 100 0 80 ( 0.6 20 0.8 40 0.8 20 Figure 5-1. Estimates of Bayes factors for the U.S. crime data. The plots give two different views of the graph of the Bayes factor as a function of w and g when the baseline value of the hyperparameter is given by w = 0.5 and g = 15. The estimate is (2-13), which uses control variates. BIOGRAPHICAL SKETCH Eugenia Buta was born in 1982 in Romania. In 2000, she was admitted to the University of Oradea, Romania, from where she earned her Bachelor's degree in Mathematics-Informatics in 2004. She then joined the Department of Statistics at the University of Florida to pursue a Ph.D. degree. During her graduate student years, she served as a Teaching Assistant for several undergraduate and graduate level courses in the Department of Statistics. She expects to receive her doctorate degree in Statistics in August 2010. combining (A.1) and (A.2), we obtain B(h, hi)) = NF(d)'N(d d) + [, N(a d)]'V2F(d*)[ N(d d)] 2 Nv + vn(B(h, hi, d)- B(h, hi)) = vc(h)' N( d) + n((h, hi, d) B(h, hi)) + op(1), (A.5) where the last line follows from the previously established fact that VF(d) a-a c(h), and the assumptions of Theorem 1 that n/N -- /q and that N(d d) converges in distribution (hence is Op(1)). Because the two sampling stages (for estimating d and B(h, hi)) are assumed to be independent, using the assumption that ,N(d d) -d /V(0, Z) in conjunction with the result n(B(h, hi, d) B(h, hi)) d V A(0, 72(h)) established in Theorem 1 of Doss (2010) under conditions Al and A2, we conclude that vf(B(h, hi, d) B(h, hi)) -d A'(O, qc(h)'Zc(h) + T2(h)). Proof of Theorem 2 We begin by writing (A.6) ^d- B( )) = ^( ) (d) + (d B(h, h d )), B(h,hi)) = ()- i(d) + where the second term on the right side of (A.6) was analyzed by Doss (2010) who showed that it is asymptotically normal, with mean 0 and variance o-2(h). Our plan is to show that 3(d) and )(d) converge in probability to the same limit, which we denote 01im. We then expand the first term on the right side of (A.6) by writing (A.7) (7(d) -/() vd(()- i,,3 + Ji (7~,, Iim) + (7m (d)) v/n(B(h, hi, l) with e'll-12A2/A e1l-'13A3/A1 ... e'1-' kAk/Al 0 0 ... 0 -e 1-q2A2/Am 0 ... 0 0 0 ... 0 0 -e"l -a3A3/A1 ... 0 0 0 ... 0 Vg = e 0 0 ... -el-kAk/Am 0 0 ... 0 0 0 ... 0 1 0 ... 0 0 0 ... 0 0 1 ... 0 0 0 ... 0 0 0 ... 1 and S given by (A.44), (A.46), and (A.45). O Proof of Remark 2 to Theorem 1 Following the lines of the proof of Theorem 1 with q = 1, we get as in (A.5) that /n([(h, hi, d) B(h, hi)) = c(h)' /( d) + /n(B(h, hi, d) B(h, hi)) + op(1), where c(h) is the constant column vector given in (A.3). This decomposition can be rewritten as V,(B(h, hl, a) B(h, hm)) = (c(h)', 1) vl( d) + Op(). Svrn(B(h, hl, d) B(h, hl)) Now note that in order to establish the asymptotic normality of vn(B(h, hi, d)-B(h, hi)), it is enough to show that ( n(a -d) ani(n(h, hl, d) B(h, h)) is asymptotically normal. Using the q-notation introduced in the proof of Theorem 5, let S= B(h, hi, d) B(h, hi) CHAPTER 4 REVIEW OF PREVIOUS WORK Vardi (1985) introduced the following k-sample model for biased sampling. There is an unknown distribution function F, which we wish to estimate. For each weight function wi, I = 1,..., k, we have a sample X11,..., XIn, d Fl, where 1 jx F,(x) = 1 wl(s) dF(s). (4-1) W -Joo In (4-1), W = J' wI(s) dF(s). The weight functions w, ..., wk are known, but the normalizing constants W1, ..., Wk are not. Vardi (1985) was interested in conditions that guarantee that a nonparametric maximum likelihood estimator (NPMLE) exists and is unique, and he gave the form of the NPMLE. (The conditions for existence and uniqueness involve issues regarding the supports of the F/'s and do not concern us in the present paper.) To estimate F, a preliminary step is to estimate the vector (W1,..., Wk). Vardi (1985) and Gill et al. (1988) show that W may be estimated by the solution to the system of k equations W,= k ) dF (y), I k, (4-2) where a, n/n, n = k ,1 n,, and Fn is the empirical distribution function that gives mass 1/n to each of the X,/. Actually, the solution to (4-2) is not unique: it is trivial to see that if the vector W solves (4-2), then so does a W, for any a. However, it turns out that knowing W only up to a multiplicative constant is all that is needed, and to avoid non-identifiability issues, we define the vector V = (W2/W1 ..., Wk/Wl). Gill et al. (1988) show that if W is any solution to (4-2), and V is defined by V = (W2/W, ..., Wk/Wi), then n/2(V V) is asymptotically normal (Proposition 2.3 in Gill et al. (1988)). Once an estimate of W is formed, it is relatively easy to form an estimate Fn of F, and consequently of integrals of the form f h dF. Gill et al. (1988) dpg, ibt, vh.ibh, and humid.ibt (see Appendix D for a description of these variables). This model yields an out-of-sample RMSE of 4.5. Since the empirical Bayes choice of w is relatively small (wv = .13), it is not surprising that the highest probability model includes only 4 variables-fewer than in any of the hyper-g models recommended by Liang et al. (2008), which all include at least 6 variables. But it is interesting to note that nevertheless, this model gives an RMSE that is essentially the same as the RMSE of any of the other models. We applied the regeneration algorithm described in Appendix B to the chain corresponding to the hyperparameter h = (.13, 75) deemed optimal by our previous analysis. We ran the chain until R = 3000 regenerations occurred, which took 85,000 iterations. From the output, we obtained estimates of the posterior inclusion probabilities for every one of the 44 predictors, and formed the corresponding 95% confidence intervals, using the regeneration method discussed in Chapter 3. These are displayed in Figure 5-5. Our choice of R was arbitrary, but this choice should ultimately be based on the degree of accuracy one desires for the estimates of the quantities of interest. We considered our choice to be satisfactory for this particular analysis since the confidence intervals for the posterior inclusion probabilities for the 44 predictors have margins of error of at most 1%. Note that our chain regenerates relatively often with the average length of a tour (N) being about 28. Mykland et al. (1995) recommend that one check that that the coefficient of variation CV(N) = (Var(N))1/2/E(N) of the average tour length is than .1 before deeming K2 to be estimated properly by k2. Their criterion seems to be met here since the strongly consistent estimator CV(N) = ( =(Nt - N)2/(RN)2)1/2 equals .02. parameter, w, estimated by maximum likelihood. They show that if the null model has the largest marginal likelihood, then the MLE of w is 0 and if the full model has the largest marginal likelihood, then the MLE of w is 1. Each of these gives rise to the degeneracy discussed above. Their result is not true in our setup, in which we do not put a prior on g, but rather estimate both w and g by maximum likelihood. To see this, consider a very simple example, in which Y = (2, 1, 9, 5)' and X = 3 3 7 10.5 We have R2 = 0.52, R2 = 0.51, R2 = 0.40, and R2 = 0, Now y=(1,1) y=(1,0) y=(0,1) y=(O,O) (1 + g(1 R2))3/2 where c(Y) does not depend on g or 7. Therefore, (1 + g)(3-q,)/2 (g, vv) = argmax( 9,) w (1 w)q-q ( + m (.5, .2). 7 (1 + g(1 R2))3/2 From equation (38) of Scott and Berger (2010) we know that under the Zellner-Siow null prior, we have P(Y 17y) P(Y 17 = (0, 0)) S (1 + g)( 3- R)/2 I (1+ g(1 -R))3/2 .72 < 1 .58 < 1 .31 < 1 for 7 = (1, 0) for 7 = (0, 1) for = (1, 1) obtain functional weak convergence results of the sort n1/2 (f h dFn f h dF) -d Z(h), where Z is a mean-0 Gaussian process indexed by h e H, where H is a large class of square integrable functions. It is not difficult to see that our setup is the same as that considered in Vardi (1985) and Gill et al. (1988): their F corresponds to our Vh,y; their w1 to vh,/lh; Fi to vh,,y; Wi to mh,/mh; and V to d. But there are major differences between our framework and theirs. They deal with iid samples, and so can use empirical process theory, whereas we deal with Markov chains, for which such a theory is not available. In their framework, the samples arise from some experiment, and they are seeking optimal estimates given data that is given to them. In contrast, our samples are obtained by Monte Carlo, so we have control over design issues. In particular, we are concerned with computational efficiency, in addition to statistical efficiency; hence our interest in the two-stage sampling method for preliminary estimation of d and for enabling the use of control variates. Geyer (1994) also deals with the setup in Vardi (1985) and Gill et al. (1988), i.e. the k-sample model for biased sampling, and he also considers the problem of estimating d. As mentioned in Section 2.1, his estimator is obtained by maximizing (2-6), and the solution is numerically identical to the solution to the system (4-2). However, he considers the situation where each of the k samples are Markov chains, as opposed to iid samples, and assuming that the chains satisfy certain mixing conditions, he obtains a central limit theorem for n1/2(( d). Naturally, the variance of the limiting distribution is different from the variance obtained in Gill et al. (1988), and is typically larger. In Section 7 of their paper Meng and Wong (1996) consider the situation where for each I = 1,..., k, we have an iid sample from the density fi = qi/mi, where the functions qi, ... ,qk are known, but the normalizing constants mi,..., mk are not, and we wish to estimate the vector (m2/ml, ... mk/mi). Without going into detail, we mention that they develop a family of "bridge functions" and show that, in the iid setting, the optimal bridge function gives rise to an estimate identical to that of Geyer (1994). They obtain their For t / j, we have k 2 n aia(atah(O)h(O))~(O1) [V2F(d*)]t-_,j_1 .= 2 a (a3 . j dsd=2(Cl as Vh, 5(,)/d n=i n d= d* k=1 hs 1) d k 2 na ajatl/Vh(Ol')), /h( ()) 1 ni =1 (dj )2(dt- )2 [k a h() )/(ds + )]3 a.s. 2 k j aa/at Vh () Vh()Vh() hy() dO S(d C)2(d )2 i [E a k h(O)/( +)]3 )( )>1, h1 a aah,(vh)(ds + O)v 2 -2, has) vh5(O)/(ds + )]3 Vh,(O) dO. (A.4) (dy e)2(d e)2 /=1 k=1 s d( + 3 * Note that the expression inside the braces in (A.4) is clearly bounded above by a constant, so expression (A.4) is finite. Similarly, for t =j, [V2F(d*)]O_1,j_1 I I S2 a;a, 2h(10)) ,(0 ,) ( S=1 asn,,h(O,))/) aj- a,j())/I I= i~ d*3(E=1 ah (O)/d k 2 aiaivh 1) ) ,=, ,=Y ; 'h-, = ((ol' )/ ;) k 2 n' ajal h (Ol ))Vhj(Ol) ) /- 1 jn- ( avh(O)/d*) < k 2 n jalh(O))h,( )) /= ,_1 (d -)3 [ 1s aVh(, )/(ds + e)]2 a.s. 2 k aj v (Oh 1 () (d s )3 B(h, hl) (O))]2 h,(0) dO. (dJ C)3 k= k=1 s d Again, this limit is a finite constant by the same reasoning we used earlier. Since P(I d* dll < c) 1, it follows that V2F(d*) is bounded in probability. Now, by total sample size. This fact limits the total sample size and hence the accuracy of the estimates. An increase in accuracy can be achieved essentially for free by estimating d from long preliminary runs in Stage 1. The cost incurred in Stage 1 is minimal because generating the chains is typically extremely fast, and has to be done only once. Doss (2010) also developed an improvement of (2-5) that is based on control variates, and showed that this improvement is also consistent and asymptotically normal. Unfortunately, both of these estimates require us to know the vector d exactly. One may be tempted to believe that using an estimated d instead of the true d will not inflate the asymptotic variance-indeed, the literature has errors regarding this point, and this is discussed in Appendix A. Here we provide a careful analysis of the increase in the asymptotic variance that results when we use an estimate of d. A more detailed summary of the main contributions of the present work is as follows. 1. We develop a complete characterization of the asymptotic distribution of both the estimate (2-5) and the improvement that uses control variates for the realistic case where d is estimated from Stage 1 sampling (Theorems 1 and 2). 2. We develop an analogous theory for the problem of estimating a family of posterior expectations Eh(f(O) | Y = y), hE c-t (Theorems 3, 4, and 6). 3. We discuss estimation of the variance, and show how variance estimates can be used to guide selection of the skeleton points hi,..., hk. 4. We apply the methodology to the problem of Bayesian variable selection discussed earlier. In particular, we show how our methods enable us to select good values of h = (w, g) and to also see how the probability that a given variable is included in the regression varies with (w, g). 2.1 Estimation of Bayes Factors Here, we analyze the asymptotic distributional properties of the estimator that results if in (2-5) we replace d with an estimate. Geyer (1994) proposes an estimator for d based on the "reverse logistic regression" method and Theorem 2 therein shows that this estimator is asymptotically normal when the samplers used satisfy certain regularity conditions. This estimator is obtained by maximizing with respect to d2, ..., dk the log CHAPTER 5 ILLUSTRATION ON VARIABLE SELECTION There exist many classes of problems in Bayesian analysis in which the sensitivity analysis and model selection issues discussed earlier arise; see Chapter 6. Here we give an application involving the hierarchical prior used in variable selection in the Bayesian linear regression model discussed in Chapter 1. This chapter consists of three parts. First we discuss an MCMC algorithm for this model and state some of its theoretical properties; then we discuss the literature on selection of the hyperparameter h; and finally we present two detailed illustrations of our methodology. 5.1 A Markov Chain for Estimating the Posterior Distribution of Model Parameters The design of MCMC algorithms for estimating the posterior distribution of 0 under (1-1) revolves around the generation of the indicator variable 7. We now briefly review the algorithms for running a Markov chain on 7 that are proposed in the literature, and the main issues of implementation of these algorithms. Raftery et al. (1997) and Madigan and York (1995) discuss the following Metropolis-Hastings algorithm for generating a sequence 7(l), 7(2),.... If the current state is 7, a new state 7* is formed by selecting at random a coordinate, setting 7* = 1 7j, and 7~ = 7k for k z j. The proposal 7* is then accepted or rejected with the Metropolis-Hastings acceptance probability min{p(7* I Y)/p(7 Y), 1}. Madigan and York (1995) call this algorithm MC3. Clyde et al. (1996) propose a modification of this algorithm in which we do not select a component at random and update it, but instead sequentially update all components. They call this the "Hybrid Algorithm." (Strictly speaking, this is a Metropolized Gibbs sampler, and is not actually a Metropolis-Hastings algorithm.) Smith and Kohn (1996) propose a Gibbs sampler which simply cycles through the coordinates 7, one at a time. George and McCulloch (1997) show that when compared with MC3, the Gibbs sampler algorithm gives estimates with smaller standard error, and is also slightly faster, at least in several simulation studies they conducted. expectations in the next-to-last equality because k -Ar Nr [1 E(pr(Or), 1o))] + ~ Ai NNiE(pr(O), ro)) V N-r /1 V'/ I/r NAr1 k hr( hr,y (0) dO + N AE (pr (0), o)) -ZN 1- : rl vh 5 (e)e/ Ir 1+0)) s=l Vh(O) s /=1 I/r N= J kh h, ,y (0) dO + vhA,(E (pr (0i()o, o0)) /k1 /(1 Ir IIr k k = -- ar e77- / k ) l-?Ir khr ) r (h0 )h ey(0) d7l ( + '\/N AiE (pr (0('), T1o)) /=1 s=m1 hh,(0)es mhe /=1 lor Ior k k /Ar r mhi e77l E(pr,(On'), 10)) + V/N AIE(pr(O!i')0, 1)) /=1 mhr /=1 Ijr Ijr =0. The asymptotic normality of V/N(T1o)/vN now follows from the Cramer-Wold device. In view of this convergence in distribution and the convergence result in (A.42), (A.41) gives 1 1 N(1N 10) = Op(1)Op(1) + B VlN(lo) = B 1 VN(l/o) + Op(l). Therefore, we can now easily see that condition (A.37) is also satisfied by the first k components of U, i.e. v/(1^N 1 /o) because, as we have shown in (A.43), every element of V/N(r/o)/V/- is a linear combination of the form (A.37). with 711 = Var(Y']) + 2 1 Cov(Y'], Y]1,), 712 =721= Cov(Y1',, Y1,) + -, [cov(Y',, Y1+g,1) + Cov(Yi,,, Yf] ), 722 = Var(Yi,,) + 2 =1Cov(Yi,,, YI+,,i). Since 7[1](h, d) is given by the ratio (2-15), in view of (A.20), its asymptotic distribution may be obtained by applying the delta method to the function g(u, v) = u/v. This gives v(7Q'f(h, d) I-[](h)) -d A/(0, p(h)), where p(h) = Vg(l[ l(h)B(h, hi), B(h, hl))' F(h) Vg(l[f](h)B(h, hi), B(h, hl)), (A.21) with Vg(u, v) = (/v, -u/v2)'. We now consider the first term on the right side of (A.19). Define k n1 f (o(i)) lh (0()) L(u)= 1=1 '=1 C=1 sIVhs )/US k n h ( )) /=1 i=1 s=l s Vhs )/ for u = (u2,... k)' with ul > 0 for / = 2,..., k. Then v,=1C /1 yf] L(d) = f](h, d)= 1 -'1 l= , k,=1 yni-1 y,l and vz(l([(h, !) I[](h, d)) = (-(L(d) L(d)). Now, by the Taylor series expansion of L about d we get vn(I[](h, d) 1f](h, d)) = vVL(d)'(d d) + -(d d)'V2L(d*)( d), where d* is between d and d. First, we show that the gradient VL(d) converges almost surely to a finite constant vector by proving that each one of its components, APPENDIX B DETAILS REGARDING GENERATION OF THE MARKOV CHAIN FROM CHAPTER 5 To generate a Markov chain of length n on 0 = (7, a, 3o, ,y) for a fixed choice of the hyperparameter h = (w, g), we use the following sampling scheme. First, we pick an arbitrary value for 70). Then we draw o2(0), ~o), and (O) as indicated in Steps 2-4 below (with i = 0). To generate the rest of the chain, we iterate through Steps 1-4 described below for each i = 1,..., n 1. Step 1 In this stage we generate the binary vector 7(') by using a Gibbs sampler on < = (1, 72,... 7q). Thus, we first generate 7 () 2'-l), ...7 '-), Y according to the following Bernoulli distribution: P(71 7j1, Y) oc p(7 Y) o (1 + g)-q 2S- -1)[1+ g(1 R2)] -(m-1)/2 W ) (B.1) where, recall that 52 = 1i( Y)2, and R2 is the coefficient of determination of model 7; see (5-1). Similarly, generate 7(') from p(2 7 ), ('-i ..., (i-l), Y), and so on for 3 ), ..., 7q'). (This Gibbs sampler is not identical to that of Smith and Kohn (1996) in that in our model the prior on 3o is a flat prior, whereas Smith and Kohn (1996) use a proper prior on 3o.) Step 2 Generate 02(') 7(,), Y according to the density x Jp( Y 7,2,/ 0,/)P/)P(/a '7,2) d3o d p(2) (n 7)-2 ex 22(g + 1) 1 + (l R27m dn an inverse1)ex gamma density. 2 an inverse gamma density. The disappearance of the likelihood function in (2-1 a) is very convenient because its computation requires considerable effort in some cases (for example, when we have missing or censored data, the likelihood is a possibly high-dimensional integral). Note that the second average in (2-2) is an estimate of mh/mh,, i.e. the Bayes factor B(h, hi). Ideally, we would like to use the estimates in (2-2) for multiple values of h using only a sample from the posterior distribution corresponding to the fixed hyperparameter value hi. But, when the prior Vh differs from Vh, greatly, the two estimates in (2-2) are unstable because of the potential that only a few observations will dominate the sums. Their ratio suffers the same defect. A natural approach for dealing with the instability of these simple estimates is to choose k hyperparameter values hi,..., hk e T-i and to replace Vh with a mixture Es=l asVhs, where as > 0, for s = 1,..., k, and k= as = 1. For concreteness, consider the estimate of the Bayes factor. To estimate B(h, hi) using a sample 01,..., 0n (iid or ergodic Markov chain output) from the posterior mixture v. := s=1 asVhs,y, we are tempted to write 1 h( i) mh 1 P (y)vh (,)/mh (2-3a) n = s=1 asVh (O,) mh n =1 s= aspO,(y)Vh(O,)/mh, 1 nh,y(,) = B(h, hi,) (2-3b) i=1 Es=1 ash,,y(O,)ds where ds = mh5/mh,, s = 1,..., k. Thus, in order to have v.y in the denominator in (2-3b) (which would imply that the average in (2-3b) converges to 1, so that (2-3b) converges to B(h, hi)), we need to start out with k=1 asVh /ds in the denominator of the left side of (2-3a). Unfortunately, this requires the condition that we know the vector d = (d2, ..., dk)'. Under this condition, if 01, ... On are drawn from the mixture v.y, instead of from Vh,,y, we may form 1 Vh (0) (2-4) ",=1 Yk=1 asVh (O,)/ds Empirical Bayes (EB) Methods In global EB procedures, an estimate of g common for all models is derived from its marginal likelihood; see George and Foster (2000). In local EB, an estimate of g is derived for each model; see Hansen and Yu (2001). Unfortunately, the EB method is in general computationally demanding because the likelihood is a sum over all 2q models y, so it is practically feasible only for relatively small values of q. Liang et al. (2008) show that the EB method is consistent in the frequentist sense: if 7, is the true model, then if g is chosen via the EB method, the posterior probability P(7 = 7, | Y) converges to 1 as m oC. See Theorem 3 of Liang et al. (2008) for a precise statement. (This result refers only to the case where w is fixed at 1/2, and only g is estimated.) Liang et al. (2008) propose an EM algorithm for estimating g in the global EB setting. In their algorithm, the model indicator and o are treated as missing data. While their approach is certainly useful, there are some problems associated with it. Each step in the EM algorithm involves a sum of 2q terms. Unless q is relatively small, complete enumeration is not possible, and Liang et al. (2008) propose summing only over the most significant terms. However, determining which terms these are may be very difficult in some problems. Also, the EM algorithm gives a single point estimate. What we do is different: we estimate the Bayes factor for all g (and w). This enables us in particular to estimate the maximizing values; but it also allows us to rule out large regions of the hyperparameter space. Additionally, our method allows us to carry out sensitivity analysis. We also mention very briefly that if we are interested only in the maximizing values, then the method proposed in the present paper can be used to form a stochastic search algorithm. The basic requirement for such algorithms is that we know the gradient OB(h, hi)/ah. But the same methodology used to estimate B(h, hi) can also be used to estimate its gradient. For example, in the simple estimate (2-7), we just replace Vh(e(')) by Oyh Oq())/8h. is a severe computational burden caused by the requirement that we handle a very large number of values of h. The main contributions of this work are the development of computationally efficient schemes for estimating large families of posterior expectations and Bayes factors that are based on a combination of MCMC, importance sampling, and the use of control variates, therefore providing an answer to questions (A) and (B) raised earlier. We also provide theory to support the methods we propose. Chapter 2 describes our methodology for estimating Bayes factors and posterior expectations, and gives statements of theoretical results associated with the methodology. In Chapter 3 we discuss estimation of the variance of our estimates. Chapter 4 gives a review of the relevant literature, along with a discussion of how the present work fits in the context of previous related work. In Chapter 5 we return to the problem of variable selection in Bayesian linear regression. There, first we consider a Markov chain algorithm that generates a sequence (71), (1), 1), /3()), (7(2), ,2) /2) (2)) ..., describe theoretical properties of this chain, and show how to implement the methods developed in Chapter 2 to answer questions (A) and (B) posed earlier. We also illustrate our methodology on two data sets. Appendix A contains the proofs of the theorems stated in Chapter 2, Appendix B provides details regarding the generation of our Markov chain and its computational complexity, and Appendix C gives technical details regarding the theoretical properties of the Markov chain. asymptotic normality of d1d N(( (d)) Let us first define separable and inseparable Monte Carlo samples as introduced by Geyer (1994). The Monte Carlo sample {O0()}1-, I = 1 ..., k is said to be separable if there are disjoints subsets L and M of {1,..., k} such that for each 0 in the sample and each / E L and m E M either vh,,y(0) or vh,,,y(0) are zero. A Monte Carlo sample that is not separable is said to be inseparable. Theorem 5 Assume that the Monte Carlo sample from Stage 1, {O0,)}O( I =1,..., k, inseparable, and the following conditions hold: B1 for each I B2 for each 1,..., k, the chain {O()o }0 is geometrically ergodic 1,... k, there exists e > 0 such that E,,,y(If l2+(0)) < oo. Then Sd-d> (, V), ( de where V is given in equation (A.48) in the Appendix. Theorem 6 If the conditions stated in Theorem 4 and (2-19) hold, then , (a) ) with b(h) given in equation (A.57) in the Appendix. is (2-19) [f])(h) d Ar(O, (h)), We investigated the performance of our methodology using a split of the data into training and validation sample identical to the one used by Liang et al. (2008). We took the baseline hyperparameter to be the pair hi = (wl, gi) = (.2, 50) and the skeleton grid of hyperparameters to consist of the 16 pairs (w, g) {.1, .2, .3, .5} x {15, 50, 100, 150}. To identify the value of h that maximizes the Bayes factor B(h, hi), we estimated this quantity for a grid of the 750 values of h obtained when w ranges from .01 to .5 by increments of .02, and g ranges from 5 to 150 by increments of 5. These estimates were based on 16 chains each of length 10,000, corresponding to the skeleton grid of hyperparameter values for the Stage 1 samples, and 16 new chains, each of length 1000, corresponding to the same hyperparameter values, for the Stage 2 samples. Figure 5-4 gives a plot of these estimates of B(h, hi) as a function of w and g. The standard error is less than .014 over the entire range of the plot. 0.3 / 0.4 > 50 n 1..& 0.4 Figure 5-4. Estimates of Bayes factors for the ozone data. The plots give two different views of the graph of the Bayes factor as a function of w and g when the baseline value of the hyperparameter is given by w = .2 and g = 50 The value of h at which the maximum B(h, hi) is attained is h = (.13, 75). We ran a new chain of length 100,000 corresponding to this value of h, and based on it we estimated the highest probability model to be the model containing the 4 variables I dedicate this to my brother Florin. which is exactly the same as the regeneration success probability for the chain on F. Hence, the augmented 0-chain and the 7-chain regenerate simultaneously. Note that sampling from dl (which is needed to start the regeneration) is trivial. We first sample 7 from d, which is done by sampling from v(7*, -) and retaining 7 only if it is in D (and to do this we do not need to know the normalizing constant c(7*)); then we sequentially sample -2, 3o, and p3 from p(o-2 I7, Y), (/3o 7,, 2, Y), and p(3 7, 2, 02 3o, Y), respectively. = 2,..., k. By Taylor series expansion, we have d) + n(d 2 d)'V2 K(d*)(d (A.17) where d* is between d and d. We now focus our attention on VK(d). For t 2,..., k we have 1h ) asth, ( )/ ) 2 t (YSl, as-h, (i /,) o(/) 1d! wh,(1-))ath"(I) - !3j,lim d?(Ek1 (0))/ds)2 j=22 dt s=l a Vh, jot + 3t,lim 2 yCk a.s. B(h, h) [ 2k dt Es:= j#t k + ij,lim j=2 jot h, (O 1) =1 as Vh ())/ds a.vh(O) V.h,y (O) dO 1 asvh ,(O)/ds at Vh (0) d 1s= asVh, (0)/ds d asth (0) d?~ =1as h5(0)/ds SVhj,y(O) dO 1 Shh,y(O) dO + 3t,lim- dt at~h, (0) d Z~ ,asVh(O)/ds at h, ( d s=1 asVh, (O)ds * Vh,,y(O) dO * Vhl,y(O) dO B(h, h) k Oj=,im j=2 atvht () Ss=, asvh, (O)ds I /3j,iim Vh,y(O) dO kat h, () vy() dO d? s, asvh,(O)/ds atvh, (0e Sk at vht(O yh,y() dO + lt,lim dt s=, asvh, (O)ds := [w(h)]t-l, [VK(d)]t-1 1 k nl 1 /=1 i=1 lim (Ol)")/dt Vh (O1l)) )tVht (Oi)) t,Im ((O)/ds) d (s=1 asVh, H) ds) - /t,lim J + 3t,lim / (7i where u = (2, ... Uk)', and ui > 0 for / ,,m m) = VnVK(d)'(d (A. 18) LIST OF TABLES Table page 5-1 Posterior inclusion probabilities for the fifteen predictor variables in the U.S. crime data set, under three models. Names of the variables are as in Table 2 of Liang et al. (2008) (but all variables except for the binary variable S have been log transformed)... ................ ............ 49 D-1 The 44 predictors used in the ozone illustration. The symbol "." represents an interaction .................... .... ............. 99 By the assumed independence of the k Markov chains, we have n (Z1/2 l k= : ,1) ]_ d (, (O l.), Er = rn f1. n ~ ()31,3 /= li We now show that =1 /i1 (A.30) rf](h)B(h, h1) B(h, h) ) and to do this we write k 5- alE(ulf]) I E(]=1- al =1 \E(YI,) - k,=l aE(Y I,,) (k=1 a/E(Yi,i) -k 30[f] E(Z[f]O)) 2j=1 PjlimE Z,/ J/1 ~-Jk=2jc ,limE(Zli) ) Y-- k,0[f] [ m~ k k1 E f(Z j[]')) ] - kj=2 Pj,lim[ =1 al E (Zi,)J E= 1 a/E(YI,E ) Ilf](h)B(h, hi)) B(h, hi) the next-to-last equality being a consequence of the readily verifiable fact that k 0 and aaE(zu) /=1 From (A.30) and (A.28) we conclude that I A (O, Z-l]). where (A.28) (A.29) k /=1 k a/ E(Z 1lu)) /=1 forj = 2, ..., k. (A.31) n1/2s [ f],, (I'f](h)B(h, SIMB(h, h ) intervals for the posterior expectation of 1(0). Suppose that E(12(0)) < oo. Then since the chain is uniformly ergodic, Corollary 4.2 of Cogburn (1972) implies that, with Var(/(0o)) and Cov(/(0o), 1(0O)) calculated under the assumption that 60 has the stationary distribution, the series K2 = Var(/(0o)) + 2 Cov(/(0o), 1(0)) (5-2) J=1 converges absolutely, and if K2 > 0, then with 0o having an arbitrary distribution, the estimate I/= (1/n) 'jo1 /(0) satisfies n1/2 n( E[(0) ) | d y] (0, K2) as n oo. The Markov chain driven by K is also regenerative, and in Appendix C we give an explicit minorization condition that can be used to introduce regenerations into the chain. Functions that run the chain and implement the regeneration scheme are provided in the R package bvslr, available from http: //www.stat.ufl.edu/~ebuta/BVSLR. In Chapters 1 and 2, vh and Vh,y refer to the prior and posterior densities, and all estimates in Chapter 2 involve ratios of these prior densities. In the Bayesian linear regression model that we are considering here, the priors vh on (7-, a, 03o, ) are actually probability measures on {0, 1}q x (0, oo) x Rq+', which in fact are not absolutely continuous with respect to the product of counting measure on {0, 1}q and Lebesgue measure on (0, oo) x Rq+1. For hi = (wl, gi) and h2 = (w2, g2), the Radon-Nikodym derivative of vh with respect to Vh2 is given by dVh, W1 ) 1 q- W1 X9 (7; 0, g72 (X7IX7)-1) (,17, 0,wi) ) =-- w ; 2(x7/X )-1) (5-3) dvh2 W2 1- W2 q, (7; 0, 922XX)-1) where qy(u; a, V) is the density of the q,-dimensional normal distribution with mean a and covariance V, evaluated at u (Doss (2007)). It is immediate that all formulas in Chapter 2 remain valid if ratios of the form Vh(O)/Vhz(0) (see, e.g., equation (2-2)) are replaced by the Radon-Nikodym derivative [dvh/dvh,](O). Fortunately, evaluation of (5-3) Proof of Theorem 5 We begin by reviewing some related notation and results established by Geyer (1994). Recall that Nj denotes the length of the jth chain in Stage 1 samples, N = Z-1i N,, and A, = Nj/N. Using the notation j = log mh + log(A,), forj= 1,...,k, Geyer's (1994) reverse logistic regression estimator = (1i..., ^k) for the unknown vector Tr is obtained by maximizing the log quasi-likelihood k NI IN(q) = log (p/(0'(0, r)), (A.34) /=1 i=1 where 0, = V ()e" for/= 1..., k. (A.35) Es=(,() el Theorem 1 of Geyer (1994) states that this maximizer is unique up to an additive constant if the Monte Carlo sample is inseparable. Geyer (1994) also proves that, under certain conditions, v/N(^N rio) is asymptotically normal, where ryo is defined by 1 k [Tolj = Tj-YE s, j= 1..., k. s=1 Our proof is structured as follows. First, we extend Geyer's (1994) proof in order to show that the 2k-dimensional vector vN( ( =: : U (A.36) U(2k) ) is asymptotically normal. Then, by getting back to the d notation through a transformation, we show that our vector of interest d |

Full Text |

PAGE 1 COMPUTATIONALAPPROACHESFOREMPIRICALBAYESMETHODSAND BAYESIANSENSITIVITYANALYSIS By EUGENIABUTA ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2010 PAGE 2 c 2010EugeniaButa 2 PAGE 3 IdedicatethistomybrotherFlorin. 3 PAGE 4 ACKNOWLEDGMENTS Iwouldliketothankmyadvisor,ProfessorHaniDoss,fortheinvaluablehelpand guidancewithwritingthisdissertation.IamalsothankfultoProfessorsFaridAitSahlia, GeorgeCasella,andJamesHobertforservingonmysupervisorycommitteeand offeringmehelpfulcomments.Inaddition,IoweadebtofgratitudetotheDepartmentof StatisticsattheUniversityofFlorida.IgreatlyappreciatethechanceIhavebeengiven tocomehereandlearnStatisticsfrommanyexceptionalteachers,allwhilebeneting fromthekindnessandsupportofotherstudentsandstaff. 4 PAGE 5 TABLEOFCONTENTS page ACKNOWLEDGMENTS..................................4 LISTOFTABLES......................................7 LISTOFFIGURES.....................................8 ABSTRACT.........................................9 CHAPTER 1INTRODUCTION...................................10 2ESTIMATIONOFBAYESFACTORSANDPOSTERIOREXPECTATIONS...14 2.1EstimationofBayesFactors..........................17 2.2EstimationofBayesFactorsUsingControlVariates.............19 2.3EstimationofPosteriorExpectations.....................21 2.4EstimationofPosteriorExpectationsUsingControlVariates........23 2.5EstimationofPosteriorExpectationsUsingControlVariatesWith EstimatedSkeletonBayesFactorsandExpectations.........25 3VARIANCEESTIMATIONANDSELECTIONOFTHESKELETONPOINTS..27 3.1EstimationoftheVariance...........................27 3.2SelectionoftheSkeletonPoints........................32 4REVIEWOFPREVIOUSWORK..........................34 5ILLUSTRATIONONVARIABLESELECTION...................38 5.1AMarkovChainforEstimatingthePosteriorDistributionofModel Parameters................................38 5.2ChoiceoftheHyperparameter........................43 5.3Examples....................................47 5.3.1U.S.CrimeData............................47 5.3.2OzoneData...............................50 6DISCUSSION.....................................55 APPENDIX APROOFOFRESULTSFROMCHAPTER1....................56 BDETAILSREGARDINGGENERATIONOFTHEMARKOVCHAINFROM CHAPTER5.....................................90 5 PAGE 6 CPROOFOFTHEUNIFORMERGODICITYANDDEVELOPMENTOFTHE MINORIZATIONCONDITIONFROMCHAPTER5................95 DMAPFORTHEOZONEPREDICTORSINFIGURE5-5.............99 REFERENCES.......................................100 BIOGRAPHICALSKETCH................................104 6 PAGE 7 LISTOFTABLES Table page 5-1PosteriorinclusionprobabilitiesforthefteenpredictorvariablesintheU.S. crimedataset,underthreemodels.NamesofthevariablesareasinTable 2 ofLiangetal.2008butallvariablesexceptforthebinaryvariableShave beenlogtransformed.................................49 D-1The 44 predictorsusedintheozoneillustration.Thesymbol.representsan interaction.......................................99 7 PAGE 8 LISTOFFIGURES Figure page 5-1EstimatesofBayesfactorsfortheU.S.crimedata.Theplotsgivetwodifferent viewsofthegraphoftheBayesfactorasafunctionof w and g whenthebaseline valueofthehyperparameterisgivenby w =0.5 and g =15 .Theestimate is2,whichusescontrolvariates........................48 5-2EstimatesofposteriorinclusionprobabilitiesforVariables 1 and 6 fortheU.S. crimedata.Theestimateusedis2......................50 5-3Variancefunctionsfortwoversionsof ^ I ^ d ^ ^ d .Theleftpanelisfortheestimate basedontheskeleton5.Thepointsinthisskeletonwereshiftedtobetter covertheproblematicregionnearthebackoftheplot g smalland w large, creatingtheskeleton5.Themaximumvarianceisthenreducedbyafactor of 9 rightpanel....................................51 5-4EstimatesofBayesfactorsfortheozonedata.Theplotsgivetwodifferent viewsofthegraphoftheBayesfactorasafunctionof w and g whenthebaseline valueofthehyperparameterisgivenby w =.2 and g =50 ............52 5-5 95% condenceintervalsoftheposteriorinclusionprobabilitiesforthe 44 predictors intheozonedatawhenthehyperparametervalueisgivenby w =.13 and g =75 .Atablegivingthecorrespondencebetweentheintegers 1 44 andthe predictorsisgiveninAppendixD...........................54 8 PAGE 9 AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy COMPUTATIONALAPPROACHESFOREMPIRICALBAYESMETHODSAND BAYESIANSENSITIVITYANALYSIS By EugeniaButa August2010 Chair:HaniDoss Major:Statistics WeconsidersituationsinBayesiananalysiswherewehaveafamilyofpriors h ontheparameter ,where h variescontinuouslyoveraspace H ,andwedealwithtwo relatedproblems.Therstinvolvessensitivityanalysisandisstatedasfollows.Suppose wexafunction f of .Howdoweefcientlyestimatetheposteriorexpectationof f simultaneouslyforall h in H ?Thesecondproblemishowdoweidentifysubsets of H whichgiverisetoreasonablechoicesof h ?Weassumethatweareableto generateMarkovchainsamplesfromtheposteriorforanitenumberofthepriors,and wedevelopamethodology,basedonacombinationofimportancesamplingandthe useofcontrolvariates,fordealingwiththesetwoproblems.Themethodologyapplies verygenerally,andweshowhowitappliesinparticulartoacommonlyusedmodel forvariableselectioninBayesianlinearregression,inwhichtheunknownparameter includesthemodelandtheregressioncoefcientsfortheselectedmodel.Thepriorisa hierarchicalpriorinwhichrstthemodelisselected,thenthecoefcientsforthismodel arechosen,andthispriorisindexedbytwohyperparameters.Thesehyperparameters effectivelydeterminewhethertheselectedmodelwillbealargemodelwithmany variables,oraparsimoniousmodelwithonlyafewvariables,sochoosingthemisvery important.Wegivetwoillustrationsofourmethodology,oneontheU.S.crimedataof VandaeleandtheotherongroundlevelozonedataoriginallyanalyzedbyBreimanand Friedman. 9 PAGE 10 CHAPTER1 INTRODUCTION IntheBayesianparadigmwehaveadatavector Y withdensity p forsome unknown 2 ,andwewishtoputapriordensityon .Theavailablefamilyof priordensitiesis f h h 2Hg ,where h iscalledahyperparameter.Typically,the hyperparameterismultivariateandchoosingitcanbedifcult.Butthischoiceisvery importantandcanhavealargeimpactonsubsequentinference.Therearetwoissues wewishtoconsider: ASupposewexaquantityofinterest,say f ,where f isafunction.Howdo weassesshowtheposteriorexpectationof f changesaswevary h ?More generally,howdoweassesschangesintheposteriordistributionof f aswevary h ? BHowdowedetermineifagivensubsetof H constitutesaclassofreasonable choices? Therstissueisoneofsensitivityanalysisandthesecondisoneofmodelselection. Asanexampleofthekindofproblemwewishtodealwith,considertheproblem ofvariableselectioninBayesianlinearregression.Here,wehavearesponsevariable Y andasetofpredictors X 1 ,..., X q ,eachavectoroflength m .Foreverysubset of f 1,..., q g wehaveapotentialmodel M givenby Y =1 m 0 + X + where 1 m isthevectorof m 1 's, X isthedesignmatrixwhosecolumnsconsistof thepredictorvectorscorrespondingtothesubset isthevectorofcoefcients forthatsubset,and N m 2 I .Let q denotethenumberofvariablesinthe subset .Theunknownparameteris = 0 ,whichincludestheindicator ofthesubsetofvariablesthatgointothelinearmodel.Averycommonlyusedprior distributionon isgivenbyahierarchyinwhichwerstchoosetheindicator from 10 PAGE 11 theindependenceBernoulliprioreachvariablegoesintothemodelwithacertain probability w ,independentlyofalltheothervariablesandthenchoosethevector ofregressioncoefcientscorrespondingtotheselectedvariables.Inmoredetail,the modelisdescribedasfollows: Y N m m 0 + X 2 I a 2 0 p 2 0 / 1 = 2 andgiven N q )]TJ/F22 11.9552 Tf 5.479 -9.683 Td [(0, g 2 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 b w q )]TJ/F39 11.9552 Tf 11.955 0 Td [(w q )]TJ/F40 7.9701 Tf 6.586 0 Td [(q c Theprioron 0 isZellner's g -priorintroducedinZellner1986,andisindexedby ahyperparameter g .Althoughthispriorisimproper,theresultingposteriordistributionis proper. Notethatwehaveusedthewordmodelintwodifferentways:iamodelisa specicationofthehyperparameter h ,andiiamodelinregressionisalistofvariables toinclude.Themeaningofthewordwillalwaysbeclearfromcontext. Tosummarize,thepriorontheparameter = 0 isgivenbythetwo-level hierarchy1cand1b,andisindexedby h = w g .Looselyspeaking,when w islargeand g issmall,thepriorencouragesmodelswithmanyvariablesandsmall coefcients,whereaswhen w issmalland g islarge,thepriorconcentratesitsmasson parsimoniousmodelswithlargecoefcients.Therefore,thehyperparameter h = w g playsaveryimportantrole,andineffectdeterminesthemodelthatwillbeusedtocarry outvariableselection. AstandardmethodforapproachingmodelselectioninvolvestheuseofBayes factors.Foreach h 2H ,let m h y denotethemarginallikelihoodofthedataunderthe prior h ,thatis, m h y = R p y h d .Wewillwrite m h insteadof m h y .TheBayes factorofthemodelindexedby h 2 vs.themodelindexedby h 1 isdenedastheratio ofthemarginallikelihoodsofthedataunderthetwomodels, m h 2 = m h 1 ,andisdenoted throughoutby B h 2 h 1 .Bayesfactorsarewidelyusedasacriterionforcomparing 11 PAGE 12 modelsinBayesiananalyses.Forselectingmodelsthatarebetterthanothersfromthe familyofmodelsindexedby h 2H ,ourstrategywillbetocomputeandsubsequently comparealltheBayesfactors B h h ,forall h 2H ,andaxedhyperparametervalue h .Wecouldthenconsiderasgoodcandidatemodelsthosewithvaluesof h thatresult inthelargestBayesfactors. Supposenowthatwexaparticularfunction f oftheparameter ;forinstance, intheexample,thismightbetheindicatorthatvariable 1 isincludedintheregression model.Itisofgeneralinteresttodeterminetheposteriorexpectation E h f j Y asa functionof h andtodeterminewhetherornot E h f j Y isverysensitivetothevalue of h .Ifitisnot,thentwoindividualsusingtwodifferenthyperparameterswillreach approximatelythesameconclusionsandtheanalysiswillnotbecontroversial.Onthe otherhand,ifforafunctionofinteresttheposteriorexpectationvariesconsiderably aswechangethehyperparameter,thenwewillwanttoknowwhichaspectsofthe hyperparametere.g.whichcomponentsof h producebigchangesandwemaywantto seeaplotoftheposteriorexpectationsaswevarythoseaspectsofthehyperparameter. Exceptforextremelysimplecases,posteriorexpectationscannotbeobtainedinclosed form,andaretypicallyestimatedviaMarkovchainMonteCarloMCMC.Itisslowand inefcienttorunMarkovchainsforeveryhyperparametervalue h .Chapter2reviews anexistingmethodforestimating E h f j Y thatbypassestheneedtorunaseparate Markovchainforevery h .Themethodhasananaloguefortheproblemofestimating Bayesfactors.Unfortunately,themethodhasseverelimitations,whichwealsodiscuss. Thepurposeofthisworkistointroduceamethodologyfordealingwiththe sensitivityanalysisandmodelselectionissuesdiscussedabove.Thebasicideaisnot surprisinglytouseMarkovchainscorrespondingtoafewvaluesofthehyperparameter inordertoestimate E h f j Y forall h 2H andalsotheBayesfactors B h h forall h 2H ,andthisisdonethroughimportancesampling.Thedifcultywefaceisthatthere 12 PAGE 13 isaseverecomputationalburdencausedbytherequirementthatwehandleaverylarge numberofvaluesof h Themaincontributionsofthisworkarethedevelopmentofcomputationallyefcient schemesforestimatinglargefamiliesofposteriorexpectationsandBayesfactors thatarebasedonacombinationofMCMC,importancesampling,andtheuseof controlvariates,thereforeprovidingananswertoquestionsAandBraisedearlier. Wealsoprovidetheorytosupportthemethodswepropose.Chapter2describes ourmethodologyforestimatingBayesfactorsandposteriorexpectations,andgives statementsoftheoreticalresultsassociatedwiththemethodology.InChapter3we discussestimationofthevarianceofourestimates.Chapter4givesareviewofthe relevantliterature,alongwithadiscussionofhowthepresentworktsinthecontext ofpreviousrelatedwork.InChapter5wereturntotheproblemofvariableselection inBayesianlinearregression.There,rstweconsideraMarkovchainalgorithm thatgeneratesasequence 0 , 0 ,... ,describe theoreticalpropertiesofthischain,andshowhowtoimplementthemethodsdeveloped inChapter2toanswerquestionsAandBposedearlier.Wealsoillustrateour methodologyontwodatasets.AppendixAcontainstheproofsofthetheoremsstated inChapter2,AppendixBprovidesdetailsregardingthegenerationofourMarkovchain anditscomputationalcomplexity,andAppendixCgivestechnicaldetailsregardingthe theoreticalpropertiesoftheMarkovchain. 13 PAGE 14 CHAPTER2 ESTIMATIONOFBAYESFACTORSANDPOSTERIOREXPECTATIONS Let h y denotetheposteriordensityof given Y = y whentheprioris h .Suppose wehaveasample 1 ,..., n iidorergodicMarkovchainoutputfromtheposterior density h 1 y foraxed h 1 andweareinterestedintheposteriorexpectation E h f j Y = y = Z f h y d fordifferentvaluesofthehyperparameter h .Wemaywrite R f h y d as Z f p y h = m h p y h 1 = m h 1 h 1 y d = m h 1 m h Z f h h 1 h 1 y d a = m h 1 m h R f h h 1 h 1 y d m h 1 m h R h h 1 h 1 y d b = R f h h 1 h 1 y d R h h 1 h 1 y d c wherein2bwehaveusedthefactthattheintegralinthedenominatorisjust 1 inordertocanceltheunknownconstant m h 1 = m h in2c.Theideatoexpress R f h y d inthiswaywasproposedinadifferentcontextbyHastings1970. Expression2cistheratiooftwointegralswithrespectto h 1 y ,eachofwhichmay beestimatedfromthesequence 1 ,..., n .Wemayestimatethenumeratorandthe denominatorby 1 n n X i =1 f i [ h i = h 1 i ] and 1 n n X i =1 [ h i = h 1 i ], respectively.Thus,ifwelet w h i = [ h i = h 1 i ] P n e =1 [ h e = h 1 e ] thentheseareweights,andweseethatthedesiredintegralmaybeestimatedbythe weightedaverage P n i =1 f i w h i 14 PAGE 15 Thedisappearanceofthelikelihoodfunctionin2aisveryconvenientbecause itscomputationrequiresconsiderableeffortinsomecasesforexample,whenwehave missingorcensoreddata,thelikelihoodisapossiblyhigh-dimensionalintegral.Note thatthesecondaveragein2isanestimateof m h = m h 1 ,i.e.theBayesfactor B h h 1 Ideally,wewouldliketousetheestimatesin2formultiplevaluesof h usingonlya samplefromtheposteriordistributioncorrespondingtothexedhyperparametervalue h 1 .But,whentheprior h differsfrom h 1 greatly,thetwoestimatesin2areunstable becauseofthepotentialthatonlyafewobservationswilldominatethesums.Theirratio suffersthesamedefect. Anaturalapproachfordealingwiththeinstabilityofthesesimpleestimatesis tochoose k hyperparametervalues h 1 ,..., h k 2H andtoreplace h 1 withamixture P k s =1 a s h s ,where a s 0 ,for s =1,..., k ,and P k s =1 a s =1 .Forconcreteness,consider theestimateoftheBayesfactor.Toestimate B h h 1 usingasample 1 ,..., n iidor ergodicMarkovchainoutputfromtheposteriormixture y := P k s =1 a s h s y ,weare temptedtowrite 1 n n X i =1 h i P k s =1 a s h s i = m h m h 1 1 n n X i =1 p i y h i = m h P k s =1 a s p i y h s i = m h 1 a = B h h 1 1 n n X i =1 h y i P k s =1 a s h s y i d s b where d s = m h s = m h 1 s =1,..., k .Thus,inordertohave y inthedenominator in2bwhichwouldimplythattheaveragein2bconvergesto 1 ,sothat2b convergesto B h h 1 ,weneedtostartoutwith P k s =1 a s h s = d s inthedenominatorof theleftsideof2a.Unfortunately,thisrequirestheconditionthatweknowthevector d = d 2 ,..., d k 0 .Underthiscondition,if 1 ,..., n aredrawnfromthemixture y ,instead offrom h 1 y ,wemayform 1 n n X i =1 h i P k s =1 a s h s i = d s 15 PAGE 16 andthisquantityconvergesto B h h 1 Z h y y y d = B h h 1 Assumingthatforeach l =1,..., k wehavesamples l i i =1,..., n l fromtheposterior density h l y ,thenfor a s = n s = n ,theestimatein2canbewrittenas ^ B h h 1 d = k X l =1 n l X i =1 h l i P k s =1 n s h s l i = d s Notethatthecombinedsamples l i i =1,..., n l l =1,..., k formastratiedsample fromthemixturedistribution y .Doss2010showsthatundercertainregularity conditionstheestimate2isconsistentandasymptoticallynormal. Invirtuallyallapplications,thevalueofthevector d isunknownandhastobe estimated.Doss2010doesnotdealwiththecasewhere d isunknown.Inthis Chapter,weassumethat d isestimatedviapreliminaryMCMCrunsgenerated independentlyoftherunssubsequentlyusedtoestimate B h h 1 .Hencethesampling willconsistofthefollowingtwostages. Stage1 Generatesamples l i i =1,..., N l from h l y ,theposteriordensityof given Y = y ,assumingthattheprioris h l ,foreach l =1,..., k ,andusethese N = P k l =1 N l observationstoformanestimateof d Stage2 IndependentlyofStage 1 ,againgeneratesamples l i i =1,..., n l from h l y foreach l =1,..., k ,andconstructtheestimateoftheBayesfactor B h h 1 based onthissecondsetof n = P k l =1 n l observationsandtheestimateof d fromStage 1 Fromnowon,for l =1,..., k ,weusethenotations A l and a l toidentifytheratios N l = N and n l = n ,respectively. Itisnaturaltoaskwhyisitnecessarytohavetwostepsofsampling,instead ofestimatingthevector d and B h h 1 fromasinglesample.Thereasonisthatwe areinterestedinestimatingBayesfactorsandposteriorexpectationsforaverylarge numberofvaluesof h ,andforeach h ,thecomputationaltimeneededislinearinthe 16 PAGE 17 totalsamplesize.Thisfactlimitsthetotalsamplesizeandhencetheaccuracyofthe estimates.Anincreaseinaccuracycanbeachievedessentiallyforfreebyestimating d fromlongpreliminaryrunsinStage 1 .ThecostincurredinStage 1 isminimalbecause generatingthechainsistypicallyextremelyfast,andhastobedoneonlyonce. Doss2010alsodevelopedanimprovementof2thatisbasedoncontrol variates,andshowedthatthisimprovementisalsoconsistentandasymptotically normal.Unfortunately,bothoftheseestimatesrequireustoknowthevector d exactly. Onemaybetemptedtobelievethatusinganestimated d insteadofthetrue d willnot inatetheasymptoticvarianceindeed,theliteraturehaserrorsregardingthispoint, andthisisdiscussedinAppendixA.Hereweprovideacarefulanalysisoftheincrease intheasymptoticvariancethatresultswhenweuseanestimateof d .Amoredetailed summaryofthemaincontributionsofthepresentworkisasfollows. 1.Wedevelopacompletecharacterizationoftheasymptoticdistributionofboththe estimate2andtheimprovementthatusescontrolvariatesfortherealisticcase where d isestimatedfromStage 1 samplingTheorems1and2. 2.Wedevelopananalogoustheoryfortheproblemofestimatingafamilyofposterior expectations E h f j Y = y h 2H Theorems3,4,and6. 3.Wediscussestimationofthevariance,andshowhowvarianceestimatescanbe usedtoguideselectionoftheskeletonpoints h 1 ,..., h k 4.WeapplythemethodologytotheproblemofBayesianvariableselectiondiscussed earlier.Inparticular,weshowhowourmethodsenableustoselectgoodvaluesof h = w g andtoalsoseehowtheprobabilitythatagivenvariableisincludedin theregressionvarieswith w g 2.1EstimationofBayesFactors Here,weanalyzetheasymptoticdistributionalpropertiesoftheestimatorthat resultsifin2wereplace d withanestimate.Geyer1994proposesanestimatorfor d basedonthereverselogisticregressionmethodandTheorem2thereinshowsthat thisestimatorisasymptoticallynormalwhenthesamplersusedsatisfycertainregularity conditions.Thisestimatorisobtainedbymaximizingwithrespectto d 2 ,..., d k thelog 17 PAGE 18 quasi-likelihood l N d = k X l =1 N l X i =1 log A l h l l i = d l P k s =1 A s h s l i = d s TheestimateisthesameastheestimatesobtainedbyGilletal.1988,Mengand Wong1996,andKongetal.2003.WeassumethatforalltheMarkovchainsweuse aStrongLawofLargeNumbersSLLNholdsforallintegrablefunctions[forsufcient conditionssee,e.g.,Theorem 2 ofAthreyaetal.1996].Inthenexttheorem,weshow thatif ^ d istheestimateproducedbyGeyer's1994method,oranyoftheequivalent estimatesdiscussedabove,thentheestimateoftheBayesfactorgivenby ^ B h h 1 ^ d = k X l =1 n l X i =1 h l i P k s =1 n s h s l i = ^ d s isasymptoticallynormalifcertainregularityconditionsaremet.In27, ^ d 1 =1 Theorem1 SupposethechainsinStage 2 satisfyconditionsA 1 andA 2 inDoss 2010: A1Foreach l =1,..., k ,thechain f l i g 1 i =1 isgeometricallyergodic. A2Foreach l =1,..., k ,thereexists > 0 suchthat E h l 1 P k s =1 a s h s l 1 = d s 2+ < 1 AssumealsothatthechainsinStage 1 satisfytheconditionsinTheorem 2 ofGeyer 1994thatimply p N ^ d )]TJ/F39 11.9552 Tf 11.832 0 Td [(d d )167(!N .Inaddition,supposethetotalsamplesizesfor thetwostages, N and n ,arechosensuchthat n = N q 2 [0, 1 .Then p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(B h h 1 d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, qc h 0 c h + 2 h where c h and 2 h aregiveninequation A.3 intheAppendixandequationA.9in Doss2010,respectively. Remarks 18 PAGE 19 1.Therearetwocomponentstotheexpressionforthevariance.Therstcomponent arisesfromestimating d ,andthesecondcomponentisthevariancethatwewould haveifwehadestimatedtheBayesfactorknowingwhat d is.Ascanbeseen fromtheformula,therstcomponentvanishesif q =0 ,i.e.,ifthesamplesize forestimatingtheparameter d convergestoinnityatafasterratethandoes thesamplesizeusedtoestimatetheBayesfactor.InthiscasetheBayesfactor estimator2usingtheestimate ^ d hasthesameasymptoticdistributionasthe estimatorin2whichusesthetruevalueof d .Otherwise,thevarianceof2 isgreaterthanthatof2,andthedifferencebetweenthevariancesdependson themagnitudeof q 2.Thistheoremassumesthesamplingisdoneintwoindependentstages:Stage 1 samplestoestimate d ,andStage 2 samplesusedtogetherwith ^ d toestimate theBayesfactor B h h 1 .Asabyproductofourapproach,wecangetasimilar theoremforthesituationwhereboth d and B h h 1 areestimatedfromthesame sample.However,forthereasonsdiscussedearlier,ordinarilywewouldnotusea singlesample.Inmoredetail,ifweimposethesameconditionsasinthetheorem aboveonsamplesoftotalsize n fromasinglestage,exceptfortheconditionthat n = N q 2 [0, 1 ,then p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, c h 0 c h + 2 h +2 c h 0 E 0 z where denotes,asinthestatementofTheorem1,theasymptoticvarianceof p n ^ d )]TJ/F39 11.9552 Tf 12.618 0 Td [(d E isthematrixgiveninequationA.51,and z isthecolumnvector giveninA.49.AproofofthisresultisgiveninAppendixA. 2.2EstimationofBayesFactorsUsingControlVariates Recallthatwehavesamples l i i =1,..., n l from h l y l =1,..., k ,with independenceacrosssamplesStage 2 ofsamplingandthat,basedonanindependent setofpreliminaryMCMCrunsStage 1 ofsampling,wehaveestimatedtheconstants d 2 ,..., d k .Also, n l = n = a l and n = P k l =1 n l .Let Y = h P k s =1 a s h s = d s 19 PAGE 20 Recallingthat y := P k s =1 a s h s y ,wehave E y Y = B h h 1 ,wherethesubscript y totheexpectationindicatesthat y .Also,for j =2,..., k ,let Z j = h j = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 P k s =1 a s h s = d s = h j y )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 y P k s =1 a s h s y Expression2showsthat E y Z j =0 .Thisistrueevenifthepriors h j and h 1 areimproper,aslongastheposteriors h j y and h 1 y areproper,exactlyour situationintheBayesianvariableselectionexampleofChapter1.Ontheotherhand, therepresentation2showsthat Z j iscomputableifweknowthe d j 'sitinvolves thepriorsandnottheposteriors.Asimilarremarkappliesto2.Therefore,ifasin Doss2010wedenefor l =1,..., k i =1,..., n l Y i l = h l i P k s =1 a s h s l i = d s Z i l =1, Z j i l = h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = d s j =2,..., k thenforanyxed = 2 ,..., k ^ I d = 1 n k X l =1 n l X i =1 )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(Y i l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =2 j Z j i l isanunbiasedestimateof B h h 1 .Thevalueof thatminimizesthevarianceof ^ I d isunknown.Asiscommonlydonewhenoneusescontrolvariates,weuseinstead theestimateobtainedbydoingordinarylinearregressionoftheresponse Y i l onthe predictors Z j i l j =2,..., k ,andtoemphasizethatthisestimatedependson d ,we denoteitby ^ d .Theorem1ofDoss2010statesthattheestimator ^ B reg h h 1 = ^ I d ^ d ,obtainedundertheassumptionthatweknowtheconstants d 2 ,..., d k ,hasan asymptoticallynormaldistribution.Asmentionedearlier, d 2 ,..., d k aretypicallyunknown, andmustbeestimated.Let ^ d 2 ,..., ^ d k beestimatesobtainedfrompreviousMCMCruns 20 PAGE 21 andlet ^ I ^ d ^ ^ d = 1 n k X l =1 n l X i =1 ^ Y i l )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 ^ j ^ d ^ Z j i l where ^ Y i l and ^ Z j i l arelikein2,exceptusing ^ d for d ,and ^ ^ d istheleastsquares regressionestimatorfromregressing ^ Y i l onpredictors ^ Z j i l j =2,..., k .Thenext theoremgivestheasymptoticdistributionofthisnewestimator. Theorem2 SupposealltheconditionsfromTheorem1aresatised.Moreover,assume that R ,the k k matrixdenedby R j j 0 = E P k l =1 a l Z j 1, l Z j 0 1, l j j 0 =1,..., k isnonsingular.Then p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 d )167(!N )]TJ/F22 11.9552 Tf 5.48 -9.683 Td [(0, qw h 0 w h + 2 h Expressionsfor w h and 2 h aregiveninequation A.18 tofollowandequationA.7 inDoss2010,respectively. 2.3EstimationofPosteriorExpectations Inthissectionwegiveamethodforestimatingtheposteriorexpectationofa function f whentheprioris h .Letusdenotethisquantityby I [ f ] h = Z f h y d Dene Y [ f ] i l = f l i h l i P k s =1 a s h s l i = d s = f l i h l i = m h P k s =1 a s h s l i = m h s m h m h 1 = f l i h y l i P k s =1 a s h s y l i B h h 1 AssumingaSLLNholdsfortheMarkovchains l i l =1,..., k i =1,..., n l ,wehave 1 n l n l X i =1 Y [ f ] i l a.s. )167(! Z f h y P k s =1 a s h s y h l y d B h h 1 21 PAGE 22 Therefore, 1 n k X l =1 n l X i =1 Y [ f ] i l = k X l =1 n l X i =1 n l n 1 n l Y [ f ] i l a.s. )167(! Z f h y P k s =1 a s h s y k X l =1 a l h l y d B h h 1 = I [ f ] h B h h 1 Similarly,wehave 1 n k X l =1 n l X i =1 Y i l a.s. )167(! B h h 1 the Y i l 'saredenedin2. Notethat Y i l = Y [ f ] i l when f 1 .Letting ^ I [ f ] h d = P k l =1 P n l i =1 Y [ f ] i l P k l =1 P n l i =1 Y i l weseethat ^ I [ f ] h d a.s. )167(! I [ f ] h Replacingtheunknown d withanestimate ^ d obtained fromStage 1 sampling,weformtheestimator ^ I [ f ] h ^ d = k X l =1 n l X i =1 f l i h l i P k s =1 a s h s l i = ^ d s k X l =1 n l X i =1 h l i P k s =1 a s h s l i = ^ d s Itistheasymptoticbehaviorofthisestimatorthatweareconcernedwithinthefollowing theorem. Theorem3 SupposetheconditionsstatedinTheorem1aresatisedand,inaddition, foreach l =1,..., k ,thereexistsan > 0 suchthat E )]TJ 5.48 0.478 Td [( Y [ f ] 1, l 2+ < 1 Then p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(I [ f ] h d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, qv h 0 v h + h 22 PAGE 23 Expressionsfor v h and h aregiveninequations A.22 and A.21 ,respectively,in theAppendix. 2.4EstimationofPosteriorExpectationsUsingControlVariates Weassumeinthissectionthatthevaluesofthevector d andtheposterior expectations I [ f ] h j ,for j =1,..., k ,areavailabletous.Inreality,thesequantities areseldomknown,andthenextsectiondealswiththecasewhentheyareestimated basedonpreviousMCMCruns.Recallthattheintegralwewanttoestimateis I [ f ] h = R f h y d .In2weestablishedthat = n P k l =1 P n l i =1 Y [ f ] i l isa stronglyconsistentestimatorof I [ f ] h B h h 1 .Dene Z [ f ] i l =1, Z [ f ] j i l = f l i h j l i = d j P k s =1 a s h s l i = d s )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h j j =1,..., k andlet Z [ f ] j = f h j = d j P k s =1 a s h s = d s )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h j j =1,..., k With y denotingthemixturedistribution P k s =1 a s h s y ,itcanbeeasilycheckedthat E y )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z [ f ] j =0, for j =1,..., k sowecanusethe Z [ f ] j 'sascontrolvariatestoreducethevarianceoftheoriginal estimator = n P k l =1 P n l i =1 Y [ f ] i l .Doingsogivestheestimator 1 n k X l =1 n l X i =1 Y [ f ] i l )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X j =1 ^ [ f ] j Z [ f ] j i l where ^ [ f ] j 'sdenotetheleastsquaresestimatesresultingfromtheregressionof Y [ f ] i l on predictors Z [ f ] j i l .TheBayesfactor B h h 1 willbeestimatedasbeforeinSection2.2, usingtheestimator = n P k l =1 P n l i =1 Y i l andthe k )]TJ/F22 11.9552 Tf 12.903 0 Td [(1 controlvariates Z j ,for j = 2,..., k .Theratioofthesetwocontrolvariateadjustedestimatorsprovidesuswithan 23 PAGE 24 improvedestimatorfortheposteriorexpectation I [ f ] h ,whichisgivenby ^ I ^ ^ [ f ] = P k l =1 P n l i =1 Y [ f ] i l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =1 ^ [ f ] j Z [ f ] j i l P k l =1 P n l i =1 Y i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =2 ^ j Z j i l Theorem4 SupposeconditionsA1andA2statedinTheorem1aresatisedandthe matrix R denedinTheorem2isnonsingular.Also,supposethat A3foreach l =1,..., k ,thereexists > 0 suchthat E )]TJ 5.479 0.478 Td [( Y [ f ] 1, l 2+ < 1 ; A4foreach l =1,..., k ,thereexists > 0 suchthat E h l y j f j 2+ < 1 ; A5foreach l =1,..., k E h l y f 2 h P k s =1 a s h s = d s < 1 ; A6the k +1 k +1 matrix R [ f ] denedby R [ f ] j +1, j 0 +1 = E P k l =1 a l Z [ f ] j 1, l Z [ f ] j 0 1, l j j 0 =0,..., k isnonsingular. Then p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ ^ [ f ] )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N r h where r h isgiveninequation A.33 oftheAppendix. Remarks 1.If h = h j forsome j 2f 1,..., k g ,ourestimatorofposteriorexpectation ^ I ^ ^ [ f ] given abovein2haszerovariance.Toseewhy,notethatinthiscasetheresponse Y [ f ] canbewrittenas Y [ f ] = d j I [ f ] h j + d j Z [ f ] j sothereisnonoiseintheregressionof Y [ f ] onpredictors Z [ f ] j 's,andasa consequence,thenumeratorofthisestimatorisconstantspecically, nd j I [ f ] h j Throughsimilararguments,thedenominatorwasshowntobeconstant nd j in Doss2010.Hence,for h = h j ^ I ^ ^ [ f ] isaperfectestimatorof I [ f ] h j 2.Theorem4pertainstothecasewhere d andtheposteriorexpectations I [ f ] h j j = 1,..., k areknown.Theredoexistsomesituationswherethisisthecase.For example,inthehierarchicalmodelforBayesianlinearregressiondiscussedin 24 PAGE 25 Chapter1,foreach j =1,..., k ,themarginallikelihood m h j isasumof 2 q terms, andif q isrelativelysmall,thesemarginallikelihoodsarecomputable,sothevector d isavailable.Likewise,forsomefunctions f ,theposteriorexpectations I [ f ] h j canbenumericallyobtained;seeSection 3 ofGeorgeandFoster2000.Soitis possibletocalculate d and I [ f ] h j j =1,..., k forskeletonpoints h 1 ,..., h k ,and themethoddescribedinthissectionenablesustoefcientlyestimatethefamily I [ f ] h h 2H 2.5EstimationofPosteriorExpectationsUsingControlVariatesWithEstimated SkeletonBayesFactorsandExpectations Thissectionundertakesestimationoftheposteriorexpectation I [ f ] h = R f h y d inthecasewherethequantities d and I [ f ] h j 'sareunknownandestimatedbasedon previousMCMCruns.Let e = )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(I [ f ] h 1 ,..., I [ f ] h k 0 and ^ e = 1 N 1 N 1 X i =1 f i ,..., 1 N k N k X i =1 f k i 0 i.e. e isthevectoroftrueexpectations,whichhadbeenassumedknowninTheorem4, and ^ e isitsnaturalestimatebasedonthesamplesinStage 1 .Toaccountforthefact thattheresponses Y [ f ] i l andcovariates Z [ f ] j i l usedintheprevioussectionarenow unknown,astheyinvolvetheunknown d and e ,weneedtoconsidernewresponsesand covariatesbasedonestimates ^ d and ^ e obtainedfromStage 1 samples.Dene ^ Y [ f ] i l = f l i h l i P k s =1 a s h s l i = ^ d s and ^ Z [ f ] j i l = f l i h j l i = ^ d j P k s =1 a s h s l i = ^ d s )]TJ/F22 11.9552 Tf 11.995 0 Td [(^ e j j =1,..., k Henceanewcontrolvariateadjustedestimatorfor I [ f ] h correspondingtothe estimator2oftheprevioussection,whichassumedknowledgeof d and I [ f ] h j 's,is ^ I ^ d ,^ e ^ ^ d ^ [ f ] ^ d = P k l =1 P n l i =1 ^ Y [ f ] i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =1 ^ [ f ] ^ d j ^ Z [ f ] j i l P k l =1 P n l i =1 ^ Y i l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =2 [ ^ ^ d ] j ^ Z j i l where ^ [ f ] ^ d istheleastsquaresestimateresultingfromtheregressionof ^ Y [ f ] i l on predictors ^ Z [ f ] j i l .Theorem6establishestheasymptoticnormalityofthisestimator. Butbeforestatingthistheorem,wegiveanauxiliarytheoremwhichshowsthejoint 25 PAGE 26 asymptoticnormalityof p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A LetusrstdeneseparableandinseparableMonteCarlosamplesasintroducedby Geyer1994.TheMonteCarlosample f l i g 1 i =1 l =1,..., k issaidtobeseparableif therearedisjointssubsets L and M of f 1,..., k g suchthatforeach inthesampleand each l 2 L and m 2 M either h l y or h m y arezero.AMonteCarlosamplethatis notseparableissaidtobeinseparable. Theorem5 AssumethattheMonteCarlosamplefromStage 1 f l i g 1 i =1 l =1,..., k ,is inseparable,andthefollowingconditionshold: B1foreach l =1,..., k ,thechain f l i g 1 i =1 isgeometricallyergodic B2foreach l =1,..., k ,thereexists > 0 suchthat E h l y j f j 2+ < 1 Then p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A d )167(!N V where V isgiveninequation A.48 intheAppendix. Theorem6 IftheconditionsstatedinTheorem4and 2 hold,then p n ^ I ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N h with h giveninequation A.57 intheAppendix. 26 PAGE 27 CHAPTER3 VARIANCEESTIMATIONANDSELECTIONOFTHESKELETONPOINTS Estimationofthevarianceofourestimatesisimportantforseveralreasons.In additiontotheusualneedforprovidingerrormarginsforourpointestimates,variance estimatesareofgreathelpinselectingtheskeletonpoints. 3.1EstimationoftheVariance Therearetwoapproachesonecanusetoestimatethevarianceofanyofour estimates.Forthesakeofconcreteness,consider ^ B h h 1 ^ d ,whoseasymptotic varianceistheexpression 2 h = qc h 0 c h + 2 h seeTheorem1. SpectralMethods If X 0 X 1 X 2 ,... isaMarkovchainand f isafunction,theasymptotic varianceof = n P n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 i =0 f X i whenitexistsistheinniteseries Var f X 0 +2 P 1 j =1 Cov f X 0 f X j wherethevariancesandcovariancesarecalculatedundertheassumptionthat X 0 has thestationarydistribution.Spectralmethodsinvolveestimatinganinitialsegmentofthe series,usingtechniquesfromtimeseries;seeGeyer1992forareview.Ourproblemis morecomplicatedbecausewearedealingwithmultiplechains.Inoursituation,theterm 2 h maybeestimatedthroughspectralmethods,andthisisdoneinastraightforward manner.Wenowgivetechnicaldetailsregardingtheconsistencyofthismethod.The quantity 2 h isgivenby 2 h = P k l =1 a l 2 l h ,where 2 l h istheasymptoticvariance of 1 n l n l X i =1 h l i P k s =1 a s h s l i = d s SeeequationA.9ofDoss2010.Becauseforeach l wewillbeestimating 2 l h by theasymptoticvarianceof 1 n l n l X i =1 h l i P k s =1 a s h s l i = ^ d s 27 PAGE 28 where ^ d isformedfromStage 1 runs,itisnecessarytoconsiderthequantity 2 l h u denedastheasymptoticvarianceof 1 n l n l X i =1 h l i P k s =1 a s h s l i = u s where u = u 1 u 2 ,..., u k 0 .Afterdening f u = h P k s =1 a s h s = u s weget 2 l h u =Var f u l 1 +2 1 X g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(f u l 1 f u l 1+ j Wenowproceedtoestablishcontinuityof 2 h u in u ,andtodothiswewillshow thatforeach l =1,..., k 2 l h u iscontinuousin u .Fortherestofthisdiscussion expectationsandvariancesaretakenwithrespectto h l y ,andwedrop l fromthe notation.Let u n beanysequenceofvectorssuchthat u n d .Thentrivially f u n f d forall ,andletting =min f d 1 ,..., d k g ,thereexistsapositiveinteger n suchthat k u n )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k forall n n .Consequently, f u n f 2 d =2 f d forall andall n n andwecanapplytheLebesgueDominatedConvergenceTheoremtwicetoconclude that Var f u n 1 = E f 2 u n 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [([ E f u n 1 ] 2 convergesto E f 2 d 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [([ E f d 1 ] 2 =Var f d 1 NotethatconditionA2guaranteesthatthedominatingfunctionin3hasnite expectation.Similarly,foreachofthecovarianceterms, Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(f u n 1 f u n 1+ j Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(f d 1 f d 1+ j 28 PAGE 29 Ifwedene c u j =Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(f u 1 f u 1+ j j =1,2,..., then P 1 j =1 c u j isabsolutelyconvergent.Thisisbecauseundergeometricergodicity, theso-calledstrongmixingcoefcients j decreaseto 0 exponentiallyfastadenition ofstrongmixingisgivenonp. 349 ofIbragimov1962,and Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(f u 1 f u 1+ j [ j ] E j f u 1 j 2+ 2 = + forsome > 0 .SeeTheorem 18.5.3 ofIbragimovandLinnik1971orLemma 7.7 of Chapter 7 inDurrett1991.Since c u n j c d j foreach j ,3and3enableus toagainapplyDominatedConvergencetoconcludethat P 1 j =1 c u n j P 1 j =1 c d j ,and thisprovesthat 2 h u iscontinuousin u Let g u bethespectraldensityat 0 oftheseries f u i .Notethat g u isequalto 2 l h u ,exceptforanormalizingconstant.Understrongmixingimpliedbygeometric ergodicity,standardspectraldensityestimates ^ g u areconsistent,andboundsonthe discrepancy j ^ g u )]TJ/F39 11.9552 Tf 12.586 0 Td [(g u j dependonthemixingrateandboundsonthemomentsof thefunction f u Rosenblatt1984.By3,therateisuniformaslongas k u )]TJ/F39 11.9552 Tf 12.267 0 Td [(d k is small,andtheconditionthat k ^ d )]TJ/F39 11.9552 Tf 12.046 0 Td [(d k issmallisguaranteediftheStage 1 samplesize N islarge. Geyer1994givesanexpressionfor involvinginniteseriesoftheform3, andthisenablesestimationof byspectralmethods.Now, c h isavectoreachof whosecomponentsisanintegralwithrespecttotheposterior h y seeA.3.The estimatederivedinSection2.3see2isdesignedpreciselytoestimatesuch posteriorexpectations.Combining,wearriveatanoverallestimateof 2 h ,andthe asymptoticvariancesofourotherestimatesarehandledsimilarly. MethodsBasedonRegeneration Thecleanestapproachtoestimatingasymptotic variancesisbasedonregeneration.Let X 0 X 1 X 2 ,... beaMarkovchainonthe measurablespace X B ,let K x A betheMarkovtransitiondistribution,andassume 29 PAGE 30 that isastationaryprobabilitydistributionforthechain.Supposethatforeach x K x hasdensity k x withrespecttoadominatingmeasure .Regeneration methodsrequiretheexistenceofafunction s : X! [0,1 ,whoseexpectationwith respectto isstrictlypositive,andaprobabilitydensity d withrespectto ,suchthat k satises k x x 0 s x d x 0 forall x x 0 2X Thisiscalledaminorizationconditionand,aswedescribebelow,itcanbeusedto introduceregenerationsintotheMarkovchaindrivenby k .Theseregenerationsare thekeytoconstructingasimple,consistentestimatorofthevarianceinthecentrallimit theorem.Dene r x x 0 = k x x 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(s x d x 0 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(s x Notethat,forxed x 2X r x x 0 isadensityfunctionin x 0 .Wemaythereforewrite k x x 0 = s x d x 0 + )]TJ/F39 11.9552 Tf 11.956 0 Td [(s x r x x 0 whichgivesarepresentationof k x asamixtureoftwodensities, d and r x Thisprovidesanalternativemethodofsimulatingfrom k .Supposethatthecurrentstate ofthechainis X n .Wegenerate n Bernoulli s X n .If n =1 ,wedraw X n +1 d ; otherwise,wedraw X n +1 r X n .Notethat,if n =1 ,thenextstateofthechainis drawnfrom d ,whichdoesnotdependonthecurrentstate.Hence,thechainforgets thecurrentstateandwehavearegeneration.Tobemorespecic,supposewestartthe Markovchainwith X 0 d andthenusethemethoddescribedabovetosimulatethe chain.Eachtime n =1 ,wehave X n +1 d andtheprocessstochasticallyrestartsitself; thatis,theprocessregenerates.Eventhoughthedescriptionaboveinvolvesgenerating observationsfrom r ,thereareclevertricksthatenabletheusertobypassgenerating from r ,andtoobtainthesequence X n n inawaythatrequiresonlygenerating directlyfrom k ;see,e.g.,TanandHobert2009. 30 PAGE 31 Hereishowtheregenerativemethodisusedtogetvalidasymptoticstandard errors.Supposewewishtoapproximatetheposteriorexpectationofsomefunction f X .SupposefurtherthattheMarkovchainistoberunfor R regenerationsortours; thatis,webeginbydrawingthestartingvaluefrom d andwestopthesimulationthe R th timethata n =1 .Let 0= 0 < 1 < 2 < < R betherandomregenerationtimes, i.e. t =min f n > t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 : n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 =1 g for t 2f 1,2,..., R g .Thetotallengthofthesimulation, R ,israndom.Let N 1 N 2 ,..., N R bethelengthsofthetours,i.e. N t = t )]TJ/F25 11.9552 Tf 12.71 0 Td [( t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ,and dene S t = P t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 n = t )]TJ/F24 5.9776 Tf 5.757 0 Td [(1 f X n t =1,..., R .Notethatthe N t S t pairsareiid,anda stronglyconsistentestimatorof E f X is f R = S = N = = R P R )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 n =0 f X n ,where S = = R P R t =1 S t and N = = R P R t =1 N t ,andtheasymptoticvarianceof f R maybe estimatedverysimplyby P R t =1 S t )]TJETq1 0 0 1 265.387 478.942 cm[]0 d 0 J 0.478 w 0 0 m 6.472 0 l SQBT/F39 11.9552 Tf 265.387 468.967 Td [(f R N t 2 = R N 2 .Momentandergodicityconditions thatguaranteestrongconsistencyofthisvarianceestimatoraregiveninHobertetal. 2002.Thismethodhasrecentlybeenappliedsuccessfullyinanumberofproblems involvingcontinuousstatespaces;see,e.g.,TanandHobert2009andthereferences therein,andweusethemethodintheillustrationinChapter5. Inourframeworkofmultiplechains,onemightthinkthatweneedtoidentifya sequenceoftimes 0= 0 < 1 < 2 < < R atwhichallthechainsregenerate.This isnotthecase,andweneedonlyidentify,foreachchain,asequenceofregeneration timesforthatchain.Sincetheoverallestimateisessentiallyafunctionofaverages involvingthe k chains,itsasymptoticvarianceisafunctionoftheasymptoticvariances ofaveragesformedfromtheindividualchains. Considerthefunction B h h 1 ; h 2H ,andanestimator,suchas ^ B h h 1 ^ d for therestofthisdiscussion,wewilldenotetheseby B h and ^ B h ,forbrevity.Itisof interesttoprovideacondencebandregion,if h ismultidimensionalfor B h thatis validsimultaneouslyforall h 2H .Acloselyrelatedproblemistoproduceacondence intervalfor argmax h 2H B h .Thetraditionalwayofformingcondencebandsthatare validgloballyistoproceedasfollows: 31 PAGE 32 1Establishafunctionalcentrallimittheoremthatsaysthat n 1 = 2 )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h )]TJ/F39 11.9552 Tf 11.997 0 Td [(B h converges indistributiontoaGaussianprocess W h ; h 2H 2Findthedistributionof sup h 2H j W h j If s isthe )]TJ/F25 11.9552 Tf 12.435 0 Td [( -quantileofthedistributionofthissupremum,thentheband ^ B h s = n 1 = 2 hasasymptoticcoverageprobabilityequalto 1 )]TJ/F25 11.9552 Tf 12.747 0 Td [( .Thevalue s istypically toodifculttocomputeanalytically,butcanbeobtainedbysimulation[see,e.g.Burr andDoss1993amongmanyothers].Themaximalinequalitiesneededtoestablish functionalcentrallimittheoremstypicallyrequireaniidstructure,andforthisreason webelievethattheregenerationmethodoffersthebesthopeforestablishingsuch theorems. 3.2SelectionoftheSkeletonPoints Theasymptoticvariancesofanyofourestimatesdependonthechoiceof thepoints h 1 ,..., h k .Forconcreteness,consider ^ B h h 1 ^ d ,andtoemphasizethis dependence,let V h h 1 ,..., h k denotetheasymptoticvarianceof ^ B h h 1 ^ d .Forxed h 1 ,..., h k ,identifyingthesetof h 'sforwhich V h h 1 ,..., h k is nite istypicallyafeasible problem.Forinstance,Doss1994consideredthepumpdataexamplediscussedin Tierney1994,forwhichthehyperparameter h hasdimension 3 ,anddeterminedthis setforthecase k =1 .Heshowedthatonecangoasfarawayfrom h 1 asonewants incertaindirections,butinotherdirectionstherangeislimited.Thecalculationcanbe extendedtoany k .Supposenowthatwexarange H overwhich h istovary.Typically, wewillwantmorethanjustapositioningof h 1 ,..., h k thatguaranteethat V h h 1 ,..., h k isniteforall h 2H ,andwewillfacetheproblembelow. DesignProblem Findthevaluesof h 1 ,..., h k thatminimize max h 2H V h h 1 ,..., h k Unfortunately,exceptforextremelysimplecases,itisnotpossibletocalculate V h h 1 ,..., h k analyticallyevenif k =1 V h h 1 isaninnitesumeachofwhose termsdependsontheMarkovtransitiondistributioninacomplicatedway,and maximizingitover h 2H wouldpresentadditionaldifculties.Furthermore,evenif 32 PAGE 33 wewereabletocalculate max h 2H V h h 1 ,..., h k ,thedesignproblemwouldinvolvethe minimizationofafunctionof k dim H variables,andingeneral,solvingthedesign problemishopeless. Inourexperience,wehavefoundthatthefollowingmethodworksreasonablywell. Havingspeciedtherange H ,weselecttrialvalues h 1 ,..., h k andplottheestimated varianceasafunctionof h ,usingoneofthemethodsdescribedabove.Ifwend aregionin H wherethisvarianceisunacceptablylarge,wecoverthisregionby movingsome h l 'sclosertotheregion,orbysimplyaddingnew h l 'sinthatregion,which increases k .ThisisillustratedintheexampleinChapter5. 33 PAGE 34 CHAPTER4 REVIEWOFPREVIOUSWORK Vardi1985introducedthefollowing k -samplemodelforbiasedsampling.Thereis anunknowndistributionfunction F ,whichwewishtoestimate.Foreachweightfunction w l l =1,..., k ,wehaveasample X l 1 ,..., X ln l iid F l ,where F l x = 1 W l Z x w l s dF s In4, W l = R 1 w l s dF s .Theweightfunctions w 1 ,..., w k areknown,butthe normalizingconstants W 1 ,..., W k arenot.Vardi1985wasinterestedinconditions thatguaranteethatanonparametricmaximumlikelihoodestimatorNPMLEexists andisunique,andhegavetheformoftheNPMLE.Theconditionsforexistenceand uniquenessinvolveissuesregardingthesupportsofthe F l 'sanddonotconcernusin thepresentpaper. Toestimate F ,apreliminarystepistoestimatethevector W 1 ,..., W k .Vardi 1985andGilletal.1988showthat W maybeestimatedbythesolutiontothe systemof k equations W l = Z w l y P k j =1 a j w j y = W j d F n y l =1,..., k where a j n j = n n = P k j =1 n j ,and F n istheempiricaldistributionfunctionthatgives mass 1 = n toeachofthe X il .Actually,thesolutionto4isnotunique:itistrivial toseethatifthevector W solves4,thensodoes W ,forany .However,it turnsoutthatknowing W onlyuptoamultiplicativeconstantisallthatisneeded,and toavoidnon-identiabilityissues,wedenethevector V = W 2 = W 1 ,..., W k = W 1 Gilletal.1988showthatif c W isanysolutionto4,and b V isdenedby b V = c W 2 = c W 1 ,..., c W k = c W 1 ,then n 1 = 2 b V )]TJ/F39 11.9552 Tf 12.731 0 Td [(V isasymptoticallynormalProposition2.3in Gilletal.1988.Onceanestimateof W isformed,itisrelativelyeasytoforman estimate ^ F n of F ,andconsequentlyofintegralsoftheform R hdF .Gilletal.1988 34 PAGE 35 obtainfunctionalweakconvergenceresultsofthesort n 1 = 2 )]TJ 5.48 -0.053 Td [(R hd ^ F n )]TJ/F30 11.9552 Tf 12.152 9.63 Td [(R hdF d )167(! Z h where Z isamean0 Gaussianprocessindexedby h 2 H ,where H isalargeclassof squareintegrablefunctions. ItisnotdifculttoseethatoursetupisthesameasthatconsideredinVardi1985 andGilletal.1988:their F correspondstoour h y ;their w l to h l = h ; F l to h l y ; W l to m h l = m h ;and V to d .Buttherearemajordifferencesbetweenourframeworkandtheirs. Theydealwithiidsamples,andsocanuseempiricalprocesstheory,whereaswedeal withMarkovchains,forwhichsuchatheoryisnotavailable.Intheirframework,the samplesarisefromsomeexperiment,andtheyareseekingoptimalestimatesgivendata thatisgiventothem.Incontrast,oursamplesareobtainedbyMonteCarlo,sowehave controloverdesignissues.Inparticular,weareconcernedwithcomputationalefciency, inadditiontostatisticalefciency;henceourinterestinthetwo-stagesamplingmethod forpreliminaryestimationof d andforenablingtheuseofcontrolvariates. Geyer1994alsodealswiththesetupinVardi1985andGilletal.1988,i.e.the k -samplemodelforbiasedsampling,andhealsoconsiderstheproblemofestimating d .AsmentionedinSection2.1,hisestimatorisobtainedbymaximizing26,and thesolutionisnumericallyidenticaltothesolutiontothesystem4.However,he considersthesituationwhereeachofthe k samplesareMarkovchains,asopposedto iidsamples,andassumingthatthechainssatisfycertainmixingconditions,heobtainsa centrallimittheoremfor n 1 = 2 ^ d )]TJ/F39 11.9552 Tf 12.209 0 Td [(d .Naturally,thevarianceofthelimitingdistributionis differentfromthevarianceobtainedinGilletal.1988,andistypicallylarger. InSection7oftheirpaperMengandWong1996considerthesituationwherefor each l =1,..., k ,wehaveaniidsamplefromthedensity f l = q l = m l ,wherethefunctions q 1 ,..., q k areknown,butthenormalizingconstants m 1 ,..., m k arenot,andwewishto estimatethevector m 2 = m 1 ,..., m k = m 1 .Withoutgoingintodetail,wementionthatthey developafamilyofbridgefunctionsandshowthat,intheiidsetting,theoptimalbridge functiongivesrisetoanestimateidenticaltothatofGeyer1994.Theyobtaintheir 35 PAGE 36 estimatethroughaniterativeschemewhichisfastandstable[MengandWong1996, p.849]andthisisthecomputationalmethodweuseinthepresentpaper. OwenandZhou2000considertheproblemofestimatinganintegralofthe form I = R h x f x dx ,where f isaprobabilitydensitythatiscompletelyknownas opposedtoknownuptoanormalizingconstantand h isaknownfunction.Theywish toestimate I throughimportancesampling.Theyassumetheycangeneratesequences X l 1 ,..., X ln l iid p l l =1,..., k ,wherethe p l 'sarecompletelyknowndensities.Thedoubly indexedsequence X li i =1,..., n l l =1,..., k formsastratiedsamplefromthe mixturedensity p a = P k l =1 a l p l ,where a l = n l = P k l =1 n l ,soonecancarryoutimportance samplingwithrespecttothismixture.Theypointoutthatsincethe p l 'sarecompletely known,theycanformthefunctions H j x =[ p j x = p a x ] )]TJ/F22 11.9552 Tf 12.499 0 Td [(1, j =1,..., k ,andthese satisfy E p a H j X =0 ,wherethesubscriptindicatesthattheexpectationistakenwith respecttothemixturedensity p a .Therefore,these k functionscanbeusedascontrol variates.WhatwedoinChapter2issimilar,exceptthatweareworkingwithdensities whosefunctionalformisknown,butwhosenormalizingconstantsarenot. Kongetal.2003alsoconsiderthe k -samplemodelforbiasedsampling,but haveadifferentperspective,andwedescribetheirworkinthenotationofthepresent paper.Theyassumethatthereareprobabilitymeasures Q 1 ,..., Q k ,withdensities q 1 = m 1 ,..., q k = m k ,respectively,relativetosomedominatingmeasure ,andforeach l =1,..., k ,wehaveaniidsample f X li g n l i =1 from Q l .Here,the q l 'sareknown,butthe m l 'sarenot.Theirobjectiveistoestimateallpossibleratios m l = m j l j 2f 1,..., k g or, equivalently,thevector d =, m 2 = m 1 ,..., m k = m 1 .Intheirhighlyunorthodoxapproach, Kongetal.2003obtainthemaximumlikelihoodestimate ^ ofthedominatingmeasure itself ^ isgivenuptoanoverallmultiplicativeconstant.Theycanthenestimatethe ratios m l = m j ,sincethenormalizingconstantsareknownfunctionsof i.e. m r = R q r x d x ,and q r isknown.Theyshowthattheresultingestimateof d isobtained 36 PAGE 37 bysolvingthesystem d r = k X l =1 n l X i =1 q r X li P k s =1 n s q s X li = d s r =1,..., k whichiseasilyseentobeidenticaltothesystem4ofGilletal.1988. Tan2004showshowcontrolvariatescanbeincorporatedinthelikelihood frameworkofKongetal.2003.Whenthereare r functions H j j =1,..., r forwhich weknowthat R H j d =0 ,theparameterspaceisrestrictedtothesetofallsigma-nite measuressatisfyingthese r constraints.Forthecasewhere X li i =1,..., n l areiid foreach l =1,..., k ,heobtainsthemaximumlikelihoodestimateof inthisreduced parameterspace,andthereforeofcorrespondingestimatesof d and m h = m h 1 ,andshows thatthisapproachgivesestimatesthatareasymptoticallyequivalenttoestimatesthat usecontrolvariatesviaregression.Healsoobtainsresultsonasymptoticnormalityof hisestimatorsthatarevalidwhenwehavetheiidstructure. Theestimatesof d inGilletal.1988,Geyer1994,MengandWong1996,and Kongetal.2003areallequivalent.Theorem1ofTan2004establishesasymptotic optimalityofthisestimateundertheiidassumption.WhenthesamplesareMarkov chaindraws,theasymptoticallyoptimalestimateisessentiallyimpossibletoobtain Romero2003.Buttheestimatederivedundertheiidassumptioncanstillbeused intheMarkovchainsettingifonecandevelopasymptoticresultsthatarevalidinthe Markovchaincase,andthisisdonebyGeyer1994,whoseresultsweuseinallour theorems. 37 PAGE 38 CHAPTER5 ILLUSTRATIONONVARIABLESELECTION ThereexistmanyclassesofproblemsinBayesiananalysisinwhichthesensitivity analysisandmodelselectionissuesdiscussedearlierarise;seeChapter6.Herewe giveanapplicationinvolvingthehierarchicalpriorusedinvariableselectioninthe BayesianlinearregressionmodeldiscussedinChapter1.Thischapterconsistsof threeparts.FirstwediscussanMCMCalgorithmforthismodelandstatesomeofits theoreticalproperties;thenwediscusstheliteratureonselectionofthehyperparameter h ;andnallywepresenttwodetailedillustrationsofourmethodology. 5.1AMarkovChainforEstimatingthePosteriorDistributionofModel Parameters ThedesignofMCMCalgorithmsforestimatingtheposteriordistributionof under1revolvesaroundthegenerationoftheindicatorvariable .Wenowbriey reviewthealgorithmsforrunningaMarkovchainon thatareproposedintheliterature, andthemainissuesofimplementationofthesealgorithms.Rafteryetal.1997and MadiganandYork1995discussthefollowingMetropolis-Hastingsalgorithmfor generatingasequence ,... .Ifthecurrentstateis ,anewstate isformed byselectingatrandomacoordinate j ,setting j =1 )]TJ/F25 11.9552 Tf 12.897 0 Td [( j ,and k = k for k 6 = j Theproposal isthenacceptedorrejectedwiththeMetropolis-Hastingsacceptance probability min f p j Y = p j Y ,1 g .MadiganandYork1995callthisalgorithmMC 3 Clydeetal.1996proposeamodicationofthisalgorithminwhichwedonotselecta componentatrandomandupdateit,butinsteadsequentiallyupdateallcomponents. TheycallthistheHybridAlgorithm.Strictlyspeaking,thisisaMetropolizedGibbs sampler,andisnotactuallyaMetropolis-Hastingsalgorithm.SmithandKohn1996 proposeaGibbssamplerwhichsimplycyclesthroughthecoordinates j oneatatime. GeorgeandMcCulloch1997showthatwhencomparedwithMC 3 ,theGibbssampler algorithmgivesestimateswithsmallerstandarderror,andisalsoslightlyfaster,atleast inseveralsimulationstudiestheyconducted. 38 PAGE 39 Kohnetal.2001considerMetropolizedGibbsalgorithmswhicharethesame astheHybridAlgorithmofClydeetal.1996,exceptthatatcoordinate j ,insteadof deterministicallyproposingtogofrom j to j =1 )]TJ/F25 11.9552 Tf 12 0 Td [( j ,theproposedvalue j isequalto 1 )]TJ/F25 11.9552 Tf 12.149 0 Td [( j withprobabilitydependingonthecurrentstate .Kohnetal.2001describetwo suchalgorithms,andshowthatthesearemorecomputationallyefcientthantheGibbs samplerinsituationswhereonaverage q issmall,i.e.themodelsaresparse.They alsoconductadetailedsimulationstudyofoneoftheirsamplingschemestheirSS whichsuggeststhat,whiletheschemeproducesestimateswhosestandarderrorsare abitlargerthanthoseproducedbytheGibbssampler,thisdisadvantageismorethan outweighedbyitscomputationalefciency. Allthealgorithmsmentionedaboverequire,inonewayoranother,thecalculation of p j Y = p j Y .Becauseoftheconjugatenatureofmodel1,themarginal likelihoodofmodel isavailableinclosedform,andtherefore p j Y isavailableupto anormalizingconstant.Wehave p j Y / + g )]TJ/F40 7.9701 Tf 6.587 0 Td [(q = 2 S )]TJ/F23 7.9701 Tf 6.587 0 Td [( m )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 )]TJ/F23 7.9701 Tf 6.587 0 Td [( m )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 2 w 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(w q where S 2 = P m j =1 Y j )]TJETq1 0 0 1 202.072 335.493 cm[]0 d 0 J 0.478 w 0 0 m 10.148 0 l SQBT/F39 11.9552 Tf 202.072 325.517 Td [(Y 2 and R 2 isthecoefcientofdeterminationofmodel Asisstandardformodel1,weassumethatthecolumnsofthedesignmatrixare centered,andinthiscase, R 2 = Y 0 X X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 X 0 Y = S 2 .Themaincomputationalburden inobtaining5isthecalculationof R 2 ,whichistime-consumingif q islarge.Smith andKohn1996notethat,when and differinonlyonecomponent, R 2 canbe obtainedrapidlyfrom R 2 .WereturntothispointinAppendixB. Inoursituation,weneedtogenerateaMarkovchainon ,becausetheBayesfactor estimatesgiveninChapter2requiresamplesfromtheposteriordistributionof .The algorithmweuseinthepresentpaperisbasedontheGibbssampleron introducedin SmithandKohn1996althoughthecomputationalimplementationweuseisdifferent fromtheirs,followedbythreestepstogenerate 0 ,and .Inabitmoredetail,let 39 PAGE 40 V betheMarkovtransitionfunctioncorrespondingtotheGibbssamplerinSmith andKohn1996,i.e. V isthedistributionof giventhatthecurrentstateis ,andlet v = V f g bethecorrespondingprobabilitymassfunction. Supposethecurrentstateis i i i 0 i i .Weproceedasfollows. 1.Weupdate i to i +1 using V i .Thegenerationof i +1 doesnotinvolve i i 0 i i 2.Wegenerate i +1 fromtheconditionaldistributionof given = i +1 andthe data. 3.Wegenerate i +1 0 fromtheconditionaldistributionof 0 given = i +1 = i +1 ,andthedata. 4.Wegenerate i +1 i +1 fromtheconditionaldistributionof given = i +1 = i +1 0 = i +1 0 ,andthedata. Thedetailsdescribingthedistributionsinvolvedandthecomputationsneededaregiven inAppendixB.Thealgorithmabovegivesasequence ,... ,anditiseasytosee thatthissequenceisaMarkovchain. AsMarkovchainsonthe sequence,therelativeperformanceoftheGibbssampler vs.SSdepends,inpart,on m q h ,andthedatasetitself,andneitheralgorithmis uniformlysuperiortotheother.Inprinciple,inStep 1 ofouralgorithmwecanuseany Markovtransitionfunctionthatgeneratesachainon ,includingSS.Wechoseto workwiththeGibbssamplerbecauseitiseasiertodeveloparegenerationschemefor thischainthanfortheotherchains. Theoutputofthechaincanbeusedinseveralways.Anobviouswayistousethe highestposteriorprobabilitymodelHPM.Unfortunately,when q isbiggerthanaround 20 ,thenumberofmodels, 2 q ,isverylarge,anditmayhappenthatnosinglemodel hasappreciableprobability,andinanycase,itisverydifcultorimpossibletoidentify theHPMfromtheMarkovchainoutput.BarbieriandBerger2004argueinfavorof themedianprobabilitymodelMPM,whichisdenedtobethemodelthatincludesall variables j forwhichthemarginalinclusionprobability P j =1 j Y 1 = 2 .Wemention 40 PAGE 41 heretheBayesianAdaptiveSamplingmethodofClydeetal.2009,whichgivesan algorithmforprovidingsampleswithoutreplacementfromthesetofmodels.Under certainconditions,thealgorithmhasthefeaturethattheseareperfectsampleswithout replacement;itthenenablesanefcientsearchfortheHPM. UniformErgodicity Let = f 0,1 g q 1 R q +1 ,let bethepriordistributionof specied by1band1c,andlet y betheposteriordistributionof given Y = y .For theremainderofthissectionthesubscript h issuppressedsincewearedealingwith asinglespecicationofthishyperparameter.Let K denotetheMarkovtransition functionfortheMarkovchainon describedinthebeginningofthischapter,i.e. K 0 isthedistributionof 1 giventhatthecurrentstateis 0 ,andlet K n 0 denotethe corresponding n -stepMarkovtransitionfunction.Harrisergodicityofthechainisthe conditionthat k K n )]TJ/F25 11.9552 Tf 12.327 0 Td [( y k! 0 forall 2 ,where kk denotessupremumover allBorelsubsetsof .Thisconditionisguaranteedbytheso-calledusualregularity conditions,namelythatthechainhasaninvariantprobabilitymeasure,isirreducible, aperiodic,andHarrisrecurrent;see,e.g.,Theorem13.0.1ofMeynandTweedie1993. Theseusualregularityconditionsaretypicallyeasytocheck;inthepresentcontext,they areimpliedforexampleiftheMarkovtransitionfunctionhasadensitywithrespectto theproductofcountingmeasureon f 0,1 g q andLebesguemeasureon 1 R q +1 whichiseverywherepositive,whichisthecaseinoursituation.Uniformergodicityisthe farstrongerconditionthatthereexistconstants c 2 [0,1 and M > 0 suchthatforany n 2 N k K n )]TJ/F25 11.9552 Tf 11.955 0 Td [( y k Mc n forall Proposition1 Thechaindrivenby K isuniformlyergodic. TheproofofProposition1isgiveninAppendixC.Let 0 1 ,... beaMarkovchain drivenby K ,let l beareal-valuedfunctionof forexample l = I 1 =1 ,the indicatorthatvariable 1 isinthemodel,andsupposewewishtoformcondence 41 PAGE 42 intervalsfortheposteriorexpectationof l .Supposethat E l 2 < 1 .Then sincethechainisuniformlyergodic,Corollary4.2ofCogburn1972impliesthat, with Var l 0 and Cov l 0 l j calculatedundertheassumptionthat 0 hasthe stationarydistribution,theseries 2 =Var l 0 +2 1 X j =1 Cov l 0 l j convergesabsolutely,andif 2 > 0 ,thenwith 0 havinganarbitrarydistribution,the estimate l n = = n P n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 i =0 l i satises n 1 = 2 )]TJETq1 0 0 1 204.025 511.517 cm[]0 d 0 J 0.478 w 0 0 m 3.84 0 l SQBT/F39 11.9552 Tf 204.025 501.541 Td [(l n )]TJ/F39 11.9552 Tf 11.955 0 Td [(E [ l j y ] d )167(!N 2 as n !1 TheMarkovchaindrivenby K isalsoregenerative,andinAppendixCwegivean explicitminorizationconditionthatcanbeusedtointroduceregenerationsintothechain. Functionsthatrunthechainandimplementtheregenerationschemeareprovidedinthe Rpackage bvslr ,availablefrom http://www.stat.ufl.edu/ ebuta/BVSLR InChapters1and2, h and h y refertothepriorandposterior densities ,andall estimatesinChapter2involveratiosofthesepriordensities.IntheBayesianlinear regressionmodelthatweareconsideringhere,thepriors h on 0 areactually probabilitymeasureson f 0,1 g q 1 R q +1 ,whichinfactarenotabsolutely continuouswithrespecttotheproductofcountingmeasureon f 0,1 g q andLebesgue measureon 1 R q +1 .For h 1 = w 1 g 1 and h 2 = w 2 g 2 ,theRadon-Nikodym derivativeof h 1 withrespectto h 2 isgivenby d h 1 d h 2 0 = w 1 w 2 q 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(w 1 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(w 2 q )]TJ/F40 7.9701 Tf 6.586 0 Td [(q q )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( ;0, g 1 2 X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 q )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [( ;0, g 2 2 X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 where q u ; a V isthedensityofthe q -dimensionalnormaldistributionwithmean a andcovariance V ,evaluatedat u Doss2007.Itisimmediatethatallformulasin Chapter2remainvalidifratiosoftheform h = h 1 see,e.g.,equation2are replacedbytheRadon-Nikodymderivative [ d h = d h 1 ] .Fortunately,evaluationof5 42 PAGE 43 requiresneithermatrixinversionnorcalculationofadeterminant,socanbedonevery quickly.Notethatinviewof5,itisnotenoughtohaveMarkovchainsrunningonthe 'sandweneedMarkovchainsrunningonthe 'soratleast 5.2ChoiceoftheHyperparameter Asmentionedearlier,regarding w ,theproposalsintheliteraturearequitesimple: either w isxedat 1 = 2 ,orabetapriorisputon w .Thediscussionbelowfocuses primarilyon g ,forwhichthereisanextensiveliterature,andwenowsummarizethe portionofthisliteraturethatisdirectlyrelevanttothepresentwork.Broadlyspeaking, recommendationsregarding g canbedividedintothreecategories: Data-IndependentChoices Inthesimplecasewherethesetupisgivenby1but without1c,i.e.thetruemodel isassumedknown,theposteriordistribution of given is N )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [( g = g +1 ^ g = g +1 2 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 ,where ^ istheusual leastsquaresestimateof .If q isxedand m !1 ,understandardconditions X 0 X = m ,where isapositivedenitematrix;thereforeif g isxed,this distributionisapproximatelyapointmassat g = g +1 ^ ,sotheposteriorisnot evenconsistent,andweseethatanecessaryconditionforconsistencyisthat g !1 .Data-independentchoicesof g includeKassandWasserman's1995 recommendationof g = m ,andFernandezetal.'s2001recommendationof g = max m q 2 ,followinguponFosterandGeorge's1994earlierrecommendationof g = q 2 Liangetal.2008arguethat,ingeneral,data-independentchoicesof g havethe followingundesirableproperty,referredtoastheInformationParadox.Whenthe datagiveoverwhelmingevidenceinfavorofmodel e.g. k ^ k!1 ,thenusing 0 todenotethenullmodeli.e.themodelthatincludesonlytheintercept,the ratioofposteriorprobabilities p j Y = p 0 j Y doesnottendtoinnity. 43 PAGE 44 EmpiricalBayesEBMethods InglobalEBprocedures,anestimateof g commonfor allmodelsisderivedfromitsmarginallikelihood;seeGeorgeandFoster2000.In localEB,anestimateof g isderivedforeachmodel;seeHansenandYu2001. Unfortunately,theEBmethodisingeneralcomputationallydemandingbecause thelikelihoodisasumoverall 2 q models ,soitispracticallyfeasibleonlyfor relativelysmallvaluesof q .Liangetal.2008showthattheEBmethodis consistentinthefrequentistsense:if isthetruemodel,thenif g ischosen viatheEBmethod,theposteriorprobability P = j Y convergesto 1 as m !1 .SeeTheorem 3 ofLiangetal.2008foraprecisestatement.Thisresult refersonlytothecasewhere w isxedat 1 = 2 ,andonly g isestimated.Liang etal.2008proposeanEMalgorithmforestimating g intheglobalEBsetting.In theiralgorithm,themodelindicatorand aretreatedasmissingdata.Whiletheir approachiscertainlyuseful,therearesomeproblemsassociatedwithit.Each stepintheEMalgorithminvolvesasumof 2 q terms.Unless q isrelativelysmall, completeenumerationisnotpossible,andLiangetal.2008proposesumming onlyoverthemostsignicantterms.However,determiningwhichtermsthese aremaybeverydifcultinsomeproblems.Also,theEMalgorithmgivesasingle pointestimate.Whatwedoisdifferent:weestimatetheBayesfactorforall g and w .Thisenablesusinparticulartoestimatethemaximizingvalues;butitalso allowsustoruleoutlargeregionsofthehyperparameterspace.Additionally,our methodallowsustocarryoutsensitivityanalysis.Wealsomentionverybrieythat ifweareinterestedonlyinthemaximizingvalues,thenthemethodproposedin thepresentpapercanbeusedtoformastochasticsearchalgorithm.Thebasic requirementforsuchalgorithmsisthatweknowthegradient @ B h h 1 =@ h .But thesamemethodologyusedtoestimate B h h 1 canalsobeusedtoestimate itsgradient.Forexample,inthesimpleestimate2,wejustreplace h l i by @ h l i =@ h 44 PAGE 45 FullyBayesFBMethods Themostcommonprioron g istheZellnerandSiow1980 prior,aninverse-gammawhichresultsinamultivariateCauchypriorfor .The familyofhyperg priorsisintroducedbyCuiandGeorge2008anddeveloped furtherbyLiangetal.2008,whoshowthatthesehaveseveraldesirable properties.Inparticular,theydonotsufferfromtheinformationparadox,and theyexhibitimportantconsistencyproperties. BoththeEBmethodsandFBmethodshavetheirownadvantagesanddisadvantages. CuiandGeorge2008giveevidencethatEBmethodsoutperformFBmethods. Thisisbasedonextensivesimulationstudiesincaseswherenumericalmethodsare feasible.Also,FBmethodsrequireonetospecifyhyperparametersoftheprioronthe hyperparameter h ,anddifferentchoicesleadtodifferentinferences.Additionally,inEB methods,oneusesamodelwithasinglevalueof h ,andtheresultinginferenceismore parsimoniousandinterpretable. Ontheotherhand,aswithmanylikelihood-basedmethods,specialcareneedsto betakenwhenthemaximizingvalueisattheboundary.WhenweusetheEBmethod, ifthemaximizingvalueof w is 0 or 1 ,theposteriorassignsprobabilityonetothenull modelorfullmodelmodelthatincludesallvariables,respectively.Thisissimilarto theverysimplesituationinwhichwehave X binomial n p :ifweobserve X =0 thennotonlyisthemaximumlikelihoodestimateof p equalto 0 ,buttheassociated standarderrorestimateisalso 0 ,andthenaiveWald-typecondenceintervalfor p isthesingleton f 0 g .Ofcourseinthissimplecasethereexistmodicationstothe maximumlikelihoodestimate ^ p = X = n whichyieldproceduresthatdonotgiveriseto thisdegeneracy.Howtodevelopcorrespondingmodicationstothemaximumlikelihood estimateoftheBernoulliparameter w inthepresentcontextisaproblemthatismuch moredifcult,butcertainlyworthyofinvestigation. ScottandBerger2010considerthesamemodelforvariableselectionthatwe considerhere,i.e.model1,butwithaZellner-Siowprioron g ,andtheremaining 45 PAGE 46 parameter, w ,estimatedbymaximumlikelihood.Theyshowthatifthenullmodelhas thelargestmarginallikelihood,thentheMLEof w is 0 andifthefullmodelhasthe largestmarginallikelihood,thentheMLEof w is 1 .Eachofthesegivesrisetothe degeneracydiscussedabove.Theirresultisnottrueinoursetup,inwhichwedonot putaprioron g ,butratherestimateboth w and g bymaximumlikelihood.Toseethis, consideraverysimpleexample,inwhich Y =,1,9,5 0 and X = 0 B B B B B B B @ 13 53 87 810.5 1 C C C C C C C A Wehave R 2 =,1 =0.52 R 2 =,0 =0.51 R 2 =,1 =0.40 ,and R 2 =,0 =0 ,Now P Y j g = c Y + g )]TJ/F40 7.9701 Tf 6.587 0 Td [(q = 2 )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 3 = 2 where c Y doesnotdependon g or .Therefore, ^ g ,^ w =argmax g w X w q )]TJ/F39 11.9552 Tf 11.955 0 Td [(w q )]TJ/F40 7.9701 Tf 6.586 0 Td [(q + g )]TJ/F40 7.9701 Tf 6.587 0 Td [(q = 2 )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 3 = 2 .5,.2. FromequationofScottandBerger2010weknowthatundertheZellner-Siownull prior,wehave P Y j P )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Y j =,0 = Z 1 0 + g )]TJ/F40 7.9701 Tf 6.586 0 Td [(q = 2 )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 3 = 2 = 1 = 2 g )]TJ/F23 7.9701 Tf 6.586 0 Td [(3 = 2 exp )]TJ/F22 11.9552 Tf 9.298 0 Td [(2 = g dg = 8 > > > > > > < > > > > > > : .72 < 1 for =,0 .58 < 1 for =,1 .31 < 1 for =,1 46 PAGE 47 andhencethenullmodelhasthestrictlylargestmarginallikelihoodamongallmodels. Lemma 4.1 ofScottandBerger2010impliesthat,withaZellner-Siowprioron g ^ w =0 ,whileinoursetup,thesamedatagive ^ w > 0 5.3Examples Weillustrateourmethodsontwoexamples.TherstistheU.S.crimedataofVandaele 1978,whichcanbefoundintheRlibrary MASS underthename UScrime .Weuse thisdatasetbecauseithasbeenstudiedinseveralpapersalreadysowecancompare ourresultswithpreviousanalyses,andalsobecausethenumberofvariablesissmall enoughtoenableaclosed-formcalculationofthemarginallikelihood m h ,sowecan compareourestimateswiththegoldstandard.Theseconddatasetistheozonedata originallyanalyzedbyBreimanandFriedman1985.Weusethisdatasetbecauseit involves 44 variables,eventhoughonlyafewofthoseareimportant,andwewantedto showhowourmethodologyhandlesadatasetwiththischaracter. 5.3.1U.S.CrimeData Thedatasetgives,foreachof m =47 U.S.states,thecrimerate,denedasnumber ofoffensesper 100,000 individualstheresponsevariable,and q =15 predictors measuringdifferentcharacteristicsofthepopulation,suchasaveragenumberofyears ofschooling,averageincome,unemploymentrate,etc. Tobeconsistentwithwhatisdoneintheliterature,weappliedalogtransformation toallvariables,excepttheindicatorvariable.Wetookthebaselinehyperparametertobe h 1 = w 1 g 1 =.5,15 ,andourgoalwastoestimate B h h 1 forthe 924 valuesof h obtainedwhen w rangesfrom 0.1 to 0.91 byincrementsof 0.03 ,and g rangesfrom 4 to 100 byincrementsof 3 .Weused2andthisestimatewasbasedon 16 chainseach oflength 10,000 ,correspondingtotheskeletongridofhyperparametervalues w g 2f .3,.5,.6,.8 gf 15,50,100,225 g 47 PAGE 48 fortheStage 1 samples,and 16 newchains,eachoflength 1000 ,correspondingto thesamehyperparametervalues,fortheStage 2 samples.TheplotsinFigure5-1 givegraphsoftheestimate2as w and g vary,fromtwodifferentangles.These indicatethatvaluesfor w around 0.65 andfor g around 20 seemappropriate,while valuesof w lessthan .3 andvaluesof g greaterthan 60 shouldbeavoided.Aside calculationshowedthat,interestingly,for g =max f m q 2 g =225 ,theestimateof B )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [( w g ,.65,20 islessthan .008 regardlessofthevalueof w ,sothischoiceshould notbeusedforthisdataset.Withthelongchainsusedandtheestimatethatuses controlvariates,theBayesfactorestimatesinFigure5-1areextremelyaccurateroot meansquarederrorsarelessthan 0.04 uniformlyovertheentiredomainoftheplot andconsiderablylessintheconvexhulloftheskeletongridourcalculationoftheroot meansquarederrorsusedtheclosed-formexpressionfortheBayesfactorsbasedon completeenumeration.TheguretookaboutahalfhourtogenerateonanIntel 2.8 GHzQ 9550 runningLinux.Theaccuracyweobtainedisoverkillandthegurecanbe createdinafewminutesifweusemoretypicalMarkovchainlengths. Figure5-1.EstimatesofBayesfactorsfortheU.S.crimedata.Theplotsgivetwo differentviewsofthegraphoftheBayesfactorasafunctionof w and g whenthebaselinevalueofthehyperparameterisgivenby w =0.5 and g =15 .Theestimateis2,whichusescontrolvariates. 48 PAGE 49 Table5-1givestheposteriorinclusionprobabilitiesforeachofthefteenpredictors, i.e. P i =1 j y for i =1,...,15 ,underseveralmodels.Line 2 givestheinclusion probabilitieswhenweusemodel1withthevalues w =.65 and g =20 ,which arethevaluesatwhichthegraphinFigure5-1attainsitsmaximum.Line 4 givesthe inclusionprobabilitieswhenthehyperg priorHG 3 inLiangetal.2008isused.As canbeseen,theinclusionprobabilitiesweobtainedundertheEBmodelarecomparable to,butsomewhatlargerthan,theprobabilitieswhentheHG 3 priorisused.Thisisnot surprisingsinceourmodelallows w tobechosen,andthedata-drivenchoicegivesa value .65 greaterthanthevalue w =.5 usedinLiangetal.2008.Table 2 ofLiang etal.2008givesacomparisonofposteriorinclusionprobabilitiesforatotaloften modelstakenfromtheliterature.Line 3 ofTable5-1givestheinclusionprobabilities undermodel1whenweuse w =.5 andthevalueof g thatmaximizesthelikelihood with w constrainedtobe .5 .Itisinterestingtonotethattheinclusionprobabilitiesare thenstrikinglyclosetothoseundertheHG 3 model. Table5-1.PosteriorinclusionprobabilitiesforthefteenpredictorvariablesintheU.S. crimedataset,underthreemodels.NamesofthevariablesareasinTable 2 ofLiangetal.2008butallvariablesexceptforthebinaryvariableShave beenlogtransformed. AgeSEdEx0Ex1LFMNNWU1U2WXPrisonTime EB ,.65 .93.39.99 70.51.34.35.52.83.40.76.551.00.96.55 EB ,.5 .85.29.97 67.45.22.22.38.70.27.62.381.00.90.39 HG 3.84.29.97 66.47.23.23.39.69.27.61.38.99.89.38 Figure5-2givesplotsoftheposteriorinclusionprobabilitiesforVariables 1 and 6 as w and g vary.Theliteraturerecommendsvariouschoicesfor g [inparticular g = m inKassandWasserman1995, g = q 2 inFosterandGeorge1994, g =max m q 2 inFernandezetal.2001],andposteriorinclusionprobabilitiesforallthesechoices combinedwithanychoiceof w canbereaddirectlyfromthegure.Theextenttowhich theseprobabilitieschangewiththechoiceof g isquitestriking. 49 PAGE 50 Figure5-2.EstimatesofposteriorinclusionprobabilitiesforVariables 1 and 6 forthe U.S.crimedata.Theestimateusedis2. SelectionoftheskeletonpointswasdiscussedattheendofChapter3,andwenow returntothisissue.ConsidertheBayesfactorestimatebasedontheskeleton5, whichwaschoseninanad-hocmanner.TheleftpanelinFigure5-3givesaplotofthe varianceofthisestimate,asafunctionof h .Ascanbeseenfromtheplot,thevariance isgreatestintheregionwhere g issmalland w islarge.Wechangedtheskeleton from5to w g 2f .5,.7,.8,.9 gf 10,15,50,100 g andreranthealgorithm.Thevariancefortheestimatebasedon55isgivenbythe rightpanelofFigure5-3,fromwhichweseethatthemaximumvariancehasbeen reducedbyafactorofabout 9 5.3.2OzoneData ThisdatasetwasoriginallyanalyzedinBreimanandFriedman1985,wasused inmanypaperssince,andwasrecentlyanalyzedinaBayesianframeworkbyCasella andMoreno2006andLiangetal.2008.Thedataconsistofdailymeasurements ofozoneconcentrationandeightmeteorologicalquantitiesintheLosAngelesbasin for 330 daysof 1976 .Theresponsevariableisthedailyozoneconcentration,andwe followLiangetal.2008inconsidering 44 possiblepredictors:theeightmeteorological 50 PAGE 51 Figure5-3.Variancefunctionsfortwoversionsof ^ I ^ d ^ ^ d .Theleftpanelisfortheestimate basedontheskeleton5.Thepointsinthisskeletonwereshiftedto bettercovertheproblematicregionnearthebackoftheplot g smalland w large,creatingtheskeleton5.Themaximumvarianceisthenreduced byafactorof 9 rightpanel. measurements,theirsquares,andtheirtwo-wayinteractions.Liangetal.2008give areviewoftheliteratureonpriorsforthehyperparameter g andadvocatethehyperg priors.Theycompare 10 variableselectiontechniquesincludingthreehyperg priors onthisdatasetbyusingacross-validationprocedure:thedatasetisrandomlysplitin twohalves,oneofwhichthetrainingsampleisusedforselectingthemodelforthe Bayesianmethodsthisisthehighestprobabilitymodel,whiletheotherthevalidation sampleisusedformeasuringthepredictiveaccuracy.Thepredictiveaccuracyof method j ismeasuredthroughthesquare-rootofthemeansquaredpredictionerror RMSEoftheselectedmodel j ,denedbyRMSE j = )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 V P i 2 V Y i )]TJ/F22 11.9552 Tf 14.671 2.657 Td [(^ Y i 2 1 = 2 Here, V isthevalidationset, n V isitssize,and ^ Y i isthettedvalueofobservation i undermodel j .Liangetal.2008pointoutthecuriousfactthattheRMSE'softhe 10 methodsareallveryclosetheyrangefrom 4.4 to 4.6 ,buttheselectedmodelsdiffer greatlyinthenumberofvariablesselected,whichrangefrom 3 to 18 51 PAGE 52 Weinvestigatedtheperformanceofourmethodologyusingasplitofthedatainto trainingandvalidationsampleidenticaltotheoneusedbyLiangetal.2008.Wetook thebaselinehyperparametertobethepair h 1 = w 1 g 1 =.2,50 andtheskeletongrid ofhyperparameterstoconsistofthe 16 pairs w g 2f .1,.2,.3,.5 gf 15,50,100,150 g Toidentifythevalueof h thatmaximizestheBayesfactor B h h 1 ,weestimatedthis quantityforagridofthe 750 valuesof h obtainedwhen w rangesfrom .01 to .5 by incrementsof .02 ,and g rangesfrom 5 to 150 byincrementsof 5 .Theseestimates werebasedon 16 chainseachoflength 10,000 ,correspondingtotheskeletongridof hyperparametervaluesfortheStage 1 samples,and 16 newchains,eachoflength 1000 ,correspondingtothesamehyperparametervalues,fortheStage 2 samples. Figure5-4givesaplotoftheseestimatesof B h h 1 asafunctionof w and g .The standarderrorislessthan .014 overtheentirerangeoftheplot. Figure5-4.EstimatesofBayesfactorsfortheozonedata.Theplotsgivetwodifferent viewsofthegraphoftheBayesfactorasafunctionof w and g whenthe baselinevalueofthehyperparameterisgivenby w =.2 and g =50 Thevalueof h atwhichthemaximum B h h 1 isattainedis h =.13,75 .We rananewchainoflength 100,000 correspondingtothisvalueof h ,andbasedonit weestimatedthehighestprobabilitymodeltobethemodelcontainingthe 4 variables 52 PAGE 53 dpg,ibt,vh.ibh,andhumid.ibtseeAppendixDforadescriptionofthesevariables. Thismodelyieldsanout-of-sampleRMSEof 4.5 .SincetheempiricalBayeschoiceof w isrelativelysmall ^ w =.13 ,itisnotsurprisingthatthehighestprobabilitymodel includesonly 4 variablesfewerthaninanyofthehyperg modelsrecommendedby Liangetal.2008,whichallincludeatleast 6 variables.Butitisinterestingtonotethat nevertheless,thismodelgivesanRMSEthatisessentiallythesameastheRMSEof anyoftheothermodels. WeappliedtheregenerationalgorithmdescribedinAppendixBtothechain correspondingtothehyperparameter h =.13,75 deemedoptimalbyourprevious analysis.Weranthechainuntil R =3000 regenerationsoccurred,whichtook 85,000 iterations.Fromtheoutput,weobtainedestimatesoftheposteriorinclusionprobabilities foreveryoneofthe 44 predictors,andformedthecorresponding 95% condence intervals,usingtheregenerationmethoddiscussedinChapter3.Thesearedisplayedin Figure5-5. Ourchoiceof R wasarbitrary,butthischoiceshouldultimatelybebasedonthe degreeofaccuracyonedesiresfortheestimatesofthequantitiesofinterest.We consideredourchoicetobesatisfactoryforthisparticularanalysissincethecondence intervalsfortheposteriorinclusionprobabilitiesforthe 44 predictorshavemarginsof errorofatmost 1% .Notethatourchainregeneratesrelativelyoftenwiththeaverage lengthofatour N beingabout 28 .Myklandetal.1995recommendthatonecheck thatthatthecoefcientofvariationCV N = )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(Var N 1 = 2 = E N oftheaveragetour lengthisthan .1 beforedeeming 2 tobeestimatedproperlyby ^ 2 .Theircriterion seemstobemetheresincethestronglyconsistentestimator c CV N = )]TJ 5.479 -0.717 Td [(P R t =1 N t )]TJETq1 0 0 1 72 159.637 cm[]0 d 0 J 0.478 w 0 0 m 9.403 0 l SQBT/F39 11.9552 Tf 72 149.661 Td [(N 2 = R N 2 1 = 2 equals .02 53 PAGE 54 Figure5-5. 95% condenceintervalsoftheposteriorinclusionprobabilitiesforthe 44 predictorsintheozonedatawhenthehyperparametervalueisgivenby w =.13 and g =75 .Atablegivingthecorrespondencebetweentheintegers 1 44 andthepredictorsisgiveninAppendixD. 54 PAGE 55 CHAPTER6 DISCUSSION Thefollowingfactisobvious,butitmaybeworthwhiletostateitexplicitly.If h 1 is xed,maximizing B h h 1 andmaximizingthemarginallikelihood m h areequivalent. Choosingthevalueof h thatmaximizes m h isbydenitiontheempiricalBayesmethod. Thus,thedevelopmentinChapter2canbeusedtoimplementempiricalBayes methods. Ourmethodologyfordealingwiththesensitivityanalysisandmodelselection problemsdiscussedinChapter1canbeappliedtomanyclassesofBayesianmodels. Inadditiontotheusualparametricmodels,wementionalsoBayesiannonparametric modelsinvolvingmixturesofDirichletprocessesAntoniak1974,inwhichone ofthehyperparametersistheso-calledtotalmassparameterverybriey,this hyperparametercontrolstheextenttowhichthenonparametricmodeldiffersfroma purelyparametricmodel.Amongthemanypapersthatusesuchmodels,wementionin particularBurrandDoss2005,whogiveamoredetaileddiscussionoftheroleofthe totalmassparameter.TheapproachdevelopedinSections2.1and2.2canbeusedto selectthisparameter. Whenthedimensionof h islow,itwillbepossibletoplot B h h 1 ,oratleastplot itas h variesalongsomeofitsdimensions.EmpiricalBayesmethodsarenotoriously difculttoimplementwhenthedimensionofthehyperparameter h ishigh.Inthiscase,it ispossibletousethemethodsdevelopedinSections2.1and2.2toenableapproaches basedonstochasticsearchalgorithms.Theserequirethecalculationofthegradient @ B h h 1 =@ h .Wenotethatthesamemethodologyusedtoestimate B h h 1 canalso beusedtoestimateitsgradient.Forexample,in2, h l i issimplyreplacedby @ h l i =@ h 55 PAGE 56 APPENDIXA PROOFOFRESULTSFROMCHAPTER1 ProofofTheorem1 Webeginbywriting p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F22 11.9552 Tf 13.414 2.657 Td [(^ B h h 1 d + p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 A.1 ThesecondtermontherightsideofA.1involvesrandomnesscomingonlyfromthe secondstageofsampling.ThistermwasanalyzedbyDoss2010,whoshowedthat itisasymptoticallynormal,withmean 0 andvariance 2 h .Thersttermostensibly involvesrandomnessfrombothStage 1 andStage 2 sampling.However,aswillemerge fromourproof,therandomnessfromStage 2 isoflowerorder,andeffectivelyallthe randomnessisfromStage 1 .Thisrandomnessisnon-negligible.Wementionherethe often-citedworkofGeyer1994whoseniceresultsweuseinthepresentpaper.In thecontextofasetupverysimilartoours,hisTheorem 4 statesthatusinganestimated d andusingthetrue d resultsinthesameasymptoticvariance.Fromourproofrefer alsotoRemark2ofSection2.1,weseethatthisstatementisnotcorrect. ToanalyzethersttermontherightsideofA.1,wedenethefunction F u = ^ B h h 1 u ,where u = u 2 ,..., u k 0 isarealvectorwith u l > 0, l =2,..., k .Then,bythe Taylorseriesexpansionof F about d ,weget p n )]TJ/F22 11.9552 Tf 6.939 -7.027 Td [(^ B h h 1 ^ d )]TJ/F22 11.9552 Tf 13.414 2.656 Td [(^ B h h 1 d = p n )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(F ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(F d = p n r F d 0 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + p n 2 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d 0 r 2 F d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d A.2 where d isbetween d and ^ d First,weshowthatthegradient r F d = @ F d =@ d 2 ,..., @ F d =@ d k 0 converges almostsurelytoaniteconstant.For j =2,..., k ,the j )]TJ/F22 11.9552 Tf 12.2 0 Td [(1 th componentofthisvector convergesalmostsurelysince,withtheSLLNassumedtoholdfortheMarkovchains 56 PAGE 57 used,wehave [ r F d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = k X l =1 n l X i =1 n j h l i h j l i d 2 j )]TJ 5.48 -0.717 Td [(P k s =1 n s h s l i = d s 2 = k X l =1 1 n l n l X i =1 a j a l h l i h j l i d 2 j )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 a.s. )167(! 1 d 2 j k X l =1 a l Z a j h h j )]TJ 5.48 -0.718 Td [(P k s =1 a s h s = d s 2 h l y d = 1 d 2 j Z m h m h 1 a j h j )]TJ 5.479 -0.717 Td [(P k s =1 a s h s = d s 2 k X l =1 a l h l = d l h y d = B h h 1 d 2 j Z a j h j P k s =1 a s h s = d s h y d :=[ c h ] j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 A.3 Thelastintegralisclearlynite,andthelastequalityinA.3indicatesthat c h denotes theconstantvectortowhich r F d converges. Next,weshowthattherandomHessianmatrix r 2 F d ofsecond-orderderivatives of F evaluatedat d isboundedinprobability.Tothisend,itsufcestoshowthateach elementofthismatrix,say [ r 2 F d ] t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1, j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ,where t j 2f 2,..., k g ,is O p .Since k d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d kk ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k p 0 ,itfollowsthat d p d Let 2 ,min d 2 ,..., d k .Thenwehave P k d )]TJ/F39 11.9552 Tf 12.079 0 Td [(d k 1 .Wenowshowthat, ontheset fk d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k g r 2 F d isboundedinprobability.Let I = I k d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k 57 PAGE 58 For t 6 = j ,wehave [ r 2 F d ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1, j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 I = k X l =1 2 n l n l X i =1 a j a l a t h l i h j l i h t l i d j 2 d t 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 3 I k X l =1 2 n l n l X i =1 a j a l a t h l i h j l i h t l i d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 P k s =1 a s h s l i = d s + 3 a.s. )167(! 2 d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 k X l =1 Z a j a l a t h h j h t P k s =1 a s h s = d s + 3 h l y d = 2 d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 d t )]TJ/F25 11.9552 Tf 11.956 0 Td [( 2 k X l =1 B h h l Z a j a l a t h j h t h l P k s =1 a s h s = d s + 3 h y d A.4 NotethattheexpressioninsidethebracesinA.4isclearlyboundedabovebya constant,soexpressionA.4isnite.Similarly,for t = j [ r 2 F d ] j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1, j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 I k X l =1 2 n l n l X i =1 a j a l h l i h j l i )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s )]TJ/F39 11.9552 Tf 11.955 0 Td [(a j h j l i = d j d j 3 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 3 k X l =1 2 n l n l X i =1 a j a l h l i h j l i d j 3 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 k X l =1 2 n l n l X i =1 a j a l h l i h j l i d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 3 P k s =1 a s h s l i = d s + 2 a.s. )167(! 2 d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 3 k X l =1 B h h l Z a j a l h j h l P k s =1 a s h s = d s + 2 h y d Again,thislimitisaniteconstantbythesamereasoningweusedearlier.Since P k d )]TJ/F39 11.9552 Tf 13.086 0 Td [(d k 1 ,itfollowsthat r 2 F d isboundedinprobability.Now,by 58 PAGE 59 combiningA.1andA.2,weobtain p n )]TJ/F22 11.9552 Tf 6.939 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = r n N r F d 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + 1 2 p N r n N p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 r 2 F d p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = p qc h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + p n )]TJ/F22 11.9552 Tf 6.938 -7.028 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.956 0 Td [(B h h 1 + o p A.5 wherethelastlinefollowsfromthepreviouslyestablishedfactthat r F d a.s. )167(! c h andtheassumptionsofTheorem1that p n = N p q andthat p N ^ d )]TJ/F39 11.9552 Tf 12.464 0 Td [(d converges indistributionhenceis O p .Becausethetwosamplingstagesforestimating d and B h h 1 areassumedtobeindependent,usingtheassumptionthat p N ^ d )]TJ/F39 11.9552 Tf 12.543 0 Td [(d d )167(! N inconjunctionwiththeresult p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 12.986 0 Td [(B h h 1 d )167(!N 2 h establishedinTheorem 1 ofDoss2010underconditionsA1andA2,weconcludethat p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 d )167(!N )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(0, qc h 0 c h + 2 h ProofofTheorem2 Webeginbywriting p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d ^ d + p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 A.6 wherethesecondtermontherightsideofA.6wasanalyzedbyDoss2010who showedthatitisasymptoticallynormal,withmean 0 andvariance 2 h .Ourplanisto showthat ^ d and ^ ^ d convergeinprobabilitytothesamelimit,whichwedenote lim WethenexpandthersttermontherightsideofA.6bywriting p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d ^ d = p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I ^ d lim + p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d lim )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d lim + p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I d lim )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d ^ d A.7 59 PAGE 60 Ourproofisorganizedasfollows: WenotethatthethirdtermontherightsideofA.7wasshowntoconvergeto 0 in probabilitybyDoss2010. WewillshowthersttermontherightsideofA.7alsoconvergesto 0 in probability. ThesecondtermontherightsideofA.7involvesrandomnessfromboth Stage 1 andStage 2 .However,wewillshowthattherandomnessfromStage 2 isasymptoticallynegligible,andthatthistermisasymptoticallyequivalenttoan expressionoftheform w h 0 ^ d )]TJ/F39 11.9552 Tf 12.06 0 Td [(d ,where w h isadeterministicvector.Thiswill showthatthesecondtermisasymptoticallynormal. NowweprovethatthersttermontherightsideofA.7is o p ,andtodothiswe beginbyshowingthat ^ d and ^ ^ d convergeinprobabilitytothesamelimit.Let Z be the n k matrixwhosetransposeis Z 0 = 0 B B B B B B B @ 1...11...1...1...1 Z 1,1 ... Z n 1 ,1 Z 1,2 ... Z n 2 ,2 ... Z 1, k ... Z n k k . . . . . . . Z k 1,1 ... Z k n 1 ,1 Z k 1,2 ... Z k n 2 ,2 ... Z k 1, k ... Z k n k k 1 C C C C C C C A A.8 andlet Y bethevector Y = )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y 1,1 ,..., Y n 1 ,1 Y 1,2 ,..., Y n 2 ,2 ,..., Y 1, k ,..., Y n k k 0 A.9 Let ^ Z bethe n k matrixcorrespondingto Z whenwereplace d by ^ d .Similarly, ^ Y islike Y ,butusing ^ d for d Forxed j j 0 2f 2,..., k g ,considerthefunction G u = 1 n k X l =1 n l X i =1 h j l i = u j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = u s h j 0 l i = u j 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = u s A.10 where u = u 2 ,..., u k 0 and u l > 0, for l =2,..., k .OntherightsideofA.10, u 1 istaken tobe 1 .Notethatsetting u = d gives G d = 1 n P k l =1 P n l i =1 Z j i l Z j 0 i l BytheMeanValue 60 PAGE 61 Theorem,weknowthatthereexistsa d between d and ^ d suchthat G ^ d = G d + r G d 0 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d = R j j 0 + r G d 0 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + o p NotethatthelastequalityabovecomesfromapplyingtheSLLN.Nextweshowthat r G d = O p .Wehavethreecasesfor t =2,..., k Case1: t = 2f j j 0 g .Wehave [ r G d ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 I k X l =1 2 a l 1 n l n l X i =1 h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i h j 0 l i = d j 0 )]TJ/F25 11.9552 Tf 11.956 0 Td [( h 1 l i a t h t l i d t 2 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 3 k X l =1 2 a l n l n l X i =1 h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( + h 1 l i h j 0 l i = d j 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( + h 1 l i a t h t l i d t )]TJ/F25 11.9552 Tf 11.956 0 Td [( 2 )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s + 3 Theterminsidetheinnersumisbounded,sowecanconcludethat [ r G d ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 is boundedinprobability,asitisboundedbya O p termon I Case2: j 6 = j 0 t 2f j j 0 g say t = j .Wehave [ r G d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = k X l =1 a l n l n l X i =1 2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i )]TJ/F25 11.9552 Tf 10.959 -9.684 Td [( h j 0 l i = d j 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a j h j l i d j 2 )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s 3 + k X l =1 a l n l n l X i =1 )]TJ/F25 11.9552 Tf 9.298 0 Td [( h j l i )]TJ/F25 11.9552 Tf 5.479 -9.683 Td [( h j 0 l i = d j 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i d j 2 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 andthisisboundedinprobability. Case3: t = j = j 0 .Wehave [ r G d ] j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = k X l =1 2 a l 1 n l n l X i =1 h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = d s )]TJ/F25 11.9552 Tf 49.673 8.858 Td [( h j l i d j 2 P k s =1 a s h s l i = d s + )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a j h j l i d j 2 )]TJ 5.48 -0.718 Td [(P k s =1 a s h s l i = d s 2 # andagainthisisboundedinprobability. 61 PAGE 62 Therefore G ^ d = R j j 0 + r G d 0 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + o p = R j j 0 + O p o p + o p p R j j 0 Similarargumentsextendtothecase j =1 or j 0 =1 .Bythefactthat R isassumed invertible,wehave n ^ Z 0 ^ Z )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 p )167(! R )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 A.11 Inasimilarway,itcanbeshownthat ^ Z 0 ^ Y = n p )167(! v A.12 where v isthesamelimitvectortowhich Z 0 Y = n hasbeenprovedtoconvergeinDoss 2010.CombiningA.11andA.12wehave )]TJ/F22 11.9552 Tf 6.952 -7.027 Td [(^ 0 ^ d ^ ^ d = n ^ Z 0 ^ Z )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ^ Z 0 ^ Y = n p )167(! 0, lim lim = R )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 v Let e j l = E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z j 1, l .Wenowhave p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.206 2.656 Td [(^ I ^ d lim = k X j =2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ j ^ d k X l =1 a l n 1 = 2 n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = k X j =2 o p k X l =1 a l n 1 = 2 n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l A.13 ToshowthatA.13convergesto 0 inprobabilityitsufcestoshowthatforeach l and j n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = O p A.14 Forxed j 2f 2,..., k g and l 2f 1,..., k g ,dene H u = n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l n l X i =1 h j l i = u j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = u s 62 PAGE 63 for u = u 2 ,..., u k 0 with u l > 0, l =2,..., k u 1 =1 .Notethat H d = n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 2 l P n l i =1 Z j i l .To seewhyA.14istruewebeginbywriting n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(Z j i l n l + n 1 = 2 l n l X i =1 Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = H ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(H d + O p A.15 Notethatthefactthat n 1 = 2 l P n l i =1 )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [([ Z j i l )]TJ/F39 11.9552 Tf 12.053 0 Td [(e j l ] = n l = O p ,whichwasusedtoestablish thesecondequalityinA.15,isprovedinDoss2010.Now,applyingtheMeanValue Theoremtothefunction H ,weknowthatthereexistsapoint d between d and ^ d such thatA.15becomes n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = r H d 0 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + O p = p a l r n N n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 2 l r H d 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + O p A.16 sothattherightsideofA.16is O p .Toseethislastassertion,notethatthe t )]TJ/F22 11.9552 Tf 12.138 0 Td [(1 th elementofthegradientof H [ r H d ] t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 ,isgivenby 8 > > > > > < > > > > > : n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l n l X i =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 if t 6 = j n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l n l X i =1 )]TJ/F25 11.9552 Tf 9.299 0 Td [( h j l i d 2 j P k s =1 a s h s l i = d s + n l X i =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a j h j l i d 2 j )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 # if t = j 63 PAGE 64 Let 2 ,min d 2 ,..., d k .Then P k d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k 1 .For t 6 = j wehave n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l [ r H d ] t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 I n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h j l i = d j )]TJ/F25 11.9552 Tf 11.956 0 Td [( h 1 l i a t h t l i d t 2 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h j l i a t h t l i d t 2 d j )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 + n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h 1 l i a t h t l i d t 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h j l i a t h t l i d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s + 2 + n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h 1 l i a t h t l i d t )]TJ/F25 11.9552 Tf 11.956 0 Td [( 2 )]TJ 5.48 -0.718 Td [(P k s =1 a s h s l i = d s + 2 = O p + O p = O p Similarly, n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l [ r H d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 I 1 n l n l X i =1 h j l i d j )]TJ/F25 11.9552 Tf 11.956 0 Td [( 2 P k s =1 a s h s l i = d s + + 1 n l n l X i =1 h j l i a j h j l i d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 3 )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s + 2 + 1 n l n l X i =1 h 1 l i a j h j l i d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 )]TJ 5.48 -0.718 Td [(P k s =1 a s h s l i = d s + 2 andtherightsideofthisinequalityis O p ,asitisthesumofthree O p terms. SoA.16nowimpliesthat n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = p a l r n N O p O p + O p = O p Wenowconsider p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d lim )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d lim ,themiddleterminA.7.Dene K u = 1 n k X l =1 n l X i =1 h l i P k s =1 a s h s l i = u s )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X j =2 j lim h j l i = u j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = u s 64 PAGE 65 where u = u 2 ,..., u k 0 ,and u l > 0 for l =2,..., k .ByTaylorseriesexpansion,wehave p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d lim )]TJ/F22 11.9552 Tf 11.206 2.656 Td [(^ I d lim = p n r K d 0 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + p n 1 2 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 r 2 K d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d A.17 where d isbetween ^ d and d .Wenowfocusourattentionon r K d .For t =2,..., k we have [ r K d ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 1 n k X l =1 n l X i =1 h l i a t h t l i d 2 t )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 j 6 = t j lim )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 + t lim h t l i d 2 t P k s =1 a s h s l i = d s )]TJ/F25 11.9552 Tf 11.955 0 Td [( t lim )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( h t l i = d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 # a.s. )167(! B h h 1 d 2 t Z a t h t P k s =1 a s h s = d s h y d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 j 6 = t j lim Z a t h t d 2 t P k s =1 a s h s = d s h j y d + k X j =2 j 6 = t j lim Z a t h t d 2 t P k s =1 a s h s = d s h 1 y d + t lim 1 d t )]TJ/F25 11.9552 Tf 11.956 0 Td [( t lim Z a t h t d 2 t P k s =1 a s h s = d s h t y d + t lim Z a t h t d 2 t P k s =1 a s h s = d s h 1 y d = B h h 1 d 2 t Z a t h t P k s =1 a s h s = d s h y d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 j lim Z a t h t d 2 t P k s =1 a s h s = d s h j y d + k X j =2 j lim Z a t h t d 2 t P k s =1 a s h s = d s h 1 y d + t lim 1 d t :=[ w h ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 A.18 65 PAGE 66 wherethenotationinA.18indicatesthat w h denotesthenitevectorlimittowhich r K d converges.WenowdealwiththeHessianmatrix r 2 K d .For t 6 = u wehave [ r 2 K d ] t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1, u )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 1 n k X l =1 n l X i =1 2 h l i a t h t l i a u h u l i d t 2 d u 2 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 3 )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 j 6 = t j 6 = u j lim 2 h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i a u h u l i d t 2 d u 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 3 + u lim h u l i a t h t l i d t 2 d u )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [( u lim 2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( h u l i = d u )]TJ/F25 11.9552 Tf 11.956 0 Td [( h 1 l i a t h t l i a u h u l i d t 2 d u 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 3 + t lim h l i a u h u l i d t 2 d u 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [( t lim 2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( h t l i = d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i a u h u l i d t 2 d u 2 )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s 3 # andasbefore,itcanbeshownthatthisisboundedinprobability.Similarly,wecanshow thatthediagonaltermsof r 2 K d arealsoboundedinprobability.Therefore,usingthe factthat r 2 K d isboundedinprobability,wecannowrewriteA.17as p n )]TJ/F22 11.9552 Tf 4.731 -7.028 Td [(^ I ^ d lim )]TJ/F22 11.9552 Tf 11.206 2.656 Td [(^ I d lim = r n N w h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + r n N 1 2 p N p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 O p p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d = p qw h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + o p TogetherwithA.6,thisgives p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = p qw h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 + o p d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.683 Td [(0, qw h 0 w h + 2 h bytheindependenceofthetwosamplingstages,theassumptionthat p N ^ d )]TJ/F39 11.9552 Tf 12.708 0 Td [(d is asymptoticallynormalwithmean 0 andvariance ,andtheresultfromDoss2010that p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 isasymptoticallynormalwithmean 0 andvariance 2 h 66 PAGE 67 ProofofTheorem3 First,wenotethat p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h = p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F22 11.9552 Tf 11.206 2.656 Td [(^ I [ f ] h d + p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h A.19 WebeginbyanalyzingthesecondtermontherightsideofA.19,whichonlyinvolves randomnessfromthesecondstageofsampling,andshowthatitisasymptotically normal.Asfortherstterm,acloserexaminationrevealsthatitisalsoasymptotically normal,withallitsrandomnesscomingfromStage 1 .Theasymptoticnormalityofthe sumofthesetwotermsthenfollowsimmediatelyfromtheindependenceofthetwo stagesofsampling. Notethat P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l = I [ f ] h B h h 1 ,andinparticular,when f 1 ,thisgives P k l =1 a l E Y 1, l = B h h 1 .Also,wehave n 1 = 2 0 B B B B @ 1 n k X l =1 n l X i =1 Y [ f ] i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h B h h 1 1 n k X l =1 n l X i =1 Y i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 1 C C C C A = n 1 = 2 0 B B B B @ 1 n k X l =1 n l X i =1 Y [ f ] i l )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y [ f ] 1, l 1 n k X l =1 n l X i =1 Y i l )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 a l E Y 1, l 1 C C C C A = k X l =1 a l 1 = 2 1 n l 1 = 2 n l X i =1 0 B @ Y [ f ] i l Y i l 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y [ f ] 1, l E Y 1, l 1 C A # A.20 Bycondition2,assumptionA2ofTheorem1,andtheassumedgeometric ergodicityandindependenceofthe k Markovchainsused,thevectorinA.20 convergesindistributiontoanormalrandomvectorwithmean 0 andcovariancematrix \050 h = P k l =1 a l )]TJ/F40 7.9701 Tf 6.775 -1.793 Td [(l h ,where )]TJ/F40 7.9701 Tf 6.775 -1.793 Td [(l h = 0 B @ 11 12 21 22 1 C A 67 PAGE 68 with 11 =Var )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l +2 P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y [ f ] 1, l Y [ f ] 1+ g l 12 = 21 =Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l Y 1, l + P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l Y 1+ g l +Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y 1, l Y [ f ] 1+ g l 22 =Var )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(Y 1, l +2 P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Y 1, l Y 1+ g l Since ^ I [ f ] h d isgivenbytheratio2,inviewofA.20,itsasymptoticdistribution maybeobtainedbyapplyingthedeltamethodtothefunction g u v = u = v .Thisgives p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N h ,where h = r g )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(I [ f ] h B h h 1 B h h 1 0 \050 h r g )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(I [ f ] h B h h 1 B h h 1 A.21 with r g u v = = v )]TJ/F39 11.9552 Tf 9.298 0 Td [(u = v 2 0 WenowconsiderthersttermontherightsideofA.19.Dene L u = k X l =1 n l X i =1 f l i h l i P k s =1 a s h s l i = u s k X l =1 n l X i =1 h l i P k s =1 a s h s l i = u s for u = u 2 ,..., u k 0 with u l > 0 for l =2,..., k .Then L d = ^ I [ f ] h d = P k l =1 P n l i =1 Y [ f ] i l P k l =1 P n l i =1 Y i l and p n )]TJ/F22 11.9552 Tf 4.731 -7.028 Td [(^ I [ f ] h ^ d )]TJ/F22 11.9552 Tf 10.969 2.656 Td [(^ I [ f ] h d = p n )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(L ^ d )]TJ/F39 11.9552 Tf 11.719 0 Td [(L d .Now,bytheTaylorseriesexpansionof L about d weget p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I [ f ] h d = p n r L d 0 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n 2 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 r 2 L d ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d where d isbetween d and ^ d .First,weshowthatthegradient r L d converges almostsurelytoaniteconstantvectorbyprovingthateachoneofitscomponents, 68 PAGE 69 [ L d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 j =2,..., k ,convergesalmostsurely.Wehave [ r L d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = k X l =1 n l X i =1 a j f l i h l i h j l i d 2 j )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 k X l =1 n l X i =1 h l i P k s =1 a s h s l i = d s )]TJ/F40 7.9701 Tf 19.371 35.187 Td [(k X l =1 n l X i =1 f l i h l i P k s =1 a s h s l i = d s k X l =1 n l X i =1 a j h l i h j l i d 2 j )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s 2 k X l =1 n l X i =1 h l i P k s =1 a s h s l i = d s 2 a.s. )167(! B h h 1 d 2 j Z a j f h j P k s =1 a s h s = d s h y d B h h 1 )]TJ/F39 11.9552 Tf 13.151 18.181 Td [(I [ f ] h B h h 1 B h h 1 d 2 j Z a j h j P k s =1 a s h s = d s h y d B h h 1 2 = 1 d 2 j Z a j f h j P k s =1 a s h s = d s h y d )]TJ/F39 11.9552 Tf 15.143 8.088 Td [(I [ f ] h d 2 j Z a j h j P k s =1 a s h s = d s h y d :=[ v h ] j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 j =2,..., k A.22 AsintheproofofTheorem1,itcanbeshownthateachelementofthesecond-derivative matrix r 2 L d is O p .Now,wecanrewriteA.19as p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h = r n N r L d 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + 1 2 p N r n N p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 r 2 L d p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I [ f ] d h )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h = p qv h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h d )]TJ/F39 11.9552 Tf 11.956 0 Td [(I [ f ] h + o p Sincethetwosamplingstagesareassumedtobeindependent,weconcludethat p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(I [ f ] h d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, qv h 0 v h + h 69 PAGE 70 ProofofTheorem4 Here Z and Y representthematrixandvector,respectively,previouslydened inA.8andA.9.Inaddition,let Z [ f ] denotethe n k +1 matrixwithtranspose Z [ f ] 0 = 0 B B B B B B B B B B B B B @ 1...11...1...1...1 Z [ f ] 1,1 ... Z [ f ] n 1 ,1 Z [ f ] 1,2 ... Z [ f ] n 2 ,2 ... Z [ f ] 1, k ... Z [ f ] n k k Z [ f ] 1,1 ... Z [ f ] n 1 ,1 Z [ f ] 1,2 ... Z [ f ] n 2 ,2 ... Z [ f ] 1, k ... Z [ f ] n k k . . . . . . . Z [ f ] k 1,1 ... Z [ f ] k n 1 ,1 Z [ f ] k 1,2 ... Z [ f ] k n 2 ,2 ... Z [ f ] k 1, k ... Z [ f ] k n k k 1 C C C C C C C C C C C C C A A.23 andlet Y [ f ] bethevector Y [f] = )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y [ f ] 1,1 ,..., Y [ f ] n 1 ,1 Y [ f ] 1,2 ,..., Y [ f ] n 2 ,2 ,..., Y [ f ] 1, k ,..., Y [ f ] n k k 0 A.24 WeknowfromDoss2010thattheleastsquaresestimatewhen Y isregressedon Z denotedby ^ 0 ^ 2 ,..., ^ k =: ^ 0 ^ ,convergesalmostsurelyto 0, lim lim = R )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 v .In asimilarway,wewillshowherethattheleastsquaresestimatewhen Y [ f ] isregressed on Z [ f ] )]TJ/F22 11.9552 Tf 6.953 -7.027 Td [(^ [ f ] 0 ^ [ f ] = )]TJ/F22 11.9552 Tf 6.953 -7.027 Td [(^ [ f ] 0 ^ [ f ] 1 ,..., ^ [ f ] k ,convergesalmostsurelytoavector )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [( [ f ] 0, lim [ f ] lim Notethat,undertheassumptionthat Z [ f ] 0 Z [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 exists, )]TJ/F22 11.9552 Tf 6.952 -7.027 Td [(^ [ f ] 0 ^ [ f ] = n Z [ f ] 0 Z [ f ] )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 Z [ f ] 0 Y [ f ] n SinceA4issatised,wehave 1 n k X l =1 n l X i =1 Z [ f ] j i l Z [ f ] j 0 i l = k X l =1 n l n 1 n l n l X i =1 Z [ f ] j i l Z [ f ] j 0 i l a.s. )167(! R [ f ] j +1, j 0 +1 j j 0 =0,..., k andhence Z [ f ] 0 Z [ f ] = n a.s. )167(! R [ f ] .ThereforebyA6,withprobabilityone Z [ f ] 0 Z [ f ] is nonsingularforlarge n ,andfurthermore n Z [ f ] 0 Z [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 a.s. )167(! R [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 A.25 70 PAGE 71 ByconditionA5,wealsohave Z [ f ] 0 Y [ f ] n = 0 B B B B @ 1 n P k l =1 P n l i =1 Z [ f ] i l Y [ f ] i l 1 n P k l =1 P n l i =1 Z [ f ] k i l Y [ f ] i l 1 C C C C A a.s. )167(! 0 B B B B @ P k l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Z [ f ] 1, l Y [ f ] 1, l P k l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z [ f ] k 1, l Y [ f ] 1, l 1 C C C C A A.26 Let v [ f ] = v [ f ] 0 ,..., v [ f ] k 0 denotethevectorontherightsideofA.26.CombiningA.25 andA.26weget )]TJ/F22 11.9552 Tf 6.952 -7.027 Td [(^ [ f ] 0 ^ [ f ] a.s. )167(! )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( [ f ] 0, lim [ f ] lim = R [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 v [ f ] A.27 Let ^ J lim [ f ] lim = 1 n k X l =1 n l X i =1 0 B @ Y [ f ] i l Y i l 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ P k j =1 [ f ] j lim Z [ f ] j i l P k j =2 j lim Z j i l 1 C A # = k X l =1 a l 1 n l n l X i =1 U [ f ] i l where U [ f ] i l = 0 B @ U [ f ] i l U [ f ] i l 1 C A = 0 B @ Y [ f ] i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =1 [ f ] j lim Z [ f ] j i l Y i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =2 j lim Z j i l 1 C A Also,let [ f ] l = E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(U [ f ] 1, l NowsinceA2,A3,andA4hold,foreach l =1,..., k wehave n 1 = 2 l 1 n l n l X i =1 U [ f ] i l )]TJ/F25 11.9552 Tf 11.955 0 Td [( [ f ] l d )167(!N )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(0, [ f ] l where [ f ] l = 0 B @ 2 l ,11 l ,12 l ,21 2 l ,22 1 C A with 2 l ,11 =Var )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(U [ f ] 1, l +2 P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(U [ f ] 1, l U [ f ] 1+ g l l ,12 = l ,21 =Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(U [ f ] 1, l U [ f ] 1, l + P 1 g =1 h Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(U [ f ] 1, l U [ f ] 1+ g l +Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(U [ f ] 1, l U [ f ] 1+ g l i 2 l ,22 =Var )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(U [ f ] 1, l +2 P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(U [ f ] 1, l U [ f ] 1+ g l 71 PAGE 72 Bytheassumedindependenceofthe k Markovchains,wehave n 1 = 2 ^ J lim [ f ] lim )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k l =1 a l [ f ] l d )167(!N [ f ] A.28 where [ f ] = P k l =1 a l [ f ] l A.29 Wenowshowthat k X l =1 a l [ f ] l = 0 B @ I [ f ] h B h h 1 B h h 1 1 C A A.30 andtodothiswewrite k X l =1 a l [ f ] l = k X l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(U [ f ] 1, l = k X l =1 a l 0 B @ E )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Y [ f ] 1, l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =1 [ f ] j lim E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z [ f ] j 1, l E Y 1, l )]TJ/F30 11.9552 Tf 11.956 8.966 Td [(P k j =2 j lim E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z j 1, l 1 C A = 0 B @ P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =1 [ f ] j lim P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z [ f ] j 1, l P k l =1 a l E Y 1, l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =2 j lim P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z j 1, l 1 C A = 0 B @ P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(Y [ f ] 1, l P k l =1 a l E Y 1, l 1 C A = 0 B @ I [ f ] h B h h 1 B h h 1 1 C A thenext-to-lastequalitybeingaconsequenceofthereadilyveriablefactthat k X l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z [ f ] j 1, l =0 and k X l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z j 1, l =0 for j =2,..., k A.31 FromA.30andA.28weconcludethat n 1 = 2 ^ J lim [ f ] lim )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ I [ f ] h B h h 1 B h h 1 1 C A d )167(!N [ f ] 72 PAGE 73 Considernowthedifference n 1 = 2 ^ J ^ ^ [ f ] )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J lim [ f ] lim = 0 B @ n 1 = 2 P k j =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( [ f ] j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ [ f ] j )]TJ/F23 7.9701 Tf 12.221 -4.977 Td [(1 n P k l =1 P n l i =1 Z [ f ] j i l n 1 = 2 P k j =2 j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ j )]TJ/F23 7.9701 Tf 6.741 -4.977 Td [(1 n P k l =1 P n l i =1 Z j i l 1 C A = 0 B B @ P k j =1 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( [ f ] j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ [ f ] j P k l =1 a l n 1 = 2 P n l i =1 h Z [ f ] j i l )]TJ/F40 7.9701 Tf 6.587 0 Td [(E )]TJ/F40 7.9701 Tf 5.48 -9.684 Td [(Z [ f ] j 1, l n l i P k j =2 j lim )]TJ/F22 11.9552 Tf 13.428 2.656 Td [(^ j P k l =1 a l n 1 = 2 P n l i =1 h Z j i l )]TJ/F40 7.9701 Tf 6.587 0 Td [(E )]TJ/F40 7.9701 Tf 5.48 -9.684 Td [(Z j 1, l n l i 1 C C A wherethelastequalityfollowsfromA.31.Bytheassumptionthatthechainsare geometricallyergodicconditionA1,theboundednessof Z j i l 's,andthemoment conditionimposedon f inA4,weknowthat n 1 = 2 P n l i =1 \002 Z j i l )]TJ/F39 11.9552 Tf 13.369 0 Td [(E )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Z j 1, l n l and n 1 = 2 P n l i =1 \002 Z [ f ] j i l )]TJ/F39 11.9552 Tf 13.228 0 Td [(E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z [ f ] j 1, l n l areasymptoticallynormal,hence O p .This fact,combinedwithA.27andthecorrespondingresultfor )]TJ/F22 11.9552 Tf 6.952 -7.027 Td [(^ 0 ^ ,yields n 1 = 2 ^ J ^ ^ [ f ] )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J lim [ f ] lim = o p Hencewecanconcludethat n 1 = 2 ^ J ^ ^ [ f ] )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ I [ f ] h B h h 1 B h h 1 1 C A d )167(!N [ f ] A.32 Nowapplyingthedeltamethodwiththefunction g u v = u = v wehave n 1 = 2 P k l =1 P n l i =1 )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Y [ f ] i l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =1 ^ [ f ] j Z [ f ] j i l P k l =1 P n l i =1 )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =1 ^ j Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N r h i.e. n 1 = 2 )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ ^ [ f ] )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N r h where r h = r g )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(I [ f ] h B h h 1 B h h 1 0 [ f ] r g )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(I [ f ] h B h h 1 B h h 1 A.33 with r g u v = = v )]TJ/F39 11.9552 Tf 9.298 0 Td [(u = v 2 0 and [ f ] asinA.29. 73 PAGE 74 ProofofTheorem5 WebeginbyreviewingsomerelatednotationandresultsestablishedbyGeyer 1994.Recallthat N j denotesthelengthofthe j th chaininStage 1 samples, N = P k j =1 N j ,and A j = N j = N .Usingthenotation j = )]TJ/F22 11.9552 Tf 11.291 0 Td [(log m h j +log A j for j =1,..., k Geyer's1994reverselogisticregressionestimator ^ =^ 1 ,...,^ k fortheunknown vector isobtainedbymaximizingthelogquasi-likelihood l N = k X l =1 N l X i =1 log )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p l l i A.34 where p l = h l e l P k s =1 h s e s for l =1,..., k A.35 Theorem 1 ofGeyer1994statesthatthismaximizerisuniqueuptoanadditive constantiftheMonteCarlosampleisinseparable.Geyer1994alsoprovesthat,under certainconditions, p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 isasymptoticallynormal,where 0 isdenedby [ 0 ] j = j )]TJ/F22 11.9552 Tf 13.457 8.087 Td [(1 k k X s =1 s j =1,..., k Ourproofisstructuredasfollows.First,weextendGeyer's1994proofinordertoshow thatthe 2 k -dimensionalvector p N 0 B @ ^ ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ 0 e 1 C A =: 0 B B B B @ U U k 1 C C C C A =: U A.36 isasymptoticallynormal.Then,bygettingbacktothe d notationthroughatransformation, weshowthatourvectorofinterest p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.956 27.616 Td [(0 B @ d e 1 C A 74 PAGE 75 isalsoasymptoticallynormal. Tocarryouttherststep,wewillexpresseach U j j =1,..., k ,asthesumof alinearcombinationofstandardizedaveragesoffunctionsofthe l i 'sanda o p quantity.Wewillalsoneedthecentrallimittheoremtoholdfortheseaverages.Hence, foreach j =1,..., k ,weplantondconstants j 1 ,..., j k andfunctions j 1 ,..., j k whichsatisfytheconditions E h l y )]TJ/F25 11.9552 Tf 5.479 -9.683 Td [( j l =0 and E h l y )]TJ/F21 11.9552 Tf 5.479 -9.683 Td [(j j l j 2+ < 1 l =1,..., k A.37a U j = j 1 1 p N 1 N 1 X i =1 j 1 i + + j k 1 p N k N k X i =1 j k k i + o p A.37b forsome > 0 .NotethatconditionsA.37aandB1yieldcentrallimittheoremsforthe averagesinthelinearcombinationabove. For U k +1 ,..., U k ,conditionA.37isclearlysatisedsince U j + k = p N ^ e j )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j = 1 p A j 1 p N j N j X i =1 )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(f j i )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j for j =1,..., k andthemomentconditionsinA.37aholdseeB2inthestatementofthistheorem. Next,weshowthatconditionA.37alsoholdsfortherst k componentsof U .In theproofofhisTheorem 2 ,Geyer1994denesthematrix B N via )]TJ/F22 11.9552 Tf 14.714 8.088 Td [(1 N )]TJ/F21 11.9552 Tf 5.479 -9.684 Td [(r l N ^ N )-222(r l N 0 = B N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 A.38 where l N wasdenedinA.34,andestablishesthat B N a.s. )167(! B ,where B isgivenby equationinGeyer1994.Healsoshowsthat,with u beingthe k -dimensional columnvectorof 1 's, 0 B @ B N u 0 1 C A p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 = 0 B @ 1 p N r l N 0 0 1 C A A.39 75 PAGE 76 [SeeequationinGeyer1994.]Notethat,byapplyingtheMeanValueTheoremto r l N B N denedinA.38canalsobeexpressedas B N = )]TJ/F22 11.9552 Tf 12.057 8.088 Td [(1 N r 2 l N forsome between ^ N and 0 .Hence,with p r r =1,..., k denedasinA.35,the elementsof B N aregivenby [ B N ] r r = 1 N k X l =1 N l X i =1 p r l i 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p r l i r =1,..., k [ B N ] r s = )]TJ/F22 11.9552 Tf 12.057 8.088 Td [(1 N k X l =1 N l X i =1 p r l i p s l i r 6 = s whichmakesiteasytoverifythat B N u =0 .CombiningthiswithequationA.39,itcan beshownthat p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 = B + N 1 p N r l N 0 A.40 where B + N = B N + 1 k uu 0 )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 )]TJ/F22 11.9552 Tf 13.457 8.088 Td [(1 k uu 0 istheMoore-Penroseinverseof B N .Furthermore,letting B + denotetheMoore-Penrose inverseof B ,wecanalternativelywritetheequalityinA.40as p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 = B + N )]TJ/F39 11.9552 Tf 11.955 0 Td [(B + + B + 1 p N r l N 0 = B + N )]TJ/F39 11.9552 Tf 11.955 0 Td [(B + 1 p N r l N 0 + B + 1 p N r l N 0 A.41 Now,usingtheresult B N a.s. )167(! B establishedbyGeyer1994,wecaneasilydeducethat B + N a.s. )167(! B + A.42 76 PAGE 77 bywriting B + N = B N + 1 k uu 0 )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 )]TJ/F22 11.9552 Tf 13.457 8.088 Td [(1 k uu 0 a.s. )167(! B + 1 k uu 0 )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 )]TJ/F22 11.9552 Tf 13.457 8.087 Td [(1 k uu 0 = B + wherethelastequalitycomesfromGeyer1994. Next,weestablishasymptoticnormalityof r l N 0 = p N .Sincethegradient r l N 0 isthevectorwhose r th elementisgivenby @ l N 0 @ r = N r )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 N l X i =1 p r l i 0 wecanseethat 1 p N @ l N 0 @ r = 1 p N N r )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 N l X i =1 p r l i 0 = p A r 1 p N r N r X i =1 )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p r r i 0 )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X l =1 l 6 = r p A l 1 p N l N l X i =1 p r l i 0 = p A r 1 p N r N r X i =1 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p r r i 0 )]TJ/F30 11.9552 Tf 11.955 9.684 Td [( 1 )]TJ/F39 11.9552 Tf 11.956 0 Td [(E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r r 1 0 )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X l =1 l 6 = r p A l 1 p N l N l X i =1 p r l i 0 )]TJ/F39 11.9552 Tf 11.956 0 Td [(E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 = )]TJ/F40 7.9701 Tf 17.511 14.944 Td [(k X l =1 p A l 1 p N l N l X i =1 p r l i 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 A.43 whichisalinearcombinationoftheformgiveninA.37band,because 0 p r 1 forall and ,conditionA.37aisalsosatised.Notethatweareallowedtoinsertthe 77 PAGE 78 expectationsinthenext-to-lastequalitybecause )]TJ/F30 11.9552 Tf 9.299 11.071 Td [(p A r 1 p N r N r 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r r 1 0 + k X l =1 l 6 = r p A l 1 p N l N l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 = )]TJ 9.299 10.772 Td [(p NA r 1 )]TJ/F30 11.9552 Tf 11.955 16.272 Td [(Z h r e r P k s =1 h s e s h r y d + p N k X l =1 l 6 = r A l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 = )]TJ 9.299 10.772 Td [(p NA r k X l =1 l 6 = r Z h l e l P k s =1 h s e s h r y d + p N k X l =1 l 6 = r A l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 = )]TJ 9.299 10.772 Td [(p NA r k X l =1 l 6 = r Z e l )]TJ/F26 7.9701 Tf 6.586 0 Td [( r h r e r P k s =1 h s e s m h l m h r h l y d + p N k X l =1 l 6 = r A l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 = )]TJ 9.299 10.772 Td [(p NA r k X l =1 l 6 = r m h l m h r e l )]TJ/F26 7.9701 Tf 6.586 0 Td [( r E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 + p N k X l =1 l 6 = r A l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 =0. Theasymptoticnormalityof r l N 0 = p N nowfollowsfromtheCram er-Wolddevice. InviewofthisconvergenceindistributionandtheconvergenceresultinA.42,A.41 gives p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 = o p O p + B + 1 p N r l N 0 = B + 1 p N r l N 0 + o p Therefore,wecannoweasilyseethatconditionA.37isalsosatisedbytherst k componentsof U ,i.e. p N ^ N )]TJ/F25 11.9552 Tf 11.998 0 Td [( 0 because,aswehaveshowninA.43,everyelement of r l N 0 = p N isalinearcombinationoftheformA.37. 78 PAGE 79 Nowthatwehaveshownthat U = 0 B B B B B B B B B B B B B B B @ 1 1 p N 1 P N 1 i =1 1 i + + k 1 p N k P N k i =1 k k i k 1 1 p N 1 P N 1 i =1 k 1 i + + k k 1 p N k P N k i =1 k k k i k +1 1 1 p N 1 P N 1 i =1 k +1 1 i k 1 1 p N k P N k i =1 k k k i 1 C C C C C C C C C C C C C C C A + o p wecanprovethat U isasymptoticallynormalbyusingtheCram er-Wolddevice.Letus denotetheasymptoticvarianceof U by S .Then S = 0 B @ S S S S 1 C A where S = B + CB + A.44 with C rs = k X l =1 A l Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 p s l 1 0 + k X l =1 A l 1 X g =1 h Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 p s l 1+ g 0 +Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1+ g 0 p s l 1 0 i for r =1,..., k ,and s =1,..., k S rr = 1 A r h Var )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(f r 1 +2 1 X g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(f r 1 f r 1+ g i S rs =0 when r 6 = s r =1,..., k s =1,..., k A.45 and S = B + D A.46 79 PAGE 80 with D rs = )]TJ/F22 11.9552 Tf 11.291 0 Td [(Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r s 1 0 f s 1 )]TJ/F28 7.9701 Tf 16.355 14.944 Td [(1 X g =1 h Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r s 1 0 f s 1+ g +Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r s 1+ g 0 f s 1 i for r =1,..., k ,and s =1,..., k .Now,havingestablishedtheconvergenceresult U = p N 0 B @ ^ ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ 0 e 1 C A d )167(!N S A.47 considerthefunction g : R 2 k R 2 k )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 givenby g 0 B @ e 1 C A = 0 B B B B B B B B B B @ e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 2 A 2 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 3 A 3 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( k A k = A 1 e 1 C C C C C C C C C C A where and e are k -dimensionalvectors.ThedeltamethodappliedtoA.47with g as thetransformationgives p N 0 B B B B B B B @ 0 B B B B B B B @ ^ d 2 ^ d k ^ e 1 C C C C C C C A )]TJ/F30 11.9552 Tf 11.955 49.137 Td [(0 B B B B B B B @ d 2 d k e 1 C C C C C C C A 1 C C C C C C C A d )167(!N V where V = r g 0 B @ 0 e 1 C A 0 S r g 0 B @ 0 e 1 C A A.48 80 PAGE 81 with r g 0 B @ e 1 C A = 0 B B B B B B B B B B B B B B B B B B B B @ e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 2 A 2 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 3 A 3 = A 1 ... e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( k A k = A 1 00...0 )]TJ/F39 11.9552 Tf 9.299 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 2 A 2 = A 1 0...000...0 0 )]TJ/F39 11.9552 Tf 9.299 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 3 A 3 = A 1 ...000...0 . . . . . . 00... )]TJ/F39 11.9552 Tf 9.298 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( k A k = A 1 00...0 00...010...0 00...001...0 00...000...1 1 C C C C C C C C C C C C C C C C C C C C A and S givenbyA.44,A.46,andA.45. ProofofRemark2toTheorem1 FollowingthelinesoftheproofofTheorem1with q =1 ,wegetasinA.5that p n )]TJ/F22 11.9552 Tf 6.938 -7.028 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = c h 0 p n ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 6.938 -7.028 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 + o p where c h istheconstantcolumnvectorgiveninA.3.Thisdecompositioncanbe rewrittenas p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = c h 0 ,1 0 B @ p n ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.956 0 Td [(B h h 1 1 C A + o p Nownotethatinordertoestablishtheasymptoticnormalityof p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 9.802 0 Td [(B h h 1 itisenoughtoshowthat 0 B @ p n ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.956 0 Td [(B h h 1 1 C A isasymptoticallynormal.Usingthe -notationintroducedintheproofofTheorem5,let T = p n 0 B @ ^ ^ B h h 1 d 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ 0 B h h 1 1 C A 81 PAGE 82 Aswasdonefor U intheproofofTheorem5,wecanwrite T as T = 0 B B B B B B B B @ 1 1 p n 1 P n 1 i =1 1 i + + k 1 p n k P n k i =1 k k i k 1 1 p n 1 P n 1 i =1 k 1 i + + k k 1 p n k P n k i =1 k k k i P k l =1 a 1 = 2 l 1 p n l P n l i =1 )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(Y i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(E Y 1, l 1 C C C C C C C C A + o p wheretherst k componentsarethesameastherst k componentsofthevector U inA.36,and Y i l isgivenin2.ByapplyingtheCram er-Wolddevice,wecan concludethatthevector T convergesindistributiontoanormalrandomvariablewith mean 0 andvariance Z = 0 B @ S z z 0 2 h 1 C A inwhich S isthe k k matrixgiveninA.44, z isthe k 1 vectorgivenby z = B + y A.49 where y r = k X l =1 A l Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 Y 1, l + k X l =1 A l 1 X g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 Y 1+ g l +Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1+ g 0 Y 1, l for r =1,..., k ,with B + asinTheorem5,andasinA.9ofDoss2010 2 h = k X l =1 a l h Var Y 1, l +2 1 X g =1 Cov Y 1, l Y 1+ g l i 82 PAGE 83 Nowdenethefunction g : R k +1 R k by g 0 B @ b 1 C A = 0 B B B B B B B B B B @ e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 2 A 2 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 3 A 3 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( k A k = A 1 b 1 C C C C C C C C C C A where isa k -dimensionalvectorand b isarealnumber.Applyingthedeltamethodto thepreviouslyestablishedresultthat T d )167(!N Z ,weget p n 0 B @ ^ d ^ B h h 1 d 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d B h h 1 1 C A d )167(!N 0, r g 0 B @ 0 B h h 1 1 C A 0 Z r g 0 B @ 0 B h h 1 1 C A where r g 0 B @ B h h 1 1 C A = 0 B @ E 0 0 0 1 1 C A A.50 with E = 0 B B B B B B B B B B @ e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 2 A 2 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 3 A 3 = A 1 ... e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( k A k = A 1 )]TJ/F39 11.9552 Tf 9.299 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 2 A 2 = A 1 0...0 . . . 0 )]TJ/F39 11.9552 Tf 9.298 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 3 A 3 = A 1 ...0 00... )]TJ/F39 11.9552 Tf 9.298 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( k A k = A 1 1 C C C C C C C C C C A A.51 and 0 inA.50representingthecolumnvectorof k zeros. Hence,weknowthat p n )]TJ/F22 11.9552 Tf 6.938 -7.028 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 12.872 0 Td [(B h h 1 hasanasymptoticallynormal distributionwithmean 0 andvariance c h 0 c h + 2 h +2 c h 0 E 0 z where denotes,asinthestatementofTheorem1,theasymptoticvarianceof p n ^ d )]TJ/F39 11.9552 Tf -452.89 -23.908 Td [(d E isgiveninA.51,and z inA.49. 83 PAGE 84 ProofofTheorem6 Let ^ J d e [ f ] = 1 n k X l =1 n l X i =1 0 B @ Y [ f ] i l Y i l 1 C A )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ P k j =1 [ f ] j Z [ f ] j i l P k j =2 j Z j i l 1 C A # wherethesuperscripts d e indicatethevaluesof d and e usedwhencomputing Y 's and Z 's,whilethesubscriptsindicatethecoefcientsof Z 's.With [ f ] l asintheproofof Theorem4wenowwrite p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X l =1 a l [ f ] l = p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J d e ^ d ^ [ f ] d + p n ^ J d e ^ d ^ [ f ] d )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X l =1 a l [ f ] l A.52 NotethatthesecondquantityontherightsideofA.52,whichinvolvesonlyknown d and e ,wasshowntobeasymptoticallynormalwithmean 0 andvariance [ f ] inthe proofofTheorem4[seeA.32].Nowletusexpandthersttermontherightside ofA.52bywriting p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F22 11.9552 Tf 12.174 2.657 Td [(^ J d e ^ d ^ [ f ] d = p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J ^ d ,^ e lim [ f ] lim + p n ^ J ^ d ,^ e lim [ f ] lim )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J d e lim [ f ] lim + p n ^ J d e lim [ f ] lim )]TJ/F22 11.9552 Tf 12.173 2.656 Td [(^ J d e ^ d ^ [ f ] d A.53 Wenextproceedasfollows: 1.WenotethatthethirdtermontherightsideofA.53wasshowntoconvergeto 0 inprobabilityintheproofofTheorem4. 2.WeshowthatthersttermontherightsideofA.53alsoconvergesto 0 in probability. 3.WeshowthatthesecondtermontherightsideofA.53isasymptoticallynormal. Todealwiththesecondstep,asintheproofofTheorem1,rstweshowthat ^ [ f ] d and ^ [ f ] ^ d convergeinprobabilitytothesamelimit,whichwedenotedintheproofof 84 PAGE 85 Theorem4by [ f ] lim .Forxed j j 0 2f 1,..., k g ,considerthefunction G u v = 1 n k X l =1 n l X i =1 f l i h j l i = u j P k s =1 a s h s l i = u s )]TJ/F39 11.9552 Tf 11.955 0 Td [(v j # f l i h j 0 l i = u j 0 P k s =1 a s h s l i = u s )]TJ/F39 11.9552 Tf 11.955 0 Td [(v j 0 # where u = u 2 ,..., u k 0 with u l > 0 ,for l =2,..., k ,and v = v 1 ,..., v k 0 .Notethatsetting u = d and v = e gives G d e = 1 n k X l =1 n l X i =1 Z [ f ] j i l Z [ f ] j 0 i l BytheMeanValueTheorem,weknowthatthereexistsa d e between ^ d ,^ e and d e suchthat G ^ d ,^ e = G d e + r G d e 0 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A = R [ f ] j +1, j 0 +1 + r G d e 0 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A + o p Asinpreviousproofs,withsomecalculationswecanshowthat r G d e = O p Therefore G ^ d ,^ e p )167(! R [ f ] j +1, j 0 +1 ,andsince R [ f ] isassumedinvertible,wehave n )]TJ/F22 11.9552 Tf 11.341 -7.027 Td [(^ Z [ f ] 0 ^ Z [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 p )167(! R [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 where ^ Z [ f ] isobtainedfromthematrix Z [ f ] inA.23byreplacing d and e with ^ d and ^ e Thesamereasoningextendstothecasewhere j =0 or j 0 =0 .Inasimilarway,ifwe let ^ Y [ f ] denotethevectorobtainedfrom Y [ f ] inA.24byreplacing d with ^ d andwerecall that v [ f ] wasdenedtobethevectorontherightsideofA.26,itcanbeprovedthat )]TJ/F22 11.9552 Tf 6.359 -7.027 Td [(^ Z [ f ] 0 ^ Y [ f ] n p )167(! v [ f ] 85 PAGE 86 whichtogetherwiththepreviousresultimpliesthat ^ [ f ] d and ^ [ f ] ^ d convergein probabilitytothesamelimit.Also, p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J ^ d ,^ e lim [ f ] lim = 0 B @ n 1 = 2 P k j =1 )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [( [ f ] j lim )]TJ/F22 11.9552 Tf 13.428 2.656 Td [(^ [ f ] j ^ d )]TJ/F23 7.9701 Tf 12.221 -4.977 Td [(1 n P k l =1 P n l i =1 ^ Z [ f ] j i l n 1 = 2 P k j =2 j lim )]TJ/F22 11.9552 Tf 13.428 2.656 Td [(^ j ^ d )]TJ/F23 7.9701 Tf 6.741 -4.976 Td [(1 n P k l =1 P n l i =1 ^ Z j i l 1 C A = 0 B B @ P k j =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( [ f ] j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ [ f ] j ^ d h P k l =1 a l n 1 = 2 P n l i =1 ^ Z [ f ] j i l )]TJ/F40 7.9701 Tf 6.587 0 Td [(E )]TJ/F40 7.9701 Tf 5.48 -9.684 Td [(Z [ f ] j 1, l n l i P k j =2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ j ^ d h P k l =1 a l n 1 = 2 P n l i =1 ^ Z j i l )]TJ/F40 7.9701 Tf 6.587 0 Td [(E )]TJ/F40 7.9701 Tf 5.48 -9.684 Td [(Z j 1, l n l i 1 C C A FromtheproofofTheorem2wealreadyknowthatthesecondcomponentofthislast vector,denotedthereinby p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.548 2.657 Td [(^ I ^ d lim ,is o p .Inananalogousmanner,itcanbe shownthattherstcomponentis o p .Thusthewholevectoris o p AsforthemiddletermoftherightsideofA.53,ifwedene K [ f ] u v = 1 n k X l =1 n l X i =1 f l i h l i P k s =1 a s h s l i = u s )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X j =1 [ f ] j lim f l i h j l i = u j P k s =1 a s h s l i = u s )]TJ/F39 11.9552 Tf 11.955 0 Td [(v j !! where u = u 2 ,..., u k 0 with u l > 0 for l =2,..., k and v = v 1 v 2 ,..., v k 0 ,then p n ^ J ^ d ,^ e lim [ f ] lim )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J d e lim [ f ] lim = 0 B @ p n )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(K [ f ] ^ d ,^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(K [ f ] d e p n )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(K ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(K d 1 C A A.54 with K denedasintheproofofTheorem2.Fromthissameproof,weknowthat p n )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(K ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(K d = p qw h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + o p A.55 Wewillnowshowthat,similarly, p n )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(K [ f ] ^ d ,^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(K [ f ] d e = p qw [ f ] h 0 p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ d e 1 C A + o p 86 PAGE 87 with w [ f ] h denedbyA.56below.ByTaylorseriesexpansion,weget p n )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(K [ f ] ^ d ,^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(K [ f ] d e = p n r K [ f ] d e 0 0 B @ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d ^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(e 1 C A + p n 1 2 0 B @ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d ^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(e 1 C A 0 r 2 K [ f ] d e 0 B @ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d ^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(e 1 C A where d e isbetween ^ d ,^ e and d e Belowwecomputethegradient r K [ f ] d e andshowthatitconvergesalmost surelytoavector w [ f ] h .Wehave @ K [ f ] @ u t d e = 1 n k X l =1 n l X i =1 f l i h l i a t h t l i d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =1 j 6 = t [ f ] j lim f l i h j l i a t h t l i d j d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 + [ f ] t lim f l i h t l i d 2 t P k s =1 a s h s l i = d s )]TJ/F25 11.9552 Tf 11.955 0 Td [( [ f ] t lim f l i h t l i a t h t l i d 3 t )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s 2 # a.s. )167(! B h h 1 d 2 t Z f a t h t P k s =1 a s h s = d s h y d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =1 j 6 = t [ f ] j lim Z f a t h t d 2 t P k s =1 a s h s = d s h j y d + [ f ] t lim I [ f ] h t d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( [ f ] t lim Z f a t h t d 2 t P k s =1 a s h s = d s h t y d = B h h 1 d 2 t Z f a t h t P k s =1 a s h s = d s h y d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =1 [ f ] j lim Z f a t h t d 2 t P k s =1 a s h s = d s h j y d + [ f ] t lim I [ f ] h t d t := w [ f ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 h for t =2,..., k A.56a 87 PAGE 88 and @ K [ f ] @ v t d e = [ f ] t lim := w [ f ] k )]TJ/F23 7.9701 Tf 6.587 0 Td [(1+ t h for t =1,..., k A.56b ProceedingaswedidintheproofofTheorem2whenweshowedthat r 2 K d = O p ,wecanshowherethat r 2 K [ f ] d e isboundedinprobability.Hence p n )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(K [ f ] ^ d ,^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(K [ f ] d e = p qw [ f ] h 0 p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A + o p andtogetherwithA.55andA.54thisimpliesthat p n ^ J ^ d ,^ e lim [ f ] lim )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J d e lim [ f ] lim = 0 B B B B @ p qw [ f ] h 0 p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A + o p p qw h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + o p 1 C C C C A = p q 0 B @ w [ f ] h 0 w 0 h 0 1 C A p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A + o p where w 0 h isthecolumn-vectorobtainedfrom w h byconcatenating k zerosatits end.NowreturningtoA.52andA.53weget p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 a l [ f ] l = p q 0 B @ w [ f ] h 0 w 0 h 0 1 C A p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ d e 1 C A + p n ^ J d e ^ d ^ [ f ] d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 a l [ f ] l + o p d )167(!N 0, q 0 B @ w [ f ] h 0 w 0 h 0 1 C A V )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(w [ f ] h w 0 h + [ f ] Wecannowapplythedeltamethodwiththefunction g u v = u = v togetourresult p n ^ I ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N h 88 PAGE 89 where h = r g )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(I [ f ] h B h h 1 B h h 1 0 q 0 B @ w [ f ] h 0 w 0 h 0 1 C A V )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(w [ f ] h w 0 h + [ f ] r g )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(I [ f ] h B h h 1 B h h 1 A.57 with r g u v = = v )]TJ/F39 11.9552 Tf 9.298 0 Td [(u = v 2 0 89 PAGE 90 APPENDIXB DETAILSREGARDINGGENERATIONOFTHEMARKOVCHAINFROMCHAPTER5 TogenerateaMarkovchainoflength n on = 0 foraxedchoiceof thehyperparameter h = w g ,weusethefollowingsamplingscheme.First,wepick anarbitraryvaluefor .Thenwedraw 2 0 ,and asindicatedinSteps 2 4 belowwith i =0 .Togeneratetherestofthechain,weiteratethroughSteps 1 4 describedbelowforeach i =1,..., n )]TJ/F22 11.9552 Tf 11.956 0 Td [(1 Step1 Inthisstagewegeneratethebinaryvector i byusingaGibbssampleron = 1 2 ,..., q .Thus,werstgenerate i 1 j i )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 2 ,..., i )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 q Y accordingtothe followingBernoullidistribution: p 1 j j 6 =1 Y / p )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( j Y / + g )]TJ/F40 7.9701 Tf 6.586 0 Td [(q = 2 S )]TJ/F23 7.9701 Tf 6.587 0 Td [( m )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 1+ g )]TJ/F39 11.9552 Tf 11.956 0 Td [(R 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 w 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(w q B.1 where,recallthat S 2 = P m j =1 Y j )]TJETq1 0 0 1 284.934 413.69 cm[]0 d 0 J 0.478 w 0 0 m 10.148 0 l SQBT/F39 11.9552 Tf 284.934 403.714 Td [(Y 2 ,and R 2 isthecoefcientofdetermination ofmodel ;see5.Similarly,generate i 2 from p )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( i 2 j i 1 i )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 3 ,..., i )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 q Y andsoonfor i 3 ,..., i q .ThisGibbssamplerisnotidenticaltothatofSmithand Kohn1996inthatinourmodeltheprioron 0 isaatprior,whereasSmithand Kohn1996useaproperprioron 0 Step2 Generate 2 i j i Y accordingtothedensity p 2 j Y / p Y j 2 p 2 / Z p Y j 2 0 p 0 p j 2 d 0 d p 2 / 2 )]TJ/F23 7.9701 Tf 6.587 0 Td [( m +1 = 2 exp )]TJ/F39 11.9552 Tf 31.86 8.087 Td [(S 2 2 2 g +1 1+ g )]TJ/F39 11.9552 Tf 11.956 0 Td [(R 2 aninversegammadensity. 90 PAGE 91 Toseewhythelastrelationshipstatementistrue,werstconsidertheintegralwith respectto 0 .Wehave Z p Y j 2 0 p 0 d 0 / Z 2 )]TJ/F23 7.9701 Tf 6.587 0 Td [( m = 2 exp h )]TJ/F22 11.9552 Tf 16.501 8.087 Td [(1 2 2 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(X 0 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(X i d 0 / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m = 2 exp h )]TJ/F22 11.9552 Tf 16.501 8.088 Td [(1 2 2 Y )]TJ/F39 11.9552 Tf 11.956 0 Td [(X 0 Y )]TJ/F39 11.9552 Tf 11.956 0 Td [(X i Z exp h )]TJ/F39 11.9552 Tf 14.656 8.088 Td [(m 2 2 2 0 )]TJ/F22 11.9552 Tf 11.955 0 Td [(2 0 Y i d 0 / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 2 exp h )]TJ/F22 11.9552 Tf 16.501 8.088 Td [(1 2 2 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(X 0 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(X i exp m Y 2 2 2 Sowemaynowwrite p 2 j Y / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m +1 = 2 exp m Y 2 2 2 Z exp h )]TJ/F22 11.9552 Tf 16.501 8.088 Td [(1 2 2 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(X 0 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(X i p j 2 d / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m +1+ q = 2 exp )]TJ/F39 11.9552 Tf 13.153 8.088 Td [(S 2 2 2 Z exp h )]TJ/F22 11.9552 Tf 16.501 8.088 Td [(1 2 2 g +1 g 0 X 0 X + 1 2 Y 0 X i d / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m +1+ q = 2 exp )]TJ/F39 11.9552 Tf 13.153 8.088 Td [(S 2 2 2 2 q = 2 exp g 2 2 g +1 Y 0 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 Y / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m +1 = 2 exp n )]TJ/F39 11.9552 Tf 31.86 8.088 Td [(S 2 2 2 g +1 1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 o wherethenext-to-lastproportionalityrelationresultsfromusingtheformula Z exp )]TJ/F22 11.9552 Tf 10.494 8.088 Td [(1 2 0 W )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 + a 0 d = q = 2 j W j 1 = 2 exp a 0 Wa = 2 whichcanbeshowntoholdforanyvector a oflength q andanypositivedenite matrix W byusingacompletingthesquaresargument.Inpractice,weusethe distributionalrelationship S 2 1+ g )]TJ/F39 11.9552 Tf 11.956 0 Td [(R 2 2 g +1 2 m )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 todraw 2 91 PAGE 92 Step3 Generate i 0 j i 2 i Y accordingtothedensity p 0 j 2 Y / Z p Y j 2 0 p j 2 d p 0 B.2a / exp h )]TJ/F22 11.9552 Tf 16.501 8.087 Td [(1 2 2 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 0 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 i B.2b /N Y 2 = m NotethatB.2bfollowsfromB.2abecause 1 0 m X =0 ,sincethecolumnsof X arecentered. Step4 Generate i j i 2 i i 0 Y accordingtothedensity p j 2 0 Y / p Y j 2 0 p j 2 / exp )]TJ/F22 11.9552 Tf 16.501 8.087 Td [(1 2 2 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 )]TJ/F39 11.9552 Tf 11.956 0 Td [(X 0 Y )]TJ/F22 11.9552 Tf 11.956 0 Td [(1 m 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(X + 0 X 0 X g whichcanbeshowntobea q -dimensionalnormalwithmeanandcovariance matrixgivenrespectivelyby = g g +1 ^ and = g g +1 2 X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 where ^ = X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 X 0 Y istheusualleastsquaresestimateformodel Wenowdiscussthecomputationaleffortneededtoimplementoursampler. Considergeneratingtherstcomponentof .AsseeninStep 1 ,theconditional distributionforthiscomponentisBernoulliwithsuccessprobability p 1 =1 j j 6 =1 Y = p )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(, 2 ,..., q j Y p )]TJ/F22 11.9552 Tf 5.48 -9.683 Td [(, 2 ,..., q j Y + p )]TJ/F22 11.9552 Tf 5.479 -9.683 Td [(, 2 ,..., q j Y withtheexpressionfor p j Y givenbyB.1.Theothercomponentsof canbeinturn similarlygenerated,andthentheothercomponentsof canbegeneratedaccording totheconditionaldistributionsfromSteps 2 4 .Themaincomputationalburdenisini forming R 2 ,iiforming ^ ,andiiigeneratingfrom N )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(c 1 ^ c 2 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 ,where c 1 and c 2 areconstants.Alloftheseostensiblyrequirecalculationof X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 ,forwhich O q 3 92 PAGE 93 operationsarerequired.Infact,iandiirequireonly ^ ,whichcanbecalculated bysolving X 0 X ^ = X 0 Y ,requiringonly O q 2 operations.Nowtheessenceof iiiisgeneratingfroma N )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(0, X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 distribution,andtodothiswedonotneedto form X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 .Weneedonlyexpress X 0 X = U 0 U ,where U isuppertriangular.Forif Z N I q ,then U )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 Z N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ,and U )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 Z isobtainedwithoutcalculating U )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ,andsimplybysolvingfor intheequation U = Z whichrequiresonly O q 2 operations,since U isuppertriangular.Wenotethat,ifwestartwith X ,ndingthe factorization X 0 X = U 0 U requires O q 2 operations. Nowif and differinasinglecomponentthisisthecaseforexamplewhen cyclingthroughStep 1 ofthealgorithm,thenafactorizationof X 0 X canbeobtained fromthefactorization X 0 X veryefciently:therearewellknownmethodsforupdating thetandrelatedquantitiesofalinearregressionmodelwhenapredictorisadded ordroppedfromthemodel.TheserelyonfastupdatesofQR,Cholesky,orsingular valuedecompositionswhenthedesignmatrixischangedbytheadditionordeletionof acolumn.SmithandKohn1996relyonfastupdatingoftheCholeskydecomposition of X 0 X inordertoupdate R 2 .OurMarkovchainismoreinvolvedthanthatofSmithand Kohn1996sinceourchainrunson = 0 .Ourimplementationusesthe sweepoperator,awell-knownmethodforupdatingalineart,becausethisprovidesall thequantitiesneededforourchaininoneshot.Wenowdescribethisinmoredetail. Werstdenethesweepoperator.Let T beasymmetricmatrix.Thesweepof T onits k th diagonalentry t kk 6 =0 isthesymmetricmatrix S with s kk = )]TJ/F22 11.9552 Tf 14.348 8.088 Td [(1 t kk s ik = t ik t kk s kj = t kj t kk s ij = t ij )]TJ/F39 11.9552 Tf 13.151 8.088 Td [(t ik t kj t kk for i 6 = k and j 6 = k .Thesweepoperatorhasanobviousinverseoperator. Ifweapplythesweepoperatortothematrix T = 0 B @ X 0 XX 0 Y Y 0 XY 0 Y 1 C A B.3 93 PAGE 94 onall 1 through q diagonalentries,thenweobtainthematrix S = 0 B @ )]TJ/F22 11.9552 Tf 9.299 0 Td [( X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 X 0 Y Y 0 X X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 Y 0 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(Y 0 X X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 X 0 Y 1 C A Ifwesweeptheaugmentedmatrix T denedinB.3onthediagonalentriescorresponding tothecovariatesin ,thenfromtheresultingmatrix S wecanobtainalltheimportant quantitiesneededbySteps 1 4 : X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 isthenegativeofthesubmatrixof S correspondingtorowsandcolumnsin X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 X 0 Y isthesubmatrixof S corresponding torowsin andcolumn q +1 ,and Y 0 X X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 X 0 Y canbeobtainedbysubtracting the q +1, q +1 elementof S from Y 0 Y withothermethods,wemayneedtocompute separatelythelastthreequantities. Toillustratetheuseofthisoperator,supposethatwehavealreadyswept T overthe covariatesin =, 2 ,..., q .Thenweonlyneedtoperformonesweepontherst diagonalentrytogetthesweptmatrixcorrespondingtotherstpredictorbeingadded tothepreviousmodel =, 2 ,..., q .Conversely,sincethesweepoperatorhasan inverse,thelattermatrixcouldbeunsweptovertherstdiagonalentrytogettheswept matrixcorrespondingtodroppingtherstpredictor. 94 PAGE 95 APPENDIXC PROOFOFTHEUNIFORMERGODICITYANDDEVELOPMENTOFTHE MINORIZATIONCONDITIONFROMCHAPTER5 ProofofProposition1 Let y and p j Y denotetheposteriordistributionof and ,respectively,under theprior on wearesuppressingthesubscript h ,sincethehyperparameterisxed throughout.Weuse K n and V n todenotethe n -stepMarkovtransitionfunctionsforthe and chains,respectively.Also,letting denotetheproductofcountingmeasureon f 0,1 g q andLebesguemeasureon 1 R q +1 ,weuse k n todenotethedensityof K n withrespectto and v n todenotetheprobabilitymassfunctionof V n .Wenowshow thatthe -chainandthe -chainconvergetotheircorrespondingposteriordistributions atexactlythesamerate.Foranystartingstate 0 and n 2 N ,wehave k K n 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( y k =sup A j K n 0 A )]TJ/F25 11.9552 Tf 11.955 0 Td [( y A j = 1 2 Z j k n 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( y j d = 1 2 Z v n 0 p 2 0 j Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(p j Y p 2 0 j Y d = 1 2 X 2 )]TJ/F30 11.9552 Tf 7.799 24.265 Td [( v n 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p j Y ZZZ p 2 0 j Y d 2 d 0 d = 1 2 X 2 )]TJ/F30 11.9552 Tf 7.799 24.265 Td [( v n 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p j Y =sup B V n 0 B )]TJ/F39 11.9552 Tf 11.955 0 Td [(p 2 B j Y Hence,the chaininheritstheconvergencerateofits -subchain,auniformlyergodic Gibbssampleronanitestatespace. DescriptionoftheRegenerationScheme Forregenerationpurposes,itisenoughtorestrictourattentiontotheMarkov chainthatrunson .Thisisbecause,aswewillseelater,wheneverthissubchain regenerates,theaugmentedchainthatproducesdrawsfromtheposteriordistributionof = 0 alsoregenerates.Wewillndafunction s :)]TJ/F21 11.9552 Tf 18.044 0 Td [(! [0,1 andaprobability 95 PAGE 96 massfunction d on )]TJ/F20 11.9552 Tf 10.098 0 Td [(suchthat v 0 satisestheminorizationcondition v 0 s d 0 forall 0 2 C.1 WeproceedviathedistinguishedpointtechniqueintroducedinMyklandetal.1995. Let denoteaxedmodel,whichwewillrefertoasadistinguishedmodel,andlet D )]TJ/F20 11.9552 Tf 10.098 0 Td [(beasetofmodels.Themodel andtheset D arearbitrary,butbelowwegive guidelinesformakingapracticalchoiceof and D .Forall 0 2 )]TJ/F20 11.9552 Tf 10.098 0 Td [(wehave v 0 = v 0 v 0 v 0 min 00 2 D v 00 v 00 v 0 I 0 2 D Ifwelet c denotethenormalizingconstantfor v 0 I 0 2 D ,thatis, c = P 0 2 D v 0 ,thenweget v 0 c min 00 2 D v 00 v 00 v 0 I 0 2 D c = s d 0 where s = c min 00 2 D v 00 v 00 and d 0 = v 0 I 0 2 D c Evaluatingboth s and d requirescomputingtransitionprobabilitiesoftheform v 0 which,duetothefactthattheMarkovchainon isaGibbssampler,canbeexpressed as v 0 = p 0 1 j 2 3 ,..., q Y p 0 2 j 0 1 3 ,..., q Y p 0 q j 0 1 0 2 ,..., 0 q )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 Y wheretheformulafortherightsidetermsisgivenbyB.1.Sincethe 'sinthetermson therightsidedifferinatmostonecomponent,thefastupdatingtechniquesdiscussedin AppendixAcanbeappliedheretootospeedupthecomputations. 96 PAGE 97 By2, P )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( i =1 j i i +1 = s i d i +1 = v i i +1 ,andthenormalizing constant c cancelsinthenumerator,soinpracticethereisnoneedtocomputeit, andthesuccessprobabilityoftheregenerationindicatorissimply P )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( i =1 j i i +1 = min 00 2 D v i 00 v 00 v i +1 I i +1 2 D v i i +1 C.2 Thechoiceof and D affectstheregenerationrate.Ideallywewouldlikethe regenerationprobabilitytobeasbigaspossible.Noticethatregenerationcanoccuronly if isin D .Thissuggestsmaking D large.However,increasingthesizeof D makesthe rstterminbracketsinC.2smaller.Wehavefoundthatareasonabletradeoffconsists oftaking D tobethesmallestsetofmodelsthatencompasses 25% oftheposterior probability.Also,theobviouschoicefor istheHPMmodel.Thedistinguishedmodel andtheset D areselectedfromtheoutputofaninitialchain. Fortheall-inclusivechainthatrunsnotonlyonthemodelspacebutalsoonthe spaceoferrorvarianceandmodelcoefcients,wecanobtainaminorizationconditionif wemultiplytheconditioninC.1onbothsidesby p 0 2 j 0 Y p 0 0 j 0 0 2 Y p 0 j 0 0 2 0 0 Y Thisyields k 0 s 1 d 1 0 forall = 0 0 = 0 0 0 0 0 C.3 where s 1 = s and d 1 0 = d 0 p 0 2 j 0 Y p 0 0 j 0 0 2 Y p 0 j 0 0 2 0 0 Y Henceforthisbiggerchaintheregenerationindicatorhas,accordingto211,success probability P i =1 j i i +1 = s 1 i d 1 i +1 k i i +1 = s i d i +1 v i i +1 97 PAGE 98 whichisexactlythesameastheregenerationsuccessprobabilityforthechainon )]TJ/F20 11.9552 Tf 6.775 0 Td [(.Hence,theaugmented -chainandthe -chainregeneratesimultaneously.Note thatsamplingfrom d 1 whichisneededtostarttheregenerationistrivial.Werst sample from d ,whichisdonebysamplingfrom v andretaining onlyif itisin D andtodothiswedonotneedtoknowthenormalizingconstant c ; thenwesequentiallysample 2 0 ,and from p 2 j Y p 0 j 2 Y ,and p j 2 0 Y ,respectively. 98 PAGE 99 APPENDIXD MAPFORTHEOZONEPREDICTORSINFIGURE5-5 TableD-1.The 44 predictorsusedintheozoneillustration.Thesymbol.representsan interaction. NumberPredictor 1 vhVandenburg 500 millibarpressureheightm 2 windWindspeedmphatLosAngelesInternationalAirportLAX 3 humidHumiditypercentatLAX 4 tempSandburgAirForceBasetemperatureF 5 ibhInversionbaseheightatLAX 6 dpgDaggettpressuregradientmmHgfromLAXtoDaggett,CA 7 ibtInversionbasetemperatureatLAX 8 visVisibilitymilesatLAX NumberPredictor 9 vh 2 10 wind 2 11 humid 2 12 temp 2 13 ibh 2 14 dpg 2 15 ibt 2 16 vis 2 17 vh.wind 18 vh.humid 19 wind.humid 20 vh.temp 21 wind.temp 22 humid.temp 23 vh.ibh 24 wind.ibh 25 humid.ibh 26 temp.ibh NumberPredictor 27 vh.dpg 28 wind.dp 29 humid.d 30 temp.dp 31 ibh.dpg 32 vh.ibt 33 wind.ib 34 humid.i 35 temp.ib 36 ibh.ibt 37 dpg.ibt 38 vh.vis 39 wind.vi 40 humid.v 41 temp.vi 42 ibh.vis 43 dpg.vis 44 ibt.vis 99 PAGE 100 REFERENCES A NTONIAK ,C.E..MixturesofDirichletprocesseswithapplicationstoBayesian nonparametricproblems. TheAnnalsofStatistics 2 1152. A THREYA ,K.B.,D OSS ,H.andS ETHURAMAN ,J..Ontheconvergenceofthe Markovchainsimulationmethod. TheAnnalsofStatistics 24 69. B ARBIERI ,M.M.andB ERGER ,J.O..Optimalpredictivemodelselection. The AnnalsofStatistics 32 870. B REIMAN ,L.andF RIEDMAN ,J.H..Estimatingoptimaltransformationsfor multipleregressionandcorrelation. JournaloftheAmericanStatisticalAssociation 80 580. B URR ,D.andD OSS ,H..Condencebandsforthemediansurvivaltimeas afunctionofthecovariatesintheCoxmodel. JournaloftheAmericanStatistical Association 88 1330. B URR ,D.andD OSS ,H..ABayesiansemiparametricmodelforrandom-effects meta-analysis. JournaloftheAmericanStatisticalAssociation 100 242. C ASELLA ,G.andM ORENO ,E..ObjectiveBayesianvariableselection. Journalof theAmericanStatisticalAssociation 101 157. C LYDE ,M.,D E S IMONE ,H.andP ARMIGIANI ,G..Predictionviaorthogonalized modelmixing. JournaloftheAmericanStatisticalAssociation 91 1197. C LYDE ,M.,G HOSH ,J.andL ITTMAN ,M..Bayesianadaptivesamplingfor variableselectionandmodelaveraging.DiscussionPaper2009-16,DukeUniversity DepartmentofStatisticalScience. C OGBURN ,R..ThecentrallimittheoremforMarkovprocesses.In Proceedings oftheSixthBerkeleySymposiumonMathematicalStatisticsandProbability,Volume 2 .UniversityofCaliforniaPress,Berkeley,485. C UI ,W.andG EORGE ,E..EmpiricalBayesvs.fullyBayesvariableselection. JournalofStatisticalPlanningandInference 138 888. D OSS ,H..CommentonMarkovchainsforexploringposteriordistributionsby LukeTierney. TheAnnalsofStatistics 22 1728. D OSS ,H..Bayesianmodelselection:Somethoughtsonfuturedirections. StatisticaSinica 17 413. D OSS ,H..EstimationoflargefamiliesofBayesfactorsfromMarkovchain output. StatisticaSinica 20 537. D URRETT ,R.. Probability:TheoryandExamples .Brooks/ColePublishingCo. 100 PAGE 101 F ERNANDEZ ,C.,L EY ,E.andS TEEL ,M.F.J..BenchmarkpriorsforBayesian modelaveraging. JournalofEconometrics 100 381. F OSTER ,D.P.andG EORGE ,E.I..Theriskinationcriterionformultiple regression. TheAnnalsofStatistics 22 1947. G EORGE ,E.I.andF OSTER ,D.P..CalibrationandempiricalBayesvariable selection. Biometrika 87 731. G EORGE ,E.I.andM C C ULLOCH ,R.E..ApproachesforBayesianvariable selection. StatisticaSinica 7 339. G EYER ,C.J..PracticalMarkovchainMonteCarlowithdiscussion. Statistical Science 7 473. G EYER ,C.J..Estimatingnormalizingconstantsandreweightingmixturesin MarkovchainMonteCarlo.Tech.Rep.568r,DepartmentofStatistics,Universityof Minnesota. G ILL ,R.D.,V ARDI ,Y.andW ELLNER ,J.A..Largesampletheoryofempirical distributionsinbiasedsamplingmodels. TheAnnalsofStatistics 16 1069. H ANSEN ,M.H.andY U ,B..Modelselectionandtheprincipleofminimum descriptionlength. JournaloftheAmericanStatisticalAssociation 96 746. H ASTINGS ,W.K..MonteCarlosamplingmethodsusingMarkovchainsandtheir applications. Biometrika 57 97. H OBERT ,J.P.,J ONES ,G.L.,P RESNELL ,B.andR OSENTHAL ,J.S..Onthe applicabilityofregenerativesimulationinMarkovchainMonteCarlo. Biometrika 89 731. I BRAGIMOV ,I..Somelimittheoremsforstationaryprocesses. Theoryof ProbabilityanditsApplications 7 349. I BRAGIMOV ,I.A.andL INNIK ,Y.V.. IndependentandStationarySequencesof RandomVariables .Wolters-Noordhoff,Groningen. K ASS ,R.E.andW ASSERMAN ,L..AreferenceBayesiantestfornested hypothesesanditsrelationshiptotheSchwarzcriterion. JournaloftheAmerican StatisticalAssociation 90 928. K OHN ,R.,S MITH ,M.andC HAN ,D..Nonparametricregressionusinglinear combinationsofbasisfunctions. StatisticsandComputing 11 313. K ONG ,A.,M C C ULLAGH ,P.,M ENG ,X.-L.,N ICOLAE ,D.andT AN ,Z..Atheoryof statisticalmodelsforMonteCarlointegrationwithdiscussion. JournaloftheRoyal StatisticalSociety,SeriesB 65 585. 101 PAGE 102 L IANG ,F.,P AULO ,R.,M OLINA ,G.,C LYDE ,M.A.andB ERGER ,J.O..Mixtures of g -priorsforBayesianvariableselection. JournaloftheAmericanStatistical Association 103 410. M ADIGAN ,D.andY ORK ,J..Bayesiangraphicalmodelsfordiscretedata. InternationalStatisticalReview 63 215. M ENG ,X.-L.andW ONG ,W.H..Simulatingratiosofnormalizingconstantsviaa simpleidentity:Atheoreticalexploration. StatisticaSinica 6 831. M EYN ,S.P.andT WEEDIE ,R.L.. MarkovChainsandStochasticStability Springer-Verlag,NewYork,London. M YKLAND ,P.,T IERNEY ,L.andY U ,B..RegenerationinMarkovchainsamplers. JournaloftheAmericanStatisticalAssociation 90 233. O WEN ,A.andZ HOU ,Y..Safeandeffectiveimportancesampling. Journalofthe AmericanStatisticalAssociation 95 135. R AFTERY ,A.E.,M ADIGAN ,D.andH OETING ,J.A..Bayesianmodelaveraging forlinearregressionmodels. JournaloftheAmericanStatisticalAssociation 92 179. R OMERO ,M.. OnTwoTopicswithnoBridge:BridgeSamplingwithDependent DrawsandBiasoftheMultipleImputationVarianceEstimator .Ph.D.thesis,University ofChicago. R OSENBLATT ,M..Asymptoticnormality,strongmixingandspectraldensity estimates. TheAnnalsofProbability 12 1167. S COTT ,J.G.andB ERGER ,J.O..Bayesandempirical-Bayesmultiplicity adjustmentinthevariable-selectionproblem. TheAnnalsofStatistics toappear. S MITH ,M.andK OHN ,R..NonparametricregressionusingBayesianvariable selection. JournalofEconometrics 75 317. T AN ,A.andH OBERT ,J.P..BlockGibbssamplingforBayesianrandomeffects modelswithimproperpriors:convergenceandregeneration. JournalofComputationalandGraphicalStatistics 18 861. T AN ,Z..OnalikelihoodapproachforMonteCarlointegration. Journalofthe AmericanStatisticalAssociation 99 1027. T IERNEY ,L..MarkovchainsforexploringposteriordistributionsDisc: p1728. TheAnnalsofStatistics 22 1701. V ANDAELE ,W..Participationinillegitimateactivities:Ehrlichrevisited.In DeterrenceandIncapacitation .USNationalAcademyofSciences,WashingtonDC, 270. 102 PAGE 103 V ARDI ,Y..Empiricaldistributionsinselectionbiasmodels. TheAnnalsof Statistics 13 178. Z ELLNER ,A..OnassessingpriordistributionsandBayesianregressionanalysis with g -priordistributions.In BayesianInferenceandDecisionTechniques:Essays inHonorofBrunodeFinetti P.K.GoelandA.Zellner,eds..Elsevier,NewYork, 233. Z ELLNER ,A.andS IOW ,A..Posterioroddsratiosforselectedregression hypotheses.In BayesianStatistics:ProceedingsoftheFirstInternationalMeetingheld inValenciaSpain J.M.Bernardo,M.H.DeGroot,D.V.LindleyandA.F.M.Smith, eds..Valencia:UniversityPress,585. 103 PAGE 104 BIOGRAPHICALSKETCH EugeniaButawasbornin1982inRomania.In2000,shewasadmittedtothe UniversityofOradea,Romania,fromwheresheearnedherBachelor'sdegreein Mathematics-Informaticsin2004.ShethenjoinedtheDepartmentofStatisticsatthe UniversityofFloridatopursueaPh.D.degree.Duringhergraduatestudentyears,she servedasaTeachingAssistantforseveralundergraduateandgraduatelevelcoursesin theDepartmentofStatistics.SheexpectstoreceiveherdoctoratedegreeinStatisticsin August2010. 104 |