UFDC Home myUFDC Home  |   Help
<%BANNER%>

# Computational Approaches for Empirical Bayes Methods and Bayesian Sensitivity Analysis

## Material Information

Title: Computational Approaches for Empirical Bayes Methods and Bayesian Sensitivity Analysis
Physical Description: 1 online resource (104 p.)
Language: english
Creator: Buta, Eugenia
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

## Subjects

Subjects / Keywords: analysis, bayes, bayesian, empirical, factor, hyperparameter, posterior, prior, selection, sensitivity, variable
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

## Notes

Abstract: Computational Approaches for Empirical Bayes Methods and Bayesian Sensitivity Analysis We consider situations in Bayesian analysis where we have a family of priors on the parameter theta, and we deal with two related problems. The first involves sensitivity analysis and is stated as follows. Suppose we fix a function f of theta. How do we efficiently estimate the posterior expectation of f(theta) simultaneously for all priors in the family of priors? The second problem is how do we identify reasonable choices of priors? We assume that we are able to generate Markov chain samples from the posterior for a finite number of the priors, and we develop a methodology, based on a combination of importance sampling and the use of control variates, for dealing with these two problems. The methodology applies very generally, and we show how it applies in particular to a commonly used model for variable selection in Bayesian linear regression, in which the unknown parameter includes the model and the regression coefficients for the selected model. The prior is a hierarchical prior in which first the model is selected, then the coefficients for this model are chosen, and this prior is indexed by two hyperparameters. These hyperparameters effectively determine whether the selected model will be a large model with many variables, or a parsimonious model with only a few variables, so choosing them is very important. We give illustrations of our methodology on real data sets.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Eugenia Buta.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31

## Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041145:00001

## Material Information

Title: Computational Approaches for Empirical Bayes Methods and Bayesian Sensitivity Analysis
Physical Description: 1 online resource (104 p.)
Language: english
Creator: Buta, Eugenia
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

## Subjects

Subjects / Keywords: analysis, bayes, bayesian, empirical, factor, hyperparameter, posterior, prior, selection, sensitivity, variable
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

## Notes

Abstract: Computational Approaches for Empirical Bayes Methods and Bayesian Sensitivity Analysis We consider situations in Bayesian analysis where we have a family of priors on the parameter theta, and we deal with two related problems. The first involves sensitivity analysis and is stated as follows. Suppose we fix a function f of theta. How do we efficiently estimate the posterior expectation of f(theta) simultaneously for all priors in the family of priors? The second problem is how do we identify reasonable choices of priors? We assume that we are able to generate Markov chain samples from the posterior for a finite number of the priors, and we develop a methodology, based on a combination of importance sampling and the use of control variates, for dealing with these two problems. The methodology applies very generally, and we show how it applies in particular to a commonly used model for variable selection in Bayesian linear regression, in which the unknown parameter includes the model and the regression coefficients for the selected model. The prior is a hierarchical prior in which first the model is selected, then the coefficients for this model are chosen, and this prior is indexed by two hyperparameters. These hyperparameters effectively determine whether the selected model will be a large model with many variables, or a parsimonious model with only a few variables, so choosing them is very important. We give illustrations of our methodology on real data sets.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Eugenia Buta.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31

## Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041145:00001

Full Text

COMPUTATIONAL APPROACHES FOR EMPIRICAL BAYES METHODS AND
BAYESIAN SENSITIVITY ANALYSIS

By

EUGENIA BUTA

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2010

@ 2010 Eugenia Buta

I dedicate this to my brother Florin.

ACKNOWLEDGMENTS

I would like to thank my advisor, Professor Hani Doss, for the invaluable help and

guidance with writing this dissertation. I am also thankful to Professors Farid AitSahlia,

George Casella, and James Hobert for serving on my supervisory committee and

Statistics at the University of Florida. I greatly appreciate the chance I have been given

to come here and learn Statistics from many exceptional teachers, all while benefiting

from the kindness and support of other students and staff.

ACKNOW LEDGMENTS ................................

LIST O F TA BLES . . .

LIST O F FIG U R ES . . .

ABSTRACT. ................... ...................

CHAPTER

1 INTRO DUCTIO N . . .

2 ESTIMATION OF BAYES FACTORS AND POSTERIOR EXPECTATIONS .

2.1 Estimation of Bayes Factors .......................
2.2 Estimation of Bayes Factors Using Control Variates ...........
2.3 Estimation of Posterior Expectations ..................
2.4 Estimation of Posterior Expectations Using Control Variates .
2.5 Estimation of Posterior Expectations Using Control Variates With
Estimated Skeleton Bayes Factors and Expectations .

3 VARIANCE ESTIMATION AND SELECTION OF THE SKELETON POINTS

3.1 Estimation of the Variance ..........
3.2 Selection of the Skeleton Points .......

4 REVIEW OF PREVIOUS WORK .........

5 ILLUSTRATION ON VARIABLE SELECTION .

5.1 A Markov Chain for Estimating the Posterior
Param eters .. .............
5.2 Choice of the Hyperparameter .......
5.3 Exam ples . .
5.3.1 U.S. Crime Data .. .........
5.3.2 Ozone Data ..............

6 DISCUSSION ...................

APPENDIX

A PROOF OF RESULTS FROM CHAPTER 1 .

Distribution

page

S 4

S 7

S 8

S 9

S10

14

S17
S19
S21
S23

S25

27

Model

. . 5 6

B DETAILS REGARDING GENERATION OF THE MARKOV CHAIN FROM
C H A PT E R 5 . . .

C PROOF OF THE UNIFORM ERGODICITY AND DEVELOPMENT OF THE
MINORIZATION CONDITION FROM CHAPTER 5 ................ 95

D MAP FOR THE OZONE PREDICTORS IN FIGURE 5-5 ............. 99

REFERENC ES . . .... 100

BIOGRAPHICAL SKETCH ................... ............. 104

LIST OF TABLES

Table page

5-1 Posterior inclusion probabilities for the fifteen predictor variables in the U.S.
crime data set, under three models. Names of the variables are as in Table 2
of Liang et al. (2008) (but all variables except for the binary variable S have
been log transformed)... ................ ............ 49

D-1 The 44 predictors used in the ozone illustration. The symbol "." represents an
interaction .................... .... ............. 99

LIST OF FIGURES

Figure page

5-1 Estimates of Bayes factors for the U.S. crime data. The plots give two different
views of the graph of the Bayes factor as a function of w and g when the baseline
value of the hyperparameter is given by w = 0.5 and g = 15. The estimate
is (2-13), which uses control variates. ... 48

5-2 Estimates of posterior inclusion probabilities for Variables 1 and 6 for the U.S.
crime data. The estimate used is (2-16). ... 50

5-3 Variance functions for two versions of / (). The left panel is for the estimate
based on the skeleton (5-4). The points in this skeleton were shifted to better
cover the problematic region near the back of the plot (g small and w large),
creating the skeleton (5-5). The maximum variance is then reduced by a factor
of 9 (right panel).... ................ ........ ... .. 51

5-4 Estimates of Bayes factors for the ozone data. The plots give two different
views of the graph of the Bayes factor as a function of w and g when the baseline
value of the hyperparameter is given by w = .2 and g = 50. ... 52

5-5 95% confidence intervals of the posterior inclusion probabilities for the 44 predictors
in the ozone data when the hyperparameter value is given by w = .13 and
g = 75. A table giving the correspondence between the integers 1-44 and the
predictors is given in Appendix D ............. .. .. .......... .. 54

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

COMPUTATIONAL APPROACHES FOR EMPIRICAL BAYES METHODS AND
BAYESIAN SENSITIVITY ANALYSIS

By

Eugenia Buta

August 2010

Chair: Hani Doss
Major: Statistics

We consider situations in Bayesian analysis where we have a family of priors Vh

on the parameter 0, where h varies continuously over a space H, and we deal with two

related problems. The first involves sensitivity analysis and is stated as follows. Suppose

we fix a function f of 0. How do we efficiently estimate the posterior expectation of

f(0) simultaneously for all h in H-? The second problem is how do we identify subsets

of H- which give rise to reasonable choices of vh? We assume that we are able to

generate Markov chain samples from the posterior for a finite number of the priors, and

we develop a methodology, based on a combination of importance sampling and the

use of control variates, for dealing with these two problems. The methodology applies

very generally, and we show how it applies in particular to a commonly used model

for variable selection in Bayesian linear regression, in which the unknown parameter

includes the model and the regression coefficients for the selected model. The prior is a

hierarchical prior in which first the model is selected, then the coefficients for this model

are chosen, and this prior is indexed by two hyperparameters. These hyperparameters

effectively determine whether the selected model will be a large model with many

variables, or a parsimonious model with only a few variables, so choosing them is very

important. We give two illustrations of our methodology, one on the U.S. crime data of

Vandaele and the other on ground level ozone data originally analyzed by Breiman and

Friedman.

CHAPTER 1
INTRODUCTION

In the Bayesian paradigm we have a data vector Y with density pe for some

unknown 0 e 0, and we wish to put a prior density on 0. The available family of

prior densities is {vh, he c- }, where h is called a hyperparameter. Typically, the

hyperparameter is multivariate and choosing it can be difficult. But this choice is very

important and can have a large impact on subsequent inference. There are two issues

we wish to consider:

(A) Suppose we fix a quantity of interest, say f(0), where f is a function. How do

we assess how the posterior expectation of f(0) changes as we vary h? More

generally, how do we assess changes in the posterior distribution of f(0) as we vary

h?

(B) How do we determine if a given subset of 'H constitutes a class of reasonable

choices?

The first issue is one of sensitivity analysis and the second is one of model selection.

As an example of the kind of problem we wish to deal with, consider the problem

of variable selection in Bayesian linear regression. Here, we have a response variable

Y and a set of predictors X, ..., Xq, each a vector of length m. For every subset 7 of

{1,... q} we have a potential model MA4 given by

Y = 1m/3 + X,, + ,

where 1m is the vector of m l's, X. is the design matrix whose columns consist of

the predictor vectors corresponding to the subset 7, 3, is the vector of coefficients

for that subset, and e ~ Am(0, o2/). Let q. denote the number of variables in the

subset 7. The unknown parameter is 0 = (7-, a, 0o, y), which includes the indicator

of the subset of variables that go into the linear model. A very commonly used prior

distribution on 0 is given by a hierarchy in which we first choose the indicator 7 from

the "independence Bernoulli prior"-each variable goes into the model with a certain

probability w, independently of all the other variables-and then choose the vector

of regression coefficients corresponding to the selected variables. In more detail, the

model is described as follows:

Y ~ Am(lm/3o + X7P,r 21) (1-1a)

(2, /3) "(2 o) oC 1/2, and given a, ~3-, (0, g,2(X7X7)-1) (1-1b)

7 Wq (1 w)q- q. (1-1c)

The prior on (o, 3o, /,) is Zellner's g-prior introduced in Zellner (1986), and is indexed by

a hyperparameter g. Although this prior is improper, the resulting posterior distribution is

proper.

Note that we have used the word "model" in two different ways: (i) a model is a

specification of the hyperparameter h, and (ii) a model in regression is a list of variables

to include. The meaning of the word will always be clear from context.

To summarize, the prior on the parameter 0 = (7, a, 03o, /) is given by the two-level

hierarchy (1-1c) and (1-1b), and is indexed by h = (w, g). Loosely speaking, when

w is large and g is small, the prior encourages models with many variables and small

coefficients, whereas when w is small and g is large, the prior concentrates its mass on

parsimonious models with large coefficients. Therefore, the hyperparameter h = (w, g)

plays a very important role, and in effect determines the model that will be used to carry

out variable selection.

A standard method for approaching model selection involves the use of Bayes

factors. For each he T- let mh(y) denote the marginal likelihood of the data under the

prior Vh, that is, mh(y) = J p(y)vh(O) dO. We will write mh instead of mh(y). The Bayes

factor of the model indexed by h2 vs. the model indexed by hi is defined as the ratio

of the marginal likelihood of the data under the two models, mh2/mh,, and is denoted

throughout by B(h2, hi). Bayes factors are widely used as a criterion for comparing

models in Bayesian analyses. For selecting models that are better than others from the

family of models indexed by h H -, our strategy will be to compute and subsequently

compare all the Bayes factors B(h, h,), for all h E 'H, and a fixed hyperparameter value

h,. We could then consider as good candidate models those with values of h that result

in the largest Bayes factors.

Suppose now that we fix a particular function f of the parameter 0; for instance,

in the example, this might be the indicator that variable 1 is included in the regression

model. It is of general interest to determine the posterior expectation Eh(f(O) I Y) as a

function of h and to determine whether or not Eh(f(O) Y) is very sensitive to the value

of h. If it is not, then two individuals using two different hyperparameters will reach

approximately the same conclusions and the analysis will not be controversial. On the

other hand, if for a function of interest the posterior expectation varies considerably

as we change the hyperparameter, then we will want to know which aspects of the

hyperparameter (e.g. which components of h) produce big changes and we may want to
see a plot of the posterior expectations as we vary those aspects of the hyperparameter.

Except for extremely simple cases, posterior expectations cannot be obtained in closed

form, and are typically estimated via Markov chain Monte Carlo (MCMC). It is slow and

inefficient to run Markov chains for every hyperparameter value h. Chapter 2 reviews

an existing method for estimating Eh(f(0) I Y) that bypasses the need to run a separate

Markov chain for every h. The method has an analogue for the problem of estimating

Bayes factors. Unfortunately, the method has severe limitations, which we also discuss.

The purpose of this work is to introduce a methodology for dealing with the

sensitivity analysis and model selection issues discussed above. The basic idea is-not

surprisingly-to use Markov chains corresponding to a few values of the hyperparameter

in order to estimate Eh(f(O) I Y) for all h c '-H and also the Bayes factors B(h, h,) for all

h e H-, and this is done through importance sampling. The difficulty we face is that there

is a severe computational burden caused by the requirement that we handle a very large
number of values of h.

The main contributions of this work are the development of computationally efficient

schemes for estimating large families of posterior expectations and Bayes factors

that are based on a combination of MCMC, importance sampling, and the use of

control variates, therefore providing an answer to questions (A) and (B) raised earlier.

We also provide theory to support the methods we propose. Chapter 2 describes

our methodology for estimating Bayes factors and posterior expectations, and gives

statements of theoretical results associated with the methodology. In Chapter 3 we

discuss estimation of the variance of our estimates. Chapter 4 gives a review of the

relevant literature, along with a discussion of how the present work fits in the context

of previous related work. In Chapter 5 we return to the problem of variable selection

in Bayesian linear regression. There, first we consider a Markov chain algorithm

that generates a sequence (71), (1), 1), /3()), (7(2), ,2) /2) (2)) ..., describe

theoretical properties of this chain, and show how to implement the methods developed

in Chapter 2 to answer questions (A) and (B) posed earlier. We also illustrate our

methodology on two data sets. Appendix A contains the proofs of the theorems stated

in Chapter 2, Appendix B provides details regarding the generation of our Markov chain

and its computational complexity, and Appendix C gives technical details regarding the

theoretical properties of the Markov chain.

CHAPTER 2
ESTIMATION OF BAYES FACTORS AND POSTERIOR EXPECTATIONS
Let Vh,y denote the posterior density of 0 given Y = y when the prior is vh. Suppose
we have a sample 01, ..., On (iid or ergodic Markov chain output) from the posterior
density Vh,,y for a fixed hi and we are interested in the posterior expectation

Eh(f(0) Y )= Jf()h,y(0) dO

for different values of the hyperparameter h. We may write f f(0)vh,y(0) dO as

J (0) P(Y) 7- h ()Vh),y(0) dO = f (0) 7()Vhl,y(0) dO (2-1a)
j e(y)vh(O)/mh d nh l h
P0 (y))hl()/mh, (0)mh M/ ( h, ( 0)

M ) (2-1 b)
m .f .v(o hly(0) dO

f f(O) v h,,y(0) dO
(2-1 c)
f v/' h),y(0) dO

where in (2-1 b) we have used the fact that the integral in the denominator is just 1,
in order to cancel the unknown constant mh,/mh in (2-1c). The idea to express

f f(O)vh,y(O) dO in this way was proposed in a different context by Hastings (1970).
Expression (2-1c) is the ratio of two integrals with respect to Vh,,y, each of which may
be estimated from the sequence 01,..., On. We may estimate the numerator and the
denominator by
n n
Y f(0i)[vh(O)/vhl(Oi)] and [v]h(O)/hl(Oi)], (2-2)
i=1 i=1
respectively. Thus, if we let

(h) [Vh(0,) Vh1(0,)]
w = [h(O,)IVh(O,)]'

then these are weights, and we see that the desired integral may be estimated by the
weighted average Ein f(,)w,(h)

The disappearance of the likelihood function in (2-1 a) is very convenient because
its computation requires considerable effort in some cases (for example, when we have
missing or censored data, the likelihood is a possibly high-dimensional integral). Note
that the second average in (2-2) is an estimate of mh/mh,, i.e. the Bayes factor B(h, hi).
Ideally, we would like to use the estimates in (2-2) for multiple values of h using only a
sample from the posterior distribution corresponding to the fixed hyperparameter value
hi. But, when the prior Vh differs from Vh, greatly, the two estimates in (2-2) are unstable
because of the potential that only a few observations will dominate the sums. Their ratio
suffers the same defect.
A natural approach for dealing with the instability of these simple estimates is
to choose k hyperparameter values hi,..., hk e T-i and to replace Vh with a mixture

Es=l asVhs, where as > 0, for s = 1,..., k, and k= as = 1. For concreteness, consider
the estimate of the Bayes factor. To estimate B(h, hi) using a sample 01,..., 0n (iid or
ergodic Markov chain output) from the posterior mixture v. := s=1 asVhs,y, we are
tempted to write

1 h( i) mh 1 P (y)vh (,)/mh (2-3a)
n = s=1 asVh (O,) mh n =1 s= aspO,(y)Vh(O,)/mh,
1 nh,y(,)
= B(h, hi,) (2-3b)
i=1 Es=1 ash,,y(O,)ds
where ds = mh5/mh,, s = 1,..., k. Thus, in order to have v.y in the denominator
in (2-3b) (which would imply that the average in (2-3b) converges to 1, so that (2-3b)
converges to B(h, hi)), we need to start out with k=1 asVh /ds in the denominator of
the left side of (2-3a). Unfortunately, this requires the condition that we know the vector
d = (d2, ..., dk)'. Under this condition, if 01, ... On are drawn from the mixture v.y, instead
of from Vh,,y, we may form
1 Vh (0) (2-4)
",=1 Yk=1 asVh (O,)/ds

and this quantity converges to

B(h, h,) J hy() V.y(O) dO = B(h, hi).

Assuming that for each I = 1,..., k we have samples (/), i = 1,..., n from the posterior
density Vh,,y, then for as = ns/n, the estimate in (2-4) can be written as

B(h, h, d)= h i()) (2-5)
/=1 i=1 ns=l h5(O)/ds

(Note that the combined samples 0(), i = 1 ...n, I = 1,..., k form a stratified sample
from the mixture distribution v.y.) Doss (2010) shows that under certain regularity
conditions the estimate (2-5) is consistent and asymptotically normal.
In virtually all applications, the value of the vector d is unknown and has to be
estimated. Doss (2010) does not deal with the case where d is unknown. In this
Chapter, we assume that d is estimated via preliminary MCMC runs generated
independently of the runs subsequently used to estimate B(h, hi). Hence the sampling
will consist of the following two stages.
Stage 1 Generate samples 0,(0I), i = 1,..., N from Vhiy, the posterior density of 0
given Y = y, assuming that the prior is Vh,, for each I = 1,..., k, and use these
N = y,1 NI observations to form an estimate of d.
Stage 2 Independently of Stage 1, again generate samples 0('), i = 1,..., n, from Vh,y,
for each I = 1,..., k, and construct the estimate of the Bayes factor B(h, hi) based
on this second set of n = =1 ni observations and the estimate of d from Stage 1.
From now on, for / = 1,..., k, we use the notations Al and al to identify the ratios NV/N
and ni/n, respectively.
It is natural to ask why is it necessary to have two steps of sampling, instead
of estimating the vector d and B(h, hi) from a single sample. The reason is that we
are interested in estimating Bayes factors and posterior expectations for a very large
number of values of h, and for each h, the computational time needed is linear in the

total sample size. This fact limits the total sample size and hence the accuracy of the

estimates. An increase in accuracy can be achieved essentially for free by estimating d

from long preliminary runs in Stage 1. The cost incurred in Stage 1 is minimal because

generating the chains is typically extremely fast, and has to be done only once.

Doss (2010) also developed an improvement of (2-5) that is based on control

variates, and showed that this improvement is also consistent and asymptotically

normal. Unfortunately, both of these estimates require us to know the vector d exactly.

One may be tempted to believe that using an estimated d instead of the true d will not

inflate the asymptotic variance-indeed, the literature has errors regarding this point,

and this is discussed in Appendix A. Here we provide a careful analysis of the increase

in the asymptotic variance that results when we use an estimate of d. A more detailed

summary of the main contributions of the present work is as follows.

1. We develop a complete characterization of the asymptotic distribution of both the
estimate (2-5) and the improvement that uses control variates for the realistic case
where d is estimated from Stage 1 sampling (Theorems 1 and 2).

2. We develop an analogous theory for the problem of estimating a family of posterior
expectations Eh(f(O) | Y = y), hE c-t (Theorems 3, 4, and 6).

3. We discuss estimation of the variance, and show how variance estimates can be
used to guide selection of the skeleton points hi,..., hk.

4. We apply the methodology to the problem of Bayesian variable selection discussed
earlier. In particular, we show how our methods enable us to select good values of
h = (w, g) and to also see how the probability that a given variable is included in
the regression varies with (w, g).

2.1 Estimation of Bayes Factors

Here, we analyze the asymptotic distributional properties of the estimator that

results if in (2-5) we replace d with an estimate. Geyer (1994) proposes an estimator for

d based on the "reverse logistic regression" method and Theorem 2 therein shows that

this estimator is asymptotically normal when the samplers used satisfy certain regularity

conditions. This estimator is obtained by maximizing with respect to d2, ..., dk the log

quasi-likelihood
k N, (AIVh,(o1)o)l/d,
INv(d) = log k (2-6)
/=1 ,=1 s= s
The estimate is the same as the estimates obtained by Gill et al. (1988), Meng and
Wong (1996), and Kong et al. (2003). We assume that for all the Markov chains we use
a Strong Law of Large Numbers (SLLN) holds for all integrable functions [for sufficient
conditions see, e.g., Theorem 2 of Athreya et al. (1996)]. In the next theorem, we show
that if d is the estimate produced by Geyer's (1994) method, or any of the equivalent
estimates discussed above, then the estimate of the Bayes factor given by

(h,k nh )h(Ol)) (2-7)
/=1 i=1 ks=lnshs(O(1)/dS
is asymptotically normal if certain regularity conditions are met. In (2-7), d = 1.
Theorem 1 Suppose the chains in Stage 2 satisfy conditions Al and A2 in Doss
(2010):

Al For each I = 1,..., k, the chain {O(I }J1 is geometrically ergodic.
A2 For each I = 1,..., k, there exists c > 0 such that

E2 < 00.
( *'"!"' 2+e\
S1 sVhs (Vh, ) /)ds
Assume also that the chains in Stage 1 satisfy the conditions in Theorem 2 of Geyer
(1994) that imply vN(d d) d Afi(O, Z). In addition, suppose the total sample sizes for
the two stages, N and n, are chosen such that n/N -- q e [0, oo). Then

vn(B(h, hi, d) B(h, hi)) A,(O, qc(h)'-c(h) + 72(h)),

where c(h) and r2(h) are given in equation (A.3) in the Appendix and equation (A.9) in
Doss (2010), respectively.
Remarks

1. There are two components to the expression for the variance. The first component
arises from estimating d, and the second component is the variance that we would
have if we had estimated the Bayes factor knowing what d is. As can be seen
from the formula, the first component vanishes if q = 0, i.e., if the sample size
for estimating the parameter d converges to infinity at a faster rate than does
the sample size used to estimate the Bayes factor. In this case the Bayes factor
estimator (2-7) using the estimate d has the same asymptotic distribution as the
estimator in (2-5) which uses the true value of d. Otherwise, the variance of (2-7)
is greater than that of (2-5), and the difference between the variances depends on
the magnitude of q.

2. This theorem assumes the sampling is done in two independent stages: Stage 1
samples to estimate d, and Stage 2 samples used together with d to estimate
the Bayes factor B(h, hi). As a byproduct of our approach, we can get a similar
theorem for the situation where both d and B(h, hi) are estimated from the same
sample. (However, for the reasons discussed earlier, ordinarily we would not use a
single sample.) In more detail, if we impose the same conditions as in the theorem
above on samples of total size n from a single stage, except for the condition that
n/N q c [0, oo), then

n(b,(h, hi, d) B(h, hi)) d )A/(0, c(h)'Zc(h) + r2(h) + 2c(h)'E'z),

where F denotes, as in the statement of Theorem 1, the asymptotic variance of
n/n(d d), E is the matrix given in equation (A.51), and z is the column vector
given in (A.49). A proof of this result is given in Appendix A.

2.2 Estimation of Bayes Factors Using Control Variates

Recall that we have samples 0'), i = 1,..., n, from Vhi,y, I = 1, ..., k, with
independence across samples (Stage 2 of sampling) and that, based on an independent

set of preliminary MCMC runs (Stage 1 of sampling), we have estimated the constants

d2, ..., dk. Also, ni/n = al and n = k n1 n1. Let

Y(O) V= ) (2-8)
Es= asVhh (0)/ds

Recalling that v.y := C= asvhs,y, we have E,,(Y(O)) = B(h, hi), where the subscript v.y
to the expectation indicates that 0 vy. Also, forj = 2,..., k, let

) ()dJ -h() (2-9)
zs)) = asVhS(0)/ds

VhjIY ) a 'hS,,y(0)
S ,y( (2-10)
Es=1 ash,,y( )
Expression (2-10) shows that E,y(ZW)(0)) = 0. This is true even if the priors Vh,
and Vh, are improper, as long as the posteriors vh,y and Vh,,y are proper, exactly our
situation in the Bayesian variable selection example of Chapter 1. On the other hand,
the representation (2-9) shows that ZO)(0) is computable if we know the d,'s-it involves
the priors and not the posteriors. (A similar remark applies to (2-8).) Therefore, if as in
Doss (2010) we define for 1, ..., k, i = 1,..., n1

Vh(l) ) ( 1) ) h (1))dj- Vh (0))
Yi, k=Z s ak j ()/d j 2. k
Yk=1 as h, ( 1()) ds Y: i asVh,(s (1) Ids
(2-11)
then for any fixed 3 = (/32,... ,/3)
Sk nk
Y (Y,, = A=2 /Z, ), (2-12)
/=1 ;=1

is an unbiased estimate of B(h, hi). The value of 3 that minimizes the variance of I
is unknown. As is commonly done when one uses control variates, we use instead
the estimate obtained by doing ordinary linear regression of the response Y,,i on the
predictors ZT), j = 2,..., k, and to emphasize that this estimate depends on d, we
denote it by /(d). Theorem 1 of Doss (2010) states that the estimator Breg(h, h) =
/(d, obtained under the assumption that we know the constants d2,..., dk, has an
asymptotically normal distribution. As mentioned earlier, d2 ..., dk are typically unknown,
and must be estimated. Let d2,..., dk be estimates obtained from previous MCMC runs

and let
k n1 k
id 4A -13)
/=1 1 j=2
where Y,/ and 2j) are like in (2-11), except using d for d, and /9() is the least squares
regression estimator from regressing 9,,/ on predictors 220, j = 2,...,k. The next
theorem gives the asymptotic distribution of this new estimator.
Theorem 2 Suppose all the conditions from Theorem 1 are satisfied. Moreover, assume
that R, the k x k matrix defined by

R =E( 1 aZi jZ ), j,j' 1...,k,

is nonsingular. Then

n(l() B(h, hi)) -d Ar(0, qw(h)' w(h) + r-2(h)).

Expressions for w(h) and2 2(h) are given in equation (A.18) to follow and equation (A.7)
in Doss (2010), respectively.
2.3 Estimation of Posterior Expectations
In this section we give a method for estimating the posterior expectation of a
function f when the prior is Vh. Let us denote this quantity by

Ill](h) J f(0)Vh,y(0) dO.

Define

[f ) f((I) h ( ) ( ')) ( ') ) h h f (/)) fh,y (e0/)
Y', k m (O ( B(h, h).
s= 1 aS ihs (O)/ds Ek= asih,( h()/mhs mh 1 k=1 sash,,y(O) )

Assuming a SLLN holds for the Markov chains (li), = 1,..., k, i = 1,..., n/, we have

1 [f] f(0)Vhy()
S I as Vh,y() Vhi,y() dO B(h, hj).
/i=1 s S= l hs,y

Therefore,

k nl k n

/=1 i=1 /=1 i=1
I )hy- Vahi,y(O) dO. B(h, ht) (2-14)
s=l asVh,,y() i=1
= /f](h) B(h, hi).

Similarly, we have
k ni
n Y1, B(h, hi)
/--1 i=-1
(the Y,,/'s are defined in (2-11)).
Note that Y,, = YVf when f 1. Letting

v[f]
If] (h, d) = kL1 1,'= 'Y, (2-15)
L =" ,1 Yv,"I

we see that 7[]l(h, d) a.s Il](h). Replacing the unknown d with an estimate d obtained

from Stage 1 sampling, we form the estimator
k n1 f (0(1)) /h (01))

[f](h, d) = i= '=1=1Es1 = (2-16)
k n1 V h(O ll/))
1=1 :S=l a (o'))1
It is the asymptotic behavior of this estimator that we are concerned with in the following

theorem.

Theorem 3 Suppose the conditions stated in Theorem 1 are satisfied and, in addition,

for each I = 1,..., k, there exists an c > 0 such that

E( Yf] 2+c) < 00. (2-17)

Then

vn(7fl(h, d) I](h)) d /i'(0, qv(h)'Z-v(h) + p(h)).

Expressions for v(h) and p(h) are given in equations (A.22) and (A.21), respectively, in
the Appendix.
2.4 Estimation of Posterior Expectations Using Control Variates
We assume in this section that the values of the vector d and the posterior
expectations I[f](hj), forj = 1,..., k, are available to us. In reality, these quantities
are seldom known, and the next section deals with the case when they are estimated
based on previous MCMC runs. Recall that the integral we want to estimate is
I[f](h) = J f(O)h,y(O) dO. In (2-14) we established that (1/n) EZ=1 i nl i[f] is a

strongly consistent estimator of I[f](h) B(h, hi). Define

i]() Zf(O) ) /))/d I[f](h), j = 1...k,
s=1 asVh, (O))/ds
and let
Zlf])(0) O= h(O) / I[f](hj), = 1 k.
s=-1 asVhs(O)/ds
With vy denoting the mixture distribution Ek= sy, it can be easily checked that

Ey(Z[f])(O)) =0, for j= 1 ... k,

so we can use the ZfO)'s as control variates to reduce the variance of the original
estimator (1/n) Eki1 ,i y,[]. Doing so gives the estimator
k n -

/=1 i=1 j=1

where /f]'s denote the least squares estimates resulting from the regression of Y[f on
predictors Z, fJ). The Bayes factor B(h, hi) will be estimated as before in Section 2.2,
using the estimator (1/n) yk 1 yi1 Y,/ and the k 1 control variates Z(), forj =
2,..., k. The ratio of these two control variate adjusted estimators provides us with an

improved estimator for the posterior expectation I[](h), which is given by

'[ (y -z,
I ,3 -= (2-18)
2I= n (Y2i= 1 kj =2 zJi, j

Theorem 4 Suppose conditions A 1 and A2 stated in Theorem 1 are satisfied and the
matrix R defined in Theorem 2 is nonsingular. Also, suppose that

A3 for each I = 1,..., k, there exists e > 0 such that E( Yf] 2 6) < oo;
A4 for each I = 1,..., k, there exists e > 0 such that E,, (If l2+ (0)) < o;
A5 for each I = 1,..., k,

Eh f2(0) =Vh ( -
E"sY1 as, vh, (0)/ds
A6 the (k + 1) x (k + 1) matrix Rf defined by
R[f] _Z ]k 1 a [f (j ') k
J+llj+l = a iZ 1()Z ,/ )' J) 0,..., k,

is nonsingular.

Then
S(il(h)) (0, r(h)),

where r(h) is given in equation (A.33) of the Appendix.
Remarks

1. If h = h for some e {1,..., k}, our estimator of posterior expectation I, [f] given
above in (2-18) has zero variance. To see why, note that in this case the response
Y[f] can be written as
y[f] = dj/[f(hj) + djZ[fl+),
so there is no noise in the regression of Y[f] on predictors Z[lf)'s, and as a
consequence, the numerator of this estimator is constant (specifically, ndj/l[l(hj)).
Through similar arguments, the denominator was shown to be constant (nd,) in
Doss (2010). Hence, for h = hj, 71,,I[ is a perfect estimator of I[f](hj).

2. Theorem 4 pertains to the case where d and the posterior expectations I[f](hj), j =
1,..., k are known. There do exist some situations where this is the case. For
example, in the hierarchical model for Bayesian linear regression discussed in

Chapter 1, for each j= 1,..., k, the marginal likelihood mhJ is a sum of 2q terms,
and if q is relatively small, these marginal likelihood are computable, so the vector
d is available. Likewise, for some functions f, the posterior expectations I[](hj)
can be numerically obtained; see Section 3 of George and Foster (2000). So it is
possible to calculate d and I[/](hj), j = 1,..., k for skeleton points hi, ..., hk, and
the method described in this section enables us to efficiently estimate the family
I[f](h), h c T.
2.5 Estimation of Posterior Expectations Using Control Variates With Estimated
Skeleton Bayes Factors and Expectations
This section undertakes estimation of the posterior expectation I[f](h) = f f(O)vh,y() dO
in the case where the quantities d and If](hJ)'s are unknown and estimated based on
previous MCMC runs. Let
Ni Nk \
e = (I[f(hl),. I[f(hk))' and e = f(0(1o) 1. r f(O )
i= 1 = 1

i.e. e is the vector of true expectations, which had been assumed known in Theorem 4,
and e is its natural estimate based on the samples in Stage 1. To account for the fact
that the responses Y[f] and covariates Z},]0) used in the previous section are now
unknown, as they involve the unknown d and e, we need to consider new responses and
covariates based on estimates d and e obtained from Stage 1 samples. Define

f )( kand 2 )= -k -h- ) J k
^[f] f')v ) and z]) ()h, (o .)/
s=' 1 as1 vhs(Os'))aS s= a1s)hs /i as j 1. k
Hence a new control variate adjusted estimator for I[f](h) corresponding to the
estimator (2-18) of the previous section, which assumed knowledge of d and I[l(h,)'s, is
ndk e 1 L (1yn [/[ (]zk )f] )
,,1 ,/ 1 Li 2[/(d)ji,lZ

where [f (d) is the least squares estimate resulting from the regression of Y[, on
predictors 2z[fU). Theorem 6 establishes the asymptotic normality of this estimator.
But before stating this theorem, we give an auxiliary theorem which shows the joint

asymptotic normality of
d1d
N(( (d))

Let us first define separable and inseparable Monte Carlo samples as introduced by
Geyer (1994). The Monte Carlo sample {O0()}1-, I = 1 ..., k is said to be separable if
there are disjoints subsets L and M of {1,..., k} such that for each 0 in the sample and
each / E L and m E M either vh,,y(0) or vh,,,y(0) are zero. A Monte Carlo sample that is
not separable is said to be inseparable.
Theorem 5 Assume that the Monte Carlo sample from Stage 1, {O0,)}O( I =1,..., k,
inseparable, and the following conditions hold:

B1 for each I
B2 for each

1,..., k, the chain {O()o }0 is geometrically ergodic
1,... k, there exists e > 0 such that E,,,y(If l2+(0)) < oo.

Then

Sd-d> (, V),
( de
where V is given in equation (A.48) in the Appendix.
Theorem 6 If the conditions stated in Theorem 4 and (2-19) hold, then

, (a) )

with b(h) given in equation (A.57) in the Appendix.

is

(2-19)

[f])(h) d Ar(O, (h)),

CHAPTER 3
VARIANCE ESTIMATION AND SELECTION OF THE SKELETON POINTS
Estimation of the variance of our estimates is important for several reasons. In
addition to the usual need for providing error margins for our point estimates, variance
estimates are of great help in selecting the skeleton points.
3.1 Estimation of the Variance

There are two approaches one can use to estimate the variance of any of our
estimates. For the sake of concreteness, consider B(h, hi, d), whose asymptotic
variance is the expression ,2(h) = qc(h)'Ec(h) + 72(h) (see Theorem 1).

Spectral Methods If Xo, X1, X2,... is a Markov chain and f is a function, the asymptotic
variance of (1/n) Y2 o1 f(Xi) (when it exists) is the infinite series

Var(f(Xo)) + 2 E, Cov(f(Xo), f(Xj)) (3-1)

where the variances and covariances are calculated under the assumption that Xo has
the stationary distribution. Spectral methods involve estimating an initial segment of the
series, using techniques from time series; see Geyer (1992) for a review. Our problem is
more complicated because we are dealing with multiple chains. In our situation, the term
r2(h) may be estimated through spectral methods, and this is done in a straightforward
manner. We now give technical details regarding the consistency of this method. The
quantity T2(h) is given by 2(h) =- k arT,2(h), where r-f(h) is the asymptotic variance
of
n1 ah(00) /
=n' =1 C s=l h,)) ds
(See equation (A.9) of Doss (2010).) Because for each I we will be estimating r/f(h) by
the asymptotic variance of
1 n h(,))
n"' k~=1 asVh,(, )/ld

where d is formed from Stage 1 runs, it is necessary to consider the quantity r2(h, u),
defined as the asymptotic variance of

n1 h()/us
n1 k i h(O(,))
ni kS=1 IsVh I() s
where u (ul, u2,..., Uk)'. After defining

()Vh (0)
f( s= 1 as Vh, (0) /us
we get
rT2(h, u) = Var(f,(0('))) + 2 Cov(ful(00)), fu(Os ))
g=1
We now proceed to establish continuity of r2(h, u) in u, and to do this we will show
that for each I = 1,..., k, ,r(h, u) is continuous in u. For the rest of this discussion
expectations and variances are taken with respect to Vh,,y, and we drop I from the
notation. Let u(n) be any sequence of vectors such that u(n) d. Then trivially fu,) (0)
fd(O)for all 0, and letting c = min{di, ..., dk}, there exists a positive integer n(c) such that
I|u(n) dll < c for all n > n(c). Consequently,

fu(n)(0) < f2d(0) = 2fd(0) for all 0 and all n > n(e) (3-2)

and we can apply the Lebesgue Dominated Convergence Theorem twice to conclude
that
Var(fu(n)(01)) = E(fu2n)(01)) [E(fu(n)(01))]2

converges to

E(fd2(81)) [E(fd(01))12 = Var(fd( 1)).

Note that condition A2 guarantees that the dominating function in (3-2) has finite
expectation. Similarly, for each of the covariance terms,

Cov(futn)(01), fu(,,(Ol+,)) Cov(fd(l), fd(1+j)).

If we define

c'(j) = Cov(f(1), f( +,)), j = 1, 2,...

then Y71 cu(j) is absolutely convergent. This is because under geometric ergodicity,
the so-called strong mixing coefficients a(j) decrease to 0 exponentially fast (a definition
of strong mixing is given on p. 349 of Ibragimov (1962)), and

Cov(fu(01), f,(O1+)) < [a(j)]"[E( fu(, 1)12+)]2/(2+ ), (3-3)

for some / > 0. See Theorem 18.5.3 of Ibragimov and Linnik (1971) or Lemma 7.7 of
Chapter 7 in Durrett (1991). Since c,(,,)(j) Cd(j) for each j, (3-2) and (3-3) enable us
to again apply Dominated Convergence to conclude that ,-1 Cu(,,)(j) Z_ Cd(j), and
this proves that r2(h, u) is continuous in u.
Let g(u) be the spectral density at 0 of the series fu(Oi). Note that g(u) is equal to
r/2(h, u), except for a normalizing constant. Under strong mixing (implied by geometric
ergodicity), standard spectral density estimates g(u) are consistent, and bounds on the
discrepancy lg(u) g(u)l depend on the mixing rate and bounds on the moments of
the function fu(O) (Rosenblatt 1984). By (3-2), the rate is uniform as long as ||u dll is
small, and the condition that id dl| is small is guaranteed if the Stage 1 sample size N
is large.
Geyer (1994) gives an expression for Z involving infinite series of the form (3-1),
and this enables estimation of Z by spectral methods. Now, c(h) is a vector each of
whose components is an integral with respect to the posterior Vh,y (see (A.3)). The
estimate derived in Section 2.3 (see (2-16)) is designed precisely to estimate such
posterior expectations. Combining, we arrive at an overall estimate of K2(h), and the
asymptotic variances of our other estimates are handled similarly.

Methods Based on Regeneration The cleanest approach to estimating asymptotic
variances is based on regeneration. Let Xo, X1, X2, ... be a Markov chain on the
measurable space (X, B), let K(x, A) be the Markov transition distribution, and assume

that r is a stationary probability distribution for the chain. Suppose that for each x,

K(x, .) has density k(x, -) with respect to a dominating measure p. Regeneration

methods require the existence of a function s: X [0, 1), whose expectation with

respect to r is strictly positive, and a probability density d with respect to p, such that

k(., -) satisfies

k(x, x') > s(x)d(x') for all x, x' e X. (3-4)

This is called a minorization condition and, as we describe below, it can be used to

introduce regenerations into the Markov chain driven by k. These regenerations are

the key to constructing a simple, consistent estimator of the variance in the central limit

theorem. Define
k(x,x') s(x)d(x')
r(x, x') =
1 S(X)
Note that, for fixed x c X, r(x, x') is a density function in x'. We may therefore write

k(x, x') = s(x)d(x') + (1 s(x))r(x, x'),

which gives a representation of k(x, -) as a mixture of two densities, d(-) and r(x, .).

This provides an alternative method of simulating from k. Suppose that the current state

of the chain is Xn. We generate 6n ~ Bernoulli(s(Xn)). If 6, = 1, we draw Xn+1 ~ d;

otherwise, we draw Xn+, ~ r(Xn, .). Note that, if 6, = 1, the next state of the chain is

drawn from d, which does not depend on the current state. Hence, the chain "forgets"

the current state and we have a regeneration. To be more specific, suppose we start the

Markov chain with Xo ~ d and then use the method described above to simulate the

chain. Each time 6n = 1, we have Xn+1 d and the process stochastically restarts itself;

that is, the process regenerates. Even though the description above involves generating

observations from r, there are clever tricks that enable the user to bypass generating

from r, and to obtain the sequence (Xn, 6n) in a way that requires only generating

directly from k; see, e.g., Tan and Hobert (2009).

Here is how the regenerative method is used to get valid asymptotic standard
errors. Suppose we wish to approximate the posterior expectation of some function
f(X). Suppose further that the Markov chain is to be run for R regenerations (or tours);
that is, we begin by drawing the starting value from d and we stop the simulation the Rth
time that a 6, = 1. Let 0 = To < T- < 72 < < TR be the (random) regeneration times,
i.e. 7- = min{n > T-t- : 6,n- = 1} for t E {1, 2,..., R}. The total length of the simulation,

TR, is random. Let N1, N2,..., NR be the lengths of the tours, i.e. Nt = t 'Tt-, and
define St = t-1L f(Xn), t = 1, R. Note that the (Nt, St) pairs are iid, and a
strongly consistent estimator of E,(f(X)) is f, = S/N = (1/TR) TR1 f(Xn), where
S = (1/R) Et= St and N = (1/R) Et= Nt, and the asymptotic variance of fTR may be
estimated very simply by E:=(St NTRNt)2/(RN2). Moment and ergodicity conditions
that guarantee strong consistency of this variance estimator are given in Hobert et al.
(2002). This method has recently been applied successfully in a number of problems
involving continuous state spaces; see, e.g., Tan and Hobert (2009) and the references
therein, and we use the method in the illustration in Chapter 5.
In our framework of multiple chains, one might think that we need to identify a
sequence of times 0 = 7o < 7- < -2 < ... < TR at which all the chains regenerate. This
is not the case, and we need only identify, for each chain, a sequence of regeneration
times for that chain. Since the overall estimate is essentially a function of averages
involving the k chains, its asymptotic variance is a function of the asymptotic variances
of averages formed from the individual chains.
Consider the function B(h, hi); h e c-R, and an estimator, such as B(h, hi, d) (for
the rest of this discussion, we will denote these by B(h) and B(h), for brevity). It is of
interest to provide a confidence band (region, if h is multidimensional) for B(h) that is
valid simultaneously for all he 'c A closely related problem is to produce a confidence
interval for argmaxhcH B(h). The traditional way of forming confidence bands that are
valid globally is to proceed as follows:

1 Establish a functional central limit theorem that says that n12 (B(h) B(h)) converges

in distribution to a Gaussian process W(h); h c R.

2 Find the distribution of suph, I W(h)l.

If s, is the (1 a)-quantile of the distribution of this supremum, then the band B(h)

s,/n1/2 has asymptotic coverage probability equal to 1 a. The value s, is typically

too difficult to compute analytically, but can be obtained by simulation [see, e.g. Burr

and Doss (1993) among many others]. The maximal inequalities needed to establish

functional central limit theorems typically require an iid structure, and for this reason

we believe that the regeneration method offers the best hope for establishing such

theorems.
3.2 Selection of the Skeleton Points

The asymptotic variances of any of our estimates depend on the choice of

the points hi,..., hk. For concreteness, consider B(h, hi, d), and to emphasize this

dependence, let V(h, hi, ..., hk) denote the asymptotic variance of B(h, hi, d). For fixed

h, ..., hk, identifying the set of h's for which V(h, hi, ..., hk) is finite is typically a feasible

problem. For instance, Doss (1994) considered the pump data example discussed in

Tierney (1994), for which the hyperparameter h has dimension 3, and determined this

set for the case k = 1. He showed that one can go as far away from h1 as one wants

in certain directions, but in other directions the range is limited. (The calculation can be

extended to any k.) Suppose now that we fix a range '- over which h is to vary. Typically,

we will want more than just a positioning of hi,..., hk that guarantee that V(h, hi,..., hk)

is finite for all he 'c and we will face the problem below.

Design Problem Find the values of hi,..., hk that minimize maxh6E V(h, hi,..., hk).

Unfortunately, except for extremely simple cases, it is not possible to calculate

V(h, hi,..., hk) analytically (even if k = 1, V(h, hi) is an infinite sum each of whose

terms depends on the Markov transition distribution in a complicated way), and

maximizing it over h E c- would present additional difficulties. Furthermore, even if

we were able to calculate maxhE V(h, hi,..., hk), the design problem would involve the

minimization of a function of k x dim(-t) variables, and in general, solving the design

problem is hopeless.

In our experience, we have found that the following method works reasonably well.

Having specified the range -, we select trial values h, ..., hk and plot the estimated

variance as a function of h, using one of the methods described above. If we find

a region in R- where this variance is unacceptably large, we "cover" this region by

moving some hi's closer to the region, or by simply adding new hi's in that region, which

increases k. This is illustrated in the example in Chapter 5.

CHAPTER 4
REVIEW OF PREVIOUS WORK
Vardi (1985) introduced the following k-sample model for biased sampling. There is
an unknown distribution function F, which we wish to estimate. For each weight function
wi, I = 1,..., k, we have a sample X11,..., XIn, d Fl, where

1 jx
F,(x) = 1 wl(s) dF(s). (4-1)
W -Joo

In (4-1), W = J' wI(s) dF(s). The weight functions w, ..., wk are known, but the
normalizing constants W1, ..., Wk are not. Vardi (1985) was interested in conditions
that guarantee that a nonparametric maximum likelihood estimator (NPMLE) exists
and is unique, and he gave the form of the NPMLE. (The conditions for existence and
uniqueness involve issues regarding the supports of the F/'s and do not concern us in
the present paper.)
To estimate F, a preliminary step is to estimate the vector (W1,..., Wk). Vardi
(1985) and Gill et al. (1988) show that W may be estimated by the solution to the
system of k equations

W,= k ) dF (y), I k, (4-2)

where a, n/n, n = k ,1 n,, and Fn is the empirical distribution function that gives
mass 1/n to each of the X,/. Actually, the solution to (4-2) is not unique: it is trivial
to see that if the vector W solves (4-2), then so does a W, for any a. However, it
turns out that knowing W only up to a multiplicative constant is all that is needed, and
to avoid non-identifiability issues, we define the vector V = (W2/W1 ..., Wk/Wl).
Gill et al. (1988) show that if W is any solution to (4-2), and V is defined by V =

(W2/W, ..., Wk/Wi), then n/2(V V) is asymptotically normal (Proposition 2.3 in
Gill et al. (1988)). Once an estimate of W is formed, it is relatively easy to form an
estimate Fn of F, and consequently of integrals of the form f h dF. Gill et al. (1988)

obtain functional weak convergence results of the sort n1/2 (f h dFn f h dF) -d Z(h),

where Z is a mean-0 Gaussian process indexed by h e H, where H is a large class of

square integrable functions.

It is not difficult to see that our setup is the same as that considered in Vardi (1985)

and Gill et al. (1988): their F corresponds to our Vh,y; their w1 to vh,/lh; Fi to vh,,y; Wi to

mh,/mh; and V to d. But there are major differences between our framework and theirs.
They deal with iid samples, and so can use empirical process theory, whereas we deal

with Markov chains, for which such a theory is not available. In their framework, the

samples arise from some experiment, and they are seeking optimal estimates given data

that is given to them. In contrast, our samples are obtained by Monte Carlo, so we have

control over design issues. In particular, we are concerned with computational efficiency,

in addition to statistical efficiency; hence our interest in the two-stage sampling method

for preliminary estimation of d and for enabling the use of control variates.

Geyer (1994) also deals with the setup in Vardi (1985) and Gill et al. (1988), i.e. the

k-sample model for biased sampling, and he also considers the problem of estimating

d. As mentioned in Section 2.1, his estimator is obtained by maximizing (2-6), and

the solution is numerically identical to the solution to the system (4-2). However, he

considers the situation where each of the k samples are Markov chains, as opposed to

iid samples, and assuming that the chains satisfy certain mixing conditions, he obtains a

central limit theorem for n1/2(( d). Naturally, the variance of the limiting distribution is

different from the variance obtained in Gill et al. (1988), and is typically larger.

In Section 7 of their paper Meng and Wong (1996) consider the situation where for

each I = 1,..., k, we have an iid sample from the density fi = qi/mi, where the functions

qi, ... ,qk are known, but the normalizing constants mi,..., mk are not, and we wish to
estimate the vector (m2/ml, ... mk/mi). Without going into detail, we mention that they

develop a family of "bridge functions" and show that, in the iid setting, the optimal bridge

function gives rise to an estimate identical to that of Geyer (1994). They obtain their

estimate through an iterative scheme which is fast and stable [Meng and Wong (1996,

p. 849)] and this is the computational method we use in the present paper.

Owen and Zhou (2000) consider the problem of estimating an integral of the

form / = f h(x)f(x) dx, where f is a probability density that is completely known (as

opposed to known up to a normalizing constant) and h is a known function. They wish

to estimate I through importance sampling. They assume they can generate sequences
iid
X1, ..., Xn, vd pi, I = 1,..., k, where the pi's are completely known densities. The doubly
indexed sequence X/,, i = 1,... n, I = 1,..., k forms a (stratified) sample from the

mixture density pa = 1=1 ap1, where al = n/ = 1 ni, so one can carry out importance

sampling with respect to this mixture. They point out that since the p/'s are completely

known, they can form the functions Hi(x) = [pj(x)/pa(x)] 1, j = 1,..., k, and these

satisfy Ep,(Hi(X)) = 0, where the subscript indicates that the expectation is taken with

respect to the mixture density p,. Therefore, these k functions can be used as control

variates. What we do in Chapter 2 is similar, except that we are working with densities

whose functional form is known, but whose normalizing constants are not.

Kong et al. (2003) also consider the k-sample model for biased sampling, but

have a different perspective, and we describe their work in the notation of the present

paper. They assume that there are probability measures Q1,... Q, with densities

qi/mi, ..., qk/mk, respectively, relative to some dominating measure j, and for each
I = 1,..., k, we have an iid sample {X/,}i from Qi. Here, the qi's are known, but the

mi's are not. Their objective is to estimate all possible ratios ml/mj, I,j e {1,..., k} or,

equivalently, the vector d = (1, m2/m, ..., mk/mi). In their highly unorthodox approach,

Kong et al. (2003) obtain the maximum likelihood estimate p of the dominating measure

itself (A is given up to an overall multiplicative constant). They can then estimate the

ratios ml/mj, since the normalizing constants are known functions of p (i.e. m,r

f q,(x) dp(x), and q, is known). They show that the resulting estimate of d is obtained

by solving the system

dr = k qr (X ) r = 1,... k, (4-3)
=1 s=l nsqs(Xli)/ds
which is easily seen to be identical to the system (4-2) of Gill et al. (1988).
Tan (2004) shows how control variates can be incorporated in the likelihood
framework of Kong et al. (2003). When there are r functions Hi, j = 1,..., r for which
we know that f Hj dp = 0, the parameter space is restricted to the set of all sigma-finite

measures satisfying these r constraints. For the case where X1i, i = 1,..., n1 are iid
for each I = 1,..., k, he obtains the maximum likelihood estimate of p in this reduced
parameter space, and therefore of corresponding estimates of d and mh/mh,, and shows
that this approach gives estimates that are asymptotically equivalent to estimates that
use control variates via regression. He also obtains results on asymptotic normality of
his estimators that are valid when we have the iid structure.

The estimates of d in Gill et al. (1988), Geyer (1994), Meng and Wong (1996), and
Kong et al. (2003) are all equivalent. Theorem 1 of Tan (2004) establishes asymptotic
optimality of this estimate under the iid assumption. When the samples are Markov
chain draws, the asymptotically optimal estimate is essentially impossible to obtain
(Romero 2003). But the estimate derived under the iid assumption can still be used
in the Markov chain setting if one can develop asymptotic results that are valid in the
Markov chain case, and this is done by Geyer (1994), whose results we use in all our
theorems.

CHAPTER 5
ILLUSTRATION ON VARIABLE SELECTION

There exist many classes of problems in Bayesian analysis in which the sensitivity

analysis and model selection issues discussed earlier arise; see Chapter 6. Here we

give an application involving the hierarchical prior used in variable selection in the

Bayesian linear regression model discussed in Chapter 1. This chapter consists of

three parts. First we discuss an MCMC algorithm for this model and state some of its

theoretical properties; then we discuss the literature on selection of the hyperparameter

h; and finally we present two detailed illustrations of our methodology.

5.1 A Markov Chain for Estimating the Posterior Distribution of Model
Parameters

The design of MCMC algorithms for estimating the posterior distribution of 0

under (1-1) revolves around the generation of the indicator variable 7. We now briefly

review the algorithms for running a Markov chain on 7 that are proposed in the literature,

and the main issues of implementation of these algorithms. Raftery et al. (1997) and

Madigan and York (1995) discuss the following Metropolis-Hastings algorithm for

generating a sequence 7(l), 7(2),.... If the current state is 7, a new state 7* is formed

by selecting at random a coordinate, setting 7* = 1 7j, and 7~ = 7k for k z j.

The proposal 7* is then accepted or rejected with the Metropolis-Hastings acceptance

probability min{p(7* I Y)/p(7 Y), 1}. Madigan and York (1995) call this algorithm MC3.

Clyde et al. (1996) propose a modification of this algorithm in which we do not select a

component at random and update it, but instead sequentially update all components.

They call this the "Hybrid Algorithm." (Strictly speaking, this is a Metropolized Gibbs

sampler, and is not actually a Metropolis-Hastings algorithm.) Smith and Kohn (1996)

propose a Gibbs sampler which simply cycles through the coordinates 7, one at a time.

George and McCulloch (1997) show that when compared with MC3, the Gibbs sampler

algorithm gives estimates with smaller standard error, and is also slightly faster, at least

in several simulation studies they conducted.

Kohn et al. (2001) consider Metropolized Gibbs algorithms which are the same
as the Hybrid Algorithm of Clyde et al. (1996), except that at coordinate j, instead of
deterministically proposing to go from 7j to 7, = 1 Ty, the proposed value 7* is equal to
1 7 with probability depending on the current state y. Kohn et al. (2001) describe two
such algorithms, and show that these are more computationally efficient than the Gibbs
sampler in situations where on average q. is small, i.e. the models are sparse. They
also conduct a detailed simulation study of one of their sampling schemes (their "SS(2)")
which suggests that, while the scheme produces estimates whose standard errors are
a bit larger than those produced by the Gibbs sampler, this disadvantage is more than
outweighed by its computational efficiency.
All the algorithms mentioned above require, in one way or another, the calculation

of p(7* I Y)/p(7 I Y). Because of the conjugate nature of model (1-1), the marginal
likelihood of model 7 is available in closed form, and therefore p(7 Y) is available up to
a normalizing constant. We have

p(7 y) o (1 + g)-2S-(m-1)[1 + g(l R)] -(m-1)/2 W) (5-1)

where S2 Z= ( Y)2 and R is the coefficient of determination of model 7.

As is standard for model (1-1), we assume that the columns of the design matrix are
centered, and in this case, R = Y'XY(X X )-IXY/S2. The main computational burden
in obtaining (5-1) is the calculation of R-, which is time-consuming if q. is large. Smith
and Kohn (1996) note that, when 7* and 7 differ in only one component, R2. can be
obtained rapidly from R.. We return to this point in Appendix B.
In our situation, we need to generate a Markov chain on 0, because the Bayes factor
estimates given in Chapter 2 require samples from the posterior distribution of 0. The
algorithm we use in the present paper is based on the Gibbs sampler on 7 introduced in
Smith and Kohn (1996) (although the computational implementation we use is different
from theirs), followed by three steps to generate a, 3o, and /3.. In a bit more detail, let

V(70), -) be the Markov transition function corresponding to the Gibbs sampler in Smith
and Kohn (1996), i.e. V(y(0), .) is the distribution of (1) given that the current state is

70), and let v(y(0), ) = V(y7(), {7}) be the corresponding probability mass function.
Suppose the current state is (7('), ', /3'), 0 s0 '). We proceed as follows.

1. We update (') to 7('+1) using V(7('), .). The generation of 7('+1) does not involve

2. We generate -('+1) from the conditional distribution of a given 7= ('+) and the
data.

3. We generate O'+1) from the conditional distribution of 3o given = 7(+1l), a =
7('+1), and the data.

4. We generate 'n,+) from the conditional distribution of /3, given = 7('+1)
a = ('+1), 3o = i+1), and the data.
The details describing the distributions involved and the computations needed are given
in Appendix B. The algorithm above gives a sequence 0(1), (2), ..., and it is easy to see

that this sequence is a Markov chain.
As Markov chains on the 7 sequence, the relative performance of the Gibbs sampler

vs. SS(2) depends, in part, on m, q, h, and the data set itself, and neither algorithm is
uniformly superior to the other. In principle, in Step 1 of our algorithm we can use any
Markov transition function that generates a chain on 7, including SS(2). We chose to
work with the Gibbs sampler because it is easier to develop a regeneration scheme for
this chain than for the other chains.

The output of the chain can be used in several ways. An obvious way is to use the
highest posterior probability model (HPM). Unfortunately, when q is bigger than around
20, the number of models, 2q, is very large, and it may happen that no single model
has appreciable probability, and in any case, it is very difficult or impossible to identify
the HPM from the Markov chain output. Barbieri and Berger (2004) argue in favor of

the median probability model (MPM), which is defined to be the model that includes all
variables j for which the marginal inclusion probability P(QT = 1 Y) > 1/2. We mention

here the Bayesian Adaptive Sampling method of Clyde et al. (2009), which gives an
algorithm for providing samples without replacement from the set of models. Under

certain conditions, the algorithm has the feature that these are perfect samples without

replacement; it then enables an efficient search for the HPM.

Uniform Ergodicity

Let 0 = {0, 1}q x (0, o0) x Rq+1 let v be the (prior) distribution of 0 specified

by (1-1b) and (1-1c), and let vy be the posterior distribution of 0 given Y = y. (For

the remainder of this section the subscript h is suppressed since we are dealing with

a single specification of this hyperparameter.) Let K denote the Markov transition

function for the Markov chain on 0 described in the beginning of this chapter, i.e. K(0, .)

is the distribution of 01 given that the current state is 00, and let Kn(0o, -) denote the

corresponding n-step Markov transition function. Harris ergodicity of the chain is the

condition that ||Kn(0, -) )y(-)I|| 0 for all 0 e 0, where I| I| denotes supremum over

all Borel subsets of (. This condition is guaranteed by the so-called "usual regularity

conditions," namely that the chain has an invariant probability measure, is irreducible,

periodic, and Harris recurrent; see, e.g., Theorem 13.0.1 of Meyn and Tweedie (1993).

These usual regularity conditions are typically easy to check; in the present context, they

are implied for example if the Markov transition function has a density (with respect to

the product of counting measure on {0, 1}q and Lebesgue measure on (0, oo) x Rq+l)

which is everywhere positive, which is the case in our situation. Uniform ergodicity is the

far stronger condition that there exist constants c c [0, 1) and M > 0 such that for any

n c N,

||K"(0, ) v() < Mc for all 0.

Proposition 1 The chain driven by K is uniformly ergodic.

The proof of Proposition 1 is given in Appendix C. Let 00, 01,... be a Markov chain

driven by K, let I be a real-valued function of 0 (for example 1(0) = /(7I = 1), the

indicator that variable 1 is in the model), and suppose we wish to form confidence

intervals for the posterior expectation of 1(0). Suppose that E(12(0)) < oo. Then
since the chain is uniformly ergodic, Corollary 4.2 of Cogburn (1972) implies that,
with Var(/(0o)) and Cov(/(0o), 1(0O)) calculated under the assumption that 60 has the
stationary distribution, the series

K2 = Var(/(0o)) + 2 Cov(/(0o), 1(0)) (5-2)
J=1
converges absolutely, and if K2 > 0, then with 0o having an arbitrary distribution, the
estimate I/= (1/n) 'jo1 /(0) satisfies

n1/2 n( E[(0) ) | d y] (0, K2) as n oo.

The Markov chain driven by K is also regenerative, and in Appendix C we give an
explicit minorization condition that can be used to introduce regenerations into the chain.
Functions that run the chain and implement the regeneration scheme are provided in the
R package bvslr, available from http: //www.stat.ufl.edu/~ebuta/BVSLR.

In Chapters 1 and 2, vh and Vh,y refer to the prior and posterior densities, and all
estimates in Chapter 2 involve ratios of these prior densities. In the Bayesian linear
regression model that we are considering here, the priors vh on (7-, a, 03o, ) are actually
probability measures on {0, 1}q x (0, oo) x Rq+', which in fact are not absolutely
continuous with respect to the product of counting measure on {0, 1}q and Lebesgue
measure on (0, oo) x Rq+1. For hi = (wl, gi) and h2 = (w2, g2), the Radon-Nikodym
derivative of vh with respect to Vh2 is given by

dVh, W1 ) 1 q- W1 X9 (7; 0, g72 (X7IX7)-1)
(,17, 0,wi) ) =-- w ; 2(x7/X )-1) (5-3)
dvh2 W2 1- W2 q, (7; 0, 922XX)-1)
where qy(u; a, V) is the density of the q,-dimensional normal distribution with mean
a and covariance V, evaluated at u (Doss (2007)). It is immediate that all formulas in
Chapter 2 remain valid if ratios of the form Vh(O)/Vhz(0) (see, e.g., equation (2-2)) are
replaced by the Radon-Nikodym derivative [dvh/dvh,](O). Fortunately, evaluation of (5-3)

requires neither matrix inversion nor calculation of a determinant, so can be done very
quickly. Note that in view of (5-3), it is not enough to have Markov chains running on the

7's and we need Markov chains running on the O's (or at least (7, a, /3)).
5.2 Choice of the Hyperparameter

As mentioned earlier, regarding w, the proposals in the literature are quite simple:
either w is fixed at 1/2, or a beta prior is put on w. The discussion below focuses

primarily on g, for which there is an extensive literature, and we now summarize the

portion of this literature that is directly relevant to the present work. Broadly speaking,
recommendations regarding g can be divided into three categories:

Data-Independent Choices In the simple case where the setup is given by (1-1) but

without (1-1c), i.e. the true model 7 is assumed known, the posterior distribution
of 3 given a is A/((g/(g + 1)))3, (g/(g + 1)),72(XCX_)-1), where 3 is the usual

least squares estimate of 3. If q is fixed and m oc, under standard conditions

XX,/m ZE, where Z is a positive definite matrix; therefore if g is fixed, this
distribution is approximately a point mass at (g/(g + 1))3y, so the posterior is not

even consistent, and we see that a necessary condition for consistency is that
g oc. Data-independent choices of g include Kass and Wasserman's (1995)

recommendation of g = m, and Fernandez et al.'s (2001) recommendation of g =

max(m, q2), following up on Foster and George's (1994) earlier recommendation of

g = q2
Liang et al. (2008) argue that, in general, data-independent choices of g have the
following undesirable property, referred to as the "Information Paradox." When the

data give overwhelming evidence in favor of model 7 (e.g. II,. || oo), then using

o7 to denote the null model (i.e. the model that includes only the intercept), the
ratio of posterior probabilities p(7 I Y)/p(7o I Y) does not tend to infinity.

Empirical Bayes (EB) Methods In global EB procedures, an estimate of g common for

all models is derived from its marginal likelihood; see George and Foster (2000). In

local EB, an estimate of g is derived for each model; see Hansen and Yu (2001).

Unfortunately, the EB method is in general computationally demanding because

the likelihood is a sum over all 2q models y, so it is practically feasible only for

relatively small values of q. Liang et al. (2008) show that the EB method is

consistent in the frequentist sense: if 7, is the true model, then if g is chosen

via the EB method, the posterior probability P(7 = 7, | Y) converges to 1 as

m oC. See Theorem 3 of Liang et al. (2008) for a precise statement. (This result

refers only to the case where w is fixed at 1/2, and only g is estimated.) Liang

et al. (2008) propose an EM algorithm for estimating g in the global EB setting. In

their algorithm, the model indicator and o are treated as missing data. While their

approach is certainly useful, there are some problems associated with it. Each

step in the EM algorithm involves a sum of 2q terms. Unless q is relatively small,

complete enumeration is not possible, and Liang et al. (2008) propose summing

only over the most significant terms. However, determining which terms these

are may be very difficult in some problems. Also, the EM algorithm gives a single

point estimate. What we do is different: we estimate the Bayes factor for all g (and

w). This enables us in particular to estimate the maximizing values; but it also

allows us to rule out large regions of the hyperparameter space. Additionally, our

method allows us to carry out sensitivity analysis. We also mention very briefly that

if we are interested only in the maximizing values, then the method proposed in

the present paper can be used to form a stochastic search algorithm. The basic

requirement for such algorithms is that we know the gradient OB(h, hi)/ah. But

the same methodology used to estimate B(h, hi) can also be used to estimate

its gradient. For example, in the simple estimate (2-7), we just replace Vh(e(')) by

Oyh Oq())/8h.

Fully Bayes (FB) Methods The most common prior on g is the Zellner and Siow (1980)

prior, an inverse-gamma which results in a multivariate Cauchy prior for 3. The

family of "hyper-g" priors is introduced by Cui and George (2008) and developed

further by Liang et al. (2008), who show that these have several desirable

properties. In particular, they do not suffer from the information paradox, and

they exhibit important consistency properties.

Both the EB methods and FB methods have their own advantages and disadvantages.

Cui and George (2008) give evidence that EB methods outperform FB methods.

This is based on extensive simulation studies in cases where numerical methods are

feasible. Also, FB methods require one to specify hyperparameters of the prior on the

hyperparameter h, and different choices lead to different inferences. Additionally, in EB

methods, one uses a model with a single value of h, and the resulting inference is more

parsimonious and interpretable.

On the other hand, as with many likelihood-based methods, special care needs to

be taken when the maximizing value is at the boundary. When we use the EB method,

if the maximizing value of w is 0 or 1, the posterior assigns probability one to the null

model or full model (model that includes all variables), respectively. This is similar to

the very simple situation in which we have X ~ binomial(n, p): if we observe X = 0,

then not only is the maximum likelihood estimate of p equal to 0, but the associated

standard error estimate is also 0, and the naive Wald-type confidence interval for p

is the singleton {0}. Of course in this simple case there exist modifications to the

maximum likelihood estimate P = X/n which yield procedures that do not give rise to

this degeneracy. How to develop corresponding modifications to the maximum likelihood

estimate of the Bernoulli parameter w in the present context is a problem that is much

more difficult, but certainly worthy of investigation.

Scott and Berger (2010) consider the same model for variable selection that we

consider here, i.e. model (1-1), but with a Zellner-Siow prior on g, and the remaining

parameter, w, estimated by maximum likelihood. They show that if the null model has
the largest marginal likelihood, then the MLE of w is 0 and if the full model has the
largest marginal likelihood, then the MLE of w is 1. Each of these gives rise to the
degeneracy discussed above. Their result is not true in our setup, in which we do not
put a prior on g, but rather estimate both w and g by maximum likelihood. To see this,
consider a very simple example, in which Y = (2, 1, 9, 5)' and

X =

3
3
7
10.5

We have R2 = 0.52, R2 = 0.51, R2 = 0.40, and R2 = 0, Now
y=(1,1) y=(1,0) y=(0,1) y=(O,O)

(1 + g(1 R2))3/2

where c(Y) does not depend on g or 7. Therefore,

(1 + g)(3-q,)/2
(g, vv) = argmax( 9,) w (1 w)q-q ( + m (.5, .2).
7 (1 + g(1 R2))3/2

From equation (38) of Scott and Berger (2010) we know that under the Zellner-Siow null
prior, we have

P(Y 17y)
P(Y 17 = (0, 0))

S (1 + g)( 3- R)/2
I (1+ g(1 -R))3/2

.72 < 1

.58 < 1

.31 < 1

for 7 = (1, 0)

for 7 = (0, 1)

for = (1, 1)

and hence the null model has the strictly largest marginal likelihood among all models.

Lemma 4.1 of Scott and Berger (2010) implies that, with a Zellner-Siow prior on g,

vv = 0, while in our setup, the same data give Cv > 0.

5.3 Examples

We illustrate our methods on two examples. The first is the U.S. crime data of Vandaele

(1978), which can be found in the R library MASS under the name UScrime. We use

this data set because it has been studied in several papers already so we can compare

our results with previous analyses, and also because the number of variables is small

enough to enable a closed-form calculation of the marginal likelihood mh, so we can

compare our estimates with the gold standard. The second data set is the ozone data

originally analyzed by Breiman and Friedman (1985). We use this data set because it

involves 44 variables, even though only a few of those are important, and we wanted to

show how our methodology handles a data set with this character.

5.3.1 U.S. Crime Data

The data set gives, for each of m = 47 U.S. states, the crime rate, defined as number

of offenses per 100,000 individuals (the response variable), and q = 15 predictors

measuring different characteristics of the population, such as average number of years

of schooling, average income, unemployment rate, etc.

To be consistent with what is done in the literature, we applied a log transformation

to all variables, except the indicator variable. We took the baseline hyperparameter to be

hi = (wl, gi) = (.5, 15), and our goal was to estimate B(h, hi) for the 924 values of h

obtained when w ranges from 0.1 to 0.91 by increments of 0.03, and g ranges from 4 to

100 by increments of 3. We used (2-13) and this estimate was based on 16 chains each

of length 10,000, corresponding to the skeleton grid of hyperparameter values

(w, g) e {.3, .5, .6,.8} x {15, 50, 100, 225} (5-4)

for the Stage 1 samples, and 16 new chains, each of length 1000, corresponding to

the same hyperparameter values, for the Stage 2 samples. The plots in Figure 5-1

give graphs of the estimate (2-13) as w and g vary, from two different angles. These

indicate that values for w around 0.65 and for g around 20 seem appropriate, while

values of w less than .3 and values of g greater than 60 should be avoided. A side

calculation showed that, interestingly, for g = max{m, q2} (= 225), the estimate of

B((w, g), (.65, 20)) is less than .008 regardless of the value of w, so this choice should

not be used for this data set. With the long chains used and the estimate that uses

control variates, the Bayes factor estimates in Figure 5-1 are extremely accurate-root

mean squared errors are less than 0.04 uniformly over the entire domain of the plot

and considerably less in the convex hull of the skeleton grid (our calculation of the root

mean squared errors used the closed-form expression for the Bayes factors based on

complete enumeration). The figure took about a half hour to generate on an Intel 2.8

GHz Q9550 running Linux. (The accuracy we obtained is overkill and the figure can be

created in a few minutes if we use more typical Markov chain lengths.)

n -n
S1.1

o.

0.i0 0.4 100
0 80
( 0.6 20 0.8 40
0.8 20

Figure 5-1. Estimates of Bayes factors for the U.S. crime data. The plots give two
different views of the graph of the Bayes factor as a function of w and g
when the baseline value of the hyperparameter is given by w = 0.5 and
g = 15. The estimate is (2-13), which uses control variates.

Table 5-1 gives the posterior inclusion probabilities for each of the fifteen predictors,

i.e. P(7Q = 1 y) for i = 1,..., 15, under several models. Line 2 gives the inclusion

probabilities when we use model (1-1) with the values w = .65 and g = 20, which

are the values at which the graph in Figure 5-1 attains its maximum. Line 4 gives the

inclusion probabilities when the hyper-g prior "HG3" in Liang et al. (2008) is used. As

can be seen, the inclusion probabilities we obtained under the EB model are comparable

to, but somewhat larger than, the probabilities when the HG3 prior is used. This is not

surprising since our model allows w to be chosen, and the data-driven choice gives a

value (.65) greater than the value w = .5 used in Liang et al. (2008). (Table 2 of Liang

et al. (2008) gives a comparison of posterior inclusion probabilities for a total of ten

models taken from the literature.) Line 3 of Table 5-1 gives the inclusion probabilities

under model (1-1) when we use w = .5 and the value of g that maximizes the likelihood

with w constrained to be .5. It is interesting to note that the inclusion probabilities are

then strikingly close to those under the HG3 model.

Table 5-1. Posterior inclusion probabilities for the fifteen predictor variables in the U.S.
crime data set, under three models. Names of the variables are as in Table 2
of Liang et al. (2008) (but all variables except for the binary variable S have
been log transformed).
Age S Ed ExO Exl LF M N NW U1 U2 W X Prison Time
EB(20,.65) .93 .39 .99 .70 .51 .34 .35 .52 .83 .40 .76 .55 1.00 .96 .55
EB (20,.5) .85 .29 .97 .67 .45 .22 .22 .38 .70 .27 .62 .38 1.00 .90 .39
HG3 .84 .29 .97 .66 .47 .23 .23 .39 .69 .27 .61 .38 .99 .89 .38

Figure 5-2 gives plots of the posterior inclusion probabilities for Variables 1 and 6,

as w and g vary. The literature recommends various choices for g [in particular g = m

in Kass and Wasserman (1995), g = q2 in Foster and George (1994), g = max(m, q2)

in Fernandez et al. (2001)], and posterior inclusion probabilities for all these choices

combined with any choice of w can be read directly from the figure. The extent to which

these probabilities change with the choice of g is quite striking.

-0

00.

o.o
0.2 .8 .8.

< 0.6 m 0.6
200 200
150 0. i 150 0.44
100 100
9 50 0.2 9 50 0.2

Figure 5-2. Estimates of posterior inclusion probabilities for Variables 1 and 6 for the
U.S. crime data. The estimate used is (2-16).

Selection of the skeleton points was discussed at the end of Chapter 3, and we now

return to this issue. Consider the Bayes factor estimate based on the skeleton (5-4),

which was chosen in an ad-hoc manner. The left panel in Figure 5-3 gives a plot of the

variance of this estimate, as a function of h. As can be seen from the plot, the variance

is greatest in the region where g is small and w is large. We changed the skeleton

from (5-4) to

(w, g) e {.5, .7, .8,.9} x {10, 15, 50, 100} (5-5)

and reran the algorithm. The variance for the estimate based on (5-5) is given by the

right panel of Figure 5-3, from which we see that the maximum variance has been

reduced by a factor of about 9.

5.3.2 Ozone Data

This data set was originally analyzed in Breiman and Friedman (1985), was used

in many papers since, and was recently analyzed in a Bayesian framework by Casella

and Moreno (2006) and Liang et al. (2008). The data consist of daily measurements

of ozone concentration and eight meteorological quantities in the Los Angeles basin

for 330 days of 1976. The response variable is the daily ozone concentration, and we

follow Liang et al. (2008) in considering 44 possible predictors: the eight meteorological

80 0. 80
0.2 \ 0.2
100 100

Figure 5-3. Variance functions for two versions of ). The left panel is for the estimate
based on the skeleton (5-4). The points in this skeleton were shifted to
better cover the problematic region near the back of the plot (g small and w
large), creating the skeleton (5-5). The maximum variance is then reduced
by a factor of 9 (right panel).

measurements, their squares, and their two-way interactions. Liang et al. (2008) give

a review of the literature on priors for the hyperparameter g and advocate the hyper-g

priors. They compare 10 variable selection techniques (including three hyper-g priors)

on this data set by using a cross-validation procedure: the data set is randomly split in

two halves, one of which (the training sample) is used for selecting the model (for the

Bayesian methods this is the highest probability model), while the other (the validation

sample) is used for measuring the predictive accuracy. The predictive accuracy of

method j is measured through the square-root of the mean squared prediction error

(RMSE) of the selected model ?7, defined by RMSE(7j) = (nl E,C(Y Y,)2)1/2

Here, V is the validation set, nv is its size, and Y, is the fitted value of observation i

under model 7j. Liang et al. (2008) point out the curious fact that the RMSE's of the 10

methods are all very close (they range from 4.4 to 4.6), but the selected models differ

greatly in the number of variables selected, which range from 3 to 18.

We investigated the performance of our methodology using a split of the data into

training and validation sample identical to the one used by Liang et al. (2008). We took

the baseline hyperparameter to be the pair hi = (wl, gi) = (.2, 50) and the skeleton grid

of hyperparameters to consist of the 16 pairs

(w, g) {.1, .2, .3, .5} x {15, 50, 100, 150}.

To identify the value of h that maximizes the Bayes factor B(h, hi), we estimated this

quantity for a grid of the 750 values of h obtained when w ranges from .01 to .5 by

increments of .02, and g ranges from 5 to 150 by increments of 5. These estimates

were based on 16 chains each of length 10,000, corresponding to the skeleton grid of

hyperparameter values for the Stage 1 samples, and 16 new chains, each of length

1000, corresponding to the same hyperparameter values, for the Stage 2 samples.

Figure 5-4 gives a plot of these estimates of B(h, hi) as a function of w and g. The

standard error is less than .014 over the entire range of the plot.

0.3 / 0.4 > 50
n 1..&

0.4

Figure 5-4. Estimates of Bayes factors for the ozone data. The plots give two different
views of the graph of the Bayes factor as a function of w and g when the
baseline value of the hyperparameter is given by w = .2 and g = 50

The value of h at which the maximum B(h, hi) is attained is h = (.13, 75). We

ran a new chain of length 100,000 corresponding to this value of h, and based on it

we estimated the highest probability model to be the model containing the 4 variables

dpg, ibt, vh.ibh, and humid.ibt (see Appendix D for a description of these variables).
This model yields an out-of-sample RMSE of 4.5. Since the empirical Bayes choice of
w is relatively small (wv = .13), it is not surprising that the highest probability model
includes only 4 variables-fewer than in any of the hyper-g models recommended by
Liang et al. (2008), which all include at least 6 variables. But it is interesting to note that

nevertheless, this model gives an RMSE that is essentially the same as the RMSE of

any of the other models.

We applied the regeneration algorithm described in Appendix B to the chain
corresponding to the hyperparameter h = (.13, 75) deemed optimal by our previous
analysis. We ran the chain until R = 3000 regenerations occurred, which took 85,000
iterations. From the output, we obtained estimates of the posterior inclusion probabilities
for every one of the 44 predictors, and formed the corresponding 95% confidence
intervals, using the regeneration method discussed in Chapter 3. These are displayed in
Figure 5-5.

Our choice of R was arbitrary, but this choice should ultimately be based on the
degree of accuracy one desires for the estimates of the quantities of interest. We
considered our choice to be satisfactory for this particular analysis since the confidence
intervals for the posterior inclusion probabilities for the 44 predictors have margins of
error of at most 1%. Note that our chain regenerates relatively often with the average

length of a tour (N) being about 28. Mykland et al. (1995) recommend that one check

that that the coefficient of variation CV(N) = (Var(N))1/2/E(N) of the average tour
length is than .1 before deeming K2 to be estimated properly by k2. Their criterion

seems to be met here since the strongly consistent estimator CV(N) = ( =(Nt -
N)2/(RN)2)1/2 equals .02.

I H
H H

H H
H
i H
SH
H
HH
H
H H

H
I

SH
I
HH
H

Confidence Interval

Figure 5-5.

95% confidence intervals of the posterior inclusion probabilities for the 44
predictors in the ozone data when the hyperparameter value is given by
w = .13 and g = 75. A table giving the correspondence between the integers
1-44 and the predictors is given in Appendix D.

CHAPTER 6
DISCUSSION

The following fact is obvious, but it may be worthwhile to state it explicitly. If hi is

fixed, maximizing B(h, hi) and maximizing the marginal likelihood mh are equivalent.

Choosing the value of h that maximizes mh is by definition the empirical Bayes method.

Thus, the development in Chapter 2 can be used to implement empirical Bayes

methods.

Our methodology for dealing with the sensitivity analysis and model selection

problems discussed in Chapter 1 can be applied to many classes of Bayesian models.

In addition to the usual parametric models, we mention also Bayesian nonparametric

models involving mixtures of Dirichlet processes (Antoniak (1974)), in which one

of the hyperparameters is the so-called total mass parameter-very briefly, this

hyperparameter controls the extent to which the nonparametric model differs from a
purely parametric model. (Among the many papers that use such models, we mention in

particular Burr and Doss (2005), who give a more detailed discussion of the role of the

total mass parameter.) The approach developed in Sections 2.1 and 2.2 can be used to

select this parameter.

When the dimension of h is low, it will be possible to plot B(h, hi), or at least plot

it as h varies along some of its dimensions. Empirical Bayes methods are notoriously

difficult to implement when the dimension of the hyperparameter h is high. In this case, it

is possible to use the methods developed in Sections 2.1 and 2.2 to enable approaches

based on stochastic search algorithms. These require the calculation of the gradient

OB(h, hl)/9h. We note that the same methodology used to estimate B(h, hi) can also

be used to estimate its gradient. For example, in (2-7), vh((0)) is simply replaced by

a eh(O(1))1/h.

APPENDIX A
PROOF OF RESULTS FROM CHAPTER 1

Proof of Theorem 1

We begin by writing

n(B((h, hl, d)- B(h, hi))= =/n(B(h, hl, d) B(h, hl, d)) + vn(B(h, hi, d) B(h, hi)).
(A.1)
The second term on the right side of (A.1) involves randomness coming only from the
second stage of sampling. This term was analyzed by Doss (2010), who showed that

it is asymptotically normal, with mean 0 and variance -2(h). The first term ostensibly
involves randomness from both Stage 1 and Stage 2 sampling. However, as will emerge

from our proof, the randomness from Stage 2 is of lower order, and effectively all the

randomness is from Stage 1. This randomness is non-negligible. We mention here the
often-cited work of Geyer (1994) (whose nice results we use in the present paper). In

the context of a setup very similar to ours, his Theorem 4 states that using an estimated
d and using the true d results in the same asymptotic variance. From our proof (refer

also to Remark 2 of Section 2.1), we see that this statement is not correct.

To analyze the first term on the right side of (A.1), we define the function F(u) =
B(h, hi, u), where u = (u2, .., Uk)' is a real vector with ul > 0, / = 2,..., k. Then, by the

Taylor series expansion of F about d, we get

vn(B(h, hl, d) 3B(h, hi, d)) = vn(F(d) F(d))

= vrVF(d)'( d) + ( d)'V2F(d*)(d d), (A.2)
2

where d* is between d and d.

First, we show that the gradient VF(d) = (OF(d)/dd2,..., OF(d)/9dk)' converges

almost surely to a finite constant. For j = 2,..., k, the (j 1)th component of this vector
converges almost surely since, with the SLLN assumed to hold for the Markov chains

used, we have

[VkF(d)]_ nJ jh(0 )) hj(O)
/=1 i=1 d2( s= nVh (0(,)/ds)
k 1 n1 ajallh(O(',)l/h, (O,))
= n d2 (Ek=1 ash(0)/ds)2

a.s. 1 Jk j h (0) h (0)
d2 'J /k Vh /y(O) dO
I/i ( =1 avh (O) /ds)

B(h, h1) f aJyh,()
S 2 k aVh() Vh,y(0) dO := [c(h)]j_i. (A.3)
d Es=l asVh,(0)/ds

The last integral is clearly finite, and the last equality in (A.3) indicates that c(h) denotes
the constant vector to which VF(d) converges.
Next, we show that the random Hessian matrix V2F(d*) of second-order derivatives
of F evaluated at d* is bounded in probability. To this end, it suffices to show that each
element of this matrix, say [V2F(d*)]t-,1,1, where t,j c {2,..., k}, is Op(1). Since
I d* dll < lid dll > 0, it follows that d* A d.
Let c e (0, min(d2,..., dk)). Then we have P(l d* dll < e) 1. We now show that,
on the set {l d* dll < c}, V2F(d*) is bounded in probability. Let

I= I(d* dl < ).

For t / j, we have

k 2 n aia(atah(O)h(O))~(O1)
[V2F(d*)]t-_,j_1 .= 2 a (a3 .
j dsd=2(Cl as Vh, 5(,)/d
n=i n d= d* k=1 hs 1) d
k 2 na ajatl/Vh(Ol')), /h( ())
1 ni =1 (dj )2(dt- )2 [k a h() )/(ds + )]3

a.s. 2 k j aa/at Vh () Vh()Vh() hy() dO
S(d C)2(d )2 i [E a k h(O)/( +)]3
)( )>1, h1 a aah,(vh)(ds + O)v

2 -2, has) vh5(O)/(ds + )]3 Vh,(O) dO. (A.4)
(dy e)2(d e)2 /=1 k=1 s d( + 3 *
Note that the expression inside the braces in (A.4) is clearly bounded above by a
constant, so expression (A.4) is finite. Similarly, for t =j,

[V2F(d*)]O_1,j_1 I I
S2 a;a, 2h(10)) ,(0 ,) ( S=1 asn,,h(O,))/) aj- a,j())/I
I= i~ d*3(E=1 ah (O)/d
k 2 aiaivh 1) )
,=, ,=Y ; 'h-, = ((ol' )/ ;)

k 2 n' ajal h (Ol ))Vhj(Ol) )
/- 1 jn- ( avh(O)/d*)

< k 2 n jalh(O))h,( ))
/= ,_1 (d -)3 [ 1s aVh(, )/(ds + e)]2
a.s. 2 k aj v (Oh 1 ()
(d s )3 B(h, hl) (O))]2 h,(0) dO.
(dJ C)3 k= k=1 s d

Again, this limit is a finite constant by the same reasoning we used earlier. Since
P(I d* dll < c) 1, it follows that V2F(d*) is bounded in probability. Now, by

combining (A.1) and (A.2), we obtain

B(h, hi)) = NF(d)'N(d d)

+ [, N(a d)]'V2F(d*)[ N(d d)]
2 Nv
+ vn(B(h, hi, d)- B(h, hi))
= vc(h)' N( d) + n((h, hi, d) B(h, hi)) + op(1),
(A.5)

where the last line follows from the previously established fact that VF(d) a-a c(h),
and the assumptions of Theorem 1 that n/N -- /q and that N(d d) converges
in distribution (hence is Op(1)). Because the two sampling stages (for estimating d and
B(h, hi)) are assumed to be independent, using the assumption that ,N(d d) -d
/V(0, Z) in conjunction with the result n(B(h, hi, d) B(h, hi)) d V A(0, 72(h))
established in Theorem 1 of Doss (2010) under conditions Al and A2, we conclude that

vf(B(h, hi, d) B(h, hi)) -d A'(O, qc(h)'Zc(h) + T2(h)).

Proof of Theorem 2

We begin by writing

(A.6)

^d- B( )) = ^( ) (d) + (d B(h, h d )),
B(h,hi)) = ()- i(d) +

where the second term on the right side of (A.6) was analyzed by Doss (2010) who
showed that it is asymptotically normal, with mean 0 and variance o-2(h). Our plan is to
show that 3(d) and )(d) converge in probability to the same limit, which we denote 01im.
We then expand the first term on the right side of (A.6) by writing

(A.7)

(7(d) -/() vd(()- i,,3 + Ji (7~,, Iim) + (7m (d))

v/n(B(h, hi, l)

Our proof is organized as follows:

* We note that the third term on the right side of (A.7) was shown to converge to 0 in
probability by Doss (2010).

We will show the first term on the right side of (A.7) also converges to 0 in
probability.

The second term on the right side of (A.7) involves randomness from both
Stage 1 and Stage 2. However, we will show that the randomness from Stage 2
is asymptotically negligible, and that this term is asymptotically equivalent to an
expression of the form w(h)'(d d), where w(h) is a deterministic vector. This will
show that the second term is asymptotically normal.

Now we prove that the first term on the right side of (A.7) is o,(l), and to do this we

begin by showing that J(d) and 3(d) converge in probability to the same limit. Let Z be

the n x k matrix whose transpose is

1 1

Z(2) Z(2)
ni,l 1,2

Z(k) Z(k)
ni,1 1,2

... 1

... 1

. Z(k)
1,k

... 1

Z(2)
nk,k

Z(k)
S nk,k/

(A.8)

and let Y be the vector

Y = ( i1, ...Y Yn 1 1, 2 Y,2 Yn2,2. ... Y,k, Ynkk) .

(A.9)

Let Z be the n x k matrix corresponding to Z when we replace d by d. Similarly, Y is like

Y, but using d for d.

For fixed j,j' e {2 ..., k}, consider the function

1 k n, h (())/Uj Vh (01/))
G(u) = i 1 V l (,)
ni=1 ;=1 Ys=(asVhsI) )/Us

(A. 10)

()) j h( O)/uS

where u = (u2,... Uk)' and ui > 0, for / = 2,..., k. (On the right side of (A.10), ui is taken

to be 1.) Note that setting u = d gives G(d)

Z(2)
2n,2

Zn(k)
2,2

Z(k)
1,1

1 = i 1 ZO)ZJ'). By the Mean Value
1 Y/,i / I/,

Theorem, we know that there exists a d* between d and d such that

G(d) = G(d) + VG(d*)'(d d) = R,,, + VG(d*)'(c d) + o,(1).

Note that the last equality above comes from applying the SLLN. Next we show that
VG(d*) = O,(1). We have three cases for t = 2,..., k.

Case 1: t {j,j'}. We have

I[VG(d*)]t-l -.

k 1 h, [ (OV )/d
< 2a,-

/=1 1=1
= ni '
^v^ a, [^^)I'/.

_ +Vh, (/))] [,h, (1) d -
d*2 k( 1s] [ ( )/d

i _,, (0 (d

h, ( )] + V h,( ) (0)
,)3

-e) + h1(0(')) atVh,(0(1))

(de )2 (YZ=1 asvh(O, )/(ds + ))3

The term inside the inner sum is bounded, so we can conclude that [VG(d*)]t-1 is
bounded in probability, as it is bounded by a O,(1) term on Z.

Case 2: j / j', t c {j,j'}, say t = j. We have

[V G(d*)]j_

k al n' 2((h ))/d
/=1 n=1

Vh, ( ) (' (- 0) / 0)) )
d2 (yik 01))Id)3
d* s=1 as7-h, (1 d

k ni
=1n 1
I=1 iil

and this is bounded in probability.

Case 3: t = j = j'. We have

[VG(d*)],_

1k i n 7h( l sh (()/
I=1 ni" =1 Cs=1as 7hs /d;

d*2 C = aslhs(O1)/d*

(vhs ('()I )/d Vh1 (0')) a Vh, (I')
d*2 (kS1 aVh (0)/ds) 2

and again this is bounded in probability.

, i (')(,, ( )/d, Sh, ((/))
d2 k(s=l )( )2

Therefore

G() = Rij, + VG(d*)/(d

d) + o,(1)

Similar arguments extend to the case j

R,, + O,(1)o,(1) + o,(1) Ri,,.

1 orj' = 1. By the fact that R is assumed

invertible, we have

n(7' f)-1 R-1

(A.11)

In a similar way, it can be shown that

(A.12)

where v is the same limit vector to which Z'Y/n has been proved to converge in Doss

(2010). Combining (A.11) and (A.12) we have

Ln(Z'Z)-l] [z'*/n] A) (0o,1im, Arim)

Let e(j, I) = E(Z0). We now have

k
v/( )-/m) = e( im
j=2

J op()2 ( an1
j=2 (/=1

To show that (A. 13) converges to 0 in probability it suffices to show that for each I and j

n /1 )
/2i=l i

-ne(, I)

Op(1).

(A.14)

For fixed j {2, ..., k} and / {1, ..., k}, define

H(u) = n1/2
i=1

Vh, ( ) )

Vh (01)

=1 asVhs,(Oi)/Us

R-iv.

(3)J) (
\/=1

a n1/2 n )

-ne(jl
nl

S /
/2 1 i,

ne(j, I)

(A.13)

2/ Pn P v,
Z Y/n tv,

f )(0)W)

for u = (u2,... Uk)' with ui > 0, I = 2, ... k, u1 = 1. Note that H(d) = n,-12 1n, Zi). To
see why (A.14) is true we begin by writing

n1/2 = 1/2 ) + n 1/2 I
n=1 i= n 1=1
= H(d) H(d) + Op(1). (A.15)

Note that the fact that n/12 i 1 ([Z,) e(j, 1)]/n,) = Op(1), which was used to establish
the second equality in (A.15), is proved in Doss (2010). Now, applying the Mean Value
Theorem to the function H, we know that there exists a point d* between d and d such
that (A.15) becomes

n1/2 n2i) -e(j VH(d*)'( d) + Op(1)
1=1

= na n-/2VH(d*)' N( d) + Op(1), (A.16)

so that the right side of (A.16) is Op(1). To see this last assertion, note that the (t l)th
element of the gradient of H, [VH(d)]t-~, is given by

,=1 dt (Cs=l as (h, 1)ds)
I -1/2 nnv -(Ohj),)00) n'J(7 ), ())dj l))-)ja(() if t=j.
n-1/2 h(+ 1()) if t ( ))

Let ec (0, min(d2,..., dk)). Then P(lld* dl < ) 1. For t j we have

nl1/2[VH(d*)]t-l Z
nt lhj(Ol))/dj __ h(0/)) at h ( ))
< *2t k ( )) )2
i=1 S=1 avha(,s)/d
nh-1 t ni ) t h ( 1) -_ 1 n h ( ) h )
< n-h 2(O))atVh(O)+ + nhl (Ol))atl/h (Ol))
=1 d d* (y kl1 s Vh(O)/d)2 / dt*2f ( 1 asvMh (O))/dS)
< n t Vhj ( ))th (0))
j= ( -t c)2(d (c)( = aSvh5(0,'))/(ds + )
n (d )2(C/, aV e)( h, (l/(d + e))2
+ ni+ ))2

= o0(1)+ o0(1) = o(1).

Similarly,
11 0 10) k O
n-1/2[VH(d*)]j_1 I < 1 h( ))
n (dj C)2 s 1 as h7(O')/(ds + e)
1 nhj (0(, )) ajh (h l ))
n+ (d, e)3(E:=1 aSvh,()/(d + ))2
1 nh, ( / ) aj hj(0 )
n'1 Y (d C(Eyk= lavh(')/(ds + e))2'
and the right side of this inequality is Op(1), as it is the sum of three Op(1) terms.
So (A.16) now implies that

nln = la V Op( O,(1) + Op,(1) = Op,(1).
in1
We now consider vln(Im 7_ ), the middle term in (A.7). Define

Sk n vh(O)) k ,hj(O))/J hl(O'))
K(u) = k jjOim)/Us E ash(O)/U /
n /=1 1=1 ( s=1 as h, (o) s j=2 s=1 sh, ,()/Us

= 2,..., k. By Taylor series expansion, we have

d) + n(d
2

d)'V2 K(d*)(d

(A.17)

where d* is between d and d. We now focus our attention on VK(d). For t

2,..., k we

have

1h ) asth, ( )/ ) 2
t (YSl, as-h, (i /,)

o(/) 1d! wh,(1-))ath"(I)
- !3j,lim d?(Ek1 (0))/ds)2
j=22 dt s=l a Vh,
jot

+ 3t,lim
2 yCk

a.s. B(h, h) [
2k
dt Es:=

j#t
k
+ ij,lim
j=2
jot

h, (O 1)
=1 as Vh ())/ds

a.vh(O) V.h,y (O) dO
1 asvh ,(O)/ds

at Vh (0)
d 1s= asVh, (0)/ds

d asth (0)
d?~ =1as h5(0)/ds

SVhj,y(O) dO

1
Shh,y(O) dO + 3t,lim-
dt

at~h, (0)
d Z~ ,asVh(O)/ds
at h, (
d s=1 asVh, (O)ds

* Vh,,y(O) dO

* Vhl,y(O) dO

B(h, h)

k
Oj=,im
j=2

atvht ()
Ss=, asvh, (O)ds

I

/3j,iim

Vh,y(O) dO

kat h, () vy() dO
d? s, asvh,(O)/ds

atvh, (0e
Sk at vht(O yh,y() dO + lt,lim
dt s=, asvh, (O)ds

:= [w(h)]t-l,

[VK(d)]t-1

1 k nl
1
/=1 i=1

lim (Ol)")/dt Vh (O1l)) )tVht (Oi))
t,Im ((O)/ds)
d (s=1 asVh, H) ds)

- /t,lim J

+ 3t,lim /
(7i

where u = (2, ... Uk)', and ui > 0 for /

,,m m) = VnVK(d)'(d

(A. 18)

where the notation in (A.18) indicates that w(h) denotes the finite vector limit to which
VK(d) converges. We now deal with the Hessian matrix V2K(d*). For t \$ u we have

[V2 K(d*)]t-l,u-, 1 n2 ) Lt7c ( () h (_)-3
n i=1 ;=1 dt "2 (2 ks=1 as h, 1)/d)
2 [Vh (01))/dc* 1 (O1'))] atlh, ( '))auv,h (O1'))
I 22 d,2 (y kah(0,)IV/S)3
j=2 t d2d" 2 kC=1 as Vh, i()d
j#t
j#u

+ ui im
d2 du (:sk= asV, (0I)) d)

2 (h,,,(O '))/du h, (('))) at Vh,(0') u)au (0('))
2u*2U (s=k a (7sho(l)) /dS)3

m Vh(OB())au, h (O'())
+ !t,Iim d*2d*2 k=1/d s h5(1)/dS)
d 2 ( ,u2 (yC:k as7Vh, (O')) 2l (2
2(h( '))/)d h 1 ( '))) aSth,( 1))au7,h,( 1))
dm dv2d*2 k= (0s hs 1 )3
Y 1tii 1 I I

and as before, it can be shown that this is bounded in probability. Similarly, we can show
that the diagonal terms of V2K(d*) are also bounded in probability. Therefore, using the
fact that V2K(d*) is bounded in probability, we can now rewrite (A.17) as

Iim) = w(h)YN( d) + vd d)'O(1) v( d)
11Pm N N V 2/2N

= qw(h)'N(d- d) + op(l).

Together with (A.6), this gives

v(l(3) a- B(h, h1)) = qw(h)' N( d) + v((d) B(h, h1)) + op(1)
-d A/(O, qw(h)'_w(h) + a2(h)),

by the independence of the two sampling stages, the assumption that v/N(d d) is
asymptotically normal with mean 0 and variance Z, and the result from Doss (2010) that

v7(d) B(h, h1)) is asymptotically normal with mean 0 and variance -2(h). D

Proof of Theorem 3

First, we note that

n(F(7f]](h, d) I[f](h)) = nn(7rfl(h, d) 1f](h, d)) + n[(7[f](h, d) If](h)). (A.19)

We begin by analyzing the second term on the right side of (A.19), which only involves

randomness from the second stage of sampling, and show that it is asymptotically

normal. As for the first term, a closer examination reveals that it is also asymptotically

normal, with all its randomness coming from Stage 1. The asymptotic normality of the

sum of these two terms then follows immediately from the independence of the two

stages of sampling.

Note that E=1 alE(Y ) = I[f](h) B(h, hi), and in particular, when f 1, this gives

=1 alE(Y1,i) = B(h, hi). Also, we have

n Y) i I:'l(h) B(h, hi) V iV aiE(Yl])
n1/2 =1 i=1 1/2 /= 1 =1 1
1 k n k nl k
n n
= 1 ,=1 \ = i= =1 i== 1

k n1 [f] E (YfI)
= al/2 11/ 2 >, 2 1,, (A.20)
/=1 =1 ,l E(Y1,i)

By condition (2-17), assumption A2 of Theorem 1, and the assumed geometric

ergodicity and independence of the k Markov chains used, the vector in (A.20)

converges in distribution to a normal random vector with mean 0 and covariance matrix

F(h)= E=1 a/,F(h), where

/(h) = 711 712 ,
721 722

with

711 = Var(Y']) + 2 1 Cov(Y'], Y]1,),
712 =721= Cov(Y1',, Y1,) + -, [cov(Y',, Y1+g,1) + Cov(Yi,,, Yf] ),
722 = Var(Yi,,) + 2 =1Cov(Yi,,, YI+,,i).

Since 7[1](h, d) is given by the ratio (2-15), in view of (A.20), its asymptotic distribution
may be obtained by applying the delta method to the function g(u, v) = u/v. This gives
v(7Q'f(h, d) I-[](h)) -d A/(0, p(h)), where

p(h) = Vg(l[ l(h)B(h, hi), B(h, hl))' F(h) Vg(l[f](h)B(h, hi), B(h, hl)), (A.21)

with Vg(u, v) = (/v, -u/v2)'.
We now consider the first term on the right side of (A.19). Define
k n1 f (o(i)) lh (0())

L(u)= 1=1 '=1 C=1 sIVhs )/US
k n h ( ))
/=1 i=1 s=l s Vhs )/
for u = (u2,... k)' with ul > 0 for / = 2,..., k. Then

v,=1C /1 yf]
L(d) = f](h, d)= 1 -'1 l= ,
k,=1 yni-1 y,l
and vz(l([(h, !) I[](h, d)) = (-(L(d) L(d)). Now, by the Taylor series expansion of

vn(I[](h, d) 1f](h, d)) = vVL(d)'(d d) + -(d d)'V2L(d*)( d),

where d* is between d and d. First, we show that the gradient VL(d) converges
almost surely to a finite constant vector by proving that each one of its components,

[L(d)]_i, j = 2,..., k, converges almost surely. We have

[VL(d)]jl_

11 d=(1 k 1 ash7(- ,) )2

k1 ni
= ,=l Zks1
S ---

h (O1)
asVh(O()ds

k ni

/=1 1=1

af(e) h, (e)
,1 asvh',()/ds

f(0(1) ),h(1(()
1=1 as sh (O,))/ds

( k =i s
:= ;= k-

B(h, hi) a, f(e) h( e)
d2 J (h, hl()
B(h, h1)

Swh,y(O) de

k ni

/=1 i=1

ajVh(Oi)Vh,) (Oh')
d2 ( k=1 ash(1))/ds)

(1 (12sl) )
1 ash 0()) ds

Vh,y(O) dO

Sh,y(O) de

I[l(h)
dj

:= [v(h)]j_, j = 2,..., k.

(A.22)

As in the proof of Theorem 1, it can be shown that each element of the second-derivative
matrix V2L(d*) is Op(1). Now, we can rewrite (A.19) as

I[f](h)) = 7VL(d)' /( d)

( h) [f](
+ / [v//N(d -

+ v"(!1f](d, h)-/1f](

qv(h)' /N(d

d)] 2 L(d*) [/ (l

d) + vn('f](h, d)

Since the two sampling stages are assumed to be independent, we conclude that

I/f](h)) d A(O, qv(h)'Tv(h) + p(h)).

1
d2I

) f a ,vh,(e)
h 1 asvh ds
B(h, hi)2

ajVh, (0)
=1 asVhc(0)/ ds

Sh,y(O) dO,

vn ('](h, a)

I[](h)) + op(l).

Bhh B(h, h
f](h) B(h, hi) 2
dJ

v (7!I](h, d)

D
Proof of Theorem 4

Here Z and Y represent the matrix and vector, respectively, previously defined
in (A.8) and (A.9). In addition, let ZIf denote the n x (k + 1) matrix with transpose

1 ... 1 1 ... 1 ... 1 ... 1

[f]() Z[fl(') Z[f](1) Zf(1) Z[f]() 7Zf(1)
1,1 ni,1 1,2 n2,2 1,k nk,k
(Z[f])' = Z[f(2) Z[f](2) Z7[f(2) Z[f](2) Z[f](2) 7.(2) (A.23)
1,1 ni,,1 1,2 n2,2 .. 1-,k kk (Ak)

Z[f](k) 7[f](k) Z[f](k) Z[f](k) Z[f](k) Z[f](k)
1,1 ni,1l 1,2 n ,2 1,k .. nkk

and let Y[f] be the vector

Y[f]= (y[f] f] Y [f] f] Y [f] Yn f k)' (A.24)
S- 1,1 ni,l1 1,2, n2,2 '" l,k' n,,

We know from Doss (2010) that the least squares estimate when Y is regressed on Z,
denoted by (3o, 2, ... /k) =: (0, 2 ), converges almost surely to (/3o,im, Olim) = R- v. In
a similar way, we will show here that the least squares estimate when Y[f] is regressed
on Z[] (7f], [f]) = (^f] #if] [f]), converges almost surely to a vector (~,m /[fim).
Note that, under the assumption that [(Z[l)'Z[]l] 1 exists,

(/)f, ) = n [(Z 1)'Z f] ( Z[)'Y]

Since A4 is satisfied, we have

1 nk k n
ni n' ~ n ni Z,/ ," + ,+l j,j = 0, ... k,
1/=1 i=1 /=1 i=i

and hence (Z[]l)'Z[l/n a R[]. Therefore by A6, with probability one (Z[f])'Z[]1 is
nonsingular for large n, and furthermore

n [(Z[f])'Z[f]]-1 (R[f])-1. (A.25)

By condition A5, we also have

[ 1 =k n, Z[f](0)
(Z[f])'Y[f]
n
Yn k,=1 y /=1 Z,[f](k)

( k
a.s.

k

1 a E Z[f]() YVf )

1 a/E(Z[f](k) Y[f]
a 1,1 1,1

Let vf]1 = (If] ..., vl)' denote the vector on the right side of (A.26). Combining (A.25)

and (A.26) we get

(A.27)

k

/1 n
/----1

r( [y,] /y- R[f] Z0) CA
k= 1|,j i miZ,, ,/)

Y 1,/ k j=2/ J,limZ ,/

Ul") Yi[f]

U[f](2)
/ida) (Yi,

pk = [f] 7[f]Z j)l
k=1 Pj,lim i),/

Z =k (jim )
2 j mZ;,, )

Also, let f] = E(U if]).

Now since A2, A3, and A4 hold, for each I= 1,..., k we have

nt
n1/2 (1 [ li
i=1

where

2
S 1,11 0/,12
21
J/,21 J/,22 /

with

-2 = Var (U[ )) +" 2 l Cov(U ](), U[f](l)'
7,1 1,g= ]\ 1,1 "-l+g,l/'
0/,12 0/,21 C v(Ui[(1), U[f](2)) + CO LCov ( ](1), U[f(2) + Cov(Uf ](2), U f1)]
222 =, V r 1 g=U I 1,d l-g,l d 1,~+ lCg,1j,
2U[f](2)1 COy/u[f](2)i U[f](2)\
-/,22 = Var+[f](2) 2 1Co1
/,22 1,1 id 1, 1 -l+g,l/ "

(A.26)

where

( f]', a[f]) s ( i[f] / )
\00',1im'

(Rlfl)-lvlfl.

1 k ni
Ijm,,, = Y V
/=1 i=1

d r(0, El]),

By the assumed independence of the k Markov chains, we have

n (Z1/2 l k= : ,1) ]_ d (, (O l.),

Er = rn f1.
n ~ ()31,3 /= li

We now show that

=1

/i1

(A.30)

rf](h)B(h, h1)
B(h, h) )

and to do this we write

k
5- alE(ulf])

I E(]=1-
al
=1 \E(YI,) -

k,=l aE(Y I,,)
(k=1 a/E(Yi,i)

-k 30[f] E(Z[f]O))
2j=1 PjlimE Z,/ J/1

~-Jk=2jc ,limE(Zli) )
Y-- k,0[f] [ m~ k k1 E f(Z j[]')) ]

- kj=2 Pj,lim[ =1 al E (Zi,)J

E= 1 a/E(YI,E )

Ilf](h)B(h, hi))

B(h, hi)

the next-to-last equality being a consequence of the readily verifiable fact that

k
0 and aaE(zu)
/=1

From (A.30) and (A.28) we conclude that

I A (O, Z-l]).

where

(A.28)

(A.29)

k

/=1

k
a/ E(Z 1lu))
/=1

forj = 2, ..., k.

(A.31)

n1/2s [ f],, (I'f](h)B(h,
SIMB(h, h )

Consider now the difference

1n l /22-_k ( [ f ] m f ] '
1/2 l k Pj,lim f]) i z,[,] J)n
/22 2(Jim j]) ( l 1l Z ,) I
n 7If, )71/2, k/ ,n1 i-,, -0^ _
Sj=2J-- J/,lIim n i=

i (j=l3jim 3 f ] ) 1 1 / 2 1y:l)n E(zY j ())]

k L;=2(/j,1im j) =1 a 1 1/2 1-Z0) -E(Z- I] )Z
k -- .ac=-k 1, nl/2 nl / u-: 1, J/

where the last equality follows from (A.31). By the assumption that the chains are
geometrically ergodic (condition Al), the boundedness of Z,)'s, and the moment
condition imposed on f in A4, we know that n1/2 ([ZiZ) E(Zi0J)]/ni) and
n1/2 n1 [Z(]i) E(ZlifO))]/n,) are asymptotically normal, hence Op(1). This
fact, combined with (A.27) and the corresponding result for (3o, /3), yields

n"2 (j^, 1 Jim,') = Op(1).

Hence we can conclude that
/2( [ (lf](h)B(h, h) d) > (O, [f]).

SB(h,h,) ,

Now applying the delta method with the function g(u, v) = u/v we have
\/ =1 1 1 l (y [f]= _-[f] [f](j) \

i.e.
nl/2 [,]- I[f](h)) d ) (O, r(h)),

where

r(h,) = Vg(l[f](h)B(h, h/ ), B(h, h))' t[f] Vg(l[f](h)B(h, h), B(h, hl)), (A.33)

with Vg(u, v) = (l/v, -u/I2)' and F[f] as in (A.29). O

Proof of Theorem 5

We begin by reviewing some related notation and results established by Geyer
(1994). Recall that Nj denotes the length of the jth chain in Stage 1 samples, N =

Z-1i N,, and A, = Nj/N. Using the notation

j = log mh + log(A,), forj= 1,...,k,

Geyer's (1994) reverse logistic regression estimator = (1i..., ^k) for the unknown
vector Tr is obtained by maximizing the log quasi-likelihood
k NI
IN(q) = log (p/(0'(0, r)), (A.34)
/=1 i=1
where

0, = V ()e" for/= 1..., k. (A.35)
Es=(,() el
Theorem 1 of Geyer (1994) states that this maximizer is unique up to an additive
constant if the Monte Carlo sample is inseparable. Geyer (1994) also proves that, under
certain conditions, v/N(^N rio) is asymptotically normal, where ryo is defined by

1 k
[Tolj = Tj-YE s, j= 1..., k.
s=1
Our proof is structured as follows. First, we extend Geyer's (1994) proof in order to show
that the 2k-dimensional vector

vN( ( =: : U (A.36)

U(2k) )

is asymptotically normal. Then, by getting back to the d notation through a transformation,
we show that our vector of interest

d

is also asymptotically normal.
To carry out the first step, we will express each U), j = 1,..., k, as the sum of
a linear combination of standardized averages of functions of the 0')0's and a op(1)
quantity. We will also need the central limit theorem to hold for these averages. Hence,
for each j = 1,..., k, we plan to find constants ) ..., ) and functions (),... 0),
which satisfy the conditions

E,,,, ( )(0)) = 0 and E,,,, (0 )(2) ) < =1... k, (A.37a)
SN Nk
U 0) = a) i 'J)(1)0) + ... + 0) ) )(ok)O0)+ Op(1) (A.37b)
N=1 ==1
for some c > 0. Note that conditions (A.37a) and B1 yield central limit theorems for the
averages in the linear combination above.
For U(k+1),..., U(2k), condition (A.37) is clearly satisfied since

+k) = 1 1 NJ
UO) ( e, e,) 1 1 (f(o)) e,), forj = 1... k,
V" j VN"j 1=1
and the moment conditions in (A.37a) hold (see B2 in the statement of this theorem).
Next, we show that condition (A.37) also holds for the first k components of U. In
the proof of his Theorem 2, Geyer (1994) defines the matrix BN via

1 (VN(IN) V/AN(o0)) = BN(l rIo), (A.38)

where IN was defined in (A.34), and establishes that BN -a B, where B is given by
equation (19) in Geyer (1994). He also shows that, with u being the k-dimensional
column vector of l's,

/NON T0) -J (A.39)
U/) 0

[See equation (31) in Geyer (1994).] Note that, by applying the Mean Value Theorem to

V/1N(q), BN defined in (A.38) can also be expressed as

for some T* between 7N and qo. Hence, with pr, r

elements of BN are given by

1,..., k defined as in (A.35), the

1 k N,
[BN]r,r = N pr(Oi)O, 1 l*) [ pr(0(')0, 1*)], r ...k,
/=1 i=1
1 k Ni
[BN]r,s = / r(O(')0, *)P(O(')o, *), r s,
/=1 i=1
which makes it easy to verify that BNU = 0. Combining this with equation (A.39), it can

be shown that

S1 o),
\N(1N -17o) BN= V/N(T7o),
N V/-

(A.40)

where

S(BN+ U' UU
N kUU k
is the Moore-Penrose inverse of BN. Furthermore, letting B+ denote the Moore-Penrose

inverse of B, we can alternatively write the equality in (A.40) as

N(1N o) = (B B + B+) V--VN(1o)
1 1+
= (B B+) VI/N(o) + B V1/N(/o).
Nv/ v N

(A.41)

Now, using the result BN Ea B established by Geyer (1994), we can easily deduce that

B+ 2a. B+, (A.42)

1NV2 N(* ),

by writing

S(B + 1 uu' 1 uu'
N k k

where the last equality comes from Geyer (1994).

Next, we establish asymptotic normality of V/N(rOo)/VN. Since the gradient V/vN(ro)

is the vector whose rth element is given by

OIN( \ k NI
OO) = Nr pr(OP)0,i, I),
ari I=1 ,=1

we can see that
1 aN(O) 1 A k NN
arlr N rPr (0()0, -IO)

A Nr k( r NI
= Nr (1- p (O(r)O, o)) Pr(O0 C}i), O),
S i=1 /=1 v =1
Ir

k N1

/=1 =l[
Ilr
k N,
= A 1 [p r(o l0'i)) E(pr(O(')o, rlo))],
/=1 =1

which is a linear combination of the form given in (A.37b) and, because 0 < pr(O, I) < 1

for all 0 and r, condition (A.37a) is also satisfied. Note that we are allowed to insert the

expectations in the next-to-last equality because

k
-Ar Nr [1 E(pr(Or), 1o))] + ~ Ai NNiE(pr(O), ro))
V N-r /1 V'/
I/r

NAr1 k hr( hr,y (0) dO + N AE (pr (0), o))
-ZN 1- : rl vh 5 (e)e/ Ir 1+0))
s=l Vh(O) s /=1
I/r

N= J kh h, ,y (0) dO + vhA,(E (pr (0i()o, o0))
/k1 /(1

Ir IIr
k k
= -- ar e77- / k ) l-?Ir khr ) r (h0 )h ey(0) d7l ( + '\/N AiE (pr (0('), T1o))
/=1 s=m1 hh,(0)es mhe /=1
lor Ior
k k
/Ar r mhi e77l E(pr,(On'), 10)) + V/N AIE(pr(O!i')0, 1))
/=1 mhr /=1
Ijr Ijr
=0.

The asymptotic normality of V/N(T1o)/vN now follows from the Cramer-Wold device.

In view of this convergence in distribution and the convergence result in (A.42), (A.41)

gives

1 1
N(1N 10) = Op(1)Op(1) + B VlN(lo) = B 1 VN(l/o) + Op(l).

Therefore, we can now easily see that condition (A.37) is also satisfied by the first k

components of U, i.e. v/(1^N 1 /o) because, as we have shown in (A.43), every element

of V/N(r/o)/V/- is a linear combination of the form (A.37).

Now that we have shown that

(1) 1 y ,N ( (0(1)0 1 +..+ (1) 1 Nk (l) (o(k)O)
a1 vN 1 =1 + + a1 k N 1 Yk

( (k) 1 2 1i (k)(0(1)0) + (k) 1 1N 2 k;= k)i (k)O)
U= 1 v + p(l),
(k+l) 1 NAi ,(k+1)/o(1)0)
a i 1 1 I

(22k) 1 W N k(2k) (k)O)

we can prove that U is asymptotically normal by using the Cramer-Wold device. Let us

denote the asymptotic variance of U by S. Then

S(1) S(2)\
S= I
S(2) S(3)

where

S(i) = B+ CB+, (A.44)

with
k
Crs = A Cov(p,(O/')0, TIo), ps(i~)o0, TIo))
/=1
k oo
+ ZiA, ov(pr ([C ~ o), Ps(Og, o)) + Cov(Pr((0 go), Ps(j1)0, 1o))]
/=1 g=1

for r = 1,... k, and s = 1,..., k,

S = [Var(f( (or)) + 2O Cov(f((r)O), f( r))]
g=1 (A.45)

S3) = 0 when r s, r = 1...k, s = 1...k,

and

S(2) = B+D, (A.46)

with

Ds = Cov(p,(O(S), Io), f(0s)o))

Cov( (0 o) f( s), )) + Cov(p,r(O(, ho), ((s)o))]
OV (f(O( v l__v+g)) 1r-Vpr+g, 1\0),
g=1
for r = 1,..., k, and s = 1,..., k. Now, having established the convergence result

on r t fnon g n (S),

consider the function g: R2 k R2k-1 given by

g =

where T and e are k-dimensional vectors.
the transformation gives

dN

(2

dk

e

e' -q2A2/A1
e?1-73 A3/A1

el-k Ak/Am

e

The delta method applied to (A.47) with g as

d2

dk
e

where

V = Vg )S Vg ,
e e

(A.47)

- r(o, V),

(A.48)

with

e'll-12A2/A e1l-'13A3/A1 ... e'1-' kAk/Al 0 0 ... 0
-e 1-q2A2/Am 0 ... 0 0 0 ... 0
0 -e"l -a3A3/A1 ... 0 0 0 ... 0

Vg =
e 0 0 ... -el-kAk/Am 0 0 ... 0
0 0 ... 0 1 0 ... 0
0 0 ... 0 0 1 ... 0
0 0 ... 0 0 0 ... 1

and S given by (A.44), (A.46), and (A.45). O
Proof of Remark 2 to Theorem 1
Following the lines of the proof of Theorem 1 with q = 1, we get as in (A.5) that

/n([(h, hi, d) B(h, hi)) = c(h)' /( d) + /n(B(h, hi, d) B(h, hi)) + op(1),

where c(h) is the constant column vector given in (A.3). This decomposition can be
rewritten as

V,(B(h, hl, a) B(h, hm)) = (c(h)', 1) vl( d) + Op().
Svrn(B(h, hl, d) B(h, hl))

Now note that in order to establish the asymptotic normality of vn(B(h, hi, d)-B(h, hi)),
it is enough to show that
( n(a -d)
ani(n(h, hl, d) B(h, h))
is asymptotically normal. Using the q-notation introduced in the proof of Theorem 5, let

S= B(h, hi, d) B(h, hi)

As was done for U in the proof of Theorem 5, we can write T as

(1) ynl (1) ( (1)) + + a(1) 1 nk 1)(Ok))

T='7=
T (ik) 1 ,1 (k )) + + a() ,k1 k)(0k) + (1),

Ek 1/2 1 yn/
7=1 a 2 i ,= l (,- E(YI,,))

where the first k components are the same as the first k components of the vector

U in (A.36), and Y,,/ is given in (2-11). By applying the Cramer-Wold device, we can

conclude that the vector T converges in distribution to a normal random variable with

mean 0 and variance

Z=
Sz' 72(h))

in which S(1) is the k x k matrix given in (A.44), z is the k x 1 vector given by

z = B+y (A.49)

where
k k oo
Yr = A Cov (p,(0/), To), Yl,,) + AI [Cov (pr(0), 1o), Y1+g,1)
/=1 /=1 g=l

+ Cov(pr(O(1 To), Y,/)],

for r = 1,..., k, with B+ as in Theorem 5, and as in (A.9) of Doss (2010)
k oo
72(h) = a, [Var(Y,i) + 2 Cov(Yl,i, Y+,g,)].
/=1 g=1

Now define the function g: Rk+l Rk by

el' -2 A2/A1
e' r-3A3/A1

S e'l- kAk/Al

b

where Tr is a k-dimensional vector and b is a real number. Applying the delta method to
the previously established result that T -d A/(0, Z), we get

Sd 0O, Vg I ZVg Iori ,
SB(h, hl, d) B(h, hl) B(h, hl) B(h, h))

where

Vg I = (A.50)
B(h, h) 0' 1
with
Se"1-'72A2/A1 ell-1'3A3/A1 ... e'-kAk/A
-e'l-'2A2/A1 0 ... 0
E = : : (A.51)

0 -el-'73A3/A1 ... 0
0 0 ... -e1-7kAk/A1
and 0 in (A.50) representing the column vector of k zeros.
Hence, we know that (B(h, hi, d) B(h, hi)) has an asymptotically normal
distribution with mean 0 and variance

c(h)'Zc(h) + -2(h) + 2c(h)'E'z,

where Z denotes, as in the statement of Theorem 1, the asymptotic variance of v/n(d -
d), E is given in (A.51), and z in (A.49). D

Proof of Theorem 6

Let
1 k ( L n1 r YJk= 1 Zid
/Y I1 ,, )= pJ Zi i
where the superscripts d, e indicate the values of d and e used when computing Y's
and Z's, while the subscripts indicate the coefficients of Z's. With lf] as in the proof of
Theorem 4 we now write

v^ ,, aI>,/")[ = ^_d,,^, ),
SS a )3(d),,4[f](d))
/=1 (A.52)

S 1(d),,4 -(d)).
/=1
Note that the second quantity on the right side of (A.52), which involves only known
d and e, was shown to be asymptotically normal with mean 0 and variance F[ ] in the
proof of Theorem 4 [see (A.32)]. Now let us expand the first term on the right side
of (A.52) by writing

[ []3 1[f],3 f' /3,Im ,

+rdne ( n \. r -1d, (A.53)

We next proceed as follows:

1. We note that the third term on the right side of (A.53) was shown to converge to 0
in probability in the proof of Theorem 4.

2. We show that the first term on the right side of (A.53) also converges to 0 in
probability.

3. We show that the second term on the right side of (A.53) is asymptotically normal.
To deal with the second step, as in the proof of Theorem 1, first we show that J3 (d)
and [3f (d) converge in probability to the same limit, which we denoted in the proof of

Theorem 4 by /3if. For fixed j, j' e {1,... k}, consider the function

1 k n, )f(o)h ()/j ;1 [f(o ) )Jh()j
G(u, v) n= v -/- v/ ,
n= 1 as iV0=1 ;,=1 1usshs Y() Usk1 asVh, ) us
where u = (u2,..., uk)' with ui > 0, for / = 2,..., k, and v = (v, ..., k)'. Note that setting
u = d and v = e gives
G(,e) k ni
G(d, e) =nZ Z=f) ZYI0'
/=1 i=1
By the Mean Value Theorem, we know that there exists a (d*, e*) between (d, e) and
(d, e) such that

G(d, e)G(d, e) +VG(d*, e)' d

"JR+l,+l r (- ,e -- + Op(l).

As in previous proofs, with some calculations we can show that VG(d*, e*) = Op(1).
Therefore G(d, e) Rlf] and since R[] is assumed invertible, we have

n[(f[q)'f]]-1 (R[f)-1,

where 2] is obtained from the matrix Zf] in (A.23) by replacing d and e with d and 8.
The same reasoning extends to the case where = 0 orj' = 0. In a similar way, if we
let *[f] denote the vector obtained from Y[f] in (A.24) by replacing d with d and we recall
that v1] was defined to be the vector on the right side of (A.26), it can be proved that

n

which together with the previous result implies that [3 (d) and [] (d) converge in
probability to the same limit. Also,

1n(/2 y-k (O[f] )[f]im (a jk 1 in 1
/k 2 j=l \J jlim k/ n 1 = 1 ]/ ) /

(l'j=l \jlim -- (a)) [1 = 1 2a Z=i--1 y( nl

k=2(, lim -,k= -k n1/2 1 y ( Z '
,u=2O J ,liM A 0 )) LYI=l a, n ,,=1k n, -- )\j
From the proof of Theorem 2 we already know that the second component of this last
vector, denoted therein by (vr() i/im), is op(l). In an analogous manner, it can be
shown that the first component is op(1). Thus the whole vector is op(1).
As for the middle term of the right side of (A.53), if we define

1 k 1f (0 V 0) klm ( f(0,) )/U (0
K[f](u, v) = n1 y k -l) h( ) im .(k vj
n s=1 =1 as= Vhsl /Us J=1 \ ls=1 shs (0(')/ Us
where u = (u2, .. Uk)', with ul > 0 for / = 2,..., k, and v = (vl, v2,... Vk)', then

V/( K, ( nnKf(d,e)- KE](d,e))
V ^e' E )d,e Kf](d, e)) (A.54)
"/31 m If 31,m,,311 1 (K(c) K(d))

with K defined as in the proof of Theorem 2. From this same proof, we know that

(K(d) K(d)) = Vqw(h)'vN (d d) + Op(1). (A.55)

We will now show that, similarly,

n(K'l(d, e) K['](d, e)) = qw f(h)' I + Op(1),
V Ve

with w[f](h) defined by (A.56) below. By Taylor series expansion, we get

n(K[lf(d, ) K[](d, e)) = vnVK[f (d, e)' d -
e-e

+ V2K[f](d*, e*) (
2 e e

where (d*, e*) is between (d, e) and (d, e).
Below we compute the gradient VK[f](d, e) and show that it converges almost
surely to a vector w[f](h). We have

9K n f(OI ())h(O1) )atht( 1)
(d, e) 2
O ut n d (E,2 asVh, (e,))/ds)2
k f(0 ))Vh ( 10))at h, 0t())
S dd dt (k1 asVh,(O )/d)2
j#t
iimc d Ek1 h( '))/ds tlim d (Ek,1 as7,h(0o'))/ds)2
tI S 'lah, ( '))/ds O ta'ah7(O-)/S)j

a.s. B(h, h) f f(0)at h (0)
d2 k a h,y (0) dO
d J fs= asVh /(O)ds
k -f([f] ) atVh] [f] I (ht)
Is hImk "W ,y() dO +- tlim
=J di S =1as Vh, ()/ds dt
j#t
0[] Jq f( (0) ath(0)
t,lim 2 k Vhh,y(O) dO
i dtm S= 1 as~h (0) / ds

S Es=1 asVh (O)/ds
k -f( )a] h,() ,,() dO + f] I[f] ](ht)
J=1 dtEs= ashV(O)/h, (0 tds ( dt

for t= 2,..., k,

(A.56a)

and

(K e = ],m := w[f] +(h), for t = 1,..., k. (A.56b)
vt, e k-1

Proceeding as we did in the proof of Theorem 2 when we showed that V2K(d*) =
Op(1), we can show here that V2K[f](d*, e*) is bounded in probability. Hence

n(K[Kl(d, e) K[(d, e)) = qwE(h)' -N ) d) + Op(1),

and together with (A.55) and (A.54) this implies that

rn / 1)- ^ ^lW(h)' |)+ op)

V qw(h)' N(d d) + o,p(l)

qW[f](h)') N (a d) + O(l
wo(h)' e e

where wo(h) is the column-vector obtained from w(h) by concatenating k zeros at its
end. Now returning to (A.52) and (A.53) we get

k w[i] (h)' d
ai])=1 wo(h) N e
k
), .,(d),I )) + Op()

Swo(h)'

We can now apply the delta method with the function g(u, v) = u/v to get our result

V^(c/,^^ v

where

y(h) = Vg(I[f](h)B(h, hi), B(h, h,))' q (q\ Wl(h)' V(w[l(h), wo(h)) + /El
w wo (h)'

SVg(Ilf](h)B(h, hi), B(h, hi)), (A.57)

with Vg(u, v) = (l/v, -u/v2)'. E

APPENDIX B
DETAILS REGARDING GENERATION OF THE MARKOV CHAIN FROM CHAPTER 5
To generate a Markov chain of length n on 0 = (7, a, 3o, ,y) for a fixed choice of
the hyperparameter h = (w, g), we use the following sampling scheme. First, we pick
an arbitrary value for 70). Then we draw o2(0), ~o), and (O) as indicated in Steps 2-4
below (with i = 0). To generate the rest of the chain, we iterate through Steps 1-4
described below for each i = 1,..., n 1.
Step 1 In this stage we generate the binary vector 7(') by using a Gibbs sampler on
< = (1, 72,... 7q). Thus, we first generate 7 () 2'-l), ...7 '-), Y according to the
following Bernoulli distribution:

P(71 7j1, Y) oc p(7 Y)
o (1 + g)-q 2S- -1)[1+ g(1 R2)] -(m-1)/2 W ) (B.1)

where, recall that 52 = 1i( Y)2, and R2 is the coefficient of determination
of model 7; see (5-1). Similarly, generate 7(') from p(2 7 ), ('-i ..., (i-l), Y),
and so on for 3 ), ..., 7q'). (This Gibbs sampler is not identical to that of Smith and
Kohn (1996) in that in our model the prior on 3o is a flat prior, whereas Smith and
Kohn (1996) use a proper prior on 3o.)
Step 2 Generate 02(') 7(,), Y according to the density

x Jp( Y 7,2,/ 0,/)P/)P(/a '7,2) d3o d p(2)

(n 7)-2 ex 22(g + 1) 1 + (l R27m dn
an inverse1)ex gamma density. 2

an inverse gamma density.

To see why the last relationship statement is true, we first consider the integral with
respect to so. We have

Ip(Y 7, 72/3o, /y)p( 3o) d3o

o J(2) -(m2) exp (Y lm3o XO, )'(Y ,1m X,3)] do3

x (2) -(m2) expL (Y X3,)'(Y X )] exp [-m (o2 20o0Y)] do3
-2
x (-2)-(m-1)/2 exp [- (Y Xp)'(Y XO3)] exp (m2

So we may now write

P(72 7I Y)

52
S(a2) -(m+1)/2 2exp ) xp (Y X,)/(Y X,)] p(1, |7, 72) d,
(2)-(m+1+q,)/2 exp(- exp d-07 X'X + 'X p d

N 72)-(m+1+q,)/2 exp -xp S2 )(a2 q /2 exp 2 92 Y Y X'XY )-Y
x (2) -(m / 1) exp{- 52)]
x(72)-^1)/2 exp{J-ij [1 + g(1 R7},

where the next-to-last proportionality relation results from using the formula

exp( W- 1 + a'/) d3 = (27)q2| W1/2 exp(a'Wa/2)

which can be shown to hold for any vector a of length q. and any positive definite
matrix W by using a "completing the squares" argument. In practice, we use the
distributional relationship

S2 [1 + (- R-)] 2
2(g + 1) Xm-1

to draw -2.

Step 3 Generate /3') 1 ('), Y according to the density

p(/3o 7,2, Y) Jp(Y 7, 2,/ 3o,/)p(3 7',2) d/ p(/3o) (B.2a)

x exp[- 2( lo)'(- lm30)] (B.2b)

oc n(Y, 2/m).

Note that (B.2b) follows from (B.2a) because 1'Xy = 0, since the columns of X,
are centered.
Step 4 Generate 3I') 17'), a2(') ~'), Y according to the density

p( 7, 2, a o, Y) oc p(Y 1 7, 72, 0o, 0) p( Y 17, 2)

o exp 22 (Y lmo X- )'(Y l- lm3o X,3) + -

which can be shown to be a q.-dimensional normal with mean and covariance
matrix given respectively by

g'= and Z= g 2(XX,)-1
g+l g+
where / = (X'X,)-'X' Y is the usual least squares estimate for model 7.
We now discuss the computational effort needed to implement our sampler.
Consider generating the first component of 7. As seen in Step 1, the conditional
distribution for this component is Bernoulli with success probability

P = 1 7j#, Y) p((1, 72, ... 7q) Y)
p((0, 72, 7q) IY) +p((1, 72 7q) IY)'

with the expression for p(7y Y) given by (B.1). The other components of 7 can be in turn
similarly generated, and then the other components of 0 can be generated according
to the conditional distributions from Steps 2-4. The main computational burden is in (i)
forming R2, (ii) forming /3, and (iii) generating from N.(ci/,, c2(X~XX,)-), where cl and
c2 are constants. All of these ostensibly require calculation of (X'X,)-1, for which O(q3)

operations are required. In fact, (i) and (ii) require only /3, which can be calculated
by solving (X'X,)/) = X Y, requiring only O(q') operations. Now the essence of

(iii) is generating from a AP(0, (X'X,)-1) distribution, and to do this we do not need to

form (X'X,)-1. We need only express X'X, = U'U, where U is upper triangular. For if

Z ~ AV(0, 1N), then U-1Z ~ Af(O, (X'X,)-1), and U-1Z is obtained without calculating

U-1, and simply by solving for 3 in the equation U3 = Z (which requires only O(q2)

operations, since U is upper triangular). We note that, if we start with X,, finding the
factorization XK' = U'U requires O(q.) operations.

Now if 7* and 7 differ in a single component (this is the case for example when

cycling through Step 1 of the algorithm), then a factorization of X.*X,. can be obtained

from the factorization X'X, very efficiently: there are well known methods for updating

the fit (and related quantities) of a linear regression model when a predictor is added

or dropped from the model. These rely on fast updates of QR, Cholesky, or singular

value decompositions when the design matrix is changed by the addition or deletion of

a column. Smith and Kohn (1996) rely on fast updating of the Cholesky decomposition

of X.X, in order to update R.. Our Markov chain is more involved than that of Smith and

Kohn (1996) since our chain runs on 0 = (, o-, 3o, P3). Our implementation uses the
"sweep operator," a well-known method for updating a linear fit, because this provides all

the quantities needed for our chain in one shot. We now describe this in more detail.

We first define the sweep operator. Let T be a symmetric matrix. The sweep of T

on its kth diagonal entry tkk z 0 is the symmetric matrix S with

1 t,k tkj t,ktkj
Skk ,= S,k tk Skj = Sy = ty ,
Skk tkk tkk tkk

for i z k and j z k. The sweep operator has an obvious inverse operator.

If we apply the sweep operator to the matrix

(X'X X'Y
T= YI (B.3)
Y'X Y'Y

on all 1 through q diagonal entries, then we obtain the matrix

S ( -(X'X)-1 (X'X)-X'Y
Y'X(X'X)-1 Y'Y Y'X(X'X)-1X'Y

If we sweep the augmented matrix T defined in (B.3) on the diagonal entries corresponding
to the covariates in 7, then from the resulting matrix S we can obtain all the important
quantities needed by Steps 1-4: (XX,)-1 is the negative of the submatrix of S
corresponding to rows and columns in y, (XX,)-1XX Y is the submatrix of S corresponding
to rows in 7 and column q + 1, and Y'X,(X'X,)-'X' Y can be obtained by subtracting
the (q + 1, q + 1) element of S from Y'Y (with other methods, we may need to compute
separately the last three quantities).
To illustrate the use of this operator, suppose that we have already swept T over the
covariates in 7 = (0, 72,..., 7q). Then we only need to perform one sweep on the first
diagonal entry to get the swept matrix corresponding to the first predictor being added
to the previous model (7 = (1, 72,..., 7q)). Conversely, since the sweep operator has an
inverse, the latter matrix could be "unswept" over the first diagonal entry to get the swept
matrix corresponding to dropping the first predictor.

APPENDIX C
PROOF OF THE UNIFORM ERGODICITY AND DEVELOPMENT OF THE
MINORIZATION CONDITION FROM CHAPTER 5
Proof of Proposition 1
Let vy and p( I Y) denote the posterior distribution of 0 and 7, respectively, under
the prior v on 0 (we are suppressing the subscript h, since the hyperparameter is fixed
throughout). We use Kn and Vn to denote the n-step Markov transition functions for the
0 and 7 chains, respectively. Also, letting A denote the product of counting measure on

{0, 1}9 and Lebesgue measure on (0, oo) x Rq+1, we use k" to denote the density of Kn
with respect to A and vn to denote the probability mass function of V". We now show
that the 0-chain and the 7-chain converge to their corresponding posterior distributions
at exactly the same rate. For any starting state 80 and n c N, we have

lKn(o, ) y(.)|| = sup IKn(Oo, A) Vy(A)
A
= kn(o' 0) vy(0)) dA

= I vn(7o,7 7)p(,2, /3o0 1y, Y) p(7 Y)p(o-2, /30o 1 I, Y) dA

= vn(?o,)-p(Y) ppy2 o.y, Y) Ip(2 Y)d-2d/3od/3]

1 Vn"(7o,?) -P(7 Y)
7rE
=sup Vn(70, B) p(7 B I Y).
B

Hence, the 0 chain inherits the convergence rate of its 7-subchain, a uniformly ergodic
Gibbs sampler on a finite state space. D
Description of the Regeneration Scheme
For regeneration purposes, it is enough to restrict our attention to the Markov
chain that runs on 7. This is because, as we will see later, whenever this subchain
regenerates, the augmented chain that produces draws from the posterior distribution of
0 = (y, a, /3o, /3) also regenerates. We will find a function s: F [0, 1) and a probability

mass function d on F such that v(7, 7') satisfies the minorization condition

v( >, 7') > s(()d(Q') for all 7, 7' e F. (C.1)

We proceed via the "distinguished point" technique introduced in Mykland et al. (1995).
Let 7* denote a fixed model, which we will refer to as a distinguished model, and let
D c F be a set of models. The model 7* and the set D are arbitrary, but below we give
guidelines for making a practical choice of 7* and D. For all 7, 7' e F we have

v(Qy, y')
v(, ')) = ( v-y, y' )

> min v(Y*, yl)1(y' ED).
y//ED v(Y*, Y")

If we let c(7*) denote the normalizing constant for v(y*, Y)/(Y' c D), that is, c(Q*) =

y,/ED v(Y*, '), then we get

) > c( ) min v(Y, 7") v(Q*, 7') /(' c D)
v(-, -,)> c(v*) m)(
7//ED V(*, 7Y") c(7*)

= s(7) d(7'),

where
v (7, a") v (7*, 7') / (7' E D)
s(') = c(y*) min and d(Q')
7//ED V(Q*, 7/) c(7*)
Evaluating both s and d requires computing transition probabilities of the form v(7, y')
which, due to the fact that the Markov chain on 7 is a Gibbs sampler, can be expressed
as

v(7, ') = P(7'l 172, 73 ..., 7q, Y) x P '(72 171, 73, ..., 7q, Y) X ... X P((7q 17 7 -. 7,q- 1,Y)

where the formula for the right side terms is given by (B.1). Since the 7's in the terms on
the right side differ in at most one component, the fast updating techniques discussed in
Appendix A can be applied here too to speed up the computations.

By (2-11), P(6i = 17 ('), ('+1)) = s(y('))d(y('+))/v(7(') ,(i'+)), and the normalizing
constant c(Q*) cancels in the numerator, so in practice there is no need to compute it,
and the success probability of the regeneration indicator is simply

(min v(.,.,,)]Ev(.7*, )/(('+ c)E D)]
P(b, = 1 '), )) D V-V ( ), 7('+)) D)] (C.2)

The choice of 7* and D affects the regeneration rate. Ideally we would like the
regeneration probability to be as big as possible. Notice that regeneration can occur only
if 7 is in D. This suggests making D large. However, increasing the size of D makes the
first term in brackets in (C.2) smaller. We have found that a reasonable tradeoff consists
of taking D to be the smallest set of models that encompasses 25% of the posterior
probability. Also, the obvious choice for 7* is the HPM model. The distinguished model
and the set D are selected from the output of an initial chain.
For the all-inclusive chain that runs not only on the model space but also on the
space of error variance and model coefficients, we can obtain a minorization condition if
we multiply the condition in (C.1) on both sides by

p(,'21y', Y)p(/3o 7,2 yr)p(/ Y,, 712, 0 Y).

This yields

k(O, O') > sl(O)dl(O') for all 0 = (7 ,, o, ), 0' = (y', o', i, j'), (C.3)

where

si(0) = s(7) and dl(') = d(7')p('y, Y)p(32I '/2, Y)p(Q3 I <17', )'2, Y).

Hence for this bigger chain the regeneration indicator has, according to (2-11), success
probability
P(6, = 11 ), 1) = l i)) i+ )) ( (i))d( (i ))
Sk(O(i), 0(i+ )) v(7('), 7(i+ )) '

which is exactly the same as the regeneration success probability for the chain on
F. Hence, the augmented 0-chain and the 7-chain regenerate simultaneously. Note
that sampling from dl (which is needed to start the regeneration) is trivial. We first
sample 7 from d, which is done by sampling from v(7*, -) and retaining 7 only if
it is in D (and to do this we do not need to know the normalizing constant c(7*));
then we sequentially sample -2, 3o, and p3 from p(o-2 I7, Y), (/3o 7,, 2, Y), and

p(3 7, 2, 02 3o, Y), respectively.

APPENDIX D
MAP FOR THE OZONE PREDICTORS IN FIGURE 5-5

Table D-1. The 44 predictors used in the ozone illustration. The symbol "." represents an
interaction.
Number Predictor
1 vh (Vandenburg 500 millibar pressure height (m))
2 wind (Wind speed (mph) at Los Angeles International Airport (LAX))
3 humid (Humidity (percent) at LAX)
4 temp (Sandburg Air Force Base temperature (F))
5 ibh (Inversion base height at LAX)
6 dpg (Daggett pressure gradient (mm Hg) from LAX to Daggett, CA)
7 ibt (Inversion base temperature at LAX)
8 vis (Visibility (miles) at LAX)

Number Predictor Number Predictor
9 vh2 27 vh.dpg
10 wind2 28 wind.dp
11 humid2 29 humid.d
12 temp2 30 temp.dp
13 ibh2 31 ibh.dpg
14 dpg2 32 vh.ibt
15 ibt2 33 wind.ib
16 vis2 34 humid.i
17 vh.wind 35 temp.ib
18 vh.humid 36 ibh.ibt
19 wind.humid 37 dpg.ibt
20 vh.temp 38 vh.vis
21 wind.temp 39 wind.vi
22 humid.temp 40 humid.v
23 vh.ibh 41 temp.vi
24 wind.ibh 42 ibh.vis
25 humid.ibh 43 dpg.vis
26 temp.ibh 44 ibt.vis

REFERENCES

ANTONIAK, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian
nonparametric problems. The Annals of Statistics, 2 1152-1174.

ATHREYA, K. B., Doss, H. and SETHURAMAN, J. (1996). On the convergence of the
Markov chain simulation method. The Annals of Statistics, 24 69-100.

BARBIERI, M. M. and BERGER, J. O. (2004). Optimal predictive model selection. The
Annals of Statistics, 32 870-897.

BREIMAN, L. and FRIEDMAN, J. H. (1985). Estimating optimal transformations for
multiple regression and correlation. Journal of the American Statistical Association, 80
580-598.

BURR, D. and Doss, H. (1993). Confidence bands for the median survival time as
a function of the covariates in the Cox model. Journal of the American Statistical
Association, 88 1330-1340.

BURR, D. and Doss, H. (2005). A Bayesian semiparametric model for random-effects
meta-analysis. Journal of the American Statistical Association, 100 242-251.

CASELLA, G. and MORENO, E. (2006). Objective Bayesian variable selection. Journal of
the American Statistical Association, 101 157-167.

CLYDE, M., DESIMONE, H. and PARMIGIANI, G. (1996). Prediction via orthogonalized
model mixing. Journal of the American Statistical Association, 91 1197-1208.

CLYDE, M., GHOSH, J. and LITTMAN, M. (2009). Bayesian adaptive sampling for
variable selection and model averaging. Discussion Paper 2009-16, Duke University
Department of Statistical Science.

COGBURN, R. (1972). The central limit theorem for Markov processes. In Proceedings
of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume
2. University of California Press, Berkeley, 485-512.

CuI, W. and GEORGE, E. (2008). Empirical Bayes vs. fully Bayes variable selection.
Journal of Statistical Planning and Inference, 138 888-900.

Doss, H. (1994). Comment on "Markov chains for exploring posterior distributions" by
Luke Tierney. The Annals of Statistics, 22 1728-1734.

Doss, H. (2007). Bayesian model selection: Some thoughts on future directions.
Statistica Sinica, 17 413-421.

Doss, H. (2010). Estimation of large families of Bayes factors from Markov chain
output. Statistica Sinica, 20 537-560.

DURRETT, R. (1991). Probability: Theory and Examples. Brooks/Cole Publishing Co.

100

FERNANDEZ, C., LEY, E. and STEEL, M. F. J. (2001). Benchmark priors for Bayesian
model averaging. Journal of Econometrics, 100 381-427.

FOSTER, D. P. and GEORGE, E. I. (1994). The risk inflation criterion for multiple
regression. The Annals of Statistics, 22 1947-1975.

GEORGE, E. I. and FOSTER, D. P. (2000). Calibration and empirical Bayes variable
selection. Biometrika, 87 731-747.

GEORGE, E. I. and MCCULLOCH, R. E. (1997). Approaches for Bayesian variable
selection. Statistica Sinica, 7 339-374.

GEYER, C. J. (1992). Practical Markov chain Monte Carlo (with discussion). Statistical
Science, 7 473-511.

GEYER, C. J. (1994). Estimating normalizing constants and reweighting mixtures in
Markov chain Monte Carlo. Tech. Rep. 568r, Department of Statistics, University of
Minnesota.

GILL, R. D., VARDI, Y. and WELLNER, J. A. (1988). Large sample theory of empirical
distributions in biased sampling models. The Annals of Statistics, 16 1069-1112.

HANSEN, M. H. and Yu, B. (2001). Model selection and the principle of minimum
description length. Journal of the American Statistical Association, 96 746-774.

HASTINGS, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57 97-109.

HOBERT, J. P., JONES, G. L., PRESNELL, B. and ROSENTHAL, J. S. (2002). On the
applicability of regenerative simulation in Markov chain Monte Carlo. Biometrika, 89
731-743.

IBRAGIMOV, I. (1962). Some limit theorems for stationary processes. Theory of
Probability and its Applications, 7 349-382.

IBRAGIMOV, I. A. and LINNIK, Y. V. (1971). Independent and Stationary Sequences of
Random Variables. Wolters-Noordhoff, Groningen.

KASS, R. E. and WASSERMAN, L. (1995). A reference Bayesian test for nested
hypotheses and its relationship to the Schwarz criterion. Journal of the American
Statistical Association, 90 928-934.

KOHN, R., SMITH, M. and CHAN, D. (2001). Nonparametric regression using linear
combinations of basis functions. Statistics and Computing, 11 313-322.

KONG, A., MCCULLAGH, P., MENG, X.-L., NICOLAE, D. and TAN, Z. (2003). A theory of
statistical models for Monte Carlo integration (with discussion). Journal of the Royal
Statistical Society, Series B, 65 585-618.

LIANG, F., PAULO, R., MOLINA, G., CLYDE, M. A. and BERGER, J. O. (2008). Mixtures
of g-priors for Bayesian variable selection. Journal of the American Statistical
Association, 103 410-423.

MADIGAN, D. and YORK, J. (1995). Bayesian graphical models for discrete data.
International Statistical Review, 63 215-232.

MENG, X.-L. and WONG, W. H. (1996). Simulating ratios of normalizing constants via a
simple identity: A theoretical exploration. Statistica Sinica, 6 831-860.

MEYN, S. P. and TWEEDIE, R. L. (1993). Markov Chains and Stochastic Stability.
Springer-Verlag, New York, London.

MYKLAND, P., TIERNEY, L. and Yu, B. (1995). Regeneration in Markov chain samplers.
Journal of the American Statistical Association, 90 233-241.

OWEN, A. and ZHOU, Y. (2000). Safe and effective importance sampling. Journal of the
American Statistical Association, 95 135-143.

RAFTERY, A. E., MADIGAN, D. and HOETING, J. A. (1997). Bayesian model averaging
for linear regression models. Journal of the American Statistical Association, 92
179-191.

ROMERO, M. (2003). On Two Topics with no Bridge: Bridge Sampling with Dependent
Draws and Bias of the Multiple Imputation Variance Estimator. Ph.D. thesis, University
of Chicago.

ROSENBLATT, M. (1984). Asymptotic normality, strong mixing and spectral density
estimates. The Annals of Probability, 12 1167-1180.

SCOTT, J. G. and BERGER, J. O. (2010). Bayes and empirical-Bayes multiplicity
adjustment in the variable-selection problem. The Annals of Statistics (to appear).

SMITH, M. and KOHN, R. (1996). Nonparametric regression using Bayesian variable
selection. Journal of Econometrics, 75 317-343.

TAN, A. and HOBERT, J. P. (2009). Block Gibbs sampling for Bayesian random effects
models with improper priors: convergence and regeneration. Journal of Computa-
tional and Graphical Statistics, 18 861-878.

TAN, Z. (2004). On a likelihood approach for Monte Carlo integration. Journal of the
American Statistical Association, 99 1027-1036.

TIERNEY, L. (1994). Markov chains for exploring posterior distributions (Disc:
p1728-1762). The Annals of Statistics, 22 1701-1728.

VANDAELE, W. (1978). Participation in illegitimate activities: Ehrlich revisited. In
Deterrence and Incapacitation. US National Academy of Sciences, Washington DC,
270-335.

102

VARDI, Y. (1985). Empirical distributions in selection bias models. The Annals of
Statistics, 13 178-203.

ZELLNER, A. (1986). On assessing prior distributions and Bayesian regression analysis
with g-prior distributions. In Bayesian Inference and Decision Techniques: Essays
in Honor of Bruno de Finetti (P. K. Goel and A. Zellner, eds.). Elsevier, New York,
233-243.

ZELLNER, A. and Slow, A. (1980). Posterior odds ratios for selected regression
hypotheses. In Bayesian Statistics: Proceedings of the First International Meeting held
in Valencia (Spain) (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith,
eds.). Valencia: University Press, 585-603.

103

BIOGRAPHICAL SKETCH

Eugenia Buta was born in 1982 in Romania. In 2000, she was admitted to the

University of Oradea, Romania, from where she earned her Bachelor's degree in

Mathematics-Informatics in 2004. She then joined the Department of Statistics at the

University of Florida to pursue a Ph.D. degree. During her graduate student years, she

served as a Teaching Assistant for several undergraduate and graduate level courses in

the Department of Statistics. She expects to receive her doctorate degree in Statistics in

August 2010.

PAGE 1

PAGE 2

c 2010EugeniaButa 2

PAGE 3

IdedicatethistomybrotherFlorin. 3

PAGE 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS..................................4 LISTOFTABLES......................................7 LISTOFFIGURES.....................................8 ABSTRACT.........................................9 CHAPTER 1INTRODUCTION...................................10 2ESTIMATIONOFBAYESFACTORSANDPOSTERIOREXPECTATIONS...14 2.1EstimationofBayesFactors..........................17 2.2EstimationofBayesFactorsUsingControlVariates.............19 2.3EstimationofPosteriorExpectations.....................21 2.4EstimationofPosteriorExpectationsUsingControlVariates........23 2.5EstimationofPosteriorExpectationsUsingControlVariatesWith EstimatedSkeletonBayesFactorsandExpectations.........25 3VARIANCEESTIMATIONANDSELECTIONOFTHESKELETONPOINTS..27 3.1EstimationoftheVariance...........................27 3.2SelectionoftheSkeletonPoints........................32 4REVIEWOFPREVIOUSWORK..........................34 5ILLUSTRATIONONVARIABLESELECTION...................38 5.1AMarkovChainforEstimatingthePosteriorDistributionofModel Parameters................................38 5.2ChoiceoftheHyperparameter........................43 5.3Examples....................................47 5.3.1U.S.CrimeData............................47 5.3.2OzoneData...............................50 6DISCUSSION.....................................55 APPENDIX APROOFOFRESULTSFROMCHAPTER1....................56 BDETAILSREGARDINGGENERATIONOFTHEMARKOVCHAINFROM CHAPTER5.....................................90 5

PAGE 6

CPROOFOFTHEUNIFORMERGODICITYANDDEVELOPMENTOFTHE MINORIZATIONCONDITIONFROMCHAPTER5................95 DMAPFORTHEOZONEPREDICTORSINFIGURE5-5.............99 REFERENCES.......................................100 BIOGRAPHICALSKETCH................................104 6

PAGE 7

LISTOFTABLES Table page 5-1PosteriorinclusionprobabilitiesforthefteenpredictorvariablesintheU.S. crimedataset,underthreemodels.NamesofthevariablesareasinTable 2 ofLiangetal.2008butallvariablesexceptforthebinaryvariableShave beenlogtransformed.................................49 D-1The 44 predictorsusedintheozoneillustration.Thesymbol.representsan interaction.......................................99 7

PAGE 8

LISTOFFIGURES Figure page 5-1EstimatesofBayesfactorsfortheU.S.crimedata.Theplotsgivetwodifferent viewsofthegraphoftheBayesfactorasafunctionof w and g whenthebaseline valueofthehyperparameterisgivenby w =0.5 and g =15 .Theestimate is2,whichusescontrolvariates........................48 5-2EstimatesofposteriorinclusionprobabilitiesforVariables 1 and 6 fortheU.S. crimedata.Theestimateusedis2......................50 5-3Variancefunctionsfortwoversionsof ^ I ^ d ^ ^ d .Theleftpanelisfortheestimate basedontheskeleton5.Thepointsinthisskeletonwereshiftedtobetter covertheproblematicregionnearthebackoftheplot g smalland w large, creatingtheskeleton5.Themaximumvarianceisthenreducedbyafactor of 9 rightpanel....................................51 5-4EstimatesofBayesfactorsfortheozonedata.Theplotsgivetwodifferent viewsofthegraphoftheBayesfactorasafunctionof w and g whenthebaseline valueofthehyperparameterisgivenby w =.2 and g =50 ............52 5-5 95% condenceintervalsoftheposteriorinclusionprobabilitiesforthe 44 predictors intheozonedatawhenthehyperparametervalueisgivenby w =.13 and g =75 .Atablegivingthecorrespondencebetweentheintegers 1 44 andthe predictorsisgiveninAppendixD...........................54 8

PAGE 9

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy COMPUTATIONALAPPROACHESFOREMPIRICALBAYESMETHODSAND BAYESIANSENSITIVITYANALYSIS By EugeniaButa August2010 Chair:HaniDoss Major:Statistics WeconsidersituationsinBayesiananalysiswherewehaveafamilyofpriors h ontheparameter ,where h variescontinuouslyoveraspace H ,andwedealwithtwo relatedproblems.Therstinvolvessensitivityanalysisandisstatedasfollows.Suppose wexafunction f of .Howdoweefcientlyestimatetheposteriorexpectationof f simultaneouslyforall h in H ?Thesecondproblemishowdoweidentifysubsets of H whichgiverisetoreasonablechoicesof h ?Weassumethatweareableto generateMarkovchainsamplesfromtheposteriorforanitenumberofthepriors,and wedevelopamethodology,basedonacombinationofimportancesamplingandthe useofcontrolvariates,fordealingwiththesetwoproblems.Themethodologyapplies verygenerally,andweshowhowitappliesinparticulartoacommonlyusedmodel forvariableselectioninBayesianlinearregression,inwhichtheunknownparameter includesthemodelandtheregressioncoefcientsfortheselectedmodel.Thepriorisa hierarchicalpriorinwhichrstthemodelisselected,thenthecoefcientsforthismodel arechosen,andthispriorisindexedbytwohyperparameters.Thesehyperparameters effectivelydeterminewhethertheselectedmodelwillbealargemodelwithmany variables,oraparsimoniousmodelwithonlyafewvariables,sochoosingthemisvery important.Wegivetwoillustrationsofourmethodology,oneontheU.S.crimedataof VandaeleandtheotherongroundlevelozonedataoriginallyanalyzedbyBreimanand Friedman. 9

PAGE 10

CHAPTER1 INTRODUCTION IntheBayesianparadigmwehaveadatavector Y withdensity p forsome unknown 2 ,andwewishtoputapriordensityon .Theavailablefamilyof priordensitiesis f h h 2Hg ,where h iscalledahyperparameter.Typically,the hyperparameterismultivariateandchoosingitcanbedifcult.Butthischoiceisvery importantandcanhavealargeimpactonsubsequentinference.Therearetwoissues wewishtoconsider: ASupposewexaquantityofinterest,say f ,where f isafunction.Howdo weassesshowtheposteriorexpectationof f changesaswevary h ?More generally,howdoweassesschangesintheposteriordistributionof f aswevary h ? BHowdowedetermineifagivensubsetof H constitutesaclassofreasonable choices? Therstissueisoneofsensitivityanalysisandthesecondisoneofmodelselection. Asanexampleofthekindofproblemwewishtodealwith,considertheproblem ofvariableselectioninBayesianlinearregression.Here,wehavearesponsevariable Y andasetofpredictors X 1 ,..., X q ,eachavectoroflength m .Foreverysubset of f 1,..., q g wehaveapotentialmodel M givenby Y =1 m 0 + X + where 1 m isthevectorof m 1 's, X isthedesignmatrixwhosecolumnsconsistof thepredictorvectorscorrespondingtothesubset isthevectorofcoefcients forthatsubset,and N m 2 I .Let q denotethenumberofvariablesinthe subset .Theunknownparameteris = 0 ,whichincludestheindicator ofthesubsetofvariablesthatgointothelinearmodel.Averycommonlyusedprior distributionon isgivenbyahierarchyinwhichwerstchoosetheindicator from 10

PAGE 11

theindependenceBernoulliprioreachvariablegoesintothemodelwithacertain probability w ,independentlyofalltheothervariablesandthenchoosethevector ofregressioncoefcientscorrespondingtotheselectedvariables.Inmoredetail,the modelisdescribedasfollows: Y N m m 0 + X 2 I a 2 0 p 2 0 / 1 = 2 andgiven N q )]TJ/F22 11.9552 Tf 5.479 -9.683 Td [(0, g 2 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 b w q )]TJ/F39 11.9552 Tf 11.955 0 Td [(w q )]TJ/F40 7.9701 Tf 6.586 0 Td [(q c Theprioron 0 isZellner's g -priorintroducedinZellner1986,andisindexedby ahyperparameter g .Althoughthispriorisimproper,theresultingposteriordistributionis proper. Notethatwehaveusedthewordmodelintwodifferentways:iamodelisa specicationofthehyperparameter h ,andiiamodelinregressionisalistofvariables toinclude.Themeaningofthewordwillalwaysbeclearfromcontext. Tosummarize,thepriorontheparameter = 0 isgivenbythetwo-level hierarchy1cand1b,andisindexedby h = w g .Looselyspeaking,when w islargeand g issmall,thepriorencouragesmodelswithmanyvariablesandsmall coefcients,whereaswhen w issmalland g islarge,thepriorconcentratesitsmasson parsimoniousmodelswithlargecoefcients.Therefore,thehyperparameter h = w g playsaveryimportantrole,andineffectdeterminesthemodelthatwillbeusedtocarry outvariableselection. AstandardmethodforapproachingmodelselectioninvolvestheuseofBayes factors.Foreach h 2H ,let m h y denotethemarginallikelihoodofthedataunderthe prior h ,thatis, m h y = R p y h d .Wewillwrite m h insteadof m h y .TheBayes factorofthemodelindexedby h 2 vs.themodelindexedby h 1 isdenedastheratio ofthemarginallikelihoodsofthedataunderthetwomodels, m h 2 = m h 1 ,andisdenoted throughoutby B h 2 h 1 .Bayesfactorsarewidelyusedasacriterionforcomparing 11

PAGE 12

modelsinBayesiananalyses.Forselectingmodelsthatarebetterthanothersfromthe familyofmodelsindexedby h 2H ,ourstrategywillbetocomputeandsubsequently comparealltheBayesfactors B h h ,forall h 2H ,andaxedhyperparametervalue h .Wecouldthenconsiderasgoodcandidatemodelsthosewithvaluesof h thatresult inthelargestBayesfactors. Supposenowthatwexaparticularfunction f oftheparameter ;forinstance, intheexample,thismightbetheindicatorthatvariable 1 isincludedintheregression model.Itisofgeneralinteresttodeterminetheposteriorexpectation E h f j Y asa functionof h andtodeterminewhetherornot E h f j Y isverysensitivetothevalue of h .Ifitisnot,thentwoindividualsusingtwodifferenthyperparameterswillreach approximatelythesameconclusionsandtheanalysiswillnotbecontroversial.Onthe otherhand,ifforafunctionofinteresttheposteriorexpectationvariesconsiderably aswechangethehyperparameter,thenwewillwanttoknowwhichaspectsofthe hyperparametere.g.whichcomponentsof h producebigchangesandwemaywantto seeaplotoftheposteriorexpectationsaswevarythoseaspectsofthehyperparameter. Exceptforextremelysimplecases,posteriorexpectationscannotbeobtainedinclosed form,andaretypicallyestimatedviaMarkovchainMonteCarloMCMC.Itisslowand inefcienttorunMarkovchainsforeveryhyperparametervalue h .Chapter2reviews anexistingmethodforestimating E h f j Y thatbypassestheneedtorunaseparate Markovchainforevery h .Themethodhasananaloguefortheproblemofestimating Bayesfactors.Unfortunately,themethodhasseverelimitations,whichwealsodiscuss. Thepurposeofthisworkistointroduceamethodologyfordealingwiththe sensitivityanalysisandmodelselectionissuesdiscussedabove.Thebasicideaisnot surprisinglytouseMarkovchainscorrespondingtoafewvaluesofthehyperparameter inordertoestimate E h f j Y forall h 2H andalsotheBayesfactors B h h forall h 2H ,andthisisdonethroughimportancesampling.Thedifcultywefaceisthatthere 12

PAGE 13

isaseverecomputationalburdencausedbytherequirementthatwehandleaverylarge numberofvaluesof h Themaincontributionsofthisworkarethedevelopmentofcomputationallyefcient schemesforestimatinglargefamiliesofposteriorexpectationsandBayesfactors thatarebasedonacombinationofMCMC,importancesampling,andtheuseof controlvariates,thereforeprovidingananswertoquestionsAandBraisedearlier. Wealsoprovidetheorytosupportthemethodswepropose.Chapter2describes ourmethodologyforestimatingBayesfactorsandposteriorexpectations,andgives statementsoftheoreticalresultsassociatedwiththemethodology.InChapter3we discussestimationofthevarianceofourestimates.Chapter4givesareviewofthe relevantliterature,alongwithadiscussionofhowthepresentworktsinthecontext ofpreviousrelatedwork.InChapter5wereturntotheproblemofvariableselection inBayesianlinearregression.There,rstweconsideraMarkovchainalgorithm thatgeneratesasequence 0 , 0 ,... ,describe theoreticalpropertiesofthischain,andshowhowtoimplementthemethodsdeveloped inChapter2toanswerquestionsAandBposedearlier.Wealsoillustrateour methodologyontwodatasets.AppendixAcontainstheproofsofthetheoremsstated inChapter2,AppendixBprovidesdetailsregardingthegenerationofourMarkovchain anditscomputationalcomplexity,andAppendixCgivestechnicaldetailsregardingthe theoreticalpropertiesoftheMarkovchain. 13

PAGE 14

CHAPTER2 ESTIMATIONOFBAYESFACTORSANDPOSTERIOREXPECTATIONS Let h y denotetheposteriordensityof given Y = y whentheprioris h .Suppose wehaveasample 1 ,..., n iidorergodicMarkovchainoutputfromtheposterior density h 1 y foraxed h 1 andweareinterestedintheposteriorexpectation E h f j Y = y = Z f h y d fordifferentvaluesofthehyperparameter h .Wemaywrite R f h y d as Z f p y h = m h p y h 1 = m h 1 h 1 y d = m h 1 m h Z f h h 1 h 1 y d a = m h 1 m h R f h h 1 h 1 y d m h 1 m h R h h 1 h 1 y d b = R f h h 1 h 1 y d R h h 1 h 1 y d c wherein2bwehaveusedthefactthattheintegralinthedenominatorisjust 1 inordertocanceltheunknownconstant m h 1 = m h in2c.Theideatoexpress R f h y d inthiswaywasproposedinadifferentcontextbyHastings1970. Expression2cistheratiooftwointegralswithrespectto h 1 y ,eachofwhichmay beestimatedfromthesequence 1 ,..., n .Wemayestimatethenumeratorandthe denominatorby 1 n n X i =1 f i [ h i = h 1 i ] and 1 n n X i =1 [ h i = h 1 i ], respectively.Thus,ifwelet w h i = [ h i = h 1 i ] P n e =1 [ h e = h 1 e ] thentheseareweights,andweseethatthedesiredintegralmaybeestimatedbythe weightedaverage P n i =1 f i w h i 14

PAGE 15

Thedisappearanceofthelikelihoodfunctionin2aisveryconvenientbecause itscomputationrequiresconsiderableeffortinsomecasesforexample,whenwehave missingorcensoreddata,thelikelihoodisapossiblyhigh-dimensionalintegral.Note thatthesecondaveragein2isanestimateof m h = m h 1 ,i.e.theBayesfactor B h h 1 Ideally,wewouldliketousetheestimatesin2formultiplevaluesof h usingonlya samplefromtheposteriordistributioncorrespondingtothexedhyperparametervalue h 1 .But,whentheprior h differsfrom h 1 greatly,thetwoestimatesin2areunstable becauseofthepotentialthatonlyafewobservationswilldominatethesums.Theirratio suffersthesamedefect. Anaturalapproachfordealingwiththeinstabilityofthesesimpleestimatesis tochoose k hyperparametervalues h 1 ,..., h k 2H andtoreplace h 1 withamixture P k s =1 a s h s ,where a s 0 ,for s =1,..., k ,and P k s =1 a s =1 .Forconcreteness,consider theestimateoftheBayesfactor.Toestimate B h h 1 usingasample 1 ,..., n iidor ergodicMarkovchainoutputfromtheposteriormixture y := P k s =1 a s h s y ,weare temptedtowrite 1 n n X i =1 h i P k s =1 a s h s i = m h m h 1 1 n n X i =1 p i y h i = m h P k s =1 a s p i y h s i = m h 1 a = B h h 1 1 n n X i =1 h y i P k s =1 a s h s y i d s b where d s = m h s = m h 1 s =1,..., k .Thus,inordertohave y inthedenominator in2bwhichwouldimplythattheaveragein2bconvergesto 1 ,sothat2b convergesto B h h 1 ,weneedtostartoutwith P k s =1 a s h s = d s inthedenominatorof theleftsideof2a.Unfortunately,thisrequirestheconditionthatweknowthevector d = d 2 ,..., d k 0 .Underthiscondition,if 1 ,..., n aredrawnfromthemixture y ,instead offrom h 1 y ,wemayform 1 n n X i =1 h i P k s =1 a s h s i = d s 15

PAGE 16

andthisquantityconvergesto B h h 1 Z h y y y d = B h h 1 Assumingthatforeach l =1,..., k wehavesamples l i i =1,..., n l fromtheposterior density h l y ,thenfor a s = n s = n ,theestimatein2canbewrittenas ^ B h h 1 d = k X l =1 n l X i =1 h l i P k s =1 n s h s l i = d s Notethatthecombinedsamples l i i =1,..., n l l =1,..., k formastratiedsample fromthemixturedistribution y .Doss2010showsthatundercertainregularity conditionstheestimate2isconsistentandasymptoticallynormal. Invirtuallyallapplications,thevalueofthevector d isunknownandhastobe estimated.Doss2010doesnotdealwiththecasewhere d isunknown.Inthis Chapter,weassumethat d isestimatedviapreliminaryMCMCrunsgenerated independentlyoftherunssubsequentlyusedtoestimate B h h 1 .Hencethesampling willconsistofthefollowingtwostages. Stage1 Generatesamples l i i =1,..., N l from h l y ,theposteriordensityof given Y = y ,assumingthattheprioris h l ,foreach l =1,..., k ,andusethese N = P k l =1 N l observationstoformanestimateof d Stage2 IndependentlyofStage 1 ,againgeneratesamples l i i =1,..., n l from h l y foreach l =1,..., k ,andconstructtheestimateoftheBayesfactor B h h 1 based onthissecondsetof n = P k l =1 n l observationsandtheestimateof d fromStage 1 Fromnowon,for l =1,..., k ,weusethenotations A l and a l toidentifytheratios N l = N and n l = n ,respectively. Itisnaturaltoaskwhyisitnecessarytohavetwostepsofsampling,instead ofestimatingthevector d and B h h 1 fromasinglesample.Thereasonisthatwe areinterestedinestimatingBayesfactorsandposteriorexpectationsforaverylarge numberofvaluesof h ,andforeach h ,thecomputationaltimeneededislinearinthe 16

PAGE 17

totalsamplesize.Thisfactlimitsthetotalsamplesizeandhencetheaccuracyofthe estimates.Anincreaseinaccuracycanbeachievedessentiallyforfreebyestimating d fromlongpreliminaryrunsinStage 1 .ThecostincurredinStage 1 isminimalbecause generatingthechainsistypicallyextremelyfast,andhastobedoneonlyonce. Doss2010alsodevelopedanimprovementof2thatisbasedoncontrol variates,andshowedthatthisimprovementisalsoconsistentandasymptotically normal.Unfortunately,bothoftheseestimatesrequireustoknowthevector d exactly. Onemaybetemptedtobelievethatusinganestimated d insteadofthetrue d willnot inatetheasymptoticvarianceindeed,theliteraturehaserrorsregardingthispoint, andthisisdiscussedinAppendixA.Hereweprovideacarefulanalysisoftheincrease intheasymptoticvariancethatresultswhenweuseanestimateof d .Amoredetailed summaryofthemaincontributionsofthepresentworkisasfollows. 1.Wedevelopacompletecharacterizationoftheasymptoticdistributionofboththe estimate2andtheimprovementthatusescontrolvariatesfortherealisticcase where d isestimatedfromStage 1 samplingTheorems1and2. 2.Wedevelopananalogoustheoryfortheproblemofestimatingafamilyofposterior expectations E h f j Y = y h 2H Theorems3,4,and6. 3.Wediscussestimationofthevariance,andshowhowvarianceestimatescanbe usedtoguideselectionoftheskeletonpoints h 1 ,..., h k 4.WeapplythemethodologytotheproblemofBayesianvariableselectiondiscussed earlier.Inparticular,weshowhowourmethodsenableustoselectgoodvaluesof h = w g andtoalsoseehowtheprobabilitythatagivenvariableisincludedin theregressionvarieswith w g 2.1EstimationofBayesFactors Here,weanalyzetheasymptoticdistributionalpropertiesoftheestimatorthat resultsifin2wereplace d withanestimate.Geyer1994proposesanestimatorfor d basedonthereverselogisticregressionmethodandTheorem2thereinshowsthat thisestimatorisasymptoticallynormalwhenthesamplersusedsatisfycertainregularity conditions.Thisestimatorisobtainedbymaximizingwithrespectto d 2 ,..., d k thelog 17

PAGE 18

quasi-likelihood l N d = k X l =1 N l X i =1 log A l h l l i = d l P k s =1 A s h s l i = d s TheestimateisthesameastheestimatesobtainedbyGilletal.1988,Mengand Wong1996,andKongetal.2003.WeassumethatforalltheMarkovchainsweuse aStrongLawofLargeNumbersSLLNholdsforallintegrablefunctions[forsufcient conditionssee,e.g.,Theorem 2 ofAthreyaetal.1996].Inthenexttheorem,weshow thatif ^ d istheestimateproducedbyGeyer's1994method,oranyoftheequivalent estimatesdiscussedabove,thentheestimateoftheBayesfactorgivenby ^ B h h 1 ^ d = k X l =1 n l X i =1 h l i P k s =1 n s h s l i = ^ d s isasymptoticallynormalifcertainregularityconditionsaremet.In27, ^ d 1 =1 Theorem1 SupposethechainsinStage 2 satisfyconditionsA 1 andA 2 inDoss 2010: A1Foreach l =1,..., k ,thechain f l i g 1 i =1 isgeometricallyergodic. A2Foreach l =1,..., k ,thereexists > 0 suchthat E h l 1 P k s =1 a s h s l 1 = d s 2+ < 1 AssumealsothatthechainsinStage 1 satisfytheconditionsinTheorem 2 ofGeyer 1994thatimply p N ^ d )]TJ/F39 11.9552 Tf 11.832 0 Td [(d d )167(!N .Inaddition,supposethetotalsamplesizesfor thetwostages, N and n ,arechosensuchthat n = N q 2 [0, 1 .Then p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(B h h 1 d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, qc h 0 c h + 2 h where c h and 2 h aregiveninequation A.3 intheAppendixandequationA.9in Doss2010,respectively. Remarks 18

PAGE 19

1.Therearetwocomponentstotheexpressionforthevariance.Therstcomponent arisesfromestimating d ,andthesecondcomponentisthevariancethatwewould haveifwehadestimatedtheBayesfactorknowingwhat d is.Ascanbeseen fromtheformula,therstcomponentvanishesif q =0 ,i.e.,ifthesamplesize forestimatingtheparameter d convergestoinnityatafasterratethandoes thesamplesizeusedtoestimatetheBayesfactor.InthiscasetheBayesfactor estimator2usingtheestimate ^ d hasthesameasymptoticdistributionasthe estimatorin2whichusesthetruevalueof d .Otherwise,thevarianceof2 isgreaterthanthatof2,andthedifferencebetweenthevariancesdependson themagnitudeof q 2.Thistheoremassumesthesamplingisdoneintwoindependentstages:Stage 1 samplestoestimate d ,andStage 2 samplesusedtogetherwith ^ d toestimate theBayesfactor B h h 1 .Asabyproductofourapproach,wecangetasimilar theoremforthesituationwhereboth d and B h h 1 areestimatedfromthesame sample.However,forthereasonsdiscussedearlier,ordinarilywewouldnotusea singlesample.Inmoredetail,ifweimposethesameconditionsasinthetheorem aboveonsamplesoftotalsize n fromasinglestage,exceptfortheconditionthat n = N q 2 [0, 1 ,then p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, c h 0 c h + 2 h +2 c h 0 E 0 z where denotes,asinthestatementofTheorem1,theasymptoticvarianceof p n ^ d )]TJ/F39 11.9552 Tf 12.618 0 Td [(d E isthematrixgiveninequationA.51,and z isthecolumnvector giveninA.49.AproofofthisresultisgiveninAppendixA. 2.2EstimationofBayesFactorsUsingControlVariates Recallthatwehavesamples l i i =1,..., n l from h l y l =1,..., k ,with independenceacrosssamplesStage 2 ofsamplingandthat,basedonanindependent setofpreliminaryMCMCrunsStage 1 ofsampling,wehaveestimatedtheconstants d 2 ,..., d k .Also, n l = n = a l and n = P k l =1 n l .Let Y = h P k s =1 a s h s = d s 19

PAGE 20

Recallingthat y := P k s =1 a s h s y ,wehave E y Y = B h h 1 ,wherethesubscript y totheexpectationindicatesthat y .Also,for j =2,..., k ,let Z j = h j = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 P k s =1 a s h s = d s = h j y )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 y P k s =1 a s h s y Expression2showsthat E y Z j =0 .Thisistrueevenifthepriors h j and h 1 areimproper,aslongastheposteriors h j y and h 1 y areproper,exactlyour situationintheBayesianvariableselectionexampleofChapter1.Ontheotherhand, therepresentation2showsthat Z j iscomputableifweknowthe d j 'sitinvolves thepriorsandnottheposteriors.Asimilarremarkappliesto2.Therefore,ifasin Doss2010wedenefor l =1,..., k i =1,..., n l Y i l = h l i P k s =1 a s h s l i = d s Z i l =1, Z j i l = h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = d s j =2,..., k thenforanyxed = 2 ,..., k ^ I d = 1 n k X l =1 n l X i =1 )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(Y i l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =2 j Z j i l isanunbiasedestimateof B h h 1 .Thevalueof thatminimizesthevarianceof ^ I d isunknown.Asiscommonlydonewhenoneusescontrolvariates,weuseinstead theestimateobtainedbydoingordinarylinearregressionoftheresponse Y i l onthe predictors Z j i l j =2,..., k ,andtoemphasizethatthisestimatedependson d ,we denoteitby ^ d .Theorem1ofDoss2010statesthattheestimator ^ B reg h h 1 = ^ I d ^ d ,obtainedundertheassumptionthatweknowtheconstants d 2 ,..., d k ,hasan asymptoticallynormaldistribution.Asmentionedearlier, d 2 ,..., d k aretypicallyunknown, andmustbeestimated.Let ^ d 2 ,..., ^ d k beestimatesobtainedfrompreviousMCMCruns 20

PAGE 21

andlet ^ I ^ d ^ ^ d = 1 n k X l =1 n l X i =1 ^ Y i l )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 ^ j ^ d ^ Z j i l where ^ Y i l and ^ Z j i l arelikein2,exceptusing ^ d for d ,and ^ ^ d istheleastsquares regressionestimatorfromregressing ^ Y i l onpredictors ^ Z j i l j =2,..., k .Thenext theoremgivestheasymptoticdistributionofthisnewestimator. Theorem2 SupposealltheconditionsfromTheorem1aresatised.Moreover,assume that R ,the k k matrixdenedby R j j 0 = E P k l =1 a l Z j 1, l Z j 0 1, l j j 0 =1,..., k isnonsingular.Then p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 d )167(!N )]TJ/F22 11.9552 Tf 5.48 -9.683 Td [(0, qw h 0 w h + 2 h Expressionsfor w h and 2 h aregiveninequation A.18 tofollowandequationA.7 inDoss2010,respectively. 2.3EstimationofPosteriorExpectations Inthissectionwegiveamethodforestimatingtheposteriorexpectationofa function f whentheprioris h .Letusdenotethisquantityby I [ f ] h = Z f h y d Dene Y [ f ] i l = f l i h l i P k s =1 a s h s l i = d s = f l i h l i = m h P k s =1 a s h s l i = m h s m h m h 1 = f l i h y l i P k s =1 a s h s y l i B h h 1 AssumingaSLLNholdsfortheMarkovchains l i l =1,..., k i =1,..., n l ,wehave 1 n l n l X i =1 Y [ f ] i l a.s. )167(! Z f h y P k s =1 a s h s y h l y d B h h 1 21

PAGE 22

Therefore, 1 n k X l =1 n l X i =1 Y [ f ] i l = k X l =1 n l X i =1 n l n 1 n l Y [ f ] i l a.s. )167(! Z f h y P k s =1 a s h s y k X l =1 a l h l y d B h h 1 = I [ f ] h B h h 1 Similarly,wehave 1 n k X l =1 n l X i =1 Y i l a.s. )167(! B h h 1 the Y i l 'saredenedin2. Notethat Y i l = Y [ f ] i l when f 1 .Letting ^ I [ f ] h d = P k l =1 P n l i =1 Y [ f ] i l P k l =1 P n l i =1 Y i l weseethat ^ I [ f ] h d a.s. )167(! I [ f ] h Replacingtheunknown d withanestimate ^ d obtained fromStage 1 sampling,weformtheestimator ^ I [ f ] h ^ d = k X l =1 n l X i =1 f l i h l i P k s =1 a s h s l i = ^ d s k X l =1 n l X i =1 h l i P k s =1 a s h s l i = ^ d s Itistheasymptoticbehaviorofthisestimatorthatweareconcernedwithinthefollowing theorem. Theorem3 SupposetheconditionsstatedinTheorem1aresatisedand,inaddition, foreach l =1,..., k ,thereexistsan > 0 suchthat E )]TJ 5.48 0.478 Td [( Y [ f ] 1, l 2+ < 1 Then p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(I [ f ] h d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, qv h 0 v h + h 22

PAGE 23

Expressionsfor v h and h aregiveninequations A.22 and A.21 ,respectively,in theAppendix. 2.4EstimationofPosteriorExpectationsUsingControlVariates Weassumeinthissectionthatthevaluesofthevector d andtheposterior expectations I [ f ] h j ,for j =1,..., k ,areavailabletous.Inreality,thesequantities areseldomknown,andthenextsectiondealswiththecasewhentheyareestimated basedonpreviousMCMCruns.Recallthattheintegralwewanttoestimateis I [ f ] h = R f h y d .In2weestablishedthat = n P k l =1 P n l i =1 Y [ f ] i l isa stronglyconsistentestimatorof I [ f ] h B h h 1 .Dene Z [ f ] i l =1, Z [ f ] j i l = f l i h j l i = d j P k s =1 a s h s l i = d s )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h j j =1,..., k andlet Z [ f ] j = f h j = d j P k s =1 a s h s = d s )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h j j =1,..., k With y denotingthemixturedistribution P k s =1 a s h s y ,itcanbeeasilycheckedthat E y )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z [ f ] j =0, for j =1,..., k sowecanusethe Z [ f ] j 'sascontrolvariatestoreducethevarianceoftheoriginal estimator = n P k l =1 P n l i =1 Y [ f ] i l .Doingsogivestheestimator 1 n k X l =1 n l X i =1 Y [ f ] i l )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X j =1 ^ [ f ] j Z [ f ] j i l where ^ [ f ] j 'sdenotetheleastsquaresestimatesresultingfromtheregressionof Y [ f ] i l on predictors Z [ f ] j i l .TheBayesfactor B h h 1 willbeestimatedasbeforeinSection2.2, usingtheestimator = n P k l =1 P n l i =1 Y i l andthe k )]TJ/F22 11.9552 Tf 12.903 0 Td [(1 controlvariates Z j ,for j = 2,..., k .Theratioofthesetwocontrolvariateadjustedestimatorsprovidesuswithan 23

PAGE 24

improvedestimatorfortheposteriorexpectation I [ f ] h ,whichisgivenby ^ I ^ ^ [ f ] = P k l =1 P n l i =1 Y [ f ] i l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =1 ^ [ f ] j Z [ f ] j i l P k l =1 P n l i =1 Y i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =2 ^ j Z j i l Theorem4 SupposeconditionsA1andA2statedinTheorem1aresatisedandthe matrix R denedinTheorem2isnonsingular.Also,supposethat A3foreach l =1,..., k ,thereexists > 0 suchthat E )]TJ 5.479 0.478 Td [( Y [ f ] 1, l 2+ < 1 ; A4foreach l =1,..., k ,thereexists > 0 suchthat E h l y j f j 2+ < 1 ; A5foreach l =1,..., k E h l y f 2 h P k s =1 a s h s = d s < 1 ; A6the k +1 k +1 matrix R [ f ] denedby R [ f ] j +1, j 0 +1 = E P k l =1 a l Z [ f ] j 1, l Z [ f ] j 0 1, l j j 0 =0,..., k isnonsingular. Then p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ ^ [ f ] )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N r h where r h isgiveninequation A.33 oftheAppendix. Remarks 1.If h = h j forsome j 2f 1,..., k g ,ourestimatorofposteriorexpectation ^ I ^ ^ [ f ] given abovein2haszerovariance.Toseewhy,notethatinthiscasetheresponse Y [ f ] canbewrittenas Y [ f ] = d j I [ f ] h j + d j Z [ f ] j sothereisnonoiseintheregressionof Y [ f ] onpredictors Z [ f ] j 's,andasa consequence,thenumeratorofthisestimatorisconstantspecically, nd j I [ f ] h j Throughsimilararguments,thedenominatorwasshowntobeconstant nd j in Doss2010.Hence,for h = h j ^ I ^ ^ [ f ] isaperfectestimatorof I [ f ] h j 2.Theorem4pertainstothecasewhere d andtheposteriorexpectations I [ f ] h j j = 1,..., k areknown.Theredoexistsomesituationswherethisisthecase.For example,inthehierarchicalmodelforBayesianlinearregressiondiscussedin 24

PAGE 25

Chapter1,foreach j =1,..., k ,themarginallikelihood m h j isasumof 2 q terms, andif q isrelativelysmall,thesemarginallikelihoodsarecomputable,sothevector d isavailable.Likewise,forsomefunctions f ,theposteriorexpectations I [ f ] h j canbenumericallyobtained;seeSection 3 ofGeorgeandFoster2000.Soitis possibletocalculate d and I [ f ] h j j =1,..., k forskeletonpoints h 1 ,..., h k ,and themethoddescribedinthissectionenablesustoefcientlyestimatethefamily I [ f ] h h 2H 2.5EstimationofPosteriorExpectationsUsingControlVariatesWithEstimated SkeletonBayesFactorsandExpectations Thissectionundertakesestimationoftheposteriorexpectation I [ f ] h = R f h y d inthecasewherethequantities d and I [ f ] h j 'sareunknownandestimatedbasedon previousMCMCruns.Let e = )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(I [ f ] h 1 ,..., I [ f ] h k 0 and ^ e = 1 N 1 N 1 X i =1 f i ,..., 1 N k N k X i =1 f k i 0 i.e. e isthevectoroftrueexpectations,whichhadbeenassumedknowninTheorem4, and ^ e isitsnaturalestimatebasedonthesamplesinStage 1 .Toaccountforthefact thattheresponses Y [ f ] i l andcovariates Z [ f ] j i l usedintheprevioussectionarenow unknown,astheyinvolvetheunknown d and e ,weneedtoconsidernewresponsesand covariatesbasedonestimates ^ d and ^ e obtainedfromStage 1 samples.Dene ^ Y [ f ] i l = f l i h l i P k s =1 a s h s l i = ^ d s and ^ Z [ f ] j i l = f l i h j l i = ^ d j P k s =1 a s h s l i = ^ d s )]TJ/F22 11.9552 Tf 11.995 0 Td [(^ e j j =1,..., k Henceanewcontrolvariateadjustedestimatorfor I [ f ] h correspondingtothe estimator2oftheprevioussection,whichassumedknowledgeof d and I [ f ] h j 's,is ^ I ^ d ,^ e ^ ^ d ^ [ f ] ^ d = P k l =1 P n l i =1 ^ Y [ f ] i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =1 ^ [ f ] ^ d j ^ Z [ f ] j i l P k l =1 P n l i =1 ^ Y i l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =2 [ ^ ^ d ] j ^ Z j i l where ^ [ f ] ^ d istheleastsquaresestimateresultingfromtheregressionof ^ Y [ f ] i l on predictors ^ Z [ f ] j i l .Theorem6establishestheasymptoticnormalityofthisestimator. Butbeforestatingthistheorem,wegiveanauxiliarytheoremwhichshowsthejoint 25

PAGE 26

asymptoticnormalityof p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A LetusrstdeneseparableandinseparableMonteCarlosamplesasintroducedby Geyer1994.TheMonteCarlosample f l i g 1 i =1 l =1,..., k issaidtobeseparableif therearedisjointssubsets L and M of f 1,..., k g suchthatforeach inthesampleand each l 2 L and m 2 M either h l y or h m y arezero.AMonteCarlosamplethatis notseparableissaidtobeinseparable. Theorem5 AssumethattheMonteCarlosamplefromStage 1 f l i g 1 i =1 l =1,..., k ,is inseparable,andthefollowingconditionshold: B1foreach l =1,..., k ,thechain f l i g 1 i =1 isgeometricallyergodic B2foreach l =1,..., k ,thereexists > 0 suchthat E h l y j f j 2+ < 1 Then p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A d )167(!N V where V isgiveninequation A.48 intheAppendix. Theorem6 IftheconditionsstatedinTheorem4and 2 hold,then p n ^ I ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N h with h giveninequation A.57 intheAppendix. 26

PAGE 27

CHAPTER3 VARIANCEESTIMATIONANDSELECTIONOFTHESKELETONPOINTS Estimationofthevarianceofourestimatesisimportantforseveralreasons.In additiontotheusualneedforprovidingerrormarginsforourpointestimates,variance estimatesareofgreathelpinselectingtheskeletonpoints. 3.1EstimationoftheVariance Therearetwoapproachesonecanusetoestimatethevarianceofanyofour estimates.Forthesakeofconcreteness,consider ^ B h h 1 ^ d ,whoseasymptotic varianceistheexpression 2 h = qc h 0 c h + 2 h seeTheorem1. SpectralMethods If X 0 X 1 X 2 ,... isaMarkovchainand f isafunction,theasymptotic varianceof = n P n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 i =0 f X i whenitexistsistheinniteseries Var f X 0 +2 P 1 j =1 Cov f X 0 f X j wherethevariancesandcovariancesarecalculatedundertheassumptionthat X 0 has thestationarydistribution.Spectralmethodsinvolveestimatinganinitialsegmentofthe series,usingtechniquesfromtimeseries;seeGeyer1992forareview.Ourproblemis morecomplicatedbecausewearedealingwithmultiplechains.Inoursituation,theterm 2 h maybeestimatedthroughspectralmethods,andthisisdoneinastraightforward manner.Wenowgivetechnicaldetailsregardingtheconsistencyofthismethod.The quantity 2 h isgivenby 2 h = P k l =1 a l 2 l h ,where 2 l h istheasymptoticvariance of 1 n l n l X i =1 h l i P k s =1 a s h s l i = d s SeeequationA.9ofDoss2010.Becauseforeach l wewillbeestimating 2 l h by theasymptoticvarianceof 1 n l n l X i =1 h l i P k s =1 a s h s l i = ^ d s 27

PAGE 28

where ^ d isformedfromStage 1 runs,itisnecessarytoconsiderthequantity 2 l h u denedastheasymptoticvarianceof 1 n l n l X i =1 h l i P k s =1 a s h s l i = u s where u = u 1 u 2 ,..., u k 0 .Afterdening f u = h P k s =1 a s h s = u s weget 2 l h u =Var f u l 1 +2 1 X g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(f u l 1 f u l 1+ j Wenowproceedtoestablishcontinuityof 2 h u in u ,andtodothiswewillshow thatforeach l =1,..., k 2 l h u iscontinuousin u .Fortherestofthisdiscussion expectationsandvariancesaretakenwithrespectto h l y ,andwedrop l fromthe notation.Let u n beanysequenceofvectorssuchthat u n d .Thentrivially f u n f d forall ,andletting =min f d 1 ,..., d k g ,thereexistsapositiveinteger n suchthat k u n )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k forall n n .Consequently, f u n f 2 d =2 f d forall andall n n andwecanapplytheLebesgueDominatedConvergenceTheoremtwicetoconclude that Var f u n 1 = E f 2 u n 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [([ E f u n 1 ] 2 convergesto E f 2 d 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [([ E f d 1 ] 2 =Var f d 1 NotethatconditionA2guaranteesthatthedominatingfunctionin3hasnite expectation.Similarly,foreachofthecovarianceterms, Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(f u n 1 f u n 1+ j Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(f d 1 f d 1+ j 28

PAGE 29

Ifwedene c u j =Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(f u 1 f u 1+ j j =1,2,..., then P 1 j =1 c u j isabsolutelyconvergent.Thisisbecauseundergeometricergodicity, theso-calledstrongmixingcoefcients j decreaseto 0 exponentiallyfastadenition ofstrongmixingisgivenonp. 349 ofIbragimov1962,and Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(f u 1 f u 1+ j [ j ] E j f u 1 j 2+ 2 = + forsome > 0 .SeeTheorem 18.5.3 ofIbragimovandLinnik1971orLemma 7.7 of Chapter 7 inDurrett1991.Since c u n j c d j foreach j ,3and3enableus toagainapplyDominatedConvergencetoconcludethat P 1 j =1 c u n j P 1 j =1 c d j ,and thisprovesthat 2 h u iscontinuousin u Let g u bethespectraldensityat 0 oftheseries f u i .Notethat g u isequalto 2 l h u ,exceptforanormalizingconstant.Understrongmixingimpliedbygeometric ergodicity,standardspectraldensityestimates ^ g u areconsistent,andboundsonthe discrepancy j ^ g u )]TJ/F39 11.9552 Tf 12.586 0 Td [(g u j dependonthemixingrateandboundsonthemomentsof thefunction f u Rosenblatt1984.By3,therateisuniformaslongas k u )]TJ/F39 11.9552 Tf 12.267 0 Td [(d k is small,andtheconditionthat k ^ d )]TJ/F39 11.9552 Tf 12.046 0 Td [(d k issmallisguaranteediftheStage 1 samplesize N islarge. Geyer1994givesanexpressionfor involvinginniteseriesoftheform3, andthisenablesestimationof byspectralmethods.Now, c h isavectoreachof whosecomponentsisanintegralwithrespecttotheposterior h y seeA.3.The estimatederivedinSection2.3see2isdesignedpreciselytoestimatesuch posteriorexpectations.Combining,wearriveatanoverallestimateof 2 h ,andthe asymptoticvariancesofourotherestimatesarehandledsimilarly. MethodsBasedonRegeneration Thecleanestapproachtoestimatingasymptotic variancesisbasedonregeneration.Let X 0 X 1 X 2 ,... beaMarkovchainonthe measurablespace X B ,let K x A betheMarkovtransitiondistribution,andassume 29

PAGE 30

that isastationaryprobabilitydistributionforthechain.Supposethatforeach x K x hasdensity k x withrespecttoadominatingmeasure .Regeneration methodsrequiretheexistenceofafunction s : X! [0,1 ,whoseexpectationwith respectto isstrictlypositive,andaprobabilitydensity d withrespectto ,suchthat k satises k x x 0 s x d x 0 forall x x 0 2X Thisiscalledaminorizationconditionand,aswedescribebelow,itcanbeusedto introduceregenerationsintotheMarkovchaindrivenby k .Theseregenerationsare thekeytoconstructingasimple,consistentestimatorofthevarianceinthecentrallimit theorem.Dene r x x 0 = k x x 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(s x d x 0 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(s x Notethat,forxed x 2X r x x 0 isadensityfunctionin x 0 .Wemaythereforewrite k x x 0 = s x d x 0 + )]TJ/F39 11.9552 Tf 11.956 0 Td [(s x r x x 0 whichgivesarepresentationof k x asamixtureoftwodensities, d and r x Thisprovidesanalternativemethodofsimulatingfrom k .Supposethatthecurrentstate ofthechainis X n .Wegenerate n Bernoulli s X n .If n =1 ,wedraw X n +1 d ; otherwise,wedraw X n +1 r X n .Notethat,if n =1 ,thenextstateofthechainis drawnfrom d ,whichdoesnotdependonthecurrentstate.Hence,thechainforgets thecurrentstateandwehavearegeneration.Tobemorespecic,supposewestartthe Markovchainwith X 0 d andthenusethemethoddescribedabovetosimulatethe chain.Eachtime n =1 ,wehave X n +1 d andtheprocessstochasticallyrestartsitself; thatis,theprocessregenerates.Eventhoughthedescriptionaboveinvolvesgenerating observationsfrom r ,thereareclevertricksthatenabletheusertobypassgenerating from r ,andtoobtainthesequence X n n inawaythatrequiresonlygenerating directlyfrom k ;see,e.g.,TanandHobert2009. 30

PAGE 31

Hereishowtheregenerativemethodisusedtogetvalidasymptoticstandard errors.Supposewewishtoapproximatetheposteriorexpectationofsomefunction f X .SupposefurtherthattheMarkovchainistoberunfor R regenerationsortours; thatis,webeginbydrawingthestartingvaluefrom d andwestopthesimulationthe R th timethata n =1 .Let 0= 0 < 1 < 2 < < R betherandomregenerationtimes, i.e. t =min f n > t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 : n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 =1 g for t 2f 1,2,..., R g .Thetotallengthofthesimulation, R ,israndom.Let N 1 N 2 ,..., N R bethelengthsofthetours,i.e. N t = t )]TJ/F25 11.9552 Tf 12.71 0 Td [( t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ,and dene S t = P t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 n = t )]TJ/F24 5.9776 Tf 5.757 0 Td [(1 f X n t =1,..., R .Notethatthe N t S t pairsareiid,anda stronglyconsistentestimatorof E f X is f R = S = N = = R P R )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 n =0 f X n ,where S = = R P R t =1 S t and N = = R P R t =1 N t ,andtheasymptoticvarianceof f R maybe estimatedverysimplyby P R t =1 S t )]TJETq1 0 0 1 265.387 478.942 cm[]0 d 0 J 0.478 w 0 0 m 6.472 0 l SQBT/F39 11.9552 Tf 265.387 468.967 Td [(f R N t 2 = R N 2 .Momentandergodicityconditions thatguaranteestrongconsistencyofthisvarianceestimatoraregiveninHobertetal. 2002.Thismethodhasrecentlybeenappliedsuccessfullyinanumberofproblems involvingcontinuousstatespaces;see,e.g.,TanandHobert2009andthereferences therein,andweusethemethodintheillustrationinChapter5. Inourframeworkofmultiplechains,onemightthinkthatweneedtoidentifya sequenceoftimes 0= 0 < 1 < 2 < < R atwhichallthechainsregenerate.This isnotthecase,andweneedonlyidentify,foreachchain,asequenceofregeneration timesforthatchain.Sincetheoverallestimateisessentiallyafunctionofaverages involvingthe k chains,itsasymptoticvarianceisafunctionoftheasymptoticvariances ofaveragesformedfromtheindividualchains. Considerthefunction B h h 1 ; h 2H ,andanestimator,suchas ^ B h h 1 ^ d for therestofthisdiscussion,wewilldenotetheseby B h and ^ B h ,forbrevity.Itisof interesttoprovideacondencebandregion,if h ismultidimensionalfor B h thatis validsimultaneouslyforall h 2H .Acloselyrelatedproblemistoproduceacondence intervalfor argmax h 2H B h .Thetraditionalwayofformingcondencebandsthatare validgloballyistoproceedasfollows: 31

PAGE 32

1Establishafunctionalcentrallimittheoremthatsaysthat n 1 = 2 )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h )]TJ/F39 11.9552 Tf 11.997 0 Td [(B h converges indistributiontoaGaussianprocess W h ; h 2H 2Findthedistributionof sup h 2H j W h j If s isthe )]TJ/F25 11.9552 Tf 12.435 0 Td [( -quantileofthedistributionofthissupremum,thentheband ^ B h s = n 1 = 2 hasasymptoticcoverageprobabilityequalto 1 )]TJ/F25 11.9552 Tf 12.747 0 Td [( .Thevalue s istypically toodifculttocomputeanalytically,butcanbeobtainedbysimulation[see,e.g.Burr andDoss1993amongmanyothers].Themaximalinequalitiesneededtoestablish functionalcentrallimittheoremstypicallyrequireaniidstructure,andforthisreason webelievethattheregenerationmethodoffersthebesthopeforestablishingsuch theorems. 3.2SelectionoftheSkeletonPoints Theasymptoticvariancesofanyofourestimatesdependonthechoiceof thepoints h 1 ,..., h k .Forconcreteness,consider ^ B h h 1 ^ d ,andtoemphasizethis dependence,let V h h 1 ,..., h k denotetheasymptoticvarianceof ^ B h h 1 ^ d .Forxed h 1 ,..., h k ,identifyingthesetof h 'sforwhich V h h 1 ,..., h k is nite istypicallyafeasible problem.Forinstance,Doss1994consideredthepumpdataexamplediscussedin Tierney1994,forwhichthehyperparameter h hasdimension 3 ,anddeterminedthis setforthecase k =1 .Heshowedthatonecangoasfarawayfrom h 1 asonewants incertaindirections,butinotherdirectionstherangeislimited.Thecalculationcanbe extendedtoany k .Supposenowthatwexarange H overwhich h istovary.Typically, wewillwantmorethanjustapositioningof h 1 ,..., h k thatguaranteethat V h h 1 ,..., h k isniteforall h 2H ,andwewillfacetheproblembelow. DesignProblem Findthevaluesof h 1 ,..., h k thatminimize max h 2H V h h 1 ,..., h k Unfortunately,exceptforextremelysimplecases,itisnotpossibletocalculate V h h 1 ,..., h k analyticallyevenif k =1 V h h 1 isaninnitesumeachofwhose termsdependsontheMarkovtransitiondistributioninacomplicatedway,and maximizingitover h 2H wouldpresentadditionaldifculties.Furthermore,evenif 32

PAGE 33

wewereabletocalculate max h 2H V h h 1 ,..., h k ,thedesignproblemwouldinvolvethe minimizationofafunctionof k dim H variables,andingeneral,solvingthedesign problemishopeless. Inourexperience,wehavefoundthatthefollowingmethodworksreasonablywell. Havingspeciedtherange H ,weselecttrialvalues h 1 ,..., h k andplottheestimated varianceasafunctionof h ,usingoneofthemethodsdescribedabove.Ifwend aregionin H wherethisvarianceisunacceptablylarge,wecoverthisregionby movingsome h l 'sclosertotheregion,orbysimplyaddingnew h l 'sinthatregion,which increases k .ThisisillustratedintheexampleinChapter5. 33

PAGE 34

CHAPTER4 REVIEWOFPREVIOUSWORK Vardi1985introducedthefollowing k -samplemodelforbiasedsampling.Thereis anunknowndistributionfunction F ,whichwewishtoestimate.Foreachweightfunction w l l =1,..., k ,wehaveasample X l 1 ,..., X ln l iid F l ,where F l x = 1 W l Z x w l s dF s In4, W l = R 1 w l s dF s .Theweightfunctions w 1 ,..., w k areknown,butthe normalizingconstants W 1 ,..., W k arenot.Vardi1985wasinterestedinconditions thatguaranteethatanonparametricmaximumlikelihoodestimatorNPMLEexists andisunique,andhegavetheformoftheNPMLE.Theconditionsforexistenceand uniquenessinvolveissuesregardingthesupportsofthe F l 'sanddonotconcernusin thepresentpaper. Toestimate F ,apreliminarystepistoestimatethevector W 1 ,..., W k .Vardi 1985andGilletal.1988showthat W maybeestimatedbythesolutiontothe systemof k equations W l = Z w l y P k j =1 a j w j y = W j d F n y l =1,..., k where a j n j = n n = P k j =1 n j ,and F n istheempiricaldistributionfunctionthatgives mass 1 = n toeachofthe X il .Actually,thesolutionto4isnotunique:itistrivial toseethatifthevector W solves4,thensodoes W ,forany .However,it turnsoutthatknowing W onlyuptoamultiplicativeconstantisallthatisneeded,and toavoidnon-identiabilityissues,wedenethevector V = W 2 = W 1 ,..., W k = W 1 Gilletal.1988showthatif c W isanysolutionto4,and b V isdenedby b V = c W 2 = c W 1 ,..., c W k = c W 1 ,then n 1 = 2 b V )]TJ/F39 11.9552 Tf 12.731 0 Td [(V isasymptoticallynormalProposition2.3in Gilletal.1988.Onceanestimateof W isformed,itisrelativelyeasytoforman estimate ^ F n of F ,andconsequentlyofintegralsoftheform R hdF .Gilletal.1988 34

PAGE 35

obtainfunctionalweakconvergenceresultsofthesort n 1 = 2 )]TJ 5.48 -0.053 Td [(R hd ^ F n )]TJ/F30 11.9552 Tf 12.152 9.63 Td [(R hdF d )167(! Z h where Z isamean0 Gaussianprocessindexedby h 2 H ,where H isalargeclassof squareintegrablefunctions. ItisnotdifculttoseethatoursetupisthesameasthatconsideredinVardi1985 andGilletal.1988:their F correspondstoour h y ;their w l to h l = h ; F l to h l y ; W l to m h l = m h ;and V to d .Buttherearemajordifferencesbetweenourframeworkandtheirs. Theydealwithiidsamples,andsocanuseempiricalprocesstheory,whereaswedeal withMarkovchains,forwhichsuchatheoryisnotavailable.Intheirframework,the samplesarisefromsomeexperiment,andtheyareseekingoptimalestimatesgivendata thatisgiventothem.Incontrast,oursamplesareobtainedbyMonteCarlo,sowehave controloverdesignissues.Inparticular,weareconcernedwithcomputationalefciency, inadditiontostatisticalefciency;henceourinterestinthetwo-stagesamplingmethod forpreliminaryestimationof d andforenablingtheuseofcontrolvariates. Geyer1994alsodealswiththesetupinVardi1985andGilletal.1988,i.e.the k -samplemodelforbiasedsampling,andhealsoconsiderstheproblemofestimating d .AsmentionedinSection2.1,hisestimatorisobtainedbymaximizing26,and thesolutionisnumericallyidenticaltothesolutiontothesystem4.However,he considersthesituationwhereeachofthe k samplesareMarkovchains,asopposedto iidsamples,andassumingthatthechainssatisfycertainmixingconditions,heobtainsa centrallimittheoremfor n 1 = 2 ^ d )]TJ/F39 11.9552 Tf 12.209 0 Td [(d .Naturally,thevarianceofthelimitingdistributionis differentfromthevarianceobtainedinGilletal.1988,andistypicallylarger. InSection7oftheirpaperMengandWong1996considerthesituationwherefor each l =1,..., k ,wehaveaniidsamplefromthedensity f l = q l = m l ,wherethefunctions q 1 ,..., q k areknown,butthenormalizingconstants m 1 ,..., m k arenot,andwewishto estimatethevector m 2 = m 1 ,..., m k = m 1 .Withoutgoingintodetail,wementionthatthey developafamilyofbridgefunctionsandshowthat,intheiidsetting,theoptimalbridge functiongivesrisetoanestimateidenticaltothatofGeyer1994.Theyobtaintheir 35

PAGE 36

estimatethroughaniterativeschemewhichisfastandstable[MengandWong1996, p.849]andthisisthecomputationalmethodweuseinthepresentpaper. OwenandZhou2000considertheproblemofestimatinganintegralofthe form I = R h x f x dx ,where f isaprobabilitydensitythatiscompletelyknownas opposedtoknownuptoanormalizingconstantand h isaknownfunction.Theywish toestimate I throughimportancesampling.Theyassumetheycangeneratesequences X l 1 ,..., X ln l iid p l l =1,..., k ,wherethe p l 'sarecompletelyknowndensities.Thedoubly indexedsequence X li i =1,..., n l l =1,..., k formsastratiedsamplefromthe mixturedensity p a = P k l =1 a l p l ,where a l = n l = P k l =1 n l ,soonecancarryoutimportance samplingwithrespecttothismixture.Theypointoutthatsincethe p l 'sarecompletely known,theycanformthefunctions H j x =[ p j x = p a x ] )]TJ/F22 11.9552 Tf 12.499 0 Td [(1, j =1,..., k ,andthese satisfy E p a H j X =0 ,wherethesubscriptindicatesthattheexpectationistakenwith respecttothemixturedensity p a .Therefore,these k functionscanbeusedascontrol variates.WhatwedoinChapter2issimilar,exceptthatweareworkingwithdensities whosefunctionalformisknown,butwhosenormalizingconstantsarenot. Kongetal.2003alsoconsiderthe k -samplemodelforbiasedsampling,but haveadifferentperspective,andwedescribetheirworkinthenotationofthepresent paper.Theyassumethatthereareprobabilitymeasures Q 1 ,..., Q k ,withdensities q 1 = m 1 ,..., q k = m k ,respectively,relativetosomedominatingmeasure ,andforeach l =1,..., k ,wehaveaniidsample f X li g n l i =1 from Q l .Here,the q l 'sareknown,butthe m l 'sarenot.Theirobjectiveistoestimateallpossibleratios m l = m j l j 2f 1,..., k g or, equivalently,thevector d =, m 2 = m 1 ,..., m k = m 1 .Intheirhighlyunorthodoxapproach, Kongetal.2003obtainthemaximumlikelihoodestimate ^ ofthedominatingmeasure itself ^ isgivenuptoanoverallmultiplicativeconstant.Theycanthenestimatethe ratios m l = m j ,sincethenormalizingconstantsareknownfunctionsof i.e. m r = R q r x d x ,and q r isknown.Theyshowthattheresultingestimateof d isobtained 36

PAGE 37

bysolvingthesystem d r = k X l =1 n l X i =1 q r X li P k s =1 n s q s X li = d s r =1,..., k whichiseasilyseentobeidenticaltothesystem4ofGilletal.1988. Tan2004showshowcontrolvariatescanbeincorporatedinthelikelihood frameworkofKongetal.2003.Whenthereare r functions H j j =1,..., r forwhich weknowthat R H j d =0 ,theparameterspaceisrestrictedtothesetofallsigma-nite measuressatisfyingthese r constraints.Forthecasewhere X li i =1,..., n l areiid foreach l =1,..., k ,heobtainsthemaximumlikelihoodestimateof inthisreduced parameterspace,andthereforeofcorrespondingestimatesof d and m h = m h 1 ,andshows thatthisapproachgivesestimatesthatareasymptoticallyequivalenttoestimatesthat usecontrolvariatesviaregression.Healsoobtainsresultsonasymptoticnormalityof hisestimatorsthatarevalidwhenwehavetheiidstructure. Theestimatesof d inGilletal.1988,Geyer1994,MengandWong1996,and Kongetal.2003areallequivalent.Theorem1ofTan2004establishesasymptotic optimalityofthisestimateundertheiidassumption.WhenthesamplesareMarkov chaindraws,theasymptoticallyoptimalestimateisessentiallyimpossibletoobtain Romero2003.Buttheestimatederivedundertheiidassumptioncanstillbeused intheMarkovchainsettingifonecandevelopasymptoticresultsthatarevalidinthe Markovchaincase,andthisisdonebyGeyer1994,whoseresultsweuseinallour theorems. 37

PAGE 38

CHAPTER5 ILLUSTRATIONONVARIABLESELECTION ThereexistmanyclassesofproblemsinBayesiananalysisinwhichthesensitivity analysisandmodelselectionissuesdiscussedearlierarise;seeChapter6.Herewe giveanapplicationinvolvingthehierarchicalpriorusedinvariableselectioninthe BayesianlinearregressionmodeldiscussedinChapter1.Thischapterconsistsof threeparts.FirstwediscussanMCMCalgorithmforthismodelandstatesomeofits theoreticalproperties;thenwediscusstheliteratureonselectionofthehyperparameter h ;andnallywepresenttwodetailedillustrationsofourmethodology. 5.1AMarkovChainforEstimatingthePosteriorDistributionofModel Parameters ThedesignofMCMCalgorithmsforestimatingtheposteriordistributionof under1revolvesaroundthegenerationoftheindicatorvariable .Wenowbriey reviewthealgorithmsforrunningaMarkovchainon thatareproposedintheliterature, andthemainissuesofimplementationofthesealgorithms.Rafteryetal.1997and MadiganandYork1995discussthefollowingMetropolis-Hastingsalgorithmfor generatingasequence ,... .Ifthecurrentstateis ,anewstate isformed byselectingatrandomacoordinate j ,setting j =1 )]TJ/F25 11.9552 Tf 12.897 0 Td [( j ,and k = k for k 6 = j Theproposal isthenacceptedorrejectedwiththeMetropolis-Hastingsacceptance probability min f p j Y = p j Y ,1 g .MadiganandYork1995callthisalgorithmMC 3 Clydeetal.1996proposeamodicationofthisalgorithminwhichwedonotselecta componentatrandomandupdateit,butinsteadsequentiallyupdateallcomponents. TheycallthistheHybridAlgorithm.Strictlyspeaking,thisisaMetropolizedGibbs sampler,andisnotactuallyaMetropolis-Hastingsalgorithm.SmithandKohn1996 proposeaGibbssamplerwhichsimplycyclesthroughthecoordinates j oneatatime. GeorgeandMcCulloch1997showthatwhencomparedwithMC 3 ,theGibbssampler algorithmgivesestimateswithsmallerstandarderror,andisalsoslightlyfaster,atleast inseveralsimulationstudiestheyconducted. 38

PAGE 39

Kohnetal.2001considerMetropolizedGibbsalgorithmswhicharethesame astheHybridAlgorithmofClydeetal.1996,exceptthatatcoordinate j ,insteadof deterministicallyproposingtogofrom j to j =1 )]TJ/F25 11.9552 Tf 12 0 Td [( j ,theproposedvalue j isequalto 1 )]TJ/F25 11.9552 Tf 12.149 0 Td [( j withprobabilitydependingonthecurrentstate .Kohnetal.2001describetwo suchalgorithms,andshowthatthesearemorecomputationallyefcientthantheGibbs samplerinsituationswhereonaverage q issmall,i.e.themodelsaresparse.They alsoconductadetailedsimulationstudyofoneoftheirsamplingschemestheirSS whichsuggeststhat,whiletheschemeproducesestimateswhosestandarderrorsare abitlargerthanthoseproducedbytheGibbssampler,thisdisadvantageismorethan outweighedbyitscomputationalefciency. Allthealgorithmsmentionedaboverequire,inonewayoranother,thecalculation of p j Y = p j Y .Becauseoftheconjugatenatureofmodel1,themarginal likelihoodofmodel isavailableinclosedform,andtherefore p j Y isavailableupto anormalizingconstant.Wehave p j Y / + g )]TJ/F40 7.9701 Tf 6.587 0 Td [(q = 2 S )]TJ/F23 7.9701 Tf 6.587 0 Td [( m )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 )]TJ/F23 7.9701 Tf 6.587 0 Td [( m )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 2 w 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(w q where S 2 = P m j =1 Y j )]TJETq1 0 0 1 202.072 335.493 cm[]0 d 0 J 0.478 w 0 0 m 10.148 0 l SQBT/F39 11.9552 Tf 202.072 325.517 Td [(Y 2 and R 2 isthecoefcientofdeterminationofmodel Asisstandardformodel1,weassumethatthecolumnsofthedesignmatrixare centered,andinthiscase, R 2 = Y 0 X X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 X 0 Y = S 2 .Themaincomputationalburden inobtaining5isthecalculationof R 2 ,whichistime-consumingif q islarge.Smith andKohn1996notethat,when and differinonlyonecomponent, R 2 canbe obtainedrapidlyfrom R 2 .WereturntothispointinAppendixB. Inoursituation,weneedtogenerateaMarkovchainon ,becausetheBayesfactor estimatesgiveninChapter2requiresamplesfromtheposteriordistributionof .The algorithmweuseinthepresentpaperisbasedontheGibbssampleron introducedin SmithandKohn1996althoughthecomputationalimplementationweuseisdifferent fromtheirs,followedbythreestepstogenerate 0 ,and .Inabitmoredetail,let 39

PAGE 40

V betheMarkovtransitionfunctioncorrespondingtotheGibbssamplerinSmith andKohn1996,i.e. V isthedistributionof giventhatthecurrentstateis ,andlet v = V f g bethecorrespondingprobabilitymassfunction. Supposethecurrentstateis i i i 0 i i .Weproceedasfollows. 1.Weupdate i to i +1 using V i .Thegenerationof i +1 doesnotinvolve i i 0 i i 2.Wegenerate i +1 fromtheconditionaldistributionof given = i +1 andthe data. 3.Wegenerate i +1 0 fromtheconditionaldistributionof 0 given = i +1 = i +1 ,andthedata. 4.Wegenerate i +1 i +1 fromtheconditionaldistributionof given = i +1 = i +1 0 = i +1 0 ,andthedata. Thedetailsdescribingthedistributionsinvolvedandthecomputationsneededaregiven inAppendixB.Thealgorithmabovegivesasequence ,... ,anditiseasytosee thatthissequenceisaMarkovchain. AsMarkovchainsonthe sequence,therelativeperformanceoftheGibbssampler vs.SSdepends,inpart,on m q h ,andthedatasetitself,andneitheralgorithmis uniformlysuperiortotheother.Inprinciple,inStep 1 ofouralgorithmwecanuseany Markovtransitionfunctionthatgeneratesachainon ,includingSS.Wechoseto workwiththeGibbssamplerbecauseitiseasiertodeveloparegenerationschemefor thischainthanfortheotherchains. Theoutputofthechaincanbeusedinseveralways.Anobviouswayistousethe highestposteriorprobabilitymodelHPM.Unfortunately,when q isbiggerthanaround 20 ,thenumberofmodels, 2 q ,isverylarge,anditmayhappenthatnosinglemodel hasappreciableprobability,andinanycase,itisverydifcultorimpossibletoidentify theHPMfromtheMarkovchainoutput.BarbieriandBerger2004argueinfavorof themedianprobabilitymodelMPM,whichisdenedtobethemodelthatincludesall variables j forwhichthemarginalinclusionprobability P j =1 j Y 1 = 2 .Wemention 40

PAGE 41

heretheBayesianAdaptiveSamplingmethodofClydeetal.2009,whichgivesan algorithmforprovidingsampleswithoutreplacementfromthesetofmodels.Under certainconditions,thealgorithmhasthefeaturethattheseareperfectsampleswithout replacement;itthenenablesanefcientsearchfortheHPM. UniformErgodicity Let = f 0,1 g q 1 R q +1 ,let bethepriordistributionof specied by1band1c,andlet y betheposteriordistributionof given Y = y .For theremainderofthissectionthesubscript h issuppressedsincewearedealingwith asinglespecicationofthishyperparameter.Let K denotetheMarkovtransition functionfortheMarkovchainon describedinthebeginningofthischapter,i.e. K 0 isthedistributionof 1 giventhatthecurrentstateis 0 ,andlet K n 0 denotethe corresponding n -stepMarkovtransitionfunction.Harrisergodicityofthechainisthe conditionthat k K n )]TJ/F25 11.9552 Tf 12.327 0 Td [( y k! 0 forall 2 ,where kk denotessupremumover allBorelsubsetsof .Thisconditionisguaranteedbytheso-calledusualregularity conditions,namelythatthechainhasaninvariantprobabilitymeasure,isirreducible, aperiodic,andHarrisrecurrent;see,e.g.,Theorem13.0.1ofMeynandTweedie1993. Theseusualregularityconditionsaretypicallyeasytocheck;inthepresentcontext,they areimpliedforexampleiftheMarkovtransitionfunctionhasadensitywithrespectto theproductofcountingmeasureon f 0,1 g q andLebesguemeasureon 1 R q +1 whichiseverywherepositive,whichisthecaseinoursituation.Uniformergodicityisthe farstrongerconditionthatthereexistconstants c 2 [0,1 and M > 0 suchthatforany n 2 N k K n )]TJ/F25 11.9552 Tf 11.955 0 Td [( y k Mc n forall Proposition1 Thechaindrivenby K isuniformlyergodic. TheproofofProposition1isgiveninAppendixC.Let 0 1 ,... beaMarkovchain drivenby K ,let l beareal-valuedfunctionof forexample l = I 1 =1 ,the indicatorthatvariable 1 isinthemodel,andsupposewewishtoformcondence 41

PAGE 42

intervalsfortheposteriorexpectationof l .Supposethat E l 2 < 1 .Then sincethechainisuniformlyergodic,Corollary4.2ofCogburn1972impliesthat, with Var l 0 and Cov l 0 l j calculatedundertheassumptionthat 0 hasthe stationarydistribution,theseries 2 =Var l 0 +2 1 X j =1 Cov l 0 l j convergesabsolutely,andif 2 > 0 ,thenwith 0 havinganarbitrarydistribution,the estimate l n = = n P n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 i =0 l i satises n 1 = 2 )]TJETq1 0 0 1 204.025 511.517 cm[]0 d 0 J 0.478 w 0 0 m 3.84 0 l SQBT/F39 11.9552 Tf 204.025 501.541 Td [(l n )]TJ/F39 11.9552 Tf 11.955 0 Td [(E [ l j y ] d )167(!N 2 as n !1 TheMarkovchaindrivenby K isalsoregenerative,andinAppendixCwegivean explicitminorizationconditionthatcanbeusedtointroduceregenerationsintothechain. Functionsthatrunthechainandimplementtheregenerationschemeareprovidedinthe Rpackage bvslr ,availablefrom http://www.stat.ufl.edu/ ebuta/BVSLR InChapters1and2, h and h y refertothepriorandposterior densities ,andall estimatesinChapter2involveratiosofthesepriordensities.IntheBayesianlinear regressionmodelthatweareconsideringhere,thepriors h on 0 areactually probabilitymeasureson f 0,1 g q 1 R q +1 ,whichinfactarenotabsolutely continuouswithrespecttotheproductofcountingmeasureon f 0,1 g q andLebesgue measureon 1 R q +1 .For h 1 = w 1 g 1 and h 2 = w 2 g 2 ,theRadon-Nikodym derivativeof h 1 withrespectto h 2 isgivenby d h 1 d h 2 0 = w 1 w 2 q 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(w 1 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(w 2 q )]TJ/F40 7.9701 Tf 6.586 0 Td [(q q )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( ;0, g 1 2 X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 q )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [( ;0, g 2 2 X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 where q u ; a V isthedensityofthe q -dimensionalnormaldistributionwithmean a andcovariance V ,evaluatedat u Doss2007.Itisimmediatethatallformulasin Chapter2remainvalidifratiosoftheform h = h 1 see,e.g.,equation2are replacedbytheRadon-Nikodymderivative [ d h = d h 1 ] .Fortunately,evaluationof5 42

PAGE 43

requiresneithermatrixinversionnorcalculationofadeterminant,socanbedonevery quickly.Notethatinviewof5,itisnotenoughtohaveMarkovchainsrunningonthe 'sandweneedMarkovchainsrunningonthe 'soratleast 5.2ChoiceoftheHyperparameter Asmentionedearlier,regarding w ,theproposalsintheliteraturearequitesimple: either w isxedat 1 = 2 ,orabetapriorisputon w .Thediscussionbelowfocuses primarilyon g ,forwhichthereisanextensiveliterature,andwenowsummarizethe portionofthisliteraturethatisdirectlyrelevanttothepresentwork.Broadlyspeaking, recommendationsregarding g canbedividedintothreecategories: Data-IndependentChoices Inthesimplecasewherethesetupisgivenby1but without1c,i.e.thetruemodel isassumedknown,theposteriordistribution of given is N )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [( g = g +1 ^ g = g +1 2 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 ,where ^ istheusual leastsquaresestimateof .If q isxedand m !1 ,understandardconditions X 0 X = m ,where isapositivedenitematrix;thereforeif g isxed,this distributionisapproximatelyapointmassat g = g +1 ^ ,sotheposteriorisnot evenconsistent,andweseethatanecessaryconditionforconsistencyisthat g !1 .Data-independentchoicesof g includeKassandWasserman's1995 recommendationof g = m ,andFernandezetal.'s2001recommendationof g = max m q 2 ,followinguponFosterandGeorge's1994earlierrecommendationof g = q 2 Liangetal.2008arguethat,ingeneral,data-independentchoicesof g havethe followingundesirableproperty,referredtoastheInformationParadox.Whenthe datagiveoverwhelmingevidenceinfavorofmodel e.g. k ^ k!1 ,thenusing 0 todenotethenullmodeli.e.themodelthatincludesonlytheintercept,the ratioofposteriorprobabilities p j Y = p 0 j Y doesnottendtoinnity. 43

PAGE 44

EmpiricalBayesEBMethods InglobalEBprocedures,anestimateof g commonfor allmodelsisderivedfromitsmarginallikelihood;seeGeorgeandFoster2000.In localEB,anestimateof g isderivedforeachmodel;seeHansenandYu2001. Unfortunately,theEBmethodisingeneralcomputationallydemandingbecause thelikelihoodisasumoverall 2 q models ,soitispracticallyfeasibleonlyfor relativelysmallvaluesof q .Liangetal.2008showthattheEBmethodis consistentinthefrequentistsense:if isthetruemodel,thenif g ischosen viatheEBmethod,theposteriorprobability P = j Y convergesto 1 as m !1 .SeeTheorem 3 ofLiangetal.2008foraprecisestatement.Thisresult refersonlytothecasewhere w isxedat 1 = 2 ,andonly g isestimated.Liang etal.2008proposeanEMalgorithmforestimating g intheglobalEBsetting.In theiralgorithm,themodelindicatorand aretreatedasmissingdata.Whiletheir approachiscertainlyuseful,therearesomeproblemsassociatedwithit.Each stepintheEMalgorithminvolvesasumof 2 q terms.Unless q isrelativelysmall, completeenumerationisnotpossible,andLiangetal.2008proposesumming onlyoverthemostsignicantterms.However,determiningwhichtermsthese aremaybeverydifcultinsomeproblems.Also,theEMalgorithmgivesasingle pointestimate.Whatwedoisdifferent:weestimatetheBayesfactorforall g and w .Thisenablesusinparticulartoestimatethemaximizingvalues;butitalso allowsustoruleoutlargeregionsofthehyperparameterspace.Additionally,our methodallowsustocarryoutsensitivityanalysis.Wealsomentionverybrieythat ifweareinterestedonlyinthemaximizingvalues,thenthemethodproposedin thepresentpapercanbeusedtoformastochasticsearchalgorithm.Thebasic requirementforsuchalgorithmsisthatweknowthegradient @ B h h 1 =@ h .But thesamemethodologyusedtoestimate B h h 1 canalsobeusedtoestimate itsgradient.Forexample,inthesimpleestimate2,wejustreplace h l i by @ h l i =@ h 44

PAGE 45

FullyBayesFBMethods Themostcommonprioron g istheZellnerandSiow1980 prior,aninverse-gammawhichresultsinamultivariateCauchypriorfor .The familyofhyperg priorsisintroducedbyCuiandGeorge2008anddeveloped furtherbyLiangetal.2008,whoshowthatthesehaveseveraldesirable properties.Inparticular,theydonotsufferfromtheinformationparadox,and theyexhibitimportantconsistencyproperties. BoththeEBmethodsandFBmethodshavetheirownadvantagesanddisadvantages. CuiandGeorge2008giveevidencethatEBmethodsoutperformFBmethods. Thisisbasedonextensivesimulationstudiesincaseswherenumericalmethodsare feasible.Also,FBmethodsrequireonetospecifyhyperparametersoftheprioronthe hyperparameter h ,anddifferentchoicesleadtodifferentinferences.Additionally,inEB methods,oneusesamodelwithasinglevalueof h ,andtheresultinginferenceismore parsimoniousandinterpretable. Ontheotherhand,aswithmanylikelihood-basedmethods,specialcareneedsto betakenwhenthemaximizingvalueisattheboundary.WhenweusetheEBmethod, ifthemaximizingvalueof w is 0 or 1 ,theposteriorassignsprobabilityonetothenull modelorfullmodelmodelthatincludesallvariables,respectively.Thisissimilarto theverysimplesituationinwhichwehave X binomial n p :ifweobserve X =0 thennotonlyisthemaximumlikelihoodestimateof p equalto 0 ,buttheassociated standarderrorestimateisalso 0 ,andthenaiveWald-typecondenceintervalfor p isthesingleton f 0 g .Ofcourseinthissimplecasethereexistmodicationstothe maximumlikelihoodestimate ^ p = X = n whichyieldproceduresthatdonotgiveriseto thisdegeneracy.Howtodevelopcorrespondingmodicationstothemaximumlikelihood estimateoftheBernoulliparameter w inthepresentcontextisaproblemthatismuch moredifcult,butcertainlyworthyofinvestigation. ScottandBerger2010considerthesamemodelforvariableselectionthatwe considerhere,i.e.model1,butwithaZellner-Siowprioron g ,andtheremaining 45

PAGE 46

parameter, w ,estimatedbymaximumlikelihood.Theyshowthatifthenullmodelhas thelargestmarginallikelihood,thentheMLEof w is 0 andifthefullmodelhasthe largestmarginallikelihood,thentheMLEof w is 1 .Eachofthesegivesrisetothe degeneracydiscussedabove.Theirresultisnottrueinoursetup,inwhichwedonot putaprioron g ,butratherestimateboth w and g bymaximumlikelihood.Toseethis, consideraverysimpleexample,inwhich Y =,1,9,5 0 and X = 0 B B B B B B B @ 13 53 87 810.5 1 C C C C C C C A Wehave R 2 =,1 =0.52 R 2 =,0 =0.51 R 2 =,1 =0.40 ,and R 2 =,0 =0 ,Now P Y j g = c Y + g )]TJ/F40 7.9701 Tf 6.587 0 Td [(q = 2 )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 3 = 2 where c Y doesnotdependon g or .Therefore, ^ g ,^ w =argmax g w X w q )]TJ/F39 11.9552 Tf 11.955 0 Td [(w q )]TJ/F40 7.9701 Tf 6.586 0 Td [(q + g )]TJ/F40 7.9701 Tf 6.587 0 Td [(q = 2 )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 3 = 2 .5,.2. FromequationofScottandBerger2010weknowthatundertheZellner-Siownull prior,wehave P Y j P )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Y j =,0 = Z 1 0 + g )]TJ/F40 7.9701 Tf 6.586 0 Td [(q = 2 )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 3 = 2 = 1 = 2 g )]TJ/F23 7.9701 Tf 6.586 0 Td [(3 = 2 exp )]TJ/F22 11.9552 Tf 9.298 0 Td [(2 = g dg = 8 > > > > > > < > > > > > > : .72 < 1 for =,0 .58 < 1 for =,1 .31 < 1 for =,1 46

PAGE 47

andhencethenullmodelhasthestrictlylargestmarginallikelihoodamongallmodels. Lemma 4.1 ofScottandBerger2010impliesthat,withaZellner-Siowprioron g ^ w =0 ,whileinoursetup,thesamedatagive ^ w > 0 5.3Examples Weillustrateourmethodsontwoexamples.TherstistheU.S.crimedataofVandaele 1978,whichcanbefoundintheRlibrary MASS underthename UScrime .Weuse thisdatasetbecauseithasbeenstudiedinseveralpapersalreadysowecancompare ourresultswithpreviousanalyses,andalsobecausethenumberofvariablesissmall enoughtoenableaclosed-formcalculationofthemarginallikelihood m h ,sowecan compareourestimateswiththegoldstandard.Theseconddatasetistheozonedata originallyanalyzedbyBreimanandFriedman1985.Weusethisdatasetbecauseit involves 44 variables,eventhoughonlyafewofthoseareimportant,andwewantedto showhowourmethodologyhandlesadatasetwiththischaracter. 5.3.1U.S.CrimeData Thedatasetgives,foreachof m =47 U.S.states,thecrimerate,denedasnumber ofoffensesper 100,000 individualstheresponsevariable,and q =15 predictors measuringdifferentcharacteristicsofthepopulation,suchasaveragenumberofyears ofschooling,averageincome,unemploymentrate,etc. Tobeconsistentwithwhatisdoneintheliterature,weappliedalogtransformation toallvariables,excepttheindicatorvariable.Wetookthebaselinehyperparametertobe h 1 = w 1 g 1 =.5,15 ,andourgoalwastoestimate B h h 1 forthe 924 valuesof h obtainedwhen w rangesfrom 0.1 to 0.91 byincrementsof 0.03 ,and g rangesfrom 4 to 100 byincrementsof 3 .Weused2andthisestimatewasbasedon 16 chainseach oflength 10,000 ,correspondingtotheskeletongridofhyperparametervalues w g 2f .3,.5,.6,.8 gf 15,50,100,225 g 47

PAGE 48

fortheStage 1 samples,and 16 newchains,eachoflength 1000 ,correspondingto thesamehyperparametervalues,fortheStage 2 samples.TheplotsinFigure5-1 givegraphsoftheestimate2as w and g vary,fromtwodifferentangles.These indicatethatvaluesfor w around 0.65 andfor g around 20 seemappropriate,while valuesof w lessthan .3 andvaluesof g greaterthan 60 shouldbeavoided.Aside calculationshowedthat,interestingly,for g =max f m q 2 g =225 ,theestimateof B )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [( w g ,.65,20 islessthan .008 regardlessofthevalueof w ,sothischoiceshould notbeusedforthisdataset.Withthelongchainsusedandtheestimatethatuses controlvariates,theBayesfactorestimatesinFigure5-1areextremelyaccurateroot meansquarederrorsarelessthan 0.04 uniformlyovertheentiredomainoftheplot andconsiderablylessintheconvexhulloftheskeletongridourcalculationoftheroot meansquarederrorsusedtheclosed-formexpressionfortheBayesfactorsbasedon completeenumeration.TheguretookaboutahalfhourtogenerateonanIntel 2.8 GHzQ 9550 runningLinux.Theaccuracyweobtainedisoverkillandthegurecanbe createdinafewminutesifweusemoretypicalMarkovchainlengths. Figure5-1.EstimatesofBayesfactorsfortheU.S.crimedata.Theplotsgivetwo differentviewsofthegraphoftheBayesfactorasafunctionof w and g whenthebaselinevalueofthehyperparameterisgivenby w =0.5 and g =15 .Theestimateis2,whichusescontrolvariates. 48

PAGE 49

Table5-1givestheposteriorinclusionprobabilitiesforeachofthefteenpredictors, i.e. P i =1 j y for i =1,...,15 ,underseveralmodels.Line 2 givestheinclusion probabilitieswhenweusemodel1withthevalues w =.65 and g =20 ,which arethevaluesatwhichthegraphinFigure5-1attainsitsmaximum.Line 4 givesthe inclusionprobabilitieswhenthehyperg priorHG 3 inLiangetal.2008isused.As canbeseen,theinclusionprobabilitiesweobtainedundertheEBmodelarecomparable to,butsomewhatlargerthan,theprobabilitieswhentheHG 3 priorisused.Thisisnot surprisingsinceourmodelallows w tobechosen,andthedata-drivenchoicegivesa value .65 greaterthanthevalue w =.5 usedinLiangetal.2008.Table 2 ofLiang etal.2008givesacomparisonofposteriorinclusionprobabilitiesforatotaloften modelstakenfromtheliterature.Line 3 ofTable5-1givestheinclusionprobabilities undermodel1whenweuse w =.5 andthevalueof g thatmaximizesthelikelihood with w constrainedtobe .5 .Itisinterestingtonotethattheinclusionprobabilitiesare thenstrikinglyclosetothoseundertheHG 3 model. Table5-1.PosteriorinclusionprobabilitiesforthefteenpredictorvariablesintheU.S. crimedataset,underthreemodels.NamesofthevariablesareasinTable 2 ofLiangetal.2008butallvariablesexceptforthebinaryvariableShave beenlogtransformed. AgeSEdEx0Ex1LFMNNWU1U2WXPrisonTime EB ,.65 .93.39.99 70.51.34.35.52.83.40.76.551.00.96.55 EB ,.5 .85.29.97 67.45.22.22.38.70.27.62.381.00.90.39 HG 3.84.29.97 66.47.23.23.39.69.27.61.38.99.89.38 Figure5-2givesplotsoftheposteriorinclusionprobabilitiesforVariables 1 and 6 as w and g vary.Theliteraturerecommendsvariouschoicesfor g [inparticular g = m inKassandWasserman1995, g = q 2 inFosterandGeorge1994, g =max m q 2 inFernandezetal.2001],andposteriorinclusionprobabilitiesforallthesechoices combinedwithanychoiceof w canbereaddirectlyfromthegure.Theextenttowhich theseprobabilitieschangewiththechoiceof g isquitestriking. 49

PAGE 50

Figure5-2.EstimatesofposteriorinclusionprobabilitiesforVariables 1 and 6 forthe U.S.crimedata.Theestimateusedis2. SelectionoftheskeletonpointswasdiscussedattheendofChapter3,andwenow returntothisissue.ConsidertheBayesfactorestimatebasedontheskeleton5, whichwaschoseninanad-hocmanner.TheleftpanelinFigure5-3givesaplotofthe varianceofthisestimate,asafunctionof h .Ascanbeseenfromtheplot,thevariance isgreatestintheregionwhere g issmalland w islarge.Wechangedtheskeleton from5to w g 2f .5,.7,.8,.9 gf 10,15,50,100 g andreranthealgorithm.Thevariancefortheestimatebasedon55isgivenbythe rightpanelofFigure5-3,fromwhichweseethatthemaximumvariancehasbeen reducedbyafactorofabout 9 5.3.2OzoneData ThisdatasetwasoriginallyanalyzedinBreimanandFriedman1985,wasused inmanypaperssince,andwasrecentlyanalyzedinaBayesianframeworkbyCasella andMoreno2006andLiangetal.2008.Thedataconsistofdailymeasurements ofozoneconcentrationandeightmeteorologicalquantitiesintheLosAngelesbasin for 330 daysof 1976 .Theresponsevariableisthedailyozoneconcentration,andwe followLiangetal.2008inconsidering 44 possiblepredictors:theeightmeteorological 50

PAGE 51

Figure5-3.Variancefunctionsfortwoversionsof ^ I ^ d ^ ^ d .Theleftpanelisfortheestimate basedontheskeleton5.Thepointsinthisskeletonwereshiftedto bettercovertheproblematicregionnearthebackoftheplot g smalland w large,creatingtheskeleton5.Themaximumvarianceisthenreduced byafactorof 9 rightpanel. measurements,theirsquares,andtheirtwo-wayinteractions.Liangetal.2008give areviewoftheliteratureonpriorsforthehyperparameter g andadvocatethehyperg priors.Theycompare 10 variableselectiontechniquesincludingthreehyperg priors onthisdatasetbyusingacross-validationprocedure:thedatasetisrandomlysplitin twohalves,oneofwhichthetrainingsampleisusedforselectingthemodelforthe Bayesianmethodsthisisthehighestprobabilitymodel,whiletheotherthevalidation sampleisusedformeasuringthepredictiveaccuracy.Thepredictiveaccuracyof method j ismeasuredthroughthesquare-rootofthemeansquaredpredictionerror RMSEoftheselectedmodel j ,denedbyRMSE j = )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 V P i 2 V Y i )]TJ/F22 11.9552 Tf 14.671 2.657 Td [(^ Y i 2 1 = 2 Here, V isthevalidationset, n V isitssize,and ^ Y i isthettedvalueofobservation i undermodel j .Liangetal.2008pointoutthecuriousfactthattheRMSE'softhe 10 methodsareallveryclosetheyrangefrom 4.4 to 4.6 ,buttheselectedmodelsdiffer greatlyinthenumberofvariablesselected,whichrangefrom 3 to 18 51

PAGE 52

Weinvestigatedtheperformanceofourmethodologyusingasplitofthedatainto trainingandvalidationsampleidenticaltotheoneusedbyLiangetal.2008.Wetook thebaselinehyperparametertobethepair h 1 = w 1 g 1 =.2,50 andtheskeletongrid ofhyperparameterstoconsistofthe 16 pairs w g 2f .1,.2,.3,.5 gf 15,50,100,150 g Toidentifythevalueof h thatmaximizestheBayesfactor B h h 1 ,weestimatedthis quantityforagridofthe 750 valuesof h obtainedwhen w rangesfrom .01 to .5 by incrementsof .02 ,and g rangesfrom 5 to 150 byincrementsof 5 .Theseestimates werebasedon 16 chainseachoflength 10,000 ,correspondingtotheskeletongridof hyperparametervaluesfortheStage 1 samples,and 16 newchains,eachoflength 1000 ,correspondingtothesamehyperparametervalues,fortheStage 2 samples. Figure5-4givesaplotoftheseestimatesof B h h 1 asafunctionof w and g .The standarderrorislessthan .014 overtheentirerangeoftheplot. Figure5-4.EstimatesofBayesfactorsfortheozonedata.Theplotsgivetwodifferent viewsofthegraphoftheBayesfactorasafunctionof w and g whenthe baselinevalueofthehyperparameterisgivenby w =.2 and g =50 Thevalueof h atwhichthemaximum B h h 1 isattainedis h =.13,75 .We rananewchainoflength 100,000 correspondingtothisvalueof h ,andbasedonit weestimatedthehighestprobabilitymodeltobethemodelcontainingthe 4 variables 52

PAGE 53

dpg,ibt,vh.ibh,andhumid.ibtseeAppendixDforadescriptionofthesevariables. Thismodelyieldsanout-of-sampleRMSEof 4.5 .SincetheempiricalBayeschoiceof w isrelativelysmall ^ w =.13 ,itisnotsurprisingthatthehighestprobabilitymodel includesonly 4 variablesfewerthaninanyofthehyperg modelsrecommendedby Liangetal.2008,whichallincludeatleast 6 variables.Butitisinterestingtonotethat nevertheless,thismodelgivesanRMSEthatisessentiallythesameastheRMSEof anyoftheothermodels. WeappliedtheregenerationalgorithmdescribedinAppendixBtothechain correspondingtothehyperparameter h =.13,75 deemedoptimalbyourprevious analysis.Weranthechainuntil R =3000 regenerationsoccurred,whichtook 85,000 iterations.Fromtheoutput,weobtainedestimatesoftheposteriorinclusionprobabilities foreveryoneofthe 44 predictors,andformedthecorresponding 95% condence intervals,usingtheregenerationmethoddiscussedinChapter3.Thesearedisplayedin Figure5-5. Ourchoiceof R wasarbitrary,butthischoiceshouldultimatelybebasedonthe degreeofaccuracyonedesiresfortheestimatesofthequantitiesofinterest.We consideredourchoicetobesatisfactoryforthisparticularanalysissincethecondence intervalsfortheposteriorinclusionprobabilitiesforthe 44 predictorshavemarginsof errorofatmost 1% .Notethatourchainregeneratesrelativelyoftenwiththeaverage lengthofatour N beingabout 28 .Myklandetal.1995recommendthatonecheck thatthatthecoefcientofvariationCV N = )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(Var N 1 = 2 = E N oftheaveragetour lengthisthan .1 beforedeeming 2 tobeestimatedproperlyby ^ 2 .Theircriterion seemstobemetheresincethestronglyconsistentestimator c CV N = )]TJ 5.479 -0.717 Td [(P R t =1 N t )]TJETq1 0 0 1 72 159.637 cm[]0 d 0 J 0.478 w 0 0 m 9.403 0 l SQBT/F39 11.9552 Tf 72 149.661 Td [(N 2 = R N 2 1 = 2 equals .02 53

PAGE 54

Figure5-5. 95% condenceintervalsoftheposteriorinclusionprobabilitiesforthe 44 predictorsintheozonedatawhenthehyperparametervalueisgivenby w =.13 and g =75 .Atablegivingthecorrespondencebetweentheintegers 1 44 andthepredictorsisgiveninAppendixD. 54

PAGE 55

CHAPTER6 DISCUSSION Thefollowingfactisobvious,butitmaybeworthwhiletostateitexplicitly.If h 1 is xed,maximizing B h h 1 andmaximizingthemarginallikelihood m h areequivalent. Choosingthevalueof h thatmaximizes m h isbydenitiontheempiricalBayesmethod. Thus,thedevelopmentinChapter2canbeusedtoimplementempiricalBayes methods. Ourmethodologyfordealingwiththesensitivityanalysisandmodelselection problemsdiscussedinChapter1canbeappliedtomanyclassesofBayesianmodels. Inadditiontotheusualparametricmodels,wementionalsoBayesiannonparametric modelsinvolvingmixturesofDirichletprocessesAntoniak1974,inwhichone ofthehyperparametersistheso-calledtotalmassparameterverybriey,this hyperparametercontrolstheextenttowhichthenonparametricmodeldiffersfroma purelyparametricmodel.Amongthemanypapersthatusesuchmodels,wementionin particularBurrandDoss2005,whogiveamoredetaileddiscussionoftheroleofthe totalmassparameter.TheapproachdevelopedinSections2.1and2.2canbeusedto selectthisparameter. Whenthedimensionof h islow,itwillbepossibletoplot B h h 1 ,oratleastplot itas h variesalongsomeofitsdimensions.EmpiricalBayesmethodsarenotoriously difculttoimplementwhenthedimensionofthehyperparameter h ishigh.Inthiscase,it ispossibletousethemethodsdevelopedinSections2.1and2.2toenableapproaches basedonstochasticsearchalgorithms.Theserequirethecalculationofthegradient @ B h h 1 =@ h .Wenotethatthesamemethodologyusedtoestimate B h h 1 canalso beusedtoestimateitsgradient.Forexample,in2, h l i issimplyreplacedby @ h l i =@ h 55

PAGE 56

APPENDIXA PROOFOFRESULTSFROMCHAPTER1 ProofofTheorem1 Webeginbywriting p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F22 11.9552 Tf 13.414 2.657 Td [(^ B h h 1 d + p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 A.1 ThesecondtermontherightsideofA.1involvesrandomnesscomingonlyfromthe secondstageofsampling.ThistermwasanalyzedbyDoss2010,whoshowedthat itisasymptoticallynormal,withmean 0 andvariance 2 h .Thersttermostensibly involvesrandomnessfrombothStage 1 andStage 2 sampling.However,aswillemerge fromourproof,therandomnessfromStage 2 isoflowerorder,andeffectivelyallthe randomnessisfromStage 1 .Thisrandomnessisnon-negligible.Wementionherethe often-citedworkofGeyer1994whoseniceresultsweuseinthepresentpaper.In thecontextofasetupverysimilartoours,hisTheorem 4 statesthatusinganestimated d andusingthetrue d resultsinthesameasymptoticvariance.Fromourproofrefer alsotoRemark2ofSection2.1,weseethatthisstatementisnotcorrect. ToanalyzethersttermontherightsideofA.1,wedenethefunction F u = ^ B h h 1 u ,where u = u 2 ,..., u k 0 isarealvectorwith u l > 0, l =2,..., k .Then,bythe Taylorseriesexpansionof F about d ,weget p n )]TJ/F22 11.9552 Tf 6.939 -7.027 Td [(^ B h h 1 ^ d )]TJ/F22 11.9552 Tf 13.414 2.656 Td [(^ B h h 1 d = p n )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(F ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(F d = p n r F d 0 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + p n 2 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d 0 r 2 F d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d A.2 where d isbetween d and ^ d First,weshowthatthegradient r F d = @ F d =@ d 2 ,..., @ F d =@ d k 0 converges almostsurelytoaniteconstant.For j =2,..., k ,the j )]TJ/F22 11.9552 Tf 12.2 0 Td [(1 th componentofthisvector convergesalmostsurelysince,withtheSLLNassumedtoholdfortheMarkovchains 56

PAGE 57

used,wehave [ r F d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = k X l =1 n l X i =1 n j h l i h j l i d 2 j )]TJ 5.48 -0.717 Td [(P k s =1 n s h s l i = d s 2 = k X l =1 1 n l n l X i =1 a j a l h l i h j l i d 2 j )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 a.s. )167(! 1 d 2 j k X l =1 a l Z a j h h j )]TJ 5.48 -0.718 Td [(P k s =1 a s h s = d s 2 h l y d = 1 d 2 j Z m h m h 1 a j h j )]TJ 5.479 -0.717 Td [(P k s =1 a s h s = d s 2 k X l =1 a l h l = d l h y d = B h h 1 d 2 j Z a j h j P k s =1 a s h s = d s h y d :=[ c h ] j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 A.3 Thelastintegralisclearlynite,andthelastequalityinA.3indicatesthat c h denotes theconstantvectortowhich r F d converges. Next,weshowthattherandomHessianmatrix r 2 F d ofsecond-orderderivatives of F evaluatedat d isboundedinprobability.Tothisend,itsufcestoshowthateach elementofthismatrix,say [ r 2 F d ] t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1, j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ,where t j 2f 2,..., k g ,is O p .Since k d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d kk ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k p 0 ,itfollowsthat d p d Let 2 ,min d 2 ,..., d k .Thenwehave P k d )]TJ/F39 11.9552 Tf 12.079 0 Td [(d k 1 .Wenowshowthat, ontheset fk d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k g r 2 F d isboundedinprobability.Let I = I k d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k 57

PAGE 58

For t 6 = j ,wehave [ r 2 F d ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1, j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 I = k X l =1 2 n l n l X i =1 a j a l a t h l i h j l i h t l i d j 2 d t 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 3 I k X l =1 2 n l n l X i =1 a j a l a t h l i h j l i h t l i d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 P k s =1 a s h s l i = d s + 3 a.s. )167(! 2 d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 k X l =1 Z a j a l a t h h j h t P k s =1 a s h s = d s + 3 h l y d = 2 d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 d t )]TJ/F25 11.9552 Tf 11.956 0 Td [( 2 k X l =1 B h h l Z a j a l a t h j h t h l P k s =1 a s h s = d s + 3 h y d A.4 NotethattheexpressioninsidethebracesinA.4isclearlyboundedabovebya constant,soexpressionA.4isnite.Similarly,for t = j [ r 2 F d ] j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1, j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 I k X l =1 2 n l n l X i =1 a j a l h l i h j l i )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s )]TJ/F39 11.9552 Tf 11.955 0 Td [(a j h j l i = d j d j 3 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 3 k X l =1 2 n l n l X i =1 a j a l h l i h j l i d j 3 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 k X l =1 2 n l n l X i =1 a j a l h l i h j l i d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 3 P k s =1 a s h s l i = d s + 2 a.s. )167(! 2 d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 3 k X l =1 B h h l Z a j a l h j h l P k s =1 a s h s = d s + 2 h y d Again,thislimitisaniteconstantbythesamereasoningweusedearlier.Since P k d )]TJ/F39 11.9552 Tf 13.086 0 Td [(d k 1 ,itfollowsthat r 2 F d isboundedinprobability.Now,by 58

PAGE 59

combiningA.1andA.2,weobtain p n )]TJ/F22 11.9552 Tf 6.939 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = r n N r F d 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + 1 2 p N r n N p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 r 2 F d p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = p qc h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + p n )]TJ/F22 11.9552 Tf 6.938 -7.028 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.956 0 Td [(B h h 1 + o p A.5 wherethelastlinefollowsfromthepreviouslyestablishedfactthat r F d a.s. )167(! c h andtheassumptionsofTheorem1that p n = N p q andthat p N ^ d )]TJ/F39 11.9552 Tf 12.464 0 Td [(d converges indistributionhenceis O p .Becausethetwosamplingstagesforestimating d and B h h 1 areassumedtobeindependent,usingtheassumptionthat p N ^ d )]TJ/F39 11.9552 Tf 12.543 0 Td [(d d )167(! N inconjunctionwiththeresult p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 12.986 0 Td [(B h h 1 d )167(!N 2 h establishedinTheorem 1 ofDoss2010underconditionsA1andA2,weconcludethat p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 d )167(!N )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(0, qc h 0 c h + 2 h ProofofTheorem2 Webeginbywriting p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d ^ d + p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 A.6 wherethesecondtermontherightsideofA.6wasanalyzedbyDoss2010who showedthatitisasymptoticallynormal,withmean 0 andvariance 2 h .Ourplanisto showthat ^ d and ^ ^ d convergeinprobabilitytothesamelimit,whichwedenote lim WethenexpandthersttermontherightsideofA.6bywriting p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d ^ d = p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I ^ d lim + p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d lim )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d lim + p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I d lim )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d ^ d A.7 59

PAGE 60

Ourproofisorganizedasfollows: WenotethatthethirdtermontherightsideofA.7wasshowntoconvergeto 0 in probabilitybyDoss2010. WewillshowthersttermontherightsideofA.7alsoconvergesto 0 in probability. ThesecondtermontherightsideofA.7involvesrandomnessfromboth Stage 1 andStage 2 .However,wewillshowthattherandomnessfromStage 2 isasymptoticallynegligible,andthatthistermisasymptoticallyequivalenttoan expressionoftheform w h 0 ^ d )]TJ/F39 11.9552 Tf 12.06 0 Td [(d ,where w h isadeterministicvector.Thiswill showthatthesecondtermisasymptoticallynormal. NowweprovethatthersttermontherightsideofA.7is o p ,andtodothiswe beginbyshowingthat ^ d and ^ ^ d convergeinprobabilitytothesamelimit.Let Z be the n k matrixwhosetransposeis Z 0 = 0 B B B B B B B @ 1...11...1...1...1 Z 1,1 ... Z n 1 ,1 Z 1,2 ... Z n 2 ,2 ... Z 1, k ... Z n k k . . . . . . . Z k 1,1 ... Z k n 1 ,1 Z k 1,2 ... Z k n 2 ,2 ... Z k 1, k ... Z k n k k 1 C C C C C C C A A.8 andlet Y bethevector Y = )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y 1,1 ,..., Y n 1 ,1 Y 1,2 ,..., Y n 2 ,2 ,..., Y 1, k ,..., Y n k k 0 A.9 Let ^ Z bethe n k matrixcorrespondingto Z whenwereplace d by ^ d .Similarly, ^ Y islike Y ,butusing ^ d for d Forxed j j 0 2f 2,..., k g ,considerthefunction G u = 1 n k X l =1 n l X i =1 h j l i = u j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = u s h j 0 l i = u j 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = u s A.10 where u = u 2 ,..., u k 0 and u l > 0, for l =2,..., k .OntherightsideofA.10, u 1 istaken tobe 1 .Notethatsetting u = d gives G d = 1 n P k l =1 P n l i =1 Z j i l Z j 0 i l BytheMeanValue 60

PAGE 61

Theorem,weknowthatthereexistsa d between d and ^ d suchthat G ^ d = G d + r G d 0 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d = R j j 0 + r G d 0 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + o p NotethatthelastequalityabovecomesfromapplyingtheSLLN.Nextweshowthat r G d = O p .Wehavethreecasesfor t =2,..., k Case1: t = 2f j j 0 g .Wehave [ r G d ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 I k X l =1 2 a l 1 n l n l X i =1 h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i h j 0 l i = d j 0 )]TJ/F25 11.9552 Tf 11.956 0 Td [( h 1 l i a t h t l i d t 2 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 3 k X l =1 2 a l n l n l X i =1 h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( + h 1 l i h j 0 l i = d j 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( + h 1 l i a t h t l i d t )]TJ/F25 11.9552 Tf 11.956 0 Td [( 2 )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s + 3 Theterminsidetheinnersumisbounded,sowecanconcludethat [ r G d ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 is boundedinprobability,asitisboundedbya O p termon I Case2: j 6 = j 0 t 2f j j 0 g say t = j .Wehave [ r G d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = k X l =1 a l n l n l X i =1 2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i )]TJ/F25 11.9552 Tf 10.959 -9.684 Td [( h j 0 l i = d j 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a j h j l i d j 2 )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s 3 + k X l =1 a l n l n l X i =1 )]TJ/F25 11.9552 Tf 9.298 0 Td [( h j l i )]TJ/F25 11.9552 Tf 5.479 -9.683 Td [( h j 0 l i = d j 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i d j 2 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 andthisisboundedinprobability. Case3: t = j = j 0 .Wehave [ r G d ] j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = k X l =1 2 a l 1 n l n l X i =1 h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = d s )]TJ/F25 11.9552 Tf 49.673 8.858 Td [( h j l i d j 2 P k s =1 a s h s l i = d s + )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a j h j l i d j 2 )]TJ 5.48 -0.718 Td [(P k s =1 a s h s l i = d s 2 # andagainthisisboundedinprobability. 61

PAGE 62

Therefore G ^ d = R j j 0 + r G d 0 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + o p = R j j 0 + O p o p + o p p R j j 0 Similarargumentsextendtothecase j =1 or j 0 =1 .Bythefactthat R isassumed invertible,wehave n ^ Z 0 ^ Z )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 p )167(! R )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 A.11 Inasimilarway,itcanbeshownthat ^ Z 0 ^ Y = n p )167(! v A.12 where v isthesamelimitvectortowhich Z 0 Y = n hasbeenprovedtoconvergeinDoss 2010.CombiningA.11andA.12wehave )]TJ/F22 11.9552 Tf 6.952 -7.027 Td [(^ 0 ^ d ^ ^ d = n ^ Z 0 ^ Z )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ^ Z 0 ^ Y = n p )167(! 0, lim lim = R )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 v Let e j l = E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z j 1, l .Wenowhave p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.206 2.656 Td [(^ I ^ d lim = k X j =2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ j ^ d k X l =1 a l n 1 = 2 n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = k X j =2 o p k X l =1 a l n 1 = 2 n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l A.13 ToshowthatA.13convergesto 0 inprobabilityitsufcestoshowthatforeach l and j n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = O p A.14 Forxed j 2f 2,..., k g and l 2f 1,..., k g ,dene H u = n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l n l X i =1 h j l i = u j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = u s 62

PAGE 63

for u = u 2 ,..., u k 0 with u l > 0, l =2,..., k u 1 =1 .Notethat H d = n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 2 l P n l i =1 Z j i l .To seewhyA.14istruewebeginbywriting n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(Z j i l n l + n 1 = 2 l n l X i =1 Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = H ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(H d + O p A.15 Notethatthefactthat n 1 = 2 l P n l i =1 )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [([ Z j i l )]TJ/F39 11.9552 Tf 12.053 0 Td [(e j l ] = n l = O p ,whichwasusedtoestablish thesecondequalityinA.15,isprovedinDoss2010.Now,applyingtheMeanValue Theoremtothefunction H ,weknowthatthereexistsapoint d between d and ^ d such thatA.15becomes n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = r H d 0 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + O p = p a l r n N n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 2 l r H d 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + O p A.16 sothattherightsideofA.16is O p .Toseethislastassertion,notethatthe t )]TJ/F22 11.9552 Tf 12.138 0 Td [(1 th elementofthegradientof H [ r H d ] t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 ,isgivenby 8 > > > > > < > > > > > : n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l n l X i =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 if t 6 = j n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l n l X i =1 )]TJ/F25 11.9552 Tf 9.299 0 Td [( h j l i d 2 j P k s =1 a s h s l i = d s + n l X i =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a j h j l i d 2 j )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 # if t = j 63

PAGE 64

Let 2 ,min d 2 ,..., d k .Then P k d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d k 1 .For t 6 = j wehave n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l [ r H d ] t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 I n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h j l i = d j )]TJ/F25 11.9552 Tf 11.956 0 Td [( h 1 l i a t h t l i d t 2 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h j l i a t h t l i d t 2 d j )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 + n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h 1 l i a t h t l i d t 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h j l i a t h t l i d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s + 2 + n )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 l n l X i =1 h 1 l i a t h t l i d t )]TJ/F25 11.9552 Tf 11.956 0 Td [( 2 )]TJ 5.48 -0.718 Td [(P k s =1 a s h s l i = d s + 2 = O p + O p = O p Similarly, n )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 l [ r H d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 I 1 n l n l X i =1 h j l i d j )]TJ/F25 11.9552 Tf 11.956 0 Td [( 2 P k s =1 a s h s l i = d s + + 1 n l n l X i =1 h j l i a j h j l i d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 3 )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s + 2 + 1 n l n l X i =1 h 1 l i a j h j l i d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 )]TJ 5.48 -0.718 Td [(P k s =1 a s h s l i = d s + 2 andtherightsideofthisinequalityis O p ,asitisthesumofthree O p terms. SoA.16nowimpliesthat n 1 = 2 l n l X i =1 ^ Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j l n l = p a l r n N O p O p + O p = O p Wenowconsider p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d lim )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I d lim ,themiddleterminA.7.Dene K u = 1 n k X l =1 n l X i =1 h l i P k s =1 a s h s l i = u s )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X j =2 j lim h j l i = u j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i P k s =1 a s h s l i = u s 64

PAGE 65

where u = u 2 ,..., u k 0 ,and u l > 0 for l =2,..., k .ByTaylorseriesexpansion,wehave p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d lim )]TJ/F22 11.9552 Tf 11.206 2.656 Td [(^ I d lim = p n r K d 0 ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + p n 1 2 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 r 2 K d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d A.17 where d isbetween ^ d and d .Wenowfocusourattentionon r K d .For t =2,..., k we have [ r K d ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 1 n k X l =1 n l X i =1 h l i a t h t l i d 2 t )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 j 6 = t j lim )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 + t lim h t l i d 2 t P k s =1 a s h s l i = d s )]TJ/F25 11.9552 Tf 11.955 0 Td [( t lim )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( h t l i = d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 # a.s. )167(! B h h 1 d 2 t Z a t h t P k s =1 a s h s = d s h y d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 j 6 = t j lim Z a t h t d 2 t P k s =1 a s h s = d s h j y d + k X j =2 j 6 = t j lim Z a t h t d 2 t P k s =1 a s h s = d s h 1 y d + t lim 1 d t )]TJ/F25 11.9552 Tf 11.956 0 Td [( t lim Z a t h t d 2 t P k s =1 a s h s = d s h t y d + t lim Z a t h t d 2 t P k s =1 a s h s = d s h 1 y d = B h h 1 d 2 t Z a t h t P k s =1 a s h s = d s h y d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 j lim Z a t h t d 2 t P k s =1 a s h s = d s h j y d + k X j =2 j lim Z a t h t d 2 t P k s =1 a s h s = d s h 1 y d + t lim 1 d t :=[ w h ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 A.18 65

PAGE 66

wherethenotationinA.18indicatesthat w h denotesthenitevectorlimittowhich r K d converges.WenowdealwiththeHessianmatrix r 2 K d .For t 6 = u wehave [ r 2 K d ] t )]TJ/F23 7.9701 Tf 6.586 0 Td [(1, u )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 1 n k X l =1 n l X i =1 2 h l i a t h t l i a u h u l i d t 2 d u 2 )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 3 )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =2 j 6 = t j 6 = u j lim 2 h j l i = d j )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i a u h u l i d t 2 d u 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 3 + u lim h u l i a t h t l i d t 2 d u )]TJ 5.479 -0.718 Td [(P k s =1 a s h s l i = d s 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [( u lim 2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( h u l i = d u )]TJ/F25 11.9552 Tf 11.956 0 Td [( h 1 l i a t h t l i a u h u l i d t 2 d u 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 3 + t lim h l i a u h u l i d t 2 d u 2 )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [( t lim 2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( h t l i = d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( h 1 l i a t h t l i a u h u l i d t 2 d u 2 )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s 3 # andasbefore,itcanbeshownthatthisisboundedinprobability.Similarly,wecanshow thatthediagonaltermsof r 2 K d arealsoboundedinprobability.Therefore,usingthe factthat r 2 K d isboundedinprobability,wecannowrewriteA.17as p n )]TJ/F22 11.9552 Tf 4.731 -7.028 Td [(^ I ^ d lim )]TJ/F22 11.9552 Tf 11.206 2.656 Td [(^ I d lim = r n N w h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + r n N 1 2 p N p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 O p p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d = p qw h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d + o p TogetherwithA.6,thisgives p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = p qw h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 + o p d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.683 Td [(0, qw h 0 w h + 2 h bytheindependenceofthetwosamplingstages,theassumptionthat p N ^ d )]TJ/F39 11.9552 Tf 12.708 0 Td [(d is asymptoticallynormalwithmean 0 andvariance ,andtheresultfromDoss2010that p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I d ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 isasymptoticallynormalwithmean 0 andvariance 2 h 66

PAGE 67

ProofofTheorem3 First,wenotethat p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h = p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F22 11.9552 Tf 11.206 2.656 Td [(^ I [ f ] h d + p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h A.19 WebeginbyanalyzingthesecondtermontherightsideofA.19,whichonlyinvolves randomnessfromthesecondstageofsampling,andshowthatitisasymptotically normal.Asfortherstterm,acloserexaminationrevealsthatitisalsoasymptotically normal,withallitsrandomnesscomingfromStage 1 .Theasymptoticnormalityofthe sumofthesetwotermsthenfollowsimmediatelyfromtheindependenceofthetwo stagesofsampling. Notethat P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l = I [ f ] h B h h 1 ,andinparticular,when f 1 ,thisgives P k l =1 a l E Y 1, l = B h h 1 .Also,wehave n 1 = 2 0 B B B B @ 1 n k X l =1 n l X i =1 Y [ f ] i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h B h h 1 1 n k X l =1 n l X i =1 Y i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 1 C C C C A = n 1 = 2 0 B B B B @ 1 n k X l =1 n l X i =1 Y [ f ] i l )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y [ f ] 1, l 1 n k X l =1 n l X i =1 Y i l )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 a l E Y 1, l 1 C C C C A = k X l =1 a l 1 = 2 1 n l 1 = 2 n l X i =1 0 B @ Y [ f ] i l Y i l 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y [ f ] 1, l E Y 1, l 1 C A # A.20 Bycondition2,assumptionA2ofTheorem1,andtheassumedgeometric ergodicityandindependenceofthe k Markovchainsused,thevectorinA.20 convergesindistributiontoanormalrandomvectorwithmean 0 andcovariancematrix \050 h = P k l =1 a l )]TJ/F40 7.9701 Tf 6.775 -1.793 Td [(l h ,where )]TJ/F40 7.9701 Tf 6.775 -1.793 Td [(l h = 0 B @ 11 12 21 22 1 C A 67

PAGE 68

with 11 =Var )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l +2 P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y [ f ] 1, l Y [ f ] 1+ g l 12 = 21 =Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l Y 1, l + P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l Y 1+ g l +Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y 1, l Y [ f ] 1+ g l 22 =Var )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(Y 1, l +2 P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Y 1, l Y 1+ g l Since ^ I [ f ] h d isgivenbytheratio2,inviewofA.20,itsasymptoticdistribution maybeobtainedbyapplyingthedeltamethodtothefunction g u v = u = v .Thisgives p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N h ,where h = r g )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(I [ f ] h B h h 1 B h h 1 0 \050 h r g )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(I [ f ] h B h h 1 B h h 1 A.21 with r g u v = = v )]TJ/F39 11.9552 Tf 9.298 0 Td [(u = v 2 0 WenowconsiderthersttermontherightsideofA.19.Dene L u = k X l =1 n l X i =1 f l i h l i P k s =1 a s h s l i = u s k X l =1 n l X i =1 h l i P k s =1 a s h s l i = u s for u = u 2 ,..., u k 0 with u l > 0 for l =2,..., k .Then L d = ^ I [ f ] h d = P k l =1 P n l i =1 Y [ f ] i l P k l =1 P n l i =1 Y i l and p n )]TJ/F22 11.9552 Tf 4.731 -7.028 Td [(^ I [ f ] h ^ d )]TJ/F22 11.9552 Tf 10.969 2.656 Td [(^ I [ f ] h d = p n )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(L ^ d )]TJ/F39 11.9552 Tf 11.719 0 Td [(L d .Now,bytheTaylorseriesexpansionof L about d weget p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F22 11.9552 Tf 11.206 2.657 Td [(^ I [ f ] h d = p n r L d 0 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n 2 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 r 2 L d ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(d where d isbetween d and ^ d .First,weshowthatthegradient r L d converges almostsurelytoaniteconstantvectorbyprovingthateachoneofitscomponents, 68

PAGE 69

[ L d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 j =2,..., k ,convergesalmostsurely.Wehave [ r L d ] j )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = k X l =1 n l X i =1 a j f l i h l i h j l i d 2 j )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 k X l =1 n l X i =1 h l i P k s =1 a s h s l i = d s )]TJ/F40 7.9701 Tf 19.371 35.187 Td [(k X l =1 n l X i =1 f l i h l i P k s =1 a s h s l i = d s k X l =1 n l X i =1 a j h l i h j l i d 2 j )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s 2 k X l =1 n l X i =1 h l i P k s =1 a s h s l i = d s 2 a.s. )167(! B h h 1 d 2 j Z a j f h j P k s =1 a s h s = d s h y d B h h 1 )]TJ/F39 11.9552 Tf 13.151 18.181 Td [(I [ f ] h B h h 1 B h h 1 d 2 j Z a j h j P k s =1 a s h s = d s h y d B h h 1 2 = 1 d 2 j Z a j f h j P k s =1 a s h s = d s h y d )]TJ/F39 11.9552 Tf 15.143 8.088 Td [(I [ f ] h d 2 j Z a j h j P k s =1 a s h s = d s h y d :=[ v h ] j )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 j =2,..., k A.22 AsintheproofofTheorem1,itcanbeshownthateachelementofthesecond-derivative matrix r 2 L d is O p .Now,wecanrewriteA.19as p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h = r n N r L d 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + 1 2 p N r n N p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d 0 r 2 L d p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I [ f ] d h )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h = p qv h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h d )]TJ/F39 11.9552 Tf 11.956 0 Td [(I [ f ] h + o p Sincethetwosamplingstagesareassumedtobeindependent,weconcludethat p n )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I [ f ] h ^ d )]TJ/F39 11.9552 Tf 11.956 0 Td [(I [ f ] h d )167(!N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, qv h 0 v h + h 69

PAGE 70

ProofofTheorem4 Here Z and Y representthematrixandvector,respectively,previouslydened inA.8andA.9.Inaddition,let Z [ f ] denotethe n k +1 matrixwithtranspose Z [ f ] 0 = 0 B B B B B B B B B B B B B @ 1...11...1...1...1 Z [ f ] 1,1 ... Z [ f ] n 1 ,1 Z [ f ] 1,2 ... Z [ f ] n 2 ,2 ... Z [ f ] 1, k ... Z [ f ] n k k Z [ f ] 1,1 ... Z [ f ] n 1 ,1 Z [ f ] 1,2 ... Z [ f ] n 2 ,2 ... Z [ f ] 1, k ... Z [ f ] n k k . . . . . . . Z [ f ] k 1,1 ... Z [ f ] k n 1 ,1 Z [ f ] k 1,2 ... Z [ f ] k n 2 ,2 ... Z [ f ] k 1, k ... Z [ f ] k n k k 1 C C C C C C C C C C C C C A A.23 andlet Y [ f ] bethevector Y [f] = )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y [ f ] 1,1 ,..., Y [ f ] n 1 ,1 Y [ f ] 1,2 ,..., Y [ f ] n 2 ,2 ,..., Y [ f ] 1, k ,..., Y [ f ] n k k 0 A.24 WeknowfromDoss2010thattheleastsquaresestimatewhen Y isregressedon Z denotedby ^ 0 ^ 2 ,..., ^ k =: ^ 0 ^ ,convergesalmostsurelyto 0, lim lim = R )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 v .In asimilarway,wewillshowherethattheleastsquaresestimatewhen Y [ f ] isregressed on Z [ f ] )]TJ/F22 11.9552 Tf 6.953 -7.027 Td [(^ [ f ] 0 ^ [ f ] = )]TJ/F22 11.9552 Tf 6.953 -7.027 Td [(^ [ f ] 0 ^ [ f ] 1 ,..., ^ [ f ] k ,convergesalmostsurelytoavector )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [( [ f ] 0, lim [ f ] lim Notethat,undertheassumptionthat Z [ f ] 0 Z [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 exists, )]TJ/F22 11.9552 Tf 6.952 -7.027 Td [(^ [ f ] 0 ^ [ f ] = n Z [ f ] 0 Z [ f ] )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 Z [ f ] 0 Y [ f ] n SinceA4issatised,wehave 1 n k X l =1 n l X i =1 Z [ f ] j i l Z [ f ] j 0 i l = k X l =1 n l n 1 n l n l X i =1 Z [ f ] j i l Z [ f ] j 0 i l a.s. )167(! R [ f ] j +1, j 0 +1 j j 0 =0,..., k andhence Z [ f ] 0 Z [ f ] = n a.s. )167(! R [ f ] .ThereforebyA6,withprobabilityone Z [ f ] 0 Z [ f ] is nonsingularforlarge n ,andfurthermore n Z [ f ] 0 Z [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 a.s. )167(! R [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 A.25 70

PAGE 71

ByconditionA5,wealsohave Z [ f ] 0 Y [ f ] n = 0 B B B B @ 1 n P k l =1 P n l i =1 Z [ f ] i l Y [ f ] i l 1 n P k l =1 P n l i =1 Z [ f ] k i l Y [ f ] i l 1 C C C C A a.s. )167(! 0 B B B B @ P k l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Z [ f ] 1, l Y [ f ] 1, l P k l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z [ f ] k 1, l Y [ f ] 1, l 1 C C C C A A.26 Let v [ f ] = v [ f ] 0 ,..., v [ f ] k 0 denotethevectorontherightsideofA.26.CombiningA.25 andA.26weget )]TJ/F22 11.9552 Tf 6.952 -7.027 Td [(^ [ f ] 0 ^ [ f ] a.s. )167(! )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( [ f ] 0, lim [ f ] lim = R [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 v [ f ] A.27 Let ^ J lim [ f ] lim = 1 n k X l =1 n l X i =1 0 B @ Y [ f ] i l Y i l 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ P k j =1 [ f ] j lim Z [ f ] j i l P k j =2 j lim Z j i l 1 C A # = k X l =1 a l 1 n l n l X i =1 U [ f ] i l where U [ f ] i l = 0 B @ U [ f ] i l U [ f ] i l 1 C A = 0 B @ Y [ f ] i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =1 [ f ] j lim Z [ f ] j i l Y i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =2 j lim Z j i l 1 C A Also,let [ f ] l = E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(U [ f ] 1, l NowsinceA2,A3,andA4hold,foreach l =1,..., k wehave n 1 = 2 l 1 n l n l X i =1 U [ f ] i l )]TJ/F25 11.9552 Tf 11.955 0 Td [( [ f ] l d )167(!N )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(0, [ f ] l where [ f ] l = 0 B @ 2 l ,11 l ,12 l ,21 2 l ,22 1 C A with 2 l ,11 =Var )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(U [ f ] 1, l +2 P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(U [ f ] 1, l U [ f ] 1+ g l l ,12 = l ,21 =Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(U [ f ] 1, l U [ f ] 1, l + P 1 g =1 h Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(U [ f ] 1, l U [ f ] 1+ g l +Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(U [ f ] 1, l U [ f ] 1+ g l i 2 l ,22 =Var )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(U [ f ] 1, l +2 P 1 g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(U [ f ] 1, l U [ f ] 1+ g l 71

PAGE 72

Bytheassumedindependenceofthe k Markovchains,wehave n 1 = 2 ^ J lim [ f ] lim )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k l =1 a l [ f ] l d )167(!N [ f ] A.28 where [ f ] = P k l =1 a l [ f ] l A.29 Wenowshowthat k X l =1 a l [ f ] l = 0 B @ I [ f ] h B h h 1 B h h 1 1 C A A.30 andtodothiswewrite k X l =1 a l [ f ] l = k X l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(U [ f ] 1, l = k X l =1 a l 0 B @ E )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Y [ f ] 1, l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =1 [ f ] j lim E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z [ f ] j 1, l E Y 1, l )]TJ/F30 11.9552 Tf 11.956 8.966 Td [(P k j =2 j lim E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z j 1, l 1 C A = 0 B @ P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Y [ f ] 1, l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =1 [ f ] j lim P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z [ f ] j 1, l P k l =1 a l E Y 1, l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =2 j lim P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z j 1, l 1 C A = 0 B @ P k l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(Y [ f ] 1, l P k l =1 a l E Y 1, l 1 C A = 0 B @ I [ f ] h B h h 1 B h h 1 1 C A thenext-to-lastequalitybeingaconsequenceofthereadilyveriablefactthat k X l =1 a l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Z [ f ] j 1, l =0 and k X l =1 a l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z j 1, l =0 for j =2,..., k A.31 FromA.30andA.28weconcludethat n 1 = 2 ^ J lim [ f ] lim )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ I [ f ] h B h h 1 B h h 1 1 C A d )167(!N [ f ] 72

PAGE 73

Considernowthedifference n 1 = 2 ^ J ^ ^ [ f ] )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J lim [ f ] lim = 0 B @ n 1 = 2 P k j =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( [ f ] j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ [ f ] j )]TJ/F23 7.9701 Tf 12.221 -4.977 Td [(1 n P k l =1 P n l i =1 Z [ f ] j i l n 1 = 2 P k j =2 j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ j )]TJ/F23 7.9701 Tf 6.741 -4.977 Td [(1 n P k l =1 P n l i =1 Z j i l 1 C A = 0 B B @ P k j =1 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( [ f ] j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ [ f ] j P k l =1 a l n 1 = 2 P n l i =1 h Z [ f ] j i l )]TJ/F40 7.9701 Tf 6.587 0 Td [(E )]TJ/F40 7.9701 Tf 5.48 -9.684 Td [(Z [ f ] j 1, l n l i P k j =2 j lim )]TJ/F22 11.9552 Tf 13.428 2.656 Td [(^ j P k l =1 a l n 1 = 2 P n l i =1 h Z j i l )]TJ/F40 7.9701 Tf 6.587 0 Td [(E )]TJ/F40 7.9701 Tf 5.48 -9.684 Td [(Z j 1, l n l i 1 C C A wherethelastequalityfollowsfromA.31.Bytheassumptionthatthechainsare geometricallyergodicconditionA1,theboundednessof Z j i l 's,andthemoment conditionimposedon f inA4,weknowthat n 1 = 2 P n l i =1 \002 Z j i l )]TJ/F39 11.9552 Tf 13.369 0 Td [(E )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Z j 1, l n l and n 1 = 2 P n l i =1 \002 Z [ f ] j i l )]TJ/F39 11.9552 Tf 13.228 0 Td [(E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(Z [ f ] j 1, l n l areasymptoticallynormal,hence O p .This fact,combinedwithA.27andthecorrespondingresultfor )]TJ/F22 11.9552 Tf 6.952 -7.027 Td [(^ 0 ^ ,yields n 1 = 2 ^ J ^ ^ [ f ] )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J lim [ f ] lim = o p Hencewecanconcludethat n 1 = 2 ^ J ^ ^ [ f ] )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ I [ f ] h B h h 1 B h h 1 1 C A d )167(!N [ f ] A.32 Nowapplyingthedeltamethodwiththefunction g u v = u = v wehave n 1 = 2 P k l =1 P n l i =1 )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(Y [ f ] i l )]TJ/F30 11.9552 Tf 11.955 8.966 Td [(P k j =1 ^ [ f ] j Z [ f ] j i l P k l =1 P n l i =1 )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(Y i l )]TJ/F30 11.9552 Tf 11.955 8.967 Td [(P k j =1 ^ j Z j i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N r h i.e. n 1 = 2 )]TJ/F22 11.9552 Tf 4.73 -7.027 Td [(^ I ^ ^ [ f ] )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N r h where r h = r g )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(I [ f ] h B h h 1 B h h 1 0 [ f ] r g )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(I [ f ] h B h h 1 B h h 1 A.33 with r g u v = = v )]TJ/F39 11.9552 Tf 9.298 0 Td [(u = v 2 0 and [ f ] asinA.29. 73

PAGE 74

ProofofTheorem5 WebeginbyreviewingsomerelatednotationandresultsestablishedbyGeyer 1994.Recallthat N j denotesthelengthofthe j th chaininStage 1 samples, N = P k j =1 N j ,and A j = N j = N .Usingthenotation j = )]TJ/F22 11.9552 Tf 11.291 0 Td [(log m h j +log A j for j =1,..., k Geyer's1994reverselogisticregressionestimator ^ =^ 1 ,...,^ k fortheunknown vector isobtainedbymaximizingthelogquasi-likelihood l N = k X l =1 N l X i =1 log )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p l l i A.34 where p l = h l e l P k s =1 h s e s for l =1,..., k A.35 Theorem 1 ofGeyer1994statesthatthismaximizerisuniqueuptoanadditive constantiftheMonteCarlosampleisinseparable.Geyer1994alsoprovesthat,under certainconditions, p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 isasymptoticallynormal,where 0 isdenedby [ 0 ] j = j )]TJ/F22 11.9552 Tf 13.457 8.087 Td [(1 k k X s =1 s j =1,..., k Ourproofisstructuredasfollows.First,weextendGeyer's1994proofinordertoshow thatthe 2 k -dimensionalvector p N 0 B @ ^ ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ 0 e 1 C A =: 0 B B B B @ U U k 1 C C C C A =: U A.36 isasymptoticallynormal.Then,bygettingbacktothe d notationthroughatransformation, weshowthatourvectorofinterest p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.956 27.616 Td [(0 B @ d e 1 C A 74

PAGE 75

isalsoasymptoticallynormal. Tocarryouttherststep,wewillexpresseach U j j =1,..., k ,asthesumof alinearcombinationofstandardizedaveragesoffunctionsofthe l i 'sanda o p quantity.Wewillalsoneedthecentrallimittheoremtoholdfortheseaverages.Hence, foreach j =1,..., k ,weplantondconstants j 1 ,..., j k andfunctions j 1 ,..., j k whichsatisfytheconditions E h l y )]TJ/F25 11.9552 Tf 5.479 -9.683 Td [( j l =0 and E h l y )]TJ/F21 11.9552 Tf 5.479 -9.683 Td [(j j l j 2+ < 1 l =1,..., k A.37a U j = j 1 1 p N 1 N 1 X i =1 j 1 i + + j k 1 p N k N k X i =1 j k k i + o p A.37b forsome > 0 .NotethatconditionsA.37aandB1yieldcentrallimittheoremsforthe averagesinthelinearcombinationabove. For U k +1 ,..., U k ,conditionA.37isclearlysatisedsince U j + k = p N ^ e j )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j = 1 p A j 1 p N j N j X i =1 )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(f j i )]TJ/F39 11.9552 Tf 11.955 0 Td [(e j for j =1,..., k andthemomentconditionsinA.37aholdseeB2inthestatementofthistheorem. Next,weshowthatconditionA.37alsoholdsfortherst k componentsof U .In theproofofhisTheorem 2 ,Geyer1994denesthematrix B N via )]TJ/F22 11.9552 Tf 14.714 8.088 Td [(1 N )]TJ/F21 11.9552 Tf 5.479 -9.684 Td [(r l N ^ N )-222(r l N 0 = B N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 A.38 where l N wasdenedinA.34,andestablishesthat B N a.s. )167(! B ,where B isgivenby equationinGeyer1994.Healsoshowsthat,with u beingthe k -dimensional columnvectorof 1 's, 0 B @ B N u 0 1 C A p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 = 0 B @ 1 p N r l N 0 0 1 C A A.39 75

PAGE 76

[SeeequationinGeyer1994.]Notethat,byapplyingtheMeanValueTheoremto r l N B N denedinA.38canalsobeexpressedas B N = )]TJ/F22 11.9552 Tf 12.057 8.088 Td [(1 N r 2 l N forsome between ^ N and 0 .Hence,with p r r =1,..., k denedasinA.35,the elementsof B N aregivenby [ B N ] r r = 1 N k X l =1 N l X i =1 p r l i 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p r l i r =1,..., k [ B N ] r s = )]TJ/F22 11.9552 Tf 12.057 8.088 Td [(1 N k X l =1 N l X i =1 p r l i p s l i r 6 = s whichmakesiteasytoverifythat B N u =0 .CombiningthiswithequationA.39,itcan beshownthat p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 = B + N 1 p N r l N 0 A.40 where B + N = B N + 1 k uu 0 )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 )]TJ/F22 11.9552 Tf 13.457 8.088 Td [(1 k uu 0 istheMoore-Penroseinverseof B N .Furthermore,letting B + denotetheMoore-Penrose inverseof B ,wecanalternativelywritetheequalityinA.40as p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 = B + N )]TJ/F39 11.9552 Tf 11.955 0 Td [(B + + B + 1 p N r l N 0 = B + N )]TJ/F39 11.9552 Tf 11.955 0 Td [(B + 1 p N r l N 0 + B + 1 p N r l N 0 A.41 Now,usingtheresult B N a.s. )167(! B establishedbyGeyer1994,wecaneasilydeducethat B + N a.s. )167(! B + A.42 76

PAGE 77

bywriting B + N = B N + 1 k uu 0 )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 )]TJ/F22 11.9552 Tf 13.457 8.088 Td [(1 k uu 0 a.s. )167(! B + 1 k uu 0 )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 )]TJ/F22 11.9552 Tf 13.457 8.087 Td [(1 k uu 0 = B + wherethelastequalitycomesfromGeyer1994. Next,weestablishasymptoticnormalityof r l N 0 = p N .Sincethegradient r l N 0 isthevectorwhose r th elementisgivenby @ l N 0 @ r = N r )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 N l X i =1 p r l i 0 wecanseethat 1 p N @ l N 0 @ r = 1 p N N r )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 N l X i =1 p r l i 0 = p A r 1 p N r N r X i =1 )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p r r i 0 )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X l =1 l 6 = r p A l 1 p N l N l X i =1 p r l i 0 = p A r 1 p N r N r X i =1 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p r r i 0 )]TJ/F30 11.9552 Tf 11.955 9.684 Td [( 1 )]TJ/F39 11.9552 Tf 11.956 0 Td [(E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r r 1 0 )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X l =1 l 6 = r p A l 1 p N l N l X i =1 p r l i 0 )]TJ/F39 11.9552 Tf 11.956 0 Td [(E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 = )]TJ/F40 7.9701 Tf 17.511 14.944 Td [(k X l =1 p A l 1 p N l N l X i =1 p r l i 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 A.43 whichisalinearcombinationoftheformgiveninA.37band,because 0 p r 1 forall and ,conditionA.37aisalsosatised.Notethatweareallowedtoinsertthe 77

PAGE 78

expectationsinthenext-to-lastequalitybecause )]TJ/F30 11.9552 Tf 9.299 11.071 Td [(p A r 1 p N r N r 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r r 1 0 + k X l =1 l 6 = r p A l 1 p N l N l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 = )]TJ 9.299 10.772 Td [(p NA r 1 )]TJ/F30 11.9552 Tf 11.955 16.272 Td [(Z h r e r P k s =1 h s e s h r y d + p N k X l =1 l 6 = r A l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 = )]TJ 9.299 10.772 Td [(p NA r k X l =1 l 6 = r Z h l e l P k s =1 h s e s h r y d + p N k X l =1 l 6 = r A l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 = )]TJ 9.299 10.772 Td [(p NA r k X l =1 l 6 = r Z e l )]TJ/F26 7.9701 Tf 6.586 0 Td [( r h r e r P k s =1 h s e s m h l m h r h l y d + p N k X l =1 l 6 = r A l E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 = )]TJ 9.299 10.772 Td [(p NA r k X l =1 l 6 = r m h l m h r e l )]TJ/F26 7.9701 Tf 6.586 0 Td [( r E )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 + p N k X l =1 l 6 = r A l E )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 =0. Theasymptoticnormalityof r l N 0 = p N nowfollowsfromtheCram er-Wolddevice. InviewofthisconvergenceindistributionandtheconvergenceresultinA.42,A.41 gives p N ^ N )]TJ/F25 11.9552 Tf 11.955 0 Td [( 0 = o p O p + B + 1 p N r l N 0 = B + 1 p N r l N 0 + o p Therefore,wecannoweasilyseethatconditionA.37isalsosatisedbytherst k componentsof U ,i.e. p N ^ N )]TJ/F25 11.9552 Tf 11.998 0 Td [( 0 because,aswehaveshowninA.43,everyelement of r l N 0 = p N isalinearcombinationoftheformA.37. 78

PAGE 79

Nowthatwehaveshownthat U = 0 B B B B B B B B B B B B B B B @ 1 1 p N 1 P N 1 i =1 1 i + + k 1 p N k P N k i =1 k k i k 1 1 p N 1 P N 1 i =1 k 1 i + + k k 1 p N k P N k i =1 k k k i k +1 1 1 p N 1 P N 1 i =1 k +1 1 i k 1 1 p N k P N k i =1 k k k i 1 C C C C C C C C C C C C C C C A + o p wecanprovethat U isasymptoticallynormalbyusingtheCram er-Wolddevice.Letus denotetheasymptoticvarianceof U by S .Then S = 0 B @ S S S S 1 C A where S = B + CB + A.44 with C rs = k X l =1 A l Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 p s l 1 0 + k X l =1 A l 1 X g =1 h Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 p s l 1+ g 0 +Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1+ g 0 p s l 1 0 i for r =1,..., k ,and s =1,..., k S rr = 1 A r h Var )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(f r 1 +2 1 X g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(f r 1 f r 1+ g i S rs =0 when r 6 = s r =1,..., k s =1,..., k A.45 and S = B + D A.46 79

PAGE 80

with D rs = )]TJ/F22 11.9552 Tf 11.291 0 Td [(Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r s 1 0 f s 1 )]TJ/F28 7.9701 Tf 16.355 14.944 Td [(1 X g =1 h Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r s 1 0 f s 1+ g +Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r s 1+ g 0 f s 1 i for r =1,..., k ,and s =1,..., k .Now,havingestablishedtheconvergenceresult U = p N 0 B @ ^ ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ 0 e 1 C A d )167(!N S A.47 considerthefunction g : R 2 k R 2 k )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 givenby g 0 B @ e 1 C A = 0 B B B B B B B B B B @ e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 2 A 2 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 3 A 3 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( k A k = A 1 e 1 C C C C C C C C C C A where and e are k -dimensionalvectors.ThedeltamethodappliedtoA.47with g as thetransformationgives p N 0 B B B B B B B @ 0 B B B B B B B @ ^ d 2 ^ d k ^ e 1 C C C C C C C A )]TJ/F30 11.9552 Tf 11.955 49.137 Td [(0 B B B B B B B @ d 2 d k e 1 C C C C C C C A 1 C C C C C C C A d )167(!N V where V = r g 0 B @ 0 e 1 C A 0 S r g 0 B @ 0 e 1 C A A.48 80

PAGE 81

with r g 0 B @ e 1 C A = 0 B B B B B B B B B B B B B B B B B B B B @ e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 2 A 2 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 3 A 3 = A 1 ... e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( k A k = A 1 00...0 )]TJ/F39 11.9552 Tf 9.299 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 2 A 2 = A 1 0...000...0 0 )]TJ/F39 11.9552 Tf 9.299 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 3 A 3 = A 1 ...000...0 . . . . . . 00... )]TJ/F39 11.9552 Tf 9.298 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( k A k = A 1 00...0 00...010...0 00...001...0 00...000...1 1 C C C C C C C C C C C C C C C C C C C C A and S givenbyA.44,A.46,andA.45. ProofofRemark2toTheorem1 FollowingthelinesoftheproofofTheorem1with q =1 ,wegetasinA.5that p n )]TJ/F22 11.9552 Tf 6.938 -7.028 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = c h 0 p n ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + p n )]TJ/F22 11.9552 Tf 6.938 -7.028 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 + o p where c h istheconstantcolumnvectorgiveninA.3.Thisdecompositioncanbe rewrittenas p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(B h h 1 = c h 0 ,1 0 B @ p n ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.956 0 Td [(B h h 1 1 C A + o p Nownotethatinordertoestablishtheasymptoticnormalityof p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 9.802 0 Td [(B h h 1 itisenoughtoshowthat 0 B @ p n ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d p n )]TJ/F22 11.9552 Tf 6.938 -7.027 Td [(^ B h h 1 d )]TJ/F39 11.9552 Tf 11.956 0 Td [(B h h 1 1 C A isasymptoticallynormal.Usingthe -notationintroducedintheproofofTheorem5,let T = p n 0 B @ ^ ^ B h h 1 d 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ 0 B h h 1 1 C A 81

PAGE 82

Aswasdonefor U intheproofofTheorem5,wecanwrite T as T = 0 B B B B B B B B @ 1 1 p n 1 P n 1 i =1 1 i + + k 1 p n k P n k i =1 k k i k 1 1 p n 1 P n 1 i =1 k 1 i + + k k 1 p n k P n k i =1 k k k i P k l =1 a 1 = 2 l 1 p n l P n l i =1 )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(Y i l )]TJ/F39 11.9552 Tf 11.955 0 Td [(E Y 1, l 1 C C C C C C C C A + o p wheretherst k componentsarethesameastherst k componentsofthevector U inA.36,and Y i l isgivenin2.ByapplyingtheCram er-Wolddevice,wecan concludethatthevector T convergesindistributiontoanormalrandomvariablewith mean 0 andvariance Z = 0 B @ S z z 0 2 h 1 C A inwhich S isthe k k matrixgiveninA.44, z isthe k 1 vectorgivenby z = B + y A.49 where y r = k X l =1 A l Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1 0 Y 1, l + k X l =1 A l 1 X g =1 Cov )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(p r l 1 0 Y 1+ g l +Cov )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(p r l 1+ g 0 Y 1, l for r =1,..., k ,with B + asinTheorem5,andasinA.9ofDoss2010 2 h = k X l =1 a l h Var Y 1, l +2 1 X g =1 Cov Y 1, l Y 1+ g l i 82

PAGE 83

Nowdenethefunction g : R k +1 R k by g 0 B @ b 1 C A = 0 B B B B B B B B B B @ e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 2 A 2 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 3 A 3 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( k A k = A 1 b 1 C C C C C C C C C C A where isa k -dimensionalvectorand b isarealnumber.Applyingthedeltamethodto thepreviouslyestablishedresultthat T d )167(!N Z ,weget p n 0 B @ ^ d ^ B h h 1 d 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d B h h 1 1 C A d )167(!N 0, r g 0 B @ 0 B h h 1 1 C A 0 Z r g 0 B @ 0 B h h 1 1 C A where r g 0 B @ B h h 1 1 C A = 0 B @ E 0 0 0 1 1 C A A.50 with E = 0 B B B B B B B B B B @ e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 2 A 2 = A 1 e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( 3 A 3 = A 1 ... e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( k A k = A 1 )]TJ/F39 11.9552 Tf 9.299 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 2 A 2 = A 1 0...0 . . . 0 )]TJ/F39 11.9552 Tf 9.298 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.586 0 Td [( 3 A 3 = A 1 ...0 00... )]TJ/F39 11.9552 Tf 9.298 0 Td [(e 1 )]TJ/F26 7.9701 Tf 6.587 0 Td [( k A k = A 1 1 C C C C C C C C C C A A.51 and 0 inA.50representingthecolumnvectorof k zeros. Hence,weknowthat p n )]TJ/F22 11.9552 Tf 6.938 -7.028 Td [(^ B h h 1 ^ d )]TJ/F39 11.9552 Tf 12.872 0 Td [(B h h 1 hasanasymptoticallynormal distributionwithmean 0 andvariance c h 0 c h + 2 h +2 c h 0 E 0 z where denotes,asinthestatementofTheorem1,theasymptoticvarianceof p n ^ d )]TJ/F39 11.9552 Tf -452.89 -23.908 Td [(d E isgiveninA.51,and z inA.49. 83

PAGE 84

ProofofTheorem6 Let ^ J d e [ f ] = 1 n k X l =1 n l X i =1 0 B @ Y [ f ] i l Y i l 1 C A )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ P k j =1 [ f ] j Z [ f ] j i l P k j =2 j Z j i l 1 C A # wherethesuperscripts d e indicatethevaluesof d and e usedwhencomputing Y 's and Z 's,whilethesubscriptsindicatethecoefcientsof Z 's.With [ f ] l asintheproofof Theorem4wenowwrite p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X l =1 a l [ f ] l = p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J d e ^ d ^ [ f ] d + p n ^ J d e ^ d ^ [ f ] d )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X l =1 a l [ f ] l A.52 NotethatthesecondquantityontherightsideofA.52,whichinvolvesonlyknown d and e ,wasshowntobeasymptoticallynormalwithmean 0 andvariance [ f ] inthe proofofTheorem4[seeA.32].Nowletusexpandthersttermontherightside ofA.52bywriting p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F22 11.9552 Tf 12.174 2.657 Td [(^ J d e ^ d ^ [ f ] d = p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J ^ d ,^ e lim [ f ] lim + p n ^ J ^ d ,^ e lim [ f ] lim )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J d e lim [ f ] lim + p n ^ J d e lim [ f ] lim )]TJ/F22 11.9552 Tf 12.173 2.656 Td [(^ J d e ^ d ^ [ f ] d A.53 Wenextproceedasfollows: 1.WenotethatthethirdtermontherightsideofA.53wasshowntoconvergeto 0 inprobabilityintheproofofTheorem4. 2.WeshowthatthersttermontherightsideofA.53alsoconvergesto 0 in probability. 3.WeshowthatthesecondtermontherightsideofA.53isasymptoticallynormal. Todealwiththesecondstep,asintheproofofTheorem1,rstweshowthat ^ [ f ] d and ^ [ f ] ^ d convergeinprobabilitytothesamelimit,whichwedenotedintheproofof 84

PAGE 85

Theorem4by [ f ] lim .Forxed j j 0 2f 1,..., k g ,considerthefunction G u v = 1 n k X l =1 n l X i =1 f l i h j l i = u j P k s =1 a s h s l i = u s )]TJ/F39 11.9552 Tf 11.955 0 Td [(v j # f l i h j 0 l i = u j 0 P k s =1 a s h s l i = u s )]TJ/F39 11.9552 Tf 11.955 0 Td [(v j 0 # where u = u 2 ,..., u k 0 with u l > 0 ,for l =2,..., k ,and v = v 1 ,..., v k 0 .Notethatsetting u = d and v = e gives G d e = 1 n k X l =1 n l X i =1 Z [ f ] j i l Z [ f ] j 0 i l BytheMeanValueTheorem,weknowthatthereexistsa d e between ^ d ,^ e and d e suchthat G ^ d ,^ e = G d e + r G d e 0 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A = R [ f ] j +1, j 0 +1 + r G d e 0 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A + o p Asinpreviousproofs,withsomecalculationswecanshowthat r G d e = O p Therefore G ^ d ,^ e p )167(! R [ f ] j +1, j 0 +1 ,andsince R [ f ] isassumedinvertible,wehave n )]TJ/F22 11.9552 Tf 11.341 -7.027 Td [(^ Z [ f ] 0 ^ Z [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 p )167(! R [ f ] )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 where ^ Z [ f ] isobtainedfromthematrix Z [ f ] inA.23byreplacing d and e with ^ d and ^ e Thesamereasoningextendstothecasewhere j =0 or j 0 =0 .Inasimilarway,ifwe let ^ Y [ f ] denotethevectorobtainedfrom Y [ f ] inA.24byreplacing d with ^ d andwerecall that v [ f ] wasdenedtobethevectorontherightsideofA.26,itcanbeprovedthat )]TJ/F22 11.9552 Tf 6.359 -7.027 Td [(^ Z [ f ] 0 ^ Y [ f ] n p )167(! v [ f ] 85

PAGE 86

whichtogetherwiththepreviousresultimpliesthat ^ [ f ] d and ^ [ f ] ^ d convergein probabilitytothesamelimit.Also, p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J ^ d ,^ e lim [ f ] lim = 0 B @ n 1 = 2 P k j =1 )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [( [ f ] j lim )]TJ/F22 11.9552 Tf 13.428 2.656 Td [(^ [ f ] j ^ d )]TJ/F23 7.9701 Tf 12.221 -4.977 Td [(1 n P k l =1 P n l i =1 ^ Z [ f ] j i l n 1 = 2 P k j =2 j lim )]TJ/F22 11.9552 Tf 13.428 2.656 Td [(^ j ^ d )]TJ/F23 7.9701 Tf 6.741 -4.976 Td [(1 n P k l =1 P n l i =1 ^ Z j i l 1 C A = 0 B B @ P k j =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( [ f ] j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ [ f ] j ^ d h P k l =1 a l n 1 = 2 P n l i =1 ^ Z [ f ] j i l )]TJ/F40 7.9701 Tf 6.587 0 Td [(E )]TJ/F40 7.9701 Tf 5.48 -9.684 Td [(Z [ f ] j 1, l n l i P k j =2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( j lim )]TJ/F22 11.9552 Tf 13.428 2.657 Td [(^ j ^ d h P k l =1 a l n 1 = 2 P n l i =1 ^ Z j i l )]TJ/F40 7.9701 Tf 6.587 0 Td [(E )]TJ/F40 7.9701 Tf 5.48 -9.684 Td [(Z j 1, l n l i 1 C C A FromtheproofofTheorem2wealreadyknowthatthesecondcomponentofthislast vector,denotedthereinby p n )]TJ/F22 11.9552 Tf 4.731 -7.027 Td [(^ I ^ d ^ ^ d )]TJ/F22 11.9552 Tf 11.548 2.657 Td [(^ I ^ d lim ,is o p .Inananalogousmanner,itcanbe shownthattherstcomponentis o p .Thusthewholevectoris o p AsforthemiddletermoftherightsideofA.53,ifwedene K [ f ] u v = 1 n k X l =1 n l X i =1 f l i h l i P k s =1 a s h s l i = u s )]TJ/F40 7.9701 Tf 18.175 14.944 Td [(k X j =1 [ f ] j lim f l i h j l i = u j P k s =1 a s h s l i = u s )]TJ/F39 11.9552 Tf 11.955 0 Td [(v j !! where u = u 2 ,..., u k 0 with u l > 0 for l =2,..., k and v = v 1 v 2 ,..., v k 0 ,then p n ^ J ^ d ,^ e lim [ f ] lim )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J d e lim [ f ] lim = 0 B @ p n )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(K [ f ] ^ d ,^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(K [ f ] d e p n )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(K ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(K d 1 C A A.54 with K denedasintheproofofTheorem2.Fromthissameproof,weknowthat p n )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(K ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(K d = p qw h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + o p A.55 Wewillnowshowthat,similarly, p n )]TJ/F39 11.9552 Tf 5.48 -9.683 Td [(K [ f ] ^ d ,^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(K [ f ] d e = p qw [ f ] h 0 p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ d e 1 C A + o p 86

PAGE 87

with w [ f ] h denedbyA.56below.ByTaylorseriesexpansion,weget p n )]TJ/F39 11.9552 Tf 5.479 -9.683 Td [(K [ f ] ^ d ,^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(K [ f ] d e = p n r K [ f ] d e 0 0 B @ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d ^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(e 1 C A + p n 1 2 0 B @ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d ^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(e 1 C A 0 r 2 K [ f ] d e 0 B @ ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d ^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(e 1 C A where d e isbetween ^ d ,^ e and d e Belowwecomputethegradient r K [ f ] d e andshowthatitconvergesalmost surelytoavector w [ f ] h .Wehave @ K [ f ] @ u t d e = 1 n k X l =1 n l X i =1 f l i h l i a t h t l i d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =1 j 6 = t [ f ] j lim f l i h j l i a t h t l i d j d 2 t )]TJ 5.479 -0.717 Td [(P k s =1 a s h s l i = d s 2 + [ f ] t lim f l i h t l i d 2 t P k s =1 a s h s l i = d s )]TJ/F25 11.9552 Tf 11.955 0 Td [( [ f ] t lim f l i h t l i a t h t l i d 3 t )]TJ 5.48 -0.717 Td [(P k s =1 a s h s l i = d s 2 # a.s. )167(! B h h 1 d 2 t Z f a t h t P k s =1 a s h s = d s h y d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =1 j 6 = t [ f ] j lim Z f a t h t d 2 t P k s =1 a s h s = d s h j y d + [ f ] t lim I [ f ] h t d t )]TJ/F25 11.9552 Tf 11.955 0 Td [( [ f ] t lim Z f a t h t d 2 t P k s =1 a s h s = d s h t y d = B h h 1 d 2 t Z f a t h t P k s =1 a s h s = d s h y d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X j =1 [ f ] j lim Z f a t h t d 2 t P k s =1 a s h s = d s h j y d + [ f ] t lim I [ f ] h t d t := w [ f ] t )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 h for t =2,..., k A.56a 87

PAGE 88

and @ K [ f ] @ v t d e = [ f ] t lim := w [ f ] k )]TJ/F23 7.9701 Tf 6.587 0 Td [(1+ t h for t =1,..., k A.56b ProceedingaswedidintheproofofTheorem2whenweshowedthat r 2 K d = O p ,wecanshowherethat r 2 K [ f ] d e isboundedinprobability.Hence p n )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(K [ f ] ^ d ,^ e )]TJ/F39 11.9552 Tf 11.955 0 Td [(K [ f ] d e = p qw [ f ] h 0 p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A + o p andtogetherwithA.55andA.54thisimpliesthat p n ^ J ^ d ,^ e lim [ f ] lim )]TJ/F22 11.9552 Tf 12.173 2.657 Td [(^ J d e lim [ f ] lim = 0 B B B B @ p qw [ f ] h 0 p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A + o p p qw h 0 p N ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(d + o p 1 C C C C A = p q 0 B @ w [ f ] h 0 w 0 h 0 1 C A p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.617 Td [(0 B @ d e 1 C A + o p where w 0 h isthecolumn-vectorobtainedfrom w h byconcatenating k zerosatits end.NowreturningtoA.52andA.53weget p n ^ J ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 a l [ f ] l = p q 0 B @ w [ f ] h 0 w 0 h 0 1 C A p N 0 B @ ^ d ^ e 1 C A )]TJ/F30 11.9552 Tf 11.955 27.616 Td [(0 B @ d e 1 C A + p n ^ J d e ^ d ^ [ f ] d )]TJ/F40 7.9701 Tf 18.176 14.944 Td [(k X l =1 a l [ f ] l + o p d )167(!N 0, q 0 B @ w [ f ] h 0 w 0 h 0 1 C A V )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(w [ f ] h w 0 h + [ f ] Wecannowapplythedeltamethodwiththefunction g u v = u = v togetourresult p n ^ I ^ d ,^ e ^ ^ d ^ [ f ] ^ d )]TJ/F39 11.9552 Tf 11.955 0 Td [(I [ f ] h d )167(!N h 88

PAGE 89

where h = r g )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(I [ f ] h B h h 1 B h h 1 0 q 0 B @ w [ f ] h 0 w 0 h 0 1 C A V )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(w [ f ] h w 0 h + [ f ] r g )]TJ/F39 11.9552 Tf 5.48 -9.684 Td [(I [ f ] h B h h 1 B h h 1 A.57 with r g u v = = v )]TJ/F39 11.9552 Tf 9.298 0 Td [(u = v 2 0 89

PAGE 90

APPENDIXB DETAILSREGARDINGGENERATIONOFTHEMARKOVCHAINFROMCHAPTER5 TogenerateaMarkovchainoflength n on = 0 foraxedchoiceof thehyperparameter h = w g ,weusethefollowingsamplingscheme.First,wepick anarbitraryvaluefor .Thenwedraw 2 0 ,and asindicatedinSteps 2 4 belowwith i =0 .Togeneratetherestofthechain,weiteratethroughSteps 1 4 describedbelowforeach i =1,..., n )]TJ/F22 11.9552 Tf 11.956 0 Td [(1 Step1 Inthisstagewegeneratethebinaryvector i byusingaGibbssampleron = 1 2 ,..., q .Thus,werstgenerate i 1 j i )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 2 ,..., i )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 q Y accordingtothe followingBernoullidistribution: p 1 j j 6 =1 Y / p )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( j Y / + g )]TJ/F40 7.9701 Tf 6.586 0 Td [(q = 2 S )]TJ/F23 7.9701 Tf 6.587 0 Td [( m )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 1+ g )]TJ/F39 11.9552 Tf 11.956 0 Td [(R 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 = 2 w 1 )]TJ/F39 11.9552 Tf 11.955 0 Td [(w q B.1 where,recallthat S 2 = P m j =1 Y j )]TJETq1 0 0 1 284.934 413.69 cm[]0 d 0 J 0.478 w 0 0 m 10.148 0 l SQBT/F39 11.9552 Tf 284.934 403.714 Td [(Y 2 ,and R 2 isthecoefcientofdetermination ofmodel ;see5.Similarly,generate i 2 from p )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( i 2 j i 1 i )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 3 ,..., i )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 q Y andsoonfor i 3 ,..., i q .ThisGibbssamplerisnotidenticaltothatofSmithand Kohn1996inthatinourmodeltheprioron 0 isaatprior,whereasSmithand Kohn1996useaproperprioron 0 Step2 Generate 2 i j i Y accordingtothedensity p 2 j Y / p Y j 2 p 2 / Z p Y j 2 0 p 0 p j 2 d 0 d p 2 / 2 )]TJ/F23 7.9701 Tf 6.587 0 Td [( m +1 = 2 exp )]TJ/F39 11.9552 Tf 31.86 8.087 Td [(S 2 2 2 g +1 1+ g )]TJ/F39 11.9552 Tf 11.956 0 Td [(R 2 aninversegammadensity. 90

PAGE 91

Toseewhythelastrelationshipstatementistrue,werstconsidertheintegralwith respectto 0 .Wehave Z p Y j 2 0 p 0 d 0 / Z 2 )]TJ/F23 7.9701 Tf 6.587 0 Td [( m = 2 exp h )]TJ/F22 11.9552 Tf 16.501 8.087 Td [(1 2 2 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(X 0 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(X i d 0 / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m = 2 exp h )]TJ/F22 11.9552 Tf 16.501 8.088 Td [(1 2 2 Y )]TJ/F39 11.9552 Tf 11.956 0 Td [(X 0 Y )]TJ/F39 11.9552 Tf 11.956 0 Td [(X i Z exp h )]TJ/F39 11.9552 Tf 14.656 8.088 Td [(m 2 2 2 0 )]TJ/F22 11.9552 Tf 11.955 0 Td [(2 0 Y i d 0 / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 = 2 exp h )]TJ/F22 11.9552 Tf 16.501 8.088 Td [(1 2 2 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(X 0 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(X i exp m Y 2 2 2 Sowemaynowwrite p 2 j Y / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m +1 = 2 exp m Y 2 2 2 Z exp h )]TJ/F22 11.9552 Tf 16.501 8.088 Td [(1 2 2 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(X 0 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(X i p j 2 d / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m +1+ q = 2 exp )]TJ/F39 11.9552 Tf 13.153 8.088 Td [(S 2 2 2 Z exp h )]TJ/F22 11.9552 Tf 16.501 8.088 Td [(1 2 2 g +1 g 0 X 0 X + 1 2 Y 0 X i d / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m +1+ q = 2 exp )]TJ/F39 11.9552 Tf 13.153 8.088 Td [(S 2 2 2 2 q = 2 exp g 2 2 g +1 Y 0 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 Y / 2 )]TJ/F23 7.9701 Tf 6.586 0 Td [( m +1 = 2 exp n )]TJ/F39 11.9552 Tf 31.86 8.088 Td [(S 2 2 2 g +1 1+ g )]TJ/F39 11.9552 Tf 11.955 0 Td [(R 2 o wherethenext-to-lastproportionalityrelationresultsfromusingtheformula Z exp )]TJ/F22 11.9552 Tf 10.494 8.088 Td [(1 2 0 W )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 + a 0 d = q = 2 j W j 1 = 2 exp a 0 Wa = 2 whichcanbeshowntoholdforanyvector a oflength q andanypositivedenite matrix W byusingacompletingthesquaresargument.Inpractice,weusethe distributionalrelationship S 2 1+ g )]TJ/F39 11.9552 Tf 11.956 0 Td [(R 2 2 g +1 2 m )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 todraw 2 91

PAGE 92

Step3 Generate i 0 j i 2 i Y accordingtothedensity p 0 j 2 Y / Z p Y j 2 0 p j 2 d p 0 B.2a / exp h )]TJ/F22 11.9552 Tf 16.501 8.087 Td [(1 2 2 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 0 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 i B.2b /N Y 2 = m NotethatB.2bfollowsfromB.2abecause 1 0 m X =0 ,sincethecolumnsof X arecentered. Step4 Generate i j i 2 i i 0 Y accordingtothedensity p j 2 0 Y / p Y j 2 0 p j 2 / exp )]TJ/F22 11.9552 Tf 16.501 8.087 Td [(1 2 2 Y )]TJ/F22 11.9552 Tf 11.955 0 Td [(1 m 0 )]TJ/F39 11.9552 Tf 11.956 0 Td [(X 0 Y )]TJ/F22 11.9552 Tf 11.956 0 Td [(1 m 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(X + 0 X 0 X g whichcanbeshowntobea q -dimensionalnormalwithmeanandcovariance matrixgivenrespectivelyby = g g +1 ^ and = g g +1 2 X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 where ^ = X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 X 0 Y istheusualleastsquaresestimateformodel Wenowdiscussthecomputationaleffortneededtoimplementoursampler. Considergeneratingtherstcomponentof .AsseeninStep 1 ,theconditional distributionforthiscomponentisBernoulliwithsuccessprobability p 1 =1 j j 6 =1 Y = p )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(, 2 ,..., q j Y p )]TJ/F22 11.9552 Tf 5.48 -9.683 Td [(, 2 ,..., q j Y + p )]TJ/F22 11.9552 Tf 5.479 -9.683 Td [(, 2 ,..., q j Y withtheexpressionfor p j Y givenbyB.1.Theothercomponentsof canbeinturn similarlygenerated,andthentheothercomponentsof canbegeneratedaccording totheconditionaldistributionsfromSteps 2 4 .Themaincomputationalburdenisini forming R 2 ,iiforming ^ ,andiiigeneratingfrom N )]TJ/F39 11.9552 Tf 5.479 -9.684 Td [(c 1 ^ c 2 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 ,where c 1 and c 2 areconstants.Alloftheseostensiblyrequirecalculationof X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 ,forwhich O q 3 92

PAGE 93

operationsarerequired.Infact,iandiirequireonly ^ ,whichcanbecalculated bysolving X 0 X ^ = X 0 Y ,requiringonly O q 2 operations.Nowtheessenceof iiiisgeneratingfroma N )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(0, X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 distribution,andtodothiswedonotneedto form X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 .Weneedonlyexpress X 0 X = U 0 U ,where U isuppertriangular.Forif Z N I q ,then U )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 Z N )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(0, X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ,and U )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 Z isobtainedwithoutcalculating U )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 ,andsimplybysolvingfor intheequation U = Z whichrequiresonly O q 2 operations,since U isuppertriangular.Wenotethat,ifwestartwith X ,ndingthe factorization X 0 X = U 0 U requires O q 2 operations. Nowif and differinasinglecomponentthisisthecaseforexamplewhen cyclingthroughStep 1 ofthealgorithm,thenafactorizationof X 0 X canbeobtained fromthefactorization X 0 X veryefciently:therearewellknownmethodsforupdating thetandrelatedquantitiesofalinearregressionmodelwhenapredictorisadded ordroppedfromthemodel.TheserelyonfastupdatesofQR,Cholesky,orsingular valuedecompositionswhenthedesignmatrixischangedbytheadditionordeletionof acolumn.SmithandKohn1996relyonfastupdatingoftheCholeskydecomposition of X 0 X inordertoupdate R 2 .OurMarkovchainismoreinvolvedthanthatofSmithand Kohn1996sinceourchainrunson = 0 .Ourimplementationusesthe sweepoperator,awell-knownmethodforupdatingalineart,becausethisprovidesall thequantitiesneededforourchaininoneshot.Wenowdescribethisinmoredetail. Werstdenethesweepoperator.Let T beasymmetricmatrix.Thesweepof T onits k th diagonalentry t kk 6 =0 isthesymmetricmatrix S with s kk = )]TJ/F22 11.9552 Tf 14.348 8.088 Td [(1 t kk s ik = t ik t kk s kj = t kj t kk s ij = t ij )]TJ/F39 11.9552 Tf 13.151 8.088 Td [(t ik t kj t kk for i 6 = k and j 6 = k .Thesweepoperatorhasanobviousinverseoperator. Ifweapplythesweepoperatortothematrix T = 0 B @ X 0 XX 0 Y Y 0 XY 0 Y 1 C A B.3 93

PAGE 94

onall 1 through q diagonalentries,thenweobtainthematrix S = 0 B @ )]TJ/F22 11.9552 Tf 9.299 0 Td [( X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 X 0 Y Y 0 X X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 Y 0 Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(Y 0 X X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 X 0 Y 1 C A Ifwesweeptheaugmentedmatrix T denedinB.3onthediagonalentriescorresponding tothecovariatesin ,thenfromtheresultingmatrix S wecanobtainalltheimportant quantitiesneededbySteps 1 4 : X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 isthenegativeofthesubmatrixof S correspondingtorowsandcolumnsin X 0 X )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 X 0 Y isthesubmatrixof S corresponding torowsin andcolumn q +1 ,and Y 0 X X 0 X )]TJ/F23 7.9701 Tf 6.587 0 Td [(1 X 0 Y canbeobtainedbysubtracting the q +1, q +1 elementof S from Y 0 Y withothermethods,wemayneedtocompute separatelythelastthreequantities. Toillustratetheuseofthisoperator,supposethatwehavealreadyswept T overthe covariatesin =, 2 ,..., q .Thenweonlyneedtoperformonesweepontherst diagonalentrytogetthesweptmatrixcorrespondingtotherstpredictorbeingadded tothepreviousmodel =, 2 ,..., q .Conversely,sincethesweepoperatorhasan inverse,thelattermatrixcouldbeunsweptovertherstdiagonalentrytogettheswept matrixcorrespondingtodroppingtherstpredictor. 94

PAGE 95

APPENDIXC PROOFOFTHEUNIFORMERGODICITYANDDEVELOPMENTOFTHE MINORIZATIONCONDITIONFROMCHAPTER5 ProofofProposition1 Let y and p j Y denotetheposteriordistributionof and ,respectively,under theprior on wearesuppressingthesubscript h ,sincethehyperparameterisxed throughout.Weuse K n and V n todenotethe n -stepMarkovtransitionfunctionsforthe and chains,respectively.Also,letting denotetheproductofcountingmeasureon f 0,1 g q andLebesguemeasureon 1 R q +1 ,weuse k n todenotethedensityof K n withrespectto and v n todenotetheprobabilitymassfunctionof V n .Wenowshow thatthe -chainandthe -chainconvergetotheircorrespondingposteriordistributions atexactlythesamerate.Foranystartingstate 0 and n 2 N ,wehave k K n 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( y k =sup A j K n 0 A )]TJ/F25 11.9552 Tf 11.955 0 Td [( y A j = 1 2 Z j k n 0 )]TJ/F25 11.9552 Tf 11.955 0 Td [( y j d = 1 2 Z v n 0 p 2 0 j Y )]TJ/F39 11.9552 Tf 11.955 0 Td [(p j Y p 2 0 j Y d = 1 2 X 2 )]TJ/F30 11.9552 Tf 7.799 24.265 Td [( v n 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p j Y ZZZ p 2 0 j Y d 2 d 0 d = 1 2 X 2 )]TJ/F30 11.9552 Tf 7.799 24.265 Td [( v n 0 )]TJ/F39 11.9552 Tf 11.955 0 Td [(p j Y =sup B V n 0 B )]TJ/F39 11.9552 Tf 11.955 0 Td [(p 2 B j Y Hence,the chaininheritstheconvergencerateofits -subchain,auniformlyergodic Gibbssampleronanitestatespace. DescriptionoftheRegenerationScheme Forregenerationpurposes,itisenoughtorestrictourattentiontotheMarkov chainthatrunson .Thisisbecause,aswewillseelater,wheneverthissubchain regenerates,theaugmentedchainthatproducesdrawsfromtheposteriordistributionof = 0 alsoregenerates.Wewillndafunction s :)]TJ/F21 11.9552 Tf 18.044 0 Td [(! [0,1 andaprobability 95

PAGE 96

massfunction d on )]TJ/F20 11.9552 Tf 10.098 0 Td [(suchthat v 0 satisestheminorizationcondition v 0 s d 0 forall 0 2 C.1 WeproceedviathedistinguishedpointtechniqueintroducedinMyklandetal.1995. Let denoteaxedmodel,whichwewillrefertoasadistinguishedmodel,andlet D )]TJ/F20 11.9552 Tf 10.098 0 Td [(beasetofmodels.Themodel andtheset D arearbitrary,butbelowwegive guidelinesformakingapracticalchoiceof and D .Forall 0 2 )]TJ/F20 11.9552 Tf 10.098 0 Td [(wehave v 0 = v 0 v 0 v 0 min 00 2 D v 00 v 00 v 0 I 0 2 D Ifwelet c denotethenormalizingconstantfor v 0 I 0 2 D ,thatis, c = P 0 2 D v 0 ,thenweget v 0 c min 00 2 D v 00 v 00 v 0 I 0 2 D c = s d 0 where s = c min 00 2 D v 00 v 00 and d 0 = v 0 I 0 2 D c Evaluatingboth s and d requirescomputingtransitionprobabilitiesoftheform v 0 which,duetothefactthattheMarkovchainon isaGibbssampler,canbeexpressed as v 0 = p 0 1 j 2 3 ,..., q Y p 0 2 j 0 1 3 ,..., q Y p 0 q j 0 1 0 2 ,..., 0 q )]TJ/F23 7.9701 Tf 6.586 0 Td [(1 Y wheretheformulafortherightsidetermsisgivenbyB.1.Sincethe 'sinthetermson therightsidedifferinatmostonecomponent,thefastupdatingtechniquesdiscussedin AppendixAcanbeappliedheretootospeedupthecomputations. 96

PAGE 97

By2, P )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( i =1 j i i +1 = s i d i +1 = v i i +1 ,andthenormalizing constant c cancelsinthenumerator,soinpracticethereisnoneedtocomputeit, andthesuccessprobabilityoftheregenerationindicatorissimply P )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( i =1 j i i +1 = min 00 2 D v i 00 v 00 v i +1 I i +1 2 D v i i +1 C.2 Thechoiceof and D affectstheregenerationrate.Ideallywewouldlikethe regenerationprobabilitytobeasbigaspossible.Noticethatregenerationcanoccuronly if isin D .Thissuggestsmaking D large.However,increasingthesizeof D makesthe rstterminbracketsinC.2smaller.Wehavefoundthatareasonabletradeoffconsists oftaking D tobethesmallestsetofmodelsthatencompasses 25% oftheposterior probability.Also,theobviouschoicefor istheHPMmodel.Thedistinguishedmodel andtheset D areselectedfromtheoutputofaninitialchain. Fortheall-inclusivechainthatrunsnotonlyonthemodelspacebutalsoonthe spaceoferrorvarianceandmodelcoefcients,wecanobtainaminorizationconditionif wemultiplytheconditioninC.1onbothsidesby p 0 2 j 0 Y p 0 0 j 0 0 2 Y p 0 j 0 0 2 0 0 Y Thisyields k 0 s 1 d 1 0 forall = 0 0 = 0 0 0 0 0 C.3 where s 1 = s and d 1 0 = d 0 p 0 2 j 0 Y p 0 0 j 0 0 2 Y p 0 j 0 0 2 0 0 Y Henceforthisbiggerchaintheregenerationindicatorhas,accordingto211,success probability P i =1 j i i +1 = s 1 i d 1 i +1 k i i +1 = s i d i +1 v i i +1 97

PAGE 98

whichisexactlythesameastheregenerationsuccessprobabilityforthechainon )]TJ/F20 11.9552 Tf 6.775 0 Td [(.Hence,theaugmented -chainandthe -chainregeneratesimultaneously.Note thatsamplingfrom d 1 whichisneededtostarttheregenerationistrivial.Werst sample from d ,whichisdonebysamplingfrom v andretaining onlyif itisin D andtodothiswedonotneedtoknowthenormalizingconstant c ; thenwesequentiallysample 2 0 ,and from p 2 j Y p 0 j 2 Y ,and p j 2 0 Y ,respectively. 98

PAGE 99

APPENDIXD MAPFORTHEOZONEPREDICTORSINFIGURE5-5 TableD-1.The 44 predictorsusedintheozoneillustration.Thesymbol.representsan interaction. NumberPredictor 1 vhVandenburg 500 millibarpressureheightm 2 windWindspeedmphatLosAngelesInternationalAirportLAX 3 humidHumiditypercentatLAX 4 tempSandburgAirForceBasetemperatureF 5 ibhInversionbaseheightatLAX 6 dpgDaggettpressuregradientmmHgfromLAXtoDaggett,CA 7 ibtInversionbasetemperatureatLAX 8 visVisibilitymilesatLAX NumberPredictor 9 vh 2 10 wind 2 11 humid 2 12 temp 2 13 ibh 2 14 dpg 2 15 ibt 2 16 vis 2 17 vh.wind 18 vh.humid 19 wind.humid 20 vh.temp 21 wind.temp 22 humid.temp 23 vh.ibh 24 wind.ibh 25 humid.ibh 26 temp.ibh NumberPredictor 27 vh.dpg 28 wind.dp 29 humid.d 30 temp.dp 31 ibh.dpg 32 vh.ibt 33 wind.ib 34 humid.i 35 temp.ib 36 ibh.ibt 37 dpg.ibt 38 vh.vis 39 wind.vi 40 humid.v 41 temp.vi 42 ibh.vis 43 dpg.vis 44 ibt.vis 99

PAGE 100

REFERENCES A NTONIAK ,C.E..MixturesofDirichletprocesseswithapplicationstoBayesian nonparametricproblems. TheAnnalsofStatistics 2 1152. A THREYA ,K.B.,D OSS ,H.andS ETHURAMAN ,J..Ontheconvergenceofthe Markovchainsimulationmethod. TheAnnalsofStatistics 24 69. B ARBIERI ,M.M.andB ERGER ,J.O..Optimalpredictivemodelselection. The AnnalsofStatistics 32 870. B REIMAN ,L.andF RIEDMAN ,J.H..Estimatingoptimaltransformationsfor multipleregressionandcorrelation. JournaloftheAmericanStatisticalAssociation 80 580. B URR ,D.andD OSS ,H..Condencebandsforthemediansurvivaltimeas afunctionofthecovariatesintheCoxmodel. JournaloftheAmericanStatistical Association 88 1330. B URR ,D.andD OSS ,H..ABayesiansemiparametricmodelforrandom-effects meta-analysis. JournaloftheAmericanStatisticalAssociation 100 242. C ASELLA ,G.andM ORENO ,E..ObjectiveBayesianvariableselection. Journalof theAmericanStatisticalAssociation 101 157. C LYDE ,M.,D E S IMONE ,H.andP ARMIGIANI ,G..Predictionviaorthogonalized modelmixing. JournaloftheAmericanStatisticalAssociation 91 1197. C LYDE ,M.,G HOSH ,J.andL ITTMAN ,M..Bayesianadaptivesamplingfor variableselectionandmodelaveraging.DiscussionPaper2009-16,DukeUniversity DepartmentofStatisticalScience. C OGBURN ,R..ThecentrallimittheoremforMarkovprocesses.In Proceedings oftheSixthBerkeleySymposiumonMathematicalStatisticsandProbability,Volume 2 .UniversityofCaliforniaPress,Berkeley,485. C UI ,W.andG EORGE ,E..EmpiricalBayesvs.fullyBayesvariableselection. JournalofStatisticalPlanningandInference 138 888. D OSS ,H..CommentonMarkovchainsforexploringposteriordistributionsby LukeTierney. TheAnnalsofStatistics 22 1728. D OSS ,H..Bayesianmodelselection:Somethoughtsonfuturedirections. StatisticaSinica 17 413. D OSS ,H..EstimationoflargefamiliesofBayesfactorsfromMarkovchain output. StatisticaSinica 20 537. D URRETT ,R.. Probability:TheoryandExamples .Brooks/ColePublishingCo. 100

PAGE 101

F ERNANDEZ ,C.,L EY ,E.andS TEEL ,M.F.J..BenchmarkpriorsforBayesian modelaveraging. JournalofEconometrics 100 381. F OSTER ,D.P.andG EORGE ,E.I..Theriskinationcriterionformultiple regression. TheAnnalsofStatistics 22 1947. G EORGE ,E.I.andF OSTER ,D.P..CalibrationandempiricalBayesvariable selection. Biometrika 87 731. G EORGE ,E.I.andM C C ULLOCH ,R.E..ApproachesforBayesianvariable selection. StatisticaSinica 7 339. G EYER ,C.J..PracticalMarkovchainMonteCarlowithdiscussion. Statistical Science 7 473. G EYER ,C.J..Estimatingnormalizingconstantsandreweightingmixturesin MarkovchainMonteCarlo.Tech.Rep.568r,DepartmentofStatistics,Universityof Minnesota. G ILL ,R.D.,V ARDI ,Y.andW ELLNER ,J.A..Largesampletheoryofempirical distributionsinbiasedsamplingmodels. TheAnnalsofStatistics 16 1069. H ANSEN ,M.H.andY U ,B..Modelselectionandtheprincipleofminimum descriptionlength. JournaloftheAmericanStatisticalAssociation 96 746. H ASTINGS ,W.K..MonteCarlosamplingmethodsusingMarkovchainsandtheir applications. Biometrika 57 97. H OBERT ,J.P.,J ONES ,G.L.,P RESNELL ,B.andR OSENTHAL ,J.S..Onthe applicabilityofregenerativesimulationinMarkovchainMonteCarlo. Biometrika 89 731. I BRAGIMOV ,I..Somelimittheoremsforstationaryprocesses. Theoryof ProbabilityanditsApplications 7 349. I BRAGIMOV ,I.A.andL INNIK ,Y.V.. IndependentandStationarySequencesof RandomVariables .Wolters-Noordhoff,Groningen. K ASS ,R.E.andW ASSERMAN ,L..AreferenceBayesiantestfornested hypothesesanditsrelationshiptotheSchwarzcriterion. JournaloftheAmerican StatisticalAssociation 90 928. K OHN ,R.,S MITH ,M.andC HAN ,D..Nonparametricregressionusinglinear combinationsofbasisfunctions. StatisticsandComputing 11 313. K ONG ,A.,M C C ULLAGH ,P.,M ENG ,X.-L.,N ICOLAE ,D.andT AN ,Z..Atheoryof statisticalmodelsforMonteCarlointegrationwithdiscussion. JournaloftheRoyal StatisticalSociety,SeriesB 65 585. 101

PAGE 102

L IANG ,F.,P AULO ,R.,M OLINA ,G.,C LYDE ,M.A.andB ERGER ,J.O..Mixtures of g -priorsforBayesianvariableselection. JournaloftheAmericanStatistical Association 103 410. M ADIGAN ,D.andY ORK ,J..Bayesiangraphicalmodelsfordiscretedata. InternationalStatisticalReview 63 215. M ENG ,X.-L.andW ONG ,W.H..Simulatingratiosofnormalizingconstantsviaa simpleidentity:Atheoreticalexploration. StatisticaSinica 6 831. M EYN ,S.P.andT WEEDIE ,R.L.. MarkovChainsandStochasticStability Springer-Verlag,NewYork,London. M YKLAND ,P.,T IERNEY ,L.andY U ,B..RegenerationinMarkovchainsamplers. JournaloftheAmericanStatisticalAssociation 90 233. O WEN ,A.andZ HOU ,Y..Safeandeffectiveimportancesampling. Journalofthe AmericanStatisticalAssociation 95 135. R AFTERY ,A.E.,M ADIGAN ,D.andH OETING ,J.A..Bayesianmodelaveraging forlinearregressionmodels. JournaloftheAmericanStatisticalAssociation 92 179. R OMERO ,M.. OnTwoTopicswithnoBridge:BridgeSamplingwithDependent DrawsandBiasoftheMultipleImputationVarianceEstimator .Ph.D.thesis,University ofChicago. R OSENBLATT ,M..Asymptoticnormality,strongmixingandspectraldensity estimates. TheAnnalsofProbability 12 1167. S COTT ,J.G.andB ERGER ,J.O..Bayesandempirical-Bayesmultiplicity adjustmentinthevariable-selectionproblem. TheAnnalsofStatistics toappear. S MITH ,M.andK OHN ,R..NonparametricregressionusingBayesianvariable selection. JournalofEconometrics 75 317. T AN ,A.andH OBERT ,J.P..BlockGibbssamplingforBayesianrandomeffects modelswithimproperpriors:convergenceandregeneration. JournalofComputationalandGraphicalStatistics 18 861. T AN ,Z..OnalikelihoodapproachforMonteCarlointegration. Journalofthe AmericanStatisticalAssociation 99 1027. T IERNEY ,L..MarkovchainsforexploringposteriordistributionsDisc: p1728. TheAnnalsofStatistics 22 1701. V ANDAELE ,W..Participationinillegitimateactivities:Ehrlichrevisited.In DeterrenceandIncapacitation .USNationalAcademyofSciences,WashingtonDC, 270. 102

PAGE 103

V ARDI ,Y..Empiricaldistributionsinselectionbiasmodels. TheAnnalsof Statistics 13 178. Z ELLNER ,A..OnassessingpriordistributionsandBayesianregressionanalysis with g -priordistributions.In BayesianInferenceandDecisionTechniques:Essays inHonorofBrunodeFinetti P.K.GoelandA.Zellner,eds..Elsevier,NewYork, 233. Z ELLNER ,A.andS IOW ,A..Posterioroddsratiosforselectedregression hypotheses.In BayesianStatistics:ProceedingsoftheFirstInternationalMeetingheld inValenciaSpain J.M.Bernardo,M.H.DeGroot,D.V.LindleyandA.F.M.Smith, eds..Valencia:UniversityPress,585. 103

PAGE 104