Citation
Monte Carlo methods for posterior distributions associated with multivariate student's t data

Material Information

Title:
Monte Carlo methods for posterior distributions associated with multivariate student's t data
Creator:
Marchev, Dobrin
Place of Publication:
Gainesville FL
Publisher:
University of Florida
Publication Date:
Language:
English
Physical Description:
x, 76 leaves : ill. ; 29 cm.

Subjects

Subjects / Keywords:
Burn in ( jstor )
Ergodic theory ( jstor )
Markov chains ( jstor )
Mathematical theorems ( jstor )
Mathematical variables ( jstor )
Standard error ( jstor )
Statistical models ( jstor )
Statistics ( jstor )
T distribution ( jstor )
Trucks ( jstor )
Dissertations, Academic -- Statistics -- UF
Statistics thesis, Ph. D
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 2004.
Bibliography:
Includes bibliographical references.
General Note:
Printout.
General Note:
Vita.
Statement of Responsibility:
by Dobrin Marchev.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Dobrin Marchev. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
022825016 ( ALEPH )
880637194 ( OCLC )

Downloads

This item has the following downloads:


Full Text










MONTE CARLO METHODS FOR POSTERIOR DISTRIBUTIONS
ASSOCIATED WITH
MULTIVARIATE STUDENT'S t DATA















By

DOBRIN MARCHEV


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2004











.7-.


Copyright 2004

by

Dobrin Marchev



































To the memory of my grandfather, Dimiter














ACKNOWLEDGMENTS

I would like to thank Jim Hobert for his support throughout the last 3 years. His guidance, endless questions and rigorous criticism brought the necessary quality to my work. Without his insights, this dissertation would have never been the same.

I would also like to thank Brett Presnell, Jim Booth, Alex Trindade, Murali Rao and Mike Daniels for serving on my committee and spending their valuable time with my dissertation. To Brett Presnell, I am particularly thankful for stimulating discussions throughout my studies at the University of Florida. I learned a lot about probability, mathematics, and LaTex from him.

Above all, I thank Pavlina for her love and for sharing with me all the moments of happiness and grief.

Finally, I thank my parents, Angel and Liliana, who supported me and pointed me in the right direction for the first 25 years of my life.


iv















TABLE OF CONTENTS


page


ACKNOW LEDGMENTS .............................

LIST OF TABLES ........................................

LIST OF FIGURES .......................................

A B ST R A C T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER

1 INTRODUCTION ..............................


1.1
1.2 1.3


iv


vii viii


ix


1


From the Normal to the Student's t Model ... O utline . . . . . . . . . . . . . . . . . . . . . . .
How to Sample from an Intractable Density . .


. . . . . . . . .1. I
. . . . . . . . . . 4
. . . . . . . . . . 4


2 MONTE CARLO METHODS FOR THE STUDENT'S t MODEL

2.1 T he M odel . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 D efinition . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Propriety of the Posterior . . . . . . . . . . . . . . .
2.2 Making Exact Draws from the Target Posterior . . . . . .
2.3 Simulation Examples . . . . . . . . . . . . . . . . . . . . .

3 DATA-AUGMENTATION ALGORITHMS . . . . . . . . . . . . .


3.1 3.2 3.3




3.4 3.5 3.6


7


7
7
8 13
24

26


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Standard Data Augmentation . . . . . . . . . . . . . . . . . . . . 27
Meng and van Dyk's Extensions . . . . . . . . . . . . . . . . . . . 30
3.3.1 Conditional Augmentation . . . . . . . . . . . . . . . . . . 31
3.3.2 Marginal Augmentation . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Marginal Augmentation with Improper Prior . . . . . . . . 34
Connection with Group Theory . . . . . . . . . . . . . . . . . . . 39
Understanding the Artwork of van Dyk and Meng . . . . . . . . . 47
Other Problems with van Dyk and Meng's Lemma 1 . . . . . . . . 51


4 GEOMETRIC ERGODICITY OF STUDENT'S t PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . .

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Markov Chain Background . . . . . . . . . . . . . . . . . . . . . .


v


55 55 56









4.3 Multivariate Student's t Problem . . . . . . . . . . . . . . . . . . 59
4.4 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 70

REFERENCES........ ................................... 72

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


vi














LIST OF TABLES
Table

2-1 Acceptance rates of the rejection sampler when d = 2 2-2 Acceptance rates of the rejection sampler when d = 3


vii


page

. . . . . . . . . 25

. . . . . . . . . 25














LIST OF FIGURES
Figure page

4-1 Empirical comparison of Markov chains and the target . . . . . . . . 69


viii














Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

MONTE CARLO METHODS FOR POSTERIOR DISTRIBUTIONS ASSOCIATED WITH
MULTIVARIATE STUDENT'S t DATA By

Dobrin Marchev

August 2004

Chair: James P. Hobert
Major Department: Statistics

Let f denote the posterior density that results when a random sample of size n from a d-dimensional location-scale Student's t distribution (with v degrees of freedom) is combined with the standard non-informative prior. We consider several Monte Carlo methods for sampling from the intractable density f, including rejection samplers and Gibbs samplers. Special attention is paid to the Markov chain Monte Carlo (MCMC) algorithm developed by van Dyk and Meng who provided considerable empirical evidence suggesting that their algorithm converges to stationarity much faster than the standard data-augmentation Gibbs sampler. In addition to its practical importance, this algorithm is interesting from a theoretical standpoint because it is based on a Markov chain that is not positive recurrent. We formally analyze the relevant marginal chain underlying van Dyk and Meng's algorithm. In particular, we establish drift and minorization conditions showing that, for many (d, v, n) triples, the marginal chain is geometrically ergodic. This is the first general, rigorous analysis of an MCMC algorithm based on a non-positive recurrent Markov chain. Moreover, our results are important from a practical


ix









standpoint since geometric ergodicity guarantees the existence of central limit theorems that can be used to calculate Monte Carlo standard errors; and because the drift and minorization conditions themselves allow for the calculation of exact upper bounds on the total variation distance to stationarity. Our results are illustrated using several numerical examples.


x














CHAPTER 1
INTRODUCTION

1.1 From the Normal to the Student's t Model

Scientists have built all kinds of models to explain the surrounding world. Statisticians, in particular, use stochastic models in an attempt to describe and summarize data; and to predict future observations. The most widely used statistical models are based on the assumption that the data are normally distributed. For example, a very simple model might assume that the sample data, yi,... , yn, are realizations of independent and identically distributed (iid) normal random variables. That is,


yj ~ N(p, u-), i = ,...,n,( .)


where I E R and u E R+ := (0, oo) are the unknown parameters denoting mean and variance, respectively. To accommodate more complex sampling designs, the model in Equation (1.1) can be generalized in many ways. For example, it can be extended to multivariate scenarios allowing yis to be vectors (of possibly different dimensions), or it can be assumed that the mean 1t is dependent on other variables (regression analysis). One reason these models have been studied so extensively is that they can be analyzed relatively easily.

Unfortunately, there are situations where the normal distribution is inappropriate because of its light tails. A possible solution to this problem is to change the model in a different direction. Instead of the normal distribution, one can use a distribution with heavier tails. A natural candidate is the Student's t distribution.


1






2


Therefore, can replace the assumption in Equation (1.1) with


yj ~ t (V, f, U-), i = 1, . . . , n, (1.2)


where t(v, p, o) is the location-scale Student's t distribution (Section 2.1.1) and v is the degrees of freedom (assumed to be known). This idea is not new. Lange, Little and Taylor (1989) extensively listed articles and books using different t models, claiming that the first use dates back at least to Jeffreys (1939), who fit it to a series of astronomical data. Of course, the assumption in Equation (1.2) can also be generalized in many ways. Lange et al. (1989) described results from applying different generalizations of (1.2) to all sorts of real data sets, ranging from the stack-loss data set of Brownlee (1965) to the traffic-accident data of Draper and Smith (1981, p. 191). Another modification of the model in Equation (1.2) is given by Zellner (1976) who considers y (yi, . . . , y,)' to be multivariate t distributed with identity covariance matrix, instead of assuming that the individual tis are independently t distributed. Notice that, contrary to the normal case, these are not equivalent models, as it is suggested by Kelejian and Prucha (1985).

Heavier-tailed distributions are especially needed when modeling the high

volatility inherent in financial data sets (see; e.g., Mandelbroit, 1963). In particular, the market model is a multivariate linear regression extension of (1.2) that relates the returns on assets (y) to an overall market return (embedded in p). Substantial empirical evidence shows that the market model error distribution should be heavier-tailed than the multivariate normal (see; e.g., Affleck-Grave and McDonald, 1989). Praetz (1972) considers a similar model that describes the change in the log of several share-price index series with a t distribution, and compares the fit (by X2-test) to three other models, concluding that only the t model is acceptable, even at 0.01 level of significance. Many studies concerned with economical or financial applications, like Fernandez and Steel (2000), use Bayesian t models, changing the






3


model of Equation (1.2) to


yily, a t(v, y-, a), i =n;,.(.1.,3) (1.3)
A, 0' 7r(p, U),

where ir(p, a) is a (possibly improper) prior. The multivariate version of (1.3) is the focus of our attention. We assume that given (pI, E)1, the vectors yi,..., y, are conditionally independent with

Ejid
yi p, td(V, A, E), i = ,...,n;
(1.4)
yL, E ~(p, ),

where td(v, P, E) represents the d-dimensional Student's t distribution (whose density is defined in Section 4.3) and ir(pL, E) oc E is the standard noninformative prior for a multivariate location-scale problem.

Even though the Student's t models are more realistic, there is a practical

limitation to their use: making inferences about u and o is not as straightforward as in the normal case. For example, the maximum likelihood estimators of A and a in Equation (1.2) are not available in closed form. Similarly, there is no known exact method to sample directly from the posterior distribution resulting from Equations (1.3) and (1.4). A possible solution to these problems is to apply the EM algorithm to (1.2) or a Gibbs sampler to (1.4). We concentrate on the Bayesian paradigm and give some answers to the question of how to sample from the posterior distribution associated with Equation (1.3).

Geweke (1993) was among the first to study a Bayesian t model with independent errors, but a mistake in the main result invalidates his proof of the propriety



Throughout the dissertation, we use bold for variables that are vectors or matrices






4


of the posterior. Geweke (1993) was also among the first to suggest a Gibbs sampler to get draws from the posterior and to actually apply his results to a real data set involving U.S. macroeconomic time series, studied by Nelson and Plosser (1982). van Dyk and Meng (2001) and Meng and van Dyk (1999) show how to improve on the standard Gibbs method for sampling from the t posterior, but they do not deal with the problem of the propriety of the posterior.

1.2 Outline

The purpose of this dissertation is to present and compare different Monte

Carlo methods for sampling from the posterior resulting from Equation (1.4). The organization is as follows.

In Chapter 2 we define the most general version of the Student's t model that will be considered herein, and we prove a result that guarantees the propriety of the posterior. We then show how to use classical Monte Carlo methods in order to sample from the posterior, and illustrate with some examples.

Chapter 3 deals with MCMC methods of getting draws from the target posterior associated with our model. We first introduce the standard data-augmentation algorithm, and then we attempt to explain the new MCMC methods developed by Meng and van Dyk (1999), van Dyk and Meng (2001), and Liu and Wu (1999). In our view, these papers did not provide a rigorous foundation for these new methods. In this chapter we attempt to lay such a foundation.

Finally, in Chapter 4, we state and prove our results about the geometric ergodicity of van Dyk and Meng's algorithm.

Before moving to Chapter 2, we conclude this chapter with a brief description of various methods for obtaining draws from an intractable density.

1.3 How to Sample from an Intractable Density

Suppose we are given a density, f(0), where 0 E 0. We may be interested in computing a particular numerical characteristic of f, like its mean or standard






5


deviation; but sometimes, the complexity of the expressions may not allow us to do the computations directly. In such cases, there are several methods for estimating or approximating the properties of interest of f. These include asymptotic approximations, numerical integration, and sampling. Sampling methods (also called Monte Carlo methods) are ways of generating samples from a density that equals f or that approximates f. Using such samples, we can estimate the characteristics of the target density, f. For example, if we denote the sample by 00, 01,..., then there are strong laws of large numbers that guarantee (under some regularity conditions) that
I g() n- g(0)f(0)d0 a.s.,
2=0
as n -+ oc for any integrable function g : e -> R.

There are many different sampling methods. In some cases, it is possible to

sample directly from f, and thus obtain an independent and identically distributed (iid) sample. When direct sampling is infeasible (especially when we don't know the normalizing constant of f), a possible way to get iid draws from the target density f is the rejection sampler. Here is a brief description of this method. Suppose we know how to sample from the density h(0),0 E E) (often called a candidate) and we also know that f() K M < oc V 0 E E. Then the basic algorithm h (0)
of the rejection sampler consists of performing the following steps:

1. Generate 0 from h
2. Generate u from U(O, 1)
3. If u M h(0) < f(0), then return 0, or else go to 1.

More details about this algorithm, along with some examples and a proof that it returns iid samples from f, can be found in Ripley (1987).

In many problems it is impractical or even impossible to get an iid sample, and then Markov chain Monte Carlo (MCMC) is often used. Generally speaking, when an MCMC algorithm is applied, dependent samples are generated using a Markov






6


chain with stationary density f. The most commonly used MCMC algorithms are the Metropolis-Hastings algorithm and the Gibbs sampler, which itself can be viewed as a special case of the Metropolis algorithm. Tierney (1994) gives a good overview of MCMC methods for exploring intractable densities.

In this dissertation we consider the special case when f is the posterior density that results when data from a multivariate location-scale Student's t distribution are combined with the standard non-informative prior. van Dyk and Meng (2001) developed an efficient MCMC algorithm for sampling from f, and provided empirical evidence suggesting that their algorithm converges to stationarity much faster than the standard Gibbs sampler. We formally analyze the Markov chain underlying Meng and van Dyk's algorithm. Specifically, we derive drift and minorization conditions, which can be used to get rigorous bounds on the total variation distance to stationarity, and to do regenerative simulation.














CHAPTER 2
MONTE CARLO METHODS FOR THE STUDENT'S t MODEL

2.1 The Model

In many statistical models, the observations are assumed to be normally

distributed. For example, inference procedures for the classical linear regression and ANOVA models are based on the normality assumption. In many applications, however, there is a need for heavier-tailed distributions. In particular, observations on daily returns of stocks (and financial data in general) prove to be incompatible with the assumption of normality. The Student's t model is a relatively tractable heavy-tailed alternative.

2.1.1 Definition

The model we consider is as follows. Let p and E denote a d x 1 mean vector and a d x d covariance matrix, respectively. Also, let v > 0 be fixed and known. Assume that, conditional on (ti, E), the d-dimensional random vectors y, ... .. , are iid td(v, yu, E); that is, y, has density |E r- (!t)- (y1 - L)T E-(1 )
fed(y; v,y, E).: a-pdIw [ +
ft ( 1 V f) r (,) di +

Let f(ylt, E) denote the joint density of y = (yi, ..., y); that is,
n
f (y I /_t, E) = ft (Y , ,

We take the prior on (p, E) to be the standard noninformative prior for a multivariate location-scale problem; that is, ir(pI, E) oc E- 2 (see, e.g., Section 8.2 of Box and Tiao, 1992). We show later in this section that (no matter what the value of v) the resulting posterior is proper if and only if n > d + 1. Assuming that


7






8


n > d + 1, the posterior density is of course

f (p, Ely) =f yIIE7(L )
m n (2.1)
c |E ~l- 1 + v-( )TE-l(yi - )where

m(y)= S I [E (p E)d[d

with

S = {z R d(d2+) unvech(z) is p.d. where unvech(z) is the symmetric matrix corresponding to the elements of z and "p.d." means positive definite. Our ultimate goal will be to draw samples from (2.1), but before developing methods for sampling we will establish the propriety of the posterior.

2.1.2 Propriety of the Posterior

We will prove that f(IL, Ely) is proper if and only if n > d + 1. We start with the following lemma, which seems obvious but could not be found in the literature.

Lemma 2.1. Suppose that 0 C R" and let {f(x 0) : 0 c 0} be a family of

densities indexed by 0 such that the support, X = {x : f(x|0) > 0}, does not depend on 0. Let x1,. . . ,x, be iid from f(|0) and suppose that ir(0) is an improper prior density yielding the posterior


f (Blo , . ,xn) oC f (xi 10) 7(0) .(2.2) If there exists an n' C N := {1, 2,... } such that the posterior is proper; that is, such that

c(x,..., )= J 7 f(xi 0)] (0) dO < oo

for all (x1,...,xI) E X"', then the posterior (2.2) is proper for all n > n'.






9


Proof. If n > n' we have,


Sf (Xi lo) r7(0) dO = n f (zi6)j n' f (xi lo) 7(0) dO
9 _=1 . i=n'+1 . i=1 I
n] n f( il1)7 ( 0)
=c(x... X,],) d f(X [i) H )] d O
S _ c(x ,... ,1 r)

Now since the quantity in curly brackets is a proper (prior) density on 0, the integral is finite and the posterior is proper. D

Lemma 2.1 can be stated to hold for almost all x1, ... .,x (not all x1,. . ., ), but we will omit such technicalities even though we use the lemma when it holds only outside a set of measure 0. Here is our main result about the propriety of the posterior.

Theorem 2.1. Let A denote Lebesgue measure in Rdn. The posterior, f(/, EIy), is proper for A-almost all y if and only if n > d + 1.

Proof. The goal is to show that


lSd <00 (2.3)

for A-almost all y, if and only if n > d + 1. Instead of analyzing this integral directly, we will analyze a related integral that is more tractable. This integral involves the augmented likelihood, which is based on the so called "missing data" q := (qi,. .., qn )T E Q, where Q is usually a subset of some Euclidean space (detailed in Chapter 3)

Conditional on (p, E), let (yi, qc), i = 1, . . . , n, be iid pairs satisfying the relations

yilqi, yL, E ~Nd P, -D (2.4)
qifLy, E ~ 1 .)






10


Then the implied joint density of y and q is f (y, qIpt, E)= f (yiIqi, I, E)f (qi I [p, E)


and it is easy to see that


f (y, ql p, E) dq = f (y l, E),


where f(yIL, E) is as in Equation (2.3). Therefore,


S Rd f I , )dlid =


S d A;


Of course, by Fubini's Theorem, we are allowed to integrate in any order we like (notice that the integrand is nonnegative). First,

f (y, qI Lp, E) 7r(p, E)


2 e e2 Z-j=1q*3j~y -_T -(jA


2


_


where q. := E' qj. But


1 ( j - _) T
TiT
j=q.


[L) = Z q'y]TE'yj - 2 Zq'y7TiP1fL +q.[T 1t
j=1 j=1


qjT


+TE
I ~qiyYi
j=1


1yj ,


1)


where A = q.-1 E qyj. Thus,


LRd


nv / -d2


- +d 1 [E jT -~jA ( )1L 2e . lJ \q /


(27r) d(, 2 F1 (g)


(q)-d e'q


(2.5)


f (y, q I tt, E) -F (tt, E) dq dtt dE .


2(L) 2
(27)d2" Fn I
(L2


f (y, qlIp-, E) 7(pL, E) dpL






11


Also notice that
-1 n
yT E~ yj - ii = r qyT E-1Yj - TE-A
ZE~ ~~Y -y )~t ji f tr [Z q~7 , - q.]
j=1 qj=1

= tr E qjyjy - q.p4T E-1.
_ j=1I

We now assume that n > d + 1 and will show that the posterior is proper for A-almost all y; that is, we will show that (2.5) is integrable with respect to both E and q. To find the integral with respect to E, we use the fact that the Inverse Wishart density fw - 1yT - q. ) ), defined in Section

4.3, integrates to 1. That is, we have


IS R df(,qIfr) Ttr)dt




Now, for i = 1,2, .. .,n, let p, = ~. Also, let p, = (P2, P3,.. ,P)T and let P, be a diagonal matrix with p, on the diagonal. Finally, set



M1) = 1.
(d+ -2n)( q. -q.







Now, straightforward calculations show that



2 (j=1 j=1

= M()PM() - M()ppIM() = M(1)T (P - pnpl] M).






12


Hence, if n d + 1, we have


q yjy' - q.pT = q. M( P, - p~pT] MS1
j=1

q.d IM (l)2 p* _P*PTi



= q.d |M 2 1 _ pTpqd )2 /) I J7J
= q.a| |2 j=1

=q1 M 1)2 1qj j=1

where the fourth equation follows from an elementary property of determinants that can be found in Fang and Zhang (1990, p. 3). Therefore, when n = d + 1 we have


(2.6)


i= F ('(d + j )) n(L) g- _2z I l)~ d 7d(d ) 2I qj e 2
4 j=1 F(II)

which is clearly integrable in q. Thus, assuming q y yT - q. PpT 1 $ 0 is true for A-almost all y), it follows that the posterior is proper when n = d and because of Lemma 2.1 that the posterior is proper for n > d + 1.

Now, to show that the posterior is improper when n < d, denote Q = E I9 q3yy - q. AiiT and notice that by the transformation Z = E-' with
Jacobian J = ZI-(d+1) (see; e.g., Fang and Zhang, 1990), we have


(which +1,


I -1-ritr(QiZ-, C, --2
tr'I2.~j dE = 2dZ.
s 2Jse2


Since Z is a p.d. matrix, by the Cholesky decomposition, it can be uniquely represented as Z = XXT, where X is a lower triangular matrix (l.t.m.)






13


with positive diagonal elements, and the Jacobian of this transformation is J = 2d Hd1_ xj-i+1. Therefore

d
e-itri(Qz)| n2 dZ = 22 et(QXXT)Xn-d-2 1 X-i+ ldX
is e J 2R e

= 2dJR etr(QXXT) J x -i1dX d JR - 1=x d X,
JR R

where

d(d+1)
S = {(ri1, r21, r22, rF3, .. ., Tdd) E R 2 :ri > 0, V i = 1, . . . , d},


and xi for i =1,..., d are the columns of X. Recall that n < d implying that Q is non-negative definite (xTQx = 0 for at least one non-zero x E Rd), which means that Q = TTT, where T is a l.t.m. with non-negative diagonal elements, at least one of which is zero. Let tkk = 0 and observe that the integral d d
eJ c f=1 aTQxi 1 x -+ldX e-- E -1(Tx,)T(Txi) 1 x7- 7+dX
JR I iR i i

diverges because the coefficient in front of Xkk in the exponential is 0; that is,

d
J =1(Txi)T(Tx) 1Jx -i+1dX = cf e-kkkk k+d-k+
R k + k+ kk

where the first equality is justified by Fubini's theorem even if c = 00. E

2.2 Making Exact Draws from the Target Posterior

As we now explain, a byproduct of the proof of Theorem 2.1 is a method for making exact draws from the target posterior, f([t, Ely), in the special case where n is exactly d + 1.

First notice that

f (p, E ly) = f(q, u, E Iy)dq, +~






14


where

f (q p, ly)f (ylq, [L, E)f (ql p, E)7(pL, E,)
f~,,~)=fQ f, f1 ,. f (y Iq, [L, E) f(q I p, E) 7(/-t, F-) d[LdEdq

Thus, our problem would be solved if we could make draws from f(q, [t, E y) since we could then just ignore the q component. Now, since f(q, [y, E Iy) = f(qIy)f(E q, y)f(tu IE, q, y), we can make draws from f(q, yL, E Iy) sequentially. Actually the draws from f(EIq, y) and f(pIE, q, y) are trivial, since the calculations in the proof of Theorem 2.1 show that


Elq, y ~ Wd n -1, Eqj(yj - )(yj - t)T ,


and

pLJE,q,y ~Nd (, -),
q.
and exact methods for sampling from these distributions are well known. Furthermore, Equation (2.6) shows that when n = d + 1, conditional on y, the qis are iid F(v/2, v/2). Hence, one can make exact draws from f(q, [, Ely) f(q y)f(EIq, y)f([IE, q, y) sequentially, using the following three steps:

1. Draw qi,.. ., q, independently from the F(v/2, v/2) distribution.

2. Draw E IWd n - 1, _1 qj(yj - f)(yj - j)T

3. Draw /t Nd .

Marginally, the resulting (p, E) is an exact draw from f(pi, E y). Unfortunately, when n > d + 1, the qjs are no longer conditionally independent, and thus drawing from f(qIy) is no longer straightforward. However, as we will show, it is possible to construct rejection samplers for f(qly) that are reasonably efficient when d is small.






15


We now consider the problem of sampling from

n qj dn
f(qly) cx q ) (q.I e- q3y y
j=h j=

Notice that Equation (2.7) can also be written as


f(qly) oc qi


d
qI- E= _Iy . y 2
e~. 2= i= q


But

qyyy - q.ppi q. (yj - A)(y - ) : q.C(q, y),
j=1 j=1
and therefore Equation (2.8) becomes


\ n= 2 - q. )d 2
I \ 2


= q I q=1*
f (q I) y q) (Fx= qj)q d
_=1 qqC(q, y) 2
d

C(q, Y) 1 2i


oc fr( ,v)(I ) . . . fr( , -)(qn) C q, ) Since the first part is just a product of iid F', ) densities, we might consider a rejection sampler in which the candidate is exactly this product of Gamma densities. To do this, we need a bound for


( i)d
( F ~ i = q . -(2 .1 0 )
1 C(q, y) In-1

The following lemmas are needed to establish our bound.

Lemma 2.2. Suppose that Y is a discrete random vector putting mass pi at yi, i = 1, 2,..., n. Then


cov(Y) = pip (yi - yi)(yi - yi)
1

(2.7) (2.8)


(2.9)






16


Proof. Let X be an independent copy of Y. Then notice that

1 1
2 2
-E(X - /) (X - Y)T = - E(X - p +t yY)(X-p+ )

2 2
1 1

+ -E(X - p)(X - y)T + -E(tt - Y)(X -Y)T
2 2
1 1
S-cov(X) + -cov(Y) = cov(Y),
2 2

where t = E(Y) = E(X). Now apply Equation (2.2) to get


cov(Y) =E(X -" Y)X-Y pipj (yi -j yy(yi -j yT
i=1 j=1

Pppy- y(yi - yj)T



Before we proceed any further, we need to develop some notation. For m < n, let Cnn be the set of all possible subsets of size m from Nn := {1, 2,. . . , n}; that is,


CIm", := { (ki, . .., k,) : ki < ... < km and ki E N,,, i = 1, ... ., m}.

For example, C2 := {(1, 2), (1, 3), (2, 3)}. Also, for a given matrix A of dimension m x n, let Ar(T) denote the submatrix obtained from A using only the rows that are in the index set T (of course, here T C Nm). Similarly, let Ac(T) be the submatrix obtained from the columns of A that are in the set T. We now state the Binet-Cauchy formula for the determinant of a product of two matrices. It can be found, for example, in Karlin (1968).

Lemma 2.3 (Binet-Cauchy Formula). Let A be an m x n matrix, and let B be an n x m matrix. Then |AB|= : |Ac(T)rB,(T).
T C;C'






17


Let's now go back to the question of how to bound the expression in Equation (2.10).

Theorem 2.2. FordEN, n>d+1, y E d' and q E R,


)d
< (2.11)
JC(q, y) In-I (n-d)"- 'd(y)' where


(1 Y2)2 if n = 2
d(y) = -d(n-2)
1 ,+--+ 1 if n> 2,

2d!(n-d-1)I
ci(y) = M (
T ,C-1
TCd

and M(') is an (n - 1) x d matrix, given by (yi - y1)





(yi - y1)T



(y - y)

for = ,...,n.

Proof. First notice that when n = 2, d must be 1, and in this case both sides of (2.11) are equal to (Y .)2 For the nontrivial case when n > 2, denote

= , i = 1, .. ., n and notice that because of Lemma 2.2


)1 pi)d sup = sup 1'
qER- -C(q, y)In-l PEP ) (y - nwhere P = {(pi,. . . ,pn) E (0, 1)n : _ = 1}.






18


What follows is true for any Pk, k = 1, . . ., n; but for ease of exposition, we'll show the details only for pi. First, notice that in the sense of matrix ordering1,
n
Z" PiPj (Yi - Yj -(iY) > jPi( - YM I- Y
i
= iMS P.Ml),

where P, is as defined in Section 2.1. Now applying the Binet-Cauchy formula with A M()T and B = P,M ), we obtain


MP.M') E M PM . (2.13)
C(T

Since


P. M

pn (y1 -- U

for any T E Cd , we can write


(P.M (1)(T n~~ MM
PMS Pi+1 M .)

Thus,

M()TP.M) + Mr)


and consequently from Equations (2.12), (2.13) and the fact that if A > B then IA I >B 1, it follows that


pi (yi - yj )(yi - yj)T > P di1 (2.14)
i

1 For nonnegative definite A and B, A > B iff A - B is nonnegative definite






19


Now by the inequality between the arithmetic and geometric means, we obtain


E
TEC'


MM 2
S ip+11 M1
T (T)

1(n d<


(2.15)


Notice that each index i E {1, .. . n} is contained in exactly (,_ ) elements of the set C"1. Therefore, P+1=n(P2g t TECC-1 iET
implying that


(2.16)


1


ddT
-( Pi+1 M (TE 1 TEp.)-- MMT
(P2 - ) O M
TECdn-


From Equations (2.14), (2.15) and (2.16) we obtain that


(n -


d(n-1) 2 - p )dC (y), P1 P ..P


and consequently


(Hn 1 pi)d Spi (yi - yj)(yi


n-i 1
)T n-T (n-).n-1Pd(n-2)c)
- j yd) I c(y)


By symmetry, for any k E Nn


)T]n-1 )-d1n-1d(n-2)
-yj) d k k (y)


n-i


P pj (yi - yj) (yg







20


Thus,

(In = pi)d )(y,- y)T n-1 - 1
and therefore


i= 1 )d sup
qER- Iqy)Tl


1
< sup min pE7p1k n kn (n-2)


Finally, we will show that


1
sup min PEP d(n-2)
AEl~p ci (Y)


1 . y


(2.17)


Let


bi = c (y) d-2
n
b. = bi
i=1
, b.


Then


1
sup min pE-p 1

inf max picj(y) d(n-2) PEP 1

-d(n-2)


= bd(n-2) (inf max
PEP 1

Assuming pl > 0 V i e Nn, it is clear that if p = p*, then


max = 1, Assuming10V n
P*


whereas for any other choice of p,


max <1,
1 --pi


since if pi $ p* for some i, then there exists a j E Nn such that pj > p .






21


In the univariate case (d = 1), the constants ci(y) can be simplified considerably. We state this simplification of Theorem 2.2 as a corollary.

Corollary 2.1. For n > 2, y C Rn and q C R',


[ q, < [(n - 1)"- di(y)] [v (q, y) ]nwhere

v(q, y) = qi (yidi(y) =[ 1 + - - -+ (n ,
e 1(y).- en (y).and

e, (y)= (y,- y)2, n.


Proof. First notice that Cn-1 {1,... , n - 1} and therefore


S M~ = f(yi - y)2.


The rest of the proof is just substitution. E

The bound proved in Theorem 2.2 is not sharp, which reduces the acceptance rate of the resulting rejection sampler. When d = 1, we were able to find the exact supremum when n E {2, 3} and this result is stated in Proposition 2.1. Unfortunately, we were unable to get a better bound in the general case, but we can state a conjecture about the exact supremum of (2.10) when d = 1, which is presented in Conjecture 2.1

Proposition 2.1. For n C {2, 3}, y E R' and q (E R+


sup H ,_1 = [(n - ,) c(y)] (2.18)
qER- [v(q,y






22


where

c(y) = min ]7(yi -yj)2. (2.19)
jii
Proof. Notice that when n = 2, v(q, y) = P1P2(Yl - Y2)2, c(y) = (YI - y2)2 and consequently (2.18) is obvious. The non-trivial case is when n = 3.

We will first prove that

_. I [(n - 1)"1c(y)]~1 (2.20)
[v(q, y)]fl1

and then we will establish the attainment of the sup. Suppose that miin 7(yi - yj)2 17(Yj 2 1<< j7 i j541

i.e., the minimum is achieved at i = 1. This simply means that yi is between Y2 and y3. Without loss of generality, we will assume that the ordering of yis is Y2 < Y1 Y <3. Notice that (2.20) is now equivalent to


(yi -2)2 + (y1 - y)2 + (y2-113) > 2(y - Y2)(Y3 -11). (2.21)
P2(Y Pi Y

But

(Y2 - Y3)2 = (11 - 12)2 + (Y3 - yi)2 + 2(Y3 - Y1)(Y1 - 2), and because of the assumed ordering, (2.21) is equivalent to

11- Y2 ( 1P2+( P3> 11 3 - Yi1 P3: ~P +2) ___P
2y131 G P3 VP1 211y2 P2 PI Pi

Now an application of the inequality between the arithmetic and geometric mean gives us

2 P2++ Y Y1 ( 1P+3 P2P3.
2 Y3 -YIk VP3 V p,]2 111-1Y2 VP2 PV

>1P P2 ) > ( 1 3P2 3 + P21)3
( l :+ rPP P:+ _P2:
V P3 Pi 1 P2 P1Pi






23


Consequently, (2.20) will be established, if we can show that +P2P3 + P2P3 >1
1 + + 21
Pi P1

which is obviously true.

Notice that the last inequality becomes equality when pi -* 1 and P2,3 -* 0 with the rest of the inequalities suggesting that = _, . It is easy to verify that in terms of qis this is satisfied when, for example, we take


q1= t, q2= -andq3= 2
t t(y3 - yi)2

and let t -* oc. Then

t
q. t + 1+ (1 -Y22 t t(y3-yl) 1
(Y1 -Y2)2
q2q3 t2(y3-yl)2 -+0
q~q. t [t + , + (Y1 -y2)2 t t(y3 -y1)2

as t -> oo, implying that


lim + -Y3y - Y1 P3 +1 -- +, -V
t y- ( Pi Y2 P2 Pi Pi



Conjecture 2.1. For n > 2, y (E R' and q C R,


sup _ )1 = [(n - 1)1-lc(y)] -, (2.22)
qERn [v(q, y)] where

c(y) = min (y, - y,)2. (2.23)
i
Remark 2.1. Notice that the result of Theorem 2.2 provides an alternative proof of the fact that the posterior is proper for n > d + 1, because it shows that (2.6) is bounded from above by a quantity integrable in q.






24


2.3 Simulation Examples

The purpose of this Section is to study the acceptance rate of the rejection sampler proposed in Section 2.2 as a method for obtaining draws from our target density. Clearly, the bounds established in Lemma 2.2 and the subsequent corollary and conjecture depend on y, n and d. Furthermore, the candidate density depends on v. It is, therefore, interesting to investigate the behavior of the algorithm when changing the values of these parameters. Generally speaking, the results of our computer simulations show that the acceptance rate of the algorithm varies greatly depending on the specific data set, y, and suggest that it works better if either n is small and/or v is big. Also, the acceptance rate tends to be higher when the yis are equally spaced. Here are some specific examples.

Example 2.1: Suppose a particular set of yis is given by2

y = (-1.449605, -0.996631, 0.228872, 0.068414, -0.126978, -0.563358, 0.766889)T

and is assumed to have t, distribution with v = 5. A computer simulation using the bound in the conjecture resulted in an Accept - Reject algorithm with acceptance rate of 0.0384 estimated with a standard error of 0.00061 (using the well-known formula for the standard deviation of one observation from the Binomial distribution). Using the bound from Lemma 2.2 on the same data, results in acceptance rate of 0.00131 estimated with a standard error of 0.00011. Assuming that v = 10, changes the acceptance rates for the unproved and proved bound to 0.04141 (0.00063) and 0.00139 (0.00012), respectively (the numbers in parenthesis are the standard errors).

Example 2.2: For this example we simulated Student's t data with parameters v = 3, d = 2, and n = 7:


2 The actual values were obtained by simulation






25


Yi = (-1.0082, 1.5155) T, Y2 = (-0.39561, 0.066761)T

Y3 = (1.4634, 2.5265)T, y4 = (1.5163, 1.5953 )T

y5 = (-1.0334, 3.0844)T, Y6 = (1.5207, 0.21293)T

y7 = (6.0723, -0.069308)T,
which produced a bound of 0.000668735. Table 2.3 gives the acceptance rates of the rejection sampler for different values of v along with their estimated standard errors and is based on 1,000,000 iterations.

Table 2-1: Acceptance rates of the rejection sampler when d = 2 v Acceptance rate Standard Error 1 1.1 x10-6 3.31662 x 10-7
5 4.1 x10-6 6.40311 x 10-7
10 4.1 x10-6 6.40311 x 10-7
15 5.1 x 10-6 7.14141 x 10-7


Example 2.3: In our final example, we first obtained another simulated

sample from the multivariate Student's t distribution, but this time with v = 5, d 3, and n = 6:

yi = (-0.92497, -0.63609, -0.058751)T, y2 = (0.32547, 0.0031071, 0.98505)T y3 = (0.77057,1.8546, -0.086936)T, y4 = (1.5351, -1.1889, -1.6686)T Y5 = (0.037699, -1.1527, 0.29751) T, Y6 = (0.14760, 0.51222, -1.9081) T, which produced a bound of 0.0465003.

Here are the acceptance rates for different values of v, based on 10,000,000 iterations along with their standard errors:

Table 2-2: Acceptance rates of the rejection sampler when d = 3 v Acceptance rate Standard Error 1 2.3 x10-6 4.79583 x 10-7
5 6.8 x 10-6 8.24618 x 10-7
10 6.7 x10-6 8.18533 x 10-7
15 8.3 x10-6 9.1104 x 10-7














CHAPTER 3
DATA-AUGMENTATION ALGORITHMS

3.1 Introduction

In two related papers (Meng and van Dyk, 1999; van Dyk and Meng, 2001), X.-L. Meng and D.A. van Dyk developed two generalizations of the standard data augmentation (DA) algorithm. The basic idea underlying these generalizations is to construct a large class of candidate augmentation schemes by introducing a working parameter, a, into the standard augmentation scheme. In conditional augmentation, a particular value of a is chosen, while in marginal augmentation, a is actually incorporated into the augmented model by way of a working prior, w(a). In all of the examples considered by Meng and van Dyk, the marginal augmentation (MA) algorithm becomes more efficient as the working prior approaches a special improper form. The basic MCMC theory breaks down, however, when the working prior is improper. Indeed, when w(a) is improper, the Markov chain driving the MA algorithm is not positive recurrent. On the other hand, this nonpositive recurrent chain is on an augmented state space that is larger than the parameter space. It turns out that, under regularity conditions, the non-positive recurrent chain possesses a well behaved marginal chain that has the correct (proper) invariant distribution. In this chapter, we consider in detail the general MA algorithm and show how it can be applied to the Student's t problem. The chapter is organized as follows.

In the next section we describe the standard data augmentation algorithm for dealing with intractable posterior densities like (2.1). Then in Section 3.3 we describe the generalizations of Meng and van Dyk. One drawback of their methods is that it is very hard to verify the conditions that guarantee convergence of the


26






27


algorithm to the target density. In Section 3.4 we provide an alternative validation of the MA method that is inspired by the work of Liu and Wu (1999). Finally, in Sections 3.5 and 3.6 we highlight some problems with the proofs in van Dyk and Meng (2001). Throughout the chapter we present all the algorithms in their full generality and we use the Student's t model as an example.

3.2 Standard Data Augmentation Suppose that f(yJO) is a likelihood and r(0) is a prior, with supports Y and 8, respectively, yielding the proper posterior density f (y 0) ir(O)
f (01y) = ,(y) (3.1)

where m(y) fe f(y 0) 7r(0) dO < oc. Unless 7r(-) is a conjugate prior, the

denominator, m(y), is typically an intractable integral and hence expectations with respect to this posterior (required for Bayesian inference) are not available in closed form. The posterior (2.1) is an example. In such cases, one way to analyze the posterior distribution is through Monte Carlo methods. Assuming that direct (iid) sampling is not an option, one might resort to an MCMC method where dependent samples are generated using a Markov chain with stationary density (3.1), and these samples are used to approximate expectations with respect to f(01y). A good overview of MCMC methods can be found in Tierney (1994).

The MCMC algorithms developed by Meng and van Dyk are extensions of the standard DA algorithm so this is where we begin. The basic idea of DA is to construct unobserved data, q E Q, that serve as a computational tool. To be specific, we introduce a joint density f(y, q 0), with support Y x Q, such that (i) the marginal density of y implied by f(y, qJO) coincides with the original likelihood f(y 0); that is,

If (y, qO) dq = f (yJO) , V 0 E O , V y E y, (3.2)






28


and (ii) sampling from both conditional densities

__(y,____)____( _ f (y, ql0) wr(O) (3
f(qlO, y) := f(y,qO)() and f(Olq,y):= f(y,qO)7() (3.3)
fQ f (y, qj0) 7r(0) dq f. f (y, qjO) r(0) dO

is feasible, if not convenient. Since f(y, qjO) is introduced purely for computational purposes it should not alter the posterior model (3.1). To see that condition (3.2) is sufficient to guarantee this, notice that the "augmented posterior" satisfies

f (0, qy) f=(y, qjO) 7(0) _ f(y, qj) 7(0) f (y, qlO)7(0)
fe f f (y, qjO) 7(0) dq dO fe f (yjO) 7(0) dO m(y)

and therefore

fe f(y, q|6)dq r( (O6)
jf (0, qy) dq- m ()- f (0y) (3.4)
Q Tm(y)

our target density.

The DA algorithm is a method for obtaining a sample {1,. . . , 0, with the property that the sampling distribution of n, converges to (3.1), in total variation distance, as n -+ oo. It is based on a discrete time Markov chain,


(DA = 0,j'O

with Markov transition density


KDA(010') f (01q, y)f (qj0, y) dq; (3.5)

that is, given that the current state of the chain is 0', the conditional density of the next state, 0, is (3.5). Since


Je KDA(O 10') f(O'I y) dO' = J f (01q, y) [J f(q 10', y) f (0ly) dO'jI dq

JQ f q, y)f (qly)dq = JQf(,qyd = f(1)






29


our target density, f(O1y), is an invariant density for <(DDA. Therefore1, we can use (DDA to get approximate draws from f(0 y). More details about data augmentation can be found in Tanner and Wong (1987), Liu, Wong and Kong (1995) and Liu, Wong and Kong (1994).

There are many ways to construct a joint density that satisfies (3.2), but here we will specifically consider the following hierarchical description: y~q, 0 ~ f(ylq,0) (3.6)

q|0 ~ f (q|0).

Typically (3.6) represents some well known decomposition as is demonstrated in the next example.

Example 3.1: Consider the univariate version of the Student's t problem,

described at the beginning of Section 2.1, with 0 = (p, a); that is, assume yI,..., yn are iid ti(v, p, o-) with joint density f(ylp, -). The prior is r(p, a) Oc 1/o and the goal is to sample from the posterior f(p, oly).

There is a simple data augmentation algorithm for this problem that is based on the well known decomposition (that we express loosely as) y A + Vz/ q,

where z ~ N(O, 1) and q - x'/v with z and q independent. To be specific, "missing data" q = (qi,. . . , q,)T are introduced such that, conditional on ([t, o), (yj, qj) are iid pairs satisfying

y j q j , I , - N ( , -3

q3 1p, ao- (, 2)




1 Throughout this chapter, we assume (unless otherwise stated) that all Markov chains satisfy the usual regularity conditions described in Section 4.2






30


where by X ~ F(a, b) we mean that the density of the random variable X is proportional to xa-i e-bxI(0'0)(x) and X =F ( , '). The reader is invited to verify that the marginal density of yj given (p, a) under (3.7) is ft, (yj; v, p, a) and hence, because of the independence assumption, (3.2) is satisfied. Thus, we are now in a position to apply the DA algorithm; i.e., we consider simulation of the Markov chain defined by (3.5). It is trivial to verify that, conditional on (P, cr, y), the random variables q, . . . , q, are independent with V +1 I y )
qj 1p, o,, y ~7 ( ( 0 + V)

for j E N,. Furthermore, making draws from f(a, p q, y) can be done in two easy steps by noticing that


crq,y~IG n +,q.(q) and p 1,q,y~N (q),
2 2 q.

where
n n in2
q. = qi, /(q) = yjqj , &(q) = q3 [y - A(q)]2
j=1 j=1. j=1

and by X ~ IG(a, b) we mean that the density of X is ba
fIG (x; a, b) ba-e (0)W (3.8)
F(a)xa+l

i.e., X has the distribution of 1/Y when Y ~ F(a, b). Thus, it is straightforward to simulate the chain defined by (3.5) in this case.

3.3 Meng and van Dyk's Extensions The key idea underlying the DA algorithm is that it is easier to simulate the Markov chain 4DDA than to sample directly from the target posterior. If, in addition,





31


simplicity and speed. For example, the speed of the EM algorithm, which can be viewed as the deterministic analog of the DA algorithm, is inversely proportional to the amount of missing information. This suggests that a more clever augmentation may lead to a better DA algorithm. Indeed, this is the idea behind conditional and marginal augmentation, which are now described.

3.3.1 Conditional Augmentation

Suppose we are able to introduce a working parameter, say, a E R+, into the standard data augmentation density f(y, qj0) yielding a family of densities, {g(y, rJO, a)}-ER, each of which is a candidate for use in the DA algorithm.2 One way to do this is via a family of one-to-one transformations D, : Q - Q, indexed by the working parameter a. That is, for each a E R+, there is an analog to the hierarchical model (3.6):

y r, 0, a ~ g(ylr, 0, a);
(3.9)
rJ0, a ~ g(rl0, a),

where

g(ylr, 0, a) = f (yjD-1(r), 0), (3.10)

and g(r 0, a) is obtained from f(q10) by change of variable r = D,(q); i.e.,


g(rJ, a) = f (D-(r)10) D (3.11)
Q ar

Usually, the original augmentation scheme, f(y, q10), is a member of the family with, say, a = 1.



2 To clarify the notation, we use f to denote densities associated with the standard DA algorithm and g for densities involved in van Dyk and Meng's extensions. Similarly, q refers to missing data in the standard DA and r denotes missing data in CA and MA.






32


Generally speaking, in order for each member of such a family to be a viable augmentation scheme, it must be the case that a is "invisible" in the marginal distribution of the observed data; i.e., we must have


J g(y, rJ, a) dr = f(y10) V a E R+ , V y EY, V 0 e 8. (3.12)

Notice that under the hierarchy (3.9), the above condition is automatically satisfied, since for each a c R+

g(y, r,a)dr = g(ylr, 0, a)g(rjO, a)dr = (y ID- 10(r), 0)f (D-(r)1) O dr = (3.13)
] Or

jf (ylq, 0)f (ql )dq = I f (y, ql0)dq = f (y 0).

Once the validity of the CA is verified through equation (3.12), we can fix a at any value that leads to relatively quickly converging Markov chain. For a specific value of a, the transition kernel of the corresponding Markov chain is K: (010') = j g(O r, y, a)g(r10', y, a)dr and it is easy to show that f(01y) remains its invariant density (since g(y, rJ0, a) is a legitimate data augmentation scheme). van Dyk and Meng (2001) provide guidance as to choosing the value of a that corresponds to the fastest converging Markov chain by establishing connections with the corresponding EM algorithm. We do not pursue these ideas any further since we are mainly interested in MA, which is presented next.

3.3.2 Marginal Augmentation

In MA, the working parameter a is incorporated into the augmented model through a proper working prior w(a). There are several different ways to use g(y, r 0, a), 7r(O) and w(a) to construct an MCMC algorithm with invariant






33


density f(01y) (Meng and van Dyk, 1999). Here we describe the two methods that Meng and van Dyk (1999) call Scheme 1 and Scheme 2 (see also Liu et al., 1994).

Assume that a and 0 are a priori independent; i.e., w(a 0) = w(a) V 0 E . Then

g"'(y, Vr10) := g (y, r 10, a) w(oz) doz is a viable augmentation scheme. 3 To see this notice that (3.12) and Fubini's theorem yield

g(y, rj0)dr = f f g(y, r|1, a)w(a)dadr
Q JQ R+3.4

R w(a) JQg(Y r1, a)drda = f (y 10) w(a)da = f (y10).

Therefore, for each fixed proper prior, w(-), we can make approximate draws from the target posterior by simulating the Markov chain, (DMA =Oi}'o, with Markov transition density

K,(0 0') = g,(0 ry) g,(r1', y) dr, where
g'(0rY) fR+ g(y, rJ0, a)ir(0)w(a)da g( r, ff+ g(y, r:=, a)r(0)w(a)dd0

and
(r 1, Y) fRg(y, rJO, a)7(0)w(a)da g (r|0,y) :
f g(y, rJ0, a)7r(0)w(a)dadr
This is our version of Scheme 1. In Section 3.5 we show how it is related to Meng and van Dyk's work and we give more details on how the chain is actually run.

As expressed in the notation, the Markov transition density K, clearly

depends on the choice of working prior w(a). Each different proper prior on a will result in a different version of


3 It is important to recognize that gw(y, rJ0) and f(y, q10) are typically not the same.






34


and Meng (2001, Section 4) give a heuristic argument suggesting that DMA should converge faster as the working prior, w(a), becomes more "diffuse." For example, consider the following parametric family of IG priors for a 6J e-61c"
w(a; 6) = I6(,0)(O),


where 6 > 0. The "diffuseness" of this density increases as J decreases4. Thus, it is tempting to consider what would happen to the convergence rate of 4IMA under the improper prior w(a) = a-'I(o,o)(a). However, it is clear from Equation (3.14) that g,(r 1', y) will be improper when w(a) is improper and hence K, is not defined in that case. Therefore, we need a new sampling scheme when w(a) is improper. It is called Scheme 2 and is described in the next subsection.

3.3.3 Marginal Augmentation with Improper Prior

We now consider a different Markov chain in which a is actually one of the variables. Suppose that w(a) is proper and consider the joint density defined by


g(0, r, aIy) c g(y, r 0, a) w(a) 7r(0) . (3.15)


The main idea of Scheme 2 is to use a two variable Gibbs sampler based on the above density with components (0, a) and r. To see that it produces an approximate draw from f(0 y) notice that equation (3.12) shows that integrating r out of (3.15) yields


/ g(0, r, aly)dr = f (01y) m(y)w(a).

The reason for introducing this chain is not as an alternative to )MA, but because, unlike (MA, it may still be well-defined when the working prior is improper. When


4 Notice that Ea2 < oo iff 6 > 2






35


w(a) is improper we have


/ f g(O, r, ay) dr da oc f (yO) 7r() w(a) da = o0 ,
R+ Q JR+

so the density in (3.15) is improper. Consequently, under an improper prior on a, the Gibbs sampler based on (3.15) with components (0, a) and r will not be positive recurrent (Hobert, 2001b). However, in certain situations, it is still possible to construct a useful MCMC algorithm using (3.15) when w(a) is improper. Indeed, we have already seen that for each fixed (0, a) E 0 x R+,


j g(O, r, aly) dr = f (O1y)m(y)w(a) < oo . (3.16)

Suppose further that for each fixed r E Q,


/ / g(, r, aly) dO da < oo . (3.17)

Then we can run a two-variable Gibbs sampler by iteratively sampling from the densities

g*(rlO, a, y) = g(Or,aly) and fQ g(, r, aly)dr

g*(-,alr,y) = g(O, r, aly)
gRk + f9 g (O, r,ay)dOda

The "*"s are used to remind the reader that the "joint density" to which these conditionals correspond is improper. Following the notation of Meng and van Dyk (1999), we will call this method Scheme 2. Write the resulting Markov chain as

= (ri, (0i, ai))} -0. It is easy to show that (3.15) is an (improper) invariant density for this chain, and hence (as mentioned above) 1 is not positive recurrent. Furthermore, while I = {r}'o and 12 = {(Oi, ai)}0 are themselves (nonpositive recurrent) Markov chains (see; e.g. Liu et al., 1994), the marginal sequence MA = i}o induced by this algorithm is not necessarily a Markov chain nor






36


is this sequence necessarily convergent. Before discussing MA in more detail we return again to Example 3.1.

Example 3.1 continued: One common transformation is r = D,(q) = aq, which corresponds to the alternative representation t1 (V, A, o) = [ + v../vlaz/jF, where z ~ N(0, 1) and r - aX2/v with z and r independent. The corresponding MA algorithm is based on the following hierarchal model. Conditional on (, 9- a), let (yj, rj),j 1, . . . , n be iid pairs satisfying yj~rj, a, a ~ N A,r"7 (3.19)
r t, 0-, a - F , ) with (p, o, a) ~ ir(p, o) w(a), where L(a) is the improper prior density a-I(o,)(a) and 7r(p, a) oc -. The posterior "density" of (r, i, a, a) given y under the above hierarchy is


g (y, u, r, aIy) = [f g(yi 1r, p, 0- a) g (rIp, a, a) 7(p, a) w(a) . (3.20)
_j=1I

It is easy to check that (3.17) is satisfied, which makes it possible to define two legitimate conditional densities as follows g*(raa, y) g(r, y, a, aly)
g(r, p, a, aIy) dr

and
g* y) g (r, p, a, c Iy)
fR g(r, , , a,, y) =dlu dada Sampling from these two densities is easy (Meng and van Dyk, 1999; Hobert, 2001a). For example, g*(rlp, a, a, y) is just a product of gamma densities and it is






37


easy to verify that

g*(9, a r, y) g(pt U, Oar, y) = g (ar, y)g(O-la, r, y)g(Plo-, a, r, y) =fIG ( cr , vr fjG (.; n- fN p(P,
2 2 2 2a r
- r.+1 -1()+ jAr]2 -1 [&('r) -I
ca 2 e Ir. 2 [&r]2 U- 2,

where fIG(-;-,.) is defined in equation (3.8), and

1 (xr-~) 2
fN (x; , b) e

Therefore, we can apply the Gibbs sampler (by iterating between these two conditionals) to form the Markov chain 1 = {(ri, (pi, o-, aj))}i=O, where ri = (rij,.. .,rin). Also define

1 = {ri}'0 and '2 {(Ai, o-i, ai) which are themselves Markov chains.

The improper density (3.20) is invariant for (D and consequently b, #1, and (D2 are all non-positive recurrent with f(p, o-ly) w(a) being an invariant density for "2. The fact that, under our choices for ir(p, -) and w(a), the sequence


'MA PiI i I=O

is a positive recurrent Markov chain with invariant density f(p, a Iy) should follow by Lemma 1 from van Dyk and Meng (2001) or Theorem 4 from Liu and Wu (1999). That is, one can simulate the unstable Markov chain, (, and use the (p, o-) component to explore the proper posterior density f((p, o-ly). However, we have serious doubts about the validity of the results of van Dyk and Meng, and Liu and Wu. Consequently, we have developed our own proofs (see Sections 3.4 - 3.6).

We now discuss the conditions under which the sequence *MA is a positive recurrent Markov chain. Recall that (2 = {(0j, ac))}-o is a Markov chain with






38


Markov transition density given by


j g*(Q, alr, y)g*(rO', a', y)dr. A sufficient condition for MA = Oi'o to be a Markov chain is that the integral


JQ fRg*(, ar, y)g*(rl', a', y)dadr. (3.21)

does not depend on a'. Another sufficient condition under which the sequence MA is a Markov chain is given in Meng and van Dyk (1999). Here we establish a connection between their condition and the integral (3.21).

Lemma 3.1. Assume the hierarchical structure of Equations (3.9) - (3.11). If f g*(0, aID,(q), y)da does not depend on a', then (b* is a Markov chain with Markov transition density

KMA(O1') g*(, a r, y)g*(rO', a', y)dadr.

Proof. Notice that by Equations (3.18), (3.15), (3.10), and (3.11)

g*(r 0', a', y) g (', r, a' y)
Qg (0', i , e'ly) di
g(ylr, 0', a')gr|0', a')7(0')W(ae') fQ g (y10', i , W) g(iIa', 0') 7(0') w(a') di f (y ID-,1 (r), 0') f (D-,1(r) 10') aO-, (r) Or
f (y 10')
Now, after change of variable q = D1-,'(r), we see that (3.21) can be written as jj*0 ( y 0')
ifI g* (0, a|ID.,(q), y) da fyl , q' dq = g*(0, aID.,(q), y)daf (q10', y)dq,

which combined with the assumption completes the proof. E

We were able to show that the condition of Lemma 3.1 is satisfied in a very general set-up (see Theorem 3.1). Here we conclude the section with an example






39


that demonstrates that the condition of Lemma 3.1 is satisfied at least for the univariate t model.

Example 3.1 continued: We now show that the condition of Lemma 3.1 is satisfied for the univariate t model. Using the definition of the IG density, we obtain that

{ &(r) +[P -- AM?)2 2 -1+ ,_1 ,+2
g*(A, U-Ir, y) Oc (. v + + r. 2 [&(r)] 2 o- 2


S~ + & (r ) + __(r )]2 2 [_ ,1 +2 V+ + [&r]2 a

Finally, since &(a'q) = &(q) and A(a'q) = A(q), it follows that the condition of Lemma 3.1 is satisfied.

For further details about MA with improper working priors, along with

empirical evidence that it produces fast mixing MCMC algorithms, the reader is referred to van Dyk and Meng (2001). For more information on the use of nonpositive recurrent Markov chains in MCMC algorithms, see Liu and Wu (1999), Hobert (2001b) and Lavine (2003).

Liu and Wu (1999) noticed that some of the properties of the Markov chain which provides peace of mind about the results for the multivariate t model.

3.4 Connection with Group Theory

We start with some fundamental definitions that can be found, for example, in Green (1987).

Definition 3.1. If A is a nonempty set, then every function from A x A to A is called a binary operation in the set A.






40


Definition 3.2. A set A together with a binary operation ""is called a group if the following properties are satisfied.

(a) Associativity: (al - a2) - aO3 a - (a2 - a3) for all a,, a2, a3 E A.

(b) Identity: there exists an identity element e G A such that e a = a - e = a for any a E A.

(c) Invertibility: for each a G A, there exists a unique inverse element a-1 2 A such that a- a= a-1 a= e.

The functions that preserve the group operation play a special role in group theory. Here is the formal definition of such functions.

Definition 3.3. A function D : A -* B between the groups (A, -) and (B,.) is called a homomorphism if

(a) D(ai - a2) = D(ai) - D(a2) for all a1, a2 E A;

(b) D(eA) = eB.

We will now proceed to the main definition of this section, namely the notion of a transformation group, which can be viewed as a special case of the following general definition taken from Olver (1999).

Definition 3.4. Let Q be a set. Define 9(Q) to be the family of all one-to-one functions d : Q -* Q.

9(Q) is a group in which the group operation is defined by composition of functions. The identity transformation plays the role of the identity element, and the inverse of a transformation is the usual functional inverse. If the space Q is equipped with some additional structure (like a topology), then the class of possible functions included in 9(Q) may be restricted correspondingly. For example, when Q = Rn, 9(Q) may contain all continuous invertible functions d : R' --2> R' (recall that the composition of two continuous functions is also continuous).






41


Definition 3.5. For a given group A, a transformation group, acting on a

space Q is defined as any homomorphism D : A -* 9(Q) mapping A to the group of invertible functions on Q, and is denoted by (A, Q, D). In other words, for each element a E A there is a corresponding invertible function D, : Q -- Q. Because this identification is a homomorphism the following properties are satisfied:

(i) Da, o D2(q) := D,(D2(q)) = D,.2(q) for all zi, a2 E A;

(ii) De(q) - q, where e is the identity element of A; (iii) D,-i(q) D-(q) for any a E A, where (iii) is a consequence of (i) and (ii).

Definition 3.6. For a fixed q G Q, the subset


Oq := {D,(q) : a C A} C Q


is called the orbit of q.

Some authors use the term group action instead of transformation group (see, e.g., Eaton, 1989). Notice that the action of the group A on the space Q induces an equivalence relationship among the elements of Q, namely two points are equivalent iff they are in the same orbit. To see this, notice that

(i) q E Oq, since De(q) = q;

(ii) If q, E Oq2 then q2 C Oql, since there exists an a E A such that qi = D,(q2), implying that D,-1(qi) = Do-1(D.(q2)) = D.-1.,(q2) = De(q2) =2;

(iii) If qi E Oq2 and q2 C Oq3 then qi E Oe , since there exist al, a2 E A such that

qi = D,, (q2) = D., (D2(q3)) = D, -2(q3)Thus, the space Q can be partitioned into disjoint sets, corresponding to the equivalence classes (the orbits), induced by the above equivalence relationship. By the axiom of choice, we can choose a representative element for each of these equivalence classes. Denote the set of the representative elements by Z. Now, since






42


the equivalence classes partition Q, each element q C Q must belong to a unique equivalence class. Thus, if z E Z is the corresponding representative of that class, then q E Oz and there exists a c A such that q = Da(z) (see Example 3.2 below). In fact, it is easy to see that for any q E Q, there is unique z C Z and at least one a E A such that q = Da(z). This means that the map D : A x Z -- Q defined by D(a, z) := D,(z) is onto. Suppose further that the index set Z can be chosen in such a way that the map D(a, z) is one-to-one and differentiable. When this is possible, the index set Z is called a smooth cross-section. Usually, the smooth cross-section, Z, has lower dimension than the space Q and it is convenient to write Z as the image of another space R via the one-to-one transformation Z; i.e. Z = Z(R). Here is an example.

Example 3.2: Let A be the group of all positive real numbers, R+, with

product as the group binary operation. Obviously, 1 C R+ is the identity element and the reciprocal = a-1 is the unique inverse of a E R+. Also let Q =R' and for any a E R+ define Da : R' - R' by D,(q) := aq,


which is a one-to-one transformation satisfying the conditions of Definition 3.5. The structure (R+, R , {Da}aER,) is called the scale transformation group. Notice that the set of all orbits is the set of all lines in the first quadrant that go through the origin. They do not intersect since the point 0 is not in R n . A smooth crosssection is the part of the unit sphere that is restricted to the first quadrant; i.e., Z :{(ql,..., qn) (E R + : q 2 + ---+ gi 1) Furthermore, any point q E R' can be represented as q = D,(z) = az, where a C R+ and z E Z with a := ||qjj and z := q. Finally, any point






43


z = (zi,..., z,) E Z can be written as z, = COS r, COs T2 . .. COS Tn-3 COS rn-2 COS r-I Z2 = COS T COS T2 . . COS T-3 COS Tn-2 sin rn-i Z3 = COS r COS r2 ... COS Tn-3 sin rn-2



Zn = sin r,

where r :(Ti, ... , Ti)T E R := (0, 7)n-1.

We will now show that for the scale transformation group from Example 3.2 with improper prior -, the MA algorithm works, by proving that
Theorem 3.1. Assume the hierarchical structure of Equations (3.9) - (3.11) and consider MA with improper pTior w(a) = . If the transformation is given by D,(q) = aq, then I4*I is a Markov chain and


SKmA (0 ')f (O y )dO' = f (0| y).

Proof. We will establish that MA is a Markov chain on our way to showing that f(0 y) is an invariant density for KMA (01'). First notice that based on the proof of Lemma 3.1, we can write


KMA(010') = fQ g*(0, ae D,(q), y)daf (q 0', y)dq.






44


Then

KMA(0 1')f(O'ly)dO'

g*(JfaD,(q),y)daf(q| ',y)f('|y)dqd'


= I 1+ g*(0, a| D&(q), y)da e f(q, '|y)dO'dq

Sf g*(, alDc& (q), y)daf (qly)dq,

where

f(qly) J f(q, 0ly)d0.

Therefore, we need to show that


J g*(, aD.,(q), y)daf (qly)dq = f (01y).
JQ JR+

Recall that we can write q = D(s, z) = D(s, Z(r)), where the mapping D A x Z -+ Q is one-to-one and differentiable. Thus, we can make change of variables q = D,(Z(r)) with Jacobian J equal to the absolute value of the determinant of the n x n matrix
OD,,(Z(r)) OD,(Z(r)))
as Or
In our case (the scale transformation group), D,(Z(r)) = sZ(r) and thus aD,(Z(r)) = Z(r). Therefore, J = sn- L(r), where L (0, ()-1 - R is a
as
function of r only (this can be seen from the definition of determinant and the fact that s is a scalar in all of the elements of the matrix except those in the first column). Hence, we need to prove


fR J4 fR g*(0, aID,(Ds(Z(r))), y)daf (D,(Z(r))Iy)s'-1L(r)dr ds = f (01y), which is equivalent to


f f g*(O, aja'sZ(r)), y)daf (sZ(r) y)sn- L(r)drds = f (01y).






45


Now by definition (see equation (3.18)),


g*(0 ala'sZ(r), y) = g(y, a'sZ(r) 0, a) r (0)w(a) fe g(y, a'sZ(r)10, a) r (0)w(a)dOda
_ f(y, D-(a'sZ(r)), 0)r (0)w(a) J.(a'sZ(r))
fR fe f (y, D-1 (a'sZ(r)) 0)7r(0)w(a)J,(a'sZ(r))d~da


and since J,(q) : aD (r) = ., we have


g*(0, aja'sZ(r), y)


f (0, a-1a'sZ(r)Iy)m(y)w(a) J0 (a'sZ(r))
fa+f (a-la'sZ(r)ly)m(y)w(a)Ja(a'sZ(r))da f (a-a'sZ(r), Ojy)a--1
fR f (a-1a'sZ(r)jy)a-n-1da'


Another change of variable 3= a- a's with Jacobian j gives us


1 f (a-a'sZ(r) y)a-'-1da = f (OZ(r) y) d 2

~)'f+ f(OZ(r)jy)O--1dO ( 'sL f f(OZ(r)jy)/3n'L(r)dO

_ I h(Z(,r)ly) a's L(r)'
where
h(Z(r)ly) := u(s, Z(r) y)ds and u(s, Z(r)Iy) is obtained from f(qIy) after change of variable q = D,(Z(r)) = sZ(r). Therefore, a asZ~), =f(a-la'sZ(r), 0|y)a-41
9 (0, ala'sZ(r), y) = (a a z(r), Y) Ct's) L(r)
Similarly, we can make the change of variable to obtain that


f(a la'sZ(r), 0jy)a-'-1da ( " h(Z r), O1y)






46


where

h(Z(r), 01y) := u(s, Z(r), 01y)ds and u(s, Z(r), O1y) is obtained from f(q, 01y) after change of variable q = Ds(Z(r)) = sZ(r). Thus, I 0 a ID,,,(D, (Z(r))), y) da =-(~ ),0y h(Z(r) y)

which does not depend on a'. Thus, by Lemma 3.1, the sequence DMA is a Markov chain. Notice that all the results after the change of variable 0 = a-la's depend heavily on the choice of prior w(a). It is our belief that this is where the Haar measure plays a special role in the proof. However, it still remains to be investigated rigorously what happens in the general case.

Now, in order to show that f(01y) is the invariant density for the Markov chain
j h(Z(r), 01y) f (sZ(r) y)s"- L(r)dsdr = f (01y),
fxh(Z(r)|y) R,

or equivalently (by the definition of h(Z(r)ly)),


J h(Z(r), Oly)dr = f (61y). (3.22)

But

h(Z(r), 01y) := u(s, Z(r), 01y)ds = 4 f (sZ(r), 01y)s"-1L(r)ds. Therefore, (3.22) is equivalent to


4 f(sZ(r), 0 y)s'-1L(r)dsdr = f (01y).

Finally, the inverse transformation q = sZ(r) shows that the above is equivalent to


Jf (0, qly)dq = f (01y),






47


which we already know is satisfied. E

When the conditions of Theorem 3.1 are satisfied, one can simulate the nonpositive recurrent chain 4 and actually use the marginal chain @MA to explore the proper, but intractable posterior f(O1y). Moreover, this phenomenon is more than just a theoretical curiosity. Indeed, there is substantial empirical evidence suggesting that 4MA converges faster than (DMA and much faster than 4DA- It is not yet fully understood why an algorithm that is based on a non-positive recurrent Markov chain converges faster than related algorithms based on positive recurrent chains.

3.5 Understanding the Artwork of van Dyk and Meng

In this section we will show that the descriptions van Dyk and Meng give to define their Markov chains agree with our definitions. We start with Scheme 1; i.e., the conditional densities used in the expressions that follow are based on the joint density g(ylr, 0, a)g(rl0, a)7(0)w(a) that is now based on a proper w(a). First note that


g(r 0', y) := g(r, a10', y)da = g(rJ0', a, y)g(al0', y)da.


But

g(a0', y) = g(r, a10', y)dr

g(ylr, 0', a)g (r10', a)r(0')w(a) dr


f_ g (y, r 0', a)drw(a) _ f (yl')w(a) fR+ fQ g (y, i I0', a)w(a)drda f (y 0') fR w (a) . Therefore,

g(r|0', y) = + g(rJ0', a, y) w(a) da. (3.23)






48


Hence, K, can be alternatively represented as

K(0| 0')
-(3.24)
= jR jg(01r, a", y)g(a"Ir, y)da" [j g(r 0', a', y)w(a')da'] dr.

In Meng and van Dyk (1999) page 309 and van Dyk and Meng (2001) Lemma 1, the authors claim that the Markov chain corresponding to Scheme 1 (according to us, it has transition kernel (3.24)) can be simulated by performing the following steps: first draw a' from the prior w(a'), and then draw r from g(rj0', a', y); second draw a" from the posterior g(a" r, y), and then draw 0 from g(01r, a", y). In other words, the following steps must be performed:

1. Draw q from f(q10', y);
2. Draw a' from w(a');
3. Take r = D,(q);
4. Draw 0 from g(01r, y).

However, Meng and van Dyk did not provide the transition kernel for their Markov chain which makes it harder to understand their idea. To see that van Dyk and Meng's and our description agree, first notice the simple fact that if we need a draw from f(y) = fx f(x, y)dx, we can first get x ~ f(x) and then y - f(ylx), and the resulting y is marginally from f(y). Now notice that conditionally on a' , the variable r obtained in step 3 has density f(DI'(r)10', y)J,(r), where J01(r) = aD r
ar

Thus, the result of steps 1, 2 and 3 is a draw from fR, f(D),1(r) 0', y)J,,(r)w(a')da'. But
, f(ylq, 0')f(q|0')
f (q , y) = f(Y0,)






49


and therefore

f (D-(r)|',y) Ji(r) =f (y I D-7(r), 0')f (D-(r) 1') J(r)
f (yI 0) (3.25)
g(y r, 0', a')g (r|', a') g(r, o',y).
f (yI0')
Thus, the final result after step 3 is a draw from


/ f (D-(r) 1', y)J&,(r)w(a')da' = j g(r10', a', y)w(a')da'.

This combined with step 4 makes it crystal clear that the two definitions of the transition kernel under Scheme 1 agree.

We now briefly discuss the same issue but for Scheme 2. As noted in van Dyk and Meng (2001), KMA can be described similarly to what we did in Scheme 1. For any fixed a' E R+ perform the following steps:

1. Draw q from f(q10', y);
2. Take r = D,(q);
3. Draw 0 from f, g*(0, adr, y)da.

Similarly to (3.25), the result of steps 1 and 2 above is a draw from g*(rI1', a', y). Therefore, our definition for Scheme 2 agrees with the above.

We now address the question about the stationary distribution of DMAThere are problems with van Dyk and Meng's results concerning the invariance of f(01y) for *MA and these are described now. Since these authors never defined their Markov chains with transition kernels, we start by first restating their main result. As can be seen afterwards, this result might not even be true. The following theorem is an attempt to rigorously state Lemma 1 from van Dyk and Meng (2001).

Theorem 3.2. Assume the condition of Lemma 3.1 is satisfied and let

KMA(O0') be the transition kernel of C*y when implementing Scheme 2 under an improper prior w(a) and let km(01|') be the transition kernel of GDMA when






50


implementing Scheme 1 under the proper prior wm(z), where m E N. Define g(ylr, 0, a)g(rl0, a)7(0)wm(a)
gm(0,cry) :=f f , g(ylr, 0, a)g(rJ0, a)r()wm(a)dadO and, similarly,

g(ylr, 0, a)g(rl0, a)7r(0)wm(a)
gm(r, ,,y) = g(yr, 0, a)g(r, a)(0)w()dadr

If the limit limm-. gm(0, aIr, y) exists and

lim gm (0, a Ir, y) = g*(0, aIr, y), where g*(O, a r, y) is calculated using the improper prior w(a), then

km(0|1') -* KmA(0|1') as m , o, implying that the stationary distribution of b*A is f(0|y).

The fact that km(0 0') -- KMA(00') as m --+ o implies that f(01y) is

invariant for KMA, follows from Liu and Wu's Lemma 1, which we restate here for reference.

Lemma 3.2 (Liu and Wu). Suppose that Km(ylx) is a sequence of Markov transition functions, all having r(x) as invariant density. If K(ylx) limm_.c Km(ylx) is a proper transition function, then -r is an invariant density of K.

If true, Theorem 3.2 would provide justification for the use of the improper

prior w(a). Unfortunately, the "proof" of van Dyk and Meng has some gaps. More specifically, notice the following

km(010') := gm(0r, y)gm(r10', y)dr


= I R+ gm(0, oz r, y)gm(r 0', y)dadr






51


and by equation (3.23)

km(1') = gm(0, a r, y) j g(rJO', a', y)wm(a')da'dadr

g(ylr, 0, a)g (r10, a)7r(0)wm(a) g (r10', a', y)wm(a')da'dadr JQ JR+ L f J (yJr, 0, &)g(r10, &+ ) (0)w,(6) dd0


= Ig*(O, ar, y)g*(rl ', a', y) x

X Wr(a) fe fR+ g(y r, 0, a)g(rJ0, a)rr(0)w(a)dazd0
xw(a) f fR+ g(ylr, 0, a)g(rJm, a)r()w(a)dad0 J (a')dadadr. Thus, it is clear that the assumptions of the theorem do not concern the convergence of the integral over 9, but only the convergence of its integrand. Assuming that wm(a) -+ w(a) as m - oc (which van Dyk and Meng never say in their theorem) in conjunction with the assumptions of Theorem 3.2 can only be used to conclude that the limit of the quantity in the square brackets is 1. Even under some regularity conditions, that would allow us to pass the limit inside the integral, the only thing that can be concluded is that


km(O0') , j j j g*(O, a r, y)g*(rJ0', a', y)w(a')da'dadr = oc as m -* oc.


Meng and van Dyk (1999) provide other "checkable" and even more obscure conditions (Lemma 3 and Theorem 2) "ensuring" that
3.6 Other Problems with van Dyk and Meng's Lemma 1

We conclude this chapter with an interesting question that arises when working with sequences of random vectors. The question is whether the convergence of the joint densities implies convergence of the corresponding marginal densities. Here we show that in general this is not true and give an example to illustrate that. The reason we are interested in this question is because in van Dyk and






52


Meng's (2001) "proof" of their Lemma 1 on page 10, the authors claim that the convergence of gm(0jr, y) to g(01r, y) follows from their assumption that gm(6, ar, y) -> g(O, a r, y) as m -> oo. Here we show that this is not true.

Lemma 3.3. Let (Xv, Yr), n 1, 2,... be a sequence of random vectors with values in X x Y and densities fn(x, y), n = 1, 2,... (with respect to the Lebesgue measure). Suppose that


fn(x, y) -+ f(x, y), V (x, y) C X x Y as n -- 00, where f(x, y) is a density on X x Y. Then lim inf j f.(x, y)dy j f(x, y)dy a.s. n-oo y fy

Proof. By Fatou's lemma we have


lim inf fn(x, Y)dy > j lim inf fn(x, y)dy f (x, y)dy, (3.26)
n-o fy JY n-ox fy

for any x E X. Applying Fatou's lemma again we obtain


I lim inf f fn(x, y)dydx < lim inf fn(x, y)dyd= 1.
Jy -x Jy

On the other hand, from (3.26) it follows


J lim inf j fn(x, y)dydx > f (x, y)dydx = 1, x -ofy Jry

and thus,

J lim infj fn(x, y)dydx = 1 <-> FminfL fn(x, y)dy - f (, y)dy dx =0.
[li nf y y .

But by (3.26) the integrand is a non-negative function and therefore we must have lim inf j fn(x, y)dy = f (x, y)dy a.s. n-ox fy fy






53


F

The following example shows that without further assumptions, there is no

guarantee that the marginal densities even converge, much less that they converge to the desired limit.

Example 3.3: Let n C N \ {1} and for k E N, define the following sequence of bivariate densities:

3 2
fnk(X, y) = nIB. (x, y) + n2 -1 IB11\Bnk (XY)I where
{(~) 2: k -1 k
Bnk = (x, y) E R2 : < X < -, 0 < Y
n n n

For n = 1 simply take fii (X, y) = IB11 (x, y) which is the uniform density on the unit square.

Now consider the sequence fi(x, y), f2(x, y), f3(x, y), f4(x, y), .... which is

obtained by re-enumerating fll(x, y), f21(x, y), f22(X, y), f3(x, y) .... Since for any particular (x, y) E B1 eventually as n gets large, (x, y) E B1 \ Bnk, V k = 1,... , it is clear that

fn(x, y) - j1(x, y), V (x, y) C B1 as n - 00. On the other hand the marginal densities


ink W = k(X, y)dy = J + -_ n (0,1}\(r ,]
Jn + 1 n'In n2 -1 I(O n\( II,(X

do not converge to any version of the uniform density on the unit interval. Moreover, limsup,_ogn(x) = oc for all 0 < x < 1 even though liminf - gn() = I(o,1i(x) as it is expected to be based on Lemma 1, where the sequence {gn(x)} is obtained from the sequence {gnk(X)} the same way as {f,(x, y)} from {fnk(x, y)}. To see this, let x E (0, 1] and notice that for every n = 1, 2,... there is a k c {1,. . . , n} such that k-i < X < k and this k is given by k = [n]. Therefore
n - n






54

the subsequence in rnx (X) -+ oc as n -+ oc. The result about liM infn-o, gn(x) is obtained similarly.

The question we answer in the next chapter is "How fast do
multivariate Student's t problem converge?" The Markov chain theory described in Section 4.2 will allow us to formulate and then answer a more precise version of this question.














CHAPTER 4
GEOMETRIC ERGODICITY OF
4.1 Introduction

Meng and van Dyk (1999) suggested that MA with an improper prior could be applied to the univariate version of the Student's t problem (d = 1), and van Dyk and Meng (2001) extended this to the multivariate case. These authors provided considerable empirical evidence suggesting that this new algorithm converges much faster than the standard data augmentation algorithm. However, they stopped short of formally proving anything about the convergence rate of the underlying marginal chain. In a discussion of van Dyk and Meng (2001), Hobert (2001a) showed that when d =1 and n = 2, the marginal chain converges to stationarity at a geometric rate; that is, the chain is geometrically ergodic (Meyn and Tweedie, 1993, Chapter 15). In this chapter, we extend Hobert's results to the case d > 1 and n > 2. This is the first general, rigorous analysis of an MCMC algorithm based on a non-positive recurrent Markov chain.

There are two important practical applications of the results established in this chapter. First, if geometric ergodicity is established through drift and minorization conditions, as it is here, then formulas like those of Rosenthal (1995) and Roberts and Tweedie (1999) can be used to calculate an appropriate burn-in before sampling begins. In other words, given a starting value for the Markov chain, one can calculate an exact upper bound on the number of iterations required to get the total variation distance to stationarity below a pre-specified level. Second, geometric ergodicity implies the existence of central limit theorems that can be used to assess the Monte Carlo error of estimates of posterior quantities


55






56


of interest. This assessment can be carried out in a number of ways including regenerative simulation (Mykland, Tierney and Yu, 1995; Hobert, Jones, Presnell and Rosenthal, 2002) and time series-type methods (Geyer, 1992; Chan and Geyer, 1994).

The rest of this chapter is organized as follows. The necessary Markov chain background is provided in Section 4.2. The main results are given in Section 4.3 and they are applied to a simple illustrative example in Section 4.4. Finally, some concluding remarks are given in Section 4.5.

4.2 Markov Chain Background

In this section we give some basic definitions and then briefly describe what drift and minorization conditions are and how they can be used to establish geometric convergence of Markov chains. For a more general discussion of the ideas described in this section, see Jones and Hobert (2001). Let X = {ilg=


be a time-homogeneous discrete-time Markov chain defined on the probability space (Q, 9, Pr) that takes values in the measurable state space (X, -). For t E N, let Pt(x, A) be the t-step transition kernel 1 for X; that is, for all x E X and all A c -4,

P'(x, A) = Pr(X,+t E AIX, = x) V s E {0, 1, 2,...}.

The following definitions can be found, for example, in Nummelin (1984) or Meyn and Tweedie (1993), but we state them here for completeness of the exposition.


1 The superscript will be omitted when t = 1






57


Definition 4.1. The Markov chain X is called 7-irreducible if there exists a measure 'y on - such that, whenever -y(A) > 0, we have L(x, A) > 0 V x E X,


where

L(x, A) := Pr(X ever enters A|Xo = x). Definition 4.2. If 7r is a or-finite measure on - such that


r(A) = j P(x, A) r(dx) V A C -4,


then ,r is called an invariant measure for the Markov chain X.

Definition 4.3. The 7r-irreducible Markov chain X is periodic if there exist an integer d ;> 2 and a collection {Ao, A1,... , AdI} of nonempty and disjoint subsets in 4 such that, for all i =0,...,d - l and all x c Aj, P(x, Aj) = 1 forj = i +1 mod d.


Otherwise, X is called aperiodic.

For the rest of the section, let +: {A E - : 7r(A) > 0}.

Definition 4.4. The Markov chain X is called Harris recurrent if it is 7r-irreducible and for every set A G 4+,


Pr(rA = OCIXo = x) - 1 V x E A, where the occupation time random variable, nA := IA IA(X,), counts the number of visits to the set A.

We will say that a Markov chain satisfies the usual regularity conditions if it:

(a) possesses an invariant probability measure 7r;

(b) is 7r-irreducible;

(c) is aperiodic; and






58


(d) is Harris recurrent.

Under the usual regularity conditions, for every x E X, we have the following limit for the total variation distance between P'(x,-) and 7(.):


sup P'(x, A) -(A)| |P'(x,-)-(-)| -*0 as t-* oo .


If this convergence takes place at a geometric rate; that is, if there exist a constant

0 < # < 1 and a function M: X -- [0, oc) such that for any x E X,


Pt(x, -) - ir(-) 1 M(x) k0

for all t C N, then X is said to be geometrically ergodic.

One method of proving that X is geometrically ergodic and, simultaneously, getting an upper bound on M(x) #' is by establishing drift and minorization conditions. There are several ways to do this (Meyn and Tweedie, 1994; Rosenthal, 1995; Roberts and Tweedie, 1999). The version we describe here is based on the work of Rosenthal (1995). A drift condition holds if for some function V : X -+ [0, oc),


and some constants A E (0, 1) and L E R PV(x) < AV(x)+ L V x C X,


where

PV(x) := V(y)P(x, dy).

An associated minorization condition holds if for some probability measure H on e and some e > 0

P(x, A) > EH(A) V x C S V A e -, where S := {x E X : V(x) < l} with I being any number larger than 2L/(1 - A). Together, drift and minorization imply that X is geometrically ergodic, and once






59

drift and minorization conditions are established, Theorem 12 from Rosenthal (1995) gives a formula for calculating an upper bound on M(x) <. This bound can be used to compute a sufficiently large burn-in (see Section 4.4 for an example).

4.3 Multivariate Student's t Problem We now establish drift and minorization (and hence geometric ergodicity) for van Dyk and Meng's (2001) MA algorithm for the general multivariate Student's t problem described at the beginning of Section 2.1. The algorithm is based on the multivariate analogue of (3.19). Conditional on (p, E, a), let (y,, qj), j E N, be iid pairs satisfying the relations yjjqj,, oa ~Nd P, --qN (4.1)

qj Ip, E, a ~ F V, .

d+1
We take the prior on (t, E, a) to be 7r(pi, E) w(a), where 7r(pt, E) = 2E and w(a) = a1-J(0,O0)(a). Let g*(~~ag*(q, , E, ay) g*(qlpt, E, a, y) := (q ,LozIy (4.2)
fg* (q, p, E, a Iy) dq

and
a q, , E, a f, y) g*(q, y, E, aIy) (4.3)
f, fs f,,dg*(q, tz, E, ozjyd Eda
be the "conditional densities" based on (4.1). As in Subsection 3.3, these are legitimate densities despite the fact that the posterior g*(q, E, Eay), based on (4.1), is improper. As a matter of fact we can even derive the explicit distribution of (4.2) and (4.3). First,

qjIt, , , Fv + d (yj - p_) E_(yj - t) + v
qglE~~y~ 2 ' 2a'






60


independently for j = 1,..., n. Furthermore, it is easy to identify all the conditionals in the decomposition


g*( , E, aq, y) = g*(pzIE, a, q, y) g*(E la, q, y) g*(alq, y) Indeed,

plE,a,q,y Nd A, )


a, q, y ~ IWd n - 1, a q3(y - (yj - [1)


and

aq,y IG , where ft = q.- j gy and by X ~ IWp(k, A) we mean that the p x p

random matrix X has Inverse Wishart distribution (see; e.g. Mardia, Kent and Bibby, 1979) with density

- k_+p+le tr(A'X' )
f1w,(X; k, A) = e )_)
2 2 7r 4 JAlI f1 r( (k + I - i)) It is straightforward to show that everything from the univariate case, described in Example 3.1, follows through; that is, we can use the conditionals in (4.2) and (4.3) to construct a non-positive recurrent two-variable Gibbs sampler that has a marginal chain


Ky(p, EM /' E') = j g*(t, E, aq, y) g*(q lI', E', a', y) dq da,
R+ g

where a' can be taken to be any positive number.

We now prove a lemma that will be needed in establishing the drift condition.






61


Lemma 4.1. Let A be an m x m positive definite matrix. Then


xT(xxT + A)-lx < 1 V x E R". Proof. Applying the Sherman-Morrison-Woodbury formula (see, for example, Golub and Van Loan (1989), page 51) we obtain


(xxT + A)-1 = A-1 - A-x(1 + xTA-lx)-IxT A-. Therefore,

xT A-IxxT A-1x t2 t
xT(xxT + A)-l x' A-'x - = t - = < 1,
1+ XTA-'x 1+ t 1+ t

where t = xT Alx > 0, since A is positive definite.

The following result yields the drift condition: Theorem 4.1. Define
n
V(u, E) = Z(y - p)TE-(yi - tt).
i=1

If nu> 2 and v + d> 2 then

(n - 1)2(v + d) ndv nv(n - 1)(nv + nd - 2)
PV(pu', E') < V(u', E') + +
- (nv - 2)(v + d - 2) nv - 2 (nv - 2)(v + d - 2)

Proof. We will evaluate the expectation of the drift function using several layers of iterated expectations:

PV(pL', E') = E [E(V(pt, E)jq, y)|pL', E', y]

= E E [E(V(pL, E)Joz, q, y)lq, y] p', E', y

= E (E{E[E(V(pt, E)E, a, q, y)la, q, y] q, y I, E', Y)

The expectation on the left-hand side is with respect to the Markov transition density K(t, YEup', E') and the first equality follows from the properties of the






62


Gibbs sampler. The other two equalities are nothing more than the usual iterated expectation.

Starting with the innermost expectation, we obtain



n
E [S - [V)p, E(y) ,) oz q, y]

n
= E[y-1 TE - )| E, a, q, y]
= Y E~l-yi - yi P-l - PT -1 T -L


n

1


n
- yE-21 - fTE Y=1


= yi


=T (nda
-~ y _ ft)E-(yi _ ft) + _


i=1


V *


Now if X ~ IWP(k, A), then X-1 Wishart,(k, A) and EX- = kA (see, for example, Mardia et al. (1979), page 85). Thus,


ndan
E [V(p, E)ja, q, y] = -+ E(y

nda n
q . ji 1


A)


-1


- ,)T [ qj(yj - )(yj -- )T
j=1 .I


Since vn > 2, conditional on (q, y), a has a finite mean given by E(a q, y) = v . Therefore


nd
E [V(pu, E) q, y] = -E(aI q, y)
q.


n
+ (n - 1)E(aoq, y) 7(y
i= 1


ndv
=n - 2


-1


n
[L)T Iq (yj- l)(yj -A)T
j=1I


(yi - A)


(4.4)


v(n-1 n -'
nv-2 , [ (Yj (Y z -


Now, we need a bound for the quantity


n - -1
(y- )(y - )I -1 -


n
q. (y - 2)1


(4.5)


ft)T E (E-1 la, q, y) (yi


+ n ^T E-I + tr (E-1






63


which is obtained by applying Lemma 4.1 to the ith component in (4.5) as follows

-n - _I
(yi -A)T Yqj(yj- A)(yj - 1)T (yi - A) (y-p )T If)
-11
=(yi - E ) (yj - t)(yj - ,)T +(yi - )(yi - ) (yi - A) < .
q 5 q q

Therefore from (4.4) we obtain that

ndv v(n - 1) 1 ndv v(n - 1 q
E[V(p-,E)|q,y] < n + q . --- + n+
n-)2 n1-2 q, nv-2 nv -2 E qFinally,

EY[IV(A, E) I p', E']
ndv v(n - 1) v + d (yi - [')T(E')-1(yi _ t') + _v 1
nv - 2 nv - 2 v + d - 2 (y - T(E'(y - p') +v J

ndu v(n - 1) (v +d)(n -1 1
nv-2 +n-2 v(v+d- 2)

and some rearrangement yields the result.

Remark 4.1. This result agrees perfectly with Hobert's (2001a) result in the case where d = 1 and n = 2.

Remark 4.2. A similar result was obtained by Jones (2001) for the case

d = 1, but our result is uniformly better in the sense that our coefficient in front of V(p' a-') is always smaller and doesn't depend on the data.
The last thing we need is the minorization condition. The following is a

straightforward generalization of the minorization developed in Hobert (2001a).

Theorem 4.2. Fix 1 > 0 and let S = {(t, E) : V(1p, E) < l}. Then the Markov transition density Ky(pu, Ejpt', E') satisfies the following minorization condition


K,,(p, E Ip', E') ;> --h (p, E I y) V (/', E') E S,






64


where the density h(pu, Ely) is given by


h(u, El y) = g*(p, E, z Iq, y) g(q) 1 dqda,
Jo iw~ [~=1 f0o g(t) dtj

(f,= ' g(t) dt)n and the function g(-) is given by

v+d v v+d v+l
g(q) = F ( 2 , 2 ; I(0,q*)(q) + F ( ' d ; I(q.o)(q),


where 17(a, b; x) denotes the Gamma density with parameters a and b evaluated at the point x and q* = ld log(1 + Together, Theorems 4.1 and 4.2 show that for many (d, v, n) triples,
Corollary 4.1. The Markov chain (b*A is geometrically ergodic if nV > 2, v + d > 2 and
(n - 1)2(v + d)
(nv - 2)(v + d - 2)
When v > 2, condition (4.6) is satisfied for

v2 +vd+2d+ \v+d-2)[d(V2+4v-8)+v(v+4)(v-2)]
2(v + d)

Roughly speaking, all the conditions of Corollary 4.1 hold when n < v if d > 2 and when n < v - 1 if d = 1.

Of course, ideally, we would like to be able to say that 1, v/ > 0 and n > d + 1. The most restrictive of the three conditions in Corollary 4.1 is clearly (4.6) which is a direct result of the upper bound we use for the quantity (4.5) in the proof of Theorem 4.1. (Note that up until this point in the proof, all of the expectations are evaluated exactly.) Our experience suggests that finding a closed form expression for the expectation of (4.5) is impossible. Furthermore, alternative upper bounds that one might consider using are either not as sharp or have other undesirable consequences. For example, note that our bound on (4.5) holds uniformly in y. If






65


a bound depending on y is utilized, then the coefficient in front of V(1', E') in the drift condition will also depend on y. This could lead to the unfortunate situation where the sufficient conditions for geometric convergence of the Markov chain actually depend on the data. It is our belief that a substantial improvement on Theorem 4.1 will likely require the use of a different drift function, V(pt, E), which basically means starting over from square one.

Amazingly, the standard DA algorithm is also geometrically ergodic. We prove this fact in the next theorem which is an analog of Theorem 4.1.

Theorem 4.3. Let V([y, E) be defined as in Theorem 4.1. If v + d > 2 then

EY[VpE)ft' ,] v+d-2 v+d-2

where Ey[.I.] is with respect to the Markov transition density of the standard DA chain.

Proof. It is easy to check that for the standard DA we have the following facts. First,

Ev + d (y2 - ) (yj - p) + v


independently for j = 1, ..., n. Second,


ptjE,q,y ~Nd , )


and

Elq,y ~IWd -1, (q(yj - -)(y,-f).
j=1
Now we have that

Ey[V(p, E)Jp', E'] = E[E(V(p, E)jq, y)|p', E', y]

= E (E [E(V(pt, E)JE, q, y)lq, y] pt', E', y.






66


Skipping some of the details we obtain


[V(, E-)Iq, y] = + (n-i) (y2 - t) [ g(y - E)(y - ) (yi -A) q. j=1

Now using the fact that q.I 5 I _ q- as well as the results established earlier in the proof of Theorem 4.1, we have


E [V(p, E)jq, y] < (n + d - 1) .


Finally,


E [V(p, E) E',E'] < d I [(yi - ')T (E')-I(yi - ) +v,

and some rearrangement yields the result. l

The above theorem combined with the fact that Theorem 4.2 also applies to standard DA shows that the standard DA algorithm is also geometrically ergodic.

4.4 Numerical Example

Suppose that d = 2, v = 10, n = 3 and that the data are: yi = (-5, -2)T, y2 = (1, 5)T, y3 = (-10, 4)T. In this section, we will use our results in conjunction with Theorem 12 from Rosenthal (1995) to calculate sufficient burn-in for 4DM*A and for the standard DA algorithm. We will also take advantage of the fact that it is possible to get iid samples from the target in this particular case (see Section 2.2) to examine empirically how long it takes for the Markov chains to converge.

For DIA, Theorem 4.1 yields

6 66
E.[V(p, E) I ', E'] < AV(p', E') + L = -V(ti', E') + .
35 7

In order to apply Rosenthal's Theorem 12, we need to choose an I > b 22.76 and an r E (0, 1). Taking I = 27.241 and r = 0.04 leads to q* ~ 0.58 and this yields E ~~ 0.00158. Finally, suppose we start the chain with tto = (0, 0)T and Eo = I.






67


Then our Theorems 4.1 and 4.2 in conjunction with Rosenthal's Theorem 12 yield


|P'[(1O, Eo), - r(y)ll < (0.9999368199)' + (183.3793103)(0.9996034577)' , where P ,[(pO, Eo), denotes the t-step Markov transition kernel corresponding to MA and 7r(.y) denotes the probability measure corresponding to (2.1). It is easy to verify that

|P47411[(PO, Eo), ] - 7r(-ly) < 0.05.

Hence, a burn-in of 47,415 iterations for (*MA is more than enough. To put this in perspective, it takes about a minute to run 1,000,000 iterations of this Markov chain (on a slow laptop).

The reader may wonder how we chose these particular values for 1 and r. Because Rosenthal's bound holds for all 1 > 2L/(1 - A) and all r C (0, 1), the user is free to choose the values that lead to the smallest upper bound. In our applications, we simply used a grid-search to find the (approximate) minimizers.

For the standard DA algorithm, Theorem 4.3 yields

2
E5

Taking 1 = 46.017 and r = 0.019 (and starting the chain at the same initial values as the MA chain) yields


R'[(pLo, Eo),] - 7r(-y)l < (0.999998668)' + 23(0.9999831867)' ,

where R' [(io, Eo), -] is the t-step Markov transition kernel corresponding to the standard DA algorithm. In this case, we need 2,249,050 iterations to get the total variation distance below 0.05. That's nearly 50 times as many as are needed for MA!
It is tempting to conclude from this comparison that, in this particular case,

*MA converges faster than the Markov chain underlying standard DA . Keep in






68


mind, however, that we are comparing two upper bounds. It is still possible, albeit unlikely, that standard DA gets within 0.05 of the stationary distribution sooner than
Recall that n is exactly equal to d + 1 in the example we are considering and hence it is possible to make iid draws from ir(. y) in this particular case. Thus, for fixed t, we can compare ir(-y), P [(pto, Eo), -] and R [(o, Eo), -] empirically through random samples generated from the three distributions. A method of drawing a random sample from r(-ly) is given in Section 2.2. Making iid draws from P [(pto, Eo), -] and R [(pio, Eo), -] is simple - just run independent copies of the Markov chains (starting each at (pO, Eo)) and take the tth iteration of each chain.

We simulated random samples of size 106 from 7r(-1y), P'[(tto, Eo) .]

P 3[((o, Eo), ], and Pl0[(po, Eo), -] as well as the corresponding distributions under standard DA. The 5-dimensional samples are compared on the basis of three univariate quantities: pi, o-u (the first diagonal element of E) and 0-12 (the offdiagonal element of E). The univariate comparisons are made via density estimates calculated using the density() command in the R programming language (under the default settings). The results are shown in Figure 4-1. These plots suggest that the standard DA algorithm takes a bit longer to converge than MA and that both algorithms have converged after about 10 iterations. Thus, the empirical results suggest that the burn-ins calculated above are extremely conservative. It should be kept in mind, however, that nothing conclusive about the total variation distance (between the 5-dimensional distributions) can be said based on such empirical comparisons.











69


C



0
6



0 C'j 0 ,


PI


-20 -10 0 10

























-20 -10 0 10























20 I 0
-2 -1 0 10


0






0


0


0 50 100 150 200











0
0






















o I I
0
0



0








0





0
0





0 50 100 150 200


01


Figure 4-1: Empirical comparison of Markov chains and the target




From left to right, the three columns correspond to the parameters [ti, a and 0-12. From top to bottom, the three rows correspond to 1 iteration, 3 iterations and 10 iterations. Each plot contains three density estimates and each density estimate is based on a random sample of size 106. The solid line, dashed line and dots correspond to the iid sample, 'biA and the standard DA algorithm, respectively. Since "iterations" are irrelevant for the iid sample, the same density estimate is shown in each column for the iid case.


0
0 C\j



0 r

-5 5 0




0


0

0





0 (0







0




0
0


0



-50 0 50 100


0 -


1
























3 10






70

4.5 Concluding Remarks

The results in Section 4.4 contribute to mounting evidence that upper bounds on the total variation distance calculated using Rosenthal's (1995) Theorem 12 (or the related results of Roberts and Tweedie (1999)) are typically very conservative. Indeed, Rosenthal (1995, Example 1) demonstrated such conservatism in the context of a simple Markov chain whose exact convergence rate is known. In the Rejoinder of van Dyk and Meng (2001), an empirical estimator of total variation distance was used to show that the amount of burn-in suggested by Hobert (2001a) (calculated using Rosenthal's (1995) Theorem 12) is probably much more than is actually needed. By applying both the techniques of Roberts and Tweedie (1999) and those of Rosenthal (1995) in the same example, Jones and Hobert (2004) illustrated the potential conservativeness of results based on Roberts and Tweedie's (1999) techniques.

On the other hand, suppose that one of these techniques is applied in a particular situation (where it is impossible to sample directly from the target) and the result is that n iterations is a sufficient burn-in. Even if we believe (say, because of the mounting evidence alluded to above) that n is probably much larger than is actually necessary, there is no harm in burning in n iterations, unless this is simply too time consuming. Thus, from a "better safe than sorry" standpoint, even an extremely conservative upper bound on the necessary amount of burn-in can be useful.

We conclude with a few comments regarding a possible generalization of our results. The basic DA algorithm underlying all of the results in this dissertation is based on the fact that the location-scale Student's t density can be written as a mixture of normals. For example, the d-variate Student's t density with v degrees






71


of freedom, ft,, can be written as


ft,(y) = (2fr)-2 q2 e-2 dG(q) , (4.7)

where G is the Gamma(v/2, v/2) distribution function. Now consider a more general problem where the data come from a density having the same form as (4.7), but where G is the distribution function of any positive random variable. Bayesian regression models with errors of this form have been studied by many authors including West (1984), Liu (1996) and Ferndndez and Steel (1999, 2000). Perhaps general versions of Theorems 4.1 and 4.2 could be developed to handle this problem.














REFERENCES


Affleck-Grave, J. and McDonald, B. (1989). Nonnormalities and tests of assest
pricing theories, Journal of Finance 44: 889-908.

Box, G. E. P. and Tiao, G. C. (1992). Bayesian Inference in Statistical Analysis,
Wiley, New York.

Brownlee, K. A. (1965). Statistical Theory and Methodology in Science and
Engineering, John Wiley, New York.

Chan, K. S. and Geyer, C. J. (1994). Comment on "Markov chains for exploring
posterior distributions" by L. Tierney, The Annals of Statistics 22: 1747-1758.

Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, John Wiley, New
York.

Eaton, M. L. (1989). Group Invariance Applications in Statistics, Institute of
Mathematical Statistics and the American Statistical Association, Hayward,
California and Alexandria, Virginia.

Fang, K.-T. and Zhang, Y.-T. (1990). Generalized Multivariate Analysis, SpringerVerlag, Berlin.

Fernandez, C. and Steel, M. F. J. (2000). Bayesian regression analysis with scale
mixtures of normals, Econometric Theory 16: 80-101.

Geweke, J. (1993). Bayesian treatment of the independent student-t linear model,
Journal of Applied Econometrics 8: S19-S40.

Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with discussion),
Statistical Science 7: 473-511.

Golub, G. H. and Van Loan, C. F. (1989). Matrix Computations, The Johns
Hopkins University Press, Baltimore and London.

Green, J. A. (1987). Sets and Groups, Routledge and Kegan Paul, London and New
York.

Hobert, J. P. (2001a). Discussion of "The art of data augmentation" by D.A.
van Dyk and X.-L. Meng, Journal of Computational and Graphical Statistics
10: 59-68.

Hobert, J. P. (2001b). Stability relationships among the Gibbs sampler and its
subchains, Journal of Computational and Graphical Statistics 10: 185-205.


72






73


Hobert, J. P., Jones, G. L., Presnell, B. and Rosenthal, J. S. (2002). On the applicability of regenerative simulation in Markov chain Monte Carlo, Biometrika
89: 731-743.

Jeffreys, H. (1939). Theory of Probability, Clarendon Press, Oxford, U.K.

Jones, G. L. (2001). Convergence Rates and Monte Carlo Standard Errors for
Markov Chain Monte Carlo Algorithms, unpublished PhD dissertation,
University of Florida, Gainesville.

Jones, G. L. and Hobert, J. P. (2001). Honest exploration of intractable probability
distributions via Markov chain Monte Carlo, Statistical Science 16: 312-334.

Jones, G. L. and Hobert, J. P. (2004). Sufficient burn-in for Gibbs samplers for a
hierarchical random effects model, The Annals of Statistics 32: 784-817.

Karlin, S. (1968). Total Positivity, Vol. 1, Stanford University Press, Stanford,
California.

Kelejian, H. H. and Prucha, I. R. (1985). Independent or uncorrelated disturbances
in linear regression, Economics Letter 19: 35-38.

Lange, K. L., Little, R. J. A. and Taylor, J. M. G. (1989). Robust statistical modeling using the t distribution, Journal of the American Statistical Association
84: 881-896.

Lavine, M. (2003). A marginal ergodic theorem, in J. M. Bernardo, A. P. Dawid,
J. 0. Berger and M. West (eds), Bayesian Statistics 7, Oxford University
Press, Oxford.

Liu, C. (1996). Bayesian robust multivariate linear regression with incomplete data,
Journal of the American Statistical Association 91: 1219-1227.

Liu, J. S., Wong, W. H. and Kong, A. (1994). Covariance structure of the Gibbs
sampler with applications to comparisons of estimators and augmentation
schemes, Biometrika 81: 27-40.

Liu, J. S., Wong, W. H. and Kong, A. (1995). Covariance structure and convergence rate of the gibbs sampler with various scans, Journal of the Royal
Statistical Society, Series B, Methodological 57: 157-169.

Liu, J. S. and Wu, Y. N. (1999). Parameter expansion for data augmentation,
Journal of the American Statistical Association 94: 1264-1274.

Mandelbroit, B. (1963). The variation of certain speculative prices, Journal of
Business 36: 394-419.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis,
Academic Press, London.






74


Meng, X.-L. and van Dyk, D. A. (1999). Seeking efficient data augmentation
schemes via conditional and marginal augmentation, Biometrika 86: 301-320.

Meyn, S. P. and Tweedie, R. L. (1993). Markov Chains and Stochastic Stability,
Springer-Verlag, London.

Meyn, S. P. and Tweedie, R. L. (1994). Computable bounds for geometric
convergence rates of Markov chains, The Annals of Applied Probability 4: 9811011.

Mykland, P., Tierney, L. and Yu, B. (1995). Regeneration in Markov chain
samplers, Journal of the American Statistical Association 90: 233-241.

Nelson, C. R. and Plosser, C. I. (1982). Trends and random walks in macroeconomic time series: Some evidence and implications, Journal of Monetary
Economics 10: 139-162.

Nummelin, E. (1984). General Irreducible Markov Chains and Non-negative
Operators, Cambridge, London.

Olver, P. J. (1999). Classical Invariant Theory, Cambridge University Press,
Cambridge, United Kingdom.

Praetz, P. D. (1972). The distribution of share price changes, The Journal of
Business 45: 49-55.

Ripley, B. D. (1987). Stochastic Simulation, John Wiley and Sons, New York.

Roberts, G. and Tweedie, R. L. (1999). Bounds on regeneration times and convergence rates for Markov chains, Stochastic Processes and their Applications
80: 211-229. Corrigendum (2001) 91: 337-338.

Rosenthal, J. S. (1995). Minorization conditions and convergence rates for Markov
chain Monte Carlo, Journal of the American Statistical Association 90: 558566.

Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions
by data augmentation (with discussion), Journal of the American Statistical
Association 82: 528-550.

Tierney, L. (1994). Markov chains for exploring posterior distributions (with
discussion), The Annals of Statistics 22: 1701-1762.

van Dyk, D. A. and Meng, X.-L. (2001). The art of data augmentation (with
discussion), Journal of Computational and Graphical Statistics 10: 1-50.

West, M. (1984). Outlier models and prior distributions in Bayesian linear
regression, Journal of the Royal Statistical Society, Series B 46: 431-439.






75

Zellner, A. (1976). Bayesian and non-Bayesian analysis of the regression model
with multivariate student-t error terms, Journal of the American Statistical
Association 71: 400-405.














BIOGRAPHICAL SKETCH

I was born in Bulgaria, in 1974. My educational background consists of a high school diploma in Mathematics and a master's degree in Mathematics (with emphasis on Probability and Statistics) from Sofia University. Shortly after receiving my master's degree in 1997, I went to the Bulgarian army to fulfill a 9-month duty required by law. Immediately after that, in 1998, I was appointed as an Assistant Professor of Statistics in the Department of Economics and Business Administration at Sofia University. By that time I had already decided to pursue a Ph.D. degree in the USA. Luckily, I was accepted at the University of Florida's Department of Statistics, where I began studying in 1999. After graduating from the University of Florida, I will move to New York City to work as an Assistant Professor in the Department of Statistics in the Zicklin School of Business at Baruch College, City University of New York.


76








I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation an is f ily adequate, in scope and quality, as a dissertation for the degree of Docto f P losophy.



Jam s P. Ho t, Chair
Associate Professor of Statistics

I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philoso



Brett D. Presnell
Associate Professor of Statistics

I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.


Jam s G. Booth
Professor of Statistics

I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.

aw
Adao A. Trindade
Assistant Professor of Statistics

I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.


Murali Rao
Professor of Mathematics








This dissertation was submitted to the Graduate Faculty of the Department of Statistics in the College of Liberal Arts and Sciences and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy.

August 2004
Dean, Graduate School




Full Text

PAGE 1

MONTE CARLO METHODS FOR POSTERIOR DISTRIBUTIONS ASSOCIATED WITH MULTIVARIATE STUDENTÂ’S t DATA By DOBRIN MARCHEV A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004

PAGE 2

Copyright 2004 by Dobrin Marchev

PAGE 3

To the memory of my grandfather, Dimiter

PAGE 4

ACKNOWLEDGMENTS I would like to thank Jim Hobert for his support throughout the last 3 years. His guidance, endless questions and rigorous criticism brought the necessary quality to my work. Without his insights, this dissertation would have never been the same. I would also like to thank Brett Presnell, Jim Booth, Alex Trindade, Murali Rao and Mike Daniels for serving on my committee and spending their valuable time with my dissertation. To Brett Presnell, I am particularly thankful for stimulating discussions throughout my studies at the University of Florida. I learned a lot about probability, mathematics, and LaTex from him. Above all, I thank Pavlina for her love and for sharing with me all the moments of happiness and grief. Finally, I thank my parents, Angel and Liliana, who supported me and pointed me in the right direction for the first 25 years of my life. IV

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS iv LIST OF TABLES vii LIST OF FIGURES viii ABSTRACT ix CHAPTER 1 INTRODUCTION 1 1.1 Prom the Normal to the StudentÂ’s t Model 1 1.2 Outline 4 1.3 How to Sample from an Intractable Density 4 2 MONTE CARLO METHODS FOR THE STUDENTÂ’S t MODEL .... 7 2.1 The Model 7 2.1.1 Definition 7 2.1.2 Propriety of the Posterior 8 2.2 Making Exact Draws from the Target Posterior 13 2.3 Simulation Examples 24 3 DATAAUGMENTATION ALGORITHMS 26 3.1 Introduction 26 3.2 Standard Data Augmentation 27 3.3 Meng and van DykÂ’s Extensions 30 3.3.1 Conditional Augmentation 31 3.3.2 Marginal Augmentation 32 3.3.3 Marginal Augmentation with Improper Prior 34 3.4 Connection with Group Theory 39 3.5 Understanding the Artwork of van Dyk and Meng 47 3.6 Other Problems with van Dyk and MengÂ’s Lemma 1 51 4 GEOMETRIC ERGODICITY OF FOR THE MULTIVARIATE STUDENTÂ’S t PROBLEM 55 4.1 Introduction 55 4.2 Markov Chain Background 56 v

PAGE 6

4.3 Multivariate StudentÂ’s t Problem 59 4.4 Numerical Example 66 4.5 Concluding Remarks 70 REFERENCES 72 BIOGRAPHICAL SKETCH 76 vi

PAGE 7

LIST OF TABLES Table page 2-1 Acceptance rates of the rejection sampler when d — 2 25 2-2 Acceptance rates of the rejection sampler when d — 3 25 vii

PAGE 8

LIST OF FIGURES Figure page 4-1 Empirical comparison of Markov chains and the target 69 viii

PAGE 9

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy MONTE CARLO METHODS FOR POSTERIOR DISTRIBUTIONS ASSOCIATED WITH MULTIVARIATE STUDENTÂ’S t DATA By Dobrin Marchev August 2004 Chair: James P. Hobert Major Department: Statistics Let / denote the posterior density that results when a random sample of size n from a d-dimensional location-scale StudentÂ’s t distribution (with v degrees of freedom) is combined with the standard non-informative prior. We consider several Monte Carlo methods for sampling from the intractable density /, including rejection samplers and Gibbs samplers. Special attention is paid to the Markov chain Monte Carlo (MCMC) algorithm developed by van Dyk and Meng who provided considerable empirical evidence suggesting that their algorithm converges to stationarity much faster than the standard data-augmentation Gibbs sampler. In addition to its practical importance, this algorithm is interesting from a theoretical standpoint because it is based on a Markov chain that is not positive recurrent. We formally analyze the relevant marginal chain underlying van Dyk and MengÂ’s algorithm. In particular, we establish drift and minorization conditions showing that, for many (d, u, n ) triples, the marginal chain is geometrically ergodic. This is the first general, rigorous analysis of an MCMC algorithm based on a non-positive recurrent Markov chain. Moreover, our results are important from a practical IX

PAGE 10

standpoint since geometric ergodicity guarantees the existence of central limit theorems that can be used to calculate Monte Carlo standard errors; and because the drift and minorization conditions themselves allow for the calculation of exact upper bounds on the total variation distance to stationarity. Our results are illustrated using several numerical examples. x

PAGE 11

CHAPTER 1 INTRODUCTION 1.1 From the Normal to the Student’s t Model Scientists have built all kinds of models to explain the surrounding world. Statisticians, in particular, use stochastic models in an attempt to describe and summarize data; and to predict future observations. The most widely used statistical models are based on the assumption that the data are normally distributed. For example, a very simple model might assume that the sample data, yi, ,y n , are realizations of independent and identically distributed (iid) normal random variables. That is, !/i ~N(/i,a), i = l,...,n, (1.1) where // 6 R and a € R+ := (0, oo) are the unknown parameters denoting mean and variance, respectively. To accommodate more complex sampling designs, the model in Equation (1.1) can be generalized in many ways. For example, it can be extended to multivariate scenarios allowing to be vectors (of possibly different dimensions), or it can be assumed that the mean y is dependent on other variables (regression analysis). One reason these models have been studied so extensively is that they can be analyzed relatively easily. Unfortunately, there are situations where the normal distribution is inappropriate because of its light tails. A possible solution to this problem is to change the model in a different direction. Instead of the normal distribution, one can use a distribution with heavier tails. A natural candidate is the Student’s t distribution. 1

PAGE 12

2 Therefore, can replace the assumption in Equation (1.1) with yi~t(v,ii,a), i = l,... t n, (1.2) where t(v,y,a) is the location-scale StudentÂ’s t distribution (Section 2.1.1) and u is the degrees of freedom (assumed to be known). This idea is not new. Lange, Little and Taylor (1989) extensively listed articles and books using different t models, claiming that the first use dates back at least to Jeffreys (1939), who fit it to a series of astronomical data. Of course, the assumption in Equation (1.2) can also be generalized in many ways. Lange et al. (1989) described results from applying different generalizations of (1.2) to all sorts of real data sets, ranging from the stackloss data set of Brownlee (1965) to the traffic-accident data of Draper and Smith (1981, p. 191). Another modification of the model in Equation (1.2) is given by Zellner (1976) who considers y (jq, . . . ,y n ) T to be multivariate t distributed with identity covariance matrix, instead of assuming that the individual tiS are independently t distributed. Notice that, contrary to the normal case, these are not equivalent models, as it is suggested by Kelejian and Prucha (1985). Heavier-tailed distributions are especially needed when modeling the high volatility inherent in financial data sets (see; e.g., Mandelbroit, 1963). In particular, the market model is a multivariate linear regression extension of (1.2) that relates the returns on assets ( y ) to an overall market return (embedded in y). Substantial empirical evidence shows that the market model error distribution should be heavier-tailed than the multivariate normal (see; e.g., Affleck-Grave and McDonald, 1989). Praetz (1972) considers a similar model that describes the change in the log of several share-price index series with a t distribution, and compares the fit (by X 2 -test) to three other models, concluding that only the t model is acceptable, even at 0.01 level of significance. Many studies concerned with economical or financial applications, like Fernandez and Steel (2000), use Bayesian t models, changing the

PAGE 13

3 model of Equation (1.2) to Vi\^cr ~ t(u, //, a), i = 1,. . . ,ra; a ~ 7r(/i,
PAGE 14

4 of the posterior. Geweke (1993) was also among the first to suggest a Gibbs sampler to get draws from the posterior and to actually apply his results to a real data set involving U.S. macroeconomic time series, studied by Nelson and Plosser (1982). van Dyk and Meng (2001) and Meng and van Dyk (1999) show how to improve on the standard Gibbs method for sampling from the t posterior, but they do not deal with the problem of the propriety of the posterior. 1.2 Outline The purpose of this dissertation is to present and compare different Monte Carlo methods for sampling from the posterior resulting from Equation (1.4). The organization is as follows. In Chapter 2 we define the most general version of the StudentÂ’s t model that will be considered herein, and we prove a result that guarantees the propriety of the posterior. We then show how to use classical Monte Carlo methods in order to sample from the posterior, and illustrate with some examples. Chapter 3 deals with MCMC methods of getting draws from the target posterior associated with our model. We first introduce the standard data-augmentation algorithm, and then we attempt to explain the new MCMC methods developed by Meng and van Dyk (1999), van Dyk and Meng (2001), and Liu and Wu (1999). In our view, these papers did not provide a rigorous foundation for these new methods. In this chapter we attempt to lay such a foundation. Finally, in Chapter 4, we state and prove our results about the geometric ergodicity of van Dyk and MengÂ’s algorithm. Before moving to Chapter 2, we conclude this chapter with a brief description of various methods for obtaining draws from an intractable density. 1.3 How to Sample from an Intractable Density Suppose we are given a density, f(9), where 6 G 0. We may be interested in computing a particular numerical characteristic of /, like its mean or standard

PAGE 15

5 deviation; but sometimes, the complexity of the expressions may not allow us to do the computations directly. In such cases, there are several methods for estimating or approximating the properties of interest of /. These include asymptotic approximations, numerical integration, and sampling. Sampling methods (also called Monte Carlo methods) are ways of generating samples from a density that equals / or that approximates /. Using such samples, we can estimate the characteristics of the target density, /. For example, if we denote the sample by 9o, # 1 , • , then there are strong laws of large numbers that guarantee (under some regularity conditions) that f g(0)f(0)dd a.s., n *=o as n — > oo for any integrable function g : 0 — » M. There are many different sampling methods. In some cases, it is possible to sample directly from /, and thus obtain an independent and identically distributed (iid) sample. When direct sampling is infeasible (especially when we don’t know the normalizing constant of /), a possible way to get iid draws from the target density / is the rejection sampler. Here is a brief description of this method. Suppose we know how to sample from the density h{0), 6 £ 0 (often called a candidate) and we also know that j
PAGE 16

6 chain with stationary density /. The most commonly used MCMC algorithms are the Metropolis-Hastings algorithm and the Gibbs sampler, which itself can be viewed as a special case of the Metropolis algorithm. Tierney (1994) gives a good overview of MCMC methods for exploring intractable densities. In this dissertation we consider the special case when / is the posterior density that results when data from a multivariate location-scale StudentÂ’s t distribution are combined with the standard non-informative prior, van Dyk and Meng (2001) developed an efficient MCMC algorithm for sampling from /, and provided empirical evidence suggesting that their algorithm converges to stationarity much faster than the standard Gibbs sampler. We formally analyze the Markov chain underlying Meng and van DykÂ’s algorithm. Specifically, we derive drift and minorization conditions, which can be used to get rigorous bounds on the total variation distance to stationarity, and to do regenerative simulation.

PAGE 17

CHAPTER 2 MONTE CARLO METHODS FOR THE STUDENT’S t MODEL 2.1 The Model In many statistical models, the observations are assumed to be normally distributed. For example, inference procedures for the classical linear regression and ANOVA models are based on the normality assumption. In many applications, however, there is a need for heavier-tailed distributions. In particular, observations on daily returns of stocks (and financial data in general) prove to be incompatible with the assumption of normality. The Student’s t model is a relatively tractable heavy-tailed alternative. 2.1.1 Definition The model we consider is as follows. Let // and E denote a d x 1 mean vector and a d x d covariance matrix, respectively. Also, let u > 0 be fixed and known. Assume that, conditional on (fx, E), the d-dimensional random vectors y x , . . . , y n are iid td{y, /x, E); that is, y x has density ft d {y iPA.S) S |-i r (^) a*) t £ l (vi a *) ~ ' V Let f(y\fx, E) denote the joint density of y = (y v . . . , y n )\ that is, n f(y\n, £) = JJ f td (yi, v, n, S). 1=1 We take the prior on (/ x , E) to be the standard noninformat ive prior for a multivariate location-scale problem; that is, 7r(/x, E) oc lEj ^ (see, e.g., Section 8.2 of Box and Tiao, 1992). We show later in this section that (no matter what the value of v) the resulting posterior is proper if and only if n > d + 1. Assuming that 7

PAGE 18

8 n > d + 1, the posterior density is of course oc m(y) ^ + _ /X )T S -1^ _ ( 2 . 1 ) i= 1 where with m{y)= f [ f(y\p, S)tt(/u, JS J M d { rf(d-t-l) 'I z G R 2 : unvech(z) is p.d. > , where unvech(z) is the symmetric matrix corresponding to the elements of z and “p.d.” means positive definite. Our ultimate goal will be to draw samples from (2.1), but before developing methods for sampling we will establish the propriety of the posterior. 2.1.2 Propriety of the Posterior We will prove that f(pi, E|y) is proper if and only if n > d + 1. We start with the following lemma, which seems obvious but could not be found in the literature. Lemma 2.1. Suppose that O C R p and let {f(x\9) : 9 G 0} be a family of densities indexed by 9 such that the support, X = {x : f(x\9) > 0}, does not depend on 9. Let x \, . . . , x n be iid from f(-\9) and suppose that n(9) is an improper prior density yielding the posterior f{9\xi , . . . , x n ) oc il f(xi\e) i = 1 ir(«) . ( 2 . 2 ) If there exists an n! G N := {1,2,...} such that the posterior is proper; that is, such that c{xi , . . . , x n >) = / Je II /W0) _i=l n(9) d9 < oo for all (xi, . . . ,x n t) G X n> , then the posterior (2.2) is proper for all n > n 1 .

PAGE 19

9 Proof. If n > n' we have, H/(zi|0) , 2=1 7T (9)d0= [ Je 1 [ f{xi\0) \\f( x i\0) i=i 7 r(0) dO = c(xi, . . .,x n >) / Je n f( x i\ e ) i=n , -\1 nr =1 /N0M0) do . c{x i, . . . ,X„/) Now since the quantity in curly brackets is a proper (prior) density on 6, the integral is finite and the posterior is proper. Lemma 2.1 can be stated to hold for almost all x\,...,x n (not all x\, . . . , x n ), but we will omit such technicalities even though we use the lemma when it holds only outside a set of measure 0. Here is our main result about the propriety of the posterior. Theorem 2.1. Let A denote Lebesgue measure in R d ". The posterior, E| y), is proper for X-almost all y if and only if n > d + 1. Proof. The goal is to show that // Js JR d /(2/|^,S)7r(/x,S)d/xdE < oo (2.3) for A-almost all y , if and only if n > d + 1. Instead of analyzing this integral directly, we will analyze a related integral that is more tractable. This integral involves the augmented likelihood, which is based on the so called “missing data” q := (qi, . . . , q n ) T € Q, where Q is usually a subset of some Euclidean space (detailed in Chapter 3) Conditional on (/x, S), let (y { , qf), i = 1, . . . , n, be iid pairs satisfying the relations yMuh ~ N d ( /x, — Qi 9,1^, s ~ r ( 5 , 5 ) . (2.4)

PAGE 20

10 Then the implied joint density of y and q is f{y, S) i = 1 and it is easy to see that /(y,9|A*,S)dqr = /(y|/x,S), where /(j/|/z, £) is as in Equation (2.3). Therefore, f [ /(2/|/i,S)7r(/x,S)d/xdS = f [ [ f(y, q\y., £) tt(/u, S) dqdudZ Js JR d JS J& d J R" Of course, by Fubini’s Theorem, we are allowed to integrate in any order we like (notice that the integrand is nonnegative). First, f{y,Q _ (?) nu 2 u-\-d— 2 2 wf'© in 5 ' where q. := Y^j=iQj‘ But I S I ^ ® -m) t s1 (y, -m) j ^ Qj(Vj ~ y) = ^qjyjE 1 y j -2^q j yJS 1 n + q.tA T 'Z 1 [i j = i 3=1 j=i = (m A) T ( (m A) A T A + XJ S_ 1 A> > where A = 91 Zj=i QjVjThus > / /(y.gl/^EM/x, s)d/x dlR d _ (1)^ (n?.,®)^ («•)-*«-*• (2.5) n+d — SI 2 e

PAGE 21

Also notice that 11 Xfc3/J S l Vj A t (j) A = j = i tr 1 y j -q.fi T 'Z . 1= 1 = tr -w-AA t vl = l We now assume that n > d + 1 and will show that the posterior is proper for A-almost all y\ that is, we will show that (2.5) is integrable with respect to both £ and q. To find the integral with respect to S, we use the fact that the Inverse Wishart density f m<1 n 1, (X" =1 q^y^yj ~ 9-AA 4.3, integrates to 1. That is, we have T -r | , defined in Section .// JS JR d f(y, q\n, £) 7r(/x, S) d/2 d£ (ill ntuQtn-.)) w v+d .— 2 2 (
PAGE 22

12 Hence, if n = d + 1, we have E 3 = 1 IjVjVj 9-AA T <1 m' i,t [p.-p.pI] M<« g.‘‘|M< 1 >| 2 |P.-p.pJ| 9 j |m (1) | 2 |p,||/-p; 1 p,pJ| slMOip Ii-p1p. _1 p. q d \M^ fl p, 3 = 1 =,.->iM<')| 2 n®, j=i where the fourth equation follows from an elementary property of determinants that can be found in Fang and Zhang (1990, p. 3). Therefore, when n = d + 1 we have /(2/,g|/J,£)7r(/i,E)d/zdE i _n?=.r(i(i+i-i)) A (I ) 1 d(d+ 1 ) H r (f) i^_i qjv Qi e 2 ( 2 . 6 ) which is clearly integrable in q. Thus, assuming | )T)" =1 (IjUjU] ~ Q 7^ 0 (which is true for A-almost all y), it follows that the posterior is proper when n — d + 1, and because of Lemma 2.1 that the posterior is proper for n > d + 1. Now, to show that the posterior is improper when n < d, denote Q = Sj=i QjVjyJ ~ 9AA~ and notice that by the transformation Z = £ _1 with Jacobian J = \Z\~^ d+l ^ (see; e.g., Fang and Zhang, 1990), we have L 2 d£ i^QZ) izi^dz. Since Z is a p.d. matrix, by the Cholesky decomposition, it can be uniquely represented as Z = XX T , where X is a lower triangular matrix (l.t.m.)

PAGE 23

13 with positive diagonal elements, and the Jacobian of this transformation is J = 2 d nti Therefore [ e~^ QZ ^\Z\^dZ = 2 d [ e-^ QXxT )\X\ n d ~ 2 f[xlr i+1 dX Js Jr = 2 d [ e -^ l{QXxT) T\xlr i+l dX = 2 d [ e 4£?=i R i= 1 i= 1 xl~ w dX, where d(d+l) R = {(ni,r2i,r22,r 3 i, . . . ,r dd ) e R 2 : m > 0 , V i = 1 , . . . , d}, and Xi for i — 1 , ,d are the columns of X. Recall that n < d implying that Q is non-negative definite (x T Qx = 0 for at least one non-zero x £ M' z ), which means that Q = T t T, where T is a l.t.m. with non-negative diagonal elements, at least one of which is zero. Let tkk — 0 and observe that the integral f e~5 ESU *7 TT xl~ i+l dX = [ e"* Eti(^) T (T*,) ^n-i+i d x i = i ^ R i=i diverges because the coefficient in front of Xkk in the exponential is 0; that is, d f e -t£?=i( Tx<)T ( Tx<) 17 xl~ i+l dX = c [ e^ tkkXkk x n k ^ k+1 dx kk = [ Jr i=\ 'A* ; J m x n— fc-t -1 kk OO, where the first equality is justified by Fubini’s theorem even if c = oc. 2.2 Making Exact Draws from the Target Posterior As we now explain, a byproduct of the proof of Theorem 2.1 is a method for making exact draws from the target posterior, f(/x 1 E|y), in the special case where n is exactly d + 1. First notice that /(/*>E|y) = [ f(q,v,'£\y)dq, Jr*

PAGE 24

14 where /(«>/*, £|w) f(y \q, AT S)/(q |/x, E)tt(/x, E) Jq Is /r" /( 2 /I 9 . AT S)/(q [|/x, S)tt(ax, E)d/xdEdq' Thus, our problem would be solved if we could make draws from /(q, /x, E|y) since we could then just ignore the q component. Now, since f(q, y, E| y) = f(q\y)f(£,\q, y)/(/x |E, q, y ), we can make draws from /( d + 1, the qi s are no longer conditionally independent, and thus drawing from f(q\y) is no longer straightforward. However, as we will show, it is possible to construct rejection samplers for f(q\y) that are reasonably efficient when d is and and exact methods for sampling from these distributions are well known. Fur2. Draw E ~ TW d n 1, (E"=i QjiVj ~ A )(Vj ~ A) T ) 3. Draw /x ~ j). small.

PAGE 25

15 We now consider the problem of sampling from v+d—2 2 71 (n«) 1 ( q.y 1 c-*«Ylvyjy] -QAA T 1=1 n— 1 2 Notice that Equation (2.7) can also be written as --i 2 1 f(q\y) °c INm e 2^?= 1 9i. , 2—1 d q.2 ELi ^ 2 / j 2/7 ~ 4-AA T But ^2 qjVjV] 9-AA T = 9~( y j ~ A)(l/i A) T =: q-C(q, y ), 1=1 1=1 <7 ' and therefore Equation (2.8) becomes f{q\y) « ( II ^ ) e 2^r= ,z=l 1 ft OE.i 9 .)’ n «< v*=i e 2 oc d 9-2 \q-C{q,y) | V . d. (r GU*) 1 \C(q,y)\^ (nr.i *)“ (2.7) ( 2 . 8 ) (2.9) Since the first part is just a product of iid T(|, |) densities, we might consider a rejection sampler in which the candidate is exactly this product of Gamma densities. To do this, we need a bound for d (nr=. k)‘ \C{g,y)\ 71—1 * (2,10) The following lemmas are needed to establish our bound. Lemma 2.2. Suppose that Y is a discrete random vector putting mass pi at y if i = 1, 2, . . . , n. Then cov{y) = 52 PiPj(yi-yj)( y iy j) r l
PAGE 26

Proof. Let X be an independent copy of Y . Then notice that \e(X Y)(X Y) T = ^E(X M + /, Y)(X M + M r) T = ^E(X M )(X v ) 1 + |E(M ~ E)(M Y) T + iE(X f*)( M Y) t + |E( M Y)(X ^) T = -cov(X) H — cov{y) = cov(V), 2 2 where /lx = E (Y) = E(X). Now apply Equation (2.2) to get cov(y) = ±E(X v)(x r) T = W i=l j=l = PiPitVi Vj)(Vi ~yj) T i
PAGE 27

17 Let’s now go back to the question of how to bound the expression in Equation ( 2 . 10 ). Theorem 2.2. For d£N,n>d+l,y£ R dn and q £ R" , d (IE* “)‘ < 1 \C(q,y)\ n -i f-y-'diy)’ where d(y) = C yi-y2 ) 2 ifn = 2 Ci(y) = n TgCJ1 + 1 —J c\ ( y ) d ( n— 2 ) cn(y)*n=V 2d\(n — d— 1)! \ — d(n— 2 ) 1 ' ifn > 2, M M r(T) (n— 2)! and is an {n — 1) x d matrix, given by ( M (i) = T \ (Vi-y i) (l/i (s/i yi+ i) T (2.11) \ ( yi-y n ) T for i = 1, . . . ,n. Proof. First notice that when n = 2, d must be 1, and in this case both sides of (2.11) are equal to ( yi l y2 y For the nontrivial case when n > 2, denote Pi = ^,i = 1, . . . , n and notice that because of Lemma 2.2 d (nr=i ”)‘ SU P ru-T = sup qe Ri \C(q,y)\ n pev (FELi Pi) d T,i
PAGE 28

18 What follows is true for any Pk,k = 1, . . . , n; but for ease of exposition, weÂ’ll show the details only for p\. First, notice that in the sense of matrix ordering 1 , Y PiPjiVi ~ Vj)(Vi VjV > Y PlPi ( y i ~ Vi)(Vi y ^> iT )J( F * M(I> ) Tec" Since P*M (1) = ^ P2(yi 2 / 2 ) t ^ ^ PniVl VnV j r(T) (2.13) for any T G we can write M (i) r(T) Thus, and consequently from Equations (2.12), (2.13) and the fact that if A > B then | A | > |P|, it follows that m (1)T p*m (1) E (nn /Vfb) 1V1 r(T) TeC ,n-l Vier / Y PiP ^ y i ~ y ^ y i y i) ipi E M (1) r(T) Tec"1 VieT / (2.14) 1 For nonnegative definite A and B , A > P iff A -P is nonnegative definite

PAGE 29

19 Now by the inequality between the arithmetic and geometric means, we obtain e (riK+i) Xecj -1 / M (i) r(T) > n — 1 n (it« TeC7 -1 \®er , M (i) CO ( V )' 1 (2.15) Notice that each index i G {1, . . . , n} is contained in exactly Q_j) elements of the set (7? 1 Therefore, Tecj -1 *ex implying that n m ( !L r(T) 2 -Tecj -1 VieT / err 1 ovr 1 (2.16) = (P2 -Pn)"" 1 n Im. ( 1 ) r(T) ^TeC" Prom Equations (2.14), (2.15) and (2.16) we obtain that Y^PiPitVi ~ Vj)(Vi ~ yj) T i / -i \ n— 1 ^ -L \ d(n— 1) d Pl (P2 ---Pn) Ci(y), (nr=i Pi) d < i 1 Zi^PiPAvi Vj)(Vi ~ Vj) T n-l — (nl\ n -! d(n2) / \ ( d ) Pl C l(P) By symmetry, for any k G N, C UhPiY < £i
PAGE 30

20 Thus, gnu po EkjW(i/i Vj)(yi yjV and therefore sup (n« f. 1 ,7»; |C( 9 ,»)|-> ' ~ 2) c t (v) Finally, we will show that 1 1 Let sup mm —j . — = ~r. r . P eV 1 < i < n pS n a(y ) d(y) bi = Ci(y)~ d ^ n ~ 2) L = ± b , i = 1 * bi P ^b. Then sup min , _ 0 , P epl<» pG'P l 0 V i G N„, it is clear that if p — p* , then —d(n— 2 ) Pi max — l p*.

PAGE 31

21 In the univariate case (d = 1), the constants Ci(y) can be simplified considerably. We state this simplification of Theorem 2.2 as a corollary. Corollary 2.1. For n > 2, y E R n and q E R” , [(n-ir-'d^y)]" 1 , where 1 " (9.3/) = ~ A) d\(y) = e i (y) i i T 1 f e n (y) n ~ -(n— 2) and e*(y) = Ifa “ %) 2 ’ * = !> • • > n Proof. First notice that C” -1 :={!,. ..,n — 1} and therefore %) 2 The rest of the proof is just substitution. The bound proved in Theorem 2.2 is not sharp, which reduces the acceptance rate of the resulting rejection sampler. When d = 1, we were able to find the exact supremum when n E {2,3} and this result is stated in Proposition 2.1. Unfortunately, we were unable to get a better bound in the general case, but we can state a conjecture about the exact supremum of (2.10) when d = 1, which is presented in Conjecture 2.1 Proposition 2.1. For n E {2, 3}, y E R n and q E R” , sup rt, (?) qeR” [v(q, y)} 71—1 [(« i)" 'c(y)] 1 (2.18)

PAGE 32

22 where c (y) = min %) 2 . (2.19) l 2( Vl y 2 )(y 3 yi ). (2.21) V P3 V P2 V Pi But (2/2 P3) 2 = (2/1 2/2 ) 2 + (2/3 yi ) 2 + 2(2/3 2/1) (2/1 2/2), and because of the assumed ordering, (2.21) is equivalent to 1 2 /i 2/2 / IP1P2 , / P2P3 ^ , 1 2/3 — 2 /i ^ / P1P3 , / P2P3 ^ , IP2P3 ^ 1 2 y 3 -yi VV w V pi J bti-aW P2 V Pi / V Pi “ Now an application of the inequality between the arithmetic and geometric mean gives us 1 22i 2/2 / /P 1 P 2 /P 2 PA 1 2/3~ 2/i / /P 1 P 3 /P 2 P:i \ 2 2 / 3 2/1 \V Ps V Pi / 22/1 — 2/2 \V P2 V Pi /

PAGE 33

23 Consequently, (2.20) will be established, if we can show that Pi 1 + mL + jm> lt pi which is obviously true. Notice that the last inequality becomes equality when pi — > 1 and p 2,3 0 with the rest of the inequalities suggesting that ^ • It is easy to verify that in terms of qiS this is satisfied when, for example, we take qi = t, q 2 = and q 3 = {yi V2) 2 t(y 3 yiY and let t — > oo. Then qi q92 <73 9i9t + I 1 I (yi-3/2) 2 * ' t(y3-yi) 2 fai-i/2) 2 t 2 ( 2 / 3 2 / 1 ) 2 i,l, fai-y2) 2 ^ t t(y3-yi) 2 0 as f > 00 , implying that hm (J™i + . /^) + 1^1 (.1^ + .lm) + I™. = 1 . t-oo2y 3 -t/i VV P3 \ Pi J 2y l -y 2 \\ P 2 \ Pi J V Pi Conjecture 2.1. For n>2,j/el n and q 6 nr., (“) sup qeK" [u(q, y)]’ rr = [(nl) n 1 c(s/)] ' ( 2 . 22 ) where c ( y ) = mm Y[(Vi ~ Pj ) 2 (2-23) Remark 2.1. Notice that the result of Theorem 2.2 provides an alternative proof of the fact that the posterior is proper for n > d + 1, because it shows that (2.6) is bounded from above by a quantity integrable in q.

PAGE 34

24 2.3 Simulation Examples The purpose of this Section is to study the acceptance rate of the rejection sampler proposed in Section 2.2 as a method for obtaining draws from our target density. Clearly, the bounds established in Lemma 2.2 and the subsequent corollary and conjecture depend on y, n and d. Furthermore, the candidate density depends on v. It is, therefore, interesting to investigate the behavior of the algorithm when changing the values of these parameters. Generally speaking, the results of our computer simulations show that the acceptance rate of the algorithm varies greatly depending on the specific data set, y, and suggest that it works better if either n is small and/or p is big. Also, the acceptance rate tends to be higher when the are equally spaced. Here are some specific examples. Example 2.1: Suppose a particular set of yi s is given by 2 y = (-1.449605, -0.996631, 0.228872, 0.068414, -0.126978, -0.563358, 0.766889) T and is assumed to have t\ distribution with p = 5. A computer simulation using the bound in the conjecture resulted in an Accept Reject algorithm with acceptance rate of 0.0384 estimated with a standard error of 0.00061 (using the well-known formula for the standard deviation of one observation from the Binomial distribution). Using the bound from Lemma 2.2 on the same data, results in acceptance rate of 0.00131 estimated with a standard error of 0.00011. Assuming that p = 10, changes the acceptance rates for the unproved and proved bound to 0.04141 (0.00063) and 0.00139 (0.00012), respectively (the numbers in parenthesis are the standard errors). Example 2.2: For this example we simulated Student’s t data with parameters p = 3, d = 2, and n — 7: 2 The actual values were obtained by simulation

PAGE 35

25 y l = (-1.0082, 1.5155) t , y 2 = (-0.39561, 0.066761) T y 3 = (1.4634, 2.5265) t , y 4 = (1.5163, 1.5953) T l/ 5 = (-1.0334, 3.0844) t , y 6 = (-1.5207, 0.21293) T 2/ 7 = (6.0723, — 0.069308) t , which produced a bound of 0.000668735. Table 2.3 gives the acceptance rates of the rejection sampler for different values of u along with their estimated standard errors and is based on 1,000,000 iterations. Table 2-1: Acceptance rates of the rejection sampler when d = 2 V Acceptance rate Standard Error 1 1.1 xl0~ e 3.31662 x 10~ 7 5 4.1 xlO -6 6.40311 x 107 10 4.1 xlO -6 6.40311 x 107 15 5.1 xlO" 6 7.14141 x 107 Example 2.3: In our final example, we first obtained another simulated sample from the multivariate Student’s t distribution, but this time with v — 5, d 3, and n = 6: 2 h = (-0.92497, -0.63609, -0.058751) t , y 2 = (0.32547, 0.0031071, 0.98505) T 2/3 = (0.77057, 1.8546, -0.086936) T , y 4 = (1.5351, -1.1889, -1.6686) T 2/ 5 = (0.037699, -1.1527, 0.29751) t , y 6 = (0.14760, 0.51222, -1.9081) 7 , which produced a bound of 0.0465003. Here are the acceptance rates for different values of u, based on 10,000,000 iterations along with their standard errors: Table 2-2: Acceptance rates of the rejection sampler when d = 3 V Acceptance rate Standard Error 1 2.3 xlO -6 4.79583 x 107 5 6.8 xl0“ 6 8.24618 x 107 10 6.7 xlO6 8.18533 x lO" 7 15 8.3 xl0“ 6 9.1104 x 107

PAGE 36

CHAPTER 3 DATAAUGMENTATION ALGORITHMS 3.1 Introduction In two related papers (Meng and van Dyk, 1999; van Dyk and Meng, 2001), X.-L. Meng and D.A. van Dyk developed two generalizations of the standard data augmentation (DA) algorithm. The basic idea underlying these generalizations is to construct a large class of candidate augmentation schemes by introducing a working parameter , a , into the standard augmentation scheme. In conditional augmentation , a particular value of a is chosen, while in marginal augmentation , a is actually incorporated into the augmented model by way of a working prior , u(a). In all of the examples considered by Meng and van Dyk, the marginal augmentation (MA) algorithm becomes more efficient as the working prior approaches a special improper form. The basic MCMC theory breaks down, however, when the working prior is improper. Indeed, when cu(a) is improper, the Markov chain driving the MA algorithm is not positive recurrent. On the other hand, this nonpositive recurrent chain is on an augmented state space that is larger than the parameter space. It turns out that, under regularity conditions, the non-positive recurrent chain possesses a well behaved marginal chain that has the correct (proper) invariant distribution. In this chapter, we consider in detail the general MA algorithm and show how it can be applied to the StudentÂ’s t problem. The chapter is organized as follows. In the next section we describe the standard data augmentation algorithm for dealing with intractable posterior densities like (2.1). Then in Section 3.3 we describe the generalizations of Meng and van Dyk. One drawback of their methods is that it is very hard to verify the conditions that guarantee convergence of the 26

PAGE 37

27 algorithm to the target density. In Section 3.4 we provide an alternative validation of the MA method that is inspired by the work of Liu and Wu (1999). Finally, in Sections 3.5 and 3.6 we highlight some problems with the proofs in van Dyk and Meng (2001). Throughout the chapter we present all the algorithms in their full generality and we use the StudentÂ’s t model as an example. 3.2 Standard Data Augmentation Suppose that f(y\0) is a likelihood and it(0) is a prior, with supports y and 0, respectively, yielding the proper posterior density /(%) f(y\e)n(G) m(y) (3.1) where m(y) := J e f(y\0)n(0)d0 < oo. Unless 7r(-) is a conjugate prior, the denominator, m(y), is typically an intractable integral and hence expectations with respect to this posterior (required for Bayesian inference) are not available in closed form. The posterior (2.1) is an example. In such cases, one way to analyze the posterior distribution is through Monte Carlo methods. Assuming that direct (iid) sampling is not an option, one might resort to an MCMC method where dependent samples are generated using a Markov chain with stationary density (3.1), and these samples are used to approximate expectations with respect to f(0\y). A good overview of MCMC methods can be found in Tierney (1994). The MCMC algorithms developed by Meng and van Dyk are extensions of the standard DA algorithm so this is where we begin. The basic idea of DA is to construct unobserved data , q 6 Q, that serve as a computational tool. To be specific, we introduce a joint density f(y,q \0), with support T x Q, such that (i) the marginal density of y implied by f(y, q\0) coincides with the original likelihood f(y\0)\ that is, [ f(y,q\0)dq = f(y\0) ,V6> e 0,Vy e y, Jq (3.2)

PAGE 38

28 and (ii) sampling from both conditional densities f(y,q\0)n(0) f(q\ 0 ,y) = and f{9\q,y)\= f(y,q\9)n(G) (3-3) f Q f(y,q\o)n( 0 ) d q f e f(y,q \o)v(0)d6 is feasible, if not convenient. Since f(y , q\0) is introduced purely for computational purposes it should not alter the posterior model (3.1). To see that condition (3.2) is sufficient to guarantee this, notice that the “augmented posterior” satisfies f(y,q\0)n{9) f(y,q\G)n(G) f(y, q\G)ir(G) f(0,q\y) := and therefore J© Iq /(i/» q\ 0 ) A 0 ) d q de f 0 f(y\°) n ( 0 ) dQ m(y) ft q i w I Q f(y^q\ 0 ) d q< 0 ) f(0,q\y) dq= = /(%) , (3.4) Jq m(y) our target density. The DA algorithm is a method for obtaining a sample {Gi , . . . , G n ] with the property that the sampling distribution of G n converges to (3.1), in total variation distance, as n — » oo. It is based on a discrete time Markov chain, ®da = {0,}£o, with Markov transition density Kba(G\G')= [ f{G\q,y)f{q\G\y)dq(3.5) J Q that is, given that the current state of the chain is G', the conditional density of the next state, G, is (3.5). Since [ K VA (G\G')f(G'\y)dG'= [ f(0\q t y)\ f f(q\G',y)f(G'\y)dG , ]dq = Je Jq Ue [ f( 0 \q,y)f{q\y)dq = [ f( 0 ,q\y)dq = f{0\y), Jq Jq

PAGE 39

29 our target density, f(6\y), is an invariant density for $daTherefore 1 , we can use $da to get approximate draws from f(0\y). More details about data augmentation can be found in Tanner and Wong (1987), Liu, Wong and Kong (1995) and Liu, Wong and Kong (1994). There are many ways to construct a joint density that satisfies (3.2), but here we will specifically consider the following hierarchical description: y\q,o ~ f{v\q,o) q\e ~ f{q\0). (3.6) Typically (3.6) represents some well known decomposition as is demonstrated in the next example. Example 3.1: Consider the univariate version of the Student’s t problem, described at the beginning of Section 2.1, with 9 = (//, cr); that is, assume yi,..-,y n are iid t\{y,n,a) with joint density f(y\y,a). The prior is ir(y,a) oc 1/a and the goal is to sample from the posterior f(y,a\y). There is a simple data augmentation algorithm for this problem that is based on the well known decomposition (that we express loosely as) y = y + y/az/^/q, where z ~ N(0, 1) and q ~ )(il v with z and Q independent. To be specific, “missing data” q = (qi , . . . , q n ) T are introduced such that, conditional on (y, a), ( yj , qj) are iid pairs satisfying Vj\qj,y,a ~ N Qj\y,o~ r (3.7) 1 Throughout this chapter, we assume (unless otherwise stated) that all Markov chains satisfy the usual regularity conditions described in Section 4.2

PAGE 40

30 where by X ~ T(a, b ) we mean that the density of the random variable X is proportional to x a ~ 1 e~ bx I(o t00 )(x) and xl — T (§, |). The reader is invited to verify that the marginal density of yj given (/r, a) under (3.7) is f tl (yj ; u, y, a) and hence, because of the independence assumption, (3.2) is satisfied. Thus, we are now in a position to apply the DA algorithm; i.e., we consider simulation of the Markov chain defined by (3.5). It is trivial to verify that, conditional on (y, cr, y), the random variables qi,...,q n are independent with , n ,T + 1 1 f{Vjy ) 2 , ,, qj\y,a,y ~ h ( — x— , I V v 2 ’ 2 a for j G N n . Furthermore, making draws from f(a, y\q, y) can be done in two easy steps by noticing that a\q,y ~ IG fn + 1 and vW,q,y ~ N where n i n 1 71 q= q j > £(«) = X] > *(«) = q i [ % A(g)] 2 , i=i i=i i=i and by AT ~ IG(a, b ) we mean that the density of A1 is /ig(s;q,&) = r^a+i Ao.oo)^); (3-8) i.e., X has the distribution of 1/y when Y ~ T(a, b). Thus, it is straightforward to simulate the chain defined by (3.5) in this case. 3.3 Meng and van Dyk’s Extensions The key idea underlying the DA algorithm is that it is easier to simulate the Markov chain $da than to sample directly from the target posterior. If, in addition, <3 >da converges geometrically, then the algorithm is even more useful because of supplementary results, like central limit theorems and exact bounds for the length of the burn-in period. Unfortunately, there is a general conflict between

PAGE 41

31 simplicity and speed. For example, the speed of the EM algorithm, which can be viewed as the deterministic analog of the DA algorithm, is inversely proportional to the amount of missing information. This suggests that a more clever augmentation may lead to a better DA algorithm. Indeed, this is the idea behind conditional and marginal augmentation, which are now described. 3.3.1 Conditional Augmentation Suppose we are able to introduce a working parameter, say, a G M+, into the standard data augmentation density f(y, q\0 ) yielding a family of densities, {g(y, r|0, a:)} Qe R + , each of which is a candidate for use in the DA algorithm. 2 One way to do this is via a family of one-to-one transformations D a : Q — > Q, indexed by the working parameter a. That is, for each a G R+, there is an analog to the hierarchical model (3.6): where y\r,0,a ~ g(y\r,0,a)i r\0 , a ~ g(r\6, a), g(y\r, 0 , a) = f(y\ D;‘(r), 0), (3.9) (3.10) and g(r\6,a ) is obtained from f(q\0) by change of variable r = D a (q)\ i.e. , dD~ l (r) g(r\9,a) = f{D a (r)|0) dr (3.11) Usually, the original augmentation scheme, f(y, q\0), is a member of the family with, say, a = 1. 2 To clarify the notation, we use / to denote densities associated with the standard DA algorithm and g for densities involved in van Dyk and Meng’s extensions. Similarly, q refers to missing data in the standard DA and r denotes missing data in CA and MA.

PAGE 42

32 Generally speaking, in order for each member of such a family to be a viable augmentation scheme, it must be the case that a is “invisible” in the marginal distribution of the observed data; i.e., we must have L g(y, r\0, a) dr = f(y\0) V a G R + , V y V0G0. (3.12) Notice that under the hierarchy (3.9), the above condition is automatically satisfied, since for each a € R+ / g{y,r\6,a)dr = / g(y\r, 0, a)g(r\0, a)dr = Jq Jq [ m DZ\r),e)S(D~\r)\9) dD °' (r) Jq dr = (3.13) f{vW^)f{q\0)dq = [ f(y,q\0)dq = f(y\0). Jq Once the validity of the CA is verified through equation (3.12), we can fix a. at any value that leads to relatively quickly converging Markov chain. For a specific value of a, the transition kernel of the corresponding Markov chain is K a (0\0')= [ g(d\r,y,a)g(r\G' 1 y,a)dr Jq and it is easy to show that f(0\y) remains its invariant density (since g(y, r\6, a) is a legitimate data augmentation scheme), van Dyk and Meng (2001) provide guidance as to choosing the value of a that corresponds to the fastest converging Markov chain by establishing connections with the corresponding EM algorithm. We do not pursue these ideas any further since we are mainly interested in MA, which is presented next. 3.3.2 Marginal Augmentation In MA, the working parameter a is incorporated into the augmented model through a proper working prior u(a). There are several different ways to use g{y, r\0, a), 7r(0) and u(a) to construct an MCMC algorithm with invariant

PAGE 43

33 density f(0\y) (Meng and van Dyk, 1999). Here we describe the two methods that Meng and van Dyk (1999) call Scheme 1 and Scheme 2 (see also Liu et ah, 1994). Assume that a and 6 are a priori independent; i.e. , u(a\8) = u(a) V 9 £ 0. Then 9u>(y, r \9) := / g(y,r\9,a)u(a)da J R+ is a viable augmentation scheme. 3 To see this notice that (3.12) and Fubini’s theorem yield / 9 u(y,r\G)dr = / / g(y, r\0, a)u(a)dadr Jq JqJ e + = f u(a) [ g(y,r\0,a)drda = f(y\0) [ u(at)da = f(y\9). J R + JQ J R + (3.14) Therefore, for each fixed proper prior, cu(-), we can make approximate draws from the target posterior by simulating the Markov chain, $ma = with Markov transition density K u (9\0')= [ g u (9\r,y)g u (r\9',y)dr , Jq where and 9w{0\r,y) := 9u(r\9,y) := f R+ g(y, r\9 , a)n(9)uj(a)da JeJZ g(y, r\9, a)'n(9)uj(a)doid9 f R+ 9{y , r \0, a)n(9)u(a)da Iq Jr+ 9(y, r\8, a)ir(9)u(a)dadr ' This is our version of Scheme 1. In Section 3.5 we show how it is related to Meng and van Dyk’s work and we give more details on how the chain is actually run. As expressed in the notation, the Markov transition density K u clearly depends on the choice of working prior co(a). Each different proper prior on a will result in a different version of $ma with its own convergence properties, van Dyk 3 It is important to recognize that g u (y, r\9) and f(y, q\9) are typically not the same.

PAGE 44

34 and Meng (2001, Section 4) give a heuristic argument suggesting that 4 >ma should converge faster as the working prior, uj(a), becomes more “diffuse.” For example, consider the following parametric family of IG priors for a where 8 > 0. The “diffuseness” of this density increases as 5 decreases 4 . Thus, it is tempting to consider what would happen to the convergence rate of 4 >ma under the improper prior u(a) = a~ 1 I( 0tOO )(a). However, it is clear from Equation (3.14) that gu>(r\0', y ) will be improper when u(a) is improper and hence K u is not defined in that case. Therefore, we need a new sampling scheme when u(a) is improper. It is called Scheme 2 and is described in the next subsection. 3.3.3 Marginal Augmentation with Improper Prior We now consider a different Markov chain in which a is actually one of the variables. Suppose that u(a) is proper and consider the joint density defined by The main idea of Scheme 2 is to use a two variable Gibbs sampler based on the above density with components (0, a) and r. To see that it produces an approximate draw from f(9\y) notice that equation (3.12) shows that integrating r out of (3.15) yields The reason for introducing this chain is not as an alternative to mai it may still be well-defined when the working prior is improper. When u ( a ’$) Qji+i ho,oo){oi) , 8 s e~ s/a g{9, r, a\y) oc g(y, r\9, a) u ;(<*) 7 r(9) . (3.15) 4 Notice that E a 1 < oo iff 8 > 2

PAGE 45

35 uj(a) is improper we have / / g(0,r,ot\y)drda oc f(y\9)n(0) / u(a) da = oo , -/k+ Jq 7r + so the density in (3.15) is improper. Consequently, under an improper prior on a, the Gibbs sampler based on (3.15) with components ( 0, a) and r will not be positive recurrent (Hobert, 2001b). However, in certain situations, it is still possible to construct a useful MCMC algorithm using (3.15) when u(a) is improper. Indeed, we have already seen that for each fixed (0, a) G 0 x M + , L g(6, r , a\y) dr = f(0\y)m(y)u(a) < oo . i Suppose further that for each fixed r £ Q, / g(0, r, a\y ) dO da < oo . Je (3.16) / (3.17) Then we can run a twovariable Gibbs sampler by iteratively sampling from the densities 9 -(r\e,c,y)= f r ’ a| . 3,) M and J Q g{0,r,a\y)dr g(0,r,a\y) fu+ fe 9(0, r > a\y)d0da g*(0,a\r,y) = (3.18) The “*”s are used to remind the reader that the “joint density” to which these conditionals correspond is improper. Following the notation of Meng and van Dyk (1999), we will call this method Scheme 2. Write the resulting Markov chain as $ — {( r i> (&i, a i)) } i=0 ft eas Y t° show that (3.15) is an (improper) invariant density for this chain, and hence (as mentioned above) is not positive recurrent. Furthermore, while = {rj}?8 0 and <3> 2 = { (^i, ctj) }^= 0 are themselves (nonpositive recurrent) Markov chains (see; e.g. Liu et ah, 1994), the marginal sequence *^ma ~ {0i}i^o induced by this algorithm is not necessarily a Markov chain nor

PAGE 46

36 is this sequence necessarily convergent. Before discussing in more detail we return again to Example 3.1. Example 3.1 continued: One common transformation is r = D a (q) = aq , which corresponds to the alternative representation ti(u, y, a) = y + y/a^z/^/r, where z ~ N(0, 1) and r ~ oiy^/v with z and r independent. The corresponding MA algorithm is based on the following hierarchal model. Conditional on (y, cr, a), let (yj, Tj),j = 1, . . . , n be iid pairs satisfying VjVji y,cr,a ~ N rj \y, a, a ~ T (3.19) with (y, a, a) ~ 7r(/U,
PAGE 47

37 easy to verify that g*(6, a\ r, y ) = g(y, a, a\r, y ) = g(a\r, y)g{a\a, r, y)g(y\a, a, r, y) n — 1 = {(rj, (/q, <7,, cq))}?! 0 , where r l — (ri 1 , . . . ,r in ). Also define $1 = M£o and $2 = {(Ait, Oi, ai)}So > which are themselves Markov chains. The improper density (3.20) is invariant for $ and consequently , $ 1 , and $2 are all non-positive recurrent with f(/j,,a\y)u>(a) being an invariant density for $ 2 . The fact that, under our choices for 7r(/q a) and u(a), the sequence $ma = 0 is a positive recurrent Markov chain with invariant density f(y,a\y) should follow by Lemma 1 from van Dyk and Meng (2001) or Theorem 4 from Liu and Wu (1999). That is, one can simulate the unstable Markov chain, $, and use the (//, a) component to explore the proper posterior density /(/q a\y). However, we have serious doubts about the validity of the results of van Dyk and Meng, and Liu and Wu. Consequently, we have developed our own proofs (see Sections 3.4 3.6). We now discuss the conditions under which the sequence is a positive recurrent Markov chain. Recall that $2 = {(#,, cq)}°^ 0 is a Markov chain with

PAGE 48

38 Markov transition density given by 1 , g*(0, a\r , y)g*(r\0', a', y)dr. A sufficient condition for = {0,}^ o to be a Markov chain is that the integral [ [ g*(0,a\r,y)g*(r\G\a',y)dadr. (3.21) Jq J r+ does not depend on a! . Another sufficient condition under which the sequence is a Markov chain is given in Meng and van Dyk (1999). Here we establish a connection between their condition and the integral (3.21). Lemma 3.1. Assume the hierarchical structure of Equations (3.9) (3.11). If fu+ g*(@’ a \D a '(q), y)da does not depend on a' , then <&* MA is a Markov chain with Markov transition density K M a{0\0')= f [ g*(0,a\r,y)g*(r\0',a',y)dadr. Jq Jm.+ Proof. Notice that by Equations (3.18), (3.15), (3.10), and (3.11) g*(r\0',a',y) g(0',r,a'\y) Jq 9{&'i r > Oi'\y)dr g{y\r , 0', a')g(r\0 a')i t(0')uj(ol') Jq 9(y\0', r, a')g(r\a', 0')n(0')uj(a’)dr f(y\D-f(r),0')f(D-}(v)\0') f{y\ e ') 9D ~Hr) dr Now, after change of variable q — D a f(r), we see that (3.21) can be written as f f * lo m f \ w /(y|«> j JJm y )da m#) dq = [ [ 9*(0,a\D a ,(q),y)da'f(q\0',y)dq, Jq J iu which combined with the assumption completes the proof. We were able to show that the condition of Lemma 3.1 is satisfied in a very general set-up (see Theorem 3.1). Here we conclude the section with an example

PAGE 49

39 that demonstrates that the condition of Lemma 3.1 is satisfied at least for the univariate t model. Example 3.1 continued: We now show that the condition of Lemma 3.1 is satisfied for the univariate t model. Using the definition of the IG density, we obtain that g*(n,a\r,y) oc r. L . [ °(r) + In ~ A(r)] 2 _ n( v+\) 2 r. 2 [cr(r)J 2 a 2 = i v | | ^ ~ AH 2 n(i/+l) r a / \ T li L — n ~t " 2 [cr(r)] 2 a 2 . cr cr J Finally, since cr(a'q ) = a(q) and jx{a'q) = /i(q), it follows that the condition of Lemma 3.1 is satisfied. For further details about MA with improper working priors, along with empirical evidence that it produces fast mixing MCMC algorithms, the reader is referred to van Dyk and Meng (2001). For more information on the use of nonpositive recurrent Markov chains in MCMC algorithms, see Liu and Wu (1999), Hobert (2001b) and Lavine (2003). Liu and Wu (1999) noticed that some of the properties of the Markov chain can be derived in a more general set-up if the family of transformations {D a (q)} ae u+ has a specific group structure. Following their ideas, we will show, in the next section that, in general, the target density f{9\y) is an invariant density for $m A , which provides peace of mind about the results for the multivariate t model. 3.4 Connection with Group Theory We start with some fundamental definitions that can be found, for example, in Green (1987). Definition 3.1. If A is a nonempty set, then every function from A x A to A is called a binary operation in the set A.

PAGE 50

40 Definition 3.2. A set A together with a binary operation is called a group if the following properties are satisfied. (a) Associativity: (cti • 0:2) • 03 = ol \ • 0:3) for all aq, a 2, 03 G A. (b ) Identity: there exists an identity element e G A such that e a. — a e — a for any a G A. (c) Invertibility: for each a G A, there exists a unique inverse element or 1 G A such that a • aT 1 = a~ l a = e. The functions that preserve the group operation play a special role in group theory. Here is the formal definition of such functions. Definition 3.3. A function D : A -> B between the groups ( A , •) and (B, •) is called a homomorphism if (a) D(a\ • a 2 ) = D(a 1 ) • D(a 2 ) for all 01,02 £ A; (b) D(e A ) = e B . We will now proceed to the main definition of this section, namely the notion of a transformation group , which can be viewed as a special case of the following general definition taken from Olver (1999). Definition 3.4. Let Q be a set. Define @(Q) to be the family of all one-to-one functions d : Q — > Q. t&(Q) is a group in which the group operation is defined by composition of functions. The identity transformation plays the role of the identity element, and the inverse of a transformation is the usual functional inverse. If the space Q is equipped with some additional structure (like a topology), then the class of possible functions included in S>(Q) may be restricted correspondingly. For example, when Q = R", 2>(Q) may contain all continuous invertible functions d : M n — > M n (recall that the composition of two continuous functions is also continuous).

PAGE 51

41 Definition 3.5. For a given group A, a transformation group, acting on a space Q is defined as any homomorphism D : A — > @(Q) mapping A to the group of invertible functions on Q, and is denoted by (.4., Q, D ). In other words, for each element o G A there is a corresponding invertible function D a : Q — > Q. Because this identification is a homomorphism the following properties are satisfied: (i) D ax o D a2 (q) := D ax (D a2 (q)) = D ax . a2 (q ) for all a u a 2 G A\ (ii) D e (q) = q, where e is the identity element of A ; (hi) D a -i(q) = D~ l (q) for any a e A, where (iii) is a consequence of (i) and (ii). Definition 3.6. For a fixed q G Q, the subset O q := { D a (q ) : a G A} C Q is called the orbit of q. Some authors use the term group action instead of transformation group (see, e.g., Eaton, 1989). Notice that the action of the group A on the space Q induces an equivalence relationship among the elements of Q, namely two points are equivalent iff they are in the same orbit. To see this, notice that (i) q G O q , since D e (q) = q; (ii) If q\ G O q2 then q 2 G O qx , since there exists anaGd such that q\ = D a (q 2 ), implying that D a -i{qf) = D a -i(D a (q 2 )) = D a -i. a (q 2 ) = D e (q 2 ) = q 2 , (iii) If qi G O q2 and q 2 G O q3 then q x G O q3 , since there exist 01,0:2 G A such that *7i Dociipif) ~ F) ax (D Q 2 ( 93 ) ) F) ax , a2 (§ 3 ) • Thus, the space Q can be partitioned into disjoint sets, corresponding to the equivalence classes (the orbits), induced by the above equivalence relationship. By the axiom of choice, we can choose a representative element for each of these equivalence classes. Denote the set of the representative elements by Z. Now, since

PAGE 52

42 the equivalence classes partition Q, each element q E Q must belong to a unique equivalence class. Thus, if z E Z is the corresponding representative of that class, then q E O z and there exists a E A such that q = D a (z) (see Example 3.2 below). In fact, it is easy to see that for any q E Q, there is unique z E Z and at least one a E A such that q = D a (z). This means that the map D : A x Z — > Q defined by D(a,z) D a (z) is onto. Suppose further that the index set Z can be chosen in such a way that the map D(a, z) is one-to-one and differentiable. When this is possible, the index set Z is called a smooth cross-section. Usually, the smooth cross-section, Z, has lower dimension than the space Q and it is convenient to write Z as the image of another space 7 Z via the one-to-one transformation Z\ i.e. Z = Z(1Z). Here is an example. Example 3.2: Let A be the group of all positive real numbers, R + , with product as the group binary operation. Obviously, 1 E R+ is the identity element and the reciprocal ^ = a -1 is the unique inverse of a E R+. Also let Q = R” and for any a E R+ define D a : R" — > R" by D a (q) := aq , which is a one-to-one transformation satisfying the conditions of Definition 3.5. The structure (R + ,R+, (D q } q6 r + ) is called the scale transformation group. Notice that the set of all orbits is the set of all lines in the first quadrant that go through the origin. They do not intersect since the point 0 is not in R” . A smooth crosssection is the part of the unit sphere that is restricted to the first quadrant; i.e., Z := {(? 1 ) • ) Qn) £ R+ : q\ H H 7n — !} Furthermore, any point q E R" can be represented as q = D a (z) = az , where a E R+ and z E Z with a := ||qr|| and 2 := Finally, any point

PAGE 53

43 z = (z \, . . . , z n ) G Z can be written as Z\ = cos r i cos r^ A is a Markov Liu and Wu (1999) claim to prove this result for any transformation group as long as the improper prior, u;(oi), corresponds to the density of the Haar measure of the family of transformations D a (q). Unfortunately, they did not provide adequate definitions and consequently it is very hard to follow the proof of their theorem, so we provide an alternative proof for the scale transformation group and invite the reader to attempt to follow the discourse in Liu and Wu (1999). Theorem 3.1. Assume the hierarchical structure of Equations (3.9) (3.11) and consider MA with improper prior uj ( a) — If the transformation is given by D a (q ) = aq, then $* MA is a Markov chain and Proof. We will establish that is a Markov chain on our way to showing that f(0\y) is an invariant density for Ki/[a{0\6'). First notice that based on the proof of Lemma 3.1, we can write chain and that the target density f(0\y ) is invariant for the transition kernel Ama9*(0, a\D a fq), y)daf(q\6 y)dq.

PAGE 54

44 Then where [ K MA {0\0')f(0'\y)d0' Je = [ f [ g*(O,a\D a '(q),y)daf(q\0 , ,y)f(O'\y)dqd0' Je Jq J r+ = [ / ^*(6>,a|D a /(qr),y)da [ f(q,0'\y)dO'dq Jq J k + Je = f [ 9*(O t Q is one-to-one and differentiable. Thus, we can make change of variables q = D s (Z(r )) with Jacobian J equal to the absolute value of the determinant of the n x n matrix f dD s (Z(r)) dD s (Z(r)) \ V ds dr ) In our case (the scale transformation group), D s {Z(r )) = sZ{r) and thus = Z(r)Therefore, J = s n_1 L(r), where L : (0, f ) n_1 — > R is a function of r only (this can be seen from the definition of determinant and the fact that s is a scalar in all of the elements of the matrix except those in the first column). Hence, we need to prove 9*(0, a\ D al (D s (Z(r))), y)daf(D s (Z(r))\y)s n l L{r)drds = f(0\y ), which is equivalent to a\a'sZ(r)),y)daf(sZ(r)\y)s n l L{r)drds = f(Q\y).

PAGE 55

45 Now by definition (see equation (3.18)), *(f) | ' 7( ) )= ff ( y ’ a ' sZ ( r )\ e i a)n(0)uj(a) g ,a a s r ,y / R+ f e g(y, a'sZ(r)\0, a)n(0)cj(a)d0da = f(y , D^ja'sZjr)), \9)ir(0)uj(a)J a (a , sZ(r)) f« + fe /(?/» D1 (a'sZ(r))\9)Tv(e)uj{a)J a (a , sZ(r))deda and since J a (q) := dD«\r) dr -^r, we have a n ’ g*(9,a\a'sZ(r),y) /(#, a 1 a'sZ(r)\y)m(y)u>(a)J a (a'sZ(r)) Ir+ f{u1 a l sZ(r)\y)m(y)uj(a)J ol {a'sZ(r))da /(a -1 Q:'sZ(r), G\ y)a~ n ~ l J K f(a~ l a'sZ{r) \y)a~ n ~ l da Another change of variable P = a 1 a!s with Jacobian 0 gives us [ f(a 1 a'sZ(r)\y)a n l da = [ f(pZ(r)\y){-^Jr+ Jr. \as R+ n+l / as P 2 f {PZ (r )\y)P n ~ 1 d(5 dp R+ 1 f{pZ(r)\y)p n l L(r)dp /»+ a's J L{r ) 1 \ n K z i r )\y) a's J L(r ) ’ where h(Z(r)\y):=[ u{s, Z(r)\y)ds J R+ and u(s, Z(r)\y ) is obtained from f(q\y) after change of variable q = D s (Z(r )) sZ(r). Therefore, g*(0,a\a'sZ(r),y) f(a 1 a'sZ(r),9\y)a n 1 / J_\n h(Z(r),y) la's/ L(r) Similarly, we can make the change of variable to obtain that 1 a'sZ(r), 0\y)a n x da = f 1 Y h(Z(r),0\y) \a's J L(r)

PAGE 56

46 where h{Z{r),0\y) := f u(s,Z(r),0\y)ds Jr+ and u(s, Z(r), 0\y) is obtained from f(q, 9\y) after change of variable q = D s (Z(r)) = sZ(r). Thus, J R+ 9 *( 6 ' a l D A D *( z ( r )))> V) da = h(z( r )\y) which does not depend on a'. Thus, by Lemma 3.1, the sequence is a Markov chain. Notice that all the results after the change of variable (3 = a~ l a!s depend heavily on the choice of prior ui(a). It is our belief that this is where the Haar measure plays a special role in the proof. However, it still remains to be investigated rigorously what happens in the general case. Now, in order to show that f(0\y) is the invariant density for the Markov chain < f>^ IA , we have to prove that / h ^hw \ fy / f(s z (r)\y)s n l L{r)dsdr = f(6\y), Jtz h(Z(r)\y ) J R+ or equivalently (by the definition of h(Z(r)\y)), f h(Z(r),0\y)dr = f($\y). (3.22) Jn But h(Z(r),0\y) := f u{s, Z(r), 0\y)ds = f f(sZ(r),6\y)s n 1 L(r)ds. J R + J IR_j_ Therefore, (3.22) is equivalent to f f f(sZ{r),G\y)s n l L(r)dsdr = f(0\y). Jn J r + Finally, the inverse transformation q = sZ(r) shows that the above is equivalent to 'Q f(6,q\y)dq = f(0\y),

PAGE 57

47 which we already know is satisfied. When the conditions of Theorem 3.1 are satisfied, one can simulate the nonpositive recurrent chain $ and actually use the marginal chain ^ A to explore the proper, but intractable posterior f(0\y). Moreover, this phenomenon is more than just a theoretical curiosity. Indeed, there is substantial empirical evidence suggesting that converges faster than $ma and much faster than TdaIt is not yet fully understood why an algorithm that is based on a non-positive recurrent Markov chain converges faster than related algorithms based on positive recurrent chains. 3.5 Understanding the Artwork of van Dyk and Meng In this section we will show that the descriptions van Dyk and Meng give to define their Markov chains agree with our definitions. We start with Scheme 1; i.e. , the conditional densities used in the expressions that follow are based on the joint density g(y\r, 6 , a)g(r\0, a)iv(0)u)(a) that is now based on a proper u(a). First note that g(r\0',y):= g(r,a\0',y)da = g(r\6' ,ot,y)g{a\& ,y)da. J R-|J M-j. But g(a\0',y)= / g(r,a\0' ,y)dr dr f g{y\r, 0\ a)g(r\0', a)n(0')uj(a) JQ fm+ Iq 9(V lu 0'Oi)g(r\G', a)-* (0')u(a)drda f Q g(y,r\0',a)druj(a ) = f(y\0')u(a) Ir+ Iq 9(y> r\0', a)u(a)drda f(y\0') J R+ c o(a)da w(q:). Therefore, g(r\0' ,ot.,y)u(a) da . (3.23)

PAGE 58

48 Hence, K u can be alternatively represented as K U {O\0) = f [ g(0\r,a",y)g(a"\r,y)da" [ g(r\0',a',y)u(a')da' dr. J Q m JR+ _ M-)_ (3.24) In Meng and van Dyk (1999) page 309 and van Dyk and Meng (2001) Lemma 1, the authors claim that the Markov chain corresponding to Scheme 1 (according to us, it has transition kernel (3.24)) can be simulated by performing the following steps: first draw cd from the prior o>(cd), and then draw r from g(r\0' , cd, y); second draw cd' from the posterior g(a"\r, y), and then draw G from g(0\r, cd', y). In other words, the following steps must be performed: 1. Draw q from f(q\0', y); 2. Draw cd from cn(cd); 3. Take r = D a >(q); 4. Draw 0 from g(0\r,y). However, Meng and van Dyk did not provide the transition kernel for their Markov chain which makes it harder to understand their idea. To see that van Dyk and MengÂ’s and our description agree, first notice the simple fact that if we need a draw from f(g) = J x f(x,y)dx, we can first get x ~ f{pc) and then y f(y\x), and the resulting y is marginally from f(y). Now notice that conditionally on a' , the variable r obtained in step 3 has density f(D~i L (r)\0',y)J a /(r), where J a '{r) dDj(r) dr Thus, the result of steps 1, 2 and 3 is a draw from J R+ f{D a }{r)\0\y)J a '{r)uj{a l )da'. But f(y m f{q\o\y)

PAGE 59

49 and therefore f(Dj(r)\G',y)j a >(r) = g{y\r,0',ot')g{r\0',a') f{y\D-}(r)^')f{D-}{r)\G')J a ,{r) f{y\o') Thus, the final result after step 3 is a draw from f(y \ 6 ') = g{r\G',a.',y). (3.25) f f(Dj(r)\G',y)J a >(r)uj(a , )da , = [ g{r\G', a', y)u(a')da'. J R+ J R+ This combined with step 4 makes it crystal clear that the two definitions of the transition kernel under Scheme 1 agree. We now briefly discuss the same issue but for Scheme 2. As noted in van Dyk and Meng (2001), Ama can be described similarly to what we did in Scheme 1. For any fixed a' G R + perform the following steps: 1. Draw q from f(q\G', y)', 2. Take r = D a >(q ); 3. Draw G from f R g*(G, a|r, y)da. Similarly to (3.25), the result of steps 1 and 2 above is a draw from g*(r\G' , a', y). Therefore, our definition for Scheme 2 agrees with the above. We now address the question about the stationary distribution of 4 >maThere are problems with van Dyk and MengÂ’s results concerning the invariance of f(G\y) for ^ A and these are described now. Since these authors never defined their Markov chains with transition kernels, we start by first restating their main result. As can be seen afterwards, this result might not even be true. The following theorem is an attempt to rigorously state Lemma 1 from van Dyk and Meng ( 2001 ). Theorem 3.2. Assume the condition of Lemma 3.1 is satisfied and let Kma{Q\ 0') he the transition kernel of $* MA when implementing Scheme 2 under an improper prior cu(a) and let k m me') be the transition kernel of $ma when

PAGE 60

50 implementing Scheme 1 under the proper prior uj m (a) , where m E N. Define (6 air v) = 0, a)g(r\0, a)n(0)u m (a) J e J R+ g(y\r,G,a)g(r\e,a)TT(9)oj m (a)dadO and, similarly, t I g \ . = 9(y\r , 0, a)g(r\0, a)n (9)u m (a) J Q J R+ g(y\r,d,a)g(r\e,a)TT(G)u m (a)dadr If the limit lim m _oo g m (d , a|r, y) exists and lim g m (0,a\r,y) = g*(0,a\r,y), ra— oo where g*(6,a\r,y) is calculated using the improper prior u (a), then km{9\0') — > Kma(9\0') asm — » oo, implying that the stationary distribution of &* MA is f(0\y). The fact that k^dlO') Kma{G\G') as m — * oo implies that f(0\y) is invariant for Kma, follows from Liu and Wu’s Lemma 1, which we restate here for reference. Lemma 3.2 (Liu and Wu). Suppose that K m (y\x) is a sequence of Markov transition functions, all having n(x) as invariant density. If K(y\x) := linim^oo K m (y \x) is a proper transition function, then i r is an invariant density of K. If true, Theorem 3.2 would provide justification for the use of the improper prior l o(a). Unfortunately, the “proof” of van Dyk and Meng has some gaps. More specifically, notice the following km{0\0') := f g rn (G\r,y)g m {r\0' ,y)dr JQ. ~ gm{0,Oi\r,y)g m (r\0' ,y)dadr JO. J K-u

PAGE 61

51 and by equation (3.23) km(d\e') = [ [ g m (0,a\r,y) f g(r\0 ' , a' ^^(a^da'dadr J 0 J] R_l JR+ g{y\r, 9, a)g{r\0 , a)7r(0)a; m (a;) Iq L + L + fe f R+ 9(y\ r , a)g{r\9, a)n{9)u > m (a)dad0 = f [ [ g*(0,a\r,y)g*(,r\0',a',y)x Jo Jr, Jr, g(r\9\ a ' , y)u m {a.')da! dadr x ’u; m (a) IelR + 9(y\r,9,a)g(r \9, ct)rr(9)(jj(a)dad9 _ u ( a ) fe Ir + 9(y\ r > &> a )9( r \ 9, a)ir(9)L0 m (a)dad9 u> m (a')da'dadr. Thus, it is clear that the assumptions of the theorem do not concern the convergence of the integral over Q, but only the convergence of its integrand. Assuming that uJm(cc) -> u(a) as m — >• oo (which van Dyk and Meng never say in their theorem) in conjunction with the assumptions of Theorem 3.2 can only be used to conclude that the limit of the quantity in the square brackets is 1. Even under some regularity conditions, that would allow us to pass the limit inside the integral, the only thing that can be concluded is that k m (0\O') — * f j j g*(6,a\r,y)g*(r\6’,a',y)uj(a , )da'dadr = oo asm J Q Jr.\00 . Meng and van Dyk (1999) provide other “checkable” and even more obscure conditions (Lemma 3 and Theorem 2) “ensuring” that d>^A is a positive recurrent Markov chain with f(9\y) as its unique invariant density. However, we do not pursue them any further. Instead, we will point out to another problem with their Lemma 1. 3.6 Other Problems with van Dyk and Meng’s Lemma 1 We conclude this chapter with an interesting question that arises when working with sequences of random vectors. The question is whether the convergence of the joint densities implies convergence of the corresponding marginal densities. Here we show that in general this is not true and give an example to illustrate that. The reason we are interested in this question is because in van Dyk and

PAGE 62

52 Meng’s (2001) “proof” of their Lemma 1 on page 10, the authors claim that the convergence of g m (6\r,y ) to g(0\r,y) follows from their assumption that g m (0, ol\ r, y) — > g(0, a\r, y) as m — * oo. Here we show that this is not true. Lemma 3.3. Let (X n , Y n ), n = 1, 2, . . . be a sequence of random vectors with values in ST x y and densities f n (x,y),n = 1,2,... (with respect to the Lebesgue measure). Suppose that f n (x, y) -> f(x, y), V (x, t/)ef x ^ as n -> oo, where f(x,y) is a density on 3£ x y. Then liminf / f n (x,y)dy = / f(x,y)dy a.s. n +0 ° Jy Jy Proof. By Fatou’s lemma we have liminf / f n (x,y)dy> / liminf f n (x,y)dy = / f(x,y)dy, n ->°° Jy Jy Jy for any x £ 3C . Applying Fatou’s lemma again we obtain / liminf / f n (x, y)dydx < liminf / / f n (x,y)dydx= 1. ./.ser n ^°° iy n > °° Jar Jy On the other hand, from (3.26) it follows / liminf / f n (x,y)dydx > / / f(x,y)dydx = l, Jar n_> °° Jy Jar Jy and thus, ix = 1 dx = 0. (3.26) / liminf / f n (x,y)dyda Jar n ^°° Jy / liminf / f n (x,y)dy / f(x,y)dy J x L n*°° Jy Jy But by (3.26) the integrand is a non-negative function and therefore we must have liminf / f n (x,y)dy = / f{x,y)dy a.s. n ~ > °° Jy Jy

PAGE 63

53 The following example shows that without further assumptions, there is no guarantee that the marginal densities even converge, much less that they converge to the desired limit. Example 3.3: Let n £ N \ {1} and for k £ N n define the following sequence of bivariate densities: fnk{x,y) = n~^I Bjik (x,y) + n<2 _I Bll \ Bnk (x,y), where B nk = ( ( x , y) El 2 : — 00 5 f„(x) = oo for all 0 < x < 1 even though liminf^oo g n (x) = /(o,i] (x) as h is expected to be based on Lemma 1, where the sequence {g n (x)} is obtained from the sequence {g n k(x)} the same way as {f n (x,y)} from {f n k( x , y)}. To see this, let x € (0, 1] and notice that for every n = 1,2,... there is a k £ {1, . . . , n) such that < x < d and this k is given by k = \nx ] . Therefore

PAGE 64

54 the subsequence g n \ nx \(x) — > oo as n — > oo. The result about liminf n _ >00 5 , n (x) is obtained similarly. The question we answer in the next chapter is “How fast do 4> AIA for the multivariate Student’s t problem converge?” The Markov chain theory described in Section 4.2 will allow us to formulate and then answer a more precise version of this question.

PAGE 65

CHAPTER 4 GEOMETRIC ERGODICITY OF FOR THE MULTIVARIATE STUDENTÂ’S t PROBLEM 4.1 Introduction Meng and van Dyk (1999) suggested that MA with an improper prior could be applied to the univariate version of the StudentÂ’s t problem ( d = 1), and van Dyk and Meng (2001) extended this to the multivariate case. These authors provided considerable empirical evidence suggesting that this new algorithm converges much faster than the standard data augmentation algorithm. However, they stopped short of formally proving anything about the convergence rate of the underlying marginal chain. In a discussion of van Dyk and Meng (2001), Hobert (2001a) showed that when d = 1 and n = 2, the marginal chain converges to stationarity at a geometric rate; that is, the chain is geometrically ergodic (Meyn and Tweedie, 1993, Chapter 15). In this chapter, we extend HobertÂ’s results to the case d > 1 and n > 2. This is the first general, rigorous analysis of an MCMC algorithm based on a non-positive recurrent Markov chain. There are two important practical applications of the results established in this chapter. First, if geometric ergodicity is established through drift and minorization conditions, as it is here, then formulas like those of Rosenthal (1995) and Roberts and Tweedie (1999) can be used to calculate an appropriate burn-in before sampling begins. In other words, given a starting value for the Markov chain, one can calculate an exact upper bound on the number of iterations required to get the total variation distance to stationarity below a pre-specified level. Second, geometric ergodicity implies the existence of central limit theorems that can be used to assess the Monte Carlo error of estimates of posterior quantities 55

PAGE 66

56 of interest. This assessment can be carried out in a number of ways including regenerative simulation (Mykland, Tierney and Yu, 1995; Hobert, Jones, Presnell and Rosenthal, 2002) and time series-type methods (Geyer, 1992; Chan and Geyer, 1994). The rest of this chapter is organized as follows. The necessary Markov chain background is provided in Section 4.2. The main results are given in Section 4.3 and they are applied to a simple illustrative example in Section 4.4. Finally, some concluding remarks are given in Section 4.5. 4.2 Markov Chain Background In this section we give some basic definitions and then briefly describe what drift and minorization conditions are and how they can be used to establish geometric convergence of Markov chains. For a more general discussion of the ideas described in this section, see Jones and Hobert (2001). Let X = TOS, be a time-homogeneous discrete-time Markov chain defined on the probability space (fl, Pr) that takes values in the measurable state space (Y, 3§). For t 6 N, let P t (x , A) be the t step transition kernel 1 for X; that is, for all x G X and all A E 3$, P\x, A) = Pv(X s+t e H|X S = x) V s € {0, 1, 2, . . . }. The following definitions can be found, for example, in Nummelin (1984) or Meyn and Tweedie (1993), but we state them here for completeness of the exposition. 1 The superscript will be omitted when t = 1

PAGE 67

57 Definition 4.1. The Markov chain X is called 7 -irreducible if there exists a measure 7 on 38 such that, whenever 7 (A) > 0, we have L(x, A) > 0 V x E X, where L(x, A) Pr(X ever enters A\X 0 = x ). Definition 4.2. If it is a a-finite measure on £8 such that 7 v(A) = f P(x , A)Tr(dx) V A E £#, Jx then 7 r is called an invariant measure for the Markov chain X . Definition 4.3. The tt irreducible Markov chain X is periodic if there exist an integer d > 2 and a collection {Ao, Ai, . . . , Ad1} of nonempty and disjoint subsets in £§ such that, for all i = 0, . . . , d — 1 and all x G Ai, P(x, Aj) — 1 forj = 7 + 1 mod d. Otherwise, X is called aperiodic. For the rest of the section, let £§ + := {A E : n(A) > 0}. Definition 4.4. The Markov chain X is called Harris recurrent if it is 7T -irreducible and for every set A E £§ + , PifrjA — oo|X 0 = x) = 1 V x E A, where the occupation time random variable, rjA '= Ia{Xu), counts the number of visits to the set A. We will say that a Markov chain satisfies the usual regularity conditions if it: (a) possesses an invariant probability measure 7r; (b) is 7r-irreducible; (c) is aperiodic; and

PAGE 68

58 (d) is Harris recurrent. Under the usual regularity conditions, for every x G A, we have the following limit for the total variation distance between P t (x, •) and 7r(-): sup \P t (x, A) — 7t(A)| =: ||P‘(x, •) — 7r(-) || — 0 as t —> oo . If this convergence takes place at a geometric rate; that is, if there exist a constant 0 < (j> < 1 and a function M : X — > [0, oo) such that for any x € A, •)-*(•) II < M{x) t for all t G N, then X is said to be geometrically ergodic. One method of proving that X is geometrically ergodic and, simultaneously, getting an upper bound on M(x) ft is by establishing drift and minorization conditions. There are several ways to do this (Meyn and Tweedie, 1994; Rosenthal, 1995; Roberts and Tweedie, 1999). The version we describe here is based on the work of Rosenthal (1995). A drift condition holds if for some function V : A — » [0, oo), and some constants A G (0, 1) and L G R PV(x) < \V(x) + L V rr 6 A , where PV(x):= f V(y)P(x, dy). Jx An associated minorization condition holds if for some probability measure H on SB and some e > 0 P(x, A) > eH(A) WxeS VdeJ, where S := {x € X : V{x) < 1} with l being any number larger than 2L/(1 — A). Together, drift and minorization imply that X is geometrically ergodic, and once

PAGE 69

59 drift and minorization conditions are established, Theorem 12 from Rosenthal (1995) gives a formula for calculating an upper bound on M(x)

/b£,a ~ N d (/x,— ) V Qj J (4.1) We take the prior on (/x,S,a) to be 7r(/z, S) u{a), where 7r(/x, £) = |£| 2 “ and cj(q:) = CK _ 1 /( 0 ,oo)(«)Let and 9*(q |/x,S,a,j/) := g*(n,'Z,a\q,y) := g*(q,^^,a\y) f M n g*(q,/J,i:,a\y)dq' ff*(q,t*,E,a\y) (4.2) (4.3) fs JiR d 9 (*?; AT ot\y)d fid's doc be the “conditional densities” based on (4.1). As in Subsection 3.3, these are legitimate densities despite the fact that the posterior g*(q , /x, E, a\ y), based on (4.1), is improper. As a matter of fact we can even derive the explicit distribution of (4.2) and (4.3). First, qj\n, ~ T v + d (y -/x) T E \ yj -fi) + u 2 a

PAGE 70

60 independently for j = 1 , ,n. Furthermore, it is easy to identify all the conditionals in the decomposition 9 *(p,
PAGE 71

61 Lemma 4.1. Let A be an m x m positive definite matrix. Then x T (xx T + A)~ l x <1 V x € R m . Proof. Applying the Sherman-MorrisonWoodbury formula (see, for example, Golub and Van Loan (1989), page 51) we obtain (xx T + A)1 = A~ l A -1 cc(l + x t A~ 1 x)1 x t A1 . Therefore, x J {xx T + A) 1 x = x J A 1 x — x T A 1 xx T A -l x 1 + x T A -l = t x i t t It t < 1 , where t — x T A~ l x > 0, since A is positive definite. The following result yields the drift condition: Theorem 4.1. Define va,/, £) = J>. Ts-T, M). i = 1 If nv > 2 and u T d > 2 then (n 1 ) 2 {y T d) PVV,£')< Tr/ , ndu nv(n — l)(nv + nd — 2) V(n',H') + T v ;v (nv — 2)(v T d — 2) ’ nv — 2 (nv — 2)(v T d — 2) Proof. We will evaluate the expectation of the drift function using several layers of iterated expectations: PV(fi', S') = E[E(V(ax, S)| q, y)\fx', S', y] = e|e[E(V(/x, S)|o, q, y)\q , y\ | M ', S', y] = E ^e|e [E(V(/ r, S)|S, a, q, 2 /)|o!, q, j/] \q,y} The expectation on the left-hand side is with respect to the Markov transition density K y (fx , S|/z', S') and the first equality follows from the properties of the

PAGE 72

62 Gibbs sampler. The other two equalities are nothing more than the usual iterated expectation. Starting with the innermost expectation, we obtain E[V(^,S)| H,a,q,y] n = E t(^ ~ A*) T ^ _1 (2/i a0|£> a, y ] i— 1 n = _ 0i" S ~V ~ y a, q , y } i = 1 n 2=1 2=1 2 = 1 pJT, 1 /i + tr^S = 5I(^-A) T £ 1 (y i -A) + nda 2=1 Now if X ~ IW p (fc, A), then X -1 ~ Wishart p (fc, A) and EX -1 = kA (see, for example, Mardia et al. (1979), page 85). Thus, nda f i = 1 E [V(n,J:)\a,q,y} = + ^(y* A) T E (E ^aqg.y) (y* A) nda + a{n — 1) ^(y, A) T 2=1 A)(yj A) T .i=i (3/i — A)Since i/n > 2, conditional on (qr, y), ct has a finite mean given by E(a|qr, y) = Therefore nd E['E(/u, E)| q,y] = — E(a\q,y) n + {nl)E(a|q, y) J^(y t A) 2=1 1 “I ^Qjiyj A ){Vj A) T j=i (i/i A) (4.4) ndu u(n — 1) v^, = o + — TT^ L “ A*) nu — 2 nv — 2 T i=l ^Qjiyj y){y 3 A) .j=i T (i/< A)Now, we need a bound for the quantity 9J^(yi A) 2=1 “1 “I A)(2/jA) .i=i (»i A) , (4.5)

PAGE 73

63 which is obtained by applying Lemma 4.1 to the zth component in (4.5) as follows -\T (Vi ~ A) '52 q j( y j ~ A)(l/j A) T .3 = 1 ( y i ~ A) T ~ -i (s/i A) ft ^riyj A ) (Vj A) t + ( 2 /i A)(y* A) “ 0, -i (y* A) < -7 • Q.i Therefore from (4.4) we obtain that ndu u(n — 1) E[V(/z,£)| q,y]< Finally, ^ 1 ndu u(n — 1) q.y = rH — rU L ^ Cl: 711/ nu — 2 nu — 2 ' a; nu — 2 nu — 2 *=i ^ * + ££ 9j . , " q i i=i E„[V(f*. S)|/4 S'] ndi/ z/(n — 1) < + — nu — 2 nu — 2 ndi/ — 1) < + — nu — 2 nu — 2 ^ + d ST' (g/i ~ A a/ ) T (^' / ) Hl/i — Ap + fc' * + d 2 £ ( Vj M o T (s')1 (y j /*') + * ^^E (( v,-iOW ( y i -iO^) + v 7 i = 1 n and some rearrangement yields the result. Remark 4.1. This result agrees perfectly with Hobert’s (2001a) result in the case where d = 1 and n = 2. Remark 4.2. 2l similar result was obtained by Jones (2001) for the case d = 1, but our result is uniformly better in the sense that our coefficient in front of V (/a', a') is always smaller and doesn’t depend on the data. The last thing we need is the minorization condition. The following is a straightforward generalization of the minorization developed in Hobert (2001a). Theorem 4.2. Fix l > 0 and let S = { (/x, S) : V(/u, S) < l}. Then the Markov transition density S|/z', S') satisfies the following minorization condition K y (n,'Z\ t i\'£')>eh( f x,'Z\y) V (//, S') G S,

PAGE 74

64 where the density h(/x , £|y) is given by h(^, £|y)=/ / g*(fi,T,,a\q,y) Jo J K? g(Qi) .El r 9(t) dt dqda , e = (/ 0 °° g(t) dt) n and the function g(-) is given by is + d v + 1 9{q) = r ( V ~y l ^ ^ W )(^ A is geometrically ergodic. We state this as a corollary. Corollary 4.1. The Markov chain $* MA is geometrically ergodic if nv > 2, v + d > 2 and (n 1 ) 2 {y + d) (ms — 2)(is + d — 2) When is > 2, condition (4.6) is satisfied for (4.6) is 2 + vd + 2d + y/(is + d — 2 )[d(is 2 + 4is — 8) + is(is + 4)(^ — 2)] 2(is + d) Roughly speaking, all the conditions of Corollary 4.1 hold when n < v if d > 2 and when n < is — 1 if d = 1. Of course, ideally, we would like to be able to say that ^ A is geometrically ergodic for all ( d , is, n) triples such that d > 1, is > 0 and n > d + 1. The most restrictive of the three conditions in Corollary 4.1 is clearly (4.6) which is a direct result of the upper bound we use for the quantity (4.5) in the proof of Theorem 4.1. (Note that up until this point in the proof, all of the expectations are evaluated exactly.) Our experience suggests that finding a closed form expression for the expectation of (4.5) is impossible. Furthermore, alternative upper bounds that one might consider using are either not as sharp or have other undesirable consequences. For example, note that our bound on (4.5) holds uniformly in y. If

PAGE 75

65 a bound depending on y is utilized, then the coefficient in front of V (/z', S') in the drift condition will also depend on y. This could lead to the unfortunate situation where the sufficient conditions for geometric convergence of the Markov chain actually depend on the data. It is our belief that a substantial improvement on Theorem 4.1 will likely require the use of a different drift function, P(/z, S), which basically means starting over from square one. Amazingly, the standard DA algorithm is also geometrically ergodic. We prove this fact in the next theorem which is an analog of Theorem 4.1. Theorem 4.3. Let V(/z, S) be defined as in Theorem 4-1If v + d> 2 then E y [v(^r.w, £,']< n + d — v + d — lvV,E') + n(n + d — 1 )u v + d — 2 where £ y [-|*] is with respect to the Markov transition density of the standard DA chain. Proof. It is easy to check that for the standard DA we have the following facts. First, fv + d (2/y V 2 1 {y j + independently for j — 1, ...,n. Second, ~ N d A, — T and S| q,y ~ lW d Now we have that -l n ~ ~ P) T . 3 = 1 E y [V(ta, S)|/x', S'] = E[E(V(/x, S)|q, y) |/x', S', y] = e{e[E(P(m, S)|S, q, y)\q , y\ |/x'. S', y}.

PAGE 76

66 Skipping some of the details we obtain E[l/(^,£)|<7,:y] = — + (n-l)^(yiA) i=i n -1 A )(Vj A) -j=l T (i/i A) Now using the fact that q.~ l < £ Y^i=\ Qi 1 as we ^ as th e results established earlier in the proof of Theorem 4.1, we have E[V(n,'Z)\q 1 y] < (n + d 1) ^ — . i=i 9i Finally, EyM* S)!m', S'] < "tl 1 £ [( V , M') T (S')-‘(» i m') + *] , i=l and some rearrangement yields the result. The above theorem combined with the fact that Theorem 4.2 also applies to standard DA shows that the standard DA algorithm is also geometrically ergodic. 4.4 Numerical Example Suppose that d = 2, u = 10, n = 3 and that the data are: y 1 = (—5, — 2) T , y 2 — (1,5) t , y 3 = ( — 10, 4) T . In this section, we will use our results in conjunction with Theorem 12 from Rosenthal (1995) to calculate sufficient burn-in for $^ IA and for the standard DA algorithm. We will also take advantage of the fact that it is possible to get iid samples from the target in this particular case (see Section 2.2) to examine empirically how long it takes for the Markov chains to converge. For Theorem 4.1 yields E y [V(y, S) | /*', S'] < AVV, S') + L = ^V(/z', S') + y . In order to apply Rosenthal’s Theorem 12, we need to choose an l > « 22.76 and an r G (0, 1). Taking l = 27.241 and r = 0.04 leads to q* ~ 0.58 and this yields e ~ 0.00158. Finally, suppose we start the chain with /* 0 = (0, 0) T and Eq = I.

PAGE 77

67 Then our Theorems 4.1 and 4.2 in conjunction with Rosenthal’s Theorem 12 yield || i*[(/x 0 , S 0 ), •] 7r(-|y)|| < (0.9999368199)* + (183.3793103)(0.9996034577)‘ , where P^[(/x 0 , S 0 ), ] denotes the t-step Markov transition kernel corresponding to and 7 r(-|y) denotes the probability measure corresponding to (2.1). It is easy to verify that ||P y 47415 [(^ 0 ,S o ),-]-7r(-|y)||<0.05. Hence, a burn-in of 47,415 iterations for 4 >m A is more than enough. To put this in perspective, it takes about a minute to run 1,000,000 iterations of this Markov chain (on a slow laptop). The reader may wonder how we chose these particular values for / and r. Because Rosenthal’s bound holds for all l > 2L/(1 — A) and all r E (0, 1), the user is free to choose the values that lead to the smallest upper bound. In our applications, we simply used a grid-search to find the (approximate) minimizers. For the standard DA algorithm, Theorem 4.3 yields E„[y( M , S)|/x', S'] < |VV, S') + 12 . Taking l = 46.017 and r = 0.019 (and starting the chain at the same initial values as the MA chain) yields ||i?y(M 0 , S 0 ), •] tt(-|j/)|| < (0.999998668)* + 23(0.9999831867)* , where Ry[(/z 0 , £ 0 ), •] is the t-step Markov transition kernel corresponding to the standard DA algorithm. In this case, we need 2,249,050 iterations to get the total variation distance below 0.05. That’s nearly 50 times as many as are needed for $ma! It is tempting to conclude from this comparison that, in this particular case, ^ma converges faster than the Markov chain underlying standard DA . Keep in

PAGE 78

68 mind, however, that we are comparing two upper bounds. It is still possible, albeit unlikely, that standard DA gets within 0.05 of the stationary distribution sooner than $ma(See Jones and Hobert (2004) for some general discussion along these lines.) Recall that n is exactly equal to d + 1 in the example we are considering and hence it is possible to make iid draws from 7r(-| y) in this particular case. Thus, for fixed t, we can compare 7r(-|y), Py[(fJ0 , S 0 ), •] and Ry[(pi 0 , S 0 ), •] empirically through random samples generated from the three distributions. A method of drawing a random sample from 7r(-|y) is given in Section 2.2. Making iid draws from Py[(li 0 , S 0 ), •] and i?y[(/Lt 0 , S 0 ), ] is simple just run independent copies of the Markov chains (starting each at (/x 0 , So)) and take the fth iteration of each chain. We simulated random samples of size 10 6 from 7r(-|y), T^[(/x 0 , S 0 ), •], Fy[(/x 0 , S 0 ), ], and Py°[(fx 0 , S 0 ), •] as well as the corresponding distributions under standard DA. The 5-dimensional samples are compared on the basis of three univariate quantities: yi, (the first diagonal element of S) and
PAGE 79

69 ^ 1 cr 12 10 Figure 4-1: Empirical comparison of Markov chains and the target From left to right, the three columns correspond to the parameters /xi, cri and oyiFrom top to bottom, the three rows correspond to 1 iteration, 3 iterations and 10 iterations. Each plot contains three density estimates and each density estimate is based on a random sample of size 10 6 . The solid line, dashed line and dots correspond to the iid sample, 4 >m A and the standard DA algorithm, respectively. Since “iterations” are irrelevant for the iid sample, the same density estimate is shown in each column for the iid case.

PAGE 80

70 4.5 Concluding Remarks The results in Section 4.4 contribute to mounting evidence that upper bounds on the total variation distance calculated using Rosenthal’s (1995) Theorem 12 (or the related results of Roberts and Tweedie (1999)) are typically very conservative. Indeed, Rosenthal (1995, Example 1) demonstrated such conservatism in the context of a simple Markov chain whose exact convergence rate is known. In the Rejoinder of van Dyk and Meng (2001), an empirical estimator of total variation distance was used to show that the amount of burn-in suggested by Hobert (2001a) (calculated using Rosenthal’s (1995) Theorem 12) is probably much more than is actually needed. By applying both the techniques of Roberts and Tweedie (1999) and those of Rosenthal (1995) in the same example, Jones and Hobert (2004) illustrated the potential conservativeness of results based on Roberts and Tweedie ’s (1999) techniques. On the other hand, suppose that one of these techniques is applied in a particular situation (where it is impossible to sample directly from the target) and the result is that n iterations is a sufficient burn-in. Even if we believe (say, because of the mounting evidence alluded to above) that n is probably much larger than is actually necessary, there is no harm in burning in n iterations, unless this is simply too time consuming. Thus, from a “better safe than sorry” standpoint, even an extremely conservative upper bound on the necessary amount of burn-in can be useful. We conclude with a few comments regarding a possible generalization of our results. The basic DA algorithm underlying all of the results in this dissertation is based on the fact that the location-scale Student’s t density can be written as a mixture of normals. For example, the d variate Student’s t density with u degrees

PAGE 81

71 of freedom, f tv , can be written as ftM= [ (2tt )~% q% e-^ dG(q) , (4.7) J R+ where G is the Gamma(i//2,y/2) distribution function. Now consider a more general problem where the data come from a density having the same form as (4.7), but where G is the distribution function of any positive random variable. Bayesian regression models with errors of this form have been studied by many authors including West (1984), Liu (1996) and Fernandez and Steel (1999, 2000). Perhaps general versions of Theorems 4.1 and 4.2 could be developed to handle this problem.

PAGE 82

REFERENCES Affleck-Grave, J. and McDonald, B. (1989). Nonnormalities and tests of assest pricing theories, Journal of Finance 44: 889-908. Box, G. E. P. and Tiao, G. C. (1992). Bayesian Inference in Statistical Analysis , Wiley, New York. Brownlee, K. A. (1965). Statistical Theory and Methodology in Science and Engineering, John Wiley, New York. Chan, K. S. and Geyer, C. J. (1994). Comment on “Markov chains for exploring posterior distributions” by L. Tierney, The Annals of Statistics 22: 1747-1758. Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, John Wiley, New York. Eaton, M. L. (1989). Group Invariance Applications in Statistics, Institute of Mathematical Statistics and the American Statistical Association, Hayward, California and Alexandria, Virginia. Fang, K.-T. and Zhang, Y.-T. (1990). Generalized Multivariate Analysis, SpringerVerlag, Berlin. Fernandez, C. and Steel, M. F. J. (2000). Bayesian regression analysis with scale mixtures of normals, Econometric Theory 16: 80-101. Geweke, J. (1993). Bayesian treatment of the independent student-f linear model, Journal of Applied Econometrics 8: S19-S40. Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with discussion), Statistical Science 7: 473-511. Golub, G. H. and Van Loan, C. F. (1989). Matrix Computations, The Johns Hopkins University Press, Baltimore and London. Green, J. A. (1987). Sets and Groups, Routledge and Kegan Paul, London and New York. Hobert, J. P. (2001a). Discussion of “The art of data augmentation” by D.A. van Dyk and X.-L. Meng, Journal of Computational and Graphical Statistics 10: 59-68. Hobert, J. P. (2001b). Stability relationships among the Gibbs sampler and its subchains, Journal of Computational and Graphical Statistics 10: 185-205. 72

PAGE 83

73 Hobert, J. P., Jones, G. L., Presnell, B. and Rosenthal, J. S. (2002). On the applicability of regenerative simulation in Markov chain Monte Carlo, Biometrika 89: 731-743. Jeffreys, H. (1939). Theory of Probability, Clarendon Press, Oxford, U.K. Jones, G. L. (2001). Convergence Rates and Monte Carlo Standard Errors for Markov Chain Monte Carlo Algorithms, unpublished PhD dissertation, University of Florida, Gainesville. Jones, G. L. and Hobert, J. P. (2001). Honest exploration of intractable probability distributions via Markov chain Monte Carlo, Statistical Science 16 : 312-334. Jones, G. L. and Hobert, J. P. (2004). Sufficient burn-in for Gibbs samplers for a hierarchical random effects model, The Annals of Statistics 32 : 784-817. Karlin, S. (1968). Total Positivity , Vol. 1, Stanford University Press, Stanford, California. Kelejian, H. H. and Prucha, I. R. (1985). Independent or uncorrelated disturbances in linear regression, Economics Letter 19 : 35-38. Lange, K. L., Little, R. J. A. and Taylor, J. M. G. (1989). Robust statistical modeling using the t distribution, Journal of the American Statistical Association 84 : 881-896. Lavine, M. (2003). A marginal ergodic theorem, in J. M. Bernardo, A. P. Dawid, J. O. Berger and M. West (eds), Bayesian Statistics 7, Oxford University Press, Oxford. Liu, C. (1996). Bayesian robust multivariate linear regression with incomplete data, Journal of the American Statistical Association 91 : 1219-1227. Liu, J. S., Wong, W. H. and Kong, A. (1994). Covariance structure of the Gibbs sampler with applications to comparisons of estimators and augmentation schemes, Biometrika 81 : 27-40. Liu, J. S., Wong, W. H. and Kong, A. (1995). Covariance structure and convergence rate of the gibbs sampler with various scans, Journal of the Royal Statistical Society, Series B, Methodological 57 : 157-169. Liu, J. S. and Wu, Y. N. (1999). Parameter expansion for data augmentation, Journal of the American Statistical Association 94 : 1264-1274. Mandelbroit, B. (1963). The variation of certain speculative prices, Journal of Business 36 : 394-419. Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis, Academic Press, London.

PAGE 84

74 Meng, X.-L. and van Dyk, D. A. (1999). Seeking efficient data augmentation schemes via conditional and marginal augmentation, Biometrika 86: 301-320. Meyn, S. P. and Tweedie, R. L. (1993). Markov Chains and Stochastic Stability, SpringerVerlag, London. Meyn, S. P. and Tweedie, R. L. (1994). Computable bounds for geometric convergence rates of Markov chains, The Annals of Applied Probability 4: 981— 1011 . Mykland, P., Tierney, L. and Yu, B. (1995). Regeneration in Markov chain samplers, Journal of the American Statistical Association 90 : 233-241. Nelson, C. R. and Plosser, C. I. (1982). Trends and random walks in macroeconomic time series: Some evidence and implications, Journal of Monetary Economics 10 : 139-162. Nummelin, E. (1984). General Irreducible Markov Chains and Non-negative Operators, Cambridge, London. Olver, P. J. (1999). Classical Invariant Theory, Cambridge University Press, Cambridge, United Kingdom. Praetz, P. D. (1972). The distribution of share price changes, The Journal of Business 45 : 49-55. Ripley, B. D. (1987). Stochastic Simulation, John Wiley and Sons, New York. Roberts, G. and Tweedie, R. L. (1999). Bounds on regeneration times and convergence rates for Markov chains, Stochastic Processes and their Applications 80 : 211-229. Corrigendum (2001) 91 : 337-338. Rosenthal, J. S. (1995). Minorization conditions and convergence rates for Markov chain Monte Carlo, Journal of the American Statistical Association 90 : 558566. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion), Journal of the American Statistical Association 82 : 528-550. Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion), The Annals of Statistics 22 : 1701-1762. van Dyk, D. A. and Meng, X.-L. (2001). The art of data augmentation (with discussion), Journal of Computational and Graphical Statistics 10 : 1-50. West, M. (1984). Outlier models and prior distributions in Bayesian linear regression, Journal of the Royal Statistical Society, Series B 46: 431-439.

PAGE 85

75 Zellner, A. (1976). Bayesian and non-Bayesian analysis of the regression model with multivariate student-t error terms, Journal of the American Statistical Association 71 : 400-405.

PAGE 86

BIOGRAPHICAL SKETCH I was born in Bulgaria, in 1974. My educational background consists of a high school diploma in Mathematics and a masterÂ’s degree in Mathematics (with emphasis on Probability and Statistics) from Sofia University. Shortly after receiving my masterÂ’s degree in 1997, I went to the Bulgarian army to fulfill a 9-month duty required by law. Immediately after that, in 1998, I was appointed as an Assistant Professor of Statistics in the Department of Economics and Business Administration at Sofia University. By that time I had already decided to pursue a Ph.D. degree in the USA. Luckily, I was accepted at the University of FloridaÂ’s Department of Statistics, where I began studying in 1999. After graduating from the University of Florida, I will move to New York City to work as an Assistant Professor in the Department of Statistics in the Zicklin School of Business at Baruch College, City University of New York. 76

PAGE 87

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation an® is fwlly adequate, in scope and quality, as a dissertation for the degree of Doctonpf Pmlosophy. Jamjfes P. Hobm, Chair Associate Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.^— Brett D. Presnell Associate Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. ames G. Booth Jam£ Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Art T^C-^C — Adao A. Trindade Assistant Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. ITAwvaJo Sjlo Murali Rao Professor of Mathematics

PAGE 88

This dissertation was submitted to the Graduate Faculty of the Department of Statistics in the College of Liberal Arts and Sciences and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. August 2004 Dean, Graduate School