Citation
Regression models for discrete-valued time series data

Material Information

Title:
Regression models for discrete-valued time series data
Creator:
Klingenberg, Bernhard
Publication Date:
Language:
English
Physical Description:
xii, 177 leaves : ill. ; 29 cm.

Subjects

Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 2004.
Bibliography:
Includes bibliographical references.
General Note:
Printout.
General Note:
Vita.
Statement of Responsibility:
by Bernhard Klingenberg.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
003100752 ( ALEPH )
706801267 ( OCLC )

Downloads

This item has the following downloads:


Full Text









REGRESSION MODELS FOR
DISCRETE-VALUED TIME SERIES DATA














By

BERNHARD KLINGENBERG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2004






























Copyright 2004

by

Bernhard Klingenberg

































To Sophia and Jean-Luc Picard













ACKNOWLEDGMENTS

I would like to express my sincere gratitude to Drs. Alan Agresti and James

Booth for their guidance and assistance with my dissertation research and for

their support throughout my years at the University of Florida. During the past

three years as a research assistant for Dr. Agresti, I gained valuable experience in

conducting statistical research and writing scholarly papers, for which I am very

grateful. I would also like to thank Dr. Ramon Littell, who guided me through a

year of invaluable statistical consulting experience at IFAS, Dr. George Casella,

who taught me Monte Carlo methods, and Drs. Jeff Gill and Michael Martinez for

serving on my committee. My gratitude extends to all the faculty and present and

former graduate students of the Department, among them, in alphabetical order,

Brian Caffo, Sounak Chakraborty, Dr. Herwig Friedl, Ludwig Heigenhauser, David

Hitchcock, Wolfgang Jank, Galin Jones, Ziyad Mahfoud, Siuli Mukhopadhyay and

Brian Stephens.

I would like to thank my family, foremost my wife Sophia and my daughter

Franziska, for all their support, light and joy they bring into my life and for

providing me with energy and fulfillment.

Lastly, this dissertation would not have been written in English had it not

been for the countless adventures of the Starship Enterprise and its Captain,

Jean-Luc Picard, whose episodes kept me glued to the TV in Austria and helped to

sufficiently improve my knowledge of the English language.














TABLE OF CONTENTS
page

ACKNOWLEDGMENTS ............................ iv

LIST OF TABLES ................ .................... viii

LIST OF FIGURES .................... ............ ix

ABSTRACT .......................... .......... xi

CHAPTER

1 INTRODUCTION .............................. 1

1.1 Regression Models for Correlated Discrete Data .......... 1
1.2 Marginal Models ... ........................ 3
1.2.1 Likelihood Based Estimation Methods ............ 3
1.2.2 Quasi-Likelihood Based Estimation Methods ... ...... 4
1.3 Transitional Models .................. ........ 7
1.3.1 M odel Fitting ............................ 9
1.3.2 Transitional Models for Time Series of Counts ...... 10
1.3.3 Transitional Models for Binary Data ... 11
1.4 Random Effects Models ........................ 13
1.4.1 Correlated Random Effects in GLMMs ... 13
1.4.2 Other Modeling Approaches ... 17
1.5 Motivation and Outline of the Dissertation ... 19

2 GENERALIZED LINEAR MIXED MODELS ... 21

2.1 Definition and Notation ........................ 22
2.1.1 Generalized Linear Mixed Models for Univariate Discrete
Time Series .................... 24
2.1.2 State Space Models for Discrete Time Series Observations 26
2.1.3 Structural Similarities Between State Space Models and
G LM M s 28
2.1.4 Practical Differences ...... .... ... ..... .. 29
2.2 Maximum Likelihood Estimation ..... .. 31
2.2.1 Direct and Indirect Maximum Likelihood Procedures .. 32
2.2.2 Model Fitting in a Bayesian Framework ... 36
2.2.3 Maximum Likelihood Estimation for State Space Models .37
2.3 The Monte Carlo EM Algorithm ..... 40








2.3.1 Maximization of Qm ...................... 41
2.3.2 Generating Samples from h(u y; f, ) .. .. 43
2.3.3 Convergence Criteria .. 48

3 CORRELATED RANDOM EFFECTS ... 53

3.1 A Motivating Example: Data from the General Social Survey .. 54
3.1.1 A GLMM Approach ................ ..... 55
3.1.2 Motivating Correlated Random Effects ... 56
3.2 Equally Correlated Random Effects ... 62
3.2.1 Definition of Equally Correlated Random Effects 62
3.2.2 The M-step with Equally Correlated Random Effects 63
3.3 Autoregressive Random Effects ... 65
3.3.1 Definition of Autoregressive Random Effects ... 65
3.3.2 The M-step with Autoregressive Random Effects 68
3.4 Sampling from the Posterior Distribution Via Gibbs Sampling 71
3.4.1 A Gibbs Sampler for Autoregressive Random Effects .. 72
3.4.2 A Gibbs Sampler for Equally Correlated Random Effects .74
3.5 A Simulation Study .......................... 75

4 MODEL PROPERTIES FOR NORMAL, POISSON AND BINOMIAL
OBSERVATIONS ................... ........... 82

4.1 Analysis for a Time Series of Normal Observations ... 83
4.1.1 Analysis via Linear Mixed Models .... 85
4.1.2 Parameter Interpretation ..... 86
4.2 Analysis for a Time Series of Counts ..... 86
4.2.1 Marginal Model Implied by the Poisson GLMM 87
4.2.2 Parameter Interpretation ... 92
4.3 Analysis for a Time Series of Binomial or Binary Observations .92
4.3.1 Marginal Model Implied by the Binomial GLMM 94
4.3.2 Approximation Techniques for Marginal Moments 95
4.3.3 Parameter Interpretation ... 108

5 EXAMPLES OF COUNT, BINOMIAL AND BINARY TIME SERIES 115

5.1 Graphical Exploration of Correlation Structures ... 115
5.1.1 The Variogram ......................... 116
5.1.2 The Lorelogram ........................ 117
5.2 Normal Time Series .......................... 117
5.3 Analysis of the Polio Count Data ... 121
5.3.1 Comparison of ARGLMMs to other Approaches 125
5.3.2 A Residual Analysis for the ARGLMM .... 127
5.4 Binary and Binomial Time Series ... 130
5.4.1 Old Faithful Geyser Data ..... 130








5.4.2 Oxford versus Cambridge Boat Race Data ... 142

6 SUMMARY, DISCUSSION AND FUTURE RESEARCH ......... 159

6.1 Cross-Sectional Time Series ................. .. .162
6.2 Univariate Time Series ............ ............. 164
6.2.1 Clipping of Time Series .... .. 165
6.2.2 Longitudinal Data ............. ...... 165
6.3 Extensions and Further Research ....... .. 166
6.3.1 Alternative Random Effects Distribution ... 166
6.3.2 Topics in GLMM Research .... .. 168

REFERENCES ................................... 170

BIOGRAPHICAL SKETCH ................... ...... 171














LIST OF TABLES
Table page

3-1 A simulation study for a logistic GLMM with autoregressive random
effects. ... .. 80

3-2 Simulation study for modeling unequally space binary time series .. 81

5-1 Comparing estimates from two models for the log-odds. ... 121

5-2 Parameter estimates for the polio data. ... 125

5-3 Autocorrelation functions for the Old Faithful geyser data. 134

5-4 Comparison of observed and expected counts for the Old Faithful geyser
data. ............................ ........ 142

5-5 Maximum likelihood estimates for boat race data. ... 147

5-6 Observed and expected counts of sequences of wins (W) and losses
(L) for the Cambridge University team. ... 151

5-7 Estimated random effects fit for the last 30 years for the boat race data. 152

5-8 Estimated probabilities of a Cambridge win in 2004, given the past
s + 1 outcomes of the race. ........................ 155














LIST OF FIGURES
Figure page

2-1 Plot of the typical behavior of the Monte Carlo sample size m(k) and
the Q-function Q) through MCEM iterations. .......... 51

3-1 Sampling proportions from the GSS data set. ..... 55

3-2 Iteration history for selected parameters and their asymptotic stan-
dard errors for the GSS data. .................... .. .. 61

3-3 Realized (simulated) random effects u1,..., UT versus estimated ran-
dom effects il,..., UT ........... .. ........... .. 77

3-4 Comparing simulated and estimated random effects .... 78

4-1 Approximated marginal probabilities for the fixed part predictor value
x'p ranging from -4 to 4 in a logit model. .... 99

4-2 Comparison of conditional logit and probit model based probabilities. 104

4-3 Comparison of implied marginal probabilities from logit and probit
models..................... .... ..... ...... 106

5-1 Empirical standard deviations std(0it) for the log odds of favoring ho-
mosexual relationship by race. 119

5-2 Plot of the Polio data. ...................... .123

5-3 Iteration history for the Polio data. .. 124

5-4 Residual autocorrelations for the Polio data. ... 129

5-5 Residual autocorrelations with outlier adjustment for the Polio data.. 131

5-6 Autocorrelation functions for the Old Faithful geyser data. 135

5-7 Lorelogram for the Old Faithful geyser data .... 136

5-8 Plot of the Oxford vs. Cambridge boat race data .... 143

5-9 Variogram for the Oxford vs. Cambridge boat race data. ...... ..145

5-10 Lorelogram for the Oxford vs. Cambridge boat race data. ... 146








5-11 Path plots of fixed and random effects parameter estimates for the
boat race data. ................... ........... 147

6-1 Association graphs for GLMMs. ..... 161













Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

REGRESSION MODELS FOR
DISCRETE-VALUED TIME SERIES DATA

By

Bernhard Klingenberg

August 2004

Chair: Alan G. Agresti
Cochair: James G. Booth
Major Department: Statistics

Independent random effects in generalized linear models induce an exchange-

able correlation structure, but long sequences of counts or binomial observations

typically show correlations decaying with increasing lag. This dissertation intro-

duces models with autocorrelated random effects for a more appropriate, parameter

driven analysis of discrete-valued time series data. We present a Monte Carlo EM

algorithm with Gibbs sampling to jointly obtain maximum likelihood estimates

of regression parameters and variance components. Marginal mean, variance and

correlation properties of the conditionally specified models are derived for Poisson,

negative binomial and binary/binomial random components. They are used for

constructing goodness of fit tables and checking the appropriateness of the modeled

correlation structure. Our models define a likelihood and hence estimation of the

joint probability of two or more events is possible and used in predicting future

responses. Also, all methods are flexible enough to allow for multiple gaps or miss-

ing observations in the observed time series. The approach is illustrated with the

analysis of a cross-sectional study over 30 years, where only observations from 16








unequally spaced years are available, a time series of 168 monthly counts of polio

infections and two long binary time series.













CHAPTER 1
INTRODUCTION

Correlated discrete data arise in a variety of settings in the biomedical,

social, political or business sciences whenever a discrete response variable is

measured repeatedly. Examples are time series of counts or longitudinal studies

measuring a binary response. Correlations between successive observations arise

naturally through a time, space or some other cluster forming context and have

to be incorporated in any inferential procedure. Standard regression models

for independent data can be expanded to accommodate such correlations. For

continuous type responses, the normal linear mixed effects model offers such a

flexible framework and has been well studied in the past. A recent reference is

Verbeke and Molenberghs (2000), who also discuss computer software for fitting

linear mixed effects models with popular statistical packages. Although the normal

linear mixed effects model is but one member of the broader class of generalized

linear mixed effects models, it enjoys unique properties which simplify parameter

estimation and interpretation substantially. For discrete response data, however,

the normal distribution is not appropriate, and other members in the exponential

family of distributions have to be considered.

1.1 Regression Models for Correlated Discrete Data

In this introduction we will review extensions of the basic generalized linear

model (McCullagh and Nelder, 1989) for analyzing independent observations to

models for correlated data. These models are marginal (Section 1.2), transitional

(Section 1.3) and random effects models (Section 1.4). An extensive discussion of

these models with respect to discrete longitudinal data is given in the books by








Agresti (2002) and Diggle, Heagerty, Liang and Zeger (2002). In general, longi-

tudinal studies concern only a few repeated measurements. In this dissertation,

however, we are interested in the analysis of much longer series of repeated observa-

tions, often exceeding 100 repeated measurements. Therefore, the following review

focuses specifically on models for univariate time series observations, some of which

are presented in Fahrmeir and Tutz (2001).

Let Yt be a response at time t, t = 1,..., T, observed together with a vector

of covariates denoted by Xt. In a generalized linear model (GLM), the mean

pt = E[yt] of observation yt depends on a linear predictor t = xzfP through a link
function h(.), forming the relationship pt = h-l(x'f). The variance of yt depends

on the mean through the relationship var(yt) = tv(pt), where v(.) is a distribution

specific variance function and {(t} are additional dispersion parameters. In a

regular GLM, observations at any two distinct time points t and t* are assumed

independent.

In the models discussed below, the type of extension to accommodate corre-

lated data depends on the way the correlation is introduced into the model. In

marginal models, the correlation can be specified directly, e.g., corr(yt, yt.) = p or

left completely unspecified, but nonetheless accounted for in likelihood based and

non-likelihood based inferences. In transitional models correlation is introduced

by including previous observations in the linear predictor, e.g., tit = iz where

it = (X, Yt-, y t-2, ...)' and 3 = (, a, a2, ... ) are extensions of the design and
parameter vector of a GLM with independent components. Random effects models

induce correlation between observations by including random effects rather than

previous observations in the linear predictor, e.g., rt = x'a + u, where u is a

random effect shared by all observations.

The way correlation is built into a model also determines the type of inference.

Typically, marginal models are fitted by a quasi-likelihood approach, estimation in








transitional models is based on a conditional or partial likelihood, and inference

in random effects models relies on a full likelihood (possibly Bayesian) approach.

However, models and inferential procedures have been developed that allow more

flexibility than the above categorization.

1.2 Marginal Models

In marginal regression models, the main scientific goal is to assess the influence

of covariates on the marginal mean of yt, treating the association structure between

repeated observations as a nuisance. The marginal mean /it and variance var(yt)

are modeled separately from a correlation structure between two observations

Yt and yt.. Regression parameters in the linear predictor are called population-

averaged parameters, because their interpretation is based on an average over

all individuals in a specific covariate subgroup. Due to the correlation among

repeated observations, the likelihood for the model refers to the joint distribution

of all observations and not to the simpler product of their marginal distributions.

However, the model is specified in terms of these marginal distributions, which

makes maximum likelihood fitting particular hard for even a moderate number T

of repeated measurements.

1.2.1 Likelihood Based Estimation Methods

For binary data, Fitzmaurice and Laird (1993) discuss a parametrization of the

joint distribution in terms of conditional probabilities and log odds ratios. These

parameters are related to the marginal mean and the same conditional log odds

ratios, which describe the higher order associations among the repeated responses.

The marginal mean and the higher order associations are then modelled in terms of

orthogonal parameters f3 and a, respectively. Fitzmaurice and Laird (1993) present

an algorithm for maximizing the likelihood with respect to these two parameter

sets. The algorithm has been implemented in a freely available computer program

(MAREG) by Kastner et al. (1997).








Another approach to maximum likelihood fitting for longitudinal discrete

data regards the marginal model as a constraint on the joint distribution and

maximizes the likelihood subject to this constraint. The model is written in terms

of a generalized log-linear model C log(A/f) = XP3, where p is a vector of expected

counts and A and C are matrices to form marginal counts and functions of those

marginal counts, respectively. With this approach, no specific assumption about

the correlation structure of repeated observations is made, and the likelihood

refers to the most general form for the joint distribution. However, simultaneous

modeling of the marginal distribution and a simplified joint distribution is also

possible. Details can be found in Lang and Agresti (1994) and Lang (1996). Lang

(2004) also offers an R computer program (mph.fit) for maximum likelihood fitting

of these very general marginal models.

1.2.2 Quasi-Likelihood Based Estimation Methods

The drawback of the two approaches mentioned above and likelihood based

methods in general is that they require enormous computing resources as the

number of repeated responses increases or the number of covariates is large, making

maximum likelihood fitting computationally impossible for long time series. This is

also true for estimation based on alternative parameterizations of a distribution for

multivariate binary data such as those discussed in Bahadur (1961), Cox (1972) or

Zhao and Prentice (1990).

Estimating methods leading to computationally simpler inference (albeit

not maximum likelihood) for marginal models are based on a quasi-likelihood

approach (Wedderburn, 1974). In a quasi-likelihood approach, no specific form

for the distribution of the responses is assumed and only the mean, variance and

correlation are specified. However, with discrete data, specifying the mean and

covariances does not determine the likelihood, as it would with normal data, so

parameter estimation cannot be based on it. Liang and Zeger (1986) proposed








generalized estimating equations (GEE) to estimate parameters, which have

the form of score equations for GLMs, but cannot be interpreted as such. Their

approach also requires the specification of a working correlation matrix for the

repeated responses. They show that if the mean function is correctly specified,

the solution to the generalized estimating equations is a consistent estimator,

regardless of the assumed variance-covariance structure for the repeated responses.

They also present an estimator of the asymptotic variance-covariance matrix

for the GEE estimates, which is robust against misspecification of the working

correlation matrix. Several structured working correlation matrices have been

proposed for parsimonious modeling of the marginal correlation, and some of them

are implemented in statistical software packages for GEE estimation (e.g., SAS's

proc genmod with the repeated statement and the type option or the gee and geel

packages in R).

1.2.2.1 GEE for time series of counts

Zeger (1988) uses the GEE methodology to fit a marginal model to a time

series {yt}T of T = 168 monthly counts of cases of poliomyelitis in the United

States. He specifies the marginal mean, variance and correlation by


pt = exp(ax )

var(yt) = t + a22 (1.1)
P(r)
corr(y, y ) [{1+ (a2p)-i}{1 + (a2t+T)-1}]/2'

where a2 is the variance and p(r) the autocorrelation function of an underlying

random process {ut}. To fit this marginal model, he proposes and outlines the

GEE approach, but notes that it requires inversion of the T x T variance-covariance

matrix of yl,..., YT, which has no recognizable structure and therefore no simple

inverse. Subsequently, he suggests approximating this matrix by a simpler, struc-

tured matrix, leading to nearly as efficient estimators as would have been obtained








with the GEE approach. The variance component a2 and unknown parameters in

p(r) are estimated by a methods of moments approach.
Interestingly, Zeger (1988) derives the marginal mean, variance and correlation

in (1.1) from a random effects model specification: Conditional on an underlying

latent random process {ut} with E[ut] = 1 and cov(ut, ut+,) = a2p(7), he initially

models the time series observations as conditionally independent Poisson variables

with mean and variance


E[yt | Ut] = var(yt ut) = exp(xa/)ut. (1.2)

Marginally, by the formula for repeated expectation, this leads to the moments

presented in (1.1). From there we also see that the latent random process {ut} has

introduced both overdispersion relative to a Poisson variable and autocorrelation

among the observations. The models we will develop in subsequent chapters have

similar features. The equation for the marginal correlation between yt and Yt*

shows that the autocorrelation in the observed time series must be less than the

autocorrelation in the latent process {ut}. We will return to the polio data set in

Chapter 5, where we compare this model to models suggested in this dissertation

and elsewhere.

1.2.2.2 GEE for binomial time series

For binary and binomial time series data, it is often more advantageous to

model the association between observations using the odds ratio rather than

directly specifying the marginal correlation corr(yt, yt.) as with count data.

The odds ratio is a more natural metric to measure association between binary

outcomes and easier to interpret. The correlation between two binary outcomes

Y1 and Y2 is also constrained in a complicated way by their marginal means

I1, = P(Y1 = 1) and p2 = P(Y2 = 1) as a consequence of the following inequalities








for their joint distribution:

P(Y1 = 1, Y2 = 1) = P1 + P2 P(Y= 1 or Y2 = 1) > max{0, p, + 2 1}

and

P(Y1 = 1, Y2 = 1) < min{fl, 2},

leading to

max{0, i + 2- 1} P(Y1 = 1, Y2 = 1) < min{l, /2}

Therefore, instead of marginal correlations, a number of authors (Fitzmaurice,

Laird and Rotnitzky, 1993; Carey, Zeger and Diggle, 1993) propose the use of

marginal odds ratios. For unequally spaced and unbalanced binary time series data,

Fitzmaurice and Lipsitz (1995) present a GEE approach which models the marginal

association using serial odds ratio patterns. Let tt. denote the marginal odds ratio

between two binary observations Yt and yt.. Their model for the association has the

form

.tt = al/lt-t*, 1 < a < oo,

which has the property that as It t* -- 0, there is perfect association (tt* oo),

and as It t*I oo, the observations are independent (utt. -- 1). Note, however,

that only positive association is possible with this type of model. (SAS's proc

genmod now offers the possibility of specifying a general regression structure for the

log odds ratios with the logor option.)

1.3 Transitional Models

In transitional models, past observations are simply treated as additional

predictors. Interest lies in estimating the effects of these and other explanatory

variables on the conditional mean of the response Yt, given realizations of the past

responses. Specifying the relationship between the mean of Yt and previous obser-

vations yt-i, t-2, ... is another way (and in contrast to the direct way of marginal








models) of modeling the dependency between correlated responses. Transitional

models fit into the framework of GLMs, where, however, the distribution of Yt is

now conditional on the past responses. The model in its most general form (Diggle

et al., 2002) expresses the conditional mean of Yt as a function of explanatory

variables and q functions f,(.) of past responses,


E[yt Ht] = h-1 3 + E h(Ht; a) (1.3)
r=l
where Ht = {t-1, Yt-2, yl} denotes the collection of past responses. Ht can

also include past explanatory variables and parameters. Often, the models are in

discrete-time Markov chain form of order q, and the conditional distribution of

Yt given Ht only depends on the last q responses Yt-1, ... Yt-q. For example, a
transitional logistic regression model for binary responses that is a second order

Markov chain has form


logit P(Yt = 1 yt-1, Yt-2) = x 13 + alYt-1 + a2Yt-2.

The main difference between transitional models and regular GLMs or marginal

models is parameter interpretation. Both the interpretation of a and the inter-

pretation of / are conditional on previous outcomes and depend on how many

of these are included. As the time dependence in the model changes, so does the

interpretation of parameters. With the logistic regression example from above, the

conditional odds of success at time t are exp(al) times higher if the given previous

response was a success rather than a failure. However, this interpretation assumes

a fixed and given outcome at time t 2. Similarly, a coefficient in / represents

the change in the log odds for a unit change in xt, conditional on the two prior

responses. It might be possible that we lose information on the covariate effect by

conditioning on these previous outcomes. In general, the interpretation of parame-

ters in transitional models is different from the population averaged interpretation








we discussed for marginal models, where parameters are effects on the marginal

mean without conditioning on any previous outcomes.
1.3.1 Model Fitting

If a discrete-time Markov model applies, the likelihood for a generic series

yl,... T is determined by the Markov chain structure:
T
L(,a3,;yi,...,yT) = f(yi,...,Yq) f(Yt I Yt-1i,...,Yt-q)-
t=q+l
However, the transitional model (1.3) only specifies the conditional distributions

appearing in the product, but not the first term of the likelihood. Often, instead

of a full maximum likelihood approach, one conditions on the first q observations

and maximizes the corresponding conditional likelihood. If in addition f,(Ht; a)

in (1.3) is a linear function in a (and possibly 3), then maximization follows

along the lines of GLMs for independent data. Kaufmann (1987) establishes the

asymptotic properties such as consistency, asymptotic normality and efficiency of

the conditional maximum likelihood estimator.

If a Markov assumption is not warranted, estimation can be based on the

partial likelihood (Cox, 1975). To motivate the partial likelihood approach, we

follow Kedem and Fokianos (2002): They consider occasions where a time series

{Yt} is observed jointly with a random covariate series {Xt}. The joint density of

(Yt, Xt), t = 1,..., T, parameterized by a vector 0, can be expressed as


f(x, .., x,XT, YT; ) = f(x;0) [ f(xt I Hft;) f(yt1 Ht; ) (1.4)
t=2 .t=l
where Ht = (x1, y1,..., Xt-1, yt-1) and Ht = (xi, yl,..., x,-1, yt-1, Xt) hold the

history up to time points t 1 and t, respectively. Let Tt-1 denote the a-field

generated by Y-i, Yt-2, ... Xt, Xt-1,..., i.e., Tt-1 is generated by past responses

and present and past values of the covariates. Also, let ft(yt I Ft-1; 0) denote the

conditional density of Yt, given t_-1, which is of exponential density form with








mean modeled by (1.3). Then, the partial likelihood for 0 = (a, /) is given by

T
PL(0;yl,...,yT) = r ft(Yt I -1; 0), (1.5)
t=1
which is the second product in (1.4) and hence the term partial. The loss of

information by ignoring the first product in the joint density is considered small.

If the covariate process is deterministic, then the partial likelihood becomes a

conditional likelihood, but without the necessity of a Markov assumption on the

distribution of the Yt's.

Standard asymptotic results from likelihood analysis of independent data carry

over to the case of partial likelihood estimation with dependent data. Fokianos and

Kedem (1998) showed consistency and asymptotic normality of 0 and provided an

expression for the asymptotic covariance matrix. Since the score equation obtained

from (1.5) is identical to one for independent data in a GLM, partial likelihood

holds the advantage of easy, fast and readily available software implementation

with standard estimation routines such as iterative re-weighted least squares.

1.3.2 Transitional Models for Time Series of Counts

For a time series of counts {yt}, Zeger and Qaqish (1988) propose Markov-

type transitional models which they fit using quasi-likelihood methods and the

estimating equations approach. They consider various models for the conditional

mean Lt = E[yt I Ht] of form log(p ) = x'z + ~=1 arfr(Ht-r), where for example

fr(Ht-r) = yt-r or f,(Ht-r) = log(y_-r + c) log(exp[zXP] + c). One common goal
of their models is to approximate the marginal mean by E[yt] = E[lpt] z exp{a{z3}

so that 3 has an approximate marginal interpretation as the change in the log

mean for a unit change in the explanatory variables. Davis et al. (2003) develop

these models further and propose fr(Ht-r) = (Yt-r lt-r)//A-r as a more

appropriate function to built serial dependence in the model, where A is an
additional parameter. They explore stability properties such as stationarity and








ergodicity of these models and describe fast (in comparison to maximum likelihood

techniques required for competing random effects models), recursive and iterative

maximum likelihood estimation algorithms.

Chapter 4 in Kedem and Fokianos (2002) discusses regression models of form

(1.3) assuming a conditional Poisson or double-truncated Poisson distribution

for the counts, with inference based on the partial likelihood concept. Their

methodology is illustrated with two examples about monthly counts of rainy days

and counts of tourist arrivals.

1.3.3 Transitional Models for Binary Data

For binary data {yt}, a two state, first order Markov chain can be defined by

its probability transition matrix

p Poo Poi

Plo P11

where Pab = P(Yt = b I Yt-1 = a), a, b = 0, 1 are the one-step transition probabilities

between the two states a and b. Diggle et al. (2002, Chapt. 10.3) discuss various

logistic regression models for these probabilities and higher order Markov chains for

equally spaced observations. Unequally spaced data cannot be routinely handled

with these models.

How can we determine the marginal association structure implied by the

conditionally specified model? Let pl = (p, pl) be the initial marginal distribution

for the states at time t = 1. Then the distribution of the states at time n is

given by pn = plp". As n increases, p" approaches a steady state or equilibrium

distribution that satisfies p = pP. The solution to this equation is given by

pi = P(Yt = 1) = E[yt] = pol/(pol + Pio) and is used to derive marginal moments
implied by the transitional model. For example, it can be shown (Kedem, 1980)

that in the steady state, the marginal variance and correlation implied by the








transitional model are var(yt) = poPi (as it should be) and corr(ytl, Yt) = Pn Poi,

respectively.
Azzalini (1994) models serial dependence in binary data through transition

models, but at the same time retains the marginal interpretation of regression

parameters. He specifies the marginal regression model logit(At) = xa/ for a
binary time series {yt} with E[yt] = tpt, but assumes that a binary Markov chain
with transition probabilities Pab has generated the data. Therefore, the likelihood
refers to these probabilities but the model specifies marginal probabilities, a

complication similar to the fitting of marginal models discussed in the previous

section. However, assuming a constant log odds ratio

= logP(Y = 1, Y = 1)P(Yt-1 = 0, Yt = 0)
log \P(Y-1 = 0,Yt = 1)P(Yt-1 = 1, Yt= 0)

between any two adjacent observations, Azzalini (1994) shows how to write

Pab in terms of just this log odds ratio 0 and the marginal probabilities At and
#pt-1. Maximum likelihood estimation for such models is tedious but possible in
closed form, although second derivatives of the log likelihood function have to be
calculated numerically. A software package (the S-plus function rm.tools, Azzalini

and Chiogna, 1997) exists to fit such models for binary and Poisson observations.

Azzalini (1994) mentions that this basic approach can be extended to include

variable odds ratios between any two adjacent observations, possibly depending on

covariates, but this is not pursued in the article. Diggle et al. (2002) discuss these

marginalized transitional models further.
Chapter 2 in Kedem and Fokianos (2002) presents a detailed discussion of
partial likelihood estimation for transitional binary models and discusses, among
other examples, the eruption data of the Old Faithful geyser which we will turn to

in Chapter 5.








1.4 Random Effects Models

A popular way of modeling correlation among dependent observations is to

include random effects u in the linear predictor. One of the first developments for

discrete data occurred for longitudinal binary data, where subject-specific random

effects induced correlation between repeated binary measurements on a subject

(Bock and Aitkin, 1981; Stiratelli, Laird and Ware, 1984). In general, we assume

that unmeasurable factors give rise to the dependency in the data {yt} and random

effects {ut} represent the heterogeneity due to these unmeasured factors. Given

these effects, the responses are assumed independent. However, no values for these

factors are observed, and so marginally (i.e., averaged over these factors), the

responses are dependent.

Conditional on some random effects, we consider models that fit into the

framework of GLMs for independent data, i.e., where the conditional distribution

of yt I u is a member of the family of exponential distributions, whose mean

E[yt ut] is modeled as a function of a linear predictor t = xzf + zLut. Together
with a distributional assumption for the random effects (usually independent and

identically normal), this leads to generalized linear mixed models (GLMMs), where

the term mixed refers to the mixture of fixed and random effects in the linear pre-

dictor. Chapter 2 contains a detailed definition of GLMMs and discusses maximum

likelihood fitting and parameter interpretation and in Chapter 3 correlated random

effects for the description of time dependent observations {Yt} are motivated and

described. Here, we only give a short literature review about GLMMs which use

correlated random effects to model time (or space) dependent data.

1.4.1 Correlated Random Effects in GLMMs

One of the first papers considering correlated random effects in GLMMs for

the description of (spatial) dependence in Poisson data is Breslow and Clayton

(1993), who analyze lip cancer rates in Scottish counties. They propose correlated








normal random effects to capture the correlation in counts of adjacent districts in

Scotland. A random effect is assigned to each district, and two random effects are

correlated if their districts are adjacent to each other.

In Section 1.2.2 we mentioned the Polio data set of a time series of equally

spaced counts {y}16s and formulated the conditional model (1.2) with a latent

process for the random effects. Instead of obtaining marginal moments as in Zeger

(1988), Chan and Ledolter (1995) use a GLMM approach with Poisson random

components and autoregressive random effects to analyze the time series. They

outline parameter estimation via an MCEM algorithm similar to the one discussed

in Sections 2.4 and 3.2 in this dissertation.

One of the three central generalized linear models advocated by Diggle et al.

(2002, Chap. 11.2) to model longitudinal data uses correlated random effects. For

equally spaced binary longitudinal data {yit}, they plot response profiles simulated

according to the model

logit[P(Yit =1 uit)] = + uit

cov(Uit, uit.) = a-2PIti-ti, I


with a2 = 2.52 and p = 0.9 and note that the profiles exhibit more alternating

runs of O's and 1's than a random intercept model with uil = ui = ... =

UiT = ui. However, based on the similarity between plots of random intercepts,

random intercepts and slopes and autoregressive random effects models, they

mention the challenge that binary data present in distinguishing and modeling the

underlying dependency structure in longitudinal data. (They used T = 25 repeated

observation for their simulations.) Furthermore, they state that numerical methods

for maximum likelihood estimation are computationally impractical for fitting

models with higher dimensional random effects. This makes it impossible, they

conclude, to fit the GLMM with serially correlated random effects using maximum








likelihood. Instead, they propose a Bayesian analysis using powerful Monte Carlo

Markov chain methods.

Indeed, the majority of examples in the literature which consider correlated

random effects in a GLMM framework take a Bayesian approach. Sun, Speckman

and Tsutakawa (2000) explore several types of correlated random effects (autore-

gressive, generalized autoregressive and conditional autoregressive) in a Bayesian

analysis of a GLMM. As in any Bayesian analysis, the propriety of the posterior

distribution given the data is of concern when fixed effects and variance compo-

nents have improper prior distributions and random effects are (possibly singular)

multivariate normal. One of their results applied to Poisson or binomial data {yt}

states that the posterior might be improper when yt = 0 in the Poisson case and

cannot be proper when yt = 0 or ye = nt in the binomial case for any t when

improper or non-informative priors are used.

Diggle, Tawn and Moyeed (1998) consider Gaussian spatial processes S(x)

to model spatial count data at locations x. The role of S(x) is to explain any

residual spatial variation after accounting for all known explanatory variables.

They also use a Bayesian framework to estimate parameters and give a solution to

the problem of predicting the count at a new location :. Ghosh et al. (1998) use

correlated random effects in Bayesian models for small area estimation problems.

They present an application of pairwise difference priors for random effects

to model a series of spatially correlated binomial observations in a Bayesian

framework. Zhang (2002) discusses maximum likelihood estimation with an

underlying spatial Gaussian process for spatially correlated binomial observations.

Bayesian models for binary time series are described in Liu (2001), based

on probit-type models for correlated binary data which are discussed in Chib

and Greenberg (1998). Probit type models are motivated by assuming latent

random variables z = (zl,..., ZT), which follow a N(pA, E) distribution with








p = (p/I,, PT,), pt = x' / and E a correlation matrix. The yet's are assumed to be
generated according to

Yt = I(Zt > 0),

where I(.) is the indicator function. This leads to the (marginal) probit model

P(Yt = 1 I E) = It(zt I 1p, E). Rich classes of dependency structures between

binary outcomes can be modeled through E. These models can further be extended

to include random effects through pl't = zx/3 + z ut or q previous responses such

as pt = x,3 + '=, ayt-,. It is important to note that E has to be in correlation

form. To see this, suppose it is not and let S = DED be a covariance matrix

for the latent random variables z, where D is a diagonal matrix holding standard

deviation parameters. The joint density of the times series under the multivariate

probit model is given by


P[(Y,,.-.,Y,) = (Y,,...,T)] = P[z E A]

= P[D-z E A],

where A = Ai x .. x AT with At = (-oo, 0] if yt = 0 and At = (0, oo) if yt = 1

are the intervals corresponding to the relationship yt = I(zt > 0), for t = 1,..., T.

However, above relationship is true for any parametrization of D, because the

intervals At are not affected by the transformation from z to D-lz. Hence, the

elements of D are not identifiable based on the joint distribution of the observed

time series y.

Lee and Nelder (2001) present models to analyze spatially correlated Poisson

counts and binomial longitudinal data about cancer mortality rates. They explore

a variety of patterned correlation structures for random effects in a GLMM setup.

Model fitting is based on the joint data likelihood of observations and unobserved

random effects (Lee and Nelder, 1996) and not on the marginal likelihood of the








observed data. Model diagnostic plots of estimated random effects are presented to

aid in selecting an appropriate correlation structure.

1.4.2 Other Modeling Approaches

In hidden Markov models (MacDonald and Zucchini, 1997) the underlying

random process is assumed to be a discrete state-space Markov chain instead

of a continuous (normal) process. Probability transition matrices describe the

connection between states. A very convenient property of hidden Markov models

is that the likelihood can be evaluated sufficiently fast to permit direct numerical

maximization. MacDonald and Zucchini (1997) present a detailed description of

hidden Markov models for the analysis of binary and count time series.

A connection between transitional models and random effects models is

explored in Aitkin and Alf6 (1998). They model the success probabilities of

serial binary observations conditional on subject-specific random effects and on

the previous outcome. As in the models before, transition probabilities Pab are

changing over time due to the inclusion of time-dependent covariates and the

previous observation in the linear predictor. Additionally, random effects account

for possibly unobserved sources of heterogeneity between subjects. The authors

argue that the conditional model specification together with the specification of

the random effects distribution does not determine the distribution of the initial

observation, and hence the likelihood for this model is unspecified. They present

a solution by maximizing the likelihood obtained from conditioning on this first

observation. However, this causes the specified random effects distribution to shift

to an unknown distribution. Two approaches for estimation are outlined: The first

assumes another normal distribution for the new random effects distribution and

the likelihood is maximized using Gauss-Hermite quadrature. The second approach

assumes no parametric form for the new random effects distribution and follows the

nonparametric maximum likelihood approach (Aitkin, 1999). For binary data, the







new random effects distribution is only a two point distribution and its parameters
can be estimated via maximum likelihood jointly with the other model parameters.

Marginalized transitional models were briefly mentioned with the approach
taken by Azzalini (1994). The idea of marginalizingg", i.e., model the marginal

mean of an otherwise conditionally specified model can also be applied to random
effects models. The advantage of transitional or random effects models is the
ability to easily specify correlation patterns, with the potential disadvantage that
parameters in such models have conditional interpretations when the scientific goal

is on the interpretation of marginal relationships. In marginal models parameters

can directly be interpreted as contrasts between subpopulations without the need

of conditioning on previous observations or unobserved random effects. However, as

we mentioned in Section 1.2.2, likelihood based inference in marginal models might
not be possible.

A marginalized random effects model (Heagerty, 1999; Heagerty and Zeger,
2000) specifies two regression equations that are consistent with each other. The

first equation,

AM= E[yt] = h-'(x

expresses the marginal mean tM as a function of covariates and describes system-

atic variation. The second equation characterizes the dependency structure among
observations through specification of the conditional mean pC,


p = E[yt I ut] = h-l(At(t) + z'ut),

where ut are random effects with design vector zt. Consistency between the

marginal and conditional specification is achieved by defining At(xt) implicitly
through

IM = Eu[ c] = Eu [E[yt I ut]] = Eu[h-'(At(zt) + z'ut)].








For instance, in a marginalized GLMM with random effects distribution F(ut),

At(st) is the solution to the integral equation

S= h-'(' ) = h-1(At(xt) + z'ut)dF(ut),

so that At(xt) is a function of the marginal regression coefficients f and the

(variance) parameters in F(ut). Maximum likelihood estimation is based on the

integrated likelihood from the GLMM model.

1.5 Motivation and Outline of the Dissertation

In this dissertation we propose generalized linear mixed models (GLMMs)

with correlated random effects to model count or binomial response data collected

over time or in space. For sequential or spatial Gaussian measurements, maximum

likelihood estimation is well established and software (e.g. SAS's proc mixed) is

available to fit fairly complicated correlation structures. The challenge for discrete

data lies in the fact that the observed (marginal) likelihood is not analytically

tractable, and maximization of it is more involved. Furthermore, with correlated

random effects, the likelihood does not break down into lower-dimensional com-

ponents which are easier to integrate numerically. Therefore, most approaches in

the literature are based on a quasi-likelihood approach or take a Bayesian per-

spective. The advantage of Bayesian models is that powerful Monte Carlo Markov

chain methods make it easier to obtain a sample from the posterior distribution of

interest than to obtain maximum likelihood estimates. However, priors must be

specified very carefully to ensure posterior propriety.

In addition, repeated observations are prone to missing data or unequally

spaced observation times. We would like to develop methods and models that allow

for unequally spaced binary, binomial or Poisson observations, making them more

general than previously presented in the literature.








To our knowledge, maximum likelihood estimation of GLMMs with such high

dimensional random effects has not been demonstrated before, with the exception

of the paper by Chan and Ledolter (1995) who consider fitting of a time series

of counts. However, they do not consider unequally spaced data and employ a

different implementation of the MCEM algorithm. In Chapter 5 we argue that

their implementation of the algorithm might have been stopped prematurely,

leading to different conclusions than our analysis and analyses published elsewhere.

Most articles that discuss correlated random effects do so for only a small number

of correlated random effects. E.g., Chan and Kuk (1997) show that the data set

on salamander mating behavior published and analyzed in McCullagh and Nelder

(1989) is more appropriately analyzed when random effects pertaining to the male

salamander population are correlated over the three different time points when they

were observed. In this thesis, we would like to consider much longer sequences of

repeated observations.

In Chapter 2 we introduce the GLMM as the model of our choice to analyze

correlated discrete data and outline an EM algorithm to estimate fixed and random

effects, where both the E-step and the M-step require numerical approximations,

leading to an EM algorithm based on Monte Carlo methods (MCEM). Correlated

random effects and their implications on the analysis of GLMMs are discussed in

Chapter 3, together with a motivating example. This chapter also gives details for

the implementation of the algorithm and reports results from simulation studies.

Chapter 4 looks at marginal model properties and interpretation for correlated

binary, binomial or Poisson observations and Chapter 5 applies our methods to

real data sets from the social sciences, public health, sports and other backgrounds.

A summary and discussion of the methods and models presented here is given in

Chapter 6.













CHAPTER 2
GENERALIZED LINEAR MIXED MODELS

Chapter 1 reviewed various approaches of extending GLMs to deal with

correlated data. In this Chapter we will take a closer look at generalized linear

mixed models (GLMMs) which were briefly mentioned in Section 1.4. When

the response variables are normal, these models are simply called linear mixed

models (LMMs) and have been extensively discussed in the literature (see, for

example, the books by Searle, Casella and McCulloch, 1992, and Verbeke and

Molenberghs, 2000). The form of the normal density for observations and random

effects allows for analytical evaluation of the integrals together with straightforward

maximization. Hence, LMMs can be readily fit with existing software (e.g., SAS's

proc mixed), using rich classes of pre-specified correlation structures for the random

effects to model the dependence in the data more precisely. The broader notion

of GLMMs also encompasses binary, binomial, Poisson or gamma responses.

A distinctive feature of GLMMs is their so called subject-specific parameter

interpretation, which differs from the interpretation of parameters in marginal

(Section 1.2) or transitional (Section 1.3) models. This feature is discussed in

Section 2.1, after a formal introduction of the GLMM. Throughout, special

attention is devoted to define GLMMs for discrete time series observations.

GLMMs are harder to fit because they typically involve intractable integrals

in the likelihood function. Section 2.2 outlines various approaches to model fitting.

Section 2.3 focuses on a Monte Carlo version of the EM algorithm which is an

indirect method of finding maximum likelihood estimates in GLMMs. Monte Carlo

methods are necessary because our applications involve correlated random effects

which lead to a very high-dimensional integral in the likelihood function. Parallel








to the discussion of GLMMs, state space models are introduced and a fitting

algorithm is described. State space models are popular models for discrete time

series in econometric applications (Durbin and Koopman, 2001). The presentation

of specific examples of GLMMs for discrete time series observation is deferred until

Chapter 5.

2.1 Definition and Notation

The generalized linear mixed model is an extension of the well known gen-

eralized linear model (McCullagh and Nelder, 1989) that permits fixed as well as

random effects in the linear predictor (hence the word mixed). The setup process

for GLMMs is split into two stages, which we present here using notation common

for longitudinal studies.:

Firstly, conditional on cluster specific random effects ui, the data are assumed

to follow a GLM with independent random components Yit, the t-th response

in cluster i, i = 1,..., n, t = 1,..., ni. A cluster here is a generic expression

and means any form of observations being grouped together, such as repeated

observation on the same subject (cluster = subject), observations on different

students in the same school (cluster = school) or observations recorded in a

common time interval (cluster = time interval). The conditional distribution of Yit

is a member of the exponential family of distributions (e.g., McCullagh and Nelder,

1989) with form


f(yit I u,) = exp {[yitOit b(0it)]/(it + c(yit, ,t)} (2.1)

where Oit are natural parameters and b(.) and c(.) are certain functions determined

by the specific member of the exponential family. The parameters it are typically

of form Oit = O/wit where the wit's are known weights and 0 is a possibly unknown

dispersion parameter. For the discrete response GLMMs we are considering,

0 = 1. For a specific link function h(.), the model for the conditional mean for








observations yit has form


pit = b'(Oit) = E[Yi ui] = h-1(x'tP + zItu,), (2.2)

where x't and zt are covariate or design vectors for fixed and random effects

associated with observation yit and / is a vector of unknown regression coefficients.

At this first stage, z'tui can be regarded as a known offset for each observation,

and observations are conditionally independent.

It should be noted that relationship (2.2) between the mean of the observation

and fixed and random effects is exactly as is specified in the systematic part of

GLMs, with the exception that in GLMMs a conditional mean is modeled. This

affects parameter interpretation. The regression coefficients 3 represent the effect

of explanatory variables on the conditional mean of observations, given the random

effects. For instance, observations in the same cluster i share a common value

of the random cluster effect ui, and hence / describes the conditional effect of

explanatory variables, given the value for ui. If the cluster consists of repeated

observations on the same subject, these effects are called subject-specific effects.

In contrast, regression coefficients in GLMs and marginal models describe the

effect of explanatory variables on the population average, which is an average over

observations in different clusters.

At the second stage, the random effects ui are specified to follow a multi-

variate normal distribution with mean zero and variance-covariance matrix Ei. A

standard assumption is that random effects {ui} are independent and identically

distributed, but an example at the beginning of Chapter 3 will show that this

is sometimes not appropriate. With time series observations, where the clusters

refer to time segments, it is reasonable to assume that observations are not only

correlated within the cluster (modeled by sharing the same cluster specific random








effect), but also across clusters, which we will model by assuming correlated cluster

specific random effects.

2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time
Series

Most of the data we are going to analyze is in the form of a single univariate

time series. To emphasize this data structure, the general two-dimensional notation

(indices i and t) of a GLMM to model observations which come in clusters can be

simplified in two ways:

We can assume that a single cluster (i.e., n = 1 and nl = T) contains the

entire time series yx,..., YT. The random effects vector u = (ul,..., UT) associ-

ated with the single cluster has a random effects component for each individual

time series member. The distribution of u is multivariate normal with variance-

covariance matrix E, which is different from the identity matrix. The correlation

of the components of u induce a correlation among the time series members.

However, conditional on u, observations within the single cluster are independent.

The cluster index i is redundant in the notation and hence can be dropped. This

representation is particular useful when used with existing software to fit GLMMs,

where it is often necessary to include a column indicating the cluster membership

information for each observation. Here, since we have only one cluster, it suffices to

include a column of all ones, say.

Alternatively, we can adopt the point of view that each member of the time

series is a mini cluster by itself, containing only one observation (i.e., ni = 1 for

all i = 1,..., T) in the case of a single time series. When multiple, parallel time

series are observed, the cluster contains all c observations at time point t from the

c parallel time series (i.e., ni = c for all i = 1,..., T). In any case, the clusters

are then synonymous with the discrete time points at which observations were

recorded. This makes index t, which counts the repeated observations in a cluster








redundant (t = 1 or t = c for all clusters i), but instead of denoting the time series

by {yi}?=i, we decided to use the more common notation {yt}T=1, where t now is
the index for clusters or, equivalently, time points. In the following definition of
GLMMs for univariate time series, the notation of clusters or time points can be
used interchangeably. Conditional on unobserved random effects Ul,..., UT for
the different time points, observations yi,..., yT are assumed independent with
distributions

f(ytI ut) = exp {[ytOt b(Ot)]/Ot + c(yt, qt)} (2.3)

in the exponential family. As before, for a specific link function h(.), the model for

the conditional mean has form

E[yt I ut] = it = b'(0t) = h-l(x'1 + z'ut), (2.4)

where x' and z' are covariate or design vectors for fixed and random effects

associated with the t-th observation and f is a vector of unknown regression

coefficients. The random effects u, ... ,UT are typically not independent. When

collected in the vector u = (u1,..., UT), a multivariate normal distribution with

mean 0 and covariance matrix E can be directly specified. In particular, in Chapter

3 we will assume special patterned covariance matrices to allow for rich, but still

parsimonious, classes of correlation structures among the time series observations.
The advantage of the second setup of mini clusters is that it also allows for

other, indirect specifications of the random effects distribution, for instance through

a latent random process. For this, we relate cluster-specific random effects from

successive time points. For example, with univariate random effects, a first-order

latent autoregressive process assumes that the random effects follow


Ut+1 = put + et,


(2.5)








where et has a zero-mean normal distribution and p is a correlation parameter.

Cox (1981) called these type of models parameter-driven models, as opposed to

transitional (or observation-driven) models discussed in Section 1.3. In parameter-

driven models, an underlying and unobserved parameter process influences the

distribution of a series of observations. The model for the polio data in Zeger

(1988) is an example of a parameter-driven model. However, Zeger (1988) does not

assume normality nor zero mean for the latent autoregressive process. Furthermore,

the natural logarithm of the random effects, and not the random effects themselves

appear additively in the linear predictor. Therefore, this model is slightly different

from the specifications of the time series GLMM from above.

Another application of the mini cluster setup is to spatial settings, where clus-

ters represent spatially aggregated data instead of time points. Then ul,..., UT is

a collection of random effects associated with spatial clusters. Again, independent

random effects to describe the spatial dependencies are inappropriate. In general,

time dependent data are easier to handle since observations are linearly ordered

in time, and more complicated random effects distributions are needed for spatial

applications (e.g., Besag et al. 1995).

We will use the mini-cluster representation to facilitate a comparison to state

space models. This is the focus of the next section.

2.1.2 State Space Models for Discrete Time Series Observations

State space models are a rich alternative to the traditional Box-Jenkins

ARIMA system for time series analysis. Similar to GLMMs, state space models

for Gaussian and non-Gaussian time series split the modeling process into two

stages: At the first stage, the responses Yt are related to unobserved "states" by an

observation equation. (State space models originated in the engineering literature,

where parameters are often called states.) At the second stage, a latent or hidden

Markov model is assumed for the states. For univariate Gaussian responses








yi,..., YT, the two equations of a state space model take the form

t = w'~t + Et, Et N(0,a2), (2.6)

at = Ttat-1+Rttt, t t N(O,Qt) t=1,...,T,

where wt is an m x 1 observation or design vector and ct is a white noise process.

The unobserved m x 1 state or parameter vector at is defined by the second

transition equation, where Tt is a transition matrix and (t is another white noise

process, independent of the first one. Compared to the standard GLMMs of

Section 2.1, the main difference is that random effects are correlated instead of

i.i.d. In state space models, no clear distinction between fixed and random effects

is made, and the state vector at can contain both. However, the form of the

transition matrix Tt together with the form of the matrix Rt, which consists of

columns of the identity matrix I,, allows one to declare certain elements of at as

being fixed effects and others to be random. The matrix Rt is called the selection

matrix since it selects the rows of the state equation which have nonzero variance

terms. With this formulation, the variance-covariance matrix Qt is assumed to

be non-singular. Furthermore, the transition matrix Tt allows specification of

which effects vary through time and which stay constant. (For a slightly different

formulation without a selection matrix Rt but with possibly singular variance-

covariance matrix Qt, see Fahrmeir and Tutz, 2001, Chap. 8).

State space models for non-Gaussian time series were considered by West

et al. (1985) under the name dynamic generalized linear model. They used a

Bayesian framework with conjugate priors to specify and fit their models. Durbin

and Koopman (1997, 2000) and Fahrmeir and Tutz (2001) describe a state space

structure for non-Gaussian observations similar to the two equations above.

The normal distribution assumption for the observations in (2.6) is replaced by

assuming a distribution in the exponential family with natural parameters Ot. With








a canonical link, Ot = w'at. This is called the signal by Durbin and Koopman

(1997, 2000). In particular, given the states al,..., aT, observations yi,..., yT

are conditionally independent and have density p(yt I Ot) in the exponential family

(2.3). As in Gaussian state space models, the state vector at is determined by the

vector autoregressive relationship


at = Ttat-1 + Rtt, (2.7)

where the serially independent (t typically have normal distributions with mean 0

and variance-covariance matrix Qt.

2.1.3 Structural Similarities Between State Space Models and GLMMs

There is a strong connection between state space models and GLMMs with

a canonical link. To see this, we write the GLMM in state space form: Let wt =

(a', z')' and at = (p', u')', where Xt, Zt, f and ut are from the GLMM notation
as defined in (2.4). Hence, the linear predictor x + z'Ut of the GLMM is equal

to the state space signal Ot = w'at. Next, partition the disturbance term (t of

the state equation into ( = (a', c')' and consider special transition and selection

matrices of block form

SO 00
Tt = Rt=
0 4T 0 Rt

Using transition equation (2.7) results in the following autoregressive relationship

between the random effects of a GLMM: ut = Ttut- + et, where Et = it

is a white noise component. In a univariate context, we have already motivated

this type of relationship between random effects in equation (2.5). The transition

equation also implies a constant effect 3 for the GLMM, since 0P = 0I2 = ... =

3T := 3. Hence, both models use correlated random effects, but GLMMs also
typically involve fixed parameters which are not modeled as evolving over time.








The restriction of the transition equation to the autoregressive form (often

a simple random walk) is but only one way of specifying a distribution for the

random effects in the GLMMs of Section 2.2.1. Other structures, such as equally

correlated random effects are possible within the GLMM framework and are

considered in Chapter 3.

2.1.4 Practical Differences

Although GLMMs and state space models are similar in structure, they are

used differently in practice. This is in part due to the fact that in GLMMs the

focus is on the fixed subject-specific regression parameters /#, which refer to time-

constant and time varying covariates, while in state space models the main purpose

is to infer properties about the time varying random states at. These are often

assumed to follow a first or second order random walk. To illustrate, consider a

data set about a monthly time series of counts presented in Durbin and Koopman

(2000) for the investigation of the effectiveness of new seat belt legislation on

automobile accidents. They specify the log-mean for a Poisson state space model as


log(p/t) = vt + Axt + 7t,

where vt is a trend component following the random walk

Vt+l = Vt + &t

with t a white noise process. Further, A is an intervention parameter correspond-

ing to the change in seat-belt legislation (xt = 0 before the change and equal to

1 afterwards) and {7t} are fixed seasonal components with Zt= 7t = 0 and equal

in every year. The main focus is on the parameter A describing the drop in the log

means, also called level, after the seat belt legislation went into effect.








For the series at hand, a GLMM approach would consider a fixed linear time

effect /, and model the log-mean as


log(pt) = a + ft + Axt + yt + ut,

where a is the intercept of the linear time trend with slope 3 and where correlated

random effects {ut} account for the correlation in the monthly log means. Similar

to above, A describes the effect on the log means after the seat belt legislation went

into effect and yt are fixed seasonal components, equal for every year. The trend

component vt of the state space model corresponds to a + ut in the GLMM, which

additionally allows for a linear time trend /. This approach seems to be favored

by some discussants of the Durbin and Koopman (2000) paper. (In particular, see

the discussions by Chatfield or Aitkin of the paper by Durbin and Koopman (2000)

who mention the lack of a linear trend term in the proposed state space model.)

An even better GLMM approach with linear time trends explicitly modeled

could use a change point formulation in the linear predictor, with the month the

legislation went into effect (or was enforced) as the change point, and again with

correlated random effects {ut} to capture the dependency among successive means.

Such a specification would be harder to model in a state space model.

In the reply to the discussion of their paper, Durbin and Koopman (2000)

wrote that the two approaches (state space models versus Hierarchical Generalized

Linear Models, a model class very similar to GLMMs) are very different and that

they regard their treatment to be more transparent and general for problems that

specifically relate to time series. With the presentation of GLMMs with correlated

random effects for time series analysis in this thesis their argument might weaken.

For instance, with the proposal of autocorrelated random effects Ut+1 = put + ft

in a GLMM context we have elegant means of introducing autocorrelation into a

basic regression model that is well understood and whose parameters are easily








interpreted. Furthermore, GLMMs can easily accommodate the case of multiple

time series observations on each of several individuals or cross-sectional units, as is

often observed in a longitudinal study.

One common feature of both models is the intractability of the likelihood

function and the use of numerical and simulation techniques to obtain maximum

likelihood estimates. In general, state space models for non-Gaussian time series

are fit using a simulated maximum likelihood approach, which is also a popular

method for fitting GLMMs. However, long time series necessarily result in models

with complex and high dimensional random effects, and alternative, indirect, meth-

ods may work better. Jank and Booth (2003) indicate that simulated maximum

likelihood, the method of choice for estimation in state space models, may not

work as well as indirect methods based on the EM algorithm, the method we will

use to fit GLMMs for time series observations. The next section reviews various

approaches of fitting GLMMs and contrasts them with the approach taken for state

space models.

2.2 Maximum Likelihood Estimation

Maximum likelihood estimation in GLMMs is a challenging task, because it

requires the calculation of integrals (often high dimensional) that have no known

analytic solution. Following the general notation of a GLMM in Section 2.1, let

Yi = (yil, yini) be the vector of all observations in cluster i, whose associated
random effects vector is ui. Conditional independence of the yit's implies that the

density function of y, is given by
ni
f(y ; I = Uf (y;t I u( ; /), (2.8)
t=1
where f(yit I ui; /) are the exponential densities in (2.1). The parameter 3 is the

vector of all unknown regression coefficients introduced by specifying model (2.2)

for the mean of the observations. Furthermore, observations from different clusters








are assumed conditionally independent, leading to the conditional joint density
n
f(yi,.-,yn uil,...,Un;,3)=ii f(yi ui;/3)
i=1
for all observations given all random effects. Let g(u1,..., ,n; ib) denote the mul-
tivariate normal density function of the random effects, whose variance-covariance
matrix E is determined by the variance component vector ib. The goal is to es-
timate the unknown parameter vectors 3 and ib by maximum likelihood. The
likelihood function L(/3, v; yl,..., y,) for a GLMM is given by the marginal den-

sity function of the observations y, Yn, viewed as a function of the parameters,
and is equal to

L(09#, 0; y Y jf f(Y,'...",IYn Iu,...,Un;f)g(ux,...,Un; O)dul...dun

= f I (Y l ui; ) g(ul,..., un; O)dUl ... dun
i=1

= J Jf(yit ui; ) g(ul,..., Un; *1)dUl...dun (2.9)
i=1 t=1
It is called the "observed" likelihood because the unobserved random effects have

been integrated out and (2.9) is a function of the observed data only. Except
for the linear mixed model where f(y, ... Yn I ul,..., un) is a normal density,
the integral has no closed-form solution and numerical procedures (analytic or
stochastic) are necessary to calculate and maximize it. Standard maximization

techniques such as Newton-Raphson or EM for fitting GLMs and LLMs have

to be modified because the conditional distribution of the observations and the
distribution of the random effects are not conjugate and the integral is analytically
intractable.
2.2.1 Direct and Indirect Maximum Likelihood Procedures

In general, there are two ways to obtain maximum likelihood estimates
from the marginal likelihood in (2.9): The first one is a direct approach and








attempts to approximate the integral by either analytic or stochastic methods

and then maximize this approximation with respect to the parameters 3 and tp.

Some common analytic approximation methods are Gauss-Hermite quadrature

(Abramowitz and Stegun, 1964), a first-order Taylor series expansion of the

integrand or a Laplace approximation (Tierney and Kadane, 1986), which is

based on a second-order Taylor series expansion. The two latter methods result

in likelihood equations similar to a linear mixed model (Breslow and Clayton,

1993; Wolfinger and O'Connell, 1993), and by iteratively fitting such a model and

re-expanding the integrand around updated parameter estimates one can obtain

approximate maximum likelihood estimates. However, these methods have been

shown to yield estimates which can be biased and inconsistent, an issue which is

discussed in Lin and Breslow (1996) and Breslow and Lin (1995).

Techniques using stochastic integral approximations are known under the name

simulated maximum likelihood and have been proposed by Geyer and Thompson

(1992) and Gelfand and Carlin (1993). These methods approximate the integral in

(2.9) by importance sampling (Robert and Casella, 1999) and are better suited for

larger dimensional integrals than analytic approximations. Usually, the importance

density depends on the parameters to be estimated, and so simulated maximum

likelihood is used iteratively by first approximating the integral by a Monte Carlo

sum with some initial values for the unknown parameters. Then, the likelihood

is maximized and the resulting parameters are used to generate a new sample

form the importance density in the next iteration. We will briefly discuss the idea

behind importance sampling in Section 2.3.2. The simulated maximum likelihood

approach is also further illustrated in the next section, where we discuss it in the

context of state space models.

An alternative to the direct approximating methods is the EM-algorithm
(Dempster et al., 1977). The integral in (2.9) is not directly maximized in this








method, but is maximized indirectly by considering a related function Q(. I .). At
each step of this algorithm, maximization of the Q-function increases the marginal
likelihood, a fact that can be verified using Jensen's inequality. The EM-algorithm
relies on recognizing or inventing missing data, which, together with the observed
data, simplifies maximum likelihood calculations. For GLMMs, the random effects
ul,..., u, are treated as the missing data. In particular, let f(k-1) and t(k-1)
denote current (at the end of iteration k 1) values for parameter vectors 3 and
'. Also, let y' = (yl,..., y,) and u' = (ul,..., u,) denote the vector of all
observations and their associated random effects. Then, the Q(. I .) function at the
start of iteration k has form

Q(I, |f3(k-1) (k-1)) = E [logj(y, u; 0, ) y, f(k-1), 0(k-1)]
= E [logf(y I u;3) y,7(k-1),(k-1)] (2.10)

+ E [logg(u; ) y, '(k-1),l(k-l)]

where

j(y, u; 0, i) = f(y I u; 3)g(u; i)

denotes the joint density of observed and missing data, also known as the complete
data. The expectation in (2.10) is with respect to the conditional distribution
of u y, evaluated at the current parameter estimates 3(k-1) and O(k-l). The
calculation of the expected value is called the E-step and it is followed by an
M-step which maximizes Q(/, | I 8(k-1), O(k-1)) with respect to f3 and -4. The
resulting estimates f(k) and ,(k) are used as updates in the next iteration to re-
calculate the E-step and the M-step. Since (/3(k), (k)) is the maximizer at iteration


Q((k), (k) /(k-1) I'(k-1)) > Q(/3, /(k-1), (k-1)) for all (0, 4')


(2.11)







and it follows that the likelihood increases (or at worst stays the same) from one
iteration to the next:

log L((k),(k) y) = Q((k), (k-1), -1)
-E [logh(u Iy;/3(k), () k) y,/(k-1), (-1)]
> Q(p(k-1), k-1) (k-1),k-1))
-E [log h(u | y;3)(k-,1) ( y,() I -1),(k-1)]
= log L(-(k-1), (k-1);y)

Here we used (2.11) and the fact that

E [log h(u I y; 3(k), (k) y,(k-1), (-1)
-E [logh(u | y;/(k-1), (k-1)) y (k-1) (k-1)
= E [log (h(u y; w(k), P(k))/h(u y;3(k-1)' (k-1))) I y1-1)(k-1) (k-1)]
< log E h( y;((k) (k))/( y (k-1) (k-)) y, (kL),, (k-1)) = 0,

where the inequality in the last step derives from Jensen's inequality. Under regu-
larity conditions (Wu, 1983) and some initial starting values (P3, ,0), the sequence
of estimates {(0(k), '(k))} converges to the maximum likelihood estimators (/, 6 ).
The EM-algorithm is most useful if replacing the calculation of the integral in
the marginal likelihood (2.9) by the calculation of the integral in the Q-function
(2.10) simplifies computation and maximization. Unfortunately, for GLMMs the
integrals in (2.10) are also intractable since the conditional density of u I y involves
the integral in (2.9). However, the EM algorithm may still be used by approxi-
mating the expectation in the E-step with appropriate Monte Carlo methods. The
resulting algorithm is called the Monte Carlo EM-algorithm (MCEM) and was
proposed by Wei and Tanner (1990). We review it in detail in Section 2.3.








Some arguments favoring the use of the MCEM-algorithm over direct methods

such as simulated maximum likelihood for fitting GLMMs, especially when some

variance components in I are large, are given in Jank and Booth (2003) and Booth

et al. (2001). Currently, the only available software for fitting GLMMs uses direct

methods such as Gauss-Hermite quadrature or simulated maximum likelihood

(e.g., SAS's proc nlmixed). State space models of Section 2.1.2 are also fitted via
simulated maximum likelihood. This is discussed in Section 2.2.3.

2.2.2 Model Fitting in a Bayesian Framework

In a Bayesian context, GLMMs are two-stage hierarchical models with ap-
propriate priors on 3 and 0. Instead of obtaining maximum likelihood estimates

of unknown parameters, a Bayesian analysis looks at their entire posterior dis-

tributions, given the observed data. Markov chain Monte Carlo techniques avoid

the tedious integration in the posterior densities and allow for relatively easy

simulations from these distributions, compared with the problems encountered in

maximum likelihood estimation. This suggests approximating maximum likelihood

estimates via a Bayesian route, assuming improper or at least very diffuse priors

and exploiting the proportionality of the likelihood function and the posterior

distribution of the parameters. However, for many discrete data models, improper

priors may lead to improper posteriors. Natarajan and McCulloch (1995) demon-

strate this with GLMMs for correlated binary data assuming independent N(0, a2)

random effects and a flat or a non-informative prior for a2. Sun, Tsutakawa and

Speckman (1999) and Sun, Speckman and Tsutakawa (2000) show that with
noninformative (flat) priors on fixed effects and variance components of more com-

plicated random effects distributions, propriety of the posterior distribution cannot

be guaranteed for a Poisson GLMM when one of the observed counts is zero, and

is impossible in a logit link GLMM for binomial(n, r) observations if just one of

the observation is equal to 0 or n. Of course, the use of proper priors will always






37

lead to proper posteriors. However, for often employed diffuse but proper priors,
Natarajan and McCulloch (1998) show that even with enormous simulation sizes,
posterior estimates (such as the posterior mode) can be far away from maximum
likelihood estimates, which make their use undesirable in a frequentist setting.
2.2.3 Maximum Likelihood Estimation for State Space Models

The same problems as in the GLMM case arise for maximum likelihood fit-
ting of non-Gaussian state space models. Here, we review a simulated maximum
likelihood approach suggested by Durbin and Koopman (1997), using notation in-
troduced in Section 2.1.2. Let p(y I a; ,) = 1tP(yt I at; tp) denote the distribution
of all observations given the states and let p(a; 0) denote the distribution of the
states, where y and a are the stacked vectors of all observations and all states,
respectively. The vector iv holds parameters that may appear in wt, Tt and Qt.
Let p(y, a; /i) denote the joint density of observations and states. For practical
purposes, it is easier to work with the signal Ot instead of the high dimensional
state vector at. Hence, let p(y ; 0;), p(0; Vb) and p(y, 0; b) denote the corre-
sponding conditional, marginal and joint distributions parameterized in terms of
the signal Ot = w'at, t = 1,..., T, where 0 = (01,..., OT)'. The observed likelihood
is then given by the integral


L(0; y) = p(y 0; i)p(O; /)dO. (2.12)

To maximize (2.12) with respect to 0, Durbin and Koopman (1997, 2000) first
calculate the likelihood Lg,(i; y) for an approximating Gaussian model and then
obtain the true likelihood L(/,; y) by an adjustment to it. However, two different
approaches of how to construct the approximating Gaussian model are presented
in the two papers. In Durbin and Koopman (1997) the approximating model is
obtained by assuming that observations follow a linear Gaussian model


Yt= wat + Et = t Ot + ft,







with Et ~ N(p/t, ao). All densities generated under this model are denoted by g(.).
The two parameters /t and ao are chosen such that the true density p(y 1 0; ,) and
its normal approximation g(y I 0; tk) are as close as possible in the neighborhood
of the posterior mean Eg[80 y]. The state equations of the true non-Gaussian
model and the Gaussian approximating model are assumed to be the same,
which implies that the marginal density of 8 is the same under both models, i.e.,
p(8; i) = g(; 0). The likelihood of the approximating model is then given by

Lg(y, ; ) g(y i0; O)p(O; (2.13)
Lg('; y) = g(y; #) = (2.13)
g(8 1 Y; I) g(8 I y; IP)
This likelihood is calculated using a recursive procedure known as the Kalman
filter (see, for instance, Fahrmeir and Tutz, 2001, Chap. 8). Alternatively, the
approximating Gaussian model is a regular linear mixed model and maximum
likelihood calculations can be carried out using more familiar algorithms in the
linear mixed model literature (see, for instance, Verbeke and Molenberghs, 2000).
From (2.13),
p(O; ) = L,()g(O I y; k)
g(y 1 0; )
and upon plugging in into (2.12)

L(O; y)= L9g(; y)E [P(i ; ,) (2.14)
Lg(yI 0; )'
where E, denotes expectation with respect to the Gaussian density g(8O y; I)
generated by the approximating model. Hence, the observed likelihood of the non-
Gaussian model can be estimated by the likelihood of an approximating Gaussian
model and an adjustment factor, in particular


L(0; y)= Lg('; y)7@(),







where
) =1 p(y 0(i);b)
m zg(y Og); m g)
is a Monte Carlo sum approximating the expected value E, with m random
samples 0(') from g(O y; ,). Normality of g(O y; ') allows for straightforward
simulation from this density.
A different approach for choosing the approximating Gaussian model is
presented in Durbin and Koopman (2000). There, the model is determined by
choosing 0, and Qt of an approximating Gaussian state space model (2.6) such that
the posterior densities g(0 I y; i) implied by the Gaussian model and p(O I y; iP)
implied by the true model have the same posterior mode 0.
Formally, by dividing and multiplying (2.12) by the importance density
g(O I y; i), we like to interpret approximation (2.14) as an importance sampling
estimate of the observed likelihood and the entire procedure as a simulated
maximum likelihood approach:

L(; y)= [p(y 0;e ) ; g(O y; -)dO
,g(0 I y;), d

= p(y 0; 0 g(Y; g(Y I y; )dO
gy, 0O; )
Sg(Y; ) g(O y; t)dO
[p(y 1o0;
=- (y;,)Eg "g(y 1 .

Durbin and Koopman (1997, 2000) present a clever way of artificially enlarging the
simulated sample of 0(')'s from the importance density g(0 | y; ) by the use of
antithetic variables (Robert and Casella, 1999). These quadruple the sample size
without additional simulation efforts and balance the sample for location and scale.
Overall, this leads to a reduction in the total sample size necessary to achieve a
certain precision in the estimates.








In practice, it is desirable in the maximization process to work with log (L(?I; y)).

Durbin and Koopman (1997, 2000) present a bias correction for the bias introduced

by estimating log (E[p(y | 0; ai)/g(y 0; 0)]). Finally, the resulting estimator

of log (L(O; y)) can be maximized with respect to i by a suitable numerical

procedure, such as Newton-Raphson.

We mentioned before that simulated maximum likelihood can be computa-

tionally inefficient and suboptimal, especially when some variance components

are large (Jank and Booth, 2003). As we will see in various examples in Chapter

5, large variance components (e.g., a large random effects variance) are the norm

rather than the exception with the type of time series models we consider. Next,

we will look at an alternative, indirect method for fitting our models. In principle,

though, the methods just described are also applicable to GLMMs, through the

close connections of GLMMs and state space models described above.

2.3 The Monte Carlo EM Algorithm

In Section 2.2 we presented the EM-algorithm as an iterative procedure

consisting of two components, the E- and the M-step. The E-step calculates a

conditional expectation while the M-step subsequently maximizes this expectation.

Often, at least one of these steps is analytically intractable and in most of the

applications considered here, both steps are. Numerical methods (analytic and

stochastic) have to be used to overcome these difficulties, whereby the E-step

usually is the more troublesome. One popular way of approximating the expected

value in the E-step uses Monte Carlo methods and is discussed in Wei and Tanner

(1990), McCulloch (1994, 1997) and Booth and Hobert (1999). The Monte Carlo

EM (MCEM) algorithm uses a sample from the distribution of the random effects

u given the observed data y to approximate the Q-function in (2.10). In particular,

at iteration k, let u(1),..., u(m) be a sample from this distribution, denoted by

h(u I y; '(k-1), (-1)) and evaluated at the parameter estimates "(k-1) and 0(k-1)








from the previous iteration. The approximation to (2.10) is then given by
m m
Qm(3, | (k-l), (k-l)) = L log/(y I u);/3) + Elogg(u(i); ). (2.15)
m m
j=1 j=1
As m -+ oo, with probability one, Qm -+ Q. The M-step then maximizes Qm
instead of Q with respect to p and 4 and the resulting estimates 3(k) and i(k)
are used in the next iteration to generate a new sample from h(u I y; 'k, Pk ). If
maximization is not possible in closed form, sometimes only a pair of values (/3, T)
which satisfies Qm(/3, 4 I 3(k-1), '(k-1)) > Qm((k1) (k-) (-1) (k-1), (k-1)), but
which do not attain the global maximum, is chosen as the new parameter update

(3(k), ,(k)). However, we show for our models that the global maximum can be
approximated in very few steps.

Maximization of Qm with respect to 3 and i4 is equivalent to maximizing the
first term in (2.15) with respect to / only and the second term with respect to '

only. This is due to the two-stage hierarchy of the response distribution and the

random effects distribution in GLMMs and is discussed next. Different approaches

to obtaining a sample from h(u I y; 3, 4) for the approximation of the E-step are
presented in Sections 2.3.2 and convergence criteria are discussed in Section 2.3.3.

2.3.1 Maximization of Qm

For now we assume we have available a sample u(1),..., um) from h(u

y; 3, i4) or an importance sampling distribution, generated by one of the mecha-
nisms described in Sections 2.3.2 to 2.3.5. Let Q1 and Q2 be the first and second

term of the sum in (2.15). Using the exponential family expression for the densities

f(yit I ui), at iteration k,

Q(3 (k-)) 1 C n n i b(0~))] (2.16)
S=1 =1 t= -i
j=1 i=1 t=1







where, according to the GLMM specifications,

b'((,)) = ( = h-l(xtP + Zi ),

with ujA the i-th component of the j-th sampled random effects vector u().
Maximizing Q1 with respect to 3 is equivalent to fitting an augmented GLM with
known offsets: For j = 1,..., m, let yf = yit and ) = xit be the random
components and known design vectors for this augmented GLM, and let zU j) be
a known offset associated with each y). That is, we duplicate the original data
set m times and attach a known offset z' u) to each replicated observation. The
model for the mean in the augmented GLM, E[Yij)] = pz) = h-'(xt + zu))
is structurally equivalent to the model for the mean in the GLMM. Then, the
log-likelihood equations for estimating 3 for the augmented GLM are proportional
to Q'. Hence, maximization of Q1 with respect to / follows along the lines of
well known, iterative Newton-Raphson or Fisher scoring algorithms for GLMs.
Denote by /(k) the parameter vector after convergence of one of these algorithms.
It represents the value of the maximum likelihood estimator of / at iteration k of
the MCEM algorithm.
The expression for Q' depends on the assumed random effects distribution.
Most generally, let E be an unstructured nq x nq covariance matrix for the random
effects vector u = (ux,..., un), where q = E i ni and ni is the dimension of each
cluster specific random effect ui. Then, assuming u has a mean zero multivariate
normal distribution g(u; i) where 0/ holds the !nq(nq + 1) distinct elements of E,
Q2 has form

2 -c log u()1y-1
j=1
The goal is to maximize Q2 with respect to the variance components If of E. For
a general E, the maximum is obtained at the variance components of the sample








covariance matrix Sm = L u(ij)uWl)'. Denoting these by O(k) gives the value of
the maximum likelihood estimator of i, at iteration k of the MCEM algorithm.
The simplest structure occurs when random effects ui have independent
components and are i.i.d. across all clusters, where g(u; i) is then the product of
n N(0, ua2) densities and i/ = a. Q2 at iteration k is then maximized at a(k)
(- ZLj o u(')'U) 12. Many applications of GLMMs use this simple structure
of i.i.d. random effects, where often ui is a univariate random intercept. In this
case, the estimate of a at iteration k reduces to a(k) = ( 1 L 1 uJi)2)1/2
In Chapter 3 we will drop the assumption of independence and look at correlated
random effects, but with more parsimonious covariance structures than the most
general case presented here. Maximization of Q' with respect to t1 will be
presented there on a case by case basis.
2.3.2 Generating Samples from h(u I y; P, i,)
So far we assumed we had available a sample u(),..., u(m) to approximate
the expected value in the E-step of the MCEM algorithm. This section describes
how to generate such a sample from h(u I y; 3, i-), which is only known up to a
normalizing constant, or from an importance density g(u). In the following, we will
suppress the dependency on parameters / and 0, since the densities are always
evaluated at their current values. Three methods are presented: The accept-reject
algorithm produces independent samples, while Metropolis-Hastings algorithms
produce dependent samples. A detailed description of all three methods can be
found in Robert and Casella (1999).
2.3.2.1 Accept-reject sampling in GLMMs
In general, for accept-reject sampling we need to find a candidate density
g and a constant M, such that for the density of interest h (the target density)
h(x) < Mg(x) holds for all x in the support of h. The algorithm is then to
1. generate x ~ g, w ~ Uniform[O, 1];








2. accept x as a random sample from h if w < M() ;
3. return to 1. otherwise;
This will produce one random sample x from the target density h. The
probability of acceptance is given by 1/M and the expected number of trials until a
variable is accepted is M.
For our purpose, the target density is h(u I y). Since h(u I y) = f(y I
u)g(u) < Mg(u), where M = supu f(y I u) and a is an unknown normalizing
constant equal to the marginal likelihood, the multivariate normal random effects
distribution g(u) can be used as a candidate density. Booth and Hobert (1999,
Sect. 4.1) show that for certain models supu f(y I u) can be easily calculated
from the data alone and thus need not to be updated at every iteration. For some
models we discuss here, the condition of Booth and Hobert (1999, Sect. 4.1, page
272) required for this simplification does not hold. However, the likelihood of a
saturated GLM is always an upper bound for f(y I u). To illustrate, regard L(u) =

f(y I u) as the likelihood corresponding to a GLM with random components yit
and linear predictor rit = z ui + xif3, where now x'8 plays the role of a known
offset and ui are the parameters of interest. The maximized likelihood L(i&) for

this model is always less than the maximized likelihood L(y) for a saturated model.
Hence, supu f(y I u) < L(i&) < L(y), and L(iL) or L(y) can be used to construct
M.
Example: In Section 3.1 we consider a data set where conditional on a random
effect Ut, yit, the t-th observation in group i, is modeled as a Binomial(nit, Wit)
random variable. There are 16 time points, i.e., t = 1,... 16 and two groups i =
1, 2. A very simple logistic-normal GLMM for these data has form logit(7it(ut)) =
a + pxi + ut, where xi is a binary group indicator. The overall design matrix for








this problem is the 32 x 18 matrix

1 0 1 0 ... 0

1 1 0 *-- 0

1 0 0 1 ... 0

1 1 0 1 0,


1 0 0 0 ..* 1

1 1 0 0 ... 1

where the columns hold the coefficients corresponding to a, /, u1, u2,..., U16. All

rows of this matrix are different, and as a consequence the condition of Booth and

Hobert (1999, Sect. 4.1, page 272) does not hold. However, the saturated binomial

likelihood L(y) is an upper bound for f(y I u), i.e.,


sup f(y u) < L(y).

For instance, with the logistic-normal example from above with linear predictor

77it = c + ut, where c = a + zxi represents the fixed part of the model, we have

sup f(y I U) = sup( ec+ut i 1 it
su U, 1 + ec+ut 1 + ec+ut

By first taking logs and then finding first and second derivatives with respect to ut,

we see that u = log )it/t c maximizes this expression for 0 < yit < nit.

Plugging in, we obtain the result

( it \ nit-li,
supf(yl) = 1 .
ut nit k nit

For the special cases of yit = 0 or yit = nit the trivial bound on f(yit I Ut) is 1.

Hence, the following inequality, which immediately follows from above can be used

in constructing the accept-reject algorithm for a logistic-normal model with linear








predictor of form rit c + ut:

( eif( fYit nit -/t
sup f y I u) : sup li r 1+e+ e

_1- ) )= L(y).
S nit/ nit/
iit
This means we can select M = L(y) to meet the accept-reject condition and

consequently we accept a sample u from g(u) if for a w ~ Uniform[0, 1]:

h(u I y) f(y u)
Mg(u) L(y)

Notice that this condition is free of the normalizing constant a. In practice,
especially for high dimensional random effects, M can be very large and therefore

we almost never accept a sample. Two alternative methods described below
may avoid this problem. Note, however, that the accept-reject method yields an
independent and identical distributed sample from the target distribution. This is
important if one wants to implement an automated MCEM algorithm (Booth and
Hobert (1999)), where the Monte Carlo sample size m is increased automatically as

the algorithm progresses to adjust for the error in the Monte Carlo approximation
to the E-step.

2.3.2.2 Markov chain Monte Carlo methods

For high dimensional distributions h(u I y), which are unavoidable if correlated

random effects are used, accept-reject methods can be very slow. An alternative
is to generate a Markov chain with invariant distribution h(u I y), which may
be much faster but results in dependent samples. McCulloch (1997) discussed a
Metropolis Hastings algorithm for creating such a chain for the logistic-normal
regression case. In general, an independent Metropolis Hastings algorithm is built

as follows: Choose a candidate density j(u) with the same support as h(u I y).
Then, for a current state u(j-),







1. Generate w ~ g;
2. Set u(j) equal to w with probability p = min (, h(W )/h(Uj))
and equal to u-1) with probability 1-p;
After a sufficient burn in time, the states of the generated chain can be
regarded as a (dependent) sample from h(u I y). If the candidate density g(u)
is chosen to be the density of the random effects g(u), the acceptance probability
in step 2 reduces to the simple form min (1, f(y w)/f(y I u(-1)). To further
speed up simulations, McCulloch (1997) uses a random scan algorithm which only
updates the k-th component of the previous state u(-1) and, upon acceptance in
step 2, uses it as the new state.
Another popular MCMC algorithm is the Gibbs sampler. Let uj-l) =

(u i-1),..., uU-1)) denote the current state of a Markov chain with invariant distri-
bution h(u I y). One iteration of the Gibbs sampler generates, componentwise,
Uj) i h(ul I -1) -1)y)



UU ~ h(u uj ...IUU)




where h(ui uj) ,..., u (j-),..., u -1), y) are the so called full conditionals of
h(u I y). The vector u(j) = (ujQ),..., u)) represents the new state of the chain,
and, after a sufficient burn-in time, can be regarded as a sample from h(u y).
The advantage of the Gibbs sampler is that it reduces sampling of a possibly very
high-dimensional vector u into sampling of several lower-dimensional components
of u. We will use the Gibbs sampler in connection with autoregressive random
effects to simplify sampling from an initially very high-dimensional distribution
h(u ) y) by sampling from its simpler full univariate conditionals.








2.3.2.3 Importance sampling
An importance sampling approximation to the Q-function in (2.10) is given by

m m
Qm(3, i (k-1), (k-1)) = log f (y I uo ); ) + w- log g(u ); ),
j=1 j=1
where u(j) are independent samples from an importance density g(u; V)(k-1)) and

h(u. y;(k-1 -1)) (y U); (k-1))g(u(j); 0(k-1))
3 ( );0(k-1)) (U);k-1))

are importance weights at iteration k. Usually, Qm is divided by the sum of the
importance weights m=, wj. The normalizing constant a only depends on known
parameters (3(k-1), O(k-1)) and hence plays no part in the following maximization
step. Selecting the importance density g is a delicate issue. It should be easy
to simulate from but also resemble h(u I ) as close as possible. Booth and
Hobert (1999) suggest a Student t density as the importance distribution g, whose
mean and variance match those of h(u I y) which are derived via a Laplace
approximation.
2.3.3 Convergence Criteria

Due to the stochastic nature of the algorithm, parameter estimates of two suc-
cessive iterations can be close together just by chance, although convergence is not
yet achieved. To reduce the risk of stopping prematurely, we declare convergence
if the relative change in parameter estimates is less than some el for c (e.g., five)
consecutive times. Let A(k) = (f(k)', Ob(k)') be the vector of unknown fixed effects
parameters and variance components. Then this condition means that

m A Ik) P-1) I
max 'IA 'I E1 (2.17)

has to be fulfilled for c consecutive (e.g., five) ks. For any AkL), an exception to this
rule occurs when the estimated standard error of that parameter is substantially
larger than the change from one iteration to the next. Hence, at iteration k, for







those parameters satisfying

(IA(jk) -A )
max v (gk)) E2,

where vir(A k)) is the current estimate of the variance of the MLE A,, the relative
precision of criterion (2.17) need not be met. An estimate of this variance can
be obtained from the observed information matrix of the ML estimator for A.
Louis (1982) showed that the observed information matrix can be written in
terms of the first (1') and second (I") derivative of the complete data log-likelihood
1(A; y, u) = logj(y, u; A). Evaluated at the MLE A, it is given by

I(A) = -Euly [l"(i; y, u) I y] varuly [l'(A; u) I .1

An approximation to this matrix, at iteration k, uses Monte Carlo sums with draws
from h(u I y, A(k)) from the current iteration of the MCEM algorithm.
To further safeguard against stopping prematurely, we use a third convergence
criterion based on the Qm function. For deterministic EM, the Q function is
guaranteed to increase from iteration to iteration. With MCEM, because of
the stochastic approximation nature, QM) can be less than Q(-1) because of an
"unlucky" Monte Carlo sample at iteration k. Hence, the parameter estimates
obtained from maximizing QM) can be a step in the wrong direction and actually
decrease the value of the likelihood. To counter this, we declare convergence only
if successive values of QM) are within a small neighborhood. More importantly,

however, is that we accept the k-th parameter update A(k) only if the relative
change in the Qm function is larger than some small negative constant, e.g.,


(k) > E3- (2.18)

If at iteration k (2.18) is not met and there is reason to believe that A(k) decreases
the likelihood and is worse than the parameter update from the previous iteration,








we repeat the k-th iteration with a new and larger Monte Carlo sample. Thereby,

we hope to better approximate the Q-function and as a result get a better estimate

A(k), with Qm-function larger than the previous one. If this does not happen, we

nevertheless accept A(k) and proceed to the next iteration, possibly letting the

algorithm temporarily move in a direction of a lower likelihood region. Otherwise,

the Monte Carlo sample size quickly grows without bounds at an early stage

of the algorithm. Furthermore, at early stages, the Monte Carlo error in the

approximation of the Q function can be large and hence its trace plot is very

volatile.

Caffo, Jank and Jones (2003) go a step further and calculate asymptotic

confidence intervals for the change in the Qm-function, based on which they

construct a rule for accepting or rejecting A(k). They discuss schemes of how to

increase the Monte Carlo sample accordingly and their MCEM algorithm inherits

the ascent property of EM with high probability. However, we feel that the simpler

criterion (2.18) suffices for the examples considered here.

Coupled with any convergence criterion is the question of the updating scheme

for the Monte Carlo sample size m between iterations. In general, we will use

m(k) = am(k-1), where a > 1 and m(k) is the Monte Carlo sample size at iteration

k. At early iterations, m(k) will be low, since big parameter jumps are expected

regardless of the quality of the approximation and the Monte Carlo error associated

with it. Later, as more weight will be put on decreasing the Monte Carlo error in

the approximations, the polynomial increase guarantees sufficiently large Monte

Carlo samples. Furthermore, condition (2.18) signals when an additional boost

in m(k) is needed to better approximate the Q-function in this iteration. Hence,

whenever (2.18) is not met, we re-run iteration k with a bigger sample size qm(k),

where q > 1 is usually between 1 and 2.









8000- I --
-264
7000-

-266
6000

5000 -268

4000 270
-270 -

3000
-272
2000

-274
1000


0 25 50 75 100 0 25 50 75 100

Figure 2-1: Plot of the typical behavior of the Monte Carlo sample size m(k) and
the Q-function Qk) through MCEM iterations.
The iteration number is shown on the x -axis. Plots are based on the data and
model for the boat race data discussed in Chapter 5.


A typical picture of the Monte Carlo sample size m(k) and the Qm) function

through the iterations of an MCEM algorithm is presented in Figure 2-1. The

increase in the Q) function is large at the first iterations, but it's Monte Carlo

error is also large due to the small Monte Carlo sample size. The plot of the Monte

Carlo sample size m(k) shows several jumps corresponding to the events that the

Qk) function actually decreased by more than E3 from one iteration to the next and

we adjusted with an additional boost in generated samples. The data and model

on which this plot is based on are taken from the boat race example analyzed

and discussed in Chapter 5, with convergence criterions set to 6E = 0.001, c = 4,

E2 = 0.003, E3 = -0.005, a = 1.03 and q = 1.05.

Fort and Moulines (2003) show that with geometrically ergodic (see, e.g.,

Robert and Casella, 1999) MCMC samplers, a polynomial increase in the Monte






52

Carlo sample size leads to convergence of MCEM parameter estimates. However,

establishing geometric ergodicity is not an easy task. Other, more sophisticated

and automated Monte Carlo sample size updating schemes are presented by Booth

and Hobert (1999) for independent sampling, and Caffo, Jank and Jones (2003), for

independent and MCMC sampling.













CHAPTER 3
CORRELATED RANDOM EFFECTS

In Chapter 2 we mentioned at several occasions that for certain data structures

the usual assumption of independent random effects is inappropriate. For instance,

if clusters represent time points in a study over time, observations from different

clusters can no longer be assumed (marginally) independent. Or, in longitudinal

studies, the non-negative and exchangeable correlation structure among repeated

observations implied by a single random effect can be far from the truth for long

sequences of repeated observations. Section 3.1 presents data from a cross-sectional

time series which motivates the use of correlated random effects and discusses their

implications. In Sections 3.2 and 3.3 two special correlation structures useful for

modeling the dependence structure in discrete repeated measures with possibly

unequally spaced observation times are discussed. The main focus of this chapter

is on the technical implications on the MCEM algorithm arising from estimating

an additional variance (correlation) component. In contrast to models with

independent random effects, the M-step has no closed form solution and iterative

methods have to be used to find the maximum. Also, because random effects are

correlated a priori, they are correlated a posteriori, and sampling from the posterior

distribution of u I y as required by the MCEM algorithm is more involved than

with independent random effects. A Gibbs sampling approach is developed in

Section 3.4.

From here on we let t denote the index for the discrete observation times,

t = 1,..., T, and we let Yi denote a response at time point t for strata i, i =

1,..., n. Throughout, we will assume univariate but correlated random effects

{ut}T=1 associated with the observations over time.








3.1 A Motivating Example: Data from the General Social Survey

The basic purpose of the General Social Survey (GSS), conducted by the
National Opinion Research Center, is to gather data on contemporary Amer-

ican society in order to monitor and explain trends and constants in atti-

tudes, behaviors and attributes. It is only second to the census in popularity

among sociologist as a data source for conducting research. The GSS ques-

tionnaire contains a standard core of demographic and attitudinal variables

whose wording is retained throughout the years to facilitate time trend stud-

ies. (Source: www.norc.uchicago.edu/projects/gensocl.asp). Currently, the

GSS comprises a total of 24 surveys conducted in the years 1973-1978, 1980,

1982, 1983-1994, 1996, 1998, 2000 and 2002, with data available online (at

www.webapp.icpsr.umich.edu/GSS/) through 1998. The two features, a dis-

crete response variable (most of the attitude questions) observed through time

and unequally spaced observation times make it a prime resource for applying the

models proposed in this dissertation. Data obtained from the GSS are different

from longitudinal studies where subjects are followed through time. Here, responses

are from independent cross-sectional surveys of different subjects in each year.

One question included in 16 of the 22 surveys till 1998 recorded attitude

towards homosexual relationships. It was observed in the years 1974, 1976-77,

1980, 1982, 1984-85, 1987-1991, 1993-94, 1996 and 1998. We will use this data

to motivate and illustrate the use of correlated random effects. Figure 3-1 shows

the proportion of respondents who agreed with the statement that homosexual

relationships are not wrong at all for the two race cohorts white respondents and

black respondents. For simplicity in this introductory example, only race was

chosen as a cross-classifying variable and attitude was measured as answering "yes"

or "no" to the aforementioned question. Let Yt denote the number of people in

year t and of race i who agreed with the statement that homosexual relationships











0
6 --- white
-e- black
o
C -
0






C5

1975 1980 1985 1990 1995
Figure 3-1: Sampling proportions from the GSS data set.
Proportion of whites (squares) and blacks (circles) agreeing with the statement that
homosexual relationships are not wrong at all, from 1974 to 1998.

are not wrong at all. The index t = 1,...,16 runs through the set of 16 years

{1974,1976,1977,1980,..., 1998} mentioned above, and i = 1 for race equal to

white and i = 2 for race equal to black. The conditional independence assumption

discussed in Section 2.1 allows us to model Yt, the sum of nit binary variables

which are the individual responses, as a binomial variable conditional on a yearly

random effect. That is, the probabilistic model we propose assumes a conditional

Binomial(nit, 7rit) distribution for each member of the two time series {IYt}", and

{Y2 61 pictured in Figure 3-1. The parameters nit and rit are the total number

and the conditional probability of agreeing with the statement that homosexual

relationships are not wrong at all, respectively, of respondents of race i in year t.

3.1.1 A GLMM Approach

A popular model for 7tit is a logistic-normal model for which the link function

h(.) in (2.2) is the logit link and the random effects structure simplifies to a ran-

dom intercept ut. We will assume that the fixed parameter vector / is composed








of an intercept term a, linear and quadratic time effects /1 and 32, a race effect 03

and a year-by-race interaction 34. With Xit representing the year variable centered

around 1984 (e.g., xll = 1974 1984 = -10) and x2i the indicator variable for race

(for whites x21 = 0, for blacks x22 = 1), the model has form

logit(7rit(ut)) = logit(P(Yit = yit I ut)) = a+Pzxltu+2xlt+u3x2i +4XltX2i -+t. (3.1)

Apart from the fixed effects, the random time effect ut captures the dependency

structure over the years. Note that rit(ut) is a conditional probability, given the

random effect us from the year the question was asked. This random effect ut can

be interpreted as the unmeasurable public opinion about homosexual relationships,

common to all respondents within the same year. By introducing this random

effect, we assume that individual opinions are influenced by this overall opinion

or the social and political climate on homosexual relationships (like awareness of

AIDS and the social spending associated with it, which is hard to measure). Thus,

individual responses within a given year are no longer independent of each other,

but share a common random effect. Furthermore, it is natural to assume that the

public opinion about homosexual relationships changes gradually over time, with

higher correlations for years closer together and lower correlations for years further

apart. It would be wrong and unnatural to assume that the public opinion (or

political climate) is independent from one year to the next. However, this would

be assumed by modeling the random effects {ut} as independent of each other. It

would also be wrong to assume a common, time-independent random effect u = ut

for all time points t, as this implies that public opinion does not change over time.

It's effect would then be the same, whether responses are measured in 1974 or 1998.

3.1.2 Motivating Correlated Random Effects

To capture the dependency in public opinion and therefore in responses over

different years, we propose random effects that are correlated. In particular,






57

for this example with unequally spaced observation times, we suggest normal

autocorrelated random effects {ut} with variance function


var(ut) = a2, t = 1,..., 16

and correlation function


corr(ut, ut.) = pIlt- t Xl., 1 < t < t* < 16,

where Xit xxt. is the difference between the two years identified by indices t and

t*. This is equivalent to specifying a latent autoregressive process


Ut+1 = PIXlt--XlUt + + t

underlying the data generation mechanism. Both of these formulations naturally

handle the multiple gaps in the observed time series. There is no need to make

adjustments (such as imputation of data or artificially treating the series as equally

spaced) in our analysis due to "missing" data at years 1975, 1978-79, 1981, 1983,

1986, 1992, 1995 or 1997.

With correlated random effects, we have to distinguish between two situations:

The correlation induced by assuming a common random effect ut for each

cluster (here: year) and

the correlation induced by assuming a correlation among the cluster-specific

random effects {ut}.

Correlation among observations in the same cluster is a consequence of assuming a

single, cluster-specific random effect shared by all observation in that cluster. For

example, the presence of the cluster specific random effect Ut in (3.1) leads to a

(marginal) non-negative correlation among the two binomial responses Ylt and Y2t

in year t. With conditional independence, the marginal covariance between these








two observations is given by

cov(Yt, Y2t) = E[cov(Y, Y2t I Ut)]

+cov(E[Yul ut], E[Y2t I Ut])

= cov(logit-'(ult + ut), logit-'(it + Ut)), (3.2)

where jit is the fixed part of the linear predictor in (3.1). Both functions in (3.2)

are monotone increasing in ut, leading to a non-negative correlation. Approxima-

tions to (3.2) will be dealt with in Section 4.3.

In the example, we attributed the cause of this correlation to the current

(at the time of the interview) public opinion about homosexual relationships,

influencing all respondents in that year. The estimate of a gives an idea about the

magnitude of this correlation, since the more disperse the Ut's are, the stronger

the correlation among the responses within a year. For instance, if the true ut

for a particular year is positive and far away from zero, as measured by a, then

all respondents have a common tendency to give a positive answer. If it is far

away from zero on the negative side, respondents have a common tendency for

a negative answer. This interpretation, of course, is only relative to other fixed

effects included in the linear predictor. For the GSS data, there seems to be

moderate correlation between responses, based on a maximum likelihood estimate

of & = 0.10 with an approximated asymptotic s.e. of 0.03. This interpretation of

a moderate effect of public opinion on responses within the same year is further

supported by the fact that a can also be interpreted as the regression coefficient for
a standardized version of the random effect ut. A regression coefficient of 0.10 for

a standard normal variable on the logit scale leads to moderate heterogeneity on

the probability scale. This shows that the correlation between responses within a

common year cannot be neglected.








The second consequence of correlated random effects is that observations from

different clusters are correlated, which is a distinctive feature compared to GLMMs

assuming independence between cluster-specific random effects. The conditional log

odds of agreeing with the statement that homosexual relationships are not wrong

at all are now correlated over the years, a feature which is natural for a time series

of binomial observations but would have gone unaccounted for if time-independent

random effects were used. For instance, for the cohort of white respondents (i = 1),

the correlation between the conditional log odds at years t and t* is

corr (logit(rit(ut)), logit(7rit.(ut.)) = pllZ-t1I-l

and therefore directly related to the assumed random effects correlation structure.

Marginally, the two binomial responses at the different observation times have

covariance


cov(Ylt, Y2t.) = cov(logit-'(i~ + Ut), logit-'l(0l + ut.)),

which accommodates changing covariance patterns for different observation times

(e.g., decreasing with increasing lag) and also negative covariances (see for instance
the analysis of the Old Faithful geyser eruption data in Chapter 5). We will present

approximations to these marginal correlations in binomial time series in Section

4.3.

Summing up, correlated random effects give us a means of incorporating

correlation between sequential binomial observations that go beyond independent

or exchangeable correlation structures.

In our example, we attributed the sequential correlation to the gradual change

in public opinion about homosexual relationships over the years, affecting both

races equally. In fact, the maximum likelihood estimate of p is equal to 0.65 (s.e.








0.25), indicating that rather strong correlations might exist between responses from

adjacent years.

The model uses 7 parameters (5 fixed effects, 2 variance components) to

describe the 32 probabilities. In comparison with a regular GLM and a GLMM

with independent random time effects, the maximized likelihood decrease from

-113.0 for the regular GLM to approximately -109.7 for a GLMM with independent

random effects to approximately -107.5 for the GLMM with autoregressive random

effects. Note that the GLM assumes independent observations within and between

the years and that the GLMM with independent random effects {ut} for each year

t assumes correlation of responses within a year, but independence of responses

over the years. Both assumptions might be inappropriate. Our model implies that

the log odds of approval of homosexual relationships are correlated for blacks and

whites within a year (though not very strong with an estimate of a equal to 0.1)

and are also correlated for two consecutive years.

The estimates of the fixed parameters and their asymptotic standard errors are

given in Table 5-1. The MCEM algorithm converged after 128 iterations with a

starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of 8600.

Convergence parameters (conf. Section 2.3.3) were e1 = 0.002, c = 5, e2 = 0.003,

E3 = -0.001, a = 1.01 and q = 1.2. Path plots of selected parameter estimates

for two different sets of starting values are shown in Figure 3-2. A detailed

interpretation of the parameters and the effects of the explanatory variables on the

odds of approval is provided in Section 4.3.

Although this example assumed autocorrelated random effects, we will look

at the simpler case of equally correlated random effects next. Then, we discuss

how the correlation parameter p can be estimated within the MCEM framework

presented in Section 2.3. Summing up, correlated random effects in GLMMs allow


























4.50 r[ 1.25 r
4.25-- ----_ I-- 00l |
4.00 0.75-
0 50 100 150 200 50 75 100 125 150 175 200

-0.30 I- 0.00705 --
0.0700 -
-0.35 0.0695 ..
0 50 100 150 200 50 75 100 125 150 175 200
-0.026- 0.00900

-0.028 0.00895

-0.030 50 1 1 00890 75 100 125 150 175 2'
0 50 100 150 200 50 75 100 125 150 175 200


0.50

0.25

0 50
n "75 _


0.06

,0.04
100 150 200 50 7!


100 125 150


0.50
0.25


S50 100 150030 I-s7e(p)1

0.25 -
) 50 100 150 200 50 75 100 125 150 175 20


Figure 3-2: Iteration history for selected parameters and their asymptotic standard
errors for the GSS data.
The iteration number is plotted on the x-axis. The estimates and standard errors
for 02 were multiplied by 103 for better plotting. The two different lines in each
plot correspond to two different sets of starting values.


- se(o)


175 200


I t


L~hrr~c


wwwRF~


-CL








one to model within-cluster as well as between-cluster correlations for discrete

response variables, where clusters refer to grouping of responses in time.

3.2 Equally Correlated Random Effects

The introductory example modeled decaying correlation between cross-
sectional data over time through the use of autocorrelated random effects. In other
temporal or spatial settings, the correlation might stay nearly constant between any
two observation times, regardless of time or location differences between the two
discrete responses. Equally correlated random effects might then be appropriate to
describe such a behavior.
3.2.1 Definition of Equally Correlated Random Effects

We call random effects equally correlated if

var(ut) = a2 for all t

and

corr(ut, ut.) = p for all t $ t*.

More generally, the covariance matrix of the random effects vector u = (ul,..., UT)'
is given by E = a2 [(1 p)IT + pJT], where JT = 1TVI. To ensure positive

definiteness, p has restricted range, i.e., 1 > p > -1/(T 1). The random effects

density is given by

g(u;) oc l-1,2exp Ui'E-l, (3.3)

where now due to the pattern in E, E = a2T(1 p)T-l[1 + (T 1)p] and
E-1 = [( IT P-) JT]. The vector i = (a, p) holds the variance
,, (1- ) +(T-1)p I .
components of E.
The more complicated random effects structure (as compared to independence
or a single latent random effect) leads to a more complicated M-step in the MCEM
algorithm described in Section 2.3. For a sample u(1),..., u(m) from the posterior

h(u I y; 1(k-1), (k-l)) evaluated at the previous parameter estimate #(k-1) and








,(k-1), the function Q (i ) introduced in Section 2.3.1 has form

1 m T-l 1
Q()= m logg(u(j);p) oc -Tloga 2 log(1 p) log [1 + (T 1)p
j=1
1 p(1 p)
a+ b,
2a2(1-p) 2a2[1+(T-1)p]

where a = M M_ 'uLi)~U ) and b = 1 -m U i)*JTu(') are constants depending on
the sample only.
3.2.2 The M-step with Equally Correlated Random Effects
The M-step seeks to maximize Q2 with respect to a and p, which is equivalent
to finding their MLEs treating the sample ),..., u(m) as independent. Since this
is not possible in closed form, one way to maximize Q2 uses a bivariate Newton-
Raphson algorithm with the Hessian formed by the second order partial and mixed
derivatives of Q2 with respect to a and p. Some authors (e.g., Lange, 1995, Zhang,
2002) use only a single iteration of the Newton-Raphson algorithm instead of an
entire M-step to speed up convergence. However, this might not always lead to
convergence, since the interval for which the Newton-Raphson algorithm converges
is restricted through the restrictions on p. We show now that with a little bit of
work the maximizers for a and p can be obtained very quickly.
For any given value of p, the ML estimator for a (at iteration k) is available in
closed form and is equal to

() ( 1 p(l- p) 1/2
ST(1-p) a T[ + (T 1)pR])

Note that if p = 0, &^ = () E 1 U(j)' u )1/ the estimator for the independence
case presented at the end of Section 2.3.1. Unfortunately, the ML estimator for
p has no closed form solution. The first and second partial derivative of Q2 with








respect to p are given by

9 2 T(T-1)p 11 2p p2(T -1)
Op m 2(1 p)[1 + (T l)p] 2a2(1 p)2 2a2[1+ (T l)p]2

2 2 T(T-1) 1 -p(4-2T)+p2(1 -3T) 1
Op2 2 (1 -p)2[ (T-l)p]2 3(1-p)3
T
T b
02[1 + (T l)p]3b

We obtain the profile likelihood for p by plugging the MLE &(k) into the likelihood
equation for p. Then we use a simple and fast interval-halving (or bisection)
method to find the root for p. This is advantageous compared to a Newton-
Raphson algorithm since the range of p is restricted. Let f(p) = IQ 2a=(k)
and let pi and P2 be two initial estimates in the appropriate range, satisfying

pi < p2 and f(pi)f(p2) < 0. Without loss of generality, assume f(pi) < 0.
Clearly, the maximum likelihood estimate p must be in the interval [pl, P2]. The
interval-halving method computes the midpoint p3 = (pi + p2)/2 of this interval

and updates one of its endpoints in the following way: It sets pi = p3 if f(p3) < 0

or P2 = p3 otherwise. The newly formed interval [pl, p2] has half the length of the
initial interval, but still contains p. Subsequently, a new midpoint p3 is calculated,
giving rise to a new interval with one fourth of the length of the initial interval,
but still containing p. This process is iterated until If(p3)| < C, where e is a small
positive constant. To ensure it is a maximum we can check that the value of the
second derivative, f'(p) is negative at P3. (The second derivative is also needed
for approximating standard errors in the EM algorithm.) The value of p3 is then
used as an update for p in the maximum likelihood estimator for a, and the whole
process of finding the roots of Q2 is repeated. Convergence is declared when the
relative change in a and p is less than some pre-specified small constant. The
values of a and p at this final iteration are the estimates &(k) and 1(k) from MCEM
iteration k.








The issue of how to obtain a sample u(1),..., u(m) from h(u y; 13(k-1), 0(k-1)),

taking into account the special structure of the random effects distribution will be

discussed in Section 3.4.2.
3.3 Autoregressive Random Effects

The use of autoregressive random effects was demonstrated in the introductory

example. Their property of a decaying correlation function make them a useful

tool for modeling temporal or spatial associations among discrete data. We will

limit ourselves to instances where there is a natural ordering of random effects, and

consider time dependent data first.

3.3.1 Definition of Autoregressive Random Effects

As with equally correlated random effects in Section 3.2, we can look at the

joint distribution of autoregressive (or autocorrelated) random effects {ut}l1 as a
mean-zero multivariate normal distribution with patterned covariance matrix E,

defined by the variance and correlation functions

var(ut) = a2 for all t

and

corr(ut, ut.) = plt-Zt I for all t : t',

where xi and xt. are time points (e.g., years, as in the GSS example) associated

with random effects ut and ut.. Let dt = Xt+l zx denote the time difference

between two successive time points and let ft = 1/(1 p2di), t 1,... ,T 1.

Then, due to the special structure, the determinant of the covariance matrix is

given by FI = a2T fIT-1 f-' and E-1 is tri-diagonal (Crowder and Hand, 1990,

with correction of a typo in there) with main diagonal

1
( fl, fl + f2 L + 1 2 24- 1, f + A fT-2+ fT-1 1 T-1)







and sub-diagonals
1
-2 (flpd,,fpd .., fTpdT-1)

For a sample u(),..., u(m) from the posterior h(u I y; 1(k-1), ,(k-1)) evaluated at
the previous parameter estimates f3(k-1) and 0(k-1) = (a(k-1), (k-1)), the function

Q2, (cf. Section 2.3.1) now has form

1 T-1 1
Q2() ( -Tlogaa- log(lo p2) (3.4)
t=1 j=l
1 T-1 l-B P^ ]2
12T2 M1 E 1 1 p2dt
j=1 t= 1

where ut) is the t-th component of the j-th sampled vector u\). In the M-step of
an MCEM algorithm, we seek to maximize Q2 with respect to a and p.
Alternatively, we can view the random effects {ut} as a latent first-order
autoregressive process: Random effect Ut+i at time t +1 is related to its predecessor
ut by the equation

Ut+1 = pdtUt + ft, Et N(O, 02(1 -2d)), t = 1,...,T 1, (3.5)

where dt again denotes the lag between the two successive time points associated
with random effects ut and Ut+l. Assuming a N(0, a2) distribution for the first
random effect ul, the joint random effects density for u = (ul,..., uT) enjoys a
Markov property and has form

g(u; 0) = g(ui; i) g(u2 I1u; ) ... g(utl I t-1; ) ... g(UT I uT-1;) (3.6)
T/2 T-1 1/2 T- _U
(1 1 -h / {Ut+: p4Ut]
oc 1 exp } exp 2(lp t)
W2 H 2o, Ei2 2a2(1 p j) '
t=l t=l
leading, of course, to the same expression for Q2 as given in (3.4). For two time
indices t and t* with t < t*, the random process has autocorrelation function


p(t, t*) = corr(ut, ut.) = p ,- d,








Before we discuss maximization of Q2 in this setting with possibly unequally

spaced observation times, let us comment on the rather unusual parametrization of

the latent random process (3.5). Chan and Ledolter (1995), in their development

for time series models of equally spaced discrete events use the more common form


ut+l = put + Et, Et N N(0, a2), t = 1,2... T 1.

This leads to var(ut) = a2/(1 p2) for all t if we assume a N(O, a2/(1 p2))

distribution for ul. (Chan and Ledolter (1995) condition on this first observation,

which leads to closed form solutions for both a and p in the case of equidistant

observations). Since it is common practice to let a2 describe the strength of

association between observation in a common cluster sharing that random effect,

our parametrization seems more natural. In Chan and Ledolter's parameterizations

both the variance and correlation parameter appear in the variance of the random

effect.

In the more general case of unequally spaced observations, the parametrization

Et N(O, O72) results in different variances of the random effects at different time

points (i.e., var(ut) = a2/(1 p2d)). Considering that the random effects represent

unobservable phenomena common to all clusters, their variability should be about

the same for all clusters, and not depend on the time difference between any two

clusters. There is no reason to believe that the strength of association is larger in

some clusters and weaker in others. Therefore, the parametrization we choose in

(3.5) seems natural and appropriate.

For spatially correlated data, a relationship between random effects ui and

ui. is defined in terms of a distance function d(xi, ax.) between covariates ax

and xi. associated with them. Each u, then represents a random effect for a

spatial cluster, and correlated random effects are again natural to model spatial

dependency among observations in different clusters. In the time setting, we had








d(xi, xi.) = Ixi xi. I with xi and xi. representing time points. In 2-dimensional

spatial settings, xi = (xil, xi2)' may represent midpoints in a Cartesian system

and d(sx, xi.) = aIxi xi.|I is the Euclidian distance function. The so-defined

distance between clusters can be used in a model with correlations between random

effects decaying as distances between cluster midpoints grow, e.g., corr(ui, u.) =

pllXi-xZ,.-. Models of this form are discussed in Zhang (2002) and in Diggle et al.

(1998) in a Bayesian framework.

Sometimes only the information concerning whether clusters are adjacent to

each other is used to form the correlation structure. In this case, d(ax, xi.) is a

binary function, indicating if clusters i and i* are adjacent or not. Usually, this

leads to an improper joint distribution for the random effects, as for instance in

the analysis of the Scottish Lip cancer data set presented in Breslow and Clayton

(1993).

3.3.2 The M-step with Autoregressive Random Effects

Maximizing Q2 with respect to a and p is again equivalent to finding their

MLEs for the sample of u0),..., u(m), pretending they are independent. For fixed

p, maximizing Q2 with respect to a is possible in closed form. For notational con-

venience, denote the parts depending on p and the generated sample u(m),..., (m)

by


at(p, u) = U+ __- p d U ) 2
j=1
and


Smw1 r
j=1

with derivatives with respect to p (indicated by a prime) given by


ta'(P, u) = -2dtpd`1'bt(p, u)








and


b(p, u) = -dtp- )2
m
j=1
The maximum likelihood estimator of a at iteration k of the MCEM algorithm has

form
fr( ,m T-1 )1 /2
&(k) 1 u(j)2 + 1a ) 1/
Tm(-M T +j p2dt/
j=1 t=1
For the special case of independent random effects (p = 0), this simplifies to the

estimator ( m j= U()'UD() presented at the end of Section 2.3.1. (The equal

correlation structure cannot be presented as a special case of the autocorrelation

structure.) No closed form solutions exist for 1(k). Let

dt pdt-1
ct(p) =
1 p2d

and

et(p) = [ct(p) ]

be terms depending on p but not on u, with derivatives given by


c'(p) = ct(P) + 2p-1 [ct(p)]2
P

and

e(P) = 2ct(p)c( p) + [ct(p)2 ,
dt dt
respectively. Then, the first and second partial derivative of Q2 with respect to p

can be written as
T-1 T-1
Qap = Epd'tc(p) + [c(p)bt(p, u)- et(p)at(p, u)]
1t=1 t=1

92Q2 = pd--lct(p)[2d, 1 + 2pdt+lc(p)]
S=t=1
1 T-1
+ c' [c'(p)bt (p, u) + ct (p) (p, u) e'(p)at(p, u) et(p)a' (p, u)] .
t=1








A Newton-Raphson algorithm with Hessian formed of partial and mixed derivatives
of Q2 with respect to a and p can be employed to find the maximum likelihood
estimators at iteration k. However, since the range of p is restricted, it might be
advantageous to use the interval-halving method on Q', I1=&(k) described in the
previous section.

For the special case of equidistant time points {xt}, the distances dt are equal
for all t = 1,..., T. Without loss of generality, we assume dt = 1 for all t. Then the
random effects follow the simple random walk ut+l = put + et, t = 1, ..., T, where
we assume that ul ~ N(0, a2) and ct iid. N (0, 2(1 p2)). Certain simplifications
occur. Let
m m T- ( m T-1 ( T-1 j
a=- m b- +l I2, c =- i uUt+1 dUu
j=1 j=1 t=1 j=1 t=1 j=1 t=1
denote constants depending on the generated samples only, but not on any parame-

ters. Then the maximum likelihood estimator of a at iteration k is

(k) = a+ (b 2pc+p2d) 1/2

and, upon plugging it in into the score equation Q2 I=&8(k) for p, we obtain as
the new score equation

[T-1 1 2 -T T-1 1
3 [T-(d-a) +p2 2-T c +p -i a- b-d +c=0, (3.7)

a polynomial of order three. A result by Witt (1987) mentioned in McKeown and
Johnson (1996) shows that (3.7) has three real solutions, only one of which lies
in the interval (-1, 1). This must be the maximum likelihood estimator jlk) at
iteration k for the case of equidistant time points. Exact solutions to this third
degree polynomial are for instance given in Abramowitz and Stegun (1964) and we
only need to iterate between the two explicit solutions till convergence.








In most of our applications the number of distinct observation times T

is rather large, and generating independent T-dimensional vectors u from the

posterior h(u I y; O(k-l)) as required to approximate the E-step is difficult, even

with the nice (prior) autoregressive relationship among the components of u. The

next section discusses this issue.

There are many other correlation structures which are not discussed here. For

instance, the first order autoregressive random process can be extended to a p-order

random process and the formulas provided here and in the next section can be

modified accordingly.

3.4 Sampling from the Posterior Distribution Via Gibbs Sampling

In Section 2.3.2 we gave a general description of how to obtain a random sam-

ple u(1),...,u(m) from h(u I y). (As in Section 2.3.2, we suppress the dependency

on the parameter estimates from the previous iteration). For high dimensional

random effects distributions g(u), generating independent draws from h(u I y)

can get very time consuming, if not impossible. The Gibbs sampler introduced in

Section 2.3.2 offers an alternative because it involves sampling from lower dimen-

sional (often univariate) conditional distributions of h(u I y), which is considerably

faster. However, it results in dependent samples from the posterior random effects

distribution. The distributional structure of equally correlated random effects

or autoregressive random effects is very amenable to Gibbs sampling because of

the simplifications that occur in the full univariate conditionals. Remember that

the two-stage hierarchy and the conditional independence assumption in GLMMs

implies that
T
h(u I y) oc f(y I u)g(u) = f/(yt I ut) g(u),
t=1
the product of the product of conditional densities of observations sharing a

common random effect and the random effects density. In the following, let

u = (Ul, ..., UT). We discuss the case of autoregressive random effects first.








3.4.1 A Gibbs Sampler for Autoregressive Random Effects
From representation (3.6) of the random effects distribution, we see that the
full univariate conditional distribution of Ut given the other T 1 components of u
only depends on its neighbors ut-1 and ut+1, i.e.,

g(Ut I u,, .. Ut-l, Ut+l, ,UT) c( g(Ut I Ut-1)g(Ut+1 I Ut), t = 2,..., T 1.

At the beginning (t = 1) and the end (t = T) of the process, the conditional
distribution of Ul and UT only depend on the successor u2 and predecessor UT-1,
respectively. Furthermore, random effect ut only applies to observations Yt =

(ytl, ytn) at a common time point that share that random effect, but not to
other observations. Hence, the full univariate conditionals of the posterior random
effects distribution can be expressed as

hi(u I u2,y ) oc f(y1 Ul) 9g(ui U2)

ht(ut I t-, Ut+1, yt) oc f(yt t) gt(ut I Ut-1, Ut+), t = 2,...,T -1

hT(T I UT-, YT) oc f(YT UT) T(UT I UT-1),

where, using standard multivariate normal theory results,

91(ul IU2) = N (pd1U2,2[1 p2dl])
t(Ut I Ut-, t+) = pd-[1- p2dt-1Utl +pd( [1- p2dt-lt+
S1 p2(dt_1+dt)
2[1 p2dt-1 p2dt + p2(dt-i+dt)
1 2(dt-+d) t ...,
gT(UT I UT-1) = N (pdT-lUT1, a2[l p2dT-l])

For equally spaced data (dt = 1 for all t) these distributions reduce to the ones
derived in Chan and Ledolter (1995).
Direct sampling from the full univariate conditionals ht is not possible.
However, it is straightforward to implement an accept-reject algorithm. In fact,






73

the accept-reject algorithm as outlined in Section 2.3.2 applies directly with target
density ht and candidate density gt, since ht has the form of an exponential family
density multiplied by a normal density. In Section 2.3.2, we discussed the accept-
reject algorithm for generating an entire vector u from the posterior random effects
distribution h(u I y) with candidate density g(u) and mentioned that acceptance
probabilities are virtually zero for large dimensional u's. With the Gibbs sampler,
we have reduced the problem to univariate sampling of the t-th component ut from
the univariate target density ht with univariate candidate density gt. By selecting
Mt = LL(yt), where L(yt) is the saturated likelihood for observations at time point
t, we ensure that the target density ht < Mtgt.
Given u(-1) = (uj-), ..., u-1) from the previous iteration, the Gibbs
sampler with accept-reject sampling from the full univariate conditionals consists of

1. generate first component u) hi(u, I u(j -),y) by
(a) generation step:
generate ul from candidate density gl(ul Iu u
generate U ~ Uniform[0, 1];
(b) acceptance step:
set uj) =ul if U < f(yl I ul)/L(y); return to (a) otherwise;
2. for t=2,...,T-1:
generate component u j)" ht(u () I u(j), (j ) by

(a) generation step:
generate ut from candidate density gt(ut 1u1) ,u j1 )
generate U ~ Uniform[O, 1];
(b) acceptance step:
set u) = ut if U < f(yt I ut)/L(yt); return to (a) otherwise;
3. generate last component u() ~ hT(uT I UT2, YT) by
(a) generation step:







generate UT from candidate density gT(UT I UTi1);
generate U ~ Uniform[O, 1];
(b) acceptance step:
set U = UT if U 5 f(yT I UT)/L(yT); return to (a) otherwise;
4. set U)= (U),...,u(T_);
The so-obtained sample ),..., () (after allowing for burn-in) forms a
dependent sample which we use to approximate the E-step in the kth- iteration of
the MCEM algorithm. Note that all densities are evaluated at current parameter
estimates, i.e., (-1 for f(yt I ut) and ( = (1(k-1), (-1)) for gt(ut
Ut-1, Ut+i)
3.4.2 A Gibbs Sampler for Equally Correlated Random Effects
Similar results as for the autoregressive correlation structure can be derived
for the case of equally correlated random effects. In this case, the full univariate
conditional of us depends on all other t 1 components of u, as can be seen from
(3.3). Let ut- denote the vector u with the t-th component deleted. Using similar
notation as in the previous section, the full univariate conditionals of h(u I y) are
given by
ht(ut I Ut-, t) oc f(yt I Ut)gt(t | ut-, t), t = 1,..., T,

where with standard results from multivariate normal theory gt(ut I ut-) is a
N(pt, -rt) density with


1-i-p 1+(T-2)p Uk
P (T-1)p(1-p)]
kgt
and
= =a,2 (T 1)p2 (- 1)23(1 )
1 p 1 +(T-2)p
Given the vector u-1) from the previous iteration, the Gibbs sampler with
accept-reject sampling from the full univariate conditionals has form








1. for t=1,...,T,

generate components u~j) ~ h(ut uj) ,, I... U t+ -1 ,UT ) by

(a) generation step:

generate ut from candidate density

gt(Ut I U (jt-1, Ut(+ ., I )
generate U ~ Uniform[0, 1];

(b) acceptance step:

set ut = u if U < f(yt I ut)/L(yt); return to (a) otherwise;

2. set uaU)= -( i), U );

This leads to a sample u(),..., (m) from the posterior distribution used

in the E- and M-step of the MCEM algorithm at iteration k. Note again that

all distributions are evaluated at their current parameter estimates (k-1) and
(k-l) = (&k-i), (k-i)).

3.5 A Simulation Study

We conducted a simulation study to evaluate the performance of the maximum

likelihood estimation algorithm, to evaluate the bias in the estimation of covariate

effects and variance components and to compare predicted random effects to the

ones used in the simulation of the data. To this end, we generated a time series

yl,..., IT of T = 400 binary observations according to the model

logit(7rt(ut)) = a + /Xt + ut (3.8)

for the conditional log odds of success at time t, t = 1,..., 400. For the simulation,

we choose a = 1 and 3 = 1, where 3 is the regression coefficient for independent

standard normal distributed covariates xt i. i. d. N(0, 1). The random effects

l, ..., UT are thought to arise from an unobserved latent random autoregressive
process Ut+l = put + ct, where Et are i. i. d. N(0, aV1l p2), i.e., the out's have stan-

dard deviation a and lag t correlation pt. For the simulation of these autoregressive








random effects, we used a = 2 and p = 0.8. The resulting sample autocorrelation

function of the realized random effects is pictured in Figure 3-4. Their standard

deviation and lag 1 correlation of the 400 realized values of u, ... UT is equal to

1.95 and 0.77. Note that conditional on the realized values of Ut's, the Yt's are

generated independently with log odds given by (3.8). The MCEM algorithm as

described in Sections 2.3 and 3.3 for a logistic GLMM with autocorrelated random

effects yielded the following maximum likelihood estimates for the fixed effects and

variance components: & = 0.94 (0.39), 3 = 1.03 (0.22), as compared to the true

values 1 and 1, and & = 2.25 (0.44), and p = 0.74 (0.06), as compared to the true

values 1.95 and 0.77.

The algorithm converged after 71 iterations with a starting Monte Carlo

sample size of 50 and a final Monte Carlo sample size of only 880, although

estimated standard errors are based on a Monte Carlo sample size of 20,000.

Convergence parameters were set to e1 = 0.003, c = 3, E2 = 0.005, 63 = -0.001,

a = 1.03 and q = 1.05 (see Section 2.3.3). Regular GLM estimates were used as

starting values for a and # and starting values for a and p were set to 1.5 and 0,

respectively.

As will be described in Section 5.4.2, we estimated random effects through a

Monte Carlo approximation of their posterior mean: ui = E[u I y]. The scatter

plot in Figure 3-3 shows good agreement in a comparison of the realized random

effects ul,..., UT from the simulation and the estimated random effects 1i,..., iT

from the model. Note though that the standard deviation of the estimated random

effects is equal to 1.60 (as compared to the true standard deviation of 1.95),

showing that estimated random effects are less variable and the general shrinkage

effect (compare the scales on the x and y axis of Figure 3-3) brought along by

using posterior mean estimates. Also, a comparison of the autocorrelation and

partial autocorrelation functions of the realized and estimated random effects in






77

4


2 A AAAA j AA

A 2AA a

*1 .A A

-A A A






realized random effects

Figure 3-3: Realized (simulated) random effects u,..., UT versus estimated ran-
dom effects i ,...,G fT


Figure 3-4 reveals some differences due to the fact that estimated random effects

are based on the posterior distribution of u I y. Therefore, estimated random

effects are only of limited use in checking assumptions on the true random effects.

Only when their behavior is grossly unexpected compared to the assumed structure

of the underlying latent random process may they serve as an indication of model

inappropriateness. Related remarks are given by Verbeke and Molenberghs (2000)
who generate data in a linear mixed models assuming a mixture of two normal

distributed random effects resulting in a bimodal distribution. There also the plot

of posterior mean estimates of the random effects from a model that misspecified

the random effects distribution does not reveal that anything went wrong.

We repeated above simulation 100 times, using the same specifications,
starting values and convergence criteria as mentioned above. Each of the 100

generated binary time series of length 400 was fit using the MCEM algorithm.

Table 3-1 shows the average (over the 100 generated time series) of the fixed


































.4.4




000
.2.


1 2 3 4 5 6 7 4 5 6 7

Lag Lag

1.0 10

.8.8

.6 .6

.4 .4

S .2 7 .2







1 2 3 4 5 6 7 1 2 3 4 5 6 7

ag Lag



Figure 3-4: Comparing simulated and estimated random effects.
Sample autocorrelation (first row) and partial autocorrelation (second row) func-
tions for realized (simulated) random effects u1,..., UT (first column) and estimated
random effects u ,..., iT (second column).








parameter and variance component estimates and their average estimated standard

errors. On average, the GLMM estimates of the fixed effects a and # and the

variance components are very close to the true parameters, although the true lag

1 correlation of the random effects is underestimated by 6.3%. Table 3-1 also

displays, in parentheses, the standard deviations of all estimated parameters in the

100 replications. Comparing these to the theoretical estimates of the asymptotic

standard errors, we see good agreement. This suggests that the procedure for

finding standard errors we described and implemented (via Louis's (1982) formula)

in our MCEM algorithm works fine. In 5 (5%) out of the 100 simulations, the

approximation of the asymptotic covariance matrix by Monte Carlo methods

resulted in a negative definite matrix. For these simulations, a larger Monte Carlo

sample after convergence of the MCEM algorithm (the default was 20,000) might

be necessary.

It is also interesting to note that of the 95 simulations with positive definite

covariance matrix, 6 (6.3%) resulted in a non-significant (based on a 5% level

Wald test) estimate of the regression coefficient P under the GLMM model with

autoregressive random effects, while none was declared non-significant with the

GLM approach. Estimates and standard errors for a corresponding GLM model

fit are also provided in Table 3-1. The average Monte Carlo sample at the final

iteration of the MCEM algorithm was 1200, although highly disperse, ranging from

210 to 21,000. The average computation time (on a mobile Pentium III, 600 MHz

processor with 256MB RAM) to convergence, including estimating the covariance

matrix, was 73 minutes.

We ran two other simulation studies, now with a shorter length of only

T = 100 observations, and a true lag 1 correlation of 0.6 and -0.8, respectively.

All other parameters remained unchanged. These results are also summarized in

Table 3-1. Again, we observe that the estimated parameters are very close to the








Table 3-1: A simulation study for
effects.


80

a logistic GLMM with autoregressive random


a 1 a p s.e.(a) s.e.(8) s.e.(a) s.e.(p)
True: 1 1 2 0.8 T = 400
GLM: 0.64 0.63 0.11 0.12
(0.20) (0.14) (0.01) (0.01)
GLMM: 1.07 1.02 2.08 0.75 0.37 0.26 0.61 0.10
(0.32) (0.20) (0.25) (0.06) (0.12) (0.23) (0.40) (0.05)

True: 1 1 2 0.6 T = 100
GLM: 0.69 0.70 0.23 0.25
(0.38) (0.27) (0.02) (0.03)
GLMM: 1.09 1.07 1.99 0.51 0.58 0.47 1.35 0.26
(0.58) (0.37) (0.39) (0.20) (0.33) (0.27) (1.15) (0.18)

True: 1 1 2 -0.8 T = 100
GLM: 0.65 0.61 0.22 0.24
(0.21) (0.26) (0.01) (0.03)
GLMM: 1.04 0.96 2.00 -0.75 0.42 0.51 1.04 0.16
(0.29) (0.34) (0.53) (0.13) (0.32) (0.99) (1.04) (0.13)
Average and standard deviation (in parentheses) of fixed effects, variance compo-
nents and their standard error estimates from a GLM and a GLMM with latent
AR(1) process. The two models were fitted to each of 100 generated binary time
series of length T = 400 and T = 100.

true ones, but on average the correlation was underestimated by 15% and 6.3%,

respectively. However, the sampling errors of the correlation parameters (shown in
parentheses in Table 3-1) were large enough to include the true values.
Since our methods are general enough to handle unequally spaced data, we
repeated the first simulation with a time series of T = 400 binary observations, but

now randomly deleted 10% of the observations to create random gaps in the series.
We left all parameters and the model for the conditional odds unchanged, except

that we now assume that random effects follow the latent random autoregressive
process ut+l = ptut + ct, where et are i. i. d. N(0, a/1 p2dt) and dt is the
difference (in the units of measurement) between the time points associated with
the observations at times t and t + 1. For example, the first series we generated had






81
Table 3-2: Simulation study for modeling unequally space binary time series.

a /3p s.e.(a) s.e.(6) s.e.(a) s.e.(p)
True: 1 1 2 0.8 T = 360, unequally spaced
GLM: 0.61 0.62 0.12 0.12
(0.19) (0.11) (0.00) (0.01)
GLMM: 1.03 1.00 2.07 0.75 0.38 0.28 0.71 0.11
(0.29) (0.16) (0.25) (0.06) (0.19) (0.28) (0.80) (0.18)
Average and standard deviation (in parentheses) of fixed effects, variance compo-
nents and their standard error estimates from a GLM and a GLMM with latent au-
toregressive random effects accounting for unequally spaced observations. The two
models were fitted to each of 100 generated binary time series of length T = 360,
with random gaps of random length between observations.

1 gap of length three (i.e., dt = 4 for one t), 4 gaps of length two (i.e., dt = 3 for 4
t's) and 29 gaps of length one (i.e., dt = 2 for 29 t's). For all other t's, dt = 1, i.e.,

they are successive observations and the difference between two of them is one unit
of measurement.

Simulation results are shown in Table 3-2 and reveal that our proposed
methods and algorithm also work fine for an unequally spaced binary time series.

All true parameters are included in confidence intervals based on the average of the

estimated parameters from 100 replicated series and its standard deviation (shown
in parentheses in Table 3-2).













CHAPTER 4
MODEL PROPERTIES FOR NORMAL, POISSON AND BINOMIAL
OBSERVATIONS
So far we have discussed models for discrete valued time series data in a very

broad manner. In Section 2, we developed the likelihood for our models based on

generic distributions f(ylu) for observations y and g(u) for random effects u and

presented an algorithm for finding maximum likelihood estimates. Section 3 looked

at two special cases of random effects distributions useful for describing temporal or

spatial dependencies. In this chapter we make specific distributional assumptions

about the observations and develop some theory underlying the models we propose.

We will pay special attention to data in the form of a single (sometimes considered

generic) time series Y = (Y1,..., YT) and derive marginal properties implied by

the conditional model formulation. Multiple, independent time series Yi,..., Yn

can result from replication of the original time series or from stratification of the

sampled population such as in the example about homosexual relationships. All

derivations given below for a generic time series Y still hold for the i-th series

Yi = (Yil, Yi2,... YT), provided the same latent process {ut} is assumed to
underly each one of them.

An important characteristic of any time series model is its implied serial

dependency structure. In the case of normal theory time series models, this is

specified by the autocorrelation function. In Section 4.1 we derive the implied

marginal autocorrelation function for GLMMs with normal random components

and either an equal correlation or autoregressive assumption for the random effects.

With these assumptions, our models are special cases of linear mixed models

discussed for instance in Diggle et al. (2002). In Sections 4.2 and 4.3 we explore








marginal properties of GLMMs with Poisson and binomial random components

that are induced by assuming equally correlated or autoregressive random effects.

In Chapter 5, these model properties such as the implied autocorrelation function

are then compared to empirical counterparts based on the observed data to

evaluate the proposed model.

Section 2.1 mentioned that parameters in GLMMs have a conditional inter-

pretation, controlling for the random effects. Correlated random effects vary over

time and parameter interpretation is different from having just one common level of

a random effect, as in many standard random intercepts GLMMs. For each of the

models presented here, we discuss parameter interpretation in a separate section.

4.1 Analysis for a Time Series of Normal Observations

Suppose that conditional on time specific normal random effects {ut}, observa-

tions {Yt} are independent N(pt + ut, 72). The marginal likelihood for this model is

tractable, because marginally the joint distribution of {Yt} is multivariate normal

with mean tM = (/1,..., p-T)' and covariance matrix Eu + 72I, where Eu is the

covariance matrix of the joint distribution of {ut}. With the usual assumption that

var(ut) = a2, the marginal variance of Yt is given by

var(Yt) = 2 + 2

and the marginal correlation function p(t, t*) for the case of equally correlated

random effects (conf. Section 3.2) has form

-2
p(t, t*) = corr(Yt, Yt.) = -2 2p (4.1)

while for the case of autocorrelated random effects (conf. Section 3.3), it has form

p(*) = corr(Y 2 dk. (4.2)
p(t,t*) = corr(Yt, Yt.) =-2 +-p02 d (4.2)
T-+ 0








If the distances between time points are equal, then (4.2) is more conveniently

written in terms of the lag h between observations as

2
p(h) = corr(Yt, Yt+h) = T2 +

For both cases, note that the autocorrelations (4.1) and (4.2) are smaller than

the corresponding ones assumed for the underlying latent process {ut} by a factor

of r For equally correlated random effects, the marginal covariance matrix
has form r72 + a2 [(1 p)I + pJ], implying equal marginal correlations between

any two members Yt and Yt. of {Yt}. (This can also be seen from (4.1), where the

autocorrelations do not depend on t or t*.) Diggle et al. (2002, Sec. 5.2.2) call this

a model with serial correlation plus measurement error.

Similar properties can be observed in the case of autocorrelated random ef-
fects: The basic structure of correlations decaying in absolute value with increasing

distances between observation times (as measured by E dk or h) is preserved

marginally. However, the first-order Markov property of the underlying autore-

gressive process is not preserved in the marginal distribution of {Yt}, which can

be proved by calculating conditional distributions. For instance, for three (T = 3)

equidistant time points, the conditional mean of Y3 given Y1 = yl and Y2 = y1 is

equal to

a2p
E[Y3ay, y2] = 3 + 2 +22_ (2 ( 2 [1 + P(y,- ~1)] + 02 [l (2 2)])

and depends on yl.

It should be noted that in the case of independent random effects with

Eu = Oa2, marginally the Yt's are also independent, but with overdispersed
variances r2 + o2 relative to their conditional distribution. This case can be seen

as a special case of the equally correlated model and the autoregressive model when

p=0.








The traditional assumption in random intercepts models is to assume a

common random effect u = us for all time points t. I.e., conditional on a N(0, a2)

random effect u, Yt is N(Aj + u, 72) for t = 1,..., T. For this case, the marginal

covariance matrix has form 721 + a2J. This can be derived directly or inferred

from the marginal correlation expressions (4.1) and (4.2) by setting p = 1, implying

perfect correlation among the {ut}. Hence, the random intercepts model is a

special case of the equal correlated or autoregressive model when p = 1. It implies

a constant (exchangeable) marginal correlation of U2/(T2 + a2) between any two

observations Yt and Yt..

4.1.1 Analysis via Linear Mixed Models

In a GLMM, we try to provide some structure for the unknown mean com-

ponent pt by using covariates xt. Let x z be a linear predictor for /t, with f

denoting a fixed effects parameter vector for the covariates xt. Using an iden-

tity link, the series {Yt} then follows a GLMM with conditional mean function

E[Yt I ut] = x't3 + ut. The model can be written as Yt = x'f + us + ce, where
et ~d N(0, 72) and independent of us. Then, the models discussed here are special

cases of mixed effects models (Verbeke and Molenberghs, 2000) with general matrix

form

Y = X3 + Zu +E.

In our case, Y = (Yi,..., YT)' is the time series vector and X = (xs,..., x,)'

is the overall design matrix with associated parameter /. The design matrix Z

for the random effects u' = (u1,..., UT) simplifies to the identity matrix IT. The

distributional assumption on the random effects is u ~ N(0, Eu) and they are

independent from the N(O, r2I)-distributed errors e. Exploiting this relationship,

software for fitting models of this kind (i.e., correlated normal data with structured

covariance matrix of form var(Y) = ZEuZ' + rT2) is readily available, for instance

in the form of the SAS procedure proc mixed, where the equal correlation structure







and the autoregressive structure are only two out of many possible choices for the
covariance matrix Eu for the random effects distribution.
Mixed effects models are very popular for the regression analysis of shorter
time series, like growth curve models or data from longitudinal studies. In Sec-
tion 5.1, we illustrate an application by analyzing the motivating example of
Section 3.1 about attitudes towards homosexual relationships, based on a normal
approximation to the log odds.
4.1.2 Parameter Interpretation
Parameters in normal time series models retain their interpretation when
averaging over the random effects distribution. The interpretation of 3 as the
change in the mean for a change in the covariates is valid conditional on random
effects and also marginally. The random effects parameters only contribute to the
variance-covariance structure of the marginal distribution, inducing overdispersion
and correlation relative to the conditional assumptions.

4.2 Analysis for a Time Series of Counts
Suppose now that conditional on time specific normal random effects {ut},
observations {Yt} are independent counts, which we model as Poisson random
variables with mean /t. Using a log link, explanatory variables ax and correlated
random effects {uj}, we specify the conditional mean structure of a Poisson GLMM
as

log(pt) = zx' + ut, t = 1,..., T. (4.3)

The correlation in the random effects allows the log-means to be correlated over
time or space. The marginal likelihood corresponding to this model is given by

I T
L(f3,/;y) oc xf T exp{-pt}g(u;,) du
t=1

= exp [yt(x'/3 + Ut) exp{'t + Ut}]} g(u; ik) du,
RT I Tt=1








where g(u; Ip) is one of the random effects distributions of Chapter 3. In that case,

the integral is not tractable and numerical methods such as the MCEM algorithm
of Section 2.3 must be used to find maximum likelihood estimates for f and 1i. For

this, the function Q' defined in (2.16) has form

1m T
j=1 t=l1

where uj) is the t-th element of the j-th generated sample u(j) from the posterior

distribution h(uly; )3(k-1)' (k-1)). Note that here we discuss only the case of a

generic time series {Yt} with no replication, hence n = 1 (i.e., index i is redundant),

and nl = T in the general form presented in (2.16). If replications are available

or in the case where two time series differ in the fixed effects part but not in the

random effects (e.g., have the same underlying latent process), then one simply

needs to include the sum over the replicates as indicated in (2.16). Choosing one

of the correlated random effects distributions of Chapter 3, the Gibbs sampling

algorithms developed in Sections 3.4.1 or 3.4.2 can be used to generate the sample

from h(uly), with f(yt) having the form of a Poisson density with mean At.

4.2.1 Marginal Model Implied by the Poisson GLMM

As with the normal GLMMs before, marginal first and second moments can

be obtained by integrating over the random effects distribution, although here

the complete marginal distribution of Yt is not tractable as it is in the normal

case. The random effects appearing in model (4.3) imply that the conditional

log-means {log(/t)} are random quantities. Assuming that random effects {ut} are

normal with zero mean and variance var(ut) = a2, they have expectations {z/3}

and variance a2. For two distinct time points t and t*, their correlation under an

independence, equal correlation or autocorrelation assumptions on the random

effects is given by 0, p or pk=t dk, respectively. (Remember that dk denoted the

time difference between two successive observations yk and Yk+1.) On the original








scale, the means have expectation, variance and correlation given by

E[pt] = exp{(z/3 +a2/2}

var(pt) = exp{2(x3 + a2/2)} (e'2-)
eCOv(U(,U(.) _
corr(p/, pt ) = eco -
e2 1

Plugging in cov(ut, ut.) = 0, a2p or a2 pk=t d yields the marginal correlations
among means when assuming independent, equally correlated or autoregressive
random effects, respectively.
4.2.1.1 Marginal distribution of Yt

Now let's turn to the marginal distribution of Yt itself, for which we can only
derive moments. The marginal mean and variance of Yt are given by:

E[Yt] = E[pt] = exp{z/3 + a2/2} (4.4)

var(Yt) = E[(t] + var(t) = E[Yt] [1 + E[Yt] (e 1)].

Hence, the log of the marginal mean still follows a linear model with fixed effects
parameters /, but with an additional offset U2/2 to the intercept term. (This is not
particular to the Poisson assumption, but is true for any loglinear random effects
model of form (4.3) with more general random effects structure z'ut, see Problem
13.42 in Agresti, 2002.) The marginal distribution of Yt is not Poisson, since the
variance exceeds the mean by a factor of [1 + E[Yt](e"2 1)]. The marginal variance
is a quadratic function of the marginal mean.
For two distinct time points t and t*, the marginal covariance between
observations Yt and Yt. is given by

cov(Yt,Yt.) = cov(pt, At.)

= E[Yt]E[Yt.] (ecov(ut",') 1) (4.5)

= exp{(X' + x'.)/3 + a2} (ecov(ut,-) 1)




Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EPWSN1VID_CJRAAN INGEST_TIME 2017-07-13T14:51:32Z PACKAGE AA00003585_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES



PAGE 1

REGRESSION MODELS FOR DISCRETE-VALUED TIME SERIES DATA By BERNHARD KLINGENBERG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004

PAGE 2

Copyright 2004 by Bernhard Klingenberg

PAGE 3

To Sophia and Jean-Luc Picard

PAGE 4

ACKNOWLEDGMENTS I would like to express my sincere gratitude to Drs Alan Agresti and James Booth for their guidance and assistance with my dissertation research and for their support throughout my years at the University of Florida. During the past three years as a research assistant for Dr. Agresti I gained valuable experience in conducting statistical research and writing scholarly papers, for which I am very grateful. I would also like to thank Dr. Ramon Littell, who guided me through a year of invaluable statistical consulting experience at IFAS Dr. George Casella, who taught me Monte Carlo methods and Drs Jeff Gill and Michael Martinez for serving on my committee. My gratitude extends to all the faculty and present and former graduate students of the Department among them in alphabetical order Brian Caffo Sounak Chakraborty Dr. Herwig Friedl Ludwig Heigenhauser, David Hitchcock Wolfg~ng Jank Galin Jones Ziyad Mahfoud Siuli Mukhopadhyay and Brian Stephens I would like to thank my family foremost my wife Sophia and my daughter Franziska for all their support light and joy they bring into my life and for providing me with energy and fulfillment. Lastly this dissertation would not have been written in English had it not been for the countless adventures of the Starship Enterprise and its Captain Jean-Luc Picard whose episodes kept me glued to the TV in Austria and helped to sufficiently improve my knowledge of the English language. lV

PAGE 5

TABLE OF CONTENTS ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT lV viii lX xi CHAPTER 1 2 INTRODUCTION 1.1 Regression Models for Correlated Discrete Data 1.2 Marginal Models .. ......... .. . 1.2.1 Likelihood Based Estimation Methods . 1.2.2 Quasi-Likelihood Based Estimation Methods 1.3 Transitional Models . . . . . . . . . 1.3.1 Model Fitting .. ... ..... .... 1.3.2 Transitional Models for Time Series of Counts 1.3.3 Transitional Models for Binary Data . 1.4 Random Effects Models . . . . . . 1.4 1 Correlated Random Effects in GLMMs 1.4.2 Other Modeling Approaches . . 1.5 Motivation and Outline of the Dissertation GENERALIZED LINEAR MIXED MODELS 1 1 3 3 4 7 9 10 11 13 13 17 19 21 2.1 Definition and Notation. . . . . 22 2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time Series . . . . . . . . . . . . 24 2.1.2 State Space Models for Discrete Time Series Observations 26 2.1.3 Structural Similarities Between State Space Models and GLMMs . . . . . 28 2.1.4 Practical Differences . . . . . . . . . 29 2.2 Maximum Likelihood Estimation . . . . . . . 31 2.2.1 Direct and Indirect Maximum Likelihood Procedures 32 2.2.2 Model Fitting in a Bayesian Framework . . . . 36 2 2 3 Maximum Likelihood Estimation for State Space Models 37 2.3 The Monte Carlo EM Algorithm . . . . . . . . . 40 V

PAGE 6

3 4 5 2.3.1 Maximization of Qm ..... ..... 2 3 2 Generating Samples from h(u I y ; /3 1/J) 2.3.3 Convergence Criteria ......... CORRELATED RANDOM EFFECTS 41 43 48 53 3.1 A Motivating Example: Data from the General Social Survey 54 3.1.1 A GLMM Approach . . . . . 55 3.1.2 Motivating Correlated Random Effects . . 56 3 2 Equally Correlated Random Effects . . . . . 62 3 2.1 Definition of Equally Correlated Random Effects 62 3 2.2 The M-step with Equally Correlated Random Effects 63 3.3 Autoregressive Random Effects . . . . . . 65 3.3.1 D efin ition of Autoregressive Random Effects . . 65 3.3 2 The M-step with Autoregressive Random Effects . 68 3.4 Sampling from the Posterior Distribution Via Gibbs Sampling 71 3.4.1 A Gibbs Sampler for Autoregressive Random Effects . 72 3.4.2 A Gibbs Sampler for Equally Correlated Random Effects 74 3 5 A Simulation Study . . . . . . . . . . . . 75 MODEL PROPERTIES FOR NORMAL POISSON AND BINOMIAL OBSERVATIONS ..... . ........ ... . 82 4 .1 Analysis for a Time Series of Normal Observations 83 4.1.1 Analysis via Linear Mixed Models 85 4 1.2 Parameter Interpretation . . . . . 86 4.2 Analysis for a Tim e Series of Counts . . . . 86 4.2 1 Marginal Model Implied by the Poisson GLMM 87 4 .2.2 Parameter Int erpretation . . . . . . 92 4.3 Analysis for a Time Series of Binomial or Binary Observations 92 4.3.1 Marginal Model Impli ed by the Binomial GLMM 94 4 3 2 Approximation Techniques for Marginal Moments 95 4.3.3 Parameter Int erpretat ion . . . . . . . 108 EXAMPLES OF COUNT BINOMIAL AND BINARY TIME SERIES 5.1 Graphical Exploration of Correlation Structures 5.1.1 The Variogram 5 1.2 The Lorelogram ..... 5.2 Normal Time Series .. ... . 5.3 Analysis of the Polio Count Data 5 3.1 Comparison of ARGLMMs to other Approaches 5.3.2 A Residual Analysis for the ARGLMM 5.4 Binary and Binomial Time Series 5.4.1 Old Faithful Geyser Data .. ... . Vl 115 115 116 117 117 121 125 127 130 130

PAGE 7

6 5.4.2 Oxford versus Cambridge Boat Race Data SUMMARY DISCUSSION AND FUTURE RESEARCH. 6.1 Cross-Sectional Time Series 6.2 Univariate Time Series ... 6.2.1 Clipping of Time Series 6.2.2 Longitudinal Data . 6.3 Extensions and Further Research 6.3.1 Alternative Random Effects Distribution 6.3.2 Topics in GLMM Research REFERENCES BIOGRAPHICAL SKETCH. Vll 142 159 162 164 165 165 166 166 168 170 171

PAGE 8

LIST OF TABLES Table 3 1 A simulation study for a logistic GLMM with autoregressive random effects . . . . . . . . . . . . . . . . . 80 3 2 Simulation study for modeling unequally space binary time series. 81 5 1 Comparing estimates from two models for the log-odds. 121 5 2 Parameter estimates for the polio data. . . . . 125 5 3 Autocorrelation functions for the Old Faithful geyser data. 134 5 4 Comparison of observed and expected counts for the Old Faithful geyser data . . . . . . . . . . . . . 142 5 5 Maximum likelihood estimates for boat race data 147 5 6 Observed and expected counts of sequences of wins (W) and losses (L) for the Cambridge University team. . . . . . . . . 151 5 7 Estimated random effects Ut for the last 30 years for the boat race data. 152 5 8 Estimated probabilities of a Cambridge win in 2004 given the past s + l outcomes of the race . . . . . . . . . . . . 155 Vlll

PAGE 9

LIST OF FIGURES Figure 2 1 Plot of the typical behavior of the Monte Carlo sample size m(k) and the Q-function Q~) through MCEM iterations. . . . . . 51 3 1 Sampling proportions from the GSS data set. . 55 3 2 Iteration history for selected parameters and their asymptotic standard e rror s for the GSS data. . . . . . . . . . . 61 3 3 Realized (simulated) random effects u 1 ... ur versus estimated random effects u 1 .. ilr. . . . . . . . . 77 3 4 Comparing simulated and estimated random effects. 78 4 1 Approximated marginal probabilities for the fixed part predictor value x' /3 ranging from -4 to 4 in a logit model. . . . . . . . . 99 4 2 Comparison of conditional logit and probit model based probabilities. 104 4 3 Comparison of implied marginal probabilities from logit and probit models.. . . . . . . . . . . . . . . . . 106 5 1 Empirical standard deviations std(0it) for the log odds of favoring homosexual relationship by race. 119 5 2 Plot of the Polio data. . . 123 5 3 Iteration history for the Polio data. 124 5 4 Residual autocorrelations for the Polio data. 129 5 5 Residual autocorrelations with outlier adjustment for the Polio data. 131 5 6 Autocorrelation functions for the Old Faithful geyser data. 5 7 Lorelogram for the Old Faithful geyser data. .. 5 8 Plot of the Oxford vs. Cambridge boat race data. 5 9 Variogram for the Oxford vs. Cambridge boat race data. 5 10 Lorelogram for the Oxford vs. Cambridge boat race data .. IX 135 136 143 145 146

PAGE 10

5 11 Path plots of fixed and random effects parameter estimates for the boat race data . . . . . 14 7 6 1 Association graphs for GLMMs. . . . . . . . . . . 161 X

PAGE 11

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy REGRESSION MODELS FOR DISCRETE-VALUED TIME SERIES DATA By Bernhard Klingenberg August 2004 Chair: Alan G. Agresti Cochair: James G. Booth Major Department: Statistics Independent random effects in generalized linear models induce an exchang e able correlation structure but long sequences of counts or binomial observations typically show correlations decaying with increasing lag. This dissertation intro duces models with autocorrelated random effects for a more appropriate, parameter driven analysis of discrete-valued time series data. We present a Monte Carlo EM algorithm with Gibbs sampling to jointly obtain maximum likelihood estimates of regression parameters and variance components. Marginal mean variance and correlation properties of the conditionally specified models are derived for Poisson negative binomial and binary /binomial random components. They are used for constructing goodness of fit tables and checking the appropriateness of the modeled correlation structure. Our models define a likelihood and hence estimation of the joint probability of two or more events is possible and used in predicting future responses Also all methods are flexible enough to allow for multiple gaps or miss ing observations in the observed time series. The approach is illustrated with the analysis of a cross-sectional study over 30 years where only observations from 16 Xl

PAGE 12

unequally spaced years are available a time series of 168 monthly counts of polio infections and two long binary time series. Xll

PAGE 13

CHAPTER 1 INTRODUCTION Correlated discrete data arise in a variety of settings in the biomedical, social, political or business sciences whenever a discrete response variable is measured repeatedly. Examples are time series of counts or longitudinal studies measuring a binary response. Correlations between successive observations arise naturally through a time space or some other cluster forming context and have to be incorporated in any inferential procedure. Standard regression models for independent data can be expanded to accommodate such correlations. For continuous type responses the normal linear mixed effects model offers such a flexible framework and has been well studied in the past. A recent reference is Verbeke and Molenberghs (2000) who also discuss computer software for fitting linear mixed effects models with popular statistical packages Although the normal linear mixed effects model is but one member of the broader class of generalized linear mixed effects models it enjoys unique properties which simplify parameter estimation and interpretation substantially. For discrete response data, however, the normal distribution is not appropriate, and other members in the exponential family of distributions have to be considered. 1.1 Regression Models for Correlated Discrete Data In this introduction we will review extensions of the basic generalized linear model (McCullagh and Nelder 1989) for analyzing independent observations to models for correlated data. These models are marginal (Section 1.2), transitional (Section 1.3) and random effects models (Section 1.4) An extensive discussion of these models with respect to discrete longitudinal data is given in the books by 1

PAGE 14

2 Agresti (2002) and Diggl e, Heagerty Liang and Zeger (2002). In general longi tudinal studies concern only a few repeated measurements In this dissertation however we are interested in the analysis of much longer series of repeated observa tions often exceeding 100 repeated measurements. Therefore the following review focuses specifically on models for univariate time s eries observations some of which are presented in Fahrmeir and Tutz (2001). Let Yt be a response at time t t = l ... T observed together with a vector of covariates denoted by XtIn a generalized linear model (GLM) the mean t = E[yt] of observation Yt depends on a lin e ar predictor 'TJt = x~/3 through a link function h( ) forming the relationship t = h 1 (x~f3) The variance of Yt depends on the mean through the relationship var(yt) = tv(t) where v( ) is a distribution specific variance function and {
PAGE 15

3 transitional models is based on a conditional or partial likelihood, and inference in random effects models relies on a full likelihood (possibly Bayesian) approach. However models and inferential procedures have been developed that allow more flexibility than the above categorization. 1.2 Marginal Models In marginal regression models the main scientific goal is to assess the influence of covariates on the marginal mean of Yt treating the association structure between repeated observations as a nuisance. The marginal mean t and variance var(yt) are modeled separately from a correlation structure between two observations Yt and Yt Regression parameters in the linear predictor are called population averaged parameters because their interpretation is based on an average over all individuals in a specific covariate subgroup. Due to the correlation among repeated observations the likelihood for the model refers to the joint distribution of all observations and not to the simpler product of their marginal distributions However the model is specified in terms of these marginal distributions, which makes maximum likelihood fitting particulary hard for even a moderate number T of repeated measurements. 1.2.1 Likelihood Based Estimation Methods For binary data Fitzmaurice and Laird (1993) discuss a parametrization of the joint distribution in terms of conditional probabilities and log odds ratios. These parameters are related to the marginal mean and the same conditional log odds ratios which describe the higher order associations among the repeated responses. The marginal mean and the higher order associations are then modelled in terms of orthogonal parameters /3 and a, respectively. Fitzmaurice and Laird (1993) present an algorithm for maximizing the likelihood with respect to these two parameter sets. The algorithm has been implemented in a freely available computer program (MAREG) by Kastner et al. (1997).

PAGE 16

4 Another approach to maximum lik elihood fitting for longitudinal discrete data regards the marginal model as a constraint on the joint distribution and maximizes the likelihood subject to this constraint. The model is written in terms of a genera liz ed log-linear model Clog(A = X/3 where is a vector of expected counts and A and C are matrices to form marginal counts and functions of those marginal counts, respectively. With this approach, no specific assumption about the correlation structure of repeated observations is made and the likelihood refers to the most general form for the joint distribution However, simultaneous modeling of the marginal distribution and a simplified joint distribution is also possible. Details can be found in Lang and Agresti (1994) and Lang (1996). Lang (2004) also offers an R computer program (mph fit) for maximum lik elihood fitting of these very general marginal models. 1.2.2 Quasi-Likelihood Based Estimation Methods The drawback of the two approaches mentioned above and likelihood based methods in general is that they require enormous computing resources as the number of repeated responses increases or the number of covariates is large, making maximum likelihood fitting computationally impossible for long time series. This is also true for estimation based on alternative parameterizations of a distribution for multivariate binary data such as those discussed in Bahadur (1961), Cox (1972) or Zhao and Prentice (1990). Estimating methods leading to computationally simpler inference (albeit not maximum likelihood) for marginal models are based on a quasi-likelihood approach (Wedderburn, 1974). In a quasi-likelihood approach, no specific form for the distribution of the responses is assumed and only the mean variance and correlation are specified. However with discrete data specifying the mean and covariances does not determine the lik elihood, as it would with normal data, so parameter estimation cannot be based on it. Liang and Zeger (1986) proposed

PAGE 17

5 generalized estimating equations (GEE) to estimate parameters, which have the form of score equations for GLMs, but cannot be interpreted as such. Their approach also requires the specification of a working correlation matrix for the repeated responses. They show that if the mean function is correctly specified the solution to the generalized estimating equations is a consistent estimator regardless of the assumed variance-covariance structure for the repeated responses. They also present an estimator of the asymptotic variance-covariance matrix for the GEE estimates which is robust against misspecification of the working correlation matrix. Several structured working correlation matrices have been proposed for parsimonious modeling of the marginal correlation, and some of them are implemented in statistical software packages for GEE estimation (e.g., SAS s proc genmod with the repeated statement and the type option or the gee and geel packages in R) 1.2.2.1 GEE for time series of counts Zeger (1988) uses the GEE methodology to fit a marginal model to a time series {Yt}f= 1 of T = 168 monthly counts of cases of poliomyelitis in the United States. He specifies the marginal mean variance and correlation by corr(yt Yt+-r) where a 2 is the variance and p(r) the autocorrelation function of an underlying random process { Ut}To fit this marginal model he proposes and outlines the (1.1) GEE approach but notes that it requires inversion of the T x T variance-covariance matrix of y 1 ... YT which has no recognizable structure and therefore no simple inverse. Subsequently he suggests approximating this matrix by a simpler, struc tured matrix leading to nearly as efficient estimators as would have been obtained

PAGE 18

6 with the GEE approach. The variance component a 2 and unknown parameters in p( T) are estimated by a methods of moments approach. Interestingly Zeger (1988) derives the marginal mean variance and correlation in (1.1) from a random effects model specification: Conditional on an underlying latent random process {ut} with E[ut] = 1 and cov(ut,Ut+ 7 ) = a 2 p(T) he initially models the time series observations as conditionally independent Poisson variables with mean and variance (1.2) Marginally by the formula for repeated expectation, this leads to the moments presented in (1.1). From there we also see that the latent random process {ut} has introduced both overdispersion relative to a Poisson variable and autocorrelation among the observations. The models we will develop in subsequent chapters have similar features. The equation for the marginal correlation between Yt and Yt shows that the autocorrelation in the observed time series must be less than the autocorrelation in the latent process { Ut}We will return to the polio data set in Chapter 5 where we compare this model to models suggested in this dissertation and elsewhere. 1.2.2.2 GEE for binomial time series For binary and binomial time series data it is often more advantageous to model the association between observations using the odds ratio rather than directly specifying the marginal correlation corr(yt, Yt) as with count data. The odds ratio is a more natural metric to measure association between binary outcomes and easier to interpret. The correlation between two binary outcomes Y 1 and Y; is also constrained in a complicated way by their marginal means 1 = P(Y 1 = 1) and 2 = P(Y 2 = 1) as a consequence of the following inequalities

PAGE 19

7 for their joint distribution : and leading to max{O + 1} :S P(Y1 = 1 Y; = 1) :S min{ }. Therefore instead of marginal correlations a number of authors (Fitzmaurice, Laird and Rotnitzky 1993; Carey Zeger and Diggle 1993) propose the use of marginal odds ratios. For unequally spaced and unbalanced binary time series data Fitzmaurice and Lipsitz (1995) present a GEE approach which models the marginal association using s e rial odds ratio patterns. Let 1/Judenote the marginal odds ratio between two binary observations Yt and Yt. Their model for the association has the form ., . = o? /l t t* I 1 < a < oo f.// tt , which has the property that as It t I 0 there is perfect association ( VJtt oo) and as It t I oo the observations are independent (1/Ju1). Note however that only positive association is possible with this type of model. (SAS s proc genmod now offers the possibility of specifying a general regression structure for th e log odds ratios with the logor option.) 1.3 Transitional Models In transitional models, past observations are simply treated as additional predictors. Int e rest lies in e stimating th e eff e cts of these and other explanatory variables on th e conditional mean of the response Yt given realizations of the past responses. Specifying the relationship between the mean of Yt and previous obser vations Yt-l Yt 2 ... is another way (and in contrast to the direct way of marginal

PAGE 20

8 models) of modeling the dependency between correlated responses. Transitional models fit into the framework of GLMs where, however the distribution of Yt is now conditional on the past responses. The model in its most general form (Diggle et al. 2002) expresses the conditional mean of Yt as a function of explanatory variables and q functions fr(-) of past responses, E[y, I H,] = h 1 ( x;/3 + t, f,(H, ; a)) (1.3) where Ht = {Yt 1 Yt-2 ... y 1 } denotes the collection of past responses. Ht can also include past explanatory variables and parameters. Often, the models are in discrete-time Markov chain form of order q, and the conditional distribution of Yt given Ht only depends on the last q responses Yt i ... Yt q For example, a transitional logistic regression model for binary responses that is a second order Markov chain has form logit P(Yi = 1 I Yt 1, Yt 2) = x~/3 + a1Yt-1 + a2Yt-2The main difference between transitional models and regular GLMs or marginal models is parameter interpretation. Both the interpretation of a and the inter pretation of {3 are conditional on previous outcomes and depend on how many of these are included. As the time dependence in the model changes so does the interpretation of parameters With the logistic regression example from above the conditional odds of success at time t are exp(a 1 ) times higher if the given previous response was a success rather than a failure. However this interpretation assumes a fixed and given outcome at time t 2. Similarly, a coefficient in {3 represents the change in the log odds for a unit change in Xt conditional on the two prior responses. It might be possible that we lose information on the covariate effect by conditioning on these previous outcomes. In general the interpretation of parame ters in transitional models is different from the population averaged interpretation

PAGE 21

9 we discussed for marginal models where parameters are effects on the marginal mean without conditioning on any previous outcomes. 1.3.1 Model Fitting If a discrete-time Markov model applies the likelihood for a generic series Y1 ... YT is determined by the Markov chain structure: T L(/3 a ; Y1 , YT) = f (Y1 , Yq) IT f (Yt I Yt-1 , Yt-q)t=q+1 However the transitional model (1.3) only specifies the conditional distributions appearing in the product, but not the first term of the likelihood. Often, instead of a full maximum likelihood approach one conditions on the first q observations and maximizes the corresponding conditional likelihood. If in addition fr(Ht; a) in (1.3) is a linear function in a (and possibly /3) then maximization follows along the lines of GLMs for independent data. Kaufmann {1987) establishes the asymptotic properties such as consistency, asymptotic normality and efficiency of the conditional maximum likelihood estimator. If a Markov assumption is not warranted estimation can be based on the partial likelihood (Cox 1975). To motivate the partial likelihood approach we follow Kedem and Fokianos (2002): They consider occasions where a time series {Yi} is observed jointly with a random covariate series { Xt}The joint density of (Yi Xt) t = l . T parameterized by a vector 8, can be expressed as where Ht = (x1 Y1 ... Xt 1 Yt 1) and Ht = (xi Yi ... Xt-1, Yt 1, Xt) hold the history up to time points t l and t respectively. Let :Ft I denote the a-field generated by Yt i Yt 2 ... Xt Xt I .. i.e. Ft I is generated by past responses and present and past values of the covariates. Also, let ft(Yt I Ft i i 8) denote the conditional density of Yi given :Ft I which is of exponential density form with

PAGE 22

10 mean modeled by (1.3). Then the partial likelihood for 0 = ( a /3) is given by T p L(0 ; Yi ... YT) = IT !t(Yt I :Ft -1; 0) (1.5) t=l which is the second produ ct in (1.4) and hence the term partial. The loss of information by ignoring the first product in the joint density is considered small. If the covariate process is deterministic, then the partial likelihood becomes a conditional likelihood, but without the necessity of a Markov assumption on the distribution of the Yt's. Standard asymptotic results from likelihood analysis of independent data carry over to the case of partial likelihood estimation with dependent data. Fokianos and Kedem (1998) showed consistency and asymptotic normality of 0 and provided an expression for the asymptotic covariance matrix. Since the score equation obtained from (1.5) is identical to one for independent data in a GLM partial likelihood holds the advantage of easy fast and readily available software implementation with standard estimation routines such as iterative re-weighted least squares. 1.3.2 Transitional Models for Time Series of Counts For a time series of counts {Yt} Zeger and Qaqish (1988) propose Markov type transitional models which they fit using quasi-likelihood methods and the estimating equations approach. They consider various models for the conditional mean t = E[yt I Ht] of form log(t) = x~/3 + L~ = I arfr(Ht-r), where for example fr(Ht-r) = Yt-r or f r( Ht -r) = log(Yt r + c) log(exp[x~/3] + c). One common goal of their models is to approximate the marginal mean by E[yt] = E[t] exp{ x~/3} so that f3 has an approximate marginal interpretation as the change in the log mean for a unit change in the explanatory variables. Davis et al. (2003) develop these models further and propose fr ( Ht -r) = (Yt-r r) / ;_r as a more appropriate function to built serial dependence in the model where ,\ is an additional parameter They explore stability properties such as stationarity and

PAGE 23

11 ergodicity of these models and describe fast (in comparison to maximum likelihood techniques required for competing random effects models) recursive and iterative maximum likelihood estimation algorithms Chapter 4 in Kedem and Fokianos (2002) discusses regression models of form (1.3) assuming a conditional Poisson or double-truncated Poisson distribution for the counts with inference based on the partial likelihood concept. Their methodology is illustrated with two examples about monthly counts of rainy days and counts of tourist arrivals. 1.3.3 Transitional Models for Binary Data For binary data {Yt} a two state first order Markov chain can be defined by its probability transition matrix p = [Poo Po1] P10 Pu where Pab = P(Yt = b I Yt 1 = a) a b = 0 1 are the one-step transition probabilities between the two states a and b. Diggle et al. (2002 Chapt. 10.3) discuss various logistic regression models for these probabilities and higher order Markov chains for equally spaced observations. Unequally spaced data cannot be routinely handled with these models. How can we determine the marginal association structure implied by the conditionally specified model? Let p 1 = (pi PD be the initial marginal distribution for the states at time t = l. Then the distribution of the states at time n is given by pn = p 1 pn As n increases pn approaches a steady state or equilibrium distribution that satisfies p = pP The solution to this equation is given by P1 = P(Yt = 1) = E[yt] = Poi/(p 01 + p 10 ) and is used to derive marginal moments implied by the transitional model. For example it can be shown (Kedem 1980) that in the steady state the marginal variance and correlation implied by the

PAGE 24

12 transitional model are var(yt) = PoPi (as it should be) and corr(Yt-1 Yt) = Pu Poi respectively. Azzalini (1994) models serial dependence in binary data through transition models but at the same time retains the marginal interpretation of regression parameters He specifies the marginal regression model logit(t) = z~/3 for a binary time series {Yt} with E[yt] = but assumes that a binary Markov chain with transition probabilities Pab has generated the data Therefore the likelihood refers to these probabilities but the model specifies marginal probabilities a complication similar to th e fitting of marginal models discussed in the previous section However assuming a constant log odds ratio 0 = log (P(Yt 1 = 1 Yi= l)P(Yi 1 = 0 Yi= 0)) P(Yi 1 = 0 Yi= l)P(Yi 1 = 1 Yi= 0) between any two adjacent observations Azzalini (1994) shows how to write Pab in terms of just this log odds ratio 0 and the marginal probabilities t and 1 Maximum likelihood estimation for such models is tedious but possible in closed form although second derivatives of the log likelihood function have to be calculated numerically. A software package (the S-plus function rm.tools, Azzalini and Chiogna 1997) exists to fit such models for binary and Poisson observations. Azzalini (1994) mentions that this basic approach can be extended to include variable odds ratios between any two adjacent observations possibly depending on covariates but this is not pursued in the article Diggle et al. (2002) discuss these marginalized transitional models further. Chapter 2 in Kedem and Fokianos (2002) presents a detailed discussion of partial likelihood estimation for transitional binary models and discusses among other examples the eruption data of the Old Faithful geyser which we will turn to in Chapter 5.

PAGE 25

13 1.4 Random Effects Models A popular way of modeling correlation among dependent observations is to include random effects u in the linear predictor. One of the first developments for discrete data occurred for longitudinal binary data where subject-specific random effects induced correlation between repeated binary measurements on a subject (Bock and Aitkin 1981; Stiratelli, Laird and Ware, 1984). In general, we assume that unmeasurable factors give rise to the dependency in the data {Yt} and random effects { ut} represent the heterogeneity due to these unmeasured factors. Given these effects the responses are assumed independent. However, no values for these factors are observed and so marginally (i.e., averaged over these factors), the responses are dependent. Conditional on some random effects, we consider models that fit into the framework of GLMs for independent data, i.e., where the conditional distribution of Yt I U t is a member of the family of exponential distributions whose mean E[yt I Ut ] is modeled as a function of a lin ear predictor f/t = x~/3 + z~ut. Together with a distributional assumption for the random effects (usually independent and identically normal) this leads to genera liz ed linear mixed models (GLMMs), where the term mixed refers to the mixture of fixed and random effects in the linear pre dictor. Chapter 2 contains a detailed definition of GLMMs and discusses maximum lik elihood fitting and parameter interpretation and in Chapter 3 correlated random effects for the description of time dependent observations {Yt} are motivated and described. Here we only give a short lit erature review about GLMMs which use correlated random effects to model time ( or space) dependent data. 1.4.1 Correlated Random Effects in GLMMs One of the first papers considering correlated random effects in G LMMs for the description of (spatial) dependence in Poisson data is Breslow and Clayton (1993), who analyze lip cancer rates in Scottish counties. They propose correlated

PAGE 26

14 normal random e ffects to capture the correlation in counts of adjacent districts in Scotland. A random effect is assigned to each district and two random effects are correlated if their districts are adjacent to each other. In Section 1.2 2 we mentioned the Polio data set of a time series of equally spaced counts {yt};~1 and formulated the conditional model (1.2) with a latent process for the random effects. Instead of obtaining marginal moments as in Zeger (1988) Chan and Ledolter (1995) use a GLMM approach with Poisson random components and autoregressive random effects to analyze the time series. They outline parameter estimation via an MCEM algorithm similar to the one discussed in Sections 2.4 and 3.2 in this dissertation. One of the three central generalized linear models advocated by Diggle et al. (2002 Chap. 11.2) to model longitudinal data uses correlated random effects. For equally spaced binary longitudinal data {yit} they plot response profiles simulated according to the model logit[P(~t = 1 I Uit)] COV ( U it, U i t ) {3 + Uit a2 p l t;1 t;1 I with a 2 = 2.5 2 and p = 0.9 and note that the profiles exhibit more alternating runs of O 's and l 's than a random intercept model with ui 1 = ui 2 = ... = uiT = ui. However based on the similarity between plots of random intercepts, random intercepts and slopes and autoregressive random effects models, they mention the challenge that binary data present in distinguishing and modeling the underlying dependency structure in longitudinal data. (They used T = 25 repeated observation for their simulations ) Furthermore they state that numerical methods for maximum likelihood estimation are computationally impractical for fitting models with higher dimensional random effects. This makes it impossible they conclude, to fit the GLMM with serially correlated random effects using maximum

PAGE 27

15 likelihood Instead they propose a Bayesian analysis using powerful Monte Carlo Markov chain methods Indeed the majority of examples in the literature which consider correlated random effects in a GLMM framework take a Bayesian approach. Sun Speckman and Tsutakawa (2000) explore several types of correlated random effects (autore gressive generalized autoregressive and conditional autoregressive) in a Bayesian analysis of a GLMM. As in any Bayesian analysis the propriety of the posterior distribution given the data is of concern when fixed effects and variance compo nents have improper prior distributions and random effects are (possibly singular) multivariate normal. One of their results applied to Poisson or binomial data {Yt} states that the posterior might be improper when Yt = 0 in the Poisson case and cannot be proper when Yt = 0 or Yt = nt in the binomial case for any t when improper or non-informative priors are used Diggle Tawn and Moyeed (1998) consider Gaussian spatial processes S(x) to model spatial count data at locations x. The role of S ( x) is to explain any residual spatial variation after accounting for all known explanatory variables They also use a Bayesian framework to estimate parameters and give a solution to the problem of predicting the count at a new location x. Ghosh et al. (1998) use correlated random effects in Bayesian models for small area estimation problems. They present an application of pairwise difference priors for random effects to model a series of spatially correlated binomial observations in a Bayesian framework Zhang (2002) discusses maximum likelihood estimation with an underlying spatial Gaussian process for spatially correlated binomial observations. Bayesian models for binary time series are described in Liu (2001), based on probit-type models for correlated binary data which are discussed in Chib and Greenberg (1998) Probit type models are motivated by assuming latent random variables z = ( z 1 ... ZT) which follow a ~) distribution with

PAGE 28

16 = 1 .. J1T) t = x~/3 and E a correlation matrix. The Yt 's are assumed to be generated according to Yt = I(zt > 0), where I(.) is the indicator function. This leads to the (marginal) probit model P(Yt = 1 I /3 E) = t(Zt I E). Rich classes of dependency structures between binary outcomes can be modeled through E. These models can further be extended to include random effects through 't = x~/3 + z~ut or q previous responses such as t = x~/3 + I::; = 1 O:rYt r It is important to note that E has to be in correlation form. To see this suppose it is not and let S = DED be a covariance matrix for the latent random variables z, where D is a diagonal matrix holding standard deviation parameters. The joint density of the times series under the multivariate probit model is given by P[(Y1 ... Yt) = (Yi , YT)] P[z EA] P[D1 z EA], where A = A 1 x x AT with At = ( -oo, 0] if Yt = 0 and At = (0, oo) if Yt = 1 are the intervals corresponding to the relationship Yt = I(zt > 0), fort = 1, ... T. However, above relationship is true for any parametrization of D because the intervals At are not affected by the transformation from z to n 1 z. Hence, the elements of D are not identifiable based on the joint distribution of the observed time series y. Lee and Nelder (2001) present models to analyze spatially correlated Poisson counts and binomial longitudinal data about cancer mortality rates They explore a variety of patterned correlation structures for random effects in a GLMM setup. Model fitting is based on the joint data likelihood of observations and unobserved random effects (Lee and Nelder, 1996) and not on the marginal likelihood of the

PAGE 29

17 observed data. Model diagnostic plots of estimated random effects are presented to aid in selecting an appropriate correlation structure. 1.4.2 Other Modeling Approaches In hidden Markov models (MacDonald and Zucchini 1997) the underlying random process is assumed to be a discrete state-space Markov chain instead of a continuous (normal) process. Probability transition matrices describe the connection between states. A very convenient property of hidden Markov models is that the likelihood can be evaluated sufficiently fast to permit direct numerical maximization. MacDonald and Zucchini (1997) present a detailed description of hidden Markov models for the analysis of binary and count time series. A connection between transitional models and random effects models is explored in Aitkin and Alf6 (1998). They model the success probabilities of serial binary observations conditional on subject-specific random effects and on the previous outcome. As in the models before transition probabilities Pab are changing over time due to the inclusion of time-dependent covariates and the previous observation in the linear predictor. Additionally random effects account for possibly unobserved sources of heterogeneity between subjects. The authors argue that the conditional model specification together with the specification of the random effects distribution does not determine the distribution of the initial observation, and hence the likelihood for this model is unspecified. They present a solution by maximizing the likelihood obtained from conditioning on this first observation. However this causes the specified random effects distribution to shift to an unknown distribution Two approaches for estimation are outlined: The first assumes another normal distribution for the new random effects distribution and the likelihood is maximized using Gauss Hermite quadrature. The second approach assumes no parametric form for the new random effects distribution and follows th e nonparametric maximum likelihood approach (Aitkin, 1999). For binary data, the

PAGE 30

18 new random effects distribution is only a two point distribution and its parameters can be estimated via maximum likelihood jointly with the other model parameters Marginalized transitional models were briefly mentioned with the approach taken by Azzalini (1994). The idea of marginalizing ", i.e. model the marginal mean of an otherwise conditionally specified model can also be applied to random effects models The advantage of transitional or random effects models is the ability to easily specify correlation patterns with the potential disadvantage that parameters in such models have conditional interpretations when the scientific goal is on the interpretation of marginal relationships In marginal models parameters can directly be interpreted as contrasts between subpopulations without the need of conditioning on previous observations or unobserved random effects. However as we mentioned in Section 1.2.2 likelihood based inference in marginal models might not be possible A marginalized random effects model (Heagerty 1999; Heagerty and Zeger, 2000) specifies two regression equations that are consistent with each other. The first equation expresses the marginal mean f'1 as a function of covariates and describes system atic variation. The second equation c haracterizes the dependency structure among observations through specification of the conditional mean where Ut are random effects with design vector Zt Consistency between the marginal and conditional specification is achieved by defining ~t(Xt) implicitly through

PAGE 31

19 For instance in a marginalized GLMM with random effects distribution F(u t ), .6.t(xt) is the solution to the integral equation f = h 1 (x~ f3 ) = J h 1 (.6.t(xt) + z~ut)dF(ut) so that .6.t ( Xt) is a function of the marginal regression coefficients {3 and the (variance) parameters in F(ut)Maximum likelihood estimation is based on the integrated likelihood from the GLMM model. 1.5 Motivation and Outline of the Dissertation In this dissertation we propose generalized linear mixed models ( G LMMs) with correlated random effects to model count or binomial response data collected over time or in space. For sequential or spatial Gaussian measurements, maximum likelihood estimation is well established and software (e.g. SAS s proc mixed) is available to fit fairly complicated correlation structures. The challenge for discrete data lies in the fact that the observed (marginal) likelihood is not analytically tractable and maximization of it is more involved. Furthermore with correlated random effects the likelihood does not break down into lower-dimensional com ponents which are easier to integrate numerically. Therefore most approaches in the literature are based on a quasi-likelihood approach or take a Bayesian per spective. The advantage of Bayesian models is that powerful Monte Carlo Markov chain methods make it easier to obtain a sample from the posterior distribution of interest than to obtain maximum likelihood estimates. However priors must be specified very carefully to ensure posterior propriety. In addition repeated observations are prone to missing data or unequally spaced observation times. We would like to develop methods and models that allow for unequally spaced binary binomial or Poisson observations making them more general than previously presented in the literature.

PAGE 32

20 To our knowledge maximum likelihood estimation of GLMMs with such high dimensional random effects has not been demonstrated before, with the exception of the paper by Chan and Ledolter (1995) who consider fitting of a time series of counts However they do not consider unequally spaced data and employ a different implementation of the MCEM algorithm In Chapter 5 we argue that their implementation of the algorithm might have been stopped prematurely leading to different conclusions than our a nalysis and analyses published elsewhere Most articles that discuss correlated random effects do so for only a small number of correlated random effects E g Chan and Kuk (1997) show that the data set on salamander mating behavior published and analyzed in McCullagh and Nelder (1989) is more appropriately analyzed when random effects pertaining to the male salamander population are correlated over the three different time points when they were observed. In this thesis we wou l d like to consider much longer sequences of repeated observations. In Chapter 2 we introduce the GLMM as the model of our choice to analyze correlated discrete data and outline an EM algorithm to estimate fixed and random effects, where both the E-step and the M-step require numerical approximations leading to an EM algorithm based on Monte Carlo methods (MCEM). Correlated random effects and their implications on the analysis of GLMMs are discussed in Chapter 3 together with a motivating example. This chapter also gives details for the implementation of the algorithm and reports results from simulation studies. Chapter 4 looks at marginal model properties and interpretation for correlated binary binomial or Poisson observations and Chapter 5 applies our methods to real data sets from the social sciences public health, sports and other backgrounds. A summary and discussion of the methods and models presented here is given in Chapter 6.

PAGE 33

CHAPTER 2 GENERALIZED LINEAR MIXED MODELS Chapter 1 reviewed various approaches of extending GLMs to deal with correlated data. In this Chapter we will take a closer look at generalized linear mixed models (GLMMs) which were briefly mentioned in Section 1.4. When the response variables are normal, these models are simply called linear mixed models (LMMs) and have been extensively discussed in the literature (see, for example the books by Searle, Casella and McCulloch, 1992, and Verbeke and Molenberghs 2000) The form of the normal density for observations and random effects allows for analytical evaluation of the integrals together with straightforward maximization. Hence LMMs can be readily fit with existing software (e.g., SAS's proc mixed), using rich classes of pre-specified correlation structures for the random effects to model the dependence in the data more precisely. The broader notion of GLMMs also encompasses binary, binomial, Poisson or gamma responses. A distinctive feature of GLMMs is their so called subject-specific parameter interpretation, which differs from the interpretation of parameters in marginal (Section 1.2) or transitional (Section 1.3) models This feature is discussed in Section 2.1, after a formal introduction of the GLMM. Throughout, special attention is devoted to define GLMMs for discrete time series observations. GLMMs are harder to fit because they typically involve intractable integrals in the likelihood function. Section 2.2 outlines various approaches to model fitting. Section 2.3 focuses on a Monte Carlo version of the EM algorithm which is an indirect method of finding maximum likelihood estimates in G LMMs. Monte Carlo methods are necessary because our applications involve correlated random effects which lead to a very high-dimensional integral in the likelihood function. Parallel 21

PAGE 34

22 to the discussion of GLMMs state space models are introduced and a fitting algorithm is described. State space models are popular models for discrete time series in econometric applications (Durbin and Koopman 2001). The presentation of specific examples of GLMMs for discrete time series observation is deferred until Chapter 5. 2.1 Definition and Notation The generalized linear mixed model is an extension of the well known gen eralized linear model (McCullagh and Nelder, 1989) that permits fixed as well as random effects in the linear predictor (hence the word mixed). The setup process for GLMMs is split into two stages which we present here using notation common for longitudinal studies.: Firstly conditional on cluster specific random effects ui the data are assumed to follow a GLM with independent random components ~t the t-th response in cluster i, i = 1 . n t = 1, ... ni. A cluster here is a generic expression and means any form of observations being grouped together such as repeated observation on the same subject (cluster= subject), observations on different students in the same school ( cluster = school) or observations recorded in a common time interval ( cluster = time interval). The conditional distribution of ~t is a member of the exponential family of distributions (e g ., McCullagh and Nelder 1989) with form (2.1) where 0it are natural parameters and b(.) and c(.) are certain functions determined by the specific member of the exponential family. The parameters >it are typically of form i t = >/w i t where the wit s are known weights and > is a possibly unknown dispersion parameter. For the discrete response GLMMs we are considering > = 1. For a specific link function h( ), the model for the conditional mean for

PAGE 35

23 observations Y it has form (2 2) where x~t and z ~t are covariate or design vectors for fixed and random effects associated with observation Yit and /3 is a vector of unknown regression coefficients. At this first stage z~tu i can b e regarded as a known offset for each observation and observations are conditionally independent. It should be noted that relationship (2.2) between the mean of the observation and fixed and random effects is exactly as is specified in the systematic part of GLMs with the exception that in GLMMs a conditional mean is modeled. This affects parameter interpretation The regression coefficients /3 represent the effect of explanatory variables on the conditional mean of observations given the random effects. For instance observations in the same cluster i share a common value of the random cluster effect u i, and hence /3 describes the conditional effect of explanatory variables given the value for u i If the cluster consists of repeated observations on the same subject these effects are called subject-specific effects. In contrast regression coefficients in GLMs and marginal models describe the effect of explanatory variables on the population average which is an average over observations in different clusters At the second stage, the random effects ui are specified to follow a multi variate normal distribution with mean zero and variance-covariance matrix E i A standard assumption is that random effects { u i } are independent and identically distributed but an example at the beginning of Chapter 3 will show that this is sometimes not appropriate. With time series observations where the clusters refer to time segments it is reasonable to assume that observations are not only correlated within the cluster (modeled by sharing the same cluster specific random

PAGE 36

24 effect), but also across clusters which we will model by assuming correlated cluster specific random effects. 2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time Series Most of the data we are going to analyze is in the form of a single univariate time series. To emphasize this data structure, the general two-dimensional notation (indices i and t) of a GLMM to model observations which come in clusters can be simplified in two ways: We can assume that a single cluster (i.e. n = 1 and n 1 = T) contains the entire time series y 1 ... Yr. The random effects vector u = ( u 1 ... ur) associ ated with the single cluster has a random effects component for each individual time series member. The distribution of u is multivariate normal with variance covariance matrix I: which is different from the identity matrix. The correlation of the components of u induce a correlation among the time series members. However conditional on u observations within the single cluster are independent. The cluster index i is redundant in the notation and hence can be dropped. This representation is particulary useful when used with existing software to fit GLMMs where it is often necessary to include a column indicating the cluster membership information for each observation. Here since we have only one cluster, it suffices to include a column of all ones, say. Alternatively we can adopt the point of view that each member of the time series is a mini cluster by itself containing only one observation (i.e., ni = 1 for all i = 1 .. T) in the case of a single time series. When multiple parallel time series are observed, the cluster contains all c observations at time point t from the c parallel time series (i.e., ni = c for all i = 1 ... T). In any case, the clusters are then synonymous with the discrete time points at which observations were recorded. This makes index t which counts the repeated observations in a cluster

PAGE 37

25 redundant (t = 1 or t = c for all clusters i) but instead of denoting the time series by {yi}~ 1 we decided to use the more common notation {Yt}f = 1 where t now is the index for clusters or, equivalently, time points. In the following definition of GLMMs for univariate time series, the notation of clusters or time points can be used interchangeably Conditional on unobserved random effects u 1 .. UT for the different time points observations y 1 .. YT are assumed independent with distributions (2.3) in the exponential family. As before, for a specific link function h( ), the model for the conditional mean has form (2.4) where x ~ and z~ are covariate or design vectors for fixed and random effects associated with the t-th observation and /3 is a vector of unknown regression coefficients. The random effects u 1 ... u T are typically not independent. When collected in the vector u = ( u 1 .. UT) a multivariate normal distribution with mean O and covariance matrix can be directly specified. In particular in Chapter 3 we will assume special patterned covariance matrices to allow for rich, but still parsimonious, classes of correlation structures among the time series observations The advantage of the second setup of mini clusters is that it also allows for other indirect specifications of the random effects distribution for instance through a latent random process. For this we relate cluster-specific random effects from successive time points For example, with univariate random effects, a first-order latent autoregressive process assumes that the random effects follow (2.5)

PAGE 38

26 where lt has a zero-mean normal distribution and p is a correlation parameter. Cox (1981) called these type of models parameter-driven models as opposed to transitional ( or observation-driven) models discussed in Section 1.3. In parameter driven models an underlying and unobserved parameter process influences the distribution of a series of observations. The model for the polio data in Zeger (1988) is an example of a parameter-driven model. However Zeger (1988) does not assume normality nor zero mean for the latent autoregressive process. Furthermore the natural logarithm of the random effects, and not the random effects themselves appear additively in the linear predictor. Therefore this model is slightly different from the specifications of the time series G LMM from above. Another application of the mini cluster setup is to spatial settings, where clus ters represent spatially aggregated data instead of time points. Then u 1 ... uT is a collection of random effects associated with spatial clusters Again independent random effects to describe the spatial dependencies are inappropriate. In general, time dependent data are easier to handle since observations are linearly ordered in time and more complicated random effects distributions are needed for spatial applications (e.g. Besag et al. 1995). We will use the mini-cluster representation to facilitate a comparison to state space models This is the focus of the next section 2.1.2 State Space Models for Discrete Time Series Observations State space models are a rich alternative to the traditional Box-Jenkins ARIMA system for time series analysis. Similar to GLMMs state space models for Gaussian and non-Gaussian time series split the modeling process into two stages: At the first stage the responses Yt are related to unobserved states by an observation equation. (State space models originated in the engineering literature, where parameters are often called states.) At the second stage a latent or hidden Markov model is assumed for the states. For univariate Gaussian responses

PAGE 39

27 y 1 .. YT the two equations of a state space model take the form Yt (2.6) where Wt is an m x 1 observation or design vector and Et is a white noise process. The unobserved m x 1 state or parameter vector Ot is defined by the second transition equation, where Tt is a transition matrix and et is another white noise process, independent of the first one. Compared to the standard GLMMs of Section 2 1 the main difference is that random effects are correlated instead of i.i.d. In state space models no clear distinction between fixed and random effects is made and the state vector Ot can contain both. However, the form of the transition matrix Tt together with the form of the matrix Rt which consists of columns of the identity matrix Im allows one to declare certain elements of Ot as being fixed effects and others to be random. The matrix Rt is called the selection matrix since it selects the rows of the state equation which have nonzero variance terms. With this formulation the variancec ovariance matrix Qt is assumed to be non-singular. Furthermore the transition matrix Tt allows specification of which effects vary through time and which stay constant. (For a slightly different formulation without a selection matrix Rt but with possibly singular variance covariance matrix Qt see Fahrmeir and Tutz 2001 Chap. 8) State space models for non-Gaussian time series were considered by West et al. (1985) und e r the name dynamic generalized linear model. They used a Bayesian framework with conjugate priors to specify and fit their models. Durbin and Koopman (1997 2000) and Fahrmeir and Tutz (2001) describe a state space structure for non-Gaussian observations similar to the two equations above The normal distribution assumption for the observations in (2.6) is replaced by assuming a distribution in the exponential family with natural parameters 0t With

PAGE 40

28 a canonical link 0t = w~a t This is called the signal by Durbin and Koopman (1997, 2000). In particular, given the states a 1 .. aT observations y 1 . YT are conditionally independent and have density P(Yt I 0t) in the exponential family (2.3). As in Gaussian state space models, the state vector Ot is determined by the vector autoregressive relationship (2.7) where the serially independent et typically have normal distributions with mean 0 and variance-covariance matrix Qt. 2.1.3 Structural Similarities Between State Space Models and GLMMs There is a strong connection between state space models and G LMMs with a canonical link. To see this we write the GLMM in state space form: Let W t = (x~ z~)' and O t = ( /3' u~)' where Xt Zt /3 and U t are from the GLMM notation as defined in (2.4). Hence, the linear predictor x~/3 + z~ut of the GLMM is equal to the state space signal 0t = w~at. Next partition the disturbance term e t of the state equation into e t = ( ef et)' and consider special transition and selection matrices of block form Tt = [ 1 ~] Rt = [ 0 ] 0 Tt O Rt Using transition equation (2.7) results in the following autoregressive relationship between the random effects of a GLMM: Ut = 'I'tUt-1 + where t = Rte! is a white noise component. In a univariate context, we have already motivated this type of relationship between random effects in equation (2.5). The transition equation also implies a constant effect /3 for the GLMM since /3 1 = /3 2 = ... = /3T := /3. Hence, both models use corre lat ed random effects but GLMMs also typically involve fixed parameters which are not modeled as evolving over time.

PAGE 41

29 The restriction of the transition equation to the autoregressive form ( often a simple random walk) is but only one way of specifying a distribution for the random effects in the GLMMs of Section 2.2.1. Other structures, such as equally correlated random effects are possible within the GLMM framework and are considered in Chapter 3. 2.1.4 Practical Differences Although GLMMs and state space models are similar in structure, they are used differently in practice. This is in part due to the fact that in GLMMs the focus is on the fixed subject-specific regression parameters /3 which refer to time constant and time varying covariates while in state space models the main purpose is to infer properties about the time varying random states Ot. These are often assumed to follow a first or second order random walk. To illustrate, consider a data set about a monthly time series of counts presented in Durbin and Koopman (2000) for the investigation of the effectiveness of new seat belt legislation on automobile accidents They specify the log-mean for a Poisson state space model as log(t) = vt + Axt + rt where Vt is a trend component following the random walk with ~t a white noise process. Further A is an intervention parameter correspond ing to the change in seat-belt legislation (xt = 0 before the change and equal to 1 afterwards) and { rt} are fixed seasonal components with E~: 1 rt = 0 and equal in every year. The main focus is on the parameter A describing the drop in the log means also called level after the seat belt legislation went into effect.

PAGE 42

30 For the series at hand a GLMM approach would consider a fixed linear time effect /3, and model the log-mean as log(t) =a+ f3t + AXt +rt+ Ut, where a is the intercept of the linear time trend with slope /3 and where correlated random effects { Ut} account for the correlation in the monthly log means. Similar to above, A describes the effect on the log means after the seat belt legislation went into effect and rt are fixed seasonal components, equal for every year. The trend component lit of the state space model corresponds to a+ Ut in the GLMM which additionally allows for a linear time trend /3. This approach seems to be favored by some discussants of the Durbin and Koopman (2000) paper. (In particular, see the discussions by Chatfield or Aitkin of the paper by Durbin and Koopman (2000) who mention the lack of a linear trend term in the proposed state space model.) An even better GLMM approach with linear time trends explicitly modeled could use a change point formulation in the linear predictor with the month the legislation went into effect (or was enforced) as the change point, and again with correlated random effects { ut} to capture the dependency among successive means. Such a specification would be harder to model in a state space model. In the reply to the discussion of their paper Durbin and Koopman (2000) wrote that the two approaches (state space models versus Hierarchical Generalized Linear Models a model class very similar to GLMMs) are very different and that they regard their treatment to be more transparent and general for problems that specifically relate to time series. With the presentation of GLMMs with correlated random effects for time series analysis in this thesis their argument might weaken. For instance, with the proposal of autocorrelated random effects Ut+1 = put+ f.t in a GLMM context we have elegant means of introducing autocorrelation into a basic regression model that is well understood and whose parameters are easily

PAGE 43

31 interpreted. Furthermore GLMMs can easily accommodate the case of multiple time series observations on each of several individuals or cross-sectional units as is often observed in a longitudinal study. One common feature of both models is the intractability of the likelihood function and the use of numerical and simulation techniques to obtain maximum likelihood estimates. In general state space models for non-Gaussian time series are fit using a simulated maximum likelihood approach which is also a popular method for fitting GLMMs However, long time series necessarily result in models with complex and high dimensional random effects and alternative indirect, meth ods may work better. Jank and Booth (2003) indicate that simulated maximum likelihood the method of choice for estimation in state space models may not work as w e ll as indirect methods based on the EM algorithm the method we will use to fit GLMMs for time series observations. The next section reviews various approaches of fitting GLMMs and contrasts them with the approach taken for state space models. 2.2 Maximum Likelihood Estimation Maximum likelihood estimation in GLMMs is a challenging task because it requires the calculation of integrals ( often high dimensional) that have no known analytic solution Following the general notation of a GLMM in Section 2 1 let Y i = (Y i 1 ... Y i nJ be the vector of all observations in cluster i whose associated random effects vector is u i Conditional independence of the Y i t s implies that the density function of Y i is given by n ; !(Y i I u i ;/3) = fIJ(Y i t I u i; /3), (2 8) t=l where f (Y i t I u i; /3) are the exponential densities in (2.1). The parameter /3 is the vector of all unknown regression coefficients introduced by specifying model (2.2) for the mean of the observations. Furthermore observations from different clusters

PAGE 44

32 are assumed conditionally independent leading to the conditional joint density n f (Y1 , Y n I U1 , Un;{3) = IT f(Y i I u i; /3) i= l for all observations given all random effects. L e t g(u 1 ... un ; 1/J) denote the mul tivariate normal density function of the random effects whose variance-covariance matrix :E is determined by the variance component vector 1/J. The goal is to es timate the unknown parameter vectors /3 and 1/J by maximum likelihood. The likelihood function L(/3 1/J ; y 1 ... Yn) for a GLMM is given by the marginal den sity function of the observations y 1 ... Yn viewed as a function of the parameters and is equal to L(/3 1/J ; Y1 , Yn) J f(Y1 , Yn I U1 , Un; /3)g(u1 . Un; 1/J)du1 ... dun f IT!(Y i I ui ; /3) g(u1 un;1/J)du1 ... dun i= l J IT IT f(Y i t I u i ; /3) g(u1 ... Un; 1/J)du1 .. dun (2.9) i= l t=l It is called the observed likelihood because the unobserved random effects have been integrated out and (2 9) is a function of the observed data only. Except for the linear mixed model where f (y 1 ... Yn I u 1 . un) is a normal density the integral has no closed-form solution and numerical procedures (analytic or stochastic) are necessary to calculate and maximize it. Standard maximization techniques such as Newton-Raphson or EM for fitting GLMs and LLMs have to be modified because the conditional distribution of the observations and the distribution of the random effects are not conjugate and the integral is analytically intractable 2.2.1 Direct and Indirect Maximum Likelihood Procedures In general there are two ways to obtain maximum likelihood estimates from the marginal likelihood in (2 9): The first one is a direct approach and

PAGE 45

33 attempts to approximate the integral by either analytic or stochastic methods and then maximize this approximation with respect to the parameters /3 and 1/J. Some common analytic approximation methods are Gauss-Hermite quadrature (Abramowitz and Stegun 1964) a first-order Taylor series expansion of the integrand or a Laplace approximation (Tierney and Kadane 1986) which is based on a second-order Taylor series expansion. The two latter methods result in likelihood equations similar to a linear mixed model (Breslow and Clayton 1993 ; Wolfinger and O Connell 1993) and by iteratively fitting such a model and re-expanding the integrand around updated parameter estimates one can obtain approximate maximum likelihood estimates However these methods have been shown to yield estimates which can be biased and inconsistent an issue which is discussed in Lin and Breslow (1996) and Breslow and Lin (1995) Techniques using stochastic integral approximations are known under the name simulated maximum likelihood and have been proposed by Geyer and Thompson (1992) and Gelfand and Carlin (1993) These methods approximate the integral in (2 9) by importance samp l ing (Robert and Casella, 1999) and are better suited for larger dimensional integrals than analytic approximations. Usually, the importance density depends on the parameters to be estimated, and so simulated maximum likelihood is used iteratively by first approximating the integral by a Monte Carlo sum with some initial values for the unknown parameters Then the likelihood is maximized and the resulting parameters are used to generate a new sample form the importance density in the next iteration. We will briefly discuss the idea behind importance sampling in Section 2.3.2. The simulated maximum likelihood approach is also further illustrated in the next section, where we discuss it in the context of state space models. An alternative to the direct approximating methods is the EM-algorithm (Dempster et al. 1977) The integral in (2.9) is not directly maximized in this

PAGE 46

34 method but is maximized indirectly by considering a related function Q(. I .). At each step of this algorithm, maximization of the Q-function increases the marginal likelihood a fact that can be verified using Jensen s inequality. The EM-algorithm relies on recognizing or inventing missing data which together with the observed data, simplifies maximum likelihood calculations. For GLMMs the random effects u1 . Un are treated as the missing data. In particular let 13(k-l) and ,p(k l) denote current (at the end of iteration k 1) values for parameter vectors /3 and tp. Also let y' = (Yi, ... Y n ) and u' = ( u 1 ... un) denote the vector of all observations and their associated random effects Then the Q(. I .) function at the start of iteration k has form E [1ogj(y,u ; /3 ,p) I y /3(k 1) ,p(k 1)] E [1og/(y I u;/3) I y /3(k-1) ,p(k-1)] (2.10) + E [1ogg(u;') I y,/3(k-1),,p(k-1)] where j(y u;/3 ,p) = f(y I u;f3)g(u ; ,p) denotes the joint density of observed and missing data also known as the complete data. The expectation in (2.10) is with respect to the conditional distribution of u I y evaluated at the current parameter estimates 13(k l) and ,p(k-l). The calculation of the expected value is called the E-step and it is followed by an M-step which maximizes Q(/3 ,p I 13(k l), ,p(k l)) with respect to /3 and 1/J. The resulting estimates 13(k) and ,p(k) are used as updates in the next iteration to re calculate the E-step and the M-step. Since (/3(k) ,p(k)) is the maximizer at iteration k

PAGE 47

35 and it follows that the likelihood increases ( or at worst stays the same) from one iteration to the next: log L(~(k) 1/J(k); y) Q(~(k) 1/J(k) I ~(k 1) 1/J(k 1)) -E [10g h( u I y; ~(k )' 1/J(k)) I Y ~(k 1)' 1/J(k-1)] > Q(~(k 1) 1/J(k 1) I ~(k 1) 1/J(k 1)) -E [10g h( u I y ; ~(k 1), 1/J(k -1)) I Y ~(k-1} 1/J(k -1) ] log L(~(k I), 1/J(k I) ; y) Here we us e d (2.11) and the fact that E [10g h( u I y; ~(k) 1/J(k)) I Y, ~(k 1)' 1/J(k 1)] -E [1ogh(u I y;~(k 1) 1/J(k 1)) I Y,~(k-1) 1/J(k 1)] E [1og (h(u I y;~(k) 1/J(k))/h(u I y;~(k 1) 1/J(k-1))) I Y ~(k 1) 1/J(k 1)] < log (E [h(u I y;~(k) 1/J(k))/h(u I y ; ~(k 1) 1/J(k 1}) I Y,~(k-1) 1/J(k 1)]) = 0, where the inequality in the last step derives from Jensen s inequality. Under regu larity conditions (Wu, 1983) and some initial starting values (~ 0 1'; 0 ), the sequence of estimates { (~(k), 1/J(k ) )} converges to the maximum likelihood estimators ({3, {/J ). The EM-algorithm is most useful if replacing the calculation of the integral in the marginal likelihood (2.9) by the calculation of the integral in the Q-function (2.10) simplifies computation and maximization. Unfortunately for GLMMs the integrals in (2.10) are also intractable since the conditional density of u I y involves the integral in (2.9). However the EM algorithm may still be used by approxi mating the expectation in the E-step with appropriate Monte Carlo methods The resulting algorithm is called the Monte Carlo EM-algorithm (MCEM) and was proposed by Wei and Tanner (1990). We review it in detail in Section 2 3.

PAGE 48

36 Some arguments favoring the use of the MCEM-algorithm over direct methods such as simulated maximum likelihood for fitting GLMMs especially when some variance components in 1/; are large are given in Jank and Booth (2003) and Booth et al. (2001) Currently the only available software for fitting GLMMs uses direct methods such as Gauss-Hermite quadrature or simulated maximum likelihood (e.g. SAS s proc nlmixed) State space models of Section 2.1.2 are also fitted via simulated maximum likelihood. This is discussed in Section 2 2 3. 2.2.2 Model Fitting in a Bayesian Framework In a Bayesian context GLMMs are two-stage hierarchical models with ap propriate priors on /3 and 1/;. Instead of obtaining maximum likelihood estimates of unknown paramet e rs a Bayesian analysis looks at their entire posterior dis tributions given the observed data. Markov chain Monte Carlo techniques avoid the tedious integrations in the posterior densities and allow for relatively easy simulations from these distributions compared with the problems encountered in maximum likelihood estimation. This suggests approximating maximum likelihood estimates via a Bayesian route assuming improper or at least very diffuse priors and exploiting th e proportionality of the likelihood function and the posterior distribution of the parameters. However, for many discrete data models improper priors may lead to improper posteriors. Natarajan and McCulloch (1995) demon strate this with GLMMs for correlated binary data assuming independent N(O, a 2 ) random effects and a flat or a non-informative prior for a 2 Sun Tsutakawa and Speckman (1999) and Sun Speckman and Tsutakawa (2000) show that with noninformative (flat) priors on fixed effects and variance components of more com plicated random effects distributions propriety of the posterior distribution cannot be guaranteed for a Poisson GLMM when one of the observed counts is zero, and is impossible in a logit link GLMM for binomial(n 7r) observations if just one of the observation is equal to O or n. Of course the use of proper priors will always

PAGE 49

37 lead to proper posteriors. However, for often emp loyed diffuse but proper priors Natarajan and McCulloch (1998) show that even with enormous simulation sizes, posterior estimates ( such as the posterior mode) can be far away from maximum likelihood estimates, which make their use undesirable in a frequentist setting. 2.2.3 Maximum Likelihood Estimation for State Space Models The same problems as in the GLMM case arise for maximum likelihood fit ting of non-Gaussian state space models. Here we review a simulated maximum likelihood approach suggested by Durbin and Koopman (1997) using notation in troduced in Section 2.1.2. Let p(y I a ; t/J) = fltP(Yt I at; t/J) denote the distribution of all observations given the states and let p( a; t/J) denote the distribution of the states where y and a are the stacked vectors of all observations and all states, respectively. The vector t/J holds parameters that may appear in W t, Tt and Qt. Let p(y a ; t/J) denote the joint density of observations and states. For practical purposes it is easier to work with the signal 0t instead of the high dimensional state vector O'.tHence, let p(y I 6 ; t/J) p( 6 ; t/J) and p(y, 6 ; t/J) denote the corre sponding conditional marginal and joint distributions parameterized in terms of the signal 0t = w~at t = l ... T where 6 = (0 1 ... 0r )'. The observed likelihood is then given by the integral L(t/J;y) = j p(y I 6 ; t/J)p(6 ; t/J)d6 (2.12) To maximize (2.12) with respect to t/J Durbin and Koopman (1997, 2000) first calculate the likelihood L 9 (t/J ; y) for an approximating Gaussian model and then obtain the true likelihood L(t/J; y) by an adjustment to it. However, two different approaches of how to construct the approximating Gaussian model are presented in the two papers. In Durbin and Koopman (1997) the approximating model is obtained by assuming that observations follow a linear Gaussian model

PAGE 50

38 with f.t ~ N t, a;). All densities generated under this model are denoted by g(.). The two parameters t and a; are chosen such that the true density p(y I 0; 1/;) and its normal approximation g(y I 0; 1/;) are as close as possible in the neighborhood of the posterior mean E 9 [0 I y]. The state equations of the true non-Gaussian model and the Gaussian approximating model are assumed to be the same, which implies that the marginal density of 0 is the same under both models, i.e. p(0 ; 1/;) = g(0; 1/J). The likelihood of the approximating model is then given by L (1/J ; y) = g(y; 1/;) = g(y 0; 1/;) = g(y I 0 ; 'l/J)p(0; 1/J) g g(0 I y;'l/J) g(0 I y ; 'l/J) (2.13) This likelihood is calculated using a recursive procedure known as the Kalman filter (see for instance Fahrmeir and Tutz 2001 Chap. 8) Alternatively the approximating Gaussian model is a regular linear mixed model and maximum likelihood calculations can be carried out using more familiar algorithms in the linear mixed model literature (see for instance Verbeke and Molenberghs, 2000). From (2.13) ( 0 "'') = Lg('l/J)g(0 I y; 1/J) P 'I-' g(y I 0; 1/J) and upon plugging in into (2 12) [ p(y I 0; 1/;)l L('l/J; y) = Lg('l/J; y)Eg g(y I 0 ; 1/;) (2.14) where E 9 denotes expectation with respect to the Gaussian density g( 0 I y ; 1/J) generated by the approximating model. Hence the observed likelihood of the non Gaussian mod e l can be estimated by the likelihood of an approximating Gaussian model and an adjustment factor, in particular L('l/J; y) = L 9 (1/J; y)w('l/J)

PAGE 51

39 where w(v;) = 2_ L p( y I 9 ( ~ l ; 1/J) m i g( y I 9 ( ); 1/J) is a Monte Carlo sum approximating the expected value E 9 with m random samples 9 ( i ) from g( 0 / y ; 1/J) Normality of g( 0 / y ; 1/)) allows for straightforward simulation from t his density. A diff e rent approach for choosing the approximating Gaussian model is presented in Durbin and Koopman (2000). There the model is determined by choosing a-; and Qt of an approximating Gaussian state space model (2 6) such that the posterior densities g( 0 / y ; 1/)) implied by the Gaussian model and p( 0 / y ; 1/)) i mplied by the true mod e l have the same posterior mode 0 Formally by dividing and multiplying (2.12) by the importance density g( 0 / y ; 1/)) w e lik e to interpr e t approximation (2 14) as an importance sampling estimate of th e observed likelihood and the entire procedure as a simulated maximum likelihood approach: J p( 0 ; 1/J) L(v;; y ) = p( y I 0 ; 1/J ) g( 0 / y ; v;)g( 0 / y ; v;)d 0 J g( y ; 1/;) p( y / 0 ; 1/J) g( y 0 ; 1/J)g( 0 / y ; v;)d 0 J p( y / 0 ; 1/J) g( y ; 1/J ) g( y / 0 ; v;)g( 0 / y ; v;)d 0 [ p( y / 0 ; 1/J)l Lg( y ; 1/;)Eg g( y I 0 ; 1/J) Durbin and Koopman (1997 2000) present a clever way of artificially enlarging t he simulated sample of 9 (i), s from the importanc e density g( 0 / y ; 1/;) by the use of antithetic variables (Rob e rt and Casella 1999) These quadruple the sample siz e without additional simulation efforts and balan c e the sample for location and scale. Overall this leads to a reduction in the total sample size necessary to achieve a certain pr ec ision in the e stimates.

PAGE 52

40 In practice, it is desirable in the maximization process to work with log (L(t/;; y)). Durbin and Koopman (1997, 2000) present a bias correction for the bias introduced by estimating log ( E 9 [p(y I 0; t/;) / g(y I 0; t/;)]). Finally, the resulting estimator of log (L(t/;; y)) can be maximized with respect tot/; by a suitable numerical procedure, such as Newton-Raphson. We mentioned before that simulated maximum likelihood can be computa tionally inefficient and suboptimal, especially when some variance components are large (Jank and Booth, 2003) As we will see in various examples in Chapter 5, large variance components (e.g., a large random effects variance) are the norm rather than the exception with the type of time series models we consider. Next, we will look at an alternative, indirect method for fitting our models. In principle, though, the methods just described are also applicable to GLMMs, through the close connections of GLMMs and state space models described above. 2.3 The Monte Carlo EM Algorithm In Section 2.2 we presented the EM-algorithm as an iterative procedure consisting of two components the Eand the M-step. The E-step calculates a conditional expectation while the M-step subsequently maximizes this expectation Often at least one of these steps is analytically intractable and in most of the applications considered here, both steps are. Numerical methods ( analytic and stochastic) have to be used to overcome these difficulties, whereby the E-step usually is the more troublesome. One popular way of approximating the expected value in the E-step uses Monte Carlo methods and is discussed in Wei and Tanner (1990) McCulloch (1994 1997) and Booth and Hobert (1999). The Monte Carlo EM (MCEM) algorithm uses a sample from the distribution of the random effects u given the observed data y to approximate the Q-function in (2.10). In particular, at iteration k let uC 1 ) . uCm) be a sample from this distribution denoted by h( u I y; t3(k l) t/;(k-l)) and evaluated at the parameter estimates t3(k-l) and t/;(k-l)

PAGE 53

41 from the previous iteration. The approximation to (2.10) is then given by As m oo with probability one Qm Q. The M-step then maximizes Qm instead of Q with respect to /3 and 'ljJ and the resulting estimates 13 (k) and 'lj) ( k) are used in the n e xt iteration to generate a new sample from h( u I y ; /3k 'l/Jk). If maximization is not possible in closed form sometimes only a pair of values ( /3 'ljJ) which satisfies Qm(/3 'ljJ I 13 (k-1)' "P (k-I)) Qm(/3(k I) "P (k-I) I 13(k 1) "P (k-1)), but which do not attain the globa l maximum is chosen as the new parameter update ( /3 (k), 'lj)(k)). Howev e r we show for our mod e ls that the global maximum can be approximated in very few steps. Maximization of Qm with respect to /3 and 'ljJ is equivalent to maximizing the first term in (2.15) with respect to /3 only and the second term with respect to 'ljJ only. This is due to the two-stage hierarchy of the response distribution and the random effects distribution in GLMMs and is discussed next. Different approaches to obtaining a sample from h(u I y ; /3 'l/J) for th e approximation of the E-step are presented in Sections 2.3.2 and convergence criteria are discussed in Section 2 3 3. 2.3.1 Maximization of Qm For now we assume we have available a sample u (I), .. u (m) from h( u y ; /3 'ljJ) or an importance sampling distribution generated by one of the mecha nisms described in Sections 2 3 2 to 2 3.5 Let Q~ and Q~ be the first and second term of the sum in (2 15) Using the exponential family expression for the densities f(Yit I u i ) at iteration k m n n; Q~(/31 /3 ( k I)) ex! LLL [Y it 0if ) -b(0 ~{)) ] j=l i= l t = l (2.16)

PAGE 54

42 where according to the GLMM specifications, with u~i) the i-th component of the j-th sampled random effects vector uU) Maximizing Q~ with respect to /3 is equivalent to fitting an augmented GLM with known offsets: For j = 1 ... m let y;fl = Y it and x ~f) = X it be the random components and known design vectors for this augmented GLM and let z~tuF) be a known offset associated with each y;fl. That is, we duplicate the original data set m times and attach a known offset z~tuij) to each replicated observation. The model for the mean in the augmented GLM E[~~j)] = if) = h1 (x~tf3 + z~tui j)) is structurally equivalent to the model for the mean in the GLMM. Then the log-likelihood equations for estimating /3 for the augmented GLM are proportional to Q~. Hence maximization of Q~ with respect to /3 follows along the lines of well known iterative Newton-Raphson or Fisher scoring algorithms for GLMs. Denote by 13(k) the parameter vector after convergence of one of these algorithms. It represents the value of the maximum likelihood estimator of /3 at iteration k of the MCEM algorithm. The expression for Q~ depends on the assumed random effects distribution. Most generally let :E be an unstructured nq x nq covariance matrix for the random effects vector u = (u 1 ... un) where q = z:=: 1 n i and n i is the dimension of each cluster specific random effect u i. Then assuming u has a mean zero multivariate normal distribution g(u ; 1/J) where 1/J holds the nq(nq + 1) distinct elements of :E Q~ has form m Q~ ex L [log l:EI uU)':E1 uU)] J=l The goal is to maximize Q~ with respect to the variance components 1/J of :E. For a general :E the maximum is obtained at the variance components of the sample

PAGE 55

43 covariance matrix Sm = I:;: 1 uUluUl' D enoting these by 1/J (k) gives the value of the maximum likelihood estimator of 1/J at iteration k of the MCEM algorithm. The simplest structure occurs when random effects u i have independent components and are i.i.d. across all clusters where g( u ; 1/J) is then the product of n N(O c, 2 I) densities and 1/J = c, Q~ at iteration k is then maximized at c,(k) = ( ~q I:;: 1 u (i) u (j) ) 1 1 2 Many applications of G LMMs use this simple structure of i.i.d. random e ffects where often u i is a univariate random intercept. In this case, the estimate of c, at iteration k reduces to c,(k) = ( n!n I:7 = 1 L ~=l u~j) 2 ) 112 In Chapter 3 w e will drop the assumption of independence and look at correlated random effects, but with more parsimonious covariance structures than the most general case presented here. Maximization of Q~ with respect to 1/J will be presented there on a case by case basis 2.3.2 Generating Samples from h(u I y ; (3 1/J) So far we assumed we had available a sample u ( 1 ), ... u (m) to approximate the expected value in the E-step of the MCEM algorithm This section describes how to generate such a sample from h(u I y ; (3 1/J) which is only known up to a normalizing constant, or from an importance density g(u). In the following we will suppress th e dependency on parameters (3 and 1/J since the densities are always evaluated at their current values. Three methods are presented: The accept-reject algorithm produces independent samples while Metropolis-Hastings algorithms produce dependent samples. A detailed description of all three methods can be found in Robert and Cas e lla (1999) 2.3.2.1 Accept-reject sampling in GLMMs In general for accept-reject sampling we need to find a candidate density g and a constant M such that for the density of interest h ( the target density) h(x) < Mg( x ) holds for all x in the support of h The algorithm is then to 1. generate x ~ g w ~ Uniform [O, 1];

PAGE 56

44 2 t d 1 from h 1 f < h(x) . accep x as a ran om samp e w Mg(x) 3. return to 1. otherwise; This will produce one random samp l e x from the target density h. The probability of acceptance is given by 1/ M and the expected number of trials until a variable i s accepted is M. For our purpose, the target density i s h(u I y). Since h(u I y) = J(y I u)g(u ) '.S Mg(u) where M = sup u f(y I u) and a is an unknown normalizing constant equa l to the marginal lik elihood, the multivariate normal random effects distribution g(u) can be used as a candidate density. Booth and Hobert (1999, Sect 4.1) show that for certain models supu f (y I u) can be easi ly calculated from the data alone and thus need not to be updated at every iteration. For some models we discuss here the condition of Booth and Hobert (1999 Sect. 4.1, page 272) required for this simplification does not hold. However the likelihood of a saturated GLM is always an upper bound for f(y I u ) To illustrate, regard L(u) = f (y I u) as the likelihood correspond ing to a GLM with random components Yit and linear predictor "l i t = z~tui + x ~tf 3 where now x~tf3 plays the role of a known offset and u i are the parameters of interest. The maximized likelihood L(u ) for this model is always less than the maximized likelihood L(y) for a saturated model. Hence sup u f(y I u ) '.S L(u) '.S L(y) and L(u) or L(y) can be used to construct M. Example: In Section 3 1 we consider a data set where conditional on a random effect Ut Yit the t-th observation in group i is modeled as a Binomial(n it, 1rit) random variable. There are 16 time points, i.e. t = l . 16 and two groups i = 1 2. A very s impl e logistic-normal GLMM for these data has form logit(1rit(ut)) = a+ f3xi + Ut where X i i s a binary gro up indicator. The overall design matrix for

PAGE 57

45 this problem is the 32 x 18 matrix 1 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 where the columns hold the coefficients corresponding to a, /3, u 1 u 2 ... u 16 All rows of this matrix are differ e nt and as a consequence the condition of Booth and Hobert (1999, S ect 4.1, page 272) does not hold However the saturated binomial likelihood L(y) is an upp e r bound for f (y I u) i. e., sup f(y I u) L(y). u For instance with the logistic-normal example from above with linear predictor T/ it = c + Ut wh e re c =a+ f3xi represents the fixed part of the model we have By first taking logs and then finding first and second derivatives with respect to Ut, we see that u; = log ( 1 ~~ /~ ; t ) c maximizes this expression for O < Y it < nit Plugging in we obtain th e r es ult sup f (Yit I Ut) = Y it 1 Y it ( ) Y i t ( ) n;1-Yit u1 nit n it For the special cases of Y it = 0 or Y it = n it the trivial bound on f (Yit I Ut) is 1. Hence the following inequality which immediately follows from above can be used in constructing the accept-reject algorithm for a logistic-normal model with linear

PAGE 58

46 predictor of form T/it = c + Ut: '':}' f (y I ") s':}' i;r i;r (i :"';,,., r ( 1+1 e"" r-" < i:ri:r (::r (1~::r-" L(y). This means we can select M = L(y) to meet the accept-reject condition and consequently we accept a sample u from g(u) if for aw~ Uniform[O, 1): h(u I y) f(y I u) w<---=--Mg(u) L(y) Notice that this condition is free of the normalizing constant a. In practice, espec ially for high dimensional random effects, M can be very large and therefore we almost never accept a sample. Two alternative methods described below may avoid this problem. Note, however, that the accept-reject method yields an independent and identical distributed sample from the target distribution. This is important if one wants to implement an automated MCEM algorithm (Booth and Hobert (1999)), where the Monte Carlo sample size mis increased automatically as the algorithm progresses to adjust for the error in the Monte Carlo approximation to the E-step. 2.3.2.2 Markov chain Monte Carlo methods For high dimensional distributions h( u I y) which are unavoidable if correlated random effects are used, accept-reject methods can be very slow. An alternative is to generate a Markov chain with invariant distribution h(u I y) which may be much faster but results in dependent samples. McCulloch (1997) discussed a Metropolis Hastings algorithm for creating s uch a chain for the logistic-normal regression case. In general, an independent Metropolis Hastings algorithm is built as follows: Choose a candidate density g(u) with the same support as h(u I y). Then for a current state u U 1 ),

PAGE 59

47 1. Generate w ~ g; C-) . ( h(W I Y) / h(u U1 > I Y)) 2. Set u 3 equal to w with probab1l1ty p mm 1 g(W)/g(u U-1>) and equal to u U-t) with probability 1 p; After a sufficient burn in time, the states of th e generated chain can be regarded as a (dependent) samp l e from h(u I y). If the candidate density g(u) is chosen to be the density of the random effects g(u), the acceptance probability in step 2 reduces to the simp l e form min (1 f(y I w)/f(y I u U1 ))) To further speed up simu la tions, McCulloch (1997) u ses a random scan algorithm which only updat es the k-th component of the previous state u U-t) and upon acceptance in step 2 uses it as the new state. Another popular MCMC algorithm is the Gibbs sampler. Let u U-t) = ( Ui j-t), .. u W1 )) denote the current state of a Markov chain with invariant distri bution h(u I y ). One iteration of the Gibb s sampler generates, componentwise, ~ h(u1 I u~ t) ... u ~-t) y ) ~ h(u2 I uP) u~ i) ... u ~-i), y) ~ h( I (j) (j) ) U n U1 ... U n-1' y h h( I U) Ul U-t) U t) ) h 11 d f 11 d. 1 f w ere u i u 1 ... u i_ 1 u i+l .. U n y are t e so ca e u con 1t10na so h(u I y ). Th e vector u U) = (uii ), ... u W)) represents the new state of the chain, and after a s uffici ent burn-in t im e, can be regarded as a sample from h(u I y ) The advantage of the Gibbs samp l er is that it reduces sampling of a possibly very high-dimensional vector u into samp ling of severa l lower-dimensional components of u. W e will use the Gibbs sampler in connection with autoregressive random effects to simplify samp ling from an initially very high-dimensional distribution h(u I y) by samp ling from its simpler full univari ate cond itionals.

PAGE 60

48 2.3.2.3 Importance sampling An importance sampling approximation to the Q-function in (2.10) is given by where uUl are independent samples from an importance density g(u; '1/J(k l)) and are importance weights at iteration k. Usually Qm is divided by the sum of the importance weights 1=;: 1 Wj. The normalizing constant a only depends on known parameters (/3(k-l), 1/J(k l)) and hence plays no part in the following maximization step. Selecting the importance density g is a delicate issue. It should be easy to simulate from but also resemble h(u I y) as close as possible. Booth and Hobert (1999) suggest a Student t density as the importance distribution g, whose mean and variance match those of h( u I y) which are derived via a Laplace approximation. 2.3.3 Convergence Criteria Due to the stochastic nature of the algorithm, parameter estimates of two suc cessive iterations can be close together just by chance, although convergence is not yet achieved. To reduce the risk of stopping prematurely, we declare convergence if the relative change in parameter estimates is less than some t: 1 for c ( e.g., five) consecutive times Let A (k) = (/3(k)', 1/J(k)')' be the vector of unknown fixed effects parameters and variance components. Then this condition means that (2.17) has to be fulfilled for c consecutive (e .g., five) ks. For any >.tl an exception to this rule occurs when the estimated standard error of that parameter is substantially larger than the change from one iteration to the next Hence at iteration k, for

PAGE 61

49 those parameters satisfying where var(>.t>) is the current estimate of the variance of the MLE .X i, the relative precision of criterion (2 .17 ) n ee d not be met An estimate of this variance can be obtained from the observed information matrix of the ML estimator for A Louis (1982) showed that the observed information matrix can be written in terms of the first (l') and second (l") derivative of the complete data log-likelihood l(A; y u) = logj(y u ; A). Evaluated at the MLE .\ it is given by 1(-\) = -E u 1 y [l"(.\; Y u) I y] varu 1 y [l'(.\; y u) I y]. An approximation to this matrix at it eration k uses Monte Carlo sums with draws from h(u I y A (k)) from the current iteration of the MCEM algorithm. To further safeguard against stopping prematurely we use a third convergence criterion based on the Q m function. For deterministic EM the Q function is guaranteed to increas e from iteration to iteration. With MCEM because of the stochastic approximation nature Q~) can be less than Q~ l) because of an unlucky Monte Carlo sample at iteration k. Hence the parameter estimates obtained from maximizing Q~ ) can be a step in the wrong direction and actually decrease the value of the likelihood. To counter this we declare convergence only if successive values of Q~) are within a small neighborhood. More importantly however is that we accept the k-th parameter update A (k) only if the relative change in the Qm function is larger than some small negative constant, e .g ., Q~ ) Q~ 1 ) Q~ 1) (2.18) If at iteration k (2.18) is not met and there is reason to believe that A (k} decrease s the likelihood and is worse than the parameter update from the previous iteration

PAGE 62

50 we repeat the k-th iteration with a new and lar ger Monte Carlo sample. Thereby, we hope to better approximate the Q-function and as a result get a better estimate ,\ (k), with Qm-function larger than the previous one. If this does not happen, we nevertheless accept ,\ (k) and proceed to the next iteration, possibly letting the algorithm temporarily move in a direction of a lower likelihood region. Otherwise the Monte Carlo sample size quickly grows without bounds at an early stage of the algorithm. Furthermore at early stages the Monte Carlo error in the approximation of the Q function can be large and hence its trace plot is very volatile. Caffo Jank and Jones (2003) go a step further and calculate asymptotic confidence intervals for the change in the Qm-function based on which they construct a rule for accepting or rejecting ,\ {k). They discuss schemes of how to increase the Monte Carlo samp l e accordingly and their MCEM algorithm inherits the ascent property of EM with high probability. However we feel that the simpler criterion (2.18) suffices for the examples considered here. Coupled with any convergence criterion is the question of the updating scheme for the Monte Carlo sample size m between iterations. In general we will use m(k) = am(k I) where a > l and m(k) is the Monte Carlo sample size at iteration k. At early iterations, m(k) will be low, since big parameter jumps are expected regardless of the quality of the approximation and the Monte Carlo error associated with it. Later, as more weight will be put on decreasing the Monte Carlo error in the approximations the polynomial increase guarantees sufficiently large Monte Carlo samples. Furthermore condition (2.18) signals when an additional boost in m(k) is needed to better approximate the Q-function in this iteration. Hence, whenever (2.18) is not met, we re-run iteration k with a bigger sample size qm(k) where q > l is usually between 1 and 2.

PAGE 63

51 8000 1m 1 -2 64 7000 -266 6000 5000 -2 68 4000 270 3000 272 2000 274 1000 0 25 50 75 100 0 25 50 75 100 Figure 2 1: P l ot of the typical behavior of the Monte Carlo sample size m(k) and the Q-function Q~) through MCEM iterations The iteration number is shown on the x -axis. Plots are based on the data and model for the boat race data discussed in Chapter 5. A typical picture of the Monte Carlo sample size m (k) and the Q~) function through the iterations of an MCEM algorithm is presented in Figure 2 1. The increase in the Q~) function is large at the first iterations but it s Monte Carlo error is also large due to the small Monte Carlo sample size. The plot of the Monte Carlo sample size m(k) shows several jumps corresponding to the events that the Q~) function actually decreased by more than t: 3 from one iteration to the next and we adjusted with an additional boost in generated samples The data and model on which this plot is bas e d on are taken from the boat race example analyzed and discussed in Chapter 5 with convergence criterions set to E 1 = 0.001 c = 4 E 2 = 0 003 t: 3 = -0.005 a = 1.03 and q = 1.05. Fort and Moulines (2003) show that with geometrically ergodic (see, e.g., Robert and Casella 1999) MCMC samplers, a polynomial increase in the Monte

PAGE 64

52 Carlo sample size leads to convergence of MCEM parameter estimates. However establishing geometric ergodicity is not an easy task. Other more sophisticated and automated Monte Carlo sample size updating schemes are presented by Booth and Hobert (1999) for independent sampling and Caffo Jank and Jones (2003) for independent and MCMC sampling.

PAGE 65

CHAPTER 3 CORRELATED RANDOM EFFECTS In Chapter 2 we mentioned at several occasions that for certain data structures the usual assumption of independent random effects is inappropriate For instanc e, if clusters represent time points in a study over time observations from different clusters can no longer be assumed (marginally) independent. Or in longitudinal studies the non-negative and exchangeable correlation structure among repeated observations implied by a single random effect can be far from the truth for long sequences of repeated observations. Section 3.1 presents data from a cross-sectional time series which motivates the use of correlated random effects and discusses their implications. In Sections 3.2 and 3.3 two special correlation structures useful for modeling the dependence structure in discrete repeated measures with possibly unequally spaced observation times are discussed. The main focus of this chapter is on the technical implications on the MCEM algorithm arising from estimating an additional variance (correlation) component. In contrast to models with independent random effects the M-step has no closed form solution and iterative methods have to be used to find the maximum. Also because random effects are correlated a priori they are correlated a posteriori and sampling from the posterior distribution of u I y as required by the MCEM algorithm is more involved than with independent random effects A Gibbs sampling approach is developed in Section 3.4 From here on we let t denote the index for the discrete observation times t = 1 ... T and we let ~t denote a response at time point t for strata i i = 1 ... n. Throughout we will assume univariate but correlated random effects { Ut}i = l associated with the observations over time. 53

PAGE 66

54 3.1 A Motivating Example: Data from the General Social Survey The basic purpose of the General Social Survey (GSS) conducted by the National Opinion Research Center, is to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes behaviors and attributes. It is only second to the census in popularity among sociologist as a data source for conducting research. The GSS ques tionnaire contains a standard core of demographic and attitudinal variables whose wording is retained throughout the years to facilitate time trend studies (Source: www.norc.uchicago.edu/projects/gensocl.asp). Currently the GSS comprises a total of 24 surveys conducted in the years 1973-1978 1980 1982 1983-1994 1996 1998 2000 and 2002 with data available online (at www.webapp.icpsr.umich edu/GSS/) through 1998. The two features, a discrete response variable (most of the attitude questions) observed through time and unequally spaced observation times make it a prime resource for applying the models proposed in this dissertation. Data obtained from the GSS are different from longitudinal studies where subjects are followed through time. Here, responses are from independent cross-sectional surveys of different subjects in each year. One question included in 16 of the 22 surveys till 1998 recorded attitude towards homosexual relationships It was observed in the years 1974, 1976-77, 1980 1982 1984-85 1987-1991 1993-94, 1996 and 1998. We will use this data to motivate and illustrate the use of correlated random effects Figure 3 1 shows the proportion of respondents who agreed with the statement that homosexual relationships ar e not wrong at all for the two race cohorts white respondents and black respondents. For simplicity in this introductory example only race was chosen as a cross-classifying variable and attitude was measured as answering "yes or no to the aforementioned question. Let ~t denote the number of people in year t and of race i who agreed with the statement that homosexual relationships

PAGE 67

55 0 (") -awhite ci --Erb l ack 0 (\J 0 0 T""" ci 0 0 ci 19 7 5 1980 1985 1990 1995 Figure 3 1: Sampling proportions from the GSS data set Proportion of whites (squares) and blacks (circles) agreeing with the statement that homosexual relationships are not wrong at all from 1974 to 1998. are not wrong at all. The index t = 1 .. 16 runs through the set of 16 years {1974 1976 1977 1980 ... 1998} mentioned above and i = 1 for race equal to white and i = 2 for race equal to black. The conditional independence assumption discussed in Section 2.1 allows us to model ~t the sum of nit binary variables which are the individual responses as a binomial variable conditiona l on a yearly random effect That is the probabilistic model we propose assumes a conditional Binomial(n i t 7r i t) distribution for each member of the two time series {Y1t}}! 1 and {Y 2 t}}! 1 pictured in Figure 3 1. The parameters nit and 7rit are the total number and the conditional probability of agreeing with the statement that homosexual relationships are not wrong at all respectively, of respondents of race i in year t. 3.1.1 A GL MM App r oach A popular model for n it is a logistic-normal model for which the link function h( ) in (2 2) is the logit link and the random effects structure simp l ifies to a ran dom intercept Ut. We will assume that the fixed parameter vector /3 is composed

PAGE 68

56 of an intercept term a, linear and quadratic time effects /3 1 and /3 2 a race effect /33 and a year-by-race interaction (3 4 With X1t representing the year variable centered around 1984 (e.g. x 11 = 1974 1984 = -10) and x 2 i the indicator variable for race (for whites x 21 = 0, for blacks x 22 = 1) the model has form Apart from the fixed effects, the random time effect Ut captures the dependency structure over the years. Note that rrit ( Ut) is a conditional probability, given the random effect Ut from the year the question was asked. This random effect Ut can be interpreted as the unmeasurable public opinion about homosexual relationships common to all respondents within the same year. By introducing this random effect, we assume that individual opinions are influenced by this overall opinion or the social and political climate on homosexual relationships (like awareness of AIDS and the social spending associated with it, which is hard to measure). Thus, individual responses within a given year are no longer independent of each other, but share a common random effect. Furthermore, it is natural to assume that the public opinion about homosexual relationships changes gradually over time, with higher correlations for years closer together and lower correlations for years further apart. It would be wrong and unnatural to assume that the public opinion ( or political climate) is independent from one year to the next However this would be assumed by modeling the random effects { ut} as independent of each other. It would also be wrong to assume a common, time-independent random effect u = Ut for all time points t as this implies that public opinion does not change over time. It s effect would then be the same, whether responses are measured in 1974 or 1998. 3.1.2 Motivating Correlated Random Effects To capture the dependency in public opinion and therefore in responses over dijf erent years, we propose random effects that are correlated. In particular,

PAGE 69

57 for this example with unequally spaced observation times we s uggest normal autocorrelated random effects { Ut} with variance function var(ut) = a 2 t = 1 ... 16 and correlation function corr( Ut, Ut) = p lx1t-x w I, 1 :S t < t* :S 16 where X1t X1t is the difference between the two years identified by indices t and t This is equivalent to specifying a latent autoregressive process U p lxtt-xwl u + c t+l t '-t underlying the data generation mechanism. Both of these formulations naturally handle the multiple gaps in the observed time series. There is no need to make adjustments (such as imputation of data or artificially treating the series as equally spaced) in our analysis due to missing data at years 1975 1978-79, 1981, 1983, 1986 1992, 1995 or 1997. With correlated random effects, we have to distinguish between two situations: The correlation induced by assuming a common random effect Ut for each cluster (here: year) and the correlation induced by assuming a correlation among the cluster-specific random effects { Ut}. Correlation among observations in the same cluster is a consequence of assuming a single, cluster-specific random effect shared by all observation in that cluster. For example, the presence of the cluster specific random effect Ut in (3.1) leads to a (marginal) non-negative correlation among the two binomial responses Yit and :it in year t. With conditional independence the marginal covariance between thes e

PAGE 70

58 two observations is given by cov(Y1t Y2t) = E [cov(Y1t, Y2t I Ut)] +cov (E[Y1t I Utl, E[Y:it I Ut]) cov(logiC 1 (iJ1t + Ut), logiC 1 (7J2t + Ut)), (3 .2 ) where 1Jit is the fixed part of the linear predictor in (3.1). Both functions in (3.2) are monotone increasing in Ut leading to a non-negative correlation. Approxima tions to (3.2) will be dealt with in Section 4.3. In the example, we attributed the cause of this correlation to the current (at the time of the interview) public opinion about homosexual relationships, influencing all respondents in that year. The estimate of CJ gives an idea about the magnitude of this correlation, since the more disperse the Ut s are, the stronger the correlation among the responses within a year. For instance, if the true Ut for a particular year is positive and far away from zero as measured by CJ then all respondents have a common tendency to give a positive answer. If it is far away from zero on the negative side, respondents have a common tendency for a negative answer. This interpretation, of course, is only relative to other fixed effects included in the lin ear predictor. For the GSS data, there seems to be moderate correlation between responses, based on a maximum likelihood estimate of a= 0.10 with an approximated asymptotic s.e of 0.03. This interpretation of a moderate effect of public opinion on responses within the same year is further supported by the fact that CJ can also be interpreted as the regression coefficient for a standardized version of the random effect Ut, A regression coefficient of 0.10 for a standard normal variable on the logit scale leads to moderate heterogeneity on the probability scale. This shows that the correlation between responses within a common year cannot be neglected.

PAGE 71

59 The second consequence of correlated random effects is that observations from different clusters are correlated which is a distinctive feature compared to GLMMs assuming independence between cluster-specific random effects. The conditional log odds of agreeing with the statement that homosexual relationships are not wrong at all are now correlated over the years a feature which is natural for a time series of binomial observations but would have gone unaccounted for if time-independent random effects were used For instance for the cohort of white respondents (i = 1) the correlation between the conditional log odds at years t and t* is corr (logit(1ru(ut)), logit(1rit (Ut )) = p l xu xw l and therefore directly related to the assumed random effects correlation structure. Marginally, the two binomial responses at the different observation times have covariance which accommodates changing covariance patterns for different observation times ( e.g. decreasing with increasing lag) and also negative covariances (see for instance the analysis of the Old Faithful geyser eruption data in Chapter 5). We will present approximations to these marginal correlations in binomial time series in Section 4.3. Summing up correlated random effects give us a means of incorporating correlation between sequential binomial observations that go beyond independent or exchangeable correlation structures In our example we attributed the sequential correlation to the gradual change in public opinion about homosexual relationships over the years, affecting both races equally. In fact, the maximum likelihood estimate of p is equal to 0.65 (s.e.

PAGE 72

60 0.25) indicating that rather strong correlations might exist between responses from adjacent years The model uses 7 parameters (5 fixed effects, 2 variance components) to describe the 32 probabilities. In comparison with a regular GLM and a GLMM with independent random time effects the maximized likelihoods decrease from -113 0 for the regular GLM to approximately -109 7 for a GLMM with independent random effects to approximately -107.5 for the GLMM with autoregressive random effects. Note that the GLM assumes independent observations within and between the years and that the GLMM with independent random effects { Ut} for each year t assumes correlation of responses within a year but independence of responses over the years. Both assumptions might be inappropriate Our model implies that the log odds of approval of homosexua l relationships are correlated for blacks and whites within a year (though not very strong with an estimate of a equal to 0.1) and are also correlated for two consecutive years. 'rhe estimates of the fixed parameters and their asymptotic standard errors are given in Table 5 1. The MCEM algorithm converged after 128 iterations with a starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of 8600 Convergence parameters ( conf. Section 2.3 3) were E 1 = 0 002 c = 5 E 2 = 0.003 E 3 = -0 001 a = 1.01 and q = 1.2. Path plots of selected parameter estimates for two different sets of starting values are shown in Figure 3 2. A detailed interpretation of the parameters and the effects of the explanatory variables on the odds of approval is provided in Section 4.3. Although this example assumed autocorrelated random effects, we will look at the simpler case of equally correlated random effects next. Then we discuss how the correlation parameter p can be estimated within the MCEM framework presented in Section 2.3. Summing up correlated random effects in GLMMs allow

PAGE 73

61 :: t;_ i~ L : v ____________ _ ______ 0 SO 100 ISO 200 SO 75 100 125 ISO 1 75 2 00 ::: t: . . . ::: t, .... ... .... .... ... 1_-, ' 0 SO 100 ISO 200 SO 75 100 125 ISO 175 200 0 026 ~ --= 0.00900 ~ 1 s e (llJ I ~ : : ::: r:::= I ~ .: : ::::: r~~~ .. i 0 SO 100 ISO 200 SO 75 100 125 ISO 175 2 00 :: 1 __ xz -, , -, 1 --"1 :: ~! .... .... .... .... .. 1 ~ .. ,~,I, 0 SO 100 ISO 200 SO 75 100 125 ISO 175 200 0. 7 s 2---::::::. = = .. = 0 so 100 ISO 0.30 0 25 200 s e.(p)! so 75 100 125 ISO 175 200 Figure 3 2: It e ration history for sele c ted parameters and their asymptotic standard errors for the GSS data The iteration number is plotted on the x axis The estimates and standard errors for (3 2 w e r e multiplied by 10 3 for better plotting Th e two different lines in e ach plot correspond to two different sets of starting values

PAGE 74

62 one to model within-cluster as well as between-cluster correlations for discrete response variables where clusters refer to grouping of responses in time 3.2 Equally Correlated Random Effects The introductory example modeled decaying correlation between cross sectional data over time through the use of autocorrelated random effects. In other temporal or spatial settings the correlation might stay nearly constant between any two observation times regardless of time or location differences between the two discrete responses. Equally correlated random effects might then be appropriate to describe such a behavior. 3.2.1 Definition of Equally Correlated Random Effects We call random effects equally correlated if var( Ut) = a 2 for all t and corr( Ut Ut) = p for all t =I= t More generally the covariance matrix of the random effects vector u = ( u 1 .. ur )' is given by E = a 2 [(1 p)Ir + pJrl, where Jr = lrl~. To ensure positive definiteness p has restricted range, i.e., 1 > p > -1/(T 1). The random effects density is given by g(u; 1/J) ex IEl 1 1 2 exp { 1u'E 1 u} where now due to the pattern in E IEI = a 2 r(l pf1 [1 + (T l)p] and E-1 = _J__ [-1-J. p(l p) j. ] u 2 (1-p) r 1+ (r 1)p r components of E. The vector 1/J = ( a, p) holds the variance (3.3) The more complicated random effects structure (as compared to independence or a single latent random effect) leads to a more complicated M-step in the MCEM algorithm described in Section 2 3 For a sample u< 1 > ... u (m ) from the posterior h( u I y ; t3(k l) 1/J(k l)) evaluated at the previous parameter estimate t3(k l) and

PAGE 75

63 1/J(k-l) the function Q~ ( 1/J) introduced in Section 2.3.1 has form 1 m Q'!i ( 1/J) = L log g( uU); 1/J) ex m T-1 1 -Tloga2 -log(l p) 2 log[l + (Tl)p] J = l 1 p(l p) b -----a+------2a2(1 p) 2a 2 [1 + (T l)p] where a= l ''~ uU)' uU) and b = l ''~ uU)' J,TuU) are constants depending on m ~J = l m ~J=l the sample only. 3.2.2 The M-step with Equally Correlated Random Effects The M-step seeks to maximize Q~ with respect to a and p which is equivalent to finding their MLEs treating the sample u( 1 ) ... u(m) as independent. Since this is not possible in closed form one way to maximize Q~ uses a bivariate Newton Raphson algorithm with the Hessian formed by the second order partial and mixed derivatives of Q~ with respect to a and p. Some authors ( e.g Lange 1995 Zhang 2002) use only a single iteration of the Newton-Raphson algorithm instead of an entire M-step to speed up convergence. However this might not always lead to convergence since the interval for which the Newton-Raphson algorithm converges is restricted through the restrictions on p. We show now that with a little bit of work the maximizers for a and p can be obtained very quickly. For any given value of p the ML estimator for a ( at iteration k) is available in closed form and is equal to &(k) = 1 a p l p b ( ( ) ) 1 / 2 T(l p) T[l + (T l)p] Note that if p = 0 &(k) = ( i I:7 = 1 uU)' uU)) 1 1 2 the estimator for the independence case presented at the end of Section 2.3.1. Unfortunately the ML estimator for p has no closed form solution. The first and second partial derivative of Q~ with

PAGE 76

64 respect to p are given by !!___Q2 op m T(T l)p 2(1 p)[l + (T l)p] __ 1 __ a + _1 _-_2_p_-_p_ 2 (_T_-_1) b 2a 2 (1 p) 2 2a 2 [1 + (T l)p]2 T(T 1) [1 p(4 2T) + p 2 (1 3T)] 1 a 2 (1 p)2[1 + (T l)p] 2 a 2 (1 p) 3 T b a 2 [1 + (T l)p]3 We obtain the profile likelihood for p by plugging the MLE a-(k) into the likelihood equation for p. Then we use a simple and fast interval-halving (or bisection) method to find the root for p This is advantageous compared to a Newton Raphson algorithm since the range of pis restricted. Let J(p) = :PQ~ la=a < k> and let p 1 and p 2 be two initial estimates in the appropriate range satisfying P1 < P2 and f(p1)f (p2) < 0. Without loss of generality assume f(p 1 ) < 0. Clearly, the maximum likelihood estimate p must be in the interval [p 1 p 2 ]. The interval-halving method computes the midpoint p 3 = (p 1 + P2)/2 of this interval and updates one of its endpoints in the following way: It sets p 1 = p 3 if J(p 3 ) < 0 or p 2 = p 3 otherwise. The newly formed interval [p 1 P2] has half the length of the initial interval but still contains p Subsequently a new midpoint p 3 is calculated, giving rise to a new interval with one fourth of the length of the initial interval, but still containing p. This process is iterated until lf(p 3 )1 < c, where c is a small positive constant. To ensure it is a maximum we can check that the value of the second derivativ e, f'(p) is negative at p 3 (The second derivative is also needed for approximating standard errors in the EM algorithm.) The value of p 3 is then used as an update for pin the maximum likelihood estimator for a, and the whole process of finding the roots of Q~ is repeated Convergence is declared when the relative change in a and p is less than some pre-specified small constant. The values of a and p at this final iteration are the estimates a-(k) and p(k) from MCEM iteration k.

PAGE 77

65 The issue of how to obtain a sample u( 1 ) ... u(m ) from h( u I y; (3(k-l) 1/J(k l ) ) taking into account the special structure of the random effects distribution will be discussed in Section 3.4.2. 3.3 Autoregressive Random Effects The use of autoregressive random effects was demonstrated in the introductory example Their property of a decaying correlation function make them a useful tool for modeling temporal or spatial associations among discrete data. We will limit ourselves to instances where there is a natural ordering of random effects, and consider time dependent data first. 3.3.1 Definition of Autoregressive Random Effects As with equally correlated random effects in Section 3.2 we can look at the joint distribution of autoregressive ( or autocorrelated) random effects { Ut}f= 1 as a mean-zero multivariate normal distribution with patterned covariance matrix E, defined by the variance and correlation functions var( Ut) = c; 2 for all t and where Xt and Xt are time points (e.g. years as in the GSS example) associated with random effects Ut and Ut. Let dt = Xt+1 Xt denote the time difference between two successive time points and let ft = 1/(1 p 2 dt), t = 1 ... T l. Then due to the special structure the determinant of the covariance matrix is given by IEI = a 2 T TI; = -;_ 1 ft 1 and E 1 is tri-diagonal (Crowder and Hand 1990, with correction of a typo in there) with main diagonal 1 2 (/1 Ji+ h 1, Ji+ h 1 h + h 1 h + f4 1 . !T 2 + !T 1 1 !T-1) O"

PAGE 78

66 and sub-diagonals For a sample u(l) ... u(m) from the posterior h( u I y ; t3(k I), 1/J(k -l)) evaluated at the previous parameter estimates 13(k -I) and 1/J(k l) = (a(k-l), p(k l)), the function Q~ (cf. Section 2.3.1) now has form T-1 m Q~(1/J) ex -Tloga "log(l p 2 dt) _l _!_ "u (i) 2 (3.4) 2 L.....J 2a 2 m L.....J 1 t = l j=l m T 1 [u(j) pdtu(j)] 2 1 1 t+l t 2a2 m 1 p2dt where uii) is the t-th component of the j-th sampled vector uU) In the M-step of an MCEM algorithm we seek to maximize Q~ with respect to a and p. Alternatively we can view the random effects { Ut} as a latent first-order autoregressive process : Random effect Ut+i at time t + l is related to its predecessor Ut by the equation where dt again denotes the lag between the two successive time points associated with random effects Ut and ut+l Assuming a N(O, a 2 ) distribution for the first random effect u 1 the joint random effects density for u = ( u 1 .. ur) enjoys a Markov property and has form g(u ; 1/J) = g(u1; 1/J) g(u2 I u1 ; 1/J) g(ut I Ut 1 ; 1/J) g(ur I UT-1; 1/J) (3.6) ex (:,) T / 2 (TI 1 lp'' ) 1 / 2 exp {;;}}exp {tJ;:(~ = ::~~r} leading of course, to the same expression for Q~ as given in (3.4) For two time indices t and t with t < t*, the random process has autocorrelation function

PAGE 79

67 Before we discuss maximization of Q~ in this setting with possibly unequally spaced observation times let us comment on the rather unusual parametrization of the latent random process (3.5). Chan and Ledolter (1995) in their development for time series models of equally spaced discrete events use the more common form ut+1 =put+ Et Et~ N(O, a 2 ) t = 1 ... T l. This leads to var(ut) = a 2 /(1p 2 ) for all t ifwe assume a N(O a 2 /(1p 2 )) distribution for u 1 (Chan and Ledolter (1995) condition on this first observation, which leads to clos e d form solutions for both a and p in the case of equidistant observations). Since it is common practice to let a 2 describe the strength of association between observation in a common cluster sharing that random effect our parametrization seems more natural. In Chan and Ledolter s parameterizations both the variance and correlation parameter appear in the variance of the random effect. In the more general case of unequally spaced observations the parametrization f.t ~ N(O a 2 ) results in different variances of the random effects at different time points (i.e ., var(ut) = a 2 /(1 p 2 d 1 )) Considering that the random effects represent unobservable phenomena common to all clusters, their variability should be about the same for all clusters and not depend on the time difference between any two clusters There is no reason to believe that the strength of association is larger in some clusters and weaker in others. Therefore the parametrization we choose in (3.5) seems natural and appropriate For spatially correlated data a relationship between random effects u i and U i is defined in terms of a distance function d(xi X i between covariates x i and X i associated with them. Each u i then represents a random effect for a spatial cluster and correlated random effects are again natural to model spatial dependency among observations in different clusters In the time setting we had

PAGE 80

68 d(x i, X i = lx i X i I with xi and X i representing time points. In 2-dimensional spatial settings X i = (x i i x i 2 )' may represent midpoints in a Cartesian system and d(x i, X i = I lx i X i II is the Euclidian distance function The so-defined distance between clusters can be used in a model with correlations between random effects decaying as distances between cluster midpoints grow e g ., corr(u i, U i = p ll X; X ; 11 Models of this form are discussed in Zhang (2002) and in Diggle et al. (1998) in a Bayesian framework Sometimes only the information concerning whether clusters are adjacent to each other is used to form the correlation structure. In this case, d(x i, X i is a binary function indicating if clusters i and i are adjacent or not. Usually this leads to an improper joint distribution for the random effects as for instance in the analysis of the Scottish Lip cancer data set presented in Breslow and Clayton (1993). 3.3.2 The M-step with Autoregressive Random Effects Maximizing Q~ with respect to a and p is again equivalent to finding their MLEs for the sample of u(l ), ... u(m) pretending they are independent. For fixed p maximizing Q~ with respect to a is possible in closed form For notational con venience denote the parts depending on p and the generated sample u( 1 ) . u ( m) by at(P u) and m bt(p u) = ~~(u~~ 1 l 1 u~j))u~j) J = l with derivatives with respect to p (indicated by a prime) given by

PAGE 81

69 and The maximum likelihood estimator of a at iteration k of the MCEM algorithm has form ( 1 m 2 1 T 1 1 ) 1/2 a-(k ) = -~ u(J) + --at(P u) Tm L 1 T L l p2d1 j = l t = l For the special case of independent random effects (p = 0) this simplifies to the estimator ( E;' = 1 u(j)' u(j)) 1 1 2 presented at the end of Section 2.3.1. (The equal correlation structure cannot be presented as a special case of the autocorrelation structure.) No closed form solutions exist for fP) Let and be terms depending on p but not on u with derivatives given by and dt-1 dl 2 c~(p) = --ct(P) + 2p i[ct(P)] p respectively. Then the first and second partial derivative of Q~ with respect to p can be written as T 1 l T 1 L l 1 Ct(P) + 2 L [Ct(p)bt(P, u) et(p)at(P u)] a t = l t = l T 1 I:l 1 1 Ct(P) [2dt 1 + 2l 1 +1Ct(P)] t=l T 1 +-; L [c~(p)bt(P, u) + Ct(p)b~(p, u) e~(p)at(P u) et(p)a~(p u)] a t = l

PAGE 82

70 A Newton-Raphson algorithm with Hessian formed of partial and mixed derivatives of Q~ with respect to a and p can be employed to find the maximum likelihood estimators at iteration k. However, since the range of p is restricted, it might be advantageous to use the interval-halving method on ;P Q~ lu=u
PAGE 83

71 In most of our applications the number of distinct observation times T is rather large and generating independent T-dimensional vectors u from the posterior h( u I y ; 1/J(k l)) as required to approximate the E-step is difficult, even with the nice (prior) autoregressive relationship among the components of u. The next section discusses this issue. There are many other correlation structures which are not discussed here. For instance the first order autoregressive random process can be extended to a p--order random process and the formulas provided here and in the next section can be modified accordingly. 3.4 Sampling from the Posterior Distribution Via Gibbs Sampling In Section 2.3.2 we gave a general description of how to obtain a random sam ple u < 1 ) ... u (m) from h( u I y). ( As in Section 2 3.2 we suppress the dependency on the parameter estimates from the previous iteration) For high dimensional random effects distributions g(u) generating independent draws from h(u I y) can get very time consuming if not impossible. The Gibbs sampler introduced in Section 2.3.2 offers an alternative because it involves sampling from lower dimen sional (often univariate) conditional distributions of h(u I y), which is considerably faster. However it results in dependent samples from the posterior random effects distribution. The distributional structure of equally correlated random effects or autoregressive random effects is very amenable to Gibbs sampling because of the simplifications that occur in the full univariate conditionals. Remember that the two-stage hierarchy and the conditional independence assumption in GLMMs implies that T h(u I y) ex f(y I u)g(u) = IT f(Yt I Ut) g(u) t = l the product of the product of conditional densities of observations sharing a common random effect and the random effects density. In the following, let u = ( u 1 ... UT). We discuss the case of autoregressive random effects first.

PAGE 84

72 3.4.1 A Gibbs Sampler for Autoregressive Random Effects From representation (3.6) of the random effects distribution we see that the full univariate conditional distribution of Ut given the other T l components of u only depends on its neighbors Ut I and Ut+1 i.e. g(ut I u1 ... Ut-1 ut+1, .. ur) ex g(ut I Ut-1)g(ut+1 I Ut) t = 2, ... Tl. At the beginning ( t = 1) and the end ( t = T) of the process, the conditional distribution of u 1 and ur only depend on the successor u 2 and predecessor ur-i respectively. Furthermore random effect Ut only applies to observations Yt = (Yn, ... Ytnt) at a common time point that share that random effect but not to other observations. Hence the full univariate conditionals of the posterior random effects distribution can be expressed as hr(ur I ur 1 Yr) ex !(Yr I ur) 9r(ur I Ur 1) where using standard multivariate normal theory results, N (l1u2 a2[l p2d1 l) N(it-1[1p2dt i]ut I + it[l p2dt 1Jut+i 1 p2(dt l + dt) 0"2 [1 p2dt-l p2dt + p2(dt-l +dt)]) 2(d d ) t = 2 ... 'T l 1 p t 1 + t 9r( ur I ur 1) = N (l riur-1, a2[l p2d r1 l) For equally spaced data (dt = 1 for all t) these distributions reduce to the ones derived in Chan and Ledolter (1995). Direct sampling from the full univariate conditionals ht is not possible. However, it is straightforward to implement an accept-reject algorithm In fact

PAGE 85

73 the accept-reject algorithm as outlined in Section 2.3.2 applies directly with target density ht and candidate density 9t since ht has the form of an exponential family density multiplied by a normal density. In Section 2.3.2 we discussed the accept reject algorithm for generating an entire vector u from the posterior random effects distribution h( u I y) with candidate density g( u) and mentioned that acceptance probabilities are virtually zero for large dimensional u s. With the Gibbs sampler we have reduced the problem to univariate sampling of the t-th component U t from the univariate target density ht with univariate candidate density 9t By selecting Mt = L(yt) where L(yt) is the saturated likelihood for observations at time point t we ensure that the target density ht Mt9t Given uU I ) = ( Uij l) . l)) from the previous iteration, the Gibbs sampler with acc e pt-reject sampling from the full univariate conditionals consists of 1. generate first component uP) ~ h1 ( U1 I uv l ), Y1) by (a) generation step: generate U1 from candidate density 91 ( U1 I uv-l)); generate U ~ Uniform[O, 1]; (b) acceptance step: set uf) = U1 if U J(y 1 I u 1 )/ L(y 1 ); return to (a) otherwise; 2. for t = 2 .. T 1: g enerate component u(j) ~ h (u(j) I u(j) u(j-I) y ) by t t t t 1 t + 1 t (a) generation step: generate Ut from candidate density 9t(Ut I u~~ 1 u~~~ 1 )); generate U ~ Uniform[O 1]; (b) acceptance step: set uij) = Ut if U f(Yt I Ut)/ L(yt); return to (a) otherwise; 3. generate last component u) ~ hr(ur I ~ 1 Yr) by (a) generation step:

PAGE 86

74 generate UT from candidate density 9T( UT I u~ 1 ); generate U ~ Uniform[O, 1]; (b) acceptance step: set uW = uT if U s; f (YT I UT)/ L(yT); return to (a) otherwise; 4. set uU) = (u(i) u(T)) 1 , T The so-obtained sample u(l) ... u(m) (after allowing for burn-in) forms a dependent sample which we use to approximate the E-step in the kthiteration of the MCEM algorithm. Note that all densities are evaluated at current parameter (k 1) (k 1) ( ) ( )) estimates, i.e. {3 for f(Yt I Ut) and 1/J = (17 k l p k l for 9t(Ut I 3.4.2 A Gibbs Sampler for Equally Correlated Random Effects Similar results as for the autoregressive correlation structure can be derived for the case of equally correlated random effects. In this case, the full univariate conditional of Ut depends on all other t 1 components of u as can be seen from (3.3). Let Ut denote the vector u with the t-th component deleted. Using similar notation as in the previous section the full univariate conditionals of h( u I y) are given by where with standard results from multivariate normal theory 9t(Ut I Ut ) is a N(t, T;) density with t = [-p(T l)p(l p)l L Uk 1 p 1 + (T 2)p k::/t and 72 = a 2 (l (Tl)p 2 + (T1) 2 p3(1p)) t 1 p 1 + (T 2)p Given the vector uU 1 ) from the previous iteration the Gibbs sampler with accept-reject sampling from the full univariate conditionals has form

PAGE 87

75 1 for t=l ... T, generate components u~j) ~ h( Ut I u~j) ... u~~ 1 u~t~I) I)) by (a) generation step: generate Ut from candidate density ( I (j) (j) (j-1) (j-1)) 9t Ut U1 ... ut-1 ut+I . UT ; generate U ~ Uniform[0 1]; (b) acceptance step: set u~j) = Ut if U f (Yt I Ut)/ L(yt); return to (a) otherwise; 2. set uU ) = (uf ), .. uf)); This leads to a sample u ( 1 ), .. u( m) from the posterior distribution used in the Eand M-step of the MCEM algorithm at iteration k. Note again that 11 d 'b t 1 d h f.l(k l) d a 1stn u 10ns are eva uate at t e ir current parameter estimates /J an {p(k l) = (a(k -l), p(k 1)) 3.5 A Simulation Study We conducted a simu la tion study to evaluate the performance of the maximum likelihood estimation algorithm to evaluate the bias in the estimation of covariate effects and variance components and to compare predicted random effects to the ones used in the simulation of the data. To this end, we generated a time series Yi ... YT of T = 400 binary observations according to the model (3 .8 ) for the conditional lo g odds of success at time t t = 1 ... 400. For the simulation we choose a = 1 and /3 = 1 where {3 is the regression coefficient for independent standard normal distributed covariates Xt i. i. d. N(0 1). The random effects u 1 ... UT are thought to arise from an unobserved lat ent random autoregressive process Ut+i =put+ Et, where Et are i i d. N(0 a-Jl p2) i.e. the Ut's have stan dard deviation aand la g t corre lation /. For the simulation of these autoregressive

PAGE 88

76 random effects, we used CJ = 2 and p = 0.8. The resulting sample autocorrelation function of the realized random effects is pictured in Figure 3 4. Their standard deviation and lag 1 correlation of the 400 realized values of u 1 ... UT is equal to 1.95 and 0. 77. Note that conditional on the realized values of ut's, the yt's are generated independently with log odds given by (3.8). The MCEM algorithm as described in Sections 2.3 and 3.3 for a logistic GLMM with autocorrelated random effects yielded the following maximum likelihood estimates for the fixed effects and variance components: & = 0.94 (0.39), = 1.03 (0.22) as compared to the true values 1 and 1 and a= 2.25 (0.44) and p = 0.74 (0.06) as compared to the true values 1.95 and 0.77. The algorithm converged after 71 iterations with a starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of only 880, although estimated standard errors are based on a Monte Carlo sample size of 20, 000. Convergence parameters were set to c 1 = 0.003 c = 3 c 2 = 0.005, c 3 = -0.001, a = 1.03 and q = 1.05 (see Section 2.3.3). Regular GLM estimates were used as starting values for a and /3 and starting values for CJ and p were set to 1.5 and 0, respectively. As will be described in Section 5.4.2 we estimated random effects through a Monte Carlo approximation of their posterior mean: it = E[u I y]. The scatter plot in Figure 3 3 shows good agreement in a comparison of the realized random effects u 1 ... UT from the simulation and the estimated random effects ih, ... UT from the model. Note though that the standard deviation of the estimated random effects is equal to 1.60 (as compared to the true standard deviation of 1.95) showing that estimated random effects are less variable and the general shrinkage effect ( compare the scales on the x and y axis of Figure 3 3) brought along by using posterior mean estimates. Also a comparison of the autocorrelation and partial autocorrelation functions of the realized and estimated random effects in

PAGE 89

77 "''""~~I>. "'"'1, "' "'t,.L;.b.t,.6 j "' ~t>.~I "'"'"' u 2 &, ... tj..,. . '.... "' "' "' ~"' &, : !~~ttL''"'"' E 0 111-.:. t! ... ,:, C 0 I,"' t I, (U \"' 'a lo "' ... ... "' &11 ,:, &, 1 .. 1 .. ~r (U "'L;. "' "' "' E "' L;.t,. I, ..,.t "' I>. L;. -2 I>. I, I,"' & "' &, "' ? L;.~~,e I>. """ "' t,.tf 'Al"'e "' I, I, L;. I, "'"' "' -~ -7 -5 -3 -1 3 5 7 realiz:ed random effects Figure 3 3: Realized (simulated) random effects u 1 ... UT versus estimated ran dom effects u 1 ... UT. Figure 3 4 reveals some differences due to the fact that estimated random effects are based on the posterior distribution of u I y. Therefore, estimated random effects are only of limited use in checking assumptions on the true random effects. Only when their behavior is grossly unexpected compared to the assumed structure of the underlying latent random process may they serve as an indication of model inappropriateness. Related remarks are given by Verbeke and Molenberghs (2000) who generate data in a linear mixed models assuming a mixture of two normal distributed random effects resulting in a bimodal distribution. There also the plot of posterior mean estimates of the random effects from a model that misspecified the random effects distribution does not reveal that anything went wrong. We repeated above simulation 100 times using the same specifications starting values and convergence criteria as mentioned above. Each of the 100 generated binary time series of length 400 was fit using the MCEM algorithm. Table 3 1 shows the average ( over the 100 generated time series) of the fixed

PAGE 90

u. 0 <( 1 0 8 6 4 2 0 0 1 0 8 6 4 2 0 2 4 2 6 7 Lag c::J = = c::::J = 6....__ ____________ ~_ 3 Lag 78 u. 0 <( 1 0 8 6 .4 2 0 0 2 3 4 5 Lag 1 0 8 6 4 2 -.0 D = CJ = -. 2 4 6 2 3 4 5 Lag Figure 3 4: Comparing simulated and estimated random effects. 6 7 c::J 6 7 Sample autocorrelation (first row) and partial autocorrelation (second row) func tions for realized (simulated) random effects u 1 ... ur (first column) and estimated random effects ih, . ilr (second co l umn).

PAGE 91

79 parameter and variance component estimates and their average estimated standard errors. On averag e, the GLMM estimates of the fixed effects a and f3 and the variance components are very close to the true parameters although the true lag 1 correlation of the random effects is underestimated by 6.3%. Table 3 1 also displays in parentheses the standard deviations of all estimated parameters in the 100 replications. Comparing these to the theoretical estimates of the asymptotic standard errors we see good agreement This suggests that the procedure for finding standard errors we described and implemented (via Louis s (1982) formula) in our MCEM algorithm works fine. In 5 (5%) out of the 100 simulations the approximation of the asymptotic covariance matrix by Monte Carlo methods resulted in a negative definite matrix For these simulations, a larger Monte Carlo sample after convergence of the MCEM algorithm (the default was 20,000) might be necessary. It is also interesting to note that of the 95 simulations with positive definite covariance matrix 6 (6.3%) resulted in a non-significant (based on a 5% level Wald test) estimate of the regression coefficient f3 under the GLMM model with autoregressive random effects while none was declared non-significant with the GLM approach Estimates and standard errors for a corresponding GLM model fit are also provided in Table 3 1. The average Monte Carlo sample at the final iteration of the MCEM algorithm was 1200, although highly disperse, ranging from 210 to 21 000. The average computation time ( on a mobile Pentium III, 600 MHz processor with 256MB RAM) to convergence including estimating the covariance matrix was 73 minutes. We ran two other simulation studies now with a shorter length of only T = 100 observations and a true lag 1 correlation of 0.6 and -0 8 respectively. All other parameters remained unchanged. These results are also summarized in Table 3 1. Again we observe that the estimated parameters are very close to the

PAGE 92

80 Table 3 1 : A simulation study for a log i stic GLMM with autoregressi v e random effects a /3 (1 p s e (a) s.e.(/3) s.e.(c,) s.e (p) True : 1 1 2 0.8 T= 400 GLM: 0 64 0 6 3 0.11 0.12 (0.20) (0.14) (0.01) (0.01) GLMM: 1.07 1.02 2.08 0 75 0 37 0.26 0 61 0.10 (0.32) (0.20) (0.25) (0.06) (0 12) (0.23) (0.40) (0 05) True: 1 1 2 0.6 T= 100 GLM: 0 69 0.70 0.23 0.25 (0 38) (0 27) (0 02) (0.03) GLMM: 1.09 1.07 1.99 0.51 0.58 0.47 1.35 0.26 (0.58) (0 37) (0.39) (0 20) (0 33) (0.27) (1.15) (0.18) True: 1 1 2 0.8 T= 100 GLM: 0.65 0.61 0.22 0 24 (0.21) (0 26) (0 01) (0.03) GLMM: 1.04 0.96 2.00 -0.75 0.42 0.51 1.04 0.16 (0.29) (0.34) (0.53) (0 13) (0.32) (0 99) (1.04) (0 13) Average and standard deviation (in parentheses) of fixed effects variance com po nents and their standard error estimates from a GLM and a GLMM with latent AR(l) process. The two models were fitted to each of 100 generated binary time series of length T = 400 and T = 100. true ones but on average the correlation was underest i mated by 1 5% and 6 .3%, respectively. However the samp l ing errors of the corre l ation parameters (shown in parentheses in Table 3 1) were large enough to include the true va l ues. Since our methods are general enough to handle unequally spaced data we repeated the first simu l ation with a time series of T = 400 binary observations, but now randomly deleted 10% of the observat i ons to create random gaps in the series. We left all parameters and the model for the conditional odds unchanged, except that we now assume that random effects follow the latent random autoregressiv e process Ut+1 = pdtut + Et where Et are i. i d. N(0 c,Jl p 2 dt) and dt is the difference (in the units of measurement) between the time points associated with the observations at times t and t + 1. For example the first series we generated had

PAGE 93

81 Table 3 2: Simulation study for modeling unequally space binary time series. a {3 True: 1 1 GLM: 0.61 0.62 (0.19) (0.11) GLMM: 1.03 1.00 (0.29) (0.16) a 2 2.07 (0 25) p 0.8 0.75 (0.06) s.e.(a) s e.({3) s.e.(a) s.e.(p) T = 360 unequally spaced 0.12 0.12 (0.00) (0.01) 0.38 0.28 0 71 0 11 (0 19) (0 28) (0.80) (0 18) Average and standard deviation (in parentheses) of fixed effects variance compo nents and their standard error estimates from a GLM and a GLMM with latent au toregressive random effects accounting for unequally spaced observations The two models were fitted to each of 100 generated binary time series of length T = 360, with random gaps of random length between observations. 1 gap of length three (i e. dt = 4 for one t) 4 gaps of length two (i.e. dt = 3 for 4 t s) and 29 gaps of length one (i e. dt = 2 for 29 t s) For all other t s dt = 1 i e. they are successive observations and the difference between two of them is one unit of measurement Simulation results are shown in Table 3 2 and reveal that our proposed methods and algorithm also work fine for an unequally spaced binary time series. All true parameters are included in confidence intervals based on the average of the estimated parameters from 100 replicated series and its standard deviation (shown in parentheses in Table 3 2).

PAGE 94

CHAPTER 4 MODEL PROPERTIES FOR NORMAL POISSON AND BINOMIAL OBSERVATIONS So far we have discussed models for discrete valued time series data in a very broad manner. In Section 2 we developed the likelihood for our models based on generic distributions f(ylu) for observations y and g(u) for random effects u and presented an algorithm for finding maximum likelihood estimates Section 3 looked at two special cases of random effects distributions useful for describing temporal or spatial dependencies. In this chapter we make specific distributional assumptions about the observations and develop some theory underlying the models we propose We will pay special attention to data in the form of a single (sometimes considered generic) time series Y = (Y 1 ... YT) and derive marginal properties implied by the conditional model formulation. Multiple independent time series Y 1 ... Y n can result from replication of the original time series or from stratification of the sampled population such as in the example about homosexual relationships. All derivations given below for a generic time series Y still hold for the i-th series Y i = (~ 1 ~ 2 .. ~T) provided the same latent process { ut} is assumed to underly each one of them. An important characteristic of any time series model is its implied serial dependency structure. In the case of normal theory time series models this is specified by the autocorrelation function. In Section 4.1 we derive the implied marginal autocorrelation function for GLMMs with normal random components and either an equal correlation or autoregressive assumption for the random effects. With these assumptions our models are special cases of linear mixed models discussed for instance in Diggle et al. (2002). In Sections 4.2 and 4.3 we explore 82

PAGE 95

83 marginal properties of GLMMs with Poisson and binomial random components that are induced by assuming equally correlated or autoregressive random effects. In Chapter 5 these model properties such as the implied autocorrelation function are then compared to empirical counterparts based on the observed data to evaluate the proposed model. Section 2.1 mentioned that parameters in GLMMs have a conditional inter pretation controlling for the random effects. Correlated random effects vary over time and parameter interpretation is different from having just one common level of a random effect as in many standard random intercepts GLMMs. For each of the models presented here we discuss parameter interpretation in a separate section. 4.1 Analysis for a Time Series of Normal Observations Suppose that conditional on time specific normal random effects { Ut}, observa tions {Yt} are independent N(t + Ut T 2 ). The marginal likelihood for this model is tractable because marginally the joint distribution of {Yt} is multivariate normal with mean M = 1 ... /J,T )' and covariance matrix 'Eu + T 2 I where 'Eu is the covariance matrix of the joint distribution of { Ut}With the usual assumption that var( Ut) = a2 the marginal variance of Yt is given by var(Yt) = T 2 + a2 and the marginal correlation function p(t t*) for the case of equally correlated random effects ( conf. Section 3 2) has form (72 p(t t )=corr(Yt Yt)= 2 2 P T +a (4.1) while for the case of autocorrelated random effects (conf. Section 3.3) it has form a2 t 1 p(t t ) = corr(Yt Yt ) = 2 2 p"E-i.=t d,. T +a (4.2)

PAGE 96

84 If the distances between time points are equal, then ( 4.2) is more conveniently written in terms of the lag h between observations as (J2 p(h) = corr(Yt Yt+h) = 2 2 l. 7 +a For both cases, note that the autocorrelations (4.1) and (4.2) are smaller than the corresponding ones assumed for the underlying latent process { Ut} by a factor of r 2 ~u 2 For equally correlated random effects the marginal covariance matrix has form 7 2 I + a 2 [ (1 p )I + pJ], implying equal marginal correlations between any two members Yt and Yt of {Yt}. (This can also be seen from (4.1), where the autocorrelations do not depend on t or t*.) Diggle et al. (2002, Sec. 5.2.2) call this a model with serial correlation plus measurement error. Similar properties can be observed in the case of autocorrelated random ef fects: The basic structure of correlations decaying in absolute value with increasing distances between observation times (as measured by I: dk or h) is preserved marginally. However, the first-order Markov property of the underlying autore gressive process is not preserved in the marginal distribution of {Yt}, which can be proved by calculating conditional distributions. For instance, for three (T = 3) equidistant time points, the conditional mean of Y 3 given Y 1 = y 1 and = y 2 is equal to 2 E[Y3!Y1 Y2] = + ( 72 + a 2 ~ 2 P_ (a 2 p) 2 ( 7 2 [1 + P(Y1 i)] + a 2 [ 1 P 2 (Y2 )]) and depends on Y1 It should be noted that in the case of independent random effects with ~u = a 2 I marginally the Yt s are also independent but with overdispersed variances 7 2 + a 2 relative to their conditional distribution. This case can be seen as a special case of the equally correlated model and the autoregressive model when p = 0.

PAGE 97

85 The traditional assumption in random int e rcepts models is to assume a common random effect u = Ut for all time points t. I.e. conditional on a N(0 a 2 ) random effect u Yt is N(t + u, T 2 ) fort = 1 .. T For th i s case the marginal covariance matrix has form T 2 I + a 2 J. This can be derived directly or inferred from the marginal correlation e xpressions (4.1) and (4 2) by setting p = 1 implying perfect correlation among the { Ut}Hence the random intercepts model is a special case of th e equal correlated or autoregressive model when p = 1. It implies a constant (exchangeable) marginal correlation of a 2 / ( T 2 + a 2 ) between any two observations Yt and Yt 4. 1. 1 A na l ys i s via Lin e ar M ixed Mod els In a GLMM we try to provide some structure for the unknown mean com ponent by using covariates X t Let x~ /3 be a linear predictor for with {3 denoting a fixed effects parameter vector for the covariates X tUsing an identity link the series {Yt} then follows a GLMM with conditional mean function E[Yt I Ut] = x~{3 + Ut. The model can be written as Yt = x~/3 + Ut + Et, where Et N(0 T 2 ) and independent of Ut. Then the models discussed here are special cases of mixed e ffects models (Verbeke and Molenb e rghs 2000) with general matrix form Y = X {3 + Z u + In our case Y = (Y 1 .. Yr)' is the time seri e s vector and X = (x~ . x~)' is the overall design matrix with associated parameter {3. The design matrix Z for the random effects u' = (u 1 ... ur) simplifies to the identity matrix Ir. The distributional assumption on the random effects is u ~ N( O ~ u ) and they are independent from the N( O T 2 /) distributed errors Exploiting this relationship software for fitting models of this kind (i.e correlated normal data with structured covariance matrix of form var( Y ) = Z~ u Z' + T 2 I) is readily available for instance in the form of th e SAS procedure proc mixed wher e the equal correlation structur e

PAGE 98

86 and the autoregressive structure are only two out of many possible choices for the covariance matrix 'Eu for the random effects distribution. Mixed effects models are very popular for the regression analysis of shorter time series like growth curve models or data from longitudinal studies. In Sec tion 5.1 we illustrate an application by analyzing the motivating example of Section 3.1 about attitudes towards homosexual relationships based on a normal approximation to the log odds. 4.1.2 Parameter Interpretation Parameters in normal time series models retain their interpretation when averaging over the random effects distribution The interpretation of /3 as the change in the mean for a change in the covariates is valid conditional on random effects and also marginally. The random effects parameters only contribute to the variance-covariance structure of the marginal distribution inducing overdispersion and correlation relative to the conditional assumptions. 4.2 Analysis for a Time Series of Counts Suppose now that conditional on time specific normal random effects { Ut} observations {Yt} are independent counts which we model as Poisson random variables with mean Using a log link explanatory variables Xt and correlated random effects { ut} we specify the conditional mean structure of a Poisson GLMM as log(t) = x~/3 + Ut t = 1 . T. (4.3) The correlation in the random effects allows the log-means to be correlated over time or space. The marginal likelihood corresponding to this model is given by T L(/3 1/J ; y) ex 1 IJ r' exp{-t}g(u ; 1/J) du ]RT t = l L exp { t [y,(x;.B + u,) exp{x;.B + u,}]} g(u; ,P) du,

PAGE 99

87 where g( u; ,p) is one of the random effects distributions of Chapter 3. In that case, the integral is not tractable and numerical methods such as the MCEM algorithm of Section 2.3 must be used to find maximum likelihood estimates for /3 and 1/J. For this the function Q~ defined in (2.16) has form where u~j) is the t-th element of the j-th generated sample uU) from the posterior distribution h(uly; 13(k l) 1/J(k l)). Note that here we discuss only the case of a generic time series {Yt} with no replication hence n = l (i.e. index i is redundant) and n 1 = Tin the general form presented in (2 16) If replications are available or in the case where two time series differ in the fixed effects part but not in the random effects (e g. have the same underlying latent process) then one simply needs to include the sum over the replicates as indicated in (2.16). Choosing one of the correlated random effects distributions of Chapter 3 the Gibbs sampling algorithms developed in Sections 3.4.1 or 3.4.2 can be used to generate the sample from h(uly) with f(Yt) having the form of a Poisson density with mean t. 4.2.1 Marginal Model Implied by the Poisson GLMM As with the normal GLMMs before, marginal first and second moments can be obtained by integrating over the random effects distribution although here the complete marginal distribution of Yt is not tractable as it is in the normal case. The random effects appearing in model ( 4.3) imply that the conditional log-means {log(t)} are random quantities. Assuming that random effects { Ut} are normal with zero mean and variance var( Ut) = a2 they have expectations { x~/3} and variance a2 For two distinct time points t and t *, their correlation under an independence equal correlation or autocorrelation assumptions on the random effects is given by 0 p or p'L.i =-,, 1 dk respectively. (Remember that dk denoted the time difference between two successive observations Yk and Yk+l ) On the original

PAGE 100

88 scale, the means have expectation, variance and correlation given by exp{ x~/3 + a 2 /2} exp{2(x~{3 + a 2 /2)} ( eo2 1) ecov(ut,Ut) 1 eu2 1 Plugging in cov( Ut Ut) = 0, a 2 p or a 2 p"Lt"=.-,, 1 dk yields the marginal correlations among means when assuming independent, equally correlated or autoregressive random effects respectively. 4.2.1.1 Marginal distribution of Yt Now let s turn to the marginal distribution of Yt itself, for which we can only derive moments. The marginal mean and variance of Yt are given by: E[Yt] = E[t] = exp{ x~/3 + a 2 /2} (4.4) var(Yt) = E[t] + var(t) = E[Yt] [ 1 + E[Yt] ( eo2 1)] Hence, the log of the marginal mean still follows a linear model with fixed effects parameters {3 but with an additional offset a 2 /2 to the intercept term. (This is not particular to the Poisson assumption, but is true for any loglinear random effects model of form ( 4.3) with more general random effects structure z~ut, see Problem 13.42 in Agresti, 2002.) The marginal distribution of Yt is not Poisson, since the variance exceeds the mean by a factor of [1 + E[Yt](eo2 -1)]. The marginal variance is a quadratic function of the marginal mean. For two distinct time points t and t*, the marginal covariance between observations Yt and Yt is given by cov(Yt, Yi) cov(t, E[Yt]E[Yt] ( ecov(ut U t 1) (4.5)

PAGE 101

89 In the case where random effects { Ut } are assumed independent the marginal covariance is zero. In longitudinal studies usually each replicated time series has its own univariate random effect attached to it. For such a time series {Yi}, assume a singl e common random effect u ~ N(O a2 ) shared by all observation in the series. I.e. in th e notation used above u = Ut for all t and model ( 4.3) has form log(t) = x~ /3 + u. Then cov( Ut Ut) = var( u) = a2 and the marginal correlation between any two members of the time series is given by (E[Yi]E[Yi. ]) 1 1 2 (ecr 2 1) ( 4 6 ) corr(Yi Yi) = [1 + E[Yi](ecr 2 1)]1 / 2 [1 + E[Yi-](ecr 2 1)]1 / 2 This is the exchangeable correlation structure implied by a random intercepts Poisson GLMM (see e g Agr e sti 2002 pages 564 and 575). In Section 2 w e mo tivated and proposed correlated random effects { Ut} to facilitate other correlation structures. We will now derive marginal correlation properties for a time series of counts based on our conditional Poisson GLMM approach using equally correlated or autoregressive random effects. This is easily done by plugging in for cov( U t, U t in ( 4 5) above. The equal correlation assumption cov( Ut Ut) = a2 p leads to the marginal structure ( v 1 .,,. ) ( E[Yi]E[Yi]) 1 1 2 ( e cr 2 P 1) corr It .It = 1 ;2 [1 + E[Yi]( e cr 2 1)] 1 2 [1 + E[Yi-]( e cr 2 1)] 1 (4 7) still implying equal (but possibly negative) correlations. The autoregressive random effects approach with cov( Ut Ut) = a2 pE~ =--/ dk leads to a decaying ( with time) correlation function (E[Yi]E[Yi ])1 / 2 ( e cr 2 pEt~ -; 1 dk 1) corr(Yi Yi) = ---------------. [1 + E[Yi](ecr 2 1)] 1 1 2 [1 + E[Yi-](ecr 2 1)] 112

PAGE 102

90 In the case of equally spaced observations ( dk = 1 for all k), this is more conve niently written in terms of the lag h in between two observations: v v (E[Yi]E[Yi+h]) 112 (eu 2 ph 1) corr(.1t, .1t + h) = 1 (4.8) [1 + E[Yt]( eu 2 1)] 112 [1 + E[Yi+h]( eu 2 1) J1 2 Note that if p = 1 i.e. perfect correlation between random effects, all correlation structures reduce to the random intercept model with correlation structure ( 4.6) However, with JpJ :S 1, and h oo, (4.8) accommodates decaying correlations and with p :S 0 ( 4. 7) accommodates negative correlation. In Section 5.3, we will fit a Poisson G LMM to a time series of counts and use the marginal properties derived here to assess and interpret the regression model. 4.2.1.2 Negative Binomial GLMMs An alternative to the Poisson assumption as the conditional distribution for the counts is to use a negative binomial distribution. The negative binomial distribution per se already allows for overdispersion relative to the mean. A second source of overdispersion is then introduced by regarding the (log-) mean of a negative binomial random variable as a normal mixture. Correlated random effects allow these means to be connected over time. Booth et al. (2004) look at negative binomial GLMMs with independent (over time) random effects. The anchovy larvae data analyzed there is a time series of correlated counts and autoregressive random effects seem an appropriate alternative to the independent ones used by Booth et al. Using the parametrization of the negative binomial distribution as discussed in Sec. 13.4 of Agresti (2002), let Yi be negative binomial with mean t and variance t + Uk, conditional on random effects Ut, (For fixed k, the negative binomial distribution is a member in the exponential family of distributions.) We consider cases where the dispersion parameter k is the same for all observations. As in the Poisson G LMM presented before, we propose the following loglinear model for the

PAGE 103

91 conditional mean of Yi given U( log(E[Yilut]) = log(t) = x~/3 + Ut t = l ... T where { Ut} follows one of the random processes discussed in Chapter 3. Marginally this leads to the same expectations variances and covariances for the conditional log-means and means as discussed in the Poisson model above. Furthermore, since ( 4.4) holds for any loglinear model it also holds for the negative binomial loglinear model and the marginal means of the Poisson GLMM and negative binomial GLMM coincide However the marginal variance under the negative binomial assumption is given by var(Yi) = E[t + ;jk] + var(t) = E[Yi] [ 1 + E[Yi] ( k; l eu 2 1)] which fork -+ oo approaches the variance of the Poisson GLMM. The difference between the variance under a negative binomial assumption and a Poisson as2 sumption is e : (E[Yt]) 2 Similarly for each one of the random effects structures discussed under the Poisson GLMM the formulas for the marginal correlations presented for the Poisson GLMM hold true when u 2 in the denominator of each equation is replaced by kt1 eu 2 When these implied marginal properties are more plausible than the corresponding ones from a Poisson GLMM as judged for instance by a comparison of the approximated maximized likelihoods or by a com parison of empirical estimates to model based estimates then the negative binomial GLMM is a relevant alternative. Note that as with any GLMM, the negative binomial GLMM results from the hierarchy Yi I Ut ind ~ neg. bin. k) t = l, ... T (u1 ... ur) ~ N(O,I;u)

PAGE 104

92 and the marginal correlation between the counts {Yi} arises because of the cor relations in the underlying random effects Ut which appear in the model for the log-means. 4.2.2 Parameter Interpretation As long as { ut} is a mean-stationary random process (we assume a mean of zero throughout) it follows from ( 4.4) that all parameters except the intercept have equal interpretations conditionally on random effects and marginally. For a Gaussian random process with variance a 2 the intercept itself is set off by a factor of a 2 /2. Hence all parameters except the intercept can be interpreted as effects on the conditional or marginal log-mean. In particular for any member /3i of /3 ef3i is the ratio of two conditional or marginal means after a one unit change in the covariate associated with /3j 4.3 Analysis for a Time Series of Binomial or Binary Observations Suppose that conditional on a time specific normal random effect Ut, ob servations {Yst} ~;, 1 are independent and identical binary random variables with conditional success probability 7rt(ut) = P(Yst = 1 I Ut) t = 1 ... T. Consequently the sum Yi = I: ;: 1 Yst has a conditional binomial(nt ?rt(ut)) distribution. Fur thermore given random effects Ut and Ut at two different time points t and t *, Yi and Yi are conditionally independent. Using a logit link time-specific explanatory variables { Xt} and correlated random effects { Ut} we specify the conditional mean structure of a binomial G LMM as =x~f3+ut t=l ... T. (4.9) Correlated random effects allow correlation of conditional log odds over different time points or locations. For observed data y = (y 1 .. YT) the marginal

PAGE 105

93 likelihood corresponding to this model is given by T L({3, 1/J; y ) ex 1 l1 [1rt(ut)]Yt[l 1rt(ut)t 1 Y 1 g( u ;-rp) du JR T t = l f. r exp { t y,( x;/3 + u,)} g ( 1 +exp{ x;/3 + u,} ) n, g( u ; ,J,) du, where g( u ; 1/)) is one of the random effects distributions of Chapter 3. The integral is not tractable and numerical methods such as the MCEM algorithm of Section 2.3 must be used to find maximum likelihood estimates for (3 and 1/J. The function Q~ defined in (2.16) now has form l m T _ Q~(f3 I {3(k l)) ex m L [Yt(x~{3 + uP)) nt(l + exp{ x~{3 + uP)})] J=l t=l where u~j) is the t-th element of the j-th generated sample u W from the posterior distribution h(uly). As before, note that here we discuss only the case of a generic time series {Yt} with no replication, hence n = l (i.e., index i is redundant), and n 1 = Tin the general form presented in (2.16). Again if replications are available or in the case where two time series differ in the fixed effects part but not in the random effects (e.g have the same underlying latent process), then one simply needs to include the sum over the replicates as indicated in (2.16). An example where we assumed that two series differ in their fixed effects parameters but share the same underlying latent process { Ut} is the motivating example of Section 3.1, with a binomial time series for each, white and black respondents Choosing one of the correlated random effects distributions of Chapter 3, the Gibbs sampling algorithms developed in Sections 3.4.1 or 3.4.2 can be used to generate the sample from h(uly), with f(Yt) having the form of a binomial(nt, 7rt(ut)) density.

PAGE 106

94 4.3.1 Marginal Model Implied by the Binomial GLMM Marginal properties are hard e r to derive than in the normal or Poisson case because the conditional mean ?Tt ( Ut) is not a linear or exponential function of the random effects. Assuming zero-mean random effects { Ut} with variance var( Ut) = a 2 the conditional log odds {logit(1rt(ut))} have means {x~.B} and variance a 2 For two distinct time points t and t* the correlation between the conditional log odds at times t and t* under independence equal correlation or autocorrelation assumptions on the random effects are given by 0 p or pL..~-=-/ dk respectively. We will refer to E[logit(1rt(ut))] = x~.B as the unconditional or expected log odds ones that do not depend on the random effects. It is perhaps more natural to investigate the unconditional or expected odds of success, E[1rt(ut)/(l 1Tt(ut))], since interpretation is on the natural scale. They are given by exp{ x~.B + a 2 /2} and any member of exp{,8} can be interpreted as the change in the expected (e.g. averaged over random effects) odds of success for a unit change in the corresponding member of Xt. Alternatively log(1rf /1 1rf) exp(x~.B/Jl + o2 ) (derived in subsequent sections) is the logit of marginal probabilities implied by the conditional model and the effect of parameters are seen to be down-weighted when using this function as the quantity of interest. Often this is the preferred measure and we will derive it in the next section Further discussion of the interpretation of parameters in GLMMs with time dependent random effects is provided in Section 4.3.3. Let 's turn now to the marginal distribution of Yt, the sum of nt binary vari ables Yst and their dependence structure over time. At time t the binary variables Yst have marginal mean 1rf = E[Yst] = E[1rt( Ut)], variance var(Yst) = 1rf (1 1rf) and constant covariance cov(Yst, Yst) = var(1rt(ut)) which is a function of a. By sharing a common random effect Ut at time t observations {Yst};; 1 are marginally dependent with an exchangeable correlation structure. Correlated random effects

PAGE 107

95 { ut} induce a s e cond time related dependency: For binary observations Yst and Yst at two different time points t and t *, cov(Yst Yst) = cov(1rt(Ut),1rt(Ut)) which depends on the assumed covariance of random effects Ut and Ut As a consequence of marginal dependence among the binary variables at a common time point t (within-type dependency), their sum Yi = I:; ~ 1 Yst shows overdispersion relative to a binomial random variable Its mean and variance are given by E[Yi] var(Yt) which can be evaluated by using methods of the next section. As a consequence of marginal dependency between binary variables at times t and t* (between-type dependency), their respective sums Yi and Yi are correlated according to corr(Yt Yi) = ntnt COV ( 1ft ( Ut) 1rt ( Ut)) / [var(Yt) var(Yi) ] 1 1 2 which again can be evaluated using methods of the next section. 4.3.2 Approximation Techniques for Marginal Moments 4.3.2.1 Approximation based on Taylor series expansions To evaluate moments of the marginal distribution of Yi we will first use a second-order Taylor series expansion of 7rt( Ut) around the mean E[ut] = 0 of the random effect U t. It is given by 1 exp{-x~.B} exp{-2x~,B} exp{-x~.B} 2 1rt(Ut) -----+ -------Ut + ----------Ut 1 + exp{ x~,B} (1 + exp{ -x~,B} ) 2 2(1 + exp{ -x~,B} ) 3

PAGE 108

96 Using this expansion w e can approximate the marginal mean variance and covariance of the conditionally specified success probabilities: E[ ( )] ~ 1 [ exp{ 2x~,8} exp{ -x~,8} 2 ] 7rt Ut ~ -----1 + -----------=--a 1 + exp{ -x~,8} 2 (1 + exp{ -x~,8} ) 2 ( ( )) ( exp{ -x~,8} ) 2 [ 2 ( 1 exp{ -x~,8}) 2 a 4 ] var 7rt Ut ------a + -----[1 + exp{ -x~,8} ] 2 1 + exp{ -x~,8} 2 (4.10) (4.11) exp{-(x~ + x~.),8} ( ) ---------~ 4.12 (1 + exp{ x~,8} ) 2 (1 + exp{-x~.,8} ) 2 x [ cov( Ut Ut) + (1 exp{ -x~,8}) (1 exp{ -x~. ,8}) cov( u1 u~.)] (1 + exp{ -x~,8}) (1 + exp{-x~.,8}) For the last two expressions we used the additional assumptions that E[u:J = 0 and E[ut] = 3a 4 for all t, which for instance, holds for the normal distribution With independent random effects cov( Ut u;) = 0 and the covariance between the success probabilities is zero. Using correlated normal random effects cov( u~ u~.) is equal to a 4 (p + p 2 /2) for equally correlated random effects and equal to a 4 (pI:~ :,--; i dk + p 2 I:~ :,--; 1 dk /2) for autoregressive random effects. This simplifies to a 4 (ph + p 2 h /2) for equally spaced observations h units apart. These results are derived by evaluating the joint moment generating function of the bivariate normal distribution of ( Ut Ut). Using the approximate expressions for the variance and covariance given in (4.11) and (4.12) the correlation between two success probabilities at different times t and t is approximated by ( ) + (1 e xp{ X~,8})(1-exp{-X~.,8}) cov(u~ u~ ) cov Ut Ut ( ,8 ) ( ,8 ) 4 ( ( ) ( )) l+ e xp{-Xt } l+exp{ X t } corr 7rt Ut 7rt Ut 1 ; 2 a2 [1 + (1e xp{ X' }) 2 a 2 ] 1 1 2 [l + (1 -e xp{ X~.,8}) 2 a 2 ] I+ e xp{ x; } 2 I+exp{ X~.,8} 2

PAGE 109

97 Plugging in the expressions for cov( Ut, Ut) and cov( u~ u~.) according to the random effects assumption yields the approximation of the implied marginal correlation between success probabilities at time t and t*. Many authors ( e.g., Zeger Liang and Albert 1988) only use a first-order Taylor expansion, for which the second terms in the large squared brackets of (4.10)-(4.12) would vanish. Then, the approximation for the correlation between the conditional success probabilities at time t and t* simplifies to and is p for equally correlated random effects and pE~'=-,_ 1 dk for autoregressive ran dom effects. In that case the success probabilities directly inherit the correlation properties from the underlying latent random process. 4.3.2.2 Cumulative Gaussian approximation An alternative approximation for E[7rt(ut)] is given by Zeger, Liang and Albert (1988) who use a cumulative Gaussian approximation to the logistic function (Johnson and Kotz 1970 p.6) to derive 1 E[7rt(Ut)] ------===, 1 + exp{-x~/3/ + (co-)2} (4.13) where c is a constant equal to 1 ft. They use this expression for the marginal mean together with an approximation for the marginal covariance matrix based on a first-order Taylor series expansion of 11"t(ut) (outlined above) to motivate a GEE approach of fitting GLMMs. Approximations based on the Taylor series expansion are only accurate for random effects close to their mean value of 0 as measured by small values of a. Figure 4 1 plots the approximations of the marginal probability 7rf = E[7rt( Ut) ] given in ( 4 10) over a range of (-4 4) for the fixed part x~{3 of the linear predictor for various values of a The dotted curve in Figure 4 1 corresponds to a = 2.5.

PAGE 110

98 It clearly shows that the approximation is useless for such a value for a since the approximated marginal probability loses the fundamental property of monotonicity in the linear predictor. For even larger values of a, formula (4.10) shows that the approximation can be even greater than 1 or less than 0. This problem is also not alleviated by including further terms in the Taylor series or expanding the series around with the sign depending on whether one expects a positive or negative value for Ut. Unfortunately, large values for a are the rule rather than the exception in models with autoregressive random effects. (See the remark by Diggle et al. 2002 p. 239 and the examples in Chapter 5.) On the other hand ( 4 13) does not suffer these drawbacks. As a increases the conditional success probabilities have distributions increasingly concentrated near O and near 1. Averaging over these conditional probabilities we expect a marginal probability of 0 5 which is the limit of (4 13) when a 2 oo. However approximations for the marginal variance and covariance are not easy to derive with the cumulative Gaussian approximation of the logistic function, but these are the quantities w e are interested in for a comparison of model based estimates and sample based estimates. The next section mentions a connection of logit and probi t models which can be used to calculate all desired marginal properties for cases where a is large. 4.3.2.3 Marginal results through probit models In a longitudinal setting and for long sequences of binary data Smith and Diggle (1998) propose similar models using correlated random effects as presented here but pursue a different fitting approach. They derive marginal moments implied by their model and construct a marginal covariance matrix of observations, all with the intention to use the GEE methodology for estimation. Also they use a probit link (4.14)

PAGE 111

I.0, ~---1Taylor approx I 0 5 .,...////, . ../ -2.5 0.0 /,// 2 5 ., 99 I.0, ~---1Probit approx I 0.5 -2 5 0 0 2 5 1.0, ~----~ I Monte Carlo approx I 0 5 -2 5 0 0 2.5 Figure 4 1 : Approximated marginal probabilities for the fixed part predictor value x' (3 ranging from -4 to 4 in a logit model. Approximations are based on a second-order Taylor series expansion (first panel) the probit connection (second panel) and Monte Carlo integration over the random effects distribution The 4 lines in each panel correspond to a = l 1.5, 2 2 5 with the dotted line corresponding to a= 2.5. to model the conditional success probability irt(Ut) at time t, where is the standard normal cdf To distinguish these from their logit model counterparts, we use (3 and ato denote the fixed effects parameters and variance component in the probit model. Unlike the variance, the correlation is a scale free measurement Hence the correlation among the conditional success probabilities, as measured by p is the same whether measured on the logit scale or the probit scale Thus no new parameter for describing the correlation in the probit model is needed. The advantage of a probit link model is that marginal means and covariances can be calculated explicitly and no approximation is needed. Smith and Diggle (1998) employ the threshold interpretation to derive these exact results. Using similar arguments we now intend to derive alternative approximations of marginal moments and correlations for our logit link GLMMs with correlated random effects. These results can then be compared to or used instead of the approximate results for the logit link derived with the Taylor expansion given above. The threshold ( or latent variable) interpretation states that Yst = l if and only if Tst < c for a suitable threshold value c, where {Tst} are independent N(O 1) latent variables ( and independent of any random effect in the linear predictor).

PAGE 112

100 Then under (4.14) E[Yst] P(Yst = 1) since Tst Ut has a N(O 1 + a 2 ) distribution. The threshold interpretation although mathematical convenient may some times seem artificial. We next give an exact proof of the above result without using the threshold interpr e tation. Similar proofs can be constructed for all results derived in this s e ction. Note that Let g(ut) denote the N(O a 2 ) density of random effect Ut and let
PAGE 113

101 Hence the inner integral giv e s th e marginal distribution of z, which is N(O 1 + o2 ) Then x'/3 [~ + a 2 ) +a 2 ) dz l x~/3 z ) d z -00 + a 2 ) giving abov e r e sult Using the thr e shold int e rpretation again the marginal joint moment of two binary variables Ys t and Y s t observed at th e same time t is given by P(T s t Ut < x~,8 Ts t Ut < x~,8) <1> 2 ( (x~,8 x~,8)' Q(t t)) wher e Q( t t ) = [ 1 + a 2 cov( Ut Ut )] c ov( Ut Ut) 1 + a 2 (4.15) and <1> 2 ((a b)' Q(t t )) is the probability that a bivariate zero-mean random variable with covarianc e matrix Q(t t ) is less than (a b)'. Summing up th e marginal prop e rties of th e binary variables {Y s t} ;;,, 1 at time t are explicitly given by (4.16) var(Ys t ) ~ M (1 ~ M) 7r t 7r t <1> 2 ((x~,8 x~,8)' Q(t t))(ir:,1)2

PAGE 114

102 For observations Yst and Yst at two different time points P(YstY s = 1) P(T s t Ut < x~'/3, Tst Ut < x~.'/3) <1> 2 ((x~'/3 x~.'/3)' Q(t t )). This leads to a marginal covariance of cov(Yst Yst) = 2 ( (x~'/3, x~.'/3)', Q(t, t )) irf'1irtf between two binary observations at time points t and t*. Plugging in different forms for the covariance of the random effects in Q(t t*) results in different covariance structures among the two binary observations. Now let Yi = I:: ;: 1 Yst represent the sum over all nt (marginally dependent) binary variables at time t. Then similar to the logit link results presented before, Yi is overdispersed relative to a binomial with mean and variance The correlation between random effects implies a correlation between outcomes at different time points t and t *, resulting in a correlation between Yi and Yi of form corr(Yi Yi) cov(Yi Yi )/[var(Yi) var(Yi )] 1 1 2 ( 4.17) 4.3.2.4 Logit-probit connection In regular GLMs parameter estimates in logit models are roughly 1.6 times those in probit models (Agresti, 2002, p. 246). This number derives from the fact

PAGE 115

103 that with a linear predictor of form a + f3x the rate of change in the success probability 1r(x) is highest at x = -a/ /3 for both the probit model (i.e., 1r(x) has the form of a normal cdf) and the logit model (i e. 1r(x) has the form of a logistic cdf). At this point 1r(x) is equal to 1/2 for both models and the rate of change, d1r(x)/dx is equal to 0.4/3 for the probit model and 0.25/3 for the logit model. Hence we get an equal rate of change (at x = -a//3) in both models when the logit /3 is 0.4/0.25 = 1.6 times the probit /3. When comparing the standard deviations implied by the normal cdf (which is equal to 1/1/31) and the logistic cdf ( which is equal to 1r / J31/3 I) then this factor between the relationships of parameters in the two models increases to 1.8. We exploited this connection between parameters in logit and probit models to construct Figure 4 2 where we compare conditional success probabilities bas e d on the logit link (4 9) and the probit link (4.14) for various values for a in a GLMM. We rewrote the linear predictor for the logit GLMM as T/t = x~/3 + CJZt where Zt is a standard normal variable and a can be interpreted as the regression coefficient for the random effect ZtThen we used the approximate connection T/robit TJ!ogit /1.6 between parameter estimates to compute conditional success probabilities under both models This was done for a random sample of 6 Zt s from a standard normal distribution For each generated Zt each panel in Figure 4 2 displays the conditional success probabilities based on the logit link (straight line) and (scaled) probit link (dashed line) for TJ!ogit ranging from -3 to 3. The 4 different panels refer to 4 different choices of a. We clearly see that conditional success probabilities based on the logit and (scaled) probit link are almost indistinguishable irrespective of the magnitude of a. The agreement is best at conditional success probabilities around 1/2, which is to be expected based on the derivations given above But even for small and large success probabilities the agreement is very good. Hence with the proper scaling

PAGE 116

104 1.00 1.00 0 75 0.75 0.50 0 50 0 25 0.25 -3 -2 1 0 2 3 3 -2 -I 0 2 3 1.00 1.00 0 75 0 75 0.50 0.50 0.25 0 25 -3 2 I 0 2 3 3 2 I 0 2 3 Figure 4 2: Comparison of conditional lo git and probit model based probabilities. Conditional success probabilities for logit (straight line) and probit (dashed line) link GLMMs for linear predictor values ranging from -3 to 3. Each pair of (straight,dashed)-curves in each panel corresponds to one out of 6 randomly sam pled random effects Zt Conditional success probabilities for probit link GLMMs use a scaled version of the linear predictor for logit link GLMMs to adjust for dif ferent parameter estimates in these two models. The four panels correspond to four different values of the random effects standard deviation a = 1, 1.5, 2 and 2.5.

PAGE 117

105 factor on fixed effects parameters and the standard deviation of the random e ffects a probit-link GLMM corresponds to a logit link GLMM. That is for any fixed Z t, where the left hand side in the approximation refers to a logit model for the conditional success probabilities with parameters /3 and a and the right hand side refers to a probit model for the same conditional success probabilities with parameters J3 = /3 /1.6 and a = a /1.6. Taking expectations with respect to the distribution of Zt one would then also expect that and consequently (4.18) This gives an approximation of the marginal success probability of a logit model in terms of parameters from a conditional probit model. Graphically (4.18) means that the average of conditionally specified logistic cdfs which does not follow a logistic form itself can be approximated by the average of subject specific normal cdfs which does follow a normal form provided the correct parameter adjustments are made. The connection is pictured in Figure 4 3 where the average of 100 conditionally specified logistic curves a~;~~ 1r(zi /3 a) with z i ~ N(O 2.5) is compared to the cdf irf-1 = (x~'/3/Jl + a 2 ) of a marginal probit model with adjusted parameters Note that the agreement is almost perfect although a is large. The upshot of this exercise is that we can use the exact formulae for marginal properties of probit-link GLMMs to make good approximate marginal statements in logit link GLMMs. These should be more accurate than the Taylor approxima tions based on the logit link in cases where a is large. For instance, we can use

PAGE 118

106 1.0 -marg. probit 0 9 marg logit cond. logit 0.8 cr = 2.5 0.7 0 6 0 5 0.4 0.3 I I I I 0.2 / / / / ,, ; .,, .,, ,, ,., 6 5 4 3 2 I 0 2 I I I I I I I I I I I I 3 I ,, I I I I 4 / 5 ,. / / ; ; Figure 4 3: Comparison of implied marginal probabilities from logit and probit models. 6 The plot shows the average of 100 conditionally specified logistic curves ( dashed line) generated by using u ~ N(O, 2.5) and the marginal normal curve irM (solid line) from a probit model with adjusted parameters. The plot is over a linear predictor range from -6 to 6. A random sample of 10 of the 100 generated condi tionally specified logistic curves is also shown (grey dashed lines).

PAGE 119

107 ( 4.16) to derive the marginal success probability or the marginal odds in logit link GLMMs and (4.17) to derive the marginal correlation between two observations in models where the estimate of a is large. The second panel of Figure 4 1 shows that the approximation of the marginal success probabilities based on the probit connection does not suffer the drawbacks (loss of monotonicity non-convergence to 0.5 for a oo) experienced with the Taylor based approximation approach Also notice the close connection between the multiplicative factor c = 1 tt 1/1.7 for a in the cumulative Gaussian approximation of the marginal mean employed by Zeger, Liang and Albert (1988) and the approximation using the probit link: and 1 ----:----:----===:, a2 = (a/1.7) 2 (Zeger et al.) 1 + exp(-x~/3 / + a-2) 1rf'1 + 17 2 ), '/3 = /3/1.6 17 2 = (a/1.6) 2 (probit-logit) (4.19) The first approximation makes a stronger statement in that it says that the marginal mean also has the form of a logistic regression model with parameters downweighted by a factor of + a2 However, using the relationship between pro bit and logit link once more, the parameter vector '/3 / + 17 2 for the marginal probit model (4.19) translates to roughly 1.6 x + 17 2 = + 17 2 for a marginal logit model and Hence both approximations show that the marginal mean follows roughly a logit model. They differ only by the weight factor assigned to the random effects stan dard deviation. Notice that the probit-logit connection and derivations outlined in this dissertation are more valuable because they also provide approximations to the marginal variance and correlation a key component for time series observations.

PAGE 120

108 In summary fitting the logit GLMM allows for the usual interpretation of parameters as ( conditional) effects on the log odds. Through exploiting connec tions with models using a probit link we can give good closed form, analytical approximations for marginal probabilities odds and correlations. Note that non closed form solutions to the marginal mean variance and correlation can always be obtained by integrating over the random effects density e.g. nf = f 7rt(ut)g(ut)dut and approximated by stochastic methods such as a Monte Carlo sum using the fit ted random effects distribution. The third panel of Figure 4 1 displays these Mont e Carlo averages (based on 100 000 draws from the assumed N(O a) random effects distribution) and shows almost perfect agreement with the closed-form approxima tions based on the probit model (see also Figure 4 3). For this simple example, the Monte Carlo averages ar e easy to obtain but considerably more simulation effort may be necessary to yield good approximations for higher dimensional marginal probabilities such as the occurrence of three consecutive successes or more com plicated functions The examples in Section 5.4 will illustrate this point further and make extensive use of both of these approximation techniques. There, the main use of these approximations will be on comparing the empirical dependency structur e observe in the time series to the theoretical one implied by the model and to compare obs e rv e d frequ e n c ies in the time series to estimated ones based on our proposed models 4.3.3 Parameter Interpretation In GLMMs for binary data we model conditional log odds given random effects. By averaging with respect to the random effects distribution on the logit scale we obtain unconditional (or expected) log odds. However these are different from th e marginal log odds log( 1rf4 /1 1r{4) obtained with the marginal

PAGE 121

109 probabilities implied by the conditionally formulated model. In the previous section we derived the approximation log(1rl'4 /1 1rl'4) exp(x~/3/Jl + o2 ) and we see that parameters can still be interpreted as log odds ratio but down weighted by a factor of Jl + 82 In the literature on longitudinal data where random effects p e rtain to subjects, this interpretation is preferred when the de pendency structure is considered a nuisance. However for interpreting regression parameters in a time series analysis based on the GLMMs outlined in this disser tation, a preference is not so clear. For related analysis of binary time series via hidden Markov models where the distinction to GLMMs is essentially the assump tion of a discrete Markov process on a few states instead of an AR(l) process for the latent random process { Ut}, MacDonald and Zucchini (1997) give unconditional interpretations of regression parameters throughout. Since the previous section fo cused on marginal interpretations this section focuses on unconditional ones, but it is noted that it may be more natural to interpret a log odds of average probabilities (i.e. think marginally) than an average log odds. (To illustrate the issue further, in pre-GLM times people took logarithmic transforms of the observations and fitted a linear model to their means. But then one is modeling the mean of the logarithm rather than the logarithm of the mean as would be done with a GLM.) 4.3.3.1 Conditional and unconditional log odds and log odds ratios The conditional log odds of success at time point t are given by Integrating over the random effects distribution the unconditional or expected log odds are equal to x~/3 and /3 can be interpreted as the unconditional or expected change in the log odds for a change in the covariates. The central 100(1 a)% of

PAGE 122

110 the distribution of the log odds falls in between The conditional log odds-ratio of success at time point t over one at time point t is given by With correlated random effects the interpretation of the log odds ratio is time specific not only through the covariates but also through the random term ( Ut Ut). This is different from the so called subject-specific interpretation of the log odds ratio in a regular random intercepts model ( Ut = u for all t). There, the random effect is assumed to be constant over time and cancels out in the log odds ratio and /3 can be directly interpreted as the change in the conditional log odds for a change in the covariates. With correlated random effects f3 is the change in the conditional log odds for a change in the covariates when random effects at time t and t have the sam e value. The unconditional or expected log odds ratio is equal to (x~. x~)/3 and /3 can alternatively be interpreted as the expected change in the log odds for a change in the covariates between time t and t The central 100(1 a)% of the distribution of the log odds ratio falls in between which can be estimated by plugging in ML estimates for f3 and the random effects variance and covariances.

PAGE 123

111 4.3.3.2 Conditional and unconditional odds and odds ratios At time point t with associated random effect Ut we already define the unconditional or expected odds of success as E[exp{ x~,8 + Ut}] = exp{ x~,8 + o2 /2} Here ,8 describes the effect of covariates on the expected odds of success with the intercept term offset as in Poisson G LMMs. For two time points t* and t with associate random effects Ut and Ut the ratio of expected odds is equal to exp{ ( x~. x~),8}, and can be interpreted as any regular odds ratio. I.e ., for a positive, one unit change in the k-th predictor from time t to time t*, the expected odds of success at time t* are exp{,Bk} times those at time t. Due to the non-linearity of the odds the ratio of expected odds is different from the expected odds ratio, which is given by [ exp{x~.,8 + Ut }] {( ') } { 2( ( ) } E { ,8 } = exp xt. xt ,8 exp a l corr Ut Ut ) exp xt + Ut Using this measure exp{,Bk} exp{o2 (1 corr(Ut, Ut))} now equals the change in the expected odds ratio of success at time t* versus time t, for a positive one unit change in the k-th predictor for that time span. The first measure, exp{,Bk} describes the change in the expected odds at two time points the second one exp{,Bk} exp{ o2 (1 corr( Ut, Ut))} describes the expected change in the odds of a success at the two time points. In the following, we focus on the ratio of expected odds. 4.3.3.3 Multiple time series If n different time series y 1 = (Yu ... Y1n 1 ), Y i = (Y i 1 Yn = (Yni ... Yn nr) are observed and the same latent process { ut}f= 1 is assumed to

PAGE 124

112 underly each one of them the conditional log odds at time t have form logit(1rit(ut)) = x~tf3 + Ut i = 1 . n. Here 1rit(ut) is the conditional probability of success at time t for the i-th series and depends on time specific covariates Xit plus a serially correlated random time effect Ut If two different time series Y i and Y i represent different subpopulations or stratifications of a population interest can focus on each one of the following three contrasts: Contrasts between subpopulations at a given common observation time Contrasts between different time points within the same subpopulation Contrasts between subpopu l ations at different observation times We will look at ratios of expected odds which is perhaps the most natural metric, to address these three points. In the first case the expected odds of success in strata i over the ones in strata j at a fixed time t are given by exp{ x~tf3 + a 2 /2} / exp{ xjt /3 + a 2 /2} = exp{ (x~t xjt)f3} Then exp{/3} has the interpretation of a change in the expected odds for a change in the strata covariates at fixed time t. For example with model (3.1), for the two time series measuring attitude towards homosexual relationships for whites and blacks exp{,8 3 + ,B 4 x1t} describes the change in the expected odds of approval of homosexual relationships for blacks versus whites in year X1t, That is, in year X1t the expected odds of approval for black respondents are exp{,8 3 + ,B 4 x1t} times the expected odds of approval for white respondents Using the maximum likelihood estimates and their estimated asymptotic standard errors (see Table 5 1)

PAGE 125

113 and covariances the expected odds of approval of homosexual relationships for black respondents in 1988 are estimated to be 0.65 times (95%-Confidence Interval: [0.54 0. 73], using the Delta method) the expected odds for white respondents in that year. Ten years later in 1998, this factor decreases to 0.49 with a 95% Confidence interval of 0.37 to 0 61. For the scenario in the second contrast the ratio of expected odds at time t* versus time t for subpopulation i is given by Now exp{,B} describes the change in the expected odds for changes in the pre dictors from time t to time t. For the motivating example with model (3 1) exp{,81h + /32(xi t+ h XitH is the change in the expected odds of approval for white respondents and exp{(,8 1 +/3 4 )h+/3 2 (xit+hX itH is the change in the expected odds of approval for black respondents for observations h years apart. For example over a period of 10 years from 1988 to 1998 the expected odds of approval increase by a factor of 2.63 for white respondents (95% Confidence Interval : [1.56 3.69], using the Delta method) and by a factor of 2.01 (95% Confidence Interval: [1.44, 2.58]) for black respondents. For the third contrast describes how the expected odds at time t in strata i compare to the expected odds at time t in strata j Again, exp{,B} describes the effect of a change in the strata covariates from time t to t on the expected odds. In the motivating example the expected odds of approval for black respondents in year Xu + h are exp{,81h+,82[(x1t+h)2-xit)+/3 3 +/3 4 (x1t+h)} times those for white respondents in year xu. Using the maximum likelihood estimates the expected odds of approval for black respondents in 1998 is estimated to be 0.99 times (i e. almost equal to)

PAGE 126

114 the expected odds of approval for white respondents 6 years back in 1992. Here I fixed the year 1998 (i e. X1t + h = 15) and searched for the number of years h one has to go back to match the expected odds of approval for the two races. I.e., I solved for h in the equation yielding h 6.

PAGE 127

CHAPTER 5 EXAMPLES OF COUNT BINOMIAL AND BINARY TIME SERIES In this chapter we propose GLMMs with autocorrelated random effects for the analysis of several practical examples. We will apply the likelihood estimation theory developed in Chapter 3 and use the model theoretic properties derived in Chapter 4. In Section 5 2 we take another look at the GSS data set discussed in Chapter 3 and re-analyze it based on a normal approximation using linear mixed model theory. A famous time series of counts is analyzed in Section 5.3, with results in the literature compared to our results based on an autoregressive GLMM. Two binary time series are analyzed next in Section 5.4. The first one considers 299 consecutive eruptions which are classified as either short or long, from the Old Faithful geyser in Yellowstone National Park The second one considers the annual boat race between teams from the Universities of Cambridge and Oxford and is challenging because of several missing observations. Two goals in this example are to establish the influence of weight of the crew on the outcome of the race (demystifying a long held believe) and to predict a future outcome. First, though in Section 5.1 we present ways to explore and picture the dependency structure in an observed discrete time series. 5.1 Graphical Exploration of Correlation Structures Let {yt} be a realization of the time series {Yi} In practice, we have to choose an appropriate model for {Yt} based on information the data provides about the dependency structure. An important tool to explore the dependency structure of the observed time series is the sample autocorrelation function (ACF). For equally 115

PAGE 128

116 spaced data it is defined as (5.1) where his the lag between observations and y is the sample mean. If the observed time series displays any trend we have to first estimate it (maybe by fitting a regular GLM) and subsequently explore the autocorrelations among the residuals. A comparison of the autocorrelation the model predicts with the empirical one observed in the time series serves as a crucial check on the adequacy of the fitted model and its assumptions 5.1.1 The Variogram For unequally spaced data, the variogram (Diggle 1990) is a better measure to describe the association than the ACF. In Diggle et al. (2002) the variogram is discussed for longitudinal data while we develop it here for the special case of time series data {Yi}. Define the variogram "f(h) at lag has If {Yt} is stationary the variogram is directly related to the autocorrelation function p(h) = corr(Yt Yi + h) by "f(h) = 7 2 [ 1 p(h)] where 7 2 is the variance of Yt. (Even for a non-stationary time-series the vari ogram is well defined provided the increments Yi + h Yt are stationary.) To develop the empirical variogram let dtt be the time in between two observations Yt and Yt. The sample analog g(h) of the variogram at lag h is calculated by averaging over all possible squared differences between observation pairs h time units apart. I.e ., 1 g(h) = I C I L h (t )E C h

PAGE 129

117 where Ch= {(t, t*) : dtt = h} is the set of all index pairs (t, t*) with corresponding observations measured h time units apart. A comparison of the sample variogram g(h) to an estimate i(h) of the theoretical variogram implied by a particular model serves as a check on the adequacy of the model. In Section 5.4 we show with the help of the variogram the appropriateness of the modeled correlation structure for the unequally spaced Oxford versus Cambridge boat race time series data. 5.1.2 The Lorelogram For categorical, especially binary responses, the dependency can also be measured in terms of odds ratios. For a binary time series {Yt} Heagerty and Zeger {1998) define the lorelogram 0(h) at lag has the log odds ratio between observations Yt and Yi+h, O(h) = log (P(Yt = 1, Yi+h = 1) x P(Yt = 0, Yi+h = 0)) P(Yt = 1, Yi+h = 0) x P(Yt = 0, Yi+h = 1) ( 5 2 ) For an observed binary time series y = (y 1 ... Yt), the lorelogram can be esti mated by using sample proportions of the probabilities in (5.2). I.e., the sample lorelogram at lag his given by LOR(h) = log (Yj1 T-h]Y[h+1,T] x (lr-h Y [1,T-h])'( lT -h ~[h+I ,T]) ) Y[1 ,T-h]( lr -h Y[h +I,T]) x (lT-h Y [1,T-h]) Y [h+I,T] where Y[a ,b] is the sub-vector (Ya, ... Yb) of y and lT-h is a row vector of T h ones. Proper adjustments have to be made for unequally spaced data. As with the variogram a comparison of the sample lorelogram to one implied by a particular model serves as a check on the adequacy of the model. 5.2 Normal Time Series Section 4.1 discussed a linear mixed model approach of time series modeling for normal data. In this section we illustrate with the example about attitudes towards homosexual relationships (a cross-sectional time series see Section 3.1) where attitude was measured 16 times in between 1974 and 1998 for whites and

PAGE 130

118 blacks. The binomial counts in this study are large enough to warrant an analysis based on a normal approximation although the binomial sample sizes are about 8 to 11 times larger for whites than for blacks for almost all years. Initially we will assume that conditional on random time effects { Ut} the log odds 0it of approval of homosexual relationships for race i at year t are independent over the years and follow a normal distribution with mean i t + Ut and standard deviation T (We could have also modeled the binomial counts or the proportions of approval directly, but prefer the log odds approach because of the structural problem of the identity link with modeling proportion data.) However, the usual assumption about a constant standard deviation T throughout the two groups is grossly inappropriate Firstly we have to account for the fact that sample sizes in the white group are much larger than sample sizes in the black group. Secondly the estimated asymptotic standard deviation of the log odds derived by the delta method and based on the asymptotic normality of the sample proportions is (5 3) where f i t is the sample proportion for race i and year t. Figure 5 1 shows these estimates for the two groups. To put the scale for the empirical standard deviation for this plot into perspective the estimated log odds range from -1.8 to -0.9 for the white group and from -2.8 to -1.4 for the black group. Figure 5 1 shows that th e variability in the log odds is markedly smaller for white respondents than black ones due to the overall larger sample sizes in the white group. Furthermore the variability in the log odds is not constant over time especially in the black group. This is due to the fact that in the years 1988 to 1991 and 1993 only about half as many people were sampled as in the other years for both groups. Also the trend in the probabilities 1r i t causes the standard deviations to be different over time. In an

PAGE 131

119 .60 .50 .40 30 20 r I \ 9. I \ I \ I \ \. \ I "\ I \ I "" \ /~ : \ I \ \ I \ I \ _..._ ___ ...... I \ I \ .\ I \ \ I \ I \I ---\J ii. race : .10 black 0.00 ----.-------.-------,..-----.-----.-------.-------.. white 1972 1976 1980 1984 1988 1992 1996 2000 year ---Figure 5 1: Empirical standard deviations std ( 0it) for the log odds of favoring homosexual relationship by race

PAGE 132

120 effort to remedy all th e se effects simultaneously we associate a weight with each observation that is the inv e rs e of th e empirical standard deviation giv e n in (5 3). It might then be reasonable to assume that the weighted log odds W i t0 it have constant conditional standard deviation std( W i t0 i t I Ut) = T or stated differently that std(0 i t I Ut) = T /w it We are now able to specify an autoregressive model. Analogous to Section 3.1 we assume the following linear mixed model of the log odds: where as before x u is the (centered) year the response was measured x 2 i is an indicator for race Ut is the year-specific random effect and i t are assumed i.i.d N(O T 2 /w i t). With the motivation given in Section 3.1 we assume autocorrelated random effects Ut +1 = pd t ut + f.t to model the serial dependence in successive observations caused by gradual change in unobserved factors such as public opinion. We us e d the procedure proc mix e d in SAS to obtain the maximum likelihood estimates for all parameters in this model which are shown in Table 5 1. The mix e d procedure allows for the use of the weights l/w i t in the varianc e covariance matrix of the conditional log odds given the random effects through the weight statement. We selected the so-called spatial power covariance matrix (SAS proc mixed with random statement option: type=sp (pow) ('year')) for the covariance matrix of the random effects because it allows for unequally spaced observations The results confirm the ones we saw earlier using a logistic regression approach with autoregressive random effects to model the serial log odds. Since parameter estimates in both models are almost equal substantially the same conclusions as given in Section 4.3.3 based on the logit GLMM are reached One

PAGE 133

121 Table 5 1: Comparing estimates from two models for the log-odds. normal logistic Param est. s. e est. s.e a: -1.80 0 07 -1.80 0.07 /31: 0.025 0.007 0.023 0.007 /32: 0.0039 0 0008 0 0041 0 0009 (33: -0.32 0.057 -0.33 0 07 (34: -0.027 0.007 -0.027 0.009 T: 0 67 a : 0.11 0.10 0.03 p: 0.66 0.65 0.25 Maximum likelihood estimates and asymptotic standard errors based on an ap proximate normal linear mixed model for the log odds (first two columns) and the logistic GLMM with autocorrelated random effects of Section 3.1. advantage of the normal approximating model is the easy with which marginal statements can be obtained For instance, formula ( 4.2) applies but now with a weight factor. That is is the marginal correlation between the log odds of approval of race i for observa tions in years t and t The marginal correlations depend on the weight factors and therefore vary throughout the years. For instance using the maximum likelihood estimates of the variance components the estimated correlation between the log odds of approving homosexual relationship in years 1996 and 1998 is 0.52 for whites and 0.48 for blacks 5 3 Analysis o f the Polio Count Data Table 2 in Zeger (1988) lists a time series of 168 monthly counts, most of them small of new cases of poliomyelitis in the U.S. between 1970 and 1983, plotted in the first bin of Figure 5 2 Of interest is whether the data provide evidence of a long-term decrease in the rate of polio infections Many authors have analyzed this data set beginning with Zeger (1988) who uses marginal model

PAGE 134

122 fitting techniques ( cf. Section 1.2). He estimates a yearly decrease of 6.3% in new cases of poliomyelitis where however a 95% confidence interval ranges from a decrease of 12.2% to an increase of 1.1%. We demonstrate our technique by re analyzing this data using a GLMM with autoregressive random effects (ARGLMM) to incorporate the time dependence between adjacent counts. Conditional on a random time effect Ut let Yi be a Poisson variable representing the count in year t, t = 1 ... 168. Following Zeger (1988), we model the log of the conditional mean of Yi as a+ /3 1 (t 73)/1000 + /3 2 cos(27rt/12) + /3 3 sin(27rt/12) + /3 4 cos(27rt/6) + /3 5 sin(27rt/6) + Ut (5.4) where the random effects follow the autoregressive process Ut+1 =put+ Et, Et~ N(0 aJl p 2 ), u 1 ~ N(0 a). Th e sine and cosine pairs adjust for annual and semi-annual seasonal patterns of the counts displayed in Figure 5 2 Figure 5 3 shows the convergence of selected parameter estimates and their estimated standard errors in an MCEM algorithm for two different sets of starting values for the variance components. Convergence parameters were set to 1 = 0.002 c = 5, 2 = 0.01 3 = -0.001 a = 1.04 and q = l.l (see Section 2.3.3). All maximum likelihood estimates and their standard errors are presented in Table 5 2 together with estimates from other models. A negative binomial model takes overdispersion into account but treats the observations as independent. Similarly a Poisson GLMM with independent random effects Ut i.i.d. N(0 a) only adjusts for overdispersion in the counts, but does not address the dependency among the observations The Poisson ARGLMM has the smallest maximized likelihood among these models and uses only one parameter more than a negative binomial or

PAGE 135

123 10 5 . i. n I: fV:l~ r1 ~A A f \ , ,1, ,!\, ,r;\ Jr J ~\ ;\ 1 r f \~f ""'''!.f:f ~~~,~~?,~?,i\~l'! f,f:~f~:i~~ih \ :i &II\!""': i ~'i' 'i' i ,:;"\ ,Q,:: ,i\'\ ? \? 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 10 0 5 1970 1971 1972 1973 1974 1975 1976 IO 0 0 5 00 0 0 0 1977 1978 1979 0 0 0 0 0 1980 0 0 I O O Y -,(O,) I 0 1981 1982 1983 1984 I O O Y /11 0 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 Figure 5 2: Plot of the Polio data. The first bin shows the observed time series of counts of polio infections from 1970 to 1984. Not shown is the observed count of 14 in November 1972. The second bin shows the observed counts as circles, and superimposes the fitted conditional model (5.4) where posterior means are used to estimate the random effects. The third bin shows the fit of the marginal model implied by the conditional formulation.

PAGE 136

2 5 5 0 0 50 100 150 124 4 3 200 0 20 0 15 0 10 120 140 160 180 200 1 -s e .( a )I 0 50 100 150 200 120 140 0.75 "\,.. 0.20 I r'""\..,......._ j 'v. :.-.:,~--~ ----0.50 0 25 Figure 5 3: Iteration history for the Polio data. 160 180 200 1 s.e.( p )I The plot shows the iteration history for parameters /3 1 Cl and p and their estimated asymptotic standard errors for the Poisson ARGLMM. The Iteration number is plotted on the x -axis The two different lines in each plot correspond to two dif ferent sets of starting values for Cl and p. Th e starting value for /3 1 was its GLM estimate. Final Monte Carlo sample sizes in the MCEM algorithm were 29 070 and 34 005 respectively for the two different sets of starting values.

PAGE 137

125 Table 5 2: Parameter estimates for the polio data. PoissonPoissonChan & GLM Neg. Bin GLMM ARGLMM Ledolter est. s e est. s.e. est. s e. est s.e. est s.e a: 0.21 0 08 0.21 0 10 -0.05 0.11 -0.03 0 15 0 21 0 13 /31 : 4 8 0 1.4 0 4 33 1.85 -4.34 1.92 3.74 2.91 -4.62 1.38 /32 : -0.15 0.10 -0.14 0.13 -0.13 0.13 -0.10 0.15 0.15 0.09 /33 : -0.53 0.11 -0.50 0.14 0 51 0.14 -0.50 0.16 -0.50 0 12 /34 : 0 17 0.10 0.17 0.13 0.17 0.13 0.20 0.13 0.44 0.10 /35 : -0.43 0.10 -0.42 0.13 -0.38 0.13 -0 36 0 13 -0.04 0.10 1/k: 0 57 0.16 a: 0 72 0.10 0.70 0.12 0.64 p: 0 66 0.20 0.89 0.04 LL: -132.49 -113 37 -112.40 -92.50 19.00 Fit of a Poisson GLM a negative binomial GLM, a Poisson GLMM with independent random effects and a Poisson ARGLMM. The last column holds parameter estimates as reported by Chan and Ledolter (1995). Their estimates of a and pare transformed to bring them into agreement with our parametrization of the latent autoregressive process. Poisson GLMM with independent random effects. Any information criterion such as the AIC, would heavily favor this model. 5.3.1 Comparison of ARGLMMs to other Approaches Chan and Ledolter (1995) also used autoregressive random effects in a Poisson GLMM setting to analyze these data However, their implementation of the MCEM algorithm seems to have been stopped prematurely. The path plot of the coefficient of interest /3 1 (Chan and Ledolter, 1995 p 246) still shows some trend movements when they declared convergence. Their convergence criterion is based on the change in the marginal log-likelihood while ours takes into account the consecutive changes in parameter estimates and the Q-function. (Moreover, their estimate of the maximized log-likelihood function of 19 seems to be a rather unusual value considering our estimate of -92.5. This is excluding the constant term Lt log(yt!) in the log-likelihood which is equal to -140 5. We were not able to reproduce the estimate with the parameter estimates given by Chan and

PAGE 138

126 Ledolter (1995) ) Also, our Monte Carlo sample for the final iterations in the MCEM algorithm is 28 500 about 14 times higher than theirs. The Monte Carlo sample size increased exponentially in our implementation but only two sample sizes of 800 and then 2000 for confirming the results obtained with the 800 samples were used in Chan and Ledolter (1995). Based on their estimate of /3 1 and its asymptotic standard error they conclude that the time trend is significant at the 5% level. However our analysis shows an insignificant time trend at the 5% level which is in agreement with the conclusion of Zeger (1988) and other approaches that use different ways of incorporating correlation into a Poisson model. For instance Fahrmeir and Tutz (2001) use a transitional model including the past 5 responses and report a p-value of 0.095 for the test of a zero slope for the time trend. Li (1994) fitted a slightly different transitional model and also reported an insignificant time trend after accounting for the autocorrelation. He noted that the series might be too short to establish significance of a linear time trend. Benjamin et al. (2003) used the negative binomial instead of the Poisson as the conditional distribution in a transitional model and reported a better fit with it, but again an insignificant tim e trend. Note that in general, regardless of its significance the time trend parameter in transitional models fitted by the above authors has to be explained conditional on past responses (or functions involving past responses). In our model, however e 13 1 can simply be interpreted as the linear time effect on the marginal mean of the polio counts. Let f-1 = E[Yi] denoted the marginal mean of Yi as given by expression (4.4). Then with our model (5.4), the ratio f1. 12 /f'1 of two marginal means exactly one year apart is equal to e 12 f3 1 / 1000 Using the ML-estimate of /3 1 we then estimate a yearly decline of 4 5% in the marginal polio counts. However a 95% confidence interval for this parameter ranges from a yearly decline of 10 6% to a yearly increase of 2.1 % including the possibility of no change.

PAGE 139

127 At a given time point t, a negative binomial GLM (with the parametrization used in Section 4.2) implies that the variance exceeds the mean fl by a factor of (1 + fl /k) For the Poisson ARGLMM we saw that this overdispersion factor equals [1 + fl (eu 2 1)]. Using the estimates from Table 5 2, 1/k equals 0 57 and eu 2 -1 equals 0.63. Both models seem to propose a similar amount of overdispersion relative to their estimated marginal means. Also, both models are similar in the sense that they assume the same overdispersion parameter ( k in the negative binomial case a in the ARGLMM case) for all observations. However, the negative binomial model does not adjust for correlation among the responses. 5.3.2 A Residual Analysis for the ARGLMM We assess the quality of our model by a residual analysis. For the random intercepts GLMM and the ARGLMM we define the residual at time t as rt = Yt fl where fl is the marginal mean ( 4.4). Figure 5 4 shows the autocorrelation function of the estimated residuals ft = Yt fl based on the fit of a negative binomial GLM and a Poisson ARGLMM. The autocorrelation function of residuals from the fit of a GLMM with independent random effects is omitted from the plot since it is almost identical to the one from the negative binomial GLM. While the significant (based on an asymptotic standard error of y'1ff' = 0.08) residual autocorrelation at lag 1 shows the inappropriateness of the negative binomial GLM and the Poisson GLMM with independent random effects the autocorrelation function of the residuals is as expected for the Poisson ARGLMM. (We could also judge significance by a Monte Carlo experiment where we look at the smallest and largest lag 1 correlation of 1000 reordered time series of the residuals. If the observed lag 1 correlation of 0.25 falls close to or above the upper bound the correlation is deemed significant.) The model-implied autocorrelation function Pt(h) = corr(rt rt+h) = corr(yt Yt + h) for the Poisson ARGLMM is given by ( 4 8) and depends on the time t. For a given lag h we took the average

PAGE 140

128 p(h) = 16 i-h Lt Pt(h) of all possible lag h model-based autocorrelations to construct an estimate of the correlation at that lag. The line with filled triangles in Figure 5 4 represents this average of model-based autocorrelations. It seems reasonably close to the observed autocorrelation in the residuals from the fi t of the Poisson ARGLMM especially at the most important first two lags (A generalized estimating equations approach with a marginally specified AR(l) correlation matrix gives a very similar estimate (0 24) of the marginal lag 1 correlation. However it is questionable if the marginal AR(l) correlation is justified since both the transitional model and the ARGLMM indicate a longer dependence relationship.) The estimated model-based autocorrelation function p indicates that marginal correlations die out for observations 3 or more month apart. The residual analysis also reveals an extreme observation (a count of 14 new cases in Nov. 1972) which was deemed insignificant by Zeger (1988) but was addressed by Chan and Ledolter (1995) who added an additional parameter to the model. Benjamin et al. (2003) calculated a conditional tail probability of 0.02 for observing an event that extreme or even more extreme under their negative binomial transitional model. Based on our Poisson ARGLMM we estimate a conditional mean of 9.1 (cf. Figure 5 2) for an observation at that time-point using the posterior mean of the random effect for Nov 1972 as its prediction. This translates to an estimated conditional tail probability of 0.08 for observing 14 or more new counts of poliomyelitis for that month which does not seem too extreme However if we base calculations on the marginal probability of observing an event like this then the probability is equal to 0 007 The marginal mean count for Nov 1972 is estimated to be 2 6 (cf. Figure 5 2) but the Poisson distribution cannot be used to calculate tail probabilities since marginally the counts are not Poisson Hence we used Monte Carlo integration to estimate the marginal distribution P(YNov 19 12 = k) k = 1 ... 13 from the conditional Poisson model by sampling

PAGE 141

0.30 0.25 0.20 0 15 0.10 0 05 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 2 3 4 129 -0-0Neg. Bin -6-trt\R--GLMM ...... p(h) under Neg Bin ........_. p(h) under GLMM 2 x ASE 5 6 7 8 9 lag h Figure 5 4: Residual autocorrelations for the Polio data. 10 The plot shows autocorrelation functions of residuals from the fit of a negative binomial model (circles) and a Poisson ARG LMM (triangles). Also shown is an es timate of the model-based autocorrelation function when the assumed model is the negative binomial GLM (filled circles) or the Poisson ARGLMM (filled triangles). The straight dotted line represents the asymptotic standard error /f7T for the correlation.

PAGE 142

130 from the fitted random effects distribution and averaging over the conditional Poisson densities specified through (5.4) to get abov e result. If we decide to eliminate the extreme observation ( and in a more general context if we eliminate several more observations or some observations are missing) then the methods about unequally spaced time series developed in Chapter 3 are useful and directly applicable. Here however we add an additional parameter corresponding to an indicator function I(t = Nov 1972) for this observation to adjust for the extreme observation (in regular GLMs this would force a perfect fit for the outlying observation). The final sample size in the MCEM algorithm with the additional parameter increases to 56 900 and some parameter estimates are affected by this adjustment For instance the slope estimate now equals J 1 = -2.49 (s. e. = 3.66) and the lag 1 correlation among the conditional log-means increases top= 0.86 (s e. = 0 09). Their standard deviation decreases to a= 0 59 (s. e. = 0.15). The estimate for the coefficient of the indicator function for the outlier equals 1.84 with a standard error of 0.49. There is now even less evidence of a decreasing time trend in the observed time series. Figure 5 5 is the same as Figure 5 4 but now based on the model with the adjustment for the extreme observation 5.4 Binary and Binomial Time Series This section analyzes two data sets of binary time series, illustrating regression parameter estimation checking for model appropriateness and predicting future ob servations. Various implied model properties concerning the marginal distribution are also derived 5.4.1 Old Faithful Geyser Data MacDonald and Zucchini (1997) propose a two-state hidden Markov model (for a brief summary of the hidden Markov model see Section 1.3) for data concerning the eruption times of the Old Faithful geyser. The series consists of 299 consecutive

PAGE 143

0 30 0 25 0.20 0.15 0 10 0 05 G. A.' .... '"'-..._ ,..._ '~ '~ \\ \\ ' 131 -0-0Neg Bin -6-l'r t\R-GLMM ...... p ( h ) under Neg. B in .........-. p(h) under GLMM 2xASE I ----\-,-'r ----1 ....., 1 ~ c ', \ \ I I \ '.... \ \ I I \ ', \ \ I I \ .,,.,, \ \ I I \ ts \ \ I I \ \ \ I I \ \ \ I I \ 15 / \ \ I \ \ I \ I ''t/ C, .,,; ,..._ ..Q .,,,.. I ',', -\ '..._ A ...__.__--\ ,"' -\\ \\ \\ \\ ,, ,, I ,, I ,, I \/ 0 00 ------------------------------------------------2 3 4 5 6 7 8 9 10 l ag h Figure 5 5 : Residual autocorrelations with outlier adjustment for the Polio data. This plot is the same as the one in Figure 5 4, but is now based on models with an extra parameter to adjust for the extreme observation of November 1972.

PAGE 144

132 observations between the 1st of August and the 15th of August 1985. Most observations can be characterized as either long or short with very low variation within the long and short group. In fact, some of the eruption times measured at night are only recorded as being either short or long MacDonald and Zucchini (1997), following Azzalini and Bowman (1990) transform the series into a binary one with cutoff point defined at an eruption length of 3 minutes. They analyze the series assuming a discrete two-point mixture for the probability of a long eruption where the mixture depends on the states an underlying, two state Markov chain is in Here we attempt an approach assuming an underlying normal first-order autoregressive process Let Yt t = 1, .. 299 be the discretized observations with a value of O indicating eruption times less than 3 minutes (short eruptions) and a value of 1 otherwise (long eruptions). The sample ACF r(h) of the series is pictured in Figure 5 6 with numerical values given in Table 5 3. It clearly shows signs of negative autocorrelation in the series. Note that the autocorrelations, with increasing lag do not decay geometrically so a first-order Markov model ( a transitional model) is inappropriate. Marginal correlations implied by a GLMM with autoregressive random effects as developed in Section 4.3.1, seem capable of capturing such a behavior. Let { utH!1 be autocorrelated random effects in a model for the conditional probability 7rt(ut) of a long eruption. In particular the model has form logit(1rt(ut)) =a+ Ut, u1 ~ N(O a) Ut = PUt 1 + tt tt N(O, aJl p 2 ), where a is a parameter for the unconditional ( or expected) logit of a long eruption and the series { Ut} captures the serial dependency. Conditional on these random effects we assume that successive eruption lengths are independent. The MCEM algorithm with Gibbs sampling from the posterior distribution of the random effects as described in Sections 2.3 and 3.4.1 is used to obtain maximum likelihood

PAGE 145

133 estimates for this model. Since a and p are the standard deviation and the lag 1 correlation of the conditional logits, we use the sample standard deviation and sample lag 1 correlation of the empirical logits log( 1 ~ ty:~ g_ 5 ) as starting valu es for the parameters. The starting value for a is the ML-estimate of it under a GLM model assuming independent observations. Alternatively the GLIMMIX macro in SAS which maximizes a normal approximation of the marginal likelihood (see Section 2 2) but allows to incorporate autocorrelated random effects can be used. The two sets of starting values and some more we tried yielded similar parameter estimates and standard errors. Using the convergence parameters t 1 = 0.001 c = 5 t2 = 0 003 t 3 = -0.001 a = 1.02 and q = 1.1 (see Section 2.3.3) we obtained the following estimates: a = 1.24 (0 56), a= 3.69 (1.17) p = -0 89 (0.037) The algorithm converged after 130 iterations with a final Monte Carlo sample size of m = 56 000. The values in parenthesis are obtained from a Monte Carlo approximation of the observed Information matrix (Louis 1982) with a Monte Carlo sample of 100 000 drawn from the fitted posterior random effects distribution. 5.4.1.1 Marginal Properties Since the estimate of a is large, we use formula ( 4.17) to derive marginal correlations. Let a and adenote the maximum likelihood estimates of a and a scaled by a factor of 1.6. The calculation of marginal correlations enables a comparison with the sample autocorrelation function of the observed series and gives valuable information about the fit of the model. From ( 4.17) the marginal autocorrelations are estimated by where frM = ( &/ + 1)

PAGE 146

134 Table 5 3: Autocorrelation functions for the Old Faithful geyser data. h: 1 2 3 4 5 6 7 8 9 10 r(h) : -0.54 0.48 -0.35 0.32 -0.26 0 21 -0.16 0 14 -0.17 0.16 p(h) : -0.48 0.46 -0.37 0.35 -0.29 0 27 -0.23 0.21 -0.18 0.17 Comparison of numerical values of the sample ACF r(h) and estimated ACF p(h) based on a logistic-normal ARGLMM is the time-invariant estimated marginal probability of a long eruption time ( equal to 0.62) and [ 1 (J)2 Q(t t+h)= (u)2ph is the estimate of covariance matrix (4.15) for a bivariate zero-mean normal random variable with cdf 2 The plot in Figure 5 6 with numerical values given in Table 5 3 shows good agreement of these model based estimates of the autocorrelation function and the sample autocorrelation function. Similar good agreement between the model and the data can be seen from Figure 5 7 which plots the empirical lorelogram against its model counterpart. Note that the empirical lorelogram is not defined at lag 1 since the sequence of two consecutive short eruptions was not observed. The plot also shows the asymptotic standard error (ASE) for the log odds ratio at each lag It should be mentioned that similar good results were also achieved with the hidden Markov model approach by MacDonald and Zucchini (1997). The probit approach is not the only way to calculate marginal probabilities and correlations although it provides closed form approximations. Integrating the conditional success probability over the marginal distribution of the random effect Ut gives the marginal success probability at time t. Similarly integrating the joint conditional distribution of two successes two failures or one success and one failure over the two dimensional distribution of the random effects vector ( Ut, Ut) gives the corresponding marginal joint probabilities. For instance as we already briefly

PAGE 147

0.4 0 2 0.2 0.4 135 t 1 ,.. ...... r(h ) ___. p (h) I --x ASE 2 3 4 5 6 7 8 9 10 11 12 13 14 lagh Figure 5 6: Autocorrelation functions for the Old Faithful geyser data. \ 15 Comparison of the sample ACF r(h) (triangles) and estimated marginal model based ACF p(h) (squares) based on a logistic-normal ARGLMM for the Old Faith ful geyser data The straight dotted lines represents the asymptotic standard error ../f7T for the correlation.

PAGE 148

2 0 I t '1 I I I I 1 I II 1 I I I 1 I I I i., 1 I I I 1 1 1 71' 1 1 1 _J,.---.. I I I I ,, ___.... ---.. J ___ 1~1 -. I I I I ii 11 II I I I I ,, II I \ t 11 I \ I I I / I I I ----l L------L y --1 I I I I I I I I I I I I \ I \ I I \I \ 11 / \ ---~d/ ----1 ... \/ I \I i -2 I I I I 2 4 6 8 136 10 12 14 lagh ..,._... LOR(h) ----0 (h) ---xASE 16 18 20 Figure 5 7 : Lorelogram for the Old Faithful geyser data. Comparison of the empirical lorelogram LOR(h) (triangles) and estimated lorel ogram p(h) (squares) based on a l ogistic-normal ARGLMM for the Old Faithful geyser data. 22

PAGE 149

137 mentioned a unique feature to this series is that every short eruption is followed by a long one in other words the sequence (Yt Yt+1) = (0, 0) is not observed for any t This is not a structural zero as there is no a-priori reason why a short eruption cannot be followed by another short one although Azzalini and Bowman (1990) mention a geophysical interpretation which makes this quite unlikely We estimate the marginal joint probability P(Yt = 0 Yt+ 1 = 0) of two consecutive short eruptions as P(Yt = o Yt+i = o) = <1>2 ( (-& -&)' Q(t t + 1)) = 0 031 (5.5) based on the probit model approximation using the 2-dimensional multivariate normal cdf and as m h 1 ( h (") )-1 ( h (") )-1 P(Yt = 0 Yt+1 = 0) = m 1 + exp{a + u/} 1 + exp{a + ut~i} = 0.036 J=l (5.6) based on a Monte Carlo sum approximation of the two-dimensional integral with m = 500 000 samples from the estimated joint distribution of (ut Ut+1)With a straightforward extension of the results presented in Section 4.3.2, we can calculate joint probabilities with the probit model approach for longer sequences of long and short eruptions For example the joint probability of observing a long eruption at time t followed by a short eruption and another long one is given by P(Yt = 1 Yt + 1 = 0 Yt + 2 = 1) 3 ((o:, -o: o:)' Q(t t + 1, t + 2)) again using the threshold interpretation Yt = 1 {=> Tt < o: + Ut. Here o: is the intercept term of a probit model Ws = Ts Us s = t t + 1 t + 2 and <1> 3

PAGE 150

138 is the cdf corresponding to the multivariate normal mean zero random vector (Wt -W t+ 1 Wt + 2 )' with variance-covariance matrix 1 + (a-)2 -(a)2P Q(t t+l t+2)= l+(a-)2 (a-)2p2 -(a)2P 1 + (a-)2 For a model based estimate of the probability of this particular sequence of thre e consecutive observations we simply plug in maximum likelihood estimates of the parameters appearing in <1> 3 (.) and Q(t t + 1 t + 2). Estimates of resulting counts for all possible combinations up to order three are displayed in Table 5 4 using both the probit-logit connection and Monte Carlo integration The results are very similar using both types of approximation which speaks for the quality of the closed form probit bas e d approximations derived in the previous Chapter. 5.4.1.2 Exchangeability of certain sequences The probit connection also helps in explaining certain symmetries in the model when no time varying covariates are present In the geyser example the conditional probabilities only depend on an intercept term. Returning to sequences of length two the probability of the event (Wt -Wt+1)' < (& -&)'is the probability of a long eruption at time t followed by a short one at time t + 1. The symmetry of the (mean-zero) bivariate normal distribution with the special form of the variance covariance matrix (most notably equal variances) demands that this event has the same probability as the event (-Wt Wt+ 1 )' <(-& &)' The probability of the latter event is associated with the probability of a short eruption followed by a long one. Hence the estimated marginal probabilities are the same for a sequence of a long eruption followed by a short or a short eruption followed by a long. This is reflected in Table 5 4 which compares observed and expected counts for several more sequences of long and short eruptions.

PAGE 151

139 For three consecutive eruptions symmetry in the trivariate (mean-zero) normal distribution of (Wt Wt + i W t+ 2 )' demands that the events and the events have equal probability. Hence the model based marginal probabilities of two long eruptions follow e d by a short one (1 1 0) and a short one followed by two long ones (0 1 1) are equal. Accordingly the marginal probabilities of two short and a long eruption (0 0 1) is the same as the probability of one long eruption followed by two short on e s (1 0 0). Again this symmetry is reflected in the expected counts presented in Table 5 4. It can be interpreted as an exchangeability property for certain sequ e nces of long an short eruptions when no time-varying covariates are present E.g. denoting the event of two consecutive long eruptions with A and one short eruption with B, our model suggests that the probability distribution of AB is the same as the one for BA. 5.4.1.3 Technical derivation of the exchangeability property It is not immediately obvious why these pairs of events have equal prob ability under our model. Following is a proof for th e fact that the two events (Wt wt+l -Wt + 2)' < (a a a)' and (-W t, Wt+l W t+ 2)' < (-a, a a)' have equal probability: Without loss of generality assume t = l and let W = (W 1 W 2 W 3 )' have a trivariate normal distribution with mean zero and variance-covariance matrix E where E has diagonal elements all equal to a 2 and equal covariances

PAGE 152

140 17 12 = 17 23 Th e n th e density function of W is proportional to J(w1 w2 w3) ex exp{ [ (wi + w~)(17 4 17i 2 ) + w~(17 4 17i 3 ) +(w1 + w3) [2w21712(1713 17 2 )] + 2w1w3(17i 2 17 2 1713 ) ] / [(17 4 + 171317 2 217i2)(17 2 1713)] }. From this expression it is straightforward to derive the corresponding expressions for the densities of (W 1 W 2 -W 3 )' and (-W 3 W 2 W 1 )'. Then, it can be be shown algebraically with a simple transformation argument that these two random vectors have identical densities i.e. f(w 1 w 2 -w 3 ) = f(-w 3 w 2 w 1 ) (Notice the way in which w 1 and w 3 enter the density above ) Now the first event (W 1 W 2 -W 3 )' < (a a -a)' has probability I : I : I : f(w1 w2 -w3)dw1dw2dw3 I : I : I : J(-w3 w2 w1)dw1dw2dw3 I : I : I: f (-w3, W2, w1)dw3dw2dw1 [: I: I : J(-v1 v2 v3)dv1dv2dv3 where we used Fubini s theorem and in the last step simply renamed the variables (i.e. the transformation w 1 = v 3 w 2 = v 2 w 3 = v 1 ). However this last probability is the probability of the event that the random vector ( -W 1 W 2 W 3 )' is less than (-a a a)' quod erat demonstrandum. The proof for the equivalence of the other pair of three consecutive eruptions is similar. Also the case of the equivalence of the marginal probabilities of a short eruption followed by a long and a long followed by a short is handled with similar arguments using the bivariate normal distribution Symmetry occurs also with the Monte Carlo approach for approximat ing marginal probabilities since the distribution of ( Ut, Ut+1) and ( Ut+1, Ut) are

PAGE 153

141 equivalent. Therefore a random sample (uij), u~2 1 ) j = 1, ... m from the joint distribution of (ut, Ut+1) is also a random sample from the distribution of (u t+ 1 Ut) Consequently, Ut and Ut+1 can be used interchangeably and the Monte Carlo approximation to P(Yt = 1, Yt+i = 0) 1 exp { a + u~j)} 1 m 1 + exp{a + u~i)} 1 + exp{a + ui2 with samples ( uP) ui2 1 ) is equivalent to which is the approximation to P(Yt = 0 Yt+i = 1). The last column in Table 5 4 displays the expected numbers based on a Monte Carlo approximation of the marginal probabilities. Another word of caution: Simply multiplying the estimated marginal prob ability of a sequence by the sample size of 299 consecutive eruptions to obtain the expected count is wrong and can lead to estimated counts larger than what is possible in a sequence of 299 observations. We calculate expected counts of a particular sequence by multiplying the estimated marginal probability for that sequence with the number of possible consecutive sequences of length two or three. E.g. there are 297 possible sequences of three consecutive eruptions for the Old Faithful data set Hence with an estimated marginal probability of F(Yt = 1 Yt+i = 1 Yt + 2 = 1) = 0.1703 for three consecutive long eruptions, we expect a count of 297 x 0 1703 = 50 6 such sequences in the time frame observed for that series This distinction gets more important for shorter or unequally spaced data as will be d e monstrated with the next example Note that the logistic ARGLMM uses only 3 parameters. The only cells showing some lack of fit in Table 5 4 are the ones which involve two or more

PAGE 154

142 Table 5 4: Comparison of observed and expected counts for the Old Faithful geyser data expected counts observed counts probit Monte Carlo long eruptions (1) 194 185.7 185 0 short eruptions (0) 105 113.3 114.0 299 299 299 from 1 to 1 89 81.4 81.9 from 1 to 0 105 103 6 102.6 from Oto 1 104 103.6 102 6 from Oto 0 0 9.3 10 9 299 299 299 from 1 to 1 to 1 54 50.6 50.6 from 1 to 1 to 0 35 30.6 31.0 from 1 to 0 to 1 104 96.1 94 1 from 1 to 0 to 0 0 7 2 8 1 from 0 to 1 to 1 35 30.6 31.0 from 0 to 1 to 0 69 72.7 71.5 from 0 to 0 to 1 0 7.2 8.1 from 0 to 0 to 0 0 2 0 2.6 299 299 299 The Table compares observed counts of short and long eruptions and various tran sitions with those expected under a logistic ARGLMM for the Old Faithful geyser data. consecutive short eruptions, an outcome that was not observed in the given time span. However our model assigns a very small probability for this event. 5.4.2 Oxford versus Cambridge Boat Race Data In this illustration we consider the outcome of the annual boat race between teams representing the University of Oxford and the University of Cambridge. The first race took place in 1829 and was won by Oxford, and the last race ( at the time of writing) took place in 2003 and was won by Oxford for the second time in a row. Overall Cambridge holds a slight edge by winning 77 out of 148 races (52.0%). Two races were held in 1849 one in March and one in December. Since all other races are traditionally held in late March or early April we treat the

PAGE 155

143 1.0 . .. --. ---0 9 0 8 0 7 0.6 0 5 0 4 0 3 0 2 0 1 1840 1860 1880 1900 1920 1940 1960 1980 2000 Figure 5 8: Plot of the Oxford vs. Cambridge boat race data. Squares at O and 1 are the outcomes of the individual races, where a square at 1 stands for a Cambridge win. The jagged line connects the estimates of the condi tional success probabilities rr( Ut) over time. result of December 1849 as the result for 1850, when no race took place. There are 26 years such as during both world wars when the race did not take place. These are 1830-1835 1837 /38, 1843/44, 1847 /48 1851, 1853, 1855, 1915-1919 and 1940-1945. No special handling of these missing data is required with our methods of maximum likelihood estimation. In 1877, the race ended as a dead heat, which we treat as another missing value in the sense that for this year no winner could be determined The data are available online at www. th eboa tr a c e o rg and are plotted in Figure 5 8 5.4 2.1 A GL M M w ith a u t ocorrela te d r a n do m effe c ts Let Yt = 1 if Cambridge w i ns at year t and Yt = 0 i f Oxford wins, where t indices the 148 years the race took place. Conditional on an autoregressive random time effect Ut we model the log odds of a Cambridge win at time t as (5 7)

PAGE 156

144 where Wt is the difference between the averag e weight of the Cambridge crew and the Oxford crew (The boats have standard size and weight ) This is th e only covariate available. A winner in one year is also likely to be a winner in th e next year because of overlapping crew memberships rowing techniques training methods experience and many other factors. In part each outcome reflects the und e rlying combined efforts leading up to th e race. We propose correlated random effects to parsimoniously characterize the variation and correlation in outcomes du e to these efforts That is our model establish e s a link between successive winning probabilities by specifying an underlying autoregressive process U t+1 = l 1 u t + f t fo r the random effects where dt is the tim e lag (in years) between two successive races Figure 5 9 shows the depend e ncy in the data by plotting a smooth estimate s(h) of the the sample variogram g(h) for lags up to 50 years. The most important feature is the strong increase for th e first few lags after which the sample variogram lev e ls off at a constant level. A similar impression of the dependency structure in this data set is obtained by using the lorelogram shown in Figure 5 10 For e ach lag h the odds of a Cambridge win are estimated by cross-classifying outcomes h years apart. E.g., the first three values LOR(l) = 1.30 LOR(2) = 1.37 and LOR(3) = 0.74 in the plot ar e the log odds ratios corresponding to the following contingency t ables cross-classifying outcomes one two or three years apart: lag 1 0 1 0 42 24 1 23 48 lag 2 0 1 0 43 21 1 24 46 lag 3 0 1 0 37 25 1 29 41 In constructing these tables proper care must be taken to accommodate the years where no race took place. Similar to the variogram we observe a sharp decline in the log odds ratio for the first few lags after which the log odds ratio level off a t around 0. Figure 5 10 also shows twice the asymptotic standard error (ASE) of th e

PAGE 157

0 30 0 28 0.26 0 24 0 22 0 20 0.18 0 5 JO 15 20 145 25 lagh 30 ----s(h) g(h) 35 40 45 Figure 5 9: Variogram for the Oxford vs. Cambridge boat race data. 50 Comparison of a smooth estimate of the sample variogram with a model based estimate of the variogram. Triangles represent the smooth (natural cubic spline) estimate s(h) of the sample variogram. Squares represent the model based estimate i(h) of the variogram. Crosses are the actual values of the sample variogram g(h).

PAGE 158

1.50 1.25 \' 1.00 ,, ~ \ ,, 0 75 ,, ,, 0.50 ,, \ 0 25 \ ~ ~" +\ '-0 00 \ ........ ,. I \ + /. "L -0 25 + ,...._.,. -0 50 + ---+ -0 75 0 5 10 15 20 146 25 lagh -.-s(h) ----(h) + + LOR(h ) x ASE __ M ___ _,,-_________ .... 30 35 40 45 50 Figure 5 10: Lorelogram for the Oxford vs. Cambridge boat race data. Comparison of a smooth estimate of the sample lorelogram with the model based estimate of the lorelogram. Triangles represent the smooth (natural cubic spline) estimate s(h) of the sample lorelogram. Squares represent the model based esti mate O(h) of the lorelogram. Crosses are the actual values LOR(h) of the sample lor elogram and the grey dotted lines represent two times the ASE of the log odds ratio. log odds ratio at each lag calculated from the observed tables. For the three tables above the ASEs are given by 0 36 0.37 and 0.35 respectively. According to the web-page www. theboatrace. org Boat race legend has it that the heavier and taller crews have an advantage when it comes to race day". In fact an estimate of the weight effect /3 under the assumption of independent outcomes from one year to another is equal to 0 056 with s.e. eq ual to 0.023. This would support the claim that the heavier crew has higher odds of winning the race E g ., we estimate that a 5 pound difference increases the odds of winning by 32%. However can this claim still be supported in the presence of dependent observations, as both the plot of the variogram and the lorelogram suggest?

PAGE 159

0 300 0 275 0 250 0 225 0 2 7 2 6 0 147 Table 5 5: Maximum likelihood estimates for boat race data. a {3 CJ p estimate : 0.27 0 079 2 65 0.68 s.e. : 0.54 0 047 1.18 0.11 0 085 '-----0 080 0.o75 0 070 25 50 75 100 0 25 0 75 0 70 0 65 0 60 25 50 75 100 0 25 50 75 50 75 100 100 Figure 5 11: Path plots of fixed and random effects parameter estimates for the boat race data The x-axis shows the iteration number through the iterations of the MCEM algo rithm. The Monte Carlo sample size increased in 112 iterations from 100 at the beginning to 8260 random draws from the posterior random effects distribu t ion at the end We used the MCEM algorithm described in Sections 2 3 and 3.3 for model (5 7) to obtain the maximum likelihood estimates displayed in Table 5 5. Standard errors are based on a Monte Carlo approximation of the observed information matrix using 50 000 samples from the estimated posterior distribution. Trace plots for parameter estimates are pictured in Figure 5 11, with convergence criteria similar to th e ones mentioned for the Old Faithful data: t: 1 = 0.001, c = 4 t: 2 = 0 003 t: 3 = -0 005 a = 1.03 and q = 1.05 (cf. Section 2 3 3) Regular GLM estimates for a and {3 and a = 2 and p = 0 were used as starting values for the MCEM algorithm.

PAGE 160

148 Based on Table 5 5 we estimate the conditional odds of a Cambridge win to increase by 48 % for every 5 pounds the average Cambridge team member ( and there are 9 on a boat including the cox) weights more but the estimated standard error might be too large to conclude a significant effect A quadratic effect of th e weight difference on the log odds was found to be insignificant. The significant correlation of 0.68 for races one year apart indicates a strong dependency betwe e n successive outcomes This reflects the fact that the odds of winning are influenced by factors such as overlapping crew memberships training methods, motivation and experience from previous races The conditional estimate for {3 translates to an estimate of 0.041 for the marginal effect, using the probit-logit connection mentioned in Section 4.3.2. Hence the marginal odds of winning are estimated to increase by 23 % (compared to 32% from the GLM fit) for a 5 pound difference in the average weight when we properly adjust for correlation in the series. Moreover the large standard error of {3 does not rule out the possibility of no effect of weight on the odds of winning. 5.4.2.2 Checking the fit of the model Given the maximum likelihood estimates the model-based estimate of the variogram is i (h) ~E [(Yt + h Yt) 2 ] F(Yt = 1) F(Yt = 1 ft+h = 1) h = 1 2 ... where the marginal and marginal joint probability can be estimated via the probi t connection or by integrating over the oneand two dimensional random effects distribution With the inclusion of a time varying covariate ( difference in average crew weight Wt) marginal probabilities vary over time. We assume no weight differences (i.e ., Wt=0 for all t) for the calculation in this section. Then, the estimated marginal probability of a Cambrigde win F(Yt = 1) or two Cambridge

PAGE 161

149 wins at times t and t + h, F(Yt 1, Yt + h = 1) do not depend on the years t and t + h Figure 5 9 shows the model-based estimate of the variogram. The agreement with the empirical variogam is good especially for the most important first few lags and the model seems to capture the association displayed in the data appropriately The smooth line in Figure 5 10 represents the model based estimates of the marginal log odds ratio O(h) =log ( ~(,,= 1, Yt+h = 1) X ~(Y,, = 0, Yt+h = 0) ) P(Yt = 1, Yt + h = 0) x P(Yt = 0 Yt+h = 1) for observations h years apart when both crews are of equal weight. For instance for races one year apart the model based estimate of the marginal l og odds ratio of a Cambridge win is 1.38 approximated via the logit-probit connection That is, the odds of a Cambridge win over an Oxford win are estimated to be 4 times higher if Cambridge had won the previous race than if they had lost it Naturally, this factor gets smaller the greater the time separation between two races. For instance at lag 2, the odds of a Cambridge win are only 2.4 times higher if they had won the race two years ago rather than losing it Based on Figure 5 10 a result t 5 or more years in the past hardly has any influence on the result in year t. That is, the odds of a Cambridge win in year t are roughly the same whether Cambridge had won or lost the race t 5 years ago. Table 5 6 compares observed and expected counts of particular sequences of wins and losses again assuming no weight difference between the two crews. Care must be taken in finding all possible sequences of given length in the observed time series due to unequal sampling intervals. For instance with the specific pattern of no races in certain years in the series of the 148 unequally spaced observations only 137 sequences of two consecutive years and 130 sequences of three consecutive years can be formed. These are the multipliers for the estimated

PAGE 162

150 probability of two and three consecutive outcomes respectively. In general, the agreement betw een observ e d and predicted counts of sequences is excellent. The only minor discrepancy seems to be the one concerning three Cambridge losses (or equivalently three Oxford wins) in a row However in constructing this table we assumed no weight difference between the two crews. On average, over the 148 races the weight difference between the Cambridge crew and the Oxford crew is -1.03 pounds i.e. Oxford crews are heavier on average. Out of the 33 races Oxford had a weight advantage (i.e., was heavier) by more than 5 pounds it won 21 races (64%) Similarly out of the 25 races Cambridge had a weight advantage by more than 5 pounds it won 16 races (64%). Since weight seems to have some effect on the outcome of the race and was marginally significant, we overestimate the probability of three Cambridge losses ( or underestimated the probability of three Oxford wins) by assuming no weight difference for all three races. These lead to the slightly smaller expected count than the one observed in the last entry of Table 5 6. Factoring an average weight difference of -1.03 pounds for all three races into the calculation of the marginal probability, the estimated expected count for three Cambridge losses is 27 9 which is a little closer to the observed number of 32. 5.4.2.3 Prediction of random effects In traditional GLMMs the prediction of a univariate random effect u describ ing an exchangeable correlation structure is the posterior mean E[u I y ] of the distribution of the random effect given the observed data. Similarly, the prediction of the random process u = ( u 1 ... 'UT)' is the posterior mean E[u I y ]. Draws from the last iteration of the MCEM algorithm can be used to approximate the posterior mean by a Monte Carlo average. Let ilt denote the t-th component from this ap proximation. Then an estimate of the conditional probability of a Cambridge win in year t is given by A ( A ) exp{&+ Swt + ilt} 1ft Ut = A 1 +exp{&+ f3wt + ilt}

PAGE 163

151 Table 5 6: Observed and expected counts of sequences of wins (W) and losses (L ) for the Cambridge University team. expected counts observed counts probit Monte Carlo W 77 79.1 78.4 1 71 68.9 69 6 148 148 148 W W 48 50.5 49.8 W 1 23 22.8 23.4 1 W 24 22.8 23.4 L 1 42 41.0 40.4 137 137 137 W W W 34 34 8 34 1 W W 1 13 13.1 13.2 W 1 W 12 9.4 9.8 W 1 L 10 12.2 12.2 1 W W 11 13 1 13.3 L W 1 9 8 6 8.9 1 L W 9 12.2 12.3 L 1 L 32 26.7 26.1 130 130 130

PAGE 164

152 Table 5 7: Estimated random effects Ut for the last 30 years for the boat race data. year result 'Ut year result 'Ut year result Ut 1959 L -0.49 1974 L 0.23 1989 L -2 35 1960 L -0.58 1975 w -0.12 1990 L -2.11 1961 w 0.63 1976 L -1.31 1991 L -1.82 1962 w 0.71 1977 L -2 07 1992 L -1.03 1963 L -0.19 1978 L -2 57 1993 w 0 75 1964 w 0 12 1979 L -2.76 1994 w 1.63 1965 L -1.11 1980 L -2 85 1995 w 2 05 1966 L -1.34 1981 L -2 83 1996 w 2.16 1967 L -0.71 1982 L -2.78 1997 w 2 08 1968 w 0 85 1983 L -2.59 1998 w 1.74 1969 w 1.68 1984 L -2 35 1999 w 1.07 1970 w 2.11 1985 L -1.81 2000 L -0 10 1971 w 2.11 1986 w -0 56 2001 w -0.08 1972 w 1.76 1987 L -1.41 2002 L -1.38 1973 w 1.05 1988 L -1 98 2003 L -1.82 and plotted in Figure 5 8 It seems that a sequence of wins pulls the estimated conditional probability towards 1, and a couple of losses pulls it towards 0 The predicted random effects for the last 45 years are displayed in Table 5 7 together with the outcome for these races The structure of the estimated random effects reflects the dynamics of the data: Positive random effects are usually associated with a Cambridge win and negative ones with an Oxford win. The magnitude of the predicted random effects increases (decreases) the closer the start (end) of a sequence of wins or losses for one team is in sight For example in 1993 the predicted random effect is 0. 75. During the next 3 years, all of which are Cambridge wins the magnitude of the predicted random effects increases steadily reflecting the increased confidence ( as measured by the odds) of another Cambridge win due to past results. After 1996 the predicted random effects slowly decline, but still show a preference for a Cambridge win In 2000 they turn negative because of an Oxford win in that year. For 2001 the random effect rises momentarily because of another Cambridge win but declines again in 2002 and 2003 because of two consecutive Cambridge losses.

PAGE 165

153 Part of thi s phenomenon can be e xplained through the form of the full univariate conditional distribution of Ut given the data and all other random effects. Section 3.4.1 showed that this distribution depends on the immediate predecessor Ut i the immediate successor U t +i and on the outcome Yt of the race In turn the full univariate conditional distribution of the successor Ut+i directly depends on the result Y t +i of that race and again on the random effects before and after it. In this way information of future wins or losses is incorporated in the posterior distribution of UtFor example a lost race in the near future results in decreasing predicted random effects preceding it reflecting this future change in momentum. Incidentally the distribution of the random effect uT at the boundary t = T only depends on its immediate predec e ssor uT l and on YT Using the prediction rule flt = 1 if n( Ut) > 0.5 and flt = 0 otherwise w e are able to compare outcomes predicted by our model to observed ones. Of the 148 observations the GLMM model with autocorrelated random effects only misclassifies 8 or 5.4% as the opposite outcome compared to what actually was observed. This is of course a significant improvement over predictions based on a regular logit GLM with a misclassification rate of 62 out of the 148 observa t ions or 41.9%. 5.4.2.4 Prediction of future outcomes The availability of marginal joint distributions through probit or Monte Carlo approximation allows one to consider marginal conditional distributions such as a one-step-ahead forecast distribution P(YT+l = YT + l I YT= YT YT 1 = YT 1 , YT s = YT s) P(YT+l = YT +i, YT = YT , YT -s = YT-s) P(YT = YT , YT -s = YT s) for an outcom e at time T + 1 factoring in the past s + 1 observations s 0 1 2 .... We use our proposed model to obtain an estimate of the numerator (5 8)

PAGE 166

154 and denominator in (5 8) The two-stage hierarchy together with the autoregressive nature of the random effects imply P(Yr + 1 = YT+I Yr= Yr , Yr s = Yr s) (5.9) ( T + l ) ( T+l ) J ,!! P(Y, = Yt I u,) g(uT ,) ~ J!+, g(u, I u, ,) dur-s ... dur dUr + 1. The last term P(Yr+1 = Yr+1 I ur+1) in the first product is given by an extrapola tion of the fitted model to time point T + 1 i.e. where Xr+1 is the covariate vector for time point T + 1. It is estimated by using the MLE /3 for /3. The last term g( ur + i I ur) in the second product is determined by extrapolating the underlying random process { Ut}i = I to time point T + 1. I.e. ur+1 = pd r ur + Er where dr is the time distance between points T and T + 1 and g(Ur+1 I ur) is a normal distribution with mean pur and variance a2 (1 p 2 d r ) For the boat race data a Monte Carlo approximation to (5.9) is given by m T + l { (j))} _!_ IT exp Yt(a + f3wt + ut L..,; (j) m i = l t=T s 1 + exp{ & + /3Wt + ut } where u~j) is the t-th component from the j-th generated autoregressive process uU) (extrapolated to time point T + 1) with variance components 82 and p. The estimated forecast probabilities of a Cambridge win in 2004 based on the past s + 1 observations and the GLMM model with autocorrelated random effects is given in Table 5 8. For example P(Y 149 = 1 I Yi 48 = 0), the conditional probability of a Cambridge win in 2004, given the outcome of the race in 2003 (a Cambridge loss) is estimated to be 0.32. We assumed a zero weight difference Wr+1 = 0 in calculating these forecast probabilities however, any value can be substituted in (Historically, the crews weigh in four days prior to the race, so that

PAGE 167

155 Table 5 8: Estimat e d probabilities of a Cambridge win in 2004 given the past s + 1 outcomes of the race. s: 0 1 2 3 4 5 6 History: L LL LLW LLWL LLWLW LLWLWW LLWLWWW W: 0.323 0.286 0.318 0.309 0.311 0.312 0 313 The first row displays the number of years s preceding 2003 which are conditioned on. The second row shows the history of Cambridge wins and losses from 2003 to 2003-s The third row displays the estimated probabilities P(Y 149 = 1 I Y 148 Y14B ... Y14s -s = Y14s s ) of a Cambridge win given the past s + 1 observations. WT+i will be available before the actual outcome of the race.) Including the past two results of 2003 and 2002 (two Cambridge losses) the estimated probability of a Cambridge win in 2004 decreases to 0 29. Considering the last three outcomes (two Cambridg e losses one Cambridge win) the estimated probability is again 0.32. Conditioning on outcomes even further back in time the estimated probability of a Cambridge win stays roughly constant at 0.31 even when the 7 year winning streak of Cambridge in the years 1993 to 1999 is factored in It seems reasonable however that these outcomes do not impact the 2004 outcome in terms of crew membership and training methods as do the outcomes of 2003 or 2002. We already mentioned in connection with the interpretation of the lorelogram and Figure 5 10 that results five or more years in the past hardly seem to influence the current outcome. Considering the weight factor if the average crewman on the Oxford boat weights mor e by 5 pounds then the estimated probability of a Cambridge win decrease from 0 32 to 0.28 conditioning the outcome of the last race and from 0.29 to 0.24 when factoring in th e last two outcomes On the other hand if Cambridge holds a weight advantage of 5 pounds in 2004 their predicted winning probabilities are 0.37 and 0.33 respectively. One special case of the derivations above is to not condition on any of the past outcomes and just look at the marginal probability of a Cambridge win in 2004

PAGE 168

156 which for general Tis given by where and UT+1 has a marginal N(O a 2 ) distribution. However this estimator does not factor in any past information. For no weight difference ( WT+1 = 0) it is calculated to be 0.52. The above predictions are based on marginal probabilities, where random effects have been integrated out. Another way of predicting the probability of a future outcome uses the conditional model directly incorporating an estimate of the random effect at time T + 1. With the autoregressive nature of the random effects, the minimum mean-squared error predictor of UT+1 is given by uf +1 = E[uT+l I uT] = ()UT. To estimate uf+1 we use the posterior distribution of the random effects given the observed data i.e., uf+l = E [E[uT+l I UT] I y ] = E[puT I y ] = fJUT and plug in the maximum likelihood estimate of p. This is consistent with the way random effects are predicted in spatial GLMMs as described by Zhang (2002). He proved the following theorem under the assumption of known fixed and random effects parameters. Here, we adapted his theorem to our time series context to facilitate prediction of unobserved intermediate and future outcomes: Theorem: Let uk, k E z + be Gaussian with E[uk] = 0 for all k. If conditionally on {uk k E z + } {Yk k E z + } are independent and for each k the distribution of Yk depends on uk only, then for any k and observed time points t 1 t 2 ... tT, T E[uk I y] = L c;E[ut; I y ], (5.10) i =l

PAGE 169

157 where the coefficients c; are such that E[uk I Uti, ... Ut r l = L-i = I C i Ut ; and y = (Yti .. Yt r ) are the observations at time points t 1 ... tr. Proof: (An adaptation from Zhang, 2002.) Let k E z + and k =I= ti for all i = 1, .. T. Let f(uk u, y) denote the joint density of (uk u, y) where u = ( Uti ... Ut r ) holds the random effects at the observed time points The distribution of the observed data depends on the random effects at the observation time points but no other random effects hence f (y I uk u) = f(y I u). Then, f(uk U y) f(y I uk, u)J(uk u) f(y I u)f(uk u) J(u y)f(uk I u). Dividing both sides by f ( u y) we obtain f ( uk I u y) = f ( Uk I u) and consequently E[uk I u y] = E[uk I u] = L-i = i ciUt ; for some appropriate constants ci. By the properties of repeated expectation (i.e. Ex1Y[X I y] = EzlY [Ex 1 z y[X I z y] I z ] ) we have E [E[uk I u y] I y] = E[uk I y] and (5.10) follows. For the boat race data we could use (5.10) to get a prediction of the probability of an outcome in a year where no race took place. I.e., choose a k T, k =I= t 1 ... tr where t 1 ... tr are the years a race took place. (For clarity, we now denote the set of years where a race took place as t 1 t 2 . tr where t 1 is the year of the first race in 1829 t 2 is the year of the second race in 1836 and tr is the year 2003.) More importantly, we can use (5.10) to predict future outcomes. For instance by setting k =tr+ l i.e. to the year 2004 we obtain 148 u{;. + 1 = E[ut r+l I y] = L c i E[ut ; I y] = pE[ut r I y] = p'Ut r i= l as the prediction for the random effect for that year the same result we derived before. Here we made use of (5 10) and the autoregressive nature of the random effects which imply that E[ut r+ i I Ut 1 , Ut r l = PUtr i.e. C1 = ... = Cr 1 = 0

PAGE 170

158 and Or = p. The prediction can be evaluated by plugging in the MLE for p and using the Monte Carlo sample from the l ast iteration in the MCEM algorithm to approximate the posterior mean. The prediction for a random effect s years into the future (and hence the prediction of the distribution for a future outcome Yt r+. under the assumed model) can be obtained similarly. E.g., with k = tT+s, E[ut r+s J y ] = E ~!~ i; E[ut ; J y ] = P 8 E[utr I y ], since E[ut r+. I Utn , Ut rl = P 8 Ut r, i. e., C1 = ... = Or-1 = 0 and Or= p 8 For the boat race data, the estimated random effect for 2004 is uf 49 0 69 x -1.82 = -1.26 Then according to our model the estimated probability of a Cambridge win in 2004 given a prediction for the random effect in that year is 0.26.

PAGE 171

CHAPTER 6 SUMMARY DISCUSSION AND FUTURE RESEARCH In this dissertation we proposed autocorrelated and other correlated random effects in GLMMs as a mean of introducing and modeling correlation in regression models for series of unequally spaced counts or binary /binomial observations. In Chapter 1 we contrasted our regression approach with one based on modeling the mean variance and covariance directly (marginal models), and another one based on regressing previous observations and covariates on the current response (transitional models). At the cost of increased computational time and complex algorithms inferential procedures for GLMMs are based on the joint likelihood of the T observations y 1 .. YT In contrast marginal models are based on a quasi likelihood approach and estimation in transitional models relies on conditionally or partially specified likelihoods that do not represent the full joint distribution In particular constructing tables such as 5 4 and 5 6 that compare observed counts to marginal predicted counts of sequences of events is impossible. In Chapter 2 we presented a general MCEM algorithm for fitting GLMMs and derived specific algorithmic details for equally and autocorrelated random eff e cts in Chapter 3 There we gave details on the implementation of a full iterative M-step as opposed to just a single iteration. We also focused on the Gibbs sampler as a means of sampling from the posterior distribution of the random effects given the observed data as required for the approximation of the E-step and posterior predictions of the random effects. Other MCMC methods such as a Metropolis Hastings algorithm can also be employed however the Gibbs sampling approach reduced to simple forms for autoregressive random effects and was relatively fast in implementations. Furthermore the structure of the full univariate conditionals 159

PAGE 172

160 made clear how the correlations between random effects serve as building blocks for the correlation between the time series observations. The first graph in Figure 6 1 shows the exchangeable correlation structure among Yi s implied by the traditional GLMM assumption of one common random effect to all observations All vertices (where a vertex corresponds to a random variable) which are not joined by an edge are conditionally independent. The second graph shows the same picture for autoregressive random effects. Note that a marginal dependency between Yi and Yt i is induced through the path via Ut and Ut -l (There is no edge between Ut I and Ut+1 if we assume a lag 1 autoregressive process ) The third graph is slightly different in nature and pictures the structure of the full univariate conditional distribution of Ut given the other random effects Ut and the data y. As we showed in Section 3.4, the conditional distribution of Ut depends on its predecessor Ut I, its successor Ut+1 and on the current observation Yt. In turn Ut-l and Ut+1 depend on the observation at times t l and t + l respectively. Thus the posterior of Ut incorporates information of past and future responses and random effects. Although autocorrelated random effects in GLMMs are not new to the literature (e.g. Chan and Ledolter 1995) we extended their application by explicitly allowing for gaps in the observed time series through specifying the correlation in the AR(l) process in terms of a lag This allowed us to handle missing data in the series without any additional procedures and adjustments to likelihood inference. In some instances, predicting responses at times where no responses (and covariates) were observed could be a potential goal. We presented some theory for intermediate prediction with the Oxford vs. Cambridge boat race data. By the way random effects incorporate information on previous and future observations our regression models should be well suited to predict intermediate observations not observed at certain time-points.

PAGE 173

161 Yc-1 Yc+l Uc-1 Uc+! Figure 6 1: Association graphs for GLMMs. The first two diagrams represent the associations among observations Y1, .. Yi in GLMMs with one common random effect and autocorrelated random effects { Ut} Vertices not connected by edges are conditionally independent. The last graph rep resents the association structure for the posterior distribution of Ut given the other autocorrelated random effects and the data. Influence of covariates is not shown. Chapter 4 was devoted to derive marginal properties of our time series regression models based on normal Poisson, negative binomial and binomial distributional assumptions on the observations. We saw that conditionally specified GLMMs lead to marginal overdispersion relative to the normal Poisson negative binomial or binomial variance and we gave formulas for their expressions in the case of correlated random effects suggested here. More importantly, we derived expressions for the marginal correlations between any two members of the time series implied by our models. While in the case of normal Poisson and negative binomial time series regression models these have closed forms, approximations have to be used in the binomial case with a logit link. We explored several options and presented an approximation based on the similarity between logit link and probit-link models to evaluate marginal properties. The derived formulas for marginal means and correlations are useful for comparing empirical quantities to model-based quantities such as a comparison of observed and predicted counts observed and predicted autocorrelations or observed and predicted log odds ratios

PAGE 174

162 in a time series. These are important aspects in determining the appropriateness of the random effects distribution and the goodness of fit of the model in general. Applications of the MCEM algorithm developed in Chapters 2 and 3 and the theory developed in Chapter 4 are given in Chapter 5. Examples of binomial binary and count time series were presented and modeled within the proposed framework of GLMMs with autoregressive random effects (ARGLMMs). For the binary case we explored certain symmetries the model implies in the marginal distribution when no time-varying covariates are observed Also some theory on predicting future events in binary time series based on the conditional model or the implied marginal model was developed. Some results of these data analyses will be discussed in the next two sections. 6.1 Cross-Sectional Time Series We motivated the usefulness and appropriateness of our methodology through the analysis of a cross-sectional binomial time series from one of the largest US data sets for social science research. Scientists making use of this data base and who would like to analyze developments of count or binomial responses through time should consider the methods described here because they address the temporal dependence in the observations over the years, address the cross sectional dependence within a year and naturally handle gaps in the observed time series. We showed analysis of such data by assuming an approximate normal distribution for the log odds and fitting of a corresponding linear mixed effects models. We made adjustments by appropriately weighting the log odds to more closely meet the assumptions of a normal linear mixed model. However we also presented the analysis based on the true binomial nature of the observations using a logistic ARGLMM. We consider this a better approach particular if the binomial sample sizes are small because it allows the variance of the log odds to vary as a function

PAGE 175

163 of the mean. In time series models the mean often displays trend behavior and therefore the assumption of constant variance is inappropriate. As to how much the adjustments by weighting the log odds with their estimated asymptotic standard deviation in the normal approximation model simultaneously alleviates all these problems remains doubtful. The ARGLMM can be fit using the MCEM algorithm outlined in Sections 2 and 3 and possesses the three features mentioned above. Furthermore using the marginal results for binomial time series discussed in Section 4.3 these models allow for both overdispersion relative to the binomial variance and correlation between successive observations. Data from the General Social Survey are not the only application of our methods. In political science especially in international relations annual binary time series cross-sectional data are very common. For instance a lot of research focuses on the analysis of the relationship (conflict/no conflict) between two states over a long period of years Ad-hoc methods such as including the residuals from a preliminary logit analysis in the linear predictor (Oneal and Russett 1997) are proposed to adjust for the temporal dependence More sophisticated methods treat the binary time series as grouped time to event (or survival) data and include temporal dummy variables in the linear predictor of a logit model. These dummy variables mark the number of years in between two events (i e conflicts) of the binary time series and are motivated by a relationship between Cox s proportional hazard model for time to event data and l ogit models (Beck, Katz and Tucker 1998) However one drawback of these methods is that usually we observe several events (e.g. conflicts) over time In particular, Beck et al. (1998) treat the probability of subsequent events as independent from the first one. This is a much stricter (and often unrealistic) assumption than the conditional independence assumption in ARGLMMs.

PAGE 176

164 Furth e rmor e, by including dummy variabl e s ( or as also suggested a natural cubic splin e v e rsion) in th e lin e ar predictor to induc e dependency the nature of this dependency cannot b e modeled. Beck et al. (1998) note that temporal depend e nce cannot provide a satisfactory e xplanation [of conflict] by itself but must instead be the consequence of som e important but unobserved variable. Hence the GLMMs with th e assumption of a latent autoregressive random process develop e d in this dissertation seem to be a natural approach of analyzing binary time series cross-sectional data and are an attractive alternative to the widely used methods sugg e sted by Beck et al. (1998) 6.2 Univariate Time Series Apart from cross sectional time series data we focused on the analysis of a single time series of counts or binary observations in Chapter 5 Standard loglinear or logit analysis ignoring the serial dependence may result in misl e ading inference. This was evident for the Polio data of Section 5.3 where we showed that the ARGLMM adequately captured the correlation structure for the residuals, which was not the case for other more standard models. Hence these models an ordinary Poisson GLM a negative binomial GLM and a Poisson GLMM showed strong evidence of a time trend (see Table 5 2) whereas the evidence seems to be considerably weaker when the correlation is accounted for. Similarly, for the boat race data we were able to correctly quantify a common belief about the influence of weight. Ignoring the correlation in the series of wins and losses would have led to an overstatement of the influence of weight one that is not seen as strong once the analysis tak e s the serial correlation into account. We emphasized model checking through residual analysis with a comparison of empirical and theoretical autocorrelations lorelograms and variograms in the case of unequally spaced observations. For the Polio data we calculated residuals for all entertained models and showed that only the autocorrelation function implied by

PAGE 177

165 an ARGLMM mimics the one observed in the residuals. Residual autocorrelations from other models ignoring th e serial dependence showed non conformity with the model specifications. For the Old Faithful data set we observed good agreement between the ARGLMM implied marginal autocorrelation function and the empir ical one. Similarly, the e stimated and empirical variogram and lorelogram for the boat race data showed good agreement indicating a reasonable assumption on the model-implied dependency structure. This was further justified by a comparison of observed and (marginally) predicted counts of sequences of wins and losses, which we approximated using either the connection between the exact marginal expressions for probit models or Monte Carlo approximation. 6.2.1 Clipping of Time Series The application of our regression models to the analysis of binary time series and the methodology developed here may be broader than initially realized. Let Yi be a binary time series obtained by clipping (Kedem 1980) an underlying process Zt such that Yi= l[Zt E C], where I is the indicator function which is 1 if Zt is in the set C and O if it is in its complement. Estimation (t :ST) and prediction (t > T) of 1r(ut) = P(Yt = 1 I Ut) using an ARGLMM with covariates Xt then entails estimation and prediction of the event { Zt E C} bas e d on covariate information. This might be useful in a variety of settings e g when an investigator is forced to or more comfortable with dichotomizing the observed data. 6.2.2 Longitudinal Data The type of time series data we consider in this dissertation are different from longitudinal or panel data which usually consist of only a few repeated observa tions (i.e. Tis small often less than 5) but with a large number of replications It is doubtful if th e estimation techniques (such as GEE) developed for longitudinal,

PAGE 178

166 especially interdependent binary data are useful in our context of very long time series data since the temporal dependence is much richer. For instance, while it was possible to fit a marginal model via GEE for the two binomial time series of 16 observations each in the data about homosexual relationships we could not fit any other of the time series mentioned in Chapter 5 with GEE methodology. The reason is that solving the estimating equations requires inversion of the T x T covariance matrix for (y 1 .. YT). In longitudinal studies this matrix is also large but has block-diagonal structure with low-dimensional blocks corresponding to the few repeated measurements within a cluster. For the GEE analysis of the polio count data Zeger (1988) proposed approximating the T x T covariance matrix of (Y1, ... YT) with a simpler band-diagonal matrix corresponding to an autore gressive process which then has an easy inverse Also note that in the case of a single time series (i.e. a single cluster), the usual approach of adjusting estimated standard errors by using the sample covariance matrix as a robust estimate of the correlation matrix is not applicable. This is because the usual sum over clusters in the expression of the robust asymptotic covariance matrix reduces to a single sum mand which is the score equation that was set equal to 0. Furthermore the GEE approach does not yield estimates of multivariate probabilities, so that construction of tables such as 5 4 and 5 6 comparing observed counts to marginal predicted counts of sequences to evaluate goodness of fit is impossible. 6.3 Extensions and Further Research Several extensions of the proposed methodology are possible, opening new lines of research. Following we give a brief overview of some ideas. 6.3.1 Alternative Random Effects Distribution We focused on latent first-order autoregressive random effects processes for describing time series observations in a GLMM framework, but extensions to pth-order processes are possible. The derivations of the MCEM algorithm and the

PAGE 179

167 Gibbs sampler should be similar where the conditional distribution of Ut now will depend on its 2p neighbors. More generally other random effects distributions (possibly improper) can be explored For example, our AR(l) process is a special case of a generalized autoregressive random effects process U t = p Ef= 1 CtjUj + Et which was proposed by Ord (1975) With appropriately specified constants Cij, one alternative to AR(l) (or AR(2)) random effects is to let Ut p( Ut-1 + Ut+1) + ft t = 2 ... 'T l ur pur-1 + Er For this case Sun Speckman and Tsutakawa (2000) mention that the full uni variate conditional distribution of Ut depends on (ut 2 Ut-i, Ut+1, Ut + 2) for 3 :S t :S T 2 and similar interpretations as given above regarding the random effects as building blocks for the correlation in the time series apply. As mentioned in Section 1.4.1 models with more complicated random effects struc tures are often fit in a Bayesian framework, assuming noninformative priors on the fixed effects and variance components. In that setting propriety of the posterior distribution cannot be guaranteed for a Poisson GLMM when one of the observed counts is zero and is impossible in a logit link GLMM for binomial observations if they are equal to O or nt for just one t. (See theorem 4.1 and examples 4.1 and 4.2 in Sun Speckman and Tsutakawa (2000).) Hence, with noninformative (flat) priors on fixed effects and variance components of the correlated random effects distribution Bayesian GLMMs for binary time series result in improper posteriors. Further developing the models and methods presented in this dissertation would therefore be a worthwhile goal.

PAGE 180

168 6.3.2 Topics in GLMM Research Goodness of fit measures for GLMMs remain an active area of research. We tried to propose some methods here based on a comparison of observed and marginally fitted counts, however no formal statistic was developed. Part of the problem is that GLMMs can not easily be made a special case of some broader model (such as the saturated model for GLMs) and compared to it. Recently Presnell and Boos (2004) suggested a test for model misspecification based on comparing the maximized likelihood of a model to one motivated by a cross-validation approach where observations are deleted sequentially. It would be interesting to see how this applies to GLMMs although the computational complexity due to refitting the model several times might be a huge burden. Similar computational costs arise when we try to determine if GLMMs with autoregressive random effects are useful for prediction. Section 5.4.2 presented some theory of predicting future outcomes in binary time series, but more work is needed on cross-validation, misclassification rate or similar measures of the quality of the model for prediction. From the examples we have seen in this dissertation it appears that the estimate of the standard deviation of the latent autoregressive process is rather large which would lead to wide prediction intervals on the logi t and original probability scale. A lot of recent work has focused on transitional models for categorical time series with more than two categories (e.g. Fokianos and Kedem, 2003). We believe that a multivariate GLMM approach (Agresti 2002; Fahrmeir and Tutz 2001) with carefully specified correlated random effects (univariate or multivariate) might be an alternative worth studying. Lastly, although we focused on analyzing a single time series, it shou ld be straightforward to extend our methodology to the analysis of several, independent time series ( e .g. one for each subject in a longitudinal study with a large numb e r

PAGE 181

169 of repeated observations) as was alluded to throughout Chapters 2 and 3 of this dissertation. Unlike GEE our methodology with special regard to unequally spac e d time series should be practical when not all subjects are measured at common time points and a very realistic assumption subjects skip certain time points. Smith and Diggle (1998) propose a GEE approach for estimating fixed effects parameters in such circumstances coupled with a complicated pseudo-likelihood approach that assumes independence for estimating the variance components. We would like to extend our proposed likelihood framework jointly estimating all parameters, to this situation.

PAGE 182

REFERENCES Abramowitz M. and Stegun I. (1964). Handbook of Mathematical Functions N e w York: Dover. Agresti, A. (2002). Categorical Data Analysis, 2nd edition, New York: John Wiley and Sons. Aitkin, M (1999). A general maximum likelihood analysis of variance components in generalized linear models, Biometrics 55: 117 128 Aitkin M and Alf6 M (1998). Regression models for binary longitudinal responses, Statistics and Computing 8 : 289 307 Azzalini A. (1994). Logistic regression for autocorrelated data with application to repeated measures Biometrika 81: 767 775. Azzalini A and Bowman, A. W. (1990). A look at some data on the Old Faithful Geyser Applied Statistics 39: 357 365. Azzalini A. and Chiogna, M. (1997) S-Plus tools for the analysis of repeated measures data Computational Statistics 12: 53 66. Bahadur R. R. (1961). A representation of the joint distribution of responses ton dichotomous items Studies in Item Analsis and Prediction, pp. 158 168. Beck N. Katz J K and Tucker R. (1998). Taking time seriously: Time-series cross-section analysis with a binary dependent variable American Journal of Polit ica l Sci ence 42 : 1260 1288. Benjamin M A Rigby R. A. and Stasinopoulos, M. D (2003). Generalized autoregressive moving average models, Journal of the American Statist ic al Association 98: 214 223 Besag J. Green P ., Higdon D. and Mengersen K. (1995). Bayesian computation and stochastic systems, Statistical Scien ce 10: 3 41. Bock R. D. and Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm, Psychometrika 46: 443459. Booth J. G. and Hobert J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm, Journal of the Royal Statistical Society Series B 61 : 265 285. 170

PAGE 183

171 Booth J. G ., Casella G. Friedl H. and Hobert J.P. (2003). Negative binomial loglinear mixed models Statistical Modelling: An International Journal 3: 179 191. Booth J. G. Hobert J. P and Jank, W. (2001). A survey of Monte Carlo algorithms for maximizing the likelihood of a two-stage hierarchical model, Statistical Modelling: An International Journal 1: 333 349. Breslow N. E and Clayton D. G. (1993). Approximate inference in generalized linear mixed models Journal of the American Statistical Association 88: 9-25. Breslow N. E. and Lin X. (1995). Bias correction in generalised linear mixed models with a single component of dispersion Biometrika 82: 81 91. Caffo B ., Jank W and Jones G. (2003). Ascent-based MCEM under review. Carey V ., Zeger S L. and Diggle P. (1993). Modelling multivariate binary data with alternating logistic regressions Biometrika 80: 517 526. Chan J S K. and Kuk A Y C (1997). Maximum likelihood estimation for probit-linear mixed models with correlated random effects B i ometrics 53: 86 97. Chan K. S and Ledolter J (1995). Monte Carlo EM estimation for time series models involving counts Journal of the American Statistical Association 90: 242 252. Chib S. and Greenberg E. (1998) Analysis of multivariate probit models, B i ometrika 85 : 347 361. Cox D. R. (1972) The analysis of multivariate binary data Applied Statistics 21: 113 120. Cox D R. (1975). Partial likelihood Biometrika 62: 269 276. Cox D R. (1981) Statistical analysis of time series: Some recent developments Scand i nav i an Journal of Stat i stics 8: 93 108. Crowder M J and Hand D. J. (1990). Analysis of Repeated Measures, London: Chapman and Hall. Davis R. Dunsmuir W. and Streett, S. (2003). Observation-driven models for poisson counts Biom e trika 90: 777 790 Dempster A P. Laird N. M. and Rubin D. B (1977). Maximum likelihood from incomplet e data via the EM algorithm (C/R : p22-37) Journal of the Royal Statist ic al So c i e ty S e ri es B Methodological pp. 1 22. Diggle P ., Heagerty P. Liang K.-Y. and Zeger S. L. (2002). Analysis of Longitu dinal Data 2nd edition Oxford : Oxford University Press.

PAGE 184

172 Diggle, P. J (1990). Time Series. A Biostatistical Introduction Oxford: Oxford University Press. Diggle, P J., Tawn J. A. and Moyeed R. A. (1998) Model-based geostatistics, Applied Statistics 47: 299 326. Durbin J and Koopman S. J. (1997) Monte Carlo maximum likelihood estima tion for non-Gaussian state space models, Biometrika 84: 669 684. Durbin J. and Koopman S. J. (2000) Time series analysis of non-Gaussian observations based on state space models from both classical and Bayesian perspectives Journal of the Royal Statistical Society, Series B 62: 3 56. Durbin J. and Koopman S. J. (2001). Time Series Analysis by State Space Methods Oxford : Oxford University Press. Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Generalized Linear Models 2nd edition, New York: Springer. Fitzmaurice G. M. and Laird N M. (1993). A likelihood-based method for analysing longitudinal binary responses, Biometrika 80: 141-151. Fitzmaurice, G. M. and Lipsitz S. R. (1995). A model for binary time series data with serial odds ratio patterns Applied Statistics 44: 51 61. Fitzmaurice G. M. Laird N. M and Rotnitzky A. G. (1993). Regression models for discrete longitudinal responses Statistical Science 8: 284 299. Fokianos, K and Kedem, B. (1998). Prediction and classification of non-stationary categorical time series Journal of Multivariate Analysis 67: 277 296. Fokianos K. and Kedem, B. (2003). Regression theory for categorical time series, Statistical Science 18: 357 376 Fort G and Moulines E. (2003). Convergence of the Monte Carlo expectation maximization for curved exponential families, The Annals of Statistics 31 : 1220 1259. Gelfand, A. E and Carlin, B P (1993) Maximum-likelihood estimation for constrainedor missing-data models The Canadian Journal of Statistics 21: 303 311. Geyer C J. and Thompson E. A. (1992). Constrained Monte Carlo maximum likelihood for dependent data Journal of the Royal Statistical Society Series B 54: 657 683. Ghosh M. Natarajan K ., Stroud T. W. F. and Carlin B. P. (1998). Generalized linear models for small-area estimation, Journal of the American Statistical Association 93: 273 282.

PAGE 185

173 Heagerty, P. J. (1999). Marginally specified logistic-normal models for longitudinal binary data Biometric s 55: 688 698. Heagerty P J and Zeger S. L. (2000). Marginalized multilevel models and likelihood inference Statistical Science 15: 1 26. Jank, W. and Booth, J. G. (2003). Efficiency of Monte Carlo EM and simulated maximum likelihood in generalized linear mixed models Journal of Computa tional and Graphical Statistics 12: 214 230. Johnson N. L and Kotz S. (1970). Distributions in Statistics Continuous Univariate Distribution s, Volume 2, Boston: Houghton-Mifflin. Kastner, C. Fieger, A. and Heumann, C. (1997). MAREG and WinMAREG: A tool for marginal regression models, Computational Statistics and Data Analysis 24: 237 241. Kaufmann H. (1987). Regression models for nonstationary categorical time series: Asymptotic estimation theory The Annals of Statistics 15: 79 98. Kedem B. (1980). Binary Time Series, New York: Marcel Dekker. Kedem B. and Fokianos K. (2002). Regression Models for Time Series Analysis, New York: John Wiley and Sons. Lang J B. (1996) Maximum likelihood methods for a generalized class of log linear models The Annals of Statistics 24: 726 752. Lang J. B. (2004). Multinomial-Poisson homogeneous models for contingency tables The Annals of Statistics 32: 340 383. Lang J. B. and Agresti A. (1994) Simultaneously modeling joint and marginal distributions of multivariate categorical responses, Journal of the American Statistical As s ociation 89: 625 632 Lange, K. (1995). A quasi-Newton acceleration of the EM algorithm Statistica Sinica 5: 1 18. Lee Y. and Nelder J A. (1996). Hierarchical generalized linear models, Journal of the Royal Statistical Society Series B 58: 619 656. Lee, Y. and Nelder J. A. (2001). Modelling and analysing correlated non-normal data, Statistical Modelling: An International Journal 1: 3 16. Li W. K (1994). Time series models based on generalized linear models: Some further results Biometrics 50: 506 511. Liang K.-Y and Zeger S. L. (1986) Longitudinal data analysis using generalized linear models Biometrika 73: 13 22.

PAGE 186

174 Lin, X. and Breslow N. E. (1996). Bias correction in generalized linear mixed models with multiple components of dispersion Journal of the American Stati s tical Association 91: 1007 1016. Liu S.-1. (2001). Bayesian model determination for binary-time-series data with applications Computational Statistic s and Data Analysis 36: 461 473. Louis T. A. (1982). Finding the observed information matrix when using the EM algorithm Journal of the Royal Statistical Society Series B 44: 226 233 MacDonald I. L. and Zucchini W. (1997). Hidden Markov and other Models for Discr e teValu e d Time S e ries New York : Chapman and Hall. McCullagh P. and Nelder J. A (1989). Generalized Linear Models, 2nd edition London: Chapman and Hall. McCulloch C. E. (1994) Maximum likelihood variance components estima t ion for binary data Journal of the American Statistical Association 89: 330 335. McCulloch C. E (1997). Maximum likelihood algorithms for generalized linear mixed models Journal of the American Statistical Association 92: 162 170 McKeown S. P. and Johnson W D. (1996) Testing for autocorrelation and equality of covariance matrices Biometrics 52: 1087 1095. Natarajan R. and McCulloch C. E. (1995) A note on the existence of the posterior distribution for a class of mixed models for binomial responses Biom e trika 82: 639 643. Natarajan R. and McCulloch C. E. (1998) Gibbs sampling with diffuse proper priors: A valid approach to data-driven inference? Journal of Computational and Graph i cal Statistics 7: 267 277. Oneal J. and Russett B (1997). The classical liberals were right: Democracy, interdependence and conflict International Studies Quarterly 41: 267 294. Ord K. (1975). Estimation methods for models of spatial interaction Journal of the American Stati s tical Association 70: 120 126. Presnell B. and Boos D. (2004). The IOS test for model misspecification, Journal of th e Am e r i can Statistical Association 99: 216 227. Robert C. P and Casella G (1999) Monte Carlo Statistical Methods, New York: Springer. Searle S. R. Casella G. and McCulloch, C. E (1992). Variance Components, New York: John Wiley and Sons

PAGE 187

175 Smith, D M. and Diggle P. J. (1998). Compliance in an anti-hypertension trial: A latent process model for binary longitudinal data Statistics in Medicine 17: 357 370 Stiratelli R., Laird N and Ware, J. H. (1984). Random-effects models for serial observations with binary response Biometrics 40: 961 971. Sun D. Speckman, P L. and Tsutakawa R. K. (2000). Random effects in generalized linear mixed models, Generalized Linear Models: A Bayesian perspective Dey D. Ghosh S. and Mallick B. (Editors) New York: Marcel Dekker. Sun D ., Tsutakawa, R. K. and Speckman, P. L. (1999) Posterior distribution of hierarchical models using CAR(l) distributions Biometrika 86: 341 350. Tierney L and Kadane J. B. (1986) Accurate approximations for posterior mo ments and marginal densities, Journal of the American Statistical Association 81: 82 86. Verbeke G and Molenberghs G. (2000). Linear Mixed Models for Longitudinal Data New York: Springer. Wedderburn R. W. M. (1974). Quasi-likelihood functions generalized linear models and the Gauss-Newton method Biometrika 61: 439 447. Wei, G. C G. and Tanner M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man s data augmentation algorithms Journal of the American Statistical Association 85: 699 704. West M. Harrison P. J. and Migon H. S. (1985). Dynamic generalized linear models and Bayesian forecasting Journal of the American Statistical Assoc i tion 80: 73 83. Witt G. (1987). The analysis of repeated measurements with first-order autocorre lation. Ph.D. dissertation University of Pennsylvania Philadelphia. Wolfinger R. and O Connell M. (1993). Generalized linear mixed models: A pseudo-likelihood approach Journal of Statistical Computation and Simulation 48: 233 243. Wu C. F. J. (1983). On the convergence properties of the EM algorithm The Annals of Statistics 11: 95 103 Zeger S. L. (1988). A regression model for time series of counts Biometrika 75 : 621 629. Zeger S. L and Qaqish B (1988). Markov regression models for time series: A quasi-likelihood approach Biometrics 44: 1019-1031.

PAGE 188

176 Zeger S. L., Liang, K.-Y. and Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach, Biometrics 44: 1049-1060. Zhang H. (2002). On estimation and prediction for spatial generalized linear mixed models, Biometrics 58: 129 136. Zhao, L. P. and Prentice R. L (1990) Correlated binary regression using a quadratic exponential model, Biometrika 77: 642 648.

PAGE 189

BIOGRAPHICAL SKETCH Bernhard Klingenberg was born on August 31 1973 in Graz Austria as the first son to Dr. Hans and Ilse Klingenberg Upon graduating from Lichtenfels High School he completed his compulsory military training as a truck driver responsible for group and armory transports. In 1992 he registered with the Technical University Graz to pursue studies in mathematics operations research and statistics. He graduated with a master s degree and highest honors in 1998. Eager to further intensify his knowledge Bernhard was awarded a Fulbright scholarship in 1999 for academic study abroad and decided to pursue a Ph.D. de gree in statistics at the University of Florida. Aside from completing the standard curriculum he worked as a teaching assistant and as a statistical consultant before joining Dr. Alan Agresti as a research assistant. Just before passing the written Ph.D. qualifying exams in August 2001 he married Sophia Froehlich, a medical student at the University of Vienna. In spring 2003 their daughter Franziska was born in Orlando Florida. In the fall of 2001 Bernhard began to work with Dr. Alan Agresti and Dr. James Booth on topics in generalized linear mixed models leading to his dissertation work in modeling discrete-valued time series data. After graduating Bernhard will be an assistant professor in the Department of Mathematics and Statistics at Williams College Massachusetts. 177

PAGE 190

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy ta_ C. Alan G. Agresti Chair Distinguished Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Ja~es G. Booth Cochair Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Phi~ ~c Ge~asella Distinguished Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate in scope and quality, as a dissertation for the degree of Doct of P ilosophy I Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. M~~~e: Associate Professor of Political Science

PAGE 191

This dissertation was submitted to the Graduate Faculty of the Department of Statistics in the College of Liberal Arts and Sciences and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy August 2004 Dean, Graduate School

PAGE 192

UNIVERSITY OF FLORIDA 111 11 1111/ I Ill I l l ll l ll l l lll I I I II I II I I II I I IIII I I III I l l ll ll lllll I I 3 1262 08553 7263