Regression models for discrete-valued time series data


Material Information

Regression models for discrete-valued time series data
Physical Description:
xii, 177 leaves : ill. ; 29 cm.
Klingenberg, Bernhard
Publication Date:


bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )


Thesis (Ph. D.)--University of Florida, 2004.
Includes bibliographical references.
Statement of Responsibility:
by Bernhard Klingenberg.
General Note:
General Note:

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 003100752
oclc - 706801267
System ID:

Full Text







Copyright 2004


Bernhard Klingenberg

To Sophia and Jean-Luc Picard


I would like to express my sincere gratitude to Drs. Alan Agresti and James

Booth for their guidance and assistance with my dissertation research and for

their support throughout my years at the University of Florida. During the past

three years as a research assistant for Dr. Agresti, I gained valuable experience in

conducting statistical research and writing scholarly papers, for which I am very

grateful. I would also like to thank Dr. Ramon Littell, who guided me through a

year of invaluable statistical consulting experience at IFAS, Dr. George Casella,

who taught me Monte Carlo methods, and Drs. Jeff Gill and Michael Martinez for

serving on my committee. My gratitude extends to all the faculty and present and

former graduate students of the Department, among them, in alphabetical order,

Brian Caffo, Sounak Chakraborty, Dr. Herwig Friedl, Ludwig Heigenhauser, David

Hitchcock, Wolfgang Jank, Galin Jones, Ziyad Mahfoud, Siuli Mukhopadhyay and

Brian Stephens.

I would like to thank my family, foremost my wife Sophia and my daughter

Franziska, for all their support, light and joy they bring into my life and for

providing me with energy and fulfillment.

Lastly, this dissertation would not have been written in English had it not

been for the countless adventures of the Starship Enterprise and its Captain,

Jean-Luc Picard, whose episodes kept me glued to the TV in Austria and helped to

sufficiently improve my knowledge of the English language.


ACKNOWLEDGMENTS ............................ iv

LIST OF TABLES ................ .................... viii

LIST OF FIGURES .................... ............ ix

ABSTRACT .......................... .......... xi


1 INTRODUCTION .............................. 1

1.1 Regression Models for Correlated Discrete Data .......... 1
1.2 Marginal Models ... ........................ 3
1.2.1 Likelihood Based Estimation Methods ............ 3
1.2.2 Quasi-Likelihood Based Estimation Methods ... ...... 4
1.3 Transitional Models .................. ........ 7
1.3.1 M odel Fitting ............................ 9
1.3.2 Transitional Models for Time Series of Counts ...... 10
1.3.3 Transitional Models for Binary Data . ... 11
1.4 Random Effects Models ........................ 13
1.4.1 Correlated Random Effects in GLMMs . ... 13
1.4.2 Other Modeling Approaches . ... 17
1.5 Motivation and Outline of the Dissertation . ... 19


2.1 Definition and Notation ........................ 22
2.1.1 Generalized Linear Mixed Models for Univariate Discrete
Time Series .................... 24
2.1.2 State Space Models for Discrete Time Series Observations 26
2.1.3 Structural Similarities Between State Space Models and
G LM M s . . . 28
2.1.4 Practical Differences ...... .... ... ..... .. 29
2.2 Maximum Likelihood Estimation ..... .. 31
2.2.1 Direct and Indirect Maximum Likelihood Procedures .. 32
2.2.2 Model Fitting in a Bayesian Framework . ... 36
2.2.3 Maximum Likelihood Estimation for State Space Models .37
2.3 The Monte Carlo EM Algorithm . ..... 40

2.3.1 Maximization of Qm ...................... 41
2.3.2 Generating Samples from h(u y; f, ) .. .. 43
2.3.3 Convergence Criteria . .. 48


3.1 A Motivating Example: Data from the General Social Survey .. 54
3.1.1 A GLMM Approach ................ ..... 55
3.1.2 Motivating Correlated Random Effects . ... 56
3.2 Equally Correlated Random Effects . ... 62
3.2.1 Definition of Equally Correlated Random Effects 62
3.2.2 The M-step with Equally Correlated Random Effects 63
3.3 Autoregressive Random Effects . . ... 65
3.3.1 Definition of Autoregressive Random Effects ... 65
3.3.2 The M-step with Autoregressive Random Effects 68
3.4 Sampling from the Posterior Distribution Via Gibbs Sampling 71
3.4.1 A Gibbs Sampler for Autoregressive Random Effects .. 72
3.4.2 A Gibbs Sampler for Equally Correlated Random Effects .74
3.5 A Simulation Study .......................... 75

OBSERVATIONS ................... ........... 82

4.1 Analysis for a Time Series of Normal Observations ... 83
4.1.1 Analysis via Linear Mixed Models . .... 85
4.1.2 Parameter Interpretation . ..... 86
4.2 Analysis for a Time Series of Counts . ..... 86
4.2.1 Marginal Model Implied by the Poisson GLMM 87
4.2.2 Parameter Interpretation . . ... 92
4.3 Analysis for a Time Series of Binomial or Binary Observations .92
4.3.1 Marginal Model Implied by the Binomial GLMM 94
4.3.2 Approximation Techniques for Marginal Moments 95
4.3.3 Parameter Interpretation . . ... 108


5.1 Graphical Exploration of Correlation Structures ... 115
5.1.1 The Variogram ......................... 116
5.1.2 The Lorelogram ........................ 117
5.2 Normal Time Series .......................... 117
5.3 Analysis of the Polio Count Data . ... 121
5.3.1 Comparison of ARGLMMs to other Approaches 125
5.3.2 A Residual Analysis for the ARGLMM .... 127
5.4 Binary and Binomial Time Series . ... 130
5.4.1 Old Faithful Geyser Data . ..... 130

5.4.2 Oxford versus Cambridge Boat Race Data ... 142


6.1 Cross-Sectional Time Series ................. .. .162
6.2 Univariate Time Series ............ ............. 164
6.2.1 Clipping of Time Series . .... .. 165
6.2.2 Longitudinal Data ............. ...... 165
6.3 Extensions and Further Research ....... . .. 166
6.3.1 Alternative Random Effects Distribution ... 166
6.3.2 Topics in GLMM Research .... .. 168

REFERENCES ................................... 170

BIOGRAPHICAL SKETCH ................... ...... 171

Table page

3-1 A simulation study for a logistic GLMM with autoregressive random
effects. . . . ... .. 80

3-2 Simulation study for modeling unequally space binary time series .. 81

5-1 Comparing estimates from two models for the log-odds. ... 121

5-2 Parameter estimates for the polio data. . ... 125

5-3 Autocorrelation functions for the Old Faithful geyser data. 134

5-4 Comparison of observed and expected counts for the Old Faithful geyser
data. ............................ ........ 142

5-5 Maximum likelihood estimates for boat race data. . ... 147

5-6 Observed and expected counts of sequences of wins (W) and losses
(L) for the Cambridge University team. . ... 151

5-7 Estimated random effects fit for the last 30 years for the boat race data. 152

5-8 Estimated probabilities of a Cambridge win in 2004, given the past
s + 1 outcomes of the race. ........................ 155

Figure page

2-1 Plot of the typical behavior of the Monte Carlo sample size m(k) and
the Q-function Q) through MCEM iterations. .......... 51

3-1 Sampling proportions from the GSS data set. . ..... 55

3-2 Iteration history for selected parameters and their asymptotic stan-
dard errors for the GSS data. .................... .. .. 61

3-3 Realized (simulated) random effects u1,..., UT versus estimated ran-
dom effects il,..., UT ........... .. ........... .. 77

3-4 Comparing simulated and estimated random effects .... 78

4-1 Approximated marginal probabilities for the fixed part predictor value
x'p ranging from -4 to 4 in a logit model. . .... 99

4-2 Comparison of conditional logit and probit model based probabilities. 104

4-3 Comparison of implied marginal probabilities from logit and probit
models..................... .... ..... ...... 106

5-1 Empirical standard deviations std(0it) for the log odds of favoring ho-
mosexual relationship by race. . . 119

5-2 Plot of the Polio data. ...................... .123

5-3 Iteration history for the Polio data. . . .. 124

5-4 Residual autocorrelations for the Polio data. . ... 129

5-5 Residual autocorrelations with outlier adjustment for the Polio data.. 131

5-6 Autocorrelation functions for the Old Faithful geyser data. 135

5-7 Lorelogram for the Old Faithful geyser data . .... 136

5-8 Plot of the Oxford vs. Cambridge boat race data . .... 143

5-9 Variogram for the Oxford vs. Cambridge boat race data. ...... ..145

5-10 Lorelogram for the Oxford vs. Cambridge boat race data. ... 146

5-11 Path plots of fixed and random effects parameter estimates for the
boat race data. ................... ........... 147

6-1 Association graphs for GLMMs. . ..... 161

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Bernhard Klingenberg

August 2004

Chair: Alan G. Agresti
Cochair: James G. Booth
Major Department: Statistics

Independent random effects in generalized linear models induce an exchange-

able correlation structure, but long sequences of counts or binomial observations

typically show correlations decaying with increasing lag. This dissertation intro-

duces models with autocorrelated random effects for a more appropriate, parameter

driven analysis of discrete-valued time series data. We present a Monte Carlo EM

algorithm with Gibbs sampling to jointly obtain maximum likelihood estimates

of regression parameters and variance components. Marginal mean, variance and

correlation properties of the conditionally specified models are derived for Poisson,

negative binomial and binary/binomial random components. They are used for

constructing goodness of fit tables and checking the appropriateness of the modeled

correlation structure. Our models define a likelihood and hence estimation of the

joint probability of two or more events is possible and used in predicting future

responses. Also, all methods are flexible enough to allow for multiple gaps or miss-

ing observations in the observed time series. The approach is illustrated with the

analysis of a cross-sectional study over 30 years, where only observations from 16

unequally spaced years are available, a time series of 168 monthly counts of polio

infections and two long binary time series.


Correlated discrete data arise in a variety of settings in the biomedical,

social, political or business sciences whenever a discrete response variable is

measured repeatedly. Examples are time series of counts or longitudinal studies

measuring a binary response. Correlations between successive observations arise

naturally through a time, space or some other cluster forming context and have

to be incorporated in any inferential procedure. Standard regression models

for independent data can be expanded to accommodate such correlations. For

continuous type responses, the normal linear mixed effects model offers such a

flexible framework and has been well studied in the past. A recent reference is

Verbeke and Molenberghs (2000), who also discuss computer software for fitting

linear mixed effects models with popular statistical packages. Although the normal

linear mixed effects model is but one member of the broader class of generalized

linear mixed effects models, it enjoys unique properties which simplify parameter

estimation and interpretation substantially. For discrete response data, however,

the normal distribution is not appropriate, and other members in the exponential

family of distributions have to be considered.

1.1 Regression Models for Correlated Discrete Data

In this introduction we will review extensions of the basic generalized linear

model (McCullagh and Nelder, 1989) for analyzing independent observations to

models for correlated data. These models are marginal (Section 1.2), transitional

(Section 1.3) and random effects models (Section 1.4). An extensive discussion of

these models with respect to discrete longitudinal data is given in the books by

Agresti (2002) and Diggle, Heagerty, Liang and Zeger (2002). In general, longi-

tudinal studies concern only a few repeated measurements. In this dissertation,

however, we are interested in the analysis of much longer series of repeated observa-

tions, often exceeding 100 repeated measurements. Therefore, the following review

focuses specifically on models for univariate time series observations, some of which

are presented in Fahrmeir and Tutz (2001).

Let Yt be a response at time t, t = 1,..., T, observed together with a vector

of covariates denoted by Xt. In a generalized linear model (GLM), the mean

pt = E[yt] of observation yt depends on a linear predictor t = xzfP through a link
function h(.), forming the relationship pt = h-l(x'f). The variance of yt depends

on the mean through the relationship var(yt) = tv(pt), where v(.) is a distribution

specific variance function and {(t} are additional dispersion parameters. In a

regular GLM, observations at any two distinct time points t and t* are assumed


In the models discussed below, the type of extension to accommodate corre-

lated data depends on the way the correlation is introduced into the model. In

marginal models, the correlation can be specified directly, e.g., corr(yt, yt.) = p or

left completely unspecified, but nonetheless accounted for in likelihood based and

non-likelihood based inferences. In transitional models correlation is introduced

by including previous observations in the linear predictor, e.g., tit = iz where

it = (X, Yt-, y t-2, ...)' and 3 = (, a, a2, ... ) are extensions of the design and
parameter vector of a GLM with independent components. Random effects models

induce correlation between observations by including random effects rather than

previous observations in the linear predictor, e.g., rt = x'a + u, where u is a

random effect shared by all observations.

The way correlation is built into a model also determines the type of inference.

Typically, marginal models are fitted by a quasi-likelihood approach, estimation in

transitional models is based on a conditional or partial likelihood, and inference

in random effects models relies on a full likelihood (possibly Bayesian) approach.

However, models and inferential procedures have been developed that allow more

flexibility than the above categorization.

1.2 Marginal Models

In marginal regression models, the main scientific goal is to assess the influence

of covariates on the marginal mean of yt, treating the association structure between

repeated observations as a nuisance. The marginal mean /it and variance var(yt)

are modeled separately from a correlation structure between two observations

Yt and yt.. Regression parameters in the linear predictor are called population-

averaged parameters, because their interpretation is based on an average over

all individuals in a specific covariate subgroup. Due to the correlation among

repeated observations, the likelihood for the model refers to the joint distribution

of all observations and not to the simpler product of their marginal distributions.

However, the model is specified in terms of these marginal distributions, which

makes maximum likelihood fitting particular hard for even a moderate number T

of repeated measurements.

1.2.1 Likelihood Based Estimation Methods

For binary data, Fitzmaurice and Laird (1993) discuss a parametrization of the

joint distribution in terms of conditional probabilities and log odds ratios. These

parameters are related to the marginal mean and the same conditional log odds

ratios, which describe the higher order associations among the repeated responses.

The marginal mean and the higher order associations are then modelled in terms of

orthogonal parameters f3 and a, respectively. Fitzmaurice and Laird (1993) present

an algorithm for maximizing the likelihood with respect to these two parameter

sets. The algorithm has been implemented in a freely available computer program

(MAREG) by Kastner et al. (1997).

Another approach to maximum likelihood fitting for longitudinal discrete

data regards the marginal model as a constraint on the joint distribution and

maximizes the likelihood subject to this constraint. The model is written in terms

of a generalized log-linear model C log(A/f) = XP3, where p is a vector of expected

counts and A and C are matrices to form marginal counts and functions of those

marginal counts, respectively. With this approach, no specific assumption about

the correlation structure of repeated observations is made, and the likelihood

refers to the most general form for the joint distribution. However, simultaneous

modeling of the marginal distribution and a simplified joint distribution is also

possible. Details can be found in Lang and Agresti (1994) and Lang (1996). Lang

(2004) also offers an R computer program ( for maximum likelihood fitting

of these very general marginal models.

1.2.2 Quasi-Likelihood Based Estimation Methods

The drawback of the two approaches mentioned above and likelihood based

methods in general is that they require enormous computing resources as the

number of repeated responses increases or the number of covariates is large, making

maximum likelihood fitting computationally impossible for long time series. This is

also true for estimation based on alternative parameterizations of a distribution for

multivariate binary data such as those discussed in Bahadur (1961), Cox (1972) or

Zhao and Prentice (1990).

Estimating methods leading to computationally simpler inference (albeit

not maximum likelihood) for marginal models are based on a quasi-likelihood

approach (Wedderburn, 1974). In a quasi-likelihood approach, no specific form

for the distribution of the responses is assumed and only the mean, variance and

correlation are specified. However, with discrete data, specifying the mean and

covariances does not determine the likelihood, as it would with normal data, so

parameter estimation cannot be based on it. Liang and Zeger (1986) proposed

generalized estimating equations (GEE) to estimate parameters, which have

the form of score equations for GLMs, but cannot be interpreted as such. Their

approach also requires the specification of a working correlation matrix for the

repeated responses. They show that if the mean function is correctly specified,

the solution to the generalized estimating equations is a consistent estimator,

regardless of the assumed variance-covariance structure for the repeated responses.

They also present an estimator of the asymptotic variance-covariance matrix

for the GEE estimates, which is robust against misspecification of the working

correlation matrix. Several structured working correlation matrices have been

proposed for parsimonious modeling of the marginal correlation, and some of them

are implemented in statistical software packages for GEE estimation (e.g., SAS's

proc genmod with the repeated statement and the type option or the gee and geel

packages in R). GEE for time series of counts

Zeger (1988) uses the GEE methodology to fit a marginal model to a time

series {yt}T of T = 168 monthly counts of cases of poliomyelitis in the United

States. He specifies the marginal mean, variance and correlation by

pt = exp(ax )

var(yt) = t + a22 (1.1)
corr(y, y ) [{1+ (a2p)-i}{1 + (a2t+T)-1}]/2'

where a2 is the variance and p(r) the autocorrelation function of an underlying

random process {ut}. To fit this marginal model, he proposes and outlines the

GEE approach, but notes that it requires inversion of the T x T variance-covariance

matrix of yl,..., YT, which has no recognizable structure and therefore no simple

inverse. Subsequently, he suggests approximating this matrix by a simpler, struc-

tured matrix, leading to nearly as efficient estimators as would have been obtained

with the GEE approach. The variance component a2 and unknown parameters in

p(r) are estimated by a methods of moments approach.
Interestingly, Zeger (1988) derives the marginal mean, variance and correlation

in (1.1) from a random effects model specification: Conditional on an underlying

latent random process {ut} with E[ut] = 1 and cov(ut, ut+,) = a2p(7), he initially

models the time series observations as conditionally independent Poisson variables

with mean and variance

E[yt | Ut] = var(yt ut) = exp(xa/)ut. (1.2)

Marginally, by the formula for repeated expectation, this leads to the moments

presented in (1.1). From there we also see that the latent random process {ut} has

introduced both overdispersion relative to a Poisson variable and autocorrelation

among the observations. The models we will develop in subsequent chapters have

similar features. The equation for the marginal correlation between yt and Yt*

shows that the autocorrelation in the observed time series must be less than the

autocorrelation in the latent process {ut}. We will return to the polio data set in

Chapter 5, where we compare this model to models suggested in this dissertation

and elsewhere. GEE for binomial time series

For binary and binomial time series data, it is often more advantageous to

model the association between observations using the odds ratio rather than

directly specifying the marginal correlation corr(yt, yt.) as with count data.

The odds ratio is a more natural metric to measure association between binary

outcomes and easier to interpret. The correlation between two binary outcomes

Y1 and Y2 is also constrained in a complicated way by their marginal means

I1, = P(Y1 = 1) and p2 = P(Y2 = 1) as a consequence of the following inequalities

for their joint distribution:

P(Y1 = 1, Y2 = 1) = P1 + P2 P(Y= 1 or Y2 = 1) > max{0, p, + 2 1}


P(Y1 = 1, Y2 = 1) < min{fl, 2},

leading to

max{0, i + 2- 1} P(Y1 = 1, Y2 = 1) < min{l, /2}

Therefore, instead of marginal correlations, a number of authors (Fitzmaurice,

Laird and Rotnitzky, 1993; Carey, Zeger and Diggle, 1993) propose the use of

marginal odds ratios. For unequally spaced and unbalanced binary time series data,

Fitzmaurice and Lipsitz (1995) present a GEE approach which models the marginal

association using serial odds ratio patterns. Let tt. denote the marginal odds ratio

between two binary observations Yt and yt.. Their model for the association has the


.tt = al/lt-t*, 1 < a < oo,

which has the property that as It t* -- 0, there is perfect association (tt* oo),

and as It t*I oo, the observations are independent (utt. -- 1). Note, however,

that only positive association is possible with this type of model. (SAS's proc

genmod now offers the possibility of specifying a general regression structure for the

log odds ratios with the logor option.)

1.3 Transitional Models

In transitional models, past observations are simply treated as additional

predictors. Interest lies in estimating the effects of these and other explanatory

variables on the conditional mean of the response Yt, given realizations of the past

responses. Specifying the relationship between the mean of Yt and previous obser-

vations yt-i, t-2, ... is another way (and in contrast to the direct way of marginal

models) of modeling the dependency between correlated responses. Transitional

models fit into the framework of GLMs, where, however, the distribution of Yt is

now conditional on the past responses. The model in its most general form (Diggle

et al., 2002) expresses the conditional mean of Yt as a function of explanatory

variables and q functions f,(.) of past responses,

E[yt Ht] = h-1 3 + E h(Ht; a) (1.3)
where Ht = {t-1, Yt-2, yl} denotes the collection of past responses. Ht can

also include past explanatory variables and parameters. Often, the models are in

discrete-time Markov chain form of order q, and the conditional distribution of

Yt given Ht only depends on the last q responses Yt-1, ... Yt-q. For example, a
transitional logistic regression model for binary responses that is a second order

Markov chain has form

logit P(Yt = 1 yt-1, Yt-2) = x 13 + alYt-1 + a2Yt-2.

The main difference between transitional models and regular GLMs or marginal

models is parameter interpretation. Both the interpretation of a and the inter-

pretation of / are conditional on previous outcomes and depend on how many

of these are included. As the time dependence in the model changes, so does the

interpretation of parameters. With the logistic regression example from above, the

conditional odds of success at time t are exp(al) times higher if the given previous

response was a success rather than a failure. However, this interpretation assumes

a fixed and given outcome at time t 2. Similarly, a coefficient in / represents

the change in the log odds for a unit change in xt, conditional on the two prior

responses. It might be possible that we lose information on the covariate effect by

conditioning on these previous outcomes. In general, the interpretation of parame-

ters in transitional models is different from the population averaged interpretation

we discussed for marginal models, where parameters are effects on the marginal

mean without conditioning on any previous outcomes.
1.3.1 Model Fitting

If a discrete-time Markov model applies, the likelihood for a generic series

yl,... T is determined by the Markov chain structure:
L(,a3,;yi,...,yT) = f(yi,...,Yq) f(Yt I Yt-1i,...,Yt-q)-
However, the transitional model (1.3) only specifies the conditional distributions

appearing in the product, but not the first term of the likelihood. Often, instead

of a full maximum likelihood approach, one conditions on the first q observations

and maximizes the corresponding conditional likelihood. If in addition f,(Ht; a)

in (1.3) is a linear function in a (and possibly 3), then maximization follows

along the lines of GLMs for independent data. Kaufmann (1987) establishes the

asymptotic properties such as consistency, asymptotic normality and efficiency of

the conditional maximum likelihood estimator.

If a Markov assumption is not warranted, estimation can be based on the

partial likelihood (Cox, 1975). To motivate the partial likelihood approach, we

follow Kedem and Fokianos (2002): They consider occasions where a time series

{Yt} is observed jointly with a random covariate series {Xt}. The joint density of

(Yt, Xt), t = 1,..., T, parameterized by a vector 0, can be expressed as

f(x, .., x,XT, YT; ) = f(x;0) [ f(xt I Hft;) f(yt1 Ht; ) (1.4)
t=2 .t=l
where Ht = (x1, y1,..., Xt-1, yt-1) and Ht = (xi, yl,..., x,-1, yt-1, Xt) hold the

history up to time points t 1 and t, respectively. Let Tt-1 denote the a-field

generated by Y-i, Yt-2, ... Xt, Xt-1,..., i.e., Tt-1 is generated by past responses

and present and past values of the covariates. Also, let ft(yt I Ft-1; 0) denote the

conditional density of Yt, given t_-1, which is of exponential density form with

mean modeled by (1.3). Then, the partial likelihood for 0 = (a, /) is given by

PL(0;yl,...,yT) = r ft(Yt I -1; 0), (1.5)
which is the second product in (1.4) and hence the term partial. The loss of

information by ignoring the first product in the joint density is considered small.

If the covariate process is deterministic, then the partial likelihood becomes a

conditional likelihood, but without the necessity of a Markov assumption on the

distribution of the Yt's.

Standard asymptotic results from likelihood analysis of independent data carry

over to the case of partial likelihood estimation with dependent data. Fokianos and

Kedem (1998) showed consistency and asymptotic normality of 0 and provided an

expression for the asymptotic covariance matrix. Since the score equation obtained

from (1.5) is identical to one for independent data in a GLM, partial likelihood

holds the advantage of easy, fast and readily available software implementation

with standard estimation routines such as iterative re-weighted least squares.

1.3.2 Transitional Models for Time Series of Counts

For a time series of counts {yt}, Zeger and Qaqish (1988) propose Markov-

type transitional models which they fit using quasi-likelihood methods and the

estimating equations approach. They consider various models for the conditional

mean Lt = E[yt I Ht] of form log(p ) = x'z + ~=1 arfr(Ht-r), where for example

fr(Ht-r) = yt-r or f,(Ht-r) = log(y_-r + c) log(exp[zXP] + c). One common goal
of their models is to approximate the marginal mean by E[yt] = E[lpt] z exp{a{z3}

so that 3 has an approximate marginal interpretation as the change in the log

mean for a unit change in the explanatory variables. Davis et al. (2003) develop

these models further and propose fr(Ht-r) = (Yt-r lt-r)//A-r as a more

appropriate function to built serial dependence in the model, where A is an
additional parameter. They explore stability properties such as stationarity and

ergodicity of these models and describe fast (in comparison to maximum likelihood

techniques required for competing random effects models), recursive and iterative

maximum likelihood estimation algorithms.

Chapter 4 in Kedem and Fokianos (2002) discusses regression models of form

(1.3) assuming a conditional Poisson or double-truncated Poisson distribution

for the counts, with inference based on the partial likelihood concept. Their

methodology is illustrated with two examples about monthly counts of rainy days

and counts of tourist arrivals.

1.3.3 Transitional Models for Binary Data

For binary data {yt}, a two state, first order Markov chain can be defined by

its probability transition matrix

p Poo Poi

Plo P11

where Pab = P(Yt = b I Yt-1 = a), a, b = 0, 1 are the one-step transition probabilities

between the two states a and b. Diggle et al. (2002, Chapt. 10.3) discuss various

logistic regression models for these probabilities and higher order Markov chains for

equally spaced observations. Unequally spaced data cannot be routinely handled

with these models.

How can we determine the marginal association structure implied by the

conditionally specified model? Let pl = (p, pl) be the initial marginal distribution

for the states at time t = 1. Then the distribution of the states at time n is

given by pn = plp". As n increases, p" approaches a steady state or equilibrium

distribution that satisfies p = pP. The solution to this equation is given by

pi = P(Yt = 1) = E[yt] = pol/(pol + Pio) and is used to derive marginal moments
implied by the transitional model. For example, it can be shown (Kedem, 1980)

that in the steady state, the marginal variance and correlation implied by the

transitional model are var(yt) = poPi (as it should be) and corr(ytl, Yt) = Pn Poi,

Azzalini (1994) models serial dependence in binary data through transition

models, but at the same time retains the marginal interpretation of regression

parameters. He specifies the marginal regression model logit(At) = xa/ for a
binary time series {yt} with E[yt] = tpt, but assumes that a binary Markov chain
with transition probabilities Pab has generated the data. Therefore, the likelihood
refers to these probabilities but the model specifies marginal probabilities, a

complication similar to the fitting of marginal models discussed in the previous

section. However, assuming a constant log odds ratio

= logP(Y = 1, Y = 1)P(Yt-1 = 0, Yt = 0)
log \P(Y-1 = 0,Yt = 1)P(Yt-1 = 1, Yt= 0)

between any two adjacent observations, Azzalini (1994) shows how to write

Pab in terms of just this log odds ratio 0 and the marginal probabilities At and
#pt-1. Maximum likelihood estimation for such models is tedious but possible in
closed form, although second derivatives of the log likelihood function have to be
calculated numerically. A software package (the S-plus function, Azzalini

and Chiogna, 1997) exists to fit such models for binary and Poisson observations.

Azzalini (1994) mentions that this basic approach can be extended to include

variable odds ratios between any two adjacent observations, possibly depending on

covariates, but this is not pursued in the article. Diggle et al. (2002) discuss these

marginalized transitional models further.
Chapter 2 in Kedem and Fokianos (2002) presents a detailed discussion of
partial likelihood estimation for transitional binary models and discusses, among
other examples, the eruption data of the Old Faithful geyser which we will turn to

in Chapter 5.

1.4 Random Effects Models

A popular way of modeling correlation among dependent observations is to

include random effects u in the linear predictor. One of the first developments for

discrete data occurred for longitudinal binary data, where subject-specific random

effects induced correlation between repeated binary measurements on a subject

(Bock and Aitkin, 1981; Stiratelli, Laird and Ware, 1984). In general, we assume

that unmeasurable factors give rise to the dependency in the data {yt} and random

effects {ut} represent the heterogeneity due to these unmeasured factors. Given

these effects, the responses are assumed independent. However, no values for these

factors are observed, and so marginally (i.e., averaged over these factors), the

responses are dependent.

Conditional on some random effects, we consider models that fit into the

framework of GLMs for independent data, i.e., where the conditional distribution

of yt I u is a member of the family of exponential distributions, whose mean

E[yt ut] is modeled as a function of a linear predictor t = xzf + zLut. Together
with a distributional assumption for the random effects (usually independent and

identically normal), this leads to generalized linear mixed models (GLMMs), where

the term mixed refers to the mixture of fixed and random effects in the linear pre-

dictor. Chapter 2 contains a detailed definition of GLMMs and discusses maximum

likelihood fitting and parameter interpretation and in Chapter 3 correlated random

effects for the description of time dependent observations {Yt} are motivated and

described. Here, we only give a short literature review about GLMMs which use

correlated random effects to model time (or space) dependent data.

1.4.1 Correlated Random Effects in GLMMs

One of the first papers considering correlated random effects in GLMMs for

the description of (spatial) dependence in Poisson data is Breslow and Clayton

(1993), who analyze lip cancer rates in Scottish counties. They propose correlated

normal random effects to capture the correlation in counts of adjacent districts in

Scotland. A random effect is assigned to each district, and two random effects are

correlated if their districts are adjacent to each other.

In Section 1.2.2 we mentioned the Polio data set of a time series of equally

spaced counts {y}16s and formulated the conditional model (1.2) with a latent

process for the random effects. Instead of obtaining marginal moments as in Zeger

(1988), Chan and Ledolter (1995) use a GLMM approach with Poisson random

components and autoregressive random effects to analyze the time series. They

outline parameter estimation via an MCEM algorithm similar to the one discussed

in Sections 2.4 and 3.2 in this dissertation.

One of the three central generalized linear models advocated by Diggle et al.

(2002, Chap. 11.2) to model longitudinal data uses correlated random effects. For

equally spaced binary longitudinal data {yit}, they plot response profiles simulated

according to the model

logit[P(Yit =1 uit)] = + uit

cov(Uit, uit.) = a-2PIti-ti, I

with a2 = 2.52 and p = 0.9 and note that the profiles exhibit more alternating

runs of O's and 1's than a random intercept model with uil = ui = ... =

UiT = ui. However, based on the similarity between plots of random intercepts,

random intercepts and slopes and autoregressive random effects models, they

mention the challenge that binary data present in distinguishing and modeling the

underlying dependency structure in longitudinal data. (They used T = 25 repeated

observation for their simulations.) Furthermore, they state that numerical methods

for maximum likelihood estimation are computationally impractical for fitting

models with higher dimensional random effects. This makes it impossible, they

conclude, to fit the GLMM with serially correlated random effects using maximum

likelihood. Instead, they propose a Bayesian analysis using powerful Monte Carlo

Markov chain methods.

Indeed, the majority of examples in the literature which consider correlated

random effects in a GLMM framework take a Bayesian approach. Sun, Speckman

and Tsutakawa (2000) explore several types of correlated random effects (autore-

gressive, generalized autoregressive and conditional autoregressive) in a Bayesian

analysis of a GLMM. As in any Bayesian analysis, the propriety of the posterior

distribution given the data is of concern when fixed effects and variance compo-

nents have improper prior distributions and random effects are (possibly singular)

multivariate normal. One of their results applied to Poisson or binomial data {yt}

states that the posterior might be improper when yt = 0 in the Poisson case and

cannot be proper when yt = 0 or ye = nt in the binomial case for any t when

improper or non-informative priors are used.

Diggle, Tawn and Moyeed (1998) consider Gaussian spatial processes S(x)

to model spatial count data at locations x. The role of S(x) is to explain any

residual spatial variation after accounting for all known explanatory variables.

They also use a Bayesian framework to estimate parameters and give a solution to

the problem of predicting the count at a new location :. Ghosh et al. (1998) use

correlated random effects in Bayesian models for small area estimation problems.

They present an application of pairwise difference priors for random effects

to model a series of spatially correlated binomial observations in a Bayesian

framework. Zhang (2002) discusses maximum likelihood estimation with an

underlying spatial Gaussian process for spatially correlated binomial observations.

Bayesian models for binary time series are described in Liu (2001), based

on probit-type models for correlated binary data which are discussed in Chib

and Greenberg (1998). Probit type models are motivated by assuming latent

random variables z = (zl,..., ZT), which follow a N(pA, E) distribution with

p = (p/I,, PT,), pt = x' / and E a correlation matrix. The yet's are assumed to be
generated according to

Yt = I(Zt > 0),

where I(.) is the indicator function. This leads to the (marginal) probit model

P(Yt = 1 I E) = It(zt I 1p, E). Rich classes of dependency structures between

binary outcomes can be modeled through E. These models can further be extended

to include random effects through pl't = zx/3 + z ut or q previous responses such

as pt = x,3 + '=, ayt-,. It is important to note that E has to be in correlation

form. To see this, suppose it is not and let S = DED be a covariance matrix

for the latent random variables z, where D is a diagonal matrix holding standard

deviation parameters. The joint density of the times series under the multivariate

probit model is given by

P[(Y,,.-.,Y,) = (Y,,...,T)] = P[z E A]

= P[D-z E A],

where A = Ai x .. x AT with At = (-oo, 0] if yt = 0 and At = (0, oo) if yt = 1

are the intervals corresponding to the relationship yt = I(zt > 0), for t = 1,..., T.

However, above relationship is true for any parametrization of D, because the

intervals At are not affected by the transformation from z to D-lz. Hence, the

elements of D are not identifiable based on the joint distribution of the observed

time series y.

Lee and Nelder (2001) present models to analyze spatially correlated Poisson

counts and binomial longitudinal data about cancer mortality rates. They explore

a variety of patterned correlation structures for random effects in a GLMM setup.

Model fitting is based on the joint data likelihood of observations and unobserved

random effects (Lee and Nelder, 1996) and not on the marginal likelihood of the

observed data. Model diagnostic plots of estimated random effects are presented to

aid in selecting an appropriate correlation structure.

1.4.2 Other Modeling Approaches

In hidden Markov models (MacDonald and Zucchini, 1997) the underlying

random process is assumed to be a discrete state-space Markov chain instead

of a continuous (normal) process. Probability transition matrices describe the

connection between states. A very convenient property of hidden Markov models

is that the likelihood can be evaluated sufficiently fast to permit direct numerical

maximization. MacDonald and Zucchini (1997) present a detailed description of

hidden Markov models for the analysis of binary and count time series.

A connection between transitional models and random effects models is

explored in Aitkin and Alf6 (1998). They model the success probabilities of

serial binary observations conditional on subject-specific random effects and on

the previous outcome. As in the models before, transition probabilities Pab are

changing over time due to the inclusion of time-dependent covariates and the

previous observation in the linear predictor. Additionally, random effects account

for possibly unobserved sources of heterogeneity between subjects. The authors

argue that the conditional model specification together with the specification of

the random effects distribution does not determine the distribution of the initial

observation, and hence the likelihood for this model is unspecified. They present

a solution by maximizing the likelihood obtained from conditioning on this first

observation. However, this causes the specified random effects distribution to shift

to an unknown distribution. Two approaches for estimation are outlined: The first

assumes another normal distribution for the new random effects distribution and

the likelihood is maximized using Gauss-Hermite quadrature. The second approach

assumes no parametric form for the new random effects distribution and follows the

nonparametric maximum likelihood approach (Aitkin, 1999). For binary data, the

new random effects distribution is only a two point distribution and its parameters
can be estimated via maximum likelihood jointly with the other model parameters.

Marginalized transitional models were briefly mentioned with the approach
taken by Azzalini (1994). The idea of marginalizingg", i.e., model the marginal

mean of an otherwise conditionally specified model can also be applied to random
effects models. The advantage of transitional or random effects models is the
ability to easily specify correlation patterns, with the potential disadvantage that
parameters in such models have conditional interpretations when the scientific goal

is on the interpretation of marginal relationships. In marginal models parameters

can directly be interpreted as contrasts between subpopulations without the need

of conditioning on previous observations or unobserved random effects. However, as

we mentioned in Section 1.2.2, likelihood based inference in marginal models might
not be possible.

A marginalized random effects model (Heagerty, 1999; Heagerty and Zeger,
2000) specifies two regression equations that are consistent with each other. The

first equation,

AM= E[yt] = h-'(x

expresses the marginal mean tM as a function of covariates and describes system-

atic variation. The second equation characterizes the dependency structure among
observations through specification of the conditional mean pC,

p = E[yt I ut] = h-l(At(t) + z'ut),

where ut are random effects with design vector zt. Consistency between the

marginal and conditional specification is achieved by defining At(xt) implicitly

IM = Eu[ c] = Eu [E[yt I ut]] = Eu[h-'(At(zt) + z'ut)].

For instance, in a marginalized GLMM with random effects distribution F(ut),

At(st) is the solution to the integral equation

S= h-'(' ) = h-1(At(xt) + z'ut)dF(ut),

so that At(xt) is a function of the marginal regression coefficients f and the

(variance) parameters in F(ut). Maximum likelihood estimation is based on the

integrated likelihood from the GLMM model.

1.5 Motivation and Outline of the Dissertation

In this dissertation we propose generalized linear mixed models (GLMMs)

with correlated random effects to model count or binomial response data collected

over time or in space. For sequential or spatial Gaussian measurements, maximum

likelihood estimation is well established and software (e.g. SAS's proc mixed) is

available to fit fairly complicated correlation structures. The challenge for discrete

data lies in the fact that the observed (marginal) likelihood is not analytically

tractable, and maximization of it is more involved. Furthermore, with correlated

random effects, the likelihood does not break down into lower-dimensional com-

ponents which are easier to integrate numerically. Therefore, most approaches in

the literature are based on a quasi-likelihood approach or take a Bayesian per-

spective. The advantage of Bayesian models is that powerful Monte Carlo Markov

chain methods make it easier to obtain a sample from the posterior distribution of

interest than to obtain maximum likelihood estimates. However, priors must be

specified very carefully to ensure posterior propriety.

In addition, repeated observations are prone to missing data or unequally

spaced observation times. We would like to develop methods and models that allow

for unequally spaced binary, binomial or Poisson observations, making them more

general than previously presented in the literature.

To our knowledge, maximum likelihood estimation of GLMMs with such high

dimensional random effects has not been demonstrated before, with the exception

of the paper by Chan and Ledolter (1995) who consider fitting of a time series

of counts. However, they do not consider unequally spaced data and employ a

different implementation of the MCEM algorithm. In Chapter 5 we argue that

their implementation of the algorithm might have been stopped prematurely,

leading to different conclusions than our analysis and analyses published elsewhere.

Most articles that discuss correlated random effects do so for only a small number

of correlated random effects. E.g., Chan and Kuk (1997) show that the data set

on salamander mating behavior published and analyzed in McCullagh and Nelder

(1989) is more appropriately analyzed when random effects pertaining to the male

salamander population are correlated over the three different time points when they

were observed. In this thesis, we would like to consider much longer sequences of

repeated observations.

In Chapter 2 we introduce the GLMM as the model of our choice to analyze

correlated discrete data and outline an EM algorithm to estimate fixed and random

effects, where both the E-step and the M-step require numerical approximations,

leading to an EM algorithm based on Monte Carlo methods (MCEM). Correlated

random effects and their implications on the analysis of GLMMs are discussed in

Chapter 3, together with a motivating example. This chapter also gives details for

the implementation of the algorithm and reports results from simulation studies.

Chapter 4 looks at marginal model properties and interpretation for correlated

binary, binomial or Poisson observations and Chapter 5 applies our methods to

real data sets from the social sciences, public health, sports and other backgrounds.

A summary and discussion of the methods and models presented here is given in

Chapter 6.


Chapter 1 reviewed various approaches of extending GLMs to deal with

correlated data. In this Chapter we will take a closer look at generalized linear

mixed models (GLMMs) which were briefly mentioned in Section 1.4. When

the response variables are normal, these models are simply called linear mixed

models (LMMs) and have been extensively discussed in the literature (see, for

example, the books by Searle, Casella and McCulloch, 1992, and Verbeke and

Molenberghs, 2000). The form of the normal density for observations and random

effects allows for analytical evaluation of the integrals together with straightforward

maximization. Hence, LMMs can be readily fit with existing software (e.g., SAS's

proc mixed), using rich classes of pre-specified correlation structures for the random

effects to model the dependence in the data more precisely. The broader notion

of GLMMs also encompasses binary, binomial, Poisson or gamma responses.

A distinctive feature of GLMMs is their so called subject-specific parameter

interpretation, which differs from the interpretation of parameters in marginal

(Section 1.2) or transitional (Section 1.3) models. This feature is discussed in

Section 2.1, after a formal introduction of the GLMM. Throughout, special

attention is devoted to define GLMMs for discrete time series observations.

GLMMs are harder to fit because they typically involve intractable integrals

in the likelihood function. Section 2.2 outlines various approaches to model fitting.

Section 2.3 focuses on a Monte Carlo version of the EM algorithm which is an

indirect method of finding maximum likelihood estimates in GLMMs. Monte Carlo

methods are necessary because our applications involve correlated random effects

which lead to a very high-dimensional integral in the likelihood function. Parallel

to the discussion of GLMMs, state space models are introduced and a fitting

algorithm is described. State space models are popular models for discrete time

series in econometric applications (Durbin and Koopman, 2001). The presentation

of specific examples of GLMMs for discrete time series observation is deferred until

Chapter 5.

2.1 Definition and Notation

The generalized linear mixed model is an extension of the well known gen-

eralized linear model (McCullagh and Nelder, 1989) that permits fixed as well as

random effects in the linear predictor (hence the word mixed). The setup process

for GLMMs is split into two stages, which we present here using notation common

for longitudinal studies.:

Firstly, conditional on cluster specific random effects ui, the data are assumed

to follow a GLM with independent random components Yit, the t-th response

in cluster i, i = 1,..., n, t = 1,..., ni. A cluster here is a generic expression

and means any form of observations being grouped together, such as repeated

observation on the same subject (cluster = subject), observations on different

students in the same school (cluster = school) or observations recorded in a

common time interval (cluster = time interval). The conditional distribution of Yit

is a member of the exponential family of distributions (e.g., McCullagh and Nelder,

1989) with form

f(yit I u,) = exp {[yitOit b(0it)]/(it + c(yit, ,t)} (2.1)

where Oit are natural parameters and b(.) and c(.) are certain functions determined

by the specific member of the exponential family. The parameters it are typically

of form Oit = O/wit where the wit's are known weights and 0 is a possibly unknown

dispersion parameter. For the discrete response GLMMs we are considering,

0 = 1. For a specific link function h(.), the model for the conditional mean for

observations yit has form

pit = b'(Oit) = E[Yi ui] = h-1(x'tP + zItu,), (2.2)

where x't and zt are covariate or design vectors for fixed and random effects

associated with observation yit and / is a vector of unknown regression coefficients.

At this first stage, z'tui can be regarded as a known offset for each observation,

and observations are conditionally independent.

It should be noted that relationship (2.2) between the mean of the observation

and fixed and random effects is exactly as is specified in the systematic part of

GLMs, with the exception that in GLMMs a conditional mean is modeled. This

affects parameter interpretation. The regression coefficients 3 represent the effect

of explanatory variables on the conditional mean of observations, given the random

effects. For instance, observations in the same cluster i share a common value

of the random cluster effect ui, and hence / describes the conditional effect of

explanatory variables, given the value for ui. If the cluster consists of repeated

observations on the same subject, these effects are called subject-specific effects.

In contrast, regression coefficients in GLMs and marginal models describe the

effect of explanatory variables on the population average, which is an average over

observations in different clusters.

At the second stage, the random effects ui are specified to follow a multi-

variate normal distribution with mean zero and variance-covariance matrix Ei. A

standard assumption is that random effects {ui} are independent and identically

distributed, but an example at the beginning of Chapter 3 will show that this

is sometimes not appropriate. With time series observations, where the clusters

refer to time segments, it is reasonable to assume that observations are not only

correlated within the cluster (modeled by sharing the same cluster specific random

effect), but also across clusters, which we will model by assuming correlated cluster

specific random effects.

2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time

Most of the data we are going to analyze is in the form of a single univariate

time series. To emphasize this data structure, the general two-dimensional notation

(indices i and t) of a GLMM to model observations which come in clusters can be

simplified in two ways:

We can assume that a single cluster (i.e., n = 1 and nl = T) contains the

entire time series yx,..., YT. The random effects vector u = (ul,..., UT) associ-

ated with the single cluster has a random effects component for each individual

time series member. The distribution of u is multivariate normal with variance-

covariance matrix E, which is different from the identity matrix. The correlation

of the components of u induce a correlation among the time series members.

However, conditional on u, observations within the single cluster are independent.

The cluster index i is redundant in the notation and hence can be dropped. This

representation is particular useful when used with existing software to fit GLMMs,

where it is often necessary to include a column indicating the cluster membership

information for each observation. Here, since we have only one cluster, it suffices to

include a column of all ones, say.

Alternatively, we can adopt the point of view that each member of the time

series is a mini cluster by itself, containing only one observation (i.e., ni = 1 for

all i = 1,..., T) in the case of a single time series. When multiple, parallel time

series are observed, the cluster contains all c observations at time point t from the

c parallel time series (i.e., ni = c for all i = 1,..., T). In any case, the clusters

are then synonymous with the discrete time points at which observations were

recorded. This makes index t, which counts the repeated observations in a cluster

redundant (t = 1 or t = c for all clusters i), but instead of denoting the time series

by {yi}?=i, we decided to use the more common notation {yt}T=1, where t now is
the index for clusters or, equivalently, time points. In the following definition of
GLMMs for univariate time series, the notation of clusters or time points can be
used interchangeably. Conditional on unobserved random effects Ul,..., UT for
the different time points, observations yi,..., yT are assumed independent with

f(ytI ut) = exp {[ytOt b(Ot)]/Ot + c(yt, qt)} (2.3)

in the exponential family. As before, for a specific link function h(.), the model for

the conditional mean has form

E[yt I ut] = it = b'(0t) = h-l(x'1 + z'ut), (2.4)

where x' and z' are covariate or design vectors for fixed and random effects

associated with the t-th observation and f is a vector of unknown regression

coefficients. The random effects u, ... ,UT are typically not independent. When

collected in the vector u = (u1,..., UT), a multivariate normal distribution with

mean 0 and covariance matrix E can be directly specified. In particular, in Chapter

3 we will assume special patterned covariance matrices to allow for rich, but still

parsimonious, classes of correlation structures among the time series observations.
The advantage of the second setup of mini clusters is that it also allows for

other, indirect specifications of the random effects distribution, for instance through

a latent random process. For this, we relate cluster-specific random effects from

successive time points. For example, with univariate random effects, a first-order

latent autoregressive process assumes that the random effects follow

Ut+1 = put + et,


where et has a zero-mean normal distribution and p is a correlation parameter.

Cox (1981) called these type of models parameter-driven models, as opposed to

transitional (or observation-driven) models discussed in Section 1.3. In parameter-

driven models, an underlying and unobserved parameter process influences the

distribution of a series of observations. The model for the polio data in Zeger

(1988) is an example of a parameter-driven model. However, Zeger (1988) does not

assume normality nor zero mean for the latent autoregressive process. Furthermore,

the natural logarithm of the random effects, and not the random effects themselves

appear additively in the linear predictor. Therefore, this model is slightly different

from the specifications of the time series GLMM from above.

Another application of the mini cluster setup is to spatial settings, where clus-

ters represent spatially aggregated data instead of time points. Then ul,..., UT is

a collection of random effects associated with spatial clusters. Again, independent

random effects to describe the spatial dependencies are inappropriate. In general,

time dependent data are easier to handle since observations are linearly ordered

in time, and more complicated random effects distributions are needed for spatial

applications (e.g., Besag et al. 1995).

We will use the mini-cluster representation to facilitate a comparison to state

space models. This is the focus of the next section.

2.1.2 State Space Models for Discrete Time Series Observations

State space models are a rich alternative to the traditional Box-Jenkins

ARIMA system for time series analysis. Similar to GLMMs, state space models

for Gaussian and non-Gaussian time series split the modeling process into two

stages: At the first stage, the responses Yt are related to unobserved "states" by an

observation equation. (State space models originated in the engineering literature,

where parameters are often called states.) At the second stage, a latent or hidden

Markov model is assumed for the states. For univariate Gaussian responses

yi,..., YT, the two equations of a state space model take the form

t = w'~t + Et, Et N(0,a2), (2.6)

at = Ttat-1+Rttt, t t N(O,Qt) t=1,...,T,

where wt is an m x 1 observation or design vector and ct is a white noise process.

The unobserved m x 1 state or parameter vector at is defined by the second

transition equation, where Tt is a transition matrix and (t is another white noise

process, independent of the first one. Compared to the standard GLMMs of

Section 2.1, the main difference is that random effects are correlated instead of

i.i.d. In state space models, no clear distinction between fixed and random effects

is made, and the state vector at can contain both. However, the form of the

transition matrix Tt together with the form of the matrix Rt, which consists of

columns of the identity matrix I,, allows one to declare certain elements of at as

being fixed effects and others to be random. The matrix Rt is called the selection

matrix since it selects the rows of the state equation which have nonzero variance

terms. With this formulation, the variance-covariance matrix Qt is assumed to

be non-singular. Furthermore, the transition matrix Tt allows specification of

which effects vary through time and which stay constant. (For a slightly different

formulation without a selection matrix Rt but with possibly singular variance-

covariance matrix Qt, see Fahrmeir and Tutz, 2001, Chap. 8).

State space models for non-Gaussian time series were considered by West

et al. (1985) under the name dynamic generalized linear model. They used a

Bayesian framework with conjugate priors to specify and fit their models. Durbin

and Koopman (1997, 2000) and Fahrmeir and Tutz (2001) describe a state space

structure for non-Gaussian observations similar to the two equations above.

The normal distribution assumption for the observations in (2.6) is replaced by

assuming a distribution in the exponential family with natural parameters Ot. With

a canonical link, Ot = w'at. This is called the signal by Durbin and Koopman

(1997, 2000). In particular, given the states al,..., aT, observations yi,..., yT

are conditionally independent and have density p(yt I Ot) in the exponential family

(2.3). As in Gaussian state space models, the state vector at is determined by the

vector autoregressive relationship

at = Ttat-1 + Rtt, (2.7)

where the serially independent (t typically have normal distributions with mean 0

and variance-covariance matrix Qt.

2.1.3 Structural Similarities Between State Space Models and GLMMs

There is a strong connection between state space models and GLMMs with

a canonical link. To see this, we write the GLMM in state space form: Let wt =

(a', z')' and at = (p', u')', where Xt, Zt, f and ut are from the GLMM notation
as defined in (2.4). Hence, the linear predictor x + z'Ut of the GLMM is equal

to the state space signal Ot = w'at. Next, partition the disturbance term (t of

the state equation into ( = (a', c')' and consider special transition and selection

matrices of block form

SO 00
Tt = Rt=
0 4T 0 Rt

Using transition equation (2.7) results in the following autoregressive relationship

between the random effects of a GLMM: ut = Ttut- + et, where Et = it

is a white noise component. In a univariate context, we have already motivated

this type of relationship between random effects in equation (2.5). The transition

equation also implies a constant effect 3 for the GLMM, since 0P = 0I2 = ... =

3T := 3. Hence, both models use correlated random effects, but GLMMs also
typically involve fixed parameters which are not modeled as evolving over time.

The restriction of the transition equation to the autoregressive form (often

a simple random walk) is but only one way of specifying a distribution for the

random effects in the GLMMs of Section 2.2.1. Other structures, such as equally

correlated random effects are possible within the GLMM framework and are

considered in Chapter 3.

2.1.4 Practical Differences

Although GLMMs and state space models are similar in structure, they are

used differently in practice. This is in part due to the fact that in GLMMs the

focus is on the fixed subject-specific regression parameters /#, which refer to time-

constant and time varying covariates, while in state space models the main purpose

is to infer properties about the time varying random states at. These are often

assumed to follow a first or second order random walk. To illustrate, consider a

data set about a monthly time series of counts presented in Durbin and Koopman

(2000) for the investigation of the effectiveness of new seat belt legislation on

automobile accidents. They specify the log-mean for a Poisson state space model as

log(p/t) = vt + Axt + 7t,

where vt is a trend component following the random walk

Vt+l = Vt + &t

with t a white noise process. Further, A is an intervention parameter correspond-

ing to the change in seat-belt legislation (xt = 0 before the change and equal to

1 afterwards) and {7t} are fixed seasonal components with Zt= 7t = 0 and equal

in every year. The main focus is on the parameter A describing the drop in the log

means, also called level, after the seat belt legislation went into effect.

For the series at hand, a GLMM approach would consider a fixed linear time

effect /, and model the log-mean as

log(pt) = a + ft + Axt + yt + ut,

where a is the intercept of the linear time trend with slope 3 and where correlated

random effects {ut} account for the correlation in the monthly log means. Similar

to above, A describes the effect on the log means after the seat belt legislation went

into effect and yt are fixed seasonal components, equal for every year. The trend

component vt of the state space model corresponds to a + ut in the GLMM, which

additionally allows for a linear time trend /. This approach seems to be favored

by some discussants of the Durbin and Koopman (2000) paper. (In particular, see

the discussions by Chatfield or Aitkin of the paper by Durbin and Koopman (2000)

who mention the lack of a linear trend term in the proposed state space model.)

An even better GLMM approach with linear time trends explicitly modeled

could use a change point formulation in the linear predictor, with the month the

legislation went into effect (or was enforced) as the change point, and again with

correlated random effects {ut} to capture the dependency among successive means.

Such a specification would be harder to model in a state space model.

In the reply to the discussion of their paper, Durbin and Koopman (2000)

wrote that the two approaches (state space models versus Hierarchical Generalized

Linear Models, a model class very similar to GLMMs) are very different and that

they regard their treatment to be more transparent and general for problems that

specifically relate to time series. With the presentation of GLMMs with correlated

random effects for time series analysis in this thesis their argument might weaken.

For instance, with the proposal of autocorrelated random effects Ut+1 = put + ft

in a GLMM context we have elegant means of introducing autocorrelation into a

basic regression model that is well understood and whose parameters are easily

interpreted. Furthermore, GLMMs can easily accommodate the case of multiple

time series observations on each of several individuals or cross-sectional units, as is

often observed in a longitudinal study.

One common feature of both models is the intractability of the likelihood

function and the use of numerical and simulation techniques to obtain maximum

likelihood estimates. In general, state space models for non-Gaussian time series

are fit using a simulated maximum likelihood approach, which is also a popular

method for fitting GLMMs. However, long time series necessarily result in models

with complex and high dimensional random effects, and alternative, indirect, meth-

ods may work better. Jank and Booth (2003) indicate that simulated maximum

likelihood, the method of choice for estimation in state space models, may not

work as well as indirect methods based on the EM algorithm, the method we will

use to fit GLMMs for time series observations. The next section reviews various

approaches of fitting GLMMs and contrasts them with the approach taken for state

space models.

2.2 Maximum Likelihood Estimation

Maximum likelihood estimation in GLMMs is a challenging task, because it

requires the calculation of integrals (often high dimensional) that have no known

analytic solution. Following the general notation of a GLMM in Section 2.1, let

Yi = (yil, yini) be the vector of all observations in cluster i, whose associated
random effects vector is ui. Conditional independence of the yit's implies that the

density function of y, is given by
f(y ; I = Uf (y;t I u( ; /), (2.8)
where f(yit I ui; /) are the exponential densities in (2.1). The parameter 3 is the

vector of all unknown regression coefficients introduced by specifying model (2.2)

for the mean of the observations. Furthermore, observations from different clusters

are assumed conditionally independent, leading to the conditional joint density
f(yi,.-,yn uil,...,Un;,3)=ii f(yi ui;/3)
for all observations given all random effects. Let g(u1,..., ,n; ib) denote the mul-
tivariate normal density function of the random effects, whose variance-covariance
matrix E is determined by the variance component vector ib. The goal is to es-
timate the unknown parameter vectors 3 and ib by maximum likelihood. The
likelihood function L(/3, v; yl,..., y,) for a GLMM is given by the marginal den-

sity function of the observations y, Yn, viewed as a function of the parameters,
and is equal to

L(09#, 0; y Y jf f(Y,'...",IYn Iu,...,Un;f)g(ux,...,Un; O)dul...dun

= f I (Y l ui; ) g(ul,..., un; O)dUl ... dun

= J Jf(yit ui; ) g(ul,..., Un; *1)dUl...dun (2.9)
i=1 t=1
It is called the "observed" likelihood because the unobserved random effects have

been integrated out and (2.9) is a function of the observed data only. Except
for the linear mixed model where f(y, ... Yn I ul,..., un) is a normal density,
the integral has no closed-form solution and numerical procedures (analytic or
stochastic) are necessary to calculate and maximize it. Standard maximization

techniques such as Newton-Raphson or EM for fitting GLMs and LLMs have

to be modified because the conditional distribution of the observations and the
distribution of the random effects are not conjugate and the integral is analytically
2.2.1 Direct and Indirect Maximum Likelihood Procedures

In general, there are two ways to obtain maximum likelihood estimates
from the marginal likelihood in (2.9): The first one is a direct approach and

attempts to approximate the integral by either analytic or stochastic methods

and then maximize this approximation with respect to the parameters 3 and tp.

Some common analytic approximation methods are Gauss-Hermite quadrature

(Abramowitz and Stegun, 1964), a first-order Taylor series expansion of the

integrand or a Laplace approximation (Tierney and Kadane, 1986), which is

based on a second-order Taylor series expansion. The two latter methods result

in likelihood equations similar to a linear mixed model (Breslow and Clayton,

1993; Wolfinger and O'Connell, 1993), and by iteratively fitting such a model and

re-expanding the integrand around updated parameter estimates one can obtain

approximate maximum likelihood estimates. However, these methods have been

shown to yield estimates which can be biased and inconsistent, an issue which is

discussed in Lin and Breslow (1996) and Breslow and Lin (1995).

Techniques using stochastic integral approximations are known under the name

simulated maximum likelihood and have been proposed by Geyer and Thompson

(1992) and Gelfand and Carlin (1993). These methods approximate the integral in

(2.9) by importance sampling (Robert and Casella, 1999) and are better suited for

larger dimensional integrals than analytic approximations. Usually, the importance

density depends on the parameters to be estimated, and so simulated maximum

likelihood is used iteratively by first approximating the integral by a Monte Carlo

sum with some initial values for the unknown parameters. Then, the likelihood

is maximized and the resulting parameters are used to generate a new sample

form the importance density in the next iteration. We will briefly discuss the idea

behind importance sampling in Section 2.3.2. The simulated maximum likelihood

approach is also further illustrated in the next section, where we discuss it in the

context of state space models.

An alternative to the direct approximating methods is the EM-algorithm
(Dempster et al., 1977). The integral in (2.9) is not directly maximized in this

method, but is maximized indirectly by considering a related function Q(. I .). At
each step of this algorithm, maximization of the Q-function increases the marginal
likelihood, a fact that can be verified using Jensen's inequality. The EM-algorithm
relies on recognizing or inventing missing data, which, together with the observed
data, simplifies maximum likelihood calculations. For GLMMs, the random effects
ul,..., u, are treated as the missing data. In particular, let f(k-1) and t(k-1)
denote current (at the end of iteration k 1) values for parameter vectors 3 and
'. Also, let y' = (yl,..., y,) and u' = (ul,..., u,) denote the vector of all
observations and their associated random effects. Then, the Q(. I .) function at the
start of iteration k has form

Q(I, |f3(k-1) (k-1)) = E [logj(y, u; 0, ) y, f(k-1), 0(k-1)]
= E [logf(y I u;3) y,7(k-1),(k-1)] (2.10)

+ E [logg(u; ) y, '(k-1),l(k-l)]


j(y, u; 0, i) = f(y I u; 3)g(u; i)

denotes the joint density of observed and missing data, also known as the complete
data. The expectation in (2.10) is with respect to the conditional distribution
of u y, evaluated at the current parameter estimates 3(k-1) and O(k-l). The
calculation of the expected value is called the E-step and it is followed by an
M-step which maximizes Q(/, | I 8(k-1), O(k-1)) with respect to f3 and -4. The
resulting estimates f(k) and ,(k) are used as updates in the next iteration to re-
calculate the E-step and the M-step. Since (/3(k), (k)) is the maximizer at iteration

Q((k), (k) /(k-1) I'(k-1)) > Q(/3, /(k-1), (k-1)) for all (0, 4')


and it follows that the likelihood increases (or at worst stays the same) from one
iteration to the next:

log L((k),(k) y) = Q((k), (k-1), -1)
-E [logh(u Iy;/3(k), () k) y,/(k-1), (-1)]
> Q(p(k-1), k-1) (k-1),k-1))
-E [log h(u | y;3)(k-,1) ( y,() I -1),(k-1)]
= log L(-(k-1), (k-1);y)

Here we used (2.11) and the fact that

E [log h(u I y; 3(k), (k) y,(k-1), (-1)
-E [logh(u | y;/(k-1), (k-1)) y (k-1) (k-1)
= E [log (h(u y; w(k), P(k))/h(u y;3(k-1)' (k-1))) I y1-1)(k-1) (k-1)]
< log E h( y;((k) (k))/( y (k-1) (k-)) y, (kL),, (k-1)) = 0,

where the inequality in the last step derives from Jensen's inequality. Under regu-
larity conditions (Wu, 1983) and some initial starting values (P3, ,0), the sequence
of estimates {(0(k), '(k))} converges to the maximum likelihood estimators (/, 6 ).
The EM-algorithm is most useful if replacing the calculation of the integral in
the marginal likelihood (2.9) by the calculation of the integral in the Q-function
(2.10) simplifies computation and maximization. Unfortunately, for GLMMs the
integrals in (2.10) are also intractable since the conditional density of u I y involves
the integral in (2.9). However, the EM algorithm may still be used by approxi-
mating the expectation in the E-step with appropriate Monte Carlo methods. The
resulting algorithm is called the Monte Carlo EM-algorithm (MCEM) and was
proposed by Wei and Tanner (1990). We review it in detail in Section 2.3.

Some arguments favoring the use of the MCEM-algorithm over direct methods

such as simulated maximum likelihood for fitting GLMMs, especially when some

variance components in I are large, are given in Jank and Booth (2003) and Booth

et al. (2001). Currently, the only available software for fitting GLMMs uses direct

methods such as Gauss-Hermite quadrature or simulated maximum likelihood

(e.g., SAS's proc nlmixed). State space models of Section 2.1.2 are also fitted via
simulated maximum likelihood. This is discussed in Section 2.2.3.

2.2.2 Model Fitting in a Bayesian Framework

In a Bayesian context, GLMMs are two-stage hierarchical models with ap-
propriate priors on 3 and 0. Instead of obtaining maximum likelihood estimates

of unknown parameters, a Bayesian analysis looks at their entire posterior dis-

tributions, given the observed data. Markov chain Monte Carlo techniques avoid

the tedious integration in the posterior densities and allow for relatively easy

simulations from these distributions, compared with the problems encountered in

maximum likelihood estimation. This suggests approximating maximum likelihood

estimates via a Bayesian route, assuming improper or at least very diffuse priors

and exploiting the proportionality of the likelihood function and the posterior

distribution of the parameters. However, for many discrete data models, improper

priors may lead to improper posteriors. Natarajan and McCulloch (1995) demon-

strate this with GLMMs for correlated binary data assuming independent N(0, a2)

random effects and a flat or a non-informative prior for a2. Sun, Tsutakawa and

Speckman (1999) and Sun, Speckman and Tsutakawa (2000) show that with
noninformative (flat) priors on fixed effects and variance components of more com-

plicated random effects distributions, propriety of the posterior distribution cannot

be guaranteed for a Poisson GLMM when one of the observed counts is zero, and

is impossible in a logit link GLMM for binomial(n, r) observations if just one of

the observation is equal to 0 or n. Of course, the use of proper priors will always


lead to proper posteriors. However, for often employed diffuse but proper priors,
Natarajan and McCulloch (1998) show that even with enormous simulation sizes,
posterior estimates (such as the posterior mode) can be far away from maximum
likelihood estimates, which make their use undesirable in a frequentist setting.
2.2.3 Maximum Likelihood Estimation for State Space Models

The same problems as in the GLMM case arise for maximum likelihood fit-
ting of non-Gaussian state space models. Here, we review a simulated maximum
likelihood approach suggested by Durbin and Koopman (1997), using notation in-
troduced in Section 2.1.2. Let p(y I a; ,) = 1tP(yt I at; tp) denote the distribution
of all observations given the states and let p(a; 0) denote the distribution of the
states, where y and a are the stacked vectors of all observations and all states,
respectively. The vector iv holds parameters that may appear in wt, Tt and Qt.
Let p(y, a; /i) denote the joint density of observations and states. For practical
purposes, it is easier to work with the signal Ot instead of the high dimensional
state vector at. Hence, let p(y ; 0;), p(0; Vb) and p(y, 0; b) denote the corre-
sponding conditional, marginal and joint distributions parameterized in terms of
the signal Ot = w'at, t = 1,..., T, where 0 = (01,..., OT)'. The observed likelihood
is then given by the integral

L(0; y) = p(y 0; i)p(O; /)dO. (2.12)

To maximize (2.12) with respect to 0, Durbin and Koopman (1997, 2000) first
calculate the likelihood Lg,(i; y) for an approximating Gaussian model and then
obtain the true likelihood L(/,; y) by an adjustment to it. However, two different
approaches of how to construct the approximating Gaussian model are presented
in the two papers. In Durbin and Koopman (1997) the approximating model is
obtained by assuming that observations follow a linear Gaussian model

Yt= wat + Et = t Ot + ft,

with Et ~ N(p/t, ao). All densities generated under this model are denoted by g(.).
The two parameters /t and ao are chosen such that the true density p(y 1 0; ,) and
its normal approximation g(y I 0; tk) are as close as possible in the neighborhood
of the posterior mean Eg[80 y]. The state equations of the true non-Gaussian
model and the Gaussian approximating model are assumed to be the same,
which implies that the marginal density of 8 is the same under both models, i.e.,
p(8; i) = g(; 0). The likelihood of the approximating model is then given by

Lg(y, ; ) g(y i0; O)p(O; (2.13)
Lg('; y) = g(y; #) = (2.13)
g(8 1 Y; I) g(8 I y; IP)
This likelihood is calculated using a recursive procedure known as the Kalman
filter (see, for instance, Fahrmeir and Tutz, 2001, Chap. 8). Alternatively, the
approximating Gaussian model is a regular linear mixed model and maximum
likelihood calculations can be carried out using more familiar algorithms in the
linear mixed model literature (see, for instance, Verbeke and Molenberghs, 2000).
From (2.13),
p(O; ) = L,()g(O I y; k)
g(y 1 0; )
and upon plugging in into (2.12)

L(O; y)= L9g(; y)E [P(i ; ,) (2.14)
Lg(yI 0; )'
where E, denotes expectation with respect to the Gaussian density g(8O y; I)
generated by the approximating model. Hence, the observed likelihood of the non-
Gaussian model can be estimated by the likelihood of an approximating Gaussian
model and an adjustment factor, in particular

L(0; y)= Lg('; y)7@(),

) =1 p(y 0(i);b)
m zg(y Og); m g)
is a Monte Carlo sum approximating the expected value E, with m random
samples 0(') from g(O y; ,). Normality of g(O y; ') allows for straightforward
simulation from this density.
A different approach for choosing the approximating Gaussian model is
presented in Durbin and Koopman (2000). There, the model is determined by
choosing 0, and Qt of an approximating Gaussian state space model (2.6) such that
the posterior densities g(0 I y; i) implied by the Gaussian model and p(O I y; iP)
implied by the true model have the same posterior mode 0.
Formally, by dividing and multiplying (2.12) by the importance density
g(O I y; i), we like to interpret approximation (2.14) as an importance sampling
estimate of the observed likelihood and the entire procedure as a simulated
maximum likelihood approach:

L(; y)= [p(y 0;e ) ; g(O y; -)dO
,g(0 I y;), d

= p(y 0; 0 g(Y; g(Y I y; )dO
gy, 0O; )
Sg(Y; ) g(O y; t)dO
[p(y 1o0;
=- (y;,)Eg "g(y 1 .

Durbin and Koopman (1997, 2000) present a clever way of artificially enlarging the
simulated sample of 0(')'s from the importance density g(0 | y; ) by the use of
antithetic variables (Robert and Casella, 1999). These quadruple the sample size
without additional simulation efforts and balance the sample for location and scale.
Overall, this leads to a reduction in the total sample size necessary to achieve a
certain precision in the estimates.

In practice, it is desirable in the maximization process to work with log (L(?I; y)).

Durbin and Koopman (1997, 2000) present a bias correction for the bias introduced

by estimating log (E[p(y | 0; ai)/g(y 0; 0)]). Finally, the resulting estimator

of log (L(O; y)) can be maximized with respect to i by a suitable numerical

procedure, such as Newton-Raphson.

We mentioned before that simulated maximum likelihood can be computa-

tionally inefficient and suboptimal, especially when some variance components

are large (Jank and Booth, 2003). As we will see in various examples in Chapter

5, large variance components (e.g., a large random effects variance) are the norm

rather than the exception with the type of time series models we consider. Next,

we will look at an alternative, indirect method for fitting our models. In principle,

though, the methods just described are also applicable to GLMMs, through the

close connections of GLMMs and state space models described above.

2.3 The Monte Carlo EM Algorithm

In Section 2.2 we presented the EM-algorithm as an iterative procedure

consisting of two components, the E- and the M-step. The E-step calculates a

conditional expectation while the M-step subsequently maximizes this expectation.

Often, at least one of these steps is analytically intractable and in most of the

applications considered here, both steps are. Numerical methods (analytic and

stochastic) have to be used to overcome these difficulties, whereby the E-step

usually is the more troublesome. One popular way of approximating the expected

value in the E-step uses Monte Carlo methods and is discussed in Wei and Tanner

(1990), McCulloch (1994, 1997) and Booth and Hobert (1999). The Monte Carlo

EM (MCEM) algorithm uses a sample from the distribution of the random effects

u given the observed data y to approximate the Q-function in (2.10). In particular,

at iteration k, let u(1),..., u(m) be a sample from this distribution, denoted by

h(u I y; '(k-1), (-1)) and evaluated at the parameter estimates "(k-1) and 0(k-1)

from the previous iteration. The approximation to (2.10) is then given by
m m
Qm(3, | (k-l), (k-l)) = L log/(y I u);/3) + Elogg(u(i); ). (2.15)
m m
j=1 j=1
As m -+ oo, with probability one, Qm -+ Q. The M-step then maximizes Qm
instead of Q with respect to p and 4 and the resulting estimates 3(k) and i(k)
are used in the next iteration to generate a new sample from h(u I y; 'k, Pk ). If
maximization is not possible in closed form, sometimes only a pair of values (/3, T)
which satisfies Qm(/3, 4 I 3(k-1), '(k-1)) > Qm((k1) (k-) (-1) (k-1), (k-1)), but
which do not attain the global maximum, is chosen as the new parameter update

(3(k), ,(k)). However, we show for our models that the global maximum can be
approximated in very few steps.

Maximization of Qm with respect to 3 and i4 is equivalent to maximizing the
first term in (2.15) with respect to / only and the second term with respect to '

only. This is due to the two-stage hierarchy of the response distribution and the

random effects distribution in GLMMs and is discussed next. Different approaches

to obtaining a sample from h(u I y; 3, 4) for the approximation of the E-step are
presented in Sections 2.3.2 and convergence criteria are discussed in Section 2.3.3.

2.3.1 Maximization of Qm

For now we assume we have available a sample u(1),..., um) from h(u

y; 3, i4) or an importance sampling distribution, generated by one of the mecha-
nisms described in Sections 2.3.2 to 2.3.5. Let Q1 and Q2 be the first and second

term of the sum in (2.15). Using the exponential family expression for the densities

f(yit I ui), at iteration k,

Q(3 (k-)) 1 C n n i b(0~))] (2.16)
S=1 =1 t= -i
j=1 i=1 t=1

where, according to the GLMM specifications,

b'((,)) = ( = h-l(xtP + Zi ),

with ujA the i-th component of the j-th sampled random effects vector u().
Maximizing Q1 with respect to 3 is equivalent to fitting an augmented GLM with
known offsets: For j = 1,..., m, let yf = yit and ) = xit be the random
components and known design vectors for this augmented GLM, and let zU j) be
a known offset associated with each y). That is, we duplicate the original data
set m times and attach a known offset z' u) to each replicated observation. The
model for the mean in the augmented GLM, E[Yij)] = pz) = h-'(xt + zu))
is structurally equivalent to the model for the mean in the GLMM. Then, the
log-likelihood equations for estimating 3 for the augmented GLM are proportional
to Q'. Hence, maximization of Q1 with respect to / follows along the lines of
well known, iterative Newton-Raphson or Fisher scoring algorithms for GLMs.
Denote by /(k) the parameter vector after convergence of one of these algorithms.
It represents the value of the maximum likelihood estimator of / at iteration k of
the MCEM algorithm.
The expression for Q' depends on the assumed random effects distribution.
Most generally, let E be an unstructured nq x nq covariance matrix for the random
effects vector u = (ux,..., un), where q = E i ni and ni is the dimension of each
cluster specific random effect ui. Then, assuming u has a mean zero multivariate
normal distribution g(u; i) where 0/ holds the !nq(nq + 1) distinct elements of E,
Q2 has form

2 -c log u()1y-1
The goal is to maximize Q2 with respect to the variance components If of E. For
a general E, the maximum is obtained at the variance components of the sample

covariance matrix Sm = L u(ij)uWl)'. Denoting these by O(k) gives the value of
the maximum likelihood estimator of i, at iteration k of the MCEM algorithm.
The simplest structure occurs when random effects ui have independent
components and are i.i.d. across all clusters, where g(u; i) is then the product of
n N(0, ua2) densities and i/ = a. Q2 at iteration k is then maximized at a(k)
(- ZLj o u(')'U) 12. Many applications of GLMMs use this simple structure
of i.i.d. random effects, where often ui is a univariate random intercept. In this
case, the estimate of a at iteration k reduces to a(k) = ( 1 L 1 uJi)2)1/2
In Chapter 3 we will drop the assumption of independence and look at correlated
random effects, but with more parsimonious covariance structures than the most
general case presented here. Maximization of Q' with respect to t1 will be
presented there on a case by case basis.
2.3.2 Generating Samples from h(u I y; P, i,)
So far we assumed we had available a sample u(),..., u(m) to approximate
the expected value in the E-step of the MCEM algorithm. This section describes
how to generate such a sample from h(u I y; 3, i-), which is only known up to a
normalizing constant, or from an importance density g(u). In the following, we will
suppress the dependency on parameters / and 0, since the densities are always
evaluated at their current values. Three methods are presented: The accept-reject
algorithm produces independent samples, while Metropolis-Hastings algorithms
produce dependent samples. A detailed description of all three methods can be
found in Robert and Casella (1999). Accept-reject sampling in GLMMs
In general, for accept-reject sampling we need to find a candidate density
g and a constant M, such that for the density of interest h (the target density)
h(x) < Mg(x) holds for all x in the support of h. The algorithm is then to
1. generate x ~ g, w ~ Uniform[O, 1];

2. accept x as a random sample from h if w < M() ;
3. return to 1. otherwise;
This will produce one random sample x from the target density h. The
probability of acceptance is given by 1/M and the expected number of trials until a
variable is accepted is M.
For our purpose, the target density is h(u I y). Since h(u I y) = f(y I
u)g(u) < Mg(u), where M = supu f(y I u) and a is an unknown normalizing
constant equal to the marginal likelihood, the multivariate normal random effects
distribution g(u) can be used as a candidate density. Booth and Hobert (1999,
Sect. 4.1) show that for certain models supu f(y I u) can be easily calculated
from the data alone and thus need not to be updated at every iteration. For some
models we discuss here, the condition of Booth and Hobert (1999, Sect. 4.1, page
272) required for this simplification does not hold. However, the likelihood of a
saturated GLM is always an upper bound for f(y I u). To illustrate, regard L(u) =

f(y I u) as the likelihood corresponding to a GLM with random components yit
and linear predictor rit = z ui + xif3, where now x'8 plays the role of a known
offset and ui are the parameters of interest. The maximized likelihood L(i&) for

this model is always less than the maximized likelihood L(y) for a saturated model.
Hence, supu f(y I u) < L(i&) < L(y), and L(iL) or L(y) can be used to construct
Example: In Section 3.1 we consider a data set where conditional on a random
effect Ut, yit, the t-th observation in group i, is modeled as a Binomial(nit, Wit)
random variable. There are 16 time points, i.e., t = 1,... 16 and two groups i =
1, 2. A very simple logistic-normal GLMM for these data has form logit(7it(ut)) =
a + pxi + ut, where xi is a binary group indicator. The overall design matrix for

this problem is the 32 x 18 matrix

1 0 1 0 ... 0

1 1 0 *-- 0

1 0 0 1 ... 0

1 1 0 1 0,

1 0 0 0 ..* 1

1 1 0 0 ... 1

where the columns hold the coefficients corresponding to a, /, u1, u2,..., U16. All

rows of this matrix are different, and as a consequence the condition of Booth and

Hobert (1999, Sect. 4.1, page 272) does not hold. However, the saturated binomial

likelihood L(y) is an upper bound for f(y I u), i.e.,

sup f(y u) < L(y).

For instance, with the logistic-normal example from above with linear predictor

77it = c + ut, where c = a + zxi represents the fixed part of the model, we have

sup f(y I U) = sup( ec+ut i 1 it
su U, 1 + ec+ut 1 + ec+ut

By first taking logs and then finding first and second derivatives with respect to ut,

we see that u = log )it/t c maximizes this expression for 0 < yit < nit.

Plugging in, we obtain the result

( it \ nit-li,
supf(yl) = 1 .
ut nit k nit

For the special cases of yit = 0 or yit = nit the trivial bound on f(yit I Ut) is 1.

Hence, the following inequality, which immediately follows from above can be used

in constructing the accept-reject algorithm for a logistic-normal model with linear

predictor of form rit c + ut:

( eif( fYit nit -/t
sup f y I u) : sup li r 1+e+ e

_1- ) )= L(y).
S nit/ nit/
This means we can select M = L(y) to meet the accept-reject condition and

consequently we accept a sample u from g(u) if for a w ~ Uniform[0, 1]:

h(u I y) f(y u)
Mg(u) L(y)

Notice that this condition is free of the normalizing constant a. In practice,
especially for high dimensional random effects, M can be very large and therefore

we almost never accept a sample. Two alternative methods described below
may avoid this problem. Note, however, that the accept-reject method yields an
independent and identical distributed sample from the target distribution. This is
important if one wants to implement an automated MCEM algorithm (Booth and
Hobert (1999)), where the Monte Carlo sample size m is increased automatically as

the algorithm progresses to adjust for the error in the Monte Carlo approximation
to the E-step. Markov chain Monte Carlo methods

For high dimensional distributions h(u I y), which are unavoidable if correlated

random effects are used, accept-reject methods can be very slow. An alternative
is to generate a Markov chain with invariant distribution h(u I y), which may
be much faster but results in dependent samples. McCulloch (1997) discussed a
Metropolis Hastings algorithm for creating such a chain for the logistic-normal
regression case. In general, an independent Metropolis Hastings algorithm is built

as follows: Choose a candidate density j(u) with the same support as h(u I y).
Then, for a current state u(j-),

1. Generate w ~ g;
2. Set u(j) equal to w with probability p = min (, h(W )/h(Uj))
and equal to u-1) with probability 1-p;
After a sufficient burn in time, the states of the generated chain can be
regarded as a (dependent) sample from h(u I y). If the candidate density g(u)
is chosen to be the density of the random effects g(u), the acceptance probability
in step 2 reduces to the simple form min (1, f(y w)/f(y I u(-1)). To further
speed up simulations, McCulloch (1997) uses a random scan algorithm which only
updates the k-th component of the previous state u(-1) and, upon acceptance in
step 2, uses it as the new state.
Another popular MCMC algorithm is the Gibbs sampler. Let uj-l) =

(u i-1),..., uU-1)) denote the current state of a Markov chain with invariant distri-
bution h(u I y). One iteration of the Gibbs sampler generates, componentwise,
Uj) i h(ul I -1) -1)y)

UU ~ h(u uj ...IUU)

where h(ui uj) ,..., u (j-),..., u -1), y) are the so called full conditionals of
h(u I y). The vector u(j) = (ujQ),..., u)) represents the new state of the chain,
and, after a sufficient burn-in time, can be regarded as a sample from h(u y).
The advantage of the Gibbs sampler is that it reduces sampling of a possibly very
high-dimensional vector u into sampling of several lower-dimensional components
of u. We will use the Gibbs sampler in connection with autoregressive random
effects to simplify sampling from an initially very high-dimensional distribution
h(u ) y) by sampling from its simpler full univariate conditionals. Importance sampling
An importance sampling approximation to the Q-function in (2.10) is given by

m m
Qm(3, i (k-1), (k-1)) = log f (y I uo ); ) + w- log g(u ); ),
j=1 j=1
where u(j) are independent samples from an importance density g(u; V)(k-1)) and

h(u. y;(k-1 -1)) (y U); (k-1))g(u(j); 0(k-1))
3 ( );0(k-1)) (U);k-1))

are importance weights at iteration k. Usually, Qm is divided by the sum of the
importance weights m=, wj. The normalizing constant a only depends on known
parameters (3(k-1), O(k-1)) and hence plays no part in the following maximization
step. Selecting the importance density g is a delicate issue. It should be easy
to simulate from but also resemble h(u I ) as close as possible. Booth and
Hobert (1999) suggest a Student t density as the importance distribution g, whose
mean and variance match those of h(u I y) which are derived via a Laplace
2.3.3 Convergence Criteria

Due to the stochastic nature of the algorithm, parameter estimates of two suc-
cessive iterations can be close together just by chance, although convergence is not
yet achieved. To reduce the risk of stopping prematurely, we declare convergence
if the relative change in parameter estimates is less than some el for c (e.g., five)
consecutive times. Let A(k) = (f(k)', Ob(k)') be the vector of unknown fixed effects
parameters and variance components. Then this condition means that

m A Ik) P-1) I
max 'IA 'I E1 (2.17)

has to be fulfilled for c consecutive (e.g., five) ks. For any AkL), an exception to this
rule occurs when the estimated standard error of that parameter is substantially
larger than the change from one iteration to the next. Hence, at iteration k, for

those parameters satisfying

(IA(jk) -A )
max v (gk)) E2,

where vir(A k)) is the current estimate of the variance of the MLE A,, the relative
precision of criterion (2.17) need not be met. An estimate of this variance can
be obtained from the observed information matrix of the ML estimator for A.
Louis (1982) showed that the observed information matrix can be written in
terms of the first (1') and second (I") derivative of the complete data log-likelihood
1(A; y, u) = logj(y, u; A). Evaluated at the MLE A, it is given by

I(A) = -Euly [l"(i; y, u) I y] varuly [l'(A; u) I .1

An approximation to this matrix, at iteration k, uses Monte Carlo sums with draws
from h(u I y, A(k)) from the current iteration of the MCEM algorithm.
To further safeguard against stopping prematurely, we use a third convergence
criterion based on the Qm function. For deterministic EM, the Q function is
guaranteed to increase from iteration to iteration. With MCEM, because of
the stochastic approximation nature, QM) can be less than Q(-1) because of an
"unlucky" Monte Carlo sample at iteration k. Hence, the parameter estimates
obtained from maximizing QM) can be a step in the wrong direction and actually
decrease the value of the likelihood. To counter this, we declare convergence only
if successive values of QM) are within a small neighborhood. More importantly,

however, is that we accept the k-th parameter update A(k) only if the relative
change in the Qm function is larger than some small negative constant, e.g.,

(k) > E3- (2.18)

If at iteration k (2.18) is not met and there is reason to believe that A(k) decreases
the likelihood and is worse than the parameter update from the previous iteration,

we repeat the k-th iteration with a new and larger Monte Carlo sample. Thereby,

we hope to better approximate the Q-function and as a result get a better estimate

A(k), with Qm-function larger than the previous one. If this does not happen, we

nevertheless accept A(k) and proceed to the next iteration, possibly letting the

algorithm temporarily move in a direction of a lower likelihood region. Otherwise,

the Monte Carlo sample size quickly grows without bounds at an early stage

of the algorithm. Furthermore, at early stages, the Monte Carlo error in the

approximation of the Q function can be large and hence its trace plot is very


Caffo, Jank and Jones (2003) go a step further and calculate asymptotic

confidence intervals for the change in the Qm-function, based on which they

construct a rule for accepting or rejecting A(k). They discuss schemes of how to

increase the Monte Carlo sample accordingly and their MCEM algorithm inherits

the ascent property of EM with high probability. However, we feel that the simpler

criterion (2.18) suffices for the examples considered here.

Coupled with any convergence criterion is the question of the updating scheme

for the Monte Carlo sample size m between iterations. In general, we will use

m(k) = am(k-1), where a > 1 and m(k) is the Monte Carlo sample size at iteration

k. At early iterations, m(k) will be low, since big parameter jumps are expected

regardless of the quality of the approximation and the Monte Carlo error associated

with it. Later, as more weight will be put on decreasing the Monte Carlo error in

the approximations, the polynomial increase guarantees sufficiently large Monte

Carlo samples. Furthermore, condition (2.18) signals when an additional boost

in m(k) is needed to better approximate the Q-function in this iteration. Hence,

whenever (2.18) is not met, we re-run iteration k with a bigger sample size qm(k),

where q > 1 is usually between 1 and 2.

8000- I --


5000 -268

4000 270
-270 -



0 25 50 75 100 0 25 50 75 100

Figure 2-1: Plot of the typical behavior of the Monte Carlo sample size m(k) and
the Q-function Qk) through MCEM iterations.
The iteration number is shown on the x -axis. Plots are based on the data and
model for the boat race data discussed in Chapter 5.

A typical picture of the Monte Carlo sample size m(k) and the Qm) function

through the iterations of an MCEM algorithm is presented in Figure 2-1. The

increase in the Q) function is large at the first iterations, but it's Monte Carlo

error is also large due to the small Monte Carlo sample size. The plot of the Monte

Carlo sample size m(k) shows several jumps corresponding to the events that the

Qk) function actually decreased by more than E3 from one iteration to the next and

we adjusted with an additional boost in generated samples. The data and model

on which this plot is based on are taken from the boat race example analyzed

and discussed in Chapter 5, with convergence criterions set to 6E = 0.001, c = 4,

E2 = 0.003, E3 = -0.005, a = 1.03 and q = 1.05.

Fort and Moulines (2003) show that with geometrically ergodic (see, e.g.,

Robert and Casella, 1999) MCMC samplers, a polynomial increase in the Monte


Carlo sample size leads to convergence of MCEM parameter estimates. However,

establishing geometric ergodicity is not an easy task. Other, more sophisticated

and automated Monte Carlo sample size updating schemes are presented by Booth

and Hobert (1999) for independent sampling, and Caffo, Jank and Jones (2003), for

independent and MCMC sampling.


In Chapter 2 we mentioned at several occasions that for certain data structures

the usual assumption of independent random effects is inappropriate. For instance,

if clusters represent time points in a study over time, observations from different

clusters can no longer be assumed (marginally) independent. Or, in longitudinal

studies, the non-negative and exchangeable correlation structure among repeated

observations implied by a single random effect can be far from the truth for long

sequences of repeated observations. Section 3.1 presents data from a cross-sectional

time series which motivates the use of correlated random effects and discusses their

implications. In Sections 3.2 and 3.3 two special correlation structures useful for

modeling the dependence structure in discrete repeated measures with possibly

unequally spaced observation times are discussed. The main focus of this chapter

is on the technical implications on the MCEM algorithm arising from estimating

an additional variance (correlation) component. In contrast to models with

independent random effects, the M-step has no closed form solution and iterative

methods have to be used to find the maximum. Also, because random effects are

correlated a priori, they are correlated a posteriori, and sampling from the posterior

distribution of u I y as required by the MCEM algorithm is more involved than

with independent random effects. A Gibbs sampling approach is developed in

Section 3.4.

From here on we let t denote the index for the discrete observation times,

t = 1,..., T, and we let Yi denote a response at time point t for strata i, i =

1,..., n. Throughout, we will assume univariate but correlated random effects

{ut}T=1 associated with the observations over time.

3.1 A Motivating Example: Data from the General Social Survey

The basic purpose of the General Social Survey (GSS), conducted by the
National Opinion Research Center, is to gather data on contemporary Amer-

ican society in order to monitor and explain trends and constants in atti-

tudes, behaviors and attributes. It is only second to the census in popularity

among sociologist as a data source for conducting research. The GSS ques-

tionnaire contains a standard core of demographic and attitudinal variables

whose wording is retained throughout the years to facilitate time trend stud-

ies. (Source: Currently, the

GSS comprises a total of 24 surveys conducted in the years 1973-1978, 1980,

1982, 1983-1994, 1996, 1998, 2000 and 2002, with data available online (at through 1998. The two features, a dis-

crete response variable (most of the attitude questions) observed through time

and unequally spaced observation times make it a prime resource for applying the

models proposed in this dissertation. Data obtained from the GSS are different

from longitudinal studies where subjects are followed through time. Here, responses

are from independent cross-sectional surveys of different subjects in each year.

One question included in 16 of the 22 surveys till 1998 recorded attitude

towards homosexual relationships. It was observed in the years 1974, 1976-77,

1980, 1982, 1984-85, 1987-1991, 1993-94, 1996 and 1998. We will use this data

to motivate and illustrate the use of correlated random effects. Figure 3-1 shows

the proportion of respondents who agreed with the statement that homosexual

relationships are not wrong at all for the two race cohorts white respondents and

black respondents. For simplicity in this introductory example, only race was

chosen as a cross-classifying variable and attitude was measured as answering "yes"

or "no" to the aforementioned question. Let Yt denote the number of people in

year t and of race i who agreed with the statement that homosexual relationships

6 --- white
-e- black
C -


1975 1980 1985 1990 1995
Figure 3-1: Sampling proportions from the GSS data set.
Proportion of whites (squares) and blacks (circles) agreeing with the statement that
homosexual relationships are not wrong at all, from 1974 to 1998.

are not wrong at all. The index t = 1,...,16 runs through the set of 16 years

{1974,1976,1977,1980,..., 1998} mentioned above, and i = 1 for race equal to

white and i = 2 for race equal to black. The conditional independence assumption

discussed in Section 2.1 allows us to model Yt, the sum of nit binary variables

which are the individual responses, as a binomial variable conditional on a yearly

random effect. That is, the probabilistic model we propose assumes a conditional

Binomial(nit, 7rit) distribution for each member of the two time series {IYt}", and

{Y2 61 pictured in Figure 3-1. The parameters nit and rit are the total number

and the conditional probability of agreeing with the statement that homosexual

relationships are not wrong at all, respectively, of respondents of race i in year t.

3.1.1 A GLMM Approach

A popular model for 7tit is a logistic-normal model for which the link function

h(.) in (2.2) is the logit link and the random effects structure simplifies to a ran-

dom intercept ut. We will assume that the fixed parameter vector / is composed

of an intercept term a, linear and quadratic time effects /1 and 32, a race effect 03

and a year-by-race interaction 34. With Xit representing the year variable centered

around 1984 (e.g., xll = 1974 1984 = -10) and x2i the indicator variable for race

(for whites x21 = 0, for blacks x22 = 1), the model has form

logit(7rit(ut)) = logit(P(Yit = yit I ut)) = a+Pzxltu+2xlt+u3x2i +4XltX2i -+t. (3.1)

Apart from the fixed effects, the random time effect ut captures the dependency

structure over the years. Note that rit(ut) is a conditional probability, given the

random effect us from the year the question was asked. This random effect ut can

be interpreted as the unmeasurable public opinion about homosexual relationships,

common to all respondents within the same year. By introducing this random

effect, we assume that individual opinions are influenced by this overall opinion

or the social and political climate on homosexual relationships (like awareness of

AIDS and the social spending associated with it, which is hard to measure). Thus,

individual responses within a given year are no longer independent of each other,

but share a common random effect. Furthermore, it is natural to assume that the

public opinion about homosexual relationships changes gradually over time, with

higher correlations for years closer together and lower correlations for years further

apart. It would be wrong and unnatural to assume that the public opinion (or

political climate) is independent from one year to the next. However, this would

be assumed by modeling the random effects {ut} as independent of each other. It

would also be wrong to assume a common, time-independent random effect u = ut

for all time points t, as this implies that public opinion does not change over time.

It's effect would then be the same, whether responses are measured in 1974 or 1998.

3.1.2 Motivating Correlated Random Effects

To capture the dependency in public opinion and therefore in responses over

different years, we propose random effects that are correlated. In particular,


for this example with unequally spaced observation times, we suggest normal

autocorrelated random effects {ut} with variance function

var(ut) = a2, t = 1,..., 16

and correlation function

corr(ut, ut.) = pIlt- t Xl., 1 < t < t* < 16,

where Xit xxt. is the difference between the two years identified by indices t and

t*. This is equivalent to specifying a latent autoregressive process

Ut+1 = PIXlt--XlUt + + t

underlying the data generation mechanism. Both of these formulations naturally

handle the multiple gaps in the observed time series. There is no need to make

adjustments (such as imputation of data or artificially treating the series as equally

spaced) in our analysis due to "missing" data at years 1975, 1978-79, 1981, 1983,

1986, 1992, 1995 or 1997.

With correlated random effects, we have to distinguish between two situations:

The correlation induced by assuming a common random effect ut for each

cluster (here: year) and

the correlation induced by assuming a correlation among the cluster-specific

random effects {ut}.

Correlation among observations in the same cluster is a consequence of assuming a

single, cluster-specific random effect shared by all observation in that cluster. For

example, the presence of the cluster specific random effect Ut in (3.1) leads to a

(marginal) non-negative correlation among the two binomial responses Ylt and Y2t

in year t. With conditional independence, the marginal covariance between these

two observations is given by

cov(Yt, Y2t) = E[cov(Y, Y2t I Ut)]

+cov(E[Yul ut], E[Y2t I Ut])

= cov(logit-'(ult + ut), logit-'(it + Ut)), (3.2)

where jit is the fixed part of the linear predictor in (3.1). Both functions in (3.2)

are monotone increasing in ut, leading to a non-negative correlation. Approxima-

tions to (3.2) will be dealt with in Section 4.3.

In the example, we attributed the cause of this correlation to the current

(at the time of the interview) public opinion about homosexual relationships,

influencing all respondents in that year. The estimate of a gives an idea about the

magnitude of this correlation, since the more disperse the Ut's are, the stronger

the correlation among the responses within a year. For instance, if the true ut

for a particular year is positive and far away from zero, as measured by a, then

all respondents have a common tendency to give a positive answer. If it is far

away from zero on the negative side, respondents have a common tendency for

a negative answer. This interpretation, of course, is only relative to other fixed

effects included in the linear predictor. For the GSS data, there seems to be

moderate correlation between responses, based on a maximum likelihood estimate

of & = 0.10 with an approximated asymptotic s.e. of 0.03. This interpretation of

a moderate effect of public opinion on responses within the same year is further

supported by the fact that a can also be interpreted as the regression coefficient for
a standardized version of the random effect ut. A regression coefficient of 0.10 for

a standard normal variable on the logit scale leads to moderate heterogeneity on

the probability scale. This shows that the correlation between responses within a

common year cannot be neglected.

The second consequence of correlated random effects is that observations from

different clusters are correlated, which is a distinctive feature compared to GLMMs

assuming independence between cluster-specific random effects. The conditional log

odds of agreeing with the statement that homosexual relationships are not wrong

at all are now correlated over the years, a feature which is natural for a time series

of binomial observations but would have gone unaccounted for if time-independent

random effects were used. For instance, for the cohort of white respondents (i = 1),

the correlation between the conditional log odds at years t and t* is

corr (logit(rit(ut)), logit(7rit.(ut.)) = pllZ-t1I-l

and therefore directly related to the assumed random effects correlation structure.

Marginally, the two binomial responses at the different observation times have


cov(Ylt, Y2t.) = cov(logit-'(i~ + Ut), logit-'l(0l + ut.)),

which accommodates changing covariance patterns for different observation times

(e.g., decreasing with increasing lag) and also negative covariances (see for instance
the analysis of the Old Faithful geyser eruption data in Chapter 5). We will present

approximations to these marginal correlations in binomial time series in Section


Summing up, correlated random effects give us a means of incorporating

correlation between sequential binomial observations that go beyond independent

or exchangeable correlation structures.

In our example, we attributed the sequential correlation to the gradual change

in public opinion about homosexual relationships over the years, affecting both

races equally. In fact, the maximum likelihood estimate of p is equal to 0.65 (s.e.

0.25), indicating that rather strong correlations might exist between responses from

adjacent years.

The model uses 7 parameters (5 fixed effects, 2 variance components) to

describe the 32 probabilities. In comparison with a regular GLM and a GLMM

with independent random time effects, the maximized likelihood decrease from

-113.0 for the regular GLM to approximately -109.7 for a GLMM with independent

random effects to approximately -107.5 for the GLMM with autoregressive random

effects. Note that the GLM assumes independent observations within and between

the years and that the GLMM with independent random effects {ut} for each year

t assumes correlation of responses within a year, but independence of responses

over the years. Both assumptions might be inappropriate. Our model implies that

the log odds of approval of homosexual relationships are correlated for blacks and

whites within a year (though not very strong with an estimate of a equal to 0.1)

and are also correlated for two consecutive years.

The estimates of the fixed parameters and their asymptotic standard errors are

given in Table 5-1. The MCEM algorithm converged after 128 iterations with a

starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of 8600.

Convergence parameters (conf. Section 2.3.3) were e1 = 0.002, c = 5, e2 = 0.003,

E3 = -0.001, a = 1.01 and q = 1.2. Path plots of selected parameter estimates

for two different sets of starting values are shown in Figure 3-2. A detailed

interpretation of the parameters and the effects of the explanatory variables on the

odds of approval is provided in Section 4.3.

Although this example assumed autocorrelated random effects, we will look

at the simpler case of equally correlated random effects next. Then, we discuss

how the correlation parameter p can be estimated within the MCEM framework

presented in Section 2.3. Summing up, correlated random effects in GLMMs allow

4.50 r[ 1.25 r
4.25-- ----_ I-- 00l |
4.00 0.75-
0 50 100 150 200 50 75 100 125 150 175 200

-0.30 I- 0.00705 --
0.0700 -
-0.35 0.0695 ..
0 50 100 150 200 50 75 100 125 150 175 200
-0.026- 0.00900

-0.028 0.00895

-0.030 50 1 1 00890 75 100 125 150 175 2'
0 50 100 150 200 50 75 100 125 150 175 200



0 50
n "75 _


100 150 200 50 7!

100 125 150


S50 100 150030 I-s7e(p)1

0.25 -
) 50 100 150 200 50 75 100 125 150 175 20

Figure 3-2: Iteration history for selected parameters and their asymptotic standard
errors for the GSS data.
The iteration number is plotted on the x-axis. The estimates and standard errors
for 02 were multiplied by 103 for better plotting. The two different lines in each
plot correspond to two different sets of starting values.

- se(o)

175 200

I t




one to model within-cluster as well as between-cluster correlations for discrete

response variables, where clusters refer to grouping of responses in time.

3.2 Equally Correlated Random Effects

The introductory example modeled decaying correlation between cross-
sectional data over time through the use of autocorrelated random effects. In other
temporal or spatial settings, the correlation might stay nearly constant between any
two observation times, regardless of time or location differences between the two
discrete responses. Equally correlated random effects might then be appropriate to
describe such a behavior.
3.2.1 Definition of Equally Correlated Random Effects

We call random effects equally correlated if

var(ut) = a2 for all t


corr(ut, ut.) = p for all t $ t*.

More generally, the covariance matrix of the random effects vector u = (ul,..., UT)'
is given by E = a2 [(1 p)IT + pJT], where JT = 1TVI. To ensure positive

definiteness, p has restricted range, i.e., 1 > p > -1/(T 1). The random effects

density is given by

g(u;) oc l-1,2exp Ui'E-l, (3.3)

where now due to the pattern in E, E = a2T(1 p)T-l[1 + (T 1)p] and
E-1 = [( IT P-) JT]. The vector i = (a, p) holds the variance
,, (1- ) +(T-1)p I .
components of E.
The more complicated random effects structure (as compared to independence
or a single latent random effect) leads to a more complicated M-step in the MCEM
algorithm described in Section 2.3. For a sample u(1),..., u(m) from the posterior

h(u I y; 1(k-1), (k-l)) evaluated at the previous parameter estimate #(k-1) and

,(k-1), the function Q (i ) introduced in Section 2.3.1 has form

1 m T-l 1
Q()= m logg(u(j);p) oc -Tloga 2 log(1 p) log [1 + (T 1)p
1 p(1 p)
a+ b,
2a2(1-p) 2a2[1+(T-1)p]

where a = M M_ 'uLi)~U ) and b = 1 -m U i)*JTu(') are constants depending on
the sample only.
3.2.2 The M-step with Equally Correlated Random Effects
The M-step seeks to maximize Q2 with respect to a and p, which is equivalent
to finding their MLEs treating the sample ),..., u(m) as independent. Since this
is not possible in closed form, one way to maximize Q2 uses a bivariate Newton-
Raphson algorithm with the Hessian formed by the second order partial and mixed
derivatives of Q2 with respect to a and p. Some authors (e.g., Lange, 1995, Zhang,
2002) use only a single iteration of the Newton-Raphson algorithm instead of an
entire M-step to speed up convergence. However, this might not always lead to
convergence, since the interval for which the Newton-Raphson algorithm converges
is restricted through the restrictions on p. We show now that with a little bit of
work the maximizers for a and p can be obtained very quickly.
For any given value of p, the ML estimator for a (at iteration k) is available in
closed form and is equal to

() ( 1 p(l- p) 1/2
ST(1-p) a T[ + (T 1)pR])

Note that if p = 0, &^ = () E 1 U(j)' u )1/ the estimator for the independence
case presented at the end of Section 2.3.1. Unfortunately, the ML estimator for
p has no closed form solution. The first and second partial derivative of Q2 with

respect to p are given by

9 2 T(T-1)p 11 2p p2(T -1)
Op m 2(1 p)[1 + (T l)p] 2a2(1 p)2 2a2[1+ (T l)p]2

2 2 T(T-1) 1 -p(4-2T)+p2(1 -3T) 1
Op2 2 (1 -p)2[ (T-l)p]2 3(1-p)3
T b
02[1 + (T l)p]3b

We obtain the profile likelihood for p by plugging the MLE &(k) into the likelihood
equation for p. Then we use a simple and fast interval-halving (or bisection)
method to find the root for p. This is advantageous compared to a Newton-
Raphson algorithm since the range of p is restricted. Let f(p) = IQ 2a=(k)
and let pi and P2 be two initial estimates in the appropriate range, satisfying

pi < p2 and f(pi)f(p2) < 0. Without loss of generality, assume f(pi) < 0.
Clearly, the maximum likelihood estimate p must be in the interval [pl, P2]. The
interval-halving method computes the midpoint p3 = (pi + p2)/2 of this interval

and updates one of its endpoints in the following way: It sets pi = p3 if f(p3) < 0

or P2 = p3 otherwise. The newly formed interval [pl, p2] has half the length of the
initial interval, but still contains p. Subsequently, a new midpoint p3 is calculated,
giving rise to a new interval with one fourth of the length of the initial interval,
but still containing p. This process is iterated until If(p3)| < C, where e is a small
positive constant. To ensure it is a maximum we can check that the value of the
second derivative, f'(p) is negative at P3. (The second derivative is also needed
for approximating standard errors in the EM algorithm.) The value of p3 is then
used as an update for p in the maximum likelihood estimator for a, and the whole
process of finding the roots of Q2 is repeated. Convergence is declared when the
relative change in a and p is less than some pre-specified small constant. The
values of a and p at this final iteration are the estimates &(k) and 1(k) from MCEM
iteration k.

The issue of how to obtain a sample u(1),..., u(m) from h(u y; 13(k-1), 0(k-1)),

taking into account the special structure of the random effects distribution will be

discussed in Section 3.4.2.
3.3 Autoregressive Random Effects

The use of autoregressive random effects was demonstrated in the introductory

example. Their property of a decaying correlation function make them a useful

tool for modeling temporal or spatial associations among discrete data. We will

limit ourselves to instances where there is a natural ordering of random effects, and

consider time dependent data first.

3.3.1 Definition of Autoregressive Random Effects

As with equally correlated random effects in Section 3.2, we can look at the

joint distribution of autoregressive (or autocorrelated) random effects {ut}l1 as a
mean-zero multivariate normal distribution with patterned covariance matrix E,

defined by the variance and correlation functions

var(ut) = a2 for all t


corr(ut, ut.) = plt-Zt I for all t : t',

where xi and xt. are time points (e.g., years, as in the GSS example) associated

with random effects ut and ut.. Let dt = Xt+l zx denote the time difference

between two successive time points and let ft = 1/(1 p2di), t 1,... ,T 1.

Then, due to the special structure, the determinant of the covariance matrix is

given by FI = a2T fIT-1 f-' and E-1 is tri-diagonal (Crowder and Hand, 1990,

with correction of a typo in there) with main diagonal

( fl, fl + f2 L + 1 2 24- 1, f + A fT-2+ fT-1 1 T-1)

and sub-diagonals
-2 (flpd,,fpd .., fTpdT-1)

For a sample u(),..., u(m) from the posterior h(u I y; 1(k-1), ,(k-1)) evaluated at
the previous parameter estimates f3(k-1) and 0(k-1) = (a(k-1), (k-1)), the function

Q2, (cf. Section 2.3.1) now has form

1 T-1 1
Q2() ( -Tlogaa- log(lo p2) (3.4)
t=1 j=l
1 T-1 l-B P^ ]2
12T2 M1 E 1 1 p2dt
j=1 t= 1

where ut) is the t-th component of the j-th sampled vector u\). In the M-step of
an MCEM algorithm, we seek to maximize Q2 with respect to a and p.
Alternatively, we can view the random effects {ut} as a latent first-order
autoregressive process: Random effect Ut+i at time t +1 is related to its predecessor
ut by the equation

Ut+1 = pdtUt + ft, Et N(O, 02(1 -2d)), t = 1,...,T 1, (3.5)

where dt again denotes the lag between the two successive time points associated
with random effects ut and Ut+l. Assuming a N(0, a2) distribution for the first
random effect ul, the joint random effects density for u = (ul,..., uT) enjoys a
Markov property and has form

g(u; 0) = g(ui; i) g(u2 I1u; ) ... g(utl I t-1; ) ... g(UT I uT-1;) (3.6)
T/2 T-1 1/2 T- _U
(1 1 -h / {Ut+: p4Ut]
oc 1 exp } exp 2(lp t)
W2 H 2o, Ei2 2a2(1 p j) '
t=l t=l
leading, of course, to the same expression for Q2 as given in (3.4). For two time
indices t and t* with t < t*, the random process has autocorrelation function

p(t, t*) = corr(ut, ut.) = p ,- d,

Before we discuss maximization of Q2 in this setting with possibly unequally

spaced observation times, let us comment on the rather unusual parametrization of

the latent random process (3.5). Chan and Ledolter (1995), in their development

for time series models of equally spaced discrete events use the more common form

ut+l = put + Et, Et N N(0, a2), t = 1,2... T 1.

This leads to var(ut) = a2/(1 p2) for all t if we assume a N(O, a2/(1 p2))

distribution for ul. (Chan and Ledolter (1995) condition on this first observation,

which leads to closed form solutions for both a and p in the case of equidistant

observations). Since it is common practice to let a2 describe the strength of

association between observation in a common cluster sharing that random effect,

our parametrization seems more natural. In Chan and Ledolter's parameterizations

both the variance and correlation parameter appear in the variance of the random


In the more general case of unequally spaced observations, the parametrization

Et N(O, O72) results in different variances of the random effects at different time

points (i.e., var(ut) = a2/(1 p2d)). Considering that the random effects represent

unobservable phenomena common to all clusters, their variability should be about

the same for all clusters, and not depend on the time difference between any two

clusters. There is no reason to believe that the strength of association is larger in

some clusters and weaker in others. Therefore, the parametrization we choose in

(3.5) seems natural and appropriate.

For spatially correlated data, a relationship between random effects ui and

ui. is defined in terms of a distance function d(xi, ax.) between covariates ax

and xi. associated with them. Each u, then represents a random effect for a

spatial cluster, and correlated random effects are again natural to model spatial

dependency among observations in different clusters. In the time setting, we had

d(xi, xi.) = Ixi xi. I with xi and xi. representing time points. In 2-dimensional

spatial settings, xi = (xil, xi2)' may represent midpoints in a Cartesian system

and d(sx, xi.) = aIxi xi.|I is the Euclidian distance function. The so-defined

distance between clusters can be used in a model with correlations between random

effects decaying as distances between cluster midpoints grow, e.g., corr(ui, u.) =

pllXi-xZ,.-. Models of this form are discussed in Zhang (2002) and in Diggle et al.

(1998) in a Bayesian framework.

Sometimes only the information concerning whether clusters are adjacent to

each other is used to form the correlation structure. In this case, d(ax, xi.) is a

binary function, indicating if clusters i and i* are adjacent or not. Usually, this

leads to an improper joint distribution for the random effects, as for instance in

the analysis of the Scottish Lip cancer data set presented in Breslow and Clayton


3.3.2 The M-step with Autoregressive Random Effects

Maximizing Q2 with respect to a and p is again equivalent to finding their

MLEs for the sample of u0),..., u(m), pretending they are independent. For fixed

p, maximizing Q2 with respect to a is possible in closed form. For notational con-

venience, denote the parts depending on p and the generated sample u(m),..., (m)


at(p, u) = U+ __- p d U ) 2

Smw1 r

with derivatives with respect to p (indicated by a prime) given by

ta'(P, u) = -2dtpd`1'bt(p, u)


b(p, u) = -dtp- )2
The maximum likelihood estimator of a at iteration k of the MCEM algorithm has

fr( ,m T-1 )1 /2
&(k) 1 u(j)2 + 1a ) 1/
Tm(-M T +j p2dt/
j=1 t=1
For the special case of independent random effects (p = 0), this simplifies to the

estimator ( m j= U()'UD() presented at the end of Section 2.3.1. (The equal

correlation structure cannot be presented as a special case of the autocorrelation

structure.) No closed form solutions exist for 1(k). Let

dt pdt-1
ct(p) =
1 p2d


et(p) = [ct(p) ]

be terms depending on p but not on u, with derivatives given by

c'(p) = ct(P) + 2p-1 [ct(p)]2


e(P) = 2ct(p)c( p) + [ct(p)2 ,
dt dt
respectively. Then, the first and second partial derivative of Q2 with respect to p

can be written as
T-1 T-1
Qap = Epd'tc(p) + [c(p)bt(p, u)- et(p)at(p, u)]
1t=1 t=1

92Q2 = pd--lct(p)[2d, 1 + 2pdt+lc(p)]
1 T-1
+ c' [c'(p)bt (p, u) + ct (p) (p, u) e'(p)at(p, u) et(p)a' (p, u)] .

A Newton-Raphson algorithm with Hessian formed of partial and mixed derivatives
of Q2 with respect to a and p can be employed to find the maximum likelihood
estimators at iteration k. However, since the range of p is restricted, it might be
advantageous to use the interval-halving method on Q', I1=&(k) described in the
previous section.

For the special case of equidistant time points {xt}, the distances dt are equal
for all t = 1,..., T. Without loss of generality, we assume dt = 1 for all t. Then the
random effects follow the simple random walk ut+l = put + et, t = 1, ..., T, where
we assume that ul ~ N(0, a2) and ct iid. N (0, 2(1 p2)). Certain simplifications
occur. Let
m m T- ( m T-1 ( T-1 j
a=- m b- +l I2, c =- i uUt+1 dUu
j=1 j=1 t=1 j=1 t=1 j=1 t=1
denote constants depending on the generated samples only, but not on any parame-

ters. Then the maximum likelihood estimator of a at iteration k is

(k) = a+ (b 2pc+p2d) 1/2

and, upon plugging it in into the score equation Q2 I=&8(k) for p, we obtain as
the new score equation

[T-1 1 2 -T T-1 1
3 [T-(d-a) +p2 2-T c +p -i a- b-d +c=0, (3.7)

a polynomial of order three. A result by Witt (1987) mentioned in McKeown and
Johnson (1996) shows that (3.7) has three real solutions, only one of which lies
in the interval (-1, 1). This must be the maximum likelihood estimator jlk) at
iteration k for the case of equidistant time points. Exact solutions to this third
degree polynomial are for instance given in Abramowitz and Stegun (1964) and we
only need to iterate between the two explicit solutions till convergence.

In most of our applications the number of distinct observation times T

is rather large, and generating independent T-dimensional vectors u from the

posterior h(u I y; O(k-l)) as required to approximate the E-step is difficult, even

with the nice (prior) autoregressive relationship among the components of u. The

next section discusses this issue.

There are many other correlation structures which are not discussed here. For

instance, the first order autoregressive random process can be extended to a p-order

random process and the formulas provided here and in the next section can be

modified accordingly.

3.4 Sampling from the Posterior Distribution Via Gibbs Sampling

In Section 2.3.2 we gave a general description of how to obtain a random sam-

ple u(1),...,u(m) from h(u I y). (As in Section 2.3.2, we suppress the dependency

on the parameter estimates from the previous iteration). For high dimensional

random effects distributions g(u), generating independent draws from h(u I y)

can get very time consuming, if not impossible. The Gibbs sampler introduced in

Section 2.3.2 offers an alternative because it involves sampling from lower dimen-

sional (often univariate) conditional distributions of h(u I y), which is considerably

faster. However, it results in dependent samples from the posterior random effects

distribution. The distributional structure of equally correlated random effects

or autoregressive random effects is very amenable to Gibbs sampling because of

the simplifications that occur in the full univariate conditionals. Remember that

the two-stage hierarchy and the conditional independence assumption in GLMMs

implies that
h(u I y) oc f(y I u)g(u) = f/(yt I ut) g(u),
the product of the product of conditional densities of observations sharing a

common random effect and the random effects density. In the following, let

u = (Ul, ..., UT). We discuss the case of autoregressive random effects first.

3.4.1 A Gibbs Sampler for Autoregressive Random Effects
From representation (3.6) of the random effects distribution, we see that the
full univariate conditional distribution of Ut given the other T 1 components of u
only depends on its neighbors ut-1 and ut+1, i.e.,

g(Ut I u,, .. Ut-l, Ut+l, ,UT) c( g(Ut I Ut-1)g(Ut+1 I Ut), t = 2,..., T 1.

At the beginning (t = 1) and the end (t = T) of the process, the conditional
distribution of Ul and UT only depend on the successor u2 and predecessor UT-1,
respectively. Furthermore, random effect ut only applies to observations Yt =

(ytl, ytn) at a common time point that share that random effect, but not to
other observations. Hence, the full univariate conditionals of the posterior random
effects distribution can be expressed as

hi(u I u2,y ) oc f(y1 Ul) 9g(ui U2)

ht(ut I t-, Ut+1, yt) oc f(yt t) gt(ut I Ut-1, Ut+), t = 2,...,T -1

hT(T I UT-, YT) oc f(YT UT) T(UT I UT-1),

where, using standard multivariate normal theory results,

91(ul IU2) = N (pd1U2,2[1 p2dl])
t(Ut I Ut-, t+) = pd-[1- p2dt-1Utl +pd( [1- p2dt-lt+
S1 p2(dt_1+dt)
2[1 p2dt-1 p2dt + p2(dt-i+dt)
1 2(dt-+d) t ...,
gT(UT I UT-1) = N (pdT-lUT1, a2[l p2dT-l])

For equally spaced data (dt = 1 for all t) these distributions reduce to the ones
derived in Chan and Ledolter (1995).
Direct sampling from the full univariate conditionals ht is not possible.
However, it is straightforward to implement an accept-reject algorithm. In fact,


the accept-reject algorithm as outlined in Section 2.3.2 applies directly with target
density ht and candidate density gt, since ht has the form of an exponential family
density multiplied by a normal density. In Section 2.3.2, we discussed the accept-
reject algorithm for generating an entire vector u from the posterior random effects
distribution h(u I y) with candidate density g(u) and mentioned that acceptance
probabilities are virtually zero for large dimensional u's. With the Gibbs sampler,
we have reduced the problem to univariate sampling of the t-th component ut from
the univariate target density ht with univariate candidate density gt. By selecting
Mt = LL(yt), where L(yt) is the saturated likelihood for observations at time point
t, we ensure that the target density ht < Mtgt.
Given u(-1) = (uj-), ..., u-1) from the previous iteration, the Gibbs
sampler with accept-reject sampling from the full univariate conditionals consists of

1. generate first component u) hi(u, I u(j -),y) by
(a) generation step:
generate ul from candidate density gl(ul Iu u
generate U ~ Uniform[0, 1];
(b) acceptance step:
set uj) =ul if U < f(yl I ul)/L(y); return to (a) otherwise;
2. for t=2,...,T-1:
generate component u j)" ht(u () I u(j), (j ) by

(a) generation step:
generate ut from candidate density gt(ut 1u1) ,u j1 )
generate U ~ Uniform[O, 1];
(b) acceptance step:
set u) = ut if U < f(yt I ut)/L(yt); return to (a) otherwise;
3. generate last component u() ~ hT(uT I UT2, YT) by
(a) generation step:

generate UT from candidate density gT(UT I UTi1);
generate U ~ Uniform[O, 1];
(b) acceptance step:
set U = UT if U 5 f(yT I UT)/L(yT); return to (a) otherwise;
4. set U)= (U),...,u(T_);
The so-obtained sample ),..., () (after allowing for burn-in) forms a
dependent sample which we use to approximate the E-step in the kth- iteration of
the MCEM algorithm. Note that all densities are evaluated at current parameter
estimates, i.e., (-1 for f(yt I ut) and ( = (1(k-1), (-1)) for gt(ut
Ut-1, Ut+i)
3.4.2 A Gibbs Sampler for Equally Correlated Random Effects
Similar results as for the autoregressive correlation structure can be derived
for the case of equally correlated random effects. In this case, the full univariate
conditional of us depends on all other t 1 components of u, as can be seen from
(3.3). Let ut- denote the vector u with the t-th component deleted. Using similar
notation as in the previous section, the full univariate conditionals of h(u I y) are
given by
ht(ut I Ut-, t) oc f(yt I Ut)gt(t | ut-, t), t = 1,..., T,

where with standard results from multivariate normal theory gt(ut I ut-) is a
N(pt, -rt) density with

1-i-p 1+(T-2)p Uk
P (T-1)p(1-p)]
= =a,2 (T 1)p2 (- 1)23(1 )
1 p 1 +(T-2)p
Given the vector u-1) from the previous iteration, the Gibbs sampler with
accept-reject sampling from the full univariate conditionals has form

1. for t=1,...,T,

generate components u~j) ~ h(ut uj) ,, I... U t+ -1 ,UT ) by

(a) generation step:

generate ut from candidate density

gt(Ut I U (jt-1, Ut(+ ., I )
generate U ~ Uniform[0, 1];

(b) acceptance step:

set ut = u if U < f(yt I ut)/L(yt); return to (a) otherwise;

2. set uaU)= -( i), U );

This leads to a sample u(),..., (m) from the posterior distribution used

in the E- and M-step of the MCEM algorithm at iteration k. Note again that

all distributions are evaluated at their current parameter estimates (k-1) and
(k-l) = (&k-i), (k-i)).

3.5 A Simulation Study

We conducted a simulation study to evaluate the performance of the maximum

likelihood estimation algorithm, to evaluate the bias in the estimation of covariate

effects and variance components and to compare predicted random effects to the

ones used in the simulation of the data. To this end, we generated a time series

yl,..., IT of T = 400 binary observations according to the model

logit(7rt(ut)) = a + /Xt + ut (3.8)

for the conditional log odds of success at time t, t = 1,..., 400. For the simulation,

we choose a = 1 and 3 = 1, where 3 is the regression coefficient for independent

standard normal distributed covariates xt i. i. d. N(0, 1). The random effects

l, ..., UT are thought to arise from an unobserved latent random autoregressive
process Ut+l = put + ct, where Et are i. i. d. N(0, aV1l p2), i.e., the out's have stan-

dard deviation a and lag t correlation pt. For the simulation of these autoregressive

random effects, we used a = 2 and p = 0.8. The resulting sample autocorrelation

function of the realized random effects is pictured in Figure 3-4. Their standard

deviation and lag 1 correlation of the 400 realized values of u, ... UT is equal to

1.95 and 0.77. Note that conditional on the realized values of Ut's, the Yt's are

generated independently with log odds given by (3.8). The MCEM algorithm as

described in Sections 2.3 and 3.3 for a logistic GLMM with autocorrelated random

effects yielded the following maximum likelihood estimates for the fixed effects and

variance components: & = 0.94 (0.39), 3 = 1.03 (0.22), as compared to the true

values 1 and 1, and & = 2.25 (0.44), and p = 0.74 (0.06), as compared to the true

values 1.95 and 0.77.

The algorithm converged after 71 iterations with a starting Monte Carlo

sample size of 50 and a final Monte Carlo sample size of only 880, although

estimated standard errors are based on a Monte Carlo sample size of 20,000.

Convergence parameters were set to e1 = 0.003, c = 3, E2 = 0.005, 63 = -0.001,

a = 1.03 and q = 1.05 (see Section 2.3.3). Regular GLM estimates were used as

starting values for a and # and starting values for a and p were set to 1.5 and 0,


As will be described in Section 5.4.2, we estimated random effects through a

Monte Carlo approximation of their posterior mean: ui = E[u I y]. The scatter

plot in Figure 3-3 shows good agreement in a comparison of the realized random

effects ul,..., UT from the simulation and the estimated random effects 1i,..., iT

from the model. Note though that the standard deviation of the estimated random

effects is equal to 1.60 (as compared to the true standard deviation of 1.95),

showing that estimated random effects are less variable and the general shrinkage

effect (compare the scales on the x and y axis of Figure 3-3) brought along by

using posterior mean estimates. Also, a comparison of the autocorrelation and

partial autocorrelation functions of the realized and estimated random effects in




A 2AA a

*1 .A A

-A A A

realized random effects

Figure 3-3: Realized (simulated) random effects u,..., UT versus estimated ran-
dom effects i ,...,G fT

Figure 3-4 reveals some differences due to the fact that estimated random effects

are based on the posterior distribution of u I y. Therefore, estimated random

effects are only of limited use in checking assumptions on the true random effects.

Only when their behavior is grossly unexpected compared to the assumed structure

of the underlying latent random process may they serve as an indication of model

inappropriateness. Related remarks are given by Verbeke and Molenberghs (2000)
who generate data in a linear mixed models assuming a mixture of two normal

distributed random effects resulting in a bimodal distribution. There also the plot

of posterior mean estimates of the random effects from a model that misspecified

the random effects distribution does not reveal that anything went wrong.

We repeated above simulation 100 times, using the same specifications,
starting values and convergence criteria as mentioned above. Each of the 100

generated binary time series of length 400 was fit using the MCEM algorithm.

Table 3-1 shows the average (over the 100 generated time series) of the fixed



1 2 3 4 5 6 7 4 5 6 7

Lag Lag

1.0 10


.6 .6

.4 .4

S .2 7 .2

1 2 3 4 5 6 7 1 2 3 4 5 6 7

ag Lag

Figure 3-4: Comparing simulated and estimated random effects.
Sample autocorrelation (first row) and partial autocorrelation (second row) func-
tions for realized (simulated) random effects u1,..., UT (first column) and estimated
random effects u ,..., iT (second column).

parameter and variance component estimates and their average estimated standard

errors. On average, the GLMM estimates of the fixed effects a and # and the

variance components are very close to the true parameters, although the true lag

1 correlation of the random effects is underestimated by 6.3%. Table 3-1 also

displays, in parentheses, the standard deviations of all estimated parameters in the

100 replications. Comparing these to the theoretical estimates of the asymptotic

standard errors, we see good agreement. This suggests that the procedure for

finding standard errors we described and implemented (via Louis's (1982) formula)

in our MCEM algorithm works fine. In 5 (5%) out of the 100 simulations, the

approximation of the asymptotic covariance matrix by Monte Carlo methods

resulted in a negative definite matrix. For these simulations, a larger Monte Carlo

sample after convergence of the MCEM algorithm (the default was 20,000) might

be necessary.

It is also interesting to note that of the 95 simulations with positive definite

covariance matrix, 6 (6.3%) resulted in a non-significant (based on a 5% level

Wald test) estimate of the regression coefficient P under the GLMM model with

autoregressive random effects, while none was declared non-significant with the

GLM approach. Estimates and standard errors for a corresponding GLM model

fit are also provided in Table 3-1. The average Monte Carlo sample at the final

iteration of the MCEM algorithm was 1200, although highly disperse, ranging from

210 to 21,000. The average computation time (on a mobile Pentium III, 600 MHz

processor with 256MB RAM) to convergence, including estimating the covariance

matrix, was 73 minutes.

We ran two other simulation studies, now with a shorter length of only

T = 100 observations, and a true lag 1 correlation of 0.6 and -0.8, respectively.

All other parameters remained unchanged. These results are also summarized in

Table 3-1. Again, we observe that the estimated parameters are very close to the

Table 3-1: A simulation study for


a logistic GLMM with autoregressive random

a 1 a p s.e.(a) s.e.(8) s.e.(a) s.e.(p)
True: 1 1 2 0.8 T = 400
GLM: 0.64 0.63 0.11 0.12
(0.20) (0.14) (0.01) (0.01)
GLMM: 1.07 1.02 2.08 0.75 0.37 0.26 0.61 0.10
(0.32) (0.20) (0.25) (0.06) (0.12) (0.23) (0.40) (0.05)

True: 1 1 2 0.6 T = 100
GLM: 0.69 0.70 0.23 0.25
(0.38) (0.27) (0.02) (0.03)
GLMM: 1.09 1.07 1.99 0.51 0.58 0.47 1.35 0.26
(0.58) (0.37) (0.39) (0.20) (0.33) (0.27) (1.15) (0.18)

True: 1 1 2 -0.8 T = 100
GLM: 0.65 0.61 0.22 0.24
(0.21) (0.26) (0.01) (0.03)
GLMM: 1.04 0.96 2.00 -0.75 0.42 0.51 1.04 0.16
(0.29) (0.34) (0.53) (0.13) (0.32) (0.99) (1.04) (0.13)
Average and standard deviation (in parentheses) of fixed effects, variance compo-
nents and their standard error estimates from a GLM and a GLMM with latent
AR(1) process. The two models were fitted to each of 100 generated binary time
series of length T = 400 and T = 100.

true ones, but on average the correlation was underestimated by 15% and 6.3%,

respectively. However, the sampling errors of the correlation parameters (shown in
parentheses in Table 3-1) were large enough to include the true values.
Since our methods are general enough to handle unequally spaced data, we
repeated the first simulation with a time series of T = 400 binary observations, but

now randomly deleted 10% of the observations to create random gaps in the series.
We left all parameters and the model for the conditional odds unchanged, except

that we now assume that random effects follow the latent random autoregressive
process ut+l = ptut + ct, where et are i. i. d. N(0, a/1 p2dt) and dt is the
difference (in the units of measurement) between the time points associated with
the observations at times t and t + 1. For example, the first series we generated had

Table 3-2: Simulation study for modeling unequally space binary time series.

a /3p s.e.(a) s.e.(6) s.e.(a) s.e.(p)
True: 1 1 2 0.8 T = 360, unequally spaced
GLM: 0.61 0.62 0.12 0.12
(0.19) (0.11) (0.00) (0.01)
GLMM: 1.03 1.00 2.07 0.75 0.38 0.28 0.71 0.11
(0.29) (0.16) (0.25) (0.06) (0.19) (0.28) (0.80) (0.18)
Average and standard deviation (in parentheses) of fixed effects, variance compo-
nents and their standard error estimates from a GLM and a GLMM with latent au-
toregressive random effects accounting for unequally spaced observations. The two
models were fitted to each of 100 generated binary time series of length T = 360,
with random gaps of random length between observations.

1 gap of length three (i.e., dt = 4 for one t), 4 gaps of length two (i.e., dt = 3 for 4
t's) and 29 gaps of length one (i.e., dt = 2 for 29 t's). For all other t's, dt = 1, i.e.,

they are successive observations and the difference between two of them is one unit
of measurement.

Simulation results are shown in Table 3-2 and reveal that our proposed
methods and algorithm also work fine for an unequally spaced binary time series.

All true parameters are included in confidence intervals based on the average of the

estimated parameters from 100 replicated series and its standard deviation (shown
in parentheses in Table 3-2).

So far we have discussed models for discrete valued time series data in a very

broad manner. In Section 2, we developed the likelihood for our models based on

generic distributions f(ylu) for observations y and g(u) for random effects u and

presented an algorithm for finding maximum likelihood estimates. Section 3 looked

at two special cases of random effects distributions useful for describing temporal or

spatial dependencies. In this chapter we make specific distributional assumptions

about the observations and develop some theory underlying the models we propose.

We will pay special attention to data in the form of a single (sometimes considered

generic) time series Y = (Y1,..., YT) and derive marginal properties implied by

the conditional model formulation. Multiple, independent time series Yi,..., Yn

can result from replication of the original time series or from stratification of the

sampled population such as in the example about homosexual relationships. All

derivations given below for a generic time series Y still hold for the i-th series

Yi = (Yil, Yi2,... YT), provided the same latent process {ut} is assumed to
underly each one of them.

An important characteristic of any time series model is its implied serial

dependency structure. In the case of normal theory time series models, this is

specified by the autocorrelation function. In Section 4.1 we derive the implied

marginal autocorrelation function for GLMMs with normal random components

and either an equal correlation or autoregressive assumption for the random effects.

With these assumptions, our models are special cases of linear mixed models

discussed for instance in Diggle et al. (2002). In Sections 4.2 and 4.3 we explore

marginal properties of GLMMs with Poisson and binomial random components

that are induced by assuming equally correlated or autoregressive random effects.

In Chapter 5, these model properties such as the implied autocorrelation function

are then compared to empirical counterparts based on the observed data to

evaluate the proposed model.

Section 2.1 mentioned that parameters in GLMMs have a conditional inter-

pretation, controlling for the random effects. Correlated random effects vary over

time and parameter interpretation is different from having just one common level of

a random effect, as in many standard random intercepts GLMMs. For each of the

models presented here, we discuss parameter interpretation in a separate section.

4.1 Analysis for a Time Series of Normal Observations

Suppose that conditional on time specific normal random effects {ut}, observa-

tions {Yt} are independent N(pt + ut, 72). The marginal likelihood for this model is

tractable, because marginally the joint distribution of {Yt} is multivariate normal

with mean tM = (/1,..., p-T)' and covariance matrix Eu + 72I, where Eu is the

covariance matrix of the joint distribution of {ut}. With the usual assumption that

var(ut) = a2, the marginal variance of Yt is given by

var(Yt) = 2 + 2

and the marginal correlation function p(t, t*) for the case of equally correlated

random effects (conf. Section 3.2) has form

p(t, t*) = corr(Yt, Yt.) = -2 2p (4.1)

while for the case of autocorrelated random effects (conf. Section 3.3), it has form

p(*) = corr(Y 2 dk. (4.2)
p(t,t*) = corr(Yt, Yt.) =-2 +-p02 d (4.2)
T-+ 0

If the distances between time points are equal, then (4.2) is more conveniently

written in terms of the lag h between observations as

p(h) = corr(Yt, Yt+h) = T2 +

For both cases, note that the autocorrelations (4.1) and (4.2) are smaller than

the corresponding ones assumed for the underlying latent process {ut} by a factor

of r For equally correlated random effects, the marginal covariance matrix
has form r72 + a2 [(1 p)I + pJ], implying equal marginal correlations between

any two members Yt and Yt. of {Yt}. (This can also be seen from (4.1), where the

autocorrelations do not depend on t or t*.) Diggle et al. (2002, Sec. 5.2.2) call this

a model with serial correlation plus measurement error.

Similar properties can be observed in the case of autocorrelated random ef-
fects: The basic structure of correlations decaying in absolute value with increasing

distances between observation times (as measured by E dk or h) is preserved

marginally. However, the first-order Markov property of the underlying autore-

gressive process is not preserved in the marginal distribution of {Yt}, which can

be proved by calculating conditional distributions. For instance, for three (T = 3)

equidistant time points, the conditional mean of Y3 given Y1 = yl and Y2 = y1 is

equal to

E[Y3ay, y2] = 3 + 2 +22_ (2 ( 2 [1 + P(y,- ~1)] + 02 [l (2 2)])

and depends on yl.

It should be noted that in the case of independent random effects with

Eu = Oa2, marginally the Yt's are also independent, but with overdispersed
variances r2 + o2 relative to their conditional distribution. This case can be seen

as a special case of the equally correlated model and the autoregressive model when


The traditional assumption in random intercepts models is to assume a

common random effect u = us for all time points t. I.e., conditional on a N(0, a2)

random effect u, Yt is N(Aj + u, 72) for t = 1,..., T. For this case, the marginal

covariance matrix has form 721 + a2J. This can be derived directly or inferred

from the marginal correlation expressions (4.1) and (4.2) by setting p = 1, implying

perfect correlation among the {ut}. Hence, the random intercepts model is a

special case of the equal correlated or autoregressive model when p = 1. It implies

a constant (exchangeable) marginal correlation of U2/(T2 + a2) between any two

observations Yt and Yt..

4.1.1 Analysis via Linear Mixed Models

In a GLMM, we try to provide some structure for the unknown mean com-

ponent pt by using covariates xt. Let x z be a linear predictor for /t, with f

denoting a fixed effects parameter vector for the covariates xt. Using an iden-

tity link, the series {Yt} then follows a GLMM with conditional mean function

E[Yt I ut] = x't3 + ut. The model can be written as Yt = x'f + us + ce, where
et ~d N(0, 72) and independent of us. Then, the models discussed here are special

cases of mixed effects models (Verbeke and Molenberghs, 2000) with general matrix


Y = X3 + Zu +E.

In our case, Y = (Yi,..., YT)' is the time series vector and X = (xs,..., x,)'

is the overall design matrix with associated parameter /. The design matrix Z

for the random effects u' = (u1,..., UT) simplifies to the identity matrix IT. The

distributional assumption on the random effects is u ~ N(0, Eu) and they are

independent from the N(O, r2I)-distributed errors e. Exploiting this relationship,

software for fitting models of this kind (i.e., correlated normal data with structured

covariance matrix of form var(Y) = ZEuZ' + rT2) is readily available, for instance

in the form of the SAS procedure proc mixed, where the equal correlation structure

and the autoregressive structure are only two out of many possible choices for the
covariance matrix Eu for the random effects distribution.
Mixed effects models are very popular for the regression analysis of shorter
time series, like growth curve models or data from longitudinal studies. In Sec-
tion 5.1, we illustrate an application by analyzing the motivating example of
Section 3.1 about attitudes towards homosexual relationships, based on a normal
approximation to the log odds.
4.1.2 Parameter Interpretation
Parameters in normal time series models retain their interpretation when
averaging over the random effects distribution. The interpretation of 3 as the
change in the mean for a change in the covariates is valid conditional on random
effects and also marginally. The random effects parameters only contribute to the
variance-covariance structure of the marginal distribution, inducing overdispersion
and correlation relative to the conditional assumptions.

4.2 Analysis for a Time Series of Counts
Suppose now that conditional on time specific normal random effects {ut},
observations {Yt} are independent counts, which we model as Poisson random
variables with mean /t. Using a log link, explanatory variables ax and correlated
random effects {uj}, we specify the conditional mean structure of a Poisson GLMM

log(pt) = zx' + ut, t = 1,..., T. (4.3)

The correlation in the random effects allows the log-means to be correlated over
time or space. The marginal likelihood corresponding to this model is given by

L(f3,/;y) oc xf T exp{-pt}g(u;,) du

= exp [yt(x'/3 + Ut) exp{'t + Ut}]} g(u; ik) du,
RT I Tt=1

where g(u; Ip) is one of the random effects distributions of Chapter 3. In that case,

the integral is not tractable and numerical methods such as the MCEM algorithm
of Section 2.3 must be used to find maximum likelihood estimates for f and 1i. For

this, the function Q' defined in (2.16) has form

1m T
j=1 t=l1

where uj) is the t-th element of the j-th generated sample u(j) from the posterior

distribution h(uly; )3(k-1)' (k-1)). Note that here we discuss only the case of a

generic time series {Yt} with no replication, hence n = 1 (i.e., index i is redundant),

and nl = T in the general form presented in (2.16). If replications are available

or in the case where two time series differ in the fixed effects part but not in the

random effects (e.g., have the same underlying latent process), then one simply

needs to include the sum over the replicates as indicated in (2.16). Choosing one

of the correlated random effects distributions of Chapter 3, the Gibbs sampling

algorithms developed in Sections 3.4.1 or 3.4.2 can be used to generate the sample

from h(uly), with f(yt) having the form of a Poisson density with mean At.

4.2.1 Marginal Model Implied by the Poisson GLMM

As with the normal GLMMs before, marginal first and second moments can

be obtained by integrating over the random effects distribution, although here

the complete marginal distribution of Yt is not tractable as it is in the normal

case. The random effects appearing in model (4.3) imply that the conditional

log-means {log(/t)} are random quantities. Assuming that random effects {ut} are

normal with zero mean and variance var(ut) = a2, they have expectations {z/3}

and variance a2. For two distinct time points t and t*, their correlation under an

independence, equal correlation or autocorrelation assumptions on the random

effects is given by 0, p or pk=t dk, respectively. (Remember that dk denoted the

time difference between two successive observations yk and Yk+1.) On the original

scale, the means have expectation, variance and correlation given by

E[pt] = exp{(z/3 +a2/2}

var(pt) = exp{2(x3 + a2/2)} (e'2-)
eCOv(U(,U(.) _
corr(p/, pt ) = eco -
e2 1

Plugging in cov(ut, ut.) = 0, a2p or a2 pk=t d yields the marginal correlations
among means when assuming independent, equally correlated or autoregressive
random effects, respectively. Marginal distribution of Yt

Now let's turn to the marginal distribution of Yt itself, for which we can only
derive moments. The marginal mean and variance of Yt are given by:

E[Yt] = E[pt] = exp{z/3 + a2/2} (4.4)

var(Yt) = E[(t] + var(t) = E[Yt] [1 + E[Yt] (e 1)].

Hence, the log of the marginal mean still follows a linear model with fixed effects
parameters /, but with an additional offset U2/2 to the intercept term. (This is not
particular to the Poisson assumption, but is true for any loglinear random effects
model of form (4.3) with more general random effects structure z'ut, see Problem
13.42 in Agresti, 2002.) The marginal distribution of Yt is not Poisson, since the
variance exceeds the mean by a factor of [1 + E[Yt](e"2 1)]. The marginal variance
is a quadratic function of the marginal mean.
For two distinct time points t and t*, the marginal covariance between
observations Yt and Yt. is given by

cov(Yt,Yt.) = cov(pt, At.)

= E[Yt]E[Yt.] (ecov(ut",') 1) (4.5)

= exp{(X' + x'.)/3 + a2} (ecov(ut,-) 1)