
Citation 
 Permanent Link:
 http://ufdc.ufl.edu/AA00003585/00001
Material Information
 Title:
 Regression models for discretevalued time series data
 Creator:
 Klingenberg, Bernhard
 Publication Date:
 2004
 Language:
 English
 Physical Description:
 xii, 177 leaves : ill. ; 29 cm.
Subjects
 Genre:
 bibliography ( marcgt )
theses ( marcgt ) nonfiction ( marcgt )
Notes
 Thesis:
 Thesis (Ph. D.)University of Florida, 2004.
 Bibliography:
 Includes bibliographical references.
 General Note:
 Printout.
 General Note:
 Vita.
 Statement of Responsibility:
 by Bernhard Klingenberg.
Record Information
 Source Institution:
 University of Florida
 Holding Location:
 University of Florida
 Rights Management:
 Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for nonprofit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
 Resource Identifier:
 003100752 ( ALEPH )
706801267 ( OCLC )

Downloads 
This item has the following downloads:

Full Text 
REGRESSION MODELS FOR
DISCRETEVALUED TIME SERIES DATA
By
BERNHARD KLINGENBERG
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2004
Copyright 2004
by
Bernhard Klingenberg
To Sophia and JeanLuc Picard
ACKNOWLEDGMENTS
I would like to express my sincere gratitude to Drs. Alan Agresti and James
Booth for their guidance and assistance with my dissertation research and for
their support throughout my years at the University of Florida. During the past
three years as a research assistant for Dr. Agresti, I gained valuable experience in
conducting statistical research and writing scholarly papers, for which I am very
grateful. I would also like to thank Dr. Ramon Littell, who guided me through a
year of invaluable statistical consulting experience at IFAS, Dr. George Casella,
who taught me Monte Carlo methods, and Drs. Jeff Gill and Michael Martinez for
serving on my committee. My gratitude extends to all the faculty and present and
former graduate students of the Department, among them, in alphabetical order,
Brian Caffo, Sounak Chakraborty, Dr. Herwig Friedl, Ludwig Heigenhauser, David
Hitchcock, Wolfgang Jank, Galin Jones, Ziyad Mahfoud, Siuli Mukhopadhyay and
Brian Stephens.
I would like to thank my family, foremost my wife Sophia and my daughter
Franziska, for all their support, light and joy they bring into my life and for
providing me with energy and fulfillment.
Lastly, this dissertation would not have been written in English had it not
been for the countless adventures of the Starship Enterprise and its Captain,
JeanLuc Picard, whose episodes kept me glued to the TV in Austria and helped to
sufficiently improve my knowledge of the English language.
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS ............................ iv
LIST OF TABLES ................ .................... viii
LIST OF FIGURES .................... ............ ix
ABSTRACT .......................... .......... xi
CHAPTER
1 INTRODUCTION .............................. 1
1.1 Regression Models for Correlated Discrete Data .......... 1
1.2 Marginal Models ... ........................ 3
1.2.1 Likelihood Based Estimation Methods ............ 3
1.2.2 QuasiLikelihood Based Estimation Methods ... ...... 4
1.3 Transitional Models .................. ........ 7
1.3.1 M odel Fitting ............................ 9
1.3.2 Transitional Models for Time Series of Counts ...... 10
1.3.3 Transitional Models for Binary Data ... 11
1.4 Random Effects Models ........................ 13
1.4.1 Correlated Random Effects in GLMMs ... 13
1.4.2 Other Modeling Approaches ... 17
1.5 Motivation and Outline of the Dissertation ... 19
2 GENERALIZED LINEAR MIXED MODELS ... 21
2.1 Definition and Notation ........................ 22
2.1.1 Generalized Linear Mixed Models for Univariate Discrete
Time Series .................... 24
2.1.2 State Space Models for Discrete Time Series Observations 26
2.1.3 Structural Similarities Between State Space Models and
G LM M s 28
2.1.4 Practical Differences ...... .... ... ..... .. 29
2.2 Maximum Likelihood Estimation ..... .. 31
2.2.1 Direct and Indirect Maximum Likelihood Procedures .. 32
2.2.2 Model Fitting in a Bayesian Framework ... 36
2.2.3 Maximum Likelihood Estimation for State Space Models .37
2.3 The Monte Carlo EM Algorithm ..... 40
2.3.1 Maximization of Qm ...................... 41
2.3.2 Generating Samples from h(u y; f, ) .. .. 43
2.3.3 Convergence Criteria .. 48
3 CORRELATED RANDOM EFFECTS ... 53
3.1 A Motivating Example: Data from the General Social Survey .. 54
3.1.1 A GLMM Approach ................ ..... 55
3.1.2 Motivating Correlated Random Effects ... 56
3.2 Equally Correlated Random Effects ... 62
3.2.1 Definition of Equally Correlated Random Effects 62
3.2.2 The Mstep with Equally Correlated Random Effects 63
3.3 Autoregressive Random Effects ... 65
3.3.1 Definition of Autoregressive Random Effects ... 65
3.3.2 The Mstep with Autoregressive Random Effects 68
3.4 Sampling from the Posterior Distribution Via Gibbs Sampling 71
3.4.1 A Gibbs Sampler for Autoregressive Random Effects .. 72
3.4.2 A Gibbs Sampler for Equally Correlated Random Effects .74
3.5 A Simulation Study .......................... 75
4 MODEL PROPERTIES FOR NORMAL, POISSON AND BINOMIAL
OBSERVATIONS ................... ........... 82
4.1 Analysis for a Time Series of Normal Observations ... 83
4.1.1 Analysis via Linear Mixed Models .... 85
4.1.2 Parameter Interpretation ..... 86
4.2 Analysis for a Time Series of Counts ..... 86
4.2.1 Marginal Model Implied by the Poisson GLMM 87
4.2.2 Parameter Interpretation ... 92
4.3 Analysis for a Time Series of Binomial or Binary Observations .92
4.3.1 Marginal Model Implied by the Binomial GLMM 94
4.3.2 Approximation Techniques for Marginal Moments 95
4.3.3 Parameter Interpretation ... 108
5 EXAMPLES OF COUNT, BINOMIAL AND BINARY TIME SERIES 115
5.1 Graphical Exploration of Correlation Structures ... 115
5.1.1 The Variogram ......................... 116
5.1.2 The Lorelogram ........................ 117
5.2 Normal Time Series .......................... 117
5.3 Analysis of the Polio Count Data ... 121
5.3.1 Comparison of ARGLMMs to other Approaches 125
5.3.2 A Residual Analysis for the ARGLMM .... 127
5.4 Binary and Binomial Time Series ... 130
5.4.1 Old Faithful Geyser Data ..... 130
5.4.2 Oxford versus Cambridge Boat Race Data ... 142
6 SUMMARY, DISCUSSION AND FUTURE RESEARCH ......... 159
6.1 CrossSectional Time Series ................. .. .162
6.2 Univariate Time Series ............ ............. 164
6.2.1 Clipping of Time Series .... .. 165
6.2.2 Longitudinal Data ............. ...... 165
6.3 Extensions and Further Research ....... .. 166
6.3.1 Alternative Random Effects Distribution ... 166
6.3.2 Topics in GLMM Research .... .. 168
REFERENCES ................................... 170
BIOGRAPHICAL SKETCH ................... ...... 171
LIST OF TABLES
Table page
31 A simulation study for a logistic GLMM with autoregressive random
effects. ... .. 80
32 Simulation study for modeling unequally space binary time series .. 81
51 Comparing estimates from two models for the logodds. ... 121
52 Parameter estimates for the polio data. ... 125
53 Autocorrelation functions for the Old Faithful geyser data. 134
54 Comparison of observed and expected counts for the Old Faithful geyser
data. ............................ ........ 142
55 Maximum likelihood estimates for boat race data. ... 147
56 Observed and expected counts of sequences of wins (W) and losses
(L) for the Cambridge University team. ... 151
57 Estimated random effects fit for the last 30 years for the boat race data. 152
58 Estimated probabilities of a Cambridge win in 2004, given the past
s + 1 outcomes of the race. ........................ 155
LIST OF FIGURES
Figure page
21 Plot of the typical behavior of the Monte Carlo sample size m(k) and
the Qfunction Q) through MCEM iterations. .......... 51
31 Sampling proportions from the GSS data set. ..... 55
32 Iteration history for selected parameters and their asymptotic stan
dard errors for the GSS data. .................... .. .. 61
33 Realized (simulated) random effects u1,..., UT versus estimated ran
dom effects il,..., UT ........... .. ........... .. 77
34 Comparing simulated and estimated random effects .... 78
41 Approximated marginal probabilities for the fixed part predictor value
x'p ranging from 4 to 4 in a logit model. .... 99
42 Comparison of conditional logit and probit model based probabilities. 104
43 Comparison of implied marginal probabilities from logit and probit
models..................... .... ..... ...... 106
51 Empirical standard deviations std(0it) for the log odds of favoring ho
mosexual relationship by race. 119
52 Plot of the Polio data. ...................... .123
53 Iteration history for the Polio data. .. 124
54 Residual autocorrelations for the Polio data. ... 129
55 Residual autocorrelations with outlier adjustment for the Polio data.. 131
56 Autocorrelation functions for the Old Faithful geyser data. 135
57 Lorelogram for the Old Faithful geyser data .... 136
58 Plot of the Oxford vs. Cambridge boat race data .... 143
59 Variogram for the Oxford vs. Cambridge boat race data. ...... ..145
510 Lorelogram for the Oxford vs. Cambridge boat race data. ... 146
511 Path plots of fixed and random effects parameter estimates for the
boat race data. ................... ........... 147
61 Association graphs for GLMMs. ..... 161
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
REGRESSION MODELS FOR
DISCRETEVALUED TIME SERIES DATA
By
Bernhard Klingenberg
August 2004
Chair: Alan G. Agresti
Cochair: James G. Booth
Major Department: Statistics
Independent random effects in generalized linear models induce an exchange
able correlation structure, but long sequences of counts or binomial observations
typically show correlations decaying with increasing lag. This dissertation intro
duces models with autocorrelated random effects for a more appropriate, parameter
driven analysis of discretevalued time series data. We present a Monte Carlo EM
algorithm with Gibbs sampling to jointly obtain maximum likelihood estimates
of regression parameters and variance components. Marginal mean, variance and
correlation properties of the conditionally specified models are derived for Poisson,
negative binomial and binary/binomial random components. They are used for
constructing goodness of fit tables and checking the appropriateness of the modeled
correlation structure. Our models define a likelihood and hence estimation of the
joint probability of two or more events is possible and used in predicting future
responses. Also, all methods are flexible enough to allow for multiple gaps or miss
ing observations in the observed time series. The approach is illustrated with the
analysis of a crosssectional study over 30 years, where only observations from 16
unequally spaced years are available, a time series of 168 monthly counts of polio
infections and two long binary time series.
CHAPTER 1
INTRODUCTION
Correlated discrete data arise in a variety of settings in the biomedical,
social, political or business sciences whenever a discrete response variable is
measured repeatedly. Examples are time series of counts or longitudinal studies
measuring a binary response. Correlations between successive observations arise
naturally through a time, space or some other cluster forming context and have
to be incorporated in any inferential procedure. Standard regression models
for independent data can be expanded to accommodate such correlations. For
continuous type responses, the normal linear mixed effects model offers such a
flexible framework and has been well studied in the past. A recent reference is
Verbeke and Molenberghs (2000), who also discuss computer software for fitting
linear mixed effects models with popular statistical packages. Although the normal
linear mixed effects model is but one member of the broader class of generalized
linear mixed effects models, it enjoys unique properties which simplify parameter
estimation and interpretation substantially. For discrete response data, however,
the normal distribution is not appropriate, and other members in the exponential
family of distributions have to be considered.
1.1 Regression Models for Correlated Discrete Data
In this introduction we will review extensions of the basic generalized linear
model (McCullagh and Nelder, 1989) for analyzing independent observations to
models for correlated data. These models are marginal (Section 1.2), transitional
(Section 1.3) and random effects models (Section 1.4). An extensive discussion of
these models with respect to discrete longitudinal data is given in the books by
Agresti (2002) and Diggle, Heagerty, Liang and Zeger (2002). In general, longi
tudinal studies concern only a few repeated measurements. In this dissertation,
however, we are interested in the analysis of much longer series of repeated observa
tions, often exceeding 100 repeated measurements. Therefore, the following review
focuses specifically on models for univariate time series observations, some of which
are presented in Fahrmeir and Tutz (2001).
Let Yt be a response at time t, t = 1,..., T, observed together with a vector
of covariates denoted by Xt. In a generalized linear model (GLM), the mean
pt = E[yt] of observation yt depends on a linear predictor t = xzfP through a link
function h(.), forming the relationship pt = hl(x'f). The variance of yt depends
on the mean through the relationship var(yt) = tv(pt), where v(.) is a distribution
specific variance function and {(t} are additional dispersion parameters. In a
regular GLM, observations at any two distinct time points t and t* are assumed
independent.
In the models discussed below, the type of extension to accommodate corre
lated data depends on the way the correlation is introduced into the model. In
marginal models, the correlation can be specified directly, e.g., corr(yt, yt.) = p or
left completely unspecified, but nonetheless accounted for in likelihood based and
nonlikelihood based inferences. In transitional models correlation is introduced
by including previous observations in the linear predictor, e.g., tit = iz where
it = (X, Yt, y t2, ...)' and 3 = (, a, a2, ... ) are extensions of the design and
parameter vector of a GLM with independent components. Random effects models
induce correlation between observations by including random effects rather than
previous observations in the linear predictor, e.g., rt = x'a + u, where u is a
random effect shared by all observations.
The way correlation is built into a model also determines the type of inference.
Typically, marginal models are fitted by a quasilikelihood approach, estimation in
transitional models is based on a conditional or partial likelihood, and inference
in random effects models relies on a full likelihood (possibly Bayesian) approach.
However, models and inferential procedures have been developed that allow more
flexibility than the above categorization.
1.2 Marginal Models
In marginal regression models, the main scientific goal is to assess the influence
of covariates on the marginal mean of yt, treating the association structure between
repeated observations as a nuisance. The marginal mean /it and variance var(yt)
are modeled separately from a correlation structure between two observations
Yt and yt.. Regression parameters in the linear predictor are called population
averaged parameters, because their interpretation is based on an average over
all individuals in a specific covariate subgroup. Due to the correlation among
repeated observations, the likelihood for the model refers to the joint distribution
of all observations and not to the simpler product of their marginal distributions.
However, the model is specified in terms of these marginal distributions, which
makes maximum likelihood fitting particular hard for even a moderate number T
of repeated measurements.
1.2.1 Likelihood Based Estimation Methods
For binary data, Fitzmaurice and Laird (1993) discuss a parametrization of the
joint distribution in terms of conditional probabilities and log odds ratios. These
parameters are related to the marginal mean and the same conditional log odds
ratios, which describe the higher order associations among the repeated responses.
The marginal mean and the higher order associations are then modelled in terms of
orthogonal parameters f3 and a, respectively. Fitzmaurice and Laird (1993) present
an algorithm for maximizing the likelihood with respect to these two parameter
sets. The algorithm has been implemented in a freely available computer program
(MAREG) by Kastner et al. (1997).
Another approach to maximum likelihood fitting for longitudinal discrete
data regards the marginal model as a constraint on the joint distribution and
maximizes the likelihood subject to this constraint. The model is written in terms
of a generalized loglinear model C log(A/f) = XP3, where p is a vector of expected
counts and A and C are matrices to form marginal counts and functions of those
marginal counts, respectively. With this approach, no specific assumption about
the correlation structure of repeated observations is made, and the likelihood
refers to the most general form for the joint distribution. However, simultaneous
modeling of the marginal distribution and a simplified joint distribution is also
possible. Details can be found in Lang and Agresti (1994) and Lang (1996). Lang
(2004) also offers an R computer program (mph.fit) for maximum likelihood fitting
of these very general marginal models.
1.2.2 QuasiLikelihood Based Estimation Methods
The drawback of the two approaches mentioned above and likelihood based
methods in general is that they require enormous computing resources as the
number of repeated responses increases or the number of covariates is large, making
maximum likelihood fitting computationally impossible for long time series. This is
also true for estimation based on alternative parameterizations of a distribution for
multivariate binary data such as those discussed in Bahadur (1961), Cox (1972) or
Zhao and Prentice (1990).
Estimating methods leading to computationally simpler inference (albeit
not maximum likelihood) for marginal models are based on a quasilikelihood
approach (Wedderburn, 1974). In a quasilikelihood approach, no specific form
for the distribution of the responses is assumed and only the mean, variance and
correlation are specified. However, with discrete data, specifying the mean and
covariances does not determine the likelihood, as it would with normal data, so
parameter estimation cannot be based on it. Liang and Zeger (1986) proposed
generalized estimating equations (GEE) to estimate parameters, which have
the form of score equations for GLMs, but cannot be interpreted as such. Their
approach also requires the specification of a working correlation matrix for the
repeated responses. They show that if the mean function is correctly specified,
the solution to the generalized estimating equations is a consistent estimator,
regardless of the assumed variancecovariance structure for the repeated responses.
They also present an estimator of the asymptotic variancecovariance matrix
for the GEE estimates, which is robust against misspecification of the working
correlation matrix. Several structured working correlation matrices have been
proposed for parsimonious modeling of the marginal correlation, and some of them
are implemented in statistical software packages for GEE estimation (e.g., SAS's
proc genmod with the repeated statement and the type option or the gee and geel
packages in R).
1.2.2.1 GEE for time series of counts
Zeger (1988) uses the GEE methodology to fit a marginal model to a time
series {yt}T of T = 168 monthly counts of cases of poliomyelitis in the United
States. He specifies the marginal mean, variance and correlation by
pt = exp(ax )
var(yt) = t + a22 (1.1)
P(r)
corr(y, y ) [{1+ (a2p)i}{1 + (a2t+T)1}]/2'
where a2 is the variance and p(r) the autocorrelation function of an underlying
random process {ut}. To fit this marginal model, he proposes and outlines the
GEE approach, but notes that it requires inversion of the T x T variancecovariance
matrix of yl,..., YT, which has no recognizable structure and therefore no simple
inverse. Subsequently, he suggests approximating this matrix by a simpler, struc
tured matrix, leading to nearly as efficient estimators as would have been obtained
with the GEE approach. The variance component a2 and unknown parameters in
p(r) are estimated by a methods of moments approach.
Interestingly, Zeger (1988) derives the marginal mean, variance and correlation
in (1.1) from a random effects model specification: Conditional on an underlying
latent random process {ut} with E[ut] = 1 and cov(ut, ut+,) = a2p(7), he initially
models the time series observations as conditionally independent Poisson variables
with mean and variance
E[yt  Ut] = var(yt ut) = exp(xa/)ut. (1.2)
Marginally, by the formula for repeated expectation, this leads to the moments
presented in (1.1). From there we also see that the latent random process {ut} has
introduced both overdispersion relative to a Poisson variable and autocorrelation
among the observations. The models we will develop in subsequent chapters have
similar features. The equation for the marginal correlation between yt and Yt*
shows that the autocorrelation in the observed time series must be less than the
autocorrelation in the latent process {ut}. We will return to the polio data set in
Chapter 5, where we compare this model to models suggested in this dissertation
and elsewhere.
1.2.2.2 GEE for binomial time series
For binary and binomial time series data, it is often more advantageous to
model the association between observations using the odds ratio rather than
directly specifying the marginal correlation corr(yt, yt.) as with count data.
The odds ratio is a more natural metric to measure association between binary
outcomes and easier to interpret. The correlation between two binary outcomes
Y1 and Y2 is also constrained in a complicated way by their marginal means
I1, = P(Y1 = 1) and p2 = P(Y2 = 1) as a consequence of the following inequalities
for their joint distribution:
P(Y1 = 1, Y2 = 1) = P1 + P2 P(Y= 1 or Y2 = 1) > max{0, p, + 2 1}
and
P(Y1 = 1, Y2 = 1) < min{fl, 2},
leading to
max{0, i + 2 1} P(Y1 = 1, Y2 = 1) < min{l, /2}
Therefore, instead of marginal correlations, a number of authors (Fitzmaurice,
Laird and Rotnitzky, 1993; Carey, Zeger and Diggle, 1993) propose the use of
marginal odds ratios. For unequally spaced and unbalanced binary time series data,
Fitzmaurice and Lipsitz (1995) present a GEE approach which models the marginal
association using serial odds ratio patterns. Let tt. denote the marginal odds ratio
between two binary observations Yt and yt.. Their model for the association has the
form
.tt = al/ltt*, 1 < a < oo,
which has the property that as It t*  0, there is perfect association (tt* oo),
and as It t*I oo, the observations are independent (utt.  1). Note, however,
that only positive association is possible with this type of model. (SAS's proc
genmod now offers the possibility of specifying a general regression structure for the
log odds ratios with the logor option.)
1.3 Transitional Models
In transitional models, past observations are simply treated as additional
predictors. Interest lies in estimating the effects of these and other explanatory
variables on the conditional mean of the response Yt, given realizations of the past
responses. Specifying the relationship between the mean of Yt and previous obser
vations yti, t2, ... is another way (and in contrast to the direct way of marginal
models) of modeling the dependency between correlated responses. Transitional
models fit into the framework of GLMs, where, however, the distribution of Yt is
now conditional on the past responses. The model in its most general form (Diggle
et al., 2002) expresses the conditional mean of Yt as a function of explanatory
variables and q functions f,(.) of past responses,
E[yt Ht] = h1 3 + E h(Ht; a) (1.3)
r=l
where Ht = {t1, Yt2, yl} denotes the collection of past responses. Ht can
also include past explanatory variables and parameters. Often, the models are in
discretetime Markov chain form of order q, and the conditional distribution of
Yt given Ht only depends on the last q responses Yt1, ... Ytq. For example, a
transitional logistic regression model for binary responses that is a second order
Markov chain has form
logit P(Yt = 1 yt1, Yt2) = x 13 + alYt1 + a2Yt2.
The main difference between transitional models and regular GLMs or marginal
models is parameter interpretation. Both the interpretation of a and the inter
pretation of / are conditional on previous outcomes and depend on how many
of these are included. As the time dependence in the model changes, so does the
interpretation of parameters. With the logistic regression example from above, the
conditional odds of success at time t are exp(al) times higher if the given previous
response was a success rather than a failure. However, this interpretation assumes
a fixed and given outcome at time t 2. Similarly, a coefficient in / represents
the change in the log odds for a unit change in xt, conditional on the two prior
responses. It might be possible that we lose information on the covariate effect by
conditioning on these previous outcomes. In general, the interpretation of parame
ters in transitional models is different from the population averaged interpretation
we discussed for marginal models, where parameters are effects on the marginal
mean without conditioning on any previous outcomes.
1.3.1 Model Fitting
If a discretetime Markov model applies, the likelihood for a generic series
yl,... T is determined by the Markov chain structure:
T
L(,a3,;yi,...,yT) = f(yi,...,Yq) f(Yt I Yt1i,...,Ytq)
t=q+l
However, the transitional model (1.3) only specifies the conditional distributions
appearing in the product, but not the first term of the likelihood. Often, instead
of a full maximum likelihood approach, one conditions on the first q observations
and maximizes the corresponding conditional likelihood. If in addition f,(Ht; a)
in (1.3) is a linear function in a (and possibly 3), then maximization follows
along the lines of GLMs for independent data. Kaufmann (1987) establishes the
asymptotic properties such as consistency, asymptotic normality and efficiency of
the conditional maximum likelihood estimator.
If a Markov assumption is not warranted, estimation can be based on the
partial likelihood (Cox, 1975). To motivate the partial likelihood approach, we
follow Kedem and Fokianos (2002): They consider occasions where a time series
{Yt} is observed jointly with a random covariate series {Xt}. The joint density of
(Yt, Xt), t = 1,..., T, parameterized by a vector 0, can be expressed as
f(x, .., x,XT, YT; ) = f(x;0) [ f(xt I Hft;) f(yt1 Ht; ) (1.4)
t=2 .t=l
where Ht = (x1, y1,..., Xt1, yt1) and Ht = (xi, yl,..., x,1, yt1, Xt) hold the
history up to time points t 1 and t, respectively. Let Tt1 denote the afield
generated by Yi, Yt2, ... Xt, Xt1,..., i.e., Tt1 is generated by past responses
and present and past values of the covariates. Also, let ft(yt I Ft1; 0) denote the
conditional density of Yt, given t_1, which is of exponential density form with
mean modeled by (1.3). Then, the partial likelihood for 0 = (a, /) is given by
T
PL(0;yl,...,yT) = r ft(Yt I 1; 0), (1.5)
t=1
which is the second product in (1.4) and hence the term partial. The loss of
information by ignoring the first product in the joint density is considered small.
If the covariate process is deterministic, then the partial likelihood becomes a
conditional likelihood, but without the necessity of a Markov assumption on the
distribution of the Yt's.
Standard asymptotic results from likelihood analysis of independent data carry
over to the case of partial likelihood estimation with dependent data. Fokianos and
Kedem (1998) showed consistency and asymptotic normality of 0 and provided an
expression for the asymptotic covariance matrix. Since the score equation obtained
from (1.5) is identical to one for independent data in a GLM, partial likelihood
holds the advantage of easy, fast and readily available software implementation
with standard estimation routines such as iterative reweighted least squares.
1.3.2 Transitional Models for Time Series of Counts
For a time series of counts {yt}, Zeger and Qaqish (1988) propose Markov
type transitional models which they fit using quasilikelihood methods and the
estimating equations approach. They consider various models for the conditional
mean Lt = E[yt I Ht] of form log(p ) = x'z + ~=1 arfr(Htr), where for example
fr(Htr) = ytr or f,(Htr) = log(y_r + c) log(exp[zXP] + c). One common goal
of their models is to approximate the marginal mean by E[yt] = E[lpt] z exp{a{z3}
so that 3 has an approximate marginal interpretation as the change in the log
mean for a unit change in the explanatory variables. Davis et al. (2003) develop
these models further and propose fr(Htr) = (Ytr ltr)//Ar as a more
appropriate function to built serial dependence in the model, where A is an
additional parameter. They explore stability properties such as stationarity and
ergodicity of these models and describe fast (in comparison to maximum likelihood
techniques required for competing random effects models), recursive and iterative
maximum likelihood estimation algorithms.
Chapter 4 in Kedem and Fokianos (2002) discusses regression models of form
(1.3) assuming a conditional Poisson or doubletruncated Poisson distribution
for the counts, with inference based on the partial likelihood concept. Their
methodology is illustrated with two examples about monthly counts of rainy days
and counts of tourist arrivals.
1.3.3 Transitional Models for Binary Data
For binary data {yt}, a two state, first order Markov chain can be defined by
its probability transition matrix
p Poo Poi
Plo P11
where Pab = P(Yt = b I Yt1 = a), a, b = 0, 1 are the onestep transition probabilities
between the two states a and b. Diggle et al. (2002, Chapt. 10.3) discuss various
logistic regression models for these probabilities and higher order Markov chains for
equally spaced observations. Unequally spaced data cannot be routinely handled
with these models.
How can we determine the marginal association structure implied by the
conditionally specified model? Let pl = (p, pl) be the initial marginal distribution
for the states at time t = 1. Then the distribution of the states at time n is
given by pn = plp". As n increases, p" approaches a steady state or equilibrium
distribution that satisfies p = pP. The solution to this equation is given by
pi = P(Yt = 1) = E[yt] = pol/(pol + Pio) and is used to derive marginal moments
implied by the transitional model. For example, it can be shown (Kedem, 1980)
that in the steady state, the marginal variance and correlation implied by the
transitional model are var(yt) = poPi (as it should be) and corr(ytl, Yt) = Pn Poi,
respectively.
Azzalini (1994) models serial dependence in binary data through transition
models, but at the same time retains the marginal interpretation of regression
parameters. He specifies the marginal regression model logit(At) = xa/ for a
binary time series {yt} with E[yt] = tpt, but assumes that a binary Markov chain
with transition probabilities Pab has generated the data. Therefore, the likelihood
refers to these probabilities but the model specifies marginal probabilities, a
complication similar to the fitting of marginal models discussed in the previous
section. However, assuming a constant log odds ratio
= logP(Y = 1, Y = 1)P(Yt1 = 0, Yt = 0)
log \P(Y1 = 0,Yt = 1)P(Yt1 = 1, Yt= 0)
between any two adjacent observations, Azzalini (1994) shows how to write
Pab in terms of just this log odds ratio 0 and the marginal probabilities At and
#pt1. Maximum likelihood estimation for such models is tedious but possible in
closed form, although second derivatives of the log likelihood function have to be
calculated numerically. A software package (the Splus function rm.tools, Azzalini
and Chiogna, 1997) exists to fit such models for binary and Poisson observations.
Azzalini (1994) mentions that this basic approach can be extended to include
variable odds ratios between any two adjacent observations, possibly depending on
covariates, but this is not pursued in the article. Diggle et al. (2002) discuss these
marginalized transitional models further.
Chapter 2 in Kedem and Fokianos (2002) presents a detailed discussion of
partial likelihood estimation for transitional binary models and discusses, among
other examples, the eruption data of the Old Faithful geyser which we will turn to
in Chapter 5.
1.4 Random Effects Models
A popular way of modeling correlation among dependent observations is to
include random effects u in the linear predictor. One of the first developments for
discrete data occurred for longitudinal binary data, where subjectspecific random
effects induced correlation between repeated binary measurements on a subject
(Bock and Aitkin, 1981; Stiratelli, Laird and Ware, 1984). In general, we assume
that unmeasurable factors give rise to the dependency in the data {yt} and random
effects {ut} represent the heterogeneity due to these unmeasured factors. Given
these effects, the responses are assumed independent. However, no values for these
factors are observed, and so marginally (i.e., averaged over these factors), the
responses are dependent.
Conditional on some random effects, we consider models that fit into the
framework of GLMs for independent data, i.e., where the conditional distribution
of yt I u is a member of the family of exponential distributions, whose mean
E[yt ut] is modeled as a function of a linear predictor t = xzf + zLut. Together
with a distributional assumption for the random effects (usually independent and
identically normal), this leads to generalized linear mixed models (GLMMs), where
the term mixed refers to the mixture of fixed and random effects in the linear pre
dictor. Chapter 2 contains a detailed definition of GLMMs and discusses maximum
likelihood fitting and parameter interpretation and in Chapter 3 correlated random
effects for the description of time dependent observations {Yt} are motivated and
described. Here, we only give a short literature review about GLMMs which use
correlated random effects to model time (or space) dependent data.
1.4.1 Correlated Random Effects in GLMMs
One of the first papers considering correlated random effects in GLMMs for
the description of (spatial) dependence in Poisson data is Breslow and Clayton
(1993), who analyze lip cancer rates in Scottish counties. They propose correlated
normal random effects to capture the correlation in counts of adjacent districts in
Scotland. A random effect is assigned to each district, and two random effects are
correlated if their districts are adjacent to each other.
In Section 1.2.2 we mentioned the Polio data set of a time series of equally
spaced counts {y}16s and formulated the conditional model (1.2) with a latent
process for the random effects. Instead of obtaining marginal moments as in Zeger
(1988), Chan and Ledolter (1995) use a GLMM approach with Poisson random
components and autoregressive random effects to analyze the time series. They
outline parameter estimation via an MCEM algorithm similar to the one discussed
in Sections 2.4 and 3.2 in this dissertation.
One of the three central generalized linear models advocated by Diggle et al.
(2002, Chap. 11.2) to model longitudinal data uses correlated random effects. For
equally spaced binary longitudinal data {yit}, they plot response profiles simulated
according to the model
logit[P(Yit =1 uit)] = + uit
cov(Uit, uit.) = a2PItiti, I
with a2 = 2.52 and p = 0.9 and note that the profiles exhibit more alternating
runs of O's and 1's than a random intercept model with uil = ui = ... =
UiT = ui. However, based on the similarity between plots of random intercepts,
random intercepts and slopes and autoregressive random effects models, they
mention the challenge that binary data present in distinguishing and modeling the
underlying dependency structure in longitudinal data. (They used T = 25 repeated
observation for their simulations.) Furthermore, they state that numerical methods
for maximum likelihood estimation are computationally impractical for fitting
models with higher dimensional random effects. This makes it impossible, they
conclude, to fit the GLMM with serially correlated random effects using maximum
likelihood. Instead, they propose a Bayesian analysis using powerful Monte Carlo
Markov chain methods.
Indeed, the majority of examples in the literature which consider correlated
random effects in a GLMM framework take a Bayesian approach. Sun, Speckman
and Tsutakawa (2000) explore several types of correlated random effects (autore
gressive, generalized autoregressive and conditional autoregressive) in a Bayesian
analysis of a GLMM. As in any Bayesian analysis, the propriety of the posterior
distribution given the data is of concern when fixed effects and variance compo
nents have improper prior distributions and random effects are (possibly singular)
multivariate normal. One of their results applied to Poisson or binomial data {yt}
states that the posterior might be improper when yt = 0 in the Poisson case and
cannot be proper when yt = 0 or ye = nt in the binomial case for any t when
improper or noninformative priors are used.
Diggle, Tawn and Moyeed (1998) consider Gaussian spatial processes S(x)
to model spatial count data at locations x. The role of S(x) is to explain any
residual spatial variation after accounting for all known explanatory variables.
They also use a Bayesian framework to estimate parameters and give a solution to
the problem of predicting the count at a new location :. Ghosh et al. (1998) use
correlated random effects in Bayesian models for small area estimation problems.
They present an application of pairwise difference priors for random effects
to model a series of spatially correlated binomial observations in a Bayesian
framework. Zhang (2002) discusses maximum likelihood estimation with an
underlying spatial Gaussian process for spatially correlated binomial observations.
Bayesian models for binary time series are described in Liu (2001), based
on probittype models for correlated binary data which are discussed in Chib
and Greenberg (1998). Probit type models are motivated by assuming latent
random variables z = (zl,..., ZT), which follow a N(pA, E) distribution with
p = (p/I,, PT,), pt = x' / and E a correlation matrix. The yet's are assumed to be
generated according to
Yt = I(Zt > 0),
where I(.) is the indicator function. This leads to the (marginal) probit model
P(Yt = 1 I E) = It(zt I 1p, E). Rich classes of dependency structures between
binary outcomes can be modeled through E. These models can further be extended
to include random effects through pl't = zx/3 + z ut or q previous responses such
as pt = x,3 + '=, ayt,. It is important to note that E has to be in correlation
form. To see this, suppose it is not and let S = DED be a covariance matrix
for the latent random variables z, where D is a diagonal matrix holding standard
deviation parameters. The joint density of the times series under the multivariate
probit model is given by
P[(Y,,..,Y,) = (Y,,...,T)] = P[z E A]
= P[Dz E A],
where A = Ai x .. x AT with At = (oo, 0] if yt = 0 and At = (0, oo) if yt = 1
are the intervals corresponding to the relationship yt = I(zt > 0), for t = 1,..., T.
However, above relationship is true for any parametrization of D, because the
intervals At are not affected by the transformation from z to Dlz. Hence, the
elements of D are not identifiable based on the joint distribution of the observed
time series y.
Lee and Nelder (2001) present models to analyze spatially correlated Poisson
counts and binomial longitudinal data about cancer mortality rates. They explore
a variety of patterned correlation structures for random effects in a GLMM setup.
Model fitting is based on the joint data likelihood of observations and unobserved
random effects (Lee and Nelder, 1996) and not on the marginal likelihood of the
observed data. Model diagnostic plots of estimated random effects are presented to
aid in selecting an appropriate correlation structure.
1.4.2 Other Modeling Approaches
In hidden Markov models (MacDonald and Zucchini, 1997) the underlying
random process is assumed to be a discrete statespace Markov chain instead
of a continuous (normal) process. Probability transition matrices describe the
connection between states. A very convenient property of hidden Markov models
is that the likelihood can be evaluated sufficiently fast to permit direct numerical
maximization. MacDonald and Zucchini (1997) present a detailed description of
hidden Markov models for the analysis of binary and count time series.
A connection between transitional models and random effects models is
explored in Aitkin and Alf6 (1998). They model the success probabilities of
serial binary observations conditional on subjectspecific random effects and on
the previous outcome. As in the models before, transition probabilities Pab are
changing over time due to the inclusion of timedependent covariates and the
previous observation in the linear predictor. Additionally, random effects account
for possibly unobserved sources of heterogeneity between subjects. The authors
argue that the conditional model specification together with the specification of
the random effects distribution does not determine the distribution of the initial
observation, and hence the likelihood for this model is unspecified. They present
a solution by maximizing the likelihood obtained from conditioning on this first
observation. However, this causes the specified random effects distribution to shift
to an unknown distribution. Two approaches for estimation are outlined: The first
assumes another normal distribution for the new random effects distribution and
the likelihood is maximized using GaussHermite quadrature. The second approach
assumes no parametric form for the new random effects distribution and follows the
nonparametric maximum likelihood approach (Aitkin, 1999). For binary data, the
new random effects distribution is only a two point distribution and its parameters
can be estimated via maximum likelihood jointly with the other model parameters.
Marginalized transitional models were briefly mentioned with the approach
taken by Azzalini (1994). The idea of marginalizingg", i.e., model the marginal
mean of an otherwise conditionally specified model can also be applied to random
effects models. The advantage of transitional or random effects models is the
ability to easily specify correlation patterns, with the potential disadvantage that
parameters in such models have conditional interpretations when the scientific goal
is on the interpretation of marginal relationships. In marginal models parameters
can directly be interpreted as contrasts between subpopulations without the need
of conditioning on previous observations or unobserved random effects. However, as
we mentioned in Section 1.2.2, likelihood based inference in marginal models might
not be possible.
A marginalized random effects model (Heagerty, 1999; Heagerty and Zeger,
2000) specifies two regression equations that are consistent with each other. The
first equation,
AM= E[yt] = h'(x
expresses the marginal mean tM as a function of covariates and describes system
atic variation. The second equation characterizes the dependency structure among
observations through specification of the conditional mean pC,
p = E[yt I ut] = hl(At(t) + z'ut),
where ut are random effects with design vector zt. Consistency between the
marginal and conditional specification is achieved by defining At(xt) implicitly
through
IM = Eu[ c] = Eu [E[yt I ut]] = Eu[h'(At(zt) + z'ut)].
For instance, in a marginalized GLMM with random effects distribution F(ut),
At(st) is the solution to the integral equation
S= h'(' ) = h1(At(xt) + z'ut)dF(ut),
so that At(xt) is a function of the marginal regression coefficients f and the
(variance) parameters in F(ut). Maximum likelihood estimation is based on the
integrated likelihood from the GLMM model.
1.5 Motivation and Outline of the Dissertation
In this dissertation we propose generalized linear mixed models (GLMMs)
with correlated random effects to model count or binomial response data collected
over time or in space. For sequential or spatial Gaussian measurements, maximum
likelihood estimation is well established and software (e.g. SAS's proc mixed) is
available to fit fairly complicated correlation structures. The challenge for discrete
data lies in the fact that the observed (marginal) likelihood is not analytically
tractable, and maximization of it is more involved. Furthermore, with correlated
random effects, the likelihood does not break down into lowerdimensional com
ponents which are easier to integrate numerically. Therefore, most approaches in
the literature are based on a quasilikelihood approach or take a Bayesian per
spective. The advantage of Bayesian models is that powerful Monte Carlo Markov
chain methods make it easier to obtain a sample from the posterior distribution of
interest than to obtain maximum likelihood estimates. However, priors must be
specified very carefully to ensure posterior propriety.
In addition, repeated observations are prone to missing data or unequally
spaced observation times. We would like to develop methods and models that allow
for unequally spaced binary, binomial or Poisson observations, making them more
general than previously presented in the literature.
To our knowledge, maximum likelihood estimation of GLMMs with such high
dimensional random effects has not been demonstrated before, with the exception
of the paper by Chan and Ledolter (1995) who consider fitting of a time series
of counts. However, they do not consider unequally spaced data and employ a
different implementation of the MCEM algorithm. In Chapter 5 we argue that
their implementation of the algorithm might have been stopped prematurely,
leading to different conclusions than our analysis and analyses published elsewhere.
Most articles that discuss correlated random effects do so for only a small number
of correlated random effects. E.g., Chan and Kuk (1997) show that the data set
on salamander mating behavior published and analyzed in McCullagh and Nelder
(1989) is more appropriately analyzed when random effects pertaining to the male
salamander population are correlated over the three different time points when they
were observed. In this thesis, we would like to consider much longer sequences of
repeated observations.
In Chapter 2 we introduce the GLMM as the model of our choice to analyze
correlated discrete data and outline an EM algorithm to estimate fixed and random
effects, where both the Estep and the Mstep require numerical approximations,
leading to an EM algorithm based on Monte Carlo methods (MCEM). Correlated
random effects and their implications on the analysis of GLMMs are discussed in
Chapter 3, together with a motivating example. This chapter also gives details for
the implementation of the algorithm and reports results from simulation studies.
Chapter 4 looks at marginal model properties and interpretation for correlated
binary, binomial or Poisson observations and Chapter 5 applies our methods to
real data sets from the social sciences, public health, sports and other backgrounds.
A summary and discussion of the methods and models presented here is given in
Chapter 6.
CHAPTER 2
GENERALIZED LINEAR MIXED MODELS
Chapter 1 reviewed various approaches of extending GLMs to deal with
correlated data. In this Chapter we will take a closer look at generalized linear
mixed models (GLMMs) which were briefly mentioned in Section 1.4. When
the response variables are normal, these models are simply called linear mixed
models (LMMs) and have been extensively discussed in the literature (see, for
example, the books by Searle, Casella and McCulloch, 1992, and Verbeke and
Molenberghs, 2000). The form of the normal density for observations and random
effects allows for analytical evaluation of the integrals together with straightforward
maximization. Hence, LMMs can be readily fit with existing software (e.g., SAS's
proc mixed), using rich classes of prespecified correlation structures for the random
effects to model the dependence in the data more precisely. The broader notion
of GLMMs also encompasses binary, binomial, Poisson or gamma responses.
A distinctive feature of GLMMs is their so called subjectspecific parameter
interpretation, which differs from the interpretation of parameters in marginal
(Section 1.2) or transitional (Section 1.3) models. This feature is discussed in
Section 2.1, after a formal introduction of the GLMM. Throughout, special
attention is devoted to define GLMMs for discrete time series observations.
GLMMs are harder to fit because they typically involve intractable integrals
in the likelihood function. Section 2.2 outlines various approaches to model fitting.
Section 2.3 focuses on a Monte Carlo version of the EM algorithm which is an
indirect method of finding maximum likelihood estimates in GLMMs. Monte Carlo
methods are necessary because our applications involve correlated random effects
which lead to a very highdimensional integral in the likelihood function. Parallel
to the discussion of GLMMs, state space models are introduced and a fitting
algorithm is described. State space models are popular models for discrete time
series in econometric applications (Durbin and Koopman, 2001). The presentation
of specific examples of GLMMs for discrete time series observation is deferred until
Chapter 5.
2.1 Definition and Notation
The generalized linear mixed model is an extension of the well known gen
eralized linear model (McCullagh and Nelder, 1989) that permits fixed as well as
random effects in the linear predictor (hence the word mixed). The setup process
for GLMMs is split into two stages, which we present here using notation common
for longitudinal studies.:
Firstly, conditional on cluster specific random effects ui, the data are assumed
to follow a GLM with independent random components Yit, the tth response
in cluster i, i = 1,..., n, t = 1,..., ni. A cluster here is a generic expression
and means any form of observations being grouped together, such as repeated
observation on the same subject (cluster = subject), observations on different
students in the same school (cluster = school) or observations recorded in a
common time interval (cluster = time interval). The conditional distribution of Yit
is a member of the exponential family of distributions (e.g., McCullagh and Nelder,
1989) with form
f(yit I u,) = exp {[yitOit b(0it)]/(it + c(yit, ,t)} (2.1)
where Oit are natural parameters and b(.) and c(.) are certain functions determined
by the specific member of the exponential family. The parameters it are typically
of form Oit = O/wit where the wit's are known weights and 0 is a possibly unknown
dispersion parameter. For the discrete response GLMMs we are considering,
0 = 1. For a specific link function h(.), the model for the conditional mean for
observations yit has form
pit = b'(Oit) = E[Yi ui] = h1(x'tP + zItu,), (2.2)
where x't and zt are covariate or design vectors for fixed and random effects
associated with observation yit and / is a vector of unknown regression coefficients.
At this first stage, z'tui can be regarded as a known offset for each observation,
and observations are conditionally independent.
It should be noted that relationship (2.2) between the mean of the observation
and fixed and random effects is exactly as is specified in the systematic part of
GLMs, with the exception that in GLMMs a conditional mean is modeled. This
affects parameter interpretation. The regression coefficients 3 represent the effect
of explanatory variables on the conditional mean of observations, given the random
effects. For instance, observations in the same cluster i share a common value
of the random cluster effect ui, and hence / describes the conditional effect of
explanatory variables, given the value for ui. If the cluster consists of repeated
observations on the same subject, these effects are called subjectspecific effects.
In contrast, regression coefficients in GLMs and marginal models describe the
effect of explanatory variables on the population average, which is an average over
observations in different clusters.
At the second stage, the random effects ui are specified to follow a multi
variate normal distribution with mean zero and variancecovariance matrix Ei. A
standard assumption is that random effects {ui} are independent and identically
distributed, but an example at the beginning of Chapter 3 will show that this
is sometimes not appropriate. With time series observations, where the clusters
refer to time segments, it is reasonable to assume that observations are not only
correlated within the cluster (modeled by sharing the same cluster specific random
effect), but also across clusters, which we will model by assuming correlated cluster
specific random effects.
2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time
Series
Most of the data we are going to analyze is in the form of a single univariate
time series. To emphasize this data structure, the general twodimensional notation
(indices i and t) of a GLMM to model observations which come in clusters can be
simplified in two ways:
We can assume that a single cluster (i.e., n = 1 and nl = T) contains the
entire time series yx,..., YT. The random effects vector u = (ul,..., UT) associ
ated with the single cluster has a random effects component for each individual
time series member. The distribution of u is multivariate normal with variance
covariance matrix E, which is different from the identity matrix. The correlation
of the components of u induce a correlation among the time series members.
However, conditional on u, observations within the single cluster are independent.
The cluster index i is redundant in the notation and hence can be dropped. This
representation is particular useful when used with existing software to fit GLMMs,
where it is often necessary to include a column indicating the cluster membership
information for each observation. Here, since we have only one cluster, it suffices to
include a column of all ones, say.
Alternatively, we can adopt the point of view that each member of the time
series is a mini cluster by itself, containing only one observation (i.e., ni = 1 for
all i = 1,..., T) in the case of a single time series. When multiple, parallel time
series are observed, the cluster contains all c observations at time point t from the
c parallel time series (i.e., ni = c for all i = 1,..., T). In any case, the clusters
are then synonymous with the discrete time points at which observations were
recorded. This makes index t, which counts the repeated observations in a cluster
redundant (t = 1 or t = c for all clusters i), but instead of denoting the time series
by {yi}?=i, we decided to use the more common notation {yt}T=1, where t now is
the index for clusters or, equivalently, time points. In the following definition of
GLMMs for univariate time series, the notation of clusters or time points can be
used interchangeably. Conditional on unobserved random effects Ul,..., UT for
the different time points, observations yi,..., yT are assumed independent with
distributions
f(ytI ut) = exp {[ytOt b(Ot)]/Ot + c(yt, qt)} (2.3)
in the exponential family. As before, for a specific link function h(.), the model for
the conditional mean has form
E[yt I ut] = it = b'(0t) = hl(x'1 + z'ut), (2.4)
where x' and z' are covariate or design vectors for fixed and random effects
associated with the tth observation and f is a vector of unknown regression
coefficients. The random effects u, ... ,UT are typically not independent. When
collected in the vector u = (u1,..., UT), a multivariate normal distribution with
mean 0 and covariance matrix E can be directly specified. In particular, in Chapter
3 we will assume special patterned covariance matrices to allow for rich, but still
parsimonious, classes of correlation structures among the time series observations.
The advantage of the second setup of mini clusters is that it also allows for
other, indirect specifications of the random effects distribution, for instance through
a latent random process. For this, we relate clusterspecific random effects from
successive time points. For example, with univariate random effects, a firstorder
latent autoregressive process assumes that the random effects follow
Ut+1 = put + et,
(2.5)
where et has a zeromean normal distribution and p is a correlation parameter.
Cox (1981) called these type of models parameterdriven models, as opposed to
transitional (or observationdriven) models discussed in Section 1.3. In parameter
driven models, an underlying and unobserved parameter process influences the
distribution of a series of observations. The model for the polio data in Zeger
(1988) is an example of a parameterdriven model. However, Zeger (1988) does not
assume normality nor zero mean for the latent autoregressive process. Furthermore,
the natural logarithm of the random effects, and not the random effects themselves
appear additively in the linear predictor. Therefore, this model is slightly different
from the specifications of the time series GLMM from above.
Another application of the mini cluster setup is to spatial settings, where clus
ters represent spatially aggregated data instead of time points. Then ul,..., UT is
a collection of random effects associated with spatial clusters. Again, independent
random effects to describe the spatial dependencies are inappropriate. In general,
time dependent data are easier to handle since observations are linearly ordered
in time, and more complicated random effects distributions are needed for spatial
applications (e.g., Besag et al. 1995).
We will use the minicluster representation to facilitate a comparison to state
space models. This is the focus of the next section.
2.1.2 State Space Models for Discrete Time Series Observations
State space models are a rich alternative to the traditional BoxJenkins
ARIMA system for time series analysis. Similar to GLMMs, state space models
for Gaussian and nonGaussian time series split the modeling process into two
stages: At the first stage, the responses Yt are related to unobserved "states" by an
observation equation. (State space models originated in the engineering literature,
where parameters are often called states.) At the second stage, a latent or hidden
Markov model is assumed for the states. For univariate Gaussian responses
yi,..., YT, the two equations of a state space model take the form
t = w'~t + Et, Et N(0,a2), (2.6)
at = Ttat1+Rttt, t t N(O,Qt) t=1,...,T,
where wt is an m x 1 observation or design vector and ct is a white noise process.
The unobserved m x 1 state or parameter vector at is defined by the second
transition equation, where Tt is a transition matrix and (t is another white noise
process, independent of the first one. Compared to the standard GLMMs of
Section 2.1, the main difference is that random effects are correlated instead of
i.i.d. In state space models, no clear distinction between fixed and random effects
is made, and the state vector at can contain both. However, the form of the
transition matrix Tt together with the form of the matrix Rt, which consists of
columns of the identity matrix I,, allows one to declare certain elements of at as
being fixed effects and others to be random. The matrix Rt is called the selection
matrix since it selects the rows of the state equation which have nonzero variance
terms. With this formulation, the variancecovariance matrix Qt is assumed to
be nonsingular. Furthermore, the transition matrix Tt allows specification of
which effects vary through time and which stay constant. (For a slightly different
formulation without a selection matrix Rt but with possibly singular variance
covariance matrix Qt, see Fahrmeir and Tutz, 2001, Chap. 8).
State space models for nonGaussian time series were considered by West
et al. (1985) under the name dynamic generalized linear model. They used a
Bayesian framework with conjugate priors to specify and fit their models. Durbin
and Koopman (1997, 2000) and Fahrmeir and Tutz (2001) describe a state space
structure for nonGaussian observations similar to the two equations above.
The normal distribution assumption for the observations in (2.6) is replaced by
assuming a distribution in the exponential family with natural parameters Ot. With
a canonical link, Ot = w'at. This is called the signal by Durbin and Koopman
(1997, 2000). In particular, given the states al,..., aT, observations yi,..., yT
are conditionally independent and have density p(yt I Ot) in the exponential family
(2.3). As in Gaussian state space models, the state vector at is determined by the
vector autoregressive relationship
at = Ttat1 + Rtt, (2.7)
where the serially independent (t typically have normal distributions with mean 0
and variancecovariance matrix Qt.
2.1.3 Structural Similarities Between State Space Models and GLMMs
There is a strong connection between state space models and GLMMs with
a canonical link. To see this, we write the GLMM in state space form: Let wt =
(a', z')' and at = (p', u')', where Xt, Zt, f and ut are from the GLMM notation
as defined in (2.4). Hence, the linear predictor x + z'Ut of the GLMM is equal
to the state space signal Ot = w'at. Next, partition the disturbance term (t of
the state equation into ( = (a', c')' and consider special transition and selection
matrices of block form
SO 00
Tt = Rt=
0 4T 0 Rt
Using transition equation (2.7) results in the following autoregressive relationship
between the random effects of a GLMM: ut = Ttut + et, where Et = it
is a white noise component. In a univariate context, we have already motivated
this type of relationship between random effects in equation (2.5). The transition
equation also implies a constant effect 3 for the GLMM, since 0P = 0I2 = ... =
3T := 3. Hence, both models use correlated random effects, but GLMMs also
typically involve fixed parameters which are not modeled as evolving over time.
The restriction of the transition equation to the autoregressive form (often
a simple random walk) is but only one way of specifying a distribution for the
random effects in the GLMMs of Section 2.2.1. Other structures, such as equally
correlated random effects are possible within the GLMM framework and are
considered in Chapter 3.
2.1.4 Practical Differences
Although GLMMs and state space models are similar in structure, they are
used differently in practice. This is in part due to the fact that in GLMMs the
focus is on the fixed subjectspecific regression parameters /#, which refer to time
constant and time varying covariates, while in state space models the main purpose
is to infer properties about the time varying random states at. These are often
assumed to follow a first or second order random walk. To illustrate, consider a
data set about a monthly time series of counts presented in Durbin and Koopman
(2000) for the investigation of the effectiveness of new seat belt legislation on
automobile accidents. They specify the logmean for a Poisson state space model as
log(p/t) = vt + Axt + 7t,
where vt is a trend component following the random walk
Vt+l = Vt + &t
with t a white noise process. Further, A is an intervention parameter correspond
ing to the change in seatbelt legislation (xt = 0 before the change and equal to
1 afterwards) and {7t} are fixed seasonal components with Zt= 7t = 0 and equal
in every year. The main focus is on the parameter A describing the drop in the log
means, also called level, after the seat belt legislation went into effect.
For the series at hand, a GLMM approach would consider a fixed linear time
effect /, and model the logmean as
log(pt) = a + ft + Axt + yt + ut,
where a is the intercept of the linear time trend with slope 3 and where correlated
random effects {ut} account for the correlation in the monthly log means. Similar
to above, A describes the effect on the log means after the seat belt legislation went
into effect and yt are fixed seasonal components, equal for every year. The trend
component vt of the state space model corresponds to a + ut in the GLMM, which
additionally allows for a linear time trend /. This approach seems to be favored
by some discussants of the Durbin and Koopman (2000) paper. (In particular, see
the discussions by Chatfield or Aitkin of the paper by Durbin and Koopman (2000)
who mention the lack of a linear trend term in the proposed state space model.)
An even better GLMM approach with linear time trends explicitly modeled
could use a change point formulation in the linear predictor, with the month the
legislation went into effect (or was enforced) as the change point, and again with
correlated random effects {ut} to capture the dependency among successive means.
Such a specification would be harder to model in a state space model.
In the reply to the discussion of their paper, Durbin and Koopman (2000)
wrote that the two approaches (state space models versus Hierarchical Generalized
Linear Models, a model class very similar to GLMMs) are very different and that
they regard their treatment to be more transparent and general for problems that
specifically relate to time series. With the presentation of GLMMs with correlated
random effects for time series analysis in this thesis their argument might weaken.
For instance, with the proposal of autocorrelated random effects Ut+1 = put + ft
in a GLMM context we have elegant means of introducing autocorrelation into a
basic regression model that is well understood and whose parameters are easily
interpreted. Furthermore, GLMMs can easily accommodate the case of multiple
time series observations on each of several individuals or crosssectional units, as is
often observed in a longitudinal study.
One common feature of both models is the intractability of the likelihood
function and the use of numerical and simulation techniques to obtain maximum
likelihood estimates. In general, state space models for nonGaussian time series
are fit using a simulated maximum likelihood approach, which is also a popular
method for fitting GLMMs. However, long time series necessarily result in models
with complex and high dimensional random effects, and alternative, indirect, meth
ods may work better. Jank and Booth (2003) indicate that simulated maximum
likelihood, the method of choice for estimation in state space models, may not
work as well as indirect methods based on the EM algorithm, the method we will
use to fit GLMMs for time series observations. The next section reviews various
approaches of fitting GLMMs and contrasts them with the approach taken for state
space models.
2.2 Maximum Likelihood Estimation
Maximum likelihood estimation in GLMMs is a challenging task, because it
requires the calculation of integrals (often high dimensional) that have no known
analytic solution. Following the general notation of a GLMM in Section 2.1, let
Yi = (yil, yini) be the vector of all observations in cluster i, whose associated
random effects vector is ui. Conditional independence of the yit's implies that the
density function of y, is given by
ni
f(y ; I = Uf (y;t I u( ; /), (2.8)
t=1
where f(yit I ui; /) are the exponential densities in (2.1). The parameter 3 is the
vector of all unknown regression coefficients introduced by specifying model (2.2)
for the mean of the observations. Furthermore, observations from different clusters
are assumed conditionally independent, leading to the conditional joint density
n
f(yi,.,yn uil,...,Un;,3)=ii f(yi ui;/3)
i=1
for all observations given all random effects. Let g(u1,..., ,n; ib) denote the mul
tivariate normal density function of the random effects, whose variancecovariance
matrix E is determined by the variance component vector ib. The goal is to es
timate the unknown parameter vectors 3 and ib by maximum likelihood. The
likelihood function L(/3, v; yl,..., y,) for a GLMM is given by the marginal den
sity function of the observations y, Yn, viewed as a function of the parameters,
and is equal to
L(09#, 0; y Y jf f(Y,'...",IYn Iu,...,Un;f)g(ux,...,Un; O)dul...dun
= f I (Y l ui; ) g(ul,..., un; O)dUl ... dun
i=1
= J Jf(yit ui; ) g(ul,..., Un; *1)dUl...dun (2.9)
i=1 t=1
It is called the "observed" likelihood because the unobserved random effects have
been integrated out and (2.9) is a function of the observed data only. Except
for the linear mixed model where f(y, ... Yn I ul,..., un) is a normal density,
the integral has no closedform solution and numerical procedures (analytic or
stochastic) are necessary to calculate and maximize it. Standard maximization
techniques such as NewtonRaphson or EM for fitting GLMs and LLMs have
to be modified because the conditional distribution of the observations and the
distribution of the random effects are not conjugate and the integral is analytically
intractable.
2.2.1 Direct and Indirect Maximum Likelihood Procedures
In general, there are two ways to obtain maximum likelihood estimates
from the marginal likelihood in (2.9): The first one is a direct approach and
attempts to approximate the integral by either analytic or stochastic methods
and then maximize this approximation with respect to the parameters 3 and tp.
Some common analytic approximation methods are GaussHermite quadrature
(Abramowitz and Stegun, 1964), a firstorder Taylor series expansion of the
integrand or a Laplace approximation (Tierney and Kadane, 1986), which is
based on a secondorder Taylor series expansion. The two latter methods result
in likelihood equations similar to a linear mixed model (Breslow and Clayton,
1993; Wolfinger and O'Connell, 1993), and by iteratively fitting such a model and
reexpanding the integrand around updated parameter estimates one can obtain
approximate maximum likelihood estimates. However, these methods have been
shown to yield estimates which can be biased and inconsistent, an issue which is
discussed in Lin and Breslow (1996) and Breslow and Lin (1995).
Techniques using stochastic integral approximations are known under the name
simulated maximum likelihood and have been proposed by Geyer and Thompson
(1992) and Gelfand and Carlin (1993). These methods approximate the integral in
(2.9) by importance sampling (Robert and Casella, 1999) and are better suited for
larger dimensional integrals than analytic approximations. Usually, the importance
density depends on the parameters to be estimated, and so simulated maximum
likelihood is used iteratively by first approximating the integral by a Monte Carlo
sum with some initial values for the unknown parameters. Then, the likelihood
is maximized and the resulting parameters are used to generate a new sample
form the importance density in the next iteration. We will briefly discuss the idea
behind importance sampling in Section 2.3.2. The simulated maximum likelihood
approach is also further illustrated in the next section, where we discuss it in the
context of state space models.
An alternative to the direct approximating methods is the EMalgorithm
(Dempster et al., 1977). The integral in (2.9) is not directly maximized in this
method, but is maximized indirectly by considering a related function Q(. I .). At
each step of this algorithm, maximization of the Qfunction increases the marginal
likelihood, a fact that can be verified using Jensen's inequality. The EMalgorithm
relies on recognizing or inventing missing data, which, together with the observed
data, simplifies maximum likelihood calculations. For GLMMs, the random effects
ul,..., u, are treated as the missing data. In particular, let f(k1) and t(k1)
denote current (at the end of iteration k 1) values for parameter vectors 3 and
'. Also, let y' = (yl,..., y,) and u' = (ul,..., u,) denote the vector of all
observations and their associated random effects. Then, the Q(. I .) function at the
start of iteration k has form
Q(I, f3(k1) (k1)) = E [logj(y, u; 0, ) y, f(k1), 0(k1)]
= E [logf(y I u;3) y,7(k1),(k1)] (2.10)
+ E [logg(u; ) y, '(k1),l(kl)]
where
j(y, u; 0, i) = f(y I u; 3)g(u; i)
denotes the joint density of observed and missing data, also known as the complete
data. The expectation in (2.10) is with respect to the conditional distribution
of u y, evaluated at the current parameter estimates 3(k1) and O(kl). The
calculation of the expected value is called the Estep and it is followed by an
Mstep which maximizes Q(/,  I 8(k1), O(k1)) with respect to f3 and 4. The
resulting estimates f(k) and ,(k) are used as updates in the next iteration to re
calculate the Estep and the Mstep. Since (/3(k), (k)) is the maximizer at iteration
Q((k), (k) /(k1) I'(k1)) > Q(/3, /(k1), (k1)) for all (0, 4')
(2.11)
and it follows that the likelihood increases (or at worst stays the same) from one
iteration to the next:
log L((k),(k) y) = Q((k), (k1), 1)
E [logh(u Iy;/3(k), () k) y,/(k1), (1)]
> Q(p(k1), k1) (k1),k1))
E [log h(u  y;3)(k,1) ( y,() I 1),(k1)]
= log L((k1), (k1);y)
Here we used (2.11) and the fact that
E [log h(u I y; 3(k), (k) y,(k1), (1)
E [logh(u  y;/(k1), (k1)) y (k1) (k1)
= E [log (h(u y; w(k), P(k))/h(u y;3(k1)' (k1))) I y11)(k1) (k1)]
< log E h( y;((k) (k))/( y (k1) (k)) y, (kL),, (k1)) = 0,
where the inequality in the last step derives from Jensen's inequality. Under regu
larity conditions (Wu, 1983) and some initial starting values (P3, ,0), the sequence
of estimates {(0(k), '(k))} converges to the maximum likelihood estimators (/, 6 ).
The EMalgorithm is most useful if replacing the calculation of the integral in
the marginal likelihood (2.9) by the calculation of the integral in the Qfunction
(2.10) simplifies computation and maximization. Unfortunately, for GLMMs the
integrals in (2.10) are also intractable since the conditional density of u I y involves
the integral in (2.9). However, the EM algorithm may still be used by approxi
mating the expectation in the Estep with appropriate Monte Carlo methods. The
resulting algorithm is called the Monte Carlo EMalgorithm (MCEM) and was
proposed by Wei and Tanner (1990). We review it in detail in Section 2.3.
Some arguments favoring the use of the MCEMalgorithm over direct methods
such as simulated maximum likelihood for fitting GLMMs, especially when some
variance components in I are large, are given in Jank and Booth (2003) and Booth
et al. (2001). Currently, the only available software for fitting GLMMs uses direct
methods such as GaussHermite quadrature or simulated maximum likelihood
(e.g., SAS's proc nlmixed). State space models of Section 2.1.2 are also fitted via
simulated maximum likelihood. This is discussed in Section 2.2.3.
2.2.2 Model Fitting in a Bayesian Framework
In a Bayesian context, GLMMs are twostage hierarchical models with ap
propriate priors on 3 and 0. Instead of obtaining maximum likelihood estimates
of unknown parameters, a Bayesian analysis looks at their entire posterior dis
tributions, given the observed data. Markov chain Monte Carlo techniques avoid
the tedious integration in the posterior densities and allow for relatively easy
simulations from these distributions, compared with the problems encountered in
maximum likelihood estimation. This suggests approximating maximum likelihood
estimates via a Bayesian route, assuming improper or at least very diffuse priors
and exploiting the proportionality of the likelihood function and the posterior
distribution of the parameters. However, for many discrete data models, improper
priors may lead to improper posteriors. Natarajan and McCulloch (1995) demon
strate this with GLMMs for correlated binary data assuming independent N(0, a2)
random effects and a flat or a noninformative prior for a2. Sun, Tsutakawa and
Speckman (1999) and Sun, Speckman and Tsutakawa (2000) show that with
noninformative (flat) priors on fixed effects and variance components of more com
plicated random effects distributions, propriety of the posterior distribution cannot
be guaranteed for a Poisson GLMM when one of the observed counts is zero, and
is impossible in a logit link GLMM for binomial(n, r) observations if just one of
the observation is equal to 0 or n. Of course, the use of proper priors will always
37
lead to proper posteriors. However, for often employed diffuse but proper priors,
Natarajan and McCulloch (1998) show that even with enormous simulation sizes,
posterior estimates (such as the posterior mode) can be far away from maximum
likelihood estimates, which make their use undesirable in a frequentist setting.
2.2.3 Maximum Likelihood Estimation for State Space Models
The same problems as in the GLMM case arise for maximum likelihood fit
ting of nonGaussian state space models. Here, we review a simulated maximum
likelihood approach suggested by Durbin and Koopman (1997), using notation in
troduced in Section 2.1.2. Let p(y I a; ,) = 1tP(yt I at; tp) denote the distribution
of all observations given the states and let p(a; 0) denote the distribution of the
states, where y and a are the stacked vectors of all observations and all states,
respectively. The vector iv holds parameters that may appear in wt, Tt and Qt.
Let p(y, a; /i) denote the joint density of observations and states. For practical
purposes, it is easier to work with the signal Ot instead of the high dimensional
state vector at. Hence, let p(y ; 0;), p(0; Vb) and p(y, 0; b) denote the corre
sponding conditional, marginal and joint distributions parameterized in terms of
the signal Ot = w'at, t = 1,..., T, where 0 = (01,..., OT)'. The observed likelihood
is then given by the integral
L(0; y) = p(y 0; i)p(O; /)dO. (2.12)
To maximize (2.12) with respect to 0, Durbin and Koopman (1997, 2000) first
calculate the likelihood Lg,(i; y) for an approximating Gaussian model and then
obtain the true likelihood L(/,; y) by an adjustment to it. However, two different
approaches of how to construct the approximating Gaussian model are presented
in the two papers. In Durbin and Koopman (1997) the approximating model is
obtained by assuming that observations follow a linear Gaussian model
Yt= wat + Et = t Ot + ft,
with Et ~ N(p/t, ao). All densities generated under this model are denoted by g(.).
The two parameters /t and ao are chosen such that the true density p(y 1 0; ,) and
its normal approximation g(y I 0; tk) are as close as possible in the neighborhood
of the posterior mean Eg[80 y]. The state equations of the true nonGaussian
model and the Gaussian approximating model are assumed to be the same,
which implies that the marginal density of 8 is the same under both models, i.e.,
p(8; i) = g(; 0). The likelihood of the approximating model is then given by
Lg(y, ; ) g(y i0; O)p(O; (2.13)
Lg('; y) = g(y; #) = (2.13)
g(8 1 Y; I) g(8 I y; IP)
This likelihood is calculated using a recursive procedure known as the Kalman
filter (see, for instance, Fahrmeir and Tutz, 2001, Chap. 8). Alternatively, the
approximating Gaussian model is a regular linear mixed model and maximum
likelihood calculations can be carried out using more familiar algorithms in the
linear mixed model literature (see, for instance, Verbeke and Molenberghs, 2000).
From (2.13),
p(O; ) = L,()g(O I y; k)
g(y 1 0; )
and upon plugging in into (2.12)
L(O; y)= L9g(; y)E [P(i ; ,) (2.14)
Lg(yI 0; )'
where E, denotes expectation with respect to the Gaussian density g(8O y; I)
generated by the approximating model. Hence, the observed likelihood of the non
Gaussian model can be estimated by the likelihood of an approximating Gaussian
model and an adjustment factor, in particular
L(0; y)= Lg('; y)7@(),
where
) =1 p(y 0(i);b)
m zg(y Og); m g)
is a Monte Carlo sum approximating the expected value E, with m random
samples 0(') from g(O y; ,). Normality of g(O y; ') allows for straightforward
simulation from this density.
A different approach for choosing the approximating Gaussian model is
presented in Durbin and Koopman (2000). There, the model is determined by
choosing 0, and Qt of an approximating Gaussian state space model (2.6) such that
the posterior densities g(0 I y; i) implied by the Gaussian model and p(O I y; iP)
implied by the true model have the same posterior mode 0.
Formally, by dividing and multiplying (2.12) by the importance density
g(O I y; i), we like to interpret approximation (2.14) as an importance sampling
estimate of the observed likelihood and the entire procedure as a simulated
maximum likelihood approach:
L(; y)= [p(y 0;e ) ; g(O y; )dO
,g(0 I y;), d
= p(y 0; 0 g(Y; g(Y I y; )dO
gy, 0O; )
Sg(Y; ) g(O y; t)dO
[p(y 1o0;
= (y;,)Eg "g(y 1 .
Durbin and Koopman (1997, 2000) present a clever way of artificially enlarging the
simulated sample of 0(')'s from the importance density g(0  y; ) by the use of
antithetic variables (Robert and Casella, 1999). These quadruple the sample size
without additional simulation efforts and balance the sample for location and scale.
Overall, this leads to a reduction in the total sample size necessary to achieve a
certain precision in the estimates.
In practice, it is desirable in the maximization process to work with log (L(?I; y)).
Durbin and Koopman (1997, 2000) present a bias correction for the bias introduced
by estimating log (E[p(y  0; ai)/g(y 0; 0)]). Finally, the resulting estimator
of log (L(O; y)) can be maximized with respect to i by a suitable numerical
procedure, such as NewtonRaphson.
We mentioned before that simulated maximum likelihood can be computa
tionally inefficient and suboptimal, especially when some variance components
are large (Jank and Booth, 2003). As we will see in various examples in Chapter
5, large variance components (e.g., a large random effects variance) are the norm
rather than the exception with the type of time series models we consider. Next,
we will look at an alternative, indirect method for fitting our models. In principle,
though, the methods just described are also applicable to GLMMs, through the
close connections of GLMMs and state space models described above.
2.3 The Monte Carlo EM Algorithm
In Section 2.2 we presented the EMalgorithm as an iterative procedure
consisting of two components, the E and the Mstep. The Estep calculates a
conditional expectation while the Mstep subsequently maximizes this expectation.
Often, at least one of these steps is analytically intractable and in most of the
applications considered here, both steps are. Numerical methods (analytic and
stochastic) have to be used to overcome these difficulties, whereby the Estep
usually is the more troublesome. One popular way of approximating the expected
value in the Estep uses Monte Carlo methods and is discussed in Wei and Tanner
(1990), McCulloch (1994, 1997) and Booth and Hobert (1999). The Monte Carlo
EM (MCEM) algorithm uses a sample from the distribution of the random effects
u given the observed data y to approximate the Qfunction in (2.10). In particular,
at iteration k, let u(1),..., u(m) be a sample from this distribution, denoted by
h(u I y; '(k1), (1)) and evaluated at the parameter estimates "(k1) and 0(k1)
from the previous iteration. The approximation to (2.10) is then given by
m m
Qm(3,  (kl), (kl)) = L log/(y I u);/3) + Elogg(u(i); ). (2.15)
m m
j=1 j=1
As m + oo, with probability one, Qm + Q. The Mstep then maximizes Qm
instead of Q with respect to p and 4 and the resulting estimates 3(k) and i(k)
are used in the next iteration to generate a new sample from h(u I y; 'k, Pk ). If
maximization is not possible in closed form, sometimes only a pair of values (/3, T)
which satisfies Qm(/3, 4 I 3(k1), '(k1)) > Qm((k1) (k) (1) (k1), (k1)), but
which do not attain the global maximum, is chosen as the new parameter update
(3(k), ,(k)). However, we show for our models that the global maximum can be
approximated in very few steps.
Maximization of Qm with respect to 3 and i4 is equivalent to maximizing the
first term in (2.15) with respect to / only and the second term with respect to '
only. This is due to the twostage hierarchy of the response distribution and the
random effects distribution in GLMMs and is discussed next. Different approaches
to obtaining a sample from h(u I y; 3, 4) for the approximation of the Estep are
presented in Sections 2.3.2 and convergence criteria are discussed in Section 2.3.3.
2.3.1 Maximization of Qm
For now we assume we have available a sample u(1),..., um) from h(u
y; 3, i4) or an importance sampling distribution, generated by one of the mecha
nisms described in Sections 2.3.2 to 2.3.5. Let Q1 and Q2 be the first and second
term of the sum in (2.15). Using the exponential family expression for the densities
f(yit I ui), at iteration k,
Q(3 (k)) 1 C n n i b(0~))] (2.16)
S=1 =1 t= i
j=1 i=1 t=1
where, according to the GLMM specifications,
b'((,)) = ( = hl(xtP + Zi ),
with ujA the ith component of the jth sampled random effects vector u().
Maximizing Q1 with respect to 3 is equivalent to fitting an augmented GLM with
known offsets: For j = 1,..., m, let yf = yit and ) = xit be the random
components and known design vectors for this augmented GLM, and let zU j) be
a known offset associated with each y). That is, we duplicate the original data
set m times and attach a known offset z' u) to each replicated observation. The
model for the mean in the augmented GLM, E[Yij)] = pz) = h'(xt + zu))
is structurally equivalent to the model for the mean in the GLMM. Then, the
loglikelihood equations for estimating 3 for the augmented GLM are proportional
to Q'. Hence, maximization of Q1 with respect to / follows along the lines of
well known, iterative NewtonRaphson or Fisher scoring algorithms for GLMs.
Denote by /(k) the parameter vector after convergence of one of these algorithms.
It represents the value of the maximum likelihood estimator of / at iteration k of
the MCEM algorithm.
The expression for Q' depends on the assumed random effects distribution.
Most generally, let E be an unstructured nq x nq covariance matrix for the random
effects vector u = (ux,..., un), where q = E i ni and ni is the dimension of each
cluster specific random effect ui. Then, assuming u has a mean zero multivariate
normal distribution g(u; i) where 0/ holds the !nq(nq + 1) distinct elements of E,
Q2 has form
2 c log u()1y1
j=1
The goal is to maximize Q2 with respect to the variance components If of E. For
a general E, the maximum is obtained at the variance components of the sample
covariance matrix Sm = L u(ij)uWl)'. Denoting these by O(k) gives the value of
the maximum likelihood estimator of i, at iteration k of the MCEM algorithm.
The simplest structure occurs when random effects ui have independent
components and are i.i.d. across all clusters, where g(u; i) is then the product of
n N(0, ua2) densities and i/ = a. Q2 at iteration k is then maximized at a(k)
( ZLj o u(')'U) 12. Many applications of GLMMs use this simple structure
of i.i.d. random effects, where often ui is a univariate random intercept. In this
case, the estimate of a at iteration k reduces to a(k) = ( 1 L 1 uJi)2)1/2
In Chapter 3 we will drop the assumption of independence and look at correlated
random effects, but with more parsimonious covariance structures than the most
general case presented here. Maximization of Q' with respect to t1 will be
presented there on a case by case basis.
2.3.2 Generating Samples from h(u I y; P, i,)
So far we assumed we had available a sample u(),..., u(m) to approximate
the expected value in the Estep of the MCEM algorithm. This section describes
how to generate such a sample from h(u I y; 3, i), which is only known up to a
normalizing constant, or from an importance density g(u). In the following, we will
suppress the dependency on parameters / and 0, since the densities are always
evaluated at their current values. Three methods are presented: The acceptreject
algorithm produces independent samples, while MetropolisHastings algorithms
produce dependent samples. A detailed description of all three methods can be
found in Robert and Casella (1999).
2.3.2.1 Acceptreject sampling in GLMMs
In general, for acceptreject sampling we need to find a candidate density
g and a constant M, such that for the density of interest h (the target density)
h(x) < Mg(x) holds for all x in the support of h. The algorithm is then to
1. generate x ~ g, w ~ Uniform[O, 1];
2. accept x as a random sample from h if w < M() ;
3. return to 1. otherwise;
This will produce one random sample x from the target density h. The
probability of acceptance is given by 1/M and the expected number of trials until a
variable is accepted is M.
For our purpose, the target density is h(u I y). Since h(u I y) = f(y I
u)g(u) < Mg(u), where M = supu f(y I u) and a is an unknown normalizing
constant equal to the marginal likelihood, the multivariate normal random effects
distribution g(u) can be used as a candidate density. Booth and Hobert (1999,
Sect. 4.1) show that for certain models supu f(y I u) can be easily calculated
from the data alone and thus need not to be updated at every iteration. For some
models we discuss here, the condition of Booth and Hobert (1999, Sect. 4.1, page
272) required for this simplification does not hold. However, the likelihood of a
saturated GLM is always an upper bound for f(y I u). To illustrate, regard L(u) =
f(y I u) as the likelihood corresponding to a GLM with random components yit
and linear predictor rit = z ui + xif3, where now x'8 plays the role of a known
offset and ui are the parameters of interest. The maximized likelihood L(i&) for
this model is always less than the maximized likelihood L(y) for a saturated model.
Hence, supu f(y I u) < L(i&) < L(y), and L(iL) or L(y) can be used to construct
M.
Example: In Section 3.1 we consider a data set where conditional on a random
effect Ut, yit, the tth observation in group i, is modeled as a Binomial(nit, Wit)
random variable. There are 16 time points, i.e., t = 1,... 16 and two groups i =
1, 2. A very simple logisticnormal GLMM for these data has form logit(7it(ut)) =
a + pxi + ut, where xi is a binary group indicator. The overall design matrix for
this problem is the 32 x 18 matrix
1 0 1 0 ... 0
1 1 0 * 0
1 0 0 1 ... 0
1 1 0 1 0,
1 0 0 0 ..* 1
1 1 0 0 ... 1
where the columns hold the coefficients corresponding to a, /, u1, u2,..., U16. All
rows of this matrix are different, and as a consequence the condition of Booth and
Hobert (1999, Sect. 4.1, page 272) does not hold. However, the saturated binomial
likelihood L(y) is an upper bound for f(y I u), i.e.,
sup f(y u) < L(y).
For instance, with the logisticnormal example from above with linear predictor
77it = c + ut, where c = a + zxi represents the fixed part of the model, we have
sup f(y I U) = sup( ec+ut i 1 it
su U, 1 + ec+ut 1 + ec+ut
By first taking logs and then finding first and second derivatives with respect to ut,
we see that u = log )it/t c maximizes this expression for 0 < yit < nit.
Plugging in, we obtain the result
( it \ nitli,
supf(yl) = 1 .
ut nit k nit
For the special cases of yit = 0 or yit = nit the trivial bound on f(yit I Ut) is 1.
Hence, the following inequality, which immediately follows from above can be used
in constructing the acceptreject algorithm for a logisticnormal model with linear
predictor of form rit c + ut:
( eif( fYit nit /t
sup f y I u) : sup li r 1+e+ e
_1 ) )= L(y).
S nit/ nit/
iit
This means we can select M = L(y) to meet the acceptreject condition and
consequently we accept a sample u from g(u) if for a w ~ Uniform[0, 1]:
h(u I y) f(y u)
Mg(u) L(y)
Notice that this condition is free of the normalizing constant a. In practice,
especially for high dimensional random effects, M can be very large and therefore
we almost never accept a sample. Two alternative methods described below
may avoid this problem. Note, however, that the acceptreject method yields an
independent and identical distributed sample from the target distribution. This is
important if one wants to implement an automated MCEM algorithm (Booth and
Hobert (1999)), where the Monte Carlo sample size m is increased automatically as
the algorithm progresses to adjust for the error in the Monte Carlo approximation
to the Estep.
2.3.2.2 Markov chain Monte Carlo methods
For high dimensional distributions h(u I y), which are unavoidable if correlated
random effects are used, acceptreject methods can be very slow. An alternative
is to generate a Markov chain with invariant distribution h(u I y), which may
be much faster but results in dependent samples. McCulloch (1997) discussed a
Metropolis Hastings algorithm for creating such a chain for the logisticnormal
regression case. In general, an independent Metropolis Hastings algorithm is built
as follows: Choose a candidate density j(u) with the same support as h(u I y).
Then, for a current state u(j),
1. Generate w ~ g;
2. Set u(j) equal to w with probability p = min (, h(W )/h(Uj))
and equal to u1) with probability 1p;
After a sufficient burn in time, the states of the generated chain can be
regarded as a (dependent) sample from h(u I y). If the candidate density g(u)
is chosen to be the density of the random effects g(u), the acceptance probability
in step 2 reduces to the simple form min (1, f(y w)/f(y I u(1)). To further
speed up simulations, McCulloch (1997) uses a random scan algorithm which only
updates the kth component of the previous state u(1) and, upon acceptance in
step 2, uses it as the new state.
Another popular MCMC algorithm is the Gibbs sampler. Let ujl) =
(u i1),..., uU1)) denote the current state of a Markov chain with invariant distri
bution h(u I y). One iteration of the Gibbs sampler generates, componentwise,
Uj) i h(ul I 1) 1)y)
UU ~ h(u uj ...IUU)
where h(ui uj) ,..., u (j),..., u 1), y) are the so called full conditionals of
h(u I y). The vector u(j) = (ujQ),..., u)) represents the new state of the chain,
and, after a sufficient burnin time, can be regarded as a sample from h(u y).
The advantage of the Gibbs sampler is that it reduces sampling of a possibly very
highdimensional vector u into sampling of several lowerdimensional components
of u. We will use the Gibbs sampler in connection with autoregressive random
effects to simplify sampling from an initially very highdimensional distribution
h(u ) y) by sampling from its simpler full univariate conditionals.
2.3.2.3 Importance sampling
An importance sampling approximation to the Qfunction in (2.10) is given by
m m
Qm(3, i (k1), (k1)) = log f (y I uo ); ) + w log g(u ); ),
j=1 j=1
where u(j) are independent samples from an importance density g(u; V)(k1)) and
h(u. y;(k1 1)) (y U); (k1))g(u(j); 0(k1))
3 ( );0(k1)) (U);k1))
are importance weights at iteration k. Usually, Qm is divided by the sum of the
importance weights m=, wj. The normalizing constant a only depends on known
parameters (3(k1), O(k1)) and hence plays no part in the following maximization
step. Selecting the importance density g is a delicate issue. It should be easy
to simulate from but also resemble h(u I ) as close as possible. Booth and
Hobert (1999) suggest a Student t density as the importance distribution g, whose
mean and variance match those of h(u I y) which are derived via a Laplace
approximation.
2.3.3 Convergence Criteria
Due to the stochastic nature of the algorithm, parameter estimates of two suc
cessive iterations can be close together just by chance, although convergence is not
yet achieved. To reduce the risk of stopping prematurely, we declare convergence
if the relative change in parameter estimates is less than some el for c (e.g., five)
consecutive times. Let A(k) = (f(k)', Ob(k)') be the vector of unknown fixed effects
parameters and variance components. Then this condition means that
m A Ik) P1) I
max 'IA 'I E1 (2.17)
has to be fulfilled for c consecutive (e.g., five) ks. For any AkL), an exception to this
rule occurs when the estimated standard error of that parameter is substantially
larger than the change from one iteration to the next. Hence, at iteration k, for
those parameters satisfying
(IA(jk) A )
max v (gk)) E2,
where vir(A k)) is the current estimate of the variance of the MLE A,, the relative
precision of criterion (2.17) need not be met. An estimate of this variance can
be obtained from the observed information matrix of the ML estimator for A.
Louis (1982) showed that the observed information matrix can be written in
terms of the first (1') and second (I") derivative of the complete data loglikelihood
1(A; y, u) = logj(y, u; A). Evaluated at the MLE A, it is given by
I(A) = Euly [l"(i; y, u) I y] varuly [l'(A; u) I .1
An approximation to this matrix, at iteration k, uses Monte Carlo sums with draws
from h(u I y, A(k)) from the current iteration of the MCEM algorithm.
To further safeguard against stopping prematurely, we use a third convergence
criterion based on the Qm function. For deterministic EM, the Q function is
guaranteed to increase from iteration to iteration. With MCEM, because of
the stochastic approximation nature, QM) can be less than Q(1) because of an
"unlucky" Monte Carlo sample at iteration k. Hence, the parameter estimates
obtained from maximizing QM) can be a step in the wrong direction and actually
decrease the value of the likelihood. To counter this, we declare convergence only
if successive values of QM) are within a small neighborhood. More importantly,
however, is that we accept the kth parameter update A(k) only if the relative
change in the Qm function is larger than some small negative constant, e.g.,
(k) > E3 (2.18)
If at iteration k (2.18) is not met and there is reason to believe that A(k) decreases
the likelihood and is worse than the parameter update from the previous iteration,
we repeat the kth iteration with a new and larger Monte Carlo sample. Thereby,
we hope to better approximate the Qfunction and as a result get a better estimate
A(k), with Qmfunction larger than the previous one. If this does not happen, we
nevertheless accept A(k) and proceed to the next iteration, possibly letting the
algorithm temporarily move in a direction of a lower likelihood region. Otherwise,
the Monte Carlo sample size quickly grows without bounds at an early stage
of the algorithm. Furthermore, at early stages, the Monte Carlo error in the
approximation of the Q function can be large and hence its trace plot is very
volatile.
Caffo, Jank and Jones (2003) go a step further and calculate asymptotic
confidence intervals for the change in the Qmfunction, based on which they
construct a rule for accepting or rejecting A(k). They discuss schemes of how to
increase the Monte Carlo sample accordingly and their MCEM algorithm inherits
the ascent property of EM with high probability. However, we feel that the simpler
criterion (2.18) suffices for the examples considered here.
Coupled with any convergence criterion is the question of the updating scheme
for the Monte Carlo sample size m between iterations. In general, we will use
m(k) = am(k1), where a > 1 and m(k) is the Monte Carlo sample size at iteration
k. At early iterations, m(k) will be low, since big parameter jumps are expected
regardless of the quality of the approximation and the Monte Carlo error associated
with it. Later, as more weight will be put on decreasing the Monte Carlo error in
the approximations, the polynomial increase guarantees sufficiently large Monte
Carlo samples. Furthermore, condition (2.18) signals when an additional boost
in m(k) is needed to better approximate the Qfunction in this iteration. Hence,
whenever (2.18) is not met, we rerun iteration k with a bigger sample size qm(k),
where q > 1 is usually between 1 and 2.
8000 I 
264
7000
266
6000
5000 268
4000 270
270 
3000
272
2000
274
1000
0 25 50 75 100 0 25 50 75 100
Figure 21: Plot of the typical behavior of the Monte Carlo sample size m(k) and
the Qfunction Qk) through MCEM iterations.
The iteration number is shown on the x axis. Plots are based on the data and
model for the boat race data discussed in Chapter 5.
A typical picture of the Monte Carlo sample size m(k) and the Qm) function
through the iterations of an MCEM algorithm is presented in Figure 21. The
increase in the Q) function is large at the first iterations, but it's Monte Carlo
error is also large due to the small Monte Carlo sample size. The plot of the Monte
Carlo sample size m(k) shows several jumps corresponding to the events that the
Qk) function actually decreased by more than E3 from one iteration to the next and
we adjusted with an additional boost in generated samples. The data and model
on which this plot is based on are taken from the boat race example analyzed
and discussed in Chapter 5, with convergence criterions set to 6E = 0.001, c = 4,
E2 = 0.003, E3 = 0.005, a = 1.03 and q = 1.05.
Fort and Moulines (2003) show that with geometrically ergodic (see, e.g.,
Robert and Casella, 1999) MCMC samplers, a polynomial increase in the Monte
52
Carlo sample size leads to convergence of MCEM parameter estimates. However,
establishing geometric ergodicity is not an easy task. Other, more sophisticated
and automated Monte Carlo sample size updating schemes are presented by Booth
and Hobert (1999) for independent sampling, and Caffo, Jank and Jones (2003), for
independent and MCMC sampling.
CHAPTER 3
CORRELATED RANDOM EFFECTS
In Chapter 2 we mentioned at several occasions that for certain data structures
the usual assumption of independent random effects is inappropriate. For instance,
if clusters represent time points in a study over time, observations from different
clusters can no longer be assumed (marginally) independent. Or, in longitudinal
studies, the nonnegative and exchangeable correlation structure among repeated
observations implied by a single random effect can be far from the truth for long
sequences of repeated observations. Section 3.1 presents data from a crosssectional
time series which motivates the use of correlated random effects and discusses their
implications. In Sections 3.2 and 3.3 two special correlation structures useful for
modeling the dependence structure in discrete repeated measures with possibly
unequally spaced observation times are discussed. The main focus of this chapter
is on the technical implications on the MCEM algorithm arising from estimating
an additional variance (correlation) component. In contrast to models with
independent random effects, the Mstep has no closed form solution and iterative
methods have to be used to find the maximum. Also, because random effects are
correlated a priori, they are correlated a posteriori, and sampling from the posterior
distribution of u I y as required by the MCEM algorithm is more involved than
with independent random effects. A Gibbs sampling approach is developed in
Section 3.4.
From here on we let t denote the index for the discrete observation times,
t = 1,..., T, and we let Yi denote a response at time point t for strata i, i =
1,..., n. Throughout, we will assume univariate but correlated random effects
{ut}T=1 associated with the observations over time.
3.1 A Motivating Example: Data from the General Social Survey
The basic purpose of the General Social Survey (GSS), conducted by the
National Opinion Research Center, is to gather data on contemporary Amer
ican society in order to monitor and explain trends and constants in atti
tudes, behaviors and attributes. It is only second to the census in popularity
among sociologist as a data source for conducting research. The GSS ques
tionnaire contains a standard core of demographic and attitudinal variables
whose wording is retained throughout the years to facilitate time trend stud
ies. (Source: www.norc.uchicago.edu/projects/gensocl.asp). Currently, the
GSS comprises a total of 24 surveys conducted in the years 19731978, 1980,
1982, 19831994, 1996, 1998, 2000 and 2002, with data available online (at
www.webapp.icpsr.umich.edu/GSS/) through 1998. The two features, a dis
crete response variable (most of the attitude questions) observed through time
and unequally spaced observation times make it a prime resource for applying the
models proposed in this dissertation. Data obtained from the GSS are different
from longitudinal studies where subjects are followed through time. Here, responses
are from independent crosssectional surveys of different subjects in each year.
One question included in 16 of the 22 surveys till 1998 recorded attitude
towards homosexual relationships. It was observed in the years 1974, 197677,
1980, 1982, 198485, 19871991, 199394, 1996 and 1998. We will use this data
to motivate and illustrate the use of correlated random effects. Figure 31 shows
the proportion of respondents who agreed with the statement that homosexual
relationships are not wrong at all for the two race cohorts white respondents and
black respondents. For simplicity in this introductory example, only race was
chosen as a crossclassifying variable and attitude was measured as answering "yes"
or "no" to the aforementioned question. Let Yt denote the number of people in
year t and of race i who agreed with the statement that homosexual relationships
0
6  white
e black
o
C 
0
C5
1975 1980 1985 1990 1995
Figure 31: Sampling proportions from the GSS data set.
Proportion of whites (squares) and blacks (circles) agreeing with the statement that
homosexual relationships are not wrong at all, from 1974 to 1998.
are not wrong at all. The index t = 1,...,16 runs through the set of 16 years
{1974,1976,1977,1980,..., 1998} mentioned above, and i = 1 for race equal to
white and i = 2 for race equal to black. The conditional independence assumption
discussed in Section 2.1 allows us to model Yt, the sum of nit binary variables
which are the individual responses, as a binomial variable conditional on a yearly
random effect. That is, the probabilistic model we propose assumes a conditional
Binomial(nit, 7rit) distribution for each member of the two time series {IYt}", and
{Y2 61 pictured in Figure 31. The parameters nit and rit are the total number
and the conditional probability of agreeing with the statement that homosexual
relationships are not wrong at all, respectively, of respondents of race i in year t.
3.1.1 A GLMM Approach
A popular model for 7tit is a logisticnormal model for which the link function
h(.) in (2.2) is the logit link and the random effects structure simplifies to a ran
dom intercept ut. We will assume that the fixed parameter vector / is composed
of an intercept term a, linear and quadratic time effects /1 and 32, a race effect 03
and a yearbyrace interaction 34. With Xit representing the year variable centered
around 1984 (e.g., xll = 1974 1984 = 10) and x2i the indicator variable for race
(for whites x21 = 0, for blacks x22 = 1), the model has form
logit(7rit(ut)) = logit(P(Yit = yit I ut)) = a+Pzxltu+2xlt+u3x2i +4XltX2i +t. (3.1)
Apart from the fixed effects, the random time effect ut captures the dependency
structure over the years. Note that rit(ut) is a conditional probability, given the
random effect us from the year the question was asked. This random effect ut can
be interpreted as the unmeasurable public opinion about homosexual relationships,
common to all respondents within the same year. By introducing this random
effect, we assume that individual opinions are influenced by this overall opinion
or the social and political climate on homosexual relationships (like awareness of
AIDS and the social spending associated with it, which is hard to measure). Thus,
individual responses within a given year are no longer independent of each other,
but share a common random effect. Furthermore, it is natural to assume that the
public opinion about homosexual relationships changes gradually over time, with
higher correlations for years closer together and lower correlations for years further
apart. It would be wrong and unnatural to assume that the public opinion (or
political climate) is independent from one year to the next. However, this would
be assumed by modeling the random effects {ut} as independent of each other. It
would also be wrong to assume a common, timeindependent random effect u = ut
for all time points t, as this implies that public opinion does not change over time.
It's effect would then be the same, whether responses are measured in 1974 or 1998.
3.1.2 Motivating Correlated Random Effects
To capture the dependency in public opinion and therefore in responses over
different years, we propose random effects that are correlated. In particular,
57
for this example with unequally spaced observation times, we suggest normal
autocorrelated random effects {ut} with variance function
var(ut) = a2, t = 1,..., 16
and correlation function
corr(ut, ut.) = pIlt t Xl., 1 < t < t* < 16,
where Xit xxt. is the difference between the two years identified by indices t and
t*. This is equivalent to specifying a latent autoregressive process
Ut+1 = PIXltXlUt + + t
underlying the data generation mechanism. Both of these formulations naturally
handle the multiple gaps in the observed time series. There is no need to make
adjustments (such as imputation of data or artificially treating the series as equally
spaced) in our analysis due to "missing" data at years 1975, 197879, 1981, 1983,
1986, 1992, 1995 or 1997.
With correlated random effects, we have to distinguish between two situations:
The correlation induced by assuming a common random effect ut for each
cluster (here: year) and
the correlation induced by assuming a correlation among the clusterspecific
random effects {ut}.
Correlation among observations in the same cluster is a consequence of assuming a
single, clusterspecific random effect shared by all observation in that cluster. For
example, the presence of the cluster specific random effect Ut in (3.1) leads to a
(marginal) nonnegative correlation among the two binomial responses Ylt and Y2t
in year t. With conditional independence, the marginal covariance between these
two observations is given by
cov(Yt, Y2t) = E[cov(Y, Y2t I Ut)]
+cov(E[Yul ut], E[Y2t I Ut])
= cov(logit'(ult + ut), logit'(it + Ut)), (3.2)
where jit is the fixed part of the linear predictor in (3.1). Both functions in (3.2)
are monotone increasing in ut, leading to a nonnegative correlation. Approxima
tions to (3.2) will be dealt with in Section 4.3.
In the example, we attributed the cause of this correlation to the current
(at the time of the interview) public opinion about homosexual relationships,
influencing all respondents in that year. The estimate of a gives an idea about the
magnitude of this correlation, since the more disperse the Ut's are, the stronger
the correlation among the responses within a year. For instance, if the true ut
for a particular year is positive and far away from zero, as measured by a, then
all respondents have a common tendency to give a positive answer. If it is far
away from zero on the negative side, respondents have a common tendency for
a negative answer. This interpretation, of course, is only relative to other fixed
effects included in the linear predictor. For the GSS data, there seems to be
moderate correlation between responses, based on a maximum likelihood estimate
of & = 0.10 with an approximated asymptotic s.e. of 0.03. This interpretation of
a moderate effect of public opinion on responses within the same year is further
supported by the fact that a can also be interpreted as the regression coefficient for
a standardized version of the random effect ut. A regression coefficient of 0.10 for
a standard normal variable on the logit scale leads to moderate heterogeneity on
the probability scale. This shows that the correlation between responses within a
common year cannot be neglected.
The second consequence of correlated random effects is that observations from
different clusters are correlated, which is a distinctive feature compared to GLMMs
assuming independence between clusterspecific random effects. The conditional log
odds of agreeing with the statement that homosexual relationships are not wrong
at all are now correlated over the years, a feature which is natural for a time series
of binomial observations but would have gone unaccounted for if timeindependent
random effects were used. For instance, for the cohort of white respondents (i = 1),
the correlation between the conditional log odds at years t and t* is
corr (logit(rit(ut)), logit(7rit.(ut.)) = pllZt1Il
and therefore directly related to the assumed random effects correlation structure.
Marginally, the two binomial responses at the different observation times have
covariance
cov(Ylt, Y2t.) = cov(logit'(i~ + Ut), logit'l(0l + ut.)),
which accommodates changing covariance patterns for different observation times
(e.g., decreasing with increasing lag) and also negative covariances (see for instance
the analysis of the Old Faithful geyser eruption data in Chapter 5). We will present
approximations to these marginal correlations in binomial time series in Section
4.3.
Summing up, correlated random effects give us a means of incorporating
correlation between sequential binomial observations that go beyond independent
or exchangeable correlation structures.
In our example, we attributed the sequential correlation to the gradual change
in public opinion about homosexual relationships over the years, affecting both
races equally. In fact, the maximum likelihood estimate of p is equal to 0.65 (s.e.
0.25), indicating that rather strong correlations might exist between responses from
adjacent years.
The model uses 7 parameters (5 fixed effects, 2 variance components) to
describe the 32 probabilities. In comparison with a regular GLM and a GLMM
with independent random time effects, the maximized likelihood decrease from
113.0 for the regular GLM to approximately 109.7 for a GLMM with independent
random effects to approximately 107.5 for the GLMM with autoregressive random
effects. Note that the GLM assumes independent observations within and between
the years and that the GLMM with independent random effects {ut} for each year
t assumes correlation of responses within a year, but independence of responses
over the years. Both assumptions might be inappropriate. Our model implies that
the log odds of approval of homosexual relationships are correlated for blacks and
whites within a year (though not very strong with an estimate of a equal to 0.1)
and are also correlated for two consecutive years.
The estimates of the fixed parameters and their asymptotic standard errors are
given in Table 51. The MCEM algorithm converged after 128 iterations with a
starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of 8600.
Convergence parameters (conf. Section 2.3.3) were e1 = 0.002, c = 5, e2 = 0.003,
E3 = 0.001, a = 1.01 and q = 1.2. Path plots of selected parameter estimates
for two different sets of starting values are shown in Figure 32. A detailed
interpretation of the parameters and the effects of the explanatory variables on the
odds of approval is provided in Section 4.3.
Although this example assumed autocorrelated random effects, we will look
at the simpler case of equally correlated random effects next. Then, we discuss
how the correlation parameter p can be estimated within the MCEM framework
presented in Section 2.3. Summing up, correlated random effects in GLMMs allow
4.50 r[ 1.25 r
4.25 _ I 00l 
4.00 0.75
0 50 100 150 200 50 75 100 125 150 175 200
0.30 I 0.00705 
0.0700 
0.35 0.0695 ..
0 50 100 150 200 50 75 100 125 150 175 200
0.026 0.00900
0.028 0.00895
0.030 50 1 1 00890 75 100 125 150 175 2'
0 50 100 150 200 50 75 100 125 150 175 200
0.50
0.25
0 50
n "75 _
0.06
,0.04
100 150 200 50 7!
100 125 150
0.50
0.25
S50 100 150030 Is7e(p)1
0.25 
) 50 100 150 200 50 75 100 125 150 175 20
Figure 32: Iteration history for selected parameters and their asymptotic standard
errors for the GSS data.
The iteration number is plotted on the xaxis. The estimates and standard errors
for 02 were multiplied by 103 for better plotting. The two different lines in each
plot correspond to two different sets of starting values.
 se(o)
175 200
I t
L~hrr~c
wwwRF~
CL
one to model withincluster as well as betweencluster correlations for discrete
response variables, where clusters refer to grouping of responses in time.
3.2 Equally Correlated Random Effects
The introductory example modeled decaying correlation between cross
sectional data over time through the use of autocorrelated random effects. In other
temporal or spatial settings, the correlation might stay nearly constant between any
two observation times, regardless of time or location differences between the two
discrete responses. Equally correlated random effects might then be appropriate to
describe such a behavior.
3.2.1 Definition of Equally Correlated Random Effects
We call random effects equally correlated if
var(ut) = a2 for all t
and
corr(ut, ut.) = p for all t $ t*.
More generally, the covariance matrix of the random effects vector u = (ul,..., UT)'
is given by E = a2 [(1 p)IT + pJT], where JT = 1TVI. To ensure positive
definiteness, p has restricted range, i.e., 1 > p > 1/(T 1). The random effects
density is given by
g(u;) oc l1,2exp Ui'El, (3.3)
where now due to the pattern in E, E = a2T(1 p)Tl[1 + (T 1)p] and
E1 = [( IT P) JT]. The vector i = (a, p) holds the variance
,, (1 ) +(T1)p I .
components of E.
The more complicated random effects structure (as compared to independence
or a single latent random effect) leads to a more complicated Mstep in the MCEM
algorithm described in Section 2.3. For a sample u(1),..., u(m) from the posterior
h(u I y; 1(k1), (kl)) evaluated at the previous parameter estimate #(k1) and
,(k1), the function Q (i ) introduced in Section 2.3.1 has form
1 m Tl 1
Q()= m logg(u(j);p) oc Tloga 2 log(1 p) log [1 + (T 1)p
j=1
1 p(1 p)
a+ b,
2a2(1p) 2a2[1+(T1)p]
where a = M M_ 'uLi)~U ) and b = 1 m U i)*JTu(') are constants depending on
the sample only.
3.2.2 The Mstep with Equally Correlated Random Effects
The Mstep seeks to maximize Q2 with respect to a and p, which is equivalent
to finding their MLEs treating the sample ),..., u(m) as independent. Since this
is not possible in closed form, one way to maximize Q2 uses a bivariate Newton
Raphson algorithm with the Hessian formed by the second order partial and mixed
derivatives of Q2 with respect to a and p. Some authors (e.g., Lange, 1995, Zhang,
2002) use only a single iteration of the NewtonRaphson algorithm instead of an
entire Mstep to speed up convergence. However, this might not always lead to
convergence, since the interval for which the NewtonRaphson algorithm converges
is restricted through the restrictions on p. We show now that with a little bit of
work the maximizers for a and p can be obtained very quickly.
For any given value of p, the ML estimator for a (at iteration k) is available in
closed form and is equal to
() ( 1 p(l p) 1/2
ST(1p) a T[ + (T 1)pR])
Note that if p = 0, &^ = () E 1 U(j)' u )1/ the estimator for the independence
case presented at the end of Section 2.3.1. Unfortunately, the ML estimator for
p has no closed form solution. The first and second partial derivative of Q2 with
respect to p are given by
9 2 T(T1)p 11 2p p2(T 1)
Op m 2(1 p)[1 + (T l)p] 2a2(1 p)2 2a2[1+ (T l)p]2
2 2 T(T1) 1 p(42T)+p2(1 3T) 1
Op2 2 (1 p)2[ (Tl)p]2 3(1p)3
T
T b
02[1 + (T l)p]3b
We obtain the profile likelihood for p by plugging the MLE &(k) into the likelihood
equation for p. Then we use a simple and fast intervalhalving (or bisection)
method to find the root for p. This is advantageous compared to a Newton
Raphson algorithm since the range of p is restricted. Let f(p) = IQ 2a=(k)
and let pi and P2 be two initial estimates in the appropriate range, satisfying
pi < p2 and f(pi)f(p2) < 0. Without loss of generality, assume f(pi) < 0.
Clearly, the maximum likelihood estimate p must be in the interval [pl, P2]. The
intervalhalving method computes the midpoint p3 = (pi + p2)/2 of this interval
and updates one of its endpoints in the following way: It sets pi = p3 if f(p3) < 0
or P2 = p3 otherwise. The newly formed interval [pl, p2] has half the length of the
initial interval, but still contains p. Subsequently, a new midpoint p3 is calculated,
giving rise to a new interval with one fourth of the length of the initial interval,
but still containing p. This process is iterated until If(p3) < C, where e is a small
positive constant. To ensure it is a maximum we can check that the value of the
second derivative, f'(p) is negative at P3. (The second derivative is also needed
for approximating standard errors in the EM algorithm.) The value of p3 is then
used as an update for p in the maximum likelihood estimator for a, and the whole
process of finding the roots of Q2 is repeated. Convergence is declared when the
relative change in a and p is less than some prespecified small constant. The
values of a and p at this final iteration are the estimates &(k) and 1(k) from MCEM
iteration k.
The issue of how to obtain a sample u(1),..., u(m) from h(u y; 13(k1), 0(k1)),
taking into account the special structure of the random effects distribution will be
discussed in Section 3.4.2.
3.3 Autoregressive Random Effects
The use of autoregressive random effects was demonstrated in the introductory
example. Their property of a decaying correlation function make them a useful
tool for modeling temporal or spatial associations among discrete data. We will
limit ourselves to instances where there is a natural ordering of random effects, and
consider time dependent data first.
3.3.1 Definition of Autoregressive Random Effects
As with equally correlated random effects in Section 3.2, we can look at the
joint distribution of autoregressive (or autocorrelated) random effects {ut}l1 as a
meanzero multivariate normal distribution with patterned covariance matrix E,
defined by the variance and correlation functions
var(ut) = a2 for all t
and
corr(ut, ut.) = pltZt I for all t : t',
where xi and xt. are time points (e.g., years, as in the GSS example) associated
with random effects ut and ut.. Let dt = Xt+l zx denote the time difference
between two successive time points and let ft = 1/(1 p2di), t 1,... ,T 1.
Then, due to the special structure, the determinant of the covariance matrix is
given by FI = a2T fIT1 f' and E1 is tridiagonal (Crowder and Hand, 1990,
with correction of a typo in there) with main diagonal
1
( fl, fl + f2 L + 1 2 24 1, f + A fT2+ fT1 1 T1)
and subdiagonals
1
2 (flpd,,fpd .., fTpdT1)
For a sample u(),..., u(m) from the posterior h(u I y; 1(k1), ,(k1)) evaluated at
the previous parameter estimates f3(k1) and 0(k1) = (a(k1), (k1)), the function
Q2, (cf. Section 2.3.1) now has form
1 T1 1
Q2() ( Tlogaa log(lo p2) (3.4)
t=1 j=l
1 T1 lB P^ ]2
12T2 M1 E 1 1 p2dt
j=1 t= 1
where ut) is the tth component of the jth sampled vector u\). In the Mstep of
an MCEM algorithm, we seek to maximize Q2 with respect to a and p.
Alternatively, we can view the random effects {ut} as a latent firstorder
autoregressive process: Random effect Ut+i at time t +1 is related to its predecessor
ut by the equation
Ut+1 = pdtUt + ft, Et N(O, 02(1 2d)), t = 1,...,T 1, (3.5)
where dt again denotes the lag between the two successive time points associated
with random effects ut and Ut+l. Assuming a N(0, a2) distribution for the first
random effect ul, the joint random effects density for u = (ul,..., uT) enjoys a
Markov property and has form
g(u; 0) = g(ui; i) g(u2 I1u; ) ... g(utl I t1; ) ... g(UT I uT1;) (3.6)
T/2 T1 1/2 T _U
(1 1 h / {Ut+: p4Ut]
oc 1 exp } exp 2(lp t)
W2 H 2o, Ei2 2a2(1 p j) '
t=l t=l
leading, of course, to the same expression for Q2 as given in (3.4). For two time
indices t and t* with t < t*, the random process has autocorrelation function
p(t, t*) = corr(ut, ut.) = p , d,
Before we discuss maximization of Q2 in this setting with possibly unequally
spaced observation times, let us comment on the rather unusual parametrization of
the latent random process (3.5). Chan and Ledolter (1995), in their development
for time series models of equally spaced discrete events use the more common form
ut+l = put + Et, Et N N(0, a2), t = 1,2... T 1.
This leads to var(ut) = a2/(1 p2) for all t if we assume a N(O, a2/(1 p2))
distribution for ul. (Chan and Ledolter (1995) condition on this first observation,
which leads to closed form solutions for both a and p in the case of equidistant
observations). Since it is common practice to let a2 describe the strength of
association between observation in a common cluster sharing that random effect,
our parametrization seems more natural. In Chan and Ledolter's parameterizations
both the variance and correlation parameter appear in the variance of the random
effect.
In the more general case of unequally spaced observations, the parametrization
Et N(O, O72) results in different variances of the random effects at different time
points (i.e., var(ut) = a2/(1 p2d)). Considering that the random effects represent
unobservable phenomena common to all clusters, their variability should be about
the same for all clusters, and not depend on the time difference between any two
clusters. There is no reason to believe that the strength of association is larger in
some clusters and weaker in others. Therefore, the parametrization we choose in
(3.5) seems natural and appropriate.
For spatially correlated data, a relationship between random effects ui and
ui. is defined in terms of a distance function d(xi, ax.) between covariates ax
and xi. associated with them. Each u, then represents a random effect for a
spatial cluster, and correlated random effects are again natural to model spatial
dependency among observations in different clusters. In the time setting, we had
d(xi, xi.) = Ixi xi. I with xi and xi. representing time points. In 2dimensional
spatial settings, xi = (xil, xi2)' may represent midpoints in a Cartesian system
and d(sx, xi.) = aIxi xi.I is the Euclidian distance function. The sodefined
distance between clusters can be used in a model with correlations between random
effects decaying as distances between cluster midpoints grow, e.g., corr(ui, u.) =
pllXixZ,.. Models of this form are discussed in Zhang (2002) and in Diggle et al.
(1998) in a Bayesian framework.
Sometimes only the information concerning whether clusters are adjacent to
each other is used to form the correlation structure. In this case, d(ax, xi.) is a
binary function, indicating if clusters i and i* are adjacent or not. Usually, this
leads to an improper joint distribution for the random effects, as for instance in
the analysis of the Scottish Lip cancer data set presented in Breslow and Clayton
(1993).
3.3.2 The Mstep with Autoregressive Random Effects
Maximizing Q2 with respect to a and p is again equivalent to finding their
MLEs for the sample of u0),..., u(m), pretending they are independent. For fixed
p, maximizing Q2 with respect to a is possible in closed form. For notational con
venience, denote the parts depending on p and the generated sample u(m),..., (m)
by
at(p, u) = U+ __ p d U ) 2
j=1
and
Smw1 r
j=1
with derivatives with respect to p (indicated by a prime) given by
ta'(P, u) = 2dtpd`1'bt(p, u)
and
b(p, u) = dtp )2
m
j=1
The maximum likelihood estimator of a at iteration k of the MCEM algorithm has
form
fr( ,m T1 )1 /2
&(k) 1 u(j)2 + 1a ) 1/
Tm(M T +j p2dt/
j=1 t=1
For the special case of independent random effects (p = 0), this simplifies to the
estimator ( m j= U()'UD() presented at the end of Section 2.3.1. (The equal
correlation structure cannot be presented as a special case of the autocorrelation
structure.) No closed form solutions exist for 1(k). Let
dt pdt1
ct(p) =
1 p2d
and
et(p) = [ct(p) ]
be terms depending on p but not on u, with derivatives given by
c'(p) = ct(P) + 2p1 [ct(p)]2
P
and
e(P) = 2ct(p)c( p) + [ct(p)2 ,
dt dt
respectively. Then, the first and second partial derivative of Q2 with respect to p
can be written as
T1 T1
Qap = Epd'tc(p) + [c(p)bt(p, u) et(p)at(p, u)]
1t=1 t=1
92Q2 = pdlct(p)[2d, 1 + 2pdt+lc(p)]
S=t=1
1 T1
+ c' [c'(p)bt (p, u) + ct (p) (p, u) e'(p)at(p, u) et(p)a' (p, u)] .
t=1
A NewtonRaphson algorithm with Hessian formed of partial and mixed derivatives
of Q2 with respect to a and p can be employed to find the maximum likelihood
estimators at iteration k. However, since the range of p is restricted, it might be
advantageous to use the intervalhalving method on Q', I1=&(k) described in the
previous section.
For the special case of equidistant time points {xt}, the distances dt are equal
for all t = 1,..., T. Without loss of generality, we assume dt = 1 for all t. Then the
random effects follow the simple random walk ut+l = put + et, t = 1, ..., T, where
we assume that ul ~ N(0, a2) and ct iid. N (0, 2(1 p2)). Certain simplifications
occur. Let
m m T ( m T1 ( T1 j
a= m b +l I2, c = i uUt+1 dUu
j=1 j=1 t=1 j=1 t=1 j=1 t=1
denote constants depending on the generated samples only, but not on any parame
ters. Then the maximum likelihood estimator of a at iteration k is
(k) = a+ (b 2pc+p2d) 1/2
and, upon plugging it in into the score equation Q2 I=&8(k) for p, we obtain as
the new score equation
[T1 1 2 T T1 1
3 [T(da) +p2 2T c +p i a bd +c=0, (3.7)
a polynomial of order three. A result by Witt (1987) mentioned in McKeown and
Johnson (1996) shows that (3.7) has three real solutions, only one of which lies
in the interval (1, 1). This must be the maximum likelihood estimator jlk) at
iteration k for the case of equidistant time points. Exact solutions to this third
degree polynomial are for instance given in Abramowitz and Stegun (1964) and we
only need to iterate between the two explicit solutions till convergence.
In most of our applications the number of distinct observation times T
is rather large, and generating independent Tdimensional vectors u from the
posterior h(u I y; O(kl)) as required to approximate the Estep is difficult, even
with the nice (prior) autoregressive relationship among the components of u. The
next section discusses this issue.
There are many other correlation structures which are not discussed here. For
instance, the first order autoregressive random process can be extended to a porder
random process and the formulas provided here and in the next section can be
modified accordingly.
3.4 Sampling from the Posterior Distribution Via Gibbs Sampling
In Section 2.3.2 we gave a general description of how to obtain a random sam
ple u(1),...,u(m) from h(u I y). (As in Section 2.3.2, we suppress the dependency
on the parameter estimates from the previous iteration). For high dimensional
random effects distributions g(u), generating independent draws from h(u I y)
can get very time consuming, if not impossible. The Gibbs sampler introduced in
Section 2.3.2 offers an alternative because it involves sampling from lower dimen
sional (often univariate) conditional distributions of h(u I y), which is considerably
faster. However, it results in dependent samples from the posterior random effects
distribution. The distributional structure of equally correlated random effects
or autoregressive random effects is very amenable to Gibbs sampling because of
the simplifications that occur in the full univariate conditionals. Remember that
the twostage hierarchy and the conditional independence assumption in GLMMs
implies that
T
h(u I y) oc f(y I u)g(u) = f/(yt I ut) g(u),
t=1
the product of the product of conditional densities of observations sharing a
common random effect and the random effects density. In the following, let
u = (Ul, ..., UT). We discuss the case of autoregressive random effects first.
3.4.1 A Gibbs Sampler for Autoregressive Random Effects
From representation (3.6) of the random effects distribution, we see that the
full univariate conditional distribution of Ut given the other T 1 components of u
only depends on its neighbors ut1 and ut+1, i.e.,
g(Ut I u,, .. Utl, Ut+l, ,UT) c( g(Ut I Ut1)g(Ut+1 I Ut), t = 2,..., T 1.
At the beginning (t = 1) and the end (t = T) of the process, the conditional
distribution of Ul and UT only depend on the successor u2 and predecessor UT1,
respectively. Furthermore, random effect ut only applies to observations Yt =
(ytl, ytn) at a common time point that share that random effect, but not to
other observations. Hence, the full univariate conditionals of the posterior random
effects distribution can be expressed as
hi(u I u2,y ) oc f(y1 Ul) 9g(ui U2)
ht(ut I t, Ut+1, yt) oc f(yt t) gt(ut I Ut1, Ut+), t = 2,...,T 1
hT(T I UT, YT) oc f(YT UT) T(UT I UT1),
where, using standard multivariate normal theory results,
91(ul IU2) = N (pd1U2,2[1 p2dl])
t(Ut I Ut, t+) = pd[1 p2dt1Utl +pd( [1 p2dtlt+
S1 p2(dt_1+dt)
2[1 p2dt1 p2dt + p2(dti+dt)
1 2(dt+d) t ...,
gT(UT I UT1) = N (pdTlUT1, a2[l p2dTl])
For equally spaced data (dt = 1 for all t) these distributions reduce to the ones
derived in Chan and Ledolter (1995).
Direct sampling from the full univariate conditionals ht is not possible.
However, it is straightforward to implement an acceptreject algorithm. In fact,
73
the acceptreject algorithm as outlined in Section 2.3.2 applies directly with target
density ht and candidate density gt, since ht has the form of an exponential family
density multiplied by a normal density. In Section 2.3.2, we discussed the accept
reject algorithm for generating an entire vector u from the posterior random effects
distribution h(u I y) with candidate density g(u) and mentioned that acceptance
probabilities are virtually zero for large dimensional u's. With the Gibbs sampler,
we have reduced the problem to univariate sampling of the tth component ut from
the univariate target density ht with univariate candidate density gt. By selecting
Mt = LL(yt), where L(yt) is the saturated likelihood for observations at time point
t, we ensure that the target density ht < Mtgt.
Given u(1) = (uj), ..., u1) from the previous iteration, the Gibbs
sampler with acceptreject sampling from the full univariate conditionals consists of
1. generate first component u) hi(u, I u(j ),y) by
(a) generation step:
generate ul from candidate density gl(ul Iu u
generate U ~ Uniform[0, 1];
(b) acceptance step:
set uj) =ul if U < f(yl I ul)/L(y); return to (a) otherwise;
2. for t=2,...,T1:
generate component u j)" ht(u () I u(j), (j ) by
(a) generation step:
generate ut from candidate density gt(ut 1u1) ,u j1 )
generate U ~ Uniform[O, 1];
(b) acceptance step:
set u) = ut if U < f(yt I ut)/L(yt); return to (a) otherwise;
3. generate last component u() ~ hT(uT I UT2, YT) by
(a) generation step:
generate UT from candidate density gT(UT I UTi1);
generate U ~ Uniform[O, 1];
(b) acceptance step:
set U = UT if U 5 f(yT I UT)/L(yT); return to (a) otherwise;
4. set U)= (U),...,u(T_);
The soobtained sample ),..., () (after allowing for burnin) forms a
dependent sample which we use to approximate the Estep in the kth iteration of
the MCEM algorithm. Note that all densities are evaluated at current parameter
estimates, i.e., (1 for f(yt I ut) and ( = (1(k1), (1)) for gt(ut
Ut1, Ut+i)
3.4.2 A Gibbs Sampler for Equally Correlated Random Effects
Similar results as for the autoregressive correlation structure can be derived
for the case of equally correlated random effects. In this case, the full univariate
conditional of us depends on all other t 1 components of u, as can be seen from
(3.3). Let ut denote the vector u with the tth component deleted. Using similar
notation as in the previous section, the full univariate conditionals of h(u I y) are
given by
ht(ut I Ut, t) oc f(yt I Ut)gt(t  ut, t), t = 1,..., T,
where with standard results from multivariate normal theory gt(ut I ut) is a
N(pt, rt) density with
1ip 1+(T2)p Uk
P (T1)p(1p)]
kgt
and
= =a,2 (T 1)p2 ( 1)23(1 )
1 p 1 +(T2)p
Given the vector u1) from the previous iteration, the Gibbs sampler with
acceptreject sampling from the full univariate conditionals has form
1. for t=1,...,T,
generate components u~j) ~ h(ut uj) ,, I... U t+ 1 ,UT ) by
(a) generation step:
generate ut from candidate density
gt(Ut I U (jt1, Ut(+ ., I )
generate U ~ Uniform[0, 1];
(b) acceptance step:
set ut = u if U < f(yt I ut)/L(yt); return to (a) otherwise;
2. set uaU)= ( i), U );
This leads to a sample u(),..., (m) from the posterior distribution used
in the E and Mstep of the MCEM algorithm at iteration k. Note again that
all distributions are evaluated at their current parameter estimates (k1) and
(kl) = (&ki), (ki)).
3.5 A Simulation Study
We conducted a simulation study to evaluate the performance of the maximum
likelihood estimation algorithm, to evaluate the bias in the estimation of covariate
effects and variance components and to compare predicted random effects to the
ones used in the simulation of the data. To this end, we generated a time series
yl,..., IT of T = 400 binary observations according to the model
logit(7rt(ut)) = a + /Xt + ut (3.8)
for the conditional log odds of success at time t, t = 1,..., 400. For the simulation,
we choose a = 1 and 3 = 1, where 3 is the regression coefficient for independent
standard normal distributed covariates xt i. i. d. N(0, 1). The random effects
l, ..., UT are thought to arise from an unobserved latent random autoregressive
process Ut+l = put + ct, where Et are i. i. d. N(0, aV1l p2), i.e., the out's have stan
dard deviation a and lag t correlation pt. For the simulation of these autoregressive
random effects, we used a = 2 and p = 0.8. The resulting sample autocorrelation
function of the realized random effects is pictured in Figure 34. Their standard
deviation and lag 1 correlation of the 400 realized values of u, ... UT is equal to
1.95 and 0.77. Note that conditional on the realized values of Ut's, the Yt's are
generated independently with log odds given by (3.8). The MCEM algorithm as
described in Sections 2.3 and 3.3 for a logistic GLMM with autocorrelated random
effects yielded the following maximum likelihood estimates for the fixed effects and
variance components: & = 0.94 (0.39), 3 = 1.03 (0.22), as compared to the true
values 1 and 1, and & = 2.25 (0.44), and p = 0.74 (0.06), as compared to the true
values 1.95 and 0.77.
The algorithm converged after 71 iterations with a starting Monte Carlo
sample size of 50 and a final Monte Carlo sample size of only 880, although
estimated standard errors are based on a Monte Carlo sample size of 20,000.
Convergence parameters were set to e1 = 0.003, c = 3, E2 = 0.005, 63 = 0.001,
a = 1.03 and q = 1.05 (see Section 2.3.3). Regular GLM estimates were used as
starting values for a and # and starting values for a and p were set to 1.5 and 0,
respectively.
As will be described in Section 5.4.2, we estimated random effects through a
Monte Carlo approximation of their posterior mean: ui = E[u I y]. The scatter
plot in Figure 33 shows good agreement in a comparison of the realized random
effects ul,..., UT from the simulation and the estimated random effects 1i,..., iT
from the model. Note though that the standard deviation of the estimated random
effects is equal to 1.60 (as compared to the true standard deviation of 1.95),
showing that estimated random effects are less variable and the general shrinkage
effect (compare the scales on the x and y axis of Figure 33) brought along by
using posterior mean estimates. Also, a comparison of the autocorrelation and
partial autocorrelation functions of the realized and estimated random effects in
77
4
2 A AAAA j AA
A 2AA a
*1 .A A
A A A
realized random effects
Figure 33: Realized (simulated) random effects u,..., UT versus estimated ran
dom effects i ,...,G fT
Figure 34 reveals some differences due to the fact that estimated random effects
are based on the posterior distribution of u I y. Therefore, estimated random
effects are only of limited use in checking assumptions on the true random effects.
Only when their behavior is grossly unexpected compared to the assumed structure
of the underlying latent random process may they serve as an indication of model
inappropriateness. Related remarks are given by Verbeke and Molenberghs (2000)
who generate data in a linear mixed models assuming a mixture of two normal
distributed random effects resulting in a bimodal distribution. There also the plot
of posterior mean estimates of the random effects from a model that misspecified
the random effects distribution does not reveal that anything went wrong.
We repeated above simulation 100 times, using the same specifications,
starting values and convergence criteria as mentioned above. Each of the 100
generated binary time series of length 400 was fit using the MCEM algorithm.
Table 31 shows the average (over the 100 generated time series) of the fixed
.4.4
000
.2.
1 2 3 4 5 6 7 4 5 6 7
Lag Lag
1.0 10
.8.8
.6 .6
.4 .4
S .2 7 .2
1 2 3 4 5 6 7 1 2 3 4 5 6 7
ag Lag
Figure 34: Comparing simulated and estimated random effects.
Sample autocorrelation (first row) and partial autocorrelation (second row) func
tions for realized (simulated) random effects u1,..., UT (first column) and estimated
random effects u ,..., iT (second column).
parameter and variance component estimates and their average estimated standard
errors. On average, the GLMM estimates of the fixed effects a and # and the
variance components are very close to the true parameters, although the true lag
1 correlation of the random effects is underestimated by 6.3%. Table 31 also
displays, in parentheses, the standard deviations of all estimated parameters in the
100 replications. Comparing these to the theoretical estimates of the asymptotic
standard errors, we see good agreement. This suggests that the procedure for
finding standard errors we described and implemented (via Louis's (1982) formula)
in our MCEM algorithm works fine. In 5 (5%) out of the 100 simulations, the
approximation of the asymptotic covariance matrix by Monte Carlo methods
resulted in a negative definite matrix. For these simulations, a larger Monte Carlo
sample after convergence of the MCEM algorithm (the default was 20,000) might
be necessary.
It is also interesting to note that of the 95 simulations with positive definite
covariance matrix, 6 (6.3%) resulted in a nonsignificant (based on a 5% level
Wald test) estimate of the regression coefficient P under the GLMM model with
autoregressive random effects, while none was declared nonsignificant with the
GLM approach. Estimates and standard errors for a corresponding GLM model
fit are also provided in Table 31. The average Monte Carlo sample at the final
iteration of the MCEM algorithm was 1200, although highly disperse, ranging from
210 to 21,000. The average computation time (on a mobile Pentium III, 600 MHz
processor with 256MB RAM) to convergence, including estimating the covariance
matrix, was 73 minutes.
We ran two other simulation studies, now with a shorter length of only
T = 100 observations, and a true lag 1 correlation of 0.6 and 0.8, respectively.
All other parameters remained unchanged. These results are also summarized in
Table 31. Again, we observe that the estimated parameters are very close to the
Table 31: A simulation study for
effects.
80
a logistic GLMM with autoregressive random
a 1 a p s.e.(a) s.e.(8) s.e.(a) s.e.(p)
True: 1 1 2 0.8 T = 400
GLM: 0.64 0.63 0.11 0.12
(0.20) (0.14) (0.01) (0.01)
GLMM: 1.07 1.02 2.08 0.75 0.37 0.26 0.61 0.10
(0.32) (0.20) (0.25) (0.06) (0.12) (0.23) (0.40) (0.05)
True: 1 1 2 0.6 T = 100
GLM: 0.69 0.70 0.23 0.25
(0.38) (0.27) (0.02) (0.03)
GLMM: 1.09 1.07 1.99 0.51 0.58 0.47 1.35 0.26
(0.58) (0.37) (0.39) (0.20) (0.33) (0.27) (1.15) (0.18)
True: 1 1 2 0.8 T = 100
GLM: 0.65 0.61 0.22 0.24
(0.21) (0.26) (0.01) (0.03)
GLMM: 1.04 0.96 2.00 0.75 0.42 0.51 1.04 0.16
(0.29) (0.34) (0.53) (0.13) (0.32) (0.99) (1.04) (0.13)
Average and standard deviation (in parentheses) of fixed effects, variance compo
nents and their standard error estimates from a GLM and a GLMM with latent
AR(1) process. The two models were fitted to each of 100 generated binary time
series of length T = 400 and T = 100.
true ones, but on average the correlation was underestimated by 15% and 6.3%,
respectively. However, the sampling errors of the correlation parameters (shown in
parentheses in Table 31) were large enough to include the true values.
Since our methods are general enough to handle unequally spaced data, we
repeated the first simulation with a time series of T = 400 binary observations, but
now randomly deleted 10% of the observations to create random gaps in the series.
We left all parameters and the model for the conditional odds unchanged, except
that we now assume that random effects follow the latent random autoregressive
process ut+l = ptut + ct, where et are i. i. d. N(0, a/1 p2dt) and dt is the
difference (in the units of measurement) between the time points associated with
the observations at times t and t + 1. For example, the first series we generated had
81
Table 32: Simulation study for modeling unequally space binary time series.
a /3p s.e.(a) s.e.(6) s.e.(a) s.e.(p)
True: 1 1 2 0.8 T = 360, unequally spaced
GLM: 0.61 0.62 0.12 0.12
(0.19) (0.11) (0.00) (0.01)
GLMM: 1.03 1.00 2.07 0.75 0.38 0.28 0.71 0.11
(0.29) (0.16) (0.25) (0.06) (0.19) (0.28) (0.80) (0.18)
Average and standard deviation (in parentheses) of fixed effects, variance compo
nents and their standard error estimates from a GLM and a GLMM with latent au
toregressive random effects accounting for unequally spaced observations. The two
models were fitted to each of 100 generated binary time series of length T = 360,
with random gaps of random length between observations.
1 gap of length three (i.e., dt = 4 for one t), 4 gaps of length two (i.e., dt = 3 for 4
t's) and 29 gaps of length one (i.e., dt = 2 for 29 t's). For all other t's, dt = 1, i.e.,
they are successive observations and the difference between two of them is one unit
of measurement.
Simulation results are shown in Table 32 and reveal that our proposed
methods and algorithm also work fine for an unequally spaced binary time series.
All true parameters are included in confidence intervals based on the average of the
estimated parameters from 100 replicated series and its standard deviation (shown
in parentheses in Table 32).
CHAPTER 4
MODEL PROPERTIES FOR NORMAL, POISSON AND BINOMIAL
OBSERVATIONS
So far we have discussed models for discrete valued time series data in a very
broad manner. In Section 2, we developed the likelihood for our models based on
generic distributions f(ylu) for observations y and g(u) for random effects u and
presented an algorithm for finding maximum likelihood estimates. Section 3 looked
at two special cases of random effects distributions useful for describing temporal or
spatial dependencies. In this chapter we make specific distributional assumptions
about the observations and develop some theory underlying the models we propose.
We will pay special attention to data in the form of a single (sometimes considered
generic) time series Y = (Y1,..., YT) and derive marginal properties implied by
the conditional model formulation. Multiple, independent time series Yi,..., Yn
can result from replication of the original time series or from stratification of the
sampled population such as in the example about homosexual relationships. All
derivations given below for a generic time series Y still hold for the ith series
Yi = (Yil, Yi2,... YT), provided the same latent process {ut} is assumed to
underly each one of them.
An important characteristic of any time series model is its implied serial
dependency structure. In the case of normal theory time series models, this is
specified by the autocorrelation function. In Section 4.1 we derive the implied
marginal autocorrelation function for GLMMs with normal random components
and either an equal correlation or autoregressive assumption for the random effects.
With these assumptions, our models are special cases of linear mixed models
discussed for instance in Diggle et al. (2002). In Sections 4.2 and 4.3 we explore
marginal properties of GLMMs with Poisson and binomial random components
that are induced by assuming equally correlated or autoregressive random effects.
In Chapter 5, these model properties such as the implied autocorrelation function
are then compared to empirical counterparts based on the observed data to
evaluate the proposed model.
Section 2.1 mentioned that parameters in GLMMs have a conditional inter
pretation, controlling for the random effects. Correlated random effects vary over
time and parameter interpretation is different from having just one common level of
a random effect, as in many standard random intercepts GLMMs. For each of the
models presented here, we discuss parameter interpretation in a separate section.
4.1 Analysis for a Time Series of Normal Observations
Suppose that conditional on time specific normal random effects {ut}, observa
tions {Yt} are independent N(pt + ut, 72). The marginal likelihood for this model is
tractable, because marginally the joint distribution of {Yt} is multivariate normal
with mean tM = (/1,..., pT)' and covariance matrix Eu + 72I, where Eu is the
covariance matrix of the joint distribution of {ut}. With the usual assumption that
var(ut) = a2, the marginal variance of Yt is given by
var(Yt) = 2 + 2
and the marginal correlation function p(t, t*) for the case of equally correlated
random effects (conf. Section 3.2) has form
2
p(t, t*) = corr(Yt, Yt.) = 2 2p (4.1)
while for the case of autocorrelated random effects (conf. Section 3.3), it has form
p(*) = corr(Y 2 dk. (4.2)
p(t,t*) = corr(Yt, Yt.) =2 +p02 d (4.2)
T+ 0
If the distances between time points are equal, then (4.2) is more conveniently
written in terms of the lag h between observations as
2
p(h) = corr(Yt, Yt+h) = T2 +
For both cases, note that the autocorrelations (4.1) and (4.2) are smaller than
the corresponding ones assumed for the underlying latent process {ut} by a factor
of r For equally correlated random effects, the marginal covariance matrix
has form r72 + a2 [(1 p)I + pJ], implying equal marginal correlations between
any two members Yt and Yt. of {Yt}. (This can also be seen from (4.1), where the
autocorrelations do not depend on t or t*.) Diggle et al. (2002, Sec. 5.2.2) call this
a model with serial correlation plus measurement error.
Similar properties can be observed in the case of autocorrelated random ef
fects: The basic structure of correlations decaying in absolute value with increasing
distances between observation times (as measured by E dk or h) is preserved
marginally. However, the firstorder Markov property of the underlying autore
gressive process is not preserved in the marginal distribution of {Yt}, which can
be proved by calculating conditional distributions. For instance, for three (T = 3)
equidistant time points, the conditional mean of Y3 given Y1 = yl and Y2 = y1 is
equal to
a2p
E[Y3ay, y2] = 3 + 2 +22_ (2 ( 2 [1 + P(y, ~1)] + 02 [l (2 2)])
and depends on yl.
It should be noted that in the case of independent random effects with
Eu = Oa2, marginally the Yt's are also independent, but with overdispersed
variances r2 + o2 relative to their conditional distribution. This case can be seen
as a special case of the equally correlated model and the autoregressive model when
p=0.
The traditional assumption in random intercepts models is to assume a
common random effect u = us for all time points t. I.e., conditional on a N(0, a2)
random effect u, Yt is N(Aj + u, 72) for t = 1,..., T. For this case, the marginal
covariance matrix has form 721 + a2J. This can be derived directly or inferred
from the marginal correlation expressions (4.1) and (4.2) by setting p = 1, implying
perfect correlation among the {ut}. Hence, the random intercepts model is a
special case of the equal correlated or autoregressive model when p = 1. It implies
a constant (exchangeable) marginal correlation of U2/(T2 + a2) between any two
observations Yt and Yt..
4.1.1 Analysis via Linear Mixed Models
In a GLMM, we try to provide some structure for the unknown mean com
ponent pt by using covariates xt. Let x z be a linear predictor for /t, with f
denoting a fixed effects parameter vector for the covariates xt. Using an iden
tity link, the series {Yt} then follows a GLMM with conditional mean function
E[Yt I ut] = x't3 + ut. The model can be written as Yt = x'f + us + ce, where
et ~d N(0, 72) and independent of us. Then, the models discussed here are special
cases of mixed effects models (Verbeke and Molenberghs, 2000) with general matrix
form
Y = X3 + Zu +E.
In our case, Y = (Yi,..., YT)' is the time series vector and X = (xs,..., x,)'
is the overall design matrix with associated parameter /. The design matrix Z
for the random effects u' = (u1,..., UT) simplifies to the identity matrix IT. The
distributional assumption on the random effects is u ~ N(0, Eu) and they are
independent from the N(O, r2I)distributed errors e. Exploiting this relationship,
software for fitting models of this kind (i.e., correlated normal data with structured
covariance matrix of form var(Y) = ZEuZ' + rT2) is readily available, for instance
in the form of the SAS procedure proc mixed, where the equal correlation structure
and the autoregressive structure are only two out of many possible choices for the
covariance matrix Eu for the random effects distribution.
Mixed effects models are very popular for the regression analysis of shorter
time series, like growth curve models or data from longitudinal studies. In Sec
tion 5.1, we illustrate an application by analyzing the motivating example of
Section 3.1 about attitudes towards homosexual relationships, based on a normal
approximation to the log odds.
4.1.2 Parameter Interpretation
Parameters in normal time series models retain their interpretation when
averaging over the random effects distribution. The interpretation of 3 as the
change in the mean for a change in the covariates is valid conditional on random
effects and also marginally. The random effects parameters only contribute to the
variancecovariance structure of the marginal distribution, inducing overdispersion
and correlation relative to the conditional assumptions.
4.2 Analysis for a Time Series of Counts
Suppose now that conditional on time specific normal random effects {ut},
observations {Yt} are independent counts, which we model as Poisson random
variables with mean /t. Using a log link, explanatory variables ax and correlated
random effects {uj}, we specify the conditional mean structure of a Poisson GLMM
as
log(pt) = zx' + ut, t = 1,..., T. (4.3)
The correlation in the random effects allows the logmeans to be correlated over
time or space. The marginal likelihood corresponding to this model is given by
I T
L(f3,/;y) oc xf T exp{pt}g(u;,) du
t=1
= exp [yt(x'/3 + Ut) exp{'t + Ut}]} g(u; ik) du,
RT I Tt=1
where g(u; Ip) is one of the random effects distributions of Chapter 3. In that case,
the integral is not tractable and numerical methods such as the MCEM algorithm
of Section 2.3 must be used to find maximum likelihood estimates for f and 1i. For
this, the function Q' defined in (2.16) has form
1m T
j=1 t=l1
where uj) is the tth element of the jth generated sample u(j) from the posterior
distribution h(uly; )3(k1)' (k1)). Note that here we discuss only the case of a
generic time series {Yt} with no replication, hence n = 1 (i.e., index i is redundant),
and nl = T in the general form presented in (2.16). If replications are available
or in the case where two time series differ in the fixed effects part but not in the
random effects (e.g., have the same underlying latent process), then one simply
needs to include the sum over the replicates as indicated in (2.16). Choosing one
of the correlated random effects distributions of Chapter 3, the Gibbs sampling
algorithms developed in Sections 3.4.1 or 3.4.2 can be used to generate the sample
from h(uly), with f(yt) having the form of a Poisson density with mean At.
4.2.1 Marginal Model Implied by the Poisson GLMM
As with the normal GLMMs before, marginal first and second moments can
be obtained by integrating over the random effects distribution, although here
the complete marginal distribution of Yt is not tractable as it is in the normal
case. The random effects appearing in model (4.3) imply that the conditional
logmeans {log(/t)} are random quantities. Assuming that random effects {ut} are
normal with zero mean and variance var(ut) = a2, they have expectations {z/3}
and variance a2. For two distinct time points t and t*, their correlation under an
independence, equal correlation or autocorrelation assumptions on the random
effects is given by 0, p or pk=t dk, respectively. (Remember that dk denoted the
time difference between two successive observations yk and Yk+1.) On the original
scale, the means have expectation, variance and correlation given by
E[pt] = exp{(z/3 +a2/2}
var(pt) = exp{2(x3 + a2/2)} (e'2)
eCOv(U(,U(.) _
corr(p/, pt ) = eco 
e2 1
Plugging in cov(ut, ut.) = 0, a2p or a2 pk=t d yields the marginal correlations
among means when assuming independent, equally correlated or autoregressive
random effects, respectively.
4.2.1.1 Marginal distribution of Yt
Now let's turn to the marginal distribution of Yt itself, for which we can only
derive moments. The marginal mean and variance of Yt are given by:
E[Yt] = E[pt] = exp{z/3 + a2/2} (4.4)
var(Yt) = E[(t] + var(t) = E[Yt] [1 + E[Yt] (e 1)].
Hence, the log of the marginal mean still follows a linear model with fixed effects
parameters /, but with an additional offset U2/2 to the intercept term. (This is not
particular to the Poisson assumption, but is true for any loglinear random effects
model of form (4.3) with more general random effects structure z'ut, see Problem
13.42 in Agresti, 2002.) The marginal distribution of Yt is not Poisson, since the
variance exceeds the mean by a factor of [1 + E[Yt](e"2 1)]. The marginal variance
is a quadratic function of the marginal mean.
For two distinct time points t and t*, the marginal covariance between
observations Yt and Yt. is given by
cov(Yt,Yt.) = cov(pt, At.)
= E[Yt]E[Yt.] (ecov(ut",') 1) (4.5)
= exp{(X' + x'.)/3 + a2} (ecov(ut,) 1)

Full Text 
xml version 1.0 encoding UTF8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EPWSN1VID_CJRAAN INGEST_TIME 20170713T14:51:32Z PACKAGE AA00003585_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
PAGE 1
REGRESSION MODELS FOR DISCRETEVALUED TIME SERIES DATA By BERNHARD KLINGENBERG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004
PAGE 2
Copyright 2004 by Bernhard Klingenberg
PAGE 3
To Sophia and JeanLuc Picard
PAGE 4
ACKNOWLEDGMENTS I would like to express my sincere gratitude to Drs Alan Agresti and James Booth for their guidance and assistance with my dissertation research and for their support throughout my years at the University of Florida. During the past three years as a research assistant for Dr. Agresti I gained valuable experience in conducting statistical research and writing scholarly papers, for which I am very grateful. I would also like to thank Dr. Ramon Littell, who guided me through a year of invaluable statistical consulting experience at IFAS Dr. George Casella, who taught me Monte Carlo methods and Drs Jeff Gill and Michael Martinez for serving on my committee. My gratitude extends to all the faculty and present and former graduate students of the Department among them in alphabetical order Brian Caffo Sounak Chakraborty Dr. Herwig Friedl Ludwig Heigenhauser, David Hitchcock Wolfg~ng Jank Galin Jones Ziyad Mahfoud Siuli Mukhopadhyay and Brian Stephens I would like to thank my family foremost my wife Sophia and my daughter Franziska for all their support light and joy they bring into my life and for providing me with energy and fulfillment. Lastly this dissertation would not have been written in English had it not been for the countless adventures of the Starship Enterprise and its Captain JeanLuc Picard whose episodes kept me glued to the TV in Austria and helped to sufficiently improve my knowledge of the English language. lV
PAGE 5
TABLE OF CONTENTS ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT lV viii lX xi CHAPTER 1 2 INTRODUCTION 1.1 Regression Models for Correlated Discrete Data 1.2 Marginal Models .. ......... .. . 1.2.1 Likelihood Based Estimation Methods . 1.2.2 QuasiLikelihood Based Estimation Methods 1.3 Transitional Models . . . . . . . . . 1.3.1 Model Fitting .. ... ..... .... 1.3.2 Transitional Models for Time Series of Counts 1.3.3 Transitional Models for Binary Data . 1.4 Random Effects Models . . . . . . 1.4 1 Correlated Random Effects in GLMMs 1.4.2 Other Modeling Approaches . . 1.5 Motivation and Outline of the Dissertation GENERALIZED LINEAR MIXED MODELS 1 1 3 3 4 7 9 10 11 13 13 17 19 21 2.1 Definition and Notation. . . . . 22 2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time Series . . . . . . . . . . . . 24 2.1.2 State Space Models for Discrete Time Series Observations 26 2.1.3 Structural Similarities Between State Space Models and GLMMs . . . . . 28 2.1.4 Practical Differences . . . . . . . . . 29 2.2 Maximum Likelihood Estimation . . . . . . . 31 2.2.1 Direct and Indirect Maximum Likelihood Procedures 32 2.2.2 Model Fitting in a Bayesian Framework . . . . 36 2 2 3 Maximum Likelihood Estimation for State Space Models 37 2.3 The Monte Carlo EM Algorithm . . . . . . . . . 40 V
PAGE 6
3 4 5 2.3.1 Maximization of Qm ..... ..... 2 3 2 Generating Samples from h(u I y ; /3 1/J) 2.3.3 Convergence Criteria ......... CORRELATED RANDOM EFFECTS 41 43 48 53 3.1 A Motivating Example: Data from the General Social Survey 54 3.1.1 A GLMM Approach . . . . . 55 3.1.2 Motivating Correlated Random Effects . . 56 3 2 Equally Correlated Random Effects . . . . . 62 3 2.1 Definition of Equally Correlated Random Effects 62 3 2.2 The Mstep with Equally Correlated Random Effects 63 3.3 Autoregressive Random Effects . . . . . . 65 3.3.1 D efin ition of Autoregressive Random Effects . . 65 3.3 2 The Mstep with Autoregressive Random Effects . 68 3.4 Sampling from the Posterior Distribution Via Gibbs Sampling 71 3.4.1 A Gibbs Sampler for Autoregressive Random Effects . 72 3.4.2 A Gibbs Sampler for Equally Correlated Random Effects 74 3 5 A Simulation Study . . . . . . . . . . . . 75 MODEL PROPERTIES FOR NORMAL POISSON AND BINOMIAL OBSERVATIONS ..... . ........ ... . 82 4 .1 Analysis for a Time Series of Normal Observations 83 4.1.1 Analysis via Linear Mixed Models 85 4 1.2 Parameter Interpretation . . . . . 86 4.2 Analysis for a Tim e Series of Counts . . . . 86 4.2 1 Marginal Model Implied by the Poisson GLMM 87 4 .2.2 Parameter Int erpretation . . . . . . 92 4.3 Analysis for a Time Series of Binomial or Binary Observations 92 4.3.1 Marginal Model Impli ed by the Binomial GLMM 94 4 3 2 Approximation Techniques for Marginal Moments 95 4.3.3 Parameter Int erpretat ion . . . . . . . 108 EXAMPLES OF COUNT BINOMIAL AND BINARY TIME SERIES 5.1 Graphical Exploration of Correlation Structures 5.1.1 The Variogram 5 1.2 The Lorelogram ..... 5.2 Normal Time Series .. ... . 5.3 Analysis of the Polio Count Data 5 3.1 Comparison of ARGLMMs to other Approaches 5.3.2 A Residual Analysis for the ARGLMM 5.4 Binary and Binomial Time Series 5.4.1 Old Faithful Geyser Data .. ... . Vl 115 115 116 117 117 121 125 127 130 130
PAGE 7
6 5.4.2 Oxford versus Cambridge Boat Race Data SUMMARY DISCUSSION AND FUTURE RESEARCH. 6.1 CrossSectional Time Series 6.2 Univariate Time Series ... 6.2.1 Clipping of Time Series 6.2.2 Longitudinal Data . 6.3 Extensions and Further Research 6.3.1 Alternative Random Effects Distribution 6.3.2 Topics in GLMM Research REFERENCES BIOGRAPHICAL SKETCH. Vll 142 159 162 164 165 165 166 166 168 170 171
PAGE 8
LIST OF TABLES Table 3 1 A simulation study for a logistic GLMM with autoregressive random effects . . . . . . . . . . . . . . . . . 80 3 2 Simulation study for modeling unequally space binary time series. 81 5 1 Comparing estimates from two models for the logodds. 121 5 2 Parameter estimates for the polio data. . . . . 125 5 3 Autocorrelation functions for the Old Faithful geyser data. 134 5 4 Comparison of observed and expected counts for the Old Faithful geyser data . . . . . . . . . . . . . 142 5 5 Maximum likelihood estimates for boat race data 147 5 6 Observed and expected counts of sequences of wins (W) and losses (L) for the Cambridge University team. . . . . . . . . 151 5 7 Estimated random effects Ut for the last 30 years for the boat race data. 152 5 8 Estimated probabilities of a Cambridge win in 2004 given the past s + l outcomes of the race . . . . . . . . . . . . 155 Vlll
PAGE 9
LIST OF FIGURES Figure 2 1 Plot of the typical behavior of the Monte Carlo sample size m(k) and the Qfunction Q~) through MCEM iterations. . . . . . 51 3 1 Sampling proportions from the GSS data set. . 55 3 2 Iteration history for selected parameters and their asymptotic standard e rror s for the GSS data. . . . . . . . . . . 61 3 3 Realized (simulated) random effects u 1 ... ur versus estimated random effects u 1 .. ilr. . . . . . . . . 77 3 4 Comparing simulated and estimated random effects. 78 4 1 Approximated marginal probabilities for the fixed part predictor value x' /3 ranging from 4 to 4 in a logit model. . . . . . . . . 99 4 2 Comparison of conditional logit and probit model based probabilities. 104 4 3 Comparison of implied marginal probabilities from logit and probit models.. . . . . . . . . . . . . . . . . 106 5 1 Empirical standard deviations std(0it) for the log odds of favoring homosexual relationship by race. 119 5 2 Plot of the Polio data. . . 123 5 3 Iteration history for the Polio data. 124 5 4 Residual autocorrelations for the Polio data. 129 5 5 Residual autocorrelations with outlier adjustment for the Polio data. 131 5 6 Autocorrelation functions for the Old Faithful geyser data. 5 7 Lorelogram for the Old Faithful geyser data. .. 5 8 Plot of the Oxford vs. Cambridge boat race data. 5 9 Variogram for the Oxford vs. Cambridge boat race data. 5 10 Lorelogram for the Oxford vs. Cambridge boat race data .. IX 135 136 143 145 146
PAGE 10
5 11 Path plots of fixed and random effects parameter estimates for the boat race data . . . . . 14 7 6 1 Association graphs for GLMMs. . . . . . . . . . . 161 X
PAGE 11
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy REGRESSION MODELS FOR DISCRETEVALUED TIME SERIES DATA By Bernhard Klingenberg August 2004 Chair: Alan G. Agresti Cochair: James G. Booth Major Department: Statistics Independent random effects in generalized linear models induce an exchang e able correlation structure but long sequences of counts or binomial observations typically show correlations decaying with increasing lag. This dissertation intro duces models with autocorrelated random effects for a more appropriate, parameter driven analysis of discretevalued time series data. We present a Monte Carlo EM algorithm with Gibbs sampling to jointly obtain maximum likelihood estimates of regression parameters and variance components. Marginal mean variance and correlation properties of the conditionally specified models are derived for Poisson negative binomial and binary /binomial random components. They are used for constructing goodness of fit tables and checking the appropriateness of the modeled correlation structure. Our models define a likelihood and hence estimation of the joint probability of two or more events is possible and used in predicting future responses Also all methods are flexible enough to allow for multiple gaps or miss ing observations in the observed time series. The approach is illustrated with the analysis of a crosssectional study over 30 years where only observations from 16 Xl
PAGE 12
unequally spaced years are available a time series of 168 monthly counts of polio infections and two long binary time series. Xll
PAGE 13
CHAPTER 1 INTRODUCTION Correlated discrete data arise in a variety of settings in the biomedical, social, political or business sciences whenever a discrete response variable is measured repeatedly. Examples are time series of counts or longitudinal studies measuring a binary response. Correlations between successive observations arise naturally through a time space or some other cluster forming context and have to be incorporated in any inferential procedure. Standard regression models for independent data can be expanded to accommodate such correlations. For continuous type responses the normal linear mixed effects model offers such a flexible framework and has been well studied in the past. A recent reference is Verbeke and Molenberghs (2000) who also discuss computer software for fitting linear mixed effects models with popular statistical packages Although the normal linear mixed effects model is but one member of the broader class of generalized linear mixed effects models it enjoys unique properties which simplify parameter estimation and interpretation substantially. For discrete response data, however, the normal distribution is not appropriate, and other members in the exponential family of distributions have to be considered. 1.1 Regression Models for Correlated Discrete Data In this introduction we will review extensions of the basic generalized linear model (McCullagh and Nelder 1989) for analyzing independent observations to models for correlated data. These models are marginal (Section 1.2), transitional (Section 1.3) and random effects models (Section 1.4) An extensive discussion of these models with respect to discrete longitudinal data is given in the books by 1
PAGE 14
2 Agresti (2002) and Diggl e, Heagerty Liang and Zeger (2002). In general longi tudinal studies concern only a few repeated measurements In this dissertation however we are interested in the analysis of much longer series of repeated observa tions often exceeding 100 repeated measurements. Therefore the following review focuses specifically on models for univariate time s eries observations some of which are presented in Fahrmeir and Tutz (2001). Let Yt be a response at time t t = l ... T observed together with a vector of covariates denoted by XtIn a generalized linear model (GLM) the mean t = E[yt] of observation Yt depends on a lin e ar predictor 'TJt = x~/3 through a link function h( ) forming the relationship t = h 1 (x~f3) The variance of Yt depends on the mean through the relationship var(yt) = tv(t) where v( ) is a distribution specific variance function and {
PAGE 15
3 transitional models is based on a conditional or partial likelihood, and inference in random effects models relies on a full likelihood (possibly Bayesian) approach. However models and inferential procedures have been developed that allow more flexibility than the above categorization. 1.2 Marginal Models In marginal regression models the main scientific goal is to assess the influence of covariates on the marginal mean of Yt treating the association structure between repeated observations as a nuisance. The marginal mean t and variance var(yt) are modeled separately from a correlation structure between two observations Yt and Yt Regression parameters in the linear predictor are called population averaged parameters because their interpretation is based on an average over all individuals in a specific covariate subgroup. Due to the correlation among repeated observations the likelihood for the model refers to the joint distribution of all observations and not to the simpler product of their marginal distributions However the model is specified in terms of these marginal distributions, which makes maximum likelihood fitting particulary hard for even a moderate number T of repeated measurements. 1.2.1 Likelihood Based Estimation Methods For binary data Fitzmaurice and Laird (1993) discuss a parametrization of the joint distribution in terms of conditional probabilities and log odds ratios. These parameters are related to the marginal mean and the same conditional log odds ratios which describe the higher order associations among the repeated responses. The marginal mean and the higher order associations are then modelled in terms of orthogonal parameters /3 and a, respectively. Fitzmaurice and Laird (1993) present an algorithm for maximizing the likelihood with respect to these two parameter sets. The algorithm has been implemented in a freely available computer program (MAREG) by Kastner et al. (1997).
PAGE 16
4 Another approach to maximum lik elihood fitting for longitudinal discrete data regards the marginal model as a constraint on the joint distribution and maximizes the likelihood subject to this constraint. The model is written in terms of a genera liz ed loglinear model Clog(A = X/3 where is a vector of expected counts and A and C are matrices to form marginal counts and functions of those marginal counts, respectively. With this approach, no specific assumption about the correlation structure of repeated observations is made and the likelihood refers to the most general form for the joint distribution However, simultaneous modeling of the marginal distribution and a simplified joint distribution is also possible. Details can be found in Lang and Agresti (1994) and Lang (1996). Lang (2004) also offers an R computer program (mph fit) for maximum lik elihood fitting of these very general marginal models. 1.2.2 QuasiLikelihood Based Estimation Methods The drawback of the two approaches mentioned above and likelihood based methods in general is that they require enormous computing resources as the number of repeated responses increases or the number of covariates is large, making maximum likelihood fitting computationally impossible for long time series. This is also true for estimation based on alternative parameterizations of a distribution for multivariate binary data such as those discussed in Bahadur (1961), Cox (1972) or Zhao and Prentice (1990). Estimating methods leading to computationally simpler inference (albeit not maximum likelihood) for marginal models are based on a quasilikelihood approach (Wedderburn, 1974). In a quasilikelihood approach, no specific form for the distribution of the responses is assumed and only the mean variance and correlation are specified. However with discrete data specifying the mean and covariances does not determine the lik elihood, as it would with normal data, so parameter estimation cannot be based on it. Liang and Zeger (1986) proposed
PAGE 17
5 generalized estimating equations (GEE) to estimate parameters, which have the form of score equations for GLMs, but cannot be interpreted as such. Their approach also requires the specification of a working correlation matrix for the repeated responses. They show that if the mean function is correctly specified the solution to the generalized estimating equations is a consistent estimator regardless of the assumed variancecovariance structure for the repeated responses. They also present an estimator of the asymptotic variancecovariance matrix for the GEE estimates which is robust against misspecification of the working correlation matrix. Several structured working correlation matrices have been proposed for parsimonious modeling of the marginal correlation, and some of them are implemented in statistical software packages for GEE estimation (e.g., SAS s proc genmod with the repeated statement and the type option or the gee and geel packages in R) 1.2.2.1 GEE for time series of counts Zeger (1988) uses the GEE methodology to fit a marginal model to a time series {Yt}f= 1 of T = 168 monthly counts of cases of poliomyelitis in the United States. He specifies the marginal mean variance and correlation by corr(yt Yt+r) where a 2 is the variance and p(r) the autocorrelation function of an underlying random process { Ut}To fit this marginal model he proposes and outlines the (1.1) GEE approach but notes that it requires inversion of the T x T variancecovariance matrix of y 1 ... YT which has no recognizable structure and therefore no simple inverse. Subsequently he suggests approximating this matrix by a simpler, struc tured matrix leading to nearly as efficient estimators as would have been obtained
PAGE 18
6 with the GEE approach. The variance component a 2 and unknown parameters in p( T) are estimated by a methods of moments approach. Interestingly Zeger (1988) derives the marginal mean variance and correlation in (1.1) from a random effects model specification: Conditional on an underlying latent random process {ut} with E[ut] = 1 and cov(ut,Ut+ 7 ) = a 2 p(T) he initially models the time series observations as conditionally independent Poisson variables with mean and variance (1.2) Marginally by the formula for repeated expectation, this leads to the moments presented in (1.1). From there we also see that the latent random process {ut} has introduced both overdispersion relative to a Poisson variable and autocorrelation among the observations. The models we will develop in subsequent chapters have similar features. The equation for the marginal correlation between Yt and Yt shows that the autocorrelation in the observed time series must be less than the autocorrelation in the latent process { Ut}We will return to the polio data set in Chapter 5 where we compare this model to models suggested in this dissertation and elsewhere. 1.2.2.2 GEE for binomial time series For binary and binomial time series data it is often more advantageous to model the association between observations using the odds ratio rather than directly specifying the marginal correlation corr(yt, Yt) as with count data. The odds ratio is a more natural metric to measure association between binary outcomes and easier to interpret. The correlation between two binary outcomes Y 1 and Y; is also constrained in a complicated way by their marginal means 1 = P(Y 1 = 1) and 2 = P(Y 2 = 1) as a consequence of the following inequalities
PAGE 19
7 for their joint distribution : and leading to max{O + 1} :S P(Y1 = 1 Y; = 1) :S min{ }. Therefore instead of marginal correlations a number of authors (Fitzmaurice, Laird and Rotnitzky 1993; Carey Zeger and Diggle 1993) propose the use of marginal odds ratios. For unequally spaced and unbalanced binary time series data Fitzmaurice and Lipsitz (1995) present a GEE approach which models the marginal association using s e rial odds ratio patterns. Let 1/Judenote the marginal odds ratio between two binary observations Yt and Yt. Their model for the association has the form ., . = o? /l t t* I 1 < a < oo f.// tt , which has the property that as It t I 0 there is perfect association ( VJtt oo) and as It t I oo the observations are independent (1/Ju1). Note however that only positive association is possible with this type of model. (SAS s proc genmod now offers the possibility of specifying a general regression structure for th e log odds ratios with the logor option.) 1.3 Transitional Models In transitional models, past observations are simply treated as additional predictors. Int e rest lies in e stimating th e eff e cts of these and other explanatory variables on th e conditional mean of the response Yt given realizations of the past responses. Specifying the relationship between the mean of Yt and previous obser vations Ytl Yt 2 ... is another way (and in contrast to the direct way of marginal
PAGE 20
8 models) of modeling the dependency between correlated responses. Transitional models fit into the framework of GLMs where, however the distribution of Yt is now conditional on the past responses. The model in its most general form (Diggle et al. 2002) expresses the conditional mean of Yt as a function of explanatory variables and q functions fr() of past responses, E[y, I H,] = h 1 ( x;/3 + t, f,(H, ; a)) (1.3) where Ht = {Yt 1 Yt2 ... y 1 } denotes the collection of past responses. Ht can also include past explanatory variables and parameters. Often, the models are in discretetime Markov chain form of order q, and the conditional distribution of Yt given Ht only depends on the last q responses Yt i ... Yt q For example, a transitional logistic regression model for binary responses that is a second order Markov chain has form logit P(Yi = 1 I Yt 1, Yt 2) = x~/3 + a1Yt1 + a2Yt2The main difference between transitional models and regular GLMs or marginal models is parameter interpretation. Both the interpretation of a and the inter pretation of {3 are conditional on previous outcomes and depend on how many of these are included. As the time dependence in the model changes so does the interpretation of parameters With the logistic regression example from above the conditional odds of success at time t are exp(a 1 ) times higher if the given previous response was a success rather than a failure. However this interpretation assumes a fixed and given outcome at time t 2. Similarly, a coefficient in {3 represents the change in the log odds for a unit change in Xt conditional on the two prior responses. It might be possible that we lose information on the covariate effect by conditioning on these previous outcomes. In general the interpretation of parame ters in transitional models is different from the population averaged interpretation
PAGE 21
9 we discussed for marginal models where parameters are effects on the marginal mean without conditioning on any previous outcomes. 1.3.1 Model Fitting If a discretetime Markov model applies the likelihood for a generic series Y1 ... YT is determined by the Markov chain structure: T L(/3 a ; Y1 , YT) = f (Y1 , Yq) IT f (Yt I Yt1 , Ytq)t=q+1 However the transitional model (1.3) only specifies the conditional distributions appearing in the product, but not the first term of the likelihood. Often, instead of a full maximum likelihood approach one conditions on the first q observations and maximizes the corresponding conditional likelihood. If in addition fr(Ht; a) in (1.3) is a linear function in a (and possibly /3) then maximization follows along the lines of GLMs for independent data. Kaufmann {1987) establishes the asymptotic properties such as consistency, asymptotic normality and efficiency of the conditional maximum likelihood estimator. If a Markov assumption is not warranted estimation can be based on the partial likelihood (Cox 1975). To motivate the partial likelihood approach we follow Kedem and Fokianos (2002): They consider occasions where a time series {Yi} is observed jointly with a random covariate series { Xt}The joint density of (Yi Xt) t = l . T parameterized by a vector 8, can be expressed as where Ht = (x1 Y1 ... Xt 1 Yt 1) and Ht = (xi Yi ... Xt1, Yt 1, Xt) hold the history up to time points t l and t respectively. Let :Ft I denote the afield generated by Yt i Yt 2 ... Xt Xt I .. i.e. Ft I is generated by past responses and present and past values of the covariates. Also, let ft(Yt I Ft i i 8) denote the conditional density of Yi given :Ft I which is of exponential density form with
PAGE 22
10 mean modeled by (1.3). Then the partial likelihood for 0 = ( a /3) is given by T p L(0 ; Yi ... YT) = IT !t(Yt I :Ft 1; 0) (1.5) t=l which is the second produ ct in (1.4) and hence the term partial. The loss of information by ignoring the first product in the joint density is considered small. If the covariate process is deterministic, then the partial likelihood becomes a conditional likelihood, but without the necessity of a Markov assumption on the distribution of the Yt's. Standard asymptotic results from likelihood analysis of independent data carry over to the case of partial likelihood estimation with dependent data. Fokianos and Kedem (1998) showed consistency and asymptotic normality of 0 and provided an expression for the asymptotic covariance matrix. Since the score equation obtained from (1.5) is identical to one for independent data in a GLM partial likelihood holds the advantage of easy fast and readily available software implementation with standard estimation routines such as iterative reweighted least squares. 1.3.2 Transitional Models for Time Series of Counts For a time series of counts {Yt} Zeger and Qaqish (1988) propose Markov type transitional models which they fit using quasilikelihood methods and the estimating equations approach. They consider various models for the conditional mean t = E[yt I Ht] of form log(t) = x~/3 + L~ = I arfr(Htr), where for example fr(Htr) = Ytr or f r( Ht r) = log(Yt r + c) log(exp[x~/3] + c). One common goal of their models is to approximate the marginal mean by E[yt] = E[t] exp{ x~/3} so that f3 has an approximate marginal interpretation as the change in the log mean for a unit change in the explanatory variables. Davis et al. (2003) develop these models further and propose fr ( Ht r) = (Ytr r) / ;_r as a more appropriate function to built serial dependence in the model where ,\ is an additional parameter They explore stability properties such as stationarity and
PAGE 23
11 ergodicity of these models and describe fast (in comparison to maximum likelihood techniques required for competing random effects models) recursive and iterative maximum likelihood estimation algorithms Chapter 4 in Kedem and Fokianos (2002) discusses regression models of form (1.3) assuming a conditional Poisson or doubletruncated Poisson distribution for the counts with inference based on the partial likelihood concept. Their methodology is illustrated with two examples about monthly counts of rainy days and counts of tourist arrivals. 1.3.3 Transitional Models for Binary Data For binary data {Yt} a two state first order Markov chain can be defined by its probability transition matrix p = [Poo Po1] P10 Pu where Pab = P(Yt = b I Yt 1 = a) a b = 0 1 are the onestep transition probabilities between the two states a and b. Diggle et al. (2002 Chapt. 10.3) discuss various logistic regression models for these probabilities and higher order Markov chains for equally spaced observations. Unequally spaced data cannot be routinely handled with these models. How can we determine the marginal association structure implied by the conditionally specified model? Let p 1 = (pi PD be the initial marginal distribution for the states at time t = l. Then the distribution of the states at time n is given by pn = p 1 pn As n increases pn approaches a steady state or equilibrium distribution that satisfies p = pP The solution to this equation is given by P1 = P(Yt = 1) = E[yt] = Poi/(p 01 + p 10 ) and is used to derive marginal moments implied by the transitional model. For example it can be shown (Kedem 1980) that in the steady state the marginal variance and correlation implied by the
PAGE 24
12 transitional model are var(yt) = PoPi (as it should be) and corr(Yt1 Yt) = Pu Poi respectively. Azzalini (1994) models serial dependence in binary data through transition models but at the same time retains the marginal interpretation of regression parameters He specifies the marginal regression model logit(t) = z~/3 for a binary time series {Yt} with E[yt] = but assumes that a binary Markov chain with transition probabilities Pab has generated the data Therefore the likelihood refers to these probabilities but the model specifies marginal probabilities a complication similar to th e fitting of marginal models discussed in the previous section However assuming a constant log odds ratio 0 = log (P(Yt 1 = 1 Yi= l)P(Yi 1 = 0 Yi= 0)) P(Yi 1 = 0 Yi= l)P(Yi 1 = 1 Yi= 0) between any two adjacent observations Azzalini (1994) shows how to write Pab in terms of just this log odds ratio 0 and the marginal probabilities t and 1 Maximum likelihood estimation for such models is tedious but possible in closed form although second derivatives of the log likelihood function have to be calculated numerically. A software package (the Splus function rm.tools, Azzalini and Chiogna 1997) exists to fit such models for binary and Poisson observations. Azzalini (1994) mentions that this basic approach can be extended to include variable odds ratios between any two adjacent observations possibly depending on covariates but this is not pursued in the article Diggle et al. (2002) discuss these marginalized transitional models further. Chapter 2 in Kedem and Fokianos (2002) presents a detailed discussion of partial likelihood estimation for transitional binary models and discusses among other examples the eruption data of the Old Faithful geyser which we will turn to in Chapter 5.
PAGE 25
13 1.4 Random Effects Models A popular way of modeling correlation among dependent observations is to include random effects u in the linear predictor. One of the first developments for discrete data occurred for longitudinal binary data where subjectspecific random effects induced correlation between repeated binary measurements on a subject (Bock and Aitkin 1981; Stiratelli, Laird and Ware, 1984). In general, we assume that unmeasurable factors give rise to the dependency in the data {Yt} and random effects { ut} represent the heterogeneity due to these unmeasured factors. Given these effects the responses are assumed independent. However, no values for these factors are observed and so marginally (i.e., averaged over these factors), the responses are dependent. Conditional on some random effects, we consider models that fit into the framework of GLMs for independent data, i.e., where the conditional distribution of Yt I U t is a member of the family of exponential distributions whose mean E[yt I Ut ] is modeled as a function of a lin ear predictor f/t = x~/3 + z~ut. Together with a distributional assumption for the random effects (usually independent and identically normal) this leads to genera liz ed linear mixed models (GLMMs), where the term mixed refers to the mixture of fixed and random effects in the linear pre dictor. Chapter 2 contains a detailed definition of GLMMs and discusses maximum lik elihood fitting and parameter interpretation and in Chapter 3 correlated random effects for the description of time dependent observations {Yt} are motivated and described. Here we only give a short lit erature review about GLMMs which use correlated random effects to model time ( or space) dependent data. 1.4.1 Correlated Random Effects in GLMMs One of the first papers considering correlated random effects in G LMMs for the description of (spatial) dependence in Poisson data is Breslow and Clayton (1993), who analyze lip cancer rates in Scottish counties. They propose correlated
PAGE 26
14 normal random e ffects to capture the correlation in counts of adjacent districts in Scotland. A random effect is assigned to each district and two random effects are correlated if their districts are adjacent to each other. In Section 1.2 2 we mentioned the Polio data set of a time series of equally spaced counts {yt};~1 and formulated the conditional model (1.2) with a latent process for the random effects. Instead of obtaining marginal moments as in Zeger (1988) Chan and Ledolter (1995) use a GLMM approach with Poisson random components and autoregressive random effects to analyze the time series. They outline parameter estimation via an MCEM algorithm similar to the one discussed in Sections 2.4 and 3.2 in this dissertation. One of the three central generalized linear models advocated by Diggle et al. (2002 Chap. 11.2) to model longitudinal data uses correlated random effects. For equally spaced binary longitudinal data {yit} they plot response profiles simulated according to the model logit[P(~t = 1 I Uit)] COV ( U it, U i t ) {3 + Uit a2 p l t;1 t;1 I with a 2 = 2.5 2 and p = 0.9 and note that the profiles exhibit more alternating runs of O 's and l 's than a random intercept model with ui 1 = ui 2 = ... = uiT = ui. However based on the similarity between plots of random intercepts, random intercepts and slopes and autoregressive random effects models, they mention the challenge that binary data present in distinguishing and modeling the underlying dependency structure in longitudinal data. (They used T = 25 repeated observation for their simulations ) Furthermore they state that numerical methods for maximum likelihood estimation are computationally impractical for fitting models with higher dimensional random effects. This makes it impossible they conclude, to fit the GLMM with serially correlated random effects using maximum
PAGE 27
15 likelihood Instead they propose a Bayesian analysis using powerful Monte Carlo Markov chain methods Indeed the majority of examples in the literature which consider correlated random effects in a GLMM framework take a Bayesian approach. Sun Speckman and Tsutakawa (2000) explore several types of correlated random effects (autore gressive generalized autoregressive and conditional autoregressive) in a Bayesian analysis of a GLMM. As in any Bayesian analysis the propriety of the posterior distribution given the data is of concern when fixed effects and variance compo nents have improper prior distributions and random effects are (possibly singular) multivariate normal. One of their results applied to Poisson or binomial data {Yt} states that the posterior might be improper when Yt = 0 in the Poisson case and cannot be proper when Yt = 0 or Yt = nt in the binomial case for any t when improper or noninformative priors are used Diggle Tawn and Moyeed (1998) consider Gaussian spatial processes S(x) to model spatial count data at locations x. The role of S ( x) is to explain any residual spatial variation after accounting for all known explanatory variables They also use a Bayesian framework to estimate parameters and give a solution to the problem of predicting the count at a new location x. Ghosh et al. (1998) use correlated random effects in Bayesian models for small area estimation problems. They present an application of pairwise difference priors for random effects to model a series of spatially correlated binomial observations in a Bayesian framework Zhang (2002) discusses maximum likelihood estimation with an underlying spatial Gaussian process for spatially correlated binomial observations. Bayesian models for binary time series are described in Liu (2001), based on probittype models for correlated binary data which are discussed in Chib and Greenberg (1998) Probit type models are motivated by assuming latent random variables z = ( z 1 ... ZT) which follow a ~) distribution with
PAGE 28
16 = 1 .. J1T) t = x~/3 and E a correlation matrix. The Yt 's are assumed to be generated according to Yt = I(zt > 0), where I(.) is the indicator function. This leads to the (marginal) probit model P(Yt = 1 I /3 E) = t(Zt I E). Rich classes of dependency structures between binary outcomes can be modeled through E. These models can further be extended to include random effects through 't = x~/3 + z~ut or q previous responses such as t = x~/3 + I::; = 1 O:rYt r It is important to note that E has to be in correlation form. To see this suppose it is not and let S = DED be a covariance matrix for the latent random variables z, where D is a diagonal matrix holding standard deviation parameters. The joint density of the times series under the multivariate probit model is given by P[(Y1 ... Yt) = (Yi , YT)] P[z EA] P[D1 z EA], where A = A 1 x x AT with At = ( oo, 0] if Yt = 0 and At = (0, oo) if Yt = 1 are the intervals corresponding to the relationship Yt = I(zt > 0), fort = 1, ... T. However, above relationship is true for any parametrization of D because the intervals At are not affected by the transformation from z to n 1 z. Hence, the elements of D are not identifiable based on the joint distribution of the observed time series y. Lee and Nelder (2001) present models to analyze spatially correlated Poisson counts and binomial longitudinal data about cancer mortality rates They explore a variety of patterned correlation structures for random effects in a GLMM setup. Model fitting is based on the joint data likelihood of observations and unobserved random effects (Lee and Nelder, 1996) and not on the marginal likelihood of the
PAGE 29
17 observed data. Model diagnostic plots of estimated random effects are presented to aid in selecting an appropriate correlation structure. 1.4.2 Other Modeling Approaches In hidden Markov models (MacDonald and Zucchini 1997) the underlying random process is assumed to be a discrete statespace Markov chain instead of a continuous (normal) process. Probability transition matrices describe the connection between states. A very convenient property of hidden Markov models is that the likelihood can be evaluated sufficiently fast to permit direct numerical maximization. MacDonald and Zucchini (1997) present a detailed description of hidden Markov models for the analysis of binary and count time series. A connection between transitional models and random effects models is explored in Aitkin and Alf6 (1998). They model the success probabilities of serial binary observations conditional on subjectspecific random effects and on the previous outcome. As in the models before transition probabilities Pab are changing over time due to the inclusion of timedependent covariates and the previous observation in the linear predictor. Additionally random effects account for possibly unobserved sources of heterogeneity between subjects. The authors argue that the conditional model specification together with the specification of the random effects distribution does not determine the distribution of the initial observation, and hence the likelihood for this model is unspecified. They present a solution by maximizing the likelihood obtained from conditioning on this first observation. However this causes the specified random effects distribution to shift to an unknown distribution Two approaches for estimation are outlined: The first assumes another normal distribution for the new random effects distribution and the likelihood is maximized using Gauss Hermite quadrature. The second approach assumes no parametric form for the new random effects distribution and follows th e nonparametric maximum likelihood approach (Aitkin, 1999). For binary data, the
PAGE 30
18 new random effects distribution is only a two point distribution and its parameters can be estimated via maximum likelihood jointly with the other model parameters Marginalized transitional models were briefly mentioned with the approach taken by Azzalini (1994). The idea of marginalizing ", i.e. model the marginal mean of an otherwise conditionally specified model can also be applied to random effects models The advantage of transitional or random effects models is the ability to easily specify correlation patterns with the potential disadvantage that parameters in such models have conditional interpretations when the scientific goal is on the interpretation of marginal relationships In marginal models parameters can directly be interpreted as contrasts between subpopulations without the need of conditioning on previous observations or unobserved random effects. However as we mentioned in Section 1.2.2 likelihood based inference in marginal models might not be possible A marginalized random effects model (Heagerty 1999; Heagerty and Zeger, 2000) specifies two regression equations that are consistent with each other. The first equation expresses the marginal mean f'1 as a function of covariates and describes system atic variation. The second equation c haracterizes the dependency structure among observations through specification of the conditional mean where Ut are random effects with design vector Zt Consistency between the marginal and conditional specification is achieved by defining ~t(Xt) implicitly through
PAGE 31
19 For instance in a marginalized GLMM with random effects distribution F(u t ), .6.t(xt) is the solution to the integral equation f = h 1 (x~ f3 ) = J h 1 (.6.t(xt) + z~ut)dF(ut) so that .6.t ( Xt) is a function of the marginal regression coefficients {3 and the (variance) parameters in F(ut)Maximum likelihood estimation is based on the integrated likelihood from the GLMM model. 1.5 Motivation and Outline of the Dissertation In this dissertation we propose generalized linear mixed models ( G LMMs) with correlated random effects to model count or binomial response data collected over time or in space. For sequential or spatial Gaussian measurements, maximum likelihood estimation is well established and software (e.g. SAS s proc mixed) is available to fit fairly complicated correlation structures. The challenge for discrete data lies in the fact that the observed (marginal) likelihood is not analytically tractable and maximization of it is more involved. Furthermore with correlated random effects the likelihood does not break down into lowerdimensional com ponents which are easier to integrate numerically. Therefore most approaches in the literature are based on a quasilikelihood approach or take a Bayesian per spective. The advantage of Bayesian models is that powerful Monte Carlo Markov chain methods make it easier to obtain a sample from the posterior distribution of interest than to obtain maximum likelihood estimates. However priors must be specified very carefully to ensure posterior propriety. In addition repeated observations are prone to missing data or unequally spaced observation times. We would like to develop methods and models that allow for unequally spaced binary binomial or Poisson observations making them more general than previously presented in the literature.
PAGE 32
20 To our knowledge maximum likelihood estimation of GLMMs with such high dimensional random effects has not been demonstrated before, with the exception of the paper by Chan and Ledolter (1995) who consider fitting of a time series of counts However they do not consider unequally spaced data and employ a different implementation of the MCEM algorithm In Chapter 5 we argue that their implementation of the algorithm might have been stopped prematurely leading to different conclusions than our a nalysis and analyses published elsewhere Most articles that discuss correlated random effects do so for only a small number of correlated random effects E g Chan and Kuk (1997) show that the data set on salamander mating behavior published and analyzed in McCullagh and Nelder (1989) is more appropriately analyzed when random effects pertaining to the male salamander population are correlated over the three different time points when they were observed. In this thesis we wou l d like to consider much longer sequences of repeated observations. In Chapter 2 we introduce the GLMM as the model of our choice to analyze correlated discrete data and outline an EM algorithm to estimate fixed and random effects, where both the Estep and the Mstep require numerical approximations leading to an EM algorithm based on Monte Carlo methods (MCEM). Correlated random effects and their implications on the analysis of GLMMs are discussed in Chapter 3 together with a motivating example. This chapter also gives details for the implementation of the algorithm and reports results from simulation studies. Chapter 4 looks at marginal model properties and interpretation for correlated binary binomial or Poisson observations and Chapter 5 applies our methods to real data sets from the social sciences public health, sports and other backgrounds. A summary and discussion of the methods and models presented here is given in Chapter 6.
PAGE 33
CHAPTER 2 GENERALIZED LINEAR MIXED MODELS Chapter 1 reviewed various approaches of extending GLMs to deal with correlated data. In this Chapter we will take a closer look at generalized linear mixed models (GLMMs) which were briefly mentioned in Section 1.4. When the response variables are normal, these models are simply called linear mixed models (LMMs) and have been extensively discussed in the literature (see, for example the books by Searle, Casella and McCulloch, 1992, and Verbeke and Molenberghs 2000) The form of the normal density for observations and random effects allows for analytical evaluation of the integrals together with straightforward maximization. Hence LMMs can be readily fit with existing software (e.g., SAS's proc mixed), using rich classes of prespecified correlation structures for the random effects to model the dependence in the data more precisely. The broader notion of GLMMs also encompasses binary, binomial, Poisson or gamma responses. A distinctive feature of GLMMs is their so called subjectspecific parameter interpretation, which differs from the interpretation of parameters in marginal (Section 1.2) or transitional (Section 1.3) models This feature is discussed in Section 2.1, after a formal introduction of the GLMM. Throughout, special attention is devoted to define GLMMs for discrete time series observations. GLMMs are harder to fit because they typically involve intractable integrals in the likelihood function. Section 2.2 outlines various approaches to model fitting. Section 2.3 focuses on a Monte Carlo version of the EM algorithm which is an indirect method of finding maximum likelihood estimates in G LMMs. Monte Carlo methods are necessary because our applications involve correlated random effects which lead to a very highdimensional integral in the likelihood function. Parallel 21
PAGE 34
22 to the discussion of GLMMs state space models are introduced and a fitting algorithm is described. State space models are popular models for discrete time series in econometric applications (Durbin and Koopman 2001). The presentation of specific examples of GLMMs for discrete time series observation is deferred until Chapter 5. 2.1 Definition and Notation The generalized linear mixed model is an extension of the well known gen eralized linear model (McCullagh and Nelder, 1989) that permits fixed as well as random effects in the linear predictor (hence the word mixed). The setup process for GLMMs is split into two stages which we present here using notation common for longitudinal studies.: Firstly conditional on cluster specific random effects ui the data are assumed to follow a GLM with independent random components ~t the tth response in cluster i, i = 1 . n t = 1, ... ni. A cluster here is a generic expression and means any form of observations being grouped together such as repeated observation on the same subject (cluster= subject), observations on different students in the same school ( cluster = school) or observations recorded in a common time interval ( cluster = time interval). The conditional distribution of ~t is a member of the exponential family of distributions (e g ., McCullagh and Nelder 1989) with form (2.1) where 0it are natural parameters and b(.) and c(.) are certain functions determined by the specific member of the exponential family. The parameters >it are typically of form i t = >/w i t where the wit s are known weights and > is a possibly unknown dispersion parameter. For the discrete response GLMMs we are considering > = 1. For a specific link function h( ), the model for the conditional mean for
PAGE 35
23 observations Y it has form (2 2) where x~t and z ~t are covariate or design vectors for fixed and random effects associated with observation Yit and /3 is a vector of unknown regression coefficients. At this first stage z~tu i can b e regarded as a known offset for each observation and observations are conditionally independent. It should be noted that relationship (2.2) between the mean of the observation and fixed and random effects is exactly as is specified in the systematic part of GLMs with the exception that in GLMMs a conditional mean is modeled. This affects parameter interpretation The regression coefficients /3 represent the effect of explanatory variables on the conditional mean of observations given the random effects. For instance observations in the same cluster i share a common value of the random cluster effect u i, and hence /3 describes the conditional effect of explanatory variables given the value for u i If the cluster consists of repeated observations on the same subject these effects are called subjectspecific effects. In contrast regression coefficients in GLMs and marginal models describe the effect of explanatory variables on the population average which is an average over observations in different clusters At the second stage, the random effects ui are specified to follow a multi variate normal distribution with mean zero and variancecovariance matrix E i A standard assumption is that random effects { u i } are independent and identically distributed but an example at the beginning of Chapter 3 will show that this is sometimes not appropriate. With time series observations where the clusters refer to time segments it is reasonable to assume that observations are not only correlated within the cluster (modeled by sharing the same cluster specific random
PAGE 36
24 effect), but also across clusters which we will model by assuming correlated cluster specific random effects. 2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time Series Most of the data we are going to analyze is in the form of a single univariate time series. To emphasize this data structure, the general twodimensional notation (indices i and t) of a GLMM to model observations which come in clusters can be simplified in two ways: We can assume that a single cluster (i.e. n = 1 and n 1 = T) contains the entire time series y 1 ... Yr. The random effects vector u = ( u 1 ... ur) associ ated with the single cluster has a random effects component for each individual time series member. The distribution of u is multivariate normal with variance covariance matrix I: which is different from the identity matrix. The correlation of the components of u induce a correlation among the time series members. However conditional on u observations within the single cluster are independent. The cluster index i is redundant in the notation and hence can be dropped. This representation is particulary useful when used with existing software to fit GLMMs where it is often necessary to include a column indicating the cluster membership information for each observation. Here since we have only one cluster, it suffices to include a column of all ones, say. Alternatively we can adopt the point of view that each member of the time series is a mini cluster by itself containing only one observation (i.e., ni = 1 for all i = 1 .. T) in the case of a single time series. When multiple parallel time series are observed, the cluster contains all c observations at time point t from the c parallel time series (i.e., ni = c for all i = 1 ... T). In any case, the clusters are then synonymous with the discrete time points at which observations were recorded. This makes index t which counts the repeated observations in a cluster
PAGE 37
25 redundant (t = 1 or t = c for all clusters i) but instead of denoting the time series by {yi}~ 1 we decided to use the more common notation {Yt}f = 1 where t now is the index for clusters or, equivalently, time points. In the following definition of GLMMs for univariate time series, the notation of clusters or time points can be used interchangeably Conditional on unobserved random effects u 1 .. UT for the different time points observations y 1 .. YT are assumed independent with distributions (2.3) in the exponential family. As before, for a specific link function h( ), the model for the conditional mean has form (2.4) where x ~ and z~ are covariate or design vectors for fixed and random effects associated with the tth observation and /3 is a vector of unknown regression coefficients. The random effects u 1 ... u T are typically not independent. When collected in the vector u = ( u 1 .. UT) a multivariate normal distribution with mean O and covariance matrix can be directly specified. In particular in Chapter 3 we will assume special patterned covariance matrices to allow for rich, but still parsimonious, classes of correlation structures among the time series observations The advantage of the second setup of mini clusters is that it also allows for other indirect specifications of the random effects distribution for instance through a latent random process. For this we relate clusterspecific random effects from successive time points For example, with univariate random effects, a firstorder latent autoregressive process assumes that the random effects follow (2.5)
PAGE 38
26 where lt has a zeromean normal distribution and p is a correlation parameter. Cox (1981) called these type of models parameterdriven models as opposed to transitional ( or observationdriven) models discussed in Section 1.3. In parameter driven models an underlying and unobserved parameter process influences the distribution of a series of observations. The model for the polio data in Zeger (1988) is an example of a parameterdriven model. However Zeger (1988) does not assume normality nor zero mean for the latent autoregressive process. Furthermore the natural logarithm of the random effects, and not the random effects themselves appear additively in the linear predictor. Therefore this model is slightly different from the specifications of the time series G LMM from above. Another application of the mini cluster setup is to spatial settings, where clus ters represent spatially aggregated data instead of time points. Then u 1 ... uT is a collection of random effects associated with spatial clusters Again independent random effects to describe the spatial dependencies are inappropriate. In general, time dependent data are easier to handle since observations are linearly ordered in time and more complicated random effects distributions are needed for spatial applications (e.g. Besag et al. 1995). We will use the minicluster representation to facilitate a comparison to state space models This is the focus of the next section 2.1.2 State Space Models for Discrete Time Series Observations State space models are a rich alternative to the traditional BoxJenkins ARIMA system for time series analysis. Similar to GLMMs state space models for Gaussian and nonGaussian time series split the modeling process into two stages: At the first stage the responses Yt are related to unobserved states by an observation equation. (State space models originated in the engineering literature, where parameters are often called states.) At the second stage a latent or hidden Markov model is assumed for the states. For univariate Gaussian responses
PAGE 39
27 y 1 .. YT the two equations of a state space model take the form Yt (2.6) where Wt is an m x 1 observation or design vector and Et is a white noise process. The unobserved m x 1 state or parameter vector Ot is defined by the second transition equation, where Tt is a transition matrix and et is another white noise process, independent of the first one. Compared to the standard GLMMs of Section 2 1 the main difference is that random effects are correlated instead of i.i.d. In state space models no clear distinction between fixed and random effects is made and the state vector Ot can contain both. However, the form of the transition matrix Tt together with the form of the matrix Rt which consists of columns of the identity matrix Im allows one to declare certain elements of Ot as being fixed effects and others to be random. The matrix Rt is called the selection matrix since it selects the rows of the state equation which have nonzero variance terms. With this formulation the variancec ovariance matrix Qt is assumed to be nonsingular. Furthermore the transition matrix Tt allows specification of which effects vary through time and which stay constant. (For a slightly different formulation without a selection matrix Rt but with possibly singular variance covariance matrix Qt see Fahrmeir and Tutz 2001 Chap. 8) State space models for nonGaussian time series were considered by West et al. (1985) und e r the name dynamic generalized linear model. They used a Bayesian framework with conjugate priors to specify and fit their models. Durbin and Koopman (1997 2000) and Fahrmeir and Tutz (2001) describe a state space structure for nonGaussian observations similar to the two equations above The normal distribution assumption for the observations in (2.6) is replaced by assuming a distribution in the exponential family with natural parameters 0t With
PAGE 40
28 a canonical link 0t = w~a t This is called the signal by Durbin and Koopman (1997, 2000). In particular, given the states a 1 .. aT observations y 1 . YT are conditionally independent and have density P(Yt I 0t) in the exponential family (2.3). As in Gaussian state space models, the state vector Ot is determined by the vector autoregressive relationship (2.7) where the serially independent et typically have normal distributions with mean 0 and variancecovariance matrix Qt. 2.1.3 Structural Similarities Between State Space Models and GLMMs There is a strong connection between state space models and G LMMs with a canonical link. To see this we write the GLMM in state space form: Let W t = (x~ z~)' and O t = ( /3' u~)' where Xt Zt /3 and U t are from the GLMM notation as defined in (2.4). Hence, the linear predictor x~/3 + z~ut of the GLMM is equal to the state space signal 0t = w~at. Next partition the disturbance term e t of the state equation into e t = ( ef et)' and consider special transition and selection matrices of block form Tt = [ 1 ~] Rt = [ 0 ] 0 Tt O Rt Using transition equation (2.7) results in the following autoregressive relationship between the random effects of a GLMM: Ut = 'I'tUt1 + where t = Rte! is a white noise component. In a univariate context, we have already motivated this type of relationship between random effects in equation (2.5). The transition equation also implies a constant effect /3 for the GLMM since /3 1 = /3 2 = ... = /3T := /3. Hence, both models use corre lat ed random effects but GLMMs also typically involve fixed parameters which are not modeled as evolving over time.
PAGE 41
29 The restriction of the transition equation to the autoregressive form ( often a simple random walk) is but only one way of specifying a distribution for the random effects in the GLMMs of Section 2.2.1. Other structures, such as equally correlated random effects are possible within the GLMM framework and are considered in Chapter 3. 2.1.4 Practical Differences Although GLMMs and state space models are similar in structure, they are used differently in practice. This is in part due to the fact that in GLMMs the focus is on the fixed subjectspecific regression parameters /3 which refer to time constant and time varying covariates while in state space models the main purpose is to infer properties about the time varying random states Ot. These are often assumed to follow a first or second order random walk. To illustrate, consider a data set about a monthly time series of counts presented in Durbin and Koopman (2000) for the investigation of the effectiveness of new seat belt legislation on automobile accidents They specify the logmean for a Poisson state space model as log(t) = vt + Axt + rt where Vt is a trend component following the random walk with ~t a white noise process. Further A is an intervention parameter correspond ing to the change in seatbelt legislation (xt = 0 before the change and equal to 1 afterwards) and { rt} are fixed seasonal components with E~: 1 rt = 0 and equal in every year. The main focus is on the parameter A describing the drop in the log means also called level after the seat belt legislation went into effect.
PAGE 42
30 For the series at hand a GLMM approach would consider a fixed linear time effect /3, and model the logmean as log(t) =a+ f3t + AXt +rt+ Ut, where a is the intercept of the linear time trend with slope /3 and where correlated random effects { Ut} account for the correlation in the monthly log means. Similar to above, A describes the effect on the log means after the seat belt legislation went into effect and rt are fixed seasonal components, equal for every year. The trend component lit of the state space model corresponds to a+ Ut in the GLMM which additionally allows for a linear time trend /3. This approach seems to be favored by some discussants of the Durbin and Koopman (2000) paper. (In particular, see the discussions by Chatfield or Aitkin of the paper by Durbin and Koopman (2000) who mention the lack of a linear trend term in the proposed state space model.) An even better GLMM approach with linear time trends explicitly modeled could use a change point formulation in the linear predictor with the month the legislation went into effect (or was enforced) as the change point, and again with correlated random effects { ut} to capture the dependency among successive means. Such a specification would be harder to model in a state space model. In the reply to the discussion of their paper Durbin and Koopman (2000) wrote that the two approaches (state space models versus Hierarchical Generalized Linear Models a model class very similar to GLMMs) are very different and that they regard their treatment to be more transparent and general for problems that specifically relate to time series. With the presentation of GLMMs with correlated random effects for time series analysis in this thesis their argument might weaken. For instance, with the proposal of autocorrelated random effects Ut+1 = put+ f.t in a GLMM context we have elegant means of introducing autocorrelation into a basic regression model that is well understood and whose parameters are easily
PAGE 43
31 interpreted. Furthermore GLMMs can easily accommodate the case of multiple time series observations on each of several individuals or crosssectional units as is often observed in a longitudinal study. One common feature of both models is the intractability of the likelihood function and the use of numerical and simulation techniques to obtain maximum likelihood estimates. In general state space models for nonGaussian time series are fit using a simulated maximum likelihood approach which is also a popular method for fitting GLMMs However, long time series necessarily result in models with complex and high dimensional random effects and alternative indirect, meth ods may work better. Jank and Booth (2003) indicate that simulated maximum likelihood the method of choice for estimation in state space models may not work as w e ll as indirect methods based on the EM algorithm the method we will use to fit GLMMs for time series observations. The next section reviews various approaches of fitting GLMMs and contrasts them with the approach taken for state space models. 2.2 Maximum Likelihood Estimation Maximum likelihood estimation in GLMMs is a challenging task because it requires the calculation of integrals ( often high dimensional) that have no known analytic solution Following the general notation of a GLMM in Section 2 1 let Y i = (Y i 1 ... Y i nJ be the vector of all observations in cluster i whose associated random effects vector is u i Conditional independence of the Y i t s implies that the density function of Y i is given by n ; !(Y i I u i ;/3) = fIJ(Y i t I u i; /3), (2 8) t=l where f (Y i t I u i; /3) are the exponential densities in (2.1). The parameter /3 is the vector of all unknown regression coefficients introduced by specifying model (2.2) for the mean of the observations. Furthermore observations from different clusters
PAGE 44
32 are assumed conditionally independent leading to the conditional joint density n f (Y1 , Y n I U1 , Un;{3) = IT f(Y i I u i; /3) i= l for all observations given all random effects. L e t g(u 1 ... un ; 1/J) denote the mul tivariate normal density function of the random effects whose variancecovariance matrix :E is determined by the variance component vector 1/J. The goal is to es timate the unknown parameter vectors /3 and 1/J by maximum likelihood. The likelihood function L(/3 1/J ; y 1 ... Yn) for a GLMM is given by the marginal den sity function of the observations y 1 ... Yn viewed as a function of the parameters and is equal to L(/3 1/J ; Y1 , Yn) J f(Y1 , Yn I U1 , Un; /3)g(u1 . Un; 1/J)du1 ... dun f IT!(Y i I ui ; /3) g(u1 un;1/J)du1 ... dun i= l J IT IT f(Y i t I u i ; /3) g(u1 ... Un; 1/J)du1 .. dun (2.9) i= l t=l It is called the observed likelihood because the unobserved random effects have been integrated out and (2 9) is a function of the observed data only. Except for the linear mixed model where f (y 1 ... Yn I u 1 . un) is a normal density the integral has no closedform solution and numerical procedures (analytic or stochastic) are necessary to calculate and maximize it. Standard maximization techniques such as NewtonRaphson or EM for fitting GLMs and LLMs have to be modified because the conditional distribution of the observations and the distribution of the random effects are not conjugate and the integral is analytically intractable 2.2.1 Direct and Indirect Maximum Likelihood Procedures In general there are two ways to obtain maximum likelihood estimates from the marginal likelihood in (2 9): The first one is a direct approach and
PAGE 45
33 attempts to approximate the integral by either analytic or stochastic methods and then maximize this approximation with respect to the parameters /3 and 1/J. Some common analytic approximation methods are GaussHermite quadrature (Abramowitz and Stegun 1964) a firstorder Taylor series expansion of the integrand or a Laplace approximation (Tierney and Kadane 1986) which is based on a secondorder Taylor series expansion. The two latter methods result in likelihood equations similar to a linear mixed model (Breslow and Clayton 1993 ; Wolfinger and O Connell 1993) and by iteratively fitting such a model and reexpanding the integrand around updated parameter estimates one can obtain approximate maximum likelihood estimates However these methods have been shown to yield estimates which can be biased and inconsistent an issue which is discussed in Lin and Breslow (1996) and Breslow and Lin (1995) Techniques using stochastic integral approximations are known under the name simulated maximum likelihood and have been proposed by Geyer and Thompson (1992) and Gelfand and Carlin (1993) These methods approximate the integral in (2 9) by importance samp l ing (Robert and Casella, 1999) and are better suited for larger dimensional integrals than analytic approximations. Usually, the importance density depends on the parameters to be estimated, and so simulated maximum likelihood is used iteratively by first approximating the integral by a Monte Carlo sum with some initial values for the unknown parameters Then the likelihood is maximized and the resulting parameters are used to generate a new sample form the importance density in the next iteration. We will briefly discuss the idea behind importance sampling in Section 2.3.2. The simulated maximum likelihood approach is also further illustrated in the next section, where we discuss it in the context of state space models. An alternative to the direct approximating methods is the EMalgorithm (Dempster et al. 1977) The integral in (2.9) is not directly maximized in this
PAGE 46
34 method but is maximized indirectly by considering a related function Q(. I .). At each step of this algorithm, maximization of the Qfunction increases the marginal likelihood a fact that can be verified using Jensen s inequality. The EMalgorithm relies on recognizing or inventing missing data which together with the observed data, simplifies maximum likelihood calculations. For GLMMs the random effects u1 . Un are treated as the missing data. In particular let 13(kl) and ,p(k l) denote current (at the end of iteration k 1) values for parameter vectors /3 and tp. Also let y' = (Yi, ... Y n ) and u' = ( u 1 ... un) denote the vector of all observations and their associated random effects Then the Q(. I .) function at the start of iteration k has form E [1ogj(y,u ; /3 ,p) I y /3(k 1) ,p(k 1)] E [1og/(y I u;/3) I y /3(k1) ,p(k1)] (2.10) + E [1ogg(u;') I y,/3(k1),,p(k1)] where j(y u;/3 ,p) = f(y I u;f3)g(u ; ,p) denotes the joint density of observed and missing data also known as the complete data. The expectation in (2.10) is with respect to the conditional distribution of u I y evaluated at the current parameter estimates 13(k l) and ,p(kl). The calculation of the expected value is called the Estep and it is followed by an Mstep which maximizes Q(/3 ,p I 13(k l), ,p(k l)) with respect to /3 and 1/J. The resulting estimates 13(k) and ,p(k) are used as updates in the next iteration to re calculate the Estep and the Mstep. Since (/3(k) ,p(k)) is the maximizer at iteration k
PAGE 47
35 and it follows that the likelihood increases ( or at worst stays the same) from one iteration to the next: log L(~(k) 1/J(k); y) Q(~(k) 1/J(k) I ~(k 1) 1/J(k 1)) E [10g h( u I y; ~(k )' 1/J(k)) I Y ~(k 1)' 1/J(k1)] > Q(~(k 1) 1/J(k 1) I ~(k 1) 1/J(k 1)) E [10g h( u I y ; ~(k 1), 1/J(k 1)) I Y ~(k1} 1/J(k 1) ] log L(~(k I), 1/J(k I) ; y) Here we us e d (2.11) and the fact that E [10g h( u I y; ~(k) 1/J(k)) I Y, ~(k 1)' 1/J(k 1)] E [1ogh(u I y;~(k 1) 1/J(k 1)) I Y,~(k1) 1/J(k 1)] E [1og (h(u I y;~(k) 1/J(k))/h(u I y;~(k 1) 1/J(k1))) I Y ~(k 1) 1/J(k 1)] < log (E [h(u I y;~(k) 1/J(k))/h(u I y ; ~(k 1) 1/J(k 1}) I Y,~(k1) 1/J(k 1)]) = 0, where the inequality in the last step derives from Jensen s inequality. Under regu larity conditions (Wu, 1983) and some initial starting values (~ 0 1'; 0 ), the sequence of estimates { (~(k), 1/J(k ) )} converges to the maximum likelihood estimators ({3, {/J ). The EMalgorithm is most useful if replacing the calculation of the integral in the marginal likelihood (2.9) by the calculation of the integral in the Qfunction (2.10) simplifies computation and maximization. Unfortunately for GLMMs the integrals in (2.10) are also intractable since the conditional density of u I y involves the integral in (2.9). However the EM algorithm may still be used by approxi mating the expectation in the Estep with appropriate Monte Carlo methods The resulting algorithm is called the Monte Carlo EMalgorithm (MCEM) and was proposed by Wei and Tanner (1990). We review it in detail in Section 2 3.
PAGE 48
36 Some arguments favoring the use of the MCEMalgorithm over direct methods such as simulated maximum likelihood for fitting GLMMs especially when some variance components in 1/; are large are given in Jank and Booth (2003) and Booth et al. (2001) Currently the only available software for fitting GLMMs uses direct methods such as GaussHermite quadrature or simulated maximum likelihood (e.g. SAS s proc nlmixed) State space models of Section 2.1.2 are also fitted via simulated maximum likelihood. This is discussed in Section 2 2 3. 2.2.2 Model Fitting in a Bayesian Framework In a Bayesian context GLMMs are twostage hierarchical models with ap propriate priors on /3 and 1/;. Instead of obtaining maximum likelihood estimates of unknown paramet e rs a Bayesian analysis looks at their entire posterior dis tributions given the observed data. Markov chain Monte Carlo techniques avoid the tedious integrations in the posterior densities and allow for relatively easy simulations from these distributions compared with the problems encountered in maximum likelihood estimation. This suggests approximating maximum likelihood estimates via a Bayesian route assuming improper or at least very diffuse priors and exploiting th e proportionality of the likelihood function and the posterior distribution of the parameters. However, for many discrete data models improper priors may lead to improper posteriors. Natarajan and McCulloch (1995) demon strate this with GLMMs for correlated binary data assuming independent N(O, a 2 ) random effects and a flat or a noninformative prior for a 2 Sun Tsutakawa and Speckman (1999) and Sun Speckman and Tsutakawa (2000) show that with noninformative (flat) priors on fixed effects and variance components of more com plicated random effects distributions propriety of the posterior distribution cannot be guaranteed for a Poisson GLMM when one of the observed counts is zero, and is impossible in a logit link GLMM for binomial(n 7r) observations if just one of the observation is equal to O or n. Of course the use of proper priors will always
PAGE 49
37 lead to proper posteriors. However, for often emp loyed diffuse but proper priors Natarajan and McCulloch (1998) show that even with enormous simulation sizes, posterior estimates ( such as the posterior mode) can be far away from maximum likelihood estimates, which make their use undesirable in a frequentist setting. 2.2.3 Maximum Likelihood Estimation for State Space Models The same problems as in the GLMM case arise for maximum likelihood fit ting of nonGaussian state space models. Here we review a simulated maximum likelihood approach suggested by Durbin and Koopman (1997) using notation in troduced in Section 2.1.2. Let p(y I a ; t/J) = fltP(Yt I at; t/J) denote the distribution of all observations given the states and let p( a; t/J) denote the distribution of the states where y and a are the stacked vectors of all observations and all states, respectively. The vector t/J holds parameters that may appear in W t, Tt and Qt. Let p(y a ; t/J) denote the joint density of observations and states. For practical purposes it is easier to work with the signal 0t instead of the high dimensional state vector O'.tHence, let p(y I 6 ; t/J) p( 6 ; t/J) and p(y, 6 ; t/J) denote the corre sponding conditional marginal and joint distributions parameterized in terms of the signal 0t = w~at t = l ... T where 6 = (0 1 ... 0r )'. The observed likelihood is then given by the integral L(t/J;y) = j p(y I 6 ; t/J)p(6 ; t/J)d6 (2.12) To maximize (2.12) with respect to t/J Durbin and Koopman (1997, 2000) first calculate the likelihood L 9 (t/J ; y) for an approximating Gaussian model and then obtain the true likelihood L(t/J; y) by an adjustment to it. However, two different approaches of how to construct the approximating Gaussian model are presented in the two papers. In Durbin and Koopman (1997) the approximating model is obtained by assuming that observations follow a linear Gaussian model
PAGE 50
38 with f.t ~ N t, a;). All densities generated under this model are denoted by g(.). The two parameters t and a; are chosen such that the true density p(y I 0; 1/;) and its normal approximation g(y I 0; 1/;) are as close as possible in the neighborhood of the posterior mean E 9 [0 I y]. The state equations of the true nonGaussian model and the Gaussian approximating model are assumed to be the same, which implies that the marginal density of 0 is the same under both models, i.e. p(0 ; 1/;) = g(0; 1/J). The likelihood of the approximating model is then given by L (1/J ; y) = g(y; 1/;) = g(y 0; 1/;) = g(y I 0 ; 'l/J)p(0; 1/J) g g(0 I y;'l/J) g(0 I y ; 'l/J) (2.13) This likelihood is calculated using a recursive procedure known as the Kalman filter (see for instance Fahrmeir and Tutz 2001 Chap. 8) Alternatively the approximating Gaussian model is a regular linear mixed model and maximum likelihood calculations can be carried out using more familiar algorithms in the linear mixed model literature (see for instance Verbeke and Molenberghs, 2000). From (2.13) ( 0 "'') = Lg('l/J)g(0 I y; 1/J) P 'I' g(y I 0; 1/J) and upon plugging in into (2 12) [ p(y I 0; 1/;)l L('l/J; y) = Lg('l/J; y)Eg g(y I 0 ; 1/;) (2.14) where E 9 denotes expectation with respect to the Gaussian density g( 0 I y ; 1/J) generated by the approximating model. Hence the observed likelihood of the non Gaussian mod e l can be estimated by the likelihood of an approximating Gaussian model and an adjustment factor, in particular L('l/J; y) = L 9 (1/J; y)w('l/J)
PAGE 51
39 where w(v;) = 2_ L p( y I 9 ( ~ l ; 1/J) m i g( y I 9 ( ); 1/J) is a Monte Carlo sum approximating the expected value E 9 with m random samples 9 ( i ) from g( 0 / y ; 1/J) Normality of g( 0 / y ; 1/)) allows for straightforward simulation from t his density. A diff e rent approach for choosing the approximating Gaussian model is presented in Durbin and Koopman (2000). There the model is determined by choosing a; and Qt of an approximating Gaussian state space model (2 6) such that the posterior densities g( 0 / y ; 1/)) implied by the Gaussian model and p( 0 / y ; 1/)) i mplied by the true mod e l have the same posterior mode 0 Formally by dividing and multiplying (2.12) by the importance density g( 0 / y ; 1/)) w e lik e to interpr e t approximation (2 14) as an importance sampling estimate of th e observed likelihood and the entire procedure as a simulated maximum likelihood approach: J p( 0 ; 1/J) L(v;; y ) = p( y I 0 ; 1/J ) g( 0 / y ; v;)g( 0 / y ; v;)d 0 J g( y ; 1/;) p( y / 0 ; 1/J) g( y 0 ; 1/J)g( 0 / y ; v;)d 0 J p( y / 0 ; 1/J) g( y ; 1/J ) g( y / 0 ; v;)g( 0 / y ; v;)d 0 [ p( y / 0 ; 1/J)l Lg( y ; 1/;)Eg g( y I 0 ; 1/J) Durbin and Koopman (1997 2000) present a clever way of artificially enlarging t he simulated sample of 9 (i), s from the importanc e density g( 0 / y ; 1/;) by the use of antithetic variables (Rob e rt and Casella 1999) These quadruple the sample siz e without additional simulation efforts and balan c e the sample for location and scale. Overall this leads to a reduction in the total sample size necessary to achieve a certain pr ec ision in the e stimates.
PAGE 52
40 In practice, it is desirable in the maximization process to work with log (L(t/;; y)). Durbin and Koopman (1997, 2000) present a bias correction for the bias introduced by estimating log ( E 9 [p(y I 0; t/;) / g(y I 0; t/;)]). Finally, the resulting estimator of log (L(t/;; y)) can be maximized with respect tot/; by a suitable numerical procedure, such as NewtonRaphson. We mentioned before that simulated maximum likelihood can be computa tionally inefficient and suboptimal, especially when some variance components are large (Jank and Booth, 2003) As we will see in various examples in Chapter 5, large variance components (e.g., a large random effects variance) are the norm rather than the exception with the type of time series models we consider. Next, we will look at an alternative, indirect method for fitting our models. In principle, though, the methods just described are also applicable to GLMMs, through the close connections of GLMMs and state space models described above. 2.3 The Monte Carlo EM Algorithm In Section 2.2 we presented the EMalgorithm as an iterative procedure consisting of two components the Eand the Mstep. The Estep calculates a conditional expectation while the Mstep subsequently maximizes this expectation Often at least one of these steps is analytically intractable and in most of the applications considered here, both steps are. Numerical methods ( analytic and stochastic) have to be used to overcome these difficulties, whereby the Estep usually is the more troublesome. One popular way of approximating the expected value in the Estep uses Monte Carlo methods and is discussed in Wei and Tanner (1990) McCulloch (1994 1997) and Booth and Hobert (1999). The Monte Carlo EM (MCEM) algorithm uses a sample from the distribution of the random effects u given the observed data y to approximate the Qfunction in (2.10). In particular, at iteration k let uC 1 ) . uCm) be a sample from this distribution denoted by h( u I y; t3(k l) t/;(kl)) and evaluated at the parameter estimates t3(kl) and t/;(kl)
PAGE 53
41 from the previous iteration. The approximation to (2.10) is then given by As m oo with probability one Qm Q. The Mstep then maximizes Qm instead of Q with respect to /3 and 'ljJ and the resulting estimates 13 (k) and 'lj) ( k) are used in the n e xt iteration to generate a new sample from h( u I y ; /3k 'l/Jk). If maximization is not possible in closed form sometimes only a pair of values ( /3 'ljJ) which satisfies Qm(/3 'ljJ I 13 (k1)' "P (kI)) Qm(/3(k I) "P (kI) I 13(k 1) "P (k1)), but which do not attain the globa l maximum is chosen as the new parameter update ( /3 (k), 'lj)(k)). Howev e r we show for our mod e ls that the global maximum can be approximated in very few steps. Maximization of Qm with respect to /3 and 'ljJ is equivalent to maximizing the first term in (2.15) with respect to /3 only and the second term with respect to 'ljJ only. This is due to the twostage hierarchy of the response distribution and the random effects distribution in GLMMs and is discussed next. Different approaches to obtaining a sample from h(u I y ; /3 'l/J) for th e approximation of the Estep are presented in Sections 2.3.2 and convergence criteria are discussed in Section 2 3 3. 2.3.1 Maximization of Qm For now we assume we have available a sample u (I), .. u (m) from h( u y ; /3 'ljJ) or an importance sampling distribution generated by one of the mecha nisms described in Sections 2 3 2 to 2 3.5 Let Q~ and Q~ be the first and second term of the sum in (2 15) Using the exponential family expression for the densities f(Yit I u i ) at iteration k m n n; Q~(/31 /3 ( k I)) ex! LLL [Y it 0if ) b(0 ~{)) ] j=l i= l t = l (2.16)
PAGE 54
42 where according to the GLMM specifications, with u~i) the ith component of the jth sampled random effects vector uU) Maximizing Q~ with respect to /3 is equivalent to fitting an augmented GLM with known offsets: For j = 1 ... m let y;fl = Y it and x ~f) = X it be the random components and known design vectors for this augmented GLM and let z~tuF) be a known offset associated with each y;fl. That is, we duplicate the original data set m times and attach a known offset z~tuij) to each replicated observation. The model for the mean in the augmented GLM E[~~j)] = if) = h1 (x~tf3 + z~tui j)) is structurally equivalent to the model for the mean in the GLMM. Then the loglikelihood equations for estimating /3 for the augmented GLM are proportional to Q~. Hence maximization of Q~ with respect to /3 follows along the lines of well known iterative NewtonRaphson or Fisher scoring algorithms for GLMs. Denote by 13(k) the parameter vector after convergence of one of these algorithms. It represents the value of the maximum likelihood estimator of /3 at iteration k of the MCEM algorithm. The expression for Q~ depends on the assumed random effects distribution. Most generally let :E be an unstructured nq x nq covariance matrix for the random effects vector u = (u 1 ... un) where q = z:=: 1 n i and n i is the dimension of each cluster specific random effect u i. Then assuming u has a mean zero multivariate normal distribution g(u ; 1/J) where 1/J holds the nq(nq + 1) distinct elements of :E Q~ has form m Q~ ex L [log l:EI uU)':E1 uU)] J=l The goal is to maximize Q~ with respect to the variance components 1/J of :E. For a general :E the maximum is obtained at the variance components of the sample
PAGE 55
43 covariance matrix Sm = I:;: 1 uUluUl' D enoting these by 1/J (k) gives the value of the maximum likelihood estimator of 1/J at iteration k of the MCEM algorithm. The simplest structure occurs when random effects u i have independent components and are i.i.d. across all clusters where g( u ; 1/J) is then the product of n N(O c, 2 I) densities and 1/J = c, Q~ at iteration k is then maximized at c,(k) = ( ~q I:;: 1 u (i) u (j) ) 1 1 2 Many applications of G LMMs use this simple structure of i.i.d. random e ffects where often u i is a univariate random intercept. In this case, the estimate of c, at iteration k reduces to c,(k) = ( n!n I:7 = 1 L ~=l u~j) 2 ) 112 In Chapter 3 w e will drop the assumption of independence and look at correlated random effects, but with more parsimonious covariance structures than the most general case presented here. Maximization of Q~ with respect to 1/J will be presented there on a case by case basis 2.3.2 Generating Samples from h(u I y ; (3 1/J) So far we assumed we had available a sample u ( 1 ), ... u (m) to approximate the expected value in the Estep of the MCEM algorithm This section describes how to generate such a sample from h(u I y ; (3 1/J) which is only known up to a normalizing constant, or from an importance density g(u). In the following we will suppress th e dependency on parameters (3 and 1/J since the densities are always evaluated at their current values. Three methods are presented: The acceptreject algorithm produces independent samples while MetropolisHastings algorithms produce dependent samples. A detailed description of all three methods can be found in Robert and Cas e lla (1999) 2.3.2.1 Acceptreject sampling in GLMMs In general for acceptreject sampling we need to find a candidate density g and a constant M such that for the density of interest h ( the target density) h(x) < Mg( x ) holds for all x in the support of h The algorithm is then to 1. generate x ~ g w ~ Uniform [O, 1];
PAGE 56
44 2 t d 1 from h 1 f < h(x) . accep x as a ran om samp e w Mg(x) 3. return to 1. otherwise; This will produce one random samp l e x from the target density h. The probability of acceptance is given by 1/ M and the expected number of trials until a variable i s accepted is M. For our purpose, the target density i s h(u I y). Since h(u I y) = J(y I u)g(u ) '.S Mg(u) where M = sup u f(y I u) and a is an unknown normalizing constant equa l to the marginal lik elihood, the multivariate normal random effects distribution g(u) can be used as a candidate density. Booth and Hobert (1999, Sect 4.1) show that for certain models supu f (y I u) can be easi ly calculated from the data alone and thus need not to be updated at every iteration. For some models we discuss here the condition of Booth and Hobert (1999 Sect. 4.1, page 272) required for this simplification does not hold. However the likelihood of a saturated GLM is always an upper bound for f(y I u ) To illustrate, regard L(u) = f (y I u) as the likelihood correspond ing to a GLM with random components Yit and linear predictor "l i t = z~tui + x ~tf 3 where now x~tf3 plays the role of a known offset and u i are the parameters of interest. The maximized likelihood L(u ) for this model is always less than the maximized likelihood L(y) for a saturated model. Hence sup u f(y I u ) '.S L(u) '.S L(y) and L(u) or L(y) can be used to construct M. Example: In Section 3 1 we consider a data set where conditional on a random effect Ut Yit the tth observation in group i is modeled as a Binomial(n it, 1rit) random variable. There are 16 time points, i.e. t = l . 16 and two groups i = 1 2. A very s impl e logisticnormal GLMM for these data has form logit(1rit(ut)) = a+ f3xi + Ut where X i i s a binary gro up indicator. The overall design matrix for
PAGE 57
45 this problem is the 32 x 18 matrix 1 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 where the columns hold the coefficients corresponding to a, /3, u 1 u 2 ... u 16 All rows of this matrix are differ e nt and as a consequence the condition of Booth and Hobert (1999, S ect 4.1, page 272) does not hold However the saturated binomial likelihood L(y) is an upp e r bound for f (y I u) i. e., sup f(y I u) L(y). u For instance with the logisticnormal example from above with linear predictor T/ it = c + Ut wh e re c =a+ f3xi represents the fixed part of the model we have By first taking logs and then finding first and second derivatives with respect to Ut, we see that u; = log ( 1 ~~ /~ ; t ) c maximizes this expression for O < Y it < nit Plugging in we obtain th e r es ult sup f (Yit I Ut) = Y it 1 Y it ( ) Y i t ( ) n;1Yit u1 nit n it For the special cases of Y it = 0 or Y it = n it the trivial bound on f (Yit I Ut) is 1. Hence the following inequality which immediately follows from above can be used in constructing the acceptreject algorithm for a logisticnormal model with linear
PAGE 58
46 predictor of form T/it = c + Ut: '':}' f (y I ") s':}' i;r i;r (i :"';,,., r ( 1+1 e"" r" < i:ri:r (::r (1~::r" L(y). This means we can select M = L(y) to meet the acceptreject condition and consequently we accept a sample u from g(u) if for aw~ Uniform[O, 1): h(u I y) f(y I u) w<=Mg(u) L(y) Notice that this condition is free of the normalizing constant a. In practice, espec ially for high dimensional random effects, M can be very large and therefore we almost never accept a sample. Two alternative methods described below may avoid this problem. Note, however, that the acceptreject method yields an independent and identical distributed sample from the target distribution. This is important if one wants to implement an automated MCEM algorithm (Booth and Hobert (1999)), where the Monte Carlo sample size mis increased automatically as the algorithm progresses to adjust for the error in the Monte Carlo approximation to the Estep. 2.3.2.2 Markov chain Monte Carlo methods For high dimensional distributions h( u I y) which are unavoidable if correlated random effects are used, acceptreject methods can be very slow. An alternative is to generate a Markov chain with invariant distribution h(u I y) which may be much faster but results in dependent samples. McCulloch (1997) discussed a Metropolis Hastings algorithm for creating s uch a chain for the logisticnormal regression case. In general, an independent Metropolis Hastings algorithm is built as follows: Choose a candidate density g(u) with the same support as h(u I y). Then for a current state u U 1 ),
PAGE 59
47 1. Generate w ~ g; C) . ( h(W I Y) / h(u U1 > I Y)) 2. Set u 3 equal to w with probab1l1ty p mm 1 g(W)/g(u U1>) and equal to u Ut) with probability 1 p; After a sufficient burn in time, the states of th e generated chain can be regarded as a (dependent) samp l e from h(u I y). If the candidate density g(u) is chosen to be the density of the random effects g(u), the acceptance probability in step 2 reduces to the simp l e form min (1 f(y I w)/f(y I u U1 ))) To further speed up simu la tions, McCulloch (1997) u ses a random scan algorithm which only updat es the kth component of the previous state u Ut) and upon acceptance in step 2 uses it as the new state. Another popular MCMC algorithm is the Gibbs sampler. Let u Ut) = ( Ui jt), .. u W1 )) denote the current state of a Markov chain with invariant distri bution h(u I y ). One iteration of the Gibb s sampler generates, componentwise, ~ h(u1 I u~ t) ... u ~t) y ) ~ h(u2 I uP) u~ i) ... u ~i), y) ~ h( I (j) (j) ) U n U1 ... U n1' y h h( I U) Ul Ut) U t) ) h 11 d f 11 d. 1 f w ere u i u 1 ... u i_ 1 u i+l .. U n y are t e so ca e u con 1t10na so h(u I y ). Th e vector u U) = (uii ), ... u W)) represents the new state of the chain, and after a s uffici ent burnin t im e, can be regarded as a sample from h(u I y ) The advantage of the Gibbs samp l er is that it reduces sampling of a possibly very highdimensional vector u into samp ling of severa l lowerdimensional components of u. W e will use the Gibbs sampler in connection with autoregressive random effects to simplify samp ling from an initially very highdimensional distribution h(u I y) by samp ling from its simpler full univari ate cond itionals.
PAGE 60
48 2.3.2.3 Importance sampling An importance sampling approximation to the Qfunction in (2.10) is given by where uUl are independent samples from an importance density g(u; '1/J(k l)) and are importance weights at iteration k. Usually Qm is divided by the sum of the importance weights 1=;: 1 Wj. The normalizing constant a only depends on known parameters (/3(kl), 1/J(k l)) and hence plays no part in the following maximization step. Selecting the importance density g is a delicate issue. It should be easy to simulate from but also resemble h(u I y) as close as possible. Booth and Hobert (1999) suggest a Student t density as the importance distribution g, whose mean and variance match those of h( u I y) which are derived via a Laplace approximation. 2.3.3 Convergence Criteria Due to the stochastic nature of the algorithm, parameter estimates of two suc cessive iterations can be close together just by chance, although convergence is not yet achieved. To reduce the risk of stopping prematurely, we declare convergence if the relative change in parameter estimates is less than some t: 1 for c ( e.g., five) consecutive times Let A (k) = (/3(k)', 1/J(k)')' be the vector of unknown fixed effects parameters and variance components. Then this condition means that (2.17) has to be fulfilled for c consecutive (e .g., five) ks. For any >.tl an exception to this rule occurs when the estimated standard error of that parameter is substantially larger than the change from one iteration to the next Hence at iteration k, for
PAGE 61
49 those parameters satisfying where var(>.t>) is the current estimate of the variance of the MLE .X i, the relative precision of criterion (2 .17 ) n ee d not be met An estimate of this variance can be obtained from the observed information matrix of the ML estimator for A Louis (1982) showed that the observed information matrix can be written in terms of the first (l') and second (l") derivative of the complete data loglikelihood l(A; y u) = logj(y u ; A). Evaluated at the MLE .\ it is given by 1(\) = E u 1 y [l"(.\; Y u) I y] varu 1 y [l'(.\; y u) I y]. An approximation to this matrix at it eration k uses Monte Carlo sums with draws from h(u I y A (k)) from the current iteration of the MCEM algorithm. To further safeguard against stopping prematurely we use a third convergence criterion based on the Q m function. For deterministic EM the Q function is guaranteed to increas e from iteration to iteration. With MCEM because of the stochastic approximation nature Q~) can be less than Q~ l) because of an unlucky Monte Carlo sample at iteration k. Hence the parameter estimates obtained from maximizing Q~ ) can be a step in the wrong direction and actually decrease the value of the likelihood. To counter this we declare convergence only if successive values of Q~) are within a small neighborhood. More importantly however is that we accept the kth parameter update A (k) only if the relative change in the Qm function is larger than some small negative constant, e .g ., Q~ ) Q~ 1 ) Q~ 1) (2.18) If at iteration k (2.18) is not met and there is reason to believe that A (k} decrease s the likelihood and is worse than the parameter update from the previous iteration
PAGE 62
50 we repeat the kth iteration with a new and lar ger Monte Carlo sample. Thereby, we hope to better approximate the Qfunction and as a result get a better estimate ,\ (k), with Qmfunction larger than the previous one. If this does not happen, we nevertheless accept ,\ (k) and proceed to the next iteration, possibly letting the algorithm temporarily move in a direction of a lower likelihood region. Otherwise the Monte Carlo sample size quickly grows without bounds at an early stage of the algorithm. Furthermore at early stages the Monte Carlo error in the approximation of the Q function can be large and hence its trace plot is very volatile. Caffo Jank and Jones (2003) go a step further and calculate asymptotic confidence intervals for the change in the Qmfunction based on which they construct a rule for accepting or rejecting ,\ {k). They discuss schemes of how to increase the Monte Carlo samp l e accordingly and their MCEM algorithm inherits the ascent property of EM with high probability. However we feel that the simpler criterion (2.18) suffices for the examples considered here. Coupled with any convergence criterion is the question of the updating scheme for the Monte Carlo sample size m between iterations. In general we will use m(k) = am(k I) where a > l and m(k) is the Monte Carlo sample size at iteration k. At early iterations, m(k) will be low, since big parameter jumps are expected regardless of the quality of the approximation and the Monte Carlo error associated with it. Later, as more weight will be put on decreasing the Monte Carlo error in the approximations the polynomial increase guarantees sufficiently large Monte Carlo samples. Furthermore condition (2.18) signals when an additional boost in m(k) is needed to better approximate the Qfunction in this iteration. Hence, whenever (2.18) is not met, we rerun iteration k with a bigger sample size qm(k) where q > l is usually between 1 and 2.
PAGE 63
51 8000 1m 1 2 64 7000 266 6000 5000 2 68 4000 270 3000 272 2000 274 1000 0 25 50 75 100 0 25 50 75 100 Figure 2 1: P l ot of the typical behavior of the Monte Carlo sample size m(k) and the Qfunction Q~) through MCEM iterations The iteration number is shown on the x axis. Plots are based on the data and model for the boat race data discussed in Chapter 5. A typical picture of the Monte Carlo sample size m (k) and the Q~) function through the iterations of an MCEM algorithm is presented in Figure 2 1. The increase in the Q~) function is large at the first iterations but it s Monte Carlo error is also large due to the small Monte Carlo sample size. The plot of the Monte Carlo sample size m(k) shows several jumps corresponding to the events that the Q~) function actually decreased by more than t: 3 from one iteration to the next and we adjusted with an additional boost in generated samples The data and model on which this plot is bas e d on are taken from the boat race example analyzed and discussed in Chapter 5 with convergence criterions set to E 1 = 0.001 c = 4 E 2 = 0 003 t: 3 = 0.005 a = 1.03 and q = 1.05. Fort and Moulines (2003) show that with geometrically ergodic (see, e.g., Robert and Casella 1999) MCMC samplers, a polynomial increase in the Monte
PAGE 64
52 Carlo sample size leads to convergence of MCEM parameter estimates. However establishing geometric ergodicity is not an easy task. Other more sophisticated and automated Monte Carlo sample size updating schemes are presented by Booth and Hobert (1999) for independent sampling and Caffo Jank and Jones (2003) for independent and MCMC sampling.
PAGE 65
CHAPTER 3 CORRELATED RANDOM EFFECTS In Chapter 2 we mentioned at several occasions that for certain data structures the usual assumption of independent random effects is inappropriate For instanc e, if clusters represent time points in a study over time observations from different clusters can no longer be assumed (marginally) independent. Or in longitudinal studies the nonnegative and exchangeable correlation structure among repeated observations implied by a single random effect can be far from the truth for long sequences of repeated observations. Section 3.1 presents data from a crosssectional time series which motivates the use of correlated random effects and discusses their implications. In Sections 3.2 and 3.3 two special correlation structures useful for modeling the dependence structure in discrete repeated measures with possibly unequally spaced observation times are discussed. The main focus of this chapter is on the technical implications on the MCEM algorithm arising from estimating an additional variance (correlation) component. In contrast to models with independent random effects the Mstep has no closed form solution and iterative methods have to be used to find the maximum. Also because random effects are correlated a priori they are correlated a posteriori and sampling from the posterior distribution of u I y as required by the MCEM algorithm is more involved than with independent random effects A Gibbs sampling approach is developed in Section 3.4 From here on we let t denote the index for the discrete observation times t = 1 ... T and we let ~t denote a response at time point t for strata i i = 1 ... n. Throughout we will assume univariate but correlated random effects { Ut}i = l associated with the observations over time. 53
PAGE 66
54 3.1 A Motivating Example: Data from the General Social Survey The basic purpose of the General Social Survey (GSS) conducted by the National Opinion Research Center, is to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes behaviors and attributes. It is only second to the census in popularity among sociologist as a data source for conducting research. The GSS ques tionnaire contains a standard core of demographic and attitudinal variables whose wording is retained throughout the years to facilitate time trend studies (Source: www.norc.uchicago.edu/projects/gensocl.asp). Currently the GSS comprises a total of 24 surveys conducted in the years 19731978 1980 1982 19831994 1996 1998 2000 and 2002 with data available online (at www.webapp.icpsr.umich edu/GSS/) through 1998. The two features, a discrete response variable (most of the attitude questions) observed through time and unequally spaced observation times make it a prime resource for applying the models proposed in this dissertation. Data obtained from the GSS are different from longitudinal studies where subjects are followed through time. Here, responses are from independent crosssectional surveys of different subjects in each year. One question included in 16 of the 22 surveys till 1998 recorded attitude towards homosexual relationships It was observed in the years 1974, 197677, 1980 1982 198485 19871991 199394, 1996 and 1998. We will use this data to motivate and illustrate the use of correlated random effects Figure 3 1 shows the proportion of respondents who agreed with the statement that homosexual relationships ar e not wrong at all for the two race cohorts white respondents and black respondents. For simplicity in this introductory example only race was chosen as a crossclassifying variable and attitude was measured as answering "yes or no to the aforementioned question. Let ~t denote the number of people in year t and of race i who agreed with the statement that homosexual relationships
PAGE 67
55 0 (") awhite ci Erb l ack 0 (\J 0 0 T""" ci 0 0 ci 19 7 5 1980 1985 1990 1995 Figure 3 1: Sampling proportions from the GSS data set Proportion of whites (squares) and blacks (circles) agreeing with the statement that homosexual relationships are not wrong at all from 1974 to 1998. are not wrong at all. The index t = 1 .. 16 runs through the set of 16 years {1974 1976 1977 1980 ... 1998} mentioned above and i = 1 for race equal to white and i = 2 for race equal to black. The conditional independence assumption discussed in Section 2.1 allows us to model ~t the sum of nit binary variables which are the individual responses as a binomial variable conditiona l on a yearly random effect That is the probabilistic model we propose assumes a conditional Binomial(n i t 7r i t) distribution for each member of the two time series {Y1t}}! 1 and {Y 2 t}}! 1 pictured in Figure 3 1. The parameters nit and 7rit are the total number and the conditional probability of agreeing with the statement that homosexual relationships are not wrong at all respectively, of respondents of race i in year t. 3.1.1 A GL MM App r oach A popular model for n it is a logisticnormal model for which the link function h( ) in (2 2) is the logit link and the random effects structure simp l ifies to a ran dom intercept Ut. We will assume that the fixed parameter vector /3 is composed
PAGE 68
56 of an intercept term a, linear and quadratic time effects /3 1 and /3 2 a race effect /33 and a yearbyrace interaction (3 4 With X1t representing the year variable centered around 1984 (e.g. x 11 = 1974 1984 = 10) and x 2 i the indicator variable for race (for whites x 21 = 0, for blacks x 22 = 1) the model has form Apart from the fixed effects, the random time effect Ut captures the dependency structure over the years. Note that rrit ( Ut) is a conditional probability, given the random effect Ut from the year the question was asked. This random effect Ut can be interpreted as the unmeasurable public opinion about homosexual relationships common to all respondents within the same year. By introducing this random effect, we assume that individual opinions are influenced by this overall opinion or the social and political climate on homosexual relationships (like awareness of AIDS and the social spending associated with it, which is hard to measure). Thus, individual responses within a given year are no longer independent of each other, but share a common random effect. Furthermore, it is natural to assume that the public opinion about homosexual relationships changes gradually over time, with higher correlations for years closer together and lower correlations for years further apart. It would be wrong and unnatural to assume that the public opinion ( or political climate) is independent from one year to the next However this would be assumed by modeling the random effects { ut} as independent of each other. It would also be wrong to assume a common, timeindependent random effect u = Ut for all time points t as this implies that public opinion does not change over time. It s effect would then be the same, whether responses are measured in 1974 or 1998. 3.1.2 Motivating Correlated Random Effects To capture the dependency in public opinion and therefore in responses over dijf erent years, we propose random effects that are correlated. In particular,
PAGE 69
57 for this example with unequally spaced observation times we s uggest normal autocorrelated random effects { Ut} with variance function var(ut) = a 2 t = 1 ... 16 and correlation function corr( Ut, Ut) = p lx1tx w I, 1 :S t < t* :S 16 where X1t X1t is the difference between the two years identified by indices t and t This is equivalent to specifying a latent autoregressive process U p lxttxwl u + c t+l t 't underlying the data generation mechanism. Both of these formulations naturally handle the multiple gaps in the observed time series. There is no need to make adjustments (such as imputation of data or artificially treating the series as equally spaced) in our analysis due to missing data at years 1975 197879, 1981, 1983, 1986 1992, 1995 or 1997. With correlated random effects, we have to distinguish between two situations: The correlation induced by assuming a common random effect Ut for each cluster (here: year) and the correlation induced by assuming a correlation among the clusterspecific random effects { Ut}. Correlation among observations in the same cluster is a consequence of assuming a single, clusterspecific random effect shared by all observation in that cluster. For example, the presence of the cluster specific random effect Ut in (3.1) leads to a (marginal) nonnegative correlation among the two binomial responses Yit and :it in year t. With conditional independence the marginal covariance between thes e
PAGE 70
58 two observations is given by cov(Y1t Y2t) = E [cov(Y1t, Y2t I Ut)] +cov (E[Y1t I Utl, E[Y:it I Ut]) cov(logiC 1 (iJ1t + Ut), logiC 1 (7J2t + Ut)), (3 .2 ) where 1Jit is the fixed part of the linear predictor in (3.1). Both functions in (3.2) are monotone increasing in Ut leading to a nonnegative correlation. Approxima tions to (3.2) will be dealt with in Section 4.3. In the example, we attributed the cause of this correlation to the current (at the time of the interview) public opinion about homosexual relationships, influencing all respondents in that year. The estimate of CJ gives an idea about the magnitude of this correlation, since the more disperse the Ut s are, the stronger the correlation among the responses within a year. For instance, if the true Ut for a particular year is positive and far away from zero as measured by CJ then all respondents have a common tendency to give a positive answer. If it is far away from zero on the negative side, respondents have a common tendency for a negative answer. This interpretation, of course, is only relative to other fixed effects included in the lin ear predictor. For the GSS data, there seems to be moderate correlation between responses, based on a maximum likelihood estimate of a= 0.10 with an approximated asymptotic s.e of 0.03. This interpretation of a moderate effect of public opinion on responses within the same year is further supported by the fact that CJ can also be interpreted as the regression coefficient for a standardized version of the random effect Ut, A regression coefficient of 0.10 for a standard normal variable on the logit scale leads to moderate heterogeneity on the probability scale. This shows that the correlation between responses within a common year cannot be neglected.
PAGE 71
59 The second consequence of correlated random effects is that observations from different clusters are correlated which is a distinctive feature compared to GLMMs assuming independence between clusterspecific random effects. The conditional log odds of agreeing with the statement that homosexual relationships are not wrong at all are now correlated over the years a feature which is natural for a time series of binomial observations but would have gone unaccounted for if timeindependent random effects were used For instance for the cohort of white respondents (i = 1) the correlation between the conditional log odds at years t and t* is corr (logit(1ru(ut)), logit(1rit (Ut )) = p l xu xw l and therefore directly related to the assumed random effects correlation structure. Marginally, the two binomial responses at the different observation times have covariance which accommodates changing covariance patterns for different observation times ( e.g. decreasing with increasing lag) and also negative covariances (see for instance the analysis of the Old Faithful geyser eruption data in Chapter 5). We will present approximations to these marginal correlations in binomial time series in Section 4.3. Summing up correlated random effects give us a means of incorporating correlation between sequential binomial observations that go beyond independent or exchangeable correlation structures In our example we attributed the sequential correlation to the gradual change in public opinion about homosexual relationships over the years, affecting both races equally. In fact, the maximum likelihood estimate of p is equal to 0.65 (s.e.
PAGE 72
60 0.25) indicating that rather strong correlations might exist between responses from adjacent years The model uses 7 parameters (5 fixed effects, 2 variance components) to describe the 32 probabilities. In comparison with a regular GLM and a GLMM with independent random time effects the maximized likelihoods decrease from 113 0 for the regular GLM to approximately 109 7 for a GLMM with independent random effects to approximately 107.5 for the GLMM with autoregressive random effects. Note that the GLM assumes independent observations within and between the years and that the GLMM with independent random effects { Ut} for each year t assumes correlation of responses within a year but independence of responses over the years. Both assumptions might be inappropriate Our model implies that the log odds of approval of homosexua l relationships are correlated for blacks and whites within a year (though not very strong with an estimate of a equal to 0.1) and are also correlated for two consecutive years. 'rhe estimates of the fixed parameters and their asymptotic standard errors are given in Table 5 1. The MCEM algorithm converged after 128 iterations with a starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of 8600 Convergence parameters ( conf. Section 2.3 3) were E 1 = 0 002 c = 5 E 2 = 0.003 E 3 = 0 001 a = 1.01 and q = 1.2. Path plots of selected parameter estimates for two different sets of starting values are shown in Figure 3 2. A detailed interpretation of the parameters and the effects of the explanatory variables on the odds of approval is provided in Section 4.3. Although this example assumed autocorrelated random effects, we will look at the simpler case of equally correlated random effects next. Then we discuss how the correlation parameter p can be estimated within the MCEM framework presented in Section 2.3. Summing up correlated random effects in GLMMs allow
PAGE 73
61 :: t;_ i~ L : v ____________ _ ______ 0 SO 100 ISO 200 SO 75 100 125 ISO 1 75 2 00 ::: t: . . . ::: t, .... ... .... .... ... 1_, ' 0 SO 100 ISO 200 SO 75 100 125 ISO 175 200 0 026 ~ = 0.00900 ~ 1 s e (llJ I ~ : : ::: r:::= I ~ .: : ::::: r~~~ .. i 0 SO 100 ISO 200 SO 75 100 125 ISO 175 2 00 :: 1 __ xz , , , 1 "1 :: ~! .... .... .... .... .. 1 ~ .. ,~,I, 0 SO 100 ISO 200 SO 75 100 125 ISO 175 200 0. 7 s 2::::::. = = .. = 0 so 100 ISO 0.30 0 25 200 s e.(p)! so 75 100 125 ISO 175 200 Figure 3 2: It e ration history for sele c ted parameters and their asymptotic standard errors for the GSS data The iteration number is plotted on the x axis The estimates and standard errors for (3 2 w e r e multiplied by 10 3 for better plotting Th e two different lines in e ach plot correspond to two different sets of starting values
PAGE 74
62 one to model withincluster as well as betweencluster correlations for discrete response variables where clusters refer to grouping of responses in time 3.2 Equally Correlated Random Effects The introductory example modeled decaying correlation between cross sectional data over time through the use of autocorrelated random effects. In other temporal or spatial settings the correlation might stay nearly constant between any two observation times regardless of time or location differences between the two discrete responses. Equally correlated random effects might then be appropriate to describe such a behavior. 3.2.1 Definition of Equally Correlated Random Effects We call random effects equally correlated if var( Ut) = a 2 for all t and corr( Ut Ut) = p for all t =I= t More generally the covariance matrix of the random effects vector u = ( u 1 .. ur )' is given by E = a 2 [(1 p)Ir + pJrl, where Jr = lrl~. To ensure positive definiteness p has restricted range, i.e., 1 > p > 1/(T 1). The random effects density is given by g(u; 1/J) ex IEl 1 1 2 exp { 1u'E 1 u} where now due to the pattern in E IEI = a 2 r(l pf1 [1 + (T l)p] and E1 = _J__ [1J. p(l p) j. ] u 2 (1p) r 1+ (r 1)p r components of E. The vector 1/J = ( a, p) holds the variance (3.3) The more complicated random effects structure (as compared to independence or a single latent random effect) leads to a more complicated Mstep in the MCEM algorithm described in Section 2 3 For a sample u< 1 > ... u (m ) from the posterior h( u I y ; t3(k l) 1/J(k l)) evaluated at the previous parameter estimate t3(k l) and
PAGE 75
63 1/J(kl) the function Q~ ( 1/J) introduced in Section 2.3.1 has form 1 m Q'!i ( 1/J) = L log g( uU); 1/J) ex m T1 1 Tloga2 log(l p) 2 log[l + (Tl)p] J = l 1 p(l p) b a+2a2(1 p) 2a 2 [1 + (T l)p] where a= l ''~ uU)' uU) and b = l ''~ uU)' J,TuU) are constants depending on m ~J = l m ~J=l the sample only. 3.2.2 The Mstep with Equally Correlated Random Effects The Mstep seeks to maximize Q~ with respect to a and p which is equivalent to finding their MLEs treating the sample u( 1 ) ... u(m) as independent. Since this is not possible in closed form one way to maximize Q~ uses a bivariate Newton Raphson algorithm with the Hessian formed by the second order partial and mixed derivatives of Q~ with respect to a and p. Some authors ( e.g Lange 1995 Zhang 2002) use only a single iteration of the NewtonRaphson algorithm instead of an entire Mstep to speed up convergence. However this might not always lead to convergence since the interval for which the NewtonRaphson algorithm converges is restricted through the restrictions on p. We show now that with a little bit of work the maximizers for a and p can be obtained very quickly. For any given value of p the ML estimator for a ( at iteration k) is available in closed form and is equal to &(k) = 1 a p l p b ( ( ) ) 1 / 2 T(l p) T[l + (T l)p] Note that if p = 0 &(k) = ( i I:7 = 1 uU)' uU)) 1 1 2 the estimator for the independence case presented at the end of Section 2.3.1. Unfortunately the ML estimator for p has no closed form solution. The first and second partial derivative of Q~ with
PAGE 76
64 respect to p are given by !!___Q2 op m T(T l)p 2(1 p)[l + (T l)p] __ 1 __ a + _1 __2_p__p_ 2 (_T__1) b 2a 2 (1 p) 2 2a 2 [1 + (T l)p]2 T(T 1) [1 p(4 2T) + p 2 (1 3T)] 1 a 2 (1 p)2[1 + (T l)p] 2 a 2 (1 p) 3 T b a 2 [1 + (T l)p]3 We obtain the profile likelihood for p by plugging the MLE a(k) into the likelihood equation for p. Then we use a simple and fast intervalhalving (or bisection) method to find the root for p This is advantageous compared to a Newton Raphson algorithm since the range of pis restricted. Let J(p) = :PQ~ la=a < k> and let p 1 and p 2 be two initial estimates in the appropriate range satisfying P1 < P2 and f(p1)f (p2) < 0. Without loss of generality assume f(p 1 ) < 0. Clearly, the maximum likelihood estimate p must be in the interval [p 1 p 2 ]. The intervalhalving method computes the midpoint p 3 = (p 1 + P2)/2 of this interval and updates one of its endpoints in the following way: It sets p 1 = p 3 if J(p 3 ) < 0 or p 2 = p 3 otherwise. The newly formed interval [p 1 P2] has half the length of the initial interval but still contains p Subsequently a new midpoint p 3 is calculated, giving rise to a new interval with one fourth of the length of the initial interval, but still containing p. This process is iterated until lf(p 3 )1 < c, where c is a small positive constant. To ensure it is a maximum we can check that the value of the second derivativ e, f'(p) is negative at p 3 (The second derivative is also needed for approximating standard errors in the EM algorithm.) The value of p 3 is then used as an update for pin the maximum likelihood estimator for a, and the whole process of finding the roots of Q~ is repeated Convergence is declared when the relative change in a and p is less than some prespecified small constant. The values of a and p at this final iteration are the estimates a(k) and p(k) from MCEM iteration k.
PAGE 77
65 The issue of how to obtain a sample u( 1 ) ... u(m ) from h( u I y; (3(kl) 1/J(k l ) ) taking into account the special structure of the random effects distribution will be discussed in Section 3.4.2. 3.3 Autoregressive Random Effects The use of autoregressive random effects was demonstrated in the introductory example Their property of a decaying correlation function make them a useful tool for modeling temporal or spatial associations among discrete data. We will limit ourselves to instances where there is a natural ordering of random effects, and consider time dependent data first. 3.3.1 Definition of Autoregressive Random Effects As with equally correlated random effects in Section 3.2 we can look at the joint distribution of autoregressive ( or autocorrelated) random effects { Ut}f= 1 as a meanzero multivariate normal distribution with patterned covariance matrix E, defined by the variance and correlation functions var( Ut) = c; 2 for all t and where Xt and Xt are time points (e.g. years as in the GSS example) associated with random effects Ut and Ut. Let dt = Xt+1 Xt denote the time difference between two successive time points and let ft = 1/(1 p 2 dt), t = 1 ... T l. Then due to the special structure the determinant of the covariance matrix is given by IEI = a 2 T TI; = ;_ 1 ft 1 and E 1 is tridiagonal (Crowder and Hand 1990, with correction of a typo in there) with main diagonal 1 2 (/1 Ji+ h 1, Ji+ h 1 h + h 1 h + f4 1 . !T 2 + !T 1 1 !T1) O"
PAGE 78
66 and subdiagonals For a sample u(l) ... u(m) from the posterior h( u I y ; t3(k I), 1/J(k l)) evaluated at the previous parameter estimates 13(k I) and 1/J(k l) = (a(kl), p(k l)), the function Q~ (cf. Section 2.3.1) now has form T1 m Q~(1/J) ex Tloga "log(l p 2 dt) _l _!_ "u (i) 2 (3.4) 2 L.....J 2a 2 m L.....J 1 t = l j=l m T 1 [u(j) pdtu(j)] 2 1 1 t+l t 2a2 m 1 p2dt where uii) is the tth component of the jth sampled vector uU) In the Mstep of an MCEM algorithm we seek to maximize Q~ with respect to a and p. Alternatively we can view the random effects { Ut} as a latent firstorder autoregressive process : Random effect Ut+i at time t + l is related to its predecessor Ut by the equation where dt again denotes the lag between the two successive time points associated with random effects Ut and ut+l Assuming a N(O, a 2 ) distribution for the first random effect u 1 the joint random effects density for u = ( u 1 .. ur) enjoys a Markov property and has form g(u ; 1/J) = g(u1; 1/J) g(u2 I u1 ; 1/J) g(ut I Ut 1 ; 1/J) g(ur I UT1; 1/J) (3.6) ex (:,) T / 2 (TI 1 lp'' ) 1 / 2 exp {;;}}exp {tJ;:(~ = ::~~r} leading of course, to the same expression for Q~ as given in (3.4) For two time indices t and t with t < t*, the random process has autocorrelation function
PAGE 79
67 Before we discuss maximization of Q~ in this setting with possibly unequally spaced observation times let us comment on the rather unusual parametrization of the latent random process (3.5). Chan and Ledolter (1995) in their development for time series models of equally spaced discrete events use the more common form ut+1 =put+ Et Et~ N(O, a 2 ) t = 1 ... T l. This leads to var(ut) = a 2 /(1p 2 ) for all t ifwe assume a N(O a 2 /(1p 2 )) distribution for u 1 (Chan and Ledolter (1995) condition on this first observation, which leads to clos e d form solutions for both a and p in the case of equidistant observations). Since it is common practice to let a 2 describe the strength of association between observation in a common cluster sharing that random effect our parametrization seems more natural. In Chan and Ledolter s parameterizations both the variance and correlation parameter appear in the variance of the random effect. In the more general case of unequally spaced observations the parametrization f.t ~ N(O a 2 ) results in different variances of the random effects at different time points (i.e ., var(ut) = a 2 /(1 p 2 d 1 )) Considering that the random effects represent unobservable phenomena common to all clusters, their variability should be about the same for all clusters and not depend on the time difference between any two clusters There is no reason to believe that the strength of association is larger in some clusters and weaker in others. Therefore the parametrization we choose in (3.5) seems natural and appropriate For spatially correlated data a relationship between random effects u i and U i is defined in terms of a distance function d(xi X i between covariates x i and X i associated with them. Each u i then represents a random effect for a spatial cluster and correlated random effects are again natural to model spatial dependency among observations in different clusters In the time setting we had
PAGE 80
68 d(x i, X i = lx i X i I with xi and X i representing time points. In 2dimensional spatial settings X i = (x i i x i 2 )' may represent midpoints in a Cartesian system and d(x i, X i = I lx i X i II is the Euclidian distance function The sodefined distance between clusters can be used in a model with correlations between random effects decaying as distances between cluster midpoints grow e g ., corr(u i, U i = p ll X; X ; 11 Models of this form are discussed in Zhang (2002) and in Diggle et al. (1998) in a Bayesian framework Sometimes only the information concerning whether clusters are adjacent to each other is used to form the correlation structure. In this case, d(x i, X i is a binary function indicating if clusters i and i are adjacent or not. Usually this leads to an improper joint distribution for the random effects as for instance in the analysis of the Scottish Lip cancer data set presented in Breslow and Clayton (1993). 3.3.2 The Mstep with Autoregressive Random Effects Maximizing Q~ with respect to a and p is again equivalent to finding their MLEs for the sample of u(l ), ... u(m) pretending they are independent. For fixed p maximizing Q~ with respect to a is possible in closed form For notational con venience denote the parts depending on p and the generated sample u( 1 ) . u ( m) by at(P u) and m bt(p u) = ~~(u~~ 1 l 1 u~j))u~j) J = l with derivatives with respect to p (indicated by a prime) given by
PAGE 81
69 and The maximum likelihood estimator of a at iteration k of the MCEM algorithm has form ( 1 m 2 1 T 1 1 ) 1/2 a(k ) = ~ u(J) + at(P u) Tm L 1 T L l p2d1 j = l t = l For the special case of independent random effects (p = 0) this simplifies to the estimator ( E;' = 1 u(j)' u(j)) 1 1 2 presented at the end of Section 2.3.1. (The equal correlation structure cannot be presented as a special case of the autocorrelation structure.) No closed form solutions exist for fP) Let and be terms depending on p but not on u with derivatives given by and dt1 dl 2 c~(p) = ct(P) + 2p i[ct(P)] p respectively. Then the first and second partial derivative of Q~ with respect to p can be written as T 1 l T 1 L l 1 Ct(P) + 2 L [Ct(p)bt(P, u) et(p)at(P u)] a t = l t = l T 1 I:l 1 1 Ct(P) [2dt 1 + 2l 1 +1Ct(P)] t=l T 1 +; L [c~(p)bt(P, u) + Ct(p)b~(p, u) e~(p)at(P u) et(p)a~(p u)] a t = l
PAGE 82
70 A NewtonRaphson algorithm with Hessian formed of partial and mixed derivatives of Q~ with respect to a and p can be employed to find the maximum likelihood estimators at iteration k. However, since the range of p is restricted, it might be advantageous to use the intervalhalving method on ;P Q~ lu=u
PAGE 83
71 In most of our applications the number of distinct observation times T is rather large and generating independent Tdimensional vectors u from the posterior h( u I y ; 1/J(k l)) as required to approximate the Estep is difficult, even with the nice (prior) autoregressive relationship among the components of u. The next section discusses this issue. There are many other correlation structures which are not discussed here. For instance the first order autoregressive random process can be extended to a porder random process and the formulas provided here and in the next section can be modified accordingly. 3.4 Sampling from the Posterior Distribution Via Gibbs Sampling In Section 2.3.2 we gave a general description of how to obtain a random sam ple u < 1 ) ... u (m) from h( u I y). ( As in Section 2 3.2 we suppress the dependency on the parameter estimates from the previous iteration) For high dimensional random effects distributions g(u) generating independent draws from h(u I y) can get very time consuming if not impossible. The Gibbs sampler introduced in Section 2.3.2 offers an alternative because it involves sampling from lower dimen sional (often univariate) conditional distributions of h(u I y), which is considerably faster. However it results in dependent samples from the posterior random effects distribution. The distributional structure of equally correlated random effects or autoregressive random effects is very amenable to Gibbs sampling because of the simplifications that occur in the full univariate conditionals. Remember that the twostage hierarchy and the conditional independence assumption in GLMMs implies that T h(u I y) ex f(y I u)g(u) = IT f(Yt I Ut) g(u) t = l the product of the product of conditional densities of observations sharing a common random effect and the random effects density. In the following, let u = ( u 1 ... UT). We discuss the case of autoregressive random effects first.
PAGE 84
72 3.4.1 A Gibbs Sampler for Autoregressive Random Effects From representation (3.6) of the random effects distribution we see that the full univariate conditional distribution of Ut given the other T l components of u only depends on its neighbors Ut I and Ut+1 i.e. g(ut I u1 ... Ut1 ut+1, .. ur) ex g(ut I Ut1)g(ut+1 I Ut) t = 2, ... Tl. At the beginning ( t = 1) and the end ( t = T) of the process, the conditional distribution of u 1 and ur only depend on the successor u 2 and predecessor uri respectively. Furthermore random effect Ut only applies to observations Yt = (Yn, ... Ytnt) at a common time point that share that random effect but not to other observations. Hence the full univariate conditionals of the posterior random effects distribution can be expressed as hr(ur I ur 1 Yr) ex !(Yr I ur) 9r(ur I Ur 1) where using standard multivariate normal theory results, N (l1u2 a2[l p2d1 l) N(it1[1p2dt i]ut I + it[l p2dt 1Jut+i 1 p2(dt l + dt) 0"2 [1 p2dtl p2dt + p2(dtl +dt)]) 2(d d ) t = 2 ... 'T l 1 p t 1 + t 9r( ur I ur 1) = N (l riur1, a2[l p2d r1 l) For equally spaced data (dt = 1 for all t) these distributions reduce to the ones derived in Chan and Ledolter (1995). Direct sampling from the full univariate conditionals ht is not possible. However, it is straightforward to implement an acceptreject algorithm In fact
PAGE 85
73 the acceptreject algorithm as outlined in Section 2.3.2 applies directly with target density ht and candidate density 9t since ht has the form of an exponential family density multiplied by a normal density. In Section 2.3.2 we discussed the accept reject algorithm for generating an entire vector u from the posterior random effects distribution h( u I y) with candidate density g( u) and mentioned that acceptance probabilities are virtually zero for large dimensional u s. With the Gibbs sampler we have reduced the problem to univariate sampling of the tth component U t from the univariate target density ht with univariate candidate density 9t By selecting Mt = L(yt) where L(yt) is the saturated likelihood for observations at time point t we ensure that the target density ht Mt9t Given uU I ) = ( Uij l) . l)) from the previous iteration, the Gibbs sampler with acc e ptreject sampling from the full univariate conditionals consists of 1. generate first component uP) ~ h1 ( U1 I uv l ), Y1) by (a) generation step: generate U1 from candidate density 91 ( U1 I uvl)); generate U ~ Uniform[O, 1]; (b) acceptance step: set uf) = U1 if U J(y 1 I u 1 )/ L(y 1 ); return to (a) otherwise; 2. for t = 2 .. T 1: g enerate component u(j) ~ h (u(j) I u(j) u(jI) y ) by t t t t 1 t + 1 t (a) generation step: generate Ut from candidate density 9t(Ut I u~~ 1 u~~~ 1 )); generate U ~ Uniform[O 1]; (b) acceptance step: set uij) = Ut if U f(Yt I Ut)/ L(yt); return to (a) otherwise; 3. generate last component u) ~ hr(ur I ~ 1 Yr) by (a) generation step:
PAGE 86
74 generate UT from candidate density 9T( UT I u~ 1 ); generate U ~ Uniform[O, 1]; (b) acceptance step: set uW = uT if U s; f (YT I UT)/ L(yT); return to (a) otherwise; 4. set uU) = (u(i) u(T)) 1 , T The soobtained sample u(l) ... u(m) (after allowing for burnin) forms a dependent sample which we use to approximate the Estep in the kthiteration of the MCEM algorithm. Note that all densities are evaluated at current parameter (k 1) (k 1) ( ) ( )) estimates, i.e. {3 for f(Yt I Ut) and 1/J = (17 k l p k l for 9t(Ut I 3.4.2 A Gibbs Sampler for Equally Correlated Random Effects Similar results as for the autoregressive correlation structure can be derived for the case of equally correlated random effects. In this case, the full univariate conditional of Ut depends on all other t 1 components of u as can be seen from (3.3). Let Ut denote the vector u with the tth component deleted. Using similar notation as in the previous section the full univariate conditionals of h( u I y) are given by where with standard results from multivariate normal theory 9t(Ut I Ut ) is a N(t, T;) density with t = [p(T l)p(l p)l L Uk 1 p 1 + (T 2)p k::/t and 72 = a 2 (l (Tl)p 2 + (T1) 2 p3(1p)) t 1 p 1 + (T 2)p Given the vector uU 1 ) from the previous iteration the Gibbs sampler with acceptreject sampling from the full univariate conditionals has form
PAGE 87
75 1 for t=l ... T, generate components u~j) ~ h( Ut I u~j) ... u~~ 1 u~t~I) I)) by (a) generation step: generate Ut from candidate density ( I (j) (j) (j1) (j1)) 9t Ut U1 ... ut1 ut+I . UT ; generate U ~ Uniform[0 1]; (b) acceptance step: set u~j) = Ut if U f (Yt I Ut)/ L(yt); return to (a) otherwise; 2. set uU ) = (uf ), .. uf)); This leads to a sample u ( 1 ), .. u( m) from the posterior distribution used in the Eand Mstep of the MCEM algorithm at iteration k. Note again that 11 d 'b t 1 d h f.l(k l) d a 1stn u 10ns are eva uate at t e ir current parameter estimates /J an {p(k l) = (a(k l), p(k 1)) 3.5 A Simulation Study We conducted a simu la tion study to evaluate the performance of the maximum likelihood estimation algorithm to evaluate the bias in the estimation of covariate effects and variance components and to compare predicted random effects to the ones used in the simulation of the data. To this end, we generated a time series Yi ... YT of T = 400 binary observations according to the model (3 .8 ) for the conditional lo g odds of success at time t t = 1 ... 400. For the simulation we choose a = 1 and /3 = 1 where {3 is the regression coefficient for independent standard normal distributed covariates Xt i. i. d. N(0 1). The random effects u 1 ... UT are thought to arise from an unobserved lat ent random autoregressive process Ut+i =put+ Et, where Et are i i d. N(0 aJl p2) i.e. the Ut's have stan dard deviation aand la g t corre lation /. For the simulation of these autoregressive
PAGE 88
76 random effects, we used CJ = 2 and p = 0.8. The resulting sample autocorrelation function of the realized random effects is pictured in Figure 3 4. Their standard deviation and lag 1 correlation of the 400 realized values of u 1 ... UT is equal to 1.95 and 0. 77. Note that conditional on the realized values of ut's, the yt's are generated independently with log odds given by (3.8). The MCEM algorithm as described in Sections 2.3 and 3.3 for a logistic GLMM with autocorrelated random effects yielded the following maximum likelihood estimates for the fixed effects and variance components: & = 0.94 (0.39), = 1.03 (0.22) as compared to the true values 1 and 1 and a= 2.25 (0.44) and p = 0.74 (0.06) as compared to the true values 1.95 and 0.77. The algorithm converged after 71 iterations with a starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of only 880, although estimated standard errors are based on a Monte Carlo sample size of 20, 000. Convergence parameters were set to c 1 = 0.003 c = 3 c 2 = 0.005, c 3 = 0.001, a = 1.03 and q = 1.05 (see Section 2.3.3). Regular GLM estimates were used as starting values for a and /3 and starting values for CJ and p were set to 1.5 and 0, respectively. As will be described in Section 5.4.2 we estimated random effects through a Monte Carlo approximation of their posterior mean: it = E[u I y]. The scatter plot in Figure 3 3 shows good agreement in a comparison of the realized random effects u 1 ... UT from the simulation and the estimated random effects ih, ... UT from the model. Note though that the standard deviation of the estimated random effects is equal to 1.60 (as compared to the true standard deviation of 1.95) showing that estimated random effects are less variable and the general shrinkage effect ( compare the scales on the x and y axis of Figure 3 3) brought along by using posterior mean estimates. Also a comparison of the autocorrelation and partial autocorrelation functions of the realized and estimated random effects in
PAGE 89
77 "''""~~I>. "'"'1, "' "'t,.L;.b.t,.6 j "' ~t>.~I "'"'"' u 2 &, ... tj..,. . '.... "' "' "' ~"' &, : !~~ttL''"'"' E 0 111.:. t! ... ,:, C 0 I,"' t I, (U \"' 'a lo "' ... ... "' &11 ,:, &, 1 .. 1 .. ~r (U "'L;. "' "' "' E "' L;.t,. I, ..,.t "' I>. L;. 2 I>. I, I,"' & "' &, "' ? L;.~~,e I>. """ "' t,.tf 'Al"'e "' I, I, L;. I, "'"' "' ~ 7 5 3 1 3 5 7 realiz:ed random effects Figure 3 3: Realized (simulated) random effects u 1 ... UT versus estimated ran dom effects u 1 ... UT. Figure 3 4 reveals some differences due to the fact that estimated random effects are based on the posterior distribution of u I y. Therefore, estimated random effects are only of limited use in checking assumptions on the true random effects. Only when their behavior is grossly unexpected compared to the assumed structure of the underlying latent random process may they serve as an indication of model inappropriateness. Related remarks are given by Verbeke and Molenberghs (2000) who generate data in a linear mixed models assuming a mixture of two normal distributed random effects resulting in a bimodal distribution. There also the plot of posterior mean estimates of the random effects from a model that misspecified the random effects distribution does not reveal that anything went wrong. We repeated above simulation 100 times using the same specifications starting values and convergence criteria as mentioned above. Each of the 100 generated binary time series of length 400 was fit using the MCEM algorithm. Table 3 1 shows the average ( over the 100 generated time series) of the fixed
PAGE 90
u. 0 <( 1 0 8 6 4 2 0 0 1 0 8 6 4 2 0 2 4 2 6 7 Lag c::J = = c::::J = 6....__ ____________ ~_ 3 Lag 78 u. 0 <( 1 0 8 6 .4 2 0 0 2 3 4 5 Lag 1 0 8 6 4 2 .0 D = CJ = . 2 4 6 2 3 4 5 Lag Figure 3 4: Comparing simulated and estimated random effects. 6 7 c::J 6 7 Sample autocorrelation (first row) and partial autocorrelation (second row) func tions for realized (simulated) random effects u 1 ... ur (first column) and estimated random effects ih, . ilr (second co l umn).
PAGE 91
79 parameter and variance component estimates and their average estimated standard errors. On averag e, the GLMM estimates of the fixed effects a and f3 and the variance components are very close to the true parameters although the true lag 1 correlation of the random effects is underestimated by 6.3%. Table 3 1 also displays in parentheses the standard deviations of all estimated parameters in the 100 replications. Comparing these to the theoretical estimates of the asymptotic standard errors we see good agreement This suggests that the procedure for finding standard errors we described and implemented (via Louis s (1982) formula) in our MCEM algorithm works fine. In 5 (5%) out of the 100 simulations the approximation of the asymptotic covariance matrix by Monte Carlo methods resulted in a negative definite matrix For these simulations, a larger Monte Carlo sample after convergence of the MCEM algorithm (the default was 20,000) might be necessary. It is also interesting to note that of the 95 simulations with positive definite covariance matrix 6 (6.3%) resulted in a nonsignificant (based on a 5% level Wald test) estimate of the regression coefficient f3 under the GLMM model with autoregressive random effects while none was declared nonsignificant with the GLM approach Estimates and standard errors for a corresponding GLM model fit are also provided in Table 3 1. The average Monte Carlo sample at the final iteration of the MCEM algorithm was 1200, although highly disperse, ranging from 210 to 21 000. The average computation time ( on a mobile Pentium III, 600 MHz processor with 256MB RAM) to convergence including estimating the covariance matrix was 73 minutes. We ran two other simulation studies now with a shorter length of only T = 100 observations and a true lag 1 correlation of 0.6 and 0 8 respectively. All other parameters remained unchanged. These results are also summarized in Table 3 1. Again we observe that the estimated parameters are very close to the
PAGE 92
80 Table 3 1 : A simulation study for a log i stic GLMM with autoregressi v e random effects a /3 (1 p s e (a) s.e.(/3) s.e.(c,) s.e (p) True : 1 1 2 0.8 T= 400 GLM: 0 64 0 6 3 0.11 0.12 (0.20) (0.14) (0.01) (0.01) GLMM: 1.07 1.02 2.08 0 75 0 37 0.26 0 61 0.10 (0.32) (0.20) (0.25) (0.06) (0 12) (0.23) (0.40) (0 05) True: 1 1 2 0.6 T= 100 GLM: 0 69 0.70 0.23 0.25 (0 38) (0 27) (0 02) (0.03) GLMM: 1.09 1.07 1.99 0.51 0.58 0.47 1.35 0.26 (0.58) (0 37) (0.39) (0 20) (0 33) (0.27) (1.15) (0.18) True: 1 1 2 0.8 T= 100 GLM: 0.65 0.61 0.22 0 24 (0.21) (0 26) (0 01) (0.03) GLMM: 1.04 0.96 2.00 0.75 0.42 0.51 1.04 0.16 (0.29) (0.34) (0.53) (0 13) (0.32) (0 99) (1.04) (0 13) Average and standard deviation (in parentheses) of fixed effects variance com po nents and their standard error estimates from a GLM and a GLMM with latent AR(l) process. The two models were fitted to each of 100 generated binary time series of length T = 400 and T = 100. true ones but on average the correlation was underest i mated by 1 5% and 6 .3%, respectively. However the samp l ing errors of the corre l ation parameters (shown in parentheses in Table 3 1) were large enough to include the true va l ues. Since our methods are general enough to handle unequally spaced data we repeated the first simu l ation with a time series of T = 400 binary observations, but now randomly deleted 10% of the observat i ons to create random gaps in the series. We left all parameters and the model for the conditional odds unchanged, except that we now assume that random effects follow the latent random autoregressiv e process Ut+1 = pdtut + Et where Et are i. i d. N(0 c,Jl p 2 dt) and dt is the difference (in the units of measurement) between the time points associated with the observations at times t and t + 1. For example the first series we generated had
PAGE 93
81 Table 3 2: Simulation study for modeling unequally space binary time series. a {3 True: 1 1 GLM: 0.61 0.62 (0.19) (0.11) GLMM: 1.03 1.00 (0.29) (0.16) a 2 2.07 (0 25) p 0.8 0.75 (0.06) s.e.(a) s e.({3) s.e.(a) s.e.(p) T = 360 unequally spaced 0.12 0.12 (0.00) (0.01) 0.38 0.28 0 71 0 11 (0 19) (0 28) (0.80) (0 18) Average and standard deviation (in parentheses) of fixed effects variance compo nents and their standard error estimates from a GLM and a GLMM with latent au toregressive random effects accounting for unequally spaced observations The two models were fitted to each of 100 generated binary time series of length T = 360, with random gaps of random length between observations. 1 gap of length three (i e. dt = 4 for one t) 4 gaps of length two (i.e. dt = 3 for 4 t s) and 29 gaps of length one (i e. dt = 2 for 29 t s) For all other t s dt = 1 i e. they are successive observations and the difference between two of them is one unit of measurement Simulation results are shown in Table 3 2 and reveal that our proposed methods and algorithm also work fine for an unequally spaced binary time series. All true parameters are included in confidence intervals based on the average of the estimated parameters from 100 replicated series and its standard deviation (shown in parentheses in Table 3 2).
PAGE 94
CHAPTER 4 MODEL PROPERTIES FOR NORMAL POISSON AND BINOMIAL OBSERVATIONS So far we have discussed models for discrete valued time series data in a very broad manner. In Section 2 we developed the likelihood for our models based on generic distributions f(ylu) for observations y and g(u) for random effects u and presented an algorithm for finding maximum likelihood estimates Section 3 looked at two special cases of random effects distributions useful for describing temporal or spatial dependencies. In this chapter we make specific distributional assumptions about the observations and develop some theory underlying the models we propose We will pay special attention to data in the form of a single (sometimes considered generic) time series Y = (Y 1 ... YT) and derive marginal properties implied by the conditional model formulation. Multiple independent time series Y 1 ... Y n can result from replication of the original time series or from stratification of the sampled population such as in the example about homosexual relationships. All derivations given below for a generic time series Y still hold for the ith series Y i = (~ 1 ~ 2 .. ~T) provided the same latent process { ut} is assumed to underly each one of them. An important characteristic of any time series model is its implied serial dependency structure. In the case of normal theory time series models this is specified by the autocorrelation function. In Section 4.1 we derive the implied marginal autocorrelation function for GLMMs with normal random components and either an equal correlation or autoregressive assumption for the random effects. With these assumptions our models are special cases of linear mixed models discussed for instance in Diggle et al. (2002). In Sections 4.2 and 4.3 we explore 82
PAGE 95
83 marginal properties of GLMMs with Poisson and binomial random components that are induced by assuming equally correlated or autoregressive random effects. In Chapter 5 these model properties such as the implied autocorrelation function are then compared to empirical counterparts based on the observed data to evaluate the proposed model. Section 2.1 mentioned that parameters in GLMMs have a conditional inter pretation controlling for the random effects. Correlated random effects vary over time and parameter interpretation is different from having just one common level of a random effect as in many standard random intercepts GLMMs. For each of the models presented here we discuss parameter interpretation in a separate section. 4.1 Analysis for a Time Series of Normal Observations Suppose that conditional on time specific normal random effects { Ut}, observa tions {Yt} are independent N(t + Ut T 2 ). The marginal likelihood for this model is tractable because marginally the joint distribution of {Yt} is multivariate normal with mean M = 1 ... /J,T )' and covariance matrix 'Eu + T 2 I where 'Eu is the covariance matrix of the joint distribution of { Ut}With the usual assumption that var( Ut) = a2 the marginal variance of Yt is given by var(Yt) = T 2 + a2 and the marginal correlation function p(t t*) for the case of equally correlated random effects ( conf. Section 3 2) has form (72 p(t t )=corr(Yt Yt)= 2 2 P T +a (4.1) while for the case of autocorrelated random effects (conf. Section 3.3) it has form a2 t 1 p(t t ) = corr(Yt Yt ) = 2 2 p"Ei.=t d,. T +a (4.2)
PAGE 96
84 If the distances between time points are equal, then ( 4.2) is more conveniently written in terms of the lag h between observations as (J2 p(h) = corr(Yt Yt+h) = 2 2 l. 7 +a For both cases, note that the autocorrelations (4.1) and (4.2) are smaller than the corresponding ones assumed for the underlying latent process { Ut} by a factor of r 2 ~u 2 For equally correlated random effects the marginal covariance matrix has form 7 2 I + a 2 [ (1 p )I + pJ], implying equal marginal correlations between any two members Yt and Yt of {Yt}. (This can also be seen from (4.1), where the autocorrelations do not depend on t or t*.) Diggle et al. (2002, Sec. 5.2.2) call this a model with serial correlation plus measurement error. Similar properties can be observed in the case of autocorrelated random ef fects: The basic structure of correlations decaying in absolute value with increasing distances between observation times (as measured by I: dk or h) is preserved marginally. However, the firstorder Markov property of the underlying autore gressive process is not preserved in the marginal distribution of {Yt}, which can be proved by calculating conditional distributions. For instance, for three (T = 3) equidistant time points, the conditional mean of Y 3 given Y 1 = y 1 and = y 2 is equal to 2 E[Y3!Y1 Y2] = + ( 72 + a 2 ~ 2 P_ (a 2 p) 2 ( 7 2 [1 + P(Y1 i)] + a 2 [ 1 P 2 (Y2 )]) and depends on Y1 It should be noted that in the case of independent random effects with ~u = a 2 I marginally the Yt s are also independent but with overdispersed variances 7 2 + a 2 relative to their conditional distribution. This case can be seen as a special case of the equally correlated model and the autoregressive model when p = 0.
PAGE 97
85 The traditional assumption in random int e rcepts models is to assume a common random effect u = Ut for all time points t. I.e. conditional on a N(0 a 2 ) random effect u Yt is N(t + u, T 2 ) fort = 1 .. T For th i s case the marginal covariance matrix has form T 2 I + a 2 J. This can be derived directly or inferred from the marginal correlation e xpressions (4.1) and (4 2) by setting p = 1 implying perfect correlation among the { Ut}Hence the random intercepts model is a special case of th e equal correlated or autoregressive model when p = 1. It implies a constant (exchangeable) marginal correlation of a 2 / ( T 2 + a 2 ) between any two observations Yt and Yt 4. 1. 1 A na l ys i s via Lin e ar M ixed Mod els In a GLMM we try to provide some structure for the unknown mean com ponent by using covariates X t Let x~ /3 be a linear predictor for with {3 denoting a fixed effects parameter vector for the covariates X tUsing an identity link the series {Yt} then follows a GLMM with conditional mean function E[Yt I Ut] = x~{3 + Ut. The model can be written as Yt = x~/3 + Ut + Et, where Et N(0 T 2 ) and independent of Ut. Then the models discussed here are special cases of mixed e ffects models (Verbeke and Molenb e rghs 2000) with general matrix form Y = X {3 + Z u + In our case Y = (Y 1 .. Yr)' is the time seri e s vector and X = (x~ . x~)' is the overall design matrix with associated parameter {3. The design matrix Z for the random effects u' = (u 1 ... ur) simplifies to the identity matrix Ir. The distributional assumption on the random effects is u ~ N( O ~ u ) and they are independent from the N( O T 2 /) distributed errors Exploiting this relationship software for fitting models of this kind (i.e correlated normal data with structured covariance matrix of form var( Y ) = Z~ u Z' + T 2 I) is readily available for instance in the form of th e SAS procedure proc mixed wher e the equal correlation structur e
PAGE 98
86 and the autoregressive structure are only two out of many possible choices for the covariance matrix 'Eu for the random effects distribution. Mixed effects models are very popular for the regression analysis of shorter time series like growth curve models or data from longitudinal studies. In Sec tion 5.1 we illustrate an application by analyzing the motivating example of Section 3.1 about attitudes towards homosexual relationships based on a normal approximation to the log odds. 4.1.2 Parameter Interpretation Parameters in normal time series models retain their interpretation when averaging over the random effects distribution The interpretation of /3 as the change in the mean for a change in the covariates is valid conditional on random effects and also marginally. The random effects parameters only contribute to the variancecovariance structure of the marginal distribution inducing overdispersion and correlation relative to the conditional assumptions. 4.2 Analysis for a Time Series of Counts Suppose now that conditional on time specific normal random effects { Ut} observations {Yt} are independent counts which we model as Poisson random variables with mean Using a log link explanatory variables Xt and correlated random effects { ut} we specify the conditional mean structure of a Poisson GLMM as log(t) = x~/3 + Ut t = 1 . T. (4.3) The correlation in the random effects allows the logmeans to be correlated over time or space. The marginal likelihood corresponding to this model is given by T L(/3 1/J ; y) ex 1 IJ r' exp{t}g(u ; 1/J) du ]RT t = l L exp { t [y,(x;.B + u,) exp{x;.B + u,}]} g(u; ,P) du,
PAGE 99
87 where g( u; ,p) is one of the random effects distributions of Chapter 3. In that case, the integral is not tractable and numerical methods such as the MCEM algorithm of Section 2.3 must be used to find maximum likelihood estimates for /3 and 1/J. For this the function Q~ defined in (2.16) has form where u~j) is the tth element of the jth generated sample uU) from the posterior distribution h(uly; 13(k l) 1/J(k l)). Note that here we discuss only the case of a generic time series {Yt} with no replication hence n = l (i.e. index i is redundant) and n 1 = Tin the general form presented in (2 16) If replications are available or in the case where two time series differ in the fixed effects part but not in the random effects (e g. have the same underlying latent process) then one simply needs to include the sum over the replicates as indicated in (2.16). Choosing one of the correlated random effects distributions of Chapter 3 the Gibbs sampling algorithms developed in Sections 3.4.1 or 3.4.2 can be used to generate the sample from h(uly) with f(Yt) having the form of a Poisson density with mean t. 4.2.1 Marginal Model Implied by the Poisson GLMM As with the normal GLMMs before, marginal first and second moments can be obtained by integrating over the random effects distribution although here the complete marginal distribution of Yt is not tractable as it is in the normal case. The random effects appearing in model ( 4.3) imply that the conditional logmeans {log(t)} are random quantities. Assuming that random effects { Ut} are normal with zero mean and variance var( Ut) = a2 they have expectations { x~/3} and variance a2 For two distinct time points t and t *, their correlation under an independence equal correlation or autocorrelation assumptions on the random effects is given by 0 p or p'L.i =,, 1 dk respectively. (Remember that dk denoted the time difference between two successive observations Yk and Yk+l ) On the original
PAGE 100
88 scale, the means have expectation, variance and correlation given by exp{ x~/3 + a 2 /2} exp{2(x~{3 + a 2 /2)} ( eo2 1) ecov(ut,Ut) 1 eu2 1 Plugging in cov( Ut Ut) = 0, a 2 p or a 2 p"Lt"=.,, 1 dk yields the marginal correlations among means when assuming independent, equally correlated or autoregressive random effects respectively. 4.2.1.1 Marginal distribution of Yt Now let s turn to the marginal distribution of Yt itself, for which we can only derive moments. The marginal mean and variance of Yt are given by: E[Yt] = E[t] = exp{ x~/3 + a 2 /2} (4.4) var(Yt) = E[t] + var(t) = E[Yt] [ 1 + E[Yt] ( eo2 1)] Hence, the log of the marginal mean still follows a linear model with fixed effects parameters {3 but with an additional offset a 2 /2 to the intercept term. (This is not particular to the Poisson assumption, but is true for any loglinear random effects model of form ( 4.3) with more general random effects structure z~ut, see Problem 13.42 in Agresti, 2002.) The marginal distribution of Yt is not Poisson, since the variance exceeds the mean by a factor of [1 + E[Yt](eo2 1)]. The marginal variance is a quadratic function of the marginal mean. For two distinct time points t and t*, the marginal covariance between observations Yt and Yt is given by cov(Yt, Yi) cov(t, E[Yt]E[Yt] ( ecov(ut U t 1) (4.5)
PAGE 101
89 In the case where random effects { Ut } are assumed independent the marginal covariance is zero. In longitudinal studies usually each replicated time series has its own univariate random effect attached to it. For such a time series {Yi}, assume a singl e common random effect u ~ N(O a2 ) shared by all observation in the series. I.e. in th e notation used above u = Ut for all t and model ( 4.3) has form log(t) = x~ /3 + u. Then cov( Ut Ut) = var( u) = a2 and the marginal correlation between any two members of the time series is given by (E[Yi]E[Yi. ]) 1 1 2 (ecr 2 1) ( 4 6 ) corr(Yi Yi) = [1 + E[Yi](ecr 2 1)]1 / 2 [1 + E[Yi](ecr 2 1)]1 / 2 This is the exchangeable correlation structure implied by a random intercepts Poisson GLMM (see e g Agr e sti 2002 pages 564 and 575). In Section 2 w e mo tivated and proposed correlated random effects { Ut} to facilitate other correlation structures. We will now derive marginal correlation properties for a time series of counts based on our conditional Poisson GLMM approach using equally correlated or autoregressive random effects. This is easily done by plugging in for cov( U t, U t in ( 4 5) above. The equal correlation assumption cov( Ut Ut) = a2 p leads to the marginal structure ( v 1 .,,. ) ( E[Yi]E[Yi]) 1 1 2 ( e cr 2 P 1) corr It .It = 1 ;2 [1 + E[Yi]( e cr 2 1)] 1 2 [1 + E[Yi]( e cr 2 1)] 1 (4 7) still implying equal (but possibly negative) correlations. The autoregressive random effects approach with cov( Ut Ut) = a2 pE~ =/ dk leads to a decaying ( with time) correlation function (E[Yi]E[Yi ])1 / 2 ( e cr 2 pEt~ ; 1 dk 1) corr(Yi Yi) = . [1 + E[Yi](ecr 2 1)] 1 1 2 [1 + E[Yi](ecr 2 1)] 112
PAGE 102
90 In the case of equally spaced observations ( dk = 1 for all k), this is more conve niently written in terms of the lag h in between two observations: v v (E[Yi]E[Yi+h]) 112 (eu 2 ph 1) corr(.1t, .1t + h) = 1 (4.8) [1 + E[Yt]( eu 2 1)] 112 [1 + E[Yi+h]( eu 2 1) J1 2 Note that if p = 1 i.e. perfect correlation between random effects, all correlation structures reduce to the random intercept model with correlation structure ( 4.6) However, with JpJ :S 1, and h oo, (4.8) accommodates decaying correlations and with p :S 0 ( 4. 7) accommodates negative correlation. In Section 5.3, we will fit a Poisson G LMM to a time series of counts and use the marginal properties derived here to assess and interpret the regression model. 4.2.1.2 Negative Binomial GLMMs An alternative to the Poisson assumption as the conditional distribution for the counts is to use a negative binomial distribution. The negative binomial distribution per se already allows for overdispersion relative to the mean. A second source of overdispersion is then introduced by regarding the (log) mean of a negative binomial random variable as a normal mixture. Correlated random effects allow these means to be connected over time. Booth et al. (2004) look at negative binomial GLMMs with independent (over time) random effects. The anchovy larvae data analyzed there is a time series of correlated counts and autoregressive random effects seem an appropriate alternative to the independent ones used by Booth et al. Using the parametrization of the negative binomial distribution as discussed in Sec. 13.4 of Agresti (2002), let Yi be negative binomial with mean t and variance t + Uk, conditional on random effects Ut, (For fixed k, the negative binomial distribution is a member in the exponential family of distributions.) We consider cases where the dispersion parameter k is the same for all observations. As in the Poisson G LMM presented before, we propose the following loglinear model for the
PAGE 103
91 conditional mean of Yi given U( log(E[Yilut]) = log(t) = x~/3 + Ut t = l ... T where { Ut} follows one of the random processes discussed in Chapter 3. Marginally this leads to the same expectations variances and covariances for the conditional logmeans and means as discussed in the Poisson model above. Furthermore, since ( 4.4) holds for any loglinear model it also holds for the negative binomial loglinear model and the marginal means of the Poisson GLMM and negative binomial GLMM coincide However the marginal variance under the negative binomial assumption is given by var(Yi) = E[t + ;jk] + var(t) = E[Yi] [ 1 + E[Yi] ( k; l eu 2 1)] which fork + oo approaches the variance of the Poisson GLMM. The difference between the variance under a negative binomial assumption and a Poisson as2 sumption is e : (E[Yt]) 2 Similarly for each one of the random effects structures discussed under the Poisson GLMM the formulas for the marginal correlations presented for the Poisson GLMM hold true when u 2 in the denominator of each equation is replaced by kt1 eu 2 When these implied marginal properties are more plausible than the corresponding ones from a Poisson GLMM as judged for instance by a comparison of the approximated maximized likelihoods or by a com parison of empirical estimates to model based estimates then the negative binomial GLMM is a relevant alternative. Note that as with any GLMM, the negative binomial GLMM results from the hierarchy Yi I Ut ind ~ neg. bin. k) t = l, ... T (u1 ... ur) ~ N(O,I;u)
PAGE 104
92 and the marginal correlation between the counts {Yi} arises because of the cor relations in the underlying random effects Ut which appear in the model for the logmeans. 4.2.2 Parameter Interpretation As long as { ut} is a meanstationary random process (we assume a mean of zero throughout) it follows from ( 4.4) that all parameters except the intercept have equal interpretations conditionally on random effects and marginally. For a Gaussian random process with variance a 2 the intercept itself is set off by a factor of a 2 /2. Hence all parameters except the intercept can be interpreted as effects on the conditional or marginal logmean. In particular for any member /3i of /3 ef3i is the ratio of two conditional or marginal means after a one unit change in the covariate associated with /3j 4.3 Analysis for a Time Series of Binomial or Binary Observations Suppose that conditional on a time specific normal random effect Ut, ob servations {Yst} ~;, 1 are independent and identical binary random variables with conditional success probability 7rt(ut) = P(Yst = 1 I Ut) t = 1 ... T. Consequently the sum Yi = I: ;: 1 Yst has a conditional binomial(nt ?rt(ut)) distribution. Fur thermore given random effects Ut and Ut at two different time points t and t *, Yi and Yi are conditionally independent. Using a logit link timespecific explanatory variables { Xt} and correlated random effects { Ut} we specify the conditional mean structure of a binomial G LMM as =x~f3+ut t=l ... T. (4.9) Correlated random effects allow correlation of conditional log odds over different time points or locations. For observed data y = (y 1 .. YT) the marginal
PAGE 105
93 likelihood corresponding to this model is given by T L({3, 1/J; y ) ex 1 l1 [1rt(ut)]Yt[l 1rt(ut)t 1 Y 1 g( u ;rp) du JR T t = l f. r exp { t y,( x;/3 + u,)} g ( 1 +exp{ x;/3 + u,} ) n, g( u ; ,J,) du, where g( u ; 1/)) is one of the random effects distributions of Chapter 3. The integral is not tractable and numerical methods such as the MCEM algorithm of Section 2.3 must be used to find maximum likelihood estimates for (3 and 1/J. The function Q~ defined in (2.16) now has form l m T _ Q~(f3 I {3(k l)) ex m L [Yt(x~{3 + uP)) nt(l + exp{ x~{3 + uP)})] J=l t=l where u~j) is the tth element of the jth generated sample u W from the posterior distribution h(uly). As before, note that here we discuss only the case of a generic time series {Yt} with no replication, hence n = l (i.e., index i is redundant), and n 1 = Tin the general form presented in (2.16). Again if replications are available or in the case where two time series differ in the fixed effects part but not in the random effects (e.g have the same underlying latent process), then one simply needs to include the sum over the replicates as indicated in (2.16). An example where we assumed that two series differ in their fixed effects parameters but share the same underlying latent process { Ut} is the motivating example of Section 3.1, with a binomial time series for each, white and black respondents Choosing one of the correlated random effects distributions of Chapter 3, the Gibbs sampling algorithms developed in Sections 3.4.1 or 3.4.2 can be used to generate the sample from h(uly), with f(Yt) having the form of a binomial(nt, 7rt(ut)) density.
PAGE 106
94 4.3.1 Marginal Model Implied by the Binomial GLMM Marginal properties are hard e r to derive than in the normal or Poisson case because the conditional mean ?Tt ( Ut) is not a linear or exponential function of the random effects. Assuming zeromean random effects { Ut} with variance var( Ut) = a 2 the conditional log odds {logit(1rt(ut))} have means {x~.B} and variance a 2 For two distinct time points t and t* the correlation between the conditional log odds at times t and t* under independence equal correlation or autocorrelation assumptions on the random effects are given by 0 p or pL..~=/ dk respectively. We will refer to E[logit(1rt(ut))] = x~.B as the unconditional or expected log odds ones that do not depend on the random effects. It is perhaps more natural to investigate the unconditional or expected odds of success, E[1rt(ut)/(l 1Tt(ut))], since interpretation is on the natural scale. They are given by exp{ x~.B + a 2 /2} and any member of exp{,8} can be interpreted as the change in the expected (e.g. averaged over random effects) odds of success for a unit change in the corresponding member of Xt. Alternatively log(1rf /1 1rf) exp(x~.B/Jl + o2 ) (derived in subsequent sections) is the logit of marginal probabilities implied by the conditional model and the effect of parameters are seen to be downweighted when using this function as the quantity of interest. Often this is the preferred measure and we will derive it in the next section Further discussion of the interpretation of parameters in GLMMs with time dependent random effects is provided in Section 4.3.3. Let 's turn now to the marginal distribution of Yt, the sum of nt binary vari ables Yst and their dependence structure over time. At time t the binary variables Yst have marginal mean 1rf = E[Yst] = E[1rt( Ut)], variance var(Yst) = 1rf (1 1rf) and constant covariance cov(Yst, Yst) = var(1rt(ut)) which is a function of a. By sharing a common random effect Ut at time t observations {Yst};; 1 are marginally dependent with an exchangeable correlation structure. Correlated random effects
PAGE 107
95 { ut} induce a s e cond time related dependency: For binary observations Yst and Yst at two different time points t and t *, cov(Yst Yst) = cov(1rt(Ut),1rt(Ut)) which depends on the assumed covariance of random effects Ut and Ut As a consequence of marginal dependence among the binary variables at a common time point t (withintype dependency), their sum Yi = I:; ~ 1 Yst shows overdispersion relative to a binomial random variable Its mean and variance are given by E[Yi] var(Yt) which can be evaluated by using methods of the next section. As a consequence of marginal dependency between binary variables at times t and t* (betweentype dependency), their respective sums Yi and Yi are correlated according to corr(Yt Yi) = ntnt COV ( 1ft ( Ut) 1rt ( Ut)) / [var(Yt) var(Yi) ] 1 1 2 which again can be evaluated using methods of the next section. 4.3.2 Approximation Techniques for Marginal Moments 4.3.2.1 Approximation based on Taylor series expansions To evaluate moments of the marginal distribution of Yi we will first use a secondorder Taylor series expansion of 7rt( Ut) around the mean E[ut] = 0 of the random effect U t. It is given by 1 exp{x~.B} exp{2x~,B} exp{x~.B} 2 1rt(Ut) + Ut + Ut 1 + exp{ x~,B} (1 + exp{ x~,B} ) 2 2(1 + exp{ x~,B} ) 3
PAGE 108
96 Using this expansion w e can approximate the marginal mean variance and covariance of the conditionally specified success probabilities: E[ ( )] ~ 1 [ exp{ 2x~,8} exp{ x~,8} 2 ] 7rt Ut ~ 1 + =a 1 + exp{ x~,8} 2 (1 + exp{ x~,8} ) 2 ( ( )) ( exp{ x~,8} ) 2 [ 2 ( 1 exp{ x~,8}) 2 a 4 ] var 7rt Ut a + [1 + exp{ x~,8} ] 2 1 + exp{ x~,8} 2 (4.10) (4.11) exp{(x~ + x~.),8} ( ) ~ 4.12 (1 + exp{ x~,8} ) 2 (1 + exp{x~.,8} ) 2 x [ cov( Ut Ut) + (1 exp{ x~,8}) (1 exp{ x~. ,8}) cov( u1 u~.)] (1 + exp{ x~,8}) (1 + exp{x~.,8}) For the last two expressions we used the additional assumptions that E[u:J = 0 and E[ut] = 3a 4 for all t, which for instance, holds for the normal distribution With independent random effects cov( Ut u;) = 0 and the covariance between the success probabilities is zero. Using correlated normal random effects cov( u~ u~.) is equal to a 4 (p + p 2 /2) for equally correlated random effects and equal to a 4 (pI:~ :,; i dk + p 2 I:~ :,; 1 dk /2) for autoregressive random effects. This simplifies to a 4 (ph + p 2 h /2) for equally spaced observations h units apart. These results are derived by evaluating the joint moment generating function of the bivariate normal distribution of ( Ut Ut). Using the approximate expressions for the variance and covariance given in (4.11) and (4.12) the correlation between two success probabilities at different times t and t is approximated by ( ) + (1 e xp{ X~,8})(1exp{X~.,8}) cov(u~ u~ ) cov Ut Ut ( ,8 ) ( ,8 ) 4 ( ( ) ( )) l+ e xp{Xt } l+exp{ X t } corr 7rt Ut 7rt Ut 1 ; 2 a2 [1 + (1e xp{ X' }) 2 a 2 ] 1 1 2 [l + (1 e xp{ X~.,8}) 2 a 2 ] I+ e xp{ x; } 2 I+exp{ X~.,8} 2
PAGE 109
97 Plugging in the expressions for cov( Ut, Ut) and cov( u~ u~.) according to the random effects assumption yields the approximation of the implied marginal correlation between success probabilities at time t and t*. Many authors ( e.g., Zeger Liang and Albert 1988) only use a firstorder Taylor expansion, for which the second terms in the large squared brackets of (4.10)(4.12) would vanish. Then, the approximation for the correlation between the conditional success probabilities at time t and t* simplifies to and is p for equally correlated random effects and pE~'=,_ 1 dk for autoregressive ran dom effects. In that case the success probabilities directly inherit the correlation properties from the underlying latent random process. 4.3.2.2 Cumulative Gaussian approximation An alternative approximation for E[7rt(ut)] is given by Zeger, Liang and Albert (1988) who use a cumulative Gaussian approximation to the logistic function (Johnson and Kotz 1970 p.6) to derive 1 E[7rt(Ut)] ===, 1 + exp{x~/3/ + (co)2} (4.13) where c is a constant equal to 1 ft. They use this expression for the marginal mean together with an approximation for the marginal covariance matrix based on a firstorder Taylor series expansion of 11"t(ut) (outlined above) to motivate a GEE approach of fitting GLMMs. Approximations based on the Taylor series expansion are only accurate for random effects close to their mean value of 0 as measured by small values of a. Figure 4 1 plots the approximations of the marginal probability 7rf = E[7rt( Ut) ] given in ( 4 10) over a range of (4 4) for the fixed part x~{3 of the linear predictor for various values of a The dotted curve in Figure 4 1 corresponds to a = 2.5.
PAGE 110
98 It clearly shows that the approximation is useless for such a value for a since the approximated marginal probability loses the fundamental property of monotonicity in the linear predictor. For even larger values of a, formula (4.10) shows that the approximation can be even greater than 1 or less than 0. This problem is also not alleviated by including further terms in the Taylor series or expanding the series around with the sign depending on whether one expects a positive or negative value for Ut. Unfortunately, large values for a are the rule rather than the exception in models with autoregressive random effects. (See the remark by Diggle et al. 2002 p. 239 and the examples in Chapter 5.) On the other hand ( 4 13) does not suffer these drawbacks. As a increases the conditional success probabilities have distributions increasingly concentrated near O and near 1. Averaging over these conditional probabilities we expect a marginal probability of 0 5 which is the limit of (4 13) when a 2 oo. However approximations for the marginal variance and covariance are not easy to derive with the cumulative Gaussian approximation of the logistic function, but these are the quantities w e are interested in for a comparison of model based estimates and sample based estimates. The next section mentions a connection of logit and probi t models which can be used to calculate all desired marginal properties for cases where a is large. 4.3.2.3 Marginal results through probit models In a longitudinal setting and for long sequences of binary data Smith and Diggle (1998) propose similar models using correlated random effects as presented here but pursue a different fitting approach. They derive marginal moments implied by their model and construct a marginal covariance matrix of observations, all with the intention to use the GEE methodology for estimation. Also they use a probit link (4.14)
PAGE 111
I.0, ~1Taylor approx I 0 5 .,...////, . ../ 2.5 0.0 /,// 2 5 ., 99 I.0, ~1Probit approx I 0.5 2 5 0 0 2 5 1.0, ~~ I Monte Carlo approx I 0 5 2 5 0 0 2.5 Figure 4 1 : Approximated marginal probabilities for the fixed part predictor value x' (3 ranging from 4 to 4 in a logit model. Approximations are based on a secondorder Taylor series expansion (first panel) the probit connection (second panel) and Monte Carlo integration over the random effects distribution The 4 lines in each panel correspond to a = l 1.5, 2 2 5 with the dotted line corresponding to a= 2.5. to model the conditional success probability irt(Ut) at time t, where is the standard normal cdf To distinguish these from their logit model counterparts, we use (3 and ato denote the fixed effects parameters and variance component in the probit model. Unlike the variance, the correlation is a scale free measurement Hence the correlation among the conditional success probabilities, as measured by p is the same whether measured on the logit scale or the probit scale Thus no new parameter for describing the correlation in the probit model is needed. The advantage of a probit link model is that marginal means and covariances can be calculated explicitly and no approximation is needed. Smith and Diggle (1998) employ the threshold interpretation to derive these exact results. Using similar arguments we now intend to derive alternative approximations of marginal moments and correlations for our logit link GLMMs with correlated random effects. These results can then be compared to or used instead of the approximate results for the logit link derived with the Taylor expansion given above. The threshold ( or latent variable) interpretation states that Yst = l if and only if Tst < c for a suitable threshold value c, where {Tst} are independent N(O 1) latent variables ( and independent of any random effect in the linear predictor).
PAGE 112
100 Then under (4.14) E[Yst] P(Yst = 1) since Tst Ut has a N(O 1 + a 2 ) distribution. The threshold interpretation although mathematical convenient may some times seem artificial. We next give an exact proof of the above result without using the threshold interpr e tation. Similar proofs can be constructed for all results derived in this s e ction. Note that Let g(ut) denote the N(O a 2 ) density of random effect Ut and let
PAGE 113
101 Hence the inner integral giv e s th e marginal distribution of z, which is N(O 1 + o2 ) Then x'/3 [~ + a 2 ) +a 2 ) dz l x~/3 z ) d z 00 + a 2 ) giving abov e r e sult Using the thr e shold int e rpretation again the marginal joint moment of two binary variables Ys t and Y s t observed at th e same time t is given by P(T s t Ut < x~,8 Ts t Ut < x~,8) <1> 2 ( (x~,8 x~,8)' Q(t t)) wher e Q( t t ) = [ 1 + a 2 cov( Ut Ut )] c ov( Ut Ut) 1 + a 2 (4.15) and <1> 2 ((a b)' Q(t t )) is the probability that a bivariate zeromean random variable with covarianc e matrix Q(t t ) is less than (a b)'. Summing up th e marginal prop e rties of th e binary variables {Y s t} ;;,, 1 at time t are explicitly given by (4.16) var(Ys t ) ~ M (1 ~ M) 7r t 7r t <1> 2 ((x~,8 x~,8)' Q(t t))(ir:,1)2
PAGE 114
102 For observations Yst and Yst at two different time points P(YstY s = 1) P(T s t Ut < x~'/3, Tst Ut < x~.'/3) <1> 2 ((x~'/3 x~.'/3)' Q(t t )). This leads to a marginal covariance of cov(Yst Yst) = 2 ( (x~'/3, x~.'/3)', Q(t, t )) irf'1irtf between two binary observations at time points t and t*. Plugging in different forms for the covariance of the random effects in Q(t t*) results in different covariance structures among the two binary observations. Now let Yi = I:: ;: 1 Yst represent the sum over all nt (marginally dependent) binary variables at time t. Then similar to the logit link results presented before, Yi is overdispersed relative to a binomial with mean and variance The correlation between random effects implies a correlation between outcomes at different time points t and t *, resulting in a correlation between Yi and Yi of form corr(Yi Yi) cov(Yi Yi )/[var(Yi) var(Yi )] 1 1 2 ( 4.17) 4.3.2.4 Logitprobit connection In regular GLMs parameter estimates in logit models are roughly 1.6 times those in probit models (Agresti, 2002, p. 246). This number derives from the fact
PAGE 115
103 that with a linear predictor of form a + f3x the rate of change in the success probability 1r(x) is highest at x = a/ /3 for both the probit model (i.e., 1r(x) has the form of a normal cdf) and the logit model (i e. 1r(x) has the form of a logistic cdf). At this point 1r(x) is equal to 1/2 for both models and the rate of change, d1r(x)/dx is equal to 0.4/3 for the probit model and 0.25/3 for the logit model. Hence we get an equal rate of change (at x = a//3) in both models when the logit /3 is 0.4/0.25 = 1.6 times the probit /3. When comparing the standard deviations implied by the normal cdf (which is equal to 1/1/31) and the logistic cdf ( which is equal to 1r / J31/3 I) then this factor between the relationships of parameters in the two models increases to 1.8. We exploited this connection between parameters in logit and probit models to construct Figure 4 2 where we compare conditional success probabilities bas e d on the logit link (4 9) and the probit link (4.14) for various values for a in a GLMM. We rewrote the linear predictor for the logit GLMM as T/t = x~/3 + CJZt where Zt is a standard normal variable and a can be interpreted as the regression coefficient for the random effect ZtThen we used the approximate connection T/robit TJ!ogit /1.6 between parameter estimates to compute conditional success probabilities under both models This was done for a random sample of 6 Zt s from a standard normal distribution For each generated Zt each panel in Figure 4 2 displays the conditional success probabilities based on the logit link (straight line) and (scaled) probit link (dashed line) for TJ!ogit ranging from 3 to 3. The 4 different panels refer to 4 different choices of a. We clearly see that conditional success probabilities based on the logit and (scaled) probit link are almost indistinguishable irrespective of the magnitude of a. The agreement is best at conditional success probabilities around 1/2, which is to be expected based on the derivations given above But even for small and large success probabilities the agreement is very good. Hence with the proper scaling
PAGE 116
104 1.00 1.00 0 75 0.75 0.50 0 50 0 25 0.25 3 2 1 0 2 3 3 2 I 0 2 3 1.00 1.00 0 75 0 75 0.50 0.50 0.25 0 25 3 2 I 0 2 3 3 2 I 0 2 3 Figure 4 2: Comparison of conditional lo git and probit model based probabilities. Conditional success probabilities for logit (straight line) and probit (dashed line) link GLMMs for linear predictor values ranging from 3 to 3. Each pair of (straight,dashed)curves in each panel corresponds to one out of 6 randomly sam pled random effects Zt Conditional success probabilities for probit link GLMMs use a scaled version of the linear predictor for logit link GLMMs to adjust for dif ferent parameter estimates in these two models. The four panels correspond to four different values of the random effects standard deviation a = 1, 1.5, 2 and 2.5.
PAGE 117
105 factor on fixed effects parameters and the standard deviation of the random e ffects a probitlink GLMM corresponds to a logit link GLMM. That is for any fixed Z t, where the left hand side in the approximation refers to a logit model for the conditional success probabilities with parameters /3 and a and the right hand side refers to a probit model for the same conditional success probabilities with parameters J3 = /3 /1.6 and a = a /1.6. Taking expectations with respect to the distribution of Zt one would then also expect that and consequently (4.18) This gives an approximation of the marginal success probability of a logit model in terms of parameters from a conditional probit model. Graphically (4.18) means that the average of conditionally specified logistic cdfs which does not follow a logistic form itself can be approximated by the average of subject specific normal cdfs which does follow a normal form provided the correct parameter adjustments are made. The connection is pictured in Figure 4 3 where the average of 100 conditionally specified logistic curves a~;~~ 1r(zi /3 a) with z i ~ N(O 2.5) is compared to the cdf irf1 = (x~'/3/Jl + a 2 ) of a marginal probit model with adjusted parameters Note that the agreement is almost perfect although a is large. The upshot of this exercise is that we can use the exact formulae for marginal properties of probitlink GLMMs to make good approximate marginal statements in logit link GLMMs. These should be more accurate than the Taylor approxima tions based on the logit link in cases where a is large. For instance, we can use
PAGE 118
106 1.0 marg. probit 0 9 marg logit cond. logit 0.8 cr = 2.5 0.7 0 6 0 5 0.4 0.3 I I I I 0.2 / / / / ,, ; .,, .,, ,, ,., 6 5 4 3 2 I 0 2 I I I I I I I I I I I I 3 I ,, I I I I 4 / 5 ,. / / ; ; Figure 4 3: Comparison of implied marginal probabilities from logit and probit models. 6 The plot shows the average of 100 conditionally specified logistic curves ( dashed line) generated by using u ~ N(O, 2.5) and the marginal normal curve irM (solid line) from a probit model with adjusted parameters. The plot is over a linear predictor range from 6 to 6. A random sample of 10 of the 100 generated condi tionally specified logistic curves is also shown (grey dashed lines).
PAGE 119
107 ( 4.16) to derive the marginal success probability or the marginal odds in logit link GLMMs and (4.17) to derive the marginal correlation between two observations in models where the estimate of a is large. The second panel of Figure 4 1 shows that the approximation of the marginal success probabilities based on the probit connection does not suffer the drawbacks (loss of monotonicity nonconvergence to 0.5 for a oo) experienced with the Taylor based approximation approach Also notice the close connection between the multiplicative factor c = 1 tt 1/1.7 for a in the cumulative Gaussian approximation of the marginal mean employed by Zeger, Liang and Albert (1988) and the approximation using the probit link: and 1 ::===:, a2 = (a/1.7) 2 (Zeger et al.) 1 + exp(x~/3 / + a2) 1rf'1 + 17 2 ), '/3 = /3/1.6 17 2 = (a/1.6) 2 (probitlogit) (4.19) The first approximation makes a stronger statement in that it says that the marginal mean also has the form of a logistic regression model with parameters downweighted by a factor of + a2 However, using the relationship between pro bit and logit link once more, the parameter vector '/3 / + 17 2 for the marginal probit model (4.19) translates to roughly 1.6 x + 17 2 = + 17 2 for a marginal logit model and Hence both approximations show that the marginal mean follows roughly a logit model. They differ only by the weight factor assigned to the random effects stan dard deviation. Notice that the probitlogit connection and derivations outlined in this dissertation are more valuable because they also provide approximations to the marginal variance and correlation a key component for time series observations.
PAGE 120
108 In summary fitting the logit GLMM allows for the usual interpretation of parameters as ( conditional) effects on the log odds. Through exploiting connec tions with models using a probit link we can give good closed form, analytical approximations for marginal probabilities odds and correlations. Note that non closed form solutions to the marginal mean variance and correlation can always be obtained by integrating over the random effects density e.g. nf = f 7rt(ut)g(ut)dut and approximated by stochastic methods such as a Monte Carlo sum using the fit ted random effects distribution. The third panel of Figure 4 1 displays these Mont e Carlo averages (based on 100 000 draws from the assumed N(O a) random effects distribution) and shows almost perfect agreement with the closedform approxima tions based on the probit model (see also Figure 4 3). For this simple example, the Monte Carlo averages ar e easy to obtain but considerably more simulation effort may be necessary to yield good approximations for higher dimensional marginal probabilities such as the occurrence of three consecutive successes or more com plicated functions The examples in Section 5.4 will illustrate this point further and make extensive use of both of these approximation techniques. There, the main use of these approximations will be on comparing the empirical dependency structur e observe in the time series to the theoretical one implied by the model and to compare obs e rv e d frequ e n c ies in the time series to estimated ones based on our proposed models 4.3.3 Parameter Interpretation In GLMMs for binary data we model conditional log odds given random effects. By averaging with respect to the random effects distribution on the logit scale we obtain unconditional (or expected) log odds. However these are different from th e marginal log odds log( 1rf4 /1 1r{4) obtained with the marginal
PAGE 121
109 probabilities implied by the conditionally formulated model. In the previous section we derived the approximation log(1rl'4 /1 1rl'4) exp(x~/3/Jl + o2 ) and we see that parameters can still be interpreted as log odds ratio but down weighted by a factor of Jl + 82 In the literature on longitudinal data where random effects p e rtain to subjects, this interpretation is preferred when the de pendency structure is considered a nuisance. However for interpreting regression parameters in a time series analysis based on the GLMMs outlined in this disser tation, a preference is not so clear. For related analysis of binary time series via hidden Markov models where the distinction to GLMMs is essentially the assump tion of a discrete Markov process on a few states instead of an AR(l) process for the latent random process { Ut}, MacDonald and Zucchini (1997) give unconditional interpretations of regression parameters throughout. Since the previous section fo cused on marginal interpretations this section focuses on unconditional ones, but it is noted that it may be more natural to interpret a log odds of average probabilities (i.e. think marginally) than an average log odds. (To illustrate the issue further, in preGLM times people took logarithmic transforms of the observations and fitted a linear model to their means. But then one is modeling the mean of the logarithm rather than the logarithm of the mean as would be done with a GLM.) 4.3.3.1 Conditional and unconditional log odds and log odds ratios The conditional log odds of success at time point t are given by Integrating over the random effects distribution the unconditional or expected log odds are equal to x~/3 and /3 can be interpreted as the unconditional or expected change in the log odds for a change in the covariates. The central 100(1 a)% of
PAGE 122
110 the distribution of the log odds falls in between The conditional log oddsratio of success at time point t over one at time point t is given by With correlated random effects the interpretation of the log odds ratio is time specific not only through the covariates but also through the random term ( Ut Ut). This is different from the so called subjectspecific interpretation of the log odds ratio in a regular random intercepts model ( Ut = u for all t). There, the random effect is assumed to be constant over time and cancels out in the log odds ratio and /3 can be directly interpreted as the change in the conditional log odds for a change in the covariates. With correlated random effects f3 is the change in the conditional log odds for a change in the covariates when random effects at time t and t have the sam e value. The unconditional or expected log odds ratio is equal to (x~. x~)/3 and /3 can alternatively be interpreted as the expected change in the log odds for a change in the covariates between time t and t The central 100(1 a)% of the distribution of the log odds ratio falls in between which can be estimated by plugging in ML estimates for f3 and the random effects variance and covariances.
PAGE 123
111 4.3.3.2 Conditional and unconditional odds and odds ratios At time point t with associated random effect Ut we already define the unconditional or expected odds of success as E[exp{ x~,8 + Ut}] = exp{ x~,8 + o2 /2} Here ,8 describes the effect of covariates on the expected odds of success with the intercept term offset as in Poisson G LMMs. For two time points t* and t with associate random effects Ut and Ut the ratio of expected odds is equal to exp{ ( x~. x~),8}, and can be interpreted as any regular odds ratio. I.e ., for a positive, one unit change in the kth predictor from time t to time t*, the expected odds of success at time t* are exp{,Bk} times those at time t. Due to the nonlinearity of the odds the ratio of expected odds is different from the expected odds ratio, which is given by [ exp{x~.,8 + Ut }] {( ') } { 2( ( ) } E { ,8 } = exp xt. xt ,8 exp a l corr Ut Ut ) exp xt + Ut Using this measure exp{,Bk} exp{o2 (1 corr(Ut, Ut))} now equals the change in the expected odds ratio of success at time t* versus time t, for a positive one unit change in the kth predictor for that time span. The first measure, exp{,Bk} describes the change in the expected odds at two time points the second one exp{,Bk} exp{ o2 (1 corr( Ut, Ut))} describes the expected change in the odds of a success at the two time points. In the following, we focus on the ratio of expected odds. 4.3.3.3 Multiple time series If n different time series y 1 = (Yu ... Y1n 1 ), Y i = (Y i 1 Yn = (Yni ... Yn nr) are observed and the same latent process { ut}f= 1 is assumed to
PAGE 124
112 underly each one of them the conditional log odds at time t have form logit(1rit(ut)) = x~tf3 + Ut i = 1 . n. Here 1rit(ut) is the conditional probability of success at time t for the ith series and depends on time specific covariates Xit plus a serially correlated random time effect Ut If two different time series Y i and Y i represent different subpopulations or stratifications of a population interest can focus on each one of the following three contrasts: Contrasts between subpopulations at a given common observation time Contrasts between different time points within the same subpopulation Contrasts between subpopu l ations at different observation times We will look at ratios of expected odds which is perhaps the most natural metric, to address these three points. In the first case the expected odds of success in strata i over the ones in strata j at a fixed time t are given by exp{ x~tf3 + a 2 /2} / exp{ xjt /3 + a 2 /2} = exp{ (x~t xjt)f3} Then exp{/3} has the interpretation of a change in the expected odds for a change in the strata covariates at fixed time t. For example with model (3.1), for the two time series measuring attitude towards homosexual relationships for whites and blacks exp{,8 3 + ,B 4 x1t} describes the change in the expected odds of approval of homosexual relationships for blacks versus whites in year X1t, That is, in year X1t the expected odds of approval for black respondents are exp{,8 3 + ,B 4 x1t} times the expected odds of approval for white respondents Using the maximum likelihood estimates and their estimated asymptotic standard errors (see Table 5 1)
PAGE 125
113 and covariances the expected odds of approval of homosexual relationships for black respondents in 1988 are estimated to be 0.65 times (95%Confidence Interval: [0.54 0. 73], using the Delta method) the expected odds for white respondents in that year. Ten years later in 1998, this factor decreases to 0.49 with a 95% Confidence interval of 0.37 to 0 61. For the scenario in the second contrast the ratio of expected odds at time t* versus time t for subpopulation i is given by Now exp{,B} describes the change in the expected odds for changes in the pre dictors from time t to time t. For the motivating example with model (3 1) exp{,81h + /32(xi t+ h XitH is the change in the expected odds of approval for white respondents and exp{(,8 1 +/3 4 )h+/3 2 (xit+hX itH is the change in the expected odds of approval for black respondents for observations h years apart. For example over a period of 10 years from 1988 to 1998 the expected odds of approval increase by a factor of 2.63 for white respondents (95% Confidence Interval : [1.56 3.69], using the Delta method) and by a factor of 2.01 (95% Confidence Interval: [1.44, 2.58]) for black respondents. For the third contrast describes how the expected odds at time t in strata i compare to the expected odds at time t in strata j Again, exp{,B} describes the effect of a change in the strata covariates from time t to t on the expected odds. In the motivating example the expected odds of approval for black respondents in year Xu + h are exp{,81h+,82[(x1t+h)2xit)+/3 3 +/3 4 (x1t+h)} times those for white respondents in year xu. Using the maximum likelihood estimates the expected odds of approval for black respondents in 1998 is estimated to be 0.99 times (i e. almost equal to)
PAGE 126
114 the expected odds of approval for white respondents 6 years back in 1992. Here I fixed the year 1998 (i e. X1t + h = 15) and searched for the number of years h one has to go back to match the expected odds of approval for the two races. I.e., I solved for h in the equation yielding h 6.
PAGE 127
CHAPTER 5 EXAMPLES OF COUNT BINOMIAL AND BINARY TIME SERIES In this chapter we propose GLMMs with autocorrelated random effects for the analysis of several practical examples. We will apply the likelihood estimation theory developed in Chapter 3 and use the model theoretic properties derived in Chapter 4. In Section 5 2 we take another look at the GSS data set discussed in Chapter 3 and reanalyze it based on a normal approximation using linear mixed model theory. A famous time series of counts is analyzed in Section 5.3, with results in the literature compared to our results based on an autoregressive GLMM. Two binary time series are analyzed next in Section 5.4. The first one considers 299 consecutive eruptions which are classified as either short or long, from the Old Faithful geyser in Yellowstone National Park The second one considers the annual boat race between teams from the Universities of Cambridge and Oxford and is challenging because of several missing observations. Two goals in this example are to establish the influence of weight of the crew on the outcome of the race (demystifying a long held believe) and to predict a future outcome. First, though in Section 5.1 we present ways to explore and picture the dependency structure in an observed discrete time series. 5.1 Graphical Exploration of Correlation Structures Let {yt} be a realization of the time series {Yi} In practice, we have to choose an appropriate model for {Yt} based on information the data provides about the dependency structure. An important tool to explore the dependency structure of the observed time series is the sample autocorrelation function (ACF). For equally 115
PAGE 128
116 spaced data it is defined as (5.1) where his the lag between observations and y is the sample mean. If the observed time series displays any trend we have to first estimate it (maybe by fitting a regular GLM) and subsequently explore the autocorrelations among the residuals. A comparison of the autocorrelation the model predicts with the empirical one observed in the time series serves as a crucial check on the adequacy of the fitted model and its assumptions 5.1.1 The Variogram For unequally spaced data, the variogram (Diggle 1990) is a better measure to describe the association than the ACF. In Diggle et al. (2002) the variogram is discussed for longitudinal data while we develop it here for the special case of time series data {Yi}. Define the variogram "f(h) at lag has If {Yt} is stationary the variogram is directly related to the autocorrelation function p(h) = corr(Yt Yi + h) by "f(h) = 7 2 [ 1 p(h)] where 7 2 is the variance of Yt. (Even for a nonstationary timeseries the vari ogram is well defined provided the increments Yi + h Yt are stationary.) To develop the empirical variogram let dtt be the time in between two observations Yt and Yt. The sample analog g(h) of the variogram at lag h is calculated by averaging over all possible squared differences between observation pairs h time units apart. I.e ., 1 g(h) = I C I L h (t )E C h
PAGE 129
117 where Ch= {(t, t*) : dtt = h} is the set of all index pairs (t, t*) with corresponding observations measured h time units apart. A comparison of the sample variogram g(h) to an estimate i(h) of the theoretical variogram implied by a particular model serves as a check on the adequacy of the model. In Section 5.4 we show with the help of the variogram the appropriateness of the modeled correlation structure for the unequally spaced Oxford versus Cambridge boat race time series data. 5.1.2 The Lorelogram For categorical, especially binary responses, the dependency can also be measured in terms of odds ratios. For a binary time series {Yt} Heagerty and Zeger {1998) define the lorelogram 0(h) at lag has the log odds ratio between observations Yt and Yi+h, O(h) = log (P(Yt = 1, Yi+h = 1) x P(Yt = 0, Yi+h = 0)) P(Yt = 1, Yi+h = 0) x P(Yt = 0, Yi+h = 1) ( 5 2 ) For an observed binary time series y = (y 1 ... Yt), the lorelogram can be esti mated by using sample proportions of the probabilities in (5.2). I.e., the sample lorelogram at lag his given by LOR(h) = log (Yj1 Th]Y[h+1,T] x (lrh Y [1,Th])'( lT h ~[h+I ,T]) ) Y[1 ,Th]( lr h Y[h +I,T]) x (lTh Y [1,Th]) Y [h+I,T] where Y[a ,b] is the subvector (Ya, ... Yb) of y and lTh is a row vector of T h ones. Proper adjustments have to be made for unequally spaced data. As with the variogram a comparison of the sample lorelogram to one implied by a particular model serves as a check on the adequacy of the model. 5.2 Normal Time Series Section 4.1 discussed a linear mixed model approach of time series modeling for normal data. In this section we illustrate with the example about attitudes towards homosexual relationships (a crosssectional time series see Section 3.1) where attitude was measured 16 times in between 1974 and 1998 for whites and
PAGE 130
118 blacks. The binomial counts in this study are large enough to warrant an analysis based on a normal approximation although the binomial sample sizes are about 8 to 11 times larger for whites than for blacks for almost all years. Initially we will assume that conditional on random time effects { Ut} the log odds 0it of approval of homosexual relationships for race i at year t are independent over the years and follow a normal distribution with mean i t + Ut and standard deviation T (We could have also modeled the binomial counts or the proportions of approval directly, but prefer the log odds approach because of the structural problem of the identity link with modeling proportion data.) However, the usual assumption about a constant standard deviation T throughout the two groups is grossly inappropriate Firstly we have to account for the fact that sample sizes in the white group are much larger than sample sizes in the black group. Secondly the estimated asymptotic standard deviation of the log odds derived by the delta method and based on the asymptotic normality of the sample proportions is (5 3) where f i t is the sample proportion for race i and year t. Figure 5 1 shows these estimates for the two groups. To put the scale for the empirical standard deviation for this plot into perspective the estimated log odds range from 1.8 to 0.9 for the white group and from 2.8 to 1.4 for the black group. Figure 5 1 shows that th e variability in the log odds is markedly smaller for white respondents than black ones due to the overall larger sample sizes in the white group. Furthermore the variability in the log odds is not constant over time especially in the black group. This is due to the fact that in the years 1988 to 1991 and 1993 only about half as many people were sampled as in the other years for both groups. Also the trend in the probabilities 1r i t causes the standard deviations to be different over time. In an
PAGE 131
119 .60 .50 .40 30 20 r I \ 9. I \ I \ I \ \. \ I "\ I \ I "" \ /~ : \ I \ \ I \ I \ _..._ ___ ...... I \ I \ .\ I \ \ I \ I \I \J ii. race : .10 black 0.00 ..,....... white 1972 1976 1980 1984 1988 1992 1996 2000 year Figure 5 1: Empirical standard deviations std ( 0it) for the log odds of favoring homosexual relationship by race
PAGE 132
120 effort to remedy all th e se effects simultaneously we associate a weight with each observation that is the inv e rs e of th e empirical standard deviation giv e n in (5 3). It might then be reasonable to assume that the weighted log odds W i t0 it have constant conditional standard deviation std( W i t0 i t I Ut) = T or stated differently that std(0 i t I Ut) = T /w it We are now able to specify an autoregressive model. Analogous to Section 3.1 we assume the following linear mixed model of the log odds: where as before x u is the (centered) year the response was measured x 2 i is an indicator for race Ut is the yearspecific random effect and i t are assumed i.i.d N(O T 2 /w i t). With the motivation given in Section 3.1 we assume autocorrelated random effects Ut +1 = pd t ut + f.t to model the serial dependence in successive observations caused by gradual change in unobserved factors such as public opinion. We us e d the procedure proc mix e d in SAS to obtain the maximum likelihood estimates for all parameters in this model which are shown in Table 5 1. The mix e d procedure allows for the use of the weights l/w i t in the varianc e covariance matrix of the conditional log odds given the random effects through the weight statement. We selected the socalled spatial power covariance matrix (SAS proc mixed with random statement option: type=sp (pow) ('year')) for the covariance matrix of the random effects because it allows for unequally spaced observations The results confirm the ones we saw earlier using a logistic regression approach with autoregressive random effects to model the serial log odds. Since parameter estimates in both models are almost equal substantially the same conclusions as given in Section 4.3.3 based on the logit GLMM are reached One
PAGE 133
121 Table 5 1: Comparing estimates from two models for the logodds. normal logistic Param est. s. e est. s.e a: 1.80 0 07 1.80 0.07 /31: 0.025 0.007 0.023 0.007 /32: 0.0039 0 0008 0 0041 0 0009 (33: 0.32 0.057 0.33 0 07 (34: 0.027 0.007 0.027 0.009 T: 0 67 a : 0.11 0.10 0.03 p: 0.66 0.65 0.25 Maximum likelihood estimates and asymptotic standard errors based on an ap proximate normal linear mixed model for the log odds (first two columns) and the logistic GLMM with autocorrelated random effects of Section 3.1. advantage of the normal approximating model is the easy with which marginal statements can be obtained For instance, formula ( 4.2) applies but now with a weight factor. That is is the marginal correlation between the log odds of approval of race i for observa tions in years t and t The marginal correlations depend on the weight factors and therefore vary throughout the years. For instance using the maximum likelihood estimates of the variance components the estimated correlation between the log odds of approving homosexual relationship in years 1996 and 1998 is 0.52 for whites and 0.48 for blacks 5 3 Analysis o f the Polio Count Data Table 2 in Zeger (1988) lists a time series of 168 monthly counts, most of them small of new cases of poliomyelitis in the U.S. between 1970 and 1983, plotted in the first bin of Figure 5 2 Of interest is whether the data provide evidence of a longterm decrease in the rate of polio infections Many authors have analyzed this data set beginning with Zeger (1988) who uses marginal model
PAGE 134
122 fitting techniques ( cf. Section 1.2). He estimates a yearly decrease of 6.3% in new cases of poliomyelitis where however a 95% confidence interval ranges from a decrease of 12.2% to an increase of 1.1%. We demonstrate our technique by re analyzing this data using a GLMM with autoregressive random effects (ARGLMM) to incorporate the time dependence between adjacent counts. Conditional on a random time effect Ut let Yi be a Poisson variable representing the count in year t, t = 1 ... 168. Following Zeger (1988), we model the log of the conditional mean of Yi as a+ /3 1 (t 73)/1000 + /3 2 cos(27rt/12) + /3 3 sin(27rt/12) + /3 4 cos(27rt/6) + /3 5 sin(27rt/6) + Ut (5.4) where the random effects follow the autoregressive process Ut+1 =put+ Et, Et~ N(0 aJl p 2 ), u 1 ~ N(0 a). Th e sine and cosine pairs adjust for annual and semiannual seasonal patterns of the counts displayed in Figure 5 2 Figure 5 3 shows the convergence of selected parameter estimates and their estimated standard errors in an MCEM algorithm for two different sets of starting values for the variance components. Convergence parameters were set to 1 = 0.002 c = 5, 2 = 0.01 3 = 0.001 a = 1.04 and q = l.l (see Section 2.3.3). All maximum likelihood estimates and their standard errors are presented in Table 5 2 together with estimates from other models. A negative binomial model takes overdispersion into account but treats the observations as independent. Similarly a Poisson GLMM with independent random effects Ut i.i.d. N(0 a) only adjusts for overdispersion in the counts, but does not address the dependency among the observations The Poisson ARGLMM has the smallest maximized likelihood among these models and uses only one parameter more than a negative binomial or
PAGE 135
123 10 5 . i. n I: fV:l~ r1 ~A A f \ , ,1, ,!\, ,r;\ Jr J ~\ ;\ 1 r f \~f ""'''!.f:f ~~~,~~?,~?,i\~l'! f,f:~f~:i~~ih \ :i &II\!""': i ~'i' 'i' i ,:;"\ ,Q,:: ,i\'\ ? \? 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 10 0 5 1970 1971 1972 1973 1974 1975 1976 IO 0 0 5 00 0 0 0 1977 1978 1979 0 0 0 0 0 1980 0 0 I O O Y ,(O,) I 0 1981 1982 1983 1984 I O O Y /11 0 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 Figure 5 2: Plot of the Polio data. The first bin shows the observed time series of counts of polio infections from 1970 to 1984. Not shown is the observed count of 14 in November 1972. The second bin shows the observed counts as circles, and superimposes the fitted conditional model (5.4) where posterior means are used to estimate the random effects. The third bin shows the fit of the marginal model implied by the conditional formulation.
PAGE 136
2 5 5 0 0 50 100 150 124 4 3 200 0 20 0 15 0 10 120 140 160 180 200 1 s e .( a )I 0 50 100 150 200 120 140 0.75 "\,.. 0.20 I r'""\..,......._ j 'v. :..:,~~ 0.50 0 25 Figure 5 3: Iteration history for the Polio data. 160 180 200 1 s.e.( p )I The plot shows the iteration history for parameters /3 1 Cl and p and their estimated asymptotic standard errors for the Poisson ARGLMM. The Iteration number is plotted on the x axis The two different lines in each plot correspond to two dif ferent sets of starting values for Cl and p. Th e starting value for /3 1 was its GLM estimate. Final Monte Carlo sample sizes in the MCEM algorithm were 29 070 and 34 005 respectively for the two different sets of starting values.
PAGE 137
125 Table 5 2: Parameter estimates for the polio data. PoissonPoissonChan & GLM Neg. Bin GLMM ARGLMM Ledolter est. s e est. s.e. est. s e. est s.e. est s.e a: 0.21 0 08 0.21 0 10 0.05 0.11 0.03 0 15 0 21 0 13 /31 : 4 8 0 1.4 0 4 33 1.85 4.34 1.92 3.74 2.91 4.62 1.38 /32 : 0.15 0.10 0.14 0.13 0.13 0.13 0.10 0.15 0.15 0.09 /33 : 0.53 0.11 0.50 0.14 0 51 0.14 0.50 0.16 0.50 0 12 /34 : 0 17 0.10 0.17 0.13 0.17 0.13 0.20 0.13 0.44 0.10 /35 : 0.43 0.10 0.42 0.13 0.38 0.13 0 36 0 13 0.04 0.10 1/k: 0 57 0.16 a: 0 72 0.10 0.70 0.12 0.64 p: 0 66 0.20 0.89 0.04 LL: 132.49 113 37 112.40 92.50 19.00 Fit of a Poisson GLM a negative binomial GLM, a Poisson GLMM with independent random effects and a Poisson ARGLMM. The last column holds parameter estimates as reported by Chan and Ledolter (1995). Their estimates of a and pare transformed to bring them into agreement with our parametrization of the latent autoregressive process. Poisson GLMM with independent random effects. Any information criterion such as the AIC, would heavily favor this model. 5.3.1 Comparison of ARGLMMs to other Approaches Chan and Ledolter (1995) also used autoregressive random effects in a Poisson GLMM setting to analyze these data However, their implementation of the MCEM algorithm seems to have been stopped prematurely. The path plot of the coefficient of interest /3 1 (Chan and Ledolter, 1995 p 246) still shows some trend movements when they declared convergence. Their convergence criterion is based on the change in the marginal loglikelihood while ours takes into account the consecutive changes in parameter estimates and the Qfunction. (Moreover, their estimate of the maximized loglikelihood function of 19 seems to be a rather unusual value considering our estimate of 92.5. This is excluding the constant term Lt log(yt!) in the loglikelihood which is equal to 140 5. We were not able to reproduce the estimate with the parameter estimates given by Chan and
PAGE 138
126 Ledolter (1995) ) Also, our Monte Carlo sample for the final iterations in the MCEM algorithm is 28 500 about 14 times higher than theirs. The Monte Carlo sample size increased exponentially in our implementation but only two sample sizes of 800 and then 2000 for confirming the results obtained with the 800 samples were used in Chan and Ledolter (1995). Based on their estimate of /3 1 and its asymptotic standard error they conclude that the time trend is significant at the 5% level. However our analysis shows an insignificant time trend at the 5% level which is in agreement with the conclusion of Zeger (1988) and other approaches that use different ways of incorporating correlation into a Poisson model. For instance Fahrmeir and Tutz (2001) use a transitional model including the past 5 responses and report a pvalue of 0.095 for the test of a zero slope for the time trend. Li (1994) fitted a slightly different transitional model and also reported an insignificant time trend after accounting for the autocorrelation. He noted that the series might be too short to establish significance of a linear time trend. Benjamin et al. (2003) used the negative binomial instead of the Poisson as the conditional distribution in a transitional model and reported a better fit with it, but again an insignificant tim e trend. Note that in general, regardless of its significance the time trend parameter in transitional models fitted by the above authors has to be explained conditional on past responses (or functions involving past responses). In our model, however e 13 1 can simply be interpreted as the linear time effect on the marginal mean of the polio counts. Let f1 = E[Yi] denoted the marginal mean of Yi as given by expression (4.4). Then with our model (5.4), the ratio f1. 12 /f'1 of two marginal means exactly one year apart is equal to e 12 f3 1 / 1000 Using the MLestimate of /3 1 we then estimate a yearly decline of 4 5% in the marginal polio counts. However a 95% confidence interval for this parameter ranges from a yearly decline of 10 6% to a yearly increase of 2.1 % including the possibility of no change.
PAGE 139
127 At a given time point t, a negative binomial GLM (with the parametrization used in Section 4.2) implies that the variance exceeds the mean fl by a factor of (1 + fl /k) For the Poisson ARGLMM we saw that this overdispersion factor equals [1 + fl (eu 2 1)]. Using the estimates from Table 5 2, 1/k equals 0 57 and eu 2 1 equals 0.63. Both models seem to propose a similar amount of overdispersion relative to their estimated marginal means. Also, both models are similar in the sense that they assume the same overdispersion parameter ( k in the negative binomial case a in the ARGLMM case) for all observations. However, the negative binomial model does not adjust for correlation among the responses. 5.3.2 A Residual Analysis for the ARGLMM We assess the quality of our model by a residual analysis. For the random intercepts GLMM and the ARGLMM we define the residual at time t as rt = Yt fl where fl is the marginal mean ( 4.4). Figure 5 4 shows the autocorrelation function of the estimated residuals ft = Yt fl based on the fit of a negative binomial GLM and a Poisson ARGLMM. The autocorrelation function of residuals from the fit of a GLMM with independent random effects is omitted from the plot since it is almost identical to the one from the negative binomial GLM. While the significant (based on an asymptotic standard error of y'1ff' = 0.08) residual autocorrelation at lag 1 shows the inappropriateness of the negative binomial GLM and the Poisson GLMM with independent random effects the autocorrelation function of the residuals is as expected for the Poisson ARGLMM. (We could also judge significance by a Monte Carlo experiment where we look at the smallest and largest lag 1 correlation of 1000 reordered time series of the residuals. If the observed lag 1 correlation of 0.25 falls close to or above the upper bound the correlation is deemed significant.) The modelimplied autocorrelation function Pt(h) = corr(rt rt+h) = corr(yt Yt + h) for the Poisson ARGLMM is given by ( 4 8) and depends on the time t. For a given lag h we took the average
PAGE 140
128 p(h) = 16 ih Lt Pt(h) of all possible lag h modelbased autocorrelations to construct an estimate of the correlation at that lag. The line with filled triangles in Figure 5 4 represents this average of modelbased autocorrelations. It seems reasonably close to the observed autocorrelation in the residuals from the fi t of the Poisson ARGLMM especially at the most important first two lags (A generalized estimating equations approach with a marginally specified AR(l) correlation matrix gives a very similar estimate (0 24) of the marginal lag 1 correlation. However it is questionable if the marginal AR(l) correlation is justified since both the transitional model and the ARGLMM indicate a longer dependence relationship.) The estimated modelbased autocorrelation function p indicates that marginal correlations die out for observations 3 or more month apart. The residual analysis also reveals an extreme observation (a count of 14 new cases in Nov. 1972) which was deemed insignificant by Zeger (1988) but was addressed by Chan and Ledolter (1995) who added an additional parameter to the model. Benjamin et al. (2003) calculated a conditional tail probability of 0.02 for observing an event that extreme or even more extreme under their negative binomial transitional model. Based on our Poisson ARGLMM we estimate a conditional mean of 9.1 (cf. Figure 5 2) for an observation at that timepoint using the posterior mean of the random effect for Nov 1972 as its prediction. This translates to an estimated conditional tail probability of 0.08 for observing 14 or more new counts of poliomyelitis for that month which does not seem too extreme However if we base calculations on the marginal probability of observing an event like this then the probability is equal to 0 007 The marginal mean count for Nov 1972 is estimated to be 2 6 (cf. Figure 5 2) but the Poisson distribution cannot be used to calculate tail probabilities since marginally the counts are not Poisson Hence we used Monte Carlo integration to estimate the marginal distribution P(YNov 19 12 = k) k = 1 ... 13 from the conditional Poisson model by sampling
PAGE 141
0.30 0.25 0.20 0 15 0.10 0 05 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 2 3 4 129 00Neg. Bin 6trt\RGLMM ...... p(h) under Neg Bin ........_. p(h) under GLMM 2 x ASE 5 6 7 8 9 lag h Figure 5 4: Residual autocorrelations for the Polio data. 10 The plot shows autocorrelation functions of residuals from the fit of a negative binomial model (circles) and a Poisson ARG LMM (triangles). Also shown is an es timate of the modelbased autocorrelation function when the assumed model is the negative binomial GLM (filled circles) or the Poisson ARGLMM (filled triangles). The straight dotted line represents the asymptotic standard error /f7T for the correlation.
PAGE 142
130 from the fitted random effects distribution and averaging over the conditional Poisson densities specified through (5.4) to get abov e result. If we decide to eliminate the extreme observation ( and in a more general context if we eliminate several more observations or some observations are missing) then the methods about unequally spaced time series developed in Chapter 3 are useful and directly applicable. Here however we add an additional parameter corresponding to an indicator function I(t = Nov 1972) for this observation to adjust for the extreme observation (in regular GLMs this would force a perfect fit for the outlying observation). The final sample size in the MCEM algorithm with the additional parameter increases to 56 900 and some parameter estimates are affected by this adjustment For instance the slope estimate now equals J 1 = 2.49 (s. e. = 3.66) and the lag 1 correlation among the conditional logmeans increases top= 0.86 (s e. = 0 09). Their standard deviation decreases to a= 0 59 (s. e. = 0.15). The estimate for the coefficient of the indicator function for the outlier equals 1.84 with a standard error of 0.49. There is now even less evidence of a decreasing time trend in the observed time series. Figure 5 5 is the same as Figure 5 4 but now based on the model with the adjustment for the extreme observation 5.4 Binary and Binomial Time Series This section analyzes two data sets of binary time series, illustrating regression parameter estimation checking for model appropriateness and predicting future ob servations. Various implied model properties concerning the marginal distribution are also derived 5.4.1 Old Faithful Geyser Data MacDonald and Zucchini (1997) propose a twostate hidden Markov model (for a brief summary of the hidden Markov model see Section 1.3) for data concerning the eruption times of the Old Faithful geyser. The series consists of 299 consecutive
PAGE 143
0 30 0 25 0.20 0.15 0 10 0 05 G. A.' .... '"'..._ ,..._ '~ '~ \\ \\ ' 131 00Neg Bin 6l'r t\RGLMM ...... p ( h ) under Neg. B in .......... p(h) under GLMM 2xASE I \,'r 1 ....., 1 ~ c ', \ \ I I \ '.... \ \ I I \ ', \ \ I I \ .,,.,, \ \ I I \ ts \ \ I I \ \ \ I I \ \ \ I I \ 15 / \ \ I \ \ I \ I ''t/ C, .,,; ,..._ ..Q .,,,.. I ',', \ '..._ A ...__.__\ ,"' \\ \\ \\ \\ ,, ,, I ,, I ,, I \/ 0 00 2 3 4 5 6 7 8 9 10 l ag h Figure 5 5 : Residual autocorrelations with outlier adjustment for the Polio data. This plot is the same as the one in Figure 5 4, but is now based on models with an extra parameter to adjust for the extreme observation of November 1972.
PAGE 144
132 observations between the 1st of August and the 15th of August 1985. Most observations can be characterized as either long or short with very low variation within the long and short group. In fact, some of the eruption times measured at night are only recorded as being either short or long MacDonald and Zucchini (1997), following Azzalini and Bowman (1990) transform the series into a binary one with cutoff point defined at an eruption length of 3 minutes. They analyze the series assuming a discrete twopoint mixture for the probability of a long eruption where the mixture depends on the states an underlying, two state Markov chain is in Here we attempt an approach assuming an underlying normal firstorder autoregressive process Let Yt t = 1, .. 299 be the discretized observations with a value of O indicating eruption times less than 3 minutes (short eruptions) and a value of 1 otherwise (long eruptions). The sample ACF r(h) of the series is pictured in Figure 5 6 with numerical values given in Table 5 3. It clearly shows signs of negative autocorrelation in the series. Note that the autocorrelations, with increasing lag do not decay geometrically so a firstorder Markov model ( a transitional model) is inappropriate. Marginal correlations implied by a GLMM with autoregressive random effects as developed in Section 4.3.1, seem capable of capturing such a behavior. Let { utH!1 be autocorrelated random effects in a model for the conditional probability 7rt(ut) of a long eruption. In particular the model has form logit(1rt(ut)) =a+ Ut, u1 ~ N(O a) Ut = PUt 1 + tt tt N(O, aJl p 2 ), where a is a parameter for the unconditional ( or expected) logit of a long eruption and the series { Ut} captures the serial dependency. Conditional on these random effects we assume that successive eruption lengths are independent. The MCEM algorithm with Gibbs sampling from the posterior distribution of the random effects as described in Sections 2.3 and 3.4.1 is used to obtain maximum likelihood
PAGE 145
133 estimates for this model. Since a and p are the standard deviation and the lag 1 correlation of the conditional logits, we use the sample standard deviation and sample lag 1 correlation of the empirical logits log( 1 ~ ty:~ g_ 5 ) as starting valu es for the parameters. The starting value for a is the MLestimate of it under a GLM model assuming independent observations. Alternatively the GLIMMIX macro in SAS which maximizes a normal approximation of the marginal likelihood (see Section 2 2) but allows to incorporate autocorrelated random effects can be used. The two sets of starting values and some more we tried yielded similar parameter estimates and standard errors. Using the convergence parameters t 1 = 0.001 c = 5 t2 = 0 003 t 3 = 0.001 a = 1.02 and q = 1.1 (see Section 2.3.3) we obtained the following estimates: a = 1.24 (0 56), a= 3.69 (1.17) p = 0 89 (0.037) The algorithm converged after 130 iterations with a final Monte Carlo sample size of m = 56 000. The values in parenthesis are obtained from a Monte Carlo approximation of the observed Information matrix (Louis 1982) with a Monte Carlo sample of 100 000 drawn from the fitted posterior random effects distribution. 5.4.1.1 Marginal Properties Since the estimate of a is large, we use formula ( 4.17) to derive marginal correlations. Let a and adenote the maximum likelihood estimates of a and a scaled by a factor of 1.6. The calculation of marginal correlations enables a comparison with the sample autocorrelation function of the observed series and gives valuable information about the fit of the model. From ( 4.17) the marginal autocorrelations are estimated by where frM = ( &/ + 1)
PAGE 146
134 Table 5 3: Autocorrelation functions for the Old Faithful geyser data. h: 1 2 3 4 5 6 7 8 9 10 r(h) : 0.54 0.48 0.35 0.32 0.26 0 21 0.16 0 14 0.17 0.16 p(h) : 0.48 0.46 0.37 0.35 0.29 0 27 0.23 0.21 0.18 0.17 Comparison of numerical values of the sample ACF r(h) and estimated ACF p(h) based on a logisticnormal ARGLMM is the timeinvariant estimated marginal probability of a long eruption time ( equal to 0.62) and [ 1 (J)2 Q(t t+h)= (u)2ph is the estimate of covariance matrix (4.15) for a bivariate zeromean normal random variable with cdf 2 The plot in Figure 5 6 with numerical values given in Table 5 3 shows good agreement of these model based estimates of the autocorrelation function and the sample autocorrelation function. Similar good agreement between the model and the data can be seen from Figure 5 7 which plots the empirical lorelogram against its model counterpart. Note that the empirical lorelogram is not defined at lag 1 since the sequence of two consecutive short eruptions was not observed. The plot also shows the asymptotic standard error (ASE) for the log odds ratio at each lag It should be mentioned that similar good results were also achieved with the hidden Markov model approach by MacDonald and Zucchini (1997). The probit approach is not the only way to calculate marginal probabilities and correlations although it provides closed form approximations. Integrating the conditional success probability over the marginal distribution of the random effect Ut gives the marginal success probability at time t. Similarly integrating the joint conditional distribution of two successes two failures or one success and one failure over the two dimensional distribution of the random effects vector ( Ut, Ut) gives the corresponding marginal joint probabilities. For instance as we already briefly
PAGE 147
0.4 0 2 0.2 0.4 135 t 1 ,.. ...... r(h ) ___. p (h) I x ASE 2 3 4 5 6 7 8 9 10 11 12 13 14 lagh Figure 5 6: Autocorrelation functions for the Old Faithful geyser data. \ 15 Comparison of the sample ACF r(h) (triangles) and estimated marginal model based ACF p(h) (squares) based on a logisticnormal ARGLMM for the Old Faith ful geyser data The straight dotted lines represents the asymptotic standard error ../f7T for the correlation.
PAGE 148
2 0 I t '1 I I I I 1 I II 1 I I I 1 I I I i., 1 I I I 1 1 1 71' 1 1 1 _J,... I I I I ,, ___.... .. J ___ 1~1 . I I I I ii 11 II I I I I ,, II I \ t 11 I \ I I I / I I I l LL y 1 I I I I I I I I I I I I \ I \ I I \I \ 11 / \ ~d/ 1 ... \/ I \I i 2 I I I I 2 4 6 8 136 10 12 14 lagh ..,._... LOR(h) 0 (h) xASE 16 18 20 Figure 5 7 : Lorelogram for the Old Faithful geyser data. Comparison of the empirical lorelogram LOR(h) (triangles) and estimated lorel ogram p(h) (squares) based on a l ogisticnormal ARGLMM for the Old Faithful geyser data. 22
PAGE 149
137 mentioned a unique feature to this series is that every short eruption is followed by a long one in other words the sequence (Yt Yt+1) = (0, 0) is not observed for any t This is not a structural zero as there is no apriori reason why a short eruption cannot be followed by another short one although Azzalini and Bowman (1990) mention a geophysical interpretation which makes this quite unlikely We estimate the marginal joint probability P(Yt = 0 Yt+ 1 = 0) of two consecutive short eruptions as P(Yt = o Yt+i = o) = <1>2 ( (& &)' Q(t t + 1)) = 0 031 (5.5) based on the probit model approximation using the 2dimensional multivariate normal cdf and as m h 1 ( h (") )1 ( h (") )1 P(Yt = 0 Yt+1 = 0) = m 1 + exp{a + u/} 1 + exp{a + ut~i} = 0.036 J=l (5.6) based on a Monte Carlo sum approximation of the twodimensional integral with m = 500 000 samples from the estimated joint distribution of (ut Ut+1)With a straightforward extension of the results presented in Section 4.3.2, we can calculate joint probabilities with the probit model approach for longer sequences of long and short eruptions For example the joint probability of observing a long eruption at time t followed by a short eruption and another long one is given by P(Yt = 1 Yt + 1 = 0 Yt + 2 = 1) 3 ((o:, o: o:)' Q(t t + 1, t + 2)) again using the threshold interpretation Yt = 1 {=> Tt < o: + Ut. Here o: is the intercept term of a probit model Ws = Ts Us s = t t + 1 t + 2 and <1> 3
PAGE 150
138 is the cdf corresponding to the multivariate normal mean zero random vector (Wt W t+ 1 Wt + 2 )' with variancecovariance matrix 1 + (a)2 (a)2P Q(t t+l t+2)= l+(a)2 (a)2p2 (a)2P 1 + (a)2 For a model based estimate of the probability of this particular sequence of thre e consecutive observations we simply plug in maximum likelihood estimates of the parameters appearing in <1> 3 (.) and Q(t t + 1 t + 2). Estimates of resulting counts for all possible combinations up to order three are displayed in Table 5 4 using both the probitlogit connection and Monte Carlo integration The results are very similar using both types of approximation which speaks for the quality of the closed form probit bas e d approximations derived in the previous Chapter. 5.4.1.2 Exchangeability of certain sequences The probit connection also helps in explaining certain symmetries in the model when no time varying covariates are present In the geyser example the conditional probabilities only depend on an intercept term. Returning to sequences of length two the probability of the event (Wt Wt+1)' < (& &)'is the probability of a long eruption at time t followed by a short one at time t + 1. The symmetry of the (meanzero) bivariate normal distribution with the special form of the variance covariance matrix (most notably equal variances) demands that this event has the same probability as the event (Wt Wt+ 1 )' <(& &)' The probability of the latter event is associated with the probability of a short eruption followed by a long one. Hence the estimated marginal probabilities are the same for a sequence of a long eruption followed by a short or a short eruption followed by a long. This is reflected in Table 5 4 which compares observed and expected counts for several more sequences of long and short eruptions.
PAGE 151
139 For three consecutive eruptions symmetry in the trivariate (meanzero) normal distribution of (Wt Wt + i W t+ 2 )' demands that the events and the events have equal probability. Hence the model based marginal probabilities of two long eruptions follow e d by a short one (1 1 0) and a short one followed by two long ones (0 1 1) are equal. Accordingly the marginal probabilities of two short and a long eruption (0 0 1) is the same as the probability of one long eruption followed by two short on e s (1 0 0). Again this symmetry is reflected in the expected counts presented in Table 5 4. It can be interpreted as an exchangeability property for certain sequ e nces of long an short eruptions when no timevarying covariates are present E.g. denoting the event of two consecutive long eruptions with A and one short eruption with B, our model suggests that the probability distribution of AB is the same as the one for BA. 5.4.1.3 Technical derivation of the exchangeability property It is not immediately obvious why these pairs of events have equal prob ability under our model. Following is a proof for th e fact that the two events (Wt wt+l Wt + 2)' < (a a a)' and (W t, Wt+l W t+ 2)' < (a, a a)' have equal probability: Without loss of generality assume t = l and let W = (W 1 W 2 W 3 )' have a trivariate normal distribution with mean zero and variancecovariance matrix E where E has diagonal elements all equal to a 2 and equal covariances
PAGE 152
140 17 12 = 17 23 Th e n th e density function of W is proportional to J(w1 w2 w3) ex exp{ [ (wi + w~)(17 4 17i 2 ) + w~(17 4 17i 3 ) +(w1 + w3) [2w21712(1713 17 2 )] + 2w1w3(17i 2 17 2 1713 ) ] / [(17 4 + 171317 2 217i2)(17 2 1713)] }. From this expression it is straightforward to derive the corresponding expressions for the densities of (W 1 W 2 W 3 )' and (W 3 W 2 W 1 )'. Then, it can be be shown algebraically with a simple transformation argument that these two random vectors have identical densities i.e. f(w 1 w 2 w 3 ) = f(w 3 w 2 w 1 ) (Notice the way in which w 1 and w 3 enter the density above ) Now the first event (W 1 W 2 W 3 )' < (a a a)' has probability I : I : I : f(w1 w2 w3)dw1dw2dw3 I : I : I : J(w3 w2 w1)dw1dw2dw3 I : I : I: f (w3, W2, w1)dw3dw2dw1 [: I: I : J(v1 v2 v3)dv1dv2dv3 where we used Fubini s theorem and in the last step simply renamed the variables (i.e. the transformation w 1 = v 3 w 2 = v 2 w 3 = v 1 ). However this last probability is the probability of the event that the random vector ( W 1 W 2 W 3 )' is less than (a a a)' quod erat demonstrandum. The proof for the equivalence of the other pair of three consecutive eruptions is similar. Also the case of the equivalence of the marginal probabilities of a short eruption followed by a long and a long followed by a short is handled with similar arguments using the bivariate normal distribution Symmetry occurs also with the Monte Carlo approach for approximat ing marginal probabilities since the distribution of ( Ut, Ut+1) and ( Ut+1, Ut) are
PAGE 153
141 equivalent. Therefore a random sample (uij), u~2 1 ) j = 1, ... m from the joint distribution of (ut, Ut+1) is also a random sample from the distribution of (u t+ 1 Ut) Consequently, Ut and Ut+1 can be used interchangeably and the Monte Carlo approximation to P(Yt = 1, Yt+i = 0) 1 exp { a + u~j)} 1 m 1 + exp{a + u~i)} 1 + exp{a + ui2 with samples ( uP) ui2 1 ) is equivalent to which is the approximation to P(Yt = 0 Yt+i = 1). The last column in Table 5 4 displays the expected numbers based on a Monte Carlo approximation of the marginal probabilities. Another word of caution: Simply multiplying the estimated marginal prob ability of a sequence by the sample size of 299 consecutive eruptions to obtain the expected count is wrong and can lead to estimated counts larger than what is possible in a sequence of 299 observations. We calculate expected counts of a particular sequence by multiplying the estimated marginal probability for that sequence with the number of possible consecutive sequences of length two or three. E.g. there are 297 possible sequences of three consecutive eruptions for the Old Faithful data set Hence with an estimated marginal probability of F(Yt = 1 Yt+i = 1 Yt + 2 = 1) = 0.1703 for three consecutive long eruptions, we expect a count of 297 x 0 1703 = 50 6 such sequences in the time frame observed for that series This distinction gets more important for shorter or unequally spaced data as will be d e monstrated with the next example Note that the logistic ARGLMM uses only 3 parameters. The only cells showing some lack of fit in Table 5 4 are the ones which involve two or more
PAGE 154
142 Table 5 4: Comparison of observed and expected counts for the Old Faithful geyser data expected counts observed counts probit Monte Carlo long eruptions (1) 194 185.7 185 0 short eruptions (0) 105 113.3 114.0 299 299 299 from 1 to 1 89 81.4 81.9 from 1 to 0 105 103 6 102.6 from Oto 1 104 103.6 102 6 from Oto 0 0 9.3 10 9 299 299 299 from 1 to 1 to 1 54 50.6 50.6 from 1 to 1 to 0 35 30.6 31.0 from 1 to 0 to 1 104 96.1 94 1 from 1 to 0 to 0 0 7 2 8 1 from 0 to 1 to 1 35 30.6 31.0 from 0 to 1 to 0 69 72.7 71.5 from 0 to 0 to 1 0 7.2 8.1 from 0 to 0 to 0 0 2 0 2.6 299 299 299 The Table compares observed counts of short and long eruptions and various tran sitions with those expected under a logistic ARGLMM for the Old Faithful geyser data. consecutive short eruptions, an outcome that was not observed in the given time span. However our model assigns a very small probability for this event. 5.4.2 Oxford versus Cambridge Boat Race Data In this illustration we consider the outcome of the annual boat race between teams representing the University of Oxford and the University of Cambridge. The first race took place in 1829 and was won by Oxford, and the last race ( at the time of writing) took place in 2003 and was won by Oxford for the second time in a row. Overall Cambridge holds a slight edge by winning 77 out of 148 races (52.0%). Two races were held in 1849 one in March and one in December. Since all other races are traditionally held in late March or early April we treat the
PAGE 155
143 1.0 . .. . 0 9 0 8 0 7 0.6 0 5 0 4 0 3 0 2 0 1 1840 1860 1880 1900 1920 1940 1960 1980 2000 Figure 5 8: Plot of the Oxford vs. Cambridge boat race data. Squares at O and 1 are the outcomes of the individual races, where a square at 1 stands for a Cambridge win. The jagged line connects the estimates of the condi tional success probabilities rr( Ut) over time. result of December 1849 as the result for 1850, when no race took place. There are 26 years such as during both world wars when the race did not take place. These are 18301835 1837 /38, 1843/44, 1847 /48 1851, 1853, 1855, 19151919 and 19401945. No special handling of these missing data is required with our methods of maximum likelihood estimation. In 1877, the race ended as a dead heat, which we treat as another missing value in the sense that for this year no winner could be determined The data are available online at www. th eboa tr a c e o rg and are plotted in Figure 5 8 5.4 2.1 A GL M M w ith a u t ocorrela te d r a n do m effe c ts Let Yt = 1 if Cambridge w i ns at year t and Yt = 0 i f Oxford wins, where t indices the 148 years the race took place. Conditional on an autoregressive random time effect Ut we model the log odds of a Cambridge win at time t as (5 7)
PAGE 156
144 where Wt is the difference between the averag e weight of the Cambridge crew and the Oxford crew (The boats have standard size and weight ) This is th e only covariate available. A winner in one year is also likely to be a winner in th e next year because of overlapping crew memberships rowing techniques training methods experience and many other factors. In part each outcome reflects the und e rlying combined efforts leading up to th e race. We propose correlated random effects to parsimoniously characterize the variation and correlation in outcomes du e to these efforts That is our model establish e s a link between successive winning probabilities by specifying an underlying autoregressive process U t+1 = l 1 u t + f t fo r the random effects where dt is the tim e lag (in years) between two successive races Figure 5 9 shows the depend e ncy in the data by plotting a smooth estimate s(h) of the the sample variogram g(h) for lags up to 50 years. The most important feature is the strong increase for th e first few lags after which the sample variogram lev e ls off at a constant level. A similar impression of the dependency structure in this data set is obtained by using the lorelogram shown in Figure 5 10 For e ach lag h the odds of a Cambridge win are estimated by crossclassifying outcomes h years apart. E.g., the first three values LOR(l) = 1.30 LOR(2) = 1.37 and LOR(3) = 0.74 in the plot ar e the log odds ratios corresponding to the following contingency t ables crossclassifying outcomes one two or three years apart: lag 1 0 1 0 42 24 1 23 48 lag 2 0 1 0 43 21 1 24 46 lag 3 0 1 0 37 25 1 29 41 In constructing these tables proper care must be taken to accommodate the years where no race took place. Similar to the variogram we observe a sharp decline in the log odds ratio for the first few lags after which the log odds ratio level off a t around 0. Figure 5 10 also shows twice the asymptotic standard error (ASE) of th e
PAGE 157
0 30 0 28 0.26 0 24 0 22 0 20 0.18 0 5 JO 15 20 145 25 lagh 30 s(h) g(h) 35 40 45 Figure 5 9: Variogram for the Oxford vs. Cambridge boat race data. 50 Comparison of a smooth estimate of the sample variogram with a model based estimate of the variogram. Triangles represent the smooth (natural cubic spline) estimate s(h) of the sample variogram. Squares represent the model based estimate i(h) of the variogram. Crosses are the actual values of the sample variogram g(h).
PAGE 158
1.50 1.25 \' 1.00 ,, ~ \ ,, 0 75 ,, ,, 0.50 ,, \ 0 25 \ ~ ~" +\ '0 00 \ ........ ,. I \ + /. "L 0 25 + ,...._.,. 0 50 + + 0 75 0 5 10 15 20 146 25 lagh .s(h) (h) + + LOR(h ) x ASE __ M ___ _,,_________ .... 30 35 40 45 50 Figure 5 10: Lorelogram for the Oxford vs. Cambridge boat race data. Comparison of a smooth estimate of the sample lorelogram with the model based estimate of the lorelogram. Triangles represent the smooth (natural cubic spline) estimate s(h) of the sample lorelogram. Squares represent the model based esti mate O(h) of the lorelogram. Crosses are the actual values LOR(h) of the sample lor elogram and the grey dotted lines represent two times the ASE of the log odds ratio. log odds ratio at each lag calculated from the observed tables. For the three tables above the ASEs are given by 0 36 0.37 and 0.35 respectively. According to the webpage www. theboatrace. org Boat race legend has it that the heavier and taller crews have an advantage when it comes to race day". In fact an estimate of the weight effect /3 under the assumption of independent outcomes from one year to another is equal to 0 056 with s.e. eq ual to 0.023. This would support the claim that the heavier crew has higher odds of winning the race E g ., we estimate that a 5 pound difference increases the odds of winning by 32%. However can this claim still be supported in the presence of dependent observations, as both the plot of the variogram and the lorelogram suggest?
PAGE 159
0 300 0 275 0 250 0 225 0 2 7 2 6 0 147 Table 5 5: Maximum likelihood estimates for boat race data. a {3 CJ p estimate : 0.27 0 079 2 65 0.68 s.e. : 0.54 0 047 1.18 0.11 0 085 '0 080 0.o75 0 070 25 50 75 100 0 25 0 75 0 70 0 65 0 60 25 50 75 100 0 25 50 75 50 75 100 100 Figure 5 11: Path plots of fixed and random effects parameter estimates for the boat race data The xaxis shows the iteration number through the iterations of the MCEM algo rithm. The Monte Carlo sample size increased in 112 iterations from 100 at the beginning to 8260 random draws from the posterior random effects distribu t ion at the end We used the MCEM algorithm described in Sections 2 3 and 3.3 for model (5 7) to obtain the maximum likelihood estimates displayed in Table 5 5. Standard errors are based on a Monte Carlo approximation of the observed information matrix using 50 000 samples from the estimated posterior distribution. Trace plots for parameter estimates are pictured in Figure 5 11, with convergence criteria similar to th e ones mentioned for the Old Faithful data: t: 1 = 0.001, c = 4 t: 2 = 0 003 t: 3 = 0 005 a = 1.03 and q = 1.05 (cf. Section 2 3 3) Regular GLM estimates for a and {3 and a = 2 and p = 0 were used as starting values for the MCEM algorithm.
PAGE 160
148 Based on Table 5 5 we estimate the conditional odds of a Cambridge win to increase by 48 % for every 5 pounds the average Cambridge team member ( and there are 9 on a boat including the cox) weights more but the estimated standard error might be too large to conclude a significant effect A quadratic effect of th e weight difference on the log odds was found to be insignificant. The significant correlation of 0.68 for races one year apart indicates a strong dependency betwe e n successive outcomes This reflects the fact that the odds of winning are influenced by factors such as overlapping crew memberships training methods, motivation and experience from previous races The conditional estimate for {3 translates to an estimate of 0.041 for the marginal effect, using the probitlogit connection mentioned in Section 4.3.2. Hence the marginal odds of winning are estimated to increase by 23 % (compared to 32% from the GLM fit) for a 5 pound difference in the average weight when we properly adjust for correlation in the series. Moreover the large standard error of {3 does not rule out the possibility of no effect of weight on the odds of winning. 5.4.2.2 Checking the fit of the model Given the maximum likelihood estimates the modelbased estimate of the variogram is i (h) ~E [(Yt + h Yt) 2 ] F(Yt = 1) F(Yt = 1 ft+h = 1) h = 1 2 ... where the marginal and marginal joint probability can be estimated via the probi t connection or by integrating over the oneand two dimensional random effects distribution With the inclusion of a time varying covariate ( difference in average crew weight Wt) marginal probabilities vary over time. We assume no weight differences (i.e ., Wt=0 for all t) for the calculation in this section. Then, the estimated marginal probability of a Cambrigde win F(Yt = 1) or two Cambridge
PAGE 161
149 wins at times t and t + h, F(Yt 1, Yt + h = 1) do not depend on the years t and t + h Figure 5 9 shows the modelbased estimate of the variogram. The agreement with the empirical variogam is good especially for the most important first few lags and the model seems to capture the association displayed in the data appropriately The smooth line in Figure 5 10 represents the model based estimates of the marginal log odds ratio O(h) =log ( ~(,,= 1, Yt+h = 1) X ~(Y,, = 0, Yt+h = 0) ) P(Yt = 1, Yt + h = 0) x P(Yt = 0 Yt+h = 1) for observations h years apart when both crews are of equal weight. For instance for races one year apart the model based estimate of the marginal l og odds ratio of a Cambridge win is 1.38 approximated via the logitprobit connection That is, the odds of a Cambridge win over an Oxford win are estimated to be 4 times higher if Cambridge had won the previous race than if they had lost it Naturally, this factor gets smaller the greater the time separation between two races. For instance at lag 2, the odds of a Cambridge win are only 2.4 times higher if they had won the race two years ago rather than losing it Based on Figure 5 10 a result t 5 or more years in the past hardly has any influence on the result in year t. That is, the odds of a Cambridge win in year t are roughly the same whether Cambridge had won or lost the race t 5 years ago. Table 5 6 compares observed and expected counts of particular sequences of wins and losses again assuming no weight difference between the two crews. Care must be taken in finding all possible sequences of given length in the observed time series due to unequal sampling intervals. For instance with the specific pattern of no races in certain years in the series of the 148 unequally spaced observations only 137 sequences of two consecutive years and 130 sequences of three consecutive years can be formed. These are the multipliers for the estimated
PAGE 162
150 probability of two and three consecutive outcomes respectively. In general, the agreement betw een observ e d and predicted counts of sequences is excellent. The only minor discrepancy seems to be the one concerning three Cambridge losses (or equivalently three Oxford wins) in a row However in constructing this table we assumed no weight difference between the two crews. On average, over the 148 races the weight difference between the Cambridge crew and the Oxford crew is 1.03 pounds i.e. Oxford crews are heavier on average. Out of the 33 races Oxford had a weight advantage (i.e., was heavier) by more than 5 pounds it won 21 races (64%) Similarly out of the 25 races Cambridge had a weight advantage by more than 5 pounds it won 16 races (64%). Since weight seems to have some effect on the outcome of the race and was marginally significant, we overestimate the probability of three Cambridge losses ( or underestimated the probability of three Oxford wins) by assuming no weight difference for all three races. These lead to the slightly smaller expected count than the one observed in the last entry of Table 5 6. Factoring an average weight difference of 1.03 pounds for all three races into the calculation of the marginal probability, the estimated expected count for three Cambridge losses is 27 9 which is a little closer to the observed number of 32. 5.4.2.3 Prediction of random effects In traditional GLMMs the prediction of a univariate random effect u describ ing an exchangeable correlation structure is the posterior mean E[u I y ] of the distribution of the random effect given the observed data. Similarly, the prediction of the random process u = ( u 1 ... 'UT)' is the posterior mean E[u I y ]. Draws from the last iteration of the MCEM algorithm can be used to approximate the posterior mean by a Monte Carlo average. Let ilt denote the tth component from this ap proximation. Then an estimate of the conditional probability of a Cambridge win in year t is given by A ( A ) exp{&+ Swt + ilt} 1ft Ut = A 1 +exp{&+ f3wt + ilt}
PAGE 163
151 Table 5 6: Observed and expected counts of sequences of wins (W) and losses (L ) for the Cambridge University team. expected counts observed counts probit Monte Carlo W 77 79.1 78.4 1 71 68.9 69 6 148 148 148 W W 48 50.5 49.8 W 1 23 22.8 23.4 1 W 24 22.8 23.4 L 1 42 41.0 40.4 137 137 137 W W W 34 34 8 34 1 W W 1 13 13.1 13.2 W 1 W 12 9.4 9.8 W 1 L 10 12.2 12.2 1 W W 11 13 1 13.3 L W 1 9 8 6 8.9 1 L W 9 12.2 12.3 L 1 L 32 26.7 26.1 130 130 130
PAGE 164
152 Table 5 7: Estimated random effects Ut for the last 30 years for the boat race data. year result 'Ut year result 'Ut year result Ut 1959 L 0.49 1974 L 0.23 1989 L 2 35 1960 L 0.58 1975 w 0.12 1990 L 2.11 1961 w 0.63 1976 L 1.31 1991 L 1.82 1962 w 0.71 1977 L 2 07 1992 L 1.03 1963 L 0.19 1978 L 2 57 1993 w 0 75 1964 w 0 12 1979 L 2.76 1994 w 1.63 1965 L 1.11 1980 L 2 85 1995 w 2 05 1966 L 1.34 1981 L 2 83 1996 w 2.16 1967 L 0.71 1982 L 2.78 1997 w 2 08 1968 w 0 85 1983 L 2.59 1998 w 1.74 1969 w 1.68 1984 L 2 35 1999 w 1.07 1970 w 2.11 1985 L 1.81 2000 L 0 10 1971 w 2.11 1986 w 0 56 2001 w 0.08 1972 w 1.76 1987 L 1.41 2002 L 1.38 1973 w 1.05 1988 L 1 98 2003 L 1.82 and plotted in Figure 5 8 It seems that a sequence of wins pulls the estimated conditional probability towards 1, and a couple of losses pulls it towards 0 The predicted random effects for the last 45 years are displayed in Table 5 7 together with the outcome for these races The structure of the estimated random effects reflects the dynamics of the data: Positive random effects are usually associated with a Cambridge win and negative ones with an Oxford win. The magnitude of the predicted random effects increases (decreases) the closer the start (end) of a sequence of wins or losses for one team is in sight For example in 1993 the predicted random effect is 0. 75. During the next 3 years, all of which are Cambridge wins the magnitude of the predicted random effects increases steadily reflecting the increased confidence ( as measured by the odds) of another Cambridge win due to past results. After 1996 the predicted random effects slowly decline, but still show a preference for a Cambridge win In 2000 they turn negative because of an Oxford win in that year. For 2001 the random effect rises momentarily because of another Cambridge win but declines again in 2002 and 2003 because of two consecutive Cambridge losses.
PAGE 165
153 Part of thi s phenomenon can be e xplained through the form of the full univariate conditional distribution of Ut given the data and all other random effects. Section 3.4.1 showed that this distribution depends on the immediate predecessor Ut i the immediate successor U t +i and on the outcome Yt of the race In turn the full univariate conditional distribution of the successor Ut+i directly depends on the result Y t +i of that race and again on the random effects before and after it. In this way information of future wins or losses is incorporated in the posterior distribution of UtFor example a lost race in the near future results in decreasing predicted random effects preceding it reflecting this future change in momentum. Incidentally the distribution of the random effect uT at the boundary t = T only depends on its immediate predec e ssor uT l and on YT Using the prediction rule flt = 1 if n( Ut) > 0.5 and flt = 0 otherwise w e are able to compare outcomes predicted by our model to observed ones. Of the 148 observations the GLMM model with autocorrelated random effects only misclassifies 8 or 5.4% as the opposite outcome compared to what actually was observed. This is of course a significant improvement over predictions based on a regular logit GLM with a misclassification rate of 62 out of the 148 observa t ions or 41.9%. 5.4.2.4 Prediction of future outcomes The availability of marginal joint distributions through probit or Monte Carlo approximation allows one to consider marginal conditional distributions such as a onestepahead forecast distribution P(YT+l = YT + l I YT= YT YT 1 = YT 1 , YT s = YT s) P(YT+l = YT +i, YT = YT , YT s = YTs) P(YT = YT , YT s = YT s) for an outcom e at time T + 1 factoring in the past s + 1 observations s 0 1 2 .... We use our proposed model to obtain an estimate of the numerator (5 8)
PAGE 166
154 and denominator in (5 8) The twostage hierarchy together with the autoregressive nature of the random effects imply P(Yr + 1 = YT+I Yr= Yr , Yr s = Yr s) (5.9) ( T + l ) ( T+l ) J ,!! P(Y, = Yt I u,) g(uT ,) ~ J!+, g(u, I u, ,) durs ... dur dUr + 1. The last term P(Yr+1 = Yr+1 I ur+1) in the first product is given by an extrapola tion of the fitted model to time point T + 1 i.e. where Xr+1 is the covariate vector for time point T + 1. It is estimated by using the MLE /3 for /3. The last term g( ur + i I ur) in the second product is determined by extrapolating the underlying random process { Ut}i = I to time point T + 1. I.e. ur+1 = pd r ur + Er where dr is the time distance between points T and T + 1 and g(Ur+1 I ur) is a normal distribution with mean pur and variance a2 (1 p 2 d r ) For the boat race data a Monte Carlo approximation to (5.9) is given by m T + l { (j))} _!_ IT exp Yt(a + f3wt + ut L..,; (j) m i = l t=T s 1 + exp{ & + /3Wt + ut } where u~j) is the tth component from the jth generated autoregressive process uU) (extrapolated to time point T + 1) with variance components 82 and p. The estimated forecast probabilities of a Cambridge win in 2004 based on the past s + 1 observations and the GLMM model with autocorrelated random effects is given in Table 5 8. For example P(Y 149 = 1 I Yi 48 = 0), the conditional probability of a Cambridge win in 2004, given the outcome of the race in 2003 (a Cambridge loss) is estimated to be 0.32. We assumed a zero weight difference Wr+1 = 0 in calculating these forecast probabilities however, any value can be substituted in (Historically, the crews weigh in four days prior to the race, so that
PAGE 167
155 Table 5 8: Estimat e d probabilities of a Cambridge win in 2004 given the past s + 1 outcomes of the race. s: 0 1 2 3 4 5 6 History: L LL LLW LLWL LLWLW LLWLWW LLWLWWW W: 0.323 0.286 0.318 0.309 0.311 0.312 0 313 The first row displays the number of years s preceding 2003 which are conditioned on. The second row shows the history of Cambridge wins and losses from 2003 to 2003s The third row displays the estimated probabilities P(Y 149 = 1 I Y 148 Y14B ... Y14s s = Y14s s ) of a Cambridge win given the past s + 1 observations. WT+i will be available before the actual outcome of the race.) Including the past two results of 2003 and 2002 (two Cambridge losses) the estimated probability of a Cambridge win in 2004 decreases to 0 29. Considering the last three outcomes (two Cambridg e losses one Cambridge win) the estimated probability is again 0.32. Conditioning on outcomes even further back in time the estimated probability of a Cambridge win stays roughly constant at 0.31 even when the 7 year winning streak of Cambridge in the years 1993 to 1999 is factored in It seems reasonable however that these outcomes do not impact the 2004 outcome in terms of crew membership and training methods as do the outcomes of 2003 or 2002. We already mentioned in connection with the interpretation of the lorelogram and Figure 5 10 that results five or more years in the past hardly seem to influence the current outcome. Considering the weight factor if the average crewman on the Oxford boat weights mor e by 5 pounds then the estimated probability of a Cambridge win decrease from 0 32 to 0.28 conditioning the outcome of the last race and from 0.29 to 0.24 when factoring in th e last two outcomes On the other hand if Cambridge holds a weight advantage of 5 pounds in 2004 their predicted winning probabilities are 0.37 and 0.33 respectively. One special case of the derivations above is to not condition on any of the past outcomes and just look at the marginal probability of a Cambridge win in 2004
PAGE 168
156 which for general Tis given by where and UT+1 has a marginal N(O a 2 ) distribution. However this estimator does not factor in any past information. For no weight difference ( WT+1 = 0) it is calculated to be 0.52. The above predictions are based on marginal probabilities, where random effects have been integrated out. Another way of predicting the probability of a future outcome uses the conditional model directly incorporating an estimate of the random effect at time T + 1. With the autoregressive nature of the random effects, the minimum meansquared error predictor of UT+1 is given by uf +1 = E[uT+l I uT] = ()UT. To estimate uf+1 we use the posterior distribution of the random effects given the observed data i.e., uf+l = E [E[uT+l I UT] I y ] = E[puT I y ] = fJUT and plug in the maximum likelihood estimate of p. This is consistent with the way random effects are predicted in spatial GLMMs as described by Zhang (2002). He proved the following theorem under the assumption of known fixed and random effects parameters. Here, we adapted his theorem to our time series context to facilitate prediction of unobserved intermediate and future outcomes: Theorem: Let uk, k E z + be Gaussian with E[uk] = 0 for all k. If conditionally on {uk k E z + } {Yk k E z + } are independent and for each k the distribution of Yk depends on uk only, then for any k and observed time points t 1 t 2 ... tT, T E[uk I y] = L c;E[ut; I y ], (5.10) i =l
PAGE 169
157 where the coefficients c; are such that E[uk I Uti, ... Ut r l = Li = I C i Ut ; and y = (Yti .. Yt r ) are the observations at time points t 1 ... tr. Proof: (An adaptation from Zhang, 2002.) Let k E z + and k =I= ti for all i = 1, .. T. Let f(uk u, y) denote the joint density of (uk u, y) where u = ( Uti ... Ut r ) holds the random effects at the observed time points The distribution of the observed data depends on the random effects at the observation time points but no other random effects hence f (y I uk u) = f(y I u). Then, f(uk U y) f(y I uk, u)J(uk u) f(y I u)f(uk u) J(u y)f(uk I u). Dividing both sides by f ( u y) we obtain f ( uk I u y) = f ( Uk I u) and consequently E[uk I u y] = E[uk I u] = Li = i ciUt ; for some appropriate constants ci. By the properties of repeated expectation (i.e. Ex1Y[X I y] = EzlY [Ex 1 z y[X I z y] I z ] ) we have E [E[uk I u y] I y] = E[uk I y] and (5.10) follows. For the boat race data we could use (5.10) to get a prediction of the probability of an outcome in a year where no race took place. I.e., choose a k T, k =I= t 1 ... tr where t 1 ... tr are the years a race took place. (For clarity, we now denote the set of years where a race took place as t 1 t 2 . tr where t 1 is the year of the first race in 1829 t 2 is the year of the second race in 1836 and tr is the year 2003.) More importantly, we can use (5.10) to predict future outcomes. For instance by setting k =tr+ l i.e. to the year 2004 we obtain 148 u{;. + 1 = E[ut r+l I y] = L c i E[ut ; I y] = pE[ut r I y] = p'Ut r i= l as the prediction for the random effect for that year the same result we derived before. Here we made use of (5 10) and the autoregressive nature of the random effects which imply that E[ut r+ i I Ut 1 , Ut r l = PUtr i.e. C1 = ... = Cr 1 = 0
PAGE 170
158 and Or = p. The prediction can be evaluated by plugging in the MLE for p and using the Monte Carlo sample from the l ast iteration in the MCEM algorithm to approximate the posterior mean. The prediction for a random effect s years into the future (and hence the prediction of the distribution for a future outcome Yt r+. under the assumed model) can be obtained similarly. E.g., with k = tT+s, E[ut r+s J y ] = E ~!~ i; E[ut ; J y ] = P 8 E[utr I y ], since E[ut r+. I Utn , Ut rl = P 8 Ut r, i. e., C1 = ... = Or1 = 0 and Or= p 8 For the boat race data, the estimated random effect for 2004 is uf 49 0 69 x 1.82 = 1.26 Then according to our model the estimated probability of a Cambridge win in 2004 given a prediction for the random effect in that year is 0.26.
PAGE 171
CHAPTER 6 SUMMARY DISCUSSION AND FUTURE RESEARCH In this dissertation we proposed autocorrelated and other correlated random effects in GLMMs as a mean of introducing and modeling correlation in regression models for series of unequally spaced counts or binary /binomial observations. In Chapter 1 we contrasted our regression approach with one based on modeling the mean variance and covariance directly (marginal models), and another one based on regressing previous observations and covariates on the current response (transitional models). At the cost of increased computational time and complex algorithms inferential procedures for GLMMs are based on the joint likelihood of the T observations y 1 .. YT In contrast marginal models are based on a quasi likelihood approach and estimation in transitional models relies on conditionally or partially specified likelihoods that do not represent the full joint distribution In particular constructing tables such as 5 4 and 5 6 that compare observed counts to marginal predicted counts of sequences of events is impossible. In Chapter 2 we presented a general MCEM algorithm for fitting GLMMs and derived specific algorithmic details for equally and autocorrelated random eff e cts in Chapter 3 There we gave details on the implementation of a full iterative Mstep as opposed to just a single iteration. We also focused on the Gibbs sampler as a means of sampling from the posterior distribution of the random effects given the observed data as required for the approximation of the Estep and posterior predictions of the random effects. Other MCMC methods such as a Metropolis Hastings algorithm can also be employed however the Gibbs sampling approach reduced to simple forms for autoregressive random effects and was relatively fast in implementations. Furthermore the structure of the full univariate conditionals 159
PAGE 172
160 made clear how the correlations between random effects serve as building blocks for the correlation between the time series observations. The first graph in Figure 6 1 shows the exchangeable correlation structure among Yi s implied by the traditional GLMM assumption of one common random effect to all observations All vertices (where a vertex corresponds to a random variable) which are not joined by an edge are conditionally independent. The second graph shows the same picture for autoregressive random effects. Note that a marginal dependency between Yi and Yt i is induced through the path via Ut and Ut l (There is no edge between Ut I and Ut+1 if we assume a lag 1 autoregressive process ) The third graph is slightly different in nature and pictures the structure of the full univariate conditional distribution of Ut given the other random effects Ut and the data y. As we showed in Section 3.4, the conditional distribution of Ut depends on its predecessor Ut I, its successor Ut+1 and on the current observation Yt. In turn Utl and Ut+1 depend on the observation at times t l and t + l respectively. Thus the posterior of Ut incorporates information of past and future responses and random effects. Although autocorrelated random effects in GLMMs are not new to the literature (e.g. Chan and Ledolter 1995) we extended their application by explicitly allowing for gaps in the observed time series through specifying the correlation in the AR(l) process in terms of a lag This allowed us to handle missing data in the series without any additional procedures and adjustments to likelihood inference. In some instances, predicting responses at times where no responses (and covariates) were observed could be a potential goal. We presented some theory for intermediate prediction with the Oxford vs. Cambridge boat race data. By the way random effects incorporate information on previous and future observations our regression models should be well suited to predict intermediate observations not observed at certain timepoints.
PAGE 173
161 Yc1 Yc+l Uc1 Uc+! Figure 6 1: Association graphs for GLMMs. The first two diagrams represent the associations among observations Y1, .. Yi in GLMMs with one common random effect and autocorrelated random effects { Ut} Vertices not connected by edges are conditionally independent. The last graph rep resents the association structure for the posterior distribution of Ut given the other autocorrelated random effects and the data. Influence of covariates is not shown. Chapter 4 was devoted to derive marginal properties of our time series regression models based on normal Poisson, negative binomial and binomial distributional assumptions on the observations. We saw that conditionally specified GLMMs lead to marginal overdispersion relative to the normal Poisson negative binomial or binomial variance and we gave formulas for their expressions in the case of correlated random effects suggested here. More importantly, we derived expressions for the marginal correlations between any two members of the time series implied by our models. While in the case of normal Poisson and negative binomial time series regression models these have closed forms, approximations have to be used in the binomial case with a logit link. We explored several options and presented an approximation based on the similarity between logit link and probitlink models to evaluate marginal properties. The derived formulas for marginal means and correlations are useful for comparing empirical quantities to modelbased quantities such as a comparison of observed and predicted counts observed and predicted autocorrelations or observed and predicted log odds ratios
PAGE 174
162 in a time series. These are important aspects in determining the appropriateness of the random effects distribution and the goodness of fit of the model in general. Applications of the MCEM algorithm developed in Chapters 2 and 3 and the theory developed in Chapter 4 are given in Chapter 5. Examples of binomial binary and count time series were presented and modeled within the proposed framework of GLMMs with autoregressive random effects (ARGLMMs). For the binary case we explored certain symmetries the model implies in the marginal distribution when no timevarying covariates are observed Also some theory on predicting future events in binary time series based on the conditional model or the implied marginal model was developed. Some results of these data analyses will be discussed in the next two sections. 6.1 CrossSectional Time Series We motivated the usefulness and appropriateness of our methodology through the analysis of a crosssectional binomial time series from one of the largest US data sets for social science research. Scientists making use of this data base and who would like to analyze developments of count or binomial responses through time should consider the methods described here because they address the temporal dependence in the observations over the years, address the cross sectional dependence within a year and naturally handle gaps in the observed time series. We showed analysis of such data by assuming an approximate normal distribution for the log odds and fitting of a corresponding linear mixed effects models. We made adjustments by appropriately weighting the log odds to more closely meet the assumptions of a normal linear mixed model. However we also presented the analysis based on the true binomial nature of the observations using a logistic ARGLMM. We consider this a better approach particular if the binomial sample sizes are small because it allows the variance of the log odds to vary as a function
PAGE 175
163 of the mean. In time series models the mean often displays trend behavior and therefore the assumption of constant variance is inappropriate. As to how much the adjustments by weighting the log odds with their estimated asymptotic standard deviation in the normal approximation model simultaneously alleviates all these problems remains doubtful. The ARGLMM can be fit using the MCEM algorithm outlined in Sections 2 and 3 and possesses the three features mentioned above. Furthermore using the marginal results for binomial time series discussed in Section 4.3 these models allow for both overdispersion relative to the binomial variance and correlation between successive observations. Data from the General Social Survey are not the only application of our methods. In political science especially in international relations annual binary time series crosssectional data are very common. For instance a lot of research focuses on the analysis of the relationship (conflict/no conflict) between two states over a long period of years Adhoc methods such as including the residuals from a preliminary logit analysis in the linear predictor (Oneal and Russett 1997) are proposed to adjust for the temporal dependence More sophisticated methods treat the binary time series as grouped time to event (or survival) data and include temporal dummy variables in the linear predictor of a logit model. These dummy variables mark the number of years in between two events (i e conflicts) of the binary time series and are motivated by a relationship between Cox s proportional hazard model for time to event data and l ogit models (Beck, Katz and Tucker 1998) However one drawback of these methods is that usually we observe several events (e.g. conflicts) over time In particular, Beck et al. (1998) treat the probability of subsequent events as independent from the first one. This is a much stricter (and often unrealistic) assumption than the conditional independence assumption in ARGLMMs.
PAGE 176
164 Furth e rmor e, by including dummy variabl e s ( or as also suggested a natural cubic splin e v e rsion) in th e lin e ar predictor to induc e dependency the nature of this dependency cannot b e modeled. Beck et al. (1998) note that temporal depend e nce cannot provide a satisfactory e xplanation [of conflict] by itself but must instead be the consequence of som e important but unobserved variable. Hence the GLMMs with th e assumption of a latent autoregressive random process develop e d in this dissertation seem to be a natural approach of analyzing binary time series crosssectional data and are an attractive alternative to the widely used methods sugg e sted by Beck et al. (1998) 6.2 Univariate Time Series Apart from cross sectional time series data we focused on the analysis of a single time series of counts or binary observations in Chapter 5 Standard loglinear or logit analysis ignoring the serial dependence may result in misl e ading inference. This was evident for the Polio data of Section 5.3 where we showed that the ARGLMM adequately captured the correlation structure for the residuals, which was not the case for other more standard models. Hence these models an ordinary Poisson GLM a negative binomial GLM and a Poisson GLMM showed strong evidence of a time trend (see Table 5 2) whereas the evidence seems to be considerably weaker when the correlation is accounted for. Similarly, for the boat race data we were able to correctly quantify a common belief about the influence of weight. Ignoring the correlation in the series of wins and losses would have led to an overstatement of the influence of weight one that is not seen as strong once the analysis tak e s the serial correlation into account. We emphasized model checking through residual analysis with a comparison of empirical and theoretical autocorrelations lorelograms and variograms in the case of unequally spaced observations. For the Polio data we calculated residuals for all entertained models and showed that only the autocorrelation function implied by
PAGE 177
165 an ARGLMM mimics the one observed in the residuals. Residual autocorrelations from other models ignoring th e serial dependence showed non conformity with the model specifications. For the Old Faithful data set we observed good agreement between the ARGLMM implied marginal autocorrelation function and the empir ical one. Similarly, the e stimated and empirical variogram and lorelogram for the boat race data showed good agreement indicating a reasonable assumption on the modelimplied dependency structure. This was further justified by a comparison of observed and (marginally) predicted counts of sequences of wins and losses, which we approximated using either the connection between the exact marginal expressions for probit models or Monte Carlo approximation. 6.2.1 Clipping of Time Series The application of our regression models to the analysis of binary time series and the methodology developed here may be broader than initially realized. Let Yi be a binary time series obtained by clipping (Kedem 1980) an underlying process Zt such that Yi= l[Zt E C], where I is the indicator function which is 1 if Zt is in the set C and O if it is in its complement. Estimation (t :ST) and prediction (t > T) of 1r(ut) = P(Yt = 1 I Ut) using an ARGLMM with covariates Xt then entails estimation and prediction of the event { Zt E C} bas e d on covariate information. This might be useful in a variety of settings e g when an investigator is forced to or more comfortable with dichotomizing the observed data. 6.2.2 Longitudinal Data The type of time series data we consider in this dissertation are different from longitudinal or panel data which usually consist of only a few repeated observa tions (i.e. Tis small often less than 5) but with a large number of replications It is doubtful if th e estimation techniques (such as GEE) developed for longitudinal,
PAGE 178
166 especially interdependent binary data are useful in our context of very long time series data since the temporal dependence is much richer. For instance, while it was possible to fit a marginal model via GEE for the two binomial time series of 16 observations each in the data about homosexual relationships we could not fit any other of the time series mentioned in Chapter 5 with GEE methodology. The reason is that solving the estimating equations requires inversion of the T x T covariance matrix for (y 1 .. YT). In longitudinal studies this matrix is also large but has blockdiagonal structure with lowdimensional blocks corresponding to the few repeated measurements within a cluster. For the GEE analysis of the polio count data Zeger (1988) proposed approximating the T x T covariance matrix of (Y1, ... YT) with a simpler banddiagonal matrix corresponding to an autore gressive process which then has an easy inverse Also note that in the case of a single time series (i.e. a single cluster), the usual approach of adjusting estimated standard errors by using the sample covariance matrix as a robust estimate of the correlation matrix is not applicable. This is because the usual sum over clusters in the expression of the robust asymptotic covariance matrix reduces to a single sum mand which is the score equation that was set equal to 0. Furthermore the GEE approach does not yield estimates of multivariate probabilities, so that construction of tables such as 5 4 and 5 6 comparing observed counts to marginal predicted counts of sequences to evaluate goodness of fit is impossible. 6.3 Extensions and Further Research Several extensions of the proposed methodology are possible, opening new lines of research. Following we give a brief overview of some ideas. 6.3.1 Alternative Random Effects Distribution We focused on latent firstorder autoregressive random effects processes for describing time series observations in a GLMM framework, but extensions to pthorder processes are possible. The derivations of the MCEM algorithm and the
PAGE 179
167 Gibbs sampler should be similar where the conditional distribution of Ut now will depend on its 2p neighbors. More generally other random effects distributions (possibly improper) can be explored For example, our AR(l) process is a special case of a generalized autoregressive random effects process U t = p Ef= 1 CtjUj + Et which was proposed by Ord (1975) With appropriately specified constants Cij, one alternative to AR(l) (or AR(2)) random effects is to let Ut p( Ut1 + Ut+1) + ft t = 2 ... 'T l ur pur1 + Er For this case Sun Speckman and Tsutakawa (2000) mention that the full uni variate conditional distribution of Ut depends on (ut 2 Uti, Ut+1, Ut + 2) for 3 :S t :S T 2 and similar interpretations as given above regarding the random effects as building blocks for the correlation in the time series apply. As mentioned in Section 1.4.1 models with more complicated random effects struc tures are often fit in a Bayesian framework, assuming noninformative priors on the fixed effects and variance components. In that setting propriety of the posterior distribution cannot be guaranteed for a Poisson GLMM when one of the observed counts is zero and is impossible in a logit link GLMM for binomial observations if they are equal to O or nt for just one t. (See theorem 4.1 and examples 4.1 and 4.2 in Sun Speckman and Tsutakawa (2000).) Hence, with noninformative (flat) priors on fixed effects and variance components of the correlated random effects distribution Bayesian GLMMs for binary time series result in improper posteriors. Further developing the models and methods presented in this dissertation would therefore be a worthwhile goal.
PAGE 180
168 6.3.2 Topics in GLMM Research Goodness of fit measures for GLMMs remain an active area of research. We tried to propose some methods here based on a comparison of observed and marginally fitted counts, however no formal statistic was developed. Part of the problem is that GLMMs can not easily be made a special case of some broader model (such as the saturated model for GLMs) and compared to it. Recently Presnell and Boos (2004) suggested a test for model misspecification based on comparing the maximized likelihood of a model to one motivated by a crossvalidation approach where observations are deleted sequentially. It would be interesting to see how this applies to GLMMs although the computational complexity due to refitting the model several times might be a huge burden. Similar computational costs arise when we try to determine if GLMMs with autoregressive random effects are useful for prediction. Section 5.4.2 presented some theory of predicting future outcomes in binary time series, but more work is needed on crossvalidation, misclassification rate or similar measures of the quality of the model for prediction. From the examples we have seen in this dissertation it appears that the estimate of the standard deviation of the latent autoregressive process is rather large which would lead to wide prediction intervals on the logi t and original probability scale. A lot of recent work has focused on transitional models for categorical time series with more than two categories (e.g. Fokianos and Kedem, 2003). We believe that a multivariate GLMM approach (Agresti 2002; Fahrmeir and Tutz 2001) with carefully specified correlated random effects (univariate or multivariate) might be an alternative worth studying. Lastly, although we focused on analyzing a single time series, it shou ld be straightforward to extend our methodology to the analysis of several, independent time series ( e .g. one for each subject in a longitudinal study with a large numb e r
PAGE 181
169 of repeated observations) as was alluded to throughout Chapters 2 and 3 of this dissertation. Unlike GEE our methodology with special regard to unequally spac e d time series should be practical when not all subjects are measured at common time points and a very realistic assumption subjects skip certain time points. Smith and Diggle (1998) propose a GEE approach for estimating fixed effects parameters in such circumstances coupled with a complicated pseudolikelihood approach that assumes independence for estimating the variance components. We would like to extend our proposed likelihood framework jointly estimating all parameters, to this situation.
PAGE 182
REFERENCES Abramowitz M. and Stegun I. (1964). Handbook of Mathematical Functions N e w York: Dover. Agresti, A. (2002). Categorical Data Analysis, 2nd edition, New York: John Wiley and Sons. Aitkin, M (1999). A general maximum likelihood analysis of variance components in generalized linear models, Biometrics 55: 117 128 Aitkin M and Alf6 M (1998). Regression models for binary longitudinal responses, Statistics and Computing 8 : 289 307 Azzalini A. (1994). Logistic regression for autocorrelated data with application to repeated measures Biometrika 81: 767 775. Azzalini A and Bowman, A. W. (1990). A look at some data on the Old Faithful Geyser Applied Statistics 39: 357 365. Azzalini A. and Chiogna, M. (1997) SPlus tools for the analysis of repeated measures data Computational Statistics 12: 53 66. Bahadur R. R. (1961). A representation of the joint distribution of responses ton dichotomous items Studies in Item Analsis and Prediction, pp. 158 168. Beck N. Katz J K and Tucker R. (1998). Taking time seriously: Timeseries crosssection analysis with a binary dependent variable American Journal of Polit ica l Sci ence 42 : 1260 1288. Benjamin M A Rigby R. A. and Stasinopoulos, M. D (2003). Generalized autoregressive moving average models, Journal of the American Statist ic al Association 98: 214 223 Besag J. Green P ., Higdon D. and Mengersen K. (1995). Bayesian computation and stochastic systems, Statistical Scien ce 10: 3 41. Bock R. D. and Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm, Psychometrika 46: 443459. Booth J. G. and Hobert J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm, Journal of the Royal Statistical Society Series B 61 : 265 285. 170
PAGE 183
171 Booth J. G ., Casella G. Friedl H. and Hobert J.P. (2003). Negative binomial loglinear mixed models Statistical Modelling: An International Journal 3: 179 191. Booth J. G. Hobert J. P and Jank, W. (2001). A survey of Monte Carlo algorithms for maximizing the likelihood of a twostage hierarchical model, Statistical Modelling: An International Journal 1: 333 349. Breslow N. E and Clayton D. G. (1993). Approximate inference in generalized linear mixed models Journal of the American Statistical Association 88: 925. Breslow N. E. and Lin X. (1995). Bias correction in generalised linear mixed models with a single component of dispersion Biometrika 82: 81 91. Caffo B ., Jank W and Jones G. (2003). Ascentbased MCEM under review. Carey V ., Zeger S L. and Diggle P. (1993). Modelling multivariate binary data with alternating logistic regressions Biometrika 80: 517 526. Chan J S K. and Kuk A Y C (1997). Maximum likelihood estimation for probitlinear mixed models with correlated random effects B i ometrics 53: 86 97. Chan K. S and Ledolter J (1995). Monte Carlo EM estimation for time series models involving counts Journal of the American Statistical Association 90: 242 252. Chib S. and Greenberg E. (1998) Analysis of multivariate probit models, B i ometrika 85 : 347 361. Cox D. R. (1972) The analysis of multivariate binary data Applied Statistics 21: 113 120. Cox D R. (1975). Partial likelihood Biometrika 62: 269 276. Cox D R. (1981) Statistical analysis of time series: Some recent developments Scand i nav i an Journal of Stat i stics 8: 93 108. Crowder M J and Hand D. J. (1990). Analysis of Repeated Measures, London: Chapman and Hall. Davis R. Dunsmuir W. and Streett, S. (2003). Observationdriven models for poisson counts Biom e trika 90: 777 790 Dempster A P. Laird N. M. and Rubin D. B (1977). Maximum likelihood from incomplet e data via the EM algorithm (C/R : p2237) Journal of the Royal Statist ic al So c i e ty S e ri es B Methodological pp. 1 22. Diggle P ., Heagerty P. Liang K.Y. and Zeger S. L. (2002). Analysis of Longitu dinal Data 2nd edition Oxford : Oxford University Press.
PAGE 184
172 Diggle, P. J (1990). Time Series. A Biostatistical Introduction Oxford: Oxford University Press. Diggle, P J., Tawn J. A. and Moyeed R. A. (1998) Modelbased geostatistics, Applied Statistics 47: 299 326. Durbin J and Koopman S. J. (1997) Monte Carlo maximum likelihood estima tion for nonGaussian state space models, Biometrika 84: 669 684. Durbin J. and Koopman S. J. (2000) Time series analysis of nonGaussian observations based on state space models from both classical and Bayesian perspectives Journal of the Royal Statistical Society, Series B 62: 3 56. Durbin J. and Koopman S. J. (2001). Time Series Analysis by State Space Methods Oxford : Oxford University Press. Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Generalized Linear Models 2nd edition, New York: Springer. Fitzmaurice G. M. and Laird N M. (1993). A likelihoodbased method for analysing longitudinal binary responses, Biometrika 80: 141151. Fitzmaurice, G. M. and Lipsitz S. R. (1995). A model for binary time series data with serial odds ratio patterns Applied Statistics 44: 51 61. Fitzmaurice G. M. Laird N. M and Rotnitzky A. G. (1993). Regression models for discrete longitudinal responses Statistical Science 8: 284 299. Fokianos, K and Kedem, B. (1998). Prediction and classification of nonstationary categorical time series Journal of Multivariate Analysis 67: 277 296. Fokianos K. and Kedem, B. (2003). Regression theory for categorical time series, Statistical Science 18: 357 376 Fort G and Moulines E. (2003). Convergence of the Monte Carlo expectation maximization for curved exponential families, The Annals of Statistics 31 : 1220 1259. Gelfand, A. E and Carlin, B P (1993) Maximumlikelihood estimation for constrainedor missingdata models The Canadian Journal of Statistics 21: 303 311. Geyer C J. and Thompson E. A. (1992). Constrained Monte Carlo maximum likelihood for dependent data Journal of the Royal Statistical Society Series B 54: 657 683. Ghosh M. Natarajan K ., Stroud T. W. F. and Carlin B. P. (1998). Generalized linear models for smallarea estimation, Journal of the American Statistical Association 93: 273 282.
PAGE 185
173 Heagerty, P. J. (1999). Marginally specified logisticnormal models for longitudinal binary data Biometric s 55: 688 698. Heagerty P J and Zeger S. L. (2000). Marginalized multilevel models and likelihood inference Statistical Science 15: 1 26. Jank, W. and Booth, J. G. (2003). Efficiency of Monte Carlo EM and simulated maximum likelihood in generalized linear mixed models Journal of Computa tional and Graphical Statistics 12: 214 230. Johnson N. L and Kotz S. (1970). Distributions in Statistics Continuous Univariate Distribution s, Volume 2, Boston: HoughtonMifflin. Kastner, C. Fieger, A. and Heumann, C. (1997). MAREG and WinMAREG: A tool for marginal regression models, Computational Statistics and Data Analysis 24: 237 241. Kaufmann H. (1987). Regression models for nonstationary categorical time series: Asymptotic estimation theory The Annals of Statistics 15: 79 98. Kedem B. (1980). Binary Time Series, New York: Marcel Dekker. Kedem B. and Fokianos K. (2002). Regression Models for Time Series Analysis, New York: John Wiley and Sons. Lang J B. (1996) Maximum likelihood methods for a generalized class of log linear models The Annals of Statistics 24: 726 752. Lang J. B. (2004). MultinomialPoisson homogeneous models for contingency tables The Annals of Statistics 32: 340 383. Lang J. B. and Agresti A. (1994) Simultaneously modeling joint and marginal distributions of multivariate categorical responses, Journal of the American Statistical As s ociation 89: 625 632 Lange, K. (1995). A quasiNewton acceleration of the EM algorithm Statistica Sinica 5: 1 18. Lee Y. and Nelder J A. (1996). Hierarchical generalized linear models, Journal of the Royal Statistical Society Series B 58: 619 656. Lee, Y. and Nelder J. A. (2001). Modelling and analysing correlated nonnormal data, Statistical Modelling: An International Journal 1: 3 16. Li W. K (1994). Time series models based on generalized linear models: Some further results Biometrics 50: 506 511. Liang K.Y and Zeger S. L. (1986) Longitudinal data analysis using generalized linear models Biometrika 73: 13 22.
PAGE 186
174 Lin, X. and Breslow N. E. (1996). Bias correction in generalized linear mixed models with multiple components of dispersion Journal of the American Stati s tical Association 91: 1007 1016. Liu S.1. (2001). Bayesian model determination for binarytimeseries data with applications Computational Statistic s and Data Analysis 36: 461 473. Louis T. A. (1982). Finding the observed information matrix when using the EM algorithm Journal of the Royal Statistical Society Series B 44: 226 233 MacDonald I. L. and Zucchini W. (1997). Hidden Markov and other Models for Discr e teValu e d Time S e ries New York : Chapman and Hall. McCullagh P. and Nelder J. A (1989). Generalized Linear Models, 2nd edition London: Chapman and Hall. McCulloch C. E. (1994) Maximum likelihood variance components estima t ion for binary data Journal of the American Statistical Association 89: 330 335. McCulloch C. E (1997). Maximum likelihood algorithms for generalized linear mixed models Journal of the American Statistical Association 92: 162 170 McKeown S. P. and Johnson W D. (1996) Testing for autocorrelation and equality of covariance matrices Biometrics 52: 1087 1095. Natarajan R. and McCulloch C. E. (1995) A note on the existence of the posterior distribution for a class of mixed models for binomial responses Biom e trika 82: 639 643. Natarajan R. and McCulloch C. E. (1998) Gibbs sampling with diffuse proper priors: A valid approach to datadriven inference? Journal of Computational and Graph i cal Statistics 7: 267 277. Oneal J. and Russett B (1997). The classical liberals were right: Democracy, interdependence and conflict International Studies Quarterly 41: 267 294. Ord K. (1975). Estimation methods for models of spatial interaction Journal of the American Stati s tical Association 70: 120 126. Presnell B. and Boos D. (2004). The IOS test for model misspecification, Journal of th e Am e r i can Statistical Association 99: 216 227. Robert C. P and Casella G (1999) Monte Carlo Statistical Methods, New York: Springer. Searle S. R. Casella G. and McCulloch, C. E (1992). Variance Components, New York: John Wiley and Sons
PAGE 187
175 Smith, D M. and Diggle P. J. (1998). Compliance in an antihypertension trial: A latent process model for binary longitudinal data Statistics in Medicine 17: 357 370 Stiratelli R., Laird N and Ware, J. H. (1984). Randomeffects models for serial observations with binary response Biometrics 40: 961 971. Sun D. Speckman, P L. and Tsutakawa R. K. (2000). Random effects in generalized linear mixed models, Generalized Linear Models: A Bayesian perspective Dey D. Ghosh S. and Mallick B. (Editors) New York: Marcel Dekker. Sun D ., Tsutakawa, R. K. and Speckman, P. L. (1999) Posterior distribution of hierarchical models using CAR(l) distributions Biometrika 86: 341 350. Tierney L and Kadane J. B. (1986) Accurate approximations for posterior mo ments and marginal densities, Journal of the American Statistical Association 81: 82 86. Verbeke G and Molenberghs G. (2000). Linear Mixed Models for Longitudinal Data New York: Springer. Wedderburn R. W. M. (1974). Quasilikelihood functions generalized linear models and the GaussNewton method Biometrika 61: 439 447. Wei, G. C G. and Tanner M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man s data augmentation algorithms Journal of the American Statistical Association 85: 699 704. West M. Harrison P. J. and Migon H. S. (1985). Dynamic generalized linear models and Bayesian forecasting Journal of the American Statistical Assoc i tion 80: 73 83. Witt G. (1987). The analysis of repeated measurements with firstorder autocorre lation. Ph.D. dissertation University of Pennsylvania Philadelphia. Wolfinger R. and O Connell M. (1993). Generalized linear mixed models: A pseudolikelihood approach Journal of Statistical Computation and Simulation 48: 233 243. Wu C. F. J. (1983). On the convergence properties of the EM algorithm The Annals of Statistics 11: 95 103 Zeger S. L. (1988). A regression model for time series of counts Biometrika 75 : 621 629. Zeger S. L and Qaqish B (1988). Markov regression models for time series: A quasilikelihood approach Biometrics 44: 10191031.
PAGE 188
176 Zeger S. L., Liang, K.Y. and Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach, Biometrics 44: 10491060. Zhang H. (2002). On estimation and prediction for spatial generalized linear mixed models, Biometrics 58: 129 136. Zhao, L. P. and Prentice R. L (1990) Correlated binary regression using a quadratic exponential model, Biometrika 77: 642 648.
PAGE 189
BIOGRAPHICAL SKETCH Bernhard Klingenberg was born on August 31 1973 in Graz Austria as the first son to Dr. Hans and Ilse Klingenberg Upon graduating from Lichtenfels High School he completed his compulsory military training as a truck driver responsible for group and armory transports. In 1992 he registered with the Technical University Graz to pursue studies in mathematics operations research and statistics. He graduated with a master s degree and highest honors in 1998. Eager to further intensify his knowledge Bernhard was awarded a Fulbright scholarship in 1999 for academic study abroad and decided to pursue a Ph.D. de gree in statistics at the University of Florida. Aside from completing the standard curriculum he worked as a teaching assistant and as a statistical consultant before joining Dr. Alan Agresti as a research assistant. Just before passing the written Ph.D. qualifying exams in August 2001 he married Sophia Froehlich, a medical student at the University of Vienna. In spring 2003 their daughter Franziska was born in Orlando Florida. In the fall of 2001 Bernhard began to work with Dr. Alan Agresti and Dr. James Booth on topics in generalized linear mixed models leading to his dissertation work in modeling discretevalued time series data. After graduating Bernhard will be an assistant professor in the Department of Mathematics and Statistics at Williams College Massachusetts. 177
PAGE 190
I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy ta_ C. Alan G. Agresti Chair Distinguished Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Ja~es G. Booth Cochair Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Phi~ ~c Ge~asella Distinguished Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate in scope and quality, as a dissertation for the degree of Doct of P ilosophy I Professor of Statistics I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. M~~~e: Associate Professor of Political Science
PAGE 191
This dissertation was submitted to the Graduate Faculty of the Department of Statistics in the College of Liberal Arts and Sciences and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy August 2004 Dean, Graduate School
PAGE 192
UNIVERSITY OF FLORIDA 111 11 1111/ I Ill I l l ll l ll l l lll I I I II I II I I II I I IIII I I III I l l ll ll lllll I I 3 1262 08553 7263

