UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
REGRESSION MODELS FOR DISCRETEVALUED TIME SERIES DATA By BERNHARD KLINGENBERG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004 Copyright 2004 by Bernhard Klingenberg To Sophia and JeanLuc Picard ACKNOWLEDGMENTS I would like to express my sincere gratitude to Drs. Alan Agresti and James Booth for their guidance and assistance with my dissertation research and for their support throughout my years at the University of Florida. During the past three years as a research assistant for Dr. Agresti, I gained valuable experience in conducting statistical research and writing scholarly papers, for which I am very grateful. I would also like to thank Dr. Ramon Littell, who guided me through a year of invaluable statistical consulting experience at IFAS, Dr. George Casella, who taught me Monte Carlo methods, and Drs. Jeff Gill and Michael Martinez for serving on my committee. My gratitude extends to all the faculty and present and former graduate students of the Department, among them, in alphabetical order, Brian Caffo, Sounak Chakraborty, Dr. Herwig Friedl, Ludwig Heigenhauser, David Hitchcock, Wolfgang Jank, Galin Jones, Ziyad Mahfoud, Siuli Mukhopadhyay and Brian Stephens. I would like to thank my family, foremost my wife Sophia and my daughter Franziska, for all their support, light and joy they bring into my life and for providing me with energy and fulfillment. Lastly, this dissertation would not have been written in English had it not been for the countless adventures of the Starship Enterprise and its Captain, JeanLuc Picard, whose episodes kept me glued to the TV in Austria and helped to sufficiently improve my knowledge of the English language. TABLE OF CONTENTS page ACKNOWLEDGMENTS ............................ iv LIST OF TABLES ................ .................... viii LIST OF FIGURES .................... ............ ix ABSTRACT .......................... .......... xi CHAPTER 1 INTRODUCTION .............................. 1 1.1 Regression Models for Correlated Discrete Data .......... 1 1.2 Marginal Models ... ........................ 3 1.2.1 Likelihood Based Estimation Methods ............ 3 1.2.2 QuasiLikelihood Based Estimation Methods ... ...... 4 1.3 Transitional Models .................. ........ 7 1.3.1 M odel Fitting ............................ 9 1.3.2 Transitional Models for Time Series of Counts ...... 10 1.3.3 Transitional Models for Binary Data . ... 11 1.4 Random Effects Models ........................ 13 1.4.1 Correlated Random Effects in GLMMs . ... 13 1.4.2 Other Modeling Approaches . ... 17 1.5 Motivation and Outline of the Dissertation . ... 19 2 GENERALIZED LINEAR MIXED MODELS . ... 21 2.1 Definition and Notation ........................ 22 2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time Series .................... 24 2.1.2 State Space Models for Discrete Time Series Observations 26 2.1.3 Structural Similarities Between State Space Models and G LM M s . . . 28 2.1.4 Practical Differences ...... .... ... ..... .. 29 2.2 Maximum Likelihood Estimation ..... .. 31 2.2.1 Direct and Indirect Maximum Likelihood Procedures .. 32 2.2.2 Model Fitting in a Bayesian Framework . ... 36 2.2.3 Maximum Likelihood Estimation for State Space Models .37 2.3 The Monte Carlo EM Algorithm . ..... 40 2.3.1 Maximization of Qm ...................... 41 2.3.2 Generating Samples from h(u y; f, ) .. .. 43 2.3.3 Convergence Criteria . .. 48 3 CORRELATED RANDOM EFFECTS . . ... 53 3.1 A Motivating Example: Data from the General Social Survey .. 54 3.1.1 A GLMM Approach ................ ..... 55 3.1.2 Motivating Correlated Random Effects . ... 56 3.2 Equally Correlated Random Effects . ... 62 3.2.1 Definition of Equally Correlated Random Effects 62 3.2.2 The Mstep with Equally Correlated Random Effects 63 3.3 Autoregressive Random Effects . . ... 65 3.3.1 Definition of Autoregressive Random Effects ... 65 3.3.2 The Mstep with Autoregressive Random Effects 68 3.4 Sampling from the Posterior Distribution Via Gibbs Sampling 71 3.4.1 A Gibbs Sampler for Autoregressive Random Effects .. 72 3.4.2 A Gibbs Sampler for Equally Correlated Random Effects .74 3.5 A Simulation Study .......................... 75 4 MODEL PROPERTIES FOR NORMAL, POISSON AND BINOMIAL OBSERVATIONS ................... ........... 82 4.1 Analysis for a Time Series of Normal Observations ... 83 4.1.1 Analysis via Linear Mixed Models . .... 85 4.1.2 Parameter Interpretation . ..... 86 4.2 Analysis for a Time Series of Counts . ..... 86 4.2.1 Marginal Model Implied by the Poisson GLMM 87 4.2.2 Parameter Interpretation . . ... 92 4.3 Analysis for a Time Series of Binomial or Binary Observations .92 4.3.1 Marginal Model Implied by the Binomial GLMM 94 4.3.2 Approximation Techniques for Marginal Moments 95 4.3.3 Parameter Interpretation . . ... 108 5 EXAMPLES OF COUNT, BINOMIAL AND BINARY TIME SERIES 115 5.1 Graphical Exploration of Correlation Structures ... 115 5.1.1 The Variogram ......................... 116 5.1.2 The Lorelogram ........................ 117 5.2 Normal Time Series .......................... 117 5.3 Analysis of the Polio Count Data . ... 121 5.3.1 Comparison of ARGLMMs to other Approaches 125 5.3.2 A Residual Analysis for the ARGLMM .... 127 5.4 Binary and Binomial Time Series . ... 130 5.4.1 Old Faithful Geyser Data . ..... 130 5.4.2 Oxford versus Cambridge Boat Race Data ... 142 6 SUMMARY, DISCUSSION AND FUTURE RESEARCH ......... 159 6.1 CrossSectional Time Series ................. .. .162 6.2 Univariate Time Series ............ ............. 164 6.2.1 Clipping of Time Series . .... .. 165 6.2.2 Longitudinal Data ............. ...... 165 6.3 Extensions and Further Research ....... . .. 166 6.3.1 Alternative Random Effects Distribution ... 166 6.3.2 Topics in GLMM Research .... .. 168 REFERENCES ................................... 170 BIOGRAPHICAL SKETCH ................... ...... 171 LIST OF TABLES Table page 31 A simulation study for a logistic GLMM with autoregressive random effects. . . . ... .. 80 32 Simulation study for modeling unequally space binary time series .. 81 51 Comparing estimates from two models for the logodds. ... 121 52 Parameter estimates for the polio data. . ... 125 53 Autocorrelation functions for the Old Faithful geyser data. 134 54 Comparison of observed and expected counts for the Old Faithful geyser data. ............................ ........ 142 55 Maximum likelihood estimates for boat race data. . ... 147 56 Observed and expected counts of sequences of wins (W) and losses (L) for the Cambridge University team. . ... 151 57 Estimated random effects fit for the last 30 years for the boat race data. 152 58 Estimated probabilities of a Cambridge win in 2004, given the past s + 1 outcomes of the race. ........................ 155 LIST OF FIGURES Figure page 21 Plot of the typical behavior of the Monte Carlo sample size m(k) and the Qfunction Q) through MCEM iterations. .......... 51 31 Sampling proportions from the GSS data set. . ..... 55 32 Iteration history for selected parameters and their asymptotic stan dard errors for the GSS data. .................... .. .. 61 33 Realized (simulated) random effects u1,..., UT versus estimated ran dom effects il,..., UT ........... .. ........... .. 77 34 Comparing simulated and estimated random effects .... 78 41 Approximated marginal probabilities for the fixed part predictor value x'p ranging from 4 to 4 in a logit model. . .... 99 42 Comparison of conditional logit and probit model based probabilities. 104 43 Comparison of implied marginal probabilities from logit and probit models..................... .... ..... ...... 106 51 Empirical standard deviations std(0it) for the log odds of favoring ho mosexual relationship by race. . . 119 52 Plot of the Polio data. ...................... .123 53 Iteration history for the Polio data. . . .. 124 54 Residual autocorrelations for the Polio data. . ... 129 55 Residual autocorrelations with outlier adjustment for the Polio data.. 131 56 Autocorrelation functions for the Old Faithful geyser data. 135 57 Lorelogram for the Old Faithful geyser data . .... 136 58 Plot of the Oxford vs. Cambridge boat race data . .... 143 59 Variogram for the Oxford vs. Cambridge boat race data. ...... ..145 510 Lorelogram for the Oxford vs. Cambridge boat race data. ... 146 511 Path plots of fixed and random effects parameter estimates for the boat race data. ................... ........... 147 61 Association graphs for GLMMs. . ..... 161 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy REGRESSION MODELS FOR DISCRETEVALUED TIME SERIES DATA By Bernhard Klingenberg August 2004 Chair: Alan G. Agresti Cochair: James G. Booth Major Department: Statistics Independent random effects in generalized linear models induce an exchange able correlation structure, but long sequences of counts or binomial observations typically show correlations decaying with increasing lag. This dissertation intro duces models with autocorrelated random effects for a more appropriate, parameter driven analysis of discretevalued time series data. We present a Monte Carlo EM algorithm with Gibbs sampling to jointly obtain maximum likelihood estimates of regression parameters and variance components. Marginal mean, variance and correlation properties of the conditionally specified models are derived for Poisson, negative binomial and binary/binomial random components. They are used for constructing goodness of fit tables and checking the appropriateness of the modeled correlation structure. Our models define a likelihood and hence estimation of the joint probability of two or more events is possible and used in predicting future responses. Also, all methods are flexible enough to allow for multiple gaps or miss ing observations in the observed time series. The approach is illustrated with the analysis of a crosssectional study over 30 years, where only observations from 16 unequally spaced years are available, a time series of 168 monthly counts of polio infections and two long binary time series. CHAPTER 1 INTRODUCTION Correlated discrete data arise in a variety of settings in the biomedical, social, political or business sciences whenever a discrete response variable is measured repeatedly. Examples are time series of counts or longitudinal studies measuring a binary response. Correlations between successive observations arise naturally through a time, space or some other cluster forming context and have to be incorporated in any inferential procedure. Standard regression models for independent data can be expanded to accommodate such correlations. For continuous type responses, the normal linear mixed effects model offers such a flexible framework and has been well studied in the past. A recent reference is Verbeke and Molenberghs (2000), who also discuss computer software for fitting linear mixed effects models with popular statistical packages. Although the normal linear mixed effects model is but one member of the broader class of generalized linear mixed effects models, it enjoys unique properties which simplify parameter estimation and interpretation substantially. For discrete response data, however, the normal distribution is not appropriate, and other members in the exponential family of distributions have to be considered. 1.1 Regression Models for Correlated Discrete Data In this introduction we will review extensions of the basic generalized linear model (McCullagh and Nelder, 1989) for analyzing independent observations to models for correlated data. These models are marginal (Section 1.2), transitional (Section 1.3) and random effects models (Section 1.4). An extensive discussion of these models with respect to discrete longitudinal data is given in the books by Agresti (2002) and Diggle, Heagerty, Liang and Zeger (2002). In general, longi tudinal studies concern only a few repeated measurements. In this dissertation, however, we are interested in the analysis of much longer series of repeated observa tions, often exceeding 100 repeated measurements. Therefore, the following review focuses specifically on models for univariate time series observations, some of which are presented in Fahrmeir and Tutz (2001). Let Yt be a response at time t, t = 1,..., T, observed together with a vector of covariates denoted by Xt. In a generalized linear model (GLM), the mean pt = E[yt] of observation yt depends on a linear predictor t = xzfP through a link function h(.), forming the relationship pt = hl(x'f). The variance of yt depends on the mean through the relationship var(yt) = tv(pt), where v(.) is a distribution specific variance function and {(t} are additional dispersion parameters. In a regular GLM, observations at any two distinct time points t and t* are assumed independent. In the models discussed below, the type of extension to accommodate corre lated data depends on the way the correlation is introduced into the model. In marginal models, the correlation can be specified directly, e.g., corr(yt, yt.) = p or left completely unspecified, but nonetheless accounted for in likelihood based and nonlikelihood based inferences. In transitional models correlation is introduced by including previous observations in the linear predictor, e.g., tit = iz where it = (X, Yt, y t2, ...)' and 3 = (, a, a2, ... ) are extensions of the design and parameter vector of a GLM with independent components. Random effects models induce correlation between observations by including random effects rather than previous observations in the linear predictor, e.g., rt = x'a + u, where u is a random effect shared by all observations. The way correlation is built into a model also determines the type of inference. Typically, marginal models are fitted by a quasilikelihood approach, estimation in transitional models is based on a conditional or partial likelihood, and inference in random effects models relies on a full likelihood (possibly Bayesian) approach. However, models and inferential procedures have been developed that allow more flexibility than the above categorization. 1.2 Marginal Models In marginal regression models, the main scientific goal is to assess the influence of covariates on the marginal mean of yt, treating the association structure between repeated observations as a nuisance. The marginal mean /it and variance var(yt) are modeled separately from a correlation structure between two observations Yt and yt.. Regression parameters in the linear predictor are called population averaged parameters, because their interpretation is based on an average over all individuals in a specific covariate subgroup. Due to the correlation among repeated observations, the likelihood for the model refers to the joint distribution of all observations and not to the simpler product of their marginal distributions. However, the model is specified in terms of these marginal distributions, which makes maximum likelihood fitting particular hard for even a moderate number T of repeated measurements. 1.2.1 Likelihood Based Estimation Methods For binary data, Fitzmaurice and Laird (1993) discuss a parametrization of the joint distribution in terms of conditional probabilities and log odds ratios. These parameters are related to the marginal mean and the same conditional log odds ratios, which describe the higher order associations among the repeated responses. The marginal mean and the higher order associations are then modelled in terms of orthogonal parameters f3 and a, respectively. Fitzmaurice and Laird (1993) present an algorithm for maximizing the likelihood with respect to these two parameter sets. The algorithm has been implemented in a freely available computer program (MAREG) by Kastner et al. (1997). Another approach to maximum likelihood fitting for longitudinal discrete data regards the marginal model as a constraint on the joint distribution and maximizes the likelihood subject to this constraint. The model is written in terms of a generalized loglinear model C log(A/f) = XP3, where p is a vector of expected counts and A and C are matrices to form marginal counts and functions of those marginal counts, respectively. With this approach, no specific assumption about the correlation structure of repeated observations is made, and the likelihood refers to the most general form for the joint distribution. However, simultaneous modeling of the marginal distribution and a simplified joint distribution is also possible. Details can be found in Lang and Agresti (1994) and Lang (1996). Lang (2004) also offers an R computer program (mph.fit) for maximum likelihood fitting of these very general marginal models. 1.2.2 QuasiLikelihood Based Estimation Methods The drawback of the two approaches mentioned above and likelihood based methods in general is that they require enormous computing resources as the number of repeated responses increases or the number of covariates is large, making maximum likelihood fitting computationally impossible for long time series. This is also true for estimation based on alternative parameterizations of a distribution for multivariate binary data such as those discussed in Bahadur (1961), Cox (1972) or Zhao and Prentice (1990). Estimating methods leading to computationally simpler inference (albeit not maximum likelihood) for marginal models are based on a quasilikelihood approach (Wedderburn, 1974). In a quasilikelihood approach, no specific form for the distribution of the responses is assumed and only the mean, variance and correlation are specified. However, with discrete data, specifying the mean and covariances does not determine the likelihood, as it would with normal data, so parameter estimation cannot be based on it. Liang and Zeger (1986) proposed generalized estimating equations (GEE) to estimate parameters, which have the form of score equations for GLMs, but cannot be interpreted as such. Their approach also requires the specification of a working correlation matrix for the repeated responses. They show that if the mean function is correctly specified, the solution to the generalized estimating equations is a consistent estimator, regardless of the assumed variancecovariance structure for the repeated responses. They also present an estimator of the asymptotic variancecovariance matrix for the GEE estimates, which is robust against misspecification of the working correlation matrix. Several structured working correlation matrices have been proposed for parsimonious modeling of the marginal correlation, and some of them are implemented in statistical software packages for GEE estimation (e.g., SAS's proc genmod with the repeated statement and the type option or the gee and geel packages in R). 1.2.2.1 GEE for time series of counts Zeger (1988) uses the GEE methodology to fit a marginal model to a time series {yt}T of T = 168 monthly counts of cases of poliomyelitis in the United States. He specifies the marginal mean, variance and correlation by pt = exp(ax ) var(yt) = t + a22 (1.1) P(r) corr(y, y ) [{1+ (a2p)i}{1 + (a2t+T)1}]/2' where a2 is the variance and p(r) the autocorrelation function of an underlying random process {ut}. To fit this marginal model, he proposes and outlines the GEE approach, but notes that it requires inversion of the T x T variancecovariance matrix of yl,..., YT, which has no recognizable structure and therefore no simple inverse. Subsequently, he suggests approximating this matrix by a simpler, struc tured matrix, leading to nearly as efficient estimators as would have been obtained with the GEE approach. The variance component a2 and unknown parameters in p(r) are estimated by a methods of moments approach. Interestingly, Zeger (1988) derives the marginal mean, variance and correlation in (1.1) from a random effects model specification: Conditional on an underlying latent random process {ut} with E[ut] = 1 and cov(ut, ut+,) = a2p(7), he initially models the time series observations as conditionally independent Poisson variables with mean and variance E[yt  Ut] = var(yt ut) = exp(xa/)ut. (1.2) Marginally, by the formula for repeated expectation, this leads to the moments presented in (1.1). From there we also see that the latent random process {ut} has introduced both overdispersion relative to a Poisson variable and autocorrelation among the observations. The models we will develop in subsequent chapters have similar features. The equation for the marginal correlation between yt and Yt* shows that the autocorrelation in the observed time series must be less than the autocorrelation in the latent process {ut}. We will return to the polio data set in Chapter 5, where we compare this model to models suggested in this dissertation and elsewhere. 1.2.2.2 GEE for binomial time series For binary and binomial time series data, it is often more advantageous to model the association between observations using the odds ratio rather than directly specifying the marginal correlation corr(yt, yt.) as with count data. The odds ratio is a more natural metric to measure association between binary outcomes and easier to interpret. The correlation between two binary outcomes Y1 and Y2 is also constrained in a complicated way by their marginal means I1, = P(Y1 = 1) and p2 = P(Y2 = 1) as a consequence of the following inequalities for their joint distribution: P(Y1 = 1, Y2 = 1) = P1 + P2 P(Y= 1 or Y2 = 1) > max{0, p, + 2 1} and P(Y1 = 1, Y2 = 1) < min{fl, 2}, leading to max{0, i + 2 1} P(Y1 = 1, Y2 = 1) < min{l, /2} Therefore, instead of marginal correlations, a number of authors (Fitzmaurice, Laird and Rotnitzky, 1993; Carey, Zeger and Diggle, 1993) propose the use of marginal odds ratios. For unequally spaced and unbalanced binary time series data, Fitzmaurice and Lipsitz (1995) present a GEE approach which models the marginal association using serial odds ratio patterns. Let tt. denote the marginal odds ratio between two binary observations Yt and yt.. Their model for the association has the form .tt = al/ltt*, 1 < a < oo, which has the property that as It t*  0, there is perfect association (tt* oo), and as It t*I oo, the observations are independent (utt.  1). Note, however, that only positive association is possible with this type of model. (SAS's proc genmod now offers the possibility of specifying a general regression structure for the log odds ratios with the logor option.) 1.3 Transitional Models In transitional models, past observations are simply treated as additional predictors. Interest lies in estimating the effects of these and other explanatory variables on the conditional mean of the response Yt, given realizations of the past responses. Specifying the relationship between the mean of Yt and previous obser vations yti, t2, ... is another way (and in contrast to the direct way of marginal models) of modeling the dependency between correlated responses. Transitional models fit into the framework of GLMs, where, however, the distribution of Yt is now conditional on the past responses. The model in its most general form (Diggle et al., 2002) expresses the conditional mean of Yt as a function of explanatory variables and q functions f,(.) of past responses, E[yt Ht] = h1 3 + E h(Ht; a) (1.3) r=l where Ht = {t1, Yt2, yl} denotes the collection of past responses. Ht can also include past explanatory variables and parameters. Often, the models are in discretetime Markov chain form of order q, and the conditional distribution of Yt given Ht only depends on the last q responses Yt1, ... Ytq. For example, a transitional logistic regression model for binary responses that is a second order Markov chain has form logit P(Yt = 1 yt1, Yt2) = x 13 + alYt1 + a2Yt2. The main difference between transitional models and regular GLMs or marginal models is parameter interpretation. Both the interpretation of a and the inter pretation of / are conditional on previous outcomes and depend on how many of these are included. As the time dependence in the model changes, so does the interpretation of parameters. With the logistic regression example from above, the conditional odds of success at time t are exp(al) times higher if the given previous response was a success rather than a failure. However, this interpretation assumes a fixed and given outcome at time t 2. Similarly, a coefficient in / represents the change in the log odds for a unit change in xt, conditional on the two prior responses. It might be possible that we lose information on the covariate effect by conditioning on these previous outcomes. In general, the interpretation of parame ters in transitional models is different from the population averaged interpretation we discussed for marginal models, where parameters are effects on the marginal mean without conditioning on any previous outcomes. 1.3.1 Model Fitting If a discretetime Markov model applies, the likelihood for a generic series yl,... T is determined by the Markov chain structure: T L(,a3,;yi,...,yT) = f(yi,...,Yq) f(Yt I Yt1i,...,Ytq) t=q+l However, the transitional model (1.3) only specifies the conditional distributions appearing in the product, but not the first term of the likelihood. Often, instead of a full maximum likelihood approach, one conditions on the first q observations and maximizes the corresponding conditional likelihood. If in addition f,(Ht; a) in (1.3) is a linear function in a (and possibly 3), then maximization follows along the lines of GLMs for independent data. Kaufmann (1987) establishes the asymptotic properties such as consistency, asymptotic normality and efficiency of the conditional maximum likelihood estimator. If a Markov assumption is not warranted, estimation can be based on the partial likelihood (Cox, 1975). To motivate the partial likelihood approach, we follow Kedem and Fokianos (2002): They consider occasions where a time series {Yt} is observed jointly with a random covariate series {Xt}. The joint density of (Yt, Xt), t = 1,..., T, parameterized by a vector 0, can be expressed as f(x, .., x,XT, YT; ) = f(x;0) [ f(xt I Hft;) f(yt1 Ht; ) (1.4) t=2 .t=l where Ht = (x1, y1,..., Xt1, yt1) and Ht = (xi, yl,..., x,1, yt1, Xt) hold the history up to time points t 1 and t, respectively. Let Tt1 denote the afield generated by Yi, Yt2, ... Xt, Xt1,..., i.e., Tt1 is generated by past responses and present and past values of the covariates. Also, let ft(yt I Ft1; 0) denote the conditional density of Yt, given t_1, which is of exponential density form with mean modeled by (1.3). Then, the partial likelihood for 0 = (a, /) is given by T PL(0;yl,...,yT) = r ft(Yt I 1; 0), (1.5) t=1 which is the second product in (1.4) and hence the term partial. The loss of information by ignoring the first product in the joint density is considered small. If the covariate process is deterministic, then the partial likelihood becomes a conditional likelihood, but without the necessity of a Markov assumption on the distribution of the Yt's. Standard asymptotic results from likelihood analysis of independent data carry over to the case of partial likelihood estimation with dependent data. Fokianos and Kedem (1998) showed consistency and asymptotic normality of 0 and provided an expression for the asymptotic covariance matrix. Since the score equation obtained from (1.5) is identical to one for independent data in a GLM, partial likelihood holds the advantage of easy, fast and readily available software implementation with standard estimation routines such as iterative reweighted least squares. 1.3.2 Transitional Models for Time Series of Counts For a time series of counts {yt}, Zeger and Qaqish (1988) propose Markov type transitional models which they fit using quasilikelihood methods and the estimating equations approach. They consider various models for the conditional mean Lt = E[yt I Ht] of form log(p ) = x'z + ~=1 arfr(Htr), where for example fr(Htr) = ytr or f,(Htr) = log(y_r + c) log(exp[zXP] + c). One common goal of their models is to approximate the marginal mean by E[yt] = E[lpt] z exp{a{z3} so that 3 has an approximate marginal interpretation as the change in the log mean for a unit change in the explanatory variables. Davis et al. (2003) develop these models further and propose fr(Htr) = (Ytr ltr)//Ar as a more appropriate function to built serial dependence in the model, where A is an additional parameter. They explore stability properties such as stationarity and ergodicity of these models and describe fast (in comparison to maximum likelihood techniques required for competing random effects models), recursive and iterative maximum likelihood estimation algorithms. Chapter 4 in Kedem and Fokianos (2002) discusses regression models of form (1.3) assuming a conditional Poisson or doubletruncated Poisson distribution for the counts, with inference based on the partial likelihood concept. Their methodology is illustrated with two examples about monthly counts of rainy days and counts of tourist arrivals. 1.3.3 Transitional Models for Binary Data For binary data {yt}, a two state, first order Markov chain can be defined by its probability transition matrix p Poo Poi Plo P11 where Pab = P(Yt = b I Yt1 = a), a, b = 0, 1 are the onestep transition probabilities between the two states a and b. Diggle et al. (2002, Chapt. 10.3) discuss various logistic regression models for these probabilities and higher order Markov chains for equally spaced observations. Unequally spaced data cannot be routinely handled with these models. How can we determine the marginal association structure implied by the conditionally specified model? Let pl = (p, pl) be the initial marginal distribution for the states at time t = 1. Then the distribution of the states at time n is given by pn = plp". As n increases, p" approaches a steady state or equilibrium distribution that satisfies p = pP. The solution to this equation is given by pi = P(Yt = 1) = E[yt] = pol/(pol + Pio) and is used to derive marginal moments implied by the transitional model. For example, it can be shown (Kedem, 1980) that in the steady state, the marginal variance and correlation implied by the transitional model are var(yt) = poPi (as it should be) and corr(ytl, Yt) = Pn Poi, respectively. Azzalini (1994) models serial dependence in binary data through transition models, but at the same time retains the marginal interpretation of regression parameters. He specifies the marginal regression model logit(At) = xa/ for a binary time series {yt} with E[yt] = tpt, but assumes that a binary Markov chain with transition probabilities Pab has generated the data. Therefore, the likelihood refers to these probabilities but the model specifies marginal probabilities, a complication similar to the fitting of marginal models discussed in the previous section. However, assuming a constant log odds ratio = logP(Y = 1, Y = 1)P(Yt1 = 0, Yt = 0) log \P(Y1 = 0,Yt = 1)P(Yt1 = 1, Yt= 0) between any two adjacent observations, Azzalini (1994) shows how to write Pab in terms of just this log odds ratio 0 and the marginal probabilities At and #pt1. Maximum likelihood estimation for such models is tedious but possible in closed form, although second derivatives of the log likelihood function have to be calculated numerically. A software package (the Splus function rm.tools, Azzalini and Chiogna, 1997) exists to fit such models for binary and Poisson observations. Azzalini (1994) mentions that this basic approach can be extended to include variable odds ratios between any two adjacent observations, possibly depending on covariates, but this is not pursued in the article. Diggle et al. (2002) discuss these marginalized transitional models further. Chapter 2 in Kedem and Fokianos (2002) presents a detailed discussion of partial likelihood estimation for transitional binary models and discusses, among other examples, the eruption data of the Old Faithful geyser which we will turn to in Chapter 5. 1.4 Random Effects Models A popular way of modeling correlation among dependent observations is to include random effects u in the linear predictor. One of the first developments for discrete data occurred for longitudinal binary data, where subjectspecific random effects induced correlation between repeated binary measurements on a subject (Bock and Aitkin, 1981; Stiratelli, Laird and Ware, 1984). In general, we assume that unmeasurable factors give rise to the dependency in the data {yt} and random effects {ut} represent the heterogeneity due to these unmeasured factors. Given these effects, the responses are assumed independent. However, no values for these factors are observed, and so marginally (i.e., averaged over these factors), the responses are dependent. Conditional on some random effects, we consider models that fit into the framework of GLMs for independent data, i.e., where the conditional distribution of yt I u is a member of the family of exponential distributions, whose mean E[yt ut] is modeled as a function of a linear predictor t = xzf + zLut. Together with a distributional assumption for the random effects (usually independent and identically normal), this leads to generalized linear mixed models (GLMMs), where the term mixed refers to the mixture of fixed and random effects in the linear pre dictor. Chapter 2 contains a detailed definition of GLMMs and discusses maximum likelihood fitting and parameter interpretation and in Chapter 3 correlated random effects for the description of time dependent observations {Yt} are motivated and described. Here, we only give a short literature review about GLMMs which use correlated random effects to model time (or space) dependent data. 1.4.1 Correlated Random Effects in GLMMs One of the first papers considering correlated random effects in GLMMs for the description of (spatial) dependence in Poisson data is Breslow and Clayton (1993), who analyze lip cancer rates in Scottish counties. They propose correlated normal random effects to capture the correlation in counts of adjacent districts in Scotland. A random effect is assigned to each district, and two random effects are correlated if their districts are adjacent to each other. In Section 1.2.2 we mentioned the Polio data set of a time series of equally spaced counts {y}16s and formulated the conditional model (1.2) with a latent process for the random effects. Instead of obtaining marginal moments as in Zeger (1988), Chan and Ledolter (1995) use a GLMM approach with Poisson random components and autoregressive random effects to analyze the time series. They outline parameter estimation via an MCEM algorithm similar to the one discussed in Sections 2.4 and 3.2 in this dissertation. One of the three central generalized linear models advocated by Diggle et al. (2002, Chap. 11.2) to model longitudinal data uses correlated random effects. For equally spaced binary longitudinal data {yit}, they plot response profiles simulated according to the model logit[P(Yit =1 uit)] = + uit cov(Uit, uit.) = a2PItiti, I with a2 = 2.52 and p = 0.9 and note that the profiles exhibit more alternating runs of O's and 1's than a random intercept model with uil = ui = ... = UiT = ui. However, based on the similarity between plots of random intercepts, random intercepts and slopes and autoregressive random effects models, they mention the challenge that binary data present in distinguishing and modeling the underlying dependency structure in longitudinal data. (They used T = 25 repeated observation for their simulations.) Furthermore, they state that numerical methods for maximum likelihood estimation are computationally impractical for fitting models with higher dimensional random effects. This makes it impossible, they conclude, to fit the GLMM with serially correlated random effects using maximum likelihood. Instead, they propose a Bayesian analysis using powerful Monte Carlo Markov chain methods. Indeed, the majority of examples in the literature which consider correlated random effects in a GLMM framework take a Bayesian approach. Sun, Speckman and Tsutakawa (2000) explore several types of correlated random effects (autore gressive, generalized autoregressive and conditional autoregressive) in a Bayesian analysis of a GLMM. As in any Bayesian analysis, the propriety of the posterior distribution given the data is of concern when fixed effects and variance compo nents have improper prior distributions and random effects are (possibly singular) multivariate normal. One of their results applied to Poisson or binomial data {yt} states that the posterior might be improper when yt = 0 in the Poisson case and cannot be proper when yt = 0 or ye = nt in the binomial case for any t when improper or noninformative priors are used. Diggle, Tawn and Moyeed (1998) consider Gaussian spatial processes S(x) to model spatial count data at locations x. The role of S(x) is to explain any residual spatial variation after accounting for all known explanatory variables. They also use a Bayesian framework to estimate parameters and give a solution to the problem of predicting the count at a new location :. Ghosh et al. (1998) use correlated random effects in Bayesian models for small area estimation problems. They present an application of pairwise difference priors for random effects to model a series of spatially correlated binomial observations in a Bayesian framework. Zhang (2002) discusses maximum likelihood estimation with an underlying spatial Gaussian process for spatially correlated binomial observations. Bayesian models for binary time series are described in Liu (2001), based on probittype models for correlated binary data which are discussed in Chib and Greenberg (1998). Probit type models are motivated by assuming latent random variables z = (zl,..., ZT), which follow a N(pA, E) distribution with p = (p/I,, PT,), pt = x' / and E a correlation matrix. The yet's are assumed to be generated according to Yt = I(Zt > 0), where I(.) is the indicator function. This leads to the (marginal) probit model P(Yt = 1 I E) = It(zt I 1p, E). Rich classes of dependency structures between binary outcomes can be modeled through E. These models can further be extended to include random effects through pl't = zx/3 + z ut or q previous responses such as pt = x,3 + '=, ayt,. It is important to note that E has to be in correlation form. To see this, suppose it is not and let S = DED be a covariance matrix for the latent random variables z, where D is a diagonal matrix holding standard deviation parameters. The joint density of the times series under the multivariate probit model is given by P[(Y,,..,Y,) = (Y,,...,T)] = P[z E A] = P[Dz E A], where A = Ai x .. x AT with At = (oo, 0] if yt = 0 and At = (0, oo) if yt = 1 are the intervals corresponding to the relationship yt = I(zt > 0), for t = 1,..., T. However, above relationship is true for any parametrization of D, because the intervals At are not affected by the transformation from z to Dlz. Hence, the elements of D are not identifiable based on the joint distribution of the observed time series y. Lee and Nelder (2001) present models to analyze spatially correlated Poisson counts and binomial longitudinal data about cancer mortality rates. They explore a variety of patterned correlation structures for random effects in a GLMM setup. Model fitting is based on the joint data likelihood of observations and unobserved random effects (Lee and Nelder, 1996) and not on the marginal likelihood of the observed data. Model diagnostic plots of estimated random effects are presented to aid in selecting an appropriate correlation structure. 1.4.2 Other Modeling Approaches In hidden Markov models (MacDonald and Zucchini, 1997) the underlying random process is assumed to be a discrete statespace Markov chain instead of a continuous (normal) process. Probability transition matrices describe the connection between states. A very convenient property of hidden Markov models is that the likelihood can be evaluated sufficiently fast to permit direct numerical maximization. MacDonald and Zucchini (1997) present a detailed description of hidden Markov models for the analysis of binary and count time series. A connection between transitional models and random effects models is explored in Aitkin and Alf6 (1998). They model the success probabilities of serial binary observations conditional on subjectspecific random effects and on the previous outcome. As in the models before, transition probabilities Pab are changing over time due to the inclusion of timedependent covariates and the previous observation in the linear predictor. Additionally, random effects account for possibly unobserved sources of heterogeneity between subjects. The authors argue that the conditional model specification together with the specification of the random effects distribution does not determine the distribution of the initial observation, and hence the likelihood for this model is unspecified. They present a solution by maximizing the likelihood obtained from conditioning on this first observation. However, this causes the specified random effects distribution to shift to an unknown distribution. Two approaches for estimation are outlined: The first assumes another normal distribution for the new random effects distribution and the likelihood is maximized using GaussHermite quadrature. The second approach assumes no parametric form for the new random effects distribution and follows the nonparametric maximum likelihood approach (Aitkin, 1999). For binary data, the new random effects distribution is only a two point distribution and its parameters can be estimated via maximum likelihood jointly with the other model parameters. Marginalized transitional models were briefly mentioned with the approach taken by Azzalini (1994). The idea of marginalizingg", i.e., model the marginal mean of an otherwise conditionally specified model can also be applied to random effects models. The advantage of transitional or random effects models is the ability to easily specify correlation patterns, with the potential disadvantage that parameters in such models have conditional interpretations when the scientific goal is on the interpretation of marginal relationships. In marginal models parameters can directly be interpreted as contrasts between subpopulations without the need of conditioning on previous observations or unobserved random effects. However, as we mentioned in Section 1.2.2, likelihood based inference in marginal models might not be possible. A marginalized random effects model (Heagerty, 1999; Heagerty and Zeger, 2000) specifies two regression equations that are consistent with each other. The first equation, AM= E[yt] = h'(x expresses the marginal mean tM as a function of covariates and describes system atic variation. The second equation characterizes the dependency structure among observations through specification of the conditional mean pC, p = E[yt I ut] = hl(At(t) + z'ut), where ut are random effects with design vector zt. Consistency between the marginal and conditional specification is achieved by defining At(xt) implicitly through IM = Eu[ c] = Eu [E[yt I ut]] = Eu[h'(At(zt) + z'ut)]. For instance, in a marginalized GLMM with random effects distribution F(ut), At(st) is the solution to the integral equation S= h'(' ) = h1(At(xt) + z'ut)dF(ut), so that At(xt) is a function of the marginal regression coefficients f and the (variance) parameters in F(ut). Maximum likelihood estimation is based on the integrated likelihood from the GLMM model. 1.5 Motivation and Outline of the Dissertation In this dissertation we propose generalized linear mixed models (GLMMs) with correlated random effects to model count or binomial response data collected over time or in space. For sequential or spatial Gaussian measurements, maximum likelihood estimation is well established and software (e.g. SAS's proc mixed) is available to fit fairly complicated correlation structures. The challenge for discrete data lies in the fact that the observed (marginal) likelihood is not analytically tractable, and maximization of it is more involved. Furthermore, with correlated random effects, the likelihood does not break down into lowerdimensional com ponents which are easier to integrate numerically. Therefore, most approaches in the literature are based on a quasilikelihood approach or take a Bayesian per spective. The advantage of Bayesian models is that powerful Monte Carlo Markov chain methods make it easier to obtain a sample from the posterior distribution of interest than to obtain maximum likelihood estimates. However, priors must be specified very carefully to ensure posterior propriety. In addition, repeated observations are prone to missing data or unequally spaced observation times. We would like to develop methods and models that allow for unequally spaced binary, binomial or Poisson observations, making them more general than previously presented in the literature. To our knowledge, maximum likelihood estimation of GLMMs with such high dimensional random effects has not been demonstrated before, with the exception of the paper by Chan and Ledolter (1995) who consider fitting of a time series of counts. However, they do not consider unequally spaced data and employ a different implementation of the MCEM algorithm. In Chapter 5 we argue that their implementation of the algorithm might have been stopped prematurely, leading to different conclusions than our analysis and analyses published elsewhere. Most articles that discuss correlated random effects do so for only a small number of correlated random effects. E.g., Chan and Kuk (1997) show that the data set on salamander mating behavior published and analyzed in McCullagh and Nelder (1989) is more appropriately analyzed when random effects pertaining to the male salamander population are correlated over the three different time points when they were observed. In this thesis, we would like to consider much longer sequences of repeated observations. In Chapter 2 we introduce the GLMM as the model of our choice to analyze correlated discrete data and outline an EM algorithm to estimate fixed and random effects, where both the Estep and the Mstep require numerical approximations, leading to an EM algorithm based on Monte Carlo methods (MCEM). Correlated random effects and their implications on the analysis of GLMMs are discussed in Chapter 3, together with a motivating example. This chapter also gives details for the implementation of the algorithm and reports results from simulation studies. Chapter 4 looks at marginal model properties and interpretation for correlated binary, binomial or Poisson observations and Chapter 5 applies our methods to real data sets from the social sciences, public health, sports and other backgrounds. A summary and discussion of the methods and models presented here is given in Chapter 6. CHAPTER 2 GENERALIZED LINEAR MIXED MODELS Chapter 1 reviewed various approaches of extending GLMs to deal with correlated data. In this Chapter we will take a closer look at generalized linear mixed models (GLMMs) which were briefly mentioned in Section 1.4. When the response variables are normal, these models are simply called linear mixed models (LMMs) and have been extensively discussed in the literature (see, for example, the books by Searle, Casella and McCulloch, 1992, and Verbeke and Molenberghs, 2000). The form of the normal density for observations and random effects allows for analytical evaluation of the integrals together with straightforward maximization. Hence, LMMs can be readily fit with existing software (e.g., SAS's proc mixed), using rich classes of prespecified correlation structures for the random effects to model the dependence in the data more precisely. The broader notion of GLMMs also encompasses binary, binomial, Poisson or gamma responses. A distinctive feature of GLMMs is their so called subjectspecific parameter interpretation, which differs from the interpretation of parameters in marginal (Section 1.2) or transitional (Section 1.3) models. This feature is discussed in Section 2.1, after a formal introduction of the GLMM. Throughout, special attention is devoted to define GLMMs for discrete time series observations. GLMMs are harder to fit because they typically involve intractable integrals in the likelihood function. Section 2.2 outlines various approaches to model fitting. Section 2.3 focuses on a Monte Carlo version of the EM algorithm which is an indirect method of finding maximum likelihood estimates in GLMMs. Monte Carlo methods are necessary because our applications involve correlated random effects which lead to a very highdimensional integral in the likelihood function. Parallel to the discussion of GLMMs, state space models are introduced and a fitting algorithm is described. State space models are popular models for discrete time series in econometric applications (Durbin and Koopman, 2001). The presentation of specific examples of GLMMs for discrete time series observation is deferred until Chapter 5. 2.1 Definition and Notation The generalized linear mixed model is an extension of the well known gen eralized linear model (McCullagh and Nelder, 1989) that permits fixed as well as random effects in the linear predictor (hence the word mixed). The setup process for GLMMs is split into two stages, which we present here using notation common for longitudinal studies.: Firstly, conditional on cluster specific random effects ui, the data are assumed to follow a GLM with independent random components Yit, the tth response in cluster i, i = 1,..., n, t = 1,..., ni. A cluster here is a generic expression and means any form of observations being grouped together, such as repeated observation on the same subject (cluster = subject), observations on different students in the same school (cluster = school) or observations recorded in a common time interval (cluster = time interval). The conditional distribution of Yit is a member of the exponential family of distributions (e.g., McCullagh and Nelder, 1989) with form f(yit I u,) = exp {[yitOit b(0it)]/(it + c(yit, ,t)} (2.1) where Oit are natural parameters and b(.) and c(.) are certain functions determined by the specific member of the exponential family. The parameters it are typically of form Oit = O/wit where the wit's are known weights and 0 is a possibly unknown dispersion parameter. For the discrete response GLMMs we are considering, 0 = 1. For a specific link function h(.), the model for the conditional mean for observations yit has form pit = b'(Oit) = E[Yi ui] = h1(x'tP + zItu,), (2.2) where x't and zt are covariate or design vectors for fixed and random effects associated with observation yit and / is a vector of unknown regression coefficients. At this first stage, z'tui can be regarded as a known offset for each observation, and observations are conditionally independent. It should be noted that relationship (2.2) between the mean of the observation and fixed and random effects is exactly as is specified in the systematic part of GLMs, with the exception that in GLMMs a conditional mean is modeled. This affects parameter interpretation. The regression coefficients 3 represent the effect of explanatory variables on the conditional mean of observations, given the random effects. For instance, observations in the same cluster i share a common value of the random cluster effect ui, and hence / describes the conditional effect of explanatory variables, given the value for ui. If the cluster consists of repeated observations on the same subject, these effects are called subjectspecific effects. In contrast, regression coefficients in GLMs and marginal models describe the effect of explanatory variables on the population average, which is an average over observations in different clusters. At the second stage, the random effects ui are specified to follow a multi variate normal distribution with mean zero and variancecovariance matrix Ei. A standard assumption is that random effects {ui} are independent and identically distributed, but an example at the beginning of Chapter 3 will show that this is sometimes not appropriate. With time series observations, where the clusters refer to time segments, it is reasonable to assume that observations are not only correlated within the cluster (modeled by sharing the same cluster specific random effect), but also across clusters, which we will model by assuming correlated cluster specific random effects. 2.1.1 Generalized Linear Mixed Models for Univariate Discrete Time Series Most of the data we are going to analyze is in the form of a single univariate time series. To emphasize this data structure, the general twodimensional notation (indices i and t) of a GLMM to model observations which come in clusters can be simplified in two ways: We can assume that a single cluster (i.e., n = 1 and nl = T) contains the entire time series yx,..., YT. The random effects vector u = (ul,..., UT) associ ated with the single cluster has a random effects component for each individual time series member. The distribution of u is multivariate normal with variance covariance matrix E, which is different from the identity matrix. The correlation of the components of u induce a correlation among the time series members. However, conditional on u, observations within the single cluster are independent. The cluster index i is redundant in the notation and hence can be dropped. This representation is particular useful when used with existing software to fit GLMMs, where it is often necessary to include a column indicating the cluster membership information for each observation. Here, since we have only one cluster, it suffices to include a column of all ones, say. Alternatively, we can adopt the point of view that each member of the time series is a mini cluster by itself, containing only one observation (i.e., ni = 1 for all i = 1,..., T) in the case of a single time series. When multiple, parallel time series are observed, the cluster contains all c observations at time point t from the c parallel time series (i.e., ni = c for all i = 1,..., T). In any case, the clusters are then synonymous with the discrete time points at which observations were recorded. This makes index t, which counts the repeated observations in a cluster redundant (t = 1 or t = c for all clusters i), but instead of denoting the time series by {yi}?=i, we decided to use the more common notation {yt}T=1, where t now is the index for clusters or, equivalently, time points. In the following definition of GLMMs for univariate time series, the notation of clusters or time points can be used interchangeably. Conditional on unobserved random effects Ul,..., UT for the different time points, observations yi,..., yT are assumed independent with distributions f(ytI ut) = exp {[ytOt b(Ot)]/Ot + c(yt, qt)} (2.3) in the exponential family. As before, for a specific link function h(.), the model for the conditional mean has form E[yt I ut] = it = b'(0t) = hl(x'1 + z'ut), (2.4) where x' and z' are covariate or design vectors for fixed and random effects associated with the tth observation and f is a vector of unknown regression coefficients. The random effects u, ... ,UT are typically not independent. When collected in the vector u = (u1,..., UT), a multivariate normal distribution with mean 0 and covariance matrix E can be directly specified. In particular, in Chapter 3 we will assume special patterned covariance matrices to allow for rich, but still parsimonious, classes of correlation structures among the time series observations. The advantage of the second setup of mini clusters is that it also allows for other, indirect specifications of the random effects distribution, for instance through a latent random process. For this, we relate clusterspecific random effects from successive time points. For example, with univariate random effects, a firstorder latent autoregressive process assumes that the random effects follow Ut+1 = put + et, (2.5) where et has a zeromean normal distribution and p is a correlation parameter. Cox (1981) called these type of models parameterdriven models, as opposed to transitional (or observationdriven) models discussed in Section 1.3. In parameter driven models, an underlying and unobserved parameter process influences the distribution of a series of observations. The model for the polio data in Zeger (1988) is an example of a parameterdriven model. However, Zeger (1988) does not assume normality nor zero mean for the latent autoregressive process. Furthermore, the natural logarithm of the random effects, and not the random effects themselves appear additively in the linear predictor. Therefore, this model is slightly different from the specifications of the time series GLMM from above. Another application of the mini cluster setup is to spatial settings, where clus ters represent spatially aggregated data instead of time points. Then ul,..., UT is a collection of random effects associated with spatial clusters. Again, independent random effects to describe the spatial dependencies are inappropriate. In general, time dependent data are easier to handle since observations are linearly ordered in time, and more complicated random effects distributions are needed for spatial applications (e.g., Besag et al. 1995). We will use the minicluster representation to facilitate a comparison to state space models. This is the focus of the next section. 2.1.2 State Space Models for Discrete Time Series Observations State space models are a rich alternative to the traditional BoxJenkins ARIMA system for time series analysis. Similar to GLMMs, state space models for Gaussian and nonGaussian time series split the modeling process into two stages: At the first stage, the responses Yt are related to unobserved "states" by an observation equation. (State space models originated in the engineering literature, where parameters are often called states.) At the second stage, a latent or hidden Markov model is assumed for the states. For univariate Gaussian responses yi,..., YT, the two equations of a state space model take the form t = w'~t + Et, Et N(0,a2), (2.6) at = Ttat1+Rttt, t t N(O,Qt) t=1,...,T, where wt is an m x 1 observation or design vector and ct is a white noise process. The unobserved m x 1 state or parameter vector at is defined by the second transition equation, where Tt is a transition matrix and (t is another white noise process, independent of the first one. Compared to the standard GLMMs of Section 2.1, the main difference is that random effects are correlated instead of i.i.d. In state space models, no clear distinction between fixed and random effects is made, and the state vector at can contain both. However, the form of the transition matrix Tt together with the form of the matrix Rt, which consists of columns of the identity matrix I,, allows one to declare certain elements of at as being fixed effects and others to be random. The matrix Rt is called the selection matrix since it selects the rows of the state equation which have nonzero variance terms. With this formulation, the variancecovariance matrix Qt is assumed to be nonsingular. Furthermore, the transition matrix Tt allows specification of which effects vary through time and which stay constant. (For a slightly different formulation without a selection matrix Rt but with possibly singular variance covariance matrix Qt, see Fahrmeir and Tutz, 2001, Chap. 8). State space models for nonGaussian time series were considered by West et al. (1985) under the name dynamic generalized linear model. They used a Bayesian framework with conjugate priors to specify and fit their models. Durbin and Koopman (1997, 2000) and Fahrmeir and Tutz (2001) describe a state space structure for nonGaussian observations similar to the two equations above. The normal distribution assumption for the observations in (2.6) is replaced by assuming a distribution in the exponential family with natural parameters Ot. With a canonical link, Ot = w'at. This is called the signal by Durbin and Koopman (1997, 2000). In particular, given the states al,..., aT, observations yi,..., yT are conditionally independent and have density p(yt I Ot) in the exponential family (2.3). As in Gaussian state space models, the state vector at is determined by the vector autoregressive relationship at = Ttat1 + Rtt, (2.7) where the serially independent (t typically have normal distributions with mean 0 and variancecovariance matrix Qt. 2.1.3 Structural Similarities Between State Space Models and GLMMs There is a strong connection between state space models and GLMMs with a canonical link. To see this, we write the GLMM in state space form: Let wt = (a', z')' and at = (p', u')', where Xt, Zt, f and ut are from the GLMM notation as defined in (2.4). Hence, the linear predictor x + z'Ut of the GLMM is equal to the state space signal Ot = w'at. Next, partition the disturbance term (t of the state equation into ( = (a', c')' and consider special transition and selection matrices of block form SO 00 Tt = Rt= 0 4T 0 Rt Using transition equation (2.7) results in the following autoregressive relationship between the random effects of a GLMM: ut = Ttut + et, where Et = it is a white noise component. In a univariate context, we have already motivated this type of relationship between random effects in equation (2.5). The transition equation also implies a constant effect 3 for the GLMM, since 0P = 0I2 = ... = 3T := 3. Hence, both models use correlated random effects, but GLMMs also typically involve fixed parameters which are not modeled as evolving over time. The restriction of the transition equation to the autoregressive form (often a simple random walk) is but only one way of specifying a distribution for the random effects in the GLMMs of Section 2.2.1. Other structures, such as equally correlated random effects are possible within the GLMM framework and are considered in Chapter 3. 2.1.4 Practical Differences Although GLMMs and state space models are similar in structure, they are used differently in practice. This is in part due to the fact that in GLMMs the focus is on the fixed subjectspecific regression parameters /#, which refer to time constant and time varying covariates, while in state space models the main purpose is to infer properties about the time varying random states at. These are often assumed to follow a first or second order random walk. To illustrate, consider a data set about a monthly time series of counts presented in Durbin and Koopman (2000) for the investigation of the effectiveness of new seat belt legislation on automobile accidents. They specify the logmean for a Poisson state space model as log(p/t) = vt + Axt + 7t, where vt is a trend component following the random walk Vt+l = Vt + &t with t a white noise process. Further, A is an intervention parameter correspond ing to the change in seatbelt legislation (xt = 0 before the change and equal to 1 afterwards) and {7t} are fixed seasonal components with Zt= 7t = 0 and equal in every year. The main focus is on the parameter A describing the drop in the log means, also called level, after the seat belt legislation went into effect. For the series at hand, a GLMM approach would consider a fixed linear time effect /, and model the logmean as log(pt) = a + ft + Axt + yt + ut, where a is the intercept of the linear time trend with slope 3 and where correlated random effects {ut} account for the correlation in the monthly log means. Similar to above, A describes the effect on the log means after the seat belt legislation went into effect and yt are fixed seasonal components, equal for every year. The trend component vt of the state space model corresponds to a + ut in the GLMM, which additionally allows for a linear time trend /. This approach seems to be favored by some discussants of the Durbin and Koopman (2000) paper. (In particular, see the discussions by Chatfield or Aitkin of the paper by Durbin and Koopman (2000) who mention the lack of a linear trend term in the proposed state space model.) An even better GLMM approach with linear time trends explicitly modeled could use a change point formulation in the linear predictor, with the month the legislation went into effect (or was enforced) as the change point, and again with correlated random effects {ut} to capture the dependency among successive means. Such a specification would be harder to model in a state space model. In the reply to the discussion of their paper, Durbin and Koopman (2000) wrote that the two approaches (state space models versus Hierarchical Generalized Linear Models, a model class very similar to GLMMs) are very different and that they regard their treatment to be more transparent and general for problems that specifically relate to time series. With the presentation of GLMMs with correlated random effects for time series analysis in this thesis their argument might weaken. For instance, with the proposal of autocorrelated random effects Ut+1 = put + ft in a GLMM context we have elegant means of introducing autocorrelation into a basic regression model that is well understood and whose parameters are easily interpreted. Furthermore, GLMMs can easily accommodate the case of multiple time series observations on each of several individuals or crosssectional units, as is often observed in a longitudinal study. One common feature of both models is the intractability of the likelihood function and the use of numerical and simulation techniques to obtain maximum likelihood estimates. In general, state space models for nonGaussian time series are fit using a simulated maximum likelihood approach, which is also a popular method for fitting GLMMs. However, long time series necessarily result in models with complex and high dimensional random effects, and alternative, indirect, meth ods may work better. Jank and Booth (2003) indicate that simulated maximum likelihood, the method of choice for estimation in state space models, may not work as well as indirect methods based on the EM algorithm, the method we will use to fit GLMMs for time series observations. The next section reviews various approaches of fitting GLMMs and contrasts them with the approach taken for state space models. 2.2 Maximum Likelihood Estimation Maximum likelihood estimation in GLMMs is a challenging task, because it requires the calculation of integrals (often high dimensional) that have no known analytic solution. Following the general notation of a GLMM in Section 2.1, let Yi = (yil, yini) be the vector of all observations in cluster i, whose associated random effects vector is ui. Conditional independence of the yit's implies that the density function of y, is given by ni f(y ; I = Uf (y;t I u( ; /), (2.8) t=1 where f(yit I ui; /) are the exponential densities in (2.1). The parameter 3 is the vector of all unknown regression coefficients introduced by specifying model (2.2) for the mean of the observations. Furthermore, observations from different clusters are assumed conditionally independent, leading to the conditional joint density n f(yi,.,yn uil,...,Un;,3)=ii f(yi ui;/3) i=1 for all observations given all random effects. Let g(u1,..., ,n; ib) denote the mul tivariate normal density function of the random effects, whose variancecovariance matrix E is determined by the variance component vector ib. The goal is to es timate the unknown parameter vectors 3 and ib by maximum likelihood. The likelihood function L(/3, v; yl,..., y,) for a GLMM is given by the marginal den sity function of the observations y, Yn, viewed as a function of the parameters, and is equal to L(09#, 0; y Y jf f(Y,'...",IYn Iu,...,Un;f)g(ux,...,Un; O)dul...dun = f I (Y l ui; ) g(ul,..., un; O)dUl ... dun i=1 = J Jf(yit ui; ) g(ul,..., Un; *1)dUl...dun (2.9) i=1 t=1 It is called the "observed" likelihood because the unobserved random effects have been integrated out and (2.9) is a function of the observed data only. Except for the linear mixed model where f(y, ... Yn I ul,..., un) is a normal density, the integral has no closedform solution and numerical procedures (analytic or stochastic) are necessary to calculate and maximize it. Standard maximization techniques such as NewtonRaphson or EM for fitting GLMs and LLMs have to be modified because the conditional distribution of the observations and the distribution of the random effects are not conjugate and the integral is analytically intractable. 2.2.1 Direct and Indirect Maximum Likelihood Procedures In general, there are two ways to obtain maximum likelihood estimates from the marginal likelihood in (2.9): The first one is a direct approach and attempts to approximate the integral by either analytic or stochastic methods and then maximize this approximation with respect to the parameters 3 and tp. Some common analytic approximation methods are GaussHermite quadrature (Abramowitz and Stegun, 1964), a firstorder Taylor series expansion of the integrand or a Laplace approximation (Tierney and Kadane, 1986), which is based on a secondorder Taylor series expansion. The two latter methods result in likelihood equations similar to a linear mixed model (Breslow and Clayton, 1993; Wolfinger and O'Connell, 1993), and by iteratively fitting such a model and reexpanding the integrand around updated parameter estimates one can obtain approximate maximum likelihood estimates. However, these methods have been shown to yield estimates which can be biased and inconsistent, an issue which is discussed in Lin and Breslow (1996) and Breslow and Lin (1995). Techniques using stochastic integral approximations are known under the name simulated maximum likelihood and have been proposed by Geyer and Thompson (1992) and Gelfand and Carlin (1993). These methods approximate the integral in (2.9) by importance sampling (Robert and Casella, 1999) and are better suited for larger dimensional integrals than analytic approximations. Usually, the importance density depends on the parameters to be estimated, and so simulated maximum likelihood is used iteratively by first approximating the integral by a Monte Carlo sum with some initial values for the unknown parameters. Then, the likelihood is maximized and the resulting parameters are used to generate a new sample form the importance density in the next iteration. We will briefly discuss the idea behind importance sampling in Section 2.3.2. The simulated maximum likelihood approach is also further illustrated in the next section, where we discuss it in the context of state space models. An alternative to the direct approximating methods is the EMalgorithm (Dempster et al., 1977). The integral in (2.9) is not directly maximized in this method, but is maximized indirectly by considering a related function Q(. I .). At each step of this algorithm, maximization of the Qfunction increases the marginal likelihood, a fact that can be verified using Jensen's inequality. The EMalgorithm relies on recognizing or inventing missing data, which, together with the observed data, simplifies maximum likelihood calculations. For GLMMs, the random effects ul,..., u, are treated as the missing data. In particular, let f(k1) and t(k1) denote current (at the end of iteration k 1) values for parameter vectors 3 and '. Also, let y' = (yl,..., y,) and u' = (ul,..., u,) denote the vector of all observations and their associated random effects. Then, the Q(. I .) function at the start of iteration k has form Q(I, f3(k1) (k1)) = E [logj(y, u; 0, ) y, f(k1), 0(k1)] = E [logf(y I u;3) y,7(k1),(k1)] (2.10) + E [logg(u; ) y, '(k1),l(kl)] where j(y, u; 0, i) = f(y I u; 3)g(u; i) denotes the joint density of observed and missing data, also known as the complete data. The expectation in (2.10) is with respect to the conditional distribution of u y, evaluated at the current parameter estimates 3(k1) and O(kl). The calculation of the expected value is called the Estep and it is followed by an Mstep which maximizes Q(/,  I 8(k1), O(k1)) with respect to f3 and 4. The resulting estimates f(k) and ,(k) are used as updates in the next iteration to re calculate the Estep and the Mstep. Since (/3(k), (k)) is the maximizer at iteration Q((k), (k) /(k1) I'(k1)) > Q(/3, /(k1), (k1)) for all (0, 4') (2.11) and it follows that the likelihood increases (or at worst stays the same) from one iteration to the next: log L((k),(k) y) = Q((k), (k1), 1) E [logh(u Iy;/3(k), () k) y,/(k1), (1)] > Q(p(k1), k1) (k1),k1)) E [log h(u  y;3)(k,1) ( y,() I 1),(k1)] = log L((k1), (k1);y) Here we used (2.11) and the fact that E [log h(u I y; 3(k), (k) y,(k1), (1) E [logh(u  y;/(k1), (k1)) y (k1) (k1) = E [log (h(u y; w(k), P(k))/h(u y;3(k1)' (k1))) I y11)(k1) (k1)] < log E h( y;((k) (k))/( y (k1) (k)) y, (kL),, (k1)) = 0, where the inequality in the last step derives from Jensen's inequality. Under regu larity conditions (Wu, 1983) and some initial starting values (P3, ,0), the sequence of estimates {(0(k), '(k))} converges to the maximum likelihood estimators (/, 6 ). The EMalgorithm is most useful if replacing the calculation of the integral in the marginal likelihood (2.9) by the calculation of the integral in the Qfunction (2.10) simplifies computation and maximization. Unfortunately, for GLMMs the integrals in (2.10) are also intractable since the conditional density of u I y involves the integral in (2.9). However, the EM algorithm may still be used by approxi mating the expectation in the Estep with appropriate Monte Carlo methods. The resulting algorithm is called the Monte Carlo EMalgorithm (MCEM) and was proposed by Wei and Tanner (1990). We review it in detail in Section 2.3. Some arguments favoring the use of the MCEMalgorithm over direct methods such as simulated maximum likelihood for fitting GLMMs, especially when some variance components in I are large, are given in Jank and Booth (2003) and Booth et al. (2001). Currently, the only available software for fitting GLMMs uses direct methods such as GaussHermite quadrature or simulated maximum likelihood (e.g., SAS's proc nlmixed). State space models of Section 2.1.2 are also fitted via simulated maximum likelihood. This is discussed in Section 2.2.3. 2.2.2 Model Fitting in a Bayesian Framework In a Bayesian context, GLMMs are twostage hierarchical models with ap propriate priors on 3 and 0. Instead of obtaining maximum likelihood estimates of unknown parameters, a Bayesian analysis looks at their entire posterior dis tributions, given the observed data. Markov chain Monte Carlo techniques avoid the tedious integration in the posterior densities and allow for relatively easy simulations from these distributions, compared with the problems encountered in maximum likelihood estimation. This suggests approximating maximum likelihood estimates via a Bayesian route, assuming improper or at least very diffuse priors and exploiting the proportionality of the likelihood function and the posterior distribution of the parameters. However, for many discrete data models, improper priors may lead to improper posteriors. Natarajan and McCulloch (1995) demon strate this with GLMMs for correlated binary data assuming independent N(0, a2) random effects and a flat or a noninformative prior for a2. Sun, Tsutakawa and Speckman (1999) and Sun, Speckman and Tsutakawa (2000) show that with noninformative (flat) priors on fixed effects and variance components of more com plicated random effects distributions, propriety of the posterior distribution cannot be guaranteed for a Poisson GLMM when one of the observed counts is zero, and is impossible in a logit link GLMM for binomial(n, r) observations if just one of the observation is equal to 0 or n. Of course, the use of proper priors will always 37 lead to proper posteriors. However, for often employed diffuse but proper priors, Natarajan and McCulloch (1998) show that even with enormous simulation sizes, posterior estimates (such as the posterior mode) can be far away from maximum likelihood estimates, which make their use undesirable in a frequentist setting. 2.2.3 Maximum Likelihood Estimation for State Space Models The same problems as in the GLMM case arise for maximum likelihood fit ting of nonGaussian state space models. Here, we review a simulated maximum likelihood approach suggested by Durbin and Koopman (1997), using notation in troduced in Section 2.1.2. Let p(y I a; ,) = 1tP(yt I at; tp) denote the distribution of all observations given the states and let p(a; 0) denote the distribution of the states, where y and a are the stacked vectors of all observations and all states, respectively. The vector iv holds parameters that may appear in wt, Tt and Qt. Let p(y, a; /i) denote the joint density of observations and states. For practical purposes, it is easier to work with the signal Ot instead of the high dimensional state vector at. Hence, let p(y ; 0;), p(0; Vb) and p(y, 0; b) denote the corre sponding conditional, marginal and joint distributions parameterized in terms of the signal Ot = w'at, t = 1,..., T, where 0 = (01,..., OT)'. The observed likelihood is then given by the integral L(0; y) = p(y 0; i)p(O; /)dO. (2.12) To maximize (2.12) with respect to 0, Durbin and Koopman (1997, 2000) first calculate the likelihood Lg,(i; y) for an approximating Gaussian model and then obtain the true likelihood L(/,; y) by an adjustment to it. However, two different approaches of how to construct the approximating Gaussian model are presented in the two papers. In Durbin and Koopman (1997) the approximating model is obtained by assuming that observations follow a linear Gaussian model Yt= wat + Et = t Ot + ft, with Et ~ N(p/t, ao). All densities generated under this model are denoted by g(.). The two parameters /t and ao are chosen such that the true density p(y 1 0; ,) and its normal approximation g(y I 0; tk) are as close as possible in the neighborhood of the posterior mean Eg[80 y]. The state equations of the true nonGaussian model and the Gaussian approximating model are assumed to be the same, which implies that the marginal density of 8 is the same under both models, i.e., p(8; i) = g(; 0). The likelihood of the approximating model is then given by Lg(y, ; ) g(y i0; O)p(O; (2.13) Lg('; y) = g(y; #) = (2.13) g(8 1 Y; I) g(8 I y; IP) This likelihood is calculated using a recursive procedure known as the Kalman filter (see, for instance, Fahrmeir and Tutz, 2001, Chap. 8). Alternatively, the approximating Gaussian model is a regular linear mixed model and maximum likelihood calculations can be carried out using more familiar algorithms in the linear mixed model literature (see, for instance, Verbeke and Molenberghs, 2000). From (2.13), p(O; ) = L,()g(O I y; k) g(y 1 0; ) and upon plugging in into (2.12) L(O; y)= L9g(; y)E [P(i ; ,) (2.14) Lg(yI 0; )' where E, denotes expectation with respect to the Gaussian density g(8O y; I) generated by the approximating model. Hence, the observed likelihood of the non Gaussian model can be estimated by the likelihood of an approximating Gaussian model and an adjustment factor, in particular L(0; y)= Lg('; y)7@(), where ) =1 p(y 0(i);b) m zg(y Og); m g) is a Monte Carlo sum approximating the expected value E, with m random samples 0(') from g(O y; ,). Normality of g(O y; ') allows for straightforward simulation from this density. A different approach for choosing the approximating Gaussian model is presented in Durbin and Koopman (2000). There, the model is determined by choosing 0, and Qt of an approximating Gaussian state space model (2.6) such that the posterior densities g(0 I y; i) implied by the Gaussian model and p(O I y; iP) implied by the true model have the same posterior mode 0. Formally, by dividing and multiplying (2.12) by the importance density g(O I y; i), we like to interpret approximation (2.14) as an importance sampling estimate of the observed likelihood and the entire procedure as a simulated maximum likelihood approach: L(; y)= [p(y 0;e ) ; g(O y; )dO ,g(0 I y;), d = p(y 0; 0 g(Y; g(Y I y; )dO gy, 0O; ) Sg(Y; ) g(O y; t)dO [p(y 1o0; = (y;,)Eg "g(y 1 . Durbin and Koopman (1997, 2000) present a clever way of artificially enlarging the simulated sample of 0(')'s from the importance density g(0  y; ) by the use of antithetic variables (Robert and Casella, 1999). These quadruple the sample size without additional simulation efforts and balance the sample for location and scale. Overall, this leads to a reduction in the total sample size necessary to achieve a certain precision in the estimates. In practice, it is desirable in the maximization process to work with log (L(?I; y)). Durbin and Koopman (1997, 2000) present a bias correction for the bias introduced by estimating log (E[p(y  0; ai)/g(y 0; 0)]). Finally, the resulting estimator of log (L(O; y)) can be maximized with respect to i by a suitable numerical procedure, such as NewtonRaphson. We mentioned before that simulated maximum likelihood can be computa tionally inefficient and suboptimal, especially when some variance components are large (Jank and Booth, 2003). As we will see in various examples in Chapter 5, large variance components (e.g., a large random effects variance) are the norm rather than the exception with the type of time series models we consider. Next, we will look at an alternative, indirect method for fitting our models. In principle, though, the methods just described are also applicable to GLMMs, through the close connections of GLMMs and state space models described above. 2.3 The Monte Carlo EM Algorithm In Section 2.2 we presented the EMalgorithm as an iterative procedure consisting of two components, the E and the Mstep. The Estep calculates a conditional expectation while the Mstep subsequently maximizes this expectation. Often, at least one of these steps is analytically intractable and in most of the applications considered here, both steps are. Numerical methods (analytic and stochastic) have to be used to overcome these difficulties, whereby the Estep usually is the more troublesome. One popular way of approximating the expected value in the Estep uses Monte Carlo methods and is discussed in Wei and Tanner (1990), McCulloch (1994, 1997) and Booth and Hobert (1999). The Monte Carlo EM (MCEM) algorithm uses a sample from the distribution of the random effects u given the observed data y to approximate the Qfunction in (2.10). In particular, at iteration k, let u(1),..., u(m) be a sample from this distribution, denoted by h(u I y; '(k1), (1)) and evaluated at the parameter estimates "(k1) and 0(k1) from the previous iteration. The approximation to (2.10) is then given by m m Qm(3,  (kl), (kl)) = L log/(y I u);/3) + Elogg(u(i); ). (2.15) m m j=1 j=1 As m + oo, with probability one, Qm + Q. The Mstep then maximizes Qm instead of Q with respect to p and 4 and the resulting estimates 3(k) and i(k) are used in the next iteration to generate a new sample from h(u I y; 'k, Pk ). If maximization is not possible in closed form, sometimes only a pair of values (/3, T) which satisfies Qm(/3, 4 I 3(k1), '(k1)) > Qm((k1) (k) (1) (k1), (k1)), but which do not attain the global maximum, is chosen as the new parameter update (3(k), ,(k)). However, we show for our models that the global maximum can be approximated in very few steps. Maximization of Qm with respect to 3 and i4 is equivalent to maximizing the first term in (2.15) with respect to / only and the second term with respect to ' only. This is due to the twostage hierarchy of the response distribution and the random effects distribution in GLMMs and is discussed next. Different approaches to obtaining a sample from h(u I y; 3, 4) for the approximation of the Estep are presented in Sections 2.3.2 and convergence criteria are discussed in Section 2.3.3. 2.3.1 Maximization of Qm For now we assume we have available a sample u(1),..., um) from h(u y; 3, i4) or an importance sampling distribution, generated by one of the mecha nisms described in Sections 2.3.2 to 2.3.5. Let Q1 and Q2 be the first and second term of the sum in (2.15). Using the exponential family expression for the densities f(yit I ui), at iteration k, Q(3 (k)) 1 C n n i b(0~))] (2.16) S=1 =1 t= i j=1 i=1 t=1 where, according to the GLMM specifications, b'((,)) = ( = hl(xtP + Zi ), with ujA the ith component of the jth sampled random effects vector u(). Maximizing Q1 with respect to 3 is equivalent to fitting an augmented GLM with known offsets: For j = 1,..., m, let yf = yit and ) = xit be the random components and known design vectors for this augmented GLM, and let zU j) be a known offset associated with each y). That is, we duplicate the original data set m times and attach a known offset z' u) to each replicated observation. The model for the mean in the augmented GLM, E[Yij)] = pz) = h'(xt + zu)) is structurally equivalent to the model for the mean in the GLMM. Then, the loglikelihood equations for estimating 3 for the augmented GLM are proportional to Q'. Hence, maximization of Q1 with respect to / follows along the lines of well known, iterative NewtonRaphson or Fisher scoring algorithms for GLMs. Denote by /(k) the parameter vector after convergence of one of these algorithms. It represents the value of the maximum likelihood estimator of / at iteration k of the MCEM algorithm. The expression for Q' depends on the assumed random effects distribution. Most generally, let E be an unstructured nq x nq covariance matrix for the random effects vector u = (ux,..., un), where q = E i ni and ni is the dimension of each cluster specific random effect ui. Then, assuming u has a mean zero multivariate normal distribution g(u; i) where 0/ holds the !nq(nq + 1) distinct elements of E, Q2 has form 2 c log u()1y1 j=1 The goal is to maximize Q2 with respect to the variance components If of E. For a general E, the maximum is obtained at the variance components of the sample covariance matrix Sm = L u(ij)uWl)'. Denoting these by O(k) gives the value of the maximum likelihood estimator of i, at iteration k of the MCEM algorithm. The simplest structure occurs when random effects ui have independent components and are i.i.d. across all clusters, where g(u; i) is then the product of n N(0, ua2) densities and i/ = a. Q2 at iteration k is then maximized at a(k) ( ZLj o u(')'U) 12. Many applications of GLMMs use this simple structure of i.i.d. random effects, where often ui is a univariate random intercept. In this case, the estimate of a at iteration k reduces to a(k) = ( 1 L 1 uJi)2)1/2 In Chapter 3 we will drop the assumption of independence and look at correlated random effects, but with more parsimonious covariance structures than the most general case presented here. Maximization of Q' with respect to t1 will be presented there on a case by case basis. 2.3.2 Generating Samples from h(u I y; P, i,) So far we assumed we had available a sample u(),..., u(m) to approximate the expected value in the Estep of the MCEM algorithm. This section describes how to generate such a sample from h(u I y; 3, i), which is only known up to a normalizing constant, or from an importance density g(u). In the following, we will suppress the dependency on parameters / and 0, since the densities are always evaluated at their current values. Three methods are presented: The acceptreject algorithm produces independent samples, while MetropolisHastings algorithms produce dependent samples. A detailed description of all three methods can be found in Robert and Casella (1999). 2.3.2.1 Acceptreject sampling in GLMMs In general, for acceptreject sampling we need to find a candidate density g and a constant M, such that for the density of interest h (the target density) h(x) < Mg(x) holds for all x in the support of h. The algorithm is then to 1. generate x ~ g, w ~ Uniform[O, 1]; 2. accept x as a random sample from h if w < M() ; 3. return to 1. otherwise; This will produce one random sample x from the target density h. The probability of acceptance is given by 1/M and the expected number of trials until a variable is accepted is M. For our purpose, the target density is h(u I y). Since h(u I y) = f(y I u)g(u) < Mg(u), where M = supu f(y I u) and a is an unknown normalizing constant equal to the marginal likelihood, the multivariate normal random effects distribution g(u) can be used as a candidate density. Booth and Hobert (1999, Sect. 4.1) show that for certain models supu f(y I u) can be easily calculated from the data alone and thus need not to be updated at every iteration. For some models we discuss here, the condition of Booth and Hobert (1999, Sect. 4.1, page 272) required for this simplification does not hold. However, the likelihood of a saturated GLM is always an upper bound for f(y I u). To illustrate, regard L(u) = f(y I u) as the likelihood corresponding to a GLM with random components yit and linear predictor rit = z ui + xif3, where now x'8 plays the role of a known offset and ui are the parameters of interest. The maximized likelihood L(i&) for this model is always less than the maximized likelihood L(y) for a saturated model. Hence, supu f(y I u) < L(i&) < L(y), and L(iL) or L(y) can be used to construct M. Example: In Section 3.1 we consider a data set where conditional on a random effect Ut, yit, the tth observation in group i, is modeled as a Binomial(nit, Wit) random variable. There are 16 time points, i.e., t = 1,... 16 and two groups i = 1, 2. A very simple logisticnormal GLMM for these data has form logit(7it(ut)) = a + pxi + ut, where xi is a binary group indicator. The overall design matrix for this problem is the 32 x 18 matrix 1 0 1 0 ... 0 1 1 0 * 0 1 0 0 1 ... 0 1 1 0 1 0, 1 0 0 0 ..* 1 1 1 0 0 ... 1 where the columns hold the coefficients corresponding to a, /, u1, u2,..., U16. All rows of this matrix are different, and as a consequence the condition of Booth and Hobert (1999, Sect. 4.1, page 272) does not hold. However, the saturated binomial likelihood L(y) is an upper bound for f(y I u), i.e., sup f(y u) < L(y). For instance, with the logisticnormal example from above with linear predictor 77it = c + ut, where c = a + zxi represents the fixed part of the model, we have sup f(y I U) = sup( ec+ut i 1 it su U, 1 + ec+ut 1 + ec+ut By first taking logs and then finding first and second derivatives with respect to ut, we see that u = log )it/t c maximizes this expression for 0 < yit < nit. Plugging in, we obtain the result ( it \ nitli, supf(yl) = 1 . ut nit k nit For the special cases of yit = 0 or yit = nit the trivial bound on f(yit I Ut) is 1. Hence, the following inequality, which immediately follows from above can be used in constructing the acceptreject algorithm for a logisticnormal model with linear predictor of form rit c + ut: ( eif( fYit nit /t sup f y I u) : sup li r 1+e+ e _1 ) )= L(y). S nit/ nit/ iit This means we can select M = L(y) to meet the acceptreject condition and consequently we accept a sample u from g(u) if for a w ~ Uniform[0, 1]: h(u I y) f(y u) Mg(u) L(y) Notice that this condition is free of the normalizing constant a. In practice, especially for high dimensional random effects, M can be very large and therefore we almost never accept a sample. Two alternative methods described below may avoid this problem. Note, however, that the acceptreject method yields an independent and identical distributed sample from the target distribution. This is important if one wants to implement an automated MCEM algorithm (Booth and Hobert (1999)), where the Monte Carlo sample size m is increased automatically as the algorithm progresses to adjust for the error in the Monte Carlo approximation to the Estep. 2.3.2.2 Markov chain Monte Carlo methods For high dimensional distributions h(u I y), which are unavoidable if correlated random effects are used, acceptreject methods can be very slow. An alternative is to generate a Markov chain with invariant distribution h(u I y), which may be much faster but results in dependent samples. McCulloch (1997) discussed a Metropolis Hastings algorithm for creating such a chain for the logisticnormal regression case. In general, an independent Metropolis Hastings algorithm is built as follows: Choose a candidate density j(u) with the same support as h(u I y). Then, for a current state u(j), 1. Generate w ~ g; 2. Set u(j) equal to w with probability p = min (, h(W )/h(Uj)) and equal to u1) with probability 1p; After a sufficient burn in time, the states of the generated chain can be regarded as a (dependent) sample from h(u I y). If the candidate density g(u) is chosen to be the density of the random effects g(u), the acceptance probability in step 2 reduces to the simple form min (1, f(y w)/f(y I u(1)). To further speed up simulations, McCulloch (1997) uses a random scan algorithm which only updates the kth component of the previous state u(1) and, upon acceptance in step 2, uses it as the new state. Another popular MCMC algorithm is the Gibbs sampler. Let ujl) = (u i1),..., uU1)) denote the current state of a Markov chain with invariant distri bution h(u I y). One iteration of the Gibbs sampler generates, componentwise, Uj) i h(ul I 1) 1)y) UU ~ h(u uj ...IUU) where h(ui uj) ,..., u (j),..., u 1), y) are the so called full conditionals of h(u I y). The vector u(j) = (ujQ),..., u)) represents the new state of the chain, and, after a sufficient burnin time, can be regarded as a sample from h(u y). The advantage of the Gibbs sampler is that it reduces sampling of a possibly very highdimensional vector u into sampling of several lowerdimensional components of u. We will use the Gibbs sampler in connection with autoregressive random effects to simplify sampling from an initially very highdimensional distribution h(u ) y) by sampling from its simpler full univariate conditionals. 2.3.2.3 Importance sampling An importance sampling approximation to the Qfunction in (2.10) is given by m m Qm(3, i (k1), (k1)) = log f (y I uo ); ) + w log g(u ); ), j=1 j=1 where u(j) are independent samples from an importance density g(u; V)(k1)) and h(u. y;(k1 1)) (y U); (k1))g(u(j); 0(k1)) 3 ( );0(k1)) (U);k1)) are importance weights at iteration k. Usually, Qm is divided by the sum of the importance weights m=, wj. The normalizing constant a only depends on known parameters (3(k1), O(k1)) and hence plays no part in the following maximization step. Selecting the importance density g is a delicate issue. It should be easy to simulate from but also resemble h(u I ) as close as possible. Booth and Hobert (1999) suggest a Student t density as the importance distribution g, whose mean and variance match those of h(u I y) which are derived via a Laplace approximation. 2.3.3 Convergence Criteria Due to the stochastic nature of the algorithm, parameter estimates of two suc cessive iterations can be close together just by chance, although convergence is not yet achieved. To reduce the risk of stopping prematurely, we declare convergence if the relative change in parameter estimates is less than some el for c (e.g., five) consecutive times. Let A(k) = (f(k)', Ob(k)') be the vector of unknown fixed effects parameters and variance components. Then this condition means that m A Ik) P1) I max 'IA 'I E1 (2.17) has to be fulfilled for c consecutive (e.g., five) ks. For any AkL), an exception to this rule occurs when the estimated standard error of that parameter is substantially larger than the change from one iteration to the next. Hence, at iteration k, for those parameters satisfying (IA(jk) A ) max v (gk)) E2, where vir(A k)) is the current estimate of the variance of the MLE A,, the relative precision of criterion (2.17) need not be met. An estimate of this variance can be obtained from the observed information matrix of the ML estimator for A. Louis (1982) showed that the observed information matrix can be written in terms of the first (1') and second (I") derivative of the complete data loglikelihood 1(A; y, u) = logj(y, u; A). Evaluated at the MLE A, it is given by I(A) = Euly [l"(i; y, u) I y] varuly [l'(A; u) I .1 An approximation to this matrix, at iteration k, uses Monte Carlo sums with draws from h(u I y, A(k)) from the current iteration of the MCEM algorithm. To further safeguard against stopping prematurely, we use a third convergence criterion based on the Qm function. For deterministic EM, the Q function is guaranteed to increase from iteration to iteration. With MCEM, because of the stochastic approximation nature, QM) can be less than Q(1) because of an "unlucky" Monte Carlo sample at iteration k. Hence, the parameter estimates obtained from maximizing QM) can be a step in the wrong direction and actually decrease the value of the likelihood. To counter this, we declare convergence only if successive values of QM) are within a small neighborhood. More importantly, however, is that we accept the kth parameter update A(k) only if the relative change in the Qm function is larger than some small negative constant, e.g., (k) > E3 (2.18) If at iteration k (2.18) is not met and there is reason to believe that A(k) decreases the likelihood and is worse than the parameter update from the previous iteration, we repeat the kth iteration with a new and larger Monte Carlo sample. Thereby, we hope to better approximate the Qfunction and as a result get a better estimate A(k), with Qmfunction larger than the previous one. If this does not happen, we nevertheless accept A(k) and proceed to the next iteration, possibly letting the algorithm temporarily move in a direction of a lower likelihood region. Otherwise, the Monte Carlo sample size quickly grows without bounds at an early stage of the algorithm. Furthermore, at early stages, the Monte Carlo error in the approximation of the Q function can be large and hence its trace plot is very volatile. Caffo, Jank and Jones (2003) go a step further and calculate asymptotic confidence intervals for the change in the Qmfunction, based on which they construct a rule for accepting or rejecting A(k). They discuss schemes of how to increase the Monte Carlo sample accordingly and their MCEM algorithm inherits the ascent property of EM with high probability. However, we feel that the simpler criterion (2.18) suffices for the examples considered here. Coupled with any convergence criterion is the question of the updating scheme for the Monte Carlo sample size m between iterations. In general, we will use m(k) = am(k1), where a > 1 and m(k) is the Monte Carlo sample size at iteration k. At early iterations, m(k) will be low, since big parameter jumps are expected regardless of the quality of the approximation and the Monte Carlo error associated with it. Later, as more weight will be put on decreasing the Monte Carlo error in the approximations, the polynomial increase guarantees sufficiently large Monte Carlo samples. Furthermore, condition (2.18) signals when an additional boost in m(k) is needed to better approximate the Qfunction in this iteration. Hence, whenever (2.18) is not met, we rerun iteration k with a bigger sample size qm(k), where q > 1 is usually between 1 and 2. 8000 I  264 7000 266 6000 5000 268 4000 270 270  3000 272 2000 274 1000 0 25 50 75 100 0 25 50 75 100 Figure 21: Plot of the typical behavior of the Monte Carlo sample size m(k) and the Qfunction Qk) through MCEM iterations. The iteration number is shown on the x axis. Plots are based on the data and model for the boat race data discussed in Chapter 5. A typical picture of the Monte Carlo sample size m(k) and the Qm) function through the iterations of an MCEM algorithm is presented in Figure 21. The increase in the Q) function is large at the first iterations, but it's Monte Carlo error is also large due to the small Monte Carlo sample size. The plot of the Monte Carlo sample size m(k) shows several jumps corresponding to the events that the Qk) function actually decreased by more than E3 from one iteration to the next and we adjusted with an additional boost in generated samples. The data and model on which this plot is based on are taken from the boat race example analyzed and discussed in Chapter 5, with convergence criterions set to 6E = 0.001, c = 4, E2 = 0.003, E3 = 0.005, a = 1.03 and q = 1.05. Fort and Moulines (2003) show that with geometrically ergodic (see, e.g., Robert and Casella, 1999) MCMC samplers, a polynomial increase in the Monte 52 Carlo sample size leads to convergence of MCEM parameter estimates. However, establishing geometric ergodicity is not an easy task. Other, more sophisticated and automated Monte Carlo sample size updating schemes are presented by Booth and Hobert (1999) for independent sampling, and Caffo, Jank and Jones (2003), for independent and MCMC sampling. CHAPTER 3 CORRELATED RANDOM EFFECTS In Chapter 2 we mentioned at several occasions that for certain data structures the usual assumption of independent random effects is inappropriate. For instance, if clusters represent time points in a study over time, observations from different clusters can no longer be assumed (marginally) independent. Or, in longitudinal studies, the nonnegative and exchangeable correlation structure among repeated observations implied by a single random effect can be far from the truth for long sequences of repeated observations. Section 3.1 presents data from a crosssectional time series which motivates the use of correlated random effects and discusses their implications. In Sections 3.2 and 3.3 two special correlation structures useful for modeling the dependence structure in discrete repeated measures with possibly unequally spaced observation times are discussed. The main focus of this chapter is on the technical implications on the MCEM algorithm arising from estimating an additional variance (correlation) component. In contrast to models with independent random effects, the Mstep has no closed form solution and iterative methods have to be used to find the maximum. Also, because random effects are correlated a priori, they are correlated a posteriori, and sampling from the posterior distribution of u I y as required by the MCEM algorithm is more involved than with independent random effects. A Gibbs sampling approach is developed in Section 3.4. From here on we let t denote the index for the discrete observation times, t = 1,..., T, and we let Yi denote a response at time point t for strata i, i = 1,..., n. Throughout, we will assume univariate but correlated random effects {ut}T=1 associated with the observations over time. 3.1 A Motivating Example: Data from the General Social Survey The basic purpose of the General Social Survey (GSS), conducted by the National Opinion Research Center, is to gather data on contemporary Amer ican society in order to monitor and explain trends and constants in atti tudes, behaviors and attributes. It is only second to the census in popularity among sociologist as a data source for conducting research. The GSS ques tionnaire contains a standard core of demographic and attitudinal variables whose wording is retained throughout the years to facilitate time trend stud ies. (Source: www.norc.uchicago.edu/projects/gensocl.asp). Currently, the GSS comprises a total of 24 surveys conducted in the years 19731978, 1980, 1982, 19831994, 1996, 1998, 2000 and 2002, with data available online (at www.webapp.icpsr.umich.edu/GSS/) through 1998. The two features, a dis crete response variable (most of the attitude questions) observed through time and unequally spaced observation times make it a prime resource for applying the models proposed in this dissertation. Data obtained from the GSS are different from longitudinal studies where subjects are followed through time. Here, responses are from independent crosssectional surveys of different subjects in each year. One question included in 16 of the 22 surveys till 1998 recorded attitude towards homosexual relationships. It was observed in the years 1974, 197677, 1980, 1982, 198485, 19871991, 199394, 1996 and 1998. We will use this data to motivate and illustrate the use of correlated random effects. Figure 31 shows the proportion of respondents who agreed with the statement that homosexual relationships are not wrong at all for the two race cohorts white respondents and black respondents. For simplicity in this introductory example, only race was chosen as a crossclassifying variable and attitude was measured as answering "yes" or "no" to the aforementioned question. Let Yt denote the number of people in year t and of race i who agreed with the statement that homosexual relationships 0 6  white e black o C  0 C5 1975 1980 1985 1990 1995 Figure 31: Sampling proportions from the GSS data set. Proportion of whites (squares) and blacks (circles) agreeing with the statement that homosexual relationships are not wrong at all, from 1974 to 1998. are not wrong at all. The index t = 1,...,16 runs through the set of 16 years {1974,1976,1977,1980,..., 1998} mentioned above, and i = 1 for race equal to white and i = 2 for race equal to black. The conditional independence assumption discussed in Section 2.1 allows us to model Yt, the sum of nit binary variables which are the individual responses, as a binomial variable conditional on a yearly random effect. That is, the probabilistic model we propose assumes a conditional Binomial(nit, 7rit) distribution for each member of the two time series {IYt}", and {Y2 61 pictured in Figure 31. The parameters nit and rit are the total number and the conditional probability of agreeing with the statement that homosexual relationships are not wrong at all, respectively, of respondents of race i in year t. 3.1.1 A GLMM Approach A popular model for 7tit is a logisticnormal model for which the link function h(.) in (2.2) is the logit link and the random effects structure simplifies to a ran dom intercept ut. We will assume that the fixed parameter vector / is composed of an intercept term a, linear and quadratic time effects /1 and 32, a race effect 03 and a yearbyrace interaction 34. With Xit representing the year variable centered around 1984 (e.g., xll = 1974 1984 = 10) and x2i the indicator variable for race (for whites x21 = 0, for blacks x22 = 1), the model has form logit(7rit(ut)) = logit(P(Yit = yit I ut)) = a+Pzxltu+2xlt+u3x2i +4XltX2i +t. (3.1) Apart from the fixed effects, the random time effect ut captures the dependency structure over the years. Note that rit(ut) is a conditional probability, given the random effect us from the year the question was asked. This random effect ut can be interpreted as the unmeasurable public opinion about homosexual relationships, common to all respondents within the same year. By introducing this random effect, we assume that individual opinions are influenced by this overall opinion or the social and political climate on homosexual relationships (like awareness of AIDS and the social spending associated with it, which is hard to measure). Thus, individual responses within a given year are no longer independent of each other, but share a common random effect. Furthermore, it is natural to assume that the public opinion about homosexual relationships changes gradually over time, with higher correlations for years closer together and lower correlations for years further apart. It would be wrong and unnatural to assume that the public opinion (or political climate) is independent from one year to the next. However, this would be assumed by modeling the random effects {ut} as independent of each other. It would also be wrong to assume a common, timeindependent random effect u = ut for all time points t, as this implies that public opinion does not change over time. It's effect would then be the same, whether responses are measured in 1974 or 1998. 3.1.2 Motivating Correlated Random Effects To capture the dependency in public opinion and therefore in responses over different years, we propose random effects that are correlated. In particular, 57 for this example with unequally spaced observation times, we suggest normal autocorrelated random effects {ut} with variance function var(ut) = a2, t = 1,..., 16 and correlation function corr(ut, ut.) = pIlt t Xl., 1 < t < t* < 16, where Xit xxt. is the difference between the two years identified by indices t and t*. This is equivalent to specifying a latent autoregressive process Ut+1 = PIXltXlUt + + t underlying the data generation mechanism. Both of these formulations naturally handle the multiple gaps in the observed time series. There is no need to make adjustments (such as imputation of data or artificially treating the series as equally spaced) in our analysis due to "missing" data at years 1975, 197879, 1981, 1983, 1986, 1992, 1995 or 1997. With correlated random effects, we have to distinguish between two situations: The correlation induced by assuming a common random effect ut for each cluster (here: year) and the correlation induced by assuming a correlation among the clusterspecific random effects {ut}. Correlation among observations in the same cluster is a consequence of assuming a single, clusterspecific random effect shared by all observation in that cluster. For example, the presence of the cluster specific random effect Ut in (3.1) leads to a (marginal) nonnegative correlation among the two binomial responses Ylt and Y2t in year t. With conditional independence, the marginal covariance between these two observations is given by cov(Yt, Y2t) = E[cov(Y, Y2t I Ut)] +cov(E[Yul ut], E[Y2t I Ut]) = cov(logit'(ult + ut), logit'(it + Ut)), (3.2) where jit is the fixed part of the linear predictor in (3.1). Both functions in (3.2) are monotone increasing in ut, leading to a nonnegative correlation. Approxima tions to (3.2) will be dealt with in Section 4.3. In the example, we attributed the cause of this correlation to the current (at the time of the interview) public opinion about homosexual relationships, influencing all respondents in that year. The estimate of a gives an idea about the magnitude of this correlation, since the more disperse the Ut's are, the stronger the correlation among the responses within a year. For instance, if the true ut for a particular year is positive and far away from zero, as measured by a, then all respondents have a common tendency to give a positive answer. If it is far away from zero on the negative side, respondents have a common tendency for a negative answer. This interpretation, of course, is only relative to other fixed effects included in the linear predictor. For the GSS data, there seems to be moderate correlation between responses, based on a maximum likelihood estimate of & = 0.10 with an approximated asymptotic s.e. of 0.03. This interpretation of a moderate effect of public opinion on responses within the same year is further supported by the fact that a can also be interpreted as the regression coefficient for a standardized version of the random effect ut. A regression coefficient of 0.10 for a standard normal variable on the logit scale leads to moderate heterogeneity on the probability scale. This shows that the correlation between responses within a common year cannot be neglected. The second consequence of correlated random effects is that observations from different clusters are correlated, which is a distinctive feature compared to GLMMs assuming independence between clusterspecific random effects. The conditional log odds of agreeing with the statement that homosexual relationships are not wrong at all are now correlated over the years, a feature which is natural for a time series of binomial observations but would have gone unaccounted for if timeindependent random effects were used. For instance, for the cohort of white respondents (i = 1), the correlation between the conditional log odds at years t and t* is corr (logit(rit(ut)), logit(7rit.(ut.)) = pllZt1Il and therefore directly related to the assumed random effects correlation structure. Marginally, the two binomial responses at the different observation times have covariance cov(Ylt, Y2t.) = cov(logit'(i~ + Ut), logit'l(0l + ut.)), which accommodates changing covariance patterns for different observation times (e.g., decreasing with increasing lag) and also negative covariances (see for instance the analysis of the Old Faithful geyser eruption data in Chapter 5). We will present approximations to these marginal correlations in binomial time series in Section 4.3. Summing up, correlated random effects give us a means of incorporating correlation between sequential binomial observations that go beyond independent or exchangeable correlation structures. In our example, we attributed the sequential correlation to the gradual change in public opinion about homosexual relationships over the years, affecting both races equally. In fact, the maximum likelihood estimate of p is equal to 0.65 (s.e. 0.25), indicating that rather strong correlations might exist between responses from adjacent years. The model uses 7 parameters (5 fixed effects, 2 variance components) to describe the 32 probabilities. In comparison with a regular GLM and a GLMM with independent random time effects, the maximized likelihood decrease from 113.0 for the regular GLM to approximately 109.7 for a GLMM with independent random effects to approximately 107.5 for the GLMM with autoregressive random effects. Note that the GLM assumes independent observations within and between the years and that the GLMM with independent random effects {ut} for each year t assumes correlation of responses within a year, but independence of responses over the years. Both assumptions might be inappropriate. Our model implies that the log odds of approval of homosexual relationships are correlated for blacks and whites within a year (though not very strong with an estimate of a equal to 0.1) and are also correlated for two consecutive years. The estimates of the fixed parameters and their asymptotic standard errors are given in Table 51. The MCEM algorithm converged after 128 iterations with a starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of 8600. Convergence parameters (conf. Section 2.3.3) were e1 = 0.002, c = 5, e2 = 0.003, E3 = 0.001, a = 1.01 and q = 1.2. Path plots of selected parameter estimates for two different sets of starting values are shown in Figure 32. A detailed interpretation of the parameters and the effects of the explanatory variables on the odds of approval is provided in Section 4.3. Although this example assumed autocorrelated random effects, we will look at the simpler case of equally correlated random effects next. Then, we discuss how the correlation parameter p can be estimated within the MCEM framework presented in Section 2.3. Summing up, correlated random effects in GLMMs allow 4.50 r[ 1.25 r 4.25 _ I 00l  4.00 0.75 0 50 100 150 200 50 75 100 125 150 175 200 0.30 I 0.00705  0.0700  0.35 0.0695 .. 0 50 100 150 200 50 75 100 125 150 175 200 0.026 0.00900 0.028 0.00895 0.030 50 1 1 00890 75 100 125 150 175 2' 0 50 100 150 200 50 75 100 125 150 175 200 0.50 0.25 0 50 n "75 _ 0.06 ,0.04 100 150 200 50 7! 100 125 150 0.50 0.25 S50 100 150030 Is7e(p)1 0.25  ) 50 100 150 200 50 75 100 125 150 175 20 Figure 32: Iteration history for selected parameters and their asymptotic standard errors for the GSS data. The iteration number is plotted on the xaxis. The estimates and standard errors for 02 were multiplied by 103 for better plotting. The two different lines in each plot correspond to two different sets of starting values.  se(o) 175 200 I t L~hrr~c wwwRF~ CL one to model withincluster as well as betweencluster correlations for discrete response variables, where clusters refer to grouping of responses in time. 3.2 Equally Correlated Random Effects The introductory example modeled decaying correlation between cross sectional data over time through the use of autocorrelated random effects. In other temporal or spatial settings, the correlation might stay nearly constant between any two observation times, regardless of time or location differences between the two discrete responses. Equally correlated random effects might then be appropriate to describe such a behavior. 3.2.1 Definition of Equally Correlated Random Effects We call random effects equally correlated if var(ut) = a2 for all t and corr(ut, ut.) = p for all t $ t*. More generally, the covariance matrix of the random effects vector u = (ul,..., UT)' is given by E = a2 [(1 p)IT + pJT], where JT = 1TVI. To ensure positive definiteness, p has restricted range, i.e., 1 > p > 1/(T 1). The random effects density is given by g(u;) oc l1,2exp Ui'El, (3.3) where now due to the pattern in E, E = a2T(1 p)Tl[1 + (T 1)p] and E1 = [( IT P) JT]. The vector i = (a, p) holds the variance ,, (1 ) +(T1)p I . components of E. The more complicated random effects structure (as compared to independence or a single latent random effect) leads to a more complicated Mstep in the MCEM algorithm described in Section 2.3. For a sample u(1),..., u(m) from the posterior h(u I y; 1(k1), (kl)) evaluated at the previous parameter estimate #(k1) and ,(k1), the function Q (i ) introduced in Section 2.3.1 has form 1 m Tl 1 Q()= m logg(u(j);p) oc Tloga 2 log(1 p) log [1 + (T 1)p j=1 1 p(1 p) a+ b, 2a2(1p) 2a2[1+(T1)p] where a = M M_ 'uLi)~U ) and b = 1 m U i)*JTu(') are constants depending on the sample only. 3.2.2 The Mstep with Equally Correlated Random Effects The Mstep seeks to maximize Q2 with respect to a and p, which is equivalent to finding their MLEs treating the sample ),..., u(m) as independent. Since this is not possible in closed form, one way to maximize Q2 uses a bivariate Newton Raphson algorithm with the Hessian formed by the second order partial and mixed derivatives of Q2 with respect to a and p. Some authors (e.g., Lange, 1995, Zhang, 2002) use only a single iteration of the NewtonRaphson algorithm instead of an entire Mstep to speed up convergence. However, this might not always lead to convergence, since the interval for which the NewtonRaphson algorithm converges is restricted through the restrictions on p. We show now that with a little bit of work the maximizers for a and p can be obtained very quickly. For any given value of p, the ML estimator for a (at iteration k) is available in closed form and is equal to () ( 1 p(l p) 1/2 ST(1p) a T[ + (T 1)pR]) Note that if p = 0, &^ = () E 1 U(j)' u )1/ the estimator for the independence case presented at the end of Section 2.3.1. Unfortunately, the ML estimator for p has no closed form solution. The first and second partial derivative of Q2 with respect to p are given by 9 2 T(T1)p 11 2p p2(T 1) Op m 2(1 p)[1 + (T l)p] 2a2(1 p)2 2a2[1+ (T l)p]2 2 2 T(T1) 1 p(42T)+p2(1 3T) 1 Op2 2 (1 p)2[ (Tl)p]2 3(1p)3 T T b 02[1 + (T l)p]3b We obtain the profile likelihood for p by plugging the MLE &(k) into the likelihood equation for p. Then we use a simple and fast intervalhalving (or bisection) method to find the root for p. This is advantageous compared to a Newton Raphson algorithm since the range of p is restricted. Let f(p) = IQ 2a=(k) and let pi and P2 be two initial estimates in the appropriate range, satisfying pi < p2 and f(pi)f(p2) < 0. Without loss of generality, assume f(pi) < 0. Clearly, the maximum likelihood estimate p must be in the interval [pl, P2]. The intervalhalving method computes the midpoint p3 = (pi + p2)/2 of this interval and updates one of its endpoints in the following way: It sets pi = p3 if f(p3) < 0 or P2 = p3 otherwise. The newly formed interval [pl, p2] has half the length of the initial interval, but still contains p. Subsequently, a new midpoint p3 is calculated, giving rise to a new interval with one fourth of the length of the initial interval, but still containing p. This process is iterated until If(p3) < C, where e is a small positive constant. To ensure it is a maximum we can check that the value of the second derivative, f'(p) is negative at P3. (The second derivative is also needed for approximating standard errors in the EM algorithm.) The value of p3 is then used as an update for p in the maximum likelihood estimator for a, and the whole process of finding the roots of Q2 is repeated. Convergence is declared when the relative change in a and p is less than some prespecified small constant. The values of a and p at this final iteration are the estimates &(k) and 1(k) from MCEM iteration k. The issue of how to obtain a sample u(1),..., u(m) from h(u y; 13(k1), 0(k1)), taking into account the special structure of the random effects distribution will be discussed in Section 3.4.2. 3.3 Autoregressive Random Effects The use of autoregressive random effects was demonstrated in the introductory example. Their property of a decaying correlation function make them a useful tool for modeling temporal or spatial associations among discrete data. We will limit ourselves to instances where there is a natural ordering of random effects, and consider time dependent data first. 3.3.1 Definition of Autoregressive Random Effects As with equally correlated random effects in Section 3.2, we can look at the joint distribution of autoregressive (or autocorrelated) random effects {ut}l1 as a meanzero multivariate normal distribution with patterned covariance matrix E, defined by the variance and correlation functions var(ut) = a2 for all t and corr(ut, ut.) = pltZt I for all t : t', where xi and xt. are time points (e.g., years, as in the GSS example) associated with random effects ut and ut.. Let dt = Xt+l zx denote the time difference between two successive time points and let ft = 1/(1 p2di), t 1,... ,T 1. Then, due to the special structure, the determinant of the covariance matrix is given by FI = a2T fIT1 f' and E1 is tridiagonal (Crowder and Hand, 1990, with correction of a typo in there) with main diagonal 1 ( fl, fl + f2 L + 1 2 24 1, f + A fT2+ fT1 1 T1) and subdiagonals 1 2 (flpd,,fpd .., fTpdT1) For a sample u(),..., u(m) from the posterior h(u I y; 1(k1), ,(k1)) evaluated at the previous parameter estimates f3(k1) and 0(k1) = (a(k1), (k1)), the function Q2, (cf. Section 2.3.1) now has form 1 T1 1 Q2() ( Tlogaa log(lo p2) (3.4) t=1 j=l 1 T1 lB P^ ]2 12T2 M1 E 1 1 p2dt j=1 t= 1 where ut) is the tth component of the jth sampled vector u\). In the Mstep of an MCEM algorithm, we seek to maximize Q2 with respect to a and p. Alternatively, we can view the random effects {ut} as a latent firstorder autoregressive process: Random effect Ut+i at time t +1 is related to its predecessor ut by the equation Ut+1 = pdtUt + ft, Et N(O, 02(1 2d)), t = 1,...,T 1, (3.5) where dt again denotes the lag between the two successive time points associated with random effects ut and Ut+l. Assuming a N(0, a2) distribution for the first random effect ul, the joint random effects density for u = (ul,..., uT) enjoys a Markov property and has form g(u; 0) = g(ui; i) g(u2 I1u; ) ... g(utl I t1; ) ... g(UT I uT1;) (3.6) T/2 T1 1/2 T _U (1 1 h / {Ut+: p4Ut] oc 1 exp } exp 2(lp t) W2 H 2o, Ei2 2a2(1 p j) ' t=l t=l leading, of course, to the same expression for Q2 as given in (3.4). For two time indices t and t* with t < t*, the random process has autocorrelation function p(t, t*) = corr(ut, ut.) = p , d, Before we discuss maximization of Q2 in this setting with possibly unequally spaced observation times, let us comment on the rather unusual parametrization of the latent random process (3.5). Chan and Ledolter (1995), in their development for time series models of equally spaced discrete events use the more common form ut+l = put + Et, Et N N(0, a2), t = 1,2... T 1. This leads to var(ut) = a2/(1 p2) for all t if we assume a N(O, a2/(1 p2)) distribution for ul. (Chan and Ledolter (1995) condition on this first observation, which leads to closed form solutions for both a and p in the case of equidistant observations). Since it is common practice to let a2 describe the strength of association between observation in a common cluster sharing that random effect, our parametrization seems more natural. In Chan and Ledolter's parameterizations both the variance and correlation parameter appear in the variance of the random effect. In the more general case of unequally spaced observations, the parametrization Et N(O, O72) results in different variances of the random effects at different time points (i.e., var(ut) = a2/(1 p2d)). Considering that the random effects represent unobservable phenomena common to all clusters, their variability should be about the same for all clusters, and not depend on the time difference between any two clusters. There is no reason to believe that the strength of association is larger in some clusters and weaker in others. Therefore, the parametrization we choose in (3.5) seems natural and appropriate. For spatially correlated data, a relationship between random effects ui and ui. is defined in terms of a distance function d(xi, ax.) between covariates ax and xi. associated with them. Each u, then represents a random effect for a spatial cluster, and correlated random effects are again natural to model spatial dependency among observations in different clusters. In the time setting, we had d(xi, xi.) = Ixi xi. I with xi and xi. representing time points. In 2dimensional spatial settings, xi = (xil, xi2)' may represent midpoints in a Cartesian system and d(sx, xi.) = aIxi xi.I is the Euclidian distance function. The sodefined distance between clusters can be used in a model with correlations between random effects decaying as distances between cluster midpoints grow, e.g., corr(ui, u.) = pllXixZ,.. Models of this form are discussed in Zhang (2002) and in Diggle et al. (1998) in a Bayesian framework. Sometimes only the information concerning whether clusters are adjacent to each other is used to form the correlation structure. In this case, d(ax, xi.) is a binary function, indicating if clusters i and i* are adjacent or not. Usually, this leads to an improper joint distribution for the random effects, as for instance in the analysis of the Scottish Lip cancer data set presented in Breslow and Clayton (1993). 3.3.2 The Mstep with Autoregressive Random Effects Maximizing Q2 with respect to a and p is again equivalent to finding their MLEs for the sample of u0),..., u(m), pretending they are independent. For fixed p, maximizing Q2 with respect to a is possible in closed form. For notational con venience, denote the parts depending on p and the generated sample u(m),..., (m) by at(p, u) = U+ __ p d U ) 2 j=1 and Smw1 r j=1 with derivatives with respect to p (indicated by a prime) given by ta'(P, u) = 2dtpd`1'bt(p, u) and b(p, u) = dtp )2 m j=1 The maximum likelihood estimator of a at iteration k of the MCEM algorithm has form fr( ,m T1 )1 /2 &(k) 1 u(j)2 + 1a ) 1/ Tm(M T +j p2dt/ j=1 t=1 For the special case of independent random effects (p = 0), this simplifies to the estimator ( m j= U()'UD() presented at the end of Section 2.3.1. (The equal correlation structure cannot be presented as a special case of the autocorrelation structure.) No closed form solutions exist for 1(k). Let dt pdt1 ct(p) = 1 p2d and et(p) = [ct(p) ] be terms depending on p but not on u, with derivatives given by c'(p) = ct(P) + 2p1 [ct(p)]2 P and e(P) = 2ct(p)c( p) + [ct(p)2 , dt dt respectively. Then, the first and second partial derivative of Q2 with respect to p can be written as T1 T1 Qap = Epd'tc(p) + [c(p)bt(p, u) et(p)at(p, u)] 1t=1 t=1 92Q2 = pdlct(p)[2d, 1 + 2pdt+lc(p)] S=t=1 1 T1 + c' [c'(p)bt (p, u) + ct (p) (p, u) e'(p)at(p, u) et(p)a' (p, u)] . t=1 A NewtonRaphson algorithm with Hessian formed of partial and mixed derivatives of Q2 with respect to a and p can be employed to find the maximum likelihood estimators at iteration k. However, since the range of p is restricted, it might be advantageous to use the intervalhalving method on Q', I1=&(k) described in the previous section. For the special case of equidistant time points {xt}, the distances dt are equal for all t = 1,..., T. Without loss of generality, we assume dt = 1 for all t. Then the random effects follow the simple random walk ut+l = put + et, t = 1, ..., T, where we assume that ul ~ N(0, a2) and ct iid. N (0, 2(1 p2)). Certain simplifications occur. Let m m T ( m T1 ( T1 j a= m b +l I2, c = i uUt+1 dUu j=1 j=1 t=1 j=1 t=1 j=1 t=1 denote constants depending on the generated samples only, but not on any parame ters. Then the maximum likelihood estimator of a at iteration k is (k) = a+ (b 2pc+p2d) 1/2 and, upon plugging it in into the score equation Q2 I=&8(k) for p, we obtain as the new score equation [T1 1 2 T T1 1 3 [T(da) +p2 2T c +p i a bd +c=0, (3.7) a polynomial of order three. A result by Witt (1987) mentioned in McKeown and Johnson (1996) shows that (3.7) has three real solutions, only one of which lies in the interval (1, 1). This must be the maximum likelihood estimator jlk) at iteration k for the case of equidistant time points. Exact solutions to this third degree polynomial are for instance given in Abramowitz and Stegun (1964) and we only need to iterate between the two explicit solutions till convergence. In most of our applications the number of distinct observation times T is rather large, and generating independent Tdimensional vectors u from the posterior h(u I y; O(kl)) as required to approximate the Estep is difficult, even with the nice (prior) autoregressive relationship among the components of u. The next section discusses this issue. There are many other correlation structures which are not discussed here. For instance, the first order autoregressive random process can be extended to a porder random process and the formulas provided here and in the next section can be modified accordingly. 3.4 Sampling from the Posterior Distribution Via Gibbs Sampling In Section 2.3.2 we gave a general description of how to obtain a random sam ple u(1),...,u(m) from h(u I y). (As in Section 2.3.2, we suppress the dependency on the parameter estimates from the previous iteration). For high dimensional random effects distributions g(u), generating independent draws from h(u I y) can get very time consuming, if not impossible. The Gibbs sampler introduced in Section 2.3.2 offers an alternative because it involves sampling from lower dimen sional (often univariate) conditional distributions of h(u I y), which is considerably faster. However, it results in dependent samples from the posterior random effects distribution. The distributional structure of equally correlated random effects or autoregressive random effects is very amenable to Gibbs sampling because of the simplifications that occur in the full univariate conditionals. Remember that the twostage hierarchy and the conditional independence assumption in GLMMs implies that T h(u I y) oc f(y I u)g(u) = f/(yt I ut) g(u), t=1 the product of the product of conditional densities of observations sharing a common random effect and the random effects density. In the following, let u = (Ul, ..., UT). We discuss the case of autoregressive random effects first. 3.4.1 A Gibbs Sampler for Autoregressive Random Effects From representation (3.6) of the random effects distribution, we see that the full univariate conditional distribution of Ut given the other T 1 components of u only depends on its neighbors ut1 and ut+1, i.e., g(Ut I u,, .. Utl, Ut+l, ,UT) c( g(Ut I Ut1)g(Ut+1 I Ut), t = 2,..., T 1. At the beginning (t = 1) and the end (t = T) of the process, the conditional distribution of Ul and UT only depend on the successor u2 and predecessor UT1, respectively. Furthermore, random effect ut only applies to observations Yt = (ytl, ytn) at a common time point that share that random effect, but not to other observations. Hence, the full univariate conditionals of the posterior random effects distribution can be expressed as hi(u I u2,y ) oc f(y1 Ul) 9g(ui U2) ht(ut I t, Ut+1, yt) oc f(yt t) gt(ut I Ut1, Ut+), t = 2,...,T 1 hT(T I UT, YT) oc f(YT UT) T(UT I UT1), where, using standard multivariate normal theory results, 91(ul IU2) = N (pd1U2,2[1 p2dl]) t(Ut I Ut, t+) = pd[1 p2dt1Utl +pd( [1 p2dtlt+ S1 p2(dt_1+dt) 2[1 p2dt1 p2dt + p2(dti+dt) 1 2(dt+d) t ..., gT(UT I UT1) = N (pdTlUT1, a2[l p2dTl]) For equally spaced data (dt = 1 for all t) these distributions reduce to the ones derived in Chan and Ledolter (1995). Direct sampling from the full univariate conditionals ht is not possible. However, it is straightforward to implement an acceptreject algorithm. In fact, 73 the acceptreject algorithm as outlined in Section 2.3.2 applies directly with target density ht and candidate density gt, since ht has the form of an exponential family density multiplied by a normal density. In Section 2.3.2, we discussed the accept reject algorithm for generating an entire vector u from the posterior random effects distribution h(u I y) with candidate density g(u) and mentioned that acceptance probabilities are virtually zero for large dimensional u's. With the Gibbs sampler, we have reduced the problem to univariate sampling of the tth component ut from the univariate target density ht with univariate candidate density gt. By selecting Mt = LL(yt), where L(yt) is the saturated likelihood for observations at time point t, we ensure that the target density ht < Mtgt. Given u(1) = (uj), ..., u1) from the previous iteration, the Gibbs sampler with acceptreject sampling from the full univariate conditionals consists of 1. generate first component u) hi(u, I u(j ),y) by (a) generation step: generate ul from candidate density gl(ul Iu u generate U ~ Uniform[0, 1]; (b) acceptance step: set uj) =ul if U < f(yl I ul)/L(y); return to (a) otherwise; 2. for t=2,...,T1: generate component u j)" ht(u () I u(j), (j ) by (a) generation step: generate ut from candidate density gt(ut 1u1) ,u j1 ) generate U ~ Uniform[O, 1]; (b) acceptance step: set u) = ut if U < f(yt I ut)/L(yt); return to (a) otherwise; 3. generate last component u() ~ hT(uT I UT2, YT) by (a) generation step: generate UT from candidate density gT(UT I UTi1); generate U ~ Uniform[O, 1]; (b) acceptance step: set U = UT if U 5 f(yT I UT)/L(yT); return to (a) otherwise; 4. set U)= (U),...,u(T_); The soobtained sample ),..., () (after allowing for burnin) forms a dependent sample which we use to approximate the Estep in the kth iteration of the MCEM algorithm. Note that all densities are evaluated at current parameter estimates, i.e., (1 for f(yt I ut) and ( = (1(k1), (1)) for gt(ut Ut1, Ut+i) 3.4.2 A Gibbs Sampler for Equally Correlated Random Effects Similar results as for the autoregressive correlation structure can be derived for the case of equally correlated random effects. In this case, the full univariate conditional of us depends on all other t 1 components of u, as can be seen from (3.3). Let ut denote the vector u with the tth component deleted. Using similar notation as in the previous section, the full univariate conditionals of h(u I y) are given by ht(ut I Ut, t) oc f(yt I Ut)gt(t  ut, t), t = 1,..., T, where with standard results from multivariate normal theory gt(ut I ut) is a N(pt, rt) density with 1ip 1+(T2)p Uk P (T1)p(1p)] kgt and = =a,2 (T 1)p2 ( 1)23(1 ) 1 p 1 +(T2)p Given the vector u1) from the previous iteration, the Gibbs sampler with acceptreject sampling from the full univariate conditionals has form 1. for t=1,...,T, generate components u~j) ~ h(ut uj) ,, I... U t+ 1 ,UT ) by (a) generation step: generate ut from candidate density gt(Ut I U (jt1, Ut(+ ., I ) generate U ~ Uniform[0, 1]; (b) acceptance step: set ut = u if U < f(yt I ut)/L(yt); return to (a) otherwise; 2. set uaU)= ( i), U ); This leads to a sample u(),..., (m) from the posterior distribution used in the E and Mstep of the MCEM algorithm at iteration k. Note again that all distributions are evaluated at their current parameter estimates (k1) and (kl) = (&ki), (ki)). 3.5 A Simulation Study We conducted a simulation study to evaluate the performance of the maximum likelihood estimation algorithm, to evaluate the bias in the estimation of covariate effects and variance components and to compare predicted random effects to the ones used in the simulation of the data. To this end, we generated a time series yl,..., IT of T = 400 binary observations according to the model logit(7rt(ut)) = a + /Xt + ut (3.8) for the conditional log odds of success at time t, t = 1,..., 400. For the simulation, we choose a = 1 and 3 = 1, where 3 is the regression coefficient for independent standard normal distributed covariates xt i. i. d. N(0, 1). The random effects l, ..., UT are thought to arise from an unobserved latent random autoregressive process Ut+l = put + ct, where Et are i. i. d. N(0, aV1l p2), i.e., the out's have stan dard deviation a and lag t correlation pt. For the simulation of these autoregressive random effects, we used a = 2 and p = 0.8. The resulting sample autocorrelation function of the realized random effects is pictured in Figure 34. Their standard deviation and lag 1 correlation of the 400 realized values of u, ... UT is equal to 1.95 and 0.77. Note that conditional on the realized values of Ut's, the Yt's are generated independently with log odds given by (3.8). The MCEM algorithm as described in Sections 2.3 and 3.3 for a logistic GLMM with autocorrelated random effects yielded the following maximum likelihood estimates for the fixed effects and variance components: & = 0.94 (0.39), 3 = 1.03 (0.22), as compared to the true values 1 and 1, and & = 2.25 (0.44), and p = 0.74 (0.06), as compared to the true values 1.95 and 0.77. The algorithm converged after 71 iterations with a starting Monte Carlo sample size of 50 and a final Monte Carlo sample size of only 880, although estimated standard errors are based on a Monte Carlo sample size of 20,000. Convergence parameters were set to e1 = 0.003, c = 3, E2 = 0.005, 63 = 0.001, a = 1.03 and q = 1.05 (see Section 2.3.3). Regular GLM estimates were used as starting values for a and # and starting values for a and p were set to 1.5 and 0, respectively. As will be described in Section 5.4.2, we estimated random effects through a Monte Carlo approximation of their posterior mean: ui = E[u I y]. The scatter plot in Figure 33 shows good agreement in a comparison of the realized random effects ul,..., UT from the simulation and the estimated random effects 1i,..., iT from the model. Note though that the standard deviation of the estimated random effects is equal to 1.60 (as compared to the true standard deviation of 1.95), showing that estimated random effects are less variable and the general shrinkage effect (compare the scales on the x and y axis of Figure 33) brought along by using posterior mean estimates. Also, a comparison of the autocorrelation and partial autocorrelation functions of the realized and estimated random effects in 77 4 2 A AAAA j AA A 2AA a *1 .A A A A A realized random effects Figure 33: Realized (simulated) random effects u,..., UT versus estimated ran dom effects i ,...,G fT Figure 34 reveals some differences due to the fact that estimated random effects are based on the posterior distribution of u I y. Therefore, estimated random effects are only of limited use in checking assumptions on the true random effects. Only when their behavior is grossly unexpected compared to the assumed structure of the underlying latent random process may they serve as an indication of model inappropriateness. Related remarks are given by Verbeke and Molenberghs (2000) who generate data in a linear mixed models assuming a mixture of two normal distributed random effects resulting in a bimodal distribution. There also the plot of posterior mean estimates of the random effects from a model that misspecified the random effects distribution does not reveal that anything went wrong. We repeated above simulation 100 times, using the same specifications, starting values and convergence criteria as mentioned above. Each of the 100 generated binary time series of length 400 was fit using the MCEM algorithm. Table 31 shows the average (over the 100 generated time series) of the fixed .4.4 000 .2. 1 2 3 4 5 6 7 4 5 6 7 Lag Lag 1.0 10 .8.8 .6 .6 .4 .4 S .2 7 .2 1 2 3 4 5 6 7 1 2 3 4 5 6 7 ag Lag Figure 34: Comparing simulated and estimated random effects. Sample autocorrelation (first row) and partial autocorrelation (second row) func tions for realized (simulated) random effects u1,..., UT (first column) and estimated random effects u ,..., iT (second column). parameter and variance component estimates and their average estimated standard errors. On average, the GLMM estimates of the fixed effects a and # and the variance components are very close to the true parameters, although the true lag 1 correlation of the random effects is underestimated by 6.3%. Table 31 also displays, in parentheses, the standard deviations of all estimated parameters in the 100 replications. Comparing these to the theoretical estimates of the asymptotic standard errors, we see good agreement. This suggests that the procedure for finding standard errors we described and implemented (via Louis's (1982) formula) in our MCEM algorithm works fine. In 5 (5%) out of the 100 simulations, the approximation of the asymptotic covariance matrix by Monte Carlo methods resulted in a negative definite matrix. For these simulations, a larger Monte Carlo sample after convergence of the MCEM algorithm (the default was 20,000) might be necessary. It is also interesting to note that of the 95 simulations with positive definite covariance matrix, 6 (6.3%) resulted in a nonsignificant (based on a 5% level Wald test) estimate of the regression coefficient P under the GLMM model with autoregressive random effects, while none was declared nonsignificant with the GLM approach. Estimates and standard errors for a corresponding GLM model fit are also provided in Table 31. The average Monte Carlo sample at the final iteration of the MCEM algorithm was 1200, although highly disperse, ranging from 210 to 21,000. The average computation time (on a mobile Pentium III, 600 MHz processor with 256MB RAM) to convergence, including estimating the covariance matrix, was 73 minutes. We ran two other simulation studies, now with a shorter length of only T = 100 observations, and a true lag 1 correlation of 0.6 and 0.8, respectively. All other parameters remained unchanged. These results are also summarized in Table 31. Again, we observe that the estimated parameters are very close to the Table 31: A simulation study for effects. 80 a logistic GLMM with autoregressive random a 1 a p s.e.(a) s.e.(8) s.e.(a) s.e.(p) True: 1 1 2 0.8 T = 400 GLM: 0.64 0.63 0.11 0.12 (0.20) (0.14) (0.01) (0.01) GLMM: 1.07 1.02 2.08 0.75 0.37 0.26 0.61 0.10 (0.32) (0.20) (0.25) (0.06) (0.12) (0.23) (0.40) (0.05) True: 1 1 2 0.6 T = 100 GLM: 0.69 0.70 0.23 0.25 (0.38) (0.27) (0.02) (0.03) GLMM: 1.09 1.07 1.99 0.51 0.58 0.47 1.35 0.26 (0.58) (0.37) (0.39) (0.20) (0.33) (0.27) (1.15) (0.18) True: 1 1 2 0.8 T = 100 GLM: 0.65 0.61 0.22 0.24 (0.21) (0.26) (0.01) (0.03) GLMM: 1.04 0.96 2.00 0.75 0.42 0.51 1.04 0.16 (0.29) (0.34) (0.53) (0.13) (0.32) (0.99) (1.04) (0.13) Average and standard deviation (in parentheses) of fixed effects, variance compo nents and their standard error estimates from a GLM and a GLMM with latent AR(1) process. The two models were fitted to each of 100 generated binary time series of length T = 400 and T = 100. true ones, but on average the correlation was underestimated by 15% and 6.3%, respectively. However, the sampling errors of the correlation parameters (shown in parentheses in Table 31) were large enough to include the true values. Since our methods are general enough to handle unequally spaced data, we repeated the first simulation with a time series of T = 400 binary observations, but now randomly deleted 10% of the observations to create random gaps in the series. We left all parameters and the model for the conditional odds unchanged, except that we now assume that random effects follow the latent random autoregressive process ut+l = ptut + ct, where et are i. i. d. N(0, a/1 p2dt) and dt is the difference (in the units of measurement) between the time points associated with the observations at times t and t + 1. For example, the first series we generated had 81 Table 32: Simulation study for modeling unequally space binary time series. a /3p s.e.(a) s.e.(6) s.e.(a) s.e.(p) True: 1 1 2 0.8 T = 360, unequally spaced GLM: 0.61 0.62 0.12 0.12 (0.19) (0.11) (0.00) (0.01) GLMM: 1.03 1.00 2.07 0.75 0.38 0.28 0.71 0.11 (0.29) (0.16) (0.25) (0.06) (0.19) (0.28) (0.80) (0.18) Average and standard deviation (in parentheses) of fixed effects, variance compo nents and their standard error estimates from a GLM and a GLMM with latent au toregressive random effects accounting for unequally spaced observations. The two models were fitted to each of 100 generated binary time series of length T = 360, with random gaps of random length between observations. 1 gap of length three (i.e., dt = 4 for one t), 4 gaps of length two (i.e., dt = 3 for 4 t's) and 29 gaps of length one (i.e., dt = 2 for 29 t's). For all other t's, dt = 1, i.e., they are successive observations and the difference between two of them is one unit of measurement. Simulation results are shown in Table 32 and reveal that our proposed methods and algorithm also work fine for an unequally spaced binary time series. All true parameters are included in confidence intervals based on the average of the estimated parameters from 100 replicated series and its standard deviation (shown in parentheses in Table 32). CHAPTER 4 MODEL PROPERTIES FOR NORMAL, POISSON AND BINOMIAL OBSERVATIONS So far we have discussed models for discrete valued time series data in a very broad manner. In Section 2, we developed the likelihood for our models based on generic distributions f(ylu) for observations y and g(u) for random effects u and presented an algorithm for finding maximum likelihood estimates. Section 3 looked at two special cases of random effects distributions useful for describing temporal or spatial dependencies. In this chapter we make specific distributional assumptions about the observations and develop some theory underlying the models we propose. We will pay special attention to data in the form of a single (sometimes considered generic) time series Y = (Y1,..., YT) and derive marginal properties implied by the conditional model formulation. Multiple, independent time series Yi,..., Yn can result from replication of the original time series or from stratification of the sampled population such as in the example about homosexual relationships. All derivations given below for a generic time series Y still hold for the ith series Yi = (Yil, Yi2,... YT), provided the same latent process {ut} is assumed to underly each one of them. An important characteristic of any time series model is its implied serial dependency structure. In the case of normal theory time series models, this is specified by the autocorrelation function. In Section 4.1 we derive the implied marginal autocorrelation function for GLMMs with normal random components and either an equal correlation or autoregressive assumption for the random effects. With these assumptions, our models are special cases of linear mixed models discussed for instance in Diggle et al. (2002). In Sections 4.2 and 4.3 we explore marginal properties of GLMMs with Poisson and binomial random components that are induced by assuming equally correlated or autoregressive random effects. In Chapter 5, these model properties such as the implied autocorrelation function are then compared to empirical counterparts based on the observed data to evaluate the proposed model. Section 2.1 mentioned that parameters in GLMMs have a conditional inter pretation, controlling for the random effects. Correlated random effects vary over time and parameter interpretation is different from having just one common level of a random effect, as in many standard random intercepts GLMMs. For each of the models presented here, we discuss parameter interpretation in a separate section. 4.1 Analysis for a Time Series of Normal Observations Suppose that conditional on time specific normal random effects {ut}, observa tions {Yt} are independent N(pt + ut, 72). The marginal likelihood for this model is tractable, because marginally the joint distribution of {Yt} is multivariate normal with mean tM = (/1,..., pT)' and covariance matrix Eu + 72I, where Eu is the covariance matrix of the joint distribution of {ut}. With the usual assumption that var(ut) = a2, the marginal variance of Yt is given by var(Yt) = 2 + 2 and the marginal correlation function p(t, t*) for the case of equally correlated random effects (conf. Section 3.2) has form 2 p(t, t*) = corr(Yt, Yt.) = 2 2p (4.1) while for the case of autocorrelated random effects (conf. Section 3.3), it has form p(*) = corr(Y 2 dk. (4.2) p(t,t*) = corr(Yt, Yt.) =2 +p02 d (4.2) T+ 0 If the distances between time points are equal, then (4.2) is more conveniently written in terms of the lag h between observations as 2 p(h) = corr(Yt, Yt+h) = T2 + For both cases, note that the autocorrelations (4.1) and (4.2) are smaller than the corresponding ones assumed for the underlying latent process {ut} by a factor of r For equally correlated random effects, the marginal covariance matrix has form r72 + a2 [(1 p)I + pJ], implying equal marginal correlations between any two members Yt and Yt. of {Yt}. (This can also be seen from (4.1), where the autocorrelations do not depend on t or t*.) Diggle et al. (2002, Sec. 5.2.2) call this a model with serial correlation plus measurement error. Similar properties can be observed in the case of autocorrelated random ef fects: The basic structure of correlations decaying in absolute value with increasing distances between observation times (as measured by E dk or h) is preserved marginally. However, the firstorder Markov property of the underlying autore gressive process is not preserved in the marginal distribution of {Yt}, which can be proved by calculating conditional distributions. For instance, for three (T = 3) equidistant time points, the conditional mean of Y3 given Y1 = yl and Y2 = y1 is equal to a2p E[Y3ay, y2] = 3 + 2 +22_ (2 ( 2 [1 + P(y, ~1)] + 02 [l (2 2)]) and depends on yl. It should be noted that in the case of independent random effects with Eu = Oa2, marginally the Yt's are also independent, but with overdispersed variances r2 + o2 relative to their conditional distribution. This case can be seen as a special case of the equally correlated model and the autoregressive model when p=0. The traditional assumption in random intercepts models is to assume a common random effect u = us for all time points t. I.e., conditional on a N(0, a2) random effect u, Yt is N(Aj + u, 72) for t = 1,..., T. For this case, the marginal covariance matrix has form 721 + a2J. This can be derived directly or inferred from the marginal correlation expressions (4.1) and (4.2) by setting p = 1, implying perfect correlation among the {ut}. Hence, the random intercepts model is a special case of the equal correlated or autoregressive model when p = 1. It implies a constant (exchangeable) marginal correlation of U2/(T2 + a2) between any two observations Yt and Yt.. 4.1.1 Analysis via Linear Mixed Models In a GLMM, we try to provide some structure for the unknown mean com ponent pt by using covariates xt. Let x z be a linear predictor for /t, with f denoting a fixed effects parameter vector for the covariates xt. Using an iden tity link, the series {Yt} then follows a GLMM with conditional mean function E[Yt I ut] = x't3 + ut. The model can be written as Yt = x'f + us + ce, where et ~d N(0, 72) and independent of us. Then, the models discussed here are special cases of mixed effects models (Verbeke and Molenberghs, 2000) with general matrix form Y = X3 + Zu +E. In our case, Y = (Yi,..., YT)' is the time series vector and X = (xs,..., x,)' is the overall design matrix with associated parameter /. The design matrix Z for the random effects u' = (u1,..., UT) simplifies to the identity matrix IT. The distributional assumption on the random effects is u ~ N(0, Eu) and they are independent from the N(O, r2I)distributed errors e. Exploiting this relationship, software for fitting models of this kind (i.e., correlated normal data with structured covariance matrix of form var(Y) = ZEuZ' + rT2) is readily available, for instance in the form of the SAS procedure proc mixed, where the equal correlation structure and the autoregressive structure are only two out of many possible choices for the covariance matrix Eu for the random effects distribution. Mixed effects models are very popular for the regression analysis of shorter time series, like growth curve models or data from longitudinal studies. In Sec tion 5.1, we illustrate an application by analyzing the motivating example of Section 3.1 about attitudes towards homosexual relationships, based on a normal approximation to the log odds. 4.1.2 Parameter Interpretation Parameters in normal time series models retain their interpretation when averaging over the random effects distribution. The interpretation of 3 as the change in the mean for a change in the covariates is valid conditional on random effects and also marginally. The random effects parameters only contribute to the variancecovariance structure of the marginal distribution, inducing overdispersion and correlation relative to the conditional assumptions. 4.2 Analysis for a Time Series of Counts Suppose now that conditional on time specific normal random effects {ut}, observations {Yt} are independent counts, which we model as Poisson random variables with mean /t. Using a log link, explanatory variables ax and correlated random effects {uj}, we specify the conditional mean structure of a Poisson GLMM as log(pt) = zx' + ut, t = 1,..., T. (4.3) The correlation in the random effects allows the logmeans to be correlated over time or space. The marginal likelihood corresponding to this model is given by I T L(f3,/;y) oc xf T exp{pt}g(u;,) du t=1 = exp [yt(x'/3 + Ut) exp{'t + Ut}]} g(u; ik) du, RT I Tt=1 where g(u; Ip) is one of the random effects distributions of Chapter 3. In that case, the integral is not tractable and numerical methods such as the MCEM algorithm of Section 2.3 must be used to find maximum likelihood estimates for f and 1i. For this, the function Q' defined in (2.16) has form 1m T j=1 t=l1 where uj) is the tth element of the jth generated sample u(j) from the posterior distribution h(uly; )3(k1)' (k1)). Note that here we discuss only the case of a generic time series {Yt} with no replication, hence n = 1 (i.e., index i is redundant), and nl = T in the general form presented in (2.16). If replications are available or in the case where two time series differ in the fixed effects part but not in the random effects (e.g., have the same underlying latent process), then one simply needs to include the sum over the replicates as indicated in (2.16). Choosing one of the correlated random effects distributions of Chapter 3, the Gibbs sampling algorithms developed in Sections 3.4.1 or 3.4.2 can be used to generate the sample from h(uly), with f(yt) having the form of a Poisson density with mean At. 4.2.1 Marginal Model Implied by the Poisson GLMM As with the normal GLMMs before, marginal first and second moments can be obtained by integrating over the random effects distribution, although here the complete marginal distribution of Yt is not tractable as it is in the normal case. The random effects appearing in model (4.3) imply that the conditional logmeans {log(/t)} are random quantities. Assuming that random effects {ut} are normal with zero mean and variance var(ut) = a2, they have expectations {z/3} and variance a2. For two distinct time points t and t*, their correlation under an independence, equal correlation or autocorrelation assumptions on the random effects is given by 0, p or pk=t dk, respectively. (Remember that dk denoted the time difference between two successive observations yk and Yk+1.) On the original scale, the means have expectation, variance and correlation given by E[pt] = exp{(z/3 +a2/2} var(pt) = exp{2(x3 + a2/2)} (e'2) eCOv(U(,U(.) _ corr(p/, pt ) = eco  e2 1 Plugging in cov(ut, ut.) = 0, a2p or a2 pk=t d yields the marginal correlations among means when assuming independent, equally correlated or autoregressive random effects, respectively. 4.2.1.1 Marginal distribution of Yt Now let's turn to the marginal distribution of Yt itself, for which we can only derive moments. The marginal mean and variance of Yt are given by: E[Yt] = E[pt] = exp{z/3 + a2/2} (4.4) var(Yt) = E[(t] + var(t) = E[Yt] [1 + E[Yt] (e 1)]. Hence, the log of the marginal mean still follows a linear model with fixed effects parameters /, but with an additional offset U2/2 to the intercept term. (This is not particular to the Poisson assumption, but is true for any loglinear random effects model of form (4.3) with more general random effects structure z'ut, see Problem 13.42 in Agresti, 2002.) The marginal distribution of Yt is not Poisson, since the variance exceeds the mean by a factor of [1 + E[Yt](e"2 1)]. The marginal variance is a quadratic function of the marginal mean. For two distinct time points t and t*, the marginal covariance between observations Yt and Yt. is given by cov(Yt,Yt.) = cov(pt, At.) = E[Yt]E[Yt.] (ecov(ut",') 1) (4.5) = exp{(X' + x'.)/3 + a2} (ecov(ut,) 1) 