Models for repeated measures of a multivariate response


Material Information

Models for repeated measures of a multivariate response
Physical Description:
viii, 171 leaves : ill. ; 29 cm.
Gueorguieva, Ralitza, 1971-
Publication Date:


Subjects / Keywords:
Statistics thesis, Ph. D   ( lcsh )
Dissertations, Academic -- Statistics -- UF   ( lcsh )
bibliography   ( marcgt )
non-fiction   ( marcgt )


Thesis (Ph. D.)--University of Florida, 1999.
Includes bibliographical references (leaves 164-170).
Statement of Responsibility:
by Ralitza Gueorguieva.
General Note:
General Note:

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 021558080
oclc - 43628298
System ID:

Full Text







Copyright 1999


Ralitza Gueorguieva

To my family


I would like to express my deepest gratitude to Dr. Alan Agresti for serving as

my dissertation advisor. Without his guidance, constant encouragement and valuable

advice this work would not have been completed. My appreciation is extended to

Drs. James Booth, Randy Carter, Malay Ghosh, and Monika Ardelt for serving on

my committee and for helping me with my research. I would also like to thank all

the faculty, staff, and students in the Department of Statistics for their support and


I also wish to acknowledge the members of the Perinatal Data Systems group

whose constant encouragement and understanding have helped me in many ways.

Special thanks go to Dr. Randy Carter, who has supported me since my first day as

a graduate student.

Finally, I would like to express my gratitude to my parents, Vessela and Vladislav,

for their loving care and confidence in my success, to my sister, Annie, and her fiance,

Nathan, for their encouragement and continual support, and to my husband, Velizar,

for his constant love and inspiration.



ACKNOWLEDGMENTS ...................................................... iv

A B ST RA C T ................................................................... vii


1 INTRODUCTION .................................................... 1

1.1 Models for Univariate Repeated Measures ...................... 3
1.1.1 General Linear Models .................................. 3
1.1.2 Generalized Linear Models .............................. 4
1.1.3 Marginal Models ........................................ 5
1.1.4 Random Effects Models ................................. 7
1.1.5 Transition M odels ....................................... 9
1.2 Models for Multivariate Repeated Measures ................... 10
1.3 Simultaneous Modelling of Responses of Different Types ....... 15
1.4 Format of Dissertation ......................................... 17


2.1 Introduction ..................................................... 20
2.2 M odel Definition ..................... ........................... 22
2.3 M odel Properties .................... ........................... 24
2.4 A applications .................................................... 25

M IXED M ODEL .................. ................................ 29

3.1 Introduction ..................................................... 29
3.2 Maximum Likelihood Estimation ............................... 32
3.2.1 Gauss-Hermite Quadrature ............................. 33
3.2.2 Monte Carlo EM Algorithm ............................ 38
3.2.3 Pseudo-likelihood Approach ............................ 42
3.3 Simulated Data Example ....................................... 47
3.4 Applications .................................................... 53

3.4.1 Developmental Toxicity Study in Mice .................. 53
3.4.2 Myoelectric Activity Study in Ponies ................... 60
3.5 Additional Methods ............................................ 69

MIXED MODEL .................................................. 72

4.1 Inference about Regression Parameters ......................... 74
4.2 Estimation of Random Effects ................................ 77
4.3 Inference Based on Score Tests ................................. 78
4.3.1 General Theory .................. ......................... 78
4.3.2 Testing the Conditional Independence Assumption ..... 80
4.3.3 Testing the Significance of Variance Components ....... 84
4.4 Applications ..................................................... 94
4.5 Sim ulation Study ................................. .............. 97
4.6 Future Research ................................................. 102

5 CORRELATED PROBIT MODEL .................................. 104

5.1 Introduction ..................................................... 105
5.2 Model Definition ................................................ ..110
5.3 Maximum Likelihood Estimation ............................... 112
5.3.1 Monte Carlo EM Algorithm ............................. 112
5.3.2 Stochastic Approximation EM Algorithm ............... 121
5.3.3 Standard Error Approximation ......................... 124
5.4 Application ...................................................... 125
5.5 Simulation Study ............................................... 142
5.6 Identifiability Issue ................ .............................. 145
5.7 M odel Extensions ............................................... 155
5.8 Future Research ................... .............................. 157

6 CONCLUSIONS .................. .... ...................... .......... 158

6.1 Sum m ary ........................................................ 158
6.2 Future Research ................. ...................... .......... 161

REFERENCES ........................................................ ......... 164

BIOGRAPHICAL SKETCH ................................................... 171

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment
of the Requirements for the Degree of
Doctor of Philosophy


Ralitza Gueorguieva

December 1999

Chairman: Alan Agresti
Major Department: Statistics

The goal of this dissertation is to propose and investigate random effects models

for repeated measures situations when there are two or more response variables. The

emphasis is on maximum likelihood estimation and on applications with outcomes

of different types. We propose a multivariate generalized linear mixed model that

can accommodate any combination of outcome variables in the exponential family.

This model assumes conditional independence between the response variables given

the random effects. We also consider a correlated probit model that is suitable for

mixtures of binary, continuous, censored continuous, and ordinal outcomes. Although

more limited in area of applicability, the correlated probit model allows for more

general correlation structure between the response variables than the corresponding

multivariate generalized linear mixed model.

We extend three estimation procedures from the univariate generalized linear

mixed model to the multivariate generalization proposed herein. The methods are

Gauss-Hermite quadrature, Monte Carlo EM algorithm, and pseudo-likelihood. Stan-

dard error approximations are considered along with parameter estimation. A sim-

ulated data example and two 'real-life' examples are used for illustration. We also

consider hypothesis testing based on quadrature and Monte Carlo approximations to
the Wald, score, and likelihood ratio tests. The performance of the approximations to

the test statistics is studied via a small simulation study for checking the conditional

independence assumption.

We propose a Monte Carlo EM algorithm for maximum likelihood estimation in

the correlated probit model. Because of the computational inefficiency of the algo-

rithm we consider a modification based on stochastic approximations which leads to

a significant decrease in the time for model fitting. To address the issue of advan-

tages of joint over separate analyses of the response variables we design a simulation

study to investigate possible efficiency gains in a multivariate analysis. Noticeable

increase in the estimated standard errors is observed only in the binary response case

for small number of subjects and observations per subject and for high correlation

between the outcomes. We also briefly consider an identifiability issue for one of the

variance components.


Univariate repeated measures occur when one response variable is observed at

several occasions for each subject. Hereafter subject refers to any unit on which a

measurement is taken, while occasion corresponds to time or to a specific condition. If

more than one response is observed at each occasion, multivariate repeated measures

are available. Univariate and multivariate repeated measures are very common in

biomedical applications, for example when one or more variables are measured on

each patient at a number of hospital visits, or when a number of questions are asked

at a series of interviews. But the occasions do not necessarily refer to different times.

For instance dependent responses can be measured on litter mates, on members of

the same family or at different places on a subject's body. Difficulties in analyzing

repeated measures arise because of correlations usually present between observations

on the same subject. Statistical methods and estimation techniques are well developed

for repeated measures on a univariate normal variable, and lately much research has

been dedicated to repeated observations on a binary variable and more generally

on variables with distributions in the exponential family. Zeger and Liang (1992)

provide an overview of methods for longitudinal data and the books of Lindsey (1993),

Diggle, Liang and Zeger (1994), and Fahrmeir and Tutz (1994) cover many details.
Pendergast et al. (1996) present a comprehensive survey of models for correlated

binary outcomes, including longitudinal data.

However, relatively little attention is concentrated on repeated measures of a mul-
tivariate response. General models for this situation are necessarily complex as two

types of correlations must be taken into account: correlations between measurements
on different variables at each occasion and correlations between measurements at dif-
ferent occasions. Reinsel (1982, 1984), Lundbye-Christensen (1991), Matsuyama and
Ohashi (1997), and Heitjan and Sharma (1997) consider models for normally dis-
tributed responses. Lefkopoulou, Moore and Ryan (1989), Liang and Zeger (1989),

and Agresti (1997) propose models for multivariate binary data; Catalano and Ryan
(1992), Fitzmaurice and Laird (1995), and Regan and Catalano (1999) introduce
models for clustered bivariate discrete and continuous outcomes. Catalano (1994)
considers an extension to ordinal data of the Catalano and Ryan model. Rochon
(1996) demonstrates how generalized estimating equations can be used to fit extended
marginal models for bivariate repeated measures of discrete or continuous outcomes.

Rochon's approach is very general and allows for a large class of response distribu-
tions. However, in many cases, especially when subject-specific inference is of primary
interest, marginal models are not appropriate as they may lead to attenuation of the
estimates of the regression parameters (Zeger, Liang, and Albert, 1988).

Sammel, Ryan, and Legler (1997) analyze mixtures of discrete and continuous
responses in the exponential family using latent variables models. Their approach
is based on numerical or stochastic approximations to maximum-likelihood and al-

lows for subject-specific inference. Blackwell and Catalano (1999a, 1999b) consider
extensions for ordinal data and for repeated measures of ordinal responses.

The Generalized Linear Mixed Model (GLMM) forms a very general class of
subject-specific models for discrete and continuous responses in the exponential fam-
ily and is used for univariate repeated measures (Fahrmeir and Tutz, 1994). In the
current dissertation we demonstrate how the GLMM approach can be extended for
multivariate repeated measures by assuming separate random effects for each out-
come variable. This is in contrast to the Sammel et al. approach in which common
underlying latent variables are assumed. We also consider a more general correlated

probit model than the appropriate GLMM for the special case of clustered binary

and continuous data.
The introduction to this dissertation contains an overview of approaches for mod-
elling of univariate repeated measures (Section 1.1), existing models for multivariate
repeated measures (Section 1.2), and simultaneous modeling of different types of
responses (Section 1.3). The chapter concludes with an outline of the dissertation

(Section 1.4).

1.1 Models for Univariate Repeated Measures

1.1.1 General Linear Models

Historically, the first models for repeated measures data to be considered are
general linear models with correlated normally distributed errors (see Ware, 1982, for

review). The univariate representation of such models is as follows:

Suppose each subject i is observed on J occasions and denote the column vector

of responses by Yi. Assume that y, arises from the linear model

yi = Xi/3 + ei, (1.1)

where Xi is a J x p model matrix for the ith individual, f3 is apx 1 unknown parameter
vector and Ei is a J x 1 error vector with a multivariate normal distribution with mean
0 and arbitrary positive definite covariance matrix E: Nj(0, E). E can take certain

special forms. The case E = I corresponds to the usual linear model for cross-
sectional data. The equicorrelation structure (E = o2(pI + (1 p)J), where J is a
matrix of ones, 0 < p < 1 and a2 > 0) is appropriate when the repeated measures
are on subjects within a cluster, for example when a certain characteristic is observed
on each of a number of litter mates. The autoregressive structure for E = ((aij))

(uij = pli-71) is one of the most popular structures when the observations are over
equally-spaced time periods.

Usually of primary interest in general linear models is the estimation of regression

parameters while recognizing the likely correlation structure in the data. To achieve

this, one either assumes an explicit parametric model for the covariance structure,

or uses methods of inference that are robust to misspecification of the covariance

structure. Weighted least squares, maximum likelihood and restricted maximum

likelihood are the most popular estimation methods for general linear models.

1.1.2 Generalized Linear Models

Generalized linear models (GLM) are natural extensions of classical linear models

allowing for a larger class of response distributions. Their specification consists of

three parts (McCullagh and Nelder, 1989, pp.27-30): a random component, a sys-

tematic component and a link function.

1. The random component is the probability distribution for the elements of

the response vector. The yij's, i = 1, ...n, are assumed to be independent with a

distribution in the exponential family

f (Yi; Oi,i) = expi ab(O) +c(yi, 0 (1.2)

for some specified functions a(.), b(.) and c(.). Usually

a(i) =

where 0 is called a dispersion parameter and wi are known weights. The mean

Pi and the variance function v(pi) completely specify a member of the exponential

family because y, = b'(0i) and v(pi) = b"(Oi)a(ji). Important exponential family

distributions are the normal, the binomial, the Poisson and the gamma distributions.

2. The systematic component is a linear function of the covariates

iii = xf3,

where 7i is commonly called a linear predictor.

3. The link function g(.) is a monotonic differentiable function which relates the

expected value of the response distribution pi to the linear predictor 71i:

= g(pi)

When the response distribution is normal and the link is the identity function

g(p) = p, the GLM reduces to the usual linear regression model. For each member

of the exponential family there is a special link function, called the canonical link

function, which simplifies model fitting. For that link function Oi = 77i. Maximum

likelihood estimates in GLM are obtained using iterative reweighted least squares.

Just as modifications of linear models are used for analyzing Gaussian repeated

measures, modifications of GLM can handle discrete and continuous outcomes. Ex-

tensions to GLM include marginal, random effects and transition models (Zeger and

Liang, 1992). Hereafter we will use Yij to denote the response at the jth occasion for

the ith subject. The ranges of the subscripts will be i = 1, ...n, and j = 1, ...J for

balanced and j = 1, for unbalanced data.

1.1.3 Marginal Models

Marginal models are designed to permit separate modeling of the regression of the
response on the predictors, and of the association among repeated observations for
each individual. The models are defined by specifying expressions for the marginal

mean and the marginal variance-covariance matrix of the response:

1. The marginal mean pij = E(yij) is related to the predictors by a known link

function g(.)

g(ij) = x

2. The marginal variance is a function of the marginal mean

Var(yij) = V(pij)

and the marginal covariance is a function of the marginal means and of additional
parameters 6

Cov(yij,, yij) = c(/i, Ai;6).

Notice that if the correlation is ignored and the variance function is chosen to corre-
spond to an exponential family distribution, the marginal model reduces to GLM for
independent data. But the variance function can be more general, and hence even
if the responses are uncorrelated this model is more general than the corresponding
GLM. Because only the first two moments are specified for the joint distribution of
the response, additional assumptions are needed for likelihood inferences. Alterna-
tively, the Generalized Estimating Equations (GEE) method can be used (Liang and
Zeger (1986), Zeger, Liang and Albert (1988)) as briefly summarized here.

Let yi = (yI,..Y -= (i, .., Ai = diag{V(pil),..., V(i,,J)} and let

R,(6) be a 'working' correlation matrix for the ih subject. The latter means than
R,(65) is completely specified up to a parameter vector 6 and may or may not be the
true correlation matrix. The regression parameters /3 are then estimated by solving

n, 8,,T I_
S =o V (6)(Yi /ti) = 0,

1 1
where Vi(6) = A5RP(6)A2. Liang and Zeger (1986) show that if the mean function

is correctly specified, 3, the solution to the above equation, is consistent and asymp-
totically normal as the number of subjects goes to infinity. They also propose a robust
variance estimate, which is also consistent even when the variance-covariance struc-
ture is misspecified. Hence, the GEE approach is appropriate when the regression
relationship, and not the correlation structure of the data, is of primary interest.

1.1.4 Random Effects Models

An important feature of the marginal models is that the regression coefficients

have the same interpretation as coefficients from a cross-sectional analysis. These

models are preferred when the effects of explanatory variables on the average response

within a population are of primary interest. However, when it is of interest to describe

how the response for a particular individual changes as a result of a change in the

covariates, a more pertinent approach is to consider random (mixed) effects models.

Random effects models assume that the correlation among repeated responses

arises because there is a natural heterogeneity across individuals and that this het-

erogeneity can be represented as a probability distribution. More precisely,

1. The conditional distribution of the response Yij given a subject-specific vector

of random effects bi satisfies a GLM with a linear predictor x ,/ + zbi, where zi3

in general is a subset of xj.

2. The responses on the same subject yii, Yi2,.--, yin, are conditionally independent

given bi.

3. bi has certain distribution F(.; 6) with mean 0 and variance-covariance matrix

E depending on a parameter vector 6.

Such models with normal distribution for the random effects are considered in
greater detail in Section 2.1. In contrast to marginal models, the regression coefficients

in random effects models have subject-specific interpretations. To better illustrate

that difference let us consider a particular example.

Let the response Yij be binary and the subscript j refer to time. Consider the

marginal model

logit(ij) =/3o +/31ij,

where pij = E(yii), and the random effects model

logit(pj) = 1c +/ j + bi, bi i.i.d. N(0,a2),

where i = E(yij~bi). Then / is the log-odds ratio for a positive response at time
j +1 relative to time j for any subject i, while /3i is the population-averaged log-odds

ratio. That is 0i1 describes the change in log-odds for a positive response from time

j to time j + 1 for the population as a whole. In general these two interpretations

are different but in some special cases such as identity link functions subject-specific

and population-averaged interpretations coincide. More discussion on tonneonnection

between marginal and random effects models follows in Chapter 2.

The presence of random effects enables the pooling of information across different

subjects to result in better subject-specific as opposed to population-averaged infer-

ence, but complicates the estimation problem considerably. To obtain the likelihood

function, one has to integrate out the random effects, which, except for a few special

cases, cannot be performed analytically. If the random effects are nuisance param-

eters, conditional likelihood estimates for the fixed effects may be easy to obtain

for canonical link functions. This is accomplished by conditioning on the sufficient

statistics for the unknown nuisance parameters and then maximizing the conditional

When the dimension of the integral is not high, numerical methods such as Gaus-
sian quadrature work well for normally distributed random effects (Fahrmeir and

Tutz, 1994, pp.357-362; Liu and Pierce, 1994). A variety of other methods have

recently been proposed to handle more difficult cases. These include the approxi-

mate maximum likelihood estimates proposed by Schall (1991), the penalized quasi-

likelihood approach of Breslow and Clayton (1993), the hierarchical maximum like-

lihood of Lee and Nelder (1996), the Gibbs sampling approach of Zeger and Karim

(1991), the EM algorithm approach for GLMM of Booth and Hobert (1999) and of
McCulloch (1997) and others.

1.1.5 Transition Models

A third group of models for dealing with longitudinal data consists of transition

models. Regression Markov chain models for repeated measures data have been con-
sidered by Zeger et al. (1985) and Zeger and Qaqish (1988). This approach involves

modeling the conditional expectation of the response at each occasion given past

outcomes. Specifically,
1. The conditional expectation of the response 4, = E(yjyjj-1, ...yii) depends

on previous responses and current covariates as follows

g(9) = X3 + L Ykfk(Yij-, ...Yi),

where {fk(')}, k = 1, ...j are known functions.
2. The conditional variance of yij is a function of the conditional mean

Var(yjij-yi ...yjl) = V(p)),

where V is a known function.
Transition models combine the assumptions about the dependence of the response

on the explanatory variables and the correlation among repeated observations into a
single equation. Conditional maximum likelihood and GEE have both been used for
estimating parameters.
Extensions of GLM are not the only methods for analyzing repeated measures data
(see for example Vonesh (1992) for an overview of non-linear models for longitudinal
data), but as the proposed models in this dissertation are based on such extensions,
our discussion in the following sections will be restricted to GLM types of models.

1.2 Models for Multivariate Repeated Measures

In contrast to univariate longitudinal data, very few models have been discussed

that specifically deal with multivariate repeated measures. These models are now

briefly discussed, starting with normal theory linear models and proceeding with

models for discrete outcomes.

A review of general linear models for the analysis of longitudinal studies is pro-

vided by Ware (1985). The general multivariate model is defined as in (1.1) but we

assume that the J = KL repeated measures on each subject are made on K nor-

mally distributed variables rather than on only one normally distributed response.

Hence, the general multivariate model with unspecified covariance structure can be

directly applied to multivariate repeated measures data. However, the number of

parameters to be estimated increases quickly as the number of occasions and/or vari-

ables increases and the estimation may become quite burdensome. Special types of

correlation structures such as bivariate autoregressive can be specified directly as pro-

posed by Galecki (1994). This multivariate linear model is also not well suited for

unbalanced or incomplete data.

More parsimonious covariance structures are achieved in random effects models.
The linear mixed effects model is defined in two stages. At stage 1 we assume

yibi = Xi,3 + Zib, + Ei,

where Ei is distributed Nn,(0, a2I), X, and Z, are ni x p and ni x q model matrices

for the fixed and the random effects respectively, and bi is a q x 1 random effects
vector. At stage 2, bi Nq(O, E) independently of El. This corresponds to a special

variance-covariance structure for the ith subject: Ei = 02I + ZiEZT.

Reinsel (1982) generalized this linear mixed effects model and showed that for bal-
anced multivariate repeated measures models with random effects structure, closed

form solutions exist for both maximum likelihood (ML) and restricted maximum

likelihood (REML) estimates of mean and covariance parameters. Matsuyama and
Ohashi (1997) considered bivariate response mixed effects models that can handle

missing data. Choosing a Bayesian viewpoint, they used the Gibbs sampler to esti-

mate the parameters.
The model of Reinsel is overly restrictive in some cases as he prescribes the same

growth pattern over all response variables for all individuals. In contrast, Heitjan and
Sharma (1997) considered a model for repeated series longitudinal data, where each
unit could yield multiple series of the same variable. The error term they used was a
sum of a random subject effect and a vector autoregressive process, thus accounting for
subject heterogeneity and time dependence in an additive fashion. A straightforward

extension for multiple series of observations on distinct variables is possible.
All models discussed so far in this section are appropriate for continuous response
variables that can be assumed to be normally distributed. More generally GEE
marginal models can easily accommodate multivariate repeated measures if all re-

sponse variables have the same discrete or continuous distribution. We now restate
the GEE marginal model from Section 1.1 in matrix notation to simplify the mul-

tivariate extension. We also consider balanced data. Let yi represent the vector of
observations for the ith individual (i = 1,2, ...n) and let /Ai = E(yi) be the marginal
mean vector. We assume

gWli = X,/3

where X, is the model matrix for the ith individual, /3 is an unknown parameter vector
and g(.) is a known link function, applied componentwise to i. Also Var(yij) =

OV(pij), where V is a known variance function and 0 is a dispersion parameter.
Letting A, = diag[V(pii1),...V((pij)] and assuming a working correlation matrix R,(65)

for yi, the 'working' covariance matrix for y, is given by

Vi = (A) ()(A .

If ,((6) = R(6) is the true correlation matrix, V, is the true covariance matrix of
yi. The J responses for each subject are usually repeated measures on the same
variable, but as in the normal case, they can be repeated observations on two or more
outcome variables as long as they have the same distribution. The only difference
with univariate repeated measures is in the specification of the covariance matrix.
The estimates of the regression parameters /3 will be consistent provided that the
model for the marginal mean structure is specified correctly, but for better efficiency,
the working correlation matrix should be close to the true correlation matrix. Several
correlation structures for multivariate repeated measures are discussed by Rochon
As a further extension, Rochon (1996) proposed a model for bivariate repeated
measures that could accommodate both continuous and discrete outcomes. He used
GEE models as the one described above, to relate each set of repeated measures to im-
portant explanatory variables and then applied seemingly unrelated regression (SUR,
Zellner, 1962) methodology to combine the pair of GEE models into an overall analysis
framework. If the response vector for the ith subject is denoted by yi = (y 1)T, y 2)T

and all relevant quantities for the first and second response are superscribed by 1 and
2 respectively, the SUR model may be written as

[$1)] F p'
I ri i xi f X 0 Q-\
P-i- ^ j'^ X2 p-.3(2).

and g(.) is a compound function consisting of g(l)(.) and g(2)(.). The joint covariance
matrix among the sets of repeated measures may be written as

V0 FIJ 0 A 1( ) 1 R() 0 1}2 0 52Ij 0
[ 0 [ 0 A02. J (6) 0 A[2A' J [ 0 0 1J '

where R(65) is the working correlation matrix among the two sets of repeated measures
for each subject, and each of its elements is a function of the vector parameter 6.

R() [ R11 R121
R(6 RfL R22 j

The suggested techniques may be extended to multiple outcome measures. Rochon's
approach provides a great deal of flexibility in modeling the effects of both within-
subject and between-subject covariates on discrete and continuous outcomes and is
appropriate when the effects of covariates on the marginal distributions of the response
are of interest. If subject-specific inference is preferred, however, a transitional or
random effects models should be considered.
Liang and Zeger (1989) suggest a class of Markov chain logistic regression models
for multivariate binary time series. Their approach is to model the conditional dis-
tribution of each component of the multivariate binary time series given the others.
They use 'pseudolikelihood' estimation methods to reduce the computational burden
associated with maximum likelihood estimation.
Liang and Zeger's transitional model is useful when the association among vari-
ables at one time is of interest, or when the purpose is to identify temporal relation-
ship among the variables, adjusting for covariates. However, one must use caution
when interpreting the estimated parameters because the regression parameter 03 in
the model has log-odds ratio interpretation conditional on the past and on the other
outcomes at time t. Hence, if a covariate influences more than one component of the
outcome vector or past outcomes, which is frequently the case, its regression coeffi-
cient will capture only that part of its influence that cannot be explained by the other
outcomes it is also affecting. Another problem with the model is that in the fitting
process all the information on the first q observations is ignored.
A random effects approach for dealing with multivariate longitudinal data is dis-
cussed by Agresti (1997) who develops a multivariate extension of the Rasch model for

repeated measures of a multivariate binary response. One disadvantage of the multi-
variate Rasch model shared by many random effects models, is that it can not model
negative covariance structure among repeated observations on the same variable. Al-
though usually measurements on the same variable within a subject are positively
correlated, there are cases when the correlation is negative. One such example is the
infection data, first considered by Haber (1986). Frequencies of infection profiles of a
sample of 263 individuals for four influenza outbreaks over four consecutive winters
in Michigan were recorded. The first and fourth outbreaks are known to be caused
by the same virus type and because contracting influenza during the first outbreak
provides an immunity against a subsequent outbreak, a subject's response for these
two outbreaks is negatively correlated. Coull (1997) analyzed these data using the
multivariate binomial-logit normal model. As it is a special case of the models that
we propose later in this dissertation, we define it here.
Let y = (yi, ...Y1)T given ir = (71, ...Tri)T be a random vector of independent bi-

nomial components with number of trials (nl, ...nh)T. Also let logit(ir) be Ni(tI, E).
Then 7r has multivariate logistic-normal distribution and unconditionally y has a mul-
tivariate binomial logit-normal mixture distribution. If the I observations correspond
to measurements on K variables at L occasions, this model can be used for analyzing
multivariate repeated measures data. The mean of the multivariate random effects
distribution can be assumed to be a function of covariates p = X,3 and several groups
of subjects with the same design matrices X., s = 1, ...S, can be considered.
The multivariate binomial-logit normal model can be regarded as an analog for

binary data of the multivariate Poisson-log normal model of Aitchison and Ho (1989)

for count data. Aitchison and Ho assume that y = (yi, ...y1)T given 0 = (01, ...0i)T
are independent Poisson with mean vector 0, and that log(O) = (log(01), ...log(9i))T
are Ni(p, E). Then 0 has multivariate log-normal distribution. Like the multivariate

logit-normal model, the multivariate Poisson-log normal model can be used to model

negative correlations and can be extended to incorporate covariates.

Chan and Kuk (1997) also consider random effects models for binary repeated

measures but assume an underlying threshold model with normal errors and random

effects, and use the probit link. They estimate the parameters via a Monte Carlo EM

algorithm, regarding the observations from the underlying continuous model as the

complete data. Their approach will be discussed in more detail in Chapter 5 where

an extension of their model will be considered.

1.3 Simultaneous Modelling of Responses of Different Types

Difficulties in joint modelling of responses of different types arise because of the

need to specify a multivariate joint distribution for the outcome variables. Most

research so far has concentrated on simultaneous analysis of binary and continuous


Olkin and Tate (1961) introduced a location model for discrete and continuous
outcomes. It is based on a multinomial model for the discrete outcomes and a mul-

tivariate Gaussian model for the continuous outcomes conditional on the discrete

outcomes. Fitzmaurice and Laird (1995) discussed a generalization of this model

which turns out to be a special case of the partly exponential model introduced by

Zhao, Prentice and Self (1992). Partly exponential models for the regression analy-

sis of multivariate (discrete, continuous or mixed) response data are parametrized in

terms of the response mean and a general shape parameter. They encompass gener-

alized linear models as well as certain multivariate distributions. A fully parametric

approach to estimation leads to asymptotically independent maximum likelihood es-

timates of the mean and the shape parameters. The score equations for the mean

parameters are essentially the same as in GEE. The authors point out two major

drawbacks to their approach: one is the computational complexity of full maximum

likelihood estimation of the mean and the shape parameters together, another one is
the need to specify the shape function for the response. The latter will be especially
hard if the response vector is a mixture of discrete and continuous outcomes. Zhao,
Prentice and Self conclude that partly exponential models are mainly of theoretical

interest and can be used to evaluate properties of other mean estimation procedures.

Cox and Wermuth (1992) compared a number of special models for the joint
distribution of qualitative (binary) and quantitative variables. The joint distribution
for all bivariate models is based on the marginal distribution of one of the components
and a conditional distribution for the other component given the first one. A key

distinction is between models in which for each binary outcome (A), the quantitative
response (Y) is assumed to be normally distributed and models in which the marginal

distribution of Y is normal. Typically, simplicity in the marginal distribution of
Y corresponds to fairly complicated conditional distribution of Y and vice versa
but normality often holds at least approximately. Estimation procedures differ from
model to model but essentially the same tests of independence of the two components
of the response can be derived. This, however, is not true if trivariate distributions
are considered with two binary and one continuous, or with two continuous and one

binary component. In that case several different hypotheses of independence and
conditional independence can be considered and depending on the model sometimes
they may not be tested unless a stronger hypothesis of independence is assumed.

Catalano and Ryan (1992), Fitzmaurice and Laird (1995) and Regan and Catalano
(1999) considered mixed models for a bivariate response consisting of a binary and of

a quantitative variable. Catalano and Ryan (1992) and Regan and Catalano (1999)
treated the binary variable as a dichotomized continuous latent trait, which had a
joint bivariate normal distribution with the other continuous response. Catalano and
Ryan (1992) then parametrized the model so that the joint distribution was a product
of a standard random effects model for the continuous variable and a correlated probit

model for the discrete variable. Estimation for the parameters of the two models is

performed using quasi-likelihood techniques. Catalano (1994) extended the Catalano

and Ryan procedure to ordinal instead of binary data.

Regan and Catalano (1999) used exact maximum likelihood for estimation, which

is computationally feasible because of the equicorrelation assumption between and

within the binary and continuous outcomes. The maximum likelihood methodology

used by the authors is an extension to the procedure suggested by Ochi and Prentice

(1984) for binary data.

Fitzmaurice and Laird (1995) assumed a logit model for the binary response and a

conditional Gaussian model for the continuous response. Unlike Catalano and Ryan's

model all regression parameters have marginal interpretations and the estimates of

the regression parameters (based on ML or GEE) are robust to misspecification of
the association between the binary and the continuous responses.

Sammel, Ryan and Legler (1997) developed models for mixtures of outcomes in the
exponential family. They assumed that all responses are manifestations of one or more

common latent variables and that conditional independence between the outcomes

given the value of the latent trait held. This allows one to use the EM algorithm with

some numerical or stochastic approximations at each E-step. Blackwell and Catalano

(1999) extended the Sammel, Ryan and Legler methodology to longitudinal ordinal

data by assuming correlated latent variables at each time point. For simplicity of

analysis each outcome is assumed to depend only on one latent variable and the

latent variables are assumed to be independent at each time point.

1.4 Format of Dissertation

The purpose of this dissertation is to propose and investigate random effects mod-

els for repeated measures situations when there are two or more response variables.

Of special interest is the case when the response varibles are of different types. The

dissertation is organized as follows.

In Chapter 2 we propose a multivariate generalized linear mixed model which can

accommodate any combination of responses in the exponential family. We first describe

the usual generalized linear mixed model (GLMM) and then define its extension. The

relationship between marginal and conditional moments in the proposed model is

briefly discussed and two motivating examples are presented. The key assumption of

conditional independence is outlined.

Chapter 3 concentrates on maximum likelihood model fitting methods for the

proposed model. Gauss-Hermite quadrature, a Monte Carlo EM algorithm and a

pseudo-likelihood approach are extended from the univariate to the multivariate gen-

eralized linear mixed model. Standard error approximation is discussed along with

point estimation. We use a simulated data example and the two motivating examples

to illustrate the proposed methodology. We also address certain issues such as stan-

dard error variability and comparison between multivariate and univariate analyses.

In Chapter 4 we consider inference in the multivariate GLMM. We describe hy-

pothesis testing for the fixed effects based on approximations to the Wald and likeli-

hood ratio statistics and propose score tests for the variance components and for test-

ing the conditional independence assumption. The performance of the Gauss-Hermite

quadrature and Monte Carlo approximations to the Wald, score and likelihood ratio

statistics is compared via a small simulation study.

Chapter 5 introduces a correlated probit model as an alternative to the multivari-

ate GLMM for binary and continuous data when conditional independence does not

hold. We develop a Monte Carlo EM algorithm for maximum likelihood estimation

and apply it to one of the motivating examples. To address the issue of advantages

of joint over separate analyses of the response variables we design a simulation study



to investigate possible efficiency gains in a multivariate analysis. An identifiability

issue concerning one of the variance components is also discussed.

The dissertation concludes with a summary of the most important findings and

discussion of future research topics (Chapter 6).


The Generalized Linear Mixed Model (GLMM) is a very general class of random
effects models, well suited for subject-specific inference for repeated measures data.
It is a special case of the random effects model defined in Section 1.2. We start
this chapter by providing a detailed definition of the GLMM for repeated univariate
mixed models (Section 2.1) and then introduce the multivariate extension (Section
2.2). Some properties of the multivariate GLMM are discussed in Section 2.3 and the
chapter concludes with a description of two data sets which will be used for illustration
throughout the thesis. Hereafter we omit the superscript c in the conditional mean

4 notation.

2.1 Introduction

Let yij denote the jth response observed on the ith subject, j = 1,, i = 1, ...n.
The conditional distribution of Yij given an unobserved q x 1 subject-specific random
vector b, is assumed to be in the exponential family with density

f (Yij Ibi) -= exp{ yijOij -- b(oij) W) (ij ,W
fx{ w +c(yij,,w

where pij = b'(Oij) is the conditional mean, 4 is the dispersion parameter, b(.) and
c(.) are specific functions corresponding to the type of exponential family, and wij
are known weights. Also at this stage, it is assumed that g(gij) = g(E(yijlbi)) = 7i,
where ij = xT/3 + zTbi is a linear predictor, b, is a q x 1 random effect, 3 is a

p x 1 parameter vector and the design vectors xij3P and Zij" are functions of the


At the second stage the subject-specific effects b, are assumed to be i.i.d.

Nq(0, E). In general, the random effects can have other continuous or discrete dis-

tributions, but the normal distribution provides a full range of possible covariance

structures and we will restrict our attention to this case.

As an additional assumption, conditional independence of the observations within

and between subjects is required, that is,

n ni
/(ylb;/3,0) =lf(yilbi;/3,O) with f(yi\bi;/3,q) =ff(yijfbi,3,O),
i=1 j=l

where y = (yl, ...ynT) and b = (bf, ...b) are column vectors of all responses and

all random effects respectively.

Maximum-likelihood estimation for GL\MM is complicated because the marginal

likelihood for the response does not have a closed form expression. Different ap-

proaches for dealing with this situation are discussed in Chapter 3.

The parameters /3 in the GLMM have subject-specific interpretations; i.e. they

describe the individual's rather than the average population response to changing the

covariates. Under the GLMM, the marginal mean E* E(yij) = E(E(yjlbi))) =

f g- (x',/3+zbi)f(bi, E)dbi and in general g(1pij) 1 x5T/3 if g is a non-linear function.

However, this equation holds approximately if the standard deviations of the random

effects distributions are small.

Closed-form expressions for the marginal means exist for certain link functions
(Zeger, Liang and Albert, 1988). For example, for the identity link function the

marginal and conditional moments coincide. For the log link function the marginal
mean is p = exp(xif3 + )J.) For the profit link function, the marginal mean is
S= (ap(E)xi ), where p(E) = Ezz+I 2 and q is the dimension ofb. For
wher V~~ ~2zI / and q is the dimension of bi. For

the logit link, an exact closed-form expression for the marginal mean is unavailable
but using a cumulative Gaussian approximation to the logistic function leads to the

expression logit({tj) al(E)xT/3, where ai(E) = IC2Ezzi + II-q/2 and c = 16j-.
Unfortunately, there are no simple formulae for the higher order marginal moments
except in the case of a linear link function. But the second-order moments can also
be approximated and then the GEE approach can be used to fit the mixed model
(Zeger, Liang and Albert, 1988).

2.2 Model Definition

Let us first consider bivariate repeated measures. Denote the response vector
for the ith subject by yi = (yTyY)T, where Yi = (yil, ...,Y )T and Y2 =

(Yi2 1, Yi2ni)T are the repeated measurements on the 1"t and 2nd variable respec-
tively at rni occasions. The number of observations for the two variables within a
subject need not be the same and hence it would be more appropriate to denote them
by nil and ni2 but for simplicity we will use ni. We assume that Yilj, j = 1,,
are conditionally independent given bil with density fl(.) in the exponential family.
Analogously, Yi2j, j = 1,, are conditionally independent given bi2 with density
f2(') in the exponential family. Note that fl and f2 need not be the same. Also Yi
and Yi2 are conditionally independent given b, = (bi bT and the responses on
different subjects are independent. Let gl(.) and g2(') be appropriate link functions
for fl and f2. Denote the conditional means of yilj and Yi2j by ilj and 1i2j respec-
tively. Let pil = (i1 ..., P il)T and /i2 = (21, ....Pi2ni, )T. At stage one of the
mixed model specification we assume

gi(=il) = Xil31 + Zilbil (2.1)

g2(,i2) = Xi2J32 + Z,2bi2, (2.2)

where 31 and 02 are p, x 1 and P2 x 1 dimensional unknown parameter vectors, Xil
and Xi2 are ni, x pi and ni, x P2 dimensional design matrices for the fixed effects, Zil
and Zi2 are ni x q, and ni x q2 design matrices for the random effects and g, and 92
are applied componentwise to Mil and i2. At stage two, a joint distribution for bil
(q1 x 1) and bi2 (q2 x 1) is specified. The normal distribution is a very good candidate,
as it provides a rich covariance structure. Hence, we assume

b l: i.i.d.MVN(O,) = MVN [ [ ET 12]) (2.3)
b bi2 ) 5 1 T 12 522 '

where E, El and E22 are in general unknown positive-definite matrices. The exten-
sion of this model to higher dimensional multivariate repeated measures is straight-
forward but for simplicity of discussion only the bivariate case will be considered.
When E12 = 0 then the above model is equivalent to two separate GLMM's for the
two outcome variables. Advantages of joint over separate fitting include the ability to
answer intrinsically multivariate questions, better control over the type I error rates
in multiple tests and possible gains in efficiency in the parameter estimates.
For example, in the developmental toxicity application described in more detail
at the end of this section, it is of interest to estimate the dose effect of ethylene glycol
on both outcomes (malformation and low fetal weight) simultaneously. One might
be interested in the probability of either malformation or fetal weight at any given
dose. This type of question can not be answered by a univariate analysis as it requires
knowledge of the correlation between the outcomes. Also, testing the significance of
the dose effect on malformation and fetal weight simultaneously (using a Wald test
for example) allows one to keep the significance level fixed at a. While if the two
outcomes were tested separately then an adjustment should be made for the level of
the individual tests to achieve overall a level for both comparisons.
Multivariate analysis also allows one to borrow strength from the observations on
one outcome to estimate the other outcomes. This can lead to more precise estimates

of the parameters and therefore, to gains in efficiency. Theoretically, the gain will

be the greatest if there is a common random effect for all responses (this means

E12 = En11E22 in our notation). Also, if the number of observations per subject is
small and the correlation between the two outcomes is large, the gains in efficiency

should increase.

If the two vectors of random effects are perfectly correlated, then this is equivalent
to assuming a common latent variable with a multivariate normal distribution which

does not depend on covariates. Hence, the multivariate GLMM reduces to a special
case of the Sammel, Ryan and Legler model when the latent variable is not modelled

as a function of the covariates. It is possible to allow the random effects in our models

to depend on covariates but we have not considered that extension here.

Several other models discussed in previous sections turn out to be special cases of
this general formulation. For example the bivariate Rasch model is obtained by spec-
ifying Bernoulli response distributions for both variables, using the logit link function
and identity design matrices both for the fixed and for the random effects. The mul-

tivariate binomial-logit normal model is also a special case for binary responses with

identity design matrix for the random effects and unrestricted variance-covariance
structure. The Aitchison and Ho multivariate Poisson log-normal model also falls un-

der this general structure when the response variables are assumed to have a Poisson


2.3 Model Properties

Exactly as in GLMM, the conditional moments in the multivariate GLMM are
directly modelled, while marginal moments are harder to find. The marginal means
and the marginal variances of Yii and Yi2 for the model defined by (2.2) and (2.3) are
the same as those of the GLMM considering one variable at a time:

E(yi+) = EE(yil bil) = Epi (31, bil)],

E(yi2) = E[/i2(2, bi2)],

Var(yil) = E[Var(yilbil)] + Var[E(yi bil)] =

E[iV(l)] + Var[ il],

Var(yi2) = E[2V(pi2)] + Var[f/i2],

where V(pil) and V(/i2) denote the variance functions corresponding to the expo-
nential family distributions for the two response variables.
The marginal covariance matrix between Yi and Yi2 is found to be equal to the
covariance matrix between the conditional means ,il and Ai2:

CoV(yyl,Yi2) = E(ylY T) E(yil)E(yT) =

= EE(yiyTbil, bi2) E(iil)E(') =

= E[E(yiibi1)E(yTbi2)] E(il)E(T) =

= Cov(mil, Ai2)

The latter property is a consequence of the key assumption of conditional indepen-
dence between the two response variables. This assumption allows one to extend
model fitting methods from the univariate to the multivariate GLMM but may not
hold in certain situations. This issue will be discussed in more detail in Chapter 4,
where score tests are proposed for verifying conditional independence.

2.4 Applications

The multivariate GLMM are fitted to two data sets. The first one is a dataset from
a developmental toxicity study of ethylene glycol (EG) in mice conducted through the
National Toxicology Program (Price et al., 1985). The experiment involved four ran-
domly chosen groups of pregnant mice, one group serving as a control and the other
three exposed to three different levels of EG during major organogenesis. Following

Table 2.1. Descriptive statistics for the Ethylene Glycol data
Dose (g/kg) Dams Live Fetuses Fetal Weight (g) Malformation
Mean SD Number Percent
0 25 297 0.972 0.098 1 0.34
0.75 24 276 0.877 0.104 26 9.42
1.50 22 229 0.764 0.107 89 38.86
3.00 23 226 0.704 0.124 126 57.08

sacrifice, measurements were taken on each fetus in the uterus. The two outcome

measures on each live fetus of interest to us are fetal weight (continuous) and malfor-

mation status dichotomouss). Some descriptive statistics for the data are available in

Table 2.1. Fetal weight decreases monotonically with increasing dose with the average

weight ranging from 0.972 g in the control group to 0.704 g in the group administered

the highest dose. At the same time the malformation rate increases with dose from

0.3% in the control group to 57% in the group administered the highest dose.

The goal of the analysis is to study the joint effects of increasing dose on fetal

weight and on the probability of malformation. The analysis of these data is compli-

cated by the correlations between the repeated measures on fetuses within litter. A

multivariate GLMM with random intercepts for each variable allows one to explicitly

model the correlation structure within litter and provides subject-specific estimates

for the regression parameters.

The second data set is from a study to compare the effects of 9 drugs and a placebo

on the patterns of myoelectric activity in the intestines of ponies (Lester et al., 1998a,
1998b, 1998c). For that purpose electrodes are attached to four different areas of the

intestines of 6 ponies, and spike burst rate and duration are measured at 18 equally

spaced time intervals around the time of each drug administration. Six of the drugs

and the placebo are given twice to each pony in a randomized complete block design.

The remaining three drugs are not given to all ponies and hence will not be analyzed

here. There is a rest period after each drug's administration and no carry-over effects

Table 2.2. Descriptive statistics for pony data
Duration Count
Pony Mean SD Mean SD Corr.
1 1.03 0.25 77.57 45.53 0.59
2 1.35 0.28 128.25 76.08 0.18
3 1.40 0.36 84.33 46.27 0.45
4 1.27 0.48 111.00 49.66 0.21
5 1.14 0.21 75.44 45.03 0.50
6 1.24 0.40 66.86 48.97 0.32

are expected. The spike burst rate is a count variable reflecting the number of con-

tractions exceeding certain threshold in 15 minutes. The duration variable reflects

average duration of the contractions in each 15-minute interval. Figure 2.1 shows

graphical representations of the averaged responses by drug and time for one of the

electrodes and one hour after the drug administration. Table 2.2 shows the sample

means, standard deviations and correlations between the two outcome variables by

pony for this smaller data set. We analyse a restricted data set for reasons of com-

putational and practical feasibility. This issue is discussed in more detail in Chapter



^ /

- -7 --- - - - - - - -

..- . . . . . .... . . .

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Time period

~~~ -. -..... -. -.. .

--------- ,

-------- - - -

- - - --- -

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Time period

Figure 2.1 Count and duration trends over time for the pony data. Each

trajectory shows the change in mean response for one of the seven drugs.

C 0





0 -

.1 -


This chapter focuses on methods for obtaining maximum likelihood (or approx-

imate maximum likelihood) estimates for the model parameters in the multivariate

GLMM. We first mention some approaches proposed for the univariate GLMM (Sec-

tion 3.1), and then describe in detail extensions of three of those approaches for the

multivariate GLMM (Section 3.2). The proposed methods are then illustrated on a

simulated data example (Section 3.3), and on the two 'real-life' data sets introduced

in Chapter 2 (Section 3.4). Some issues such as standard error variability and advan-

tages of one multivariate versus several univariate analyses are addressed in Sections

3.3 and 3.4. The chapter concludes with discussion of some additional model-fitting

methods (Section 3.5).

3.1 Introduction

In GLMM the marginal likelihood of the response is obtained by integrating out

the random effects

nf f (yIbi; )3, ))f(bi)dbi,

where f(bi) denotes a normal density. Usually these integrals are not analytically

tractable and some kind of approximation must be used.

Direct approaches to maximum likelihood estimation are based on numerical or

stochastic approximations of the integrals and on numerical maximizations of those

approximations. Probably the most widely used numerical approximation procedure

for this type of integral is Gauss-Hermite quadrature. It involves evaluating the
integrands at m prespecified quadrature points and substituting weighted sums in

place of the intractable integrals for the n subjects. If the number of quadrature

points m is large the approximation can be made very accurate but to keep the

numerical effort low m should be kept as small as possible. Gauss-Hermite quadrature

is appropriate when the dimension of the random effects is small. An alternative for
high-dimensional random effects is to approximate the integrals by Monte Carlo sums.

This involves generating m random values from the random effects distribution for

each subject, evaluating the conditional densities f(yilbi;/3, 0) at those values and

taking averages. Details on how to perform Gauss-Hermite quadrature and Monte

Carlo approximation for the GLMM can be found in Fahrmeir and Tutz (1994, pp.357-
365). We discuss Gauss-Hermite quadrature for the multivariate GLMM in Section
3.2. Liu and Pierce (1994) consider adaptive Gaussian quadrature, which allows a
reduction in the required number of quadrature points by centering and scaling them

around the mode of the integrand function. This procedure is described in more detail

and applied to one of the data examples in Section 3.4.

Indirect approaches to maximum likelihood estimation use the EM-algorithm
(Dempster, Laird and Rubin, 1977), treating the random effects as the missing data.

We apply these methods both to the multivariate GLMM and to the correlated pro-
bit model and therefore we now introduce the basic ideas. Hereafter 0/ denotes the

vector of all unknown parameters in the problem, and b denotes some estimate of 4'.

The EM algorithm is an iterative technique for finding maximum likelihood estimates
when direct maximization of the observed likelihood f(y; 4) is not feasible. It involves

augmenting the observed data by unobserved data so that maximization at each step
of the algorithm is considerably simplified. The unobserved data are denoted by b
and in the GLMM context these are the random effects. The EM-algorithm can be

summarized as follows:

1. Select a starting value (0). Set r = 0.

2. Increase r by 1.

E-step: Calculate E{lnf(y, b; ) 1y; (T-1 }.

3. M-step: Find a value '(r) of 0 that maximizes this conditional expectation.
4. Iterate between (2) and (3) until convergence is achieved.

In the GLMM context the complete data is u = (yT bT7)T and the complete data

log-likelihood is given by

n n
lnL, = 1n f(yiJbi;I3,) + In f(bi|E).
=l ij=I

The rth E-step of the EM algorithm involves computing E(lnLIy, (1) and the

rth M-step maximizes this quantity with respect to 4 and updates the parameter

estimates. Notice that because (3 and 0 enter only the first term of L., the M-step

with respect to 3 and 4 uses only f(yIb) and so it is similar to a standard generalized

linear model computation with the values of b treated as known. Maximizing with

respect to E is just maximum likelihood using the distribution of b after replacing

sufficient statistics with their conditional expected values. In general, the conditional

expectations in the E-step can not be computed in closed form, but Gauss-Hermite

or different Monite Carlo approximations can be utilized. (Fahrmeir and Tutz, 1994,

pp. 362-365; McCulloch, 1997; Booth and Hobert, 1999).
Many authors consider tractable analytical approximations to the likelihood (Bres-

low and Clayton, 1993; Wolfinger and O'Connell, 1994). Although these methods lead
to inconsistent estimates in some cases, they may have considerable advantage in com-

putational speed over 'exact' methods and can be fitted with standard software. Both
Breslow and Clayton's and Wolfinger and O'Connell's procedures amount to itera-

tive fitting of normal theory linear mixed models and can be implemented using the

%GLIMMIX macro in SAS. An extension of Wolfinger and O'ConneUll's approach

is considered in Section 3.2.

A Bayesian paradigm with flat priors can also be used to approximate the maxi-

mum likelihood estimates using posterior modes (Fahrmeir and Tutz, 1994, pp.233-

238) or posterior means (Zeger and Karim, 1991). Though the numerator in such

computations is the same as for the maximum likelihood calculations, the posterior

may not exist for diffuse priors (Natarajan and McCulloch, 1995). This may not be

detected using computational techniques such as the Gibbs sampler, and can result

in incorrect parameter estimates (Hobert and Casella, 1996).

3.2 Maximum Likelihood Estimation

For simplicity of presentation we again consider the case of only two response

variables. The marginal likelihood in the bivariate GLMM is obtained as in the usual

GLMM by integrating out the random effects

n n,
I1 f Ji fI(yi, Ibii;3j, 0i)f2(y 2jjbi2;A/2,02)}f(bil, bi2; E)dbildbi2, (3.1)
i=1 j=l

where f denotes the multivariate normal density of the random effects. In this sec-

tion we describe how Gauss-Hermite quadrature, Monte Carlo EM algorithm and

pseudo-likelihood can be used to obtain estimates in the multivariate GLMM. Both

Gaussian quadrature and the Monte Carlo EM algorithm are referred to as 'exact

maximum likelihood' methods because they target the exact maximum likelihood es-

timates. In comparison, methods that use analytical approximations are referred to

as 'approximate' maximum likelihood. We now start by describing the 'exact' maxi-
mum likelihood methods. Hereafter 0 = Vec(31, I2, 01 2, 6), where 6 are unknown

variance components in E.

3.2.1 Gauss-Hermite Quadrature

Parameter Estimation

The marginal log-likelihood in the multivariate GLMM is expressed as a sum of

the individual log-likelihoods for all subjects, that is

lnL(y; i) = lnLi (0),


n(4) = {JI fi(Yljbi1; ,, l)f2(yi2jlb2; 32, 2)} exp(-bE-lb)db.
Sj=l 27r| 2

Hence Gauss-Hermite quadrature (Fahrmeir and Tutz, 1994) involves numerical ap-

proximations of n q-dimensional integrals (q = q + q2). The big's are first transformed

so that each integral has the form

Li = J h(zi)exp(-zTz)dzi .
7 JRq

The needed transformation is b, = vhLzi, where E = LLT. L is the lower triangular

Choleski factor of E and it always exists because E is non-negative definite. Here

h(zi) = fI (YiViljZil;)1, Ol)f2(Yi2j zi2;)32,02).

Each integral is then approximated by

m m
GQ = E -.- E h(Z),

kl=l kq=l

where z(k) = v/2Ld(k) for the multiple index k = (k1, ...kq) and d(k) denote the tabled

nodes of univariate Gauss-Hermite integration of order m (Abramowitz and Stegun,

1972). The corresponding weights are given by vk = r wk) where Wk are the

tabled univariate weights, 1 = 1, ...m.
The maximization algorithm then proceed as follows:

1. Choose initial estimate for the parameter vector b( ). Set r = 0
2. Increase r by 1.

Approximate each of the integrals LiP(r )) by LQ((r -)) using m quadrature
points in each direction.
3. Maximize the approximation with respect to f using a numerical maximization
4. Iterate between steps (2) and (3) until the parameter estimates have converged.
A popular numerical maximization procedure for step 3 is the Newton-Raphson
method. It involves iteratively solving the equations

)(r) = (r-1) + (r-1)

where S ) = 8,L and J((r1)) 2LnL (r-) is the observed infor-

mation matrix. One possible criterion for convergence is

|(r+l) r)
max rI"--+ < 61,

where ir) denotes the estimate of the s8th element of the parameter vector at step r
of the algorithm and 81 and 62 are chosen to be small positive numbers. The role of
62 is to prevent numerical problems stemming from estimates close to zero. Another
frequently used criterion is
-(r+i) ;(r)1< 6,
112P --0 1<6,

where 1 I1 denotes Euclidean norm.

The numerical maximization procedure MaxBFGS which we are going to use for
the numerical examples later in this chapter is based on two criteria for convergence:

[Ss rs ) < e for all s when 5r) 0

|s()I < E for all s when r) = 0

|(r+l) r) < 10e\r)\ for all s when ^) 0

|(r+l) r) _< 10e for all s when ir) = 0

where s, denotes the sth component of the score vector.

Standard Error Estimation

After the algorithm has converged estimates of the standard errors of the pa-
rameter estimates can be based on the observed information matrix. By asymptotic

maximum-likelihood theory

Var() = I-1(i),

where I(i) = E( 8n -r) is the expected information matrix. But the observed

information matrix J(g,) is easier to obtain and hence we will use the latter to ap-

proximate the standard errors

where i8S is the 8th diagonal element of the inverse of the observed information ma-
trix. Note that if the Newton-Raphson maximization method is used the observed
information matrix is a by-product of the algorithm.

Numerical and Exact Derivatives

The observed information matrix and the score vector are not available in closed-
form and must be approximated. One can either compute numerical derivatives
of the approximated log-likelihood, or approximate the intractable integrals in the
expressions for the exact derivatives. The first approach is simpler to implement but
may require a large number of quadrature points.
The method of finite differences can be used to find numerical derivatives. The
sth element of the score vector and the (s, t)th element of the information matrix are
approximated as follows:

n ^ +'0 + 62t -- W -Et
i= l 2e

ist( )

li' I + 61Z^ + 22t) W(V ElZ, + 622t) f^+ 6^I, EA) + E1-26, EA)
i = 4 6 162

where lk = InLQ, E and E2 are suitably chosen step lengths, and z, and zt are unit
vectors. The unit vectors have ones in positions s and t respectively and the rest of
their elements are zeros.
An alternative to numerical derivatives is to approximate the exact derivatives.
The latter are obtained by interchanging the differentiation and integration signs and
are as follows:
OlnL 1" rlnfi


a21nL n P "f, 1di f fidbI fW dbi f o d/. Ifdbi

7i=T (f fidb,)2

fi = In{fl(Yilj|bi;/)31,1)f2(yi2jlbi;/2,2)}f(bi;E).
For each subject there are three intractable integrals which need to be approximated:
f fidb, f o and f (**T + ^ )fb and Gauss-Hermite quadrature

is used for each one of them. Note that
n n
lnfi = lnfi(yiljIbi;/3i, i1) + lnf2(yi2jvbi;P2,g2) + lnf(b,; E).
ni i
oiE=l I 1 oE=l n)
Hence O = f = Inf2) and = f. Similarly the deriva-
a9 a~i a)2 .9 2 0gt 8
tives with respect to the scale parameters 01 and 02, if needed, are obtained by
differentiating the first and second term in the summation for lnfi. This leads to
simple expressions for the integrands because the functions that are differentiated are
members of the exponential family (Fahrmeir and Tutz, 1994).
Note that the random effects distribution is always multivariate normal which
complicates the expressions for the second order derivatives with respect to the vari-
ance components. Jennrich and Schluster (1986) propose an elegant approach to deal
with this problem and we now illustrate their method for the multivariate GLMM. For

simplicity of notation consider bivariate random effects bi. Let 6= 82 = 2
63 p
(83) \p (I
and~~ le2 1 aU2 Te rt
and let = 2 po2 Then write
^ P01C2 2 T

o 2(1 0) ) 20 0 ) 0 1)
=o1 0 0 +a 0 1 1+P'lo2 1 0 "

Let ---, s = 1,...3. Then

,9lnfi 1
0lf 2tr{Y (bibT _

a2lnfi 1 __
O69O6t 2

In the above parametrization of the variance-covariance matrix there are restrictions

on the parameter space (al > 0, a02 > 0, -1 < p < 1), but the algorithm does not
guarantee that the current estimates will stay in the restricted parameter space. It is

then preferable to maximize with respect to the elements of the Choleski root of E.

The Choleski root is a left-triangular matrix L = ( / ) such that E = LL'. Its
\121 122
elements are held unrestricted in intermediate steps of the algorithm and hence they

can lead to a negative-definite final estimate of E. Such an outcome indicates either

a numerical problem in the procedure or an inappropriate model. However, if only
intermediate estimates of E are negative-definite, then the maximization algorithm

with respect to the elements of the Choleski root will work well while the other one

can have problems.

Although the parametrization in terms of 111, 12l and 122 has fewer numerical

problems than the parametrization in terms of oai, a2 and p, it does not have nice

interpretation. It then seems reasonable to use the Choleski roots in the maximization

procedure but to perform one extra partial step of the algorithm to approximate the

information matrix for the meaningful parametrization ao, a2 and p.

3.2.2 Monte Carlo EM Algorithm

Parameter Estimation

The complete data for the multivariate GLMM is u = (yT, bT)T and hence the

complete data log-likelihood is given by

n ni n ni nr
lnL, = In f (yij Ibil;/31, 0) + 1 In f2(yi2j21bi2;/3, 1 2) + 1n f(bi, E).
i=l j=l i=1 j2=1 i1

Notice that (/31, 0i), (132, 92) and E appear in different terms in the log-likelihood
and therefore each M-step of the algorithm consists of three separate maximizations
of E[E' 1 'iln fi (yii, bil; 3,0i)|y], n[E' E;-=lin f2(yi2j3lbi2;/32,2)Iy], and

E [ZnI=n 1f(b, E)|y] respectively, evaluated at the current parameter estimate of 1P.

Recall that for the regular GLMM there are two separate terms to be maximized.
These conditional expectations do not have closed form expressions but can be
approximated by Monte Carlo estimates:

1m n niz
__ S l nf,(yi )" jb';31 91) (3.2)
E EE 1n f (Yiljl I.bil, (3.2)
m k=l i=1 j1=l

1m n ni
1 EE n f2 (Y2j2 Ibi2 ;2, 2) (3.3)
m k=l i=1 j2=1

m n
1E E-1nfb (); E), (3.4)
mn k=l i=1

where (b(k) b(k), k = 1, ...m are simulated values from the distribution of bilyi,
\ il ui2 1
evaluated at the current parameter estimate of 0. In order to obtain simulated
values for bi multivariate rejection sampling or importance sampling can be used
as proposed by Booth and Hobert (1999). At the rth step of the Monte Carlo EM
algorithm multivariate rejection sampling algorithm is as follows:
For each subject i, i = 1, ...n,

1. Compute -i = supbRq f(yijbi, (r-l)).

2. Sample b(k) from the multivariate normal density f(bi; ) and indepen-

dently sample wk) from Uniform(O, 1).

3. If w < YbL then accept b.; if not, go to 2.

Iterate between (2) and (3) until m generated values of bi are accepted.
Once m simulated values for the random effects are available, the Monte Carlo
estimates of the conditional expectations can be maximized using some numerical

maximization procedure such as the Newton-Raphson algorithm. Convergence can be
claimed when successive values for the parameters are within some required precision.
The same convergence criterions as those mentioned for Gauss-Hermite quadrature
can be applied but with generally larger J1 value. Booth and Hobert (1999) suggest
using 61 between 0.002 and 0.005. This is needed to avoid excessively large simulation
sample sizes.
A nice feature of the Booth and Hobert algorithm is that because independent
random sampling is used, Monte Carlo error can be assessed at each iteration and
the simulation sample size can be automatically increased. The idea is that it is
inefficient to start with a large simulation sample but as the estimates get closer to
the maximum likelihood estimates the simulation sample size must be increased to
provide enough precision for convergence.
Booth and Hobert propose to increase the simulation sample size m with m/t,

where t some positive number, if (r-1) lies in a 95% confidence ellipsoid constructed

around '(r) ,that is if

()(r) ( (r-1)

where c2 denotes the 95th percentile of Chi-square distribution with number of de-
grees of freedom equal to the dimension of the parameter vector. To describe what
variance approximation is used above some notation must be introduced. Denote the
maximizer of the true conditional expectation Q = E(InLr(b)|y,?r-1)) by *r').


Var((r) (r-l)) ,Q (2m) (*(r),b(r-1)) -1varf (Q(1) (b*(r) 1(r-1)) IO(2m)(b*(r)l~ r)i

Here Qm = 1 lnL.(b(k)y, ((W)) and Q() and Q() are the vector and matrix
of first and second derivatives of Qm respectively. A sandwich estimate of the variance

^(r) ^*(r)
is obtained by substituting f) in place of 4' and using the estimate

1 r-Q$,lnfpy(b k); =(1))alnf(yb(k); ) alnf(y,b(k);g,(r))
vacuum), *( 1) J=Hf if
S(D/rn^ M2k1 ^r ayo i^ .^ W; )

In summary, the algorithm works as follows:

1. Select initial estimate 4' Set r = 0.
2. Increase r by 1.
E-step: For each subject i, i = 1,...n generate m random samples from the

distribution of bilyi; (r-1) using rejection sampling.
3. M-step: Update the parameter estimate of 4 by maximizing (3.2), (3.3) and
4. Iterate between (2) and (3) until convergence is achieved.

Standard Errors Estimation

After the algorithm has converged standard errors can be estimated using Louis'
method of approximation of the observed information matrix (Tanner, 1993). Denote
the observed data log-likelihood by I and then the observed data information matrix
02(y, 0) nO21 (b,y, ) )nL(y,
T = -E[ aapT ly] Var[OlnL(b ly] =

-E[ O2lnL, (b,y, ) 9lnL (b,y,4) OlnL (b, Y, )ly]

+1 99lnn9(b,Y, 0) J lnn,(b, y,) ,

By simulating values b k), k = 1, ...m, i = 1, ...n from the conditional distributions of

bilyi at the final 4 the conditional expectations above can be approximated by the

Monte Carlo sums

1 m a2 nL,(b(k), y, O) OlnL.(b(k), y, ) alnL(b('k) , b)
19+ 0T aoT


1 alnLu(b(k), y,4)
m k=1 0

In the above expressions the argument ip means that the derivatives are evaluated at

the final parameter estimate. Exactly as in the Gauss-Hermite quadrature example

Var(o) = (--

As the simulation sample size increases the Monte Carlo EM algorithm behaves

as a deterministic EM algorithm and usually converges to the exact maximum like-

lihood estimates, but it may also converge to a local instead of a global maximum

(Wu, 1983). A drawback of the Monte Carlo EM algorithm is that it may be very
computationally intensive. Hence important alternatives to this 'exact' method for

GLMM are methods based on analytical approximations to the marginal likelihood
such as the procedures suggested by Breslow and Clayton (1993) and by Wolfinger
and O'Connell (1993). Wolfinger and O'Connell propose finding approximate maxi-

mum likelihood estimates by approximating the non-normal responses using Taylor

series expansions and then solving the linear mixed model equations for the associated

normal model. The extension to their method is presented in the next section.

3.2.3 Pseudo-likelihood Approach

Parameter Estimation

Let us first assume for simplicity that the shape parameters 01 and 02 for both
response distributions are equal to 1. By the model definition

mi, gl 1(Xiil31 + Zijbil)

/Ai2 = g92 (Xi2/32 + Zi2bi2),

where gl1 and g2' above are evaluated componentwise at the elements of Xil/31 +

Ziibil and Xi2/32 + Zi2bi2 respectively. Let /13i, /2, b1i and bi2, i = 1, ...n, be some
estimates of the fixed and the random effects. The corresponding mean estimates
are denoted by fiil and Ati2. In a neighborhood around the fixed and random effects
estimates the errors Eil = Yii -i, and Ei2 = Yi2 -Ai2 are then approximated by first
order Taylor series expansions. Denote the two approximations by

eii Yii Ail (g1)'(Xi 1 + Zilbil)(Xil/31 + Zilbii XJ1/ Zibil)


i2 = Yi2 Ai2 (g21)'(Xi2f2 + Zi2bi2)(Xi2P2 + Zi2bi2 Xi2-2 Zi22),

where (g1)' (Xil31 + Zilbil) and (g') (Xi2j32 + Zi2bi2) are diagonal matrices with
elements consisting of evaluations of the first derivatives of gl1 and g'. Note that
because the estimates of the random effects are not consistent as the number of
subjects increases this expansion may not work well when the number of observations
per subject is small.
Now as in Wolfinger and O'Connell we approximate the conditional distributions
of ,Eil bi and E.i2bi by normal distributions with 0 means and diagonal variance ma-
trices V1(ti1) = diag(Vi(pin),...V1(piln,)) and V2(jAi2) = diag(V2(pi2l),...!V2(i2,J))
respectively, where V1 (.) and V2 () are the variance functions of the original responses.
Note that it is reasonable to assume that the two normal distributions are uncorrelated
because the responses are assumed to be conditionally independent given the random
effects. The normal approximation, on the other hand, may not be appropriate for
some distributions, such as the Bernoulli distribution.
Substituting fil and Ai2 in the variance expressions, and using the fact that
(g(-1)'(XA + Zgib = (g'9ii))-1 and (g1)'(X22 Z2b2) = (g(2))-, the

conditional distributions of g'(#jl)(yi Ail) and g'(i2)(yi f12) are also multi-
variate normal with appropriately transformed variance-covariance matrices. Defining
vil= gi(A il) + g'l(ftA)(yil Ail) and vi2 = g2(Ai2) + i)(i2 Ai2), it follows

viibi b Nn (Xii/31 + Zilbi,; gI(jil)VIi( jA)gj (Ail))

i2|b. N(Xi2/32 + Zba2, 922)V2(a2) (2^))-
Now, recalling that bi has a multivariate normal distribution with mean 0 and
variance-covariance matrix E, this approximation takes form of a linear mixed model
with response Vee(vil, vi2), random effect Vec(bil, bi2), and uncorrelated but het-
eroscedastic errors. Such a model is called a weighted linear mixed model with a
diagonal weight matrix W = diag(Wi), where Wi = diag(Wil, Wi2),
~i- g'l(Ail)Vl(jA)gl(Ai) and W-2l = g'2 i)V2(2)g2(i2) For canonical
link functions Vl(jl) = [g'(ftil)]-1 and V2(A,2) = [g((A2)]1, and then W,-1
Estimates of 3 = Vec(/31, f32) and b, can be obtained by solving the mixed-model
equations below (Harville, 1977):

(XTWX XTWZE \(/3 \ X
zTWX E + ZT Z ) b) ZT V ) (3.5)

/X, Z, V11 / bi
X = ,Z = ,v = ,b = "
X o zV b o
\Xn / \Zn / Vn bn
Xi- (xil 0) z il 0 ) Zi2 il
0 X Zi = 0 Zi/ 'i
x1.=( n )~ ~ 0 'i ( n 7

Variance component estimates can be obtained by numerically maximizing the
weighted linear mixed model log-likelihood

1 InIVil ErTV-ri, (3.6)
2 1=1 2 i=l


V, = W -1 + ZEZiT

ri = v- Xi(XT V Xxi)-xTVi 'Vi.

The algorithm then is as follows:

1. Obtain initial estimates t(') and fjt using the original data, i = 1, ...n.
2. Compute the modified responses

Iia = g(Ail) + (YiI Ai)g'l(Ai)

g2 = 92(Ai2) + (Yi2 Ait2)g2(fit2).
3. Maximize (3.6) with respect to the variance components.
Stop if the difference between the new and the old variance component estimates
is sufficiently small. Otherwise go to the next step.
4. Compute the mixed-model estimates for /3 and b, by solving the mixed model
equations (3.5):

3 = (Zn1 XiTVi X,)-(! 1 X[V,-i=),

5. Compute the new estimates of pi = (Ail, 11i2):

A1 = g1-1(X13 + Zilbi)
A2 = g2'(X 3A + Z,2b,2).
Go to step 2.

In comparison to the original Wolfinger and O'Connell algorithm, the above al-
gorithm for multiple responses is computationally more intensive at step 3 because
of the more complicated structure of the variance-covariance matrices Vi. As the
number of response variables increases the computational advantages of this approx-
imate procedure over the "exact" maximum likelihood procedures may become less

The Newton-Raphson method is a logical choice for the numerical maximization
in step 3 because of its relatively fast convergence and because as a side product
it provides the Hessian matrix for the approximation of the standard errors of the
variance components.
In the general case when 01 and 02 are arbitrary, the variance functions V1(/1jl)
and V2(ti2) are replaced by 51V1Q(#il) and 2V2(AU2), and hence the weight matrices

are accordingly modified: W*1I = 1' (Ai)Vi(Ai)g'1(AL) and

W 1 (= 292'(i2)2(Ai2)2(Ai2). The estimates of 1 and 02 can be obtained in step
3 together with the other variance components estimates. This approach is different
from the approach taken by Wolfinger and O'Connell, who estimated the dispersion
parameter for the response together with the fixed and random effects.

Standard Error Estimation

The standard errors for the regression parameter and for the random effects esti-
mates are approximated from the linear mixed model:

arb zTWX ;+ ZWZ

As E and W are unknown, estimates of the variance components will be used.
The standard errors for the variance components can be approximated by using
exact or numerical derivatives of the likelihood function in step 3 of the algorithm.

For the same reasons as discussed in Section 3.2 it is better to maximize with respect
to the Choleski factors of E. The computation of the standard errors, however, should
be carried out with respect to the original parameters a,, U2 and p (or oal, o2 and Oa12).
This can be done only if the final variance-covariance estimate is positive definite. The
approach discussed in Section 3.2 for representation of the variance-covariance matrix
can also be adopted here to simplify computations.

3.3 Simulated Data Example

The methods discussed in the previous sections are illustrated on one simulated
data set in which the true values on the parameters are known. The data consists of
J = 10 repeated measures of a normal and of a Bernoulli response on each of I = 30
subjects. No covariates are considered and a random intercept is assumed for each
variable. In the notation introduced in Chapter 2,

Yiljhbil indep. N(pilj, a2),
Yi2jbi2 indep. Bernoulli(pi2j),
-ilj = 0i + bil,
logit(iv2) = /32+ bi2,

Tr U1 ~2
bi = (bil, bi2) MVN(O E), 2= [a
[ 0-12 02
This means that we define a normal theory random intercept model for one of the
responses, a logistic regression model with a random intercept for the other response,
and we assume that the random intercepts are correlated. The data are simulated
( 10.5)
for 0-2 = 1, 01 = 4, /2 = 1, E = 0.5 0 ). The parameter estimates and their

standard errors for the three fitting methods discussed above are presented in Table
3.1. The methods are programmed in Ox and are run on Sun Ultra workstations. Ox
is an object-oriented matrix programming language developed at Oxford University

Table 3.1. Estimates and standard errors from model fitting for the simulated data
example using Gauss-Hermite quadrature (GQ), Monte Carlo EM algorithm (MCEM)
and pseudo-likelihood (PL)
Parameter True GQ MCEM PL
value Est. SE Est. SE Est. SE
31 4.0 3.72 0.19 3.72 0.20 3.72 0.19
/02 1.0 1.08 0.18 1.08 0.19 1.08 0.17
a2 1.0 0.97 0.08 0.97 0.08 0.97 0.08
a1 1.0 0.94 0.27 0.94 0.27 0.94 0.27
U2 0.5 0.39 0.26 0.41 0.35 0.35 0.22
012 0.5 0.54 0.21 0.54 0.24 0.48 0.20

(Doornik, 1998). It is convenient because of the availability of matrix operations and

a variety of predefined functions and even more importantly because of the speed of


All standard errors, except those for the Monte Carlo EM algorithm and those for

the regression parameters for the pseudo-likelihood algorithm, are based on numerical

derivatives. The identity matrix is used as an initial estimate of the covariance matrix

for the random components, and the initial estimates for /31, 03 and a2 are taken to

be the sample means for the normal and the Bernoulli response, and the sample

variance for the normal response. Gauss-Hermite quadrature is used with number of

quadrature points in each dimension varying from 10 to 100.

The MaxBFGS numerical maximization procedure in Ox is used for optimization

in the Gaussian quadrature and the pseudo-likelihood algorithms with a convergence

criterion based on 61 = 0.0001. The convergence criterion used for the Monte Carlo

EM algorithm is based on 61 = 0.003. MaxBFGS employs a quasi-Newton method

for maximization developed by Broyden, Fletcher, Goldfarb and Shanno (BFGS)

(Doornik, 1998) and it can use either analytical or numerical first derivatives.

The times to achieve convergence by the three methods are 30 minutes for Gaus-

sian quadrature (GQ) with 100 quadrature points in each dimension, approximately

3 hours for the Monte Carlo EM algorithm and 1.5 hours for the pseudo-likelihood

(PL) algorithm. The numbers of iterations required are 22, 53 and 7 respectively.

The final simulation sample size for the Monte Carlo EM algorithm is 9864.

As expected, the Monte Carlo EM estimates are very close to the Gaussian quadra-

ture estimates with comparable standard errors. Only the estimate of u2 is slightly

different and has a rather large standard error, but this is due to premature stopping

of the EM algorithm. This problem can be solved by increasing the required precision

for convergence but at the expense of a very large simulation sample size.

As can be seen from the table, the parameters corresponding to the normal re-

sponse are well estimated by the pseudo-likelihood algorithm, but some of the es-

timates corresponding to the Bernoulli response are somewhat underestimated and

have smaller standard errors. The estimate of the random intercept variance a2 is

about 10% smaller than the corresponding Gaussian quadrature estimate with about

15% smaller standard error. The estimate of a12 is also about 15% smaller than

the corresponding Gauss-Hermite quadrature and Monte Carlo estimates, but it is

actually closer to the true value in this particular example. In general the variance

component estimates for binary data obtained using the pseudo-likelihood algorithm

are expected to be biased downward as indicated by previous research in the univari-

ate GLMM (Breslow and Clayton, 1993; Wolfinger, 1998). Hence the Wolfinger and

O'Connell method should be used with caution for the extended GLMM's with at

least one Bernoulli response, keeping in mind the possible attenuation of estimates

and of the standard errors.

Gauss-Hermite Quadrature: 'Exact' vs 'Numerical' Derivatives

As mentioned before MaxBFGS can be used either with exact or with numerical
first derivatives. Comparisons of these two approaches in terms of computational

speed and accuracy are provided in Tables 3.2 and 3.3 respectively. The 'Iteration'

columns in Table 3.2 contain number of iterations and the 'Convergence' columns

Table 3.2. 'Exact' versus 'numerical' derivatives for the simulated data example:
convergence using Gauss-Hermite quadrature.
Number Exact derivatives Numerical derivatives
of quad. Time Iteration Convergence Time Iteration Convergence
points (min.) (min.)
10 27.4 56 no 0.3 19 strong
20 21.3 22 weak 1.1 19 strong
30 75 35 weak 2.8 23 strong
40 120 34 strong 4.8 22 strong
50 246 47 strong 7.2 22 strong
60 237 34 strong 10.9 22 strong

show whether the convergence tests are passed at the prespecified 61 level (strong

convergence), or passed at the lower level 51 = 0.001 (weak convergence) or not

passed at all (no convergence). Note that for the exact derivatives strong convergence

is achieved with 40 or more quadrature points.

It is rather surprising that the 'exact' Gaussian quadrature procedure does not
perform better than the 'numerical' one. The parameter estimates from both pro-

cedures are the same (Table 3.3) for 40 or more quadrature points but the 'exact'

Gauss-Hermite quadrature converges much more slowly than the numerical procedure

(hours as compared to minutes). The numerical procedure also has the advantage

of simpler programming code and hence for the rest of the disertation will be the

preferred quadrature method.

Monte Carlo EM algorithm: Variability of Estimated Standard Errors

We observed that the estimated Monte Carlo EM standard errors varied somewhat
for different random seeds, so we generated 100 samples with the final simulation sam-
ple size and computed the average and the standard deviation of the standard error

estimates for these samples. Table 3.4 shows the results. We include the Gaussian

quadrature standard error estimates for comparison. There is much variability in the

standard error estimates (especially in the estimate of oU) which is primarily due to

Table 3.3. 'Exact' versus 'numerical' derivatives for the simulated data example:
estimates and standard errors using Gauss-Hermite quadrature with varying number
of quadrature points (m).
m Exact derivatives Numerical derivatives
/31 32 a02 al2 a2 012 21 02 02 3a 2 2
1T C2 1 13U.2~ 2 Or1
10 3.68 1.06 0.97 0.98 0.38 0.52 3.72 1.08 0.97 0.85 0.35 0.48
0.13 0.16 0.08 0.28 0.25 0.22 0.12 0.16 0.08 0.16 0.24 0.15
20 3.75 1.10 0.96 0.95 0.39 0.54 3.75 1.10 0.97 0.95 0.39 0.54
0.16 0.18 0.08 0.27 0.26 0.21 0.16 0.18 0.08 0.27 0.26 0.21
30 3.71 1.07 0.97 0.94 0.39 0.54 3.71 1.07 0.97 0.95 0.39 0.54
0.19 0.18 0.08 0.27 0.26 0.21 0.19 0.18 0.08 0.29 0.27 0.22
40 3.72 1.08 0.97 0.94 0.39 0.54 3.72 1.08 0.97 0.94 0.39 0.54
0.18 0.18 0.08 0.27 0.26 0.21 0.18 0.18 0.08 0.26 0.26 0.20
50 3.72 1.08 0.97 0.94 0.39 0.54 3.72 1.08 0.97 0.94 0.39 0.54
0.19 0.18 0.08 0.27 0.26 0.21 0.19 0.18 0.08 0.27 0.26 0.21
60 3.72 1.08 0.97 0.94 0.39 0.54 3.72 1.08 0.97 0.94 0.39 0.54
0.19 0.18 0.08 0.27 0.26 0.21 0.19 0.18 0.08 0.27 0.26 0.21

several extreme observations. If the simulation sample size is increased by one-third,

the standard error estimates are more stable (see the last two columns of Table 3.4).

It is not surprising that there is more variability in the standard error estimates than

in the parameter estimates because only the latter are controlled by the convergence

criterion. So it may be necessary to increase the simulation sample size to obtain

better estimates of the standard errors. This, however, requires additional compu-

tational effort and it is not clear by how much the simulation sample size should be

increased. We consider a different approach to dealing with standard error instability

in the next section of the dissertation.

Comparison of Simultaneous and Separate Fitting of the Response Variables

To investigate the possible efficiency gains of joint fitting of the two responses

over separate fitting, we compared estimates from the joint to estimates from the

two separate fits. The results for the three estimation methods are provided in Table

Table 3.4. Variability in Monte Carlo EM standard errors for the simulated data
example. Means and standard deviations of the standard error estimates are com-
puted for 100 samples using two different simulation sample sizes. The Gauss-Hermite
quadrature standard errors are given as a reference in the second column.
Parameter GQ SE Mean SE SDofSE Mean SE SDofSE
ss=9864 ss=13152
f1 0.19 0.21 0.07 0.19 0.03
/32 0.18 0.20 0.09 0.19 0.03
a2 0.08 0.08 0.0008 0.08 0.0001
a2 0.27 0.27 0.03 0.27 0.01
2a 0.26 0.41 0.76 0.31 0.13
a012 0.21 0.22 0.04 0.22 0.02

Table 3.5. Results from joint and separate Gauss-Hermite quadrature of the two
response variables in the simulated data example
Parameter True value Joint estimation Separate estimation
Estimate SE Estimate SE
31 4.0 3.72 0.19 3.72 0.19
/32 1.0 1.08 0.18 1.07 0.18
a2 1.0 0.97 0.08 0.97 0.08
a12 1.0 0.94 0.27 0.94 0.27
2O 0.5 0.39 0.26 0.37 0.25
7r12 0.5 0.54 0.21 -

3.5 (Gaussian quadrature), Table 3.6 (Monte Carlo EM), and Table 3.7 (pseudo-
likelihood). For the separate fits we used PROC NLMIXED for Gaussian quadrature

and wrote our own programs in Ox for Monte Carlo EM and pseudo-likelihood. Notice

that when the Normal response is considered separately, the corresponding model is
a one-way random effects ANOVA and can be fitted using PROC MIXED in SAS, for

example. The estimates from PROC MIXED are as follows: / = 3.72 (SE = 0.19),
&2 = 0.97 (SE = 0.08) and &2 = 0.94 (SE = 0.27).

It is surprising that although the estimated correlation between the random effects
was rather high ( a1 = f = 0.86) the estimates and the estimated standard errors

from the joint and from the separate fits were very similar. Having in mind that it
is much faster to fit the responses separately and that there is existing software for
univariate repeated measures, it is not clear if there are any advantages to joint fitting

Table 3.6. Results from joint and separate Monte Carlo EM estimation of the two
response variables in the simulated data example
Parameter True value Joint estimation Separate estimation
Estimate SE Estimate SE
/31 4.0 3.72 0.20 3.73 0.22
)2 1.0 1.08 0.19 1.08 0.18
a2 1.0 0.97 0.08 0.97 0.08
a2 1.0 0.94 0.27 0.95 0.28
2o 0.5 0.41 0.35 0.38 0.23
U12 0.5 0.54 0.24 -

Table 3.7. Results from joint and separate pseudo-likelihood estimation of the two
response variables in the simulated data example
Parameter True value Joint estimation Separate estimation
Estimate SE Estimate SE
01 4.0 3.72 0.19 3.72 0.19
12 1.0 1.08 0.17 1.05 0.17
a.2 1.0 0.97 0.08 0.97 0.08
OT2 1.0 0.94 0.27 0.94 0.27
202 0.5 0.35 0.22 0.35 0.22
OU12 0.5 0.48 0.20 -

in this particular example. Advantages and disadvantages of joint and separate fitting

will be discussed in Chapter 5 of the dissertation.

3.4 Applications

3.4.1 Developmental Toxicity Study in Mice

In this section the 'exact' maximum likelihood methods are applied to the ethylene
glycol (EG) example mentioned in the introduction. The model is defined as follows

(di denotes the exposure level of EG for the ith dam):

Yij birth weight of jth live fetus in ith litter.

Yi2j malformation status of jth live fetus in ith litter.

Yaji bil indep. N(piij, a2).

yi2j I bi2 -' indep. Be(fii2j).

Table 3.8. Gauss-Hermite and Monte Carlo EM estimates and standard errors from
model fitting with a quadratic effect of dose on birth weight in the ethylene glycol
Parameter Gauss-Hermite quadrature Monte Carlo EM
Estimate SE Estimate SE
10o 0.962 0.016 0.964 0.037
01 -0.118 0.028 -0.120 0.053
012 0.010 0.009 0.010 0.015
/320 -4.267 0.409 -4.287 0.455
021 1.720 0.207 1.731 0.217
a2 0.006 0.0003 0.006 0.0003
7l 0.007 0.001 0.007 0.001
0' 2.287 0.596 2.238 0.560
a12 -0.082 0.020 -0.082 0.022

Table 3.9. Gauss-Hermite quadrature and Monte Carlo EM estimates and standard
errors from model fitting with linear trends of dose in the ethylene glycol data
Parameter GQ (logit) MCEM (logit) GQ (probit) MCEM (probit)
Est SE Est SE Est SE Est SE
010 0.952 0.014 0.952 0.021 0.952 0.014 0.954 0.028
011 -0.087 0.008 -0.088 0.008 -0.087 0.008 -0.088 0.012
/020 -4.335 0.411 -4.335 0.522 -2.340 0.213 -2.422 0.269
021 1.749 0.208 1.752 0.225 0.970 0.111 0.982 0.129
a 0.075 0.002 0.075 0.002 0.075 0.002 0.075 0.002
al 0.086 0.007 0.086 0.007 0.086 0.007 0.086 0.007
U2 1.513 0.196 1.495 0.207 0.839 0.107 0.831 0.101
p -0.682 0.088 -0.687 0.091 -0.666 0.090 -0.670 0.094

ttilj =/3io + 10 11 di + bil,
logit(Aji2j) = 0/32o + 021 di + bi2


We considered a linear trend in dose in both linear predictors. Although other
authors have found a significant quadratic dose trend for birth weight, we found that

to be nonsignificant and hence have not included it in the final model (see Table 3.8).

We also considered a probit link.
The Gaussian quadrature and Monte Carlo EM parameter estimates from the
model fits are given in Table 3.9. Initial estimates for the regression parameters were

bi = (bi1, bi2)' MVN(O, E), E = [

the estimates from the fixed effects models for the two responses, a was set equal

to the estimated standard deviation from the linear mixed model for birth weight,

and the identity matrix was taken as an initial estimate for the variance-covariance

matrix of the random effects. Gaussian quadrature estimates were obtained using 100

quadrature points in each dimension which took 36 hours for both the logit and the

probit links. Adaptive Gaussian quadrature will probably be much faster because it

requires a smaller number of quadrature points. The Monte Carlo EM algorithm for

the model with logit link ran for 3 hours and 6 minutes, and required 41 iterations

and a final sample size of 989 for convergence. The Monte Carlo EM algorithm for the

model with a probit link ran for 5 hours and 13 minutes, and required 37 iterations

and a final sample size of 1757 for convergence. The convergence precisions were the

same as in the simulated data example.

The estimates from the two algorithms are similar but the standard error estimates

from the Monte Carlo EM algorithm are larger than their counterparts from Gaussian

quadrature. This is not always the case as seen from Table 3.10. There is much

variability in the standard error estimates from the Monte Carlo EM algorithm. The

estimates can be improved by considering larger simulation sample sizes but we adopt

a different approach. The observed information matrix at each step of the algorithm is

easily approximated using Louis' method and it does not require extra computational

effort, because it relies on approximations needed for the sample size increase decision.

Because the EM algorithm converges slowly in close proximity of the estimates and

because the convergence criterion must be satisfied in three consecutive iterations,

the approximated observed information matrices from the last three iterations are

likely to be equally good in approximating the variance of the parameter estimates.

Hence if we average the three matrices we are likely to obtain a better estimate of

the variance-covariance matrix than if we rely on only one particular iteration.

Table 3.10. Variability in Monte Carlo EM standard errors for the ethylene glycol ex-
ample. Means and standard deviations of the standard error estimates are computed
for 100 samples for the logit and probit models. Gauss-Hermite quadrature standard
errors as given as a reference.
Parameter Logit model Probit model
____GQ Mean of SE SDofSE GQ Mean of SE SDofSE
010 0.014 0.015 0.032 0.014 0.014 0.021
Oi 0.008 0.008 0.015 0.008 0.007 0.010
/320 0.411 0.440 0.500 0.213 0.200 0.137
/21 0.208 0.206 0.207 0.111 0.101 0.063
a 0.002 0.002 0.0001 0.002 0.002 0.00002
a7 0.007 0.007 0.0005 0.007 0.007 0.0001
72 0.196 0.212 0.115 0.107 0.110 0.020
p 0.088 0.094 0.052 0.090 0.092 0.012

Tables 3.11 and 3.12 give standard error estimates for the EG data based on Gaus-
sian quadrature (column GQ), based on the the Monte Carlo approximated observed
information matrix in the third to last (column Al), second to last (column A2) and
last (column A3) iterations, and based on average of the observed information matri-
ces of the last two (column A4) and of the last three (column A5) iterations. Clearly

using only the estimate of the observed information matrix from the last iteration
is not satisfactory because it depends heavily on the random seed and may contain
some negative standard error estimates. Averaging over the three final iterations is

better, although it is not guaranteed that even the pooled estimate of the observed
information matrix will be positive definite, or that it will lead to improved estimates.

All parameter estimates in the model are significantly different from zero with
birth weight significantly decreasing with increasing dose and probability for malfor-
mation significantly increasing with increasing dose. As expected, the regression pa-
rameters estimates using the logit link function are greater than those obtained using
the probit link function. The use of the logit link function facilitates the interpreta-
tion of the parameter estimates. Thus, if the EG dose level for a litter in increased

by lg/kg the estimated odds of malformation for any fetus within that litter increase

Table 3.11. Monte Carlo EM standard errors using logit link in the ethylene glycol
example. The approximations are based on the estimate of the observed information
matrix from the third to last iteration (Al), second to last iteration (A2), last iter-
ation (A3), the average of the last two iterations (A4), and the average of the last
three iterations (A5). The standard error estimates obtained using Gauss-Hermite
quadrature are given as a reference in the column labeled GQ.
Parameter Estimated S.E.
GQ Al A2 A3 A4 A5
,31o 0.014 0.017 0.037 0.012 0.016 0.016
A11 0.008 0.007 0.014 0.001 0.008 0.007
f20 0.411 0.385 0.429 0.365 0.409 0.396
/21 0.208 0.183 0.239 0.215 0.236 0.206
a 0.002 0.002 0.002 0.002 0.002 0.002
O"1 0.007 0.007 0.007 0.007 0.007 0.007
U2 0.196 0.199 0.202 0.221 0.204 0.199
p 0.088 0.081 0.084 0.136 0.096 0.089

Table 3.12. Monte Carlo EM standard errors using probit link in the ethylene glycol
example. The approximations are based on the estimate of the observed information
matrix from the third to last iteration (Al), second to last iteration (A2), last iter-
ation (A3), the average of the last two iterations (A4), and the average of the last
three iterations (A5). The standard error estimates obtained using Gauss-Hermite
quadrature are given as a reference in the column labeled GQ.
Parameter Estimated S.E.
GQ Al A2 A3 A4 A5
010 0.014 0.021 0.021 0.006 0.016 0.017
Oil 0.008 0.012 0.007 0 0.008 0.009
320 0.213 0.273 0.227 0 0.243 0.233
/321 0.111 0.151 0.110 0 0.132 0.127
a 0.002 0.002 0.002 0.002 0.002 0.002
a1 0.007 0.008 0.007 0.007 0.007 0.007
U2 0.107 0.127 0.097 0 0.124 0.110
p 0.090 0.094 0.102 0 0.101 0.095

exp(1.75) = 5.75 times. The probit link function does not offer similar easy inter-

pretation, but allows one to obtain population-averaged estimates for dose. If the

subject-specific regression estimate is /3, then the marginal regression parameters are

estimated by T (Section 2.1). Hence, we obtain a marginal intercept of -1.793
and a marginal slope of 0.743. The latter can be interpreted as the amount by which

the population-averaged probability of malformation changes on the probit scale for

each unit increase in dose. The subject-specific slope estimate of 0.970 on other hand
is interpreted as the increase in the individual fetus probability of malformation on

the probit scale for each unit increase in dose. A more meaningful interpretation for
those numbers can be offered if one assumes a continuous underlying malformation

variable for the observed binary malformation outcome. The latent variable reflects

a cumulative detrimental effect which manifests itself in a malformation if it exceeds

a certain threshold. The effect of dose on this underlying latent variable is linear and

hence /&21 is interpreted as the amount by which the cumulative effect is increased for

one unit increase in dose. /321 can also be interpreted as the highest rate of change in

the probability for malformation (Agresti, 1990, p.103). This highest rate of change

is estimated to be achieved at -__

As expected birth weight and malformation are significantly negatively correlated

as judged from the Wald test from Table 3.9: (^ .)2 = (-066 )2 = 54.8 (p- value <

0.0001). This translates into a negative correlation between the responses but there

is no closed form expression to estimate it precisely.

It is interesting to compare the joint and the separate fits of the two response
variables (Tables 3.13 and 3.14). Table 3.13 presents estimates obtained using Gaus-

sian quadrature and shows that the estimates for the Normal response are identical
(within the reported precision) but the estimates for the Bernoulli response from

the separate fits are generally larger in absolute value with larger standard errors.

Table 3.13. Results from joint and separate Gauss-Hermite quadrature of the two
response variables in the ethylene glycol example.
Par. Logit models Probit models
Joint fit Separate fits Joint fit Separate fits
Est. SE Est. SE Est. SE Est. SE
O3lo 0.952 0.014 0.952 0.014 0.952 0.014 0.952 0.014
01, -0.087 0.008 -0.087 0.008 -0.087 0.008 -0.087 0.008
320o -4.335 0.411 -4.356 0.438 -2.340 0.213 -2.426 0.230
1321 1.749 0.208 1.779 0.220 0.970 0.111 0.993 0.118
a 0.075 0.002 0.075 0.002 0.075 0.002 0.075 0.002
a1 0.086 0.007 0.086 0.007 0.086 0.007 0.086 0.007
U2 1.513 0.196 1.577 0.211 0.839 0.107 0.881 0.117
p -0.682 0.088 -0.666 0.090 -

Table 3.14. Results from joint and separate fitting of the Monte Carlo EM algorithm
for the two response variables in the ethylene glycol example
Par. Logit models Probit models
Joint fit Separate fits Joint fit Separate fits
Est. SE Est. SE Est. SE Est. SE
310 0.952 0.021 0.952 0.014 0.954 0.028 0.952 0.014
/311 -0.088 0.008 -0.088 0.007 -0.088 0.012 -0.088 0.007
1320 -4.335 0.522 -4.453 0.344 -2.422 0.269 -2.499 0.193
321 1.752 0.225 1.822 0.211 0.982 0.129 1.025 0.098
a 0.075 0.002 0.075 0.002 0.075 0.002 0.075 0.002
01 0.086 0.007 0.086 0.007 0.086 0.007 0.086 0.007
a2 1.495 0.207 1.547 0.199 0.831 0.101 0.869 0.052
p -0.687 0.091 0.670 0.094 -

This may indicate small efficiency gains in fitting the responses together rather than

separately. More noticeable differences in the parameter estimates and especially in

the standard errors are observed in the results from the Monte Carlo EM algorithm

(Table 3.14). In this case the standard error estimates for the Bernoulli components

from the separate fits are smaller than their counterparts from the joint fits. These

results, however, may be deceiving because there is large variability in the standard

error estimates depending on the particular simulation sample used (see Table 3.10).

As a result of the conditional independence assumption the correlation between

birth weight and malformation within fetus, and the correlation between birth weight

and malformation measured on two different fetuses within the same litter, are as-

sumed to be the same. However, in practice this assumption may not be satisfied. We

would expect that measurements on the same fetus are more highly correlated than

measurements on two different fetuses within a litter. Hence, it is very important

to be able to check the conditional independence assumption and to investigate how

departures from that assumption affect the parameter estimates. In the following

chapter score tests are used to check aspects of conditional independence without fit-

ting more complicated models. When the score tests show that there is non-negligible

departure from this assumption, alternative models should be considered. In the case

of a binary and a continuous response one more general model is presented in Chapter

5. It has been considered by Catalano and Ryan (1992), who used GEE methods to

fit it. We propose "exact" maximum likelihood estimation which allows for direct

estimation of the variance-covariance structure.

3.4.2 Myoelectric Activity Study in Ponies

The second data set introduced in Section 2.4 is from a study on myoelectric activ-

ity in ponies. The purpose of this analysis is to simultaneously assess the immediate

effects of six drugs and placebo on spike burst rate and spike burst duration within

a pony. Two features of this data example are immediately obvious from Table 2.2:
there is a large number of observations within cluster (pony) and the variance of the

spike burst rate response for each pony is much larger than the mean (see Table 2.2).

The implication of the first observation is that ordinary Gauss-Hermite quadrature

as described in Section 3.2.1 may not work well because some of the integrands will

be essentially zero. Adaptive Gaussian quadrature on the other hand will be more

appropriate because it requires a smaller number of quadrature points. Also the ad-

ditional computations to obtain the necessary modes and curvatures will not slow

down the algorithm because there are only 6 subjects in the data set and hence only

six additional maximizations will be performed. Adaptive Gaussian quadrature is

described in the next subsection of the dissertation.

The implication of the second observation concerns the response distribution of the

spike burst rate response. An obvious choice for count data is the Poisson distribution,

but the Poisson distribution imposes equality of the mean and the variance of the

response and this assumption is clearly not satisfied for the data even within a pony.

Therefore some way of accounting for the extra dispersion in addition to the pony

effect must be incorporated in the model. Such an approach based on the negative

binomial distribution is discussed later in this chapter.

Adaptive Gaussian Quadrature

Adaptive Gaussian quadrature has been considered by Liu and Pierce (1994),

Pinheiro and Bates (1999), and Wolfinger (1998). We now present that idea in the

context of the bivariate GLMM. Recall that the likelihood for the ith subject has the


Li f f(y=, bf)db, ,

n, 11T 1
j=l |27T |2
f~y1 b1 = J {f~yi, b~; ~&)f(y~2Ib~; I21gy,,) k)-1 16, Tenp-1b~b)

Let ib, be the mode of f(yi, bi) and fi = (-2YbT)-11 Then

L= J f*(yi,bi)O(bi;bi,r-)db,

where f*(yi, bi) = f(yb,b and

0(bi;b,rf) = exp(--(b b ri'-(b )).

The needed transformation of bi is then bi = b, + V/2Aizi, where r, = AiAT. Hence

each integral is approximated by

m m
i =IV/'2Ail E1 () E kq (
=:~ wkq],=i
k1=1 kq=l

with the tabled univariate weights wk) and nodes z,) = b+x/2Aid(Q) for the multiple

index k = (kl, ...kq), where d(k) are the tabled nodes of Gauss-Hermite integration of

order m.

The difference between this procedure and the ordinary Gauss-Hermite approx-

imation is in the centering and spread of the nodes that are used. Here they are
distributed where most of the data are available and this makes adaptive quadra-

ture more efficient. The computation of the modes bi and the curvatures ri requires
maximization of the integrand function for each subject, which in general requires a

numerical procedure, but this was not a major impediment for the pony data. We

used MaxBFGS within MaxBFGS to perform the maximization.

Negative Binomial Model

When the dispersion in count data is more than that predicted by the Poison model

a convenient parametric approach is to assume a Gamma prior for the Poisson mean

(McCullagh and Nelder, 1989). Letting Y Poisson(p[), then P(Y = y) = Ye-

y = 0,1,2, ..., and E(Y) = Var(Y) = ip. Suppose that we want to specify a larger

variance for Y but keep the mean the same. This can be accomplished as suggested

by Booth and Hobert (personal communication). Let Yju Poisson(upi) and let

u Gamma(a, 1). The density for u is f(u) rI)e '.u Then unconditionally Y
a r'(a)
has a negative binomial distribution with probability density

p (Y)ar(Y + a) a I-

P(Y = y) = (^ y -
P(Y y) F(a)y! +L --a+t

and E(Y) = p and Var(Y) = pi + A. Notice that for large a the negative binomial
variance approaches the Poisson variance, meaning that is there is little overdisper-

sion, but for small a the negative binomial variance can be much larger than the

Poisson variance.

The mean in the above model can be specified as a function of covariates using

a log link ln(pi) = x[/3, i = 1, ...n. The log link is convenient because it is the

canonical link for the Poisson distribution and transforms the mean so that it can

take any real value. It is possible to obtain maximum-likelihood estimates of a and

/3 by directly numerically maximizing the negative binomial likelihood but there is

a more elegant way based on the EM algorithm as proposed by Booth and Hobert

(personal communication).

The complete data consists of y and u, and the complete data log-likelihood is

InLV = c+ [aln(a) lnF(a) + yln(pi) + aln(ui) uj(pi + a)],

where c is a constant depending only on the data but not on the unknown parameters.

The Ith E-step of the EM algorithm then involves calculating

E(n Lv(a, 3)|6&(1_1),/3(_l)) = c + Naln(a) -Nlnr(a)+

[yixUT3 + aE(ln ui jyj, &(1-1), (/-1))- E(uilyi, a(l--),/(-1) )(exp(xT[3) + a)]

But because of the choice of the prior distribution for ui, the posterior distribution for

Uilyi evaluated at &( 1-1),/3(11) is Gamma(yi + &(-i), -i1)-) Therefore
&(1- 1) +eXP(X7,3tp^_^

E(lnuiyi,&(-i),43(1_)) Y- + a(_I()) ln(exp(xTI(3-l)) + (i-1))


Yi + &(1-i)
E(uIjy, &(l), 3(-1)) = -^ + (-l
&(/_1) + exp(x3iT(jl))'

where V)(.) denotes a digamma function (the first derivative of the log-gamma func-


At the Ith M-step the expected log-likelihood is maximized with respect to a and

/3. The two parameters are present in different terms of the log-likelihood and hence

two separate numerical maximizations are required. Note that if we adopted a direct

approach the parameters would need to be maximized together which could lead to

more numerical problems.

When there are random effects the model can be specified as follows:

yij j\ij, uij indep. Poisson(Izijuij)

uij i.i.d. F(a, )

ln(i|bj)= =/3 +z~b

bi i.i.d. Nq(0, E)

and the random effects bi and uij are independent.

The unknown parameters a, /3 and E can be estimates by a nested EM algorithm.
Such an algorithm has been proposed by Booth and Robert (personal communication)
for the overdispersed Poisson GLMM, and more generally by van Dyk (1999) who
motivated it from the point of view of computational efficiency. The complete data
for the outer loop consists of y and b and the outer EM algorithm is performed as

outlined in Section 3.2.2 but with an EM maximization procedure for a and /3 as
introduced above. The complete data log-likelihood has the form

In L, = c + -[lnF(yij + a) lnr(a) + alna (a + yij)ln(a + pij) + yijln(,ij)]+

l- TE-1b.]

and at the rth E-step its conditional expectation is approximated by the Monte Carlo


E [c + E-[lnF(yij + a) lnF(a) + alna (a + yij)ln(a )) + y+ijln() )]+
m -1 i,j
~ -Xhbk)T.]_lb~k)]
k k=l ij

E[- (k) -1

where b k) are generated from the conditional distribution of bilyi; (r-l) and p -
2' 13 =

exp(xjT3 + z.b k) ). The first part of that sum is what needs to be maximized to

obtain estimates of a and /3. Notice that each term in the first Monte Carlo sum is

a log-likelihood for a negative binomial model with a different (but fixed) mean (k)

and hence it can be subjected to an EM algorithm by augmenting the observed data
y by u.
In summary, the nested EM algorithm is as follows:

1. Select an initial estimate 4,). Set r = 0.

2. Increase r by 1.

E-step: For each subject i, i = 1, ...n generate m random samples from the

distribution of bilyi; (r-l) using rejection sampling and approximate
(r 1)
E(ln L.(y,b; 4)|y; i ) by a Monte Carlo sum.

3. M-step:

3.1. Maximize with respect to the elements of E as usual.
3.2. Set I = 0, &(o) = &(r-) and (0) = ( (r-).

3.3. Increase I by 1.

Inner E-step: For any given b(k) compute

E(ln L.(y, u; a, ,3)\y; Q!((l1), ;/3(1))

3.4. Inner M-step: Find &(1) and (/3) to maximize the Monte Carlo sum

of conditional expectations.

3.5. Iterate between (3.3) and (3.4) until convergence for a and (3 is


4. Iterate between (2) and (3) until convergence for 0/ is achieved.

In the multivariate GLMM there are two or more response variables but the nested

EM algorithm works in essentially the same way because of the conditional indepen-

dence between the variables. In the next section we describe a bivariate GLMM for

the pony data.

Analysis of Pony Data

The model is defined as follows:

Yilj 3th duration measurement on the ith pony.

Yi2j jth spike burst rate measurement of the ith pony.
Ylj bii b indep. Gamma with mean Ji'j and variance s

Yi2j bi2, Uij indep. Poisson(pj2juij),
uij i.i.d. Gamma(a, 1),

ln(1pi,) = xT/3j + b61,

ln(pi2j) = Xilj)32 + bi2,

bi = (bi, bi2) MVN(O, E), E a2 P) 102]
P0[10"2 P"2]

Let dijk be 1 if drug k is administered to pony i at occasion j, k = 1, ...6, and 0
otherwise. Placebo is coded as drug 7 and will be the reference group. Let t denote
time. Then the linear predictors are

Xilj = 1 + /11dil + f12dij2 + 313diy3 + 314dij4 + 0sdij5 + /316dij6

X,2j = 20w + f21dijl + f22dij2 + /23dij3 + /)324dij4 + 025d5 + 026dij6

+/327t + /28dsijlt + 029dij2t + f32,10dij3t + /C2,11dij4t + /2,12di5t + 032,13dij6t.

In the univariate analysis we initially considered the same linear predictor for
both variables but the time drug interaction and the time main effect were clearly
not significant for the duration response and were dropped from the model. Also, we
performed fixed effects analysis using PROC GENMOD and fitted two random effects
models on log-transformed variables using PROC MIXED for the complete data (all
time points and all electrode groups). There was evidence of a three-way interaction
between electrode group, time and drug. Describing this interaction requires estimat-
ing 76 regression parameters for a linear time trend as compared to 21 in the above
specification. Recall that in the other data example we estimated up to five regression
parameters, so the numerical maximization procedures in this example were expected
to be much more complicated. The complete pony data set was also about ten times
larger than the ethylene glycol data set and the time trends appeared complicated
and not well described by simple linear or quadratic time effects. One might need to
use splines to adequately describe the trends. Hence, we concentrated on a particular

research question, namely to describe differences in the immediate effects (up to 60

minutes after administration) of the drugs in the cecal base (corresponding to one

of the electrode groups). In the analyses performed by Lester et al. (1998a, 1998b,

1999c) there was evidence that some drugs led to immediate significant increase in

spike burst count or spike burst duration, while others took longer or did not lead to

increase at all, and we decided to address that issue in our analysis.

We programmed the nested Monte Carlo EM and the adaptive Gaussian quadra-

ture algorithms both for the joint and for the separate fits. We used a negative

binomial distribution for the count response, a gamma distribution for the duration

response and log link functions for both. (Although the log function is not the canon-

ical link function for the Gamma response, it is convenient because it transforms the

mean to the whole real line). As initial estimates we used the final estimates from the

pseudo-likelihood approach using the %GLIMMIX macro but rounded only up to

one significant digit after the decimal point. The macro does not allow specification

of a negative binomial distribution and hence we fitted a Poisson distribution with

an extra dispersion parameter.

The results from the analysis are summarized in Table 3.15. MaxBFGS has

convergence problems when the responses are fitted together and adaptive Gaussian

quadrature is used so no results are presented for this case. We report the results

using only one quadrature point because the estimates and their standard errors are

essentially the same if the number of quadrature points is increased. Adaptive Gaus-

sian quadrature with one quadrature point is equivalent to Laplacian approximation

(Liu and Pierce, 1994) which is indicative that the analytical approximations work

well for this data set. The simulation sample size for the separate fit of the Gamma

response was increased at each iteration. Convergence was achieved when the simula-

tion sample size was 31174 after 20 iterations and took about 1 hour. For the negative

binomial response the final simulation sample size was still 100 after 542 iterations

and it took also about 1 hour to converge. The joint fit took about 21 hours until

convergence with a final simulation sample size of 23381 and 201 iterations.

The results from the joint and from the separate fits are almost identical and the

correlation between the two response variables is not significant (p = 0.71, SE =

0.50). Notice that p is quite large and the fact that it is not significant may be

partially due to the fact that there are only six subjects in the data set. Notice also

that the standard error estimates from the three methods are similar with only some

of the adaptive Gaussian quadrature standard errors being smaller than their Monte

Carlo counterparts. The Monte Carlo standard error estimates did not show much

variability as in the ethylene glycol example.

For both responses the coefficients for drug 1 and for drug 4 are significantly

different from zero. This means that both drugs have significantly different immediate

effects on the response variables than saline solution. Drug 1 leads to a significant

decrease in the individual level of duration for one hour after the drug is administered

and drug 4 is associated with a significant increase. Of all the interactions between

drug and time only the one involving drug 1 for the count response is significant.

3.5 Additional Methods

The empirical Bayes posterior mode procedure, discussed in Fahrmeir and Tutz

(1994, pp.233-238), can also be used to estimate the parameters. The Newton-

Raphson equations for the extended model when a flat (or vague) prior is used and

the dispersion parameters are set to 1.0, are essentially no more complicated than

those for the generalized linear mixed models. Fahrmeir and Tutz, however, do not

discuss the estimation of 01 and 02 when they are unknown. The dispersion parame-

ters can be estimated together with E via maximum likelihood, treating the current

estimates of the fixed and the random effects as the true values. Another possibility is

to put noninformative priors on 01 and q2 and estimate them together with f3 and b.

Table 3.15. Final maximum likelihood estimates for pony data
Monte Carlo EM Adaptive
Joint fit Separate fits Separate fits
Parameter Estimate SE Estimate SE Estimate SE








I ________________ I ________________


What noninformative priors should be used in order to avoid dealing with improper

posteriors is a topic for further research. The question about the propriety of the

posterior also arises when Gibbs sampling is applied for posterior mean estimation,

which is yet another method that can be used for fitting the extended model.


The estimates considered in the previous chapter are approximate maximum like-

lihood and hence confidence intervals and hypothesis tests can be constructed accord-

ing to asymptotic maximum likelihood theory. Rigorous proof of the properties of

the estimates requires check of the regularity conditions for consistency and asymp-

totic normality in the case of independent but not necessarily identically distributed

random vectors. Such conditions have been established by Hoadley (1977) but are

difficult to check for the generalized linear mixed model and its multivariate exten-

sion because of the lack of closed-form expression for the marginal likelihood. To our

knowledge these conditions have not been verified for the generalized linear mixed

model and we could not do that for the multivariate GLMM. Instead we assume that

the regularity conditions are satisfied and rely on the general results for maximum
likelihood estimates. Caution is applied to testing for the significance of variance

components because when a parameter falls on the boundary of the parameter space

the asymptotic distribution of its maximum likelihood estimate is no longer normal

(Moran, 1971; Chant, 1974; Self and Liang, 1987). Score tests may be a good alterna-

tive to the usual Wald and likelihood-ratio tests because their asymptotic properties

are retained on the boundary (Chant, 1974). Score test statistics are also computed

under the null hypothesis and do not require fitting of more complicated models and

hence can be used to check the conditional independence assumption.

Since there are no closed form expressions for the marginal likelihood, the score

and the information matrix, numerical, stochastic or analytical approximations must

be used when computing test statistics and constructing confidence intervals. This

aggravates the problem of determining actual error rates and coverage probabilities

and requires the use of simulations to study the behaviour of the tests. Analytical and

stochastic approximations can be improved by increasing the number of quadrature

points or simulated sample sizes, but the precision of analytical approximations can

not be directly controlled. For example, in the case of binary data pseudo-likelihood

works well if the binomial denominator is large (Breslow and Clayton, 1993). But

the latter depends only on the data, so for a particular data set, the approximation

is either good or bad. In general the estimates based on analytical approximations

are asymptotically biased.

In this chapter we concentrate on analytical and stochastic approximations of

Wald, score and likelihood ratio statistics and study their performance for checking

the conditional independence assumption. We also show how these approximations

can be constructed for testing fixed effects and for estimating random effects, and con-

sider them for checking the significance of variance components. Since the asymptotic

maximum likelihood results hold when the number of subjects increases to infinity

we focus on the ethylene glycol example. The pony data has only six subjects and

as suggested in Chapter 3 inference concerning correlation between the two response

variables is suspect.

Section 4.1 discusses Wald and likelihood ratio tests for testing the fixed effects.

We briefly consider estimation of random effects and prediction of future observations

in Section 4.2 The score approach is introduced in Section 4.3.1 by providing a histor-

ical overview. We then propose score tests for checking the conditional independence

assumption (Section 4.3.2) and for testing the variance components (Section 4.3.2).

The ethylene glycol example is used for illustration in Section 4.4 and Section 4.5

contains the results from a small simulation study for the performance of the pro-
posed conditional independence test. The chapter concludes with discussion of future
research topics.

4.1 Inference about Regression Parameters

The asymptotic properties of maximum likelihood estimates have been studied
under a variety of conditions. The usual assumption is that the observations on
which the maximum likelihood estimates are based are independent and identically
distributed (Foutz, 1977), but results are available also for models in which the obser-
vations are independent but not identically distributed (Hoadley, 1971). In general,
let Yi, Y2, *.. Yn be independent random vectors with density or mass functions

fi(yi, 0), f2(y2, )), ... fn(y), i) depending on a common unknown parameter vec-
tor -0. Then as n -+ oo under certain regularity conditions the maximum likelihood

estimator ib is consistent and asymptotically normal

n(- ) N(O, I-1)),

where I(g) = limn,- i=l E [- 2 (yiY) = lnf{(y,).

Two basic principles of testing are directly based on the asymptotic distribution
of the maximum likelihood estimate: the Wald test (Wald, 1943) and the likelihood

ratio test (Neyman and Pearson, 1928). Let i = (OT, OT)T'. The significance of a
subset 01 of the parameter vector VY can be tested by either one of them as follows.
The null hypothesis is


where 0 is in the interior of the parameter space (In general the null hypothesis is
H0 : i1 = 110 but herein we consider the simpler test.) Let

I b)= (:: I~ l :D*~bl b2



The Wald test statistic is then

Tw : 1(n/,]J)i

with 02 in IV'111 replaced by the consistent maximum likelihood estimator i2 under
the null hypothesis. Under H0, Tw ,, -X, where d is the dimension of the parameter
vector ip. Because in the case of independent data nI(11) may not be available and in
random effects models the expected information matrix may be hard to calculate we

use J(O) = Z'=(- 482lY"*) instead of nh(i(). Some authors argue that it is more

appropriate to use the observed rather than the expected information (Efron and
Hinkley, 1978) but unlike the expected information matrix, the observed information
matrix is not guaranteed to be positive-definite. The latter problem is exacerbated
when the numerical or stochastic approximations discussed in Section 3.2 are applied
to approximate the observed information matrix J(V)), which is not available in closed
form. We already used Wald tests in Chapter 3 to test the significance of individ-
ual regression coeffients but we can also use Wald tests to check several regression
coefficients simultaneously.
The likelihood ratio statistic for the hypothesis (4.1) is defined as follows. Let
M1 be the model with unknown parameter vector 'f and let M2 be a reduced model
with 'ip = 0 and 012 held unrestricted. Let also 11 and 12 denote the maximized

log-likelihoods for models Mi and M2 respectively. Under H0 as n -* oo

TLR = -2(12 -1) Xi

so the likelihood ratio and the Wald statistic have the same asymptotic distribution
under the null hypothesis. The absence of closed-form expressions for 11 and 12 necessi-
tates the use of either Gaussian quadrature or Monte Carlo approximations. Gaussian
quadrature approximations are applied exactly as described in Section 3.2.1. To ob-
tain Monte Carlo approximations m samples from the random effects distribution

bi Nq(0, ) are generated. Then

1 m
11 E f(yiI ;, ).
m k-1

Here E, /3 and ( are the final maximum likelihood estimates from the full model M1.
The same type of approximation is used for 12 but evaluated at 01 =0 and at the

maximum likelihood estimator 4' under M2.

Under ideal conditions for large enough number of quadrature points in the Gaus-
sian quadrature algorithm and for large enough simulation sample size in the MCEM
algorithm the approximate maximum likelihood estimates will be arbitrarily close to
the true estimates. Also the additional approximations needed to compute the Wald
and the likelihood ratio statistics above can be made very precise, hence the statis-
tics should perform well for a large number of subjects. But in reality there can be
problems because it is not clear how many quadrature points and what number of
simulation samples are needed for adequate approximations.
Even if all approximations are adequate and the sample size is large enough the
Wald and likelihood ratio tests can still run into problems if a parameter is on the
boundary of the parameter space. This happens in testing the significance of a vari-
ance term. Hence Wald and likelihood should be used with caution in that situation.

It has been proven by several authors (Moran, 1971; Chant, 1974, Self and Liang,

1987) that when a parameter is on the boundary of the parameter space the asymp-
totic distribution of the maximum likelihood estimator is no longer normal but rather
a mixture of distributions. But the score tests as discussed in Section 4.3 are not af-
fected and therefore may be a nice substitute for Wald and likelihood ratio tests.

4.2 Estimation of Random Effects

In some applications it is of interest to obtain estimates for the unobserved random
effects. A natural point estimator is the conditional mean

f[bfly, = f bif(bi' )f(bi; b) dbi
E[bf f(yiIbi, )f(bi; ) dbi

which is not available in closed form but can be approximated either numerically or
stochastically. Gauss-Hermite quadrature involves two separate approximations of
the numerator and the denominator, while the stochastic approximation is via the
simple Monte Carlo sum
1 m
m k=-

where b[), ... bm) are simulated from the conditional distribution of {bilyi, i} using
a technique such as rejection sampling. Note that this approximation is performed
anyway in the MCEM algorithm and hence the random effects estimate is obtained
at no extra cost. Estimates of the random effects are needed for prediction. For
example one might be interested in obtaining an estimate for the linear predictor for
particular subject:
4 =XT + ZTb%.

The question with obtaining the variance of the random effects estimate is not
straightforward. The simplest approach is to use the variance

Var(bilyi, ib)

evaluated at ip, but this 'naive' estimate may underestimate the true variance as it

does not account for the sampling variability of ). As an alternative Booth and
Hobert (1998) suggested using a 'conditional mean square error of prediction' as a
measure of prediction variance. Their approach can also be applied to the multivariate
Note that in the two 'real-life' examples that we consider estimation of the random
effects is not of particular interest. The subjects are mice and ponies respectively and
their individual characteristics are a source of variability that needs to be accounted
for but not necessarily precisely estimated for each subject. In other applications, for
example in small area estimation, prediction of the random effects is a very important
objective of the analysis.

4.3 Inference Based on Score Tests

4.3.1 General Theory

An overview of the historical development of the score test is provided in a re-
search paper by Bera and Bilias (1999). Rao (1947) was the first to introduce the
fundamental principle of testing based on the score function as an alternative to like-
lihood ratio and Wald tests. Let yl, Y2, ... y,, be i.i.d. observations with density
f(yi, '). Denote the joint log-likelihood, score and expected information matrix of

those observations by /(o), s(i) and I(,0) respectively. Suppose that the interest is
in testing a simple hypothesis against a local alternative

Ho :0 4= = 0 vs Ha :0 =0 + A,

where 4 = (1i, ...Op)T and A = (Ai, ...Ap)T. Then the score test is based on

T, = s(o)TI(O)-1s(0o)

which has an asymptotic X' distribution. If the null hypothesis is composite, that is

Ho : h(O) = c,

where h(g) is a r x 1 vector function of 0 with r < p restrictions, and c is a known
constant vector, Rao (1947) suggested using

(V) = S( 1()-lP),

where g is the restricted maximum likelihood estimate of 0, that is, the estimate
obtained by maximizing the log-likelihood function l() under H0. T8(Qi) has an
asymptotic X2 distribution.
Independently of Rao's work Neyman (1959) proposed C(a) tests. These tests
are specifically designed to deal with hypothesis testing of a parameter of primary
interest in the presence of nuisance parameters and are more general than Rao's score
tests in that any v/n-consistent estimates for the nuisance parameters can be used,
not only the maximum likelihood estimators. By design C(a) tests maximize the
slope of the limiting power function under local alternatives to the null hypothesis.
Neyman assumed the same setup as in Rao's score test but considered the simple
null hypothesis
HO : 01 = V10,
where = (V), 4T)T. Notice that V1 is a scalar. The score vector and the information
matrix are partitioned as follows

S(M) : 801 (011'02))
S02 (01, 2)
s1('1 11 ( 2) IS 1 1(,12) )

(I)= (pi1(g1,' 2) ) (01,2) "

Then the C(a) score statistic is

C(o) = [8 0 2) ((10, 02) IO2 (0', 2)I 4 (02, 02)Sp2 (/O10, 02)]T

X [I41, (o10, 02) I12, ( 01o, 42)I (02, 4)I,2t (1o, '2)]-1

X [Sp, (V10, 12) 1- (*10, 02)1020,2 (021 02)sVI, (010, 02)],

where 4'2 is a V/n--consistent estimator of 0/2. The Neyman's C(a) statistic reduces

to the Rao's score statistic when '2 is the maximum likelihood estimate of 2' under
Buhler and Puri (1966) extended the asymptotic and local optimality of Neyman's
C(a) statistic for a vector valued i, and in the case of independent but not necessarily
i.i.d. random variables. They assumed that p0o was interior to an open set in the
parameter space but as pointed out later by Moran (1971) and Chant (1974) this
restriction is unnecessary.
Chant (1974) showed that when the parameter is on the boundary of a closed pa-
rameter space, the score test retains its asymptotic properties while the asymptotic
distributional forms of the test statistics based on the maximum likelihood estima-
tors are no longer X2. In addition to this advantage of the score test it has the
computational advantage that only estimates under the null hypothesis are needed
to compute the test statistic. We now use this feature to propose tests for checking
the conditional independence assumption.

4.3.2 Testing the Conditional Independence Assumption

Difficulties in testing the conditional independence of the response variables arise
because of the need to specify more complicated models for the joint response dis-
tribution. Even if score tests are used (which as discussed do not require fitting of
more complicated models) extended models need to be specified and the form of the
score function and of the information matrix need to be derived. We again consider
the bivariate GLMM case for simplicity. A convenient way to introduce conditional

dependence is to use one of the response variables as a covariate in the linear predic-
tor for the other response variable. In the bivariate case this leads to considering the
following model

YiljYi2, bn indep f/iy(lyjjY2j, bil;3^1, ^, 1)

yi2jlbi2 indepf2(yj2j\bi2; f32, 02)
T biT
g1(141j) = xij31 + 7YYi2j + Zi'bn

g2(Ai2j) = xi 22 + z= bi2

bi= ( bl ) i.i.d.MVN(O, E) = MVN (f ] El i E12)
Sbi2 / 0 L u12 22

In general this setup leads to a complicated form of conditional dependence which
is hard to interpret if there is no natural ordering to the two responses. The case
-y = 0 corresponds to conditional independence but testing Y = 0 in the above model
is performed against a complicated alternative on the marginal scale. When the
identity link function is used for the first response, conditional on the random effects
Cov(yilj, yi2j) = -yVar(Yi2j) and the test is a test of conditional uncorrelatedness of
the two outcomes. If both outcomes are normally distributed then this is truly a test
of conditional independence.
An interesting case to consider in view of the simulated data example and the
ethylene glycol application is when one of the responses is normally distributed and
the other one has a Bernoulli distribution. Let in the above general specification fi
be the normal density function and f2 be the Bernoulli probability function. Also
assume that the random effects consist of two random intercepts and let

1i3 = xjjlj + yy;2j + bil,

where y*2j = 2yi2j 1. Then

E(yijljy*^ = 1, bi) = xfTjI + bil + -y

E(yiljly*j = -1, bi) = xij.3l +nil -7

and hence testing H0 : = 0 against H1 : -y 0 is equivalent to testing for location
shift in the conditional distribution of the normal response.
The score test statistic is as follows:

Ts ~ T _TI- I _I'_I%

where s., is the element of the score vector corresponding to y and I is the expected
information matrix. Note that even in this simple case neither the score nor the
expected information matrix have closed form expressions and hence the score statistic
must be approximated. We again consider Gaussian quadrature and Monte Carlo
The log-likelihood is InL 1 nL, where

lnLi = In f fl(Yil jYi2, bil;,f1,7, 0)f2(Yi2|bi2; 132,02)f(bi; E)db6

2)- ni
fI(yiI yI bil; 01, -y, 70) = (27ror2)- ezp{- _(yiy xf-C3i b,1 7yi)2}

exp(E{ilI y,2j( T )2 + bi2))
f2(Yi21bi2;/32, 2) = (11 Ye(XT32 + b2))
(1~=( + exp(xT23f32 + bi2))

f(bi; E) = 127rEI-exp{-2bTE-1b }.
For the Monte Carlo approximation we notice that under the assumption of in-
terchangeability of the integral and differential signs

s, = -in I fl (yYilbilYM2,01, -/, 0l)f2(Yi2 bi2,32, 02)f(biE)dbi =

n f (f(yilb i,,Yi2,3l,7^,)f2(Yi2|bi2, 32,02)f(b,,E))dbi
E- a =

j= ,(i bl i,0,7 1)f y2Ji,0,2f(i ~b

f lnfi(yil bx, Yi2, 3, 7,-Y 1)f(yi,bi; ')dbi
if1 f(Yi;)

zJ -lnf (y~i xbil, Yi2,,7'y, 01)f (bifyj; O)dbi =
n 0
_E( -ln f(Yi Yi2, bi)lyj).
i=1 O /

The expectation above is taken with respect to the conditional distribution of the
random effects given the response vector. Differentiating with respect to 7

a 1 TO>
72 E nYi --Xlj -x,31- bil Yi)
Wnfi(yilYI2, bi) = I --fY (Yl -Til _Y)
5--y j=1

and therefore

a 1 ni
E(-1nfl(yIxYi2, bi)) = E Y(Yij xiTjl E(billyi) 7y^).

So, to approximate the score under the null hypothesis we only need to approximate
the conditional mean E(bil6yi) by the Monte Carlo sum E' bI), where 0b),
k = 1, ...m are generated for the estimation of the standard errors in the MCEM
algorithm (Section 3.2.2). The elements of the observed information matrix J, which
can be used in place of I, can also be approximated using the Louis's method. J-,--,
is available from the MCEM algorithm and only J,-,, Iy-y and J-.,,., need to be
computed. The latter can also be performed in the procedure for finding the standard
errors of the estimates in the MCEM algorithm.
Gaussian quadrature using numerical derivatives involves approximating the log-
likelihood once and then numerically differentiating with respect to and the other
parameters to obtain the score and the observed information matrix. Denote the
Gauss-Hermite quadrature approximation of the log-likelihood by 1GQ and let sGQ =

Q and J Y GQ -2jQ Then the approximation to the score statistic is

7 7,--7- -7,--7 "-7f,

The performance of the score statistics for conditional independence is studied
in more detail in Section 4.5. When there are more than two response variables

this approach to testing for departure from conditional independence becomes very
complicated and not easily interpretable. It is also not easy to decide which variable
to use in the linear predictor for the other one, unless there is a natural ordering.
This issue is discussed in more detail for the ethylene glycol example.

4.3.3 Testing the Significance of Variance Components

Global Variance Components Test

Lin (1997) proposed a global variance component test for testing the significance
of all variance components in the univariate GLMM which can be extended for the
multivariate GLMM. The null hypothesis for the global test is H0 : 8 = 0, where 6 is

the vector of all variance components for the random effects. Suppose for simplicity
that 01 = 02 = 1 and that there are two response variables. The generalizations to
arbitrary 1 and 02, and to more than two response variables are straightforward.
The form of the score test statistic is

T,(3) = s6('3)T(I'm(3) I,5363)I,)1,))-63),

where /3 = (3 13)T and 3 and f32 are the maximum likelihood estimates under H0,
i.e. the maximum likelihood estimators from the two separate fixed effects generalized

linear models for the two response variables. Under H0, Ts(/3) has an asymptotic xd
distribution, where d is the number of variance-covariance parameters for the random

In the univariate GLMM considered by Lin the rth element of the score vector has
the form

sr(23) = {(Y- Alw ZW i(yi I) tr(WoZiEZTi)},
i =1

where g(Iti) = Xi)3, E = Var(bi) and 1& \0 The matrices Ai, Wi

and W0oi are diagonal with elements Aij --, Wij = (V(ij){g'(pij)}2)-1 and
.-ij)g ,()+Y ^'+( )g( in general and e*-
Woij = wij + eij3(yij tij) where eij (uijt)(rj "3 in general and eyi = 0

for canonical link functions. The subscript j refers to the jth observation on the ith
Following step by step Lin's derivation for the univariate GLMM, the correspond-
ing rth element of the score function for the multivariate GLMM is s6 ()3) =
--2 {(Yil --,-il),TA i- 1W ilZi1 allZ i ~l iA ilT-1(Y il ,Ail) tr(W oil iltrlZTl)}'

1 {yi- TA WZ--ZT W (Ay ri tr(WOiZ.i JZr T)}

+ Z{(Yi2 Ai2)A21 2Y2 /ii2)- (Wo 2 i2)}

+ >Z{(yi2 ^f^ ^~ rWAIy M~),(4.2)

where the subscripts 1 and 2 refer to the parts of the vectors and matrices corre-
sponding to the first and to the second variable respectively.
The proof is as follows. Let

li(bi) = li(bil) + 42(bi2) = ln(fi(yiilbi; /31)f2(Yi2bi2; /32))-

Then the marginal likelihood for the ith subject is

fi(yi; E) = Eb,[exp(li(bi))],

where the expectation is taken with respect to the marginal distribution of bi. Ex-
panding the integrand in a multivariate Taylor series around bi = 0 we get

04i(O) 1 bT(0)(O) l4(0O) 821(0)
exp(ll(bi)) = exp(li(O))[1 + O bi + -2 Ob, bT + bbT + E

where Ei contains third and higher order terms of bi. Notice that

S(b,) T_ (bi)
abi O4li97
a21,(b i) = T 21,(b ,
abi~bf At ri9r
where 77 is the vector of linear predictors for the ith subject. Then taking expectation
and using the moment assumptions for b,

Eb,[exp(li(b))] = exp(li(0))[1 + -tr(ZT(i( O) T + ))Z,) + ri]
2 ai &q7' 1978T"77

and the marginal log-likelihood for the ith subject is
1 (ZTl(O) O((O) 0 2l,(0
1nf (y,;f0,E) = 1i(0) + tr T )ZE) +ri.

Here ri contains terms that are products of variance components and its derivative
will be 0 when evaluated under H0. Now we must take into consideration that there
are two response variables. Because lI and l42 depend on different sets of random
l (bi 9i(bii)
491i (ba)__ E2i

02l~b) aol, Ol~bl)
(9 &21k b7T 0
24 (bb) ii
) 1.(bT -- 2/2(bi2)
0T190T,~ )
0 ?2^

lnf (yi;'=) =1(o) + 2(0)+

_r Ar (1 T i (0) 19ii(0) + i(0) )Z, E)
2 lt'z ii qi ^nT il'an

1 T r12(0) 0/2(0) 02/i2(0)^ "Z E
t Z r042(0)+ o (E0)

+tr(Zi2 (2 2(0
0Ti2 7il
To obtain (4.2) one uses the fact that 1ij and I42 do not depend on 6 and that for
exponential family distributions

9lik (0) __(Yik / ik)
WikA/ (yL-Yik
a Tik

02lik(0) W
0Thk0 -ik
for k = 1, 2.
Notice that in the bivariate GLMM it is likely that the two responses require
different variance components, in which case the expressions for the elements of the
score vector above simplify. Suppose that the random effects variance-covariance
matrix has the form
E Elr(151) r12(1512) (43
E1 2(612) E22(62) J 4,
61 522
where 61, 62 and 612 are different parameter vectors. Then t22 = 12 = 1 =
62 612 -
t12 = Et = t2 = 0 and the score vector is

( s612

1 E {(yl tiTA- 45fZTWaA y }- tr(W 1ZT16^)}
=I{Yi --]-il) WilZillli Zilil (i1 i)- tr(WoilZilt11Z 1)
1 n TA i2i t (52 ZT WiA IYi- "r(W 2t 42Tl
SE=,i{(Yi2 Ai2) W2Z2 Z2WW (Y -i 2 tr(WoA2Z222 Z2)}
En )TA 1 12 TWAlyl _1i)
<--if{(Yi2 i2) iW2W A(Yi pi2)}

Lin showed that the information matrix in the univariate GLMM depends only on
the first two moments of the response variables and its elements can be expressed in
closed form for exponential family responses. It is easy to verify that the information
matrix for the multivariate GLMM also depends only on the first two moments of the
two response variables and does not contain more complicated expressions than its
simpler counterpart. The latter property is due to the independence of the response
variables under H0. Note that

155 = E(s5s')

13& = E(sps6)

I,3) = E(s,3sT),

where the expectations are computed at 6 = 0. Consider only 166 for now and
let us assume that E has the structure in (4.3). Then I151 is exactly the same
as in a univariate GLMM and hence can be expressed as proposed by Lin. On the
other hand 166 = E(s6 s65) = E(sd,)E(s6 ) = 0 under H0 because the two score
vectors depend on different response variables which are independent under the null
hypothesis. Also

X )T -W ilZil l -2Z
i----1A i

n n
S E[(h(Yil)(Yi'2 -/Pi') =
i=1 V=1
n n
E Ehi(y1)E(yi'2 t,,2) = 0.
i=1 iV=l

Here hi(yijl) is a function of the first response variable yi only. Similarly all other
parts of the information matrix which correspond to partial derivatives with respect

to parameters for different response variables are zero and hence the expected infor-
mation matrix has the form

1),33, 0 I31/56 0 0
0 I'3/ 0 I/326 0
I/36 0 166 0 0
0 I/326 0 155 0
0 0 0 0 112612

and the score statistic separates as follows

S- I(T3I ')3 lI 5 1,)-S(l

+ S62 (22 19523I322,3I2)1s62)

+ sT 1 SI s .
412 1i20I2 512

This factorization appears only when the variance-covariance matrix is structured as
in (4.3), otherwise the expression is more complicated but still depends only on the

first two moments of the response. The key to proving this is to notice that the
highest order expectation that needs to be computed is of the form E(yij Iii)4.
The same is true for the univariate GLMM and hence Lin's arguments can be directly

Lin proves that the global score statistic in the univariate GLMM follows a chi-
squared distribution with d degrees of freedom (d is equal to the number of random
effects) asymptotically under 6 = 0. The asymptotic result holds when the number

of subjects goes to infinity and the number of observations on each subjects remains

bounded. In the multivariate GLMM the asymptotic distribution is also X2 but with

the number of degrees of freedom adjusted accordingly.

The global score statistic is not very useful even in the GLMM contest because it

tests the significance of all variance components simultaneously, while in most cases

it will be more interesting to check a subset of the variance components. But in the

multivariate GLMM model the global score test is even less appealing. Suppose that

the test is performed and that the null hypothesis is rejected. What information can

one get from that result? It will not be clear whether the rejection occurred because

of extra variability in one of the variables, or in the other one, or in both. It may be

more meaningful to perform score tests for each variable separately and then check

for correlation between the two responses.

Lin also develops score tests for specific variance components in the independent

random effects model. In contrast to the global test, here the score vector and the

efficient information matrix can not be computed in closed form in general and Lin

uses Laplace approximations. Not surprisingly, the approximation to the score statis-

tic does not work well in the binary case as demonstrated by some simulations that

she performed. Lin's score tests can be used with the Breslow and Clayton, and the

Wolfinger and O'Connell methods but are not adequate if used with Gaussian quadra-

ture or the MCEM algorithm. In that case it is natural to try to develop score tests

based on numerical or stochastic approximations. We now discuss a direct approach

which is of limited use, and an indirect approach which is more complicated but is

especially suited for variance components on the boundary of the parameter space.

Tests for Individual Variance Components

We consider a bivariate GLMM for simplicity. Suppose one is interested in testing

H0 : 01 = 0, where 01 is a subset of the variance components for the random effects.

Also let iP1 have L elements and let i/ = ('fT, iT1)T. The score statistic is

Ts = ST(^ -^ l_^I.)l^

T8O 2
T, r X2L

To develop a Monte Carlo approximation we can try to follow the approach we used

for the conditional independence test. Under the assumption of interchangeability of

the integral and differential signs the score vector can be rewritten as follows

n a
SO E( 1nf (bi, E) Yi, 0).

The random effects density f(bi, 6) is multivariate normal for our models and hence

it is possible to obtain expressions for the partial derivatives inside the expectation

using the approach of Jennrich and Schluster as outlined in Section 3.2.1. There is

no closed form expression for the score vector but we can approximate it by

I n m a
m S --h' lnf (0,), -)
m j=l k=1 al

where b k) are simulated values from the conditional distribution bi yi. As mentioned

before the expected information matrix is much harder to work with and hence the

observed information matrix can be approximated using Louis' method as shown in

Section 3.2.2. Notice though that the score vector must be evaluated at i1 = 0 and

at the restricted maximum likelihood estimates '-_. Depending on the subset of
the variance components tested, the derivative at 01 = 0 may not exist. Consider

for example the case when E = 1 12 and when the null hypothesis is Ho"
U12 r2 /
01 = U12 = 0. Then E is singular under Ho and we can not evaluate the derivative.

If we test only Ho : 0-12 = 0 then there is no problem with the test. In this case

the parameter is not on the boundary under the null hypothesis, but the score test
is useful because the correlation between the response variables can be tested from
the fit of two separate GLMMs. Recall that univariate GLMMs can be fitted using

standard software such as PROC NLMIXED in SAS. Hence one can decide whether

there is a need to fit the two responses together based only on univariate analyses.

An alternative method to compute an approximation to the score statistic above
is to use Gauss-Hermite quadrature. Two possible approaches can be followed. The

easier one is to compute first and second order numerical derivatives of the log-
likelihood and then compute the score statistic based on them. Exact derivatives

might be useful if the number of observations per subject is not very large. Both

approaches are not applicable if the tested subset of variance components leads to a

nonpositive-definite E.
Denote the Gauss-Hermite quadrature approximation of the log-likelihood by 1GQ.

Also let sGQ = 9Q and JQ O2GQ i.e the score and the observed informa-
011 -W, 1- -5 -
tion matrix are approximated by the first and second order numerical derivatives of
the Gauss-Hermite approximation to the log-likelihood. The score statistic for testing

H0 : = 0 then is approximated by

GQT,, (Q, JGQ )-1,GQ )-l.GQ

where the score and the information matrix are evaluated at ip = 0 and at the

restricted maximum likelihood estimates ?-_-
As mentioned before this direct approach to testing for variance components will
not work in many interesting cases when the parameters are on the boundary of
the paramater space. To be able to handle such a problem the integrand should be
modified to avoid dependence on the random effects with 0 variances. Notice that
if we want to test whether for example cr = 0 we have to set all corresponding
covariance terms Ukk' to be equal to 0 as well. Suppose we want to test 01 = 0 and