
Citation 
 Permanent Link:
 http://ufdc.ufl.edu/AA00020402/00001
Material Information
 Title:
 Models for repeated measures of a multivariate response
 Creator:
 Gueorguieva, Ralitza, 1971
 Publication Date:
 1999
 Language:
 English
 Physical Description:
 viii, 171 leaves : ill. ; 29 cm.
Subjects
 Subjects / Keywords:
 Approximation ( jstor )
Gaussian quadratures ( jstor ) Maximum likelihood estimations ( jstor ) Modeling ( jstor ) Sample size ( jstor ) Simulations ( jstor ) Standard error ( jstor ) Statistical discrepancies ( jstor ) Statistical models ( jstor ) Statistics ( jstor ) Dissertations, Academic  Statistics  UF ( lcsh ) Statistics thesis, Ph. D ( lcsh )
 Genre:
 bibliography ( marcgt )
nonfiction ( marcgt )
Notes
 Thesis:
 Thesis (Ph. D.)University of Florida, 1999.
 Bibliography:
 Includes bibliographical references (leaves 164170).
 General Note:
 Printout.
 General Note:
 Vita.
 Statement of Responsibility:
 by Ralitza Gueorguieva.
Record Information
 Source Institution:
 University of Florida
 Holding Location:
 University of Florida
 Rights Management:
 Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for nonprofit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
 Resource Identifier:
 021558080 ( ALEPH )
43628298 ( OCLC )

Downloads 
This item has the following downloads:

Full Text 
MODELS FOR REPEATED MEASURES OF A MULTIVARIATE RESPONSE
By
RALITZA GUEORGUIEVA
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1999
Copyright 1999
by
Ralitza Gueorguieva
To my family
ACKNOWLEDGMENTS
I would like to express my deepest gratitude to Dr. Alan Agresti for serving as
my dissertation advisor. Without his guidance, constant encouragement and valuable
advice this work would not have been completed. My appreciation is extended to
Drs. James Booth, Randy Carter, Malay Ghosh, and Monika Ardelt for serving on
my committee and for helping me with my research. I would also like to thank all
the faculty, staff, and students in the Department of Statistics for their support and
friendship.
I also wish to acknowledge the members of the Perinatal Data Systems group
whose constant encouragement and understanding have helped me in many ways.
Special thanks go to Dr. Randy Carter, who has supported me since my first day as
a graduate student.
Finally, I would like to express my gratitude to my parents, Vessela and Vladislav,
for their loving care and confidence in my success, to my sister, Annie, and her fiance,
Nathan, for their encouragement and continual support, and to my husband, Velizar,
for his constant love and inspiration.
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS ...................................................... iv
A B ST RA C T ................................................................... vii
CHAPTERS
1 INTRODUCTION .................................................... 1
1.1 Models for Univariate Repeated Measures ...................... 3
1.1.1 General Linear Models .................................. 3
1.1.2 Generalized Linear Models .............................. 4
1.1.3 Marginal Models ........................................ 5
1.1.4 Random Effects Models ................................. 7
1.1.5 Transition M odels ....................................... 9
1.2 Models for Multivariate Repeated Measures ................... 10
1.3 Simultaneous Modelling of Responses of Different Types ....... 15
1.4 Format of Dissertation ......................................... 17
2 MULTIVARIATE GENERALIZED LINEAR MIXED MODEL ...... 20
2.1 Introduction ..................................................... 20
2.2 M odel Definition ..................... ........................... 22
2.3 M odel Properties .................... ........................... 24
2.4 A applications .................................................... 25
3 ESTIMATION IN THE MULTIVARIATE GENERALIZED LINEAR
M IXED M ODEL .................. ................................ 29
3.1 Introduction ..................................................... 29
3.2 Maximum Likelihood Estimation ............................... 32
3.2.1 GaussHermite Quadrature ............................. 33
3.2.2 Monte Carlo EM Algorithm ............................ 38
3.2.3 Pseudolikelihood Approach ............................ 42
3.3 Simulated Data Example ....................................... 47
3.4 Applications .................................................... 53
3.4.1 Developmental Toxicity Study in Mice .................. 53
3.4.2 Myoelectric Activity Study in Ponies ................... 60
3.5 Additional Methods ............................................ 69
4 INFERENCE IN THE MULTIVARIATE GENERALIZED LINEAR
MIXED MODEL .................................................. 72
4.1 Inference about Regression Parameters ......................... 74
4.2 Estimation of Random Effects ................................ 77
4.3 Inference Based on Score Tests ................................. 78
4.3.1 General Theory .................. ......................... 78
4.3.2 Testing the Conditional Independence Assumption ..... 80
4.3.3 Testing the Significance of Variance Components ....... 84
4.4 Applications ..................................................... 94
4.5 Sim ulation Study ................................. .............. 97
4.6 Future Research ................................................. 102
5 CORRELATED PROBIT MODEL .................................. 104
5.1 Introduction ..................................................... 105
5.2 Model Definition ................................................ ..110
5.3 Maximum Likelihood Estimation ............................... 112
5.3.1 Monte Carlo EM Algorithm ............................. 112
5.3.2 Stochastic Approximation EM Algorithm ............... 121
5.3.3 Standard Error Approximation ......................... 124
5.4 Application ...................................................... 125
5.5 Simulation Study ............................................... 142
5.6 Identifiability Issue ................ .............................. 145
5.7 M odel Extensions ............................................... 155
5.8 Future Research ................... .............................. 157
6 CONCLUSIONS .................. .... ...................... .......... 158
6.1 Sum m ary ........................................................ 158
6.2 Future Research ................. ...................... .......... 161
REFERENCES ........................................................ ......... 164
BIOGRAPHICAL SKETCH ................................................... 171
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment
of the Requirements for the Degree of
Doctor of Philosophy
MODELS FOR REPEATED MEASURES OF A MULTIVARIATE RESPONSE
By
Ralitza Gueorguieva
December 1999
Chairman: Alan Agresti
Major Department: Statistics
The goal of this dissertation is to propose and investigate random effects models
for repeated measures situations when there are two or more response variables. The
emphasis is on maximum likelihood estimation and on applications with outcomes
of different types. We propose a multivariate generalized linear mixed model that
can accommodate any combination of outcome variables in the exponential family.
This model assumes conditional independence between the response variables given
the random effects. We also consider a correlated probit model that is suitable for
mixtures of binary, continuous, censored continuous, and ordinal outcomes. Although
more limited in area of applicability, the correlated probit model allows for more
general correlation structure between the response variables than the corresponding
multivariate generalized linear mixed model.
We extend three estimation procedures from the univariate generalized linear
mixed model to the multivariate generalization proposed herein. The methods are
GaussHermite quadrature, Monte Carlo EM algorithm, and pseudolikelihood. Stan
dard error approximations are considered along with parameter estimation. A sim
ulated data example and two 'reallife' examples are used for illustration. We also
consider hypothesis testing based on quadrature and Monte Carlo approximations to
the Wald, score, and likelihood ratio tests. The performance of the approximations to
the test statistics is studied via a small simulation study for checking the conditional
independence assumption.
We propose a Monte Carlo EM algorithm for maximum likelihood estimation in
the correlated probit model. Because of the computational inefficiency of the algo
rithm we consider a modification based on stochastic approximations which leads to
a significant decrease in the time for model fitting. To address the issue of advan
tages of joint over separate analyses of the response variables we design a simulation
study to investigate possible efficiency gains in a multivariate analysis. Noticeable
increase in the estimated standard errors is observed only in the binary response case
for small number of subjects and observations per subject and for high correlation
between the outcomes. We also briefly consider an identifiability issue for one of the
variance components.
CHAPTER 1
INTRODUCTION
Univariate repeated measures occur when one response variable is observed at
several occasions for each subject. Hereafter subject refers to any unit on which a
measurement is taken, while occasion corresponds to time or to a specific condition. If
more than one response is observed at each occasion, multivariate repeated measures
are available. Univariate and multivariate repeated measures are very common in
biomedical applications, for example when one or more variables are measured on
each patient at a number of hospital visits, or when a number of questions are asked
at a series of interviews. But the occasions do not necessarily refer to different times.
For instance dependent responses can be measured on litter mates, on members of
the same family or at different places on a subject's body. Difficulties in analyzing
repeated measures arise because of correlations usually present between observations
on the same subject. Statistical methods and estimation techniques are well developed
for repeated measures on a univariate normal variable, and lately much research has
been dedicated to repeated observations on a binary variable and more generally
on variables with distributions in the exponential family. Zeger and Liang (1992)
provide an overview of methods for longitudinal data and the books of Lindsey (1993),
Diggle, Liang and Zeger (1994), and Fahrmeir and Tutz (1994) cover many details.
Pendergast et al. (1996) present a comprehensive survey of models for correlated
binary outcomes, including longitudinal data.
However, relatively little attention is concentrated on repeated measures of a mul
tivariate response. General models for this situation are necessarily complex as two
types of correlations must be taken into account: correlations between measurements
on different variables at each occasion and correlations between measurements at dif
ferent occasions. Reinsel (1982, 1984), LundbyeChristensen (1991), Matsuyama and
Ohashi (1997), and Heitjan and Sharma (1997) consider models for normally dis
tributed responses. Lefkopoulou, Moore and Ryan (1989), Liang and Zeger (1989),
and Agresti (1997) propose models for multivariate binary data; Catalano and Ryan
(1992), Fitzmaurice and Laird (1995), and Regan and Catalano (1999) introduce
models for clustered bivariate discrete and continuous outcomes. Catalano (1994)
considers an extension to ordinal data of the Catalano and Ryan model. Rochon
(1996) demonstrates how generalized estimating equations can be used to fit extended
marginal models for bivariate repeated measures of discrete or continuous outcomes.
Rochon's approach is very general and allows for a large class of response distribu
tions. However, in many cases, especially when subjectspecific inference is of primary
interest, marginal models are not appropriate as they may lead to attenuation of the
estimates of the regression parameters (Zeger, Liang, and Albert, 1988).
Sammel, Ryan, and Legler (1997) analyze mixtures of discrete and continuous
responses in the exponential family using latent variables models. Their approach
is based on numerical or stochastic approximations to maximumlikelihood and al
lows for subjectspecific inference. Blackwell and Catalano (1999a, 1999b) consider
extensions for ordinal data and for repeated measures of ordinal responses.
The Generalized Linear Mixed Model (GLMM) forms a very general class of
subjectspecific models for discrete and continuous responses in the exponential fam
ily and is used for univariate repeated measures (Fahrmeir and Tutz, 1994). In the
current dissertation we demonstrate how the GLMM approach can be extended for
multivariate repeated measures by assuming separate random effects for each out
come variable. This is in contrast to the Sammel et al. approach in which common
underlying latent variables are assumed. We also consider a more general correlated
probit model than the appropriate GLMM for the special case of clustered binary
and continuous data.
The introduction to this dissertation contains an overview of approaches for mod
elling of univariate repeated measures (Section 1.1), existing models for multivariate
repeated measures (Section 1.2), and simultaneous modeling of different types of
responses (Section 1.3). The chapter concludes with an outline of the dissertation
(Section 1.4).
1.1 Models for Univariate Repeated Measures
1.1.1 General Linear Models
Historically, the first models for repeated measures data to be considered are
general linear models with correlated normally distributed errors (see Ware, 1982, for
review). The univariate representation of such models is as follows:
Suppose each subject i is observed on J occasions and denote the column vector
of responses by Yi. Assume that y, arises from the linear model
yi = Xi/3 + ei, (1.1)
where Xi is a J x p model matrix for the ith individual, f3 is apx 1 unknown parameter
vector and Ei is a J x 1 error vector with a multivariate normal distribution with mean
0 and arbitrary positive definite covariance matrix E: Nj(0, E). E can take certain
special forms. The case E = I corresponds to the usual linear model for cross
sectional data. The equicorrelation structure (E = o2(pI + (1 p)J), where J is a
matrix of ones, 0 < p < 1 and a2 > 0) is appropriate when the repeated measures
are on subjects within a cluster, for example when a certain characteristic is observed
on each of a number of litter mates. The autoregressive structure for E = ((aij))
(uij = pli71) is one of the most popular structures when the observations are over
equallyspaced time periods.
Usually of primary interest in general linear models is the estimation of regression
parameters while recognizing the likely correlation structure in the data. To achieve
this, one either assumes an explicit parametric model for the covariance structure,
or uses methods of inference that are robust to misspecification of the covariance
structure. Weighted least squares, maximum likelihood and restricted maximum
likelihood are the most popular estimation methods for general linear models.
1.1.2 Generalized Linear Models
Generalized linear models (GLM) are natural extensions of classical linear models
allowing for a larger class of response distributions. Their specification consists of
three parts (McCullagh and Nelder, 1989, pp.2730): a random component, a sys
tematic component and a link function.
1. The random component is the probability distribution for the elements of
the response vector. The yij's, i = 1, ...n, are assumed to be independent with a
distribution in the exponential family
f (Yi; Oi,i) = expi ab(O) +c(yi, 0 (1.2)
u\(Oi)
for some specified functions a(.), b(.) and c(.). Usually
a(i) =
Wi
where 0 is called a dispersion parameter and wi are known weights. The mean
Pi and the variance function v(pi) completely specify a member of the exponential
family because y, = b'(0i) and v(pi) = b"(Oi)a(ji). Important exponential family
distributions are the normal, the binomial, the Poisson and the gamma distributions.
2. The systematic component is a linear function of the covariates
iii = xf3,
where 7i is commonly called a linear predictor.
3. The link function g(.) is a monotonic differentiable function which relates the
expected value of the response distribution pi to the linear predictor 71i:
= g(pi)
When the response distribution is normal and the link is the identity function
g(p) = p, the GLM reduces to the usual linear regression model. For each member
of the exponential family there is a special link function, called the canonical link
function, which simplifies model fitting. For that link function Oi = 77i. Maximum
likelihood estimates in GLM are obtained using iterative reweighted least squares.
Just as modifications of linear models are used for analyzing Gaussian repeated
measures, modifications of GLM can handle discrete and continuous outcomes. Ex
tensions to GLM include marginal, random effects and transition models (Zeger and
Liang, 1992). Hereafter we will use Yij to denote the response at the jth occasion for
the ith subject. The ranges of the subscripts will be i = 1, ...n, and j = 1, ...J for
balanced and j = 1, ...ni for unbalanced data.
1.1.3 Marginal Models
Marginal models are designed to permit separate modeling of the regression of the
response on the predictors, and of the association among repeated observations for
each individual. The models are defined by specifying expressions for the marginal
mean and the marginal variancecovariance matrix of the response:
1. The marginal mean pij = E(yij) is related to the predictors by a known link
function g(.)
g(ij) = x
2. The marginal variance is a function of the marginal mean
Var(yij) = V(pij)
and the marginal covariance is a function of the marginal means and of additional
parameters 6
Cov(yij,, yij) = c(/i, Ai;6).
Notice that if the correlation is ignored and the variance function is chosen to corre
spond to an exponential family distribution, the marginal model reduces to GLM for
independent data. But the variance function can be more general, and hence even
if the responses are uncorrelated this model is more general than the corresponding
GLM. Because only the first two moments are specified for the joint distribution of
the response, additional assumptions are needed for likelihood inferences. Alterna
tively, the Generalized Estimating Equations (GEE) method can be used (Liang and
Zeger (1986), Zeger, Liang and Albert (1988)) as briefly summarized here.
Let yi = (yI,..Y = (i, .., Ai = diag{V(pil),..., V(i,,J)} and let
R,(6) be a 'working' correlation matrix for the ih subject. The latter means than
R,(65) is completely specified up to a parameter vector 6 and may or may not be the
true correlation matrix. The regression parameters /3 are then estimated by solving
n, 8,,T I_
S =o V (6)(Yi /ti) = 0,
1 1
where Vi(6) = A5RP(6)A2. Liang and Zeger (1986) show that if the mean function
is correctly specified, 3, the solution to the above equation, is consistent and asymp
totically normal as the number of subjects goes to infinity. They also propose a robust
variance estimate, which is also consistent even when the variancecovariance struc
ture is misspecified. Hence, the GEE approach is appropriate when the regression
relationship, and not the correlation structure of the data, is of primary interest.
1.1.4 Random Effects Models
An important feature of the marginal models is that the regression coefficients
have the same interpretation as coefficients from a crosssectional analysis. These
models are preferred when the effects of explanatory variables on the average response
within a population are of primary interest. However, when it is of interest to describe
how the response for a particular individual changes as a result of a change in the
covariates, a more pertinent approach is to consider random (mixed) effects models.
Random effects models assume that the correlation among repeated responses
arises because there is a natural heterogeneity across individuals and that this het
erogeneity can be represented as a probability distribution. More precisely,
1. The conditional distribution of the response Yij given a subjectspecific vector
of random effects bi satisfies a GLM with a linear predictor x ,/ + zbi, where zi3
in general is a subset of xj.
2. The responses on the same subject yii, Yi2,., yin, are conditionally independent
given bi.
3. bi has certain distribution F(.; 6) with mean 0 and variancecovariance matrix
E depending on a parameter vector 6.
Such models with normal distribution for the random effects are considered in
greater detail in Section 2.1. In contrast to marginal models, the regression coefficients
in random effects models have subjectspecific interpretations. To better illustrate
that difference let us consider a particular example.
Let the response Yij be binary and the subscript j refer to time. Consider the
marginal model
logit(ij) =/3o +/31ij,
where pij = E(yii), and the random effects model
logit(pj) = 1c +/ j + bi, bi i.i.d. N(0,a2),
where i = E(yij~bi). Then / is the logodds ratio for a positive response at time
j +1 relative to time j for any subject i, while /3i is the populationaveraged logodds
ratio. That is 0i1 describes the change in logodds for a positive response from time
j to time j + 1 for the population as a whole. In general these two interpretations
are different but in some special cases such as identity link functions subjectspecific
and populationaveraged interpretations coincide. More discussion on tonneonnection
between marginal and random effects models follows in Chapter 2.
The presence of random effects enables the pooling of information across different
subjects to result in better subjectspecific as opposed to populationaveraged infer
ence, but complicates the estimation problem considerably. To obtain the likelihood
function, one has to integrate out the random effects, which, except for a few special
cases, cannot be performed analytically. If the random effects are nuisance param
eters, conditional likelihood estimates for the fixed effects may be easy to obtain
for canonical link functions. This is accomplished by conditioning on the sufficient
statistics for the unknown nuisance parameters and then maximizing the conditional
likelihood.
When the dimension of the integral is not high, numerical methods such as Gaus
sian quadrature work well for normally distributed random effects (Fahrmeir and
Tutz, 1994, pp.357362; Liu and Pierce, 1994). A variety of other methods have
recently been proposed to handle more difficult cases. These include the approxi
mate maximum likelihood estimates proposed by Schall (1991), the penalized quasi
likelihood approach of Breslow and Clayton (1993), the hierarchical maximum like
lihood of Lee and Nelder (1996), the Gibbs sampling approach of Zeger and Karim
(1991), the EM algorithm approach for GLMM of Booth and Hobert (1999) and of
McCulloch (1997) and others.
1.1.5 Transition Models
A third group of models for dealing with longitudinal data consists of transition
models. Regression Markov chain models for repeated measures data have been con
sidered by Zeger et al. (1985) and Zeger and Qaqish (1988). This approach involves
modeling the conditional expectation of the response at each occasion given past
outcomes. Specifically,
1. The conditional expectation of the response 4, = E(yjyjj1, ...yii) depends
on previous responses and current covariates as follows
j
g(9) = X3 + L Ykfk(Yij, ...Yi),
k=l
where {fk(')}, k = 1, ...j are known functions.
2. The conditional variance of yij is a function of the conditional mean
Var(yjijyi ...yjl) = V(p)),
where V is a known function.
Transition models combine the assumptions about the dependence of the response
on the explanatory variables and the correlation among repeated observations into a
single equation. Conditional maximum likelihood and GEE have both been used for
estimating parameters.
Extensions of GLM are not the only methods for analyzing repeated measures data
(see for example Vonesh (1992) for an overview of nonlinear models for longitudinal
data), but as the proposed models in this dissertation are based on such extensions,
our discussion in the following sections will be restricted to GLM types of models.
1.2 Models for Multivariate Repeated Measures
In contrast to univariate longitudinal data, very few models have been discussed
that specifically deal with multivariate repeated measures. These models are now
briefly discussed, starting with normal theory linear models and proceeding with
models for discrete outcomes.
A review of general linear models for the analysis of longitudinal studies is pro
vided by Ware (1985). The general multivariate model is defined as in (1.1) but we
assume that the J = KL repeated measures on each subject are made on K nor
mally distributed variables rather than on only one normally distributed response.
Hence, the general multivariate model with unspecified covariance structure can be
directly applied to multivariate repeated measures data. However, the number of
parameters to be estimated increases quickly as the number of occasions and/or vari
ables increases and the estimation may become quite burdensome. Special types of
correlation structures such as bivariate autoregressive can be specified directly as pro
posed by Galecki (1994). This multivariate linear model is also not well suited for
unbalanced or incomplete data.
More parsimonious covariance structures are achieved in random effects models.
The linear mixed effects model is defined in two stages. At stage 1 we assume
yibi = Xi,3 + Zib, + Ei,
where Ei is distributed Nn,(0, a2I), X, and Z, are ni x p and ni x q model matrices
for the fixed and the random effects respectively, and bi is a q x 1 random effects
vector. At stage 2, bi Nq(O, E) independently of El. This corresponds to a special
variancecovariance structure for the ith subject: Ei = 02I + ZiEZT.
Reinsel (1982) generalized this linear mixed effects model and showed that for bal
anced multivariate repeated measures models with random effects structure, closed
form solutions exist for both maximum likelihood (ML) and restricted maximum
likelihood (REML) estimates of mean and covariance parameters. Matsuyama and
Ohashi (1997) considered bivariate response mixed effects models that can handle
missing data. Choosing a Bayesian viewpoint, they used the Gibbs sampler to esti
mate the parameters.
The model of Reinsel is overly restrictive in some cases as he prescribes the same
growth pattern over all response variables for all individuals. In contrast, Heitjan and
Sharma (1997) considered a model for repeated series longitudinal data, where each
unit could yield multiple series of the same variable. The error term they used was a
sum of a random subject effect and a vector autoregressive process, thus accounting for
subject heterogeneity and time dependence in an additive fashion. A straightforward
extension for multiple series of observations on distinct variables is possible.
All models discussed so far in this section are appropriate for continuous response
variables that can be assumed to be normally distributed. More generally GEE
marginal models can easily accommodate multivariate repeated measures if all re
sponse variables have the same discrete or continuous distribution. We now restate
the GEE marginal model from Section 1.1 in matrix notation to simplify the mul
tivariate extension. We also consider balanced data. Let yi represent the vector of
observations for the ith individual (i = 1,2, ...n) and let /Ai = E(yi) be the marginal
mean vector. We assume
gWli = X,/3
where X, is the model matrix for the ith individual, /3 is an unknown parameter vector
and g(.) is a known link function, applied componentwise to i. Also Var(yij) =
OV(pij), where V is a known variance function and 0 is a dispersion parameter.
Letting A, = diag[V(pii1),...V((pij)] and assuming a working correlation matrix R,(65)
for yi, the 'working' covariance matrix for y, is given by
Vi = (A) ()(A .
If ,((6) = R(6) is the true correlation matrix, V, is the true covariance matrix of
yi. The J responses for each subject are usually repeated measures on the same
variable, but as in the normal case, they can be repeated observations on two or more
outcome variables as long as they have the same distribution. The only difference
with univariate repeated measures is in the specification of the covariance matrix.
The estimates of the regression parameters /3 will be consistent provided that the
model for the marginal mean structure is specified correctly, but for better efficiency,
the working correlation matrix should be close to the true correlation matrix. Several
correlation structures for multivariate repeated measures are discussed by Rochon
(1996).
As a further extension, Rochon (1996) proposed a model for bivariate repeated
measures that could accommodate both continuous and discrete outcomes. He used
GEE models as the one described above, to relate each set of repeated measures to im
portant explanatory variables and then applied seemingly unrelated regression (SUR,
Zellner, 1962) methodology to combine the pair of GEE models into an overall analysis
framework. If the response vector for the ith subject is denoted by yi = (y 1)T, y 2)T
and all relevant quantities for the first and second response are superscribed by 1 and
2 respectively, the SUR model may be written as
where
[$1)] F p'
I ri i xi f X 0 Q\
Pi ^ j'^ X2 p.3(2).
and g(.) is a compound function consisting of g(l)(.) and g(2)(.). The joint covariance
matrix among the sets of repeated measures may be written as
V0 FIJ 0 A 1( ) 1 R() 0 1}2 0 52Ij 0
[ 0 [ 0 A02. J (6) 0 A[2A' J [ 0 0 1J '
where R(65) is the working correlation matrix among the two sets of repeated measures
for each subject, and each of its elements is a function of the vector parameter 6.
R() [ R11 R121
R(6 RfL R22 j
The suggested techniques may be extended to multiple outcome measures. Rochon's
approach provides a great deal of flexibility in modeling the effects of both within
subject and betweensubject covariates on discrete and continuous outcomes and is
appropriate when the effects of covariates on the marginal distributions of the response
are of interest. If subjectspecific inference is preferred, however, a transitional or
random effects models should be considered.
Liang and Zeger (1989) suggest a class of Markov chain logistic regression models
for multivariate binary time series. Their approach is to model the conditional dis
tribution of each component of the multivariate binary time series given the others.
They use 'pseudolikelihood' estimation methods to reduce the computational burden
associated with maximum likelihood estimation.
Liang and Zeger's transitional model is useful when the association among vari
ables at one time is of interest, or when the purpose is to identify temporal relation
ship among the variables, adjusting for covariates. However, one must use caution
when interpreting the estimated parameters because the regression parameter 03 in
the model has logodds ratio interpretation conditional on the past and on the other
outcomes at time t. Hence, if a covariate influences more than one component of the
outcome vector or past outcomes, which is frequently the case, its regression coeffi
cient will capture only that part of its influence that cannot be explained by the other
outcomes it is also affecting. Another problem with the model is that in the fitting
process all the information on the first q observations is ignored.
A random effects approach for dealing with multivariate longitudinal data is dis
cussed by Agresti (1997) who develops a multivariate extension of the Rasch model for
repeated measures of a multivariate binary response. One disadvantage of the multi
variate Rasch model shared by many random effects models, is that it can not model
negative covariance structure among repeated observations on the same variable. Al
though usually measurements on the same variable within a subject are positively
correlated, there are cases when the correlation is negative. One such example is the
infection data, first considered by Haber (1986). Frequencies of infection profiles of a
sample of 263 individuals for four influenza outbreaks over four consecutive winters
in Michigan were recorded. The first and fourth outbreaks are known to be caused
by the same virus type and because contracting influenza during the first outbreak
provides an immunity against a subsequent outbreak, a subject's response for these
two outbreaks is negatively correlated. Coull (1997) analyzed these data using the
multivariate binomiallogit normal model. As it is a special case of the models that
we propose later in this dissertation, we define it here.
Let y = (yi, ...Y1)T given ir = (71, ...Tri)T be a random vector of independent bi
nomial components with number of trials (nl, ...nh)T. Also let logit(ir) be Ni(tI, E).
Then 7r has multivariate logisticnormal distribution and unconditionally y has a mul
tivariate binomial logitnormal mixture distribution. If the I observations correspond
to measurements on K variables at L occasions, this model can be used for analyzing
multivariate repeated measures data. The mean of the multivariate random effects
distribution can be assumed to be a function of covariates p = X,3 and several groups
of subjects with the same design matrices X., s = 1, ...S, can be considered.
The multivariate binomiallogit normal model can be regarded as an analog for
binary data of the multivariate Poissonlog normal model of Aitchison and Ho (1989)
for count data. Aitchison and Ho assume that y = (yi, ...y1)T given 0 = (01, ...0i)T
are independent Poisson with mean vector 0, and that log(O) = (log(01), ...log(9i))T
are Ni(p, E). Then 0 has multivariate lognormal distribution. Like the multivariate
logitnormal model, the multivariate Poissonlog normal model can be used to model
negative correlations and can be extended to incorporate covariates.
Chan and Kuk (1997) also consider random effects models for binary repeated
measures but assume an underlying threshold model with normal errors and random
effects, and use the probit link. They estimate the parameters via a Monte Carlo EM
algorithm, regarding the observations from the underlying continuous model as the
complete data. Their approach will be discussed in more detail in Chapter 5 where
an extension of their model will be considered.
1.3 Simultaneous Modelling of Responses of Different Types
Difficulties in joint modelling of responses of different types arise because of the
need to specify a multivariate joint distribution for the outcome variables. Most
research so far has concentrated on simultaneous analysis of binary and continuous
responses.
Olkin and Tate (1961) introduced a location model for discrete and continuous
outcomes. It is based on a multinomial model for the discrete outcomes and a mul
tivariate Gaussian model for the continuous outcomes conditional on the discrete
outcomes. Fitzmaurice and Laird (1995) discussed a generalization of this model
which turns out to be a special case of the partly exponential model introduced by
Zhao, Prentice and Self (1992). Partly exponential models for the regression analy
sis of multivariate (discrete, continuous or mixed) response data are parametrized in
terms of the response mean and a general shape parameter. They encompass gener
alized linear models as well as certain multivariate distributions. A fully parametric
approach to estimation leads to asymptotically independent maximum likelihood es
timates of the mean and the shape parameters. The score equations for the mean
parameters are essentially the same as in GEE. The authors point out two major
drawbacks to their approach: one is the computational complexity of full maximum
likelihood estimation of the mean and the shape parameters together, another one is
the need to specify the shape function for the response. The latter will be especially
hard if the response vector is a mixture of discrete and continuous outcomes. Zhao,
Prentice and Self conclude that partly exponential models are mainly of theoretical
interest and can be used to evaluate properties of other mean estimation procedures.
Cox and Wermuth (1992) compared a number of special models for the joint
distribution of qualitative (binary) and quantitative variables. The joint distribution
for all bivariate models is based on the marginal distribution of one of the components
and a conditional distribution for the other component given the first one. A key
distinction is between models in which for each binary outcome (A), the quantitative
response (Y) is assumed to be normally distributed and models in which the marginal
distribution of Y is normal. Typically, simplicity in the marginal distribution of
Y corresponds to fairly complicated conditional distribution of Y and vice versa
but normality often holds at least approximately. Estimation procedures differ from
model to model but essentially the same tests of independence of the two components
of the response can be derived. This, however, is not true if trivariate distributions
are considered with two binary and one continuous, or with two continuous and one
binary component. In that case several different hypotheses of independence and
conditional independence can be considered and depending on the model sometimes
they may not be tested unless a stronger hypothesis of independence is assumed.
Catalano and Ryan (1992), Fitzmaurice and Laird (1995) and Regan and Catalano
(1999) considered mixed models for a bivariate response consisting of a binary and of
a quantitative variable. Catalano and Ryan (1992) and Regan and Catalano (1999)
treated the binary variable as a dichotomized continuous latent trait, which had a
joint bivariate normal distribution with the other continuous response. Catalano and
Ryan (1992) then parametrized the model so that the joint distribution was a product
of a standard random effects model for the continuous variable and a correlated probit
model for the discrete variable. Estimation for the parameters of the two models is
performed using quasilikelihood techniques. Catalano (1994) extended the Catalano
and Ryan procedure to ordinal instead of binary data.
Regan and Catalano (1999) used exact maximum likelihood for estimation, which
is computationally feasible because of the equicorrelation assumption between and
within the binary and continuous outcomes. The maximum likelihood methodology
used by the authors is an extension to the procedure suggested by Ochi and Prentice
(1984) for binary data.
Fitzmaurice and Laird (1995) assumed a logit model for the binary response and a
conditional Gaussian model for the continuous response. Unlike Catalano and Ryan's
model all regression parameters have marginal interpretations and the estimates of
the regression parameters (based on ML or GEE) are robust to misspecification of
the association between the binary and the continuous responses.
Sammel, Ryan and Legler (1997) developed models for mixtures of outcomes in the
exponential family. They assumed that all responses are manifestations of one or more
common latent variables and that conditional independence between the outcomes
given the value of the latent trait held. This allows one to use the EM algorithm with
some numerical or stochastic approximations at each Estep. Blackwell and Catalano
(1999) extended the Sammel, Ryan and Legler methodology to longitudinal ordinal
data by assuming correlated latent variables at each time point. For simplicity of
analysis each outcome is assumed to depend only on one latent variable and the
latent variables are assumed to be independent at each time point.
1.4 Format of Dissertation
The purpose of this dissertation is to propose and investigate random effects mod
els for repeated measures situations when there are two or more response variables.
Of special interest is the case when the response varibles are of different types. The
dissertation is organized as follows.
In Chapter 2 we propose a multivariate generalized linear mixed model which can
accommodate any combination of responses in the exponential family. We first describe
the usual generalized linear mixed model (GLMM) and then define its extension. The
relationship between marginal and conditional moments in the proposed model is
briefly discussed and two motivating examples are presented. The key assumption of
conditional independence is outlined.
Chapter 3 concentrates on maximum likelihood model fitting methods for the
proposed model. GaussHermite quadrature, a Monte Carlo EM algorithm and a
pseudolikelihood approach are extended from the univariate to the multivariate gen
eralized linear mixed model. Standard error approximation is discussed along with
point estimation. We use a simulated data example and the two motivating examples
to illustrate the proposed methodology. We also address certain issues such as stan
dard error variability and comparison between multivariate and univariate analyses.
In Chapter 4 we consider inference in the multivariate GLMM. We describe hy
pothesis testing for the fixed effects based on approximations to the Wald and likeli
hood ratio statistics and propose score tests for the variance components and for test
ing the conditional independence assumption. The performance of the GaussHermite
quadrature and Monte Carlo approximations to the Wald, score and likelihood ratio
statistics is compared via a small simulation study.
Chapter 5 introduces a correlated probit model as an alternative to the multivari
ate GLMM for binary and continuous data when conditional independence does not
hold. We develop a Monte Carlo EM algorithm for maximum likelihood estimation
and apply it to one of the motivating examples. To address the issue of advantages
of joint over separate analyses of the response variables we design a simulation study
F M
19
to investigate possible efficiency gains in a multivariate analysis. An identifiability
issue concerning one of the variance components is also discussed.
The dissertation concludes with a summary of the most important findings and
discussion of future research topics (Chapter 6).
CHAPTER 2
MULTIVARIATE GENERALIZED LINEAR MIXED MODEL
The Generalized Linear Mixed Model (GLMM) is a very general class of random
effects models, well suited for subjectspecific inference for repeated measures data.
It is a special case of the random effects model defined in Section 1.2. We start
this chapter by providing a detailed definition of the GLMM for repeated univariate
mixed models (Section 2.1) and then introduce the multivariate extension (Section
2.2). Some properties of the multivariate GLMM are discussed in Section 2.3 and the
chapter concludes with a description of two data sets which will be used for illustration
throughout the thesis. Hereafter we omit the superscript c in the conditional mean
4 notation.
2.1 Introduction
Let yij denote the jth response observed on the ith subject, j = 1, ...ni, i = 1, ...n.
The conditional distribution of Yij given an unobserved q x 1 subjectspecific random
vector b, is assumed to be in the exponential family with density
f (Yij Ibi) = exp{ yijOij  b(oij) W) (ij ,W
fx{ w +c(yij,,w
where pij = b'(Oij) is the conditional mean, 4 is the dispersion parameter, b(.) and
c(.) are specific functions corresponding to the type of exponential family, and wij
are known weights. Also at this stage, it is assumed that g(gij) = g(E(yijlbi)) = 7i,
where ij = xT/3 + zTbi is a linear predictor, b, is a q x 1 random effect, 3 is a
p x 1 parameter vector and the design vectors xij3P and Zij" are functions of the
covariates.
At the second stage the subjectspecific effects b, are assumed to be i.i.d.
Nq(0, E). In general, the random effects can have other continuous or discrete dis
tributions, but the normal distribution provides a full range of possible covariance
structures and we will restrict our attention to this case.
As an additional assumption, conditional independence of the observations within
and between subjects is required, that is,
n ni
/(ylb;/3,0) =lf(yilbi;/3,O) with f(yi\bi;/3,q) =ff(yijfbi,3,O),
i=1 j=l
where y = (yl, ...ynT) and b = (bf, ...b) are column vectors of all responses and
all random effects respectively.
Maximumlikelihood estimation for GL\MM is complicated because the marginal
likelihood for the response does not have a closed form expression. Different ap
proaches for dealing with this situation are discussed in Chapter 3.
The parameters /3 in the GLMM have subjectspecific interpretations; i.e. they
describe the individual's rather than the average population response to changing the
covariates. Under the GLMM, the marginal mean E* E(yij) = E(E(yjlbi))) =
f g (x',/3+zbi)f(bi, E)dbi and in general g(1pij) 1 x5T/3 if g is a nonlinear function.
However, this equation holds approximately if the standard deviations of the random
effects distributions are small.
Closedform expressions for the marginal means exist for certain link functions
(Zeger, Liang and Albert, 1988). For example, for the identity link function the
marginal and conditional moments coincide. For the log link function the marginal
mean is p = exp(xif3 + )J.) For the profit link function, the marginal mean is
S= (ap(E)xi ), where p(E) = Ezz+I 2 and q is the dimension ofb. For
wher V~~ ~2zI / and q is the dimension of bi. For
the logit link, an exact closedform expression for the marginal mean is unavailable
but using a cumulative Gaussian approximation to the logistic function leads to the
expression logit({tj) al(E)xT/3, where ai(E) = IC2Ezzi + IIq/2 and c = 16j.
Unfortunately, there are no simple formulae for the higher order marginal moments
except in the case of a linear link function. But the secondorder moments can also
be approximated and then the GEE approach can be used to fit the mixed model
(Zeger, Liang and Albert, 1988).
2.2 Model Definition
Let us first consider bivariate repeated measures. Denote the response vector
for the ith subject by yi = (yTyY)T, where Yi = (yil, ...,Y )T and Y2 =
(Yi2 1, Yi2ni)T are the repeated measurements on the 1"t and 2nd variable respec
tively at rni occasions. The number of observations for the two variables within a
subject need not be the same and hence it would be more appropriate to denote them
by nil and ni2 but for simplicity we will use ni. We assume that Yilj, j = 1, ...ni,
are conditionally independent given bil with density fl(.) in the exponential family.
Analogously, Yi2j, j = 1, ...ni, are conditionally independent given bi2 with density
f2(') in the exponential family. Note that fl and f2 need not be the same. Also Yi
and Yi2 are conditionally independent given b, = (bi bT and the responses on
different subjects are independent. Let gl(.) and g2(') be appropriate link functions
for fl and f2. Denote the conditional means of yilj and Yi2j by ilj and 1i2j respec
tively. Let pil = (i1 ..., P il)T and /i2 = (21, ....Pi2ni, )T. At stage one of the
mixed model specification we assume
gi(=il) = Xil31 + Zilbil (2.1)
g2(,i2) = Xi2J32 + Z,2bi2, (2.2)
where 31 and 02 are p, x 1 and P2 x 1 dimensional unknown parameter vectors, Xil
and Xi2 are ni, x pi and ni, x P2 dimensional design matrices for the fixed effects, Zil
and Zi2 are ni x q, and ni x q2 design matrices for the random effects and g, and 92
are applied componentwise to Mil and i2. At stage two, a joint distribution for bil
(q1 x 1) and bi2 (q2 x 1) is specified. The normal distribution is a very good candidate,
as it provides a rich covariance structure. Hence, we assume
b l: i.i.d.MVN(O,) = MVN [ [ ET 12]) (2.3)
b bi2 ) 5 1 T 12 522 '
where E, El and E22 are in general unknown positivedefinite matrices. The exten
sion of this model to higher dimensional multivariate repeated measures is straight
forward but for simplicity of discussion only the bivariate case will be considered.
When E12 = 0 then the above model is equivalent to two separate GLMM's for the
two outcome variables. Advantages of joint over separate fitting include the ability to
answer intrinsically multivariate questions, better control over the type I error rates
in multiple tests and possible gains in efficiency in the parameter estimates.
For example, in the developmental toxicity application described in more detail
at the end of this section, it is of interest to estimate the dose effect of ethylene glycol
on both outcomes (malformation and low fetal weight) simultaneously. One might
be interested in the probability of either malformation or fetal weight at any given
dose. This type of question can not be answered by a univariate analysis as it requires
knowledge of the correlation between the outcomes. Also, testing the significance of
the dose effect on malformation and fetal weight simultaneously (using a Wald test
for example) allows one to keep the significance level fixed at a. While if the two
outcomes were tested separately then an adjustment should be made for the level of
the individual tests to achieve overall a level for both comparisons.
Multivariate analysis also allows one to borrow strength from the observations on
one outcome to estimate the other outcomes. This can lead to more precise estimates
of the parameters and therefore, to gains in efficiency. Theoretically, the gain will
be the greatest if there is a common random effect for all responses (this means
E12 = En11E22 in our notation). Also, if the number of observations per subject is
small and the correlation between the two outcomes is large, the gains in efficiency
should increase.
If the two vectors of random effects are perfectly correlated, then this is equivalent
to assuming a common latent variable with a multivariate normal distribution which
does not depend on covariates. Hence, the multivariate GLMM reduces to a special
case of the Sammel, Ryan and Legler model when the latent variable is not modelled
as a function of the covariates. It is possible to allow the random effects in our models
to depend on covariates but we have not considered that extension here.
Several other models discussed in previous sections turn out to be special cases of
this general formulation. For example the bivariate Rasch model is obtained by spec
ifying Bernoulli response distributions for both variables, using the logit link function
and identity design matrices both for the fixed and for the random effects. The mul
tivariate binomiallogit normal model is also a special case for binary responses with
identity design matrix for the random effects and unrestricted variancecovariance
structure. The Aitchison and Ho multivariate Poisson lognormal model also falls un
der this general structure when the response variables are assumed to have a Poisson
distribution.
2.3 Model Properties
Exactly as in GLMM, the conditional moments in the multivariate GLMM are
directly modelled, while marginal moments are harder to find. The marginal means
and the marginal variances of Yii and Yi2 for the model defined by (2.2) and (2.3) are
the same as those of the GLMM considering one variable at a time:
E(yi+) = EE(yil bil) = Epi (31, bil)],
E(yi2) = E[/i2(2, bi2)],
Var(yil) = E[Var(yilbil)] + Var[E(yi bil)] =
E[iV(l)] + Var[ il],
Var(yi2) = E[2V(pi2)] + Var[f/i2],
where V(pil) and V(/i2) denote the variance functions corresponding to the expo
nential family distributions for the two response variables.
The marginal covariance matrix between Yi and Yi2 is found to be equal to the
covariance matrix between the conditional means ,il and Ai2:
CoV(yyl,Yi2) = E(ylY T) E(yil)E(yT) =
= EE(yiyTbil, bi2) E(iil)E(') =
= E[E(yiibi1)E(yTbi2)] E(il)E(T) =
= Cov(mil, Ai2)
The latter property is a consequence of the key assumption of conditional indepen
dence between the two response variables. This assumption allows one to extend
model fitting methods from the univariate to the multivariate GLMM but may not
hold in certain situations. This issue will be discussed in more detail in Chapter 4,
where score tests are proposed for verifying conditional independence.
2.4 Applications
The multivariate GLMM are fitted to two data sets. The first one is a dataset from
a developmental toxicity study of ethylene glycol (EG) in mice conducted through the
National Toxicology Program (Price et al., 1985). The experiment involved four ran
domly chosen groups of pregnant mice, one group serving as a control and the other
three exposed to three different levels of EG during major organogenesis. Following
Table 2.1. Descriptive statistics for the Ethylene Glycol data
Dose (g/kg) Dams Live Fetuses Fetal Weight (g) Malformation
Mean SD Number Percent
0 25 297 0.972 0.098 1 0.34
0.75 24 276 0.877 0.104 26 9.42
1.50 22 229 0.764 0.107 89 38.86
3.00 23 226 0.704 0.124 126 57.08
sacrifice, measurements were taken on each fetus in the uterus. The two outcome
measures on each live fetus of interest to us are fetal weight (continuous) and malfor
mation status dichotomouss). Some descriptive statistics for the data are available in
Table 2.1. Fetal weight decreases monotonically with increasing dose with the average
weight ranging from 0.972 g in the control group to 0.704 g in the group administered
the highest dose. At the same time the malformation rate increases with dose from
0.3% in the control group to 57% in the group administered the highest dose.
The goal of the analysis is to study the joint effects of increasing dose on fetal
weight and on the probability of malformation. The analysis of these data is compli
cated by the correlations between the repeated measures on fetuses within litter. A
multivariate GLMM with random intercepts for each variable allows one to explicitly
model the correlation structure within litter and provides subjectspecific estimates
for the regression parameters.
The second data set is from a study to compare the effects of 9 drugs and a placebo
on the patterns of myoelectric activity in the intestines of ponies (Lester et al., 1998a,
1998b, 1998c). For that purpose electrodes are attached to four different areas of the
intestines of 6 ponies, and spike burst rate and duration are measured at 18 equally
spaced time intervals around the time of each drug administration. Six of the drugs
and the placebo are given twice to each pony in a randomized complete block design.
The remaining three drugs are not given to all ponies and hence will not be analyzed
here. There is a rest period after each drug's administration and no carryover effects
Table 2.2. Descriptive statistics for pony data
Duration Count
Pony Mean SD Mean SD Corr.
1 1.03 0.25 77.57 45.53 0.59
2 1.35 0.28 128.25 76.08 0.18
3 1.40 0.36 84.33 46.27 0.45
4 1.27 0.48 111.00 49.66 0.21
5 1.14 0.21 75.44 45.03 0.50
6 1.24 0.40 66.86 48.97 0.32
are expected. The spike burst rate is a count variable reflecting the number of con
tractions exceeding certain threshold in 15 minutes. The duration variable reflects
average duration of the contractions in each 15minute interval. Figure 2.1 shows
graphical representations of the averaged responses by drug and time for one of the
electrodes and one hour after the drug administration. Table 2.2 shows the sample
means, standard deviations and correlations between the two outcome variables by
pony for this smaller data set. We analyse a restricted data set for reasons of com
putational and practical feasibility. This issue is discussed in more detail in Chapter
3.
/
^ /
7   
.. . .... .
1.0 1.5 2.0 2.5 3.0 3.5 4.0
Time period
7
7
7
//
~~~ . ..... . .. .
 ,
7
7
 
 
1.0 1.5 2.0 2.5 3.0 3.5 4.0
Time period
Figure 2.1 Count and duration trends over time for the pony data. Each
trajectory shows the change in mean response for one of the seven drugs.
C
C 0
M
2%
~C"J
ca
0
OD
0
0 
.1 
CHAPTER 3
ESTIMATION IN THE MULTIVARIATE GENERALIZED LINEAR MIXED MODEL
This chapter focuses on methods for obtaining maximum likelihood (or approx
imate maximum likelihood) estimates for the model parameters in the multivariate
GLMM. We first mention some approaches proposed for the univariate GLMM (Sec
tion 3.1), and then describe in detail extensions of three of those approaches for the
multivariate GLMM (Section 3.2). The proposed methods are then illustrated on a
simulated data example (Section 3.3), and on the two 'reallife' data sets introduced
in Chapter 2 (Section 3.4). Some issues such as standard error variability and advan
tages of one multivariate versus several univariate analyses are addressed in Sections
3.3 and 3.4. The chapter concludes with discussion of some additional modelfitting
methods (Section 3.5).
3.1 Introduction
In GLMM the marginal likelihood of the response is obtained by integrating out
the random effects
nf f (yIbi; )3, ))f(bi)dbi,
i=l"
where f(bi) denotes a normal density. Usually these integrals are not analytically
tractable and some kind of approximation must be used.
Direct approaches to maximum likelihood estimation are based on numerical or
stochastic approximations of the integrals and on numerical maximizations of those
approximations. Probably the most widely used numerical approximation procedure
for this type of integral is GaussHermite quadrature. It involves evaluating the
integrands at m prespecified quadrature points and substituting weighted sums in
place of the intractable integrals for the n subjects. If the number of quadrature
points m is large the approximation can be made very accurate but to keep the
numerical effort low m should be kept as small as possible. GaussHermite quadrature
is appropriate when the dimension of the random effects is small. An alternative for
highdimensional random effects is to approximate the integrals by Monte Carlo sums.
This involves generating m random values from the random effects distribution for
each subject, evaluating the conditional densities f(yilbi;/3, 0) at those values and
taking averages. Details on how to perform GaussHermite quadrature and Monte
Carlo approximation for the GLMM can be found in Fahrmeir and Tutz (1994, pp.357
365). We discuss GaussHermite quadrature for the multivariate GLMM in Section
3.2. Liu and Pierce (1994) consider adaptive Gaussian quadrature, which allows a
reduction in the required number of quadrature points by centering and scaling them
around the mode of the integrand function. This procedure is described in more detail
and applied to one of the data examples in Section 3.4.
Indirect approaches to maximum likelihood estimation use the EMalgorithm
(Dempster, Laird and Rubin, 1977), treating the random effects as the missing data.
We apply these methods both to the multivariate GLMM and to the correlated pro
bit model and therefore we now introduce the basic ideas. Hereafter 0/ denotes the
vector of all unknown parameters in the problem, and b denotes some estimate of 4'.
The EM algorithm is an iterative technique for finding maximum likelihood estimates
when direct maximization of the observed likelihood f(y; 4) is not feasible. It involves
augmenting the observed data by unobserved data so that maximization at each step
of the algorithm is considerably simplified. The unobserved data are denoted by b
and in the GLMM context these are the random effects. The EMalgorithm can be
summarized as follows:
1. Select a starting value (0). Set r = 0.
2. Increase r by 1.
Estep: Calculate E{lnf(y, b; ) 1y; (T1 }.
3. Mstep: Find a value '(r) of 0 that maximizes this conditional expectation.
4. Iterate between (2) and (3) until convergence is achieved.
In the GLMM context the complete data is u = (yT bT7)T and the complete data
loglikelihood is given by
n n
lnL, = 1n f(yiJbi;I3,) + In f(biE).
=l ij=I
The rth Estep of the EM algorithm involves computing E(lnLIy, (1) and the
rth Mstep maximizes this quantity with respect to 4 and updates the parameter
estimates. Notice that because (3 and 0 enter only the first term of L., the Mstep
with respect to 3 and 4 uses only f(yIb) and so it is similar to a standard generalized
linear model computation with the values of b treated as known. Maximizing with
respect to E is just maximum likelihood using the distribution of b after replacing
sufficient statistics with their conditional expected values. In general, the conditional
expectations in the Estep can not be computed in closed form, but GaussHermite
or different Monite Carlo approximations can be utilized. (Fahrmeir and Tutz, 1994,
pp. 362365; McCulloch, 1997; Booth and Hobert, 1999).
Many authors consider tractable analytical approximations to the likelihood (Bres
low and Clayton, 1993; Wolfinger and O'Connell, 1994). Although these methods lead
to inconsistent estimates in some cases, they may have considerable advantage in com
putational speed over 'exact' methods and can be fitted with standard software. Both
Breslow and Clayton's and Wolfinger and O'Connell's procedures amount to itera
tive fitting of normal theory linear mixed models and can be implemented using the
%GLIMMIX macro in SAS. An extension of Wolfinger and O'ConneUll's approach
is considered in Section 3.2.
A Bayesian paradigm with flat priors can also be used to approximate the maxi
mum likelihood estimates using posterior modes (Fahrmeir and Tutz, 1994, pp.233
238) or posterior means (Zeger and Karim, 1991). Though the numerator in such
computations is the same as for the maximum likelihood calculations, the posterior
may not exist for diffuse priors (Natarajan and McCulloch, 1995). This may not be
detected using computational techniques such as the Gibbs sampler, and can result
in incorrect parameter estimates (Hobert and Casella, 1996).
3.2 Maximum Likelihood Estimation
For simplicity of presentation we again consider the case of only two response
variables. The marginal likelihood in the bivariate GLMM is obtained as in the usual
GLMM by integrating out the random effects
n n,
I1 f Ji fI(yi, Ibii;3j, 0i)f2(y 2jjbi2;A/2,02)}f(bil, bi2; E)dbildbi2, (3.1)
i=1 j=l
where f denotes the multivariate normal density of the random effects. In this sec
tion we describe how GaussHermite quadrature, Monte Carlo EM algorithm and
pseudolikelihood can be used to obtain estimates in the multivariate GLMM. Both
Gaussian quadrature and the Monte Carlo EM algorithm are referred to as 'exact
maximum likelihood' methods because they target the exact maximum likelihood es
timates. In comparison, methods that use analytical approximations are referred to
as 'approximate' maximum likelihood. We now start by describing the 'exact' maxi
mum likelihood methods. Hereafter 0 = Vec(31, I2, 01 2, 6), where 6 are unknown
variance components in E.
3.2.1 GaussHermite Quadrature
Parameter Estimation
The marginal loglikelihood in the multivariate GLMM is expressed as a sum of
the individual loglikelihoods for all subjects, that is
n
lnL(y; i) = lnLi (0),
i=1
where
n(4) = {JI fi(Yljbi1; ,, l)f2(yi2jlb2; 32, 2)} exp(bElb)db.
Sj=l 27r 2
Hence GaussHermite quadrature (Fahrmeir and Tutz, 1994) involves numerical ap
proximations of n qdimensional integrals (q = q + q2). The big's are first transformed
so that each integral has the form
Li = J h(zi)exp(zTz)dzi .
7 JRq
The needed transformation is b, = vhLzi, where E = LLT. L is the lower triangular
Choleski factor of E and it always exists because E is nonnegative definite. Here
ni
h(zi) = fI (YiViljZil;)1, Ol)f2(Yi2j zi2;)32,02).
j=l
Each integral is then approximated by
m m
GQ = E . E h(Z),
kl=l kq=l
where z(k) = v/2Ld(k) for the multiple index k = (k1, ...kq) and d(k) denote the tabled
nodes of univariate GaussHermite integration of order m (Abramowitz and Stegun,
1972). The corresponding weights are given by vk = r wk) where Wk are the
tabled univariate weights, 1 = 1, ...m.
The maximization algorithm then proceed as follows:
1. Choose initial estimate for the parameter vector b( ). Set r = 0
2. Increase r by 1.
Approximate each of the integrals LiP(r )) by LQ((r )) using m quadrature
points in each direction.
3. Maximize the approximation with respect to f using a numerical maximization
routine.
4. Iterate between steps (2) and (3) until the parameter estimates have converged.
A popular numerical maximization procedure for step 3 is the NewtonRaphson
method. It involves iteratively solving the equations
)(r) = (r1) + (r1)
where S ) = 8,L and J((r1)) 2LnL (r) is the observed infor
mation matrix. One possible criterion for convergence is
(r+l) r)
max rI"+ < 61,
where ir) denotes the estimate of the s8th element of the parameter vector at step r
of the algorithm and 81 and 62 are chosen to be small positive numbers. The role of
62 is to prevent numerical problems stemming from estimates close to zero. Another
frequently used criterion is
(r+i) ;(r)1< 6,
112P 0 1<6,
where 1 I1 denotes Euclidean norm.
The numerical maximization procedure MaxBFGS which we are going to use for
the numerical examples later in this chapter is based on two criteria for convergence:
[Ss rs ) < e for all s when 5r) 0
s()I < E for all s when r) = 0
and
(r+l) r) < 10e\r)\ for all s when ^) 0
(r+l) r) _< 10e for all s when ir) = 0
where s, denotes the sth component of the score vector.
Standard Error Estimation
After the algorithm has converged estimates of the standard errors of the pa
rameter estimates can be based on the observed information matrix. By asymptotic
maximumlikelihood theory
Var() = I1(i),
where I(i) = E( 8n r) is the expected information matrix. But the observed
information matrix J(g,) is easier to obtain and hence we will use the latter to ap
proximate the standard errors
where i8S is the 8th diagonal element of the inverse of the observed information ma
trix. Note that if the NewtonRaphson maximization method is used the observed
information matrix is a byproduct of the algorithm.
Numerical and Exact Derivatives
The observed information matrix and the score vector are not available in closed
form and must be approximated. One can either compute numerical derivatives
of the approximated loglikelihood, or approximate the intractable integrals in the
expressions for the exact derivatives. The first approach is simpler to implement but
may require a large number of quadrature points.
The method of finite differences can be used to find numerical derivatives. The
sth element of the score vector and the (s, t)th element of the information matrix are
approximated as follows:
n ^ +'0 + 62t  W Et
i= l 2e
ist( )
il
li' I + 61Z^ + 22t) W(V ElZ, + 622t) f^+ 6^I, EA) + E126, EA)
i = 4 6 162
where lk = InLQ, E and E2 are suitably chosen step lengths, and z, and zt are unit
vectors. The unit vectors have ones in positions s and t respectively and the rest of
their elements are zeros.
An alternative to numerical derivatives is to approximate the exact derivatives.
The latter are obtained by interchanging the differentiation and integration signs and
are as follows:
OlnL 1" rlnfi
and
a21nL n P "f, 1di f fidbI fW dbi f o d/. Ifdbi
7i=T (f fidb,)2
where
ni
fi = In{fl(Yiljbi;/)31,1)f2(yi2jlbi;/2,2)}f(bi;E).
j=l
For each subject there are three intractable integrals which need to be approximated:
f fidb, f o and f (**T + ^ )fb and GaussHermite quadrature
is used for each one of them. Note that
n n
lnfi = lnfi(yiljIbi;/3i, i1) + lnf2(yi2jvbi;P2,g2) + lnf(b,; E).
ni i
oiE=l I 1 oE=l n)
Hence O = f = Inf2) and = f. Similarly the deriva
a9 a~i a)2 .9 2 0gt 8
tives with respect to the scale parameters 01 and 02, if needed, are obtained by
differentiating the first and second term in the summation for lnfi. This leads to
simple expressions for the integrands because the functions that are differentiated are
members of the exponential family (Fahrmeir and Tutz, 1994).
Note that the random effects distribution is always multivariate normal which
complicates the expressions for the second order derivatives with respect to the vari
ance components. Jennrich and Schluster (1986) propose an elegant approach to deal
with this problem and we now illustrate their method for the multivariate GLMM. For
simplicity of notation consider bivariate random effects bi. Let 6= 82 = 2
63 p
(83) \p (I
and~~ le2 1 aU2 Te rt
and let = 2 po2 Then write
^ P01C2 2 T
o 2(1 0) ) 20 0 ) 0 1)
=o1 0 0 +a 0 1 1+P'lo2 1 0 "
Let , s = 1,...3. Then
,9lnfi 1
0lf 2tr{Y (bibT _
and
a2lnfi 1 __
O69O6t 2
In the above parametrization of the variancecovariance matrix there are restrictions
on the parameter space (al > 0, a02 > 0, 1 < p < 1), but the algorithm does not
guarantee that the current estimates will stay in the restricted parameter space. It is
then preferable to maximize with respect to the elements of the Choleski root of E.
The Choleski root is a lefttriangular matrix L = ( / ) such that E = LL'. Its
\121 122
elements are held unrestricted in intermediate steps of the algorithm and hence they
can lead to a negativedefinite final estimate of E. Such an outcome indicates either
a numerical problem in the procedure or an inappropriate model. However, if only
intermediate estimates of E are negativedefinite, then the maximization algorithm
with respect to the elements of the Choleski root will work well while the other one
can have problems.
Although the parametrization in terms of 111, 12l and 122 has fewer numerical
problems than the parametrization in terms of oai, a2 and p, it does not have nice
interpretation. It then seems reasonable to use the Choleski roots in the maximization
procedure but to perform one extra partial step of the algorithm to approximate the
information matrix for the meaningful parametrization ao, a2 and p.
3.2.2 Monte Carlo EM Algorithm
Parameter Estimation
The complete data for the multivariate GLMM is u = (yT, bT)T and hence the
complete data loglikelihood is given by
n ni n ni nr
lnL, = In f (yij Ibil;/31, 0) + 1 In f2(yi2j21bi2;/3, 1 2) + 1n f(bi, E).
i=l j=l i=1 j2=1 i1
Notice that (/31, 0i), (132, 92) and E appear in different terms in the loglikelihood
and therefore each Mstep of the algorithm consists of three separate maximizations
of E[E' 1 'iln fi (yii, bil; 3,0i)y], n[E' E;=lin f2(yi2j3lbi2;/32,2)Iy], and
E [ZnI=n 1f(b, E)y] respectively, evaluated at the current parameter estimate of 1P.
Recall that for the regular GLMM there are two separate terms to be maximized.
These conditional expectations do not have closed form expressions but can be
approximated by Monte Carlo estimates:
1m n niz
__ S l nf,(yi )" jb';31 91) (3.2)
E EE 1n f (Yiljl I.bil, (3.2)
m k=l i=1 j1=l
1m n ni
1 EE n f2 (Y2j2 Ibi2 ;2, 2) (3.3)
m k=l i=1 j2=1
m n
1E E1nfb (); E), (3.4)
mn k=l i=1
where (b(k) b(k), k = 1, ...m are simulated values from the distribution of bilyi,
\ il ui2 1
evaluated at the current parameter estimate of 0. In order to obtain simulated
values for bi multivariate rejection sampling or importance sampling can be used
as proposed by Booth and Hobert (1999). At the rth step of the Monte Carlo EM
algorithm multivariate rejection sampling algorithm is as follows:
For each subject i, i = 1, ...n,
1. Compute i = supbRq f(yijbi, (rl)).
2. Sample b(k) from the multivariate normal density f(bi; ) and indepen
dently sample wk) from Uniform(O, 1).
3. If w < YbL then accept b.; if not, go to 2.
Iterate between (2) and (3) until m generated values of bi are accepted.
Once m simulated values for the random effects are available, the Monte Carlo
estimates of the conditional expectations can be maximized using some numerical
maximization procedure such as the NewtonRaphson algorithm. Convergence can be
claimed when successive values for the parameters are within some required precision.
The same convergence criterions as those mentioned for GaussHermite quadrature
can be applied but with generally larger J1 value. Booth and Hobert (1999) suggest
using 61 between 0.002 and 0.005. This is needed to avoid excessively large simulation
sample sizes.
A nice feature of the Booth and Hobert algorithm is that because independent
random sampling is used, Monte Carlo error can be assessed at each iteration and
the simulation sample size can be automatically increased. The idea is that it is
inefficient to start with a large simulation sample but as the estimates get closer to
the maximum likelihood estimates the simulation sample size must be increased to
provide enough precision for convergence.
Booth and Hobert propose to increase the simulation sample size m with m/t,
where t some positive number, if (r1) lies in a 95% confidence ellipsoid constructed
around '(r) ,that is if
()(r) ( (r1)
where c2 denotes the 95th percentile of Chisquare distribution with number of de
grees of freedom equal to the dimension of the parameter vector. To describe what
variance approximation is used above some notation must be introduced. Denote the
maximizer of the true conditional expectation Q = E(InLr(b)y,?r1)) by *r').
Then
Var((r) (rl)) ,Q (2m) (*(r),b(r1)) 1varf (Q(1) (b*(r) 1(r1)) IO(2m)(b*(r)l~ r)i
Here Qm = 1 lnL.(b(k)y, ((W)) and Q() and Q() are the vector and matrix
of first and second derivatives of Qm respectively. A sandwich estimate of the variance
^(r) ^*(r)
is obtained by substituting f) in place of 4' and using the estimate
1 rQ$,lnfpy(b k); =(1))alnf(yb(k); ) alnf(y,b(k);g,(r))
vacuum), *( 1) J=Hf if
S(D/rn^ M2k1 ^r ayo i^ .^ W; )
In summary, the algorithm works as follows:
1. Select initial estimate 4' Set r = 0.
2. Increase r by 1.
Estep: For each subject i, i = 1,...n generate m random samples from the
distribution of bilyi; (r1) using rejection sampling.
3. Mstep: Update the parameter estimate of 4 by maximizing (3.2), (3.3) and
(3.4).
4. Iterate between (2) and (3) until convergence is achieved.
Standard Errors Estimation
After the algorithm has converged standard errors can be estimated using Louis'
method of approximation of the observed information matrix (Tanner, 1993). Denote
the observed data loglikelihood by I and then the observed data information matrix
is
02(y, 0) nO21 (b,y, ) )nL(y,
T = E[ aapT ly] Var[OlnL(b ly] =
E[ O2lnL, (b,y, ) 9lnL (b,y,4) OlnL (b, Y, )ly]
+1 99lnn9(b,Y, 0) J lnn,(b, y,) ,
By simulating values b k), k = 1, ...m, i = 1, ...n from the conditional distributions of
bilyi at the final 4 the conditional expectations above can be approximated by the
Monte Carlo sums
1 m a2 nL,(b(k), y, O) OlnL.(b(k), y, ) alnL(b('k) b)
19+ 0T aoT
and
1 alnLu(b(k), y,4)
m k=1 0
In the above expressions the argument ip means that the derivatives are evaluated at
the final parameter estimate. Exactly as in the GaussHermite quadrature example
Var(o) = (
As the simulation sample size increases the Monte Carlo EM algorithm behaves
as a deterministic EM algorithm and usually converges to the exact maximum like
lihood estimates, but it may also converge to a local instead of a global maximum
(Wu, 1983). A drawback of the Monte Carlo EM algorithm is that it may be very
computationally intensive. Hence important alternatives to this 'exact' method for
GLMM are methods based on analytical approximations to the marginal likelihood
such as the procedures suggested by Breslow and Clayton (1993) and by Wolfinger
and O'Connell (1993). Wolfinger and O'Connell propose finding approximate maxi
mum likelihood estimates by approximating the nonnormal responses using Taylor
series expansions and then solving the linear mixed model equations for the associated
normal model. The extension to their method is presented in the next section.
3.2.3 Pseudolikelihood Approach
Parameter Estimation
Let us first assume for simplicity that the shape parameters 01 and 02 for both
response distributions are equal to 1. By the model definition
mi, gl 1(Xiil31 + Zijbil)
/Ai2 = g92 (Xi2/32 + Zi2bi2),
where gl1 and g2' above are evaluated componentwise at the elements of Xil/31 +
Ziibil and Xi2/32 + Zi2bi2 respectively. Let /13i, /2, b1i and bi2, i = 1, ...n, be some
estimates of the fixed and the random effects. The corresponding mean estimates
are denoted by fiil and Ati2. In a neighborhood around the fixed and random effects
estimates the errors Eil = Yii i, and Ei2 = Yi2 Ai2 are then approximated by first
order Taylor series expansions. Denote the two approximations by
eii Yii Ail (g1)'(Xi 1 + Zilbil)(Xil/31 + Zilbii XJ1/ Zibil)
and
i2 = Yi2 Ai2 (g21)'(Xi2f2 + Zi2bi2)(Xi2P2 + Zi2bi2 Xi22 Zi22),
where (g1)' (Xil31 + Zilbil) and (g') (Xi2j32 + Zi2bi2) are diagonal matrices with
elements consisting of evaluations of the first derivatives of gl1 and g'. Note that
because the estimates of the random effects are not consistent as the number of
subjects increases this expansion may not work well when the number of observations
per subject is small.
Now as in Wolfinger and O'Connell we approximate the conditional distributions
of ,Eil bi and E.i2bi by normal distributions with 0 means and diagonal variance ma
trices V1(ti1) = diag(Vi(pin),...V1(piln,)) and V2(jAi2) = diag(V2(pi2l),...!V2(i2,J))
respectively, where V1 (.) and V2 () are the variance functions of the original responses.
Note that it is reasonable to assume that the two normal distributions are uncorrelated
because the responses are assumed to be conditionally independent given the random
effects. The normal approximation, on the other hand, may not be appropriate for
some distributions, such as the Bernoulli distribution.
Substituting fil and Ai2 in the variance expressions, and using the fact that
(g(1)'(XA + Zgib = (g'9ii))1 and (g1)'(X22 Z2b2) = (g(2)), the
conditional distributions of g'(#jl)(yi Ail) and g'(i2)(yi f12) are also multi
variate normal with appropriately transformed variancecovariance matrices. Defining
vil= gi(A il) + g'l(ftA)(yil Ail) and vi2 = g2(Ai2) + i)(i2 Ai2), it follows
that
viibi b Nn (Xii/31 + Zilbi,; gI(jil)VIi( jA)gj (Ail))
i2b. N(Xi2/32 + Zba2, 922)V2(a2) (2^))
Now, recalling that bi has a multivariate normal distribution with mean 0 and
variancecovariance matrix E, this approximation takes form of a linear mixed model
with response Vee(vil, vi2), random effect Vec(bil, bi2), and uncorrelated but het
eroscedastic errors. Such a model is called a weighted linear mixed model with a
diagonal weight matrix W = diag(Wi), where Wi = diag(Wil, Wi2),
~i g'l(Ail)Vl(jA)gl(Ai) and W2l = g'2 i)V2(2)g2(i2) For canonical
link functions Vl(jl) = [g'(ftil)]1 and V2(A,2) = [g((A2)]1, and then W,1
diag(gl(A,l),2(Ai2))
Estimates of 3 = Vec(/31, f32) and b, can be obtained by solving the mixedmodel
equations below (Harville, 1977):
(XTWX XTWZE \(/3 \ X
zTWX E + ZT Z ) b) ZT V ) (3.5)
Here
/X, Z, V11 / bi
X = ,Z = ,v = ,b = "
X o zV b o
\Xn / \Zn / Vn bn
and
Xi (xil 0) z il 0 ) Zi2 il
0 X Zi = 0 Zi/ 'i
x1.=( n )~ ~ 0 'i ( n 7
Variance component estimates can be obtained by numerically maximizing the
weighted linear mixed model loglikelihood
1 InIVil ErTVri, (3.6)
2 1=1 2 i=l
where
V, = W 1 + ZEZiT
and
ri = v Xi(XT V Xxi)xTVi 'Vi.
The algorithm then is as follows:
1. Obtain initial estimates t(') and fjt using the original data, i = 1, ...n.
2. Compute the modified responses
Iia = g(Ail) + (YiI Ai)g'l(Ai)
and
g2 = 92(Ai2) + (Yi2 Ait2)g2(fit2).
3. Maximize (3.6) with respect to the variance components.
Stop if the difference between the new and the old variance component estimates
is sufficiently small. Otherwise go to the next step.
4. Compute the mixedmodel estimates for /3 and b, by solving the mixed model
equations (3.5):
3 = (Zn1 XiTVi X,)(! 1 X[V,i=),
5. Compute the new estimates of pi = (Ail, 11i2):
A1 = g11(X13 + Zilbi)
and
A2 = g2'(X 3A + Z,2b,2).
Go to step 2.
In comparison to the original Wolfinger and O'Connell algorithm, the above al
gorithm for multiple responses is computationally more intensive at step 3 because
of the more complicated structure of the variancecovariance matrices Vi. As the
number of response variables increases the computational advantages of this approx
imate procedure over the "exact" maximum likelihood procedures may become less
pronounced.
The NewtonRaphson method is a logical choice for the numerical maximization
in step 3 because of its relatively fast convergence and because as a side product
it provides the Hessian matrix for the approximation of the standard errors of the
variance components.
In the general case when 01 and 02 are arbitrary, the variance functions V1(/1jl)
and V2(ti2) are replaced by 51V1Q(#il) and 2V2(AU2), and hence the weight matrices
are accordingly modified: W*1I = 1' (Ai)Vi(Ai)g'1(AL) and
W 1 (= 292'(i2)2(Ai2)2(Ai2). The estimates of 1 and 02 can be obtained in step
3 together with the other variance components estimates. This approach is different
from the approach taken by Wolfinger and O'Connell, who estimated the dispersion
parameter for the response together with the fixed and random effects.
Standard Error Estimation
The standard errors for the regression parameter and for the random effects esti
mates are approximated from the linear mixed model:
( ZTWX XTWYZE 1
arb zTWX ;+ ZWZ
As E and W are unknown, estimates of the variance components will be used.
The standard errors for the variance components can be approximated by using
exact or numerical derivatives of the likelihood function in step 3 of the algorithm.
For the same reasons as discussed in Section 3.2 it is better to maximize with respect
to the Choleski factors of E. The computation of the standard errors, however, should
be carried out with respect to the original parameters a,, U2 and p (or oal, o2 and Oa12).
This can be done only if the final variancecovariance estimate is positive definite. The
approach discussed in Section 3.2 for representation of the variancecovariance matrix
can also be adopted here to simplify computations.
3.3 Simulated Data Example
The methods discussed in the previous sections are illustrated on one simulated
data set in which the true values on the parameters are known. The data consists of
J = 10 repeated measures of a normal and of a Bernoulli response on each of I = 30
subjects. No covariates are considered and a random intercept is assumed for each
variable. In the notation introduced in Chapter 2,
Yiljhbil indep. N(pilj, a2),
Yi2jbi2 indep. Bernoulli(pi2j),
ilj = 0i + bil,
logit(iv2) = /32+ bi2,
Tr U1 ~2
bi = (bil, bi2) MVN(O E), 2= [a
[ 012 02
This means that we define a normal theory random intercept model for one of the
responses, a logistic regression model with a random intercept for the other response,
and we assume that the random intercepts are correlated. The data are simulated
( 10.5)
for 02 = 1, 01 = 4, /2 = 1, E = 0.5 0 ). The parameter estimates and their
standard errors for the three fitting methods discussed above are presented in Table
3.1. The methods are programmed in Ox and are run on Sun Ultra workstations. Ox
is an objectoriented matrix programming language developed at Oxford University
Table 3.1. Estimates and standard errors from model fitting for the simulated data
example using GaussHermite quadrature (GQ), Monte Carlo EM algorithm (MCEM)
and pseudolikelihood (PL)
Parameter True GQ MCEM PL
value Est. SE Est. SE Est. SE
31 4.0 3.72 0.19 3.72 0.20 3.72 0.19
/02 1.0 1.08 0.18 1.08 0.19 1.08 0.17
a2 1.0 0.97 0.08 0.97 0.08 0.97 0.08
a1 1.0 0.94 0.27 0.94 0.27 0.94 0.27
U2 0.5 0.39 0.26 0.41 0.35 0.35 0.22
012 0.5 0.54 0.21 0.54 0.24 0.48 0.20
(Doornik, 1998). It is convenient because of the availability of matrix operations and
a variety of predefined functions and even more importantly because of the speed of
computations.
All standard errors, except those for the Monte Carlo EM algorithm and those for
the regression parameters for the pseudolikelihood algorithm, are based on numerical
derivatives. The identity matrix is used as an initial estimate of the covariance matrix
for the random components, and the initial estimates for /31, 03 and a2 are taken to
be the sample means for the normal and the Bernoulli response, and the sample
variance for the normal response. GaussHermite quadrature is used with number of
quadrature points in each dimension varying from 10 to 100.
The MaxBFGS numerical maximization procedure in Ox is used for optimization
in the Gaussian quadrature and the pseudolikelihood algorithms with a convergence
criterion based on 61 = 0.0001. The convergence criterion used for the Monte Carlo
EM algorithm is based on 61 = 0.003. MaxBFGS employs a quasiNewton method
for maximization developed by Broyden, Fletcher, Goldfarb and Shanno (BFGS)
(Doornik, 1998) and it can use either analytical or numerical first derivatives.
The times to achieve convergence by the three methods are 30 minutes for Gaus
sian quadrature (GQ) with 100 quadrature points in each dimension, approximately
3 hours for the Monte Carlo EM algorithm and 1.5 hours for the pseudolikelihood
(PL) algorithm. The numbers of iterations required are 22, 53 and 7 respectively.
The final simulation sample size for the Monte Carlo EM algorithm is 9864.
As expected, the Monte Carlo EM estimates are very close to the Gaussian quadra
ture estimates with comparable standard errors. Only the estimate of u2 is slightly
different and has a rather large standard error, but this is due to premature stopping
of the EM algorithm. This problem can be solved by increasing the required precision
for convergence but at the expense of a very large simulation sample size.
As can be seen from the table, the parameters corresponding to the normal re
sponse are well estimated by the pseudolikelihood algorithm, but some of the es
timates corresponding to the Bernoulli response are somewhat underestimated and
have smaller standard errors. The estimate of the random intercept variance a2 is
about 10% smaller than the corresponding Gaussian quadrature estimate with about
15% smaller standard error. The estimate of a12 is also about 15% smaller than
the corresponding GaussHermite quadrature and Monte Carlo estimates, but it is
actually closer to the true value in this particular example. In general the variance
component estimates for binary data obtained using the pseudolikelihood algorithm
are expected to be biased downward as indicated by previous research in the univari
ate GLMM (Breslow and Clayton, 1993; Wolfinger, 1998). Hence the Wolfinger and
O'Connell method should be used with caution for the extended GLMM's with at
least one Bernoulli response, keeping in mind the possible attenuation of estimates
and of the standard errors.
GaussHermite Quadrature: 'Exact' vs 'Numerical' Derivatives
As mentioned before MaxBFGS can be used either with exact or with numerical
first derivatives. Comparisons of these two approaches in terms of computational
speed and accuracy are provided in Tables 3.2 and 3.3 respectively. The 'Iteration'
columns in Table 3.2 contain number of iterations and the 'Convergence' columns
Table 3.2. 'Exact' versus 'numerical' derivatives for the simulated data example:
convergence using GaussHermite quadrature.
Number Exact derivatives Numerical derivatives
of quad. Time Iteration Convergence Time Iteration Convergence
points (min.) (min.)
10 27.4 56 no 0.3 19 strong
20 21.3 22 weak 1.1 19 strong
30 75 35 weak 2.8 23 strong
40 120 34 strong 4.8 22 strong
50 246 47 strong 7.2 22 strong
60 237 34 strong 10.9 22 strong
show whether the convergence tests are passed at the prespecified 61 level (strong
convergence), or passed at the lower level 51 = 0.001 (weak convergence) or not
passed at all (no convergence). Note that for the exact derivatives strong convergence
is achieved with 40 or more quadrature points.
It is rather surprising that the 'exact' Gaussian quadrature procedure does not
perform better than the 'numerical' one. The parameter estimates from both pro
cedures are the same (Table 3.3) for 40 or more quadrature points but the 'exact'
GaussHermite quadrature converges much more slowly than the numerical procedure
(hours as compared to minutes). The numerical procedure also has the advantage
of simpler programming code and hence for the rest of the disertation will be the
preferred quadrature method.
Monte Carlo EM algorithm: Variability of Estimated Standard Errors
We observed that the estimated Monte Carlo EM standard errors varied somewhat
for different random seeds, so we generated 100 samples with the final simulation sam
ple size and computed the average and the standard deviation of the standard error
estimates for these samples. Table 3.4 shows the results. We include the Gaussian
quadrature standard error estimates for comparison. There is much variability in the
standard error estimates (especially in the estimate of oU) which is primarily due to
Table 3.3. 'Exact' versus 'numerical' derivatives for the simulated data example:
estimates and standard errors using GaussHermite quadrature with varying number
of quadrature points (m).
m Exact derivatives Numerical derivatives
/31 32 a02 al2 a2 012 21 02 02 3a 2 2
1T C2 1 13U.2~ 2 Or1
SE SE SE SE SE SE SE SE SE SE SE SE
10 3.68 1.06 0.97 0.98 0.38 0.52 3.72 1.08 0.97 0.85 0.35 0.48
0.13 0.16 0.08 0.28 0.25 0.22 0.12 0.16 0.08 0.16 0.24 0.15
20 3.75 1.10 0.96 0.95 0.39 0.54 3.75 1.10 0.97 0.95 0.39 0.54
0.16 0.18 0.08 0.27 0.26 0.21 0.16 0.18 0.08 0.27 0.26 0.21
30 3.71 1.07 0.97 0.94 0.39 0.54 3.71 1.07 0.97 0.95 0.39 0.54
0.19 0.18 0.08 0.27 0.26 0.21 0.19 0.18 0.08 0.29 0.27 0.22
40 3.72 1.08 0.97 0.94 0.39 0.54 3.72 1.08 0.97 0.94 0.39 0.54
0.18 0.18 0.08 0.27 0.26 0.21 0.18 0.18 0.08 0.26 0.26 0.20
50 3.72 1.08 0.97 0.94 0.39 0.54 3.72 1.08 0.97 0.94 0.39 0.54
0.19 0.18 0.08 0.27 0.26 0.21 0.19 0.18 0.08 0.27 0.26 0.21
60 3.72 1.08 0.97 0.94 0.39 0.54 3.72 1.08 0.97 0.94 0.39 0.54
0.19 0.18 0.08 0.27 0.26 0.21 0.19 0.18 0.08 0.27 0.26 0.21
several extreme observations. If the simulation sample size is increased by onethird,
the standard error estimates are more stable (see the last two columns of Table 3.4).
It is not surprising that there is more variability in the standard error estimates than
in the parameter estimates because only the latter are controlled by the convergence
criterion. So it may be necessary to increase the simulation sample size to obtain
better estimates of the standard errors. This, however, requires additional compu
tational effort and it is not clear by how much the simulation sample size should be
increased. We consider a different approach to dealing with standard error instability
in the next section of the dissertation.
Comparison of Simultaneous and Separate Fitting of the Response Variables
To investigate the possible efficiency gains of joint fitting of the two responses
over separate fitting, we compared estimates from the joint to estimates from the
two separate fits. The results for the three estimation methods are provided in Table
Table 3.4. Variability in Monte Carlo EM standard errors for the simulated data
example. Means and standard deviations of the standard error estimates are com
puted for 100 samples using two different simulation sample sizes. The GaussHermite
quadrature standard errors are given as a reference in the second column.
Parameter GQ SE Mean SE SDofSE Mean SE SDofSE
ss=9864 ss=13152
f1 0.19 0.21 0.07 0.19 0.03
/32 0.18 0.20 0.09 0.19 0.03
a2 0.08 0.08 0.0008 0.08 0.0001
a2 0.27 0.27 0.03 0.27 0.01
2a 0.26 0.41 0.76 0.31 0.13
a012 0.21 0.22 0.04 0.22 0.02
Table 3.5. Results from joint and separate GaussHermite quadrature of the two
response variables in the simulated data example
Parameter True value Joint estimation Separate estimation
Estimate SE Estimate SE
31 4.0 3.72 0.19 3.72 0.19
/32 1.0 1.08 0.18 1.07 0.18
a2 1.0 0.97 0.08 0.97 0.08
a12 1.0 0.94 0.27 0.94 0.27
2O 0.5 0.39 0.26 0.37 0.25
7r12 0.5 0.54 0.21 
3.5 (Gaussian quadrature), Table 3.6 (Monte Carlo EM), and Table 3.7 (pseudo
likelihood). For the separate fits we used PROC NLMIXED for Gaussian quadrature
and wrote our own programs in Ox for Monte Carlo EM and pseudolikelihood. Notice
that when the Normal response is considered separately, the corresponding model is
a oneway random effects ANOVA and can be fitted using PROC MIXED in SAS, for
example. The estimates from PROC MIXED are as follows: / = 3.72 (SE = 0.19),
&2 = 0.97 (SE = 0.08) and &2 = 0.94 (SE = 0.27).
It is surprising that although the estimated correlation between the random effects
was rather high ( a1 = f = 0.86) the estimates and the estimated standard errors
from the joint and from the separate fits were very similar. Having in mind that it
is much faster to fit the responses separately and that there is existing software for
univariate repeated measures, it is not clear if there are any advantages to joint fitting
Table 3.6. Results from joint and separate Monte Carlo EM estimation of the two
response variables in the simulated data example
Parameter True value Joint estimation Separate estimation
Estimate SE Estimate SE
/31 4.0 3.72 0.20 3.73 0.22
)2 1.0 1.08 0.19 1.08 0.18
a2 1.0 0.97 0.08 0.97 0.08
a2 1.0 0.94 0.27 0.95 0.28
2o 0.5 0.41 0.35 0.38 0.23
U12 0.5 0.54 0.24 
Table 3.7. Results from joint and separate pseudolikelihood estimation of the two
response variables in the simulated data example
Parameter True value Joint estimation Separate estimation
Estimate SE Estimate SE
01 4.0 3.72 0.19 3.72 0.19
12 1.0 1.08 0.17 1.05 0.17
a.2 1.0 0.97 0.08 0.97 0.08
OT2 1.0 0.94 0.27 0.94 0.27
202 0.5 0.35 0.22 0.35 0.22
OU12 0.5 0.48 0.20 
in this particular example. Advantages and disadvantages of joint and separate fitting
will be discussed in Chapter 5 of the dissertation.
3.4 Applications
3.4.1 Developmental Toxicity Study in Mice
In this section the 'exact' maximum likelihood methods are applied to the ethylene
glycol (EG) example mentioned in the introduction. The model is defined as follows
(di denotes the exposure level of EG for the ith dam):
Yij birth weight of jth live fetus in ith litter.
Yi2j malformation status of jth live fetus in ith litter.
Yaji bil indep. N(piij, a2).
yi2j I bi2 ' indep. Be(fii2j).
Table 3.8. GaussHermite and Monte Carlo EM estimates and standard errors from
model fitting with a quadratic effect of dose on birth weight in the ethylene glycol
data
Parameter GaussHermite quadrature Monte Carlo EM
Estimate SE Estimate SE
10o 0.962 0.016 0.964 0.037
01 0.118 0.028 0.120 0.053
012 0.010 0.009 0.010 0.015
/320 4.267 0.409 4.287 0.455
021 1.720 0.207 1.731 0.217
a2 0.006 0.0003 0.006 0.0003
7l 0.007 0.001 0.007 0.001
0' 2.287 0.596 2.238 0.560
a12 0.082 0.020 0.082 0.022
Table 3.9. GaussHermite quadrature and Monte Carlo EM estimates and standard
errors from model fitting with linear trends of dose in the ethylene glycol data
Parameter GQ (logit) MCEM (logit) GQ (probit) MCEM (probit)
Est SE Est SE Est SE Est SE
010 0.952 0.014 0.952 0.021 0.952 0.014 0.954 0.028
011 0.087 0.008 0.088 0.008 0.087 0.008 0.088 0.012
/020 4.335 0.411 4.335 0.522 2.340 0.213 2.422 0.269
021 1.749 0.208 1.752 0.225 0.970 0.111 0.982 0.129
a 0.075 0.002 0.075 0.002 0.075 0.002 0.075 0.002
al 0.086 0.007 0.086 0.007 0.086 0.007 0.086 0.007
U2 1.513 0.196 1.495 0.207 0.839 0.107 0.831 0.101
p 0.682 0.088 0.687 0.091 0.666 0.090 0.670 0.094
ttilj =/3io + 10 11 di + bil,
logit(Aji2j) = 0/32o + 021 di + bi2
012
2
We considered a linear trend in dose in both linear predictors. Although other
authors have found a significant quadratic dose trend for birth weight, we found that
to be nonsignificant and hence have not included it in the final model (see Table 3.8).
We also considered a probit link.
The Gaussian quadrature and Monte Carlo EM parameter estimates from the
model fits are given in Table 3.9. Initial estimates for the regression parameters were
bi = (bi1, bi2)' MVN(O, E), E = [
0'12
the estimates from the fixed effects models for the two responses, a was set equal
to the estimated standard deviation from the linear mixed model for birth weight,
and the identity matrix was taken as an initial estimate for the variancecovariance
matrix of the random effects. Gaussian quadrature estimates were obtained using 100
quadrature points in each dimension which took 36 hours for both the logit and the
probit links. Adaptive Gaussian quadrature will probably be much faster because it
requires a smaller number of quadrature points. The Monte Carlo EM algorithm for
the model with logit link ran for 3 hours and 6 minutes, and required 41 iterations
and a final sample size of 989 for convergence. The Monte Carlo EM algorithm for the
model with a probit link ran for 5 hours and 13 minutes, and required 37 iterations
and a final sample size of 1757 for convergence. The convergence precisions were the
same as in the simulated data example.
The estimates from the two algorithms are similar but the standard error estimates
from the Monte Carlo EM algorithm are larger than their counterparts from Gaussian
quadrature. This is not always the case as seen from Table 3.10. There is much
variability in the standard error estimates from the Monte Carlo EM algorithm. The
estimates can be improved by considering larger simulation sample sizes but we adopt
a different approach. The observed information matrix at each step of the algorithm is
easily approximated using Louis' method and it does not require extra computational
effort, because it relies on approximations needed for the sample size increase decision.
Because the EM algorithm converges slowly in close proximity of the estimates and
because the convergence criterion must be satisfied in three consecutive iterations,
the approximated observed information matrices from the last three iterations are
likely to be equally good in approximating the variance of the parameter estimates.
Hence if we average the three matrices we are likely to obtain a better estimate of
the variancecovariance matrix than if we rely on only one particular iteration.
Table 3.10. Variability in Monte Carlo EM standard errors for the ethylene glycol ex
ample. Means and standard deviations of the standard error estimates are computed
for 100 samples for the logit and probit models. GaussHermite quadrature standard
errors as given as a reference.
Parameter Logit model Probit model
____GQ Mean of SE SDofSE GQ Mean of SE SDofSE
010 0.014 0.015 0.032 0.014 0.014 0.021
Oi 0.008 0.008 0.015 0.008 0.007 0.010
/320 0.411 0.440 0.500 0.213 0.200 0.137
/21 0.208 0.206 0.207 0.111 0.101 0.063
a 0.002 0.002 0.0001 0.002 0.002 0.00002
a7 0.007 0.007 0.0005 0.007 0.007 0.0001
72 0.196 0.212 0.115 0.107 0.110 0.020
p 0.088 0.094 0.052 0.090 0.092 0.012
Tables 3.11 and 3.12 give standard error estimates for the EG data based on Gaus
sian quadrature (column GQ), based on the the Monte Carlo approximated observed
information matrix in the third to last (column Al), second to last (column A2) and
last (column A3) iterations, and based on average of the observed information matri
ces of the last two (column A4) and of the last three (column A5) iterations. Clearly
using only the estimate of the observed information matrix from the last iteration
is not satisfactory because it depends heavily on the random seed and may contain
some negative standard error estimates. Averaging over the three final iterations is
better, although it is not guaranteed that even the pooled estimate of the observed
information matrix will be positive definite, or that it will lead to improved estimates.
All parameter estimates in the model are significantly different from zero with
birth weight significantly decreasing with increasing dose and probability for malfor
mation significantly increasing with increasing dose. As expected, the regression pa
rameters estimates using the logit link function are greater than those obtained using
the probit link function. The use of the logit link function facilitates the interpreta
tion of the parameter estimates. Thus, if the EG dose level for a litter in increased
by lg/kg the estimated odds of malformation for any fetus within that litter increase
Table 3.11. Monte Carlo EM standard errors using logit link in the ethylene glycol
example. The approximations are based on the estimate of the observed information
matrix from the third to last iteration (Al), second to last iteration (A2), last iter
ation (A3), the average of the last two iterations (A4), and the average of the last
three iterations (A5). The standard error estimates obtained using GaussHermite
quadrature are given as a reference in the column labeled GQ.
Parameter Estimated S.E.
GQ Al A2 A3 A4 A5
,31o 0.014 0.017 0.037 0.012 0.016 0.016
A11 0.008 0.007 0.014 0.001 0.008 0.007
f20 0.411 0.385 0.429 0.365 0.409 0.396
/21 0.208 0.183 0.239 0.215 0.236 0.206
a 0.002 0.002 0.002 0.002 0.002 0.002
O"1 0.007 0.007 0.007 0.007 0.007 0.007
U2 0.196 0.199 0.202 0.221 0.204 0.199
p 0.088 0.081 0.084 0.136 0.096 0.089
Table 3.12. Monte Carlo EM standard errors using probit link in the ethylene glycol
example. The approximations are based on the estimate of the observed information
matrix from the third to last iteration (Al), second to last iteration (A2), last iter
ation (A3), the average of the last two iterations (A4), and the average of the last
three iterations (A5). The standard error estimates obtained using GaussHermite
quadrature are given as a reference in the column labeled GQ.
Parameter Estimated S.E.
GQ Al A2 A3 A4 A5
010 0.014 0.021 0.021 0.006 0.016 0.017
Oil 0.008 0.012 0.007 0 0.008 0.009
320 0.213 0.273 0.227 0 0.243 0.233
/321 0.111 0.151 0.110 0 0.132 0.127
a 0.002 0.002 0.002 0.002 0.002 0.002
a1 0.007 0.008 0.007 0.007 0.007 0.007
U2 0.107 0.127 0.097 0 0.124 0.110
p 0.090 0.094 0.102 0 0.101 0.095
exp(1.75) = 5.75 times. The probit link function does not offer similar easy inter
pretation, but allows one to obtain populationaveraged estimates for dose. If the
subjectspecific regression estimate is /3, then the marginal regression parameters are
estimated by T (Section 2.1). Hence, we obtain a marginal intercept of 1.793
V(1+42)
and a marginal slope of 0.743. The latter can be interpreted as the amount by which
the populationaveraged probability of malformation changes on the probit scale for
each unit increase in dose. The subjectspecific slope estimate of 0.970 on other hand
is interpreted as the increase in the individual fetus probability of malformation on
the probit scale for each unit increase in dose. A more meaningful interpretation for
those numbers can be offered if one assumes a continuous underlying malformation
variable for the observed binary malformation outcome. The latent variable reflects
a cumulative detrimental effect which manifests itself in a malformation if it exceeds
a certain threshold. The effect of dose on this underlying latent variable is linear and
hence /&21 is interpreted as the amount by which the cumulative effect is increased for
one unit increase in dose. /321 can also be interpreted as the highest rate of change in
the probability for malformation (Agresti, 1990, p.103). This highest rate of change
is estimated to be achieved at __
As expected birth weight and malformation are significantly negatively correlated
as judged from the Wald test from Table 3.9: (^ .)2 = (066 )2 = 54.8 (p value <
0.0001). This translates into a negative correlation between the responses but there
is no closed form expression to estimate it precisely.
It is interesting to compare the joint and the separate fits of the two response
variables (Tables 3.13 and 3.14). Table 3.13 presents estimates obtained using Gaus
sian quadrature and shows that the estimates for the Normal response are identical
(within the reported precision) but the estimates for the Bernoulli response from
the separate fits are generally larger in absolute value with larger standard errors.
Table 3.13. Results from joint and separate GaussHermite quadrature of the two
response variables in the ethylene glycol example.
Par. Logit models Probit models
Joint fit Separate fits Joint fit Separate fits
Est. SE Est. SE Est. SE Est. SE
O3lo 0.952 0.014 0.952 0.014 0.952 0.014 0.952 0.014
01, 0.087 0.008 0.087 0.008 0.087 0.008 0.087 0.008
320o 4.335 0.411 4.356 0.438 2.340 0.213 2.426 0.230
1321 1.749 0.208 1.779 0.220 0.970 0.111 0.993 0.118
a 0.075 0.002 0.075 0.002 0.075 0.002 0.075 0.002
a1 0.086 0.007 0.086 0.007 0.086 0.007 0.086 0.007
U2 1.513 0.196 1.577 0.211 0.839 0.107 0.881 0.117
p 0.682 0.088 0.666 0.090 
Table 3.14. Results from joint and separate fitting of the Monte Carlo EM algorithm
for the two response variables in the ethylene glycol example
Par. Logit models Probit models
Joint fit Separate fits Joint fit Separate fits
Est. SE Est. SE Est. SE Est. SE
310 0.952 0.021 0.952 0.014 0.954 0.028 0.952 0.014
/311 0.088 0.008 0.088 0.007 0.088 0.012 0.088 0.007
1320 4.335 0.522 4.453 0.344 2.422 0.269 2.499 0.193
321 1.752 0.225 1.822 0.211 0.982 0.129 1.025 0.098
a 0.075 0.002 0.075 0.002 0.075 0.002 0.075 0.002
01 0.086 0.007 0.086 0.007 0.086 0.007 0.086 0.007
a2 1.495 0.207 1.547 0.199 0.831 0.101 0.869 0.052
p 0.687 0.091 0.670 0.094 
This may indicate small efficiency gains in fitting the responses together rather than
separately. More noticeable differences in the parameter estimates and especially in
the standard errors are observed in the results from the Monte Carlo EM algorithm
(Table 3.14). In this case the standard error estimates for the Bernoulli components
from the separate fits are smaller than their counterparts from the joint fits. These
results, however, may be deceiving because there is large variability in the standard
error estimates depending on the particular simulation sample used (see Table 3.10).
As a result of the conditional independence assumption the correlation between
birth weight and malformation within fetus, and the correlation between birth weight
and malformation measured on two different fetuses within the same litter, are as
sumed to be the same. However, in practice this assumption may not be satisfied. We
would expect that measurements on the same fetus are more highly correlated than
measurements on two different fetuses within a litter. Hence, it is very important
to be able to check the conditional independence assumption and to investigate how
departures from that assumption affect the parameter estimates. In the following
chapter score tests are used to check aspects of conditional independence without fit
ting more complicated models. When the score tests show that there is nonnegligible
departure from this assumption, alternative models should be considered. In the case
of a binary and a continuous response one more general model is presented in Chapter
5. It has been considered by Catalano and Ryan (1992), who used GEE methods to
fit it. We propose "exact" maximum likelihood estimation which allows for direct
estimation of the variancecovariance structure.
3.4.2 Myoelectric Activity Study in Ponies
The second data set introduced in Section 2.4 is from a study on myoelectric activ
ity in ponies. The purpose of this analysis is to simultaneously assess the immediate
effects of six drugs and placebo on spike burst rate and spike burst duration within
a pony. Two features of this data example are immediately obvious from Table 2.2:
there is a large number of observations within cluster (pony) and the variance of the
spike burst rate response for each pony is much larger than the mean (see Table 2.2).
The implication of the first observation is that ordinary GaussHermite quadrature
as described in Section 3.2.1 may not work well because some of the integrands will
be essentially zero. Adaptive Gaussian quadrature on the other hand will be more
appropriate because it requires a smaller number of quadrature points. Also the ad
ditional computations to obtain the necessary modes and curvatures will not slow
down the algorithm because there are only 6 subjects in the data set and hence only
six additional maximizations will be performed. Adaptive Gaussian quadrature is
described in the next subsection of the dissertation.
The implication of the second observation concerns the response distribution of the
spike burst rate response. An obvious choice for count data is the Poisson distribution,
but the Poisson distribution imposes equality of the mean and the variance of the
response and this assumption is clearly not satisfied for the data even within a pony.
Therefore some way of accounting for the extra dispersion in addition to the pony
effect must be incorporated in the model. Such an approach based on the negative
binomial distribution is discussed later in this chapter.
Adaptive Gaussian Quadrature
Adaptive Gaussian quadrature has been considered by Liu and Pierce (1994),
Pinheiro and Bates (1999), and Wolfinger (1998). We now present that idea in the
context of the bivariate GLMM. Recall that the likelihood for the ith subject has the
form
Li f f(y=, bf)db, ,
bR)bv
where
n, 11T 1
j=l 27T 2
f~y1 b1 = J {f~yi, b~; ~&)f(y~2Ib~; I21gy,,) k)1 16, Tenp1b~b)
Let ib, be the mode of f(yi, bi) and fi = (2YbT)11 Then
L= J f*(yi,bi)O(bi;bi,r)db,
=JRq
where f*(yi, bi) = f(yb,b and
0(bi;b,rf) = exp((b b ri'(b )).
2
The needed transformation of bi is then bi = b, + V/2Aizi, where r, = AiAT. Hence
each integral is approximated by
m m
i =IV/'2Ail E1 () E kq (
=:~ wkq],=i
k1=1 kq=l
with the tabled univariate weights wk) and nodes z,) = b+x/2Aid(Q) for the multiple
index k = (kl, ...kq), where d(k) are the tabled nodes of GaussHermite integration of
order m.
The difference between this procedure and the ordinary GaussHermite approx
imation is in the centering and spread of the nodes that are used. Here they are
distributed where most of the data are available and this makes adaptive quadra
ture more efficient. The computation of the modes bi and the curvatures ri requires
maximization of the integrand function for each subject, which in general requires a
numerical procedure, but this was not a major impediment for the pony data. We
used MaxBFGS within MaxBFGS to perform the maximization.
Negative Binomial Model
When the dispersion in count data is more than that predicted by the Poison model
a convenient parametric approach is to assume a Gamma prior for the Poisson mean
(McCullagh and Nelder, 1989). Letting Y Poisson(p[), then P(Y = y) = Ye
y = 0,1,2, ..., and E(Y) = Var(Y) = ip. Suppose that we want to specify a larger
variance for Y but keep the mean the same. This can be accomplished as suggested
by Booth and Hobert (personal communication). Let Yju Poisson(upi) and let
u Gamma(a, 1). The density for u is f(u) rI)e '.u Then unconditionally Y
a r'(a)
has a negative binomial distribution with probability density
p (Y)ar(Y + a) a I
P(Y = y) = (^ y 
P(Y y) F(a)y! +L a+t
and E(Y) = p and Var(Y) = pi + A. Notice that for large a the negative binomial
a
variance approaches the Poisson variance, meaning that is there is little overdisper
sion, but for small a the negative binomial variance can be much larger than the
Poisson variance.
The mean in the above model can be specified as a function of covariates using
a log link ln(pi) = x[/3, i = 1, ...n. The log link is convenient because it is the
canonical link for the Poisson distribution and transforms the mean so that it can
take any real value. It is possible to obtain maximumlikelihood estimates of a and
/3 by directly numerically maximizing the negative binomial likelihood but there is
a more elegant way based on the EM algorithm as proposed by Booth and Hobert
(personal communication).
The complete data consists of y and u, and the complete data loglikelihood is
n
InLV = c+ [aln(a) lnF(a) + yln(pi) + aln(ui) uj(pi + a)],
i=1
where c is a constant depending only on the data but not on the unknown parameters.
The Ith Estep of the EM algorithm then involves calculating
E(n Lv(a, 3)6&(1_1),/3(_l)) = c + Naln(a) Nlnr(a)+
n
[yixUT3 + aE(ln ui jyj, &(11), (/1)) E(uilyi, a(l),/(1) )(exp(xT[3) + a)]
i=1
But because of the choice of the prior distribution for ui, the posterior distribution for
Uilyi evaluated at &( 11),/3(11) is Gamma(yi + &(i), i1)) Therefore
&(1 1) +eXP(X7,3tp^_^
E(lnuiyi,&(i),43(1_)) Y + a(_I()) ln(exp(xTI(3l)) + (i1))
and
Yi + &(1i)
E(uIjy, &(l), 3(1)) = ^ + (l
&(/_1) + exp(x3iT(jl))'
where V)(.) denotes a digamma function (the first derivative of the loggamma func
tion).
At the Ith Mstep the expected loglikelihood is maximized with respect to a and
/3. The two parameters are present in different terms of the loglikelihood and hence
two separate numerical maximizations are required. Note that if we adopted a direct
approach the parameters would need to be maximized together which could lead to
more numerical problems.
When there are random effects the model can be specified as follows:
yij j\ij, uij indep. Poisson(Izijuij)
1
uij i.i.d. F(a, )
a
ln(ibj)= =/3 +z~b
bi i.i.d. Nq(0, E)
and the random effects bi and uij are independent.
The unknown parameters a, /3 and E can be estimates by a nested EM algorithm.
Such an algorithm has been proposed by Booth and Robert (personal communication)
for the overdispersed Poisson GLMM, and more generally by van Dyk (1999) who
motivated it from the point of view of computational efficiency. The complete data
for the outer loop consists of y and b and the outer EM algorithm is performed as
outlined in Section 3.2.2 but with an EM maximization procedure for a and /3 as
introduced above. The complete data loglikelihood has the form
In L, = c + [lnF(yij + a) lnr(a) + alna (a + yij)ln(a + pij) + yijln(,ij)]+
ij
l TE1b.]
and at the rth Estep its conditional expectation is approximated by the Monte Carlo
sum
m
E [c + E[lnF(yij + a) lnF(a) + alna (a + yij)ln(a )) + y+ijln() )]+
m 1 i,j
~ Xhbk)T.]_lb~k)]
k k=l ij
E[ (k) 1
where b k) are generated from the conditional distribution of bilyi; (rl) and p 
2' 13 =
exp(xjT3 + z.b k) ). The first part of that sum is what needs to be maximized to
obtain estimates of a and /3. Notice that each term in the first Monte Carlo sum is
a loglikelihood for a negative binomial model with a different (but fixed) mean (k)
and hence it can be subjected to an EM algorithm by augmenting the observed data
y by u.
In summary, the nested EM algorithm is as follows:
1. Select an initial estimate 4,). Set r = 0.
2. Increase r by 1.
Estep: For each subject i, i = 1, ...n generate m random samples from the
distribution of bilyi; (rl) using rejection sampling and approximate
(r 1)
E(ln L.(y,b; 4)y; i ) by a Monte Carlo sum.
3. Mstep:
3.1. Maximize with respect to the elements of E as usual.
3.2. Set I = 0, &(o) = &(r) and (0) = ( (r).
3.3. Increase I by 1.
Inner Estep: For any given b(k) compute
E(ln L.(y, u; a, ,3)\y; Q!((l1), ;/3(1))
3.4. Inner Mstep: Find &(1) and (/3) to maximize the Monte Carlo sum
of conditional expectations.
3.5. Iterate between (3.3) and (3.4) until convergence for a and (3 is
achieved.
4. Iterate between (2) and (3) until convergence for 0/ is achieved.
In the multivariate GLMM there are two or more response variables but the nested
EM algorithm works in essentially the same way because of the conditional indepen
dence between the variables. In the next section we describe a bivariate GLMM for
the pony data.
Analysis of Pony Data
The model is defined as follows:
Yilj 3th duration measurement on the ith pony.
Yi2j jth spike burst rate measurement of the ith pony.
2
Ylj bii b indep. Gamma with mean Ji'j and variance s
Yi2j bi2, Uij indep. Poisson(pj2juij),
uij i.i.d. Gamma(a, 1),
ln(1pi,) = xT/3j + b61,
ln(pi2j) = Xilj)32 + bi2,
bi = (bi, bi2) MVN(O, E), E a2 P) 102]
P0[10"2 P"2]
Let dijk be 1 if drug k is administered to pony i at occasion j, k = 1, ...6, and 0
otherwise. Placebo is coded as drug 7 and will be the reference group. Let t denote
time. Then the linear predictors are
Xilj = 1 + /11dil + f12dij2 + 313diy3 + 314dij4 + 0sdij5 + /316dij6
X,2j = 20w + f21dijl + f22dij2 + /23dij3 + /)324dij4 + 025d5 + 026dij6
+/327t + /28dsijlt + 029dij2t + f32,10dij3t + /C2,11dij4t + /2,12di5t + 032,13dij6t.
In the univariate analysis we initially considered the same linear predictor for
both variables but the time drug interaction and the time main effect were clearly
not significant for the duration response and were dropped from the model. Also, we
performed fixed effects analysis using PROC GENMOD and fitted two random effects
models on logtransformed variables using PROC MIXED for the complete data (all
time points and all electrode groups). There was evidence of a threeway interaction
between electrode group, time and drug. Describing this interaction requires estimat
ing 76 regression parameters for a linear time trend as compared to 21 in the above
specification. Recall that in the other data example we estimated up to five regression
parameters, so the numerical maximization procedures in this example were expected
to be much more complicated. The complete pony data set was also about ten times
larger than the ethylene glycol data set and the time trends appeared complicated
and not well described by simple linear or quadratic time effects. One might need to
use splines to adequately describe the trends. Hence, we concentrated on a particular
research question, namely to describe differences in the immediate effects (up to 60
minutes after administration) of the drugs in the cecal base (corresponding to one
of the electrode groups). In the analyses performed by Lester et al. (1998a, 1998b,
1999c) there was evidence that some drugs led to immediate significant increase in
spike burst count or spike burst duration, while others took longer or did not lead to
increase at all, and we decided to address that issue in our analysis.
We programmed the nested Monte Carlo EM and the adaptive Gaussian quadra
ture algorithms both for the joint and for the separate fits. We used a negative
binomial distribution for the count response, a gamma distribution for the duration
response and log link functions for both. (Although the log function is not the canon
ical link function for the Gamma response, it is convenient because it transforms the
mean to the whole real line). As initial estimates we used the final estimates from the
pseudolikelihood approach using the %GLIMMIX macro but rounded only up to
one significant digit after the decimal point. The macro does not allow specification
of a negative binomial distribution and hence we fitted a Poisson distribution with
an extra dispersion parameter.
The results from the analysis are summarized in Table 3.15. MaxBFGS has
convergence problems when the responses are fitted together and adaptive Gaussian
quadrature is used so no results are presented for this case. We report the results
using only one quadrature point because the estimates and their standard errors are
essentially the same if the number of quadrature points is increased. Adaptive Gaus
sian quadrature with one quadrature point is equivalent to Laplacian approximation
(Liu and Pierce, 1994) which is indicative that the analytical approximations work
well for this data set. The simulation sample size for the separate fit of the Gamma
response was increased at each iteration. Convergence was achieved when the simula
tion sample size was 31174 after 20 iterations and took about 1 hour. For the negative
binomial response the final simulation sample size was still 100 after 542 iterations
and it took also about 1 hour to converge. The joint fit took about 21 hours until
convergence with a final simulation sample size of 23381 and 201 iterations.
The results from the joint and from the separate fits are almost identical and the
correlation between the two response variables is not significant (p = 0.71, SE =
0.50). Notice that p is quite large and the fact that it is not significant may be
partially due to the fact that there are only six subjects in the data set. Notice also
that the standard error estimates from the three methods are similar with only some
of the adaptive Gaussian quadrature standard errors being smaller than their Monte
Carlo counterparts. The Monte Carlo standard error estimates did not show much
variability as in the ethylene glycol example.
For both responses the coefficients for drug 1 and for drug 4 are significantly
different from zero. This means that both drugs have significantly different immediate
effects on the response variables than saline solution. Drug 1 leads to a significant
decrease in the individual level of duration for one hour after the drug is administered
and drug 4 is associated with a significant increase. Of all the interactions between
drug and time only the one involving drug 1 for the count response is significant.
3.5 Additional Methods
The empirical Bayes posterior mode procedure, discussed in Fahrmeir and Tutz
(1994, pp.233238), can also be used to estimate the parameters. The Newton
Raphson equations for the extended model when a flat (or vague) prior is used and
the dispersion parameters are set to 1.0, are essentially no more complicated than
those for the generalized linear mixed models. Fahrmeir and Tutz, however, do not
discuss the estimation of 01 and 02 when they are unknown. The dispersion parame
ters can be estimated together with E via maximum likelihood, treating the current
estimates of the fixed and the random effects as the true values. Another possibility is
to put noninformative priors on 01 and q2 and estimate them together with f3 and b.
Table 3.15. Final maximum likelihood estimates for pony data
Monte Carlo EM Adaptive
GaussHermite
quadrature
Joint fit Separate fits Separate fits
Parameter Estimate SE Estimate SE Estimate SE
10o
Oil
,312
013
1314
/315
116
,320
/21
/322
023
1324
1325
)326
127
128
1329
/32,10
32,11
/12,12
/32,13
V
a
Ui
U2
P
0.11
0.23
0.11
0.01
0.32
0.10
0.18
4.25
1.30
0.24
0.28
0.90
0.15
0.32
0.04
0.28
0.05
0.01
0.17
0.07
0.14
17.3
3.41
0.10
0.26
0.71
0.06
0.05
0.05
0.05
0.05
0.05
0.05
0.22
0.28
0.29
0.28
0.28
0.29
0.28
0.07
0.10
0.11
0.10
0.10
0.10
0.10
1.43
0.34
0.03
0.09
0.50
0.13
0.23
0.11
0.02
0.32
0.10
0.18
4.30
1.31
0.24
0.27
0.89
0.14
0.31
0.04
0.28
0.05
0.01
0.17
0.06
0.13
17.3
3.41
0.10
0.26
0.05
0.05
0.05
0.05
0.05
0.05
0.05
0.22
0.28
0.29
0.28
0.28
0.29
0.28
0.07
0.10
0.11
0.10
0.10
0.10
0.10
1.43
0.34
0.03
0.08
0.12
0.23
0.11
0.02
0.32
0.10
0.18
4.31
1.31
0.24
0.27
0.89
0.14
0.31
0.04
0.28
0.05
0.01
0.17
0.06
0.13
17.6
3.47
0.09
0.24
0.03
0.05
0.05
0.05
0.05
0.05
0.05
0.19
0.28
0.29
0.28
0.28
0.28
0.28
0.07
0.10
0.10
0.10
0.10
0.10
0.10
1.44
0.29
0.03
0.07
I ________________ I ________________
71
What noninformative priors should be used in order to avoid dealing with improper
posteriors is a topic for further research. The question about the propriety of the
posterior also arises when Gibbs sampling is applied for posterior mean estimation,
which is yet another method that can be used for fitting the extended model.
CHAPTER 4
INFERENCE IN THE MULTIVARIATE GENERALIZED LINEAR MIXED MODEL
The estimates considered in the previous chapter are approximate maximum like
lihood and hence confidence intervals and hypothesis tests can be constructed accord
ing to asymptotic maximum likelihood theory. Rigorous proof of the properties of
the estimates requires check of the regularity conditions for consistency and asymp
totic normality in the case of independent but not necessarily identically distributed
random vectors. Such conditions have been established by Hoadley (1977) but are
difficult to check for the generalized linear mixed model and its multivariate exten
sion because of the lack of closedform expression for the marginal likelihood. To our
knowledge these conditions have not been verified for the generalized linear mixed
model and we could not do that for the multivariate GLMM. Instead we assume that
the regularity conditions are satisfied and rely on the general results for maximum
likelihood estimates. Caution is applied to testing for the significance of variance
components because when a parameter falls on the boundary of the parameter space
the asymptotic distribution of its maximum likelihood estimate is no longer normal
(Moran, 1971; Chant, 1974; Self and Liang, 1987). Score tests may be a good alterna
tive to the usual Wald and likelihoodratio tests because their asymptotic properties
are retained on the boundary (Chant, 1974). Score test statistics are also computed
under the null hypothesis and do not require fitting of more complicated models and
hence can be used to check the conditional independence assumption.
Since there are no closed form expressions for the marginal likelihood, the score
and the information matrix, numerical, stochastic or analytical approximations must
be used when computing test statistics and constructing confidence intervals. This
aggravates the problem of determining actual error rates and coverage probabilities
and requires the use of simulations to study the behaviour of the tests. Analytical and
stochastic approximations can be improved by increasing the number of quadrature
points or simulated sample sizes, but the precision of analytical approximations can
not be directly controlled. For example, in the case of binary data pseudolikelihood
works well if the binomial denominator is large (Breslow and Clayton, 1993). But
the latter depends only on the data, so for a particular data set, the approximation
is either good or bad. In general the estimates based on analytical approximations
are asymptotically biased.
In this chapter we concentrate on analytical and stochastic approximations of
Wald, score and likelihood ratio statistics and study their performance for checking
the conditional independence assumption. We also show how these approximations
can be constructed for testing fixed effects and for estimating random effects, and con
sider them for checking the significance of variance components. Since the asymptotic
maximum likelihood results hold when the number of subjects increases to infinity
we focus on the ethylene glycol example. The pony data has only six subjects and
as suggested in Chapter 3 inference concerning correlation between the two response
variables is suspect.
Section 4.1 discusses Wald and likelihood ratio tests for testing the fixed effects.
We briefly consider estimation of random effects and prediction of future observations
in Section 4.2 The score approach is introduced in Section 4.3.1 by providing a histor
ical overview. We then propose score tests for checking the conditional independence
assumption (Section 4.3.2) and for testing the variance components (Section 4.3.2).
The ethylene glycol example is used for illustration in Section 4.4 and Section 4.5
contains the results from a small simulation study for the performance of the pro
posed conditional independence test. The chapter concludes with discussion of future
research topics.
4.1 Inference about Regression Parameters
The asymptotic properties of maximum likelihood estimates have been studied
under a variety of conditions. The usual assumption is that the observations on
which the maximum likelihood estimates are based are independent and identically
distributed (Foutz, 1977), but results are available also for models in which the obser
vations are independent but not identically distributed (Hoadley, 1971). In general,
let Yi, Y2, *.. Yn be independent random vectors with density or mass functions
fi(yi, 0), f2(y2, )), ... fn(y), i) depending on a common unknown parameter vec
tor 0. Then as n + oo under certain regularity conditions the maximum likelihood
estimator ib is consistent and asymptotically normal
n( ) N(O, I1)),
where I(g) = limn, i=l E [ 2 (yiY) = lnf{(y,).
Two basic principles of testing are directly based on the asymptotic distribution
of the maximum likelihood estimate: the Wald test (Wald, 1943) and the likelihood
ratio test (Neyman and Pearson, 1928). Let i = (OT, OT)T'. The significance of a
subset 01 of the parameter vector VY can be tested by either one of them as follows.
The null hypothesis is
(4.1)
where 0 is in the interior of the parameter space (In general the null hypothesis is
H0 : i1 = 110 but herein we consider the simpler test.) Let
I b)= (:: I~ l :D*~bl b2
Similarly
Ijl1)
The Wald test statistic is then
Tw : 1(n/,]J)i
with 02 in IV'111 replaced by the consistent maximum likelihood estimator i2 under
the null hypothesis. Under H0, Tw ,, X, where d is the dimension of the parameter
vector ip. Because in the case of independent data nI(11) may not be available and in
random effects models the expected information matrix may be hard to calculate we
use J(O) = Z'=( 482lY"*) instead of nh(i(). Some authors argue that it is more
appropriate to use the observed rather than the expected information (Efron and
Hinkley, 1978) but unlike the expected information matrix, the observed information
matrix is not guaranteed to be positivedefinite. The latter problem is exacerbated
when the numerical or stochastic approximations discussed in Section 3.2 are applied
to approximate the observed information matrix J(V)), which is not available in closed
form. We already used Wald tests in Chapter 3 to test the significance of individ
ual regression coeffients but we can also use Wald tests to check several regression
coefficients simultaneously.
The likelihood ratio statistic for the hypothesis (4.1) is defined as follows. Let
M1 be the model with unknown parameter vector 'f and let M2 be a reduced model
with 'ip = 0 and 012 held unrestricted. Let also 11 and 12 denote the maximized
loglikelihoods for models Mi and M2 respectively. Under H0 as n * oo
TLR = 2(12 1) Xi
so the likelihood ratio and the Wald statistic have the same asymptotic distribution
under the null hypothesis. The absence of closedform expressions for 11 and 12 necessi
tates the use of either Gaussian quadrature or Monte Carlo approximations. Gaussian
quadrature approximations are applied exactly as described in Section 3.2.1. To ob
tain Monte Carlo approximations m samples from the random effects distribution
bi Nq(0, ) are generated. Then
1 m
11 E f(yiI ;, ).
m k1
Here E, /3 and ( are the final maximum likelihood estimates from the full model M1.
The same type of approximation is used for 12 but evaluated at 01 =0 and at the
maximum likelihood estimator 4' under M2.
Under ideal conditions for large enough number of quadrature points in the Gaus
sian quadrature algorithm and for large enough simulation sample size in the MCEM
algorithm the approximate maximum likelihood estimates will be arbitrarily close to
the true estimates. Also the additional approximations needed to compute the Wald
and the likelihood ratio statistics above can be made very precise, hence the statis
tics should perform well for a large number of subjects. But in reality there can be
problems because it is not clear how many quadrature points and what number of
simulation samples are needed for adequate approximations.
Even if all approximations are adequate and the sample size is large enough the
Wald and likelihood ratio tests can still run into problems if a parameter is on the
boundary of the parameter space. This happens in testing the significance of a vari
ance term. Hence Wald and likelihood should be used with caution in that situation.
It has been proven by several authors (Moran, 1971; Chant, 1974, Self and Liang,
1987) that when a parameter is on the boundary of the parameter space the asymp
totic distribution of the maximum likelihood estimator is no longer normal but rather
a mixture of distributions. But the score tests as discussed in Section 4.3 are not af
fected and therefore may be a nice substitute for Wald and likelihood ratio tests.
4.2 Estimation of Random Effects
In some applications it is of interest to obtain estimates for the unobserved random
effects. A natural point estimator is the conditional mean
f[bfly, = f bif(bi' )f(bi; b) dbi
E[bf f(yiIbi, )f(bi; ) dbi
which is not available in closed form but can be approximated either numerically or
stochastically. GaussHermite quadrature involves two separate approximations of
the numerator and the denominator, while the stochastic approximation is via the
simple Monte Carlo sum
1 m
m k=
where b[), ... bm) are simulated from the conditional distribution of {bilyi, i} using
a technique such as rejection sampling. Note that this approximation is performed
anyway in the MCEM algorithm and hence the random effects estimate is obtained
at no extra cost. Estimates of the random effects are needed for prediction. For
example one might be interested in obtaining an estimate for the linear predictor for
particular subject:
4 =XT + ZTb%.
The question with obtaining the variance of the random effects estimate is not
straightforward. The simplest approach is to use the variance
Var(bilyi, ib)
evaluated at ip, but this 'naive' estimate may underestimate the true variance as it
does not account for the sampling variability of ). As an alternative Booth and
Hobert (1998) suggested using a 'conditional mean square error of prediction' as a
measure of prediction variance. Their approach can also be applied to the multivariate
GLMM.
Note that in the two 'reallife' examples that we consider estimation of the random
effects is not of particular interest. The subjects are mice and ponies respectively and
their individual characteristics are a source of variability that needs to be accounted
for but not necessarily precisely estimated for each subject. In other applications, for
example in small area estimation, prediction of the random effects is a very important
objective of the analysis.
4.3 Inference Based on Score Tests
4.3.1 General Theory
An overview of the historical development of the score test is provided in a re
search paper by Bera and Bilias (1999). Rao (1947) was the first to introduce the
fundamental principle of testing based on the score function as an alternative to like
lihood ratio and Wald tests. Let yl, Y2, ... y,, be i.i.d. observations with density
f(yi, '). Denote the joint loglikelihood, score and expected information matrix of
those observations by /(o), s(i) and I(,0) respectively. Suppose that the interest is
in testing a simple hypothesis against a local alternative
Ho :0 4= = 0 vs Ha :0 =0 + A,
where 4 = (1i, ...Op)T and A = (Ai, ...Ap)T. Then the score test is based on
T, = s(o)TI(O)1s(0o)
which has an asymptotic X' distribution. If the null hypothesis is composite, that is
Ho : h(O) = c,
where h(g) is a r x 1 vector function of 0 with r < p restrictions, and c is a known
constant vector, Rao (1947) suggested using
(V) = S( 1()lP),
where g is the restricted maximum likelihood estimate of 0, that is, the estimate
obtained by maximizing the loglikelihood function l() under H0. T8(Qi) has an
asymptotic X2 distribution.
Independently of Rao's work Neyman (1959) proposed C(a) tests. These tests
are specifically designed to deal with hypothesis testing of a parameter of primary
interest in the presence of nuisance parameters and are more general than Rao's score
tests in that any v/nconsistent estimates for the nuisance parameters can be used,
not only the maximum likelihood estimators. By design C(a) tests maximize the
slope of the limiting power function under local alternatives to the null hypothesis.
Neyman assumed the same setup as in Rao's score test but considered the simple
null hypothesis
HO : 01 = V10,
where = (V), 4T)T. Notice that V1 is a scalar. The score vector and the information
matrix are partitioned as follows
S(M) : 801 (011'02))
S02 (01, 2)
s1('1 11 ( 2) IS 1 1(,12) )
(I)= (pi1(g1,' 2) ) (01,2) "
Then the C(a) score statistic is
C(o) = [8 0 2) ((10, 02) IO2 (0', 2)I 4 (02, 02)Sp2 (/O10, 02)]T
X [I41, (o10, 02) I12, ( 01o, 42)I (02, 4)I,2t (1o, '2)]1
X [Sp, (V10, 12) 1 (*10, 02)1020,2 (021 02)sVI, (010, 02)],
where 4'2 is a V/nconsistent estimator of 0/2. The Neyman's C(a) statistic reduces
to the Rao's score statistic when '2 is the maximum likelihood estimate of 2' under
Ho0.
Buhler and Puri (1966) extended the asymptotic and local optimality of Neyman's
C(a) statistic for a vector valued i, and in the case of independent but not necessarily
i.i.d. random variables. They assumed that p0o was interior to an open set in the
parameter space but as pointed out later by Moran (1971) and Chant (1974) this
restriction is unnecessary.
Chant (1974) showed that when the parameter is on the boundary of a closed pa
rameter space, the score test retains its asymptotic properties while the asymptotic
distributional forms of the test statistics based on the maximum likelihood estima
tors are no longer X2. In addition to this advantage of the score test it has the
computational advantage that only estimates under the null hypothesis are needed
to compute the test statistic. We now use this feature to propose tests for checking
the conditional independence assumption.
4.3.2 Testing the Conditional Independence Assumption
Difficulties in testing the conditional independence of the response variables arise
because of the need to specify more complicated models for the joint response dis
tribution. Even if score tests are used (which as discussed do not require fitting of
more complicated models) extended models need to be specified and the form of the
score function and of the information matrix need to be derived. We again consider
the bivariate GLMM case for simplicity. A convenient way to introduce conditional
dependence is to use one of the response variables as a covariate in the linear predic
tor for the other response variable. In the bivariate case this leads to considering the
following model
YiljYi2, bn indep f/iy(lyjjY2j, bil;3^1, ^, 1)
yi2jlbi2 indepf2(yj2j\bi2; f32, 02)
T biT
g1(141j) = xij31 + 7YYi2j + Zi'bn
g2(Ai2j) = xi 22 + z= bi2
bi= ( bl ) i.i.d.MVN(O, E) = MVN (f ] El i E12)
Sbi2 / 0 L u12 22
In general this setup leads to a complicated form of conditional dependence which
is hard to interpret if there is no natural ordering to the two responses. The case
y = 0 corresponds to conditional independence but testing Y = 0 in the above model
is performed against a complicated alternative on the marginal scale. When the
identity link function is used for the first response, conditional on the random effects
Cov(yilj, yi2j) = yVar(Yi2j) and the test is a test of conditional uncorrelatedness of
the two outcomes. If both outcomes are normally distributed then this is truly a test
of conditional independence.
An interesting case to consider in view of the simulated data example and the
ethylene glycol application is when one of the responses is normally distributed and
the other one has a Bernoulli distribution. Let in the above general specification fi
be the normal density function and f2 be the Bernoulli probability function. Also
assume that the random effects consist of two random intercepts and let
1i3 = xjjlj + yy;2j + bil,
where y*2j = 2yi2j 1. Then
E(yijljy*^ = 1, bi) = xfTjI + bil + y
E(yiljly*j = 1, bi) = xij.3l +nil 7
and hence testing H0 : = 0 against H1 : y 0 is equivalent to testing for location
shift in the conditional distribution of the normal response.
The score test statistic is as follows:
82
Ts ~ T _TI I _I'_I%
where s., is the element of the score vector corresponding to y and I is the expected
information matrix. Note that even in this simple case neither the score nor the
expected information matrix have closed form expressions and hence the score statistic
must be approximated. We again consider Gaussian quadrature and Monte Carlo
approximations.
The loglikelihood is InL 1 nL, where
lnLi = In f fl(Yil jYi2, bil;,f1,7, 0)f2(Yi2bi2; 132,02)f(bi; E)db6
2) ni
fI(yiI yI bil; 01, y, 70) = (27ror2) ezp{ _(yiy xfC3i b,1 7yi)2}
j=1
exp(E{ilI y,2j( T )2 + bi2))
f2(Yi21bi2;/32, 2) = (11 Ye(XT32 + b2))
(1~=( + exp(xT23f32 + bi2))
f(bi; E) = 127rEIexp{2bTE1b }.
2
For the Monte Carlo approximation we notice that under the assumption of in
terchangeability of the integral and differential signs
s, = in I fl (yYilbilYM2,01, /, 0l)f2(Yi2 bi2,32, 02)f(biE)dbi =
n f (f(yilb i,,Yi2,3l,7^,)f2(Yi2bi2, 32,02)f(b,,E))dbi
E a =
j= ,(i bl i,0,7 1)f y2Ji,0,2f(i ~b
f lnfi(yil bx, Yi2, 3, 7,Y 1)f(yi,bi; ')dbi
if1 f(Yi;)
zJ lnf (y~i xbil, Yi2,,7'y, 01)f (bifyj; O)dbi =
n 0
_E( ln f(Yi Yi2, bi)lyj).
i=1 O /
The expectation above is taken with respect to the conditional distribution of the
random effects given the response vector. Differentiating with respect to 7
a 1 TO>
72 E nYi Xlj x,31 bil Yi)
Wnfi(yilYI2, bi) = I fY (Yl Til _Y)
5y j=1
and therefore
a 1 ni
E(1nfl(yIxYi2, bi)) = E Y(Yij xiTjl E(billyi) 7y^).
So, to approximate the score under the null hypothesis we only need to approximate
the conditional mean E(bil6yi) by the Monte Carlo sum E' bI), where 0b),
k = 1, ...m are generated for the estimation of the standard errors in the MCEM
algorithm (Section 3.2.2). The elements of the observed information matrix J, which
can be used in place of I, can also be approximated using the Louis's method. J,,
is available from the MCEM algorithm and only J,,, Iyy and J.,,., need to be
computed. The latter can also be performed in the procedure for finding the standard
errors of the estimates in the MCEM algorithm.
Gaussian quadrature using numerical derivatives involves approximating the log
likelihood once and then numerically differentiating with respect to and the other
parameters to obtain the score and the observed information matrix. Denote the
GaussHermite quadrature approximation of the loglikelihood by 1GQ and let sGQ =
Q and J Y GQ 2jQ Then the approximation to the score statistic is
(sGQ)2
TGQ jGQ _GQ 'jGQ
7 7,7 7,7 "7f,
The performance of the score statistics for conditional independence is studied
in more detail in Section 4.5. When there are more than two response variables
this approach to testing for departure from conditional independence becomes very
complicated and not easily interpretable. It is also not easy to decide which variable
to use in the linear predictor for the other one, unless there is a natural ordering.
This issue is discussed in more detail for the ethylene glycol example.
4.3.3 Testing the Significance of Variance Components
Global Variance Components Test
Lin (1997) proposed a global variance component test for testing the significance
of all variance components in the univariate GLMM which can be extended for the
multivariate GLMM. The null hypothesis for the global test is H0 : 8 = 0, where 6 is
the vector of all variance components for the random effects. Suppose for simplicity
that 01 = 02 = 1 and that there are two response variables. The generalizations to
arbitrary 1 and 02, and to more than two response variables are straightforward.
The form of the score test statistic is
T,(3) = s6('3)T(I'm(3) I,5363)I,)1,))63),
where /3 = (3 13)T and 3 and f32 are the maximum likelihood estimates under H0,
i.e. the maximum likelihood estimators from the two separate fixed effects generalized
linear models for the two response variables. Under H0, Ts(/3) has an asymptotic xd
distribution, where d is the number of variancecovariance parameters for the random
effects.
In the univariate GLMM considered by Lin the rth element of the score vector has
the form
sr(23) = {(Y Alw ZW i(yi I) tr(WoZiEZTi)},
i =1
where g(Iti) = Xi)3, E = Var(bi) and 1& \0 The matrices Ai, Wi
and W0oi are diagonal with elements Aij , Wij = (V(ij){g'(pij)}2)1 and
.ij)g ,()+Y ^'+( )g( in general and e*
Woij = wij + eij3(yij tij) where eij (uijt)(rj "3 in general and eyi = 0
for canonical link functions. The subscript j refers to the jth observation on the ith
subject.
Following step by step Lin's derivation for the univariate GLMM, the correspond
ing rth element of the score function for the multivariate GLMM is s6 ()3) =
1n
2 {(Yil ,il),TA i 1W ilZi1 allZ i ~l iA ilT1(Y il ,Ail) tr(W oil iltrlZTl)}'
1 {yi TA WZZT W (Ay ri tr(WOiZ.i JZr T)}
+ Z{(Yi2 Ai2)A21 2Y2 /ii2) (Wo 2 i2)}
+ >Z{(yi2 ^f^ ^~ rWAIy M~),(4.2)
i=1
where the subscripts 1 and 2 refer to the parts of the vectors and matrices corre
sponding to the first and to the second variable respectively.
The proof is as follows. Let
li(bi) = li(bil) + 42(bi2) = ln(fi(yiilbi; /31)f2(Yi2bi2; /32))
Then the marginal likelihood for the ith subject is
fi(yi; E) = Eb,[exp(li(bi))],
where the expectation is taken with respect to the marginal distribution of bi. Ex
panding the integrand in a multivariate Taylor series around bi = 0 we get
04i(O) 1 bT(0)(O) l4(0O) 821(0)
exp(ll(bi)) = exp(li(O))[1 + O bi + 2 Ob, bT + bbT + E
where Ei contains third and higher order terms of bi. Notice that
S(b,) T_ (bi)
abi O4li97
and
a21,(b i) = T 21,(b ,
abi~bf At ri9r
where 77 is the vector of linear predictors for the ith subject. Then taking expectation
and using the moment assumptions for b,
Eb,[exp(li(b))] = exp(li(0))[1 + tr(ZT(i( O) T + ))Z,) + ri]
2 ai &q7' 1978T"77
and the marginal loglikelihood for the ith subject is
1 (ZTl(O) O((O) 0 2l,(0
1nf (y,;f0,E) = 1i(0) + tr T )ZE) +ri.
Here ri contains terms that are products of variance components and its derivative
will be 0 when evaluated under H0. Now we must take into consideration that there
are two response variables. Because lI and l42 depend on different sets of random
effects
l (bi 9i(bii)
491i (ba)__ E2i
M.2
and
02l~b) aol, Ol~bl)
(9 &21k b7T 0
24 (bb) ii
) 1.(bT  2/2(bi2)
5rt,2M2,
0T190T,~ )
0 ?2^
Then
lnf (yi;'=) =1(o) + 2(0)+
_r Ar (1 T i (0) 19ii(0) + i(0) )Z, E)
2 lt'z ii qi ^nT il'an
1 T r12(0) 0/2(0) 02/i2(0)^ "Z E
t Z r042(0)+ o (E0)
+tr(Zi2 (2 2(0
0Ti2 7il
To obtain (4.2) one uses the fact that 1ij and I42 do not depend on 6 and that for
exponential family distributions
9lik (0) __(Yik / ik)
WikA/ (yLYik
a Tik
02lik(0) W
0Thk0 ik
for k = 1, 2.
Notice that in the bivariate GLMM it is likely that the two responses require
different variance components, in which case the expressions for the elements of the
score vector above simplify. Suppose that the random effects variancecovariance
matrix has the form
E Elr(151) r12(1512) (43
E1 2(612) E22(62) J 4,
61 522
where 61, 62 and 612 are different parameter vectors. Then t22 = 12 = 1 =
62 612 
t12 = Et = t2 = 0 and the score vector is
8612
( s612
1 E {(yl tiTA 45fZTWaA y } tr(W 1ZT16^)}
=I{Yi ]il) WilZillli Zilil (i1 i) tr(WoilZilt11Z 1)
1 n TA i2i t (52 ZT WiA IYi "r(W 2t 42Tl
SE=,i{(Yi2 Ai2) W2Z2 Z2WW (Y i 2 tr(WoA2Z222 Z2)}
En )TA 1 12 TWAlyl _1i)
<if{(Yi2 i2) iW2W A(Yi pi2)}
Lin showed that the information matrix in the univariate GLMM depends only on
the first two moments of the response variables and its elements can be expressed in
closed form for exponential family responses. It is easy to verify that the information
matrix for the multivariate GLMM also depends only on the first two moments of the
two response variables and does not contain more complicated expressions than its
simpler counterpart. The latter property is due to the independence of the response
variables under H0. Note that
155 = E(s5s')
13& = E(sps6)
I,3) = E(s,3sT),
where the expectations are computed at 6 = 0. Consider only 166 for now and
let us assume that E has the structure in (4.3). Then I151 is exactly the same
as in a univariate GLMM and hence can be expressed as proposed by Lin. On the
other hand 166 = E(s6 s65) = E(sd,)E(s6 ) = 0 under H0 because the two score
vectors depend on different response variables which are independent under the null
hypothesis. Also
i=l
X )T W ilZil l 2Z
i1A i
n n
S E[(h(Yil)(Yi'2 /Pi') =
i=1 V=1
n n
E Ehi(y1)E(yi'2 t,,2) = 0.
i=1 iV=l
Here hi(yijl) is a function of the first response variable yi only. Similarly all other
parts of the information matrix which correspond to partial derivatives with respect
to parameters for different response variables are zero and hence the expected infor
mation matrix has the form
1),33, 0 I31/56 0 0
0 I'3/ 0 I/326 0
I/36 0 166 0 0
0 I/326 0 155 0
0 0 0 0 112612
and the score statistic separates as follows
S I(T3I ')3 lI 5 1,)S(l
+ S62 (22 19523I322,3I2)1s62)
+ sT 1 SI s .
412 1i20I2 512
This factorization appears only when the variancecovariance matrix is structured as
in (4.3), otherwise the expression is more complicated but still depends only on the
first two moments of the response. The key to proving this is to notice that the
highest order expectation that needs to be computed is of the form E(yij Iii)4.
The same is true for the univariate GLMM and hence Lin's arguments can be directly
applied.
Lin proves that the global score statistic in the univariate GLMM follows a chi
squared distribution with d degrees of freedom (d is equal to the number of random
effects) asymptotically under 6 = 0. The asymptotic result holds when the number
of subjects goes to infinity and the number of observations on each subjects remains
bounded. In the multivariate GLMM the asymptotic distribution is also X2 but with
the number of degrees of freedom adjusted accordingly.
The global score statistic is not very useful even in the GLMM contest because it
tests the significance of all variance components simultaneously, while in most cases
it will be more interesting to check a subset of the variance components. But in the
multivariate GLMM model the global score test is even less appealing. Suppose that
the test is performed and that the null hypothesis is rejected. What information can
one get from that result? It will not be clear whether the rejection occurred because
of extra variability in one of the variables, or in the other one, or in both. It may be
more meaningful to perform score tests for each variable separately and then check
for correlation between the two responses.
Lin also develops score tests for specific variance components in the independent
random effects model. In contrast to the global test, here the score vector and the
efficient information matrix can not be computed in closed form in general and Lin
uses Laplace approximations. Not surprisingly, the approximation to the score statis
tic does not work well in the binary case as demonstrated by some simulations that
she performed. Lin's score tests can be used with the Breslow and Clayton, and the
Wolfinger and O'Connell methods but are not adequate if used with Gaussian quadra
ture or the MCEM algorithm. In that case it is natural to try to develop score tests
based on numerical or stochastic approximations. We now discuss a direct approach
which is of limited use, and an indirect approach which is more complicated but is
especially suited for variance components on the boundary of the parameter space.
Tests for Individual Variance Components
We consider a bivariate GLMM for simplicity. Suppose one is interested in testing
H0 : 01 = 0, where 01 is a subset of the variance components for the random effects.
Also let iP1 have L elements and let i/ = ('fT, iT1)T. The score statistic is
Ts = ST(^ ^ l_^I.)l^
T8O 2
T, r X2L
To develop a Monte Carlo approximation we can try to follow the approach we used
for the conditional independence test. Under the assumption of interchangeability of
the integral and differential signs the score vector can be rewritten as follows
n a
SO E( 1nf (bi, E) Yi, 0).
The random effects density f(bi, 6) is multivariate normal for our models and hence
it is possible to obtain expressions for the partial derivatives inside the expectation
using the approach of Jennrich and Schluster as outlined in Section 3.2.1. There is
no closed form expression for the score vector but we can approximate it by
I n m a
m S h' lnf (0,), )
m j=l k=1 al
where b k) are simulated values from the conditional distribution bi yi. As mentioned
before the expected information matrix is much harder to work with and hence the
observed information matrix can be approximated using Louis' method as shown in
Section 3.2.2. Notice though that the score vector must be evaluated at i1 = 0 and
at the restricted maximum likelihood estimates '_. Depending on the subset of
the variance components tested, the derivative at 01 = 0 may not exist. Consider
for example the case when E = 1 12 and when the null hypothesis is Ho"
U12 r2 /
01 = U12 = 0. Then E is singular under Ho and we can not evaluate the derivative.
If we test only Ho : 012 = 0 then there is no problem with the test. In this case
the parameter is not on the boundary under the null hypothesis, but the score test
is useful because the correlation between the response variables can be tested from
the fit of two separate GLMMs. Recall that univariate GLMMs can be fitted using
standard software such as PROC NLMIXED in SAS. Hence one can decide whether
there is a need to fit the two responses together based only on univariate analyses.
An alternative method to compute an approximation to the score statistic above
is to use GaussHermite quadrature. Two possible approaches can be followed. The
easier one is to compute first and second order numerical derivatives of the log
likelihood and then compute the score statistic based on them. Exact derivatives
might be useful if the number of observations per subject is not very large. Both
approaches are not applicable if the tested subset of variance components leads to a
nonpositivedefinite E.
Denote the GaussHermite quadrature approximation of the loglikelihood by 1GQ.
Also let sGQ = 9Q and JQ O2GQ i.e the score and the observed informa
011 W, 1 5 
tion matrix are approximated by the first and second order numerical derivatives of
the GaussHermite approximation to the loglikelihood. The score statistic for testing
H0 : = 0 then is approximated by
GQT,, (Q, JGQ )1,GQ )l.GQ
where the score and the information matrix are evaluated at ip = 0 and at the
restricted maximum likelihood estimates ?_
As mentioned before this direct approach to testing for variance components will
not work in many interesting cases when the parameters are on the boundary of
the paramater space. To be able to handle such a problem the integrand should be
modified to avoid dependence on the random effects with 0 variances. Notice that
if we want to test whether for example cr = 0 we have to set all corresponding
covariance terms Ukk' to be equal to 0 as well. Suppose we want to test 01 = 0 and

Full Text 
xml version 1.0 encoding UTF8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E9HF6LL4M_EJYYKK INGEST_TIME 20140405T00:08:44Z PACKAGE AA00020402_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
PAGE 1
MODELS FOR REPEATED MEASURES OF A MULTIVARIATE RESPONSE By RALITZA GUEORGUIEVA A DISSERTATIO PRESE TED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1999
PAGE 2
Copyright 1999 by Ralitza Gueorguieva
PAGE 3
To my family
PAGE 4
ACKNOWLEDGMENTS I would like to express my deepest gratitude to Dr. Alan Agresti for serving as my dissertation advisor. Without his guidance constant encouragement and valuable advice this work would not have been completed. My appreciation is extended to Drs. James Booth Randy Carter Malay Ghosh and Monika Ardelt for serving on my committee and for helping me with my research. I would also like to thank all the faculty staff and students in the Department of Statistics for their support and friendship. I also wish to acknowledge the members of the Perinatal Data Systems group whose constant encouragement and understanding have helped me in many ways. Special thanks go to Dr. Randy Carter who has supported me since my first day as a graduate student. Finally I would like to express my gratitude to my parents, Vessela and Vladislav for their loving care and confidence in my success to my sister, Annie and her fiance, Nathan for their encouragement and continual support and to my husband, Velizar for his constant love and inspiration. lV
PAGE 5
TABLE OF CONTENTS ACKNOWLEDGMENTS . .. ...... ......... ...... .. ... ................ ... 1v ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vil CHAPTERS 1 I TRODUCTION 1 1.1 Models for Univariate Repeated Measures . . . . . . . . . . . 3 1.1.1 General Linear Models . . . . . . . . . . . . . . . . . 3 1.1.2 1.1.3 1.1 .4 1.1.5 G e neralized Linear Models . . . . . . . . . . . . . . . 4 Marginal Models . . . . . . . . . . . . . . . . . . . . 5 Random Effects Models . . . . . . . . . . . . . . . . 7 'Transition Models ......................... .... ..... .. 1.2 Models for Multivariate Repeated Measures .. .. ..... ...... . 1.3 Simultaneous Modelling of Responses of Different Types ... ... 1.4 Format of Diss e rtation ............................ .... ... .... 9 10 15 17 2 MULTIVARIATE GENERALIZED LINEAR MIXED MODEL ...... 20 2 1 Introduction ...... .... ... .... .. ... ...... .... .. ..... .. .. ..... 20 2 2 Mod e l Definition .. .............. .......... .... .. ......... ..... 22 2.3 Model Properti e s . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 ESTIMATION IN THE MULTIVARIATE GENERALIZED LINEAR MIXED MODEL . . ... ... . ...... .. ......... ..... ..... ....... .. 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Maximum Likelihood Estimation ............. .. .... .. ........ 32 3 2.1 GaussHermite Quadrature ........ .... ........... .... 33 3.2.2 Mont e Carlo EM Algorithm . . . . . . . . . . . . . . 38 3.2.3 Pseudolikelihood Approach . ...... .... . .... .... .... 42 3.3 Simulat e d Data Example ........ .... ...... ...... ... . ... .. .... 47 3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 53 V
PAGE 6
3.4.1 Developmental Toxicity Study in Mice . . . . . . . . . 53 3.4.2 Myoelectric Activity Study in Ponies ... ...... .. ....... 60 3.5 Additional Methods . . . . . . . . . . . . . . . . . . . . . . 69 4 INFERENCE IN THE MULTIVARIATE GENERALIZED LINEAR MIXED MODEL . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1 Inference about Regression Parameters .......... .............. 74 4.2 Estimation of Random Effects ............ .. .................... 77 4.3 Inference Based on Score Tests . . . . . . . . . . . . . . . . 78 4.3.1 General Theory .... .... ............ ... .. ..... .. .... .... 78 4.3.2 Testing the Conditional Independence Assumption . . 80 4.3.3 Testing the Significance of Variance Components . . . 84 4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . 97 4.6 Future Research ............................. .................. .. 102 5 CORRELATED PROBIT MODEL .......... .... ................. ... 104 5.1 Introduction .................. ... .............. .............. .... 105 5.2 Model Definition ................................................ 110 5.3 Maximum Likelihood Estimation ............................... 112 5.3.1 Monte Carlo EM Algorithm ............................. 112 5.3 .2 Stochastic Approximation EM Algorithm ............... 121 5.3.3 Standard Error Approximation ......................... 124 5.4 Application ...................................................... 125 5.5 Simulation Study ............................................... 142 5.6 Identifiability Issue .................................... ......... 145 5.7 Model Extensions .................. ......................... .. . 155 5.8 Future Research ................................................. 157 6 CONCLUSIONS ...................................................... 158 6.1 Summary ........................................................ 158 6.2 Future Research ................................................. 161 REFERE CES ................................................................. 164 BIOGRAPHICAL SKETCH ........ .......... .. ...... ....... ......... .... ..... 171 vi
PAGE 7
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy MODELS FOR REPEATED MEASURES OF A MULTIVARIATE RESPONSE By Ralitza Gueorguieva December 1999 Chairman: Alan Agresti Major Department: Statistics The goal of this dissertation is to propose and investigate random effects models for repeated measures situations when there are two or more response variables. The emphasis is on maximum likelihood estimation and on applications with outcomes of different types. We propose a multivariate generalized linear mixed model that can accommodate any combination of outcome variables in the exponential family. This model assumes conditional independence between the response variables given the random effects. We also consider a correlated probit model that is suitable for mixtures of binary continuous censored continuous, and ordinal outcomes. Although more limited in area of applicability the correlated probit model allows for more general correlation structure between the response variables than the corresponding multivariate generalized linear mixed model. We extend three estimation procedures from the univariate generalized linear mixed model to the multivariate generalization proposed herein. The methods are GaussHermite quadrature, Monte Carlo EM algorithm and pseudolikelihood. Stan dard error approximations are considered along with parameter estimation. A sim ulated data example and two reallife' examples are used for illustration. We also vu
PAGE 8
consider hypothesis testing based on quadrature and Monte Carlo approximations to the Wald, score and likelihood ratio tests. The performance of the approximations to the test statistics is studied via a small simulation study for checking the conditional independence assumption. We propose a Monte Carlo EM algorithm for maximum likelihood estimation in the correlated probit model. Because of the computational inefficiency of the algo rithm we consider a modification based on stochastic approximations which leads to a significant decrease in the time for model fitting. To address the issue of advan tages of joint over separate analyses of the response variables we design a simulation study to investigate possible efficiency gains in a multivariate analysis. Noticeable increase in the estimated standard errors is observed only in the binary response case for small number of subjects and observations per subject and for high correlation between the outcomes. We also briefly consider an identifiability issue for one of the variance components. Vlll
PAGE 9
CHAPTER 1 INTRODUCTION Univariate repeated measures occur when one response variable is observed at several occasions for each subject Hereafter subject refers to any unit on which a measurement is taken while occasion corresponds to time or to a specific condition. If more than one response is observed at each occasion multivariate repeated measures are available. Univariate and multivariate repeated measures are very common in biomedical applications for example when one or more variables are measured on each patient at a number of hospital visits, or when a number of questions are asked at a series of interviews. But the occasions do not necessarily refer to different times. For instance dependent responses can be measured on litter mates, on members of the same family or at different places on a subject s body. Difficulties in analyzing repeated measures arise because of correlations usually present between observations on the same subject. Statistical methods and estimation techniques are well developed for repeated measures on a univariate normal variable, and lately much research has been dedicated to repeated observations on a binary variable and more generally on variables with distributions in the exponential family Zeger and Liang (1992) provide an overview of methods for longitudinal data and the books of Lindsey (1993), Diggle Liang and Zeger (1994) and Fahrmeir and Tutz (1994) cover many details. Pendergast et al. (1996) present a comprehensive survey of models for correlated binary outcomes including longitudinal data However relatively little attention is concentrated on repeated measures of a mul tivariate response. General models for this situation are necessarily complex as two 1
PAGE 10
2 types of correlations must be taken into account: correlations between measurements on different variables at each occasion and correlations between measurements at dif ferent occasions. Reinsel (1982, 1984), LundbyeChristensen (1991), Matsuyama and Ohashi (1997), and Heitjan and Sharma (1997) consider models for normally dis tributed responses. Lefkopoulou Moore and Ryan (1989), Liang and Zeger (1989), and Agresti (1997) propose models for multivariate binary data ; Catalano and Ryan (1992), Fitzmaurice and Laird (1995), and Regan and Catalano (1999) introduce models for clustered bivariate discrete and continuous outcomes. Catalano (1994) considers an extension to ordinal data of the Catalano and Ryan model. Rochon (1996) demonstrates how generalized estimating equations can be used to fit extended marginal models for bivariate repeated measures of discrete or continuous outcomes. Rochon s approach is very general and allows for a large class of response distribu tions. However in many cases especially when subjectspecific inference is of primary interest, marginal models are not appropriate as they may lead to attenuation of the estimates of the regression parameters (Zeger Liang, and Albert, 1988). Sammel Ryan and Legler (1997) analyze mixtures of discrete and continuous responses in the exponential family using latent variables models. Their approach is based on numerical or stochastic approximations to maximumlikelihood and al lows for subjectspecific inference. Blackwell and Catalano (1999a, 1999b) consider extensions for ordinal data and for repeated measures of ordinal responses. The Generalized Linear Mixed Model (GLMM) forms a very general class of subjectspecific models for discrete and continuous responses in the exponential fam ily and is used for univariate repeated measures (Fahrmeir and Tutz, 1994). In the current dissertation we demonstrate how the GLMM approach can be extended for multivariate repeated measures by assuming separate random effects for each out come variable. This is in contrast to the Sammel et al. approach in which common underlying latent variables are assumed. We also consider a more general correlated
PAGE 11
3 probit model than the appropriate GLMM for the special case of clustered binary and continuous data. The introduction to this dissertation contains an overview of approaches for mod elling of univariate repeated measures (Section 1.1) existing models for multivariate repeated measures (Section 1.2) and simultaneous modeling of different types of responses (Section 1.3). The chapter concludes with an outline of the dissertation ( Section 1. 4). 1.1 Models for Univariate Repeated Measures 1.1.1 General Linear Models Historically the first models for repeated measures data to be considered are general linear models with correlated normally distributed errors (see Ware 1982, for review). The univariate respresentation of such models is as follows: Suppose each subject i is observed on J occasions and denote the column vector of responses by Y i Assume that Y i arises from the linear model (1.1) where X i is a J x p model matrix for the i th individual /3 is a px 1 unknown parameter vector and i is a J x 1 error vector with a multivariate normal distribution with mean 0 and arbitrary positive definite covariance matrix :E: N 1 (0, :E). :E can take certain special forms. The case :E = I corresponds to the usual linear model for cross sectional data. The equicorrelation structure ( :E = a 2 (pl + (1 p)J) where J is a matrix of ones 0 < p ::; 1 and a 2 > 0) is appropriate when the repeated measures are on subjects within a cluster for example when a certain characteristic is observed on each of a number of litter mates. The autoregressive structure for :E = ( ( a ii)) ( aii = p l i il ) is one of the most popular structures when the observations are over equallyspaced time periods.
PAGE 12
4 Usually of primary interest in general linear models is the estimation of regression parameters whil e recognizing the likely correlation structure in the data. To achieve this one either assumes an explicit parametric model for the covariance structure, or uses methods of inference that are robust to misspecification of the covariance structure. Weighted least squares maximum likelihood and restricted maximum likelihood are th e most popular estimation methods for general linear models. 1.1.2 Generaliz e d Linear Models Generalized linear models (GLM) are natural extensions of classical linear models allowing for a larger class of response distributions. Their specification consists of three parts (McCullagh and Nelder 1989 pp.2730): a random component a sys tematic component and a link function. 1. The random component is the probability distribution for the elements of the response vector. The y /s, i = 1 .. n are assumed to be independent with a distribution in th e exponential family ( 1.2) for some specified functions b( and c(). Usually where is called a dispersion parameter and wi are known weights. The mean i and the variance function i) completely specify a member of the exponential family because i = b'(0 i) and i ) = b"(0 i) a( i)Important exponential family distributions ar e the normal the binomial the Poisson and the gamma distributions.
PAGE 13
2. The systematic component is a linear function of the covariates T/i = xf /3 where T/ i is commonly called a linear predictor. 5 3. The link function g( is a monotonic differentiable function which relates the expected value of the response distribution i to the linear predictor rJ i : When the response distribution is normal and the link is the identity function g() = the GLM reduces to the usual linear regression model. For each member of the exponential family there is a special link function, called the canonical link function which simplifies model fitting. For that link function 0 i = T/ i Maximum likelihood estimates in GLM are obtained using iterative reweighted least squares. Just as modifications of linear models are used for analyzing Gaussian repeated measures modifications of GLM can handle discrete and continuous outcomes. Ex tensions to GLM include marginal random effects and transitiqn models (Zeger and Liang, 1992). Hereafter we will use Y i j to denote the response at the lh occasion for the i th subject. The ranges of the subscripts will be i = 1 .. n and j = 1 ... J for balanced and j = 1 .. n i for unbalanced data. 1.1.3 Marginal Models Marginal models are designed to permit separate modeling of the regression of the response on the predictors and of the association among repeated observations for each individual. The models are defined by specifying expressions for the marginal mean and the marginal variancecovariance matrix of the response: 1. The marginal mean i j = E(y i j) is related to the predictors by a known link function g(
PAGE 14
6 2. The marginal variance is a function of the marginal mean and the marginal covariance is a function of the marginal means and of additional parameters 8 Notice that if the correlation is ignored and the variance function is chosen to corre spond to an exponential family distribution the marginal model reduces to GLM for independent data But the variance function can be more general and hence even if the responses are uncorrelated this model is more general than the corresponding GLM. Because only the first two moments are specified for the joint distribution of the response, additional assumptions are needed for likelihood inferences. Alterna tively the Generalized Estimating Equations (GEE) method can be used (Liang and Zeger (1986) Zeger Liang and Albert (1988)) as briefly summarized here. Let Y i = (Y i l i,n; f i = i I ... i ,n i f, A i= diag{V( i 1) ... V(i n i )} and let R ( 8 ) be a working correlation matrix for the i th subject. The latter means than R ( 8 ) is completely specified up to a parameter vector 8 and may or may not be the true correlation matrix. The regression parameters f3 are then estimated by solving 1 1 where V i ( 8 ) = A ; R(o)A ;. Liang and Zeger (1986) show that if the mean function is correctly specified f3 the solution to the above equat ion is consistent and asymp totically normal as the number of subjects goes to infinity. They also propose a robust variance estimate which is also consistent even when the variancecovariance struc ture is misspecified Hence the GEE approach is appropriate when the regression relationship, and not the correlation structure of the data is of primary interest.
PAGE 15
7 1.1.4 Random Effects Models An important feature of the marginal models is that the regression coefficients have the same interpretation as coefficients from a crosssectional analysis. These models are preferred when the effects of explanatory variables on the average response within a population are of primary interest. However, when it is of interest to describe how the response for a particular individual changes as a result of a change in the covariates, a more pertinent approach is to consider random (mixed) effects models. Random effects models assume that the correlation among repeated responses arises because there is a natural heterogeneity across individuals and that this het erogeneity can be represented as a probability distribution. More precisely, 1. The conditional distribution of the response Yii given a subjectspecific vector of random effects b i satisfies a GLM with a linear predictor x.'f;/3 + z ~bi, where Z ij in general is a subset of Xij. 2. The responses on the same subject Yii, Yi 2 .. Yin ; are conditionally independent given b i. 3. b i has certain distribution F(; 8) with mean O and variancecovariance matrix depending on a parameter vector 8. Such models with normal distribution for the random effects are considered in greater detail in Section 2.1. In contrast to marginal models, the regression coefficients in random effects models have subjectspecific interpretations. To better illustrate that difference let us consider a particular example. Let the response Yii be binary and the subscript j refer to time. Consider the marginal model logit(ij) = /3o + f31j, where ii= E(yij) and the random effects model logit( fi) = /38 + /3U + b i, b i ~ i.i.d. N(O, a2 ),
PAGE 16
8 where 'f, i = E(Y i ilbi). Then {3 f is the logodds ratio for a positive response at time j + 1 relative to time j for any subject i while {3 1 is the populationaveraged logodds ratio That is {3 1 describes the change in logodds for a positive response from time j to time j + 1 for the population as a whole. In general these two interpretations are different but in some special cases such as identity link functions subjectspecific and populationaveraged interpretations coincide More discussion on the connection between marginal and random effects models follows in Chapter 2. The presence of random effects enables the pooling of information across different subjects to result in better subjectspecific as opposed to populationaveraged infer ence, but complicates the estimation problem considerably. To obtain the likelihood function one has to integrate out the random effects, which, except for a few special cases, cannot be performed analytica ll y. If the random effects are nuisance param eters conditional likelihood estimates for the fixed effects may be easy to obtain for canonical link functions. This is accomplished by conditioning on the sufficient statistics for the unknown nuisance parameters and then maximizing the conditional likelihood. When the dimension of the integral is not high numerical methods such as Gaus sian quadrature work well for normally distributed random effects (Fahrmeir and Tutz 1994 pp.357362 ; Liu and Pierce, 1994). A variety of other methods have recently been proposed to handle more difficult cases. These include the approxi mate maximum likelihood estimates proposed by Schall (1991), the penalized quasi likelihood approach of Breslow and Clayton (1993), the hierarchical maximum like lihood of Lee and Nelder (1996) the Gibbs sampling approach of Zeger and Karim (1991), the EM algorithm approach for GLMM of Booth and Hobert (1999) and of McCulloch (1997) and others.
PAGE 17
9 1.1.5 Transition Models A third group of models for dealing with longitudinal data consists of transition models. Regression Markov chain models for repeated measures data have been con sidered by Zeger et al. (1985) and Zeger and Qaqish (1988). This approach involves modeling the conditional expectation of the response at each occasion given past outcomes. Specificall y, 1. The conditional expectation of the response 'i,j = E(YiilYi j 1 Yi1) depends on previous responses and current covariates as follows j 'i, j) = xi(3 + L 1kfk(Y i, jI, k = l where {fk( )} k = l .. j are known functions. 2. The conditional variance of Y ii is a function of the conditional mean where Vis a known function. Transition models combine the assumptions about the dependence of the response on the explanatory variables and the correlation among repeated observations into a single equation. Conditional maximum likelihood and GEE have both been used for estimating parameters. Extensions of GLM are not the only methods for analyzing repeated measures data (see for example Vonesh (1992) for an overview of nonlinear models for longitudinal data) but as the proposed models in this dissertation are based on such extensions our discussion in the following sections will be restricted to G LM types of models.
PAGE 18
10 1.2 Models for Multivariate Repeated Measures In contrast to univariate longitudinal data, very few models have been discussed that specifically deal with multivariate repeated measures These models are now briefly discussed starting with normal theory linear models and proceeding with models for discrete outcomes. A review of general linear models for the analysis of longitudinal studies is pro vided by Ware (1985) The general multivariate model is defined as in (1.1) but we assume that the J = KL repeated measures on each subject are made on K nor mally distributed variables rather than on only one normally distributed response. Hence the general multivariate model with unspecified covariance structure can be directly applied to multivariate repeated measures data However, the number of parameters to be estimated increases quickly as the number of occasions and/or vari ables increases and the estimation may become quite burdensome. Special types of correlation structures such as bivariate autoregressive can be specified directly as pro posed by Galecki (1994). This multivariate linear model is also not well suited for unbalanced or incomplete data More parsimonious covariance structures are achieved in random effects models. The linear mixed effects model is defined in two stages. At stage 1 we assume Yil b i = XJ 3 + Z i b i + i, where e i is distributed Nn ; ( 0 a2 1 ) X i and Z i are ni x p and n i x q model matrices for the fixed and the random effects respectively, and b i is a q x 1 random effects vector. At stage 2 b i ~ Nq( O :E ) independently of ei. This corresponds to a special variancecovariance structure for the i th subject: :E i = a2 1 + Z i :EZ f. Reinsel (1982) generalized this linear mixed effects model and showed that for bal anced multivariate repeated measures models with random effects structure closed form solutions exist for both maximum likelihood (ML) and restricted maximum /
PAGE 19
11 likelihood (REML) estimates of mean and covariance parameters. Matsuyama and Ohashi (1997) considered bivariate response mixed effects models that can handle missing data. Choosing a Bayesian viewpoint they used the Gibbs sampler to esti mate the parameters The model of Reinsel is overly restrictive in some cases as he prescribes the same growth pattern over all response variables for all individuals. In contrast, Heitjan and Sharma (1997) considered a model for repeated series longitudinal data where each unit could yield multiple series of the same variable. The error term they used was a sum of a random subject effect and a vector autoregressive process thus accounting for subject heterogeneity and time dependence in an additive fashion A straightforward extension for multiple series of observations on distinct variables is possible All models discussed so far in this section are appropriate for continuous response variables that can be assumed to be normally distributed. More generally GEE marginal models can easily accommodate multivariate repeated measures if all re sponse variables have the same discrete or continuous distribution. We now restate the GEE marginal model from Section 1.1 in matrix notation to simplify the mul tivariate extension. We also consider balanced data. Let Y i represent the vector of observations for the i th individual (i = 1, 2 ... n) and let i = E(yi) be the marginal mean vector. We assume where X i is the model matrix for the i th individual, /3 is an unknown parameter vector and g() is a known link function applied componentwise to i Also Var(y ij) =
PAGE 20
12 If R(8) = R(8) is the true correlation matrix, Vi is the true covariance matrix of Yi The J responses for each subject are usually repeated measures on the same variable but as in the normal case they can be repeated observations on two or more outcome variables as long as they have the same distribution. The only difference with univariate repeated measures is in the specification of the covariance matrix. The estimates of the regression parameters /3 will be consistent provided that the model for the marginal mean structure is specified correctly, but for better efficiency the working corr e lation matrix should be close to the true correlation matrix. Several correlation structures for multivariate repeated measures are discussed by Rochon (1996). As a further extension Rochon (1996) proposed a model for bivariate repeated measures that could accommodate both continuous and discrete outcomes He used GEE models as the one described above, to relate each set of repeated measures to im portant explanatory variables and then applied seemingly unrelated regression (SUR Zellner 1962) methodology to combine the pair of GEE models into an overall analysis framework. If the response vector for the i th subject is denoted by Y i = (y?) r Y ? )r f and all relevant quantities for the first and second response are superscribed by 1 and 2 respectively the SUR model may be written as where and g( is a compound function consisting of g(l) ( and g( 2 ) ( The joint covariance matrix among the sets of repeated measures may be written as
PAGE 21
13 where R( <5) is the working correlation matrix among the two sets of repeated measures for each subject and each of its elements is a function of the vector parameter 6 The sugg e sted t e chniques may be extended to multiple outcome measures. Rochon's approach provid e s a great deal of flexibility in modeling the effects of both within subject and betweensubject covariates on discrete and continuous outcomes and is appropriate when the effects of covariates on the marginal distributions of the response are of interest. If subjectspecific inference is preferred, however a transitional or random effects models should be considered. Liang and Zeger (1989) suggest a class of Markov chain logistic regression models for multivariate binary time series. Their approach is to model the conditional dis tribution of each component of the multivariate binary time series given the others. They use pseudolikelihood estimation methods to reduce the computational burden associated with maximum likelihood estimation Liang and Z e ger s transitional model is useful when the association among vari ables at one time is of interest or when the purpose is to identify temporal relation ship among the variables adjusting for covariates. However one must use caution when interpreting the estimated parameters because the regression parameter /3 in the model has logodds ratio interpretation conditional on the past and on the other outcomes at tim e t. Hence if a covariate influences more than one component of the outcome vector or past outcomes which is frequently the case, its regression coeffi cient will capture only that part of its influence that cannot be explained by the other outcomes it is also affecting. Another problem with the model is that in the fitting process all the information on the first q observations is ignored. A random eff e cts approach for dealing with multivariate longitudinal data is dis cussed by Agresti ( 1997) who develops a multivariate extension of the Rasch model for
PAGE 22
14 repeated measures of a multivariate binary response. One disadvantage of the multi variate Rasch model shared by many random effects models, is that it can not model negative covariance structure among repeated observations on the same variable. Al though usually measurements on the same variable within a subject are positively correlated, there are cases when the correlation is negative. One such example is the infection data, first considered by Haber (1986). Frequencies of infection profiles of a sample of 263 individuals for four influenza outbreaks over four consecutive winters in Michigan were recorded. The first and fourth outbreaks are known to be caused by the same virus type and because contracting influenza during the first outbreak provides an immunity against a subsequent outbreak, a subject's response for these two outbreaks is negatively correlated. Coull (1997) analyzed these data using the multivariate binomiallogit normal model. As it is a special case of the models that we propose later in this dissertation, we define it here. Let y = (Yi, ... y1f given 1r = (1r 1 ... 1r1f be a random vector of independent bi nomial components with number of trials (n 1 ... n 1 f. Also let logit(1r) be N 1 (, :E). Then 1r has multivariate logisticnormal distribution and unconditionally y has a mul tivariate binomial logitnormal mixture distribution. If the I observations correspond to measurements on K variables at L occasions, this model can be used for analyzing multivariate repeated measures data. The mean of the multivariate random effects distribution can be assumed to be a function of covariates = X/3 and several groups of subjects with the same design matrices X s, s = l, ... S, can be considered. The multivariate binomiallogit normal model can be regarded as an analog for binary data of the multivariate Poissonlog normal model of Aitchison and Ho (1989) for count data. Aitchison and Ho assume that y = (y 1 f given O = (0 1 .. 01 f are independent Poisson with mean vector 0 and that log(O) = (log(0 1 ) ... log(0 1 )f are N1(, :E ). Then O has multivariate log norma l distribution. Like the multivariate
PAGE 23
15 logitnormal model the multivariate Poissonlog normal model can be used to model negative correlations and can be extended to incorporate covariates. Chan and Kuk (1997) also consider random effects models for binary repeated measures but assum e an underlying threshold model with normal errors and random effects and use the probit link. They estimate the parameters via a Monte Carlo EM algorithm regarding the observations from the underlying continuous model as the complete data. Their approach will be discussed in more detail in Chapter 5 where an extension of their model will be considered. 1.3 Simultaneous Modelling of Responses of Different Types Difficulties in joint modelling of responses of different types arise because of the need to specify a multivariate joint distribution for the outcome variables. Most research so far has concentrated on simultaneous analysis of binary and continuous responses Olkin and T ate (1961) introduced a location model for discrete and continuous outcomes. It is based on a multinomial model for the discrete outcomes and a mul tivariate Gaussian model for the continuous outcomes conditional on the discrete outcomes. Fitzmaurice and Laird (1995) discussed a generalization of this model which turns out to be a special case of the partly exponential model introduced by Zhao Prentice and Self (1992). Partly exponential models for the regression analy sis of multivariat e ( discrete continuous or mixed) response data are parametrized in terms of the response mean and a general shape parameter. They encompass gener alized linear mod e ls as well as certain multivariate distributions. A fully parametric approach to estimation leads to asymptotically independent maximum likelihood es timates of the m e an and the shape parameters. The score equations for the mean parameters ar e essentially the same as in GEE. The authors point out two major drawbacks to their approach: one is the computational complexity of full maximum
PAGE 24
16 likelihood estimation of the mean and the shape parameters together another one is the need to specify the shape function for the response. The latter will be especially hard if the response vector is a mixture of discrete and continuous outcomes. Zhao, Prentice and Self conclude that partly exponential models are mainly of theoretical interest and can be used to evaluate properties of other mean estimation procedures. Cox and Wermuth (1992) compared a number of special models for the joint distribution of qualitative (binary) and quantitative variables The joint distribution for all bivariate models is based on the marginal distribution of one of the components and a conditional distribution for the other component given the first one. A key distinction is between models in which for each binary outcome (A), the quantitative response (Y) is assumed to be normally distributed and models in which the marginal distribution of Y is normal. Typically, simplicity in the marginal distribution of Y corresponds to fairly complicated conditional distribution of Y and vice versa but normality often holds at least approximately Estimation procedures differ from model to model but essentially the same tests of independence of the two components of the response can be derived. This however is not true if trivariate distributions are considered with two binary and one continuous or with two continuous and one binary component. In that case several different hypotheses of independence and conditional independence can be considered and depending on the model sometimes they may not be tested unless a stronger hypothesis of independence is assumed. Catalano and Ryan (1992), Fitzmaurice and Laird (1995) and Regan and Catalano (1999) considered mixed models for a bivariate response consisting of a binary and of a quantitative variable. Catalano and Ryan (1992) and Regan and Catalano (1999) treated the binary variable as a dichotomized continuous latent trait, which had a joint bivariate normal distribution with the other continuous response. Catalano and Ryan (1992) then parametrized the model so that the joint distribution was a product of a standard random effects model for the continuous variable and a correlated probit
PAGE 25
17 model for the discrete variable. Estimation for the parameters of the two models is performed using quasilikelihood techniques Catalano (1994) extended the Catalano and Ryan procedure to ordinal instead of binary data. Regan and Catalano (1999) used exact maximum likelihood for estimation which is computationally feasible because of the equicorrelation assumption between and within the binary and continuous outcomes. The maximum likelihood methodology used by the authors is an extension to the procedure suggested by Ochi and Prentice (1984) for binary data. Fitzmaurice and Laird (1995) assumed a logit model for the binary response and a conditional Gaussian model for the continuous response. Unlike Catalano and Ryan's model all regression parameters have marginal interpretations and the estimates of the regression parameters (based on ML or GEE) are robust to misspecification of the association between the binary and the continuous responses. Sammel Ryan and Legler (1997) developed models for mixtures of outcomes in the exponential family They assumed that all responses are manifestations of one or more common latent variables and that conditional independence between the outcomes given the value of the latent trait held. This allows one to use the EM algorithm with some numerical or stochastic approximations at each Estep. Blackwell and Catalano (1999) extended the Sammel, Ryan and Legler methodology to longitudinal ordinal data by assuming correlated latent variables at each time point. For simplicity of analysis each outcome is assumed to depend only on one latent variable and the latent variables are assumed to be independent at each time point. 1.4 Format of Dissertation The purpose of this dissertation is to propose and investigate random effects mod els for repeated measures situations when there are two or more response variables.
PAGE 26
18 Of special interest is the case when the response varibles are of different types. The dissertation is organized as follows. In Chapter 2 we propose a multivariate generalized linear mixed model which can accomodate any combination of responses in the exponential family. We first describe the usual generalized linear mixed model (GLMM) and then define its extension The relationship between marginal and conditional moments in the proposed model is briefly discussed and two motivating examples are presented. The key assumption of conditional independence is outlined. Chapter 3 concentrates on maximum likelihood model fitting methods for the proposed model. GaussHermite quadrature a Monte Carlo EM algorithm and a pseudolikelihood approach are extended from the univariate to the multivariate gen eralized linear mixed model. Standard error approximation is discussed along with point estimation We use a simulated data example and the two motivating examples to illustrate the proposed methodology. We also address certain issues such as stan dard error variability and comparison between multivariate and univariate analyses. In Chapter 4 we consider inference in the multivariate GLMM. We describe hy pothesis testing for the fixed effects based on approximations to the Wald and likeli hood ratio statistics and propose score tests for the variance components and for test ing the conditional independence assumption. The performance of the GaussHermite quadrature and Monte Carlo approximations to the Wald score and likelihood ratio statistics is compared via a small simulation study. Chapter 5 introduces a correlated probit model as an alternative to the multivari ate GLMM for binary and continuous data when conditional independence does not hold. We develop a Monte Carlo EM algorithm for maximum likelihood estimation and apply it to one of the motivating examples. To address the issue of advantages of joint over separate analyses of the response variables we design a simulation study
PAGE 27
19 to investigate possible efficiency gains in a multivariate analysis. An identifiability issue concerning one of the variance components is also discussed. The dissertation concludes with a summary of the most important findings and discussion of future research topics (Chapter 6).
PAGE 28
CHAPTER 2 MULTIVARIATE GENERALIZED LINEAR MIXED MODEL The Generalized Linear Mixed Model (GLMM) is a very general class of random effects models, well suited for subjectspecific inference for repeated measures data. It is a special case of the random effects model defined in Section 1.2. We start this chapter by providing a detailed definition of the GLMM for repeated univariate mixed models (Section 2.1) and then introduce the multivariate extension (Section 2.2). Some properties of the multivariate GLMM are discussed in Section 2.3 and the chapter concludes with a description of two data sets which will be used for illustration throughout the thesis. Hereafter we omit the superscript c in the conditional mean fi notation. 2.1 Introduction Let Yii denote the jlh response observed on the i th subject, j = 1 ... ni, i = 1, ... n. The conditional distribution of Yii given an unobserved q x 1 subjectspecific random vector b i is assumed to be in the exponential family with density where ii = b' ( 0ij) is the conditional mean, is the dispersion parameter b( and c( are specific functions corresponding to the type of exponential family, and wij are known weights. Also at this stage it is assumed that g(ij) = g(E(Yiilbi)) = T/ij, where T/ij = x 'f;/3 + z 'f; b i is a linear predictor, b i is a q x 1 random effect, f3 is a 20
PAGE 29
21 p x 1 parameter vector and the design vectors Xi/ x 1 and z i/x 1 are functions of the covariates. At the second stage the subjectspecific effects b i are assumed to be i.i.d. Nq(O :E). In general the random effects can have other continuous or discrete dis tributions, but the normal distribution provides a full range of possible covariance structures and we will restrict our attention to this case. As an additional assumption, conditional independence of the observations within and between subjects is required that is n f(y l b ; /3 ,) = IJJ(Y il b i; /3 ,) with f(ydbi;/3 ,) = IJf(Yijlbi,/3,) i=l j=l where y = (yf, ... y~)I' and b = (bf, ... b~f are column vectors of all responses and all random effects respectively. Maximumlikelihood estimation for GLMM is complicated because the marginal likelihood for the response does not have a closed form expression. Different ap proaches for dealing with this situation are discussed in Chapter 3. The parameters /3 in the GLMM have subjectspecific interpretations; i.e. they describe the individual 's rather than the average population response to changing the covariates. Under the GLMM, the marginal mean ;i = E(Yii) = E(E(Y iil bi))) = J g1 (xI;f3+zI;b i )f (bi, :E)dbi and in general ij) =I xI;/3 if g is a nonlinear function. However this equation holds approximately if the standard deviations of the random effects distributions are small. Closedform expressions for the marginal means exist for certain link functions (Zeger, Liang and Albert, 1988). For example, for the identity link function the marginal and conditional moments coincide. For the log link function the marginal mean is ;i = exp(xI;/3 + z t ~z i; ). For the probit link function, the marginal mean is
PAGE 30
22 the logit link an exact closedform expression for the marginal mean is unavailable but using a cumulative Gaussian approximation to the logistic function leads to the expression logit( i i) a1(~)x'f;f3 where a1(~) = lc 2 ~Z i jz'f; + 11q / 2 and c = 16~. Unfortunatel y, there are no simple formulae for the higher order marginal moments except in the case of a linear link function. But the secondorder moments can also be approximated and then the GEE approach can be used to fit the mixed model (Zeger Liang and Albert 1988). 2.2 Model Definition Let us first consider bivariate repeated measures. Denote the response vector for the i t h subject by Y i = (y[i, y~f, where Yi1 = (Ym, ... Yiin i f and Y i 2 = (Ym .. Y i 2n i f are the repeated measurements on the 1 st and 2 nd variable respec tively at n i occasions. The number of observations for the two variables within a subject need not be the same and hence it would be more appropriate to denote them by n i 1 and n i 2 but for simplicity we will use n i We assume that Y i ij j = 1 ... n i, are conditionally independent given bi 1 with density Ji ( in the exponential family. Analogously Y i 2 i, j = 1 ... ni are conditionally independent given bi 2 with density h ( in the exponential family. Note that Ji and h need not be the same Also Yil and Y i 2 are conditionally independent given bi = (bfi b~f and the responses on different subjects are independent. Let g 1 and g 2 be appropriate link functions for !1 and hDenote the conditional means of Yiii and Yi2j by iij and i 2j respec tively. Let i 1 = ... i i n; ) T and i 2 = (i21 ... i 2n ; ) T. At stage one of the mixed model specification we assume gl i l) g2( i 2) x i 1/31 + z i 1 h i 1 X i 2/3 2 + Z i 2 bi2, (2.1) ( 2.2)
PAGE 31
23 where /3 1 and {3 2 are p 1 x 1 and p 2 x 1 dimensional unknown parameter vectors, Xii and Xi 2 are ni x p 1 and ni x p 2 dimensional design matrices for the fixed effects, Zi1 and Zi 2 are n i x q 1 and ni x q 2 design matrices for the random effects and g1 and g2 are applied componentwise to i 1 and 2 At stage two, a joint distribution for hi1 ( q 1 x 1) and b i 2 ( q 2 x 1) is specified. The normal distribution is a very good candidate, as it provides a rich covariance structure. Hence, we assume ( b i 1 ) . ( [ 0 ] [ :En :E12 ] ) b i = b i 2 ~ i.i.d. MVN(O, :E) = MVN O :Ef 2 :E 22 (2.3) where :E :En and :E 22 are in general unknown positivedefinite matrices. The exten sion of this model to higher dimensional multivariate repeated measures is straight forward but for simplicity of discussion only the bivariate case will be considered. When :E 12 = 0 then the above model is equivalent to two separate GLMM's for the two outcome variables. Advantages of joint over separate fitting include the ability to answer intrinsically multivariate questions, better control over the type I error rates in multiple tests and possible gains in efficiency in the parameter estimates. For example in the developmental toxicity application described in more detail at the end of this section it is of interest to estimate the dose effect of ethylene glycol on both outcomes (malformation and low fetal weight) simultaneously. One might be interested in the probability of either malformation or fetal weight at any given dose. This type of question can not be answered by a univariate analysis as it requires knowledge of the correlation between the outcomes. Also, testing the significance of the dose effect on malformation and fetal weight simultaneously (using a Wald test for example) allows one to keep the significance level fixed at a. While if the two outcomes were tested separately then an adjustment should be made for the level of the individual tests to achieve overall a level for both comparisons. Multivariate analysis also allows one to borrow strength from the observations on one outcome to estimate the other outcomes. This can lead to more precise estimates
PAGE 32
24 of the parameters and therefore to gains in efficiency. Theoretically the gain will be the greatest if there is a common random effect for all responses ( this means :E 12 = :En:E 22 in our notation). Also, if the number of observations per subject is small and the correlation between the two outcomes is large the gains in efficiency should increase. If the two vectors of random effects are perfectly correlated, then this is equivalent to assuming a common latent variable with a multivariate normal distribution which does not depend on covariates. Hence, the multivariate GLMM reduces to a special case of the Sammel Ryan and Legler model when the latent variable is not modelled as a function of the covariates. It is possible to allow the random effects in our models to depend on covariates but we have not considered that extension here. Several other models discussed in previous sections turn out to be special cases of this general formulation. For example the bivariate Rasch model is obtained by spec ifying Bernoulli response distributions for both variables using the logit link function and identity design matrices both for the fixed and for the random effects. The mul tivariate binomiallegit normal model is also a special case for binary responses with identity design matrix for the random effects and unrestricted variancecovariance structure. The Aitchison and Ho multivariate Poisson lognormal model also falls un der this general structure when the response variables are assumed to have a Poisson distribution. 2.3 Model Properties Exactly as in GLMM the conditional moments in the multivariate GLMM are directly modelled while marginal moments are harder to find. The marginal means and the marginal variances of Yil and Y i 2 for the model defined by (2.2) and (2.3) are the same as those of the GLMM considering one variable at a time:
PAGE 33
E(Yi2) = E[ i2(.82, b i2)l, Var(yil) = E[Var(Yi1 lb i1)] + Var[E(Yi1 l h i1)] = E[1 V( i 1 )] + Var[ i 1 ], Var(Yi2) = E[2V( i 2 )] + Var[ i2l, 25 where V(i 1 ) and V(i 2 ) denote the variance functions corresponding to the expo nential family distributions for the two response variables. The marginal covariance matrix between Y il and Y i2 is found to be equal to the covariance matrix between the conditional means i 1 and i 2 : Cov(Yi1, Y i2) = E(Yi1Yi2) E(Yi1)E(yfz) = = EE(Yi1Yi2l b i1, b i2) E( i1)E( i2) = = E[E(Yi1l b i1)E( Yi2lb i2)] i1) E(i2) = = Cov( i1, i2) The latter property is a consequence of the key assumption of conditional indepen dence between the two response variables. This assumption allows one to extend model fitting methods from the univariate to the multivariate GLMM but may not hold in certain situations. This issue will be discussed in more detail in Chapter 4, where score tests are proposed for verifying conditional independence 2.4 Applications The multivariate GLMM are fitted to two data sets. The first one is a dataset from a developmental toxicity study of ethylene glycol (EG) in mice conducted through the National Toxicology Program (Price et al., 1985) The experiment involved four ran domly chosen groups of pregnant mice, one group serving as a control and the other three exposed to three different levels of EG dur ing major organogenesis. Following
PAGE 34
26 Tabl e 2 1. Descriptive statistics for the Ethylene Glycol data Dose (g/kg) Dams Live Fetuses Fetal Weight (g) Malformation Mean SD Number Percent 0 25 297 0 972 0.098 1 0.34 0.75 24 276 0 877 0.104 26 9.42 1.50 22 229 0.764 0.107 89 38 86 3 00 23 226 0 704 0.124 126 57 08 sacrifice measur e ments were taken on each fetus in the uterus. The two outcome measures on each live fetus of interest to us are fetal weight (continuous) and malfor mation status (dichotomous). Some descriptive statistics for the data are available in Table 2.1. Fetal weight decreases monotonically with increasing dose with the average weight ranging from 0.972 gin the control group to 0 704 gin the group administered the highest dos e At the same time the malformation rate increases with dose from 0.3 % in the control group to 57 % in the group administered the highest dose. The goal of the analysis is to study the joint effects of increasing dose on fetal weight and on the probability of malformation. The analysis of these data is compli cated by the corr e lations between the repeated measures on fetuses within litter. A multivariate GLMM with random intercepts for each variable allows one to explicitly model the correlation structure within litter and provides subjectspecific estimates for the regression parameters. The second data set is from a study to compare the effects of 9 drugs and a placebo on the patterns of myoelectric activity in the intestines of ponies (Lester et al., 1998a, 1998b 1998c). For that purpose electrodes are attached to four different areas of the intestines of 6 ponies and spike burst rate and duration are measured at 18 equally spaced time intervals around the time of each drug administration. Six of the drugs and the placebo are given twice to each pony in a randomized complete block design. The remaining three drugs are not given to all ponies and hence will not be analyzed here. There is a rest period after each drug s administration and no carryover effects
PAGE 35
27 Table 2.2 Descriptive statistics for pony data Duration Count Pony Mean SD Mean SD Corr. 1 1.03 0.25 77.57 45.53 0.59 2 1.35 0.28 128.25 76.08 0.18 3 1.40 0.36 84.33 46.27 0.45 4 1.27 0.48 111.00 49.66 0.21 5 1.14 0.21 75.44 45.03 0.50 6 1.24 0.40 66.86 48.97 0.32 are expected. Th e spike burst rate is a count variable reflecting the number of con tractions exceeding certain threshold in 15 minutes. The duration variable reflects average duration of the contractions in each 15minute interval. Figure 2.1 shows graphical repres e ntations of the averaged responses by drug and time for one of the electrodes and one hour after the drug administration. Table 2.2 shows the sample means standard deviations and correlations between the two outcome variables by pony for this smaller data set. We analyse a restricted data set for reasons of com putational and practical feasibility. This issue is discussed in more detail in Chapter 3.
PAGE 36
0 st 0 c.o 0 st // // / / // ' ~::::;_./~~~~=:.: ;::~<::~~=~~~~~~==~~=~: 1 _o '""X ' .............. : :;....:::__ __ 1 _5 2_0 2 5 Time period / / / / 3_0 3 5 4 0 ,,,,,,,,.../ / :~:~;~~ ~::;~~/:: : .c.:" ;_> _ :....~ .. .... . ..... 1 _o 1.5 2 0 2.5 3.0 3.5 4_0 Time period Figure 2_1 Count and duration trends over time for the pony data Each trajectory shows the change in mean response for one of the seven drugs_ 28
PAGE 37
CHAPTER 3 ESTIMATION IN THE MULTIVARIATE GENERALIZED LINEAR MIXED MODEL This chapter focuses on methods for obtaining maximum likelihood ( or approx imate maximum likelihood) estimates for the model parameters in the multivariate GLMM. We first mention some approaches proposed for the univariate GLMM (Sec tion 3.1) and then describe in detail extensions of three of those approaches for the multivariate GLMM (Section 3 2). The proposed methods are then illustrated on a simulated data example (Section 3.3) and on the two reallife' data sets introduced in Chapter 2 (Section 3.4). Some issues such as standard error variability and advan tages of one multivariate versus several univariate analyses are addressed in Sections 3.3 and 3.4 The chapter concludes with discussion of some additional modelfitting methods (Section 3 5). 3.1 Introduction In GLMM the marginal likelihood of the response is obtained by integrating out the random effects fI J f(Yilbi; ,B
PAGE 38
30 for this type of integral is GaussHermite quadrature. It involves evaluating the integrands at m prespecified quadrature points and substituting weighted sums in place of th e intractable integrals for the n subjects. If the number of quadrature points m is large the approximation can be made very accurate but to keep the numerical effort low m should be kept as small as possible. GaussHermite quadrature is appropriate when the dimension of the random effects is small. An alternative for highdimensional random effects is to approximate the integrals by Monte Carlo sums This involves generating m random values from the random effects distribution for each subject evaluating the conditional densities f(Y i l b i; ,B, at those values and taking averages. Details on how to perform GaussHermite quadrature and Monte Carlo approximation for the GLMM can be found in Fahrmeir and Tutz (1994 pp.357365). We discuss GaussHermite quadrature for the multivariate GLMM in Section 3 2. Liu and Pierce (1994) consider adaptive Gaussian quadrature which allows a reduction in the required number of quadrature points by centering and scaling them around the mode of the integrand function. This procedure is described in more detail and applied to one of the data examples in Section 3.4. Indirect approaches to maximum likelihood estimation use the EMalgorithm (Dempster Laird and Rubin 1977) treating the random effects as the missing data. We apply these methods both to the multivariate GLMM and to the correlated pro bit model and therefore we now introduce the basic ideas. Hereafter 1/J denotes the vector of all unknown parameters in the problem, and 1/J denotes some estimate of 1/J The EM algorithm is an iterative technique for finding maximum likelihood estimates when direct maximization of the observed likelihood f (y; 1/J) is not feasible. It involves augmenting the observed data by unobserved data so that maximization at each step of the algorithm is considerably simplified. The unobserved data are denoted by b and in the GLMM context these are the random effects. The EMalgorithm can be summarized as follows:
PAGE 39
31 A (0) 1. Select a starting value 1/J Set r = 0. 2. Increaser by 1. A ( r 1 ) Estep: Calculate E{lnf(y b;'lj))jy;'lj) }. 3 Mstep: Find a value {/; ( r) of 1/J that maximizes this conditional expectation. 4. Iterate between (2) and (3) until convergence is achieved. In the GLMM context the complete data is u = (yT bTf and the complete data loglikelihood is given by n n lnLu = I:lnf(y i jb i; ,c/> ) + I:lnf(b ij ~)i= l i= l A ( r1) The r t h Estep of the EM algorithm involves computing E(lnLulY 1P ) and the r t h Mstep maximizes this quantity with respect to 1/J and updates the parameter estimates. Notic e that because ,l3 and cp enter only the first term of L u, the Mstep with respect to {3 and cp uses only f (yjb) and so it is similar to a standard generalized linear mod e l computation with the values of b treated as known. Maximizing with respect to is just maximum likelihood using the distribution of b after replacing sufficient statistics with their conditional expected values. In general, the conditional e xpectations in the Estep can not be computed in closed form but GaussHermite or different Mont e Carlo approximations can be utilized. (Fahrmeir and Tutz 1994 pp. 362365 McCulloch 1997; Booth and Hobert 1999). Many author s consider tractable analytical approximations to the likelihood (Bres low and Clayton 1993 ; Wolfinger and O Connell 1994). Although these methods lead to inconsistent estimates in some cases they may have considerable advantage in com putational speed over exact methods and can be fitted with standard software Both Breslow and Clayton s and Wolfinger and O Connell s procedures amount to itera tiv e fitting of normal theory linear mixed models and can be implemented using the
PAGE 40
32 %GLIM MIX macro in SAS. An extension of Wolfinger and O'Connell's approach is considered in Section 3.2. A Bayesian paradigm with flat priors can also be used to approximate the maxi mum likelihood estimates using posterior modes (Fahrmeir and Tutz, 1994, pp.233238) or posterior means (Zeger and Karim 1991). Though the numerator in such computations is the same as for the maximum likelihood calculations, the posterior may not exist for diffuse priors (Natarajan and McCulloch, 1995). This may not be detected using computational techniques such as the Gibbs sampler, and can result in incorrect parameter estimates (Hobert and Casella 1996). 3.2 Maximum Likelihood Estimation For simplicity of presentation we again consider the case of only two response variables. The marginal likelihood in the bivariate GLMM is obtained as in the usual GLMM by integrating out the random effects ft J J {IT fi(Yi1jlbi1; /31 1)h( Yi2jlbi2; /32 )} f(b i l bi2; :E)db i1 dbi2 (3.1) i=l j=l where f denotes the multivariate normal density of the random effects. In this sec tion we describe how GaussHermite quadrature Monte Carlo EM algorithm and pseudolikelihood can be used to obtain estimates in the multivariate GLMM. Both Gaussian quadrature and the Monte Carlo EM algorithm are referred to as 'ex act maximum likelihood methods because they target the exact maximum likelihood es timates. In comparison, methods that use analytical approximations are referred to as 'a pproximate maximum likelihood. We now start by describing the 'e xact maxi mum likelihood methods. Hereafter 1/J = V ec( /3 1 {3 2 1 2 '5) where '5 are unknown variance components in :E
PAGE 41
33 3.2.1 GaussHermite Quadrature Parameter Estimation The marginal loglikelihood in the multivariate GLMM is expressed as a sum of the individual loglikelihoods for a ll subjects that is n lnL( y ; 'lj; ) = L lnL i (1P) i= l where Hence Gauss Hermite quadrature (Fahrmeir and Tutz 1994) involves numerical ap proximations of n qdimensional i ntegrals (q = q 1 +q 2 ) The b / s are first transformed so that each integral has the form The needed transformation is b i = y12 Lz i where :E = LL T L is the lower triangular Choleski factor of :E and it always exists because :E is nonnegative definite. Here ni h( z i ) = IT Ji (Y i l j l z i 1 ; .B 1 1) h(Y i 2j l z i 2; .B 2, 2) j= l Each integral is then approximated by m m LfQ = L vt) ... L vt ) h( z ( k)) k 1= l k q= l where z ( k ) = v12 Ld (k) for the mu l tiple index k = (k 1 ... kq) and d (k) denote the table d nodes of univariat e GaussH ermite integrat i on of order m (Abramowitz and Stegun
PAGE 42
34 1972). The corresponding weights are given by Vk:) = 1rwi~), where wt) are the tabled univariate weights l = l ... m. The maximization algorithm then proceed as follows: 1. Choose initial estimate for the parameter vector 1/J (o) Set r = 0 2. Increaser by 1. A (r1) GQ A (r 1) Approximate each of the integrals Li(',p ) by Li ( 'lj; ) using m quadrature points in each direction. 3. Maximize the approximation with respect to 'lj; using a numerical maximization routine. 4. Iterate between steps (2) and (3) until the parameter estimates have converged. A popular numerical maximization proc edure for step 3 is the NewtonRaphson method. It involves iteratively solving the equations A (r ) 8l n L A (r1) l)2lnL where S ('lj; ) = mplip c r 1 >, and J('lj; ) = a'lj;a'lj;Tlip c r 1 ) is the observed information matrix. One possible criterion for convergence is where ,J;ir) denotes the estimate of the s th element of the parameter vector at step r of the algorithm and 8 1 and 8 2 are chosen to be small positive numbers. The role of 82 is to prevent numerical problems stemming from estimates close to zero. Another frequently used criterion is where II II denotes Euclidean norm.
PAGE 43
35 The numerical maximization procedure MaxBFGS which we are going to use for the numerical examples later in this chapter is based on two criteria for convergence: and lsir)I ::S E for alls when ,J;~r) = 0 ii/J~r+l) ,J;ir) I ::S 10c l'~r) I for all s when ,J;~r) =/ 0 ii/J~r+l) ,J;~r) I ::S lOc for all s when ,J;~r) = 0 where S 8 denotes the ih component of the score vector. Standard Error Estimation After the algorithm has converged estimates of the standard errors of the pa rameter estimates can be based on the observed information matrix. By asymptotic maximumlikelihood theory where I( 1/;) = E( ~i;:Jr) is the expected information matrix. But the observed information matrix J ( 1/;) is easier to obtain and hence we will use the latter to ap proximate the standard errors s.e.(' 8 ) = #8, where i 88 is the s th diagonal element of the inverse of the observed information ma trix. Note that if the NewtonRaphson maximization method is used the observed information matrix is a byproduct of the algorithm.
PAGE 44
36 Numerical and Exact Derivatives The observed information matrix and the score vector are not available in closed form and must be approximated. One can either compute numerical derivatives of the approximated loglikelihood or approximate the intractable integrals in the expressions for the exact derivatives. The first approach is simpler to implement but may require a large number of quadrature points. The method of finite differences can be used to find numerical derivatives The s th element of the score vector and the ( s, t )lh element of the information matrix are approximated as follows: t lf('l/J + E1is + E2it) lf('l/J E1is + E2it) lf('l/J + E1is E2it) + lf('l/J E1i 8 E2it), i=l 4E1E2 where lf = lnLfQ E 1 and E 2 are suitably chosen step lengths, and is and it are unit vectors. The unit vectors have ones in positions s and t respectively and the rest of their elements are zeros. An alternative to numerical derivatives is to approximate the exact derivatives. The latter are obtained by interchanging the differentiation and integration signs and are as follows: and
PAGE 45
where ni Ji = IT {Ji (Yilj l b i; ,Bi, ) h(Yi2j lbi; /32 )} f (bi; :E) j=l 37 For each subject there are three intractable integrals which need to be approximated: J fidbi, J 8 ~i fidbi and J( :;z;J 1 + 8 ~i ~~i, )fidbi, and GaussHermite quadrature is used for each one of them. Note that n n lnfi = L lnf1(Yi1jlb i; ,(3 1 ) + L lnh(Yi2jlbi; f32 ) + lnf(bi; :E). i=l i=l H ence 81 ~/i = a(L::;~ 1 lnfi) 8 $/i = a(I:7~ 1 lnh) and 8lnfi = 8lnf. Similarly the deriva8 1 a/31 ) {) 2 a/32 8ast 8ast tives with respect to the scale parameters 1 and 2 if needed, are obtained by differentiating the first and second term in the summation for lnfi This leads to simple expressions for the integrands because the functions that are differentiated are members of the exponentia l family (Fahrmeir and Tutz 1994). Note that the random effects distribution is always multivariate normal which complicates the expressions for the second order derivatives with respect to the vari ance components. Jennrich and Schluster (1986) propose an elegant approach to deal with this problem and we now illustrate their method for the multivariate GLMM For ( '.:l) (~p 2 1) simplicity of notation consider bivariate random effects b i. Let~ = v a :E Let :E s = 88 s = 1, .. 3. Then
PAGE 46
38 and In the above parametrization of the variancecovariance matrix there are restrictions on the parameter space ( o1 > 0 o2 > 0, 1 < p < 1), but the algorithm does not guarantee that the current estimates will stay in the restricted parameter space. It is then preferable to maximize with respect to the elements of the Choleski root of :E. The Choleski root is a lefttriangular matrix L = ( l~ 2 ) such that :E = LL T. Its elements are held unrestricted in intermediate steps of the algorithm and hence they can lead to a negativedefinite final estimate of :E. Such an outcome indicates either a numerical problem in the procedure or an unappropriate model. However, if only intermediate estimates of :E are negativedefinite, then the maximization algorithm with respect to the elements of the Choleski root will work well while the other one can have problems. Although the parametrization in terms of l 11 l 21 and l 22 has fewer numerical problems than the parametrization in terms of o1 o2 and p, it does not have nice interpretation. It then seems reasonable to use the Choleski roots in the maximization procedure but to perform one extra partial step of the algorithm to approximate the information matrix for the meaningful parametrization o1 o2 and p. 3.2.2 Monte Carlo EM Algorithm Parameter Estimation The complete data for the multivariate GLMM is u = (YT, b Tf and hence the complete data loglikelihood is given by n n n lnLu = L L ln Ji (Y i lii J b i1; .Bu ) + L L ln fz(Yi2 12 lbi2; .82, ) + L ln f(bi, :E) i=l i1 =1 i=l 12 = 1 i = l
PAGE 47
39 Notice that (/3 1 1 ) (/3 2 2 ) and :E appear in different terms in the loglikelihood and therefore each Mstep of the algorithm consists of three separate maximizations of E[I: ~ 1 I: :7 1 i= 1 ln Ji (Y i l ii lb i 1 ; 1 1) ly], E[I:~=l LJ ;= 1 ln h (Yi2ji lbi2 ; ) ly ], and E[I: ~= 1 ln f(b i, :E) l y] respectively evaluated at the current parameter estimate of 1/J. Recall that for the regular GLMM there are two separate terms to be maximized. These conditional expectations do not have closed form expressions but can be approximat e d b y Monte Carlo estimates: (3.2) (3.3) (3.4) where (b ~ ~ ), b ~~)), k = l ... m are simulated values from the distribution of b i lY i, evaluated at the current parameter estimate of 1/J. In order to obtain simulated values for b i multivariate rejection sampling or importance sampling can be used as propos e d b y Booth and Hobert (1999). At the r th step of the Monte Carlo EM algorithm multivariate rejection sampling algorithm is as follows: For each subject i, i = 1 ... n (r 1) 1. Compute T i = suph iE !R q f(Y i lb i, 1/J ). 2. Sample b ?) from the multivariate normal density f (bi; itl)) and indepen dently sample w?) from Uniform(O 1). 3. If w (k) < f ( Y il b ~k> ) then accept b(k) if not go to 2. t T i ) t l ) Iterate between (2) and (3) until m generated values of b i are accepted. Once m simulated values for the random effects are available the Monte Carlo estimates of the conditional expectations can be maximized using some numerical
PAGE 48
40 maximization procedure such as the NewtonRaphson algorithm Convergence can be claimed when successive values for the parameters are within some required precision. The same convergence criterions as those mentioned for GaussHermite quadrature can be applied but with generally larger 6 1 value Booth and Hobert (1999) suggest using 6 1 between 0.002 and 0.005. This is needed to avoid excessively large simulation sample sizes. A nice feature of the Booth and Hobert algorithm is that because independent random sampling is used, Monte Carlo error can be assessed at each iteration and the simulation sample size can be automatically increased. The idea is that it is inefficient to start with a large simulation sample but as the estimates get closer to the maximum likelihood estimates the simulation sample size must be increased to provide enough precision for convergence. Booth and Hobert propose to increase the simulation sample size m with m/t, where t some positive number if 1/J(r l) lies in a 95% confidence ellipsoid constructed A (r) around 1/; that is if where c 2 denotes the 95 th percentile of Chisquare distribution with number of de grees of freedom equal to the dimension of the parameter vector. To describe what variance approximation is used above some notation must be introduced. Denote the maximizer of the true conditional expectation Q = E(ln L u( h ) J y {;; (rl)) by {;; *(r) Then 1 m ( ) A (r1) Here Qm = m Lk = 1 lnLu(b k Jy, 1/; ) and Q~) and Q~) are the vector and matnx of first and second derivatives of Qm respectively A sandwich estimate of the variance
PAGE 49
41 ~(r) ~ *(r) . is obtained by substituting 1/J in place of 1/J and usmg the estrmate In summary the algorithm works as follows: 1. Select initial estimate 1/J (o). Set r = 0. 2. Increase r by 1. Estep: For each subject i, i = 1, ... n generate m random samples from the d "b f b I ,J (rl) t l" 1stn ut1on o i Y i; o/ usmg reJec 10n samp mg 3. M step: Update the parameter estimate of 1/J by maximizing (3.2), (3.3) and (3.4). 4. Iterate between (2) and (3) until convergence is achieved. Standard Errors Estimation After the algorithm has converged standard errors can be estimated using Louis' method of approximation of the observed information matrix (Tanner, 1993). Denote the observed dat a loglikelihood b y l and then the observed data information matrix is [J2l(y 'l/J) = E[8 2 lnLu( b y 'l/J)I ]Var[8lnLu( b y,'l/J)I l = 81/)81/)T 81/)8 1/) T y 81/) y E[8lnLu( b y 'l/J)I ]E[8lnLu( b y,1/J)I ] + 81/) y 81/)T y By simulating values b ?), k = l .. m, i = 1 . n from the conditional distributions of b i! Y i at the final 1/J the conditional expectations above can be approximated b y the
PAGE 50
42 Monte Carlo sums and In the above expressions the argument 1/J means that the derivatives are evaluated at the final parameter estimate. Exactly as in the GaussHermite quadrature example As the simulation sample size increases the Monte Carlo EM algorithm behaves as a deterministic EM algorithm and usually converges to the exact maximum like lihood estimates, but it may also converge to a local instead of a global maximum (Wu, 1983). A drawback of the Monte Carlo EM algorithm is that it may be very computationally intensive. Hence important alternatives to this exact' method for GLMM are methods based on analytical approximations to the marginal likelihood such as the procedures suggested by Breslow and Clayton (1993) and by Wolfinger and O Connell (1993). Wolfinger and O'Connell propose finding approximate maxi mum likelihood estimates by approximating the nonnormal responses using Taylor series expansions and then solving the linear mixed model equations for the associated normal model. The extension to their method is presented in the next section. 3.2.3 Pseudolikelihood Approach Parameter Estimation Let us first assume for simplicity that the shape parameters 1 and 2 for both response distributions are equal to 1. By the model definition
PAGE 51
43 where 9 1 1 and 9 2 1 above are evaluated componentwise at the elements of Xii/3 1 + Z ii b i 1 and Xi 2 /3 2 + Z i 2 b i 2 respectively. Let /3i, /3 2 b i1 and b i2, i = 1, ... n, be some estimates of the fixed and the random effects The corresponding mean estimates are denoted by i 1 and i 2 In a neighborhood around the fixed and random effects estimates the errors Eii = Yil i 1 and Ei 2 = Yi 2 i 2 are then approximated by first order Taylor series expansions. Denote the two approximations by and 1' A A 1 A A where (9 1 ) (X i i/3 1 + Z i 1 b il) and (9 2 ) (Xi2/3 2 + Z i2 b i2) are diagonal matrices with elements consisting of evaluations of the first derivatives of 9 1 1 and 9 2 1 Note that because the estimates of the random effects are not consistent as the number of subjects increases this expansion may not work well when the number of observations per subject is small. Now as in Wolfinger and O'Connell we approximate the conditional distributions of Eiilbi and Ei 2 lb i by normal distributions with O means and diagonal variance ma trices V1( i 1 ) = dia9((m), ... (i1nJ) and 2 ) = dia9((m), ... (i2nJ) respectively, where ( and are the variance functions of the original responses. Note that it is reasonable to assume that the two normal distributions are uncorrelated because the responses are assumed to be conditionally independent given the random effects. The normal approximation, on the other hand, may not be appropriate for some distributions, such as the Bernoulli distribution. Substituting i 1 and 2 in the variance expressions, and using the fact that
PAGE 52
44 variate normal with appropriately transformed variancecovariance matrices. Defining v ii = 91(P, i 1 ) + g~(,i1)(Y i 1 JJ, i 1 ) and v i2 = 92(JJ,i2) + g;(p, i 2)(Y i 2 JJ, i 2) it follows that Now recalling that b i has a multivariate normal distribution with mean O and variancecovariance matrix ~ this approximation takes form of a linear mixed model with response V ec( v i I, v i 2 ), random effect V ec( b i 1 b i 2 ) and uncorrelated but het eroscedastic errors. Such a model is called a weighted linear mixed model with a diagonal weight matrix W = diag(W i ), where W i = diag(Wil W i 2 ), w ill = g~ (P,il) V 1 (,il) g~ ( P, i l) and w ;; 1 = g; (i2) V 2 (,i2) g; ( i2). For canonical link functions V 1 (P,il) = [g~ ( i 1 ) J1 and V 2 i 2 ) = [g; i 2 ) J1, and then w ; 1 = diag(g~ (i1), g; (i2)) Estimates of f3 = V ec( {3 1 {3 2 ) and b i can be obtained by solving the mixedmodel equations below (Harville, 1977): (3 .5) Here X = ( I ) z = ( I ) V = ( : ~ ) b = ( I ) and X= ( x i l O ) z. = ( z il O ) ( V i l ) i O X i O z V i i2 i2, V i2
PAGE 53
45 Variance component estimates can be obtained by numerically maximizing the weighted linear mixed model loglikelihood l n l n z = "'lnlV 1"'rTv :1 r 2 i 2 i i i, i = l i=l where and The algorithm then is as follows: 1. Obtain initial est imates ~~) and ~~) using the original data, i = 1, ... n. 2. Compute the modified responses Vil = 91 (,il) + (Yil Jl, i1)9~ (,il) and V i2 = 92(J1,i2) + (Yi2 Jl, i2)9;( JL i2)3. Maximize (3 .6) with respect to the variance components. (3.6) Stop if the difference between the new and the old variance component estimates is sufficiently small. Otherwise go to the next step. 4. Compute the mixedmodel estimates for /3 and b i by solving the mixed model equations (3.5): /3 = (L~=l x rv; 1 X i) 1 (L~1 xfv ; 1 v i), A A T A I A b i = :EZ i V i ri. 5. Compute the new estimates of i = 1 i 2 ): 1 A A JL1 = 91 (Xil/31 + z i1 b i1) and 1 A A JL2 = 92 (Xi2/32 + Zi2 b i2). Go to step 2.
PAGE 54
46 In comparison to the original Wolfinger and O 'Connell algorithm, the above al gorithm for multiple responses is computationally more intensive at step 3 because of the more complicated structure of the variancecovariance matrices V i As the number of response variables increases the computat ional advantages of this approx imate procedure over the exact maximum likelihood procedures may become less pronounced The NewtonRaphson method is a logical choice for the numerical maximization in step 3 because of its relatively fast convergence and because as a side product it provides the Hessian matrix for the approximation of the standard errors of the variance components. In the general case when 1 and 2 are arbitrary, the variance functions V 1 ( i 1 ) and V 2(iti 2 ) are replaced by V1 ( it i 1 ) and V 2( i 2 ) and hence the weight matrices are accordingly modified: W i1 1 = ~ (iti1)V 1 (iti1)9~ (iti1) and W ; ;/ = 2g;(,i 2 )V2(iti 2 )g;( i 2 ). The estimates of and can be obtained in step 3 together with the other variance components estimates. This approach is different from the approach taken by Wolfinger and O 'Connell, who estimated the dispersion parameter for the response together with the fixed and random effects. Standard Error Estimation The standard errors for the regression parameter and for the random effects esti mates are approximated from the linear mixed model: A A ( xrwx xrwzt )l Var(,B b ) T A A T A z wx ~+z wz As and W are unknown estimates of the variance components will be used. The standard errors for the variance components can be approximated by using exact or numerical derivatives of the likelihood function in step 3 of the algorithm.
PAGE 55
47 For the same reasons as discussed in Section 3.2 it is better to maximize with respect to the Choleski factors of :E. The computation of the standard errors, however should be carried out with respect to the original parameters a 1 a 2 and p (or af a~ and a 12 ). This can be don e only if the final variancecovariance estimate is positive definite. The approach discussed in Section 3.2 for representation of the variancecovariance matrix can also be adopted here to simplify computations. 3.3 Simulated Data Example The methods discussed in the previous sections are illustrated on one simulated data set in which the true values on the parameters are known The data consists of J = 10 repeated measures of a normal and of a Bernoulli response on each of I = 30 subjects. No covariates are considered and a random intercept is assumed for each variable. In the notation introduced in Chapter 2, Y i 1Jlbil ~ indep i IJ a 2 ) Y i 2 J I b i 2 ~ indep. B e rnoulli i 2 J ) This means that we define a normal theory random intercept model for one of the responses a logistic regression model with a random intercept for the other response, and we assume that the random intercepts are correlated. The data are simulated for a 2 = 1 /3 1 = 4, /3 2 = 1 :E = ( 0 \ ). The parameter estimates and their standard errors for the three fitting methods discussed above are presented in Table 3 1. The methods are programmed in Ox and are run on Sun Ultra workstations. Ox is an objectoriented matrix programming language developed at O xford University
PAGE 56
48 Table 3.1. Estimates and standard errors from model fitting for the simulated data example using GaussHermite quadrature (GQ) Monte Carlo EM algorithm (MCEM) and pseudolikelihood (PL) Parameter True GQ MCEM PL value Est SE Est. SE Est. SE /3 1 4.0 3.72 0.19 3.72 0.20 3.72 0.19 !3 2 1.0 1.08 0.18 1.08 0 19 1.08 0.17 (J2 1.0 0.97 0 08 0.97 0.08 0.97 0 08 (J2 1 1.0 0.94 0 27 0.94 0.27 0.94 0.27 (J~ 0.5 0.39 0.26 0.41 0.35 0.35 0.22 CJ12 0.5 0.54 0.21 0.54 0.24 0.48 0.20 (Doornik 1998) It is convenient because of the availability of matrix operations and a variety of predefined functions and even more importantly because of the speed of computations. All standard errors except those for the Monte Carlo EM algorithm and those for the regression parameters for the pseudolikelihood algorithm are based on numerical derivatives. The identity matrix is used as an initial estimate of the covariance matrix for the random components and the initial estimates for {3 1 {3 2 and CJ 2 are taken to be the sample means for the normal and the Bernoulli response, and the sample variance for the normal response. GaussHermite quadrature is used with number of quadrature points in each dimension varying from 10 to 100. The M a x BFGS numerical maximization procedure in Ox is used for optimization in the Gaussian quadrature and the pseudolikelihood algorithms with a convergence criterion based on ('1 = 0.0001. The convergence criterion used for the Monte Carlo EM algorithm is based on 1 = 0.003. M axBFGS employs a quasiNewton method for maximization developed by Broyden Fletcher Goldfarb and Shanno (BFGS) (Doornik 1998) and it can use either analytical or numerical first derivatives. The times to achieve convergence by the three methods are 30 minutes for Gaus sian quadrature (GQ) with 100 quadrature points in each dimension approximately 3 hours for the Monte Carlo EM algorithm and 1.5 hours for the pseudolikelihood
PAGE 57
49 (PL) algorithm. The numbers of iterations required are 22, 53 and 7 respectively. The final simulation sample size for the Monte Carlo EM algorithm is 9864. As expected the Monte Carlo EM estimates are very close to the Gaussian quadra ture estimates with comparable standard errors. Only the estimate of a~ is slightly different and has a rather large standard error, but this is due to premature stopping of the EM algorithm. This problem can be solved by increasing the required precision for convergence but at the expense of a very large simulation sample size. As can be seen from the table, the parameters corresponding to the normal re sponse are well estimated by the pseudolikelihood algorithm, but some of the es timates corresponding to the Bernoulli response are somewhat underestimated and have smaller standard errors. The estimate of the random intercept variance a~ is about 10 % smaller than the corresponding Gaussian quadrature estimate with about 15 % smaller standard error. The estimate of a 12 is also about 15% smaller than the corresponding GaussHermite quadrature and Monte Carlo estimates but it is actually closer to the true value in this particular example. In general the variance component estimates for binary data obtained using the pseudolikelihood algorithm are expected to be biased downward as indicated by previous research in the univari ate GLMM (Breslow and Clayton 1993; Wolfinger, 1998). Hence the Wolfinger and O Connell method should be used with caution for the extended GLMM s with at least one Bernoulli response keeping in mind the possible attenuation of estimates and of the standard errors. GaussHermite Quadrature: Exact' vs Numerical' Derivatives As mentioned before M axBFGS can be used either with exact or with numerical first derivatives. Comparisons of these two approaches in terms of computational speed and accuracy are provided in Tables 3.2 and 3.3 respectively. The Iteration' columns in Table 3.2 contain number of iterations and the Convergence' columns
PAGE 58
50 Table 3.2. Exact versus numerical' derivatives for the simulated data example: convergence using GaussHermite quadrature. Number Exact derivatives Numerical derivatives of quad. Time Iteration Convergence Time Iteration Convergence points (min.) (min.) 10 27.4 56 no 0.3 19 strong 20 21.3 22 weak 1.1 19 strong 30 75 35 weak 2.8 23 strong 40 120 34 strong 4.8 22 strong 50 246 47 strong 7.2 22 strong 60 237 34 strong 10.9 22 strong show whether the convergence tests are passed at the prespecified 81 level (strong convergence) or passed at the lower level 8 1 = 0.001 (weak convergence) or not passed at all (no convergence). Note that for the exact derivatives strong convergence is achieved with 40 or more quadrature points. It is rather surprising that the exact Gaussian quadrature procedure does not perform better than the numerical' one. The parameter estimates from both pro cedures are the same (Table 3.3) for 40 or more quadrature points but the exact' GaussHermite quadrature converges much more slowly than the numerical procedure (hours as compared to minutes). The numerical procedure also has the advantage of simpler programming code and hence for the rest of the disertation will be the preferred quadrature method. Monte Carlo EM algorithm: Variability of Estimated Standard Errors We observed that the estimated Monte Carlo EM standard errors varied somewhat for different random seeds so we generated 100 samples with the final simulation sam ple size and computed the average and the standard deviation of the standard error estimates for th e se samples. Table 3.4 shows the results. We include the Gaussian quadrature standard error estimates for comparison. There is much variability in the standard error estimates ( especially in the estimate of O"?) which is primarily due to
PAGE 59
51 Table 3 3. Exact versus numerica l derivatives for the simulated data example: estimat e s and st a ndard errors using GaussHermite quadrature with varrying number of quadrature points ( m) m Exact derivatives Numerical derivatives f3 1 f32 (72 (7 2 1 (7 2 2 <712 f3 1 f3 2 (72 (72 1 (72 2 <712 SE SE SE SE SE SE SE SE SE SE SE SE 10 3.68 1.06 0.97 0.98 0.38 0.52 3.72 1.08 0.97 0.85 0.35 0.48 0.13 0.16 0.08 0 28 0.25 0 22 0.12 0 16 0.08 0.16 0.24 0.15 20 3 75 1.10 0.96 0 95 0 39 0 54 3.75 1.10 0.97 0.95 0.39 0 54 0.16 0 18 0.08 0 27 0.26 0.21 0.16 0.18 0.08 0.27 0.26 0 21 30 3 71 1.07 0 97 0 94 0 39 0.54 3.71 1.07 0.97 0.95 0.39 0 54 0.19 0.18 0.08 0 27 0 26 0.21 0 19 0.18 0 08 0.29 0.27 0 22 40 3 72 1.08 0 97 0.94 0.39 0.54 3.72 1.08 0.97 0 94 0.39 0.54 0.18 0.1 8 0 08 0 27 0 26 0.21 0.18 0.18 0 08 0.26 0.26 0.20 50 3.72 1.08 0 97 0 94 0.39 0.54 3 72 1.08 0.97 0.94 0.39 0.54 0 19 0.18 0.08 0 27 0 26 0.21 0 19 0 18 0 08 0.27 0 26 0.21 60 3.72 1.08 0.97 0.94 0.39 0.54 3 72 1.08 0.97 0.94 0.39 0.54 0 19 0.18 0.08 0.27 0 26 0.21 0.19 0.18 0 08 0.27 0 26 0 21 several extreme observations If the simulation sample size is increased by onethird th e standard e rror e stimates ar e mor e stable (see the last two columns of Table 3.4). It is not surprising that ther e is more variability in the standard error estimates than in the parameter e stimates because on l y the latter are controlled by the convergence criterion. So it ma y be necessary to increase the simulation sample size to obtain b e tter estimat e s of the standard errors. This however requires additional compu tational e ffort and it is not cl e ar b y how much the simulation sample size should be incr e ased. W e consider a different approach to dealing with standard error instability in the n e xt s e ction of the dissertation. Comparison of Simultaneous and Separate Fitting of the Response Variables To investiga te th e possibl e e fficiency gains of joint fitting of the two responses ov e r separate fitting we compared estimates from the joint to estimates from the two separate fits Th e results for the three estimation methods are provided in Table
PAGE 60
52 Table 3.4 Vari a bility in Monte Carlo EM standard errors for the simulated data exampl e Mean s and standard d e viations of the standard error estimates are com puted for 100 samples using two different simulation sample sizes The Gauss Hermite quadrature standard e rrors are given as a reference in the second column. Param ete r GQSE M e an SE SD of SE Mean SE SD of SE ss=9864 ss=13152 f3 1 0.19 0 21 0.07 0 19 0.03 f32 0 18 0.20 0.09 0.19 0 03 a 2 0 08 0 08 0.0008 0 08 0.0001 a 2 1 0 27 0 27 0 03 0 27 0 01 a 2 2 0 26 0.41 0.76 0.31 0.13 a 12 0 21 0 22 0 04 0.22 0 02 Table 3.5. R e sults from joint and separate GaussHermite quadrature of the two response variable s in the simulated data example Parameter True value Joint estimation Separate estimation Estimate SE Estimate SE f3 1 4.0 3.72 0 19 3 72 0.19 f32 1.0 1.08 0 18 1.07 0 18 a 2 1.0 0.97 0.08 0.97 0 08 a2 1 1.0 0 94 0 27 0.94 0.27 a ~ 0.5 0.39 0.26 0 37 0.25 a1 2 0.5 0.54 0 21 3.5 (Gaussian quadrature) Table 3.6 (Monte Carlo EM) and Table 3.7 (pseudo lik e lihood ) For th e separate fits we used PROC NLMIXED for Gaussian quadrature and wrot e our own programs in Ox for Monte Carlo EM and pseudolikelihood. Notice that when the Normal response is considered separately the corresponding model is a oneway random effects ANOVA and can be fitted using PR O C MIXED in SAS for example. The estimates from PROC MIXED are as follows: ~ 1 = 3.72 (SE= 0 19) 82 = 0 97 (SE= 0.08) and a~ = 0 94 (SE= 0.27). It is surprising t hat although the estimated correlation between the random effects was rath e r high ( jJ = a1 2 = 0.86) the estimates and the estimated standard errors 01 02 from the joint and from th e separate fits were very similar. Having in mind that it is much faster to fit th e r e spons e s separately and that there is existing software for univariate r e peated measures it is not clear if there are any advantages to joint fitting
PAGE 61
53 Tabl e 3.6. R e sults from joint a nd separate Monte Carlo EM estimation of the two response variabl e s in the simulated data example Param e ter Tru e value Joint estimation Separate estimation Estimate SE Estimate SE f31 4.0 3.72 0.20 3.73 0.22 f3 2 1.0 1.08 0.19 1.08 0.18 (J 2 1.0 0.97 0 08 0.97 0 08 (J 2 1 1.0 0.94 0.27 0.95 0.28 (J 2 2 0.5 0 41 0 35 0.38 0.23 r71 2 0.5 0.54 0 24 Table 3 7. Result s from joint and separate pseudolikelihood estimation of the two respons e variabl es in the simulated data e xample Param e ter True value Joint estimation Separate estimation Estimate SE Estimate SE f31 4.0 3 72 0 19 3 72 0.19 f32 1.0 1.08 0.17 1.05 0.17 (J 2 1.0 0.97 0 08 0.97 0.08 (J 2 1 1.0 0.94 0 27 0.94 0.27 (J 2 2 0.5 0 35 0.22 0.35 0.22 r71 2 0.5 0.48 0.20 in this particular e xample Advantages and disadvantages of joint and separate fitting will be discussed in Chapter 5 of the dissertation. 3.4 Applications 3.4.1 Developm e ntal Toxicity Study in Mice In this section the exact maximum likelihood methods are applied to the eth y lene gl y col (EG) e xampl e mentioned in the introduction The model is defined as follows (d i d e not e s th e ex posur e l e v e l of EG for the i t h dam): Y i l i birth w e ight of jlh liv e fetus in i th litter. Y i2j malformation status of jlh live fetus in i th litter. Y i 1 j lb i 1 ~ i nd e p. i i j, r7 2 ). Y i 2jlb i2 ~ i nd e p B e 2 j)
PAGE 62
54 Table 3.8. GaussHermite and Monte Carlo EM estimates and standard errors from model fitting with a quadratic effect of do e on birth weight in the ethylene glycol data Parameter GaussHermite quadrature Monte Carlo EM Estimate SE Estimate SE /310 0.962 0.016 0.964 0.037 /311 0.118 0 028 0.120 0.053 /312 0.010 0.009 0.010 0.015 /320 4.267 0.409 4.287 0.455 /321 1 720 0.207 1.731 0.217 a2 0.006 0.0003 0.006 0.0003 a2 1 0.007 0.001 0.007 0.001 a~ 2.287 0.596 2.238 0.560 a12 0.082 0.020 0.082 0.022 Table 3.9. GaussHermite quadrature and Monte Carlo EM estimates and standard errors from model fitting with linear trends of dose in the ethylene glycol data Parameter GQ (logit) MCEM (logit) Est SE Est /310 0.952 0.014 0.952 /311 0.087 0.008 0.088 /320 4.335 0.411 4.335 /321 1.749 0 208 1.752 a 0 075 0 002 0.075 a1 0.086 0.007 0.086 a2 1.513 0.196 1.495 p 0.682 0.088 0.687 ilj = /310 + /311 h + b i l, logit( i2j) = /320 + /321 d i + b i2 SE 0.021 0.008 0.522 0.225 0.002 0.007 0.207 0.091 GQ (probit) Est SE 0.952 0.014 0.087 0.008 2.340 0 213 0.970 0 111 0.075 0.002 0.086 0.007 0.839 0 107 0 666 0.090 b i= (bil, b i2)' ~ MV N(O :E ) = [ ar all l a12 a2 MCEM (probit) Est SE 0.954 0 028 0.088 0.012 2.422 0.269 0.982 0.129 0.075 0.002 0.086 0 007 0.831 0.101 0.670 0.094 We considered a linear trend in dose in both linear predictors Although other authors have found a significant quadratic dose trend for birth weight, we found that to be nonsignificant and hence have not included it in the final model (see Table 3.8). We also considered a probit link. The Gaussian quadrature and Monte Carlo EM parameter estimates from the model fits are given in Table 3.9 Initial estimates for the regression parameters were
PAGE 63
55 the estimates from the fixed effects models for the two responses, er was set equal to the estimated standard deviation from the linear mixed model for birth weight, and the identity matrix was taken as an initial estimate for the variancecovariance matrix of the random effects. Gaussian quadrature estimates were obtained using 100 quadrature points in each dimension which took 36 hours for both the logit and the probit links. Adaptive Gaussian quadrature will probably be much faster because it requires a smaller number of quadrature points. The Monte Carlo EM algorithm for the model with logit link ran for 3 hours and 6 minutes, and required 41 iterations and a final sample size of 989 for convergence. The Monte Carlo EM algorithm for the model with a probit link ran for 5 hours and 13 minutes, and required 37 iterations and a final sample size of 1757 for convergence. The convergence precisions were the same as in the simulated data example. The estimates from the two algorithms are similar but the standard error estimates from the Monte Carlo EM algorithm are larger than their counterparts from Gaussian quadrature. This is not always the case as seen from Table 3.10. There is much variability in the standard error estimates from the Monte Carlo EM algorithm The estimates can be improved by considering larger simulation sample sizes but we adopt a different approach. The observed information matrix at each step of the algorithm is easily approximated using Louis method and it does not require extra computational effort, because it relies on approximations needed for the sample size increase decision. Because the EM algorithm converges slowly in close proximity of the estimates and because the convergence criterion must be satisfied in three consecutive iterations, the approximated observed information matrices from the last three iterations are likely to be equally good in approximating the variance of the parameter estimates. Hence if we average the three matrices we are likely to obtain a better estimate of the variancecovariance matrix than if we rely on only one particular iteration.
PAGE 64
56 Table 3.10. Variability in Monte Carlo EM standard errors for the ethylene glycol ex ample. Means and standard deviations of the standard error estimates are computed for 100 samples for the logit and probit models. GaussHermite quadrature standard errors as given as a reference. Parameter Logit model Probit model GQ Mean of SE SD of SE GQ Mean of SE SD of SE f3 10 0.014 0 015 0.032 0.014 0.014 0.021 f3 n 0 008 0 008 0.015 0.008 0.007 0 010 f3 20 0 411 0.440 0 500 0.213 0.200 0.137 f3 21 0.208 0.206 0.207 0.111 0.101 0.063 (J 0 002 0.002 0 0001 0.002 0.002 0.00002 CJ1 0.007 0 007 0.0005 0.007 0.007 0.0001 CJ2 0.196 0.212 0.115 0.107 0.110 0.020 p 0.088 0.094 0.052 0.090 0.092 0.012 Tables 3.11 and 3.12 give standard error estimates for the EG data based on Gaus sian quadrature (column GQ) based on the the Monte Carlo approximated observed information matrix in the third to last ( column Al), second to last ( column A2) and last ( column A3) iterations and based on average of the observed information matri ces of the last two ( column A4) and of the last three ( column A5) iterations. Clearly using only the estimate of the observed information matrix from the last iteration is not satisfactor y because it depends heavily on the random seed and may contain some negative standard error estimates. Averaging over the three final iterations is better although it is not guaranteed that even the pooled estimate of the observed information matrix will be positive definite or that it will lead to improved estimates. All parameter estimates in the model are significantly different from zero with birth weight significantly decreasing with increasing dose and probability for malfor mation significantl y increasing with increasing dose. As expected the regression pa rameters estimates using the logit link function are greater than those obtained using the probit link function. The use of the logit link function facilitates the interpreta tion of the parameter estimates. Thus, if the EG dose level for a litter in increased by lg/ kg th e estimated odds of malformation for any fetus within that litter increase
PAGE 65
57 Table 3.11. Monte Carlo EM standard errors using logit link in the ethylene glycol example. The approximations are based on the estimate of the observed information matrix from the third to last iteration (Al), second to last iteration (A2), last iter ation (A3) the average of the last two iterations (A4) and the average of the last three iterations ( A5). The standard error estimates obtained using GaussHermite quadrature are given as a reference in the column labeled GQ. Parameter Estimated S.E. GQ Al A2 A3 A4 A5 /310 0.014 0 017 0.037 0.012 0 016 0.016 /3n 0.008 0.007 0.014 0.001 0.008 0.007 /320 0.411 0.385 0.429 0.365 0.409 0.396 /321 0.208 0.183 0 239 0.215 0.236 0.206 a 0.002 0.002 0 002 0.002 0.002 0.002 a1 0.007 0.007 0.007 0.007 0.007 0.007 a2 0.196 0.199 0.202 0.221 0 204 0.199 p 0 088 0.081 0.084 0.136 0.096 0 089 Table 3.12. Monte Carlo EM standard errors using probit link in the ethylene glycol example. The approximations are based on the estimate of the observed information matrix from the third to last iteration (Al), second to last iteration (A2), last iter ation (A3), the average of the last two iterations (A4), and the average of the last three iterations (A5). The standard error estimates obtained using GaussHermite quadrature are given as a reference in the column labeled GQ. Parameter Estimated S.E. GQ Al A2 A3 A4 A5 /310 0.014 0.021 0.021 0.006 0.016 0.017 /3n 0.008 0.012 0.007 0 0.008 0.009 /320 0.213 0.273 0.227 0 0.243 0.233 /321 0.111 0.151 0.110 0 0.132 0.127 a 0.002 0.002 0.002 0.002 0.002 0.002 a1 0.007 0.008 0.007 0.007 0.007 0.007 a2 0.107 0.127 0.097 0 0.124 0.110 p 0.090 0.094 0.102 0 0.101 0.095
PAGE 66
58 exp(l.75) = 5.75 times. The probit link function does not offer similar easy inter pr~tation but allows one to obtain populationaveraged estimates for dose. If the subjectspecific regression estimate is f3 then the marginal regression parameters are estimated by Ai (Section 2.1). Hence we obtain a marginal intercept of 1. 793 (l+ a b2) and a marginal slope of 0.743 The latter can be interpreted as the amount by which the populationaveraged probability of malformation changes on the probit scale for each unit increas e in dose. The subjectspecific slope estimate of 0.970 on other hand is interpreted as the increase in the individual fetus probability of malformation on the probit scale for each unit increase in dose. A more meaningful interpretation for those numbers can be offered if one assumes a continuous underlying malformation variable for the observed binary malformation outcome The latent variable reflects a cumulative detrimental effect which manifests itself in a malformation if it exceeds a certain threshold. The effect of dose on this underlying latent variable is linear and hence /3 21 is interpreted as the amount by which the cumulative effect is increased for one unit increase in dose ~ 21 can also be interpreted as the highest rate of change in the probability for malformation (Agresti 1990 p 103) This highest rate of change is estimated to be achieved at {!Jl. 13 2 0 As expected birth weight and malformation are significantly negatively correlated as judged from the Wald test from Table 3.9: C}(fJ)) 2 = (~~ 0 ~~ 6 ) 2 = 54.8 (pvalue < 0.0001) This translates into a negative correlation between the responses but there is no closed form e xpression to estimate it precisely. It is interesting to compare the joint and the separate fits of the two response variables (Tables 3 13 and 3.14). Table 3.13 presents estimates obtained using Gaus sian quadrature and shows that the estimates for the Normal response are identical (within the reported precision) but the estimates for the Bernoulli response from the separate fits are generally larger in absolute value with larger standard errors.
PAGE 67
59 Table 3.13. Results from joint and separate GaussHermite quadrature of the two response variables in the ethylene glycol example. Par. Logit models Probit models Joint fit Separate fits Joint fit Separate fits Est. SE Est. SE Est. SE Est SE f310 0.952 0.014 0.952 0.014 0.952 0 014 0.952 0.014 f3u 0 087 0.008 0.087 0 008 0.087 0.008 0.087 0.008 f320 4.335 0.411 4.356 0.438 2.340 0.213 2.426 0.230 f321 1.749 0 208 1 779 0.220 0.970 0.111 0.993 0.118 (J 0.075 0.002 0.075 0.002 0.075 0.002 0.075 0.002 CJ1 0 086 0.007 0 086 0.007 0.086 0.007 0 086 0 007 CJ2 1.513 0 196 1.577 0.211 0.839 0.107 0 881 0.117 p 0.682 0.088 0 666 0.090 Table 3.14. Results from joint and separate fitting of the Monte Carlo EM algorithm for the two response variables in the ethylene glycol example Par Logit models Probit models Joint fit Separate fits Joint fit Separate fits Est SE Est. SE Est SE Est. SE f310 0.952 0.021 0.952 0 014 0.954 0.028 0.952 0.014 f3n 0.088 0.008 0.088 0.007 0.088 0.012 0.088 0 007 f320 4.335 0.522 4.453 0.344 2.422 0.269 2.499 0.193 f321 1.752 0.225 1 .8 22 0 211 0.982 0 129 1.025 0.098 (J 0 075 0.002 0.075 0 002 0.075 0.002 0.075 0.002 CJ1 0.086 0 007 0 086 0.007 0 086 0.007 0.086 0.007 CJ 2 1. 495 0.207 1.547 0.199 0.831 0.101 0 869 0 052 p 0 687 0.091 0 670 0.094
PAGE 68
60 This may indicate small efficiency gains in fitting the responses together rather than separately. More noticeable differences in the parameter estimates and especially in the standard errors are observed in the results from the Monte Carlo EM algorithm (Table 3.14 ) In this case the standard error estimates for the Bernoulli components from the separat e fits are smaller than their counterparts from the joint fits. These results however may be deceiving because there is large variability in the standard error estimates depending on the particular simulation sample used (see Table 3.10). As a result of the conditional independence assumption the correlation between birth weight and malformation within fetus, and the correlation between birth weight and malformation measured on two different fetuses within the same litter, are as sumed to be the same. However, in practice this assumption may not be satisfied. We would expect that measurements on the same fetus are more highly correlated than measurements on two different fetuses within a litter. Hence, it is very important to be able to check the conditional independence assumption and to investigate how departures from that assumption affect the parameter estimates. In the following chapter score tests are used to check aspects of conditional independence without fit ting more complicated models. When the score tests show that there is nonnegligible departure from this assumption alternative models should be considered. In the case of a binary and a continuous response one more general model is presented in Chapter 5. It has been considered by Catalano and Ryan (1992) who used GEE methods to fit it We propose exact maximum likelihood estimation which allows for direct estimation of the variancecovariance structure. 3.4 2 Myoelectric Activity Study in Ponies The second data set introduced in Section 2.4 is from a study on myoelectric activ ity in ponies. Th e purpose of this analysis is to simultaneously assess the immediate effects of six drugs and placebo on spike burst rate and spike burst duration within
PAGE 69
61 a pony. Two features of this data example are immediately obvious from Table 2.2: there is a large number of observations within cluster (pony) and the variance of the spike burst rate response for each pony is much larger than the mean (see Table 2.2). The implication of the first observation is that ordinary GaussHermite quadrature as described in Section 3.2.1 may not work well because some of the integrands will be essentially zero. Adaptive Gaussian quadrature on the other hand will be more appropriate because it requires a smaller number of quadrature points. Also the ad ditional computations to obtain the necessary modes and curvatures will not slow down the algorithm because there are only 6 subjects in the data set and hence only six additional maximizations will be performed. Adaptive Gaussian quadrature is described in the next subsection of the dissertation. The implication of the second observation concerns the response distribution of the spike burst rate response. An obvious choice for count data is the Poisson distribution, but the Poisson distribution imposes equality of the mean and the variance of the response and this assumption is clearly not satisfied for the data even within a pony. Therefore some way of accounting for the extra dispersion in addition to the pony effect must be incorporated in the model. Such an approach based on the negative binomial distribution is discussed later in this chapter. Adaptive Gaussian Quadrature Adaptive Gaussian quadrature has been considered by Liu and Pierce (1994), Pinheiro and Bates (1999) and Wolfinger (1998). We now present that idea in the context of the bivariate GLMM. Recall that the likelihood for the i th subject has the form
PAGE 70
where ~ ~ ( 8 2 log/(y ; ,b ))ll Th Let b i be the mode of !(Yi b i) and r i = Bb;B b f ll ; en where !*(Y i bi) = /(y~ b;) and (b ; ;b ; r ; ) 62 The needed transformation of b i is then b i = b i+ vl2Aizi, where r i = A i Af. Hence each integral is approximated by m m LfQ = lv2 A il L Wk~) ... L wt) f*(zt)) k1=l kq=l with the tabled univariate weights wi1/ and nodes z ~k) = b i+vl2 A i d(k) for the multiple index k = (k 1 ... kq) where d (k) are the tabled nodes of GaussHermite integration of order m. The difference between this procedure and the ordinary GaussHermite approx imation is in the centering and spread of the nodes that are used. Here they are distributed where most of the data are available and this makes adaptive quadra~ ~ ture more efficient. The computation of the modes b i and the curvatures r i requires maximization of the integrand function for each subject, which in general requires a numerical procedure, but this was not a major impediment for the pony data. We used MaxBFGS within MaxBFGS to perform the maximization.
PAGE 71
63 Negative Binomial Model When the dispersion in count data is more than that predicted by the Poison model a convenient parametric approach is to assume a Gamma prior for the Poisson mean (McCullagh and Nelder, 1989). Letting Y ~ Poisson() then P(Y = y) = 'tTe, y = 0, l, 2, ... and E(Y) = Var(Y) = Suppose that we want to specify a larger variance for Y but keep the mean the same. This can be accomplished as suggested by Booth and Hobert (personal communication). Let Yiu ~ Poisson(u) and let u ~ Gamma(a, The density for u is f(u) = 00 u;z~r 0 ". Then unconditionally Y has a negative binomial distribution with probability density P(Y = y) = r(y + a) (a)()Y r(a)y! a+ a+ and E(Y) =and Var(Y) =+if. Notice that for large a the negative binomial variance approaches the Poisson variance, meaning that is there is little overdisper sion, but for small a the negative binomial variance can be much larger than the Poisson variance. The mean in the above model can be specified as a function of covariates using a log link ln(i) = x f (3 i = l, ... n. The log link is convenient because it is the canonical link for the Poisson distribution and transforms the mean so that it can take any real value. It is possible to obtain maximumlikelihood estimates of a and (3 by directly numerically maximizing the negative binomial likelihood but there is a more elegant way based on the EM algorithm as proposed by Booth and Hobert (personal communication). The complete data consists of y and u and the complete data loglikelihood is n lnLv = c + I]aln(a) lnr(a) + Yiln(i) + aln(ui) ui(i + a)], i=l
PAGE 72
64 where c is a constant depending only on the data but not on the unknown parameters. The l th Estep of the EM algorithm then involves calculating E(ln L v(a, ,B)l &(li) ,B(l i)) = c + Naln(a) N lnf(a)+ n L[Y i xf ,13 + aE(ln ui I Yi, &(11), J\11)) E( uilYi, &(l1), ,B(l 1) )( exp(xf ,13) + a)] i=l But because of the choice of the prior distribution for ui, the posterior distribution for uilYi evaluated at &(11), (3(l I) is Gamma(yi + &(11), 1 r,13 ). Therefore '(!i)+exp(xi (!l)) and where 'If;( denotes a digamma function (the first derivative of the loggamma func tion) At the l th Mstep the expected loglikelihood is maximized with respect to a and ,13. The two parameters are present in different terms of the loglikelihood and hence two separate numerical maximizations are required. Note that if we adopted a direct approach the parameters would need to be maximized together which could lead to more numerical problems. When there are random effects the model can be specified as follows: Uij ~ i.i.d. r(a .!. ) a
PAGE 73
65 and the random effects b i and uii are independent The unknown parameters a {3 and :E can be estimates by a nested EM algorithm. Such an algorithm has been proposed by Booth and Hobert (personal communication) for the overdispersed Poisson GLMM, and more generally by van Dyk (1999) who motivated it from the point of view of computational efficiency. The complete data for the outer loop consists of y and b and the outer EM algorithm is perform ed as outlined in Section 3.2.2 but with an EM maximization procedure for a and /3 as introduced above. The complete data loglikelihood has the form ln Lu= c + I)lnr(Y ii + a) lnr(a) + alna (a+ Yii)ln(a +ii)+ Yiiln(ii)l+ i,j and at the r th Estep its conditional expectation is approximated by the Monte Carlo sum where b ?) are generated from the conditional distribution of b il Y i; 1/J (rl) and ~J) = exp(x'f; /3 + z l; b ?)). The first part of that sum is what needs to be maximized to obtain estimates of a and /3. Notice that each term in the first Monte Carlo sum is a loglikelihood for a negative binomial model with a different (but fixed) mean~? and hence it can be subjected to an EM algorithm by augmenting the observed data y by u In summary, the nested EM algorithm is as follows: 1. Select an initial estimate 1/J (o). Set r = 0.
PAGE 74
66 2. Increase r by 1. Estep: For each subject i, i = 1, ... n generate m random samples from the distribution of b i I Y i; 1/J (rI) using rejection sampling and approximate A (r1) E(ln Lu(Y, b ; v, )J y ; 'lj, ) by a Monte Carlo sum. 3. Mstep: 3.1. Maximize with respect to the elements of as usual. A A (r1) 3.2. Set l = 0 O:(o) = &(rl) and f3 (o) = {3 3.3. Increase l by 1. Inner Estep: For any given b(k) compute E(ln Lu(Y, u ; a, /3)1y; 0:(11),; f3(11)) 3.4. Inner Mstep: Find O:(t) and j3(l) to maximize the Monte Carlo sum of conditional expectations. 3.5. Iterate between (3.3) and (3.4) until convergence for a and /3 is achieved. 4. Iterate between (2) and (3) until convergence for 'lj, is achieved. In the multivariate GLMM there are two or more response variables but the nested EM algorithm works in essentially the same way because of the conditional ind epen dence between the variables. In the next section we describe a bivariate GLMM for the pony data. Analysis of Pony Data The model is defined as follows: Yiij /h duration measurement on the i th pony. Yi2j lh spike burst rate measurement of the i th pony 2 Yiij I bi1 ~ indep. Gamma with mean ilj and variance ,~i
PAGE 75
uii ~ i.i.d Gamma(a, ln(iij) = x.fij/31 + bi1, ln(i2j) = x~ j /3 2 + bi2 67 Let d i jk be 1 if drug k is administered to pony i at occasion j k = 1 ... 6 and 0 otherwise. Placebo is coded as drug 7 and will be the reference group Let t denote time. Then the linear predictors are In the univariate analysis we initially considered the same linear predictor for both variables but the time* drug interaction and the time main effect were clearly not significant for the duration response and were dropped from the model. Also we performed fixed effects analysis using PROC GENMOD and fitted two random effects models on logtransformed variables using PROC MIXED for the complete data (all time points and all electrode groups) There was evidence of a threeway interaction between electrode group time and drug. Describing this interaction requires estimat ing 76 regression parameters for a linear time trend as compared to 21 in the above specification. Recall that in the other data example we estimated up to five regression parameters so th e numerical maximization procedures in this example were expected to be much more complicated The complete pony data set was also about ten times larger than the ethylene glycol data set and the time trends appeared complicated and not well described by simple linear or quadratic time effects. One might need to use splines to adequately describe the trends Hence we concentrated on a particular
PAGE 76
68 research question, namely to describe differences in the immediate effects ( up to 60 minutes after administration) of the drugs in the cecal base ( corresponding to one of the electrode groups). In the analyses performed by Lester et al. (1998a, 1998b, 1999c) there was evidence that some drugs led to immediate significant increase in spike burst count or spike burst duration while others took longer or did not lead to increase at all and we decided to address that issue in our analysis. We programmed the nested Monte Carlo EM and the adaptive Gaussian quadra ture algorithms both for the joint and for the separate fits. We used a negative binomial distribution for the count response, a gamma distribution for the duration response and log link functions for both. (Although the log function is not the canon ical link function for the Gamma response, it is convenient because it transforms the mean to the whole real line). As initial estimates we used the final estimates from the pseudolikelihood approach using the %GLIM MIX macro but rounded only up to one significant digit after the decimal point. The macro does not allow specification of a negative binomial distribution and hence we fitted a Poisson distribution with an extra dispersion parameter. The results from the analysis are summarized in Table 3.15. M axBFGS has convergence problems when the responses are fitted together and adaptive Gaussian quadrature is used so no results are presented for this case. We report the results using only one quadrature point because the estimates and their standard errors are essentially the same if the number of quadrature points is increased. Adaptive Gaus sian quadrature with one quadrature point is equivalent to Laplacian approximation (Liu and Pierce, 1994) which is indicative that the analytical approximations work well for this data set. The simulation sample size for the separate fit of the Gamma response was increased at each iteration. Convergence was achieved when the simula tion sample size was 31174 after 20 iterations and took about 1 hour. For the negative binomial response the final simulation sample size was still 100 after 542 iterations
PAGE 77
69 and it took also about 1 hour to converge. The joint fit took about 21 hours until convergence with a final simulation sample size of 23381 and 201 iterations The results from the joint and from the separate fits are almost identical and the correlation between the two response variables is not significant (p = 0.71 SE = 0.50). Notice that p is quite large and the fact that it is not significant may be partially due to the fact that there are only six subjects in the data set. Notice also that the standard error estimates from the three methods are similar with only some of the adaptive Gaussian quadrature standard errors being smaller than their Monte Carlo counterparts. The Monte Carlo standard error estimates did not show much variability as in the ethylene glycol example For both responses the coefficients for drug 1 and for drug 4 are significantly different from zero This means that both drugs have significantly different immediate effects on the response variables than saline solution Drug 1 leads to a significant decrease in the individual level of duration for one hour after the drug is administered and drug 4 is associated with a significant increase. Of all the interactions between drug and time only the one involving drug 1 for the count response is significant. 3.5 Additional Methods The empirical Bayes posterior mode procedure, discussed in Fahrmeir and Tutz (1994 pp.233238) can also be used to estimate the parameters The Newton Raphson equations for the extended model when a flat ( or vague) prior is used and the dispersion parameters are set to 1.0 are essentially no more complicated than those for the generalized linear mixed models. Fahrmeir and Tutz, however, do not discuss the estimation of 1 and 2 when they are unknown. The dispersion parame ters can be estimated together with :E via maximum likelihood treating the current estimates of the fixed and the random effects as the true values. Another possibility is to put noninformative priors on 1 and 2 and estimate them together with /3 and b
PAGE 78
70 Table 3.15. Final maximum likelihood estimates for pony data Monte Carlo EM Adaptive GaussHermite quadrature Joint fit Separate fits Separate fits Param e ter Estimate SE Estimate SE Estimate SE f310 0.11 0.06 0.13 0.05 0.12 0.03 f311 0.23 0.05 0.23 0.05 0.23 0.05 f312 0.11 0 05 0.11 0.05 0 11 0.05 f313 0 01 0.05 0.02 0.05 0.02 0.05 f314 0.32 0.05 0.32 0.05 0.32 0.05 f315 0.10 0.05 0.10 0.05 0.10 0.05 (316 0.18 0.05 0.18 0.05 0.18 0.05 f320 4.25 0.22 4.30 0 22 4.31 0.19 f321 1.30 0.28 1.31 0.28 1.31 0.28 f322 0 24 0.29 0 24 0.29 0.24 0.29 f323 0.28 0.28 0.27 0.28 0.27 0.28 f324 0.90 0.28 0.89 0.28 0.89 0.28 f325 0.15 0.29 0.14 0 29 0.14 0.28 f326 0.32 0.28 0.31 0.28 0.31 0 28 f321 0.04 0.07 0.04 0.07 0.04 0 07 f32s 0.28 0.10 0.28 0.10 0.28 0.10 {329 0.05 0.11 0.05 0.11 0.05 0 10 f32 10 0.01 0.10 0.01 0.10 0.01 0.10 f32 11 0.17 0.10 0.17 0 10 0.17 0.10 f32,12 0.07 0.10 0.06 0.10 0.06 0.10 f32 13 0.14 0.10 0.13 0 10 0.13 0.10 /,I 17.3 1.43 17 3 1.43 17.6 1.44 a 3.41 0.34 3.41 0.34 3.47 0 29 CJ1 0.10 0.03 0 10 0.03 0.09 0.03 CJ2 0.26 0.09 0.26 0.08 0.24 0.07 p 0.71 0.50
PAGE 79
71 What noninformative priors should be used in order to avoid dealing with improper posteriors is a topic for further research. The question about the propriety of the posterior also arises when Gibbs sampling is applied for posterior mean estimation, which is yet another method that can be used for fitting the extended model.
PAGE 80
CHAPTER 4 INFERENCE I THE MULTIVARIATE GENERALIZED LINEAR MIXED MODEL The estimates considered in the previous chapter are approximate maximum like lihood and hence confidence intervals and hypothesis tests can be constructed accord ing to asymptotic maximum likelihood theory. Rigorous proof of the properties of the estimates requires check of the regularity conditions for consistency and asymp totic normality in the case of independent but not necessarily identically distributed random vectors. Such conditions have been established by Hoadley (1977) but are difficult to check for the generalized linear mixed model and its multivariate exten sion because of the lack of closedform expression for the marginal likelihood. To our knowledge thes e conditions have not been verified for the generalized linear mixed model and we could not do that for the multivariate GLMM. Instead we assume that the regularity conditions are satisfied and rely on the general results for maximum likelihood estimates. Caution is applied to testing for the significance of variance components because when a parameter falls on the boundary of the parameter space the asymptotic distribution of its maximum likelihood estimate is no longer normal (Moran 1971 ; Chant 1974 ; Self and Liang 1987). Score tests may be a good alterna tive to the usual Wald and likelihoodratio tests because their asymptotic properties are retained on the boundary (Chant, 1974). Score test statistics are also computed under the null hypothesis and do not require fitting of more complicated models and hence can be used to check the conditional independence assumption. Since there are no closed form expressions for the marginal likelihood, the score and the information matrix numerical stochastic or analytical approximations must 72
PAGE 81
73 be used when computing test statistics and constructing confidence intervals. This aggravates the problem of determining actual error rates and coverage probabilities and requires the use of simulations to study the behaviour of the tests. Analytical and stochastic approximations can be improved by increasing the number of quadrature points or simulated sample sizes, but the precision of analytical approximations can not be directly controlled. For example, in the case of binary data pseudolikelihood works well if the binomial denominator is large (Breslow and Clayton, 1993). But the latter depends only on the data so for a particular data set the approximation is either good or bad. In general the estimates based on analytical approximations are asymptotically biased. In this chapter we concentrate on analytical and stochastic approximations of Wald, score and likelihood ratio statistics and study their performance for checking the conditional independence assumption. We also show how these approximations can be constructed for testing fixed effects and for estimating random effects, and con sider them for checking the significance of variance components. Since the asymptotic maximum likelihood results hold when the number of subjects increases to infinity we focus on the ethylene glycol example. The pony data has only six subjects and as suggested in Chapter 3 inference concerning correlation between the two response variables is suspect Section 4.1 discusses Wald and likelihood ratio tests for testing the fixed effects. We briefly consider estimation of random effects and prediction of future observations in Section 4.2 Th e score approach is introduced in Section 4.3.1 by providing a histor ical overview. W e then propose score tests for checking the conditional independence assumption (Section 4.3.2) and for testing the variance components (Section 4 3.2). The ethylene glycol example is used for illustration in Section 4.4 and Section 4.5
PAGE 82
74 contains the results from a small simulation study for the performance of the pro posed conditional independence test The chapter concludes with discussion of future research topics. 4.1 Inference about Regression Parameters The asymptotic properties of maximum likelihood estimates have been studied under a variety of conditions. The usual assumption is that the observations on which the maximum likelihood estimates are based are independent and identically distributed (Foutz 1977) but results are available also for models in which the obser vations are independent but not identically distributed (Hoadley, 1971). In general, let y 1 y 2 ... Y n be independent random vectors with density or mass functions fi(Y1 1/J) h(Y2 1/J) ... f n (Yn 1/J) depending on a common unknown parameter vec tor 1/J. Then as n t oo under certain regularity conditions the maximum likelihood estimator 1/J is consistent and asymptotically normal n ('l/J 1/J) N( o ,J 1 (1/J)) where 1(1/J) = lim n oo I:~=l E[8 ~~'1/;)], li(Y i 1/J) = lnfi(Yi 1/J). Two basic principles of testing are directly based on the asymptotic distribution of the maximum likelihood estimate: the Wald test (Wald 1943) and the likelihood ratio test (Neyman and Pearson 1928). Let 1/J = (1/Ji, 1/Jif. The significance of a subset 1/) 1 of the parameter vector 1/J can be tested by either one of them as follows. The null hypothesis is Ho: 1/) 1 = 0 (4.1)
PAGE 83
75 where O is in the interior of the parameter space (In general the null hypothesis is H 0 : t/) 1 = t/) 10 but herein we consider the simpler test.) Let Similarly The Wald test statistic is then with t/) 2 in Jt/J1t/J1 replaced by the consistent maximum likelihood estimator 1/J 2 under the null hypothesis Under H 0 Tw ~ x~ where dis the dimension of the parameter vector t/) 1 Because in the case of independent data nl ( t/J) may not be available and in random effects models the expected information matrix may be hard to calculate we use J ( t/J) = ~= I ( 0 2;J~'YP) instead of nl ( t/J). Some authors argue that it is more appropriate to use the observed rather than the expected information (Efron and Hinkley 1978) but unlike the expected information matrix the observed information matrix is not guaranteed to be positivedefinite. The latter problem is exacerbated when the numerical or stochastic approximations discussed in Section 3 2 are applied to approximate the observed information matrix J(t/J), which is not available in closed form We alread y used Wald tests in Chapter 3 to test the significance of individ ual regression coeffients but we can also use Wald tests to check several regression coefficients simultaneously. The likelihood ratio statistic for the hypothesis ( 4.1) is defined as follows Let M1 be the model with unknown parameter vector t/J and let M 2 be a reduced model with t/) 1 = 0 and t/) 2 held unrestricted. Let also li and l 2 denote the maximized
PAGE 84
76 loglikelihoods for models M 1 and M 2 respectively. Under Ho as n+ oo so the likelihood ratio and the Wald statistic have the same asymptotic distribution under the null hypothesis. The absence of closedform expressions for Zi and l2 necessi tates the use of either Gaussian quadrature or Monte Carlo approximations. Gaussian quadrature approximations are applied exactly as described in Section 3.2.1. To ob tain Monte Carlo approximations m samples from the random effects distribution bi~ Nq(O :E) are generated. Then Here :E, j3 and are the final maximum likelihood estimates from the full model M1. The same type of approximation is used for l 2 but evaluated at 'lj,, 1 = 0 and at the maximum likelihood estimator 'lj,, under M 2 Under ideal conditions for large enough number of quadrature points in the Gaus sian quadrature algorithm and for large enough simulation sample size in the MCEM algorithm the approximate maximum likelihood estimates will be arbitrarily close to the true estimates. Also the additional approximations needed to compute the Wald and the likelihood ratio statistics above can be made very precise hence the statis tics should perform well for a large number of subjects. But in reality there can be problems because it is not clear how many quadrature points and what number of simulation samples are needed for adequate approximations. Even if all approximations are adequate and the sample size is large enough the Wald and likelihood ratio tests can still run into problems if a parameter is on the boundary of the parameter space. This happens in testing the significance of a vari ance term. Hence Wald and likelihood should be used with caution in that situation. It has been proven by several authors (Moran, 1971; Chant, 1974, Self and Liang,
PAGE 85
77 1987) that when a parameter is on the boundary of the parameter space the asymp totic distribution of the maximum lik e lihood estimator is no longer normal but rather a mixture of distributions. But the score tests as discussed in Section 4.3 are not af fected and therefore may be a nice substitute for Wald and likelihood ratio tests. 4.2 Estimation of Random Effects In some applications it is of interest to obtain estimates for the unobserved random effects. A natural point estimator is the conditional mean which is not available in closed form but can be approximated either numerically or stochastically. GaussHermite quadrature involves two separate approximations of the numerator and the denominator, while the stochastic approximation is via the simple Monte Carlo sum _!_ ~ b ~k) L.., i m k = I where b? ), .. h t) are simulated from the conditional distribution of {b il Yi 1/J} using a technique such as rejection sampling. Note that this approximation is performed anyway in the M CEM algorithm and hence the random effects estimate is obtained at no extra cost. Estimates of the random effects are needed for prediction. For example one might be interested in obtaining an estimate for the linear predictor for particular subject : The question with obtaining the variance of the random effects estimate is not straightforward The simplest approach is to use the variance
PAGE 86
78 evaluated at {;; but this naive' estimate may underestimate the true variance as it does not account for the sampling variability of {;;. As an alternative Booth and Hobert (1998) suggested using a conditional mean square error of prediction as a measure of prediction variance. Their approach can also be applied to the multivariate GLMM. Note that in the two reallife' examples that we consider estimation of the random effects is not of particular interest. The subjects are mice and ponies respectively and their individual characteristics are a source of variability that needs to be accounted for but not necessarily precisely estimated for each subject In other applications, for example in small area estimation, prediction of the random effects is a very important objective of the analysis. 4 .3 Inference Based on Score Tests 4.3.1 General Theory An overview of the historical development of the score test is provided in a re search paper by Bera and Bilias (1999). Rao (1947) was the first to introduce the fundamental principle of testing based on the score function as an alternative to like lihood ratio and Wald tests. Let y 1 y 2 ... Y n be i.i.d. observations with density f(y i, 1/J). Denote the joint loglikelihood, score and expected information matrix of those observations by l('lj;) s('lj;) and 1(1/J) respectively. Suppose that the interest is in testing a simple hypothesis against a local alternative Ho : 1P = 1Pa vs Ha : 1P = 1Po + a where 1P = ('I/Ji, . 'I/Jpf and a = (~1, ~pfThen the score test is based on
PAGE 87
79 which has an as y mptotic x ~ distribution. If the null hypothesis is composite that is Ho: h (v,) = c where h( 1/,) is a r x 1 vector function of 1/, with r :S p restrictions and c is a known constant vector Rao ( 194 7) suggested using where 1/J is the restricted maximum likelihood estimate of 1/,, that is the estimate obtained by maximizing the loglike l ihood function l ( 1/,) under Ho. Ts ( 1/J) has an asymptotic x ~ distribution. Independentl y of Rao s work eyman (1959) proposed C(a) tests These tests are specifically designed to deal with hypothesis testing of a parameter of primary interest in the presence of nuisance parameters and are more general than Rao's score tests in that any y'nconsistent estimates for the nuisance parameters can be used not only the maximum likelihood estimators By design C(a) tests maximize the slope of the limiting power function under local alternatives to the null hypothesis. eyman assumed the same setup as in Rao s score test but considered the simple null hypothesis Ho : 'lp 1 = 'lp10 where 1P = ( 'lp 1 v, f f. otice that 'lj; 1 is a scalar The score vector and the information matrix are partitioned as follows Then the C (a) score statistic is
PAGE 88
x [I v, 1 V' l ( 'I/J10, ,;;;2) 1 v, 1 1/J 2 ( 'I/J10, ,;;;2)iJ 2 1/J 2 ( ,;;;2, ,;;;2) 1 1/J 2 v,1 ( 'I/J10, ,;;;2) r 1 x [s v, 1 ('I/J10, ,;;; 2 ) l v, 1 1/) 2 ('I/J10, ;p2)I;;J 2 1P 2 (,;;;2, ;p2)s1P 2 ('I/J10, ;p2)), 80 where ;p 2 is a Jnconsistent estimator of 1/) 2 The eyman's C(a) statistic reduces to the Rao s score statistic when ,;;; 2 is the maximum likelihood estimate of 1/) 2 under Ho. Buhler and Puri (1 966) extended the asymptotic and local optimality of Neyman's C (a) statistic for a vector valued 1/J and in the case of independent but not necessarily i.i.d. random variables. They assumed that 1/) 10 was interior to an open set in the parameter space but as pointed out later by Moran (1971) and Chant (1974) this restriction is unnecessary. Chant (1974) showed that when the parameter is on the boundary of a closed pa rameter space, the score test retains its asymptotic properties while the asymptotic distributional forms of the test statistics based on the maximum likelihood estima tors are no long e r x 2 In addition to this advantage of the score test it has the computational advantage that only estimates under the null hypothesis are needed to compute the test statistic. We now use this feature to propose tests for checking the conditional independence assumption. 4.3.2 Testing the Conditional Independence Assumption Difficulties in testing the conditional independence of the response variables arise because of the need to specify more complicated models for the joint response dis tribution. Ev en if score tests are used (which as discussed do not require fitting of more complicated models) extended models need to be specified and the form of the score function and of the information matrix need to be derived. We again consider the bivariate GLMM case for simplicity. A convenient way to introduce conditional
PAGE 89
81 dependence is to use one of the response variables as a covariate in the linear predic tor for the other response variable. In the bivariate case this leads to considering the following model ( b i 1 ) . ( [ 0 l [ :Eu :E12 l) b i = b i 2 ~ i.i.d. MVN(O :E) = MVN O :E~ 2 :E 22 In general this setup leads to a complicated form of conditional dependence which is hard to interpret if there is no natural ordering to the two responses. The case 1 = 0 corresponds to conditional independence but testing 1 = 0 in the above model is performed against a complicated alternative on the marginal scale. When the identity link function is used for the first response conditional on the random effects Cov(yiij Y i2j) = ,Var(yi 2 j) and the test is a test of conditional uncorrelatedness of the two outcomes If both outcomes are normally distributed then this is truly a test of conditional independence. An interesting case to consider in view of the simulated data example and the ethy lene glycol application is when one of the responses is normally distributed and the other one has a Bernoulli distribution. Let in the above general specification f 1 be the normal density function and h be the Bernoulli probability function. Also assume that the random effects consist of two random intercepts and let where y;_ 2 j = 2y i 2 j 1. Then
PAGE 90
82 and hence testing H 0 : 'Y = 0 against H 1 : 'Y =I O is equivalent to testing for location shift in the conditional distribution of the normal response. The score test statistic is as follows: where S y is the element of the score vector corresponding to 'Y and I is the expected information matrix. Note that even in this simple case neither the score nor the expected information matrix have closed form expressions and hence the score statistic must be approximated. We again consider Gaussian quadrature and Monte Carlo approximations. The loglikelihood is lnL = I:f= 1 lnLi, where f ( lb. {3 "' ) exp(I:1J:,,, 1 Yi2j(x'f;.if3 2 + b i2)) 2 Y i2 i2 2 '1'2 J1n ; ( l ( T {3 b ) ) j=l + exp x i2j 2 + i2 For the Monte Carlo approximation we notice that under the assumption of in terchangeability of the integral and differential signs = t f ~(f1( Y i1l b i 1 Y i2, /31 'Y, )h( Yi2lb i2, /32 )f( b i, :E))db i = i=l f !1 ( Y il l b i l Y i2, /31, 'Y, c/J1)h( Y i2 lbi2 /32 )! (bi, :E)db i
PAGE 91
83 = t f tznfi(Yi1lbi1, Yi2 /3~, 'Y, 1)f(Yi, bi; 1/;)dbi = i=l f(Yi 1P) The expectation above is taken with respect to the conditional distribution of the random effects given the response vector. Differentiating with respect to and therefore So to approximate the score under the null hypothesis we only need to approximate the conditional mean E(bil!Y i) by the Monte Carlo sum ~k=l ht), where ht), k = 1 ... m are generated for the estimation of the standard errors in the M CEM algorithm (Section 3.2.2). The elements of the observed information matrix J, which can be used in place of I can also be approximated using the Louis s method 1r,r is available from the MCEM algorithm and only 1 r,r, I r,r and 1 r,r need to be computed. The latter can also be performed in the procedure for finding the standard errors of the estimates in the M CEM algorithm. Gaussian quadrature using numerical derivatives involves approximating the log likelihood once and then numerically differentiating with respect to and the other parameters to obtain the score and the observed information matrix Denote the GaussHermite quadrature approximation of the loglikelihood by zcQ and let s ~ Q =
PAGE 92
84 8 ~Q and J ~~"Y = ~1_~ Then the approximation to the score statistic is GQ GQ J GQ l J GQ ] yy J y,y y,y y,y The performance of the score statistics for conditional independence is studied in more detail in Section 4.5. When there are more than two response variables this approach to testing for departure from conditional independence becomes very complicated and not easily interpretable. It is also not easy to decide which variable to use in the linear predictor for the other one, unless there is a natural ordering. This issue is discussed in more detail for the ethylene glycol example. 4.3 3 Testing the Significance of Var i ance Components Global Variance Components Test Lin (1997) proposed a global variance component test for testing the significance of all variance components in the univariate GLMM which can be extended for the multivariate GLMM. The null hypothesis for the global test is H 0 : 8 = 0 where 8 i s the vector of all variance components for the random effects. Suppose for simplicity that 1 = 2 = 1 and that there are two response variables. The generalizations to arbitrary 1 and 2 and to more than two response variables are straightforward. The form of the score test statistic is T T where /3 = (/3 1 /3 1 f and /3 1 and /3 2 are the maximum likelihood estimates under H 0 i e. the maximum likelihood estimators from the two separate fixed effects generalized linear models for the two response var i ab l es. Under H 0 T 8 ( '/3 ) has an asymptotic x~ distribution where d is the number of variancecovariance parameters for the random effects
PAGE 93
85 In the univariate G LMM considered by Lin the r th element of the score vector has the form r a E where g( i) = Xi/3 E = Var( b i) and E = Bor 1 <5 = o The matnces A i, W i and Wo i are diagonal with elements 6ij = 9 (~j), Wij = (V(ij){g' (ij)}2)1 and I I II ( ) v (ij)g (~j)+V(~j)g (~j) 1 d 0 Woij wij + eij Yij ij where eij v 2 (~j)[g' (;J )]3 m genera an eij for canonical link functions. The subscript j refers to the Jih observation on the i th subject. Following step by step Lin 's derivation for the univariate GLMM, the correspond ing r th element of the score function for the multivariate GLMM is S<5r (/3) = n + I:{(Y i2 i2f 6.i2 1 Wi2Zi2:E~2ZfiWil6.i1 1 (Y i l il)}, i=l (4.2) where the subscripts 1 and 2 refer to the parts of the vectors and matrices corre sponding to the first and to the second variable respectively. The proof is as follows. Let Then the marginal likelihood for the i th subject is
PAGE 94
86 where the expectation is taken with respect to the marginal distribution of b i Ex panding the integrand in a multivariate Taylor series around b i = 0 we get where E i contains third and higher order terms of b i. otice that and where T/ i is the vector of linear predictors for the i th subject. Then taking expectation and using the moment assumptions for b i and the marginal loglikelihood for the i th subject is Here r i contains terms that are products of variance components and its derivative will be O when evaluated under H 0 Now we must take into consideration that there are two response variables Because li 1 and li 2 depend on different sets of random effects and
PAGE 95
87 Then 1 (ZT(8li2(0) 8li2(0) 8 2 li2(0) )Z :E ) +tr i2 {) [) T + {) [) T i2 22 2 "1 i2 "1 i2 "1 i2 "1 i2 (Z T oli2(0) [)Zil (0) z ) +tr i2 {) {) T il L..J12 "1i2 "lil To obtain ( 4.2) one uses the fact that lil and li 2 do not depend on t5 and that for exponential family distributions fork= 1 2. Notice that in the bivariate GLMM it is likely that the two responses require different variance components, in which case the expressions for the elements of the score vector above simplify. Suppose that the random effects variancecovariance matrix has the form (4.3) '51 '51 '52 where '5 1 '5 2 and '5 12 are different parameter vectors. Then :E 22 = :E 12 = :Eu = '5 2 '51 2 '51 2 :E1 2 = :Eu = :E 22 = 0 and the score vector 1s
PAGE 96
88 E~ 1 {(Yi1 i 1 f .6.ii 1 Wi1Zi1:E~1 1 ZfiWi1.6.ii 1 (Yi1 il) tr(Woi1Zi1:E~1 1 Zfi)} Ef=i { (Y i2 i2f .6.i~j1Wi2Z i2 :E~ 2 2 ZI;W i2.6.;; 1 ( Y i2 i2) tr(Woi2Zi2:E~2 2 ZI;)} Lin showed that the information matrix in the univariate GLMM depends only on the first two moments of the response variables and its elements can be expressed in closed form for exponential family responses. It is easy to verify that the information matrix for the multivariate GLMM also depends only on the first two moments of the two response variables and does not contain more complicated expressions than its simpler counterpart. The latter property is due to the independence of the response variables under H 0 Note that where the expectations are computed at 8 = 0 Consider only 188 for now and let us assume that :E has the structure in (4.3). Then l8 1 8 1 is exactly the same as in a univariate G LMM and hence can be expressed as proposed by Lin. On the other hand 18 1 8 2 = E(s8 1 s8 2 ) = E(s8 1 )E(s8) = 0 under H 0 because the two score vectors depend on different response variables which are independent under the null hypothesis. Also
PAGE 97
89 n n =LL E[(h1(Yi1)(Yi 1 2 i 1 2)l = i = l i' = l n n LL Eh1(Yi1)E(Yi'2 i12) = 0 i=l i' = l Here h 1 (y i 1 ) is a function of the first response variable Yi1 only. Similarly all other parts of the information matrix which correspond to partial derivatives with respect to parameters for different response variables are zero and hence the expected infor mation matrix has the form 1 f3J31 0 1 /31 0 1 0 0 0 1 /32/32 0 1 /32 0 2 0 1 /31 '5 1 0 1 0 1 0 1 0 0 0 1 /32 0 2 0 1 0 2 0 2 0 0 0 0 0 1 '5 12 0 12 and the score statistic separates as follows This factorization appears only when the variancecovariance matrix is structured as in (4.3) otherwise the expression is more complicated but still depends only on the first two moments of the response. The key to proving this is to notice that the highest order expectation that needs to be computed is of the form E(Yiij iij ) 4 The same is true for the univariate GLMM and hence Lin's arguments can be directly applied. Lin proves that the global score statistic in the univariate GLMM follows a chi squared distribution with d degrees of freedom ( d is equal to the number of random effects) asymptotically under '5 = 0 The asymptotic result holds when the number
PAGE 98
90 of subjects goes to infinity and the number of observations on each subjects remains bounded. In the multivariate GLMM the asymptotic distribution is also x 2 but with the number of degrees of freedom adjusted accordingly The global score statistic is not very useful even in the GLMM contest because it tests the significance of all variance components simultaneously while in most cases it will be more interesting to check a subset of the variance components But in the multivariate GLMM model the global score test is even less appealing. Suppose that the test is performed and that the null hypothesis is rejected. What information can one get from that result? It will not be clear whether the rejection occured because of extra variability in one of the variables, or in the other one, or in both. It may be more meaningful to perform score tests for each variable separately and then check for correlation between the two responses. Lin also develops score tests for specific variance components in the independent random effects model. In contrast to the global test, here the score vector and the efficient information matrix can not be computed in closed form in general and Lin uses Laplace approximations. Not surprisingly, the approximation to the score statis tic does not work well in the binary case as demonstrated by some simulations that she performed. Lin s score tests can be used with the Breslow and Clayton, and the Wolfinger and O Connell methods but are not adequate if used with Gaussian quadra ture or the MCEM algorithm. In that case it is natural to try to develop score tests based on numerical or stochastic approximations We now discuss a direct approach which is of limited use, and an indirect approach which is more complicated but is especially suited for variance components on the boundary of the parameter space. Tests for Individual Variance Components We consider a bivariate GLMM for simplicity Suppose one is interested in testing Ho : 1/J 1 = 0 where 1/J 1 is a subset of the variance components for the random effects
PAGE 99
Also let 1 have L elements and let 'lj; = ( 'lj;~zf. The score statistic is T s = s~. (l.,. ,.,. 1.,. ,.,. I:i: ,. 1 l.,. ,.,. )1 s,., . '1't '1' t 'l't '1't'l' t '1't'l' l '1' t'l'l '1'l rr, Ho 2 1 s ~ XL 91 To develop a Monte Carlo approximation we can try to follow the approach we used for the conditional independence test. Under the assumption of interchangeability of the integral and differential signs the score vector can be rewritten as follows The random effects density f(b i, 15) is multivariate normal for our models and hence it is possible to obtain expressions for the partial derivatives inside the expectation using the approach of Jennrich and Schluster as outlined in Section 3.2.1. There is no closed form expression for the score vector but we can approximate it by 1 8 (k) m {;;;_ f:i a'lj;z lnf(bi 1 ) where b?) are simulated values from the conditional distribution b i lY i As mentioned before the expected information matrix is much harder to work with and hence the observed information matrix can be approximated using Louis method as shown in Section 3.2.2. Notice though that the score vector must be evaluated at 1 = 0 and at the restricted maximum likelihood estimates 1/J 1 Depending on the subset of the variance components tested the derivative at 1 = 0 may not exist. Consider for example the case when = ( af a1 2 2 ) and when the null hypothesis is H 0 : a12 (l2 ar = a12 = 0. Then is singular under H 0 and we can not evaluate the derivative. If we test only H 0 : a12 = 0 then there is no problem with the test. In this case
PAGE 100
92 the parameter is not on the boundary under the null hypothesis, but the score test is useful because the correlation between the response variables can be tested from the fit of two separate GLMMs Recall that univariate GLMMs can be fitted using standard software such as PROC NLMIXED in SAS Hence one can decide whether there is a need to fit the two responses together based only on univariate analyses. An alternative method to compute an approximation to the score statistic above is to use GaussH e rmite quadrature Two possible approaches can be followed. The easier one is to compute first and second order numerical derivatives of the log likelihood and then compute the score statistic based on them. Exact derivatives might be useful if the number of observations per subject is not very large Both approaches are not applicable if the tested subset of variance components leads to a nonpositivedefinite :E. Denote the GaussHermite quadrature approximation of the loglikelihood by zcQ_ Al 1 GQ azcQ d GQ so et s,. 1 = ;:;;;,.an J ,., ,., =
PAGE 101
93 let the corresponding vectors of random effects with O variances be b Let b = ( :,1, ) ~ N(O, };), }; = ( ;~;',, ;~;~~' ) Under H 0 ~ 1 1 = 0 and ~l, l = 0. We can rewrite the marginal likelihood for the i th subject as follows and then we can expand the integrand around the conditional mean of b Because the random effects distribution is assumed to be multivariate normal the conditional distribution of b is also multivariate normal Thus we obtain Eb 'l b ~1ex p(l i (b i )) = t t (4.4) where Z i contains the rows of Z i corresponding to b L and l i (b ; 1 ) denotes the con ditional loglikelihood for the i th subject lnf(Y i lb i, 1/,) evaluated at h ; 1 and at b~ = ~l ,l(~ l ,l) 1 b ; 1 To obtain the marginal likelihood of Yi we take the expectation of ( 4.4) with respect to b ; 1 and because there is no closed form expression for the integral we need to use approximations. We ignore the remainder term r i which depends on second and higher order products of variance components which will be 0 under H 0 ow the problem of approximating the score statistic using the indirect approach is the sam e as the one using the direct approach but with exp(l i (b i )) in the integrand replaced by ( 4.4) and the random vector b i replaced by b; 1 Hence Gaussian quadrature and Monte Carlo methods can be used as described earlier in this section.
PAGE 102
94 4.4 Applications In this chapt e r we consider numerical and stochastic approximations to the Wald, likelihood ratio and score tests. The conditional independence test provides a nice framework to compare the performance of all three approaches. Recall that in the ethylene glycol e xample conditional independence implies that the correlation be tween birth weight and malformation measured on the same fetus is the same as the correlation betw e en birth weight and malformation measured on two different fetuses within a litter. Hence it is likely that this assumption will not be satisfied. To check the assumption we specify a more complicated model as suggested in Section 4.3.2. Y i lj fetal w e ight of lh fetus in i th litter Y i 2j malformation status of lh fetus in i th litter di dose administered to i th litter Yi1 1 1Yi2 bi1 ~ i nd e p i lj o2 ) Y i 2j lb i 2 ~ ind e p Be(i2j) logit( i 2 1 ) = !32 0 + f3 21d i + bi2 b i = (b i 1 b i 2l ~ N 2 (0, :E) :E = [ a~ po1 t 2 ] po102 a 2 The score test statistic for = 0 using 100 quadrature points is 17.45 (pvalue < 0.0001) and therefore the hypothesis of conditional independence is rejected. Hence the model introduced above may be more appropriate for the ethylene glycol data. We fit this model in order to compare the score, Wald and likelihoodratio statistics. The estimates using Gaussian quadrature and the Monte Carlo EM algorithm for logit and for probit links are given in Table 4.1. Gaussian quadrature used 100 quadrature points both for the logit and for the probit model. It took 3 hrs 40 min for the logit models and 5 hrs and 30 min for the probit model to converge. The numbers of
PAGE 103
95 iterations were 32 and 50 respectively. The MCEM algorithms took about 15 hrs to converge: the logit model used 63 iterations and had a final sample size of 7898, the probit model used 52 iterations with a final sample size of 5549 The starting values were the same for all algorithms: gamma had an initial value of zero and all the other parameters had initial values as in the models without gamma (Section 3.4). The estimates of I in all fits were identical up to three places after the decimal point and the standard errors were also very similar. There is evidence that the fetal weight of a fetus significantly decreases if malformation status is changed from absent to present. Notice that in this example there is no natural ordering of the response variables and hence the decision to include malformation in the linear predictor for fetal weight is for mathematical convenience. Because of the identity link function for the normal response the expression for the Monte Carlo approximation of the score vector is more simple. Also the interpretation of the I coefficient is easier to understand. In that case 2 1 is interpreted as the amount by which a subject's fetal weight is expected to decrease if malformation status is changed from absence to presence. In contrast if we were to include fetal weight in the linear predictor for malformation and we used a logit link, then the interpretation of I would be the change in a subject s log odds for malformation per unit change in birth weight, controlling for dose. To compare the performance of the stochastic and analytical approximations to the test statistic we computed Gaussian quadrature and Monte Carlo approximations using different numbers of quadrature points and simulation sample sizes. Attention was restricted to the logit link and we used the final parameter estimates from the Gaussian quadrature fits for the conditional independence (Section 3.4) and condi tional dependence models. The results are provided in Tables 4 2 and 4.3. As the number of quadrature points increases the GaussHermite approximations improve, or at least they are internally
PAGE 104
96 Table 4.1. Estimates from the conditional dependence model fit to the ethylene glycol data Par. GQ(logit) MCEM(logit) GQ (probit) MCEM (probit) E s t SE Est SE Est SE Est SE !3 10 0.936 0.014 0.937 0.024 0.936 0.014 0.937 0 027 !3 11 0.081 0 008 0 082 0 010 0 081 0.008 0.082 0.010 'Y 0.016 0 004 0.016 0 004 0.016 0 004 0.016 0.005 !320 4.371 0.421 4.403 0.537 2.420 0.218 2.444 0 300 f32 1 1.768 0.212 1.782 0 238 0 981 0 113 0 991 0.120 (J 0.074 0.002 0.074 0 002 0.074 0.002 0 074 0 002 CJ1 0 083 0.007 0.083 0 007 0 083 0 007 0 083 0.007 CJ 2 1.53 4 0.201 1.522 0 206 0 851 0 109 0 844 0.112 p 0 612 0 100 0.615 0.102 0 594 0.101 0.599 0 103 Table 4.2. Gau s sian quadrature approximations to the score Wald and likelihood ratio statistics for testing for conditional independence in the ethylene glycol example N umb e r of quadratur e points score Wald LR 10 12.10 11.07 19 65 20 19.33 18 54 17.37 30 63.44 17.60 18.09 40 16.58 17 09 16.97 50 17.20 17.05 16.95 60 17.32 17.04 16 92 70 17.40 17.06 16.95 80 17.43 17 07 16 96 90 17.45 17.08 16.96 100 17.45 17 13 16 96 consistent and show decreasing variability (Table 4 2). The only negative estimate for the score statisti c is du e to a nonpositive definite estimate of the observed information matrix. As point e d out b e fore there is no guarantee that this will not happen It seems that abou t 50 quadrature points are adequate to approximate the test statistics in this e xampl e It is not surprising that the Mont e Carlo approximations show mor e vari a bilit y (Table 4 3) W e us e two differ e nt random seeds and hence for each test there are two columns of values. Of the thre e test statistics the likelihood ratio shows most variabilit y It i s bas e d on an approximation of the loglikelihood rather than on an
PAGE 105
97 Table 4.3. Monte Carlo approximations to the score (S), Wald (W) and likelihood ratio (LR) statistics for conditional independence in the ethylene glycol example. Two different initial random seeds are used. Sample size Sl S2 Wl W2 LRl LR2 100 13 56 13.85 16.44 15.92 17 84 5.17 500 18 73 29.26 15.90 16.02 17 36 13.80 1000 16.18 17.95 17.24 21.37 12.44 19.15 5000 17.24 17.02 16.69 16 67 18.75 18.28 10000 17.14 16.95 16.77 17.17 18.85 17.32 20000 17.09 17.07 17.23 16.90 16.08 17.10 approximation of the information matrix which may indicate that larger sample sizes are needed to approximate the loglikelihood precisely than are needed for the infor mation matrix. A reasonable simulation sample size to use in view of computational efficiency is 5000. ote that rejection sampling is not used for the likelihood ratio statistic so the approximation takes less time. In the next section we further study the performance of the score Wald and likelihood ratio statistics for conditional independence via a simulation study. 4.5 Simulation Study We use the structure of the simulated data example (Section 3.3) with either 30 or 100 subjects and 10 bivariate observations per subject. The parameters are as follows: /31 = 4 /3 2 = 1 a 2 = 1 af = 1, a~ = 1, a 12 = 0.5 and five different values of y: 0 0.05 1 2 and 3. 1 = 0 corresponds to conditional independence. A bivariate GLMM and a bivariate GLMM with conditional dependence are fitted using Gaussian quadrature with 50 quadrature points. Approximations to the Wald, likelihood ratio and score statistics for testing 1 = 0 are computed for 50 quadrature points and for Monte Carlo simulation sample sizes of 5000. 50 samples are generated and means, standard deviations and rejection percentages are computed for all three statistics. The results using Gaussian quadrature are summarized in Tables 4.4 4.7 and those using Monte Carlo approximations are summarized in Tables 4.8 4 11. Under
PAGE 106
98 Ho : = 0, the mean and the standard deviations of the test statistics should be 1 and ../2 = 1.41 respectively. For = 0 for the larger simulation sample all three Gaussian quadrature approximations give a mean of 0.91 and a standard deviation of about 1.37. The corresponding Monte Carlo approximations to the Wald and score tests gives essentially the same means and slightly lower (1.34 as compared to 1.37) standard deviations. Onl y the Monte Carlo approximations to the likelihood ratio statistic has bigger mean value and bigger standard deviation. For the smaller sample size almost all values ( except the mean of the Monte Carlo likelihood ratio statistic) are further away from the truth: about 0.86 and 1.10. We expect those values to be closer to the truth if the simulation sample size is increased. Interestingly for all settings using Gaussian quadrature the score statistic has the largest average value and the likelihood ratio statistic has the smallest. No such trend is obvious in the Monte Carlo approximations. The differences in the results for the Monte Carlo likelihood ratio statistic may be attributed to the different kind of approximation used as compared to the approximation of the information matrix needed for the Wald and score statistic. Another way to summarize the results is to look at the percentage of simulations in which the null hypothesis is rejected for different levels. These are given in Tables 4.5 and 4.7 for the Gaussian quadrature approximations, and in Tables 4.9 and 4.11 for the Monte Carlo approximations. From that perspective the three statistics are almost identical and their type I error rates are close to the nominal levels. (Recall that the simulation sample size is only 50 and that is why only certain percentages could be observed. The simulation sample size was chosen for reasons of computa tional feasibility) Among the Gaussian quadrature approximations usually the score statistic has the highest rejection rate and the only case when it rejects less often is in the smaller sample setting when = 0.3. This is due to one sample in which
PAGE 107
99 Table 4.4. Means and standard deviations of the Gaussian quadrature approximations to the scor e Wald and likelihood ratio test statistics for conditional independence: sampl e siz e = 100 subjects score Wald likelihood ratio 'Y mean s.d. mean s.d. mean s.d. 0 0.91 1.37 0.91 1.37 0.91 1.36 0.05 2.00 2.43 1.99 2.41 1.98 2.40 0.1 8.48 4.68 8.38 4.57 8.33 4.52 0.2 27.99 10 57 27 03 9.94 26 59 9.66 0.3 64 13 16.85 59.62 14.56 57.64 13.61 the estimate of the variancecovariance matrix in the reduced model was nonpositive definite and the score statistic was set to be equal to zero. Among the Monte Carlo approximations the Wald and score statistics show al most perfect agr e ement with each other. The only exception is the small sample setting wh e n 1 = 0 3 for the same reasons as discussed above The Monte Carlo likelihood ratio statistic shows more variability and may require larger simulation sample size. The Gaussian quadrature and Monte Carlo approximations of the Wald and score test statistics are similar with the Monte Carlo approximations having slightly lower standard deviations This limited simulation study indicates that the Gaussian quadrature approxima tions of all three statistics and the Monte Carlo approximations of the Wald and score statistics perform similarly and the choice of which one to use should probably be dictated by other practical considerations. For example if fitting a more compli cated model is computationally intensive probably the score statistic should be used. If the extended model is going to be fit anyway, the least computationally demanding statistic is the likelihood ratio statistic. But if a Monte Carlo approximation is pre ferred for the likelihood ratio statistic we may need a larger simulation sample size which may obscure the computational advantage Further study is needed to deter mine whether th e observed behaviour is typical of the approximated test statistics or dictated by the particular setting under consideration
PAGE 108
100 Table 4.5. Rejection rates for the Gaussian quadrature approximations to the score> Wald and likelihood ratio test statistics for conditional independence: sample size = 100 subjects a= 0.01 a= 0.05 a= 0.10 s w LR s w LR s w LR 'Y % % % % % % % % % 0 0 0 0 0.06 0.06 0.06 0.08 0.08 0.08 0.05 0.06 0.06 0.04 0.20 0.18 0 18 0.30 0.30 0.30 0.1 0.64 0.62 0.62 0.86 0.86 0.84 0.92 0 92 0.92 0 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 S : score> W: Wald> LR: likelihood ratio Table 4.6. Means and standard deviations of the Gaussian quadrature approximations to the score Wald and likelihood ratio test statistics for conditional independence : sample size = 30 subjects score Wald likelihood ratio 'Y mean s.d. mean s.d. mean s.d. 0 0.87 1.11 0.86 1.10 0.86 1.09 0.05 1.42 1.92 1.40 1.87 1.39 1.84 0 1 3.24 3.41 3.16 3.26 3.12 3 19 0.2 8.42 5.28 8.07 4.91 7.91 4.74 0.3 19.65 8.96 18.24 7.38 17.56 6.86 Table 4.7. Rejection rates for Gaussian quadrature approximations to the score, Wald and likelihood ratio test statistics for conditional independence: sample size = 30 subjects a= 0 01 a= 0.05 a= 0.10 s w LR s w LR s w LR 'Y % % % % % % % % % 0 0 0 0 0.02 0.02 0.02 0.10 0.10 0.10 0.05 0.04 0.04 0.04 0 10 0.10 0.08 0 16 0.14 0.14 0.1 0.12 0.12 0.12 0.30 0.30 0.30 0.46 0.44 0.44 0 2 0.58 0.56 0.56 0.82 0.82 0.82 0.86 0.86 0.86 0.3 0.96 0.98 0.98 0.98 1.00 1.00 0.98 1.00 1.00 S: score, W: Wald LR: likelihood ratio
PAGE 109
101 Table 4 8 Means and standard deviations of the Monte Carlo approximations to the score Wald and likelihood ratio test statistics for conditional independence: sample size = 100 subj e cts s c or e Wald likelihood ratio 'Y mean s.d mean s.d mean s.d. 0 0.90 1.34 0.90 1.34 1.07 2.00 0.05 1.98 2.37 1.98 2 37 2 00 2 78 0.1 8 35 4.48 8 35 4.52 8.24 4.43 0.2 26 55 9.47 26 92 9.78 26.21 10 09 0.3 57 80 13.56 59 61 14.51 57 56 13 79 Table 4 9 Rejection rates for the Monte Carlo approximations to the score Wald and lik e lihood ra t io test statistic s for conditional independence: sample size = 100 subjects a = 0 01 a= 0 05 a= 0.10 s w LR s w LR s w LR 'Y % % % % % % % % % 0 0 0 0 0 06 0.06 0.10 0 08 0 08 0.12 0.05 0 04 0.04 0.06 0.20 0.20 0.20 0.30 0.30 0 26 0.1 0.64 0 62 0.58 0.86 0 86 0 84 0.92 0.92 0.96 0 2 1.00 1.00 0.96 1.00 1.00 0.98 1.00 1.00 0.98 0 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 S: scor e, W: Wald LR: likelihood ratio Table 4.10 Means and standard deviations of the Monte Carlo approximations to the score Wald and lik e lihood ratio t e st statistics for conditional independence: sample size = 30 subject s score Wald likelihood ratio 'Y mean s.d. mean s.d. mean s.d 0 0 86 1.09 0 86 1.09 0.96 1.10 0.05 1.39 1.84 1.40 1.87 1.40 2.01 0 1 3.12 3 14 3.16 3.25 3 00 3.25 0.2 7.91 4.72 8 11 4.94 7 93 4 69 0 3 17 81 6.91 18 18 7.42 17.60 7.07
PAGE 110
102 Tabl e 4 11. Rej e ction rates for the Monte Carlo approximations to the score Wald and likelihood ratio test statistics for conditional independence: sample size = 30 subjects o: = 0.01 o: = 0.05 o: = 0 10 s w LR s w LR s w LR 'Y % % % % % % % % % 0 0 0 0 0.02 0 02 0.04 0 10 0.10 0 08 0.05 0.04 0.04 0 0 4 0 10 0.08 0 08 0 16 0.16 0.20 0 1 0.12 0.12 0 14 0.30 0.30 0 28 0.44 0.44 0.44 0.2 0.56 0 56 0 50 0.82 0.82 0.86 0.86 0.86 0.90 0.3 0.96 0 98 0.98 0.98 1.00 1.00 0 98 1.00 1.00 S: scor e, W : Wald LR: likelihood ratio 4.6 Future Research An important future research topic is to compare the proposed numerical and stochastic approximation s of the variance component score statistics (Section 4.3.3) to the anal yt ical approximations developed by Lin. The expectations are th a t the GaussHermit e quadrature and Monte Carlo methods will be more computationally intensiv e but will outp e rform th e Laplace approximation methods when the data is far from normall y ditributed. It is also of interest to consider tests for conditional independence for different settings in addition to the BernoulliNormal combination discussed here. Alternative approach e s to incorporate dependence in the model should also be investigated One such approach i s introduc e d in the next chapter but it is applicable only to continuous and binar y vari a bl e s What c an be done in an application like the pony data is not cl e ar In this chapt e r w e did not address some important research questions such as mod e l g o odn e ssoffit and effects of d e partures from the parametric assumptions on th e e stima te s. In th e cas e of dicrete response variables it ma y be possible to use th e deviance statistic to check th e model fit But when the data is continuous, or
PAGE 111
103 both discrete and continuous, this question becomes much more complicated. The effects of departures from the parametric assumptions on the maximum likelihood estimates can probably be studied in the framework proposed by White (1982). He showed that the maximum likelihood estimates under a false model converge to a value which minimizes the KullbackLeibler divergence between the true and misspecified models. Some approximations will need to be used to assess the magnitude of the introduced bias. Finally the challenges of verifying the consistency and asymptotic normality of the maximum likelihood estimates in the GLMM and of quantifying the precision of their numerical and stochastic approximations are still unresolved and require further investigation
PAGE 112
CHAPTER 5 CORRELATED PROBIT MODEL Using the GLMM for multivariate repeated measures allows for modelling of any mixture of outcomes in the exponential family but requires the rather restrictive assumption of conditional independence between the responses given the random effects. It is difficult to construct a general fully parametric model that overcomes this drawback b e cause of the need to define multivariate distributions for mixtures of responses. However in the special case of a mixture of one binary and one continuous response (as in the ethylene glycol example) one can fit a correlated probit model with an underlying latent variable for the binary response. Catalano and Ryan (1992) considered such a model and used GEE methodology for estimation. In this section we propose a Monte Carlo EM algorithm for finding 'exact ML estimates. Chan and Kuk (1997) introduced such an algorithm for models with binary responses, but as we will show it can also be used for a mixture of binary and normal variables and in the case of correlated errors. We also consider an acceleration to this algorithm using modifications proposed by Liao (1999) and by Lavielle, Delyon and Moulines (1999). The introduction to this chapter contains a literature overview of latent variable models for binary ordinal and censored continuous data. First we consider models for crosssectional data and then we mention some extensions to correlated data. Section 5.2 defines the correlated probit model and Section 5.3 contains a description of the model fitting method. Results from the analysis of the ethylene glycol example are presented in Section 5.4. Section 5.5 is devoted to a simulation study investigating efficiency gains in the correlated probit model, and an identifiability issue is discussed 104
PAGE 113
105 in Section 5.6. Section 5.7 describes the extensions of this model to any mixture of binary, continuous, continuous censored and ordinal data with known thresholds and the chapter concludes with a discussion of future research directions (Section 5.8). 5.1 Introduction In many applications an observed binary variable y can be assumed to result from a dichotomization of an unobserved (latent) continuous variable y* ranging from oo to +oo. Larger values of y* are observed as y = l while smaller values of y* are observed as y = 0. Motivation for this representation often comes from studies of doseresponse relationships in populations of biological organisms where the response of interest is whether a randomly selected individual receiving a certain dose of a toxic chemical dies (Finney, 1964; Ashford and Sowden, 1970). The latent variable y* is assumed to be linearly related to the observed covariates through the model T~ Yi = x;,,..., + Ci, (5.1) where ci are i.i.d. with some continuous distribution. The observed binary variable is linked to the latent continuous variable in the following way: { 1 if Yi> T Yi 0 if Yi::; T In many applications ci ~ N(0 CY 2 ) and the probability p = P(y i = 1) is Txf/3 xT/3T P(y; > T) = P(xf/3+ci > T) = P(ci > TX:/3) = 1( ) = ( ). (Y (Y Without loss of generality the threshold T can be set equal to zero because it can be absorbed in the intercept of the linear predictor (/3 0 = (3 0 y). The error variance CY 2 is not estimable from the data because Yi is not observed and it is only known whether y; is positive or negative. Usually CY 2 is taken to equal 1.
PAGE 114
106 The normal distribution assumption for the underlying latent variable leads to a probit model for the binary response: where is the cumulative density funstion of a standard normal random variable. Gaddum (1933 ) and Bliss (1934a 1934b, 1935a 1935b) are credited with the devel opment of the method of probit analysis. Ordinal data are also often assumed to arise from an underlying latent continuous variable Let y ; be related to covariates as described in (5.1) and let Y i = r if Tr l ::; y ; < T r for r = l 2 .. R. If the errors are assumed to be distributed N(0, CJ 2 ) ) ( ) (Tr 1xf/3) ;i:.(Trxf/3) P(y i = r = P Tr 1::; Y i < Tr = '' , (J (J This ordered probit model was first suggested by Aitchison and Silvey (1957) who considered only a single independent variable. McKelvey and Zavoina (1975) extended Aitchison and Silvey s work to the case of multiple independent variables. For identifiability reasons usually the thresholds are reparametrized as follows 71 = 0 T 2 = T 2 T1, ... TR = TR TR1 and CJ 2 is set equal to 1.0. In some applications the actual thresholds may be known. For example the response may be family income but it may only be known what tax bracket the income falls in and not exactly how much it is. This latter case will be of interest in Section 5. 7 where extensions of the correlated probit model are considered. An underlying latent variable is also appropriate for censored continuous data When the censoring is on the left observations at or below a certain value are set to
PAGE 115
107 some predefined number. We can again assume an underlying regression model as in (5.1). The observed censored variable is defined as follows: i i { y* if y~ > T Y i Ty if Yi ::; T If we again assume that the errors have N(O a2 ) distribution then the resulting model for the observed response is called the "tobit" model. The name stands for Tobin's probit model in honor of Tobin's (1958) work on household expenditures for durable goods which led to the introduction of the tobit model. Censoring from above, and from above and below simultaneously can also occur. In general T =/= Ty =/= 0 but they are known. The tobit and probit models are related in the following way: in the tobit model we know the value of y* when y* > T while in the probit model we only know if y* > T The derivation of the probability of a case being censored is very similar to the derivation of the probability of an event in the probit model. In general the estimates for f3 from the tobit model are more efficient than the estimates that would be obtained from a probit model and a 2 can be estimated from the tobit but not from the probit model. All three models introduced so far can be fitted using maximum likelihood. Long (1997) provides details about the appropriate iterative algorithms. Extensions of the above methods for multiple response variables or clustered data have been considered in the literature. Ashford and Sowden (1970) introduced a multivariate probit model based on an underlying multivariate normal distribution. In the bivariate case there are two correlated underlying latent variables (Y l u y ; 2 f ~ N2(,',) where',= { i } Yil = I{yl 1 > O} and Yi2 = I{yl 2 > O}. Then
PAGE 116
108 where < 2 ) denotes the bivariate standard normal c.d.f. Also P(y il = 0, Yiz = 1) = P(y: 1 < 0, y: 2 > 0) = P(y: 1 < 0) P(y: 1 < 0, Y:2 < 0) = and the remaining joint probabilities can be similarly determined. Ochi and Prentice (1984) introduced a correlated generalization of the multivariate probit model of Ashford and Sowden to fit regression models to exchangeable binary data. The equicorrelated variancecovariance structure~= o2 { (1p)I + pJ} of the underlying normal distribution allows using approximations to equicorrelated normal integrals to simplify maximum likelihood estimation Regan and Catalano (1 999) considered a generalization of Ochi and Prentice's method for clustered binary and continuous outcomes. Chan and Kuk (1997) proposed a probitnormal model for binary data with cor related random effects, which provides a great deal of flexibility in modelling diverse correlation structures. The general form of their model is R q>1 (p) = ( q> 1 (pl)' ... q> 1 (p N) f = X{3 + L Zr b r' r=l where Pi = Pr(yi = l), Yi, Y2 are the observed binary variables, Xis an N x p model matrix f3 is a p x 1 vector of fixed effects, br is a qrkr x 1 vector of random effects with corresponding N x qrkr model matrix Zr It is assumed that b 1 ... bk are independent and Chan and Kuk viewed the above pro bitlinear mixed model as a threshold model resulting from dichotomizing the observations from a Gaussian mixed model. In other
PAGE 117
109 words, they assumed that yj = I {y; > O} and R Y = X(3 + L Zr br + E r = l where E ~ N(O I) independently of hr. Maximum likelihood estimates are obtained via a Monte Carlo EM algorithm treating the latent variables as the missing data. The Estep is made computationally possible by using Gibbs sampling and the step is simplified because of the assumptions of a probit link. An extension of this method to a mixtur e of binary and continuous responses assuming correlated errors is proposed in this section. Random effects regression models for ordinal regression have been considered by Harville and Mee (1984) Redeker and Gibbons (1994) and Tutz and Hennevogl (1996). Tutz and Hennevogl used an EM algorithm treating the random effects and the observed ordinal counts as the complete data. They assumed the thresholds to be unknown and estimated identifiable transformations of them. In contrast, Chan and Kuk s approach would not allow estimation of the thresholds from the complete data if it were to be used for ordinal data. That is why in the extensions allowing multinomial component considered in Section 5. 7 the thresholds are assumed to be known. If they are unknown then the Tutz and Hennevogl algorithm may be extended to handle this case. A Monte Carlo EM algorithm for multivariate probit model for ordinal data have been considered by Blackwell and Catalano (1999a 1999b) Catalano (1994) used the GEE approach to fit a model to a bivariate response consisting of an ordinal and a continuous variable. Multivariate tobit analysis has been considered in the econometrics literature (Lee 1993).
PAGE 118
110 5.2 Model Definition To develop the correlated probit model for a mixture of a binary and of a contin uous response the binary response will be considered as arising from dichotomizing a latent continuous response. L et {Yilj} denote the observed continuous measurement and {y; 2 j} denote the latent continuous measurement underlying the binary response at the jlh occasion for the i th subject, i = 1 ... n, j = 1, ... ni. The observed binary variable is then the indicator Yi 2 j = I {y ; 2 j > 0}. The underlying linear mixed model is defined as follows: (5.2) (5.3) where Xi1j, Xi2j zi1j and z i2j are known p 1 x 1, p 2 x 1 q 1 x 1 and q 2 x 1 vectors and /3 1 and /3 2 are unknown p 1 x 1 and p 2 x 1 parameter vectors. The random effects and the random errors are assumed to be normally distributed: b = ( ~ :: ) ~ i.i d N(O ~) = N ([ ~] [ ~f, ~::]) (5.4) E ij = ( Ei!~ ) ~ i.i.d. N(O :Ee) = N ( [ OO ] [ a;l O"ei 2 l) (5.5) Ei21 a el2 a e2 and {b i} and { E ij } are assumed independent. The model for the complete data {Y ilj } and {y; 2 j} translates into the following model for the observed data {Yilj} and {yi 2 j }: Conditional on {bil} and {bi 2 } (5.6) (5.7) where ilj and i2j are the conditional means for the two observed variables, and denotes normal cumulative distribution function.
PAGE 119
111 The first equation is the same as in the complete model while the second equation is derived as follows Conditional on the random effects P(E i 2 j > (~j/3 2 + z~jbi2)) =
PAGE 120
112 and Catalano model but without modelling the variancecovariance parameters The latter model can be fitted using an extension of the maximum likelihood method of Ochi and Pr e ntic e (1984). 4 The mod e l as defined is the same as the original model of Catalano and Ryan (1992) but they rewrote the model in terms of marginal moments and used GEE to obtain estimates. By marginalizing they lost the ability to estimate the model parameters as they were in the original specification and could only test certain hypotheses. If the discrete outcome is ordinal, the model is the same as that of Catalano (1997) who extended the estimation methods of Catalano and Ryan (1992) to a mixture of ordinal and continuous response 5.3 Maximum Likelihood Estimation We now show how maximum likelihood estimates can be obtained using a mod ification of the EM algorithm for the correlated probit model. We first describe the extension of the Chan and Kuk approach using a Monte Carlo EM algorithm then show how the stochastic approximation approach can be used to speed up the algo rithm and finally discuss standard error approximation 5.3.1 Monte Carlo EM Algorithm To describe the Monte Carlo EM algorithm we first define the complete data and show that there are closed form expressions for the complete data maximum likelihood estimates. This leads to a simple Mstep in the EM algorithm Then we derive the conditional expectations needed in each Estep of the algorithm and explain how those are approximated by Monte Carlo sums. At the end we provide a summary of the algorithm and discuss identifiability and convergence of the algorithm.
PAGE 121
113 Complete Data and Complete Data Maximum Likelihood Estimates The assumption of an underlying continuous variable for the binary response and of an underlying linear mixed model makes the EM algorithm an appealing method for model fitting because the complete data maximum likelihood estimates are easy to compute. Th e complete data consists of b i and {y ;i } = {(Y i 1 j, Y i 2 i f} i = 1 .. n ) and (3 = ( ~ 1 ) Then z i 2j 1'2 the complete data loglikelihood can be written as 1 n ni 1 n n i logL = 2 LL logJ"E e l 2 L L(Y ; j "Xijf3 Z i jbif"E; 1 (y;j "Xijf3 Z ij bi) i= l j= l i = lj = l Because "E appears only in the second part of the loglikelihood, which is a logarithm of a multivariate normal density the complete data maximum likelihood estimates of "Eis (5.8) When the random effects are held fixed the first part of the complete data log likelihood is also multivariate normal and hence for a fixed "E e, the complete data maximum likelihood estimates for (3 is n n i n n ; f3 = (LLx~"E ; 1 x ij ) 1 (LLx~"E ; 1 (y ; i Z i jbi)). (5.9) i = l j = l i=l j=l Maximizing the complete data profile likelihood, we obtain a closed form expression for the estimate of "E e : 1 n n ; :E e = N LL(Y i j "Xijj3Z i jb i )(Yij "Xijj3Z i jb i f i =l j = l (5.10)
PAGE 122
114 As demonstrated b y Meng and Rubin (1993) iterating between the last two equations in the EM algorithm will lead to obtaining the true maximum likelihood estimates. This is the socalled ECM (exp e ctation/conditional maximization) algorithm At each step of the EM algorithm the new estimate of ~ e is computed at the previous value of f3 and then the new value of ~ e is used to update {3. Estep of the Monte Carlo EM Algorithm Because we do not observe the random effects and the latent malformation vari able at each Estep we need to compute conditional expectations of the expressions for the maximum likelihood estimates with respect to the observed data evaluated at the current parameter estimates. Following the argument in Chan and Kuk (1997) we now show that all these conditional expectations depend only on two quantities without closedform expressions: E(y ; 2 1Y i 1 Yi2 1/, (r)) and Var(yl 2 1Yil, Yi2 1/, (r)), h ( )T ( )T ( * )T d ,./, (r ) d W ere Y i l Y i n Y i ln i Yi2 Yi21 Yi2n i Yi2 Yi21' i 2n i an o/ enotes the parameter vector estimate at the r t h step of the EMalgorithm. Let the model definition we obtain the joint distribution of the complete data (5.11) From the properties of the multivariate normal distribution the conditional distribu tion of the random effect b i given the complete response y ; is
PAGE 123
11 5 as T he r efore, an d E( b i h f l Y i) = E[ E ( b i b fl y :)I Y i] = = E[Var( b il Y :) + E( b il y :)E( b fl y ;)I Y i] = (5. 13 ) and ( n n {3 A r+l) = ( "'X ( :E (r))l X )l( "'X !( :E (r))_l E ( Z b I J (r))) = L.J i E i i L.J i Ei Y i i i Y i, 'f' (5. 1 4) i=l i= l ( t X i ( :t ~;t 1 X i)1 ( t x f( :t ~;t 1 (E( y :I Y i) Z i :E ~~ ( E ( y :I Y i, ;p
PAGE 124
116 (5.15) But and hence all conditional expectations depend only on E(y;IYi, 1/,(r)) and A (r) A (r) Var(y;\Yi, 1/; ). Note that E(y; 1 j\Yi, 1/; ) = Yilj for J = 1, ... ni and therefore we A (r) A(r) only need to approximate E(y; 2 \Yi, 1/; ) and Var(y; 2 \Yi, 1P ). The Gibbs sampler can be used to approximate the integrals above. Notice that f (Yi2 \Y:2)f (Y;2 \ Y i1) f (Yi2IYi1) = { /(Yt2IYi1) Ci 0 if Y i2 E A i otherwise where Ci = P(y; 2 E A i) and A i = {y; 2 : y; 2 j > 0 if Yi2j = 1 & y; 2 j < 0 if Yi2j = 0}. Therefore A Monte Carlo approximation of E(y; 2 \Yii,Yi2,1/,(r)) is 1 "m *(k) J{ *(k) A } m m: Uk=1Yi2 Y i2 E i '' *(L) 1.. "m l{y:(k) E A } m* Y i2 m uk=l i2 i l=l (5 .16)
PAGE 125
117 *(!) 1 d 1 fr h d" "b t f I J, (r) where yi 2 are simu ate va ues om t e 1stn u 10n o Yi 2 Yi1, o/ Similarly __!__ ;(l) ;(l)T (__!__ ;(!))(__!__ *(l)T) ~Yi2 Y i2 m* ~Yi2 m* ~Y i2 m l = l l=l l=l (5 .17) Hence the problem becomes to simulate values from the distribution of {y ; 2 IYil, 1/J (r)} that fall in A i As the conditional distribution of y,; 2 given Yi1 at each step of the EM algorithm is multivariate normal with known mean and variance multivariate rejection sampling can be used to generate values. Samples will be generated from the multivariate normal distribution of {y; 2 I Y i1, 1/J (r)} and only those in A i will be accepted. But this may be very inefficient and may slow down the algorithm considerably. For example, simulating just 100 values from the initial distribution for one of the 94 subjects in the Ethylene Glycol example took more than 12 minutes. One iteration of the Monte Carlo EM algorithm took more than 1 hour for a simulation sample size of 100. The simulation sample size increases when the algorithm approaches convergence and hence multivariate rejection sampling is not a practically feasible method for simulating values for the MCEM algorithm in this particular example. A more practical alternative is to use Gibbs sampling. Because the conditional distribution of {y ; 2 1 Yii, 1/,(r)} is multivariate normal with known mean and variance the conditional distribution of {y ; 2 j lYii y ; 2 i 1/,(r)} is univariate normal with known mean and variance j = 1 ... n i Here y ,; 2 j denotes y ,; 2 with y; 2 i omitted. Also notice that !( I J,(r)) f( I ,.,,(r)) Yi2j Y il, Y i2, Y i2j ; o/ = Yi2j Y il, Yi2j Y i2j ; o/ that is the conditional univariate distribution for each latent variable depends only on the binary indicator for that variable and not on the other binary indicators in
PAGE 126
118 the cluster. The proof is as follows: !( I .:i (r)) Yi2j Yil, Y i 2 Yi2j o/ !( I . :i.(r)) Yi2j Yil, Yi2j, Yi2j o/ 0 A (r) = f(Yi2!Yil, Yi21, Yi2, 1/J ) = 1. !( , .J,(r)) Yi2 Yil Yi2j, Yi2j o/ Hence, the Gibbs sampler involves simulating values from several truncated uni variate normal distributions. Generating values from univariate truncated normal distributions can be carried out in several ways. One possibility is to generate values from the underlying normal distribution and accept only those falling in the area of interest. Another option is based on quantiles of a univariate normal distribution and is now explained in more detail. Suppose that one wants to generate values from the distribution of X which is N(, 0' 2 ) truncated above a constant c. Then Y = x;/ is N(O, 1) truncated above c~If there were no restrictions on Y, (Y) would be Unif orm(O, 1) but because Y is truncated, (Y) can take only values values between (c~) and 1. Hence, if random numbers from Uniform(c~' 1) are generated and transformed back to the original scale using the inverse c.d.f. of the standard normal distribution the resulting variables will have the univariate truncated normal distri bution. If X was truncated below cone must generate values from Uniform(O 7) and proceed as above. To obtain the (r + l) st sample from the distribution of {y ; 2 \ y i 1 Yi2j, 1/,(r)} one iteratively generates values from
PAGE 127
119 !( (r+l)J (r),. 1 ,(r)) Yi21 Y i l Y i2 1 Yi21 'f' ( (r+l) J (r+l) ,./, (r)) f Y i 2n ; Yil, Y i 2n ;, Yi2n ; 'f' Monte Carlo EM Algorithm: Summary and Potential Problems In summary the Monte Carlo EM algorithm is carried out as follows: 1. Select an initial estimate 1/J (O) of the parameter vector. Set r = 1. 2. Increaser by 1. Estep: For each subject i, i = 1 ... n generate m* random samples from the conditional distribution of {y; 2 !Y ii, Yi2; 1/J (rl)} using the Gibbs sampler and compute the approximations (5 .16) and (5 17). (r) 3. Mstep: Update the estimate of the parameter vector 1/J using (5.13), (5.14) and (5.15). 4. Iterate between (2) and (3) until convergence is achieved. Louis approximation to the observed information matrix can be used to estimate the standard errors of the parameters (see Chapter 3) but based on dependent random samples obtained using the Gibbs sampler rather than on i.i.d. random samples. On e disadvantage of the Gibbs sampler is that the generated samples are not conditionally independent and therefore extra work is needed to assess convergence of the algorithm Chan and Kuk (1997) propose to use several independent runs of the algorithm to assess the extent of Monte Carlo variation. Another potential problem with the Monte Carlo EM algorithm arises because 0"; 2 is estimable from the complete data but is not estimable from the observed data. If the usual approach of restricting 0"; 2 to be equal to 1 is adopted, the maximization step becomes more complicated because there is no longer a closedform expression for :E e. On the other
PAGE 128
120 hand, if a; 2 is held unrestricted, then the EM algorithm will likely converge to unique estimates of the frilly identifiable ratios f3 / a e2, "E,,12 / a e2, "E,,22/ a;2. Instead of the underlying continuous random variable y; 2 j consider Yi 2 j = Yi2j/ae2 Then the model in (5 .2) and (5.3) can be rewritten as ** T 13 + T b + Yi2j = x i2j 2 z i2j i2 Ei2j, where 13; = f3 /ae 2 h ; 2 = b i 2 /ae 2 and i 2 j = i2j/ae2 For the new variables the random effects and the error distributions are as follows: _ = ( Eilj ) ~ . d N ( [ 0 ] [ a;1 pa el ] ) iJ 'I,. 'I,, 0 1 Ei2j POel As shown by this reparametrization 13 ; / a e2, "E, 1 2 / a e2 and "E,,22 / a; 2 are identifiable from the observed data. The resulting EM algorithm when a e 2 is held unrestricted can be regarded as a parameter expanded (PXEM) algorithm (Liu, Rubin and Wu, 1999). In applications when the traditional EM algorithm converges slowly, Liu Rubin and Wu suggest to expand the completedata model while preserving the observeddata model and generate an EM using the expanded completedata model. The idea is that there is extra information in the imputed complete data which can be used to make the EM algorithm more efficient Because of the need to generate a large number of multidimensional samples from the conditional distribution of the latent response at each step of the algorithm, the procedure can be very computationally intensive. An alternative fitting procedure is based on an accelerated Monte Carlo EM algorithm proposed by Liao (1999). The .....
PAGE 129
121 same algorithm has been independently proposed by Lavielle, Delyon, and Moulines (1999) who called it a stochastic approximation EM algorithm (SAEM) because each expectation step of the EM algorithm is replaced by one iteration of a stochastic approximation procedure. 5.3.2 Stochastic Approximation EM Algorithm The convergence results hold for a complete data loglikelihood of the form ln Lu(VJ) = [a(VJ)f z(u) b(y VJ) where z(u) is a vector function of the complete data u and b(y VJ) is a function of the observed data y and the parameter vector VJ but not of the missing data. In the usual Monte Carlo EM algorithm the r th Estep uses the approximations (r) 1 ( k ) E(z(u) j y, VJ ) z(u ), m k=l where uCk ) k = l ... m are generated values from the conditional distribution of (r1 ) { ujy VJ }. Liao proposed to calculate instead where u Cr) is only one generated value of the conditional distribution above The Lavielle et al. (1999) approach is more general because they propose to generate (r1 ) mr values from { u j y VJ } and take Here Wr are chosen weights which must satisfy certain conditions, described by Liao and Lavielle et al. Liao recommends using Wr = r! 2 but other weights may be more
PAGE 130
122 appropriate in certain problems. The modified expectation step is followed by the usual maximization step after an initial 'stabilization period' of length ro for Zr. This algorithm can be applied to the correlated probit model as follows: 1. Select an initial estimate 1/J (o) of the parameter vector. Generate ro samples from the distribution of {yt 2 IYil, Yi 2 ; 1/J (o)} and compute the approximations Set r = ro. 2. Increase r by 1. 1 ro Z (ro) __ y dk) 1 L...t i2 ro k=l 1 ro Z (ro) __ y (k) y (k)T 2 L...t i2 i2 ro k=l Estep: For each subject i, i = 1 ... n generate one random sample y ;/r) from the d al d" "b f I ..J, (rl) ul t lin con 1 10n 1stn ut10n o Y i 2 Y i l, Y i 2 ; .,., usmg m t1variate reJec 10n samp g and compute the approximations (r) (l ) (r1) + (r) Z 1 Wr Z1 WrYi2 (5.18) and (r) (l ) (r 1) + (r) (r)T Z 2 Wr Z2 WrYi2 Y i2 (5 .19) ( (r1) T (r1) to E Y t2I Y i1 Y i 2; 1P ) and E(y; 2 y; 2 IYil, Y i2; 1P ) respectively. 3. Mstep: Update the est imate of the parameter vector 1/J (r) by substituting z ti) and z ~r i) in (5.13) (5.14) and (5.15). 4. Iterate between (2) and (3) until convergence is achieved. As in the previous Monte Carlo EM algorithm several runs of the algorithm can be used to assess convergence. The standard errors can be obtained using Louis's method based on additional random samples after the algorithm is stopped but this
PAGE 131
123 will be computationally inefficient and may be subject to similar problems as for the multivariate GLMM. An a lternative approach is to apply the same type of stochastic approximation as used for the parameter estimates (Lavielle et al., 1999). We discuss that approach in the next subsection. One more detail of the algorithm is worth pointing out. Multivariate rejection sampling was suggested to be used to generate values from the needed conditional distributions. This is required to satisfy one of the conditions for convergence as established by Liao (1999) and by Lavielle et al. (1999), namely that conditional A( l) A(r1) () () on the parameter estimates 1/; ... 1/; the simulated values u 1 .. u r are conditionally independent. Lavi elle et al. mention that this condition can be relaxed A (1) A (r1) ( ) to the case of Markovian dependence, i.e. when conditional on 1/; ... 1/; u r depends only on u (rl) but not on u (o), .. u(r 2 ) but this is still a result that needs to be proved. In the correlated probit model if the Gibbs sampler can be used instead of multivariate rejection sampling to generate u (r), the algorithm can achieve convergence much faster. As will be demonstrated in the next section because of the inefficiency of the multivariate rejection sampling Liao 's algorithm does not provide an improvement in speed over the Chan and Kuk algorithm for the ethylene glycol data set unless the Gibbs sampler is used in place of multivariate rejection sampling. The modification to the r th Estep is as follows: For each subject i, i = 1 ... n, generate one random sample y ; 2 from the conditional distribution of {y; 2 I Y il, Y i 2 ; 1/, (rl) } using the Gibbs sampler ( as described in Chan and Kuk s algorithm) with starting values y; /rl) and compute the approximations (5.18) and (5.19)
PAGE 132
124 5.3.3 Standard Error Approximation As outlined in Chapter 3 the observed data information matrix can be represented as follows: E[8 2 lnLu(b y *' 1/J) 8lnLu(b y *' 1/J) 8lnLu(b y* 1/J) I l 8'lj)8'lj)T + 81/J 81/JT y By simulating values from the conditional distribution of {(b y *)I Y} both G and~ above can be approximated at each step of the algorithm: 8 2 lnL (b(r) Cr) 'tp) 8lnL ( b (r) y r ) where Nr = 1' + ' a'ljJ a'ljJ a ) 8lnL.,(b (r) (r) 'lp ) d 81P an These are the approximations for r > r 0 For r = r 0 1 ro 8lnL (b(k) y *(k) "') L u , o/ r ro k = I 81/J It should be noted that it was not necessary to generate values from the random effects distribution for the parameter estimation but for the standard errors estimation this
PAGE 133
125 can not be avoided. Fortunately not much extra effort is required and the modification can be incorporated in the algorithm easily. Notice that f(y *, bly) = f(y*ly)f(bly y*) = f(y*ly)f(bly*), so one can first simulate y; as outlined for the parameter estimation and then simulate b from bly *, which is multivariate normal. 5.4 Application The ethylene glycol data can be analyzed using the method outlined in the pre vious sections if we assume an underlying latent malformation variable The model will then be as follows Yilj fetal weight of jlh live fetus in i th litter. y; 2 j latent malformation of jlh live fetus in i th litter. Yi2j = I {y; 2 j > 0} observed malformation status of jlh live fetus in i th litter. / Yii j = f310 + di f3n + b i 1 + Eiij, ij = ( Eilj ) ~ i.i.d. N(O, :Ee) = N ( [ 00 ] [ a;l O" ei2 l) ti2J O" el2 O" e2 As seen from the last expression this model is more general than a bivariate GLMM for a binary and a continuous outcome because it allows different correlation between malformation and fetal weight within fetus and between two different fetuses within litter. When O" e12 = 0 this model reduces to the model in Chapter 3 but with a probit
PAGE 134
126 Table 5.1. Maximum likelihood estimates for the ethylene glycol example using the Chan and Kuk method and the two versions of Liao s method based on multivariate rejection sampling and the Gibbs sampler, respectively Parameter Chan Kuk Liao MRS Liao Gibbs f3 10 0.952 0.952 0 952 f3 u 0.087 0.087 0.087 f3 20 2.377 2.388 2 369 f3 21 0.969 0.967 0.966 (Jbl 0 086 0.086 0.086 (7b2 0.824 0.844 0.822 Pb 0.607 0.633 0.608 (J e l 0 075 0 075 0.075 P e 0.202 0 203 0.201 instead of a logit link for malformation. Testing (7 e 12 = 0 is then equivalent to testing conditional independence between the two outcomes The identifiable parameters in the above specification will be f310 {3 11 f320 = !5o.., ae2 f3 21 = fu, (Jb1 = (71 (7b2 = .Q2... Pb = ....... (7 e 1 and P e = _ill_. Table 5.1 contains the Oe 2 0e2 0102 Oe 2D"el estimates for the above model obtained using Chan and Kuk's algorithm and the two versions of Liao s algorithm Initial estimates for /3 were the regression parameter estimates from the fixed effects models for the two responses and 8~~) was set equal to the estimated standard deviation from the linear mixed model for fetal weight For the remaining variance components we used arbitrary values: t(o) = 1 2 8~ ~) = 1 and A (0) 0 (7 e l2 A comparison of the times until convergence for the parameter estimates is pro vided in Figures 5.15.4 The times are actual times (not CPU times) and are given in minutes. For comparison purposes and to remove extraneous sources of variabil ity two algorithms were run on the same Sun Ultra 10 Workstation simultaneously. There were no other big jobs running at the same time. Initially, Chan and Kuk s algorithm and Liao s algorithm using the Gibbs sampler were run simultaneously us ing the same number of simulated samples (10 000 in each case). This translated into
PAGE 135
127 10 000 iterations for Liao s algorithm and 20 iterations for Chan and Kuk's algorithm with a simulation sample size for the Gibbs sampler of 500 (a burnin of 100) at each iteration. This number of iterations however was not sufficient for convergence of Chan and Kuk s algorithm and hence the algorithms were rerun for 23 hours (cor responding to 137 iterations for Chan and Kuk s algorithm). Liao s algorithm using multivariate rejection sampling was run for 1000 iterations simultaneously with Chan and Kuk s algorithm. 1000 iterations took 60 hours and some of the parameters had not converged yet (see the estimates for ab 2 /a e 2 and Pb in Figure 5.3). As can be easily seen from the graphs Liao s algorithm using the Gibbs sampler is much faster than Chan and Kuk s and Liao's algorithm using multivariate rejection sampling. A more detailed look at Liao s Gibbs sampler (see Figures 5.5 5.8) shows that the estimates appear to have converged (or nearly converged) after only about 500 iterations and in less than half an hour. The other two algorithms take several hours. All three algo rithms proved to have convergence problems when the initial estimates were chosen far away from the true values Note that we used a sample size of 500 with a burnin of 100 for the Gibbs sampler at each iteration of Chan and Kuk s algorithm This is very inefficient especially when the estimates are still far away from the true maximum likelihood estimates but it is unclear how to adaptively increase the sample size. Potentially, Chan and Kuk's algorithm can be accelerated significantly but it is unlikely that it will outperform Liao s algorithm using the Gibbs sampler. Standard error computations were incorporated in the Liao s Gibbs sampler algo rithm. The results after 10 000 iterations for a quadratic trend in dose for fetal weight are given in Tabl e 5 2 The quadratic trend was not significant (pvalue= 0.22) so the final model assumed only a linear relationship between dose and both responses. Table 5 3 contain s the results from the model fit of the final model and also of the fit of two additional reduced models, which will be discussed later in this section.
PAGE 136
~11 a, '
PAGE 137
/320 /321 IX? '7" 0 C\I Lt) a:> ci I I I ChanKuk LiaoM AS LiaoGibbs r ',.,,_ ..... ....................... ....... ........ 0 500 1000 1500 Time (in minutes) ,, .,, I.,, .... ,,. 0 I I I I 500 ChanKuk UaoMR S LiaoGibbs 1000 1500 Time (in minutes) 2000 2500 2000 2500 129 Figure 5.2 Convergence of the intercept estimates ( /3 20 ) and the slope estimates (/3 21 ) for malformation in the ethy l ene glycol examp le.
PAGE 138
<7b1 <7b2 0 0 :;: 0 .. 9 .. 9 I ChanKuk I LiaoMAS LiaoGibb& "'I ,_ 500 ... ... ..... ____ .... 500 1000 1000 Time (in minutes ) ChanKuk LiaoMAS LiaoGibbs 1500 2000 Time (in m inu tes ) ChanKuk LiaoMAS UaoGibbe 1500 2000 '~.......:::...... ....,,.,:s=:... C'.":~ ... ~ ... ~ ... ... ... ... .. ./' ... '... .. ... . .. .. ~ ... ... ...................................... ............................... 500 1000 1500 2000 Time (i n mirutes) 2500 2500 2500 Figure 5.3 Convergence of the variance component estimates for the random effects in the ethylene glycol example. 130
PAGE 139
P e ci 0 ci CJ) 0 ci CX) 0 ci 0 ci 0 9 0 C\J 9 Chan Kuk LiaoMR S UaoGibbe "'0 : l I ; I: \\ 0 500 ......... .. .. . . ... ...... 500 1000 1500 Time (in minutes) Cha n Kuk LiaoMAS U aoOibbl 1000 1500 Time (in minutes) 2000 2500 2000 2500 Figure 5.4 Convergence of the variance component estimates for the random errors in th e eth y l e ne gl y col example. 131
PAGE 140
~ 10 C\J Lt) O') ci 0 Lt) O') ci co O') ci l'co 0 9 O') co 0 9 O') 0 9 0 5 0 5 132 10 15 20 25 Time (in minutes) 10 15 20 25 Time (in minutes) Figure 5.5 Convergence of the intercept and slope estimates for fetal weight in the ethylene glycol example. The estimates are from Liao s method using the Gibbs sampler. The horizontal lines represent the final values of the estimates.
PAGE 141
/320 ';0 C\J '<:t lO O') 0 lO
PAGE 142
O"bl O"b 2 .. ci ci "' ci al ci "' 9 ..,
PAGE 143
lT e l Pe 0 0 0 CJ) 0 0 CX) 0 0 0 0 0 0 C\J 9 0 5 0 5 135 10 15 20 25 Time (in minutes) 10 15 20 25 Time (in minutes) Figure 5.8 Convergence of the variance component estimates for the random errors in the ethy l ene glycol examp l e. The est im ates are from Liao 's method using the Gibbs sampler. The horizontal lin es represent the final values of the estimates.
PAGE 144
136 Table 5.2. Maximum likelihood estimates from a correlated probit model with a quadratic trend for fetal weight in the ethylene glycol example. Parameter Estimate Standard error f310 0.963 0.017 f3n 0.120 0 028 f312 0.011 0 009 f320 2.363 0.214 f321 0.957 0.110 CJbl 0.083 0 007 CJb2 0.836 0.107 Pb 0.608 0.101 CJ e l 0.075 0.002 P e 0.211 0.055 Table 5.3. Maximum likelihood estimates with linear dose effects for both variables in the ethylene glycol example. Parameter Full model Reduced model 1 Reduced model 2 Est. SE Est. SE Est. SE f310 0.952 0.014 0.952 0.014 0.952 0.014 f3n 0.087 0.008 0.087 0.008 0.087 0.008 f320 2.396 0.216 2.401 0.216 2.416 0.217 f321 0.971 0.110 0.972 0.110 0.988 0.112 CJb1 0.086 0.007 0.086 0.007 0.086 0.007 CJb2 0.837 0.106 0.839 0.107 0.873 0.113 Pb 0 640 0.091 0.664 0.091 CJ e l 0.075 0.002 0.075 0.002 0.075 0.002 P e 0.211 0.055 As expected the standard error estimates converged more slowly than the parameter estimates (Figures 5.9 and 5.10). As pointed out by Lavielle et al. (1999) a (r) 0 for r oo and therefore the limiting value of G (r) can be used to assess the variability of the estimators. In prac tice, however the algorithm may be stopped before reaching the maximumlikelhood estimator and it seems more reasonable to use both G {r) and a (r) to approximate the variances. In fact in this particular example the two approximations lead to significantly different results (Tab l e 5.4). The standard errors in the column SE 2 are based only on G (r), while those in the column SE 3 are based on both G (r) and a (r).
PAGE 145
SE LC) 0 ci LC) 0 0 ci 0 ci 0 0 2000 4000 2000 4000 13 7 6000 8000 10000 Itera ti on 6000 8000 1 0000 Iteration F i gure 5.9 Convergence of the intercept estimate and i ts estimated standard erro r for fetal we i ght in the ethy l ene g l yco l examp l e.
PAGE 146
138 0 f320 C\J 0 2000 4000 6000 8000 10000 Iteration Lt) C\J 0 0 C\J 0 SE Lt) ,.... 0 0 ,.... 0 0 2000 4000 6000 8000 10000 Iteration Figure 5.10 Convergence of the intercept estimate and its estimated standard error for malformation in the ethylene glycol example.
PAGE 147
9 9 ~ 11 9 9 9 2000 4 000 6 000 8 000 1 0000 It e ration ci SE 3 0 0 ci 8 V ci 0 ci 2000 4000 6000 8000 1 0000 It e ration ci 0 ci '~~~.."T"""""""" 2000 4 000 6000 8 000 10000 I ter ation Figure 5 11 Conv e rg e nce of the slope estimate and its estimated standard errors for fetal weight. SE 2 and SE 3 denote the standard errors corresponding to two and threeterm a pproximations of the observed information matrix respectivel y. 139
PAGE 148
f3 21 N ci 0 ci SE3 ; ci 0 2000 4000 6000 8000 10000 Iteration ci L......r,,,' 2000 4000 6000 8000 1 0000 Iter ation 2000 4000 6000 8000 10000 Ite ra tion 140 Figure 5.12 Convergence of the slope estimate and its estimated standard errors for malformation. S E 2 and S E 3 denote the standard errors corresponding to two and threeterm approximations of the observed information matrix respectively.
PAGE 149
141 Table 5.4. Standard error estimates using two and threeterm approximations of the observed information matrix in the ethylene glycol example. Parameter SE2 SE3 !3 10 0.038 0.014 !3 11 0.011 0.008 !3 20 0.702 0.215 !3 21 0.270 0.110
PAGE 150
142 error estimat e s from all three models are identical up to two significant digits after the decimal poin t This is somewhat disappointing because it is expected that when the responses ar e fitted together and also a better correlation structure is assumed there will be efficiency gains in estimating the parameters The next section is devoted to this issue. 5.5 Simulation Study It was of int e rest to investigate possible efficiency gains in fitting the correlated probit model instead of fitting the corresponding multivariate GLMM and instead of fitting separate univariate GLMM s for the individual response variables. A sim ulation study was designed as follows The structure of the data was assumed to be the same as in the ethylene glycol example. The parameter values ( except for the error correlation parameter P e ) were set equal to the final estimates from Liao's Gibbs sampler. There were six settings corresponding to one large and two small data sets, and to strong and weak intrafetus correlation (P e = 0 804 and Pe = 0.201). The large data set had numbers of clusters and observations exactly as in the ethylene glycol example (94 clusters and an average of about 11 observations per cluster) both small data sets had 6 observations per cluster but one had 32 clusters and the other one had 24 clusters The simulation with 32 clusters was added after the other simulations w e r e finished. A total of 50 samples were generated at each of the six settings and three models were fitted to each of those samples: a corrrelated probit model referred to as the full model (FM) ; a multivariate GLMM referred to as reduced model 1 (RMl) and two separate GLMM s referred to as reduced model 2 (RM2). All models were fit using Liao s Gibbs sampler but in RMl P e was assumed to be equal to 0, and in RM2 both P e and Pb were assumed to be equal to 0. The algorithms for the larger data set were run for 5000 iterations while those for the smaller data sets were run for 10000
PAGE 151
143 iterations. Th e standard errors were computed using the stochastic approximation approach. The initial estimates for the regression parameters were the estimates from the fixed effects fits The sample standard deviation was used as an initial estimate for a e l and som e r e asonable values were used for the other variance components ab~ ) = a b ~ ) = 1 P b o ) = 0.5 and p ~ 0 ) = 0.5a ~ ~) In the reduced models the same initial paramet e r values were used except for those parameters that were not included in th e models. The simulation programs for the larger data sets ran for about 25 days each on Sun Ultra 10 Workstation with 128 MB of RAM. The simulation programs for the smaller data sets ran for about 10 days. The results are summarized in Tables 5.5 5.16. The data set with 32 clusters is referred to as the medium data set For each setting th e re are two tables: one gives the average parameter estimates and the average standard errors and the other one gives the standard deviations for the parameter and standard error estimates. By comparing the average of the standard error estimates and the standard deviations for the parameters Monte Carlo error can be judged. For the large data set the sample standard deviations of the estimated parameters are very similar to the corresponding means of the estimated standard errors indi cating that the Monte Carlo error is small (see Tables 5.5 5.8). This however is not true for the small simulation sample size for the estimates of some of the parameters (/320 /3 21 ab2 Pb and P e ) in the full model and to a lesser extent in the reduced models. For example th e standard deviations of ~ 20 and ~ 21 in the full model are larger that the mean standard errors. (0.442 and 0.213 as compared to 0.323 and 0.175 in Tables 5.11 and 5.12) Similar observations can be made for the other small sample settings. It is not surprising that Monte Carlo error is increased when the observations per cluster and the number of clusters are reduced but requires caution in interpreting the simulation study results.
PAGE 152
144 When the sample size is large and the correlation is weak the average standard errors for all parameters are very similar and the largest difference occurs for /320 (0.211 compared to 0.216 in Table 5.5). There is hardly any gain in efficiency in fitting the responses together rather than fitting them separately although both Pb and P e are significantly different from zero. When the correlation Pe is strong there is very slight gain in efficiency for the regression parameters for the Bernoulli response (0.209 in the FM compared to 0.221 for RMl and to 0.223 for RM2 for /3 20 and 0.107 in the FM compared to 0.113 in RMl and RM2 for /3 21 in Table 5.7). For small sample size and strong intrafetus correlation (P e ) the efficiency gains are much more pronounced: the average standard error estimate for /3 21 goes up from 0.175 to 0.255 between FM and RM2 (Table 5 11). Slightly smaller difference is observed in the standard deviations (Table 5.12). Although such increase in the standard errors will not lead to a different conclusion about the significance of the effects, it is important when setting up confidence intervals to estimate strength of effects Some efficiency gains are also observed in the small sample case when the intrafetus correlation is weak, although this may at least partially be due to Monte Carlo error (Tables 5.9 and 5.10). Notice that in that case the intrafetus correlation is not significant (p value= 0.24). The simulation results for the medium data set were very similar to those for the small data set and will not be discussed in more detail here. In summary it appears that there are noticeable efficiency gains in parameter esti mation only for small data sets and strong correlations between the responses within cluster. The efficiency gains are the largest for the binary response regression param eters. The cases considered are few and so recommendations can be only tentative However, it seems that unless the provided sample is small and there is evidence of strong intracluster correlations between the response variables, it may not be worth the extra computational effort to fit the responses jointly over fitting them separately. Joint fitting may still be necessary in the cases when multivariate questions must be
PAGE 153
145 answered and to obtain the true maximum likelihood estimates when the correlated probit model is the true underlying model, but the efficiency gains may not be great. 5.6 Identifiability Issue For reasons of computational simplicity the variance component a~ was left unre stricted in the EM algorithm decribed in Section 5 2. It was hypothesized that the EM algorithm will converge to unique estimates of the identifiable parameters. We now address this issue in greater detail. Let us consider a simpler case where there is a closed form expression for the maximum likelihood estimates Suppose that we have n i.i d. Bernoulli random variables Y i ~ B e (1r) i = 1, .. n and assume that they arise from dichotomizations of n i.i.d. unobserved Normal random variables Yi ~ a) that is Y i = I {Yi > O} Then 1r = (;) as demonstrated earlier in this chapter. The maximum likelihood estimate of 1r is ~, where n i is the number of ones in the observed sample and hence the maximum likelihood estimate of; is <1> 1 (~). and a are not individually estimable from the observed data but are estimable from the complete data Yi, i = 1 ... n. For this simple case we can derive expressions for the estimates of and a at the (r + l) st step of the EM algorithm in terms of the previous estimates at the r th step and show that the EM algorithm converges to the true maximum lieklihood estimate of ; Moreover the estimates of and a converge to some values and a* such that the ratio i:; is the maximum likelihood estimate a The complete data loglikelihood is n ( 2) 1 2 l e = 2 log a 2 a 2 ~(Yi i = l
PAGE 154
146 Table 5.5. Results from simulation study: average estimates and average estimated standard errors for large sample and weak correlation. Parameter True Full model Reduced model 1 Reduced model 2 value Est. SE Est. SE Est SE f3 10 0 952 0.955 0.014 0.955 0.014 0.955 0.014 f3 11 0 087 0.089 0.008 0.089 0.008 0.089 0.008 f3 20 2.369 2.471 0.211 2.462 0.209 2.466 0.216 f3 21 0.966 1.016 0 108 1.012 0.106 1.014 0 109 O'b1 0.086 0.086 0.007 0 086 0.007 0 086 0 007 O'b 2 0.822 0.804 0 114 0 799 0.114 0 796 0.116 Pb 0.608 0.620 0.098 0.642 0.098 O' e 1 0.075 0.075 0 002 0.075 0.002 0.075 0.002 P e 0.201 0.195 0.059 Table 5.6. Results from simulation study: standard deviations of estimates and of estimated standard errors for large sample and weak correlation Parameter Full model Reduced model 1 Reduced model 2 Est SE Est. SE Est SE f3 10 0.014 0.00113 0.014 0.00112 0.014 0.00108 f3 11 0.008 0.00068 0 008 0.00065 0.008 0.00063 f3 20 0.197 0.02617 0.194 0.02235 0.202 0.02639 f3 21 0.115 0.01431 0.114 0.01049 0.119 0.01199 O'b1 0.007 0.00047 0.007 0.00047 0.007 0.00047 O'b2 0.102 0.01228 0.100 0.01163 0.102 0.01226 P b 0.098 0.01687 0.097 0.01692 O' e 1 0.002 0.00004 0.002 0.00004 0.002 0.00004 P e 0.059 0.00277
PAGE 155
147 Table 5.7. R es ults from simulation study: average estimates and average estimated standard e rrors for large sample and strong correlation. Parameter True Full model Reduced model 1 Reduced model 2 valu e E st. SE Est. SE Est. SE !310 0.952 0 951 0.014 0.951 0.014 0 951 0.014 f3u 0.087 0.088 0.008 0 088 0 008 0.088 0.008 !320 2.369 2.446 0 209 2.499 0.221 2.499 0 223 !321 0.966 1.005 0.107 1.027 0.113 1.028 0 113 abl 0.086 0.084 0.007 0.085 0 007 0.084 0 007 ab 2 0 .8 22 0 808 0.115 0.847 0 120 0.844 0.123 Pb 0.608 0.615 0.092 0 .686 0.089 a e l 0 075 0.075 0.002 0.075 0.002 0.075 0.002 Pe 0 .8 01 0.766 0.031 Table 5.8. Results from simulation study: standard deviations of estimates and of estimated standard errors for large samp l e and strong correlation. Param ete r Full model Reduced model 1 Reduced model 2 Est. SE Est SE Est SE !310 0.015 0.00094 0.015 0.00092 0.015 0.00090 f3n 0.008 0.00054 0.008 0.00051 0.008 0.00057 !320 0.207 0.03556 0 218 0.03760 0.229 0.03145 !321 0 104 0.01602 0 112 0.01770 0.116 0.01510 abl 0.005 0 00037 0.005 0.00037 0.005 0.00037 ab 2 0 114 0 01579 0.122 0.01705 0.128 0.01602 Pb 0.095 0.01476 0.095 0.01492 a e l 0.002 0.00004 0.002 0.00004 0.002 0.00004 P e 0.027 0 00330
PAGE 156
148 Table 5.9. Results from simulation study: average estimates and average estimated standard errors for small sample and weak correlation Parameter True Full model Reduced model 1 Reduced model 2 value Est SE Est. SE Est. SE f310 0.952 0.950 0.028 0.950 0.028 0.950 0 028 f3 u 0.087 0.085 0.017 0.085 0 016 0.085 0 016 f3 20 2.369 2.610 0.489 2.567 0.560 2.618 0 605 f321 0.966 1.054 0.243 1.038 0.267 1.060 0 282 O"bl 0.086 0.083 0.014 0.083 0.013 0.083 0.014 O"b2 0.822 0.787 0 225 0.772 0.262 0.761 0.320 Pb 0.608 0.646 0.177 0.725 0 141 O" e 1 0.075 0.075 0.005 0.075 0.005 0.075 0.005 P e 0.201 0.188 0.159 Table 5 .10 Results from simulation study: standard deviations of estimates and of estimated standard errors for small sample and weak correlation. Parameter Full model Reduced model 1 Reduced model 2 Est. SE Est. SE Est SE f3 10 0.027 0.004 0 027 0.004 0.027 0.004 f3 u 0.014 0.002 0.014 0.002 0.014 0.002 f3 20 0.734 0.160 0.648 0.582 0.802 0.393 f321 0 300 0.069 0.273 0.247 0.339 0.168 O"bl 0.012 0.002 0.012 0.006 0.012 0.002 O"b2 0.345 0.098 0.340 0 507 0 380 0 166 Pb 0.315 0 151 0.202 0.141 O" e 1 0.004 0.0003 0.004 0.0005 0.004 0.0003 P e 0.149 0.023
PAGE 157
149 Table 5.11. Results from simulation study: average estimates and average estimated standard errors for small sample and strong correlation Parameter True Full model Reduced model 1 Reduced model 2 value Est. SE Est. SE Est SE f310 0.952 0.950 0.025 0.950 0.028 0.950 0 028 f3n 0.087 0.086 0.015 0.086 0 016 0.086 0.016 f320 2.369 2.316 0.323 2.377 0.455 2.401 0.533 f321 0.966 0.950 0.175 0.975 0.229 0.987 0.255 O"bl 0.086 0 082 0.014 0.082 0.013 0.082 0.014 O"b2 0.822 0.721 0.197 0.761 0.197 0.766 0.293 Pb 0.608 0.637 0.185 0.788 0.130 O" e 1 0 075 0.076 0.005 0.076 0.005 0.076 0.005 P e 0.804 0.767 0.058 Table 5.12. Results from simulation study: standard deviations of estimates and of estimated standard errors for small sample and strong correlation. Parameter Full model Reduced model 1 Reduced model 2 Est. SE Est. SE Est SE f310 0.026 0.008 0.026 0.004 0.026 0 004 f3u 0.016 0.005 0.016 0.002 0 016 0.003 f320 0.442 0.169 0.482 0.128 0.561 0 196 f321 0.213 0.073 0.229 0.056 0.256 0.083 O"b1 0.015 0.002 0 015 0.007 0.015 0.002 O"b2 0.216 0.103 0.223 0.117 0.272 0.075 Pb 0 235 0.132 0.219 0.147 O" e l 0 004 0.002 0.004 0.0003 0 004 0.0002 Pe 0.093 0.061
PAGE 158
150 Tabl e 5 13. Results from simulation study : average estimates and average estimated standard e rror s for m e dium s ample and w e ak correlation. Param e t e r True Full mod e l R e duced model 1 Reduced model 2 valu e Est SE Est SE Est SE /3 1 0 0.952 0.953 0 025 0.953 0.024 0.953 0.024 /3 n 0.087 0.088 0.014 0 088 0 014 0 088 0 014 /320 2.369 2 550 0.429 2 532 0.467 2.569 0 501 /32 1 0.966 1 0 4 9 0.215 1.042 0.226 1.056 1 239 <7 b 1 0.086 0 084 0.012 0.084 0.012 0.084 0.012 <7 b2 0 822 0.821 0.213 0.813 0.244 0.816 0.274 P b 0.608 0.605 0.181 0.657 0 193 O" el 0 075 0.074 0.004 0.074 0.004 0.074 0.004 P e 0.201 0 231 0.135 Table 5 14 R es ults from simulation study: standard deviations of estimates and of e stim a t e d standard e rrors for medium sampl e and weak correlation Param e t e r Full mod e l Reduc e d model 1 Reduced model 2 Est. SE Est. SE Est SE /310 0 022 0.003 0 022 0 003 0 022 0.003 /3n 0 016 0.002 0.016 0 002 0 016 0 002 /320 0 620 0.111 0 609 0 232 0 624 0.214 /321 0 291 0 055 0 286 0.09 4 0.292 0.088 <7bl 0.011 0 001 0 011 0.001 0.011 0.001 <7 b2 0 282 0 067 0.280 0.189 0.302 0.080 P b 0.195 0 087 0 194 0.238 O" e 1 0.005 0 0003 0 005 0.0003 0 005 0 0003 P e 0 121 0 014
PAGE 159
151 Table 5.15. Results from simulation study: average estimates and average estimated standard errors for medium sample and strong correlation. Parameter True Full model Reduced model 1 Reduced model 2 value Est. SE Est. SE Est. SE f3 10 0.952 0.951 0 023 0.951 0.024 0.951 0.024 f3 11 0.087 0.085 0 013 0.085 0.014 0.085 0.014 f3 20 2 369 2.468 0.315 2.537 0.412 2.537 0.481 f3 21 0.966 0.973 0.161 1.002 0.201 1.004 0.226 OM 0.086 0.082 0 011 0.082 0.010 0.082 0.012
PAGE 160
152 Then the ( r + 1 )8l Estep involves finding From the formulae for the moments of the truncated normal distribution in Johnson et al. (1994), pp.156157 Now using the equalities (x) = (x), cI>(x) = 1cI>(x) and E{ (y;)2/y i } = Var(y7/Yi) + (E(y7/Yi) 2 and simplifying the notation by using /r) = (~) and (r) =(~) Therefore E[lc/Y ii p,(r), aCr)] = A (r) 0"2(r) + (p,(r) )2 (2 p,(r))a(r) J;_ cp(r) A (r) 0"2(r) + (p,(r) )2 + (2 p,(r))a(r) q> A 1 ~(r) _'!!.zog(a2) ~[a2(r) + (p,(r) )2 (2 p,(r))a(r) A J;(r) A ((r) ni)]. 2 2a 2 cp(r)(lcI>(r)) n
PAGE 161
153 The (r + l) st Mstep of the EM algorithm then reduces to finding the maximum of the above likelihood with respect to and a. From aE(l I A (r) A (r)) J,(r) A n cYi , CJ =~C(r)_ +a(r) A '// A ((r) _:))=0 8 a2 cp (r) (1 cp(r)) n we obtain (5.20) 8E(l I . r ) r ) ) And similarly from C ya:2 ,u = 0 we obtam A (r) J>(r) J>(r) A &2(r+l)=&2(r){l+t!:_ A A ((r)_ni)[ A A ( (r)_ni)]2} (5 .21) &(r) cp(r)(l cp(r)) n cp (r)(l cp (r)) n Equations (5.20) and (5.21) define the iterative algorithm for finding the parameter estimates of and a 2 Notice that if (r) = then p,(r+l) = p,(r) and a(r+l) = a(r), and hence 1 (~) (the maximum likelihood estimate of ;) is the only stationary point for both and a in the EM algorithm. In fact it is the only stationary point also for the ratio ; To prove that define g(x; c) f1(x;c) h(x; c) (x)((x) c) x ( x)( l (x)) 1 g(x; c) 1 + x 2 g(x; c) x 2 [g(x; c)]2. Notice that p,(r+l) = p,(r) Ji(~ ; ~) and &(r+l) = a(r) h(~; ~) for h(~; ~) > 0. Suppose that there is another stationary point X 8 f<1> 1 (~) for the ratio ; Then g(xs; c) f0 Ji (x s ; c) 2 = h(xs; c) and fi(xs; c) must be positive. The equation
PAGE 162
154 has the following nonzero solution which implies fi(xs ; c) < 0 (h(xs; c) > 0) and therefore there is no other stationary point for the ratio;Notice that the functions above are not defined at x = 0, but this is not a problem since ; = 0 cannot be a stationary point unless (~) = 0, and hence we can assume that x =f 0 for the definitions above. The finding that there are no other stationary points except the maximum likelihood estimate is important because otherwise it is not guaranteed that the algorithm will converge to the maximum likelihood estimate of the identifiable ratio. By the basic property of the EM algorithm, the observed data loglikelihood is nondecreasing at each step of the algorithhm (Dempster, Laird and Rubin (1977), Wu (1983)) and this is true regardless of whether all parameters are identifiable from the observed data or not. Let l 0 (;) = l 0 (, a) denote the observed data loglikelihood as a function of the parameters. Then l 0 (, a) = le(, a) li(, a), where li(, a) = lnf (y; IYi, a). Taking conditional expectations with respect to the observed data l 0 (,a) = Q(,a;*,a*)H(,a;*,a*), where al* a*) = E{lc(,a)ly *,a*} and H(, al*, a*)= E{li(, a))IY, *, a*}. and for any *, a* in the parameter space by a consequence to the Jensen's inequality (Dempster Laird and Rubin (1977)). Therefore
PAGE 163
155 Becaus e the obs e rved data loglikelihood is concave in the parameter ; for any (p, ( o ), {J ( 0 ) ) insid e th e parameter space the sequence {l 0 (p, (r), {J ( r ) )} is bounded from above and hence will converge to some l*. If we can assure that the loglikelihood lo is strictly increasing for all;/ <1> 1 ( ~ ) then l = l 0 (<1> 1 ( ~ )). But the likelihood will be strictly increasing if f / ~, that is if there are no other stationary points for the ratio I!: in th e EMalgorithm. u In more complicated cases when there are no closed form expressions for either the Est e p or th e Mstep of the EM algorithm it is not possible to algebraically verify that there are no other stationary points of the algorithm except those corresponding to stationary points for the observed data loglikelihood. But a modification may be implemented which according to Liu Rubin and Wu (1999) will avoid that problem. In the context of the example above the modification will amount to setting CJe 2 to be equal to its null value at each step of the algorithm rather than using the current estimate Still the general question of convergence of the algorithm as initially proposed is of significant interest and justifies further research 5. 7 Model Extensions The correlated probit model can be generalized to incorporate any combination of binary continuous or continuous censored data. It can also accommodate ordinal data for which the cutoff points for the underlying continuous random variable are known. To demonstrate how the extensions proposed above can be carried out we consider the underlying linear mixed model defined in Section 1 in equations (5.2)(5.5 ) and denote the first variable by Y l iJ rather than by YiIJ. We may observe either (Y l iJ y l 2 J f, or I {Y l i J > O} I {Y l 2 J > O} or I {y ; 1 J > Tn} I {Y l iJ > 712} . I {Y l iJ > T 1 pJ
PAGE 164
156 known or Y e { Y : 1 1 if Y : 1 1 > 1'1 i lj "' yl z f < 1 Y i lj ')'1 Y e { Y:2 1 if Y:2 1 > 1'2 i 2 j "'y2 i f 1 Y i 2 j ')' 2 where ry 1 ry 2 '"Yyl, ')'y 2 ar e known or any combination of the above For all thos e cases the complete data MLE will be the MLE for /3 1 /3 2 ~b and ~ e from th e mix e d model in Section 5.2. Therefore the Mstep in the EM algorithm will be the sam e At each Estep expressions or approximations of E(y ; Jy i {;; (r) ) and E(y ; y : TIY i, {p (r)) are needed wh e re Y i is the observed response vector for subject i. If Y i l = y ; 1 then we don t need to generate values for this response and we just use E(y ; 1 J y i, {;; (r) ) = Y i l and E(y ; 1 y ; ?IY i, {;; ( r ) ) = Y i 1Yb and similarly for the other response. Otherwis e we do need to generate values The binar y case has already been described in detail earlier in this chapter. The ordinal case is handled in the same way with the only difference that generated values for th e truncat e d multivariate normal distribution must fall in the region specified by th e ordinar y r e spon se Th e c e nsored response is kept as it is if it corresponds to unc e nsor e d observation that is E(y ; 11 1 Y i, {;; ( r)) = y f 1 1 and E(y ; 1 j y ; 1 /I Y i, {;; (r) ) = Y f 1 1 Y f 1 / if y f 1 1 # ')'y 1 If however y f 1 1 = ')' y1 then values are generated from the truncated norm a l distribution of y ; 11 I Y ; ( ) Y i lj = ')'y 1 as in the binary case. (The (r) truncation in thi s particular e xample is from above at ry 1 .) Then E(y ; 1 il Y i, 1/J ) = :E k= l y ; 1 ~) and the variance is handled similarly. Another possible extension of th e correlated probit model is that more g e neral correlation structures may be incorporated both at the random effects and at the random error lev e l. One example is autoregressive structure which will allow one to model a vari e t y of longitudinal data sets
PAGE 165
157 5.8 Future Research There are several unanswered questions concerning the model and the algorithms proposed in this chapter which justify further research. Establishing convergence for the type of Markovian dependence implied by Liao's Gibbs sampler is certainly one of them. Gu and Kong (1998) propose a stochastic approximation algorithm with the Markov chain Monte Carlo method for incomplete data estimation problems that uses basically the same type of idea. There are also some general results concerning stochastic algorithms with Markovian perturbations in the probability theory litera ture These can serve as a basis for establishing convergence properties. Another interesting question is the choice of weights to assure faster convergence of the Monte Carlo EM algorithm We noticed that when the initial estimates were far from the maximum likelihood estimates the algorithms had convergence problems, and this can probably be remedied if the weights are wisely chosen. A general result for the identifiability problem as outlined in Section 5.6. is needed for completeness and this can probably be achieved by extending some of the theorems by Wu (1983). The probl e m of checking the assumptions for the correlated probit model is also very important as mi specification of the model can lead to inconsistent parameter estimates (White 1982). The effects of incorrect specification of the random effects distribution are of special interest because the problem of joint versus separate fitting of the response variables can be addressed from that perspective. Neuhaus, Hauck and Kalbfleisch (1992) investigated the effects of misspecification of the random effects distribution in mixedeffects logistic models and found that the bias in the parameter estimates is usually small It will not be surprising if this is the case for the correlated probit model but the issue must be addressed in greater detail.
PAGE 166
CHAPTER 6 CONCLUSIONS 6.1 Summary The goal of this dissertation was to propose and investigate random effects models for repeated measures situations when there are two or more response variables Our emphasis was on maximum likelihood estimation and on applications with outcomes of different typ e s. We proposed a multivariate generalized linear mixed model that can accommodate any combination of responses in the exponential family. We also considered a correlated probit model that is suitable for mixtures of binary contin uous censored continuous and ordinal outcomes. Although more limited in area of applicability the correlated probit model allows for more general correlation structure between the response variables than the corresponding multivariate generalized linear mixed model. We used two reallife applications for illustration: a developmental toxicity study in mice and a myoelectric activity study in ponies. The first data set had a binary and a continuous response which made it suitable for illustration of both models and the second data s e t had a count and a duration response which were fitted as negative binomial and gamma variates in a multivariate generalized linear mixed model. The two data sets were also different in that the mice data had a relatively large number of subjects (about 100) and between 1 and 16 observations per subject while the pony data had only 6 subjects but many observations per subject. Because the large sample asymptotic theory holds when the number of subjects goes to infinity we used the mice data for illustration of hypothesis testing in Chapter 4. 158
PAGE 167
159 In Chapter 2 we defined the multivariate generalized linear mixed model by spec ifying a separat e GLMM for each response variable and then combining the models by imposing a joint multivariate normal distribution on the subjectspecific random effects. The responses on the same subject were assumed to be conditionally inde pendent given the random effects, which allowed estimation procedures for GLMM to be directly modified for this more general model. In Chapter 3 we extended three approximate maximum likelihood estimation methods from the univariate to the multivariate generalized linear mixed model. The three methods were GaussHermite quadrature the Monte Carlo EM algorithm pro posed by Booth and Hobert (1999) and the pseudolikelihood approach proposed by Wolfinger and O Connell (1993). In addition to parameter estimation we consid ered approximations of the standard errors based on numerical derivatives in Gauss Hermite quadrature on Louis approximation to the observed information matrix in the Monte Carlo EM algorithm and on linear mixed model theory in the pseudo likelihood method. We used a simulated data example and the two 'reallife' data sets for illustration. Some findings were as follows The pseudolikelihood method led to underestima tion of some of the parameters and their standard errors for the Bernoulli response. The GaussHermite and the Monte Carlo EM methods performed well but were com putationally intensive. The Monte Carlo standard error estimates showed high level of variability. In an attempt to counter that effect we proposed to pool the estimates of the information matrix from the last three iterations of the algorithm This seemed to alleviate but did not solve the problem. In neither of the three examples there was evidence of significant efficiency gains from fitting the responses together rather than separately. This issue was further studied with a simulation study in Chapter 5. Because of the extraPoisson variability present in the count outcome of the pony
PAGE 168
160 data we had to apply a special nested EM algorithm to fit the negativebinomial distribution for the count response. In Chapter 4 we considered hypothesis testing in the multivariate GLMM Un der the assumption that the regularity conditions are satisfied, asymptotic likelihood theory can be applied to form hypothesis tests and confidence intervals concerning the parameters of interest. We suggested to test the significance of the fixed effects using approximations to the Wald and to the likelihood ratio statistics and to esti mate the random effects using approximations to the conditional mean E(biJY i )We extended a global variance component score test originally proposed by Lin (1997) for the univariate GLMM to the multivariate GLMM, and outlined an approach to approximate the score statistics for subsets of the variance components using Gauss Hermite or Monte Carlo approximations. We also proposed a score test for checking the conditional independence assumption between the response variables and devel oped Gaussian quadrature and Monte Carlo approximations for the case of one binary and one continuous response. Because we could use all three statistics ( the Wald the score and the likelihood ratio) to test for conditional independence we compared the performance of the approximations on checking the conditional independence assump tion in the developmental toxicity example. We also designed a small simulation study and observed that the Gaussian quadrature approximations to the three statistics led to almost identical results although usually the score statistic had the largest and the likelihood ratio statistic had the smallest value. The Monte Carlo approximation to the likelihood ratio statistic was not entirely consistent with the other approximations and may require larger simulation sample size. In Chapter 5 we introduced the correlated probit model as an alternative to the multivariate GLMM when conditional independence did not hold. This model was first considered b y Catalano and Ryan (1992) who marginalized it and used GEE
PAGE 169
161 methods to obtain estimates We developed a Monte Carlo EM algorithm for maxi mum likelihood estimation which could be regarded as an extension to an approach proposed by Chan and Kuk (1997) for binary data. We applied the method to the developmental toxicity example. Because of the computational inefficiency of the algorithm we considered a modification based on stochastic approximations (Liao ( 1999) Lavielle et al. ( 1999)) which led to a significant decrease in the time for model fitting. To address the issue of advantages of joint over separate analyses of the response variables we designed a simulation study to investigate possible efficiency gains in a multivariate analysis. Noticeable increase in the estimated standard errors was observed only in the binary response case for small number of subjects and ob servations per subject and for high correlation between the outcomes. We also briefly considered an identifiability issue for one of the variance components. In conclusion the proposed models are appropriate for multivariate repeated mea sures applications when subject specific inference is of main interest. They allow one to answer intrinsically multivariate questions such as estimation of the probability of malformation and / or low fetal weight at any given dose of ethylene glycol in the mice example. Multivariate analysis also allows to maintain the proper significance levels in hypothesis tests concerning several outcome variables. However efficiency gains concerning individual parameters are noticeable only for small samples and highly correlated outcomes. Therefore if multivariate inference is not the emphasis of the analysis separate analyses of the outcome variables may be preferrable because of the availability of software for univariate repeated measures such as SAS PROC NLMIXED. 6.2 Future Research There are variety of interesting topics for further research concerning the models proposed in this dissertation. One of them is a comparison of the numerical and
PAGE 170
162 stochastic approximations proposed here to the analytical approximations proposed by Lin (1997) for the score statistic for variance components. It is also of interest to develop methods to assess error in Gaussian quadrature and Monte Carlo approxima tions of the test statistics. This way the number of quadrature points and the Monte Carlo sample size can be chosen appropriately to achieve certain precision. A better approach to dealing with the variability of Monte Carlo standard errors than just pooling several estimates of the information matrix is also needed. The good perfor mance of the stochastic approximation to the observed information matrix proposed by Lavielle et al. (1999) is an indication that if this idea could be incorporated in the Monte Carlo EM algorithm more stable standard error estimates could possibly be achieved We addressed the issue of efficiency gains of joint over separate fitting of the response variables via a simulation study in Chapter 5 We restricted our attention to the case of one binary and one continuous response and to the particular structure of the developmental toxicity example so our findings have limited applicability. Hence it is justified to design additional simulation studies to investigate different settings We also did not address the issue of bias of the maximum likelihood estimates when the model was misspecified. The maximum likelihood estimates are guaranteed to be consistent only under the assumption that the model is chosen correctly. Hence if the responses are fitted separately while the true model requires them to be fitted jointly the regression parameter estimates may show some bias. It is of special interest to determine whether this happens and if it does what the magnitude of the bias is. A variety of other questions are also not yet resolved. More research is needed on improving the computational methods for obtaining maximum likelihood estimates, investigating the performance of the asymptotic tests when small samples are present assessing goodnessoffit and performing variable and model selection, studying the effect of departure from the parametric assumptions on the estimates, developing
PAGE 171
163 methods for residual diagnostics and outlier detection. One of the most important challenges, however is to develop reliable, fast and userfriendly software for gen eralized linear mixed models and their extensions PROC NLMIXED in SAS is a step in this direction but further extensions will be needed. Without widely available software those models remain an interesting new development without much practical applicability.
PAGE 172
REFERENCES Abramowitz M. & Stegun I. (1972). Handbook of Mathematical Functions. New York: Dover. Agresti A. (1990). Categorical Data Analysis. New York: John Wiley. Agresti A. (1997). A model for repeated measurements of a multivariate binary response. Journal of the American Statistical Association, 92 315321. Aitchison, J. & Ho C. H. (1989) The multivariate Poissonlog normal distribution. Biometrika 76 643 653. Aitchison J. & Silvey S. D. (1957). The generalization of probit analysis to the case of multiple responses. Biometrika, 44 131 140. Ashford J R. & Sowden R. R. (1970). Multivariate probit analysis. Biometrics 26 535546. Bera A. K ., & Bilias Y. (1999) Rao's score, Neyman's C(a) and Silvey's LM tests: An essay on historical developments and some new results. To appear in Journal of Statistical Planning and Inference. Blackwell, B. & Catalano, P. J. (1999a). Correlated random effects latent vari able models for multivariate ordinal repeated measures bioassays Unpublished manuscript. Blackwell, B. & Catalano P. J. (1999b). A random effects latent variable model for ordinal data. Unpublished manuscript. Bliss, C. I. (1934a). The method of probits. Science 79 3839. Bliss C I. (1934b) The method of probits a correction Science 79, 409 410. Bliss C. I. (1935a). The calculation of the dosagemortality curve. Annals of Applied Biology 2 2, 307 333. Bliss C. I. (1935b). The comparison of dosagemortality data. Annals of Applied Biology 22 307 333. Booth J. G & Hobert J. P. (1998). Standard errors of prediction in generalized linear mixed models. Journal of the American Statistical Association, 93 262272. 164
PAGE 173
165 Booth J. G. & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society Series B, Methodological, 61, 265285. Breslow N. E. & Clayton D. G. (1993). Approximate inference in generalized linear mixed models Journal of the American Statistical Association, 88, 925 Buhler W. J., & Puri P. S. (1966). On optimal asymptotic tests of composite hypotheses with several constraints. Z. Wahracheinlichkeitstheorie verw, 5, 71 88. Catalano P. J. (1994). Bivariate modeling of clustered continuous and ordered cate gorical outcomes. Statistics in Medicine, 16, 883 900. Catalano P. J. & Ryan L. M. (1992). Bivariate latent variable models for clustered discrete and continuous outcomes. Journal of the American Statistical Association 81 651 658. Chan J. S. K. & Kuk A. Y. C. (1997). Maximum likelihood estimation for probit linear mixed models with correlated random effects. Biometrics 53, 8697. Chant D. (1974). On asymptotic tests of complete hypotheses in nonstandard con ditions. Biometrika, 61, 291 298. Coull B. (1997). Subjectspecific modelling of capturerecapture experiments. Ph.D. dissertation University of Florida, Gainesville, Dept. of Statistics. Cox D. R. (1972) The analysis of multivariate binary data. Applied Statistics 21, 113 120. Cox, D. R. & Wermuth N (1992). Response models for mixed binary and quanti tative variables. Biometrika 19 441 461. Dempster, A. P ., Laird N. M. & Rubin D. B. (1977). Maximum likelihood from in complete data via the EM algorithm ( c/r: p2237). Journal of the Royal Statistical Society Seri e s B Methodological 39 1 22. Diggle P. J. Liang K.Y. & Zeger, S. L. (1994). Analysis of Longitudinal Data. Oxford: Clarendon Press. Doornik J. A. (1998). Objectoriented Matrix Programming using Ox version 2.0. Kent: Timberlake Consultants. Efron, B ., & Hinkley D. V (1978) Assessing the accuracy of the maximum like lihood estimator: Observed versus expected Fisher information (c/r: p482487). Biometrika 65 457 481. Fahrmeir L. & Tutz G. (1994). Multivariate Statistical Modelling Based on Gener ali z ed Linear Mod e ls. New York: SpringerVerlag. Finney D. J. (1964). Probit analysis. Cambridge: Cambridge University Press
PAGE 174
166 Fitzmaurice G. M ., & Laird N. M. (1995). Regression models for a bivariate dis crete and continuous outcome with clustering. Journal of the American Statistical Association, 90 845 852 Foutz, R. V. (1977). On the unique consistent so l ution to the likelihood equations. Journal of th e American Statistical Association 72 147 148. Gaddum J H. ( 1933). Reports on biological standards. III. Methods of biological assay depending on a quanta! response. Spec Rep. Ser. Med. Res. Coun. London no. 183. Galecki A T. ( 1994) General class of covariance structures for two or more re peated factors in longitudinal data analysis. Communications in Statistics Part A Theory and Methods 23 31053119. Gu M. G. & Kong F. H (1998). A stochastic approximation algorithm with Markov chain MonteCarlo method for incomplete data estimation problems. Proceedings of th e National A c ad e my of Sci e nces 95 7270 7274 Haber M. (1986). Testing for pairwise independence. Biometrics 42 429435. Harville D. A. (1977) Maximum likelihood approaches to variance component esti mation and to related problems (c/r: p338340) Journal of the American Statistical Association, 72, 320338. Redeker D. & Gibbons, R. D. (1994) A randomeffects ordinal regression model for multilevel analysis. Biometrics, 50, 933 944. Heitjan D F. & Sharma D. (1997). Modelling repeatedseries longitudinal data. Statistics in Medicine 16 34 7 355. Hoadley B (1971). Asymptotic properties of maximum likelihood estimators for the independent not identically distr i buted case. The Annals of Mathematical Statistics 4 2, 1977 1991. Hobert J.P. & Casella G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Assoc i ation, 91 1461 1473. Jennrich R. 1. & Schluchter, M. D (1986). Unbalanced repeatedmeasures models with structur e d covariance matrices Biometrics 42 805820. Johnson N. L. Kotz S. & Balakrishnan, N. (1994). Continuous Univariate Distri butions. Volum e 1 (S e cond Edition). New York: Wileylnterscience. Lavielle M. Del y on B. & Moulines, E. (1999). Convergence of a stochastic approx imation version of the EM algorithm. To appear in the Annals of Statistics. Lee L. (1993) Multivariate t'obit models in econometrics. Handbook of Statistics Volume 11 : E c onometrics 145173.
PAGE 175
167 Lee Y. & Nelder J. A. (1996). Hierarchical generalized linear models (disc: p656678). Journal of th e Royal Statistical Society Series B Methodological 58, 619 656. Lefkopoulou M., Moore D. & Ryan L. (1989) The analysis of multiple correlated binary outcomes: Application to rodent teratology experiments. Journal of the American Statistical Association 84 810815. Lesaffre E. & Molenberghs, G. (1991). Multivariate probit analysis: A neglected procedure in medical statistics. Statistics in Medicine 10 13911403 Lester G. Merritt A. Neuwirth L. Widenhouse T., Steible, C ., & Rice B. (1998a). Effect of aradrenergic, cholinergic, and nonsteroidal antiinflammatory drugs on myoelectric activity of ileum cecum, and right ventral colon and on cecal amptying of radiolabeled markers in clinically norma l ponies. American Journal of Vet eri nary M e dicin e, 59, 320327. Lester G. Merritt A Neuwirth L. Widenhouse T Steible, C. & Rice, B. (1998b). Effect of erythromycin lactobionate on myoelectric activity of ileum cecum and right ventral colon, and cecal emptying of radiolabeled markers in clinically normal ponies American Journal of Veterinary Medicine 59 328 335. Lester G. Merritt A. Neuwirth L Widenhouse, T., Steible C ., & Rice B (1998c). Myoelectric activity of the ileum cecum and right ventral colon, and cecal emp tying of radiolabeled markers in clinically normal ponies. American Journal of Veterinary M e dic ine, 59, 313 319. Liang K.Y ., & Zeger S. L (1986). Longitudinal data analysis using generalized linear models Biometrika 7 3, 13 22. Liang K .Y. & Zeger S L. (1989). A class of logistic regression models for mul tivariate binary time series. Journal of the American Statistical Association 84 447 451. Liao J (1999). A simplified and accelerated Monte Carlo EM algorithm with appli cation to a hierarchical mixture model. To appear in Statistica Sinica. Lin X. (1997). Variance component testing in generalised linear models with random effects. Biom etrika, 84, 309 326. Lindsey J K. (1993). Models for Repeated Measurements. Oxford: Clarendon Press. Liu C. Rubin D B. & Wu Y. N (1994). Parameter expansion to accelerate EM the PXEM algorithm. Biom e trika 81, 624 629. Liu Q. & Pierce D A. (1994). A note on GaussHermite quadrature. Biom etri ka, 81, 624 629. Long S. (1997). R e gr ession models for categorical and limited dependent variables London: Sag e Publications.
PAGE 176
168 Longford N T (1993). Random Coefficient Models O xford: Oxford University Press LundbyeChristensen, S. (1991). A multivariate growth curve model for pregnancy. Biometrics 47 637 657. Matsuyama Y. & Ohashi Y. (1997) Mixed models for bivariate response repeated measures data using Gibbs sampling. Statistics in Medicine 16 1587 1601. McCullagh P. & Nelder J. A. (1989) Generali z ed Linear Models (Second Ed i tion). New York: Chapman & Hall. McCulloch C E (1997) Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association, 92, 162 170. McKelvey R. D ., & Zavoina W. (1975). A statistical model for the analysis of ordinal level dependent variables. Th e Journal of Mathematical Sociology 4 103 120 Meng X.L. & Rubin D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80 267 278 Moran P. A. P ( 1971). Maximumlikelihood estimation in nonstandard conditions. Proceedings of th e Cambridge Philosophical Society 70 441 450 Natarajan R. & McCulloch C. E. (1995). A note on the existence of the posterior distribution for a class of mixed models for binomial responses. Biometrika, 82, 639643 Neuhaus J. M ., Hauck W. W ., & Kalbfleisch J. D. (1992). The effects of mixture distribution misspecification when fitting mixedeffects logistic models Biom e trika, 79 755762. Neyman J. (1959). Optimal asymptotic test of composite statistical hypothesis. In U. Gr e nad e r Ed Probability and Statistics the Harald Cramer volume Uppsala: Almqvist and Wiksell 213234 Neyman J. & P e arson E. (1928). On the use of interpretation of certain test criteria for purpose of statistical inference. Biometrika 20 175240. Ochi Y. & Prentice R. L. (1984) Likelihood inference in a correlated probit regres sion mod e l. B i om e trika 71 531 543 Olkin L. & Tat e, R. F. (1961) Multivariate correlation models with mixed discrete and continuous variables. Annals of Mathematical Statistics 32, 448 465. Pendergast J. F. Gange S. J. Newton M A. Lindstrom M. J. Palta M., & Fisher M. R. ( 1996). A survey of methods for analyzing clustered binary response data. Int e rnat i onal Statistical R e view 64 89 118
PAGE 177
169 Pinheiro, J. C., & Bates, D M. (1995). Approximation to the loglikelihood function in the nonlinear mixed effects model. Journal of Computational and Graphical Statistics, 4, 12 35. Price, C. J. Kimmel, C. A., Tyl, R. W., & Marr, M. C. (1985). The developmental toxicity of ethylene glycol in rats and mice. Toxicological Applications in Pharma cology, 81, 825 839. Rao, C. R. (1947). Large sample tests of statistical hypotheses concerning several pa rameters with applications to problems of estimation. Proceedings of the Cambridge Philosophical Society, 44, 50 57. Regan M., & Catalano, P. (1999) Likelihood models for clustered binary and contin uous outcomes: applicat ion to developmental toxicology. Unpublished manuscript. Reinsel, G. (1982). Multivariate repeatedmeasurement or growth curve models with multivariate randomeffects covariance structure. Journal of the American Statis tical Association 77, 190 195. Reinsel, G. (1984). Estimation and prediction in a multivariate random effects gener alized linear model. Journal of the American Statistical Association, 79, 406 414. Rochon, J. (1996). Analyzing bivariate repeated measures for discrete and continuous outcome variables. Biom etrics, 52, 740750. Rosner B. (1992). Multivariate methods for clustered binary data with multiple subclasses with application to binary longitudinal data. Biometri cs, 48, 721731. Sammel, M. D. Ryan L. M., & Legler J. M. (1997). Latent variable models for mixed discrete and continuous outcomes. Journal of the Royal Statistical Society, Series B 59, 667 678. Schall R. (1991). Estimation in generalized linear models with random effects. Biometrika, 78, 719727. Self, S. G., & Liang K.Y. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association 82, 605 610. Tanner, M. A. (1991). Tools for Statistical Inf erence : Obs erved Data and Data Aug mentation Methods. New York: SpringerVerlag. Ten Have T. R. (1996) A mixed effects model for multivariate ordinal response data including correlated discrete failure times with ordinal responses. Biometrics, 52, 473 491. Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econo metrica 26, 24 36.
PAGE 178
170 van Dyk D. (1999). Nesting EM algorithms for computational efficiency. To appear in Statistica Sinica. Vonesh E. F. (1992) Nonlinear models for the analysis of longitudinal data (disc: p19551963) Statistics in Medicine, 11, 1929 1954 Wald A. (1943). Tests of statistical hypothesis concerning several parameters when the number of observations is large. Trans. Amer. Math. Soc. 54, 426482. Ware J. H. (1985). Linear models for the analysis of longitudinal studies. The American Statistician, 39, 95101. White H. (1982). Maximum likelihood estimation of misspecified models. Economet rica, 50, 1 26 Wolfinger R. (1998). Towards practical application of generalized linear mixed mod els Proceedings of 13th international workshop in statistical modeling, Marx, B. and Friedl, H. (eds.), 388 395 Wolfinger R. & O Connell, M. (1993). Generalized linear mixed models: A pseudo likelihood approach. Journal of Statistical Computation and Simulation 48 233 243. Wu C F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics 11 95103. Zeger S. L., & Karim, M. R. (1991). Generalized linear models with random effects: A Gibbs sampling approach. Journal of the American Statistical Association, 86, 7986. Zeger S. L. & Liang K.Y. (1992). An overview of methods for the analysis of longitudinal data. Statistics in Medicine 11, 1825 1839. Zeger, S. L. Liang, K.Y ., & Albert, P. S. (1988) Models for longitudinal data: A generalized estimating equation approach ( corr: V 45 p34 7). Biometrics 44 10491060. Zeger, S. L. Liang K .Y., & Self S. G. (1985). The analysis of binary longitudinal data with timeindependent covariates. Biometrika, 72, 31 38. Zeger S. L. & Qaqish, B (1988). Markov regression models for time series: A quasilikelihood approach. Biometrics, 44, 1019 1031. Zellner A. (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias Journal of the American Statistical Association, 57, 348368. Zhao, L P., Prentice R. L., & Self S. G (1992) Multivariate mean parameter estimation by using a partly exponential model. Journal of the Royal Statistical Society Seri es B Methodological, 54 805 811.
PAGE 179
BIOGRAPHICAL SKETCH Ralitza Gueorguieva was born in Sofia on April 25, 1971. She graduated from the Mathematical High School of Sofia and simultaneously obtained a correspondence degree from the English Language High School in Sofia In 1989 Ralitza was ac cepted in the University of Sofia and enrolled as a student in computer science. Five years later she graduated with a Master of Science degree in computer science and obtained additional certification as a teacher in mathematics and computer science. She also spent one semester as an exchange student in Slippery Rock University in Pennsylvania in the fall of 1991. Ralitza was accepted as a graduate student in the Department of Statistics at the University of Florida in the Fall of 1994. While at the University of Florida she worked as a teaching and research assistant. She obtained her Master of Statistics degree in August 1996 and then proceeded in the Ph.D. program After graduating from the University of Florida, Ralitza will spend another year in Gainesville teaching an undergraduate statistics class and working in the Perinatal Data Systems group. 171
PAGE 180
I certify that I have read this study and that in my opinion it conforms to accept able standards of scholarly presentation and is fully adequate. in scope and quality as a dissertation for the degree of Doctor of Ph~ ~' Alan Agresti Ch~ Professor of Statistics I certify that I have read this study and that in my opinion it conforms to accept able standards of scholarly presentation and is fully adequate in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Jam ~Booth Associate Professor of Statistics I certify that I have read this study and that in my opinion it conforms to accept able standards of scholarly presentation and is fully adequate in scope and quality, as a dissertation for the degree of Doctor of Philosophy Ran~~ Professor of Statistics I certify that I have r e ad this study and that in my opinion it conforms to accept able standards of scholarly presentation and is fully adequate in scope and quality, as a dissertation for the degree of Doctor of Philosophy. I certify that I have read this study and that in my opinion it conforms to accept able standards of scholarly presentation and is fully adequate in scope and quality, as a dissertation for the degree of Doctor of Philos~. Monika Ardelt Assistant Professor of Sociology
PAGE 181
This dissertation was submitted to the Graduate Faculty of the Department of Statistics in the College of Liberal Arts and Sciences and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. December 1999 Dean Graduate School
PAGE 182
UNIVERSITY OF FLORIDA II I II IIIII I Ill I l l ll l ll l l lll I I I II I II I I II I I II I I III I II IIII I III I ll I I 3 1262 08555 3500

