On model fitting for multivariate polytomous response data

MISSING IMAGE

Material Information

Title:
On model fitting for multivariate polytomous response data
Physical Description:
vii, 200 leaves : ; 29 cm.
Language:
English
Creator:
Lang, Joseph B., 1963-
Publication Date:

Subjects

Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1992.
Bibliography:
Includes bibliographical references (leaves 193-199).
Statement of Responsibility:
by Joseph B. Lang.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001753119
notis - AJG6082
oclc - 26576215
System ID:
AA00003698:00001

Full Text









ON MODEL FITTING FOR MULTIVARIATE POLYTOMOUS
RESPONSE DATA










By

JOSEPH B. LANG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

1992


UNIVERSITYY OF FLORIDA US1llic15













ACKNOWLEDGMENTS


I would like to express my appreciation to Dr. Alan Agresti for serving

as my dissertation advisor. For the many comments, ideas, and lessons he

has shared with me, I am greatly indebted. Through his advisement and

guidance, he has taught me to appreciate and respect good statistical research

and teaching. He is a mentor worthy of emulation. I also want to express

my gratitude to Dr. Jane Pendergast, who also served on my dissertation

committee. I learned a great deal from her during the two years that I worked

in the Biostatistics Department. To all of the faculty at the University of

Florida, I extend my thanks. The statistics department, with its scholarly

and friendly atmosphere, proved to be a wonderful place to learn.

The influences of persons from my past are not forgotten. Without

Patrick Kearin's stimulating teaching of high school math, I may never have

become interested in this subject. The genuine excitement delivered by Dr.

James Kepner, in his teaching of undergraduate statistics, was the reason I

decided to pursue an advanced degree in statistics.

I would like to thank my parents and the rest of my family for all of the

support and encouragement they have given over the course of my studies

and research. My friends and student colleagues deserve many thanks as

well. Finally, I would like to thank Kendra Paar for always being there to

support and encourage me while I was writing this paper.













TABLE OF CONTENTS


page
ACKNOWLEDGMENTS ............................................ ii

LIST OF TABLES ................................................ v

ABSTRACT .................................................... ... vi

CHAPTERS

1 INTRODUCTION ............................................. 1

1.1 A Brief Introduction to the Problem...................... 1
1.2 Outline of Existing Methodologies-No Missing Data ...... 3
1.3 Outline of Existing Methodologies-Missing Data.......... 12
1.4 Format of Dissertation ...................................... 14

2 RESTRICTED MAXIMUM LIKELIHOOD FOR A
GENERAL CLASS OF MODELS FOR
POLYTOMOUS RESPONSE DATA .................... 17

2.1 Introduction ........................................ .. ..... 17
2.2 Parametric Modeling-An Overview....................... 24
2.2.1 Model Specification .................................. 25
2.2.2 Measuring Model Goodness of Fit ................... 33
2.3 Multivariate Polytomous Response Model Fitting .......... 43
2.3.1 A General Multinomial Response Model.............. 44
2.3.2 Maximum Likelihood Estimation .................... 48
2.3.3 Asymptotic Distribution of Product-Multinomial
M L Estimator ...... ..... ..... ....... .... .......... 56
2.3.4 Lagrange's Method-The Algorithm ................ 60
2.4 Comparison of Product-Multinomial and
Product-Poisson Estimators ........................... 67
2.5 Miscellaneous Results ....................................... 78
2.6 Discussion ................................................... 83










3 SIMULTANEOUSLY MODELING THE JOINT AND
MARGINAL DISTRIBUTIONS OF MULTIVARIATE
POLYTOMOUS RESPONSE VECTORS .................. 87

3.1 Introduction............................................... 87
3.2 Product-Multinomial Sampling Model..................... 88
3.3 Joint and Marginal Models................................. 93
3.4 Numerical Examples ....................................... 98
3.5 Product-Multinomial Versus Product-Poisson
Estimators: An Example .......................... 111
3.6 Well-Defined Models and the Computation of
Residual Degrees of Freedom ......................... 121
3.7 Discussion .............. .................................... 132

4 LOGLINEAR MODEL FITTING WITH
INCOMPLETE DATA...................................... 135

4.1 Introduction .............. ................................... 135
4.2 Review of the EM Algorithm................................137
4.2.1 General Results .................. ....................138
4.2.2 Exponential Family Results ........................... 140
4.3 Loglinear Model Fitting with Incomplete Data............. 144
4.3.1 The EM Algorithm for Poisson Loglinear Models..... 145
4.3.2 Obtaining the Observed Information Matrix ..........148
4.3.3 Inferences for Multinomial Loglinear Models ..........152
4.4 Latent Class Model Fitting-An Application .............. 160
4.5 Modified EM/Newton-Raphson Algorithm................. 166
4.6 Discussion .................................................. 170

APPENDICES

A CALCULATIONS FOR CHAPTER 2.........................172

B CALCULATIONS FOR CHAPTER 4.........................176

BIBLIOGRAPHY ................... ..........................193

BIOGRAPHICAL SKETCH ........................................ 200













LIST OF TABLES


page
2.1 Opinion Poll Data Configuration................................. 22

3.1 Interest in Political Campaigns ................................... 91

3.2 Cross-Over Data............ .......... ..... ....... ................ 92

3.3 Joint Distribution Models-Goodness of Fit..................... 100

3.4 Marginal Distribution Models-Goodness of Fit............... 101

3.5 Candidate Models in J(L x L + D) n M(U)-Goodness of Fit... 102

3.6 Estimates of Freedom Parameters for
Model J(L x L + D) n M(CU)..................................... 103

3.7 Freedom Parameter Estimates and Standard Errors.............. 105

3.8 Estimated Cell Means and Standard Errors ................. 106

3.9 Cross-Over Data Models-Goodness of Fit....................... 110

3.10 Freedom Parameter ML Estimates for Model J(UA) n M(U) .... 110

3.11 Children's Respiratory Illness Data........................... 112

3.12 Product-Multinomial versus Product-Poisson Freedom
Parameter Estimation ........................................ 117

4.1 Observed cross-classification of 216 respondents
with respect to whether the tend toward
universalistic (1) or particularistic (2) values
in four situations (A,B,C,D) of role conflict ................. 162

4.2 Parameter and Standard Error Estimates ....................... 164

4.3 Classification Probability Estimates ..................... ....... 165








Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

ON MODEL FITTING FOR MULTIVARIATE POLYTOMOUS
RESPONSE DATA

By

Joseph B. Lang

May, 1992

Chairman: Dr. Alan Agresti
Major Department: Statistics

A broad class of models that imply structure on both the joint

and marginal distributions of multivariate categorical (ordinal or nominal)

responses is introduced. These parsimonious models can be used to si-

multaneously describe the marginal distributions of the responses and the

association structure among the responses. As a special case, this class

of models includes classical log- and logit-linear models. In this sense,

we address model fitting for multivariate polytomous response data from

a very general perspective. Simultaneous models for joint and marginal

distributions are useful in a variety of applications, including longitudinal

studies and studies dealing with social mobility and inter-rater agreement.

We outline a maximum likelihood fitting algorithm that can be used for

fitting a large class of models that includes the class of simultaneous models.

The algorithm uses Lagrange's method of undetermined multipliers and a

modified Newton-Raphson iterative scheme. We also discuss goodness-of-fit

tests and model-based inferences. Inferences for certain model parameters

are shown to be equivalent for product-Poisson and product-multinomial

vi







sampling assumptions. This useful equivalence result generalizes existing

results. The models and fitting method are illustrated for several applications.

Missing data are often a problem for multivariate response data. We

consider inferences about loglinear models for which only certain disjoint

sums of the data are observable. We derive an explicit formula for the

observed information matrix associated with the loglinear parameters that

is intuitively appealing and simple to evaluate. The observed information

matrix can be evaluated at the maximum likelihood estimates and inverted

to obtain an estimate of the precision of the loglinear parameter estimates.

The EM-algorithm can be used to fit these incomplete data loglinear models.

We describe this algorithm in some detail, paying special attention to the

Poisson loglinear model fitting case. Alternative fitting algorithms are also

outlined. One proposed alternative uses both the EM and Newton-Raphson

algorithm, thereby resulting in a faster, more stable, algorithm. We illustrate

the utility of these results using latent class model fitting.












CHAPTER 1
INTRODUCTION


1.1 A Brief Introduction to the Problem

There are many situations when multiple responses are observed for each

'subject' in a group, or several groups. Here 'subject' is generically used to

refer to a randomly chosen object that generates responses. The multiple

responses could represent repeated measurements taken on subjects over time

or occasions. They could be the ratings assigned by several judges that all

viewed and rated the same set of slides (here, the 'subjects' are the slides).

Or, perhaps, it may be that several distinct or noncommensurate responses

are recorded for each subject. These responses are often categorical-ordinal

or nominal-and inevitably interrelated. This dissertation addresses issues

related to modeling and model fitting for multivariate categorical (ordinal or

nominal) responses.

Models for multivariate categorical response data are usually developed

to answer questions about (i) the association structure among the multiple

responses or (ii) the behavior of the marginal distributions of the response

variables. Specifically, a typical question of the first type is, "How are the

responses interrelated and is this interrelationship the same across the levels

of the covariates?" A typical type ii question is, "How do the (marginal)

responses depend on the covariates or occasions?" Historically, many models

(e.g. log- and logit-linear models) have been developed for the primary

-1-






-2-

purpose of answering the type i questions. Many of these models can easily

be fitted using maximum likelihood (ML) methods. These models typically,

however, are not useful for answering the type ii questions (Cox, 1972).

Marginal models-those models used to answer type ii questions-are not

as well developed. One reason for this is that ML fitting of these marginal

models is more difficult. At present, the method of weighted least squares

(WLS) is used almost exclusively for fitting these models.

Suppose that we are interested in answering questions of both types

i and ii. Usually the questions are addressed using two different models, a

joint distribution model and a marginal model, and fitting them separately. It

seems reasonable to want a model that can be used to address simultaneously

both questions. That is, we would like a model that simultaneously implies

structure on both the joint and marginal distribution parameters. To date,

there has been very little work done on the development and fitting of these

simultaneous models.

Whenever multiple responses are observed it is inevitable that there will

be missing data. There are several ways to fit the Poisson loglinear model with

incomplete data. One popular method is to use the EM algorithm to find the

ML estimates of the loglinear parameters. One drawback to this algorithm

is that a precision estimate of the ML estimators is not produced as a by-

product. Several numerical techniques have been developed to approximate

the observed information matrix, which, upon inversion, will act as the

precision estimate. However, it would be of some convenience to derive an

explicit formula for the observed information matrix, at least in some special

cases.






-3-

1.2 Outline of Existing Methodologies-No Missing Data



We begin our discussion by considering the case of no missing data.

There are many methods for analyzing multivariate categorical (ordinal or

nominal) response data. These methods usually involve fitting (separately)

models for the joint or the marginal distributions of the response vectors.

In rare instances, simultaneous models for both the joint and marginal

distributions are considered. Maximum likelihood fitting methods for the

joint distribution models are simple and described in almost every standard

text on categorical data analysis. The fitting of marginal models using

ML methods is more difficult. Maximum likelihood fitting of the marginal

homogeneity model was considered by Madansky (1963) and Lipsitz (1988).

The fitting of a more general class of marginal models was considered

by Haber (1985a). Finally, the fitting of simultaneous models using ML

methods has only been addressed in the bivariate response case. The fitting

technique becomes very complicated when there are more than two categorical

responses. To appreciate the complexity of extending the technique to

multivariate response data, see section 6.5 of McCullagh and Nelder (1989)

or perhaps Dale (1986). In contrast, the ML fitting method of Chapter 2 can

easily be used to fit many marginal and simultaneous models. In the next few

paragraphs, we briefly describe the existing methods for modeling and model

fitting for multivariate categorical response data.

Modeling Joint Distributions Separately. One common method for analyz-

ing multivariate categorical responses is to model the joint distribution only.

These models, which include classical log- and logit-linear models for the






-4-

joint probabilities, are useful for describing the association structure among

the responses. The last 30 years have seen the development of these methods

for analyzing multivariate categorical responses (Haberman, 1979; Bishop et

al., 1975; Agresti, 1984, 1990). For specificity, consider the following panel

study: One hundred randomly selected subjects were asked how interested

they were in the political campaigns. They were to respond on the 3 point

ordinal scale, (1) Not Much, (2) Somewhat, and (3) Very Much. Then four

years later the same group of subjects was asked to respond on the same

scale to the same question. A separate investigation into the association

structure would enable us to answer questions of a conditional nature. For

example, we could estimate the probability of responding 'Very Much' on the

second occasion given that the response at the first occasion was 'Not Much'.

The description of these 'transitional' probabilities, although very interesting,

may not be completely satisfactory. We may also be interested in addressing

questions with regard to the marginal distributions. Perhaps we would like

to answer the question, "Are the distributions of responses to the political

interest question the same for each occasion?" Laird (1991), in a nice review of

likelihood-based methods for longitudinal analysis, mentions that the utility

of classical log- and logit-linear models is restricted to two situations: (1)

modeling the dependence of a univariate response on a set of covariates and

(2) modeling the association structure between a set of multivariate responses.

These models place structure on the joint probabilities and so they are not

directly useful for studying the dependence of the marginal probabilities on

occasion and other covariates. This problem was pointed out by several

authors (Cox, 1972; Prentice, 1988; McCullagh and Nelder, 1989;






-5-

Liang et al., 1991). An advantage of these models is that they are simple to fit

using either WLS (Grizzle et al., 1969), ML (McCullagh and Nelder, 1989),

or iterative proportional fitting (Bishop et al., 1975) methods. There are

many standard statistical programs available for fitting these models (SAS,

SPSS BMDP, GLIM, GENSTAT).

Modeling Marginal Distributions Separately. A second approach to an-

alyzing multivariate categorical responses is to model only the marginal

distributions and to ignore the joint distribution structure. Full likelihood

methods that consider only models for the marginal probabilities tacitly

assume a saturated model for the joint distribution. Therefore, the models

may be far from parsimonious. In the non-Gaussian response setting, there

is a distinction between these marginal models and the transitional (or

conditional) models of the previous paragraph. Marginal models describe the

occasion-specific distributions and the dependence of those distributions on

the covariates. Transitional or conditional models describe the distribution

of individual changes over occasions. Models for these transitions can be

represented as probability distributions for the future state 'given' the past

states. Questions regarding transition probabilities can only be investigated

with longitudinal data. On the other hand, questions regarding the marginal

probabilities could theoretically be answered using cross-sectional data,

provided the cohort (subject) effects were negligible. Panel studies resulting

in longitudinal data result in more powerful tests for significance of within

cluster factors, such as occasion effect. This follows because there is a reduced

cohort effect; we are using the same panel of subjects at each occasion. For






-6-

further discussion about the distinction between marginal and transitional

models, see Ware et al. (1988), Laird (1991), and Zeger (1988).

We will briefly discuss existing methods for making inferences about

the marginal probabilities separately. We will group these methods into 5

categories: (1) nonmodel-based methods, (2) WLS methods, (3) ML methods,

(4) Semi-parametric methods, and (5) other methods.

Nonmodel-based methods can be used to derive test statistics used for

testing specific hypotheses regarding the marginal distributions. Examples

include the Cochran-Mantel-Haenszel (1950, 1959) statistic which can be used

for testing the hypothesis of marginal homogeneity (MH) (cf. White et al.,

1982), McNemar's (1947) statistic which can be used for testing the equality of

two dependent proportions, and Madansky's (1963) likelihood-ratio statistic

for MH. Madansky's statistic is a difference in fit of the model of marginal

homogeneity to the fit of the unstructured (saturated) model (see also Lipsitz,

1988 and Lipsitz et al., 1990). Many other relevant test statistics, some of

which are generalizations or modifications of the aforementioned (cf. Mantel,

1963; White et al., 1982), exist. Cochran's (1950) Q statistic and Darroch's

(1981) Wald-type statistic are examples of other test statistics that can be

used to test for marginal homogeneity.

Presently, if one was to fit a marginal model, say a generalized loglinear

model of the form Clog Ai = X/, where p is the vector of expected counts

in the full contingency table, he or she would most likely use the WLS fitting

algorithm. Most statistical software that fits these generalized loglinear

models does so using WLS. There are some advantages to using WLS. It

is computationally simple. Second-order marginal information is all that is






-7-

needed. And, the estimates are asymptotically equivalent to ML estimates.

Some disadvantages are that covariates must be categorical, sampling zeroes

create problems, and estimates are sensitive when second-order marginal

counts are small. The WLS method for analyzing categorical data was

originally outlined by Grizzle, Starmer and Koch (1969). Subsequently,

marginal models for longitudinal categorical data, or more generally mul-

tivariate categorical response data, have been introduced and fitted using the

WLS method (Koch et al., 1977; Landis and Koch, 1979; Landis et al., 1988;

Agresti, 1989).

Maximum likelihood fitting of marginal models is more difficult since

the model utilizes marginal probabilities, rather than joint probabilities to

which the likelihood refers. When the responses are correlated, as they

invariably are, the marginal counts do not follow a product-multinomial

distribution. The full-table likelihood must be maximized subject to the

constraint that the marginal probabilities satisfy the model. Haber (1985a)

considers fitting generalized loglinear models of the form C log Ap = XP3 using

Lagrange multipliers and an unmodified Newton-Raphson iterative scheme.

The algorithm becomes very difficult to implement for even moderately large

tables. This is primarily due to the difficulty of inverting the large Hessian

matrix of the Lagrangian objective function. In this dissertation we consider a

modified Newton-Raphson that uses a much simpler matrix than the Hessian.

The matrix is easily inverted even for relatively large tables. Haber (1985b)

considers the estimation of the parameters / in the special case Clog y = XP3.

We will use a modification of the method of Aitchison and Silvey (1958, 1960)

and Silvey (1959) to investigate the asymptotic behavior of the estimators of






-8-

3 in the more general model Clog Ap = XP3, thereby extending the work of

Haber (1985b). Another relevant paper, Haber and Brown (1986), considers

ML fitting of a model for the expected counts p that has loglinear and

linear constraints. One can test hypotheses about the marginal probabilities

by comparing the fit of relevant models. Haber (1985a, 1985b) and Haber

and Brown (1986) only consider fitting the marginal models separately. No

attempt has been made to simultaneously model the joint and marginal

distributions.

Semi-parametric methods such as quasi-likelihood (Wedderburn, 1974)

and a multivariate extension, generalized estimating equations (GEE), have

become popular in recent years. The work of Liang and Zeger (1986), which

advocated the use of these GEEs, has been extended to cover the multivariate

categorical response data setting (Prentice, 1988; Zhao and Prentice, 1991;

Stram et al., 1988; Liang et al., 1991). With these semi-parametric methods,

the likelihood is not completely specified. Instead, generalized estimating

equations are chosen so that, when the marginal model holds, even if the

association among the multiple responses is misspecified, the estimators are

consistent and asymptotically normally distributed. These estimators, used

in conjunction with a robust estimator of their covariance (Liang and Zeger,

1986; Zeger and Liang, 1986; White, 1980, 1981, 1982; Royall, 1986), result

in consistent inference about the effects of interest. When the responses are

truly independent, the estimating equations with correlation matrix taken to

be the identity matrix, are equivalent to the likelihood equations. The GEE

approach requires the specification of a 'working' association or correlation

matrix. Examples of working associations include those that imply all






-9-

pairwise associations (measured in terms of odds ratios) are the same and

that the higher order associations are negligible (Liang et al., 1991).

A related approach is known as GEE2. The consistency of these esti-

mators follows only if both the marginal model and the pairwise association

model are correctly specified. This approach is a second order extension

of the GEEs of Liang and Zeger (1986) which are now termed GEE1. It

is second order because the estimation of the marginal model parameters

and the pairwise association model parameters is considered simultaneously.

The focus of both approaches, GEE1 and GEE2, is usually on modeling

the marginal distributions-investigating how the marginal distributions

depend on occasion and covariates. The association is considered a nuisance.

Presently, there are no tests for goodness-of-fit of these models and so the

investigation into how well both models fit can be done only at an empirical

level. The assumption that higher order effects are negligible may not be

tenable. Testing procedures to assess the validity of these assumptions have

yet to be developed. Also, in contrast to WLS and ML methods, which

require only that the missing data be 'missing at random' (MAR), the semi-

parametric approaches require the missing data to be 'missing completely

at random' (MCAR). The assumption that the missing data mechanism is

MCAR is a much stronger assumption than MAR (Little and Rubin, 1986).

Finally, there are many other approaches to analyzing the marginal

probability structure separately. There are random effects models, whereby

subject-specific random effects induce a correlation structure on the multiple

responses. The marginal approach-the full likelihood is obtained by

averaging across the random effects-is computationally difficult (Stiratelli






10-
et al., 1984). An alternative is to condition on the sufficient statistics

for the subject effects and consider finding the estimates by maximizing

the conditional likelihood. For further details on these conditional and

unconditional methods see Rasch, 1961; Tjur, 1982; Agresti, 1991; Stiratelli

et al., 1984; Conaway, 1989, 1990. As yet another alternative, Koch et al.

(1980) give a bibliography for relevant nonparametric methods for analyzing

repeated measures data. Agresti and Pendergast (1986) consider replacing

the actual observations by their within cluster rank and testing for marginal

homogeneity using the ordinary ANOVA statistic for repeated measures data.

A three-stage estimator for repeated measures studies with possibly missing

binary responses has been developed by Lipsitz et al. (1992). This approach

is very similar to a generalized least squares approach, but it has some of

the nice features of the GEE approaches. One of these nice features is that

the estimators and their variance estimates are consistent under very mild

assumptions. An extension of this method to the polytomous response case

has yet to be developed.

Simultaneous Investigation of Joint and Marginal Distributions. There

has been very little work done to investigate simultaneously the joint and

marginal distribution structure. In some ways GEE2 is an attempt to

describe both distributions. However, only the pairwise (not the joint)

association structure is modeled; the higher-order associations are considered

a nuisance. Tests comparing nested models have not been developed in this

semi-parametric setting. Full likelihood approaches have been addressed

by Dale (1986), McCullagh and Nelder (1989, Chapt. 6), and Becker and

Balagtas (1991). Dale models the joint distributions of bivariate ordered






11-

categorical responses by assuming that the log global odds ratios follow a

linear model. The marginal probabilities are assumed to follow a cumulative

logit model. McCullagh and Nelder consider simultaneously modeling the

joint and marginal probabilities of a bivariate dichotomous response (two

distinct responses) by assuming that the log odds-ratios follow a linear

model and that the marginal probabilities follow a logit-linear model. Their

example included age as a categorical covariate. Finally, Becker and Balagtas

consider models for two-period cross-over data. The bivariate dichotomous

response was the response to the two different treatments. Order of treatment

application was considered a covariate. They assumed that the two log odds

ratios followed a linear model and that the marginal probabilities satisfied a

loglinear model. Because it is the marginal probabilities and not the joint

probabilities that satisfy a loglinear model, Becker and Balagtas refer to the

model as log nonlinear.

The ML model fitting approach used by each of these authors involves

a reparameterization of the likelihood, which is a function of the joint

probabilities, in terms of the joint and marginal model parameters. The

reparameterization in the bivariate response case-the case each author

considered-is somewhat complicated especially for multi-level responses. To

make matters worse, the extension of this method to general multivariate

polytomous responses looks to be extremely difficult. If the repaparameter-

izations are made so that the full likelihood is expressible in terms of the

joint and marginal model parameters, the likelihood can be maximized using

a Newton-Raphson-type algorithm. Basically, one must solve for the root of

some nonlinear score equation. This maximization approach is very sensitive






12-

to the starting value in that convergence to a local maximum is not likely

unless the starting estimate is very close to the actual maximum. Finding

reasonable starting values is not a simple task. Dale (1986) outlines a method,

specifically for the models considered in that paper, for finding a starting

estimate.

In this dissertation, we outline an ML fitting method that can easily be

used to fit a large class of simultaneous models, including those considered

by Dale, McCullagh and Nelder, and Becker and Balagtas. The approach

involves using Lagrange's method of undetermined multipliers along with a

modified Newton-Raphson iterative scheme. For all of the models considered,

an initial estimate for the algorithm is the data counts themselves along with

a vector of zeroes corresponding to a first guess at the values of the Lagrange

multipliers. The convergence of the algorithm is quite stable. The extension

to multivariate polytomous response data is straightforward.


1.3 Outline of Existing Methodologies-Missing Data

Missing data is often an issue when the response is multivariate in nature.

Missing data can also occur in more hypothetical situations. Examples

include loglinear latent class models (Goodman, 1974; Haberman, 1988)

and linear mixed or random effects models (Laird et al., 1987). In latent

class analyses, a latent variable, which is unobservable, is assumed to exist.

Mixed or random effects models posit the existence of some unobservable

random variables that affect the mean response. In this brief outline, we will

consider ML methods for model fitting when the data are not completely

observable. Little and Rubin (1986) provide a nice summary of methods






13 -
for model fitting with incomplete data. There are many ways to find the

maximum likelihood estimators when the data are not completely observable,

each method having its positive and negative features. We could work directly

with the incomplete-data likelihood, which is usually complicated relative to

the complete-data likelihood, and use a Newton-Raphson or Fisher-scoring

algorithm. Palmgren and Ekholm (1987) and Haberman (1988) use these

methods to obtain maximum likelihood estimates and their standard errors.

Alternatively, we could avoid the complicated likelihood altogether and use

the Expectation-Maximization algorithm (Dempster et al., 1977). Sundberg

(1976) discusses the properties of the EM algorithm when it is used to

fit models to data coming from the regular exponential family. The EM

algorithm is one of the more flexible ML fitting algorithms for missing data

situations. We will primarily focus on this method for fitting loglinear models

with incomplete data.

Although the EM algorithm is easily implemented to fit loglinear models

with incomplete data, the algorithm does not provide an estimate of precision

of the model parameter estimators. Meng and Rubin (1991) outline a

supplemental EM (SEM) algorithm, whereby, upon convergence of the EM

algorithm, the variance matrix for the model estimators is adjusted to account

for missing data. The adjustment is a function of the rate of convergence of

the EM algorithm, which in turn is a function of how much information

is missing. Meng and Rubin numerically estimate the rate of convergence,

thereby obtaining an estimate of precision that reflects missingness. Although

this approach should prove to be applicable in the general situation, it still

is desirable to derive an explicit formula for the variance matrix that reflects






14-

missingness. Other authors (Meilijson, 1989; Louis, 1982) have discussed

methods for estimating precision of model estimators when the data are

incomplete and the EM algorithm is used. Meilijson's method involves EM-

aided differentiation, which is essentially a numerical differentiation of the

score vector. The method relies on the assumption that the observed data

components are i.i.d. (identically and independently distributed). Louis

gives an analytic formula for the observed information matrix based on the

incomplete data. The computation of the observed information matrix based

on this formula is not straightforward and must be considered separately for

each special application.


1.4 Format of Dissertation

In Chapter 2, we develop a maximum likelihood method for fitting a large

class of models for multivariate categorical response data. This development

follows a general discussion about parametric modeling. Concepts such as

degrees of freedom and model distances (or goodness of fit) are described at

an intuitive level. We also describe and compare the asymptotic distributions

of freedom parameter estimators under product-multinomial and product-

Poisson sampling assumptions. Chapter 3 has more of an applied flavor.

We consider simultaneously modeling the joint and marginal distributions

of multivariate categorical response vectors. A broad class of simultaneous

models is introduced. The models can be fitted using the techniques of

Chapter 2. Several numerical examples are considered. Chapter 4 outlines the

ML fitting technique known as the EM algorithm. This algorithm is used to

fit models with incomplete data. Some advantages and disadvantages of using






-15-

the EM algorithm are addressed. The most important disadvantage is that

the algorithm does not provide, as a by-product, a precision estimate of the

ML estimators. We derive an explicit formula for the observed information

matrix for the Poisson loglinear model parameters when only disjoint sums of

the complete data are observable. An application to latent class modeling is

considered. We also propose an ML fitting algorithm that uses both EM and

Newton-Raphson steps. The modified algorithm should prove to have many

positive features.

In this dissertation, we do not distinguish typographically between

scalars, vectors, and matrices. Parameters and variables are treated as ob-

jects, their dimensions either being explicitly stated or implied contextually.

By convention, functions that map scalars into scalars, when applied to

vectors, will be defined componentwise. For example, if i represents an n x 1

vector, then

logy = (log,lg/A2,...,log~n)'.

We frequently use abbreviations that are common in the statistical

literature. They include ML (Maximum Likelihood), WLS (Weighted

Least Squares), IWLS (Iterative (Re)Weighted Least Squares), and EM

(Expectation-Maximization).

The range (or column) space of an n x p matrix X is denoted by M(X)

and is defined as {p : tz = XP3, 3 E RP}. The symbols and $ are the

binary operators 'direct product' and 'direct sum'. The direct (or Kronecker)

product is taken to be the right-hand product. That is,


A B = {Abij}.






16-
The direct sum, C, of two matrices A and B is defined as


C=A B= A 0).
OB

The symbol D(p) represents a diagonal matrix with the elements of p on the

diagonal. That is,
(1 0 ... 0
D(M) = 0
0 0 ... /n
In Chapter 4, we make use of the bracket notation often used by

statistical and mathematical programming languages (e.g. Splus, Matlab).

To illustrate the notation, consider a matrix A. The (sub)matrix A[, -2] is

then matrix A with the second column deleted. Similarly, the matrix A[-3,]

is the matrix A with the third row deleted.

Equation numbering is consecutive within sections of a chapter, the

first number representing the chapter in which it appears. For example, the

thirteenth equation in section 2.3 is equation (2.3.13). Within each appendix,

the equations are numbered consecutively. For example, the third equation

in Appendix B is numbered (B.3). Tables are numbered consecutively within

chapters so that, for instance, Table 3.2 represents the second table within

Chapter 3. Theorems, lemmas, and corollaries are numbered independently

of each other. All are numbered consecutively within sections. Therefore,

Corollary 3.2.2 is the second corollary within section 3.2 and Theorem 2.3.1

is the first theorem within section 2.3.













CHAPTER 2
RESTRICTED MAXIMUM LIKELIHOOD FOR A GENERAL
CLASS OF MODELS FOR POLYTOMOUS RESPONSE DATA


2.1 Introduction


In this chapter, we consider using maximum likelihood methods to fit a

general class of parametric models for univariate or multivariate polytomous

response data. The models will be specified in terms of freedom equations

and/or constraint equations. These two ways of specifying models will be

discussed at length in section 2.2. The model specification equations may be

linear or nonlinear in the model parameters. Specifically, if p represents the

s x 1 vector of expected cell means, the linear constraints will be of the form

Lp = d and the nonlinear constraints will be of the form U'Clog(Ap) =

0. The freedom equations will have form Clog(Ap) = XPf, where the

components of the vector 3 are referred to as the freedom parameters. In

Chapter 3 of this dissertation, we discuss more specifically models that can

be specified in terms of these constraint and freedom equations. The models

of that chapter allow one to simultaneously model the joint and marginal

distributions of multivariate polytomous response vectors.

The maximum likelihood, model fitting algorithm of this chapter utilizes

Lagrange multipliers and a modified Newton-Raphson iterative scheme. In

particular, the models will be specified in terms of constraint equations and

the log likelihood will be maximized subject to the constraint equations being

17-






18-
satisfied. One common optimization algorithm found in the mathematics

literature is Lagrange's method of undetermined multipliers. We show that

Lagrange's method is easily implemented for ML fitting of the models under

consideration in this chapter. One problem with Lagrange's method of

undetermined multipliers for ML fitting of statistical models has been that it

becomes computationally infeasible for large data sets. By using a modified

Newton-Raphson method which involves inverting a matrix of a simpler form

than the more complicated Hessian, we consider fitting models to relatively

large data sets.

We also explore the asymptotic behavior of the estimators within the

framework of constraint-rather than freedom-models. Usually, asymptotic

properties of model and freedom parameter estimators are studied within the

framework of freedom models. Aitchison and Silvey (1958, 1960) and Silvey

(1959) studied the asymptotic behavior of the model parameter estimators

when the model is specified in terms of constraint equations. Following the

arguments of Aitchison and Silvey, we derive the asymptotic distributions of

both the model and freedom parameter estimators.

Previous work by Haber (1985a) addressed maximum likelihood methods

for fitting models of the form


Clog(A/) = X,,

to categorical response data. Subsequently, Haber and Brown (1986)

discussed ML fitting for loglinear models that were also subject to the

linear constraints Lp. = d, where these constraints necessarily include the

identifiability constraint required of p, the vector of product-multinomial






19-

cell means. Both of these papers advocated the use of Lagrange's method

of undetermined multipliers to find the maximum likelihood estimates of

the model parameters Mp. The method of Haber (1985a) involved using

the (unmodified) Newton-Raphson method which becomes computationally

unattractive as the number of components in p gets moderately large. Both

Haber (1985a) and Haber and Brown (1986) were primarily concerned with

measuring model goodness of fit and therefore did not consider estimation

of freedom parameters. Haber (1985b) did consider estimation of freedom

parameters, but only when the simpler model Clog p = XP3 was used. One of

the several ways that we extend the work of Haber (1985a, 1985b) and Haber

and Brown (1986) is to consider estimation of the freedom parameters when

the more general model Clog Ap = X,3 is used.

Others have considered ML fitting of nonstandard models for multivari-

ate polytomous response data. Laird (1991) outlines the different approaches

taken by different authors. As an example, Dale (1986) considered ML fitting

for a particular class of models for bivariate polytomous ordered response data

which were of the form


C1 log(Al p) = X1fi, g(A2li) = X2,2


Specifically, the first freedom equation specifies a loglinear model for the

association between the two responses measured by the global cross-ratios

(cross-product ratios of quadrant probabilities) so that C1 and A1 are of

a particular form. The second set of freedom equations specifies some

generalized linear model (McCullagh and Nelder, 1989) for the marginal

means or probabilities. Maximum likelihood estimators for the association






20-

model freedom parameters /3 and the marginal model freedom parameters

/32 were simultaneously computed by iteratively solving the score equations

via a quasi-Newton approach. To use this maximization technique, the score

functions, which involve the cell probabilities, must be written explicitly

as a function of the freedom parameter 3 = vec(/l3, 32). A nontrivial

approach to finding reasonable starting values for / is discussed by Dale

(1986). Along with Dale, McCullagh and Nelder (section 6.5, 1989) and

Becker and Balagtas (1991) consider writing the score as an explicit function

of the freedom parameters so that the marginal and association freedom

parameter estimates may be computed simultaneously. In general, when there

are more than two responses, this is not a simple task and so an extension

of this method to multivariate polytomous response data models will be very

messy indeed. Also, convergence of the iterative scheme requires good initial

estimates of the freedom parameter P. These may be very difficult to find. In

contrast, the maximization approach of this chapter, which is similar to Haber

(1985a) and Haber and Brown (1986), is shown to be easily implemented for

fitting multivariate polytomous response data models. With this technique,

it is not necessary to write the cell means as an explicit function of the

freedom parameters. Further, initial estimates of the freedom parameters,

which are difficult to find, are not needed for this technique. Instead, only

initial estimates of the cell means and undetermined multipliers are needed.

Reasonable initial estimates of the cell means are the cell counts themselves.

While a reasonable initial estimate of the vector of undetermined multipliers

is the zero vector-the value of the undetermined multipliers when the model

fits the data perfectly.






21-

We will now introduce the class of models that we will consider for the

remainder of this chapter and the next, more applied chapter. The models

have form


C1 log(A1/) = X18I, C2 log(A2z) = X212, LIp = d


where the linear constraints include the identifiability constraints. Later,

when we study the asymptotic behavior of the ML estimators, we will

require the components of d to be zero unless they correspond to an

identifiability constraint. These models, which are of the form Clog(A/) =

Xfp, Lpr = d, will allow us to model both the joint and marginal distributions

simultaneously when dealing with multivariate response data. The bivariate

association model of Dale (1986) is a special case of these models, as we

can specify the matrices C1 and A1 so that C1 log(Ali) is the vector of log

bivariate global cross-ratios. Restricting the marginal models to have form

C2 log(A2/t) = X2f62, rather than allowing the marginal means to follow a

generalized linear model, as Dale (1986) did, is not overly restrictive. In

fact, many of the generalized linear models for multinomial cell means can be

written in this form. For example, loglinear, multiple logit, and cumulative

logit models are of this form. Also, unlike Haber (1985a) and Haber and

Brown (1986), we will be concerned with estimation of the freedom parameter

3 = vec(f31, 32), thereby allowing for model-based inference.

Model-based inferences usually refer to inferences based on freedom

parameters. With freedom equations, we have the luxury of choosing a

parameterization that results in the freedom parameters having meaningful

interpretations. For instance, a freedom parameter / may be chosen to






22 -

represent a departure from independence in the form of a log odds ratio.

More generally, we usually will try to parameterize in such a way so that

certain parameters will measure the magnitude of an effect of interest.

For example, consider an opinion poll where a group a subjects were

asked on two different occasions whether they would vote for the President

again in the next election. Suppose they were asked immediately after the

President took office and again after the President had served for two years.

The researcher may be interested in determining whether the distribution of

response changed from Time 1 to Time 2 and if so, assess the magnitude of

the change. The data configuration can be displayed as in Table 2.1.



Table 2.1. Opinion Poll Data Configuration
Data Probabilities
Time 2 Time 2
yes no yes no
Time 1 yes yn Y12 Time 1 yes 7rl 7rl12 71I+
no Y21 Y22 no r721 22 72+
7+1 7r+2


We could formulate a model of the form C log(Ap) = X/3 in such a way

so that the freedom parameter 3 has a nice interpretation with respect to the

hypothesis of interest. One such model is

l 2(i)
log ( )=a+pi, i=1,2 (2.1.1)


where the parameter ij(i) is a marginal probability, i.e.

r(i)={ r+, ifi=1
r+j, if i=2






23-

and, for identifiability of the freedom parameters,


P1 = -P2 = P-

Model (2.1.1) is a simple logit model for the marginal probabilities {-rj+} and

{Tr+j}. The parameter p measures the magnitude of departure from marginal

homogeneity in that p = 0 if and only if there is marginal homogeneity.

One could use the Wald statistic P/se(p) to test the hypothesis. If the

null hypothesis is rejected, we can assess the magnitude of departure from

marginal homogeneity by computing a confidence interval for 2p which is the

log odds ratio comparing the odds that a randomly chosen subject responds

'yes' at Time 2 to the odds that a randomly chosen subject responds 'yes' at

Time 1.

This simple example illustrates the utility of using freedom parameters

and the corresponding model-based inferences. For this reason, this chapter

will be concerned with making inferences about both the model parameters

Ap and the freedom parameters 3.

The contents of the following sections are as follows. In section 2.2,

we provide an overview of parametric modeling. The two ways of specifying

models-via constraint equations and via freedom equations-are discussed

at length in section 2.2.1. It is shown that a model specified in terms of

freedom equations can be respecified in terms of constraint equations. In

particular, the freedom equation Clog(Ap) = XP3, which actually constrains

the function C log(Ap) to lie in some manifold spanned by the columns of X,

is equivalent to the constraint equation U'Clog(Ap) = 0, where the columns

of U form a basis for the null space of X'. Other topics covered in section 2.2






24-

include interpretation and calculation of 'degrees of freedom' and measuring

model goodness of fit.

We describe a general class of models for univariate or multivariate

polytomous response data in section 2.3.1. The data vector y is initially

assumed to be a realization of a product-multinomial random vector. We

describe the asymptotic behavior of the product-multinomial ML estimators

in section 2.3.3. Lagrange's method of undetermined multipliers is used to

find restricted maximum likelihood estimates of the model parameters and

the freedom parameters. The actual algorithm is described in detail in section

2.3.4.

In section 2.4, we explore the relationship between the product-multinomial

and product-Poisson ML estimators. General results that allow one to

ascertain when inferences based on product-Poisson estimates are the same as

inferences based on product-multinomial estimates are shown to follow quite

directly when one works within the framework of constraint models. Theorem

2.4.2 of this section, represents a generalization of the results of Birch (1963)

and Palmgren (1981).


2.2 Parametric Modeling-An Overview

Inferences about the distribution of some n x 1 random vector Y are

often based solely on a particular realization y of Y. In parametric modeling

it is often the case that the distribution of Y is known up to an s x 1 vector

of model parameters 8; i.e. it is 'known' that


Y ~ F(y; ), 0e8, E


(2.2.1)






25 -
where 0 is some (s q)-dimensional (q 0) subset of R* known to contain the

true unknown parameter 9*. The cumulative distribution function F maps

points in R" into the unit interval [0, 1] and is assumed to be known.

In general, we will allow the dimension s of 0 to grow with n. For

example, let Y = (Y1,..., Y,) have independent components such that

Yi ~ ind G(yi; zi(O)), i= 1,...,n,

where zi(8) is some function of 0 associated with the ith component of Y.

The function zi could be defined as zi(0) = Oi, in which case s = n. Or, on

the other hand, zi could be a mapping from R' to RI with s fixed.

2.2.1 Model Specification.

In parametric settings, models for the data, or more precisely, models for

the distribution of Y, can be completely specified by recording the family of

candidate distributions that F may belong to. That is, one must specify the

form for F(.; 0) and the space 0M that is assumed to contain the true value

9* of 9. In parametric modeling, the form of F(.; 9) is assumed known, but

the true value 0* is not. Denote a parametric model by [F(.; 9); 0 e OM] or

more simply by [0M]. We say the model [0M] 'holds', if the true parameter

value 0* is a member of 0M, i.e.

[OM] holds 0* e OM.

A model does not hold if 0* V Om.

The objective of model fitting is to find a simple, parsimonious model

that holds (or nearly holds). By parsimonious, we mean that the vector 0 can

be obtained as a function of relatively few unknown parameters. An example






26 -

of a parsimonious model for the distribution of an n-variate normal vector

with unknown mean vector p and known covariance is [Op], where


Op = {A E R" : P/ = a, J = 1,...,n, p unknown}.

Notice that all n components of p can be obtained as a function of

one unknown parameter 3. Thus, all of our estimation efforts can be

directed towards the estimation of the common mean 3. An example of a

nonparsimonious model is the so-called saturated model [O], where


0 = {pI: p E R"} = R".

In this case, p is a function of n unknown parameters.

The question of whether or not the parsimonious model holds is an

entirely different matter. Practically speaking, a model will rarely strictly

hold. Therefore, we will often say a model holds if it nearly holds, i.e. for

some small e

inf |9* 01 < e.

Without delving too much into the philosophy of model fitting and the

simplicity principle (Foster and Martin, 1966), we point out that for a model

to be practically useful it must be robust to the 'white noise' of the process

generating Y. That is, it should account for only the obvious systematic

variation. A model would be said to be robust to the white noise variability,

if the model parameter estimates based on different realizations of Y are very

similar. As an example, if instead of [0E], the saturated model [E] was used

to draw inferences about the normal mean vector pt, we would find that the

model fit perfectly, but that upon repeated sampling the model estimates






-27-
would change dramatically. Thus, the model is not robust to the white noise

of the process. On the other hand, the parsimonious model [Op] estimates

would change very little from sample to sample, varying with the sample

mean of n observations. This model is robust to the white noise variability.

Therefore, if the model would hold, or nearly hold, we would say it was a

good model.

Freedom Models. In the previous n-variate normal example we specified a

model [Op] in terms of some unknown parameter /. Aitchison and Silvey

(1958, 1960) and Silvey (1959) refer to the parameter / as a 'freedom

parameter' and the model [Op] as a 'freedom model'. These labels are

reasonable since we can measure the amount of freedom we have for estimating

9 by noting the number of independent freedom parameters there are in the

model. The model [O(] has one degree of freedom for estimating the mean

vector ~. Thus, once an estimate of the single parameter / is obtained the

entire vector p can be estimated; it is a function of the one parameter 3.

Notice that 'degrees' of freedom correspond to integer dimension in that a

degree of freedom is gained (lost) if we introduce (omit) one independent

freedom parameter thereby increasing (decreasing) the dimensionality of OE

by one.

In general we will denote a freedom model by [Ox], where


ox = {9 e : g(e) = X3,3 E R'}

The function g is some differentiable vector valued function mapping 0 e 0

into r-dimensional Euclidean space Rr. The 'model' matrix X is an r x p full

column rank matrix of known numbers. To calculate degrees of freedom for





28-
[Ox] we will initially assume g satisfies

V00 Ox, (~ ) is of full row rank r.

It also will be assumed that the constraints implied by g(0) = XP3 are
independent of the q constraints implied by the model [O] of 2.2.1. Well
defined models will satisfy these conditions. For example, any g that is

invertible satisfies the derivative condition. Actually this derivative condition
is not a necessary condition for the model to be well defined. Later, we will

show that g need only satisfy a milder derivative condition.

The degrees of freedom for the model [Ox] can be obtained by subtract-
ing the number of constraints implied by [Ox] from the total number of model
parameters, s. The number of constraints implied by [Ox] is (r p) + q, the
dimension of the null space of X' plus the q constraints implied by model [O].
Hence, the model degrees of freedom for [Ox] is

df[Ox] = s (r p + q) (2.2.2)

In view of (2.2.2) the model degrees of freedom, an integer measure of freedom

one has for estimating 9, is an increasing function of p the number of freedom

parameters. In fact, for the special case when q = 0 and g(8) = 0 (so s = r),
we have that the number of degrees of freedom for model [Ox] is simply p,
the number of freedom parameters. This gives us another good reason for

calling f a freedom parameter and [ex] a freedom model.
Constraint Models. Notice that

{0 e E : g(8) = Xp, 3 e RP} (2.2.3)

can be rewritten as


{( e 6 : U'g(8) = 0},





29-

where U is an r x (r p) full column rank matrix satisfying U'X = 0, i.e. the

columns of U form a minimal spanning set, or basis, for the null space of X'.

Letting u = r p and h.(0) = 0 be the q constraints implied by [0], we can

write the (u + q) x 1 vector of constraining functions as h(0) = [hl(0), h,(0)]'

where hi = U'g. We rewrite the freedom model [Ox] of (2.2.3) as [Oh], where

Oh = {0 E R' : h() = 0}. (2.2.4)

Aitchison and Silvey (1958,1960) refer to model [Oh] as a constraint model.

Every freedom model can be written as a constraint model.

We present a few simple examples to illustrate the equivalence between

the two model formulations-freedom and constraint.

Example 1. Let Yi ~ ind N(p3,r2), i = 1,...,n, where r2 is known.

This model can be specified as the freedom model [Ox], where

Ox = {pI E R" : p = lnf, # unknown }

or equivalently it can be expressed as the constraint model [Oh], where

Oh = {/ E R" : U'p = 0}

and U' is the (n 1) x n matrix
1 -1 0 0 ... 0
U'= 0-1 0 -1 0 0

1 0 0 0 .- -. 1
It is easily seen that Ox = Oh and that the model degrees of freedom is

df[Ox]= n-(n- 1)= 1.
Example 2. Let Y. ~ ind N(i( = fio +/3ix,,U2), i = 1,...,n, where a2

is known. This model can be specified as the freedom model [Ox], where


Ox = {A E R" : Pi = 3o +31Xi, i = 1,..., n }






30 -
or assuming that each xi is distinct, as the constraint model [Oh], where


Oh = { E R" : U'l = 0 }.


Here U' is the (n 2) x n matrix

-1 1 + 1 -1 0 ..- 0 0 0
22-Z-1 2-21 Z3-22 Z3-Z2
-1 1 1 -1 0 0 0
22-21 Z2-ZX Z4-23 24-23
U' =

,-1 1 0 0 ...0 1 -1
21-Xl x2-z- Z--1 ZRn-n-1~ /

Notice that U'p = 0 implies that

Aj+I Pi Pk+1 ik
vk, j.
xj+1 xj Xk+1 Xk

That is, the n means fall on a line. As before, it can be seen that Ox = Oh

and that the model degrees of freedom is df[Oh] = n (n 2) = 2.

Definitions. We will assume that the constraining function h satisfies

some reasonable conditions so that the model is well defined. We first present

some definitions.

(1) A model [Oh] is said to be 'consistent' if Oh 0.

(2) A consistent model [Oh] is said to be 'well-defined' if the Jacobian

matrix for h is of full row rank v = u + q at every point in Oh. That is,

0 Oh, h(0) \ of
VOo E h, ( cOh is of full row rank v.


(3) A model [Oh] is said to be 'ill-defined' if it is not well-defined, i.e.


3o E hA, (O) is not of full row rank v.
( ow -I1 0






31 -
(4) An ill-defined model [Oh] is said to be 'inconsistent' or 'incompatible'

if Oh = 0.

Briefly, any reasonable model will have a nonempty parameter space and

hence will be consistent. The Jacobian condition of definition (2) is similar

to the condition required in the Implicit Function Theorem (see Bartle, 1976).

Basically, this condition requires the constraints to be nonredundant so that,

at least theoretically, the constraint equations can be written uniquely as

a function of a smaller set of parameters. An ill-defined model has been

specified with a redundant set of constraint equations. Using the lingo of

the optimization literature, two constraints are redundant if, for each point

in the parameter space, both of the constraints are 'active' or both of the

constraints are 'inactive'. That is, for all parameter values, if one constraint

is active (inactive) then the other is necessarily active (inactive).

It should be noted that the above definitions are in terms of the

constraint formulation of a model. This is sufficient since freedom models can

be written as constraint models. For convenience, we give sufficient conditions

for a freedom model to be well-defined.

A consistent freedom model is well-defined if it satisfies the following two

conditions:

(i) The constraints implied by g(e) = XP3 are independent of the q

constraints implied by [O].

(ii) The Jacobian matrix of g evaluated at any point in [Ex] is of full row

rank r, i.e.


Vo E Ox, ( g(0) is of full row rank r.
(90






32 -

The sufficiency of conditions (i) and (ii) can be seen by observing that

(ii) implies that hi = U'g has a full row rank Jacobian since U' is of full row

rank and (i) implies that h = (hi, h.)' has full row rank Jacobian. These

sufficient conditions are by no means necessary for a model to be well defined

as the Jacobian of h may be of full row rank v even when the Jacobian of g

is not of full row rank.

Notice that the model matrix has nothing to do with whether or not a

model is well defined. In particular, one may think that the model [Ox] is

ill-defined whenever the r x p matrix X is not of full column rank; i.e. the

freedom parameters are nonestimable. However, the model can be rewritten

as a constraint model with the full column rank matrix U spanning the null

space of X, which has dimension less than p r. It follows that if g satisfies

(i) and (ii), then the model [Ox] will be well-defined. The only reason we

have taken X to be of full column rank is to avoid using generalized inverses

when working with the freedom parameters.

To illustrate the use of these definitions, we consider the model [OM],

where


OM = { e Rn : MO d = 0}.



The model will be well defined if Oh/80' = M is of full row rank. It is

inconsistent if the linear system of equations MO = d is inconsistent.

If a model [Oh] is well defined, then the constraints implied by the model

are all independent in that no constraint can be implied by the others. We

will consider only well-defined models when calculating degrees of freedom.






33 -

As before, we calculate degrees of freedom for a model as the difference

between the number of model parameters s and the number of independent

constraints v implied by the model, i.e.


df[Oh] = s (r p + q) = s (u + q) = s- v


Notice that for the constraint model, model degrees of freedom is a decreasing

function of the number of independent constraints v.

Finally, it should be noted that models may be specified in terms of

both freedom equations and constraint equations. In fact, in subsequent

sections this will be the case. However, without loss of generality, we will

concentrate on constraint models since any model can be written in the form

of a constraint model.


2.2.2 Measuring Model Goodness of Fit

Inferences about model parameters are reliable only if the model is

'good'. A good model should be well defined (or at least consistent). It

should be simple and parsimonious. Finally, the model should be relatively

close to holding.

To assess whether or not the model holds, we will need the concept of a

distance between two models. To begin, we will assume there is some measure

of distance between two hierarchical parametric models. (Two models [O1]

and [02] are hierarchical if 02 c 01 and df[O2] < df[O1] whenever 01 02.)

This parametricc) distance will be a quantitative comparison of how close

the two models are to holding. Thus, if both models hold the distance is

zero. The distance will also be independent of the model degrees of freedom.





34 -
Recall that the form of F(.; 0) is assumed known. Therefore, the distance will

measure how far the true parameter is from falling in the parametric model
space. Suppose, firstly, that 01 and 02 are general parameter spaces. That
is, 0 E 01 u 02 does not necessarily define a probability distribution. In other
words, 9 need not fall in a subset of an (s- 1)-dimensional simplex. Let a(8)
and b(8) be vector or matrix valued functions of the unknown parameter 8.

Define a distance between two hierarchical models [01] and [02] (02 C 01) as


6[02; 01] = inf lib(9)(a(9) a(0*))1|2 inf I|b(9)(a(O) a(O*))112.
02 01

Notice that a and b can be chosen so that

(1) 6[02; 01] 0
(2) [02; 01] = 0, iff O1 and 02 hold.
For example, consider the case Y ~ MVN,(i, a2I,). Suppose that

[] = {((p,2) :L E Rn,a2 > 0}

[01] = {((, 2) : l= o,2, > 0}

[02] = {(t, ,2) : X = Xp, p RP, ,2 > 0}

[03] = {(I, 02) : = 1,a, a e R, a2 > 0}.

In this example, each component of Y has a common variance a2. It seems

reasonable that differences between any pj and the true mean t* are equally
important. Hence, a natural distance between any two of these models is


S[0M,; 2M; = inf II. _,*2 inf IIt L *2.
Sn a

Notice that a(i,, ,2) = l and b(f, a2) = 1. Hence, the measure of distance





- 35 -


between [0] and [01] is


6[01; 0] = inf llp *11' = I|0o *ll12

The second infimum is zero since the model [0] is known to hold.

The measure of distance between [02] and [0] is

b[02; ] = inf Ip p *112 = inf IIXP 112

= IIX(XIX)-IXI',* 1*112
(2.2.5)
= I(I. X(X'X)-lX')1*112

= P*'(I- X(X'X)-IX')1X *.

This is the squared length of the vector orthogonal to the projection of p*
onto the range space of X. Notice that if p* = Xf*, that is 02 holds, then

6[02; 0] = 0.
Finally, the distance between [03] and [02] is

6[03; 02] = inf jIp p*112 inf 11i /1*112
03 2
= *(I- ll)* /1*'(I -X(X'XX)-X') (2.2.6)
n
= /*'(X(X'X)-'X 1n1)

As another example, consider a random vector Y = (Y1,...,Y,)', with
independent components following an exponential dispersion distribution

(Jorgenson, 1989). That is,

Yi ~ indep ED(Ai,oa'), i= 1,...,n,

where the density of Yi, with respect to some measure, has form


fy(y; -, 02) = a(y, 2) exp{ T K(71)} (2.2.7)





36 -

where pi = J'(7y) and var(Yi) = a2r"(1y,). Let V(Iu) = e- "(7y) and

0 = (PI,...,n,oa2)'. Since the components of Y have different variances,
a natural measure of distance is


[SOM,; OM] = inf IIV(L)-1/2(-p *)112 inf I V(p/)-L/2( _*)112. (2.2.8)
OM2 eM1


That is a(0) = p and b(O) = V(p)-1/2. Premultiplying the vector (p p*) by

V(p)-1/2 has the effect of downplaying those differences (p/ i *) when the

corresponding variance is large.

To assess the goodness of fit of a model, relative to another, we can

estimate the distance 6 via some statistic based on the observed data. It

is interesting to note that when 6 = 0, i.e. both models hold, our data-

based estimate of this null distance will be some nonnegative (positive, if

the model is unsaturated) number, reflecting the amount of white noise or

random variability there is in Y. This is so because, if both models hold,

then the only reason that our estimate of distance would be nonzero would

be because Y has some random component. That is, the variability in Y that

is not explained by the model causes the data to fit the model imperfectly.

Let D be an estimate of 6. That is, D[E2; 01] is a stochastic, data-based

estimate of how far apart models [01] and [02] are. Potential candidates

for D are the weighted least squares, likelihood ratio, Wald, deviance, and

Lagrange multiplier statistics.

For example, consider the n-variate normal case and the four candidate

models [0], [01], [02], and [03]. We will assume that both [0] and [02]

hold. In view of (2.2.5) a reasonable estimate of 6[02; 0] can be obtained by






37-
replacing /* by Y, the estimate of p* under model [O], i.e.

n
D[o2; ] = Y'(I X(X'X)X')Y = (Y Y)2.
1
Recall, that since [0E2; ] is known to be zero, D[02; ] serves as our

'estimate of error'.

Similarly, a reasonable estimate of 6[03; 02] can be obtained by replacing

p* in (2.2.6) by Y, the least restrictive estimate of p*, i.e.

D[03; 02] = Y'(X(X'X)-'X' Y = (8 )2

Now 03 C 02 and

df[03] = n +1 (n 1)= 2

df[2] = n+ 1 (n- p) = p + 1.
The degrees of freedom associated with estimating the distance between

two models will be called the distance (or residual or goodness-of-fit) degrees

of freedom. The distance degrees of freedom for the two models [OM1] and

[OM ,] is defined to be the difference between the two model degrees of freedom,
i.e.

df (S[OM; M1]) = df [OM1 df[0M1].

The number of distance degrees of freedom measures the dimensional distance

between the two models, i.e. the difference in dimensions. It measures the

difference in the amount of freedom one has for estimating 0 for the two

models. It seems intuitive that if the degrees of freedom is large, that is the

dimensional difference between the two models great, the significance of the

distance statistic may be difficult to ascertain. This follows since we expect

the fit to be quite different for the two very different models, even when both






38 -

models hold. This is a reflection of both white noise and possibly lack of fit.

Therefore, the distance statistic will tend to be large, even when both models

hold. But for many statistics, a large mean implies a large variance, thereby

making significant findings more difficult. It is for this reason that we say

it is better to concentrate our efforts on relatively few degrees of freedom

to detect lack of fit. That is, one should use the smallest alternative space

possible when testing a null hypothesis.

A more technical argument holds when the test statistic (distance

statistic) is a Chi-square or an F. Das Gupta and Perlman (1974) showed

that for a fixed noncentrality parameter, i.e. fixed distance between models,

the power of the F-test or the Chi-square test increases as the distance degrees

of freedom decreases.

Example 1: Continuing with the n-variate normal example, we see that

df(6[O3; 021) = df[Oz] df[03] = (p + 1) 2 = p 1.

Thus, 03 is of p 1 less dimensions than 02. Now, if we knew ,2 the white

noise variance, we could test Ho : 9* e 03, vs. H1 : 8* 02 03, using the

statistic
D[03; 02] _SS(Reg)
2 2 ,(2.2.9)

which has a X2(p-1) null distribution. However, r2 is not generally known and

we must estimate it. One way of estimating ,2 is by estimating the distance

between [0] and [02], two models that are known to hold, and dividing by

the distance degrees of freedom. Since the distance degrees of freedom is

df [] df [2] = n +1 (p +1) = n -p, we have that the estimate of the white
noise variance is D[02; 0]/(n- p) = SS(Error)/(n p).






39 -

Notice that in the above example the estimate of the parameter 02

was simply the estimated distance between two models that were known to

hold divided by their dimensional distance. Quite generally, when the data

have an exponential dispersion distribution (2.2.7) with common dispersion

parameter r2, the estimated distance between two models that are known to

hold, divided by their dimensional distance gives us an estimate of a2. This is

true when the estimated distance is taken to be the LR, Wald, Deviance, LM,

or the weighted least squares statistics. These statistics are natural estimators

of the weighted distance given in (2.2.8) for the exponential dispersion models.

Now, let us assume that 01 and 02 are each subsets of an (s -

1)-dimensional simplex. For example, with count data, conditional on the

total n, the distribution is often multinomial with index n and parameter

(alternatively, probability distribution vector) 0*. Read and Cressie (1988)

extensively study a family of distance measures called the power-divergence

family. The power divergences have form


(0* 0) (+ 1) O [( 1 ; -o
where IJ and I-1 are defined to be the continuous limiting value as A 0 and

A -1. It is assumed that 0* and 0 fall on an (s 1)-dimensional simplex.

As usual, let 0* represent the true unknown parameter. We define the family

of distance measures between [01] and [02] (02 c 01) to be proportional to


6[02; 01] = 2n{ inf I0'(*,0) inf A(0*, 0)}.
02 01

By properties of IX(O*, ) (Read and Cressie, 1988, pp. 110-113), it follows

that S > 0, with equality if and only if both models hold.






40 -
To estimate 6[[2; 01] based on the data, we note that our least restrictive

guess of 0* is Y/n, the vector of sample proportions. Intuitively, a good

estimate of the quantity 6[02; 01] would be
D[02; 0O] = 2n{ inf IA(Y/n, 9) inf IA(Y/n, 0)}
02 01
2 Y 2 Y ^
[(A+1) -n ) A1] (+1)K[(Y ) 1

where 9) and -' are the 'minimum divergence' estimators obtained by

minimizing IX(Y/n,0) with respect to 0 over 01 and 02 respectively. Read

and Cressie (1988) point out that D[02; 01] is equal to the likelihood ratio

statistic when A = 0. Also, if we assume that [01] holds so that the second

infimum is zero, we have that, for A = 1,

D[02; 1]= (Y n ))2

which is asymptotically equivalent to

D[02; 0,] = ( n ))2

where 9(0) is the maximum likelihood estimator of 6* over the space 02. This

is the Pearson chi-square statistic. Other asymptotically equivalent distance

estimates are the Wald statistic and the Lagrangian multiplier statistic. We

now illustrate these results via examples.

Example 2: Suppose that Y = (Yu, Y12, Y21, Y22) is a multinomial vector.

That is,

(Y11, Y2, Y21, Y22)' ~ Mult(n, (7ri, 712,7r21,7r22)'), with i.y = 1.
i j
Thus, the model that is known to contain the true parameter vector 7r* is [0]

where


O = {7r: 7r'14 = 1,7,j E (0, 1), i, J = 1,2}.





41 -

Notice that is really a 3-dimensional subset (simplex) of (0, 1)4 so that

df[9] =4- 1 = 3.

We wish to test the independence hypotheses

SHo : -rlll 22 = 7r2721, VS.
H : 7r11722 = 7"127"21

Writing the model of interest [o] as

O0 = {7 E E : 7r1122 727r21 = 0}

= {r : 7'14 = 1, 71rnr22 712721 = 0},

we can state the independence hypotheses as

Ho : r e00, vs.
H : 7r e o0.

Now, the model degrees of freedom can be found by subtracting the number

of constraints implied by [o0] from the total number of parameters, which

is 4. Hence, df[0o] = 4 2 = 2. Thus, the distance degrees of freedom or

measure of dimensional distance, is df(b[O0; 1]) = 3 2 = 1.

Two distance (goodness-of-fit) statistics commonly used are the Pearson

chi-square X2 (A = 1) and the likelihood ratio statistic G2 (A = 0). The forms

of these two statistics are


D[Oo; O] = X' = (y nri,o)2
i j n7rij,o

and

D[0o; ] = Ga = 2E yj log( Y-i
i j n7r,,o

where iri,o is the ML estimate of 7rij assuming that model [Oo] holds.

Under the null hypothesis, i.e. if independence truly holds, then the

asymptotic distribution of both distance statistics, X2 and G2, is X2(1).






42 -

Example 3: Continuing with example 2, consider the model [EMH] where


EMH = {7 : 7r'14 = 1, 7rl+ +r+1 = 0}.

This model implies that there is marginal homogeneity, i.e. The marginal

distributions for both factors are the same.

We would like to test the hypotheses

Ho : r e O)MH, vs.
H, : 7r EMH.

The model degrees of freedom is df[OMH] = 4 2 = 2, and so the distance

degrees of freedom is df (6[OMH; ]0) = 3 2 = 1. Once again, to illustrate

what model degrees of freedom means, we observe that if [OMH] holds and

we specify two of the four probabilities, the remaining two are completely

determined. Thus, we are free to estimate two of the probabilities based on

the data. The other two are determined.

Two frequently used estimates of the model distance, or model goodness

of fit are the likelihood ratio statistic G2 and the McNemar statistic M2. For

2 x 2 tables, the McNemar statistic and the Lagrange Multiplier statistic are

equivalent since both are score statistics (Agresti, 1990; Aitchison & Silvey,

1958). The statistics take the following forms


D[OMH; O = G2 = 2 = E 2log( Yij
jn ij,o
i jfji,0
and

D[eMH;e] = M- (2-2
Y12 + Y21
where the iij,o in the first expression is the ML estimate of 7rij under the

model [OMHI.






43 -
Under the null, i.e. when the marginal distributions are homogeneous,

both of these statistics have asymptotic X2(1) distributions.

It is important to note that, had the constraint 72+-Tr+2 = 0 been added,

the model would remain consistent but would be ill defined. For 2 x 2 tables,

this additional constraint is exactly the same as the constraint 7r+ r+1 = 0.

2.3 Multivariate Polytomous Response Model Fitting

In this section, we describe ML model fitting for an integer valued

random vector Y that is assumed to be distributed product-multinomially.

We also investigate the asymptotic behavior of the ML estimators within the

framework of constraint models. The models we will consider have form

Ox = {( E O: Clog(Ae)) = XP3, Lee = 0}

or equivalently, for appropriately chosen U,

Ox = Oh e E : U'Clog(AeC) = 0, LeC = 0},

where ee is the s x 1 mean vector of Y, a product-multinomial random vector

and the model parameter space O is of dimension s q, where q is the number

of identifiability constraints. We use the parameter rather than 11 = ee

for several reasons. One reason will become evident when we explore the

asymptotic behavior of the ML estimator of It turns out that the random

variable 4 po is not bounded in probability, whereas 6o is. In fact, the

random variable o converges in probability to 0. Another reason for using

rather than y is that the procedure for deriving the maximum likelihood

estimate of is less sensitive to small (or zero) counts. The range of possible

values is the whole real line, while the range of possible p values is restricted






44 -

to the positive half of the real line. By using 6 the problem of intermediate

out of range values (e.g. negative cell mean estimates) is avoided.

As stated above, we initially assume that the vector of cell counts Y

has a product-multinomial distribution. This is not overly restrictive since it

will be shown that inferences based on maximum (multinomial) likelihood

estimates are often the same as inferences based on maximum (Poisson)

likelihood estimates. We will present some results in section 2.4 that allow

us to determine when these inferences are indeed the same.

We also consider an alternative method for computing the maximum

likelihood estimators and their asymptotic covariances. The method of

Lagrange undetermined multipliers is well suited for maximum likelihood

fitting of the models we will be considering. This is so because we will specify

the models in terms of constraint equations and the fitting problem will be

one of maximizing a function, namely the log likelihood, subject to some

constraints, namely that 6 E Oh.







2.3.1 A General Multinomial Response Model




In this section we specify a class of models that is directly applicable

to Chapter 3 of this dissertation. Specifically, the models will be specified in

such a way so as to include the class of simultaneous models for the joint and

marginal distributions considered in Chapter 3.






45 -
Let the random vector Y = vec(Yi,..., YK) denote a product multinomial

random vector, i.e.


Yi= (Yil,...,YiR)' ~ ind Mult(ni, ri), i = 1,...,K, K > 1,


where the R x 1 vector of cell probabilities satisfy 7rilR = 1, i = 1,..., K.

Consider the 1:1 reparameterization from {Ir;} to {(~}, where &, =

log(pi) = log(ni-ri) is an R x 1 vector of log means. Under this parame-

terization,


Yi ~ ind Mult(ni, ), e41=ni, i= 1,...,K,


or

Yi ~ indMult(ni, ), i=1,...,K, e'(ef1R) = n', (2.3.1)
ni

where n' = (nl,..., nK) is the 1 x K vector of multinomial indices.

The kernel of the log likelihood for Y, written as a function of e, is


e(M)(; y) = y'e, e'(e$ 1R) = n' (2.3.2)


We now posit a model for the vector of log means. Let s = RK be the

total number of cell means. Our objectives are to test the model goodness

of fit and to estimate the s x 1 model parameter vector as well as any

freedom parameters of interest. It will be assumed that the model [ex] can

be specified as

Ox = {( e R' : C1 log Alet = Xi/3, Ca log A2e = X2,2, Lee = 0,
(2.3.3)
e'( 1R) =n',






- 46 -


where
Ci = (fCij, Cij = Cil, is qi x mi i = 1,2

Ai = qfAij, Aij = Ail, is mi x R, i= 1,2

L = 'Lfj, Lj L1 is dx R

= vec(,..., ) and is R xl 1

Xi is Kqi x pi of full rank pi, i = 1, 2

n is the K x 1 vector of multinomial indices

s = RK, the total number of cells
Let us say that a model that can be specified as in (2.3.3) satisfies

assumption (Al). That is,

(Al) The multinomial response model can be specified as in (2.3.3).

Notice that the K matrices of Ci are all identical, likewise with the

matrices comprising Ai and L. This requires that the model does not change

across the K populations (K multinomials). Also, the two sets of freedom

equations in (2.3.3) will allow us to use two different types of models for

the expected cell means. This provides us with enough generality to fit

many interesting models. For example, we may wish to simultaneously fit

a linear-by-linear association loglinear model for the joint distribution and a

cumulative logit model for the marginal distributions.

We can conveniently rewrite (2.3.3) as

Ox = { e R' : Clog(Aee) = XP, LeC = 0, eC'(ef1R) = n', (2.3.4)

where A'= [A', A'], C = C1 ) C2, X = X1 X2, and 3 = vec(3l,3Q2).

Notice that the model [Ox] is specified in terms of both freedom

equations and constraint equations. We will rewrite [Ox] as a constraint






47 -
model keeping in the back of our minds that the freedom parameters may be

of interest also.

Let U be a K(ql + q2) x u matrix of full column rank u such that
U'X = 0. Here u is the dimension of the null space of X', A((X'), i.e.

u = K(qi + q2) (1 +p2). Since U can be chosen to be of full column rank, it

follows that the columns of U form a basis for the null space of X'. Thus, the

range space of U equals the null space of X', i.e. M(U) = A(X'). Multiplying

the right and left hand side of the freedom equation Clog(Aee) = XP/ by U',

we can rewrite (2.3.4) as


Oh = { e R' : U'Clog(Aee) = 0, Lee = 0, e'(elR) n' = 0}. (2.3.5)

Thus, Ox = Oh and the models [Ox] and [Oh] are one and the same.

At this point, we will assume that the constraints implied by the model

[Oh] are nonredundant so that the model is well defined. More specifically, let
h'() = [(U'Clog(Ae))', e'L'] be the 1 x (u + 1) (1 = Kd) vector of constraint

functions. We will assume that the u ++ K constraints implied by h(() = 0

and ee'(@IelR) = n' are nonredundant. Notice that the constraints in h(() = 0

do not include the identifiability constraints. We treat the identifiability

constraints separately for reasons that will become apparent when we actually

fit the models.

As stated previously, one of our primary objectives is to estimate the

model parameters 6 and the freedom parameters f under the assumption
that [Ox] (and [Oh]) holds. We will use the maximum likelihood estimates,

which can be found by maximizing the log likelihood of Y subject to the

constraint that [Oh] holds.






48 -

The (kernel of the) log likelihood under the product multinomial

assumption is shown in (2.3.2). It is


fcm) (; Y) = Y1,*


Thus, we are to maximize the function e(M)(E; y) = y' subject to e OEh.


2.3.2 Maximum Likelihood Estimation

In this section we will discuss two procedurally different approaches

to maximizing the log likelihood e(M)(; y) subject to E e ,. The first

approach, which is the more commonly used approach, requires that the

model be specified entirely in terms of freedom equations. Often times,

when there are no identifiability constraints, the model can be completely

specified as a freedom model. Models amenable to this approach include the

Poisson loglinear model and the Normal linear model. The second approach,

Lagrange's method of undetermined multipliers, can be directly applied when

the model is specified completely in terms of constraint equations. Since the

product multinomial model includes identifiability constraints, it can more

easily be specified in terms of constraint equations. For this reason this

second method is the preferred choice. In the following sections, we discuss

some additional features of these two methods.

Freedom Parameter Approach. One approach often used in simple situa-

tions, namely those situations when the model can be specified completely

in terms of freedom equations, is to write the parameter C as a function

of the freedom parameter I and maximize e(M)((P); y) with respect to 3.

The vector (P3) will be in the model space, since the model was specified





49 -
completely in terms of f/. For example, if the model could be specified as


Ox = {( E R' : log e = Xp},

then ((/) = XP3. Notice that the multinomial model, which includes the K

constraints eV'($flR) = n', is not directly amenable to this approach. In fact,
we would have to reparameterize to a smaller set of s-K model parameters

that account for the K constraints. This reparameterization results in an

asymmetric treatment of the e and for that reason is deemed undesirable.

On the other hand, the Poisson model considered below, will often lend itself

to this maximization approach, since the K constraints eC'(e$1R) = n' are

not included.

Computationally, the method of maximizing the log likelihood with

respect to the freedom parameters is usually simple. Assuming the log

likelihood is concave and differentiable in 3, we need only solve for the root

of the 'score equations', viz.


s(; Y) = ; ) .

Many of the asymptotic properties of the maximum likelihood estimator

3 for 3 are derived by formally expanding the score vector s(P3; y) about the
true value f = 3* in a linear Taylor expansion. That is,


s(/; y) = s(3*; y) + Os(,*;y P ) ) + ( 12) (2.3.6)

In particular, in many situations,


O=s( ; Y)=s(3*;Y)+ Os' () p*) + Op(1),
0/3'






- 50-


so that / 3* has the same asymptotic distribution as

SY) s(3 *; Y).


Subsequently, we will derive the asymptotic distribution of3 -P3* in a different

way. This alternative derivation of the asymptotic distribution of the freedom

parameter estimate will shed new light on the relationship between the

asymptotic behavior of the estimates under the two sampling assumptions-

product Poisson and product multinomial.

Expression (2.3.6) also gives some indication of how one might numer-

ically solve for /, the root of the score equation. A Newton-Raphson type

algorithm is often used. This root finding algorithm involves the inversion

of the derivative matrix as(P3;y)/Ql3', which is usually of small dimension

since the model is usually specified in terms of a small number of freedom

parameters. In fact, the dimension of the derivative matrix will not be larger

than s x s, which occurs when the model is saturated.

Constraint Equations Approach. In many situations, it may be difficult to

specify a model in terms of only freedom parameters or perhaps it is possible

but the researcher would like to treat the model parameters symmetrically,

which would necessitate an additional constraint equation. It also could be

that the function ClogAeE is not a 1:1 function of so that for given /, we

can not solve for explicitly. In any of these cases, we may not be able to

use the aforementioned maximization approach.

In this section, we consider an alternative method for finding that i

that maximizes the function e(M)({; y) subject to E Oh. The method we

will use is the Lagrange's method of undetermined multipliers. Aitchison and






51-
Silvey (1958, 1960) and Silvey (1959) provide much of the essential underlying

theory related to this approach. Three positive features of this method

include (i) estimation of both ( and 3 is possible, (ii) the method provides

us with another enlightening way of deriving the asymptotic distribution

of the freedom parameter estimators, and (iii) the method works quite

generally. A negative feature of this approach is the computational difficulty.

Computationally, the method becomes burdensome as s, the number of log

mean parameters, and u + 1 + K, the number of constraints implied by the

model, become large. In fact, the algorithm involves the inversion of an

(s + u +1) x (s + u +1) matrix. One positive note, is that this potentially very

large matrix does have a simple form and one can invoke some simple matrix

algebra results to reduce the inversion problem to one of inverting matrices

of dimensions (u +1) x (u +1) and s x s.

To best illustrate the difference in computational difficulty of the two

methods, we consider the following normal linear model example. Let


Yi ~ ind N(;i = 3o + alxi,

The log likelihood can easily be written as a function of f = (Po, /i)'.

Maximizing this likelihood with respect to f involves working with a 2 x 2

matrix. On the other hand, we could equivalently specify the linear model in

terms of the 98 constraints,

Pi+1 Pi Pi+2 i+1, i= 12, ... 98,
Bi+i Xi +i+2 Zi+l

and use Lagrange's method. In this case, we would need to invert a matrix

which has dimension (s + u + 1) x (s + u + 1) = 198 x 198.





52 -
Even when we use the matrix algebra results that simplify the problem
of working with the 198 x 198 matrix, we still are left with a formidable task.
It seems that when s is large and the model is parsimonious, i.e. u + + K,
the number of constraints is large, the undetermined multiplier method may
not be the method of choice. However, in time, as computer efficiency gains
are realized, we predict that the scope of candidate models to be fit using
this method will increase tremendously. In fact, at present, many categorical
models can easily be fit using Lagrange's method. We discuss in more detail
how we can use the method of undetermined multipliers to fit models like

[Oh] of (2.3.5).
We are to maximize the function (M) ($; y) = y', subject to the constraint

( E Oh, where
Gh = {~ e R : U'Clog(Ae4) = 0, Le4 = 0, e'($jflR) n' = 0}

= { R: h() = 0,et'(flR)= n'},
and h'({) = [log(e4'A')C'U, eE'L'].

Consider the Lagrangian objective function

F(7) = e(M)(6; y) + (et'(eKlR) n')7 + h'(\)A,

where 7 = vec(, r, A). The K x 1 vector r and the (u + 1) x 1 vector A are
called either 'Lagrange multipliers' or 'undetermined multipliers'.
Provided a maximum exists and that the Jacobian of [e6'(eK1R)-

n', h'(()] is of full row rank u + 1 + K for all 6 e Oh, we can solve for the
maximum by solving the system of equations

F () + D(e'')( lR)(') + H(iM)) )
% = ( f@ 1 )e m) -'n = 0 (2.3.7)
7 /h((M))






53-

where the matrix H() = 8h'(()/98. The Jacobian condition basically

requires the constraints to be nonredundant, thereby making [Oh] a well-

defined model.

From this point on, for notational convenience, the indices for the direct

sum will be omitted unless they are different from 1 and K.

We now require the matrices of models [Ox] and [Oh] to satisfy some

additional conditions. Let us assume that

(A2) Either C = Iq,K or Ci( lm,)= 0, i = 1,2

and

(A3) If C = Iq,K then M(Xi) D M(l$m,)




The assumptions require Ci to be either a contrast matrix (rows sum

to zero), a zero matrix, or the identity matrix. If Ci is the identity matrix,

it will be required that there exists a set of columns in Xi that spans a

space containing the range space of $(Klm,. For most models of interest

these conditions are met. For example, any of the logit type models, such as

cumulative or multiple logit models, can be specified with C being a contrast

matrix. For loglinear models, the condition (A3) is met whenever the model

includes a parameter for each of the K multinomials.

The following lemma will be useful in showing that the maximum

likelihood estimates of and j3 are equivalent under both sampling schemes-

product-Poisson and product-multinomial. The lemma will also enable us to

reduce the number of equations in (2.3.7) that must be simultaneously solved

when computing the maximum (multinomial) likelihood estimators.





54-
LEMMA 2.3.1. If the matrices of models [Ox] and [Oh] satisfy (Al), (A2),
and (A3), then provided the model holds

( -) -= ( 1'~)H() = 0.


Proof. Using matrix derivatives (MacRae, 1974; Magnus and Neudecker,
1988), it follows that

H(s) = [D(ef)A'D-I(Ae4)C'U, D(ee)L']

Thus,

(e 1')H() = [(e e)A'D-1(Aee)C'U, ( e:)L]

= [(@ee)[A',A']D-1 ) (C e( C()U, @eeL.]

[[( e et)A'D-1(Alef), (e eci)A'D-1(A2e')](C. e C2)U, 0]

= [[( e eA'i)D-'(Alef)C' ( eeA'i)D-1(A2e)C2U, o]

= [( e 1')c, ( 1m, )CE]U, O

= 0,0]
=0
The third equality follows since the model holding implies that seiL = 0.
The sixth equality can be seen via the following argument.
If both Ci's are contrast matrices, or zero matrices, then (A2) implies
that the matrix [( $11)C', ($1',)C'] is the zero matrix. On the other hand,
if both C1 and C2 are identity matrices, then since the columns of U span
the null space of X', which, by (A3), implies that the columns of U span a
set contained in the null space of

elm2 '
sim, '





55-
we have that [(e 1m), ( 1'))]U = 0. Any other combination of CO and C2

can also be seen to result in the matrix equaling zero. 0
The following theorem gives conditions under which we can find the ML
estimators of 6 by solving a reduced set of equations. The smaller system of
equations no longer includes the identifiability constraint equations.
THEOREM 2.3.1 Let vec((M), i(M), k(M)) be the solution to (2.3.7).
Assuming that (Al), (A2), and (A3) hold, the sub-vector vec(&(M), \(M))
is the solution to the reduced set of s +u + equations

h(+ H((M))0 (2.3.8)



Proof: Premultiplying the first set of equations in (2.3.7) by $1'W, we arrive
at


( 1'l)y + ( 1'I)D(eZM)( 1R)T + ( l1')H((M))iAM) = 0 (2.3.9)

Now, (e l')y = n and (E l'm)D(eE(M) = e~M)'. Also, since (M) E Oe it must
be that ( eeM )( e 1R) = D(n), the diagonal matrix with the multinomial
indices on the diagonal. Further, by Lemma 2.3.1,
( 11 ')H( /(M)) = 0. Therefore, (2.3.9) can be rewritten as


n + D(n) i(M) = 0,

which implies that 4(M) = -1K. Now, since the identifiability constraints have
been explicitly accounted for when solving for f(M), we can replace i(M) of
(2.3.7) by -1K and omit the identifiability constraints. Thus, vec( (M), \(M))





56 -

is the solution to the reduced set of equations

( (m) + H(W(M))A(M) =

This is what we set out to show.

Before detailing the iterative scheme used for solving (2.3.8), we will

explore the asymptotic behavior of the estimator 0(M) = vec((M), ^(M))

within the framework of constraint models.

2.3.3 Asymptotic Distribution of Product-Multinomial ML Estimators

In what follows, we will assume that K, the number of identifiability

constraints, is some fixed integer, K > 1. We also will assume that the

asymptotics hold as n. = min{ni} approaches infinity and that n. ~ ni, i =

1,..., K. That is, we assume that the asymptotic approximations hold as

each of the multinomial indices get large at the same rate.

The derivation of the asymptotic distribution of b(M) will follow closely

that of Aitchison and Silvey (1958). Briefly, Aitchison and Silvey show that

if the score vector is op(n) and the constraints are such that the derivative

matrices H(() and OH'(()/98 have elements that are bounded functions then,

provided certain mild regularity conditions hold, the maximum likelihood

estimator is an n-1/2-consistent estimator of o and A is an n1/2-consistent

estimator of 0. They show that the joint distribution of (n1/2( -o), n-1/2)

is multivariate normal with zero mean and covariance matrix

(B- B-1H(H'B-H)-H'B-1 0 (
0 (H'B-'H)-1)

where B is the information matrix and H is the derivative of the constraint

function.






-57-

In our application, however, there are some minor changes. With the pa-

rameterization we use, the information matrix is zero since the (multinomial)

log likelihood (2.3.2) is linear in the parameter This happens because the

identifiability constraints eE'( of 1R) = n' are ignored, to preserve symmetry,

when differentiating. Also, in our parameterization, the constraints are in

terms of ee, the components of which are eCi- = n7rij. Thus, the constraints

and the corresponding derivative matrices may not be bounded. For example,

a typical constraint is of the form Let = 0. It follows that the components

of Let and the derivatives are increasing without bound as the multinomial

indices are allowed to increase without bound.

Fortunately, we can still use the results of Aitchison and Silvey (1958)

by replacing the matrix H and the vector A/n of Aitchison and Silvey by

our H/n. and A, where n. = min{ni}. The zero information problem can be

solved by identifying the vector Y e as the 'score vector'. It is pointed out

that, in this case, the asymptotic variance of @D-'/2 (nlR) times the score

vector is not equal to the negative derivative matrix D(-ro) but instead is

equal to D(ro) E-roi7r'j. This happens because the components of Y are not

independent; Y is product multinomial. Using this reparameterization, all of

the necessary assumptions required by Aitchison and Silvey (1958) hold, i.e.

assumptions X and of Aitchison and Silvey (1958) hold.

As previously mentioned, Aitchison and Silvey show that A is an

n1/2-consistent estimator of 0. With our paramterization, having replaced

A/n by A, it follows that A(M) will be n,1/2-consistent. We now derive the

asymptotic distribution of b(M).





58 -
Define the stochastic function g by

g(O; Y) Y eY + H())

The maximum likelihood estimator 0(M) is the solution to g(O; Y) = 0.
Under our parameterization, using the results of Aitchison and Silvey
(1958), we have that each of the following hold

e(M) e6o = D(el)( (M) ) + Op(1),

H( (M)) = H(0) + Op(nl/),


h((M)) = h(o) + H'(Eo)( (M) o) + Op(1)

= H'(-o)( (M o) + Op(1),
and
H(J(M))^(M) = H()A(M) + Op(l).

Thus,

O = g(m); Y) Y em) + H( (M))(M )

can be rewritten as

0 =Y eo D(eo)( (M) _o) + H(O )Op(l1)
H'( o)( (M) o) + Op(1)
(Y e)o (D(eo) -H( o) (M) 0o O (1)
V 0 -H'() 0 M(M)
Therefore, it follows that

eD-1/2(n,1R) (Y eo) =

D(ir) niO~l n /2(+o(n;'/' (2.3.10)

since n, ~ n, i= 1,...,K and 0ro = ( D-'(nil1))ef0.






59-
Now, the random variable SD-1/2(nilR)(Y-e6o) is a vector of normalized
sample proportions so that

( D-1/2(nl1R)(Y- e(o))


has an asymptotic normal distribution with zero mean and covariance matrix

(D(7ro) -ED70,7r1, 0)
0 0'

Therefore, by an extension of a theorem of Cramer (1949) and by equation
(2.3.10), it follows that n/2( (M)- 0,) = n/vec((M) 0o, i(M)) has an

asymptotic normal distribution with mean zero and covariance

D(ro) -(O D(7ro) e7roi7r o D(7ro) -2
(n. (2.3.11)
_o 0 0 0 _(o 0
S* \ *

This covariance matrix is shown in the appendix to have the simple form

(M, 0
0 M)

where


M, = D--'(oo) D-)1()H(H'D-'(o)H)-'H'D-'(7o) $fK 1Rl


and

M2 = n (H'D-( ro)H)-.

Finally, using the fact that n, ~ ni, i = 1,..., K, we can discriminantly
replace n* by the appropriate n, to arrive at a simple, asymptotically
equivalent, expression for the asymptotic covariance of i(M) = vec((M), \(M)).





- 60 -


It is

D-1 D-'H(H'D-'H)-'H'D-1 R 0
0 (H'D-1H)-1) '(

where D = D(po) = D(eo) and H = H(o).

2.3.4 Lagrange's Method-The Algorithm

In this section, we give details of how one can actually fit the models
of (2.3.4) or equivalently (2.3.5). We show how Lagrange's undetermined
multipliers method can be used in conjunction with a modified Newton-
Raphson iterative scheme to compute the ML estimators and their asymptotic
covariances. We will assume that the model assumptions (Al), (A2), and (A3)
hold. This section includes an outline of the algorithm used in the FORTRAN
program 'mle.restraint'.
Recall that our objective is to find that (M) e Ox, where

Ox ={ R': Clog(AeC)=Xf3, Lee=0, (el))e=n},

that maximizes the multinomial log likelihood


(M) (; y) = Y'

Since the assumptions (Al), (A2), and (A3) hold, we see by Theorem
2.3.1 that our problem is reduced to one of solving the system of equations
(2.3.8), i.e. to find the ML estimator 9(M) = vec(i(M), \(M)) we must
simultaneously solve the system of s + u +1 equations


( -e-Y e4H()A =0,
g~o) = ( h()





61-

where the (u + 1) x 1 vector h and the s x (u + 1) matrix H are defined as

follows.

h()= U'Clog(Ae$)

and
Oh' (()
H() =

It will be shown in section (2.4) that g(0) is actually the derivative

of the Lagrangian objective function under the product-Poisson sampling

assumption.

The iterative scheme used in the FORTRAN program 'mle.restraint' is

a modified Newton-Raphson algorithm. The algorithm can be sketched as

follows.


(1) Find a starting value for 8.

(2) Replace 0(") by 0("+1) = O(V) G-1((Y))g(o(")) (2.3.13)

(3) If ||g(0(v+l))|l > tol go to (2). Else stop.


The matrix G(8) used in step (2) is actually


G() + Op(n1 /2) (-De) H( ))


and the inverse of G(O) is of the very simple form (see Aitchison and Silvey,

1958 or Rao, 1974)

G-'() _D-1 D-1H(H'D-1H)-'H'D-1 -D-1H(H'D-1H)-1
-(H'D-'H)-1H'D-1 -(H'D-1H)-
(2.3.14)






62 -

where D = D(ee). Since we use G(0) in place of the Hessian matrix, the

procedure is a modification to the Newton-Raphson method. Haber (1985a)

used the more complicated Hessian matrix.

Notice that the inversion of G, which may be performed at each iteration,

is not nearly as difficult as inverting a general matrix of dimension (s + u +

1) x (s + u + 1). First of all, in view of (2.3.14), to obtain the inverse of the
partitioned matrix G, we need only invert the matrices D and H'D-1H, which

are of dimension s x s and (u + 1) x (u + 1). Secondly, the inversion of D is

simple since D is a diagonal matrix with et on the diagonal. Hence, the most

formidable task in the inversion process is the inversion of the symmetric

positive definite matrix H'D-1H. There are many efficient ways to invert

large symmetric positive definite matrices.

Upon convergence of the algorithm (2.3.13), estimates of the asymptotic

covariances of (M) and A(M) are readily calculable. Write G-1() of (2.3.14)

as



where
P = D-1 D-1H(H'D-1H)-1H'D-1

Q = -D-IH(H'D-1H)-1

R =-(H'D-2H)-1
By (2.3.12), the asymptotic covariance of i(M) = vec((M), (M)) can be

estimated by

var("l)=( P-)efi 0 )

Variance estimates for other continuous functions of ^(M), such as
A(M) = eF(M and t(M) = (X'X)-1X'Clog(Aee~M), can be found by invoking






- 63 -


the delta method. For example,


var(A(M)) aD(eD m ))var(^(M))D(ee) )


and

var( (M)) a

(X'X)-lX' CD- (AA(M))A(var(Af()))A'D- (Ac(M))C'X(X'X)-.

Evidently, Lagrange's method of undetermined multipliers provides us

with a convenient procedure for maximum likelihood fitting of models in a

very general class of parametric models for multivariate polytomous data with

covariates possible. We now briefly outline the steps needed to perform the

iterations of (2.3.13).

Computing U. The first thing we must do is write the freedom model (2.3.4),

which can easily be input by the user, as a constraint model (2.3.5). Therefore,

we must compute a full column rank matrix U that satisfies U'X = 0. The

method we use to find U is attributed to Haber (1985b).

Using the notation of 'mle.restraint', let X be a full column rank matrix

of dimension q x r. Let u = q r be the dimension of the null space of X'.

Further the matrices A and C of (2.3.4) will have dimensions m x s and q x m

respectively. The relationship between these dimension variables and those

used in sections 2.3.1 and 2.3.2 is as follows

q K(q + q2)

r -pi +P2

m K(m, +m2-).


We use the variables q, r, and m for notational convenience.






64 -

Consider the matrix U* = I, X(X'X)-1X'. This q x q matrix is of rank

u = q r and satisfies the property

U*' X = 0.

Let W denote a q x u matrix with random elements. Specifically,


Wij ~ Uniform(0,100), i=l,...,q, j=l1,...,u.

It follows that the matrix W is of full column rank with probability one and

hence that the q x u matrix U = U*W is of full column rank u with probability

one. But the matrix U satisfies

U'X = W'U*'X = W'O = 0.

Therefore, at least with probability one, we have found a full column rank

matrix U that satisfies the property U'X = 0. Using this U, we are able to

write freedom model (2.3.4) as a constraint model (2.3.5).

Computing h(s). We write the constraint model of (2.3.5) as


{( e R : h()) = 0, e6'(e1lR) = n'}, (2.3.15)

where the constraint function h is defined as


h() = (U'Clo(Aet)

Computing g(O). Notice that since (Al), (A2), and (A3) hold, the

identifiability constraints present in the product multinomial model (2.3.4)

can be accounted for explicitly. It will follow by results of section 2.4, that

under either sampling scheme-product-Poisson or product-multinomial-





65 -

the maximum likelihood estimators for ( and A can be found by solving the

equation



g(0)= et H()A) =0, (2.3.16)

where the matrix H is the derivative of h' with respect to (.
Computing H((). We will use matrix derivative results of MacRae (1974)
to find the matrix of derivatives of the constraint function h'().


H(0 h = ()= -[log(e'A')C'U, ee'L']

= [D(ee)A'D-'(Ae4)C'U, D(e4)L'].

The equality follows upon using the matrix version of the chain rule. Notice
that

a aef' a
(log(e 'A')C'U) ( (log(ee'A')C'U)

=D(e) log(eA') )CU + 0
.e ) OAe ]
= D(ee)A'D-1(Ae)C'U

and that

Oet' L' Oee' Oet' L'
S- D(ee)L'.

Computing G(8). The iterative scheme (2.3.13) used to solve the system
of equations (2.3.16) is actually a slight modification of the Newton-Raphson
algorithm. It is a modification because we do not use the derivative matrix
G* = Og(O)/O0 to adjust at each iteration, as Haber (1985a) did, but rather a
simpler matrix G that is related to G* by G* = G+ Op(n2 ). The derivative





66 -
matrix G* can be computed as follows.

G*(8)- = g(O) = [O )

(-D(e) +
\ H'()

H, ()


Og(O) J

H( )


0 0/
+W~) o)
V(" H o o


The matrix
OH( )A __ H()(

is of order Op(n/2) when it is evaluated at 9 = vec(, A) since

H( =
aH Op(n,)
'96


and


A = Op(n.1/2).


It follows that the matrix G, which is much simpler to invert than G*, can
be used to adjust the estimate at each iteration.
Computing the inverse of G. Although the matrix G is of dimension
(s + u + 1) x (s + u + 1), which may be very large in practice, its inverse
is relatively simple to calculate. The inverse of the partitioned matrix

,G (-D H\
= (H'

is shown by Aitchison and Silvey (1958) to have form

G-1 (D-1 D-1H(H'D-H)-H'D- -D-H(H'D- HD- )-1
-(H'D-1H)-IH'D-1 -(H'D-1H)-1 )

Therefore, only the matrices D and (H'D-1H), which are of dimensions
s x s and (u + 1) x (u + 1), need to be inverted. The inverse of D is easily






- 67-


calculated since D is a diagonal matrix with e4 on the diagonal. The inverse

of (H'D-1H), a symmetric positive definite matrix, can be found quite easily,

even when u + 1, the number of constraints, is large. It should be pointed

out that when s, the total number of cell means, is large, the number of

constraints u + 1 may be large and on the same order as s. This will be the

case for parsimonious models-those models with many constraints relative

to number of model parameters.

One could choose to invert the matrix G a limited number of times to

mitigate the computational burden. In fact, in their 1958 and 1960 papers,

Aitchison and Silvey advocate an iterative method whereby the inverse of G

is computed only two times. Once at the initial iteration and again at the

final iteration, upon convergence. We feel, however, that in this special case

in which the matrix G has a particularly simple form, the inverse can be

computed at each iteration. Along with increased computing power, there

are many efficient algorithms for inverting large symmetric positive definite

matrices.

2.4 Comparison of Product-Multinomial and Product-Poisson Estimators

We begin this section by introducing notation for a product-Poisson

random vector.

The sxl random vector Y = vec(Y,,..., YK) is said to be product-Poisson

if

Yij ~ indPoisson(eei), i =1,...,K, j=1,...,R. (2.4.1)

Suppose that the s = RK log means {(ij} satisfy the model [O(p)] where

8) = {( R': Clog(Ae4) = X3, Le = 0}





68 -

or equivalently, for appropriately chosen U,


(P) = P) = { E R' : U'Clog(Aee) = 0, Lee = 0} (2.4.2)


This model implies all the same constraints on ( as the product-

multinomial model [Oh] of (2.3.5), with one exception-the identifiability

constraints, eV'( $ lR) = n', are not included.

Denote the maximum likelihood estimators computed assuming (2.4.1)

and (2.4.2) by (P) and 3(P). Similarly, denote the maximum likelihood

estimators computed assuming (2.3.1) and (2.3.5) by (M) and (M).

Recall that the three product-multinomial model assumptions are

(Al) The multinomial response model can be specified as in (2.3.3).

That is the model parameter space can be represented as

Ox = { E R' : C1 log Ale = XP1, C2 log A2e = X232,

Le = 0, e'(ilR) = n'},

where

Ci = DgCij, Cij Cil, is qi x mi = 1, 2

Ai = afAij, Aij Ail, is mi x R, i = 1,2

L = $ Lj, L =- L1 is dx R

= vec(,,..., Kg), and k is R x 1

Xi is Kqi x pi of full rank pi, i = 1,2

n is the K x 1 vector of multinomial indices

s = RK, the total number of cells.





69 -

(A2) Either Ci Iq,K Or Ci( (l m,)= i =1,2,

and

(A3) If C, = IqK then M(X,) D M(lm,,).

The following theorem states that the maximum likelihood estimators for

E and hence p are the same under the product-multinomial sampling scheme

of (2.3.1) and the product-Poisson sampling scheme of (2.4.1) provided that

the three assumptions (Al), (A2), and (A3) hold.

THEOREM 2.4.1 If the model (2.3.4) satisfies assumptions (Al), (A2), and

(A3), then
i(P) = (M) and (P) = 7(M)

That is, the maximum likelihood estimators of / and are the same under

both sampling schemes-product-Poisson (2.4.1) and product-multinomial

(2.3.1).

Proof: Under the product Poisson assumption of (2.4.1) and (2.4.2), the

kernel of the log likelihood is

e(P)(; y) =y'(- e'1,

Therefore, letting 0 = vec(, A), the corresponding Lagrangian objective

function is

Q(0) = y'- e'1, + h'(\)A

and so to find the maximum (Poisson) likelihood estimator ^(P) = (i(P), 5(P))

we must solve the system of equations

9Q(O) y ei') + H(i())() (2.4.3)
o h( (P))0. (

The conclusion of the theorem now follows, since the equations (2.3.8) of






- 70 -


Theorem 2.3.1 and (2.4.3) yield exactly the same solutions and


(P) = (X'X)- X'Clog(Ae'P)) = (X'X)-lX'Clog(AeM') = (M)




As a corollary to Theorem 2.4.1 we have

COROLLARY 2.4.1 Provided the assumptions of Theorem 2.4.1 hold, the

estimated undetermined multipliers are invariant with respect to sampling

scheme, i.e.

A(M) = (P)




Proof: The proof follows immediately upon noting that equations (2.3.8)

and (2.4.3) yield exactly the same solutions. 0

A remark is in order. Basically, Theorem 2.4.1 enables us to conclude

that the sufficient and necessary condition of Birch (1963) holds. These

conditions are that the model be specified so that the Poisson ML estimators

necessarily satisfy the identifiability constraints that are required for the

multinomial model.

We now explore the asymptotic behavior of the (Poisson) ML estimator
b(P) = vec(i(P), i(P)). For the product-Poisson assumptions (2.4.1) and

(2.4.2), we can obtain the asymptotic distribution of 9(P) by formally replacing

the n, = min{ni} by p, = min{ei } and using the same arguments as those

used to derive the asymptotic distribution of i(M).

J0rgenson (1989) discusses limiting distributions for Poisson random

variables as the mean parameters, or equivalently /*, go to infinity. In this





- 71 -


case,

g(0o; Y) =( 0e)

has an asymptotic normal distribution with mean zero and asymptotic
covariance
(D(yo) 0)
0 0)
Using arguments similar to those used in the multinomial case, it follows that

(Y ) = (D(o) -OH) (O )+f ( OP+1
0 ) -H' ) jOp)

We conclude, as in the product multinomial case, that (P) Oo has an
asymptotic normal distribution with mean zero and asymptotic covariance

(D(po) -H)- (D(o) 0 D( C) -H r-
-H' 0 0 Oj 0\ -H' 0

But, this can again be simplified as it was in the multinomial case. It can be
shown that the asymptotic covariance can be rewritten as

D-1 D-'H(H'D-'H)-'H'D-1 0 (2.4.4)
0 (H'D-'H)-l ( 4)

where D = D(o) = D(e~O) and H = H(o).
Comparison of the Asymptotic Distributions. Provided assumptions (Al),
(A2), and (A3) hold, both (P) 0o and b(M) Oo have asymptotic normal
distributions with zero means and respective covariances given in (2.4.4) and
(2.3.12). Therefore, we have the following interesting results.
Result 1. The asymptotic covariances of (P) and i(M) are related by

var(()) = var(())- (2.4.5)
( 0 0)






72 -

Result 2. The asymptotic distributions of A(P) and ^(M) are identical and

it follows that the Lagrange multiplier statistic which has form


LM = A'(var(A))-l = A'(H'D-1H)A


is invariant with respect to the sampling scheme.

Result 3.
K (P) A (P)'
var((M)) = var(()) f (2.4.6)


Result 4.

var(/(M)) = var( ()) A (2.4.7)

where

A = (X'X)-X'C ) (e l )C'X(X'X)-.


and is nonnegative definite.

The notation var(.) used in these results denotes the asymptotic variance.

This is important since the finite sample variances may not even exist.

The proofs for Results 3 and 4 are straightforward. Basically, they

involve using the delta method and equation (2.4.5). The interested reader

will find an outline of the proofs in Appendix A.

In practice, it is of particular interest to evaluate the matrix A of equation

(2.4.7). Often, for convenience, the models are fit assuming the vector Y

is product Poisson and then inferences based on the maximum likelihood

estimates are made assuming that they are invariant with respect to the

sampling assumption. Birch (1963) and Palmgren (1981) derive rules for






73-

when these inferences, based on the two different sampling assumptions, will

be equivalent. However, they assume that the model is of a simple loglinear

form. That is, the Poisson model is assumed to have form



Ox = { R': = XI3}.



We will use the results of this section to derive more general rules for when

the two inferences will be equal. As a special case of these results, we will

arrive at the Birch and Palmgren results.

The following lemma will enable us to rewrite A of (2.4.7) in still a

simpler form.

LEMMA 2.4.1 Let Z = [Z1,..., ZK] be an r x K matrix of full rank K.

Suppose that X = [Xx,..., Xp] is an r x p (r > p > K) matrix of full rank p

such that M(X) M(Z), i.e. the range space of X contains the range space

of Z. Denote the T (K < T < p) columns of X that span a space that contains

M(Z) by {X,,,...,X,,}. Without loss of generality, suppose that the set of

vectors {XX,,...,Xv,} is a minimal spanning subset, i.e. the spanning set

of any r < T of these vectors does not contain the range space of Z. We

conclude that


3W e RTxK 9 (X'X)-'X'Z = JW,



where the p x T matrix J = [ex,...,e v] and ey, is the p x 1 vector

(0,...,0,1,0,...,0)' with the '1' in the vth position.





74 -

Proof: Let X. = [X,,...,XT]. Now, by assumption, M(X.) D M(Z).

Hence, there must exist a matrix W E RTxK 3 Z = X.W. Therefore,

(X'X)-1X'Z = (X'X)-'X'X.w = (X'X)-'(X'X.)w = JW

where J = (X'X)-'(X'X.) is as stated in the conclusion of the lemma. *

Before stating the next important theorem, let us write A in another

way. Assuming that (Al) holds, A can be written as

A-(Al A12)
A = A21 A22) (2.4.8)

where

A'J = (XiX) xj-C( )(D )CXj(XjXj)-

Now, if Ci is a contrast matrix, by assumption (A2), we can write


(X X)-'X 1( i m ) = J(iW('), (2.4.9)

where J(i) can arbitrarily be chosen to be equal to Xj and so W() = 0. On

the other hand, if Ci = IK then we have by (A3) that M(Xi) 2 M((lm,).

Therefore, we can invoke the result of Lemma 2.4.1 by setting Z = e .

Since M(Xi) 2 M(lm,,) = M(Z), the conditions for the lemma are satisfied.

Let Xi, = [X. (,),..., X ,)] be the miK x Ti (K < T, < pi) submatrix of X,

that has columns that form a minimal spanning subset for M(Z) = M(E-i-).

By Lemma 2.4.1,

3W() e RTxK 3 (X'XiX -1X, m) = J(')W(i). (2.4.10)

Here, J(i) = [e (),..., e ()], where the Ti elementary vectors correspond to the
1 Ti
columns {X. (i,..., X. )} of Xi that form a minimal spanning subset for the
ITi






75 -
range space of lm,, i.e. the Ti columns span a set that contains the range

space of elm, and any smaller set of columns will not span a set containing

the range space of ,lm, .

It follows that the matrices Aij of (2.4.8) can be written as


Ai' = J()W(W'(W()J,(j) (2.4.11)


where
j( [e) ,...,e()], if Ci = IqiK
W vzr(2.4.12)
Xi[, otherwise
and

W(i) W(), if C = IqK (2.4.13)
10, otherwise.
We now state a theorem of substantive importance.

THEOREM 2.4.2 Suppose that assumptions (Al), (A2), and (A3) hold. For
r = 1,2, if Cr is the identity matrix then let {v(r),... )} be the set of

indices that index those columns of X, that form a minimal spanning subset

for M($lm,). Then it follows that the relationship between the asymptotic

variances of the two estimators 3(M) and p(P) is


var( ((M) = var(/(")) (- 21 A1

where the pi x pj matrix Aii is a zero matrix whenever at least one of Ci or

Ci is a contrast or zero matrix. Otherwise, if both Ci and Cj are identity

matrices then


= 0, if (k,l) I {v) (i)*T" {v/1 (i)






- 76-


Proof: Since (Al), (A2), and (A3) hold, we can rewrite AiJ as in (2.4.11).

Now, if either C, or Cj are contrast or zero matrices, it is obvious by (2.4.9)

that Aij will have zero components, as stated in the theorem, since at least

one of W(i) or W(j) will be a zero matrix. On the other hand, if both C, and

Ci are identity matrices, then A'j can be rewritten as in (2.4.11) where

ji) = [e (, ...,e ],

J7i)= [(e ,,..., (,,

and the matrices W(i) and W(i) are elements of RTixK and RTjxK. Hence,



1 Ti e







where Wii = W(i)W'(i) is some Ti xT matrix. Now, since {e,} are elementary

vectors, we have that if


(k,1) {v ),..., )} x {vi),..., (),

then the component A' = 0. Otherwise, if (k, 1) is a member of this set, it

must be that A' is one of the elements of the matrix Wii. This completes

the proof.

The next two corollaries follow immediately from Theorem 2.4.1.

COROLLARY 2.4.2 If both C1 and C2 are contrast matrices then


var( (M)) = var((P )).






- 77 -


Proof: Since both 7C and C2 are contrast matrices it follows that W(1)

and W(2) are zero matrices. Therefore, the matrices Aij of the theorem are

zero matrices.

COROLLARY 2.4.3 Let C2 = 0,X2 = 0, and C1 A = I, = so that the model

(2.3.4) becomes


{h = {E R': = XP3, ee'( $1R) = n'},


i.e. a simple loglinear model with K subpopulations. Let {vl,..., vT} be the set

of indices that indez the columns of X that form a minimal spanning subset

for M(eflR). Then

var(A(M)) = var((P)) A,

where the elements of A are such that


Ak = 0, if (k, ) V {v,,..., VT}2



Proof: The proof is an immediate consequence of the theorem upon

identifying A" of the theorem with A of the corollary. The other matrices

A12, A21, and A22 will be zero since C2 = 0. 0

Corollary 2.4.3 is of practical importance and is essentially the result

shown by Palmgren (1981). In particular, if we parameterize the model in

such a way so that there is a parameter included for each of the K independent

multinomials (or K covariate levels), then the K columns of X corresponding

to these K 'fixed by design' parameters will form a basis (and hence a minimal

spanning subset) for AM(efRl). Therefore, if 3i and pj are not one of the






- 78 -


K parameters fixed by design, then cov(f(M), M)) = cov(1(P),~P). We

will illustrate the utility of the above results in the next chapter of this

dissertation.

The next section considers issues that may arise when computing the

model degrees of freedom. It also states some other miscellaneous results

with regard to the Lagrange multiplier statistic.



2.5 Miscellaneous Results

We begin this section by addressing practical issues that may arise during

nonstandard model fitting. Specifically, we will consider computing the model

and distance (or residual) degrees of freedom.

Computing model and distance degrees of freedom. Assuming the model

[Oh] of (2.3.5) is well defined, i.e. the u+I+K constraints are nonredundant,
we can compute the model degrees of freedom as in section 2.2. In that

section, we defined the model degrees of freedom as the number of model

parameters minus the number of independent constraints implied by the

model. Notice that in this application we have an additional 1 linear

constraints. The I constraints were not present in section 2.2. It follows

that the model degrees of freedom for [Oh] is


df[Oh] = s (u + 1 + K) (2.5.1)

where s is the number of cell means, u is the dimension of the null space of X',

1 is the number of linear constraints, and K is the number of identifiability

constraints.






79 -
To measure model goodness of fit, we can consider estimating some

hypothetical distance between model [Oh] and the saturated model (u = 1 = 0)

[O]. This distance, denoted S[eO; 0] has degrees of freedom

df ([Oh; 0]) = df[O] df[Oh]

= (s K) (s (u + + K)) (2.5.2)

= U + 1.
Notice that, had we considered the product Poisson model (2.4.2), the

distance degrees of freedom would be


df (6[O); O(P)]) = s ( ( + 1)) = + ,

which is identical to the product multinomial distance degrees of freedom of

(2.5.2).

We have assumed that the u +1+ K constraints are nonredundant, i.e.

each constraint is not implied by the other constraints. This may not always

be the case. To illustrate, consider the model specification for example 3 of

section 2.2.2. The model [OMH] implies that the two marginal distributions

are equal. We stated at the end of that example that the additional constraint

7r2+ 7r+2 = 0 was redundant. This can be seen since


72+ 7r+2 = r21 r12 = -(7rl+ 7r+i) = 0

That is, the constraints of model [OMH] imply that 7r2+ r+2 equals zero.

Had we blindly added this constraint, we may have incorrectly calculated

the model degrees of freedom as 1 and the distance degrees of freedom as 2.

Therefore, we must be very careful to have a set of nonredundant constraints

when computing degrees of freedom.






80 -

In practice, when models are more complicated, it may be difficult to as-

certain whether or not the model constraints are nonredundant. Fortunately,

there are two very useful results that help in this regard.

The first result is that when the constraints are redundant, the matrix

(H'D-1H) evaluated at some point in Oh is of less than full rank and is not

invertible. Therefore, in practice, if the algorithm (2.3.13) does not converge

due to G being singular, it may be due to redundant constraints, i.e. an ill-

defined model. The user should investigate and possibly respecify the model

should this occur. A caveat is that due to computational roundoff error, a

singularity may not occur even when the model is ill defined because the

iterate estimates, including the final estimate, may not strictly lie in Oh. The

next result may mitigate this problem.

A result that is useful in practice is that a necessary condition for the

constraints to be nonredundant or equivalently for the model to be well

defined, is that the Lagrange multiplier statistic be invariant to choice of

U, a matrix with columns spanning the null space of X'. Evidently, if the

user fits the model several times, each time using a different 'U' matrix, and

the Lagrange multiplier statistic varies (more so than can be explained by

roundoff error), then it must be that the model is ill defined.

Formally, this necessary condition can be stated as

THEOREM 2.5.1 Let U1 and U2 (U1 E U2) be any two full column rank

matrices satisfying ULX = 0, i = 1, 2. Denote the Lagrange multiplier statistic

evaluated using Ui by LM(Ui). If the matrix


H- Oh() _=- eA)CU,
Hi = 1=- Zt[log(e4'A')C'Uj, e4TL']
C~ ar





81-
is such that [Hi, ee] is of full column rank, i = 1, 2, and hence the models well
defined, then

LM(UI) = LM(U2),

i.e. the value of the Lagrange multiplier statistic is invariant with respect to
choice of U.
Proof: Denote the model specified in terms of Ui by [Oh,], i = 1,2. By
the definition of U, we know that the constraints implied by [Oh1] and [Oh,]
are equivalent. Hence, the solution to (2.3.8), or equivalently (2.4.3), under
either model is the same. Thus, in view of the first set of equations in (2.3.8),
any solution vec( A,) under model [Oh,] must satisfy


-(y-e) = Hi()A;, i= 1,2. (2.5.3)


Notice that since U1 f U2, we have that H () 5 H2() and by (2.5.3) Ai f A2.
Now, (2.5.3) implies that


Hi()Ai = Hg2()2. (2.5.4)


Also, since Hi() is assumed to be of full column rank, the variance of A,,


var(A,) = (H)()D-'(eZ)H,())-1 (2.5.5)


exists. Therefore, the Lagrange multiplier statistics LM(Ui), which have form


\[var(i)]-'5^, i= 1,2


(2.5.6)






82 -

exist. Finally, by (2.5.4)-(2.5.6), it follows that
LM(U1) = I[var(il)]-1l

= A'(H:(()D-l(e )Hz())Ai

= 2(H'( )D-1(ei)H2())\

= i'[var(l2)]-12

= LM(U2).
This completes the proof.

The final result of this section states that the Lagrange multiplier
statistic is exactly the same as the Pearson chi-squared statistic whenever the
random vector Y is product-Poisson or product-multinomial and the model

satisfies assumptions (Al), (A2), and (A3).

THEOREM 2.5.2 Assume that the product-multinomial model satisfies

assumptions (Al), (A2), and (A3). Let X2 denote the Pearson chi-squared

statistic, i.e.
2 = (y )'D-'()(y -i)

where A is the ML estimator under either of the sampling schemes-product-

multinomial or product-Poisson. It follows that the Lagrange multiplier

statistic LM is equivalent to X2. That is,

LM = X2.



Proof: By equations (2.5.3), (2.5.5), and (2.5.6) of the previous theorem's

proof and the fact that eM = M, we have that

LM = (y A)'D-'(f)(y 4) = X2


This is what we set out to show.






83-

2.6 Discussion



In this chapter, we discussed in some detail issues related to parametric

modeling. In particular, we followed the lead of Aitchison and Silvey (1958,

1960) and Silvey (1959) and described two ways of specifying models-using

constraint equations and using freedom equations. In section 2.2, distance

measures for quantifying how far apart two models are, relative to how close

they are to holding, were discussed. In particular, the power-divergence

measures (Read and Cressie, 1988) were used when the parameter spaces were

subsets of an (s 1)-dimensional simplex. Estimates of these distances were

developed based on very intuitive notions. Also, a geometric interpretation

of model and residual (or distance) degrees of freedom was given.

In section 2.3, we described a general class of multivariate polytomous

(categorical) response data models. The class of models, which satisfy

assumptions (Al), (A2), and (A3), were shown to satisfy the necessary and

sufficient conditions of Birch (1963) so that the models could be fitted using

either the product-Poisson or product-multinomial sampling assumption.

An ML fitting method was developed, using results of Aitchison and Sil-

vey (1958, 1960) and Haber (1985a, 1985b). The algorithm used Lagrangian

undetermined multipliers in conjunction with a modified Newton-Raphson

iterative scheme. The modification, which simplifies the method of Haber

(1985a), is to use a simpler matrix than the Hessian matrix. We replace

the Hessian matrix (of the Lagrangian objective function) by its dominant

part, which turns out to be easily inverted. Because the matrices used in the

algorithm proposed in this chapter are very large and must be inverted, this





84 -

modification is a very important one. A FORTRAN program 'mle.restraint'

has been written by the author to implement this modified algorithm.

The asymptotic behavior of the ML estimators computed under the two

sampling schemes-product-Poisson and product-multinomial-was investi-

gated. The method for deriving the asymptotic distributions represents a

modification to the technique of Aitchison and Silvey (1958). A comparison of

the limiting distributions of the two estimators was made in section 2.4. Some

very interesting results were obtained by studying the asymptotic behavior

in the constraint equation setting. In particular, Theorem 2.4.2 represents

a generalization of the results of Palmgren (1981). The theorem provides a

method for determining when the inferences about the freedom parameters

of a generalized loglinear model of the form C log Apl = X/f will be invariant

with respect to the sampling assumption. Palmgren (1981) developed some

similar results for the special case when the freedom parameters are part of

a loglinear model.

It is important to note that the asymptotic results are only valid if

the number of populations K is considered fixed and the expected counts

all get large at approximately the same rate. In particular, the asymptotic

arguments do not hold when the covariates are continuous, since the number

of populations (levels of the covariates) can theoretically run off to infinity.

The reason the arguments do not hold is that when we use the method of

Aitchison and Silvey (1958) it is required that the vector n,' lY;Y~ converge

in probability to zero as the total number of observations gets large. This is

the case only when n* = min{nl,..., nK} goes to infinity. This drawback






85 -

could prove to be temporary. It seems reasonable to assume in many cases,

that as long as the 'information' about each parameter is increasing without

bound, the estimators will be consistent and asymptotically normally dis-

tributed. For example, consider the logistic regression model with continuous

covariates. Although the nk's may all be 1, the ML estimators of the

regression parameters are often consistent and asymptotically normal.

Section 2.5 outlines some miscellaneous results. One result that is

important to the practicing statistician, is that the Lagrange multiplier

statistic is shown to be invariant with respect to choice of the matrix U

(of U'Clog Ay = 0) as long as the model is well defined. An important

implication of this result is that if one fits the model several times, each

time using a different 'U' matrix, and the Lagrange multiplier statistics

vary more so than can be explained by roundoff, then it could be that the

model is not well defined. Another interesting result is that the Lagrange

multiplier statistic is simply the Pearson chi-squared statistic X2 whenever

the assumptions (Al), (A2), and (A3) are satisfied.

Theoretically the ML fitting algorithm will work for any size problem.

Practically, however, the algorithm is certainly not a model fitting panacea.

The number of parameters that must be estimated gets very large, very fast.

Consider the case where 7 raters rate the same set of objects on a 5 point

scale. Even without covariates, the number of cell probabilities that must be

estimated is 57 = 78, 125. It seems the ML fitting method developed in this

chapter is, at least for now, useful for moderate size problems only. It can be

used to analyze longitudinal categorical response data when the number






86 -

of measurements taken on each subject is somewhere in the neighborhood of

2 to 6. This is not to take away from the utility of this chapter's algorithm,

but rather to indicate its breadth of application. In time, with increasing

computer efficiency, much larger data sets may be fitted using this algorithm.













CHAPTER 3
SIMULTANEOUSLY MODELING THE JOINT AND MARGINAL
DISTRIBUTIONS OF MULTIVARIATE POLYTOMOUS
RESPONSE VECTORS


3.1 Introduction

Often times, when given an opportunity to analyze multivariate response

data, the investigator may wish to describe both the joint and marginal

distributions simultaneously. We consider a broad class of models which

imply structure on both the joint and marginal distributions of multivariate

polytomous response vectors. To illustrate the need for such models, we

consider several settings where these models would be useful. For example,

when the multivariate responses represent repeated measures of the same

categorical response across time, one may be interested in how the marginal

distributions are changing across time and how strongly the responses are

associated. The simultaneous investigation of both joint and marginal

distributions is not restricted to the longitudinal data setting. Other examples

include the analysis of rater agreement, cross-over, and social mobility data.

The common thread tying all of these data types together is that the sampling

scheme is such that the different responses are correlated. In longitudinal

studies the same subject responds on several occasions. In rater agreement

studies, raters rate the same objects. In two-period cross-over studies, one

group of subjects receive the two treatments in one order and the other group

receive them in the other order. In social mobility studies, the socio-economic

87-






88 -

status of a father-son pair is recorded. When the responses are positively

correlated, these designs result in increased power for detecting differences

between the marginal distributions (Laird, 1991; Zeger, 1988).

This chapter considers the modeling of multivariate categorical responses

in which the same response scale is used for each response. The classes

of models used in this chapter are of the form considered in Chapter 2 of

this dissertation and hence are readily fit using the ML methods of that

chapter. In section 3.2, we give several examples that may be analyzed by

simultaneously modeling the joint and marginal distributions. We introduce

the classes of simultaneous Joint-Marginal models in section 3.3. Several

models are fitted to the data sets of section 3.2.


3.2 Product-Multinomial Sampling Model

Initially, we assume that a random sample of nk subjects is taken from

population k, k = 1,..., K. The number of populations, or covariate profiles,

K is considered to be some fixed integer. The subscript k is allowed to be

compound, i.e. the subscript k is allowed to represent a vector of subscripts

such as

k = (ki, k2,..., ik).


Suppose that there are T categorical responses V(1),..., V(T) of interest

and that each response is measured on the same response scale. Let

Vk = (Vk),..., VT))' be the random vector of responses for population k

and Vk,, u = 1,...,nk be the nk independent and identically distributed

copies of Vk, where Vk, denotes the response profile for the uth randomly





89 -

chosen person within population k. Notationally we have,

Vku ~ i.i.d. Vk, u = 1,...nk

For our purposes we can assume that each response takes on values in

{1,2,...,d} with probability one. Denote the probability that a randomly

selected subject from population k has response profile i = (i,..., iT)' by 7rk,

i.e.

P(Vk = (i,... iT)') =

where i e {1,...,d} x .-. x {1,...,d}.

The joint distribution of Vk = (V(),..., Vk(T))' is specified as {7rik}. The

marginal distributions of Vk will be denoted by {(i(t; k)}, t = 1,..., T, where

,(t; k)= P(V,(t) = i), i= 1i,...,d

Our objective is to model simultaneously the K joint distributions

{7TCk}, k= ,...,K

and the KT marginal distributions

{i(t; k)}, t= 1,...,T, k = l,...,K.

To help the reader better understand the notation, we consider the one

population bivariate case. When T = 2, the response profiles can be denoted

by i = (il,i2) = (i,j), where i = 1,...,d and j = 1,...,d. Since there is

just one population (or covariate profile) the subscript k is always 1 and is

therefore dropped. It follows that {7rij} is the joint distribution of (V('), V(2))'

and { i(t)}, t = 1, 2 are the two marginal distributions. That is,


7r, = P(V(I) = i, V(2) = j), i= l,...,d, j= l,...,d






- 90 -


and
(t) = '7i+ = P(V(I) = i), if t= l

7r+i = P(V(2) = i), if t = 2
for i= 1,2,..., d.

Now for each population k, consider the dT x 1 random vector of

indicators

'k = [I(V&=t1), .. ., I(V=_idT)]'

Notice that no information about the Vk is lost since 4k is a one-to-one

function of Vk. Also,


xk ~ ind. Mult(1, {7rik}), k= 1,..., K


Therefore, since we have randomly sampled nk subjects from each of the K

populations, we have that for given k


Tki k2, .,*k, ~ i.i.d. Mult(1, {7rik})


and hence the vector

nk
Yk = E k Mult(nk, {rik})
u=l

is sufficient for the family of distributions {rik} and {(i(t; k)}.

By independence across populations, the vector vec(,Y 2 ,.. .,YK) is

sufficient for the joint and marginal distributions of vec(V, V2,...,VK).

Further, the random vector vec(Yi, 2,...,YK) is product-multinomial, i.e.


Yk = (Yk,...,YR)' ~ ind Mult(n, {7ik}), k = 1,...,K


where 1,...,_R represent the R = dT different response profiles.






91-

Evidently, Yik represents the number of randomly selected subjects from

population k who have response profile i. That is, the {Yik} represent counts

resulting from a cross-classification of N = E'=1 nk subjects on T response

variables and a population variable. The data can be displayed in a dT x K

contingency table. By convention, we use lower case Roman letters to denote

realizations of random quantities. For example, yik represents a particular

realization of Yik.

Consider Table 3.1, taken from Hout et al. (1987).



Table 3.1. Interest in Political Campaigns

1960
Not Much Somewhat Very Much


Not Much


1956 Somewhat


Very Much


335


499


369


278 444 481 1203
Source: Hout et al. (1987), p. 166, Table 4




Each of 1203 randomly selected subjects was asked in 1956 how inter-

ested they were in the political campaigns. They responded on the 3-category

ordinal scale: 1 = Not Much, 2 = Somewhat, and 3 = Very Much.

Then, in 1960, each of the subjects was asked the same question and

responded on the same 3-category ordinal scale. Using the above notation,


155 116 64


91 237 171


32 91 246






92 -

we let V(1) and V(2) represent the responses in 1956 and 1960. Let yij, i,j =

1,2,3 represent the number of the N = 1203 subjects responding at level

i in 1956 and level j in 1960. Notice that there is just one population

of interest, we drop the population subscript altogether. Finally, for this

bivariate response example, the compound subscript i is replaced by ij. Table

3.1 summarizes the bivariate responses.

As another example, consider the cross-over data of Ezzet and White-

head (1991).


Table 3.2. Cross-over Data
B B
1 2 3 4 1 2 3 4


1 59 35 3 2 1
A 2 11 27 2 1 A 2
3 0 0 0 0 3
4 1 1 0 0 4


63 40 7 2
13 15 2 0
0 0 1 1
0 0 0 0


AB Sequence BA Sequence
(Group 1) (Group 2)


The counts displayed in Table 3.2 are from a study conducted by 3M

Health Care Ltd. to compare the suitability of two inhalation devices (A and

B) in patients who are currently using a standard inhaler device delivering

salbutomal. Two independent groups of subjects participated. Group 1 used

device A for a week followed by device B (sequence AB). Group 2 used the

devices in reverse order (sequence BA).

The response variables V(') (device A) and V(2) (device B) are ordinal

polytomous. Specifically, they are the self-assessment on clarity of leaflet

instructions accompanying the two devices, recorded on the ordinal four point

scale,






- 93 -


1 = Easy
2 = Only clear after rereading
3 = Not very clear
4 = Confusing.

For this example there are two populations of interest-Group 1 and

Group 2. Let yik represent the number of the nk subjects responding at level

i for device A and level j for device B, where nl = 142 and n2 = 144. Again,

the bivariate response profiles can be denoted by i = ij where i, j = 1, 2, 3, 4.

The bivariate responses are summarized in Table 3.2.


3.3 Joint and Marginal Models

Two types of questions that can be posed about Table 3.1 lead to quite

distinct types of models. One question is whether the interest in the political

campaigns was different at the two times. For example, the researcher

may wish to test the hypothesis that there was more interest in the 1960

political campaign than the 1956 political campaign. An investigation into the

marginal distributions is needed to test this hypothesis. For these bivariate

response data, the marginal distributions correspond to the row and column

distributions of Table 3.1. A second question that may be asked is whether

the two responses are associated and if so, how strong is the association. To

answer these questions, we must describe the dependence displayed in the

joint distribution of Table 3.1.

The marginal models we consider will be used to investigate whether

the probability that a randomly selected subject responds at level i or lower

in 1956 is different from the probability that a randomly selected subject

responds at level i or lower in 1960. In this sense, the comparison of marginal