UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
ON MODEL FITTING FOR MULTIVARIATE POLYTOMOUS RESPONSE DATA By JOSEPH B. LANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1992 UNIVERSITYY OF FLORIDA US1llic15 ACKNOWLEDGMENTS I would like to express my appreciation to Dr. Alan Agresti for serving as my dissertation advisor. For the many comments, ideas, and lessons he has shared with me, I am greatly indebted. Through his advisement and guidance, he has taught me to appreciate and respect good statistical research and teaching. He is a mentor worthy of emulation. I also want to express my gratitude to Dr. Jane Pendergast, who also served on my dissertation committee. I learned a great deal from her during the two years that I worked in the Biostatistics Department. To all of the faculty at the University of Florida, I extend my thanks. The statistics department, with its scholarly and friendly atmosphere, proved to be a wonderful place to learn. The influences of persons from my past are not forgotten. Without Patrick Kearin's stimulating teaching of high school math, I may never have become interested in this subject. The genuine excitement delivered by Dr. James Kepner, in his teaching of undergraduate statistics, was the reason I decided to pursue an advanced degree in statistics. I would like to thank my parents and the rest of my family for all of the support and encouragement they have given over the course of my studies and research. My friends and student colleagues deserve many thanks as well. Finally, I would like to thank Kendra Paar for always being there to support and encourage me while I was writing this paper. TABLE OF CONTENTS page ACKNOWLEDGMENTS ............................................ ii LIST OF TABLES ................................................ v ABSTRACT .................................................... ... vi CHAPTERS 1 INTRODUCTION ............................................. 1 1.1 A Brief Introduction to the Problem...................... 1 1.2 Outline of Existing MethodologiesNo Missing Data ...... 3 1.3 Outline of Existing MethodologiesMissing Data.......... 12 1.4 Format of Dissertation ...................................... 14 2 RESTRICTED MAXIMUM LIKELIHOOD FOR A GENERAL CLASS OF MODELS FOR POLYTOMOUS RESPONSE DATA .................... 17 2.1 Introduction ........................................ .. ..... 17 2.2 Parametric ModelingAn Overview....................... 24 2.2.1 Model Specification .................................. 25 2.2.2 Measuring Model Goodness of Fit ................... 33 2.3 Multivariate Polytomous Response Model Fitting .......... 43 2.3.1 A General Multinomial Response Model.............. 44 2.3.2 Maximum Likelihood Estimation .................... 48 2.3.3 Asymptotic Distribution of ProductMultinomial M L Estimator ...... ..... ..... ....... .... .......... 56 2.3.4 Lagrange's MethodThe Algorithm ................ 60 2.4 Comparison of ProductMultinomial and ProductPoisson Estimators ........................... 67 2.5 Miscellaneous Results ....................................... 78 2.6 Discussion ................................................... 83 3 SIMULTANEOUSLY MODELING THE JOINT AND MARGINAL DISTRIBUTIONS OF MULTIVARIATE POLYTOMOUS RESPONSE VECTORS .................. 87 3.1 Introduction............................................... 87 3.2 ProductMultinomial Sampling Model..................... 88 3.3 Joint and Marginal Models................................. 93 3.4 Numerical Examples ....................................... 98 3.5 ProductMultinomial Versus ProductPoisson Estimators: An Example .......................... 111 3.6 WellDefined Models and the Computation of Residual Degrees of Freedom ......................... 121 3.7 Discussion .............. .................................... 132 4 LOGLINEAR MODEL FITTING WITH INCOMPLETE DATA...................................... 135 4.1 Introduction .............. ................................... 135 4.2 Review of the EM Algorithm................................137 4.2.1 General Results .................. ....................138 4.2.2 Exponential Family Results ........................... 140 4.3 Loglinear Model Fitting with Incomplete Data............. 144 4.3.1 The EM Algorithm for Poisson Loglinear Models..... 145 4.3.2 Obtaining the Observed Information Matrix ..........148 4.3.3 Inferences for Multinomial Loglinear Models ..........152 4.4 Latent Class Model FittingAn Application .............. 160 4.5 Modified EM/NewtonRaphson Algorithm................. 166 4.6 Discussion .................................................. 170 APPENDICES A CALCULATIONS FOR CHAPTER 2.........................172 B CALCULATIONS FOR CHAPTER 4.........................176 BIBLIOGRAPHY ................... ..........................193 BIOGRAPHICAL SKETCH ........................................ 200 LIST OF TABLES page 2.1 Opinion Poll Data Configuration................................. 22 3.1 Interest in Political Campaigns ................................... 91 3.2 CrossOver Data............ .......... ..... ....... ................ 92 3.3 Joint Distribution ModelsGoodness of Fit..................... 100 3.4 Marginal Distribution ModelsGoodness of Fit............... 101 3.5 Candidate Models in J(L x L + D) n M(U)Goodness of Fit... 102 3.6 Estimates of Freedom Parameters for Model J(L x L + D) n M(CU)..................................... 103 3.7 Freedom Parameter Estimates and Standard Errors.............. 105 3.8 Estimated Cell Means and Standard Errors ................. 106 3.9 CrossOver Data ModelsGoodness of Fit....................... 110 3.10 Freedom Parameter ML Estimates for Model J(UA) n M(U) .... 110 3.11 Children's Respiratory Illness Data........................... 112 3.12 ProductMultinomial versus ProductPoisson Freedom Parameter Estimation ........................................ 117 4.1 Observed crossclassification of 216 respondents with respect to whether the tend toward universalistic (1) or particularistic (2) values in four situations (A,B,C,D) of role conflict ................. 162 4.2 Parameter and Standard Error Estimates ....................... 164 4.3 Classification Probability Estimates ..................... ....... 165 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ON MODEL FITTING FOR MULTIVARIATE POLYTOMOUS RESPONSE DATA By Joseph B. Lang May, 1992 Chairman: Dr. Alan Agresti Major Department: Statistics A broad class of models that imply structure on both the joint and marginal distributions of multivariate categorical (ordinal or nominal) responses is introduced. These parsimonious models can be used to si multaneously describe the marginal distributions of the responses and the association structure among the responses. As a special case, this class of models includes classical log and logitlinear models. In this sense, we address model fitting for multivariate polytomous response data from a very general perspective. Simultaneous models for joint and marginal distributions are useful in a variety of applications, including longitudinal studies and studies dealing with social mobility and interrater agreement. We outline a maximum likelihood fitting algorithm that can be used for fitting a large class of models that includes the class of simultaneous models. The algorithm uses Lagrange's method of undetermined multipliers and a modified NewtonRaphson iterative scheme. We also discuss goodnessoffit tests and modelbased inferences. Inferences for certain model parameters are shown to be equivalent for productPoisson and productmultinomial vi sampling assumptions. This useful equivalence result generalizes existing results. The models and fitting method are illustrated for several applications. Missing data are often a problem for multivariate response data. We consider inferences about loglinear models for which only certain disjoint sums of the data are observable. We derive an explicit formula for the observed information matrix associated with the loglinear parameters that is intuitively appealing and simple to evaluate. The observed information matrix can be evaluated at the maximum likelihood estimates and inverted to obtain an estimate of the precision of the loglinear parameter estimates. The EMalgorithm can be used to fit these incomplete data loglinear models. We describe this algorithm in some detail, paying special attention to the Poisson loglinear model fitting case. Alternative fitting algorithms are also outlined. One proposed alternative uses both the EM and NewtonRaphson algorithm, thereby resulting in a faster, more stable, algorithm. We illustrate the utility of these results using latent class model fitting. CHAPTER 1 INTRODUCTION 1.1 A Brief Introduction to the Problem There are many situations when multiple responses are observed for each 'subject' in a group, or several groups. Here 'subject' is generically used to refer to a randomly chosen object that generates responses. The multiple responses could represent repeated measurements taken on subjects over time or occasions. They could be the ratings assigned by several judges that all viewed and rated the same set of slides (here, the 'subjects' are the slides). Or, perhaps, it may be that several distinct or noncommensurate responses are recorded for each subject. These responses are often categoricalordinal or nominaland inevitably interrelated. This dissertation addresses issues related to modeling and model fitting for multivariate categorical (ordinal or nominal) responses. Models for multivariate categorical response data are usually developed to answer questions about (i) the association structure among the multiple responses or (ii) the behavior of the marginal distributions of the response variables. Specifically, a typical question of the first type is, "How are the responses interrelated and is this interrelationship the same across the levels of the covariates?" A typical type ii question is, "How do the (marginal) responses depend on the covariates or occasions?" Historically, many models (e.g. log and logitlinear models) have been developed for the primary 1 2 purpose of answering the type i questions. Many of these models can easily be fitted using maximum likelihood (ML) methods. These models typically, however, are not useful for answering the type ii questions (Cox, 1972). Marginal modelsthose models used to answer type ii questionsare not as well developed. One reason for this is that ML fitting of these marginal models is more difficult. At present, the method of weighted least squares (WLS) is used almost exclusively for fitting these models. Suppose that we are interested in answering questions of both types i and ii. Usually the questions are addressed using two different models, a joint distribution model and a marginal model, and fitting them separately. It seems reasonable to want a model that can be used to address simultaneously both questions. That is, we would like a model that simultaneously implies structure on both the joint and marginal distribution parameters. To date, there has been very little work done on the development and fitting of these simultaneous models. Whenever multiple responses are observed it is inevitable that there will be missing data. There are several ways to fit the Poisson loglinear model with incomplete data. One popular method is to use the EM algorithm to find the ML estimates of the loglinear parameters. One drawback to this algorithm is that a precision estimate of the ML estimators is not produced as a by product. Several numerical techniques have been developed to approximate the observed information matrix, which, upon inversion, will act as the precision estimate. However, it would be of some convenience to derive an explicit formula for the observed information matrix, at least in some special cases. 3 1.2 Outline of Existing MethodologiesNo Missing Data We begin our discussion by considering the case of no missing data. There are many methods for analyzing multivariate categorical (ordinal or nominal) response data. These methods usually involve fitting (separately) models for the joint or the marginal distributions of the response vectors. In rare instances, simultaneous models for both the joint and marginal distributions are considered. Maximum likelihood fitting methods for the joint distribution models are simple and described in almost every standard text on categorical data analysis. The fitting of marginal models using ML methods is more difficult. Maximum likelihood fitting of the marginal homogeneity model was considered by Madansky (1963) and Lipsitz (1988). The fitting of a more general class of marginal models was considered by Haber (1985a). Finally, the fitting of simultaneous models using ML methods has only been addressed in the bivariate response case. The fitting technique becomes very complicated when there are more than two categorical responses. To appreciate the complexity of extending the technique to multivariate response data, see section 6.5 of McCullagh and Nelder (1989) or perhaps Dale (1986). In contrast, the ML fitting method of Chapter 2 can easily be used to fit many marginal and simultaneous models. In the next few paragraphs, we briefly describe the existing methods for modeling and model fitting for multivariate categorical response data. Modeling Joint Distributions Separately. One common method for analyz ing multivariate categorical responses is to model the joint distribution only. These models, which include classical log and logitlinear models for the 4 joint probabilities, are useful for describing the association structure among the responses. The last 30 years have seen the development of these methods for analyzing multivariate categorical responses (Haberman, 1979; Bishop et al., 1975; Agresti, 1984, 1990). For specificity, consider the following panel study: One hundred randomly selected subjects were asked how interested they were in the political campaigns. They were to respond on the 3 point ordinal scale, (1) Not Much, (2) Somewhat, and (3) Very Much. Then four years later the same group of subjects was asked to respond on the same scale to the same question. A separate investigation into the association structure would enable us to answer questions of a conditional nature. For example, we could estimate the probability of responding 'Very Much' on the second occasion given that the response at the first occasion was 'Not Much'. The description of these 'transitional' probabilities, although very interesting, may not be completely satisfactory. We may also be interested in addressing questions with regard to the marginal distributions. Perhaps we would like to answer the question, "Are the distributions of responses to the political interest question the same for each occasion?" Laird (1991), in a nice review of likelihoodbased methods for longitudinal analysis, mentions that the utility of classical log and logitlinear models is restricted to two situations: (1) modeling the dependence of a univariate response on a set of covariates and (2) modeling the association structure between a set of multivariate responses. These models place structure on the joint probabilities and so they are not directly useful for studying the dependence of the marginal probabilities on occasion and other covariates. This problem was pointed out by several authors (Cox, 1972; Prentice, 1988; McCullagh and Nelder, 1989; 5 Liang et al., 1991). An advantage of these models is that they are simple to fit using either WLS (Grizzle et al., 1969), ML (McCullagh and Nelder, 1989), or iterative proportional fitting (Bishop et al., 1975) methods. There are many standard statistical programs available for fitting these models (SAS, SPSS BMDP, GLIM, GENSTAT). Modeling Marginal Distributions Separately. A second approach to an alyzing multivariate categorical responses is to model only the marginal distributions and to ignore the joint distribution structure. Full likelihood methods that consider only models for the marginal probabilities tacitly assume a saturated model for the joint distribution. Therefore, the models may be far from parsimonious. In the nonGaussian response setting, there is a distinction between these marginal models and the transitional (or conditional) models of the previous paragraph. Marginal models describe the occasionspecific distributions and the dependence of those distributions on the covariates. Transitional or conditional models describe the distribution of individual changes over occasions. Models for these transitions can be represented as probability distributions for the future state 'given' the past states. Questions regarding transition probabilities can only be investigated with longitudinal data. On the other hand, questions regarding the marginal probabilities could theoretically be answered using crosssectional data, provided the cohort (subject) effects were negligible. Panel studies resulting in longitudinal data result in more powerful tests for significance of within cluster factors, such as occasion effect. This follows because there is a reduced cohort effect; we are using the same panel of subjects at each occasion. For 6 further discussion about the distinction between marginal and transitional models, see Ware et al. (1988), Laird (1991), and Zeger (1988). We will briefly discuss existing methods for making inferences about the marginal probabilities separately. We will group these methods into 5 categories: (1) nonmodelbased methods, (2) WLS methods, (3) ML methods, (4) Semiparametric methods, and (5) other methods. Nonmodelbased methods can be used to derive test statistics used for testing specific hypotheses regarding the marginal distributions. Examples include the CochranMantelHaenszel (1950, 1959) statistic which can be used for testing the hypothesis of marginal homogeneity (MH) (cf. White et al., 1982), McNemar's (1947) statistic which can be used for testing the equality of two dependent proportions, and Madansky's (1963) likelihoodratio statistic for MH. Madansky's statistic is a difference in fit of the model of marginal homogeneity to the fit of the unstructured (saturated) model (see also Lipsitz, 1988 and Lipsitz et al., 1990). Many other relevant test statistics, some of which are generalizations or modifications of the aforementioned (cf. Mantel, 1963; White et al., 1982), exist. Cochran's (1950) Q statistic and Darroch's (1981) Waldtype statistic are examples of other test statistics that can be used to test for marginal homogeneity. Presently, if one was to fit a marginal model, say a generalized loglinear model of the form Clog Ai = X/, where p is the vector of expected counts in the full contingency table, he or she would most likely use the WLS fitting algorithm. Most statistical software that fits these generalized loglinear models does so using WLS. There are some advantages to using WLS. It is computationally simple. Secondorder marginal information is all that is 7 needed. And, the estimates are asymptotically equivalent to ML estimates. Some disadvantages are that covariates must be categorical, sampling zeroes create problems, and estimates are sensitive when secondorder marginal counts are small. The WLS method for analyzing categorical data was originally outlined by Grizzle, Starmer and Koch (1969). Subsequently, marginal models for longitudinal categorical data, or more generally mul tivariate categorical response data, have been introduced and fitted using the WLS method (Koch et al., 1977; Landis and Koch, 1979; Landis et al., 1988; Agresti, 1989). Maximum likelihood fitting of marginal models is more difficult since the model utilizes marginal probabilities, rather than joint probabilities to which the likelihood refers. When the responses are correlated, as they invariably are, the marginal counts do not follow a productmultinomial distribution. The fulltable likelihood must be maximized subject to the constraint that the marginal probabilities satisfy the model. Haber (1985a) considers fitting generalized loglinear models of the form C log Ap = XP3 using Lagrange multipliers and an unmodified NewtonRaphson iterative scheme. The algorithm becomes very difficult to implement for even moderately large tables. This is primarily due to the difficulty of inverting the large Hessian matrix of the Lagrangian objective function. In this dissertation we consider a modified NewtonRaphson that uses a much simpler matrix than the Hessian. The matrix is easily inverted even for relatively large tables. Haber (1985b) considers the estimation of the parameters / in the special case Clog y = XP3. We will use a modification of the method of Aitchison and Silvey (1958, 1960) and Silvey (1959) to investigate the asymptotic behavior of the estimators of 8 3 in the more general model Clog Ap = XP3, thereby extending the work of Haber (1985b). Another relevant paper, Haber and Brown (1986), considers ML fitting of a model for the expected counts p that has loglinear and linear constraints. One can test hypotheses about the marginal probabilities by comparing the fit of relevant models. Haber (1985a, 1985b) and Haber and Brown (1986) only consider fitting the marginal models separately. No attempt has been made to simultaneously model the joint and marginal distributions. Semiparametric methods such as quasilikelihood (Wedderburn, 1974) and a multivariate extension, generalized estimating equations (GEE), have become popular in recent years. The work of Liang and Zeger (1986), which advocated the use of these GEEs, has been extended to cover the multivariate categorical response data setting (Prentice, 1988; Zhao and Prentice, 1991; Stram et al., 1988; Liang et al., 1991). With these semiparametric methods, the likelihood is not completely specified. Instead, generalized estimating equations are chosen so that, when the marginal model holds, even if the association among the multiple responses is misspecified, the estimators are consistent and asymptotically normally distributed. These estimators, used in conjunction with a robust estimator of their covariance (Liang and Zeger, 1986; Zeger and Liang, 1986; White, 1980, 1981, 1982; Royall, 1986), result in consistent inference about the effects of interest. When the responses are truly independent, the estimating equations with correlation matrix taken to be the identity matrix, are equivalent to the likelihood equations. The GEE approach requires the specification of a 'working' association or correlation matrix. Examples of working associations include those that imply all 9 pairwise associations (measured in terms of odds ratios) are the same and that the higher order associations are negligible (Liang et al., 1991). A related approach is known as GEE2. The consistency of these esti mators follows only if both the marginal model and the pairwise association model are correctly specified. This approach is a second order extension of the GEEs of Liang and Zeger (1986) which are now termed GEE1. It is second order because the estimation of the marginal model parameters and the pairwise association model parameters is considered simultaneously. The focus of both approaches, GEE1 and GEE2, is usually on modeling the marginal distributionsinvestigating how the marginal distributions depend on occasion and covariates. The association is considered a nuisance. Presently, there are no tests for goodnessoffit of these models and so the investigation into how well both models fit can be done only at an empirical level. The assumption that higher order effects are negligible may not be tenable. Testing procedures to assess the validity of these assumptions have yet to be developed. Also, in contrast to WLS and ML methods, which require only that the missing data be 'missing at random' (MAR), the semi parametric approaches require the missing data to be 'missing completely at random' (MCAR). The assumption that the missing data mechanism is MCAR is a much stronger assumption than MAR (Little and Rubin, 1986). Finally, there are many other approaches to analyzing the marginal probability structure separately. There are random effects models, whereby subjectspecific random effects induce a correlation structure on the multiple responses. The marginal approachthe full likelihood is obtained by averaging across the random effectsis computationally difficult (Stiratelli 10 et al., 1984). An alternative is to condition on the sufficient statistics for the subject effects and consider finding the estimates by maximizing the conditional likelihood. For further details on these conditional and unconditional methods see Rasch, 1961; Tjur, 1982; Agresti, 1991; Stiratelli et al., 1984; Conaway, 1989, 1990. As yet another alternative, Koch et al. (1980) give a bibliography for relevant nonparametric methods for analyzing repeated measures data. Agresti and Pendergast (1986) consider replacing the actual observations by their within cluster rank and testing for marginal homogeneity using the ordinary ANOVA statistic for repeated measures data. A threestage estimator for repeated measures studies with possibly missing binary responses has been developed by Lipsitz et al. (1992). This approach is very similar to a generalized least squares approach, but it has some of the nice features of the GEE approaches. One of these nice features is that the estimators and their variance estimates are consistent under very mild assumptions. An extension of this method to the polytomous response case has yet to be developed. Simultaneous Investigation of Joint and Marginal Distributions. There has been very little work done to investigate simultaneously the joint and marginal distribution structure. In some ways GEE2 is an attempt to describe both distributions. However, only the pairwise (not the joint) association structure is modeled; the higherorder associations are considered a nuisance. Tests comparing nested models have not been developed in this semiparametric setting. Full likelihood approaches have been addressed by Dale (1986), McCullagh and Nelder (1989, Chapt. 6), and Becker and Balagtas (1991). Dale models the joint distributions of bivariate ordered 11 categorical responses by assuming that the log global odds ratios follow a linear model. The marginal probabilities are assumed to follow a cumulative logit model. McCullagh and Nelder consider simultaneously modeling the joint and marginal probabilities of a bivariate dichotomous response (two distinct responses) by assuming that the log oddsratios follow a linear model and that the marginal probabilities follow a logitlinear model. Their example included age as a categorical covariate. Finally, Becker and Balagtas consider models for twoperiod crossover data. The bivariate dichotomous response was the response to the two different treatments. Order of treatment application was considered a covariate. They assumed that the two log odds ratios followed a linear model and that the marginal probabilities satisfied a loglinear model. Because it is the marginal probabilities and not the joint probabilities that satisfy a loglinear model, Becker and Balagtas refer to the model as log nonlinear. The ML model fitting approach used by each of these authors involves a reparameterization of the likelihood, which is a function of the joint probabilities, in terms of the joint and marginal model parameters. The reparameterization in the bivariate response casethe case each author consideredis somewhat complicated especially for multilevel responses. To make matters worse, the extension of this method to general multivariate polytomous responses looks to be extremely difficult. If the repaparameter izations are made so that the full likelihood is expressible in terms of the joint and marginal model parameters, the likelihood can be maximized using a NewtonRaphsontype algorithm. Basically, one must solve for the root of some nonlinear score equation. This maximization approach is very sensitive 12 to the starting value in that convergence to a local maximum is not likely unless the starting estimate is very close to the actual maximum. Finding reasonable starting values is not a simple task. Dale (1986) outlines a method, specifically for the models considered in that paper, for finding a starting estimate. In this dissertation, we outline an ML fitting method that can easily be used to fit a large class of simultaneous models, including those considered by Dale, McCullagh and Nelder, and Becker and Balagtas. The approach involves using Lagrange's method of undetermined multipliers along with a modified NewtonRaphson iterative scheme. For all of the models considered, an initial estimate for the algorithm is the data counts themselves along with a vector of zeroes corresponding to a first guess at the values of the Lagrange multipliers. The convergence of the algorithm is quite stable. The extension to multivariate polytomous response data is straightforward. 1.3 Outline of Existing MethodologiesMissing Data Missing data is often an issue when the response is multivariate in nature. Missing data can also occur in more hypothetical situations. Examples include loglinear latent class models (Goodman, 1974; Haberman, 1988) and linear mixed or random effects models (Laird et al., 1987). In latent class analyses, a latent variable, which is unobservable, is assumed to exist. Mixed or random effects models posit the existence of some unobservable random variables that affect the mean response. In this brief outline, we will consider ML methods for model fitting when the data are not completely observable. Little and Rubin (1986) provide a nice summary of methods 13  for model fitting with incomplete data. There are many ways to find the maximum likelihood estimators when the data are not completely observable, each method having its positive and negative features. We could work directly with the incompletedata likelihood, which is usually complicated relative to the completedata likelihood, and use a NewtonRaphson or Fisherscoring algorithm. Palmgren and Ekholm (1987) and Haberman (1988) use these methods to obtain maximum likelihood estimates and their standard errors. Alternatively, we could avoid the complicated likelihood altogether and use the ExpectationMaximization algorithm (Dempster et al., 1977). Sundberg (1976) discusses the properties of the EM algorithm when it is used to fit models to data coming from the regular exponential family. The EM algorithm is one of the more flexible ML fitting algorithms for missing data situations. We will primarily focus on this method for fitting loglinear models with incomplete data. Although the EM algorithm is easily implemented to fit loglinear models with incomplete data, the algorithm does not provide an estimate of precision of the model parameter estimators. Meng and Rubin (1991) outline a supplemental EM (SEM) algorithm, whereby, upon convergence of the EM algorithm, the variance matrix for the model estimators is adjusted to account for missing data. The adjustment is a function of the rate of convergence of the EM algorithm, which in turn is a function of how much information is missing. Meng and Rubin numerically estimate the rate of convergence, thereby obtaining an estimate of precision that reflects missingness. Although this approach should prove to be applicable in the general situation, it still is desirable to derive an explicit formula for the variance matrix that reflects 14 missingness. Other authors (Meilijson, 1989; Louis, 1982) have discussed methods for estimating precision of model estimators when the data are incomplete and the EM algorithm is used. Meilijson's method involves EM aided differentiation, which is essentially a numerical differentiation of the score vector. The method relies on the assumption that the observed data components are i.i.d. (identically and independently distributed). Louis gives an analytic formula for the observed information matrix based on the incomplete data. The computation of the observed information matrix based on this formula is not straightforward and must be considered separately for each special application. 1.4 Format of Dissertation In Chapter 2, we develop a maximum likelihood method for fitting a large class of models for multivariate categorical response data. This development follows a general discussion about parametric modeling. Concepts such as degrees of freedom and model distances (or goodness of fit) are described at an intuitive level. We also describe and compare the asymptotic distributions of freedom parameter estimators under productmultinomial and product Poisson sampling assumptions. Chapter 3 has more of an applied flavor. We consider simultaneously modeling the joint and marginal distributions of multivariate categorical response vectors. A broad class of simultaneous models is introduced. The models can be fitted using the techniques of Chapter 2. Several numerical examples are considered. Chapter 4 outlines the ML fitting technique known as the EM algorithm. This algorithm is used to fit models with incomplete data. Some advantages and disadvantages of using 15 the EM algorithm are addressed. The most important disadvantage is that the algorithm does not provide, as a byproduct, a precision estimate of the ML estimators. We derive an explicit formula for the observed information matrix for the Poisson loglinear model parameters when only disjoint sums of the complete data are observable. An application to latent class modeling is considered. We also propose an ML fitting algorithm that uses both EM and NewtonRaphson steps. The modified algorithm should prove to have many positive features. In this dissertation, we do not distinguish typographically between scalars, vectors, and matrices. Parameters and variables are treated as ob jects, their dimensions either being explicitly stated or implied contextually. By convention, functions that map scalars into scalars, when applied to vectors, will be defined componentwise. For example, if i represents an n x 1 vector, then logy = (log,lg/A2,...,log~n)'. We frequently use abbreviations that are common in the statistical literature. They include ML (Maximum Likelihood), WLS (Weighted Least Squares), IWLS (Iterative (Re)Weighted Least Squares), and EM (ExpectationMaximization). The range (or column) space of an n x p matrix X is denoted by M(X) and is defined as {p : tz = XP3, 3 E RP}. The symbols and $ are the binary operators 'direct product' and 'direct sum'. The direct (or Kronecker) product is taken to be the righthand product. That is, A B = {Abij}. 16 The direct sum, C, of two matrices A and B is defined as C=A B= A 0). OB The symbol D(p) represents a diagonal matrix with the elements of p on the diagonal. That is, (1 0 ... 0 D(M) = 0 0 0 ... /n In Chapter 4, we make use of the bracket notation often used by statistical and mathematical programming languages (e.g. Splus, Matlab). To illustrate the notation, consider a matrix A. The (sub)matrix A[, 2] is then matrix A with the second column deleted. Similarly, the matrix A[3,] is the matrix A with the third row deleted. Equation numbering is consecutive within sections of a chapter, the first number representing the chapter in which it appears. For example, the thirteenth equation in section 2.3 is equation (2.3.13). Within each appendix, the equations are numbered consecutively. For example, the third equation in Appendix B is numbered (B.3). Tables are numbered consecutively within chapters so that, for instance, Table 3.2 represents the second table within Chapter 3. Theorems, lemmas, and corollaries are numbered independently of each other. All are numbered consecutively within sections. Therefore, Corollary 3.2.2 is the second corollary within section 3.2 and Theorem 2.3.1 is the first theorem within section 2.3. CHAPTER 2 RESTRICTED MAXIMUM LIKELIHOOD FOR A GENERAL CLASS OF MODELS FOR POLYTOMOUS RESPONSE DATA 2.1 Introduction In this chapter, we consider using maximum likelihood methods to fit a general class of parametric models for univariate or multivariate polytomous response data. The models will be specified in terms of freedom equations and/or constraint equations. These two ways of specifying models will be discussed at length in section 2.2. The model specification equations may be linear or nonlinear in the model parameters. Specifically, if p represents the s x 1 vector of expected cell means, the linear constraints will be of the form Lp = d and the nonlinear constraints will be of the form U'Clog(Ap) = 0. The freedom equations will have form Clog(Ap) = XPf, where the components of the vector 3 are referred to as the freedom parameters. In Chapter 3 of this dissertation, we discuss more specifically models that can be specified in terms of these constraint and freedom equations. The models of that chapter allow one to simultaneously model the joint and marginal distributions of multivariate polytomous response vectors. The maximum likelihood, model fitting algorithm of this chapter utilizes Lagrange multipliers and a modified NewtonRaphson iterative scheme. In particular, the models will be specified in terms of constraint equations and the log likelihood will be maximized subject to the constraint equations being 17 18 satisfied. One common optimization algorithm found in the mathematics literature is Lagrange's method of undetermined multipliers. We show that Lagrange's method is easily implemented for ML fitting of the models under consideration in this chapter. One problem with Lagrange's method of undetermined multipliers for ML fitting of statistical models has been that it becomes computationally infeasible for large data sets. By using a modified NewtonRaphson method which involves inverting a matrix of a simpler form than the more complicated Hessian, we consider fitting models to relatively large data sets. We also explore the asymptotic behavior of the estimators within the framework of constraintrather than freedommodels. Usually, asymptotic properties of model and freedom parameter estimators are studied within the framework of freedom models. Aitchison and Silvey (1958, 1960) and Silvey (1959) studied the asymptotic behavior of the model parameter estimators when the model is specified in terms of constraint equations. Following the arguments of Aitchison and Silvey, we derive the asymptotic distributions of both the model and freedom parameter estimators. Previous work by Haber (1985a) addressed maximum likelihood methods for fitting models of the form Clog(A/) = X,, to categorical response data. Subsequently, Haber and Brown (1986) discussed ML fitting for loglinear models that were also subject to the linear constraints Lp. = d, where these constraints necessarily include the identifiability constraint required of p, the vector of productmultinomial 19 cell means. Both of these papers advocated the use of Lagrange's method of undetermined multipliers to find the maximum likelihood estimates of the model parameters Mp. The method of Haber (1985a) involved using the (unmodified) NewtonRaphson method which becomes computationally unattractive as the number of components in p gets moderately large. Both Haber (1985a) and Haber and Brown (1986) were primarily concerned with measuring model goodness of fit and therefore did not consider estimation of freedom parameters. Haber (1985b) did consider estimation of freedom parameters, but only when the simpler model Clog p = XP3 was used. One of the several ways that we extend the work of Haber (1985a, 1985b) and Haber and Brown (1986) is to consider estimation of the freedom parameters when the more general model Clog Ap = X,3 is used. Others have considered ML fitting of nonstandard models for multivari ate polytomous response data. Laird (1991) outlines the different approaches taken by different authors. As an example, Dale (1986) considered ML fitting for a particular class of models for bivariate polytomous ordered response data which were of the form C1 log(Al p) = X1fi, g(A2li) = X2,2 Specifically, the first freedom equation specifies a loglinear model for the association between the two responses measured by the global crossratios (crossproduct ratios of quadrant probabilities) so that C1 and A1 are of a particular form. The second set of freedom equations specifies some generalized linear model (McCullagh and Nelder, 1989) for the marginal means or probabilities. Maximum likelihood estimators for the association 20 model freedom parameters /3 and the marginal model freedom parameters /32 were simultaneously computed by iteratively solving the score equations via a quasiNewton approach. To use this maximization technique, the score functions, which involve the cell probabilities, must be written explicitly as a function of the freedom parameter 3 = vec(/l3, 32). A nontrivial approach to finding reasonable starting values for / is discussed by Dale (1986). Along with Dale, McCullagh and Nelder (section 6.5, 1989) and Becker and Balagtas (1991) consider writing the score as an explicit function of the freedom parameters so that the marginal and association freedom parameter estimates may be computed simultaneously. In general, when there are more than two responses, this is not a simple task and so an extension of this method to multivariate polytomous response data models will be very messy indeed. Also, convergence of the iterative scheme requires good initial estimates of the freedom parameter P. These may be very difficult to find. In contrast, the maximization approach of this chapter, which is similar to Haber (1985a) and Haber and Brown (1986), is shown to be easily implemented for fitting multivariate polytomous response data models. With this technique, it is not necessary to write the cell means as an explicit function of the freedom parameters. Further, initial estimates of the freedom parameters, which are difficult to find, are not needed for this technique. Instead, only initial estimates of the cell means and undetermined multipliers are needed. Reasonable initial estimates of the cell means are the cell counts themselves. While a reasonable initial estimate of the vector of undetermined multipliers is the zero vectorthe value of the undetermined multipliers when the model fits the data perfectly. 21 We will now introduce the class of models that we will consider for the remainder of this chapter and the next, more applied chapter. The models have form C1 log(A1/) = X18I, C2 log(A2z) = X212, LIp = d where the linear constraints include the identifiability constraints. Later, when we study the asymptotic behavior of the ML estimators, we will require the components of d to be zero unless they correspond to an identifiability constraint. These models, which are of the form Clog(A/) = Xfp, Lpr = d, will allow us to model both the joint and marginal distributions simultaneously when dealing with multivariate response data. The bivariate association model of Dale (1986) is a special case of these models, as we can specify the matrices C1 and A1 so that C1 log(Ali) is the vector of log bivariate global crossratios. Restricting the marginal models to have form C2 log(A2/t) = X2f62, rather than allowing the marginal means to follow a generalized linear model, as Dale (1986) did, is not overly restrictive. In fact, many of the generalized linear models for multinomial cell means can be written in this form. For example, loglinear, multiple logit, and cumulative logit models are of this form. Also, unlike Haber (1985a) and Haber and Brown (1986), we will be concerned with estimation of the freedom parameter 3 = vec(f31, 32), thereby allowing for modelbased inference. Modelbased inferences usually refer to inferences based on freedom parameters. With freedom equations, we have the luxury of choosing a parameterization that results in the freedom parameters having meaningful interpretations. For instance, a freedom parameter / may be chosen to 22  represent a departure from independence in the form of a log odds ratio. More generally, we usually will try to parameterize in such a way so that certain parameters will measure the magnitude of an effect of interest. For example, consider an opinion poll where a group a subjects were asked on two different occasions whether they would vote for the President again in the next election. Suppose they were asked immediately after the President took office and again after the President had served for two years. The researcher may be interested in determining whether the distribution of response changed from Time 1 to Time 2 and if so, assess the magnitude of the change. The data configuration can be displayed as in Table 2.1. Table 2.1. Opinion Poll Data Configuration Data Probabilities Time 2 Time 2 yes no yes no Time 1 yes yn Y12 Time 1 yes 7rl 7rl12 71I+ no Y21 Y22 no r721 22 72+ 7+1 7r+2 We could formulate a model of the form C log(Ap) = X/3 in such a way so that the freedom parameter 3 has a nice interpretation with respect to the hypothesis of interest. One such model is l 2(i) log ( )=a+pi, i=1,2 (2.1.1) where the parameter ij(i) is a marginal probability, i.e. r(i)={ r+, ifi=1 r+j, if i=2 23 and, for identifiability of the freedom parameters, P1 = P2 = P Model (2.1.1) is a simple logit model for the marginal probabilities {rj+} and {Tr+j}. The parameter p measures the magnitude of departure from marginal homogeneity in that p = 0 if and only if there is marginal homogeneity. One could use the Wald statistic P/se(p) to test the hypothesis. If the null hypothesis is rejected, we can assess the magnitude of departure from marginal homogeneity by computing a confidence interval for 2p which is the log odds ratio comparing the odds that a randomly chosen subject responds 'yes' at Time 2 to the odds that a randomly chosen subject responds 'yes' at Time 1. This simple example illustrates the utility of using freedom parameters and the corresponding modelbased inferences. For this reason, this chapter will be concerned with making inferences about both the model parameters Ap and the freedom parameters 3. The contents of the following sections are as follows. In section 2.2, we provide an overview of parametric modeling. The two ways of specifying modelsvia constraint equations and via freedom equationsare discussed at length in section 2.2.1. It is shown that a model specified in terms of freedom equations can be respecified in terms of constraint equations. In particular, the freedom equation Clog(Ap) = XP3, which actually constrains the function C log(Ap) to lie in some manifold spanned by the columns of X, is equivalent to the constraint equation U'Clog(Ap) = 0, where the columns of U form a basis for the null space of X'. Other topics covered in section 2.2 24 include interpretation and calculation of 'degrees of freedom' and measuring model goodness of fit. We describe a general class of models for univariate or multivariate polytomous response data in section 2.3.1. The data vector y is initially assumed to be a realization of a productmultinomial random vector. We describe the asymptotic behavior of the productmultinomial ML estimators in section 2.3.3. Lagrange's method of undetermined multipliers is used to find restricted maximum likelihood estimates of the model parameters and the freedom parameters. The actual algorithm is described in detail in section 2.3.4. In section 2.4, we explore the relationship between the productmultinomial and productPoisson ML estimators. General results that allow one to ascertain when inferences based on productPoisson estimates are the same as inferences based on productmultinomial estimates are shown to follow quite directly when one works within the framework of constraint models. Theorem 2.4.2 of this section, represents a generalization of the results of Birch (1963) and Palmgren (1981). 2.2 Parametric ModelingAn Overview Inferences about the distribution of some n x 1 random vector Y are often based solely on a particular realization y of Y. In parametric modeling it is often the case that the distribution of Y is known up to an s x 1 vector of model parameters 8; i.e. it is 'known' that Y ~ F(y; ), 0e8, E (2.2.1) 25  where 0 is some (s q)dimensional (q 0) subset of R* known to contain the true unknown parameter 9*. The cumulative distribution function F maps points in R" into the unit interval [0, 1] and is assumed to be known. In general, we will allow the dimension s of 0 to grow with n. For example, let Y = (Y1,..., Y,) have independent components such that Yi ~ ind G(yi; zi(O)), i= 1,...,n, where zi(8) is some function of 0 associated with the ith component of Y. The function zi could be defined as zi(0) = Oi, in which case s = n. Or, on the other hand, zi could be a mapping from R' to RI with s fixed. 2.2.1 Model Specification. In parametric settings, models for the data, or more precisely, models for the distribution of Y, can be completely specified by recording the family of candidate distributions that F may belong to. That is, one must specify the form for F(.; 0) and the space 0M that is assumed to contain the true value 9* of 9. In parametric modeling, the form of F(.; 9) is assumed known, but the true value 0* is not. Denote a parametric model by [F(.; 9); 0 e OM] or more simply by [0M]. We say the model [0M] 'holds', if the true parameter value 0* is a member of 0M, i.e. [OM] holds 0* e OM. A model does not hold if 0* V Om. The objective of model fitting is to find a simple, parsimonious model that holds (or nearly holds). By parsimonious, we mean that the vector 0 can be obtained as a function of relatively few unknown parameters. An example 26  of a parsimonious model for the distribution of an nvariate normal vector with unknown mean vector p and known covariance is [Op], where Op = {A E R" : P/ = a, J = 1,...,n, p unknown}. Notice that all n components of p can be obtained as a function of one unknown parameter 3. Thus, all of our estimation efforts can be directed towards the estimation of the common mean 3. An example of a nonparsimonious model is the socalled saturated model [O], where 0 = {pI: p E R"} = R". In this case, p is a function of n unknown parameters. The question of whether or not the parsimonious model holds is an entirely different matter. Practically speaking, a model will rarely strictly hold. Therefore, we will often say a model holds if it nearly holds, i.e. for some small e inf 9* 01 < e. Without delving too much into the philosophy of model fitting and the simplicity principle (Foster and Martin, 1966), we point out that for a model to be practically useful it must be robust to the 'white noise' of the process generating Y. That is, it should account for only the obvious systematic variation. A model would be said to be robust to the white noise variability, if the model parameter estimates based on different realizations of Y are very similar. As an example, if instead of [0E], the saturated model [E] was used to draw inferences about the normal mean vector pt, we would find that the model fit perfectly, but that upon repeated sampling the model estimates 27 would change dramatically. Thus, the model is not robust to the white noise of the process. On the other hand, the parsimonious model [Op] estimates would change very little from sample to sample, varying with the sample mean of n observations. This model is robust to the white noise variability. Therefore, if the model would hold, or nearly hold, we would say it was a good model. Freedom Models. In the previous nvariate normal example we specified a model [Op] in terms of some unknown parameter /. Aitchison and Silvey (1958, 1960) and Silvey (1959) refer to the parameter / as a 'freedom parameter' and the model [Op] as a 'freedom model'. These labels are reasonable since we can measure the amount of freedom we have for estimating 9 by noting the number of independent freedom parameters there are in the model. The model [O(] has one degree of freedom for estimating the mean vector ~. Thus, once an estimate of the single parameter / is obtained the entire vector p can be estimated; it is a function of the one parameter 3. Notice that 'degrees' of freedom correspond to integer dimension in that a degree of freedom is gained (lost) if we introduce (omit) one independent freedom parameter thereby increasing (decreasing) the dimensionality of OE by one. In general we will denote a freedom model by [Ox], where ox = {9 e : g(e) = X3,3 E R'} The function g is some differentiable vector valued function mapping 0 e 0 into rdimensional Euclidean space Rr. The 'model' matrix X is an r x p full column rank matrix of known numbers. To calculate degrees of freedom for 28 [Ox] we will initially assume g satisfies V00 Ox, (~ ) is of full row rank r. It also will be assumed that the constraints implied by g(0) = XP3 are independent of the q constraints implied by the model [O] of 2.2.1. Well defined models will satisfy these conditions. For example, any g that is invertible satisfies the derivative condition. Actually this derivative condition is not a necessary condition for the model to be well defined. Later, we will show that g need only satisfy a milder derivative condition. The degrees of freedom for the model [Ox] can be obtained by subtract ing the number of constraints implied by [Ox] from the total number of model parameters, s. The number of constraints implied by [Ox] is (r p) + q, the dimension of the null space of X' plus the q constraints implied by model [O]. Hence, the model degrees of freedom for [Ox] is df[Ox] = s (r p + q) (2.2.2) In view of (2.2.2) the model degrees of freedom, an integer measure of freedom one has for estimating 9, is an increasing function of p the number of freedom parameters. In fact, for the special case when q = 0 and g(8) = 0 (so s = r), we have that the number of degrees of freedom for model [Ox] is simply p, the number of freedom parameters. This gives us another good reason for calling f a freedom parameter and [ex] a freedom model. Constraint Models. Notice that {0 e E : g(8) = Xp, 3 e RP} (2.2.3) can be rewritten as {( e 6 : U'g(8) = 0}, 29 where U is an r x (r p) full column rank matrix satisfying U'X = 0, i.e. the columns of U form a minimal spanning set, or basis, for the null space of X'. Letting u = r p and h.(0) = 0 be the q constraints implied by [0], we can write the (u + q) x 1 vector of constraining functions as h(0) = [hl(0), h,(0)]' where hi = U'g. We rewrite the freedom model [Ox] of (2.2.3) as [Oh], where Oh = {0 E R' : h() = 0}. (2.2.4) Aitchison and Silvey (1958,1960) refer to model [Oh] as a constraint model. Every freedom model can be written as a constraint model. We present a few simple examples to illustrate the equivalence between the two model formulationsfreedom and constraint. Example 1. Let Yi ~ ind N(p3,r2), i = 1,...,n, where r2 is known. This model can be specified as the freedom model [Ox], where Ox = {pI E R" : p = lnf, # unknown } or equivalently it can be expressed as the constraint model [Oh], where Oh = {/ E R" : U'p = 0} and U' is the (n 1) x n matrix 1 1 0 0 ... 0 U'= 01 0 1 0 0 1 0 0 0 . . 1 It is easily seen that Ox = Oh and that the model degrees of freedom is df[Ox]= n(n 1)= 1. Example 2. Let Y. ~ ind N(i( = fio +/3ix,,U2), i = 1,...,n, where a2 is known. This model can be specified as the freedom model [Ox], where Ox = {A E R" : Pi = 3o +31Xi, i = 1,..., n } 30  or assuming that each xi is distinct, as the constraint model [Oh], where Oh = { E R" : U'l = 0 }. Here U' is the (n 2) x n matrix 1 1 + 1 1 0 .. 0 0 0 22Z1 221 Z322 Z3Z2 1 1 1 1 0 0 0 2221 Z2ZX Z423 2423 U' = ,1 1 0 0 ...0 1 1 21Xl x2z Z1 ZRnn1~ / Notice that U'p = 0 implies that Aj+I Pi Pk+1 ik vk, j. xj+1 xj Xk+1 Xk That is, the n means fall on a line. As before, it can be seen that Ox = Oh and that the model degrees of freedom is df[Oh] = n (n 2) = 2. Definitions. We will assume that the constraining function h satisfies some reasonable conditions so that the model is well defined. We first present some definitions. (1) A model [Oh] is said to be 'consistent' if Oh 0. (2) A consistent model [Oh] is said to be 'welldefined' if the Jacobian matrix for h is of full row rank v = u + q at every point in Oh. That is, 0 Oh, h(0) \ of VOo E h, ( cOh is of full row rank v. (3) A model [Oh] is said to be 'illdefined' if it is not welldefined, i.e. 3o E hA, (O) is not of full row rank v. ( ow I1 0 31  (4) An illdefined model [Oh] is said to be 'inconsistent' or 'incompatible' if Oh = 0. Briefly, any reasonable model will have a nonempty parameter space and hence will be consistent. The Jacobian condition of definition (2) is similar to the condition required in the Implicit Function Theorem (see Bartle, 1976). Basically, this condition requires the constraints to be nonredundant so that, at least theoretically, the constraint equations can be written uniquely as a function of a smaller set of parameters. An illdefined model has been specified with a redundant set of constraint equations. Using the lingo of the optimization literature, two constraints are redundant if, for each point in the parameter space, both of the constraints are 'active' or both of the constraints are 'inactive'. That is, for all parameter values, if one constraint is active (inactive) then the other is necessarily active (inactive). It should be noted that the above definitions are in terms of the constraint formulation of a model. This is sufficient since freedom models can be written as constraint models. For convenience, we give sufficient conditions for a freedom model to be welldefined. A consistent freedom model is welldefined if it satisfies the following two conditions: (i) The constraints implied by g(e) = XP3 are independent of the q constraints implied by [O]. (ii) The Jacobian matrix of g evaluated at any point in [Ex] is of full row rank r, i.e. Vo E Ox, ( g(0) is of full row rank r. (90 32  The sufficiency of conditions (i) and (ii) can be seen by observing that (ii) implies that hi = U'g has a full row rank Jacobian since U' is of full row rank and (i) implies that h = (hi, h.)' has full row rank Jacobian. These sufficient conditions are by no means necessary for a model to be well defined as the Jacobian of h may be of full row rank v even when the Jacobian of g is not of full row rank. Notice that the model matrix has nothing to do with whether or not a model is well defined. In particular, one may think that the model [Ox] is illdefined whenever the r x p matrix X is not of full column rank; i.e. the freedom parameters are nonestimable. However, the model can be rewritten as a constraint model with the full column rank matrix U spanning the null space of X, which has dimension less than p r. It follows that if g satisfies (i) and (ii), then the model [Ox] will be welldefined. The only reason we have taken X to be of full column rank is to avoid using generalized inverses when working with the freedom parameters. To illustrate the use of these definitions, we consider the model [OM], where OM = { e Rn : MO d = 0}. The model will be well defined if Oh/80' = M is of full row rank. It is inconsistent if the linear system of equations MO = d is inconsistent. If a model [Oh] is well defined, then the constraints implied by the model are all independent in that no constraint can be implied by the others. We will consider only welldefined models when calculating degrees of freedom. 33  As before, we calculate degrees of freedom for a model as the difference between the number of model parameters s and the number of independent constraints v implied by the model, i.e. df[Oh] = s (r p + q) = s (u + q) = s v Notice that for the constraint model, model degrees of freedom is a decreasing function of the number of independent constraints v. Finally, it should be noted that models may be specified in terms of both freedom equations and constraint equations. In fact, in subsequent sections this will be the case. However, without loss of generality, we will concentrate on constraint models since any model can be written in the form of a constraint model. 2.2.2 Measuring Model Goodness of Fit Inferences about model parameters are reliable only if the model is 'good'. A good model should be well defined (or at least consistent). It should be simple and parsimonious. Finally, the model should be relatively close to holding. To assess whether or not the model holds, we will need the concept of a distance between two models. To begin, we will assume there is some measure of distance between two hierarchical parametric models. (Two models [O1] and [02] are hierarchical if 02 c 01 and df[O2] < df[O1] whenever 01 02.) This parametricc) distance will be a quantitative comparison of how close the two models are to holding. Thus, if both models hold the distance is zero. The distance will also be independent of the model degrees of freedom. 34  Recall that the form of F(.; 0) is assumed known. Therefore, the distance will measure how far the true parameter is from falling in the parametric model space. Suppose, firstly, that 01 and 02 are general parameter spaces. That is, 0 E 01 u 02 does not necessarily define a probability distribution. In other words, 9 need not fall in a subset of an (s 1)dimensional simplex. Let a(8) and b(8) be vector or matrix valued functions of the unknown parameter 8. Define a distance between two hierarchical models [01] and [02] (02 C 01) as 6[02; 01] = inf lib(9)(a(9) a(0*))12 inf Ib(9)(a(O) a(O*))112. 02 01 Notice that a and b can be chosen so that (1) 6[02; 01] 0 (2) [02; 01] = 0, iff O1 and 02 hold. For example, consider the case Y ~ MVN,(i, a2I,). Suppose that [] = {((p,2) :L E Rn,a2 > 0} [01] = {((, 2) : l= o,2, > 0} [02] = {(t, ,2) : X = Xp, p RP, ,2 > 0} [03] = {(I, 02) : = 1,a, a e R, a2 > 0}. In this example, each component of Y has a common variance a2. It seems reasonable that differences between any pj and the true mean t* are equally important. Hence, a natural distance between any two of these models is S[0M,; 2M; = inf II. _,*2 inf IIt L *2. Sn a Notice that a(i,, ,2) = l and b(f, a2) = 1. Hence, the measure of distance  35  between [0] and [01] is 6[01; 0] = inf llp *11' = I0o *ll12 The second infimum is zero since the model [0] is known to hold. The measure of distance between [02] and [0] is b[02; ] = inf Ip p *112 = inf IIXP 112 = IIX(XIX)IXI',* 1*112 (2.2.5) = I(I. X(X'X)lX')1*112 = P*'(I X(X'X)IX')1X *. This is the squared length of the vector orthogonal to the projection of p* onto the range space of X. Notice that if p* = Xf*, that is 02 holds, then 6[02; 0] = 0. Finally, the distance between [03] and [02] is 6[03; 02] = inf jIp p*112 inf 11i /1*112 03 2 = *(I ll)* /1*'(I X(X'XX)X') (2.2.6) n = /*'(X(X'X)'X 1n1) As another example, consider a random vector Y = (Y1,...,Y,)', with independent components following an exponential dispersion distribution (Jorgenson, 1989). That is, Yi ~ indep ED(Ai,oa'), i= 1,...,n, where the density of Yi, with respect to some measure, has form fy(y; , 02) = a(y, 2) exp{ T K(71)} (2.2.7) 36  where pi = J'(7y) and var(Yi) = a2r"(1y,). Let V(Iu) = e "(7y) and 0 = (PI,...,n,oa2)'. Since the components of Y have different variances, a natural measure of distance is [SOM,; OM] = inf IIV(L)1/2(p *)112 inf I V(p/)L/2( _*)112. (2.2.8) OM2 eM1 That is a(0) = p and b(O) = V(p)1/2. Premultiplying the vector (p p*) by V(p)1/2 has the effect of downplaying those differences (p/ i *) when the corresponding variance is large. To assess the goodness of fit of a model, relative to another, we can estimate the distance 6 via some statistic based on the observed data. It is interesting to note that when 6 = 0, i.e. both models hold, our data based estimate of this null distance will be some nonnegative (positive, if the model is unsaturated) number, reflecting the amount of white noise or random variability there is in Y. This is so because, if both models hold, then the only reason that our estimate of distance would be nonzero would be because Y has some random component. That is, the variability in Y that is not explained by the model causes the data to fit the model imperfectly. Let D be an estimate of 6. That is, D[E2; 01] is a stochastic, databased estimate of how far apart models [01] and [02] are. Potential candidates for D are the weighted least squares, likelihood ratio, Wald, deviance, and Lagrange multiplier statistics. For example, consider the nvariate normal case and the four candidate models [0], [01], [02], and [03]. We will assume that both [0] and [02] hold. In view of (2.2.5) a reasonable estimate of 6[02; 0] can be obtained by 37 replacing /* by Y, the estimate of p* under model [O], i.e. n D[o2; ] = Y'(I X(X'X)X')Y = (Y Y)2. 1 Recall, that since [0E2; ] is known to be zero, D[02; ] serves as our 'estimate of error'. Similarly, a reasonable estimate of 6[03; 02] can be obtained by replacing p* in (2.2.6) by Y, the least restrictive estimate of p*, i.e. D[03; 02] = Y'(X(X'X)'X' Y = (8 )2 Now 03 C 02 and df[03] = n +1 (n 1)= 2 df[2] = n+ 1 (n p) = p + 1. The degrees of freedom associated with estimating the distance between two models will be called the distance (or residual or goodnessoffit) degrees of freedom. The distance degrees of freedom for the two models [OM1] and [OM ,] is defined to be the difference between the two model degrees of freedom, i.e. df (S[OM; M1]) = df [OM1 df[0M1]. The number of distance degrees of freedom measures the dimensional distance between the two models, i.e. the difference in dimensions. It measures the difference in the amount of freedom one has for estimating 0 for the two models. It seems intuitive that if the degrees of freedom is large, that is the dimensional difference between the two models great, the significance of the distance statistic may be difficult to ascertain. This follows since we expect the fit to be quite different for the two very different models, even when both 38  models hold. This is a reflection of both white noise and possibly lack of fit. Therefore, the distance statistic will tend to be large, even when both models hold. But for many statistics, a large mean implies a large variance, thereby making significant findings more difficult. It is for this reason that we say it is better to concentrate our efforts on relatively few degrees of freedom to detect lack of fit. That is, one should use the smallest alternative space possible when testing a null hypothesis. A more technical argument holds when the test statistic (distance statistic) is a Chisquare or an F. Das Gupta and Perlman (1974) showed that for a fixed noncentrality parameter, i.e. fixed distance between models, the power of the Ftest or the Chisquare test increases as the distance degrees of freedom decreases. Example 1: Continuing with the nvariate normal example, we see that df(6[O3; 021) = df[Oz] df[03] = (p + 1) 2 = p 1. Thus, 03 is of p 1 less dimensions than 02. Now, if we knew ,2 the white noise variance, we could test Ho : 9* e 03, vs. H1 : 8* 02 03, using the statistic D[03; 02] _SS(Reg) 2 2 ,(2.2.9) which has a X2(p1) null distribution. However, r2 is not generally known and we must estimate it. One way of estimating ,2 is by estimating the distance between [0] and [02], two models that are known to hold, and dividing by the distance degrees of freedom. Since the distance degrees of freedom is df [] df [2] = n +1 (p +1) = n p, we have that the estimate of the white noise variance is D[02; 0]/(n p) = SS(Error)/(n p). 39  Notice that in the above example the estimate of the parameter 02 was simply the estimated distance between two models that were known to hold divided by their dimensional distance. Quite generally, when the data have an exponential dispersion distribution (2.2.7) with common dispersion parameter r2, the estimated distance between two models that are known to hold, divided by their dimensional distance gives us an estimate of a2. This is true when the estimated distance is taken to be the LR, Wald, Deviance, LM, or the weighted least squares statistics. These statistics are natural estimators of the weighted distance given in (2.2.8) for the exponential dispersion models. Now, let us assume that 01 and 02 are each subsets of an (s  1)dimensional simplex. For example, with count data, conditional on the total n, the distribution is often multinomial with index n and parameter (alternatively, probability distribution vector) 0*. Read and Cressie (1988) extensively study a family of distance measures called the powerdivergence family. The power divergences have form (0* 0) (+ 1) O [( 1 ; o where IJ and I1 are defined to be the continuous limiting value as A 0 and A 1. It is assumed that 0* and 0 fall on an (s 1)dimensional simplex. As usual, let 0* represent the true unknown parameter. We define the family of distance measures between [01] and [02] (02 c 01) to be proportional to 6[02; 01] = 2n{ inf I0'(*,0) inf A(0*, 0)}. 02 01 By properties of IX(O*, ) (Read and Cressie, 1988, pp. 110113), it follows that S > 0, with equality if and only if both models hold. 40  To estimate 6[[2; 01] based on the data, we note that our least restrictive guess of 0* is Y/n, the vector of sample proportions. Intuitively, a good estimate of the quantity 6[02; 01] would be D[02; 0O] = 2n{ inf IA(Y/n, 9) inf IA(Y/n, 0)} 02 01 2 Y 2 Y ^ [(A+1) n ) A1] (+1)K[(Y ) 1 where 9) and ' are the 'minimum divergence' estimators obtained by minimizing IX(Y/n,0) with respect to 0 over 01 and 02 respectively. Read and Cressie (1988) point out that D[02; 01] is equal to the likelihood ratio statistic when A = 0. Also, if we assume that [01] holds so that the second infimum is zero, we have that, for A = 1, D[02; 1]= (Y n ))2 which is asymptotically equivalent to D[02; 0,] = ( n ))2 where 9(0) is the maximum likelihood estimator of 6* over the space 02. This is the Pearson chisquare statistic. Other asymptotically equivalent distance estimates are the Wald statistic and the Lagrangian multiplier statistic. We now illustrate these results via examples. Example 2: Suppose that Y = (Yu, Y12, Y21, Y22) is a multinomial vector. That is, (Y11, Y2, Y21, Y22)' ~ Mult(n, (7ri, 712,7r21,7r22)'), with i.y = 1. i j Thus, the model that is known to contain the true parameter vector 7r* is [0] where O = {7r: 7r'14 = 1,7,j E (0, 1), i, J = 1,2}. 41  Notice that is really a 3dimensional subset (simplex) of (0, 1)4 so that df[9] =4 1 = 3. We wish to test the independence hypotheses SHo : rlll 22 = 7r2721, VS. H : 7r11722 = 7"127"21 Writing the model of interest [o] as O0 = {7 E E : 7r1122 727r21 = 0} = {r : 7'14 = 1, 71rnr22 712721 = 0}, we can state the independence hypotheses as Ho : r e00, vs. H : 7r e o0. Now, the model degrees of freedom can be found by subtracting the number of constraints implied by [o0] from the total number of parameters, which is 4. Hence, df[0o] = 4 2 = 2. Thus, the distance degrees of freedom or measure of dimensional distance, is df(b[O0; 1]) = 3 2 = 1. Two distance (goodnessoffit) statistics commonly used are the Pearson chisquare X2 (A = 1) and the likelihood ratio statistic G2 (A = 0). The forms of these two statistics are D[Oo; O] = X' = (y nri,o)2 i j n7rij,o and D[0o; ] = Ga = 2E yj log( Yi i j n7r,,o where iri,o is the ML estimate of 7rij assuming that model [Oo] holds. Under the null hypothesis, i.e. if independence truly holds, then the asymptotic distribution of both distance statistics, X2 and G2, is X2(1). 42  Example 3: Continuing with example 2, consider the model [EMH] where EMH = {7 : 7r'14 = 1, 7rl+ +r+1 = 0}. This model implies that there is marginal homogeneity, i.e. The marginal distributions for both factors are the same. We would like to test the hypotheses Ho : r e O)MH, vs. H, : 7r EMH. The model degrees of freedom is df[OMH] = 4 2 = 2, and so the distance degrees of freedom is df (6[OMH; ]0) = 3 2 = 1. Once again, to illustrate what model degrees of freedom means, we observe that if [OMH] holds and we specify two of the four probabilities, the remaining two are completely determined. Thus, we are free to estimate two of the probabilities based on the data. The other two are determined. Two frequently used estimates of the model distance, or model goodness of fit are the likelihood ratio statistic G2 and the McNemar statistic M2. For 2 x 2 tables, the McNemar statistic and the Lagrange Multiplier statistic are equivalent since both are score statistics (Agresti, 1990; Aitchison & Silvey, 1958). The statistics take the following forms D[OMH; O = G2 = 2 = E 2log( Yij jn ij,o i jfji,0 and D[eMH;e] = M (22 Y12 + Y21 where the iij,o in the first expression is the ML estimate of 7rij under the model [OMHI. 43  Under the null, i.e. when the marginal distributions are homogeneous, both of these statistics have asymptotic X2(1) distributions. It is important to note that, had the constraint 72+Tr+2 = 0 been added, the model would remain consistent but would be ill defined. For 2 x 2 tables, this additional constraint is exactly the same as the constraint 7r+ r+1 = 0. 2.3 Multivariate Polytomous Response Model Fitting In this section, we describe ML model fitting for an integer valued random vector Y that is assumed to be distributed productmultinomially. We also investigate the asymptotic behavior of the ML estimators within the framework of constraint models. The models we will consider have form Ox = {( E O: Clog(Ae)) = XP3, Lee = 0} or equivalently, for appropriately chosen U, Ox = Oh e E : U'Clog(AeC) = 0, LeC = 0}, where ee is the s x 1 mean vector of Y, a productmultinomial random vector and the model parameter space O is of dimension s q, where q is the number of identifiability constraints. We use the parameter rather than 11 = ee for several reasons. One reason will become evident when we explore the asymptotic behavior of the ML estimator of It turns out that the random variable 4 po is not bounded in probability, whereas 6o is. In fact, the random variable o converges in probability to 0. Another reason for using rather than y is that the procedure for deriving the maximum likelihood estimate of is less sensitive to small (or zero) counts. The range of possible values is the whole real line, while the range of possible p values is restricted 44  to the positive half of the real line. By using 6 the problem of intermediate out of range values (e.g. negative cell mean estimates) is avoided. As stated above, we initially assume that the vector of cell counts Y has a productmultinomial distribution. This is not overly restrictive since it will be shown that inferences based on maximum (multinomial) likelihood estimates are often the same as inferences based on maximum (Poisson) likelihood estimates. We will present some results in section 2.4 that allow us to determine when these inferences are indeed the same. We also consider an alternative method for computing the maximum likelihood estimators and their asymptotic covariances. The method of Lagrange undetermined multipliers is well suited for maximum likelihood fitting of the models we will be considering. This is so because we will specify the models in terms of constraint equations and the fitting problem will be one of maximizing a function, namely the log likelihood, subject to some constraints, namely that 6 E Oh. 2.3.1 A General Multinomial Response Model In this section we specify a class of models that is directly applicable to Chapter 3 of this dissertation. Specifically, the models will be specified in such a way so as to include the class of simultaneous models for the joint and marginal distributions considered in Chapter 3. 45  Let the random vector Y = vec(Yi,..., YK) denote a product multinomial random vector, i.e. Yi= (Yil,...,YiR)' ~ ind Mult(ni, ri), i = 1,...,K, K > 1, where the R x 1 vector of cell probabilities satisfy 7rilR = 1, i = 1,..., K. Consider the 1:1 reparameterization from {Ir;} to {(~}, where &, = log(pi) = log(niri) is an R x 1 vector of log means. Under this parame terization, Yi ~ ind Mult(ni, ), e41=ni, i= 1,...,K, or Yi ~ indMult(ni, ), i=1,...,K, e'(ef1R) = n', (2.3.1) ni where n' = (nl,..., nK) is the 1 x K vector of multinomial indices. The kernel of the log likelihood for Y, written as a function of e, is e(M)(; y) = y'e, e'(e$ 1R) = n' (2.3.2) We now posit a model for the vector of log means. Let s = RK be the total number of cell means. Our objectives are to test the model goodness of fit and to estimate the s x 1 model parameter vector as well as any freedom parameters of interest. It will be assumed that the model [ex] can be specified as Ox = {( e R' : C1 log Alet = Xi/3, Ca log A2e = X2,2, Lee = 0, (2.3.3) e'( 1R) =n',  46  where Ci = (fCij, Cij = Cil, is qi x mi i = 1,2 Ai = qfAij, Aij = Ail, is mi x R, i= 1,2 L = 'Lfj, Lj L1 is dx R = vec(,..., ) and is R xl 1 Xi is Kqi x pi of full rank pi, i = 1, 2 n is the K x 1 vector of multinomial indices s = RK, the total number of cells Let us say that a model that can be specified as in (2.3.3) satisfies assumption (Al). That is, (Al) The multinomial response model can be specified as in (2.3.3). Notice that the K matrices of Ci are all identical, likewise with the matrices comprising Ai and L. This requires that the model does not change across the K populations (K multinomials). Also, the two sets of freedom equations in (2.3.3) will allow us to use two different types of models for the expected cell means. This provides us with enough generality to fit many interesting models. For example, we may wish to simultaneously fit a linearbylinear association loglinear model for the joint distribution and a cumulative logit model for the marginal distributions. We can conveniently rewrite (2.3.3) as Ox = { e R' : Clog(Aee) = XP, LeC = 0, eC'(ef1R) = n', (2.3.4) where A'= [A', A'], C = C1 ) C2, X = X1 X2, and 3 = vec(3l,3Q2). Notice that the model [Ox] is specified in terms of both freedom equations and constraint equations. We will rewrite [Ox] as a constraint 47  model keeping in the back of our minds that the freedom parameters may be of interest also. Let U be a K(ql + q2) x u matrix of full column rank u such that U'X = 0. Here u is the dimension of the null space of X', A((X'), i.e. u = K(qi + q2) (1 +p2). Since U can be chosen to be of full column rank, it follows that the columns of U form a basis for the null space of X'. Thus, the range space of U equals the null space of X', i.e. M(U) = A(X'). Multiplying the right and left hand side of the freedom equation Clog(Aee) = XP/ by U', we can rewrite (2.3.4) as Oh = { e R' : U'Clog(Aee) = 0, Lee = 0, e'(elR) n' = 0}. (2.3.5) Thus, Ox = Oh and the models [Ox] and [Oh] are one and the same. At this point, we will assume that the constraints implied by the model [Oh] are nonredundant so that the model is well defined. More specifically, let h'() = [(U'Clog(Ae))', e'L'] be the 1 x (u + 1) (1 = Kd) vector of constraint functions. We will assume that the u ++ K constraints implied by h(() = 0 and ee'(@IelR) = n' are nonredundant. Notice that the constraints in h(() = 0 do not include the identifiability constraints. We treat the identifiability constraints separately for reasons that will become apparent when we actually fit the models. As stated previously, one of our primary objectives is to estimate the model parameters 6 and the freedom parameters f under the assumption that [Ox] (and [Oh]) holds. We will use the maximum likelihood estimates, which can be found by maximizing the log likelihood of Y subject to the constraint that [Oh] holds. 48  The (kernel of the) log likelihood under the product multinomial assumption is shown in (2.3.2). It is fcm) (; Y) = Y1,* Thus, we are to maximize the function e(M)(E; y) = y' subject to e OEh. 2.3.2 Maximum Likelihood Estimation In this section we will discuss two procedurally different approaches to maximizing the log likelihood e(M)(; y) subject to E e ,. The first approach, which is the more commonly used approach, requires that the model be specified entirely in terms of freedom equations. Often times, when there are no identifiability constraints, the model can be completely specified as a freedom model. Models amenable to this approach include the Poisson loglinear model and the Normal linear model. The second approach, Lagrange's method of undetermined multipliers, can be directly applied when the model is specified completely in terms of constraint equations. Since the product multinomial model includes identifiability constraints, it can more easily be specified in terms of constraint equations. For this reason this second method is the preferred choice. In the following sections, we discuss some additional features of these two methods. Freedom Parameter Approach. One approach often used in simple situa tions, namely those situations when the model can be specified completely in terms of freedom equations, is to write the parameter C as a function of the freedom parameter I and maximize e(M)((P); y) with respect to 3. The vector (P3) will be in the model space, since the model was specified 49  completely in terms of f/. For example, if the model could be specified as Ox = {( E R' : log e = Xp}, then ((/) = XP3. Notice that the multinomial model, which includes the K constraints eV'($flR) = n', is not directly amenable to this approach. In fact, we would have to reparameterize to a smaller set of sK model parameters that account for the K constraints. This reparameterization results in an asymmetric treatment of the e and for that reason is deemed undesirable. On the other hand, the Poisson model considered below, will often lend itself to this maximization approach, since the K constraints eC'(e$1R) = n' are not included. Computationally, the method of maximizing the log likelihood with respect to the freedom parameters is usually simple. Assuming the log likelihood is concave and differentiable in 3, we need only solve for the root of the 'score equations', viz. s(; Y) = ; ) . Many of the asymptotic properties of the maximum likelihood estimator 3 for 3 are derived by formally expanding the score vector s(P3; y) about the true value f = 3* in a linear Taylor expansion. That is, s(/; y) = s(3*; y) + Os(,*;y P ) ) + ( 12) (2.3.6) In particular, in many situations, O=s( ; Y)=s(3*;Y)+ Os' () p*) + Op(1), 0/3'  50 so that / 3* has the same asymptotic distribution as SY) s(3 *; Y). Subsequently, we will derive the asymptotic distribution of3 P3* in a different way. This alternative derivation of the asymptotic distribution of the freedom parameter estimate will shed new light on the relationship between the asymptotic behavior of the estimates under the two sampling assumptions product Poisson and product multinomial. Expression (2.3.6) also gives some indication of how one might numer ically solve for /, the root of the score equation. A NewtonRaphson type algorithm is often used. This root finding algorithm involves the inversion of the derivative matrix as(P3;y)/Ql3', which is usually of small dimension since the model is usually specified in terms of a small number of freedom parameters. In fact, the dimension of the derivative matrix will not be larger than s x s, which occurs when the model is saturated. Constraint Equations Approach. In many situations, it may be difficult to specify a model in terms of only freedom parameters or perhaps it is possible but the researcher would like to treat the model parameters symmetrically, which would necessitate an additional constraint equation. It also could be that the function ClogAeE is not a 1:1 function of so that for given /, we can not solve for explicitly. In any of these cases, we may not be able to use the aforementioned maximization approach. In this section, we consider an alternative method for finding that i that maximizes the function e(M)({; y) subject to E Oh. The method we will use is the Lagrange's method of undetermined multipliers. Aitchison and 51 Silvey (1958, 1960) and Silvey (1959) provide much of the essential underlying theory related to this approach. Three positive features of this method include (i) estimation of both ( and 3 is possible, (ii) the method provides us with another enlightening way of deriving the asymptotic distribution of the freedom parameter estimators, and (iii) the method works quite generally. A negative feature of this approach is the computational difficulty. Computationally, the method becomes burdensome as s, the number of log mean parameters, and u + 1 + K, the number of constraints implied by the model, become large. In fact, the algorithm involves the inversion of an (s + u +1) x (s + u +1) matrix. One positive note, is that this potentially very large matrix does have a simple form and one can invoke some simple matrix algebra results to reduce the inversion problem to one of inverting matrices of dimensions (u +1) x (u +1) and s x s. To best illustrate the difference in computational difficulty of the two methods, we consider the following normal linear model example. Let Yi ~ ind N(;i = 3o + alxi, The log likelihood can easily be written as a function of f = (Po, /i)'. Maximizing this likelihood with respect to f involves working with a 2 x 2 matrix. On the other hand, we could equivalently specify the linear model in terms of the 98 constraints, Pi+1 Pi Pi+2 i+1, i= 12, ... 98, Bi+i Xi +i+2 Zi+l and use Lagrange's method. In this case, we would need to invert a matrix which has dimension (s + u + 1) x (s + u + 1) = 198 x 198. 52  Even when we use the matrix algebra results that simplify the problem of working with the 198 x 198 matrix, we still are left with a formidable task. It seems that when s is large and the model is parsimonious, i.e. u + + K, the number of constraints is large, the undetermined multiplier method may not be the method of choice. However, in time, as computer efficiency gains are realized, we predict that the scope of candidate models to be fit using this method will increase tremendously. In fact, at present, many categorical models can easily be fit using Lagrange's method. We discuss in more detail how we can use the method of undetermined multipliers to fit models like [Oh] of (2.3.5). We are to maximize the function (M) ($; y) = y', subject to the constraint ( E Oh, where Gh = {~ e R : U'Clog(Ae4) = 0, Le4 = 0, e'($jflR) n' = 0} = { R: h() = 0,et'(flR)= n'}, and h'({) = [log(e4'A')C'U, eE'L']. Consider the Lagrangian objective function F(7) = e(M)(6; y) + (et'(eKlR) n')7 + h'(\)A, where 7 = vec(, r, A). The K x 1 vector r and the (u + 1) x 1 vector A are called either 'Lagrange multipliers' or 'undetermined multipliers'. Provided a maximum exists and that the Jacobian of [e6'(eK1R) n', h'(()] is of full row rank u + 1 + K for all 6 e Oh, we can solve for the maximum by solving the system of equations F () + D(e'')( lR)(') + H(iM)) ) % = ( f@ 1 )e m) 'n = 0 (2.3.7) 7 /h((M)) 53 where the matrix H() = 8h'(()/98. The Jacobian condition basically requires the constraints to be nonredundant, thereby making [Oh] a well defined model. From this point on, for notational convenience, the indices for the direct sum will be omitted unless they are different from 1 and K. We now require the matrices of models [Ox] and [Oh] to satisfy some additional conditions. Let us assume that (A2) Either C = Iq,K or Ci( lm,)= 0, i = 1,2 and (A3) If C = Iq,K then M(Xi) D M(l$m,) The assumptions require Ci to be either a contrast matrix (rows sum to zero), a zero matrix, or the identity matrix. If Ci is the identity matrix, it will be required that there exists a set of columns in Xi that spans a space containing the range space of $(Klm,. For most models of interest these conditions are met. For example, any of the logit type models, such as cumulative or multiple logit models, can be specified with C being a contrast matrix. For loglinear models, the condition (A3) is met whenever the model includes a parameter for each of the K multinomials. The following lemma will be useful in showing that the maximum likelihood estimates of and j3 are equivalent under both sampling schemes productPoisson and productmultinomial. The lemma will also enable us to reduce the number of equations in (2.3.7) that must be simultaneously solved when computing the maximum (multinomial) likelihood estimators. 54 LEMMA 2.3.1. If the matrices of models [Ox] and [Oh] satisfy (Al), (A2), and (A3), then provided the model holds ( ) = ( 1'~)H() = 0. Proof. Using matrix derivatives (MacRae, 1974; Magnus and Neudecker, 1988), it follows that H(s) = [D(ef)A'DI(Ae4)C'U, D(ee)L'] Thus, (e 1')H() = [(e e)A'D1(Aee)C'U, ( e:)L] = [(@ee)[A',A']D1 ) (C e( C()U, @eeL.] [[( e et)A'D1(Alef), (e eci)A'D1(A2e')](C. e C2)U, 0] = [[( e eA'i)D'(Alef)C' ( eeA'i)D1(A2e)C2U, o] = [( e 1')c, ( 1m, )CE]U, O = 0,0] =0 The third equality follows since the model holding implies that seiL = 0. The sixth equality can be seen via the following argument. If both Ci's are contrast matrices, or zero matrices, then (A2) implies that the matrix [( $11)C', ($1',)C'] is the zero matrix. On the other hand, if both C1 and C2 are identity matrices, then since the columns of U span the null space of X', which, by (A3), implies that the columns of U span a set contained in the null space of elm2 ' sim, ' 55 we have that [(e 1m), ( 1'))]U = 0. Any other combination of CO and C2 can also be seen to result in the matrix equaling zero. 0 The following theorem gives conditions under which we can find the ML estimators of 6 by solving a reduced set of equations. The smaller system of equations no longer includes the identifiability constraint equations. THEOREM 2.3.1 Let vec((M), i(M), k(M)) be the solution to (2.3.7). Assuming that (Al), (A2), and (A3) hold, the subvector vec(&(M), \(M)) is the solution to the reduced set of s +u + equations h(+ H((M))0 (2.3.8) Proof: Premultiplying the first set of equations in (2.3.7) by $1'W, we arrive at ( 1'l)y + ( 1'I)D(eZM)( 1R)T + ( l1')H((M))iAM) = 0 (2.3.9) Now, (e l')y = n and (E l'm)D(eE(M) = e~M)'. Also, since (M) E Oe it must be that ( eeM )( e 1R) = D(n), the diagonal matrix with the multinomial indices on the diagonal. Further, by Lemma 2.3.1, ( 11 ')H( /(M)) = 0. Therefore, (2.3.9) can be rewritten as n + D(n) i(M) = 0, which implies that 4(M) = 1K. Now, since the identifiability constraints have been explicitly accounted for when solving for f(M), we can replace i(M) of (2.3.7) by 1K and omit the identifiability constraints. Thus, vec( (M), \(M)) 56  is the solution to the reduced set of equations ( (m) + H(W(M))A(M) = This is what we set out to show. Before detailing the iterative scheme used for solving (2.3.8), we will explore the asymptotic behavior of the estimator 0(M) = vec((M), ^(M)) within the framework of constraint models. 2.3.3 Asymptotic Distribution of ProductMultinomial ML Estimators In what follows, we will assume that K, the number of identifiability constraints, is some fixed integer, K > 1. We also will assume that the asymptotics hold as n. = min{ni} approaches infinity and that n. ~ ni, i = 1,..., K. That is, we assume that the asymptotic approximations hold as each of the multinomial indices get large at the same rate. The derivation of the asymptotic distribution of b(M) will follow closely that of Aitchison and Silvey (1958). Briefly, Aitchison and Silvey show that if the score vector is op(n) and the constraints are such that the derivative matrices H(() and OH'(()/98 have elements that are bounded functions then, provided certain mild regularity conditions hold, the maximum likelihood estimator is an n1/2consistent estimator of o and A is an n1/2consistent estimator of 0. They show that the joint distribution of (n1/2( o), n1/2) is multivariate normal with zero mean and covariance matrix (B B1H(H'BH)H'B1 0 ( 0 (H'B'H)1) where B is the information matrix and H is the derivative of the constraint function. 57 In our application, however, there are some minor changes. With the pa rameterization we use, the information matrix is zero since the (multinomial) log likelihood (2.3.2) is linear in the parameter This happens because the identifiability constraints eE'( of 1R) = n' are ignored, to preserve symmetry, when differentiating. Also, in our parameterization, the constraints are in terms of ee, the components of which are eCi = n7rij. Thus, the constraints and the corresponding derivative matrices may not be bounded. For example, a typical constraint is of the form Let = 0. It follows that the components of Let and the derivatives are increasing without bound as the multinomial indices are allowed to increase without bound. Fortunately, we can still use the results of Aitchison and Silvey (1958) by replacing the matrix H and the vector A/n of Aitchison and Silvey by our H/n. and A, where n. = min{ni}. The zero information problem can be solved by identifying the vector Y e as the 'score vector'. It is pointed out that, in this case, the asymptotic variance of @D'/2 (nlR) times the score vector is not equal to the negative derivative matrix D(ro) but instead is equal to D(ro) Eroi7r'j. This happens because the components of Y are not independent; Y is product multinomial. Using this reparameterization, all of the necessary assumptions required by Aitchison and Silvey (1958) hold, i.e. assumptions X and of Aitchison and Silvey (1958) hold. As previously mentioned, Aitchison and Silvey show that A is an n1/2consistent estimator of 0. With our paramterization, having replaced A/n by A, it follows that A(M) will be n,1/2consistent. We now derive the asymptotic distribution of b(M). 58  Define the stochastic function g by g(O; Y) Y eY + H()) The maximum likelihood estimator 0(M) is the solution to g(O; Y) = 0. Under our parameterization, using the results of Aitchison and Silvey (1958), we have that each of the following hold e(M) e6o = D(el)( (M) ) + Op(1), H( (M)) = H(0) + Op(nl/), h((M)) = h(o) + H'(Eo)( (M) o) + Op(1) = H'(o)( (M o) + Op(1), and H(J(M))^(M) = H()A(M) + Op(l). Thus, O = g(m); Y) Y em) + H( (M))(M ) can be rewritten as 0 =Y eo D(eo)( (M) _o) + H(O )Op(l1) H'( o)( (M) o) + Op(1) (Y e)o (D(eo) H( o) (M) 0o O (1) V 0 H'() 0 M(M) Therefore, it follows that eD1/2(n,1R) (Y eo) = D(ir) niO~l n /2(+o(n;'/' (2.3.10) since n, ~ n, i= 1,...,K and 0ro = ( D'(nil1))ef0. 59 Now, the random variable SD1/2(nilR)(Ye6o) is a vector of normalized sample proportions so that ( D1/2(nl1R)(Y e(o)) has an asymptotic normal distribution with zero mean and covariance matrix (D(7ro) ED70,7r1, 0) 0 0' Therefore, by an extension of a theorem of Cramer (1949) and by equation (2.3.10), it follows that n/2( (M) 0,) = n/vec((M) 0o, i(M)) has an asymptotic normal distribution with mean zero and covariance D(ro) (O D(7ro) e7roi7r o D(7ro) 2 (n. (2.3.11) _o 0 0 0 _(o 0 S* \ * This covariance matrix is shown in the appendix to have the simple form (M, 0 0 M) where M, = D'(oo) D)1()H(H'D'(o)H)'H'D'(7o) $fK 1Rl and M2 = n (H'D( ro)H). Finally, using the fact that n, ~ ni, i = 1,..., K, we can discriminantly replace n* by the appropriate n, to arrive at a simple, asymptotically equivalent, expression for the asymptotic covariance of i(M) = vec((M), \(M)).  60  It is D1 D'H(H'D'H)'H'D1 R 0 0 (H'D1H)1) '( where D = D(po) = D(eo) and H = H(o). 2.3.4 Lagrange's MethodThe Algorithm In this section, we give details of how one can actually fit the models of (2.3.4) or equivalently (2.3.5). We show how Lagrange's undetermined multipliers method can be used in conjunction with a modified Newton Raphson iterative scheme to compute the ML estimators and their asymptotic covariances. We will assume that the model assumptions (Al), (A2), and (A3) hold. This section includes an outline of the algorithm used in the FORTRAN program 'mle.restraint'. Recall that our objective is to find that (M) e Ox, where Ox ={ R': Clog(AeC)=Xf3, Lee=0, (el))e=n}, that maximizes the multinomial log likelihood (M) (; y) = Y' Since the assumptions (Al), (A2), and (A3) hold, we see by Theorem 2.3.1 that our problem is reduced to one of solving the system of equations (2.3.8), i.e. to find the ML estimator 9(M) = vec(i(M), \(M)) we must simultaneously solve the system of s + u +1 equations ( eY e4H()A =0, g~o) = ( h() 61 where the (u + 1) x 1 vector h and the s x (u + 1) matrix H are defined as follows. h()= U'Clog(Ae$) and Oh' (() H() = It will be shown in section (2.4) that g(0) is actually the derivative of the Lagrangian objective function under the productPoisson sampling assumption. The iterative scheme used in the FORTRAN program 'mle.restraint' is a modified NewtonRaphson algorithm. The algorithm can be sketched as follows. (1) Find a starting value for 8. (2) Replace 0(") by 0("+1) = O(V) G1((Y))g(o(")) (2.3.13) (3) If g(0(v+l))l > tol go to (2). Else stop. The matrix G(8) used in step (2) is actually G() + Op(n1 /2) (De) H( )) and the inverse of G(O) is of the very simple form (see Aitchison and Silvey, 1958 or Rao, 1974) G'() _D1 D1H(H'D1H)'H'D1 D1H(H'D1H)1 (H'D'H)1H'D1 (H'D1H) (2.3.14) 62  where D = D(ee). Since we use G(0) in place of the Hessian matrix, the procedure is a modification to the NewtonRaphson method. Haber (1985a) used the more complicated Hessian matrix. Notice that the inversion of G, which may be performed at each iteration, is not nearly as difficult as inverting a general matrix of dimension (s + u + 1) x (s + u + 1). First of all, in view of (2.3.14), to obtain the inverse of the partitioned matrix G, we need only invert the matrices D and H'D1H, which are of dimension s x s and (u + 1) x (u + 1). Secondly, the inversion of D is simple since D is a diagonal matrix with et on the diagonal. Hence, the most formidable task in the inversion process is the inversion of the symmetric positive definite matrix H'D1H. There are many efficient ways to invert large symmetric positive definite matrices. Upon convergence of the algorithm (2.3.13), estimates of the asymptotic covariances of (M) and A(M) are readily calculable. Write G1() of (2.3.14) as where P = D1 D1H(H'D1H)1H'D1 Q = DIH(H'D1H)1 R =(H'D2H)1 By (2.3.12), the asymptotic covariance of i(M) = vec((M), (M)) can be estimated by var("l)=( P)efi 0 ) Variance estimates for other continuous functions of ^(M), such as A(M) = eF(M and t(M) = (X'X)1X'Clog(Aee~M), can be found by invoking  63  the delta method. For example, var(A(M)) aD(eD m ))var(^(M))D(ee) ) and var( (M)) a (X'X)lX' CD (AA(M))A(var(Af()))A'D (Ac(M))C'X(X'X). Evidently, Lagrange's method of undetermined multipliers provides us with a convenient procedure for maximum likelihood fitting of models in a very general class of parametric models for multivariate polytomous data with covariates possible. We now briefly outline the steps needed to perform the iterations of (2.3.13). Computing U. The first thing we must do is write the freedom model (2.3.4), which can easily be input by the user, as a constraint model (2.3.5). Therefore, we must compute a full column rank matrix U that satisfies U'X = 0. The method we use to find U is attributed to Haber (1985b). Using the notation of 'mle.restraint', let X be a full column rank matrix of dimension q x r. Let u = q r be the dimension of the null space of X'. Further the matrices A and C of (2.3.4) will have dimensions m x s and q x m respectively. The relationship between these dimension variables and those used in sections 2.3.1 and 2.3.2 is as follows q K(q + q2) r pi +P2 m K(m, +m2). We use the variables q, r, and m for notational convenience. 64  Consider the matrix U* = I, X(X'X)1X'. This q x q matrix is of rank u = q r and satisfies the property U*' X = 0. Let W denote a q x u matrix with random elements. Specifically, Wij ~ Uniform(0,100), i=l,...,q, j=l1,...,u. It follows that the matrix W is of full column rank with probability one and hence that the q x u matrix U = U*W is of full column rank u with probability one. But the matrix U satisfies U'X = W'U*'X = W'O = 0. Therefore, at least with probability one, we have found a full column rank matrix U that satisfies the property U'X = 0. Using this U, we are able to write freedom model (2.3.4) as a constraint model (2.3.5). Computing h(s). We write the constraint model of (2.3.5) as {( e R : h()) = 0, e6'(e1lR) = n'}, (2.3.15) where the constraint function h is defined as h() = (U'Clo(Aet) Computing g(O). Notice that since (Al), (A2), and (A3) hold, the identifiability constraints present in the product multinomial model (2.3.4) can be accounted for explicitly. It will follow by results of section 2.4, that under either sampling schemeproductPoisson or productmultinomial 65  the maximum likelihood estimators for ( and A can be found by solving the equation g(0)= et H()A) =0, (2.3.16) where the matrix H is the derivative of h' with respect to (. Computing H((). We will use matrix derivative results of MacRae (1974) to find the matrix of derivatives of the constraint function h'(). H(0 h = ()= [log(e'A')C'U, ee'L'] = [D(ee)A'D'(Ae4)C'U, D(e4)L']. The equality follows upon using the matrix version of the chain rule. Notice that a aef' a (log(e 'A')C'U) ( (log(ee'A')C'U) =D(e) log(eA') )CU + 0 .e ) OAe ] = D(ee)A'D1(Ae)C'U and that Oet' L' Oee' Oet' L' S D(ee)L'. Computing G(8). The iterative scheme (2.3.13) used to solve the system of equations (2.3.16) is actually a slight modification of the NewtonRaphson algorithm. It is a modification because we do not use the derivative matrix G* = Og(O)/O0 to adjust at each iteration, as Haber (1985a) did, but rather a simpler matrix G that is related to G* by G* = G+ Op(n2 ). The derivative 66  matrix G* can be computed as follows. G*(8) = g(O) = [O ) (D(e) + \ H'() H, () Og(O) J H( ) 0 0/ +W~) o) V(" H o o The matrix OH( )A __ H()( is of order Op(n/2) when it is evaluated at 9 = vec(, A) since H( = aH Op(n,) '96 and A = Op(n.1/2). It follows that the matrix G, which is much simpler to invert than G*, can be used to adjust the estimate at each iteration. Computing the inverse of G. Although the matrix G is of dimension (s + u + 1) x (s + u + 1), which may be very large in practice, its inverse is relatively simple to calculate. The inverse of the partitioned matrix ,G (D H\ = (H' is shown by Aitchison and Silvey (1958) to have form G1 (D1 D1H(H'DH)H'D DH(H'D HD )1 (H'D1H)IH'D1 (H'D1H)1 ) Therefore, only the matrices D and (H'D1H), which are of dimensions s x s and (u + 1) x (u + 1), need to be inverted. The inverse of D is easily  67 calculated since D is a diagonal matrix with e4 on the diagonal. The inverse of (H'D1H), a symmetric positive definite matrix, can be found quite easily, even when u + 1, the number of constraints, is large. It should be pointed out that when s, the total number of cell means, is large, the number of constraints u + 1 may be large and on the same order as s. This will be the case for parsimonious modelsthose models with many constraints relative to number of model parameters. One could choose to invert the matrix G a limited number of times to mitigate the computational burden. In fact, in their 1958 and 1960 papers, Aitchison and Silvey advocate an iterative method whereby the inverse of G is computed only two times. Once at the initial iteration and again at the final iteration, upon convergence. We feel, however, that in this special case in which the matrix G has a particularly simple form, the inverse can be computed at each iteration. Along with increased computing power, there are many efficient algorithms for inverting large symmetric positive definite matrices. 2.4 Comparison of ProductMultinomial and ProductPoisson Estimators We begin this section by introducing notation for a productPoisson random vector. The sxl random vector Y = vec(Y,,..., YK) is said to be productPoisson if Yij ~ indPoisson(eei), i =1,...,K, j=1,...,R. (2.4.1) Suppose that the s = RK log means {(ij} satisfy the model [O(p)] where 8) = {( R': Clog(Ae4) = X3, Le = 0} 68  or equivalently, for appropriately chosen U, (P) = P) = { E R' : U'Clog(Aee) = 0, Lee = 0} (2.4.2) This model implies all the same constraints on ( as the product multinomial model [Oh] of (2.3.5), with one exceptionthe identifiability constraints, eV'( $ lR) = n', are not included. Denote the maximum likelihood estimators computed assuming (2.4.1) and (2.4.2) by (P) and 3(P). Similarly, denote the maximum likelihood estimators computed assuming (2.3.1) and (2.3.5) by (M) and (M). Recall that the three productmultinomial model assumptions are (Al) The multinomial response model can be specified as in (2.3.3). That is the model parameter space can be represented as Ox = { E R' : C1 log Ale = XP1, C2 log A2e = X232, Le = 0, e'(ilR) = n'}, where Ci = DgCij, Cij Cil, is qi x mi = 1, 2 Ai = afAij, Aij Ail, is mi x R, i = 1,2 L = $ Lj, L = L1 is dx R = vec(,,..., Kg), and k is R x 1 Xi is Kqi x pi of full rank pi, i = 1,2 n is the K x 1 vector of multinomial indices s = RK, the total number of cells. 69  (A2) Either Ci Iq,K Or Ci( (l m,)= i =1,2, and (A3) If C, = IqK then M(X,) D M(lm,,). The following theorem states that the maximum likelihood estimators for E and hence p are the same under the productmultinomial sampling scheme of (2.3.1) and the productPoisson sampling scheme of (2.4.1) provided that the three assumptions (Al), (A2), and (A3) hold. THEOREM 2.4.1 If the model (2.3.4) satisfies assumptions (Al), (A2), and (A3), then i(P) = (M) and (P) = 7(M) That is, the maximum likelihood estimators of / and are the same under both sampling schemesproductPoisson (2.4.1) and productmultinomial (2.3.1). Proof: Under the product Poisson assumption of (2.4.1) and (2.4.2), the kernel of the log likelihood is e(P)(; y) =y'( e'1, Therefore, letting 0 = vec(, A), the corresponding Lagrangian objective function is Q(0) = y' e'1, + h'(\)A and so to find the maximum (Poisson) likelihood estimator ^(P) = (i(P), 5(P)) we must solve the system of equations 9Q(O) y ei') + H(i())() (2.4.3) o h( (P))0. ( The conclusion of the theorem now follows, since the equations (2.3.8) of  70  Theorem 2.3.1 and (2.4.3) yield exactly the same solutions and (P) = (X'X) X'Clog(Ae'P)) = (X'X)lX'Clog(AeM') = (M) As a corollary to Theorem 2.4.1 we have COROLLARY 2.4.1 Provided the assumptions of Theorem 2.4.1 hold, the estimated undetermined multipliers are invariant with respect to sampling scheme, i.e. A(M) = (P) Proof: The proof follows immediately upon noting that equations (2.3.8) and (2.4.3) yield exactly the same solutions. 0 A remark is in order. Basically, Theorem 2.4.1 enables us to conclude that the sufficient and necessary condition of Birch (1963) holds. These conditions are that the model be specified so that the Poisson ML estimators necessarily satisfy the identifiability constraints that are required for the multinomial model. We now explore the asymptotic behavior of the (Poisson) ML estimator b(P) = vec(i(P), i(P)). For the productPoisson assumptions (2.4.1) and (2.4.2), we can obtain the asymptotic distribution of 9(P) by formally replacing the n, = min{ni} by p, = min{ei } and using the same arguments as those used to derive the asymptotic distribution of i(M). J0rgenson (1989) discusses limiting distributions for Poisson random variables as the mean parameters, or equivalently /*, go to infinity. In this  71  case, g(0o; Y) =( 0e) has an asymptotic normal distribution with mean zero and asymptotic covariance (D(yo) 0) 0 0) Using arguments similar to those used in the multinomial case, it follows that (Y ) = (D(o) OH) (O )+f ( OP+1 0 ) H' ) jOp) We conclude, as in the product multinomial case, that (P) Oo has an asymptotic normal distribution with mean zero and asymptotic covariance (D(po) H) (D(o) 0 D( C) H r H' 0 0 Oj 0\ H' 0 But, this can again be simplified as it was in the multinomial case. It can be shown that the asymptotic covariance can be rewritten as D1 D'H(H'D'H)'H'D1 0 (2.4.4) 0 (H'D'H)l ( 4) where D = D(o) = D(e~O) and H = H(o). Comparison of the Asymptotic Distributions. Provided assumptions (Al), (A2), and (A3) hold, both (P) 0o and b(M) Oo have asymptotic normal distributions with zero means and respective covariances given in (2.4.4) and (2.3.12). Therefore, we have the following interesting results. Result 1. The asymptotic covariances of (P) and i(M) are related by var(()) = var(()) (2.4.5) ( 0 0) 72  Result 2. The asymptotic distributions of A(P) and ^(M) are identical and it follows that the Lagrange multiplier statistic which has form LM = A'(var(A))l = A'(H'D1H)A is invariant with respect to the sampling scheme. Result 3. K (P) A (P)' var((M)) = var(()) f (2.4.6) Result 4. var(/(M)) = var( ()) A (2.4.7) where A = (X'X)X'C ) (e l )C'X(X'X). and is nonnegative definite. The notation var(.) used in these results denotes the asymptotic variance. This is important since the finite sample variances may not even exist. The proofs for Results 3 and 4 are straightforward. Basically, they involve using the delta method and equation (2.4.5). The interested reader will find an outline of the proofs in Appendix A. In practice, it is of particular interest to evaluate the matrix A of equation (2.4.7). Often, for convenience, the models are fit assuming the vector Y is product Poisson and then inferences based on the maximum likelihood estimates are made assuming that they are invariant with respect to the sampling assumption. Birch (1963) and Palmgren (1981) derive rules for 73 when these inferences, based on the two different sampling assumptions, will be equivalent. However, they assume that the model is of a simple loglinear form. That is, the Poisson model is assumed to have form Ox = { R': = XI3}. We will use the results of this section to derive more general rules for when the two inferences will be equal. As a special case of these results, we will arrive at the Birch and Palmgren results. The following lemma will enable us to rewrite A of (2.4.7) in still a simpler form. LEMMA 2.4.1 Let Z = [Z1,..., ZK] be an r x K matrix of full rank K. Suppose that X = [Xx,..., Xp] is an r x p (r > p > K) matrix of full rank p such that M(X) M(Z), i.e. the range space of X contains the range space of Z. Denote the T (K < T < p) columns of X that span a space that contains M(Z) by {X,,,...,X,,}. Without loss of generality, suppose that the set of vectors {XX,,...,Xv,} is a minimal spanning subset, i.e. the spanning set of any r < T of these vectors does not contain the range space of Z. We conclude that 3W e RTxK 9 (X'X)'X'Z = JW, where the p x T matrix J = [ex,...,e v] and ey, is the p x 1 vector (0,...,0,1,0,...,0)' with the '1' in the vth position. 74  Proof: Let X. = [X,,...,XT]. Now, by assumption, M(X.) D M(Z). Hence, there must exist a matrix W E RTxK 3 Z = X.W. Therefore, (X'X)1X'Z = (X'X)'X'X.w = (X'X)'(X'X.)w = JW where J = (X'X)'(X'X.) is as stated in the conclusion of the lemma. * Before stating the next important theorem, let us write A in another way. Assuming that (Al) holds, A can be written as A(Al A12) A = A21 A22) (2.4.8) where A'J = (XiX) xjC( )(D )CXj(XjXj) Now, if Ci is a contrast matrix, by assumption (A2), we can write (X X)'X 1( i m ) = J(iW('), (2.4.9) where J(i) can arbitrarily be chosen to be equal to Xj and so W() = 0. On the other hand, if Ci = IK then we have by (A3) that M(Xi) 2 M((lm,). Therefore, we can invoke the result of Lemma 2.4.1 by setting Z = e . Since M(Xi) 2 M(lm,,) = M(Z), the conditions for the lemma are satisfied. Let Xi, = [X. (,),..., X ,)] be the miK x Ti (K < T, < pi) submatrix of X, that has columns that form a minimal spanning subset for M(Z) = M(Ei). By Lemma 2.4.1, 3W() e RTxK 3 (X'XiX 1X, m) = J(')W(i). (2.4.10) Here, J(i) = [e (),..., e ()], where the Ti elementary vectors correspond to the 1 Ti columns {X. (i,..., X. )} of Xi that form a minimal spanning subset for the ITi 75  range space of lm,, i.e. the Ti columns span a set that contains the range space of elm, and any smaller set of columns will not span a set containing the range space of ,lm, . It follows that the matrices Aij of (2.4.8) can be written as Ai' = J()W(W'(W()J,(j) (2.4.11) where j( [e) ,...,e()], if Ci = IqiK W vzr(2.4.12) Xi[, otherwise and W(i) W(), if C = IqK (2.4.13) 10, otherwise. We now state a theorem of substantive importance. THEOREM 2.4.2 Suppose that assumptions (Al), (A2), and (A3) hold. For r = 1,2, if Cr is the identity matrix then let {v(r),... )} be the set of indices that index those columns of X, that form a minimal spanning subset for M($lm,). Then it follows that the relationship between the asymptotic variances of the two estimators 3(M) and p(P) is var( ((M) = var(/(")) ( 21 A1 where the pi x pj matrix Aii is a zero matrix whenever at least one of Ci or Ci is a contrast or zero matrix. Otherwise, if both Ci and Cj are identity matrices then = 0, if (k,l) I {v) (i)*T" {v/1 (i)  76 Proof: Since (Al), (A2), and (A3) hold, we can rewrite AiJ as in (2.4.11). Now, if either C, or Cj are contrast or zero matrices, it is obvious by (2.4.9) that Aij will have zero components, as stated in the theorem, since at least one of W(i) or W(j) will be a zero matrix. On the other hand, if both C, and Ci are identity matrices, then A'j can be rewritten as in (2.4.11) where ji) = [e (, ...,e ], J7i)= [(e ,,..., (,, and the matrices W(i) and W(i) are elements of RTixK and RTjxK. Hence, 1 Ti e where Wii = W(i)W'(i) is some Ti xT matrix. Now, since {e,} are elementary vectors, we have that if (k,1) {v ),..., )} x {vi),..., (), then the component A' = 0. Otherwise, if (k, 1) is a member of this set, it must be that A' is one of the elements of the matrix Wii. This completes the proof. The next two corollaries follow immediately from Theorem 2.4.1. COROLLARY 2.4.2 If both C1 and C2 are contrast matrices then var( (M)) = var((P )).  77  Proof: Since both 7C and C2 are contrast matrices it follows that W(1) and W(2) are zero matrices. Therefore, the matrices Aij of the theorem are zero matrices. COROLLARY 2.4.3 Let C2 = 0,X2 = 0, and C1 A = I, = so that the model (2.3.4) becomes {h = {E R': = XP3, ee'( $1R) = n'}, i.e. a simple loglinear model with K subpopulations. Let {vl,..., vT} be the set of indices that indez the columns of X that form a minimal spanning subset for M(eflR). Then var(A(M)) = var((P)) A, where the elements of A are such that Ak = 0, if (k, ) V {v,,..., VT}2 Proof: The proof is an immediate consequence of the theorem upon identifying A" of the theorem with A of the corollary. The other matrices A12, A21, and A22 will be zero since C2 = 0. 0 Corollary 2.4.3 is of practical importance and is essentially the result shown by Palmgren (1981). In particular, if we parameterize the model in such a way so that there is a parameter included for each of the K independent multinomials (or K covariate levels), then the K columns of X corresponding to these K 'fixed by design' parameters will form a basis (and hence a minimal spanning subset) for AM(efRl). Therefore, if 3i and pj are not one of the  78  K parameters fixed by design, then cov(f(M), M)) = cov(1(P),~P). We will illustrate the utility of the above results in the next chapter of this dissertation. The next section considers issues that may arise when computing the model degrees of freedom. It also states some other miscellaneous results with regard to the Lagrange multiplier statistic. 2.5 Miscellaneous Results We begin this section by addressing practical issues that may arise during nonstandard model fitting. Specifically, we will consider computing the model and distance (or residual) degrees of freedom. Computing model and distance degrees of freedom. Assuming the model [Oh] of (2.3.5) is well defined, i.e. the u+I+K constraints are nonredundant, we can compute the model degrees of freedom as in section 2.2. In that section, we defined the model degrees of freedom as the number of model parameters minus the number of independent constraints implied by the model. Notice that in this application we have an additional 1 linear constraints. The I constraints were not present in section 2.2. It follows that the model degrees of freedom for [Oh] is df[Oh] = s (u + 1 + K) (2.5.1) where s is the number of cell means, u is the dimension of the null space of X', 1 is the number of linear constraints, and K is the number of identifiability constraints. 79  To measure model goodness of fit, we can consider estimating some hypothetical distance between model [Oh] and the saturated model (u = 1 = 0) [O]. This distance, denoted S[eO; 0] has degrees of freedom df ([Oh; 0]) = df[O] df[Oh] = (s K) (s (u + + K)) (2.5.2) = U + 1. Notice that, had we considered the product Poisson model (2.4.2), the distance degrees of freedom would be df (6[O); O(P)]) = s ( ( + 1)) = + , which is identical to the product multinomial distance degrees of freedom of (2.5.2). We have assumed that the u +1+ K constraints are nonredundant, i.e. each constraint is not implied by the other constraints. This may not always be the case. To illustrate, consider the model specification for example 3 of section 2.2.2. The model [OMH] implies that the two marginal distributions are equal. We stated at the end of that example that the additional constraint 7r2+ 7r+2 = 0 was redundant. This can be seen since 72+ 7r+2 = r21 r12 = (7rl+ 7r+i) = 0 That is, the constraints of model [OMH] imply that 7r2+ r+2 equals zero. Had we blindly added this constraint, we may have incorrectly calculated the model degrees of freedom as 1 and the distance degrees of freedom as 2. Therefore, we must be very careful to have a set of nonredundant constraints when computing degrees of freedom. 80  In practice, when models are more complicated, it may be difficult to as certain whether or not the model constraints are nonredundant. Fortunately, there are two very useful results that help in this regard. The first result is that when the constraints are redundant, the matrix (H'D1H) evaluated at some point in Oh is of less than full rank and is not invertible. Therefore, in practice, if the algorithm (2.3.13) does not converge due to G being singular, it may be due to redundant constraints, i.e. an ill defined model. The user should investigate and possibly respecify the model should this occur. A caveat is that due to computational roundoff error, a singularity may not occur even when the model is ill defined because the iterate estimates, including the final estimate, may not strictly lie in Oh. The next result may mitigate this problem. A result that is useful in practice is that a necessary condition for the constraints to be nonredundant or equivalently for the model to be well defined, is that the Lagrange multiplier statistic be invariant to choice of U, a matrix with columns spanning the null space of X'. Evidently, if the user fits the model several times, each time using a different 'U' matrix, and the Lagrange multiplier statistic varies (more so than can be explained by roundoff error), then it must be that the model is ill defined. Formally, this necessary condition can be stated as THEOREM 2.5.1 Let U1 and U2 (U1 E U2) be any two full column rank matrices satisfying ULX = 0, i = 1, 2. Denote the Lagrange multiplier statistic evaluated using Ui by LM(Ui). If the matrix H Oh() _= eA)CU, Hi = 1= Zt[log(e4'A')C'Uj, e4TL'] C~ ar 81 is such that [Hi, ee] is of full column rank, i = 1, 2, and hence the models well defined, then LM(UI) = LM(U2), i.e. the value of the Lagrange multiplier statistic is invariant with respect to choice of U. Proof: Denote the model specified in terms of Ui by [Oh,], i = 1,2. By the definition of U, we know that the constraints implied by [Oh1] and [Oh,] are equivalent. Hence, the solution to (2.3.8), or equivalently (2.4.3), under either model is the same. Thus, in view of the first set of equations in (2.3.8), any solution vec( A,) under model [Oh,] must satisfy (ye) = Hi()A;, i= 1,2. (2.5.3) Notice that since U1 f U2, we have that H () 5 H2() and by (2.5.3) Ai f A2. Now, (2.5.3) implies that Hi()Ai = Hg2()2. (2.5.4) Also, since Hi() is assumed to be of full column rank, the variance of A,, var(A,) = (H)()D'(eZ)H,())1 (2.5.5) exists. Therefore, the Lagrange multiplier statistics LM(Ui), which have form \[var(i)]'5^, i= 1,2 (2.5.6) 82  exist. Finally, by (2.5.4)(2.5.6), it follows that LM(U1) = I[var(il)]1l = A'(H:(()Dl(e )Hz())Ai = 2(H'( )D1(ei)H2())\ = i'[var(l2)]12 = LM(U2). This completes the proof. The final result of this section states that the Lagrange multiplier statistic is exactly the same as the Pearson chisquared statistic whenever the random vector Y is productPoisson or productmultinomial and the model satisfies assumptions (Al), (A2), and (A3). THEOREM 2.5.2 Assume that the productmultinomial model satisfies assumptions (Al), (A2), and (A3). Let X2 denote the Pearson chisquared statistic, i.e. 2 = (y )'D'()(y i) where A is the ML estimator under either of the sampling schemesproduct multinomial or productPoisson. It follows that the Lagrange multiplier statistic LM is equivalent to X2. That is, LM = X2. Proof: By equations (2.5.3), (2.5.5), and (2.5.6) of the previous theorem's proof and the fact that eM = M, we have that LM = (y A)'D'(f)(y 4) = X2 This is what we set out to show. 83 2.6 Discussion In this chapter, we discussed in some detail issues related to parametric modeling. In particular, we followed the lead of Aitchison and Silvey (1958, 1960) and Silvey (1959) and described two ways of specifying modelsusing constraint equations and using freedom equations. In section 2.2, distance measures for quantifying how far apart two models are, relative to how close they are to holding, were discussed. In particular, the powerdivergence measures (Read and Cressie, 1988) were used when the parameter spaces were subsets of an (s 1)dimensional simplex. Estimates of these distances were developed based on very intuitive notions. Also, a geometric interpretation of model and residual (or distance) degrees of freedom was given. In section 2.3, we described a general class of multivariate polytomous (categorical) response data models. The class of models, which satisfy assumptions (Al), (A2), and (A3), were shown to satisfy the necessary and sufficient conditions of Birch (1963) so that the models could be fitted using either the productPoisson or productmultinomial sampling assumption. An ML fitting method was developed, using results of Aitchison and Sil vey (1958, 1960) and Haber (1985a, 1985b). The algorithm used Lagrangian undetermined multipliers in conjunction with a modified NewtonRaphson iterative scheme. The modification, which simplifies the method of Haber (1985a), is to use a simpler matrix than the Hessian matrix. We replace the Hessian matrix (of the Lagrangian objective function) by its dominant part, which turns out to be easily inverted. Because the matrices used in the algorithm proposed in this chapter are very large and must be inverted, this 84  modification is a very important one. A FORTRAN program 'mle.restraint' has been written by the author to implement this modified algorithm. The asymptotic behavior of the ML estimators computed under the two sampling schemesproductPoisson and productmultinomialwas investi gated. The method for deriving the asymptotic distributions represents a modification to the technique of Aitchison and Silvey (1958). A comparison of the limiting distributions of the two estimators was made in section 2.4. Some very interesting results were obtained by studying the asymptotic behavior in the constraint equation setting. In particular, Theorem 2.4.2 represents a generalization of the results of Palmgren (1981). The theorem provides a method for determining when the inferences about the freedom parameters of a generalized loglinear model of the form C log Apl = X/f will be invariant with respect to the sampling assumption. Palmgren (1981) developed some similar results for the special case when the freedom parameters are part of a loglinear model. It is important to note that the asymptotic results are only valid if the number of populations K is considered fixed and the expected counts all get large at approximately the same rate. In particular, the asymptotic arguments do not hold when the covariates are continuous, since the number of populations (levels of the covariates) can theoretically run off to infinity. The reason the arguments do not hold is that when we use the method of Aitchison and Silvey (1958) it is required that the vector n,' lY;Y~ converge in probability to zero as the total number of observations gets large. This is the case only when n* = min{nl,..., nK} goes to infinity. This drawback 85  could prove to be temporary. It seems reasonable to assume in many cases, that as long as the 'information' about each parameter is increasing without bound, the estimators will be consistent and asymptotically normally dis tributed. For example, consider the logistic regression model with continuous covariates. Although the nk's may all be 1, the ML estimators of the regression parameters are often consistent and asymptotically normal. Section 2.5 outlines some miscellaneous results. One result that is important to the practicing statistician, is that the Lagrange multiplier statistic is shown to be invariant with respect to choice of the matrix U (of U'Clog Ay = 0) as long as the model is well defined. An important implication of this result is that if one fits the model several times, each time using a different 'U' matrix, and the Lagrange multiplier statistics vary more so than can be explained by roundoff, then it could be that the model is not well defined. Another interesting result is that the Lagrange multiplier statistic is simply the Pearson chisquared statistic X2 whenever the assumptions (Al), (A2), and (A3) are satisfied. Theoretically the ML fitting algorithm will work for any size problem. Practically, however, the algorithm is certainly not a model fitting panacea. The number of parameters that must be estimated gets very large, very fast. Consider the case where 7 raters rate the same set of objects on a 5 point scale. Even without covariates, the number of cell probabilities that must be estimated is 57 = 78, 125. It seems the ML fitting method developed in this chapter is, at least for now, useful for moderate size problems only. It can be used to analyze longitudinal categorical response data when the number 86  of measurements taken on each subject is somewhere in the neighborhood of 2 to 6. This is not to take away from the utility of this chapter's algorithm, but rather to indicate its breadth of application. In time, with increasing computer efficiency, much larger data sets may be fitted using this algorithm. CHAPTER 3 SIMULTANEOUSLY MODELING THE JOINT AND MARGINAL DISTRIBUTIONS OF MULTIVARIATE POLYTOMOUS RESPONSE VECTORS 3.1 Introduction Often times, when given an opportunity to analyze multivariate response data, the investigator may wish to describe both the joint and marginal distributions simultaneously. We consider a broad class of models which imply structure on both the joint and marginal distributions of multivariate polytomous response vectors. To illustrate the need for such models, we consider several settings where these models would be useful. For example, when the multivariate responses represent repeated measures of the same categorical response across time, one may be interested in how the marginal distributions are changing across time and how strongly the responses are associated. The simultaneous investigation of both joint and marginal distributions is not restricted to the longitudinal data setting. Other examples include the analysis of rater agreement, crossover, and social mobility data. The common thread tying all of these data types together is that the sampling scheme is such that the different responses are correlated. In longitudinal studies the same subject responds on several occasions. In rater agreement studies, raters rate the same objects. In twoperiod crossover studies, one group of subjects receive the two treatments in one order and the other group receive them in the other order. In social mobility studies, the socioeconomic 87 88  status of a fatherson pair is recorded. When the responses are positively correlated, these designs result in increased power for detecting differences between the marginal distributions (Laird, 1991; Zeger, 1988). This chapter considers the modeling of multivariate categorical responses in which the same response scale is used for each response. The classes of models used in this chapter are of the form considered in Chapter 2 of this dissertation and hence are readily fit using the ML methods of that chapter. In section 3.2, we give several examples that may be analyzed by simultaneously modeling the joint and marginal distributions. We introduce the classes of simultaneous JointMarginal models in section 3.3. Several models are fitted to the data sets of section 3.2. 3.2 ProductMultinomial Sampling Model Initially, we assume that a random sample of nk subjects is taken from population k, k = 1,..., K. The number of populations, or covariate profiles, K is considered to be some fixed integer. The subscript k is allowed to be compound, i.e. the subscript k is allowed to represent a vector of subscripts such as k = (ki, k2,..., ik). Suppose that there are T categorical responses V(1),..., V(T) of interest and that each response is measured on the same response scale. Let Vk = (Vk),..., VT))' be the random vector of responses for population k and Vk,, u = 1,...,nk be the nk independent and identically distributed copies of Vk, where Vk, denotes the response profile for the uth randomly 89  chosen person within population k. Notationally we have, Vku ~ i.i.d. Vk, u = 1,...nk For our purposes we can assume that each response takes on values in {1,2,...,d} with probability one. Denote the probability that a randomly selected subject from population k has response profile i = (i,..., iT)' by 7rk, i.e. P(Vk = (i,... iT)') = where i e {1,...,d} x .. x {1,...,d}. The joint distribution of Vk = (V(),..., Vk(T))' is specified as {7rik}. The marginal distributions of Vk will be denoted by {(i(t; k)}, t = 1,..., T, where ,(t; k)= P(V,(t) = i), i= 1i,...,d Our objective is to model simultaneously the K joint distributions {7TCk}, k= ,...,K and the KT marginal distributions {i(t; k)}, t= 1,...,T, k = l,...,K. To help the reader better understand the notation, we consider the one population bivariate case. When T = 2, the response profiles can be denoted by i = (il,i2) = (i,j), where i = 1,...,d and j = 1,...,d. Since there is just one population (or covariate profile) the subscript k is always 1 and is therefore dropped. It follows that {7rij} is the joint distribution of (V('), V(2))' and { i(t)}, t = 1, 2 are the two marginal distributions. That is, 7r, = P(V(I) = i, V(2) = j), i= l,...,d, j= l,...,d  90  and (t) = '7i+ = P(V(I) = i), if t= l 7r+i = P(V(2) = i), if t = 2 for i= 1,2,..., d. Now for each population k, consider the dT x 1 random vector of indicators 'k = [I(V&=t1), .. ., I(V=_idT)]' Notice that no information about the Vk is lost since 4k is a onetoone function of Vk. Also, xk ~ ind. Mult(1, {7rik}), k= 1,..., K Therefore, since we have randomly sampled nk subjects from each of the K populations, we have that for given k Tki k2, .,*k, ~ i.i.d. Mult(1, {7rik}) and hence the vector nk Yk = E k Mult(nk, {rik}) u=l is sufficient for the family of distributions {rik} and {(i(t; k)}. By independence across populations, the vector vec(,Y 2 ,.. .,YK) is sufficient for the joint and marginal distributions of vec(V, V2,...,VK). Further, the random vector vec(Yi, 2,...,YK) is productmultinomial, i.e. Yk = (Yk,...,YR)' ~ ind Mult(n, {7ik}), k = 1,...,K where 1,...,_R represent the R = dT different response profiles. 91 Evidently, Yik represents the number of randomly selected subjects from population k who have response profile i. That is, the {Yik} represent counts resulting from a crossclassification of N = E'=1 nk subjects on T response variables and a population variable. The data can be displayed in a dT x K contingency table. By convention, we use lower case Roman letters to denote realizations of random quantities. For example, yik represents a particular realization of Yik. Consider Table 3.1, taken from Hout et al. (1987). Table 3.1. Interest in Political Campaigns 1960 Not Much Somewhat Very Much Not Much 1956 Somewhat Very Much 335 499 369 278 444 481 1203 Source: Hout et al. (1987), p. 166, Table 4 Each of 1203 randomly selected subjects was asked in 1956 how inter ested they were in the political campaigns. They responded on the 3category ordinal scale: 1 = Not Much, 2 = Somewhat, and 3 = Very Much. Then, in 1960, each of the subjects was asked the same question and responded on the same 3category ordinal scale. Using the above notation, 155 116 64 91 237 171 32 91 246 92  we let V(1) and V(2) represent the responses in 1956 and 1960. Let yij, i,j = 1,2,3 represent the number of the N = 1203 subjects responding at level i in 1956 and level j in 1960. Notice that there is just one population of interest, we drop the population subscript altogether. Finally, for this bivariate response example, the compound subscript i is replaced by ij. Table 3.1 summarizes the bivariate responses. As another example, consider the crossover data of Ezzet and White head (1991). Table 3.2. Crossover Data B B 1 2 3 4 1 2 3 4 1 59 35 3 2 1 A 2 11 27 2 1 A 2 3 0 0 0 0 3 4 1 1 0 0 4 63 40 7 2 13 15 2 0 0 0 1 1 0 0 0 0 AB Sequence BA Sequence (Group 1) (Group 2) The counts displayed in Table 3.2 are from a study conducted by 3M Health Care Ltd. to compare the suitability of two inhalation devices (A and B) in patients who are currently using a standard inhaler device delivering salbutomal. Two independent groups of subjects participated. Group 1 used device A for a week followed by device B (sequence AB). Group 2 used the devices in reverse order (sequence BA). The response variables V(') (device A) and V(2) (device B) are ordinal polytomous. Specifically, they are the selfassessment on clarity of leaflet instructions accompanying the two devices, recorded on the ordinal four point scale,  93  1 = Easy 2 = Only clear after rereading 3 = Not very clear 4 = Confusing. For this example there are two populations of interestGroup 1 and Group 2. Let yik represent the number of the nk subjects responding at level i for device A and level j for device B, where nl = 142 and n2 = 144. Again, the bivariate response profiles can be denoted by i = ij where i, j = 1, 2, 3, 4. The bivariate responses are summarized in Table 3.2. 3.3 Joint and Marginal Models Two types of questions that can be posed about Table 3.1 lead to quite distinct types of models. One question is whether the interest in the political campaigns was different at the two times. For example, the researcher may wish to test the hypothesis that there was more interest in the 1960 political campaign than the 1956 political campaign. An investigation into the marginal distributions is needed to test this hypothesis. For these bivariate response data, the marginal distributions correspond to the row and column distributions of Table 3.1. A second question that may be asked is whether the two responses are associated and if so, how strong is the association. To answer these questions, we must describe the dependence displayed in the joint distribution of Table 3.1. The marginal models we consider will be used to investigate whether the probability that a randomly selected subject responds at level i or lower in 1956 is different from the probability that a randomly selected subject responds at level i or lower in 1960. In this sense, the comparison of marginal 