UFDC Home  myUFDC Home  Help 



Full Text  
COMPARING POISSON, HURDLE, AND ZIP MODEL FIT UNDER VARYING DEGREES OF SKEW AND ZEROINFLATION By JEFFREY MONROE MILLER A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 Copyright 2007 by Jeffrey Monroe Miller To the memory of my grandfather, Rev. Harold E. Cato. ACKNOWLEDGMENTS Several people helped make this study possible. I would like to thank my stepfather and mother, Dr. Daniel and Gail Jacobs as well as my father and stepmother, Jerry and Darnelle Miller for their many years of encouragement. I would also like to thank my supervisory committee chair, M. David Miller, for his unyielding guidance, patience, and support. I thank Dr. Jon Morris for the numerous training experiences. Many professors are appreciated for providing the educational foundations for the dissertation topic including Dr. James Algina and Dr. Alan Agresti. The idea to research zeroinflation was inspired by experiences with data while consulting on proj ects. To this extent, I thank those clients Dr. Courtney Zmach and Dr. Lori Burkhead. Undergraduate faculty that I would like to acknowledge for their inspiration and direction include Blaine Peden, Patricia Quinn, and Lee Anna Rasar. Several friends have been a source of encouragement including Matt Grezik and Rachael Wilkerson. Finally, I thank those who made it financially possible to complete this dissertation including my consulting clients, the University of Florida College of Education, the Lastinger Center, and Adsam, LLC. TABLE OF CONTENTS page ACKNOWLEDGMENTS .............. ...............4..... LIST OF TABLES ............ ...... .__. ...............8.... LIST OF FIGURES ........._.___..... .__. ...............11.... AB S TRAC T ............._. .......... ..............._ 12... CHAPTER 1 INTRODUCTION ................. ...............14.......... ...... Statement of the Problem ................. ...............15................ Rationale for the Study ................... .......... ...............15...... Purpose and Significance of the Study .............. ...............16.... Research Questions............... ...............1 2 REVIEW OF THE LITERATURE ................. ...............18.......... .... ZeroInflated Count Data ................. ...............18........... .... Count Data ................. ...............18................. ZeroInflation .................... ...............19. The Sources ofZeroInflation .............. ...............20.... Impact of ZeroInflation on Analyses .............. ...............21.... Simple Solutions to ZeroInflation ................. ...............22................ Deleting zeros............... ...............22. Assuming normality .............. ...............22.... Transforming Zeros ................. ...............23................. Generalized Linear Models .............. .. ...... .. ..............2 The Binomial Distribution and the Logit Link ................. ...............28.............. Evaluating M odel Fit .................. ... .......... .. ................3 1.... The Poisson Distribution and the Log Link .............. ...............35.... Iterative Estimation .............. ...............38.... Interpretation of Coefficients .............. ...............38.... Hypothesis testing .............. ...............39.... O verdispersion ............................ .... ....................3 Poisson and Negative Binomial Models with ZeroInflation............... ..............4 The Hurdle model .................. ......__ ...............45... The Negative Binomial Hurdle model .............. ...............48.... The ZeroInflated Poisson (ZIP) model .............. ..... ...............49.. The Negative Binomial ZeroInflated Poisson model............_ .. ......._ ........51 Model Comparison Testing for ZeroInflated Data....................... ..............5 Review of Research Pertaining to and Using ZeroInflated Count Data. .............. ..... ........._.53 Hurdle M odel ............ ..... .._ ...............53... Statistical .............. ...............53.... Applications .................. ...............54.. ZeroInflated Poisson Model ................. ...............55..._._._...... Statistical .............. ...............55.... Applications .............. ...............59.. ZIP and Hurdle ModelComparisons .............. ...............63.... Statistical .............. ...............64.... Applications .............. ...............64.... Discrepant Findings ....___ ................ .........__..........6 3 METHODOLOGY .............. ...............73.... Research Questions............... ...............7 Monte Carlo Study Design .............. ...............73.... Monte Carlo Sampling............... ...............75 P seudoP opul ati on ................... ...............75.......... .... The Prespecified Zero Proportions ........._.. ...._._..... ...............76... PreSpecified Skew .............. ...............76.... Random Number Generation............... ...............7 Sam ple Size .............. ...............78.... Simulation Size............... ...............78.. Iteration Size............... ...............79.. Distribution Generation ........._..... ...._... ...............80..... Monte Carlo Model s.........._...._ ......_.. ...............80..... Monte Carlo Analysis Procedures .............. ...............82.... Analysis Design .........._..._.. ...............84........ ...... 4 RE SULT S .............. ...............94.... PseudoPopulation Results .................. ...............94.. PseudoPopulation Poisson Models .............. ...............95.... PseudoPopulation Hurdle Models ..........._.._.. ...............96...._._. ..... Hurdle vs. Negative Binomial Hurdle ..........._.... ....._._ ....._.._.........9 Poisson vs. Hurdle............... ...... .. .. .. ............9 Negative Binomial Poisson vs. Negative Binomial Hurdle .............. ...................98 PseudoPopulation ZIP Models............... ...............98. Comparing AIC's For All Models .........._...._. ...............99......... .. Monte Carlo Simulation Results .........._...._. ...............101.._.... ..... Positively Skewed Distribution .........._...._ ......_. ...._... ............10 Normal Distribution. ..........._.._. ...............111....... ...... Negatively Skewed Distribution.................. ............12 Review of Positively Skewed Distribution Findings ..........._.._.. ..........._.. ...._.._ ...133 Review of Normal Distribution Findings .............. ...............135.... Review of Negatively Skewed Distribution Findings .............. ...............136.... 5 DI SCUS SSION ........._..... ...._... ..............._ 176.. The Impact of the Event Stage Distribution ................. ......... ......... ...........7 Positively Skewed EventStage Distributions ................. ...............................176 Normal EventStage Distributions .............. ...............181.... Negatively Skewed EventStage Distributions .............. ...............183.... Summary of Findings .............. ...............185.... Limitations ................. ...............186................ Discrete Conditions .............. ...............186.... Convergence and Optimization ................. ......... ...............186 .... Under di spersi on ................. ...............187................ Other m odels .............. .. ............... ...........18 Validity of ModelFitting and ModelComparisons ................. ..............................188 Suggestions for Future Research .............. ...............190.... Application in Educational Research............... ...............19 Maj or Contribution of Findings ........._.___..... .___ ...............191... LIST OF REFERENCES ........._.___..... .___ ...............195.... BIOGRAPHICAL SKETCH .............. ...............201.... LIST OF TABLES Table page 21 Five pairs of nested models valid for statistical comparison ................ ......................71 22 Summary of literature on zeroinflation ..........._ ..... ..__ ....___ ............._..72 31 Proportions of counts as a function of zeros and skew ....._____ ... .....___ ..............87 32 Frequencies of counts as a function of zeros and skew .......... ................ ...............87 33 Descriptive statistics for each distribution............... ..............8 34 Poisson model: pseudopopulation parameters ................. ...............88............... 35 Negative Binomial Poisson model: pseudopopulation parameters ................. ...............89 36 Hurdle model (zeros): pseudopopulation parameters ................. ......... ................89 37 Hurdle model (events): pseudopopulation parameters ......... ................ ...............90 38 Negative Binomial Hurdle model (zeros): pseudopopulation parameters ................... .....90 39 Negative Binomial Hurdle model (events): pseudopopulation parameters ................... ...91 310 ZIP model (zeros): pseudopopulation parameters ................. ............... ......... ...91 311 ZIP Model (events): pseudopopulation parameters ................. ................ ......... .92 312 Negative Binomial ZIP model (zeros): pseudopopulation parameters.................... .........92 313 Negative Binomial ZIP model (events): pseudopopulation parameters ................... ........93 41 Deviance statistics comparing Poisson and negative binomial Poisson models. ............138 42 Deviance statistics comparing Hurdle and negative binomial Hurdle models ................138 43 Deviance statistics comparing Poisson and Hurdle models............... ................13 44 Deviance statistics comparing NB Poisson and NB Hurdle models................ ...............138 45 Deviance statistics comparing ZIP and negative binomial ZIP models ..........................139 46 Loglikelihood comparisons for positively skewed distribution with .10 zeros ..............139 47 AIC's for positively skewed distribution models with a .10 proportion of zeros............ 140 48 Loglikelihood comparisons for positively skewed distribution with .25 zeros .............140 49 AIC's for positively skewed distribution models with a .25 proportion of zeros............ 141 410 Loglikelihood comparisons for positively skewed distribution with .50 zeros .............141 411 AIC's for positively skewed distribution models with a .50 proportion of zeros............1 42 412 Loglikelihood comparisons for positively skewed distribution with .75 zeros..........._...142 413 AIC's for positively skewed distribution models with a .75 proportion of zeros............ 143 414 Loglikelihood comparisons for positively skewed distribution with .90 zeros ..............143 415 AIC's for positively skewed distribution models with a .90 proportion of zeros............1 44 416 Loglikelihood comparisons for normal distribution with .10 zeros .............. ..... ........._.144 417 AIC's for normal distribution models with a .10 proportion of zeros ................... ..........145 418 Loglikelihood comparisons for normal distribution with .25 zeros .............. ................145 419 AIC's for normal distribution models with a .25 proportion of zeros ................... ..........146 420 Loglikelihood comparisons for normal distribution with .50 zeros .............. ................147 421 AIC's for normal distribution models with a .50 proportion of zeros ................... ..........147 422 Loglikelihood comparisons for normal distribution with .75 zeros .............. ................148 423 AIC's for normal distribution models with a .75 proportion of zeros ................... ..........148 424 Loglikelihood comparisons for normal distribution with .90 zeros .............. ................149 425 AIC's for normal distribution models with a .90 proportion of zeros ................... ..........149 426 Loglikelihood comparisons for negatively skewed distribution with .10 zeros.............1 50 427 AIC's for negatively skewed models with a .10 proportion of zeros .............. .... ...........150 428 Loglikelihood comparisons for negatively skewed distribution with .25 zeros.............1 51 429 AIC's for negatively skewed models with a .25 proportion of zeros .............. .... ...........151 430 Loglikelihood comparisons for negatively skewed distribution with .50 zeros.............1 52 43 1 AIC's for negatively skewed models with a .50 proportion of zeros .............. .... ...........152 432 Loglikelihood comparisons for negatively skewed distribution with .75 zeros.............1 53 433 AIC's for negatively skewed models with a .75 proportion of zeros .............. .... ...........153 434 Loglikelihood comparisons for negatively skewed distribution with .90 zeros.............1 54 43 5 AIC's for negatively skewed models with a .90 proportion of zeros .............. .... ...........154 436 Positively skewed distribution: percentage of simulations favoring complex model......155 437 AIC's: positively skewed distribution (all conditions) ................ .........................155 438 Normal distribution: percentage of simulations favoring complex model. .....................155 439 AIC's: normal distribution (all conditions)............... ..............15 440 Negatively skewed distribution: percentage of simulations favoring complex model....156 441 AIC's: negatively skewed distribution (all conditions) .............. ....................15 442 Convergence frequencies: positively skewed distribution ................. ............ .........156 443 Convergence frequencies: normal distribution ................ ...............156........... ... 444 Convergence frequencies: negatively skewed distribution ................. ......................157 LIST OF FIGURES Figure page 41 Boxplot of AIC's for all models for a .10 proportion of zeros ........._._.... ......._. .....158 42 Boxplot of AIC's for all models for a .25 proportion of zeros ........._._.... ......._. .....159 43 Boxplot of AIC's for all models for a .50 proportion of zeros ........._._.... ......._. .....160 44 Boxplot of AIC's for all models for a .75 proportion of zeros ........._._.... ......._. .....161 45 Boxplot of AIC's for all models for a .90 proportion of zeros ........._._.... ......._. .....162 46 Boxplot of AIC's for all models for a .10 proportion of zeros ........._._.... ......._. .....163 47 Boxplot of AIC's for all models for a .25 proportion of zeros ........._._.... ......._. .....164 48 Boxplot of AIC's for all models for a .50 proportion of zeros ........._._.... ......._. .....165 49 Boxplot of AIC's for all models for a .75 proportion of zeros ........._._.... ......._. .....166 410 Boxplot of AIC's for all models for a .90 proportion of zeros ........._._.... ......._. .....167 411 Boxplot of AIC's for all models for a .10 proportion of zeros ........._._.... ......._. .....168 412 Boxplot of AIC's for all models for a .25 proportion of zeros ........._._..........._........169 413 Boxplot of AIC's for all models for a .50 proportion of zeros ........._._..........._........170 414 Boxplot of AIC's for all models for a .75 proportion of zeros ........._._..........._........171 415 Boxplot of AIC's for all models for a .90 proportion of zeros ........._._..........._........172 416 AIC rank order for positively skewed distribution models .............. ....................17 417 AIC rank order for normal distribution models............... ...............174 418 AIC rank order for negatively skewed distribution models .............. .....................7 51 Poisson, NB Poisson, and Hurdle over all proportions of zeros ................. ................. 192 52 Hurdle, NB Hurdle, and NB Poisson over all proportions of zeros ................. ...............193 53 ZIP, NB ZIP, Hurdle, and NB Hurdle over all proportions of zeros .............. ..... .......... 194 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy COMPARING POISSON, HURDLE, AND ZIP MODEL FIT UNDER VARYING DEGREES OF SKEW AND ZEROINFLATION By Jeffrey Monroe Miller May 2007 Chair: M. David Miller Major Department: Educational Psychology Many datasets are characterized as count data with a preponderance of zeros. Such data are often analyzed by ignoring the zeroinflation and assuming a Poisson distribution. The Hurdle model is more sophisticated in that it considers the zeros to be completely separate from the nonzeros. The zeroinflated Poisson (ZIP) model is similar to the Hurdle model; however, it permits some of the zeros to be analyzed along with the nonzeros. Both models, as well as the Poisson, have negative binomial formulations for use when the Poisson assumption of an equal mean and variance is violated. The choice between the models should be guided by the researcher' s beliefs about the source of the zeros. Beyond this substantive concern, the choice should be based on the model providing the closest fit between the observed and predicted values. Unfortunately, the literature presents anomalous findings in terms of model superiority. Datasets with zeroinflation may vary in terms of the proportion of zeros. They may also vary in terms of the distribution for the nonzeros. Our study used a Monte Carlo design to sample 1,000 cases from positively skewed, normal, and negatively skewed distributions with proportions of zeros of .10, .25, .50, .75, and .90. The data were analyzed with each model over 2,000 simulations. The deviance statistic and Akaike's Information Criterion (AIC) value were used to compare the fit between models. The results suggest that the literature is not entirely anomalous; however, the accuracy of the findings depends on the proportion of zeros and the distribution for the nonzeros. Although the Hurdle model tends to be the superior model, there are situations when others, including the negative binomial Poisson model, are superior. The findings suggest that the researcher should consider the proportion of zeros and the distribution for the nonzeros when selecting a model to accommodate zeroinflated data. CHAPTER 1 INTTRODUCTION Analyzing data necessitates determination of the type of data being analyzed. The most basic assumption is that the data follows a normal distribution. However, there are many other types of distributions. The validity of the results can be affected by the dissimilarity between the distribution of the data and the distribution assumed in the analysis. As such, it is imperative that the researcher choose a method for analyzing the data that maintains a distribution similar to that of the observed data. Counts are an example of data which does not readily lend itself to the assumption of a normal distribution. Counts are bounded by their lowest value, which is usually zero. A regression analysis assuming a normal distribution would permit results below zero. Further, counts are discrete integers while the normal distribution assumes continuous data. Finally, counts often display positive skew such that the frequency for low counts is considerably higher than the frequencies as the count levels increase. It is not uncommon to find count data analyzed in a more appropriate manner than assuming a normal distribution. Typically, more appropriate analysis includes specification of a Poisson distribution with a log link, rather than a normal distribution with a Gaussian link. However, this does not guarantee accurate and valid results as other features of the data may warrant an even more sophisticated model. An example of data requiring a more rigorous treatment of the data is the case of zero inflation. In this scenario, there are far more zeros than would be expected using the Poisson distribution. As such, a number of methods including the zeroinflated Poisson (ZIP) model and the Hurdle model are available. Further, there are negative binomial variations of these for use when particular assumptions appear to be violated. The choice between the models depends on whether the researcher believes the zeros are all a complete lack of the quantity being measured or that at least some of the zeros are purely random error. Statement of the Problem The results from both simulated and actual data sets in the zeroinflation literature are in much disagreement. Lambert (1992) found the ZIP model to be superior to the negative binomial Poisson model, which was superior to the Poisson model. Greene (1994) found the negative binomial Poisson model to be superior to the ZIP model, which was superior to the Poisson model. Slymen, Ayala, Arredondo, and Elder (2006) found the ZIP and negative binomial ZIP models to be equal. Welsh, Cunningham, Donnelly, and Lindenmayer found the Hurdle and ZIP models to be equal while Pardoe and Durham (2003) found the negative binomial ZIP model to be superior to both the Poisson and Hurdle models. One striking characteristic of these articles and others is their differences in terms of the proportion of zeros and the distribution for the nonzeros. Some research (Boihning, Dietz, Schlattmann, Mendonga, and Kirchner, 1999) analyzed data in which the proportion of zeros was as low as .216 while others (Zorn, 1996) used proportions as high as .958. Further, the nonzeros varied in terms of their distributions from highly positively skewed to normal to uniform. It is possible that different models yield different results depending on the proportion of zeros and the distribution for the nonzeros. Rationale for the Study The best model is the one that appropriately answers the researcher' s question. Beyond this, a superior model is one that has close proximity between the observed data and that predicted by the model. In other words, a superior model is one with good fit to the data. This study compared the fit between the Poisson, ZIP, and Hurdle models as well as their negative binomial formulations. Each analysis was performed for five different proportions of zeros and three different amounts of skew for the nonzero distribution. The intended results would clarify the discrepant Eindings of previous research. Purpose and Significance of the Study The primary purpose of this study was to determine superiority of fit for various models under varying proportions of zeroinflation and varying levels of skew. As such, determination can be made as to which model has better fit given data with a particular proportion of zeros and a particular distribution. The secondary purpose was to elucidate the reasons for discrepant Endings in previous research. The superior model is the appropriate model given the research question. However, there are situations in which the appropriate model is unknown or unclear. Further, there may be situations in which a simpler model such as the Poisson may be used in lieu of the more sophisticated Hurdle and ZIP models. This research provides results that aid researchers in determining the appropriate model to use given zeroinflated data. Research Questions Model comparisons in this research were based on two measures. One is the deviance statistic, which is a measure of the difference in loglikelihood between two models, permitting a probabilistic decision as to whether one model is adequate or whether an alternative model is superior. This statistic is appropriate when one model is nested within another model. The other measure is Akaike's Information Criterion (AIC). This statistic penalizes for model complexity and permits comparison of nonnested models; however, it can only be used descriptively. These two measures of model fit were used to compare results from data simulations where each dataset included 2,000 cases and each model was analyzed 1,000 times. Specifically, the measures of model fit were used to answer the following research questions: * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated loglikelihood between a) the Negative binomial Poisson model vs. Poisson model; b) the Hurdle model vs. Poisson model?; c) the Negative binomial Hurdle model vs. negative binomial Poisson model?; d) the Negative binomial Hurdle model vs. Hurdle model; and, e) the Negative binomial ZIP model vs. ZIP model? * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated AIC between all models? CHAPTER 2 REVIEW OF THE LITERATURE ZeroInflated Count Data Count Data As the name implies, count data is data that arises from counting. They are the "realization of a nonnegative integervalued random variable" (Cameron & Travedi, 1998, p. 1). As such, the response values take the form of discrete integers (Zorn, 1996). Although the lower boundary can feasibly be any integer, it is usually the case that its value is zero. Strictly speaking, there can be no nonnegative numbers. Hence, the data are constrained by this lower bound of zero and no upper bound. Acknowledgment of concerns over zeroinflation, ignoring covariates, likely dates to Cohen (1954). Cameron and Triverdi (1989, p. 1011) identified many areas in which special models have been used to analyze count data including "models of counts of doctor visits and other types of health care utilization; occupational injuries and illnesses; absenteeism in the workplace; recreational or shopping trips; automobile insurance rate making; labor mobility; entry and exits from industry; takeover activity in business; mortgage prepayments and loan defaults; bank failures; patent registration in connection with industrial research and development; and frequency of airline accidents .. as well as in many disciplines including demographic economics, in crime victimology, in marketing, political science and government, [and] sociology". Surprisingly, there was no mention of research in education. Examples of variables in educational research that yield count data include a student' s number of days absent, number of test items scored correct or incorrect, and number of referrals for disciplinary action. The lower bound constraint of zero presents the biggest obstacle toward analyzing count data when assuming a normal distribution. It is common for count data to have a skewed distribution that is truncated at the lower bound. skew(Y)= "= (21) (N 1l)s,3 Hence, the data are heteroscedastic with variance increasing as the count increases. Therefore, standard models, such as ordinary least squares regression, are not appropriate since they assume that the residuals are distributed normally with a mean of zero and a standard deviation of one (Slymen, Ayala, Arredondo, & Elder, 2006). Cameron and Triverdi (1998) clarify that the use of standard OLS regression "leads to significant deficiencies unless the mean of the counts is high, in which case normal approximation and related regression methods may be satisfactory" (p.2). An example of a count data variable is the number of household members under the age of 21 reported by respondents in the Adult Education for WorkRelated Reasons (AEWR) survey administered by National Council for Educational Statistics in 2003 (Hagedorn, Montaquila, VadenKiernan, Kim & Chapman, 2004). The sample size was 12,725. This variable has a lower count boundary of zero and an upper count boundary of six. The count distribution is positively skewed at 1.971. The distribution mean of 0.54 is certainly not an accurate measure of central tendency; the median and mode are both zero (i.e., the lowerbound itself), and the standard deviation of 0.999 permits negative values in the lower 68% confidence interval. ZeroInflation It is not uncommon for the outcome variable in a count data distribution to be characterized by a preponderance of zeros. As Tooze, Grunwald, & Jones (2002, p.341) explain, Typically, [for count data] the outcome variable measures an amount that must be non negative and may in some cases be zero. The positive values are generally skewed, often extremely so .. Distributions of data of this type follow a common form: there is a spike of discrete probability mass at zero, followed by a bump or ramp describing positive values. The occurrence is primarily in the case of interval/ratio count data and sometimes ordinal data (Boihning, Dietz, Schlattmann, Mendonga, & Kirchner, 1999). Regarding continuous data, Hall and Zhang (2004) explain that these distributions "have a null probability of yielding a zero .. there is little motivation for a model such as [zeroinflated] normal, because all observed zeros are unambiguous .. (p. 162). If continuous zeros are inflated and those zeros are of concern, they can be analyzed separately from the nonzeros. The null probability of continuous zeros is evident in measures such as height and age. The condition of excessive zeros is known as zeroinflation (Lachenbruch, 2002) or as a probability mass that clumps at zeros (Tooze, Grunwald, & Jones, 2002). It has been recognized as an area of research in the mid60'sl (Lachenbruch, 2002) when Weiler (1964) proposed a method for mixing discrete and continuous distributions. Min and Agresti (2005) formally define zeroinflation as "data for which a generalized linear model has lack of fit due to disproportionately many zeroes" (p. 1). There are simply "a greater number of zero counts than would be expected under the Poisson or some of its variations" (Zorn, 1996, p. 1). The Sources of ZeroInflation The zeros can be classified as being either true zeros or sampling zeros. True zeros represent responses of zero that are truly null. Suppose an educational inventory item states "How many college preparatory workshops have you attended?" Some of the respondents in the sample may have no intentions to apply for college. Hence, the number of preparatory SAltemnatively, if a scale is bound, it is reasonable to consider an inflated upper bound. In this case, scale reversal and subsequent appropriate analysis if justified (Lachenbruch, 2002). workshops attended may never be greater than zero. Sampling zeros, on the other hand, arise as a probability. There are a proportion of collegebound students who have not attended a workshop due to the possibility that the workshop was not (or is not yet) available. Alternatively, some collegebound students may feel prepared and have no reason to participate in a workshop. Hence, the mechanism underlying zeroinflation can arise from one or both of 1) a possibility that no other response is probabilistic, or 2) that the response is within a random sample of potential count responses. Martin, Brendan, Wintle, Rhodes, Kuhnert, Field, Low Choy, Tyre, and Possingham (2005) term the sampling zeros as 'false zeros' and include error as a source of zeros. They state, "Zero inflation is often the result of a large number of 'true zero' observations caused by the real .. effect of interest .. However, the term [zeroinflation] can also be applied to data sets with 'false zero' observations because of sampling or observer errors in the course of data collection" (p. 1235). Often, the data contains both types of zeros. This is the result of a dual data generating process (Cameron & Trivedi, 1998). For example, some adults in the AEWR sample may have had truezero household members under the age of 21 because they are unable to bear children or desire to bear children. Alternatively, they may have randomzero household members under the age of 21 because these adults do have such children but not as members of the household. Impact of ZeroInflation on Analyses "Much of the interest in count data modeling appears to stem from the recognition that the use of continuous distributions to model integer outcomes might have unwelcome consequences including inconsistent parameter estimates" (Mullahy, 1986, p.341). In the typical count data scenario, the zero leftbound implies heteroscedasticity (Zorn, 1996). An even greater problem with zeroinflated distributions, beyond this inadequacy of analyzing such a skewed and heteroscedastic distribution as if it were normal (Tooze, Grunwald, & Jones, 2002) is that they yield "surprisingly large inefficiencies and nonsensical results" (King, 1989, pl26). Martin, Brendan, Wintle, Rhodes, Kuhnert, Field, LowChoy, Tyre, and Possingham (2005) and McCullagh and Nelder (1989) explain that zeroinflation is a special case of overdispersion in which the variance is greater than it should be given a particular distributional shape and measure of central tendency. The impact is biased/inconsistent parameter estimates, inflated standard errors and invalid inferences (Jang, 2005; Martin, Brendan, Wintle, Rhodes, Kuhnert, Field, LowChoy, Tyre, and Possingham, 2005). Simple Solutions to ZeroInflation Deleting zeros The simplest of solutions is to delete all cases having responses of zero on the variable of interest. A large proportion of total responses would then be removed from the total dataset. This would then result in a loss of valuable information impacting statistical conclusion validity (Tooze, Grunwald, & Jones, 2002). The sample size may also then be too small for analyses of the nonzero values. Assuming normality Another simple solution is to ignore the zeroinflation, assume asymptotic normality, and analyze the data using standard techniques such as ordinary least squares regression. hhml = P, + fl,Sexl + PAge, + (22) According to this model, the number of household members under the age of 21 for adult respondent i is predicted from the overall mean, a coefficient relating the respondent's sex to hhm, a coefficient relating the respondent' s age to hhm, and error. The model assumes that the residuals for hhm are distributed normally with a mean of zero and a common variance, cr For the first equation, y is a vector of responses, X is a design matrix for the explanatory variable responses, p is a vector of regression coefficients relating y to X, and E is a vector of residuals measuring the deviation between the observed values of the design matrix and those predicted from the fitted equation. Transforming Zeros Another simple solution is to transform the counts to coerce a more normal distribution (Slymen, Ayala, Arredondo, & Elder, 2006). Since count distributions often appear to be positively skewed, one reasonable transformation involves taking the natural logarithm of the responses to the predictor variables. However, assuming the zeros haven't been deleted, the transformation will not work since the natural logarithm of zero is undefined (Zhou & Tu, 1999; King, 1989). Sometimes natural log transformations for zero are handled by adding a small value, such as .001, to the zeros. However, this then leads to an inflation of that transformed adjusted value. If 70% of the scores are zero, the resulting transformed distribution will have a 70% abundance of the transformed value (Delucchi & Bostrom, 2004). 2 Further, since the transformation is linear, this technique has been shown to yield biased parameter estimates that differ as a function of the adjustment quantity (King, 1989). Although the undefined log zero problem has been handled, the original problems pervade. As Welsh, Cunningham, Donnelly, & Linenmayer (1996) state, "It is clear for data with many zero values that such an approach will not be valid as the underlying distributional assumptions (linearity, homoscedasticity and Gaussianity) will [still] be violated" (p.298). Finally, for any technique, transformations sometimes create a new problem while solving the old one; "a transform that produces constant variance may not produce normality .. (Agresti, 1996, p.73)". 2 This implies then that, beyond the dual generating process for zeros, the problem can be generalized from inflated zeros to inflated lower boundaries for count data. Generalized Linear Models Bryk and Raudenbush (1996) state, "There are important cases .. for which the assumption of linearity and normality are not realistic, and no transformation can make them so" (p.291). As previously demonstrated, count data is likely to be one such case. Instead of deleting cases or transforming the data, it is more reasonable to specify a different distribution. As explained by Hox (2002), although it is nice to be able to transform data, "modeling inherently nonlinear functions directly is sometimes preferable, because it may reflect some 'true' developmental process" (pp. 9394). In order for a model to be 'inherently nonlinear' (Hox, 2002), there must be no transformation that makes it linear.3 These nonlinear models belong to the class of generalized linear models (GLM). The following explanation of generalized linear models based on the seminal work of McCullagh and Nelder (1989) with additional clarification by Lawal (2003) and Agresti (1996). Lawal (2003) explains that generalized linear models are a subset of the traditional linear models that permit other possibilities than modeling the mean as a linear function of the covariates. All GLM possess a random component, a systematic component, and a link function. As explained by Agresti (1996), the random component requires the specification of the distribution for the outcome variable. One could specify this distribution to be normal; hence, classical models such as ordinary least squares regression and analysis of variance models are included within this broader class of generalized linear models. Other possible random components that could be specified include the binomial distribution, negativebinomial distribution, gamma distribution, and Poisson distribution. Specifying the random component depends on the expected population distribution of the outcome variable. Given both zeroinflation and truncated count data yielding 3 Transforming covariates (e.g., including polynomial terms) may graphically appear to be nonlinear while still be linear in the parameters (Singer & Willett, 2003). an oddshaped skewed distribution, the random component plays an important part in obtaining valid results. In order to better understand the formulation of the three components, it is necessary to clarify the theoretical foundations of distributions. The probability density function for a normal distribution is 1 (y U)2 f (y; u, a) = exp( ), (23) which, given random variable X ~ N(pu,o'), reduces to the standard normal probability density function Sy2 f(y)= exp( ),(24) J2~i2 which when transformed to the cumulative density function yields 1 uZ ~(DUy = FOy) = exp( )u.(25) A convenient method for obtaining the parameters is to use the distribution's moment generating function (Rice, 1995). For the normal distribution, this function is 0 t2 M, y(t) = E[exp(tY)] = exp(put + ) (26) The logarithm of the moment generating function yields the cumulant generating function, which then yields the moments of the distribution. For the normal distribution, the first moment is the mean (pU), and the second moment is the variance (0 ). Strictly speaking, a requirement for GLM is that the outcome has a distribution within the exponential family of models (EFM) (McCullagh and Nelder, 1996). These distributions are defined primarily by a vector of natural parameters (0) and a scale parameter (#). The formulation is given by (yB b(0)) f, (y; 8, #) = exp (( ) + c(y, #)) (27) a(#) At first glance, it seems odd to include the normal distribution in the EFM; however, first recall the probability density function f (y; u, a) = exp( ) (28) Algebraic manipulation reveals that the normal distribution is indeed an EFM formulation. ( y p U)2 / 2) 1 y EF~Zw,,u) = f y(8, ~) = exp (( ) o(zc) (29) o 20 Here, the natural (i.e., canonical) parameter is pu, and the scale parameter is a . These parameters need to be estimated. McCullagh and Nelder (1996) explain the estimation as follows: "In the case of generalized linear models, estimation proceeds by defining a measure of goodness of fit between the observed data and the fitted values that minimizes the goodnessoffit criterion. We shall be concerned primarily with estimates obtained by maximizing the likelihood or log likelihood of the parameters for the data observed" (p. 2324). This turns out to be the log of the EFM function. e(0, #; y) =log( f, (y; 0, #)) (210) The natural and scale parameters are estimated by derivations revealing the mean function E(Y)= pu = b '(0), (211) and the variance function var(Y) = b "(0)a(#) (212) Note that the mean function depends on only one parameter. However, as McCullagh and Nelder (1989) explain, .. the variance of Y is the product of two functions; one, b "(0) , depends on the [canonical] parameter (and hence on the mean) only and will be called the variance ju~nction [denoted V(pu)], while the other [a(#) ] is independent of I and depends only on # .. The function a(#) is commonly of the form a(#) = / w (p.29) and is commonly called the dispersion parameter. For the normal distribution, the natural parameter is the mean (pu); the variance function, V( pu); equals 1.0, and the dispersion parameter is cr The systematic component is simply the model for the predictors established as a linear combination and is denoted r. The link function, g(), brings together the random component and the systematic component hence linking the function for the mean, pu, and the function for the systematic component, r, as r = g( pu). In other words, it specifies how the population mean of the outcome variable with a particular distribution is related to the predictors in the model. If g( pu) redundantly equals pu, then the population mean itself is related to the predictors. This is termed the identity link and is exactly the function used to link the mean of the normal distribution to its covariates. The key advantage of GLM is that they are not restricted to one particular link function. Many other links are available. For example, one could specify the log link as g( pu) = log( pu) or the logit link as g( pu) = log[ pu / (1 p)]. However, each random component has one common 'canonical' link function that is best suited to the random component (McCullagh & Nelder. 1996). Alternatively, "Each potential probability distribution for the random component has one special function of the mean that is called its natural parameter" (Agresti, 1996, p.73). For example, a normal random component usually corresponds to an identity link, a Poisson distributed random component usually corresponds to a log link, and a binomial distributed random component usually corresponds to a logit link. In sum, the canonical link and natural link are two equivalent terms for specifying the most suitable link connecting a particular distribution for the outcome variable with its linear systematic covariate function. The Binomial Distribution and the Logit Link Suppose we were interested in the differences between households with zero children under age 21 and households with one or more children over the age of 21. We could feasibly collapse all nonzero responses in the AEWR data into a value of one. Now, 71.58% of the values are zeros, and 28.42% of the values are ones. This distribution is obviously not normal. We have now introduced both a lower bound (zero), an upper bound (one), and an inherently nonnormal distribution; hence, a different random component and link can be specified to accommodate these constraints. Variables that take on only one of two values are known as binary, or Bernoulli, variables, and the distribution of multiple independent trials for these variables is termed binomial. Bernoulli responses are modeled in terms of the probability (Pr) that the outcome variable (Y) is equal to either zero or one. The random component over multiple independent trials is thus a binomial distribution with parameters n for the number of trials and ai for the probability that Y= 1. Y ~ B(n, zi) (213) The binomial distribution assumes that the responses are dichotomous, mutually exclusive, independent, and randomly selected (Agresti, 1996). Since the responses are discrete, the probability density function is termed the probability mass function and is defined as f (k; n, p) = n!" k n"k]i. (214) k !(n k)!i This function gives the p probability of k ones (i.e., heads, hits, successes) over n trials. Rice (1995) clarifies, "Any particular sequence of k successes occurs with probability [pkl n"k from the multiplication principle [i.e., independent probabilities of realizations being multiplicative]. The total number of such sequences is [permutations], since there are k !(n k)!i ways to assign k successes to n trials" (p.36). The moment generating function is k !(n k)!i {1 xi + xi exp(5)) ", and the cumulant generating function is n logt 1 ai + xi exp(5) } . The loglikelihood function is virtually the same as the probability mass function. However, now we are determining the value of p as a function of n and k (rather than determining k as a function of p and n) while taking the log of this maximum at e[f(p; n,k)] = n! k1 k] (215) k !(n k)!i The estimates are obtained through derivations of the likelihood function as was previously discussed for the normal distribution. Just as the normal distribution population mean, pu, has the best maximum likelihood estimates of X the binomial distribution population probability, zi, has the best maximum likelihood estimate of k divided by n, which is the proportion of ones, hits, or successes. This greatly reduces calculations when a quick estimate is needed and the random component is not linked to any predictors. The binomial distribution eventually converges to a normal distribution. However, the speed of this convergence is primarily a function of skew, Men y) =(216) with p = .50 yielding the fastest convergence (McCullagh and Nelder, 1996). The link function should account for the binomial outcome variable. If linear predictors are used to predict a probability, then we have predicted values in an infinite range rather than constrained to be between zero and one. What is needed is a link function that will map a bounded zeroone probability onto this range of infinite values. The canonical link for the binomial distribution is the logit link. r = g(s) = log(ir /(1 zi)) (217) A logit is the natural log of an odds ratio or log[p / (1p)]. An odds ratio is equal to a probability divided by one minus that probability. Hence, if the probability is .5, then the odds are .5 / (1.5) = 1 meaning that the odds of a one are the same as the odds of a zero [i.e., the odds of success and failure are identical]. If the probability is .75, then the odds are .75 / (1.75) = 3 meaning that a response of one is three times more likely than a response of zero. The reciprocal odds of 1/3 means that a response of one is three times less likely than a response of zero, which is equivalent to stating that a response of zero is three times more likely than a response of one. When using the logit link to connect the binomial random distribution and the systematic component, the generalized linear model is logity)= log( )= f, + fX, +...X (218) A probability of .50, which is an odds of one, corresponds to a logit of zero. Odds favoring a response of one yield a positive logit, and odds favoring a response of zero yield a negative logit. Hence, the mapping is satisfied since the logit can be any real number (Agresti, 1996). The regression parameters are slopelike in that they determine the relative rate of change of the curve. The exact rate of change depends on each probability with the best approximation at that probability being Pfr(1 zi) with the steepest rate of change being at ai = 0.50, which is where X = a /p (Agresti & Finlay, 1997). Since natural logs can be reversed through exponentiation and since odds can be converted to probabilities by dividing the odds by the sum of the odds and one, the fitted equation can be used to predict probabilities via exp(a + ,X, + + PkXk) a = (219) 1+ [exp(a + ,X, + + PkXkl It is more common to interpret logistic regression coefficients by only exponentiating them. Then, the coefficient has a slope, rather than slopelike, interpretation; however, the relationship is multiplicative rather than additive. Specifically, the expected outcome is multiplied by exp( P) for each oneunit increase in X. Evaluating Model Fit The next step in interpreting generalized linear model results is to determine how well the estimated model fits the observed data, where fit is the degree of discrepancy between the observed and predicted values. McCullagh and Nelder (1989) explain, "In general the pus will not equal the y' s exactly, and the question then arises of how discrepant they are, because while a small discrepancy might be tolerable a large discrepancy is not" (p.33). The goodness of fit improves as the observed values and predicted values approach equality. For example, if a scatterplot reveals that all points fall on a straight line, then the predictive power of the regression equation would be perfect, and the subsequent fit would be perfect. The comparison is usually performed through some statistical comparison of the observed outcome values and the predicted (i.e., fitted) outcome values. Rather than compare and summarize the actual observed and predicted values, it is common to gain summary information by inspecting the loglikelihood value produced from the estimation procedure. Since the model parameters are estimated are from the data, perfect fit (i.e., observedfitted = 0) is rare.4 Hence, the goodness of the fit is measured to determine whether the difference is small enough to be tolerated. There are many measures of model fit. Typically, the model is compared either to a null model in which the only parameter is the mean or a full model in which the number of parameters is equal to the sample size. "It is wellknown that minus twice the LR statistic has a limiting chisquare distribution under the null hypothesis" (Vuong, 1989, p.308). McCullagh and Nelder (1989) equivalently state, "The discrepancy of a fit is proportional to twice the difference between the maximum log likelihood achievable and that achieved by the model under investigation" (p.33). This deviance statistic (G2) iS then considered to be asymptotically distributed chisquare with degrees of freedom equal to the number of parameters subtracted from the sample size.' A significant pvalue indicates that the deviance is greater than what would be expected under a null hypothesis that the model with less parameters is adequate; hence, the observed model with an additional parameter or parameters is considered a significant improvement over the null model. Another measure of model fit is Pearson's X2; however, unlike G2, it is not additive for nested models. Yet another measure of model fit is Akaike's Information Criterion (AIC), which penalizes the deviance for the number of parameters in the model.6 The notion is that increasing 4 Perfect fit is always obtained if the number of parameters and the sample size are identical (McCullagh & Nelder, 1989). 5 The relationship is not always exact since sometimes the deviance is scaled and/or the likelihood is more difficult to estimate than in the simple logistic regression scenario presented here (McCullagh & Nelder, 1989). 6 Other measures such as the Bayesian Information Criterion (BIC) penalize for both the number of parameters and the sample size. the number of parameters will increase the loglikelihood regardless of the model and the data. Hence, the AIC penalizes the loglikelihood with regard to the number of parameters. There are two variations that provide further penalties. The Bayesian Information Criterion (BIC) penalizes for sample size; the Consistent Akaike Information Criterion (CAIC) penalizes even further by considering sample size and adding a small adjustment (Cameron & Trivedi, 1998). These indices can be compared to those of competing models; however, this must be done descriptively, not inferentially. The disadvantage is that the AIC can not be compared to a statistical distribution resulting in probabilities for significance testing; however, the advantage is that, as a descriptive statistic, it can be used to compare nonnested models. The explanation thus far points to the fact that models can be compared to null, full, or other models. Statistical comparison is valid to the extent that one model is nested within the other, which is to say that both models share the same parameters, and one model has at least one parameter that is not included in the other. Alternatively, Clarke (2001) defines the models as follows: "Two models are nested if one model can be reduced to the other model by imposing a set of linear restrictions on the parameter vector .. Two models are nonnested, either partially or strictly, if one model cannot be reduced to the other model by imposing a set of linear restrictions on the parameter vector" (p.727). The deviance for comparing two models is calculated as the difference in log likelihood between the two models then multiplied by 2. This quantity is asymptotically distributed chisquare with degrees of freedom equal to the difference in parameters between the two models (Agresti, 1996). A significant pvalue indicates that the deviance is greater than what would be expected under a null hypothesis of model equivalence; hence, the more complex model with an additional parameter or parameters is considered a significant improvement over the nested model. The difference in loglikelihood statistics (i.e., deviance) can not be used to statistically test nonnested models. This is due to the fact that neither of the models can be considered the simple or more complex models with additional variables leading to a probabilistically higher loglikelihood. A ttest (or Ftest) is a sensible alternative that eliminates concern for nesting. However, Monte Carlo simulations have demonstrated that, for model comparison tests, the F test is lacking in sufficient power and can result in multicollinearity (Clarke, 2001). The motivation for the AIC statistic is that, all else being equal, "the greater the number of coefficients, the greater the loglikelihoods" (Clarke, 2001, p.731). Hence, model fit becomes impacted by the number of variables in the model along with the effects of those variables. Hence, the AIC penalizes for the number of parameters. The formula is AIC = 2(LL) + 2K (220) where LL is the loglikelihood estimate and K is number of parameters in the model including the intercept. Hence, now the loglikelihood is adjusted to accommodate simplicity and parsimony (Mazerolle, 2004). In actuality, one could compare loglikelihoods between nonnested models. However, beyond the lack of parameter penalty, this technique might lead to the statistical hypothesis testing associated with loglikelihood statistics (i.e., test for the deviance approximated by X2).The AIC, on the other hand, should not be used in a formal statistical hypothesis test regardless of whether the model is nested or nonnested (Clarke, 2001). Generally, the researcher looks at several AIC indices and decides which model fits best based on a lowerisbetter criterion. Mazerolle (2004) states, "The AIC is not a hypothesis test, does not have a pvalue, and does not use notions of significance. Instead, the AIC focuses on the strength of evidence .. and gives a measure of uncertainty for each model" (p. 181). Logistic modeling necessitated treating all nonzero numbers of children as a value of one. Depending on the research question, this may be a loss of valuable information (Slymen, Ayala, Arredondo, & Elder, 2006). Although sometimes it is necessary to model zeroinflated binomial data (Hall, 2000), specifying a binary distribution and logit link is not an ideal method for handling zeroinflated count data. The generalized linear model that specifies a binomial distribution and a logit link becomes more relevant when discussing the technique of splitting zeroinflated data into a model for the probability of zero separate from or combined with a model for the counts. The Poisson Distribution and the Log Link McCullagh and Nelder (1989), Lawal (2003), and Rice (1995) are the key references for the technical underpinnings for this model and distribution. The generalized linear 'Poisson' model is considered to be the benchmark model for count data (Cameron & Triverdi, 1998).7 This is primarily attributed to the fact that the Poisson distribution has a nonnegative mean (Agresti, 1996). If y is a nonnegative random variable, the Poisson probability mass function is given by ef~Z f (k; A2) = Pr(Y = k) = ,k = 0, 1, 2, .. (221) where Ai is standard Poisson notation for the mean ( p) and k is the range of counts. Derivations by Rice (1995) show that the expected value of a random Poisson variable is Ai; hence, "the parameter Ai of the Poisson distribution can thus be interpreted as the average count" (p. 113). Alternatively, lambda ( Ai) represents "the unobserved expected rate or occurrence of events ...' (Zorn, 1996, p.1). The moment generating function is SIt is also commonly used to model event count data, which is "data composed of counts of the number of events occurring within a specific observation period .. [taking the] form of nonnegative integers (Zorn, 1996, p.1)". E(e") = e;"'exp't)1). (222) The resulting cumulant generating function is Al(exp(t)1i), which, with a variance function of Ai and a dispersion parameter equal to one, leads to mean and variance both being equal to Ai, and the skew equal to one divided by the square root of Ai This equivalence of the mean and variance defined by a single parameter (Cameron & Triverdi, 1998; Agresti, 1996) is the result of a function that yields residuals that sum to zero (Jang, 2005); hence, the systematic portion of a Poisson GLM has no error term. The Poisson distribution is a generalization of a sequence of binomial distributions. Rodriguez (2006) explained that "the Poisson distribution can be derived as a limiting form of the binomial distribution if you consider the distribution of the number of successes in a very larger number of Bernoulli trials with a small probability of success in each trial. Specifically, if Y ~ B(n, zi), then the distribution of Y as n + oo and 71 + 0 with pu = nzi remaining Eixed approaches a Poisson distribution with mean pu Thus, the Poisson distribution provides an approximation to the binomial for the analyses of rare events, where ri is small and n is large (p.3). Rice (1995) clarified, "The Poisson distribution can be derived as the limit of a binomial distribution as the number of trials, n, approaches infinity and the probability of success on each trial, p, approaches zero in such a way that np = Ai (p.43). Scheaffer (1995) and Rice (1995) have derived the generalization. Further, just as a binomial distribution converges to a normal distribution given sufficient trials, the Poisson distribution converges to a normal distribution given a large mean. The loglikelihood function for the Poisson distribution is (A, y) = C y, log A, 1,, (223) with the maximum likelihood estimate of Ai simply being the sample mean (Rice, 1995) and with the related deviance function being Devian~ce(A, y) = 2C { y, log(y, / A) (y, A,~) }. (224) McCullagh and Nelder (1989) state that the second term is often ignored. "Provided that the fitted model includes a constant term, or intercept, the sum over the units of the second term is identically zero, justifying its omission" (McCullagh & Nelder, 1989, p.34). The systematic portion of the generalized linear model takes the form Ai, = exp(x, p) = exp(x,l p,) exp(x,l p,)... exp(xk; pk) (225) which is often equivalently expressed as log(A~) =P' XI (226) with p derived by solving the equation (ex(y, ) epxu), = 0 (227) by using iterative computations such as the NewtonRaphson. The canonical link for a generalized linear model with a Poisson random component specification is the log link (Stokes, Davis, & Koch, 1991). 9= lo(A),Y ~ (A).(228) The Poisson distribution is not limited to count variates. Cameron and Triverdi (1998) explain that, although counts are usually in the purview of directly observable cardinal numbers, they may also arise through a latent process. In other words, ordinal rankings such as school course grades may be discretized as pseudocounts and assumed to have a Poisson distribution. Hence, the results of an analysis based on a Poisson distribution and the results using an ordinal analytic technique are often comparable. As is almost always the case, it is common to identify other variables associated with the count variable (i.e., misspecification). However, the Poisson model has in interesting feature in that it assumes that there are no variables excluded from the model that are related to the count variable. In other words, there is no stochastic variation (i.e., no error term) (Cameron & Trivedi, 1998). Modifications must be made when one wishes to use a Poisson model with stochastic variation. Iterative Estimation Agresti (1996) clarifies the iterative estimation procedure. "The NewtonRaphson algorithm approximates the loglikelihood function in a neighborhood of the initial guess by a simpler polynomial function that has shape of a concave (moundshaped) parabola. It has the same slope and curvature location of the maximum of this approximating polynomial. That location comprises the second guess for the ML estimates. One then approximates the log likelihood function in a neighborhood of the second guess by another concave parabolic function, and the third guess is the location of its maximum. The successive approximations converge rapidly to the ML estimates, often within a few cycles" (p.94). The most common methods for estimating standard errors include Hessian maximum likelihood (MLH) (i.e., second partial derivative based) standard errors and maximum likelihood outer products (MLOP) (i.e., summed outer product of first derivative) estimation. Interpretation of Coefficients Standard ordinary linear squares regression lends an interpretation of P as the predicted additive change in the response variable per oneunit change in the predictor variable. However, as was the case with the binomial distribution, the interpretation differs when considering exponential distributions. For the Poisson distribution, "a oneunit increase in X has a multiplicative of exp(/7) on the pu The mean of Y at x+1 equals the mean of Y at x multiplied by exp( P)" (Agresti, 1996, p.81).s Due to the inherent difficulty in interpretation, it is common to express in one of three alternative ways. First, the direction of the sign of P indicates a positive or negative 'effect' of the predictor on the count variable. Second, the fitted value can be calculated at the mean.9 Third, some interpret the coefficient in terms of percent change; hence, if f =1.64, then as X increases to X+1, the predicted probability increases by 64% (Agresti, 1996). Hypothesis testing After conducting the analysis and estimating parameters, hypotheses can be tested in several ways as explained by Agresti (1996). One could test the hypothesis that P=0 using the traditional Wald :test via z = b / seb Some programs provide Wald test results that are actually Z2; this is the Wald X2 Statistic with one degree of freedom and appropriate only for a twotailed test. A third method, the Score test, is "based on the behavior of the loglikelihood function at the null value for p =0" (Agresti, 1996, p.94) yielding a chisquare distributed statistic with one degree of freedom. Overdispersion In practice, the assumption of an equal mean and variance is the exception rather than the norm (McCullagh & Nelder, 1989). It is often the case that the sample variance is greater than or less than the observed sample mean with these two seldom being statistically equivalent (Cameron & Trivedi, 1998), especially for zeroinflated data (Welsh, Cunningham, Donnelly, SThis is similar to the interpretation for the binomial distribution with logit link: however. now the multiplicative effect is directly on p rather than on the odds of p. 9 This can be particularly troublesome since that fitted value will only hold at the mean. It provides no valid inference for values greater than or less than the mean since the function is a curve with steepness that can vary drastically between separate values for the predictors. and Lindenmayer, 1996). This condition is known as overdispersionl0 (underdispersion) and is a violation of a maj or tenet of the Poisson distribution that the conditional mean and conditional variance of the dependent variable are equal (i.e., equidispersion, nonstochasticity) (Jang, 2005; Zorn, 1996). 1 This assumption of equidispersion is the analog of the ordinary least squares regression assumption of homoscedasticity. The overdispersion has been explained as heterogeneity that "has not been accounted for [that is] unobserved (i.e., the population consists of several subpopulations, in this case of Poisson type, but the subpopulation membership is not observed in the sample" (Boihning, Dietz, Shlattman, Mendonca, & Kirchner, 1999, p. 195). The impact of violation is one of incorrect conclusions due to inaccurate tstatistics and standard errors (Cameron & Triverdi, 1998; Agresti, 1996). "The estimates of the coefficients can still be consistent using Poisson regression, but the standard errors can be biased and they will be too small" (Jewell & Hubbard, 2006, p.14). Alternatively, Slymen, Ayala, Arredondo, and Elder (2006) state that "Confidence intervals for regression estimates may be too narrow and tests of association may yield pvalues that are too small" (p.2). The underlying mechanism for overdispersion is explained as unobserved heterogeneity in responses. It is apparent that some modification to the variance to accommodate overdispersion is ideal. Typically, maximum likelihood procedures are used to estimate parameters in the model. "The term pseudo (or quasi) nzaxinsun likelihood estimation is used to describe the situation in which the assumption of correct specification of the density is relaxed. Here the first moment [i.e., the mean] of the specified linear exponential family density is assumed to be correctly 'o Overdispersion is sometimes referred to as extraPoisson variation (Bdhning, Dietz, Shlattman, Mendonca, & Kirchner, 1999). 11 The under or overdispersion may disappear when predictors are added to the model: however, this is likely not the case if the variance is more than twice the mean. (Cameron & Trivedi, 1989). specified, while the second [i.e., the variance] and other moments are permitted to be incorrectly specified" (Cameron & Triverdi, 1998, p.19). Hence, the Poisson distribution as a baseline (Ridout, Demetrio, & Hinde, 1998) can be modified to accommodate overdispersion and underdispersion (Cameron & Triverdi, 1998). Rice (1995) states that "gamma densities provide a fairly flexible class for modeling nonnegative random variables" (p. 52). One way to accommodate overdispersion is to consider the unobserved heterogeneity as a gamma distributed disturbance added to the Poisson distributed count data (Jang, 2005). In other words, an individual score may be distributed Poisson with a mean of 2 but then this mean is "regarded as a random variable which we may suppose in the population to have a gamma distribution with mean pu and index p / # " (McCullagh & Nelder, 1989, p. 199). This mixture leads to the negative binomial distribution. Given gamma function, r, and count y, the negative binomial probability mass function is pr (Y = y; pu, #) = (229) Cameron and Trivedi (1998) provide a formulation where the negative binomial extends from the Poisson rather than being explained as a mixture distribution. Given a set of predictor variables, we can define co, = pu, +apu,P (23 0) where a is scalar parameter to be specified or estimated, and p is a prespecified power term. If the scalar parameter is set to zero then the resulting variance is equal to the mean, and the Poisson distribution holds. Hence, the Poisson model is nested within the negative binomial model . The standard formulation for the negative binomial formulation of the function, sometimes called NB2 (Cameron & Trivedi, 1998), leads to a variance that is quadratic by setting p to 2 co, = pu, + a p, (2 31) This is the formulation seen in most textbooks (Rodriguez, 2006; Scheaffer, 1996). Its mean is the same as that of the Poisson distribution; however, its variance is derived from the gamma distribution (Cameron & Trivedi, 1998). Just as the Poisson distribution converges to a binomial distribution, the negative binomial distribution converges to a Poisson distribution (Jewell & Hubbard, 2006). Poisson and Negative Binomial Models with ZeroInflation As elaborated upon previously, the zeroinflation problem is twofold. First, the proportion of zeros is higher than expected given the specified population distribution shape resulting in an excess zeros problem. This can be descriptively determined by calculating the expected number of zeros as E(fq(Y))= fq(Y)*(exp(Y)) = ne" (232) For example, Zorn' s example had a frequency of 4,052 with a A = 0.11. The expected frequency of zeros would be E( fq(Y)) = 4052* (exp(0. 109)) = 3, 634, (233) which is less than the 3,882 zeros observed in the data. It turns out to be (3 882/3634)* 100 = 107% of the expectation (Zorn, 1996). Second, the zeros can be a mixture of structural (i.e., true) zeros (Ridout, Demetrio, & Hinde, 1998) and sampled zeros reflecting the multiple sources of zeros problem. Shankar, Milton, and Mannering (1997) state that if "a twostate process is modeled as a single process .. if applying traditional Poisson and NB distributions, the estimated models will be inherently biased because there will be an overrepresentation of zero observations in the data, many of which do not follow the assumed distribution of [the] frequencies" (p.830O)" Shankar, Milton, and Mannering (1997) note that the negative binomial model "can spuriously indicate overdispersion when the underlying process actually consists of a zeroaltered splitting mechanism" (p.835 836). In sum, the sources of zeros arise from a dual generating process (i.e., structural and sampled) leading to two sources of unequal mean/variance dispersion (i.e., that due to unobserved heterogeneity of responses and that due to zeroinflation). Most complex methods for analyzing zeroinflated count data model a mixture of two different distributions. The justification for splitting the distribution into two pieces is well reasoned by Delucci and Bostrom (2004). "If it is deemed more reasonable to consider the zeros as indicators of cases without a problem, a more appropriate approach is to ask two questions: is there a difference in the proportion of subjects without the problem [i.e., structural true zeros], and, for those who have a problem [sampled false zeros], is there a difference in severity" (p. 1164). Zorn (1996) refers to 'dual regime' models "wherein an observation experiences a first stage in which there is some probability that its count will move from a 'zeroonly' state to one in which it may be something other than zero" (p.2). Typically, the dualregime is composed of a ttrttrttrtranstion~rtrt~t stage based on a binomial distribution and an events stage based on some type of Poisson distribution. There are many ways to model twopart distributions. For example, Mullahy (1986) and King (1989) proposed a Hurdle model in which the zeros are analyzed separately from the nonzeros. Lambert (1992) proposed a zeroinflated Poisson (ZIP) model in which different proportions of zeros are analyzed separately and along with the nonzeros. Another early formulation (Heilborn, 1989) was the zeroaltered Poisson (ZAP) model. "Arbitrary zeros are introduced by mixing point mass at 0 with a positive Poisson that assigns no mass to 0 rather than a standard Poisson (Lambert, 1992, p.1)". Mullahy (1986) presented a variation of the Hurdle model based on a geometric distributionl2 for use when specifying a Poisson distribution is not reasonable. Another possibility is to specify a loggamma distribution for the event stage (Moulton, Curriero, & Barruso, 2002). Lambert (1989) presented a variation to the ZIP model known as ZIP(z), which introduced a multiplicative constant to the event stage covariance matrix in order to account for the relationship between the two models. Gupta, Gupta, and Tripathi (1996) derived an adjusted generalized Poisson regression model for handling both zeroinflation and zerodeflation; however, accuracy was suggested to be contingent on the amount of inflation or deflation. 13 It is also possible to formulate the model with different link functions. Lambert (1989) mentions the possibility of using the loglog link, complementary loglog link (Ridout, Demetrio, & Hinde, 1998), and additive loglog link while Lachenbruch (2002 ) mentions the lognormal and loggamma distributions. Hall (2000) formulated a twopart model for zeroinflated binomial data. Gurmu (1997) describes a semiparametric approach that avoids some distributional assumptions (Ridout, Demetrio, & Hinde, 1998). There is some research on the extension of twopart models to accommodate random effects (Min & Agresti, 2005; Hall & Zhang, 2004; Hall, 2004; Hall, 2002; Olsen, 1999). Hall's (2000) model for zeroinflated binomial data permits both fixed and random effects. Dobbie and 12 This distribution is an extension of the binomial distribution where the sequence in infinite. It is typically used in cases where the researcher is concerned with probability up to and including the first success (Rice, 1995). 13 Min (I 'li 1)Stated that the Hurdle model also has this feature. Walsh (2001) permit correlated count data. Finally, Crepon and Duguet (1997) consider the cases where the variables are latent and correlated. The evolution of the research to date has led to an emphasis on the standard Hurdle model and ZIP models (along with their negative binomial extensions) with a binary distribution for the transition stage and a Poisson distribution for the events stage and with fixed covariates. For both models, estimates are obtained from maximum likelihood procedures, although there has been some research on the use of generalized estimating equations (GEE) (Hall & Zhang, 2004; Dobbie & Welsh, 2001). The models are primarily distinguished by whether zeros are permitted in the event stage. In other words, their differences are a reflection of the researcher' s notions about the potentially multiple sources of zeros in the data and their relationship to excess zeros. They also differ in terms of the transition stage cumulative probability function. To be clarified in the sections that follow, Zorn (1996) summarizes the differences as follows: .. the hurdle model has asymmetric hurdle probability while in the ZIP specification p, is symmetrical. Also, the hurdle model does not permit zero values to occur once the 'hurdle' has been crossed, while in the ZIP model zeros may occur in either stage" (p.4). Choosing between the models is a matter of validity; hence, the choice rests on substantive ground as well as statistical considerations. As Martin, Brendan, Wintle, Rhodes, Kuhnert, Field, LowChoy, Tyre, and Possingham (2005) note, "it is imperative that the source of zero observations be considered and modeled accordingly, or we risk making incorrect inferences .. (p.12431244). The Hurdle model The Hurdle model was developed separately by Mullahy (1986) in economics and King (1989) in political science, although the term itself was most likely coined by Cragg (1971). 14 14 Cragg (1971) proposed the basic twostep process in which the probability of occurrence is modeled separately from frequencies of occurrence. Welsh, Cunningham, Donnelly, and Lindenmayer (1996) refer to it as a 'conditional Poisson model'. "The idea underlying the hurdle formulations is that a binomial probability model governs the binary outcome whether a count variate has a zero or a positive realization [i.e., a transition stage]. If the realization is positive the 'hurdle' is crossed, and the conditional distribution of the positives is governed by a truncatedatzerol5 count data model [i.e., events stage]" (Mullahy, 1986, p.345) such as a truncated Poisson or truncated negative binomial distribution (Min & Agresti, 2005). In other words, one distribution addresses the zeros while another distribution addresses the positive nonzero counts. For example, for graderetention data, there would be a model for schools with no dropouts and a model for school with at least one dropout. It is a "finite mixture generated by combining the zeros generated by one density with the zeros and positives generated by a second zerotruncated density separately .. (Cameron & Trivedi, 1998, p.124). Loglikelihood values are estimated separately for each density. A key feature of the transition model is asymmetry in that the probability of crossing the hurdle increases more quickly as the covariates increase than it decreases as the covariates decrease. The function is then "asymmetrical" (King, 1989) leading to validity concerns supporting or refuting the substantive theory underlying the model. The twopart distribution of the dependent variable is given first by the transition stage g; probability mass function P(Y I= 0) =g,(0) 16 (234) '5 The lower bound is then one. 16 This is an alternative representation of the aforementioned Pr(Y = 0) = 1 Fr . modeling whether the response crosses the hurdle of zero. Assuming a Poisson distribution and log link, Zorn (1996) expands the cumulative distribution function to include covariates as p,= 1 exp[ exp(Xof 0)] (23 5) The basic model for the event stage is then the probability for a nonzero realization multiplied by the probability for the counts. 1 P(Y I = j) = (1 g,(0)) 2 ..j = 1, 2,... (23 6) 1 g2 (0) Greene (1994) notates the models with a binomial distribution and logit link for the transition stage and a Poisson distribution with a log link for the event stage asls Transition: Pr(y, = 0) = p, (23 7) 1 p e A/2 Event: Pr(y, = k) = ( ( ') ,2..(238) 1e k! Here, p is the probability of a count of zero while ii is the truncated Poisson mean for the counts greater than zero. The generalized linear models as a function of covariates is then Transition Stage: log( ) = xllB, (23 9) 1 p, Event Stage: log(il,) = x22B2 (240) The two vectors of parameters are estimated j ointly. A, = [n {exp[ exp(XP)]} C {exp[ex p(Xp)]}\1 (241) A2 = On Xp(yXp) /({ex ep (X p)]Y R1}y!)],, (242) An alternative notation provided by Min (2003) is 17 The second term is an alternative representation of the aforementioned f (k; ii) = Pr(Y = k) = 1s Note that p is used to representative the probability of a zero rather than the conventional representation of a probability of one. 8 (P>)= [logg(y, = 0; P,, x,l)]+C [log(1 I(yI = 0; P,,x,l))] Since the two models are functionally independent, the likelihood functions can be maximized separately (Min & Agresti, 2005; Min, 2003, Cameron & Trivedi, 1998; Mullahy, 1986). A;,,,,vie = log(Az) + log(A,) (245) e(p,, p,) = (p, )+ (p,) (246) This is because "the large sample covariance between the two sets of parameters is zero so the joint covariance matrices can be obtained from the separate fits" (Welsh, Cunningham, Donnelly, & Lindermayer, 1996, p.300). Solving the likelihood equations uses either the NewtonRaphson algorithm or the Fisher scoring algorithm, both giving equivalent results (Min, 2003). The Poisson model is nested within the Hurdle model (Zorn, 1996). Hence, fit of these two models can be compared statistically. 19 The Negative Binomial Hurdle model In the case of zeroinflated data, it is possible to have two sources of overdispersion. The variance can be greater than the mean due to the preponderance of zeros. However, there is now the possibility of unobserved heterogeneity in the event stage (Mazerolle, 2004; Min & Agresti, 2004). The former scenario has been referred to as zerodriven overdispersion (Zorn, 1996); the latter is Poisson overdispersion. Just as was the case with the Poisson model, it is possible to nest the Hurdle model within a more general negative binomial framework. Further, the negative binomial Poisson model is nested within the negative binomial Hurdle model. Hence, the fit of a) 19 Zhou and Tu (1999) developed likelihood ratio for count data with zeros; however, it was not generalized to any particular zeroinflation model. the Poisson model and the Hurdle model, b) the negative binomial Poisson model and the negative binomial Hurdle model, and c) the Hurdle model and the negative binomial Hurdle model can be compared using statistical tests.20 Estimation is typically performed by solving the maximum likelihood equations using the NewtonRaphson algorithm. The ZeroInflated Poisson (ZIP) model The ZeroInflated Poisson, or ZIP, model is another model that one can use when the zeros in a dataset are argued to be caused by both chance and systematic factors (Min & Agresti, 2005). The transition stage addresses zeroinflation while the event stage addresses unobserved heterogeneity of responses including zeros (Jang, 2005). Welsh, Cunningham, Donnelly, and Lindenmayer (1996) refer to it as a mixture model. This twopart model, developed by Lambert (1992) permits zeros to occur in both the transition stage and event stage (Cameron & Trivedi, 1998); "crossing the 'hurdle' in the ZIP model does not guarantee a positive realization of Y (Zorn, 1996, p.4). Further, the probability function in the transition stage is now symmetrical (Zorn, 1996). Lachenbruch (2002) explains that "ZIP regression inflates the number of zeros by mixing point mass at 0 with a Poisson distribution" (p. 12). Zorn (1996, p.4) clarifies the distinction between the ZIP and Hurdle models as follows: As a special case of the general model, the ZIP regression is thus seen to make substantially different assumptions about the nature of the data generating process than the hurdle model. Whether parameterized as a logit or a probit, the probability exiting the zero only stage is assumed to follow a symmetric cumulative distribution. Likewise, even those cases which make the transition to the events stage may nevertheless have zero counts; crossing the "hurdle" in the ZIP model does not guarantee a positive realization of Y .. The sole difference in assumptions here is that the hurdle model's count distribution is assumed to be truncated at zero whereas the ZIP specification count data may take on zero 20 It is not the case that the negative binomial Poisson model is nested within the Hurdle model: hence, one can not statistically compare all four models collectively. values in the event stage. Another difference is that, unless the ZIP model is overly parameterized, only the Hurdle model can handle zero deflation (Min & Agresti, 2005). Compared to the Hurdle model, the equations for the event stage are very similar. The exception is that (1p,) is divided by (1 e ) in the Hurdle model before being multiplied by the remaining elements of the equation. However, the transition stage equations are strikingly different. For the Hurdle model, the equation is PrOy, = 0) = p; the ZIP model includes the addition of the probability of a nonzero multiplied by the exponentiated Poisson mean. This is the mathematical characteristic that distinguishes the Hurdle model's exclusion of zeros in the event stage and the ZIP model's potential inclusion of zeros in the event stage. Rather than model the probability of a zero in the transition stage, the ZIP also models the probability that the counts have a Poisson distribution hence permitting zeros from both a perfect state and a Poisson state (Hur, 1999). Given this, 32 parameterizes the mean of this Poisson distribution (Welsh, Cunningham, Donnelly, & Lindenmayer, 1996). When adding covariates, the Hurdle and ZIP generalized linear model appear the same. Transition Stage: log( ) = xllB, (247) 1 p, Event Stage: log(il,) = x22B2 (248) Unlike the Hurdle model, the ZIP model likelihood function can not be maximized separately for the transition and event stage. Hence, the Hurdle model "has the attractive advantage of an orthogonal parameterization which makes it simpler to fit and interpret than the mixture model" (Welsh, Cunningham, Donnelly, & Lindenmayer, 1996) with the disadvantage of asymmetrical transition to counts. The likelihood function derived by Lambert (1992) is L=Clog(ewr +exp(e ))+C(y:BSe a)[log(1+ear)[loggy!) (249) Y;=o y,>0 I=1 y,>0 where PB is the vector of coefficients and matrix of scores for the event stage and 7G is the vector of coefficients and matrix of scores for the transition stage, and where iterations are based on the EM or NewtonRaphson algorithms (Min, 2003; Lambert, 1992)21 Strictly speaking, the Poisson model is not nested within the ZIP model; therefore, it would not be wise to conduct a formal model fit test (Zorn, 1996; Greene, 1994). However, it is interesting to note that the loglikelihood of 10,607 is slightly lower than that produced by the negative binomial Hurdle model. This is in line with Greene' s (1994) observation when using the Vuong statistic as an alternative for testing nonnested models. "For present purposes, the important question is whether the ZIP models .. provide any improvement over the basic negative binomial .. The loglikelihood functions are uniformly higher, but as noted earlier, since the models are not nested, these are not directly comparable. The Vuong statistic, however, is consistent with the observation" (Greene, 1994, p.26). The Negative Binomial ZeroInflated Poisson model The ZIP model can be extended to the negative binomial model just as the Poisson was extended to the negative binomial and as the Hurdle was extended to the Hurdle negative binomial. This may be necessary as Min (2003) explains that "Sometimes such simple models for overdispersion are themselves inadequate. For instance, the data might be bimodal, with a clump at zero and a separate hump around some considerably higher value. This might happen for variables for which a certain fraction follows some distribution have positive probability of a zero outcome" (p. 13). He further explains, "The equality of the mean and variance assumed by 21 Alternatively, Martin, Brendan, Wintle, Rhodes, Kuhnert, Field, LowChoy, Tyre, and Possingham (2005), compared relative means and credible intervals estimate from a bootstrap procedure. the ZIP model .. is often not realistic. Zeroinflated negative binomial models would likely often be more appropriate than ZIP models" (p.15). Unfortunately, current iterative techniques lead to a greater risk of nonconvergence than when using the other twopart models (Fanoye & Singh, 2006; Lambert, 1992). The key assumption in accepting these results over those produced from the Hurdle models is that there are some zeros that belong with the counts representing no household members under the age of 21 for reasons other than their never having such household members at all. Although it is not valid to statistically compare the fit of the ZIP model with the Hurdle and Poisson models, it is reasonable to test the fit of the ZIP model within the negative binomial ZIP model . In sum, the choice between Hurdle models and ZIP models is ultimately guided by the assumptions one makes about the data generating process. Min (2004) states, "The zeroinflated models are more natural when it is reasonable to think of the population as a mixture, with one set of subj ects that necessarily has a 0 response. However, they are more complex to fit, as the model components must be fitted simultaneously. By contrast, one can separately fit the two components in the hurdle model. The hurdle model is also suitable for modeling data with fewer zeros than would be expected under standard distributional assumptions" (p.20). Model Comparison Testing for ZeroInflated Data The Poisson model is nested within the negative binomial Poisson differing only by the dispersion parameter. In fact, the two models are equivalent when one uses the negative binomial model and restricts the dispersion parameter to 1.0 achieving Poisson equidispersion (Cameron & Trivedi, 1998). Likewise, the Hurdle model is nested within the negative binomial Hurdle model, and the ZIP model is nested within the negative binomial ZIP model. However, since each involves estimation of a transition stage and event stage model, the nesting rule applies to both equations. In other words, if the event stage contains 3 parameters and the transition stage 4 parameters for the nested model, then the more complex model must contain at least these same 3 parameters in the event stage and at least the same 4 parameters in the transition stage. According to Zorn (1996), the Poisson model is nested within the Hurdle model, and the negative binomial Poisson model is nested within the negative binomial Hurdle model. This is reasonable given that the Hurdle models are estimating the Poisson models in the event stage and that these likelihood statistics are independent of those produced in the transition stage. The ZIP models, on the other hand, are not estimated in this manner; it is not reasonable to assume that the Poisson models are nested within the ZIP Models (Greene, 1994). This leads to the hierarchy of permissible modeltesting displayed in Table 21. Other models can be compared descriptively using the aforementioned Akaike's Information Criterion (AIC). The remainder of the dissertation will use the deviance statistic for model comparison inference and the AIC for model comparison description. Review of Research Pertaining to and Using ZeroInflated Count Data Hurdle Model Statistical As previously discussed, Mullahy (1986) presented the underlying statistical foundations for the Hurdle model. He specified it in terms of extending from both the Poisson and geometric distributions. He also presented an extension to the Hurdle model that he termed the "withzeros" (WZ) model. The WZ model adjusts for zeroinflation by augmenting or reducing the probability by an additive constant (rather than having them specified by the parent distribution). Subsequent research has focused on the Hurdle model rather than the WZ model; this is most likely due to the fact that the WZ turns out to be a special case that collapses into a Hurdle model in some specifications, and estimates are often similar between the models (Mullahy, 1986). Finally, Mullahy presented tests for specifications including a technically complex information matrix test as well as the Score test typically provided in software output. King's (1989) key contribution to the Hurdle model was a Monte Carlo study confirming that the counts can be viewed as arising from a Poisson process (Civetinni & Hines, 2005). Min and Agresti (2004) conducted a simulation to compare a zeroinflation and a zero deflation condition. They found that the estimates were reasonable for the ZIP model under zero inflation. However, the coefficient and standard error for the event stage were both very large under zerodeflation. They explain that logit models simply can not accommodate too few zeros. "The zeroinflated model is only suitable for zeroinflation problems. However, the hurdle model is also suitable for modeling data with fewer zeros than would be expected under standard distributional assumptions. In fact, when a data set is zerodeflated at some levels of the covariates, the zeroinflation model may fail" (Min & Agresti, 2004, p.5). In contrast, Mullahy (1986) stated that, "a particularly interesting feature of the modified count data specifications considered here [i.e., Hurdle model] is that they provide a natural means for modeling overdispersion or underdispersion of the data. Specifically, overdispersion and underdispersion are viewed as arising from a misspecification of the maintained parent [data generating process] in which the relative probabilities of zero and nonzero (positive) realizations implied by the parent distribution are not supported by the data" (Mullahy, 1986, p.342). Applications Mullahy (1986) researched daily consumption of coffee, tea, and milk where each is a count variable. Covariates included age, years of completed schooling, family income, sex, race, and marital status. For the Hurdle model, the test statistics [pvalues] for the coffeemodel were substantially smaller than those for teamodel and milkmodel. Mullahy argued that this is not surprising given the ratio of estimates and standard errors. However, closer inspection of the data reveals that it is the coffee variable that has the lowest proportion of zeros at only 26.26%. It is possible that the Hurdle model was unable to adequately account for the additional overdispersion witnessed in the other two models (61.63% and 40.37% zeros). In other words, a twopart model may be a necessary but not sufficient condition for handling overdispersion in zeroinflated count models, and negative binomial formulations may be increasingly necessary as the proportion of zeros increases. King (1989) applied the Hurdle model to data for the relationship between the number of nations entering war in a period of time as a function of those in formal international alliances. The hurdle model was formulated based on the premises of Mullahy (1986) and justified due to there being some countries who will not go to war and others who will not at first but will later be 'dragged in' by alliances. Hence, this is a classic example of the justification for true zeros and eventdriven zeros. This was demonstrated statistically by comparing Hurdle and Poisson results. The Poisson alliance coefficient of .007 was significant, the Hurdle model eventstage coefficient of .007 was significant, and the Hurdle model transition stage coefficient of .001 was not significant. Hence, the Poisson interpretation would be that increased alliances lead to increased war frequency. However, the Hurdle results clarify that this is only true after the onset of war (i.e., the hurdle has been crossed). Further statistical evidence supported the Hurdle model based on the likelihoodratio model comparison test. ZeroInflated Poisson Model Statistical As previously discussed, Lambert (1989) presented the original formulation for the ZIP model and its ZIP(z) formulation. She also presented the models' extension from the Poisson and negative binomial as well as the derivation of the maximum likelihood (EM) estimates. She ran several simulations to test the adequacy of the model. The first simulation varied sample size, with one covariate taking on a fixed coefficient and a fixed variance in both parts of the model. The result was an average of 50% zeros in the transition stage and 23% zeros in the event stage. The results suggest that the ZIP model consistently converges at n=25 when using EM and at n=100 when using the NewtonRaphson algorithm. An examination of confidence intervals revealed that the normaltheory intervals are not reliable at n=100; however, almost all simulated likelihoodratio confidence intervals contained the true mean even at n=25. "To summarize, these simulations with one covariate for both 32 and p are encouraging. The ZIP and ZIP(z) regressions were not difficult to compute, and as long as inference was applied only when the observed information matrix was nonsingular, estimated coefficients, standard errors based on observed information, and estimated properties of Y could be trusted" (Lambert, 1992, p.7). Warton (2005) compared 20 datasets of varying sample sizes, proportions of zeros, and factors/levels. The ordinary least squares version included the addition of one to all counts before taking the logarithm. The other models not accommodating zeroinflations were the Poisson and four formulations of the negative binomial Poisson (including the aforementioned quasiPoisson where the variance is set to negative binomial ZIP model. The Akaike Information Criterion (AIC) values were calculated for a total of 1,672 variables averaged over datasets and rescaled to a minimum AIC of zero for each dataset. As expected, when overdispersion was present, the negative binomial formulations outperformed the models without these formulations. However, when overdispersion was not present, the reverse was true for 53% of the variables. This suggests that the level of skew in the model interacts with zeroinflation when measuring model adequacy. When the proportion of zeros is very small, the distribution looks more like a Poisson distribution truncated at zero. In other words, it shares features modeled by the event stage of a Hurdle model. This led to estimation problems in which the negative binomial model ZIP model rarely converged. When it did converge, its fit was better than that of the ZIP model for only 1 1% of the datasets. Possibly, as previously discussed, the zeroinflated Hurdle would converge more often since it can handle both zeroinflation and zerodeflation (Min & Agresti, 2005; Min & Agresti, 2004). A very interesting finding pertained to the transformed OLS fit indices. "Although transformed least squares was not the best fitting model for data, it fitted the data reasonably well. Surprisingly, transformed least squares appeared to fit data about as well as the zero inflated negative binomial model .. The AIC for transformed least squares was not as small as for the negative binomial model overall, although it was smaller for 20 per cent of the variables considered here" (Warton, 2005, p.283). Averaging over all datasets, the AIC was lowest for all negative binomial models followed by a close tie between the transformed OLS model and the negative binomial ZIP model, which was followed by the ZIP model. All models were a drastic improvement over the standard Poisson model. The implications are that, although the Poisson is rarely adequate when the data is not equidispersed and/or is inflated or deflated, an intuitive climb up the ladder of models may not be reasonable. There were features of these datasets including varying degrees of zeroinflation and overall distributions that warrant further investigation toward appropriate model selection. "If one were to fit a zeroinflated model, it would be advisable to present quantitative evidence that the zeroinflation term was required. Based on the present results, it is likely that a term for extra zeros is not needed, and a simpler model will usually suffice .. special techniques are not generally necessary to account for the high frequency of zeros. The negative binomial was found to be a good model for the number of zeros in counted abundance datasets, suggesting that a good approach to analyzing such data will often be to use negative binomial loglinear models" (Warton, 2005, p.287288). In regard to their problems with zeroinflated negative binomial convergence, Fanoye and Singh (2006) developed an extension that improves convergence termed the zeroinflated generalized Poisson regression (ZIGP) model. Their recent research revealed convergence in less than 20 iterations for all trials. However parameter estimates and standard errors were often very different than those produced by the ZIP model. They conclude, "Even though the ZIGP regression model is a good competitor of ZINB regression model, we do not know under what conditions, if any, which one will be better. The only observation we have in this regard at this time is that in all of the datasets fitted to both models, we successfully fitted the ZIGP regression model to all datasets. However, in a few cases, the iterative technique to estimate the parameters of ZINB regression model did not converge" (p.128). Greene (1994) used the Vuong statistic when comparing the Poisson, negative binomial Poisson, and ZIP models. It was noted that the rank order for the Vuong statistics and the log likelihood estimates were in alignment. The conclusion suggested future research using the Vuong statistic. "The use of Vuong' s statistic to test the specification seems not to have appeared in the recent literature .. We conjecture that the Vuong testing procedure offers some real potential for testing the distributional assumption in the discrete data context. In the cases examined, it appears to perform well and in line with expectations" (Greene, 1994, p.30). Shankar, Milton, and Mannering (1997) used the Vuong statistic to decide between the negative binomial, ZIP, and negative binomial ZIP model for traffic accident data. They clarify the interpretation of the statistic stating, "A value >1.96 (the 95% confidence level of the ttest) for V favors the ZINB while a value < 1.96 favors the parentNB (values in between 1.96 an  1.96 mean that the test is indecisive) .. This test can also be applied for the ZIP(z) and ZIP cases" (p.831). Civettini and Hines (2005) explored the effects of misspecification on negative binomial ZIP models. This included misspecification by leaving a variable out of the event stage that was present in the event stage and misspecification by shifting a variable from the transition stage to the event stage. Applications Lambert (1992), in formulating the ZIP model, applied it to the analysis of defects in manufacturing. In terms of improperly soldered leads, 81% of circuit boards had zero defects relative to the 71% to be expected under a Poisson distribution linked to a model with a three way interaction. This most complicated model had a loglikelihood of 638.20. This dropped to 511.2 for the ZIP model. Although comparing to a different combination of covariates, the negative binomial Poisson model fit better than the Poisson model but not as well as the ZIP model. Greene (1994) used creditreporting data to investigate differences between the Poisson, negative binomial Poisson, ZIP, negative binomial ZIP, as well as some of their aforementioned variants and the specification of a probit link rather than the logit link. The data consisted of 1,023 people who had been approved for credit cards. The count variable of concern was the number of maj or derogatory reports (MDR), which is the number of payment delinquencies in the past 60 days. For this sample, 89.4% had zero MDR. Given a mean of 0. 13, this frequency of 804 is nearly double the 418 we might expect in a Poisson distribution. The skew of 4.018 is reduced to 2.77 when ignoring the zeros while the mean increases to 1.22. As expected, the negative binomial Poisson resulted in improved fit (based on the Vuong test statistic), increased standard errors and different parameter estimates. The ZIP model resulted in slightly worse fit than the negative binomial Poisson while remaining much better compared to the Poisson model. If all of the overdispersion was due to unobserved response heterogeneity then the results should be similar for the negative binomial ZIP model. However, this model produced the best fit of all. It is interesting to note that, again, the standard errors increase while the parameter estimates are different relative to the ZIP model. In fact, of the 6 parameters, 4 estimates decreased, 2 increased, and 1 switched in sign. Hence, there are two implications. First, the negative binomial ZIP model was necessary to accommodate two sources of overdispersion to adjust standard errors. Second, ignoring the negative binomial formulations would have led to nonsensical parameter estimates driven by a sample mean of 0. 13. Boihning, Dietz, Schlattmann, Mendonga, and Kirchner (1999) compared pre and post intervention scores on the decayed, missing, and filled teeth index (DIVFT) for 797 children in one of six randomly assigned treatment conditions. The results were not exhaustive; however, the loglikelihood did decrease from 1473.20 to 1410.27 when going from the Poisson model to the ZIP model. This study was somewhat unique in that all the covariates (sex, ethnicity, and condition) were categorical, and that the conditions were dummycoded represented as five parameters. Also, this data had features that might suggest that zeroinflated models weren't necessary. For preintervention, the proportion of zeros was 21.58%, which increased to only 28.99% at postintervention. The means, with the zeros in the data, were 3.24 and 1.85, respectively. Ignoring the zeros changed these means to 4.13 and 2.61, respectively. The skew, with the zeros in the data, was 0.20 and 0.65, respectively. Ignoring zeros changed the skew to 0.08 and 0.63, respectively. In other words, many features of the data were consistent with what would be expected of a Poisson, and possibly normal, distribution. Nonetheless, with these means and frequencies, the Poisson distribution suggests overdispersion with 31 permissible zeros for the preintervention and 125 permissible for the postintervention whereas the data revealed 173 zeros and 232 zeros, respectively. It then becomes a matter of whether the overdispersion was due to the proportion of zeros in each condition or unobserved heterogeneity in the event stage. The negative binomial ZIP model was not used to analyze this data. Xie, He, and Goh (2001) analyzed the number of computer hard disk readwrite errors. Approximately 87% of the 208 cases were zeros. Given that the Poisson mean was 8.64, the authors noted that the ZIP model is to be preferred over the Poisson model. However, this mean is due to several values between 1 and 5, a few between 6 and 15, and 2 values of 75. These latter two values appear to be so far from the others that they should have been treated as outliers and addressed in some other manner. Jang (2005) analyzed the number of nonhome based trips per day from 4,416 households in Jeonju City, Korea. The provided bar graph suggested that approximately 45% of the cases were zeros. The Vuong statistic (Vuong, 1989) was used for model selection given that the Poisson is not nested within the ZIP or negative binomial ZIP models. The purpose of the article by Delucchi and Bostrom (2004) was to provide a brief introduction to many possible methods for handling zeroinflation including standard ttests, bootstrapping,22 and nonparameteric methods. In doing so, they provided results from a study involving 179 patients with opioid dependence assigned to either a methadonemaintenance or methadoneassi steddetoxification treatment. Five out of seven ways to segment the sample 22 See Jung, Jhun, and Lee (2005) for bootstrap procedures and simulation results for Type I and Type II errors. resulted in zeroinflation ranging from 17% to approximately 66% zeros. The only twopart model to be used was the ZIP model. The table of results revealed that, in terms of pvalues, the ZIP model performs either very similarly or very differently from the Pearson X2 test for the proportion of zero values, the MannWhitneyWilcoxon test of nonzero values, and/or the Mann WhitneyWilcoxon test of difference in mean scores between treatment groups. It is possible that these tests become more similar as the proportion of zeros declines but such conclusions are based purely on the table of pvalues. Desouhant, Debouzie, and Menu (1998) researched the frequency of immature weevils in chestnuts. One tree was measured over 16 years, another was measured over 11 years, and three trees were measured on 1 year. The means ranged from .06 to .63. "None of the 30 distributions fits a Poisson, X2 ValUeS being always very significant .. The ZIP distribution fits 25 out of 3 1 cases .. The NB distribution fits 20 out of the 31" (Desouhant, Debouzie, & Menu, 1998, p.3 84). This led to the conclusion that researchers should consider both true zeros and overdispersion (i.e., trees noted as 'contagious' and trees varying in random oviposition behavior). Shankar, Milton, and Mannering (1997) analyzed a 2year summary of traffic accident frequencies. For principal arterials, they chose a negative binomial model with data ranging from 0 to 84 (M~= 0.294, SD = 1.09). For minor arterials, they chose the negative binomial ZIP model for data ranging from 0 to 7 (M~= 0.09, SD = 0.346). For collector arterials, they chose the ZIP model for data ranging from 0 to 6 (M~= 0.61, SD = 0.279). Model selection was based on the Vuong statistic. For example, they state, "As suspected previously, inherent overdispersion in the data is due to the parent NB process and this was validated when the [negative binomial ZIP] specification failed to provide a statistically better fit (the Vuong statistic <1.96, which corresponds to the 95% confidence limit of the ttest" (p.833). Slymen, Ayala, Arredondo, and Elder (2006) analyzed percent calories from fat and number of days of vigorous physical activity from 357 females participating in a baseline condition and one of three treatment conditions. Covariates included employment status, education, martial status, cigarette smoking, and selfreported health. The zeroinflation was 294 out of 357 (82.4%). They compared models using likelihood ratio tests between the Poisson and negative binomial Poisson and likewise between the ZIP and negative binomial ZIP. The AIC's were inspected to compare the Poisson and ZIP models. Not surprisingly, the negative binomial model fit better than the Poisson model. However, the ZIP model did not fit better or worse than the negative binomial ZIP, and the parameter estimates and standard errors were nearly identical. This suggests almost no overdispersion in the data. Indeed, the nonzero percentages were as follows: 1 = 2.8%, 2 = 3.4%, 3 = 4.8%, 5 = 2.0%, 6 = 0.0%, and 7 = 2.0%. This suggests strong equidispersion leaning toward a uniform nonzero distribution. The AIC's for both models were also nearly equal although both being considerably smaller than the AIC for the Poisson model and somewhat smaller than the AIC for the negative binomial Poisson model. Based on a 'smallerisbetter' heuristic, the authors favored the ZIP model with an AIC of 562.5 over the zeroinflated ZIP model with an AIC of 565. ZIP and Hurdle ModelComparisons The purpose of this section is to present a breadth of literature in which both the Hurdle and ZIP models were either compared statistically and/or used to analyze real data. This also includes extensions such as the negative binomial and tau formulations (e.g., ZIP(r)). Some authors presented alternatives that seem to depart from the ZIP and Hurdle models too drastically to be within the scope of this dissertation. For example, Lachenbruch (2001) used a twopart model; however, the splitting formulation was not consistent with the literature. Further, the model was compared to atypical formulations such as the Wilcoxon, KolmogorovSmirnov, and z tests. As such, these types of articles are not included in the subsequent review. One exception is Xie, He, and Goh (2001) who included a likelihoodratio test for comparing the Poisson and ZIP models. Statistical Greene (1994) proposed several 'zeroaltered count models' for comparison. First, he took Mullahy' s withzero' s (WZ) adaptation of the Hurdle model and included a scalar estimate for ease on computational burdens. Greene also presented an adaptation of Lambert' s ZIP known as ZIP(r) and modified it for the negative binomial formulations terming them ZINB and ZINB(r ). The intention was to identify a "procedure which will enable us to test the zero inflated model against the simple Poisson model or against the negative binomial model. The latter will allow us to make a statement as to whether the excess zeros are the consequence of the splitting mechanism or are a symptom of unobserved heterogeneity" (Greene, 1994, p.10). Greene developed a method for comparing the models; however, he noted that there was no a priori reason to think that the Vuong statistic would be inferior. Applications Zorn (1996) examined the counts of actions taken by Congress addressing Supreme Court decisions between 1953 and 1987. The zeros were seen to arise from two sources since many cases will not be addressed unless there are lobbyists to pressure redress. Covariates included the year of the decision, the political orientation of the decision, the presence of lower court disagreement, the presence of precedence alteration, declaration of unconstitutionality, and unanimous vote. The number of actions ranged from 0 to 11 (M~= 0. 11, SD = .64); however, 3,882 (95.8%) of the 4,052 counts were zeros. This contributed to an exceptionally high skew of 7.97. When ignoring the zeros, the skew was reduced to 1.86 (M~= 2.59, SD = 1.53). The observed zeros were 107% of that which would be Poissonexpected. Regardless, Poisson model results were in line with theorydriven expectations. However, the test of overdispersion was significant when comparing the Poisson and negative binomial Poisson resulting in fewer significant predictors than if ignoring overdispersion. The author also fitted a generalized negative binomial model in which "the variance parameter is allowed to vary as an exponential function of the same independent variables included in the model of the count" (Zorn, 1996, p.9), which led to even better model Sit. However, due to zeroinflation, no model provided reasonable estimate sizes given the low mean count. Their analyses using the ZIP and Hurdle models yielded several findings. First, the probability of remaining a zero in the transition stage was considerably lower for the Hurdle model than for the ZIP model at lower levels of a predictor. This is a reflection of the asymmetry of the Hurdle model. Second, parameter estimates and standard errors were similar between the two models. They concluded that "at least in some circumstances the performance of ZIP and hurdle Poisson models will be quite similar. This suggests that, as a practical matter and barring any strong theoretical considerations favoring one over the other, the choice between them may be made largely on the basis of convenience of estimation" (Zorn, 1996, p. 11). Pardoe and Durham (2003) compared the Poisson, Hurdle, and ZIP models as well as their negative binomial formulations using wine sales data. Of the 1,425 counts, 1,000 (70.2%) were zeros. The authors noted that this is greater than the 67.8% that a Poisson distribution is capable of predicting. Based on the AIC, the zeroinflated negative binomial ZIP performed best. Surprisingly, the Hurdle model fit more poorly than did the Poisson model. It is possible, given the unrealistically high AIC relative to the other models, that the value wasn't calculated correctly. Alternatively, the distribution may not have been correctly specified since their analysis included Bayesian estimates of the prior distributions. No discussion pertaining to the Hurdle model was included. However, they did provide a novel procedure for comparing the range of fit statistics across the zeroinflation models. This 'parallel coordinate plot for goodness of fit measures' consists of an xaxis labeled Min on the left and Max on the right. The yaxis is a series of horizontal lines each pertaining to a fit statistic (e.g., AIC, BIC). The ceiling xaxis contains the labels for the models being compared. Then, points for each model are plotted on the lines for the fit statistics at their relative location between Min and Max. A similar procedure restricted to the AIC was used by Warton (2005). This technique was adapted to display the coverage for simulated fit statistics. Welsh, Cunningham, Donnelly, and Lindenmayer (1996) used zeroinflated rare species count data to compare the Poisson, negative binomial Poisson, Hurdle, negative binomial Hurdle, and ZIP models. Approximately 66% of the observations were zeros. They found little difference between the Hurdle, negative binomial Hurdle, and ZIP model results. Since there was no overdispersion, the authors recommended using the Poisson model. This is in line with Warton's (2005) assertion that the more complex zeroinflation models may not always be necessary; at least, this appears to be the case with 66% zeroinflation and equidispersion. Discrepant Findings What is exactly meant by zeroinflation? Min and Agresti (2005) define zeroinflated count data as "data for which a generalized linear model has lack of fit due to disproportionately many zero" (p. 1). This raises the question, "At what point does the frequency of zeros become disproportionate to the frequency of nonzeros?" One statistical definition states that the proportion of zeros is greater than that to be expected given the posited distribution (Zorn, 1996). For example, for count data, the proportion of zeros should not be greater than that expected by a Poisson distribution. However, there are three problems with this. First, there may be many different proportions of zeros greater than that expected by a particular Poisson distribution. Second, the definition assumes that the full model is distributed Poisson. Third, it ignores the two potential sources of overdispersion for Poisson zeroinflated data. The aforementioned AEWR example displayed a zero proportion of .7158 with a mean of .54, a standard deviation of 1, and a skew of 1.971. Ignoring the zeros, although this event stage distribution remains negative skewed, the mean increased to 1.91, and the level of skew dropped to 0.96. The distribution for the categorical sex variable was binomial with approximately 43% males and 47% females. The distribution for the age variable was roughly normal with a mean of 48.86, a standard deviation of 17.41, and a skew 0.87. Hence, with 71% zeros, a heavily skewed distribution of 1.971, a moderately skewed nonzero distribution of 0.96, a normally distributed continuous predictor, and a twolevel categorical predictor led to the following findings: 1) the Hurdle model fit better than Poisson model; 2) the negative binomial Hurdle fit better than negative binomial Poisson model; 3) the negative binomial Hurdle fit better than the Hurdle model; 4) the negative binomial ZIP fit better than ZIP model; 5) the negative binomial ZIP model descriptively fit better than all others; and, 6) the Hurdle and negative binomial Hurdle model yielded nearly identical estimates and pvalues. Hence, findings between the zero inflation models differed in terms of both fit and the significance of parameter estimates. Although not all research presented sufficient information (e.g., data necessary to calculate skew), there is clearly enough variation in results to warrant further research. Mullahy's (1986) Hurdle model analyses were impacted by zeroinflation of .263 and not by zeroinflation of .616 or .404; however, this is in disagreement with findings that the Hurdle model adequately handles zerodeflation (Min, 2003). Lambert' s ZIP analysis with 71.8% zeros favored the ZIP over the negative binomial ZIP. Greene's (1994) ZIP analyses resulted in nonsensical results under .894 zeros and heavy skew (4.02); the negative binomial ZIP corrected this. Slymen, Ayala, Arredondo, and Elder' s (2006) ZIP and negative binomial ZIP results were virtually identical; however, their event stage distribution was uniform. This was confirmed by Warton's (2005) finding that the negative binomial fits better than the ZIP only when zeroinflation and overdispersion both are indeed present. Extending from this is Boihning, Dietz, Schlattmann, Mendonga, and Kirchner' s (1999) findings that the ZIP actually fit better than the Poisson given .216 and .289 zerodeflation. This is again in contrast to the suggestion that the Hurdle, and not the ZIP, is appropriate for zerodeflated data. However, it could be argued that a normal distribution should have been assumed given that the event stage distribution was relatively normal . When comparing the Hurdle model to the ZIP model, Zorn (1996) found similar results given .958 zeroinflation, skew of 7.97, and a reduction of skew to 1.86 for the event stage. These findings are in contrast to Zorn' s (1996) findings of greater zeroinflation and, subsequently, greater skew. Welsh, Cunningham, Donnelly, and Lindenmayer (1997) also found little difference between the Hurdle and ZIP models. Table 22 summarizes the findings from the zeroinflation literature. There are three factors that may have caused the anomalous findings. The first possibility pertains to the effect of different types, quantities, and values for predictors. The second possibility is the proportion of zeros for the outcome variable. The third possibility is the degree of skew in the event stage for the outcome variable. It has already been suggested that the Hurdle and ZIP models should be chosen given a priori research about the source and nature of the zeros. Further, it has been established that the negative binomial formulations are meant to handle additional overdispersion in the event stage. However, the previous findings suggest that there are additional considerations such as the proportion of zeros and the nature of the event stage distribution. The proportion of zeros in this research ranged from as low as .20 (Delucchi & Bostrom, 1994) to .958 (Zorn, 1996). Distributions for the event stage included those that were heavily positively skewed (Greene, 1994), moderately positively skewed (AEWR example), distributed normally (Boihning, Dietz, Schlattmann, Mendonga, & Kirchner, 1999), and distributed uniformly (Slymen, Ayala, Arredondo, & Elder, 2006). The first possibility, pertaining to the covariates, can only be tested by varying an incredibly large set of conditions ranging from small to large quantities of predictors as well as their types (e.g., nominal, ordinal), and distributions. However, given a particular set of covariates and corresponding values, the other two possibilities pertaining to zeroinflation and skew can be explored. It is possible to vary the proportion of zeros for the outcome variable and to simultaneously vary the degree of skew of the nonzeros for this outcome variable. Hence, the following research questions are presented: * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated loglikelihood between a) the Negative binomial Poisson model vs. Poisson model; b) the Hurdle model vs. Poisson model?; c) the Negative binomial Hurdle model vs. negative binomial Poisson model?; d) the Negative binomial Hurdle model vs. Hurdle model; and, e) the Negative binomial ZIP model vs. ZIP model? * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated AIC between all models? These questions were answered by establishing several levels of skew and zeroinflation. The covariates and their values were Eixed as one continuous variable from a standard normal distribution and one binary variable. Data for the outcome variable, for all levels of skew and zeroinflation, were simulated to estimate loglikelihood values, AIC indices, covariate coefficients, standard errors, and pvalues. The details are delineated in the following chapter. The obj ective is consistent with Zorn' s (1996) advice: "First and foremost, work should be undertaken to better ascertain the statistical properties of the various estimators outlined here. It is important that we determine the robustness of these techniques to skewness in the dependent variable, model misspecification, and the host of other problems that all too frequently plague political science researchers .. perhaps using Monte Carlo methods to assess under what circumstances the results of the two may diverge" (Zorn, 1996, p. 12). Table 21. Five pairs of nested models valid for statistical comparison Valid Comparisons (Nested Models) 1 2 3 4 5 Poisson Simple Simple NB Poisson Complex Simple Hurdle Simple Complex NB Hurdle Complex Complex ZIP Simple NB ZIP Complex NB ZIP over OLS/NB ZIP over Poisson Equal Zeros Simulation Simulation .26, .62, .41 .718 .894 .824 .87 Superior Model Comments Models Compared Hurdle vs. ZIP Hurdle vs. ZIP Hurdle ZIP vs. NB Poisson vs. Poisson ZIP vs. NB ZIP vs. NB Poisson Poisson vs. NB Poisson; ZIP vs. NB ZIP ZIP vs. Poisson Poisson vs. ZIP Hurdle vs. ZIP Poisson vs. Hurdle vs. ZIP ZIP vs. NB ZIP ZIP vs. NB ZIP NB ZIP vs. OLS vs. NB ZIP vs. Poisson Hurdle vs. ZIP Table 22. Summary of literature on zeroinflation Researchers) Min & Agresti (2004) Min & Agresti (2004) Mullahy (1986) Lambert (1992) Greene (1994) Slymen, Ayala, Arredondo, and Elder (2006) Xie, He, and Goh (2001) Boihning, Dietz, Schlattmann, Mendonga, and Kirchner' s (1999) Zorn (1996) Pardoe and Durham (2003) Warton (2005) Warton (2005) Warton (2005) Wel sh, Cunningham, Donnelly, and Lindenmayer (1997) Hurdle Zerodeflation Zeroinflation Equal Hurdle .26 ZIP over NB Poisson over Poisson NB ZIP over ZIP; NB Poisson over ZIP over Poisson NB Poisson; Equal ZIP Heavy skew; Probit link Uniform event stage; Overall, AIC's favor ZIP Outliers Zerodeflation; normal event stage Heavy skew Based on AIC's Overdi sp ersi on favored NB ZIP Rare convergence for NB ZIP Based on AIC's No overdispersion .216, .289 ZIP Equal NB ZIP over Poisson over Hurdle ZIP or NB ZIP .958 .702 Various Very low ZIP Various .66 CHAPTER 3 METHODOLOGY The previous chapter fulfilled three obj ectives. First, it described the statistical models and methods for analyzing count data including that which is zeroinflated. Second, it presented research, both technical and applied, pertaining to three models (Poisson, Hurdle, and ZIP) as well as their negative binomial formulations. Third, it was concluded that there is considerable divergence in findings between models and that such differences should be explored by examining different levels of zeroinflation and skew for the count outcome variable. This led to the following research questions: Research Questions * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated loglikelihood between a) the Negative binomial Poisson model vs. Poisson model; b) the Hurdle model vs. Poisson model?; c) the Negative binomial Hurdle model vs. negative binomial Poisson model?; d) the Negative binomial Hurdle model vs. Hurdle model; and, e) the Negative binomial ZIP model vs. ZIP model? * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated AIC between all models? As recommended by Zorn (1996), these questions were answered using a Monte Carlo study in which the proportion of zeros and the skew for the distribution of the event stage counts varied between simulations. Monte Carlo Study Design Monte Carlo studies begin in the same manner as other research methods. First, a problem is identified as a research question. Second, the problem is made concrete in the form of a hypothesis or set of hypotheses. Third, the hypotheses are tested using rigorous methods. Fourth, conclusions are drawn from these results. Fifth, the implications and limitations are elucidated for future researchers and practitioners. For this study, the problem was zeroinflation for count outcomes. The problem was then clarified in the form of the research questions stated above. It is at this point that Monte Carlo studies differ from most other methods. For the Monte Carlo study, no real data is gathered. Rather, a set of samples are generated based on given parameter specifications resulting in a sampling distribution that is considered to be equivalent to that which would have been obtained had this many real participants been available. This is a clear advantage over the true experiment where statistics are used to infer from one sample to an entire population (Mooney, 1997). The Monte Carlo study is not limited to a finite number of participants and is subsequently not prone to violations of asymptotic theory (Paxton, Curran, Bollen, Kirby, & Chen, 2001). Paxton, Curran, Bollen, Kirby, and Chen (2001, p.287) provide a succinct explanation as follows: The researcher begins by creating a model with known population parameters (i.e., the values are set by the researcher). The analyst then draws repeated samples of size Nfrom that population and, for each sample, estimates the parameters of interest. Next, a sampling distribution is estimated for each population parameter by collecting the parameter estimates from all the samples. The properties of that sampling distribution, such as its mean or variance, come from this estimated sampling distribution. Similarly, Mooney (1997, p.2) explains, Monte Carlo simulation offers an alternative to analytical mathematics for understanding a statistic's sampling distribution and evaluating its behavior in random samples. Monte Carlo simulation does this empirically using random samples fiom Imown populations of simulated data to track a statistic' s behavior. The basic concept is straightforward: If a statistic' s sampling distribution is the density function of the values it could take on in a given population, then its estimate is the relative frequency distribution of the values of that statistic that were actually observed in many samples drawn from that population. The Monte Carlo simulations were performed using the R programming language (R Development Core Team, 2006). R is an opensource language based on the commercially available SPlus program (Insightful, 2005) and is just one of many programs that can be used to model zeroinflated data.23 Monte Carlo Sampling PseudoPopulation Mooney (1997) explains that defining the population parameters requires defining the pseudopopulation. In the pseudopopulation, the values for a categorical factor, X1, with two levels were constant at either 0 or 1; this is representative of a typical twolevel categorical factor such as sex. In the dataset, these values alternated. The Excel 2003 random number generator was used to draw a random selection of 1,000 normally distributed values to represent the continuous variable, X2~ N(0,1). This resulted in a pseudopopulation of N = 1,000 where the values for categorical X1 and continuous X2 were known. The values for X2 ranged from  2.945 to 3.28 with a mean of 0, a median of 0.05, a standard deviation of 0.986, and skew of  0. 127. These two sets of covariates and their distributions were chosen as a parsimonious generalization to the basic ANCOVA general linear model that extends from analyses with either quantitative or qualitative predictors.24 The simulations varied in terms of a) the amount of zero inflation present in the outcome variable scores; b) the amount of skew present in the event stage outcome variable scores, and c) the generalized linear model. The outcome variable, Y, was established as a deterministic variable in that it varied systematically as a function of the specified distributions. As clarified by Mooney (1997), "Deterministic variables are vectors of numbers that take on a range of values in a prespecified, nonrandom manner" (p.6). The regression coefficients for X1 and X2 are random variables that 23 Others include Stata, LIMDEP, COUNT, MATLAB, and SAS. Preliminary analyses compared results between SAScode incorporated in research (Min & Agresti, 2004; Min, 2003) to publicly available Rcode written by Simon Jackman of the Stanford Political Science Computing Laboratory to verify the comparability of results. 24 The model is similar to that of the AEWR examples. take on their realizations as a result of the relationship between deterministic Y and the two random X covari ate s. The Prespecified Zero Proportions Justification for generating values with prespecified proportions of event counts with frequencies also determined by prespecified proportions of zeros is justified by Mooney (1997). "If we know (or are willing to make assumptions about) the components that make up a statistic, then we can simulate these components, calculate the statistic, and explore the behavior of the resulting estimates" (Mooney, 1997, p.67). In his case, the concern was bias determined by calculating statistics and inspecting graphs as a result of Monte Carlo simulations. For this study, the concern was goodnessoffit for six models by comparing loglikelihoods and AIC's and inspecting graphs as a result of Monte Carlo simulations over three levels of skew and five levels of zero proportions. Previous research displayed zeroinflation ranging from .20 (Mullahy, 1986) to .96 (Zorn, 1996). To reflect a range including both zerodeflation and zeroinflation, six pseudopopulations were established differing in the proportion of zeros present in the count outcome variable. The pseudopopulations contained either 0. 10, 0.25, 0.50, 0.75, or 0.90 proportions of zeros. 25 PreSpecified Skew To manipulate skew, the event stage distributions for the count outcome variable were varied over three conditions. For each condition, proportions were specified and values were drawn randomly from a multinomial distribution such that the frequencies of the values added to the frequency for those already drawn to represent zeroinflation summed to 1,000. In other 25 Originally, a 0.00 proportion of zeros was included as a control condition. However, for the case of negative skew, this is simply a count model truncated at one. And, for a normal distribution eventstage, this is semi continuous data often tested with different methods from those for zeroinflation (Min, 2002). words, if .50 (or 500) observations were assigned a value of zero then the remaining .50 (or 500) observations had values drawn from the prespecified multinomial distribution. Event stage values ranging from one to five were sampled in order to represent a range small enough to distinguish the distribution from one that might be analyzed as continuous given a particular shape. The prespecified probabilities for each of the five counts were determined primarily to achieve a particular level of skew and secondarily in order to achieve a convergent and admissible solution. Hence, the proportions were not always exactly equal to .10, .25, .50, .75, and .90; the approximates were selected to achieve convergence leading to more trust in convergence for the Monte Carlo simulations. Table 31 displays the proportions of each count as a function of the three levels of skew and five levels of zeros. Table 32 displays this same information in terms of frequencies instead of proportions. Table 33 displays the descriptive statistics for each distribution. Random Number Generation By definition, a random number is one in which there is no way possible to a priori determine its value. Most statistical analysis software packages include random number generators. However, these generated random numbers are not truly random. Usually, one specifies a seed; when replications are performed using the same seed, the generated numbers are identical to the first. Hence, the values are pseudorandom (Bonate, 2001). However, this limitation is actually an advantage in that the researcher can check for errors in model programming and run the analysis again with the same generated sample (Mooney, 1997).26 26 Technically, this is only true for most random number generators. The R programming language, by default, bases its generation on the WichmanHill algorithm and the system clock resulting in a 626integer seed (R Development Core Team, 2006). Another feature of Monte Carlo random sampling pertains to the desired distributions. Typically, the random numbers are drawn from a uniform distribution, which is then followed by a transformation to the desired distribution. As Mooney (1997) explains, "In its standard form, U(0, 1), the uniform distribution is the building block of all Monte Carlo simulation work in that from it, in one way or another, variables with all other distribution functions are derived. This is because the U(0, 1) distribution with its 0 < x < 1 range, can be used to simulate a set of random probabilities, which are used to generate other distribution functions through the inverse transformation and acceptancerejection methods" (p.10). The random number generation was performed using R 2.3.1 (R Development Core Team, 2006). The procedure requires the generic sample command in which the following were specified: 1) a range of counts, which in this study, was from one to five (not zero to five since proportions of zeros were already drawn from the pseudopopulation), 2) the number of values to draw, which in this study was one minus the prespecified proportion of zeros, 3) the proportions for each value in the range, which in this case was one of three possibilities determining skew, and 4) the specification to sample with replacement. The seed was arbitrarily set at 6770. Sample Size Determining the appropriate sample size for each simulate is an important concern. This could range from zero to infinity. However, if the sample size is too small then it is not safe to assume that estimates are asymptotically normal. On the other hand, computer time and burden increases as sample size increases. The sample size was based on the highest found in the literature pertaining to zeroinflated count data, which was n = 1,000 (Civentti & Hines, 2005). Simulation Size Determining the number of simulations is also an important concern since too few replications may result in inaccurate estimates and too many replications may unnecessarily overburden computer time and performance (Bonate, 2001). Hur' s (1999) research pertaining to the ZIP model with random effects set the number of simulations at 200. Min and Agresti (2005) were able to sufficiently compare the goodness of fit for several competing models using 1,000 simulations. Likewise, Civettini and Hines (2005) selected 1,000 simulations when researching misspecification in negative binomial ZIP models. Lambert (1989) set the number of simulations at 2,000 when researching the asymptotic properties of the ZIP model. Mooney (1997) states that "The best practical advice on how many trials are needed for a given experiment is "lots"!i Most simulations published recently report upward from 1,000 trials, and simulations of 10,000 and 25,000 trials are common" (p.58). Given the previously noted problems with convergence for the negative binomial ZIP model, it seems prudent to minimize the number of simulations as much as possible. However, it is also important to simulate under conditions already found to produce asymptotic results. Hence, similar to Lambert' s (1989) seminal study and equal to the maximum found in the literature, the number of simulations was set at 2,000 for each condition (S = 2,000). Iteration Size Iteration size is not much of a concern for the basic Poisson and negative binomial Poisson model. However, obtaining valid estimates for the Hurdle model, ZIP model, and their negative binomial counterparts requires selecting an appropriate maximum number of iterations. Too few iterations can lead to incorrect estimates or, even worse, premature declaration of nonconvergence. Too many iterations results in unnecessary computer time and burden. Various procedures for analyzing these models in R have maximum iterations of 500, 5,000, and 50,000. It was important to determine an iteration size that would be equal across analyses and large enough given some models' potential for nonconvergence. These concerns were deemed more important than excessive computational burden. Hence, the procedure with the largest iteration size was selected for all procedures leading to 50,000 iterations per analysis. Distribution Generation The following describes the procedure for generating the distributions for each simulation. * The total proportion of counts out of 1,000 to be sampled was reduced by the prespecified proportion of zeros. Hence, if the proportion of zeros was .50 then the proportion of event stage counts was 1.000.50 = .50. Translated into frequencies, this is 1,000(0.50 1,000) = 500 * The generic R 'sample' procedure was used to sample with replacement from the event stage counts according the specified proportions depending on the skew condition. The seed was set at 6770. * The values were sampled over N = 1,000. * Each sample was simulated S= 2,000 times. * The data over all S = 2,000 at N= 1,000 were then stored in separate files as they were created. The filenames conformed to the labeling format where the model was replaced by an underscore (e.g., _25Pos was the filename for the S = 1,000 datasets at N=2,000 where the proportion of zeros was .25 and the skew for the event stage was positive). Monte Carlo Models As previously discussed, generalized linear models include a random component, a systematic component, and a link function. The X1 and X2 COnstants form the systematic component in the pseudopopulation's generalized linear model. The random component specification for the distribution of the outcome mean varied from pseudopopulation to pseudo population. The base level generalized linear model assuming a normally distributed outcome is given by Y = Po + fl,(X,l )+ A(X23> + (31) Subsequent models extended from this base model to form the six distributions for deterministic Y The first model, which was the Poisson generalized linear model with a log link, is given by log(il) = o, + (XI)+ 2,(X27) (3 2) Table 34 displays the parameters for this model over all conditions of skew and zero inflation. For example, the analysis with a .10 proportion of zeros and a positively skewed event stage distribution yielded log(il,) = .450 +.004(X,) .037(X2) (33) For both predictors, the coefficient near zero transforms to an exponentiation near one. The coefficient for X1 is lowest (.001) for the negatively distributed data with a .25 proportion of zeros and highest (.007) for the positively distributed data with a .50 proportion of zeros. The coefficient for X2 is lowest (.007) for the negatively distributed data with a .10 proportion of zeros and highest (. 165) for the normally distributed data with a .90 proportion of zeros. Hence, for the twolevel categorical X1 variable, changing from zero to one multiplies the mean of the outcome variable by approximately exp(0.00), which equals one. For the simulated data, the test for the coefficient estimates corresponding to these pseudopopulation parameter values is approximately Ho: P = 0. The second model, which was the negative binomial formulation of the Poisson model, is the same as the Poisson model with the addition of a dispersion parameter. Table 35 displays the parameters for this model over all conditions of skew and zeroinflation. Like the Poisson model, the test for the simulated data is Ho: p = 0. The dispersion parameter is also included in Table 3 5. For the simulated data, the test that this parameter equals zero has equivalent results to the model comparison tests conducted. The third and fifth models were the general formulations for the Hurdle and ZIP models given by logit (p, )= log( )= P, + 7,(Z,I)+ P(Z,I) (3 4) 1 p, log(l,) = P,+ P(X,I)+ P(X,I) (35) while the fourth and six models were their negative binomial formulations. Tables 36 through 3 13 display the pseudopopulation coefficients, standard errors, loglikelihood values, and AIC values over all conditions of proportions of zero and skew. Although there are four models (i.e., Hurdle, negative binomial Hurdle, ZIP, and negative binomial ZIP), there are eight tables since, for each model, there are separate results for the transition (zeros) stage and events (nonzero counts) stage. Monte Carlo Analysis Procedures A generic looping procedure was written in the R programming language to retrieve each of the 15 sets of simulated data. Each dataset was analyzed with each of the six models. The Poisson model used the generic glm procedure in R, which requires specification of the model, Poisson distribution, and log link. The negative binomial Poisson model used the glm.nb procedure in R from the generic M4rSS library. This procedure requires only the specification of the model. The R Core Development Team (2006) describes the procedure as follows: "An alternating iteration process is used. For given 'theta' [dispersion] the GLM is fitted using the same process as used by 'glm()'. For fixed means the 'theta' parameter is estimated using score and information iterations. The two are alternated until convergence of both is achieved" (R Core Development Team, 2006). The Hurdle model used the hurdle procedure in the pscl library authored by Simon Jackman, PhD of the Political Science Computing Laboratory at Stanford University. The procedure requires specification of the count response variable (Y), the Poisson distribution for the event stage, the logit link function for the transition stage, and the models for both the transition stage and event stage, which for this study were X1 and X2. The negative binomial Hurdle model also used the hurdle procedure but specified a negative binomial rather than a Poisson distribution for the event stage. This hurdle model procedure maximizes the log likelihood using either the BroydenFletcherGoldfarbShanno (BFGS) or NelderMead methods. The NelderMead (default) was selected for solution optimization. The ZIP model used the zicounts procedure in the zicounts library authored by Samuel Mwalili, doctoral student in biostatistics at Katholieke Universeteit Leuven (Netherlands). Similar to the hurdle procedure, zicounts requires specification of the count outcome variable and models for both the transition stage and event stage. The distribution is specified as "ZIP". Optimization procedures include the BFGS, NelderMead, and conj oint gradient (CG) methods; for consistency, the NelderMead was chosen. The negative binomial ZIP used the same procedure with the specification of the "ZINB" distribution. The specific procedure for analyzing the data with the models was as follows: * Three separate loops were established for the negatively skewed, normal, and positively skewed distributions. * Arrays were created to store the loglikelihood, AIC, coefficient estimates, standard errors, and pvalues. * The outermost loop for each distribution pertained to the five conditions of varying zero proportions; at each loop, the data corresponding to this zerocondition and distribution was loaded. * Within these loops, another looping procedure pertained to the six models that analyzed the data. * Within these loops another looping procedure was defined for the number of simulations in the data set. * It was at this point that the data corresponding to a particular distribution, proportion of zeros, model, and simulation were analyzed with the calculated AIC, loglikelihood, coefficient estimates, standard errors, and pvalues transferred to the appropriate array. Models that failed to converge were automatically coded as NA. 27 * The results for each statistic over all simulations for a particular distribution, model, and proportion of zeros were then exported in commaseparated format for subsequent analyses. Hence, the three distributions by fiye proportion of zeros conditions by six models yielded 90 data Hiles each containing columns pertaining to a particular statistic or set of statistics and 2,000 rows of simulated results. Analysis Design A series of tests were conducted using the simulated data. The results and graphical output were created using R 2.01 (R Development Core Team, 2006) and SPSS 14.0. The design for a particular analysis depended on the research question. These questions were as follows: * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated loglikelihood between a) the Negative binomial Poisson model vs. Poisson model; b) the Hurdle model vs. Poisson model?; c) the Negative binomial Hurdle model vs. negative binomial Poisson model?; d) the Negative binomial Hurdle model vs. Hurdle model; and, e) the Negative binomial ZIP model vs. ZIP model? * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated AIC between all models? The first question was answered by calculating loglikelihood values for the six models over the six zero proportion conditions and the three skew conditions. The deviance statistic was then calculated as 2(LLsLLc) where LLs is the model with less parameters (i.e., the simple model) than the LLc (i.e., the more complex model). Since there were 5 conditions for the proportions of zeros and 3 conditions for skew, this led to a) a total of 15 separate analyses comparing the fit of the negative binomial Poisson model and the Poisson model; b) 15 separate analyses comparing the fit of the Poisson model and the Hurdle model; c) 15 separate analyses comparing the fit of the negative binomial Poisson model and the negative binomial Hurdle 27 This number was chosen to reduce the probability of obtaining a valid result that would be identified as a nonconvergant solution model; and d) 15 separate analyses comparing the fit of ZIP model and the negative binomial ZIP model. Test statistics for the deviance were assumed asymptotically chisquare with degrees of freedom equal to the difference in the number of parameters between the two models. Some models differed only by the dispersion parameter; these were a) the comparison of the Poisson model and negative binomial Poisson model; b) the comparison of the Hurdle model and negative binomial Hurdle model; and c) the comparison of the ZIP model and negative binomial ZIP model. The loglikelihood statistic for the Hurdle model is based on the loglikelihood statistics from each of its two parts. Given three parameters in the Poisson model (i.e., the intercept and the two predictor coefficients), there are six parameters in the Hurdle model. Hence, the degrees of freedom for the model comparison test are the difference, which are three. The same is true for comparing the negative binomial Poisson model to the negative binomial Hurdle model where including the dispersion parameter leads to subtracting 4 parameters from 7 parameters . Each of the 2,000 goodnessoffit statistics for the simpler model was subtracted from each of the 2,000 goodnessoffit statistics for the more complex model. These results were then multiplied by 2. Each of these values was then compared to the chisquare distribution with the appropriate degrees of freedom. This yielded a pvalue representing the probability of obtaining a statistic this high or higher given that the simpler model adequately fits the data. Values exceeding this critical chisquare statistic based on a Type I error rate of a = .05 were coded "1" with these results suggesting that the more complex model fits the data better than the Poisson model. Values failing to exceed the critical value were coded "O" with these results suggesting adequate fit for the simpler model. The results were thus the proportion of simulated datasets favoring the more complex model over the simpler model for a particular proportion of zeros and level of skew. Output included 1.) descriptive statistics for the goodnessoffit for each model and 2.) boxplots for the difference in goodnessoffit between the two models. Answering the second question was done in a similar fashion to answering the first question. This is due to the fact that the AIC is a linear transformation of the loglikelihood statistic with a result that is positive in sign and is interpreted in a lowerisbetter fashion. However, these analyses did not involve comparisons to the chisquare distribution. As previously explained, the AIC should not be used in this manner. However, the advantage is that the AIC can be used to descriptively compare all models regardless of whether one is nested or not within another. Boxplots were created to display the range of AIC's produced over simulations for each analysis. Table 31. Proportions of counts as a function of zeros and skew Study Proportion Proportion of Remaining Nonzero Values ofZeros Ones Twos Threes Fours Fives Posl0 0.10 0.504 0.227 0.088 0.054 0.027 Norml10 0.099 0.198 0.306 0.198 0.099 Negl10 0.027 0.054 0.091 0.227 0.501 Pos25 0.25 0.418 0.190 0.075 0.046 0.021 Norm25 0.083 0.166 0.254 0.166 0.081 Neg25 0.022 0.045 0.075 0.188 0.420 Pos50 .50 0.280 0.125 0.050 0.030 0.015 Norm50 0.053 0.107 0.175 0.110 0.055 Neg50 0.015 0.030 0.050 0.125 0.280 Pos75 0.75 0.140 0.062 0.025 0.015 0.008 Norm75 0.028 0.052 0.089 0.053 0.028 Neg75 0.008 0.015 0.025 0.062 0.140 Pos90 0.90 0.056 0.025 0.010 0.006 0.003 Norm90 0.011 0.021 0.035 0.024 0.009 Neg90 0.004 0.005 0.010 0.025 0.056 Table 32. Frequencies of counts as a function of zeros and skew Study Frequency Frequency of Individual Nonzero Values of Zeros Ones Twos Threes Fours Fives Posl0 100 504 227 88 54 27 Norml0 99 198 306 198 99 Negl0 27 54 91 227 501 Pos25 250 418 190 75 46 21 Norm25 83 166 254 166 81 Neg25 22 45 75 188 420 Pos50 500 280 125 50 30 15 Norm50 53 107 175 110 55 Neg50 15 30 50 125 280 Pos75 750 140 62 25 15 8 Norm75 28 52 89 53 28 Neg75 8 15 25 62 140 Pos90 900 56 25 10 6 3 Norm90 11 21 35 24 9 Neg90 4 5 10 25 56 Table 33. Descriptive statistics for each distribution Range = 0 to 5 Skew Mean Range = 1 to 5 Mean Std.Dev. Std.Dev. 1.127 1.414 1.619 1.181 1.635 2.054 1.149 1.710 2.253 0.928 1.422 1.914 0.622 0.965 1.318 Skew 1.45 0.08 1.43 1.42 0.08 1.86 1.44 0.01 1.44 1.44 0.00 1.44 1.46 0.06 1.51 Posl0 Norm10 Negl0 Pos25 Norm25 Neg25 Pos50 Norm50 Neg50 Pos75 Norm75 Neg75 Pos90 Norm90 Neg90 1.570 2.700 3.820 1.310 2.250 3.190 0.880 1.510 2.130 0.440 0.750 1.060 0.180 0.300 0.420 1.155 0.310 1.356 1.092 0.069 0.673 1.514 0.587 0.236 2.577 1.660 1.349 4.490 3.268 2.908 1.750 3.000 4.250 1.750 2.990 4.840 1.750 3.010 4.250 1.760 3.000 4.240 1.750 2.990 4.240 1.051 1.150 1.053 1.046 1.149 0.367 1.053 1.140 1.053 1.064 1.149 1.064 1.058 1.124 1.084 Table 34. Poisson model: pseudopopulation parameters Pos10 Nonn10 Neg10 Pos25 Nonn25 Neg25 Pos50 Nonn50 Neg50 Pos75 Nonn75 Neg75 Pos90 Nonn90 Neg90 .450 .992 1.340 0.270 .807 1.159 138 .407 .752 829 .291 1475.3 1806.7 2019.2 1468.1 1913.9 2265.35 1334.3 1888.5 2370.5 2956.6 3619.3 4044.5 2942.3 3833.8 4534.7 2674.5 3783.0 4746.9 1951.7 2929.6 3845.3 1099.0 1711.1 2310.2 .068 .005 .095 .077 .052 .003 .073 .079 .048 972.9 .037 1461.8 .031 1919.7 .057 .043 .002 .061 .044 546.5 852.6 1152.1 Table 35. Negative Binomial Poisson model: pseudopopulation parameters Po sw P s, p, .s LL Pos10 Norml0 Neg10 Pos25 Norm25 Neg25 Pos50 Norm50 Neg50 Pos75 Norm75 Neg75 Pos90 Norm90 Neg90 1594.5 1998.9 2277.9 1509.4 1931.5 2235.5 1290.1 1723.0 2041.3 910.4 1275.6 1567.4 511.6 751.1 957.3 (zeros): pseudopopulation parameters sP, A1 saq A AIC Pos10 Norml0 Neg10 Pos25 Norm25 Neg25 Pos50 Norm50 Neg50 Pos75 Norm75 Neg75 Pos90 Norm90 Neg90 2.198 2.198 2.198 1.099 1.098 1.099 .000 .005 .000 1.099 1.100 1.099 2.205 2.206 2.204 1409.5 1797.1 1928.3 1465.5 1787.6 1897.9 1296.6 1510.1 1583.9 865.2 970.3 1008.1 444.9 486.4 503.0 2830.9 3606.1 3868.7 2943.0 3587.2 3807.9 2605.1 3023.2 3179.7 1742.4 1952.6 2028.3 901.8 984.8 1018.1 Table 36. Hurdle model A Pos10 Norml0 ~Neg10 Pos25 Norm25 ~Neg25 Pos50 Norm50 ~Neg50 Pos75 Norm75 ~Neg75 Pos90 Norm90 Neg90 2.197 2.197 2.205 1.099 1.099 1.099 .000 .000 .001 1.099 1.099 1.099 2.206 2.244 2.205 1395.7 1800.2 1936.1 1455.3 1790.2 1904.4 1288.9 1511.9 1588.1 861.1 971.2 1010.3 443.3 487.3 503.9 Table 37. Hurdle model (events): pseudopopulation parameters SP, 1 Sa P2 AIC Pos10 Norml0 Neg10 Pos25 Norm25 Neg25 Pos50 Norm50 Neg50 Pos75 Norm75 Neg75 Pos90 Norm90 Neg90 .212 1.035 1.430 .214 1.033 1.431 .208 1.035 1.430 .213 1.032 1.428 .179 1.106 1.420 1409.5 1797.1 1928.3 1465.5 1787.6 1897.9 1296.6 1510.1 1583.9 865.2 970.3 1008.2 444.9 486.4 503.0 2830.9 3606.1 3868.7 2943 3587.2 3807.9 2605.1 3023.2 3179.7 1742.4 1952.6 2028.3 901.8 984.8 1018.1 Table 38. Negative Binomial Hurdle 0o Sp, 1 model (zeros): pseudopopulation S4 P2 S , parameters LT, AIC 2803.4 3612.3 3884.1 2922.5 3592.3 3820.8 2589.8 3035.8 3188.2 1734.3 1954.4 2032.5 898.7 986.6 1019.8 Table 39. Negative Binomial Hurdle model (events): pseudopopulation parameters P S~G S P .S Theta LL AIC Pos10 Nonn10 Neg10 Pos25 Nonn25 Neg25 Pos50 Nonn50 Neg50 Pos75 Nonn75 Neg75 Pos90 Nonn90 Neg90 .067 .108 1.033 .030 1.429 .024 .042 .114 1.030 .033 1.431 .026 .072 .146 1.037 .041 1.429 .032 .075 .208 1.031 .058 1.427 .046 .109 .332 1.051 .091 1.421 .072 .009 .091 .004 .043 .002 .034 .009 .098 .005 .047 .001 .037 .020 .122 .006 .057 .004 .046 .012 .173 .005 .081 .002 .064 .078 .273 051 .130 .007 .102 .071 .046 012 .022 002 .017 .067 .051 027 .025 015 .019 .065 .062 025 .029 012 .023 .097 .088 054 .041 011 .032 .081 .140 048 .066 028 .052 2.039 165.718 164.970 2.234 166.348 166.006 2.030 166.574 166.526 1.950 166.445 166.377 2.047 91.979 166.606 1395.7 1800.2 1936.1 1455.3 1790.2 1904.4 1288.9 1511.9 1588.1 861.1 971.2 1010.3 443.3 487.3 503.9 2803.4 3612.3 3884.1 2922.5 3592.3 3820.8 2589.8 3035.8 3188.2 1734.3 1954.4 2032.5 898.7 986.6 1019.8 Table 310. ZIP model (zeros): pseudopopulation parameters Ao sp, A1 sa4 A AIC Pos10 Nonn10 Neg10 Pos25 Nonn25 Neg25 Pos50 Nonn50 Neg50 Pos75 Nonn75 Neg75 Pos90 Nonn90 Neg90 27.010 3.114 2.365 15.460 1.379 1.162 .890 .126 031 .610 1.032 1.079 1.797 2.133 2.188 4434.000 .038 .175 3834.000 .132 .109 .198 .097 .091 .142 .057 .110 .183 .152 .071 11.490 .002 .002 3.418 .008 .001 .051 .002 .000 .010 .005 001 .032 .002 .007 4435.000 .363 .527 .050 .246 .049 .002 1.213 .186 .056 .154 .025 .270 .050 .137 .030 .129 .035 .198 .013 .081 .054 .147 .044 .255 .104 .214 .128 .101 .028 150.700 .253 .125 .001 .098 .078 .141 .070 .065 .010 .040 .075 .129 .108 .051 1478.3 1800.1 1931.3 1471.1 1790.6 1900.9 1299.5 1513.1 1586.9 868.2 973.3 1011.2 447.9 489.4 506.0 2968.6 3612.1 3874.6 2954.3 3593.1 3813.9 2611.1 3038.3 3185.7 1748.5 1958.5 2034.3 907.7 990.8 1024.1 sP 4 sa P AIC Pos10 Nonn10 Neg10 Pos25 Nonn25 Neg25 Pos50 Nonn50 Neg50 Pos75 Nonn75 Neg75 Pos90 Nonn90 Neg90 .450 1.035 1.430 .270 1.032 1.431 .205 1.039 1.429 .213 1.015 1.428 .180 1.017 1.422 1478.3 1800.1 1931.3 1471.1 1790.6 1900.9 1299.5 1513.1 1586.9 868.2 973.3 1011.2 447.9 489.4 506.0 2968.6 3612.1 3874.6 2954.3 3593.1 3813.9 2611.1 3038.3 3185.7 1748.5 1958.5 2034.3 907.7 990.8 1024.1 Table 312. Negative 15.280 3.110 2.370 17.090 1.380 1.162 4.895 .131 031 .130 1.015 1.080 1.462 2.133 2.187 Binomial ZIP model (zeros): sp P1 s, pseudopopulation parameters 2 Sp LL AIC Pos10 Nonn10 Neg10 Pos25 Nonn25 Neg25 Pos50 Nonn50 Neg50 Pos75 Nonn75 Neg75 Pos90 Nonn90 Neg90 142.300 .375 .175 501.400 .132 .109 3.373 .097 .091 .364 .107 .104 .386 .152 3.880 .0165 001 2.390 .002 .000 1.352 .008 .000 .015 .001 002 .047 .002 830.400 .527 .247 473.600 .186 .037 2.211 .014 .129 .271 .151 .147 .298 .214 .719 .0510 .047 1.700 .059 016 1.086 .031 .035 .040 .033 .044 .098 .128 102.100 .253 .125 133.500 .983 .019 1.304 .069 .065 .136 .076 .075 .151 .108 1479.3 1801.1 1932.3 1471.4 1791.6 1901.9 1292.4 1514.1 1587.9 865.1 974.3 1012.2 447.3 490.4 2972.6 3616.1 3878.7 2956.7 3597.1 3817.9 2598.8 3042.3 3189.7 1744.3 1962.6 2038.3 908.7 994.8 .150 .001 .212 .136 .107 507.0 1028.1 Table 311. ZIP Model (events): pseudopopulation parameters Table 313. Negative Binomial ZIP model (events): pseudopopulation parameters o S, P1 Sr P Sp Theta LL AIC Pos10 Nonn10 Neg10 Pos25 Nonn25 Neg25 Pos50 Nonn50 Neg50 Pos75 Nonn75 Neg75 Pos90 Nonn90 Neg90 .036 .004 .030 .004 .024 .001 .040 .003 .033 .005 .026 .001 .067 .045 .040 .007 .032 .004 .204 .010 .057 .005 .045 .000 .327 .072 .091 .017 .071 .006 .037 .026 013 .021 003 .017 .024 .029 027 .025 016 .019 .080 .052 026 .029 012 .023 .099 .087 054 .040 010 .032 .080 .140 052 .065 028 .051 14.69 12.64 13.02 3.16 10.88 15.94 .57 11.70 14.19 .69 12.02 12.99 .73. 12.76 14.73 1479.3 1801.1 1932.3 1471.4 1791.6 1901.9 1292.4 1514.1 1587.9 865.1 974.3 1012.2 447.3 490.4 507.0 2972.6 3616.1 3878.7 2956.7 3597.1 3817.9 2598.8 3042.3 3189.7 1744.3 1962.6 2038.3 908.7 994.8 1028.1 CHAPTER 4 RESULTS This chapter presents the results based on the previously discussed methods and procedures for analyzing data with varying proportions of zeros and varying event stage distributions. First, the results are presented using the data for the pseudopopulation in which the proportions for each count level are exactly that which was rando mly sampled from in the simulations. Tables and figures are included to support interpretation. Second, the results are presented outlined by the skew level (i.e., positive, normal, and negative) and by the proportion of zeros within that skew level (i.e., .10, .25, .50, .75, and .90). For each combination of conditions, the results are presented for the five model comparisons. Third, the results are summarized separately for the negative, normal, and positive event count distributions. Tables and figures are included to support interpretation. The primary purpose of the results was to assist in answering the research questions, which were as follows: * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated loglikelihood between a) the Negative binomial Poisson model vs. Poisson model; b) the Hurdle model vs. Poisson model?; c) the Negative binomial Hurdle model vs. negative binomial Poisson model?; d) the Negative binomial Hurdle model vs. Hurdle model; and, e) the Negative binomial ZIP model vs. ZIP model? * Given one twolevel categorical covariate with known values and one continuous covariate with known values, what is the difference in the estimated AIC between all models? PseudoPopulation Results Before addressing the results of the Monte Carlo simulations, it is necessary to discuss the results when each model was analyzed with the pseudopopulation data. The prespecified proportions for each count were displayed in Table 31. Table 32 displayed this same information in terms of frequencies instead of proportions. Table 33 displayed the descriptive statistics for each distribution. Tables 34 through 313 displayed the coefficients, standard errors, loglikelihood values, and AIC values with each table pertaining to either a specific model or one of the two stages for a specific model. The following sections provide the results when comparing models using the pseudo population data. First, the Poisson model results were compared to the negative binomial Poisson model results, both descriptively via AIC's and inferentially via the deviance model comparison test. Second, the results are presented in a likewise manner for the three Hurdle model comparisons. These comparisons were a) the Hurdle model vs. the negative binomial Hurdle model, b) the Poisson model vs. the Hurdle model, and c) the negative binomial Poisson model vs. the negative binomial Hurdle model. Third, the results are presented for the comparison of the ZIP model and the negative binomial ZIP model. Fourth, descriptive results are presented comparing all models for each of the five proportions of zeros. PseudoPopulation Poisson Models Based on the AIC, for all proportions of zeros the data fit the Poisson model better when the distribution was positively skewed than when normally distributed. In addition, this data fit the Poisson model better when the distribution was normally distributed than when it was negatively skewed. For example, for the negatively skewed distribution with a .10 proportion of zeros, the AIC was 4,044.5. As the curve shifted left to a normal distribution, the AIC dropped to 3,619.3, and as the curve shifted further left to a positively skewed distribution, the AIC dropped to 2,956.3. The same pattern emerged for the negative binomial Poisson models. For example, for the negatively skewed distribution with a .10 proportion of zeros, the AIC was 4,563.8. As the curve shifted left to a normal distribution, the AIC dropped to 4,005.9, and as the curve shifted further left to a positively skewed distribution, the AIC dropped to 3,196.9. For the .10 proportion of zeros condition, the Poisson models had a lower AIC than that calculated for the negative binomial Poisson model. For the .25 proportion of zeros condition, the AIC's were approximately equal between the two models. For the .50, .75, and .90 proportions of zeros conditions, the AIC was lower for the negative binomial Poisson model than for the Poisson model. The deviance statistic was calculated comparing the Poisson loglikelihood and negative binomial Poisson loglikelihood for each skew condition and zero proportion condition. These statistics are displayed in Table 41. For all analyses, the Poisson model is nested within its negative binomial Poisson formulation differing by one degree of freedom (i.e., the dispersion parameter in the negative binomial Poisson model). Hence, at the .05 Type I error rate and assuming deviance statistics asymptotically distributed chisquare, a deviance exceeding 3.84 suggests better fit for the more complex negative binomial Poisson model. This was the result for all analyses in which the proportion of zeros was .50 or greater. However, for all analyses with .10 and .25 proportions of zeros, the statistic was significant in favor of the Poisson model. PseudoPopulation Hurdle Models Hurdle vs. Negative Binomial Hurdle For the Hurdle models, regardless of the proportion of zeros, the positively skewed distributions had a lower AIC than did the normal distributions. These, in turn, had a lower AIC than the negatively skewed distribution. The same was true for the negative binomial Hurdle models. However, for both models, the difference between AIC's for the three skew conditions decreased as the proportion of zeros decreased (i.e., the models became more similar in fit). When comparing the Hurdle models and negative binomial Hurdle models, the AIC's appear to be similar regardless of the proportion of zeros and regardless of skew. The deviance statistic was calculated comparing the Hurdle loglikelihood and negative binomial Hurdle loglikelihood for each skew condition and zero proportion condition. These statistics are displayed in Table 42. For all analyses, the Hurdle model is nested within its negative binomial Hurdle formulation differing by one degree of freedom (i.e., the dispersion parameter in the negative binomial Hurdle model). Hence, at the .05 Type I error rate and assuming deviance statistics asymptotically distributed chisquare, a deviance exceeding 3.84 suggests better fit for the more complex negative binomial Hurdle model. For the positively skewed distributions, the negative binomial Hurdle model fit significantly better than the Hurdle model, except when the proportion of zeros was .90. In this case the deviance statistic did not exceed 3.84. For the normal distributions, the Hurdle model fit significantly better than the negative binomial Hurdle model when the proportion of zeros was .10 or .25. When the proportion of zeros was .50, .75, or .90, the deviance did not exceed 3.84. For the negatively skewed distributions, the Hurdle model fit significantly better than the negative binomial Hurdle model when the proportion of zeros was .10, .25, .50, or .75. When the proportion of zeros of .90, the deviance statistic did not exceed 3.84. Poisson vs. Hurdle Deviance statistics were also calculated to compare the Poisson loglikelihood and Hurdle loglikelihood for each skew condition and zero proportion condition. These statistics are displayed in Table 43. For all analyses, the Poisson model is nested within the Hurdle model differing by three degrees of freedom. Hence, at the .05 Type I error rate and assuming deviance statistics asymptotically distributed chisquare, a deviance exceeding 7.82 suggests better fit for the more complex Hurdle model. The deviances were large supporting the Hurdle model fit over the Poisson model fit. In fact, several of the deviances were over 1,000. There was only one analysis that did not favor the Hurdle model fit over the Poisson model fit. This was the deviance calculated for the positively skewed distribution with a .25 proportion of zeros. Negative Binomial Poisson vs. Negative Binomial Hurdle Deviance statistics were also calculated to compare the negative binomial Poisson log likelihood and the negative binomial Hurdle loglikelihood for each skew condition and zero proportion condition. These statistics are displayed in Table 44. For all analyses, the negative binomial Poisson model is nested within the negative binomial Hurdle model differing by three degrees of freedom (i.e., the duplication of parameters in the negative binomial Hurdle model to represent both a transition stage and an event stage).28 Hence, at the .05 Type I error rate and assuming deviance statistics asymptotically distributed chisquare, a deviance exceeding 7.82 suggests better fit for the more complex negative binomial Hurdle model. As was the case when comparing the Poisson model and the Hurdle model, the deviances were large. Likewise, there was one deviance that did not exceed the critical value. However, it was for the positively skewed distribution with .50 zeros rather than the positively skewed distribution with .25 zeros. PseudoPopulation ZIP Models The pseudopopulation results for the ZIP models were rather similar to those obtained for the Hurdle models. Graphically, regardless of skew condition and proportions of zeros, the AIC's for the ZIP models and negative binomial ZIP models were very similar. Additionally, the AIC's appeared to be equal between the .10 and .25 proportions of zeros conditions. The deviance statistic was calculated comparing the ZIP loglikelihood and negative binomial ZIP loglikelihood for each skew condition and zero proportion condition. These 28The presence of a dispersion parameter is now redundant between models. statistics are displayed in Table 45. For all analyses, the ZIP model is nested within its negative binomial ZIP formulation differing by one degree of freedom (i.e., the dispersion parameter in the negative binomial ZIP model). Hence, at the .05 Type I error rate and assuming deviance statistics asymptotically distributed chisquare, a deviance exceeding 3.84 suggests better fit for the more complex negative binomial ZIP model. The deviance was significant for only two comparisons. When the distribution was positively skewed, the fit was significantly better for the negative binomial ZIP model than for the ZIP model when the proportion of zeros was either .50 or .75. All other results suggested that the ZIP model fit was adequate. Comparing AIC's For All Models For the .10 proportion of zeros condition with negative skew, the AIC's were approximately equal for all models except for the Poisson models. The AIC for the Poisson model was higher than for the other models, and the AIC for the negative binomial Poisson model was highest of all. When the distribution was normal, the only model to have a noticeably different AIC was the negative binomial Hurdle, which was again highest of all. For the positively skewed distribution, there was some separation between the ZIP and Hurdle models, with the Hurdle models having the lower AIC. The differences between these models and their negative binomial formulations appeared to be trivial. The Poisson model appeared to have the same AIC as those displayed for the ZIP models. The AIC for the negative binomial was again highest of all. Between the three distributions, the AIC declined from the negatively skewed distribution to the normal distribution to the positively skewed distribution. For the .25 proportion of zeros, there was little distinction between the Poisson models and negative binomial Poisson models for all distributions. Further, there was no distribution displaying a nontrivial distinction between the Hurdle models and the ZIP models. For the positively skewed distribution, all six distributions appeared approximately equal with a slightly higher AIC apparent for the negative binomial Poisson model. Between the three distributions, the AIC's appeared equal for the negatively skewed distribution and the normal distribution; However, the AIC's for the normal distribution were considerably smaller than the AIC's for the other two distributions. For the .50 proportion of zeros condition with negative skew, the results for the Poisson model and the negative binomial Poisson model reversed. For the negatively skewed distribution, the AIC for the negative binomial Poisson was higher than those for the ZIP and Hurdle models, while the AIC for the Poisson model was higher yet. As in the .50 proportion of zeros condition, the AIC appeared equal for the ZIP and Hurdle models. Also, as in the .50 proportion of zeros condition, there appeared to be no difference between any of the models for the positively skewed distribution. Between the three distributions, the AIC's declined from the negatively skewed distribution to the normal distribution to the positively skewed distribution with declines being most rapid for the Poisson and negative binomial Poisson models. Beyond the fact that the overall AIC was lower for the .90 proportion of zeros condition than for the .75 condition, these last two conditions displayed similar results. For each distribution, the Hurdle and ZIP model AIC's were approximately equal. They declined slightly from the negatively skewed distribution to the normal distribution and declined a bit more from this distribution to the positively skewed distribution. For the negatively skewed distribution, the negative binomial Poisson AIC was considerably higher than the AIC's for the ZIP and Hurdle models; the AIC for the Poisson model was highest of all. For the normal distribution, both the Poisson and negative binomial Poisson model AIC's declined by approximately 50% of their value for the negatively skewed distribution. For the positively skewed distribution, they 