UFDC Home  myUFDC Home  Help 



Full Text  
PAGE 1 1 ROBUST PARAMETRIC ESTIMATORS FOR HEALTH ECONOMETRIC MODELS WITH SKEWED OUTCOMES AND ENDOGENOUS REGRESSORS By MUJDE ZEYNEP ERTEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFI LLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 PAGE 2 2 2010 Mujde Z. Erten PAGE 3 3 To my family PAGE 4 4 ACKNOWLEDGMENTS I am grateful to Dr. Joseph V. Terza, my advisor, without his guidance this dissertation woul d not be possible He was not only an advisor but a mentor to me during this experience. His deep knowledge and great enthusiasm for the subject combined with his tolerant and pleasant personality help ed me through this process. I would like to thank my co mmittee members Dr. David E. M. Sappington, Dr. Chunrong Ai and Dr. B ruce Vogel for their help and guidance through constructive comments I owe my deepest gratitude to Dr. Steven Slutsky and Dr. Jonathan Hamilton for their support throughout my studies I would also like to thank a ll my professors in the Department of Economics at the University of Florida for generating the most supportive and creative environment to pursue my graduate studies I have had invaluable experience during my teaching and resea rch assistantships I am grateful to t he Inst itute for Child Health Policy at the University of Florida the Dep artment of Economics at the University of Florida and the Agency for Healthcare Research and Q uality under a grant to Dr. Joseph V. Terza (#R01 HS017434 01 ) for providing me financial support throughout my graduate studies I would like to extend my gratitude to Dr. Sema Aydede and Dr. Elizabeth Shenkman for introducing me to the health economics area The work I did with them ignited my interest to h ealth economics. I would also like to thank Dr. Bruce Stuart and Dr. John Mullahy for providing the data sets used in this dissertation. My family has been the re for me through all the good times and the bad times. I am grateful for their support, my parents Nail Erten and Muhterem Erten, and my sisters PAGE 5 5 Selda Erten, Ferah C. Erten, H ande K. Erten, and Hale Erten. Special thank goes to my sister Hale Erten for her invaluable support, and incessant help. Last but not the least I would like to thank Ritw ik Kumar for his endless support and help especially for all the brainstorming and programming. PAGE 6 6 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 9 LIST OF FIGURES ................................ ................................ ................................ ........ 11 ABSTRACT ................................ ................................ ................................ ................... 12 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 14 2 SKEWED OUTCOMES AND ENDOGENOUS REGRESSORS: PRESCRIPTION DRUG UTILIZATION AND HOSPITAL COST OFFSETS ........... 1 9 2.1 Introduction and Background ................................ ................................ ............ 19 2.2 Accounting for Endogeneity in the Generalized Gamma Estimation Framework ................................ ................................ ................................ ........... 21 2.2.1 The Generalized Gamma Model and Popula r Special Cases ................. 23 2.2.2 Accounting for Endogenous Regressors ................................ ................. 25 2.3 Simulation Analysis ................................ ................................ ........................... 27 2.3.1 Sampling Designs ................................ ................................ .................... 28 2.3.1.1 The observable and unobservable confounders, the instrumental variables and the endogenous variable .............................. 28 2.3.1.2 The outcome variable ................................ ................................ ..... 28 2.3.2 Estimators to be Evaluated and Compared ................................ ............. 31 2.3.3 Crit eria for Evaluation and Comparison ................................ ................... 33 2.3.4 Simulation Results ................................ ................................ ................... 36 2.4 Prescription Drug Use and Hospital Cost Offsets ................................ ............. 41 2.4.1 The Econometric Model ................................ ................................ ........... 43 2.4.2 Data Source and Variables ................................ ................................ ...... 45 2.4.3 Estimation Results ................................ ................................ ................... 47 2.5 Summary, Discussion and Conclusion ................................ .............................. 48 3 THE GENERALIZED GAMMA ESTIMATOR WITH A FLEXIBLE FORM CONDITIO NAL MEAN REGRESSION SPECIFICATION ................................ ....... 61 3.1 Introduction and Background ................................ ................................ ............ 61 3.2 Inverse Box Cox Transformation in the GG Framework ................................ ... 63 3.2.1 Inverse Box Cox Transformation ................................ ............................. 63 3.2.2 Generalized Gamma with a Flexible Form Conditional Mean Function ... 64 3.3 Simulation Analysis ................................ ................................ ........................... 66 3.3.1 Sampling Designs ................................ ................................ .................... 67 3.3.1.1 The observable variables ................................ ............................... 67 PAGE 7 7 3.3.1.2 The outcome variable ................................ ................................ ..... 67 3.3.2 Estimators Used in Evaluation and Comparison ................................ ..... 70 3.3.3 Criteria for Evaluation and Comparison ................................ ................... 72 3.3.4 Simulation Results ................................ ................................ ................... 75 3.4 The Effect of Cigarette Smoking on Birthweight ................................ ............... 79 3.4.1 Model ................................ ................................ ................................ ....... 80 3.4.2 Results ................................ ................................ ................................ .... 84 3.5 Summary, Discussion and Conclusion ................................ .............................. 85 4 MODELING AND ESTIMATING FLEXIBLE FORM HEALTH ECONOMETRIC MODELS WITH ENDOGENEITY ................................ ................................ ........... 93 4.1 Introduction and Background ................................ ................................ ............ 93 4.2 Integrating IBC and Endogenous Confounders into the GG Model .................. 95 4.3 Simulation Analy sis ................................ ................................ ........................... 97 4.3.1 Sampling Designs ................................ ................................ .................... 98 4.3.1.1 The observable and unobservable confounders, instrumental variables and endogenous vari able ................................ ........................ 98 4.3.1.2 The outcome variable ................................ ................................ ..... 99 4.3.2 Estimators to be Evaluated and Compared ................................ ........... 100 4.3.3 Criteria for Evaluation and Comparison ................................ ................. 101 4.3.4 Simulation Results ................................ ................................ ................. 104 4.4 The Effect of Cigare tte Smoking on Birthweight in Presence of Endogeneity 108 4.4.1 Model ................................ ................................ ................................ ..... 109 4.4.2 Results ................................ ................................ ................................ .. 112 4.5 Summary, Discussion and Conclusion ................................ ............................ 114 5 CONCLUSION ................................ ................................ ................................ ...... 123 5.1 Summary ................................ ................................ ................................ ........ 123 5.2 Limitations and Future Work ................................ ................................ ........... 126 APPENDIX A THE FORMAL DERIVATION OF THE REPARAMETRIZATION OF THE CONDITIONAL MEAN ................................ ................................ .......................... 129 B THE DERIVATION OF MARGINAL EFFECT OF THE ENDOGENOUS POLICY VARIABLE ................................ ................................ ................................ ............ 131 C STANDARD ERROR OF THE MARGINAL EFFECT OF THE ENDOGENOUS POLICY VARIABLE ................................ ................................ .............................. 132 D THE DERIVATION OF THE NLS IBC MODEL PARAMETER VALUES ............... 140 E THE DERIVATION OF THE MARGINAL EFFECT FOR THE GG IBC MODEL ... 143 REFERENCES ................................ ................................ ................................ ............ 146 PAGE 8 8 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 149 PAGE 9 9 LIST OF TABLES Table page 2 1 For sample size 10,000, mean squared error of the marginal effect with percent relative efficiency gain ................................ ................................ ........... 50 2 2 For sample size 500, mean squared error of the marginal effect with percent relative efficiency gain ................................ ................................ ........................ 51 2 3 For the large samples, average percentage absolute bias of the marginal effect ................................ ................................ ................................ ................... 52 2 4 For the small samples, average percentage absolute bias of the marginal effect ................................ ................................ ................................ ................... 54 2 5 Nested model selection tests from the GGE estimator for sample size 10,000 .. 57 2 6 Descriptive statistics of the study sample prescription drug use and hospital cost offsets ................................ ................................ ................................ ......... 59 2 7 The estimation results of th e real data analysis prescription drug use and hospital cost offsets ................................ ................................ ............................ 60 3 1 For generalized gamma distributed data, average percentage absolute bias of the marginal effect ( ) (in percentages) ................................ .................... 87 3 2 For various sampling designs, average percentage absolute bias of the marginal effect ( ) (N=10,000) (in percentages) ................................ ........... 87 3 3 For various sampling designs mean squared error of the marginal effect with percent relative efficiency gain ( ) (N=10,000) ................................ .............. 88 3 4 Parameter estimates ( ) (N=10,000) ................................ ............................. 89 3 5 For generalized gamma distributed sampling design with various parameter values, average percentage absolute bias of the marginal effect (N=10,000) (in percentages) ................................ ................................ ............... 90 3 6 The variable definitions from the birthweight analysis ................................ ........ 90 3 7 Descriptive statistics for the birthweight sample (N=1,388) ................................ 91 3 8 The marginal effect and the cessation effect estimates from the birthweight analysis ................................ ................................ ................................ .............. 92 4 1 For various sample sizes, average percentage absolute bias of the marginal effect ( ) (in percentages) ................................ ................................ ........... 116 PAGE 10 10 4 2 For various sampling designs, average percentage absolute bias of the marginal effect ( ) (N=10,000) (in percentages) ................................ ......... 118 4 3 For various sampling designs m ean squared error of the marginal effect with percent relative efficiency gain ( ) (N=10,000) ................................ ............ 119 4 4 Parameter estimates (N=10,000) ( ) ................................ ........................... 120 4 5 The variable definitions from the birthweight analysis ................................ ...... 122 4 6 Descriptive statistics for the birthweight sample (N=1,388) .............................. 122 4 7 The 2SRI marginal effect and cessation effect estimates from the birthweight analysis ................................ ................................ ................................ ............ 122 PAGE 11 11 LIST OF FIGURES Figure page 2 1 Histogram and Kernel density estimate of hospital expenditures (Overall) ......... 58 2 2 Histogram and Kernel density estimate of hospital expenditures (N on zero) ..... 58 3 1 .................. 91 PAGE 12 12 Abstract of Dissertation Presented to the Graduate School of the Univ ersity of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ROBUST PARAMETRIC ESTIMATORS FOR HEALTH ECONOMETRIC MODELS WITH SKEWED OUTCOMES AND ENDOGENOUS REGRESSORS By Mujde Z. Erten August 2010 Chair: Joseph V. Terza Major: Economics Empirical models in health economics and health services are commonly characterized by skewness in the regression outcome and en dogeneity among the regressors. Though some methods in the literature have addressed sk ewness and endogeneity separately, none comprehensively accou nts for both of them together. A new estimator is introduced that combines the generalized gamma approach, which flexibly accommodates skewness in the dependent variable; and the two stage residu al inclusion method, design ed to account for endogeneity. This extended generalized gamma model is the first application of two stage residual inclusion method to healthcare expenditure analysis in the full information maximum likeliho od context. In order to evaluate the generalized gamma with endogeneity estimator, extensive simulation analyses were conducted with data generated u sing various sampling designs. The simulations results show that this method matches or outperforms other estimation methods tha t are widely used in the literature. Using the generalized gamma with endogeneity method, the offsetting effect of appropriate prescription drug utilization on hospital costs was estimated in a two part modeling context. The generalized gamma model, though effective in addressing skewness, imposes a fixed functional PAGE 13 13 form for the conditional mean regression function This may lead to misspecification bias. Using a variant of the inverse Box Cox model in this dissertation we developed a more robust version o f the generalized gamma method. Using simulated data, this new estimator is tested using a flexible conditional mean specification and compa red with alternative estimators The simulation analyses highlight the advantages of our model in terms of bias and precision. This estimator is applied to estimate the effect of cigarette smoking during pregnancy in a model of birthweight Fina lly, this dissertation extends the generalized gamma model in both directions by incorporating the flexible inverse Box Cox fu nctional form and accommodation for endogenous regressors via the two stage residual inclusion m ethod. Using simulated data, the bias and efficiency properties of this new estimator are evaluated in comparison with the alt ernative parametric estimators. Th e suggested model is applied to birthweight data to estimate the effect of potentially endogenous cigarette smoking variable. PAGE 14 14 CHAPTER 1 I NTRODUCTION Health outcomes data has unique properties that need to be carefully a ddressed in empirical modeling. The researcher is mainly interested in outcome variables such as health costs and utilization th at are non negative by nature. Healthcare expenditures and utilization in the typical population is dis tributed asymmetrically with a small proportion of the popul ation with chronic illnesses have high levels of utilization and spending and a very high proportion of the population that is healthy spend very little. This high concentration of health care expenditures and utilization by the chronically ill generates a skewed distribution of the health outcomes data In the health econometrics area rese archers have proposed different estimators to deal with the skewness issue. These range from the nonlinear least squares (NLS) method that is based on only a conditional mean assumption to the full information maximum likelihood (FIML) estimation techniques that allow for all of the conditional moments. All of these me thods have their shortcomings. Although the NLS model is unbiased since it does not make any assumptions related to skewnes s it is less efficient. FIML methods are prone to misspecification bias The y do however, take into account higher order conditional moments including skewness In the presence of this trade off the search is for models th at are both ef ficient and flexible. Endogeneity in the regressors is a nother common prob lem in health outcomes modeling. This problem is caused by the presence of unobservable confounders latent variables that are correlated with both the dependent variable and one or more of the regressors This typically causes bias in estimation because the conventional regression methods that do not account for the problem will spuriously attribute the PAGE 15 15 effects of the unobservable confounders to the observable regressors. There is a sizable literature on the development and application of method s to resolve this issue. One of the common methods used in the health outcomes literature is the two stage least squares (2SLS) estimator This method generates consistent estimates for linear models, however, for nonlinear models the results are not consistent. Health outcomes models are generally nonlinear so alternative techniques for endogeneity correction are needed One method that has recently come to the fore in this context is the two stage residual inclusion (2SR I ) method ( Terza et al. 2008 a ) Recent applications of this method can be found in Terza et al. ( 2008b ), Carpio et al. ( 2008 ), Zhang ( 2008 ), Vandegrift and Yavas ( 2009 ), Bradford et al. ( 2010 ), Etile and Jones ( 2009 ), Richards on et al. ( 2 010 ), etc. Terza et al. (2008a) show that rote analogs to 2SLS in the nonlinear context are not consi stent while proving that 2SR I is a consistent method in the nonlinear setting. In regression analysis the conditional mean is typically defined as a linear combination of the regressors This linearity assumption might lead to bias if the true conditional mean regression model is actually nonlinear Wooldridge (1992) employed the inverse Box Cox (IBC) transformation for the conditional mean funct ion for the first time in a regression analysis. Applications of the IBC transformation include Kenkel and Terza (2001), Terza et al. (2008 a ), Terza et al. (2008 b ) and Basu and Rathouz (2005) The first three of these implement NLS with IBC conditional mea n regression specification. This approach is unbiased yet inefficient since it is based on the NLS. Basu and Rathouz ( 2005 ) included the IBC transformation in the generalized linear model (GLM) and seek to improve efficien cy by explicitly accounting for PAGE 16 16 h e tero scedasticity in the estimation GLM modeling is however, prone to misspecification bias and is not as flexible as an NLS model In the literature there are some papers pointing out these challenges in modeling health outcomes, however, each of these s tudies focuses on only o ne of these problems at a time. Manning et al. (2005) proposes the use of a FIML model that not only takes into account the higher order moments such as skewness, kurtosis, etc. but is also flexible by nature The fully parametric f ormulation of this model specifies the density of the dependent variable conditional on the regressors as a generalized g amma (GG). The GG model subsumes widely used distributions in itself such as the log norm al, the Weibull, the exponential and the s tand ard gamma. Manning et al. (2005) concentrates on the efficiency properties of the GG model for skewed, thick tailed and heteroscedastic outcomes. The use of GG model is also suggested in other papers such as Basu and Manning (2006) and Hill and Miller ( 200 9 ) The e ndogeneity issue is widely examined in all areas of empirical econometri cs, since i t is a common problem in empirical modeling. Most of the instrumental variable models proposed to handle this are very res trictive in their assumptions. In case of non linear models ther e are some options but they have their own shortcomings. In this regard the 2SRI model comes forward s ince it is a consistent model. The second chapter in this dissertation proposes a model that would not only take care of skewness in the outcome variable but will also take care of endogeneity in the non linear context by taking advantage of the two aforementioned methods FIML GG (Manning et al., 2005) and 2SRI (Terza et al., 2008 a ) This new model incorporates the 2SRI approach into the GG framework We name this new model generalized gamma with endogeneity PAGE 17 17 (GGE). It is flexible since it subsumes all of the aforementioned FIML models that are members of the GG family of distributions. Because i t is cast in the framework of a FIML est imator GGE is likely to approach full efficiency in parametric estimation The 2SRI component of the GGE method takes care of biase s due to endogeneity which ensures the consistency of the method. In the third chapter the issue of endogeneity is, for the moment, set aside. The focus in this chapter is on misspecification bias due to the functional form of the con ditional mean regression model. Here we incorporate the IBC conditional mean regression s pecification into the GG model. Th is new hybrid model ( ge neralized gamma with inverse Box Cox transformation ( GG IBC ) ) yields efficient parameter estimates because it is a FIML estimator The inclusion of the IBC transformation leads to a more flexible regression estimator. The NLS model with an IBC conditional mean regression specification (NLS IBC) ( Wooldridge, 1992; Kenkel and Terza, 2001; Terza et al ., 2008 a ; Terza et al. 2008 b ; Basu and Rathouz, 2005 ) is a reasonable alternative to the GG IBC model, because both models will yield consistent estimates but th e proposed GG IBC method will likely be more efficient. We compare the two models in our simulation and real data analysis. In t he fourth chapter we introduce an estimator that brings together all the impor tant elements of the second and the third chapter s conditional density flexibility (skewness accommodation) of the GG model, endogeneity correction a s provided by the 2SRI approach, and flexibility in the conditiona l mean regression as provided by the IBC model We call t he proposed estimator the gener alized gamma with endogeneity and IBC transformation (GGE IBC). Comparisons of this model with the NLS with PAGE 18 18 endogeneity and IBC transformation ( NLS IBC 2SRI ) and the GGE are also provided as part of our simulation and real data analysis. This dissertation is orga nized in the following manner. The C hapter 2 introduces the GGE model including a description of the GG and 2SRI models A simulation analysis comparing the GGE estimator to widely used alternative estimators in the literature is conducted. The base s for comparison are the bias and the efficiency of the estimators. T he estimator is then also applied to real world data in an examination of the effect of prescription drug utilization on the hospital cost offsets. Th e C hapter 3 develops and details the proposed GG IBC model and follows a similar simulation analysis and a r eal data application agenda as that of C hapter 2 Here the GG IBC estimator is compared to the GG and the NLS IBC estimators As our real data experiment we re examine smoking during pregnancy on newborn birthweight (ignoring the potential endogeneity of smoking in this model ). The C hapter 4 describes the GGE IBC model and details simulation and real data comparisons with the GGE method and the 2SRI variant of the NLS IBC estimator ( NLS IBC 2SRI ). The analysis of the impact of smoking during pregnanc y on birthweight introduced in Chapter 3 is extended in C hapter 4 to account for potential endogeneity. Finally in C hapter 5 we summa rize our findings, discuss topics for future research and present concluding remarks PAGE 19 19 CHAPTER 2 SKEWED OUTCOMES AND ENDOGENOUS REGRESSORS: PRESCRIP TION DRUG UTILIZATION AND HOSP ITAL COST OFFSETS 2.1 Introduction and Background Skewness in the regression o utcome (y) is prevalent in empiric al models in health economics. The most common examples are health care expenditures and utilization data (e.g., hospital expenditure, the number of physician visits, the number of prescriptions filled, etc.). 1 There are a variety of parametric estimators that can be implemented in such cases. These range from the nonlinear least squares (NLS) methods that require only a conditional mean regression assumption to the full information maximum likelihood (FIML) approaches that require knowledge of the conditional probability density of the dependent variable given the regressors (x). The NLS methods are relatively robust to misspecification bias because only the conditional mean regressi on need be specified. They do little, how ever, to account for skewness [or any of the other higher order moments of the conditional distribution of y given the regressors (y  x) for that matter] and are, ther efore, relatively inefficient. The FIML estimators are more susceptible to bias, but are more efficient because they impos e maximal parametric structure. Manning et al. (2005) argue the importance of taking into account skewness in health econometric models. They suggest a FIML estimator that accounts for skewness but avoids misspecification bias because it is based on a flexible distributional form the generalized g amma (GG). The flexibility of the GG is evidenced by the fact that it subsumes, as special cases, some of the 1 H ealth care utilizati on and expenditures data are generally skewed to the right due to the high frequency of healthy people with zero health care utilization or expenditures, and the low frequency of people with chronic diseases. PAGE 20 20 commonly used distributions found in the literature such as the stan dard gamma, th e exponential, and the Weibull. According to the simulation based findings of Manning et al. (2005), the GG estimator offers substantial potential efficiency gains due to the fact that it accounts for skewness [and all other higher order mome nts of (y  x)]. Another important issue commonly encountered in empirical models in health economics and health services research is en dogeneity among the regressors i.e., the presence of unobservable confounding influences resulting in biased estimat es of the model parameters and related causal effects. 2 Endogeneity is commonly caused by omitted regressors, simultaneity between a regressor and the outcome varia ble y, or errors in regressors. For example, Shea et al. (2007) analyzed the effect of presc ription drug coverage on the number of drug prescriptions filled an outcome variable that is highly skewed. Here, prescription drug coverage is likely to be endogenous because the unobservable variables with which it is correlated are also likely to infl uence drug utilization (e.g., unobserved health status a confounding influence). In nonlinear settings, endogeneity is often handled by linearizing the model and applying the conventional two stage least squares (2SLS) method, or by implementing a two st age predictor substitution (2SPS) approach a direct analo g to 2SLS for nonlinear models. Terza et al. (2008 b ) demonstrate that the former is likely to be substantially biased and Terza et al. (2008 a ) prove that the lat ter is generally inconsistent. Terza et al. (2008 a ) also establ ished the consistency of an alternative method for endogeneity correction in nonlinear setting called two stage residual inclusion ( 2SRI ) As noted above, though some methods in the literature have handled 2 By definition a confounding influence (varia ble) is one that affects y but is correlated with one or more of the elements of x. PAGE 21 21 skewness and endogeneit y separately (e.g., Manning et al 2005; Terza et al., 2008a, respectively), none comprehensively acco unts for both of them together. In this paper, to address this void, we extend the GG estimator as presented by Manning et al. (2005) by incorporating th e 2SRI method discussed by Terza et al. (2008 a ) 3 We call this new estimator the GG model with endogeneity (GGE). This model simultaneously maintains the substantial potential efficiency gains provided by the GG specification and corrects for en dogeneity v ia the 2SRI method. It is noteworthy that this is the first application of 2SRI to expenditure analysis in the full information maximum likelihood (FIML) context. The rest of the chapter is organized as follows: In S ection 2 .2 we introduce the GGE estimat or and describe how 2SRI can be implemented in conjunction with the GG mode l to deal with endogeneity. In S ection 2. 3, the details of a s imulation study are presented. Therein we: (i) summarize the different data generation techniques used in the simulatio n analysis; (ii) briefly describe the estimation techniques to be included in the simulation comparison; (iii) detail the statistical criteria for comparison; and (iv) discuss th e simulation analysis results. Then, for t he purpose of illustration, in S ecti on 2. 4, the new estimator is applied to the same data analyzed by Stuart et al. (2009) in their examination of the impact of prescript ion drug use on hospital costs. Finally in S ection 2. 5, we summarize and conclude. 2.2 Accounting for Endogeneity in the G eneralized Gamma Estimation Framework A common problem in the health outcomes data is asymmetry in the distribution of the dependent variable (y), namely skewness. Usually the existence of an upper or 3 Note that, for the special case in which the outcome regression and the auxiliary regression are both linear, 2SRI is equivalent to the conventional linear IV (two stage least squares) method (Terza et al., 2008a). PAGE 22 22 lower bound on y leads to this type of asymmetry as i s the case for many health outcome variables such as prescription drug utilizati on or outpatient expenditures. Estimation methods that require only a conditional mean regression assumption, while relatively robust to conventional misspecification error, do not account for skewness in the data and are, therefore, relatively inefficien t. The FIML estimators that account for additional information on the other moments of the distribution of the outcome conditional on the regressors are more prone to bia s but a re relatively efficient. The GG estimator discussed by Manning et al (2005) provides a reasonable and practical compromise for the robustness vs. efficiency tradeoff in that it accounts for skewness in a parametrically flexible way. Endogeneity among the regressors is another important problem widely observed in health data models. Endogeneity can result in biased estimates of the model paramet ers and related causal effects. Sampling is subject to endogeneity when one or more of the regressors are correlat ed with the unobservable determ inants of the outcome variable. Specific forms of endogeneity include, omitted regressors, measurement error and simultaneity (reverse causality) betw een regressors and the outcome. Our aim is to introduce a new version of th e GG estimator that corrects for endogeneity. The main source of difficulty encounter by applied researchers seeking to use instrumental variables (IV) in this context is the typical nonlinearity of the relevant regression model (and often the other condit ional moments). The GG model is no exception in this regard. We resolve this issue by incorporating the 2SRI approach into the GG model. PAGE 23 23 2.2.1 The Generalized Gamma Model and Popular Special Cases The GG estimator is based on a full parametric assumption, i.e., the conditional probability densi ty function is fully specified. The GG probability density function has three parameters and It embodies many different distribut ions as special cases, such as the standard gamma, the log normal, th e exponential, and the Weibull. Because these models are nested within the GG framework, standard likelihood ratio and/or Wald tests can be used as a means of detectin g and selecting spec ific cases. The following gives a brief summary of these nested cases for particular values of the parameters and and the specification of the conditional mean in each case. The conditional probabili ty density, f(y  x), is assumed to have the following form (2 1) where and ; with and as the basic parameters of the distribution ( ). Moreover, it is assumed that (2 2) where x is the 1 K* row vec tor of regressors, is a column vector of regression parameters conformable with x. Assuming that the first element of x is equal to one, i s the regression constant term. Under the assumed distribution in ( 2 1) it can be shown that (2 3) where PAGE 24 24 ( 2 4) We can express equation ( 2 3) as ( 2 5) where constant term, which becomes The GG estimator encompasses a number of familiar special cases corresponding to various combinations of values for the parameters and Standard g amma : When the shape parameters are equal, i.e., and are strictly positive, the GG distribution reduces to the standard g amma distribution, i.e., ( 2 6) In this case, since the constant term shift ( 2 4) in the reparametrization of the conditional mean ( 2 5) is 4 Weibull: reduces to the Weibull distribution, i.e., ( 2 7) Here the shift in the constant term is 5 4 Appendix A includes the formal derivation of the reparametrization of the conditional mean of the standard gamma. 5 Appendix A includes the formal derivation of the reparametrization of the conditional mean of the Weibull. PAGE 25 25 Exponential : When both shape parame ters are set to unity i.e., the GG distribution reduces to the exponential distribution, i.e., ( 2 8) F or the exponential distribution 6 Log normal: If approaches zero in the limit, the GG distribution approaches the log normal distribution, i.e., ( 2 9) For the log normal 7 2.2.2 Accounting for Endogenous Regressors Now suppose that some of the e lements of x are endogenous. To allow for this possibility, we combine the GG formulation defined in ( 2 1) with the 2SRI method discussed by Terza et al. (2008 a ) There are many applications of 2 SRI in the literature including Shea et al. (2007), Stu a rt et al. (2009), DeSimone (2002), Baser et al. (2003 ), Norton and Van Houtven (2006), Gibson et al. (2006), Shin and Moon (2007), Lindrooth and Weisbrod (2007), Terza et al. (2008 b ), and Gavin et al. (2007). Our extended model is the first application of 2SRI to expenditur e analysis in the FIML context. All other applications of 2SRI as discussed by Terza et al. (2008 a ) i.e., those in Terza et al. 6 Appendix A includes the formal derivation of the reparametrization of the conditional mean of the exponential. 7 Computed using Maple PAGE 26 26 (2008 a ) and those that reference Terza et al. (2008 a ) were ca st and estimated in the NLS framework. Follo wing the 2SRI approach, we partition x in the following way ( 2 10) where x o is the (1 K) vector of observable exogenous regressors whose first element is constant, x e is the (1 S) vector of endogenous variables, and x u is the (1 S) vector of unobservables that are correlated with both x e and y (unobservable confounding i nfluences). We can now rewrite ( 2 5) as ( 2 11) Th e unobservable confounders are the sour ce of the endogeneity problem. To control for x u potentially nonlinear auxiliary equations are specified with the following form ( 2 12) where w = [x o w + ] and w + denotes a 1 S* vector of identifying instrumental variables 1 vector of coefficient parameters. 8 In the first stage of the 2SRI method, the appropriate estimator (e.g., NLS) is applied to ( 2 12) and the resid uals from that regression are computed as ( 2 13) where s In the second stage, the predicted residuals ( 2 13) from the auxiliary equations are substituted for x u in the regression model ( 2 11) and FIML (i.e., GG, as described above) is applied to obtain estimates of 8 PAGE 27 27 Terza et al. (2008 a ) show that this estimator is consist ent. We analytica lly derive the details of this GGE estimator and program it along with its correct asymptotic inferential statistics in Stata/Mata 10 2.3 Simulation Analysis Our objectives in this simulation analysis are threefold: First, we seek to verify the theoretical properties of the GGE estimator, such as statistical efficienc y (variance) and unbiasedness. For the former, we compare the GGE estimator with 2SRI versions of t wo popular alternative methods ordinary least squares (OLS) applied to a log linear model and a generalized line ar model with a log link and a g amma family (GLM Gamma) using the mean squared error of the mar ginal effect as the criterion. For the latter, we seek to study the marginal effect of endogenous policy variables using above mentioned estimators with average percentage abs olute bias as our measureme nt. Secondly, we want to compare the performances of the above mentioned estimators, both in terms of efficiency and bias, with GG, OLS and GLM models, respectively, when they do not account for endogeneity. Finally, we test the nested model selectio n capa bility of the GGE model. For all the cases outlined above, we used simulated data from a variety of sampling designs nested within the GG framework. The requisite data is generated using Monte Carlo simulations. The focus here is on strictly positive out co me data skewed to the right. There are five different data generation met hods that are used in our study. Each of these data generation methods satisfies the exponential conditional mean property as specified in ( 2 5). These sampling designs include lognor mal distributed data, gamma distributed data, Weibull distributed data, exponential distributed data and GG distributed data. The first four data generation techniques are chosen since they are specific version s of GG. We use PAGE 28 28 average percentage absolute bi as as our metric for comparing the marginal effects in the various sampling designs. 2.3.1 Sampling Designs 2.3.1.1 The observable and u nobse rvable confounders, the instrumental variables and the endogenous v ariable As mentioned in S ection 2 .2.2 x e is the (S x 1) vector of endogenous variables, x o is the vector of observable exogenous regressors, and x u is the (S x 1) vector of unobservable confounders (i.e., unobservable variables that are correlated with both x e and y). For our simulations we generate a single observable regressor x o and a single unobservable confounder x u uniformly distributed over the [0, 2] and [0, 0.5] intervals, respectively. The instrumental variable w + is uniformly distrib uted over the [0, 2] interval. The endogenous variable x e is defined as a linear function of x o and w + wit h constant term equal to zero. Specifically, r( ) in ( 2 12) is the identity function with w = [1 x o w + ] and The simulations are repeated 500 times for each o f the seven sample sizes 500, 1,000, 2,500, 5,00 0, 10,000, 100,000 and 500,000. Here we generate samples of increasing size in order to explore the small sample properties and the asymptotic properties of the estimators. 2.3.1.2 The outcome v ariable The outcome variable y is generated using five different distributions. The scale parameter in GG is specified as across all sampling designs for the outcome, where the constant c is chosen a s 1.0. The coefficients of the endogenous and observable regressors are both equal to 0.5, whereas the coefficient of the unobserva ble regressor is equal to 1.0. In the following we describe the outcome data generator and define the values of the parameters of and the value o f the PAGE 29 29 variable for each of the particular sampling designs. Log normal : The data generator we used for the log normal outcome variable y is where is standard normally distributed. Recall that the log and the parameter In generating the log normal data we as It follows from the discussion in S ection 2. 2. 1 that Standard g amma : For the next case we generate standard gamma distributed d ata. Th e Stata data generator command rgamma(a, b) that generates the gamma rando m variates, is used to generate the outcome variable y, where a and b are the gamma shape and scale parameters, respectively. The standard gamma distribution has a shape parameter ( ) and a scale parameter ( ). We chose the shape parameter ( ) in order to create skewed outcome data. The independent variables are generated according to the above descriptio ns with the same coefficients. The conditiona l mean of the outcome variable is defined according to equations ( 2 3) and ( 2 4), where and following the discussi on in S ection 2. 2.1. Weibull : In the third case we generate a Weibul l distributed out come variable. We set the distribution parameters so as to obtain right skewed o utcome data for our simulation. We have used the Inverse Transform Method (Rubinstein, 1981) to create the Weibull distributed data. The random number generator for Weibull is where U is uniformly distri buted over the interval [0, 1] Here the PAGE 30 30 scale parameter is equal to where is given by equation ( 2 2). We used two different values of the shap e parameter ( and ) in order to generate data with varying degrees of skewness. Accordingly the conditional mean of y is and the constant term is defined as For the value of the shape parameter ; and for The independent variables are generated as described above. Exponential : The fou rth da ta generating process produces e xponential distributed data. The rate parameter is defined as ( ). The exponential is a special case of the standard gamma in which both parameters are equal to unity. Th e Stata data generato r command rgamma(a, b) is used to generate the exponential outcome variable where the shape parameter is equal to unity. The conditional mean of y is given by since the constant term is zero when We have the same endogenous, observable, unobservable and instrumental variables as defined above. Generalized g amma : The last da ta generation method is the GG. In choosing the values for GG paramet ers we avoid the specific combinations of values corresponding to the distributions used in the other four sampling designs. We generate GG distributed random variables from a standard gamma using the following transformations, as demonstrated by Tadikamal la (1979): ( 2 14) PAGE 31 31 where v is standard gamma distributed and y is the generalized gamma distributed random variable (i.e., the outc ome variable in this case). The parameters b and c are defin ed as and respectively. Given the considerable flexibility of the GG distribution we tried three different parameter settings in an attempt to explore the dist ributional landscape. We assign ( ), and use three different values for ( res pectively) in our simulations. The conditional mean of y is given in ( 2 3), where and respectively. 2.3.2 Estimators to be Evaluated and Compared The GG estimator is a maximum likelihood estimator. It is assumed that the parameter is defined as and the othe r two parameters are estimated from the data. For t he estimation of GG we use the streg 9 command in Stata The alternative estimators that are included in the comparisons are OLS applied to a log linear model and GLM with a log link and a gamma family. These are described below. Log normal model: In this setting, the log normal model is based on the assumption that Next, the following two cases should be considered : In the f irst case the distributio n can be assumed to be On 9 streg command is used to estimate parametric survival models by maximum likelihood estimation. It can be used to estimate various distributions, such as Gom pertz, logistic Weibull, generalized gamma etc. In our implementation we used streg by assigning the number of failures equal to zero with the generalized gamma model. PAGE 32 32 the other hand in the s econd case can be assumed to be normally distributed In the log normal model, the logarithm of the dependent variable y is regressed on the inde pendent variables x using OLS. The conditional mean is specified by (2 15) In the first case above, when the di assumed to be unspecified w e use the Duan s mearing estimator in the estimation of the conditional mean of the outcome variable, (2 16) where to take care of the retransforma tion issue (Duan, 1983). In the second case, when assumed to be normally distributed, the condition al mean of the outcome variable can be estimated as (2 17 ) The log normal model is easy to apply therefore it is frequently used in health economics an d labor economics. An important drawback of this model is that it is relatively inefficient. Between the two cases for th above, the first method is less susceptible t o misspecification because it does not require the specification of the distribution of (y  x) Hence, we would use this first case of unspec our experiment with the log normal model. Gam ma generalized linear models ( GLM Ga mma ): In the GLM class of models (McCullagh and Nelder, 1989) (y  x) is assumed to follow a specified distribution, g(y  PAGE 33 33 x), and has a particular assum ed form for its mean, E[y  x]. Estimation of the model is c arried out by the FIML method. This approac h offers substantial flexibility with respect to the choice of alternative specifications for g(y  x) and E[y  x] Once these choices are made, however, the model is fixed and inflexible. We specify the distribution of (y  x) as the standard gamma and t he link function which relates the mean of the distribution function to the linear predictor, in our GLM model specification a s 2.3.3 Criteria for Evaluation and Comparison In general, a ly only the estimation o f the regression parameters but rather the marginal or incremental effects on y of changes in some (or all) of the elements of x. There are a number of ways to characterize such effects (see Terza, 2010 for a detailed discussion). Here we focus on the expected marginal effect defined as ( 2 18 ) where The marginal effect ( 2 18 ) is typically estimated as ( 2 19 ) where denotes For various model in our analysis the expected marginal effect ( 2 1 8 ) and the appropriate consistent estimator of ( 2 19 ) are: Log l inear OLS: In the log normal model with OLS estimation the marginal effect of interest is: PAGE 34 34 ( 2 20 ) In the first case of the log normal model described above where x) assumed to be unspecified and in the second case of log normal model, where is no rmally distributed, is The consistent marginal effect estimator for the first case where assumed to be unspecified then is ( 2 21 ) where This is estimator (Duan, 1983). The consistent marginal effect estimator for the second case where is normally distributed, is (2 22) GLM Gamma : In the GLM Gamma estimator the marginal effect of interest is ( 2 2 3 ) because the expected conditional mean of the outcome is assumed to be in the GLM Gamma model. The appropriate consistent marginal effect estimator of ( 2 23 ) is PAGE 35 35 ( 2 24 ) where the is the GLM estimate. GGM: The marginal effect i n the generalized g amma model is ( 2 25 ) wher e as given in equation ( 2 4). We can write the consistent marginal effect estimator as ( 2 26 ) where and are the GG parameter estimates. The mean squared error of the marginal effect is used as a criterion for the statistical efficiency comparisons and it is measured as ( 2 27 ) where ME is defined as the tru i ndicates the sample the indicates the sampling design. value of ME for a particular sampling design, we simulated a large sample of 5 million observati ons and calculated the analog PAGE 36 36 to equation ( 2 19 ) for this super sample. To measure bias relative to the true value of marginal effect, we calculate the average percentage absolute bias as ( 2 28 ) where ME, k j and d defin ed as given above. Since our model is a non linear model, we calculated the ME for the three quartile values of the endogenous regressor We are interested in both comparisons among the various estimation techniques and the diff erences between the estimates that are corrected for endogeneity versus those that are not. 2.3.4 Simulation Results In the following set of results we would like to highlight the efficiency as well as the bias performance of the GGE model as compared to o ther popular models discussed a bove. The GG estimator is known to provide significant efficiency gains compared to log linear models and GLM models (Manning et al., 2005). The objective here is to show that the GGE model maintains the statistical efficienc y advantages of the conventional GG approach in the presence of endogeneity To assess this, we focus on mean squared error (MSE) of the estimated marginal effect of the endogenous variable as defined in ( 2 27 ). Furthermore, we also intent to show that the GGE model provides consistent and unbiased estimates a cross various sampling designs. This would be accomplished using average percentage absolute bias of the marginal effect as defined in ( 2 28 ). The results of the simulation study with respect to MSE are presented in Table 2 1 and Table 2 2 Here the number of observations is 10,000 and 500, respectively, and PAGE 37 37 the number of repetitio ns for each simulation is 500. In these tables we are interested in comparing MSE among the vari ous estimators defined in S ection 2. 3.2 for the endogeneity corrected models. 10 Since our model is a nonlinear model t he MSE is calculated at different quartiles of the endogenous variable To highlight the differences between O LS versus GGE, and GLM versus GGE we have included relative efficiency gain percentage measures in T ables 2 1 and 2 2 These percentage measures indicate th e relative efficiency of the GGE estimator compared to the OLS and the GLM estimators, and is defin ed by wher e m = OLS, GLM. In the log normally distribu ted data, in Table 2 1 with sample size 10,000 OLS and GGE have similar MSE values and both are lower than the GLM results (GGE is 37% efficient relati ve to GLM). Similarly for the g amma distributed samples, the GLM and GGE estimators are equally efficient and both are more efficient than the OLS estimator (GGE is 1 6% efficient relative to OLS). For the case of the Weibull, exponential and generalized gamma distributed sampl es we see similar patterns, wherein the GGE estimator has lower or equal MSE compared to other estimators (the efficiency gain from GGE goes up to 60%). In Table 2 2 for the sample size 500 we observe that for the Weibull, the exponential and the GG dist ributed data the GGE estimator provides efficiency gains compared to the OLS and the GLM estimators. The efficiency gain of the GGE estimator as compared to the OLS estimator ranges from 35% to 80%, and the efficiency gain of the GGE estimator as compared to the GLM estimator ranges from 4 % to 50%. For the log normally distributed data in this table, the 10 For the models corrected for endogeneity, the values of the mean squared errors are significantly lower than the cases whe n they are not corrected (goes up to 1078.9250 ) (not included in Table 2 1 and 2 2). Endogeneity induces bias in the marginal effect of the endogenous variable, resulting in higher mean squared errors. PAGE 38 38 OLS estimator is slightly more efficient (1.8%) compared to the GGE estimator. Similarly for gamma distributed data the GLM estimator is slightly more ef ficient (0.45%) compared to the G GE estimator. We should b e cautious in interpreting the results from the sample with 500 observations since the s ample size is very small. Overall t he findings of this experiment confirm that the GGE estimator provides effi ciency gains relative to the OL S and GLM estimators in the presence of endogenous regressors Next, we shift our focus to the average marginal effect of the endogenous variables. Table 2 3 shows the average percent absolute bias of the estimated average ma rginal effect, as defined in ( 2 28 ), obtained from the various sampling designs for different number of observations, 10,000, 100,000, and 5 00,000, with 500 replications. We have reported both the results from the estimators corrected for endogeneity using 2SRI, and the results from the uncorrect ed versions of the estimators. The importance of correcting for endogeneity in the presence of an endogenous regressor is reinforced by the fact that the average percentage absolute bias ranges from 60% to 93% (for the values of in the third quartile) when we do not correct for endogeneity. On the other hand, average percentage absolute bias is significantly lower when we do apply 2SRI to account for endogeneity. Furthermore, the asymptoti c properties of the included estimators predict that the average percentage absolute bias should tend to zero as the num ber of observations increases. This trend can be readily noted in Table 2 3 in the cases where we correct for endogeneity as the number of observations increases from 10,000 to 100,000 to 500,000. Whereas, when we do not correct for endogeneity, the average percentage absolute bias consistently remains high. PAGE 39 39 Next we focus only on the endogeneity corrected versions of the estimators, compar ing their performance across the variou s sampling designs detailed in S ection 2.3.1. 2. In the case of log normally distributed data, we expect the OLS estimator to provide low average percentage absolute bias, and this is confirmed by our findings in Table 2 3 The important point to note is that the performance of the OLS model is matched by our flexible GGE model, whereas GLM fares worse than both of these estimators. Since we use the gamma distribution in our GLM model, we expect gamma and exponentially distributed data to provide low average percentage absolute bias when estimated with GLM and this is corrobo rated in our results in Table 2 3 The crucial point to note here is that the GGE model, due to its flexible nature, provides similar or slightly b etter results than the GLM estimator across the different sample sizes and different quartile values of Next we look at the results obta ined using the Weibull and the generalized g amma distributed data. For the case of Weibull distributed data, we used two different settings of the parameter in order to vary the amount of skewness in the data. Here we find that in the case with mild skewness, GGE performs slightly better than GLM and OLS for most of t he sample sizes and quartile values. This advantage of the GGE model is magnified for the case in which the data is more skewed. We attribute this behavior of the GGE model to its flexible nature. For the generalized g amma distributed data, we chose three different settings of parameter i n order to test our estimators. For OLS performs significantly worse than GGE and GLM, while GGE perfo rms PAGE 40 40 noticeably better than GLM. For and the performance of GGE is still better than the other two estimators. Table 2 4 presents the average percentage absolute bias results from small sample simulations, i.e. N = 500, 1,000, 2,500, and 5,000. Here, although the main findings are similar to the large sample results a word of caution is in order for the smallest sample sizes, 500, and 1,000, in specific sample designs w here there is extreme skewness. For the Weibull distributed data with and the generalized gamma distributed data with the average percentage absolute bias ranges from 216% to 70% for various estimators for the sample sizes 500 and 1,000. The models that are not corrected for endogeneity lead t o less average percentage absolute bias in t hese specific sampling designs. Both OLS and GLM estimators, however, generate considerably higher average percentage absolute bias compared to GGE in the endogeneity corrected models. For the smallest sample siz es, where there is extreme skewness, none of the suggested estimators behave well. Overall, it can be noted that due to its flexible nature, GGE consistently provides lower average percentage absolute bias across different data types, sampling sizes and qu artiles, while the performance of OLS and GLM estimators vary depending on data types and parameter settings. Results here reinforce our expectation that the use of the proposed GGE model has advantages over the OLS and the GLM estimators when we encounter endogenous regressors with skewed outcomes. Table 2 5 summarizes the results of the nested model selection tests from the GGE regressions in terms of the proportion of the replicated data sets of size 10,000 for which a particular null model (H o :) is reje cted at the 5% level of significan ce. In order to PAGE 41 41 demonstrate that the GGE model is useful for model selection from among the nested alternatives, these model selection tests shou ld have two important features. First, they should manifest the same empirica l size (i.e., likelihood of type I error) as the theoretical size of the test, in this case 5%. Second, the empirical power of the test [1 Pr(Type II error)] for a particular non null sampling design should be extremely high i.e., near 1. These two crite ria appear to be sup p orted by the results in Table 2 5 For example for standard g amma generated data, the GGE estimator fails to reject the null hypotheses that in 94.5% of the replications, implying that the inherent data dis tribution is standard gamma. Similar correct predictions are also obtained for the Weibull and the log normally distributed samples. For the GG distributed data, the null hypotheses (S tandard g amma), (Log normal), and (Exponential) or (Weibull) are rejected in all repetitions (at the 95% confidence interval). Finally, in the exponential sampling design, the GGE model sel ection tests require further analysis since the exponential is a special case of both the standard gamma where the and the Weibull distribution where For this reason, in ~95% of the replications, the GGE selection test fails to reject the null hypotheses that and respectively. These results show that GGE correctly predicts and selects the inherent data distribution when the regressors are endogen ous. 2.4 Prescription Drug Use and Hospital Cost Offsets One of the main characteristics of healthcare utilization /expenditure data is skewness. The population generally consists of healthy individuals who have very small or zero healthcare expenditur es, o r healthcare utilization. A very small percentage of the remaining unhealthy population incurs very large costs or needs excessive inpatient PAGE 42 42 outpatient or emergency care. As a result, the expenditure dis tribution is typically skewed. Another noticeable ch aracteristic of healthcare data which can lead to inconsistent estimation is endogeneity of one or more of the regressors (continu ous, count, or dummy variable). This is commonly caused by the presence of unobservable confounding variables. Stuart et al. ( 2009) examine the effect of outpatient prescription drug utilization on the cost of inpatient hospitalizati on for Medicare beneficiaries. In this paper they use a two part model specification correcting for endogeneity by applying 2SRI ( Terza et al., 2008a ). We re estimate this model using the newly developed GGE estimator in the second part of the two part model, whereas they implemented NLS in the second part of the model. The GGE estimator introduces efficiency gains by taking full account of skewness an d other higher order moments of the hospital expenditure distribution. The literature suggests that hospital expenditures make up the highest percentage of total personal health care spending in all age groups including elderly (65+) ( refer Heffler et al. 2005). Heffler et al. (2005) indicates that, for the elderly, hospital spending as a percentage of total expenditure dropped from 43 percent in 1987 to 37 percent in 2004. In that same period, the health expenditures shifted towards other services such a s prescription drug utilization nursing homes etc. They also observe that prescription drug use ha s continued to rise since 1987. The observed concurrent increase in prescription drug use and decrease in hospital spending among the elderly leads one to su spect a possible causal relationship increased prescription drug use might pr omote reduced inpatient costs. Stuart et al. (2009) analyzes this relationship in a two part PAGE 43 43 regression framework that controls for confounding factors, e.g., unknown health sta tus known only to the patient and the prescribing doctor. Stuart et al. (2009) provide three possible explanations for an inverse relationship between prescriptio n drug use and hospital costs. First, a high percentage of the prescription drug use by elderl y Medicare beneficiaries prevents common chronic conditions. The preventive effect of prescription drug use would decrease hospi talization for such illnesses. Second, even if prescription drug utilization does not prevent hospitalization, it may lower the cost of hospitalization significantly by decreasing length of stay. Third, people with higher prescription drug utilization are more likely to adhere to the prescribed dosages which enhance medication effectiveness and reduce the likelihood of inpatient st ays. 2.4.1 The Econometric Model Here we implement the two part model with 2SRI compri sing the following three steps. In the first step (also the first stage of 2SRI), an auxiliary regression, as in equation (2 12), is estimated using NLS, where the depend ent variable is prescription drug usage (a count variable measured as the numb er of prescription drug fills). The auxiliary regression function is defined as ( 2 29 ) where x e d enotes prescription drug usage. In the second step ( s till the first part of the two part model) we estimate the probability of hospital utilization using a conventional probit specification the dependent variable (y*) is a binary variable taking the value one if hospital expenditure is positive (non zero) and zero otherwise. We obtain PAGE 44 44 estimates of the parameters of the first part of the two 1 ) by maximizing the following probit log 1 o1 e1 u1 ] ( 2 30 ) where is the standard normal c umulative distribution function, is the residual of the NLS estimation from the first step in ( 2 2 9 and n denot es the size of the full sample. In the third and final step, th e new GGE method is used to estimate the parameters of the second part of the two part model Specifically, we maximize the following log likelihood function ( 2 31 ) 2 *, where is defined as in ( 2 1), ( 2 32 ) y denotes the hospital e xpenditures for the subsample of individuals who had at lea st one inpatient stay, y and n y is the size of this subsampl e of hospitalized individuals. The conventional standard errors and t statistics as an output by Stata can not be used in a three stage esti mator, they must be corrected. The derivation of average marginal effect and the corrected standard errors are presented in Appendices B and C, respectively. Following Mullahy (1998) and using the argument surrounding equations ( 2 3) throug h ( 2 5), it can be shown that the conditional mean of hospital expenditures (including the zeros and the positive values) is ( 2 33 ) PAGE 45 45 where x is defined as in ( 2 2 is the same as except for the constant term, o 2 (i.e. ) which becomes defined as in ( 2 4) and being the first element (constant term ) of From ( 2 18 ) we get tha t the expected marginal effect a s ( 2 34 ) where is the standard norma l probability density function. The appropriate estimator of ( 2 3 4 ) is ( 2 35 ) 2.4.2 Data Source and Variables The data is from the 1999 and 2000 Medicare Curr ent Beneficiary Survey (MCBS). It includes information on health status, health care use and expenditures, health insurance coverage, and socioeconom ic and demographic characteristics of a nationally representative sample of Medic are beneficiaries. Stuart et al. (2009) created a subsample of Medicare beneficiaries who were enrolled for 24 months with continuous Medica re Part A and Part B coverage. More over each Medicare beneficiary has to have continuous drug coverage during the study period of 24 months or no coverage at all. 11 The overall data set includes 3,101 observations with 20 percent of the sample having positive hospital costs. 11 For more information about the data set restrictions and properties refer to Stuart et al. (2009). PAGE 46 46 The dependent va riable in the first part of the two part model (step two of three in our estimation) is the probability of any hospital stay, and in the second part of the model (step three of three in our estimation) the dependent variable is total hospital expenditures which are obtained from Medicare Part A claims in the s econd year of the study period. F igures 2 1 and 2 2 show the histogram and Kernel density estimate of hospital expenditures for the overall data, including the zero expenditure and the subsample of po sitive hospi tal expenditures, respectively. In both cases, hospital expenditures are highly positiv ely skewed as expected. The independent variable of interest is the number of prescription drug fills in the se cond part of the study period. The other contr ol variables are demographic variables such as age, sex, educational attainment, marital status, residence (urban vs. rural and the census region northeast, midwest, south, and west), annual income; the DCG/HCC (Diagnostic Cost Group/Hierarchical Coexist ing Conditions) risk adjuster 12 and Medicare entitlement status [aged with no prior disability, aged and previously disabled, Social Security Disability Insurance (SSDI) disabled]. In the first stage of the 2SRI estimation (step one of our estimation) we w ill include four instrumental variables to correct for the potential endogeneity of prescription drug usage. 13 The instruments correlate with the endogenous policy variable prescription drug utilization through the prescription drug coverage and do not correla te with the dependent variable. These four variables are as follows. (1) The percent of the work force in the respon unionized. A unionized work force has a higher probability of having coverage and 12 See Pope et al. (2004) for more information on DCG/HCC model. 13 Stuart et al. (2009) utilizes prescription drug coverage as the instrumental variable in their analysis. We incorporate other instrumental vari ables in our study and replicate their analysis using the new instrumental variables. PAGE 47 47 indirectl y more prescription d rug usage. (2) The average premium for Medigap Plan H, I, and J in the state. Coverage for prescription drugs can be higher in those sta tes with lower average premium. (3) A variable indicating if the state has a pharmaceutical assistance plan for low inco me elders/or di sabled Medicare beneficiaries. (4) The state per capita incom e. The wealthy states can be associated with greater Medicare supplemental policies, i ncluding drug coverage. Table 2 6 summarizes the descriptive statistics for these variables. 2 .4.3 Estimation Results Table 2 7 presents the estimated average marginal effect as given in equation ( 2 35 ) from the two part model (with various estimators in the second part). 14 Results for both the endogeneity corrected models using 2SRI and the un corre cted models are included. The first row shows the replicated results of Stuart et al. (2009) using the new instrumental variables where we find that the average marginal effect of prescription drug fill is a statistically significant $140.48, i.e., a one unit increase in prescription drug fills leads to a $140.48 decrease in h ospital expenditures. Here NLS is used to estimate the second part of the model. The next three rows show that the results from using OLS for a semi log model, GLM with a gamma functi on and our GGE 15 model in the second part are $87.19, $109.32 and $89.30 respectively. All of the endogeneity corrected average marginal effect es timates are negatively signed. This is consistent with our expectation that appropriate usage of prescription 14 We do not include the estimated coefficients from the three stages since we are primarily interested in the average marginal effect. 15 Note that although our sample siz e is small, N = 600, we are not dealing with one of the extreme skewness cases mentioned in the simulation analysis. The nested model selection tests suggest that our data is log normally distributed and the log normal sampling design is not one of the sev erely skewed distributions. PAGE 48 48 drugs will of fset inpatient hospital costs. For all the estimation procedures that we implemented, failure to correct for endogeneity leads to results that are substantially dive rgent from the 2SRI estimates. The average marginal effect in all four uncorr ected estimations is positive and ranges from $15.69 to $17.78 contrary to expectation. Secondly, the Wald test results obtained from our GGE model fails to reject that the dependent variabl e is log normally distributed. The result of this model specific ation test is corroborated by the estimated average marginal effects from OLS and GGE, which are $87.29 and $89.30 respectively. Thirdly, the GGE estimator along with the OLS has the lowest standard errors for the estimated marginal effect. 16 The precisio n is born out agai n in the actual data analysis. Finally, we must point out that although the estimated effects have the expected negative sign in both, NLS and GGE, there is a significant difference (approximately $50) between the Stuart et al. (2009) mod el and our model. 2.5 Summary, Discussion and Conclusion Health utilization and expenditures data is known to suffer from skewness and endogeneity. To address skewness, a FIML estimator based on a flexible distributional form (the GG) that offers increased efficiency relative to alternative estimators was suggested by Manning et al. (2005). In this chapter we have proposed a model that retains the precision advantages of the GG estimator but also includes a 2SRI component to account for endogeneity (the GGE estimator). In order to test our model, we conducted extensive simulation analysis using endogenous regressors and outcomes gene rated by various distributions. We 16 See Appendix C for detailed derivation of the asymptotic standard errors of the marginal effect. PAGE 49 49 compared the results obtained by GGE to other commonly used mo dels OLS and GLM with Gamma. Our findings suggest that GGE consistently has low average percentage absolute bias across different data distribution s and parameter configurations. On the other hand, OLS and GLM provide low average percentage absolute bias only for certain specific dat a distrib utions and parameter settings. This versatility of GGE is a testament to its flexible nature and its ability to correct for en dogeneity using the 2SRI model. We confirmed that the statistical efficiency inherent in the GG estimator transforms to t he GGE model both in our simulation analysis and the real data analysis. In our simulation analysis, we were also able to verify the asymptotic p roperties of the GGE estimator. The distribution identification tests in our simulations showed that the GGE mo del was able to identify inherent nested models with high accuracy. Finally, the application of the GGE model to health ex penditures data (Stuart et al., 2009 ) predicted that an increase in the prescription drug fill by one unit leads to a $89.30 decrease in hospital ex penditures. This result is similar to the result obtained us ing the OLS model on this data. Note that the Wald test result using the GGE model fails to reject that the dependent variable is log normally distributed, which make s above two find in gs consistent with each other. The marginal effect predicted by the GGE model is 35% lower than that predicted by the previously published NLS model (Stuart et al. 2009 ). PAGE 50 50 Table 2 1 For sample size 10,000 mean squared error of the marginal effect wit h percent relative efficiency g ain 1 st Quartile of 2 nd Quartile of 3 rd Quartile of Data Estimator 2SRI 2SRI 2SRI Log Normal, OLS for ln(y) 0.5815 0.8669 1.2692 (0.382%) (0.348%) (0.320%) GLM Gamma 0.9252 1.3790 2.0215 (37.389%) (37.355%) (37.419%) GGM 0.5793 0.8639 1.2651 Gamma, OLS for ln(y) 0.1201 0.1803 0.2653 (16.134%) (15.989%) (15.890%) GLM Ga mma 0.1007 0.1514 0.2231 (0.002%) (0.005%) (0.008%) GGM 0.1007 0.1514 0.2231 Weibull, OLS for ln(y) 0.0644 0.0972 0.1436 (35.553%) (35.015%) (34.551%) GLM Gamma 0.0448 0.0675 0.0996 (7.204%) (6.417%) (5.687%) GGM 0.0415 0.0632 0.0940 Weibull, OLS for ln(y) 99.2321 149.5710 226.7172 (39.006%) (39.736%) (40.964%) GLM Gamma 125.8915 184.9323 275.1588 (51.923%) (51.259%) (51.357%) GGM 60.5254 90.1380 133.8454 Exponential OLS for ln(y) 0.3075 0.4640 0.6878 (38.448%) (38.726%) (39.055%) GLM Gamma 0.1900 0.2851 0.4198 (0.388%) (0.260%) (0.168%) GGM 0.1893 0.2843 0.4191 Generalized Gamma, OLS for ln(y) 34.8443 50.037 0 72.8109 (60.072%) (58.997%) (58.668%) GLM Gamma 18.4038 27.1557 40.0981 (24.404%) (24.447%) (24.949%) GGM 13.9125 20.5169 30.0939 Generalized Gamma, OLS for ln(y) 0.1490 0.2221 0.3252 (57.411%) (57.268%) (57.182%) GLM Gamma 0.0700 0.1047 0.1535 (9.292%) (9.297%) (9.302%) GGM 0.0635 0.0949 0.1393 Generalized Gamma, OLS for ln(y) 0.0304 0.0459 0.0677 (51.370%) (50.942%) (50.640%) GLM Gamma 0.0203 0.0309 0.0458 (27.383%) (27.104%) (26.914%) GGM 0.0148 0.0225 0.0334 Note: The values given in parentheses are t he percent r elative efficiency gain s that measure the relative efficiency of GGE estimator compare d to the OL S and GLM estimators. T he percent r elative efficiency gain is d efined by where m = OLS, GLM. PAGE 51 51 Table 2 2 For sample size 500 mean squared error of the marginal effect with percent relative efficiency g ain 1 st Quartile of 2 nd Quartile of 3 rd Quartile of Data Estimator 2SRI 2SRI 2SRI Log Normal, OLS for ln(y) 11.0042 15.7595 22.0935 ( 1.883%) ( 1.778%) ( 1.646 %) GLM Gamma 17.9952 25.7775 36.7286 (37.698%) (37.776%) (38.857%) GGM 11.2114 16.0397 22.4570 Gamma, OLS for ln(y) 2.7292 3.8404 5.2698 (27.163%) (25.650%) (24.582%) GLM Gamma 1.9789 2.8384 3.9472 ( 0.452%) ( 0. 596%) ( 0.689%) GGM 1.9879 2.8553 3.9744 Weibull, OLS for ln(y) 1.3322 1.8927 2.6193 (38.303%) (37.026%) (35.841%) GLM Gamma 0.9128 1.3144 1.8417 (9.954%) (9.317%) (8.751%) GGM 0.8219 1.1919 1.6805 Weibull, OLS for ln(y) 4889.0390 4267.7430 5798.5690 (63.492%) (50.076%) (44.484%) GLM Gamma 3590.7890 3822.8940 6430.4850 (50.293%) (44.267%) (49.939%) GGM 1784.8660 2130.6130 3219.1520 Exponential OLS for ln(y) 7.1370 9.6358 13.091 6 (41.557%) (38.669%) (37.060%) GLM Gamma 4.1719 5.9140 8.2485 (0.019%) (0.073%) (0.104%) GGM 4.1711 5.9097 8.2399 Generalized Gamma, OLS for ln(y) 2103.2300 1568.9410 2027.5790 (80.647 %) (70.760%) (67.791%) GLM Gamma 572.4497 654.1119 1014.3330 (28.896%) (29.866%) (35.616%) GGM 407.0346 458.7557 653.0694 Generalized Gamma, OLS for ln(y) 3.1888 4.3580 5.9124 (58.239%) (5 6.109%) (54.804%) GLM Gamma 1.4409 2.0386 2.8122 (7.578%) (6.174%) (4.980%) GGM 1.3317 1.9128 2.6722 Generalized Gamma, OLS for ln(y) 0.5603 0.8076 1.1379 (54.001%) (53.064%) (51.400%) G LM Gamma 0.3577 0.5199 0.7444 (27.933%) (27.088%) (25.709%) GGM 0.2577 0.3790 0.5530 PAGE 52 52 Table 2 3 For the large samples a verage p ercent age a bsolute b ias of the marginal effect 1 st Quartile of 2 nd Quartile of 3 rd Quartile of Data Estimator N (1) 2SRI (%) (2) No IV (%) (3) 2SRI (%) (4) No IV (%) (5) 2SRI (%) (6) No IV (%) Log Normal, OLS for ln(y) 10,000 8.740 61.770 9.681 74.968 10.61 7 89.298 100,000 2.436 63.549 2.714 77.361 2.992 92.330 500,000 1.152 63.696 1.283 77.669 1.415 92.861 GLM Gamma 10,000 10.960 61.765 12.137 75.021 13.312 89.429 100,000 3.345 63.651 3.728 77.505 4.112 92.523 500,000 1.440 63.749 1.604 77.741 1.769 92.954 GGM 10,000 8.716 61.783 9.653 74.983 10.587 89.315 100,000 2.432 63.530 2.709 77.331 2.988 92.285 500,000 1.218 63.693 1.350 77.641 1.482 92.805 Gamma, OLS for ln(y) 10,000 6.589 62.859 7.319 76.341 8.044 90.986 100,000 2.153 63.621 2.400 77.449 2.648 92.434 500,000 0.870 63.656 0.969 77.613 1.068 92.786 GLM Gamma 10,000 6.035 62.839 6.709 76.304 7.378 90.928 100,000 1.837 63.619 2.043 77.445 2.250 92.427 500,000 0.775 63.626 0.865 77.573 0.95 2 92.734 GGM 10,000 6.029 62.846 6.702 76.313 7.371 90.940 100,000 1.837 63.619 2.043 77.444 2.250 92.426 500,000 0.776 63.628 0.865 77.575 0.953 92.737 Weibull, OLS for ln(y) 10,000 5.459 62.953 6.074 76.455 6.685 91.1 20 100,000 1.580 63.451 1.759 77.226 1.940 92.151 500,000 0.767 63.645 0.851 77.606 0.934 92.783 GLM Gamma 10,000 4.568 62.949 5.084 76.437 5.597 91.085 100,000 1.334 63.460 1.486 77.239 1.640 92.167 500,000 0.638 63.634 0.707 77.591 0.775 92 .765 GGM 10,000 4.342 62.812 4.844 76.239 5.341 90.817 100,000 2.232 63.425 2.459 77.194 2.686 92.112 500,000 0.567 63.070 0.632 76.871 0.697 91.870 Weibull, OLS for ln(y) 10,000 30.858 62.866 34.264 77.425 37.835 93.54 3 100,000 9.386 63.180 10.424 76.925 11.463 91.835 500,000 4.519 63.669 5.018 77.634 5.518 92.820 GLM Gamma 10,000 35.014 64.052 38.741 79.312 42.671 96.289 100,000 11.095 63.395 12.355 77.317 13.619 92.440 500,000 4.803 63.308 5.332 77.214 5 .862 92.335 GGM 10,000 24.100 63.464 26.700 77.467 29.373 92.930 100,000 7.590 63.139 8.450 76.889 9.311 91.799 500,000 3.374 63.927 3.755 77.697 4.138 92.757 PAGE 53 53 Table 2 3 C ontinued 1 st Quartile of 2 nd Quartil e of 3 rd Quartile of Data Estimator N (1) 2SRI (%) (2) No IV (%) (3) 2SRI (%) (4) No IV (%) (5) 2SRI (%) (6) No IV (%) Exponential OLS for ln(y) 10,000 10.445 62.924 11.608 76.527 12.775 91.329 1 00,000 3.098 63.374 3.450 77.131 3.802 92.037 500,000 1.486 63.637 1.653 77.596 1.820 92.771 GLM Gamma 10,000 8.162 62.997 9.056 76.571 9.949 91.329 100,000 2.569 63.409 2.864 77.179 3.158 92.099 500,000 1.172 63.615 1.300 77.571 1.428 92.743 GGM 10,000 8.152 62.934 9.051 76.494 9.949 91.236 100,000 2.523 63.403 2.819 77.151 3.115 92.047 500,000 1.121 63.622 1.250 77.557 1.378 92.707 Generalized Gamma, OLS for ln(y) 10,000 36.967 61.470 40.589 75.822 44.394 91.744 100,000 11.317 62.711 12.582 76.431 13.851 91.325 500,000 4.974 63.884 5.550 77.957 6.127 93.271 GLM Gamma 10,000 26.417 62.903 29.227 77.258 32.124 93.075 100,000 8.513 62.969 9.471 76.703 10.429 91.599 500 ,000 3.750 63.601 4.176 77.565 4.601 92.750 GGM 10,000 23.697 63.021 26.231 77.177 28.817 92.714 100,000 7.321 62.958 8.141 76.693 8.961 91.586 500,000 3.227 63.571 3.596 77.534 3.965 92.717 Generalized Gamma, OLS for ln(y) 10,000 9.151 62.604 10.136 76.044 11.118 90.650 100,000 2.834 63.364 3.161 77.122 3.487 92.028 500,000 1.252 63.739 1.396 77.729 1.541 92.940 GLM Gamma 10,000 6.359 62.909 7.053 76.402 7.744 91.059 100,000 1.90 6 63.429 2.126 77.203 2.346 92.128 500,000 0.870 63.675 0.969 77.642 1.068 92.826 GGM 10,000 6.038 62.882 6.695 76.373 7.349 91.028 100,000 1.814 63.356 2.024 77.117 2.235 92.026 500,000 0.822 63.581 0.916 77.526 1.010 92.686 Generalized Gamma, OLS for ln(y) 10,000 3.915 62.823 4.363 76.253 4.806 90.832 100,000 1.142 63.482 1.278 77.263 1.414 92.195 500,000 0.511 63.705 0.570 77.681 0.628 92.875 GLM Gamma 10,000 3.255 62.847 3.636 76.279 4.012 90.861 100,000 0.897 63.494 1.003 77.280 1.110 92.216 500,000 0.414 63.678 0.462 77.645 0.508 92.828 GGM 10,000 2.845 62.677 3.175 76.084 3.501 90.638 100,000 0.825 63.266 0.919 76.998 1.014 91.873 500,000 0.347 63.394 0.387 77. 286 0.425 92.386 PAGE 54 54 Table 2 4. For the small samples a verage p ercent age a bsolute b ias of the marginal effect 1 st Quartile of 2 nd Quartile of 3 rd Quartile of Data Estim ator N (1) 2SRI (%) (2) No IV (%) (3) 2SRI (%) (4) No IV (%) (5) 2SRI (%) (6) No IV (%) Log Normal, OLS for ln(y) 500 37.783 60.643 41.321 74.114 44.282 85.732 1 000 23.861 61.503 26.876 76.014 29.720 90.594 2 500 17.114 61.793 18.815 76.099 20.495 91.122 5 000 11.747 62.492 13.021 75.792 14.321 90.576 GLM Gamma 500 48.551 61.641 53.220 75.731 57.289 88.153 1 000 31.118 60.692 34.989 75.492 38.705 90.500 2 500 21.925 61.706 24.104 76.209 26.258 91.496 5 000 1 4.927 61.965 16.539 75.197 18.185 89.921 GGM 500 38.209 61.062 41.779 74.572 44.748 86.232 1 000 23.961 61.767 26.987 76.308 29.845 90.918 2 500 17.101 61.891 18.799 76.196 20.471 39.645 5 000 11.784 62.521 13.060 75.820 14.364 90.603 Gamma, OLS for ln(y) 500 31.335 61.277 34.138 74.605 36.476 85.946 1 000 20.046 61.818 22.745 76.216 25.238 90.620 2 500 14.100 62.537 15.576 76.942 17.027 92.049 5 000 9.168 63.200 10.170 76.624 11.195 91.538 GLM Gamma 500 26 .375 61.569 28.854 74.685 30.913 85.745 1 000 18.156 62.080 20.651 76.485 22.929 90.878 2 500 12.567 62.640 13.852 77.036 15.129 92.122 5 000 8.253 63.275 9.166 76.703 10.104 91.616 GGM 500 26.430 61.703 28.925 74.869 30.995 85.980 1 000 18.1 18 62.017 20.606 76.397 22.880 90.761 2 500 12.543 62.883 13.823 77.432 15.096 92.723 5 000 8.274 63.277 9.190 76.706 10.130 91.621 Weibull, OLS for ln(y) 500 24.107 61.406 26.297 74.154 28.197 84.819 1 000 16.515 62.6 56 18.757 77.016 20.783 91.339 2 500 11.157 63.113 12.332 77.625 13.508 92.840 5 000 7.198 63.381 8.002 76.816 8.821 91.736 GLM Gamma 500 20.139 61.587 22.042 74.181 23.752 84.654 1 000 14.027 62.683 15.953 76.940 17.641 91.127 2 500 9.296 62 .991 10.250 77.414 11.212 92.518 5 000 5.883 63.200 6.532 76.556 7.190 91.380 GGM 500 19.398 61.544 21.265 74.104 22.960 84.541 1 000 13.391 62.501 15.231 76.685 16.836 90.790 2 500 8.951 62.898 9.867 77.285 10.790 92.347 5 000 5.890 62.866 6 .532 76.099 7.183 90.779 PAGE 55 55 Table 2 4 C ontinued 1 st Quartile of 2 nd Quartile of 3 rd Quartile of Data Estimator N (1) 2SRI (%) (2) No IV (%) (3) 2SRI (%) (4) No IV (%) (5) 2SRI (%) (6) No IV (%) Weibull, OLS for ln(y) 500 174.135 116.031 172.859 142.537 182.334 172.394 1 000 101.121 86.337 111.586 110.611 125.538 138.336 2 500 66.614 67.601 72.859 86.281 80.146 107.268 5 000 4 3.326 63.716 48.000 79.191 53.135 96.915 GLM Gamma 500 160.658 111.766 164.120 140.805 177.166 174.705 1 000 105.551 85.606 113.430 109.821 125.954 138.026 2 500 74.288 67.939 79.617 85.865 86.070 106.019 5 000 52.077 61.773 56.457 76.897 61.483 94.298 GGM 500 121.808 91.341 127.821 114.549 137.313 138.841 1 000 75.819 72.453 84.836 93.184 95.300 116.018 2 500 52.713 63.002 57.271 79.360 62.134 97.268 5 000 35.294 61.407 38.010 75.648 41.333 91.767 Exponential OLS for ln(y) 500 48.557 63.166 52.357 77.666 55.660 90.502 1 000 31.862 61.911 36.088 77.058 40.148 92.453 2 500 21.732 62.660 24.061 77.486 26.359 93.149 5 000 14.365 63.348 15.970 76.982 17.617 92.171 GLM Gamma 500 38.354 61.325 41.779 74.930 44.726 86.672 1 000 24.855 62.285 28.249 77.048 31.438 91.914 2 500 17.303 62.247 19.065 76.715 20.788 91.923 5 000 11.341 62.853 12.565 76.238 13.815 91.117 GGM 500 38.279 61.293 41.694 74.868 44.640 86.576 1 000 24.960 62.292 28.359 77.064 31.565 91.938 2 500 17.311 62.162 19.095 76.611 20.844 91.800 5 000 11.419 62.840 12.691 76.223 13.998 91.100 Generalized Gamma, OLS for ln(y) 500 216.083 126.006 206.200 152.897 214.710 185.599 1 000 114.183 88 .418 124.841 114.146 141.059 144.522 2 500 77.942 65.292 84.611 83.302 92.881 103.588 5 000 49.045 62.777 53.992 78.469 59.547 96.566 GLM Gamma 500 134.372 89.954 140.609 110.875 152.430 132.945 1 000 82.625 72.716 92.260 94.069 103.881 117.789 2 500 54.857 61.421 59.606 77.536 64.760 95.174 5 000 37.119 63.859 41.183 78.916 45.563 95.998 GGM 500 112.709 81.574 118.408 100.969 127.005 120.607 1 000 70.589 67.607 78.887 86.512 88.167 106.962 2 500 49.639 60.133 54.220 75.562 59.091 9 2.290 5 000 31.170 63.034 34.561 77.604 38.156 94.064 PAGE 56 56 Table 2 4 C ontinued 1 st Quartile of 2 nd Quartile of 3 rd Quartile of Data Estimator N (1) 2SRI (%) (2 ) No IV (%) (3) 2SRI (%) (4) No IV (%) (5) 2SRI (%) (6) No IV (%) Generalized Gamma, OLS for ln(y) 500 16.717 61.011 18.310 73.337 19.783 83.523 1 000 11.655 62.968 13.277 77.264 14.649 91.478 2 500 8.186 62.903 9.022 77.244 9.884 92.248 5 000 4.810 63.191 5.353 76.517 5.900 91.304 GLM Gamma 500 13.253 61.327 14.546 73.631 15.831 83.771 1 000 9.681 62.794 11.028 76.972 12.116 91.045 2 500 6.794 63.020 7.464 77.381 8.174 92.403 5 000 3.774 63.202 4.202 76.520 4.632 91.294 GGM 500 11.506 61.256 12.689 73.496 13.965 83.565 1 000 8.478 62.151 9.666 76.110 10.590 89.943 2 500 5.961 63.020 6.519 77.410 7.134 92.466 5 000 3.103 62.878 3.454 76.107 3.808 90.776 Generalized Gam ma, OLS for ln(y) 500 41.263 59.452 44.599 72.748 47.428 84.260 1 000 26.344 62.653 29.923 77.651 33.307 92.795 2 500 19.020 62.051 21.053 76.433 23.058 91.547 5 000 11.998 62.917 13.329 76 .323 14.691 91.236 GLM Gamma 500 28.237 59.696 30.783 72.272 32.943 82.829 1 000 18.178 62.454 20.633 76.871 22.872 91.270 2 500 13.341 62.467 14.729 76.804 16.105 91.828 5 000 7.976 63.299 8.871 76.723 9.782 91.635 GGM 500 27.108 59.582 29.66 3 72.060 31.868 82.511 1 000 17.803 62.332 20.216 76.690 22.429 91.019 2 500 12.620 62.433 13.923 76.755 15.220 91.758 5 000 7.732 63.312 8.593 76.741 9.471 91.657 PAGE 57 57 Table 2 5 Nested model s election tests from the GGE estimator for sample size 1 0,000 Data Proportion Significant at 5% Gamma, Lognormal, Weibull, Exponential, Log Normal, 1.0 0.056 1.0 1.0 Gamma, 0.055 1.0 1.0 1.0 Weibull, 1.0 1.0 0.034 1.0 Exponential 0.05 1.0 0.058 0.05 Generalized Gamma, 1.0 1.0 1 .0 1.0 PAGE 58 58 Figure 2 1 H istogram and Kernel density estimate of hospital expenditures (Overall) Figure 2 2 H istogram and Kernel density estimate of hospital expenditures (Non zero) PAGE 59 59 Table 2 6 Descriptive statistics of the study s ample prescription drug use and hospital cost offsets Variables Hospitalized (N = 624) Overall Sample (N = 3,101) Dependent variables Hospitalized NA 20.1% Hospital expenditures $10,425 (11,627) $2,098 (6,682) Independent variables Number of prescription fills 40 (31) 30 (27) Medicare entitlement status Aged/no prior disability 72.8% 76.7% SSDI disabled (<65) 15.7% 16.2% Aged/previously disabled (>65) 11.5% 7.1% Age <65 18.7% 20.0% 65 69 9.3% 8.5% 70 74 19.4% 24.3% 75 79 18.9% 18.8% 80+ 33.7% 28.4% F emale 55.3% 57.1% Marital status Married 43.1% 48.4% Single/widowed/divorced/separated 56.9% 51.6% Educational attainment PAGE 60 60 Table 2 7 The e stimation r esults of the real data analysis prescription drug use and hospital cost offsets Average Marginal Effect Type of Estimation 2SRI S.E. No correction S.E. NLS 140.48 8.323 15.69 0.831 OLS for ln(y) 87.29 4.057 17.78 1.846 GLM (Gamma) 109.32 5.537 16.34 1.681 GG 89.30 4.538 16.32 1.774 PAGE 61 61 CHAPTER 3 THE GENERALIZED GAMM A ESTIMATOR WITH A FLEXIBLE FORM CONDITIONAL MEAN REGRESSION SPEC IFICATION 3.1 Introductio n and B ackground The most widely used models designed to take account of skewed data in health economics and health services research assume that the conditional mean is an exponential function of the regressors, e.g. generalized linear models (GLM) and l og li near models estimated via OLS. A misspecification of the conditional mean (e.g., assuming it has an exponential form when it does not) can lead to biased estimates of targeted marginal effects. It has been shown that the introduction of a more flexibl e conditional mean function can ser ve to alleviate such problems. This model, which incorporates t he inverse of the classical Box Cox transformation (Box and Cox, 1964) was first used in a regression context by Wooldridge (1 992). This model encompasses ex ponential, linear and the power s pecifications as special cases. More recently, Kenkel and Terza (2001), Terza et al. (2008a) and Terza et al. (2008b) used a variant of the inverse Box Cox ( IBC ) transformation with n onlinear least squares (NLS). Since NLS is not a full information maximum likelihood (FIML) estimator, it may not be efficient in all cases It is not, however, p rone to misspecification bias. The application of this IBC transformation to GLM was explored by Basu and Ra thouz (2005). Their model like implemented using an extension to th e estimating equations in GLM. Since this also is not a FIML estimator it is not as efficient as a FIML model. In C hapter 2 we int roduced the generalized gamma (GG) estimator which is known for the flexibility it allows in the conditional distribution of the outcome s In particular, it allows for skewness and subsumes many different distributions as special PAGE 62 62 cases. Being a FIML estima tor, GG is f ully asymptotically efficient. Here w e combine the distributional flexibility and asymptotic precision of the GG model with the conditional mean regression flexibility of the IBC specification and obtain a new model called the generalized gamma with inverse Box Cox transformation (GG IBC). This composite model, with additional flexibility in its parametric structure, is consistent, relatively robust, and efficient relative to alternatives such as the NLS and the GG model. In this chapter, using extensive simulation analyses, we show that when the data is obtained from a distribution with a non exponential conditional mean: (1) The GG IBC estimator is consistent while GG is not; (2) The GG IBC estimator is consistent for the average marginal effec t; and (3) The GG IBC estimator is more precise than the NLS applied to an IBC conditional mean regression specification ( NLS IBC) and the GG estimator. We also used the GG IBC model in a re igarette smoking during pregnancy on newborn birthweight (ignoring the pote ntial endogeneity of smoking). For the purpose of comparison, we also estimated the model using the NLS IBC and the GG est imators. The rest of this chapte r is organized as follows. In S ection 3.2 we begin by detailing the IBC transformation, and discussing how it can be combined with the GG conditional distribution specification to obtain t he GG IBC estimator. Section 3.3 summarizes the simulation analysis, briefly outlining the samp ling designs, other estimators used in our comparisons, and our evaluation metrics (i.e average percentage absolute bias and mean squared errors). The section concludes with a summary of the simulation results. Section 3.4 presents the real data applicat ion where we use GG IBC to estimate the effect of cigarette smoking during pregnancy PAGE 63 63 on infant birthweight Finally in Section 3.5 we present our concluding remarks for this chapter 3.2 Inverse Box Cox Transformation in the GG F ramework In this s ection we extend the GG model by specifying the conditional mean regression mode l using the IBC transformation. This transformation allows for more flexibility in the functional form of the regression model. Note that this is in contrast with popularly use d estimators like GLM, log normal estimators etc. which rigidly assume an exponential conditional mean. 3.2.1 Inverse Box Cox Transformation First introduced by Box and Cox (1964), the Box Cox transformation is an application of a power transformation to t he outcome variable y that brings about linearity in the parameters of the regression model. They suggested that the outcome variable y be transformed according to following equation (3 1) In th eir paper they were interested in the inference on the transformation parameter Wooldridge (1992) observed that, in general, the conditional mean of y given the regressors is of prime interest and that a transformation of the outcome variable as suggested by B ox and Cox (1964) does not provide any direct help in estimation and inference in the condit ional mean regression context. He instead used the Box Cox transformation as a device for generalizing the functional form of the conditional mean. This generalizat ion subsumes many special cases, such as the exponential form, the PAGE 64 64 classical linear mo del, and the power specifications. This IBC transformation form of the conditional mean is given as (3 3) The interpretation of the tran sform ation of in (3 transformati on (of y) used in (3 1) should be noted The advantage of introducing more flexibility to the conditional mean is that it leads to a more robust estimator. The t ransformation given in equation (3 3) is modified and used in the following section. 3.2.2 Generalized Gamma with a Flexible Form Conditional Mean Function The GG model encompasses FIML models based on a variety of distributions such as standard gamma, log normal, exp onential, Weibull etc 1 At the same time, in the GG framework the parameter which serves as the foundation for the formulation of the conditional mean regression, has a v ery restrictive form, as is seen in equation (2 2) By s pecifying using the IBC transformation, we seek to generate a more robust GG model (i.e. one that is less susceptible to misspecification bias). In the GG framework, we respecify (2 2) using the IBC transformation function in t he following way (3 4 ) where (3 5 ) 1 We have described the GG model in detail in Chapter 2. PAGE 65 65 and is a scalar parameter such that Equation (3 5) is a variant of the transformation used by Wooldridge (19 92) given in equation (3 3) The conditional mean of the outcome, implied by (3 4) will assume different specifications according to varying values of the parameter These can be listed as follows: Case 1 ( = 0) : A s app roaches zero, equation (3 5 ) approaches and subsequently The case for is the conventional GG specification where the conditional mean of the outcome takes on the following exponential linear index form (3 6 ) as shown in equation (2 3). Case 2 ( = 2): For equation (3 5 ) becomes and Substituting this into equation (2 3) yields the following classical linear conditional mean regression model (3 7 ) where Case 3 ( 0, 2): Here and substitut ing this into equation (2 3) yields the following expression for the conditional mean (3 8 ) The GG IBC estimator is generated by substituting (3 4) into the GG probability dens ity function in equation (2 1). It is estimated by using the maximum likelihood (ml) procedure in Stata and the parameters to be estimated are and The PAGE 66 66 vector is comprised of the regression coefficient parameters, and are the basic parameters of the GG distribution, and is the transformation parameter in the IBC equation. 3.3 Simula tion A nalysis The simulation analysis presented here aims to verify three primary statistical prope rties of the GG IBC estimator. First, we validate the consistency of the GG IBC method based on the estimated values of the marginal effect of a variable of interest using the average percentage absolute bi as as the evaluation criteria. The average percentage absolute bias of the marginal effect is evaluated for increasing sample sizes from 10,000 to 500,000. If the estimator is consistent, as theory predicts, the average percentage absolute bias of the marginal effect estimates should diminish as the sample size increases. Similar analyses are conducted for the relevant altern ative methods, GG and NLS IBC. Second, using these same results, we analyze the relat ive unbiasedness (consistency) of the GG IBC vis vis GG and NLS IBC. Finally, we seek to verify the asymptotic efficiency of the GG IBC estimator using the mean squared error of the marginal e ffect as our evaluation metric. For this w e present t he percen t relative efficiency gain values which measure the relative efficiency of GG IBC estimator compare d to the GG and the NLS IBC estimators In order to accomplish these objectives we use v arious sampling designs to generate the data. A variety of conditiona l distributions are used including the standard gamma, exponential, Weibull, GG, log normal and beta, with different co nditional mean specifications. Note that different conditional mean functions here correspond to different values of the parameter. The criteria for evaluation include the average PAGE 67 67 percentage absolute bias and the me an squared error of the estimated marginal effect 2 i.e. the expected derivative of the conditional mean function with respect to a continuous regres sor. 3.3.1 Sampling Designs 3.3.1.1 The observable variables In this simulation analysis our focus is on continuous regressors, although this analysis is also applicable to binary and count regressors. There are two observable confounders, and distributed uniformly in the [0, 4] interval. Without loss of generality we choose as the policy variable of interest. Both regressor coefficients, are c hosen to be equal to 0.5 and the intercept is set to zero. For each repetition of the simulation, j (the total number of repetitions is 500), and for each sampling design, the same observable variables are used. The simulation analysis is implemented in St ata/Mata 10 3.3.1.2 The o utcome variable In each instance t he distribution of the outcome variable chosen to be one of standard gamma, Weibull, exponential, lognormal, GG, or beta. Then the value of parameter is picked to gen erate data for a partic ular conditional mean function. Our objective s in implementing these varied sampling designs are three fold The first objective is to analyze the GG IBC estimator in the context of distributions that are within the GG family. The sec ond objective is to observe if the GG IBC estimator maintains its accuracy for a distribution outside the GG family but which has an 2 We can also estimate the treatment effect for a binary regressor and t he incremental effect for a count regressor. PAGE 68 68 exponential conditional mean. Finally we want to analyze the behavior of the GG IBC estimator across varying values. The various distributions used for data generation are detailed below: Standard gamma : The first sampling design for the outcome variable is the standard gamma, a special case of GG where In order to gene rate an outcome variable with alternative conditional mean functions t he IBC transformation is implemented in t he Stata rgamma(a,b) command where a is the scale parameter and b By assigning varying values to the parameter the Stata command rgamma( 1/  2 ,exp(ln((( /2)* x +1) ^ (2/ ) )) 2 ) is used to generate data in the simulation analysis. A right skewed outcome variable is formed with the shape parameter value i.e. Log normal : The second data generati on process used is the log normal distribution. It is a special case of the GG distribution where the parameter goes to zero and can take any value. For the log normal sampling design in the first step we generate a normal ly distributed error term e using Stata command rnormal( a b ) where a is the mean and b In the second step we substitute this error term i nto the following regression function where vector x includes the policy variable of interest and the other observable variable The Stata command used for generation of this outcome variable is exp(ln((( /2)* x +1)^(2 / ))+ e ) The parameter is set to 0.2. W eibull : The third data generation process used is the Weibull. The I nverse Transform M ethod (Rubinstein, 1981) is applied to implement the Weibull sampling PAGE 69 69 design with the inse rtion of IBC transformatio n, in place of in equation (3 4 ) Note that U is a uniform ly distributed random variable The Stata command used for generation of the outcome variable is +1 )^(2 )))*( where the value of governs the form of the conditional mean function The value chosen for the parameter is 0.5. Exponential : The next data generating process chosen is the exponential distribution The e xponential is a special case of standard gamma where It can be generated using the rgamma(1,b) command in Stata Again the shape parame ter is replaced by equation (3 4 ) and t he full Stata co mmand utilized to generate this outcome variable is rgamma( 1 ,exp(ln((( /2)* x +1) ^ (2/ ) )) ) with varying values. Generalized gamma (GG) : This is the last sampling design chosen from the GG class of distributions. This data generating process utilizes the standard gamma generator as given in equation (2 14 ) ( Tadikamalla, 1979) The parameter is replaced by equation (3 4 ) to generate varyi ng conditional mean functions. The Stata command used for generating the outcome variable using this sampling design is d*((rgamma(a,1))^ (1/c))*( )))) The selected parameter values are i.e. and i.e. Beta : The las t data generating process is based on a variant of the be t a distribution which is outside the GG class of distribution I t does however, have an exponent ial conditional mean function. The support of this distribution is the unit interval and it The mean for the beta distri bution is defined as PAGE 70 70 where z is the beta distributed random variable We extend the support of this random variable to the positive part of the real line and generate a new random variable where and T he mean of the new distribution then becomes This distribution can be used to generate a wide variety of dist r ibutional shapes e.g., right skewed, left skewed, unifor m, symmetric, modes at the extremes, etc. To develop a conditional (on x) version of the random variable y, we can specify (3 9 ) where is defined as in equation (3 5). This leads to the conditional m ea n function of y as (3 10 ) where The beta sampling design is implemented using the ((( /2)*x +1)^(2/ ))*rbeta( ) command. In our experiments various values for the parameters were tried but some of them failed to converge with one or more of the estimation meth ods that we detail in S ection 3.3.2 Here we have r eported results for and 3.3.2 Estimators Used in Evaluation and C ompar ison The GG IBC estimator is a FIML estimator. It is programmed in Stata using maxi mum likelihood programming via the ml com mand The main difference between the GG and the GG IBC estimators is the inclusion of an additional parameter (the PAGE 71 71 IBC parameter) in the latter case The parameter defines the form of the conditional mea n which could be a function other than in GG IBC We compare GG IBC estima tor with two other estimators. The f irst one is the GG estimator which is selected to analyze the effect of introducing a flexible conditional mean for m i n an estimator. Note that both the GG and the GG IBC estimators are estimated with FIML. The second estimator is the NLS w ith IBC transformation. This NLS IBC estimator, like GG IBC, also take s the flexible conditional mean form into account But since NLS IBC is not a FIML estimator it does not take higher order moments into account which may lead to decreased precision. The NLS IBC estimator is chosen in order to compare the GG IBC with an estimator that also has a flexible conditional mean form but is no t a FIML estimator. In the following we describe the three chosen estimators in detail Generalized g amma (GG) : The GG estimator is the natur al alternati ve estimator for comparison with GG IBC. Since it is also a FIML estimator like GG, it is based on the full conditional moment assumption s In GG t he parameter is assumed to be equal to which as we have seen in equation (2 3) implies that The main objective here is to obs erve how the GG estimator works when the data has a diffe rent conditional mean form than the one assumed by the GG estimator The streg command in Stata is used to impleme n t this model. Nonlinear least squares with IBC (NLS IBC) : The objective of this cha pter is to evaluate the usefulness of using a flexible conditional mean f unction with a FIML estimator. So far in the literature the flexible conditional mean form has only been used with estimators that are not FIML estimators. NLS IBC (Kenkel and Terza, 2001; PAGE 72 72 Wooldridge, 1992 ) is one such estimator with which we want to compare our model. The conditional mean function of the NLS IBC model can be defined as (3 11) where given in equation (3 8), but scaled by the factor and with the scaled constant term shifted by 3 The comparison between the GG IBC and the NLS IBC estim ators will highlight the usefulness of using a FIML based estimation even when flexible conditional mean forms are used The NLS estimator with the e xponential c onditional m ean assumption with IBC transformation was implemented in Stata using the nl comma nd Generalized gamma with IBC (GG IBC) : This is the proposed GG IBC e stimator which was detailed in S ection 3.2.2 Note that both the GG IBC esti mator and the NLS IBC estimator offer flexibility with respect to the conditional mean, Therefore, both are r obust to misspecification bias. When the data distribution is from the GG family GG IBC estimator is consistent irrespectiv e of the conditional mean form. The NLS IBC estimator on the other hand is consistent irrespective of th e data distribution since it does not make any assumptions about it. The GG IBC estimator is implemented in Stata using maximum likel ihood program and ml commands. 4 3.3.3 Criteria for Evaluation and Comparison Health ec onomic models are generally non linea r by nature and the researchers are usually interested in estimating the marginal effects of one or more of the 3 See Appendix D for the derivation of the scaling terms. 4 For more information in programming maximum likelihood estimators in Stata see Gould et al. (2003). PAGE 73 73 regressors. T he coefficient estimates, in a model do not represent the marginal effect s of the regr essors in a nonl inear setting. The marginal effect for a continuous policy variable, is defined as the derivative of the conditional mean of the outcome variable with respect to For the GG IBC mod el for each case summarized in S ection 3.2.2 there is a diff erent conditional mean function as given in equations (3 6), (3 7) and (3 8 ). The marginal effects for these cases are defined as 5 (3 12 ) and, (3 13 ) (3 14 ) respectively. The marginal effects as given in (3 12), (3 13) and (3 14) can be consistently estimated using (3 15 ) and, (3 16 ) 5 Appendix E summarizes the derivation of these ma rginal effects. PAGE 74 74 (3 17 ) re spectively. Both the bias and the precision of the marginal effect as estimated from the GG IBC model are of interest We used the following two metrics in our comparison of models: The first is the average percentage absolute bias of the estimated margina l eff ect which is defined as (3 18 ) where ME is defined as the true average marginal effect, j is the number of repetitions in the simulation k is the number of observations used in the simulation analysis, k = 10,000, 50,000, 100,000, 250,000 and 500,000 and d is the sampling design. The true value of the marginal effect is derived separately for each sampling design and value. To obtain the true values of (3 12), (3 13) and (3 14 ) w e gen erated a large sample of 5 million observations using Monte Carlo simulation and calculate d the true values from this super sample average The second metric for comparison is the mean squared error of t he estimated marginal effect which is defi ned as (3 1 9 ) where j k a nd d are defined as above PAGE 75 75 3.3.4 Simulation Results This simulation analysis aims to examine the efficiency gains achieved with the use of a FIML estimator as well as the bias performance attained through the introduction of a flexible conditional mean form. As mentioned be fore in S ection 2.3.4, Manning et al. (2005) showed that the GG estimator provides higher efficiency compared to the log linear and the GLM models. Here we want to investigate wh ether this efficiency advantage of the GG model is preserved even after incorporation of the above described I BC transformation. As before (S ection 2.3.4), the mean squared error of the marginal effe ct, as defined in equation (3 19 ), will be our metric of evaluation and comparison as we examine the statistica l efficiency of various models. Along with efficiency, the consistency of the GG IBC estimator is also of interest since we have introduced additional flexibility t hrough the IBC transformation. Here we use the average percentage absolute bias of the marginal effe ct, as defined in equation (3 18 ), as our metric for evaluation and comparison. Table 3 1 summarizes the average percentage absolute bias of the marginal effect of the policy variable for the GG sampling design. Here we present results from increasing sample sizes, 10,000, 50,000 100,000, 250,000 and 500,000. Since our model is a nonlinear model t he marginal effects are calculated using three different quartile value s of the policy variable The objective here is to examine the consistency of the GG IBC estimator relative to other estimators. As the number of observations increases, the average percentage absolute bias for the GG IBC estimat or goes to zero, for instance, for samples of size 10,000, it is 1.882%, it goes down to 0.820% for samples of size 50,000 and further down to 0.576% for samp les of size PAGE 76 76 100,000 and so on. Its noteworthy that the average percentage absolute bias for the GG estimator remains high even as the sample size is increased, e.g. for samples of size 10,000 it is 9.005%, goes down a little to 8.752% for sample size 50,000 but increases to 8.796% f or samples of size 100,000 etc. It is evident that the GG estimator is not a consistent estimator since it does not take into account the non exponential functional form of the conditional mean used to generate the outcome variable. The NLS IBC estimator on the other hand is a consistent estimator as can be seen in Table 3 1 although it has slightly higher average percentage absolute bias com pared to the GG IBC estimator. As was discussed earlier, the NLS IBC estimator is consistent. Next we examine the bias performance of the GG IBC estimator as compared to the GG and the N LS IBC e stimators. Table 3 2 presents the average percentage absolute bias of the estimated marginal effect for the samples of size10,000 for the variou s sampling designs detailed in S ection 3.3.1.2. Note that for every sampling design, the GG IBC estimato r has lower average percentage absolute bias as compared to the GG and the NLS IBC estimators. The GG estimator is not consistent since it does not correct for the con ditional mean functional form. For example, for the standard gamma distributed samples, t he GG estimator generates an 8.914% bias whereas the GG IBC has a bias of 1.803% for the first quartile of variable For the second quartile of the percentage biases from GG and GG IBC are closer to ea ch other due to the fact that the functional form of the two percentage bias functions are closest to each other at the mean value of For the third quartile of the percentage bias from the GG goes u p to 19.662% whereas the percentage bias from the GG IBC is still low at 2.910%. Similar results are observed for the other sampling PAGE 77 77 designs too Weibull, exponential, log normal generalized gamma, and beta. Although NLS IBC is a consistent estimator, th e GG IBC estimator generates slightly less percentage bias. For example, for Weibull distributed data at the first quartile of the average percentage absolute bias is 1.703 % for the NLS IBC whereas it is 1.263% for the GG IBC. The values observed at other quarti les are similar in nature. This trend is maintained for the other sampling designs and similar results are observed in comparisons of the GG IBC and the NLS IBC estimators. For examining the precision of estimators, mean squared error is the most widely used metric of evaluation and comparison. Table 3 3 presents the mean squared error of the estimated marginal effect of the policy variable for the sample s of size 10,00 0 for various sampling des igns. Note that since the marginal effects generated by the three included estimators are small, the computed mean squared errors are quite small In view of this, to highlight the efficiency gain attained by the use of GG IBC compared to the GG and the NL S IBC estimators, we used a percentage measure in our evaluation which indicates the efficiency of the GG IBC estimator relative to that of the alternative estimators It is defined as where m = GG, NLS IBC. The values next to t he mean squared errors, in the parentheses, represent th is percentage efficiency gain. Since the GG IBC estimator is a FIML estimator it is expected to provide efficiency gains compared to NLS IBC and this is indeed what we observe. For the standard gamma distributed data, GG IBC provides 53% efficiency gain for the first quartile, 23% for the second quartile and 30% for the third quartile of compared to NLS IBC. Similarly, for the rest of the sampling designs, the efficiency gai n from GG IBC compared to NLS IBC ran ges from 9 % to an PAGE 78 78 impressive 91%. The percentage efficiency gain from GG IBC compared to GG ranges from 43% to 99% for different sampling designs. Most commonly, researchers are interested in the marginal effects of var ious policy variables b ut in a nonlinear model setting the parameter estimates by themselves do not represent the marginal effects. However, in Table 3 4 we present the parameter estimates estimated by various methods for the sake of completeness and also to highlig ht the consistency in the estimates of the crucial parameter as obtained from the GG IBC and NLS IBC estimators. It can be noted that the parameter estimates obtained form the GG IBC estimator are consistent for data distributed as gamma, Weibull, exponential, log normal and GG. Also the parameter is estimat ed correctly in all the cases. For the beta sampling design, since the beta distribution is not in the GG class of distributions, the pa rameter estimates for and the constant are slightly different yet the parameter es timates appear to be the same. However, the GG estimates of and the constant are biased, due to the conditional mean misspecification bias. In the case of the NLS IBC estimator, the conditional mean is defined as where is given as w ithout the constant shifter C. For the gamma and the exponential sampling designs this constant shifter is defined as zero (se e equation (3 11) and Appendix D for details.). As a result of this, in the gamma and the exponenti al sampling designs, we can compare the NLS IBC parameter estimates with the GG IBC parameter estimates. For other sampling designs Weibull, log normal, GG and beta, the NLS IBC estimator internally adjusts the parameter estimates to compensate for this constant shifter. PAGE 79 79 Although in Tables 3 1, 3 2, 3 3 and 3 4 the parameter is chosen to be unity, the results are similar for other values of such as = 0.5, 1.0, 1.5, 3 ,4. Recall that the parameter controls the form of the conditional mean function. Table 3 5 presents the average percentage absolute bias of the marginal effect of the policy variable for various values. Here, as an example, the sampling design is chosen to be based on the GG distribution and the sample size is set to 10,000. Note that the percentage bias for the GG estimator is 6.169% for and it goes up to 16.703% for for the first quartile of The GG IBC estimator, on the other hand, consistently generates lower percentage bias compared to GG for all values. For examp le, for the percentage bias from GG is 11.106% an d for GG IBC it is 2.348% etc. Finally the percentage bias for NLS IBC is slightly higher than the percentage bias for GG IBC for all values. These res ults show that the GG IBC estimator generates less bias regardless of the condit ional mean form i.e. regardless of the values. 3.4 The Effect of Cigarette Smoking on Birthweight Here we demonstrate the practicality of the GG I BC model via a rea l data application. We also compare the results obtained using the GG IBC model with the GG and the NLS IBC model s mentioned earlier. model wherein regnancy on infan t birthweight. This data set has 1 388 observations and the outcome variable and the covariates are from the Child Health Supplement to the 1988 National Health Interview Surve y (Mullahy, 1997). 6 The outcome variable is defined as the birt hweight in pounds. Figure 3 1 shows the histogram and the K ernel de nsity of the outcome variable. The 6 Refer to Mullahy (1997) for additional information on the data set. PAGE 80 80 policy variable of interest is the number of cigarettes smoked during pregnancy. The de finitions of all of the variables in the analysis are given in Table 3 6 and their descriptive statistics are given in Table 3 7. Our objectives here are two fold. First, we want to verify that the results obtained by our estimator are empirically meaning ful i.e. there is a negative impact of smoking on infant birthweight. Second we want to study the practical applicability of our GG IBC model, especially since it is a FIML method with multiple parameters as unknowns. For the purpose of this chapter, we will ignore the potential endogeneity of smoking and use our GG IBC model. We will revisit the effects of endogeneity in conjunction with flexible conditional m ean form for this application in C hapter 4 3.4.1 Model We first hweight model assuming that the conditional mean regression for y follows the IBC f ormulation given in equation (3 11 ) o f the text Here we use the NLS IBC estimator. The metric of our interest, t he marginal e ffect ( ME) in the NLS IBC c ontext can be define d as (3 20) where (3 21) ( 3 22 ) and (3 23) PAGE 81 81 where is the policy variable of interest in this application the numbe r of cigarette smoked during pregnancy and is the v ector of observable covariates. By combining equations (3 20) to (3 23), we obtain the following expression for the ME: (3 24 ) The estimator of the ME of cigarette smoking is the following sample analog to (3 24 ) (3 25 ) where n s is the size of the subsample of smokers, and denote estimates. Note that the marg inal effect is estimated fo r smokers only. The average effect of smoking cessation on birthweight is the expected (i.e. average) effect from forcing all smokers to quit and i t pertains to the smokers only. For the NLS IBC model the cessation effect is given by (3 26) and this is estimated as (3 27) where n s is the siz e of the subsample of smokers. We next model assuming that the conditional density function of y follows the GG IBC f ormulat ion given in equation (3 8) o f the text. The metric of our interest as before, the ME in the GG IBC c ontext can be defined as PAGE 82 82 (3 28 ) where (3 29 ) (3 30 ) and (3 31 ) where is the policy variable in this application the number of ci garette smoked during pregnancy and is the vector of observable covariates and the constant C is (3 32 ) Therefore we obtai n the following expression for ME, (3 33 ) The estimator of the ME of cigarette smoking is the following sample analog to (3 33 ) ( 3 34 ) where (3 3 5 ) (3 36 ) PAGE 83 83 and Here we can also derive the average effect of smoking cessation on birthweight. The appropriate equation for average effect of smoking cessation is defined as (3 37) and it is estimated as (3 38) where n s is the siz e of the subsample of smokers. Finally, we estimated the model using the standard GG estimator. The expressions for ME and its estimators, along the same lines as above, are given by equations (3 39) and (3 40) respectively below. (3 39) where given in equation (3 32). (3 40) where given in equati on (3 36). The expressions for CE and its estimators, along the same lines as above, are given by equations (3 41) and (3 42) respectively below. (3 41) (3 42) PAGE 84 84 3.4.2 Results The results from the GG I BC, NLS IBC and GG estimators are given in Table 3 8. T he marginal effects as in ( 3 25), ( 3 32 ) and (3 40) estimated for the NLS IBC, the GG IBC and the GG models respectively, are prese nted in the first row of Table 3 8 For these estimates the sample wa s restrict ed (after regression estimation) to those who smoked during pregnancy (i.e. CIGSPREG > 0). T his subset was comprised of 212 individuals The second row presents the average effect of smoking cessation on birthweight for the 212 mothers who smoke d during pregnancy. The quantities here are computed using the expressions given in equations (3 27), (3 38) and (3 42) for the NLS IBC, the GG IBC and the GG models respectively Foremost, it can be noted that the ME of the policy variable of our interes t, as estimated by all the three methods, is negative. This result is also consistent with those obtained by Mullah y (1997). For the case of GG IBC, we find that the birthweight decreases by 0.489 ounces for every additional cigarette smoked per day during the pregnancy by the mother. Secondly, from the cessation effect, we see that among the smokers, if we force them all to quit i.e. decreased their observed level of daily smoking from the current level to zero the birthweight goes up. For the GG IBC method, we find this increase to be 6.913 ounces. In addition to this quantitative evaluation, we also found that the GG IBC estimator is very well behaved in terms of its convergence as compared to the NLS IBC model, especially with a relatively small dat aset like the one used in our application. We found that the NLS IBC fails to converge with our Stata implementation wherein the parameter is estimated as part of a comprehensive (i.e. including all parameters simultaneously) optimization routine. We were only able to get it to work when the PAGE 85 85 parameter is estimated separately using a line search method in a Ga uss implementation of NLS IBC. Since the results we obtained are fairly similar across the t w o methods GG IBC and NLS IBC the ease of use suggests that GG IBC may be more useful in a practical setting. It should also be noted that since our extensive simulations with various sampling designs and sizes showed that the GG IBC is a comparatively more precise estimator than the NLS IBC, we see it as the preferred estimator. 3.5 Summary, Discussion and Conclusion In Chapter 2, we addressed two commonly observed characteristics of health economics data, namely skewness and endogeneity. In addition to these issues, health economics data can come from distributions which may have a non exponential conditional mean. This could potentially be a problem if the model used to estimate the data impose the exponential conditional mean assumption, as this can l ead to conditional mean misspecification bias. In this chapter we propose a novel model for data estimation, called GG IBC, which specifically targets this issue. Note that we have retained the GG model since it is a FIML estimator that has flexible distr ibutional form. We modify the conditional mean expression in this model, via IBC transformation, in order to introduce conditional mean flexibility to handle the aforementioned misspecification bias problem. The composite model presented here retains the p recision benefits of the GG model while providing consistent estimates. In order to test our model, we conducted extensive simulation analyses using vario us sampling designs and sizes. We compared the results obtained using our model to two commonly implem ented alternatives the NLS IBC and the GG models. We showed that the GG IBC model is a consistent estimator. This is true not only for data PAGE 86 86 obtained from the family of the GG distributions, but, GG IBC remains consistent even when the data is obtained fr om a distribution not in the GG family e.g. the beta distribution. As compared to the NLS IBC and the GG models, we found that the proposed model consistently has lower average percentage absolute bias for the marginal effects across different distribut ions. We also confirmed that the GG IBC model retains the statistical efficiency of the GG model. Finally, we tested the proposed model on a real data application. We revisited pr egnancy on infant birthweight. We showed that, consistent with Mullahy (1997) our model predicts a decrease in the birthweight as the smoki ng during pregnancy increases. We also found that our model, unlike NLS IBC, is well behaved with respect to conve rgence when the number of observations is small, which suggests that the GG IBC can be quite suitable for real data scenarios. PAGE 87 87 Table 3 1. For generalized gamma distributed data aver age percentage absolute bias of the marginal effect ( ) (in percentages) Estimator N 1 st quartile of x p 2 nd quartile of x p 3 rd quartile of x p GG 10,000 9.005 3.610 19.472 50,000 8.752 3.128 18.658 100,000 8.796 3.173 18.743 250,000 8.634 3.372 19.039 500,000 8.537 3.543 19.161 GG IBC 10,000 1.882 2.226 3.089 50,000 0.820 0.991 1.368 100,000 0.576 0.691 0.968 250,000 0.354 0.438 0.617 500,000 0.261 0.313 0.435 NLS IBC 10,000 2.609 2.492 3.733 50,000 1.069 1.121 1.688 100,000 0.831 0.791 1.219 250,000 0.489 0.521 0.776 500,000 0.354 0.348 0.544 Table 3 2. For various sampling designs average percentage absolute bias of the marginal effect ( ) (N=10,000) (in percentages) Data Estimator 1 st quartile of x p 2 nd quartile of x p 3 rd quartile of x p Gamma GG 8.914 3.618 19.662 NLS IBC 2.247 2.184 3.525 GG IBC 1.803 2.041 2.910 Weibull GG 8.942 3.801 19.972 NLS IBC 1.703 1.667 2.655 GG IBC 1.263 1.542 2.294 Exponential GG 8.699 4.407 20.196 NLS IBC 3.318 3.221 5.174 GG IBC 2.511 3.086 4.585 Log normal GG 8.818 4.242 20.773 NLS IBC 0.716 0.644 0.974 GG IBC 0.532 0.605 0.871 Generalized gamma GG 9.005 3. 610 19.472 NLS IBC 2.609 2.492 3.733 GG IBC 1.882 2.226 3.089 Beta GG 8.642 3.667 19.776 NLS IBC 1.371 1.310 2.068 GG IBC 0.889 1.044 1.479 PAGE 88 88 Table 3 3. For various sampling designs mean sq uared error of the marginal effect with percent relative efficiency gain ( ) (N=10,000) Data Estimator 1 st quartile of x p 2 nd quartile of x p 3 rd quartile of x p Gamma GG 0.0062978 0.0018094 0.050493 (94.1 %) (63.9 %) (96.6 %) NLS IBC 0 .0006095 0 .0007401 0 .0023845 (38.7 %) ( 11.78 %) ( 28.6 %) GG IBC 0.0003734 0.0006531 0.001703 Weibull GG 0.0053677 0.0021317 0.0409466 ( 97.1 %) ( 86.0 %) ( 98.0 %) NLS IBC 0.0002827 0.000351 1 0.0010945 ( 45.9 %) ( 15.3 %) ( 24.9 %) GG IBC 0.0001530 0.0002975 0.0008215 Exponential GG 0.0062624 0.0028202 0.0552056 ( 87.7 %) ( 46.3 %) ( 92.4 %) NLS IBC 0.0013399 0.0016627 0.0052693 ( 42.4 %) ( 9.0 %) ( 20.7 %) GG IBC 0.0007723 0.0015134 0.0041792 Log normal GG 0.0066033 0.0048269 0.0648765 (99.5 %) (98.8 %) (99.8 %) NLS IBC 0 .0000643 0 .0000676 0 .0001969 (46.0%) ( 12.4 %) ( 17.9 %) GG IBC 0.0000347 0.0000592 0.0001616 Generalized gamma GG 0.0041126 0.0011594 0.0317138 (93.2 %) (58.0 %) (96.3 %) NLS IBC 0 .0005173 0 .000628 0 .0018073 ( 45.7 %) ( 22. 5%) ( 35.3 %) GG IBC 0.0002808 0.000487 0.0011685 Beta GG 0 .0014405 0.0003661 0.0124458 (98.4 %) (88.4 %) (99.1 %) NLS IBC 0 .0000587 0 .0000683 0 .0002226 (61.8%) (37.6%) (50.9%) GG IBC 0.0000224 0.0000426 0.0001093 Note: The values given in parentheses are the percent relative efficiency gains that measure the relative efficiency of GG IBC estimator compare d to the GG and the NLS IBC estimators. The percent relative efficiency gain is defined by where m = GG, NLS IBC PAGE 89 89 Table 3 4. Parameter estimates ( ) (N=10,000) Data Parameter GG NLS IBC GG IBC Gamma 0.2590 0.5011 0.5014 0.2591 0.5015 0.5015 0.3040 0.0001 0.0010 0.7002 0.7051 0.3430 0.3470 0.9962 1.0030 Weibull 0.2590 0.4718 0.4986 0.2591 0.4719 0.49 90 0.3041 0.1201 0.0002 0.9870 0.9999 0.6843 0.6934 0.9975 0.9927 Exponential 0.2599 0.5035 0.4984 0.2601 0.5040 0.4991 0.3002 0.0089 0.0009 0.9960 0.9999 0.0023 0.0003 0.9853 0.9844 Log normal 0.2604 0.5047 0.5001 0.2602 0.5047 0.5000 0.3053 0.0206 0.0003 0.0344 0.0012 1.5732 1.6096 0.9981 0.9999 Generalized gamma 0.2587 0.4496 0.5021 0.2591 0.4510 0.5028 0.3040 0.2186 0.0014 1.4041 1.4166 0.3410 0.3480 1.0071 1.0045 Beta 0.2581 0.3537 0.4101 0.2579 0.3539 0.4098 0.1042 0.5865 0.3611 1.8677 2.0110 1.1282 1.1877 1.0017 0.9993 PAGE 90 90 Table 3 5 For generalized gamma distributed sampling design with various parameter values average percentage absolute bias of the marginal effect (N=10,000) (in percentages) Estimator 1 st quartile o f x p 2 nd quartile of x p 3 rd quartile of x p GG 6.169 2.564 13.847 NLS IBC 2 41 1 2 04 4 3 26 1 GG IBC 1.460 1.763 2.600 GG 9.005 3.610 19.472 NLS IBC 2 609 2 49 2 3 733 GG IBC 1.882 2.226 3.089 GG 11.106 4.322 22.774 NLS IBC 2 95 3 2 958 4 20 1 GG IBC 2.348 2.675 3.554 GG 15.084 5.881 28.145 NLS IBC 4 234 4 328 5 55 8 GG IBC 3.725 4.013 4.963 GG 16.703 6.78 7 30.211 NLS IBC 5 1 30 5 221 6 471 GG IBC 4.666 5.008 6.040 Table 3 6. The variable definitions from the birthweight analysis Variable Definition The outcome variable BIRTHWEIGHT the i measured in lbs. Policy Variable (x p ) CIGSPREG the number of cigarettes smoked per day during pregnanc y The observable confounders (x o ) PARITY the birth order WHITE = 1 if white, 0 otherwise MALE = 1 if male, 0 otherwise PAGE 91 91 Table 3 7. Descriptive statistics for the birthweight sampl e (N=1 388) Variable Mean Minimum Maximum The outcome variable BIRTHWEIGHT 7.4 2 1.4 4 16.9 4 Policy v ariable (x p ) CIGSPREG 2.0 9 0 50 The observable confounders (x o ) PARITY 1.63 1 6 WHITE % 78 MALE % 52 Figure 3 1. H istogram and K ernel density estimate PAGE 92 92 Table 3 8 The marginal effect and the cessation effect estimates from the birthweight analysis Type of e stimation Marginal effect NLS IBC GG IBC GG Average marginal e ffect 0.538 3 0 .4894 0 .50 80 Cessation e ffect 7.685 3 6 .9138 7.0811 PAGE 93 93 CHAPTER 4 MODELING AND ESTIMAT ING FLEXIBLE FORM HEALTH ECONOMETRIC M ODELS WITH ENDOGENEITY 4.1 Introduction and Backgroun d One of the main properties of regression models in health economics and health services research is skewness in the data on the dependent variable. In the literature the generalized gamma (GG) model is one of the suggested mode ls to deal with such skewness. It is also very often the case in empirical health econometric modeling that one or mo re of the regresso rs of interest are endogenous. The two stage residual inclusion (2SRI) method has been offered as a means of producing consistent estimates in the pres ence of regressor endogeneity. The model introduced in Chapter 2, the generalized gamma with endogeneity (GGE), offers a way to correct for both skewness and endogeneity in health econometric regression models. Though the GGE model is flexible with regard to accommodating skewness in that it subsumes various asymmetric distributions as its s pecial cases, it is subject to classical misspecification bias because it imposes a fixed conditional mean regr ession form the exponential. The objective of the current chapter is to extend the GGE model through the introduction of a flexible conditional mean form in order to lessen the ch ance of misspecification bias. The combined model ( generalized gamma with endogeneity and inverse Box Cox transformation ( GGE IBC ) ) will offer flexibility both in the distributional form (to account for skewness) as well as in the conditional mean function (to avoid misspecification bias). To address skewness and thereby gain precision in health econometric models, Manning et al. (2005), advocated use of the GG estimator, a full information maximum likelihood (FIML) estim ator based on a flexible distributional form. The substantial PAGE 94 94 potential efficiency gains afforded by this model when estimated in a 2SRI framework (we called this GG 2SRI model GGE) were demonstrated by the simulation results presented in Chapter 2. There we also verified that, in presence of endogeneity, the GGE model is indeed consistent, and that application of the GG model (ignoring endogeneity ) can lead to substantial bias. In addition to the issue of flexible distributional form addressed in Chapter 2 in Chapter 3 we explored the effect of introducing flexibility in the conditional mean function of the GG model. This modification to the GG model addressed potential bias in the estimates of the targeted ma rginal effects which can result from assuming a n i ncorrect conditional mean form. We introduced flexibility in the conditional mean function using the inverse Box Cox (IBC) transformation, which was first applied in a regression context by Wooldridge (1992). We showed that our model, called GG IBC, com bines the efficiency of a FIML estimator with the bias related benefits obtained due to the use of the IBC transformation. We also demonstrated the usefulness of the GG IBC model as compared to the GG and non linear least squares with IBC (NLS IBC) models through our simulation and real data analyses. In this chapter we bring together the benefits obtained by our models introduced in Chapter 2 and Chapter 3. Here we present a new model for health econometric data analysis, called generalized gamma with endo geneity and IBC transformation (GGE IBC), which is both robust to the presence of endogeneity in regressors and flexible enough to accommodate exponential, linear and power specifications in the conditional mean function. Through extensive simulation and r eal data experiments, we verify that PAGE 95 95 the GGE IBC possesses the useful properties inherited from the GGE and the GG IBC models viz., consistency and statistical efficiency. The remainder of this chapter is organized as follows. In Section 4.2 the GGE IBC model/method, which incorporates the IBC transformation into the GG model and implements the 2SRI method, is described. In Section 4.3 the simulation analysis is summarized, including the sampling designs that are used, the estimators included in the comp arisons, the definition of the marginal effect in the GGE IBC context, the criteria for evaluation and comparison and sum mary of the simulation results. Section 4.4 describes the real data analysis, where we apply the GGE (1997) data smoking on infant birthweight. Finally, in Section 4.5 we conclude summarizing the contributions and findings in this chapter. 4.2 Integrating IBC and Endogenous Confounders into the GG Model In C hapters 2 and 3 we addressed two different forms of flexibility in health econometric regression modeling with nonnegative outcomes the need for flexibility in the distribution of the dependent variable conditional on the regressors as a means of accommodating skewness; and flexibility in the conditional mean regression specification t o avoid misspecification bias. We allow for distributional flexibility in Chapter 2 by incorporating the GG model which subsumes various skewed distributions like log normal, stand ard gamma, Weibull and exponential as its special cases. The choice of these distributions is controlled by the parameters in the GG model. In Chapter 3, we accommodate conditional mean flexibility, through the use of the IBC tr ansformation in the GG modeling framework. Our objective in this section is to build a PAGE 96 96 model which combines both of these flexibility features and is consistent in the presence of endogenous regressors. In Chapter 2 we proposed the GGE model to account fo r potential endogeneity in the regressors. In that model the parameter is defined as where is the observable confounders, is the endogenous reg ressors and is the unobservable confounders. In Chapter 3 in the GG IBC model, without endogeneity, this is replaced by where the funct ion is the IBC transformation. In th e present chapter since we are accounting for endogeneity we define the parameter as a combination of these two and we can rewrite it as (4 1) where and are as defined above. The formulation of the parameter given in (4 1) implies the following form for the conditional mean (4 2) where C is define as (4 3) is defined as (4 4) PAGE 97 97 and is a scalar parameter such that The parameter controls the conditiona l mean form and a detailed discussion of the IBC transformation and its special cases can be found in Section 3.2.2. With this modified conditional mean form, we now proceed as prescribed by the 2SRI method. To correct for the endogeneity which is caused b y the presence of the unobservable confounders, we define one, possibly non linear, auxiliary equation as in equation (2 12) for each of the endogenous variables. Using these equations, we regress the endogenous variables on some or all of the observable r egressors and appropriate num ber of instrumental variables. The estimated residuals (see equation (2 13) for details) from these regressions are used as proxy variables for the unobservable confounders in the main GG regression The composite model resulti ng from incorporation of all the aforementioned steps is called GGE IBC. We have implem ented this estimator in Stata/Mata 10 using maximum likelihood estimation via the ml procedure. 4.3 Simulation Analysis There are three main objective s in this simulati on analysis. The first objective is to examine the consistency properties of our model GGE IBC, as well as those of the NLS IBC 2SRI and the GGE the models with which we compare our results. For the class of the GG distributed data, the GGE IBC estimator should be consistent irrespective of the conditional mean form. Here we seek to verify whether this consistency of GGE IBC is carried to data distributions not belonging to the GG family of distributions e.g. the beta distribution. We would accomplish t his by observing the behavior of GGE IBC in experiments with increasing sample sizes 10,000 50,000, 100,000 and 250,000. Note that parallel to the argument in Chapter 3, the NLS IBC PAGE 98 98 2SRI model (part of the class of IBC models) should also be consistent for all the data distributional cases due to the use of a f lexible conditional mean form. The GGE model, which is a special case of the IBC class of models, is subject to misspecification bias in the conditional mean form because it imposes a particular va lue of ( ) and should be inconsistent when either the data distributional form is not from the GG family or the conditional mean does not have an exponential for m. The second objective is to study the b ias in the estimated marginal effects of the targeted policy variable obtained using the GGE IBC estimator and compare it with biases corresponding to the GGE and the NLS IBC 2SRI based estimators. The latter of these is the NLS IBC estimator used in Chapt er 3 modified to account for endogeneity. The final objective is to observe the precision of the GGE IBC marginal effect estimator in comparison to the estimators based on GGE and NLS IBC 2SRI. For all of the above mentioned cases, the data was generated u sing Monte Carlo simulations based on various sampl ing designs. Details of which are provided below. 4.3.1 Sampling Designs 4.3.1.1 The observable and unobservable confounders, i nstrumental variables and endogenous variable As mentioned in Section 4.2, in the presence of endogeneity, the regressor vector x is partitioned into three types of variables the observable confounders, the endogenous variables and the unobservable confounders. In our simulations we use only one variable as the observable confound er. It is generated as a uniformly distributed variate in the interval [0, 4]. We use one variable for the unobservable confounder too and it is generated uniformly in the range [ 0.55 0.55]. To incorporate endogeneity in our simulation analysis, we can d efine the endogenous variables as a PAGE 99 99 function of the observable confounders, unobservable confounde rs and instrumental variables. In this simulation analysis we use only one endogenous regressor (hence only one instrumental variable is also used). For simpl icity, the endogenous variable is defined as a linear combination of the observable confounder, the unobservable confounder and the instrumental variable as (4 5 ) where is the endogenous variable, is the observable confounder, is the instrumental variable and i s the unobservable confounder. The coefficients and ar e set to 0.3 in the simulation analysis. The instrumental var iable is generated with a uniform distribution in the interval [0, 2]. The simulation anal ysis is repeated 500 times for four different sample sizes 10,000, 50,000, 100,000 and 250,000. 4.3.1.2 The outcome variable The outcome variable y is generated using six different distributions. The linear index which used to generate the conditional means of these distributions, is defined as (4 6 ) where is the residua l from the auxiliary equation. The coefficients and are set to 0.5 and is 1.0. Since the data g eneration process used here is very similar to that use d in C hapter 3 we refer the readers to Section 3.3.1.2 for details. The only difference is the inclusion of the observable confounder (proxied by the first stage residual ) and explicit notation distinguishing the three types of regressors (endogenous variable, PAGE 100 100 observable confounders, and unobservable confounders) in equation (4 6 ) We have used the standard gamma, log normal, Weibull, expo nential, generalized gamma and b et a distributions with in the IBC transformation. The parameter values for and are the same as those used in Chapter 3. 4.3.2 Estimators to be Evaluated and Compared The GGE IBC estimator is a FIML estimator where parameter is defined using the IBC transformation, For the estimation of GGE IBC we implemented a maximum likelihood program using ml command in Stata For co mparison we included the two other estimators detailed below: Generalized gamma with endogeneity (GGE) : The GGE model was defined in Chapter 2. The GGE model is in the class of exponential conditional mean models where it is assumed that parameter Although this model has distributional flexibility it lacks the flexib ility in the conditional mean. We use this model as one of our comparison models in order to assess the positive consequences of the implementation of the IBC transf ormation when data is derived from a distribution with non exponential conditional mean. Note that like the model proposed in this chapter, GGE IBC, GGE is also a FIML estimator. It is estimated using streg command in Stata This model is described in det ail in Section 2.2. Nonlinear least squares with inverse Box Cox transformation and endogeneity correction (NLS IBC 2SRI) : We have picked a NLS based model for comparison since, in contrast to the GGE model, it is a non FIML based estimator. As the name s uggests, we have modified this estimator to account for both endogeneity PAGE 101 101 and conditional mean flexibility. The conditional mean function for the NLS IBC 2SRI is defined as (4 7 ) where is given by equa tion (4 4 ). The elements of coefficient vector are a scaled version of the coefficient vector used with the GGE IBC estimator where is scaled by and the constant term is shifted by the expression 1 The NLS IBC 2SRI estimator offers flexibility with respect to the conditional mean, therefore it is robust to misspecification bias, but it offers no correction for skewnes s since it makes no assumption about the distrib ution of the outcome variable. Like GGE IBC, NLS IBC 2SRI is also a consistent estimator since they both use the 2SRI for endogeneity correction and they are both flexible with r espect to the conditional mean The NLS IBC 2SRI estimator is implemented using the nl command in Stata 4.3.3 Criteria for Evaluation and Comparison Along the same lines as C hapters 2 and 3 our ultimate objective is to examine the expected marginal effect of a targeted policy var iab le as given in equation (2 18 ). In general, the marginal effect of a potentially endogenous policy variable is defined as 2 (4 8) 1 See Appendix D for detailed description of this transformation. 2 See Terza (2010) for details. PAGE 102 102 For the GG IBC model, in which there is no endogeneity problem, the m arginal effect as it pertains to the three cases of interest in the GG IBC framework, is presented in equatio ns (3 12), (3 13) and (3 14 ). For the GGE IBC model, in the presence of endogeneity, with the vector of regressors defined as the marginal effect expressions analo gous to equations (3 12), (3 13) and (3 14 ) are (4 9 ) and, (4 10 ) (4 11 ) respectively. The consistent marginal effect estima tors for (4 9), (4 10) and (4 11), are: (4 12 ) and, (4 13 ) (4 14) respectivel y. Note that these estimators are analogous to (3 15), (3 16 ) and ( 3 17 ) in Chapter 3. PAGE 103 103 In th e NLS IBC 2SRI framework the appropriate analogues to (4 9), (4 10) and (4 11) are (4 15) and, (4 16) (4 17) and the corresponding consistent estimators are (4 18 ) and, (4 19 ) (4 20 ) where is scaled by the factor and with the scaled constant term shifte d by In our simulation analysis we are interested in the bias and the precision of the marginal effect estimator of the potential ly endogenous policy variable. As in the simulation analys es conducted in Chapters 2 and 3 the me trics for comparison of the bias and precision are the average percentage absolute bias and the mea n squared error of the estimated marginal effect respectively. The for mer is defined in equation (3 18 ) and the latter in equation (3 19 ). PAGE 104 104 4.3.4 Simulation Results In this simulation analysis we are interested in the bias and precision performance of the GGE IBC model as compared to two possible alternative models, GGE and NLS IBC 2SRI. In Chapter 2 we showed that in the presence of endogeneity the GGE mod el provides efficiency gains. In Chapter 3 we showed that when the data is derived from a distribution with non exponential conditional mean, the GG IBC model is more precise compared to the GG and the NLS IBC alternatives. Here we investigate whether these e fficiency gains prevail in the presence of both endogeneity and differing conditional mean forms when the estimation is carried out using the proposed GGE IBC model. The metrics for comparison are defined in S ection 4.3.3 In Chapter 3 we showed the co nsis tency of the GG IBC model. Here our objective is to examine the consistency properties of the GGE IBC model i n the presence of endogeneity. Table 4 1 presents the average percentage absolute bias of the estimated marginal effect of the endogenous variable for different sampling designs for various sample sizes, 10,00 0, 50,000, 100,000 and 250,000. The percentage biases are calculated for the three quartile values of For each sampling design we can obs erve that the average percentage absolute bias from the GGE IBC estimator decreases as the number of obse rvations in a sample increases. For instance, for the standard gamma distributed data the average percentage absolute bias for the sample size 10,000 is 9.507% and it decrease to 4.265% f or the sample size 50,000 etc. Similarly for all other sampling designs Weibull, exponential, lognormal, GG, and beta, the average percentage absolute bias for GGE IBC gets smaller as the num ber of observations increa ses. These results are as expected in light of the theoretical PAGE 105 105 consistency of the GGE IBC esti mator. Although throughout Table 4 1, the average percentage absolute bias results from the NLS IBC 2SRI model are higher than average percentage absolute bias re sults from the GGE IBC model, the findings validate that the NLS IBC 2SRI model is also a consistent estimator. In the case of standard gamma distributed data the percentage bias decreases from 14.189% to 6.585% as the sample size incre ase from 10,000 to 50,000 etc. Similarly the average percentage absolute bias from the GGE model decreases as the sample size increases but at a slower pace. The marginal effect estimates are expected to be biased both in the presence of endogeneity and when the conditional mean functio nal form is incorrectly specif ied. In Chapter 2 we showed that 2SRI endogeneity correction in the GGE model compensates for the first problem endogeneity and generates unbiased estimates. In Chapter 3 we showed that the introduction of the IBC transformation in the GG IBC model rectifies the second problem conditional mean misspecification bias and leads to consistent estimates. In this next experiment we examine the bias performance of the GGE IBC method which simultaneously accounts fo r skewness corrects for endogeneity and is ro bust to conditional mean misspecification bias. All of the alternative estimators included in the simulation comparisons are corrected for endogeneity, so the focus in the first part of the analysis is on the i mportance of conditional mean flexibility. Both the GG E IBC model and the NLS IBC 2SRI model af ford this type of flexibility. The GGE estimator does not. Table 4 2 presents the average percentage absolute bias of the marginal effect of for various sampling desi gns for samples of size 10,000. It can be readily seen that the percentage bias from PAGE 106 106 the GGE IBC estimator is consistently lower than the percentage bias from the other es timators. Although the average percentage absolute b ias difference between GGE and GGE IBC is relatively small in the first quartile of it increases significantly in t he second and third quartiles. For example, in the Weibull distributed data, the average percentage absolute bia s values from the GGE estimator are 7.961 %, 13.196% and 19.82% for the first, second and third quartile of respectively, whereas, the corresponding average percentage absolute bias values from the GG E IBC estimator are 7.132%, 7.619%, and 7.983% consistently lower. Similar findings can be observed for all of the other sampling designs. It is noteworthy that the average percent absolute bias from the NLS IBC 2SRI model is consistently higher than that of the GGE IBC model for all sampling designs. Table 4 2 confirms that the bias performance of the GGE IBC model is consistently better than the GGE and the NLS IBC 2SRI estimators. As mentioned before, the NLS IBC 2SRI model is not a FIML based model, as a result we would expect efficiency gains from the GGE IBC mod el as compared to NLS IBC 2SRI. In order to verify this, in Table 4 3 we present the mean squared error of the marginal effect of for various sampling designs at a sample size of 10,000. The terms in parentheses represent the percentage efficiency gain from the GGE IBC model compared to the GGE and the NLS IBC 2SRI models and it is defined as where m = GGE, NLS IBC 2SRI. As expected, the mean squared error from the GGE IBC estimator is lower as compared to its non FIML counterpart the NLS IBC 2SRI estimator for all the sampling designs. For example, for the standard gamma distributed data, GGE IBC provides approximately 54% efficiency gain compared to NLS IBC 2SRI in every quartile of This statistical efficiency is PAGE 107 107 observed for all other sampling designs and the efficiency gain of GGE IBC compared to NLS IBC 2SRI ranges from 52% to 77%. Furthermore, GGE IBC also provides efficiency gains c ompared to the GGE model where these efficiency gains range from 5.7% to 85% (except for t he lognormal distributed data for the first quartile of the efficiency loss of GGE IBC compared to GGE is 27%, however, for the second an d third quartiles the efficiency gain of GGE IBC compared to GGE are 61% and 83%, respectively ) All in all, the findings confirm that the GGE IBC model is more efficient compared to the GGE and the NLS IBC 2SRI models Finally, to confirm the consistency of the estimation of the parameter in the GGE IBC and the NLS IBC 2SRI models and to show the differences in parameter estimates between the GGE and the GGE IBC we examine the parameter val ues obtained by these methods. Table 4 4 presents the parameter estimates from various sampling desi gns for the sample size 10,000. Recall that the parameter was fixed to unity. It can be observed that the parameter is estimated consisten tly in both the G GE IBC and the NLS IBC 2SRI models. As noted before, the estimates of the elements of from NLS IBC 2SRI are scaled (and shifted) versions of the elements of For the specific cases of the exponential and the standard gamma, where t his scaling factor disappears. For other distributions used in the sampling designs, the true values can be calcul ated using the scaling factor. For ex ample, for the GG distributed sampling design, the true values are and the con stant The parameter estimates from the GG distributed data also confirm the consistency of the NLS IBC 2SRI model. PAGE 108 108 4.4 The Effect of Cigarette Smoking on Birthweight in Presence of E ndogeneity Here we demonstrate the usability of the GGE IBC model for real data in presence of endogeneity. W e would also compare the results obtained with the GGE and the NLS IBC 2SRI methods described above. Our application is the same as the one we used in Section 3.4 of Chapter 3 the effect of cigarette smoking on birthweight, Mullahy (1997) but here we t ake the effect of endogeneity into account. L ike C hapter 3 we are interested in the average marginal effect of smoking an additional cigarette per day by the mother on infant birthweight. But here we take into account the potential endogeneity of the ciga rette smoking variable due to unobservable confounders like the health consciousness of the mother, which may affect both her smoking and the weight. While correcting for potential endogeneity, we would use paternal schooling, maternal schoo ling, family income and the per pack state excise tax on cigarettes, as our instrumental variables (the same ones used by Mullahy, 1997). The definitions of all of the variables used in the analysis are given in Table 4 5 and their descriptive stat istics a re given in Table 4 6. The histogram and the Kernel de nsity of the outcome variable are given in Figure 3 1. The objectives of this study are threefold. First, we want to verify that the results obtained by our estimator, GGE IBC, are empirically meaningfu l i.e. there is a negative infant birthweight. Second, we would like to compare the results obtained in this chapter, where we explicitly take endogeneity into account, with the results obtained in Chapter 3, where presence of endogeneity was ignored. Finall y, as in C hapter 3 we would like to examine the practical applicability of the GGE IBC model as compared to the NLS IBC 2SRI model. PAGE 109 109 4.4.1 Model We first model assuming that the condit ional mean regression for y follows the IBC f ormulation given in equation (4 7 ) o f the text. In addition, we assume that (4 21 ) where w = [x o w + ], and w + is a vector of identifying instrumental variables variables that are correlated with x e but not with y or x u To account for the potential endogeneity of x e we implement th e 2SRI estimator of Terza et al. (2008 a ). In the first stage we e by applying ordinary least squares (OLS) to ( 4 21 ), and use the result to compute as the OLS residual for ( 4 21 ). In the second stage we e e o u ] and the parameters of the model in equation ( 4 7 ) of the text, by ap plying NLS t o (4 22 ) where k(. .) is defined as in equation ( 4 4 is the first stage residual. We use the parameter estimates obtained via NLS IBC 2SRI to estimate th e marginal effect ( ME) of cigarette smoking during pr egnancy on infant birthweight as (4 23 ) where (4 24 ) (4 25 ) and PAGE 110 110 (4 26 ) where the x vector is partitioned as x = [x e x o x u ] and is substituted by Therefore the ME can be written as ( 4 2 7 ) The estimator of the ME of cigarette smoking is t he following sampl e analog to (4 27 ) is given by ( 4 28 ) where n s is the size of the subsample of smokers, and t The ME is estimated for smokers only. T he average effect of smoking cessation on birthweight for the mothers who smoked during pregnancy for the NLS IBC 2SRI model is defined as (4 29 ) which can be estimated as (4 30 ) where n s is the size of the subsample of smokers. Next, we est model assuming that the conditional density function of y follows the GGE IBC model ( Section 4 2 ) formulation given in equation ( 4 2 ) o f the text. In addition, we a ssume that the auxiliary equation (4 21 ) holds. Here agai n, to account for the potential endogeneity of x e we implement th e 2SRI es timator of Terza et al. (2008 a ). In the first stage, like for NLS IBC 2SRI, we PAGE 111 111 e by applying ordinary least squares (OLS) to ( 4 21 ), and use the result to compute as the OLS residual for (4 21 ). In the second stage we es e o u the parameters of the model in equation ( 4 2 ) of the text, by applying the GG E IBC described above with included as an additional regressor. We use the parameter estimates obtained via GGE IBC to estimate the ME o f cigarette smoking during pregnancy on infant birthweight as (4 31 ) where (4 32 ) and, (4 33 ) (4 3 4 ) where the x vector is partitioned as x = [x e x o x u ] and the is substituted by and C is defined as (4 35 ) Therefore ME is given as (4 36 ) The estimator of the ME of cigarette smok ing is the following sample analog to (4 36 ) (4 37 ) PAGE 112 112 where (4 38 ) (4 39 ) and For the GGE IBC model the cessation ef fect is given by (4 40 ) which can be estimated as (4 41 ) where n s is the siz e of the subsample of smokers. We will also use the GGE model, defined by us in Chapter 2, to estimate th e ME The ME and the ME estimator of the cigar ette smoking for the GGE estimator are given in equations (2 2 5) and (2 26 ), respectively. The cessation effect and its estimator for the same model are given by the following equations, respectively: (4 42 ) (4 43 ) 4.4.2 Results Table 4 7 summarizes the results from the GGE IBC, the NLS IBC 2SRI and the G GE estimators described above. The marginal effects as defined in equations ( 4 28), ( 4 37 ) and (2 26 ) for the NLS IBC 2SRI, the GG E IBC and the GGE respectively, restricting the sample (after estimation) to those who smoked during pregnancy (i.e. PAGE 113 113 CIGSPREG > 0) this subset had 212 individuals, are presented in the first row of Table 4 7 The average effect of smoking cessation on birthweight for th ese 212 mothers who smoked during pregnancy, as captured by the expressions in equations (4 30), (4 41) and (4 43) for the NLS IBC 2SRI, the GG E IBC and the GGE models respectively are presen ted in the second row of Table 4 7 The first observation that can be made from this table is that the estimates of ME obtained for all the models are negative, indicating that the birthweight decreases fo r each additional cigarette smoked per day during pregnancy. For our GGE IBC model, the weight of a newborn goes d own by 1.14 3 ounces for each additional cigarette smoked per day. Recall that the same figure for the GG IBC model in Chapter 3, which does not correct for endogeneity, was estimated to be 0.489 ounces. This marked difference in these values can be attribu ted to the relatively unbiased estimates obtained by the GGE IBC, a model which accounts for endogeneity. We have shown through extensive simulations in Chapter 2 that models that correct for endogeneity using 2SRI are unbiased compared to those which do n ot. Next from the cessation effect gi ven in the second row of Table 4 7 it can be noted that if all the smokers quit smoking with immediate effect, the birthweight goes up. For the GGE IBC method, we find this increase to be 18.491 ounces. The same figu re for the GG IBC model in Chapter 3 was obtained to be 6.913 ounces We expect the results obtained from the GGE IBC model to be unbiased compared those obtained from GG IBC because of the same reason as described above. As in Chapter 3, we would again li ke to point out that from a practical applicability point of view, GGE IBC behaves better than the NLS IBC 2SRI model. The later model PAGE 114 114 was found to have convergence issues in cases with small number of observations, like our r eal data application. 4.5 Summ ary, Discussion and Conclusion In Chapter 2 we address the two common problems seen in health outcomes d ata, skewness and endogeneity. In Chapter 3 we take into account another important issue that plagues health econometric models, the bias that surfaces with the misspecific ation of the conditional mean. Here we are interested in providing solutions to all of these problems at the same ti me by proposing a novel model. The GGE IBC model offers a simultaneous remedy for all of these problems since it is a FI ML estimator that has a flexible distributional form, flexible conditional mean form and a built in mechanism for endogeneity correction. Like C hapters 2 and 3 we first tested our model via extensive simulation analyses and then on a real data application In our simulation analyses, we used datasets of various sizes generated using various distributions. We compared our results with those obtained from popular alternative models like the GGE and the NLS IBC 2SRI both detailed above. Through our simulati on analysis we found that the GGE IBC model is consistent and does not suffer from the conditional mean misspecification bias. We empirically show that our model unfailingly generates lower average percentage absolute bias for the marginal effect estimates compared to the GGE and the NLS IBC 2SRI models. Like the other estimators introduced by us in this dissertation, the GGE IBC model retains the precision properties of the GG model because it is a FIML based estimator. Finally, we tested the proposed GGE PAGE 115 115 birthweight. Note that this is same dataset we used in Chapter 3 but here our model explicitly recognizes and corrects for endogene ity while this was not the case in Chapter 3. We found that the ME of smoking obtained when endogeneity is corrected for in the GGE IBC is twice that of the ME of smoking obtained by the GG IBC model. The conclusions drawn in Chapter 3 about the practical usefulness of the GG IBC model relative to the NLS IBC model remains true for the GGE IBC model as compared to the NLS IBC 2SRI model PAGE 116 11 6 Table 4 1. For various sample sizes average percentage absol ute bias of the marginal effect ( ) (in percentages) Data Estimator N 1 st quartile of x e 2 nd quartile of x e 3 rd quartile of x e Gamma GGE 10,000 10.338 15.655 22.167 50,000 4.196 7.895 14.030 100,000 3.876 9.323 16.294 250,000 2.958 8.678 15.419 NLS IBC 2SRI 10,000 14.189 15.051 15.784 50,000 6.585 6.966 7.323 100,000 4.570 4.878 5.151 250,000 2.885 3.045 3.192 GGE IBC 10,000 9.508 10.146 10.676 50,000 4.265 4.524 4.763 100,000 2.916 3.119 3.292 250,000 1.854 1.965 2.071 Weibu ll GGE 10,000 7.961 13.296 19.820 50,000 3.150 6.802 13.182 100,000 2.691 8.058 14.769 250,000 1.996 0.764 14.095 NLS IBC 2SRI 10,000 10.809 11.475 12.031 50,000 5.017 5.301 5.558 100,000 3.511 3.720 3.902 25 0,000 2.073 2.174 2.268 GGE IBC 10,000 7.132 7.620 7.983 50,000 3.175 3.334 3.470 100,000 2.385 2.539 2.660 250,000 0.014 0.014 0.015 Exponential GGE 10,000 14.473 20.159 26.897 50,000 7.436 10.484 16.171 100,000 7.757 11.714 18.824 25 0,000 6.682 10.931 17.769 NLS IBC 2SRI 10,000 20.107 21.434 22.588 50,000 9.420 9.960 10.463 100,000 6.721 7.118 7.471 250,000 4.056 4.281 4.485 GGE IBC 10,000 14.186 15.155 15.999 50,000 6.757 7.127 7.459 100,000 5.141 5.449 5.714 25 0,000 2.795 2.973 3.134 PAGE 117 117 Table 4 1. Continued Data Estimator N 1 st quartile of x e 2 nd quartile of x e 3 rd quartile of x e Log normal GGE 10,000 7.932 13.989 21.124 50,000 4.587 10.941 22.347 100,000 4.631 11.501 19.531 250,000 8.133 19.785 37.046 NLS IBC 2SRI 10,000 10.048 10.692 11.233 50,000 5.169 5.470 5.747 100,000 3.590 3.813 4.008 250,000 2.226 2.350 2.465 GGE IBC 10,000 8.125 7.995 7.828 50,000 3.643 3.853 4.089 100,000 2.669 2.681 2.690 250 ,000 2.018 1.819 1.803 Generalized gamma GGE 10,000 9.588 15.009 21.641 50,000 4.165 7.236 13.006 100,000 3.102 7.970 14.627 250,000 2.115 7.291 13.731 NLS IBC 2SRI 10,000 14.702 15.697 16 .557 50,000 7.046 7.464 7.855 100,000 4.890 5.202 5.477 250,000 2.956 3.133 3.295 GGE IBC 10,000 9.419 10.065 10.576 50,000 4.457 4.725 4.970 100,000 3.069 3.276 3.454 250,000 1.943 2.057 2.155 Beta GGE 10,000 4.966 10.272 16.877 50,000 1.960 5.089 11.287 100,000 1.364 6.605 13.043 250,000 0.879 6.045 12.217 NLS IBC 2SRI 10,000 9.460 10.059 10.551 50,000 4.195 4.416 4.622 100,000 2.928 3.135 3.313 250,000 1.874 0.020 2.096 GGE IBC 10,000 4.517 4.965 5.354 50,000 1.787 1.893 1.994 100,000 1.888 2.031 2.117 250,000 0.947 0.992 1.005 PAGE 118 118 Table 4 2. For various sampling designs average percentage absolute bias of the marginal effect ( ) (N=10,000) (in percentages) Data Estimator 1 st quartile of x e 2 nd quartile of x e 3 rd quartile of x e Gamma GGE 10.338 15.655 22.167 NLS IBC 2SRI 14.189 15.051 15.784 GGE IBC 9.508 10.146 10.676 Weibull GGE 7.961 13.296 19.820 NLS IBC 2SRI 10.809 11.475 12.031 GGE IBC 7.132 7.620 7.983 Exponential GGE 14 473 20 159 26 897 NLS IBC 2SRI 20.107 21.434 22.588 GGE IBC 14 186 15 15 5 15 99 9 Log normal GGE 7.9 32 13.989 21.124 NLS IBC 2SRI 10.048 10.692 11.233 GGE IBC 8.125 7.995 7.828 Generalized gamma GGE 9.588 15.009 21.641 NLS IBC 2SRI 14.702 15.697 16.557 GGE IBC 9.419 10.065 10.576 Beta GGE 4.966 10.272 16.877 NLS IBC 2SRI 9.460 10.059 10.551 GGE IBC 4.517 4.965 5.354 PAGE 119 119 Table 4 3. For various sampling designs mean squared error of the marginal effect with percent relative efficienc y gain ( ) (N=10,000) Data Estimator 1 st quartile of x e 2 nd quartile of x e 3 rd quartile of x e Gamma GGE 0.01408 0.0329229 0.0683225 ( 32.5 %) ( 63.3 %) ( 78.3 %) NLS IBC 2SRI 0.0210954 0.0265183 0.0324 618 ( 54.9 %) ( 54.4 %) ( 54.2 %) GGE IBC 0.0095073 0.0120842 0.0148529 Weibull GGE 0.005123 0.0143617 0.0329675 ( 20.3 %) ( 63.9 %) ( 80.8 %) NLS IBC 2SRI 0.0092588 0.011632 0.0141949 ( 55.9 %) ( 55.5 %) ( 55.4 %) GGE IBC 0.00408 54 0.0051798 0.0063262 Exponential GGE 0.0287104 0.0541753 0.1045475 (11.1%) (41.3%) ( 63.3%) NLS IBC 2SRI 0.0416459 0.0529894 0.0656875 (38.7%) (40.0%) (41.5%) GGE IBC 0 .0255283 0 .0318172 0 .0384056 Log normal GGE 0.0 079024 0.0249051 0.0584094 ( 27.1%) (61.2%) (83.4%) NLS IBC 2SRI 0.0137877 0.0174717 0.0214402 (27.1%) (44.7%) (54.7%) GGE IBC 0.0100471 0.0096598 0.0097130 Generalized gamma GGE 0.0061122 0.0156814 0.034095 (5.7%) (53.1%) (73.5%) NLS IBC 2SRI 0.0148921 0.0189276 0.0233592 (61.3%) (61.1%) (61.3%) GGE IBC 0.0057623 0.0073595 0.0090409 Beta GGE 0.000633 0.0025785 0.0069139 (14.3%) (70.6%) (85.2%) NLS IBC 2SRI 0.0024413 0.0030791 0.003761 (77.8%) (75.4%) (72.8%) GGE IBC 0.0005428 0.0007583 0.0010232 Note: The values given in parentheses are the percent relative efficiency gains that measure the relative efficiency of GGE IBC estimator compared to the GGE and the NLS IBC 2SRI estimators. The percent relative efficiency gain is defined by where m = GGE, NLS IBC 2SRI. PAGE 120 120 Table 4 4. Parameter estimates (N=10,000) ( ) Data Parameters GG E NLS IBC 2SRI GG E IBC Gamma 0.3241 0.5062 0.5088 0.3033 0.4997 0.5005 0.6087 0.9909 0.9912 0.1168 0.0038 0.0068 0.6952 0.7052 0.3387 0.3471 0.9963 1.0034 Weibull 0.3220 0.4767 0.5108 0.3009 0.4687 0.4966 0.6021 0.9267 0.9822 0.1209 0.1209 0.0058 0.9716 1.0031 0.6764 0.6999 0.9910 0.9921 Exponential 0.3263 0.5053 0.5146 0.2997 0.4954 0.4946 0.5987 0.9797 0.9869 0.1196 0.0018 0.0008 0.9915 0.9999 0.0007 0.0026 0.9797 1.0055 Log normal 0.3262 0.5430 0.5494 0.30 23 0.5314 0.5108 0.6058 1.0514 0.9966 0.1225 0.1216 0.0152 0.0102 0.0073 0.6765 0.6911 0.9958 1. 0828 PAGE 121 121 Table 4 4. Continued Data Parameters GG E NLS IBC 2SRI GG E IBC Generalized gamma 0.3251 0.4609 0.5171 0.3005 0.4450 0.4981 0.6017 0.8818 0.9854 0.1180 0.2197 0.0076 1.3879 1.4169 0.3312 0.3481 1.0032 1.0025 Beta 0.3153 0.3597 0.4197 0.3005 0.3536 0.4071 0.6010 0.7011 0.8083 0.2931 0.5909 0.3653 1.7565 1.9708 1.0751 1.1711 0.9981 1.0528 PAGE 122 122 Table 4 5. The variable definitions from the birthweight analysis Variable Definition The outcome variable BIRTHWEIGHT Policy Variable (x e ) CIGSPREG the number of cigarettes smoked per day during pregnancy The observable confounders (x o ) PARITY the birth order WHITE = 1 if white, 0 otherwise MALE = 1 if male, 0 otherwise I nstrumental Variables (w + ) EDFATHER paternal schooling in years EDMOTHER maternal schooling in years FAMINCOM family income ( 0.001) CIGTAX88 per pack state excise tax on cigarettes Table 4 6. Descriptive statistics for t he birthweight sample (N=1,388) Variable Mean Minimum Maximum The outcome variable BIRTHWEIGHT 7.4 2 1.4 4 16.9 4 Policy v ariable (x e ) CIGSPREG 2.0 9 0 50 The observable confounders (x o ) PARITY 1.63 1 6 WHITE % 78 MALE % 52 Instrumenta l v ariables (w + ) EDFATHER 11.32 0 18 EDMOTHER 12.9 3 0 18 FAMINCOM 29.0 3 0 .5 65 CIGTAX88 19.55 2 38 Table 4 7 The 2SRI marginal effect and cessation effect estimates from the birthweight analysis Type of Estimation Marginal effect NLS IBC 2S RI GGE IBC GGE Average Marginal Effect 1.168 4 1.142 6 1.1094 Cessation Effect 18.004 2 18.49 14 16.3405 PAGE 123 123 CHAPTER 5 CONCLUSION 5.1 Summary In this dissertation we have proposed three novel estimation techniques for modeling health outcomes data that effe ctively address commonly encountered issues in health economics datasets the generalized gamma with endogeneity (GGE) (Chapter 2), the generalized gamma with inverse Box Cox transformation (GG IBC) (Chapter 3) and the generalized gamma with endogeneity a nd inverse Box Cox transform ation (GGE IBC) (Chapter 4) In addition to the fact that all of these newly proposed estimators are designed to deal with skewness, the characteristic issues that we focus on are the endogeneity of the regressors and the possib ility of the data being derived from a distribution with a non exponential conditional mean form. The former of these, endogeneity, is common since regression models are often influenced by unobservable confounding effects. The latter is a problem with m odels that misspecifies their conditional mean form and can lead to undesirable effects. Our corrective steps to rectify for the aforementioned problems are built upon the generalized gamma (GG) model. We have chosen the GG model as the base for our model s since it is a full information maximum likelihood (FIML) estimator that provides all important distributional form flexibility that is quite effective in addressing skewness in health outcomes data. Furthermore, we demonstrate that with GG model as thei r foundation, our models take advantage of its corresponding efficiency properties. We present our first model, GGE, in Chapter 2 where we address the issue of en dogeneity. We propose the use of two stage residual inclusion (2SRI) method with the GG model to side step the effects of endogenous regressors due to unobservable PAGE 124 124 confounders. This new model simultaneously provides robustness against both skewness and endogeneity, which to the best of our knowledge, have not been addressed together in the literatu re. We thoroughly evaluate the new model using extensive simulation analysis as well as real data analyses Note that in of all our analyses presented in this dissertation, we are interested in observing the marginal effect (ME) of a targeted policy variab le on the outcome variable. In our simulation analysis we compared the performance of GGE with two popular alternative models OLS and GLM with log link function using samples generated using various distributions and with various sample sizes. Quite in terestingly, we found that though OLS and GLM models can provide low average percentage absolute bias, a metric that captures bias in the computed ME, for some cases, GGE consistently provides better results across all datasets. We also found that another benefit of using a model like the GGE with flexible distributional form is that it can be used to detect the inherent distribution underlying a given dataset. This was quite handy in our real data application, (which looks at the relationship between hospi tal expenditures and prescription drug usage) where the GGE model, through a Wald test, detected that the dependent variable is distributed according to the log normal distribution. In C hapter 3 we build a model that can accommodate a flexible conditional mean form. We accomplish this by introducing the inverse Box Cox (IBC) transformation in the conditional mean expression of the GG model. With this modification the resultant model, GG IBC has both distributional form flexibility as well as conditional mea n form flexibility. We tested the proposed model through extensive simulation and real data analyses and compared the obtained results with the NLS IBC and the GG model s We PAGE 125 125 found that the GG IBC model consistently produces lower average percentage absolut e bias as compared to the NLS IBC and the GG model s across different sampling distributions and sample sizes, when the data is drawn from distributions with non exponential conditional mean forms. The application of GG IBC to the real data application whic h examined the impact of smoking by pregnant women on their s that the former has a negative effect on the latter. Quite importantly, t hrough this real data analysis we found that the GG IBC model is well behaved in its convergen ce properties even when the dataset is small. This is at contrast to the NLS IBC model, which has considerable convergence problems in this dataset. Finally, in Chapter 4, we proposed the composite model GGE IBC, which combines the two previously proposed models GGE and GG IBC into a single versatile model. The GGE IBC model, with its flexible distributional form, flexible conditional mean form and built in mechanism to handle endogeneity, provides all the benefits previously summarized for the GGE and the GG IBC models simultaneously. Through simulation analysis we showed that the GGE IBC model maintains the efficiency properties of the original GG model while at the same time providing lower average percentage absolute bias as compared to alternatives such as NLS IBC 2SRI. Note that the NLS IBC 2SRI model is derived from NLS by appropriately modifying its conditional mean form with the IBC transformation and placing it in the 2SRI framework for handling endogeneity. In the real data experiments using GG E IBC, we revisited the smoking birthweight application used in Chapter 3, but this time we accounted for endogeneity in addition to allowing a flexible conditional mean in our estimation. We PAGE 126 126 found that the ME of smoking during pregnancy with GGE IBC is tw ice the value obtained in Chapter 3 when endogeneity is not accounted for. In summary, we have presented new models for health outcome data es timation based on the GG model, which provide very attractive alternatives to prevailing econometric models that are variants of conventional methods such as OLS, GLM and NLS models. 5.2 Limitations and Future Work The research presented in this dissertation is limited in a number of respects and, therefore, there are a number of natural extensions of this work that we intend to pursue in the future. Following is a listing of a few of these possible extensions. Foremost, in Chapter 2, we used the 2SRI model to correct for endogeneity in our GGE model. Since this is a two stage model, the correct standard errors for th e pa rameter estimates could not be obtained directly from Stata Because of this we derived the expressions for the correct standard errors for parameter estimates for GGE in Appendix C. Since we were also interested in the marginal effects of the policy variables, we also derived the expressions for the correct standard errors for the marginal effects too in the same Appendix. In Chapter 3, since we did not make use of a two stage model and since all the parameters of our model, GG IBC, were comprehensive ly estimated using a maximum likelihood routine, we were able to obtain the standard errors of our parameter estimates directly from Stata However, as before in the case of GG IBC, we are interested in finding the correct standard errors for the marginal effects. Deriving the expressions for these correct standard errors is something to be completed as future research. Similarly, in Chapter 4, since our model, GG E IBC, is a two stage model, it still remains to derive the expression for the correct PAGE 127 127 standar d errors for the parameter estimates and the marginal effect. For the parameter estimates, we would base our derivations on asymptotic results given in White (1994) and Newey and McFadden (1994). For the case of the marginal effect, we would base our deriv ation on Terza (2010). Secondly, for any given estimator, both large and small sample size analyses are of interest. This is s o since the large sample analysi s help validate the theoretical asymptotic properties of the estimator while the small sample siz e analysis provide a data scenarios. In Chapter 2, for our GGE model, we presented both of these analyses in our simulation study. But for Chapter 3 and 4, where we pr esented the GG IBC and the GGE IBC models, although we present ed results using real data with a small sample size, we did not include small sample size analysis in our simulation studies. Since the results from our real data application are encouraging, in the future we would like to explore the small data size properties of our models more rigorously via simulation studies. Thirdly, in this dissertation, in both the simulations and the real data analyses, we have observed differences among the parameter es timates across the models, but we have not determined if such differences are significant in a statistical sense. A topic for future research would be the development of Hausman Type statistical tests for this purpose. Fourthly, the properties of the GG e stimator and the GG based estimators presented in this dissertation were rigorously studied via detailed simulation for cases in which the data is distributed according to members of the GG family. Our study of data PAGE 128 128 distributions outside the GG family was limited to the beta distribution in Chapters 3 and 4. Clearly, there are many distributions that do not belong to the GG family which could be of interest. In future, we would like to explore this dimension in more detail. Fifthly, one of the important ben efits of using the GG and GG based models is that these can be used to conduct nested model tests, wherein the GG based estimators can detect which of the relevant special cases (Weibull, exponential, standard gamma and log normal) describe data the best. We detailed the performance of such a test using the GGE model in Chapter 2. In the future, we would like to explore similar nested model test applications in the GG IBC and GGE IBC contexts. Sixthly, as part of this dissertation, we have developed softwar e for our proposed estimators GGE, GG IBC and GGE IBC using Stata 10 We would like to make these available to t he research community at large. For this reason, we intend to develop Stata Finally, we hav e explored two different real datasets as target applications for our proposed models. In the future, we plan to continue to search for good applications for our methods in empirical health economics and elsewhere. PAGE 129 129 APPENDIX A THE FORMAL DERIVATIO N OF T HE REPARAMETRIZATION OF THE CONDITIONAL MEAN Standard gamma : The constant term in the conditional mean is defined in equation (2 4) ( 2 4) In standard gamma the shape parameters are equal, i.e., Subs tituting in ( 2 4) (A 1) (A 2 ) Note that As a result we can write and substitute in (A 2) (A 3) (A 4) We derived that the constant te rm in the conditional mean for standard g amma is equal to zero. Weibull : In the Weibull distribution the shape parameter is equal to unity. Substituting into equation ( 2 4) we get PAGE 130 130 (A 5) The constant term is defined as (A 6) Exponential : Exponential distribution is a special case of standard gamma distribution where the pa rameters are set equal to As a result the constant term is PAGE 131 131 APPENDIX B THE DERIVATION OF MA RGINAL EFFECT OF THE ENDOGENOUS PO LICY VARIABLE The marginal effect of the endogenous policy variable is d efined as (B 1) where 1 o1 e1 u1 ], and 2 is the same as except for the c onstant term where According to (B 1) t he marginal effect would be (B 2) (B 3) where is the standard normal probab ility density function. The marginal effect can be consistently estimated by (B 4) if we have consistent estimates of in the auxiliary regression. (B 4) is the number of observations. PAGE 132 132 APPENDIX C STANDARD ERROR OF TH E MARGINAL EFFECT OF THE ENDOGENOUS PO LICY VARIABLE Asymptotic Covariance Matrix of the Two Part Model GGE Estimator The following notational conventions will be maintained for a scalar functi on s of two vector arguments r and t (i.e., s = s(r, t) where s is a scalar and r and t are vectors): (C 1) and (C 2) We also assume that the former is a row vector, and the latter is a matrix with r ow dimension equal to that of the first subscript on and column dimension equal to that of the second subscript. The objective function in the first stage of the 2SRI that corresponds to equation ( 2 29 ) is: (C 3 ) where w = [1 ] and and the objective function for the second stage (two part) model covering equations ( 2 30 ) and ( 2 31 ) is 1 (C 4 ) 1 Equations (C 3) and (C 4) are actually abbreviated versions of the act ual estimation objective functions and suppressed for notational convenience, and n is the sample size. PAGE 133 133 where , and f( ) is defined as in ( 2 2). By Theorem 6.11 of White (1994) we have that (C 5 ) where and (C 6) and (C 7) The matrices A and B can be substantially simplified. First note that, because the second stage two pa rt model estimator based on (C 4 ) is FIML, using condition (29) in Murphy and Topel (1985), we can write (C 8) and (C 9) Moreover, because estimation in each of the components of the two part model is FIML we have (C 10) (C 11) Note also that all off diagonal element s of B are null. In particular PAGE 134 134 (C 12) because when x e x o x u and w + are fixed, so is Using (13.20) on p 393 of Wooldridge (2002), however, we have that (C 13) because is the score of the likelihood function for the first part of the two part model. Therefore (C 14) We can similarly establish that (C 15) Also note that we can write (C 16) because when x e x o x u w + and y* are fixed, so is Here again, because i s the score of the likelihood function for the second part of the two part model we obtain (C 17) Summarizing these results, we re write matrices A and B as (C 18) and (C 19) PAGE 135 135 This simplification is substantial because consistent estimates of all of the components of A and B can be obt ained from conventional Stata output for the first stage (NLS), second st age (Probit), and third stage (g eneralized g amma). We have derived t he first element of matrix A using first stage objective function. (C 20) The first component of the second element of matrix A can be calculated using the score equation from probit estimation in the first part of two part mod el. (C 21) The second component, can be derived using score equations from probit and generalized gamma estimations. The objective function for q is give n in equation (C 4 ) and the parameter enters this equation through ( ). We can define (C 22) and (C 23) Using the above equalities and the objective function for q we can derive (C 24) The third element of matrix A, can be derived from negative inverse of variance matrix from probit estimation. PAGE 136 136 The fourt h element of matrix A has two parts that needs to be der ived separately. For the first part we c an use scores from generalized g amma estimation. (C 25) (C 26) ( C 27 ) For t he second part we will use scores from probit and G G e stimations as given in above (C 28) To derive t he last element of matrix A, we can use the negative inverse of var iance matrix from generalized g amma estimation. The first element of matrix B, can be derived using the following, ( C 29 ) As we have mentioned above can be derived using the inverse of variance matrix for probit estimation. We can use the method described above the inverse of variance matrix for generalized gamma estimation to derive Asymptotic Standard Error of the Marginal Effect Estimator The marginal effect (ME) can be consistently estimated using PAGE 137 137 (C 30) Terza (2010 ) shows that the asymptotic variance of is avar( ) (C 31 ) where (C 32 ) (C 33 ) Now (C 34 ) (C 35 ) (C 36 ) and denotes the asymptotic covariance matrix of give n in (C 5 ). Following Terza (2010 ), (C 37) PAGE 138 138 where (C 38) where (C 39) (C 40) (C 41) (C 42) (C 43) (C 44) PAGE 139 139 where is the digamma function and is the estimated asymptotic covariance matrix. 2 2 The digamma function is the logarithmic derivat ive of the gamma function, PAGE 140 140 APPENDIX D THE DERIVATION OF THE NLS IBC MODEL PARAMETER VALUES Let (D 1) If the conditional mean of the random variable y conditional on the random vector x can be expressed as (D 2) of parameters and C is a scalar const a nt, then (D 3) and with the scaled constant term shifted by The derivation of these scaling factors is as follows: We substitute equation (D 1) into the equation (D 2) and rewrite (D 2) as (D 4) (D 5) where (D 6) (D 7) (D 8) (D 9) (D 10) PAGE 141 141 (D 11) where (D 12) and (D 13) where and are the constant terms. Special cases of the generalized gamma standard g amma exponential, Weibull and l og normal The generalized gamma has a flexible distributional form that su bsumes standard gamma, exponential, Weibull and l ognormal as its special cases. We can derive the scaling factors for each of these special case s by defining co nstant C which is a function of the parameters and (D 14 ) In the case of standard gamma and (See Appendix A for the de rivation of C for each case.). After s ubstituting into equations (D 12) and (D 13) the scaling factor s for standard gamma would be (D 15) and similarly the constant shifter is (D 16) Exponential is a special case of standard gamma where as a result the scaling factors for exponential are the same as in the case of standard gamma. PAGE 142 142 For the Weibul case the parameter and The scaling factors (D 12) and (D 13) can be defined as (D 17) (D 18) respectively. Finally for the log normal case parameter goes to zero in the limit and, We can rewrite (D 12) and (D 13) as (D 19) (D 20) respectively. PAGE 143 143 APPENDIX E THE DERIVATION OF TH E MARGINAL EFFECT FOR THE GG IBC MODEL Here we derive the m arginal effect for the model where parameter is defined as (E 1) where (E 2 ) In this case the conditional mean function can be written as (E 3) where (E 4) The marginal effect for a continuous policy variable, is defined as the derivative of the conditional mean of the outcome variable with respect to and defined as (E 5) When the conditional mean is defined using an inverse Box Cox transformati on as in equation (E 3) we have three cases according to the value of parameter These are summarized below: Case 1 : If parameter we can rewrite parameter as and following this the conditional mean of the outcome variable y becomes (E 6 ) PAGE 144 144 which is the case for GG. Using (E 6 ) the marginal effect for the policy variable can be derived as (E 7 ) where is as given in equation (E 4). The marginal effects as given in (E 7) can be consistently estimated using (E 8) where (E 9) Case 2 : For parameter becomes and conditional mean of y is now defined as (E 10 ) where is as given in equa tion (E 4). Above equation (E 10 ) is the cas e for linear conditional mean. For the linear case the marginal effect is simply defined as (E 11 ) where is as given in equation (E 4). The consistent estimator for the margina l effect given in equation (E 11 ) is (E 12) where as is given in (E 9) PAGE 145 145 Case 3 : In the last case the parameter or and the parameter can be defined as Substituting this into the conditional mean equation gives (E 1 3 ) where is as given in equation (E 4). The derivative inside the expected value can be written as (E 14 ) where is as given in equation (E 4). And after taking the expected value of the expression in equation (E 14) we can write the mar ginal effect of the policy variable as (E 1 5 ) where is as given in equation (E 4). The consistent estimator for the marginal effect given in equation (E 15) is (E 16) where as is given in (E 9). PAGE 146 146 REFERENCES Baser, O., Bradley, C.J., Gardiner, J.C., Given, C., 2003. Testing and correcting for n on random s election bias due to censoring: an application to medical costs. Health Services and Outcomes Research Methodology 4 93 107. Basu, A., Manning, W.G., 2006. A test for proportional hazards assumption within the class of exponential conditional mean framework. Health Services and Outcomes Research Methodology 6, 81 100. Basu, A., Rathouz, P., 2005 Estimating marginal and incremental effects on health outcomes using flexible link and variance funct ion models. Biostatistics 6, 93 109. Box, G.E.P., Cox, D.R., 1964. An analysis of t ransformations Journal o f the Royal Statistical Society Series B 26 211 252. Bradford, W.D., Zoller, J., Silvestri, G.A., 2010. Estimating the effect of individual time preferences on the on the use of disease screening. Southern Economic Journal 76 (4), 1005 1031. Carpio, C.E., Wohlgenant, M.K., Boonsaeng, T., 2008. Th e demand for agritourism in the United States. Journal of Agricultural and Resource Economics 33 (2), 254 269. DeSimone, J., 2002. I llegal drug use and e mployment. Journal of Labor Economics 20 952 977. Duan, N., 1983. Smearing estimate: a nonparametric retransformation method. Journal of the American Statistical Association 78 605 610. Etile, F., Jones, A.M., 2009. Smoking and education in France. Working Paper. http://www.paris.inra.fr/aliss/content/download/3255/31214/version/1/file/ALISSW P2009 04Etile.pdf Gavin, N.I., Adams, K., Manning, W. G. Raskin d Hood, C., Urato, M. 2007. The impact of welfare r eform on insurance coverage before pregnancy and the timing of prenatal care initiation. Health Services Research 42 1564 1588. Gibson, T.B., Mark, T.L., Axelsen, K., Baser, O., Rublee, D.A ., McGuigan, K.A., 2006. Impact of s tatin copayments on adherence and m edical care utilization and expenditures. American Journal of Managed Care 12 SP11 SP19. Gould, W., Pitblado, J., Sribney, W., 2003. Maximum Likelihood Estimation with Stata Stata Press, Texas. PAGE 147 147 Heffler, S., Smith, S., Keehan, S., Borger, C., Clemens, M., Truffler, C., 2005. Tre nds: U.S. health spending projections for 2004 2014. Health Affairs 24 (suppl.) W5 W74. Hill, S.C., Miller, G.E., 2009. Health expenditure estimation and functional form: applications of the generalized gamma and extended estimating equations models. Hea lth Economics 19, 608 627. Kenkel, D., Terza, J.V., 2001. The effect of physician advice on alcohol consumption: count regression with an endogenous treatment effect. Journal of Applied Econometrics 16, 165 184. Li ndrooth, R.C., Weisbrod, B.A., 2007. Do religious nonprofit and for profit organizations respond differently to financial incentives? The hospice i ndustry Journal of Health Economics 26 342 357. Manning W.G., Basu, A., Mullahy, J., 2005. Generalized modeling approaches to risk adjustment of skewed outcomes data. Journal of Health Economics 24 245 488. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models. Chapman & Hall, London. Mullahy, J., 1997. Instrumental variable estimation of count data models: applications to models of cigar ette smoking behavior. The Review of Economics and Statistics 79 (4), 586 593. Mullahy, J., 1998. Much ado about two: reconsidering retransformation and the two part model in health econometrics. Journal of Health Economics 17 247 281. Murphy, K.M., Top el, R.H., 1985. Estimation and inference in two step econometric m odels Journal of Business and Economic Statistics 3 (4) 370 379. Newey, W.K., McFadden D.L., 1994. Large Sample Estimation and Hypothesis Testing. In: Engle and McFadden (Eds.), Handbook of Econometrics Elsevier Science B.V., Amsterdam Ch. 36 No rton, E.C., Van Houtven, C.H., 2006. Inter vivos transfers and exchange. Southern Economic Journal 73, 157 172. Pope, G.C., Kautter, J., Ellis, R .P., Ash, A.S., Ayanian, J.Z., I ezzoni, L.I ., In gber, M.J., Levy, J.M., Robst, J., 2004. Risk adjustment of medicare capitation payments using the CMS HCC model. Health Care Financing Review 25, 119 141. Richardson, L. Loomis, J., Champ, P.A., 2010. A comparison of methodologies for valuing decreased health effects from wildfire smoke. Working Paper. http://ageconsearch.umn.edu/bitstream/61252/2/AAEA.pdf PAGE 148 148 Rubinstein, R.Y., 1981. Simulation and the Monte Carlo Method Wiley, New York Shea, D.G., T erza, J.V., Stuart, B.C., Briesacher, B., 2007. Estimating the e ffects of p rescription drug coverage for medicare beneficiaries. Health Services Research 43, 933 949. Shin, J., Moon, S., 2007. Do HMO plans r educe health care expenditure in the private s ector ? Economic Inquiry 45, 82 99. Stuart, B. C., Doshi, J., Terza, J.V., 2009. Assessing the impact of drug use on hospital costs. Health Services Research 44, 128 144. Tadikamalla, P.R., 1979. Random sampling from the generalized gamma dis tribution. Computing 23, 199 203. Terza, J.V., 2010. Health policy analysis via nonlinear regression methods: estimation and inference in the presence of endogeneity. Working Paper. Terza, J.V., Basu, A., Rathouz, P.J., 2008 a Two stage residual i nclus io n estimation: addressing endogeneity in health econometric modeling. Journal of Health Economics 27, 531 543. Terza, J. V. Bradford W.D., Dismuke, C.E., 2008 b The use of linear instrumental variables methods in health services research and health econom ics: a c autio nary note. Health Services Research 43, 1102 1120. Vandegrift, D., Yavas, A., 2009. Men, women, and competition: an experimental test of behavior. Journal of Economic Behavior and Organization 72, 554 570. Wooldridge, J.M., 1992. Some altern atives to the Box Cox regression model. International Economic Review 33, 935 955. Wooldridge, J.M. 2002. Econometric Analysis of Cross Section and Panel Data MIT Press Cambridge, MA. White, H., 1994. Estimation, Inference and Specification Analysis Cambridge University Press Cambridge Zhang, Y., 2008. Cost saving effects of Olanzapine as long term treatment for bipolar disorder. The Journal of Mental Health Policy and Economics 11, 135 146. PAGE 149 149 BIOGRAPHICAL SKETCH Mujde Z. Erten completed her B achel or of A rts degree in e conomics from Bogazici University Istanbul, Turkey in 2001 She has been a Ph.D. student in the Department of Economics at the University of Florida Gainesville, FL, USA s ince 2003 She earned her Master of Arts degree in economics from University of Florida Gainesville, FL, USA in 2007 During her graduate studies s he has been a Research and Teaching Assistant in the Department of Economics and Institute for Child Health Policy at the University of Florida Gainesville, FL, USA He r research interests include Applied Econometrics, Health Economics, Industria l Organization and Regulation. In her first year of study at the University of Florida she received Rafael Lusky Prize Best First Year Graduate Student. 