This item is only available as the following downloads:
1 THE PERFORMANCE OF PROPENSITY SCORE M E THODS TO EST I MATE THE AV E RAGE T REA T M E NT E F F E C T IN OBSERVATIONAL STUDIES WITH SELECTION BIAS : A M O N TE C A RLO S I MU L A T I ON S TU D Y By SUNGUR GUREL A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2012
2 2012 Sungur Gurel
3 To my family
4 ACKNOWLEDGMENTS First of all I would like to thank Dr. Walte r Leite and Dr. James Algina for guiding me in my thesis. I thank to faculty and students of Research and Evaluation Methodology Program. I would like to thank Turkish Government for the financial support. I also would like to thank Veysel Duman, Suleyman Tor, and Halit Yilmaz for trusting me. Finally, I would like to thank Dilek C. Gulten for her support from the beginning.
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 8 LIST OF ABBREVIATIONS ................................ ................................ ............................. 9 ABSTRACT ................................ ................................ ................................ ................... 10 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 12 2 THEORETICAL FRAMEWORK ................................ ................................ .............. 15 The Potential Outcomes Framework ................................ ................................ ...... 15 Propensity Score Methods for Reducing Selection Bias in ATE Estimates ............. 17 Inverse Probability of Treatment Weighting (IPTW) ................................ ......... 17 Truncated Inverse Probability of Treatment Weighting (TIPTW) ...................... 18 Propensity Score Stratification (PSS) ................................ ............................... 19 Opt imal Full Propensity Score Matching (OFPSM) ................................ .......... 20 Standard Error Estimation with Propensity Score Methods ................................ .... 21 Comparison of Propensi ................................ ...... 24 3 METHOD ................................ ................................ ................................ ................ 27 Data Simulation ................................ ................................ ................................ ...... 27 Estimation of ATE and Standard Errors ................................ ................................ .. 27 Analyses ................................ ................................ ................................ ................. 31 4 RESULTS ................................ ................................ ................................ ............... 33 5 DISCUSSION ................................ ................................ ................................ ......... 41 APPENDIX A RELATIVE BIAS OF ATE ESTIMATES ................................ ................................ .. 44 B RELATIVE BIAS OF STANDARD ERRORS OF THE ATE ESTIMATES ............... 49 C EMPRICAL COVERAGE AND POWER TABLES ................................ .................. 54 REFERENCES ................................ ................................ ................................ .............. 56
6 BIOGRAPHICAL SK ETCH ................................ ................................ ............................ 61
7 LIST OF TABLES Table page A 1 Relative bias of ATE estimates in the baseline condition ................................ ... 44 A 2 Relative bias of ATE estimates and the percent bias reduction of ATE estimates with IPTW. ................................ ................................ .......................... 45 A 3 Relative bias of ATE estimates and the percent bias reduction of ATE estimates with TIPTW. ................................ ................................ ........................ 46 A 4 Relative bias of ATE estimates and the percent bias reduction of ATE estimates with OFPSM. ................................ ................................ ...................... 47 A 5 Relative b ias of ATE estimates and the percent bias reduction of ATE estimates with PSS. ................................ ................................ ............................ 48 B 1 Relative bias of standard error estimates in the baseline. ................................ .. 49 B 2 Relative bias of standard error estimates with IPTW. ................................ ......... 50 B 3 Relative bias of standard error estimates with TIPTW. ................................ ....... 51 B 4 Relative bias of standard error estimates with PSS. ................................ ........... 52 B 5 Relative bias of standard error estimates with OFPSM. ................................ ..... 53 C 1 Empirical coverage rates of 95% confidence intervals across 1000 simulated data sets. ................................ ................................ ................................ ............ 54 C 2 Proportion of the estimated ATE that is significant at =.05 level across 1000 simulated datasets. ................................ ................................ ............................. 55
8 LIST OF FIGURES Figure page 4 1 Relative bias of different standard error estimation methods in the baseline ...... 36 4 2 Relative bias of different standard error estimation methods in the IPTW .......... 37 4 3 Relative bias of different standard error estimation methods in th e TIPTW ........ 38 4 4 Relative bias of different standard error estimation methods in the PSS ............ 39 4 5 Relative bias of different standar d error estimation methods in the OFPSM ...... 40
9 LIST OF ABBREVIATION S ATE Average Treatment Effect ATT Average Treatment Effect on the treated IPTW Inverse Probability of Treatment Weighting JK Jackknife NCES National Center for Education Statistics NSF National Science Foundation OFPSM Optimal Full Propensity Score Matching PS Propensity Score PSS Propensity Score Stratification SUTVA Stable Unit Treatment Value Assumption TIPTW Truncated Inverse Probability of Treatment Weighting TSL Taylor Series Linearization WLS Weighted Least Squares Regression
10 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Edu cation THE PERFORMANCE OF PROPENSITY SCORE M E THODS TO EST I MATE THE AV E RAGE T REA T M E NT E F F E C T IN OBSERVATIONAL STUDIES WITH SELECTION BIAS : A M O N TE C A RLO S I MU L A T I ON S TU D Y By Sungur Gurel August 2012 Chair : Walter Leite Major: Research and Evaluation Met hodology W e investi g a ted the p e r f o r man c e o f four dif f e r e nt propensity score (PS) methods to r e du c e sel e c t i on bias in e st i mat e s of the a v e ra g e t rea t m e nt e f f e c t (ATE) i n obs e rv a t i on a l s t udies: i nv e rse pro b a bi l i t y of tr ea t m e nt w e i g ht i n g (IPTW) truncated inv erse probabilit y of treatment weighting (TIPTW) opt i mal full propensity score m a tchin g (OFPSM) and propen sity score stratification (PSS). We compared these methods in combination with three methods of standard error estimation: weighted least squares reg ression (WLS), T aylor series linear ization (TSL), and jackknife (JK ). W e c ondu c t e d a M onte Ca r lo S i m ulation stu d y manipulating the num b e r of subj ec ts and t he r a t i o of treated to total sample size. The r e sul t s ind ica ted th a t IPTW and OFPSM methods r e moved a l m o st all of the bi a s while TIPTW and PSS removed about 90% of the bias Some of TSL and JK standard errors were acceptable, some marginally overestimated, and some moderately overestimated. For the low er ratio of treated on sample sizes, all of the WLS st andard errors were strongly underestimated, as designs get balanced, the underestimation gets less serious. Especially for the OFPSM, all of
11 the TSL and JK standard errors were overestimated and WLS standard errors under estimated under all simulated condi tions.
12 CHAPTER 1 INTRODUCTION Estimating the effects of educational interventions using secondary data has become common in educational research because of the availability of various nationally representative databases collected by agencies such as the National Center for Education Statistics (NCES) and the National Science Foundation (NSF) (Strayhorn, 2009). However, because assignment of participants to interventions in these national studies is not random, estimates of the effects of interventions are vulnerable to selection bias due to both observed and unobserved covariates (Shadish, Cook, & Campbell, 2002). In the last three decades, several methods emerged for estimating treatment effects and dealing with selection bias in studies that lack random assignment to treatment conditions (Heckman, 1978; Rosenbaum & Rubin, 1983; Abadie & Imbens, 2006; Heckman et al., 1997), which are referred collectively as observational studies. Propensity score methods are among the most commonly used methods in social science research in the analysis of observational studies. In order to reduce selection bias in treatment effect estimates, propensity score methods attempt to balance differences between treated and untreated participants on observed covariates. Rosenbaum and Rubin (1983) used the term propensity score (PS) for the first time and defined it as the predicted probability of treatment assignment given observed covariates. They found that if selection into treatment depends on observed covariates, the observed difference in treatment and control conditions at a propensity score level is an unbiased estimate of the average treatment effect (ATE) on that level (Rosenbaum & Rubin, 1983). Propensity scores can be used to reduce selection bias in the ATE by matching observations based on their similarity in PS, weighting observations with the
13 inverse of the PS, stratifying observations into homogenous groups based on PS (Stuart, 2010). In addition to dealing with the problem of obtaining unbiased treatment effect e stimates despite the existence of selection bias, educational researchers using large scale surveys also have to pay special attention to the estimation of standard errors, because changes in the sampling variability of the data due to the use of PS metho ds such as propensity score matching and stratification may require special methods to estimate standard errors such as bootstrapping, jackknife and Taylor Series linearization (Stapleton, 2008). Although several types of treatment effects have been defin ed in the literature ( Guo & Fraser, 2010 ) the estima tes most commonly found in the social sciences literature are the average treatment effect (ATE) and the average treatment effect on the treated (ATT) ( Thoemmes & Kim, 2011 ) Although all PS methods can be used to estimate these two treatment effects, the specific implementation of a propen sity score method differs depending on whether the ATE or ATT are of interest. There have been several studies comparing implementations of PS methods to estimate the ATT ( Gu & R osen ba um, 1993; Cepeda, Boston, Farrar, & Storm, 2003; Austin, 2010b; H a r d e r, Stu a rt, & Antho n y 2010 ), but there has not been a study comparing all major propensity score methods for the estimation of the ATE. Therefore, the first objective of this study is to compare full optimal propensity score matching, propensity score stratif ication and inverse probability weighting with respect to their ability to reduce selection bias in estimates of the ATE. Because most studies comparing PS methods focused on treatment effect estimates and did not address estimation of standard errors, the
14 second objective of this study is to compare strategies for estimating standard errors for the ATE estimates obtained with each propensity score method.
15 CHAPTER 2 THEORETICAL F RAMEWORK The Potential Outcomes Framework e s f ra m e wo r k (197 4 ) is c om m on l y u s e d to unde r s t a nd selection bias i n obs e rv a t i on a l s t udies Its b asic p rin c i ple is th a t t r e a ted a nd c o ntrol indiv i du a ls have potential outcom e s in both pr e s e n c e a nd a bsence o f t rea t m e nt. For instance, let the observed outcome of a treat ed participant i be while is the potential outcome if this participant had been placed in the control group. Similarly, is the observed outcome for the control group particip ant i and is the potential outcome if this participant had been placed in the treatment group. In other words, a c ontrol g roup p a rticip a nt has a po t e nt i a l ou t c ome und e r the t r ea t ment c ondi t io n Con v e rs e l y a t r ea t m e nt g roup p a rtic ip a nt has a po t e nt i a l ou t c ome und e r the c on t rol c ondi t io n In the randomized studies, considering treatment assignment is random, the expected value of the potential outcomes of the treatment group is equal to the expected value of observed outcomes of the control group Similarly, the expected value of the potential outcomes of the control group is equal to the expected value of observed outcomes of the treatment group Therefore, the unbiased estimate of the treatment effect is In observational studies, due to non randomness of the treatment assignment, groups may not be equivalent in the absence of treatment. Consequently, we cannot assume the and As stated earlier, many different treatment effects are defined within the potential outcomes framework but we will give special attention for tho se which are commonly used in the social science literature.
16 Average treatment effect on the treated is which is defined as the difference between the observed outcome of the p a rticip a nt under the treatment condition and the po t e nt i a l ou t c ome und e r the c on t rol c ondi t io n In contrast, average treatment effect which is defined as the difference between the potential outcome for all individuals if they are exposed to the treatment condition and the potentia l outcome for all individuals if they are exposed to the untreated condition ( W insh i p & Mor g a n 1999 ) In other words, ATT is the effect for those in the treatment group; while ATE is the effect on both treated and untreated individuals (Stuart, 2010). The e st i mate of the ATE b a s e d on the di f f e r e n c e b e tw ee n the obs e rv e d outco m e s of the tr ea t e d a nd c ontrol ind i vidu a ls will on l y be unbias e d if the a ss i g n ment to t he tr ea t m e nt i s indep e nd e nt of the p o t e nt i a l ou t c omes More formally, where is the potential outcome if treated is the potential outcome if untreated, X is the all potential confounders, and T is the treatment assignment. This c ondi t ion i s known a s strong i g no r a bi l i t y of t r e a t ment a ss i g nment ( Rubin, 1974) I t i s a lso n ece ss a r y that t he stable unit tr ea t m e nt value a ss u mp t i o n (SUT V A) is m e t, which re qui r e s that the pot e nt i a l ou t c ome of one unit is not a f f e c ted b y the p a rticu l a r tr ea t m e nt assi g nment of other uni t s ( Rubin, 2007 ) Random ass i g n ment me e ts t h e se a ssu m p t i ons but f re qu e nt l y in soci a l r e s e a r c h t h e r a nd o m i z e d e x p e rime n t i s e i t h e r u nfe a sib l e or un e th i ca l In o bs e rv a t i on a l s t udies, both strong i g nor a bi l i t y o f t rea t m e nt ass i g nment a nd S UT V A m a y be vio la ted, w hich l ea ds to bias e d e st i mat e s of the A TE a nd poor in t e r n a l valid i t y of the stu d y (Sha d i s h, 2002). PS methods attempt to achieve strong ignorability of treatment assignment, under the assumption that SUTVA holds, by balancing the di stributions of observed covariates between treatment and control
17 groups. Violations of SUTVA, such as when the decision of a parent to enroll a student classmates, require special considerations in the estimation of propensity scores and implementation of the PS method that are discussed elsewhere ( Arpino & Mealli, 2011 ; Thoemmes & West, 2011 ) Propensity Score Methods for Reducing Selection Bias in ATE E stimates The use of any PS method requires a multiple step process that starts with the selection of observed covariates that are related to selection into treatment conditions. The second step is to estimate propensity scores, which is most commonly accomplished with logistic regression, but other methods such as boosted regression trees ( McCaffrey, Ridgeway, & Morral, 2004 ) can be used. The third step is to evaluate the common support area of the estimated propensity scores, which is the area of the propensity score distribution where values exist for both treat ment and control groups ( Guo & Fraser, 2010 ) Lack of common support for a certain area of the propensity score distribution restricts the generalizability of the estimates only to the sub population for which common suppo rt exists. The fourth step is to verify the balance of the distribution of covariates given the propensity score method of choice. The fifth step is to apply the PS method in conjunction with a statistical method (e.g. ordinary least squares regression) to estimate the ATE and its standard error, and reach conclusions about statistical significant of the ATE. The last step is to evaluate the sensitivity of the results to possible omission of important covariates ( Rosenbaum, 2010 ) Inverse Probability of Treatment W eighting (IPTW) I n v e rse p rob a bi l i t y w e i ght i ng w a s in t roduc e d a r o und t he m i ddle of 2 0 th ce ntu r y b y Ho r witz a nd Th ompson (1952) to account for the effect of the sampling design in survey
18 estimates. Robins, H e rn a n a nd B rumb a c k (200 0 ) e x t e nd e d th i s co nce pt t o inve r se pro b a bi l i t y of t r e a t m e nt w e i g ht i ng to con t rol f or sel e c t i on bias in obse r v a t i on a l s t udies The simple idea behind IPTW is to weight subjects by the inverse of the conditional probability of being in the group that they are actually in. F o r mal l y let be the t r e a t m e nt i ndic a tor w ith indicating a member of the treatment group and indicating a member of the control group. is the estimated propensity score. To estimate the ATE, all individuals in the sample are give n weights For individual i the weight w i is (Stuart, 2010): ( 2 1) IPTW models create a pseudo population where observations are replicated based on the weights, so that participants not only account for themselves, but also for those who have similar characteristics in the other group (Hernan, Hernandez Diaz, & Robins, 2004). Neugebauer and van der Laan (2005) claimed that the performance of IPTW depends on the experimental treatment assignment assumption, which requires all of the weights are different from zero. They also found that if any treatment probability is close to zero, the new weighted sample may not be representative of the target population. Truncated I n verse Probability of Treatment Weighting (TIPTW) The IPTW method has been criticized regarding its performance when the w eights are extreme or the propensity scores are extreme (Freedman & Berk, 2008). The extreme weights create overly influential observations and inflate the sampling variability of estimates. Several researchers came up with different solutions to solve thi s problem. Bemboom and van der Laan (2008) developed a data adaptive selection
19 of truncation level for IPTW estimators. They were able to gain up to 7% efficiency in mean square error of estimates. Freedman and Berk (2008) replaced the weights that are gre ater than 20 with 20 and trimmed observations greater than 20. However, they concluded neither method was able to reduce the selection bias. Strumer, Rothman, Avorn, and Glynn (2010) found that trimming up to propensity scores that are more extreme than 2. 5 th and 97.5 th percentiles reduces selection bias compared to not trimming any observations and trimming more observations. Propensity Score S tratification (PSS) PSS consists of creating strata containing individuals that are similar with respect to prope nsity scores (Stuart, 2010), where each strata should contain at least one treated and one untreated individual. PSS is usually accomplished by dividing the distribution of propensity scores into intervals of equal size. Stratification based on covariates to reduce selection bias was proposed by Cochran (1968), but Rosenbaum and Rubin (1984) showed that stratification into five strata based on the propensity scores reduces about 90% of the selection bias. In applied social science research, Thoemmens and Ki m (2011) found that most studies use between 5 and 10 strata. Obtaining strata to estimate the ATE requires that all members of the sample are placed into a stratum, while stratum containing only untreated observations may be dropped in the estimation of A TT. Furthermore, estimating ATE requires the cases are weighted by the number of the individuals in each stratum based on the following formula: (2 2 ) where j indexes strata, t indexes the treatment condition, N is the to tal sample size, J is the total number of strata, and is the total number of treated or untreated participants
20 in the stratum j. In contrast, ATT weights are created according to the number of treated individuals in each stratum. Optimal Full P ropensity Score Matching (OFPSM) F ull mat c hing is a m e thod of stratification in w hich the number of sub c l a sses is sh a p e d b a s e d on the d a ta. W h e n picking a t r ea ted unit r a n dom l y a nd f rom the s a me s ub c lass of picking a c ontrol unit r a ndom l y the e x p ec ted di f fer e n c e b e tw ee n those two with respect to a certain measure of distance is All of the observations can be matched using a greedy algorithm, where observations are reviewed one by one to select the closest observations with respect to without considering the min imization of the overall distance for the whole sample. T he opt i mal f ull mat c hing a l g o rithm was establish e d b a s e d on n e tw o r k flow theo r y to m i ni m i z e within matched s e ts b y f ind i n g a m i ni m um cost flow for the whole sample (Ro s e nb a um, 1989: 1991) OFPSM is a special case of optimal full matching where the PS is used as the measure of distance in the matching procedure. Formally, let be the fitted propensity score of the individual i exposed to treatment t and be the fitted propensity score of an individual j exposed to the control condition c given observed covariates. is the distance between propensity scores. Rosenbaum (1991) found that there is always a full matching that is o ptimal so that is minimized where T is number of treated participants and C is number of control participants in a sample. Estimating the ATE with OFPSM requires weights calculated in the same way as in PSS (see Equation 2 2 ).
21 St andard Error Estimation with Propensity Score Methods Weighted least squares regression (WLS) can be used to obtain ATE estimates and standard errors with all PS methods presented above (Schafer & Kang, 2008). This method can be implemented through applyin g proper weights for each PS method while fitting the regression models. For the IPTW and TIPTW methods, weights are obtained using Equation 2 1. For the OFPSM and PSS methods, weights are shaped based on the number of strata and the number of treated and untreated individuals according to Equation 2 2. Once weights are shaped, standard errors in the WLS are estimated with the following formula; ( 2 3) where E i is the residual and w i is the weight for the observation i and n is the sample size (Fox, 2008). Several other methods can be used to obtain standard errors of ATE estimates from PS methods, such as Taylor Series Linearization, Jackknife, and Bootstrapping ( Rodgers, 1999 ) These methods have not been researched extensively with propensity score methods. Taylor s eries linearization (TSL) can be used to obtain the variance of a statistic via approximating the estimator by a linear function of observa tions (Wolter, 2007). Formally, let be a function of the data for observation i and ATE, and the true population solves the following equation; ( 2 4 ) then, in a complex sample we are able to define as solving the wei ghted sample equation
22 (2 5 ) Variance of the ATE is defined as follows applying delta method (Binder, 1983) ; ( 2 6 ) For the observation i let be the intercept, be the slope of the regression equation where is estimated ATE, be the weight, and be the mean of x. Using Taylor series linearization, standard error of ATE is defined as (Lohr, 1999); (2 7 ) Jackknife and bootstrapping are both based on resampling fr om the original data. The most common implementation of the jackknife (JK) is the delete 1 jackknife, where at each iteration, one member of the sample is removed randomly and the parameters of interest are estimated using replicated weights, which are rec alculated after removing the observation. For delete 1 jackknife let the w i be the initial weight for an observation i and n be the sample size. Depending on the PS method selected, w i may be IPTW weight obtained from Equation. 2 1 or truncated weights; ( 2 8 ) At each iteration, the ATE ( ) which is the parameter of interest is re calculated. The standard error will be as follows (Lohr, 1999);
23 (2 9 ) Delete n jackknife is use d when a whole stratum is deleted. Let be the weight for a particular stratum weight that are calculated using Equation. 2 2, n be the sample size, and be the strata size for strata j The same rule appli es for the estimation of standard errors using delete n jackknife but just the jackknife weights are estimated by the following way; (2 10 ) At each iteration, the ATE ( ) which is the parameter of interes t is re calculated using new weights The standard error is obtained using the following equation (Lohr, 1999); (2 11 ) Bootstrapping consists of resampling k times with replacement from the original sample to create samples of the same size as the original sample, and estimating parameters with the k resampled datasets. Weights are adjusted for each bootstrapped sample by multiplying how many times a particular observation is selected to be in the resample. Standard errors a re simply standard deviation of the parameter estimates across the multiple samples. Formally (Lohr, 1999); (2 12 ) However, Abadie and Imbens (2008) demonstrated a mathematical proof and ran a simulation about the variance estimation accuracy of ATT estimates at nearest neighbor matched samples. They argued bootstrapping is not appropriate for matched
24 data because when the ratio of treated on the sample size is greater than about .42, standard bootstrap does not provide a va lid estimate of the asymptotic variance of the ATT and increasing sample size is not a solution either. However, Abadie and Imbens claimed that bootstrapping provides valid inference at propensity score weighting. Due to the fact that the propensity score weighting estimator is asymptotically linear, bootstrapping was not included in this study because it does not apply all the conditions that we have investigated. Abadie and Imbens of standard errors that are estim have enough evidence on whether the same problem applies to the jackknife method or not we did n o t exclude the jackknife method from the study. erformances Much re search has been conducted comparing different structures, distances and algorithms for propensity score matching or probability weighting to estimate the treatment effects (Gu & Rosenbaum, 1993; Cepeda, Boston, Farrar, & Storm, 2003; Lunceford & Davidian, 2004; Austin, 2009a; Austin, 2009b; Austin, 2010a; Harder, Stuart, & Anthony, 2010). For this study, it is particularly relevant that both Gu and Rosenbaum (1993) and Cepeda et al. (2003) found the optimal matching consistently outperforms matching with a greedy algorithm, which is the most commonly used algorithm for PS matching. For this reason, we did not investigate the greedy algorithm further. Austin (2009a) found that matching on the propensity score within a specified caliper and IPTW methods remove s more systematic differences between groups than PSS and covariate adjustment. Austin (2010a) also found that a doubly robust IPTW method (i.e., where the IPTW is used as both a weight and a covariate) works better than PSS, matching on propensity score, IPTW, and covariate adjustment methods in
25 terms of bias, variance estimation, coverage of confidence intervals, mean squared error, and Type I error rates. However, Austin only compared these methods under the condition of a binary outcome. Also, Austin (2 009b) evaluated standard error estimation methods for propensity score matching and found that methods that considered the matched nature of the data resulted in smaller bias of standard errors and actual Type I error rates closer to the nominal Type I err or rate. Harder et al. (2010) compared the interaction of three PS estimation models (e.g., multivariable logistic regression, multivariable logistic regression with product terms, and nonparametric generalized boosted modeling) and five PS methods (e.g.,1 :1 greedy matching, full matching, weighting by odds, stratification, and IPTW). Because in their study the ATT and ATE estimates were similar, they made comparisons across methods that estimate the ATT and the ATE. Their results indicated that using nonpa rametric generalized boosted modeling to estimate propensity scores and 1:1 greedy matching provides better covariate balance for the majority of the covariates. Lunceford and Davidian (2004) compared IPTW with PSS, with respect to bias removal in the ATE. Both theoretical and empirical results indicated that using a fixed number of strata lead to biased estimates of the ATE. In this study, we address the scarcity of information in the literature about the relative performance of PS methods for estimating t he ATE and its standard error with the following research questions: 1. W hich p r op e nsi t y s c o r e method ( OPSM, I P T W, TIPTW, and PSS ) p e r f o r ms best wi t h r e sp e c t t o unbiased estimation of ATE und e r c ondi t ions with dif fe r e nt s a mp l e si z e a nd r a t i o of t r ea ted to to tal sample size? 2. W hich method (WLS, TSL, and JK) produces the most accurate standard errors when combined with different p r op e nsi t y s c o r e methods ( OPSM, I P T W, TIPTW, and PSS )?
26 3. Which p r op e nsi t y s c o r e method ( OPSM, I P T W, TIPTW, and PSS ) leads to the most pow er to test the ATE?
27 CHAPTER 3 METHOD In order to answer the research questions, a Monte Carlo simulation study was conducted using the R.2.14.0 program (R Development Core Team, 2011). Data S imulation The d a ta w a s si m ulat e d bas e d on manipulating sample s ize and the proportion of treated individuals. M onte Carlo simulation studies to compare PS methods by Gu a nd Ros e nb a um (1 9 93), F re e dman a nd B e rk ( 20 0 8 ) and Austin (2 0 09a) used 1 ,000 a s the only s a mp l e size. By manipulating sample size, we were able to de termine whether there were differences between the PS methods in terms of power to test the ATE. We simulated data with sample sizes equal to 500, 1000, a nd 200 0 We generated data where t he proportion of the sample that was treated was set at 1 /10, 1/7 1/4, 1/3, and 1/2 These conditions are an extension of Gu a nd Rosen ba um (1993) study, which only examined ratios of 1/7 1/4, 1/3. To measure the common support area of the propensity score, we used a n o v e rl a p me a sure that is si m i l a r to t he Coh e n s (19 8 8 ) fun ction, which is proportion of non overlap of the distributions L e t A a nd C be the a r e a of n o n ov e r l a p a nd B be the ov e rl a p a r e a of t h e lo g it of p r op e nsi t y sco r e s Th en, (3 1) where increases in U1 correspond to decreases in the area of common support. In the simulated conditions, the mean U1 ranged from .131 to .280. As the sample size increase d or the ratio of treated to sample size increase d overlap also improved
28 In order to obtain reasonable population param eters to simulate data, we took estimates from the 2007 2008 School Survey on Crime and Safety (SSOCS) survey results. SSOCS is a nationally representative survey conducted by the United States Department of Commerce (NCES, 2011).The covariates were number of students transferred from school, typical number of classroom changes, percentage of students below 15 th percentile standardized tests, and the total number of transfers to specialized schools. The grouping variable was whether outside school dicipline ry plan available or not and the outcome was total number of students involved in specified offenses. We generated multivariate normally distributed covariates for the simulation study using the Mass 7.3 16 package in R (Venables, & Ripley 2002). The first step of data simulation was to simulat e the covariates which where normally distributed with population means of zero and population covariance matrix equal to At the second step, we simulated residual s of the outcome regression. Residuals were simulated from a normal distribution with mean of zero and standard deviation of 166.208. The population standard deviation of the residuals was defined so that the population R 2 for the outcome regression was .211. Once the covariates and the residual of the outcome were simulated, w e obtained the potential control outcomes and potential tr e at m e nt outcomes for a l l i ndiv i duals in the sample based on following equations:
29 (3 2) The population values of the coefficients 0 1 2 3 4 were 0, 16.221, 58.642, 15.704, and 33.601 T he population value of the ATE was 20, which The next step was to determine which individuals in the simulated samples were exposed to treatment. The population model for treatment assignment was: (3 3) where the population values of were 0, .127, .137, .166, and .101, and rt is the ratio of treated to t he sample size. The strength of the selection bias was defined based on the McKelvey & Zavoina p seudo R 2 and its population value for the simulated data was .028 ( McKelvey, & Zavoina 975) The population model for treatment assignment defines the effect of covariates on the logit of probability of being in the treatment group. We included the odds of being in the treated group in the treatment assignment model so that we could control ratio of treated to the sample size. Finally we compared the probability of being in the treatment group given observed covariates to random number that was obtained from a uniformly distributed population with a maximum value of one and a minimum value of zero for each observation. A case in each simulated sample was defined as a treated if the probability of being in the treatment group was greater or equal to the random number. Otherwise, the case was defined as untreated. We did not manipulate the number of covariates because G u a nd Rosen ba um (1993) c o n c luded that as long as treatment assignment mechanism is modeled
30 completely, the number of c ov a ri a tes do e s not a f f e c t t he p e r f o r ma n c e o f the p r op e nsi t y s c o r e e st i mation pro ce ss with the exception of potential problems of multicollinearity of cov ariates and convergence problems. W e used four c ont i nuous covariates that were both related to the outcome and treatment assignment in t his s i m ulation stu d y Since we simulated data based on four covariates and estimated propensity scores using all four co variates, the assumption that the treatment assignment is modeled completely was met for all conditions in the study. Estima tion of ATE and S tandard Errors W e si m ulate d 1,000 datasets per condition We analyzed the simulated data according to the following steps for each dataset: 1. We estimated the PS for each individual using logistic regression. 2. value to measure the area of common support. 3. We estimated ATE and standard error ignoring selection bias to represent the baseline with the f ollowing model: (3 4 ) where is an intercept, is a treatment indicator, and is the ATE estimate. 4. Using Equation 3 4 w e estimated ATE and standard errors using weights equal to 1. We re estimated standard errors usin g the TSL method using equation 2 7.We implemented delete 1 JK method the survey 3.26.1 (Lumley, 2 011) library in R. Weights are re estimated using Equation 2 8 and standard errors are re estimated using Equation 2 9 5. Using Equation 3 4 w e estimated ATE a nd standard errors using IPTW weights with weighted least squares (WLS) estimation in the Equation 2 3. The weights used in the analys is are formed with the Equation 2 1 that was provided earlier We re estimated standard errors using the TSL method using equation 2 7 We implemented delete 1 JK method. Weights are re estimated using Equation 2 8 and standard errors are re estimated using Equation 2 9. 6. We replaced the weights which we re greater than 99 th percentile of the IPTW with the 99 th percentile and cr eated TIPTW weights We estimated ATE and
31 standard error with Equation 3 4 using TIPTW weights with WLS estimation in the Equation 2 3 We re estimated standard errors using the TSL method using equation 2 7 We implemented delete 1 JK method. Weights are r e estimated using Equation 2 8 and standard errors are re estimated using Equation 2 9. 7. We grouped the treated and control individuals into five strata, based on similarity in PSs using the MatchIt 2.4 20 (Ho, Imai, King, Stuart, 2007) library in R. We dec ided to create 5 strata because that is most commonly used number of strata by the social science researchers. (Thoemmes & Kim, 2011). We estimated the ATE and standard errors using weighted least squares estimation in the equation 2 3, with weights are ca lculated based on Equation 2 2 and used the model in the following equation; (3 5 ): where is a dummy coded indica tor of membership in stratum S We re estimated the standard errors using the TSL method based on the Equatio n 2 7 We implemented delete n JK method in the survey 3.26.1 (Lumley, 2011) library in R where one stratum is deleted at each iteration. Weights are re estimated using Eq uation 2 10 and standard errors are re estimated using Equation 2 11 While analyzing the data we assumed data is stratified in nature. 8. We grouped the treated and control individuals based on similarity in PSs into a data defined number of strata using the OFPSM algorithm implemented in optmach 0.7 1 (Hansen, & Fredrickson, 2009) library i n R. We estimated ATE based on Equation 3 5 with the weights that are obtained using Equation 2 2.. S tandard errors using WLS estimation based on the Equation 2 3. We implemented delete n JK method where one stratum is deleted at each iteration. Weights ar e re estimated using Equation 2 10 and standard errors are re estimated using Equation 2 11. While analyzing the data we assumed data is stratified in nature. A nalyses We have compared the PS methods in terms of relative bias of ATE estimates and percent b ias reduction of ATE estimates We compared the standard error estimation methods in terms of relative bias of standard errors, and cove rage of confidence intervals. For the methods resulting in acceptable bias of standard errors, we estimated the power to test the ATE. The relative bias of the ATE was calculated with where is the mean of the ATE estimates for all iterations of one
32 condition and is the population ATE. If the absolute value of the estim ated is larger .05, the bias is considered un a cceptable (Hoogland & Boomsma, 1998). Despite the fact that this criterion applies within the Structural Equation Framework and we do not know whether this applies treatment effect es timation procedures or not, we used this criteria as a rule of thumb. Because the magnitude of the relative bias of the ATE depends not only on the difference between the mean ATE and population ATE, but also the size of the ATE, we also evaluated the per cent bias reduction which is defined as : (3 6 ) where is the mean relative bias of using a particular method and is the initial bias (Cochran & Rubin, 1973; Steiner, Coo k, Shadish, & Clark, 2010). The relative bias of the standard error is where is the mean of the estimated standard errors of ATE and is the empirical standard error, which is the standard deviation of estimated ATE. If the absolute value of the estimated is larger than.1, the bias is considered unacceptable (Hoogland & Boomsma, 1998). We estimated the power by calculating the proportion of ATE that a re statistically significant at level for each condition. We also calculated the proportion coverage of confidence intervals which is the proportion of iterations where the population ATE falls within the 95% confidence interval for the estimated ATE (Au stin, 2009b).
33 CHAPTER 4 RESULTS Table A 1 shows that when we ignored the selection bias, the mean estimated ATE became more than twice the size of the population ATE. Tables A 2 through A 5 present the relative bias of the estimated ATE and percent bias r eduction for each PS method. We found that with IPTW, the only unacceptable relative bias was when the sample size was 500 and the simulated ratio of treated ove r sample size was as low as .1. In terms of removing bias, in almost all conditions, IPTW remov ed more than 98% of the initial bias. On the other hand, TIPTW was not as effective as IPTW in terms of removing bias, especially when the ratio of treated to sample size was small. As the sample size increased, TIPTW removed more bias, but this difference was small. As the ratio increased, TIPTW removed more bias and when the ratio became .5 TIPTW worked almost as well as IPTW in terms of reducing the selection bias in the estimate of ATE. As it is shown in Table A 4, OFPSM worked as well as IPTW in all c onditions. When the sample size was 500 and the ratio of treated to sample size was 1/10, the remaining bias was marginally unacceptable but as the ratio increased, all ATE estimates became unbiased. Finally, Table A 5 shows that the PSS method failed to r educe the selection bias to acceptable levels in all conditions. It is reasonable to say that, among the 4 methods that are investigated, IPTW and OFPSM removed almost all of the selection bias in the estimated ATE, while TIPTW provided acceptable results only with proportions of treated individuals approaching 50% and the PSS never provided adequate results. Overall, we found that the performances of the PS methods tended to improve as the sample size and the ratio of treated to sample size increased.
34 When we look at Tables from B 1 through B 5 we see that the sample size does not have a sizeable effect on the bias of the standard error estimates. Figure s 4 1 through 4 5 are established by collapsing biases across the sample sizes. Table 4 1 shows that the relative biases of the standard error of the estimated ATEs in the baseline condition were acceptable in all conditions with TSL and JK methods. However, WLS standard error were biased when the simulated ratio of treated to sample size was lower than .25. For WLS, the underestimation of standard errors declined as the ratio of treated to sa mple size approached .5. Figure 4 2 depicts the bias of the standard errors of ATE estimates for the IPTW method with different approaches to estimate standard errors. W e saw a similar pattern to the baseline model using WLS. For the two lowest simulated ratio s of treated to sample size, standard errors were underestimated, but underestimation did not occur with larger ratios. In contrast, TSL estimated standard errors a ccurately for the lowest ratio, but for the larger ratios t his methods overestimated standard errors either marginally or moderately. The j ackknife provided marginally more extreme standard errors than TSL did. The same pattern was observed for the TIPTW m ethod. The only difference is that TSL and JK produced more accurate estimat es of standard errors. In fact, for the lowest two ratios, TSL and JK produced acceptable standard errors but for the other three ratios they did not produce standard errors with acceptable levels of bias. For the PSS, WLS produced again underestimated standard errors with the lowest two ratios and marginally acceptable underestimated standard errors with a ratio of .25. For the other ratios, WLS standard errors were acceptable. I n contrast, TSL and JK produced overestimated standard errors for most of the conditions. However, for the
35 first smallest and largest ratios the biases were either within acceptable levels or marginally overestimated. For the other ratios of treated to tot al sample size TSL and JK marginally or moderately overestimated the standard errors. For the OFPSM, standard error estimates obtained with WLS were underestimated in all conditions. Also, TSL and JK provided standard errors that were marginally or moder ately overestimated, regardless of the sample size and the simulated r atio of treated to sample size. Considering WLS provided substantially under estimated standard errors for the two smallest ratio s of treated to sample size and TSL produced more accura te standard errors than JK overall, coverage of confidence intervals and power analysis is conducted using TSL standard errors. For each iteration, we estimated 95% confidence interval for each estimated ATE in the iteration. Results are presented in T able C 1. For all conditions, the population ATE was within that confidence interval for more than 95% of the time. Based on Table C 2, PSS was the most powerful method to test the ATE for the most of the simulated conditions The second most powerful PS meth od was TIPTW. IPTW was the third most powerful method, and finally OFPSM was the least powerful. As expected, the power increased as the sample size increased, but also increased as the simulated ratio of treated to sample size increased.
36 Figure 4 1. Re lativ e bias of different standard error estimation methods in the baseline
37 Figure 4 2. Relative bias of different standard error estimation methods in the IPTW
38 Figure 4 3. Relative bias of different standard error estimation methods in the TIPTW
39 Figure 4 4. Relative bias of different standard error estimation methods in the PSS
40 Figure 4 5. Relative bias of different standard error estimation methods in the OFPSM
41 CHAPTER 5 DISCUSSION Our first research question asked whether a PS method out performed the other in terms of removal of selection bias. With large levels of selection bias, IPTW and OFPSM removed almost all of the bias in every condition. OFPSM performed marginally better than IPTW overall. TIPTW was the third best method in terms of removing the selection bias and PSS was the worst method. Austin (2010a) found that IPTW removes more bias than PSS when the outcome is binominal. Cochran (1968) demonstrated that using 5 strata removes about 90% of the selection bias, which agrees with the percent bias reduction using PSS method that we found. We also found that as the ratio of treated to sample size increases from .1 to .5, the methods tend to remove more bias. ings that as the number of possible matches increases, selection bias is reduced. Because we estimated the ATE, two way matching is performed (i.e., treated to control and control to treated). In this case the number of available matches is maximum when th e ratio of treated to sample size is .5. We also observed that as the sample size increases, all methods become more efficient in terms of bias reduction. The second research question was about the accuracy of estimates of the standard errors. We are not able to conclude that a single standard error estimation procedure works best in every simulated condition. Data suggested that, TSL and JK tend to overestimate standard errors. Among these two methods, TSL is marginally more accurate than JK. Since TSL is a computation based method, and JK is replication based method, if TSL estimates are available, they should be more accurate than JK estimates. WLS tends to underestimate the standard errors. Even if the weights
42 are correct, WLS ignores the stratified nat ure of the data. This leads to underestimated standard errors when the strata are a key feature of the analysis such as in OFPSM and PSS. In addition, as the population ratio increases, estimates of standard error come within acceptable levels. Our final research question was about the coverage of confidence intervals and power. For all simulated conditions, percent coverage of confidence intervals was greater than 95% As expected as the sample size increases, power increases. Also as expected, balanced designs are more powerful than unbalanced designs are (Hsieah, Blooch, & Larsen, 1998). Therefore, when the simulated ratio of treated to sample size is .5, we have more power to test the ATE. In this study, we have investigated conditions where only the i gnoring o f t rea t m e nt ass i g nment was violated. Our results are not generalizable to conditions where S UT V A m a y be potentially vio la ted If this is the case, researchers are recommended to use appropriate model based solutions such as multilevel modeling (Ar pino & Mealli, 2011; Thoemmes & West, 2011). Because we estimated the ATE, the overlap between (2008) concluded that poor overlap creates bias in the estimates of treatment e ffects. Ignoring the lack of common support may be misleading because groups may not be comparable. Dropping observations that are outside of the common support area in any group is also not desirable for the estimation of ATE because it conflicts with the definition of ATE. When there is a stratum with neither treated nor untreated participant, the weight for that stratum becomes zero and that stratum is ignored in the ATE estimation. This may happen while estimating ATE using OFPSM or PSS. In this
43 simulat ion, when one iteration dropped any observation due to the limited overlap using any of the methods, we ignored that iteration and re simulated the data. For this reason, these results are not generalizable to the conditions where the overlap is poor and t here is no available match for many of the observations. One other limitation of the results is the weights that we used for the IPTW and TIPTW ranged from 1 to 20 for most of the conditions. We did not investigate conditions that have extreme weights suc h as 100 or more. The generalizability of our results is limited to the conditions that are not heavily affected by extreme weights. Also, additional simulation work should be performed to explain the reasons behind the standard errors not being accurate a nd possible solutions for this problem.
44 APPENDIX A RELATIVE BIAS OF ATE ESTIMATES Table A 1. Relative bias of ATE estimates in the baseline condition Sample size Simulated ratio of treated to total sample size Relative bias of ATE estimates 500 1/10 1.1 95 1/7 1.162 1/4 1.181 1/3 1.149 1/2 1.155 1000 1/10 1.155 1/7 1.152 1/4 1.193 1/3 1.153 1/2 1.188 2000 1/10 1.219 1/7 1.170 1/4 1.168 1/3 1.162 1/2 1.119 Note. Biases that are greater than 0 .05 or less than 0 .05 are unacceptabl e.
45 Table A 2 Relative bias of ATE estimates and the percent bias reduction of ATE estimates with IPTW. Sample size Simulated ratio of treated to total sample size Relative bias of ATE estimates Percent Bias Reduction of the ATE estimates 500 1/10 0.073 9 3.930% 1/7 0.008 99.321% 1/4 0.039 96.716% 1/3 0.001 99.932% 1/2 0.014 101.218% 1000 1/10 0.020 101.736% 1/7 0.002 100.136% 1/4 0.023 98.090% 1/3 0.009 100.821% 1/2 0.021 98.262% 2000 1/10 0.022 98.207% 1/7 0.000 99.986% 1/4 0.004 99.693% 1/3 0.012 98.971% 1/2 0.029 102.610% Note. Biases that are greater than 0 .05 or less than 0 .05 are unacceptable.
46 Table A 3 Relative bias of ATE estimates and the percent bias reduction of ATE estimates with TIPTW. Sample size Simulated rat io of treated to total sample size Relative bias of ATE estimates Percent Bias Reduction of the ATE estimates 500 1/10 0.251 78.978% 1/7 0.131 88.765% 1/4 0.106 91.021% 1/3 0.047 95.925% 1/2 0.015 98.687% 1000 1/10 0.147 87.260% 1/7 0.113 90.15 2% 1/4 0.086 92.821% 1/3 0.035 96.921% 1/2 0.047 96.060% 2000 1/10 0.186 84.718% 1/7 0.112 90.395% 1/4 0.063 94.586% 1/3 0.054 95.322% 1/2 0.005 100.416% Note. Biases that are greater than 0 .05 or less than 0 .05 are unacceptable.
47 Table A 4 Relative bias of ATE estimates and the percent bias reduction of ATE estimates with OFPSM. Sample size Simulated ratio of treated to total sample size Relative bias of ATE estimates Percent Bias Reduction of the ATE estimates 500 1/10 0.071 94.096% 1 /7 0.002 99.870% 1/4 0.028 97.607% 1/3 0.004 100.316% 1/2 0.021 101.793% 1000 1/10 0.021 101.838% 1/7 0.008 100.692% 1/4 0.038 96.826% 1/3 0.022 101.874% 1/2 0.034 97.160% 2000 1/10 0.013 98.956% 1/7 0.012 101.066% 1/4 0.001 100. 046% 1/3 0.013 98.841% 1/2 0.026 102.367% Note. Biases that are greater than 0 .05 or less than 0 .05 are unacceptable.
48 Table A 5 Relative bias of ATE estimates and the percent bias reduction of ATE estimates with PSS. Sample size Simulated ratio of t reated to total sample size Relative bias of ATE estimates Percent Bias Reduction of the ATE estimates 500 1/10 0.164 86.258% 1/7 0.110 90.552% 1/4 0.159 86.536% 1/3 0.116 89.913% 1/2 0.109 90.532% 1000 1/10 0.100 91.359% 1/7 0.118 89.742% 1/ 4 0.143 88.049% 1/3 0.108 90.661% 1/2 0.139 88.319% 2000 1/10 0.146 88.025% 1/7 0.116 90.102% 1/4 0.122 89.562% 1/3 0.128 89.002% 1/2 0.088 92.169% Not e. Biases that are greater than 0 .05 or less than 0 .05 are unacceptable.
49 APPENDIX B RELAT IVE BIAS OF STANDARD ERRORS OF THE ATE ES TIMATES Table B 1 Relative bias of standard error estimates in the baseline. Sample size Simulated ratio of treated to total sample size Empirical Standard Deviations Relative bias of Weighted Least Squares standar d error estimates Relative bias of Taylor Series Linearization standard error estimates Relative bias of Jackknife standard error estimates 500 1/10 28.903 0.423 0.042 0.026 1/7 24.435 0.318 0.040 0.030 1/4 19.107 0.127 0.002 0.003 1/3 16.87 3 0.011 0.045 0.048 1/2 17.308 0.034 0.034 0.032 1000 1/10 19.449 0.393 0.005 0.002 1/7 17.153 0.312 0.029 0.025 1/4 13.310 0.111 0.018 0.020 1/3 12.244 0.035 0.021 0.023 1/2 11.835 0.002 0.002 0.001 2000 1/10 13.853 0.398 0.011 0.007 1/7 11.893 0.297 0.005 0.002 1/4 9.553 0.125 0.002 0.003 1/3 8.842 0.054 0.001 0.001 1/2 8.535 0.021 0.021 0.020 Note. Biases that are greater than 0 .1 or less than 0 .1 are unacceptable
50 Table B 2 Relative bias of standard error estimates with IPTW. Sample size Simulated ratio of treated to total sample size Empirical Standard Deviations Relative bias of Weighted Least Squares standard error estimates Relative bias of Taylor Series Linearization standard error estimates Relative bias of Jackknife standard error estimates 500 1/10 28.118 0.407 0.058 0.086 1/7 22.728 0.267 0.089 0.105 1/4 17.162 0.026 0.154 0.161 1/3 15.646 0.068 0.160 0.164 1/2 15.641 0.071 0.095 0.097 1000 1/10 18.392 0.358 0.113 0.125 1/7 16.315 0.276 0.068 0.075 1/4 12.167 0.026 0.150 0.154 1/3 11.345 0.043 0.131 0.133 1/2 10.786 0.098 0.119 0.120 2000 1/10 13.200 0.367 0.092 0.098 1/7 11.158 0.250 0.109 0.112 1/4 8.698 0.037 0.134 0.136 1/3 8.132 0.030 0.114 0.115 1/2 7.612 0. 100 0.119 0.120 Note. Biases that are greater than 0 .1 or less than 0 .1 are unacceptable
51 Table B 3 Relative bias of standard error estimates with TIPTW. Sample size Simulated ratio of treated to total sample size Empirical Standard Deviations Relative bias of Weighted Least Squares standard error estimates Relative bias of Taylor Series Linearization standard error estimates Relative bias of Jackknife standard error estimates 500 1/10 27.170 0.387 0.060 0.083 1/7 22.404 0.258 0.082 0.095 1/4 17.0 13 0.019 0.150 0.157 1/3 15.618 0.069 0.154 0.158 1/2 15.671 0.068 0.088 0.091 1000 1/10 18.234 0.354 0.092 0.102 1/7 16.102 0.268 0.062 0.068 1/4 12.155 0.027 0.141 0.144 1/3 11.309 0.046 0.127 0.129 1/2 10.783 0.097 0.116 0.117 2000 1/1 0 13.031 0.361 0.078 0.083 1/7 11.043 0.244 0.100 0.103 1/4 8.680 0.037 0.126 0.128 1/3 8.111 0.032 0.111 0.112 1/2 7.609 0.100 0.117 0.117 Note. Relative biases that are greater than 0 .1 or less than 0 .1 are unacceptable
52 Table B 4 Relative bias of standard error estimates with PSS. Sample size Simulated ratio of treated to total sample size Empirical Standard Deviations Relative bias of Weighted Least Squares standard error estimates Relative bias of Taylor Series Linearization standard err or estimates Relative bias of Jackknife standard error estimates 500 1/10 29.261 0.463 0.046 0.080 1/7 23.210 0.324 0.082 0.100 1/4 17.329 0.093 0.142 0.149 1/3 15.811 0.009 0.146 0.150 1/2 15.880 0.011 0.075 0.077 1000 1/10 18.768 0.410 0. 096 0.109 1/7 16.238 0.320 0.076 0.084 1/4 12.228 0.097 0.141 0.145 1/3 11.433 0.036 0.118 0.120 1/2 10.879 0.013 0.105 0.106 2000 1/10 13.169 0.410 0.093 0.098 1/7 11.160 0.303 0.104 0.108 1/4 8.698 0.107 0.129 0.130 1/3 8.112 0.042 0.112 0.113 1/2 7.639 0.015 0.110 0.111 Note. Relative biases that are greater than 0 .1 or less than 0 .1 are unacceptable
53 Table.B 5 Relative bias of standard error estimates with OFPSM. Sample size Simulated ratio of treated to total sample size Empi rical Standard Deviations Relative bias of Weighted Least Squares standard error estimates Relative bias of Taylor Series Linearization standard error estimates Relative bias of Jackknife standard error estimates 500 1/10 28.527 0.522 0.075 0.087 1/7 24.278 0.439 0.087 0.093 1/4 18.574 0.247 0.171 0.173 1/3 17.407 0.183 0.154 0.156 1/2 17.179 0.150 0.104 0.105 1000 1/10 19.932 0.531 0.122 0.126 1/7 17.622 0.466 0.098 0.101 1/4 13.499 0.282 0.164 0.165 1/3 12.308 0.191 0.162 0.163 1/2 11.662 0.116 0.136 0.136 2000 1/10 14.687 0.568 0.144 0.146 1/7 12.419 0.483 0.165 0.166 1/4 9.687 0.309 0.178 0.179 1/3 8.769 0.207 0.169 0.169 1/2 8.006 0.084 0.151 0.151 Note. Relative biases that are greater than 0 .1 or less than 0 .1 are unacceptable
54 APPENDIX C EMPRICAL COVERAGE AN D POWER TABLES Table C 1 Empiric al coverage rates of 95% confidence intervals across 1000 simulated data sets. Sample size Simulated ratio of treated to total sample size Proportion coverage with IPTW P roportion coverage with TIPTW Proportion coverage with OFPSM Proportion coverage with PSS 500 1/10 0.959 0.958 0.967 0.951 1/7 0.953 0.954 0.962 0.957 1/4 0.975 0.976 0.980 0.975 1/3 0.979 0.975 0.983 0.970 1/2 0.968 0.971 0.971 0.962 1000 1/10 0 .969 0.956 0.975 0.961 1/7 0.961 0.957 0.963 0.957 1/4 0.977 0.975 0.969 0.976 1/3 0.977 0.973 0.979 0.971 1/2 0.971 0.970 0.974 0.961 2000 1/10 0.972 0.954 0.973 0.961 1/7 0.962 0.962 0.977 0.956 1/4 0.975 0.971 0.980 0.968 1/3 0.977 0.972 0.981 0.966 1/2 0.969 0.970 0.971 0.966 Note: Standard errors are estimated with Taylor Series Linearization.
55 Table C 2 Proportion of the estimated ATE that is significant at =.05 level across 1000 simulated datasets. Sample size Simulated ratio of treated to total sample size Proportion significant with IPTW Proportion significant with TIPTW Proportion significant with OFPSM Proportion significant with PSS 500 1/10 0.110 0.1 32 0.085 0.118 1/7 0.110 0.139 0.098 0.127 1/4 0.153 0.173 0.123 0.187 1/3 0.165 0.179 0.134 0.192 1/2 0.190 0.203 0.161 0.248 1000 1/10 0.135 0.188 0.118 0.169 1/7 0.200 0.240 0.150 0.229 1/4 0.280 0.317 0.230 0.354 1/3 0.314 0.348 0.250 0. 393 1/2 0.370 0.392 0.319 0.470 2000 1/10 0.280 0.365 0.194 0.339 1/7 0.357 0.445 0.243 0.440 1/4 0.530 0.585 0.404 0.636 1/3 0.616 0.651 0.506 0.713 1/2 0.635 0.664 0.579 0.759 Note: Standard errors are estimated with Taylor Series Linearizati on.
56 REFERENCES Ab a die, A & I mbens, G W ( 2006 ) L a r g e S a mp l e pro pe rties o f m a tchi n g e st i mato r s for a v e rage tr ea t m e nt e f f e c ts. E c on o m e tr i c a 74 235 2667. Ab a die, A & I mbens, G W ( 2008 ) On the failure of the bootstrap for matching estimators. Econome trica, 76, 1537 1557. Arpino, B. & Mealli, F. (2011). The specification of the propensity score in multilevel observational studies. Computational Statistics & Data Analysis, 55, 1770 1780. Austin, P C. (2009 a ) The re lative a bi l i t y o f di f ferent prop e nsi t y sc o re met h ods to b a lan c e me a sur e d c ov a ri a t e s be t w ee n tr e a ted a nd un t r e a t e d subj ec ts i n obs e rv a t i o n a l s t udies. Me dical D e c is i on M a k i n g 29 661 677. Austin, P. C. (2009b). Type I error rates, coverage of confidence intervals, and variance estimation in pr opensity score matched analyses. The International Journal of Biostatistics. 5(1), Art. 13. Austin, P. C. (2010a). The performance of different propensity score methods for estimating differences in proportions (risk differences or absolute risk reductions ) in observational studies. Statistics in Medicine, 29, 2137 2148. Austin, P. C. (2010b). Statistical criteria for selecting the optimal number of untreated subjects matched to each treated subject when using many to one matching on propensity score. Pract ice of Epidemiology, 172(9), 1092 1097. Bembom, O., & van der Laan M. J. (2008). Data adaptive selection of the truncation level for inverse probability of treatment weighted estimators. U.C. Berkeley Division of Biostatistics Working Paper Series. Paper 23 0. Cepeda M. S., Boston, R., Farrar, J. T., & Strom, B. L., (2003). Optimal matching with a variable number of controls vs. a fixed number of controls for a cohrot study: trade offs. Journal of Clinical Epidemiology, 56, 230 237. Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 24, 295 313. Cochran, W.G., & Rubin, D. B. (1973). Controlling bias in observational studies: a review. Sankhya: The Indian Journal of Statistics, Seri es A 35(4), 417 446. Coh e n, J ( 1973). E ta squ a r e d a nd p a rtial e t a squa r e d in fi x e d f a c tor A N O V A d e si g ns. Edu c at i onal and Psy c holog i c al M e asure m e n t 33, 107 112. Coh e n, J ( 1988 ) Sta t is t i c al po w e r anal y sis for t he b e ha v ioral sci e n ce s ( 2 n d e d.). Ne w Y o r k: A ca d e m i c Pr e ss.
57 F re e dman, D A. & B e rk, R. A. ( 2008 ) W e i g ht i n g re g r e ss i ons b y pro p e nsi t y s c o r e s. E v alua t ion R ev iew 3 2 ( 4 ) 39 2 409. Gu, X S & Rosen ba um, P R. (1993 ) Co m p a rison of mu l t i v a ri a te m a tchi n g methods: stru c tu r e s, dis t a n ce s, and a l g o r i t hm s J o urnal of C omputat i o n al and Graphical S t at i st ic s, 4 405 420. Guo, S., & F r a s e r, M. W ( 2010 ) Prop e nsi t y sco r e analysis: sta t is t ical m e t h ods and appl i c at i ons. Thous a nd O a ks: S a ge Hansen, B. B., & Fredrickson, M. (2009). optmach: Functions for optimal match ing. Variable ratio, optimal, and full matching. Can also be implemented through MatchIt. H a rder, V S ., S tua r t, E. A., & Antho n y J C. (2010 ) P rop e nsi t y sc o re t e c hniques a nd the a ssessment of me a s u r e d c ov a ri a te b a lan c e to t e st c a usal a ssoc i a t i ons in ps y c hol o g i c a l r e s e a r c h. Ps y c holog i c al Me thod s 15( 3 ), 23 4 349. H ec kman, J J ( 1978) Dum m y e nd o g e nous v a ri a bles in s i mu l tan e ous e qu a t i ons s y stem. E c onom e tr i c a 47 93 1 9 60. H ec kman, J J ., I c hi m u ra H,. & Todd, P. E. ( 1997) Mat c hing a s an e c onom e tric e v a luati on e st i mato r : Evid e n c e f rom e v a luati n g a job t r a in i n g pr o g ra m m e R e v iew of E c onomic Stud i e s 65 261 294. Hernan, M. A., Hernandez Diaz, S. & Robins, J. M. (2004). A structural approach to selection bias. Epidemiology, 82, 387 394. Ho, D., Imai, K., King, G., & Stuart, A. E. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis. 15(3), 199 236. Hog a n, J W ( 2004 ) I ns t rum e ntal v a ri a bles a nd i n v e rse pro b a bi l i t y w e i g ht i ng f o r c a u s a l inf e r e n c e f rom lon g i t ud i n a l obse r v a t i on a l s t udies. Sta t is t ical Me thods in M e dical R e s e arc h 13 17 48. Hoogland, J. J., & Boomsma, A. (1998). Robustness studies in covariance structure modeling: An overview and meta analysis. Sociological Methods & Researc h, 26, 523 539. Ho r vi tz D. G & Thom p son, D. J ( 1952 ) A g e n e r a l iz a t i on of sampling w i thout re pla ce ment f r om a f in i te univ e r s e J o urnal of Am e rican Statis t ical Asso c ia t io n 47, 66 3 685. Hsieah F. Y., Blooch D. A., & Larsen M. D. (1998). A simple method o f sample size calculation for linear and logistic regression. Statistics in Medicine, 17, 1623 1634.
58 Lechner, M. (2008). A note on the common support problem in applied evaluation studies. Annals of Economics and Statistics, Econometric Evaluation of Publ ic Policies: Methods and Applications, 91/92, 217 235. Lohr, S. L. (1999). Sampling: design and analysis. Pacific Grove, CA: Duxbury Press. Lumley, T. (2011). R package version 3.62.1 Lunceford, J. K., & Davidi an, M. (2004). Stratification and weig hting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med., 23, 2937 2960. McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods, 9 403 425. McKelvey, R. D., & Zavoina,W. (1975).A statistical model for the analysis of ordinal level dependent variables. Journal of Mathematical Sociology 4 103 120. N a t i on a l C e nter f o r E du c a t i on S tatis t ics. ( 2010 ) Sc hool su r v e y on c rime a n d sa fe t y R e trie v e d f r om ht t p: / /nc e s.ed. g ov/ s u r v e y s/ s socs on J une 1 2011. Neugebauer, R., & van der Laan, M. (2005). Why pre fer double robu st estimates in causal inference? Journal of Statistical Planning and Inference, 129, 405 426. P e te r s e n, M. L ., W a nd, Y., v a n d e r L a a n, M. J ., & B a n g sb er g D. R. (20 0 6 ) Ass e ss i ng t h e e f f e c t i v e n e ss of a nt i r e tr o vir a l adh e r e n c e in t e rv e nt i ons u sing ma r g inal stru c t u ra l m od e ls t o r e pl i ca te findin g s of r a n d om iz e d c ontroll e d tri a ls. A c quir Im m une D e f i c S y nd r 43 96 103. R Development Core Team. (2011). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statis tical Computing. Retrived from http://www.R project.org. Robins, J M., H e rn a n, M A. & B rumb a c k, B (2 0 00). M a r g inal stru c tu r a l models a nd ca usal inf e r e n c e in epid e m i ol o g y Epid e miology 11 550 560. Rosen ba um, R P ( 1989 ) Opti m a l m a tching f o r o bse r v a t i o n a l s t udies. J o u rnal of t he Am e rican Sta t is t ical Asso c ia t io n 4 08 1024 1032. Rosen ba um, R P ( 1991 ) A c h a r ac t e ri z a t i on of opt i mal d e si g ns for obs e rvat i on a l s t udies. J ournal of t he Ro y al S t at i st i c s S oci e t y 53 597 6 1 0. Rosen ba um, P R. (2010 ) D e sign of obser v at i onal stud i e s N e w Y o r k: S p r i n g e r. Rosen ba um, P R., & Rubin, D. B ( 198 3 ). T h e ce ntr a l role of the p rop e nsi t y s c o re in obse r v a t i on a l s t udies f or ca usal e f f e c ts. Biom e tr ica 70 41 55.
59 Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in obse rvational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79 516 524. Rodgers, J. L. (1999). The bootstrap, the jackknife, and the randomization test: A sampling taxonomy. Multivariate Behavioral R esearch, 34 441 456. Rubin, D. B ( 1974 ) Est i mating c a usal e f f e c ts of t r e a t m e nts in r a ndom iz e d a nd non ra ndom iz e d stud i e s. J ournal of Edu c at i onal Psy c hology, 66 688 701. Rubin, D. B. (2007). Statistical inference for causal effects, with emphasis on appl ications in epidemiology and medical statistics. 27 28 63. Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: a practical guide and simulated example. Psychological Method, 13(4), 279 313. S h a dish, W R., C ook, T. D., & C a mpbell, D. T. ( 2002 ) E x p e ri m e ntal and quas i ex p e ri m e ntal d e signs for g e n e ral i z e d c ausal in f e r e n ce B oston: Hou g hton Mif f l i n. Stapleton, L. (2008). Chapter18: Analysis of data from complex surveys. In: E. D. de Leeuw, J. J. Hox & D. A. Dillman. I nternat ional handbook of survey methodology. New York, Psychology Press. Steiner, P. M., Cook, T. D., Shadish, W. R., & Clark, M. H. (2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 1 5(3), 250 276. Strayhorn, T. L. (2009). Accessing and analyzing national databases. In T. J. Kowalski & T. J. Lasley II (Eds.), Handbook of data based decisi on making in education (pp. 105 122). New York, NY: Routledge. Strumer, T., Rothman, K. J., Avorn, J., & Glynn, R. J. (2010). Treatment effects in the presence of unmeasured confounding: dealing with observations in the tails of the propensity score distribution a simulation study. Practice of Epidemiology, 172(7), 842 854 S tua r t, E. A. ( 2010 ) M a tchin g methods for ca u s a l i n fe r e n c e : A r e v i e w a nd look fo r w a rd. Sta t is t ical S c ien ce 2 5 ( 1 ) 1 21. Thoemmes, F. J., Kim, E. S. (2011). A systematic review of p ropensity score methods in the social sciences. Multivariate Behavioral Research, 46, 90 118. Thoemmes, F., & West, S. (2011). The use of propensity sc ores for nonrandomized designs with clustered data. Multivariate Behavioral Research, 46, 514 543. Venables, W. N. & Ripley, B. D. (2002). Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0 387 95457 0
60 W a n g Y., Pet e rs e n, M. L ., B a n g s b e r g D., & v a n d e r L a a n M. J ( 2006 ) Di a g nosi n g bias in the inve r se pro b a bi l i t y of t r e a t m e nt w e i g hted e st i ma t or re sul t ing f r om v i olation of e x p e rime n tal tr ea t m e nt a ss i g nment. U.C. B e r k e ley Di v is i on of B iosta t is t ics W o r k ing Pap e r Seri e s, W o r king P a p e r 21 1 W insh i p, C & Mor g a n, S ( 1999 ) The e st i mation of ca usal e f f ec ts fr o m o b s e rv a t i on a l dat a Annual R ev iew of So c io l ogy 25 659 706. Wolter, K. M. (2007). Introduction to Variance Estimation. New York: Springer.
61 BIOGRAPHICAL SKETCH Sungur Gurel w as born in Osmaniye, T urkey. He received his B.A. in Mathematics T eaching from Istanbul University at 2003. He served for the Turkish Government as a Mathematics teacher for 20 months. He later qualif ied for a scholarship to study abroad in Fall 2010, enrolled for graduate studies in the School of Human Development and Organizational Studies at Collage of Education at the University of Florida. He has been awarded an Honorable Mention award by Educatio nal Statisticians Special Interest Group of American Educationa l Research Association in April 2012. He will receive his M.A.E in Research and Evaluatio n Methodology program in August 2012.