UFDC Home  myUFDC Home  Help 
Material Information
Thesis/Dissertation Information
Subjects
Notes
Record Information

Material Information
Thesis/Dissertation Information
Subjects
Notes
Record Information

Full Text  
PAGE 1 1 THE PERFORMANCE OF GENETIC MATCHING TO REDUCE SELECTION BIAS IN OBSERVATIONAL STUDIES: A MONTE CARLO SIMULATION STUDY By SEYFULLAH TINGIR A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL F ULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2013 PAGE 2 2 2013 Seyfullah Tingir PAGE 3 3 To my family PAGE 4 4 ACKNOWLEDGMENTS First of all I would like to thank my deepest gratitude to my advisor, Dr. Walter Leite, for his excellent guidance, caring, patience, and providing me with an excellent atmosphere for doing research. I would like to thank Dr. David Miller for guiding me in my thesis. I give thank s to faculty and students of the research and evaluation methodology program in the School of Human Development and Organizational Studies in Education. I would like to thank my parents and sister being next to me whenever I needed them. They were always supporting me and encouragi ng me with their best wishes. I would like to thank Dr Yakup Keskin for trusting me. I would like to thank my fiance, Seyma Intepe, for her generous support being highly understanding, and car ing She was always there cheering me up and stood by me throug h the good times and bad. I would also extend to my thanks to my new family, the Intepe family, for their full support and encouragement for finishing my MAE. I would also like to thank Turkish Govern ment for the financial support. PAGE 5 5 TABLE OF CONTENTS p age ACKNOWLEDGMENTS ................................ ................................ ................................ ........... 4 LIST OF TABLES ................................ ................................ ................................ ...................... 7 LIST OF FIGURES ................................ ................................ ................................ .................... 8 LIST OF ABBREVIATIONS ................................ ................................ ................................ ...... 9 ABSTRACT ................................ ................................ ................................ ............................. 10 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ ............. 12 2 THEORETICAL FRAMEWORK ................................ ................................ ...................... 14 Neyman ................................ .................. 14 Propensity Score Methods ................................ ................................ ................................ .. 17 Matching Meth ods ................................ ................................ ................................ ............. 18 Distance Measures ................................ ................................ ................................ ...... 18 Mahalanobis Distance Matching (MDM) ................................ ................................ .... 18 Propensity Score Matching ................................ ................................ .......................... 19 Number of Matches ................................ ................................ ................................ ..... 20 1 to 1 matching ................................ ................................ ................................ .... 20 1 to k matching ................................ ................................ ................................ .... 21 1 to many matching ................................ ................................ ............................. 21 Full matching ................................ ................................ ................................ ....... 22 Matching Algorithms ................................ ................................ ................................ .. 22 Nearest neighbor greedy matching ................................ ................................ ........ 22 Radius matching ................................ ................................ ................................ ... 23 Caliper matching ................................ ................................ ................................ .. 24 Optimal matching ................................ ................................ ................................ 24 Optimal full propensity score matching (OFPSM) ................................ ................ 25 Genetic matching (GM) ................................ ................................ ........................ 25 Genetic matching implementation in R ................................ ................................ 26 Estimation of Weights and Standard Errors ................................ ................................ ........ 28 Comparison of Matching Methods ................................ ................................ ..................... 29 3 METHOD ................................ ................................ ................................ .......................... 32 Data Simulation ................................ ................................ ................................ ................. 32 Analysis ................................ ................................ ................................ ............................. 36 4 RESULTS ................................ ................................ ................................ .......................... 38 PAGE 6 6 5 DISCUSSION ................................ ................................ ................................ .................... 49 LIST OF REFERENCES ................................ ................................ ................................ .......... 53 BIOGRAPHICAL SKETCH ................................ ................................ ................................ ..... 57 PAGE 7 7 LIST OF TABLES Table page 3 1 The coefficients of the population parameters for 4 covariates ................................ ....... 34 3 2 The values of the residual variances at the population level ................................ ........... 34 3 3 The manipulated conditions ................................ ................................ ........................... 35 4 1 ANOVA and effect size results of percentage of unbalanced covariates ......................... 39 4 2 Percentages of unbalanced covariates for method by Pseudo R squared interaction ....... 39 4 3 Percentages of unbalanced covariates for method by rati o interaction ............................ 40 4 4 Percentages of treated observations with no common support of propensity score distributions ................................ ................................ ................................ ................... 40 4 5 ANOVA a nd effect size results of ATT estimates ................................ .......................... 41 4 6 Relative bias and percent bias reductions of ATT bias estimates for method by Pseudo R squared interaction ................................ ................................ ......................... 42 4 7 ANOVA and effect size results for standard errors of ATT estimates ............................. 43 4 8 Relative bias in standard error of ATT estimates for method by Pseudo R squared inter action ................................ ................................ ................................ ..................... 43 4 9 Relative bias of standard errors of ATT bias estimates for method by R squared interaction ................................ ................................ ................................ ..................... 44 4 10 ANOVA and effec t size results of power ................................ ................................ ....... 45 4 11 by R squared interaction in the methods ................................ ................................ ........ 45 PAGE 8 8 LIST OF FIGURES Figure page 4 1 Percentages of unbalanced covariates for method by Pseudo R squared interaction. ...... 46 4 2 Percentages of unbalanced covariates for method by ratio interaction. ........................... 46 4 3 Relative bias of ATT estimates for methods under the two different Pseudo R squared conditions.. ................................ ................................ ................................ ....... 47 4 4 Relative bias of standard errors of the ATT for Pseudo R squared levels.. ..................... 47 4 5 Relative bias of standard errors of the ATT for R s quared levels. ................................ .. 48 PAGE 9 9 LIST OF ABBREVIATIONS ANOVA Analysis of Variance ATC Average Treatment Effect on Controls ATE Average Treatment Effect ATT Average Treatment Effect on Treated CAUPS Covariate Adjustment Using Propensity Score EPBR Equal Percent Bias Reduction GM Genetic Matching IPTW Inverse Probability of Treatment Weighting MDM Mahalanobis Distance Matching OFPSM Optimal Ful l Propensity Score Matching PSM Propensity Score Matching PSS Propensity Score Stratification RCT Randomized Controlled Trials RMSE Root Mean Squared Error SUTVA Stabl e Unit Treatment Values Assumption TSL Taylor Series Linearization PAGE 10 10 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree Master of Arts in Education THE PERFORMANCE OF GENETIC MATCHING TO REDUCE SELECTION BIAS IN OBSERVATIONAL STUDIES: A MONTE CARLO SIMULATION STUDY By Seyfullah Tingir August 2013 Chair: W alter Leite Major: Research and Evaluation Methodology In observational studies, differences between the distributions of covariates between treatment and control groups produces bias ed treatment effect estimates if these covariates are related to both treatment assignment and the outcome In randomized designs, it is assumed that the covariates for both control and treatment is the same. However, this is a pro blem in nonrandomized designs. Genetic Matching was introduce d as a solution to solve this problem in nonrandomized studies (Sekhon & Mebane, 1998). According to previous studies, Genetic Matching helps to improve balance on covariates (Sekhon & Grive, 2012; Diamond & Sekhon, 201 2). The purpose of this study was to evaluate how well Genetic Matching performs. A Monte Carlo simulation study was conducted based on manipulating sample size, ratio of treated to total sample size, number of covariates, the magnitude of the relationship between covariates and treatment ass ignment (measured by Pseudo R squared ) and the magnitude of relationship between covariates and the outcome (measured by R squared ) The method s evaluated we re a baseline model which ignores selection bias, optimal full propensity score matching (OFPSM) a nd Genetic Matching methods with four different strategies. These methods we re compared in terms of covariate balance, Average Treatment Effect on Treated (ATT) and standard error estimations based on Taylor series linearization (TSL). PAGE 11 11 The results show ed that 1 to many GM performs better than OFPSM. 1 to 1 GM had the worst performance in terms of relative bias of ATT estimates and standard error of the ATT estimates. Increasing sample size improved both power and covariate balance Also increasing the Pse udo R squared factor had increased bias of ATT and reduced covariate balance; but improved standard error estimates However, increasing the R squared levels produced bias st andard error of the ATT estimates PAGE 12 12 CHAPTER 1 INTRODUCTION Recently, usage of p ropensity score methods in observational studies has increased dramatically. Propensity score is the probability of being treated with a provided set of covariates (Rosenbaum & Rubin, 1983). The propensity score is a balancing score that assures that the d istribution of observed covariates will be similar between treated and control units (Austin, 2011). Because there is randomization in selecting the treated group, Average Treatment Effect (ATE), which is the difference in outcome mean, will be unbiased in Randomized Controlled Trials (RCT). Another commonly used estimate is Average Treatment Effect on Treated (ATT) (Austin, 2011), which is the average treatment effect for the persons have been treated. An important decision should be made between ATE and A TT based on research goals and accessibility to samples. Sel ection bias is the most common problem in quasi experimental and observational studies. The source of the sele ction bias originates from a non random treatment selection mechanism, such as self se lection, measurement selection, researcher selection, etc. (Gu & Fraser, 2010). Gu and Fraser (2010) also state that selection bias is a systematic error that violates the validity while random errors violate the reliability. Thus, lack of randomization is a major concern for such studies. When selection is an issue in observational studies, nonignorable treatment assignment appears as a proble m. Propensity score methods were introduced as an alternative way to correct the effects of selection bias in quas i experimental designs because of the lack of random assignment. The un balanced covariates distributions of treated and untreated groups that could result from lack of random assignment pro duce s bias ed outcomes in observational studies. It is assumed that the covariates are balanced across both control and treatment covariate distribution s in randomized designs. Genetic Matching was introduced as a solution to solve this problem PAGE 13 13 (Sekhon and Mebane 1998). According to previous studies, Genetic Matching help s to improve balance on covariates (Sekhon & Grive, 2012; Sekhon & Grive, 2008; Diamond & Sekhon, 2012). G enetic Matching (GM) consists of finding the best match for a treated unit among the other untreated control units via an iterative genetic algorithm. Previously, many comparison studies between matching methods have been conducted (Gu & Rosenbaum, 1993; Rosenbaum, 1989; Rosenbaum, 2002; Steiner & Cook, 2013; Zhao, 2004; Sekhon & Grieve, 2009), and Sekhon and Grieve (2009) compared the performance betwe en Genetic Matching and propensity score matching. This thesis extends their study by comparing Genetic Matching and optimal full propensity score matching under differe nt conditions such as sample size, ratio of treated to total sample size, magnitude eff ect of treatment assignment ( Pseudo R squared ) and magnitude effect of covariate effect on outcome ( R squared ) PAGE 14 14 CHAPTER 2 THEORETICAL FRAMEWORK Neyman A counterfactual is also known as potential outcome. Fir st, a counterfactual is the potential outcome for the treated unit under the control condition. Then a second counterfactual is the potential outcome for the control unit under the treatment conditions (Gu & Fraser, 2010). The causal effect is denoted as the difference between an observed outcome and its counterfactual. The observed outcome: (2 1) where Y i1 is the potential outcome for treated unit i when treated and Y i0 is the outcome of the same unit i when untreated. In addition, the treatment effect for the item i is denoted as = Y i1 Y i0 The Average Treatment Effect (ATE) is the treatment effect of moving the population from untreated to treated at the population level (Austin, 20 11) and can be calculated by: (2 2) Also, another common quantity of interest is the Average Treatment Effect for the Treated (ATT) is the treatment effect of the units who take the treatment (Austin, 2011) : (2 3) While ATT is the effect related to treated units, ATE is the effect of both the treated and control groups. Thus, when deciding which to use, the accessibility to the experiment and control groups should be taken into consideration. In order to identify the ATE, the strong ignorability of treatment assignment is needed, which means that the treatment assignment is independent of the potential outcome distributions PAGE 15 15 conditional on covariates. Therefore, adding the covariates Z to the model, accord ing to Rubin (1974, 1977): (2 4) where j =0, 1. Then the ATT can be estimated as: (2 5) Both the control and experimental groups will have potential outcomes (Rubin, 1974). An unbiased estimate of ATE can be evaluated from the difference between the treatment and control groups because of the advantage of the randomization. In other words, potential outcomes can be estimated by the mean outcome differences. On the other hand, for observational studies in wh ich non randomization is the basic problem, the potential outcomes cannot be calculated from the difference between the treatment and control groups. Bias may potentia lly occur in the distribution of the two groups. Thus, the assumption that the treatment and control groups are equivalent cannot be made. In the experimental designs, the randomization guarantees that the treatment condition will not confound the unobserved characteristics. Thus, comparing treated and untreated outcomes gives the treatment ef fect directly (Greenland, Pearl, & Robins, 1999). In observational studies, the subject characteristics influence the treatment selection. Thus, the beginning baseline characteristics are different among the treatment and control groups. Previously, regres sion adjustment has been used to account for these differences (Austin, 2011). Rosenbaum and Rubin (1983) introduced the propensity score methods to solve this problem, subject to conditions based on observed covariates. That is, the treatment selection de pends on covariate Z in observational studies, while the treatment assignment is independent of the potential outcome corresponding to the observed covariates in experimental designs. Propensity score methods allow us to use some of the certain PAGE 16 16 features of randomized control trials and experimental designs in observational studies. If the subjects have the same propensity score, the distribution of observed baseline covariates will be similar between the treated and untreated units. In this manner, the outc omes of treated and control units can be compared to estimate the treatment effect. Although there are some other methods, the propensity score is most commonly estimated using logistic regression, in which the treatment condition is regressed according to the baseline observed covariates (Austin, 2011). Many different covariates may affect the treatment outcome. Strong ignorability of treatment assignment is an assumption that allows unbiased outcomes for treated and control units in observational studies. In Randomized Control Trials (RCT), the treatment assignment T and the response ( ) are condition ally independent with a given Z. (2 6) However, the situation is different in observational studies. Because al l participants do not have the equal opportunity to be in the treatment group, treatment assignment will be strongly ignorable only if, (2 7) When these assumptions hold, then the treatment assignment is strongly ignorable (Rosenb aum & Rubin, 1983). Basically the assumption states that the assignment of participants to the control and treatment groups is independent of the potential outcomes. Another important assumption is the stable unit treatment values assumption (SUTVA), t hat the outcome value for an individual who is treated or untreated will be same regardless of the treatment status of other treatment conditions. More general, Y i ( t 1 n )]. SUTVA restricts the person to be affected by only the t i PAGE 17 17 condition [ Y i ( t 1 n ) = Y i ( t i )] In RCTs, the conditions for aforementioned two assumptions are met but the y are frequently violated in quasi experimental and observational designs because of the lack of random assignment. Thus, SUTVA violation makes the causal inference more problematic. If there are multiple versions of treatment, an individual may have multi ple potential outcomes which is also a SUTVA v i olation This situation makes estimates of causal effect unstable. Differences in potential outcomes may arise from the measurement aspect of the treatment (first part of SUTVA) or from treatment received fro m others (second part of SUTVA) ( Schwartz, Gatto, & Campbell, 2012) Propensity Score Methods Propensity Score Matching (PSM) requires that the treated and untreated subjects be matched according to their close propensity scores values. Various kinds of ma tching methods (e.g., greedy matching, full matching, genetic matching ) have been used. Once outcomes are estimated, the treatment effect can be obtained by comparing these treated and untreated outcomes. Different kinds of propensity score methods are cur rently being used in the social sciences, econometrics, and health science. Some of these are described below. Propensity Sco re Stratification (PSS) created strata based on similar ity of propensity scores between subjects. Propensity scores of subjects are ranked and then subjects are placed into strata. A common number of strata is five. Increasing the number of strata generally results in reduced bias. Control and treatment units are compared within each stratum. It is assumed that the features of subject s in the same stratum are similar (Austin, 2009; Austin, 2011; Rosenbaum & Rubin, 1983). Inverse Probability of Treatment Weighting (IPTW) uses the propensity score to estimate weights that create a pseudo sample where there is balance between covariate di stributions of treated and untreated individuals IPTW is a system of survey sampling weights that leads the PAGE 18 18 sample to represent a hypothetical population Also, variance estimation that includes sampling weights should be used for weighted samples (Austin 2011; Joffe et al ., 2004). With Covariate Adjustment Using Propensity Score (CAUPS), the treatment effect is estimated through logistic or linear regression depending on the type of outcome. Treated and untreated subjects with the same propensity score w ill have similar baseline characteristics. The propensity score is used a covariate in the multivariate logistic regression model. The outcome is regressed on an indicator of treatment status and an estimated propensity score. This model assumes that the r elationship between propensity score and the outcome (i.e. regression mo del) has been correctly modeled (Austin, 200 6; Austin, 2009; Austin, 2011). Matching Methods Matching is usually performed by setting a distance between the treated unit and the potent ial control units (Ming & Rosenbaum, 2001). One cannot assign the control units arbitrarily. In order to reduce the magnitude of the distance, some algorithms are used. Matching is used mostly for non randomized data, in which the treatment variable is not assigned by the researcher. It also works for the experimental data. Many aspects of matching are currently used. They are classified according to the usage of distance measures, number of ma tches, and matching algorithms. Distance Measures When using mat ching methods to estimate causal effects, it is important to decide how to perform the matching procedure. Two commonly used approaches are the Mahalanobis Distance and Propensity Scores. Mahalanobis Distance Matching (MDM) Mahalanobis Distance is a widel y used distance measure for matching without using a propensity score. Let A 1 and A 2 be samples from populations of treated and control units, PAGE 19 19 respectively. The A 1 units are randomly ordered and then the nearest control unit ( j ) is selected for the treated unit ( i ) with respect to MD. This instruction is repeated until all participants match. Formally, by letting X 1 and X 2 be the corresponding values of i and j from the samples, the distance is formulated as: (2 8) where S 1 repres ents the sample covariance matrix of the treated and completely non treated pool of the matching variables (X). If the covariance matrix is the identity matrix, the Mahalanobis distance red uces to the Euclidean distance. Because of the fact that MD does no t perform well when covariates have non ellipsoidal distributions, Rosenbaum and Rubin (1985) recommended that in addition to propensity score matching, one should match on individual covariates by minimizing the MD of X to obtain balance on X. MDM is used for estimating the bias reduction in observational studies. If is the mean of X in A i i = 1, 2, and is the mean of X in paired G 2 then ( 1 2 ) = k ( 1 2* ), where is the percent reduc tio n (Rubin, 1980). Mahalanobis metric matching within a propensity score caliper can be obtained following the same procedure with an additional covariate as propen sity score (Gu & Fraser, 2010). Propensity Score Matching Propensity Score Matching is an a lternative method of conditioning on X using the probability of assignment to treatment given a vector of covariates that predict receipt of treatment. Matching based on propensity score involves each treated unit to the nearest untreated unit on the unidi mentional metric of the propensity score vector. PSM helps to reduce bias due to the confounding variables. In experimental designs, randomization provides unbiased estimates of treatment effect. Matching attempts to mimic this randomization by creating a sample of treated units that are more comparable on all observed covariates to a sample of untreated units PAGE 20 20 in observational studies. The procedure is as follows: First, the propensity score for treated and untreated units is estimated. Second, matching is performed for each participant to one or more nonparticipants, depending on the method. Last, multivariate analysis is conducted based on the new sample (Rosenbaum & Rubin, 1983). The main goal of estimating the propensity score is to reduce the dimensiona lity of conditioning. This makes it possible to condition on a scalar variable rather than in a general n space. In observational studies, the propensity score is not known. The logit model of estimating the propensity score is: (2 9) where T i is the treatment indicator and h(X i ) is the structure of linear and higher order terms of covariates to obtain ignorable treatment assignment. Matching based on propensity score is principally a weighting pattern that assigns weight to compar ison units. The average effect on the treated units is then: (2 10) where  N is the number of treated units in the treated sample (N), and J i is the number of comparison units in a set of comparison units ( ) (Dehejia & Wahba, 1998; Abadie & Imbens; 2009). When assigning matches between treated and untreated units, this method is considered to find the closest propensity scores to be matched, so that PSM mimics the characterist ics of the experimental design. N umber of Matches 1 to 1 m atching In observational studies, the treatment and control units are matched with similar or close values of covariates ( X ). When each treated unit has one control match, it is called 1 1 matching PAGE 21 21 or pair matching. There is no nee d to have a 2 nd or 3 rd match to each treatment unit, thus minimizing the relative bias. As a drawback, 1 to1 matching discards other unmatched contr ol units (Steiner & Cook 2013). 1 to k m atching If each treated unit has k control units, then it is conside red 1 to k matching. The treated unit is matched among the k 1 control variables. Each treated unit can match up to k control units. Both Propensity Score Matchi ng and Mahalanobis Distance Matching concern the average differences in X within each match. If the distance is small, then the matched units are more similar in the distance range. More formally, if d i is the difference between mean X of the treated and c ontrol units for 1 1 matching, then d i for 1 to k matching is the difference between the X for the treated unit in paired set i and the mean X for the k controls in matched set i In addition, the average of all d i s in the sample represents the imbalance i n X in the matched sample. The distance is: (2 11) where X cm is the average difference for the matched control units. Distance and imbalance are different. When the distance is large, the average of d i ( ) ca 1993). If k matches are enforced, but the adequate matches are fewer than k adequate matches for certain individuals, the quality of matching is reduced. 1 to m any matching If each single treatment unit matches several control units, 1 to many matching or various matching occurs. Instead of using a fixed number of control units, with variable numbers PAGE 22 22 of controls bias removal is more successful (Ming & Rosenbaum, 2000). This kind of matching is useful if the number of control group units is much larger than the number of treatment units (Ming & Rosenbaum, 2001). The benefit of using this method is that each control unit has the opport unity to be paired with the best available untreated unit if the number of treated units is much smaller than the number of untreated units (Ming & Rosenbaum, 2001; Rassen, Shelat, Myers, Glynn, & Rothman, 2012). Each number of untreated units matched to t reated unit can be same or can va ry from one treated to another. Full matching Rosenbaum (1991) introduced full matching which sets a particular sub classification in an optimal manner. In full matching, there is a match between either one treated unit wit h at least one control unit or between one control unit and at least one treated unit. Full matching with a propensity score is preferable to use of a Mahalanobis distance (Gu & Rosenbaum, 1993), the propensity score also handles the missing data problem w ithout rejecting observations. Full matching resorts the missing data to merge with appro priate level of the covariate or treat it as a category itself. Thus, missing values turn into part of profiles according to stratifications in which subjects are plac ed. The quality of stratification also provides a better handling situation with missing data (Hansen, 2004). Matching Algorithms Nearest neighbor greedy matching With greedy m atching, the best available control unit is selected for the treated unit withou t taking into account the overall quality of matching. This aspect of greedy matching is a drawback. Greedy matching can be performed through using nearest neighbor propensity score matching with/without caliper, Mahalanobis metric matching, or Mahalanobis metric matching within a propensity score caliper (Gu & Fraser, 2010). While trying to maximize the exact PAGE 23 23 matches, this method may exclude some cases because of the incomplete matches. Matching with replacement helps to solve this problem. Greedy matching minimizes the distance between treated and untreated matches (Austin, 2011). However, Greedy matching rarely causes globally optimal matching (Steiner & Cook, 2013). Nearest neighbor m atching iteratively finds the best match. Let p i and p j be the propens ity scores for treated and untreated units and let A 1 and A 0 be the samples for treated and untreated units, respectively. Also, i A 1 and j A 0 Then the neighborhood C ( p i ) is the absolute value of the smallest difference between two propensity score values. Formally: (2 12) When an untreated participant is matched to a treated participant, that untreated unit is removed from the control group without replacement. Radius matching Dehejia and Wahba (1998) state that in order to have more than one best match, a tolerance level ( ) for all units is set to select more than one control unit. Thus, almost equal pro pensity scores will be selected for both the treated and unt reated units. Then, multiple control unit matches will be performed for treatment units within radius. If an untreated unit has no matches within this level, the nearest control for a treated unit is selected. Also, an untreated unit may match with multiple treated units. Using radius matching with additional multiple control units has two effects: a decrease in the quality of matching, and an increase in sample size. Increasing the sample may lead more accurate estimations but tends to have result in greater variance, which may cause bias (Dehejia & Wahba, 1998). PAGE 24 24 Caliper m atching In caliper m atching, there is a restriction on the difference between p i and p j According to this caliper, matches are formed. In oth er w ords, the matches are only allowed if the absolute value of the propensity score differences fall into the specified caliper. After that, both the matched treate d and control units are removed from further match procedures and the next treated unit is selected for the untreated unit. This circle continues iteratively (Gu & Fraser, 2010). If there are no matches for a treated unit in a specific caliper, then the treated unit is excluded from matching. This is drawback of caliper matching. Optimal matchin g The optimal matching algorithm places the treated and control units in the same subclass as close as possible in terms of the observed covariates (Rosenbaum, 1991). The matching sets contain either one treated units with multiple control units or one con trol unit with multiple treated units. This structure is more flexible compared to other matching structures. Optimal matching is performed using an expected distance ( ) between treated and control: (2 13) wh ere represents the proportion of treated, control, and sample units and represents the distance between a treated and control in subclass s. This algorithm arose based on network flow optimization of finding matc hes with minimal distance (Rosenbaum, 1991). When comparing greedy vs. optimal matching, optimal matching performs as well as and often better than greedy with some advantages. Rosenbaum (1999) found in the network flow theory that the loss by using greedy produced less than optimal results. Thus, there is no guarantee of complete pair matching with greedy when using caliper. However, optimal matching usually ach ieves a complete pair matching. PAGE 25 25 Optimal matching has the advantage of minimizing the global dist ance measure with network flow theory. Global distance measure means that only the second best or another closer untreated unit is selected for some treated units, if their nearest neighbors must be paired to other treated units whose second best or more d istant unit would have bee n worse (Steiner & Cook, 2013). Optimal full propensity score matching (OFPSM) Using full matching helps to estimate both ATE and ATT in optimal matching. Full matching is optimal in minimizing the distance between the treated and control units within each matched set (Stuart, 2010). Full matching uses all available untreated observations for the control units. In OFPSM, an additional propensity score distance is calculated. A propensity distance measure is defined as the differenc e between the propensity score for the treated unit and the propensity score for the control unit. In other words, matching is performed with similar PS of treated and control units. Rosenbaum (1989) states that using PS both increases covariate balance a nd speeds of the algorit hm to find the optimal matches. Genetic m atching (GM) Genetic matching uses a genetic algorithm to find a set of weights for each covariate and an optimal balance can be achieved after matching is done. The genetic algorithm minimiz es a multivariate weighted distance where weights are chosen to maximize a measure of covariate balance. The basic idea of GM is to include an additional weight matrix in the Mahalanobis metric to generalize it when the MD i s not optimal for the dataset. Forma lly: (2 14) where W is a square weight matrix with rows and columns equal to the number of covariates in X, and S 1/2 is the Cholesky decomposition of S the variance covariance matrix of X (Sekhon, PAGE 26 26 2011). GM eliminates need t o iteratively acquire and check propensity scores (Diamond & Sekhon, in press). Instead, the Genetic algorithm checks the covariate balance iteratively, where W in Equation 2 14 represents each metric distance for all matching variables, and the main diag onal is restricted to the zero of W This evolutionary search algorithm was first introduced by Mebane and Sekhon (1998) to choose weights and minimize a multivariate weighted distance in which weights are chosen to maximize a measure of covariate balance. Diamond and Sekhon (in press) state that GM can be used without including propensity scores, but propensity score can be added to the covariates. The g enetic algorithm sets W to an initial value and generates W s of population size. Then, the algorithm ma tches for each W in a given generation. Next, it computes the loss for each matched sample by optimizing minimum loss and provides W for minimum loss. If the W s are not at the desired level, the procedure should be repeated. Otherwise, the procedure ends b y performing GM with optimal weights (Diamond & Sekhon, in press). Genetic m atching i mplementation in R Sekhon (2011) m implements genetic matching as well the commonly used greedy near est neighbor matching. Within the matching package, t he GenMatch function the Genetic algorithm to find the optimal balance. This algorithm finds the weights for each covariate using several arguments. Here I will only mention the arguments that change the genetic matching implementation: pop .size determines the number of cases genoud which is the background function performing optimization will use Thus, it is crucial to assign large numbers for this argument. The GenMatch function can implement the gen et ic algorithm optimizing several different criteria, which are specified in the Fit.func argument: pvals maximizes the p values from the t test and Kolmogorov Simirnov tests qqmean.mean calculat es the mean standardized differences in the PAGE 27 27 eQQ plot for eac h variable and minimizes the mean differences across variables, qqmean.max calculates the mean standardized differences in the eQQ plot for each variable and minimizes the maximum of the diff erences qqmedian.mean calculates the median standardized differe nces in the eQQ plot for each variable and minimizes the m edian of these differences qqmedian.max calculates the median standardized differences in the eQQ plot for each variable and minimizes the maximum of these differences, qqmax.mean calculates the ma ximum standardized differences in the eQQ plot for each variable and minimizes the mean of these differences qqmax.max calculates the maximum standardized differences in the eQQ plot for each variable and minimizes the maximum of these differences and wil l be used in this research. The argument estimand provides different kinds of estimation such as average treatment effect (ATE), average treatment effect on treated (ATT), and average treatment effect on controls (ATC). ATT is estimated by using this argum ent. M is the scalar for the number of the matches to be found; the default type is 1 to 1 matching. With ties argument, the ties can be handled. If there are multiple control units for a treated unit, the matched data will be weighted in regards to multip le matches with the ties=TRUE argument. If ties is stated as FALSE, then all ties will be randomly broken. When ties are allowed, the 1 to many Genetic Matching is used; otherwise the 1 to 1 Genetic Matching. The replace argument determines whether matchin g will be with or without replacement. In the simulation study, different conditions for ties and replace will be used. The caliper option sets the maximum distance for the possible matches when performing the matches. There may also be different condition s, in which caliper is used or not used (Sekhon & Mebane, 1998; Sekhon, 2011). PAGE 28 28 Estimation of Weights and Standard Errors In 1 to 1 genetic matching, a weight of one is assigned to matched units and zero to unmatched units. Weight estimation in 1 to many g enetic matching with replacement and with in caliper is given by: (2 15) where is a treatment condition indicator (equal to one for treated cases and zero for untreated cases); is the number of treated items that untreated item i is matched to; is the number of untreated items matched to the same treated case as item i ; is the total number of matched items; and is the total number of treated items (King, 2011). Next, the formula of obtaining weights for OFPSM is: (2 16) where j represents the stratum, t represents treatment condition, N is the total sample size, J is the numbe r of strata, and is the total number of treated units in stratum j The procedure of Taylor series linearization (TSL) provides an approximate formula for the variance estimation via a linear function of random variables (Mendoza, 198 2). Taylor Series Linearization (TSL) standard errors were estimated for models using the formula: (2 17) PAGE 29 29 w here is the intercept, is the slope which is ATT, is th e weight, is the mean of x for the observation i (Lohr, 1999). Comparison of Matching Methods Matching is a commonly used method when considering stratification, weighting, or regression adjustment in the social sciences, economics, medicine, and political science (Thoemmes & Kim, 2011). As described above, many different types of matching methods (i.e. greedy, optimal, kernel etc.) have been used. Gu and Rosenbaum (1993) found that optimal matching performed better than greedy matchi ng with regard to producing close matches. Also, they found that full matching was a better choice than 1 k matching in terms of covariate optimal full matching. Th e main disadvantage of using greedy matching is that the order in which the treated subjects are matched; this situation has a potential of changing the quality of matching. On the other hand, optimal matching avoids this issue by considering the entire se t of matches (Rosenbaum, 2002). Furthermore, Steiner and Cook (2013) stated that optimal matching performed better on average. Zhao (2004) showed that propensity score matching outperformed over Mahalanobis distance matching when correlations among the cov ariates and sample size are high. In addition, full matching with a propensity score is preferred to a Mahalanobis metrics (Gu & Rosenbaum, 1993). Thus, this research uses optimal full propensity score matching will be used as the comparison method for the simulated data. Propensity score will be used as a distance mea sure in optimal full matching. I f greedy matching were used some treated units will be prohibited from matching to some control units. In other words, a complete matching may not occur when us ing greedy matching. Optimal matching solves this problem by finding a complete p air matching (Rosenbaum, 1989). PAGE 30 30 Another key issue is performing the matching with or without replacement. Stuart (2010) recommends that matching with replacement is preferable over matching without replacement, because control units which look similar can be utilized by different treated units resulting in reduced the bias. Gu and Rosenbaum (1993) found that when many covariates are available, propensity score performed better than either Mahalanobis distance or Mahalanobis distance within propensity calipers. If there are a small number of covariates, then Mahalanobis distance and Mahalanobis distance within pro pensity calipers are preferred. An alternative way to optimal match ing is genetic matching. Sekhon and Grieve (2009) compared Genetic Matching (GM) with replacement and Propensity Score matching (PSM) using a clinical intervention data and found that GM performed better than PSM in terms of covariate balance. They estimat ed the ATT with 1 to 1 matching with replacement. They included the propensity scores to the genetic matching to improve the covariate balance. They used a misspecified propensity score with a population size of 5000 in genetic matching An incorrect but r ichly specified model was used instead of true propensity score because it was assumed that the functional form of the true propensity score was unknown. The simulations showed that when the true propensity score was unknown, genetic matching produced less bias and RMSE. Genetic matching improved balance on observed characteristics. In this manner, t he results showed that GM provided much lower bias and root mean square error (RMSE). Improving covariance balance always lea ds to less bias and lower RMSE for both the methods In this study, it is hypothesized that Genetic Matching performs better than the optimal full matching and baseline model in terms of treatment effect and balance of covariates. In the light of the hypothesis, the following research ques tions are investigated: PAGE 31 31 How does Genetic Matching compare with OFPSM with respect to achieving covariate balance in conditions with different sample sizes, treated ratios, magnitude of covariate effect on treatment assignment and magnitude of covariate eff ect on outcome? How does Genetic Matching compare with OFPSM with respect to relative bias of the ATT estimates in conditions with different sample sizes, treated ratios, magnitude of covariate effect on treatment assignment and magnitude of covariate effe ct on outcome? How does Genetic Matching compare with OFPSM with respect to achieving standard errors of the ATT estimates in conditions with different sample sizes, treated ratios, magnitude of covariate effect on treatment assignment and magnitude of cov ariate effect on outcome? How does Genetic Matching compare with OFPSM with respect to evaluate the power in conditions with different sample sizes, treated ratios, magnitude of covariate effect on treatment assignment and magnitude of covariate effect on outcome? PAGE 32 32 CHAPTER 3 METHOD A Monte Carlo simulation study was conducted to show differences in reducing bias of the methods when sample size, ratio of treated to total sample size, number of covariates, pseu do r squared, and R squared were manipulated. Da ta Simulation The sample sizes were simulated as 300 and 600 to observe differences between the powers of different methods when testing ATT. Ratios were simulated 1/3 and 1/10, so that a larger difference in ratios produces larger differences in performan ce of methods. Consequently, the number of matches for 1/3 is smaller than ratio 1/10. Overlap for the treated units can be measured as the percentage of treated units with propensity scores above the maximum propensity score of the untreated plus 0.1 stan dard deviation units. This study use d 0.1 standard deviations from the maximum propensity score of the untreated delimit adequate overlap, because it represents the caliper width used to match the treated and untreated units. This shows how overlap measure acts when conditions are manipulated. Thoemmes and Kim (2011) reviewed 79 studies to see the usage of propensity score methods in social and educational research. The numbers of covariates in these studies ranged from 2 to 238 (the 75 th percentile was 29 the mean was 31.3, and the median was 16). The process of Genetic Matching takes longer than other matching methods, so to reduce computational burden, 4 cov ariates were used i n this study. The covariates were generated based on the population mean of ze ro and variance of one [ N (0, 1)]. Also, the cor relations between covariates were randomly generated between 0 and 0.50. PAGE 33 33 McKelvey and Zavo squared was an appropriate method to estimate the total variability of the treatment assigned e xplained by the covariates. Therefore, Pseudo R squared was used to manipulate the magnitude of the effect of covariates on treatment assignment, which determines how strong selection bias is in a particular observational study. A higher degree of variabil ity is indicative of a better logistic regression model for treatment assignment. The Pseudo R squared is more formally given by: (3 1) where is the explained sum of squares in the model. Pseud o R squared wa s manipulated in the population level as 0.2 (small) and 0.5 (medium), respectively. R squared was used manipulate the magnitude of the effect of the covariates on the outcome. Large relationships between covariates and the outcome correspond to larger se le ction bias if the covariates were not included in the analysis. In the population level, R squared for the outcome model was manipulated as 0.2 and 0.5; and the res idual mean was zero. In this research, the R squared was manipulated by selecting the resi dual variance to obtain the desired R squared in the formula: (3 2) where is the explained sum of squares in the model and is the residual sum of squares. The populatio n values of the coefficients were shown in the Table 3 1. The residual variances values f or the population parameters were shown in the T able 3 2. Potential outcomes were obtained for treated and untreated units based on the simulated covariates and the outcomes residuals In addition the population value of ATT was 0.9. PAGE 34 34 Table 3 1. The coefficients of the population parameters for 4 covariates Pseudo R squared Coefficients 0.2 0.5 1 0.45 0.90 2 0.46 0.91 3 0.45 0.91 4 0.45 0.90 Table 3 2 The values of the residual variances at the population level 4 covariates R squared 0.2 0.5 Residual variance 1.51 0.37 The probability of being treated, which determines the tr eatment assignment, was calculated based on thr ee variables First of all, the intercept was defined as and was added so that the ratio of treated to total sample size was controlled. This probability was also based on the population values of the regression coefficients to take into account the covariate effects. Lastly, a residual was added to the equation. Although residuals cannot be estimated in logistic regression models, in order to simulate treatment assignment I used a residual randomly sampled from a logistic distributio n to produce random selection given the probabilities defined by the model. More forma lly, the treatment assignment was based on, (3 3) Finally, a case was classified as treated if the proba bility of being treated was greater t han zero. Otherwise, the case was classified as untreated. Table 3 3 demonstrate d the conditions which were used to manipulate the population level. Based o n these manipulations, these were 2 x 2 x 2 x 2 = 16 different conditions which resulting from the co mbination of 2 levels of sample size, 2 levels of ratio, 2 levels of Pseudo R PAGE 35 35 squared, and 2 levels of R squared. For each condition, 500 datasets were generated The manipulated conditions were presented in the Table 3 3. Table 3 3. The manipulated cond itions Sample size ratio Number of covariates Pseudo R squared R squared 300 1/3 4 0.2 0.2 600 1/10 0.5 0.5 The statistical p ackage used in the simulation was R (2.15.2) program (R Core Team, 2012), with the following functions performed by contribut ed packages: Next, genetic matching methods were obtained with the matching (Sekhon, 2011) package (4.8 2). The Matc hIt (Ho et al., 2011) library was used to perform optimal full propensity score matching. Then, the survey package was used to obtain the ba lance measures. Finally, Taylor Series Lineari zation (TSL) standard errors were estimated by the survey (Lumley, 2012) package (3.29 4). A baseline model and f ive propensity score methods were used to estimate the ATT with each simulated dataset. Propensit y scores for e ach treated and control unit were estimated using logistic regression to implement the methods. The methods used were: 1. A ba seline model in which the ATT was estimated by ignoring selection bias. 2. Propensity scores for e ach treated and control unit were estimated using logistic regression to implement in the methods. 3. The second method used optimal full pr opensity score matching. This was a 1 t o many matching strategy that was equivalent to create the maximum number of strata with at least one tr eated and a variable number of untreated. Weights were obtained by using Equation 2 16. 4. The next method was 1 to 1 Genetic Matching using propensity scores. Furthermore, matches were specified within a caliper distance and replacemen t was not used at this time. 5. The following method was 1 to 1 Genetic Matching using propensity scores. In this method, the matches both wit hout caliper and replacement were specified. PAGE 36 36 6. The other method was 1 to many Genetic Matching using pr opensity scores but calipers were not u sed when matching was done. Weights were estimated in a different manner compared to oth er methods. Here the weights were re estimated based on Equation 2 15. 7. The last method was 1 to many Genetic Matching with propensity score s using caliper when matching was done. Weights estimation was the same as in the previous method and weights were obtained using Equation 2 15. ATTs for all models were estimated with the weighted regression model: (3 4) where is the slope of the regression equation, which is the estimated ATT and is the residual This model was estimated with the survey package in R ( 2.15.2). TSL standard errors were estimated for methods using Equation 2 17. Analysis A balance m easures for each covariate was obtai estimated by: (3 5) where is the mean of treated units, is the mean of control units, and s is the pooled standard d eviation of both the treated and control group (Cohen, 1988). Balance measures were obtained based on the crit erion that if the effect size was in the range between +0.1 and 0.1 standard deviations, then it was considered an acceptable degree of covariate balance. The percentage of unbalanced covariates was calculated as the number of unbalanced covariates divided by the total number of covariates, which was four in the simulated data. The pe rformance of Genetic Matching was compared for different conditi ons. A baseline model and optimal full propensity score matching were bias also evaluated. The relative bias of the ATT ( ) estimate was : PAGE 37 37 (3 6) where is the mean of ATT estimates obtained for over 500 replications of each condition of the simulation study and is the population value of ATT. Relative bias can be interpreted as percentage bias in ATT estimates relative to population paramete r. The absolute value of 0.0 5 was considered acceptable bias (Hoogland & Boomsma, 1998). Next, equal percent bias reduction (EPBR) was estimated using: (3 7) where is the relative bias of the ATT estimate in the baseline model at the b eginning, and is the relative bias of ATT for the method used. In addition, the relative bias of the standa rd error of the ATT estimates was estimated according to: (3 8) where is the mean of the estimated standard errors and is the standard deviation of the estimates of ATT, which corresponds to an empirical population standard error. Furthermore, the statistical power was estimated for testing ATT=0. The purpose of the power analysis was not to determine the required sample size, but rather to compare statistical power across propensity score methods and sample sizes ( Krhne, 2011). PAGE 38 38 CHAPTER 4 RESULTS In this section, ANOVA results will be presented for covariate balance, overlap measure, relative bias of the ATT, standard errors of the ATT, and the power. In ANOVA tables, if there were significant interact ions in which the main effect was significant, then the interpretation focused on interactions. In addition, t he significant factors with values larger than 0.01 of effect sizes were selected because they correspond to interpretable differences in outcomes. F te sts The between subject factors were sample size, ratio of treated to total sample size, Pseudo R squared, and R squared. Sample size was manipulated by 300 and 600. Then, the ratio of treated to total sample size was set to 1/3 and 1/10, and both Ps eudo R squared and R squared were manipulated at 0.2 and 0.5. Then, each condition was analyzed by six methods: a method ignoring the selection bias, a method with OFPSM, 1 to 1 Genetic Matching with and without caliper, and 1 to many Genetic Mat ching with and without caliper. For balance of covariates, Analysis of Variance (ANOVA) was conducted with a mixed (split plot) design to determine which of the manipulated conditions affect the covariates balance. The outcome was the percentage of unbalanced covariates. The between subject factors were sample size, ratio of treated to total samp le size, Pseudo R squared, and R square d. The within subjects factor was the method. The significant results from ANOVA and effect sizes were shown in the Table 4 1. Based on the information in Ta ble 4 1, the sample size main was the significant main facto r, F (1, 7989) = 1489.71, p < .0001. When sample size was increased from 300 to 600, covariate balance was increased. Next, method by Pseudo R squared interaction was significant, F (4, 31956) = 845.94, p < .00 01. Table 4 2 and Figure 4 1 were established based on collapsing Pseudo R squared levels over sample size, ratio, and R squared. PAGE 39 39 Based on Table 4 2 and Figure 4 1, a higher level of Pseudo R squared tended to decrease the covariate balance. OFPSM had the best covariate balance at the Pseudo R squared level of 0.2. On the other hand, 1 to 1 Genetic Matching with caliper had the best covariate balance at the Pseudo R squared level of 0.5. For the Genetic Matching methods, use of a caliper at both of the Pseudo R squared levels provided better covariate balance. Table 4 1. ANOVA and effect size results of percentage of unbalanced covariates Source df F p Eta squared Between Sample size 1 1489.71 <.0001* 0.029 Pseudo R squared 1 3574.39 <.0001* 0.069 Error 7989 Within Method 4 4442.20 < .0001* 0.140 Method*Pseudo R squared 4 845.94 <.0001* 0.026 Method*ratio 4 1877.20 <.0001* 0.059 Error (Method) 31956 Note : p < .05 Table 4 2. Percentages of unbalanced covariates for method by Pseudo R squared interaction 1 to 1 Genetic M atching 1 to many Genetic Matching Pseudo R squared OFPSM with caliper without caliper with caliper without caliper 0.2 18.41% 24.56% 49.06% 38.29% 30.82% 0.5 48.66% 26.16% 85.03% 51.16% 49.13% According to Table 4 1, th e method by ratio interaction w as also significant, F (4, 31956) =1877.20, p < .00 01. Table 4 3 and Figure 4 2 were established based on collapsing ratio levels over the sample size, Pseudo R squared, and R squared. Based on Table 4 3 and Figure 4 2, OFPSM had the best covariate balanc e at the ratio of 0.1. On the other hand, 1 to 1 PAGE 40 40 Genetic Matching with caliper had the best covariate balance at the ratio of 0.33. For the Genetic Matching methods, using a caliper at both of the ratio levels provided better covariate balance. Table 4 3. Percentages of unbalanced covariates for method by ratio interaction 1 to 1 Genetic Matching 1 to many Genetic Matching Ratio OFPSM with caliper without caliper with caliper without caliper 1/10 31.49% 35.99% 51.96% 52.16% 44.85% 1/3 35.58% 14.73% 8 2.13% 37.28% 35.10% There was no general trend showing that the increase of ratio results in increase or decrease of covariate balance for the methods. Fo r example; when ratio was increased, 1 to 1 Genetic Matching without caliper and OFPSM tend ed to hav e worse balance. On the other hand, the remaining methods tend ed to h ave better balance when ratio was increased. Table 4 4. Percentages of treated observations with no common support of propensity score distributions Sample size ratio Pseudo R squared R squared None overlap 300 1/3 0.2 0.2 3.30% 0.5 3.00% 0.5 0.2 8.63% 0.5 8.88% 1/10 0.2 0.2 1.82% 0.5 1.91% 0.5 0.2 5.05% 0.5 4.99% 600 1/3 0.2 0.2 3.31% 0.5 3.30% 0.5 0.2 9.65% 0.5 9.42% 1/10 0.2 0.2 1.84% 0.5 1.72 % 0.5 0.2 6.13% 0.5 5.94% PAGE 41 41 Table 4 4 showed the percentages of none overlap. When sample size was increased, overlap increased at both Pseudo R squared levels. Also, higher Pseudo R squared levels tended to have lower overlap than lower Pseudo R squ ared levels. For relative bias of the ATT estimates, Analysis of Variance (ANOVA) was conducted with a mixed (split plot) design. The between factors were sample size, ratio, Pseudo R squared, and R squar ed. The within subject factor was the method which w as used to analyze the data. The significant results from ANOVA and effect sizes were shown in the Table 4 5 Table 4 5 ANOVA and effect size results of ATT estimates Source df F p Eta squared Between Pseudo R squared 1 158.38 <.0001* 0.010 Error 7989 Within Method 5 9610.63 <.0001* 0.013 Method*Pseudo R squared 5 535.32 <.0001* 0.242 Error (Method) 39945 Note : p < .05 Based on Table 4 5 the interaction of method by Pseudo R squared was significant, F (5, 39945) = 535.32, p < .0001. Table 4 6 and Figure 4 3 were established by collapsing biases of the ATT estimates across the sample sizes, ratios, and R squared within the methods. The only unacceptable relative bias of ATT was when the Pseudo R squared was 0.5 with 1 to 1 Ge netic Matching without caliper. The 1 to 1 Genetic Matching with caliper did not perform as well as other Genetic Matching methods, but it still had an acceptable degree of relative bias. Also, 1 to many Genetic Matching without caliper and OFPSM performed almost equally in both Pseudo R squared was switched to 0.5. Lastly, 1 to 1 Genetic Matching with caliper performed the best among other methods under the two Pseudo R squared levels. On average, h owever, OFPSM performed slightly better than the PAGE 42 42 1 to many Genetic Matching strategies. Next, OFPSM performed better than 1 to 1 Genetic Matching strategies. Furthermore, adding a caliper to 1 to many Genetic Matching methods produced more bias, while addi ng a caliper to 1 to 1 Genetic Matching produced less bias. Lastly, percent bias reduction was better for all methods at higher Pseudo R sq uared level than the lower one. Table 4 6 Relative bias and percent bias reductions of ATT bias estimates for metho d by Pseudo R squared interaction 1 to 1 Genetic Matching 1 to many Genetic Matching Pseudo R squared Baseline OFPSM with caliper without caliper with caliper without caliper 0.2 0.26 0.002 35.92% 0.0002 36.29% 0.047 29.93% 0.031 32.08% 0.001 36.06% 0.5 0.40 0.013 65.97% 0.002 67.87% 0.162 41.15% 0.047 60.12% 0.011 66.29% Note : Biases that are greater than the absolute value of 0.05 were considered unacceptable. Values with percentages represent the average percent bias reduction for ATTs. In addition to relative bias of ATT estimates, the relative bias of standa rd error of the ATT estimates was also evaluated. The significant results from ANOVA and eff ect sizes were shown in Table 4 7 Results with an absolute relative bias of the st andard error larger than 0.1 were considered unacceptable (Hoogland & Boomsma, 1998). Based on Table 4 7, ANOVA results showed that method by Pseudo R squared interaction was significant, F (5, 39945) = 657.22, p < .0001, as well as method by R squa red interaction, F (5, 39945) = 702.43, p < .0001. Table 4 8 was established by collapsing Pseudo R squared over sample size, ratio, and R squared for the method by Pseudo R squared interaction. The acceptable relative bias of standard errors of the estima tes was smaller than the absolute value of 0.1. Figure 4 4 showed the two way ANOVA interaction based on the information in the Table 4 8. PAGE 43 43 Table 4 7 ANOVA and effect size results for standard errors of ATT estimates Source df F p Eta squared Between Pseudo R squared 1 502.64 <.0001* 0.024 R squared 1 1776.12 <.0001* 0.085 Error 7989 Within Method 5 1671.99 <.0001* 0.071 Method*Pseudo R squared 5 657.22 <.0001* 0.028 Method*R squared 5 702.43 <.0001* 0.030 Error (Method) 39945 Note : p < .05 Based on the inform ation in Table 4 8 and Figure 4 4, 1 to many Genetic Matching with caliper was the only method in the acceptable degree of the standard error of the ATT estimates at the Pseudo R squared l evel of 0.2. The worst method wa s OFPSM at this level. At the Pseudo R squared level of 0.5, 1 to many Gen etic Matching without caliper was the best method. Higher levels of Pseudo R squared tend ed to have smaller bias in standard errors except 1 to 1 Genetic Matching with caliper. In ge neral, 1 to many Genetic Matching perform ed better tha n the OFPSM. The worse method was 1 to 1 Genetic Matching. Table 4 8 Relative bias in standard error of ATT estimates for method by Pseudo R squared interaction 1 to 1 Genetic Matching 1 to many G enetic Matching Pseudo R squared Baseline OFPSM with caliper without caliper with caliper without caliper 0.2 0.007 0.138 0.134 0.132 0.089 0.101 0.5 0.008 0.018 0.150 0.102 0.037 0.015 Note : Biases that are greater than the absolute value of 0.1 were considered unacceptable. In addition to the method by Pseudo R squared interaction, Table 4 9 was established by collapsing sample size, ratio, and Pseudo R squared for the method by R squa red interaction PAGE 44 44 Figure 4 5 showed ance at the method by R squared with respect to bias in st andard errors of the estimates. Table 4 9 Relative bias of standard errors of ATT bias estimates for method by R squared interaction 1 to 1 Genetic Matching 1 to many Genetic Matching R squar ed Baseline OFPSM with caliper without caliper with caliper without caliper 0.2 0.003 0.017 0.056 0.062 0.014 0.022 0.5 0.019 0.137 0.229 0.171 0.112 0.108 Note : Biases that are greater than the absolute value of 0.1 are unacceptable. Based on Table 4 9 and Figure 4 5, all methods produced an acceptable degree of bias only at R squared of 0.2. OFPSM did not perform as well as 1 to many Genetic methods. 1 to 1 Genetic Matching methods performed the least at both the R squared level. Without using a ca liper at this level, the Genetic Methods performed better than GM without a caliper. At an R squared of 0.5 without using a caliper, Genetic Matching methods perform ed better without a caliper. In general, increasing the R squared result ed in having improv ed bias of standard error of the estimates. For power, Analysis of Variance (ANOVA) was conducted with a mixed (split plot) design to determine which of the manipulated conditions affect the power. The outcome was the percentage of significant ATT estimate s per unique condition of the simulation. The between subject factors were sample size, ratio of treated to total sample size Pseudo R squared of 0.2 was not included because this level resulted in biased ATT estimates. Also, R squared of 0.5 was not incl uded because this level resulted in bias ed standard errors of the estimates T he within subjects factor was the significance based on the hypothesized test for unbiased methods. The significant results from ANOVA and effect size were shown in the Table 4 1 0. PAGE 45 45 According to Table 4 10 the sample size main effect was significant, F (1, 1996 ) = 159.19 p < .0001 as well as method main effect interaction, F (4, 3993) = 75.44, p < .0001. Table 4 11 was established based on collapsing the sample size and R squar ed levels over the ratio and Pseudo R squared. Table 4 10 ANOVA and effect size results of power Source df F p Eta squared Between Sample size 1 159.19 <.0001* 0.0 28 Error 1996 Within Method 4 75.44 <.0001* 0.010 Error 3993 Note : p < .05 The percentage of significant ATT estimates for the sample size of 300 was 86.46%, while the sample size of 600 was 97.42%. Therefo re, as the sample size increased power increased According to Table 4 11 1 to 1 Genetic Matching was more powerfu l than OFPSM, while the poorest was 1 to many Genetic Matching In 1 to man y Genetic Matching, the power was better withou t a caliper, but the opposite was true for 1 to many Genetic Matching. Table 4 11. Percentage of the estimated ATT that is significant sample size by R squared interaction in the methods 1 to 1 Genetic Matching 1 to many Genetic Matching OFPSM with caliper without caliper with caliper without caliper 89.65% 94.85 % 97.05 % 89.08 % 88.35 % PAGE 46 46 Figure 4 1. Percentage s of unbalanced covariates for method by Pseudo R squared interaction. Note: GenMatch_c indicates genetic matching with a caliper while GenMatch does not use a caliper. Figure 4 2. Percentages of unbalanced covariates for method by ratio interaction. Note : GenMatch_c indicates genetic matching with a caliper while GenMatch does not use a caliper PAGE 47 47 Figure 4 3. Relative bias of ATT estimates for methods under the two different Pseudo R squared conditions. Note: GenMatch_c indicates genetic matching wi th a caliper while GenMatch does not use a caliper. Figure 4 4. Relative bias of standard errors of the ATT for Pseudo R squared levels. Note: GenMatch_c indicates genetic matching with a caliper while GenMatch does not use a caliper. PAGE 48 48 Figure 4 5. R elative bias of standard errors of the ATT for R squared levels Note: GenMatch_c indicates genetic matching with a caliper while GenMatch does not use a caliper. PAGE 49 49 CHAPTER 5 DISCUSSION The first research goal was the evaluation of the covariate balance. Fi rst, as expected, increa s in g sample size produce d better covariate balance. There were more control unit options to be matched with treated units so that the quality of matching resulted in better covariate balance after matching. Second, a smaller value of Pseudo R squared tended to produce better covariate balance, with 1 to 1 GM with a caliper having the best covariate balance. The reason for this was that the magnitude covariate effect on treatment assignment was smaller at lower Pseudo R squared v alue s and therefore the amount of discrepancy between the covariate distributions of the treated and untreated was smaller The second best method was OFP SM. The 1 to many GM methods were still better than the 1 to 1 GM without c aliper. Using a caliper produc ed better covariate balance in GM methods. Third, ratio also affect ed the covariate balanc e. The best covariate balance was 1 to 1 GM with caliper and OFPSM ha d the second best cova riate balance. However, there was no one directional relationship between ratio and covariate balance In addition to balance, overlap was i ncreased when the sample size was increased. However, an incr ease in Pseudo R squared resulted in decreasing overlap. As the effect of cova riates on selection bias became stronger, the distr ibution of probability of treatment for both the treated and control groups beca me more distinct. Shadish, Cook, and Campbell (2002) stated that sufficient overlap is n eeded between the treat ed and the control groups. Thus, one should select the sample sizes carefully. The third research question was whether there were significant differences in relative bias of ATT estimates among the methods when the factors were manip ulated. Pseudo R squared was the only significant factor affecting the bias of ATT est imates. When Pseudo R squared was PAGE 50 50 increased, the b ias of the ATT estimates became worse. Thus, there was an inverse relationship between Pseudo R squared and the relat ive bias. Pseudo R squared show s how well the regressors explain the participatio n probability. After matching was performed, the d istribution of the covariates for both the treated and control groups was more similar. A lower Pseudo R squared value provided a better bias reduction Huber, Lechner, and Wunsch ( 2013 ) observed that the lower Pseudo R squared values provided lower RMSE and bias. Also, Sianesi (2004) suggested that lower Pseudo R squared values provide better covariate balance after matching. The absolute value of the differences between treated and cont rol units should be lower than caliper width which was equal to 0.1 standard deviation of the logit of the propensity score in this study Matching based on this threshold produced a better covaria te balance because the treated and the control units were distributed similarly (Austin, 2011). Cochran and Rubin ( 1973) found that using a caliper removed almost all the bias. The 1 to many GM without caliper was the best method and slightly better than O FPSM at the lower Pseudo R sq uared level. The worst method was 1 to 1 GM with out cal iper at the higher Pseudo R squared level Using replacement performed recommendation about using replacement. Us in g replacement guarantees that the closest untreated observation in terms of propensity score is always available to match to a certain treated observation, regardless of the number of other treated observations with similar propensity scores. A fixed numbe r of control units, which was 1 in this study, restricted the flexibility a nd the quality of the matching. Cepeda, Boston Farrar, John & Strom (2003) fou nd that use of a variable number of untreated subjects was more effective in reducing the bias than use of a fixed number of untreated units in different condition s (i.e. sample size, ratio of treated to untreate d, etc.). They also found that as the ratio o f treated to untreated in creased the bias PAGE 51 51 decreased. As expected, a t smaller ratios, 1 to many matching perform ed no better than 1 to 1 matching because the availab le number of untreated units was large The 1 to 1 matching method can provide a si milar quality of match compared to 1 to many matching at lower ratio s of treated to untreated At higher ratios, it was expected th at 1 to many matching would perform better than 1 to 1 matching because th e numbers of untreated units were limited However, ratio did not substantially affect the relative bias in this study. This study par tially confirmed the work of Sekhon and Grieve (2009) who com pared Genetic Matching using re placement with propensity score matching and concluded that Genetic Matching performed better in terms of covariate balance T he performance of OFPSM confirmed the results of many previous research ers (Gu & Rosenbaum, 1993; Hansen, 2004; Steiner & Cook, 2013). The fo u rth research question was whe ther there wer e significant differences in relative bias of standard error of the ATT estimates obtained with Taylor Series Linearization among the models when the factors were manipulated. Pse udo R squared and R squared had significant effects among the m odels. First of all, 1 to many GM without caliper was the best method at both P seudo R squared levels. OFPSM was the second best method on average at both Pseudo R squared levels. Con sequently, 1 to many GM performed be tter than the 1 to1 GM methods An i n cr ease in Pseudo R squared resulted in smaller bias in standard error of the ATT estimates. Secondl y, a decrease in R squared resulted in smaller bias in the standard error of the ATT estimates. Again, 1 to many GM methods produce d smaller bias in standar d errors than the OFPSM, and 1 to 1 GM methods were the worst me thods at both R squared levels. The last research goal was the evaluation of the power in different conditions. As expected, an increase in sample size result ed in more power, w ith 1 to 1 GM p roviding higher power than OFPSM and 1 to many GM as the least powerful method. PAGE 52 52 T his study was limited to currently used sample size conditions. As the findings show ed much larger sample size may produce better results. This, additional sample size levels should to be invest igated. Therefore, this study was not generalized for the different levels of sample size conditions. Ratio affected only covariate balance and did not have substantial effect on relative bias. A very small number of treated units as co mpared to control units may lead to convergence problem s in logistic regression On the other hand, a ratio of treated to control close to 50% is likely to result in overlap problems for the estimation of the ATT. This factor should be investigated at diff erent levels with different sample size conditions. For standard error estimation, several other methods such as Jackknife or Bootstrapping (Rao & Shao, 2007; Efron, 1981), can be used to observe the performance of these methods in terms of reducing bias o f standard errors. The number of covariates was four because of the large burden of computations that would have been brought about by including larger numbers of covariates. Therefore, t his factor might be further investigated as well Additional propens ity score methods (e.g PSS, IPTW, CAUPS, etc.) can be added to compare the performance of these methods under the expanded levels of factors. The 1 to many Genetic Matching and OFPSM provided similar results and were better than 1 to 1 Genetic Matching. I n applied research, use of either 1 to many Genetic M atch ing or OFPSM will provide acceptable results. PAGE 53 53 LIST OF REFERENCES Abadie, A., & Imbens, G. W. (2009). Matching on the estimated propensity score. Unpublished manuscript.from https://search.ebscohost.com/login.aspx?direct=true&AuthType=ip, uid&db=ecn&AN=1062080&site=ehost live; http://www.nber.org/papers/w15301.pdf Austin, P. C. (2009). The relative ability of different propensity score methods to balance measured covariates between treated and untreated subjects in observational studies. Medical Decision Making, 29 ( 6), 661 677. doi:10.1177/0272989X09341755 Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46 (3), 399 424. doi:10.1080/00273171.2011.56878 6 Austin, P. C., & Mamdani, M. M. (2006). A comparison of propensity score methods: A case study estimating the effectiveness of post AMI statin use. Statistics in Medicine, 25 (12), 2084 2106. doi:10.1002/sim.2328 Cepeda, M. S., Boston, R., Farrar, J. T. & Strom, B. L. (2003). Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: Trade offs. Journal of Clinical Epidemiology, 56 (3), 230 237. doi: http://dx.doi.org/10.1016/S0895 4356(02)00583 8 Cochran, W. G., & Rubin, D. B. (1973). Controlling bias in observational studies: A review. 2002), 35 (4, Dedicated to the Memory of P. C. Mahalanobis), 417 446. Retrieved from http://www.jstor.org/stable/25049893 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press. Dehejia, R. H., & Wahba, S. (1998). Propensity score matching methods for non experimental causal studies. Unpublished manuscript.from https://search.ebscohost.com/login.aspx?direct=true&AuthType=ip,uid&db=ecn&AN=07 17259&site=ehost live; http://www.nber.org/papers/w6829.pdf Diamond, A., & Sekhon, J. S. (in press). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Unpublished manuscr ipt. Efron, B. (1981). Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods. Biometrika, 68 (3), 589 599. Retrieved from http://www.jstor.org/stable/2335441 Greenland, S., Robins, J. M., & Pearl, J. (1999). Confounding and collapsibility in causal inference. Statistical Science, 14 (1), 29 46. Retrieved from http://www.jstor.org/stable/2676645 PAGE 54 54 Gu, X. S., & Rosenbaum, P. R. (1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2 (4), 405 420. Retrieved from http://www.jstor.org/stable/1390693 Guo, S., & Fraser, M. W. (2010). Propensity score analysis: Statistical methods and applications Thousand Oaks, CA US: Sage Publications, Inc. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&AuthType=ip,uid&db=psyh&AN=2 010 00734 000&site=ehost live Hansen, B. B. (2004). Full matching in an observational s tudy of coaching for the SAT. Journal of the American Statistical Association, 99 (467), 609 618. doi:10.1198/016214504000000647 Heeringa, S. G., West, B. T., & Berglund, P. A. (2010). Applied survey data analysis Boca Raton, FL: CRC Press. Ho, D. E., Im ai, K., King, G., & Stuart, A. E. (2011). MatchIt: Nonparametric preprocessing for Parametric causal inference. Journal of Statistical Software, 42 (8) Hoogland, J. J. & Boomsma, A. (1998). Robustness studies in covariance structure modeling : An overview and a meta analysis (english). Sociol.Methods Res., 26 (3), 329 367. Retrieved from https://search.ebscohost.com/login.aspx?dire ct=true&AuthType=ip,uid&db=fcs&AN=24 52011&site=ehost live Huber, M., Lechner, M., & Wunsch, C. (2013). The performance of estimators based on the propensity score. Journal of Econometrics, 175(1), 1 21. doi: http://dx.doi.org/10.1016/j.jeconom.2012.11.006 Jasjeet, S. S. (2011). Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Software, 42 (7) Joffe, M. M., Have, T. R. T., Feldman, H. I., & Kimmel, S. E. (2004). Model selection, confounder control, and marginal structural models: Review and new applications. The American Statistician, 58 (4), 272 279. Retrieved from http://www.jstor.org/stable/27643582 Krhne, J. (2011). Estimation of average total effects in quasi experimental designs nonlinear contraints in structural equation models Jena : Thringer Universitts und Landesbib liothek Jena,. Lohr,Sharon L. (1999). Sampling : Design and analysis Pacific Grove, CA: Duxbury Press. Lumley, T. (2010). Complex surveys: A guide to analysis using R New York: Wiley. Lumley, T. (2012). "Survey: Analysis of complex survey samples". R package version 3.29 4 PAGE 55 55 McKelvey, R. D., & Zavoina, W. (1975). A statistical model for the analysis of ordinal level dependent variables. The Journal of Mathematical Sociology, 4 (1), 103 120. doi:10.1080/0022250X.1975.9989847 Mendoza O. M. (1982). Taylor series variance estimation for selected indirect demographic estimators. (Dr.P.H., The University of North Carolina at Chapel Hill). ProQuest Dissertations and Theses, Retrieved from http://search.proquest.com/docview/303251613?accountid=10920 (303251613). Ming, K., & Rosenbaum, P. R. (2000). Substantial gains in bias reduction from matching with a variable number of controls. Biometrics, 56 (1), 118 124. doi:10.1111/j .0006 341X.2000.00118.x Ming, K., & Rosenbaum, P. R. (2001). A note on optimal matching with variable controls using the assignment algorithm. Journal of Computational & Graphical Statistics, 10 (3), 455. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&AuthType=ip, uid&db=a2h&AN=5268954&site=ehost live R Core Team. (2012). R: A language and environment for statistical co mputing. R foundation for statistical computing Vienna, Austria: Rao, J. N. K., & Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika, 79 (4), 811 822. Retrieved from http://www.jstor.org/stable/2337236 Rassen, J. A., Shelat, A. A., Myers, J., Glynn, R. J., Rothman, K. J., & Schneeweiss, S. (2012). One to many propensity score matching in cohort studies. Pharmacoepidemiology and Drug Safety, 2 1 69 80. doi:10.1002/pds.3263 Rosenbaum, P. R. (1989). Optimal matching for observational studies. Journal of the American Statistical Association, 84 (408), 1024. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&AuthType=ip,uid&db=bah&AN=4 606611&site=ehost live Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. Journal o f the Royal Statistical Society.Series B (Methodological), 53 (3), 597 610. Retrieved from http://www.jstor.org/stable/2345589 Rosenbaum, P. R. (2002). Observational studies New York, NY: 2. Springer Verlag. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70 (1), 41 55. Retrieved from http:// www.jstor.org/stable/2335942 Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39 (1), 33 38. Retrieved from http://www.jstor.org/stable/2683903 PAGE 56 56 Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66 688 701. doi:10.1037/h0037350 Rubin, D. B. (1980). Bias reduction using mahalanobis metric matching. Biometrics, 36 (2), 293 298. Retrieved from http://www.jstor.org/stable/2529981 Rubin, D. B., & Thomas, N. (1996). Match ing using estimated propensity scores: Relating theory to practice. Biometrics, 52 (1), 249 264. Retrieved from http://www.jstor.org/stable/2533160 Schwartz, S., Gatto, N. M., & Campbell, U. B (2012). Extending the sufficient component cause model to describe the stable unit treatment value assumption (SUTVA). Epidemiologic Perspectives & Innovations, 9 (1), 3 13. doi:10.1186/1742 5573 9 3 Sekhon, J. S. (2011). Multivariate and propensity scor e satching software with automated balance optimization: The matching package for R. Journal of Statistical Software, 42 (7), 1 52. Sekhon, J. S., & Grieve, R. (2009). A new non parametric matching method for covariate adjustment with application to econom ic evaluation. Experiments in Political Science 2008 Conference Paper. Sekhon, J. S., & Mebane, W. R. (1998). Genetic optimization using derivatives. Political Analysis, 7 (1), 187 210. doi:10.1093/pan/7.1.187 Sekhon, J. S., & Grieve, R. D. (2012). A ma tching method for improving covariate balance in cost effectiveness analyses. Health Economics, 21 (6), 695 714. doi:10.1002/hec.1748 Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi experimental designs for generalized causal inference Boston, MA: Houghton Mifflin. Sianesi, B. (2004). An evaluation of the active labour market programmes in Sweden. The Review of Economics and Statistics, 86 (1), 133{155. Steiner, P. M., & Cook, D. (2013). Matching and propensity scores. In T. D. Little (Ed.), The oxford handbook of quantitative methods (1st ed., ) Oxford University Press. Stuart, E. A. (2010). Matching methods for causal inference: A review and look forward. Stat Sci., 25 (1), 1 21. doi:10.1214/09 STS313 Thoemmes, F. J., & Kim E. S. (2011). A systematic review of propensity score methods in the social sciences. Multivariate Behavioral Research, 46 (1), 90 118. doi:10.1080/00273171.2011.540475 Zhao, Z. (2004). Using matching to estimate treatment effects: Data requirements, matching metrics, and monte carlo evidence. The Review of Economics and Statistics, 86 (1), 91 107. Retrieved from http://www.jstor.org/stable/3211662 PAGE 57 57 BIOGRAPHICAL SKETCH Seyfullah Tingir was born in Gaziantep, Turkey and from Samsun, Turkey. He received his B.S. in mathematics education from Cumhuriyet University, Turkey. He served for the Eskisehir Osmangazi Un iversity and Samsun Ondokuzmayis University as a research assistant for one year. He later qualified for a scholarship to study abroad in the fall of 2011, enrolled for graduate studies in the Human Development and O rganizational Studies in Education at Colla ge of Education at the University of Florida and will receive his M.A.E in Researc h and Evaluation Methodology f ro m the Department of Educational Psychology in August, 2013. 